Incorporating External Risk Information with the Cox Model under Population Heterogeneity: Applications to Trans-Ancestry Polygenic Hazard Scores–Incorporating External Risk Information with the Cox Model under Population Heterogeneity: Applications to Trans-Ancestry Polygenic Hazard Scores
Incorporating External Risk Information with the Cox Model under Population Heterogeneity: Applications to Trans-Ancestry Polygenic Hazard Scores
Abstract
Polygenic hazard score (PHS) models designed for European ancestry (EUR) individuals provide ample information regarding survival risk discrimination. Incorporating such information can improve the performance of risk discrimination in an internal small-sized non-EUR cohort. However, given that external EUR-based model and internal individual-level data come from different populations, ignoring population heterogeneity can introduce substantial bias. In this paper, we develop a Kullback-Leibler-based Cox model (CoxKL) to integrate internal individual-level time-to-event data with external risk scores derived from published prediction models, accounting for population heterogeneity. Partial-likelihood-based KL information is utilized to measure the discrepancy between the external risk information and the internal data. We establish the asymptotic properties of the CoxKL estimator. Simulation studies show that the integration model by the proposed CoxKL method achieves improved estimation efficiency and prediction accuracy. We applied the proposed method to develop a trans-ancestry PHS model for prostate cancer and found that integrating a previously published EUR-based PHS with an internal genotype data of African ancestry (AFR) males yielded considerable improvement on the prostate cancer risk discrimination.
keywords:
Data integration; Kullback-Leibler information; Prostate cancer; Risk discrimination; Survival analysis; Trans-ancestry polygenic hazard score.1 Introduction
Survival risk discrimination is important for identifying individuals with a high risk of disease progression. For example, polygenic hazard scores (PHS), calculated as the sum of single-nucleotide polymorphisms (SNPs) weighted by the risk allele effect sizes derived from Cox proportional hazard models, are effective tools for personalized disease risk assessment and are widely applied in cancer research (Seibert et al., 2018; Zhang et al., 2020). However, the development of PHS has often included only individuals of European ancestry (EUR). A major challenge in developing PHS models for a minority population is that non-EUR individuals are usually less represented in genetic databases (Landry et al., 2018), limiting their utility and exacerbating existing health disparities (Martin et al., 2017).
Our research is motivated by the study of prostate cancer, which is the second most prevalent cancer in males across the world (Bray et al., 2018). Recently, a PHS calculated with 46 SNPs (denoted as PHS46) has been developed to predict the age of onset of prostate cancer in EUR individuals (Huynh-Le et al., 2021). However, it is well-known that African ancestry (AFR) individuals have higher prostate cancer incidences and mortality rates than EUR individuals (DeSantis et al., 2016), which may be partially explained by disparities in gene mutations associated with prostate cancer (Huang et al., 2017). Due to these clinical and genetic heterogeneity, directly applying EUR-based PHS has limited utility for non-EUR populations (Vilhjálmsson et al., 2015). On the other hand, constructing PHS for non-EUR populations suffers from limited sample sizes, high dimensionality, and low signal-to-noise ratios. Thus, it would be desirable to develop a method that integrates risk information from published prediction models upon EUR individuals (e.g. PHS46) with individual-level data collected in non-EUR cohorts (e.g. genotype data of AFR males), automatically measure the heterogeneity across different populations, and efficiently balance the contribution of each information source.
Ideally, the access to individual-level data will enable the pooled analysis and optimize the use of each information source. However, due to restrictions on data privacy, summary-level information is commonly used for data sharing in most applications. In the context of survival analysis, several methods for integrating summary-level external information with an internal study have been proposed. For example, Huang et al. (2016) proposed a double empirical likelihood approach based on a homogeneous assumption; i.e. both the covariate distribution and the association between covariates and outcomes are the same in the internal and external cohorts. However, this strong assumption is rarely met. Ignoring heterogeneity across different studies yields substantially biased parameter estimates and a loss of efficiency (Estes et al., 2017).
To relax the homogeneity assumption, Chen et al. (2021) proposed an adaptive double empirical likelihood estimator by penalizing potential differences on the subgroup survival probabilities. Sheng et al. (2021) constructed a heterogeneity test based on the penalized empirical likelihood and proposed a semiparametric density ratio model. Although these methods are useful, they typically require external subgroup survival probabilities, which are not available through EUR-based PHS models. The published PHS contains only covariate coefficients, but with no baseline hazard information. Such risk scores cannot be given any direct probability interpretation. Furthermore, it can be computationally expensive for existing methods to handle large-scale data, especially when the number of predictors largely exceeds the sample size (e.g. non-EUR genotype data), leading to a high demand for new integration methods that are computationally efficient and enable full utilization of high-dimensional information.
Kullback-Leibler (KL) information (Kullback and Leibler, 1951) has been widely applied in the development of data integration methods for binary and ordinal outcomes under population heterogeneity (Liu and Shum, 2003; Schapire et al., 2005; Jiang et al., 2016; Hector and Martin, 2022). However, these methods apply only to generalized linear models (GLM). For the integration with survival outcomes, Wang et al. (2021) extended the GLM-based KL to discrete failure time models and developed an integration procedure to incorporate external survival probabilities across multiple time points with internal time-to-event data. However, it is not clear how to incorporate external survival risk information without probability interpretation. To fill this gap, we propose a KL-based integrated Cox model (CoxKL), which incorporates external risk scores with internal individual-level time-to-event data to improve the estimation and prediction performance in the internal cohort. A key ingredient of the proposed method is the partial-likelihood-based KL information, which measures the heterogeneity between the external and the internal populations. We first develop the CoxKL method under low-dimensional settings, and then extend it to CoxKL-LASSO with a penalization, which can be further applied to high-dimensional trans-ancestry PHS developments.
The proposed CoxKL procedure attains the following advantages: (1) it only requires summary-level risk scores to be obtained from external studies, but with no use of any individual-level external data or survival probabilities, and therefore achieves greater flexibility in data sharing and integrative analysis; (2) it accounts for population heterogeneity, automatically detects the quality of different information sources, and provides an optimal integrated model based on the aggregated information; and (3) its objective function retains a similar form of the log-partial likelihood of the classical Cox model; thus, it is computationally efficient for high-dimensional problems, which is crucial for the motivating trans-ancestry PHS analysis; and (4) it provides a unified integration framework to incorporate risk information from various types of external prediction models or multiple external models.
The rest of the paper is organized as follows: In Section 2, we introduce the notations and model setup. In Section 3, we propose the CoxKL method. Asymptotic properties are discussed in Section 4. Section 5 conducts simulations to evaluate the proposed method. In Section 6, we applied CoxKL method to develop a trans-ancestry PHS model for prostate cancer in AFR males. Section 7 concludes with a brief discussion.
2 Notation and Model Setup
In Section 2.1, we introduce the notations of the Cox proportional hazards model for the target internal cohort, which collects individual-level data with time-to-event outcomes and a set of covariates from a limited-size sample (denoted as internal data). In Section 2.2, we describe various types of external risk scores that can be integrated by the CoxKL method.
2.1 Internal Cox proportional hazards model
Let denote the event time of interest and be the censoring time for subject , , where is the sample size of the internal cohort. Let be a -dimensional covariate vector for the -th subject. We assume that, conditional on , is independently censored by . The observed time is denoted by , and the event indicator is given by . The internal cohort consists of independent vectors . Consider a hazard function specified by
where is the baseline hazard function, and is a -dimensional regression parameter. Suppose the internal cohort comprises unique failure times . Considering a Cox proportional hazards model (Cox, 1972), the corresponding partial likelihood for is given by
(1) |
where is the covariate vector for the subject who experiences event at , and is the at-risk indicator. The log-partial likelihood is given by
(2) |
2.2 External risk scores
Due to restrictions on data privacy, the access to the individual-level genotype data from the external EUR studies is usually limited. Only summary-level information, such as coefficient estimates , can be obtained from the published PHS model. Specifically, we consider the situation that some external risk scores, , can be computed based on published EUR-based PHS models, where is the input covariate vector for the -th patient in the internal non-EUR cohort.
In practice, the variables contained in the internal data are not necessarily the same as those used in the external risk scores. For example, the PHS46 model only utilized 46 SNPs, but the internal genotype data of AFR individuals contains millions of SNPs after imputation. By allowing to be a sparse vector (i.e. some entries of can be zero), the proposed method can be applied to the case that the internal data and the external risk score share the same set of covariates, as well as the case that the internal data contains additional covariates.
In addition to accommodating for differences in the covariate space, another advantage of external risk scores is that it can incorporate different forms of external risk score models, such as the commonly used Cox proportional hazards model, with , which assumes a linear and proportional effect. Moreover, can also be obtained from machine learning algorithms (Gao et al., 2023), as long as the external model provides a predicted risk score for each subject of the internal population.
3 Proposed Methods
We propose using KL information to evaluate the disparity between the internal data and external risk scores, and consequently integrate the optimally weighted external information with the internal data to obtain the best integrated model for the target internal population. We present CoxKL under low-dimensional settings in Section 3.1, and then present CoxKL-LASSO, which extends CoxKL to high-dimensional trans-ancestry PHS problems, in Section 3.3.
3.1 CoxKL: KL-based integrated Cox model
Consider the internal Cox model, its partial likelihood in (1) arises as a product of conditional probability statements (Cox, 1975; Kalbfleisch and Prentice, 2002). Specifically, let specify the event that individual fails in , and let specify all the censoring and failure information up to time as well as the information that one failure occurs in . Then is a sequence of conditional experiments, where the -field generated by is contained by that generated by . The density function of conditional on is
(3) |
where the second equality is obtained by canceling in the numerator and the denominator. To extract information from external risk scores, we adopt a similar conditional density
(4) |
The partial likelihood-based KL information between conditional densities corresponding to the external risk scores and the internal Cox model, contained in , is given by
(5) |
where the expectation E is with respect to the conditional density . The accumulated KL information in the sequence of conditional experiments is then defined as , which measures the discrepancy between the external risk scores and the internal Cox model for the target cohort.
To integrate the external risk scores with the internal data, we construct a penalized log-partial likelihood, which combines the log-partial likelihood from the internal Cox model and the accumulated KL information ,
(6) |
where is a tuning parameter weighing the relative importance of external and internal data sources. In the extreme case of , the penalized log-partial likelihood is reduced to the log-partial likelihood of the internal Cox model .
Remark 1: The goal of the penalized objective function is to balance the trade-off between the external risk scores and the internal data; i.e. maximize the log-partial likelihood of the internal data, while at the same time keeping the KL information small. When internal and external information are homogeneous, it’s intuitive that the CoxKL estimator attains better estimation efficiency and higher discrimination power. In contrast, when internal and external information are heterogeneous, external information introduces bias to the integrated model. However, due to the bias and variance trade-off, it can be shown that there exists a range of within which the CoxKL method improves the estimation efficiency; further details are provided in Section 4.
3.2 Estimation
Proposition 1 below shows that the objective function retains a similar form as the log-partial likelihood of a classical Cox model. Hence, the integrated model can be solved by applying the standard optimization algorithms for the classical Cox models, which ensures the computational efficiency of the proposed method. A proof of Proposition 1 is provided in Supporting Information.
Proposition 1 The accumulated KL information is given by
where
is a weighted average of the covariates for subjects at risk at time , with weights defined upon external risk scores. Furthermore, the objective function in (6) is equivalent to
(7) |
3.3 CoxKL-LASSO: Extending CoxKL with penalization
A key challenge for applying existing integration methods to develop trans-ancestry PHS models is the high-dimension of the genotype data. Fortunately, the concise form of CoxKL objective function in (7) facilitates its extension to high-dimensional problems, such as the LASSO penalization (Tibshirani, 1996) that simultaneously achieves variable selection and parameter estimation.
Specifically, the objective function of CoxKL method with a penalty (denoted as CoxKL-LASSO) is given by
(8) |
where is the norm of , and is a tuning parameter that determines the amount of penalization. Since the penalized objective function in (8) has a similar form as the classical Cox Lasso (Tibshirani, 1997), the parameter estimation can be easily obtained by the commonly used coordinate descent-based procedures (Simon et al., 2011). Therefore, the proposed CoxKL method is computationally efficient for high-dimensional problems such as developing trans-ancestry PHS models for non-EUR populations.
Remark 2: LASSO penalization has been widely utilized in polygenic risk scores (PRS) for quantitative and binary traits (Erin et al., 2013; Mak et al., 2017; Chen et al., 2021). However, we could not directly apply these PRS constructing methods for developing PHS models, since the time-to-event data requires unique approaches for handling censored outcomes. We note that, in addition to trans-ancetry PHS models, the CoxKL-LASSO can also be applied to develop classical PHS (e.g. EUR-based PHS) by setting .
3.4 Tuning parameter selection
The optimal tuning parameter and can be determined by conventional cross-validation approaches, such as the V&VH approach (Verweij and Van Houwelingen, 1993), which computes the difference between the log-partial likelihood evaluated on the whole set and the training set as a measure of the model performance. This approach ensures the validity and stability of the calculation of log-partial likelihood, and it has been widely used for penalized Cox proportional hazards models (Friedman et al., 2010; Simon et al., 2011; Dai and Breheny, 2019).
4 Theoretical Properties
In this section, we study the asymptotic distribution of the CoxKL estimator under the low-dimensional setting (i.e. is fixed and ) and show that the proposed method improves the asymptotic mean squared error (aMSE) given a certain range of .
We first provide some prerequisite notations. For a given , let be the solution obtained by maximizing the penalized objective function in (7), and let be the maximum of the expectation of the penalized objective function . Denote as the maximum log-partial likelihood estimator of the internal model in (2). The true value of the regression parameter for the Cox model using the internal cohort only is denoted by . Define , where are the estimated coefficients from the external model. For the ease of notations, we assume that the external risk scores can be constructed as . Denote
where . Then the score function of in (7) is given by
The corresponding information matrix is
To derive the theoretical properties of the CoxKL method under the low-dimensional setting, we impose the following regularity conditions for the internal cohort:
-
(C1).
, , are independent and identically distributed random vectors.
-
(C2).
, where is a pre-specified time point.
-
(C3).
for , and , where is a constant and is the -th component of .
-
(C4).
.
-
(C5).
Continuity of , , for any between and , where is the limiting value of , with and bounded, and bounded away from for .
-
(C6).
Positive-definiteness of the matrix , for any between and , where
Conditions (C1)-(C6) are analogous to the regularity conditions employed in Andersen and Gill (1982) and Andersen et al. (1993) to prove the consistency and asymptotic normality of estimators from classical Cox models.
Lemma 1. If conditions (C1)-(C6) hold, then for each , .
Lemma 1 is a standard results showing that the CoxKL estimator is consistent for the unique maximum for the expectation of the CoxKL objective function . By Proposition 1, the CoxKL objective function (7) retains a similar form as that of a classical Cox model, thus, Lemma 1 can be easily verified following standard arguments on the asymptotic results of maximum partial-likelihood estimator for classical Cox model under standard regularity conditions (Andersen et al., 1993).
Lemma 2. If conditions (C1)-(C6) hold, and if , then as ,
(9) |
where . If , then as ,
(10) |
Lemma 2 shows the asymptotic distribution of . When , is an unbiased estimator for . In contrast, when under population heterogeneity, as and with the order of , . Moreover, the CoxKL estimator yields a considerable efficiency gain over by incorporating external risk information, which can be easily seen from the form of asymptotic variance of . We then have the following Theorem 1, which states the advantage on the estimation of the CoxKL method.
Theorem 1. If conditions (C1)-(C6) hold, then there exists a range of , on which the asymptotic mean squared error (aMSE) of is strictly less than that of . Moreover, when , the aMSE of is monotonically decreasing in ; when , the aMSE of is monotonically decreasing in and monotonically increasing in , with
where is a vector between and such that
Theorem 1 implies that the CoxKL method improves the aMSE given an appropriate range of regardless of the potential heterogeneity. When , is unbiased, and it’s intuitive that for all since the CoxKL method improves the estimation efficiency as proved in Lemma 2. Moreover, even if the external and internal information are heterogeneous and deviate from each other, there exists a range of such that the CoxKL estimator achieves a smaller aMSE for the model regression parameters compared to the maximum partial-likelihood estimator using the internal data only. Furthermore, as mentioned in Section 2.2, we do not require that the internal data and the external risk score share the same set of covariates. Since can be a sparse vector (i.e. some entries of are zero), we allow internal data to include additional predictors.
5 Simulation
To evaluate the performance of the proposed CoxKL method, we conducted two sets of simulation studies. In Setting I, we focused on the evaluation of the estimation performance of CoxKL, while in Setting II, we evaluated the prediction performance of CoxKL under different scenarios, including the integration with various types of external risk score models (CoxPH and XGBoost (Chen and Guestrin, 2016)) and multiple external risk scores. The tuning parameter was selected by the 5-fold cross validation with the V&VH cross-validated partial likelihood as the measure of model performance. In addition to the internal and the external models, we compared the performance of the CoxKL method with the stacked regression method (Debray et al., 2014), which includes the prediction of the external model as a covariate in the integration model. We evaluated the model performance on an independent testing dataset, which was simulated from the same distribution as the internal data. The simulation studies were replicated 500 times.
5.1 Setting I: Evaluation of the estimation performance
In the first set of simulations, we considered 6 covariates , where , were continuous variables simulated from a multivariate normal distribution with zero mean, unit variance and a first-order autoregressive (AR1) correlation structure with the auto-correlation parameter 0.5. were binary variables with . Regarding and , to simulate the situation where the internal cohort comes from a different underlying distribution from the external population, we denoted as a latent binary variable with , and then let and be two continuous variables simulated from normal distribution with unit variance and mean and , respectively. That is, the covariate distribution of the internal cohort could be different from that of the external cohort. Note that, as a latent variable, was unobserved in the simulated datasets. The survival time for the internal data was generated from a Cox model with a hazard function of
(11) |
where . The censoring times were generated from a uniform distribution with varying upper bounds to control for different censoring proportions. We set for the internal cohort. The true generating mechanism for the external cohort was the same as (11), but the external model was determined from a large dataset with varying (% homogeneous) and (covariates used in the external model). Specifically, we considered the following external settings:
-
(E1).
The covariate distribution of the external population is the same as the internal cohort, that is, ; and the external model is the true model: .
-
(E2).
The covariate distribution of the external population is slightly different from the internal cohort with ; and the external model is a misspecified model: .
-
(E3).
The covariate distribution of the external population is completely different from the internal cohort with ; and the external model is a misspecified model: .
External model | |||||||
Method | Setting | % Homogeneous | Covariates | Bias | SE | MSE | C-Index |
Internal data: sample size = 50, censoring rate = 60% | |||||||
Internal | — | — | — | 0.057 | 0.475 | 0.229 | 0.594 |
CoxKL | (E1) | 100 | 0.023 | 0.136 | 0.019 | 0.640 | |
(E2) | 50 | 0.101 | 0.165 | 0.038 | 0.624 | ||
(E3) | 0 | 0.145 | 0.192 | 0.058 | 0.605 | ||
Internal data: sample size = 50, censoring rate = 30% | |||||||
Internal | — | — | — | 0.040 | 0.354 | 0.127 | 0.608 |
CoxKL | (E1) | 100 | 0.022 | 0.114 | 0.014 | 0.642 | |
(E2) | 50 | 0.088 | 0.149 | 0.030 | 0.627 | ||
(E3) | 0 | 0.129 | 0.176 | 0.048 | 0.612 | ||
Internal data: sample size = 100, censoring rate = 30% | |||||||
Internal | — | — | — | 0.019 | 0.207 | 0.043 | 0.628 |
CoxKL | (E1) | 100 | 0.023 | 0.066 | 0.005 | 0.645 | |
(E2) | 50 | 0.079 | 0.111 | 0.019 | 0.633 | ||
(E3) | 0 | 0.093 | 0.138 | 0.028 | 0.627 |
Table 1 and Figure 1 summarize the results of simulation Setting I, including average empirical biases (Bias), average empirical standard errors (SE), and average empirical mean squared errors (MSE) for the estimated regression coefficients . In addition, we also presented the Harrel’s C-Index (Harrel Jr et al., 1996). As shown in Table 1, under all three external settings regarding good, fair, and poor external model quality, the integration model by the CoxKL method attained better estimation efficiency than the internal model, with smaller SE and MSE; moreover, the SE and MSE of the CoxKL method decreased as the quality of the external risk information increased. Incorporating external information from homogeneous models (external setting E1) reduced the bias when the quality of the internal cohort was limited. Unsurprisingly, when the external models and internal cohort were heterogeneous, incorporating information from misspecified external risk scores increased the bias. However, due to the bias and variance trade-off, the MSE of the CoxKL estimator still decreased. Compared to the internal model, the CoxKL method also improved the prediction performance by incorporating external information, with a higher C-Index. As the internal sample size increased or the internal censoring rate decreased, the gain on estimation efficiency and prediction accuracy by incorporating external information via the CoxKL method decreased, as expected. As shown in Figure 1, there existed a range of integration weight that the integration model by CoxKL method reduced the MSE and increased the C-Index compared to the internal model. With a decrease in the external information quality, the range of that improved the model performance shrank; that is, the optimal integration weight of the external risk scores decreased. However, the CoxKL method outperformed the internal model in all cases with a properly selected integration weight .
5.2 Setting II: Evaluation of the prediction performance
In Setting II, we included 5 covariates , which were simulated from the same distributions as in Setting I. In addition to the main effect, we also considered the interaction term in the data generating model. Since in practice, we usually cannot observe and collect all the variables in the true model, here we simulated two additional variables, and , which contributed to the true underlying generating distribution of the data, but were not observed and collected in the simulated datasets. was a binary variable with , and came from a standard normal distribution. The true generating model for the internal data was a Cox model with a hazard function of
(12) | |||||
where . The true generating mechanism for the external cohort was the same as (12). However, the external models were determined from large datasets with varying and .
We first considered the settings with two external risk scores derived from different types of models. Specifically, in each setting, the External1 was derived from a XGBoost model, and the External2 was derived from a CoxPH model. Two extreme settings with extremely well or poorly performing external models were explored. Details are given as follows:
-
(E4).
Two good-quality external models: External1: , and ; External2: , and .
-
(E5).
One good-quality external model and one poor-quality external model. External1: , and ; External2: , and .
-
(E6).
Two poor-quality external models: External1: , and ; External2: Null model.
For the internal cohort, we set the sample size as 50, the censoring rate as around 30%, and . As shown in Figure 2, under extreme scenarios such as both external risk scores were derived from extremely well or poorly performing models, the CoxKL method could achieve comparable prediction performance as the external or internal model by selecting an extremely large or small integration weight . Moreover, when integrating two external risk scores with different prediction performances, the CoxKL method could optimally determine the weight of information integration, assigning more weight to the better performing external risk score, and resulting in improved prediction performance with the highest C-Index. Incorporating information from external models resulted in a narrower interquartile range (IQR) compared with that of the internal model, which indicates that the proposed method reduced the variance of prediction. Note that under all three settings, the External1 risk scores were derived by the XGBoost, which indicates that the CoxKL method worked well when integrating a risk score derived from a machine learning model.
Then we considered different settings in terms of sample sizes and censoring rates of the internal cohort. We set for the internal cohort, and let varying from for the external cohort. The external covariates were set as . The sample size of the internal cohort varied from , and the censoring rate of the internal data varied from . The simulation results are shown in Figure 3. With increasing censoring rates of the internal data, the CoxKL method outperformed the internal and stacked models with higher C-index and smaller variance of prediction. The prediction performances of the internal and stacked models decreased rapidly as the censoring rate increased or the sample size decreased. In addition, for fairly small datasets, incorporating information from the external prediction model by the CoxKL method improved the prediction performance over the internal model. As expected, the improvement decreased when the internal cohort contained more data.
6 Application: Prostate Cancer Trans-ancestry Polygenic Hazard Score
Prostate cancer is the most prevalent cancer and second leading cause of cancer death in American males (Siegel et al., 2022). Identifying high-risk population and offering personalized suggestions on when to initialize prostate cancer screening is important to reduce the cancer-related morbidity and mortality (Schröder et al., 2014). PHS46 has shown accurate discrimination of onset of prostate cancer in EUR males but reduced performance in non-EUR males, such as AFR males (Huynh-Le et al., 2021). To improve the risk discrimination of prostate cancer in AFR males, we applied CoxKL to develop a trans-ancestry PHS by integrating PHS46 with a small-sized internal dataset of AFR individuals.
The PHS46 model is a Cox proportional hazards model developed using a genotype dataset consisting of EUR males. By analyzing associations between genetic factors and the age at diagnosis of prostate cancer, the model includes 46 SNPs and predicts prostate cancer risk in EUR males. Based on the external validation results, the PHS46 model performs poorly with a C-index of 0.55 in AFR males (Karunamuni et al., 2021). Thus, to improve the risk discrimination for prostate cancer in AFR males, an ancestry specific PHS model is desired.
The internal dataset obtained from the Michigan Genomics Initiative (MGI) included 1,036 AFR males with both genotype and phenotype data (Zawistowski et al., 2021). The outcome was defined as the age at diagnosis of prostate cancer. For subjects who were not diagnosed with prostate cancer, the study follow-up was censored at the age of the last visit, and the overall censoring rate was . Standard quality control was performed on the genotype data. Eventually, a total of 161,861 imputed SNPs passing a quality threshold of , with a minor allele frequency , and LD pruned with , and were considered in the analysis. In addition, we conducted a GWAS analysis to further identify the SNPs that may be associated with the outcome using univariate Cox regression models. Top 10,000 SNPs identified based on p-values from the GWAS results were included in the PHS model development.
With the moderate sample size, high censoring rate and high-dimensional covariate space, we considered incorporating PHS46 with the internal data by utilizing the proposed CoxKL-LASSO procedure (introduced in Section 3.2). The internal data was randomly split into training and testing sets with a ratio of 3:1. We developed PHS models on the training set and evaluated the model performance on the testing set. A total of 25 independent random splits were implemented. As comparisons, results of the PHS46 model and the internal model by applying a penalized Cox regression model with LASSO on the internal data are also presented.
As shown in Figure 4, the proposed method had the best prediction performance compared to the internal model and PHS46 model. Specifically, when the dimension of the covariate space was much larger than the sample size, it was difficult to develop a proper PHS model based on the internal small-sized data only. Therefore, the CoxKL method incorporated more information from PHS46 with an optimal at around . Furthermore, as shown in Figure 5, the CoxKL method also achieved better risk discrimination of prostate cancer in AFR males than both the internal and PHS46 models. High-risk populations (subjects with PHS score ranging from ) were clearly identified by the proposed CoxKL method, which indicates that our method can be used for better personalized suggestions on when to initialize prostate cancer screening for AFR males.
7 Discussion
In this paper, we proposed the CoxKL method to incorporate external risk scores derived from published prediction models with internal time-to-event data. Partial-likelihood-based KL information is utilized to provide an overall assessment of heterogeneity across populations, and a penalized objective function is developed to balance the contribution of different information sources. Extensive simulation studies show the advantages of the CoxKL method, including improvements on estimation efficiency and prediction accuracy.
The CoxKL method is computationally efficient and, thus, can be easily extended to integration problems under high-dimensional settings. We presented CoxKL-LASSO for developing trans-ancestry PHS prediction models in non-EUR populations. We applied our method to develop a trans-ancestry PHS model for prostate cancer by integrating PHS46, a previously published EUR-based PHS, with a small-sized AFR genotype data. The prostate cancer risk discrimination in AFR males has been substantially improved (C-Index from 0.49 [Internal] and 0.54 [PHS46] to 0.68 [CoxKL]). The proposed AFR-based trans-ancestry PHS can clearly distinguish subjects with high prostate cancer risk.
As the CoxKL method summarizes external risk information into personalized risk scores, it provides a unified framework for integrating risk information from various types of external models, including prediction models by machine learning algorithms, as long as the external models can provide personalized risk scores for each subject in the internal data. Moreover, as shown by both theoretical properties and simulation studies, the CoxKL method allows internal data include additional covariates than that used in external risk score models. The CoxKL method can also be easily extended to the integration with risk information from multiple external models, details are given in Supporting Information.
In summary, the proposed CoxKL method serves as a general and flexible integration tool for time-to-event data, with an important application in developing trans-ancestry PHS models and enhancing risk discrimination in non-EUR populations.
Acknowledgements
The authors acknowledge the Michigan Genomics Initiative participants, Precision Health at the University of Michigan, the University of Michigan Medical School Central Biorepository, the University of Michigan Medical School Data Office for Clinical and Translational Research, the University of Michigan Advanced Genomics Core for providing data storage and specimen storage, management, processing, and distribution services, and the Center for Statistical Genetics in the Department of Biostatistics at the School of Public Health for genotype data curation, imputation, and management in support of the research reported in this publication. This work was partially supported by the National Institute of Diabetes and Digestive and Kidney Diseases under Grant R01DK129539.
References
- Andersen et al. (1993) Andersen, P. K., Borgan, O., Gill, R. D., and Keiding, N. (1993). Statistical Models Based on Counting Processes. New York, NY: Springer New York.
- Andersen and Gill (1982) Andersen, P. K. and Gill, R. D. (1982). Cox’s regression model for counting processes: A large sample study. The Annals of Statistics 10, 1100–1120.
- Bray et al. (2018) Bray, F., Ferlay, J., Soerjomataram, I., Siegel, R. L., Torre, L. A., and Jemal, A. (2018). Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a Cancer Journal for Clinicians 68, 394–424.
- Chen et al. (2021) Chen, T., Chatterjee, N., Landi, M. T., and Shi, J. (2021). A penalized regression framework for building polygenic risk models based on summary statistics from genome-wide association studies and incorporating external information.
- Chen and Guestrin (2016) Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on knowledge discovery and data mining page 785–794.
- Chen et al. (2021) Chen, Z., Ning, J., Shen, Y., and Qin, J. (2021). Combining primary cohort data with external aggregate information without assuming comparability. Biometrics 77, 1024–1036.
- Cox (1972) Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society: Series B 34, 187–202.
- Cox (1975) Cox, D. R. (1975). Partial likelihood. Biometrika 62, 269–279.
- Dai and Breheny (2019) Dai, B. and Breheny, P. (2019). Cross validation approaches for penalized Cox regression. arXiv:1905.10432 .
- Debray et al. (2014) Debray, T. P., Koffijberg, H., Nieboer, D., Vergouwe, Y., Steyerberg, E. W., and Moons, K. G. (2014). Meta-analysis and aggregation of multiple published prediction models. Statistics in Medicine 33, 2341–2362.
- DeSantis et al. (2016) DeSantis, C. E., Siegel, R. L., Sauer, A. G., Miller, K. D., Fedewa, S. A., Alcaraz, K. I., and Jemal, A. (2016). Cancer statistics for African Americans, 2016: Progress and opportunities in reducing racial disparities. CA: a Cancer Journal for Clinicians 66, 290–308.
- Erin et al. (2013) Erin, A., Wei, P., and Shen, X. (2013). Penalized regression and risk prediction in genome-wide association studies. Statistical Analysis and Data Mining 6, 315–328.
- Estes et al. (2017) Estes, J. P., Mukherjee, B., and Taylor, J. M. G. (2017). Empirical Bayes estimation and prediction using summary-level information from external big data sources adjusting for violations of transportability. Statistics in Biosciences 10, 568–586.
- Friedman et al. (2010) Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33, 1–22.
- Gao et al. (2023) Gao, X. R., Chiariglione, M., Qin, K., Nuytemans, K., Scharre, D. W., Li, Y., and Martin, E. R. (2023). Explainable machine learning aggregates polygenic risk scores and electronic health records for alzheimer’s disease prediction. Scientific Reports 13, 450.
- Harrel Jr et al. (1996) Harrel Jr, F. E., Lee, K. L., and Mark, D. B. (1996). Tutorial in biostatistics: multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing error. Statistics in Medicine 15, 361–387.
- Hector and Martin (2022) Hector, E. and Martin, R. (2022). Turning the information-sharing dial: efficient inference from different data sources. arXiv: 2207.08886 .
- Huang et al. (2016) Huang, C., Qin, J., and Tsai, H. (2016). Efficient estimation of the Cox model with auxiliary subgroup survival information. Journal of the American Statistical Association 111, 787–799.
- Huang et al. (2017) Huang, F. W., Mosquera, J. M., Garofalo, A., Oh, C., Baco, M., Amin-Mansour, A., Rabasha, B., et al. (2017). Exome sequencing of African-American prostate cancer reveals loss-of-function erf mutations. Cancer Discovery 7, 973–983.
- Huynh-Le et al. (2021) Huynh-Le, M., Fan, C. C., Karunamuni, R., Thompson, W., Martinez, M. E., Eeles, R., Kote-Jarai, Z., et al. (2021). Polygenic hazard score is associated with prostate cancer in multi-ethnic populations. Nature Communications 12, 1236–9.
- Jiang et al. (2016) Jiang, Y., He, H., and Zhang, H. (2016). Variable selection with prior information for generalized linear models via the prior lasso method. Journal of the American Statistical Association 111, 355–376.
- Kalbfleisch and Prentice (2002) Kalbfleisch, J. D. and Prentice, R. L. (2002). The Statistical Analysis of Failure Time Data. John Wiley and Sons, Second edition.
- Karunamuni et al. (2021) Karunamuni, R. A., Huynh‐Le, M. P., Fan, C. C., Thompson, W., Eeles, R. A., Kote‐Jarai, Z., Muir, K., et al. (2021). African‐specific improvement of a polygenic hazard score for age at diagnosis of prostate cancer. International Journal of Cancer 148, 99–105.
- Kullback and Leibler (1951) Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics 22, 79–86.
- Landry et al. (2018) Landry, L. G., Ali, N., Williams, D. R., Rehm, H. L., and Bonham, V. L. (2018). Lack of diversity in genomic databases is a barrier to translating precision medicine research into practice. Health Affairs 37, 780–785.
- Liu and Shum (2003) Liu, C. and Shum, H. Y. (2003). Kullback-Leibler boosting. 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings 1, I–I.
- Mak et al. (2017) Mak, T. S. H., Porsch, R. M., Choi, S. W., Zhou, X., and Sham, P. C. (2017). Polygenic scores via penalized regression on summary statistics. Genetic Epidemiology 41, 469–480.
- Martin et al. (2017) Martin, A. R., Gignoux, C. R., Walters, R. K., Wojcik, G. L., Neale, B. M., Gravel, S., Daly, M. J., et al. (2017). Human demographic history impacts genetic risk prediction across diverse populations. The American Journal of Human Genetics 100, 635–649.
- Schapire et al. (2005) Schapire, R., Rochery, M., Gilbert, M., and Gupta, N. (2005). Boosting with prior knowledge for call classification. IEEE transactions on speech and audio processing 13, 174–181.
- Schröder et al. (2014) Schröder, F. H., Hugosson, J., Roobol, M. J., Tammela, T. L. J., Zappa, M., Nelen, V., Kwiatkowski, M., et al. (2014). Screening and prostate cancer mortality: results of the European Randomised Study of Screening for Prostate Cancer (erspc) at 13 years of follow-up. The Lancet 384, 2027–2035.
- Seibert et al. (2018) Seibert, T. M., Fan, C. C., Wang, Y., Zuber, V., Karunamuni, R., Parsons, J. K., Eeles, R. A., et al. (2018). Polygenic hazard score to guide screening for aggressive prostate cancer: development and validation in large scale cohorts. BMJ 360, j5757–j5757.
- Sheng et al. (2021) Sheng, Y., Sun, Y., Huang, C., and Kim, M. (2021). Synthesizing external aggregated information in the penalized Cox regression under population heterogeneity. Statistics in Medicine 40, 4915–4930.
- Siegel et al. (2022) Siegel, R. L., Miller, K. D., Fuchs, H. E., and Jemal, A. (2022). Cancer statistics, 2022. CA: A Cancer Journal for Clinicians 72, 7–33.
- Simon et al. (2011) Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2011). Regularization paths for Cox’s proportional hazards model via coordinate descent. Journal of Statistical Software 39, 1–13.
- Simon et al. (2011) Simon, R. M., Subramanian, J., Li, M. C., and Menezes, S. (2011). An evaluation of resampling methods for assessment of survival risk classifiers based on high-dimensional data. Briefings in Bioinformatics 12, 203–214.
- Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, Methodological 58, 267–288.
- Tibshirani (1997) Tibshirani, R. (1997). The lasso method for variable selection in the Cox model. Statistics in Medicine 16, 385–395.
- Verweij and Van Houwelingen (1993) Verweij, P. J. and Van Houwelingen, H. C. (1993). Cross-validation in survival analysis. Statistics in Medicine 12, 2305–2314.
- Vilhjálmsson et al. (2015) Vilhjálmsson, B. J., Yang, J., Finucane, H. K., Gusev, A., Lindström, S., Ripke, S., Genovese, G., et al. (2015). Modeling linkage disequilibrium increases accuracy of polygenic risk scores. The American Journal of Human Genetics 97, 576–592.
- Wang et al. (2021) Wang, D., Ye, W., Sung, R., Jiang, H., Taylor, J. M. G., Ly, L., and He, K. (2021). Kullback-Leibler-based discrete failure time models for integration of published prediction models with new time-to-event dataset. arXiv:2101.02354 .
- Zawistowski et al. (2021) Zawistowski, M., Fritsche, L. G., Pandit, A., Vanderwerff, B., Patil, S., Schmidt, E. M., VandeHaar, P., et al. (2021). The Michigan Genomics Initiative: a biobank linking genotypes and electronic clinical records in Michigan Medicine patients. medRxiv: 2021.12.15.21267864 .
- Zhang et al. (2020) Zhang, H., Deng, L., Schiffman, M., Qin, J., and Yu, K. (2020). Generalized integration model for improved statistical inference by leveraging external summary data. Biometrika 107, 689–703.