This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\pagerange

Incorporating External Risk Information with the Cox Model under Population Heterogeneity: Applications to Trans-Ancestry Polygenic Hazard ScoresIncorporating External Risk Information with the Cox Model under Population Heterogeneity: Applications to Trans-Ancestry Polygenic Hazard Scores

Incorporating External Risk Information with the Cox Model under Population Heterogeneity: Applications to Trans-Ancestry Polygenic Hazard Scores

Di Wang1    Wen Ye1    Ji Zhu2    Gongjun Xu2    Weijing Tang3    Matthew Zawistowski1,4   
Lars G. Fritsche1,4,5{}^{\textbf{1,4,5}}
   and Kevin He1,*{}^{\textbf{1,*}}
1Department of Biostatistics
[email protected]
   University of Michigan    Ann Arbor    Michigan    U.S.A.
2Department of Statistics
   University of Michigan    Ann Arbor    Michigan    U.S.A
3Department of Biostatistics
   Harvard University    Boston    Massachusetts    U.S.A.
4Center for Statistical Genetics
   University of Michigan    Ann Arbor    Michigan    U.S.A.
5Rogel Cancer Center
   University of Michigan    Ann Arbor    Michigan    U.S.A
Abstract

Polygenic hazard score (PHS) models designed for European ancestry (EUR) individuals provide ample information regarding survival risk discrimination. Incorporating such information can improve the performance of risk discrimination in an internal small-sized non-EUR cohort. However, given that external EUR-based model and internal individual-level data come from different populations, ignoring population heterogeneity can introduce substantial bias. In this paper, we develop a Kullback-Leibler-based Cox model (CoxKL) to integrate internal individual-level time-to-event data with external risk scores derived from published prediction models, accounting for population heterogeneity. Partial-likelihood-based KL information is utilized to measure the discrepancy between the external risk information and the internal data. We establish the asymptotic properties of the CoxKL estimator. Simulation studies show that the integration model by the proposed CoxKL method achieves improved estimation efficiency and prediction accuracy. We applied the proposed method to develop a trans-ancestry PHS model for prostate cancer and found that integrating a previously published EUR-based PHS with an internal genotype data of African ancestry (AFR) males yielded considerable improvement on the prostate cancer risk discrimination.

keywords:
Data integration; Kullback-Leibler information; Prostate cancer; Risk discrimination; Survival analysis; Trans-ancestry polygenic hazard score.

1 Introduction

Survival risk discrimination is important for identifying individuals with a high risk of disease progression. For example, polygenic hazard scores (PHS), calculated as the sum of single-nucleotide polymorphisms (SNPs) weighted by the risk allele effect sizes derived from Cox proportional hazard models, are effective tools for personalized disease risk assessment and are widely applied in cancer research (Seibert et al., 2018; Zhang et al., 2020). However, the development of PHS has often included only individuals of European ancestry (EUR). A major challenge in developing PHS models for a minority population is that non-EUR individuals are usually less represented in genetic databases (Landry et al., 2018), limiting their utility and exacerbating existing health disparities (Martin et al., 2017).

Our research is motivated by the study of prostate cancer, which is the second most prevalent cancer in males across the world (Bray et al., 2018). Recently, a PHS calculated with 46 SNPs (denoted as PHS46) has been developed to predict the age of onset of prostate cancer in EUR individuals (Huynh-Le et al., 2021). However, it is well-known that African ancestry (AFR) individuals have higher prostate cancer incidences and mortality rates than EUR individuals (DeSantis et al., 2016), which may be partially explained by disparities in gene mutations associated with prostate cancer (Huang et al., 2017). Due to these clinical and genetic heterogeneity, directly applying EUR-based PHS has limited utility for non-EUR populations (Vilhjálmsson et al., 2015). On the other hand, constructing PHS for non-EUR populations suffers from limited sample sizes, high dimensionality, and low signal-to-noise ratios. Thus, it would be desirable to develop a method that integrates risk information from published prediction models upon EUR individuals (e.g. PHS46) with individual-level data collected in non-EUR cohorts (e.g. genotype data of AFR males), automatically measure the heterogeneity across different populations, and efficiently balance the contribution of each information source.

Ideally, the access to individual-level data will enable the pooled analysis and optimize the use of each information source. However, due to restrictions on data privacy, summary-level information is commonly used for data sharing in most applications. In the context of survival analysis, several methods for integrating summary-level external information with an internal study have been proposed. For example, Huang et al. (2016) proposed a double empirical likelihood approach based on a homogeneous assumption; i.e. both the covariate distribution and the association between covariates and outcomes are the same in the internal and external cohorts. However, this strong assumption is rarely met. Ignoring heterogeneity across different studies yields substantially biased parameter estimates and a loss of efficiency (Estes et al., 2017).

To relax the homogeneity assumption, Chen et al. (2021) proposed an adaptive double empirical likelihood estimator by penalizing potential differences on the subgroup survival probabilities. Sheng et al. (2021) constructed a heterogeneity test based on the penalized empirical likelihood and proposed a semiparametric density ratio model. Although these methods are useful, they typically require external subgroup survival probabilities, which are not available through EUR-based PHS models. The published PHS contains only covariate coefficients, but with no baseline hazard information. Such risk scores cannot be given any direct probability interpretation. Furthermore, it can be computationally expensive for existing methods to handle large-scale data, especially when the number of predictors largely exceeds the sample size (e.g. non-EUR genotype data), leading to a high demand for new integration methods that are computationally efficient and enable full utilization of high-dimensional information.

Kullback-Leibler (KL) information (Kullback and Leibler, 1951) has been widely applied in the development of data integration methods for binary and ordinal outcomes under population heterogeneity (Liu and Shum, 2003; Schapire et al., 2005; Jiang et al., 2016; Hector and Martin, 2022). However, these methods apply only to generalized linear models (GLM). For the integration with survival outcomes, Wang et al. (2021) extended the GLM-based KL to discrete failure time models and developed an integration procedure to incorporate external survival probabilities across multiple time points with internal time-to-event data. However, it is not clear how to incorporate external survival risk information without probability interpretation. To fill this gap, we propose a KL-based integrated Cox model (CoxKL), which incorporates external risk scores with internal individual-level time-to-event data to improve the estimation and prediction performance in the internal cohort. A key ingredient of the proposed method is the partial-likelihood-based KL information, which measures the heterogeneity between the external and the internal populations. We first develop the CoxKL method under low-dimensional settings, and then extend it to CoxKL-LASSO with a L1L_{1} penalization, which can be further applied to high-dimensional trans-ancestry PHS developments.

The proposed CoxKL procedure attains the following advantages: (1) it only requires summary-level risk scores to be obtained from external studies, but with no use of any individual-level external data or survival probabilities, and therefore achieves greater flexibility in data sharing and integrative analysis; (2) it accounts for population heterogeneity, automatically detects the quality of different information sources, and provides an optimal integrated model based on the aggregated information; and (3) its objective function retains a similar form of the log-partial likelihood of the classical Cox model; thus, it is computationally efficient for high-dimensional problems, which is crucial for the motivating trans-ancestry PHS analysis; and (4) it provides a unified integration framework to incorporate risk information from various types of external prediction models or multiple external models.

The rest of the paper is organized as follows: In Section 2, we introduce the notations and model setup. In Section 3, we propose the CoxKL method. Asymptotic properties are discussed in Section 4. Section 5 conducts simulations to evaluate the proposed method. In Section 6, we applied CoxKL method to develop a trans-ancestry PHS model for prostate cancer in AFR males. Section 7 concludes with a brief discussion.

2 Notation and Model Setup

In Section 2.1, we introduce the notations of the Cox proportional hazards model for the target internal cohort, which collects individual-level data with time-to-event outcomes and a set of covariates from a limited-size sample (denoted as internal data). In Section 2.2, we describe various types of external risk scores that can be integrated by the CoxKL method.

2.1 Internal Cox proportional hazards model

Let TiT_{i} denote the event time of interest and CiC_{i} be the censoring time for subject ii, i=1,,ni=1,\ldots,n, where nn is the sample size of the internal cohort. Let 𝒁i\bm{Z}_{i} be a pp-dimensional covariate vector for the ii-th subject. We assume that, conditional on 𝒁i\bm{Z}_{i}, TiT_{i} is independently censored by CiC_{i}. The observed time is denoted by Xi=min{Ti,Ci}X_{i}=\mbox{min}\{T_{i},C_{i}\}, and the event indicator is given by δi=I(TiCi)\delta_{i}=I(T_{i}\leq C_{i}). The internal cohort consists of nn independent vectors (Xi,δi,𝒁i)(X_{i},\delta_{i},\bm{Z}_{i}). Consider a hazard function specified by

λ(t|𝒁i)=limdt01dtPr(tTi<t+dt|Tit,𝒁i)=λ0(t)exp(𝒁i𝜷),\displaystyle\lambda(t|\bm{Z}_{i})=\lim_{dt\rightarrow 0}\frac{1}{dt}Pr(t\leq T_{i}<t+dt|T_{i}\geq t,\bm{Z}_{i})=\lambda_{0}(t)\exp{(\bm{Z}_{i}^{\top}\bm{\beta})},

where λ0(t)\lambda_{0}(t) is the baseline hazard function, and 𝜷=(β1,,βp)\bm{\beta}=(\beta_{1},\cdots,\beta_{p})^{\top} is a pp-dimensional regression parameter. Suppose the internal cohort comprises KK unique failure times t1<<tKt_{1}<\cdots<t_{K}. Considering a Cox proportional hazards model (Cox, 1972), the corresponding partial likelihood for 𝜷\bm{\beta} is given by

L(𝜷)=k=1Kexp(𝒁k𝜷)i=1nYi(tk)exp(𝒁i𝜷),\displaystyle L(\boldsymbol{\beta})=\prod_{k=1}^{K}\frac{\exp{(\bm{Z}_{k}^{\top}\bm{\beta})}}{\sum_{i=1}^{n}Y_{i}(t_{k})\exp{(\bm{Z}_{i}^{\top}\bm{\beta})}}, (1)

where 𝒁k\bm{Z}_{k} is the covariate vector for the subject who experiences event at tkt_{k}, and Yi(tk)=I(Xitk)Y_{i}(t_{k})=I(X_{i}\geq t_{k}) is the at-risk indicator. The log-partial likelihood is given by

(𝜷)\displaystyle\ell(\boldsymbol{\beta}) =\displaystyle= k=1K[𝒁k𝜷log{i=1nYi(tk)exp(𝒁i𝜷)}].\displaystyle{}\sum_{k=1}^{K}\left[\bm{Z}_{k}^{\top}\bm{\beta}-\log\left\{\sum_{i=1}^{n}Y_{i}(t_{k})\exp{(\bm{Z}_{i}^{\top}\bm{\beta})}\right\}\right]. (2)

2.2 External risk scores

Due to restrictions on data privacy, the access to the individual-level genotype data from the external EUR studies is usually limited. Only summary-level information, such as coefficient estimates 𝜷~\widetilde{\boldsymbol{\beta}}, can be obtained from the published PHS model. Specifically, we consider the situation that some external risk scores, r(𝒁i,𝜷~)r(\bm{Z}_{i},\widetilde{\boldsymbol{\beta}}), can be computed based on published EUR-based PHS models, where 𝒁i\bm{Z}_{i} is the input covariate vector for the ii-th patient in the internal non-EUR cohort.

In practice, the variables contained in the internal data are not necessarily the same as those used in the external risk scores. For example, the PHS46 model only utilized 46 SNPs, but the internal genotype data of AFR individuals contains millions of SNPs after imputation. By allowing 𝜷~\widetilde{\bm{\beta}} to be a sparse vector (i.e. some entries of 𝜷~\widetilde{\bm{\beta}} can be zero), the proposed method can be applied to the case that the internal data and the external risk score share the same set of covariates, as well as the case that the internal data contains additional covariates.

In addition to accommodating for differences in the covariate space, another advantage of external risk scores is that it can incorporate different forms of external risk score models, such as the commonly used Cox proportional hazards model, with r(𝒁i,𝜷~)=𝒁i𝜷~r(\bm{Z}_{i},\widetilde{\boldsymbol{\beta}})=\bm{Z}_{i}^{\top}\widetilde{\boldsymbol{\beta}}, which assumes a linear and proportional effect. Moreover, r(𝒁i,𝜷~)r(\bm{Z}_{i},\widetilde{\boldsymbol{\beta}}) can also be obtained from machine learning algorithms (Gao et al., 2023), as long as the external model provides a predicted risk score for each subject of the internal population.

3 Proposed Methods

We propose using KL information to evaluate the disparity between the internal data and external risk scores, and consequently integrate the optimally weighted external information with the internal data to obtain the best integrated model for the target internal population. We present CoxKL under low-dimensional settings in Section 3.1, and then present CoxKL-LASSO, which extends CoxKL to high-dimensional trans-ancestry PHS problems, in Section 3.3.

3.1 CoxKL: KL-based integrated Cox model

Consider the internal Cox model, its partial likelihood in (1) arises as a product of conditional probability statements (Cox, 1975; Kalbfleisch and Prentice, 2002). Specifically, let AkA_{k} specify the event that individual kk fails in [tk,tk+dtk)[t_{k},t_{k}+dt_{k}), and let BkB_{k} specify all the censoring and failure information up to time tkt_{k}^{-} as well as the information that one failure occurs in [tk,tk+dtk)[t_{k},t_{k}+dt_{k}). Then A1|B1,,AK|BKA_{1}|B_{1},\dots,A_{K}|B_{K} is a sequence of conditional experiments, where the σ\sigma-field generated by BkB_{k} is contained by that generated by Bk+1B_{k+1}. The density function of aka_{k} conditional on Bk=bkB_{k}=b_{k} is

f(ak|bk;𝜷)\displaystyle f(a_{k}|b_{k};\boldsymbol{\beta}) =\displaystyle= λ(tk|𝒁k)dtki=1nYi(tk)λ(tk|𝒁i)dtk=exp(𝒁k𝜷)i=1nYi(tk)exp(𝒁i𝜷),\displaystyle{}\frac{\lambda(t_{k}|\bm{Z}_{k})dt_{k}}{\sum_{i=1}^{n}Y_{i}(t_{k})\lambda(t_{k}|\bm{Z}_{i})dt_{k}}=\frac{\exp(\bm{Z}_{k}^{\top}\bm{\beta})}{\sum_{i=1}^{n}Y_{i}(t_{k})\exp(\bm{Z}_{i}^{\top}\bm{\beta})}, (3)

where the second equality is obtained by canceling λ0(tk)dtk\lambda_{0}(t_{k})dt_{k} in the numerator and the denominator. To extract information from external risk scores, we adopt a similar conditional density

f(ak|bk;𝜷~)=exp{r(𝒁k,𝜷~)}i=1nYi(tk)exp{r(𝒁i,𝜷~)}.f(a_{k}|b_{k};\widetilde{\boldsymbol{\beta}})=\frac{\exp{\{r(\bm{Z}_{k},\widetilde{\boldsymbol{\beta}})\}}}{\sum_{i=1}^{n}Y_{i}(t_{k})\exp{\{r(\bm{Z}_{i},\widetilde{\boldsymbol{\beta}})\}}}. (4)

The partial likelihood-based KL information between conditional densities corresponding to the external risk scores and the internal Cox model, contained in ak|bka_{k}|b_{k}, is given by

dKL(𝜷~𝜷;tk)=E[log{f(ak|bk;𝜷~)f(ak|bk;𝜷)}|bk;𝜷~],d_{KL}(\widetilde{\boldsymbol{\beta}}\parallel\bm{\beta};t_{k})=\mbox{E}\left[\log\left\{\frac{f(a_{k}|b_{k};\widetilde{\boldsymbol{\beta}})}{f(a_{k}|b_{k};\bm{\beta})}\right\}\bigg{|}b_{k};\widetilde{\boldsymbol{\beta}}\right], (5)

where the expectation E is with respect to the conditional density f(ak|bk;𝜷~)f(a_{k}|b_{k};\widetilde{\boldsymbol{\beta}}). The accumulated KL information in the sequence of conditional experiments a1|b1,,aK|bKa_{1}|b_{1},\cdots,a_{K}|b_{K} is then defined as DKL(𝜷~𝜷)=k=1KdKL(𝜷~𝜷;tk)D_{KL}(\widetilde{\boldsymbol{\beta}}\parallel\bm{\beta})=\sum_{k=1}^{K}d_{KL}(\widetilde{\boldsymbol{\beta}}\parallel\bm{\beta};t_{k}), which measures the discrepancy between the external risk scores and the internal Cox model for the target cohort.

To integrate the external risk scores with the internal data, we construct a penalized log-partial likelihood, which combines the log-partial likelihood from the internal Cox model and the accumulated KL information DKL(𝜷~𝜷)D_{KL}(\widetilde{\boldsymbol{\beta}}\parallel\bm{\beta}),

η(𝜷)\displaystyle\ell_{\eta}(\bm{\beta}) =(𝜷)ηDKL(𝜷~𝜷),\displaystyle=\ell(\bm{\beta})-\eta D_{KL}(\widetilde{\boldsymbol{\beta}}\parallel\bm{\beta}), (6)

where η\eta is a tuning parameter weighing the relative importance of external and internal data sources. In the extreme case of η=0\eta=0, the penalized log-partial likelihood η(𝜷)\ell_{\eta}(\bm{\beta}) is reduced to the log-partial likelihood of the internal Cox model (𝜷)\ell(\bm{\beta}).

Remark 1: The goal of the penalized objective function is to balance the trade-off between the external risk scores and the internal data; i.e. maximize the log-partial likelihood of the internal data, while at the same time keeping the KL information small. When internal and external information are homogeneous, it’s intuitive that the CoxKL estimator attains better estimation efficiency and higher discrimination power. In contrast, when internal and external information are heterogeneous, external information introduces bias to the integrated model. However, due to the bias and variance trade-off, it can be shown that there exists a range of η\eta within which the CoxKL method improves the estimation efficiency; further details are provided in Section 4.

3.2 Estimation

Proposition 1 below shows that the objective function η(𝜷)\ell_{\eta}(\bm{\beta}) retains a similar form as the log-partial likelihood of a classical Cox model. Hence, the integrated model can be solved by applying the standard optimization algorithms for the classical Cox models, which ensures the computational efficiency of the proposed method. A proof of Proposition 1 is provided in Supporting Information.

Proposition 1 The accumulated KL information is given by

DKL(𝜷~𝜷)\displaystyle D_{KL}(\widetilde{\boldsymbol{\beta}}\parallel\bm{\beta}) \displaystyle\propto k=1K[𝒁~k𝜷log{i=1nYi(tk)exp(𝒁i𝜷)}],\displaystyle{}-\sum_{k=1}^{K}\left[\widetilde{\bm{Z}}_{k}^{\top}\bm{\beta}-\log\left\{\sum_{i=1}^{n}Y_{i}(t_{k})\exp(\bm{Z}_{i}^{\top}\bm{\beta})\right\}\right],

where

𝒁~k=i=1nYi(tk)exp{r(𝒁i,𝜷~)}𝒁ii=1nYi(tk)exp{r(𝒁i,𝜷~)}\displaystyle\widetilde{\bm{Z}}_{k}=\frac{\sum_{i=1}^{n}Y_{i}(t_{k})\exp{\{r(\bm{Z}_{i},\widetilde{\boldsymbol{\beta}})\}}\bm{Z}_{i}}{\sum_{i^{\prime}=1}^{n}Y_{i^{\prime}}(t_{k})\exp{\{r(\bm{Z}_{i^{\prime}},\widetilde{\boldsymbol{\beta}})\}}}

is a weighted average of the covariates for subjects at risk at time tkt_{k}, with weights defined upon external risk scores. Furthermore, the objective function η(𝛃)\ell_{\eta}(\bm{\beta}) in (6) is equivalent to

η(𝜷)\displaystyle\ell_{\eta}(\bm{\beta}) \displaystyle\propto k=1K[{𝒁k+η𝒁~k1+η}𝜷log{i=1nYi(tk)exp(𝒁i𝜷)}].\displaystyle{}\sum_{k=1}^{K}\left[\left\{\frac{\bm{Z}_{k}+\eta\widetilde{\bm{Z}}_{k}}{1+\eta}\right\}^{\top}\boldsymbol{\beta}-\log\left\{\sum_{i=1}^{n}Y_{i}(t_{k})\exp(\bm{Z}_{i}^{\top}\boldsymbol{\beta})\right\}\right]. (7)

3.3 CoxKL-LASSO: Extending CoxKL with L1L_{1} penalization

A key challenge for applying existing integration methods to develop trans-ancestry PHS models is the high-dimension of the genotype data. Fortunately, the concise form of CoxKL objective function η(𝜷)\ell_{\eta}(\bm{\beta}) in (7) facilitates its extension to high-dimensional problems, such as the LASSO penalization (Tibshirani, 1996) that simultaneously achieves variable selection and parameter estimation.

Specifically, the objective function of CoxKL method with a L1L_{1} penalty (denoted as CoxKL-LASSO) is given by

η,λ(𝜷)\displaystyle\ell_{\eta,\lambda}(\bm{\beta}) =\displaystyle= η(𝜷)λ𝜷1,\displaystyle{}\ell_{\eta}(\bm{\beta})-\lambda\|\boldsymbol{\beta}\|_{1}, (8)

where 𝜷1=j=1p|βj|\|\boldsymbol{\beta}\|_{1}=\sum_{j=1}^{p}|\beta_{j}| is the L1L_{1} norm of 𝜷\boldsymbol{\beta}, and λ\lambda is a tuning parameter that determines the amount of penalization. Since the penalized objective function in (8) has a similar form as the classical Cox Lasso (Tibshirani, 1997), the parameter estimation can be easily obtained by the commonly used coordinate descent-based procedures (Simon et al., 2011). Therefore, the proposed CoxKL method is computationally efficient for high-dimensional problems such as developing trans-ancestry PHS models for non-EUR populations.

Remark 2: LASSO penalization has been widely utilized in polygenic risk scores (PRS) for quantitative and binary traits (Erin et al., 2013; Mak et al., 2017; Chen et al., 2021). However, we could not directly apply these PRS constructing methods for developing PHS models, since the time-to-event data requires unique approaches for handling censored outcomes. We note that, in addition to trans-ancetry PHS models, the CoxKL-LASSO can also be applied to develop classical PHS (e.g. EUR-based PHS) by setting η=0\eta=0.

3.4 Tuning parameter selection

The optimal tuning parameter η\eta and λ\lambda can be determined by conventional cross-validation approaches, such as the V&VH approach (Verweij and Van Houwelingen, 1993), which computes the difference between the log-partial likelihood evaluated on the whole set and the training set as a measure of the model performance. This approach ensures the validity and stability of the calculation of log-partial likelihood, and it has been widely used for penalized Cox proportional hazards models (Friedman et al., 2010; Simon et al., 2011; Dai and Breheny, 2019).

4 Theoretical Properties

In this section, we study the asymptotic distribution of the CoxKL estimator under the low-dimensional setting (i.e. pp is fixed and pnp\ll n) and show that the proposed method improves the asymptotic mean squared error (aMSE) given a certain range of η\eta.

We first provide some prerequisite notations. For a given η\eta, let 𝜷^(η)\hat{\bm{\beta}}(\eta) be the solution obtained by maximizing the penalized objective function in (7), and let 𝜷(η)\bm{\beta}^{*}(\eta) be the maximum of the expectation of the penalized objective function 𝔼{η(𝜷))}\mathbb{E}\{\ell_{\eta}(\bm{\beta}))\}. Denote 𝜷^\hat{\bm{\beta}} as the maximum log-partial likelihood estimator of the internal model in (2). The true value of the regression parameter for the Cox model using the internal cohort only is denoted by 𝜷0\bm{\beta}_{0}. Define 𝚫=𝜷0𝜷~\bm{\Delta}=\bm{\beta}_{0}-\widetilde{\bm{\beta}}, where 𝜷~\widetilde{{\bm{\beta}}} are the estimated coefficients from the external model. For the ease of notations, we assume that the external risk scores can be constructed as 𝒁i𝜷~\bm{Z}_{i}^{\top}\widetilde{\boldsymbol{\beta}}. Denote

S(0)(t;𝜷)=\displaystyle S^{(0)}(t;\bm{\beta})= 1ni=1nYi(t)exp(𝒁i𝜷),\displaystyle{}\frac{1}{n}\sum_{i=1}^{n}Y_{i}(t)\exp(\bm{Z}_{i}^{\top}\bm{\beta}),
S(1)(t;𝜷)=\displaystyle S^{(1)}(t;\bm{\beta})= S(0)(t;𝜷)𝜷=1ni=1nYi(t)𝒁iexp(𝒁i𝜷),\displaystyle{}\frac{\partial S^{(0)}(t;\bm{\beta})}{\partial\bm{\beta}}=\frac{1}{n}\sum_{i=1}^{n}Y_{i}(t)\bm{Z}_{i}\exp(\bm{Z}_{i}^{\top}\bm{\beta}),
S(2)(t;𝜷)=\displaystyle S^{(2)}(t;\bm{\beta})= 2S(0)(t;𝜷)𝜷𝜷=1ni=1nYi(t)𝒁i2exp(𝒁i𝜷),\displaystyle{}\frac{\partial^{2}S^{(0)}(t;\bm{\beta})}{\partial\bm{\beta}\partial\bm{\beta}^{\top}}=\frac{1}{n}\sum_{i=1}^{n}Y_{i}(t)\bm{Z}_{i}^{\otimes 2}\exp(\bm{Z}_{i}^{\top}\bm{\beta}),

where 𝒃2=𝒃𝒃\bm{b}^{\otimes 2}=\bm{b}\bm{b}^{\top}. Then the score function of η(𝜷)\ell_{\eta}(\bm{\beta}) in (7) is given by

𝑼η(𝜷)=η(𝜷)𝜷=1nk=1K{𝒁k+η𝒁~k1+ηS(1)(tk;𝜷)S(0)(tk;𝜷)}.\displaystyle\bm{U}_{\eta}(\bm{\beta})=\frac{\partial\ell_{\eta}(\bm{\beta})}{\partial\bm{\beta}}=\frac{1}{n}\sum_{k=1}^{K}\left\{\frac{\bm{Z}_{k}+\eta\widetilde{\bm{Z}}_{k}}{1+\eta}-\frac{S^{(1)}(t_{k};\bm{\beta})}{S^{(0)}(t_{k};\bm{\beta})}\right\}.

The corresponding information matrix is

𝑯η(𝜷)=2η(𝜷)𝜷𝜷=1nk=1K[S(2)(tk;𝜷)S(0)(tk;𝜷){S(1)(tk;𝜷)S(0)(tk;𝜷)}2].\displaystyle\bm{H}_{\eta}(\bm{\beta})=-\frac{\partial^{2}\ell_{\eta}(\bm{\beta})}{\partial\bm{\beta}\partial\bm{\beta}^{\top}}=\frac{1}{n}\sum_{k=1}^{K}\left[\frac{S^{(2)}(t_{k};\bm{\beta})}{S^{(0)}(t_{k};\bm{\beta})}-\left\{\frac{S^{(1)}(t_{k};\bm{\beta})}{S^{(0)}(t_{k};\bm{\beta})}\right\}^{\otimes 2}\right].

To derive the theoretical properties of the CoxKL method under the low-dimensional setting, we impose the following regularity conditions for the internal cohort:

  • (C1).

    (Xi,δi,𝒁i)(X_{i},\delta_{i},\bm{Z}_{i}), i=1,,ni=1,\ldots,n, are independent and identically distributed random vectors.

  • (C2).

    P(Xiτ)=0P(X_{i}\geq\tau)=0, where τ\tau is a pre-specified time point.

  • (C3).

    Zij<κZ_{ij}<\kappa for i=1,,ni=1,\ldots,n, and j=1,,pj=1,\dots,p, where κ\kappa is a constant and ZijZ_{ij} is the jj-th component of 𝒁i\bm{Z}_{i}.

  • (C4).

    0τλ0(t)<\int_{0}^{\tau}\lambda_{0}(t)<\infty.

  • (C5).

    Continuity of s(d)(t;𝜷)s^{(d)}(t;\bm{\beta}), d=0,1,2d=0,1,2, for any 𝜷\bm{\beta} between 𝜷0\bm{\beta}_{0} and 𝜷~\widetilde{\bm{\beta}}, where s(d)(t;𝜷)s^{(d)}(t;\bm{\beta}) is the limiting value of S(d)(t;𝜷)S^{(d)}(t;\bm{\beta}), with s(1)(t;𝜷)s^{(1)}(t;\bm{\beta}) and s(2)(t;𝜷)s^{(2)}(t;\bm{\beta}) bounded, and s(0)(t;𝜷)s^{(0)}(t;\bm{\beta}) bounded away from 0 for t[0,τ]t\in[0,\tau].

  • (C6).

    Positive-definiteness of the matrix 𝒉(𝜷)=0τv(t;𝜷)s(0)(t;𝜷)λ0(t)𝑑t\bm{h}(\bm{\beta})=\int_{0}^{\tau}v(t;\bm{\beta})s^{(0)}(t;\bm{\beta})\lambda_{0}(t)dt, for any 𝜷\bm{\beta} between 𝜷0\bm{\beta}_{0} and 𝜷~\widetilde{\bm{\beta}}, where

    v(t;𝜷)=[sk(2)(𝜷)sk(0)(𝜷){sk(1)(𝜷)sk(0)(𝜷)}2].\displaystyle v(t;\bm{\beta})=\left[\frac{s_{k}^{(2)}(\bm{\beta})}{s_{k}^{(0)}(\bm{\beta})}-\left\{\frac{s_{k}^{(1)}(\bm{\beta})}{s_{k}^{(0)}(\bm{\beta})}\right\}^{\otimes 2}\right].

Conditions (C1)-(C6) are analogous to the regularity conditions employed in Andersen and Gill (1982) and Andersen et al. (1993) to prove the consistency and asymptotic normality of estimators from classical Cox models.

Lemma 1. If conditions (C1)-(C6) hold, then for each η0\eta\geq 0, 𝛃^(η)𝛃(η)=Op(n1/2)\hat{\bm{\beta}}(\eta)-\bm{\beta}^{*}(\eta)=O_{p}(n^{-1/2}).

Lemma 1 is a standard results showing that the CoxKL estimator 𝜷^(η)\hat{\bm{\beta}}(\eta) is consistent for the unique maximum for the expectation of the CoxKL objective function 𝜷(η)\bm{\beta}^{*}(\eta). By Proposition 1, the CoxKL objective function (7) retains a similar form as that of a classical Cox model, thus, Lemma 1 can be easily verified following standard arguments on the asymptotic results of maximum partial-likelihood estimator for classical Cox model under standard regularity conditions (Andersen et al., 1993).

Lemma 2. If conditions (C1)-(C6) hold, and if η=O(n1/2)\eta=O(n^{-1/2}), then as nn\rightarrow\infty,

n1/2{𝜷^(η)𝜷(η)}𝑑𝒩(𝟎,𝑮),n^{1/2}\{\hat{\bm{\beta}}(\eta)-\bm{\beta}^{*}(\eta)\}\overset{d}{\rightarrow}\mathcal{N}(\bm{0},\bm{G}), (9)

where 𝐆limn(1+η)1𝐇1(𝛃0)𝐡(𝛃0)𝐇1(𝛃0)\bm{G}\equiv\lim_{n\rightarrow\infty}(1+\eta)^{-1}\bm{H}^{-1}(\bm{\beta}_{0})\bm{h}(\bm{\beta}_{0})\bm{H}^{-1}(\bm{\beta}_{0}). If η=o(n1/2)\eta=o(n^{-1/2}), then as nn\rightarrow\infty,

n1/2{𝜷^(η)𝜷0}𝑑𝒩(𝟎,𝑮).n^{1/2}\{\hat{\bm{\beta}}(\eta)-\bm{\beta}_{0}\}\overset{d}{\rightarrow}\mathcal{N}(\bm{0},\bm{G}). (10)

Lemma 2 shows the asymptotic distribution of 𝜷^(η)\hat{\bm{\beta}}(\eta). When 𝚫=𝟎\bm{\Delta}=\bm{0}, 𝜷^(η)\hat{\bm{\beta}}(\eta) is an unbiased estimator for 𝜷0\bm{\beta}_{0}. In contrast, when 𝚫𝟎\bm{\Delta}\neq\bm{0} under population heterogeneity, as nn\rightarrow\infty and with the order of η=o(n1/2)\eta=o(n^{-1/2}), 𝜷^(η)𝑝𝜷0\hat{\bm{\beta}}(\eta)\overset{p}{\rightarrow}\bm{\beta}_{0}. Moreover, the CoxKL estimator 𝜷^(η)\hat{\bm{\beta}}(\eta) yields a considerable efficiency gain over 𝜷^\hat{\bm{\beta}} by incorporating external risk information, which can be easily seen from the form of asymptotic variance of 𝜷^(η)\hat{\bm{\beta}}(\eta). We then have the following Theorem 1, which states the advantage on the estimation of the CoxKL method.

Theorem 1. If conditions (C1)-(C6) hold, then there exists a range of η>0\eta>0, on which the asymptotic mean squared error (aMSE) of 𝛃^(η)\hat{\bm{\beta}}(\eta) is strictly less than that of 𝛃^\hat{\bm{\beta}}. Moreover, when 𝚫=𝟎\bm{\Delta}=\bm{0}, the aMSE of 𝛃^(η)\hat{\bm{\beta}}(\eta) is monotonically decreasing in [0,)[0,\infty); when 𝚫𝟎\bm{\Delta}\neq\bm{0}, the aMSE of 𝛃^(η)\hat{\bm{\beta}}(\eta) is monotonically decreasing in [0,η)[0,\eta^{*}) and monotonically increasing in [η,)[\eta^{*},\infty), with

η=trace{𝒉1(𝜷0)}2n𝚫𝒉(𝒄)𝒉2(𝜷0)𝒉(𝒄)𝚫trace{𝒉1(𝜷0)},\eta^{*}=\frac{\text{trace}\{\bm{h}^{-1}(\bm{\beta}_{0})\}}{2n\bm{\Delta}^{\top}\bm{h}(\bm{c})^{\top}\bm{h}^{-2}(\bm{\beta}_{0})\bm{h}(\bm{c})\bm{\Delta}-\text{trace}\{\bm{h}^{-1}(\bm{\beta}_{0})\}},

where 𝐜p\bm{c}\in\mathbb{R}^{p} is a vector between 𝛃0\bm{\beta}_{0} and 𝛃~\tilde{\bm{\beta}} such that

1nk=1K{S(1)(tk;𝜷~)S(0)(tk;𝜷~)S(1)(tk;𝜷0)S(0)(tk;𝜷0)}=𝑯(𝒄)𝚫.\frac{1}{n}\sum_{k=1}^{K}\left\{\frac{S^{(1)}(t_{k};\widetilde{\bm{\beta}})}{S^{(0)}(t_{k};\widetilde{\bm{\beta}})}-\frac{S^{(1)}(t_{k};\bm{\beta}_{0})}{S^{(0)}(t_{k};\bm{\beta}_{0})}\right\}=-\bm{H}(\bm{c})\bm{\Delta}.

Theorem 1 implies that the CoxKL method improves the aMSE given an appropriate range of η\eta regardless of the potential heterogeneity. When 𝚫=0\bm{\Delta}=0, 𝜷^(η)\hat{\bm{\beta}}(\eta) is unbiased, and it’s intuitive that aMSE(𝜷^(η))<aMSE(𝜷^)\text{aMSE}(\hat{\bm{\beta}}(\eta))<\text{aMSE}(\hat{\bm{\beta}}) for all η>0\eta>0 since the CoxKL method improves the estimation efficiency as proved in Lemma 2. Moreover, even if the external and internal information are heterogeneous and deviate from each other, there exists a range of η>0\eta>0 such that the CoxKL estimator 𝜷^(η)\hat{\bm{\beta}}(\eta) achieves a smaller aMSE for the model regression parameters compared to the maximum partial-likelihood estimator 𝜷^\hat{\bm{\beta}} using the internal data only. Furthermore, as mentioned in Section 2.2, we do not require that the internal data and the external risk score share the same set of covariates. Since 𝜷~\widetilde{\bm{\beta}} can be a sparse vector (i.e. some entries of 𝜷~\widetilde{\bm{\beta}} are zero), we allow internal data to include additional predictors.

5 Simulation

To evaluate the performance of the proposed CoxKL method, we conducted two sets of simulation studies. In Setting I, we focused on the evaluation of the estimation performance of CoxKL, while in Setting II, we evaluated the prediction performance of CoxKL under different scenarios, including the integration with various types of external risk score models (CoxPH and XGBoost (Chen and Guestrin, 2016)) and multiple external risk scores. The tuning parameter η\eta was selected by the 5-fold cross validation with the V&VH cross-validated partial likelihood as the measure of model performance. In addition to the internal and the external models, we compared the performance of the CoxKL method with the stacked regression method (Debray et al., 2014), which includes the prediction of the external model as a covariate in the integration model. We evaluated the model performance on an independent testing dataset, which was simulated from the same distribution as the internal data. The simulation studies were replicated 500 times.

5.1 Setting I: Evaluation of the estimation performance

In the first set of simulations, we considered 6 covariates 𝒁1,,𝒁6\bm{Z}_{1},\cdots,\bm{Z}_{6}, where 𝒁1\bm{Z}_{1}, 𝒁2\bm{Z}_{2} were continuous variables simulated from a multivariate normal distribution with zero mean, unit variance and a first-order autoregressive (AR1) correlation structure with the auto-correlation parameter 0.5. 𝒁3,𝒁4\bm{Z}_{3},\bm{Z}_{4} were binary variables with Pr(𝒁i=1)=0.5Pr(\bm{Z}_{i}=1)=0.5. Regarding 𝒁5\bm{Z}_{5} and 𝒁6\bm{Z}_{6}, to simulate the situation where the internal cohort comes from a different underlying distribution from the external population, we denoted 𝒁l\bm{Z}^{l} as a latent binary variable with Pr(𝒁il=1)=plPr(\bm{Z}^{l}_{i}=1)=p_{l}, and then let 𝒁5\bm{Z}_{5} and 𝒁6\bm{Z}_{6} be two continuous variables simulated from normal distribution with unit variance and mean 2𝒁l2\bm{Z}^{l} and 2𝒁l-2\bm{Z}^{l}, respectively. That is, the covariate distribution of the internal cohort could be different from that of the external cohort. Note that, as a latent variable, 𝒁l\bm{Z}^{l} was unobserved in the simulated datasets. The survival time for the internal data was generated from a Cox model with a hazard function of

λ(t|𝒁1i,,𝒁6i)=2t×exp(β1𝒁1i+β2𝒁2i+β3𝒁3i+β4𝒁4i+β5𝒁5i+β6𝒁6i),\displaystyle\lambda(t|\bm{Z}_{1i},\dots,\bm{Z}_{6i})=2t\times\exp(\beta_{1}\bm{Z}_{1i}+\beta_{2}\bm{Z}_{2i}+\beta_{3}\bm{Z}_{3i}+\beta_{4}\bm{Z}_{4i}+\beta_{5}\bm{Z}_{5i}+\beta_{6}\bm{Z}_{6i}), (11)

where 𝜷I=(β1,β2,β3,β4,β5,β6)=(0.3,0.3,0.3,0.3,0.3,0.3)\bm{\beta}^{I}=(\beta_{1},\beta_{2},\beta_{3},\beta_{4},\beta_{5},\beta_{6})^{\top}=(0.3,-0.3,0.3,-0.3,0.3,-0.3)^{\top}. The censoring times were generated from a uniform distribution with varying upper bounds to control for different censoring proportions. We set plI=1p_{l}^{I}=1 for the internal cohort. The true generating mechanism for the external cohort was the same as (11), but the external model was determined from a large dataset with varying plEp_{l}^{E} (% homogeneous) and 𝒁E\bm{Z}^{E} (covariates used in the external model). Specifically, we considered the following external settings:

  • (E1).

    The covariate distribution of the external population is the same as the internal cohort, that is, plE=1p_{l}^{E}=1; and the external model is the true model: 𝒁E=(𝒁1,𝒁2,𝒁3,𝒁4,𝒁5,𝒁6)\bm{Z}^{E}=(\bm{Z}_{1},\bm{Z}_{2},\bm{Z}_{3},\bm{Z}_{4},\bm{Z}_{5},\bm{Z}_{6})^{\top}.

  • (E2).

    The covariate distribution of the external population is slightly different from the internal cohort with plE=0.5p_{l}^{E}=0.5; and the external model is a misspecified model: 𝒁E=(𝒁1,𝒁3,𝒁5,𝒁6)\bm{Z}^{E}=(\bm{Z}_{1},\bm{Z}_{3},\bm{Z}_{5},\bm{Z}_{6})^{\top}.

  • (E3).

    The covariate distribution of the external population is completely different from the internal cohort with plE=0p_{l}^{E}=0; and the external model is a misspecified model: 𝒁E=(𝒁1,𝒁5)\bm{Z}^{E}=(\bm{Z}_{1},\bm{Z}_{5})^{\top}.

Table 1: Comparison of estimation performance under simulation setting I. Internal = Cox proportional hazards model based on the internal data only; CoxKL = integrated model by the proposed CoxKL method. For the internal data, the sample size varied from {50,100}\{50,100\}, and the censoring rate varied from {30%,60%}\{30\%,60\%\}. For external model, setting (E1) corresponds to the good-quality external model, setting (E2) corresponds to the fair-quality external model, while setting (E3) corresponds to the poor-quality external model. The simulation study was replicated 500 times. Bias = average empirical bias; SE = average empirical standard error; MSE = average empirical mean squared error. C-Index = Harrel’s C-Index. Lower Bias, SE, and MSE indicated better estimation performance. Higher C-index indicated better prediction performance.
External model
Method Setting % Homogeneous Covariates Bias SE MSE C-Index
Internal data: sample size = 50, censoring rate = 60%
Internal 0.057 0.475 0.229 0.594
CoxKL (E1) 100 (𝒁1,𝒁2,𝒁3,𝒁4,𝒁5,𝒁6)(\bm{Z}_{1},\bm{Z}_{2},\bm{Z}_{3},\bm{Z}_{4},\bm{Z}_{5},\bm{Z}_{6}) 0.023 0.136 0.019 0.640
(E2) 50 (𝒁1,𝒁3,𝒁5,𝒁6)(\bm{Z}_{1},\bm{Z}_{3},\bm{Z}_{5},\bm{Z}_{6}) 0.101 0.165 0.038 0.624
(E3) 0 (𝒁1,𝒁5)(\bm{Z}_{1},\bm{Z}_{5}) 0.145 0.192 0.058 0.605
Internal data: sample size = 50, censoring rate = 30%
Internal 0.040 0.354 0.127 0.608
CoxKL (E1) 100 (𝒁1,𝒁2,𝒁3,𝒁4,𝒁5,𝒁6)(\bm{Z}_{1},\bm{Z}_{2},\bm{Z}_{3},\bm{Z}_{4},\bm{Z}_{5},\bm{Z}_{6}) 0.022 0.114 0.014 0.642
(E2) 50 (𝒁1,𝒁3,𝒁5,𝒁6)(\bm{Z}_{1},\bm{Z}_{3},\bm{Z}_{5},\bm{Z}_{6}) 0.088 0.149 0.030 0.627
(E3) 0 (𝒁1,𝒁5)(\bm{Z}_{1},\bm{Z}_{5}) 0.129 0.176 0.048 0.612
Internal data: sample size = 100, censoring rate = 30%
Internal 0.019 0.207 0.043 0.628
CoxKL (E1) 100 (𝒁1,𝒁2,𝒁3,𝒁4,𝒁5,𝒁6)(\bm{Z}_{1},\bm{Z}_{2},\bm{Z}_{3},\bm{Z}_{4},\bm{Z}_{5},\bm{Z}_{6}) 0.023 0.066 0.005 0.645
(E2) 50 (𝒁1,𝒁3,𝒁5,𝒁6)(\bm{Z}_{1},\bm{Z}_{3},\bm{Z}_{5},\bm{Z}_{6}) 0.079 0.111 0.019 0.633
(E3) 0 (𝒁1,𝒁5)(\bm{Z}_{1},\bm{Z}_{5}) 0.093 0.138 0.028 0.627

Refer to caption

Figure 1: Comparison of MSE and C-index with different η\eta values under simulation setting I. Internal = Cox proportional hazards model based on the internal data only; CoxKL = integrated model by the proposed CoxKL method. Panel A corresponds to external Setting (E1) with one good-quality external model. Panel B corresponds to external Setting (E2) with one fair-quality external model. Panel C corresponds to external Setting (E3) with one poor-quality external model. Details of external settings can be found in Section 5.1. The simulation study was replicated 500 times. MSE = average empirical mean squared error. C-Index = Harrel’s C-Index. Lower MSE indicated better estimation performance. Higher C-index indicated better prediction performance.

Table 1 and Figure 1 summarize the results of simulation Setting I, including average empirical biases (Bias), average empirical standard errors (SE), and average empirical mean squared errors (MSE) for the estimated regression coefficients 𝜷I=(β1,β2,β3,β4,β5,β6)\bm{\beta}^{I}=(\beta_{1},\beta_{2},\beta_{3},\beta_{4},\beta_{5},\beta_{6})^{\top}. In addition, we also presented the Harrel’s C-Index (Harrel Jr et al., 1996). As shown in Table 1, under all three external settings regarding good, fair, and poor external model quality, the integration model by the CoxKL method attained better estimation efficiency than the internal model, with smaller SE and MSE; moreover, the SE and MSE of the CoxKL method decreased as the quality of the external risk information increased. Incorporating external information from homogeneous models (external setting E1) reduced the bias when the quality of the internal cohort was limited. Unsurprisingly, when the external models and internal cohort were heterogeneous, incorporating information from misspecified external risk scores increased the bias. However, due to the bias and variance trade-off, the MSE of the CoxKL estimator still decreased. Compared to the internal model, the CoxKL method also improved the prediction performance by incorporating external information, with a higher C-Index. As the internal sample size increased or the internal censoring rate decreased, the gain on estimation efficiency and prediction accuracy by incorporating external information via the CoxKL method decreased, as expected. As shown in Figure 1, there existed a range of integration weight η\eta that the integration model by CoxKL method reduced the MSE and increased the C-Index compared to the internal model. With a decrease in the external information quality, the range of η\eta that improved the model performance shrank; that is, the optimal integration weight of the external risk scores decreased. However, the CoxKL method outperformed the internal model in all cases with a properly selected integration weight η\eta.

5.2 Setting II: Evaluation of the prediction performance

In Setting II, we included 5 covariates 𝒁1,,𝒁5\bm{Z}_{1},\dots,\bm{Z}_{5}, which were simulated from the same distributions as in Setting I. In addition to the main effect, we also considered the interaction term 𝒁1𝒁5\bm{Z}_{1}*\bm{Z}_{5} in the data generating model. Since in practice, we usually cannot observe and collect all the variables in the true model, here we simulated two additional variables, 𝒁u1\bm{Z}^{u1} and 𝒁u2\bm{Z}^{u2}, which contributed to the true underlying generating distribution of the data, but were not observed and collected in the simulated datasets. 𝒁u1\bm{Z}^{u1} was a binary variable with Pr(𝒁iu1=1)=0.5Pr(\bm{Z}^{u1}_{i}=1)=0.5, and 𝒁u2\bm{Z}^{u2} came from a standard normal distribution. The true generating model for the internal data was a Cox model with a hazard function of

λ(t|𝒁1i,,𝒁5i,𝒁iu1,𝒁iu2)=2t\displaystyle\lambda(t|\bm{Z}_{1i},\dots,\bm{Z}_{5i},\bm{Z}^{u1}_{i},\bm{Z}^{u2}_{i})=2t ×\displaystyle\times exp(β1𝒁1i+β2𝒁2i+β3𝒁3i+β4𝒁4i+β5𝒁5i\displaystyle{}\exp(\beta_{1}\bm{Z}_{1i}+\beta_{2}\bm{Z}_{2i}+\beta_{3}\bm{Z}_{3i}+\beta_{4}\bm{Z}_{4i}+\beta_{5}\bm{Z}_{5i} (12)
+β6𝒁1i𝒁5i+βu1𝒁iu1+βu2𝒁iu2),\displaystyle{}+\beta_{6}\bm{Z}_{1i}*\bm{Z}_{5i}+\beta_{u1}\bm{Z}^{u1}_{i}+\beta_{u2}\bm{Z}^{u2}_{i}),

where (β1,β2,β3,β4,β5,β6,βu1,βu2)=(0.3,0.3,0.3,0.3,0.3,0.5,1,1)(\beta_{1},\beta_{2},\beta_{3},\beta_{4},\beta_{5},\beta_{6},\beta_{u1},\beta_{u2})^{\top}=(0.3,-0.3,0.3,-0.3,-0.3,0.5,1,1)^{\top}. The true generating mechanism for the external cohort was the same as (12). However, the external models were determined from large datasets with varying plp_{l} and 𝒁E\bm{Z}^{E}.

Refer to caption

Figure 2: Comparison of prediction performance under simulation setting II. Panel A corresponds to External setting (E4) with two good-quality external models. Panel B corresponds to External setting (E5) with one good-quality external model and one poor-quality external model. Panel C corresponds to External setting (E6) with two poor-quality external models. Details of external settings can be found in Section 5.2. Internal = Cox proportional hazards model based on the internal data only; External1 was derived from a XGBoost model, and External2 was derived from a Cox model, as specified in settings (E4), (E5), and (E6); CoxKL = integrated model by CoxKL method incorporates both External1 and External2; Stacked = stacked model by stacking regression incorporates both External1 and External2. Plots on the left panel present prediction performance of each method. Higher C-index indicated better prediction performance. Plots on the right panel present the optimal selection of the integration weight η\eta by CoxKL. Larger η\eta indicated the external model contributed more information to the integrated model. Simulation was replicated 500 times.

We first considered the settings with two external risk scores derived from different types of models. Specifically, in each setting, the External1 was derived from a XGBoost model, and the External2 was derived from a CoxPH model. Two extreme settings with extremely well or poorly performing external models were explored. Details are given as follows:

  • (E4).

    Two good-quality external models: External1: plE=1p_{l}^{E}=1, and 𝒁E=(𝒁1,𝒁2,𝒁3,𝒁4,𝒁5)\bm{Z}^{E}=(\bm{Z}_{1},\bm{Z}_{2},\bm{Z}_{3},\bm{Z}_{4},\bm{Z}_{5})^{\top}; External2: plE=1p_{l}^{E}=1, and 𝒁E=(𝒁1,𝒁2,𝒁3,𝒁4,𝒁5)\bm{Z}^{E}=(\bm{Z}_{1},\bm{Z}_{2},\bm{Z}_{3},\bm{Z}_{4},\bm{Z}_{5})^{\top}.

  • (E5).

    One good-quality external model and one poor-quality external model. External1: plE=0.25p_{l}^{E}=0.25, and 𝒁E=(𝒁1,𝒁3,𝒁5)\bm{Z}^{E}=(\bm{Z}_{1},\bm{Z}_{3},\bm{Z}_{5})^{\top}; External2: plE=0.25p_{l}^{E}=0.25, and 𝒁E=(𝒁2,𝒁4)\bm{Z}^{E}=(\bm{Z}_{2},\bm{Z}_{4})^{\top}.

  • (E6).

    Two poor-quality external models: External1: plE=0p_{l}^{E}=0, and 𝒁E=(𝒁2,𝒁4)\bm{Z}^{E}=(\bm{Z}_{2},\bm{Z}_{4})^{\top}; External2: Null model.

For the internal cohort, we set the sample size as 50, the censoring rate as around 30%, and plI=1p_{l}^{I}=1. As shown in Figure 2, under extreme scenarios such as both external risk scores were derived from extremely well or poorly performing models, the CoxKL method could achieve comparable prediction performance as the external or internal model by selecting an extremely large or small integration weight η\eta. Moreover, when integrating two external risk scores with different prediction performances, the CoxKL method could optimally determine the weight of information integration, assigning more weight to the better performing external risk score, and resulting in improved prediction performance with the highest C-Index. Incorporating information from external models resulted in a narrower interquartile range (IQR) compared with that of the internal model, which indicates that the proposed method reduced the variance of prediction. Note that under all three settings, the External1 risk scores were derived by the XGBoost, which indicates that the CoxKL method worked well when integrating a risk score derived from a machine learning model.

Refer to caption

Figure 3: Comparison of prediction performance under simulation setting II. For plots on the left panel, the sample size of the internal data varied from {50,75,100}\{50,75,100\} with the event rate (1 - censoring rate) set to be 40%. For plots on the right panel, the event rate of the internal data varied from {40%,60%,80%}\{40\%,60\%,80\%\} with the sample size set to be 50. Panel A corresponds to a fair-quality external model setting with plE=1p_{l}^{E}=1 (%Homogeneous=100\%\text{Homogeneous}=100) and covariates as 𝒁E=(𝒁1,𝒁3,𝒁5)\bm{Z}^{E}=(\bm{Z}_{1},\bm{Z}_{3},\bm{Z}_{5}). Panel B corresponds to a poor-quality external model setting with plE=0p_{l}^{E}=0 (%Homogeneous=0\%\text{Homogeneous}=0) and covariates as 𝒁E=(𝒁1,𝒁3,𝒁5)\bm{Z}^{E}=(\bm{Z}_{1},\bm{Z}_{3},\bm{Z}_{5}). Higher C-index indicates better prediction performance. Results were averaged for 500 replicates.

Then we considered different settings in terms of sample sizes and censoring rates of the internal cohort. We set plI=1p_{l}^{I}=1 for the internal cohort, and let plEp_{l}^{E} varying from {0,1}\{0,1\} for the external cohort. The external covariates were set as 𝒁E={𝒁1,𝒁2,𝒁3}\bm{Z}^{E}=\{\bm{Z}_{1},\bm{Z}_{2},\bm{Z}_{3}\}. The sample size of the internal cohort varied from {50,75,100}\{50,75,100\}, and the censoring rate of the internal data varied from {20%,50%,80%}\{20\%,50\%,80\%\}. The simulation results are shown in Figure 3. With increasing censoring rates of the internal data, the CoxKL method outperformed the internal and stacked models with higher C-index and smaller variance of prediction. The prediction performances of the internal and stacked models decreased rapidly as the censoring rate increased or the sample size decreased. In addition, for fairly small datasets, incorporating information from the external prediction model by the CoxKL method improved the prediction performance over the internal model. As expected, the improvement decreased when the internal cohort contained more data.

6 Application: Prostate Cancer Trans-ancestry Polygenic Hazard Score

Prostate cancer is the most prevalent cancer and second leading cause of cancer death in American males (Siegel et al., 2022). Identifying high-risk population and offering personalized suggestions on when to initialize prostate cancer screening is important to reduce the cancer-related morbidity and mortality (Schröder et al., 2014). PHS46 has shown accurate discrimination of onset of prostate cancer in EUR males but reduced performance in non-EUR males, such as AFR males (Huynh-Le et al., 2021). To improve the risk discrimination of prostate cancer in AFR males, we applied CoxKL to develop a trans-ancestry PHS by integrating PHS46 with a small-sized internal dataset of AFR individuals.

The PHS46 model is a Cox proportional hazards model developed using a genotype dataset consisting of EUR males. By analyzing associations between genetic factors and the age at diagnosis of prostate cancer, the model includes 46 SNPs and predicts prostate cancer risk in EUR males. Based on the external validation results, the PHS46 model performs poorly with a C-index of 0.55 in AFR males (Karunamuni et al., 2021). Thus, to improve the risk discrimination for prostate cancer in AFR males, an ancestry specific PHS model is desired.

The internal dataset obtained from the Michigan Genomics Initiative (MGI) included 1,036 AFR males with both genotype and phenotype data (Zawistowski et al., 2021). The outcome was defined as the age at diagnosis of prostate cancer. For subjects who were not diagnosed with prostate cancer, the study follow-up was censored at the age of the last visit, and the overall censoring rate was 82.5%82.5\%. Standard quality control was performed on the genotype data. Eventually, a total of 161,861 imputed SNPs passing a quality threshold of R2>0.3R^{2}>0.3, with a minor allele frequency >0.05>0.05, and LD pruned with R2>0.1R^{2}>0.1, +/500kb+/-500kb and step size=1\text{step size}=1 were considered in the analysis. In addition, we conducted a GWAS analysis to further identify the SNPs that may be associated with the outcome using univariate Cox regression models. Top 10,000 SNPs identified based on p-values from the GWAS results were included in the PHS model development.

With the moderate sample size, high censoring rate and high-dimensional covariate space, we considered incorporating PHS46 with the internal data by utilizing the proposed CoxKL-LASSO procedure (introduced in Section 3.2). The internal data was randomly split into training and testing sets with a ratio of 3:1. We developed PHS models on the training set and evaluated the model performance on the testing set. A total of 25 independent random splits were implemented. As comparisons, results of the PHS46 model and the internal model by applying a penalized Cox regression model with LASSO on the internal data are also presented.

Refer to caption

Figure 4: Results of the PHS models predicting the onset of prostate cancer in African American ancestry individuals. Sample size of the internal data is n=1036n=1036, and the censoring rate is 82.5%82.5\%. Panel A = comparison of prediction performance across internal, CoxKL, and PHS46 models. Panel B = performance of the integrated model by CoxKL-LASSO method with different weight η\eta. CoxKL = integrated model by the proposed CoxKL-LASSO incorporating PHS46 with the internal data. Internal = Cox LASSO model fitted on the internal data only. PHS46 = the published PHS model for prostate cancer in European ancestry. A total of 10,000 SNPs after standard quality control and screening were included in the developments of the PHS models. The internal data was randomly split into training and testing sets with a ratio of 3:1. We developed PHS models on the training set and evaluated the model performance on the testing set. A total of 25 independent random splits were implemented. The optimal η\eta was selected to be around 33. Larger C-index indicates better prediction performance.

Refer to caption

Figure 5: Risk discrimination of the PHS models on the onset of prostate cancer in African American ancestry males. Kaplan-Meier patient Pca-free survival curves by each PHS model are presented. Panel A = risk discrimination by the proposed CoxKL-LASSO method. Panel B = risk discrimination by the PHS46 model. Panel C = risk discrimination by the internal model. PHS46 = the published PHS for prostate cancer in European ancestry. CoxKL = the integrated model by the proposed CoxKL-LASSO incorporating PHS46 with the internal data. Internal = Cox LASSO model fitted on the internal data only. PCa = prostate cancer. High-risk populations (subjects with PHS score ranging from 81100%81-100\%; represented by the orange line) were clearly identified by the proposed CoxKL method. Larger C-index indicates better prediction performance.

As shown in Figure 4, the proposed method had the best prediction performance compared to the internal model and PHS46 model. Specifically, when the dimension of the covariate space was much larger than the sample size, it was difficult to develop a proper PHS model based on the internal small-sized data only. Therefore, the CoxKL method incorporated more information from PHS46 with an optimal η\eta at around 33. Furthermore, as shown in Figure 5, the CoxKL method also achieved better risk discrimination of prostate cancer in AFR males than both the internal and PHS46 models. High-risk populations (subjects with PHS score ranging from 81100%81-100\%) were clearly identified by the proposed CoxKL method, which indicates that our method can be used for better personalized suggestions on when to initialize prostate cancer screening for AFR males.

7 Discussion

In this paper, we proposed the CoxKL method to incorporate external risk scores derived from published prediction models with internal time-to-event data. Partial-likelihood-based KL information is utilized to provide an overall assessment of heterogeneity across populations, and a penalized objective function is developed to balance the contribution of different information sources. Extensive simulation studies show the advantages of the CoxKL method, including improvements on estimation efficiency and prediction accuracy.

The CoxKL method is computationally efficient and, thus, can be easily extended to integration problems under high-dimensional settings. We presented CoxKL-LASSO for developing trans-ancestry PHS prediction models in non-EUR populations. We applied our method to develop a trans-ancestry PHS model for prostate cancer by integrating PHS46, a previously published EUR-based PHS, with a small-sized AFR genotype data. The prostate cancer risk discrimination in AFR males has been substantially improved (C-Index from 0.49 [Internal] and 0.54 [PHS46] to 0.68 [CoxKL]). The proposed AFR-based trans-ancestry PHS can clearly distinguish subjects with high prostate cancer risk.

As the CoxKL method summarizes external risk information into personalized risk scores, it provides a unified framework for integrating risk information from various types of external models, including prediction models by machine learning algorithms, as long as the external models can provide personalized risk scores for each subject in the internal data. Moreover, as shown by both theoretical properties and simulation studies, the CoxKL method allows internal data include additional covariates than that used in external risk score models. The CoxKL method can also be easily extended to the integration with risk information from multiple external models, details are given in Supporting Information.

In summary, the proposed CoxKL method serves as a general and flexible integration tool for time-to-event data, with an important application in developing trans-ancestry PHS models and enhancing risk discrimination in non-EUR populations.

Acknowledgements

The authors acknowledge the Michigan Genomics Initiative participants, Precision Health at the University of Michigan, the University of Michigan Medical School Central Biorepository, the University of Michigan Medical School Data Office for Clinical and Translational Research, the University of Michigan Advanced Genomics Core for providing data storage and specimen storage, management, processing, and distribution services, and the Center for Statistical Genetics in the Department of Biostatistics at the School of Public Health for genotype data curation, imputation, and management in support of the research reported in this publication. This work was partially supported by the National Institute of Diabetes and Digestive and Kidney Diseases under Grant R01DK129539.

References

  • Andersen et al. (1993) Andersen, P. K., Borgan, O., Gill, R. D., and Keiding, N. (1993). Statistical Models Based on Counting Processes. New York, NY: Springer New York.
  • Andersen and Gill (1982) Andersen, P. K. and Gill, R. D. (1982). Cox’s regression model for counting processes: A large sample study. The Annals of Statistics 10, 1100–1120.
  • Bray et al. (2018) Bray, F., Ferlay, J., Soerjomataram, I., Siegel, R. L., Torre, L. A., and Jemal, A. (2018). Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA: a Cancer Journal for Clinicians 68, 394–424.
  • Chen et al. (2021) Chen, T., Chatterjee, N., Landi, M. T., and Shi, J. (2021). A penalized regression framework for building polygenic risk models based on summary statistics from genome-wide association studies and incorporating external information.
  • Chen and Guestrin (2016) Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on knowledge discovery and data mining page 785–794.
  • Chen et al. (2021) Chen, Z., Ning, J., Shen, Y., and Qin, J. (2021). Combining primary cohort data with external aggregate information without assuming comparability. Biometrics 77, 1024–1036.
  • Cox (1972) Cox, D. R. (1972). Regression models and life-tables. Journal of the Royal Statistical Society: Series B 34, 187–202.
  • Cox (1975) Cox, D. R. (1975). Partial likelihood. Biometrika 62, 269–279.
  • Dai and Breheny (2019) Dai, B. and Breheny, P. (2019). Cross validation approaches for penalized Cox regression. arXiv:1905.10432 .
  • Debray et al. (2014) Debray, T. P., Koffijberg, H., Nieboer, D., Vergouwe, Y., Steyerberg, E. W., and Moons, K. G. (2014). Meta-analysis and aggregation of multiple published prediction models. Statistics in Medicine 33, 2341–2362.
  • DeSantis et al. (2016) DeSantis, C. E., Siegel, R. L., Sauer, A. G., Miller, K. D., Fedewa, S. A., Alcaraz, K. I., and Jemal, A. (2016). Cancer statistics for African Americans, 2016: Progress and opportunities in reducing racial disparities. CA: a Cancer Journal for Clinicians 66, 290–308.
  • Erin et al. (2013) Erin, A., Wei, P., and Shen, X. (2013). Penalized regression and risk prediction in genome-wide association studies. Statistical Analysis and Data Mining 6, 315–328.
  • Estes et al. (2017) Estes, J. P., Mukherjee, B., and Taylor, J. M. G. (2017). Empirical Bayes estimation and prediction using summary-level information from external big data sources adjusting for violations of transportability. Statistics in Biosciences 10, 568–586.
  • Friedman et al. (2010) Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software 33, 1–22.
  • Gao et al. (2023) Gao, X. R., Chiariglione, M., Qin, K., Nuytemans, K., Scharre, D. W., Li, Y., and Martin, E. R. (2023). Explainable machine learning aggregates polygenic risk scores and electronic health records for alzheimer’s disease prediction. Scientific Reports 13, 450.
  • Harrel Jr et al. (1996) Harrel Jr, F. E., Lee, K. L., and Mark, D. B. (1996). Tutorial in biostatistics: multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing error. Statistics in Medicine 15, 361–387.
  • Hector and Martin (2022) Hector, E. and Martin, R. (2022). Turning the information-sharing dial: efficient inference from different data sources. arXiv: 2207.08886 .
  • Huang et al. (2016) Huang, C., Qin, J., and Tsai, H. (2016). Efficient estimation of the Cox model with auxiliary subgroup survival information. Journal of the American Statistical Association 111, 787–799.
  • Huang et al. (2017) Huang, F. W., Mosquera, J. M., Garofalo, A., Oh, C., Baco, M., Amin-Mansour, A., Rabasha, B., et al. (2017). Exome sequencing of African-American prostate cancer reveals loss-of-function erf mutations. Cancer Discovery 7, 973–983.
  • Huynh-Le et al. (2021) Huynh-Le, M., Fan, C. C., Karunamuni, R., Thompson, W., Martinez, M. E., Eeles, R., Kote-Jarai, Z., et al. (2021). Polygenic hazard score is associated with prostate cancer in multi-ethnic populations. Nature Communications 12, 1236–9.
  • Jiang et al. (2016) Jiang, Y., He, H., and Zhang, H. (2016). Variable selection with prior information for generalized linear models via the prior lasso method. Journal of the American Statistical Association 111, 355–376.
  • Kalbfleisch and Prentice (2002) Kalbfleisch, J. D. and Prentice, R. L. (2002). The Statistical Analysis of Failure Time Data. John Wiley and Sons, Second edition.
  • Karunamuni et al. (2021) Karunamuni, R. A., Huynh‐Le, M. P., Fan, C. C., Thompson, W., Eeles, R. A., Kote‐Jarai, Z., Muir, K., et al. (2021). African‐specific improvement of a polygenic hazard score for age at diagnosis of prostate cancer. International Journal of Cancer 148, 99–105.
  • Kullback and Leibler (1951) Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics 22, 79–86.
  • Landry et al. (2018) Landry, L. G., Ali, N., Williams, D. R., Rehm, H. L., and Bonham, V. L. (2018). Lack of diversity in genomic databases is a barrier to translating precision medicine research into practice. Health Affairs 37, 780–785.
  • Liu and Shum (2003) Liu, C. and Shum, H. Y. (2003). Kullback-Leibler boosting. 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings 1, I–I.
  • Mak et al. (2017) Mak, T. S. H., Porsch, R. M., Choi, S. W., Zhou, X., and Sham, P. C. (2017). Polygenic scores via penalized regression on summary statistics. Genetic Epidemiology 41, 469–480.
  • Martin et al. (2017) Martin, A. R., Gignoux, C. R., Walters, R. K., Wojcik, G. L., Neale, B. M., Gravel, S., Daly, M. J., et al. (2017). Human demographic history impacts genetic risk prediction across diverse populations. The American Journal of Human Genetics 100, 635–649.
  • Schapire et al. (2005) Schapire, R., Rochery, M., Gilbert, M., and Gupta, N. (2005). Boosting with prior knowledge for call classification. IEEE transactions on speech and audio processing 13, 174–181.
  • Schröder et al. (2014) Schröder, F. H., Hugosson, J., Roobol, M. J., Tammela, T. L. J., Zappa, M., Nelen, V., Kwiatkowski, M., et al. (2014). Screening and prostate cancer mortality: results of the European Randomised Study of Screening for Prostate Cancer (erspc) at 13 years of follow-up. The Lancet 384, 2027–2035.
  • Seibert et al. (2018) Seibert, T. M., Fan, C. C., Wang, Y., Zuber, V., Karunamuni, R., Parsons, J. K., Eeles, R. A., et al. (2018). Polygenic hazard score to guide screening for aggressive prostate cancer: development and validation in large scale cohorts. BMJ 360, j5757–j5757.
  • Sheng et al. (2021) Sheng, Y., Sun, Y., Huang, C., and Kim, M. (2021). Synthesizing external aggregated information in the penalized Cox regression under population heterogeneity. Statistics in Medicine 40, 4915–4930.
  • Siegel et al. (2022) Siegel, R. L., Miller, K. D., Fuchs, H. E., and Jemal, A. (2022). Cancer statistics, 2022. CA: A Cancer Journal for Clinicians 72, 7–33.
  • Simon et al. (2011) Simon, N., Friedman, J., Hastie, T., and Tibshirani, R. (2011). Regularization paths for Cox’s proportional hazards model via coordinate descent. Journal of Statistical Software 39, 1–13.
  • Simon et al. (2011) Simon, R. M., Subramanian, J., Li, M. C., and Menezes, S. (2011). An evaluation of resampling methods for assessment of survival risk classifiers based on high-dimensional data. Briefings in Bioinformatics 12, 203–214.
  • Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, Methodological 58, 267–288.
  • Tibshirani (1997) Tibshirani, R. (1997). The lasso method for variable selection in the Cox model. Statistics in Medicine 16, 385–395.
  • Verweij and Van Houwelingen (1993) Verweij, P. J. and Van Houwelingen, H. C. (1993). Cross-validation in survival analysis. Statistics in Medicine 12, 2305–2314.
  • Vilhjálmsson et al. (2015) Vilhjálmsson, B. J., Yang, J., Finucane, H. K., Gusev, A., Lindström, S., Ripke, S., Genovese, G., et al. (2015). Modeling linkage disequilibrium increases accuracy of polygenic risk scores. The American Journal of Human Genetics 97, 576–592.
  • Wang et al. (2021) Wang, D., Ye, W., Sung, R., Jiang, H., Taylor, J. M. G., Ly, L., and He, K. (2021). Kullback-Leibler-based discrete failure time models for integration of published prediction models with new time-to-event dataset. arXiv:2101.02354 .
  • Zawistowski et al. (2021) Zawistowski, M., Fritsche, L. G., Pandit, A., Vanderwerff, B., Patil, S., Schmidt, E. M., VandeHaar, P., et al. (2021). The Michigan Genomics Initiative: a biobank linking genotypes and electronic clinical records in Michigan Medicine patients. medRxiv: 2021.12.15.21267864 .
  • Zhang et al. (2020) Zhang, H., Deng, L., Schiffman, M., Qin, J., and Yu, K. (2020). Generalized integration model for improved statistical inference by leveraging external summary data. Biometrika 107, 689–703.