FastCPH: Efficient Survival Analysis for Neural Networks

Xuelin Yang¹ Louis Abraham² Sejin Kim³ Petr Smirnov³
Feng Ruan⁴ Benjamin Haibe-Kains³ Robert Tibshirani¹

Abstract

The Cox proportional hazards model is a canonical method in survival analysis for prediction of the life expectancy of a patient given clinical or genetic covariates – it is a linear model in its original form. In recent years, several methods have been proposed to generalize the Cox model to neural networks, but none of these are both numerically correct and computationally efficient. We propose FastCPH, a new method that runs in linear time and supports both the standard Breslow and Efron methods for tied events. We also demonstrate the performance of FastCPH combined with LassoNet, a neural network that provides interpretability through feature sparsity, on survival datasets. The final procedure is efficient, selects useful covariates and outperforms existing CoxPH approaches.

1 Introduction

Survival analysis is a domain in statistics that studies the dependency of the time to the occurrence of an event on predictor variables. We usually call the estimated period "duration" and the event of interest "death" or "failure." Censored data, where the endpoint of observation is not a failure, is an important component in this field and requires specialized techniques ¹¹1We refer censored data as right censoring (the most common type) in this work..
The Cox proportional hazards model (CoxPH) is a classic semi-parametric method to handle censored data (Cox 1972). It was originally used as a linear regression model, supposing the log risk of failure is a linear combination of the clinical or genetic predictor variables. Its core idea is that the dependency of hazard rate on covariates is time-invariant and multiplicative. The formulation of CoxPH likelihood is explained in Section 2. A major advantage of CoxPH models over methods like Kaplan-Meier curves and the log-rank test is their ability to work easily with quantitative predictor variables and ability to generalize patterns from censored data (Efron 1988). Therefore, they are particularly suitable for survival analysis and predictions and are applied extensively in the biomedical field including in analyzing gene expressions and the likelihood of various diseases including liver diseases, coronary heart disease, diabetes, etc (Chang et al. 1990; Li and Sun 2020; Bastien 2004; Sleeper and Harrington 1990). Beyond that, CoxPH models also have a variety of applications. When compared with the results of the multiple discriminant analysis methods, the CoxPH model gives lower type I errors (Lane, Looney, and Wansley 1986).
In classical survival studies using CoxPH, a lot of effort is needed in feature engineering to make the model work well. Over the years, several methods were proposed to generalize it to nonlinear situations, allowing more complex formulation for the log-risk function, yet having mixed results (Mariani et al. 1997; Xiang et al. 2000; Sargent 2001).

1.1 Related work

	Sksurv	PyCox	DeepSurv	PySurvival	FastCPH (Ours)
Time complexity	$\mathcal{O}(n)$	$\mathcal{O}(n)$	$\mathcal{O}(n)$	$\mathcal{O}(n^{2})$	$\mathcal{O}(n)$
Deep learning	✗	✓	✓	✗	✓
Handling ties
Tie-awareness	✓	✗	✗	✓	✓
Efron approximation	✓	✗	✗	✓	✓
Breslow approximation	✓	✗	✗	✗	✓

Table 1: Comparisons for different Cox PH implementations

Deep learning in survival analysis

With the rise of deep learning applications in many scientific fields, some studies have tried to combine CoxPH functions with deep neural networks for better time-to-event predictions for larger datasets (Katzman et al. 2018; Spooner et al. 2020; Ching, Zhu, and Garmire 2018). The inclusion of a neural network can simplify the a priori covariates selection and make the model adaptively learn them while preserving the effectiveness of CoxPH functions in survival analysis. Our work focuses on the deep learning method to model the survival hazards using CoxPH.

Handling ties

CoxPH is developed under the assumption of continuous data, but there are often tied event times in real datasets. Several ways have been offered to deal with instances where there are ties, including the exact partial likelihood, the Breslow approximation, and the Efron approximation, the latter two being well-tested in theory and widely used practice (Breslow 1975; Efron 1977; Therneau and Grambsch 2000). The most commonly used approach is the Breslow method which simply uses the number of tied events as the exponent in the denominator of the relative risk. The Efron method is thought to produce superior outcomes, yet the formulation is more complicated to implement efficiently. The difference with the Breslow approach is minor when the number of ties is not large (Therneau 1997). Our work use with both Breslow and Efron methods to break ties.

Implementations of CoxPH

When most studies focus on the application of linear and nonlinear CoxPH to survival datasets, few of them emphasize the implementation of the Cox partial likelihood function itself, which is the foundation of CoxPH-based methods. We review the four most popular implementations of the CoxPH model in neural networks as in Table 1:

•

DeepSurv (Katzman et al. 2018) implements a deep learning generalization of the Cox proportional hazards model using Theano and Lasagne. It supports tensor operation, runs in $\mathcal{O}(n)$ by vectorized cumulative sums over the entire input, but assumes there are no tied events.
•

Pycox (Kvamme, Borgan, and Scheel 2019) provides a Python package for survival analysis and time-to-event prediction with PyTorch. It computes in $\mathcal{O}(n)$ a cumulative sum of all input hazards but not the true risk sets of the CoxPH function.
•

PySurvival (Fotso et al. 2019–) is an open-source Python package for Survival Analysis modeling published in 2019 and compatible with PyTorch. The implementation of the nonlinear CoxPH model is based on Deepsurv (Katzman et al. 2018), while adding an $\mathcal{O}(n^{2})$ index matrix for Efron’s method to handle ties.
•

Scikit-survival (Pölsterl 2020a) (sksurv) is a Python library for survival analysis compatible with scikit-learn (Pedregosa et al. 2011), published in 2020. It implements Brewslow and Efron approximation of the CoxPH model in $\mathcal{O}(n)$ using an inner for-loop at each distinct event time when going through all events. However, it does not support deep learning.

It is important for CoxPH-based methods to properly handle ties, support efficient computation, and use the correct approximation at the same time. Simply assuming the absence of ties or ignoring all tied cases is statistically inappropriate. Failure to handle ties and oversimplifications may cause unexpected consequences, especially when the behaviors of neural networks possess randomness and the results can be hard to interpret (Therneau and Grambsch 2000; Borucka 2014). However, none of the existing popular survival analysis packages have CoxPH implemented in both an efficient and correct way for neural networks.
In recent years, people use the CoxPH model on survival datasets with larger scales and data complexity (Spooner et al. 2020). Existing methods sacrifice the correct formula for computational simplicity or the other way around, but why not use a method that strictly follows the well-tested approximations and possesses efficiency at the same time? Here we present the Fast Cox Proportional Hazard model (FastCPH), a computationally efficient and statistically correct method for survival analysis using neural networks.

1.2 Our contribution

We propose FastCPH to overcome the limitations of existing CoxPH methods in machine learning. It is a fully vectorized method that runs in $\mathcal{O}(n)$ and yields both standard Breslow and Efron methods for tied events. We implement it with PyTorch and it can be easily used for any other machine learning research requiring the CoxPH model, allowing computationally efficient deep learning training for a larger scale of data.
As a demonstration of FastCPH, we combine it with LassoNet to present FastCPH-LassoNet, a simple neural network that provides interpretability through feature sparsity in survival analysis. We evaluate FastCPH-LassoNet on multiple survival datasets, and FastCPH-LassoNet outperforms existing CoxPH approaches.

2 Fast Cox Proportional Hazards Model

We propose FastCPH as the linear time implementation following the exact formula of CoxPH with Breslow Approximation for neural networks. As noted in Table 1, FastCPH is the first CoxPH method that is both computationally efficient for neural networks and supports Efron and Breslow approximation in handling tied events. We also provide a step-by-step proof verifying the correctness of our tensor-based implementation.

The input is given as a feature matrix x where each row is a sample, and each column is a feature. Each row is associated with an event time $t_{i}$ (that can produce ties) and an indicator $\delta_{i}$ indicating whether the event is censored or not (1 if uncensored).

2.1 Without ties

Definition 2.1.

Given a regression model that gives a real number $g(x_{i})$ for each sample, the CoxPH likelihood that is maximized in the absence of ties is:

\displaystyle L(g)

\displaystyle=\prod_{i|\delta_{i}=1}\frac{\exp(g(\mathbf{x_{i}}))}{\sum_{j|t_{j}\geq t_{i}}\exp(g(\mathbf{x_{j}}))}

(1)

The negative log likelihood (loss function) is

\displaystyle LL(g)

\displaystyle=\sum_{i\in J}{\log\left(\sum_{j\in R_{i}}\exp{[g(\mathbf{x_{j}})}]\right)-g(\mathbf{x_{i}})}

(2)

In implementation, assuming the event times $t$ are sorted in decreasing order (which is a $\mathcal{O}(n\log n)$ preprocessing), we can compute all values of $\log\left(\sum_{j\in R_{i}}\exp{[g(\mathbf{x_{j}})}]\right)$ in $\mathcal{O}(n)$ by using the logcumsumexp function²²2 $\texttt{logcumsumexp}(g(\mathbf{x}))_{i}$ is a slight abuse of notation as logcumsumexp is applied on all events, not just those from $J$ , then indexed on the elements of $J$ . (as implemented in PyTorch (Paszke et al. 2019)):

\displaystyle LL(g)

\displaystyle=\sum_{i\in J}{\texttt{logcumsumexp}(g(\mathbf{x}))_{i}-g(\mathbf{x_{i}})}

(3)

2.2 Breslow’s method

When there are ties, the theoretical best solution would assume that the events still happened in some order and sum the formula without ties over all orders. This is not efficient because there are $d_{i}!$ possible orders for each tie, thus rarely used in applications.

Breslow approximation assumes that all $d_{i}$ elements were selected from the same risk set. Thus, the above formula stays unchanged. The implementation simply indexes the logcumsumexp terms so that events with the same failure time have the same denominator.

2.3 Efron’s method

Efron’s method observes that the denominator is too big in Breslow’s approximation, as when multiples elements are selected from the same risk set, that risk set gets smaller.

Definition 2.2.

Efron’s approximation results in the following likelihood:

\displaystyle L(g)

\displaystyle=\prod_{i\in J^{\prime}}\frac{\displaystyle\prod_{j\in D_{i}}\exp(g(\mathbf{x_{i}}))}{\displaystyle\prod_{k=0}^{d_{i}-1}{\left(\sum_{j\in R_{i}}\exp(g(\mathbf{x_{j}}))-\frac{k}{d_{i}}\sum_{j\in D_{i}}\exp(g(\mathbf{x_{j}}))\right)}}

(4)

where $J^{\prime}$ is a subset of $J$ with unique event times. The idea is to discount the denominator over the whole risk set $\sum_{j\in R_{i}}\exp(g(\mathbf{x_{j}}))$ by the average risk over $D_{i}$ : $\frac{1}{d_{i}}\sum_{j\in D_{i}}\exp(g(\mathbf{x_{j}}))$ . $k$ indexes the set $D_{i}$ .

Compared with Equation 2, the numerator did not change (it is still the product over all uncensored events). The denominator has three terms:

•

$\sum_{j\in R_{i}}\exp(g(\mathbf{x_{j}}))$ can be computed for all $j$ in linear time as before.
•

$\frac{k}{d_{i}}$ is trivial to compute for all elements of $J$ (which is the union of all $D_{i}$ for $i\in J^{\prime}$ ).
•

$\sum_{j\in D_{i}}\exp(g(\mathbf{x_{i}}))$ can also be computed in linear time with a vectorized scatter operation.

Finally, those terms are easy to combine in linear time with vectorized operations. All computations are realized in the log space to avoid numerical errors, using tricks similar to those used to implement the log-sum-exp operation.

We provide FastCPH as an efficient, vectorized and linear-time function implemented in PyTorch that can be conveniently used by any neural network.

2.4 FastCPH-LassoNet

LassoNet (Lemhadri, Ruan, and Tibshirani 2021) is a neural network that achieves feature sparsity using a LASSO-style regularization (Tibshirani 1996). Since it provides interpretability through global feature selection and yields promising results in different domains, we apply it as the backbone of our method. Originally, LassoNet was applied with mean squared error and cross-entropy losses for regression and classification problems. We use it with our FastCPH loss function to apply it to survival analysis. Thus, we present FastCPH-LassoNet, a survival prediction method with feature sparsity.
Like LASSO, LassoNet penalizes the $\textit{L}^{1}$ norm of coefficients applied to features. The Lagrange multiplier associated with that penalty is noted $\lambda$ . The model is trained with increasing values of $\lambda$ , on a dense-to-sparse path where the values of $\lambda$ follow a geometric scale. The starting value of $\lambda$ is a hyperparameter that should be carefully selected: if it is too small, the model will train on a lot of useless configurations; if it is too large, the optimizer will ignore features too fast. Another hyperparameter is $M$ , the hierarchy coefficient that balances the linear and non-linear parts of the model.

3 Experiments

Refer to caption — Figure 1: Runtime comparisons between different CoxPH implementations. The x-axis is the size of data, and the y-axis is the time for one-time CoxPH calculation in milliseconds.

	Breast cancer	WHAS500	Veteran’s lung cancer	HNSCC
CoxPH Linear	51.4	71.3	66.4	59.1
CoxNet	57.0	71.4	72.6	74.3
GlmNet	60.3 ( $\pm$ 4.67)	70.1 ( $\pm$ 0.68)	70.7 ( $\pm$ 1.34)	74.8 ( $\pm$ 2.18)
DeepSurv	57.8 ( $\pm$ 1.52)	70.0 ( $\pm$ 2.05)	66.1 ( $\pm$ 3.14)	73.1 ( $\pm$ 1.53)
FastCPH-LassoNet	67.0 ( $\pm 5.39$ )	76.6 ( $\pm 1.21$ )	71.9 ( $\pm 1.90$ )	70.1 ( $\pm 3.96$ )

Table 2: Performance of different CoxPH models on survival datasets (in percentage, 95% CI if indicated)

	Breast cancer	WHAS500	Veteran’s lung cancer	HNSCC
$\#$ selected features	24.6 ( $\pm$ 2.95)	14.0 ( $\pm$ 0.00)	10.6 ( $\pm$ 0.78)	11.0 ( $\pm$ 3.95)
$\#$ total features	80	14	11	107
run time	261s	225s	201s	283s

Table 3: Hyperparameters and run time of FastCPH-LassoNet (95% CI if indicated). Run time is in seconds for per run per CPU. Training FastCPH-LassoNet is performed on 2.8 GHz Quad-Core Intel Core i7 with 16 GB memory. As shown in the table, training FastCPH only requires a few minutes, demonstrating its computational efficiency.

	Breast cancer	WHAS500	Veteran’s lung cancer	HNSCC
(16, 16)
FastCPH-LassoNet	69.7 ( $\pm$ 5.35)	76.6 ( $\pm$ 1.35)	73.0 ( $\pm$ 2.49)	63.7 ( $\pm$ 5.89)
DeepSurv	68.2 ( $\pm$ 2.47)	64.5 ( $\pm$ 1.06)	66.3 ( $\pm$ 1.47)	61.6 ( $\pm$ 3.05)
(32)
FastCPH-LassoNet	67.4 ( $\pm$ 4.90)	76.8 ( $\pm$ 1.29)	72.6 ( $\pm$ 2.12)	69.3 ( $\pm$ 4.68)
DeepSurv	66.6 ( $\pm$ 1.27)	65.8 ( $\pm$ 0.68)	69.3 ( $\pm$ 0.72)	58.9 ( $\pm$ 2.56)
(32, 16)
FastCPH-LassoNet	69.0 ( $\pm$ 3.47)	75.4 ( $\pm$ 2.49)	71.9 ( $\pm$ 3.23)	65.3 ( $\pm$ 5.38)
DeepSurv	68.7 ( $\pm$ 1.20)	66.1 ( $\pm$ 1.07)	66.2 ( $\pm$ 1.46)	57.1 ( $\pm$ 2.39)
(64)
FastCPH-LassoNet	68.6 ( $\pm$ 5.11)	76.8 ( $\pm$ 1.40)	72.9 ( $\pm$ 2.50)	69.0 ( $\pm$ 4.22)
DeepSurv	67.0 ( $\pm$ 1.24)	64.8 ( $\pm$ 0.63)	67.3 ( $\pm$ 1.13)	58.5 ( $\pm$ 2.00)

Table 4: Results of FastCPH-LassoNet and DeepSurv with more complex architectures (in percentage, 95% CI)

We want to evaluate the following three aspects of FastCPH:

1.

Is the implementation of FastCPH computationally efficient for neural networks?
2.

Can FastCPH-LassoNet obtain feature sparsity along the regularization path?
3.

Does FastCPH-LassoNet have promising performance compared to other CoxPH-based models on real-world survival datasets?

3.1 Experiment 1: Runtime analysis of FastCPH

Baselines

To analyze the computational efficiency of FastCPH, we compare its runtime with 4 other popular vectorized implementations including PyCox, DeepSurv, PySurvival. PySurvival use risk / fail matrices to compute Efron method. PyCox and DeepSurv are for deep learning purposes and do not support tie approximations. Detailed descriptions of these methods are in Section 1.1.

Implementation details

We assume all events are uncensored and the data is sorted by duration. We first randomly generate data from size 1 to $10^{3}$ . We note down the runtime as calculating the negative log likelihood once and report the mean of 5 random trails.

Results

Among the baselines, FastCPH is the most computationally efficient CoxPH implementation that supports Breslow and Efron approximations. As shown in Fig 1, the curve of FastCPH with Breslow method is very close to the ones of PyCox and DeepSurv, which don’t have tie awareness. As the size of data increases, the difference among FastCPH with Breslow, PyCox, and DeepSurv are getting smaller. For the two baselines both using Efron approximation, FastCPH with Efron method is significantly faster than PySurvival. The gap between FastCPH with Efron and Breslow can be caused by the computation of the weighted terms in denominator. The experimental results align with our claims on the linearity of FastCPH’s runtime.

3.2 Dataset settings

To evaluate FastCPH-LassoNet on real world scenarios, we use the following four datasets:

•

Breast cancer dataset (Desmedt et al. 2007): This dataset comes from experiments set up to validate a certain gene signature in primary breast tumors. It contains data points from 198 patients, with 80 features each. The endpoint of this dataset is distant metastases. Of all patients, 51 of them (25.8%) exhibited the symptom.
•

WHAS500 dataset (Muche 2001): The Worcester Heart Attack Study dataset is an observational dataset set up to track trends in acute myocardial infarction and out-of-hospital coronary heart disease deaths in Worcester, Massachusetts. The endpoint of this dataset is death. Out of 500 patients in the dataset with 14 features each, the endpoint occurred for 215 patients (43.0%).
•

Veteran’s lung cancer dataset (Kalbfleisch and Prentice 2011): This dataset comes from a lung cancer trial by the Department of Veterans Affairs. The endpoint of this dataset is death. Out of 137 patients in the dataset with 6 features each, the endpoint occurred for 128 patients (93.4%).
•

HNSCC (Grossberg et al. 2020): This dataset is composed of 451 head and neck squamous cell carcinoma (HNSCC) patients treated with curative-intent intensity modulated radiotherapy (IMRT). This dataset was previously used to predict local recurrence and HPV status (Head, Group et al. 2018). Survival analysis is done in data exploration, but nothing predictive. We include it as a showcase of our method. The endpoint used for our analysis is death.

Data preprocessing

For the breast cancer dataset, WHAS500, and veteran’s lung cancer dataset, we retrieve the data from Scikit-survival package (Pölsterl 2020b) and obtain one-hot encodings to quantify entries such as treatment received, cell types, prior therapy, etc. The number of final entries is shown in the last row of Table 3. The HNSCC dataset is publicly available via the Cancer Imaging Archive with TCIA Limited Access License (Grossberg et al. 2020). The DICOM imaging data is processed using the Med-ImageTools pipeline (Sejin 2022) to extract the computed tomography (CT) images and gross tumor volume (GTV) segmentation masks with uniform voxel spacing for consistent feature extraction. These images and masks are processed into the NIfTI file format, which is a common standard for 3D medical images, and compatible with PyRadiomics. The processed image and GTV masks into PyRadiomics to extract shape, texture, and statistics features (Van Griethuysen et al. 2017).

Metric

We use Harrell’s concordance index (C-index) as a metric to evaluate the performance of our model and other baselines. It is a generalization of Area under ROC curve (AUC) regarding censored data, reflecting the accuracy of pairwise orders of the risk function as the output of the model (Harrell et al. 1982; Uno et al. 2011; James et al. 2013). For input $\textbf{x}_{i}$ , duration $t_{i}$ and events $\delta_{i}$ (1 if uncensored, 0 otherwise), the C-index is computed as:

\displaystyle C=\frac{\sum_{i,j:t_{j}<t_{i}}\text{1}_{g(\textbf{x}_{j})>g(\textbf{x}_{i})}\delta_{j}}{\sum_{i,j:t_{j}<t_{i}}\delta_{j}}.

(5)

We also provide a $\mathcal{O}(n\log n)$ implementation of the C-index using ordered data structures.

3.3 Experiment 2: FastCPH-LassoNet with a linear architecture

Baselines

We first compare FastCPH-LassoNet with other CoxPH-based methods using linear structures. We use CoxPH linear model (Cox 1972), CoxNet (Simon et al. 2011), GlmNet (Friedman, Hastie, and Tibshirani 2010) and DeepSurv (Katzman et al. 2018) as baselines. The first three are classical CoxPH-based models with different regularizations. DeepSurv is considered the most advanced deep learning CoxPH-based method (Katzman et al. 2018; Spooner et al. 2020).

Implementation details

The implementations of the Cox linear model and CoxNet are from Scikit-survival (Pölsterl 2020a). For the CoxPH linear model, we set $\alpha=10^{-6}$ as the regularization parameter in the ridge regression penalty. CoxNet is the CoxPH model with an elastic net penalty. We use cross-validation for choosing the best $\alpha$ of the regularization path from $10^{-1}$ to $10^{-5}$ . For GlmNet, we use the R built-in cross-validation cv.glmnet with the Breslow method to select the model for testing. The number of folds in cross-validation (if used) is 5 for breast cancer and veteran’s lung cancer dataset and 10 for WHAS500 and HNSCC dataset. For FastCPH-LassoNet and DeepSurv, we fix the number of hidden dimensions to 1 and the number of hidden layers to 1. They share the same architecture and setting (ReLU, Adam, lr= $10^{-3}$ ). For FastCPH-LassoNet, we set $M=10$ and start at $\lambda=10^{-6}$ . The prox method of LassoNet is called on the dense model following on a geometric path until the model becomes sparse. That value is divided by 10 to give lambda_start. 5-fold cross validation is used to select the best $\lambda$ value from multiple runs. We use Efron’s method to break ties.

Training and testing

We use stratified sampling w.r.t uncensored/censored events to split the training set (80%) and test set (20%) for each of the datasets. For models with randomness in training, we run 5 trails for each set of hyperparameters and obtain the average performance.

Results

Firstly, FastCPH-LassoNet obtains promising results in different datasets and generally outperform existing CoxPH-based methods as in Table 2. Its C-index is ranked 1, 1, 2, 4 for the breast cancer, WHAS500, Veteran’s lung cancer, and HNSCC datasets, rsp, showing its discrimination ability to provide reliable ranking of survival times based on risk scores. Comparing FastCPH-LassoNet to DeepSurv, FastCPH-LassoNet performs better in 4 datasets using the same achitecture.
Moreover, FastCPH-LassoNet is able to attain an effective recovery of signals given a subset of features. As shown in Table 3 and 2, the model gives excellent performance with sparsity in covariates (C-index=0.70 for 10% feature selected and C-index=0.67 for 18% feature in the HNSCC and breast cancer dataset, rsp). It achieves highest validation accuracy with a low ratio of the number of total feature over the number of total feature. However, when the number of total features is small, the model may not be able to obtain sparsity over covariates, as shown by its result of WHAS500. Fig 2 gives an example of the training curve of FastCPH-LassoNet on the breast cancer dataset. We can see LassoNet optimized properly with FastCPH as the loss function. The sparsity demonstrated in the experiments implies its potential on large-scale, more complicated real world datasets.

3.4 Experiment 3: FastCPH-LassoNet with more complex architectures

Baselines

In the context of NN methods, we further analyze the performance of FastCPH-LassoNet using more complex architectures. We use DeepSurv as the baseline because it is commonly recognized as the most advanced CoxPH-based deep learning method.

Implementation details

We let FastCPH-LassoNet and DeepSurv share the same architecture (as indicated in parentheses in Table 4) and setting (ReLU, Adam, lr= $10^{-3}$ ). For both methods, we run 15 trails to give 95% CI. The implementation of FastCPH-LassoNet is the same as in Experiment 2.

Results

FastCPH-LassoNet outperforms Deepsurv with the same architecture on all datasets in our experiments. It is a more robust deep learning method with promising results in survival analysis. Looking at the results in Table 4 and 2 together, FastCPH-LassoNet is the best CoxPH-based method on 3 out of 4 survival datasets we use. Notice that for the breast cancer and WHAS500 datasets, FastCPH-LassoNet yields a significantly better C-index than DeepSurv and other existing methods in Table 2. These results reflect that FastCPH-LassoNet has best overall performance than other CoxPH plus regularization models. To summarize, FastCPH-LassoNet is an efficient and robust way to conduct survival analysis based on the FastCPH and $\textit{L}^{1}$ penalty (Tibshirani 1996).

4 Discussion

In this paper, we have proposed FastCPH, an efficient CoxPH method for survival analysis in neural networks that follows the exact formula of well-tested methods to handle tied events. FastCPH is an efficient and numerically correct solution for neural networks in survival analysis. It can be quickly adapted to other deep learning methods and used in more real-world scenarios with censored data such as (Lane, Looney, and Wansley 1986; Liang, Self, and Liu 1990; Bendell, Wightman, and Walker 1991). We have shown FastCPH-LassoNet outperformed other CoxPH-based methods in various survival datasets. More study can be done to provide applications of FastCPH as an objective function in more complex neural network architectures. It will be interesting to see the effect of tied events on the behavior of neural networks. In addition, it is worth pointing out the importance of following the exact formulae of classic statistical methods in implementation and avoiding oversimplifications in the machine learning community.

References

Bastien (2004) Bastien, P. 2004. PLS-Cox model: application to gene expression. Proceedings in Computational Statistics, 655–662.
Bendell, Wightman, and Walker (1991) Bendell, A.; Wightman, D.; and Walker, E. 1991. Applying proportional hazards modelling in reliability. Reliability Engineering & System Safety, 34(1): 35–53.
Borucka (2014) Borucka, J. 2014. Methods of handling tied events in the Cox proportional hazard model. Studia Oeconomica Posnaniensia, 2(2): 91–106.
Breslow (1975) Breslow, N. E. 1975. Analysis of survival data under the proportional hazards model. International Statistical Review/Revue Internationale de Statistique, 45–57.
Chang et al. (1990) Chang, H.-G. H.; Lininger, L. L.; Doyle, J. T.; Maccubbin, P. A.; and Rothenberg, R. B. 1990. Application of the Cox model as a predictor of relative risk of coronary heart disease in the Albany Study. Statistics in Medicine, 9(3): 287–292.
Ching, Zhu, and Garmire (2018) Ching, T.; Zhu, X.; and Garmire, L. X. 2018. Cox-nnet: an artificial neural network method for prognosis prediction of high-throughput omics data. PLoS computational biology, 14(4): e1006076.
Cox (1972) Cox, D. R. 1972. Regression models and life-tables. Journal of the Royal Statistical Society: Series B (Methodological), 34(2): 187–202.
Desmedt et al. (2007) Desmedt, C.; Piette, F.; Loi, S.; Wang, Y.; Lallemand, F.; Haibe-Kains, B.; Viale, G.; Delorenzi, M.; Zhang, Y.; d’Assignies, M. S.; et al. 2007. Strong time dependence of the 76-gene prognostic signature for node-negative breast cancer patients in the TRANSBIG multicenter independent validation series. Clinical cancer research, 13(11): 3207–3214.
Efron (1977) Efron, B. 1977. The efficiency of Cox’s likelihood function for censored data. Journal of the American statistical Association, 72(359): 557–565.
Efron (1988) Efron, B. 1988. Logistic regression, survival analysis, and the Kaplan-Meier curve. Journal of the American statistical Association, 83(402): 414–425.
Fotso et al. (2019–) Fotso, S.; et al. 2019–. PySurvival: Open source package for Survival Analysis modeling.
Friedman, Hastie, and Tibshirani (2010) Friedman, J.; Hastie, T.; and Tibshirani, R. 2010. Regularization Paths for Generalized Linear Models via Coordinate Descent. Journal of Statistical Software, 33(1): 1–22.
Grossberg et al. (2020) Grossberg, A.; Elhalawani, H.; Mohamed, A.; Mulder, S.; Williams, B.; White, A. L.; Zafereo, J.; Wong, A. J.; Berends, J. E.; AboHashem, S.; Aymard, J. M.; Kanwar, A.; Perni, S.; Rock, C. D.; Chamchod, S.; Kantor, M. E.; Browne, T.; Hutcheson, K. A.; Gunn, G. B.; Frank, S. J.; Rosenthal, D.; Garden, A. S.; Fuller, C.; Head, M. A. C. C.; and Group, N. Q. I. W. 2020. HNSCC.
Harrell et al. (1982) Harrell, F. E.; Califf, R. M.; Pryor, D. B.; Lee, K. L.; and Rosati, R. A. 1982. Evaluating the yield of medical tests. Jama, 247(18): 2543–2546.
Head, Group et al. (2018) Head, M. A. C. C.; Group, N. Q. I. W.; et al. 2018. Investigation of radiomic signatures for local recurrence using primary tumor texture analysis in oropharyngeal head and neck cancer patients. Scientific reports, 8.
James et al. (2013) James, G.; Witten, D.; Hastie, T.; and Tibshirani, R. 2013. An introduction to statistical learning, volume 112. Springer.
Kalbfleisch and Prentice (2011) Kalbfleisch, J. D.; and Prentice, R. L. 2011. The statistical analysis of failure time data. John Wiley & Sons.
Katzman et al. (2018) Katzman, J. L.; Shaham, U.; Cloninger, A.; Bates, J.; Jiang, T.; and Kluger, Y. 2018. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC medical research methodology, 18(1): 1–12.
Kvamme, Borgan, and Scheel (2019) Kvamme, H.; Borgan, Ø.; and Scheel, I. 2019. Time-to-event prediction with neural networks and Cox regression. arXiv preprint arXiv:1907.00825.
Lane, Looney, and Wansley (1986) Lane, W. R.; Looney, S. W.; and Wansley, J. W. 1986. An application of the Cox proportional hazards model to bank failure. Journal of Banking & Finance, 10(4): 511–531.
Lemhadri, Ruan, and Tibshirani (2021) Lemhadri, I.; Ruan, F.; and Tibshirani, R. 2021. Lassonet: Neural networks with feature sparsity. In International Conference on Artificial Intelligence and Statistics, 10–18. PMLR.
Li and Sun (2020) Li, C.; and Sun, J. 2020. Variable selection for high-dimensional quadratic Cox model with application to Alzheimer’s disease. The International Journal of Biostatistics, 16(2).
Liang, Self, and Liu (1990) Liang, K.-Y.; Self, S. G.; and Liu, X. 1990. The Cox proportional hazards model with change point: An epidemiologic application. Biometrics, 783–793.
Mariani et al. (1997) Mariani, L.; Coradini, D.; Biganzoli, E.; Boracchi, P.; Marubini, E.; Pilotti, S.; Salvadori, B.; Silvestrini, R.; Veronesi, U.; Zucali, R.; et al. 1997. Prognostic factors for metachronous contralateral breast cancer: a comparison of the linear Cox regression model and its artificial neural network extension. Breast cancer research and treatment, 44(2): 167–178.
Muche (2001) Muche, R. 2001. Applied Survival Analysis: Regression Modeling of Time to Event Data. DW Hosmer, Jr., S Lemeshow. New York: John Wiley, 1999, pp. 386, US $89.95. ISBN: 0-471-15410-5.
Paszke et al. (2019) Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
Pedregosa et al. (2011) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12: 2825–2830.
Pölsterl (2020a) Pölsterl, S. 2020a. scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn. J. Mach. Learn. Res., 21(212): 1–6.
Pölsterl (2020b) Pölsterl, S. 2020b. scikit-survival: A Library for Time-to-Event Analysis Built on Top of scikit-learn. Journal of Machine Learning Research, 21(212): 1–6.
Sargent (2001) Sargent, D. J. 2001. Comparison of artificial neural networks with other statistical approaches: results from medical data sets. Cancer: Interdisciplinary International Journal of the American Cancer Society, 91(S8): 1636–1642.
Sejin (2022) Sejin, K. 2022. Med-Imagetools:Transparent and Reproducible Medical Image Processing Pipelines in Python. https://github.com/bhklab/med-imagetools.
Simon et al. (2011) Simon, N.; Friedman, J.; Hastie, T.; and Tibshirani, R. 2011. Regularization paths for Cox’s proportional hazards model via coordinate descent. Journal of statistical software, 39(5): 1.
Sleeper and Harrington (1990) Sleeper, L. A.; and Harrington, D. P. 1990. Regression splines in the Cox model with application to covariate effects in liver disease. Journal of the American Statistical Association, 85(412): 941–949.
Spooner et al. (2020) Spooner, A.; Chen, E.; Sowmya, A.; Sachdev, P.; Kochan, N. A.; Trollor, J.; and Brodaty, H. 2020. A comparison of machine learning methods for survival analysis of high-dimensional clinical data for dementia prediction. Scientific reports, 10(1): 1–10.
Therneau (1997) Therneau, T. M. 1997. Extending the Cox model. In Proceedings of the First Seattle symposium in biostatistics, 51–84. Springer.
Therneau and Grambsch (2000) Therneau, T. M.; and Grambsch, P. M. 2000. The cox model. In Modeling survival data: extending the Cox model, 39–77. Springer.
Tibshirani (1996) Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1): 267–288.
Uno et al. (2011) Uno, H.; Cai, T.; Pencina, M. J.; D’Agostino, R. B.; and Wei, L.-J. 2011. On the C-statistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Statistics in medicine, 30(10): 1105–1117.
Van Griethuysen et al. (2017) Van Griethuysen, J. J.; Fedorov, A.; Parmar, C.; Hosny, A.; Aucoin, N.; Narayan, V.; Beets-Tan, R. G.; Fillion-Robin, J.-C.; Pieper, S.; and Aerts, H. J. 2017. Computational radiomics system to decode the radiographic phenotype. Cancer research, 77(21): e104–e107.
Xiang et al. (2000) Xiang, A.; Lapuerta, P.; Ryutov, A.; Buckley, J.; and Azen, S. 2000. Comparison of the performance of neural network methods and Cox regression for censored survival data. Computational statistics & data analysis, 34(2): 243–257.

Appendix

This supplementary document is organized as the following. Firstly, we give distributions of ties presented in each dataset. We then provide additional information on datasets we used in the experiments, including an illustration of covariate breakdown, pipeline, and the correlation matrix of HNSCC dataset. We also supplement the code for FastCPH and experiments on https://github.com/lasso-net/lassonet.

Datasets statistics

Ties

are common in the datasets we use, listed as in Table 6. Breslow and Efron methods give the same log likelihood when no ties are present in the dataset. Our method in Table 2 using Breslow approximation is capable of handling datasets with and without ties.

# of patients	451
Outcome
Alive / Censored	395 (88%)
Death	56 (12%)
Sex
Male	387 (86%)
Female	64 (14%)
Disease subsite
Base of Tongue	238 (53%)
Tonsil	174 (39%)
Glossopharyngeal sulcus	10 (2%)
Other	29 (6%)
HPV Status
Positive	232 (51%)
Negative / Unknown	219 (49%)
Stage
I	3 (1%)
II	14 (3%)
III	63 (14%)
IV	371 (83%)
Tumor Laterality
R	222 (49%)
L	215 (48%)
Other	14 (3%)

Table 5: Event and clinical variable distribution of HNSCC.

	Breast cancer	WHAS500	Veteran’s lung cancer	HNSCC
# total observations	198	500	137	451
# total tied events	6	178	64	232
# uncensored events	51	215	128	56
# uncensored tied events	0	80	55	4

Table 6: Statistics on tied events in different datasets

HNSCC dataset

is a challenging prediction dataset when adpating it for survival analysis. Fig 4 is a visualization of the correlation matrix of clinical covariates in HNSCC dataset generated by PySurvival (Fotso et al. 2019–). As we can see in the figure, many of the covariates are heavily correlated with each other, posing a need of selecting useful features for constructing an efficient solution. According to a single value decomposition computation, the matrix is rank deficient. Despite that, FastCPH-LassoNet can successfully select a subset of covariates (11.04 out of 107) and attain a good and stable performance as shown in Table 2.