Deep Representation Learning-Based Dynamic Trajectory Phenotyping for Acute Respiratory Failure in Medical Intensive Care Units

Graphical Abstract

Deep Representation Learning-Based Dynamic Trajectory Phenotyping for Acute Respiratory Failure in Medical Intensive Care Units

Alan Wu Tilendra Choudhary Pulakesh Upadhyaya Ayman Ali Philip Yang and Rishikesan Kamaleswaran The authors were supported by the National Institutes of Health under Award Numbers R01GM139967 and R21GM151703. (Corresponding author: Tilendra Choudhary, e-mail: [email protected])Alan Wu is with the Department of Quantitative Theory and Methods, Emory University, Atlanta, GA, USA. Tilendra Choudhary, Pulakesh Upadhyaya, Ayman Ali, and Rishikesan Kamaleswaran are with the Duke University School of Medicine, Durham, NC, USA. Philip Yang is with the Division of Pulmonary, Allergy, Critical Care, and Sleep Medicine, Emory University, Atlanta, GA, USA.

Abstract

Sepsis-induced acute respiratory failure (ARF) is a serious complication with a poor prognosis. This paper presents a deep representation learning-based phenotyping method to identify distinct groups of clinical trajectories of septic patients with ARF. For this retrospective study, we created a dataset from electronic medical records (EMR) consisting of data from sepsis patients admitted to medical intensive care units who required at least 24 hours of invasive mechanical ventilation at a quarternary care academic hospital in southeast USA for the years 2016–2021. A total of N=3349 patient encounters were included in this study. Clustering Representation Learning on Incomplete Time Series Data (CRLI) algorithm was applied to a parsimonious set of EMR variables in this data set. To validate the optimal number of clusters, the K-means algorithm was used in conjunction with dynamic time warping. Our model yielded four distinct patient phenotypes that were characterized as liver dysfunction/heterogeneous, hypercapnia, hypoxemia, and multiple organ dysfunction syndrome by a critical care expert. A Kaplan-Meier analysis to compare the 28-day mortality trends exhibited significant differences (p $<$ 0.005) between the four phenotypes. The study demonstrates the utility of our deep representation learning-based approach in unraveling phenotypes that reflect the heterogeneity in sepsis-induced ARF in terms of different mortality outcomes and severity. These phenotypes might reveal important clinical insights into an effective prognosis and tailored treatment strategies.

{IEEEkeywords}

Acute respiratory failure, Sepsis, CRLI, Time series, Phenotyping, Deep learning

1 Introduction

\IEEEPARstart

Sepsis is a serious syndrome defined as life-threatening organ dysfunction caused by dysregulated patient response to infection[1]. This syndrome results in more than 350,000 deaths in America and around 11 million worldwide each year[2, 3]. Patients with sepsis in intensive care units (ICU) often develop acute respiratory failure (ARF), in which the respiratory system fails to adequately oxygenate and/or remove carbon dioxide from the blood, which, in severe cases, requires invasive mechanical ventilation to maintain end-organ perfusion[4, 5]. Previous studies of patients with sepsis and ARF requiring invasive mechanical ventilation (IMV) have demonstrated a poor prognosis with mortality rates of approximately 43%, which is a source of significant concern for critical care experts [6]. Respiratory failure is a widely present complication in patients with sepsis, with various determinants, physiological processes, and immunological responses, resulting in extreme heterogeneity in clinical trajectories involving multiorgan dysfunctions and other comorbidities that are difficult to interpret and characterize in clinical settings. Therefore, in order to study the clinical course of ARF, it becomes imperative to differentiate between various phenotypes of patients, each of which may follow similar clinical trajectories.

Trajectory clustering is a domain where similar trajectories in time or space are grouped into distinct clusters. Since clinical trajectories used in this study are multivariate time series data, “time series” and “trajectory” are used interchangeably in this paper. In the existing literature, many methods have been reported to cluster time series data. One group of methods utilizes distance or similarity metrics such as Euclidean distance or dynamic time warping (DTW) distance to calculate the distance between time series [7]. A traditional clustering technique, e.g., K-means, DBSCAN, hierarchical clustering, is then used to cluster the distance matrix. Another group of methods utilizes the recent path-breaking advances in the field of representation learning to design deep neural networks to cluster time series. They do not require that a distance metric be applied to the data before fitting the model. Examples include Clustering Representation Learning on Incomplete Time Series Data (CRLI)[8], Variational Deep Embedding with Recurrence (VaDER)[9], and Deep Temporal Contrastive Clustering (DTCC)[10]. These methods utilize various neural network frameworks, such as convolutional neural networks, autoencoders, and recurrent neural networks. They also may employ a clustering objective such as K-means to encourage the development of clusters [11].

The variability of clinical signs and symptoms leading to sepsis-induced ARF can make the task of patient management particularly challenging. Although clinicians have access to a large variety of lab tests and vital signs, the breadth of pathophysiological etiologies that can lead to sepsis-induced ARF (such as localized direct lung injury from pneumonia, inflammatory lung injury from systemic inflammation) could result in difficulty in predicting the clinical trajectory of an individual patient after developing ARF. Trajectory clustering may help simplify the task of prediction by grouping patients into subgroups that follow a similar path prior to ventilation. Trajectory-based phenotyping methods can also help clinicians create more targeted and patient-specific treatment regimens. These subgroups can reveal opportunities to identify patients early based on their initial disease progression and also allow clinicians to intervene prior to the need for invasive respiratory support.

Refer to caption — Figure 1: Block diagram depicting an overview of our study design for sepsis-induced ARF trajectory phenotyping.

Trajectory clustering has been used in many areas of medicine[12, 13]. For example, a previous work has used longitudinal data to find respiratory subphenotypes for COVID-19-related acute respiratory distress syndrome (ARDS)[14]. However, the time series data of a more generic cohort of sepsis-induced ARF patients prior to ventilation (as opposed to a specific complication like ARDS) has not been analyzed yet using such methodologies. It is commonly observed that once patients with sepsis are mechanically ventilated, there is a high associated mortality and multi-organ failures, and grouping patients with similar clinical trajectories may help optimizing therapies and improving such poor outcomes [15].

Furthermore, the large-scale archival and storage of time-dependent patient information in electronic medical records (EMR) has facilitated the use of novel deep learning techniques in various clinical tasks [16]. Thus, in this paper, we proposed an ARF trajectory phenotyping method using a deep representation learning-based approach on multivariate time series EMR data to elucidate novel phenotypes of sepsis-induced ARF in ICU and characterize them with the help of a clinician expert. The rest of the paper is organized as follows. In Section II, our study design and method are presented. Section III presents phenotyping results and analysis, followed by a discussion section in IV. Finally, conclusions are drawn in Section V.

2 Methodology

A block diagram, representing the study design of the proposed sepsis-induced ARF trajectory phenotyping method, is shown in Fig. 1. It consists of four major blocks: data collection, data preprocessing, trajectory clustering, and phenotype characterization. Data collection involves patient selection and collection of EMR data up to 24 hours prior to ventilation. Data preprocessing introduces a multivariate feature-set from a parsimonious set of cardio-respiratory features, and data operations such as outlier rejection, missingness imputation and standardization. Subsequently, a deep representation learning-based trajectory clustering is used to derive optimal data-driven phenotypes that are characterized by following expert suggestions. The various components of the study are described in detail in subsequent subsections.

2.1 Data Collection

This study is a retrospective cohort study conducted at Emory University Hospital network, located in Georgia, United States. This study was reviewed and approved by the Emory University IRB (#STUDY00000302) and performed by following the ethical standards of the Emory University IRB and the Helsinki Declaration of 1975. Adult patients ( $\geq$ 18 years) admitted to medical intensive care units (MICU) with sepsis (according to the sepsis-3 criteria [1]) between 2016–2021 and developed ARF during their hospital stay in the ICU were included. ARF was defined as the requirement of $\geq$ 24 hours of invasive mechanical ventilation (IMV).

For elucidating the trajectory phenotypes early in the course of sepsis-induced ARF, we considered up to 24 hours of data prior to the time of IMV. Clinical data collected from the EMR, including laboratory values and vital signs, were used for phenotyping. A set of demographic factors (such as age, sex, race and ethnicity), mortality outcomes, and comorbidity information were also incorporated for further analysis of the derived phenotypes. We defined time of IMV as the moment at which the mechanical ventilation parameters (positive end-expiratory pressure (PEEP), tidal volume, and/or plateau pressure) were first ever recorded in the EMR in patients who followed the aforementioned inclusion criteria. We excluded patients who did not meet sepsis-3 criteria, patients admitted to neurological ICUs or surgical ICUs, or those whose EMR did not include any physiological data up to 24 hours preceding the IMV.

2.2 Data Preprocessing

We used a parsimonious set of six cardio-respiratory features, namely partial pressure of oxygen (PaO₂), partial pressure of carbon dioxide (PaCO₂), fraction of inspired oxygen (FiO₂), oxygen saturation by pulse oximetry (SpO₂), heart rate (HR) and mean arterial pressure (MAP), as inputs to the clustering algorithms. Other demographic and routinely collected clinical features, comorbidities at admission, as well as outcomes such as mortality were analyzed after clustering. Values of different variables were deemed to be spurious if they were above or below the accepted range. The spurious values were dropped and corrected. The following three imputation steps were performed subsequently for all the features:

•

Forward filling: The missing values were imputed with the last known value in the time series.
•

Backward filling: The missing values were imputed with the next known value in the time series.
•

Median filling: Completely missing time series of any specific feature of a patient was filled with the population median. However, for FiO₂, 0.21 was used to replace such missingness in time series. A value of 0.21 was used as replacement because the room air contains 21% of oxygen.

In order to make the data fixed length for our algorithms, the first value of the time series was added to the beginning to make the time series 24 data-points long. Each indicator was standardized using the formula: $z_{i}=(f_{i}-\mu)/\sigma$ ; where $f_{i}$ is the unstandardized feature, and $\mu$ and $\sigma$ denote mean and standard deviation of $f_{i}$ , respectively. The order of patient encounters was then randomized.

2.3 Clustering Algorithms

2.3.1 Clustering Representation Learning on Incomplete time-series data (CRLI)

The state-of-the-art clustering algorithm, CRLI, developed in 2023, is a deep learning neural network used for clustering multivariate time series data [8]. The architecture is shown in Figure 1. It consists of a generative adversarial network (GAN) framework with a bidirectional recurrent neural network (RNN) and an encoder-decoder network with a soft K-means objective. The clustering process relies on the deep temporal representations learned by the architecture. The overall loss function of CRLI is:

L_{CRLI}=L_{pre}+L_{rec}+L_{adv}+\lambda\cdot L_{K-means}

The loss $L_{pre}$ is derived from the bidirectional RNN of the generator and attempts to capture the temporal dynamics of the time series data. $L_{rec}$ is the reconstruction loss of the autoencoder network and attempts to identify informative features. $L_{adv}$ is the adversarial loss and attempts to capture the error propagation during clustering. $L_{K-means}$ is the soft K-means loss and encourages the network to form clusters. $\lambda$ is the weight of the K-means loss[8]. CRLI consists of a bidirectional multi-layer RNN as generator (encoder), a single-layer RNN as decoder and five-layer RNN as discriminator. Each of these RNNs is a gated recurrent unit (GRU). We used a weight of $10^{-3}$ for the K-means loss, a batch size equal to the size of the entire dataset, 100,000 epochs, and a patience of 100 epochs for early stopping[17]. The default values of all other hyper-parameters were used. For the implementation of CRLI, the python package Pypots was used[17].

2.3.2 K-means and Dynamic Time Warping (DTW)

The second clustering algorithm utilizes dynamic time warping (DTW) and K-means. DTW is a technique to determine the similarity between two time series. DTW allows for the warping of the time series along the time dimension to better match two time series. This algorithm is useful because similar trajectories often vary along the time dimension. The DTW distance is defined as: $DTW(X,Y)=\min\{C_{p}(X,Y)|\textrm{$p$ is a warping path}\}$ in which $X=(x_{1},x_{2},\cdots,x_{N})$ and $Y=(y_{1},y_{2},\cdots,y_{M})$ are time series, $C_{p}(X,Y)$ is the total cost function: the sum of absolute differences between $X$ and $Y$ using the index pairs given by $p$ , and $p=(p_{1},p_{2},\cdots,p_{L})$ . $p$ is defined as a warping path when $p_{l}=(n_{l},m_{l})\in[1:N]\times[1:M]$ for $l\in[1:L=\max(N,M)]$ such that boundary, montonicity, and step size conditions hold. In summary, these conditions mean that the boundary paths are between the the boundary indexes, the paths between the indexes do not cross, i.e., if $a>b$ and $c>d$ , then $p_{i}=(n_{a},m_{d})$ and $p_{j}=(n_{b},m_{c})$ do not coexist, and that all indexes are assigned to an opposing index[7].

K-means is a common clustering algorithm that creates $k$ clusters by minimizing the distance between the points and the cluster center. While it is computationally efficient and versatile in its ability to utilize different types of distance measures, it has a few limitations. It requires the number of clusters to be set prior to its execution, it may converge to local minima instead of the global minimum, and it assumes that the clusters are spherical.

K-means was run with 10 random initiations, 1000 max iterations, and the metric DTW for 3, 4, and 5 clusters. The implementation from the python library Tslearn was used since it conveniently packages K-means and DTW together in the model TimeSeriesKMeans[18].

2.4 Optimal Clustering and Phenotype Characterization

The performance of the models and the selection of the clusters was evaluated using silhouette scores. For CRLI, silhouette scores were calculated on an intermediate, latent representation called clustering latent. This is the 2D representation of the time series that the neural network generates. For K-means + DTW, silhouette scores will be calculated using DTW as a distance metric. The silhouette score will be calculated for each predefined number of clusters: 3, 4, and 5. A higher silhouette score suggests better clustering. The silhouette score is defined as the average silhouette width for all points. The silhouette width is defined as:

s(x_{i})=(b(x_{i})-a(x_{i}))/\max\{b(x_{i}),a(x_{i})\}

where $x_{i}$ is an element in cluster $k$ , $a(x_{i})$ is the average distance of $x_{i}$ to all other elements in the cluster $k$ , and $b(x_{i})$ is the minimum average distance from $x_{i}$ to all points in a cluster other than $k$ [19]. The silhouette score was calculated using implementations from Tslearn and Pypots[18, 17]. The clustered trajectories were then assessed by an expert critical care physician for characterizing and naming the phenotypes.

2.5 Statistical Analysis

A Kaplan-Meier curve shows the survival probability overtime of a group of patients [20]. The statistical differences in mortality trends between clusters were evaluated using Kaplan-Meier curves and a multivariate log-rank test [21]. One way analysis of variance tests (ANOVA) were also performed on comorbidity indexes created from comorbidity data.

3 Results

3.1 Demographics

The details of the demographics, including race, ethnicity, sex, and age of our patient cohort are listed in Table 1. Our data consists of 3,225 patients with a total of 3,349 patient encounters or admissions. Out of the 3,225 patients, 1295 patients died with an in-hospital mortality rate of 40.2%.

Table 1: Demographics of patient population.

Demographic Feature	Value
Age mean (standard deviation)	62.3 (15.5)
Males, count (%)	1814 (54.2)
Race: African American or Black, count (%)	1624 (48.5)
Race: Caucasian or White, count (%)	1442 (43.1)
Ethnicity: Hispanic, count (%)	129 (3.9)
Ethnicity: Non-Hispanic, count (%)	2975 (88.8)

3.2 Silhouette Score Analysis

The silhouette score for CRLI was the highest for 4 clusters, denoting the optimal cluster selection. The choice of 4 clusters was also validated by the highest silhouette score from the K-means + DTW algorithm. While the silhouette scores may not be comparable between clustering methods, according to Ma et al., CRLI performed better than K-means + DTW after analysis using more reliable ground truth based metrics [8]. The silhouette scores of the two methods are shown in Table 2.

Table 2: Silhouette scores calculated for CRLI using clustering latent and silhouette scores calculated for K-means + DTW using DTW.

Clusters	CRLI	K-means + DTW
3	0.362	0.177
4	0.389	0.193
5	0.206	0.153

3.3 Phenotyping Results

By utilizing the silhouette score criterion, the CRLI algorithm yielded four distinct data-driven clusters. Various demographics and mortality outcomes of each cluster are shown in Table 3. Both the number of unique patients and encounters were included because some patients had more than one encounter. Derived clusters were analyzed and characterized by an expert physician into phenotypes. Cluster 1 was characterized as a liver dysfunction/heterogeneous phenotype, cluster 2 as a hypercania phenotype, cluster 3 as a hypoxemia phenotype, and cluster 4 as a multiple organ dysfunction syndrome (MODS) phenotype.

The mean trajectory for each phenotype was plotted for various indicators as shown in Fig. 4. With dashed lines, $83.4\%$ confidence intervals were included in order to assess statistically significant differences ( $p<0.05$ ) between points in the time series [22].

For visualization of our phenotypes, Uniform Manifold Approximation and Projection (UMAP) [23] scheme was applied on the latent representation (3349 $\times$ 128) to reduce the dimensions to (3349 $\times$ 2) as shown in Fig. 2. It is worth mentioning that UMAP was used solely for visualization of the phenotypes and not included in the clustering process itself.

3.4 Kaplan-Meier Analysis

The Kaplan-Meier curves were plotted for the derived phenotypes to show the 28-day short-term survival probability as shown in Fig. 3. Phenotype 2 has the best survival probability trends followed by phenotype 1. Phenotype 3 and phenotype 4 have the worst survival probability trends.

3.5 Comorbidity Analysis

For each cluster, the Charlson Comorbidity Index (CCI) and the age-adjusted Charlson Comorbidity Index (ACCI) were calculated using ICD-10 codes [24]. These comorbidity indexes estimate the risk of death with a higher number indicating a higher risk. All comorbidities were included at admission diagnosis. Charlson component scores were also calculated. These were a simple proportion of the patients within a phenotype that were affected by a component comorbidity from CCI. The python package comorbidipy was used for the calculations[25]. Both CCI and ACCI were computed with the Quan mappings and Quan weights. The CCI and ACCIs for various phenotypes are shown in Table 4.

Table 3: Demographics information of the four derived phenotypes.

Demo. variable	Phenotype 1	Phenotype 2	Phenotype 3	Phenotype 4
#Encounters, n(%)	1638 (48.91)	347 (10.36)	1169 (34.90)	195 (5.82)
#Patients, n	1586	332	1161	193
Mortality, n(%)	609 (38.40)	93 (28.01)	505 (43.50)	88 (45.60)
Age, m(SD)	61.53 (15.27)	64.51 (15.19)	62.70 (15.89)	62.53 (15.39)
Males, n(%)	882 (53.85)	176 (50.72)	641 (54.83)	115 (58.97)
Race
Black, n(%)	816 (49.82)	180 (51.87)	532 (45.51)	96 (49.23)
White, n(%)	676 (41.27)	159 (45.82)	519 (44.40)	88 (45.13)
Ethnicity
Hispanic, n(%)	62 (3.79)	10 (2.88)	53 (4.53)	4 (2.05)
Non-Hisp., n(%)	1443 (88.10)	321 (92.51)	1035 (88.54)	176 (90.26)

Symbols used — #, n: count, m: mean, SD: standard deviation. Please note that mortality was computed with respect to patients (not encounters).

Table 4: Charlson comorbidity indexes (CCI) and age-adjusted CCI (ACCI) for the derived phenotypes on admission diagnosis.

Phenotype	CCI	ACCI
1	4.27	6.10
2	4.11	6.17
3	4.01	5.95
4	4.30	6.17

A one way ANOVA was performed on both CCI and ACCI. CCI was statistically significant with $p<0.05$ while ACCI was statistically insignificant with $p=0.42$ . The frequency of certain pertinent comorbidities and Charlson components were compared between phenotypes for statistical significance using two proportion z-tests. Cardiac arrest rates were significantly higher for phenotype 4 at 37.44% ( $p<0.00001$ ) when compared to all other phenotypes (phenotype 1 = 15.32%, phenotype 2 = 20.46%, phenotype 3 = 19.25%). Charlson component score for chronic obstructive pulmonary disease was significantly higher for phenotype 2 at 0.5187 ( $p<.00001$ ) when compared to all other phenotypes (phenotype 1 = 0.2851, phenotype 3 = 0.2797, phenotype 4 = 0.2769). Charlson component score for mild liver disease was significantly higher for phenotype 1 at 0.2735 when compared to all other phenotypes (phenotype 2 = 0.1470, $p<0.00001$ ; phenotype 3 = 0.1993, $p<0.00001$ ; phenotype 4 = 0.2000, $p<0.05$ ). Charlson component score for moderate or severe liver disease was significantly higher for phenotype 1 at 0.1538 when compared to phenotype 2 and 3 but not phenotype 4 (phenotype 2 = 0.0634, $p<0.00001$ ; phenotype 3 = 0.0924, $p<0.00001$ ; phenotype 4 = 0.1282, $p=0.34$ ).

4 Discussion

Four data-driven trajectory phenotypes were derived: a liver dysfunction/heterogeneous phenotype, a hypercapnia phenotype, a hypoxemia phenotype, and a MODS phenotype. Phenotype 1 is the a liver dysfunction/heterogeneous phenotype with the fewest distinguishing features. It has a relatively average mortality of 38.4%. It has the second highest CCI and the third highest ACCI. However, from the trajectory clustering, we note that phenotype 1 has a high International Normalized Ratio (INR), a high bilirubin total, and low platelets. This could indicate liver dysfunction and/or coagulopathy. Liver dysfunction is supported by the comorbidity analysis that suggests high incidence rates of mild liver disease and moderate to severe liver disease. It should also be noted that there is a precipitous drop in SpO₂/FiO₂ prior to ventilation. Patients in this phenotype seem to go from having little difficulty breathing to quickly being in a state of respiratory distress requiring mechanical ventilation similar to other phenotypes.

Phenotype 2 is the hypercapnia phenotype and has the lowest mortality of the phenotypes at 28.01%. It also has the third highest CCI and tied for the highest ACCI. This phenotype has the lowest pH, the highest PaCO₂, and the highest levels of bicarbonate. These trajectories are suggestive of hypercapnia. From the comorbidity analysis, it is observed that phenotype 2 has the highest rates of chronic obstructive pulmonary disease (COPD) at 0.5187. COPD often leads to hypercapnia, making the high incidence rate unsurprising [26]. In patients with COPD, the ability to increase minute ventilation is limited, making their susceptibility to requiring mechnical ventilation a greater consequence of their baseline lung function rather than the severity of their septic process. This may explain the finding of having the lowest mortality among the phenotypes[27].

Phenotype 3 is the hypoxemia phenotype. It has a relatively high mortality rate of 43.50%, and also has the lowest CCI and ACCI. This phenotype has the highest FiO₂, lowest SpO₂/FiO₂ ratio, lowest PaO₂/FiO₂. These patients may benefit from subgroup analyzes, as they have the lowest burden of comorbid conditions, yet relatively high mortality and severe respiratory decline. Future studies that focus on this phenotype of patients may be able to identify targets for intervention prior to respiratory and other end-organ decline.

Phenotype 4 is the multiple organ dysfunction syndrome (MODS) phenotype. It has the highest mortality of the phenotypess at 45.60%, and also has the highest CCI and ACCI. This phenotype has the highest creatinine, blood urea nitrogen (BUN), lactic acid, and b-type natriuretic peptide (BNP). However, the differences between phenotype 4 and the other phenotypes for BUN and BNP were not statistically significant. However, these differences suggest heart failure and kidney injury. Interestingly, phenotype 4 also had the highest SpO₂, PaO₂, and PaO₂/FiO₂ ratio indicating that these patients did not suffer from significant hypoxemia despite being the sickest phenotype. From the comorbidity analysis, phenotype 4 had a significantly higher rate of cardiac arrest at 37.44% as compared to the other phenotypes. The high mortality rate, high comorbidity indexes, trajectory phenotypes, and high cardiac arrest rate suggest MODS or multiple organ failure.

Some clinical significance can be derived from these phenotypes. While the distinction between hypoxemic and hypercapnic ARF translated into respective phenotypes, CRLI has yielded two additional phenotypes of liver dysfunction/heterogeneous phenotype and MOD phenotype. Of these, the liver dysfunction/heterogeneous phenotype and the hypercapnic respiratory failure phenotype appear to correlate with their underlying comorbidities. Interestingly, the hypoxemic ARF phenotype had the lowest CCI and ACCI, yet had the severest degree of hypoxemia and the second highest mortality. This phenotype identified by our clustering algorithm may represent a subgroup of sepsis-induced ARF patients who may experience the worst outcomes directly related to their respiratory failure, and may benefit from targeted interventions and adjunctive treatments for severe hypoxemia. The MOD phenotype, on the other hand, may represent a high-risk subgroup who may suffer ARF and poor outcomes as a consequence of multiple organ failures. Again, our clustering algorithm can help identify the sickest subset of patients with sepsis-induced ARF. The liver dysfunction/heterogeneous phenotype is interesting because it is the largest group, yet CLRI considered it as a separate phenotype and did not cluster it as one of the two main types of ARF. This suggests that classifying ARF simply as hypoxemic and hypercapnic respiratory failure may be an oversimplification and that additional comorbidities and organ failures may influence the clinical phenotype of sepsis-induced ARF.

The CRLI has been shown to be outperforming over earlier deep representation learning-based algorithms, such as VaDER, deep temporal clustering representation (DTCR), and deep temporal clustering (DTC) [8]. The technical capabilities of the state-of-the-art CRLI were shown in the results. By yielding phenotypes with statistically different Kaplan-Meier curves, producing a comparable number of clusters to the standard K-means + DTW algorithm, and reproducing two phenotypes that were already well established, CRLI demonstrated its ability to efficiently cluster our time series data. This also validates our decision of choosing the four phenotypes. In addition, multiple pre-processing and evaluation challenges were overcome. Multiple different imputation methods were utilized to handle the data with a high percentage of missing values. After creating the clusters, there was no obvious way to determine statistical significance between trajectories. Overlaying the 83.4% confidence intervals was a simple-to-implement solution that allowed the viewer to see at what points the clusters were statistically significant from other clusters. This helped analysts and clinicians determine which trajectories had significant differences.

This study has several limitations. Our patient cohort is defined by the requirement for mechanical ventilation, which is a small but clinically important subset of patients with sepsis and septic shock. How these phenotypes apply a priori to sepsis patients without respiratory failure, or sepsis patients who require non-invasive forms of respiratory support, is unclear. Furthermore, our work considered data only from a single academic hospital system, thus requiring further validation using comprehensive data from external centers to establish its generalizability and reproducibility across populations with different demographics and comorbidities. Additionally, sparse collection (not always in hourly intervals) of a few labs (e.g., PaO₂, PaCO₂) required imputation that sometimes rendered the trajectories flat. However, the value of these measurements usually do not change significantly over a short period of time. Finally, while using 83.4% confidence intervals is a simple way to determine statistical significance, it does not take into consideration the shape of the trajectories. Trajectories that differ by shape may be missed if only the distance between them are assessed.

Despite these limitations, the developed ARF trajectory phenotyping study represents a pioneering effort to highlight distinct groups for MICU patients with a wide range of medical conditions. We expect that a ML model trained on these derived phenotypes could be utilized prospectively in the ICU settings to get real-time phenotype predictions. The next phases of this research will involve analyzing more patients’ longitudinal data and conducting a prospective study for evaluating responses of different treatments, such as administration of high vs low PEEP strategy, vasopressors, and steroids. We recommend that the results be validated by clustering data from other hospitals’ MICUs or by trying to classify patients from other MICUs into one of the four phenotypes. Better data collection is also needed to reduce missing and spurious values. Finally, techniques to handle missing at random and missing not at random values should be explored [28].

5 Conclusion

This exploratory study uses a multivariate set of features of six cardiorespiratory indicators and the new deep representation learning-based CRLI clustering algorithm to derive data-driven trajectory groups of patients with sepsis-induced ARF. Our method yielded four unique and distinct groups that could be interpreted as four meaningful phenotypes. These phenotypes are clinically characterized as liver dysfunction/heterogeneous, hypercapnia, hypoxemia, and MODS. The primary categories of ARF, hypercapnia and hypoxemia were well detected by the CRLI model, and two other phenotypes were revealed that did not fit either category. With the comorbidity analysis and Kaplan-Meier analysis on mortality data, we provided substantial characteristics and differences in all derived clustered trajectories. In addition to the clinically relevant results, we demonstrated the potential for the use of multivariate time series clustering algorithms in the exploration of ICU data. Future studies are required to reveal the effects of treatment on the derived phenotypes to optimize critical care and improve patient outcomes.

References

[1] M. Singer, C. S. Deutschman, C. W. Seymour, M. Shankar-Hari, D. Annane, M. Bauer, R. Bellomo, G. R. Bernard, J.-D. Chiche, C. M. Coopersmith et al., “The third international consensus definitions for sepsis and septic shock (sepsis-3),” Jama, vol. 315, no. 8, pp. 801–810, 2016.
[2] CDC, “Sepsis Is The Body’s Extreme Response To An Infection. — cdc.gov,” https://www.cdc.gov/sepsis/what-is-sepsis.html, [Accessed 08-04-2024].
[3] “Sepsis — who.int,” https://www.who.int/news-room/fact-sheets/detail/sepsis, [Accessed 08-04-2024].
[4] D. P. Gurka and R. A. Balk, “Chapter 38 - acute respiratory failure,” in Critical Care Medicine (Third Edition), J. E. Parrillo and R. P. Dellinger, Eds. Philadelphia: Mosby, 2008, pp. 773–794. [Online]. Available: https://www.sciencedirect.com/science/article/pii/B9780323048415500406
[5] V. S. Mirabile, E. Shebl, A. Sankari, and B. Burns, “Respiratory failure,” in StatPearls [Internet]. StatPearls Publishing, 2023.
[6] K. Lewandowski, J. Metz, C. Deutschmann, H. Preiss, R. Kuhlen, A. Artigas, and K. J. Falke, “Incidence, severity, and mortality of acute respiratory failure in berlin, germany.” American journal of respiratory and critical care medicine, vol. 151, no. 4, pp. 1121–1125, 1995.
[7] M. Müller, “Dynamic time warping,” Information retrieval for music and motion, pp. 69–84, 2007.
[8] Q. Ma, C. Chen, S. Li, and G. W. Cottrell, “Learning representations for incomplete time series clustering,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 10, pp. 8837–8846, May 2021. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/17070
[9] J. de Jong, M. A. Emon, P. Wu, R. Karki, M. Sood, P. Godard, A. Ahmad, H. Vrooman, M. Hofmann-Apitius, and H. Fröhlich, “Deep learning for clustering of multivariate clinical patient trajectories with missing values,” GigaScience, vol. 8, no. 11, p. giz134, 2019.
[10] Y. Zhong, D. Huang, and C.-D. Wang, “Deep temporal contrastive clustering,” Neural Processing Letters, vol. 55, no. 6, pp. 7869–7885, 2023.
[11] A. Alqahtani, M. Ali, X. Xie, and M. W. Jones, “Deep time-series clustering: A review,” Electronics, vol. 10, no. 23, p. 3001, 2021.
[12] P. Schulam, F. Wigley, and S. Saria, “Clustering longitudinal clinical marker trajectories from electronic health data: Applications to phenotyping and endotype discovery,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29, no. 1, 2015.
[13] C. Liu, F. Wang, J. Hu, and H. Xiong, “Temporal phenotyping from longitudinal electronic health records: A graph based framework,” in Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 705–714.
[14] L. D. Bos, M. Sjoding, P. Sinha, S. V. Bhavani, P. G. Lyons, A. F. Bewley, M. Botta, A. M. Tsonas, A. S. Neto, M. J. Schultz et al., “Longitudinal respiratory subphenotypes in patients with covid-19-related acute respiratory distress syndrome: results from three observational cohorts,” The lancet Respiratory medicine, vol. 9, no. 12, pp. 1377–1386, 2021.
[15] M. Komorowski, A. Green, K. C. Tatham, C. Seymour, and D. Antcliffe, “Sepsis biomarkers and diagnostic tools with a focus on machine learning,” EBioMedicine, vol. 86, 2022.
[16] B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi, “Deep ehr: A survey of recent advances in deep learning techniques for electronic health record (ehr) analysis,” IEEE Journal of Biomedical and Health Informatics, vol. 22, no. 5, pp. 1589–1604, 2018.
[17] W. Du, “Pypots: a python toolbox for data mining on partially-observed time series,” arXiv preprint arXiv:2305.18811, 2023.
[18] R. Tavenard, J. Faouzi, G. Vandewiele, F. Divo, G. Androz, C. Holtz, M. Payne, R. Yurchak, M. Rußwurm, K. Kolar et al., “Tslearn, a machine learning toolkit for time series data,” Journal of machine learning research, vol. 21, no. 118, pp. 1–6, 2020.
[19] M. Shutaywi and N. N. Kachouie, “Silhouette analysis for performance evaluation in machine learning with applications to clustering,” Entropy, vol. 23, no. 6, p. 759, 2021.
[20] M. K. Goel, P. Khanna, and J. Kishore, “Understanding survival analysis: Kaplan-meier estimate,” International journal of Ayurveda research, vol. 1, no. 4, p. 274, 2010.
[21] “lifelines &x2014; lifelines 0.28.0 documentation — lifelines.readthedocs.io,” https://lifelines.readthedocs.io/en/latest/, [Accessed 15-03-2024].
[22] M. J. Knol, W. R. Pestman, and D. E. Grobbee, “The (mis) use of overlap of confidence intervals to assess effect modification,” European journal of epidemiology, vol. 26, pp. 253–254, 2011.
[23] L. McInnes, J. Healy, N. Saul, and L. Großberger, “Umap: Uniform manifold approximation and projection,” Journal of Open Source Software, vol. 3, no. 29, p. 861, 2018. [Online]. Available: https://doi.org/10.21105/joss.00861
[24] M. E. Charlson, D. Carrozzino, J. Guidi, and C. Patierno, “Charlson comorbidity index: a critical review of clinimetric properties,” Psychotherapy and psychosomatics, vol. 91, no. 1, pp. 8–35, 2022.
[25] “comorbidipy — pypi.org,” https://pypi.org/project/comorbidipy/, [Accessed 15-03-2024].
[26] B. Csoma, M. R. Vulpi, S. Dragonieri, A. Bentley, T. Felton, Z. Lázár, and A. Bikov, “Hypercapnia in copd: causes, consequences, and therapy,” Journal of Clinical Medicine, vol. 11, no. 11, p. 3180, 2022.
[27] W. F. Abdo and L. M. Heunks, “Oxygen-induced hypercapnia in copd: myths and facts,” Critical Care, vol. 16, pp. 1–4, 2012.
[28] C. Ma and C. Zhang, “Identifiable generative models for missing not at random data imputation,” Advances in Neural Information Processing Systems, vol. 34, pp. 27 645–27 658, 2021.