Temporal-spatial Correlation Attention Network for Clinical Data Analysis in Intensive Care Unit
Abstract
In recent years, medical information technology has made it possible for electronic health record (EHR) to store fairly complete clinical data. This has brought health care into the era of “big data”. However, medical data are often sparse and strongly correlated, which means that medical problems cannot be solved effectively. With the rapid development of deep learning in recent years, it has provided opportunities for the use of big data in healthcare. In this paper, we propose a temporal-saptial correlation attention network (TSCAN) to handle some clinical characteristic prediction problems, such as predicting death, predicting length of stay, detecting physiologic decline, and classifying phenotypes. Based on the design of the attention mechanism model, our approach can effectively remove irrelevant items in clinical data and irrelevant nodes in time according to different tasks, so as to obtain more accurate prediction results. Our method can also find key clinical indicators of important outcomes that can be used to improve treatment options. Our experiments use information from the Medical Information Mart for Intensive Care (MIMIC-IV) database, which is open to the public. Finally, we have achieved significant performance benefits of 2.0% (metric) compared to other SOTA prediction methods. We achieved a staggering 90.7% on mortality rate, 45.1% on length of stay. The source code can be find: https://github.com/yuyuheintju/TSCAN.
Deep Learning, Medical Time Series, MIMIC-IV, Clinical Data Analysis.
1 Introduction
The explosive growth of electronic healthcare data and the rapid development of artificial intelligence technology have facilitated the deep application of healthcare data in multiple scenarios [1]. On the one hand, the rapid development of the Internet of Things (loT) and big data technologies provides technical support for the large-scale acquisition, safe storage, and rapid analysis of medical health data. The health care sector has now entered the era of big data. On the other hand, artificial intelligence has reached a new stage, with both theory and technology making big steps forward. Machine learning, especially deep learning, provides technical support for intelligent medical applications.
Electronic Health Record (EHR) contains a variety of medical time-series data organized in different forms, such as records of patients’ admission times, medications taken and procedures scheduled, observations of laboratory results, etc. The health status of patients in Intensive Care Unit (ICU) needs to be monitored in real time to ensure that the health care staff can provide timely and effective care and that the treatment plan can be adjusted according to the development of the patients’ condition. Therefore, indicators of a patient’s physical health that are regularly monitored, such as respiratory rate, heart rate, and blood pressure, often contain a wealth of information that can be used to formulate a patient’s treatment strategy.
Quantifying patient health and predicting future outcomes is an important area of critical care research [2]. One of the most immediately relevant outcomes for the ICU is in-hospital mortality, leading many studies toward the development of mortality prediction models. The accuracy of early mortality prediction can benefit the formulation of patients’ care strategies. Traditionally, discharge date estimation is done manually by clinicians, which is time-delayed and unreliable [3]. Deep learning models based on EHR can improve the accuracy of length of stay prediction, which can reduce the burden on clinicians in terms of administration. Meanwhile, decompensation prediction enables more complex decision planning, and phenotype classification can also be used early on as a complementary aid to the doctor’s diagnosis.
However, the value of medical time series is huge but the density is very low. These situations have brought great challenges to accurate data analysis. As a result, the core problem of health care big data research is how to use artificial intelligence technology to fully explore the value in data and realize the transformation of data values into applications. Most of the work being done now uses a small number of physiological variables and doesn’t pay much attention to correlations of different variables. For instance, [4, 5] proposed benchmarks to address four meaningful clinical tasks. [6] proposed to use LSTM recurrent neural networks to handle clinical diagnosis. [3] proposed a temporal pointwise convolutional networks to handle length-of-stay prediction. However, all of these methods often lose too much relevant diagnostic data due to the sparsity of the data. At the same time, they can’t say how important the diagnostic information is for the task, and it’s not easy for them to give a guiding diagnosis and treatment plan.
We propose a temporal-spatial correlation attention network (TSCAN) to address various clinical characteristic prediction problems, such as predicting death, predicting length of stay, detecting physiologic decline, and classifying phenotypes. Our proposed temporal and spatial attention networks are designed to facilitate multiple core information fusion and extraction tasks. To fully exploit the information in the temporal dimension of the medical time series, we introduce a recursive and merged attention mechanism, which enables the fusion of information from previous time periods in the current time period and allows for an orderly exploration of the hidden information in the temporal dimension when dealing with long time series. Additionally, we consider both temporal and spatial dimensions, taking into account the relevance of both dimensions and fusing them, making our approach more comprehensive than most studies on medical time series, which only consider temporal connections.
Our method can effectively remove irrelevant items in clinical data as well as irrelevant nodes in time based on different tasks, resulting in more accurate prediction results. Our method can also find key clinical indicators of important outcomes that can be used to improve treatment options. Finally, we select data from the MIMIC-IV dataset to demonstrate the performance of our approach.
The contributions of this paper are as follows:
-
•
We propose a temporal-spatial correlation attention network (TSCAN) to handle some clinical characteristic prediction problems, All indicators have exceeded the current optimal algorithm;
-
•
We expanded the physiological variables in medical time series to 155 with the advice of professionals in collaboration with local hospitals. We also publish these data111https://github.com/yuyuheintju/TSCAN for other researchers’ work;
-
•
We discuss the contribution ratios of the 155 physiological variables selected for the task on the advice of professionals, which have important implications for subsequent optimization of the model and medical recommendations.
The remainder of this article is organized as follows. Section 2 illustrates the process of data selection. Section 3 presents several related works. Section 4 provides the details of our approach. The corresponding experimental results and analysis are given in Section 5. Finally, we discuss the limitations and our future work and conclude this paper in Section 6.
2 Related Work
EHR contains different types of medical time-series data, and a great deal of research has been accomplished in medical time series analysis. The key to the research is how to automatically extract temporal associations and long-range dependencies from the data, and the main methods of the research focus on traditional machine learning methods and deep learning methods based on neural networks, of which deep learning-based medical time series analysis has gradually become a major research direction in recent years.
2.1 Traditional machine learning algorithms
Traditional machine learning methods for building predictive models based on medical time-series data have been developed over decades, including conditional random field (CRF) [7], Expectation-Maximization algorithm (EM) [8], Bayesian networks [9] and others. Caruana et al. [10] showed that backpropagation can be effective in medical data analysis, and Cooper et al. [11] demonstrated the application of eight machine learning algorithms, including K-nearest neighbor (KNN), in predicting patients’ survival. Awad et al. [12] used a random forest algorithm to predict early mortality of patients, and Bayesian networks [13, 14] were used to develop multi-label classifiers, which were applied to medical multi-classification problems. Logistic Regression (LR) was widely used in the field of predicting mortality risk among hospitalized patients [15, 16, 17, 11] and was chosen as one of the baselines.
2.2 Deep learning algorithms
In recent years, deep learning technology with deep neural networks as the core has been gradually emerging in the field of medical applications with the increase of computers’ computing power. Lasko et al. [18] used hierarchical autoencoders (AEs) based on time-series data of uric acid measurements to predict the risk of gout and acute leukaemia. Chen et al. [19] proposed a model based on CNN to extract local temporal associations in time-series data for the prediction of futural risk of congestive heart failure (CHF) and chronic obstructive pulmonary disease (COPD). Mobley et al. [20] applied ANN to length of stay prediction, Grigsby et al. [21] considered length of stay prediction as a binary classification task in order to identify patients at risk of long-term hospitalisation in early stage, and Emma et al. [3] proposed the Temporal Pointwise Convolution (TPC) achieving relatively good results in this task.
Lipton et al. [6] extracted 13 variables of laboratory outcome from paediatric ICU patients and used Long Short-Term Memory networks (LSTM) to learn temporal associations and long-distance dependencies in time-series data. Choi et al. [22] used a variety of RNN models such as Gated Recurrent Unit (GRU) and LSTM networks to process clinical time-series data for the prediction of medical events. It is worth mentioning the extensive application of LSTM in clinical prediction tasks, including prediction of cardiac arrest [23], acute kidney injury [24], missing data inference [25], prediction of drugs [26, 27] and length-of-stay prediction [28].
2.3 Application on specific tasks
In the task of mortality prediction, [29, 30] illustrated that artificial neural networks (ANN) can achieve better results than Logistic Regression. For decompensation prediction tasks, the Recurrent Attentive and Intensive Model (RAIM) [31] and Simply Attend and Diagnose (SAnD) [32] have applied attention mechanisms to predict physiological decompensation of critical patients in ICU. Phenotype classification is also a promising task and temporal convolutional networks [33] and feedforward networks [34, 18] have already be used in predictions of diagnostic codes based on medical time-series data. There were researches indicating that deep learning models represented by LSTM performs well in predicting mortality [35, 36], length of stay [28] and diagnostic classification [37]. So LSTM [4] was chosen as another baseline.

3 MIMIC-IV Dataset
Variable | MIMIC-IV table | Model as |
Albumin | chartevents | continuous |
Anion gap | chartevents | continuous |
Capillary refill rate | chartevents | categorical |
Cholesterol | chartevents | continuous |
Diastolic blood pressure | chartevents | continuous |
Fraction inspired oxygen | chartevents | continuous |
Glascow coma scale eye opening | chartevents | categorical |
Glascow coma scale motor response | chartevents | categorical |
Glascow coma scale total | chartevents | categorical |
Glascow coma scale verbal response | chartevents | categorical |
Glucose | chartevents,labevents | continuous |
Heart Rate | chartevents | continuous |
Height | chartevents | continuous |
Hemoglobin | chartevents | continuous |
Magnesium | chartevents | continuous |
Mean blood pressure | chartevents | continuous |
Oxygen saturation | chartevents,labevents | continuous |
Prothrombin time | chartevents | continuous |
Respiratory rate | chartevents | continuous |
Systolic blood pressure | chartevents | continuous |
Temperature | chartevents | continuous |
Troponin-T | chartevents | continuous |
Weight | chartevents | continuous |
pH | chartevents,labevents | continuous |
3.1 Data Description
We use the Medical Information Mart for Intensive Care (MIMIC-IV v0.4) dataset [38]. The MIMIC-IV database is a public database of clinical data that researchers from all over the world can use for free. The database has clinical information on more than 380,000 patients who were admitted to Beth Israel Deaconess Medical Center in Boston, Massachusetts, USA, from 2008 to 2019. Like other EHR databases, it keeps detailed information on patients’ demographics, lab tests, medication administration, vital signs, surgical operations, disease diagnosis, medication management, survival status, and more.
MIMIC-IV uses a modular approach to organize data. It has three modules that are made up of 27 tables that make it easy to use data from different sources separately or together. Another advantage is the protection of patients’ privacy, which is de-identified in two steps: the first is to replace the patient identifier, hospital identifier, and so on with a random code, and the second is to add a random number of days to the date data fixed for each patient.
3.2 Data Selection
The data pre-processing workflow is illustrated in Fig.1. The MIMIC-IV critical care database contains 76540 ICU stays from 53150 patients admitted.
In the first step, the important data are taken straight from the original MIMIC-IV tables and put in order by subject, which is another word for patient. This step has two parts: keeping only adults who were in the ICU (over the age of 18), who were in the ICU only once, and who did not move between ICU wards or wards during the same hospital stay. In this step, differences in how children’s and adults’ bodies operate and other unclear factors are taken out. This leaves 47,046 unique patients with a total of 59,372 ICU stays.
In the second step, we exclude clinical events that cannot be matched with ICU stays. Firstly, it considers events missing admission ID (HADM_ID), and only events with HADM_ID are reserved. Secondly, since the table allows the connection of HADM_ID with ICU stays, we exclude events that do not have HADM_ID in . Then, for events where ICU stay IDs (ICUSTAY_ID) are defaults, they can be recovered by attempting to check the ID. Finally, we only retain events for which the ICUSTAY_ID is present in .
In the third step, a patient has a single ICU admission, which is also described as an “episode”. In this script, for each episode, a list of events is put together in order of time, with only the variables from a list that has already been set. We use 155 physiologic variables, expanding from 17 variables, which are a subset from the Physionet/CinC Challenge 2012 [39]. During the collaboration, the local hospitals helped choose the 155 physiological variables, which included 5 categorical variables and 150 continuous value variables. The 24 variables of 155 are list in Table.1. Thus far, we have obtained over 92 million events from five tables (icu/inputevents, icu/procedureevents, icu/chartevents, icu/outputevents, hosp/labevents). Finally, we fixed a test set of 15% (7,057) of patients, including 8,906 ICU stays.
4 Our Approach


Our model is designed to find correlations in both time and space, taking into account both temporal trends and connections between features. We hope to integrate the features of the previous time period when processing information of the end time period for long medical time sequences, and to fully exploit the hidden features of the time dimension. For the task of predicting in-hospital deaths, we want to know how much each of the 155 variables we chose in advance has to do with it. The model needs to figure out which of the three indicators—blood protein, blood pressure, and heart rate—is the best predictor of hospital mortality. This will help doctors focus on the most important information about their patients. The framework is illustrated in Fig.2, which includes two key modules: 1) Encoder: It is used to obtain the feature vector of each element in both temporal and spatial dimension; 2) Fusion-Encoder: It is used to fuse the information of these elements and obtain the final feature at the end time. The module enables a time-segmented solution to integrate information from all time periods and takes advantage of the features of the time dimension.
4.1 Problem Define
For each patient, we need to predict in-hospital mortality, and other tasks in each time interval, such as the length-of-stay prediction. Fig.3 shows the problems. To predict, we only use previous clinical data from each time interval. As shown in Fig.3, we chose the data from hours before the prediction time point as the reference data for the predicting related tasks in this paper.
For each patient , we can obtain a set of samples , where . denotes that we select times clinical data before prediction time point. denotes that the clinical test values are selected in each hour. In Section 3, we select clinical values based on the advice of Professor Doctor. Here, we apply the one-hot method [40] to handle clinical data and convert these clinical values in to the format of vector. Finally, in each hour, clinical data can be convert to dimension vector for final predict task.
In the prediction task, our model needs to map medical features to clinically meaningful labels such as length of stay, in-hospital mortality, etc. A predictive model for c classification can be represented as a mapping from input to category, i.e. and the entire sample data set is then denoted as , where is the target label, is the number of samples. For the in-hospital mortality task, the patient target label is , where denotes survival and denotes death. The predictive model is derived from the input data to obtain the output , where is the parameters of the model. The key of our work is to obtain the best parameter for different tasks, which can minimize the difference between the output of model and the true label . In the next subsection, we will detail the structure of our model.
4.2 Temporal-spatial Correlation Attention Network
In order to consider the temporal and spatial information, we apply the two-branch network as Fig.2. Each branch has the same structure, which includes the initial Encoder and the Fusion-Encoder .
We divided each sample into equal parts based on the time dimension and the model follows these chunks for input:
(1) |
Temporal Branch. As shown in the upper part of Fig.2, this branch consists of two parts: an Encoder and Fusion-Encoder. is the input of Encoder and the output is denoted as :
(2) |
where denotes the network parameters of Encoder. Encoder’s core structure is based on a multi-head self-attention mechanism. The calculation in Encoder can be expressed in detail as follows:
(3) | ||||
Then we process the following set of data, , and merge it with previous data information, . The Fusion-Encoder has the same internal structure, but there is a recursive relationship. The j-th Fusion-Encoder is represented by:
(4) |
where denotes the network parameters of the j-th Fusion-Encoder. When , connect line in model and the second input of the 1st Fusion-Encoder is the output of Encoder, . For , the second input of Fusion-Encoder is the output of the last Fusion-Encoder, , at which point line is connected. The calculation in the j-th Fusion-Encoder is as follows:
(5) | ||||
The core is a multi-head cross-attention mechanism in which Q is derived from self-attentive output of the j-th time series , K from the previous Fusion-Encoder’s output, and V from a mapping that combines Q and K. In this case, V focuses on a relatively global feature, so the recursive structure can also focus on global information.
Spatial Branch. We transpose time series blocks of equal parts as the input of the spatial branch to obtain the relevance of variable dimensions by attention mechanism. As shown in Fig.2, , replaces of temporal branch. There is no additional difference between this branch and the upper one, both are composed of Encoder and Fusion-Encoder.
Finally, we can obtain from the temporal branch and from the spatial branch. Here, we connect and as the final feature of patient in the predict time point, and apply the Softmax to train and obtain final predict result. Algorithm 1 shows the overall training process.
Require: the initialization parameters the number of epochs the number of batches in each epoch and the number of parts .
5 Experiments
In this section, we select samples from patients according to different tasks. In next subsections, we will first introduce the evaluation metrics. Then, we present the results of our approach in the MIMIC-IV dataset and also discuss these experimental results. In particular, based on the results of the experiments, we invited Professional Doctors to analyze the data and give us useful clinical feedback. This showed that the model could be used and was effective.
5.1 Evaluation Metrics
Predicting in-hospital mortality is a binary classification task, so the area under the receiver operating characteristic (AUC-ROC) is the main way to measure how well it works, and the area under the Precision-Recall (AUC-PR) is the supplement. The main metric for length-of-stay prediction is Median Absolute Deviation (MAD) which is lower to indicate better performance, and Cohen’s linear weighted kappa score (KAPPA) that the higher the better is usually used as a supplement. Most patients have more than one condition, so it’s clear that phenotype classification is a multi-label classification problem, and we chose macro- and micro-averaged AUC-ROC as the evaluation criteria for this problem. Predicting decompensation is made easier by treating it as a binary classification task like predicting death, which is also mainly judged by the AUC-ROC.
5.2 Results
Model | AUC-ROC | AUC-PR | |
baseline | Logic Regression [4] | 0.848 | 0.474 |
LSTM [4] | 0.855 | 0.485 | |
others | channel-wise LSTM [4] | 0.862 | 0.515 |
SAnD [32] | 0.857 | 0.518 | |
TimeNet [32] | 0.764 | 0.813 | |
TPC [3] | 0.905 | 0.691 | |
Cox Time-Varying[41] | 0.740 | 0.290 | |
BoXHED[41] | 0.780 | 0.350 | |
Bagging [42] | 0.780 | \ |
|
Gradient boosting [42] | 0.830 | \ |
|
Random Forests [42] | 0.820 | \ |
|
GRU-D[43] | 0.876 | 0.532 | |
TSCAN | Feature concatenation | 0.859 0.025 | 0.491 0.026 |
Max pooling | 0.907 0.020 | 0.692 0.019 |
Model | KAPPA | MAD | |
baseline | Logic Regression [4] | 0.402 | 162.3 |
LSTM [4] | 0.438 | 123.1 | |
others | channel-wise LSTM [4] | 0.442 | 136.6 |
SAnD [32] | 0.429 | \ |
|
TSCAN | Max pooling | 0.451 0.013 | 120.1 1.3 |
Model | Macro AUC-ROC | Micro AUC-ROC | |
baseline | Logic Regression [4] | 0.739 | 0.799 |
LSTM [4] | 0.770 | 0.821 | |
others | channel-wise LSTM [4] | 0.776 | 0.825 |
SAnD [32] | 0.766 | 0.816 | |
TimeNet [32] | 0.764 | 0.813 | |
TSCAN | Max pooling | 0.795 0.022 | 0.839 0.018 |
Model | AUC-ROC | AUC-PR | |
baseline | Logic Regression [4] | 0.870 | 0.214 |
LSTM [4] | 0.892 | 0.324 | |
others | channel-wise LSTM [4] | 0.906 | 0.333 |
SAnD [32] | 0.895 | 0.316 | |
CNN-RNN [31] | 0.874 | 0.231 | |
CNN-AttRNN [31] | 0.881 | 0.258 | |
RAIM [31] | 0.901 | 0.279 | |
TSACN | Max pooling | 0.913 0.021 | 0.326 0.019 |
Traditionally, Logistic Regression (LR) was the most accurate model for medical time series. In recent years, LSTM and some deeplearning methods have done well with time series. In this section, Logistic Regression [4] and LSTM [4] were used as a baseline and compared with our approach.
For in-hospital mortality prediction, the prediction time point is defined as 48h after the beginning of the ICU stay, and =48h which means one ICU stay is one sample. As can be seen from the Table.2, TSCAN shows a huge improvement in classification scores relative to the baseline, with feature concatenation improving by 0.011 relative to LR and Max pooling improving by 0.052 relative to LSTM, and we have achieved significant performance benefits of 0.2% compared to other SOTA methods (TPC) in ACU-ROC.
For length-of-stay prediction, we select prediction time point every 12 hours, begining at the fourth hour after the patient is admitted to the ICU and ending with the patient’s discharge or death. h, =4 and the task aims to predict the remaining time spent of patients in ICU. As shown in Table.3, TSCAN realize a reduction of three points compared to LSTM and more than forty points compared to logistic regression in MAD. For the metric of KAPPA, TSCAN also improves by 0.009 compared to channel-wise LSTM.
For phenotype classification, , the prediction time point is described as the end of the ICU stay (the patient is discharged or dies), and the time-series data of h are used to classify phenotype in order to make full use of the data during the patient’s stay in the ICU. In other words, for phenotype classification, those less than 320 in length in the temporal dimension are padded with zero, and those that are redundant are removed. As shown in the Table. 4, it can be seen that TSCAN improves by 0.056 in macro-averaged AUC-ROC and 0.040 in micro-averaged AUC-ROC over logistic regression. Compared to LSTM, the experiment shows an improvement of 0.025 in the macro-averaged AUC-ROC and 0.018 in the micro-averaged AUC-ROC. This shows that the internal Fusion-Encoder module also works well for problems with more than one label.
For decompensation prediction, , we select a prediction time point every hour, 4h after the beginning of the ICU stay, and h. And the results are shown in the Table 5. It can be seen that TSCAN has an improvement of 0.043 in the AUC-ROC metric and 0.112 in the AUC-PR metric compared to Logic Regression, and 0.021 in the AUC-ROC metric compared to LSTM. And we have achieved significant performance benefits of 0.7% compared to other SOTA methods (channe-wise LSTM) in ACU-ROC. Even when the data is pre-processed to get a time-equal sequence, it is clear that our method can still do a great job.
5.3 Discussion on In-hospital Mortality
We take data from the initial 48 hours of ICU stays as the sample for the prediction task. Early prediction of hospital deaths can help identify patients who are at high risk and give medical staff an alert. In the next subsections, we apply this task to define some parameters of a prediction model that will be utilized in other tasks.
5.3.1 Number of Indicators
In this paper, data from 155 clinical indicators was selected to handle related prediction tasks based on the recommendations of the local hospital. From a medical point of view, the accuracy of mortality prediction with different number of variables as reference is expected to vary, theoretically setting the higher the number of variables the higher the accuracy. To compare the effect of the number of variables on the results of the experiment, we randomly select clinical data from 17-155 to verify the impact of variables on prediction accuracy. The related experimental results are shown in Fig.4.

As shown in Fig.4, The number of indicators increases from 17 to 155, demonstrating that the accuracy of mortality prediction improves with the number of variables, which is congruent with objective evidence. It is shown that our model can well exploit the indicators suggested by the hospital and make full use of them in the prediction task. However, we also find that the rate of growth is gradually decreasing. This condition means that continued increases in the number of indicators may not achieve a significant improvement in forecasting. Due to data limitations, the following task also utilize 155 variables as input.
5.3.2 Ablation Studies
As shown in Fig.2, the model of the scheme can be separated into two branches: the temporal domain and the characteristic domain (i.e. the spatial domain), so the ablation experiments are mainly based on the comparison of these two branches and the comparison of different fusion methods. We conducted experiments in the spatial and temporal domains respectively, and then fused the two branches in different ways. Four fusion methods are selected including feature concatenation, adding fusion, bilinear pooling and max pooling (prediction followed by fusion). The results of the ablation studies are shown in the Table.6.
studies | Temporal domain | Spatial domain | concatenate fusion | Adding fusion | Bilinear pooling [44] | Max Pooling |
AUC-ROC | 0.857 | 0.832 | 0.859 | 0.856 | 0.843 | 0.907 |
As seen in the table above, the performance of attention in the temporal domain alone is superior to that in the spatial domain alone, with an AUC-ROC score of 0.025 greater the former than the latter . Analyzing the sources of this conclusion reveals that, for medical time series, the correlation between adjacent times is extremely high, although the independence between variables is relatively high. It is also obvious that combining the temporal and spatial dimensions gives better metrics, with max pooling fusion providing the best results and concatenate fusion being the next best.
5.3.3 Attention in Temporal Domain
In this paper, we apply sequence time data to handle prediction problem. Naturally, We hope to find the contribution of data at different time points to the prediction results. From in-hospital mortality prediction task, we divide 48 hours’ data into 4 parts and each part include 12 hours’ data. We can obtain 4 attention maps. Here, we calculate the weights and visualize it like Fig.5. The horizontal axis represents the data of 12 moments and the vertical axis represents the correlation extracted of the temporal dimension.

From these visual result, we can find that the closer to the prediction time point, the more contribution to the prediction. As the distance from the predicted time point increases, the correlation gradually decreases. This is also combined with the actual clinical diagnosis and treatment situation. We also modified the parameter of the data grouping and got similar results. For other prediction and classification tasks, we achieve the similar results. Due to the length of the paper, the other tasks are not to discuss the weight of data in time.
5.3.4 Attention in Clinical Indicators
Attention mechanism is used to get the weights of each indicator. We hope to find out how each of the 155 chosen indicators helps with the task. Attention extraction in the spatial domain can help us reach this goal.

As shown in Fig.6, the horizontal axis shows the 15 indicators with the highest contribution ratios of the 155 used in the experiment. The vertical axis shows how much each variable contributed to the prediction of in-hospital mortality. For example, Activated Clotting Time made up more than 0.025 of the weights, which is the highest percentage. Knowing how important each indicator is can help doctors decide what to focus on and how to treat their patients. These results show that coagulation is a major factor in a patient’s chance of survival, as is the use of Amiodarone for arrhythmias, SCUF for heart failure, and Magnesium for treating patients with disorders of the body’s internal environment. The above situation has been affirmed by professional doctors, and it also proves that our approach has obvious advantages in the discovery of the cause of the disease, and can be applied to similar prediction tasks and the discovery of potential causes of the disease, such as DNA screening and functional discovery.
5.4 Discussion on Length-of-stay Prediction
This task is to predict remaining time spent in ICU at every 12 hours of stay, which is framed as a classification problem with 10 buckets (one for ICU stays shorter than a day, seven day-long buckets for each day of the first week, one for stays of over one week but less than two, and one for stays of over two weeks). With this work, ICU beds can be used more efficiently and patients can be cared for better.

We also focus on the contribution of the selected 155 indicators to the task. As shown in Fig.7, the horizontal axis shows the 15 variables with the highest proportion of the 155 variables used in the experiment, and the vertical axis indicates their contribution to the prediction of length-of-stay. From these results, we can see that ALT is responsible for the most weights, which is close to 0.03.
5.5 Discussion on Phenotype Classification
Phenotyping has applications in cohort construction for clinical studies, comorbidity detection and risk adjustment, quality improvement and surveillance, and diagnosis [45]. Based on the ICU data, the phenotyping classification task is to figure out which type of acute care the patient is in. There are 25 common conditions selected for this paper, including 8 chronic conditions that are common co-morbidities and risk factors in ICU such as essential hypertension, 12 severe conditions that are relatively more dangerous to life such as pneumonia, and 5 mixed conditions that are recurrent or chronic, with periodic acute episodes, as shown in Table.7.
Phenotype | Type |
Acute and unspecified renal failure | severe |
Acute cerebrovascular disease | severe |
Acute myocardial infarction | severe |
Cardiac dysrhythmias | mixed |
Chronic kidney disease | chronic |
Chronic obstructive pulmonary disease | chronic |
Complications of surgical/medical care | severe |
Conduction disorders | mixed |
Congestive heart failure; nonhypertensive | mixed |
Coronary atherosclerosis and related | severe |
Diabetes mellitus with complications | mixed |
Diabetes mellitus without complication | severe |
Disorders of lipid metabolism | severe |
Essential hypertension | severe |
Fluid and electrolyte disorders | severe |
Gastrointestinal hemorrhage | severe |
Hypertension with complications | severe |
Other liver diseases | mixed |
Other lower respiratory disease | severe |
Other upper respiratory disease | severe |
Pleurisy; pneumothorax; pulmonary collapse | severe |
Pneumonia | severe |
Respiratory failure; insufficiency; arrest | severe |
Septicemia (except in labor) | severe |
Shock | severe |
The attention map of clinical indicators is shown in Fig.8. The weights of the top 15 variables are shown on the horizontal axis, and their contributions to the task of phenotyping classification are shown on the vertical axis. Several indicators such as the esophageal echo, height, PA catheter, and PICC line accounted for a higher weighting of over 0.008, which suggests that medical staff could pay more attention to those factors when judging the phenotype.

5.6 Discussion on Decompensation Prediction
The goal of this task is to figure out which patients are likely to get much worse in the next 24 hours. This setting is currently manually designed with early warning scores and thresholds below which alerts are triggered, so most of these scoring systems are based on simple thresholds and a small number of common physiological indicators, such as the National Early Warning Score (NEWS) [46] and the Modified Early Warning Score (MEWS) [47]. Decompensation prediction in the paper aims to detect patients who are physiologically decompensating, or whose conditions are deteriorating rapidly.
Referring to previous work [5], and trying to fit the early warning scoring system as closely as possible, the task of decompensation prediction in our experiment is considered as a prediction of mortality in the next 24 hours for patients in the ICU and is done every hour. So, our experiments can be run on the MIMIC-IV database, which will pull out correct features and labels.
For each hour as a prediction time point, the data is matched to a target label that says whether the patient died within 24 hours of that hour. In contrast to predicting in-hospital mortality and phenotype classification, for a single ICU stay, more than one sample can be made.

The attention map of clinical indicators is shown in Fig.9. The vertical axis shows the contribution ratios of the first 15 variables, which account for the highest weight in the task. The percentages for the top 15 variables are relatively average, ranging from 0.005 to 0.006. The predictions are consistent with clinical reality. For the PICC line, the patient needs intravenous nutrition and rehydration therapy, and for the PA catheter, the patient needs to be checked for complex hemodynamic disturbances. Both of these are treatments for patients who are pretty sick.
On the one hand, the results suggest that patients with IABP on left heart assist devices, arrhythmia drugs such as amiodarone, verapamil, and chest pain might be more likely to deteriorate, while patients with cardiac dysfunction might be less likely to survive. On the other hand, the capillary refill rate and phenylephrine are important predictors of a patient’s prognosis in cases of circulatory failure. This means that stable circulation and enough blood flow to tissues and organs are needed to improve the prognosis of a patient. The above indicates that cardiovascular system disease plays an important role in predicting decompensation.
6 Conclusion
The widespread use of EHR and the fast growth of deep learning have made it possible for medical time series to be processed and used in new ways. In this paper, we use the MIMIC-IV database to build a deep learning model for medical time series. This model does well at predicting in-hospital mortality, with evaluation metrics that are much better than the baseline. First, based on an attention mechanism, the segmentation of the data and internal structure of the Fusion-Encoder in TSCAN give full consideration to temporal correlation of the clinical time series. Secondly, our model can also find key clinical indicators of important outcomes that can be used to improve treatment options. This has an effective guiding role for clinical diagnosis and treatment in practice.
Acknowledgment
This work was supported in part by the National Natural Science Foundation of China (62272337, 61902277) and the Natural Science Foundation of Tianjin (16JCZDJC31100).
References
- [1] B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi, “Deep EHR: A survey of recent advances in deep learning techniques for electronic health record (EHR) analysis,” IEEE J. Biomed. Health Informatics, vol. 22, no. 5, pp. 1589–1604, 2018.
- [2] A. E. W. Johnson, T. J. Pollard, and R. G. Mark, “Reproducibility in critical care: a mortality prediction case study,” in Proceedings of the Machine Learning for Health Care Conference, MLHC 2017, Boston, Massachusetts, USA, 18-19 August 2017, ser. Proceedings of Machine Learning Research, F. Doshi-Velez, J. Fackler, D. C. Kale, R. Ranganath, B. C. Wallace, and J. Wiens, Eds., vol. 68. PMLR, 2017, pp. 361–376.
- [3] E. Rocheteau, P. Liò, and S. L. Hyland, “Temporal pointwise convolutional networks for length of stay prediction in the intensive care unit,” CoRR, vol. abs/2007.09483, 2020.
- [4] H. Harutyunyan, H. Khachatrian, D. C. Kale, and A. Galstyan, “Multitask learning and benchmarking with clinical time series data,” CoRR, vol. abs/1703.07771, 2017.
- [5] S. Purushotham, C. Meng, Z. Che, and Y. Liu, “Benchmarking deep learning models on large healthcare datasets,” J. Biomed. Informatics, vol. 83, pp. 112–134, 2018.
- [6] Z. C. Lipton, D. C. Kale, C. Elkan, and R. C. Wetzel, “Learning to diagnose with LSTM recurrent neural networks,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016.
- [7] C.-H. Lee, M. Schmidt, A. Murtha, A. Bistritz, J. Sander, and R. Greiner, “Segmenting brain tumors with conditional random fields and support vector machines,” in Computer Vision for Biomedical Image Applications, Y. Liu, T. Jiang, and C. Zhang, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 469–478.
- [8] N. Friedman, “The bayesian structural em algorithm,” arXiv preprint arXiv:1301.7373, 2013.
- [9] M. Van der Heijden, M. Velikova, and P. J. Lucas, “Learning bayesian networks for clinical time series analysis,” Journal of biomedical informatics, vol. 48, pp. 94–105, 2014.
- [10] R. Caruana, S. Baluja, and T. Mitchell, “Using the future to ”sort out” the present: Rankprop and multitask learning for medical risk evaluation,” ser. NIPS’95. Cambridge, MA, USA: MIT Press, 1995, p. 959–965.
- [11] G. F. Cooper, C. F. Aliferis, R. Ambrosino, J. Aronis, B. G. Buchanan, R. Caruana, M. J. Fine, C. Glymour, G. Gordon, B. H. Hanusa et al., “An evaluation of machine-learning methods for predicting pneumonia mortality,” Artificial intelligence in medicine, vol. 9, no. 2, pp. 107–138, 1997.
- [12] A. Awad, M. Bader-El-Den, J. McNicholas, and J. Briggs, “Early hospital mortality prediction of intensive care unit patients using an ensemble learning approach,” International journal of medical informatics, vol. 108, pp. 185–195, 2017.
- [13] P. Lucas, “Bayesian analysis, pattern analysis, and data mining in health care,” Current opinion in critical care, vol. 10, no. 5, pp. 399–403, 2004.
- [14] B. Sierra, N. Serrano, P. Larrañaga, E. J. Plasencia, I. Inza, J. J. Jiménez, P. Revuelta, and M. L. Mora, “Using bayesian networks in the construction of a bi-level multi-classifier. a case study using intensive care unit patients data,” Artificial Intelligence in Medicine, vol. 22, no. 3, pp. 233–248, 2001.
- [15] A. L. Rosenberg, “Recent innovations in intensive care unit risk-prediction models,” Current opinion in critical care, vol. 8, no. 4, pp. 321–330, 2002.
- [16] W. A. Knaus, “Apache 1978-2001: the development of a quality assurance system based on prognosis: milestones and personal reflections,” Archives of Surgery, vol. 137, no. 1, pp. 37–41, 2002.
- [17] J.-R. Le Gall, P. Loirat, A. Alperovitch, P. Glaser, C. Granthil, D. Mathieu, P. Mercier, R. Thomas, and D. Villers, “A simplified acute physiology score for icu patients.” Critical care medicine, vol. 12, no. 11, pp. 975–977, 1984.
- [18] T. A. Lasko, J. C. Denny, and M. A. Levy, “Computational phenotype discovery using unsupervised feature learning over noisy, sparse, and irregular clinical data,” PloS one, vol. 8, no. 6, p. e66341, 2013.
- [19] Y. Cheng, F. Wang, P. Zhang, and J. Hu, “Risk prediction with electronic health records: A deep learning approach,” in Proceedings of the 2016 SIAM International Conference on Data Mining, Miami, Florida, USA, May 5-7, 2016, S. C. Venkatasubramanian and W. M. Jr., Eds. SIAM, 2016, pp. 432–440.
- [20] B. A. Mobley, R. Leasure, and L. Davidson, “Artificial neural network predictions of lengths of stay on a post-coronary care unit,” Heart & lung, vol. 24, no. 3, pp. 251–256, 1995.
- [21] J. Grigsby, R. Kooken, and J. Hershberger, “Simulated neural networks to predict outcomes, costs, and length of stay among orthopedic rehabilitation patients,” Archives of physical medicine and rehabilitation, vol. 75, no. 10, pp. 1077–1081, 1994.
- [22] E. Choi, A. Schuetz, W. F. Stewart, and J. Sun, “Using recurrent neural network models for early detection of heart failure onset,” J. Am. Medical Informatics Assoc., vol. 24, no. 2, pp. 361–370, 2017.
- [23] S. Tonekaboni, M. Mazwi, P. Laussen, D. Eytan, R. Greer, S. D. Goodfellow, A. J. Goodwin, M. Brudno, and A. Goldenberg, “Prediction of cardiac arrest from physiological signals in the pediatric ICU,” in Proceedings of the Machine Learning for Healthcare Conference, MLHC 2018, 17-18 August 2018, Palo Alto, California, ser. Proceedings of Machine Learning Research, F. Doshi-Velez, J. Fackler, K. Jung, D. C. Kale, R. Ranganath, B. C. Wallace, and J. Wiens, Eds., vol. 85. PMLR, 2018, pp. 534–550.
- [24] N. Tomašev, X. Glorot, J. W. Rae, M. Zielinski, H. Askham, A. Saraiva, A. Mottram, C. Meyer, S. Ravuri, I. Protsyuk et al., “A clinically applicable approach to continuous prediction of future acute kidney injury,” Nature, vol. 572, no. 7767, pp. 116–119, 2019.
- [25] W. Cao, D. Wang, J. Li, H. Zhou, L. Li, and Y. Li, “Brits: Bidirectional recurrent imputation for time series,” Advances in neural information processing systems, vol. 31, 2018.
- [26] E. Choi, M. T. Bahadori, and J. Sun, “Doctor AI: predicting clinical events via recurrent neural networks,” CoRR, vol. abs/1511.05942, 2015.
- [27] H. Suresh, N. Hunt, A. E. W. Johnson, L. A. Celi, P. Szolovits, and M. Ghassemi, “Clinical intervention prediction and understanding with deep neural networks,” in Proceedings of the Machine Learning for Health Care Conference, MLHC 2017, Boston, Massachusetts, USA, 18-19 August 2017, ser. Proceedings of Machine Learning Research, F. Doshi-Velez, J. Fackler, D. C. Kale, R. Ranganath, B. C. Wallace, and J. Wiens, Eds., vol. 68. PMLR, 2017, pp. 322–337.
- [28] S. Sheikhalishahi, V. Balaraman, and V. Osmani, “Benchmarking machine learning models on eicu critical care dataset,” CoRR, vol. abs/1910.00964, 2019.
- [29] G. Clermont, D. C. Angus, S. M. DiRusso, M. Griffin, and W. T. Linde-Zwirble, “Predicting hospital mortality for patients in the intensive care unit: a comparison of artificial neural networks with logistic regression models,” Critical care medicine, vol. 29, no. 2, pp. 291–296, 2001.
- [30] L. A. Celi, S. Galvin, G. Davidzon, J. Lee, D. Scott, and R. Mark, “A database-driven decision support system: customized mortality prediction,” Journal of personalized medicine, vol. 2, no. 4, pp. 138–148, 2012.
- [31] Y. Xu, S. Biswal, S. R. Deshpande, K. O. Maher, and J. Sun, “Raim: Recurrent attentive and intensive model of multimodal patient monitoring data,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &; Data Mining, ser. KDD ’18. New York, NY, USA: Association for Computing Machinery, 2018, p. 2565–2573.
- [32] H. Song, D. Rajan, J. Thiagarajan, and A. Spanias, “Attend and diagnose: Clinical time series analysis using attention models,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
- [33] N. Razavian, J. Marcus, and D. A. Sontag, “Multi-task prediction of disease onsets from longitudinal lab tests,” CoRR, vol. abs/1608.00647, 2016.
- [34] Z. Che, D. C. Kale, W. Li, M. T. Bahadori, and Y. Liu, “Deep computational phenotyping,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, August 10-13, 2015, L. Cao, C. Zhang, T. Joachims, G. I. Webb, D. D. Margineantu, and G. Williams, Eds. ACM, 2015, pp. 507–516.
- [35] Z. Che, S. Purushotham, K. Cho, D. A. Sontag, and Y. Liu, “Recurrent neural networks for multivariate time series with missing values,” CoRR, vol. abs/1606.01865, 2016.
- [36] B. Shickel, T. J. Loftus, L. Adhikari, T. Ozrazgat-Baslanti, A. Bihorac, and P. Rashidi, “Deepsofa: a continuous acuity score for critically ill patients using clinically interpretable deep learning,” Scientific reports, vol. 9, no. 1, pp. 1–12, 2019.
- [37] A. Rajkomar, E. Oren, K. Chen, A. M. Dai, N. Hajaj, M. Hardt, P. J. Liu, X. Liu, J. Marcus, M. Sun et al., “Scalable and accurate deep learning with electronic health records,” NPJ digital medicine, vol. 1, no. 1, pp. 1–10, 2018.
- [38] A. Johnson, L. Bulgarelli, T. Pollard, S. Horng, L. A. Celi, and R. Mark, “Mimic-iv,” version 0.4). PhysioNet. https://doi. org/10.13026/a3wn-hq05, 2020.
- [39] I. Silva, G. Moody, D. J. Scott, L. A. Celi, and R. G. Mark, “Predicting in-hospital mortality of icu patients: The physionet/computing in cardiology challenge 2012,” in 2012 Computing in Cardiology. IEEE, 2012, pp. 245–248.
- [40] P. Rodríguez, M. A. Bautista, J. Gonzàlez, and S. Escalera, “Beyond one-hot encoding: Lower dimensional target embedding,” Image and Vision Computing, vol. 75, pp. 21–31, 2018.
- [41] J. P. Royalty, “Machine learning time-to-event mortality prediction in mimic-iv critical care database,” Ph.D. dissertation, 2021.
- [42] T. N. Pattalung and S. Chaichulee, “Comparison of machine learning algorithms for mortality prediction in intensive care patients on multi-center critical care databases,” in IOP Conference Series: Materials Science and Engineering, vol. 1163, no. 1. IOP Publishing, 2021, p. 012027.
- [43] S. Wang, M. B. McDermott, G. Chauhan, M. Ghassemi, M. C. Hughes, and T. Naumann, “Mimic-extract: A data extraction, preprocessing, and representation pipeline for mimic-iii,” in Proceedings of the ACM conference on health, inference, and learning, 2020, pp. 222–235.
- [44] T. Lin, A. RoyChowdhury, and S. Maji, “Bilinear CNN models for fine-grained visual recognition,” CoRR, vol. abs/1504.07889, 2015.
- [45] V. Agarwal, T. Podchiyska, J. M. Banda, V. Goel, T. I. Leung, E. P. Minty, T. E. Sweeney, E. Gyang, and N. H. Shah, “Learning statistical models of phenotypes using noisy labeled training data,” Journal of the American Medical Informatics Association, vol. 23, no. 6, pp. 1166–1173, 2016.
- [46] L. RCoPo, “National early warning score (news): standardising the assessment of acute-illness severity in the nhs—report of a working party,” in Royal College of Physicians, 2012.
- [47] C. P. Subbe, M. Kruger, P. Rutherford, and L. Gemmel, “Validation of a modified early warning score in medical admissions,” Qjm, vol. 94, no. 10, pp. 521–526, 2001.