Predicting Intraoperative Hypoxemia with Hybrid Inference Sequence Autoencoder Networks
Abstract.
We present an end-to-end model using streaming physiological time series to predict near-term risk for hypoxemia, a rare, but life-threatening condition known to cause serious patient harm during surgery. Inspired by the fact that a hypoxemia event is defined based on a future sequence of low SpO2 (i.e., blood oxygen saturation) instances, we propose the hybrid inference network (hiNet) that makes hybrid inference on both future low SpO2 instances and hypoxemia outcomes. hiNet integrates 1) a joint sequence autoencoder that simultaneously optimizes a discriminative decoder for label prediction, and 2) two auxiliary decoders trained for data reconstruction and forecast, which seamlessly learn contextual latent representations that capture the transition from present states to future states. All decoders share a memory-based encoder that helps capture the global dynamics of patient measurement. For a large surgical cohort of 72,081 surgeries at a major academic medical center, our model outperforms strong baselines including the model used by the state-of-the-art hypoxemia prediction system. With its capability to make real-time predictions of near-term hypoxemic at clinically acceptable alarm rates, hiNet shows promise in improving clinical decision making and easing burden of perioperative care.
Abstract.
We present an end-to-end model using streaming physiological time series to predict near-term risk for hypoxemia, a rare, but life-threatening condition known to cause serious patient harm during surgery. Inspired by the fact that a hypoxemia event is defined based on a future sequence of low SpO2 (i.e., blood oxygen saturation) instances, we propose the hybrid inference network (hiNet) that makes hybrid inference on both future low SpO2 instances and hypoxemia outcomes. hiNet integrates 1) a joint sequence autoencoder that simultaneously optimizes a discriminative decoder for label prediction, and 2) two auxiliary decoders trained for data reconstruction and forecast, which seamlessly learn contextual latent representations that capture the transition from present states to future states. All decoders share a memory-based encoder that helps capture the global dynamics of patient measurement. For a large surgical cohort of 72,081 surgeries at a major academic medical center, our model outperforms strong baselines including the model used by the state-of-the-art hypoxemia prediction system. With its capability to make real-time predictions of near-term hypoxemic at clinically acceptable alarm rates, hiNet shows promise in improving clinical decision making and easing burden of perioperative care.
1. Introduction
Hypoxemia, or low blood oxygen saturation (SpO2), is an adverse physiological condition known to cause serious patient harm during surgery and general anaesthesia (dunham2014perioperative, ). Without early intervention, prolonged hypoxemia can seriously affect surgical outcomes and is associated with many adverse complications such as cardiac arrest, encephalopathy, delirium, and post-operative infections (lundberg2018explainable, ). To mitigate the effects of hypoxemia, anesthesiologists monitor SpO2 levels during general anesthesia using pulse oximetry, so that actions can be taken in a timely manner. Despite the availability of real-time data, however, reliable prediction of hypoxemic events remains elusive (ehrenfeld2010incidence, ).
In recent years, data from electronic health records (EHR) have been used to develop predictive models to anticipate risks of future adverse events, facilitating early interventions to mitigate the occurrence of adverse events (west2016interventions, ; liu2022hipal, ; lou2022predicting, ). Similar attempts (elmoaget2016multi, ; erion2017anesthesiologist, ) have been made to target hypoxemia based on SpO2 data. Recently, (lundberg2018explainable, ) proposed a gradient boosting machine (GBM) model for predicting intraoperative hypoxemia by integrating a number of preoperative static variables and intraoperative physiological time series. Compared to prior works that only utilized SpO2, the use of multi-modal data helps train a more reliable prediction model. However, as classical models such as GBM cannot directly utilize multivariate time series, they require the extraction of hand-crafted features with limited capacity for capturing the temporal dynamics of time series. In addition, the uniquely selected combination of features may not generalize to a new cohort from another hospital.
Another major limitation of the aforementioned hypoxemia models is that they were developed to target any low SpO2 occurrences where most of them are short-term (e.g., a single minute) and transient SpO2 drop. In practice, most occurrences of low SpO2 do not necessarily reflect high patient risks for deterioration that actually require anesthesiologist intervention (laffin2020severity, ). In contrast to transient SpO2 reduction, significant risks arise from persistent hypoxemia, defined as continuously low SpO2 over a longer time window (e.g., 5 minutes). Persistent hypoxemia can develop rapidly and unexpectedly due to acute respiratory failure, or other circumstances. It is immediately life-threatening if not treated (mehta2016management, ). For this study we have collected a large surgical dataset from a major academic medical center. In this cohort, despite the rareness (1.5% of all time) of low SpO2, 24.0% of the surgical encounters experienced at least a single instance of SpO2 drop (measured per minute), while merely 1.9% experienced persistent hypoxemia (over a 5-minute time window). Unessential alarms can result in clinicians’ desensitization to alarms, and thus the actually critical ones could possibly be missed. Hence, it is of high clinical importance to reliably predict persistent hypoxemia (i.e., with high sensitivity of predicting advert events with clinically acceptable alarm rate), which facilitates the most pivotal early interventions. However, the prediction of persistent hypoxemia is a greater challenge, as it is difficult for a machine learning model to learn reliable patterns from data with a certain class being underrepresented (0.13% positive rate in our real-world dataset). Moreover, it requires the model to foresee the future over a longer horizon, when the past data are decreasingly indicative of distal future outcomes.
We aim to address these challenges by developing an end-to-end learning framework that utilizes streaming physiological time series (e.g., heart rate, SpO2) and produces risk prediction of hypoxemia, while simultaneously learning powerful latent representation known to improve model robustness against class imbalance (hendrycks2019using, ). In addition to general hypoxemia (i.e., any SpO2 drop), we focus on predicting persistent hypoxemia given its clinical significance. Intuitively, if we can forecast future input data, especially the SpO2 variation, we can anticipate the potential hypoxemia risk more accurately. We propose a novel deep model, the hybrid inference network (hiNet) that simultaneously makes inference on both future hypoxemic events and the sequence of future SpO2 levels. This end-to-end framework is enabled by jointly optimizing: (i) a memory-augmented sequence encoder that both aggregate local temporal features and capture global patient dynamics; (ii) a sequence decoder for data reconstruction; (iii) a sequence decoder that models the evolution of future SpO2 levels; and (iv) a discriminative decoder (classifier) trained for hypoxemia event prediction. With joint training, the classifier can leverage the learned latent representation, while the supervisory signal from the classifier can be propagated to seamlessly direct representation learning towards optimizing the desired prediction task.
The proposed model is trained and evaluated on a large real-world pediatric cohort from a large academic medical center. The data includes minute-resolution multi-modal time series collected from 72,018 surgeries that correspond to 118,000 hours of surgery. The experiments show that our proposed model can reliably predict both general and persistent hypoxemia more precisely with lower alarm rates than a set of strong baselines including the model employed by the state-of-the-art hypoxemia prediction system.
Specifically, our contributions are threefold:
-
•
We propose the first learning-based approach for persistent hypoxemia prediction, a challenging but clinically significant problem.
-
•
We design a novel sequence learning framework for multivariate time series that jointly optimizes multiple highly-correlated tasks including a supervised discriminative task and two sequence generation tasks. Through joint training, the learned contextual latent representations facilitate better predictions and meanwhile are seamlessly optimized for task-specific effectiveness.
-
•
Extensive experiments on a large pediatric surgical cohort show the improvement of our proposed model over strong baselines and the potential of hiNet to support clinical decisions and impact surgical practice.
2. Hypoxemia Prediction Problem

2.1. Intraoperative Time Series Data
During anesthesia procedures (i.e., surgeries), a set of patient’s physiological signals such as vital signs (e.g., heart rate, SpO2, ) and ventilator parameters (e.g., respiration rate, ) are recorded at one-minute intervals. These intraoperative time series track the patient’s physical status during surgery and may contain information associated with potential complications and adverse surgical events (meyer2018machine, ).
The data used in this work were collected from 79,142 pediatric surgical encounters spanning from 2014 to 2018 with approximately 118,000 hours of surgeries in total (89 min per case in average) at the St. Louis Children’s Hospital, a free-standing tertiary care pediatric hospital, and St. Louis Children’s Specialty Care Center, an outpatient pediatric surgical center. The institutional review board of Washington University approved this study with a waiver of consent (IRB #201906030).
2.2. Hypoxemia Definition and Labeling
We use a stringent clinical definition of hypoxemia to assess a patient encounter during surgery. We follow the guideline recommended by World Health Organization (WHO2011pulse, ) and use the emergency level of SpO2 () as the threshold for hypoxemia. We define two types of hypoxemia events based on two different severeness levels and assign labels for the two prediction problems following the criteria as shown in Figure 3.
-
•
General hypoxemia: Any low SpO2 () instance (similar to prior work (lundberg2018explainable, ), light purple region).
-
•
Persistent hypoxemia: Low SpO2 () instance that consecutively lasts for 5 minutes and more (dark purple region).
In clinical practice, temporary drops in SpO2 (i.e., general hypoxemia) are common (24.0% patient encounters in our dataset) and less concerning which is usually rapidly correctable with simple maneuvers with no short-or long-term sequelae. Persistent hypoxemia (1.9% patient encounters) is more clinically relevant and can rapidly become life-threatening and also much more difficult to predict since we need to anticipate deeper into the future.
Figure 3 shows how labels are assigned for model development. For each prediction problems, given a 5-minute prediction horizon, we assign positive labels to the time steps within this 5 min predictive window right before the start of a hypoxemic event (persistent or general hypoxemia, respectively). Other time steps prior to this predictive window are assigned with negative labels. We leave the samples during the time window when the patient is already in a hypoxemic event unlabeled, as it has little clinical benefit to predict hypoxemia when hypoxemia already occurs.
2.3. Near-term Prediction Problem
Our dataset consists of independent surgeries, denoted as , where is the index of surgeries, and is a set of time series inputs. We assume the time span is divided into equal-length time intervals. The multi-channel time series , where is the number of time series channels, is the length of surgery , and is the vector of recorded data values at timestep . For timestep , we have a binary label , where indicates that a hypoxemic event will occur anytime within the next fixed-length time window , otherwise . denotes the prediction horizon. We aim to solve the following:
Problem 1.
Given a new surgery where the patient is not already in hypoxemia, and the data window at time (zero padded if ), the goal is to train a classifier that produces the label : .
3. Related Works
3.1. Learning to Predict Hypoxemia
Recently, several attempts have been made targeting hypoxemia using data-driven approaches. For instance, (elmoaget2016multi, ) used a linear auto-regressive (AR) model on SpO2 forecast. (erion2017anesthesiologist, ) sought to use deep learning models such as LSTM to directly classify past SpO2 data. (nguyen2018reducing, ) used AdaBoost to identify false SpO2 alarms, without directly targeting prediction. Recently, a more comprehensive approach, Prescience (lundberg2018explainable, ), employed GBM to predict general hypoxemia based on both preoperative static variables and intraoperative time series. All these approaches aimed at either forecasting SpO2 (regression) or predicting only general hypoxemia.

3.2. Encoder-decoder Sequential Learning
Autoencoder (AE) (davis2006relationship, ), as shown in Figure 2(a), is widely used for representation learning. AE is trained to simultaneously optimize an encoder that maps input into latent representation, and a decoder that recovers the input by minimizing the reconstruction error. However, AE cannot directly handle sequential data. Recently, (dai2015semi, ) proposed a seq2seq AE by instantiating the encoder and decoder with LSTM, referred to as LSTM-AE, shown in Figure 2(b). (srivastava2015unsupervised, ) further extended it to the composite LSTM autoencoder (LSTM-CAE) that additionally trains another decoder fitting future data as regularization. The LSTM-AE based methods have shown promising performance in learning representation for sequential data. For instance, (laptev2017time, ; zhu2017deep, ) proposed to use pretrained LSTM-AE to extract deep features from time series for Uber trips and rare event forecasting. Recent clinical applications can be found in (suresh2017use, ; ballinger2018deepheart, ; baytas2017patient, ) that used LSTM-AE to extract patient-level representation for phenotyping and cardiovascular risk prediction. These works are related to our approach, as they all use representation learning of pretrained LSTM-AE to facilitate a classification task. However, with the goal to continuously provide real-time prediction, instead of extracting patient-level representation, we use a sequence AE to aggregate local data sequence and learning representation of a data window sliding on each surgical trajectory. Unsupervised pretraining tends to learn general task-agnostic underlying structure, and thus the greedy layer-wise optimization in separate steps can lead to a suboptimal classifier (zhou2014joint, ). Instead, our approach relies on building an end-to-end model that jointly optimizes the classification and latent representation learning while balancing them more delicately.
4. The HiNet Framework

Figure 3 shows an overview of our approach. This end-to-end framework jointly optimizes the desired classification task for prediction, and two auxiliary tasks for representation learning. The addition of the sequence forecasting decoder contributes to learning future-related contextual representation. The joint training allows the supervised loss to direct representation learning towards being effective to the desired classification task. Hence, the hybrid integration of the three decoders enables the model to balance extracting the underlying structure of data and providing accurate prediction.
4.1. Memory-augmented Sequence Autoencoder
When applying a sequence AE on streaming time series using a sliding data window, the sequence encoder tends to learn mainly the local temporal patterns within the window. Recently, memory network (sukhbaatar2015end, ; gong2019memorizing, ) has shown promising results in data representation. Generally, a memory network updates an external memory consisting of a set of basis features for look-up to preserve general patterns optimized on the whole dataset (santoro2016meta, ). To help capture the global dynamics of physiological time series, we design novel a memory augmented sequence AE with the dual-level embedding of the input sequences.
4.1.1. Dual-level Sequence Embedding
Given a sliding window , we first represent the features at each step using a set of basis vectors that memorize the most representative global patterns across all surgery cases, and then use a sequence encoder to aggregate all the memory-encoded vectors within the local input window into one vector as representation, as shown in Figure 3.
Level 1: Step-level Global Memory Encoding. We assume there are feature basis vectors ( as a hyperparameter) for a specific dataset, and initialize a global memory . We assume that features of each step can all be embedded to a linear combination of the feature bases. Given a feature vector of the -th step, we obtain the attention for each basis by calculating the similarity of to each of them. The attention values are normalized by a softmax function: . Then the embedded vector is the sum of all the bases weighted by their attention. The memory is updated jointly with the network. More concretely,
(1) |
Level 2: Window-level Local Feature Aggregation. Now we use a standard sequence model parameterized by to encode the local temporal patterns within the embedded sequence. To make the framework more generalizable, here the sequence encoder can be instantiated by any standard sequence model (e.g., LSTM, TCN (bai2018empirical, ), FCN (wang2017time, )). We use the output as the current representation for time .
(2) |
Please see Appendix A for more details about the design choice of the sequence encoder base model.
For convenience, we denote this mapping from the input to the latent space as Encoder. with trainable parameters .
4.1.2. Data Reconstruction
We copy the vector that represents the patient state at present for every step in the window as the input to the sequence disaggregation layers (i.e., an LSTM and then a linear layer as a surrogate of inverse memory) to reconstruct :
(3) | |||||
(4) |
This mapping from the input space back to itself is denoted as as Reconstructor, in which represents the parameters of the sequence disaggregation layers. They can be learned by minimizing the loss: .
4.2. Multi-decoder Hybrid Inference
Reconstructor helps learn latent representation that improves model robustness against class imbalance. However, Reconstructor may not provide enough future-indicative patterns that the prediction task relies on. Motivated to learn more future-related contextual representation, we propose the hybrid inference Network (hiNet) that incorporates both generative and discriminative components, and simultaneously makes inference to both sequence of future low SpO2 instances and hypoxemia event outcome. The overall architecture of hiNet is shown in Figure 3.
4.2.1. Latent State Transition
Given the present patient state , we use a fully-connected (FC) network parameterized by to model the contextual transition to the patient state of the future in a time horizon :
(5) |
This vector that represents future patient state will be shared as the encoded data representation used by both a sequence forecasting decoder and a hypoxemia classifier, so that the future-indicative representation learning and the classification can seamlessly benefit from each other.
4.2.2. Future SpO2 Forecast
Since future hypoxemia events are strictly defined based on the sequence of future SpO2 levels (i.e., whether SpO2 and how long it lasts), we build another sequence decoder to forecast the future SpO2 sequence with a time horizon , using the future state representation . We use that corresponds to the future data window to denote SpO2 levels. Similar to Eq. (3) and (4), we apply sequence disaggregation layers to the copied vectors . We denote the mapping from input to future input, as Forecaster . Hence
(6) |
where denotes the task-specific parameters in the sequence disaggregation layers. The parameters can be learned by minimizing the loss: .
4.2.3. Hypoxemia Event Prediction
Given the new representation from Eq. (5) that contains indicative patterns of future data, now we build a classifier to estimate the label . We feed into a FC network with a softmax for the output. Using Event Predictor to denote the mapping , and to denote the task-specific FC parameters, we have
(7) |
4.2.4. Masked Predictor Loss
For a binary classifier, given true labels and predicted probability , usually we use a cross-entropy loss
(8) |
For either the prediction of general or persistent hypoxemia, it’s trivial and clinically less meaningful to predict the event when it already occurs, so we focus on predicting only the start of hypoxemia and leave the samples where hypoxemia already begins unlabeled (see Figure 1). To this end, a straightforward strategy is to directly exclude the unlabeled samples for both training and testing of the model, as in (lundberg2018explainable, ; erion2017anesthesiologist, ). However, the unlabeled samples can still provide our model useful information indicative of SpO2 tendency, and shape both Reconstructor and Forecaster . Instead, we propose a masked loss for Predictor. For surgery , we have the binary mask vector where indicates surgery at time is in hypoxemia, otherwise 1. The modified Predictor loss is:
(9) |
In this way, the unlabeled data will be filtered out for Predictor but still used in training Reconstructor and Forecaster. This masked loss mechanism is very similar to semi-supervised learning where only part of samples are mapped to labels (liu2017semi, ).
4.2.5. Objective for Joint Learning
For and , we define the overall objective function to learn model parameters for our end-to-end hybrid inference framework as minimizing the following joint loss:
(10) |
where is the weighing coefficient of both Reconstructor and Forecaster for simplicity. The two sequence decoders and provide regularization to classifier with optimized data representation and future-indicative patterns. Given the end-to-end architecture, we can jointly update all the parameters during training. After the joint model is trained, we only need Predictor for label inference on new patients.
5. Experiment
5.1. Data Preprocessing
Examples with recorded SpO2 of less than 60% were considered aberrant and excluded. We further excluded 5,606 cardiac surgery cases (with ICD-9 codes 745-747 and ICD-10 codes Q20-Q26) and 1,449 surgery cases with initial persistent low SpO2 (most likely cardiac related surgeries), in which SpO2 levels were likely affected by the surgical procedure. After extensive data cleaning, there are 72,018 surgeries and 18 channels of time series variables minutely sampled during surgical procedures. We aim to build a hypoxemia prediction system with 1-minute resolution. However, not all variables were originally observed and recorded at every minute during surgery. We use carry-forward imputation for the missing values in a gap between two observations less than 20 min, and fill those that are never observed or haven’t been observed for the past 20 min with zeros. In addition, we concatenate a binary mask matrix with the time series input to indicate variable missingness. All variables are standardized to zero mean and unit variance.
To further explore the effect of incorporating preoperative features (e.g., age, sex, weight) as part of the empirical analysis (see section 5.3 and 6.1), we also collected a set of preoperative variables for modeling. Table 1 lists the 18 intraoperative variables and 9 preoperative variables used in this study.
Intraoperative Time Series | Preoperative Variable |
---|---|
Invasive blood pressure, diastolic | Age |
Invasive blood pressure, mean | Height |
Invasive blood pressure, systolic | Weight |
Noninvasive blood pressure, diastolic | Sex |
Noninvasive blood pressure, mean | ASA physical status |
Noninvasive blood pressure, systolic | ASA emergency status |
Heart rate | Surgery type |
SpO2 | Second hand smoke |
Respiratory rate | Operating room |
Positive end expiration pressure (PEEP) | |
Peak respiratory pressure | |
Tidal volume | |
Pulse | |
End tidal CO2 (ETCO2) | |
O2 flow | |
N2O flow | |
Air flow | |
Temperature |
5.2. Experimental Setup
We randomly select 70% of all the surgery cases for the model training, setting aside 10% as a validation set for hyperparameter tuning and the other 20% for model testing. We make sure that all data points from the same surgery case stay in the same subset of data. We follow (lundberg2018explainable, ) and set the prediction horizon as min, given that it is short enough for a model to capture relevant and predictive information of potential future hypoxemia but long enough for clinicians to take actions.
5.2.1. Evaluation Metrics
We use area under the receiver operating characteristic curve (ROC-AUC) and area under the precision-recall curve (PR-AUC) to evaluate the overall prediction performance averaged on all possible output threshold. Note that PR-AUC is more informative in evaluating imbalanced datasets (davis2006relationship, ). We report the number of average false alarms per hours of surgery (False Ala./Hr) with a 5-minute redundant alarm suppression window (i.e., consider as single alarm if a second alarm goes off within a 5-minute window), averaged on all possible model sensitivity. Please see more discussion on alarm suppression in subsection 6.3.
5.2.2. Hyperparameters
For both the two hypoxemia event prediction, we set the observation window min. For persistent hypoxemia prediction, we set the horizon of Forecaster . For general hypoxemia prediction, we set , based on the intuition that, given a 5 min prediction horizon, we need to see 10 min ahead to the future to speculate persistent hypoxemia and only 6 min for single SpO2 drops.
Model | Persistent Hypoxemia ( 5 min) | General Hypoxemia ( 1 min) | ||||
---|---|---|---|---|---|---|
PR-AUC | ROC-AUC | False Ala./Hr | PR-AUC | ROC-AUC | False Ala./Hr | |
LR | .0421 | .9198 | 1.21 | .1213 | .8910 | 2.87 |
GBM | .0570 | .9305 | .81 | .1652 | .8932 | 1.44 |
LSTM | .0574 | .9283 | .69 | .1542 | .8920 | 1.78 |
TCN (bai2018empirical, ) | .0654 | .9302 | .51 | .1811 | .8956 | 1.32 |
FCN (wang2017time, ) | .0681 | .9354 | .47 | .2005 | .9024 | 1.05 |
LSTM-AE (zhu2017deep, ) | .0695 | .9321 | .58 | .1772 | .8921 | 1.54 |
LSTM-CAE (srivastava2015unsupervised, ) | .0734 | .9345 | .50 | .1823 | .8965 | 1.28 |
TCN-AE | .0744 | .9334 | .49 | .1842 | .8971 | 1.25 |
FCN-AE | .0801 | .9440 | .44 | .2011 | .9078 | 1.02 |
hiNet-l | .0775 | .9421 | .44 | .1897 | .9011 | 1.16 |
hiNet-t | .0866 | .9471 | .36 | .2124 | .9167 | .96 |
hiNet-f | .0893 | .9475 | .34 | .2120 | .9196 | .98 |
GBM w/ PreOp (lundberg2018explainable, ) | .0716 | .9322 | .52 | .1785 | .8942 | 1.35 |
hiNet-f w/ PreOp | .1021 | .9624 | .28 | .2199 | .9208 | .95 |
5.3. Baseline Methods
We compare hiNet to the following classical models, deep sequential models, and unsupervised pretraining based methods:
-
•
LR: Logistic Regression. Since LR cannot directly process time series, for fair comparison, we follow (li2020deepalerts, ; fritz2019deep, ) to extract a series of summary statistics (e.g., min, max, trend, energy, kurtosis) that capture temporal patterns of history time series within the window of the same length as hiNet.
-
•
GBM: Gradient Boosting Machines, employed by the state-of-the-art hypoxemia prediction system (lundberg2018explainable, ), which is implemented using XGBoost (chen2016xgboost, ). We use the same statistical features as in LR.
-
•
LSTM: Using stacked bi-LSTM for feature gathering and a FC block as the classifier, with layers configured the same way as the Event Predictor in hiNet.
-
•
TCN (bai2018empirical, ): Temporal Convolutional Network with causal convolutions and exponentially increased dilation. It is configured the same way as the TCN module in hiNet.
-
•
FCN (wang2017time, ): Full Convolutional Networks, a deep CNN architecture with Batch Normalization, shown to have outperformed multiple strong baselines on 44 benchmarks for time series classification.
-
•
LSTM-AE (zhu2017deep, ): A deep LSTM classifier with the weights pretrained on an LSTM-AE.
-
•
LSTM-CAE (srivastava2015unsupervised, ): A deep LSTM classifier with the weights pretrained on an LSTM-CAE that jointly reconstructs input and forecasts future input.
-
•
TCN-AE and FCN-AE: Replacing the RNN encoder in LSTM-AE (zhu2017deep, ) with TCN and FCN for comparison.
-
•
hiNet-l, hiNet-t and hiNet-f: The hiNet variants with the sequence encoder implemented by LSTM, TCN, and FCN.
-
•
GBM w/ PreOp and hiNet w/ PreOp: For GBM w/ PreOp, the preoperative variables are added to the input of GBM as in (lundberg2018explainable, ). For hiNet w/ PreOp, the preoperative features are directly concatenated with the data representation in Eq. (7).
5.4. Implementation Details
For TCN and hiNet-t, we use 3 TCN blocks with number of filters set as 64 and dilation of each block set as 2, 4, and 8. For FCN and hiNet-f, we use 3 blocks for the FCN and use the filter sizes for each block. Each FC block in hiNet has only one hidden layer with rectified linear unit (ReLU) as the activation function. The number of neurons for each hidden layer in both LSTM and FC block, and the number of basis in the memory are all set as 128. We select the regularizer coefficient from . The best is 0.1 for persistent hypoxemia and 0.01 for general hypoxemia.
We use Adam as the optimizer with default learning rate and train the model with mini batches. For each batch, we feed 32 independent surgeries containing on average about 2,880 extracted examples into the model. We use the same settings for all deep baselines and all variants of hiNet. All of them are trained for 50 epochs with early stopping and drop-out applied to prevent overfitting. The model with the lowest epoch-end classification loss at each run is saved and evaluated with test data. The proposed model is implemented using TensorFlow 2.4.0 and Keras 2.4.3 with Python 3.8, and trained using NVIDIA GeForce RTX 3080 Ti GPUs and Intel Core i9-10850K 3.60GHz CPUs.
6. Result and Discussion
6.1. Overall Performance
6.1.1. Overall performance
Table 2 summarizes the overall performance of all the models. Our proposed hiNet framework outperforms all the baseline methods, achieving improved PR-AUC/ROC-AUC scores and 46% and 32% reduced average false alarm rate over the state-of-the-art hypoxemia prediction system (GBM w/ PreOp) and the best deep baseline (LSTM-CAE), respectively. hiNet effectively improves the efficacy of alerts.
6.1.2. Feature Engineering vs. Representation Learning
In general, the non-deep models LR and GBM show inferior prediction performance compared to deep learning models. This may result from the less effective capacity of simple statistical features in capturing complex activity dynamics and temporality. In contrast, armed with our proposed activity embedding, the deep models can learn deep representations capable of encoding more complicated patterns.
6.1.3. Supervised vs. Self-supervised Learning
Among the deep models, those with AE-based pretraining (e.g., LSTM-AE, TCN-AE, FCN-AE) outperform their corresponding base model (e.g., LSTM, TCN, FCN). The self-supervised learning helps learn better data representations that benefit the discriminative task. Our hiNet framework further achieve better performance by jointly learning better representations simultaneously optimized for prediction. The self-supervised component helps with contextual representation learning that potentially improves the robustness against extreme class imbalance.
6.2. Ablation Study
As shown in Table 3, we analyze the contribution of each component by removing it from hiNet. Note that for the variant w/o Memory, the memory layers are replaced with two stacked independent linear layers to maintain similar model complexity for fair comparison. We can see that for both the two prediction tasks (persistent & general hypoxemia), each component in hiNet plays essential role in improving the prediction performance.
Persist. Hypo. | PR-AUC | ROC-AUC | False Ala./Hr |
---|---|---|---|
w/o Forecaster | .0812 | .9416 | .42 |
w/o Reconstructor | .0795 | .9367 | .45 |
w/o Memory | .0765 | .9356 | .48 |
hiNet-f | .0893 | .9475 | .34 |
Gener. Hypo. | PR-AUC | ROC-AUC | False Ala./Hr |
---|---|---|---|
w/o Forecaster | .2069 | .9108 | 1.01 |
w/o Reconstructor | .1997 | .9022 | 1.14 |
w/o Memory | .2007 | .9096 | 1.05 |
hiNet-f | .2120 | .9196 | .98 |
6.3. Practical Effectiveness
6.3.1. Alarm Suppression
The hiNet framework is designed to provide real-time prediction of near-term hypoxemia events at a one-minute resolution. For a continuous prediction system, multiple alarms raised by the prediction model within a short time window should be considered as the prediction of one approaching event, instead of multiple independent ones. To reduce alarm fatigue in a practical setting, we suppress redundant alarms within a short window (e.g., 5 minutes). Whenever there is any alarm going off within a certain time window to the first alarm, we silence the subsequent alarm and only consider the first alarm for true and false alarm evaluation. Figure 4 shows the impact of window size to the false alarm rates. We can see that by applying alarm suppression with a 5-minute window, the average false alarm rate dropped by 88%. The alarm rate changes slightly when using much larger windows. Hence, we stick to 5 minutes as the window size for the remaining false alarm evaluation in this paper.

6.3.2. False Alarm vs. Sensitivity
To mitigate alarm fatigue, it is crucial for a clinical alarm system to minimize its false alarm rate given a sensitivity threshold. Hence, we plot the False Alarm vs. Sensitivity curve to evaluate the impact of our model on clinical practice, as shown in Figure 6. In practical hypoxemia intervention and mitigation, the cost of predicting a false positive is much less than missing a true positive (i.e., relatively low cost of intervention vs. life-threatening persistent hypoxemia). Thus we prefer a high model sensitivity for hypoxemia prediction. For practical evaluation, given fixed sensitivity, Table 4 shows the false alarm rate comparison between GBM w/ PreOp and hiNet-f w/ PreOp. Our hiNet framework is able to reduce 64% and 74% false alarms compared to the state-of-the-art system (GBM w/ PreOp (lundberg2018explainable, )) for sensitivity 0.8 and 0.6, respectively.
Sensitivity | False Alarm (Alarm/Hr) | Improv. (%) | |
GBM w/ PreOp | hiNet-f w/ PreOp | ||
0.8 | 0.89 | 0.32 | 64% |
0.6 | 0.31 | 0.08 | 74% |


6.3.3. Lead Time of Alarms
In the problem formulation (Section 2) and label assignment for model training (Figure 1), we define the prediction horizon as 5 minutes and assign a positive label to each of the 5 minutes before the onset of the hypoxmeia event. Hence, the lead time of an alarm relative to the onset of the hypoxemia event may range from 1 to 5 minutes. For surgical care, it is important to analyze the actual lead time of the alarms that can have a significant impact on clinical intervention during surgeries. Figure 5 shows during model inference the histogram of the actual lead time of the first true alarm for each persistent hypoxemia event in the testing set. We can see that for both the sensitivity threshold set as 0.8 and 0.6, our hiNet predicted the event 5 minutes before it occurs for the majority of the persistent hypoxemia events, thus providing adequate lead time for intervention.
6.4. Representation Learning
As shown in Figure 7, we extracted the latent representations learned by LSTM-AE and various layers of hiNet for general hypoxemia prediction, and visualize these vectors in a 2D space using t-SNE (van2014accelerating, ). Considering the extreme class imbalance, we randomly select 50 surgeries where persistent hypoxemia occurred and 50 hypoxemia-free surgeries from the test set for visualization purpose. As shown in Figure 7, the representation of the same class in the latent space tends to group together in hiNet, which enlarges the partitioning margin and makes them easier to classify. More explicit grouping patterns can be observed at layers closer to the Event Predictor output with stronger supervisory signal. In contrast, the representation learned by unsupervised LSTM-AE is structured but shows much less salient grouping patterns. We observe that, hiNet is able to learn powerful task-specific representation via joint training, where supervisory signal is propagated to fine-tune latent representation towards task-specific effectiveness.

6.5. Potential Limitations
When producing labels for clinical outcomes such as hypoxemia, anaesthesiologist interventions can indirectly affect the prediction outcome (lundberg2018explainable, ). As these interventions may affect certain vital parameters including SpO2, models that use these parameters can learn when a doctor is likely to intervene and hence lower the risk of a potential high-risk patient. The ideal solution to this issue is to remove all samples where clinicians have intervened for model training. But this is difficult in practice, since fully identifying when clinicians are taking hypoxemia-preventing interventions is not possible. Hence, our model as all other previous learning based approaches (lundberg2018explainable, ; erion2017anesthesiologist, ) to this problem, must be based on the natural assumption that the model predicts hypoxaemia when clinicians are following standard procedures, including (possibly) taking interventions to prevent potential hypoxemia anticipated based on clinician’s professional knowledge.
7. Conclusion and Potential Impact
Hypoxemia, especially persistent hypoxemia, is a rare but critical adverse surgical condition of high clinical significance. We developed hiNet, a novel end-to-end learning approach that employs a joint sequence autoencoder to predict hypoxemia during surgeries. In a large dataset of pediatric surgeries from a major academic medical center, hiNet achieved the best performance in predicting both general and persistent hypoxemia while outperforming strong baselines including the model used by the state-of-the-art hypoxemia prediction system. Our method produces low average false alarm rates, which helps mitigate alarm fatigue, an important concern in clinical care settings.
This work has the potential to impact clinical practice by predicting clinically significant intraoperative hypoxemia and facilitating timely interventions. By tackling the challenging problem of predicting rare, but critical persistent hypoxemia, our model could help preventing adverse patient outcomes. We are currently working to implement our method directly into an application that can pull live intraoperative data streams from our health system’s EHR and present real-time predictions to surgeons and anesthesiologists in operating rooms. This will allow us to prospectively test the utility of our method in a real-world scenario by evaluating how accurately the alarms raises and how it is used on actual anesthetic interventions.
Acknowledgement
This study was funded by the Fullgraf Foundation and the Washington University/BJC HealthCare Big Ideas Healthcare Innovation Award.
Appendix A Design Choice of Sequence Encoder
A.1. Temporal Convolutional Networks
Temporal convolutional networks (TCN) is a family of efficient 1-D convolutional sequence models where convolutions are computed across time (bai2018empirical, ; lea2017temporal, ). TCN differs from dypical 1-D CNN mainly by using a different convolution mechanism, dilated causal convolution. Formally, for a 1-D sequence input and a convolution filter , the dilated causal convolution operation on element of the sequence is defined as
(11) |
where is the dilation factor, is the filter size, and accounts the past. Dilated convolution, i.e., using a larger dilation factor , enables an output at the top level to represent a wider range of inputs, effectively expanding the receptive field (yu2015multi, ) of convolution. Causal convolution, i.e., at each step the convolution is only operated with previous steps, ensures that no future information is leaked to the past (bai2018empirical, ). This feature enables TCN to have similar directional structure as RNN models. Then the output sequence of the dilation convolution layer can be written as
(12) |
Usually Layer Normalization or Batch Normalization regularization is applied after the convolutional layer for better performance (lea2017temporal, ; bai2018empirical, ). A TCN model is usually built with multiple causal convolutional layers with a wide receptive field that accounts for long sequences.
A.2. Fully Convolutional Networks
Full Convolutional Networks (FCN), a deep CNN architecturewith Batch Normalization, has shown compelling quality and efficiency for tasks on images such as semantic segmentation.
An FCN model consists of several basic convolutional blocks. A basic block is a convolutional layer followed by a Batch Normalization layer and a ReLU activation layer, as follows:
(13) |
where is the convolution operator.
References
- (1) Bai, S., Kolter, J. Z., and Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018).
- (2) Ballinger, B., Hsieh, J., Singh, A., Sohoni, N., Wang, J., Tison, G., Marcus, G., Sanchez, J., Maguire, C., Olgin, J., et al. Deepheart: semi-supervised sequence learning for cardiovascular risk prediction. In AAAI Conference on Artificial Intelligence (AAAI) (2018), vol. 32.
- (3) Baytas, I. M., Xiao, C., Zhang, X., Wang, F., Jain, A. K., and Zhou, J. Patient subtyping via time-aware lstm networks. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2017), pp. 65–74.
- (4) Chen, T., and Guestrin, C. Xgboost: A scalable tree boosting system. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2016), pp. 785–794.
- (5) Dai, A. M., and Le, Q. V. Semi-supervised sequence learning. Conference and Workshop on Neural Information Processing Systems 28 (2015), 3079–3087.
- (6) Davis, J., and Goadrich, M. The relationship between precision-recall and roc curves. In International Conference on Machine Learning (2006), pp. 233–240.
- (7) Dunham, C. M., Hileman, B. M., Hutchinson, A. E., Chance, E. A., and Huang, G. S. Perioperative hypoxemia is common with horizontal positioning during general anesthesia and is associated with major adverse outcomes: a retrospective study of consecutive patients. BMC anesthesiology 14, 1 (2014), 1–10.
- (8) Ehrenfeld, J. M., Funk, L. M., Van Schalkwyk, J., Merry, A. F., Sandberg, W. S., and Gawande, A. The incidence of hypoxemia during surgery: evidence from two institutions. Canadian Journal of Anesthesia 57, 10 (2010), 888–897.
- (9) ElMoaqet, H., Tilbury, D. M., and Ramachandran, S. K. Multi-step ahead predictions for critical levels in physiological time series. IEEE Transactions on Cybernetics 46, 7 (2016), 1704–1714.
- (10) Erion, G., Chen, H., Lundberg, S. M., and Lee, S.-I. Anesthesiologist-level forecasting of hypoxemia with only spo2 data using deep learning. In Conference and Workshop on Neural Information Processing Systems Workshop ML4H (2017).
- (11) Fritz, B. A., Cui, Z., Zhang, M., He, Y., Chen, Y., Kronzer, A., Abdallah, A. B., King, C. R., and Avidan, M. S. Deep-learning model for predicting 30-day postoperative mortality. British journal of anaesthesia 123, 5 (2019), 688–695.
- (12) Gong, D., Liu, L., Le, V., Saha, B., Mansour, M. R., Venkatesh, S., and Hengel, A. v. d. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In International Conference on Computer Vision (2019), pp. 1705–1714.
- (13) Hendrycks, D., Lee, K., and Mazeika, M. Using pre-training can improve model robustness and uncertainty. International Conference on Machine Learning (2019).
- (14) Laffin, A. E., Kendale, S. M., and Huncke, T. K. Severity and duration of hypoxemia during outpatient endoscopy in obese patients: a retrospective cohort study. Canadian Journal of Anaesthesia (2020).
- (15) Laptev, N., Yosinski, J., Li, L. E., and Smyl, S. Time-series extreme event forecasting with neural networks at uber. In International Conference on Machine Learning (2017), vol. 34, pp. 1–5.
- (16) Lea, C., Flynn, M. D., Vidal, R., Reiter, A., and Hager, G. D. Temporal convolutional networks for action segmentation and detection. In Conference on Computer Vision and Pattern Recognition (2017), pp. 156–165.
- (17) Li, D., Lyons, P. G., Lu, C., and Kollef, M. Deepalerts: Deep learning based multi-horizon alerts for clinical deterioration on oncology hospital wards. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) Conference on Artificial Intelligence (2020), vol. 34, pp. 743–750.
- (18) Liu, H., Han, J., and Nie, F. Semi-supervised orthogonal graph embedding with recursive projections. In International Joint Conference in Artificial Intelligence (2017), pp. 2308–2314.
- (19) Liu, H., Lou, S. S., Warner, B. C., Harford, D. R., Kannampallil, T., and Lu, C. Hipal: A deep framework for physician burnout prediction using activity logs in electronic health records. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2022).
- (20) Lou, S. S., Liu, H., Warner, B. C., Harford, D., Lu, C., and Kannampallil, T. Predicting physician burnout using clinical activity logs: model performance and lessons learned. Journal of Biomedical Informatics 127 (2022), 104015.
- (21) Lundberg, S. M., Nair, B., Vavilala, M. S., Horibe, M., Eisses, M. J., Adams, T., Liston, D. E., Low, D. K.-W., Newman, S.-F., Kim, J., et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nature Biomedical Engineering 2, 10 (2018), 749–760.
- (22) Mehta, C., and Mehta, Y. Management of refractory hypoxemia. Annals of cardiac anaesthesia 19, 1 (2016), 89.
- (23) Meyer, A., Zverinski, D., Pfahringer, B., Kempfert, J., Kuehne, T., Sündermann, S. H., Stamm, C., Hofmann, T., Falk, V., and Eickhoff, C. Machine learning for real-time prediction of complications in critical care: a retrospective study. The Lancet Respiratory Medicine 6, 12 (2018), 905–914.
- (24) Nguyen, H., Jang, S., Ivanov, R., Bonafide, C. P., Weimer, J., and Lee, I. Reducing pulse oximetry false alarms without missing life-threatening events. Smart Health 9 (2018), 287–296.
- (25) Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning (2016), pp. 1842–1850.
- (26) Srivastava, N., Mansimov, E., and Salakhudinov, R. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning (2015), pp. 843–852.
- (27) Sukhbaatar, S., Weston, J., Fergus, R., et al. End-to-end memory networks. In Conference and Workshop on Neural Information Processing Systems (2015), pp. 2440–2448.
- (28) Suresh, H., Szolovits, P., and Ghassemi, M. The use of autoencoders for discovering patient phenotypes. Conference and Workshop on Neural Information Processing Systems Workshop ML4H (2017).
- (29) Van Der Maaten, L. Accelerating t-sne using tree-based algorithms. Journal of Machine Learning Research 15, 1 (2014), 3221–3245.
- (30) Wang, Z., Yan, W., and Oates, T. Time series classification from scratch with deep neural networks: A strong baseline. In International Joint Conference on Neural Networks (2017), IEEE, pp. 1578–1585.
- (31) West, C. P., Dyrbye, L. N., Erwin, P. J., and Shanafelt, T. D. Interventions to prevent and reduce physician burnout: a systematic review and meta-analysis. The Lancet 388, 10057 (2016), 2272–2281.
- (32) World Health Organization. Pulse Oximetry Training Manual, 2011.
- (33) Yu, F., and Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015).
- (34) Zhou, Y., Arpit, D., Nwogu, I., and Govindaraju, V. Is joint training better for deep auto-encoders? arXiv preprint arXiv:1405.1380 (2014).
- (35) Zhu, L., and Laptev, N. Deep and confident prediction for time series at uber. In International Conference on Data Mining Workshop (2017), pp. 103–110.