Predicting Intraoperative Hypoxemia with Hybrid Inference Sequence Autoencoder Networks

Hanyang Liu McKelvey School of Engineering
Washington University in St. Louis1 Brooking DriveSt. LouisUnited States63130 [email protected] , Michael C. Montana School of Medicine
Washington University in St. LouisSt. LouisUnited States63130 [email protected] , Dingwen Li McKelvey School of Engineering
Washington University in St. Louis1 Brooking DriveSt. LouisUnited States [email protected] , Chase Renfroe School of Medicine
Washington University in St. LouisSt. LouisUnited States [email protected] , Thomas Kannampallil School of Medicine
Washington University in St. LouisSt. LouisUnited States [email protected] and Chenyang Lu McKelvey School of Engineering
Washington University in St. Louis1 Brooking DriveSt. LouisUnited States [email protected]

(2022)

Abstract.

We present an end-to-end model using streaming physiological time series to predict near-term risk for hypoxemia, a rare, but life-threatening condition known to cause serious patient harm during surgery. Inspired by the fact that a hypoxemia event is defined based on a future sequence of low SpO₂ (i.e., blood oxygen saturation) instances, we propose the hybrid inference network (hiNet) that makes hybrid inference on both future low SpO₂ instances and hypoxemia outcomes. hiNet integrates 1) a joint sequence autoencoder that simultaneously optimizes a discriminative decoder for label prediction, and 2) two auxiliary decoders trained for data reconstruction and forecast, which seamlessly learn contextual latent representations that capture the transition from present states to future states. All decoders share a memory-based encoder that helps capture the global dynamics of patient measurement. For a large surgical cohort of 72,081 surgeries at a major academic medical center, our model outperforms strong baselines including the model used by the state-of-the-art hypoxemia prediction system. With its capability to make real-time predictions of near-term hypoxemic at clinically acceptable alarm rates, hiNet shows promise in improving clinical decision making and easing burden of perioperative care.

Abstract.

Hypoxemia prediction, physiological time series, deep sequence learning, autoencoder

^†^†journalyear: 2022^†^†doi: 10.1145/XXXX.XXXX^†^†copyright: rightsretained^†^†conference: Proceedings of the 31st ACM International Conference on Information and Knowledge Management; October 17–21, 2022; Atlanta, GA, USA^†^†booktitle: Proceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM ’22), October 17–21, 2022, Atlanta, GA, USA^†^†doi: 10.1145/3511808.3557420^†^†isbn: 978-1-4503-9236-5/22/10^†^†ccs: Applied computing Health informatics^†^†ccs: Computing methodologies Temporal reasoning^†^†ccs: Applied computing Health informatics^†^†ccs: Computing methodologies Temporal reasoning

1. Introduction

Hypoxemia, or low blood oxygen saturation (SpO₂), is an adverse physiological condition known to cause serious patient harm during surgery and general anaesthesia (dunham2014perioperative, ). Without early intervention, prolonged hypoxemia can seriously affect surgical outcomes and is associated with many adverse complications such as cardiac arrest, encephalopathy, delirium, and post-operative infections (lundberg2018explainable, ). To mitigate the effects of hypoxemia, anesthesiologists monitor SpO₂ levels during general anesthesia using pulse oximetry, so that actions can be taken in a timely manner. Despite the availability of real-time data, however, reliable prediction of hypoxemic events remains elusive (ehrenfeld2010incidence, ).

In recent years, data from electronic health records (EHR) have been used to develop predictive models to anticipate risks of future adverse events, facilitating early interventions to mitigate the occurrence of adverse events (west2016interventions, ; liu2022hipal, ; lou2022predicting, ). Similar attempts (elmoaget2016multi, ; erion2017anesthesiologist, ) have been made to target hypoxemia based on SpO₂ data. Recently, (lundberg2018explainable, ) proposed a gradient boosting machine (GBM) model for predicting intraoperative hypoxemia by integrating a number of preoperative static variables and intraoperative physiological time series. Compared to prior works that only utilized SpO₂, the use of multi-modal data helps train a more reliable prediction model. However, as classical models such as GBM cannot directly utilize multivariate time series, they require the extraction of hand-crafted features with limited capacity for capturing the temporal dynamics of time series. In addition, the uniquely selected combination of features may not generalize to a new cohort from another hospital.

Another major limitation of the aforementioned hypoxemia models is that they were developed to target any low SpO₂ occurrences where most of them are short-term (e.g., a single minute) and transient SpO₂ drop. In practice, most occurrences of low SpO₂ do not necessarily reflect high patient risks for deterioration that actually require anesthesiologist intervention (laffin2020severity, ). In contrast to transient SpO₂ reduction, significant risks arise from persistent hypoxemia, defined as continuously low SpO₂ over a longer time window (e.g., 5 minutes). Persistent hypoxemia can develop rapidly and unexpectedly due to acute respiratory failure, or other circumstances. It is immediately life-threatening if not treated (mehta2016management, ). For this study we have collected a large surgical dataset from a major academic medical center. In this cohort, despite the rareness (1.5% of all time) of low SpO₂, 24.0% of the surgical encounters experienced at least a single instance of SpO₂ drop (measured per minute), while merely 1.9% experienced persistent hypoxemia (over a 5-minute time window). Unessential alarms can result in clinicians’ desensitization to alarms, and thus the actually critical ones could possibly be missed. Hence, it is of high clinical importance to reliably predict persistent hypoxemia (i.e., with high sensitivity of predicting advert events with clinically acceptable alarm rate), which facilitates the most pivotal early interventions. However, the prediction of persistent hypoxemia is a greater challenge, as it is difficult for a machine learning model to learn reliable patterns from data with a certain class being underrepresented (0.13% positive rate in our real-world dataset). Moreover, it requires the model to foresee the future over a longer horizon, when the past data are decreasingly indicative of distal future outcomes.

We aim to address these challenges by developing an end-to-end learning framework that utilizes streaming physiological time series (e.g., heart rate, SpO₂) and produces risk prediction of hypoxemia, while simultaneously learning powerful latent representation known to improve model robustness against class imbalance (hendrycks2019using, ). In addition to general hypoxemia (i.e., any SpO₂ drop), we focus on predicting persistent hypoxemia given its clinical significance. Intuitively, if we can forecast future input data, especially the SpO₂ variation, we can anticipate the potential hypoxemia risk more accurately. We propose a novel deep model, the hybrid inference network (hiNet) that simultaneously makes inference on both future hypoxemic events and the sequence of future SpO₂ levels. This end-to-end framework is enabled by jointly optimizing: (i) a memory-augmented sequence encoder that both aggregate local temporal features and capture global patient dynamics; (ii) a sequence decoder for data reconstruction; (iii) a sequence decoder that models the evolution of future SpO₂ levels; and (iv) a discriminative decoder (classifier) trained for hypoxemia event prediction. With joint training, the classifier can leverage the learned latent representation, while the supervisory signal from the classifier can be propagated to seamlessly direct representation learning towards optimizing the desired prediction task.

The proposed model is trained and evaluated on a large real-world pediatric cohort from a large academic medical center. The data includes minute-resolution multi-modal time series collected from 72,018 surgeries that correspond to 118,000 hours of surgery. The experiments show that our proposed model can reliably predict both general and persistent hypoxemia more precisely with lower alarm rates than a set of strong baselines including the model employed by the state-of-the-art hypoxemia prediction system.

Specifically, our contributions are threefold:

•

We propose the first learning-based approach for persistent hypoxemia prediction, a challenging but clinically significant problem.
•

We design a novel sequence learning framework for multivariate time series that jointly optimizes multiple highly-correlated tasks including a supervised discriminative task and two sequence generation tasks. Through joint training, the learned contextual latent representations facilitate better predictions and meanwhile are seamlessly optimized for task-specific effectiveness.
•

Extensive experiments on a large pediatric surgical cohort show the improvement of our proposed model over strong baselines and the potential of hiNet to support clinical decisions and impact surgical practice.

2. Hypoxemia Prediction Problem

Refer to caption — Figure 1. Definition of hypoxemia in two severeness levels and the training label assignment.

2.1. Intraoperative Time Series Data

During anesthesia procedures (i.e., surgeries), a set of patient’s physiological signals such as vital signs (e.g., heart rate, SpO₂, $\text{CO}_{2}$ ) and ventilator parameters (e.g., respiration rate, $\text{FiO}_{2}$ ) are recorded at one-minute intervals. These intraoperative time series track the patient’s physical status during surgery and may contain information associated with potential complications and adverse surgical events (meyer2018machine, ).

The data used in this work were collected from 79,142 pediatric surgical encounters spanning from 2014 to 2018 with approximately 118,000 hours of surgeries in total (89 min per case in average) at the St. Louis Children’s Hospital, a free-standing tertiary care pediatric hospital, and St. Louis Children’s Specialty Care Center, an outpatient pediatric surgical center. The institutional review board of Washington University approved this study with a waiver of consent (IRB #201906030).

2.2. Hypoxemia Definition and Labeling

We use a stringent clinical definition of hypoxemia to assess a patient encounter during surgery. We follow the guideline recommended by World Health Organization (WHO2011pulse, ) and use the emergency level of SpO₂ ( $\leq 90\%$ ) as the threshold for hypoxemia. We define two types of hypoxemia events based on two different severeness levels and assign labels for the two prediction problems following the criteria as shown in Figure 3.

•

General hypoxemia: Any low SpO2 ( $\leq 90\%$ ) instance (similar to prior work (lundberg2018explainable, ), light purple region).
•

Persistent hypoxemia: Low SpO₂ ( $\leq 90\%$ ) instance that consecutively lasts for 5 minutes and more (dark purple region).

In clinical practice, temporary drops in SpO₂ (i.e., general hypoxemia) are common (24.0% patient encounters in our dataset) and less concerning which is usually rapidly correctable with simple maneuvers with no short-or long-term sequelae. Persistent hypoxemia (1.9% patient encounters) is more clinically relevant and can rapidly become life-threatening and also much more difficult to predict since we need to anticipate deeper into the future.

Figure 3 shows how labels are assigned for model development. For each prediction problems, given a 5-minute prediction horizon, we assign positive labels to the time steps within this 5 min predictive window right before the start of a hypoxemic event (persistent or general hypoxemia, respectively). Other time steps prior to this predictive window are assigned with negative labels. We leave the samples during the time window when the patient is already in a hypoxemic event unlabeled, as it has little clinical benefit to predict hypoxemia when hypoxemia already occurs.

2.3. Near-term Prediction Problem

Our dataset consists of $N$ independent surgeries, denoted as $\mathcal{D}=\{\mathcal{V}_{i}\}_{i=1}^{N}$ , where $i$ is the index of surgeries, and $\mathcal{V}_{i}$ is a set of time series inputs. We assume the time span is divided into equal-length time intervals. The multi-channel time series $\mathcal{V}_{i}=[\mathbf{v}_{i}^{1},\mathbf{v}_{i}^{2},...,\mathbf{v}_{i}^{T_{i}}]\in\mathbb{R}^{V\times T_{i}}$ , where $V$ is the number of time series channels, $T_{i}$ is the length of surgery $i$ , and $\mathbf{v}_{i}^{t}\in\mathbb{R}^{V}$ is the vector of recorded data values at timestep $t$ . For timestep $t$ , we have a binary label $y_{i,t}\in\{0,1\}$ , where $1$ indicates that a hypoxemic event will occur anytime within the next fixed-length time window $[t+1,t+W_{h}]$ , otherwise $y_{i,t}=0$ . $W_{h}$ denotes the prediction horizon. We aim to solve the following:

Problem 1.

Given a new surgery $i$ where the patient is not already in hypoxemia, and the data window $\mathbf{X}_{i,t}=\mathcal{V}_{i}^{(t-W_{o},t]}\in\mathbb{R}^{V\times W_{o}}$ at time $t$ (zero padded if $1\leq t<W_{o}$ ), the goal is to train a classifier $f$ that produces the label $y_{i,t}$ : $y_{i,t}=f(\mathbf{X}_{i,t})$ .

3. Related Works

3.1. Learning to Predict Hypoxemia

Recently, several attempts have been made targeting hypoxemia using data-driven approaches. For instance, (elmoaget2016multi, ) used a linear auto-regressive (AR) model on SpO₂ forecast. (erion2017anesthesiologist, ) sought to use deep learning models such as LSTM to directly classify past SpO₂ data. (nguyen2018reducing, ) used AdaBoost to identify false SpO₂ alarms, without directly targeting prediction. Recently, a more comprehensive approach, Prescience (lundberg2018explainable, ), employed GBM to predict general hypoxemia based on both preoperative static variables and intraoperative time series. All these approaches aimed at either forecasting SpO₂ (regression) or predicting only general hypoxemia.

3.2. Encoder-decoder Sequential Learning

Autoencoder (AE) (davis2006relationship, ), as shown in Figure 2(a), is widely used for representation learning. AE is trained to simultaneously optimize an encoder that maps input into latent representation, and a decoder that recovers the input by minimizing the reconstruction error. However, AE cannot directly handle sequential data. Recently, (dai2015semi, ) proposed a seq2seq AE by instantiating the encoder and decoder with LSTM, referred to as LSTM-AE, shown in Figure 2(b). (srivastava2015unsupervised, ) further extended it to the composite LSTM autoencoder (LSTM-CAE) that additionally trains another decoder fitting future data as regularization. The LSTM-AE based methods have shown promising performance in learning representation for sequential data. For instance, (laptev2017time, ; zhu2017deep, ) proposed to use pretrained LSTM-AE to extract deep features from time series for Uber trips and rare event forecasting. Recent clinical applications can be found in (suresh2017use, ; ballinger2018deepheart, ; baytas2017patient, ) that used LSTM-AE to extract patient-level representation for phenotyping and cardiovascular risk prediction. These works are related to our approach, as they all use representation learning of pretrained LSTM-AE to facilitate a classification task. However, with the goal to continuously provide real-time prediction, instead of extracting patient-level representation, we use a sequence AE to aggregate local data sequence and learning representation of a data window sliding on each surgical trajectory. Unsupervised pretraining tends to learn general task-agnostic underlying structure, and thus the greedy layer-wise optimization in separate steps can lead to a suboptimal classifier (zhou2014joint, ). Instead, our approach relies on building an end-to-end model that jointly optimizes the classification and latent representation learning while balancing them more delicately.

4. The HiNet Framework

Figure 3 shows an overview of our approach. This end-to-end framework jointly optimizes the desired classification task for prediction, and two auxiliary tasks for representation learning. The addition of the sequence forecasting decoder contributes to learning future-related contextual representation. The joint training allows the supervised loss to direct representation learning towards being effective to the desired classification task. Hence, the hybrid integration of the three decoders enables the model to balance extracting the underlying structure of data and providing accurate prediction.

4.1. Memory-augmented Sequence Autoencoder

When applying a sequence AE on streaming time series using a sliding data window, the sequence encoder tends to learn mainly the local temporal patterns within the window. Recently, memory network (sukhbaatar2015end, ; gong2019memorizing, ) has shown promising results in data representation. Generally, a memory network updates an external memory consisting of a set of basis features for look-up to preserve general patterns optimized on the whole dataset (santoro2016meta, ). To help capture the global dynamics of physiological time series, we design novel a memory augmented sequence AE with the dual-level embedding of the input sequences.

4.1.1. Dual-level Sequence Embedding

Given a sliding window $\mathbf{X}_{i,t}=[\mathbf{x}_{i,t}^{(k)}]_{k=1}^{W_{o}}=[\mathbf{v}_{i}^{t-W_{o}+1},...,\mathbf{v}_{i}^{t-1},\mathbf{v}_{i}^{t}]$ , we first represent the features at each step using a set of basis vectors that memorize the most representative global patterns across all surgery cases, and then use a sequence encoder to aggregate all the memory-encoded vectors within the local input window into one vector as representation, as shown in Figure 3.

Level 1: Step-level Global Memory Encoding. We assume there are $M$ feature basis vectors ( $M$ as a hyperparameter) for a specific dataset, and initialize a global memory $\mathbf{B}=[\mathbf{b}_{1},\mathbf{b_{2}},...,\mathbf{b}_{M}]\in\mathbb{R}^{M\times V}$ . We assume that features of each step can all be embedded to a linear combination of the $M$ feature bases. Given a feature vector $\mathbf{x}^{(k)}$ of the $k$ -th step, we obtain the attention $\alpha$ for each basis $\mathbf{b}_{j}$ by calculating the similarity of $\mathbf{x}^{(k)}$ to each of them. The attention values are normalized by a softmax function: $\mathrm{Softmax}(z)=e^{z_{k}}/\sum_{k^{\prime}}e^{z_{k^{\prime}}}$ . Then the embedded vector is the sum of all the bases weighted by their attention. The memory $\mathbf{B}$ is updated jointly with the network. More concretely,

(1)

\begin{split}\boldsymbol{\alpha}_{i,t}^{(k)}&=\mathrm{Softmax}(\mathbf{B}\mathbf{x}_{i,t}^{(k)})\\ \mathbf{a}_{i,t}^{(k)}&=\sum_{j=1}^{M}\alpha_{i,t,j}^{(k)}\mathbf{b}_{j}\end{split}

Level 2: Window-level Local Feature Aggregation. Now we use a standard sequence model parameterized by $\boldsymbol{\theta}_{h}$ to encode the local temporal patterns within the embedded sequence. To make the framework more generalizable, here the sequence encoder $\Phi$ can be instantiated by any standard sequence model (e.g., LSTM, TCN (bai2018empirical, ), FCN (wang2017time, )). We use the output $\mathbf{z}$ as the current representation for time $t$ .

(2)

\mathbf{z}_{i,t}=\Phi([\mathbf{a}_{i,t}^{(1)},\mathbf{a}_{i,t}^{(2)},...,\mathbf{a}_{i,t}^{(W_{o})}];\boldsymbol{\theta}_{h})

Please see Appendix A for more details about the design choice of the sequence encoder base model.

For convenience, we denote this mapping from the input to the latent space $p(\mathbf{z}|\mathbf{X}_{i,t};\boldsymbol{\theta}_{E})\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{V\times W_{o}}\xrightarrow{}\mathbb{R}^{Z}$ as Encoder. with trainable parameters $\boldsymbol{\theta}_{E}=\{\mathbf{B},\boldsymbol{\theta}_{h}\}$ .

4.1.2. Data Reconstruction

We copy the vector $\mathbf{z}$ that represents the patient state at present for every step in the window $k\in[1,W_{o}]$ as the input to the sequence disaggregation layers (i.e., an LSTM and then a linear layer as a surrogate of inverse memory) to reconstruct $\hat{\mathbf{X}}=[\hat{\mathbf{x}}^{(1)},\hat{\mathbf{x}}^{(2)},...,\hat{\mathbf{x}}_{(W_{o})}]$ :

(3)		$\displaystyle\mathbf{g}_{i,t}^{(k)}$	$\displaystyle=$	$\displaystyle\mathrm{LSTM}(\mathbf{z}_{i,t},\mathbf{g}_{i,t}^{(k-1)};\boldsymbol{\omega}_{g})$
(4)		$\displaystyle\hat{\mathbf{x}}_{i,t}^{(k)}$	$\displaystyle=$	$\displaystyle\mathbf{g}_{i,t}^{(k)}\mathbf{W}_{g}+\mathbf{d}_{g}$

This mapping from the input space back to itself $f_{R}=p(\hat{\mathbf{X}}_{i,t}|\mathbf{z};\boldsymbol{\omega}_{R})\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{Z}\xrightarrow{}\mathbb{R}^{V\times W_{o}}$ is denoted as as Reconstructor, in which $\boldsymbol{\omega}_{R}=\{\boldsymbol{\omega}_{q},\boldsymbol{\omega}_{g},\mathbf{W}_{g},\mathbf{d}_{g}\}$ represents the parameters of the sequence disaggregation layers. They can be learned by minimizing the loss: $\mathcal{L}_{R}=\mathrm{MSE}(\mathbf{X}_{i,t},\hat{\mathbf{X}}_{i,t})$ .

4.2. Multi-decoder Hybrid Inference

Reconstructor helps learn latent representation that improves model robustness against class imbalance. However, Reconstructor may not provide enough future-indicative patterns that the prediction task relies on. Motivated to learn more future-related contextual representation, we propose the hybrid inference Network (hiNet) that incorporates both generative and discriminative components, and simultaneously makes inference to both sequence of future low SpO₂ instances and hypoxemia event outcome. The overall architecture of hiNet is shown in Figure 3.

4.2.1. Latent State Transition

Given the present patient state $\mathbf{z}$ , we use a fully-connected (FC) network parameterized by $\boldsymbol{\theta}_{p}$ to model the contextual transition to the patient state of the future $p$ in a time horizon $L$ :

(5)

\displaystyle\mathbf{p}_{i,t}=\mathrm{FC}(\mathbf{z}_{i,t};\boldsymbol{\theta}_{p})

This vector $\mathbf{p}$ that represents future patient state will be shared as the encoded data representation used by both a sequence forecasting decoder and a hypoxemia classifier, so that the future-indicative representation learning and the classification can seamlessly benefit from each other.

4.2.2. Future SpO₂ Forecast

Since future hypoxemia events are strictly defined based on the sequence of future SpO₂ levels (i.e., whether SpO₂ $\leq 90\%$ and how long it lasts), we build another sequence decoder to forecast the future SpO₂ sequence with a time horizon $L$ , using the future state representation $\mathbf{p}$ . We use ${\mathbf{u}}=[{u}^{(1)},{u}^{(2)},...,{u}^{(W_{o})}]\in\mathbb{B}$ that corresponds to the future data window $[\mathbf{v}_{i}^{t-W_{o}+1+L},...,\mathbf{v}_{i}^{t+L}]$ to denote SpO₂ levels. Similar to Eq. (3) and (4), we apply sequence disaggregation layers to the copied vectors $\mathbf{p}$ . We denote the mapping from input to future input, $f_{F}\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{V\times W_{o}}\xrightarrow{}\mathbb{R}^{V\times W_{o}}$ as Forecaster . Hence

(6)

\hat{\mathbf{u}}_{i,t+L}=f_{F}(\mathbf{X}_{i,t};\boldsymbol{\theta}_{E},\boldsymbol{\theta}_{p},\boldsymbol{\omega}_{F})

where $\boldsymbol{\omega}_{F}$ denotes the task-specific parameters in the sequence disaggregation layers. The parameters can be learned by minimizing the loss: $\mathcal{L}_{R}=\mathrm{MSE}(\mathbf{X}_{i,t+L},\hat{\mathbf{X}}_{i,t+L})$ .

4.2.3. Hypoxemia Event Prediction

Given the new representation $\mathbf{p}$ from Eq. (5) that contains indicative patterns of future data, now we build a classifier to estimate the label $y\in\mathbb{B}$ . We feed $\mathbf{p}$ into a FC network with a softmax for the output. Using Event Predictor to denote the mapping $f_{P}=p(\hat{y}_{i,t}|\mathbf{p}_{i,t};\boldsymbol{\omega}_{P})\mathrel{\mathop{\mathchar 58\relax}}\mathbb{R}^{Z}\xrightarrow{}\mathbb{R}^{W_{o}}$ , and $\boldsymbol{\omega}_{P}$ to denote the task-specific FC parameters, we have

(7)

\begin{split}\hat{y}_{i,t}=\mathrm{Softmax}\left(\mathrm{FC}(\mathbf{p}_{i,t};\boldsymbol{\omega}_{P})\right)\end{split}

4.2.4. Masked Predictor Loss

For a binary classifier, given true labels $y\in\{0,1\}$ and predicted probability $\hat{y}$ , usually we use a cross-entropy loss

(8)

H({y},\hat{y})=-{y}\log(\hat{y})-(1-{y})\log(1-\hat{y})

For either the prediction of general or persistent hypoxemia, it’s trivial and clinically less meaningful to predict the event when it already occurs, so we focus on predicting only the start of hypoxemia and leave the samples where hypoxemia already begins unlabeled (see Figure 1). To this end, a straightforward strategy is to directly exclude the unlabeled samples for both training and testing of the model, as in (lundberg2018explainable, ; erion2017anesthesiologist, ). However, the unlabeled samples can still provide our model useful information indicative of SpO₂ tendency, and shape both Reconstructor $f_{R}$ and Forecaster $f_{F}$ . Instead, we propose a masked loss for Predictor. For surgery $i$ , we have the binary mask vector $\mathbf{m}_{i}=[m_{i,1},m_{i,2},...,m_{i,T_{i}}]\in\mathbb{B}^{T_{i}}$ where $m_{i,t}=0$ indicates surgery $i$ at time $t$ is in hypoxemia, otherwise 1. The modified Predictor loss is:

(9)

\mathcal{L}_{P}=\sum_{i}\sum_{t}H({m}_{i,t}{y}_{i,t},{m}_{i,t}\hat{{y}}_{i,t})

In this way, the unlabeled data will be filtered out for Predictor but still used in training Reconstructor and Forecaster. This masked loss mechanism is very similar to semi-supervised learning where only part of samples are mapped to labels (liu2017semi, ).

4.2.5. Objective for Joint Learning

For $i\in[1,N]$ and $t\in[1,T_{i}]$ , we define the overall objective function to learn model parameters $\{\boldsymbol{\theta}_{E},\boldsymbol{\omega}_{R},\boldsymbol{\theta}_{p},\boldsymbol{\omega}_{F},\boldsymbol{\omega}_{P}\}$ for our end-to-end hybrid inference framework as minimizing the following joint loss:

(10)

\begin{split}\mathcal{L}&=\underbrace{\mathcal{L}_{P}}_{\text{label decoder}}+\ \ \ \lambda\underbrace{(\mathcal{L}_{F}+\mathcal{L}_{R})}_{\text{sequence decoders}}\\ &=\sum_{i}\sum_{t}\bigg{[}H\Big{(}{m}_{i,t}{y}_{i,t},f_{P}(\mathbf{z}_{i,t};\boldsymbol{\theta}_{p},\boldsymbol{\omega}_{P})\cdot m_{i,t}\Big{)}\\ &+\lambda\bigg{(}\mathinner{\!\left\lVert\mathbf{u}_{i,t+L}-f_{F}(\mathbf{z}_{i,t};\boldsymbol{\theta}_{p},\boldsymbol{\omega}_{F})\right\rVert}^{2}+\mathinner{\!\left\lVert\mathbf{X}_{i,t}-f_{R}(\mathbf{z}_{i,t};\boldsymbol{\omega}_{R})\right\rVert}_{F}^{2}\bigg{)}\bigg{]}\end{split}

where $\lambda$ is the weighing coefficient of both Reconstructor and Forecaster for simplicity. The two sequence decoders $f_{R}$ and $f_{F}$ provide regularization to classifier $f_{P}$ with optimized data representation and future-indicative patterns. Given the end-to-end architecture, we can jointly update all the parameters during training. After the joint model is trained, we only need Predictor $f_{P}$ for label inference on new patients.

5. Experiment

5.1. Data Preprocessing

Examples with recorded SpO₂ of less than 60% were considered aberrant and excluded. We further excluded 5,606 cardiac surgery cases (with ICD-9 codes 745-747 and ICD-10 codes Q20-Q26) and 1,449 surgery cases with initial persistent low SpO2 (most likely cardiac related surgeries), in which SpO₂ levels were likely affected by the surgical procedure. After extensive data cleaning, there are 72,018 surgeries and 18 channels of time series variables minutely sampled during surgical procedures. We aim to build a hypoxemia prediction system with 1-minute resolution. However, not all variables were originally observed and recorded at every minute during surgery. We use carry-forward imputation for the missing values in a gap between two observations less than 20 min, and fill those that are never observed or haven’t been observed for the past 20 min with zeros. In addition, we concatenate a binary mask matrix with the time series input to indicate variable missingness. All variables are standardized to zero mean and unit variance.

To further explore the effect of incorporating preoperative features (e.g., age, sex, weight) as part of the empirical analysis (see section 5.3 and 6.1), we also collected a set of preoperative variables for modeling. Table 1 lists the 18 intraoperative variables and 9 preoperative variables used in this study.

Table 1. List of intraoperative and preoperative variables.

Intraoperative Time Series	Preoperative Variable
Invasive blood pressure, diastolic	Age
Invasive blood pressure, mean	Height
Invasive blood pressure, systolic	Weight
Noninvasive blood pressure, diastolic	Sex
Noninvasive blood pressure, mean	ASA physical status
Noninvasive blood pressure, systolic	ASA emergency status
Heart rate	Surgery type
SpO2	Second hand smoke
Respiratory rate	Operating room
Positive end expiration pressure (PEEP)
Peak respiratory pressure
Tidal volume
Pulse
End tidal CO2 (ETCO2)
O2 flow
N2O flow
Air flow
Temperature

5.2. Experimental Setup

We randomly select 70% of all the surgery cases for the model training, setting aside 10% as a validation set for hyperparameter tuning and the other 20% for model testing. We make sure that all data points from the same surgery case stay in the same subset of data. We follow (lundberg2018explainable, ) and set the prediction horizon as $W_{h}=5$ min, given that it is short enough for a model to capture relevant and predictive information of potential future hypoxemia but long enough for clinicians to take actions.

5.2.1. Evaluation Metrics

We use area under the receiver operating characteristic curve (ROC-AUC) and area under the precision-recall curve (PR-AUC) to evaluate the overall prediction performance averaged on all possible output threshold. Note that PR-AUC is more informative in evaluating imbalanced datasets (davis2006relationship, ). We report the number of average false alarms per hours of surgery (False Ala./Hr) with a 5-minute redundant alarm suppression window (i.e., consider as single alarm if a second alarm goes off within a 5-minute window), averaged on all possible model sensitivity. Please see more discussion on alarm suppression in subsection 6.3.

5.2.2. Hyperparameters

For both the two hypoxemia event prediction, we set the observation window $W_{o}=20$ min. For persistent hypoxemia prediction, we set the horizon of Forecaster $L=W_{h}+5=10$ . For general hypoxemia prediction, we set $L=W_{h}+1=6$ , based on the intuition that, given a 5 min prediction horizon, we need to see 10 min ahead to the future to speculate persistent hypoxemia and only 6 min for single SpO₂ drops.

Table 2. Model overall performance on two types of hypoxemia condition.

Model	Persistent Hypoxemia ( $\geq$ 5 min)			General Hypoxemia ( $\geq$ 1 min)
	PR-AUC	ROC-AUC	False Ala./Hr	PR-AUC	ROC-AUC	False Ala./Hr
LR	.0421	.9198	1.21	.1213	.8910	2.87
GBM	.0570	.9305	.81	.1652	.8932	1.44
LSTM	.0574	.9283	.69	.1542	.8920	1.78
TCN (bai2018empirical, )	.0654	.9302	.51	.1811	.8956	1.32
FCN (wang2017time, )	.0681	.9354	.47	.2005	.9024	1.05
LSTM-AE (zhu2017deep, )	.0695	.9321	.58	.1772	.8921	1.54
LSTM-CAE (srivastava2015unsupervised, )	.0734	.9345	.50	.1823	.8965	1.28
TCN-AE	.0744	.9334	.49	.1842	.8971	1.25
FCN-AE	.0801	.9440	.44	.2011	.9078	1.02
hiNet-l	.0775	.9421	.44	.1897	.9011	1.16
hiNet-t	.0866	.9471	.36	.2124	.9167	.96
hiNet-f	.0893	.9475	.34	.2120	.9196	.98
GBM w/ PreOp (lundberg2018explainable, )	.0716	.9322	.52	.1785	.8942	1.35
hiNet-f w/ PreOp	.1021	.9624	.28	.2199	.9208	.95

5.3. Baseline Methods

We compare hiNet to the following classical models, deep sequential models, and unsupervised pretraining based methods:

•

LR: Logistic Regression. Since LR cannot directly process time series, for fair comparison, we follow (li2020deepalerts, ; fritz2019deep, ) to extract a series of summary statistics (e.g., min, max, trend, energy, kurtosis) that capture temporal patterns of history time series within the window of the same length as hiNet.
•

GBM: Gradient Boosting Machines, employed by the state-of-the-art hypoxemia prediction system (lundberg2018explainable, ), which is implemented using XGBoost (chen2016xgboost, ). We use the same statistical features as in LR.
•

LSTM: Using stacked bi-LSTM for feature gathering and a FC block as the classifier, with layers configured the same way as the Event Predictor in hiNet.
•

TCN (bai2018empirical, ): Temporal Convolutional Network with causal convolutions and exponentially increased dilation. It is configured the same way as the TCN module in hiNet.
•

FCN (wang2017time, ): Full Convolutional Networks, a deep CNN architecture with Batch Normalization, shown to have outperformed multiple strong baselines on 44 benchmarks for time series classification.
•

LSTM-AE (zhu2017deep, ): A deep LSTM classifier with the weights pretrained on an LSTM-AE.
•

LSTM-CAE (srivastava2015unsupervised, ): A deep LSTM classifier with the weights pretrained on an LSTM-CAE that jointly reconstructs input and forecasts future input.
•

TCN-AE and FCN-AE: Replacing the RNN encoder in LSTM-AE (zhu2017deep, ) with TCN and FCN for comparison.
•

hiNet-l, hiNet-t and hiNet-f: The hiNet variants with the sequence encoder implemented by LSTM, TCN, and FCN.
•

GBM w/ PreOp and hiNet w/ PreOp: For GBM w/ PreOp, the preoperative variables are added to the input of GBM as in (lundberg2018explainable, ). For hiNet w/ PreOp, the preoperative features are directly concatenated with the data representation $\mathbf{p}_{i,t}$ in Eq. (7).

5.4. Implementation Details

For TCN and hiNet-t, we use 3 TCN blocks with number of filters set as 64 and dilation of each block set as 2, 4, and 8. For FCN and hiNet-f, we use 3 blocks for the FCN and use the filter sizes $\{64,128,64\}$ for each block. Each FC block in hiNet has only one hidden layer with rectified linear unit (ReLU) as the activation function. The number of neurons for each hidden layer in both LSTM and FC block, and the number of basis $M$ in the memory are all set as 128. We select the regularizer coefficient $\lambda$ from $\{10^{-4},10^{-3},10^{-2},10^{-1},10^{0},10^{1}\}$ . The best $\lambda$ is 0.1 for persistent hypoxemia and 0.01 for general hypoxemia.

We use Adam as the optimizer with default learning rate and train the model with mini batches. For each batch, we feed 32 independent surgeries containing on average about 2,880 extracted examples into the model. We use the same settings for all deep baselines and all variants of hiNet. All of them are trained for 50 epochs with early stopping and drop-out applied to prevent overfitting. The model with the lowest epoch-end classification loss at each run is saved and evaluated with test data. The proposed model is implemented using TensorFlow 2.4.0 and Keras 2.4.3 with Python 3.8, and trained using NVIDIA GeForce RTX 3080 Ti GPUs and Intel Core i9-10850K 3.60GHz CPUs.

6. Result and Discussion

6.1. Overall Performance

6.1.1. Overall performance

Table 2 summarizes the overall performance of all the models. Our proposed hiNet framework outperforms all the baseline methods, achieving improved PR-AUC/ROC-AUC scores and 46% and 32% reduced average false alarm rate over the state-of-the-art hypoxemia prediction system (GBM w/ PreOp) and the best deep baseline (LSTM-CAE), respectively. hiNet effectively improves the efficacy of alerts.

6.1.2. Feature Engineering vs. Representation Learning

In general, the non-deep models LR and GBM show inferior prediction performance compared to deep learning models. This may result from the less effective capacity of simple statistical features in capturing complex activity dynamics and temporality. In contrast, armed with our proposed activity embedding, the deep models can learn deep representations capable of encoding more complicated patterns.

6.1.3. Supervised vs. Self-supervised Learning

Among the deep models, those with AE-based pretraining (e.g., LSTM-AE, TCN-AE, FCN-AE) outperform their corresponding base model (e.g., LSTM, TCN, FCN). The self-supervised learning helps learn better data representations that benefit the discriminative task. Our hiNet framework further achieve better performance by jointly learning better representations simultaneously optimized for prediction. The self-supervised component helps with contextual representation learning that potentially improves the robustness against extreme class imbalance.

6.2. Ablation Study

As shown in Table 3, we analyze the contribution of each component by removing it from hiNet. Note that for the variant w/o Memory, the memory layers are replaced with two stacked independent linear layers to maintain similar model complexity for fair comparison. We can see that for both the two prediction tasks (persistent & general hypoxemia), each component in hiNet plays essential role in improving the prediction performance.

Table 3. Effect of each module in hiNet.

Persist. Hypo.	PR-AUC	ROC-AUC	False Ala./Hr
w/o Forecaster	.0812	.9416	.42
w/o Reconstructor	.0795	.9367	.45
w/o Memory	.0765	.9356	.48
hiNet-f	.0893	.9475	.34

Gener. Hypo.	PR-AUC	ROC-AUC	False Ala./Hr
w/o Forecaster	.2069	.9108	1.01
w/o Reconstructor	.1997	.9022	1.14
w/o Memory	.2007	.9096	1.05
hiNet-f	.2120	.9196	.98

6.3. Practical Effectiveness

6.3.1. Alarm Suppression

The hiNet framework is designed to provide real-time prediction of near-term hypoxemia events at a one-minute resolution. For a continuous prediction system, multiple alarms raised by the prediction model within a short time window should be considered as the prediction of one approaching event, instead of multiple independent ones. To reduce alarm fatigue in a practical setting, we suppress redundant alarms within a short window (e.g., 5 minutes). Whenever there is any alarm going off within a certain time window to the first alarm, we silence the subsequent alarm and only consider the first alarm for true and false alarm evaluation. Figure 4 shows the impact of window size to the false alarm rates. We can see that by applying alarm suppression with a 5-minute window, the average false alarm rate dropped by 88%. The alarm rate changes slightly when using much larger windows. Hence, we stick to 5 minutes as the window size for the remaining false alarm evaluation in this paper.

6.3.2. False Alarm vs. Sensitivity

To mitigate alarm fatigue, it is crucial for a clinical alarm system to minimize its false alarm rate given a sensitivity threshold. Hence, we plot the False Alarm vs. Sensitivity curve to evaluate the impact of our model on clinical practice, as shown in Figure 6. In practical hypoxemia intervention and mitigation, the cost of predicting a false positive is much less than missing a true positive (i.e., relatively low cost of intervention vs. life-threatening persistent hypoxemia). Thus we prefer a high model sensitivity for hypoxemia prediction. For practical evaluation, given fixed sensitivity, Table 4 shows the false alarm rate comparison between GBM w/ PreOp and hiNet-f w/ PreOp. Our hiNet framework is able to reduce 64% and 74% false alarms compared to the state-of-the-art system (GBM w/ PreOp (lundberg2018explainable, )) for sensitivity 0.8 and 0.6, respectively.

Table 4. False alarm rate at sensitivity threshold 0.8 and 0.6.

Sensitivity	False Alarm (Alarm/Hr)		Improv. (%)
Sensitivity	GBM w/ PreOp	hiNet-f w/ PreOp	Improv. (%)
0.8	0.89	0.32	64%
0.6	0.31	0.08	74%

6.3.3. Lead Time of Alarms

In the problem formulation (Section 2) and label assignment for model training (Figure 1), we define the prediction horizon as 5 minutes and assign a positive label to each of the 5 minutes before the onset of the hypoxmeia event. Hence, the lead time of an alarm relative to the onset of the hypoxemia event may range from 1 to 5 minutes. For surgical care, it is important to analyze the actual lead time of the alarms that can have a significant impact on clinical intervention during surgeries. Figure 5 shows during model inference the histogram of the actual lead time of the first true alarm for each persistent hypoxemia event in the testing set. We can see that for both the sensitivity threshold set as 0.8 and 0.6, our hiNet predicted the event 5 minutes before it occurs for the majority of the persistent hypoxemia events, thus providing adequate lead time for intervention.

6.4. Representation Learning

As shown in Figure 7, we extracted the latent representations learned by LSTM-AE and various layers of hiNet for general hypoxemia prediction, and visualize these vectors in a 2D space using t-SNE (van2014accelerating, ). Considering the extreme class imbalance, we randomly select 50 surgeries where persistent hypoxemia occurred and 50 hypoxemia-free surgeries from the test set for visualization purpose. As shown in Figure 7, the representation of the same class in the latent space tends to group together in hiNet, which enlarges the partitioning margin and makes them easier to classify. More explicit grouping patterns can be observed at layers closer to the Event Predictor output with stronger supervisory signal. In contrast, the representation learned by unsupervised LSTM-AE is structured but shows much less salient grouping patterns. We observe that, hiNet is able to learn powerful task-specific representation via joint training, where supervisory signal is propagated to fine-tune latent representation towards task-specific effectiveness.

6.5. Potential Limitations

When producing labels for clinical outcomes such as hypoxemia, anaesthesiologist interventions can indirectly affect the prediction outcome (lundberg2018explainable, ). As these interventions may affect certain vital parameters including SpO₂, models that use these parameters can learn when a doctor is likely to intervene and hence lower the risk of a potential high-risk patient. The ideal solution to this issue is to remove all samples where clinicians have intervened for model training. But this is difficult in practice, since fully identifying when clinicians are taking hypoxemia-preventing interventions is not possible. Hence, our model as all other previous learning based approaches (lundberg2018explainable, ; erion2017anesthesiologist, ) to this problem, must be based on the natural assumption that the model predicts hypoxaemia when clinicians are following standard procedures, including (possibly) taking interventions to prevent potential hypoxemia anticipated based on clinician’s professional knowledge.

7. Conclusion and Potential Impact

Hypoxemia, especially persistent hypoxemia, is a rare but critical adverse surgical condition of high clinical significance. We developed hiNet, a novel end-to-end learning approach that employs a joint sequence autoencoder to predict hypoxemia during surgeries. In a large dataset of pediatric surgeries from a major academic medical center, hiNet achieved the best performance in predicting both general and persistent hypoxemia while outperforming strong baselines including the model used by the state-of-the-art hypoxemia prediction system. Our method produces low average false alarm rates, which helps mitigate alarm fatigue, an important concern in clinical care settings.

This work has the potential to impact clinical practice by predicting clinically significant intraoperative hypoxemia and facilitating timely interventions. By tackling the challenging problem of predicting rare, but critical persistent hypoxemia, our model could help preventing adverse patient outcomes. We are currently working to implement our method directly into an application that can pull live intraoperative data streams from our health system’s EHR and present real-time predictions to surgeons and anesthesiologists in operating rooms. This will allow us to prospectively test the utility of our method in a real-world scenario by evaluating how accurately the alarms raises and how it is used on actual anesthetic interventions.

Acknowledgement

This study was funded by the Fullgraf Foundation and the Washington University/BJC HealthCare Big Ideas Healthcare Innovation Award.

\appendixpage

Appendix A Design Choice of Sequence Encoder

A.1. Temporal Convolutional Networks

Temporal convolutional networks (TCN) is a family of efficient 1-D convolutional sequence models where convolutions are computed across time (bai2018empirical, ; lea2017temporal, ). TCN differs from dypical 1-D CNN mainly by using a different convolution mechanism, dilated causal convolution. Formally, for a 1-D sequence input $\mathbf{X}=[\textbf{x}_{1},...,\textbf{x}_{T}]\in\mathbb{R}^{d\times T}$ and a convolution filter $\mathbf{f}\in\mathbb{R}^{k\times d}$ , the dilated causal convolution operation $F$ on element $t$ of the sequence is defined as

(11)

F(\mathbf{x}_{t})=(\mathbf{X}*_{d}\mathbf{f})(t)=\sum_{i=0}^{k-1}\textbf{f}_{i}^{T}\cdot\mathbf{x}_{t-d\cdot i},\ \ s.t.,\ t\geq k,\textbf{x}_{\leq 0}\mathrel{\mathop{\mathchar 58\relax}}=0

where $d$ is the dilation factor, $k$ is the filter size, and $t-d\cdot i$ accounts the past. Dilated convolution, i.e., using a larger dilation factor $d$ , enables an output at the top level to represent a wider range of inputs, effectively expanding the receptive field (yu2015multi, ) of convolution. Causal convolution, i.e., at each step the convolution is only operated with previous steps, ensures that no future information is leaked to the past (bai2018empirical, ). This feature enables TCN to have similar directional structure as RNN models. Then the output sequence $\textbf{X}^{\prime}\in\mathbb{R}^{k\times T}$ of the dilation convolution layer can be written as

(12)

\textbf{X}^{\prime}=[F(\mathbf{x}_{1}),F(\mathbf{x}_{2}),...,F(\mathbf{x}_{T})]

Usually Layer Normalization or Batch Normalization regularization is applied after the convolutional layer for better performance (lea2017temporal, ; bai2018empirical, ). A TCN model is usually built with multiple causal convolutional layers with a wide receptive field that accounts for long sequences.

A.2. Fully Convolutional Networks

Full Convolutional Networks (FCN), a deep CNN architecturewith Batch Normalization, has shown compelling quality and efficiency for tasks on images such as semantic segmentation.

An FCN model consists of several basic convolutional blocks. A basic block is a convolutional layer followed by a Batch Normalization layer and a ReLU activation layer, as follows:

(13)

\begin{split}\mathbf{y}&=\mathbf{W}*\textbf{x}+\textbf{d}\\ \textbf{z}&=\text{BatchNorm}(\textbf{y})\\ \textbf{x}^{\prime}&=\text{ReLU}(\textbf{z})\end{split}

where $*$ is the convolution operator.

References

(1) Bai, S., Kolter, J. Z., and Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271 (2018).
(2) Ballinger, B., Hsieh, J., Singh, A., Sohoni, N., Wang, J., Tison, G., Marcus, G., Sanchez, J., Maguire, C., Olgin, J., et al. Deepheart: semi-supervised sequence learning for cardiovascular risk prediction. In AAAI Conference on Artificial Intelligence (AAAI) (2018), vol. 32.
(3) Baytas, I. M., Xiao, C., Zhang, X., Wang, F., Jain, A. K., and Zhou, J. Patient subtyping via time-aware lstm networks. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2017), pp. 65–74.
(4) Chen, T., and Guestrin, C. Xgboost: A scalable tree boosting system. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2016), pp. 785–794.
(5) Dai, A. M., and Le, Q. V. Semi-supervised sequence learning. Conference and Workshop on Neural Information Processing Systems 28 (2015), 3079–3087.
(6) Davis, J., and Goadrich, M. The relationship between precision-recall and roc curves. In International Conference on Machine Learning (2006), pp. 233–240.
(7) Dunham, C. M., Hileman, B. M., Hutchinson, A. E., Chance, E. A., and Huang, G. S. Perioperative hypoxemia is common with horizontal positioning during general anesthesia and is associated with major adverse outcomes: a retrospective study of consecutive patients. BMC anesthesiology 14, 1 (2014), 1–10.
(8) Ehrenfeld, J. M., Funk, L. M., Van Schalkwyk, J., Merry, A. F., Sandberg, W. S., and Gawande, A. The incidence of hypoxemia during surgery: evidence from two institutions. Canadian Journal of Anesthesia 57, 10 (2010), 888–897.
(9) ElMoaqet, H., Tilbury, D. M., and Ramachandran, S. K. Multi-step ahead predictions for critical levels in physiological time series. IEEE Transactions on Cybernetics 46, 7 (2016), 1704–1714.
(10) Erion, G., Chen, H., Lundberg, S. M., and Lee, S.-I. Anesthesiologist-level forecasting of hypoxemia with only spo2 data using deep learning. In Conference and Workshop on Neural Information Processing Systems Workshop ML4H (2017).
(11) Fritz, B. A., Cui, Z., Zhang, M., He, Y., Chen, Y., Kronzer, A., Abdallah, A. B., King, C. R., and Avidan, M. S. Deep-learning model for predicting 30-day postoperative mortality. British journal of anaesthesia 123, 5 (2019), 688–695.
(12) Gong, D., Liu, L., Le, V., Saha, B., Mansour, M. R., Venkatesh, S., and Hengel, A. v. d. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. In International Conference on Computer Vision (2019), pp. 1705–1714.
(13) Hendrycks, D., Lee, K., and Mazeika, M. Using pre-training can improve model robustness and uncertainty. International Conference on Machine Learning (2019).
(14) Laffin, A. E., Kendale, S. M., and Huncke, T. K. Severity and duration of hypoxemia during outpatient endoscopy in obese patients: a retrospective cohort study. Canadian Journal of Anaesthesia (2020).
(15) Laptev, N., Yosinski, J., Li, L. E., and Smyl, S. Time-series extreme event forecasting with neural networks at uber. In International Conference on Machine Learning (2017), vol. 34, pp. 1–5.
(16) Lea, C., Flynn, M. D., Vidal, R., Reiter, A., and Hager, G. D. Temporal convolutional networks for action segmentation and detection. In Conference on Computer Vision and Pattern Recognition (2017), pp. 156–165.
(17) Li, D., Lyons, P. G., Lu, C., and Kollef, M. Deepalerts: Deep learning based multi-horizon alerts for clinical deterioration on oncology hospital wards. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) Conference on Artificial Intelligence (2020), vol. 34, pp. 743–750.
(18) Liu, H., Han, J., and Nie, F. Semi-supervised orthogonal graph embedding with recursive projections. In International Joint Conference in Artificial Intelligence (2017), pp. 2308–2314.
(19) Liu, H., Lou, S. S., Warner, B. C., Harford, D. R., Kannampallil, T., and Lu, C. Hipal: A deep framework for physician burnout prediction using activity logs in electronic health records. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2022).
(20) Lou, S. S., Liu, H., Warner, B. C., Harford, D., Lu, C., and Kannampallil, T. Predicting physician burnout using clinical activity logs: model performance and lessons learned. Journal of Biomedical Informatics 127 (2022), 104015.
(21) Lundberg, S. M., Nair, B., Vavilala, M. S., Horibe, M., Eisses, M. J., Adams, T., Liston, D. E., Low, D. K.-W., Newman, S.-F., Kim, J., et al. Explainable machine-learning predictions for the prevention of hypoxaemia during surgery. Nature Biomedical Engineering 2, 10 (2018), 749–760.
(22) Mehta, C., and Mehta, Y. Management of refractory hypoxemia. Annals of cardiac anaesthesia 19, 1 (2016), 89.
(23) Meyer, A., Zverinski, D., Pfahringer, B., Kempfert, J., Kuehne, T., Sündermann, S. H., Stamm, C., Hofmann, T., Falk, V., and Eickhoff, C. Machine learning for real-time prediction of complications in critical care: a retrospective study. The Lancet Respiratory Medicine 6, 12 (2018), 905–914.
(24) Nguyen, H., Jang, S., Ivanov, R., Bonafide, C. P., Weimer, J., and Lee, I. Reducing pulse oximetry false alarms without missing life-threatening events. Smart Health 9 (2018), 287–296.
(25) Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., and Lillicrap, T. Meta-learning with memory-augmented neural networks. In International Conference on Machine Learning (2016), pp. 1842–1850.
(26) Srivastava, N., Mansimov, E., and Salakhudinov, R. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning (2015), pp. 843–852.
(27) Sukhbaatar, S., Weston, J., Fergus, R., et al. End-to-end memory networks. In Conference and Workshop on Neural Information Processing Systems (2015), pp. 2440–2448.
(28) Suresh, H., Szolovits, P., and Ghassemi, M. The use of autoencoders for discovering patient phenotypes. Conference and Workshop on Neural Information Processing Systems Workshop ML4H (2017).
(29) Van Der Maaten, L. Accelerating t-sne using tree-based algorithms. Journal of Machine Learning Research 15, 1 (2014), 3221–3245.
(30) Wang, Z., Yan, W., and Oates, T. Time series classification from scratch with deep neural networks: A strong baseline. In International Joint Conference on Neural Networks (2017), IEEE, pp. 1578–1585.
(31) West, C. P., Dyrbye, L. N., Erwin, P. J., and Shanafelt, T. D. Interventions to prevent and reduce physician burnout: a systematic review and meta-analysis. The Lancet 388, 10057 (2016), 2272–2281.
(32) World Health Organization. Pulse Oximetry Training Manual, 2011.
(33) Yu, F., and Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015).
(34) Zhou, Y., Arpit, D., Nwogu, I., and Govindaraju, V. Is joint training better for deep auto-encoders? arXiv preprint arXiv:1405.1380 (2014).
(35) Zhu, L., and Laptev, N. Deep and confident prediction for time series at uber. In International Conference on Data Mining Workshop (2017), pp. 103–110.

Predicting Intraoperative Hypoxemia with Hybrid Inference Sequence Autoencoder Networks

Abstract.

Abstract.

1. Introduction

2. Hypoxemia Prediction Problem

2.1. Intraoperative Time Series Data

2.2. Hypoxemia Definition and Labeling

2.3. Near-term Prediction Problem

Problem 1.

3. Related Works

3.1. Learning to Predict Hypoxemia

3.2. Encoder-decoder Sequential Learning

4. The HiNet Framework

4.1. Memory-augmented Sequence Autoencoder

4.1.1. Dual-level Sequence Embedding

4.1.2. Data Reconstruction

4.2. Multi-decoder Hybrid Inference

4.2.1. Latent State Transition

4.2.2. Future SpO2 Forecast

4.2.3. Hypoxemia Event Prediction

4.2.4. Masked Predictor Loss

4.2.5. Objective for Joint Learning

5. Experiment

5.1. Data Preprocessing

5.2. Experimental Setup

5.2.1. Evaluation Metrics

5.2.2. Hyperparameters

5.3. Baseline Methods

5.4. Implementation Details

6. Result and Discussion

6.1. Overall Performance

6.1.1. Overall performance

6.1.2. Feature Engineering vs. Representation Learning

6.1.3. Supervised vs. Self-supervised Learning

6.2. Ablation Study

6.3. Practical Effectiveness

6.3.1. Alarm Suppression

6.3.2. False Alarm vs. Sensitivity

6.3.3. Lead Time of Alarms

6.4. Representation Learning

6.5. Potential Limitations

7. Conclusion and Potential Impact

Acknowledgement

Appendix A Design Choice of Sequence Encoder

A.1. Temporal Convolutional Networks

A.2. Fully Convolutional Networks

References

4.2.2. Future SpO₂ Forecast