11email: {masoud, purang, rjliao}@ece.ubc.ca 22institutetext: Vancouver General Hospital, Vancouver, BC, Canada
22email: [email protected]
EchoGNN: Explainable Ejection Fraction Estimation with Graph Neural Networks
Abstract
Ejection fraction (EF) is a key indicator of cardiac function, allowing identification of patients prone to heart dysfunctions such as heart failure. EF is estimated from cardiac ultrasound videos known as echocardiograms (echo) by manually tracing the left ventricle and estimating its volume on certain frames. These estimations exhibit high inter-observer variability due to the manual process and varying video quality. Such sources of inaccuracy and the need for rapid assessment necessitate reliable and explainable machine learning techniques. In this work, we introduce EchoGNN, a model based on graph neural networks (GNNs) to estimate EF from echo videos. Our model first infers a latent echo-graph from the frames of one or multiple echo cine series. It then estimates weights over nodes and edges of this graph, indicating the importance of individual frames that aid EF estimation. A GNN regressor uses this weighted graph to predict EF. We show, qualitatively and quantitatively, that the learned graph weights provide explainability through identification of critical frames for EF estimation, which can be used to determine when human intervention is required. On EchoNet-Dynamic public EF dataset, EchoGNN achieves EF prediction performance that is on par with state of the art and provides explainability, which is crucial given the high inter-observer variability inherent in this task. Our source code is publicly available at: https://github.com/MasoudMo/echognn.
Keywords:
Ultrasound Ejection Fraction Cardiac Imaging Explainable Models Graph Neural Networks Deep Learning1 Introduction
Ejection fraction (EF) is a ratio indicating the volume of blood pumped by the heart. This measurement is crucial in monitoring cardiovascular health and is a potential indicator of heart failure [9, 17]. EF is computed using the stroke volume, which is the blood volume difference in the Left Ventricle (LV) during the End-Systolic (ES) and End-Diastolic (ED) phases of the cardiac cycle denoted by ESV and EDV, respectively [2]. These volumes are estimated from ultrasound videos of the heart, i.e. echocardiograms (echo), which involves detecting the frames corresponding to ES and ED and tracing the LV region. The manual process of detecting the correct frames and making proper traces is prone to human error. Therefore, the American Society of Echocardiography recommends performing EF estimation for up to 5 cardiac cycles and averaging the results [16]. However, this guideline is seldom followed in practice, and a single representative beat is selected for evaluation instead. This results in inter-observer variations from 7.6% to 13.9% in the EF ratio [18].
Automatic EF estimation techniques aid professionals by adding another layer of verification. Additionally, with the emergence of Point-of-Care Ultrasound (POCUS) imaging devices, which are routinely used by less experienced echo users, automation of clinical measurements such as EF is further needed [1]. However, to be adopted broadly, such automation techniques must be explainable to detect when human intervention is required. Different machine learning (ML) architectures have been proposed to perform automatic EF estimation [12, 10, 18, 21, 23], most of which lack reliable explainability mechanisms. Some of these models fail to provide the model’s confidence on their predictions [18, 21, 23] or have low accuracy due to unrealistic data augmentation during training and over-reliance on ground truth labels [21].
In this work, we introduce EchoGNN, a novel deep learning model for explainable EF estimation. Our approach first infers a latent graph between frames of one or multiple echo cine series. It then estimates EF based on this latent graph via Graph Neural Networks (GNNs) [22], which are a class of deep learning models that efficiently capture graph data. To the best of our knowledge, our work is the first one that investigates GNNs in the context of ultrasound videos and EF estimation. Moreover, our work brings explainability through latent graph learning, inspiring further work in this domain. Our contributions are threefold:
-
We introduce EchoGNN, a novel deep learning model for explainable EF estimation through GNN-based latent graph learning.
-
We present a weakly-supervised training pipeline for EF estimation without direct reliance on ground truth ES/ED frame labels.
-
Our model has a much lower number of parameters compared to prior work, significantly reducing computational and memory requirements.
2 Related Work
Most prior works use Convolutional Neural Networks (CNNs) in their EF estimation pipeline [12, 10, 18, 21]. Ouyang et al. [18] uses ResNet-based (2+1)D convolutions [24] to estimate and average EF for all possible 32-frame clips in an echo, while Kazemi Esfeh et al. [12] uses a similar approach under the Bayesian Neural Networks (BNNs) setting. Recent work uses the encoder of ResNetAE [8] to reduce data dimensionality before using transformers [25] to jointly perform ES/ED frame detection and EF estimation [21]. While these methods show different levels of accuracy and success in predicting EF, they either lack explainability or significantly rely on accurate clinical labels, which are inherently noisy and subject to significant inter-observer variability. As an example, the transformer-based approach requires ES/ED frame index labels in addition to EF labels in its training pipeline [21]. Lastly, while Kazemi Esfeh et al. and Jafari et al. [12, 10] report uncertainty on their predictions, they still lack explainable indicators as to why models fail or succeed for different cases. Our proposed framework based on GNNs aims to alleviate these shortcomings. It provides explainability by only relying on EF labels and not requiring ES/ED frame labels in a supervised manner. Lastly, as an added advantage, the number of parameters for our model is significantly less than prior work, which is highly desirable for deploying such models on mobile clinical devices.
3 Methodology
We consider the following supervised problem for EF estimation: assume for each patient in dataset , there is a ground truth EF ratio , and there are K number of echo videos , where , T is the number of frames, and H and W are the height and width of each frame. The goal of our model is to learn a function to estimate EF from echo videos. For notational simplicity and since our evaluation dataset only contains one video per patient, we assume that . However, it must be noted that our model is flexible in this regard and can handle multiple videos per patient.
3.1 EchoGNN Architecture
As shown in Fig. 1, EchoGNN is composed of three main components: Video Encoder, Attention Encoder, and Graph Regressor. In the following subsections, we discuss the details pertaining to each component.

3.1.1 Video Encoder
The original echo videos are high-dimensional and must be mapped into lower-dimensional embeddings to reduce memory footprint and remove redundant information.
The Video Encoder is used to learn a mapping from input echo videos to -dimensional embeddings , where is the frame number. The temporal dimension is preserved because the Attention Encoder requires embeddings for all frames to produce interpretable weights over them. We use a custom network consisting of 3D convolutions and residual connections to use both the spatial and temporal information in the video in generating the embeddings. This network’s architecture is provided in the supp. material. Lastly, following [25], periodic positional encodings are added to the generated frame embeddings to encode the sequential nature of video data.
3.1.2 Attention Encoder
For each patient, we construct an echo-graph, which is a complete graph where each node corresponds to a frame in the echo video, and the edges show the non-Euclidean relationship between these frames. Formally, we denote the echo-graph with where V is the set of nodes corresponding to echo frames such that , and E is the set of edges between the nodes to show the relationship between video frames such that if are connected, then . We use the frame embeddings from our Video Encoder as node features of . That is, are the set of features for . These embeddings can be represented as a matrix such that each row is the embedding for a frame in the echo video for patient .
Inspired by [14], we propose using GNNs to learn and assign weights to both edges and nodes of the echo-graph. The edge and node weights are learned to encode the importance of each frame (node weights) and the relationships among frames (edge weights) for the final EF estimation.
The Attention Encoder infers weights over edges and nodes of the echo-graph using message passing based GNNs [7]. A single message passing step is enough for each node to capture information from all other nodes due to echo-graph being a complete graph. More specifically, the following operations are used to obtain weights over each edge :
(1) | |||||
(2) | |||||
(3) | |||||
(4) |
where is the Sigmoid function, is the concatenation operator, and is the inferred weight for the directed edge from to . Similarly, weights for each node are generated by inserting another operation after Eq. 3. All MLPs use two fully connected linear layers with ELU [4] activation and batch normalization.
3.1.3 Regressor
Our Regressor network uses GNN layers with the learned weighted echo-graph to perform EF estimation. Specifically, for each patient, the output of the Attention Encoder can be represented as a weighted adjacency matrix and a node weight vector . The Regressor uses to generate embeddings over frames of the echo video:
(5) |
where is the matrix of learned node embeddings at layer l, is the matrix of frame embeddings from the Video Encoder, and is composed of a Graph Convolutional Network (GCN) layer followed by batch normalization and ELU activation [15]. To represent the whole graph with a single vector embedding, the node embeddings are averaged using the frame weights generated by the Attention Encoder:
(6) |
where is the th row of , and is the th scalar weight in the frame weight vector. is mapped into an EF estimate using an MLP with two fully connected linear layers, ELU activation and batch normalization.
3.1.4 Learning Algorithm
The model is differentiable in an end-to-end manner. Therefore, we use gradient descent with Mean-Absolute-Error (MAE) between predicted EF estimates and ground truth EF values as the optimization objective, which is computed as .
4 Experiments
4.1 Dataset
We use EchoNet-Dynamic public EF dataset consisting of 10,030 AP4 echo videos obtained between 2016 and 2018 at Stanford University Hospital. Each echo frame has a dimension of , and the dataset provides ESV, EDV, contour tracings of LV, and EF ratios for each patient [18]. We use the provided splits in the dataset from mutually exclusive patients, including 7465 samples for training, 1288 samples for validation, and 1277 samples for testing. The data distribution in the training set is unbalanced with only 12.7% of samples having EF ratio below 40%. Clinically, however, such patients are most critical to be detected for timely intervention [3, 11].
Frame Sampling: To stay within reasonable memory requirements, we use a fixed number of frames per echo denoted by . During training, we uniformly sample an initial frame index in , where is the total number of frames in echo video . We then use samples starting from . Following [18], we set to 64 and use zero padding in the temporal dimension when . During test time, we extract multiple back to back clips with each clip containing frames and the first clip starting from index 0. We use zero padding in the temporal dimension if and overlap the last clip with the previous one if the last clip overshoots . We set to 64 and independently estimate EF for each clip and report the average prediction.
Data Augmentation: Occasionally, AP4 echo is zoomed in on the LV region for certain clinical studies [5, 20]. To allow learning of this under-represented distribution, we augment our training set by using a fixed cropping window of centered at the top of each frame and interpolating the result to achieve the original dimension, which creates the desired zoom-in effect.
4.2 Implementation
The Video Encoder uses custom convolution blocks with 16, 32, 64, 128, and 256 channels. The Attention Encoder uses a hidden dimension of 128 for MLP layers, and the Regressor uses 3-layer GNN with 128, 64 and 32 hidden dimensions followed by an MLP with a hidden dimension of 16. We use the Adam optimizer [13] with a learning rate of 1e-4, a batch size of 80, and 2500 training epochs. Our framework is implemented using PyTorch [19] and PyG [6], and the training was performed on two Nvidia Titan V GPUs. Pretraining: We use ES/ED index labels in a pretraining step to train the Video Encoder and the Attention Encoder to give higher weights to ES and ED frames. Classification Loss: We bin the EF values into 4 ranges and use a cross-entropy loss encouraging the model to learn EF’s clinical categories [3].
4.3 Results and Discussion
4.3.1 Explainability
The key advantage of EchoGNN over prior work is the explainability it provides through the learned weights on the echo-graph. As shown in Fig. 2, the learned weights can indicate when human intervention is required. We observe two different scenarios: (1) the model learns the periodic nature of echo videos and assigns larger weights to frames and edges that are in between ES and ED phases before performing EF estimation. This means that the location of ES and ED can be approximated using these weights as illustrated in Fig. 2. (2) The model cannot detect the location of ES and ED frames and distributes weights more evenly. We see that in these cases, we have either an atypical zoomed-in AP4 echo or an echo where the LV is not entirely visible and is cropped. In such cases, an expert can evaluate the video and determine if new videos must be obtained. More explainability examples are provided in the supp. material.
To quantitatively measure the explainability of EchoGNN, for the cases where the model learns the periodic nature of the data (1173 samples out of 1277), we use the average frame distance (aFD) as in [21], which is computed as with and being the true and approximated indices, respectively, for sample . As shown in Table 4.3.2, our model achieves better ED aFD and comparable ES aFD without using ground-truth ES/ED locations for training, whereas Reynaud et al. [21] uses such supervision. This shows the explainability power of EchoGNN. aFD computation details are provided in the supp. material.

4.3.2 EF Estimation
To evaluate the error in predicted EF values, we use Mean-Absolute-Error (MAE). Additionally, as a measure of the amount of explained variance in the data, we report the model’s score. Moreover, we report the score for the task of indicating whether EF values are lower than 40%, which is a strong indicator of heart failure [11].
As shown in Table 4.3.2, our model significantly outperforms [21] without direct supervision of ES and ED frame locations during training. Our model has similar predictive performances as [12] with a much lower number of parameters and the added benefit of explainability through the learned latent graph structures. EchoNet (AF) [18] requires large amounts of RAM due to sampling all 32-frame clips in a video, making us unable to train and evaluate the model. Because of this we only report results from the paper and cannot produce additional metrics such as score which is not originally reported, and hence we show this with N/A in Table 4.3.2. This model’s weak performance compared to our model shows the sensitivity of EchoNet (AF) to frame locations in a clip. Lastly, our model has a significantly lower number of parameters, making it desirable for deployment on mobile clinical devices. Our model’s EF scatter plot and confusion matrix are provided in the supp. material.
Model | MAE | <40% | ES | |||
---|---|---|---|---|---|---|
aFD | ED | |||||
aFD | #params ( | |||||
EchoNet (AF) [18] | 0.4 | 7.35 | N/A | - | - | 31.5 |
Transformer (R) [21] | 0.48 | 6.76 | 0.70 | 2.86 | 7.88 | 346.8 |
Transformer (M) [21] | 0.52 | 5.95 | 0.55 | 3.35 | 7.17 | 346.8 |
Bayesian [12] | 0.75 | 4.46 | 0.77 | - | - | 31.5 |
EchoGNN (ours) | 0.76 | 4.45 | 0.78 | 4.15 | 3.68 | 1.7 |
4.4 Ablation Study
In Table 2, we see that the classification loss improves model’s performance for under-represented samples, while pretraining and data augmentation reduce EF error and increase the model’s ability to represent the variance in data.
Aug. | Class. | Pretrain | MAE | <40% | |
---|---|---|---|---|---|
✓ | ✗ | ✗ | 0.75 | 4.48 | 0.76 |
✓ | ✓ | ✗ | 0.74 | 4.59 | 0.77 |
✓ | ✗ | ✓ | 0.75 | 4.47 | 0.73 |
✗ | ✓ | ✓ | 0.75 | 4.47 | 0.77 |
✓ | ✓ | ✓ | 0.76 | 4.45 | 0.78 |
5 Limitations
While our model outperforms prior works for EF estimation and also provides explainability, there are certain limitations that can be addressed in future work. Firstly, while the explainability provided over frames and edges of the echo-graph allows identification of cases that need closer inspection, they do not allow finding regions of each frame that the model is uncertain about. We argue that an attention map over the pixels in each frame can further help with explainability. Secondly, creating a complete graph for long videos leads to large memory cost. While this is not an issue for echo, where videos are relatively short, alternative graph construction methods should be considered for longer videos.
6 Conclusion
In this work, we introduce a deep learning model that provides the benefit of explainability via GNN-based latent graph learning. While we showcased the success of our framework for EF estimation, we argue that the same pipeline could be used for other datasets and problems, introducing a new paradigm for video processing and prediction tasks from clinical data and beyond.
Acknowledgements.
This research was supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canadian Institutes of Health Research (CIHR) and computational resources provided by Advanced Research Computing at the University of British Columbia.
References
- [1] Amaral, C., Ralston, D., Becker, T.: Prehospital point-of-care ultrasound: A transformative technology. SAGE Open Medicine 8, 205031212093270 (07 2020)
- [2] Bamira, D., Picard, M.: Imaging: Echocardiology—assessment of cardiac structure and function. In: Vasan, R.S., Sawyer, D.B. (eds.) Encyclopedia of Cardiovascular Research and Medicine, pp. 35–54. Elsevier, Oxford (2018)
- [3] Carroll, M.: Ejection fraction: Normal range, low range, and treatment (Nov 2021), https://www.healthline.com/health/ejection-fraction
- [4] Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). arXiv: Learning (2016)
- [5] Ferraioli, D., Santoro, G., Bellino, M., Citro, R.: Ventricular septal defect complicating inferior acute myocardial infarction: A case of percutaneous closure. Journal of Cardiovascular Echography 29, 17 (01 2019)
- [6] Fey, M., Lenssen, J.E.: Fast graph representation learning with PyTorch Geometric. In: ICLR Workshop on Representation Learning on Graphs and Manifolds (2019)
- [7] Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. CoRR abs/1704.01212 (2017)
- [8] Hou, B.: ResNetAE (2019), https://github.com/farrell236/ResNetAE
- [9] Huang, H., Nijjar, P., Misialek, J., Blaes, A., Derrico, N., Kazmirczak, F., Klem, I., Farzaneh-Far, A., Shenoy, C.: Accuracy of left ventricular ejection fraction by contemporary multiple gated acquisition scanning in patients with cancer: Comparison with cardiovascular magnetic resonance. Journal of Cardiovascular Magnetic Resonance 19 (12 2017)
- [10] Jafari, M.H., Woudenberg, N.V., Luong, C., Abolmaesumi, P., Tsang, T.: Deep bayesian image segmentation for a more robust ejection fraction estimation. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). pp. 1264–1268 (2021)
- [11] Kalogeropoulos, A.P., Fonarow, G.C., Georgiopoulou, V., Burkman, G., Siwamogsatham, S., Patel, A., Li, S., Papadimitriou, L., Butler, J.: Characteristics and Outcomes of Adult Outpatients With Heart Failure and Improved or Recovered Ejection Fraction. JAMA Cardiology 1(5), 510–518 (08 2016)
- [12] Kazemi Esfeh, M.M., Luong, C., Behnami, D., Tsang, T., Abolmaesumi, P.: A deep bayesian video analysis framework: Towards a more robust estimation of ejection fraction. In: Medical Image Computing and Computer Assisted Intervention – MICCAI 2020. pp. 582–590. Springer International Publishing (2020)
- [13] Kingma, D., Ba, J.: Adam: A method for stochastic optimization. International Conference on Learning Representations (12 2014)
- [14] Kipf, T., Fetaya, E., Wang, K.C., Welling, M., Zemel, R.: Neural relational inference for interacting systems (2018)
- [15] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
- [16] Lang, R.M., Badano, L.P., Mor-Avi, V., Afilalo, J., Armstrong, A., Ernande, L., Flachskampf, F.A., Foster, E., Goldstein, S.A., Kuznetsova, T., Lancellotti, P., Muraru, D., Picard, M.H., Rietzschel, E.R., Rudski, L., Spencer, K.T., Tsang, W., Voigt, J.U.: Recommendations for cardiac chamber quantification by echocardiography in adults: An update from the american society of echocardiography and the european association of cardiovascular imaging. Journal of the American Society of Echocardiography 28(1), 1–39.e14 (2015)
- [17] Loehr, L., Rosamond, W., Chang, P., Folsom, A., Chambless, L.: Heart failure incidence and survival (from the atherosclerosis risk in communities study). The American journal of cardiology 101, 1016–22 (04 2008)
- [18] Ouyang, D., He, B., Ghorbani, A., Yuan, N., Ebinger, J., Langlotz, C., Heidenreich, P., Harrington, R., Liang, D., Ashley, E., Zou, J.: Video-based ai for beat-to-beat assessment of cardiac function. Nature 580 (04 2020)
- [19] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32, pp. 8024–8035. Curran Associates, Inc. (2019)
- [20] Patil, V., Patil, H.: Isolated non-compaction cardiomyopathy presented with ventricular tachycardia. Heart views : the official journal of the Gulf Heart Association 12, 74–8 (04 2011)
- [21] Reynaud, H., Vlontzos, A., Hou, B., Beqiri, A., Leeson, P., Kainz, B.: Ultrasound video transformers for cardiac ejection fraction estimation. In: de Bruijne, M., Cattin, P.C., Cotin, S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2021. pp. 495–505. Springer International Publishing (2021)
- [22] Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph neural network model. IEEE transactions on neural networks 20(1), 61–80 (2008)
- [23] Smistad, E., Østvik, A., Salte, I.M., Melichova, D., Nguyen, T.M., Haugaa, K., Brunvand, H., Edvardsen, T., Leclerc, S., Bernard, O., Grenne, B., Løvstakken, L.: Real-time automatic ejection fraction and foreshortening detection using deep learning. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control 67(12), 2595–2604 (2020)
- [24] Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., Paluri, M.: A closer look at spatiotemporal convolutions for action recognition. CoRR abs/1711.11248 (2017)
- [25] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. CoRR abs/1706.03762 (2017)