Evaluating Feature Attribution Methods for Electrocardiogram
Abstract
The performance of cardiac arrhythmia detection with electrocardiograms (ECGs) has been considerably improved since the introduction of deep learning models. In practice, the high performance alone is not sufficient and a proper explanation is also required. Recently, researchers have started adopting feature attribution methods to address this requirement, but it has been unclear which of the methods are appropriate for ECG. In this work, we identify and customize three evaluation metrics for feature attribution methods based on the characteristics of ECG: localization score, pointing game, and degradation score. Using the three evaluation metrics, we evaluate and analyze eleven widely-used feature attribution methods. We find that some of the feature attribution methods are much more adequate for explaining ECG, where Grad- CAM outperforms the second-best method by a large margin.
Index Terms— ECG, deep learning, feature attribution methods, evaluation metrics, Grad-CAM
1 Introduction
For electrocardiograms (ECGs), deep learning models have achieved a near cardiologist-level performance on arrhythmia detection and classification [1]. The models, however, are often considered to be a “black box” because the decision-making process is challenging to explain. For providing an explanation in the image domain, feature attribution has become a popular technique where the relevant features in the input can be highlighted [2]. It is also believed to be useful for ECG domain because a cardiac abnormality is usually declared based on a specific area of an ECG. Therefore, its usage has been consistently increasing. However, there has been no clear analysis on which feature attribution methods are adequate for ECG [3].
In this study, we examine eleven feature attribution methods that are widely used in the image domain [2, 4], and determine which methods are suitable for ECG. We propose three evaluation metrics identified and customized by considering ECG-specific characteristics. Using the three metrics, we evaluate the eleven attribution methods. Our main contributions are: (1) We propose three metrics for evaluating feature attribution methods on ECG - localization score, pointing game, and degradation score. (2) We evaluate the eleven feature attribution methods with the proposed metrics and provide a quantitative analysis. (3) We find that Grad-CAM is the most appropriate for ECG data. Our experiments can be easily reproduced and transferred to other experimental settings, because they are implemented with Captum [5] that is an open-source library for model interpretability and because we use a publicly available ECG dataset. Our code is available at https://github.com/SNU-DRL/Attribution-ECG.
2 Backgrounds
2.1 Feature Attribution Methods
Feature attribution is a technique that measures the importance of each pixel in an image for a given prediction of a model. Because feature attribution values can be easily visualized as a heatmap showing which pixels are important, it has become popular in the image domain [2]. Feature attribution methods can be categorized according to the mechanisms they use [2, 6]. Backpropagation-based methods use gradients of a model computed by the backpropagation process to measure the importance of each element in the input. This group includes Saliency [7], InputGradient [8], Guided Backprop [9], Integrated Gradients [10], DeepLIFT [8], DeepSHAP [11] and LRP [12]. Perturbation-based methods measure feature attributions by tracking the model outputs for the given perturbed inputs. This group includes LIME [13] and KernelSHAP [11]. Activation-based methods use activation maps where the importance for the target class is computed. This group includes Grad-CAM [14]. We also classify Guided Grad-CAM [14], which combines Grad-CAM and Guided Backprop, as an activation-based.
2.2 Evaluation Metrics for Feature Attribution Methods
A series of studies have proposed evaluation metrics to analyze and compare feature attribution in the context of computer vision research [2]. Evaluation metrics examine whether each attribution method explains a model’s prediction with a high fidelity. They can be categorized mainly into two groups: localization-based metrics and perturbation-based metrics.
Localization-based metrics [7, 15, 16, 17, 18, 19] assess how successfully a feature attribution method localizes the input’s class-discriminative features. To evaluate feature attribution methods using localization-based metrics, we need a dataset with additional ground-truth information on which area of an input is closely related to the label. For this reason, researchers in the image domain have conducted experiments with special datasets containing object bounding boxes or have constructed image grids [6]. As localization-based metrics utilize human-made labels, they reflect human knowledge where a model should attend to.
Perturbation-based metrics [16, 20, 21] assess feature attribution methods by perturbing an input and measuring how much the prediction of a model is affected. If an attribution method is effective, its heatmap should match the areas that are sensitive to the perturbation. An advantage of perturbation-based metrics is that they do not require additional ground-truth information. Therefore, perturbation-based metrics are applicable to a wider variety of datasets than localization-based metrics.
2.3 Evaluating Feature Attribution Methods for ECG
Feature attribution methods and evaluation metrics for them have been mainly developed in the computer vision research. The results from [2] imply that the performance of attribution methods should be measured for each specific use case. Thus, to apply feature attribution methods to ECG domain, such methods should be verified using ECG data and tasks. There are a limited number of studies that focus on evaluating feature attribution methods using ECG data [3, 4, 22]. However, the evaluation metrics used in the previous studies are designed for general time-series data based on shapelets [3] or perturbation-based metrics [3, 4, 22]. Moreover, the number of feature attribution methods studied in the three studies is at most six.
A sound evaluation can be achieved by concurrently evaluating additional feature attribution methods with customized metrics that reflect ECG-specific characteristics. In this study, we propose two localization-based metrics that utilize the pseudo-periodic structure of an ECG signal and one perturbation-based metric that jointly considers the performance drop under the increasing and decreasing order of attribution values. We compare eleven feature attribution methods with our proposed evaluation metrics, which is roughly twice of the previous studies.

3 Proposed metrics
We propose three metrics to evaluate feature attribution methods for ECG. The overview of our proposed evaluation metrics is shown in Figure 1. Localization-based metrics, Localization Score and Pointing Game, need extra information where the true cardiac abnormalities lie in an example. To implement those metrics, we propose using an ECG dataset with labels for every beat. Since we mostly focus on detecting abnormal beats out of normal beats, we could regard the label of an example as the label of any abnormal beats present if the example has any such beats, as shown in Figure 2. We have the information on where the abnormal beats are, so we can apply localization-based metrics to ECG domain. This is similar to image datasets containing object bounding boxes being used when evaluating with localization-based metrics in the image domain. On the other hand, Perturbation-based metric, Degradation Score, can be applied to ECG dataset with any type of labels - labels for a beat or for a specific span of ECG.

3.1 Localization Score
Localization Score is a metric that measures the overlap between the area with high attribution values and the ground truth area [7, 15, 16]. Considering the characteristics of the ECG data, we propose the following metric for calculating the localization score. If the abnormal beats of an example contain data points, we select top- data points with the highest attribution values. Then we calculate the Intersection-over-Union (IoU) between the selected data points and the data points in abnormal beats. IoU between two sets and is defined as follows:
(1) |
3.2 Pointing Game
Pointing Game is a metric that measures if a data point with the maximum attribution value correctly points to the ground truth area [17, 18, 19]. This metric is different from the localization score in that the pointing game does not require highlighting the full extent of an object and only focuses on the maximum attribution point [17]. We measure the pointing game accuracy as follows: If a data point with the highest attribution value falls on one of the abnormal beats, it counts as a hit; otherwise, it counts as a miss. Then we calculate the accuracy as described in [17].
(2) |
3.3 Degradation Score
Degradation Score is a metric that examines how a model’s prediction degrades under cumulative perturbation of an original input, suggested by [16]. Perturbations can be applied to small-sized windows of the input in the order of decreasing attribution values (Most Relevant removed First, MoRF) [20] or of increasing attribution values (Least Relevant removed First, LeRF) [23]. The degradation score combines both perturbation processes. We scale the model’s output probability for a correct class of an original example to one and the probability of a fully perturbed example to zero. Then we plot MoRF and LeRF curves as shown in the lower right side of Figure 1 and compute the integral between the two curves. As a perturbation, we use a mean operation; we fill a selected window with the mean value of the window. We select the mean operation since it can erase detailed wave information while preserving the typical shape of an ECG signal.
4 Experiments
|
|
||
---|---|---|---|
ConvResNet (48 layers) [1, 24] | 0.85 ± 0.03 | ||
Basic ConvNet (5 layers) [24] | 0.80 ± 0.01 | ||
|
0.90 ± 0.05 |
|
|
|
Average | ||||||||
Random (baseline) | 0.1087 | 0.1857 | 0.0037 | 0.0094 | |||||||
Backpropagation-based | Saliency∘ [7] | 0.1899 | 0.3646 | 0.3460 | 0.3002 | ||||||
InputGradient∘ [8] | 0.2178 | 0.5735 | 0.8751 | 0.5555 | |||||||
Guided Backprop∘ [9] | 0.1899 | 0.3646 | 0.3460 | 0.3002 | |||||||
Integrated Gradients∘ [10] | 0.2236 | 0.6837 | 0.8852 | 0.5825 | |||||||
DeepLIFT∘ [8] | 0.2193 | 0.5901 | 0.8791 | 0.5628 | |||||||
DeepSHAP∘ [11] | 0.2134 | 0.5557 | 0.8827 | 0.5506 | |||||||
LRP∘ [12] | 0.1375 | 0.2250 | 0.3844 | 0.2490 | |||||||
Perturbation-based | LIME∙ [13] | 0.1214 | 0.3398 | 0.4230 | 0.2497 | ||||||
KernelSHAP∙ [11] | 0.1155 | 0.3977 | 0.2367 | 0.2500 | |||||||
Activation-based | Grad-CAM∙ [14] | 0.4848 | 0.7048 | 0.8876 | 0.6924 | ||||||
Guided Grad-CAM∘ [14] | 0.3148 | 0.5710 | 0.6943 | 0.5267 |
4.1 Experimental Setup
Dataset. We experimented with the Icentia11K dataset [24, 25]. Icentia11K is one of the largest public ECG datasets containing 11 thousand patients and 2 billion labeled beats from a single-lead ECG device. We mostly followed the experimental settings of the original Icentia11K beat classification task [24]. The goal of the beat classification task is to predict if an example consists of all normal beats or contains at least one premature atrial contraction (PAC) or premature ventricular contraction (PVC). Each ECG signal has 2049 data points that correspond to about 8 seconds, as the sampling rate of the dataset is 250Hz. We use R-peak locations provided in the dataset to extract beats by using the midpoint of two adjacent R-peaks as the endpoint of a beat as done in [26].
Training and Evaluation. Each of the training and test sets contain 6000 examples - 2000 for normal, PAC, and PVC classes respectively. When evaluating feature attribution methods with the proposed metrics, we only used test examples correctly predicted as PAC or PVC with over 90% probability. The reason for using only PAC and PVC examples for evaluation is that the localization-based metrics cannot be measured on an example labeled as normal since it consists of all normal beats, and all beats are true localization targets. Moreover, the reason for using only correctly predicted examples with high probability is to ensure that the low evaluation result of feature attribution methods is attributable to the attribution method and not the classification model’s poor performance [19]. The total number of examples used to compute evaluation metrics is around 3272 on average.
Implementation Details. Since the original ResNet [27] was created for 2D images, we build a modified 18-layer ResNet with x-sized kernels for designed for 1D ECG signals. Each ECG signal of 2049 points is standardized to have zero mean and one standard deviation. The model is trained with cross-entropy loss and Adam optimizer. We set learning rate to 5e-4, batch size to 128, weight decay to 1e-7, and training epochs to 50. The classification accuracy of our model on the test set is reported in Table 1, which exceeds that of previous models evaluated on the same dataset. For evaluating the degradation score, we set the perturbation window size to 16. All results are the average of 5 repeated experiments with different dataset splits and model initializations.
4.2 Evaluation Result and Discussion
The results are shown in Table 2. Among the eleven attribution methods, Grad-CAM achieves the best performance across all three evaluation metrics. For the localization score, the two activation-based methods, including Grad-CAM, significantly outperform all of the backpropagation-based and perturbation-based methods. For the pointing game, only Integrated Gradients performs comparably with Grad-CAM. For the degradation score, four of the backpropagation-based methods perform comparably with Grad-CAM. Overall, the average score of Grad-CAM is outstanding where it is 18% better than the second-best method, Integrated Gradients. The visualization is provided in Figure 3, where the three best methods from each category are shown.
An extensive evaluation of feature attribution methods in the image domain can be found in [2]. Compared to the image domain, ECG domain turns out to be distinct in a few ways. For instance, the two perturbation-based methods, LIME and KernelSHAP, perform reasonably well in the image domain but poorly in the ECG domain. DeepSHAP, which is known to achieve a competitive performance to Grad-CAM on ImageNet [2], also performs poorly in the ECG domain except for the degradation score. Grad-CAM, however, turns out to perform well for both domains. As confirmed in [2], different attribution metrics seem to measure different underlying concepts. For ECG domain, we can conclude that Grad-CAM is an appropriate feature attribution method that measures underlying concepts that are strongly relevant to cardiac arrhythmia detection.

5 Conclusion
We have identified and customized three evaluation metrics for ECG and assessed eleven feature attribution methods using the metrics. The experiment results show that Grad-CAM outperforms the other methods including the ones that are known to perform well in the image domain. A possible future work is to design a new attribution method that specifically utilizes the characteristics of ECG. A limitation of our work is that we have considered only PAC and PVC with Icentia11K dataset.
References
- [1] Pranav Rajpurkar, Awni Y Hannun, Masoumeh Haghpanahi, Codie Bourn, and Andrew Y Ng, “Cardiologist-level arrhythmia detection with convolutional neural networks,” arXiv preprint arXiv:1707.01836, 2017.
- [2] Arne Gevaert, Axel-Jan Rousseau, Thijs Becker, Dirk Valkenborg, Tijl De Bie, and Yvan Saeys, “Evaluating feature attribution methods in the image domain,” arXiv preprint arXiv:2202.12270, 2022.
- [3] Inês Neves, Duarte Folgado, Sara Santos, Marília Barandas, Andrea Campagner, Luca Ronzio, Federico Cabitza, and Hugo Gamboa, “Interpretable heartbeat classification using local model-agnostic explanations on ecgs,” Computers in Biology and Medicine, vol. 133, pp. 104393, 2021.
- [4] Udo Schlegel, Hiba Arnout, Mennatallah El-Assady, Daniela Oelke, and Daniel A Keim, “Towards a rigorous evaluation of xai methods on time series,” in 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). IEEE, 2019, pp. 4197–4201.
- [5] Narine Kokhlikyan, Vivek Miglani, Miguel Martin, Edward Wang, Bilal Alsallakh, Jonathan Reynolds, Alexander Melnikov, Natalia Kliushkina, Carlos Araya, Siqi Yan, and Orion Reblitz-Richardson, “Captum: A unified and generic model interpretability library for pytorch,” 2020.
- [6] Sukrut Rao, Moritz Böhle, and Bernt Schiele, “Towards better understanding attribution methods,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10223–10232.
- [7] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman, “Deep inside convolutional networks: Visualising image classification models and saliency maps,” arXiv preprint arXiv:1312.6034, 2013.
- [8] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje, “Learning important features through propagating activation differences,” in International conference on machine learning. PMLR, 2017, pp. 3145–3153.
- [9] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller, “Striving for simplicity: The all convolutional net,” arXiv preprint arXiv:1412.6806, 2014.
- [10] Mukund Sundararajan, Ankur Taly, and Qiqi Yan, “Axiomatic attribution for deep networks,” in International conference on machine learning. PMLR, 2017, pp. 3319–3328.
- [11] Scott M Lundberg and Su-In Lee, “A unified approach to interpreting model predictions,” Advances in neural information processing systems, vol. 30, 2017.
- [12] Sebastian Bach, Alexander Binder, Grégoire Montavon, Frederick Klauschen, Klaus-Robert Müller, and Wojciech Samek, “On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,” PloS one, vol. 10, no. 7, pp. e0130140, 2015.
- [13] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin, “” why should i trust you?” explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 2016, pp. 1135–1144.
- [14] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 618–626.
- [15] Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang, Yongzhen Huang, Liang Wang, Chang Huang, Wei Xu, et al., “Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2956–2964.
- [16] Karl Schulz, Leon Sixt, Federico Tombari, and Tim Landgraf, “Restricting the flow: Information bottlenecks for attribution,” in International Conference on Learning Representations, 2020.
- [17] Jianming Zhang, Sarah Adel Bargal, Zhe Lin, Jonathan Brandt, Xiaohui Shen, and Stan Sclaroff, “Top-down neural attention by excitation backprop,” International Journal of Computer Vision, vol. 126, no. 10, pp. 1084–1102, 2018.
- [18] Ruth C Fong and Andrea Vedaldi, “Interpretable explanations of black boxes by meaningful perturbation,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 3429–3437.
- [19] Moritz Bohle, Mario Fritz, and Bernt Schiele, “Convolutional dynamic alignment networks for interpretable classifications,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 10029–10038.
- [20] Wojciech Samek, Alexander Binder, Grégoire Montavon, Sebastian Lapuschkin, and Klaus-Robert Müller, “Evaluating the visualization of what a deep neural network has learned,” IEEE transactions on neural networks and learning systems, vol. 28, no. 11, pp. 2660–2673, 2016.
- [21] Marco Ancona, Enea Ceolini, Cengiz Öztireli, and Markus Gross, “Towards better understanding of gradient-based attribution methods for deep neural networks,” in International Conference on Learning Representations, 2018.
- [22] Hugues Turbé, Mina Bjelogrlic, Christian Lovis, and Gianmarco Mengaldo, “Interprettime: a new approach for the systematic evaluation of neural-network interpretability in time series classification,” arXiv preprint arXiv:2202.05656, 2022.
- [23] Marco Ancona, Enea Ceolini, Cengiz Öztireli, and Markus Gross, “A unified view of gradient-based attribution methods for deep neural networks,” NIPS 2017-Workshop on Interpreting, Explaining, and Visualizaing Deep Learning, 2017.
- [24] Shawn Tan, Guillaume Androz, Ahmad Chamseddine, Pierre Fecteau, Aaron Courville, Yoshua Bengio, and Joseph Paul Cohen, “Icentia11k: An unsupervised representation learning dataset for arrhythmia subtype discovery,” in 2021 Computing in Cardiology (CinC), 2021.
- [25] Shawn Tan, Guillaume Androz, Ahmad Chamseddine, Pierre Fecteau, Aaron Courville, Yoshua Bengio, and Joseph Paul Cohen, “Icentia11k single lead continuous raw electrocardiogram dataset (version 1.0),” 2022, PhysioNet. https://doi.org/10.13026/kk0v-r952.
- [26] Yola Jones, Fani Deligianni, and Jeff Dalton, “Improving ecg classification interpretability using saliency maps,” in 2020 IEEE 20th International Conference on Bioinformatics and Bioengineering (BIBE). IEEE, 2020, pp. 675–682.
- [27] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.