Decorrelative Network Architecture for Robust Electrocardiogram Classification

Christopher Wiedeman¹ Ge Wang² Correspondence: [email protected]

( ¹Department of Electrical and Computer Systems Engineering
Rensselaer Polytechnic Institute
[email protected]
²Department of Biomedical Engineering
Rensselaer Polytechnic Institute
[email protected]
)

Abstract

Artificial intelligence has made great progress in medical data analysis, but the lack of robustness and trustworthiness has kept these methods from being widely deployed. As it is not possible to train networks that are accurate in all scenarios, models must recognize situations where they cannot operate confidently. Bayesian deep learning methods sample the model parameter space to estimate uncertainty, but these parameters are often subject to the same vulnerabilities, which can be exploited by adversarial attacks. We propose a novel ensemble approach based on feature decorrelation and Fourier partitioning for teaching networks diverse complementary features, reducing the chance of perturbation-based fooling. We test our approach on single and multi-channel electrocardiogram classification, and adapt adversarial training and DVERGE into the Bayesian ensemble framework for comparison. Our results indicate that the combination of decorrelation and Fourier partitioning generally maintains performance on unperturbed data while demonstrating superior robustness and uncertainty estimation on projected gradient descent and smooth adversarial attacks of various magnitudes. Furthermore, our approach does not require expensive optimization with adversarial samples, adding much less compute to the training process than adversarial training or DVERGE. These methods can be applied to other tasks for more robust and trustworthy models.

Keywords: Deep Learning, Adversarial Attacks, Bayesian Neural Networks, Electrocardiogram Classification

Introduction

The exponential increase of high-dimensional patient datasets and constant demand for personalized healthcare justify the urgent need for artificial intelligence (AI) in medicine. As an excellent example, electrocardiograms (ECG) are commonly used for inpatient monitoring of cardiac conditions, and are now available in smart or implantable devices. While the proper application of continuous ECG monitoring requires further clinical investigation, large scale collection and analysis of ECG, in either inpatient or outpatient populations, has the potential to improve healthcare by monitoring for signs of heart problems or alerting medical services to emergency situations. Deeper analysis of numerous samples is necessary to extract more healthcare-relevant information hidden in these signals. Big data can be leveraged in this instance, but it is infeasible for human clinicians to individually analyze all these recordings, making AI a natural solution to this problem [1, 2, 3].

For this purpose, many researchers applied deep learning to ECG classification. The 2017 PhysioNet Challenge is a milestone in this field, where deep neural networks (DNNs) were trained to classify atrial fibrillation from single-lead ECG [4]. Similarly, the 2018 China Physiological Signal Challenge (CPSC 2018) observed classification of several rhythm abnormalities from 12-lead ECG [5]. Top-scoring models can often achieve high classification accuracies on test data, but their interpretability and robustness are major concerns [6]. Chief among these concerns are adversarial attacks, which have been demonstrated both in machine learning broadly and specific healthcare tasks.

Adversarial Attacks: Background and Characteristics

Adversarial attacks are small input perturbations that do not change the semantic content yet cause massive errors in a network output; for example, imperceivable noise patterns that, when added to an image, cause a model misclassify the image [7]. Given a target model and input, projected gradient descent (PGD) is the most common algorithm for finding adversarial perturbations under an $\ell_{\infty}$ bound [8, 9]. Various other algorithms exist, including the use of generative adversarial ensembles [10, 11, 12]. Impressively, universal adversarial perturbations can be crafted to fool a network when added to any sample [13, 14].

The understanding of adversarial attacks has rapidly developed over the past several years. Akhtar and Mian wrote a broad survey on adversarial attacks in computer vision [15]. Although adversarial instability may relate to overfitting, DNNs often generalize well to unseen data yet fail on previously seen data that are only slightly altered [16]. Furthermore, it has been shown that linear models and other machine learning methods are also vulnerable to adversarial attacks [17]. Early research attributed this phenomenon to lack of data in high-dimensional problems, which leaves large portions of the total ‘data-manifold’ unstable [18, 19]. Literature also reported a relationship between large local Lipschitz constants (with regards to the loss function) and adversarial instability [20, 21, 22]. To our knowledge, the most unifying explanation is the robust features model, where it is shown that data distributions often exhibit statistical patterns that are meaningless to humans but correlate well with different classes [23]. From a human’s perspective, these patterns are arbitrary and easily perturbed, but since models are trained to only maximize distributional accuracy, they have no reason to prioritize human-favored features over these patterns.

Training models for defending against adversarial attacks remains an open problem, affecting nearly every application of machine learning. Early attempts at defense methods by obfuscating the loss gradient were found to beat only weak attackers, proving ineffective for sophisticated attackers [24, 25, 26, 27, 28]. Adversarial training, in which a model is iteratively trained on strong adversarial samples, has shown the best results in terms of adversarial robustness [8, 29]. However, the network size and computational time required is considerable for small problems, and improving adversarial robustness appears to sacrifice performance on clean data [30]. Satisfactory performance on larger problems has not been achieved due to these limitations.

Another troubling, well-documented characteristic of these attacks is their transferability: models trained on the same task will often be fooled by the same attacks, despite having different parameters [7, 17, 31]. This phenomenon is largely congruent with the robust features model, since these models are likely learning the same useful, but non-robust features. Nevertheless, transferability makes black-box attacks viable, where a malicious attacker does not need detailed knowledge of the target model.

Han, et al. has shown that models trained for ECG classification are concerningly susceptible to natural-looking adversarial attacks [32]. In short, the authors observed that traditional $\ell_{\infty}$ -bounded PGD attacks produce square-wave artifacts that are not physiologically plausible in ECG; to rectify this, the perturbation space was modified by applying Gaussian smoothing kernels in the attack objective, rendering plausible yet still highly effective adversarial samples.

Electrocardiography Background

ECG is a front-line, noninvasive tool for monitoring heart health. Skin electrodes measure electrical signals originating from the heart; the behavior of these signals over time correspond to various events during the cardiac cycle. A healthy rhythm consists of a P wave, QRS complex, and T wave, which correspond to atrial depolarization, ventricular depolarization, and ventricular repolarization, respectively. Clinical ECGs have traditionally used 12 or 5-7 (Holter) leads, but single-channel ECGs have become more prevalent for continuous monitoring [33]. These devices are either external or implantable loop recorders (ILRs) designed to record for multiple years.

A variety of downstream analytical tasks are associated with ECG, including biometric identification, respiratory estimation, emotional monitoring, and even fetal heartbeat monitoring. However, the most common application is detection of various arrythmias, which can indicate disorders or disease risk [33]. Atrial fibrillation (AFib), or an abnormally rapid atrial firing rate, is commonly assessed (such as in the 2017 PhysioNet Challenge [4]), but many arrythmic classses exist, including left or right bundle branch block, premature atrial or ventricular contraction, ventricular fibrillation, tachycardia, and myocardial infarction (heart attack) itself. Certain classes require immediate and serious medical intervention while others, such as AFib, are not immediately harmful but could still indicate risk of disease. The clinical utility of AFib detection is still an active area of investigation: a systematic review found AFib to be associated with increased risk of myocardial infarction in patients without coronary heart disease and increased risk of all-cause mortality and heart failure in all patients, implying value in detecting it [34]. On the other hand, a randomized control trial using ILR in patients with at least one risk factor for stroke concluded that continuously monitoring for AFib in this population did not reduce risk of stroke [35]. This suggests that detecting AFib early may not provide additional information for managing stroke in patients that are already known to be at risk of the event. Nevertheless, ECG and its subsequent analysis is widely applied and investigated for its implications on patient cardiovascular health.

Clinician review is often required for ECG analysis; this process is resource-consuming, especially in the case of continuous monitoring or large patient samples. Thus, automatic classification of these signals is desirable, but simple rule-based classifications often fail to generalize due to data heterogeneity between patients and the non-stationary nature of the signal within patients. As such, researchers have turned to data driven techniques and machine learning to build ECG classification models. Convolutional neural networks (CNNs) have been the most dominant architecture for ECG arrythmia classification, but deep belief networks, recurrent neural networks, long short-term memory, and gated recurrent units have all been investigated for the same task [36]. For extensive background, Merdjanovska, et al. provides a comprehensive review of applications, public datasets, and deep learning research for ECG, and Ebrahimi, et al. further surveys common deep learning architectures for ECG [33, 36].

Uncertainty Estimation in Healthcare Applications

As misdiagnosis in healthcare contexts can cause serious harm, the standard of trust required for AI to operate in this space is high. Rather than replacing clinicians, we envision AI tools augmenting clinical workflows by monitoring inputs over a large population, flagging alarming or low-confidence instances for human observation. Figure 1 broadly illustrates this scenario, where AI could allow a few experts to analyze ECG signals from a large patient population, continuously or transiently monitored. To achieve this synergy, models must be capable of gauging their own confidence, recognizing conditions where they can and cannot perform well [37]. Bayesian deep learning (BDL) is a promising field that models the parameters of a DNN as a distribution rather than a point estimation; sampling this distribution at inference time then allows one to estimate model certainty in an inference [38, 39]. Approaches for approximating and sampling the parameter distribution, including variational inference and Markov Chain Monte Carlo with Hamiltonian Dynamics, are often difficult to scale to large spaces [40, 41]. One simple approach is to train an ensemble of networks for the same task, with each network acting as a sample of the parameter space [42]. However, this approach does not guarantee robustness: adversarial attacks in particular are known to transfer between different models because these models (even with vastly different parameters) often learn the same unstable features. Furthermore, in high dimensional problems with large parameter spaces, training, storing, and running inferences from numerous models quickly becomes infeasible. As such, the goal in training such ensembles should be to achieve adequate robustness and feature diversity with a small number of models.

Refer to caption — Figure 1: Example of proposed AI augmented clinical workflow for monitoring ECG signals in a patient population. Data are first processed by a deep learning model, which infers a class for each signal (e.g., healthy or diseased) and judges the confidence of each inference. Signals classified with low confidence are reviewed by human experts.

Motivation: Diversifying Features in Deep Ensembles

To our knowledge, prior work on adversarial robustness has primarily quantified either white-box accuracy or black-box transferability, but have not evaluated uncertainty via BDL. Furthermore, works in this field primarily test methods on lower dimension datasets, such as MNIST or CIFAR10. Our goal is to efficiently train small but diverse deep ensembles capable of gauging uncertainty in worst-case scenarios, i.e., adversarial attacks. We contextualize this in the aforementioned ECG classification, a problem that is much higher in dimension. Adversarial training as a means of diversifying an ensemble is explored, and we also introduce two novel diversification methods that do not require adversarial sample computations, adding almost no overhead to the regular training process.

According to the robust features model, simply training networks with different parameters in isolation does not achieve adversarial robustness, as networks trained under the same conditions tend to converge toward the same learned features and vulnerabilities [23, 43]. As such, rather than achieving diversity in the parameter space, we turn the conversation to diversity in the feature space. A mechanism for incentivizing networks to learn different features is necessary. To this end, Yang, et al. conceived DVERGE, which diversifies the learned features and adversarial weaknesses in a classification ensemble [44]. However, this method requires full or partial computations of adversarial samples and round-robin style training of networks, which adds considerable compute. In this work, we experiment with ensemble diversification methods that are based on adversarial gradients and other methods that are agnostic to these calculations.

Adapting Adversarial Training for Ensembles

Adversarial training is the best known defense against adversarial attacks, and essentially consists of training a network on adversarial samples [8]. Unfortunately, adversarial training is computationally expensive and reduces accuracy on natural data [30].

Conventional adversarial training is the best known defense against adversarial attacks, but it does not detect attacks or quantify uncertainty; rather, it attempts to make a single network more robust to attacks. Furthermore, it only achieves satisfactory performance on small problems using large networks to fit more complex decision boundaries [8]. As such, we adapt adversarial trained ensembles, in which individual networks are adversarially trained for additional time after natural training. Consequently, we are able to study adversarial training in the Bayesian ensemble framework, and observe how vulnerabilities may be diversified between models.

Alternative Feature Diversification Methods

We propose two distinct methods for diversifying learned features, and test these against adversarial ECG attacks in [32]. The first method, linear feature decorrelation, is based on previous work [43], which not only found a strong linear correlation between in the latent space of networks trained on the same task, but also found that adding a loss term to reduce the linear correlation greatly decreases the transferability of adversarial attacks. However, the decorrelation process proposed in that work is expensive, as it requires large batch sizes and parallel training of networks. We modify this decorrelation process to make it scalable to larger problems. The second method which we refer to as Fourier partitioning, is heuristically simpler, employing linear time-invariant filters to partition the input space by frequency, forcing networks to learn features in different frequency bands. This method is inspired by recently discovered connections between the Fourier space and adversarial vulnerability, which not only demonstrated that neural networks can make accurate inferences by relying only on low or high-frequency characteristics but also that most robustifying training methods only shift a network’s sensitivity to different frequency bands [30]. As such, we find that a crude but efficient way to teach networks different features is to partition the original inputs by frequency, feeding data in different bands to different networks and integrate their outputs via ensemble learning.

Results

Overview

Two ECG datasets are used for all experiments: the 2017 PhysioNet challenge data (single-channel, four classes) [4] and the 2018 China Physiological Signal Challenge (CPSC) data (twelve-channel, nine classes) [5]. The following ensemble training strategies were tested:

•

baseline: Conventionally trained ensemble, where each model is identically and independently trained.
•

dec: Ensemble trained with the proposed linear feature decorrelation to diversify the model features.
•

part: Ensemble trained using the proposed Fourier partitioning scheme.
•

adv: A baseline ensemble that undergoes additional ensemble adversarial training.
•

dec+part: Ensemble that employs both linear feature decorrelation and Fourier partitioning.
•

dec+adv: A decorrelated ensemble that undergoes additional ensemble adversarial training.
•

dverge: A baseline ensemble that undergoes additional DVERGE training.

Ensembles produce multiple inferences, which can be processed in various ways to gauge epistemic and aleatoric uncertainty[45, 46]. Here, we adopt a normalized uncertainty approach from [47], which calculates a normalized measure of certainty $I_{norm}$ based on the mutual information between the sample and the model parameters (see Methods: Ensemble Training and Inference).

We test each ensemble using validation data perturbed by both PGD and physiologically feasible SAP attacks of varying magnitude $\varepsilon$ , targeting the first model in each ensemble [32]. Figure 2 displays several example attacks from the PhysioNet 2017 data along with inferences, probability, and uncertainty values outputted by the baseline, dec, part, and dec+part ensembles.

Notably, we found that implementing the decorrelation step only slightly increased training time: conventional training took about 352 and 358 minutes per model on average for PhysioNet and CPSC, respectively; decorrelation only added about 10 minutes on average to this time in both cases (see Methods: Experimental Details for training parameters). Fourier partitioning did not noticeably increase training time.

Uncertainty and Accuracy Performance

To classify an ensemble inference as either certain or uncertain, a threshold $I_{T}\in[0,1]$ , can be applied to differentiate certain $I_{norm}\leq I_{T}$ and uncertain $I_{norm}>I_{T}$ predictions. A robust model is generally correct when it is certain and uncertain when it is incorrect. Thus, in addition to inference accuracy, we also adopt the following three evaluation metrics from [47]:

•

Correct-certain ratio $R_{cc}(I_{T})=P_{I_{T}}(correct|certain)$ : Probability the model inference is correct when it is certain.
•

Incorrect-uncertain ratio $R_{iu}(I_{T})=P_{I_{T}}(uncertain|incorrect)$ : Probability the model is uncertain when it is incorrect.
•

Uncertainty Accuracy $UA(I_{T})=P_{I_{T}}(correct\cap certain\bigcup incorrect\cap uncertain)$ : Probability of a desired outcome (either correct and certain of uncertain and incorrect).

All three measures depend on the variable uncertainty threshold $I_{T}$ . Thus, similar to a binary classifier, the overall efficacy can be found by integrating the measure as a function of $I_{T}\in[0,1]$ (i.e., finding the area under the curve, where larger values are more desirable).

Table 1 summarizes the average prediction accuracy and areas under the curve (AUCs) for the correct-certain ratio, incorrect-uncertain ratio, and uncertainty accuracy for natural, PGD, and SAP adversarial datasets for the PhysioNet 2017 experiments. Similarly, Table 2 reports the same metrics for the CPSC 2018 experiments. Some initial observations from these numbers are as follows: 1) For the PhysioNet 2017 experiments, dec, part, and dec+part generally have comparable or superior metrics to the baseline on natural ( $\varepsilon=0$ ) samples, but achieve better performance on stronger $\varepsilon=50,75,100$ PGD and SAP attacks. Adversarial training (adv) improves performance on all adversarial attacks, but the combination of dec+adv generally leads to better performance in these instances, while dverge does not seem to improve from the baseline in this instance. 2) On the CPSC 2018 experiments, adv and dec+adv achieve improved robustness on perturbed data but sacrifice considerable accuracy on natural samples. dverge experiences a mild reduction in natural performance for a moderate performance increase on perturbed data, although this advantage diminishes with higher magnitude attacks. The combination of dec+part poses better performance on perturbed data relative to the baseline without sacrificing performance on natural samples.

		Attack Strength $\varepsilon$ (PGD)				Attack Strength $\varepsilon$ (SAP)
	0	10	50	75	100	10	50	75	100
Accuracy (%)
baseline	83.82	72.45	10.90	5.63	4.10	74.91	16.18	8.32	5.39
dec	84.41	72.57	26.61	13.95	9.50	74.21	27.78	14.77	7.50
part	86.75	70.46	32.12	23.56	19.81	72.33	38.45	25.67	19.58
dec+part	85.58	73.86	54.51	44.55	36.34	75.38	57.80	46.66	36.58
adv	83.35	79.72	58.15	32.59	15.01	80.42	66.71	45.25	21.45
dverge	83.59	69.52	10.67	5.39	2.81	71.75	14.30	7.27	4.34
dec+adv	80.07	77.02	60.14	43.96	26.73	77.96	64.83	50.29	33.53
$R_{cc}$ (% AUC)
baseline	85.46	74.17	4.77	2.00	1.38	76.63	8.23	3.17	1.88
dec	86.18	76.74	16.22	7.29	4.95	78.35	17.63	7.08	3.34
part	88.18	73.10	18.64	13.40	11.23	75.80	23.20	15.05	11.76
dec+part	87.64	77.50	27.76	22.72	18.76	79.29	30.02	23.89	19.09
adv	85.08	81.72	61.40	34.90	13.74	82.42	69.24	47.09	21.57
dverge	85.00	69.55	4.03	2.18	1.25	71.84	5.45	2.69	1.84
dec+adv	81.20	78.33	63.27	47.63	26.72	79.19	67.55	54.31	35.65
$R_{iu}$ (% AUC)
baseline	13.77	19.57	34.20	37.69	35.76	18.52	37.45	44.23	44.91
dec	15.91	29.62	44.57	41.51	38.71	28.87	46.06	42.41	39.02
part	15.27	28.56	46.64	46.18	43.85	28.53	50.02	51.57	51.87
dec+part	19.57	29.69	42.53	49.03	51.06	30.19	43.36	50.87	53.58
adv	16.97	18.65	24.37	22.07	16.17	18.57	24.49	27.19	20.53
dverge	12.68	13.85	15.25	14.73	15.99	12.88	15.97	19.59	22.44
dec+adv	10.58	11.60	16.81	21.80	23.39	11.47	16.12	21.35	23.74
$UA$ (% AUC)
baseline	81.65	67.27	33.73	36.94	35.32	69.92	36.56	42.51	43.68
dec	81.31	68.22	44.01	41.44	39.01	70.33	45.08	41.70	38.87
part	82.74	62.36	45.81	45.62	44.10	65.75	47.35	49.19	50.23
dec+part	81.07	66.66	40.92	44.42	46.80	68.69	40.78	45.06	48.20
adv	78.04	74.60	57.12	40.86	24.92	75.42	61.34	47.03	31.54
dverge	80.35	64.00	17.32	15.99	16.70	65.99	18.60	20.61	23.11
dec+adv	77.43	75.19	61.68	49.73	37.68	75.93	65.22	54.52	42.60

Table 1: Performance metrics on PGD and SAP adversarial samples of varying magnitudes on the PhysioNet 2017 dataset.

R_{cc}

: correct-certain ratio,

R_{iu}

: incorrect-uncertain ratio, and

UA

: uncertainty accuracy.

		Attack Strength $\varepsilon$ (PGD)				Attack Strength $\varepsilon$ (SAP)
	0	.025	.05	.15	.175	.025	.05	.15	.175
Accuracy (%)
baseline	80.28	41.47	23.94	3.60	3.13	41.00	23.16	2.82	1.88
dec	79.66	38.50	21.91	2.03	1.41	40.22	20.50	0.78	0.16
part	79.81	35.68	16.28	3.44	2.82	37.25	9.86	0.94	1.41
dec+part	80.28	44.76	31.77	12.68	8.45	46.79	33.33	18.78	17.53
dverge	78.40	47.26	30.99	12.36	9.08	48.83	31.92	10.80	5.63
adv	58.37	49.61	37.25	16.90	13.93	50.86	38.03	15.49	11.58
dec+adv	58.06	46.48	37.40	11.89	9.08	47.10	38.03	14.40	9.08
$R_{cc}$ (% AUC)
baseline	84.92	33.09	7.16	0.49	0.36	32.13	6.16	0.18	0.15
dec	83.90	33.64	6.85	0.55	0.32	31.96	5.96	0.17	0.05
part	84.13	48.71	9.68	1.30	0.85	47.46	4.51	0.58	0.51
dec+part	85.54	53.50	13.81	4.84	2.66	51.85	12.32	6.89	5.28
dverge	82.60	39.64	9.82	2.02	1.03	40.37	8.75	1.55	0.38
adv	70.98	56.13	31.22	9.21	7.19	57.57	33.19	7.50	5.81
dec+adv	65.68	54.28	42.92	8.89	5.28	54.79	43.81	7.11	4.70
$R_{iu}$ (% AUC)
baseline	30.01	32.03	32.89	27.75	26.73	34.41	38.46	42.40	45.67
dec	28.02	34.52	34.35	30.18	32.05	36.94	37.30	35.10	42.43
part	31.47	48.33	57.04	52.14	50.30	48.08	60.47	70.94	72.94
dec+part	35.65	50.75	61.70	61.71	61.59	52.28	63.92	68.91	71.33
dverge	26.13	30.93	32.98	35.74	35.46	31.40	37.26	39.37	38.64
adv	59.11	58.11	58.15	60.71	61.17	58.00	58.34	59.84	63.95
dec+adv	29.77	28.97	27.53	30.45	32.70	28.86	27.91	41.75	41.07
$UA$ (% AUC)
baseline	77.91	41.90	30.44	27.20	26.23	42.24	33.89	41.37	44.94
dec	75.82	43.38	32.81	30.08	31.90	42.49	34.99	34.97	42.41
part	70.20	53.04	54.28	51.48	49.64	52.07	58.03	70.58	72.25
dec+part	71.54	52.30	51.32	57.48	58.52	51.21	51.82	60.38	62.21
dverge	76.31	42.88	30.22	33.02	33.12	42.92	31.80	36.43	36.79
adv	51.38	50.74	51.80	56.61	57.50	50.58	51.88	55.88	60.51
dec+adv	60.84	54.51	47.87	34.03	34.26	54.72	48.58	41.82	41.28

Table 2: Performance metrics on PGD and SAP adversarial samples of varying magnitudes on the CPSC 2018 dataset.

R_{cc}

: correct-certain ratio,

R_{iu}

: incorrect-uncertain ratio, and

UA

: uncertainty accuracy.

Uncertainty Difference Between Incorrect and Correct Predictions

Since uncertainty $I_{norm}$ should be a useful discriminative feature for distinguishing correct and incorrect predictions, it is desirable that incorrect predictions, on average, output higher uncertainty than correct predictions. Thus, we simply define the average difference in uncertainty between incorrectly and correctly classified samples:

\displaystyle\Delta I_{norm}=\mathbb{E}[I_{norm}|incorrect]-\mathbb{E}[I_{norm}|correct]

Figure 3 compares this $\Delta I_{norm}$ for all experiments and ensembles as a function of attack strength $\varepsilon$ . In all instances, $\Delta I_{norm}$ decreases sharply for the baseline ensemble as perturbation strength increases. dverge follows this same trend on the PhysioNet data but maintains better robustness on the CPSC data. part, dec, dec+part, and dec+adv generally maintain higher $\Delta I_{norm}$ on perturbed data in both cases.

Performance on Partially Attacked Dataset

In a clinical setting, deep models should be used to augment clinical workflows, as shown in Figure 1. When the amount of data needing analysis outstrips the available time of qualified clinicians, deep ensembles can initially assess all inputs and defer the most uncertain samples to human readers. This begs the question: If a deep ensemble is budgeted a certain number of cases that it can refer to human experts, then how many cases would still be misclassified? To investigate this, we run the following hypothetical experiment: 1) A dataset of all natural samples and a partially perturbed dataset are drawn. In the partially perturbed dataset, only 50% of the data are unperturbed, and 25%, 15%, and 10%, of the data have $\varepsilon=10,50,75$ and $\varepsilon=0.025,0.05,0.15$ perturbations for the PhysioNet and CPSC data, respectively. 2) Each ensemble evaluates all samples, ordering the inferences from most to least confident. 3) Starting with the most confident samples, varying amounts of samples would defer to the deep model’s classification, while all other (less confident) samples would be correctly classified, presumably reviewed by clinicians. Note that this experiment is not meant to exactly reflect the actions and metrics of a true clinical workflow; rather, it is to investigate and compare the potential benefit of the aforementioned deep ensemble methods in augmenting human workers, particularly in the face of adversarially corrupted data.

Figures 4 and 5 plot the percentage of misclassified instances in the sample as a function of the percentage of cases referred to the deep ensemble for the natural and partially perturbed datasets, respectively. Additionally, the total area-under-curves are shown, with lower values being better in this instance. Initially, one can see that adv and adv+dec perform poorly on the natural datasets, particularly the CPSC 2018 data. dec, part, and dec+part perform better than the baseline on the partially perturbed datasets while still performing comparably or better than the baseline on the natural PhysioNet 2017 dataset.

Discussions

Table 1 reflects the accuracy and uncertainty scores of tested ensembles on the PhysioNet 2017 data. Overall, it can be seen that decorrelation and partioning does not negatively impact natural accuracy in this instance: in fact, overall ensemble accuracy, $R_{cc}$ and $R_{iu}$ are marginally higher on natural data in the dec, part, and dec+part compared to baseline. Furthermore, all ensembles, including the baseline, are more robust to small magnitude perturbations: [32] reported that their $\varepsilon=10$ SAP attacks fooled a single network 74% of the time, but our results show only a $26.09\%$ error rate for the baseline ensemble under these conditions. This indicates that in this instance, the network size is large enough relative to the input data dimension to reduce the adversarial transferability in lower magnitude attacks, even without diversifying measures [8]. However, this initial robustness plummets in the face of more challenging, larger magnitude attacks, as seen in Table 1. The use of decorrelation and partitioning both seem to have positive effects on the ensemble robustness against higher magnitude attacks, with dec+part outperforming even the more expensive adversarially trained ensemble on accuracy and $R_{iu}$ on $\varepsilon=75,100$ PGD and SAP attacks. $UA$ in particular is highest with the combination of decorrelation and adversarial training on all but highest magnitude adversarial attacks. These observations all suggest improved network diversification with both feature decorrelation and Fourier partitioning.

Table 1 also indicated little to no benefit from DVERGE training on the PhysioNet 2017 data, as dverge has slightly reduced natural accuracy compared to baseline while showing no general improvement on any metrics for any attack strength. We theorize that this lack of improvement may be due to the low number (4) of discrete classes in the problem. DVERGE creates new samples by distilling the non-robust features of one randomly drawn sample onto another [44]. However, if these two samples belong to the same class (which is more likely to occur when there are few classes or class imbalances), then the features distilled from one sample may already be similar to the other, negating any benefit. To support this explanation, DVERGE shows some robustness benefit on the CPSC 2018 data (Table 2), which has 9 classes. Consequently, the number of classes and class balance should be considered when implementing DVERGE.

One can see that while adversarial training minimally affects natural performance on the PhysioNet data, natural performance in greatly degraded for both adv and dec+adv in the CPSC data in Table 2. This degradation has been observed for adversarial training, and is likely due to the network’s limited capacity, which forces a tradeoff between natural discriminative features and adversarially robust features [30, 23]. Indeed, as the dimension of the input space increases, as is the case of the higher dimensional CPSC data, much larger networks and more training (both adversarial and natural) are needed to fit the robust but complex decision boundaries [8], rendering adversarial training less feasible for high dimensional problems. Decorrelation alone also provides less benefit in this instance; we suspect that this is due to the increase in classes. As the number of classes increases, the final feature layer, where decorrelation takes place, must increase in dimension such that feature vectors extracted from different networks can be uncorrelated while still correlating with the correct class. As such, architectural changes may be needed to optimize the decorrelation mechanism in this instance. Despite this, unlike adversarial training, the combination of dec+part still provides increased accuracy against all PGD and SAP attacks without any degradation in natural accuracy, even outperforming dverge in most cases. Additionally, the $UA$ , $R_{iu}$ , and $R_{cc}$ for dec+part is superior to the baseline in all adversarial instances, and even outperforms adv in some instances. This suggests that dec+part can add adversarial robustness without sacrificing natural accuracy, even in higher dimensional problems.

While the metrics in Tables 1 and 2 overall indicate robustness benefits for linear feature decorrelation and Fourier partitioning, these results are admittedly heterogeneous. It is clear that especially in higher dimensional data, there is a strong tradeoff between natural performance and adversarial robustness. Furthermore, an ensemble’s inference accuracy is not always correlated with the utility of its uncertainty estimation: some ensembles may pose better overall accuracy but worse $UA$ , $R_{iu}$ , and/or $R_{cc}$ . It is important to evaluate BNNs in a method similar to how they could be deployed to observe potential tradeoffs between these qualities. This is the motivation behind the mixed dataset experiments (Figures 4 and 5), as it initially explores how well the uncertainty measures of different ensembles can prioritize clinician attention. Figure 4 shows the worst performance on natural data with adversarially trained networks, reflecting the loss of natural performance with adversarial training. On the other hand, dec, part, and dec+part all perform comparably or better than the baseline and all other methods in the PhysioNet data; dec and dec+part also perform only marginally worse than the baseline in the natural CPSC 2018 data while showing superior performance on both perturbed datasets. Indeed, Figure 5 shows dec, part, and dec+part outperforming both dverge and the baseline of the PhysioNet data in this scenario, and dec+part achieves the best overall performance (smallest AUC) on the perturbed CPSC data.

Additionally, Figure 3 illustrates that part and dec+part, and dec+adv are the methods that best maintain a higher uncertainty difference $\Delta I_{norm}$ across varying attacks magnitudes for both datasets. Interestingly, while adv and dec individually fail to maintain this uncertainty difference against stronger attacks on the CPSC data, the combination dec+adv performs better in this metric. Once again, we suspect that feature decorrelation may benefit from expansion of the feature space as the number of classes increases.

Our results generally show that the proposed modified linear feature decorrelation and Fourier partitioning methods show promise for diversifying extracted features in deep ensembles, and can be used in other high-dimensional classification problems, such as medical image analysis. Using the fast Fourier transform, Fourier partitioning is an efficient way to force ensemble models to extract different features. Previous work mentioned several challenges with scaling decorrelated ensembles to larger problems, such as the need to train models in parallel and the large batch size needed to overdetermine the feature space with each training step[43]. We find that our modifications, such as selecting the final hidden layer for decorrelation and compressing the feature space with random projections, has allowed us to scale this mechanism to higher dimensions (see details in Methods section).

Furthermore, we found that both methods added little to no extra training time, and penalized natural accuracy less than adversarial training. It should also be noted that both methods are orthogonal to gradient-based methods such as adversarial training and DVERGE, and thus can be combined with these methods.

A number of limitations exist for this study, which can be explored in future work:

•

Optimization of the design space: The introduction of Fourier partitioning and decorrelation introduce new hyperparameters for tuning. For decorrelation, the compression ratio for the features, dimension of the feature space, and batch size are all critical considerations. As previously discussed, we believe that expanding the feature space may be necessary as the number of discriminative classes increases. Fourier partitioning was inspired by the discovery that neural networks can often solve computer vision tasks with only partial frequency information [30]. Thus, each ensemble filter should be designed to preserve sufficient information for the task but have non-overlapping vulnerabilities. Our experiments simply used two ‘ring filters’ which summed to an impulse response, but many other schemes could be explored in the same spirit. Additionally, these methods can be used in combination with DVERGE or adversarial training. Future experiments should extensively explore all these considerations, as well introduce new applications.
•

Investigation of robustness against various attacks, corruptions, and shifts: This work focuses on adversarial attacks in ECG of varying magnitude, using both PGD and the more domain-specific SAP algorithms. However, robust uncertainty quantification is desirable in many other contexts, such as domain-specific noise corruptions, data domain shifts, and out-of-sample detection. Since neither decorrelation nor Fourier partitioning explicitly optimize against adversarial attacks, we hypothesize that their diversification benefits may extend to these other contexts.
•

Experiments mimicking clinical deployment: The tradeoffs between a model’s inference accuracy and uncertainty accuracy make it necessary to test how deep models can work in synergy with clinicians. Our experiments compare the number of misclassified samples when uncertainty is used to theoretically prioritize clinician attention. In reality, however, clinical workflows are more complicated, and desirable outcomes will depend on the cost of false positives/negatives for different diagnoses, clinical resources, and the ability for human clinicians to more accurately diagnose certain diseases relative to a model. Thus, all these aspects should be considered in future translative work. Furthermore, we only tested a natural sample and an attacked sample where the distribution of attack magnitudes was roughly based on the assumption that larger perturbation attacks are be less common. Models should be rigorously tested specifically with the kinds and magnitudes of perturbations that one might expect in deployment.

Conclusion

Efficient and accurate confidence measurement is necessary for trust in AI systems. We have presented a novel approach for diverse network ensembles using two unique training methods which add little to no training time: a streamlined and accelerated decorrelation training strategy and a Fourier partitioning scheme. These ensembles achieve robustness by focusing on feature diversity between models. Additionally, we adapt adversarial training to ensembles, and test all methods along with DVERGE in the Bayesian ensemble framework. All approaches are applied to ECG classification with uncertainty estimation, and tested for stability against state-of-the-art adversarial ECG attacks, demonstrating their merits and potential in solving large problems. Incorrect diagnoses can cause major harm in many healthcare tasks where AI can work alongside clinicians; predicting model confidence in these contexts is crucial. Thus, we speculate that diverse ensembles will play a key role in elevating trustworthiness and confidence in AI for applications such as tomographic image processing, radiomics, and multimodal diagnosis. We see applications of this approach for robust uncertainty estimation with a diversified ensemble, which discourages different models from extracting redundant features.

Methods

Ensemble Training and Inference

Basic ensembles consist of multiple deep neural networks trained for the same task. An ensemble inference on sample $x$ is simply the average output of each model in the ensemble. Thus, for an ensemble with models $f_{1},f_{2},...f_{K}$ :

\displaystyle\hat{y}=f_{ens}(x)

\displaystyle=\frac{1}{K}\sum_{k=1}^{K}f_{k}(x)

Note that for a classification task, this output is a discrete probability distribution.

For estimating epistemic uncertainty, we adopt the approach from [47], which defines the uncertainty of sample $x$ as the mutual information between the inferred label $\hat{y}$ and the underlying parameter distribution. In other words, how much additional information sample $x$ tells us about the true parameters:

\displaystyle I(y,\theta|x,\mathcal{D})=H(y|x,\mathcal{D})-H(y|x,\theta,\mathcal{D})=H(y|x,\mathcal{D})-\mathbb{E}_{\theta|\mathcal{D}}[H(y|x,\theta)]

$\mathcal{D}$ is the training data. The first term is intractable, but can be estimated using the network ensemble as the entropy of the expected inference [47]. Thus, for ensemble with $f_{1},f_{2},...f_{K}$ models, each of which output a discrete probability distribution over $C$ classes:

\displaystyle I(y,\theta|x,\mathcal{D})

\displaystyle=-\sum_{c=1}^{C}f_{ens}(x)[c]\text{log}f_{ens}(x)[c]+\frac{1}{K}\sum_{k=1}^{K}\sum_{c=1}^{C}f_{k}(x)[c]\text{log}f_{k}(x)[c]

The scale of $I$ is relative and can vary between models and ensembles. Thus, we normalize the uncertainty with the minimum and maximum uncertainty values found during training.

\displaystyle I_{norm}=\frac{I-I_{min}}{I_{max}-I_{min}}

Note that test samples can have greater or less uncertainty than any sample encountered in the training set. Thus, values for $I_{norm}$ are not necessarily limited to $[0,1]$ . For a threshold $I_{T}$ which classifies samples as either ’certain’ or ’uncertain’, metrics $R_{cc}$ , $R_{iu}$ , and $UA$ were calculated empirically over an adversarial dataset based on the number of correct & certain, correct & uncertain, incorrect & certain, and incorrect & uncertain samples.

For the baseline ensemble, each model was trained independently in sequence. Other ensembles were trained using the methods described below.

Decorrelation training

Decorelation of Two Networks

The intent of decorrelation training between two networks is to minimize the Pearson correlation coefficient between the latent features extracted by the two networks, as explained in [43]. Minimizing this value incentivises the networks to extract different features for a task, reducing the transferability of network weaknesses [43, 23]. Assume two classification models $f_{1}$ and $f_{2}$ , and sample batch $X$ of inputs with accompanying labels $Y$ . Next, $Z_{i}=f_{i}^{l}(X)$ for $i=1,2$ are the latent feature extracted by model $i$ at layer $l$ from batch $X$ . Note that $Z_{i}\in\mathbb{R}^{N\times D}$ where $N$ is the batch size and $D$ is the dimension of the latent space. A linear relationship estimating $Z_{2}$ from $Z_{1}$ with weights $W$ can be found using ordinary least squares regression:

	$\displaystyle\underset{W}{\operatorname{minimize}}\|\|$	$\displaystyle Z_{2}-\mathbf{Z_{1}}W\|\|^{2}_{2}$
	$\displaystyle\text{where}\;\;\mathbf{Z_{1}}$	$\displaystyle=[Z_{1},\mathbf{1}]$
	$\displaystyle\underset{W}{\operatorname{min}}\|\|Z_{2}-\mathbf{Z_{1}}W\|\|^{2}_{2}$	$\displaystyle=\|\|Z_{2}-\mathbf{Z_{1}}(\mathbf{Z_{1}}^{\top}\mathbf{Z_{1}})^{-1}\mathbf{Z_{1}}^{\top})Z_{2}\|\|^{2}_{2}=SS_{res}$

The Pearson correlation coefficient is then the ratio between the residual and total sum of squares:

\displaystyle R^{2}=1-\frac{SS_{res}}{SS_{total}}

\displaystyle=1-\frac{||(I-\mathbf{Z_{1}}(\mathbf{Z_{1}}^{\top}\mathbf{Z_{1}})^{-1}\mathbf{Z_{1}}^{\top})Z_{2}||^{2}_{2}}{||Z_{2}-\bar{Z_{2}}||^{2}_{2}}

To reduce this term during training, the decorrelation loss is defined as:

	$\displaystyle\mathcal{L}_{R}$	$\displaystyle=\text{log}(SS_{total}+\epsilon)-\text{log}(SS_{res}+\epsilon)$		(1)
	$\displaystyle\mathcal{L}_{R}(Z_{1},Z_{2})$	$\displaystyle=\text{log}(\|\|Z_{2}-\bar{Z_{2}}\|\|^{2}_{2}+\epsilon)-\text{log}\|\|(I-\mathbf{Z_{1}}(\mathbf{Z_{1}}^{\top}\mathbf{Z_{1}})^{-1}\mathbf{Z_{1}}^{\top})Z_{2}\|\|^{2}_{2}+\epsilon)$		(2)

where $\epsilon$ is some small constant for stability (set to $10^{-5}$ in our experiments) and $\bar{Z_{2}}$ is an $N\times D$ matrix where each row is the sample mean of $Z_{2}$ . This decorrelation can be applied to model training simply by weighting and adding this loss to a conventional training objective (e.g., cross-entropy loss) for both networks, balancing feature decorrelation and individual network performance.

Scaling Decorrelation to Ensembles

While decorrelation of two networks was shown to reduce transferability of adversarial attacks in [43], several issues impede its use in larger ensembles and higher-dimensional problems. First, this method requires training multiple networks in parallel to obtain $Z_{1}$ and $Z_{2}$ , which linearly scales the memory needed for training, and quickly becomes infeasible for more networks, larger networks, and larger data sizes. Secondly, calculating the loss in Equation 2 requires taking the pseudo-inverse of $\mathbf{Z_{1}}\in\mathbb{R}^{N\times D+1}$ , requiring batch size $N>D+1$ for a sufficiently overdetermined system. Higher dimensional problems often necessitate higher dimensional latent spaces and smaller batch sizes, rendering this condition impractical.

To solve the above problems and extend the method to ensembles consisting of more than two networks, we employ the following modifications:

•

Selecting the regression layer: Most deep classification networks can be divided into a multi-layer feature extractor and a linear layer that followed by a softmax function (i.e., a logistic regression). We select the network layer just before this final linear layer, as it represents the highest-level features, and is typically lower dimension than previous layers.

•

Dimensionality reduction via random projections: Prior to computing the pseudo-inverse of the regressor, we compress its $D+1$ dimensionality to $r<D+1$ by applying a random projection $R\in\mathbb{R}^{D+1\times r}$ . To balance the asymmetry in this relationship, we randomly select which network’s extracted features act as the regressor and which as the regressand with each training batch. Our new loss is expressed as follows:

	$\displaystyle\mathcal{L}^{*}_{R}(Z_{1},Z_{2})$	$\displaystyle=\left\{\begin{array}[]{ll}\mathcal{L}_{R}(Z_{1},Z_{2}R)&\text{with prob. 0.5}\\ \mathcal{L}_{R}(Z_{2},Z_{1}R)&\text{with prob. 0.5}\end{array}\right.$
	$\displaystyle R$	$\displaystyle\in\mathbb{R}^{D+1\times r}\sim N(0,1/\sqrt{D})$

Although these projections individually do not capture all information in the regressor, they are drawn randomly with each training batch, preventing the networks from only decorrelating a subspace of the original feature space.

•

Models are trained in sequence instead of parallel: The first network is trained without any decorrelation loss. After training a model, its extracted features on all training samples are saved. While training the next model, these features are loaded with the corresponding batch samples, and then used for decorrelation. As such rather than dynamically decorrelating multiple networks at once, which requires simultaneous training of all networks, we simply use the features extracted by the previously trained networks as constant values to decorrelate against. For decorrelating against multiple models, we average the modified correlation loss against all the previously trained models. Thus, the entire decorrelation loss for model $k$ in an ensemble:

\displaystyle\mathcal{L}_{cor}(Z_{k},Z_{k-1}\cdots Z_{0})

\displaystyle=\frac{1}{k}\sum_{i=0}^{k-1}\mathcal{L}^{*}_{R}(Z_{k},Z_{i})

(3)

Final Decorrelation Scheme

Figure 6 illustrates the sequential training of the decorrelated ensemble. The total loss for model $k$ on batch $(X_{b},Y_{b})$ with extracted features from $X_{b}$ using $k-1$ previous models as $(Z_{k-1,b},...Z_{0,b})$ is:

\displaystyle\mathcal{L}_{total}

\displaystyle=\mathcal{L}_{ce}(f_{k}(X_{b}),Y_{b})+\lambda\mathcal{L}_{cor}(Z_{k,b}Z_{k-1,b}\cdots Z_{0,b})

(4)

where $\mathcal{L}_{ce}$ is the cross-entropy loss and $\lambda$ , a weighting hyperparameter. The implementation is summarized in the Algorithm 1. While stochastic gradient descent is shown here, any optimizer can be used for the gradient step.

Algorithm 1 Training Step for Model

f_{k}

(with parameters

\theta_{k}

) using decorrelation.

\lambda

and

r

are hyperparameters.

Draw

X_{b},Y_{b},[Z_{k-1,b}\cdots Z_{0,b}]

\triangleright

Draw training batch and corresponding features from prior models

Z_{k,b}\leftarrow f^{l}_{k}(X_{b})

N,D\leftarrow shape(Z_{k,b})

\hat{Y_{b}}\leftarrow f_{k}(X_{b})

\mathcal{L}\leftarrow\mathcal{L}_{ce}(\hat{Y_{b}},Y_{b})

i\leftarrow 0

while

i\leq k-1

Z_{1},Z_{2}\leftarrow Z_{k,b},Z_{i,b}

t\sim Uniform[0,1]<0.5

then

Z_{1},Z_{2}\leftarrow Z_{2},Z_{1}

end if

R\sim N(0,1/\sqrt{D})\in\mathbb{R}^{D+1\times r}

Z_{1}\leftarrow[Z_{1},\mathbf{1}]R

\mathcal{L}\leftarrow\mathcal{L}+\frac{\lambda}{k}\mathcal{L}_{R}(Z_{1},Z_{2})

\triangleright

Apply decorrelation loss from Equation 2

i\leftarrow i+1

end while

\theta_{k}\leftarrow\theta_{k}-\eta\nabla_{\mathcal{L}}\theta_{k}

Fourier Partitioning Scheme

With the Fourier partitioning scheme, the first ensemble model has no modification. The other models were trained normally but inputs were filtered during both training and inference (Figure 7). This approach is inspired by [30], which showed that neural networks can achieve high classification accuracy in many computer vision tasks with only a portion of an input’s frequency data, and that models often overfit to discriminative features in specific frequency bands.

\displaystyle\hat{y}_{i,k}

\displaystyle=f_{k}(h_{k}\circledast x_{i})

In practice, filter convolution was done by pointwise multiplication in the Fourier domain, computed using the fast Fourier transform.

Adversarial Attacks

Adversarial attacks are formulated by maximizing the loss objective $L$ of the model f by modifying $x$ (with the paired label $y$ ) within a set of valid perturbations $\Delta$ :

	$\displaystyle\underset{\delta}{\text{maximize}}\;\;J(x+\delta,y)$		(5)
	$\displaystyle\text{subject to}\;\;\;\delta\in\Delta(x)$

Two algorithms were used to craft adversarial attacks: projected gradient descent (PGD) and smoothed adversarial perturbations (SAP). PGD is widely used as a strong attack with an $\ell_{\infty}$ bound through iterative optimization [8]:

\displaystyle x^{\prime}_{i}=\text{Clip}_{\varepsilon}(x^{\prime}_{i-1}+\alpha\text{sgn}(\nabla_{x}L(f(x^{\prime}_{i-1}),y)))

(6)

where $x^{\prime}_{i}$ is the sample at the $i^{th}$ iteration, $y$ is the corresponding label, $\alpha$ is a step size, $L$ is the loss function, and the clipping operation clips all values to be within the $\ell_{\infty}$ ball of radius $\varepsilon$ around $x$ , as well as any implicit bounds on the domain of $X$ .

SAP is a variation of PGD designed to craft smooth attacks for ECG signals. The details of SAP can be found in [32]. To summarize, perturbation $\theta$ is iteratively optimized in a fashion similar to PGD, but is also convolved with a sequence of $M$ Gaussian kernels at every step, each of which are parameterized by their width s and standard deviation $\sigma$ :

	$\displaystyle\theta_{i}$	$\displaystyle=\text{Clip}_{\varepsilon}(\theta^{\prime}_{i-1}+\alpha\text{sgn}(\nabla_{\theta}L(f(x^{\prime}(\theta_{i-1})),y)))$
	$\displaystyle x^{\prime}(\theta)$	$\displaystyle=x+\frac{1}{M}\sum_{m=1}^{M}\theta\circledast K(s_{m},\sigma_{m})$		(7)

The convolution with Gaussian kernels smooths high frequency perturbations, removing unrealistic square wave artifacts. PGD and SAP attacks were optimized over 20 steps in total. For both attacks, $\alpha$ was scaled as $\varepsilon/10$ . All adversarial attacks were crafted from the validation to target one of the models in an ensemble. For experiments with PhysioNet 2017 data, the convolution kernels are identical to those used in [32]: $s=(5,7,11,15,19),\sigma=(1,3,5,7,10)$ . For the CPSC 2018 experiments, we used $s=(9,11,15,19,21),\sigma=(5,7,10,13,17)$ .

Experimental Details

The two ECG datasets used in this work are from the PhysioNet 2017 and CPSC 2018 challenges. The PhysioNet dataset contains 30-60 second duration single channel ECG recordings sampled at 300Hz. All samples were were zero-padded to be 60 seconds. Class labels are Normal, A. Fib., Other Rhythm, Noise. Scaling of the signal magnitudes were identical to those in [32]. The CPSC 2018 data contains 6-60 second ECG recordings at 500Hz with nine class labels: Normal, A. Fib, 1^st-Degree Atrioventricular Block, Left Bundle Branch Block, Right Bundle Branch Block, Premature Atrial Contractions, Premature Ventricular Contraction, ST-segment depression, ST-segment elevated. All samples were truncated or zero-padded to be 48 seconds. For simplicity, the minority of samples labelled with multiple diagnoses were not used. Additionally, all channels were normalized from -1 to 1.

For all experiments, each ensemble consisted of $K=3$ classification networks, each with the architecture in used [6] for experiments using the PhysioNet 2017 data. For experiments using the CPSC 2018 data, this architecture was simply modified to have 12 input channels and 9 output classes. Each network was trained for 80 epochs (batch size of 64) using the Adam optimizer with a learning rate of $10^{-3}$ . Pytorch 1.8.1 was used with two NVIDIA Titan RTX GPUs, and a 90/10 training/validation split. For all ensembles using decorrelation training, hyperparameters $r=32$ and $\lambda=0.2$ were used (the uncompressed latent dimension $D$ was 64).

For ensembles that underwent additional ensemble adversarial training, each model in the ensemble was sequentially trained using adversarial samples. In practice, this is identical to regular training, except each sample batch is perturbed using PGD (Equation 6) prior to the forward training step. The perturbations target the model being trained, and use $\varepsilon=10$ and $0.025$ for the PhysioNet 2017 and CPSC 2018, respectively. Each model is trained for an additional two hours, translating to six extra training hours total for each adversarially trained ensemble.

DVERGE training is similar to adversarial ensemble training, in that both use PGD to perturb a sample $x_{s}$ within a small $\ell_{\infty}$ bound. Differences are that in DVERGE 1) this optimization is used to maxmize similarity between the distilled features of $x_{s}$ and some other randomly drawn sample $x$ at some randomly selected feature layer $l$ of the network, and 2) the samples perturbed using network $i$ are used to train other networks $i\neq j$ . Thus, we use the feature distillation objective and training procedure from [44]:

	$\displaystyle x^{\prime}_{f^{l}_{i}}(x,x_{s})$	$\displaystyle=\underset{z}{\text{argmin}}\|\|f^{l}_{i}(z)-f^{l}_{i}(x)\|\|^{2}_{2},$		(8)
	$\displaystyle\text{subject to}\;\;\varepsilon$	$\displaystyle\geq\|\|z-x_{s}\|\|_{\infty}$

Values for $\varepsilon$ were identical to those used in adversarial ensemble training. The DVERGE training also ran for the same total training time (6 additional hours) as adversarial ensemble training. For each batch, feature layer $l$ was randomly, uniformly selected from the post batch-norm layers of the networks.

Data Availability

Data used in this paper is from the 2017 PhysioNet Cardiology Challenge [4] and the 2018 China Physiological Signal Challenge [5]. Code for implementing our methods and replicating experiments can be found at: https://github.com/WANG-AXIS/DNA_ECG.

References

[1] Fatma Murat, Ozal Yildirim, Muhammed Talo, Ulas Baran Baloglu, Yakup Demir, and U. Rajendra Acharya. Application of deep learning techniques for heartbeats detection using ECG signals-analysis and review. Computers in Biology and Medicine, 120:103726, 2020.
[2] Jianbiao Xiao, Jiahao Liu, Huanqi Yang, Qingsong Liu, Ning Wang, Zhen Zhu, Yulong Chen, Yu Long, Liang Chang, Liang Zhou, and Jun Zhou. Ulecgnet: An ultra-lightweight end-to-end ecg classification neural network. IEEE Journal of Biomedical and Health Informatics, 26(1):206–217, 2022.
[3] Shenda Hong, Zhou Yuxi, Junyuan Shang, Cao Xiao, and Sun Jimeng. Opportunities and challenges of deep learning methods for electrocardiogram data: A systematic review. Computers in Biology and Medicine, 122, 2020.
[4] Gari Clifford, Chengyu Liu, Benjamin Moody, Li-wei Lehman, Ikaro Silva, Qiao Li, Alistair Johnson, and Roger Mark. Af classification from a short single lead ECG recording: the physionet computing in cardiology challenge 2017, 2017.
[5] Feifei Liu, Chengyu Liu, Lina Zhao, Xiangyu Zhang, Xiaoling Wu, Xiaoyan Xu, Yulin Liu, Caiyun Ma, Shoushui Wei, Zhiqiang He, Jianqing Li, and Eddie Ng. An open access database for evaluating the algorithms of electrocardiogram rhythm and morphology abnormality detection. Journal of Medical Imaging and Health Informatics, 8:1368–1373, 09 2018.
[6] Sebastian D. Goodfellow, Andrew Goodwin, Robert Greer, Peter C. Laussen, Mjaye Mazwi, and Danny Eytan. Towards understanding ECG rhythm classification using convolutional neural networks and attention mappings. In Proceedings of the 3rd Machine Learning for Healthcare Conference, pages 83–101. PMLR, 2018. ISSN: 2640-3498.
[7] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks, 2014.
[8] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv:1706.06083 [cs, stat], 2019.
[9] Kui Ren, Tianhang Zheng, Zhan Qin, and Xue Liu. Adversarial attacks and defenses in deep learning. Engineering, 6(3):346–360, 2020.
[10] Yang Song, Rui Shu, Nate Kushman, and Stefano Ermon. Constructing unrestricted adversarial examples with generative models. arXiv:1805.07894 [cs, stat], 2018.
[11] Chaowei Xiao, Bo Li, Jun-Yan Zhu, Warren He, Mingyan Liu, and Dawn Song. Generating adversarial examples with adversarial networks. arXiv:1801.02610 [cs, stat], 2019.
[12] Xuanqing Liu and Cho-Jui Hsieh. Rob-GAN: Generator, discriminator, and adversarial attacker. arXiv:1807.10454 [cs, stat], 2019.
[13] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. DeepFool: a simple and accurate method to fool deep neural networks. arXiv:1511.04599 [cs], 2016.
[14] Ashutosh Chaubey, Nikhil Agrawal, Kavya Barnwal, Keerat K. Guliani, and Pramod Mehta. Universal adversarial perturbations: A survey. arXiv:2005.08087 [cs], 2020.
[15] Naveed Akhtar and Ajmal Mian. Threat of adversarial attacks on deep learning in computer vision: A survey. IEEE Access, 6:14410–14430, 2018. Conference Name: IEEE Access.
[16] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv:1412.6572 [cs, stat], 2015.
[17] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv:1605.07277 [cs], 2016.
[18] Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S. Schoenholz, Maithra Raghu, Martin Wattenberg, and Ian Goodfellow. Adversarial spheres. arXiv:1801.02774 [cs], 2018.
[19] Simant Dube. High dimensional spaces, deep learning and adversarial examples. arXiv:1801.00634 [cs], 2018.
[20] Chongli Qin, James Martens, Sven Gowal, Dilip Krishnan, Krishnamurthy Dvijotham, Alhussein Fawzi, Soham De, Robert Stanforth, and Pushmeet Kohli. Adversarial robustness through local linearization. arXiv:1907.02610 [cs, stat], 2019.
[21] Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. Advances in Neural Information Processing Systems, 30, 2017.
[22] Kevin Roth, Yannic Kilcher, and Thomas Hofmann. Adversarial training is a form of data-dependent operator norm regularization. arXiv:1906.01527 [cs, stat], 2020-10-23.
[23] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. arXiv:1905.02175 [cs, stat], 2019.
[24] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv:1802.00420 [cs], 2018.
[25] Nicholas Carlini and David Wagner. Adversarial examples are not easily detected: Bypassing ten detection methods. arXiv:1705.07263 [cs], 2017.
[26] Jonathan Uesato, Brendan O’Donoghue, Aaron van den Oord, and Pushmeet Kohli. Adversarial risk and the dangers of evaluating against weak attacks. arXiv:1802.05666 [cs, stat], 2018.
[27] Nicholas Carlini and David Wagner. Defensive distillation is not robust to adversarial examples. arXiv:1607.04311 [cs], 2016.
[28] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. arxiv, 2016.
[29] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv:1607.02533 [cs, stat], 2017.
[30] Dong Yin, Raphael Gontijo Lopes, Jon Shlens, Ekin Dogus Cubuk, and Justin Gilmer. A fourier perspective on model robustness in computer vision. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
[31] Florian Tramèr, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. The space of transferable adversarial examples. arXiv:1704.03453 [cs, stat], 2017.
[32] Xintian Han, Yuxuan Hu, Luca Foschini, Larry Chinitz, Lior Jankelson, and Rajesh Ranganath. Deep learning models for electrocardiograms are susceptible to adversarial attack. Nature Medicine, 26(3):360–363, 2020.
[33] Elena Merdjanovska and Aleksandra Rashkovska. Comprehensive survey of computational ECG analysis: Databases, methods and applications. Expert Systems with Applications, 203:117206, October 2022.
[34] Vidar Ruddox, Irene Sandven, John Munkhaugen, Julie Skattebu, Thor Edvardsen, and Jan Erik Otterstad. Atrial fibrillation and the risk for myocardial infarction, all-cause mortality and heart failure: A systematic review and meta-analysis. European Journal of Preventive Cardiology, 24(14):1555–1566, September 2017.
[35] Jesper H Svendsen, Søren Z Diederichsen, Søren Højberg, Derk W Krieger, Claus Graff, Christian Kronborg, Morten S Olesen, Jonas B Nielsen, Anders G Holst, Axel Brandes, Ketil J Haugan, and Lars Køber. Implantable loop recorder detection of atrial fibrillation to prevent stroke (The LOOP Study): a randomised controlled trial. The Lancet, 398(10310):1507–1516, October 2021.
[36] Zahra Ebrahimi, Mohammad Loni, Masoud Daneshtalab, and Arash Gharehbaghi. A review on deep learning methods for ECG arrhythmia classification. Expert Systems with Applications: X, 7:100033, 2020.
[37] Yonatan Elul, Aviv A. Rosenberg, Assaf Schuster, Alex M. Bronstein, and Yael Yaniv. Meeting the unmet needs of clinicians from ai systems showcased for cardiology with deep-learning–based ecg analysis. Proceedings of the National Academy of Sciences, 118(24):e2020620118, 2021.
[38] Andrew Gordon Wilson. The case for bayesian deep learning, 2020. Number: arXiv:2001.10995.
[39] Andrew G Wilson and Pavel Izmailov. Bayesian deep learning and a probabilistic perspective of generalization. In Advances in Neural Information Processing Systems, volume 33, pages 4697–4708. Curran Associates, Inc., 2020.
[40] Alex Graves. Practical variational inference for neural networks. In NIPS, 2011.
[41] Radford Neal. Bayesian learning via stochastic dynamics. In NIPS, 1992.
[42] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In NIPS, 2017.
[43] Christopher Wiedeman and Ge Wang. Disrupting adversarial transferability in deep neural networks. Patterns, page 100472, 2022.
[44] Huanrui Yang, Jingyang Zhang, Hongliang Dong, Nathan Inkawhich, Andrew Gardner, Andrew Touchet, Wesley Wilkes, Heath Berry, and Hai Li. DVERGE: Diversifying vulnerabilities for enhanced robust generation of ensembles. arXiv:2009.14720 [cs, stat], 2020.
[45] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. International Conference of Machine Learning, 48, 2016.
[46] Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla. Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. In Proceedings of the British Machine Vision Conference, pages 57.1–57.12. BMVA Press, 2017.
[47] Aryan Mobiny, Pengyu Yuan, Supratik K. Moulik, , Naveen Garg, Carol C. Wu, and Hien Van Nguyen. Dropconnect is effective in modeling uncertainty of bayesian deep networks. Scientific Reports, 11(5458), 2021.

Acknowledgments

This work was partially supported by U.S. National Institute of Health (NIH) grants R01EB026646, R01CA233888, R01CA237267, R01HL151561, R21CA264772, R01EB031102, and National Science Foundation Graduate Research Fellowship supporting C.W.

Author Contributions

C.W. and G.W. jointly conceived the idea for this study. C.W. designed code for executing all experiments and drafted the paper. G.W. was heavily involved in supervising the project, interpreting results, and editing the paper.

Competing Interests

The authors declare no competing interests.

	$\displaystyle\underset{W}{\operatorname{minimize}}\|\|$	$\displaystyle Z_{2}-\mathbf{Z_{1}}W\|\|^{2}_{2}$
	$\displaystyle\text{where}\;\;\mathbf{Z_{1}}$	$\displaystyle=[Z_{1},\mathbf{1}]$
	$\displaystyle\underset{W}{\operatorname{min}}\|\|Z_{2}-\mathbf{Z_{1}}W\|\|^{2}_{2}$	$\displaystyle=\|\|Z_{2}-\mathbf{Z_{1}}(\mathbf{Z_{1}}^{\top}\mathbf{Z_{1}})^{-1}\mathbf{Z_{1}}^{\top})Z_{2}\|\|^{2}_{2}=SS_{res}$