This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Hierarchical Pruning of Deep Ensembles with Focal Diversity

Yanzhao Wu [email protected] 0000-0001-8761-5486 Florida International University11200 SW 8TH STMiamiFloridaUSA33199 Georgia Institute of Technology266 Ferst DriveAtlantaGeorgiaUSA30332 Ka-Ho Chow [email protected] Wenqi Wei [email protected]  and  Ling Liu [email protected] Georgia Institute of Technology266 Ferst DriveAtlantaGeorgiaUSA30332
(2023)
Abstract.

Deep neural network ensembles combine the wisdom of multiple deep neural networks to improve the generalizability and robustness over individual networks. It has gained increasing popularity to study and apply deep ensemble techniques in the deep learning community. Some mission-critical applications utilize a large number of deep neural networks to form deep ensembles to achieve desired accuracy and resilience, which introduces high time and space costs for ensemble execution. However, it still remains a critical challenge whether a small subset of the entire deep ensemble can achieve the same or better generalizability and how to effectively identify these small deep ensembles for improving the space and time efficiency of ensemble execution. This paper presents a novel deep ensemble pruning approach, which can efficiently identify smaller deep ensembles and provide higher ensemble accuracy than the entire deep ensemble of a large number of member networks. Our hierarchical ensemble pruning approach (HQ) leverages three novel ensemble pruning techniques. First, we show that the focal ensemble diversity metrics can accurately capture the complementary capacity of the member networks of an ensemble team, which can guide ensemble pruning. Second, we design a focal ensemble diversity based hierarchical pruning approach, which will iteratively find high quality deep ensembles with low cost and high accuracy. Third, we develop a focal diversity consensus method to integrate multiple focal diversity metrics to refine ensemble pruning results, where smaller deep ensembles can be effectively identified to offer high accuracy, high robustness and high ensemble execution efficiency. Evaluated using popular benchmark datasets, we demonstrate that the proposed hierarchical ensemble pruning approach can effectively identify high quality deep ensembles with better classification generalizability while being more time and space efficient in ensemble decision making. We have released the source codes on GitHub at https://github.com/git-disl/HQ-Ensemble.

Ensemble Pruning, Ensemble Learning, Ensemble Diversity, Deep Learning
copyright: acmlicensedjournalyear: 2023doi: 10.1145/3633286journal: TISTjournalvolume: 1journalnumber: 1article: 1publicationmonth: 11ccs: Computing methodologies Ensemble methods

1. Introduction

It has become an attractive learning technique to leverage deep neural network (DNN) ensembles to improve the overall generalizability and robustness of many deep learning systems. Some mission-critical applications often require a large number of deep neural networks to achieve the target accuracy and robustness, which entails high space and time costs for ensemble execution. Recent studies have revealed that deep neural network ensembles with highly diverse member networks tend to have high failure independence, which is critical for enhancing overall ensemble predictive performance, including ensemble accuracy and robustness under adverse situations (Liu et al., 2019; Wei and Liu, 2021; Wu et al., 2020, 2021, 2023). However, the member networks in a large deep ensemble team may not have high ensemble diversity and failure independence to complement each other, which will result in sub-optimal ensemble prediction performance and high ensemble execution cost in practice (Fort et al., 2020; Liu et al., 2019; Wei et al., 2020; Wu et al., 2020, 2021; Wu and Liu, 2021). For a deep ensemble team of a large size, such as 10, it is often not only possible to identify substantially smaller deep ensembles (e.g., 3\sim5 member networks) with the same or improved ensemble accuracy but also beneficial to reduce the ensemble execution cost (Lazarevic and Obradovic, 2001; Martínez-Muñoz et al., 2009; Wu et al., 2021, 2020; Wu and Liu, 2021). This motivates us to propose an efficient hierarchical ensemble pruning approach, coined as HQ. By leveraging our focal diversity metrics, the HQ pruning method can effectively identify high quality deep ensembles with small sizes and high ensemble diversity, which not only improves ensemble accuracy over the large entire ensembles but also significantly reduces the space and time cost of ensemble execution.

1.1. Related Work and Problem Statement

Most of the existing ensemble learning studies can be summarized into three broad categories. The first category builds ensemble teams by training a set of member models, represented by bagging (Breiman, 1996), boosting (Breiman et al., 1998) and random forests (Breiman, 2001). In practice, it has led to large ensemble teams of tens to hundreds of member models, such as the commonly-used random forests. The second category is to leverage diversity measurements to compare and select high quality ensemble teams whose member models can complement each other to improve the ensemble predictive performance without requiring model training. Ensemble diversity can be evaluated using pairwise or non-pairwise diversity measures, represented by Cohen’s Kappa (CK) (McHugh, 2012) and Binary Disagreement (BD) (Skalak, 1996) for pairwise metrics and Kohavi-Wolpert Variance (KW) (Kohavi and Wolpert, 1996; Kuncheva and Whitaker, 2003) and Generalized Diversity (GD) (Partridge and Krzanowski, 1997) for non-pairwise metrics. Early studies (Kuncheva and Whitaker, 2003; Tang et al., 2006) have reported that it is challenging to use these diversity metrics for evaluating the performance quality of ensemble teams. Some recent studies (Fort et al., 2020; Liu et al., 2019; Wu et al., 2021, 2020) discussed inherent problems of using these diversity metrics to measure ensemble diversity in terms of failure independence and provided guidelines for improving the ensemble diversity measurement. This paper contributes to this second category by leveraging focal ensemble diversity metrics for effective pruning of deep ensembles. The third category covers the ensemble consensus methods for aggregating member model predictions and producing the ensemble prediction, such as soft voting (model averaging), majority voting and plurality voting (Ju et al., 2017) or learn to combine algorithms (Liu, 2011; Wu et al., 2020).

Table 1. Summary of Three Categories of Related Studies
Category Description Representative methods
1. Ensemble training Training a set of member models to build ensemble teams Bagging (Breiman, 1996), boosting (Breiman et al., 1998), random forests (Breiman, 2001), etc.
2. Ensemble diversity powered ensemble pruning Leveraging ensemble diversity measurements to compare and select high-quality ensembles CK pruning (McHugh, 2012), BD pruning (Skalak, 1996), GD pruning (Partridge and Krzanowski, 1997), etc.
3. Ensemble consensus methods Aggregating member model predictions to produce the ensemble prediction Soft voting, majority voting, plurality voting, etc.
Table 2. Comparison of Hierarchical Pruning and Existing Studies
Method Model Types Diversity Measurement Principles
Diversity Comparison Samples for Measurements Diversity Calculation
Early studies on diversity based ensemble pruning before 2015 (Caruana et al., 2004; Banfield et al., 2005; Tsoumakas et al., 2009; Martínez-Muñoz et al., 2009; Lazarevic and Obradovic, 2001) Ensembles of traditional ML models Compare all ensembles of different sizes Random samples from the validation set Directly calculated on random samples
Focal diversity powered hierarchical ensemble pruning Ensembles of (1) DNNs and (2) Traditional ML models Compare the ensembles of the same size SS Random negative samples from a focal model FfF_{f} Obtain focal negative correlations for each focal (member) model FfF_{f} in an ensemble team of size SS and then perform an average of SS focal negative correlation scores to calculate the focal diversity score

We summarize the three categories of related studies in Table 1. These three categories are complementary. To develop an ensemble team for improving prediction performance, we can employ one of the two ways to obtain an ensemble of member models: (1) training multiple models together using an ensemble learning algorithm, such as bagging, boosting, or random forest; and (2) training multiple models independently in parallel and employ a voting method to produce ensemble consensus based predictions. Most of the early studies before 2015 in the first two categories (Caruana et al., 2004; Banfield et al., 2005; Tsoumakas et al., 2009; Martínez-Muñoz et al., 2009; Lazarevic and Obradovic, 2001) focused on ensemble pruning for traditional machine learning models. Until recently, we have observed several research endeavors on deep neural network ensembles, most of which centered on training multiple networks jointly, such as diversity based weighted kernels (Chow and Liu, 2021; Bian et al., 2020; Yin et al., 2014) and leveraging deep ensembles to strengthen the robustness and resilience of a single deep neural network under adverse situations (Liu et al., 2019; Wei et al., 2020; Wei and Liu, 2021; Chow et al., 2019; Chow and Liu, 2021; Wu et al., 2021, 2023). However, to the best of our knowledge, very few studies have brought forward solutions to efficient deep ensemble pruning to improve prediction performance and reduce ensemble execution cost.

We compare our focal diversity powered hierarchical ensemble pruning approach and early studies using ensemble diversity for ensemble pruning in Table 2. Our focal diversity promotes fair comparison of ensemble diversity among the ensembles of the same size SS and leverages the focal model concept, focal negative correlation, and averaging of multiple focal negative correlation scores to obtain accurate ensemble diversity measurements. We will provide a detailed description of focal diversity powered hierarchical pruning in Section 3 and demonstrate that our hierarchical pruning approach can be effectively applied to both deep neural network ensembles and ensembles of traditional machine learning models in Section 5.

1.2. Scope and Contribution

This paper presents a holistic approach to efficient deep ensemble pruning. Given a large deep ensemble of MM member networks, we propose a hierarchical ensemble pruning framework, denoted as HQ, to efficiently identify high quality ensembles with lower cost and higher accuracy than the entire ensemble of MM networks. Our HQ framework combines three novel ensemble pruning techniques. First, we leverage focal ensemble diversity metrics (Wu et al., 2021) to measure the failure independence among member networks of a deep ensemble. The higher focal diversity score indicates the higher level of failure independence among member networks of a deep ensemble, that is higher complementary capacity for improving ensemble predictions. Our focal diversity metrics can precisely capture such correlations among member networks of an ensemble team, which can be used to effectively guide ensemble pruning. Second, we present a novel hierarchical ensemble pruning method powered by focal diversity metrics. Our hierarchical pruning approach iteratively identifies subsets of member networks with low diversity, which tend to make similar prediction errors, and then prunes them out from the entire ensemble team. Third, we perform focal diversity consensus voting to combine multiple focal diversity metrics for ensemble pruning, which further refines the hierarchical ensemble pruning results by a single focal diversity metric. Comprehensive experiments are performed on four popular benchmark datasets, CIFAR-10 (Krizhevsky and Hinton, 2009), ImageNet (Russakovsky et al., 2015), Cora (Lu and Getoor, 2003) and MNIST (Lecun et al., 1998). The experimental results demonstrate that our focal diversity based hierarchical pruning approach is effective in identifying high quality deep ensembles with significantly smaller sizes and better ensemble accuracy than the entire deep ensemble team.

2. Ensemble Pruning with Focal Diversity

Given an entire deep ensemble with a large size MM, it consists of MM individual member networks (Fi,i{0,1,,M1}F_{i},i\in\{0,1,...,M-1\}) that are trained for a specific learning task and dataset. We denote the set of all possible sub-ensembles as EnsSetEnsSet, which are composed of any subset of these MM individual networks. For a specific team size SS, let EnsSet(S)EnsSet(S) denote the set of all possible sub-ensembles of size SS in EnsSetEnsSet. The cardinality of EnsSet(S)EnsSet(S) is calculated based on the selection of SS networks from all MM base networks, that is |EnsSet(S)|=(MS)|EnsSet(S)|=\binom{M}{S}. Therefore, the total number of candidate sub-ensembles for S=2,,M1S=2,\dots,M-1 is |EnsSet|=S=2M1|EnsSet(S)|=(M2)+(M3)++(MM1)=2M(2+M)|EnsSet|=\sum_{S=2}^{M-1}|EnsSet(S)|=\binom{M}{2}+\binom{M}{3}+...+\binom{M}{M-1}=2^{M}-(2+M), which grows exponentially with MM. For example, when M=3,5,10,20M=3,~{}5,~{}10,~{}20, we have |EnsSet|=3,25,1012,1048554|EnsSet|=3,~{}25,~{}1012,~{}1048554 respectively. It may not be feasible to perform the exhaustive search of all possible ensembles in EnsSetEnsSet with a large MM. Hence, it is critical to develop efficient ensemble pruning methods for examining the sub-ensembles from the candidate set EnsSetEnsSet, removing these sub-ensembles with low diversity and obtaining the set GEnsSetGEnsSet of high quality sub-ensembles with lower cost and better ensemble accuracy than the entire deep ensemble of MM member networks.

Table 3. Example Deep Ensembles for CIFAR-10 and ImageNet
Dataset CIFAR-10 ImageNet
Ensemble Team 0123456789 0123 01238 123 1234 0123456789 12345 2345 1234 124
Ensemble Acc (%) 96.33 97.15 96.87 96.81 96.63 79.82 80.77 80.70 80.29 79.84
Team size 10 4 5 3 4 10 5 4 4 3
Acc Improv (%) 0 0.82 0.54 0.48 0.30 0 0.95 0.88 0.47 0.02
Cost Reduction (%) 0 60 50 70 60 0 50 60 60 70
Table 4. All Individual Member Models for Four Benchmark Datasets
Dataset CIFAR-10 ImageNet Cora MNIST
10,000 testing samples 50,000 testing samples 1,000 testing samples 10,000 testing samples
ID Name Acc (%) Name Acc (%) Name Acc (%) Name Acc (%)
0 DenseNet190 96.68 AlexNet 56.63 GCN 81.70 KNN 94.23
1 DenseNet100 95.46 DenseNet 77.15 GAT 82.80 Logistic Regression 91.89
2 ResNeXt 96.23 EfficientNet-B0 75.80 SGC 81.70 Linear SVM 92.48
3 WRN 96.21 ResNeXt50 77.40 ARMA 82.10 RBF SVM 96.31
4 VGG19 93.34 Inception3 77.25 APPNP 82.20 Random Forest 95.91
5 ResNet20 91.73 ResNet152 78.25 APPNP1 83.80 GBDT 92.89
6 ResNet32 92.63 ResNet18 69.64 APPNP2 88.70 Neural Network 96.18
7 ResNet44 93.10 SqueezeNet 58.00 SplineCNN 88.90
8 ResNet56 93.39 VGG16 71.63 SplineCNN1 88.30
9 ResNet110 93.68 VGG19-BN 74.22 SplineCNN2 88.50
MIN ResNet20 91.73 AlexNet 56.63 GCN/SGC 81.70 Logistic Regression 91.89
AVG 94.25 71.60 84.87 94.27
MAX DenseNet190 96.68 ResNet152 78.25 SplineCNN 88.90 RBF SVM 96.31
Table 5. Examples on ImageNet and Top-3 Classification Confidence
Image [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Ground Truth Label ice cream lemon ski
Accuracy (%) F1F2F4F_{1}F_{2}F_{4}: 79.84 (DenseNet: 77.15, EfficientNet-B0: 75.80, Inception3: 77.25) [Uncaptioned image] [Uncaptioned image] [Uncaptioned image]
Ensemble Output ice cream lemon ski

We show 5 example deep ensemble teams in Table 3 for two datasets, CIFAR-10 and ImageNet. For each dataset, we list the entire deep ensemble (0123456789) of 10 member networks in addition to the 4 high quality deep ensembles that are identified by our HQ ensemble pruning approach. The 10 member networks in each entire ensemble for CIFAR-10 and ImageNet are given in Table 4. The 4 sub-ensembles recommended by our HQ have much smaller team sizes with only 3\sim5 individual member networks and achieve better ensemble accuracy than the given entire ensemble of 10 networks, significantly reducing the ensemble execution cost by 50%\sim70% for both CIFAR-10 and ImageNet. Table 5 further presents 3 image examples from ImageNet and the prediction results with Top-3 classification confidence from the member networks of the sub-ensemble team 124. This ensemble team achieves higher accuracy than each individual member network. For each image, one member model makes a prediction error. But the ensemble team can still give the correct predictions by repairing the wrong predictions by its member networks. This sub-ensemble team of 3 member networks also outperforms the entire ensemble of 10 models on ImageNet. For a given deep ensemble of size M=10M=10, we have a total of 1012 possible sub-ensembles to be considered in ensemble pruning. It is challenging to design and develop an efficient ensemble pruning approach to identify high quality sub-ensembles to improve both ensemble accuracy and time and space efficiency of ensemble execution.

Problems with Baseline Ensemble Pruning. In related work, we discussed several recent studies that utilize deep ensembles to strengthen the robustness of individual deep neural network against adversarial attacks (Liu et al., 2019; Wei and Liu, 2021; Chow et al., 2019; Chow and Liu, 2021). Most of these existing methods leverage Cohen’s Kappa (CK) (McHugh, 2012) for measuring ensemble diversity since early studies (Kuncheva and Whitaker, 2003; Tang et al., 2006; Tsoumakas et al., 2009; Martínez-Muñoz et al., 2009) in the literature have shown that both pairwise and non-pairwise diversity metrics share similar diversity evaluation results with regard to ensemble predictive performance, including the CK, BD, KW and GD metrics mentioned in related work. However, we show that these existing diversity metrics may not precisely capture the inherent failure independence among member networks of a deep ensemble team, which may not produce the optimal performance in guiding ensemble pruning.

We first study the baseline diversity metrics for ensemble pruning and analyze their inherent problems. For a given diversity metric, such as BD or GD, the baseline ensemble pruning method calculates the ensemble diversity scores for each candidate ensemble in EnsSetEnsSet using a set of random samples drawn from the validation set as suggested by (Kuncheva and Whitaker, 2003). A practical method for pruning a given large ensemble of MM member networks follows three steps: (1) we first compute the ensemble diversity score for each candidate sub-ensemble by using CK, BD, KW or GD; (2) we calculate the mean diversity score as the pruning threshold; and (3) we choose these sub-ensembles in EnsSetEnsSet with their ensemble diversity scores above this pruning threshold and add them into the set GEnsSetGEnsSet of high quality ensembles. The rest sub-ensembles will be pruned out given their lower ensemble diversity than the pruning threshold.

Refer to caption
(a) GD, Pruning 427 out of 1012
Refer to caption
(b) F-GD, Pruning 517 out of 1012
Figure 1. Pruning All Possible Deep Ensembles by The Mean Threshold on CIFAR-10: (a) Baseline Diversity (GD), (b) Focal Diversity (F-GD)

We compute the baseline GD scores and ensemble accuracy for all 1013 (1012 sub-ensembles ++ 1 entire ensemble) deep ensembles on CIFAR-10 as shown in Figure 1a, where each dot denotes an ensemble team and each color marks the team size in the right color mapping bar. The baseline GD ensemble pruning approach utilizes the GD mean threshold of 0.524 (vertical red dashed line) to select sub-ensemble teams on the right of this diversity threshold line into GEnsSetGEnsSet. The ensemble accuracy 96.33% of the entire ensemble (marked by the black circle) is represented by the horizontal black dashed line, which serves as the reference accuracy for evaluating ensemble pruning methods. Ideally, the ensemble pruning algorithms should identify high quality sub-ensembles with their ensemble accuracy above this reference accuracy.

For an entire ensemble of a large size MM, we use four performance metrics to assess the ensemble pruning algorithms: (1) the accuracy range of the selected sub-ensembles in GEnsSetGEnsSet, (2) the precision, which measures the proportion of the selected sub-ensembles with equal or higher ensemble accuracy than the reference accuracy of the given entire ensemble, e.g., 96.33% accuracy for CIFAR-10 and 79.82% accuracy for ImageNet, among all selected sub-ensembles, (3) the recall, which measures the proportion of the selected sub-ensembles in GEnsSetGEnsSet over all the sub-ensembles in EnsSetEnsSet whose ensemble accuracy is equal to or higher than the reference accuracy of the entire deep ensemble, and (4) the cost reduction, which measures the reduction in the ensemble team size of the selected sub-ensembles relative to the entire ensemble size MM. For example, we show a small sub-ensemble, 123, in Table 3 with the cost reduction of 70% (=(MS)/M=(103)/10=(M-S)/M=(10-3)/10). We evaluate the baseline mean threshold based ensemble pruning algorithm using the baseline BD (2nd column) and GD (4th column) diversity metrics on the CIFAR-10 dataset in Table 6 and ImageNet dataset in Table 7. The BD-based baseline pruning suffers from low precision (6.18% for CIFAR-10 and 8.53% for ImageNet) and recall (11.72% for CIFAR-10 and 23.15% for ImageNet). Even though the GD-based baseline pruning has higher precision (36.90% for CIFAR-10 and 17.74% for ImageNet) and recall (63.10% for CIFAR-10 and 46.31% for ImageNet), which is still below the acceptable values, such as precision of over 50%. Moreover, the accuracy lower bound for both BD and GD based baseline ensemble pruning, i.e., 93.56% on CIFAR-10 and 61.39% (BD) and 70.79% (GD) on ImageNet, is much lower than the reference accuracy of the entire ensemble team, i.e., 96.33% accuracy for CIFAR-10 and 79.82% accuracy for ImageNet. We find similar observations on other diversity metrics, including CK and KW. This motivates us to investigate how to properly measure ensemble diversity and optimize the ensemble pruning process.

Table 6. Baseline Ensemble Pruning by Mean Threshold (CIFAR-10)
Methods BD>>0.339 (baseline) F-BD>>0.602 (optimized) GD>>0.524 (baseline) F-GD>>0.543 (optimized)
Acc Range (%)
of GEnsSetGEnsSet
93.56\sim96.72 95.71\sim97.15 93.56\sim97.15 95.71\sim97.15
Precision (%) 6.18 63.53 36.90 53.90
Recall (%) 11.72 95.51 63.10 97.59
Cost Reduction 10%\sim80% 10%\sim80% 10%\sim80% 10%\sim80%
Table 7. Baseline Ensemble Pruning by Mean Threshold (ImageNet)
Methods BD>>0.314 (baseline) F-BD>>0.632 (optimized) GD>>0.335 (baseline) F-GD>>0.594 (optimized)
Acc Range (%)
of GEnsSetGEnsSet
61.39\sim80.54 76.77\sim80.77 70.79\sim80.60 76.41\sim80.77
Precision (%) 8.53 41.93 17.74 36.75
Recall (%) 23.15 98.52 46.31 99.01
Cost Reduction 10%\sim80% 10%\sim80% 10%\sim80% 10%\sim80%

Focal Diversity based Hierarchical Pruning. We improve the ensemble diversity measurement and ensemble pruning process with three novel techniques in this paper. First, we use focal diversity metrics to optimize the ensemble diversity measurements, which utilizes the concept of focal models to sample negative samples and precisely capture the failure independence of member networks of a deep ensemble team. Second, we introduce a novel hierarchical pruning approach, which leverages focal diversity metrics and iteratively prunes out subsets of redundant member networks with low diversity from the entire ensemble. Third, we combine multiple focal diversity metrics through focal diversity consensus voting to further enhance the hierarchical ensemble pruning performance.

3. Focal Diversity Based Hierarchical Pruning

3.1. Focal Diversity Concept and Algorithm

Our focal diversity metrics are designed to provide more accurate measurements of failure independence of the member networks in a deep ensemble team, including both pairwise (F-CK and F-BD) and non-pairwise (F-GD and F-KW) diversity metrics. Unlike traditional ensemble diversity evaluation approaches, which randomly select samples from the validation set to evaluate ensemble diversity, our focal diversity measurements draw random negative samples from NegSampSet(Ff)NegSampSet(F_{f}) based on a specific focal model FfF_{f} on which the focal model FfF_{f} makes prediction errors and then calculate the focal negative correlation score. For a given deep ensemble of SS member networks, each of the SS member networks will serve as the focal model (FfF_{f}) to draw negative samples. Therefore, we have a total of SS focal negative correlation scores, where each corresponds to a member network serving as the focal model. Then we obtain the focal diversity score for this deep ensemble of SS member networks through the (weighted) average of SS focal negative correlation scores. The focal model concept is motivated by ensemble defense against adversarial attacks (Chow et al., 2019; Wei and Liu, 2021), where the attack victim model is protected in a defense ensemble team. The focal model can be viewed as the victim model to evaluate the failure independence of other member models to this focal model in an ensemble team, i.e., the focal negative correlation score. Thus, the ensemble diversity score for this ensemble of size SS can be computed as the (weighted) average of SS focal negative correlation scores by taking each member model as a focal model to mitigate the bias in diversity evaluation using only one focal model. Our preliminary results have demonstrated promising performance of the focal diversity metrics in capturing the failure independence of the member networks of a deep ensemble team (Wu et al., 2020, 2021; Wu and Liu, 2021).

Algorithm 1 Focal Diversity Metric Calculation
1:procedure getFQ(NegSampSet,Q,EnsSetNegSampSet,Q,EnsSet)
2:    Input: NegSampSetNegSampSet: negative sample sets for each focal model FfF_{f}; QQ: the focal diversity metric, such as F-CK, F-BD, F-KW and F-GD; EnsSetEnsSet: the set of candidate sub-ensemble teams to be considered;
3:    Output: FQFQ: focal diversity scores
4:    Initialize D(Q)={}D(Q)=\{\}, D¯(Q)={}\overline{D}(Q)=\{\}
5:    Initialize FQ={}FQ=\{\} \triangleright A map of diversity scores and teams
6:    for S=2S=2 to M1M-1 do
7:         for f=0f=0 to M1M-1 do
8:             Obtain EnsSet(Ff,S)EnsSet(F_{f},S) with candidate sub-ensembles of size SS and containing the focal model FfF_{f}.
9:             Initialize D(Q,S,Ff)={}D(Q,S,F_{f})=\{\}
10:             for i=1i=1 to |EnsSet(Ff,S)||EnsSet(F_{f},S)| do
11:                 \triangleright calculate the focal negative correlation score for TiEnsSet(Ff,S)T_{i}\in EnsSet(F_{f},S) using the definition of QQ
12:                 qi=FocalNegativeCorrelation(Q,Ti,NegSampSet(Ff))q_{i}=FocalNegativeCorrelation(Q,T_{i},NegSampSet(F_{f}))
13:                 D(Q,S,Ff)D(Q,S,F_{f}).append(qiq_{i})
14:             end for
15:             for i=1i=1 to |EnsSet(Ff,S)||EnsSet(F_{f},S)| do
16:                 \triangleright scale focal negative correlation scores for sub-ensembles of the same size SS
17:                 D¯(Q,S,Ff,Ti)=qimin(D(Q,S,Ff))max(D(Q,S,Ff))min(D(Q,S,Ff))\overline{D}(Q,S,F_{f},T_{i})=\frac{q_{i}-\min(D(Q,S,F_{f}))}{\max(D(Q,S,F_{f}))-\min(D(Q,S,F_{f}))}
18:             end for
19:         end for
20:         Obtain EnsSet(S)EnsSet(S) with candidate sub-ensemble teams of size SS
21:         for i=1i=1 to |EnsSet(S)||EnsSet(S)| do
22:             Initialize tmpD={}tmpD=\{\}
23:             for j=1j=1 to |Ti||T_{i}| do
24:                 tmpDtmpD.append(D¯(Q,S,Ff=Ti[j],Ti)\overline{D}(Q,S,F_{f}=T_{i}[j],T_{i}))
25:             end for
26:             \triangleright Obtain the member model accuracy ranks as the member model weights.
27:             w=MemberModelAccuracyRank(Ti)w=MemberModelAccuracyRank(T_{i})
28:             FQ(Ti)=WeightedAverage(w,tmpD)FQ(T_{i})=WeightedAverage(w,tmpD)
29:         end for
30:    end for
31:    return FQFQ
32:end procedure

In general, our focal diversity metrics compare the ensemble diversity scores among the ensembles of the same size SS, randomly draw negative samples from each focal model (FfF_{f}, i.e., NegSampSet(Ff)NegSampSet(F_{f})) and calculate the focal diversity score as the average of SS focal negative correlation scores by taking each member model as the focal model. Algorithm 1 gives a skeleton of calculating the focal diversity scores for all the candidate sub-ensembles in EnsSetEnsSet. For each team size SS (Line 6\sim30), we follow two general steps to calculate the focal diversity scores for each ensemble. First, for each member model FfF_{f}, let EnsSet(Ff,S)EnsSet(F_{f},S) denote all candidate ensembles of size SS, each containing the focal model FfF_{f}. We first compute the focal negative correlation score for each ensemble in EnsSet(Ff,S)EnsSet(F_{f},S) with the negative samples randomly drawn from the focal model FfF_{f} (NegSampSet(Ff)NegSampSet(F_{f})) and store them in D(Q,S,Ff)D(Q,S,F_{f}) (Line 10\sim14). Then, in order to make them comparable across different member models (FfF_{f}), we scale D(Q,S,Ff)D(Q,S,F_{f}) into [0,1][0,1] and store them in D¯(Q,S,Ff,Ti)\overline{D}(Q,S,F_{f},T_{i}) for each ensemble team TiT_{i} (Line 15\sim18). Second, for each candidate sub-ensemble (TiT_{i}) of size SS, we perform a weighted average of the scaled focal negative correlation scores D¯(Q,S,Ff=Ti[j],Ti)\overline{D}(Q,S,F_{f}=T_{i}[j],T_{i}) associated with each of its focal (member) model Ff=Ti[j]F_{f}=T_{i}[j] to obtain the focal diversity score. The weight is calculated with the corresponding rank of the accuracy of the member model (Ti[j]T_{i}[j]) in the ensemble team (TiT_{i}), i.e., the member models with higher accuracy will have higher weights (Line 21\sim29). In addition, we subtract the CK value from 1 when using the CK formula (McHugh, 2012) to present the consistent view that high diversity values correspond to high ensemble diversity.

We show a visual comparison between our focal diversity metric F-GD in Figure 1b and the baseline GD metric in Figure 1a for CIFAR-10. Here the baseline mean threshold based ensemble pruning method is used for both F-GD and GD diversity metrics. Compared to the GD based ensemble pruning, the F-GD powered pruning obtains better ensemble pruning results and identifies a larger portion of sub-ensembles (on the right side of the vertical red dashed line) with ensemble accuracy above the reference accuracy of 96.33% (horizontal black dashed line).

When the focal diversity metrics are used in our hierarchical pruning algorithm with a desired team size SdS_{d}, we only calculate the diversity scores for the ensembles of size 2Sd2\sim S_{d} by replacing M1M-1 with SdS_{d} in Line 6 of Algorithm 1 to reduce the computation cost. Compared to the baseline diversity metrics, our focal diversity score computation takes about 1.2\sim2.4×\times the time of the baseline diversity score computation for all candidate sub-ensembles in EnsSetEnsSet. The computation of all focal diversity metrics can be finished in several seconds on MNIST, CIFAR-10 and Cora and in several minutes on ImageNet (see Table 16). Given that our focal diversity metrics significantly outperform the baseline diversity metrics in capturing the failure independence of a group of member models, it is beneficial and worthwhile to use our focal diversity metrics to measure the ensemble diversity and identify high quality sub-ensembles. In addition, our hierarchical pruning can effectively prune out about 80% of the ensemble teams in EnsSetEnsSet, which also compensates for the increased computation time of the focal diversity metrics, making our entire solution very efficient.

We show the baseline mean threshold based ensemble pruning results using our focal diversity F-BD and F-GD in the 3rd and 5th columns of Table 6 for CIFAR-10 and Table 7 for ImageNet. Both F-BD and F-GD (optimized) substantially outperform the baseline BD and GD based ensemble pruning, as measured by the accuracy range of the selected sub-ensembles in GEnsSetGEnsSet, precision and recall. For the other two diversity metrics, CK and KW, we observed similar performance improvements of using our focal diversity metrics, F-CK and F-KW, in the baseline ensemble pruning. Even though our focal diversity metrics can significantly improve the baseline mean threshold based ensemble pruning method, achieving very high recall of over 95%, we can still enhance the 63.53% precision of F-BD and 53.90% precision of F-GD for CIFAR-10 and the 41.93% precision of F-BD and 36.75% precision of F-GD for ImageNet to more accurately identify high-quality sub-ensembles. In addition, the baseline mean threshold based ensemble pruning examines every sub-ensemble in EnsSetEnsSet with high computational cost. These observations have motivated us to explore new methods to enhance ensemble pruning efficiency and accuracy.

Algorithm 2 Hierarchical Pruning
1:procedure hq-pruning(Q,Sd,β,EnsSetQ,S_{d},\beta,EnsSet)
2:    Input: QQ: the focal diversity metric; SdS_{d}: the desired ensemble size; β\beta: the percentage of the number of ensemble teams to be pruned out in each iteration; EnsSetEnsSet: the set of sub-ensemble teams to be considered;
3:    Output: GEnsSet(Q,Sd)GEnsSet(Q,S_{d}): the set of good ensemble teams of size SdS_{d} identified by the focal diversity metric QQ.
4:    Initialize pruneSet={}pruneSet=\{\} \triangleright Subsets of member models to prune out.
5:    for S=2S=2 to SdS_{d} do
6:         Initialize GEnsSet(Q,S)={}GEnsSet(Q,S)=\{\}, D={}D=\{\}
7:         Construct EnsSet(S)EnsSet(S) of size SS ensembles
8:         for i=1i=1 to |EnsSet(S)||EnsSet(S)| do
9:             if TiT_{i} contains any subset in pruneSetpruneSet then
10:                 continue \triangleright Prune out this sub-ensemble TiT_{i} and avoid expanding along this branch.
11:             else
12:                 qi=FQ(Ti)q_{i}=FQ(T_{i}) \triangleright FQ(Ti)FQ(T_{i}) is obtained through Algorithm 1 by using the focal diversity QQ.
13:                 DD.append(qiq_{i})
14:                 GEnsSet(Q,S)GEnsSet(Q,S).add(TiT_{i})
15:             end if
16:         end for
17:         n=β×|GEnsSet(Q,S)|n=\beta\times|GEnsSet(Q,S)|
18:         Sort TiGEnsSet(Q,S)T_{i}\in GEnsSet(Q,S) by qiDq_{i}\in D
19:         Remove nn sub-ensembles of the lowest ensemble diversity from GEnsSet(Q,S)GEnsSet(Q,S) and add them into pruneSetpruneSet
20:    end for
21:    return GEnsSet(Q,Sd)GEnsSet(Q,S_{d})
22:end procedure

3.2. Hierarchical Pruning Overview and Algorithm

Our focal diversity metrics reveal some anti-monotonicity property: a superset of a low diversity ensemble has also low diversity. Specifically, a low focal diversity score for an ensemble team, e.g., F0F2F_{0}F_{2}, often implies insufficient ensemble diversity, that is a high correlation of its member models in making similar prediction errors, making the member models highly redundant for a large superset ensemble team, such as F0F1F2F_{0}F_{1}F_{2}. Hence, those large ensembles that contain a small sub-ensemble with low diversity (e.g., F0F2F_{0}F_{2}), such as F0F1F2F_{0}F_{1}F_{2}, F0F2F3F_{0}F_{2}F_{3}, F0F1F2F3F_{0}F_{1}F_{2}F_{3}, and F0F1F2F4F_{0}F_{1}F_{2}F_{4} tend to have lower ensemble diversity than other ensembles with the same size and more diverse member models. Therefore, when we identify a sub-ensemble with low ensemble diversity, these large ensembles that are the supersets of this sub-ensemble can be preemptively pruned out, which motivates our hierarchical ensemble pruning approach. The pseudo code is given in Algorithm 2. Overall, our hierarchical pruning is an iterative process of composing and selecting deep ensembles for a desired ensemble team size SdS_{d}. First, we start the hierarchical pruning process with the ensembles of size S=2S=2, that is |EnsSet(S=2)|=(M2)=10(101)/2=45|EnsSet(S=2)|=\binom{M}{2}=10(10-1)/2=45 candidate ensemble teams. Second, for a given focal diversity metric, we rank candidate ensembles of size SS, such as S=2S=2, by their focal diversity scores (qiq_{i}), prune out the bottom β\beta percentage of ensembles with low focal diversity scores, and add them into pruneSetpruneSet as the pruning targets (Line 17\sim19). Here, a dynamic β\beta can be configured to accommodate the concrete number of candidate ensembles and diversity score distribution. A conservative strategy is recommended to set a small β\beta, e.g., by default β=10%\beta=10\%. Third, we preemptively prune out all subsequent ensemble teams with a larger size S+1S+1 that contain one or more pruned sub-ensembles in pruneSetpruneSet (Line 7\sim16). By iterating through these pruning steps, our hierarchical pruning can efficiently identify high quality sub-ensembles of the desired team size SdS_{d} and add them into GEnsSet(Q,Sd)GEnsSet(Q,S_{d}).

Refer to caption
Figure 2. Hierarchical Pruning On ImageNet

We present a hierarchical ensemble pruning example on ImageNet in Figure 2. With M=10M=10, we have all 10 member networks on the top, followed by all sub-ensembles of size S=2S=2. For each tier, we add one additional network to the candidate ensemble teams of size S(2S<Sd)S~{}(2\leq S<S_{d}) and place these extended ensembles of size S+1S+1 in the next tier until S=SdS=S_{d}, where the ensembles of the desired team size SdS_{d} will be placed in the last tier. Meanwhile, for each pruned ensemble, such as F0F2F_{0}F_{2}, our hierarchical pruning algorithm will preemptively cut off these branches of candidate ensemble teams that contain this pruned ensemble, i.e., all ensembles that are supersets of F0F2F_{0}F_{2}, marked by the red cross in Figure 2. This way of building ensemble teams allows us to compose high quality sub-ensembles and strategically remove the ensembles with low diversity. As the example in Figure 2 shows that our hierarchical pruning indeed can effectively prune out these ensemble teams with low diversity and low accuracy, such as F0F1F2F_{0}F_{1}F_{2}, F0F2F3F_{0}F_{2}F_{3}, F0F1F2F3F_{0}F_{1}F_{2}F_{3} and F0F1F2F4F_{0}F_{1}F_{2}F_{4}, and avoid exploring those unpromising branches.

We first apply our F-BD and F-GD hierarchical ensemble pruning algorithms on CIFAR-10 to prune the entire deep ensemble of 10 member networks by setting β=10%\beta=10\% and Sd=5S_{d}=5. Figure 3 visualizes the ensemble pruning results, where the black and red dots denote the pruned and selected ensemble teams respectively. Three interesting observations should be highlighted. First, for our focal diversity metrics, including F-BD and F-GD, the ensemble teams with high focal diversity scores tend to have high ensemble accuracy, especially when we compare the ensemble teams of the same size SS, e.g., SS=3, 4, 5, which is an important property to guide ensemble pruning. Second, our hierarchical ensemble pruning approach can effectively identify these ensemble teams with insufficient ensemble diversity by their low focal diversity scores. The effectiveness of our hierarchical pruning can be attributed to two factors. On the one hand, our focal diversity powered hierarchical pruning encourages more fair comparison of focal diversity scores among these ensembles of the same size SS. Comparing to Figure 1b showing all possible ensembles of mixed team sizes, Figure 3 shows much clearer correlation between the focal diversity score and ensemble accuracy for the ensembles of the same size S=3,4,5S=3,~{}4,~{}5. On the other hand, our focal diversity metrics can more precisely capture the failure independence of member networks of a deep ensemble with some anti-monotonicity property, ensuring that the focal diversity based ensemble pruning is highly accurate and efficient. Third, the time and space complexity of the hierarchical pruning algorithm and pruned ensemble execution cost can be ultimately bounded by the desired team size SdS_{d}. For example, we recommend setting the desired ensemble team size SdS_{d} up to 50%×M50\%\times M, which allows our hierarchical pruning to effectively find these sub-ensembles that offer on par or improved ensemble accuracy over the entire ensemble of MM member networks and have a substantially smaller team size, such as one third or one half of MM with a significant cost reduction in ensemble execution. Finally, for Figure 3c and 3f, all the selected sub-ensembles (red dots) with the desired team size Sd=5S_{d}=5 provide higher ensemble accuracy than the reference accuracy of 96.33% for CIFAR-10, which leads to 100% precision of our hierarchical pruning with F-BD and F-GD. We observe similar ensemble pruning results using the other two focal diversity metrics, F-CK and F-KW.

Refer to caption
(a) S=3, (Pruning 35 Out of 120)
Refer to caption
(b) S=4, (Pruning 124 Out of 210)
Refer to caption
(c) S=5, (Pruning 216 Out of 252)
Refer to caption
(d) S=3, (Pruning 35 Out of 120)
Refer to caption
(e) S=4, (Pruning 124 Out of 210)
Refer to caption
(f) S=5, (Pruning 215 Out of 252)
Figure 3. Deep Ensembles of Size S=3,4,5S=3,4,5 on CIFAR-10: top three figures for F-BD and bottom three figures for F-GD (β=10%,Sd=5\beta=10\%,S_{d}=5)

Focal Diversity Pruning by Focal Diversity Consensus Voting. Using different focal diversity metrics in the hierarchical ensemble pruning may produce different sets of selected ensembles (GEnsSetGEnsSet). It is observed that these ensemble teams that are chosen by the majority of our focal diversity metrics in ensemble pruning (e.g., F-BD pruning, F-KW pruning, F-GD pruning, etc.) tend to consistently outperform the entire deep ensemble in terms of the ensemble accuracy and execution cost. Therefore, we introduce the third step in our hierarchical ensemble pruning, which consolidates these ensembles that are selected by different focal diversity metrics into a majority voting based final selection. Concretely, an ensemble team will only be added to this final selection if it is chosen by the majority of focal diversity metrics, i.e., at least three focal diversity metrics, via hierarchical pruning. This third step further refines the focal diversity powered hierarchical pruning results and consistently delivers enhanced precision and robustness in ensemble pruning.

4. Formal Analysis

Deep neural network ensembles use multiple deep neural networks to form a team to collaborate and combine the predictions of individual member networks to make the final prediction. It is a commonly held view that an ensemble team consisting of diverse member networks has high prediction performance (Tsoumakas et al., 2009; Martínez-Muñoz et al., 2009; Lazarevic and Obradovic, 2001). Our focal diversity based ensemble pruning methods can efficiently identify small pruned deep ensembles with highly diverse member networks. These pruned deep ensembles can achieve the same or even improved prediction performance with significantly reduced space and time costs compared to the entire deep ensemble. In this section, we present a formal analysis to show the desired properties and features of these pruned deep ensembles to further demonstrate the effectiveness of our proposed methods.

4.1. Diversity by Uncorrelated Errors

For an ensemble team of size SS, following (TUMER and GHOSH, 1996; Wu et al., 2021), we can derive the added error for its ensemble prediction EaddavgE_{add}^{avg} using model averaging (avgavg) as the ensemble consensus method as Formula (1) shows.

(1) Eaddavg=Eadd(1+(S1)δ)S)\small E_{add}^{avg}=E_{add}(\frac{1+(S-1)\delta)}{S})

where EaddE_{add} is the added error of a single network and δ\delta is the expected average correlation of all member networks in the ensemble. Therefore, the ideal scenario is when all member networks in an ensemble team of size SS are diverse. They can learn and predict with uncorrelated errors (failure independence), i.e., δ0\delta\leq 0. Then a simple model averaging method can significantly reduce the overall prediction error by at least SS times. Meanwhile, the worst scenario happens when errors of individual networks are highly correlated with δ=1\delta=1. For example, when all SS member networks are perfect duplicates, the error of the ensemble is identical to the initial error without any improvement. In practice, given that it is challenging to directly measure the correlation δ\delta, many ensemble diversity metrics are proposed to quantify the correlation among member networks of an ensemble team. Our focal diversity metrics can significantly improve ensemble diversity measurement, and they are closely correlated to the ensemble prediction performance, which can be directly leveraged for identifying high-quality ensemble teams.

4.2. Ensemble Robustness

The deep neural network model is typically trained to minimize a cross-entropy loss and output a probability vector to approximate posteriori probabilities for the corresponding classes. Let fi(x)f_{i}(\textbf{x}) denote the ith element in the probability vector of a classifier FF for input x, which predicts the probability of class ii. The classifier FF will output the predicted label cc with the highest probability for input x, that is c=argmax1iCfi(x)c=argmax_{1\leq i\leq C}f_{i}(\textbf{x}).

The robustness of a classifier FF can be assessed based on its ability to maintain consistent prediction for a given input x under input perturbation (μ\mu), such as the noise from adversarial samples. When the magnitude of the input perturbation μ\mu remains within a certain bound, the prediction made by the classifier FF remains unaffected. We can leverage such a bound to compare the level of robustness of different classifiers, where a higher bound indicates a higher level of robustness. In Theorem 1, we introduce the concept of robustness bound (RR) and formally show that this bound fulfills the aforementioned requirement.

Theorem 1 (Robustness Bound (RR)).

The robustness bound (RR) for a classifier FF can be defined as the following Formula (2),

(2) R=minjcfc(x0)fj(x0)maxx(fc(x)fj(x))q\small R=min_{j\neq c}\frac{f_{c}(\textbf{x}_{0})-f_{j}(\textbf{x}_{0})}{max_{\textbf{x}}||\nabla(f_{c}(\textbf{x})-f_{j}(\textbf{x}))||_{q}}

where x=x0+μ\textbf{x}=\textbf{x}_{0}+\mu. When the magnitude of the perturbation μ\mu on the input x0\textbf{x}_{0} is limited by μpR||\mu||_{p}\leq R and p,qp,q satisfy 1p+1q=1\frac{1}{p}+\frac{1}{q}=1 and 1p,q1\leq p,q\leq\infty, the predicted labels for x and x0\textbf{x}_{0} by the classifier FF will be identical, indicating the input perturbation (μ\mu) will not change the prediction.

We first introduce Lemma 2 on Lipschitz continuity below and then show the formal proof of Theorem 1.

Lemma 0 (Lipschitz Continuity).

If g(x)g(\textbf{x}) is Lipschitz continuous, according to (Paulavičius and Žilinskas, 2006), the following inequality (3) holds:

(3) |g(x)g(y)|Lqxyp\small|g(\textbf{x})-g(\textbf{y})|\leq L_{q}||\textbf{x}-\textbf{y}||_{p}

where x and y are inputs, Lipschitz constant Lq=maxxg(x)qL_{q}=max_{\textbf{x}}||\nabla g(\textbf{x})||_{q}, 1p+1q=1\frac{1}{p}+\frac{1}{q}=1, and 1p,q1\leq p,q\leq\infty.

Proof of Theorem 1.

Without loss of generality, we assume g(x)=fc(x)fj(x)g(\textbf{x})=f_{c}(\textbf{x})-f_{j}(\textbf{x}) is Lipschitz continuous with Lipschitz constant LqjL_{q}^{j} and jcj\neq c. Following Lemma 2, let x=x0+μ\textbf{x}=\textbf{x}_{0}+\mu and y=x0\textbf{y}=\textbf{x}_{0}, we derive Formula (4) below,

(4) |g(x0+μ)g(x0)|Lqjμp\small|g(\textbf{x}_{0}+\mu)-g(\textbf{x}_{0})|\leq L_{q}^{j}||\mu||_{p}

where Lqj=maxxg(x)q=maxx(fc(x)fj(x))qL_{q}^{j}=max_{\textbf{x}}||\nabla g(\textbf{x})||_{q}=max_{\textbf{x}}||\nabla(f_{c}(\textbf{x})-f_{j}(\textbf{x}))||_{q}, 1p+1q=1\frac{1}{p}+\frac{1}{q}=1, and 1p,q1\leq p,q\leq\infty.

Formula (4) can be rearranged as Formula (5),

(5) g(x0)Lqjμpg(x0+μ)g(x0)+Lqjμp\small g(\textbf{x}_{0})-L_{q}^{j}||\mu||_{p}\leq g(\textbf{x}_{0}+\mu)\leq g(\textbf{x}_{0})+L_{q}^{j}||\mu||_{p}

when g(x0+μ)<0g(\textbf{x}_{0}+\mu)<0, the predicted class label will change. However, g(x0+μ)g(\textbf{x}_{0}+\mu) is lower bounded by g(x0)Lqjμpg(x0+μ)g(\textbf{x}_{0})-L_{q}^{j}||\mu||_{p}\leq g(\textbf{x}_{0}+\mu). If 0g(x0)Lqjμp0\leq g(\textbf{x}_{0})-L_{q}^{j}||\mu||_{p}, we have g(x0+μ)0g(\textbf{x}_{0}+\mu)\geq 0 to ensure that the prediction will not change with the small perturbation μ\mu on the input x0\textbf{x}_{0}. This leads to Formula (6),

(6) g(x0)Lqjμp0μpg(x0)Lqj\small g(\textbf{x}_{0})-L_{q}^{j}||\mu||_{p}\geq 0\Rightarrow||\mu||_{p}\leq\frac{g(\textbf{x}_{0})}{L_{q}^{j}}

that is Formula (7):

(7) μpfc(x0)fj(x0)Lqj\small||\mu||_{p}\leq\frac{f_{c}(\textbf{x}_{0})-f_{j}(\textbf{x}_{0})}{L_{q}^{j}}

where Lqj=maxx(fc(x)fj(x))qL_{q}^{j}=max_{\textbf{x}}||\nabla(f_{c}(\textbf{x})-f_{j}(\textbf{x}))||_{q}, 1p+1q=1\frac{1}{p}+\frac{1}{q}=1, and 1p,q1\leq p,q\leq\infty.

In order to ensure the classification result will not change, that is argmax1iCfi(x0+μ)=cargmax_{1\leq i\leq C}f_{i}(\textbf{x}_{0}+\mu)=c, we use the minimum of the bound on μ\mu over jcj\neq c to obtain the inequality (8) with 1p+1q=1\frac{1}{p}+\frac{1}{q}=1 and 1p,q1\leq p,q\leq\infty,

(8) μpminjcfc(x0)fj(x0)Lqj=minjcfc(x0)fj(x0)maxx(fc(x)fj(x))q=R\small||\mu||_{p}\leq min_{j\neq c}\frac{f_{c}(\textbf{x}_{0})-f_{j}(\textbf{x}_{0})}{L_{q}^{j}}=min_{j\neq c}\frac{f_{c}(\textbf{x}_{0})-f_{j}(\textbf{x}_{0})}{max_{\textbf{x}}||\nabla(f_{c}(\textbf{x})-f_{j}(\textbf{x}))||_{q}}=R

which indicates that as long as μp||\mu||_{p} is small enough to fulfill the above bound, the classifier decision will never be changed, which marks the robustness of this classifier FF. Hence, we formally prove Theorem 1 on the robustness bound (RR) for a classifier FF. ∎

For a deep neural network FkF_{k}, we have its robustness bound as Formula (9) shows.

(9) Rk=minjcfck(x0)fjk(x0)maxx(fck(x)fjk(x))q\small R^{k}=min_{j\neq c}\frac{f_{c}^{k}(\textbf{x}_{0})-f_{j}^{k}(\textbf{x}_{0})}{max_{\textbf{x}}||\nabla(f_{c}^{k}(\textbf{x})-f_{j}^{k}(\textbf{x}))||_{q}}

Let gjk(x)=fck(x)fjk(x)g_{j}^{k}(\textbf{x})=f_{c}^{k}(\textbf{x})-f_{j}^{k}(\textbf{x}), we have the robustness bound as Formula (10) shows.

(10) Rk=minjcgjk(x0)maxx(gjk(x))q\small R^{k}=min_{j\neq c}\frac{g_{j}^{k}(\textbf{x}_{0})}{max_{\textbf{x}}||\nabla(g_{j}^{k}(\textbf{x}))||_{q}}

Given SS networks, combining their predictions with model averaging (avgavg), we have the ith element in the combined probability vector as fiavg(x)=1Sk=1Sfik(x)f_{i}^{avg}(\textbf{x})=\frac{1}{S}\sum_{k=1}^{S}f_{i}^{k}(\textbf{x}) corresponding to the robustness bound as the following Formula (11) shows.

(11) Ravg\displaystyle R^{avg} =minjcfcavg(x0)fjavg(x0)maxx(fcavg(x)fjavg(x))q\displaystyle=min_{j\neq c}\frac{f_{c}^{avg}(\textbf{x}_{0})-f_{j}^{avg}(\textbf{x}_{0})}{max_{\textbf{x}}||\nabla(f_{c}^{avg}(\textbf{x})-f_{j}^{avg}(\textbf{x}))||_{q}}
=minjcgcavg(x0)maxx(gcavg(x))q\displaystyle=min_{j\neq c}\frac{g_{c}^{avg}(\textbf{x}_{0})}{max_{\textbf{x}}||\nabla(g_{c}^{avg}(\textbf{x}))||_{q}}

Assume the minimum of the robustness bound can be achieved with the prediction result cc and jj for each model FkF^{k} including the ensemble FavgF^{avg} as Formula (12) shows.

(12) Rk\displaystyle R^{k} =gjk(x0)maxx(gjk(x))q\displaystyle=\frac{g_{j}^{k}(\textbf{x}_{0})}{max_{\textbf{x}}||\nabla(g_{j}^{k}(\textbf{x}))||_{q}}
Ravg\displaystyle R^{avg} =gjavg(x0)maxx(gjavg(x))q\displaystyle=\frac{g_{j}^{avg}(\textbf{x}_{0})}{max_{\textbf{x}}||\nabla(g_{j}^{avg}(\textbf{x}))||_{q}}

where gjk(x)=fck(x)fjk(x)g_{j}^{k}(\textbf{x})=f_{c}^{k}(\textbf{x})-f_{j}^{k}(\textbf{x}) and gjavg(x)=1Sk=1Sgjk(x)g_{j}^{avg}(\textbf{x})=\frac{1}{S}\sum_{k=1}^{S}g_{j}^{k}(\textbf{x}).

We can then derive Theorem 3 that states k,1kS,RkRavg\exists~{}k,1\leq k\leq S,R^{k}\leq R^{avg}, where the equal sign corresponds to the case that all member networks are the perfect duplicates. This theorem further indicates that the ensembles of high diversity can improve the robustness of individual member networks.

Theorem 3 (Ensemble Robustness Bound Enhancement).

Let RavgR^{avg} denote the robustness bound for an ensemble, which combines its member network predictions through model averaging (avgavg). We can always find one member network FkF^{k} with its robustness bound RkR^{k} satisfying RkRavgR^{k}\leq R^{avg}.

Proof of Theorem 3.

We prove the ensemble robustness bound enhancement by contradiction. First, we assume k,1kS,Rk>Ravg\forall~{}k,1\leq k\leq S,R^{k}>R^{avg}, that is as Formula (13) shows.

(13) gjk(x0)(maxx(gjavg(x))q)>gjavg(x0)(maxx(gjk(x))q)\small g_{j}^{k}(\textbf{x}_{0})(max_{\textbf{x}}||\nabla(g_{j}^{avg}(\textbf{x}))||_{q})>g_{j}^{avg}(\textbf{x}_{0})(max_{\textbf{x}}||\nabla(g_{j}^{k}(\textbf{x}))||_{q})

For each k{1,,S}k\in\{1,...,S\}, this inequality holds. To add them all, we have Formula (14).

(14) k=1Sgjk(x0)(maxx(gjavg(x))q)>k=1Sgjavg(x0)(maxx(gjk(x))q)\small\sum_{k=1}^{S}g_{j}^{k}(\textbf{x}_{0})(max_{\textbf{x}}||\nabla(g_{j}^{avg}(\textbf{x}))||_{q})>\sum_{k=1}^{S}g_{j}^{avg}(\textbf{x}_{0})(max_{\textbf{x}}||\nabla(g_{j}^{k}(\textbf{x}))||_{q})

That is Formula (15).

(15) (maxx(gjavg(x))q)k=1Sgjk(x0)>gjavg(x0)k=1S(maxx(gjk(x))q)\small(max_{\textbf{x}}||\nabla(g_{j}^{avg}(\textbf{x}))||_{q})\sum_{k=1}^{S}g_{j}^{k}(\textbf{x}_{0})>g_{j}^{avg}(\textbf{x}_{0})\sum_{k=1}^{S}(max_{\textbf{x}}||\nabla(g_{j}^{k}(\textbf{x}))||_{q})

Given gjavg(x)=1Sk=1Sgjk(x)g_{j}^{avg}(\textbf{x})=\frac{1}{S}\sum_{k=1}^{S}g_{j}^{k}(\textbf{x}), we have Formula (16):

(16) (maxx(k=1Sgjk(x))q)1Sk=1Sgjk(x0)>1Sk=1Sgjk(x0)k=1S(maxx(gjk(x))q)\small(max_{\textbf{x}}||\nabla(\sum_{k=1}^{S}g_{j}^{k}(\textbf{x}))||_{q})\frac{1}{S}\sum_{k=1}^{S}g_{j}^{k}(\textbf{x}_{0})>\frac{1}{S}\sum_{k=1}^{S}g_{j}^{k}(\textbf{x}_{0})\sum_{k=1}^{S}(max_{\textbf{x}}||\nabla(g_{j}^{k}(\textbf{x}))||_{q})

Therefore, we have the following Formula (17).

(17) (maxx(k=1Sgjk(x))q)>k=1S(maxx(gjk(x))q)\small(max_{\textbf{x}}||\nabla(\sum_{k=1}^{S}g_{j}^{k}(\textbf{x}))||_{q})>\sum_{k=1}^{S}(max_{\textbf{x}}||\nabla(g_{j}^{k}(\textbf{x}))||_{q})

According to the triangle inequality, we have Formula (18).

(18) maxx(k=1Sgjk(x))q\displaystyle max_{\textbf{x}}||\nabla(\sum_{k=1}^{S}g_{j}^{k}(\textbf{x}))||_{q} maxx(k=1S(gjk(x))q)\displaystyle\leq max_{\textbf{x}}(\sum_{k=1}^{S}||\nabla(g_{j}^{k}(\textbf{x}))||_{q})
k=1S(maxx(gjk(x))q)\displaystyle\leq\sum_{k=1}^{S}(max_{\textbf{x}}||\nabla(g_{j}^{k}(\textbf{x}))||_{q})

which contradicts with the derived inequality (Formula (17)). Therefore, the previous assumption does not hold. We show that k,1kS,RkRavg\exists~{}k,1\leq k\leq S,R^{k}\leq R^{avg}, demonstrating that the robustness of a member network can be further improved with an ensemble team. Furthermore, for a network FkF^{k}, if its robustness bound RkR^{k} was not obtained with class jj. We have ij(i,jc)andRk=gik(x0)maxx(gik(x))qgjk(x0)maxx(gjk(x))q\exists i\neq j~{}(i,j\neq c)~{}\text{and}~{}R^{k}=\frac{g_{i}^{k}(\textbf{x}_{0})}{max_{\textbf{x}}||\nabla(g_{i}^{k}(\textbf{x}))||_{q}}\leq\frac{g_{j}^{k}(\textbf{x}_{0})}{max_{\textbf{x}}||\nabla(g_{j}^{k}(\textbf{x}))||_{q}}, where Theorem 3 still holds. ∎

The above analysis formally certifies that an ensemble team of diverse member networks can further improve the robustness of individual networks.

5. Experimental Evaluation

We conducted a comprehensive experimental evaluation on four benchmark datasets, CIFAR-10, ImageNet, Cora and MNIST, for pruning the given entire ensemble of 10 individual member models for CIFAR-10, ImageNet and Cora and 7 member models for MNIST (see Table 4 for all member models). All the experiments were performed on an Intel i7-10700K server with the NVIDIA GeForce RTX 3090 (24GB) GPU on Ubuntu 20.04.

Refer to caption
Figure 4. Impact of β\beta on Precision and Recall (Sd=4S_{d}=4, F-GD)

5.1. Efficiency of Pruning with Varying β\beta

The hyperparameter β\beta in our hierarchical ensemble pruning determines the percentage of candidate ensembles to be pruned out during each iteration for a specific ensemble team size SS. In general, a higher β\beta tends to remove more ensemble teams in our hierarchical pruning, potentially lowering the recall score. On the other hand, a high precision score is expected when the ensemble pruning algorithm can precisely find the high quality sub-ensembles with much smaller team sizes and equal or better ensemble accuracy of the reference accuracy of the entire ensemble team. We analyze the impacts of the hyperparameter β\beta in Figure 4, where we vary β\beta from 5% to 35% for both CIFAR-10 and ImageNet using F-GD based hierarchical pruning and the desired team size Sd=4S_{d}=4. As shown in Figure 4, when the β\beta increases, the precision of ensemble pruning increases, and the recall decreases for both CIFAR-10 and ImageNet. Our focal diversity based hierarchical ensemble pruning is designed to achieve high precision and good recall, e.g., over 75% precision and 50% recall. Therefore, with Sd=4S_{d}=4 on CIFAR-10, we set β=20%\beta=20\% to obtain 81.25% precision and 52% recall of the F-GD based hierarchical pruning. We follow the same principles in determining the β\beta value for other experiments.

Table 8. Impact of Desired Team Size SdS_{d} on Hierarchical Pruning (CIFAR-10)
SdS_{d} 4 5 6
β\beta (%) 20 10 4
Precision (%) 81.25 100 97.56
Recall (%) 52 47.14 57.14

5.2. Impact of Desired Team Size SdS_{d}

The hyperparameter SdS_{d} specifies the target ensemble team size after ensemble pruning. Table 8 presents the impacts of varying SdS_{d} on our focal diversity powered hierarchical pruning approach on CIFAR-10. For each SdS_{d}, we find the optimal β\beta following the principles introduced in Section 5.1. We highlight two interesting observations. First, for different SdS_{d}, our hierarchical pruning can consistently deliver high precision of over 81.25% with good recall, which demonstrates the effectiveness of our hierarchical pruning approach. In particular, SdS_{d} = 5 produces the highest precision of 100%. Second, as the desired team size SdS_{d} increases, there is a decrease in the optimal β\beta value, indicating that fewer ensembles will be pruned out for each iteration to achieve a good recall rate. In practice, we recommend setting the desired ensemble team size SdS_{d} up to half the size of the entire ensemble, which allows our hierarchical pruning to efficiently find these sub-ensembles that offer comparable or improved ensemble accuracy over the entire ensemble and substantially reduce the ensemble execution cost.

5.3. Focal Diversity Pruning Methods

We then evaluate our hierarchical ensemble pruning algorithm with four focal diversity metrics and the focal diversity consensus voting based refinement. The precision, recall and cost reduction of ensemble pruning are primarily used to measure the pruning efficiency on four benchmark datasets.

Table 9. Hierarchical Pruning with SdS_{d}=5 on CIFAR-10
Methods F-CK F-BD F-KW F-GD MAJORITY-F
Precision (%) 85.71 100 100 100 100
Recall (%) 17.14 51.43 51.43 52.86 47.14
Cost Reduction 50% 50% 50% 50% 50%
Table 10. Hierarchical Pruning with SdS_{d}=4 on CIFAR-10
Methods F-CK F-BD F-KW F-GD MAJORITY-F
Precision (%) 26.09 81.25 81.25 81.25 81.25
Recall (%) 12.00 52.00 52.00 52.00 52.00
Cost Reduction 60% 60% 60% 60% 60%

CIFAR-10. We compare the four focal diversity metrics using our hierarchical ensemble pruning and the focal diversity consensus voting based pruning (MAJORITY-F) in Table 9, where we set β=10%,Sd=5\beta=10\%,S_{d}=5, and the F-BD and F-GD pruning results correspond to Figure 3c and 3f. Two interesting observations should be highlighted. First, our hierarchical ensemble pruning approach achieves a very high precision of over 85% in ensemble pruning with all four focal diversity metrics. Especially, three focal diversity metrics, F-BD, F-KW and F-GD, produces 100% precision in identifying high quality sub-ensembles with much smaller team sizes and equal or better ensemble accuracy than the entire deep ensemble. Second, with the focal diversity consensus voting based ensemble pruning, coined as MAJORITY-F, the 100% precision is maintained. MAJORITY-F further refines the ensemble pruning results by pruning out these ensemble teams that are not chosen by the majority of four focal diversity metrics, which slightly reduces the total number of selected ensembles (i.e., true positives given 100% precision). Hence, the slightly lower recall score of MAJORITY-F is expected in comparison with F-BD, F-KW and F-GD. We find similar observations in Table 10 by setting β=20%,Sd=4\beta=20\%,S_{d}=4. Even though the F-CK based hierarchical pruning achieves lower performance than the other three focal diversity metrics, the focal diversity consensus voting method, MAJORITY-F, can still deliver consistent high precision of 81.25% and recall of 52%.

Table 11. Hierarchical Pruning with SdS_{d}=5 on ImageNet
Methods F-CK F-BD F-KW F-GD MAJORITY-F
Precision (%) 0 75.61 75.61 74.42 78.95
Recall (%) 0 64.58 64.58 66.67 62.50
Cost Reduction 50% 50% 50% 50% 50%

ImageNet. We then evaluate our hierarchical ensemble pruning approach on ImageNet by setting β=10%,Sd=5\beta=10\%,S_{d}=5. Table 11 shows the experimental results. We highlight two interesting observations. First, our hierarchical ensemble pruning approach performs very well with high precision of over 74.42% by using F-BD, F-KW and F-GD metrics, which is effective and accurate in selecting high quality sub-ensembles with a much smaller team size (S=5S=5 vs. M=10M=10) and higher ensemble accuracy than the 79.82% reference accuracy of the entire ensemble team on ImageNet. On the other hand, the F-CK based pruning failed to identify any satisfactory ensemble teams, which indicates the inherent limitation of the CK diversity, although it has been optimized by the focal diversity measures. Second, using our focal diversity consensus voting based ensemble pruning, MAJORITY-F, we can effectively improve the precision by over 3.34% from 74.42%\sim75.61% to 78.95%, which further demonstrates the enhanced ensemble pruning performance by consolidating the majority of focal diversity metrics.

Table 12. Hierarchical Pruning with SdS_{d}=4 on Cora
Methods F-CK F-BD F-KW F-GD MAJORITY-F
Precision (%) 91.67 86.30 86.30 84.93 87.14
Recall (%) 64.71 61.76 61.76 60.78 59.80
Cost Reduction 60% 60% 60% 60% 60%

Cora. We also evaluate and compare our focal diversity based hierarchical pruning approach on Cora, a graph dataset. Table 12 shows the results with β=10%\beta=10\% and Sd=4S_{d}=4. All focal diversity based hierarchical pruning methods achieve over 84.93% precision in identifying high quality sub-ensembles with the same or improved ensemble accuracy over 87.90% (the reference accuracy of the entire ensemble for Cora). Interestingly, the F-CK pruning for Cora achieves the best ensemble pruning performance in terms of the highest precision of 91.67% and highest recall of 64.71%, indicating the diverse utilities of different focal diversity pruning methods for different datasets. Therefore, the focal diversity consensus voting based ensemble pruning is beneficial for delivering more stable and consistent ensemble pruning efficiency in terms of precision and robustness.

Table 13. Hierarchical Pruning with SdS_{d}=3 on MNIST
Methods F-CK F-BD F-KW F-GD MAJORITY-F
Precision (%) 100 75 75 75 75
Recall (%) 60 60 60 60 60
Cost Reduction 57% 57% 57% 57% 57%

MNIST. We then use popular machine learning models, such as KNN, SVM and Logistic Regression in Table 4 to evaluate the effectiveness of our focal diversity based hierarchical pruning methods on MNIST. Table 13 shows the results with β=50%\beta=50\% and Sd=3S_{d}=3. We highlight two interesting observations. First, all focal diversity based pruning methods achieved over 75% precision in identifying sub-ensembles with the same or improved ensemble accuracy over 96.36%, which is the reference accuracy of the entire ensemble for MNIST. It demonstrates that our focal diversity based hierarchical pruning methods are also effective on representative machine learning models. Second, among the Top-3 selected ensemble teams by F-GD or F-CK, the ensemble team 346 (RBF SVM, Random Forest and Neural Network), achieved the highest accuracy of 97.04%, significantly outperforming the entire ensemble of 7 machine learning models with 96.36% accuracy and reducing the ensemble execution cost by over 57%. Moreover, given the Random Forest is already an ensemble of decision trees, it shows that integrating ensemble models, such as the Random Forest, in an ensemble team can further improve the overall predictive performance.

5.4. Focal vs. Baseline Diversity in Hierarchical Pruning

We next compare our focal diversity metrics with baseline diversity metrics by using the same hierarchical pruning algorithm. Figure 5 shows the comparison of our focal F-GD diversity (Figure 5a) and the baseline GD diversity (Figure 5b) in hierarchical pruning for the ensemble teams of size S=4S=4. The black dots mark the ensembles that are pruned out by the hierarchical pruning while the red ones represent the remaining selected ensembles. The reference ensemble accuracy 96.33% of the entire deep ensemble on CIFAR-10 is marked by the horizontal black dashed line. Overall, the hierarchical pruning with our focal F-GD diversity metric achieved much better performance than the baseline GD diversity metric. There are two primary reasons behind this observation. First, our focal diversity metrics can better capture the failure independence of member networks of a deep ensemble than baseline diversity metrics. Therefore, when pruning out a low focal diversity branch, such as in Figure 3d with S=3S=3, most ensembles of a larger size with low diversity (with low F-GD scores) will also be pruned out in Figure 3e (same as Figure 5a) with S=4S=4. Second, focal diversity metrics are more effectively correlated to ensemble accuracy than baseline diversity metrics. Therefore, by pruning out low diversity ensembles, our focal diversity metrics can successfully identify high performance ensemble teams with high ensemble accuracy and low ensemble execution cost.

Refer to caption
(a) Focal Diversity (F-GD)
Refer to caption
(b) Baseline Diversity (GD)
Figure 5. Hierarchical Pruning with F-GD and GD for Ensemble Teams of S=4S=4 (CIFAR-10, β=10%,Sd=5\beta=10\%,S_{d}=5)
Table 14. Comparing Top-3 Ensemble Teams of SS=4 by F-GD and GD (CIFAR-10, β=10%,Sd=5\beta=10\%,S_{d}=5) with the 96.33% accuracy of the entire ensemble of 10 models shown in Table 3
Method F-GD GD
Ensemble Team 0235 0234 0123 0268 0456 1345
Ensemble Acc (%) 96.89 96.84 97.15 96.18 95.63 96.17
Acc Improv (%)
(Over 96.33%)
0.56 0.51 0.82 -0.15 -0.70 -0.16
Table 15. Comparing Top-3 Ensemble Teams of SS=5 by F-GD and GD (ImageNet, β=10%,Sd=5\beta=10\%,S_{d}=5) with the 96.33% accuracy of the entire ensemble of 10 models shown in Table 3
Method F-GD GD
Ensemble Team 12345 23459 23458 03478 02356 01257
Ensemble Acc (%) 80.77 80.50 80.44 78.53 79.25 79.19
Acc Improv (%)
(Over 79.82%)
0.95 0.68 0.62 -1.29 -0.57 -0.63

Table 14 lists the Top-3 sub-ensembles selected by F-GD and GD in Figure 5. All the Top-3 sub-ensembles selected by our hierarchical pruning using focal diversity F-GD improve the reference accuracy of 96.33% achieved by the entire ensemble on CIFAR-10. In comparison, the Top-3 sub-ensembles chosen by the baseline GD diversity all result in ensemble accuracy lower than 96.33%, the reference accuracy of the entire ensemble on CIFAR-10, failing to meet the ensemble pruning objective. Similar observations can be found for the ImageNet dataset in Table 15.

The empirical results on all four benchmark datasets demonstrate that our focal diversity based hierarchical pruning framework and algorithms are effective in identifying and selecting space and time efficient sub-ensembles from a given large ensemble team, while offering competitive ensemble accuracy.

Table 16. Execution Time (s) Comparison of Baseline and Hierarchical Ensemble Pruning
Execution Time (s) CIFAR-10 ImageNet
Baseline (No Pruning) BD 0.90 79.08
GD 0.84 75.83
F-BD 2.01 97.26
F-GD 2.01 93.51
Hierarchical Pruning F-BD 0.63 21.89
F-GD 0.62 21.79

5.5. Execution Time Comparison

We show the execution time of our hierarchical ensemble pruning based on F-BD and F-GD metrics in Table 16 for CIFAR-10 and ImageNet. The baseline is no pruning, where a diversity score, such as GD and F-GD, is computed for each possible ensemble team. We highlight three interesting observations. First, compared to the baseline focal diversity metrics, F-BD and F-GD, our hierarchical pruning significantly accelerates the process of identifying high-quality sub-ensembles by 3.2\sim4.4×\times, substantially improving the ensemble pruning efficiency. Table 17 presents the execution time breakdown for each team size SS from 2 to 5 using the baseline no pruning and hierarchical ensemble pruning. We list the number of all sub-ensemble teams for each team size in the 2nd column (#All Teams). For the baseline no pruning, we show the F-GD diversity computation time for all sub-ensemble teams in the 3rd column for CIFAR-10 and 8th column for ImageNet. For the hierarchical pruning approach, we list the execution time for calculating F-GD scores (4th column for CIFAR-10 and 9th column for ImageNet), the number of teams that participate in the diversity computation (5th column for CIFAR-10 and 10th column for ImageNet), and the reduction in both execution time and #Teams (6th & 7th columns for CIFAR-10 and 11th & 12th columns for ImageNet). Our hierarchical pruning approach will examine all the 45 smallest ensembles without #Teams reduction for SS=2. The slight execution time reduction (0.05s, 3.65%) on ImageNet for SS=2 can be attributed to the random measurement errors. For both CIFAR-10 and ImageNet, the execution time reduction and #Teams reduction by our hierarchical pruning closely match with each other, such as 89.95% time reduction and 89.29% of the teams pruned out for SS=5 on ImageNet. This indicates that the overall performance improvement (3.2\sim4.4×\times) of our hierarchical pruning originates from the pruned teams that will not participate in diversity computation and thus reduce the execution time. Second, for the baseline without pruning, our focal diversity metrics, F-BD and F-GD, take longer execution time than BD and GD, which is due to the computation of focal diversity scores involving two steps for computing the focal negative correlation scores and aggregating the SS focal negative correlation scores for each sub-ensemble of size SS. Third, compared to the baseline diversity metrics, BD and GD, our F-BD and F-GD powered hierarchical pruning can reduce the execution time by 26%\sim72%, further demonstrating the efficiency of our hierarchical pruning approach.

Table 17. Execution Time (s) Breakdown Comparison of Baseline and Hierarchical Ensemble Pruning (F-GD)
Team Size Dataset CIFAR-10 ImageNet
#All Teams No Pruning Hierarchical Pruning No Pruning Hierarchical Pruning
Time (s) Time (s) #Teams Time Reduction (%) #Teams Reduction (%) Time (s) Time (s) #Teams Time Reduction (%) #Teams Reduction (%)
2 45 0.06 0.06 45 0 0 1.37 1.32 45 3.65 0
3 120 0.26 0.18 85 30.77 29.17 9.17 5.82 82 36.53 31.67
4 210 0.63 0.23 86 63.49 59.05 28.34 9.78 78 65.49 62.86
5 252 1.03 0.15 37 85.44 85.32 52.94 5.32 27 89.95 89.29
Table 18. Ensemble Execution Cost Reduction by Hierarchical Ensemble Pruning (ImageNet, β\beta = 5%, SdS_{d} = 5)
Ensemble Team Ensemble Acc (%) Execution Costs
#Params (M) GFLOPs Inference Time (ms)
Baseline 0123456789 79.82 481.73 149.94 456.00
Top-1 teams for each size SS SS=3 245 80.42 (+0.60) 92.64 (-81%) 58.50 (-61%) 153.69 (-66%)
SS=4 2345 80.70 (+0.88) 117.67 (-76%) 67.04 (-55%) 160.93 (65%)
SS=5 12345 80.77 (+0.95) 125.65 (-74%) 72.80 (-51%) 208.49 (-54%)

Compared with the entire ensemble of all 10 base models, our focal diversity based hierarchical pruning approach can effectively reduce ensemble execution costs in terms of #Parameters (storage cost), GFLOPs (computing cost), and inference latency. Table 18 presents the ensemble execution costs for the entire ensemble and 3 Top-1 ensembles for S=3,4,5S=3,4,5 identified by our F-GD powered hierarchical pruning approach. These three sub-ensembles all achieve higher ensemble accuracy than the large 10-model ensemble with much smaller team sizes, significantly improving the ensemble execution efficiency, i.e., cutting down the storage cost by 74%\sim81%, the computing cost by 51%\sim61%, and inference latency by 54%\sim66%.

6. Conclusion

This paper presents our hierarchical ensemble pruning approach powered by the focal diversity metrics. The hierarchical pruning can effectively identify high quality sub-ensembles with a significantly smaller team size and the same or better ensemble accuracy than the entire ensemble team. We made three original contributions. First, we optimize ensemble diversity measurements by using focal diversity metrics to accurately capture the failure independence among member networks of a deep ensemble, which closely correlates to ensemble predictive performance and provides effective guidance in ensemble pruning. Second, we introduce a hierarchical ensemble pruning algorithm, powered by our focal diversity metrics, which iteratively identifies high quality sub-ensembles and preemptively prunes out the member networks of low diversity. Third, we provide a systematic ensemble pruning approach (HQ), which consists of the focal ensemble diversity metric, hierarchical ensemble pruning algorithm and focal diversity consensus voting method. We conducted comprehensive experimental evaluations on four benchmark datasets, CIFAR-10, ImageNet, Cora and MNIST, which demonstrates that our HQ approach can efficiently prune large ensemble teams and obtain high quality sub-ensembles with enhanced ensemble accuracy and significantly reduced execution cost over the entire large ensemble team. Deep ensembles have been widely used in many domain-specific applications, including intelligent medical care (Hong et al., 2020; Aversano et al., 2021), intelligent transportation (Min et al., 2019; Liu et al., 2020), and intelligent manufacturing (Kim et al., 2020; Hsu and Chien, 2022). One of our future works will focus on applying and optimizing our hierarchical ensemble pruning to enhance the deep ensemble efficiency for these domain-specific applications.

Acknowledgements.
We acknowledge the partial support from the NSF CISE grants 1564097, 2038029, 2302720, and 2312758, an IBM Faculty Award, and a CISCO Edge AI grant.

References

  • (1)
  • Aversano et al. (2021) Lerina Aversano, Mario Luca Bernardi, Marta Cimitile, and Riccardo Pecori. 2021. Deep neural networks ensemble to detect COVID-19 from CT scans. Pattern Recognition 120 (2021), 108135. https://doi.org/10.1016/j.patcog.2021.108135
  • Banfield et al. (2005) Robert E. Banfield, Lawrence O. Hall, Kevin W. Bowyer, and W.Philip Kegelmeyer. 2005. Ensemble diversity measures and their application to thinning. Information Fusion 6, 1 (2005), 49–62. https://doi.org/10.1016/j.inffus.2004.04.005 Diversity in Multiple Classifier Systems.
  • Bian et al. (2020) Yijun Bian, Yijun Wang, Yaqiang Yao, and Huanhuan Chen. 2020. Ensemble Pruning Based on Objection Maximization With a General Distributed Framework. IEEE Transactions on Neural Networks and Learning Systems 31, 9 (2020), 3766–3774. https://doi.org/10.1109/TNNLS.2019.2945116
  • Breiman (1996) Leo Breiman. 1996. Bagging predictors. Machine learning 24, 2 (1996), 123–140.
  • Breiman (2001) Leo Breiman. 2001. Random Forests. In Machine Learning. 5–32.
  • Breiman et al. (1998) Leo Breiman et al. 1998. Arcing classifier (with discussion and a rejoinder by the author). The annals of statistics 26, 3 (1998), 801–849.
  • Caruana et al. (2004) Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. 2004. Ensemble Selection from Libraries of Models. In Proceedings of the Twenty-First International Conference on Machine Learning (Banff, Alberta, Canada) (ICML ’04). Association for Computing Machinery, New York, NY, USA, 18. https://doi.org/10.1145/1015330.1015432
  • Chow and Liu (2021) Ka-Ho Chow and Ling Liu. 2021. Robust Object Detection Fusion Against Deception. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (Virtual Event, Singapore) (KDD ’21). Association for Computing Machinery, New York, NY, USA, 2703–2713. https://doi.org/10.1145/3447548.3467121
  • Chow et al. (2019) Ka-Ho Chow, Wenqi Wei, Yanzhao Wu, and Ling Liu. 2019. Denoising and Verification Cross-Layer Ensemble Against Black-box Adversarial Attacks. In 2019 IEEE International Conference on Big Data (Big Data). 1282–1291. https://doi.org/10.1109/BigData47090.2019.9006090
  • Fort et al. (2020) Stanislav Fort, Huiyi Hu, and Balaji Lakshminarayanan. 2020. Deep Ensembles: A Loss Landscape Perspective. arXiv:1912.02757 [stat.ML]
  • Hong et al. (2020) Shenda Hong, Yanbo Xu, Alind Khare, Satria Priambada, Kevin Maher, Alaa Aljiffry, Jimeng Sun, and Alexey Tumanov. 2020. HOLMES: Health OnLine Model Ensemble Serving for Deep Learning Models in Intensive Care Units. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Virtual Event, CA, USA) (KDD ’20). Association for Computing Machinery, New York, NY, USA, 1614–1624. https://doi.org/10.1145/3394486.3403212
  • Hsu and Chien (2022) Chia-Yu Hsu and Ju-Chien Chien. 2022. Ensemble convolutional neural networks with weighted majority for wafer bin map pattern classification. Journal of Intelligent Manufacturing 33, 3 (2022), 831–844. https://doi.org/10.1007/s10845-020-01687-7
  • Ju et al. (2017) Cheng Ju, Aurélien Bibaut, and Mark Laan. 2017. The Relative Performance of Ensemble Methods with Deep Convolutional Neural Networks for Image Classification. Journal of Applied Statistics 45 (04 2017).
  • Kim et al. (2020) Myeongso Kim, Minyoung Lee, Minjeong An, and Hongchul Lee. 2020. Effective automatic defect classification process based on CNN with stacking ensemble model for TFT-LCD panel. Journal of Intelligent Manufacturing 31, 5 (2020), 1165–1174. https://doi.org/10.1007/s10845-019-01502-y
  • Kohavi and Wolpert (1996) Ron Kohavi and David Wolpert. 1996. Bias plus Variance Decomposition for Zero-One Loss Functions. In Proceedings of the Thirteenth International Conference on International Conference on Machine Learning (Bari, Italy) (ICML’96). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 275–283.
  • Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Technical Report. University of Toronto.
  • Kuncheva and Whitaker (2003) Ludmila I. Kuncheva and Christopher J. Whitaker. 2003. Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy. Mach. Learn. 51, 2 (May 2003), 181–207. https://doi.org/10.1023/A:1022859003006
  • Lazarevic and Obradovic (2001) A. Lazarevic and Z. Obradovic. 2001. Effective pruning of neural network classifier ensembles. In IJCNN’01. International Joint Conference on Neural Networks. Proceedings (Cat. No.01CH37222), Vol. 2. 796–801 vol.2.
  • Lecun et al. (1998) Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86, 11 (Nov 1998), 2278–2324. https://doi.org/10.1109/5.726791
  • Liu et al. (2019) L. Liu, W. Wei, K. Chow, M. Loper, E. Gursoy, S. Truex, and Y. Wu. 2019. Deep Neural Network Ensembles Against Deception: Ensemble Diversity, Accuracy and Robustness. In 2019 IEEE 16th International Conference on Mobile Ad Hoc and Sensor Systems (MASS). 274–282.
  • Liu (2011) Tie-Yan Liu. 2011. Learning to rank for information retrieval. Springer.
  • Liu et al. (2020) Yang Liu, Zhiyuan Liu, Cheng Lyu, and Jieping Ye. 2020. Attention-Based Deep Ensemble Net for Large-Scale Online Taxi-Hailing Demand Prediction. IEEE Transactions on Intelligent Transportation Systems 21, 11 (2020), 4798–4807. https://doi.org/10.1109/TITS.2019.2947145
  • Lu and Getoor (2003) Qing Lu and Lise Getoor. 2003. Link-Based Classification. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning (Washington, DC, USA) (ICML’03). AAAI Press, 496–503.
  • Martínez-Muñoz et al. (2009) G. Martínez-Muñoz, D. Hernández-Lobato, and A. Suárez. 2009. An Analysis of Ensemble Pruning Techniques Based on Ordered Aggregation. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 2 (2009), 245–259.
  • McHugh (2012) Mary L McHugh. 2012. Interrater reliability: the kappa statistic. Biochemia medica 22, 3 (2012), 276—282. https://europepmc.org/articles/PMC3900052
  • Min et al. (2019) Kyushik Min, Dongchan Kim, Jongwon Park, and Kunsoo Huh. 2019. RNN-Based Path Prediction of Obstacle Vehicles With Deep Ensemble. IEEE Transactions on Vehicular Technology 68, 10 (2019), 10252–10256. https://doi.org/10.1109/TVT.2019.2933232
  • Partridge and Krzanowski (1997) D. Partridge and W. Krzanowski. 1997. Software diversity: practical statistics for its measurement and exploitation. Information and Software Technology 39, 10 (1997), 707 – 717. https://doi.org/10.1016/S0950-5849(97)00023-2
  • Paulavičius and Žilinskas (2006) Remigijus Paulavičius and Julius Žilinskas. 2006. Analysis of different norms and corresponding Lipschitz constants for global optimization. Ukio Technologinis ir Ekonominis Vystymas 12, 4 (2006), 301–306.
  • Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. 2015. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115, 3 (2015), 211–252. https://doi.org/10.1007/s11263-015-0816-y
  • Skalak (1996) David B. Skalak. 1996. The Sources of Increased Accuracy for Two Proposed Boosting Algorithms. In In Proc. American Association for Arti Intelligence, AAAI-96, Integrating Multiple Learned Models Workshop. 120–125.
  • Tang et al. (2006) E Ke Tang, Ponnuthurai N Suganthan, and Xin Yao. 2006. An analysis of diversity measures. Machine learning 65, 1 (2006), 247–271.
  • Tsoumakas et al. (2009) Grigorios Tsoumakas, Ioannis Partalas, and Ioannis Vlahavas. 2009. An Ensemble Pruning Primer. Springer Berlin Heidelberg, Berlin, Heidelberg, 1–13.
  • TUMER and GHOSH (1996) KAGAN TUMER and JOYDEEP GHOSH. 1996. Error Correlation and Error Reduction in Ensemble Classifiers. Connection Science 8, 3-4 (1996), 385–404. https://doi.org/10.1080/095400996116839
  • Wei and Liu (2021) Wenqi Wei and Ling Liu. 2021. Robust Deep Learning Ensemble Against Deception. IEEE Transactions on Dependable and Secure Computing 18, 4 (2021), 1513–1527. https://doi.org/10.1109/TDSC.2020.3024660
  • Wei et al. (2020) W. Wei, L. Liu, M. Loper, K. Chow, E. Gursoy, S. Truex, and Y. Wu. 2020. Cross-Layer Strategic Ensemble Defense Against Adversarial Examples. In 2020 International Conference on Computing, Networking and Communications (ICNC). 456–460.
  • Wu et al. (2023) Yanzhao Wu, Ka-Ho Chow, Wenqi Wei, and Ling Liu. 2023. Exploring Model Learning Heterogeneity for Boosting Ensemble Robustness. In 2023 IEEE International Conference on Data Mining (ICDM).
  • Wu and Liu (2021) Yanzhao Wu and Ling Liu. 2021. Boosting Deep Ensemble Performance with Hierarchical Pruning. In 2021 IEEE International Conference on Data Mining (ICDM). 1433–1438. https://doi.org/10.1109/ICDM51629.2021.00184
  • Wu et al. (2020) Yanzhao Wu, Ling Liu, Zhongwei Xie, Juhyun Bae, Ka-Ho Chow, and Wenqi Wei. 2020. Promoting High Diversity Ensemble Learning with EnsembleBench. In 2020 IEEE Second International Conference on Cognitive Machine Intelligence (CogMI). 208–217. https://doi.org/10.1109/CogMI50398.2020.00034
  • Wu et al. (2021) Yanzhao Wu, Ling Liu, Zhongwei Xie, Ka-Ho Chow, and Wenqi Wei. 2021. Boosting Ensemble Accuracy by Revisiting Ensemble Diversity Metrics. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 16464–16472. https://doi.org/10.1109/CVPR46437.2021.01620
  • Yin et al. (2014) Xu-Cheng Yin, Chun Yang, and Hong-Wei Hao. 2014. Learning to Diversify via Weighted Kernels for Classifier Ensemble. arXiv:1406.1167 [cs.LG]