Learning Performance Maximizing Ensembles with Explainability Guarantees
Abstract
In this paper we propose a method for the optimal allocation of observations between an intrinsically explainable glass box model and a black box model. An optimal allocation being defined as one which, for any given explainability level (i.e. the proportion of observations for which the explainable model is the prediction function), maximizes the performance of the ensemble on the underlying task, and maximizes performance of the explainable model on the observations allocated to it, subject to the maximal ensemble performance condition. The proposed method is shown to produce such explainability optimal allocations on a benchmark suite of tabular datasets across a variety of explainable and black box model types. These learned allocations are found to consistently maintain ensemble performance at very high explainability levels (explaining of observations on average), and in some cases even outperform both the component explainable and black box models while improving explainability.
Introduction
In most high stakes settings, such as medical diagnosis (Gulum, Trombley, and Kantardzic 2021) and criminal justice (Rudin 2019), model predictions have two viability requirements. Firstly, they must exceed a given global performance threshold, thus ensuring an adequate understanding of the underlying process. Secondly, model predictions must be explainable.
Explainability, however defined (Linardatos, Papastefanopoulos, and Kotsiantis 2020), is a desirable characteristic in any prediction function. Intrinsically interpretable “glass box” models ((Agarwal et al. 2021), (Lemhadri, Ruan, and Tibshirani 2021), (Rymarczyk et al. 2020)), which are explainable by construction, are particularly advantageous as they require no additional post-hoc processing ((Ribeiro, Singh, and Guestrin 2016), (Lundberg and Lee 2017)) to achieve explainability, and thus also avoid complications arising from post-hoc explanation learning ((Rudin 2019), (Garreau and Luxburg 2020)). Due to these advantages, glass box models are uniquely suited to settings where faithful explanations of predictions are required.
However, using an approach of “complete explainability”, in which a glass box model is used as the prediction function across the entire feature space, may not be viable. It may be the case that, in a given setting, no glass box exists that can adequately model the relationship of interest in all regions of the feature space. Thus in some regions, the model’s predictions will fail to exceed the performance threshold required by the use-case. If, as a consequence, the model exceeds the application’s global error tolerance (e.g. a low accuracy in stroke prediction (Gage et al. 2001)), it may not be usable in practice.
An alternative, “partial explainability” approach requires instead that only a proportion of observations be provided intrinsically explainable predictions. We will refer to this proportion, which is the proportion of observations for which the explainable model is the prediction function, as the explainability level . Such approaches, including our proposed method, Ensembles with Explainability Guarantees (EEG), can provide high performance while maximizing explainability, and work especially well in cases where the explainable model can be paired with an alternate model with complementary strengths. As demonstrated in Fig. 1, by identifying the areas of expertise of the glass box and black box models, the EEG approach can allocate predictions accordingly to improve both performance and explainability.

Generally, implementations of a partial explainability approach consist of an ensemble of models including at least one explainable model and alternate model (often a black box model), and an allocation scheme by which observations are distributed among the ensemble members for prediction. Individual methods are characterized by their heterogeneity in the following aspects.
Methods vary in the range of component models they can accommodate. Some are defined for only one set of glass box, black box, and allocator model types - for example LSP (Wang and Saligrama 2012) and OTSAM (Wang, Fujimaki, and Motohashi 2015), which use binary tree-type splitting to define regions, and linear models and sparse additive models respectively to predict within regions. Other methods are black box agnostic but still limited in glass box and allocator model type - for example HyRS (Wang 2019), HyPM (Wang and Lin 2021), CRL (Pan, Wang, and Hara 2020), and HybridCORELS (Ferry, Laberge, and Aïvodji 2023), which use rule-based models as both glass box and allocator. EEG is the only fully model-agnostic partial explainability method which can be implemented with any combination of glass box, black box, and allocator models.
Methods also vary in the approach used to learn each ensemble member model (i.e. glass box and black box). Most methods first learn the black box model globally (on the full dataset), and then learn the glass box model locally (on its allocated subset of the data), either simultaneously with the allocator (HyRS, HyPM, CRL, and HybridCORELS) or in an alternating EM-style (LSP, OTSAM, and AdaBudg (Nan and Saligrama 2017)). EEG on the other hand, learns both ensemble member models globally first before learning the allocations between them - similar to most general adaptive ensembling methods, e.g. (Gao et al. 2019), (Inoue 2019).
Finally, methods are characterized by their allocation criteria which commonly consist of an objective which combines one or more of the following - the explainability level, the underlying task performance of the ensemble, and the complexity of the glass box model. Most methods optimize a measure of post-allocation ensemble performance - LSP, HyRS, HyPM, and HybridCORELS minimize a 0/1 misclassification loss, AdaBudg uses a more flexible logistic loss, and CRL maximizes accuracy across a range of explainability levels. Several methods with rule-based glass box/allocator hybrid models (HyRS, HyPM, CRL, and HybridCORELS) also include a penalty on the complexity of these models. To control the explainability level, methods either include a reward term in the loss (HyRS, HyPM, and CRL), or directly restrict the model space to candidates which achieve the explainability level (HybridCORELS). In contrast, EEG optimizes an MSE loss between the predicted and actual “glass box allocation desirability” percentile of each observation.
More extensive reviews of the partial explainability approach and explainability methods in general are available in (Linardatos, Papastefanopoulos, and Kotsiantis 2020), (Nauta et al. 2022), and (Sahakyan, Aung, and Rahwan 2021).
As outlined above, our proposed method, Ensembles with Explainability Guarantees (EEG), differs from existing works in its approach to the partial explainability problem. The key novelties of this new approach, and their corresponding advantages are summarized below.
Independent and Global Component Models: The first key innovation of the EEG approach is the independent learning of each component model (i.e. the ensemble member models and allocator). As a result, EEG is agnostic to task, data, and component model type. Thus, the most powerful models can be used for each component as determined by the setting - in contrast with previous works which are more restricted.
Another important consequence of separate component model learning is that glass box predictions are independently explainable in the global context, and thus immune from “explainability collapse” - a scenario in which the allocator subsumes the glass box’s prediction role, diminishing the value of the explainable prediction, in the extreme case reducing the glass box to an uninformative constant function. On the other hand, methods which either learn glass box models locally, or jointly with the allocator, are vulnerable to this type of degeneration.
Allocation Desirability Ranking: The second novel aspect of the EEG approach is the concept of allocation desirability. Given an ensemble of models, allocation desirability quantifies how beneficial it is for a given observation to be allocated to the default ensemble member model, say the glass box. Thus, it induces a preference for glass box allocation between all pairs of observations and consequently also defines a ranking of allocation preference across all observations that is optimal irrespective of the desired explainability level.
A key advantage of such a ranking is that it is independent of the training criteria of the ensemble member models, and thus can be adapted to score allocation desirability using metrics that best fit the setting. Indeed, the EEG desirability metric builds a ranking using a combination of relative sufficient performance and absolute performance measures which can natively accommodate any underlying problem type (e.g. regression, classification). This particular desirability metric also offers several additional benefits including allocation desirability percentile and sufficiency category estimates for each observation.
Q-Complete Allocation Optimality: The final key point of novelty of the EEG approach is the optimality of allocation, as defined in Proposition 1 and Proposition 2, which is encoded in the allocation desirability ranking for any explainability level. Thus, the learned allocator, which estimates this ranking, is an explicit function of and provides the allocation solution to any explainability level after training only once. This capability is in contrast with previous works which provide, at most, several explainability level solutions with varying degrees of stability (Ferry, Laberge, and Aïvodji 2023).
These unique capabilities of the EEG method enable the following practical use cases:
-
•
Given a minimum performance requirement on the underlying task, the method can be used to obtain the allocation with the highest explainability level that achieves or exceeds the performance threshold.
-
•
Given a minimum explainability level requirement, the method can be used to obtain the allocation with the highest ensemble performance which meets or exceeds the required explainability level.
-
•
Given a minimal level of post-allocation glass box-specific performance, the allocation that achieves the highest explainability level while meeting or exceeding this requirement can be found.
-
•
Given a set of observations, sufficiency category estimates can be obtained for each, identifying which observations are likely to yield incorrect decisions and describing the likely failure mode for each such case to inform potential post-hoc remedies.
In the following sections, we first describe our method in detail and provide some theoretical assurances on the characteristics of the resulting allocator in the Methods section. Then, in the Experiments section, we describe the experimental settings and the estimation of the allocator, and demonstrate the method’s favorable performance.
Methodology
Setting
First, we define the underlying task as the estimation of the function where . We also define observations as , the training dataset , and loss function for the underlying task . Next, we define the ensemble component models - first, the intrinsically explainable glass box model as and the alternate, black box model as , both of which are learned independently on the full training dataset .
Next, we define the allocation task. We define the class of all allocator functions as and the class of all “proper” allocator functions as , where is a general space of inputs (typically ) and containing only information available at allocation time. Next we define the class of -explainable allocators as and the corresponding class of “proper” -explainable allocators as , for , with being the explainability level. Note, the set is used to define the optimal allocator, whereas the set is searched to obtain an estimator of this optimum.
We next define indicators of performance sufficiency. These functions should be thought of as context-dependent indicators of whether performance within a region of the feature space is sufficiently high to use the model in question reliably for explanation. For classification tasks, we define performance sufficiency as , and for regression tasks as , with . In practice should be selected based on problem-specific context, however, lacking such context in the regression experiments conducted for this study, was selected to be the lower of the average validation losses of and , as a reasonable threshold for prediction correctness. These sufficiency indicators generate the following partition of the data: , and , with , and .
Next we motivate the use of the sufficiency perspective. Sufficiency functions are critical for defining coherent allocations when, as is often the case, the absolute performance measures used to learn ensemble component models do not match allocation preference (e.g. loss minimization vs accuracy maximization). Consider the following constrained allocation decision in which only one observation can be allocated to :
Obs | ||||
---|---|---|---|---|
0 | 1 | 2 | 1 | |
3 | 1 | 4 | 0 |
In this case, loss minimization dictates an allocation of to and to , which would allocate to an insufficient prediction. Sufficiency maximization would however yield a more satisfactory allocation of to and to . This example demonstrates the utility of sufficiency allocation - distinguishing between a case where the user is willing to sacrifice “a bit of performance” (as quantified by sufficiency) for explanation (), and a case were even a small performance drop results in an explanation that is not sufficiently trustworthy to use ().
In the next section, we define the objective of the allocation task and introduce our proposed approach for addressing it.
Optimal Allocation
In the allocation task, the objective is to construct an allocator that will determine which model, either the explainable or the black box , is used for prediction on any given observation , in a manner that is optimal relative to the following criteria. Firstly, for any given explainability level , the allocator should distribute observations in a way that maximizes sufficient ensemble performance, defined as . Secondly, and again for any , the allocator should maximize sufficient explainable prediction , i.e. the performance of the model on the subset of observations it has been allocated, subject to maintaining maximal . Finally, the allocator should also be consistent in its allocations across the values of , meaning that if an observation is allocated to for a given , it should remain allocated to the glass box for all higher explainability levels as well. Next, we define our allocator, and show that it meets the three criteria introduced above.
Our proposed allocation function is defined as , where rescaled ranking , and ranking , with .
The intuition behind the allocator is as follows. First, all observations are sorted in sufficient performance maximizing order, i.e. allocation of observations in to is prioritized over allocation of observations in and , which in turn are prioritized over . Next, observations are sorted in explainable sufficient performance maximizing order, i.e. is prioritized ahead of for allocation to . Then, within each sufficiency category, observations are ordered in absolute performance maximizing order, i.e. observations with large relative performance of over are prioritized for allocation to . Next, this ranking is normalized, yielding the glass box allocation desirability percentile . An important feature of this percentile is that it is constant with respect to , thus the optimal observations to allocate to , for any level of , are simply the most highly ranked, resulting in the allocator .
Note that in the described methodology, sufficiency based allocation can be viewed as a generalization of allocation via absolute performance, and can thus be reduced to the latter by selecting either or , .
Next, we state the optimality properties of the proposed allocator . The proofs are available in the long form paper on arxiv.org in the Theoretical Results section of the Appendix.
Proposition 1.
(Maximal Sufficient Performance) where and
Proposition 2.
(Maximal Sufficient Explainable Performance) where and
Proposition 3.
(Monotone Allocation) ,
Experiments
In this section we describe the data, model training procedures, performance evaluation metrics, and results of our experiments.
Datasets
Tabular data is used to evaluate the proposed methodology as it the setting for which the required intrinsically explainable glass box models are most readily available. Following the tabular data benchmarking framework proposed by (Grinsztajn, Oyallon, and Varoquaux 2022), we conduct experiments on a set of 31 datasets (13 classification, 18 regression). These datasets represent the full set of provided datasets with quantitative features less the four largest scale datasets (omitted due to computational limitations). These datasets are summarized in Table 4.
Each dataset is split (70%, 9%, 21%) into training, validation, and test sets respectively, following (Grinsztajn, Oyallon, and Varoquaux 2022). All features and regression response variables are rescaled to the range [-1,1].
Models
Both glass box and black box models are learned on the full training dataset for each underlying task. For classification datasets, two types of glass box model are fitted, a logistic regression and a classification tree, as well as two types of black box model, a gradient boosting trees classifier and a neural network classifier. Analogously, for regression datasets, two types of glass box model are fitted, a linear regression and a regression tree, as well as two types of black box model, a gradient boosting trees regressor and a neural network regressor. In all cases, the architecture of the neural networks is the “Wide ResNet-28” model (Zagoruyko and Komodakis 2016) adapted to tabular data with the replacement of convolutional layers with fully connected layers.
An allocator is subsequently also learned on the full training dataset. Both gradient boosting trees regressors and neural networks are fitted as allocators for each allocation task.
For allocator training, the features are augmented with four additional constructed features, the predictions and , and two distance measures between them, the cross-entropy and MSE. In our experiments, inclusion of these features improved allocator learning - likely by removing the need for the allocator to attempt to learn these quantities on its own. Allocation performance is further improved by ensembling the feature-dependent learned allocator with a strong feature-independent allocator , where . can be viewed as an “assume the black box is correct” allocation rule which is more likely to assign an observation to if the distance between the predictions of and is high. Which of the two allocators is used for a given is determined by their respective performances on the validation set.
Hyperparameter Tuning
Hyperparameter tuning for all models is done using 4-fold cross-validation, with the exception of the neural network tuning which is done using the validation set. A grid search is done to select the best hyperparameters for each model with search values available in Tables 5, 6, and 7.
Each glass box and black box model is tuned on the full set of hyperparameters each time it is replicated. The gradient boosting trees allocator models are retuned on the full hyperparameter set each time as well. The neural network allocator is not retuned however, and instead uses the optimal settings found in the fitting of the black box on each dataset.
Metrics
We define the following metrics which are used to measure performance of our method. First we define the Percentage Performance Captured over Random (PPCR) for a given allocator as follows: where is the area under the curve of function over all values of in its domain, is the oracle allocator which has perfect information on the whole dataset, and is the random allocator which selects a subset from the data being allocated uniformly at random. The PPCR metric is a percentage and represents the proportion of the oracle AUC, in excess of that also covered by random allocation, that the learned allocator is able to capture. Thus a value of zero indicates performance on par with and a value of one represents perfect allocation.
Next, we define the Percent Q Equal or Over Max (PQEOM) as the percentage of values for which the allocator is performing at least as well as the most accurate ensemble member model (i.e. or ) and Percent Q Over Max (PQOM) as the percentage of q values for which the allocator is performing better than the most accurate ensemble member model (i.e. or ).
Next, we define the Percent Contribution of Feature-dependent Allocator (PCFA) as the percentage of values for which the feature dependent allocator is used for allocation decisions as opposed to the feature-independent allocator . A value close to one indicates that is used often, while a value close to zero indicates it is instead.
Next, we define the Threshold Q Max (95TQM) as the highest value of for which the ensemble performance meets or exceeds of the performance of the better of and . Thus this is a measure of how much explainability can be utilized before the performance price becomes material.
Next, we define the maximum accuracy achieved by the allocator across all (Max Acc), and the highest value of for which this accuracy is maintained (Argmax ). The Max Acc can be benchmarked against the AUC, interpretable as the average accuracy across . Each of these metrics is a percentage and higher values correspond with higher performance and higher explainability at this maximum performance level, respectively.
Finally, we define the accuracy with which the four sufficiency categories (, , , and ) can be estimated as the sufficiency accuracy ( Acc). The higher this accuracy, the better able the allocator is to inform the user of which category a given observation is likely to be a member of.
Results
Evaluation of allocator performance using the metrics defined previously as well as visual inspection of the performance vs explainability trade-off curves (Fig. 2) revealed both the benefits and some of the limitations of learned allocation in the tabular data setting.
Firstly, performance was found to consistently and significantly outperform random allocation, as quantified by a cross-dataset PPCR of (Table 1), indicating that the learned allocation captured close to of the area under the curve available and in excess of random allocation. It was also found that on some datasets in particular, learned allocation performed close to oracle allocation (e.g. and on the IsoletR and BrazilianHousesR datasets).
Learned allocation was also found to perform at least at the level of the best ensemble member model across an average of of the explainability range (PQEOM in Table 1). This indicates that for many datasets, there is a substantial explainability “free lunch” to be taken advantage of without performance loss. On a few datasets, performance of the allocated ensemble was found to outperform both and for a majority () of the range (PolR and FifaR PQOM). The 95TQM metric also supported these conclusions, with a cross-dataset average value of indicating that allocation performance was within of maximal individual model performance across approximately all values of .
Assessing the PCFA metric suggests some limits to the upside of learned, feature dependent allocation - at least in the tested tabular data setting. A cross-dataset average value of indicates that on average, the range for which the feature dependent allocator is used over the feature-independent one is indistinguishable from zero. This is consistent with a visual inspection of the representative performance-explainability curve e.g. Fig. 2 (b) where there is no improvement to be had in excess of . However, it is noted that the only possibility for “homerun” allocations is through the feature dependent as seen in Fig 2 (a) with the PolR dataset and also in Table 1 for datasets SulfurR, BikeSharingR, and FifaR. Thus the ensembled allocation scheme offers this upside without downside risk of low performance in either or .
Evaluation of the case in which a single allocation is needed is also positive. On a cross-dataset average, the maximal accuracy achieved is quite high, and is also achieved at a high average explainability level (). Particularly strong individual results can be seen in the Pol and SulfurR datasets (Table 1). We also find that on a observation-level, the allocation is an accurate estimator of sufficiency category, with a cross-dataset average of and with few datasets with accuracy under .


Dataset | AUC | PPCR | PQEOM | PQOM | PCFA | 95TQM | Max Acc | Argmax | Acc |
Wine | 79 0 | 21 0 | 71 0 | 0 0 | 7 0 | 98 0 | 80 0 | 70 0 | 78 0 |
Phoneme | 87 0 | 12 1 | 78 4 | 7 2 | 6 1 | 100 0 | 87 0 | 50 34 | 81 2 |
KDDIPUMS | 88 0 | 17 1 | 65 3 | 34 5 | 17 7 | 100 0 | 88 0 | 66 9 | 80 1 |
EyeMovements | 66 0 | 33 0 | 55 1 | 7 6 | 15 2 | 70 0 | 68 0 | 31 24 | 52 0 |
Pol | 98 0 | 49 0 | 98 0 | 2 0 | 0 0 | 100 0 | 99 0 | 98 0 | 96 0 |
Bank | 76 0 | -19 0 | 4 1 | 0 0 | 0 0 | 100 0 | 79 0 | 100 0 | 71 0 |
MagicTelescope | 86 0 | 39 0 | 87 2 | 12 11 | 10 0 | 100 0 | 86 0 | 47 28 | 82 1 |
House16H | 89 0 | 40 0 | 84 4 | 9 6 | 6 2 | 98 0 | 89 0 | 82 8 | 86 0 |
Credit | 78 0 | 5 1 | 56 4 | 14 19 | 95 0 | 100 0 | 78 0 | 76 6 | 72 0 |
California | 90 0 | 52 0 | 88 0 | 0 0 | 7 0 | 98 0 | 91 0 | 88 0 | 90 0 |
Electricity | 92 0 | 58 0 | 88 0 | 0 0 | 7 0 | 98 0 | 93 0 | 88 0 | 92 0 |
Jannis | 79 0 | 30 0 | 53 5 | 21 5 | 14 1 | 98 0 | 79 0 | 35 6 | 76 0 |
MiniBooNE | 94 0 | 54 0 | 90 0 | 0 0 | 5 0 | 100 0 | 94 0 | 90 0 | 92 0 |
WineR | 73 0 | 43 0 | 85 0 | 10 0 | 20 0 | 90 0 | 74 0 | 78 0 | 67 0 |
IsoletR | 91 0 | 89 0 | 68 0 | 0 0 | 49 10 | 73 0 | 95 0 | 68 0 | 87 1 |
CPUR | 75 0 | 40 2 | 52 18 | 0 0 | 29 11 | 80 0 | 77 0 | 70 0 | 59 1 |
SulfurR | 98 0 | 63 2 | 73 9 | 1 2 | 79 4 | 100 0 | 98 0 | 84 22 | 96 0 |
BrazilianHousesR | 96 0 | 71 1 | 83 0 | 64 6 | 1 2 | 93 0 | 98 0 | 8 3 | 88 0 |
AileronsR | 75 0 | 13 0 | 70 10 | 5 7 | 4 2 | 100 0 | 76 0 | 40 35 | 64 0 |
MiamiHousingR | 76 0 | 44 0 | 76 0 | 0 0 | 67 10 | 80 0 | 78 0 | 75 0 | 65 0 |
PolR | 88 0 | 44 1 | 96 1 | 93 1 | 88 0 | 100 0 | 88 0 | 81 2 | 84 0 |
ElevatorsR | 75 0 | 25 0 | 64 4 | 2 2 | 17 0 | 88 0 | 75 0 | 51 29 | 63 0 |
BikeSharingR | 77 0 | 26 1 | 88 4 | 22 16 | 82 1 | 100 0 | 78 0 | 21 13 | 72 0 |
FifaR | 77 0 | 33 0 | 95 0 | 93 0 | 78 0 | 100 0 | 77 0 | 50 0 | 69 0 |
CaliforniaR | 78 0 | 38 0 | 73 0 | 68 0 | 68 0 | 93 0 | 79 0 | 42 0 | 63 0 |
HousesR | 78 0 | 41 0 | 73 0 | 0 0 | 48 1 | 80 0 | 79 0 | 73 0 | 62 0 |
SuperconductR | 83 0 | 42 0 | 60 13 | 24 11 | 95 0 | 95 0 | 83 0 | 57 1 | 76 0 |
HouseSalesR | 76 0 | 50 1 | 64 12 | 0 0 | 56 6 | 85 0 | 78 0 | 78 0 | 63 0 |
House16HR | 92 0 | 55 0 | 90 0 | 0 0 | 6 0 | 98 0 | 92 0 | 90 0 | 86 0 |
DiamondsR | 70 0 | 19 0 | 83 0 | 34 6 | 73 0 | 100 0 | 71 0 | 20 6 | 65 0 |
MedicalChargesR | 86 0 | 24 0 | 91 1 | 85 1 | 0 0 | 100 0 | 86 0 | 67 2 | 83 0 |
Average | 83 9 | 37 21 | 74 19 | 20 29 | 35 34 | 94 9 | 84 8 | 64 24 | 76 12 |
Ablation Studies
Allocator Feature Set Selection
In addition to the features used to learn the glass box and black box models, the allocation task also has access to their predictions and , and any functions of the two - since the allocator is learned subsequent to the training of these models. To obtain the optimal feature set for allocation, standard tuning procedures (e.g. cross validation) can be employed to evaluate all feature sets of interest. However, as each candidate feature set requires the training of a corresponding allocator for evaluation, this approach can be prohibitively costly.
Thus, the following study was conducted to determine whether a consistently best feature set exists for the tabular data context used in the experiments. First, the universe of candidate features was selected to be the original features used as inputs for the ensemble component models, the predictions of both of these models and , and finally two measures of discrepancy between the predictions, the cross-entropy and the mean squared error . The measures of disagreement were included as features as they translate to the “feature independent” strategy of allocation to model for low values of - in other words the optimal allocation strategy assuming is always correct.
The candidate features were grouped into the sets listed in Table 2 and then used to train allocators on a subset of the benchmark datasets (Wine, WineR, Phoneme, SulfurR, Bank, BrazilianHousesR, FifaR, KDDIPUMS) with 6 replicates per model.
Next each feature set was evaluated as follows. First, within each dataset, each feature set’s performance (defined as the AUC) was compared to the performance of the best alternative set of features. Then, the proportion of datasets for which the feature set being evaluated was not significantly worse (i.e. either significantly better or not significantly different) than the best alternative was recorded and reported in Table 2 for three significance levels (, , ).
The results support the following three conclusions. First, no one feature set proved universally best across the tested datasets and thus a full search across feature sets would be advised in settings without resource constraint. Second, although no universally best feature set was found, the “kitchen sink” set of all candidate features (, , , , ) was found to be best most consistently and was thus used to train all allocators reported in Table 1. Finally, allocators trained on just the original features were found to be consistently worst among all alternatives thus supporting the augmentation of the original features in some form. This finding is consistent with the intuition that the predictions of the component models would be very useful to learning the optimal allocation and would be either very difficult or impossible to learn from , the optimal allocation ranking response, alone during training.
Feature Set | |||
---|---|---|---|
18.75% | 18.75% | 12.50% | |
, | 43.75% | 43.75% | 31.25% |
31.25% | 18.75% | 18.75% | |
50.00% | 43.75% | 37.50% | |
, | 37.50% | 25.00% | 18.75% |
, | 56.25% | 37.50% | 25.00% |
, , | 31.25% | 25.00% | 25.00% |
, , | 50.00% | 50.00% | 37.50% |
, , | 56.25% | 43.75% | 37.50% |
, , , | 50.00% | 43.75% | 37.50% |
, , , | 56.25% | 37.50% | 37.50% |
, , , , | 75.00% | 56.25% | 43.75% |
Ensemble Component Model Selection
The performance of any allocated ensemble is highly dependent not only on the individual performance of its component models (i.e. and ) but on their level of synergy as well. In particular, it may be the case that the component model pair in individually outperforms the respective component models in but that the allocator trained with underperforms . In this case, the high relative advantages of in different segments of the feature space overcome their global performance disadvantages as individual models compared to their counterparts in to yield a stronger ensemble.
Thus, to determine how often high relative advantage overcomes superior individual performance in allocator training, the following study was conducted. For each dataset, an allocator was trained on each combination of available glass box (tree and regression) and black box (gradient boosting trees and neural network) models (i.e. four allocators per dataset). Then the allocator , trained using the pair of component models with the highest individual validation performance, was identified along with the allocator , trained using the pair of component models resulting in the highest ensemble validation performance. Finally the difference in test performance was measured between and ().
The resulting values support the following two conclusions. Firstly, while a relatively high proportion () of datasets yield different allocators depending on which of the two different component model selection processes (individual vs. combined performance) they utilize, the cross-dataset average difference in allocator performance is not significantly different from zero (). This result suggests that the glass box and black box model types used for the experiments did not exhibit high relative expertise in different parts of the feature space, indicating that it may be beneficial to use a more diverse set of component models in this setting. However, in rare cases (e.g. IsoletR, BrazilianHousesR) the combined performance selection method yields as much as in additional performance. Thus, in resource constrained settings, or in cases in which many glass box and black box model types are under consideration, the individual performance selection method appears relatively low risk, although a full search across all component model combinations (the method used for Table 1) is recommended when feasible.
Dataset | Match? | AUC | ||||
Wine | Tree | DNN | Tree | DNN | Yes | 0.0 |
Phoneme | Tree | GBT | Tree | DNN | No | (0.01) |
KDDIPUMS | Tree | GBT | Tree | GBT | Yes | 0.0 |
EyeMovements | Regr. | GBT | Regr. | GBT | Yes | 0.0 |
Pol | Tree | DNN | Tree | GBT | No | (0.0) |
Bank | Tree | DNN | Tree | DNN | Yes | 0.0 |
MagicTelescope | Tree | GBT | Tree | GBT | Yes | 0.0 |
House16H | Tree | GBT | Tree | GBT | Yes | 0.0 |
Credit | Tree | GBT | Tree | GBT | Yes | 0.0 |
California | Tree | GBT | Tree | GBT | Yes | 0.0 |
Electricity | Tree | GBT | Tree | GBT | Yes | 0.0 |
Jannis | Tree | GBT | Tree | GBT | Yes | 0.0 |
MiniBooNE | Tree | GBT | Tree | GBT | Yes | 0.0 |
WineR | Regr. | GBT | Regr. | GBT | Yes | 0.0 |
IsoletR | Regr. | DNN | Tree | DNN | No | 0.15 |
CPUR | Tree | GBT | Tree | DNN | No | 0.0 |
SulfurR | Tree | DNN | Tree | GBT | No | 0.04 |
BrazilianHousesR | Tree | GBT | Tree | DNN | No | 0.08 |
AileronsR | Regr. | GBT | Regr. | DNN | No | 0.0 |
MiamiHousingR | Tree | GBT | Tree | GBT | Yes | 0.0 |
PolR | Tree | DNN | Tree | GBT | No | (0.0) |
ElevatorsR | Regr. | DNN | Regr. | GBT | No | 0.01 |
BikeSharingR | Tree | GBT | Tree | GBT | Yes | 0.0 |
FifaR | Tree | GBT | Tree | DNN | No | 0.0 |
CaliforniaR | Tree | GBT | Tree | DNN | No | 0.01 |
HousesR | Tree | GBT | Tree | DNN | No | 0.01 |
SuperconductR | Tree | GBT | Tree | GBT | Yes | 0.0 |
HouseSalesR | Tree | GBT | Tree | GBT | Yes | 0.0 |
House16HR | Tree | GBT | Tree | GBT | Yes | 0.0 |
DiamondsR | Tree | GBT | Tree | GBT | Yes | 0.0 |
MedicalChargesR | Tree | GBT | Tree | DNN | No | 0.0 |
Median/Average | Tree (83.9%) | GBT (77.4%) | Tree (87.1%) | GBT (64.5%) | Yes (58.1%) | 0.01 0.03 |
Acknowledgements / Thanks
We would like to thank Jonathan Siegel for valuable discussion of the theoretical aspects of this work.
This research is supported by the National Science Foundation under grant number CCF-2205004. Computations for this research were performed on the Pennsylvania State University’s Institute for Computational and Data Sciences’ Roar supercomputer.
References
- Agarwal et al. (2021) Agarwal, R.; Melnick, L.; Frosst, N.; Zhang, X.; Lengerich, B.; Caruana, R.; and Hinton, G. E. 2021. Neural additive models: Interpretable machine learning with neural nets. Advances in neural information processing systems, 34: 4699–4711.
- Ferry, Laberge, and Aïvodji (2023) Ferry, J.; Laberge, G.; and Aïvodji, U. 2023. Learning Hybrid Interpretable Models: Theory, Taxonomy, and Methods. arXiv preprint arXiv:2303.04437.
- Gage et al. (2001) Gage, B. F.; Waterman, A. D.; Shannon, W.; Boechler, M.; Rich, M. W.; and Radford, M. J. 2001. Validation of clinical classification schemes for predicting stroke: results from the National Registry of Atrial Fibrillation. Jama, 285(22): 2864–2870.
- Gao et al. (2019) Gao, X.; Shan, C.; Hu, C.; Niu, Z.; and Liu, Z. 2019. An adaptive ensemble machine learning model for intrusion detection. Ieee Access, 7: 82512–82521.
- Garreau and Luxburg (2020) Garreau, D.; and Luxburg, U. 2020. Explaining the explainer: A first theoretical analysis of LIME. In International conference on artificial intelligence and statistics, 1287–1296. PMLR.
- Grinsztajn, Oyallon, and Varoquaux (2022) Grinsztajn, L.; Oyallon, E.; and Varoquaux, G. 2022. Why do tree-based models still outperform deep learning on typical tabular data? Advances in Neural Information Processing Systems, 35: 507–520.
- Gulum, Trombley, and Kantardzic (2021) Gulum, M. A.; Trombley, C. M.; and Kantardzic, M. 2021. A review of explainable deep learning cancer detection models in medical imaging. Applied Sciences, 11(10): 4573.
- Inoue (2019) Inoue, H. 2019. Adaptive ensemble prediction for deep neural networks based on confidence level. In The 22nd International Conference on Artificial Intelligence and Statistics, 1284–1293. PMLR.
- Lemhadri, Ruan, and Tibshirani (2021) Lemhadri, I.; Ruan, F.; and Tibshirani, R. 2021. LassoNet: Neural Networks with Feature Sparsity. In Banerjee, A.; and Fukumizu, K., eds., Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, 10–18. PMLR.
- Linardatos, Papastefanopoulos, and Kotsiantis (2020) Linardatos, P.; Papastefanopoulos, V.; and Kotsiantis, S. 2020. Explainable ai: A review of machine learning interpretability methods. Entropy, 23(1): 18.
- Lundberg and Lee (2017) Lundberg, S. M.; and Lee, S.-I. 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.
- Nan and Saligrama (2017) Nan, F.; and Saligrama, V. 2017. Adaptive classification for prediction under a budget. Advances in neural information processing systems, 30.
- Nauta et al. (2022) Nauta, M.; Trienes, J.; Pathak, S.; Nguyen, E.; Peters, M.; Schmitt, Y.; Schlötterer, J.; van Keulen, M.; and Seifert, C. 2022. From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai. ACM Computing Surveys.
- Pan, Wang, and Hara (2020) Pan, D.; Wang, T.; and Hara, S. 2020. Interpretable companions for black-box models. In International conference on artificial intelligence and statistics, 2444–2454. PMLR.
- Ribeiro, Singh, and Guestrin (2016) Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. ” Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 1135–1144.
- Rudin (2019) Rudin, C. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1(5): 206–215.
- Rymarczyk et al. (2020) Rymarczyk, D.; Struski, Ł.; Tabor, J.; and Zieliński, B. 2020. Protopshare: Prototype sharing for interpretable image classification and similarity discovery. arXiv preprint arXiv:2011.14340.
- Sahakyan, Aung, and Rahwan (2021) Sahakyan, M.; Aung, Z.; and Rahwan, T. 2021. Explainable Artificial Intelligence for Tabular Data: A Survey. IEEE Access, 9: 135392–135422.
- Wang, Fujimaki, and Motohashi (2015) Wang, J.; Fujimaki, R.; and Motohashi, Y. 2015. Trading interpretability for accuracy: Oblique treed sparse additive models. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 1245–1254.
- Wang and Saligrama (2012) Wang, J.; and Saligrama, V. 2012. Local supervised learning through space partitioning. Advances in Neural Information Processing Systems, 25.
- Wang (2019) Wang, T. 2019. Gaining free or low-cost interpretability with interpretable partial substitute. In International Conference on Machine Learning, 6505–6514. PMLR.
- Wang and Lin (2021) Wang, T.; and Lin, Q. 2021. Hybrid predictive models: When an interpretable model collaborates with a black-box model. The Journal of Machine Learning Research, 22(1): 6085–6122.
- Zagoruyko and Komodakis (2016) Zagoruyko, S.; and Komodakis, N. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146.
Appendix A Appendix
Theoretical Results
Definition 1.
(Z Sets)
and
Definition 2.
(C Sets)
Lemma 1.
(Increasing in )
Proof.
Notice , statement follows. ∎
Lemma 2.
( to ) where
Proof.
Part 1:
Part 2: Observe that, by construction, is the percentile. Thus, .
∎
Lemma 3.
(Set Equivalence 1) , where and
Proof.
Part 1: We show that for given .
Assume for contradiction that . Then s.t. i.e. there exists a pair of observations for which swapping allocation decisions increases the objective function.
Case 1 (L.3.1.1): .
But, , we have that
which contradicts the definition of , thus .
Case 2 (L.3.1.2): . Then, . Thus, which by assumption on which is a contradiction with , so .
Case 3 (L.3.1.3): . Using the same argument as in Case 2 (L.3.1.2), contradiction and thus follows.
Case 4 (L.3.1.4): . Using the same argument as in Case 2 (L.3.1.2), contradiction and thus follows.
Thus we have shown that , no such exist, and so .
Part 2: We show for given .
Assume for contradiction that . Then s.t. .
Case 1 (L.3.2.1): . Then, and
and . But then, i.e. swapping the allocation decisions of and increases the objective function which is a contradiction. Thus, .
Case 2 (L.3.2.2): . Using the same argument as in Case 1 (L.3.2.1), follows.
Case 3 (L.3.2.3): . Using the same argument as in Case 1 (L.3.2.1), follows.
Thus,
Having shown both directions of implication, we conclude
∎
Proposition 1.
(Maximal Sufficient Performance) where and
Proof.
From Lemma 3, it is enough to show that . First, from Lemma 2, it follows that . Next, we show that .
Case 1 (P.1.1): . From Lemma 1, and thus .
Case 2 (P.1.2): . Using the same argument as in Case 1 (P.1.1), follows.
Case 3 (P.1.3): . Using the same argument as in Case 1 (P.1.1), follows.
Case 4 (P.1.4): by definition.
We have shown that and , thus .
∎
Lemma 4.
(Set Equivalence 2) where and
Proof.
Part 1: First we show that .
Assume for contradiction that , then s.t.
i.e. there exists a pair of observations for which swapping allocation decisions increases the -specific objective function.
Next, , and in particular for , . Observe,
Case 1 (L.4.1.1): . Then . But this is a contradiction with the assumed , thus .
Case 2 (L.4.1.2): . Then, since for . But this is a contradiction with , thus
Case 3 (L.4.1.3): . Then, since . Equivalently, , which is a contradiction with , thus .
Thus, we have shown that .
Part 2: Next we show that . . Thus, , .
Having shown both sides of the implication, we conclude .
∎
Proposition 2.
(Maximal Sufficient Explainable Performance) where and
Proof.
Proposition 3.
(Monotone Allocation) ,
Proof.
From Lemma 2, . Since , thus . ∎
Dataset | Task | Observations | Features | Epochs | Tree | Regr. | GBT | DNN |
Wine | classification | 2,554 | 11 | 800 | 73.9 | 72.1 | 80.5 | 79.5 |
Phoneme | classification | 3,172 | 5 | 800 | 85.3 | 73.3 | 88.5 | 87.1 |
KDDIPUMS | classification | 5,188 | 20 | 800 | 88.3 | 85.8 | 88.0 | 83.6 |
EyeMovements | classification | 7,608 | 20 | 800 | 58.5 | 53.7 | 68.2 | 56.6 |
Pol | classification | 10,082 | 26 | 400 | 97.1 | 86.9 | 98.4 | 98.5 |
Bank | classification | 10,578 | 7 | 400 | 79.0 | 73.6 | 79.7 | 75.6 |
MagicTelescope | classification | 13,376 | 10 | 400 | 81.7 | 77.6 | 85.9 | 84.4 |
House16H | classification | 13,488 | 16 | 400 | 84.7 | 83.2 | 89.2 | 86.2 |
Credit | classification | 16,714 | 10 | 400 | 77.1 | 73.9 | 77.7 | 71.4 |
California | classification | 20,634 | 8 | 400 | 85.5 | 82.4 | 90.6 | 87.8 |
Electricity | classification | 38,474 | 7 | 400 | 86.8 | 74.6 | 92.6 | 81.5 |
Jannis | classification | 57,580 | 54 | 400 | 74.6 | 72.4 | 79.3 | 77.6 |
MiniBooNE | classification | 72,998 | 50 | 400 | 89.8 | 84.6 | 93.9 | 93.0 |
WineR | regression | 6,497 | 11 | 800 | 0.25 | 0.25 | 0.20 | 0.23 |
IsoletR | regression | 7,797 | 613 | 800 | 0.39 | 0.38 | 0.25 | 0.16 |
CPUR | regression | 8,192 | 21 | 800 | 0.06 | 0.19 | 0.04 | 0.05 |
SulfurR | regression | 10,081 | 6 | 400 | 0.06 | 0.08 | 0.05 | 0.03 |
BrazilianHousesR | regression | 10,692 | 8 | 400 | 0.02 | 0.09 | 0.01 | 0.01 |
AileronsR | regression | 13,750 | 33 | 400 | 0.11 | 0.10 | 0.09 | 0.09 |
MiamiHousingR | regression | 13,932 | 14 | 400 | 0.12 | 0.17 | 0.08 | 0.08 |
PolR | regression | 15,000 | 26 | 400 | 0.15 | 0.61 | 0.09 | 0.07 |
ElevatorsR | regression | 16,599 | 16 | 400 | 0.10 | 0.09 | 0.07 | 0.06 |
BikeSharingR | regression | 17,379 | 6 | 400 | 0.22 | 0.30 | 0.21 | 0.21 |
FifaR | regression | 18,063 | 5 | 400 | 0.25 | 0.34 | 0.23 | 0.24 |
CaliforniaR | regression | 20,640 | 8 | 400 | 0.23 | 0.27 | 0.16 | 0.18 |
HousesR | regression | 20,640 | 8 | 400 | 0.17 | 0.20 | 0.13 | 0.13 |
SuperconductR | regression | 21,263 | 79 | 400 | 0.14 | 0.21 | 0.10 | 0.11 |
HouseSalesR | regression | 21,613 | 15 | 400 | 0.10 | 0.12 | 0.08 | 0.08 |
House16HR | regression | 22,784 | 16 | 400 | 0.10 | 0.11 | 0.08 | 0.10 |
DiamondsR | regression | 53,940 | 6 | 400 | 0.12 | 0.14 | 0.12 | 0.11 |
MedicalChargesR | regression | 163,065 | 5 | 100 | 0.04 | 0.12 | 0.04 | 0.04 |
Model | L1 penalty | min split | max leaf | max depth |
Logistic regression | for in range(-10,10) | - | - | - |
Linear regression | for in range(-10,10) | - | - | - |
Classification tree | - | for in range(1,12) | for in range(1,12) | for in range(1,12) |
Regression tree | - | for in range(1,12) | for in range(1,12) | for in range(1,12) |
Model | learning rate | n estimators | max depth | subsample |
---|---|---|---|---|
GBT classifier | 0.001, 0.01, 0.1 | , in range(4,10) | , in range(3,7) | (2,4,6,8,10)*0.1 |
GBT regressor | 0.001, 0.01, 0.1 | , in range(4,10) | , in range(3,7) | (2,4,6,8,10)*0.1 |
GBT allocator | 0.01, 0.1 | , in range(2,11,2) | , in range(2,6) | (25,50,75,100)*0.01 |
Model | L2 penalty | learning rate | lr schedule | optimizer |
---|---|---|---|---|
WRN regressor | 0, 1e-7, 1e-5, 1e-3 | (1e-5, 1e-4, 1e-3), (1e-4, 1e-3, 1e-2) | constant, cosine | Adam, SGD+Mtm |
WRN allocator | 0, 1e-7, 1e-5, 1e-3 | (1e-5, 1e-4, 1e-3), (1e-4, 1e-3, 1e-2) | constant, cosine | Adam, SGD+Mtm |

