Ensemble Squared: A Meta AutoML System

Jason Yoo University of British ColumbiaVancouverCanada [email protected] , Tony Joseph University of British ColumbiaVancouverCanada [email protected] , Dylan Yung Georgia Institute of TechnologyAtlantaUnited States of America [email protected] , S. Ali Nasseri University of British ColumbiaVancouverCanada [email protected] and Frank Wood University of British ColumbiaVancouverCanada [email protected]

(2021)

Abstract.

There are currently many barriers that prevent non-experts from exploiting machine learning solutions ranging from the lack of intuition on statistical learning techniques to the trickiness of hyperparameter tuning. Such barriers have led to an explosion of interest in automated machine learning (AutoML), whereby an off-the-shelf system can take care of many of the steps for end-users without the need for expertise in machine learning. This paper presents Ensemble Squared (Ensemble²), an AutoML system that ensembles the results of state-of-the-art open-source AutoML systems. Ensemble² exploits the diversity of existing AutoML systems by leveraging the differences in their model search space and heuristics. Empirically, we show that diversity of each AutoML system is sufficient to justify ensembling at the AutoML system level. In demonstrating this, we also establish new state-of-the-art AutoML results on the OpenML tabular classification benchmark.

automated machine learning, ensemble learning, tabular data

^†^†copyright: acmcopyright^†^†journalyear: 2021^†^†doi: 10.1145/1122445.1122456^†^†conference: SIGKDD ’21: ACM SIG on Knowledge Discovery and Data Mining; August 14–18, 2021; Virtual Event, Singapore^†^†booktitle: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’21), August 14–18, 2021, Virtual Event, Singapore^†^†submissionid: 1948^†^†ccs: Computing methodologies Machine learning^†^†ccs: Computing methodologies Artificial Intelligence^†^†ccs: Computing methodologies Search methodologies

1. Introduction

Advances in computer hardware and the ever-expanding abundance of data are enabling the construction of machine learning models that are yielding progressively more value in a growing set of application domains. Unfortunately, it is well known that there is no single best machine learning model that can handle all problems (Wolpert, 1996). As such, for every new application area or problem, a laborious, largely manual process must be followed that includes data cleaning, feature engineering, testing and model design. This is what data scientists do, often in collaboration with domain experts. With the demand for data science talent on the job market exceeding supply (Manyika et al., 2011; Pompa and Burke, 2017; McKinsey Analytics, 2016), the societal “value add” from machine learning is bottlenecked by the availability of data scientists and the complexity of applied machine learning methods.

One solution to this problem is to automate much of the machine learning model development and pipeline selection process using automated machine learning (AutoML) techniques. AutoML systems allow data scientists to focus directly on value creation, i.e. developing methods for solving the underlying problem, while other tasks such as data cleaning, model selection, model fitting, and hyperparameter tuning are automated. The full promise of AutoML is that, eventually, it will enable non-data scientists to exploit performant predictive models and extract insights from data.

A variety of AutoML systems exist today (Erickson et al., 2020; Feurer et al., 2015; Heffetz et al., 2020; LeDell and Poirier, 2020) and have begun to coalesce around reporting results on the OpenML classification benchmark (Gijsbers et al., 2019). Our review of the open-source AutoML systems and their results led us to ask the following question: Ensembling individual models has been shown to be an extremely powerful idea (Hansen and Salamon, 1990); could it be that there is sufficient diversity in AutoML systems so that ensembling AutoML systems would lead to quantitative gains?

This is the question that this paper addresses. Ensemble², the system we built, ensembles machine learning pipelines searched for in parallel by a set of diverse AutoML systems (our base systems), and achieves a new state of the art baseline on the OpenML classification benchmark. To our knowledge, no other existing work ensembles results among different AutoML systems, exploiting the different heuristics between them in terms of pipeline search, model selection, and hyperparameter optimization strategies to achieve better performance.

Refer to caption — Figure 1. Overview of Ensemble² workflow. Ensemble² is made up of two underlying subsystems. The training dataset (in CSV format), along with a user-specified target column, is sent to the Pipeline Search Subsystem. This spins up $M$ different AutoML systems (our base systems) as Singularity containers and performs the pipeline search procedure in parallel for a set time duration. Once the time limit has been hit and all discovered pipelines ( $P$ ) have been collected, it ranks the pipelines based on their validation scores. The ranked pipelines, alongside the training and test datasets are then passed on to the second subsystem, the Ensemble Prediction Subsystem. Here, the top $K$ pipelines (where $K$ is an Ensemble² hyperparameter and $K\leq P$ ), are selected, optionally refit, and made to generate predictions on the test dataset. The predictions are then passed onto the ensembling module, which generates the final Ensemble² predictions either using majority voting or a stacking model. The stacking model can choose to ensemble the best pipeline from each AutoML system as well.

We are aware that numerous AutoML solutions already internally ensemble machine learning pipelines discovered during their internal search phase (Feurer et al., 2015, 2020; Erickson et al., 2020; Chen et al., 2018; Zaidi et al., 2020). They do so for the same reason that we explore ensembling AutoML systems externally: ensembling can reduce both bias and variance of a machine learning models’ predictions.

Ensemble² has an additional benefit of being more robust than its base systems. Existing AutoML systems are brittle, and can fail to return solutions on some datasets due to: out-of-memory errors, missing values, missing metafeatures, incorrect identification of task, and incorrectly data typed columns (Zöller and Huber, 2019). This is strongly linked to the differences in pre-processing steps, and the machine learning techniques implemented in them that can have different bugs, and varying degrees of functional flexibility. Ensemble², by its construction, is more robust than all of its underlying AutoML systems and only fails when all base systems fail.

Figure 1 presents an overview of Ensemble². Our system consists of two subsystems, one that performs pipeline search using the base systems in parallel and another that ensembles results. Tables 2 and 5 show Ensemble²’s performance on the OpenML classification benchmark datasets relative to the AutoML systems it ensembles. We find that Ensemble² achieves the highest average rank overall and that it outperforms notable current open-source AutoML systems such as: AutoGluon(Erickson et al., 2020), Auto-Sklearn (Feurer et al., 2015), Auto-Sklearn 2.0 (Feurer et al., 2020), CMU AutoML ¹¹1https://github.com/autonlab/cmu-ta2, and H2O AutoML (LeDell and Poirier, 2020).

In addition to our results, we have built a public-facing web interface that enables anyone to upload their own CSV and use Ensemble². Ensemble² is also easily scalable by the virtue of containerization using Singularity (Gannon and Sochat, 2017) and integration with Slurm (Yoo et al., 2003).

2. Background

Ensemble² relies on ensembling the best pipelines from five base AutoML systems: H2O AutoML (LeDell and Poirier, 2020), Auto-Sklearn (Feurer et al., 2015), Auto-Sklearn 2.0 (Feurer et al., 2020), AutoGluon (Erickson et al., 2020), and CMU AutoML which was developed as part of the DARPA D3M program. These systems were selected for their differing heuristics and search spaces. Table 1 contains a brief overview of these base systems and the different approaches they take to the automate the pipeline search and hyperparameter optimization problem.

AutoML System Primitive Library Model Discovery and Hyperparameter Tuning Internal Ensembling AutoGluon Gluon Library Fixed Defaults Multi-Layer Stacking and Bagging Auto-SkLearn Scikit-Learn BayesOpt + Meta-Learning Forward Search Auto-SkLearn 2.0 Scikit-Learn Portfolio Learning Forward Search CMU AutoML D3M Primitives Templates + Grid Search - H2O AutoML H2O Library Grid and Random Search Super Learner

Table 1. Base AutoML Systems used in Ensemble². This table highlights some of the differences between these AutoML systems and the diversity of methods that Ensemble² benefits from.

2.1. Problem Formulation

While there are several definitions of the AutoML problem, our work follows the definition in Zöller and Huber (2019). Let a machine learning pipeline $P:\mathcal{X}\rightarrow\mathcal{Y}$ be a sequential combination of algorithms that transforms a feature vector $x\in\mathcal{X}$ to a target value $y\in\mathcal{Y}$ . For example, $y$ is an one-hot vector of class labels for a classification problem and a real number for a one-dimensional regression problem. Let $\mathcal{A}=\{A^{(1)},A^{(2)},...,A^{(n)}\}$ be a fixed set of data-cleaning, feature extraction, and estimator algorithms where each algorithm $A^{(i)}$ is configured by a set of hyperparameters $\lambda^{(i)}$ from the domain $\Lambda^{(i)}$ . Then, $P$ ’s structure can be described as a Directed Acyclic Graph (DAG) where each node is an algorithm $A^{(i)}$ and each edge represents the data flow between algorithms.

The objective of an AutoML system is to find the configuration of algorithms and hyperparameters that minimizes the loss on an unseen test dataset. Ensemble² estimates $P^{*}$ by ensembling (Section 2.2) the returned pipelines by its base AutoML systems, where $P^{*}$ is defined as:

(1)

P^{*}=\mathrm{argmin}_{P\in\mathcal{P}}\;\mathcal{L}(\mathcal{D}_{train},\mathcal{D}_{test},P)

where $P$ is a valid pipeline from the space of all valid DAG-structured pipelines $\mathcal{P}$ , $\mathcal{L}$ is the task loss, and $\mathcal{D}_{train}$ and $\mathcal{D}_{test}$ are the training and test datasets respectively.

2.2. Ensemble Learning

Ensembling methods are commonly used to boost performance by reducing model bias and variance compared to the base learners (Rahman and Tasnim, 2014). Popular ensembling methods often utilize one of voting, bagging (Breiman, 1996), boosting (Freund and Schapire, 1996), or stacking (Wolpert, 1992) techniques. Voting involves having the base learners vote on the correct class with equal weight and taking the class with the most votes as the final prediction. Bagging involves independently training the base learners using a randomly drawn subset of the training set and having the base learners vote with equal weight. Boosting involves incrementally constructing an ensemble by training base learners to better classify training data points that the previous base learners misclassified. Lastly, stacking involves training a classifier that generates predictions by looking at the predictions of its base learners. An instance of stacking paradigm is the Super Learner algorithm (van der Laan et al., [n.d.]), which trains its stacker on out-of-fold predictions from the base learners.

In this paper, we explored majority voting and stacking strategies (Section 3.1). In majority voting, it selects the top- $K$ pipelines based on the lowest cross-validation loss, where $K$ is an integer value set by the user. For this work and the web interface backend, we have set $K$ to be three. In the second strategy, stacking, the top-1 pipeline returned from every base AutoML systems are used to learn a stacking model.

2.3. Base AutoML Systems

As previously mentioned, five base AutoML systems were used in Ensemble². These systems were selected as the base AutoML systems because of their strong individual performance shown in (Zöller and Huber, 2019) as well as our own empirical results. We explored other state-of-the-art AutoML systems such as GAMA (Gijsbers et al., 2019), TPOT (Olson and Moore, 2016) (which uses genetic programming), DeepLine (Heffetz et al., 2020), AlphaD3M (Drori et al., 2019) (which both use reinforcement learning), and Auto-Weka 2.0 (Kotthoff et al., 2017) (which uses Bayesian optimization), but settled on the chosen five based on performance and ease of integration.

The different search strategies are elaborated below, so we only detail their different search spaces in this paragraph. Auto-Sklearn and Auto-Sklearn 2.0 restrict their ML algorithms (ex. data-cleaning, feature pre-processing, and estimator) to the scikit-learn library (Pedregosa et al., 2011). CMU AutoML restricts its ML algorithms the the D3M library. AutoGluon uses a mix of scikit-learn library, neural networks, and custom-made tree-based ML algorithm. Lastly, H2O AutoML uses its custom pre-built models consisting of random forest, gradient boosting machines, linear and deep learning models.

AutoGluon. AutoGluon (Erickson et al., 2020) ensembles pre-built classification models that have been shown to do empirically well in classification tasks. AutoGluon employs a unique stacking technique called Multi-Layer Stack Ensembling alongside K-fold ensemble bagging. Multi-Layer Stack Ensembling involves stacking models in multiple layers and training in a layer-wise manner which would guarantee high-quality predictions within a given time frame. To prevent overfitting, Autogluon relies on K-fold ensemble bagging at all layers.

Auto-Sklearn. Auto-Sklearn (Feurer et al., 2015) uses Bayesian optimization to search through the model and hyperparameter space of selected scikit-learn (Pedregosa et al., 2011) modules. Meta-learning is used to warm start the search procedure. When the model search is over, Auto-Sklearn constructs an ensemble from discovered pipelines by employing forward search (Caruana et al., 2004).

Auto-Sklearn 2.0 Auto-Sklearn 2.0 (Feurer et al., 2020) improves upon Auto-Sklearn by employing portfolio learning. It first runs Bayesian optimization on all meta-train datasets and constructs a portfolio of pipelines to run on all future tasks. In addition, it learns what model selection strategy to use on a new dataset based on simple meta-features involving some combination of k-fold validation, validation holdout set, and successive halving (Jamieson and Talwalkar, 2016). It also uses various other improvements such as intermittent result storage to be able to return results quicker for larger datasets in a short amount of time.

CMU AutoML. CMU AutoML takes a template approach to the pipeline synthesis problem. More specifically, it uses multiple hand-crafted pre-processing steps and cycles through all possible pre-processing and estimator combinations to quickly assess the optimality of many pipelines in a short amount of time. This follows from the authors’ finding that foregoing hyperparameter search and instead using that time to evaluate as many models as possible was able to achieve superior performance when dealing with short search times. However, some key pipeline component hyperparameters are grid-searched.

H2O AutoML. H2O AutoML (LeDell and Poirier, 2020) ensembles variations of the followings models: random forest, gradient boosting machines, linear models, deep learning and stacked iterations of the aforementioned models. The AutoML system does a 5-fold cross validation scheme for training and evaluating each model and takes the model with the highest cross validation score (logloss for classification, mean squared error for regression) across all 5-folds and refits the top ranked model on the entire train data set. This refit model is then returned and used for prediction. All models have a predefined range for each hyperparameter defined as ”most important”, which are hard coded into the AutoML system. Random search is then performed on the finite set of hyperparameters to find the optimal models. H2O AutoML creates two stacked models (Super Learner models) to evaluate against the other models, one that ensembles all models and one that ensembles the best model of each of the aforementioned types.

3. Methodology

In this section, we will discuss the two ensembling approaches we have experimented with for tackling tabular classification problems: Majority Voting and Super Learner (van der Laan et al., [n.d.]).

3.1. Ensemble Methods

Majority Voting. Let us define the feature vector space as $\mathcal{X}$ , and the $|C|$ -dimensional one-hot target vector space as $\mathcal{Y}$ . We denote $P:\mathcal{X}\rightarrow\mathcal{Y}$ to be a machine learning pipeline with fully defined algorithm and hyperparameter values. Let $\mathbf{P}=\{P_{1},P_{2},...,P_{J}\}$ be a set of $J$ machine learning pipelines from the base AutoML systems that achieves the lowest cross-validation loss when the search procedure is over. Lastly, we denote the indicator function as $\mathbb{I}$ .

A majority voting ensemble model $M:\mathcal{X}\rightarrow\mathcal{Y}$ assigns the label $\hat{y}$ to a data point $x$ as follows:

(2)

\hat{y}=\operatorname*{argmax}_{y}\left(\sum_{i=1}^{N}\mathbb{I}(\mathbf{P}_{i}(x)=y)\right)

where $\mathbf{P}_{i}$ are trained on $\mathcal{D}_{train}$ with the base classifier’s loss. The advantage of the majority voting approach is the fact that no additional training is required aside from training the pipelines in $\mathbf{P}$ .

Super Learner. The Super Learner algorithm is a stacking algorithm that fits a stacking model $M$ on the out-of-fold predictions from the base learners $\mathbf{P}$ . At training time, the algorithm first partitions the training dataset $\mathcal{D}_{train}$ into $K$ mutually exclusive folds and fits each of its $J$ base learners on all $K$ folds. Then, each base learner produces an out-of-fold prediction on the unseen parts of the training data. Lastly, a stacking model of choice $M$ learns a weighting over the predictions from all base learners.

A formal definition of $M$ ’s training objective is as follows:

(3)

\theta=\operatorname*{argmin}_{\theta}\mathcal{L}(M,\mathcal{D}_{train},\mathbf{P},\theta)

where $\theta$ is $M$ ’s parameters, $\mathbf{P}$ is a set of $J\times K$ base learners fit on different folds of $\mathcal{D}_{train}$ , and $\mathcal{L}$ is a minimizer over any given loss function of the parameter space of $\theta$ .

Ensemble² uses a softmax regression model for $M$ . However, we note that $M$ can be any classifier or even an artifact from another AutoML system. Given all the base learner predictions on a single datapoint $x$ , the softmax regression model $M:\mathcal{J}\rightarrow\mathcal{Y}$ assigns the label $\hat{y}$ to $x$ as follows:

(4)

\hat{y}=\operatorname*{argmax}_{y}\left(\frac{exp(\theta_{y}^{T}x)}{\sum_{c=1}^{C}exp(\theta^{T}_{c}x)}\right)

where $c\in\{1,..,C\}$ and $\theta_{c}\in R^{J}$ are the learned coefficients during the training phase.

4. Architecture

Ensemble² employs a minimalist API that interfaces with D3M gRPC API ²²2https://gitlab.com/datadrivendiscovery/ta3ta2-api, Auto-Sklearn API ³³3https://automl.github.io/auto-sklearn/master/index.html, AutoGluon API, and H2O API. Its design is highly extendable. All one needs to do to add an AutoML system is creating a Singularity container that can run on search and predict mode. It also allows for a client-server setup, where the AutoML systems run on server mode and Ensemble² can schedule various commands to the AutoML systems from the client side all within a single Singularity container.

Ensemble²’s containerization approach enables us to run the AutoML systems locally or remotely on clusters using a workload manager like Slurm. Scaling Ensemble² for production use is as simple as specifying which Slurm partition(s) Ensemble² should run on. A single Ensemble² run can trivially be deployed across many nodes because of its parallel design with respect to its base systems. Containerization using Singularity also helps us overcome the specific OS-dependencies that some base AutoML systems have.

Such a minimalist client-server API interface allowed us to build a web-based user-interface, which end-users can use without expertise in machine learning or programming. The user can directly upload their training dataset, select the target column, and specify the search time. Once search request is submitted, the pipeline search stage starts and the user will be notified after the search is completed. The user will then be able to upload their test dataset and send in the prediction request, which will start ensemble prediction stage. Once this stage is done, the user will be able to download the final prediction dataset.

4.1. Pipeline Search Stage

Given a training CSV, Ensemble² spins up a Singularity container for each of its base AutoML system in parallel. This is done by sending search requests to each base AutoML system to perform the pipeline search procedure for a given duration of time with a specific seed. Discovered ML pipelines along with their cross validation scores and out-of-fold predictions are saved to disk. If Super Learner algorithm is selected, Ensemble² fits the softmax regression stacker on out-of-fold predictions generated by the base AutoML systems.

4.2. Ensembling Stage

Given a test CSV, Ensemble² first collects all discovered pipelines and ranks them based on their search time validation scores. The top- $K$ pipelines (where $K$ is an Ensemble² hyperparameter) are then selected and their respective AutoML systems are spin-up in parallel to generate predictions on the test CSV. After the predictions are generated, Ensemble² generates its final test dataset predictions by top- $K$ majority voting or by passing the individual predictions to the stacking model $M$ .

5. Empirical Evaluation

We experimented with two versions of Ensemble². Version-1 (V1) ensembles AutoGluon, Auto-Sklearn, Auto-Sklearn 2.0, while Version-2 (V2) ensembles AutoGluon, Auto-Sklearn, Auto-Sklearn 2.0, CMU AutoML, and H2O AutoML. This enables us to investigate if adding more AutoML systems improves the overall perfomance and robustness of Ensemble². V1 and V2 both perform majority voting and Super Learner ensembles. The sections below describe the evaluation datasets, different experiment schemes we have performed, and their respective results.

5.1. Datasets

All experiments were performed on the OpenML classification benchmark datasets (Gijsbers et al., 2019). The benchmark was curated such that this set of datasets vary in the number of data points and features by orders of magnitudes. Each dataset also varies in the number of categorical features, numerical features, and missing values. The benchmark datasets also do not contain classification problems that are too easy to solve (e.g. most artificially generated datasets) and is gradually updated overtime to prevent AutoML tools from overfitting to the member datasets. At the time of this experiment, there were 41 datasets in the benchmark. The evaluation metric is accuracy for all datasets in the benchmark.

5.2. Evaluating Ensemble² V1

This experiment compares the accuracy of the predictions generated by the base systems against the Ensemble²’s prediction, which are generated by ensembling the aforementioned predictions. We call this the wall-clock experiment, since all systems are given the same amount of time of one hour for pipeline search.

Setup. For each benchmark dataset, all AutoML systems were run in parallel for the time limit. Every system had access to 4 CPUs and 8GB of RAM, and were performed across machines with Intel E5-2683 v4 Broadwell, 2.1GHz processor and Intel Xeon Gold 5120 Skylake, 2.2GHz processor. Ensemble²’s majority voting ensembled the top three pipelines with the best validation scores from the base AutoML systems.

We used the default settings for AutoGluon, Auto-Sklearn, and Auto-Sklearn 2.0 with the following exceptions. AutoGluon used its best performance mode ( $auto\_stack=True$ ) since we wanted performance over resource efficiency for benchmarking purposes. Auto-Sklearn used cross-validation as its resampling strategy with number of folds set to 3 and had its memory limit set to 8 GB. Auto-Sklearn 2.0’s configuration was identical to Auto-Sklearn except for its re-sampling strategy which it determined based on the dataset meta-features. Both Auto-Sklearn and Auto-Sklearn 2.0 had 25% of their search times allocated to refitting their best discovered models on the entire training dataset as they recommend refitting on the whole dataset at the end when using cross-validation re-sampling strategy. Allocating less than 25% of search time for refitting led to failures on certain datasets.

For Super Learner, the best models from each AutoML system were used, mainly because the majority of the AutoML systems only produced out-of-fold predictions for their best models. When a model could not produce out-of-fold predictions they were excluded from the Super Learner. An example of this is Auto-Sklearn 2.0, which uses a different evaluation method depending on the dataset and only generates out-of-fold predictions when cross-validation is chosen. Lastly, we applied L2-penalty when training our softmax regression stacking model.

To account for scenarios where AutoML systems didn’t terminate on time, we gave each Singularity container 30 minutes of grace period to smoothly exit the search and the predict phase. Despite this precaution, we found that rarely AutoML systems did not terminate even after the grace period but saved the model weights on disk that can be used for prediction. In this case, we loaded the saved models at prediction phase and counted those runs as successes. If there were multiple saved models in such a manner, a random model was selected and scored.

We also observed quite a bit of out-of-memory errors during our initial runs, so each Singularity container was given an extra 4 GB of RAM to prevent it from crashing. Lastly, in cases when AutoML runs still failed, we re-ran them one more time with identical configurations to ensure that the failures weren’t sporadic. Note, such a measure was taken for the accurate reporting of scores generated by the base AutoML systems. In a production scenario, a non-expert using our system would still receive a model, if even one of the base AutoML systems successfully completed the search. A failure would occur only if all the base AutoML systems failed.

OpenML Dataset ID AutoGluon Auto-Sklearn Auto-Sklearn 2.0 Ensemble² Voting Ensemble² Stacking 2 0.989±0.001 0.993±0.001 0.989±0.002 0.991±0.002 0.989±0.001 3 0.993±0.002 0.996±0.001 0.997±0.000 0.994±0.001 0.993±0.002 5 0.684±0.012 0.740±0.013 0.736±0.010 0.707±0.019 0.673±0.012 12 0.975±0.002 0.984±0.003 0.977±0.001 0.978±0.004 0.975±0.002 31 0.779±0.003 0.772±0.003 0.778±0.004 0.778±0.002 0.779±0.003 54 0.829±0.016 0.845±0.010 0.798±0.016 0.846±0.011 0.825±0.014 1067 0.859±0.003 0.856±0.004 0.854±0.003 0.859±0.004 0.859±0.003 1111 0.983±0.000 0.983±0.000 0.983±0.000 0.983±0.000 0.983±0.000 1169 0.632±0.014 0.669±0.000 0.670±0.001 0.667±0.004 0.576±0.041 1461 0.908±0.001 0.906±0.000 0.905±0.001 0.908±0.001 0.908±0.001 1464 0.751±0.012 0.761±0.008 0.742±0.009 0.751±0.011 0.751±0.012 1468 0.918±0.008 0.950±0.006 0.947±0.013 0.936±0.017 0.918±0.008 1486 0.973±0.001 0.973±0.001 0.971±0.002 0.973±0.001 0.973±0.001 1489 0.902±0.002 0.904±0.002 0.902±0.002 0.902±0.002 0.902±0.002 1590 0.876±0.001 0.875±0.001 0.876±0.001 0.876±0.001 0.876±0.001 1596 0.889±0.006 0.969±0.000 0.968±0.001 0.951±0.031 0.889±0.006 4135 0.950±0.000 0.949±0.000 0.949±0.001 0.949±0.001 0.950±0.000 23512 0.727±0.001 0.731±0.000 0.729±0.000 0.727±0.001 0.727±0.001 23517 0.509±0.002 0.520±0.000 0.520±0.001 0.502±0.004 0.509±0.002 40668 0.839±0.007 0.849±0.000 0.658±0.000 0.843±0.005 0.839±0.007 40685 1.000±0.000 1.000±0.000 1.000±0.000 1.000±0.000 1.000±0.000 40975 0.979±0.002 0.981±0.002 0.984±0.001 0.979±0.003 0.979±0.002 40981 0.869±0.006 0.869±0.010 0.870±0.005 0.868±0.008 0.869±0.006 40984 0.941±0.001 0.935±0.001 0.940±0.001 0.940±0.001 0.941±0.001 40996 0.900±0.002 0.717±0.309 0.731±0.315 0.900±0.002 0.900±0.002 41027 0.966±0.002 0.910±0.020 0.862±0.001 0.966±0.002 0.966±0.002 41138 0.994±0.000 0.993±0.000 0.990±0.000 0.994±0.000 0.994±0.000 41142 0.736±0.003 0.750±0.004 0.737±0.002 0.736±0.003 0.736±0.003 41143 0.804±0.002 0.820±0.006 0.808±0.000 0.804±0.002 0.804±0.002 41146 0.949±0.002 0.948±0.004 0.944±0.001 0.949±0.002 0.949±0.002 41147 0.679±0.005 0.615±0.062 0.689±0.000 0.684±0.006 0.679±0.005 41150 0.948±0.001 0.945±0.000 0.948±0.000 0.948±0.001 0.948±0.001 41159 0.821±0.006 0.643±0.051 0.643±0.051 0.821±0.005 0.821±0.006 41161 0.998±0.000 0.997±0.000 0.997±0.000 0.998±0.001 0.998±0.000 41163 0.989±0.001 0.986±0.002 0.978±0.001 0.989±0.001 0.989±0.001 41164 0.723±0.002 0.714±0.004 0.714±0.004 0.723±0.002 0.723±0.002 41165 0.478±0.015 0.480±0.000 0.480±0.000 0.477±0.015 0.478±0.015 41166 0.716±0.001 0.127±0.000 0.700±0.000 0.710±0.001 0.716±0.001 41167 0.913±0.003 0.005±0.000 0.005±0.000 0.913±0.003 0.612±0.428 41168 0.719±0.001 0.706±0.000 0.720±0.000 0.718±0.001 0.719±0.001 41169 0.259±0.080 0.007±0.000 0.007±0.000 0.297±0.058 0.144±0.146 Average Accuracy $0.838$ $0.790$ $0.797$ 0.842 $0.826$ Average Rank $2.878$ $3.024$ $3.268$ 2.817 $3.012$ # First Place 20 $16$ $12$ $18$ $19$

Table 2. Comparison between AutoGluon, Auto-Sklearn, Auto-Sklearn 2.0 and Ensemble²’s ensemble of their results. All systems were run for one hour and the results were averaged across five seeds. Failures were omitted during the mean calculations. Ensemble²’s majority voting setup ensembles top three pipelines irrespective of which AutoML systems they come from while the Super Learning setup ensembles the best pipelines returned by each system. The highest accuracy achieved by a method on a dataset is shown in boldface.

Result. All base system scores were measured by the performances of the single best pipeline. We note that most of these systems already employ ensembling internally, and the returned pipelines are often already ensembled. Ensemble²’s majority voting score was computed from the predictions generated by ensembling the top three pipelines with the highest validation accuracy from any of the base systems. Ensemble²’s Super Learner score was computed from the predictions generated by ensembling the best pipelines returned by each successful AutoML system. Failures were removed for average calculation.

Table 2 summarizes the results of this experiment. The average rank of each AutoML system relative to one another is written in the bottom row of the table, with ties allowed. Overall, Ensemble²’s majority voting approach achieved the highest average accuracy and rank of $0.842$ and $2.817$ across five seeds, which is slightly higher than the second highest average accuracy and rank of $0.838$ and $2.878$ achieved by AutoGluon. The Super Learner approach did not perform as well as the majority voting approach, with the average accuracy and rank of $0.826$ and $3.012$ . We suspect that the Super Learner didn’t perform as well because it only received predictions from three base learners. In future experiments, we plan to experiment with increasing the number of base learners to assess their impact on performance.

To ensure that the difference in performance distribution over the datasets in the benchmark between pairs of AutoML systems was statistically significant, we ran Wilcoxon signed-rank test with $\alpha=0.05$ by setting the results of each AutoML systems on unique dataset and seed combinations as datapoints. Since Ensemble²’s majority voting scheme achieved first place, we computed the test statistic between majority voting Ensemble² with every other AutoML system. The resulting p-values for rejecting the null hypothesis that the pair produced the same results all indicated statistical significance. Specifically, the p-values for AutoGluon, Auto-Sklearn, and Auto-Sklearn 2.0 were $0.0008$ , $0.0473$ , and $10^{-5}$ .

We noticed that AutoGluon achieves first place on more datasets than Ensemble²’s voting scheme in Table 2 ( $20/41$ vs $18/41$ datasets) even though Ensemble² had both higher average accuracy and rank compared to AutoGluon. When we computed the number of times an AutoML system achieves first place only between AutoGluon and Ensemble², we discovered that Ensemble² had first place on $33$ of the $41$ datasets and Autogluon had first place on $30$ out of the $41$ datasets. This entails that Ensemble² does have more wins over AutoGluon across benchmark datasets, but the datasets on which Ensemble² performs better can have other AutoML systems that perform better than both Ensemble² and AutoGluon.

System	Avg. Accuracy	Avg. Rank	# 1st Place
AutoGluon	0.841	2.878	15
Auto-Sklearn	0.790	2.536	18
Auto-Sklearn 2.0	0.797	3.134	15
Ensemble² Voting	0.842	3.146	10
Ensemble² Stacking	0.826	3.304	10

Table 3. Summary statistics for 1 hour Ensemble² V1 runs vs three hour base system runs. Voting and Stacking denote Ensemble²’s majority voting and Super Learner implementations respectively.

Equal-Compute Comparison. While the wall-clock experiment can show that Ensemble² produces superior results in the same time-frame, Ensemble² has access to more compute than its competitors under this setup. To investigate how Ensemble² fairs when it has access to the same amount of compute power as its base systems, we compared the performance of 1 hour run Ensemble² against three hour run base AutoML systems. The comparison results are listed on Table 3. The reported three hour performances were averaged across five seeds.

While Ensemble²’s voting scheme retained the highest average accuracy, AutoGluon’s average accuracy increased to the point where Ensemble²’s performance gain is insignificant. In addition, while Auto-Sklearn’s average accuracy decreased, it had more overall wins over other AutoML systems.

5.3. Evaluating Ensemble² V2

Setup. The setup for Ensemble² V2 is identical to the setup for Ensemble² V1 except for some key differences. Firstly, the AutoML systems were run across four machines with Intel Core i7-5820K, 3.3 GHz processors, and 48 GB DDR4 RAM. Secondly, both CMU AutoML and H2O AutoML used cross-validation as their resampling strategy, with number of folds set to 3 and 5 respectively. H2O AutoML’s allowed number of threads was set to 4 and maximum memory size was set to 8 GB.

Result. Table 5 summarizes the results of this experiment. Again, Ensemble²’s majority voting approach achieved the highest average accuracy and rank of $0.844$ and $2.963$ across five seeds, which is slightly higher than the second highest average rank of $3.085$ achieved by AutoGluon.

System	Avg. Accuracy	Avg. Rank	# 1st Place
AutoGluon	0.848	2.183	19
Auto-Sklearn	0.783	5.183	5
Auto-Sklearn 2.0	0.804	4.695	4
CMU AutoML	0.806	4.720	7
H2O AutoML	0.793	4.976	1
Ensemble² Voting	0.844	2.902	9
Ensemble² Stacking	0.840	3.341	5

Table 4. Summary statistics for 1 hour Ensemble² V2 runs vs five hour base system runs. Voting and Stacking denote Ensemble²’s majority voting and Super Learner implementations respectively.

We ran the same Wilcoxon signed-rank test with $\alpha=0.05$ to ensure the performance differences between Ensemble² majority voting and base AutoML systems were significant. The resulting p-values for rejecting the null hypothesis that the pair produced the same results all indicated statistical significance. Specifically, the p-values for AutoGluon, Auto-Sklearn, and Auto-Sklearn 2.0 were $0.0048$ , $0.0117$ , and $0.0001$ . The p-values for CMU AutoML and H2O AutoML were less than $10^{-5}$ .

OpenML Dataset ID AutoGluon Auto-Sklearn Auto-Sklearn 2.0 CMU AutoML H2O AutoML Ensemble² Voting Ensemble² Stacking 2 0.989±0.000 0.992±0.002 0.992±0.000 0.992±0.000 0.993±0.001 0.993±0.001 0.990±0.002 3 0.994±0.001 0.997±0.001 0.997±0.000 0.985±0.022 0.992±0.003 0.993±0.002 0.993±0.002 5 0.687±0.006 0.734±0.006 0.745±0.009 0.696±0.064 0.727±0.022 0.725±0.020 0.685±0.023 12 0.973±0.002 0.983±0.002 0.972±0.005 0.975±0.004 0.967±0.004 0.977±0.003 0.973±0.003 31 0.778±0.004 0.781±0.005 0.777±0.005 0.757±0.020 0.752±0.006 0.776±0.004 0.777±0.005 54 0.833±0.017 0.841±0.003 0.798±0.012 0.783±0.007 0.820±0.007 0.835±0.011 0.829±0.014 1067 0.854±0.003 0.857±0.004 0.852±0.002 0.853±0.005 0.805±0.016 0.854±0.003 0.856±0.004 1111 0.983±0.000 0.983±0.000 0.983±0.000 0.983±0.000 0.972±0.004 0.983±0.000 0.983±0.000 1169 0.658±0.005 0.653±0.030 0.670±0.000 0.634±0.040 0.633±0.005 0.658±0.005 0.658±0.005 1461 0.908±0.001 0.906±0.000 0.906±0.001 0.903±0.002 0.905±0.002 0.908±0.001 0.908±0.001 1464 0.751±0.012 0.761±0.007 0.745±0.005 0.770±0.012 0.737±0.031 0.764±0.012 0.747±0.009 1468 0.917±0.005 0.946±0.004 0.952±0.003 0.819±0.064 0.941±0.002 0.949±0.005 0.930±0.008 1486 0.973±0.000 0.971±0.000 0.954±0.006 0.958±0.014 0.971±0.001 0.973±0.000 0.973±0.000 1489 0.902±0.002 0.906±0.002 0.905±0.001 0.902±0.003 0.892±0.003 0.902±0.002 0.902±0.002 1590 0.876±0.001 0.876±0.000 0.875±0.000 0.834±0.026 0.876±0.001 0.876±0.000 0.876±0.001 1596 0.899±0.036 0.969±0.000 0.968±0.000 0.674±0.159 0.783±0.227 0.968±0.002 0.899±0.036 4135 0.949±0.001 0.949±0.000 0.950±0.000 0.946±0.002 0.947±0.001 0.949±0.001 0.949±0.001 23512 0.726±0.001 0.730±0.002 0.728±0.003 0.683±0.035 0.722±0.002 0.726±0.001 0.725±0.001 23517 0.510±0.002 0.520±0.000 0.519±0.001 0.518±0.001 0.507±0.001 0.507±0.005 0.510±0.002 40668 0.828±0.007 0.848±0.000 0.725±0.081 0.797±0.032 0.853±0.004 0.851±0.005 0.853±0.004 40685 1.000±0.000 0.958±0.084 0.958±0.084 0.920±0.094 1.000±0.000 1.000±0.000 1.000±0.000 40975 0.979±0.002 0.982±0.001 0.984±0.001 0.960±0.020 0.978±0.004 0.980±0.002 0.980±0.004 40981 0.867±0.005 0.866±0.002 0.867±0.011 0.865±0.005 0.859±0.007 0.862±0.012 0.862±0.006 40984 0.940±0.002 0.935±0.002 0.938±0.002 0.908±0.045 0.932±0.003 0.940±0.003 0.940±0.002 40996 0.897±0.001 0.829±0.093 0.887±0.001 0.761±0.161 0.861±0.010 0.897±0.001 0.895±0.001 41027 0.965±0.002 0.914±0.019 0.862±0.001 0.907±0.081 0.861±0.002 0.965±0.002 0.965±0.002 41138 0.994±0.000 0.993±0.000 0.990±0.000 0.993±0.000 0.993±0.001 0.994±0.000 0.994±0.000 41142 0.736±0.002 0.749±0.006 0.737±0.002 0.700±0.038 0.721±0.003 0.735±0.002 0.736±0.002 41143 0.802±0.003 0.809±0.005 0.810±0.002 0.794±0.010 0.791±0.008 0.803±0.002 0.789±0.010 41146 0.948±0.001 0.944±0.002 0.945±0.001 0.948±0.009 0.941±0.003 0.949±0.001 0.945±0.004 41147 0.689±0.001 0.565±0.000 0.687±0.005 0.631±0.000 0.662±0.001 0.688±0.002 0.689±0.001 41150 0.947±0.001 0.944±0.002 0.948±0.000 0.946±0.001 0.944±0.001 0.947±0.001 0.945±0.001 41159 0.819±0.001 0.749±0.054 0.813±0.012 0.813±0.012 0.783±0.014 0.819±0.001 0.819±0.001 41161 0.997±0.000 0.997±0.000 0.997±0.000 0.997±0.000 0.947±0.060 0.997±0.000 0.997±0.000 41163 0.989±0.002 0.983±0.000 0.975±0.002 0.975±0.002 0.968±0.007 0.988±0.002 0.989±0.001 41164 0.723±0.003 0.714±0.006 0.711±0.004 0.701±0.011 0.692±0.006 0.723±0.003 0.722±0.003 41165 0.480±0.016 0.481±0.002 0.482±0.012 0.482±0.012 0.380±0.002 0.479±0.015 0.474±0.015 41166 0.715±0.001 0.608±0.118 0.693±0.006 0.620±0.038 0.668±0.003 0.710±0.000 0.714±0.001 41167 0.871±0.047 0.570±0.344 0.703±0.318 0.240±0.000 0.300±0.003 0.856±0.066 0.870±0.048 41168 0.718±0.001 0.706±0.000 0.718±0.002 0.697±0.014 0.712±0.002 0.718±0.001 0.715±0.003 41169 0.396±0.001 0.353±0.027 0.371±0.002 0.345±0.022 0.312±0.005 0.396±0.001 0.391±0.001 Average Accuracy $0.840$ $0.826$ $0.831$ $0.807$ $0.797$ 0.844 $0.840$ Average Rank $3.085$ $3.585$ $3.634$ $5.707$ $5.427$ 2.963 $3.598$ # First Place 19 $13$ $13$ $4$ $4$ $16$ $13$

Table 5. Comparison between AutoGluon, Auto-Sklearn, Auto-Sklearn 2.0, CMU AutoML, H2O AutoML, and Ensemble²’s ensemble of their results. All systems were run for one hour and the results were averaged across five seeds. Failures were omitted during the mean calculations. Ensemble²’s majority voting setup ensembles top three pipelines irrespective of which AutoML systems they come from while the Super Learning setup ensembles the best pipelines returned by each system. The highest accuracy achieved by a method on a dataset is shown in boldface.

We noticed again that AutoGluon achieves first place on more datasets than Ensemble²’s voting scheme in Table 2 ( $19/41$ vs $16/41$ datasets) even though Ensemble² had both higher average accuracy and rank compared to AutoGluon. When we computed the number of times an AutoML system achieves first place only between AutoGluon and Ensemble², we discovered that Ensemble² had first place on $31$ of the $41$ datasets and AutoGluon had first place on $30$ out of the $41$ datasets.

Equal-Compute Comparison. To investigate how Ensemble² V2 fairs when it has access to the same amount of compute power as its base systems, we compared the performance of 1 hour run Ensemble² against five hour run base AutoML systems. The comparison results are listed on Table 4. The reported five hour performances were computed from a single seed due to time constraints.

We found that overall, AutoGluon performed best under these conditions. Curiously, the average accuracy for all other AutoML systems slightly dropped on five hour runs compared to 1 hour runs. This is likely due to the fact that running AutoML search for long time risks overfitting, and AutoGluon takes extensive measures to counteract that by using tools like repeated $k$ -fold bagging.

5.4. AutoML System Performance Correlation

We investigate whether Ensemble²’s base AutoML systems perform similarly across benchmark tasks by looking at the correlation between the accuracy of the base systems. This correlation measures whether the base systems generally perform better on different kinds of datasets. The less correlated the base systems are, the more suited they are for different kinds of datasets. Ensemble² generally benefits from having a suite of lowly correlated base systems since it can rely on at least one of the base systems to do well on a wide range of problems. Figure 2 shows that some diversity exists in the performance of the best pipelines generated by Ensemble²’s base systems. For example, the correlation between AutoGluon and CMU AutoML is $\sim$ 0.79 and the correlation between H2O AutoML and Auto-Sklearn 2.0 is $\sim$ 0.85. Hence, we argue that Ensemble² is overall more well-rounded than its base systems.

6. Conclusion

In this paper, we have established that ensembling AutoML systems generally lead to quantitative gains in accuracy. We have observed that Ensemble² outperformed all SOTA AutoML systems under wall-clock comparisons. On equal-compute comparisons, we saw that Ensemble²’s performance was competitive to SOTA AutoML systems. This, and our explicit examination of the correlation between the accuracy of various AutoML systems across a variety of problems, suggests that there is, currently, exploitable diversity between AutoML systems.

In addition to those findings, we have built a public-facing Ensemble² web interface designed for simplicity and ease of scaling. This web system greatly increases the accessibility of machine learning for the general public, as a non-data scientist user can simply upload a training CSV, specify which column to classify, and finally uploads a test CSV to receive predictions.

Our experiments had a one-hour search time because that is the most common timeout that the majority of AutoML papers use for their experiments. To assess how Ensemble²’s performance gain fluctuates, for both wall-clock and equal-compute setups, it would be valuable to try running experiments with very short and very long search times. In addition to experimenting with different search times, it would be informative to ensemble many more combinations of AutoML systems while taking into account their search space and heuristics to observe what combinations yield the best empirical results. We have observed that some AutoML systems had lower average accuracy than others, so a choice of better performing AutoML systems and taking measures to further prevent overfitting could also improve the performance of the ensembling system compared to a single system like AutoGluon on equal compute setup.

Acknowledgements.

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada CIFAR AI Chairs Program, and the Intel Parallel Computing Centers program. This material is based upon work supported by the United States Air Force Research Laboratory (AFRL) under the Defense Advanced Research Projects Agency (DARPA) Data Driven Discovery Models (D3M) program (Contract No. FA8750-19-2-0222) and Learning with Less Labels (LwLL) program (Contract No.FA8750-19-C-0515). Additional support was provided by UBC’s Composites Research Network (CRN), Data Science Institute (DSI) and Support for Teams to Advance Interdisciplinary Research (STAIR) Grants. This research was enabled in part by technical support and computational resources provided by WestGrid (https://www.westgrid.ca/) and Compute Canada (www.computecanada.ca).

References

(1)
Breiman (1996) Leo Breiman. 1996. Bagging predictors. Machine Learning 24, 2 (Aug. 1996), 123–140. https://doi.org/10.1007/bf00058655
Caruana et al. (2004) Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. 2004. Ensemble selection from libraries of models. In Machine Learning, Proceedings of the Twenty-first International Conference(ICML) (ACM International Conference Proceeding Series, Vol. 69), Carla E. Brodley (Ed.). ACM.
Chen et al. (2018) Boyuan Chen, Harvey Wu, Warren Mo, Ishanu Chattopadhyay, and Hod Lipson. 2018. Autostacker: a compositional evolutionary learning system. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO). ACM, 402–409.
Drori et al. (2019) Iddo Drori, Yamuna Krishnamurthy, Raoni Lourenço, Rémi Rampin, Kyunghyun Cho, Cláudio T. Silva, and Juliana Freire. 2019. Automatic Machine Learning by Pipeline Synthesis using Model-Based Reinforcement Learning and a Grammar. Computing Research Repository abs/1905.10345 (2019).
Erickson et al. (2020) Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander J. Smola. 2020. AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data. Computing Research Repository (2020).
Feurer et al. (2020) Matthias Feurer, Katharina Eggensperger, Stefan Falkner, Marius Lindauer, and Frank Hutter. 2020. Auto-Sklearn 2.0: The Next Generation. Computing Research Repository abs/2007.04074 (2020).
Feurer et al. (2015) Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Springenberg, Manuel Blum, and Frank Hutter. 2015. Efficient and robust automated machine learning. In Advances in Neural Information Processing Systems. 2962–2970.
Freund and Schapire (1996) Yoav Freund and Robert E. Schapire. 1996. Experiments with a New Boosting Algorithm. In Proceedings of the Thirteenth International Conference on International Conference on Machine Learning (Bari, Italy) (ICML’96). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 148–156.
Gannon and Sochat (2017) Dennis Gannon and Vanessa Sochat. 2017. Singularity: A Container System for HPC Applications.
Gijsbers et al. (2019) Pieter Gijsbers, Erin LeDell, Janek Thomas, Sébastien Poirier, Bernd Bischl, and Joaquin Vanschoren. 2019. An Open Source AutoML Benchmark. arXiv:1907.00909 [cs.LG]
Hansen and Salamon (1990) Lars Kai Hansen and Peter Salamon. 1990. Neural Network Ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence 12, 10 (1990), 993–1001.
Heffetz et al. (2020) Yuval Heffetz, Roman Vainshtein, Gilad Katz, and Lior Rokach. 2020. DeepLine: AutoML Tool for Pipelines Generation using Deep Reinforcement Learning and Hierarchical Actions Filtering. In Conference on Knowledge Discovery and Data Mining. ACM, 2103–2113.
Jamieson and Talwalkar (2016) Kevin G. Jamieson and Ameet Talwalkar. 2016. Non-stochastic Best Arm Identification and Hyperparameter Optimization. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), Vol. 51. 240–248.
Kotthoff et al. (2017) Lars Kotthoff, Chris Thornton, Holger H. Hoos, Frank Hutter, and Kevin Leyton-Brown. 2017. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization in WEKA. Journal of Machine Learning Research 18, 25 (2017), 1–5. http://jmlr.org/papers/v18/16-261.html
LeDell and Poirier (2020) Erin LeDell and Sebastien Poirier. 2020. H2O AutoML: Scalable Automatic Machine Learning. 7th ICML Workshop on Automated Machine Learning (AutoML) (July 2020). https://www.automl.org/wp-content/uploads/2020/07/AutoML_2020_paper_61.pdf
Manyika et al. (2011) James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and Angela Hung Byers. 2011. Big data: The next frontier for innovation, competition and productivity. Technical Report. McKinsey Global Institute.
McKinsey Analytics (2016) McKinsey Analytics. 2016. The age of analytics: competing in a data-driven world. Technical Report. San Francisco: McKinsey & Company.
Olson and Moore (2016) Randal S Olson and Jason H Moore. 2016. TPOT: A tree-based pipeline optimization tool for automating machine learning. In Workshop on automatic machine learning. 66–74.
Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake VanderPlas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Edouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research (JMLR) 12 (2011), 2825–2830.
Pompa and Burke (2017) Claudia Pompa and Travis Burke. 2017. Data science and analytics skills shortage: equipping the APEC workforce with the competencies demanded by employers. APEC Human Resource Development Working Group (2017).
Rahman and Tasnim (2014) Akhlaqur Rahman and Sumaira Tasnim. 2014. Ensemble Classifiers and Their Applications: A Review. Computing Research Repository abs/1404.4088 (2014).
van der Laan et al. ([n.d.]) Mark J. van der Laan, Eric C. Polley, and Alan E. Hubbard. [n.d.]. Super Learner. U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 222. ([n. d.]). https://biostats.bepress.com/ucbbiostat/paper222
Wolpert (1992) David H. Wolpert. 1992. Stacked generalization. Neural Networks 5, 2 (Jan. 1992), 241–259. https://doi.org/10.1016/s0893-6080(05)80023-1
Wolpert (1996) D. H. Wolpert. 1996. The Lack of A Priori Distinctions Between Learning Algorithms. Neural Computation 8, 7 (1996), 1341–1390. https://doi.org/10.1162/neco.1996.8.7.1341
Yoo et al. (2003) Andy B. Yoo, Morris A. Jette, and Mark Grondona. 2003. SLURM: Simple Linux Utility for Resource Management. In Job Scheduling Strategies for Parallel Processing. Springer Berlin Heidelberg, 44–60. https://doi.org/10.1007/10968987_3
Zaidi et al. (2020) Sheheryar Zaidi, Arber Zela, Thomas Elsken, Chris Holmes, Frank Hutter, and Yee Whye Teh. 2020. Neural Ensemble Search for Performant and Calibrated Predictions. Computing Research Repository abs/2006.08573 (2020).
Zöller and Huber (2019) Marc-André Zöller and Marco F Huber. 2019. Benchmark and Survey of Automated Machine Learning Frameworks. Computing Research Repository (2019).