This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Hyperparameter Optimization: Foundations, Algorithms, Best Practices and Open Challenges

Bernd Bischl Department of Statistics, Ludwig-Maximilians-Universität München - first.last@stat.uni-muenchen.de Corresponding Author Martin Binder Department of Statistics, Ludwig-Maximilians-Universität München - first.last@stat.uni-muenchen.de Michel Lang Department of Statistics, Ludwig-Maximilians-Universität München - first.last@stat.uni-muenchen.de Department of Statistics, TU Dortmund University - last@statistik.tu-dortmund.de Tobias Pielok Department of Statistics, Ludwig-Maximilians-Universität München - first.last@stat.uni-muenchen.de Jakob Richter Department of Statistics, Ludwig-Maximilians-Universität München - first.last@stat.uni-muenchen.de Department of Statistics, TU Dortmund University - last@statistik.tu-dortmund.de Stefan Coors Department of Statistics, Ludwig-Maximilians-Universität München - first.last@stat.uni-muenchen.de Janek Thomas Department of Statistics, Ludwig-Maximilians-Universität München - first.last@stat.uni-muenchen.de Theresa Ullmann Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig-Maximilians-Universität München - tullmann@ibe.med.uni-muenchen.de Marc Becker Department of Statistics, Ludwig-Maximilians-Universität München - first.last@stat.uni-muenchen.de Anne-Laure Boulesteix Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig-Maximilians-Universität München - last@ibe.med.uni-muenchen.de Difan Deng Institute for Information Processing, Leibniz University Hannover - last@tnt.uni-hannover.de Marius Lindauer Institute for Information Processing, Leibniz University Hannover - last@tnt.uni-hannover.de
?abstractname?

Most machine learning algorithms are configured by a set of hyperparameters whose values must be carefully chosen and which often considerably impact performance. To avoid a time-consuming and irreproducible manual process of trial-and-error to find well-performing hyperparameter configurations, various automatic hyperparameter optimization (HPO) methods – e.g., based on resampling error estimation for supervised machine learning – can be employed. After introducing HPO from a general perspective, this paper reviews important HPO methods, from simple techniques such as grid or random search to more advanced methods like evolution strategies, Bayesian optimization, Hyperband and racing. This work gives practical recommendations regarding important choices to be made when conducting HPO, including the HPO algorithms themselves, performance evaluation, how to combine HPO with machine learning pipelines, runtime improvements, and parallelization. This work is accompanied by an appendix that contains information on specific software packages in R and Python, as well as information and recommended hyperparameter search spaces for specific learning algorithms. We also provide notebooks that demonstrate concepts from this work as supplementary files.

1 Introduction

Machine learning (ML) algorithms are highly configurable by their hyperparameters (HPs). These parameters often substantially influence the complexity, behavior, speed as well as other aspects of the learner, and their values must be selected with care in order to achieve optimal performance. Human trial-and-error to select these values is time-consuming, often somewhat biased, error-prone and computationally irreproducible.

As the mathematical formalization of hyperparameter optimization (HPO) is essentially black-box optimization, often in a higher-dimensional space, this is better delegated to appropriate algorithms and machines to increase efficiency and ensure reproducibility. Many HPO methods have been developed to assist in and automate the search for well-performing hyperparameter configuration (HPCs) over the last 20 to 30 years. However, more sophisticated HPO approaches in particular are not as widely used as they could (or should) be in practice. We postulate that the reason for this may be a combination of the following factors:

  • poor understanding of HPO methods by potential users, who may perceive them as (too) complex ?black-boxes?;

  • poor confidence of potential users in the superiority of HPO methods over trivial approaches and resulting skepticism of the expected return on (time) investment;

  • missing guidance on the choice and configuration of HPO methods for the problem at hand;

  • difficulty to define the search space of an HPO process appropriately.

With these obstacles in mind, this paper formally and algorithmically introduces HPO, with many hints for practical application. Our target audience are scientists and users with a basic knowledge of ML and evaluation.

In this article, we mainly discuss HP for supervised ML, which is arguably the default scenario for HPO. We mainly do this to keep notation simple and to not overwhelm less experienced readers, especially for less experienced readers. Nevertheless, all covered techniques can be applied to practically any algorithm in ML in which the algorithm is trained on a collection of instances and performance is quantitatively measurable – e.g., in semi-supervised learning, reinforcement learning, and potentially even unsupervised learning111where the measurement of performance is arguably much less straightforward, especially via a single metric..

Subsequent sections of this paper are organized as follows: Section 2 discusses related work. Section 3 introduces the concept of supervised ML and discusses the evaluation of ML algorithms. The principle of HPO is introduced in Section 4. Major classes of HPO methods are described, including their strengths and limitations. The problem of over-tuning, the handling of noise in the context of HPO, and the topic of threshold tuning are also addressed. Section 5 introduces the most common preprocessing steps and the concept of ML pipelines, which enables us to include preprocessing and model selection within HPO. Section 6 offers practical recommendations on how to choose resampling strategies as well as define tuning search spaces, provides guidance on which HPO algorithm to use, and describes how HPO can be parallelized. In Section 7, we also briefly discuss how HPO directly connects to a much broader field of algorithm selection and configuration beyond ML and other related fields. Section 8 concludes with a discussion of relevant open issues in HPO.

The appendices contain additional material of particularly practical relevance for HPO: Appendix LABEL:app:important_ml_algorithms lists the most popular ML algorithms, describes some of their properties, and proposes sensible HP search spaces; Appendix A does the same for preprocessing methods; Appendix B contains a table of common evaluation metrics; Appendix C lists relevant considerations and software packages for ML and HPO in the two popular ML scripting languages R (C.1) and Python (C.2). Furthermore, we provide several R markdown notebooks as ancillary files which demonstrate many practical HPO concepts and implement them in mlr3 [226].

2 Related Work

As one of the most studied sub-fields of automated ML (AutoML), there exist several previous surveys on HPO. [45] offered a thorough overview about existing HPO approaches, open challenges, and future research directions. In contrast to our paper, however, that work does not focus on specific advice for issues that arise in practice. [145] provide a very high-level overview of search spaces, HPO techniques, and tools. Although we expect that the paper by [145] will be a more accessible paper for first-time users of HPO compared to the survey by [45], it does not explain HPO’s mathematical and algorithmic details or practical tips on how to apply HPO efficiently. Last but not least, [2] provides an overview about HPO methods, but with a focus on computational complexity aspects. We see this work described here as filling the gap between these papers by providing all necessary details both for first-time users of HPO as well as experts in ML and data science who seek to understand the concepts of HPO in sufficient depth.

Our focus is on providing a general overview of HPO without a special focus on concrete ML model classes. However, since the ML field has many large sub-communities by now, there are also several specialized HPO and AutoML surveys. For example, [58] and [133] focus on AutoML for deep learning models, [78] on HPO for forecasting models in smart grids, and [149] on AutoML on graph models. [6] investigate model-based HPO and also give search spaces and examples for many specific ML algorithms. Other more general reviews of AutoML are [146], [37], and [147].

3 Supervised Machine Learning

3.1 Terminology and Notations

Supervised ML addresses the problem of inferring a model from labeled training data that is then used to predict data from the same underlying distribution with minimal error. Let 𝒟\mathcal{D} be a labeled data set with nn observations, where each observation (𝐱(i),y(i))(\mathbf{x}^{(i)},y^{(i)}) consists of a pp-dimensional feature vector222More generally, with a slight abuse of notation, the feature vector 𝐱(i)\mathbf{x}^{(i)} could be taken as a tensor (for example when the feature is an image). 𝐱(i)𝒳\mathbf{x}^{(i)}\in\mathcal{X} and its label y(i)𝒴y^{(i)}\in\mathcal{Y}. Hence, we define the data set333More precisely, 𝒟\mathcal{D} is an indexed tuple, but we will continue to use common terminology and call it a data set. 𝒟=((𝐱(1),y(1)),,(𝐱(n),y(n)))\mathcal{D}=\left(\left(\mathbf{x}^{(1)},y^{(1)}\right),\ldots,\left(\mathbf{x}^{(n)},y^{(n)}\right)\right). We assume that 𝒟\mathcal{D} has been sampled i.i.d. from an underlying, unknown distribution, so 𝒟(xy)n\mathcal{D}\sim(\mathds{P}_{xy})^{n}. Each dimension of a pp-dimensional 𝐱\mathbf{x} will usually be of numerical, integer, or categorical type. While some ML algorithms can handle all of these data types natively (e.g., tree-based methods), others can only work on numeric features and require encoding techniques for categorical types. The most common supervised ML tasks are regression and classification, where 𝒴=\mathcal{Y}=\mathds{R} for regression and 𝒴\mathcal{Y} is finite and categorical for classification with |𝒴|=g|\mathcal{Y}|=g classes. Although we mainly discuss HPO in the context of regression and classification, all covered concepts easily generalize to other supervised ML tasks, such as Poisson regression, survival analysis, cost-sensitive classification, multi-output tasks, and many more. An ML model is a function f:𝒳gf:\mathcal{X}\rightarrow\mathds{R}^{g} that assigns a prediction in g\mathds{R}^{g} to a feature vector from 𝒳\mathcal{X}. For regression, gg is 11, while in classification the output represents the gg decision scores or posterior probabilities of the gg candidate classes. Binary classification is usually simplified to g=1g=1, with a single decision score in \mathds{R} or only the posterior probability for the positive class. The function space – usually parameterized – to which a model belongs is called the hypothesis space and denoted as \mathcal{H}.

The goal of supervised ML is to fit a model given nn observations sampled from xy\mathds{P}_{xy}, so that it generalizes well to new observations from the same data generating process. Formally, an ML learner or inducer \mathcal{I} configured by HPs 𝝀Λ\bm{\lambda}\in\Lambda maps a data set 𝒟\mathcal{D} to a model f^\hat{f} or equivalently to its associated parameter vector 𝜽^\bm{\hat{\theta}}, i.e.,

:𝔻×𝚲\displaystyle\mathcal{I}:\mathds{D}\times\bm{\Lambda} ,(𝒟,𝝀)\displaystyle\to\mathcal{H},\quad(\mathcal{D},\bm{\lambda}) f^,\displaystyle\mapsto\hat{f}, (1)

where 𝔻:=n(𝒳×𝒴)n\mathds{D}:=\bigcup_{n\in\mathds{N}}(\mathcal{X}\times\mathcal{Y})^{n} is the set of all data sets. While model parameters 𝜽^\bm{\hat{\theta}} are an output of the learner \mathcal{I}, HPs 𝝀\bm{\lambda} are an input. We also write 𝝀\mathcal{I}_{\bm{\lambda}} for \mathcal{I} or f^𝒟,𝝀\hat{f}_{\mathcal{D},\bm{\lambda}} for f^\hat{f} if we want to stress that the inducer was configured with 𝝀\bm{\lambda} or that the model was learned on 𝒟\mathcal{D} by an inducer configured by 𝝀\bm{\lambda}. A loss function L:𝒴×g0+L:\mathcal{Y}\times\mathds{R}^{g}\rightarrow\mathds{R}^{+}_{0} measures the discrepancy between the prediction and the true label. Many ML learners use the concept of empirical risk minimization (ERM) in their training routine to produce their fitted model f^\hat{f}, i.e., they optimize emp(f)\mathcal{R}_{\text{emp}}(f) or emp(𝜽)\mathcal{R}_{\text{emp}}(\bm{\theta}) over all candidate models ff\in\mathcal{H}

emp(f):=i=1nL(y(i),f(𝐱(i))),f^=argminfemp(f)\mathcal{R}_{\text{emp}}(f):=\sum\limits_{i=1}^{n}L\left(y^{(i)},f(\mathbf{x}^{(i)})\right)\mathrm{,}\qquad\hat{f}=\operatorname*{arg\,min}_{f}\mathcal{R}_{\text{emp}}(f) (2)

on the training data 𝒟\mathcal{D} (c.f. Figure 1). This empirical risk is only a stochastic proxy for what we are actually interested in, namely the theoretical risk or true generalization error (f):=𝔼(𝐱,y)xy[L(y,f(𝐱))]\mathcal{R}(f):=\mathds{E}_{(\mathbf{x},y)\sim\mathds{P}_{xy}}[L\left(y,f(\mathbf{x})\right)]. For many complex hypothesis spaces, emp(f)\mathcal{R}_{\text{emp}}(f) can become considerably smaller than its true risk (f)\mathcal{R}(f). This phenomenon is known as overfitting, which in ML is usually addressed by either constraining the hypothesis space or regularized risk minimization, i.e., adding a complexity penalty J(𝜽)J(\bm{\theta}) to (2).

3.2 Evaluation of ML Algorithms

After training an ML model f^\hat{f}, a natural follow-up step is to evaluate its future performance given new, unseen data. We seek to use an unbiased, high-quality statistical estimator, which numerically quantifies the performance of our model when it is used to predict the target variable for new observations drawn from the same data-generating process xy\mathds{P}_{xy}.

3.2.1 Performance Metrics

A general performance measure ρ\rho for an arbitrary data set of size mm is defined as a two-argument function that maps the mm-size vector of true labels 𝐲\mathbf{y} and the m×gm\times g matrix of prediction scores 𝑭\bm{F} to a scalar performance value:

ρ:m(𝒴m×m×g),(𝐲,𝑭)ρ(𝐲,𝑭).\rho:\bigcup_{m\in\mathds{N}}\left(\mathcal{Y}^{m}\times\mathds{R}^{m\times g}\right)\rightarrow\mathds{R},\quad(\mathbf{y},\bm{F})\mapsto\rho(\mathbf{y},\bm{F}). (3)

This more general set-based definition is needed for performance measures – such as area under the ROC curve (AUC) – or for most measures from survival time analysis, where loss values cannot be computed with respect to only a single observation. For usual point-wise losses L(y,f(𝐱))L\left(y,f(\mathbf{x})\right), we can simply extend LL to ρ\rho by averaging over the size-mm set used for testing:

ρL(𝐲,𝑭)=1mi=1mL(y(i),𝑭(i)),\rho_{L}(\mathbf{y},\bm{F})=\frac{1}{m}\sum\limits_{i=1}^{m}L(y^{(i)},\bm{F}^{(i)}), (4)

where 𝑭(i)\bm{F}^{(i)} is the ii-th row of 𝑭\bm{F}; this corresponds to estimating the theoretical risk (f)\mathcal{R}(f) corresponding to the given loss. Popular performance metrics corresponding to different loss functions can be found in Table 1 in Appendix B.

Furthermore, the introduction of ρ\rho allows the evaluation of a learner with respect to a different performance metric than the loss used for risk minimization. Because of this, we call the loss used in (2) inner loss, and ρ\rho the outer performance measure or outer loss444Surrogate loss for the inner loss and target loss for the outer loss are also commonly used terminologies.. Both can coincide, but quite often we select an outer performance measure based on the prediction task we would like to solve, and opt to approximate this metric with a computationally cheaper and possibly differentiable version during inner risk minimization.

3.2.2 Generalization Error

Due to potential overfitting, every predictive model should be evaluated on unseen test data to ensure unbiased performance estimation. Assuming (for now) dedicated train and test data sets 𝒟train\mathcal{D}_{\text{train}} and 𝒟test\mathcal{D}_{\text{test}} of sizes ntrainn_{\mathrm{train}} and ntestn_{\mathrm{test}}, respectively, we define the generalization error of a learner \mathcal{I} with HPs 𝝀\bm{\lambda} trained on ntrainn_{\mathrm{train}} observations, w.r.t. measure ρ\rho as

GE(,𝝀,ntrain,ρ):=limntest𝔼𝒟train,𝒟testxy[ρ(𝐲test,𝑭𝒟test,(𝒟train,𝝀))],\mathrm{GE}(\mathcal{I},\bm{\lambda},n_{\mathrm{train}},\rho):=\lim_{n_{\mathrm{test}}\rightarrow\infty}\mathds{E}_{\mathcal{D}_{\text{train}},\mathcal{D}_{\text{test}}\sim\mathds{P}_{xy}}\left[\rho\left(\mathbf{y}_{\mathrm{test}},\bm{F}_{\mathcal{D}_{\text{test}},\mathcal{I}(\mathcal{D}_{\text{train}},\bm{\lambda})}\right)\right]\textrm{,} (5)

where we take the expectation over the data sets 𝒟train\mathcal{D}_{\text{train}} and 𝒟test\mathcal{D}_{\text{test}}, both i.i.d. from xy\mathds{P}_{xy}, and 𝑭𝒟test,(𝒟train,𝝀)\bm{F}_{\mathcal{D}_{\text{test}},\mathcal{I}(\mathcal{D}_{\text{train}},\bm{\lambda})} is the matrix of predictions when the model is trained on 𝒟train\mathcal{D}_{\text{train}} and predicts on 𝒟test\mathcal{D}_{\text{test}}. Note that in the simpler and common case of a point-wise loss L(y,f(𝐱))L\left(y,f(\mathbf{x})\right), the above trivially reduces to the more common form

GE(,𝝀,ntrain,ρL)=𝔼𝒟train,(𝐱,y)xy[L(y,𝝀(𝒟train)(𝐱))]\mathrm{GE}(\mathcal{I},\bm{\lambda},n_{\mathrm{train}},\rho_{L})=\mathds{E}_{\mathcal{D}_{\text{train}},(\mathbf{x},y)\sim\mathds{P}_{xy}}\left[L(y,\mathcal{I}_{\bm{\lambda}}(\mathcal{D}_{\text{train}})(\mathbf{x}))\right] (6)

with expectation over data set 𝒟train\mathcal{D}_{\text{train}} and test sample (𝐱,y)(\mathbf{x},y), both independently sampled from xy\mathds{P}_{xy}. This corresponds to the expectation of (f)\mathcal{R}(f) – which references a given, fixed model – over all possible models fitted to different realizations of 𝒟train\mathcal{D}_{\text{train}} of size ntrainn_{\mathrm{train}} (see Figure 1).

Refer to caption
?figurename? 1: Learner \mathcal{I} takes input data, performs ERM, and returns model f^\hat{f} and its parameters 𝜽^\bm{\hat{\theta}}. The GE of f^\hat{f} is evaluated on the fresh test set 𝒟test\mathcal{D}_{\text{test}}.

3.2.3 Data splitting and Resampling

The generalization error must usually be estimated from a single given data set 𝒟\mathcal{D}. For a simple estimator based on a single random split, 𝒟train\mathcal{D}_{\text{train}} and 𝒟test\mathcal{D}_{\text{test}} can be represented as index vectors Jtrain{1,,n}ntrainJ_{\mathrm{train}}\in\{1,\dots,n\}^{n_{\mathrm{train}}} and Jtest{1,,n}ntestJ_{\mathrm{test}}\in\{1,\dots,n\}^{n_{\mathrm{test}}}, which usually partition the data set. For an index vector JJ of length mm, one can define the corresponding vector of labels 𝐲J𝒴m\mathbf{y}_{J}\in\mathcal{Y}^{m}, and the corresponding matrix of prediction scores 𝑭J,fm×g\bm{F}_{J,f}\in\mathds{R}^{m\times g} for a model ff. The holdout estimator is then:

GE^Jtrain,Jtest(,𝝀,ntrain,ρ)=ρ(𝐲Jtest,𝑭Jtest,(𝒟train,𝝀)).\widehat{\mathrm{GE}}_{J_{\mathrm{train}},J_{\mathrm{test}}}(\mathcal{I},\bm{\lambda},n_{\mathrm{train}},\rho)=\rho(\mathbf{y}_{J_{\mathrm{test}}},\bm{F}_{J_{\mathrm{test}},\mathcal{I}(\mathcal{D}_{\text{train}},\bm{\lambda})}). (7)

The holdout approach has the following trade-off: (i) Because ntrainn_{\mathrm{train}} must be smaller than nn, the estimator is a pessimistically biased estimator of GE(,𝝀,n,ρ)\mathrm{GE}(\mathcal{I},\bm{\lambda},n,\rho), as we do not use all available data for training. In a certain sense, we are estimating with respect to the wrong training set size. (ii) If 𝒟train\mathcal{D}_{\text{train}} is large, 𝒟test\mathcal{D}_{\text{test}} will be small, and the estimator (7) has a large variance. This trade-off not only depends on relative sizes of ntrainn_{\mathrm{train}} and ntestn_{\mathrm{test}}, but also the absolute number of observations, as learning with respect to ntrainn_{\mathrm{train}} sample size and test error estimation based on ntestn_{\mathrm{test}} samples both show a saturating effect for larger sample sizes. However, a typical rule of thumb is to choose ntrain=23nn_{\mathrm{train}}=\frac{2}{3}n [83, 33].

Resampling methods offer a partial solution to this dilemma. These methods repeatedly split the available data into training and test sets, then apply an estimator (7) for each of these, and finally aggregate over all obtained ρ\rho performance values. Formally, we can identify a resampling strategy with a vector of corresponding splits, i.e., 𝒥=((Jtrain,1,Jtest,1),,(Jtrain,B,Jtest,B)),\mathcal{J}=\left((J_{\mathrm{train},1},J_{\mathrm{test},1}),\dots,(J_{\mathrm{train},B},J_{\mathrm{test},B})\right), where Jtrain,i,Jtest,iJ_{\mathrm{train},i},J_{\mathrm{test},i} are index vectors and BB is the number of splits. Hence, the estimator for Eq. (5) is:

GE^(,𝒥,ρ,𝝀)=\displaystyle\widehat{\mathrm{GE}}(\mathcal{I},\mathcal{J},\rho,\bm{\lambda})= (8)
=agr(GE^Jtrain,1,Jtest,1(,𝝀,|Jtrain,1|,ρ),,GE^Jtrain,B,Jtest,B(,𝝀,|Jtrain,B|,ρ))\displaystyle=\mathrm{agr}\Big{(}\widehat{\mathrm{GE}}_{J_{\mathrm{train},1},J_{\mathrm{test},1}}(\mathcal{I},\bm{\lambda},|J_{\mathrm{train},1}|,\rho),\dots,\widehat{\mathrm{GE}}_{J_{\mathrm{train},B},J_{\mathrm{test},B}}(\mathcal{I},\bm{\lambda},|J_{\mathrm{train},B}|,\rho)\Big{)}
=agr(ρ(𝐲Jtest,1,𝑭Jtest,1,(𝒟train,1,𝝀)),,ρ(𝐲Jtest,B,𝑭Jtest,B,(𝒟train,B,𝝀))),\displaystyle=\mathrm{agr}\Big{(}\rho\left(\mathbf{y}_{J_{\mathrm{test},1}},\bm{F}_{J_{\mathrm{test},1},\mathcal{I}(\mathcal{D}_{\text{train},1},\bm{\lambda})}\right),\dots,\rho\left(\mathbf{y}_{J_{\mathrm{test},B}},\bm{F}_{J_{\mathrm{test},B},\mathcal{I}(\mathcal{D}_{\text{train},B},\bm{\lambda})}\right)\Big{)},

where the aggregator agr\mathrm{agr} is often chosen to be the mean. For Eq. (8) to be a valid estimator of Eq. (6), we must specify to what ntrainn_{\mathrm{train}} training set size an estimator refers in GE(,𝝀,ntrain,ρ)\mathrm{GE}(\mathcal{I},\bm{\lambda},n_{\mathrm{train}},\rho). As the training set sizes can be different during resampling (they usually do not vary much), it should at least hold that ntrainntrain,1ntrain,Bn_{\mathrm{train}}\approx n_{\mathrm{train},1}\approx\dots\approx n_{\mathrm{train},B}, and we could take the average for such a required reference size with ntrain=1Bj=1Bntrain,jn_{\mathrm{train}}=\frac{1}{B}\sum_{j=1}^{B}n_{\mathrm{train},j}.

Resampling uses the data more efficiently than a single holdout split, as the repeated estimation and averaging over multiple splits results in an estimate of generalization error with lower variance [83, 123]. Additionally, the pessimistic bias of simple holdout is also kept to a minimum and can be reduced to nearly 0 by choosing training sets of size close to nn. The most widely-used resampling technique is arguably kk-fold-cross-validation (CV), which partitions the available data in kk subsets of approximately equal size, and uses each partition to evaluate a model fitted on its complement. For small data sets, it makes sense to repeat CV with multiple random partitions and to average the resulting estimates in order to average out the variability, which results in repeated kk-fold-cross-validation. Furthermore, note that performance values generated from resampling splits and especially CV splits are not statistically independent because of their overlapping traing sets, so the variance of GE^(,𝒥,ρ,𝝀)\widehat{\mathrm{GE}}(\mathcal{I},\mathcal{J},\rho,\bm{\lambda}) is not proportional to 1/B1/B. Somewhat paradoxically, a leave-one-out strategy is not the optimal choice, and repeated cross-validation with many (but fewer than nn) folds and many repetitions is often a better choice [8]. An overview of existing resampling techniques can be found in [21] or [25].

4 Hyperparameter Optimization

4.1 HPO Problem Definition

Most learners are highly configurable by HPs, and their generalization performance usually depends on this configuration in a non-trivial and subtle way. HPO algorithms automatically identify a well-performing HPC 𝝀Λ~\bm{\lambda}\in\tilde{\Lambda} for an ML algorithm 𝝀\mathcal{I}_{\bm{\lambda}}. The search space Λ~Λ\tilde{\Lambda}\subset\Lambda contains all considered HPs for optimization and their respective ranges:

Λ~=Λ~1×Λ~2××Λ~l,\tilde{\Lambda}=\tilde{\Lambda}_{1}\times\tilde{\Lambda}_{2}\times\dots\times\tilde{\Lambda}_{l}, (9)

where Λ~i\tilde{\Lambda}_{i} is a bounded subset of the domain of the ii-th HP Λi\Lambda_{i}, and can be either continuous, discrete, or categorical. This already mixed search space can also contain dependent HPs, leading to a hierarchical search space: An HP λi\lambda_{i} is said to be conditional on λj\lambda_{j} if λi\lambda_{i} is only active when λj\lambda_{j} is an element of a given subset of Λj\Lambda_{j} and inactive otherwise, i.e., not affecting the resulting learner [135]. Common examples are kernel HPs of a kernelized machine such as the SVM, when we tune over the kernel type and its respective hyperparameters as well. Such conditional HPs usually introduce tree-like dependencies in the search space, and may in general lead to dependencies that may be represented by directed acyclic graphs.

The general HPO problem as visualized in Figure 2 is defined as:

𝝀argmin𝝀Λ~c(𝝀)=argmin𝝀Λ~GE^(,𝒥,ρ,𝝀)\displaystyle\bm{\lambda}^{*}\in\operatorname*{arg\,min}_{\bm{\lambda}\in\tilde{\Lambda}}c(\bm{\lambda})=\operatorname*{arg\,min}_{\bm{\lambda}\in\tilde{\Lambda}}\widehat{\mathrm{GE}}(\mathcal{I},\mathcal{J},\rho,\bm{\lambda}) (10)

where 𝝀\bm{\lambda}^{*} denotes the theoretical optimum, and c(𝝀)c(\bm{\lambda}) is a shorthand for the estimated generalization error when ,𝒥,ρ\mathcal{I},\mathcal{J},\rho are fixed. We therefore estimate and optimize the generalization error GE^(,𝒥,ρ,𝝀)\widehat{\mathrm{GE}}(\mathcal{I},\mathcal{J},\rho,\bm{\lambda}) of a learner 𝝀\mathcal{I}_{\bm{\lambda}}, w.r.t. an HPC 𝝀\bm{\lambda}, based on a resampling split 𝒥=((Jtrain,1,Jtest,1),,(Jtrain,B,Jtest,B))\mathcal{J}=\left((J_{\mathrm{train},1},J_{\mathrm{test},1}),\dots,(J_{\mathrm{train},B},J_{\mathrm{test},B})\right).555Note again that optimizing the resampling error will result in biased estimates, which are problematic when reporting the generalization error; use nested CV for this, see Section 4.4. Note that c(𝝀)c(\bm{\lambda}) is a black-box, as it usually has no closed-form mathematical representation, and hence no analytic gradient information is available. Furthermore, the evaluation of c(𝝀)c(\bm{\lambda}) can take a significant amount of time. Therefore, the minimization of c(𝝀)c(\bm{\lambda}) forms an expensive black-box optimization problem.

Taken together, these properties define an optimization problem of considerable difficulty. Furthermore, they rule out many popular optimization methods that require gradients or entirely numerical search spaces or that must perform a large number of evaluations to converge to a well-performing solution, like many meta-heuristics. Furthermore, as c(𝝀)=GE^(,𝒥,ρ,𝝀)c(\bm{\lambda})=\widehat{\mathrm{GE}}(\mathcal{I},\mathcal{J},\rho,\bm{\lambda}), which is defined via resampling and evaluates 𝝀\bm{\lambda} on randomly chosen validation sets, cc should be considered a stochastic objective – although many HPO algorithms may ignore this fact or simply handle it by assuming that we average out the randomness through enough resampling replications.

Refer to caption
?figurename? 2: General HPO loop with inner risk minimization.

We can thus define the HP tuner τ:(𝒟,,Λ~,ρ)𝝀^\tau:(\mathcal{D},\mathcal{I},\tilde{\Lambda},\rho)\mapsto\hat{\bm{\lambda}} that proposes its estimate 𝝀^\hat{\bm{\lambda}} of the true optimal configuration 𝝀\bm{\lambda}^{*} given a dataset 𝒟\mathcal{D}, an inducer \mathcal{I} with corresponding search space Λ~\tilde{\Lambda} to optimize, and a target measure ρ\rho. The specific resampling splits 𝒥\mathcal{J} used can either be passed into τ\tau as well or are internally handled to facilitate adaptive splitting or multi-fidelity optimization (e.g., as done in [82]).

4.2 Well-Established HPO Algorithms

All HPO algorithms presented here work by the same principle: they iteratively propose HPCs 𝝀+\bm{\lambda}^{+} and then evaluate the performance on these configurations. We store these HPs and their respective evaluations successively in the so-called archive 𝒜=((𝝀(1),c(𝝀(1))),(𝝀(2),c(𝝀(2))),)\mathcal{A}=((\bm{\lambda}^{(1)},c(\bm{\lambda}^{(1)})),(\bm{\lambda}^{(2)},c(\bm{\lambda}^{(2)})),\dots), with 𝒜[t+1]=𝒜[t](𝝀+,c(𝝀+))\mathcal{A}^{[t+1]}=\mathcal{A}^{[t]}\cup(\bm{\lambda}^{+},c(\bm{\lambda}^{+})) if a single point is proposed by the tuner.

Many algorithms can be characterized by how they handle two different trade-offs: a) The exploration vs. exploitation trade-off refers to how much budget an optimizer spends on either attempting to directly exploit the currently available knowledge base by evaluating very close to the currently best candidates (e.g., local search) or exploring the search space to gather new knowledge (e.g., random search). b) The inference vs. search trade-off refers to how much time and overhead is spent to induce a model from the currently available archive data in order to exploit past evaluations as much as possible. Other relevant aspects that HPO algorithms differ in are: Parallelizability, i.e., how many configurations a tuner can (reasonably) propose at the same time; global vs. local behavior of the optimizer, i.e., if updates are always quite close to already evaluated configurations; noise handling, i.e., if the optimizer takes into account that the estimated generalization error is noisy; multifidelity, i.e., if the tuner uses cheaper evaluations, for example on smaller subsets of the data, to infer performance on the full data; search space complexity, i.e., if and how hierarchical search spaces as introduced in Section 5 can be handled.

4.2.1 Grid Search and Random Search

Grid search (GS) is the process of discretizing the range of each HP and exhaustively evaluating every combination of values. Numeric and integer HP values are usually equidistantly spaced in their box constraints. The number of distinct values per HP is called the resolution of the grid. For categorical HPs, either a subset or all possible values are considered. A second simple HPO algorithm is random search (RS). In its simplest form, values for each HP are drawn independently of each other and from a pre-specified (often uniform) distribution, which works for (box-constrained) numeric, integer, or categorical parameters (c.f. Figure 3). Due to their simplicity, both GS and RS can handle hierarchical search spaces.

RS often has much better performance than GS in higher-dimensional HPO settings [11]. GS suffers directly from the curse of dimensionality [7], as the required number of evaluations increases exponentially with the number of HPs for a fixed grid resolution. This seems to be true as well for RS at first glance, and we certainly require an exponential number of points in dim(Λ~)\dim(\tilde{\Lambda}) to cover the space well. However, in practice, HPO problems often have low effective dimensionality [11]: The set of HPs that have an influence on performance is often a small subset of all available HPs. Consider the example illustrated in Figure 3, where an HPO problem with HPs λ1\lambda_{1} and λ2\lambda_{2} is shown. A GS with resolution 33 resulting in 99 HPCs is evaluated, and we discover that only HP λ1\lambda_{1} has any relevant influence on the performance, so only 3 of 9 evaluations provided any meaningful information. In comparison, RS would have given us 9 different configurations for HP λ1\lambda_{1}, which results in a higher chance of finding the optimum. Another advantage of RS is that it can easily be extended by further samples; in contrast, the number of points on a grid must be specified beforehand, and refining the resolution of GS afterwards is more complicated. Altogether, this makes RS preferable to GS and a surprisingly strong baseline for HPO in many practical settings. Notably, there are sampling methods that attempt to cover the search space more evenly than the uniform sampling of RS, e.g., Latin Hypercube Sampling [99], or Sobol sequences [3]. However, these do not seem to significantly outperform naive i.i.d. sampling [11].

Refer to caption
?figurename? 3: RS and GS where only HP λ1\lambda_{1} has a strong influence on cc (figure based on [10]).

4.2.2 Evolution Strategies

Evolution strategies (ES) are a class of stochastic population-based optimization methods inspired by the concepts of biological evolution, belonging to the larger class of evolutionary algorithms. They do not require gradients, making them generally applicable in black-box settings such as HPO. In ES terminology, an individual is a single HPC, the population is a currently maintained set of HPCs, and the fitness of an individual is its (inverted) generalization error c(𝝀)c(\bm{\lambda}). Mutation is the (randomized) change of one or a few HP values in a configuration. Crossover creates a new HPC by (randomly) mixing the values of two other configurations. An ES follows iterative steps to find individuals with high fitness values (c.f. Figure 4): (i) An initial population is sampled at random. (ii) The fitness of each individual is evaluated. (iii) A set of individuals is selected as parents for reproduction.666Either completely at random, or with a probability according to their fitness, the most popular variants being roulette wheel and tournament selection. (iv) The population is enlarged through crossover and mutation of the parents. (v) The offspring is evaluated. (vi) The top-k fittest individuals are selected.777In ES-terminology this is called a (μ+λ\mu+\lambda) elite survival selection, where μ\mu denotes the population size, and λ\lambda the number of offspring; other variants like (μ,λ\mu,\lambda) selection exist. (vii) Steps (ii) to (v) are repeated until a termination condition is reached. For a more comprehensive introduction to ES, see [13].

Refer to caption
?figurename? 4: Schematic representation of a single iteration of an ES as a four dimensional discrete problem. Parameter values are symbolized by geometric shapes.

ES were limited to numeric spaces in their original formulation, but they can easily be extended to handle mixed spaces by treating components of different types independently, e.g., by adding a normally distributed random value to real-valued HPs while adding the difference of two geometrically distributed values to integer-valued HPs [90]. By defining mutation and crossover operations that operate on tree structures or graphs, it is even possible to perform optimization of preprocessing pipelines [104, 41] or neural network architectures [114] using evolutionary algorithms. The properties of ES can be summarized as follows: ES have a low likelihood to get stuck in local minima, especially if so-called nested ES are used [13]. They can be straightforwardly modified to be robust to noise [14], and can also be easily extended to multi-objective settings [30]. Additionally, ES can be applied in settings with complex search spaces and can therefore work with spaces where other optimizers may fail [59]. ES are more efficient than RS and GS but still often require a large number of iterations to find good solutions, which makes them unsatisfactory for expensive optimization settings like HPO.

4.2.3 Bayesian Optimization

Bayesian optimization (BO) has become increasingly popular as a global optimization technique for expensive black-box functions, and specifically for HPO [76, 67, 125].

BO is an iterative algorithm whose key strategy is to model the mapping 𝝀c(𝝀)\bm{\lambda}\leavevmode\nobreak\ \mapsto\leavevmode\nobreak\ c(\bm{\lambda}) based on observed performance values found in the archive 𝒜\mathcal{A} via (non-linear) regression. This approximating model is called a surrogate model, for which a Gaussian process or a random forest are typically used. BO starts on an archive 𝒜\mathcal{A} filled with evaluated configurations, typically sampled randomly, using Latin Hypercube Sampling or the Sobol sampling [24]. BO then uses the archive to fit the surrogate model, which for each 𝝀\bm{\lambda} produces both an estimate of performance c^(𝝀)\hat{c}(\bm{\lambda}) as well as an estimate of prediction uncertainty σ^(𝝀)\hat{\sigma}(\bm{\lambda}), which then gives rise to a predictive distribution for one test HPC or a joint distribution for a set of HPCs. Based on the predictive distribution, BO establishes a cheap-to-evaluate acquisition function u(𝝀)u(\bm{\lambda}) that encodes a trade-off between exploitation and exploration: The former means that the surrogate model predicts a good, low cc value for a candidate HPC 𝝀\bm{\lambda}, while the latter implies that the surrogate is very uncertain about c(𝝀)c(\bm{\lambda}), likely because the surrounding area has not been explored thoroughly.

Instead of working on the true expensive objective, the acquisition function u(𝝀)u(\bm{\lambda}) is then optimized in order to generate a new candidate 𝝀+\bm{\lambda}^{+} for evaluation. The optimization problem u(𝝀)u(\bm{\lambda}) inherits most characteristics from c(𝝀)c(\bm{\lambda}); so it is often still multi-modal and defined on a mixed, hierarchical search space. Therefore, u(𝝀)u(\bm{\lambda}) may still be quite complex, but it is at least cheap to evaluate. This allows the usage of more budget-demanding optimizers on the acquisition function. If the space is real-valued and the combination of surrogate model and acquisition function supports it, even gradient information can be used.

Among the possible optimization methods are: iterated local search (as used by [69]), evolutionary algorithms (as in [140]), ES using derivatives (as used by [120, 254]), and a focusing RS called DIRECT [75].

The true objective value c(𝝀+)c(\bm{\lambda}^{+}) of the proposed HPC 𝝀+\bm{\lambda}^{+} – generated by optimization of u(𝝀)u(\bm{\lambda}) – is finally evaluated and added to the archive 𝒜\mathcal{A}. The surrogate model is updated, and BO iterates until a predefined budget is exhausted, or a different termination criterion is reached. These steps are summarized in Algorithm 1. BO methods can use different ways of deciding which 𝝀\bm{\lambda} to return, referred to as the identification step by [70]. This can either be the best observed 𝝀\bm{\lambda} during optimization, the best (mean, or quantile) predicted 𝝀\bm{\lambda} from the archive according to the surrogate model [110, 70], or the best predicted 𝝀\bm{\lambda} overall [119]. The latter options serve as a way of smoothing the observed performance values and reducing the influence of noise on the choice of 𝝀^\hat{\bm{\lambda}}.

1 Generate 𝝀(1),,𝝀(k)\bm{\lambda}^{(1)},\dots,\bm{\lambda}^{(k)} with sampling scheme or fixed design
2 Initialize archive 𝒜[0]=((𝝀(1),c(𝝀(1))),,(𝝀(k),c(𝝀(k))))\mathcal{A}^{[0]}=((\bm{\lambda}^{(1)},c(\bm{\lambda}^{(1)})),\dots,(\bm{\lambda}^{(k)},c(\bm{\lambda}^{(k)})))
3 for t=1,2,3,\text{t}=1,2,3,\ldots until termination do
4      1: Fit surrogate model (c^(𝝀),σ^(𝝀))(\hat{c}(\bm{\lambda}),\hat{\sigma}(\bm{\lambda})) on 𝒜[t1]\mathcal{A}^{[t-1]}
5       2: Build acquisition function u(𝝀)u(\bm{\lambda}) from (c^(𝝀),σ^(𝝀))(\hat{c}(\bm{\lambda}),\hat{\sigma}(\bm{\lambda}))
6       3: Obtain proposal 𝝀+\bm{\lambda}^{+} by optimizing uu: 𝝀+argmax𝝀Λ~u(𝝀)\bm{\lambda}^{+}\in\operatorname*{arg\,max}_{\bm{\lambda}\in\tilde{\Lambda}}u(\bm{\lambda})
7       4: Evaluate c(𝝀+)c(\bm{\lambda}^{+})
8       5: Obtain 𝒜[t]\mathcal{A}^{[t]} by augmenting 𝒜[t1]\mathcal{A}^{[t-1]} with (𝝀+,c(𝝀+))(\bm{\lambda}^{+},c(\bm{\lambda}^{+}))
9 end for
Result: 𝝀^\hat{\bm{\lambda}}: Best-performing 𝝀\bm{\lambda} from archive or according to surrogates prediction.
Algorithm 1 BO for a black-box objective c(𝝀)c(\bm{\lambda}).
Refer to caption
?figurename? 5: Illustration of how BO generates the proposal by maximizing an acquisition function (figure inspired by [65]).
Surrogate model

The choice of surrogate model has great influence on BO performance and is often linked to properties of Λ~\tilde{\Lambda}. If Λ~\tilde{\Lambda} is purely real-valued, Gaussian process (GP) regression [113] – sometimes referred to as Kriging – is used most often. In its basic form, BO with a GP does not support HPO with non-numeric or conditional HPs, and tends to show deteriorating performance when Λ~\tilde{\Lambda} has more than roughly ten dimensions. Dealing with integer-valued or categorical HPs requires special care [52]. Extensions for mixed-hierarchical spaces that are based on special kernels [130] exist, and the use of random embeddings has been suggested for high-dimensional spaces [139, 102]. Most importantly, standard GPs have runtime complexity that is cubic in the number of samples, which can result in a significant overhead when the archive 𝒜\mathcal{A} becomes large.

[98] propose to use an adapted, sparse GP that restrains training data from uninteresting areas. Local Bayesian optimization [189] is implemented in the TuRBO algorithm and has been successfully applied to various black-box problems.

Random forests, most notably used in SMAC [67], have also shown good performance as surrogate models for BO. Their advantage is their native ability to handle discrete HPs and, with minor modifications, e.g., in [67], even dependent HPs without the need for preprocessing. Standard random forest implementations are still able to handle dependent HPs by treating infeasible HP values as missing and performing imputation. Random forests tend to work well with larger archives and introduce less overhead than GPs. SMAC uses the standard deviation of tree predictions as a heuristic uncertainty estimate σ^(𝝀)\hat{\sigma}(\bm{\lambda}) [67]. However, more sophisticated alternatives exist to provide unbiased estimates [121]. Since trees are not distance-based spatial models, the uncertainty estimator does not increase the further we extrapolate away from observed training points. This might be one explanation as to why tree-based surrogates are outperformed by GP regression on purely numerical search spaces [35].

Neural networks (NNs) have shown good performance in particular with nontrivial input spaces, and they are thus increasingly considered as surrogate models for BO [126]. Discrete inputs can be handled by one-hot encoding or by automatic techniques, e.g., entity embedding where a dense representation is learned from the output of a simple, direct encoding, such as one-hot encoding by the NN. [56]. NNs offer efficient and versatile implementations that allow the use of gradients for more efficient optimization of the acquisition function. Uncertainty bounds on the predictions can be obtained, for example, by using Bayesian neural networks (BNNs), which combine NNs with a probabilistic model of the network weights or adaptive basis regression where only a Bayesian linear regressor is added to the last layer of the NN [126].

Acquisition function

The acquisition function balances out the surrogate model’s prediction c^(𝝀)\hat{c}(\bm{\lambda}) and its posterior uncertainty σ^(𝝀)\hat{\sigma}(\bm{\lambda}) to ensure both exploration of unexplored regions of Λ~\tilde{\Lambda}, as well as exploitation of regions that have performed well in previous evaluations. A very popular acquisition function is the expected improvement (EI) [76]:

uEI(𝝀)=𝔼C𝒩(c^(𝝀),σ^(𝝀)2)[max{cminC(𝝀),0}]=(cminc^(𝝀))Φ(cminc^(𝝀)σ^(𝝀))+σ^(𝝀)ϕ(cminc^(𝝀)σ^(𝝀)),\displaystyle\begin{split}u_{\operatorname{EI}}(\bm{\lambda})&=\operatorname{\mathds{E}}_{C\sim\mathcal{N}\left(\hat{c}(\bm{\lambda}),\,\hat{\sigma}(\bm{\lambda})^{2}\right)}\left[\max\left\{c_{\min}-C(\bm{\lambda}),0\right\}\right]\\ &=\left(c_{\min}-\hat{c}(\bm{\lambda})\right)\Phi\left(\frac{c_{\min}-\hat{c}(\bm{\lambda})}{\hat{\sigma}(\bm{\lambda})}\right)+\hat{\sigma}(\bm{\lambda})\phi\left(\frac{c_{\min}-\hat{c}(\bm{\lambda})}{\hat{\sigma}(\bm{\lambda})}\right)\ ,\end{split} (11)

where cminc_{\min} denotes the best observed outcome of cc so far, and Φ\Phi and ϕ\phi are the cumulative distribution function and density of the standard normal distribution, respectively. The EI was introduced in connection with GPs that have a Bayesian interpretation, expressing the posterior distribution of the true performance value given already observed values as a Gaussian random variable C(𝝀)C(\bm{\lambda}) with C(𝝀)𝒩(c^(𝝀),σ^(𝝀)2)C(\bm{\lambda})\sim\mathcal{N}(\hat{c}(\bm{\lambda}),\hat{\sigma}(\bm{\lambda})^{2}). Under this condition, Eq. (11)\eqref{eq:EI} can be analytically expressed as above, and the resulting formula is often heuristically applied to other surrogates that supply c^(𝝀)\hat{c}(\bm{\lambda}) and σ^(𝝀)\hat{\sigma}(\bm{\lambda}).

A further, very simple acquisition function is the lower confidence bound (LCB) [74]:

uLCB(𝝀)=(1)(c^(𝝀)κσ^(𝝀)),\displaystyle u_{\operatorname{LCB}}\left(\bm{\lambda}\right)=(-1)\cdot\left(\hat{c}(\bm{\lambda})-\kappa\cdot\hat{\sigma}(\bm{\lambda})\right), (12)

here negated to yield a maximization problem for Algorithm 1. The LCB treats local uncertainty as an additive bonus at each 𝝀\bm{\lambda} to enforce exploration, with κ\kappa being a control parameter that is not easy to set.

Adaptive Explore-Exploit Tradeoffs

In a theoretical analysis without a limit on evaluations, [128] suggest to increase the κ\kappa-parameter of LCB over time to encourage exploration in later phases of optimization. However, in the context of practical HPO with a finite budget for evaluations, it seems plausible to decrease κ\kappa over time to enforce exploration in the beginning and exploitation at the end in a similar cooldown scheme as in simulated annealing, such as e.g. suggested by [150], which uses a further cyclical scheme to escape local minima. [118] propose a cooldown scheme for expected improvement: They use the “generalized expected improvement” uGEI(𝝀)=𝔼[max{cminC(𝝀),0}g]u_{\operatorname{GEI}}(\bm{\lambda})=\operatorname{\mathds{E}}\left[\max\left\{c_{\min}-C(\bm{\lambda}),0\right\}^{g}\right], where larger values for exponent gg also enforce more exploration. They suggest to start with a large gg value in the beginning and to gradually decrease it to enforce exploitation at the end. [73] propose to dynamically give exploitation more weight as a function of mean model-uncertainty in what they call “contextual improvement”. This has a similar effect of encouraging late exploitation, as model uncertainty generally decreases over the course of optimization. Finally, [32] show that a very simply ϵ\epsilon-greedy BO strategy can perform well or even better than established acquisition functions. They simply propose HPCs that maximize the predicted mean, but interleave random HPCs with ϵ\epsilon probability. They show that this strategy performs well in settings with a low evaluation budget or with many dimensions, which is consistent with the other proposed methods, which adaptively emphasize exploitation more when remaining budget is low.

Multi-point proposal

In its original formulation, BO only proposes one candidate HPC per iteration and then waits for the performance evaluation of that configuration to conclude. However, in many situations, it is preferable to evaluate multiple HPCs in parallel by proposing multiple configurations at once, or by asynchronously proposing HPCs while other proposals are still being evaluated.

While in the sequential variant, the best point can be determined unambiguously from the full information of the acquisition function. In the parallel variant, many points must be proposed at the same time without information about how the other points will perform. The objective here is to some degree to ensure that the proposed points are sufficiently different from each other.

The proposal of nbatch>1n_{\text{batch}}>1 configurations in one BO iteration is called batch proposal or synchronous parallelization and works well if the runtimes of all black-box evaluations are somewhat homogeneous. If the runtimes are heterogeneous, one may seek to spontaneously generate new proposals whenever an evaluation thread finishes in what is called asynchronous parallelization. This offers some advantages to synchronous parallelization, but is more complicated to implement in practice.

The simplest option to obtain nbatchn_{\text{batch}} proposals is to use the LCB criterion in Eq. (12) with different values for κ\kappa. For this so-called qLCB (also referred to as qUCB) approach, [68] propose to draw κ\kappa from an exponential distribution with rate parameter 11. This can work relatively well in practice but has the potential drawback of creating proposals that are too similar to each other [23]. [23] instead propose to maximize both c^(𝝀)\hat{c}(\bm{\lambda}) and σ^(𝝀)\hat{\sigma}(\bm{\lambda}) simultaneously, using multi-objective optimization, and to choose nbatchn_{\text{batch}} points from the approximated Pareto-front. Further ways to obtain nbatchn_{\text{batch}} proposals are constant liar, Kriging believer (both described in [54]), and q-EI [28]. Constant liar sets fake constant response values for the first points proposed in the batch to generate additional one via the normal EI principle and the approach; Kriging believer does the same but uses the GP model’s mean prediction as fake value instead of a constant. The qEI optimizes a true multivariate EI criterion and is computationally expensive for larger batch sizes, but [158] implement methods to efficiently calculate the qEI (and qNEI for noisy observations) through MC simulations.

Efficient Performance Evaluation

While BO models only optimize the HPC prediction performance in its standard setup, there are several extensions that aim to make optimization more efficient by considering runtime or resource usage. These extensions mainly modify the acquisition function to influence the HPCs that are being proposed. [125] suggests the expected improvement per second (EIPS) as a new acquisition function. The EIPS includes a second surrogate model that predicts the runtime of evaluating a HPC in order to compromise between expected improvement and required runtime for evaluation. Most methods that trade off between runtime and information gain fall under the category of multi-fidelity methods, which is further discussed in Section 4.2.4. Acquisition functions that are especially relevant here consider information gain-based criteria like Entropy Search [60] or Predictive Entropy Search [61]. These acquisition functions can be used for selective subsample evaluation [82], reducing the number of necessary resampling iterations [131], and stopping certain model classes, such as NNs, early.

4.2.4 Multifidelity and Hyperband

The multifidelity (MF) concept in HPO refers to all tuning approaches that can efficiently handle a learner (𝒟,𝝀)\mathcal{I}(\mathcal{D},\bm{\lambda}) with a fidelity HP λfid\lambda_{\text{fid}} as a component of 𝝀\bm{\lambda}, which influences the computational cost of the fitting procedure in a monotonically increasing manner. Higher λfid\lambda_{\text{fid}} values imply a longer runtime of the fit. This directly implies that the lower we set λfid\lambda_{\text{fid}}, the more points we can explore in our search space, albeit with much less reliable information w.r.t. their true performance. If λfid\lambda_{\text{fid}} has a linear relationship with the true computational costs, we can directly sum the λfid\lambda_{\text{fid}} values for all evaluations to measure the computational costs of a complete optimization run. We assume to know box-constraints of λfid\lambda_{\text{fid}} in form of a lower and upper limit, so λfid[λfidlow,λfidupp]\lambda_{\text{fid}}\in[\lambda_{\text{fid}}^{\textrm{low}},\lambda_{\text{fid}}^{\textrm{upp}}], where the upper limit implies the highest fidelity returning values closest to the true objective value at the highest computational cost. Usually, we expect higher values of λfid\lambda_{\text{fid}} to be better in terms of predictive performance yet naturally more computationally expensive. However, overfitting can occur at some point, for example when λfid\lambda_{\text{fid}} controls the number of training epochs when fitting an NN. Furthermore, we assume that the relationship of the fidelity to the prediction performance changes somewhat smoothly. Consequently, when evaluating multiple HPCs with small λfid\lambda_{\text{fid}}, this at least indicates their true ranking. Typically, this implies a sequential fitting procedure, where λfid\lambda_{\text{fid}} is, for example, the number of (stochastic) gradient descent steps or the number of sequentially added (boosting) ensemble members. A further, generally applicable option is to subsample the training data from a small fraction to 100% before training and to treat this as a fidelity control [82]. HPO algorithms that exploit such a λfid\lambda_{\text{fid}} parameter – usually by spending budget on cheap HPCs with low λfid\lambda_{\text{fid}} values earlier for exploration, and then concentrating on the most promising ones later – are called multifidelity methods. One can define two versions of the MF-HPO problem. (a) If overfitting can occur with higher values of λfid\lambda_{\text{fid}} (e.g., if it encodes training iterations), simply minimizing min𝝀Λ~c(𝝀)\min_{\bm{\lambda}\in\tilde{\Lambda}}c(\bm{\lambda}) is already appropriate. (b) If the assumption holds that a higher fidelity always results in a better model (e.g., if λfid\lambda_{\text{fid}} controls the size of the training set), we are interested in finding the configuration 𝝀\bm{\lambda}^{*} for which the inducer will return the best model given the full budget, so min𝝀Λ~,λfid=λfiduppc(𝝀)\min_{\bm{\lambda}\in\tilde{\Lambda},\lambda_{\text{fid}}=\lambda_{\text{fid}}^{\textrm{upp}}}c(\bm{\lambda}). Of course, in both versions, the optimizer can make use of cheap HPCs with low settings of λfid\lambda_{\text{fid}} on its path to its result.

Hyperband [229] can best be understood as repeated execution of the successive halving (SH) procedure [71]. SH assumes a fidelity-budget BB for the sum of λfid\lambda_{\text{fid}} for all evaluations. It starts with a given, fixed number of candidates 𝝀(i)\bm{\lambda}^{(i)} that we denote with p[0]p^{[0]} and ?races them down? in stages tt to a single best candidate by repeatedly evaluating all candidates with increased fidelity in a certain schedule. Typically, this is controlled by the ηHB\eta_{\text{HB}} control multiplier of Hyperband with ηHB>1\eta_{\text{HB}}>1 (typically set to 2 or 3): After each batch evaluation tt of the current population of size p[t]p^{[t]}, we reduce the population to the best 1ηHB\frac{1}{\eta_{\text{HB}}} fraction and set the new fidelity for a candidate evaluation to ηHB×λfid\eta_{\text{HB}}\times\lambda_{\text{fid}}. Thus, promising HPCs are assigned a higher fidelity overall, and sub-optimal ones are discarded early on. The starting fidelity λfid[0]\lambda_{\text{fid}}^{[0]} and the number of stages s+1s+1 are computed in a way such that each batch evaluation of an SH population has approximately B/(s+1)B/(s+1) amount of fidelity units spent. Overall, this ensures that approximately, but not more than, BB fidelity units are spent in SH:

t=0sp[0]ηHBtλfid[0]ηHBtB.\sum^{s}_{t=0}\left\lfloor p^{[0]}\eta_{\text{HB}}^{-t}\right\rfloor\lambda_{\text{fid}}^{[0]}\eta_{\text{HB}}^{t}\leq B. (13)

However, the efficiency of SH strongly depends on a sensible choice of the number of starting configurations and the resulting schedule. If we assume a fixed fidelity-budget for HPO, the user has the choice of running either (a) more configurations but with less fidelity, or (b) fewer configurations, but with higher fidelity. While the former naturally explores more, the latter schedules evaluations with stronger correlation to the true objective value and more informative evaluations. As an example, consider how λ(6)\lambda^{(6)} is discarded in favor of λ(8)\lambda^{(8)} at 25% in Figure 6. Because their performance lines would have crossed close to 100%, λ(6)\lambda^{(6)} is ultimately the better configuration. However, in this case, the superiority of λ(6)\lambda^{(6)} was only observable after full evaluation.

Refer to caption
?figurename? 6: Right: Bracket design of HB with λfidupp=8\lambda_{\text{fid}}^{\textrm{upp}}=8 and ηHB=2\eta_{\text{HB}}=2 (resulting in four brackets). Left: Exemplary bracket run (figure inspired by [65]). Faint lines represent future performance of HPCs that were discarded early.

As we often have no prior knowledge regarding this effect, HB simply runs SH for different numbers of starting configurations ps[0]p_{s}^{[0]}, and each SH run or schedule is called a bracket. As input, HB takes ηHB\eta_{\text{HB}} and the maximum fidelity λfidupp>ηHB\lambda_{\text{fid}}^{\textrm{upp}}>\eta_{\text{HB}}. HB then constructs the target fidelity budget BB for each bracket by considering the most explorative bracket: Here, the number of batch evaluations smax+1s_{\text{max}}+1 is chosen to be logηHB(λfidupp)+1\left\lfloor\log_{\eta_{\text{HB}}}(\lambda_{\text{fid}}^{\textrm{upp}})\right\rfloor+1 for which λfid[0]=λfiduppηHBsmax(ηHB1,ηHB),λfid[smax]=λfidupp\lambda_{\text{fid}}^{[0]}=\lambda_{\text{fid}}^{\textrm{upp}}\eta_{\text{HB}}^{-s_{\text{max}}}\in(\eta_{\text{HB}}^{-1},\eta_{\text{HB}}),\,\lambda_{\text{fid}}^{[s_{\text{max}}]}=\lambda_{\text{fid}}^{\textrm{upp}}, and we collect these values in 𝒓=(λfiduppηHBsmax,λfiduppηHBsmax+1,λfiduppηHBsmax+2,,λfidupp)smax+1\bm{r}=(\lambda_{\text{fid}}^{\textrm{upp}}\eta_{\text{HB}}^{-s_{\text{max}}},\lambda_{\text{fid}}^{\textrm{upp}}\eta_{\text{HB}}^{-s_{\text{max}}+1},\lambda_{\text{fid}}^{\textrm{upp}}\eta_{\text{HB}}^{-s_{\text{max}}+2},\dots,\lambda_{\text{fid}}^{\textrm{upp}})\in\mathds{R}^{s_{max}+1}. Since we want to spend approximately the same total fidelity and reduce the candidates to one winning HPC in every batch evaluation, the fidelity budget of each bracket is B=(smax+1)λfiduppB=(s_{\text{max}}+1)\lambda_{\text{fid}}^{\textrm{upp}}. For every s{0,,smax}s\in\{0,\dots,s_{\text{max}}\}, a bracket is defined by setting the starting fidelity λfid[0]λfidlow\lambda_{\text{fid}}^{[0]}\geq\lambda_{\text{fid}}^{\textrm{low}} of the bracket to 𝒓(1+smaxs)\bm{r}^{(1+s_{\text{max}}-s)}, resulting in smax+1s_{\text{max}}+1 brackets and an overall fidelity budget of (smax+1)B(s_{\text{max}}+1)B spent by HB. Consequently, every bracket ss consists of s+1s+1 batch evaluations, and the starting population size ps[0]p_{s}^{[0]} is the maximum value that fulfills Eq. (13). The full algorithm is outlined in Algorithm 2, and the bracket design of HB with λfidupp=8\lambda_{\text{fid}}^{\textrm{upp}}=8 and ηHB=2\eta_{\text{HB}}=2 is shown in Figure 6.
Starting configurations are usually sampled uniformly, but [229] also show that any stationary sampling distribution is valid. Because HB is a random-sampling-based method, it can trivially handle hierarchical HP spaces in the same manner as RS.

2
3 input: maximum fidelity per HPC λfidupp,ηHB\lambda_{\text{fid}}^{\textrm{upp}},\eta_{\text{HB}}
4 initialization: smax=logηHB(λfidupp),B=(smax+1)λfidupps_{\text{max}}=\left\lfloor\log_{\eta_{\text{HB}}}(\lambda_{\text{fid}}^{\textrm{upp}})\right\rfloor,\;B=(s_{\text{max}}+1)\lambda_{\text{fid}}^{\textrm{upp}}
5 𝒓=(λfiduppηHBsmax,λfiduppηHBsmax+1,λfiduppηHBsmax+2,,λfidupp)\quad\quad\bm{r}=(\lambda_{\text{fid}}^{\textrm{upp}}\eta_{\text{HB}}^{-s_{\text{max}}},\lambda_{\text{fid}}^{\textrm{upp}}\eta_{\text{HB}}^{-s_{\text{max}}+1},\lambda_{\text{fid}}^{\textrm{upp}}\eta_{\text{HB}}^{-s_{\text{max}}+2},\dots,\lambda_{\text{fid}}^{\textrm{upp}})
6 for s=smax,smax1,,0s=s_{\text{max}},s_{\text{max}}-1,\dots,0 do
7       ps=BλfiduppηHBss+1p_{s}=\left\lceil\frac{B}{\lambda_{\text{fid}}^{\textrm{upp}}}\frac{\eta_{\text{HB}}^{s}}{s+1}\right\rceil
8       Λs[0]=get_HPCs(ps)(={𝝀s(1),,𝝀s(ps)},𝝀s(i)Λ~\Lambda_{s}^{[0]}=\textrm{get\_HPCs}(p_{s})\quad(=\{\bm{\lambda}^{(1)}_{s},\dots,\bm{\lambda}^{(p_{s})}_{s}\},\,\bm{\lambda}^{(i)}_{s}\in\tilde{\Lambda})
9       // Successive Halving inner loop
10       for t=0,,st=0,\dots,s do
11             ps[t]=psηHBtp_{s}^{[t]}=\left\lfloor p_{s}\eta_{\text{HB}}^{-t}\right\rfloor
12             Set λfid\lambda_{\text{fid}} components of entries of Λs[t]\Lambda_{s}^{[t]} to 𝒓(1+smaxs+t)(=(λfiduppηHBs)ηHBt)\bm{r}^{(1+s_{\text{max}}-s+t)}\quad(=(\lambda_{\text{fid}}^{\textrm{upp}}\eta_{\text{HB}}^{-s})\cdot\eta_{\text{HB}}^{t})
13             C[t]={c(𝝀):𝝀Λs[t]}C^{[t]}=\{c(\bm{\lambda}):\bm{\lambda}\in\Lambda_{s}^{[t]}\}
14             Λs[t+1]=top_k(Λs[t],C[t],ps[t]/ηHB)\Lambda_{s}^{[t+1]}=\textrm{top\_k}(\Lambda_{s}^{[t]},C^{[t]},\left\lfloor p_{s}^{[t]}/\eta_{\text{HB}}\right\rfloor)
15       end for
16      
17 end for
Result: HPC with best performance
Algorithm 2 Hyperband algorithm [229] where


\bullet

get_HPCs(p)\textrm{get\_HPCs}(p) uses a stationary sampling distribution to generate the initial HPC population of size pp,


\bullet

top_k(Λs,C,k)\textrm{top\_k}(\Lambda_{s},C,k) selects the kk HPCs in Λs\Lambda_{s} associated to the kk best performances in CC as the next HPC population.
Multifidelity Bayesian Optimization

The idea behind Hyperband – trying to discard HPCs that do not perform well early on – is somewhat orthogonal to the idea behind BO, i.e. intelligently proposing HPCs that are likely to improve performance or to otherwise gain information about the location of the optimum. It is therefore natural to combine these two methods. This has first been achieved with BOHB by [192], who progressively increase λfid\lambda_{\text{fid}} of suggested HPCs as in Hyperband. However, instead of proposing HPCs randomly, they use a model-based approach equivalent to maximizing expected improvement. They show that BOHB performs similar to HB in the low-budget regime, where it is superior to normal BO methods, but outperforms HB and perform similar or better to BO when enough budget for tens of full-budget evaluations are available. BOHB was later extended to A-BOHB [136] to efficiently perform asynchronously parallelized optimization by sampling possible outcomes of evaluations currently under way.

Hyperband-based multi-fidelity methods have a control parameter that functions similar to ηHB\eta_{\text{HB}} described above, which determines the fraction of configurations that are discarded at every λfid\lambda_{\text{fid}} value for which evaluations are performed. However, the optimal proportion of configurations to discard may vary depending on how strong the correlation is between performance values at different fidelities. An alternative approach is to use the surrogate model from BO to make adaptive decisions about what λfid\lambda_{\text{fid}} values to use, or what HPCs to discard. Algorithms following this approach typically use a method first proposed by [129], who use a surrogate model for both the performance, as well as the resources used to evaluate an HPC. Entropy search [60] is then used to maximize the information gained about the maximum for when λfid=λfidupp\lambda_{\text{fid}}=\lambda_{\text{fid}}^{\textrm{upp}} per unit of predicted resource expenditure. Low-fidelity HPCs are evaluated whenever they contribute disproportionately large amounts of information to the maximum compared to their needed resources. A special challenge that needs to be solved by these methods is the modeling of performance with varying λfid\lambda_{\text{fid}}, which often has a different influence than other HPs and is therefore often considered as a separate case. An early HPO method building on this concept is freeze-thaw BO [132], which considers optimization of iterative ML methods such as deep learning that can be suspended (“frozen”) and continued (“thawed”). Another HPO method that specifically considers the training set size as fidelity HP is FABOLAS [80], which actively decides the training set size for each evaluation by trading off computational cost of an evaluation with a lot of data against the information gain on the potential optimal configuration.

In general, there could be other proposal mechanisms instead of random sampling as in Hyperband or BO as in BOHB. For example, [4] showed that differential evolution can perform even better; however the evolution of population members across fidelities needs to be adjusted accordingly.

4.2.5 Iterated Racing

The iterated racing (IR, [20]) procedure is a general algorithm configuration method that optimizes for a configuration of a general (not necessarily ML) algorithm that performs well over a given distribution of (arbitrary) problems. In most HPO algorithms, HPCs are evaluated using a resampling procedure such as CV, so a noisy function (error estimate for single resampling iterations) is evaluated multiple times and averaged. In order to connect racing to HPO, we now define a problem as a single holdout resampling split for a given ML data set, as suggested in [135, 84], and we will from now on describe racing only in terms of HPO.

The fundamental idea of racing [97] is that HPCs that show particularly poor performance when evaluated on the first problem instances (in our case: resampling folds) are unlikely to catch up in later folds and can be discarded early to save computation time for more interesting HPCs. This is determined by running a (paired) statistical test w.r.t. HPC performance values on folds. This allows for an efficient and dynamic allocation of the number of folds in the computation of c(𝝀)c(\bm{\lambda}) – a property of IR that is unique, at least when compared to the algorithms covered in this article.

Racing is similar to HB in that it discards poorly-performing HPCs early. Like HB, racing must also be combined with a sampling metaheuristic to initialize a race. Particularly well-suited for HPO are iterated races [234], and we will use the terminology of that implementation to explain the main control parameters of IR. IR starts by racing down an initial population of randomly sampled HPCs and then uses the surviving HPCs of the race to stochastically initialize the population of the subsequent race to focus on interesting regions of the search space.

Sampling is performed by first selecting a parent configuration 𝝀\bm{\lambda} among the NeliteN^{\textrm{elite}} survivors of the previous generation, according to a categorical distribution with probabilities p𝝀=2(Neliter𝝀+1)/(Nelite(Nelite+1))p_{\bm{\lambda}}=2(N^{\textrm{elite}}-r_{\bm{\lambda}}+1)/(N^{\textrm{elite}}(N^{\textrm{elite}}+1)), where r𝝀r_{\bm{\lambda}} is the rank of the configuration 𝝀\bm{\lambda}. A new HPC is then generated from this parent by mutating numeric HPs using a truncated normal distribution, always centered at the numeric HP value of the parent. Discrete parameters use a discrete probability distribution. This is visualized in Figure 7. The parameters of these distributions are updated as the optimization continues: The standard deviation of the Gaussian is narrowed to enforce exploitation and convergence, and the categorical distribution is updated to more strongly favor the values of recent ancestors. IR is able to handle search spaces with dependencies by sampling HPCs that were inactive in a parent configuration from the initial (uniform) distribution.

Refer to caption
?figurename? 7: Scheme of the iterated racing algorithm (figure based on [234]).

This algorithmic principle of having a distribution that is centered around well-performing candidates, is continuously sampled from and updated, is close to an estimation-of-distribution algorithm (EDA), a well-known template for ES [86]. Therefore, IR could be described as an EDA with racing for noise handling.

IR has several control parameters that determine how the racing experiments are executed. We only describe the most important ones here; many of these have heuristic defaults set in the implementation introduced by [234]. NiterN^{\textrm{iter}} (nbIterations) determines the number of performed races, defaulting to 2+log2dim(Λ~)\lfloor 2+\log_{2}\dim(\tilde{\Lambda})\rfloor (with dim(Λ~)\dim(\tilde{\Lambda}) the number of HPs being optimized). Within a race, each HPC is first evaluated on TfirstT^{\textrm{first}} (firstTest) folds before a first comparison test is made. Subsequent tests are then made after every TeachT^{\textrm{each}} (eachTest) evaluation. IR can be performed as elitist, which means surviving configurations from a generation are part of the next generation. The statistical test that discards individuals can be the Friedman test or the tt-test [19], the latter possibly with multiple testing correction. [234] recommend the tt-test when when performance values for evaluations on different instances are commensurable and the tuning objective is the mean over instances, which is usually the case for our resampled performance metric where instances are simply resampling folds.

4.3 HPO - A Bilevel Inference Perspective

As discussed in Section 4.1, HPO produces an approximately optimal HPC 𝝀^=τ(𝒟,,Λ~,ρ)\hat{\bm{\lambda}}=\tau(\mathcal{D},\mathcal{I},\tilde{\Lambda},\rho) by optimizing it w.r.t. the resampled performance c(𝝀)=GE^(,𝒥,ρ,𝝀)c(\bm{\lambda})=\widehat{\mathrm{GE}}(\mathcal{I},\mathcal{J},\rho,\bm{\lambda}). This is still risk minimization w.r.t. (hyper)parameters, where we search for optimal parameters 𝝀^\hat{\bm{\lambda}} so that the risk of our predictor f^𝝀^,𝜽^\hat{f}_{\hat{\bm{\lambda}},\bm{\hat{\theta}}} becomes minimal when measured on validation data via

ρL(𝐲,𝑭𝝀^,𝜽^)=i=1mL(y(i),𝑭𝝀^,𝜽^(i)),\rho_{L}(\mathbf{y},\bm{F}_{\hat{\bm{\lambda}},\bm{\hat{\theta}}})=\sum\limits_{i=1}^{m}L\left(y^{(i)},\bm{F}^{(i)}_{\hat{\bm{\lambda}},\bm{\hat{\theta}}}\right)\textrm{,} (14)

where f^𝝀^,𝜽^=(𝒟train,𝝀^)\hat{f}_{\hat{\bm{\lambda}},\bm{\hat{\theta}}}=\mathcal{I}(\mathcal{D}_{\text{train}},\hat{\bm{\lambda}}) and 𝑭𝝀^,𝜽^\bm{F}_{\hat{\bm{\lambda}},\bm{\hat{\theta}}} is the prediction matrix of f^\hat{f} on validation data, for a pointwise loss function. The above is formulated for a single holdout split (Jtrain,Jtest)(J_{\mathrm{train}},J_{\mathrm{test}}) in order to demonstrate the tight connection between (first level) risk minimization and HPO; Eq. (8)\eqref{eq:ges} provides the generalization for arbitrary resampling with multiple folds. This is somewhat obfuscated and complicated by the fact that we cannot evaluate Eq. (8) in one go, but must rather fit one or multiple models f^\hat{f} during its computation (hence also its black-box nature). It is useful to conceptualize this as a bilevel inference mechanism; while the parameters 𝜽^\bm{\hat{\theta}} of ff for a given HPC are estimated in the first level, in the second level we infer the HPs 𝝀^\hat{\bm{\lambda}}. However, both levels are conceptually very similar in the sense that we are optimizing a risk function for model parameters which should be optimal for the the data distribution at hand. In case of the second level, this risk function is not emp(𝜽)\mathcal{R}_{\text{emp}}(\bm{\theta}), but the harder-to-evaluate generalization error GE^\widehat{\mathrm{GE}}. An intuitive, alternative term for HPO is second level inference [55], visualized in Figure 8.

Refer to caption
?figurename? 8: Self-Tuning learner with integrated HPO wrapped around the inducer.

There are mainly two reasons why such a bilevel optimization is preferable to a direct, joint risk minimization of parameters and HPs [55]:

  • Typically, learners are constructed in such a way that optimized first-level parameters can be more efficiently computed for fixed HPs, e.g., often the first-level problem is convex, while the joint problem is not.

  • Since the generalization error is eventually optimized for the bilevel approach, the resulting model should be less prone to overfitting.

Thus, we can define a learner with integrated tuning as a mapping 𝒯,Λ~,ρ,𝒥:𝔻\mathcal{T}_{\mathcal{I},\tilde{\Lambda},\rho,\mathcal{J}}:\mathds{D}\rightarrow\mathcal{H}, 𝒟τ(𝒟,,Λ~,ρ)(𝒟)\mathcal{D}\rightarrow\mathcal{I}_{\tau(\mathcal{D},\mathcal{I},\tilde{\Lambda},\rho)}(\mathcal{D}), which maps a data set 𝒟\mathcal{D} to the model f^𝝀^\hat{f}_{\hat{\bm{\lambda}}} that has the HPC set to 𝝀^\hat{\bm{\lambda}} as optimized by τ\tau on 𝒟\mathcal{D} and is then itself trained on the whole of 𝒟\mathcal{D}; all for a given inducer \mathcal{I}, performance measure ρ\rho, and search space Λ~\tilde{\Lambda}. Algorithmically, this learner has a 2-step training procedure (see Figure 8), where tuning is performed before the final model fit. This ?self-tuning? learner 𝒯\mathcal{T} ?shadows? the tuned HPs of its search space Λ~\tilde{\Lambda} from the original learner and integrates their configuration into the training procedure 888The self-tuning learner actually also adds new HPs that are the control parameters of the HPO procedure.. If such a learner is cross-validated, we naturally arrive at the concept of nested CV, which is discussed in the following Section 4.4.

4.4 Nested Resampling and Meta-Overfitting

As discussed in Section 3.2, the evaluation of any learner should always be performed via resampling on independent test sets to ensure non-biased estimation of its generalization error. This is necessary because evaluating f^\hat{f} on the data set 𝒟\mathcal{D} that was used for its construction would lead to an optimistic bias. In the general HPO problem as in Eq. (10), we already minimize this generalization error by resampling:

𝝀^argmin𝝀Λ~c(𝝀)=argmin𝝀Λ~GE^(,𝒥,ρ,𝝀).\hat{\bm{\lambda}}\in\operatorname*{arg\,min}_{\bm{\lambda}\in\tilde{\Lambda}}c(\bm{\lambda})=\operatorname*{arg\,min}_{\bm{\lambda}\in\tilde{\Lambda}}\widehat{\mathrm{GE}}(\mathcal{I},\mathcal{J},\rho,\bm{\lambda}). (15)

If we simply report the estimated c(𝝀^)c(\hat{\bm{\lambda}}) value of the returned best HPC, this also creates an optimistically biased estimator of the generalization error, as we have violated the fundamental ?untouched test set? principle by optimizing on the test set(s) instead.

Refer to caption
?figurename? 9: While nested resampling delivers correct results for the performance around 0.50.5, taking the tuning result directly results in a biased, optimistic estimator, especially on smaller data sets.

To better understand the necessity of an additional resampling step, we consider the following example in Figure 9, introduced by [21]. Assume a balanced binary classification task and an inducer 𝝀\mathcal{I}_{\bm{\lambda}} that ignores the data. Hence, 𝝀\bm{\lambda} has no effect, but rather ?predicts? the class labels in a balanced but random manner. Such a learner always has a true misclassification error of GE(,𝝀,ntrain,ρ)=0.5\mathrm{GE}(\mathcal{I},\bm{\lambda},n_{\mathrm{train}},\rho)=0.5 (using ρCE\rho_{CE} as a metric), and any normal CV-based estimator will provide an approximately correct value as long as our data set is not too small. We now ?tune? this learner, for example, by RS – which is meaningless, as 𝝀\bm{\lambda} has no effect. The more tuning iterations are performed, the more likely it becomes that some model from our archive will produce partially correct labels simply by random chance, and the (only randomly) ?best? of these is selected by our tuner at the end. The more we tune, the smaller our data set, or the more variance our GE estimator exhibits, the more expressed this optimistic bias will be.

To avoid this bias, we introduce an additional outer resampling loop around this inner HPO-resampling procedure – or as discussed in Section 4.3, we simply regard this as cleanly cross-validating the self-tuned learner 𝒯,Λ~,ρ,𝒥\mathcal{T}_{\mathcal{I},\tilde{\Lambda},\rho,\mathcal{J}}. This is called nested resampling, which is illustrated in Figure 10.

Refer to caption
?figurename? 10: Nested CV with 3 inner and outer folds: each HPC is evaluated on an inner CV, while the resulting tuned model is evaluated on the outer test set.

The procedure works as follows: In the outer loop, an outer model-building or training set is selected, and an outer test set is cleanly set aside. Each proposed HPC 𝝀+\bm{\lambda}^{+} during tuning is evaluated via inner resampling on the outer training set. The best performing HPC 𝝀^\hat{\bm{\lambda}} returned by tuning is then used to fit a final model for the current outer loop on the outer training set, and this model is then cleanly evaluated on the test set. This is repeated for all outer loops, and all outer test performances are aggregated at the end.

Some further comments on this general procedure: (i) Any resampling scheme is possible on the inside and outside, and these schemes can be flexibly combined based on statistical and computational considerations. Nested CV and nested holdout are most common. (ii) Nested holdout is often called the train-validation-test procedure, with the respective terminology for the generated three data sets resulting from the 3-way split. (iii) Many users often wonder which ?optimal? HPC 𝝀^\hat{\bm{\lambda}} they are supposed to report or study if nested CV is performed, with multiple outer loops, and hence multiple outer HPCs 𝝀^\hat{\bm{\lambda}}. However, the learned HPs that result from optimizations within CV are considered temporary objects that merely exist in order to estimate GE^(,𝒥,ρ,𝝀)\widehat{\mathrm{GE}}(\mathcal{I},\mathcal{J},\rho,\bm{\lambda}). The comparison to first-level risk minimization from Section 4.3 is instructive here: The formal goal of nested CV is simply to produce the performance distribution on outer test sets; the 𝝀^\hat{\bm{\lambda}} can be considered as the fitted HPs of the self-tuned learner 𝒯\mathcal{T}. If the parameters of a final model are of interest for further study, the tuner 𝒯\mathcal{T} should be fitted one final time on the complete data set. This would imply a final tuning run on the complete data set for second-level inference.

Nested resampling ensures unbiased outer evaluation of the HPO process, but, as CV for the first level, it is only a process that is used to estimate performance – it does not directly help in constructing a better model. The biased estimation of performance values is not a problem for the optimization itself, as long as all evaluated HPCs are still ranked correctly. But after a considerably large amount of evaluations, wrong HPCs might be selected due to stochasticity or overfitting to the splits of the inner resampling. This effect has been called either overtuning, meta-overfitting or oversearching [103, 112]. At least parts of this problem seem directly related to the problem of multiple hypothesis testing. However, it has not been analysed very well yet, and unlike regularization for (first level) ERM, not many counter measures are currently known for HPO.

4.5 Threshold Tuning

Most classifiers do not directly output class labels, but rather probabilities or real-valued decision scores, although many metrics require predicted class labels. A score is converted to a predicted label by comparing it to a threshold tt so that y^=[f(𝐱)t]\hat{y}=[f(\mathbf{x})\geq t], where we use the Iverson bracket [][] with y^=1\hat{y}=1 if f(𝐱)tf(\mathbf{x})\geq t and y^=0\hat{y}=0 in all other cases. For binary classification, the default thresholds are t=0.5t=0.5 for probabilities and t=0t=0 for scores. However, depending on the metric and the classifier, different thresholds can be optimal. Threshold tuning [122] is the practice of optimizing the classification threshold tt for a given model to improve performance. Strictly speaking, the threshold constitutes a further HP that must be chosen carefully. However, since the threshold can be varied freely after a model has been built, it does not need to be tuned jointly with the remaining HPs and can be optimized in a separate, subsequent, and cheaper step for each proposed HPC 𝝀+\bm{\lambda}^{+}.

After models have been fitted and predictions for all test sets have been obtained when c(𝝀+)c(\bm{\lambda}^{+}) is computed via resampling, the vector of joint test set scores 𝑭~\tilde{\bm{F}} can be compared against the joint vector of test set labels 𝐲\mathbf{y} to optimize for an optimal threshold t^=argmintρ(𝐲~,[𝑭~t])\hat{t}=\operatorname*{arg\,min}_{t}\rho(\tilde{\mathbf{y}},[\tilde{\bm{F}}\geq t]), where the Iverson bracket is evaluated component-wise and t[0,1]t\in[0,1] for a binary probabilistic classifier and tt\in\mathds{R} for a score classifier. Since tt is scalar and evaluations are cheap, t^\hat{t} can be found easily via a line search. This two-step approach ensures that every HPC is coupled with its optimal threshold. c(𝝀+)c(\bm{\lambda}^{+}) is then defined as the optimal performance value for 𝝀+\bm{\lambda}^{+} in combination with t^\hat{t}.

The procedure can be generalized to multi-class classification. Class probabilities or scores π^k(𝐱),k=1,,g\hat{\pi}_{k}(\mathbf{x}),k=1,\ldots,g are divided by threshold weights wkw_{k}, k=1,,gk=1,\ldots,g, and the kk that yields the maximal π^k(𝐱)wk\frac{\hat{\pi}_{k}(\mathbf{x})}{w_{k}} is chosen as the predicted class label. The weights wk,k=1,,gw_{k},k=1,\ldots,g are optimized in the same way as tt in the binary case. Generally, threshold tuning can be performed jointly with any HPO algorithm. In practice, threshold tuning is implemented as a post-processing step of an ML pipeline (Sections 5.1 & 5.2).

5 Pipelining, Preprocessing, and AutoML

ML typically involves several data transformation steps before a learner can be trained. If one or multiple preprocessing steps are executed successively, the data flows through a linear graph, also called pipeline. Subsection 5.1 explains why a pipeline forms an ML algorithm itself, and why its performance should be evaluated accordingly through resampling. Finally, Subsection 5.2 introduces the concept of flexible pipelines via hierarchical spaces.

5.1 Linear Pipelines

In the following, we will extend the HPO problem for a specific learner towards configuring a full pipeline including preprocessing. A linear ML pipeline is a succession of preprocessing methods followed by a learner at the end, all arranged as nodes in a linear graph. Each node acts in a very similar manner as the learner, and has an associated training and prediction procedure. During training, each node learns its parameters (based on the training data) and sends a potentially transformed version of the training data to its successor. Afterwards, a pipeline can be used to obtain predictions: The nodes operate on new data according to their model parameters. Figure 11 shows a simple example.

Refer to caption
Refer to caption
?figurename? 11: Example of a linear pipeline, including node parameters.

Pipelines are important to properly embed the full model building procedure, including preprocessing, into cross-validation, so every aspect of the model is only inferred from the training data. This is necessary to avoid overfitting and biased performance evaluation [21, 63], as it is for basic ML. As each node represents a configurable piece of code, each node can have HPs, and the HPs of the pipeline are simply the joint set of all HPs of its contained nodes. Therefore, we can model the whole pipeline as a single HPO problem with the combined search space Λ~=Λ~op,1××Λ~op,k×Λ~\tilde{\Lambda}=\tilde{\Lambda}_{\text{op},1}\times\cdots\times\tilde{\Lambda}_{\text{op},k}\times\tilde{\Lambda}_{\mathcal{I}}.

5.2 Operator Selection and AutoML

More flexible pipelining, and especially the selection of appropriate nodes in a data-driven manner via HPO, can be achieved by representing our pipeline as a directed acyclic graph. Usually, this implies a single source node that accepts the original data and a single sink node that returns the predictions. Each node represents a preprocessing operation, a learner, a postprocessing operation, or a directive operation that directs how the data is passed to the child node(s).

One instance of such a pipeline is illustrated in Figure 12.

Refer to caption
?figurename? 12: Example for a graph pipeline with operator selection via branches.

Here, we consider the choice between multiple mutually exclusive preprocessing steps, as well as the choice between different ML algorithms. Such a choice is represented by a branching operator, which can be configured through a categorical parameter and can determine the flow of the data, resulting in multiple ?modeling paths? in our graph.

The HP space induced by a flexible pipeline is potentially more complex. Depending on the setting of the branching HP, different nodes and therefore different HPs are active, resulting in a very hierarchical search space. If we build a graph that includes a sufficiently large selection of preprocessing steps combined with sufficiently many ML models, the result can be flexible enough to work well on a large number of data sets – assuming it is correctly configured in a data-dependent manner. Combining such a graph with an efficient tuner is the key principle of AutoML [47, 100, 104].

6 Practical Aspects of HPO

In this section, we discuss practical aspects of HPO, which are more qualitative in nature and less often discussed in an academic context. Some of these recommendations are more ?rules of thumb?, based solely on experience, while others are at least partially confirmed by empirical benchmarks – keeping in mind that empirical benchmarks are only as good as the selection of data sets. Even if a proper empirical study cannot be cited for every piece of advice in this section, we still propose that such a compilation is highly valuable for HPO beginners.

6.1 Choosing Resampling and Performance Metrics

A resampling procedure is usually chosen based on two fundamental properties of the available data: (i) the number of observations, i.e., to what degree do we face a small sample size situation, and (ii) whether the i.i.d. assumption for our data sampling process is violated.

For smaller data sets, e.g., n<500n<500, repeated CV with a high number of repetitions should be used to reduce the variance while keeping the pessimistic bias small [21]. The larger the data set, the fewer splits are necessary. Consequently, for data sets of ?medium? size with 500n50000500\leq n\leq 50000, usually 5- or 10-fold CV is recommended. Beyond that, simple holdout might be sufficient. Note that even for large nn, sample size problems can occur, for example, when data is imbalanced. In that case, repeated resampling might still be required to obtain properly accurate performance estimates. Stratified sampling, which ensures that the relative class frequencies for each train/test split are consistent with the original data set, helps in such a case.

One fundamental assumption about our data is that observations are i.i.d., i.e., (𝐱(i),y(i))i.i.dxy.\left(\mathbf{x}^{(i)},y^{(i)}\right)\overset{i.i.d}{\sim}\mathds{P}_{xy}. This assumption is often violated in practice. A typical example is repeated measurements, where observations occur in ?blocks? of multiple, correlated data, e.g., from different hospitals, cities or persons. In such a scenario, we are usually interested in the ability of the model to generalize to new blocks. We must then perform CV with respect to the blocks, e.g., ?leave one block out?. A related problem occurs if data are collected sequentially over a period of time. In such a setting, we are usually interested in how the model will generalize in the future, and the rolling or expanding window forecast must be used for evaluation [9] instead of regular CV. However, discussing these special cases is out of scope for this work.

In HPO, resampling strategies must be specified for the inner as well as the outer level of nested resampling. The outer level is simply regular ML evaluation, and all comments from above hold. We advise readers to study further material such as [72]. The inner level concerns the evaluation of c(𝝀)c(\bm{\lambda}) through resampling during tuning. While the same general comments from above apply, in order to reduce runtime, repetitions can also be scaled down. We are not particularly interested in very accurate numerical performance estimates at the inner level, and we must only ensure that HPCs are properly ranked during tuning to achieve correct selection, as discussed in Section 4.4. Hence, it might be appropriate to use a 10-fold CV on the outside to ensure proper generalization error estimation of the tuned learner, but to use only 2 folds or simple holdout on the inside. In general, controlling the number of resampling repetitions on the inside should be considered an aspect of the tuner and should probably be automated away from the user (without taking away flexible control in cases of i.i.d. violations, or other deviations from standard scenarios). However, not many current tuners provide this, although racing is one of the attractive exceptions.

The choice of the performance measure should be guided by the costs that suboptimal predictions by the model and subsequent actions in the real-word context of applying the model would incur. Often, popular but simple measures like accuracy do not meet this requirement. Misclassification of different classes can imply different costs. For example, failing to detect an illness may have a higher associated cost than mistakenly admitting a person to the hospital. There exists a plethora of performance measures that attempt to emphasize different aspects of misclassification with respect to prediction probabilities and class imbalances, c.f. [72]and many listed in Table 1 in Appendix B. For other applications, it might be necessary to design a performance measure from scratch or based on underlying business key performance indicators (KPIs).

While a further discussion of metrics is again out of scope for this article, two pieces of advice are pertinent. First, as HPO is a black-box, no real constraints exist regarding the mathematical properties of ρ\rho (or the associated outer loss). For first-level risk minimization, on the other hand, we usually require differentiability and convexity of LL. If this is not fulfilled, we must approximate the KPI with a more convenient version. Second, for many applications, it is quite unclear whether there is a single metric that captures all aspects of model quality in a balanced manner. In such cases, it can be preferable to optimize multiple measures simultaneously, resulting in a multi-criteria optimization problem [62].

6.2 Choosing a Pipeline and Corresponding Search Space

For HPO, it is necessary to define the search space Λ~\tilde{\Lambda} over which the optimization is to be performed. This choice can have a large impact on the performance of the tuned model. A simple search space is a (lower dimensional) Cartesian product of individual HP sets that are either numeric (continuous or integer-valued) or categorical. Encoding categorical values as integers is a common mistake that degrades the performance of optimizers that rely on information about distances between HPCs, such as BO. The search intervals of numeric HPs typically must be bounded within a region of plausibly well-performing values for the given method and data set.

Many numeric HPs are often either bounded in a closed interval (e.g., [0,1][0,1]) or bounded from below (e.g., [0,)[0,\infty)). The former can usually be tuned without modifications. HPs bounded by a left-closed interval should often be tuned on a logarithmic scale with a generous upper bound, as the influence of larger values often diminishes. For example, the decision whether kk-NN uses k=2k=2 vs. 3 neighbors will have a larger impact than whether it uses k=102k=102 vs. k=103k=103 neighbors. The logarithmic scale can either be defined in the tuning software or must be set up manually by adjusting the algorithm to use transformations: If the desired range of the HP is [a,b][a,b], the tuner optimizes on [loga,logb][\log a,\log b], and any proposed value is transformed through an exponential function before it is passed to the ML algorithm. The logarithm and exponentiation must refer to the same base here, but which base is chosen does not influence the tuning process.

The size of the search space will also considerably influence the quality of the resulting model and the necessary budget of HPO. If chosen too small, the search space may not contain a particularly well-performing HPC. Choosing too wide HP intervals or including inadequate HPs in the search space can have an adverse effect on tuning outcomes in the given budget. If Λ~\tilde{\Lambda} is simply too large, it is more difficult for the optimizer to find the optimum or promising regions within the given budget. Furthermore, restricting the bounds of an HP may be beneficial to avoid values that are a priori known to cause problems due to unstable behavior or large resource consumption. If multiple HPs lead to poor performance throughout a large part of their range – for example, by resulting in a degenerate ML model or a software crash – the fraction of the search space that leads to fair performance then shrinks exponentially in the number of HPs with this problem. This effect can be viewed as a further manifestation of the so-called curse of dimensionality.

Due to this curse of dimensionality and the considerable runtime costs of HPO, we would like to tune as few HPs as possible. If no prior knowledge from earlier experiments or expert knowledge exists, it is common practice to leave other HPs at their software default values with the assumption that the developers of the algorithm chose values that work well under a wide range of conditions, which is not necessarily given and its often not documented how these defaults were specified. Recent approaches have studied how to empirically find optimal default values, tuning ranges and HPC prior distributions based on extensive meta-data [142, 141, 47, 109, 138, 251, 108, 53].

It is possible to optimize several learning algorithms in combination, as described in Section 5.2, but this introduces HP dependencies. The question then arises of which of the large number of ML algorithms (or preprocessing operations) should be considered. However, [193] showed that in many cases, only a small but diverse set of learners is necessary to choose one ML algorithm that performs sufficiently well.

6.3 Choosing an HPO Algorithm

The number of HPs considered for optimization has a large influence on the difficulty of an HPO problem. Particularly large search spaces typically arise from optimizing over large pipelines or multiple learners (Section 5). With very few HPs (up to about 2-3) and well-functioning discretization, GS may be useful due to its interpretable, deterministic, and reproducible nature; however, it is not recommended beyond this [11]. BO with GPs work well for up to around 10 HPs. However, more HPs typically require more function evaluations – which in turn is problematic runtime-wise, since GPs scale cubically with the number of dimensions. On the other hand, BO with RFs have been used successfully on search spaces with hundreds of HPs [135] and can usually handle mixed hierarchical search spaces better. Pure sampling-based methods, such as RS and Hyperband, work well even for very large HP spaces as long as the ?effective? dimension (i.e., the number of HPs that have a large impact on performance) is low, which is often observed in ML models [11]. Evolutionary algorithms (and those using similar metaheuristics, such as racing) can also work with truly large search spaces, and even with search spaces of arbitrarily complex structure if one is willing do use non-custom mutation and crossover operators. Evolutionary algorithms may also require fewer objective evaluations than RS. Therefore, they occupy a middle ground between (highly complex, sample-efficient) BO and (very simple but possibly wasteful) RS.

Another property of algorithms that is especially relevant to practitioners with access to large computational resources is parallelizability, which is discussed in Subsection 6.7.2. Furthermore, HPO algorithms differ in their simplicity, both in terms of algorithmic principle and in terms of usability of available implementations, which can often have implications for their usefulness in practice. While more complex optimization methods, such as those based on BO, are often more sample-efficient or have other desirable properties compared to simple methods, they also have more components that can fail. When performance evaluations are cheap and the search space is small, it may therefore be beneficial to fall back on simple and robust approaches such as RS, Hyperband, or any tuner with minimal inference overhead. The availability of any implementation at all (and the quality of that implementation) is also important; there may be optimization algorithms based on beautiful theoretical principles that have a poorly maintained implementation or are based on out-of-date software. The additional cost of having to port an algorithm to the software platform being used, or even implement it from scratch, could be spent on running more evaluations with an existing algorithm.

One might wish to select an HPO algorithm that performed best on previous benchmarks. However, no single benchmark exists which includes all relevant scenarios and whose results generalize to all possible applications. Specific benchmark results can therefore only indicate how well an algorithm works for a selected set of data sets, a predefined budget, specific parallelization, specific learners, and search spaces. Even worse, extensive comparison studies are missing in the current literature, although efforts have been made to establish unified benchmarks. [35] showed that (i) BO with GPs were a strong approach for small continuous spaces with few evaluations, (ii) BO with Random Forests performed well for mixed spaces with a few hundred evaluations, and (iii) for large spaces and many evaluations, ES were the best optimizers.

6.4 Choosing an Implementation

In addition to the choice of algorithm, the choice of implementation is also important in practice. We note that a thorough comparison of all available software packages and frameworks is not within the scope of this article. Nevertheless, we provide a brief list of popular implementations and frameworks as reasonable starting points for newcomers.

Simple and established tuning algorithms for HPO are usually shipped with general purpose ML frameworks such as Scikit-learn [248], Tensorflow999https://www.tensorflow.org/, PyTorch [247], MXNet [176], mlr3 [226, 163], tidymodels101010https://www.tidymodels.org/ or h2o.ai111111https://github.com/h2oai/h2o-3. Although modern state-of-the-art algorithms often build on, extend, or connect to such an ML framework, they are usually developed in independent software projects.

For Python, there exist a plethora of HPO toolkits, e.g., Spearmint [262], SMAC121212https://github.com/automl/SMAC3 [213], BoTorch [158], Dragonfly [77], or Oríon131313https://github.com/Epistimio/orion. Multiple HPO methods are supported by toolkits like Hyperopt [160], Optuna [153] or Weights & Biases141414https://www.wandb.com/. A popular framework that combines modern HPO approaches with the Scikit-learn toolbox for ML in Python is auto-sklearn [194].

The availability of HPO implementations in R is comparably smaller. However, more established HPO tuners are either already shipped with the ML framework, or can be found, e.g., in the packages mlrMBO151515https://github.com/mlr-org/mlrMBO, irace [234], DiceOptim [254], or rBayesianOptimization161616https://github.com/yanyachen/rBayesianOptimization.

See Appendix C for more information

6.5 When to Terminate HPO

Choosing an appropriate budget for HPO or dynamically terminating HPO based on collected archive data is a difficult practical challenge that is not readily solved. The user typically has these options: (i) A certain amount of runtime is specified before HPO is started, solely based on practical considerations and intuition. This is what currently happens nearly exclusively in practice. A simple rule-of-thumb might be scaling the budget of HPC evaluations with search space dimensionality in a linear fashion like 50×l50\times l or 100×l100\times l, which would also define the number of full budget units in a multi-fidelity setup. The downside is that while it is usually simpler to define when a result is needed, it is nearly impossible to determine whether this computational budget is enough to solve the HPO problem reasonably well. (ii) The user can set a lower bound regarding the required generalization error, i.e., what model quality is deemed good enough in order to put the resulting model to practical use. Notably, if this lower bound is set too optimistically, the HPO process might never terminate or simply take too long. (iii) If no (significant) progress is observed for a certain amount of time during tuning, HPO is considered to have converged or stagnated and stopped. This criterion bears the risk of terminating too early, especially for large and complex search spaces. Furthermore, this approach is conceptually problematic, as it mainly makes sense for iterative, local optimizers. No HPO algorithm from this article, except for maybe ES, belongs to this category, as they all contain a strong exploratory, global search component. This often implies that search performance can flatline for a longer time and then abruptly change. (iv) For BO, a small maximum value of the EI\operatorname{EI} acquisition function (see Eq. 11) indicates that further progress is estimated to be unlikely [76]. This criterion could even be combined with other, no-BO type tuners. It implies that the surrogate model is trustworthy enough for such an estimate, including its local uncertainty estimate. This is somewhat unlikely for large, complex search spaces. In practice, it is probably prudent to set up a combination of all mentioned criteria (if possible) as an ?and? or ?or? combination, with somewhat liberal thresholds to continue the optimization for (ii), (iii), (iv), and (v), and an absolute bound on the maximal runtime for (i).

With many HPO tuners, it is possible to continue the optimization even after it has been terminated. One can even use (part of) the archive as an initialization of the next HPO run, e.g., as an initial design for BO or as the initial population for an ES.

Some methods, e.g., ES and under some circumstances BO, may get stuck in a subspace of the search space and fail to explore other regions. While some means of mitigation exist, such as interleaving proposed points with randomly generated points, there is always the possibility of re-starting the optimization from scratch multiple times, and selecting the best performance from the combined runs as the final result. The decision regarding the termination criterion is itself always a matter of cost-benefit calculation of potential increased performance against cost of additional evaluations. One should also consider the possibility that, for more local search strategies like ES, terminating early and restarting the optimization can be more cost-effective than letting a single optimization run continue for much longer. For more global samplers like BO and HB, it is much less clear whether and how such an efficient restart could be executed.

6.6 Warm-Starts

HPO may require substantial computational resources, as a large optimization search space may require many performance evaluations, and individual performance evaluations can also be very computationally expensive. One way to reduce the computational time are warm-starts, where information from previous experiments are used as a starting solution.

Warm-Starting Evaluations

Certain model classes may offer specific methods that reduce the computational resources needed for training a single model by transferring model parameters from other, similar configurations that were already trained. NNs with similar architectures can be initialized from trained networks to reduce training time – a concept known as weight sharing. Some neural architecture search algorithms are specifically designed to make use of this fact [187].

Warm-Starting Optimization

Many HPO algorithms do not take any input regarding the relative merit of different HPCs beyond box-constraints on the search space Λ~\tilde{\Lambda}. However, it is often the case that some HPCs work relatively well on a large variety of data sets [251]. At other times, HPO may have been performed on other problems that are similar to the problem currently being considered. In both cases, it is possible to warm-start the HPO process itself – for example, by choosing the initial design as HPCs that have performed well in the past or by choosing HPCs that are known to perform well in general [91]. This can also be regarded as a transfer learning mechanism for HPO. Large systematic collections of HPC performance on different data sets, such as those collected in [17] or on OpenML [137], can be used to build these initial designs [109, 47].

6.7 Control of Execution

HP tuning is usually computationally expensive, and running optimization in parallel is one way to reduce the overall time needed for a satisfactory result. During parallelization, computational jobs (which can be arbitrary function calls) are mapped to a pool of workers (usually distributed among individual physical CPU cores, or remote computational nodes). In practice, a parallelization backend orchestrates starting up, shutting down, and communication with the workers. The achieved speedup can be proportional to the number of workers in ideal cases, but less than linear scaling is typically observed, depending on the algorithm used (Amdahl’s law, [116]). Besides parallelization, ensuring a fail-safe program flow is often also mandatory, and requires special care.

6.7.1 Job Hierarchy for HPO with Nested Resampling

HPO can be parallelized at different granularity or parallelization levels. These levels result from the nested loop outlined in Algorithm 3 and described in detail in the following (from coarse to fine granularity):

1 foreach outer resampling iteration do
2       Initialize archive 𝒜={}\mathcal{A}=\{\};
3       foreach tuning iteration do
4             {𝝀+,1.,𝝀+,nbatch}\{\bm{\lambda}^{+,1}.\ldots,\bm{\lambda}^{+,n_{\text{batch}}}\} = propose_points(𝒜\mathcal{A});
5             foreach i{1,,nbatch}i\in\{1,\ldots,n_{\text{batch}}\} do
6                   𝒄={}\bm{c}=\{\};
7                   foreach inner resampling iteration do
8                         𝒄=𝒄eval(𝝀+,i\bm{c}=\bm{c}\cup\operatorname{eval}(\bm{\lambda}^{+,i}, 𝒟inner_train,𝒟inner_test)\mathcal{D}_{\text{inner\_train}},\mathcal{D}_{\text{inner\_test}});
9                        
10                   end foreach
11                  𝒜\mathcal{A} = 𝒜(𝝀+,i,agr(𝒄))\mathcal{A}\cup(\bm{\lambda}^{+,i},\mathrm{agr}(\bm{c}));
12                  
13             end foreach
14            
15       end foreach
16      𝝀\bm{\lambda}^{*} = get_best(𝒜\mathcal{A});
17       eval(𝝀\bm{\lambda}^{*}, 𝒟train,𝒟test\mathcal{D}_{\text{train}},\mathcal{D}_{\text{test}});
18      
19 end foreach
Algorithm 3 Nested resampling as nested loops.
  1. (a)

    One iteration of the outer resampling loop (Line 3), i.e., tuning an ML algorithm on the respective training set of the outer resampling split. The result is the outer test set performance of the best HPC found and trained on the outer training set.

  2. (b)

    The execution of one tuning iteration (Line 3), i.e., the proposal and evaluation of a batch of new HPCs. The result is an updated optimization archive.

  3. (c)

    One evaluation of a single proposed HPC (Line 3), i.e., one inner resampling with configuration 𝝀+,i\bm{\lambda}^{+,i}. The result is an aggregated performance score.

  4. (d)

    One iteration of the inner resampling loop (Line 3). The result is the performance score for a single inner train / test split.

  5. (e)

    The model fit for the evaluation of the model itself is sometimes also parallelizable (Line 3 and 3). For example, the individual trees of a random forest can be fitted independently and as such are an obvious target for parallelization.

Note that the kinnerk_{\text{inner}} resampling iterations created in d are independent between the nbatchn_{\text{batch}} HPC evaluations created in c. Therefore, they form nbatchkinnern_{\text{batch}}\cdot k_{\text{inner}} independent jobs that can be executed in parallel.

An HPO problem can now be distributed to the workers in several ways. For example, if one wants to perform a 10-fold (outer) CV of BO that proposes one point per tuning iteration, with a budget of 50 HP evaluations done via a 3-fold CV (inner resampling), one can decide to (i) spawn 10 parallel jobs (level a) with a long runtime once, or to (ii) spawn 3 parallel jobs (level d) with a short runtime 50 times.

Consider another example with an ES running for 50 generations with an offspring size of 20. This translates to 50 tuning iterations with 20 HPC proposals per iteration. If this is evaluated using nested CV, with 10 outer and 3 inner resampling loops, then the parallelization options are: (i) spawn 10 parallel jobs with a long runtime once (level a), (ii) spawn 20 parallel jobs (level c) 105010\cdot 50 times with a medium runtime, (iii) spawn 3 parallel jobs (level d) with a short runtime 10502010\cdot 50\cdot 20 times, or (iv) spawn 203=6020\cdot 3=60 parallel jobs (level d and c combined) with a short runtime 105010\cdot 50 times.

6.7.2 Parallelizability

These examples demonstrate that the choice of the parallelization level also depends on the choice of the tuner. Parallelization for RS and GS is so straightforward that it is also called ?embarrassingly parallel?, since all HPCs to be evaluated can be determined in advance as they have only one (trivial) tuning iteration (level b). Algorithms based on iterations over batch proposals (ES, BO, IR, Hyperband) are limited in how many parallel workers they can use and profit from: they can be parallelized up to their (current) batch size, which decreases for HB and IR during a bracket/race. The situation looks very different for BO; by construction, it is a sequential algorithm with a batch size of 1. Multi-point proposals (see 4.2.3) must be used to parallelize multiple configurations in each iteration (level c). However, parallelization does not scale very well for this level. If possible, a different parallelization level should be chosen to achieve a similar speedup but avoid the problems of multi-point proposal.

Therefore, if a truly large number of parallel resources is available, then the best optimization method may be RS due to its relatively boundless parallelizability and low synchronization overhead. With fewer parallel resources, more efficient algorithms have a larger relative advantage.

6.7.3 Parallelization Tweaks

The more jobs generated, the more workers can be utilized in general, and the better the tuning tends to scale with available computational resources. However, there are some caveats and technical difficulties to consider. First, the process of starting a job, communicating the inputs to the worker, and communicating the results back often comes with considerable overhead, depending on the parallelization backend used. To maximize utilization and minimize overhead, some backends therefore allow chunking multiple jobs together into groups that are calculated sequentially on the workers. Second, whenever a batch of parallel jobs is started, the main process must wait for all of them to finish, which leads to synchronization overhead if nodes that finish early are left idling because they wait for longer running jobs to finish. This can be a particular problem for HPO, where it is likely that the jobs have heterogeneous runtimes. For example, fitting a boosting model with 10 iterations will be significantly faster than fitting a boosting model with 10,000 iterations. Unwittingly chunking many jobs together can exacerbate synchronization overhead and lead to a situation where most workers idle and wait for one worker to finish a relatively large chunk of jobs. While there are approaches to mitigate such problems, as briefly discussed in Section 4.2.3 for BO, it is often sufficient – as a rule of thumb – to aim for as many jobs as possible, as long as each individual job has an expected average runtime of \geq 5 minutes. Additionally, randomizing the order of jobs can increase the utilization of search strategies such as GS.

Budget and Overhead

For an unbiased comparison, it is mandatory that all tuners are granted the same budget. However, if the budget is defined by a fixed number evaluations, it is unclear how to count multi-fidelity evaluations and how to compare them against ?regular? evaluations. We recommend comparing evaluations by wall-clock time, and to explicitly report overhead for model inference and point proposal.

Anytime Behavior

In practice, it is unclear how long tuners will and should be run. Thus, when benchmarking tuners, the performance of the optimization process should be considered at many stages. To achieve this objective, one should not (only) compare the final performance at the maximum budget, but rather compare the whole optimization traces from initialization to termination. Furthermore, tuning runs must be terminated if the wall-clock time is exceeded, which may leave the tuner in an undefined state and incapable of extracting a final result.

Parallelization

The effect of parallelization must be factored in171717At least if the budget is defined by available wall-clock time. While some tuners scale linearly with the number of workers without a degradation in performance (e.g., GS), other tuners (e.g., BO) show diminishing returns from excessive parallelism. A fair comparison w.r.t. a time budget is also hampered by technical aspects like cache interference or difficult to control parallelization of the ML algorithms to tune.

6.8 Simple Sequential Approach to Model Selection

In practical model selection, it is often the case that practitioners care not only about predictive performance, but also about other aspects of the model such as its simplicity and interpretability. If a practitioner considers only a low number of different model classes, then a practical alternative to AutoML on a full pipeline space of different ML models is to construct a manual sequence of preferred models of increasing complexity, to separately tune them, and to compare their performance. An informed choice can then be made about how much additional complexity is acceptable for a certain amount of improved predictive performance. [57] suggest the ?one-standard-error rule? and use the simplest model within one standard error of the performance of the best model. The specific sequence of learners will differ for each use case but might look something like this: (1) linear model with only a single feature or tree stump, (2) L1/L2 regularized linear model or a CART tree, (3) component-wise boosting of a GAM, (4) random forest, (5) gradient tree boosting, and (6) full AutoML search space with pipelines. If even more focus is put on predictive performance (multi-level), stacking a diverse set of models can give a final (often rather small) boost [143].

6.9 Benchmarking HPO Algorithms

While few well-designed large-scale benchmarks of HPO exist [81, 36, 35], there is still an ongoing endeavour of the community for better and more comprehensive studies. Usually, tuners are compared on fixed sets of data sets and search spaces or cheaper approximations via empirical performance models / surrogates [36]. It is still unclear how an ideal and representative benchmark set for HPO that also enables efficient benchmarking might look like.

7 Related Problems

Besides HPO, there are many related scientific fields that face very similar problems on an abstract level and nowadays resort to techniques that are very similar to those described in this work. Although detailed coverage of these related areas is out of scope for this paper, we nevertheless briefly summarize and compare them to HPO.

Neural Architecture Search

A specific type of HPO is neural architecture search (NAS) [187], where the task is to find a well-performing architecture of a deep NN for a given data set. Although NAS can also be formulated as an HPO problem [151, e.g.,], it is usually approached as a bi-level optimization problem that can be simultaneously solved while training the NN [93]. Although it is common practice to optimize architecture and HPCs in sequence, recent evidence suggests that they should be jointly optimized [148, 92].

Algorithm Selection and Traditional Meta-Learning

In HPO, we actively search for a well-performing HPC through iterative optimization. In the related problem of algorithm selection [115], we usually train a meta-learning model offline to select an optimal element from a finite set of algorithms or HPCs [22]. This model can be trained on empirical meta-data of the candidates’ performances on a large set of problem instances or data sets. While this allows an instantaneous prediction of a model class and/or a configuration for a new data set without any time investment, these meta-models are rarely sufficiently accurate in practice, and modern approaches use them mainly to warmstart HPO [48] in a hybrid manner.

Algorithm Configuration

Algorithm configuration (AC) is the general task of searching for a well-performing parameter configuration of an arbitrary algorithm for a given, finite, arbitrary set of problem instances. We assume that the performance of a configuration can be quantified by some scoring metric for any given instance [66]. Usually, this performance can only be accessed empirically in a black-box manner and comes at the cost of running this algorithm. This is sometimes called offline configuration for a set of instances, and the obtained configuration is then kept fixed for all future problem instances from the same domain. Alternatively, algorithms can also be configured per instance. AC is much broader than HPO, and has a particularly rich history regarding the configuration of discrete optimization solvers [144]. In HPO, we typically search for a well-performing HPC on a single data set, which can be seen as a case of per-instance configuration. The case of finding optimal pipelines or default configurations [109] across a larger domain of ML tasks is much closer to traditional AC across sets of instances. The field of AutoML as introduced in Section 5 originally stems from AC and was initially introduced as the Combined Algorithm Selection and Hyperparameter Optimization (CASH) problem [135].

Dynamic Algorithm Configuration

In HPO (and AC), we assume that th e HPC is chosen once for the entire training of an ML model. However, many HPs can actually be adapted while training. A well-known example is the learning rate of an NN optimizer, which might be adapted via heuristics [79] or pre-defined schedules [95]. However, both are potentially sub-optimal. A more adaptive view on HPO is called dynamic algorithm configuration [15]. This allows a policy to be learned (e.g., based on reinforcement learning) for mapping a state of the learner to an HPC, e.g., the learning rate [31]. We note that dynamic algorithm configuration is a generalization of algorithm selection and configuration (and thus HPO) [127].

Learning to Learn and to Optimize

[27] and [88] considered methods beyond simple HPO for fixed hyperparameters, and proposed replacing entire components of learners and optimizers. For example, [27] proposed using an LSTM to learn how to update the weights of an NN. In contrast, [88] applied reinforcement learning to learn where to sample next in black-box optimization such as HPO. Both of these approaches attempt to learn these meta-learners on diverse sets of applications. While it is already non-trivial to select an HPO algorithm for a given problem, such meta-learning approaches face the even greater challenge of generalizing to new (previously unseen) tasks.

8 Conclusion and Open Challenges

This work has sought to provide an overview of the fundamental concepts and algorithms for HPO in ML. Although HPO can be applied to a single ML algorithm, state-of-the-art systems typically optimize the entire predictive pipeline – including pre-processing, model selection, and post-processing. By using a structured search space, even complex optimization tasks can be efficiently optimized by HPO techniques. This final section now outlines several open challenges in HPO that we deem particularly important.

General vs. Narrow HPO Frameworks

For HPO tools, there is a general trade-off between handling many tasks (such as auto-sklearn [47]) and a specializing for few and narrow-focused tasks. The former has the advantage that it can be applied more flexibly, but comes with the disadvantages that (i) it requires more development time to initially set it up and (ii) that the search space is larger such that the efficiency might be sub-optimal compared to a task-specific approach. The latter has the advantage that a more specialized search space can lead to a higher efficiency on a specific task but might not be applicable to a different task. When a specialized tool for an HPO problem can be found, it can often preferred to generalized tools.

Interactive HPO

It is still unclear how HPO tools can be fully integrated into the exploratory and prototype-driven workflow of data scientists and ML experts. On the one hand, it is desirable to support these experts in tuning HPs to avoid this tedious and error-prone task. On the other hand, this can lead to lengthy periods of waiting that interrupt the workflow of the data scientist, ranging from a few minutes on smaller data sets and simpler search spaces, or for hours or even days for large-scale data and complex pipelines. Therefore, the open problems remains of how HPO approaches can be designed so that users can monitor progress in an efficient and transparent manner and how to enable interactions with a running HPO tool in case the desired progress or solutions are not achieved.

HPO for Deep Learning and Reinforcement Learning

For many applications of reinforcement learning and deep learning, especially in the field of natural language processing and large computer vision models, expensive training often prohibits several several training runs with different HPCs. Iterative HPO for such computationally extremely expensive tasks is infeasible even with efficient approaches like BO and multifidelity variants. There are three ways to address this issue. First, gradient-based approaches can directly make use of gradient-information on how to update HPs [96, 50, 134]. However, gradient-based HPO is only applicable to a few HPs and requires specific gradient information which is often only available in deep NNs. Second, some models do not have to be trained from scratch by applying transfer learning or few-shot learning, e.g., [49]. This allows cost-effective applications of these models without incredibly expensive training. Using meta-learning techniques for HPO is an option and can also be further improved if gradient-based HPO or NAS is feasible [51, 38]. Third, dynamic configuration approaches allow application of HPO on the fly while training. A very prominent example is population-based training (PBT) [87], which uses a population of training runs with different settings and applies ideas from evolutionary algorithms (especially mutation and tournament selection) from time to time while the model training makes progress. The disadvantage is that this method requires a substantial amount of computational resources for training several models in parallel. This can be reduced by combining PBT with ideas from bandits and BO [105].

Overtuning and Regularization for HPO

As discussed in 4.4, long HPO runs can lead to biased performance estimators, which in the end can also lead to incorrect HPC selection. It seems plausible that this effect is increasingly exacerbated the smaller the data and the less iterations of resampling the user has configured.181818Issues like imbalance of a classification task and non-standard resampling and evaluation metrics can complicate this matter further. It seems even more plausible that better testing of results (on even more separate validation data) can mitigate or resolve the issue, but this makes data usage even less efficient in HPO, as we are already operating with 3-ways splits. HPO tools should probably more intelligently control resampling by increasing the number of folds more and more the longer the tuning process runs. However, determining how to ideally set up such a schedule seems to be an under-explored issue.

Interpretability of HPO Process

Many HPO approaches mainly return a well-performing HPC and leave users without insights into decisions of the optimization process. Not all data scientists trust the outcome of an AutoML system due to the lack of transparency [34]. Consequently, they might not deploy an AutoML model despite all performance gains. In addition, larger performance gains could potentially be generated (especially by avoiding meta-configuration problems of HPO) if a user is presented with a better understanding of the functional landscape of the objective c(𝝀)c(\bm{\lambda}), the sampling evolution of the HPO algorithm, the importance of HPs, or their effects on the resulting performance. Hence, in future work, HPO may need to be combined more often with approaches from interpretable ML and sensitivity analysis.

Multi-Criteria HPO and Model Simplicity

Until here, we mainly considered the scenario of having one well-defined metric for predictive performance available to guide HPO. However, in practical applications, there is often an unknown trade-off between multiple metrics at play, even when only considering predictive performance (e.g., consider an imbalanced multiclass task with unknown misclassification costs). Additionally, there often exists an inherent preference towards simple, interpretable, efficient, and sparse solutions. Such solutions are easier to debug, deploy and maintain, as well as assist in overcoming the ?last-mile? problem of integrating ML systems in business workflows [29]. Since the optimal trade-off between predictive performance and simplicity is usually unknown a-priori, an attractive approach is multi-criteria HPO in order to learn the HPCs of all Pareto optimal trade-offs [62, 162]. Such multi-criteria approaches introduce a variety of additional challenges for HPO that go beyond the scope of this paper. Recently, model distillation approaches have been proposed to extract a simple(r) model from a complex ensemble [42].

Optimizing for Interpretable Models and Sparseness

A different challenge regarding interpretability is to bias the HPO process towards more interpretable models and return them if their predictive performance is not significantly and/or relevantly worse than that of a less understandable model. To integrate the search for interpretable models into HPO, it would be beneficial if the interpretability of an ML model could be quantitatively measured. This would allow direct optimization against such a metric [101], e.g., using multi-criteria HPO methods.

HPO Beyond Supervised Learning

This paper discussed HPO for supervised ML. Most algorithms from other sub-fields of ML, such as clustering (unsupervised) or anomaly detection (semi- or unsupervised), are also configurable by HPs. In these settings, the HPO techniques discussed in this paper can hardly be used effectively, since performance evaluation, especially with a single, well-defined metric, is much less clear. Nevertheless, HPO is ready to be used. With the seminal paper by [11], HPO again gained exceptional attention in the last decade. Furthermore, this development sparked tremendous progress both in terms of efficiency and in applicability of HPO. Currently, there are many HPO packages in various programming languages readily available that allow easy application of HPO to a large variety of problems.

Funding Resources

The authors of this work take full responsibilities for its content. This work was supported by the Federal Statistical Office of Germany; the Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876, A3; the German Federal Ministry of Education and Research (BMBF) under Grant No. 01IS18036A; and the Bavarian Ministry for Economic Affairs, Infrastructure, Transport and Technology through the Center for Analytics-Data-Applications (ADA-Center) within the framework of “BAYERN DIGITAL II”.

Acknowledgments

We would like to thank Eyke Hüllermeier, Lars Kothoff, Jürgen Branke and Eduardo C. Garrido-Merchán and for their valuable feedback on the manuscript.

?refname?

  • [1] T. Akiba et al. “Optuna: A Next-Generation Hyperparameter Optimization Framework” In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &\& Data Mining, KDD ’19, 2019, pp. 2623–2631 DOI: 10.1145/3292500.3330701
  • [2] R Andonie “Hyperparameter optimization in learning systems” In J. Membr. Comput. 1.4, 2019, pp. 279–291
  • [3] Ilya A Antonov and VM Saleev “An economic method of computing LPτ\tau-sequences” In USSR Computational Mathematics and Mathematical Physics 19.1, 1979, pp. 252–256 DOI: 10.1016/0041-5553(79)90085-5
  • [4] N. Awad, N. Mallik and F. Hutter “DEHB: Evolutionary Hyberband for Scalable, Robust and Efficient Hyperparameter Optimization” In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI ijcai.org, 2021, pp. 2147–2153
  • [5] Maximilian Balandat et al. “BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization” In Advances in Neural Information Processing Systems 33, 2020
  • [6] Eva Bartz, Martin Zaefferer, Olaf Mersmann and Thomas Bartz-Beielstein “Experimental Investigation and Evaluation of Model-based Hyperparameter Optimization”, 2021 arXiv:2107.08761 [cs.LG]
  • [7] Richard E. Bellman “Adaptive Control Processes” Princeton: Princeton University Press, 2015, pp. 94 DOI: 10.1515/9781400874668
  • [8] Yoshua Bengio and Yves Grandvalet “No unbiased estimator of the variance of k-fold cross-validation” In Journal of Machine Learning Research 5.Sep, 2004, pp. 1089–1105
  • [9] Christoph Bergmeir, Rob J. Hyndman and Bonsoo Koo “A note on the validity of cross-validation for evaluating autoregressive time series prediction” In Computational Statistics & Data Analysis 120, 2018, pp. 70–83 DOI: 10.1016/j.csda.2017.11.003
  • [10] J. Bergstra and Y. Bengio “Random Search for Hyper-Parameter Optimization” In Journal of Machine Learning Research 13, 2012, pp. 281–305
  • [11] James Bergstra and Yoshua Bengio “Random Search for Hyper-Parameter Optimization” In Journal of Machine Learning Research 13.10, 2012, pp. 281–305
  • [12] James Bergstra, Daniel Yamins and David Cox “Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures” In Proceedings of the 30th International Conference on Machine Learning 28.1, Proceedings of Machine Learning Research Atlanta, Georgia, USA: PMLR, 2013, pp. 115–123
  • [13] Hans-Georg Beyer and Hans-Paul Schwefel “Evolution strategies - A comprehensive introduction” In Natural Computing 1, 2002, pp. 3–52 DOI: 10.1023/A:1015059928466
  • [14] H.-G. Beyer and B. Sendhoff “Evolution Strategies for Robust Optimization” In 2006 IEEE International Conference on Evolutionary Computation, 2006, pp. 1346–1353 DOI: 10.1109/CEC.2006.1688465
  • [15] Andre Biedenkapp et al. “Dynamic Algorithm Configuration: Foundation of a New Meta-Algorithmic Framework” In ECAI 2020 - 24th European Conference on Artificial Intelligence, 29 August-8 September 2020, Santiago de Compostela, Spain, August 29 - September 8, 2020 - Including 10th Conference on Prestigious Applications of Artificial Intelligence (PAIS 2020) 325, Frontiers in Artificial Intelligence and Applications IOS Press, 2020, pp. 427–434
  • [16] Martin Binder, Julia Moosbauer, Janek Thomas and Bernd Bischl “Multi-objective hyperparameter tuning and feature selection using filter ensembles” In Proceedings of the 2020 Genetic and Evolutionary Computation Conference, 2020, pp. 471–479
  • [17] Martin Binder, Florian Pfisterer and Bernd Bischl “Collecting Empirical Data About Hyperparameters for Data Driven AutoML” In Proceedings of the 7th ICML Workshop on Automated Machine Learning (AutoML 2020), 2020
  • [18] Martin Binder et al. “mlr3pipelines - Flexible Machine Learning Pipelines in R” In Journal of Machine Learning Research 22.184, 2021, pp. 1–7
  • [19] Mauro Birattari, Thomas Stützle, Luis Paquete and Klaus Varrentrapp “A Racing Algorithm for Configuring Metaheuristics” In Proceedings of the 4th Annual Conference on Genetic and Evolutionary Computation, GECCO’02 New York City, New York: Morgan Kaufmann Publishers Inc., 2002, pp. 11–18
  • [20] Mauro Birattari, Zhi Yuan, Prasanna Balaprakash and Thomas Stützle “F-Race and iterated F-Race: An overview” In Experimental methods for the analysis of optimization algorithms Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 311–336 DOI: 10.1007/978-3-642-02538-9˙13
  • [21] B. Bischl, O. Mersmann, H. Trautmann and C. Weihs “Resampling Methods for Meta-Model Validation with Recommendations for Evolutionary Computation” In Evolutionary Computation 20.2, 2012, pp. 249–275 DOI: 10.1162/EVCO˙a˙00069
  • [22] Bernd Bischl et al. “ASlib: A benchmark library for algorithm selection” In Artif. Intell. 237, 2016, pp. 41–58
  • [23] Bernd Bischl et al. “MOI-MBO: Multiobjective Infill for Parallel Model-Based Optimization” In Learning and Intelligent Optimization Conference, Lecture Notes in Computer Science Florida: Springer, 2014, pp. 173–186 DOI: 10.1007/978-3-319-09584-4˙17
  • [24] Jakob Bossek, Carola Doerr and Pascal Kerschke “Initial Design Strategies and Their Effects on Sequential Model-Based Optimization: An Exploratory Case Study Based on BBOB” In Proceedings of the 2020 Genetic and Evolutionary Computation Conference, GECCO ’20 New York, NY, USA: Association for Computing Machinery, 2020, pp. 778–786 DOI: 10.1145/3377930.3390155
  • [25] A-L Boulesteix, Carolin Strobl, Thomas Augustin and Martin Daumer “Evaluating microarray-based classifiers: an overview” In Cancer Informatics 6 SAGE Publications Sage UK: London, England, 2008, pp. 77–97 DOI: 10.4137/CIN.S408
  • [26] T. Chen et al. “MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems” In CoRR abs/1512.01274, 2015
  • [27] Yutian Chen et al. “Learning to Learn without Gradient Descent by Gradient Descent” In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 70, Proceedings of Machine Learning Research PMLR, 2017, pp. 748–756
  • [28] Clément Chevalier and David Ginsbourger “Fast Computation of the Multi-Points Expected Improvement with Applications in Batch Selection” In Learning and Intelligent Optimization Springer Berlin Heidelberg, 2013, pp. 59–69 DOI: 10.1007/978-3-642-44973-4˙7
  • [29] Michael Chui et al. “Notes from the AI frontier: Insights from hundreds of use cases” In McKinsey Global Institute, 2018
  • [30] Carlos A. Coello Coello, Gary B. Lamont and David A. Van Veldhuizen “Evolutionary algorithms for solving multi-objective problems” Springer, 2007 DOI: 10.1007/978-1-4757-5184-0
  • [31] Christian Daniel, Jonathan Taylor and Sebastian Nowozin “Learning Step Size Controllers for Robust Neural Network Training” In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA AAAI Press, 2016, pp. 1519–1525
  • [32] George De Ath, Richard M Everson, Alma AM Rahat and Jonathan E Fieldsend “Greed is good: Exploration and exploitation trade-offs in Bayesian optimisation” In ACM Transactions on Evolutionary Learning and Optimization 1.1 ACM New York, NY, 2021, pp. 1–22
  • [33] Kevin K Dobbin and Richard M Simon “Optimally splitting cases for training and testing high dimensional classifiers” In BMC medical genomics 4.1 BioMed Central, 2011, pp. 1–8
  • [34] Jaimie Drozdal et al. “Trust in AutoML: exploring information needs for establishing trust in automated machine learning systems” In IUI ’20: 25th International Conference on Intelligent User Interfaces, Cagliari, Italy, March 17-20, 2020 ACM, 2020, pp. 297–307
  • [35] Katharina Eggensperger et al. “Towards an empirical foundation for assessing bayesian optimization of hyperparameters” In NIPS workshop on Bayesian Optimization in Theory and Practice 10, 2013, pp. 3
  • [36] Katharina Eggensperger et al. “Efficient benchmarking of algorithm configurators via model-based surrogates” In Mach. Learn. 107.1, 2018, pp. 15–41
  • [37] Radwa Elshawi, Mohamed Maher and Sherif Sakr “Automated Machine Learning: State-of-The-Art and Open Challenges”, 2019 arXiv:1906.02287 [cs.LG]
  • [38] T. Elsken, B. Staffler, J. Metzen and F. Hutter “Meta-Learning of Neural Architectures for Few-Shot Learning” In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020 Computer Vision Foundation / IEEE, 2020, pp. 12362–12372
  • [39] Thomas Elsken, Jan Hendrik Metzen and Frank Hutter “Neural Architecture Search: A Survey” In Journal of Machine Learning Research 20.55, 2019, pp. 1–21
  • [40] David Eriksson et al. “Scalable Global Optimization via Local Bayesian Optimization” In Advances in Neural Information Processing Systems, 2019, pp. 5496–5507
  • [41] Hugo Jair Escalante, Manuel Montes and Luis Enrique Sucar “Particle swarm model selection.” In Journal of Machine Learning Research 10.2, 2009
  • [42] Rasool Fakoor et al. “Fast, Accurate, and Simple Models for Tabular Data via Augmented Distillation” In Advances in Neural Information Processing Systems 33, 2020
  • [43] S. Falkner, A. Klein and F. Hutter “BOHB: Robust and Efficient Hyperparameter Optimization at Scale” In Proceedings of the 35th International Conference on Machine Learning (ICML’18) 80 Proceedings of Machine Learning Research, 2018, pp. 1437–1446
  • [44] Manuel Fernández-Delgado, Eva Cernadas, Senén Barro and Dinani Amorim “Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?” In The Journal of Machine Learning Research 15.1 JMLR.org, 2014, pp. 3133–3181
  • [45] M. Feurer and F. Hutter “Hyperparameter Optimization” In AutoML: Methods, Sytems, Challenges Springer, 2019, pp. 3–33
  • [46] M. Feurer et al. “Efficient and Robust Automated Machine Learning” In Proceedings of the 28th International Conference on Advances in Neural Information Processing Systems (NeurIPS’15), 2015, pp. 2962–2970
  • [47] Matthias Feurer et al. “Efficient and Robust Automated Machine Learning” In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15 Montreal, Canada: MIT Press, 2015, pp. 2755–2763
  • [48] Matthias Feurer, Jost Tobias Springenberg and Frank Hutter “Initializing Bayesian Hyperparameter Optimization via Meta-Learning” In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, January 25-30, 2015, Austin, Texas, USA AAAI Press, 2015, pp. 1128–1135
  • [49] C. Finn, P. Abbeel and S. Levine “Model-agnostic meta-learning for fast adaptation of deep networks” In Proceedings of the 34th International Conference on Machine Learning (ICML’17) 70 Proceedings of Machine Learning Research, 2017, pp. 1126–1135
  • [50] L. Franceschi, M. Donini, P. Frasconi and M. Pontil “Forward and Reverse Gradient-Based Hyperparameter Optimization” In Proceedings of the 34th International Conference on Machine Learning (ICML’17) 70 Proceedings of Machine Learning Research, 2017, pp. 1165–1173
  • [51] L. Franceschi et al. “Bilevel Programming for Hyperparameter Optimization and Meta-Learning” In Proceedings of the 35th International Conference on Machine Learning (ICML’18) 80 Proceedings of Machine Learning Research, 2018, pp. 1568–1577
  • [52] Eduardo C Garrido-Merchán and Daniel Hernández-Lobato “Dealing with categorical and integer-valued variables in bayesian optimization with gaussian processes” In Neurocomputing 380 Elsevier, 2020, pp. 20–35
  • [53] Pieter Gijsbers et al. “Meta-learning for symbolic hyperparameter defaults”, 2021, pp. 151–152
  • [54] David Ginsbourger, Rodolphe Le Riche and Laurent Carraro “Kriging Is Well-Suited to Parallelize Optimization” In Computational Intelligence in Expensive Optimization Problems Berlin, Heidelberg: Springer Berlin Heidelberg, 2010, pp. 131–162 DOI: 10.1007/978-3-642-10701-6˙6
  • [55] I. Guyon, A. Saffari, G. Dror and G. Cawley “Model selection: Beyond the Bayesian/frequentist divide” In The Journal of Machine Learning Research 11.3, 2010, pp. 61–87
  • [56] John T. Hancock and Taghi M. Khoshgoftaar “Survey on categorical data for neural networks” In Journal of Big Data 7.1 Springer ScienceBusiness Media LLC, 2020 DOI: 10.1186/s40537-020-00305-w
  • [57] Trevor Hastie, Robert Tibshirani and Jerome Friedman “The elements of statistical learning: data mining, inference, and prediction” New York: Springer Science & Business Media, 2009
  • [58] X. He, K. Zhao and X. Chu “AutoML: A survey of the state-of-the-art” In Knowl. Based Syst. 212, 2021, pp. 106622
  • [59] Xin He, Kaiyong Zhao and Xiaowen Chu “AutoML: A survey of the state-of-the-art” In Knowledge-Based Systems 212, 2021, pp. 106622 DOI: https://doi.org/10.1016/j.knosys.2020.106622
  • [60] Philipp Hennig and Christian J. Schuler “Entropy Search for Information-Efficient Global Optimization” In Journal of Machine Learning Research 13.6, 2012, pp. 1809–1837
  • [61] Daniel Hernández-Lobato, Jose Hernandez-Lobato, Amar Shah and Ryan Adams “Predictive entropy search for multi-objective bayesian optimization” In International Conference on Machine Learning, 2016, pp. 1492–1501
  • [62] Daniel Horn and Bernd Bischl “Multi-objective parameter configuration of machine learning algorithms using model-based optimization” In 2016 IEEE symposium series on computational intelligence (SSCI), 2016, pp. 1–8 IEEE DOI: 10.1109/SSCI.2016.7850221
  • [63] Roman Hornung et al. “A measure of the impact of CV incompleteness on prediction error estimation with application to PCA and normalization” In BMC Medical Research Methodology 15 BioMed Central, 2015, pp. 95 DOI: 10.1186/s12874-015-0088-9
  • [64] F. Hutter, H. Hoos and K. Leyton-Brown “Sequential Model-Based Optimization for General Algorithm Configuration” In Proceedings of the Fifth International Conference on Learning and Intelligent Optimization (LION’11) 6683, Lecture Notes in Computer Science Springer, 2011, pp. 507–523
  • [65] “Automated Machine Learning: Methods, Systems, Challenges” Available for free at http://automl.org/book In Automated Machine Learning: Methods, Systems, Challenges Springer, 2019
  • [66] Frank Hutter, Youssef Hamadi, Holger H Hoos and Kevin Leyton-Brown “Performance prediction and automated tuning of randomized and parametric algorithms” In International Conference on Principles and Practice of Constraint Programming, 2006, pp. 213–228 Springer
  • [67] Frank Hutter, Holger H Hoos and Kevin Leyton-Brown “Sequential model-based optimization for general algorithm configuration” In International conference on learning and intelligent optimization Berlin, Heidelberg: Springer, 2011, pp. 507–523 DOI: 10.1007/978-3-642-25566-3˙40
  • [68] Frank Hutter, Holger H. Hoos and Kevin Leyton-Brown “Parallel Algorithm Configuration” In Learning and Intelligent Optimization, Lecture Notes in Computer Science 7219 Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 55–70 DOI: 10.1007/978-3-642-34413-8˙5
  • [69] Frank Hutter, Holger H. Hoos, Kevin Leyton-Brown and Thomas Stützle “ParamILS: An Automatic Algorithm Configuration Framework” In J. Artif. Intell. Res. 36, 2009, pp. 267–306
  • [70] Hamed Jalali, Inneke Van Nieuwenhuyse and Victor Picheny “Comparison of kriging-based algorithms for simulation optimization with heterogeneous noise” In European Journal of Operational Research 261.1 Elsevier, 2017, pp. 279–301
  • [71] Kevin Jamieson and Ameet Talwalkar “Non-stochastic Best Arm Identification and Hyperparameter Optimization” In Proceedings of the 19th International Con-ference on Artificial Intelligence and Statistics (AISTATS), 2016, pp. 240–248
  • [72] Nathalie Japkowicz and Mohak Shah “Evaluating Learning Algorithms: A Classification Perspective” Cambridge University Press, 2011 DOI: 10.1017/CBO9780511921803
  • [73] Dipti Jasrasaria and Edward O Pyzer-Knapp “Dynamic Control of Explore/Exploit Trade-Off in Bayesian Optimization” In Science and Information Conference, 2018, pp. 1–15 Springer
  • [74] Donald R. Jones “A Taxonomy of Global Optimization Methods Based on Response Surfaces” In Journal of Global Optimization 21.4, 2001, pp. 345–383 DOI: 10.1023/A:1012771025575
  • [75] Donald R Jones “Direct Global Optimization Algorithm.” In Encyclopedia of optimization 1.1, 2009, pp. 431–440
  • [76] Donald R. Jones, Matthias Schonlau and William J. Welch “Efficient Global Optimization of Expensive Black-Box Functions” In Journal of Global Optimization 13.4, 1998, pp. 455–492 DOI: 10.1023/A:1008306431147
  • [77] Kirthevasan Kandasamy et al. “Tuning Hyperparameters without Grad Students: Scalable and Robust Bayesian Optimisation with Dragonfly” In Journal of Machine Learning Research 21.81, 2020, pp. 1–27
  • [78] Rabiya Khalid and Nadeem Javaid “A survey on hyperparameters optimization algorithms of forecasting models in smart grid” In Sustainable Cities and Society 61 Elsevier, 2020, pp. 102275
  • [79] Diederik P. Kingma and Jimmy Ba “Adam: A Method for Stochastic Optimization” In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015
  • [80] A. Klein et al. “Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets” In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics (AISTATS) 54 Proceedings of Machine Learning Research, 2017
  • [81] Aaron Klein et al. “Meta-surrogate benchmarking for hyperparameter optimization” In Advances in Neural Information Processing Systems 32, 2019, pp. 6270–6280
  • [82] Aaron Klein et al. “Fast bayesian optimization of machine learning hyperparameters on large datasets” In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics 54, Proceedings of Machine Learning Research Fort Lauderdale, FL, USA: PMLR, 2017, pp. 528–536 PMLR
  • [83] R. Kohavi “A study of cross-validation and bootstrap for accuracy estimation and model selection” In International Joint Conference on Artificial Intelligence (IJCAI), 1995, pp. 1137–1143
  • [84] M. Lang et al. “Automatic model selection for high-dimensional survival analysis” In Journal of Statistical Computation and Simulation 85.1 Taylor & Francis, 2015, pp. 62–76 DOI: 10.1080/00949655.2014.929131
  • [85] Michel Lang et al. “mlr3: A modern object-oriented machine learning framework in R” In Journal of Open Source Software 4.44, 2019, pp. 1903 DOI: 10.21105/joss.01903
  • [86] Pedro Larrañaga and Jose A Lozano “Estimation of distribution algorithms: A new tool for evolutionary computation” Springer Science & Business Media, 2001
  • [87] A. Li et al. “A Generalized Framework for Population Based Training” In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019 ACM, 2019, pp. 1791–1799
  • [88] Ke Li and Jitendra Malik “Learning to Optimize” In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings OpenReview.net, 2017
  • [89] Lisha Li et al. “Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization” In Journal of Machine Learning Research 18.185, 2018, pp. 1–52
  • [90] Rui Li et al. “Mixed integer evolution strategies for parameter optimization” In Evolutionary Computation 21.1 MIT Press, 2013, pp. 29–64 DOI: 10.1162/EVCO˙a˙00059
  • [91] Marius Lindauer and Frank Hutter “Warmstarting of Model-Based Algorithm Configuration” In Proceedings of the AAAI Conference on Artificial Intelligence 32.1, 2018
  • [92] Marius Lindauer and Frank Hutter “Best Practices for Scientific Research on Neural Architecture Search” In Journal of Machine Learning Research 21.243, 2020, pp. 1–18
  • [93] Hanxiao Liu, Karen Simonyan and Yiming Yang “DARTS: Differentiable Architecture Search” In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019 OpenReview.net, 2019
  • [94] Manuel López-Ibáñez et al. “The irace package: Iterated Racing for Automatic Algorithm Configuration” In Operations Research Perspectives 3, 2016, pp. 43–58 DOI: 10.1016/j.orp.2016.09.002
  • [95] Ilya Loshchilov and Frank Hutter “SGDR: Stochastic Gradient Descent with Warm Restarts” In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings OpenReview.net, 2017
  • [96] D. Maclaurin, D. Duvenaud and R. Adams “Gradient-based Hyperparameter Optimization through Reversible Learning” In Proceedings of the 32nd International Conference on Machine Learning (ICML’15) 37 Omnipress, 2015, pp. 2113–2122
  • [97] Oded Maron and Andrew W Moore “Hoeffding races: Accelerating model selection search for classification and function approximation” In Advances in Neural Information Processing Systems 6, 1994, pp. 59–66
  • [98] Mitchell McIntire, Daniel Ratner and Stefano Ermon “Sparse Gaussian Processes for Bayesian Optimization” In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI’16 Arlington, Virginia, USA: AUAI Press, 2016, pp. 517–526
  • [99] MD McKay, RJ Beckman and WJ Conover “Comparison of Three Methods for Selecting Values of Input Variables in the Analysis of Output from a Computer Code” In Technometrics 21.2, 1979, pp. 239–245 DOI: 10.2307/1268522
  • [100] F. Mohr, M. Wever and E. Hüllermeier “ML-Plan: Automated machine learning via hierarchical planning” In Machine Learning 107.8-10, 2018, pp. 1495–1515
  • [101] Christoph Molnar, Giuseppe Casalicchio and Bernd Bischl “Quantifying model complexity via functional decomposition for better post-hoc interpretability” In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2019, pp. 193–204 Springer
  • [102] Amin Nayebi, Alexander Munteanu and Matthias Poloczek “A Framework for Bayesian Optimization in Embedded Subspaces” In Proceedings of the 36th International Conference on Machine Learning 97, Proceedings of Machine Learning Research PMLR, 2019, pp. 4752–4761
  • [103] Andrew Y Ng “Preventing “overfitting” of cross-validation data” In ICML 97, 1997, pp. 245–253 Citeseer
  • [104] Randal S. Olson, Nathan Bartley, Ryan J. Urbanowicz and Jason H. Moore “Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science” In Proceedings of the Genetic and Evolutionary Computation Conference 2016, GECCO ’16 Denver, Colorado, USA: ACM, 2016, pp. 485–492 DOI: 10.1145/2908812.2908918
  • [105] Jack Parker-Holder, Vu Nguyen and Stephen J. Roberts “Provably Efficient Online Hyperparameter Optimization with Population-Based Bandits” In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems, 2020
  • [106] A. Paszke et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library” In Proceedings of the 32nd International Conference on Advances in Neural Information Processing Systems (NeurIPS’19), 2019, pp. 8024–8035
  • [107] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python” In Journal of Machine Learning Research 12, 2011, pp. 2825–2830
  • [108] Valerio Perrone et al. “Learning search spaces for Bayesian optimization: Another view of hyperparameter transfer learning” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019
  • [109] Florian Pfisterer et al. “Learning Multiple Defaults for Machine Learning Algorithms” In Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO ’21 Lille, France: Association for Computing Machinery, 2021, pp. 241–242 DOI: 10.1145/3449726.3459523
  • [110] Victor Picheny, Tobias Wagner and David Ginsbourger “A benchmark of kriging-based infill criteria for noisy optimization” In Structural and Multidisciplinary Optimization 48.3 Springer, 2013, pp. 607–626
  • [111] Philipp Probst, Anne-Laure Boulesteix and Bernd Bischl “Tunability: Importance of Hyperparameters of Machine Learning Algorithms.” In Journal of Machine Learning Research 20.53, 2019, pp. 1–32
  • [112] J Quinlan and R Cameron-Jones “Oversearching and layered search in empirical learning” In breast cancer 286 Citeseer, 1995, pp. 2–7
  • [113] Carl Edward Rasmussen and Christopher K.. Williams “Gaussian processes for machine learning” MIT Press, 2006
  • [114] Esteban Real, Alok Aggarwal, Yanping Huang and Quoc V. Le “Regularized Evolution for Image Classifier Architecture Search” In Proceedings of the AAAI Conference on Artificial Intelligence 33, 2019, pp. 4780–4789 DOI: 10.1609/aaai.v33i01.33014780
  • [115] John R. Rice “The Algorithm Selection Problem” In Adv. Comput. 15, 1976, pp. 65–118
  • [116] David P. Rodgers “Improvements in multiprocessor system design” In ACM SIGARCH Computer Architecture News 13.3 Association for Computing Machinery (ACM), 1985, pp. 225–231 DOI: 10.1145/327070.327215
  • [117] Olivier Roustant, David Ginsbourger and Yves Deville “DiceKriging, DiceOptim: Two R packages for the analysis of computer experiments by kriging-based metamodeling and optimization” In Journal of Statistical Software 51, 2012 DOI: 10.18637/jss.v051.i01
  • [118] Michael J Sasena, Panos Papalambros and Pierre Goovaerts “Exploration of metamodeling sampling criteria for constrained global optimization” In Engineering optimization 34.3 Taylor & Francis, 2002, pp. 263–278
  • [119] Warren Scott, Peter Frazier and Warren Powell “The correlated knowledge gradient for simulation optimization of continuous parameters using gaussian process regression” In SIAM Journal on Optimization 21.3 SIAM, 2011, pp. 996–1026
  • [120] Jasjeet S Sekhon and Walter R Mebane “Genetic optimization using derivatives” In Political Analysis 7 JSTOR, 1998, pp. 187–210 DOI: 10.1093/pan/7.1.187
  • [121] Joseph Sexton and Petter Laake “Standard errors for bagged and random forest estimators” In Computational Statistics & Data Analysis 53.3 Elsevier, 2009, pp. 801–811 DOI: 10.1016/j.csda.2008.08.007
  • [122] Victor S Sheng and Charles X Ling “Thresholding for making classifiers cost-sensitive” In AAAI’06: Proceedings of the 21st national conference on artificial intelligence 6, 2006, pp. 476–481
  • [123] Richard Simon “Resampling Strategies for Model Assessment and Selection” In Fundamentals of Data Mining in Genomics and Proteomics Boston, MA: Springer, 2007, pp. 173–186 DOI: 10.1007/978-0-387-47509-7˙8
  • [124] J. Snoek, H. Larochelle and R. Adams “Practical Bayesian Optimization of Machine Learning Algorithms” In Proceedings of the 25th International Conference on Advances in Neural Information Processing Systems (NeurIPS’12), 2012, pp. 2960–2968
  • [125] Jasper Snoek, Hugo Larochelle and Ryan P Adams “Practical Bayesian Optimization of Machine Learning Algorithms” In Advances in Neural Information Processing Systems 25 Curran Associates, Inc., 2012, pp. 2951–2959
  • [126] Jasper Snoek et al. “Scalable Bayesian Optimization Using Deep Neural Networks” In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015 37, JMLR Workshop and Conference Proceedings JMLR.org, 2015, pp. 2171–2180
  • [127] David Speck et al. “Learning Heuristic Selection with Dynamic Algorithm Configuration” In Proceedings of the 31st International Conference on Automated Planning and Scheduling (ICAPS’21), 2021
  • [128] Niranjan Srinivas, Andreas Krause, Sham Kakade and Matthias Seeger “Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design” In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10 Madison, WI, USA: Omnipress, 2010, pp. 1015–1022
  • [129] K. Swersky, J. Snoek and R. Adams “Multi-task Bayesian optimization” In Proceedings of the 26th International Conference on Advances in Neural Information Processing Systems (NeurIPS’13), 2013, pp. 2004–2012
  • [130] Kevin Swersky et al. “Raiders of the lost architecture: Kernels for Bayesian optimization in conditional parameter spaces” In NeurIPS workshop on Bayesian Optimization in Theory and Practice, 2014
  • [131] Kevin Swersky, Jasper Snoek and Ryan P Adams “Multi-Task Bayesian Optimization” In Advances in Neural Information Processing Systems 26 Curran Associates, Inc., 2013, pp. 2004–2012
  • [132] Kevin Swersky, Jasper Snoek and Ryan Prescott Adams “Freeze-thaw bayesian optimization”, 2014 arXiv: http://arxiv.org/abs/1406.3896
  • [133] El-Ghazali Talbi “Optimization of deep neural networks: a survey and unified taxonomy”, 2020 URL: https://hal.inria.fr/hal-02570804
  • [134] L. Thiede and U. Parlitz “Gradient based hyperparameter optimization in Echo State Networks” In Neural Networks 115, 2019, pp. 23–29
  • [135] Chris Thornton, Frank Hutter, Holger H. Hoos and Kevin Leyton-Brown “Auto-WEKA: Combined Selection and Hyperparameter Optimization of Classification Algorithms” In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’13 Chicago, Illinois, USA: Association for Computing Machinery, 2013, pp. 847–855
  • [136] Louis C Tiao, Aaron Klein, Cedric Archambeau and Matthias Seeger “Model-based Asynchronous Hyperparameter Optimization” arXiv: https://arxiv.org/abs/2003.10865
  • [137] Jan N Van Rijn et al. “OpenML: A collaborative science platform” In Joint european conference on machine learning and knowledge discovery in databases, 2013, pp. 645–649 Springer
  • [138] Jan N Van Rijn and Frank Hutter “Hyperparameter importance across datasets” In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 2367–2376
  • [139] Ziyu Wang et al. “Bayesian optimization in a billion dimensions via random embeddings” In Journal of Artificial Intelligence Research 55, 2016, pp. 361–387
  • [140] Colin White, Willie Neiswanger and Yash Savani “BANANAS: Bayesian Optimization with Neural Architectures for Neural Architecture Search” In Proceedings of the AAAI Conference on Artificial Intelligence, 2021
  • [141] M. Wistuba, N. Schilling and L. Schmidt-Thieme “Hyperparameter Search Space Pruning - A New Component for Sequential Model-Based Hyperparameter Optimization” In Machine Learning and Knowledge Discovery in Databases (ECML/PKDD’15) 9285, Lecture Notes in Computer Science Springer, 2015, pp. 104–119
  • [142] M. Wistuba, N. Schilling and L. Schmidt-Thieme “Learning hyperparameter optimization initializations” In Proceedings of the International Conference on Data Science and Advanced Analytics (DSAA) IEEE, 2015, pp. 1–10
  • [143] David H Wolpert “Stacked generalization” In Neural networks 5.2 Elsevier, 1992, pp. 241–259
  • [144] Lin Xu, Frank Hutter, Holger H Hoos and Kevin Leyton-Brown “SATzilla: portfolio-based algorithm selection for SAT” In Journal of artificial intelligence research 32, 2008, pp. 565–606
  • [145] L. Yang and A. Shami “On hyperparameter optimization of machine learning algorithms: Theory and practice” In Neurocomputing 415, 2020, pp. 295–316
  • [146] Quanming Yao et al. “Taking Human out of Learning Applications: A Survey on Automated Machine Learning”, 2019 arXiv:1810.13306 [cs.AI]
  • [147] Tong Yu and Hong Zhu “Hyper-Parameter Optimization: A Review of Algorithms and Applications”, 2020 arXiv:2003.05689 [cs.LG]
  • [148] Arber Zela, Aaron Klein, Stefan Falkner and Frank Hutter “Towards Automated Deep Learning: Efficient Joint Neural Architecture and Hyperparameter Search” In ICML 2018 AutoML Workshop, 2018
  • [149] Z. Zhang, X. Wang and W. Zhu “Automated Machine Learning on Graphs: A Survey” In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021 ijcai.org, 2021, pp. 4704–4712
  • [150] Jun Zheng, Zilong Li, Liang Gao and Guosheng Jiang “A parameterized lower confidence bounding scheme for adaptive metamodel-based design optimization” In Engineering Computations Emerald Group Publishing Limited, 2016
  • [151] Lucas Zimmer, Marius Lindauer and Frank Hutter “Auto-PyTorch Tabular: Multi-Fidelity MetaLearning for Efficient and Robust AutoDL” In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, pp. 1–12

?refname?

  • [152] M. Abadi et al. “TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems”, 2015 URL: https://www.tensorflow.org/
  • [153] T. Akiba et al. “Optuna: A Next-Generation Hyperparameter Optimization Framework” In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &\& Data Mining, KDD ’19, 2019, pp. 2623–2631 DOI: 10.1145/3292500.3330701
  • [154] JJ Allaire and François Chollet “keras: R Interface to ’Keras”’ R package version 2.3.0.0, 2020 URL: https://CRAN.R-project.org/package=keras
  • [155] A. Aronio de Romblay, N. Cherel, M. Maskani and H. Gerard “MLBox” URL: https://mlbox.readthedocs.io/en/latest/index.html
  • [156] Sunil Arya et al. “An optimal algorithm for approximate nearest neighbor searching fixed dimensions” In Journal of the ACM (JACM) 45.6 ACM New York, NY, USA, 1998, pp. 891–923 DOI: 10.1145/293347.293348
  • [157] N. Awad, N. Mallik and F. Hutter “DEHB: Evolutionary Hyperband for Scalable, Robust and Efficient Hyperparameter Optimization” URL: https://github.com/automl/DEHB
  • [158] Maximilian Balandat et al. “BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization” In Advances in Neural Information Processing Systems 33, 2020
  • [159] J. Bergstra, R. Bardenet, Y. Bengio and B. Kégl “Algorithms for Hyper-Parameter Optimization” In Proceedings of the 24th International Conference on Advances in Neural Information Processing Systems (NeurIPS’11), 2011, pp. 2546–2554
  • [160] James Bergstra, Daniel Yamins and David Cox “Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures” In Proceedings of the 30th International Conference on Machine Learning 28.1, Proceedings of Machine Learning Research Atlanta, Georgia, USA: PMLR, 2013, pp. 115–123
  • [161] Alina Beygelzimer et al. “FNN: Fast Nearest Neighbor Search Algorithms and Applications” R package version 1.1.3, 2019 URL: https://CRAN.R-project.org/package=FNN
  • [162] Martin Binder, Julia Moosbauer, Janek Thomas and Bernd Bischl “Multi-objective hyperparameter tuning and feature selection using filter ensembles” In Proceedings of the 2020 Genetic and Evolutionary Computation Conference, 2020, pp. 471–479
  • [163] Martin Binder et al. “mlr3pipelines - Flexible Machine Learning Pipelines in R” In Journal of Machine Learning Research 22.184, 2021, pp. 1–7
  • [164] Bernd Bischl et al. “mlr: Machine Learning in R” In Journal of Machine Learning Research 17.170, 2016, pp. 1–5
  • [165] Bernd Bischl et al. “mlrMBO: A Modular Framework for Model-Based Optimization of Expensive Black-Box Functions”, 2017 arXiv: http://arxiv.org/abs/1703.03373
  • [166] Bernd Bischl, Julia Schiffner and Claus Weihs “Benchmarking classification algorithms on high-performance computing clusters” In Data Analysis, Machine Learning and Knowledge Discovery Springer, 2014, pp. 23–31
  • [167] Andrea Bommert et al. “Benchmark for filter methods for feature selection in high-dimensional classification data” In Computational Statistics & Data Analysis 143 Elsevier, 2020, pp. 106839
  • [168] Hans Werner Borchers “adagio: Discrete and Global Optimization Routines” R package version 0.7.1, 2018 URL: https://CRAN.R-project.org/package=adagio
  • [169] Bernhard E. Boser, Isabelle M. Guyon and Vladimir N. Vapnik “A training algorithm for optimal margin classifiers” In Proceedings of the fifth annual workshop on Computational learning theory New York, NY, USA: Association for Computing Machinery, 1992, pp. 144–152 DOI: 10.1145/130385.130401
  • [170] Jakob Bossek “ecr 2.0: A modular framework for evolutionary computation in R” In Proceedings of the Genetic and Evolutionary Computation Conference Companion, 2017, pp. 1187–1193 DOI: 10.1145/3067695.3082470
  • [171] X. Bouthillier et al. “Epistimio/orion: Plotting API and Database commands” Zenodo, 2020 DOI: 10.5281/zenodo.3478592
  • [172] Leo Breiman “Random forests” In Machine Learning 45.1 Springer, 2001, pp. 5–32 DOI: 10.1023/A:1010933404324
  • [173] Leo Breiman, Jerome Friedman, Charles J Stone and Richard A Olshen “Classification and regression trees” CRC press, 1984
  • [174] Chih-Chung Chang and Chih-Jen Lin “LIBSVM: A library for support vector machines” In ACM Transactions on Intelligent Systems and Technology 2, 2011, pp. 27:1–27:27 DOI: 10.1145/1961189.1961199
  • [175] N.. Chawla, K.. Bowyer, L.. Hall and W.. Kegelmeyer “SMOTE: Synthetic Minority Over-sampling Technique” In Journal of Artificial Intelligence Research 16 AI Access Foundation, 2002, pp. 321–357 DOI: 10.1613/jair.953
  • [176] T. Chen et al. “MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems” In CoRR abs/1512.01274, 2015
  • [177] Tianqi Chen and Carlos Guestrin “XGBoost: A Scalable Tree Boosting System” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16 San Francisco, California, USA: ACM, 2016, pp. 785–794 DOI: 10.1145/2939672.2939785
  • [178] Tianqi Chen et al. “XGBoost: Extreme Gradient Boosting” R package version 1.2.0.1, 2020 URL: https://CRAN.R-project.org/package=xgboost
  • [179] Corinna Cortes and Vladimir Vapnik “Support-vector networks” In Machine Learning 20.3, 1995, pp. 273–297 DOI: 10.1023/A:1022627411411
  • [180] T. Cover and P. Hart “Nearest neighbor pattern classification” In IEEE Transactions on Information Theory 13.1, 1967, pp. 21–27 DOI: 10.1109/TIT.1967.1053964
  • [181] “dask” URL: https://dask.org/
  • [182] A.. G. et al. “GPflow: A Gaussian process library using TensorFlow” In Journal of Machine Learning Research 18.40, 2017, pp. 1–6
  • [183] Yufeng Ding and Jeffrey S. Simonoff “An Investigation of Missing Data Methods for Classification Trees Applied to Binary Response Data” In Journal of Machine Learning Research 11.6, 2010, pp. 131–170
  • [184] Harris Drucker et al. “Support Vector Regression Machines” In Proceedings of the 9th International Conference on Neural Information Processing Systems, NIPS’96 Denver, Colorado: MIT Press, 1996, pp. 155–161
  • [185] Peter Dunn and Gordon Smyth “Generalized Linear Models With Examples in R”, Springer Texts in Statistics New York: Springer, 2018 DOI: 10.1007/978-1-4419-0118-7
  • [186] K. Eggensperger, M. Lindauer and F. Hutter “Pitfalls and Best Practices in Algorithm Configuration” In Journal of Artificial Intelligence Research, 2019, pp. 861–893
  • [187] Thomas Elsken, Jan Hendrik Metzen and Frank Hutter “Neural Architecture Search: A Survey” In Journal of Machine Learning Research 20.55, 2019, pp. 1–21
  • [188] N. Erickson et al. “AutoGluon”
  • [189] David Eriksson et al. “Scalable Global Optimization via Local Bayesian Optimization” In Advances in Neural Information Processing Systems, 2019, pp. 5496–5507
  • [190] S. Estévez-Velarde, Y. Gutiérrez, Y. Almeida-Cruz and A. Montoyo “General-purpose hierarchical optimisation of machine learning pipelines with grammatical evolution” In Information Sciences Elsevier, 2020 DOI: 10.1016/j.ins.2020.07.035
  • [191] Daniel Falbel and Javier Luraschi “torch: Tensors and Neural Networks with ’GPU’ Acceleration” R package version 0.1.1, 2020 URL: https://CRAN.R-project.org/package=torch
  • [192] S. Falkner, A. Klein and F. Hutter “BOHB: Robust and Efficient Hyperparameter Optimization at Scale” In Proceedings of the 35th International Conference on Machine Learning (ICML’18) 80 Proceedings of Machine Learning Research, 2018, pp. 1437–1446
  • [193] Manuel Fernández-Delgado, Eva Cernadas, Senén Barro and Dinani Amorim “Do We Need Hundreds of Classifiers to Solve Real World Classification Problems?” In The Journal of Machine Learning Research 15.1 JMLR.org, 2014, pp. 3133–3181
  • [194] M. Feurer et al. “Efficient and Robust Automated Machine Learning” In Proceedings of the 28th International Conference on Advances in Neural Information Processing Systems (NeurIPS’15), 2015, pp. 2962–2970
  • [195] F. Fortin et al. “DEAP: Evolutionary Algorithms Made Easy” In Journal of Machine Learning Research 13, 2012, pp. 2171–2175
  • [196] Yoav Freund “Boosting a weak learning algorithm by majority” In Information and Computation 121.2, 1995, pp. 256–285 DOI: 10.1006/inco.1995.1136
  • [197] Yoav Freund and Robert E Schapire “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting” In Journal of Computer and System Sciences 55.1, 1997, pp. 119–139 DOI: 10.1006/jcss.1997.1504
  • [198] Jerome Friedman “Greedy function approximation: a gradient boosting machine” In The Annals of Statistics 29.5, 2001, pp. 1189–1232 DOI: 10.1214/aos/1013203451
  • [199] Jerome Friedman, Trevor Hastie and Robert Tibshirani “Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors)” In The Annals of Statistics 28.2, 2000, pp. 337–407 DOI: 10.1214/aos/1016218223
  • [200] Jerome Friedman, Trevor Hastie and Robert Tibshirani “Regularization Paths for Generalized Linear Models via Coordinate Descent” In Journal of Statistical Software 33.1, 2010, pp. 1–22 DOI: 10.18637/jss.v033.i01
  • [201] Unai Garciarena and Roberto Santana “An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers” In Expert Systems with Applications 89 Elsevier BV, 2017, pp. 52–65 DOI: 10.1016/j.eswa.2017.07.026
  • [202] J. Gardner et al. “GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration” In Advances in Neural Information Processing Systems, 2018
  • [203] M. Gelbart, J. Snoek and R. Adams “Bayesian Optimization with Unknown Constraints” In Proceedings of the 30th conference on Uncertainty in Artificial Intelligence (UAI’14) AUAI Press, 2014, pp. 250–258
  • [204] Brandon Greenwell, Bradley Boehmke, Jay Cunningham and GBM Developers “gbm: Generalized Boosted Regression Models” R package version 2.1.8, 2020 URL: https://CRAN.R-project.org/package=gbm
  • [205] Sebastian Gruber and Bernd Bischl “mlr3hyperband: Hyperband for ’mlr3”’, 2019 URL: https://mlr3hyperband.mlr-org.com
  • [206] Isabelle Guyon and André Elisseeff “An Introduction to Variable and Feature Selection” In The Journal of Machine Learning Research 3, 2003, pp. 1157–1182
  • [207] Haibo He and Yunqian Ma “Imbalanced learning: foundations, algorithms, and applications” Wiley-IEEE Press, 2013
  • [208] Klaus Hechenbichler and Klaus Schliep “Weighted k-nearest-neighbor techniques and ordinal classification”, 2004
  • [209] Arthur E. Hoerl and Robert W. Kennard “Ridge regression: Biased estimation for nonorthogonal problems” In Technometrics 12.1, 1970, pp. 55–67 DOI: 10.1080/00401706.1970.10488634
  • [210] Torsten Hothorn et al. “Model-based Boosting 2.0” In Journal of Machine Learning Research 11, 2010, pp. 2109–2113
  • [211] Torsten Hothorn, Kurt Hornik and Achim Zeileis “Unbiased recursive partitioning: A conditional inference framework” In Journal of Computational and Graphical Statistics 15.3, 2006, pp. 651–674 DOI: 10.1198/106186006X1339
  • [212] Torsten Hothorn and Achim Zeileis “partykit: A modular toolkit for recursive partytioning in R” In The Journal of Machine Learning Research 16.1 JMLR. org, 2015, pp. 3905–3909
  • [213] F. Hutter, H. Hoos and K. Leyton-Brown “Sequential Model-Based Optimization for General Algorithm Configuration” In Proceedings of the Fifth International Conference on Learning and Intelligent Optimization (LION’11) 6683, Lecture Notes in Computer Science Springer, 2011, pp. 507–523
  • [214] Anil Jadhav, Dhanya Pramod and Krishnan Ramanathan “Comparison of Performance of Data Imputation Methods for Numeric Dataset” In Applied Artificial Intelligence 33.10 Informa UK Limited, 2019, pp. 913–933 DOI: 10.1080/08839514.2019.1637138
  • [215] H. Jin, Q. Song and X. Hu “Auto-Keras: An Efficient Neural Architecture Search System” In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 1946–1956 ACM
  • [216] Steven G Johnson “The NLopt nonlinear-optimization package”, 2014 URL: https://github.com/stevengj/nlopt
  • [217] K. Kandasamy, J. Schneider and B. Póczos “High Dimensional Bayesian Optimisation and Bandits via Additive Models” In Proceedings of the 32nd International Conference on Machine Learning (ICML’15) 37 Omnipress, 2015, pp. 295–304
  • [218] K. Kandasamy et al. “Tuning Hyperparameters without Grad Students: Scalable and Robust Bayesian Optimisation with Dragonfly” In Journal of Machine Learning Research 21.81, 2020, pp. 1–27
  • [219] Alexandros Karatzoglou, Alex Smola, Kurt Hornik and Achim Zeileis “kernlab – An S4 Package for Kernel Methods in R” In Journal of Statistical Software 11.9, 2004, pp. 1–20 URL: http://www.jstatsoft.org/v11/i09/
  • [220] N. Knudde, J. van der Herten, T. Dhaene and I. Couckuyt “GPflowOpt” URL: https://gpflowopt.readthedocs.io/en/latest/index.html
  • [221] Bartosz Krawczyk “Learning from imbalanced data: open challenges and future directions” In Progress in Artificial Intelligence 5.4 Springer, 2016, pp. 221–232
  • [222] Max Kuhn “Building Predictive Models in R Using the caret Package” In Journal of Statistical Software 28.5 Foundation for Open Access Statistic, 2008, pp. 1–26 DOI: 10.18637/jss.v028.i05
  • [223] Max Kuhn “Futility Analysis in the Cross-Validation of Machine Learning Models”, 2014 arXiv: http://arxiv.org/abs/1405.6974
  • [224] Max Kuhn and Kjell Johnson “Feature engineering and selection: A practical approach for predictive models” CRC Press, 2019
  • [225] Max Kuhn and Hadley Wickham “tidymodels: a collection of packages for modeling and machine learning using tidyverse principles.”, 2020 URL: https://www.tidymodels.org
  • [226] Michel Lang et al. “mlr3: A modern object-oriented machine learning framework in R” In Journal of Open Source Software 4.44, 2019, pp. 1903 DOI: 10.21105/joss.01903
  • [227] E. LeDell and S. Poirier “H2O AutoML: Scalable Automatic Machine Learning” In 7th ICML Workshop on Automated Machine Learning (AutoML), 2020
  • [228] Erin LeDell et al. “h2o: R Interface for the ’H2O’ Scalable Machine Learning Platform”, 2020 URL: https://CRAN.R-project.org/package=h2o
  • [229] Lisha Li et al. “Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization” In Journal of Machine Learning Research 18.185, 2018, pp. 1–52
  • [230] Yang Li et al. “OpenBox: A Generalized Black-box Optimization Service” In KDD ACM, 2021, pp. 3209–3219
  • [231] R. Liaw et al. “Tune” URL: https://docs.ray.io/en/master/tune/index.html
  • [232] M. Lindauer et al. “SMAC v3: Algorithm Configuration in Python” GitHub, 2017 URL: https://github.com/automl/SMAC3
  • [233] Marius Lindauer et al. “SMAC3: A Versatile Bayesian Optimization Package for Hyperparameter Optimization”, 2021 arXiv:2109.09831 [cs.LG]
  • [234] Manuel López-Ibáñez et al. “The irace package: Iterated Racing for Automatic Algorithm Configuration” In Operations Research Perspectives 3, 2016, pp. 43–58 DOI: 10.1016/j.orp.2016.09.002
  • [235] Walter R. Mebane, Jr. and Jasjeet S. Sekhon “Genetic Optimization Using Derivatives: The rgenoud Package for R” In Journal of Statistical Software 42.11, 2011, pp. 1–26 DOI: 10.18637/jss.v042.i11
  • [236] Olaf Mersmann “mco: Multiple Criteria Optimization Algorithms and Related Functions” R package version 1.0-15.1, 2014 URL: https://CRAN.R-project.org/package=mco
  • [237] David Meyer et al. “e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien” R package version 1.7-3, 2019 URL: https://CRAN.R-project.org/package=e1071
  • [238] Daniele Micci-Barreca “A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems” In ACM SIGKDD Explorations Newsletter 3.1 ACM New York, NY, USA, 2001, pp. 27–32
  • [239] K. Moore et al. “TransmogrifAI” URL: https://github.com/salesforce/TransmogrifAI
  • [240] Katharine Mullen et al. “DEoptim: An R Package for Global Optimization by Differential Evolution” In Journal of Statistical Software 40.6, 2011, pp. 1–26 DOI: 10.18637/jss.v040.i06
  • [241] L. Nardi, D. Koeplinger and K. Olukotun “Practical design space exploration” In 2019 IEEE 27th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), 2019, pp. 347–358 IEEE
  • [242] John Ashworth Nelder and Robert WM Wedderburn “Generalized linear models” In Journal of the Royal Statistical Society: Series A (General) 135.3, 1972, pp. 370–384 DOI: 10.2307/2344614
  • [243] F. Nogueira “Bayesian Optimization: Open source constrained global optimization tool for Python”, 2014 URL: https://github.com/fmfn/BayesianOptimization
  • [244] C. Oh, E. Gavves and M. Welling “BOCK : Bayesian Optimization with Cylindrical Kernels” In Proceedings of the 35th International Conference on Machine Learning (ICML’18) 80 Proceedings of Machine Learning Research, 2018, pp. 3865–3874
  • [245] R. Olson, N. Bartley, R. Urbanowicz and J. Moore “Evaluation of a Tree-based Pipeline Optimization Tool for Automating Data Science” In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO’16) ACM, 2016, pp. 485–492
  • [246] Florian Pargent, Florian Pfisterer, Janek Thomas and Bernd Bischl “Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features”, 2021 arXiv:2104.00629 [stat.ML]
  • [247] A. Paszke et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library” In Proceedings of the 32nd International Conference on Advances in Neural Information Processing Systems (NeurIPS’19), 2019, pp. 8024–8035
  • [248] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python” In Journal of Machine Learning Research 12, 2011, pp. 2825–2830
  • [249] John Platt “Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods” In Advances in Large Margin Classifiers MIT Press, 1999, pp. 61–74
  • [250] Philipp Probst and Anne-Laure Boulesteix “To tune or not to tune the number of trees in random forest” In Journal of Machine Learning Research 18.1, 2017, pp. 6673–6690
  • [251] Philipp Probst, Anne-Laure Boulesteix and Bernd Bischl “Tunability: Importance of Hyperparameters of Machine Learning Algorithms.” In Journal of Machine Learning Research 20.53, 2019, pp. 1–32
  • [252] Philipp Probst, Marvin N Wright and Anne-Laure Boulesteix “Hyperparameters and tuning strategies for random forest” In WIREs Data Mining and Knowledge Discovery 9.3 Wiley Online Library, 2019, pp. e1301 DOI: 10.1002/widm.1301
  • [253] J. Rapin and O. Teytaud “Nevergrad - A gradient-free optimization platform” In GitHub repository GitHub, 2018 URL: https://GitHub.com/FacebookResearch/Nevergrad
  • [254] Olivier Roustant, David Ginsbourger and Yves Deville “DiceKriging, DiceOptim: Two R packages for the analysis of computer experiments by kriging-based metamodeling and optimization” In Journal of Statistical Software 51, 2012 DOI: 10.18637/jss.v051.i01
  • [255] Md Sahidullah and Goutam Saha “Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition” In Speech communication 54.4 Elsevier, 2012, pp. 543–565
  • [256] Daniel Schalk, Janek Thomas and Bernd Bischl “compboost: Modular Framework for Component-Wise Boosting” In Journal of Open Source Software 3.30, 2018, pp. 967 DOI: 10.21105/joss.00967
  • [257] Robert E. Schapire “The strength of weak learnability” In Machine Learning 5.2, 1990, pp. 197–227 DOI: 10.1007/BF00116037
  • [258] Klaus Schliep and Klaus Hechenbichler “kknn: Weighted k-Nearest Neighbors” R package version 1.3.1, 2016 URL: https://CRAN.R-project.org/package=kknn
  • [259] Bernhard Schölkopf, Alex J. Smola, Robert C. Williamson and Peter L. Bartlett “New support vector algorithms” In Neural Computation 12.5, 2000, pp. 1207–1245 DOI: 10.1162/089976600300015565
  • [260] “scikit-optimize” URL: https://scikit-optimize.github.io/
  • [261] J. Snoek “Spearmint” URL: https://github.com/HIPS/Spearmint/
  • [262] J. Snoek, H. Larochelle and R. Adams “Practical Bayesian Optimization of Machine Learning Algorithms” In Proceedings of the 25th International Conference on Advances in Neural Information Processing Systems (NeurIPS’12), 2012, pp. 2960–2968
  • [263] J. Snoek, K. Swersky, R. Zemel and R. Adams “Input Warping for Bayesian Optimization of Non-stationary Functions” In Proceedings of the 31th International Conference on Machine Learning, (ICML’14) Omnipress, 2014, pp. 1674–1682
  • [264] N. Stander and K. Craig “On the robustness of a simple domain reduction scheme for simulation‐based optimization” In Engineering Computations 19, 2002, pp. 431–450
  • [265] Robert E. Sweeney and Edwin F. Ulveling “A Transformation for Simplifying the Interpretation of Coefficients of Binary Variables in Regression Analysis” In The American Statistician 26.5 American Statistical Association, Taylor & Francis, Ltd., 1972, pp. 30–32 DOI: 10.2307/2683780
  • [266] Terry Therneau and Beth Atkinson “rpart: Recursive Partitioning and Regression Trees” R package version 4.1-15, 2019 URL: https://CRAN.R-project.org/package=rpart
  • [267] C. Thornton, F. Hutter, H. Hoos and K. Leyton-Brown “Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms” In The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’13) ACM Press, 2013, pp. 847–855
  • [268] Robert Tibshirani “Regression shrinkage and selection via the lasso” In Journal of the Royal Statistical Society: Series B (Methodological) 58.1, 1996, pp. 267–288 DOI: 10.1111/j.2517-6161.1996.tb02080.x
  • [269] Heike Trautmann, Olaf Mersmann and David Arnu “cmaes: Covariance Matrix Adapting Evolution Strategy” R package version 1.0-11, 2011 URL: https://CRAN.R-project.org/package=cmaes
  • [270] Mark J. Laan, Eric C Polley and Alan E. Hubbard “Super Learner” In Statistical Applications in Genetics and Molecular Biology 6.1 Walter de Gruyter GmbH, 2007, pp. Article 25 DOI: 10.2202/1544-6115.1309
  • [271] Yap Bee Wah et al. “Feature Selection Methods: Case of Filter and Wrapper Approaches for Maximising Classification Accuracy.” In Pertanika Journal of Science & Technology 26.1, 2018
  • [272] Katarzyna Woźnica and Przemysław Biecek “Does imputation matter? Benchmark for predictive models”, 2020 arXiv: http://arxiv.org/abs/2007.02837
  • [273] Marvin Wright and Andreas Ziegler “ranger: A fast implementation of random forests for high dimensional data in C++ and R” In Journal of Statistical Software 77.1, 2017, pp. 1–17 DOI: 10.18637/jss.v077.i01
  • [274] Yang Xiang, Sylvain Gubian, Brian Suomela and Julia Hoeng “Generalized Simulated Annealing for Efficient Global Optimization: the GenSA Package for R.” In The R Journal 5.1, 2013, pp. 13–28 DOI: 10.32614/RJ-2013-002
  • [275] Bing Xue, Mengjie Zhang and Will N Browne “A comprehensive comparison on evolutionary feature selection approaches to classification” In International Journal of Computational Intelligence and Applications 14.02 World Scientific, 2015, pp. 1550008
  • [276] Yachen Yan “rBayesianOptimization: Bayesian Optimization of Hyperparameters” R package version 1.1.0, 2016 URL: https://CRAN.R-project.org/package=rBayesianOptimization
  • [277] Yaohui Zeng and Patrick Breheny “The biglasso Package: A Memory- and Computation-Efficient Solver for Lasso Model Fitting with Big Data in R”, 2018 arXiv:1701.05936 [stat.CO]
  • [278] L. Zimmer, M. Lindauer and F. Hutter “Auto-PyTorch Tabular: Multi-Fidelity MetaLearning for Efficient and Robust AutoDL” IEEE early access; also available under https://arxiv.org/abs/2006.13799 In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, pp. 1–12
  • [279] Barret Zoph, Vijay Vasudevan, Jonathon Shlens and Quoc V. Le “Learning Transferable Architectures for Scalable Image Recognition”, 2018, pp. 8697–8710
  • [280] Hui Zou and Trevor Hastie “Regularization and variable selection via the elastic net” In Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67.2, 2005, pp. 301–320 DOI: 10.1111/j.1467-9868.2005.00503.x

appendix_A_learners_and_search_spaces

?appendixname? A Preprocessing

Before applying a learner, it may be helpful or even necessary to modify the data before training in a step called preprocessing. Preprocessing serves mainly two purposes: (i) it either transforms the data to fit the technical requirements of a specific learner, e.g., by converting categorical features to numerical features or by imputing missing values; or (ii) it modifies the data to improve predictive performance, e.g., by bringing features on a common scale or by extracting components of a date column.

A.1 Missing Data Imputation

Imputation describes the task of replacing missing values with feasible ones, since many ML algorithms do not natively handle missing values. In practice, data is often not missing fully at random, and therefore the fact that part of the data is missing itself might be informative for predicting the target label. Hence, it is generally recommended to preserve the information of whether a certain value has been imputed. For categorical features, we often simply introduce a new level for missing values. For numerical or integer features, the simplest techniques impute with a constant mean or median value per feature column or sample from the empirical univariate distribution of the feature. To preserve the missing information in this case, an additional binary dummy column can be added per imputed feature column. When tree-based methods are used, it is often preferred to impute missing values with very large or very small values outside the range of the training data [183], as the tree can then split between available and imputed data, thereby preserving the information about the fact that data is missing without the need for an extra column. Much more elaborate methods exist, often based on multiple imputation, but these are mainly geared towards ensuring proper inference. For purely predictive tasks, it is still somewhat unclear how beneficial these more elaborate imputation methods are. More detailed discussions are available in [201, 214, 272].

A.2 Encoding of Categorical Features

Many ML algorithms operate on purely numerical inputs, and hence, we often need an additional numerical encoding procedure for categorical features. Dummy encoding, also called one-hot encoding, is arguably the most popular choice. It replaces a categorical kk-level feature with kk or k1k-1 binary indicator features. The latter case avoids multicollinearity of the newly created features by using the kk-th level as a reference level, encoding it by setting all indicator features to 0. For data sets with many categorical features and data sets with categorical features with many levels (high-cardinality features), this leads to a considerable increase in dimensionality, resulting in increased runtime and memory requirements. In such cases, methods like impact encoding – also called target encoding [265] – replace the categorical feature with only a single numeric feature, based on the average outcome value for all observations for which that level has been observed. [246] have shown that target encoding using generalized linear mixed models (GLMMs) [238] tends to outperform other common encoding methods in settings where categorical feature cardinality is high.

Problems arise if factor levels occur during prediction that have not been encountered during training. To some extent, this can be addressed by merging infrequent factor levels to a new level with value ?other?, or by stratifying on the levels during resampling, but can never be completely ruled out. Since the new level is unknown to the model, it can be considered a special case of missing data and could therefore be handled by imputation. A typical HP of the feature encoding process would be the threshold that decides when to perform target encoding instead of dummy encoding, depending on the number of levels of a feature.

Several benchmarks of encoding methods have been performed, typically with different emphasis on setting, such as the work by [246] and the references contained therein.

A.3 Feature Filtering

Feature filtering is used to select a subset of features prior to model fitting. Subsetting features can serve two purposes: (i) either to reduce the computational cost, or (ii) to eliminate noise variables that may deteriorate model performance. Feature filters often assign a numerical score to each feature, usually based on simple dependency measures between each feature and the outcome, e.g., mutual information or correlation. The percentage of features to keep is typically an HP. Note that feature filtering is one of several possibilities to perform feature selection. See [206] for a description of feature wrappers and embedded feature selection methods.

See [271] for a benchmark assessing filter methods applied to simulated datasets and the correctness of selected features. See also [167] as well as [275] for a benchmark assessing filters with respect to predictive performance as well as secondary aspects of interest such as runtime. [162] evaluates a variety of feature filters with respect to their similarity and their suitability for constructing a feature filter ensemble, which can subsequently be tuned for a data set at hand with BO.

A.4 Data Augmentation and Sampling

Data augmentation has the goal of improving predictive performance by adding or removing rows of the training data. A common use-case for data augmentation is classification with imbalanced class labels. Here, we can add additional observations by oversampling, i.e., repeating observations of the minority class. A more flexible alternative is SMOTE (“Synthetic Minority Oversampling Technique”, [175]), where convex combinations of observations from the minority class are added to the training data. When the available dataset is very large, it is also possible to undersample in order to correct for class imbalance, i.e., filtering out observations of the majority class. Also, subsampling the data set as a whole is an old trick to reduce training time during HPO, and modern multifidelity approaches exploit this in a more principled manner. Sampling and data augmentation methods differ from many other preprocessing methods in that they only affect the training phase of a model. Typical HPs include the under-, over-, or subsampling ratio or the number of observations to use for the convex combinations in SMOTE.

For a more thorough description of how to treat imbalanced data in ML, see [221]. Current challenges and areas of research are described in [207].

A.5 Feature Extraction

A crucial step in improving predictive performance of a model is feature extraction or feature engineering. The key idea is to add new features to the data that the model can exploit better than the original data. This is often both domain-specific and model-specific, and therefore is hard to generalize. For linear models, for example, adding polynomial terms can provide a better view on the data, effectively allowing the model to capture non-linear effects and higher-order interactions. Simple tree-based methods might profit from methods like principal component analysis, which rotate all or a subset of numerical features to align feature dimensions with the (orthogonal) directions of major variance. On the other hand, such simple extractions techniques very often do not provide competitive benefit in terms of predictive performance when many non-parametric models are included in model selection and HPO [166].

More beneficial is often domain-specific feature extraction for more complex data types. For example, for functional data features, where a collection of columns represents a curve, one can either extract generic characteristics (mean value, maximum value, average slope, number of peaks, etc.) or run domain-specific extraction, e.g., we might extract mel-frequency cepstral coefficients (MFCCs) [255] for an audio signal. For a practical guide to using feature extraction in predictive modeling, see e.g. [224].

All extracted features can either be used instead or in addition to the original features. As it is often highly unclear how to perform ideal feature extraction for a specific task, it seems particularly attractive if an ML software system allows the user to embed custom extraction code into the preprocessing with exposed custom HPs so that a flexible feature extraction piece of code can be automatically configured by the tuner in a data-dependent manner.

?appendixname? B Evalution Metrics

Name Formula Direction Range Description
Performance measures for regression
Mean Squared Error (MSE) 1ntesti=1ntest(y(i)y^(i))2\frac{1}{n_{\mathrm{test}}}\sum_{i=1}^{n_{\mathrm{test}}}\left(y^{(i)}-\hat{y}^{(i)}\right)^{2} min [0,)[0,\infty) Mean of the squared distances between the target variable yy and the predicted target y^\hat{y}.
Mean Absolute Error (MAE) 1ntesti=1ntest|y(i)y^(i)|\frac{1}{n_{\mathrm{test}}}\sum_{i=1}^{n_{\mathrm{test}}}\left\lvert y^{(i)}-\hat{y}^{(i)}\right\rvert min [0,)[0,\infty) More robust than MSE, since it is less influenced by large errors.
R2R^{2} 1i=1ntest(y(i)y^(i))2i=1ntest(y(i)y¯)21-\frac{\sum_{i=1}^{n_{\mathrm{test}}}\left(y^{(i)}-\hat{y}^{(i)}\right)^{2}}{\sum_{i=1}^{n_{\mathrm{test}}}\left(y^{(i)}-\bar{y}\right)^{2}} max (,1](-\infty,1] Compare the sum of squared errors (SSE) of the model to a constant baseline model.
Performance measures for classification based on class labels
Accuracy (ACC) 1ntesti=1ntest𝕀{y(i)=y^(i)}\frac{1}{n_{\mathrm{test}}}\sum_{i=1}^{n_{\mathrm{test}}}\mathbb{I}_{\left\{y^{(i)}=\hat{y}^{(i)}\right\}} max [0,1][0,1] Proportion of correctly classified observations.
Balanced Accuracy (BA) 1gk=1g1ntest,ky(i):y(i)=k𝕀{y(i)=y^(i)}\frac{1}{g}\sum_{k=1}^{g}\frac{1}{n_{\text{test},k}}\sum_{y^{(i)}:y^{(i)}=k}\mathbb{I}_{\left\{y^{(i)}=\hat{y}^{(i)}\right\}} max [0,1][0,1] Variant of the accuracy that accounts for imbalanced classes.
Classification Error (CE) 1ntesti=1ntest𝕀{y(i)y^(i)}\frac{1}{n_{\mathrm{test}}}\sum_{i=1}^{n_{\mathrm{test}}}\mathbb{I}_{\left\{y^{(i)}\neq\hat{y}^{(i)}\right\}} min [0,1][0,1] CE=1ACC\mathrm{CE}=1-\mathrm{ACC} is the proportion of incorrect predictions.
ROC measures TPR=TPTP+FN\mathrm{TPR}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}} max [0,1][0,1] True Positive Rate: how many observations of the positive class 1 are predicted as 1?
FPR=FPTN+FP\mathrm{FPR}=\frac{\mathrm{FP}}{\mathrm{TN}+\mathrm{FP}} min [0,1][0,1] False Positive Rate: how many observations of the negative class 0 are falsely predicted as 1?
TNR=TNTN+FP\mathrm{TNR}=\frac{\mathrm{TN}}{\mathrm{TN}+\mathrm{FP}} max [0,1][0,1] True Negative Rate: how many observations of the negative class 0 are predicted as 0?
FNR=FNTP+FN\mathrm{FNR}=\frac{\mathrm{FN}}{\mathrm{TP}+\mathrm{FN}} min [0,1][0,1] False Negative Rate: how many observations of the positive class 1 were falsely predicted as 0?
PPV=TPTP+FP\mathrm{PPV}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}} max [0,1][0,1] Positive Predictive Value: how likely is a predicted 1 a true 1?
NPV=TNFN+TN\mathrm{NPV}=\frac{\mathrm{TN}}{\mathrm{FN}+\mathrm{TN}} max [0,1][0,1] Negative Predictive Value: how likely is a predicted 0 a true 0?
F1F_{1} 2PPVTPRPPV+TPR2\frac{\mathrm{PPV}\cdot\mathrm{TPR}}{\mathrm{PPV}+\mathrm{TPR}} max [0,1][0,1] F1F_{1} is the harmonic mean of PPV and TPR. Especially useful for imbalanced classes.
Cost measure i=1ntestC(y(i),y^(i))\sum_{i=1}^{n_{\mathrm{test}}}C(y^{(i)},\hat{y}^{(i)}) min [0,)[0,\infty) Cost of incorrect predictions based on a (usually non-negative) cost matrix Cg,gC\in\mathds{R}^{g,g}.
Performance measures for classification based on class probabilities
Brier Score (BS) 1ntesti=1ntestk=1g(π^k(𝐱(i))σk(y(i)))2\frac{1}{n_{\mathrm{test}}}\sum_{i=1}^{n_{\mathrm{test}}}\sum_{k=1}^{g}\left(\hat{\pi}_{k}(\mathbf{x}^{(i)})-\sigma_{k}(y^{(i)})\right)^{2} min [0,1][0,1] Measures squared distances of probabilities from the one-hot encoded class labels.
Log-Loss (LL) 1ntesti=1ntest(k=1gσk(y(i))log(π^k(𝐱(i))))\frac{1}{n_{\mathrm{test}}}\sum_{i=1}^{n_{\mathrm{test}}}\left(-\sum_{k=1}^{g}\sigma_{k}(y^{(i)})\log(\hat{\pi}_{k}(\mathbf{x}^{(i)}))\right) min [0,)[0,\infty) A.k.a. Bernoulli, binomial or cross-entropy loss
AUC max [0,1][0,1] Area under the ROC curve.
y^(i)\hat{y}^{(i)} denotes the predicted label for observation 𝐱(i)\mathbf{x}^{(i)}. ACC, BA, CE, BS, and LL can be used for multi-class classification with gg classes. For AUC, multiclass extensions exist as well. The notation 𝕀{}\mathbb{I}_{\{\cdot\}} denotes the indicator function. σk(y)=𝕀{y=k}\sigma_{k}(y)=\mathbb{I}_{\{y=k\}} is 1 if yy is class kk, 0 otherwise (multi-class one-hot encoding). ntest,kn_{\text{test},k} is the number of observations in the test set with class kk. π^k(𝐱)\hat{\pi}_{k}(\mathbf{x}) is the estimated probability for observation 𝐱(i)\mathbf{x}^{(i)} of belonging to class kk. TPTP is the number of true positives (observations of class 1 with predicted class 1), FPFP is the number of false positives (observations of class 0 with predicted class 1), TNTN is the number of true negatives (observations of class 0 with predicted class 0), and FNFN is the number of false negatives (observations of class 1 with predicted class 0).
?tablename? 1: Popular performance measures used for ML, assuming an arbitrary test set of size ntestn_{\mathrm{test}}.

?appendixname? C Software

This section lists relevant software packages in both R and Python. Libraries providing popular machine learning algorithms are listed first, followed by additional information about useful packages and relevant considerations for HPO in both languages.

C.1 Software in R

C.1.1 Important Machine Learning Algorithms

Here we present a curated list of learners that work well in general and are well integrated with ML frameworks in R (see Section C.1.3). For a comprehensive list of R-packages that offer ML models, categorized by the type of model implemented, see the “CRAN Task View” for ML and Statistical Learning191919https://cran.r-project.org/view=MachineLearning.

kk-NN

The basic kk-NN algorithm is implemented in the class package that comes installed with R. The distance-weighted kk-NN is implemented in the kknn package [258]. For large data sets, one may prefer the approximate algorithms [156] in the FNN package [161].

Regularized Linear Models

The elastic net is implemented in the glmnet package [200], which also offers an internal, fast optimization of its regularization HPs via the cv.glmnet() function. Both biglasso [277] and LiblineaR implement regularized linear models specifically optimized for large datasets.

Support Vector Machines

A frequently used implementation of the SVM for R can be found in the e1071 package [237], which uses the LIBSVM library internally [174]. kernlab [219] is a more comprehensive package with an emphasis on flexibility.

Decision Trees

The basic CART algorithm is implemented in the rpart R package [266], while a more flexible implementation of decision trees is included in partykit [212]

Random Forests

Three of the most widely used R implementations of RF are provided in the packages (i) randomForest, implementing the original version [172], (ii) party, implementing (unbiased) conditional inference forests [211], and (iii) ranger, providing a newer implementation of RF, including additional variants and a wide range of options, optimized for use on high-dimensional data [273].

Boosting

Popular implementations of gradient boosting with tree base learners can be found in gbm [204] and implemented more efficiently in xgboost [177, 178]. Gradient boosting of generalized linear and additive models is implemented in xgboost as well as in mboost [210], the latter being more efficiently implemented in compboost [256].

Artificial Neural Networks

Modern and fast implementations of NNs are provided by the keras [154] and torch [191] packages. torch is more low-level than keras, and some properties that are explicit HPs of keras need to be implemented by the user through R code in torch.

C.1.2 Black-Box and HP Optimizers

Evolution Strategies

Popular R packages that implement different ES are rgenoud [235], cmaes [269], adagio [168], DEoptim [240], mco [236], and ecr [170] and mosmafs [162].

Bayesian Optimization

tune, rBayesianOptimization [276] and DiceOptim [254] are R packages that implement BO with Gaussian processes and for purely numerical search spaces. mlrMBO [165] and its successor mlr3mbo offer to construct a flexible class of BO variants with arbitrary regression models and acquisition functions. These packages can work with mixed and hierarchical search spaces.

Other Methods

Hyperband is implemented in the package mlr3hyperband [205]. Iterated F-racing is provided by irace [234].

C.1.3 ML Frameworks

There are many technical difficulties and intricacies to attend to when practically combining a black-box optimizer with different ML algorithms, especially pipelines, to perform HPO. Although it would be possible to use one of the algorithms shown above and write an objective function that performs performance evaluation, this is not recommended. Instead, one of several ML frameworks should be used. These frameworks simplify the process of model fitting, evaluation, and optimization while managing the most common pitfalls and providing robust and parallel technical execution. In the following paragraphs, we summarize the tuning capabilities of the most important ML frameworks for R. The feature matrix in Table 2 gives an encompassing overview of the implemented capabilities.

mlr mlr3 caret tidymodels h2o
Tuners
- Random Search
- Grid Search
- Evolutionary
- Bayesian Optimization
- Hyperband
- Racing \ast
- Other Tuners NSGA2, SA NLOpt, SA - SA -
Search Space
- Transformations
- Default Search Spaces i
Joint Preproc Tuning
\ast: not yet released on CRAN but prototype available, i: see description in main text.
?tablename? 2: Feature matrix of tuning capabilities of R’s ML frameworks.
mlr

[164] supports a broad range of optimizers: random search, grid search, CMA-ES via cmaes, BO via mlrMBO [165], iterated F-racing via irace, NSGA2 via mco, and simulated annealing via GenSA [274]. Random search, grid search, and NSGA2 also support multi-criteria optimization. Arbitrary search spaces can be defined using the ParamHelpers package, which also supports HP transformations. HP spaces of preprocessing steps and ML algorithms can be jointly tuned with the mlrCPO package.

mlr3

[226] superseded mlr in 2019. This new package is designed with a more modular structure in mind and offers a much improved system for pipelines in mlr3pipelines [163]. mlr3tuning implements grid search, random search, simulated annealing via GenSA, CMA-ES via cmaes, and non-linear optimization via nloptr [216]. Additionally, mlr3hyperband provides Hyperband for multifidelity HPO, and the miesmuschel package implements a flexible toolbox for optimization using ES. BO is available through mlrMBO when the mlrintermbo compatibility bridge package is used. A further extension for BO is currently in development202020https://github.com/mlr-org/mlr3mbo. mlr3 uses the paradox package, a re-implementation of ParamHelpers with a comparable set of features.

caret

[222] ships with grid search, random search, and adaptive resampling (AR) [223] – a racing approach where an iterative optimization algorithm favors regions of empirically good predictive performance. Default HPs to tune are encoded in the respective call of each ML algorithm. However, if forbidden regions, transformations, or custom search spaces are required, the user must specify a custom design to evaluate. Embedding preprocessing into the tuning process is possible. However, tuning over the HPs of preprocessing methods is not supported.

tidymodels

[225] is the successor of caret. This package ships with grid search and random search, and the recently released tune package comes with Bayesian optimization. The AR racing approach and simulated annealing can be found in the finetune package. HP defaults and transformations are supported via dials. HPs of preprocessing operations using tidymodels’ recipes pipeline system can be tuned jointly with the HPs of the ML algorithm.

h2o

[228] connects to the H2O cross-platform ML framework written in Java. Unlike the other discussed frameworks, which connect third-party packages from CRAN, h2o ships with its own implementations of ML models. The package supports grid search and random grid search, a variant of random search where points to be evaluated are randomly sampled from a discrete grid. The possibilities for preprocessing are limited to imputation, different encoders for categorical features, and correcting for class imbalances via under- and oversampling. H2O automatically constructs a search space for a given set of learning algorithms and preprocessing methods, and HPs of both can be tuned jointly. It is worth mentioning that h2o was developed with a strong focus on AutoML and offers the functionality to perform random search over a pre-defined grid, evaluating configurations of generalized linear models with elastic net penalty, xgboost models, gradient boosting machines, and deep learning models. The best performing configurations found by the AutoML system are then stacked together [270] for the final model.

C.2 Software in Python

C.2.1 Machine Learning Package in Python

Scikit-learn [248, sklearn] is a general ML framework implemented in Python and the default path for ML projects without the need for deep learning (DL). The framework provides the most important traditional ML models (incl. SVM, RFs, Boosting, etc.). This is further complemented by a variety of preprocessing and post-processing methods (e.g., ensembling). The clean and well-documented API allows users to easily build fairly complex pipelines, which can provide strong performance on tabular data, e.g., see results on auto-sklearn [194].

Modern DL frameworks such as Tensorflow [152], PyTorch [247] and MXNet [176] accelerate large-scale computation task (in particular matrix multiplication) by executing them on GPUs. Furthermore, GPflow [182] and GPyTorch [202] are two GP libraries built on top of the two DL frameworks respectively and thus can be again used as surrogate models for BO.

C.2.2 Open-Source HPO Packages in Python

The rapid development of the package landscape in Python has enabled many developers to also provide HPO tools in Python. Here, we will give an selective overview of commonly used and well-known packages.

Spearmint [262, 261] was one of the first successful open-source BO packages for HPO. As proposed by [262], Spearmint implements standard BO with Markov chain Monte Carlo (MCMC) integration of the acquisition function for a fully Bayesian treatment of GP’s HPs. Additionally, the second version of Spearmint allowed warping the input space with a Beta cumulative distribution function [263] to deal with non-stationary loss landscapes. Spearmint also supports constrained BO [203] and parallel optimization [262].

Hyperopt [160], as indicated by its name, is a distributed HPO framework. Unlike other frameworks that estimate the potential performance given a configuration, Hyperopt implements a Tree of Parzen Estimators (TPE) [159] that estimates the distribution of configurations given their performance. This approach allows it to easily run different configurations in parallel, as multiple density estimators can be executed at the same time.

Scikit-optimize (skopt; \citeyearscikit_optimize) is a simple yet efficient library to optimize expensive and noisy black-box functions with BO. Skopt trains RF (including decision trees and grading-boosting trees) and GP models based on sklearn [248] as surrogate models. Similar to sklearn, skopt allows for several ways of pre-processing input-features, e.g. one-hot encoding, log transformation, normalization, label encoding, etc.

SMAC [213, 232] was originally developed as a tool for algorithm configurations [186]. However, in recent years, it has also been successfully used as a key component of several AutoML Tools, such as auto-sklearn [194] and Auto-PyTorch [278]. SMAC implements both RF and GP as surrogate models to handle various sorts of HP spaces. Additionally, BOHB [192] is implemented as a multi-fidelity approach in SMAC; however, instead of the TPE model in BOHB, SMAC utilized RF as a surrogate model for BO part. The most recent version of SMAC [233] also allows for parallel optimization with dask [181].

Similarly, HyperMapper [241] also builds an RF as a surrogate model; thus, it also supports mixed and structured HP configuration spaces. However, HyperMapper does not implement a Bayesian Optimization approach in a strict sense. In addition, HyperMapper incorporates previous knowledge by rescaling the samples with a beta distribution and handles unknown constraints by training another RF model as a probabilistic classifier. Furthermore, it supports multi-objective optimization. As HyperMapper cannot be allocated to a distributed computing environment, it might not be applicable to large-scale HPO problems.

OpenBox [230] is a general framework for black-box optimization – including HPO – and supports multi-objective optimization, multi-fidelity, early-stopping, transfer learning, and parallel BO via distributed parallelization under both synchronous parallel settings and asynchronous parallel settings. OpenBox further implements an ensemble surrogate model to integrate the information from previously seen similar tasks.

Dragonfly [218] is an open source library for scalable BO. Dragonfly extends standard BO in the following ways to scale to higher dimensional and expensive HPO problems: it implements GPs with additive kernels and additive GP-UCB [217] as an acquisition function. In addition, it supports multi-fidelity approach to scale to expensive HPO problems. To increase the robustness of the system w.r.t. its HPs, Dragonfly acquires its GP HP by either sampling a set of GP HPs from the posterior, conditioned on the data or optimizing the likelihood to handle different degrees of smoothness of the optimized loss landscape. In addition, Dragonfly maximizes its acquisition function with an evolutionary algorithm, which enables the system to work on different sorts of configuration spaces, e.g., different variable types and constraints.

The previously described BO-based HPO tools fix their search space during the optimization phases. The package Bayesian Optimization [243] can concentrate a domain around the current optimal values and adjust this domain dynamically [264]. A similar idea is implemented in TurBO [189], which only samples inside a trust region while keeping the configuration space fixed. The trust region shrinks or extends based on the performance of TurBO’s suggested configuration.

GPflowOpt [220] and BoTorch [158] are the BO-based HPO frameworks that build on top of GPflow and GPyTorch, respectively. The auto-grading system facilitates the users to freely extend their own ideas to existing models. Ax adds an easy-to-use API on top of BoTorch.

DEHB [157] uses differential evolution (DE) and combines it with a multi-fidelity HPO framework, inspired by BOHB’s combination of hyperband [229] and BO. DEHB overcomes several drawbacks of BOHB: it does not require a surrogate model, and thus, its runtime overhead does not grow over time; DEHB can have stronger final performance compared to BOHB, especially for discrete-valued and high-dimensional problems, where BO usually fails, as it tends to suggest points on the boundary of the search space [244]; additionally, it is simpler to implement an efficient parallel DE algorithm compared to a parallel BO-based approach. For instance, DEAP [195] and Nevergrad [253] have many EA implementations that enable parallel computation.

There are several other general HPO tools. For instance, Optuna [153] is an automatic HP optimization software framework that allows to dynamically construct the search space (Define-by-run API) and thus makes it much easier for users to construct a highly complex search space. Oríon [171] is an asynchronous framework for black-box function optimization. Finally, Tune [231] is a scalable HP tuning framework that provides APIs for several optimizers mentioned before, e.g., Dragonfly, SKopt, or HyperOpt.

C.2.3 AutoML tools

So far, we discussed several HPO tools in Python, which allow for flexible applications of different algorithms, search spaces and data sets. Here, we will briefly discuss several AutoML-packages of which HPO is a part.

Auto-Weka [267] is one of the first AutoML tools that automates the design choice of entire ML packages, namely Weka, with SMAC. Similarly, the same optimizer is applied to Auto-sklearn [194], while a meta-learning and ensemble approach further boost the performance of auto-sklearn. Similarly, Auto-keras [215] and Auto-PyTorch [278] use BO to automate the design choices of deep NNs. TPOT [245] automates the design choice of an ML pipeline with genetic algorithms, while a probabilistic grammatical evolution is implemented in AutoGOAL [190] for HPO problems. H2O automl [227] and AutoGluon [188] train a set of base models and ensemble them with a single layer (H2O) or multi-layer stacks (AutoGluon). Additionally, TransmogrifAI [239] is an AutoML tool built on the top of Apache Spark and MLBox [155].