JUMBO: Scalable Multi-task Bayesian Optimization using Offline Data

Kourosh Hakhamaneshi¹, Pieter Abbeel ¹, Vladimir Stojanovic ¹, Aditya Grover²

Abstract

The goal of Multi-task Bayesian Optimization (MBO) is to minimize the number of queries required to accurately optimize a target black-box function, given access to offline evaluations of other auxiliary functions. When offline datasets are large, the scalability of prior approaches comes at the expense of expressivity and inference quality. We propose JUMBO, an MBO algorithm that sidesteps these limitations by querying additional data based on a combination of acquisition signals derived from training two Gaussian Processes (GP): a cold-GP operating directly in the input domain and a warm-GP that operates in the feature space of a deep neural network pretrained using the offline data. Such a decomposition can dynamically control the reliability of information derived from the online and offline data and the use of pretrained neural networks permits scalability to large offline datasets. Theoretically, we derive regret bounds for JUMBO and show that it achieves no-regret under conditions analogous to GP-UCB (Srinivas et al. 2010). Empirically, we demonstrate significant performance improvements over existing approaches on two real-world optimization problems: hyper-parameter optimization and automated circuit design.

Introduction

Many domains in science and engineering involve the optimization of an unknown black-box function. Such functions can be expensive to evaluate, due to costs such as time and money. Bayesian optimization (BO) is a popular framework for such problems as it seeks to minimize the number of function evaluations required for optimizing a target black-box function (Shahriari et al. 2015; Frazier 2018). In real-world scenarios however, we often have access to offline evaluations of one or more auxiliary black-box functions related to the target function. For example, one might be interested in finding the optimal hyperparameters of a machine learning model for a given problem and may have access to an offline dataset from previous runs of training the same model on a different dataset for various configurations. Multi-task Bayesian optimization (MBO) is an optimization paradigm that extends BO to exploit such additional sources of information from related black-box functions for efficient optimization (Swersky, Snoek, and Adams 2013).

Early works in MBO employ multi-task Gaussian Processes (GP) with inter-task kernels to capture the correlations between the auxiliary and target function (Swersky, Snoek, and Adams 2013; Williams, Bonilla, and Chai 2007; Poloczek, Wang, and Frazier 2016). Multi-task GPs however fail to scale to large offline datasets. More recent works have proposed combining neural networks (NN) with probabilistic models to improve scalability. For example, MT-BOHAMIANN (Springenberg et al. 2016) uses Bayesian NNs (BNN) (Neal 2012) as surrogate models for MBO. The performance however, depends on the quality of the inference procedure. In contrast, MT-ABLR (Perrone et al. 2018) uses a deterministic NN followed by a Bayesian Linear Regression (BLR) layer at the output to achieve scalability while permitting exact inference. However, the use of a linear kernel can limit the expressiveness of the posterior.

We propose JUMBO, an MBO algorithm that sidesteps the limitations in expressivity and tractability of prior approaches. In JUMBO, we first train a NN on the auxiliary data to learn a feature space, akin to MT-ABLR but without the BLR restriction on the output layer. Thereafter, we train two GPs simultaneously for online data acquisition via BO: a warm-GP on the feature space learned by the NN and a cold-GP on the raw input space. The acquisition function in JUMBO combines the individual acquisition functions of both the GPs. It uses the warm-GP to reduce the search space by filtering out poor points. The remaining candidates are scored by the acquisition function for the cold-GP to account for imperfections in learning the feature space of the warm-GP. The use of GPs in the entire framework ensures tractability in posterior inference and updates.

Theoretically, we show that JUMBO is a no-regret algorithm under conditions analogous to those used for analyzing GP-UCB (Srinivas et al. 2010). In practice, we observe significant improvements over the closest baseline on two real-world applications: transferring prior knowledge in hyper-parameter optimization and automated circuit design.

Background

We are interested in maximizing a target black-box function $f:{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\to\mathbb{R}$ defined over a discrete or compact set ${\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\subseteq\mathbb{R}^{d}$ . We assume only query access to $f$ . For every query point $x$ , we receive a noisy observation $y=f(x)+\epsilon$ . Here, we assume $\epsilon$ is standard Gaussian noise, i.e., $\epsilon\sim\mathcal{N}(0,\sigma_{n}^{2})$ where $\sigma_{n}$ is the noise standard deviation. Our strategy for optimizing $f$ will be to learn a probabilistic model for regressing the inputs $x$ to $y$ using the available data and using that model to guide the acquisition of additional data for updating the model. In particular, we will be interested in using Gaussian Process regression models within a Bayesian Optimization framework, as described below.

Gaussian Process (GP) Regression

A Gaussian Process (GP) is defined as a set of random variables such that any finite subset of them follows a multivariate normal distribution. A GP can be used to define a prior distribution over the unknown function $f$ , which can be converted to a posterior distribution once we observe additional data. Formally, a GP prior is defined by a mean function $\mu_{0}:{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\to\mathbb{R}$ and a valid kernel function $\kappa:{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\times{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\to\mathbb{R}$ . A kernel function $\kappa$ is valid if it is symmetric and the Gram matrix $K$ is positive semi-definite. Intuitively, the entries of the kernel matrix $K_{i,j}=\kappa(x_{i},x_{j})$ measure the similarity between any two points $x_{i}$ and $x_{j}$ . Given points $X=\{x_{1},x_{2},\dots,x_{n}\}$ , the distribution of the function evaluations $\mathbf{f}=[f(x_{1}),f(x_{2}),\dots f(x_{n})]$ in a GP prior follows a normal distribution, such that $\mathbf{f}|X\sim\mathcal{N}(\mu_{0}(X),K(X,X))$ where $\mu_{0}(X)=[\mu_{0}(x_{1}),\mu_{0}(x_{2}),\dots\mu_{0}(x_{n})]$ and $K(X,X)$ is a covariance matrix. For simplicity, we will henceforth assume $\mu_{0}$ to be a zero mean function.

Given a training dataset $\mathcal{D}$ , let $X_{\mathcal{D}}$ and $\mathbf{y}_{\mathcal{D}}$ denote the inputs and their noisy observations. Since the observation model is also assumed to be Gaussian, the posterior over $f$ at a test set of points $X^{\ast}$ will follow a multivariate normal distribution with the following mean and covariance:

	$\displaystyle\mu(\mathbf{f}^{*}\|\mathcal{D},X^{\ast})$	$\displaystyle=K(X^{\ast},X_{\mathcal{D}})^{T}\tilde{K}_{D}^{-1}\mathbf{y}_{\mathcal{D}},$
	$\displaystyle\Sigma(\mathbf{f}^{*}\|\mathcal{D},X^{\ast})$	$\displaystyle=K(X^{},X^{})-K(X^{},X_{\mathcal{D}})^{T}\tilde{K}_{D}^{-1}K(X^{},X_{\mathcal{D}}),$
	$\displaystyle\text{where }\tilde{K}_{D}$	$\displaystyle=K(X_{\mathcal{D}},X_{\mathcal{D}})+\sigma_{n}^{2}I.$

Due to the inverse operation during posterior computation, standard GPs can be computationally prohibitive for modeling large datasets. We direct the reader to (Rasmussen 2003) for an overview on GPs.

Bayesian Optimization (BO)

Bayesian Optimization (BO) is a class of sequential algorithms for sample-efficient optimization of expensive black-box functions (Frazier 2018; Shahriari et al. 2015). A BO algorithm typically runs for a fixed number of rounds. At every round $t$ , the algorithm selects a query point $x_{t}$ and observes a noisy function value $y_{t}$ . To select $x_{t}$ , the algorithm first infers the posterior distribution over functions $p(\mathbf{f}|\{(x_{i},y_{i})\}_{i=1}^{t-1})$ via a probabilistic model (e.g., Gaussian Processes). Thereafter, $x_{t}$ is chosen to optimize an uncertainty-aware acquisition function that balances exploration and exploitation. For example, a popular acquisition function is the Upper Confidence Bound (UCB) which prefers points that have high expected value (exploitation) and high uncertainty (exploration). With the new point $(x_{t},y_{t})$ , the posterior distribution can be updated and the whole process is repeated in the next round.

At round $t$ , we define the instantaneous regret as $r_{t}=f(x^{*})-f(x_{t})$ where $x^{\ast}$ is the global optima and $x_{t}$ maximizes the acquisition function. Similarly, we can define the cumulative regret at round $T$ as the sum of instantaneous regrets $R_{T}=\sum_{t=1}^{T}r_{t}$ . A desired property of any BO algorithms is to be no-regret where the cumulative regret is sub-linear in $T$ as $T\to\infty$ , i.e., $\lim_{T\to\infty}\nicefrac{{R_{T}}}{{T}}=0$ .

Multi-task Bayesian Optimization (MBO)

Our focus setting in this work is a variant of BO, called Multi-Task Bayesian Optimization (MBO). Here, we assume $K>0$ auxiliary real-valued black-box functions $\{f_{1},\dots,f_{K}\}$ , each having the same domain $\textstyle\chi$ as the target function $f$ (Swersky, Snoek, and Adams 2013; Springenberg et al. 2016). For each function $f_{k}$ , we have an offline dataset $\mathcal{D}^{(k)}$ consisting of pairs of input points $x$ and the corresponding function evaluations $f_{k}(x)$ . If these auxiliary functions are related to the target function, then we can transfer knowledge from the offline data $\mathcal{D}^{\text{aux}}=\mathcal{D}^{(1)}\cup\dots\cup\mathcal{D}^{(K)}$ to improve the sample-efficiency for optimizing $f$ . In certain applications, we might also have access to offline data from $f$ itself. However, in practice, $f$ is typically expensive to query and its offline dataset $\mathcal{D}^{f}$ will be very small.

We discuss some prominent works in MBO that are most closely related to our proposed approach below. See Section Related Work for further discussion about other relevant work.

Multi-task BO (Swersky, Snoek, and Adams 2013) is an early approach that employs a custom kernel within a multi-task GP (Williams, Bonilla, and Chai 2007) to model the relationship between the auxiliary and target functions. Similar to standard GPs, multi-task GPs fail to scale for large offline datasets.

On the other hand, parametric models such as neural networks (NN), can effectively scale to larger datasets but do not defacto quantify uncertainty. Hybrid methods such as DNGO (Snoek et al. 2015) achieve scalability for (single task) BO through the use of a feed forward deep NN followed by Bayesian Linear Regression (BLR) (Bishop 2006). The NN is trained on the existing data via a simple regression loss (e.g, mean squared error). Once trained, the NN parameters are frozen and the output layer is replaced by BLR for the BO routine. For BLR, the computational complexity of posterior updates scales linearly with the size of the dataset. This step can be understood as applying a GP to the output features of the NN with a linear kernel (i.e. $\kappa(x_{i},x_{j})=h_{\phi}(x_{i})^{T}h_{\phi}(x_{j})$ where $h$ is the NN function with parameters $\phi$ ). For BLR, the computational complexity of posterior inference is linear w.r.t. the number of data points and thus DNGO can scale to large offline datasets.

MT-ABLR (Perrone et al. 2018) extends DNGO to multi task settings by training a single NN to learn a shared representation $h_{\phi}(x)$ followed by task-specific BLR layers (i.e. predicting $f_{1}(x),...,f_{K}(x)$ , and $f(x)$ based on inputs). The learning objective corresponds to the maximization of sum of the marginal log-likelihoods for each task: $\sum_{t=1}^{K+1}p(\mathbf{y}_{t}|w_{t},h_{\phi}(X_{t}),\sigma_{t})$ . The main task is included in the last index, $w_{t}$ is the Bayesian Linear layer weights for task $t$ with prior $p(w_{t})=\mathcal{N}(0,\sigma_{w_{t}}^{2}I)$ , $\sigma_{t}$ and $\sigma_{w_{t}}$ are the hyper-prior parameters, and $(X_{t},\mathbf{y}_{t})$ is the observed data from task $t$ . Learning $h_{\phi}(x)$ by directly maximizing the marginal likelihood improves the performance of DNGO while maintaining the computational scalability of its posterior inference in case of large offline data. However, both DNGO and ABLR have implicit assumptions on the existence of a feature space under which the target function can be expressed as a linear combination. This can be a restrictive assumption and furthermore, there is no guarantee that given finite data such feature space can be learned.

MT-BOHAMIANN (Springenberg et al. 2016) addresses the limited expressivity of prior approaches by employing Bayesian NNs to specify the posterior over $\mathbf{f}$ and feed the NN with input $x$ and additional learned task-specific embeddings $\psi(t)$ for task $t$ . While allowing for a principled treatment of uncertainties, fully Bayesian NNs are computationally expensive to train and their performance depends on the approximation quality of stochastic gradient HMC methods used for posterior inference.

Scalable MBO via JUMBO

In the previous section, we observed that prior MBO algorithms make trade-offs in either posterior expressivity or inference quality in order to scale to large offline datasets. Our goal is to show that these trade-offs can be significantly mitigated and consequently, the design of our proposed MBO framework, which we refer to as Joint Upper confidence Multi-task Bayesian Optimization (JUMBO), will be guided by the following desiderata: (1) Scalability to large offline datasets (e.g., via NNs) (2) Exact and computationally tractable posterior updates (e.g., via GPs) (3) Flexible and expressive posteriors (e.g., via non-linear kernels).

Regression Model

Refer to caption — Figure 1: JUMBO. During the pretraining phase, we learn a NN mapping $h_{\phi^{\ast}}$ (orange) for the warm-GP. The next query based on $\alpha_{t}$ (purple) will be the point that has a high score based on the acquisition function of both warm and cold GP (blue).

The regression model in JUMBO is composed of two GPs: a warm-GP and a cold-GP denoted by $\mathcal{GP}^{\text{warm}}$ $(0,\kappa^{w})$ and $\mathcal{GP}^{\text{cold}}$ $(0,\kappa^{c})$ , respectively. As shown in Figure 1, both GPs are trained to model the target function $f$ but operate in different input spaces, as we describe next.

$\mathcal{GP}^{\text{warm}}$ (with hyperparameters $\theta_{w}$ ) operates on a feature representation of the input space $h_{\phi}(x)$ derived from the offline dataset $\mathcal{D}^{\text{aux}}$ . To learn this feature space, we train a multi-headed feed-forward NN to minimize the mean squared error for each auxiliary task, akin to DNGO (Snoek et al. 2015). Thereafter, in contrast to both DNGO and ABLR, we do not train separate output BLR modules. Rather, we will directly train $\mathcal{GP}^{\text{warm}}$ on the output of the NN using the data acquired from the target function $f$ . Note that for training $\mathcal{GP}^{\text{warm}}$ , we can use any non-linear kernel, which results in an expressive posterior that allows for exact and tractable inference using closed-form expressions.

Additionally, we can encounter scenarios where some of the auxiliary functions are insufficient in reducing the uncertainty in inferring the target function. In such scenarios, relying solely on $\mathcal{GP}^{\text{warm}}$ can significantly hurt performance. Therefore, we additionally initialize $\mathcal{GP}^{\text{cold}}$ (with hyperparameters $\theta_{c}$ ) directly on the input space $\textstyle\chi$ .

If we also have access to offline data from $f$ (i.e. $\mathcal{D}^{f}$ ), the hyperparameters of the warm and cold GPs can also be pre-trained jointly along with the neural network parameters. The overall pre-training objective is then given by:

$\mathcal{L}(\phi,\theta_{w},\theta_{c})=\mathcal{L}^{\text{MSE}}(\phi|\mathcal{D}^{\text{aux}})+\mathcal{L}^{\mathcal{GP}}(\theta_{w}|\mathcal{D}^{f})+\mathcal{L}^{\mathcal{GP}}(\theta_{c}|\mathcal{D}^{f})$

(1)

where $\mathcal{L}^{\mathcal{GP}}(\cdot|\mathcal{D}^{f})$ denotes the negative marginal log-likelihood for the corresponding GP on $\mathcal{D}^{f}$ .

Acquisition Procedure

Post the offline pre-training of the JUMBO’s regression model, we can use it for online data acquisition in a standard BO loop. The key design choice here is the acquisition function, which we describe next. At round $t$ , let $\alpha_{t}^{\text{warm}}(x)$ and $\alpha_{t}^{\text{cold}}(x)$ be the single task acquisition function (e.g. UCB) of the warm and cold GPs, after observing $t-1$ data points, respectively.

Our guiding intuition for the acquisition function in JUMBO is that we are most interested in querying points which are scored highly by both acquisition functions. Ideally, we want to first sort points based on $\alpha^{\text{warm}}$ scores and then from the top choices select the ones with highest $\alpha^{\text{cold}}$ score. To realize this acquisition function on a continuous input domain, we define it as a convex combination of the individual acquisition functions by employing a dynamic interpolation coefficient $\lambda_{t}(x)\in[0,1]$ . Formally,

\displaystyle\alpha_{t}(x)=\lambda_{t}(x)\alpha^{\text{cold}}_{t}(x)+(1-\lambda_{t}(x))\alpha^{\text{warm}}_{t}(x).

(2)

In Eq. 2, By choosing $\lambda_{t}(x)$ to be close to $1$ for points with $\alpha_{t}^{\text{warm}}(x)\approx\max_{x}\alpha_{t}^{\text{warm}}(x)$ , we can ensure to acquire points that have high acquisition scores as per both $\alpha_{t}^{\text{cold}}(x)$ and $\alpha_{t}^{\text{warm}}(x)$ . Next, we will discuss some theoretical results that shed more light on the design of $\lambda_{t}(x)$ .

Theoretical Analysis

Here, we will formally derive the regret bound for JUMBO and provide insights on the conditions under which JUMBO outperforms GP-UCB (Srinivas et al. 2010). For this analysis, we will use Upper Confidence Bound (UCB) as our acquisition function for the warm and cold GPs. To do so, we utilize the notion of Maximum Information Gain (MIG).

Definition 1 (Maximum Information Gain (Srinivas et al. 2010)).

Let $f\sim\mathcal{GP}(0,\kappa)$ , $\kappa:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R}$ . Consider any ${\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\subset\mathbb{R}^{d}$ and let $\tilde{{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}=\{x_{1},...,x_{n}\}\subset{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}$ be a finite subset. Let $\mathbf{y}_{\tilde{{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}}\in\mathbb{R}^{n}$ be $n$ noisy observations such that $(\mathbf{y}_{\tilde{{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}})_{i}=(\mathbf{f}_{\tilde{{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}})_{i}+\epsilon_{i}$ , $\epsilon_{i}\sim\mathcal{N}(0,\sigma_{n}^{2})$ . Let $I$ denote the Shannon mutual information.

The MIG $\Psi_{n}({\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}})$ of set $\textstyle\chi$ after $n$ evaluations is the maximum mutual information between the function values and observations among all choices of n points in $\textstyle\chi$ . Formally,

\Psi_{n}({\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}})=\max_{\tilde{{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}\subset{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}},|\tilde{{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}|=n}I(\mathbf{y}_{\tilde{{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}};\mathbf{f}_{\tilde{{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}})

This quantity depends on kernel parameters and the set $\textstyle\chi$ , and also serves as an important tool for characterizing the difficulty of a GP-bandit. For a given kernel, it can be shown that $\Psi_{n}({\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}})\propto\Pi({\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}})$ where $\Pi({\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}})=|{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}|$ for discrete and $\text{Vol}({\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}})$ for the continuous case (Srinivas et al. 2010). For example for Radial Basis kernel $\Psi_{n}([0,1]^{d})\in O(\log(n)^{d+1})$ . For brevity, we focus on settings where $\textstyle\chi$ is discrete. Further results and analysis for the continuous case are deferred to Appendix A.

For GP-UCB (Srinivas et al. 2010), it has been shown that for any $\delta\in(0,1)$ , if $f\sim\mathcal{GP}(0,\kappa)$ (i.e., the GP assigns non-zero probability to the target function $f$ ), then the cumulative regret $R_{T}$ after $T$ rounds will be bounded with probability at least $1-\delta$ :

Pr\{R_{T}\leq\sqrt{CT\beta_{T}\Psi_{T}({\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}})},\forall T\geq 1\}\geq 1-\delta

(3)

with $C=\frac{8}{\log(1+\sigma_{n}^{-2})}$ and $\beta_{T}=2\log\left(\frac{|{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}|\pi^{2}T^{2}}{6\delta}\right)$ .

Recall that $h_{\phi}:{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\to\mathcal{Z}$ is a mapping from input space $\textstyle\chi$ to the feature space $\mathcal{Z}$ . We will further make the following modeling assumptions to ensure that the target black-box function $f$ is a sample from both the cold and warm GPs.

Assumption 1.

$f\sim\mathcal{GP}^{\text{cold}}(0,\kappa^{c})$ .

Assumption 2.

Let $\phi^{\ast}$ denote the NN parameters obtained via pretraining (Eq. 1).Then, there exists a function $g\sim\mathcal{GP}^{\text{warm}}(0,\kappa^{w})$ such that $f=g\circ h_{\phi^{\ast}}$ .

Theorem 1.

Let ${\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{g}\subset{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}$ and $\bar{{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}_{g}={\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\setminus{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{g}$ be some arbitrary partitioning of the input domain $\textstyle\chi$ . Define the interpolation coefficient as an indicator $\lambda_{t}(x)=\mathbbm{1}(x\in{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{g})$ . Then under Assumptions 1 and 2, JUMBO is no-regret.

Specifically, let $s$ be the number of rounds such that the JUMBO queries for points $x_{t}\in\bar{{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}_{g}$ . Then, for any $\delta\in(0,1)$ , running JUMBO for $T$ iterations results in a sequence of candidates $(x_{t})_{t=1}^{t=T}$ for which the following holds with probability at least $1-\delta$ :

R_{T}<\sqrt{CT\beta_{T}\{\Psi_{T-s}({\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{g})+\Psi_{s}(\bar{\mathcal{Z}}_{g})\}},\forall T\geq 1

(4)

where $C=\frac{8}{\log\left(1+\sigma_{n}^{-2}\right)}$ , $\beta_{t}=2\log\left(\frac{|{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}|\pi^{2}t^{2}}{3\delta}\right)$ , and $\bar{\mathcal{Z}}_{g}=\{h_{\phi^{\ast}}(x)|x\in\bar{{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}_{g}\}$ is the set of output features for $\bar{{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}_{g}$ .

Based on the regret bound in Eq. 4, we can conclude that if the partitioning ${\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{g}$ is chosen such that $\Pi(\bar{\mathcal{Z}}_{g})\ll\Pi(\bar{{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}_{g})$ and $\Pi({\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{g})\ll\Pi({\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}})$ , then JUMBO has a tighter bound than GP-UCB. The first condition implies that the second term in Eq. 4 is negligible and intuitively means that $\mathcal{GP}^{\text{warm}}$ will only need a few samples to infer the posterior of $f$ defined on $\bar{{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}_{g}$ , making BO more sample efficient. The second condition implies that the $\Psi_{T-s}({\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{g})\ll\Psi_{T}({\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}})$ which in turn makes the regret bound of JUMBO tighter than GP-UCB. Note that ${\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{g}$ cannot be made arbitrarily small, since $\bar{{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}_{g}$ (and therefore $\bar{\mathcal{Z}}_{g}$ ) will get larger which conflicts with the first condition.

Figure 2 provides an illustrative example. If the learned feature space $h_{\phi^{\ast}}(x)$ compresses set $\bar{{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}_{g}$ to a smaller set $\bar{\mathcal{Z}}_{g}$ , then $\mathcal{GP}^{\text{warm}}$ can infer the posterior of $g(h_{\phi^{\ast}}(x))$ with only a few samples in $\bar{{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}_{g}$ (because MIG is lower). Such $h_{\phi^{\ast}}(x)$ will likely emerge when tasks share high-level features with one another. In the appendix, we have included an empirical analysis to show that $\mathcal{GP}^{\text{warm}}$ is indeed operating on a compressed space $\mathcal{Z}$ . Consequently, if ${\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{g}$ is reflective of promising regions consisting of near-optimal points i.e. ${\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{g}=\{x\in{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\mid f(x^{\ast})-f(x)\leq l_{f}\}$ for some $l_{f}>0$ , BO will be able to quickly discard points from subset $\bar{{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}_{g}$ and acquire most of its points from ${\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{g}$ .

Choice of interpolation coefficient $\lambda_{t}(x)$

The above discussion suggests that the partitioning ${\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{g}$ should ideally consist of near-optimal points. In practice, we do not know $f$ and hence, we rely on our surrogate model to define ${\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{g}^{(t)}=\{x\in{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\mid\alpha_{t}^{\text{warm}\ast}-\alpha_{t}^{\text{warm}}(x)\leq l_{\alpha}\}$ . Here, $\alpha_{t}^{\text{warm}\ast}$ is the optimal value of $\alpha_{t}^{\text{warm}}(x)$ and the acquisition threshold $l_{\alpha}>0$ is a hyper-parameter used for defining near-optimal points w.r.t. $\alpha_{t}^{\text{warm}}(x)$ . At one extreme, $l_{\alpha}\to\infty$ corresponds to the case where $\alpha_{t}(x)=\alpha_{t}^{\text{cold}}(x)$ (i.e. the GP-UCB routine) and the other extreme $l_{\alpha}\to 0$ corresponds to case with $\alpha_{t}(x)=\alpha_{t}^{\text{warm}}(x)$ .

Figure 4 illustrates a synthetic 1D example on how JUMBO obtains the next query point. Figure 4(a) shows the main objective $f(x)$ (red) and the auxiliary task $f_{1}(x)$ (blue). They share a periodic structure but have different optimums. Figure 4(b) shows the correlation between the two.

Applying GP-UCB (Srinivas et al. 2010) will require a considerable amount of samples to learn the periodic structure and the optimal solution. However in JUMBO, as shown in Figure 4(c), the warm-GP, trained on ( $h_{\phi^{\ast}}(x),y$ ) samples, can learn the periodic structure using only 6 samples, while the posterior of the cold-GP has not yet learned this structure.

It can also be noted from Figure 4(c) that JUMBO’s acquisition function is $\alpha_{t}^{\text{cold}}(x)$ when the value of $\alpha_{t}^{\text{warm}}(x)$ is close to $\alpha_{t}^{\text{warm}\ast}$ . Therefore, the next query point (marked with a star) has a high score based on both acquisition functions. We summarize JUMBO in Algorithm 1.

Input: Offline auxiliary dataset

\mathcal{D}^{\text{aux}}

, Offline target dataset

\mathcal{D}_{0}^{f}

(optional; default: empty set), Threshold

l_{\alpha}

Output: Sequence of solution candidates

\{x_{t}\}_{t=1}^{T}

maximizing target function

f

3Initialize NN

h_{\phi}(x)

\mathcal{GP}^{\text{cold}}

\mathcal{GP}^{\text{warm}}

4Pretrain NN params jointly with

\mathcal{GP}^{\text{cold}}

and

\mathcal{GP}^{\text{warm}}

hyper-params using

\mathcal{D}^{\text{aux}}

and

\mathcal{D}_{0}^{f}

as per Eq. 1.;

6Initialize

\mathcal{D}_{0}^{\text{cold}}=\{\}

\mathcal{D}_{0}^{\text{warm}}=\{\}

7for round $t=1$ to $T$ do

9 Set

\alpha_{t}^{\text{warm}\ast}=\operatorname*{argmax}_{x\in{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}\alpha_{t}^{\text{warm}}(x)

10 Set

\lambda_{t}(x)=\mathbbm{1}(\alpha_{t}^{\text{warm}\ast}-\alpha_{t}^{\text{warm}}(x)\leq l_{\alpha})

11 Set

\alpha_{t}(x)=\lambda_{t}(x)\alpha^{\text{cold}}_{t}(x)+(1-\lambda_{t}(x))\alpha^{\text{warm}}_{t}(x)

12 Pick

x_{t}=\operatorname*{argmax}_{x\in{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}\alpha_{t}(x)

13 Obtain noisy observation

y_{t}

for

x_{t}

14 Update

\mathcal{D}^{\text{cold}}_{t}\leftarrow\mathcal{D}^{\text{cold}}_{t-1}\cup\{(x_{t},y_{t})\}

and

\mathcal{GP}^{\text{cold}}

15 Update

\mathcal{D}^{\text{warm}}_{t}\leftarrow\mathcal{D}^{\text{warm}}_{t-1}\cup\{(h_{\phi^{\ast}}(x_{i}),y_{i})\}

and

\mathcal{GP}^{\text{warm}}

16 end for

Algorithm 1 JUMBO

Experiments

We are interested in investigating the following questions: (1) How does JUMBO perform on benchmark real-world black-box optimization problems relative to baselines? (2) How does the choice of threshold $l_{\alpha}$ impact the performance of JUMBO? (3) Is it necessary to have a non-linear mapping on the features learned from the offline dataset or a BLR layer is sufficient?

Our codebase is based on BoTorch (Balandat et al. 2020) and is provided in the Supplementary Materials with additional details in Appendix C.

Application: Hyperparameter optimization

Datasets. We consider the task of optimizing hyperparameters for fully-connected NN architectures on 4 regression benchmarks from HPOBench (Klein and Hutter 2019): Protein Structure (Rana 2013), Parkinsons Telemonitoring (Tsanas et al. 2010), Naval Propulsion (Coraddu et al. 2016), and Slice Localization (Graf et al. 2011). HPOBench provides a look-up-table-based API for querying the validation error of all possible hyper-parameter configurations for a given regression task. These configurations are specified via 9 hyperparameters, that include continuous, categorical, and integer valued variables.

The objective we wish to minimize is the validation error of a regression task after 100 epochs of training. For this purpose, we consider an offline dataset that consists of validation errors for some randomly chosen configurations after 3 epochs on a given dataset. The target task is to optimize this error after 100 epochs. In (Klein and Hutter 2019), the authors show that this problem is non-trivial as there is small correlation between epochs 3 and 100 for top-1% configurations across all datasets of interest.

Evaluation protocol. We validate the performance of JUMBO against the following baselines with a UCB acquisition function (Srinivas et al. 2010):

•

GP-UCB (Srinivas et al. 2010) (i.e. cold-GP only) trains a GP from scratch disregarding $\mathcal{D}^{\text{aux}}$ completely. Equivalently, it can be interpreted as JUMBO with $\lambda_{t}(x)=1\;\;\forall x,t\geq 1$ in Eq. 2 and $\alpha(x)=\alpha^{\text{UCB}}(x)$ .
•

MT-BOHAMIANN (Springenberg et al. 2016) trains a BNN on all tasks jointly via SGHMC (Section Multi-task Bayesian Optimization (MBO)).
•

MT-ABLR (Perrone et al. 2018) trains a shared NN followed by task-specific BLR layers (Section Multi-task Bayesian Optimization (MBO)).
•

GCP (Salinas, Shen, and Perrone 2020) uses Gaussian Copula Processes to jointly model the offline and online data.
•

MF-GP-UCB (Kandasamy et al. 2019) extends the GP-UCB baseline to a multi-fidelity setting where the source task can be interpreted as a low-fidelity proxy for the target task.
•

Offline DKL (i.e. warm-GP only) is our proposed extension to Deep Kernel Learning, where we train a single GP online in the latent space of a NN pretrained on $\mathcal{D}^{\text{aux}}$ (See Section Related Work for details). Equivalently, it can be interpreted as JUMBO with $\lambda_{t}(x)=0$ in Eq. 2.

Results. We run JUMBO (with $l_{\alpha}=0.1$ ) on all baselines for 50 rounds and 5 random seeds each and measure the simple regret per iteration. The regret curves are shown in Figure 5. We find that JUMBO achieves lower regret than the previous state-of-the-art algorithms for MBO in almost all cases. We believe the slightly worse performance on the slice dataset relative to other baselines is due to the extremely low top-1% correlation between epoch 3 and epoch 100 on this dataset as compared to others (See Figure 10 in (Klein and Hutter 2019)), which could result in a suboptimal search space partitioning obtained via the warm-GP. For all other datasets, we find JUMBO to be the best performing method. Notably, on the Protein dataset, JUMBO is always able to find the global optimum, unlike the other approaches.

Application: Automated Circuit Design

Next, we consider a real-world use case in optimizing circuit design configurations for a suitable performance metric, e.g., power, noise, etc. In practice, designers are interested in performing layout simulations for measuring the performance metric on any design configuration. These simulations are however expensive to run; designers instead often turn to schematic simulations which return inexpensive proxy metrics correlated with the target metric.

In this problem, the circuit configurations are represented by an 8 dimensional vector, with elements taking continuous values between 0 and 1. The offline dataset consists of 1000 pairs of circuit configurations and 3 auxiliary signals including a scalar goodness score based on the schematic simulations. We consider the same baselines as before. We also consider BOX-GP-UCB (Perrone et al. 2019) which confines the search space to a hyper-cube over the promising region based on all auxiliary tasks in the offline data. Unlike the considered HPO problems, the offline circuit dataset contains data from more than just one auxiliary task, allowing us to consider BOX-GP-UCB as a viable baseline. MF-GP-UCB was ran with the schematic score as the lower fidelity approximation of the target function. We ran each algorithm with $l_{\alpha}=0.1$ for 100 iterations and measured simple regret against iteration. As reflected in the regret curves in Figure 7(a), JUMBO outperforms other algorithms.

Ablations

Effect of auxiliary tasks. It is important to analyze how learning on other tasks affects the performance. To this end, we considered the circuit design problem with 1 and 3 auxiliary offline tasks. In Figure 7(b), task 1 (yellow) is the most correlated and task 3 (red) is the least correlated task with the objective function. The regret curves suggest that the performance would be poor if the correlation between tasks is low. Moreover, the features pre-trained on the combination of all three tasks provide more information to the warm-GP than those pre-trained only on one of the tasks.

BLR with JUMBO’s acquisition function. A key difference between JUMBO and ABLR (Perrone et al. 2018) is replacing the BLR layer with a GP. To show the merits of having a GP, we ran an experiment on Protein dataset and replaced the GP with a BLR in JUMBO’s procedure. Figure 7(c) shows that JUMBO with $\mathcal{GP}^{\text{warm}}$ significantly outperforms JUMBO with a BLR layer.

Related Work

Transfer Learning in Bayesian Optimization: Utilizing prior information for applying transfer learning to improve Bayesian optimization has been explored in several prior papers. Early work of (Swersky, Snoek, and Adams 2013) focuses on the design of multi-task kernels for modeling task correlations (Poloczek, Wang, and Frazier 2016). These models tend to suffer from lack of scalability; (Wistuba, Schilling, and Schmidt-Thieme 2018; Feurer, Letham, and Bakshy 2018) show that this challenge can be partially mitigated by training an ensemble of task-specific GPs that scale linearly with the number of tasks but still suffer from cubic complexity in the number of observations for each task. To address scalability and robust treatment of uncertainty, several prior works have been suggested (Salinas, Shen, and Perrone 2020; Springenberg et al. 2016; Perrone et al. 2018). (Salinas, Shen, and Perrone 2020) employs a Gaussian Copula to learn a joint prior on hyper-parameters based on offline tasks, and then utilizes a GP on the online task for adapt to the target function. (Springenberg et al. 2016) uses a BNN as surrogates for MBO; however, since training BNNs is computationally intensive (Perrone et al. 2018) proposes to use a deterministic NN followed by a BLR layer at the output to achieve scalability.

Some other prior work exploit certain assumptions between the source and target data. For example (Shilton et al. 2017; Golovin et al. 2017) assume an ordering of the tasks and use this information to train GPs to model residuals between the target and auxiliary tasks. (Feurer, Springenberg, and Hutter 2015; Wistuba, Schilling, and Schmidt-Thieme 2015) assume existence of a similarity measure between prior and target data which may not be easy to define for problems other than hyper-parameter optimization. A simpler idea is to use prior data to confine the search space to promising regions (Perrone et al. 2019). However, this highly relies on whether the confined region includes the optimal solution to the target task. Another line of work studies utilizing prior optimization runs to meta-learn acquisition functions (Volpp et al. 2019). This idea can be utilized in addition to our method and is not a competing direction.

Multi-fidelity Black-box Optimization (MFBO): In multi-fidelity scenarios we can query for noisy approximations to the target function relatively cheaply. For example, in hyperparameter optimization, we can query for cheap proxies to the performance of a configuration on a smaller subset of the training data (Petrak 2000), early stopping (Li et al. 2017), or by predicting learning curves (Domhan, Springenberg, and Hutter 2015; Klein et al. 2017). We direct the reader to Section 1.4 in (Hutter, Kotthoff, and Vanschoren 2019) for a comprehensive survey on MBFO. Such methods, similar to MF-GP-UCB (Kandasamy et al. 2019) (section Application: Automated Circuit Design), are typically constrained to scenarios where such low fidelities are explicitly available and make strong continuity assumptions between the low fidelities and the target function.

Deep Kernel Learning (DKL): Commonly used GP kernels (e.g. RBF, Matern) can only capture simple correlations between points a priori. DKL (Huang et al. 2015; Calandra et al. 2016) addresses this issue by learning a latent representation via NN that can be fed to a standard kernel at the output. (Snoek et al. 2015) employs linear kernels at the output of a pre-trained NN while (Huang et al. 2015) extends it to use non-linear kernels. The warm-GP in JUMBO can be understood as a DKL surrogate model trained using offline data from auxiliary tasks.

Conclusion

We proposed JUMBO, a no-regret algorithm that employs a careful hybrid of neural networks and Gaussian Processes and a novel acquisition procedure for scalable and sample-efficient Multi-task Bayesian Optimization. We derived JUMBO’s theoretical regret bound and empirically showed it outperforms other competing approaches on set of real-world optimization problems.

References

Balandat et al. (2020) Balandat, M.; Karrer, B.; Jiang, D.; Daulton, S.; Letham, B.; Wilson, A. G.; and Bakshy, E. 2020. BoTorch: A framework for efficient Monte-Carlo Bayesian optimization. Advances in Neural Information Processing Systems, 33.
Bishop (2006) Bishop, C. M. 2006. Pattern recognition and machine learning. springer.
Calandra et al. (2016) Calandra, R.; Peters, J.; Rasmussen, C. E.; and Deisenroth, M. P. 2016. Manifold Gaussian processes for regression. In 2016 International Joint Conference on Neural Networks (IJCNN), 3338–3345. IEEE.
Coraddu et al. (2016) Coraddu, A.; Oneto, L.; Ghio, A.; Savio, S.; Anguita, D.; and Figari, M. 2016. Machine learning approaches for improving condition-based maintenance of naval propulsion plants. Proceedings of the Institution of Mechanical Engineers, Part M: Journal of Engineering for the Maritime Environment, 230(1): 136–153.
Domhan, Springenberg, and Hutter (2015) Domhan, T.; Springenberg, J. T.; and Hutter, F. 2015. Speeding up automatic hyperparameter optimization of deep neural networks by extrapolation of learning curves. In Twenty-fourth international joint conference on artificial intelligence.
Feurer, Letham, and Bakshy (2018) Feurer, M.; Letham, B.; and Bakshy, E. 2018. Scalable meta-learning for Bayesian optimization. arXiv preprint arXiv:1802.02219.
Feurer, Springenberg, and Hutter (2015) Feurer, M.; Springenberg, J.; and Hutter, F. 2015. Initializing bayesian hyperparameter optimization via meta-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29.
Frazier (2018) Frazier, P. I. 2018. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811.
Garrido-Merchán and Hernández-Lobato (2020) Garrido-Merchán, E. C.; and Hernández-Lobato, D. 2020. Dealing with categorical and integer-valued variables in bayesian optimization with gaussian processes. Neurocomputing, 380: 20–35.
Golovin et al. (2017) Golovin, D.; Solnik, B.; Moitra, S.; Kochanski, G.; Karro, J.; and Sculley, D. 2017. Google vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, 1487–1495.
Graf et al. (2011) Graf, F.; Kriegel, H.-P.; Schubert, M.; Pölsterl, S.; and Cavallaro, A. 2011. 2d image registration in ct images using radial image descriptors. In International Conference on Medical Image Computing and Computer-Assisted Intervention, 607–614. Springer.
Hansen (2016) Hansen, N. 2016. The CMA evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772.
Huang et al. (2015) Huang, W.; Zhao, D.; Sun, F.; Liu, H.; and Chang, E. 2015. Scalable gaussian process regression using deep neural networks. In Twenty-fourth international joint conference on artificial intelligence. Citeseer.
Hutter, Kotthoff, and Vanschoren (2019) Hutter, F.; Kotthoff, L.; and Vanschoren, J. 2019. Automated machine learning: methods, systems, challenges. Springer Nature.
Kandasamy et al. (2019) Kandasamy, K.; Dasarathy, G.; Oliva, J.; Schneider, J.; and Poczos, B. 2019. Multi-fidelity gaussian process bandit optimisation. Journal of Artificial Intelligence Research, 66: 151–196.
Klein et al. (2017) Klein, A.; Falkner, S.; Bartels, S.; Hennig, P.; and Hutter, F. 2017. Fast bayesian optimization of machine learning hyperparameters on large datasets. In Artificial Intelligence and Statistics, 528–536. PMLR.
Klein and Hutter (2019) Klein, A.; and Hutter, F. 2019. Tabular benchmarks for joint architecture and hyperparameter optimization. arXiv preprint arXiv:1905.04970.
Li et al. (2017) Li, L.; Jamieson, K.; DeSalvo, G.; Rostamizadeh, A.; and Talwalkar, A. 2017. Hyperband: A novel bandit-based approach to hyperparameter optimization. The Journal of Machine Learning Research, 18(1): 6765–6816.
Neal (2012) Neal, R. M. 2012. Bayesian learning for neural networks, volume 118. Springer Science & Business Media.
Perrone et al. (2018) Perrone, V.; Jenatton, R.; Seeger, M.; and Archambeau, C. 2018. Scalable hyperparameter transfer learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 6846–6856.
Perrone et al. (2019) Perrone, V.; Shen, H.; Seeger, M.; Archambeau, C.; and Jenatton, R. 2019. Learning search spaces for bayesian optimization: Another view of hyperparameter transfer learning. arXiv preprint arXiv:1909.12552.
Petrak (2000) Petrak, J. 2000. Fast subsampling performance estimates for classification algorithm selection. In Proceedings of the ECML-00 Workshop on Meta-Learning: Building Automatic Advice Strategies for Model Selection and Method Combination, 3–14.
Poloczek, Wang, and Frazier (2016) Poloczek, M.; Wang, J.; and Frazier, P. I. 2016. Warm starting Bayesian optimization. In 2016 Winter Simulation Conference (WSC), 770–781. IEEE.
Rana (2013) Rana, P. 2013. Physicochemical properties of protein tertiary structure data set. UCI Machine Learning Repository.
Rasmussen (2003) Rasmussen, C. E. 2003. Gaussian processes in machine learning. In Summer school on machine learning, 63–71. Springer.
Salinas, Shen, and Perrone (2020) Salinas, D.; Shen, H.; and Perrone, V. 2020. A quantile-based approach for hyperparameter transfer learning. In International Conference on Machine Learning, 8438–8448. PMLR.
Shahriari et al. (2015) Shahriari, B.; Swersky, K.; Wang, Z.; Adams, R. P.; and De Freitas, N. 2015. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1): 148–175.
Shilton et al. (2017) Shilton, A.; Gupta, S.; Rana, S.; and Venkatesh, S. 2017. Regret bounds for transfer learning in Bayesian optimisation. In Artificial Intelligence and Statistics, 307–315. PMLR.
Snoek et al. (2015) Snoek, J.; Rippel, O.; Swersky, K.; Kiros, R.; Satish, N.; Sundaram, N.; Patwary, M.; Prabhat, M.; and Adams, R. 2015. Scalable bayesian optimization using deep neural networks. In International conference on machine learning, 2171–2180. PMLR.
Springenberg et al. (2016) Springenberg, J. T.; Klein, A.; Falkner, S.; and Hutter, F. 2016. Bayesian optimization with robust Bayesian neural networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, 4141–4149.
Srinivas et al. (2010) Srinivas, N.; Krause, A.; Kakade, S.; and Seeger, M. 2010. Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design. In Proceedings of the 27th International Conference on Machine Learning. Omnipress.
Swersky, Snoek, and Adams (2013) Swersky, K.; Snoek, J.; and Adams, R. P. 2013. Multi-Task Bayesian Optimization. Advances in Neural Information Processing Systems, 26: 2004–2012.
Tsanas et al. (2010) Tsanas, A.; Little, M. A.; McSharry, P. E.; and Ramig, L. O. 2010. Enhanced classical dysphonia measures and sparse regression for telemonitoring of Parkinson’s disease progression. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 594–597. IEEE.
Volpp et al. (2019) Volpp, M.; Fröhlich, L. P.; Fischer, K.; Doerr, A.; Falkner, S.; Hutter, F.; and Daniel, C. 2019. Meta-learning acquisition functions for transfer learning in bayesian optimization. arXiv preprint arXiv:1904.02642.
Williams, Bonilla, and Chai (2007) Williams, C.; Bonilla, E. V.; and Chai, K. M. 2007. Multi-task Gaussian process prediction. Advances in neural information processing systems, 153–160.
Wistuba, Schilling, and Schmidt-Thieme (2015) Wistuba, M.; Schilling, N.; and Schmidt-Thieme, L. 2015. Learning hyperparameter optimization initializations. In 2015 IEEE international conference on data science and advanced analytics (DSAA), 1–10. IEEE.
Wistuba, Schilling, and Schmidt-Thieme (2018) Wistuba, M.; Schilling, N.; and Schmidt-Thieme, L. 2018. Scalable gaussian process-based transfer surrogates for hyperparameter optimization. Machine Learning, 107(1): 43–78.

Appendix A Proofs of Theoretical Results

We will present proofs for Theorem 1 and an additional result in Theorem 5 that extends the no-regret guarantees of JUMBO to continuous domains. Our theoretical derivations will build on prior results from (Srinivas et al. 2010) and (Kandasamy et al. 2019).

Theorem 1

Let $\mu^{c}_{t}(x)$ and $\sigma^{c}_{t}(x)$ denote the posterior mean and standard deviation of $\mathcal{GP}^{\text{cold}}$ at the end of round $t$ after observing $\mathcal{D}_{t-1}^{\text{cold}}=\{(x_{i},y_{i})_{i=1}^{t-1}$ }. Similarly, we will use $\mu^{w}_{t}(x)$ and $\sigma^{w}_{t}(x)$ to denote the posterior mean and standard deviation of $\mathcal{GP}^{\text{warm}}$ at the end of round $t$ after observing $\mathcal{D}_{t-1}^{\text{warm}}=\{(h_{\phi}(x_{i}),y_{i})_{i=1}^{t-1}\}$ .

Lemma 2.

Pick $\delta\in(0,1)$ and set $\beta_{t}=2\log\left(\frac{|\chi|\pi_{t}}{\delta}\right)$ where $\Sigma_{t\geq 1}\pi_{t}^{-1}=0.5$ , $\pi_{t}>0$ (e.g. $\pi_{t}=\frac{\pi^{2}t^{2}}{3}$ ). Define $\mu_{t}(x)=\lambda_{t}(x)\mu^{c}_{t}(x)+(1-\lambda_{t}(x))\mu^{w}_{t}(x)$ and $\sigma_{t}(x)=\lambda_{t}(x)\sigma^{c}_{t}(x)+(1-\lambda_{t}(x))\sigma^{w}_{t}(x)$ . Then,

P\{|f(x)-\mu_{t}(x)|\leq\beta_{t}^{1/2}\sigma_{t}(x),\forall x\in{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}},\forall t\geq 1\}\geq 1-\delta.

Proof.

Fix $t\geq 1$ and $x\in{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}$ . Based on Assumption 1, conditioned on $\mathcal{D}_{t-1}^{\text{cold}}$ , $f(x)\sim\mathcal{N}(\mu^{c}_{t}(x),\sigma^{c}_{t}(x))$ . Similarly, Assumption 2 implies that conditioned on $\mathcal{D}_{t-1}^{\text{warm}}$ , $f(x)\sim\mathcal{N}(\mu^{w}_{t}(x),\sigma^{w}_{t}(x))$ . Let $\mathcal{A}$ be the event that $|f(x)-\mu^{c}_{t}(x)|\leq\beta_{t}^{1/2}\sigma^{c}_{t}(x)$ and $\mathcal{B}$ the event that $|f(x)-\mu^{w}_{t}(x)|\leq\beta_{t}^{1/2}\sigma^{w}_{t}(x)$ . From proof of Lemma 5.1 in (Srinivas et al. 2010) we know that given a normal distribution $z\sim\mathcal{N}(0,1)$ , $P\{z>c\}\leq 0.5e^{-\frac{c^{2}}{2}}$ . Using $z=\frac{f(x)-\mu^{c}_{t}(x)}{\sigma^{c}_{t}(x)}$ and $c=\beta_{t}^{1/2}$ , $P\{\bar{\mathcal{A}}\}\leq e^{-\beta_{t}/2}$ . Similarly, $P\{\bar{\mathcal{B}}\}\leq e^{-\beta_{t}/2}$ . Using union bound, we have:

P\{\bar{\mathcal{A}}\lor\bar{\mathcal{B}}\}\leq P\{\bar{\mathcal{A}}\}+P\{\bar{\mathcal{B}}\}\leq 2e^{-\beta_{t}/2}.

By union bound, we have:

P\{\bar{\mathcal{A}}\lor\bar{\mathcal{B}}\}\leq|{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}|\sum_{t\geq 1}2e^{-\beta_{t}/2}\leq\delta\ \ \ \ \ \forall x\in{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}},\forall t\geq 1.

The event in this Lemma is just $\mathcal{A}\land\mathcal{B}$ and the proof is concluded. ∎

Next, we state two lemmas from prior work.

Lemma 3.

If $|f(x)-\mu_{t}(x)|\leq\beta_{t}^{1/2}\sigma_{t}(x)$ , then $r_{t}$ is bounded by $2\beta_{t}^{1/2}\sigma_{t}(x_{t})$ .

Proof.

See Lemma 5.2 in (Srinivas et al. 2010). It employs the results of Lemma 2 to prove the statement. ∎

Lemma 4.

Let $\sigma^{2}_{t}(x)$ denote the posterior variance of a GP after $t-1$ observations, and let $A\subset{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}$ . Assume that we have queried $f$ at $n$ points $(x_{t})_{t=1}^{n}$ of which $s$ points are in $A$ . Then $\sum_{t:x_{t}\in A}\sigma_{t}^{2}(x)\leq\frac{2}{\log(1+\sigma^{-2})}\Psi_{s}(A)$ .

Proof.

See Lemma 8 in (Kandasamy et al. 2019). ∎

Proof for Theorem 1

Proof.

From Lemma 3, we have:

\displaystyle r_{t}^{2}

\displaystyle\leq 4\beta_{t}\sigma_{t}^{2}(x_{t})

(5)

Summing over instantaneous regrets for $T$ rounds, we get:

$\displaystyle\sum_{t=1}^{T}r_{t}^{2}$	$\displaystyle\leq\sum_{t=1}^{T}4\beta_{t}\sigma_{t}^{2}(x_{t})$	(6)
	$\displaystyle\leq 4\beta_{T}\sum_{t=1}^{T}\sigma_{t}^{2}(x_{t})$	(7)
	$\displaystyle\leq 4\beta_{T}\left(\sum_{t:x_{t}\in\chi_{g}}\sigma^{c^{2}}_{t}(x_{t})+\sum_{t:x_{t}\in\bar{\chi}_{g}}\sigma^{w^{2}}_{t}(x_{t})\right)$	(8)
	$\displaystyle\leq\frac{8\beta_{T}}{1+\sigma_{n}^{-2}}\left(\Psi_{T-s}(\chi_{g})+\Psi_{s}(\bar{\mathcal{Z}}_{g})\right)$	(9)

Eq 7 follows from the monotonicity of $\beta_{t}=2\log(\pi_{t}/\delta)$ . Eq. 8 follows from the definition of $\sigma_{t}$ in Lemma 2 and the last inequality in Eq. 9 follows from Lemma 4.

Finally, from Cauchy-Schwartz inequality, we know that $R^{2}_{T}\leq T\sum_{t=1}^{T}r^{2}_{t}$ . Combining with Eq. 9, we obtain the result in Theorem 1. ∎

Extension to Continuous Domains

We will now derive regret bounds for the general case where ${\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\subset[0,r]^{d}$ is a d-dimensional compact and convex set with $r>0$ . This will critically require an additional Lipschitz continuity assumption on $f$ .

Theorem 5.

Suppose that kernels $\kappa^{c}$ and $\kappa^{w}$ are such that the derivatives of $\mathcal{GP}^{\text{cold}}$ and $\mathcal{GP}^{\text{warm}}$ sample paths are bounded with high probably. Precisely, for some constants $a,b>0$ ,

P\left\{\left|\sup_{x\in{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}}\frac{\partial f}{\partial x_{j}}\right|>L\right\}\leq ae^{-(L/b)^{2}},\\ j=1,2,\dots,d.

(10)

Pick $\delta\in(0,1)$ , and set $\beta_{t}=2\log(4\pi^{2}t^{2}/3\delta)+4d\log(dtbr\sqrt{\log(4da/\delta)})$ . Then, running JUMBO for $T$ iterations results in a sequence of candidates $(x_{t})_{t=1}^{t=T}$ for which the following holds with probability at least $1-\delta$ :

R_{T}<\sqrt{CT\beta_{T}\{\Psi_{T-s}({\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{g})+\Psi_{s}(\bar{\mathcal{Z}}_{g})\}}+\frac{\pi^{2}}{6},\forall T\geq 1

where $C=\nicefrac{{1}}{{(1+\sigma_{n}^{2})}}$ .

To start the proof, we first show that we have confidence on all the points visited by the algorithm.

Lemma 6.

Pick $\delta\in(0,1)$ and set $\beta_{t}=2\log(\pi_{t}/\delta)$ , where $\sum_{t\geq 1}\pi_{t}^{-1}=0.5$ , $\pi_{t}>0$ . Define $\mu_{t}(x)=\lambda_{t}(x)\mu^{c}_{t}(x)+(1-\lambda_{t}(x))\mu^{w}_{t}(x)$ and $\sigma_{t}(x)=\lambda_{t}(x)\sigma^{c}_{t}(x)+(1-\lambda_{t}(x))\sigma^{w}_{t}(x)$ . Then,

|f(x_{t})-\mu_{t}(x_{t})|\leq\beta_{t}^{1/2}\sigma_{t}(x_{t}),\ \forall t\geq 1

holds with probability of at least $1-\delta$ .

Proof.

Fix $t\geq 1$ and $x\in{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}$ . Similar to Lemma 3, $P\{\bar{\mathcal{A}}\lor\bar{\mathcal{B}}\}\leq 2e^{-\beta_{t}/2}$ . Since $e^{-\beta_{t}/2}=\delta/\pi_{t}$ , using the union bound for $t\geq 1$ concludes the statement. ∎

For the purpose of analysis, we define a discretization set ${\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{t}\subset{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}$ , so that the results derived earlier can be re-applied to bound the regret in continuous case. To enable this approach we will use conditions on $L$ -Lipschitz continuity to obtain a valid confidence interval on the optimal solution $x^{\ast}$ . Similar to (Srinivas et al. 2010), let us choose discretization ${\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{t}$ of size $\tau_{t}^{d}$ (i.e. $\tau_{t}$ uniformly spaced points per dimension in $\textstyle\chi$ ) such that for all $x\in{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}$ the closest point to $x$ in ${\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{t}$ , $[x]_{t}$ , has a distance less than some threshold. Formally, $||x-[x]_{t}||_{1}\leq\nicefrac{{rd}}{{\tau_{t}}}$ .

Lemma 7.

Pick $\delta\in(0,1)$ and set $\beta_{t}=2\log(4\pi_{t}/\delta)+4d\log(dtbr\sqrt{\log(4da/\delta)})$ , where $\sum_{t\geq 1}\pi_{t}^{-1}=0.5$ , $\pi_{t}>0$ . Then, for all $t\geq 1$ , the regret is bounded as follows:

r_{t}\leq 2\beta_{t}^{1/2}\sigma_{t}(x_{t})+\frac{1}{t^{2}}

(11)

with probabilty of at least $1-\delta$ .

Proof.

In light of Lemma 6,the proof follows directly from Lemma 5.8 in (Srinivas et al. 2010). ∎

Proof of Theorem 5

Proof.

From Eq. 9 in the proof of Theorem 1, we have shown that:

\sum_{t=1}^{T}4\beta_{t}\sigma_{t}^{2}(x_{t})\leq C\beta_{T}(\Psi_{T-s}({\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{g})+\Psi_{s}(\bar{\mathcal{Z}}_{g}))\ \ \ \ \ \forall T\geq 1.

Therefore, using Cauchy-Schwarz:

\sum_{t=1}^{T}2\beta_{t}^{1/2}\sigma_{t}(x_{t})\leq\sqrt{C\beta_{T}(\Psi_{T-s}({\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}_{g})+\Psi_{s}(\bar{\mathcal{Z}}_{g}))}\ \ \ \ \ \forall T\geq 1.

Since $\sum_{t=1}^{T}\frac{1}{t^{2}}\leq\frac{\pi^{2}}{6}$ , Theorem 5 follows Lemma 7. ∎

Appendix B Implementation Details

Pre-training Details

Figure 8 illustrates the skeleton of the architecture that was used for all experiments. The input configuration is fed to a multi-layer perceptron of $n_{l}$ layers with $n_{u}$ hidden units. Then, optionally, a dropout layer is applied to the output and the result is fed to another non-linear layer with $n_{z}$ outputs. The latent features are then mapped to the output with a linear layer. All activations are $\tanh$ .

For HPO experiments, we have used $n_{u}=32,n_{z}=4,n_{l}=3,\text{learning rate}=5\times 10^{-5}$ , and $\text{batch size}=128$ . For circuit experiments we used $n_{u}=200,n_{z}=32,n_{l}=3,\text{learning rate}=3\times 10^{-4},\text{batch size}=64$ , and dropout rate of $0.5$ . These hyper-parameters were chosen based on random search by observing the prediction accuracy of the pre-training model on the auxiliary validation dataset which was 20% of the overall dataset.

Details of training the Gaussian Process hyper-parameters

For both warm and cold GP, we consider a Matern kernel (i.e. $\kappa(x,x^{\prime})=\frac{2^{1-\nu}}{\Gamma(\nu)}(\sqrt{2\nu}r^{2})K_{\nu}(\sqrt{2\nu}r^{2})$ with $\nu=2.5$ where $r^{2}=\frac{||x-x^{\prime}||^{2}}{\theta}$ ). The length scale $\theta$ and observation noise $\sigma_{n}$ are optimized in every iteration of BO by taking 100 gradient steps on the likelihood of observations via Adam optimizer with a learning rate of $0.1$ .

Acquisition Function Details

For all experiments, we used Upper Confidence Bound with the exploration-exploitation hyper-parameter at round $t$ set as $10\exp{\frac{-t}{T}}$ where $T$ is the budget of total number of iterations. This way we favor exploration more initially and gradually drift to more exploitation as we approach the end of the budget. For optimization of acquisition function, we use the derivative free algorithm CMA-ES (Hansen 2016).

Dealing with Categorical Variables in HPO

We handle categorical and integer-valued variables in BO similar to (Garrido-Merchán and Hernández-Lobato 2020). In particular, we used $\kappa^{c}(T(x),T(x^{\prime}))$ as the kernel where $T:{\mathchoice{\raisebox{0.0pt}{$\displaystyle\chi$}}{\raisebox{0.0pt}{$\textstyle\chi$}}{\raisebox{0.0pt}{$\scriptstyle\chi$}}{\raisebox{0.0pt}{$\scriptscriptstyle\chi$}}}\to\mathcal{T}$ is a deterministic transformation that maps the continuous optimization variable $x$ to a representation space $\mathcal{T}$ that adheres to a meaningful distance measurement. For example, for categorical parameters, it converts a continuous input to a one-hot encoding corresponding to a choice for that parameter, and for integer-valued variables, it converts the continuous variable to the closest integer value. Similarly, for the pre-training phase, we also train using $h_{\phi}(T(x))$ .

Appendix C Experimental Evidence

All the experiments were done on a quad-core desktop.

Space compression through the pretrained NN

In this experiment we studied the latent space of a NN fed with uniformly sampled inputs for circuit design and see that 75% of data variance is preserved in only 4 dimensions (with $n_{z}=32$ ), suggesting that the warm-GP is operating in a compressed space.

Discussion: Choice of $l_{\alpha}$

The threshold $l_{\alpha}$ is a key design hyperparameter for defining the acquisition function in JUMBO. JUMBO with $l_{\alpha}=\infty$ reduces to GP-UCB (i.e. cold-GP only) and with $l_{\alpha}=0$ , it reduces to offline DKL (i.e. warm-GP only).

Table 1 shows the effect of different choices for $l_{\alpha}$ on the performance of the algorithm for both HPO and circuit design problems. As we can see, small values for $l_{\alpha}$ (e.g. $\sim 0.01$ ) cause JUMBO to rely more on the accuracy of the warm-GP model and result in sub-optimal convergence in case of model discrepancy between warm-GP and the target task. On the other hand, bigger values of $l_{\alpha}$ (e.g. $\sim 0.2$ ) cause JUMBO to give more weight to the cold-GP and rely less on prior data. We also note that there is a wide range of $l_{\alpha}$ that JUMBO performs well relative to other baselines, suggesting a good degree of robustness and less tuning in practice. Even though the optimal choice of $l_{\alpha}$ really depends on the exact problem setup (e.g. 0.05 for circuits problem, and $0.2$ for Slice localization), we have found that the choice of $l_{\alpha}=0.1$ is a good initial choice for all the problems considered.

Table 1: The average normalized simple regret at the last iteration for different values of

l_{\alpha}

(lower is better). The scores are normalized to GP-UCB’s simple regret at the last iteration.

	GP-UCB	JUMBO-0.01	JUMBO-0.05	JUMBO-0.1	JUMBO-0.2	Offline DKL ( $l_{\alpha}=\infty$ )
Protein	1.0 $\pm$ 0.08	2.09 $\pm$ 0.00	1.45 $\pm$ 0.18	$\mathbf{0.00\pm 0.00}$	0.29 $\pm$ 0.05	0.77 $\pm$ 0.13
Parkinsons	1.0 $\pm$ 0.05	0.52 $\pm$ 0.07	0.19 $\pm$ 0.05	$\mathbf{0.07\pm 0.05}$	0.45 $\pm$ 0.05	1.0 $\pm$ 0.02
Naval	1.0 $\pm$ 0.07	0.97 $\pm$ 0.07	$\mathbf{0.46\pm 0.03}$	0.49 $\pm$ 0.04	0.78 $\pm$ 0.07	1.78 $\pm$ 0.06
Slice	1.0 $\pm$ 0.02	5.54 $\pm$ 0.5	2.87 $\pm$ 0.12	2.69 $\pm$ 0.29	$\mathbf{0.94\pm 0.17}$	13.77 $\pm$ 0.57
Circuit	1.0 $\pm$ 0.06	0.16 $\pm$ 0.00	$\mathbf{0.09\pm 0.00}$	0.18 $\pm$ 0.01	0.24 $\pm$ 0.01	1.01 $\pm$ 0.01

Ablation: Dynamic choice of $\lambda_{t}$

In this ablation we illustrate that the dynamic choice of $\lambda_{t}$ is indeed better than choosing it to be a constant value. The intuition behind it is that by choosing a constant coefficient we essentially allow the acquisition function to choose points with very high $\alpha^{\text{cold}}$ but low $\alpha^{\text{warm}}$ scores. However, $\alpha^{\text{cold}}$ should not be trusted because of the warm-start problem in BO.

Figure 10 compares JUMBO with dynamic and constant $\lambda_{t}$ on the four HPO problems. It can be seen that JUMBO with constant $\lambda_{t}=0.5$ immaturely reaches a sub-optimal solution in all the experiments.

Tabular Experimental Results

In this section we present the quantitative comparison between JUMBO and the best outstanding prior work for each experimental case. In this table we have also included BOHB as another relevant multi-fidelity baseline.

BOHB combines the successive halving approach introduced in HyperBand (Li et al. 2017) with a probabilistic model that captures the density of good configurations in the input space. Unlike other methods BOHB employs a fixed budget and utilizes the information beyond epoch 3. It runs multiple hyperparameter configurations in parallel and terminates a subset of them after every few epochs based on their current validation error until the budget is exhausted.

Table 2: Comparison of simple regret for HPO. Lower is better. On average JUMBO’s simple regret at convergence is 45% better than the state-of-the-art MBO baseline in each experiment.

	Protein ( $\times 10^{-3}$ )	Parkinsons ( $\times 10^{-3}$ )	Naval ( $\times 10^{-5}$ )	Slice ( $\times 10^{-4}$ )
GP-UCB	1.98	4.93	1.4	0.77
MT-BOHAMIANN	6.71	2.13	2	0.84
MT-ABLR	13.52	4.91	2.3	1.42
OfflineDKL	1.40	2.67	4.9	10.67
BOHB	6.38	3.16	2.1	0.23
GCP	7.50	3.15	3.3	0.46
JUMBO (ours)	0	0.23	0.7	0.73

JUMBO: Scalable Multi-task Bayesian Optimization using Offline Data

Abstract

Introduction

Background

Gaussian Process (GP) Regression

Bayesian Optimization (BO)

Multi-task Bayesian Optimization (MBO)

Scalable MBO via JUMBO

Regression Model

Acquisition Procedure

Theoretical Analysis

Definition 1 (Maximum Information Gain (Srinivas et al. 2010)).

Assumption 1.

Assumption 2.

Theorem 1.

Choice of interpolation coefficient λt​(x)\lambda_{t}(x)

Experiments

Application: Hyperparameter optimization

Application: Automated Circuit Design

Ablations

Related Work

Conclusion

References

Appendix A Proofs of Theoretical Results

Theorem 1

Lemma 2.

Proof.

Lemma 3.

Proof.

Lemma 4.

Proof.

Proof for Theorem 1

Proof.

Extension to Continuous Domains

Theorem 5.

Lemma 6.

Proof.

Lemma 7.

Proof.

Proof of Theorem 5

Proof.

Appendix B Implementation Details

Pre-training Details

Details of training the Gaussian Process hyper-parameters

Acquisition Function Details

Dealing with Categorical Variables in HPO

Appendix C Experimental Evidence

Space compression through the pretrained NN

Discussion: Choice of lαl_{\alpha}

Ablation: Dynamic choice of λt\lambda_{t}

Tabular Experimental Results

Choice of interpolation coefficient $\lambda_{t}(x)$

Discussion: Choice of $l_{\alpha}$

Ablation: Dynamic choice of $\lambda_{t}$