This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Randomised Gaussian Process Upper Confidence Bound for Bayesian Optimisation

Julian Berk111Contact Author    Sunil Gupta    Santu Rana&Svetha Venkatesh
Applied Artificial Intelligence Institute
{jmberk, sunil.gupta, santu.rana, svetha.venkatesh}@deakin.edu.au
Abstract

In order to improve the performance of Bayesian optimisation, we develop a modified Gaussian process upper confidence bound (GP-UCB) acquisition function. This is done by sampling the exploration-exploitation trade-off parameter from a distribution. We prove that this allows the expected trade-off parameter to be altered to better suit the problem without compromising a bound on the function’s Bayesian regret. We also provide results showing that our method achieves better performance than GP-UCB in a range of real-world and synthetic problems.

1 Introduction

Global optimisation is a cornerstone of modern scientific innovation. Optimisation of alloys and other materials has allowed us to create massive vehicles that are both strong and light enough to fly. Optimisation in medical science has seen us live longer and healthier lives than previously thought possible. This optimisation usually involves a trial-and-error approach of repeated experiments with different inputs to determine which input produces the most desirable output. Unfortunately, many system are expensive to sample, and the heuristic methods commonly used to select inputs are not sample-efficient. This can lead to these optimisation experiments being prohibitively costly. As such, methods that can select inputs in a sample-efficient manner can lead to faster and cheaper innovation in a wide range of fields.

Bayesian optimisation is one of the most sample efficient methods for optimising expensive, noisy systems. It has shown excellent performance on a range of practical problems, including problems in biomedical science Turgeon et al. (2016); Gonzalez et al. (2015), materials science Li et al. (2017); Ju et al. (2017), and machine learning Snoek et al. (2012); Klein et al. (2017); Xia et al. (2017). It does so by using the data from previous samples to generate a statistical model of the system. This model is then used to suggest the next input through an acquisition function. The design of the acquisition function is non-trivial; and is critically important to the algorithms performance. It must balance selecting points with the goal of improving the statistical model (exploration), and selecting points with the goal of utilising the improved statistical model to find the global optima (exploitation). In addition to this, the costly nature of these problems means that it is desirable to have a theoretical guarantee of the algorithm’s performance.

There are a large range of acquisition functions, all with different balances of exploration and exploitation, but we focus on Gaussian process upper confidence bound (GP-UCB) Srinivas et al. (2010) in this work. This controls its exploration-exploitation trade-off with a single hyperparameter, βt\beta_{t}. It has strong theoretical guarantees on the overall convergence rate, but the bound they give is fairly loose. This causes the value of βt\beta_{t} to be too large, causing significant over-exploration and hence poor practical performance. In practice, the theoretical guarantees need to be weakened by selecting a far smaller βt\beta_{t}.

We introduce a novel modification to the GP-UCB acquisition function that significantly improves its exploration-exploitation balance while still having a strong convergence guarantee. This is done by sampling its trade-off parameter with a distribution that allows for a range of exploration factors. However, the distribution is chosen such that convergence is guaranteed to be sub-linear while the sampled βt\beta_{t} is generally smaller than the traditional GP-UCB. This reduction leads to a direct improvement on the convergence efficiency in practice. We demonstrate this improved performance over the standard GP-UCB implementation in a range of both synthetic benchmark functions and real-world applications.

In summary, our contributions are:

  • The development of a modified acquisition function: RGP-UCB.

  • A convergence analysis of Bayesian optimisation using RGP-UCB.

  • The demonstration of the performance of our method on a range of synthetic and real-world problems.

2 Background

In this section, we provide a brief overview of Bayesian optimisation, with an emphasis on acquisition functions and regret bounds. For a more in-depth overview of Bayesian optimisation, we refer readers to Rasmussen and Williams (2006) and Brochu et al. (2010).

2.1 Bayesian Optimisation

Bayesian optimisation is a method for optimising expensive, noisy black-box functions. It represents the system being optimised as an unknown function, ff. As this is black-box, it is impossible to directly observe. However, we can sample it with an input variable, xx, to obtain a noisy output, y=f(x)+ϵnoisey=f(x)+\epsilon_{noise}, where ϵnoiseN(0,σnoise)\epsilon_{noise}\sim N(0,\sigma_{noise}) is the random noise corrupting the measurement. Bayesian optimisation seeks to efficiently find the optimal input for such systems over a bounded search space, 𝒳\mathcal{X}:

x=argmaxx𝒳f(x)x^{*}=\underset{x\in\mathcal{X}}{\arg\max}f(x) (1)

To do this, it creates a statistical model of ff using all previously sampled input-output pairs, Dt1={xi,yi}i=1t1D_{t-1}=\{x_{i},y_{i}\}_{i=1}^{t-1}. This statistical model is usually a Gaussian process, but other models can be used. The statistical model is then used to create an acquisition function, α(x)\alpha(x). This is essentially a map of our belief of how useful a given input will be for optimising the system. As such, it can be used to suggest xtx_{t}; the next input with which to sample the system. This input and its corresponding output can then be added to the previous data, Dt=Dt1{xt,yt}D_{t}=D_{t-1}\cup\{x_{t},y_{t}\}. This process can be iterated, with the Gaussian process improving with each iteration, until a stopping condition has been met. As Bayesian optimisation is generally used on expensive systems, this stopping condition is often a maximum number of experiments, TT.

2.2 Gaussian Process

The Gaussian process is the most common statistical model used in Bayesian optimisation. It models each point in x𝒳x\in\mathcal{X} as a Gaussian random variable. As such, it is completely characterised by a mean function, μ(x)\mu(x), and a variance function, σ2(x)\sigma^{2}(x). However, in order to model sensibly behaved functions there must be correlation between neighbouring points. This is done by conditioning the distribution on the data via a kernel function. There are many viable kernel functions, but one of the simplest and most popular is the squared exponential kernel as it only depends on a single hyperparameter, the lengthscale ll. This is given by

kSE(xi,xj)=exp(xixj22l2)k_{SE}(x_{i},x_{j})=\exp\left(-\frac{\left\|x_{i}-x_{j}\right\|^{2}}{2l^{2}}\right) (2)

Using a kernel such as this, a predictive distribution, f(x)𝒩(μ(x)σ2(x))f(x)\sim\mathcal{N}(\mu(x)\sigma^{2}(x)), can be obtained by conditioning the Gaussian process on the data, DtD_{t}:

μt(x)\displaystyle\mu_{t}(x) =\displaystyle= 𝐤(𝐊t+σnoise2𝐈)1𝒚,\displaystyle\mathbf{k_{*}}(\mathbf{K}_{t}+\sigma^{2}_{noise}\mathbf{I})^{-1}\boldsymbol{y}, (3)
σt2(x)\displaystyle\sigma_{t}^{2}(x) =\displaystyle= k𝐤(𝐊t+σnoise2𝐈)1𝐤T\displaystyle k_{*\!*}-\mathbf{k_{*}}(\mathbf{K}_{t}+\sigma^{2}_{noise}\mathbf{I})^{\mathrm{-1}}\mathbf{k}_{*}^{\mathit{T}}

where 𝐊t,(i,j)=k(xi,xj)\mathbf{K}_{t,(i,j)}=k(x_{i},x_{j}) is the kernel matrix, k=kt(x,x)k_{*\!*}=k_{t}(x,x), and 𝐤=[k(x1,x),k(x2,x),,k(xt,x)]\mathbf{k}_{*}=[k(x_{1},x),k(x_{2},x),\ldots,k(x_{t},x)]. Here 𝐈\mathbf{I} is the identity matrix with the same dimensions as 𝐊t\mathbf{K}_{t}, and σnoise\sigma_{noise} is the output noise standard deviation.

2.3 Acquisition Functions

Once the Gaussian process has been generated, it must then be used to create an acquisition function. This is chosen such that its global maxima in 𝒳\mathcal{X} will be the best next point to sample:

xt=argmaxx𝒳αt1(x)x_{t}=\underset{x\in\mathcal{X}}{\arg\max}\alpha_{t-1}(x) (4)

However, the design of such a function is non-trivial. It must first suggest points spread over the search space to improve the Gaussian process. This is called exploration. Once the Gaussian process has been improved enough, it must then transition to suggesting points in regions that have a high probability of containing the global optima. This is called exploitation. If an acquisition function does not explore enough, it may get stuck exploiting sub-optimal regions and never find the global optima. However, if it explores too much, it may waste costly evaluations improving an already adequate Gaussian process. This makes balancing exploration and exploitation vital. There is a wide range of common acquisition functions, all of which have different balances of exploration and exploitation. These include entropy search (ES) by Hennig and Schuler (2012), predictive entropy search (PES) by Hernández-Lobato et al. (2014), knowledge gradient (KG) by Scott et al. (2011), and others. However, we will only consider the Gaussian process upper confidence bound (GP-UCB) by Srinivas et al. (2010), Thompson sampling by Russo and Van Roy (2014), and expected improvement (EI) by Jones et al. (1998), with the latter two only being used as baselines.

2.3.1 GP-UCB

GP-UCB Srinivas et al. (2010) is one of the most intuitive acquisition functions. It balances exploration and exploitation through a single hyperparameter, βt\beta_{t}:

αtGPUCB(x)=μt(x)+βtσt(x)\alpha_{t}^{GP-UCB}(x)=\mu_{t}(x)+\sqrt{\beta_{t}}\sigma_{t}(x) (5)

Increasing βt\beta_{t} makes the acquisition function favour points with high variance, causing more exploration. Decreasing βt\beta_{t} will make the acquisition function favour points with high mean, causing more exploitation. However, the selection of βt\beta_{t} is not done to optimally balance exploitation and exploration, but is done such that the cumulative regret is bounded. It has been proved that, assuming the chosen kernel satisfies

P{supx𝒳|f/xi|>L}ae(L/b)2,i=1tP\left\{\sup_{x\in\mathcal{X}}|\partial f/\partial x_{i}|>L\right\}\leq ae^{-(L/b)^{2}},\quad i=1\ldots t (6)

for some constants a,b>0a,b>0, then with probability 1δ1-\delta, the algorithm will have sub-linear regret if

βt=2log(t2π2/(3δ))+2dlog(t2dbrlog(4da/δ))\beta_{t}=2\log(t^{2}\pi^{2}/(3\delta))+2d\log\left(t^{2}dbr\sqrt{\log(4da/\delta)}\right) (7)

While the regret bound provided by this choice of βt\beta_{t} is desirable, it unfortunately is far larger than needed. This leads to sub-optimal real world performance due to over-exploration. In their own paper, the authors divided the suggested βt\beta_{t} by a factor of 5 to achieve better performance Srinivas et al. (2010).

3 Proposed Method

In this section we describe our improved GP-UCB acquisition function, randomised Gaussian process upper confidence bound (RGP-UCB), and prove that it has a sub-linear regret bound.

3.1 RGP-UCB

While the standard GP-UCB method has a desirable regret bound, it has relatively poor performance. This is due to the βt\beta_{t} used to satisfy this bound being far too large, forcing significant over-exploration. As such, a method for selecting a smaller βt\beta_{t} while maintaining a regret bound is desirable. We show that this can be done by sampling βt\beta_{t} from a distribution, and as such, we call our method randomised Gaussian process upper confidence bound (RGP-UCB). Doing so means that we bound the Bayesian regret instead of the regret, but it allows for far greater freedom in selecting βt\beta_{t}, letting it be set far smaller while still maintaining convergence guarantees. However, we require βt>0\beta_{t}>0 since a negative βt\beta_{t} will punish exploration. We also do not want our distribution to suggest a very large βt\beta_{t} as that will cause over-exploration. As such, we draw βt\beta_{t} from a Γ(κt,θ)\Gamma(\kappa_{t},\theta) distribution. Examples of this distribution can be seen in Figure 1 and the complete algorithm is shown in Algorithm 1. Much like with standard GP-UCB, the parameters of this distribution are chosen to satisfy a regret bound, as per Theorem 3. However, we show that we only need to set one of the two distribution parameters for the bound to hold. Unlike the standard GP-UCB, this allows us to tune βt\beta_{t} to substantially improve our methods performance without compromising its theoretical guarantees.

Refer to caption
Figure 1: A range of β5\beta_{5} distributions with t=5t=5. The parameters are chose to satisfy Theorem 3. Note that increasing θ\theta shifts the distribution right, increasing exploration.
Algorithm 1 Bayesian Optimisation with RGP-UCB

Input:Dt0={xi,yi}i=1t0D_{t_{0}}=\{x_{i},y_{i}\}_{i=1}^{t_{0}}, Γ\Gamma scale parameter, θ\theta, # Iterations TT, Kernel lengthscale ll

1:  for t=t0t=t_{0} to TT do
2:     Build a Gaussian Process (GP) with DtD_{t}.
3:     Set κt=log(12π(t2+1))log(1+θ/2)\kappa_{t}=\frac{\log\left(\frac{1}{\sqrt{2\pi}}{(t^{2}+1)}\right)}{\log\left(1+\theta/2\right)} and draw βt\beta_{t} from Γ(κt,θ)\Gamma(\kappa_{t},\theta)
4:     Optimise the acquisition function, αt(x)=μt(x)+βtσt(x)\alpha_{t}(x)=\mu_{t}(x)+\sqrt{\beta_{t}}\sigma_{t}(x) to obtain xt+1x_{t+1} and use it to sample yt+1=f(xt+1)+ϵny_{t+1}=f(x_{t+1})+\epsilon_{n} from the system.
5:     Augment the data, Dt+1=Dy{xt+1,yt+1}D_{t+1}=D_{y}\cup\{x_{t+1},y_{t+1}\}.
6:  end for
7:  return  x+=xix^{+}=x_{i} such that yi=maxt[1,T]yty_{i}=\underset{t\in[1,T]}{\max}y_{t}.

3.2 Theoretical Analysis

Bayesian optimisation is commonly used in high value problems. As such, theoretical measures of its convergence are desirable. The cumulative regret is one such measure, and is the cornerstone of GP-UCB Srinivas et al. (2010). Regret is simply the difference between the current sampled value and the global optima. The cumulative regret is the sum of the regret over all iterations:

RT=t=1T[f(x)f(xt)]R_{T}=\sum_{t=1}^{T}\left[f(x^{*})-f(x_{t})\right] (8)

As RGP-UCB is probabilistic, we instead need to use the Bayesain regret by Russo et al Russo and Van Roy (2014):

BRT=t=1T𝔼[f(x)f(xt)]BR_{T}=\sum_{t=1}^{T}\mathbb{E}\left[f(x^{*})-f(x_{t})\right] (9)

However, their proof was for Gaussian processes with a finite search space, i.e. |𝒳|<|\mathcal{X}|<\infty. As such, we follow a method similar to Srinivas et al. (2010) and Kandasamy et al. (2017) and introduce a discretisation of our search space into an τd\tau^{d} grid of equally spaced points, 𝒳𝖽𝗂𝗌\mathcal{X}_{\mathsf{dis}}. We denote [x]τ[x]_{\tau} as the closest point to xx in 𝒳𝖽𝗂𝗌\mathcal{X}_{\mathsf{dis}}.

With this, we can begin bounding the Bayesian regret of our algorithm by decomposing it into components that are easier to bound.

Lemma 1.

The Bayesian regret of a probabilistic RGP-UCB algorithm, αt1\alpha_{t-1}, over TT iterations can be decomposed as

BRT\displaystyle BR_{T} \displaystyle\leq t=1T𝔼[αt1([xt]τ)f([xt]τ)]R1\displaystyle\underbrace{\sum_{t=1}^{T}\mathbb{E}\left[\alpha_{t-1}([x_{t}]_{\tau})-f([x_{t}]_{\tau})\right]}_{R_{1}} (10)
+t=1T𝔼[f([x]τ)αt1([x]τ)]R2\displaystyle+\underbrace{\sum_{t=1}^{T}\mathbb{E}\left[f([x^{*}]_{\tau})-\alpha_{t-1}([x^{*}]_{\tau})\right]}_{R_{2}}
+t=1T𝔼[f(x)f([x]τ)]R3\displaystyle+\underbrace{\sum_{t=1}^{T}\mathbb{E}\left[f(x^{*})-f([x^{*}]_{\tau})\right]}_{R_{3}}
+t=1T𝔼[f([xt]τ)f(xt)]R4\displaystyle+\underbrace{\sum_{t=1}^{T}\mathbb{E}\left[f([x_{t}]_{\tau})-f(x_{t})\right]}_{R_{4}}
Proof.

This simply follows from the fact that, as xt=argmaxx𝒳αt1(x)x_{t}=\arg\max_{x\in\mathcal{X}}\alpha_{t-1}(x), we have that α(xt)α(x)\alpha(x_{t})\geq\alpha(x^{*}). ∎

With this decomposition, we simply need to find a bound for each term to bound the Bayesian regret. We will start with the second term.

Theorem 1.

Assuming that βt\beta_{t} is drawn from a Γ(κt,θ)\Gamma(\kappa_{t},\theta) distribution with κt=log(12π(t2+1))log(1+θ/2)\kappa_{t}=\frac{\log\left(\frac{1}{\sqrt{2\pi}}{(t^{2}+1)}\right)}{\log\left(1+\theta/2\right)}, the following bound holds

R2\displaystyle R_{2} =\displaystyle= t=1T𝔼[f([x]τ)αt1([x]τ)]\displaystyle\sum_{t=1}^{T}\mathbb{E}\left[f([x^{*}]_{\tau})-\alpha_{t-1}([x^{*}]_{\tau})\right] (11)
\displaystyle\leq t=1T1t2+1\displaystyle\sum_{t=1}^{T}\frac{1}{t^{2}+1}
Proof.

As the posterior distribution of f(x)f(x) at iteration t1t-1 is 𝒩(μt1,σt12)\mathcal{N}(\mu_{t-1},\sigma^{2}_{t-1}), the distribution of f(x)αt1(x)|βtf(x)-\alpha_{t-1}(x)|\beta_{t}, is simply 𝒩(βtσt1(x)\mathcal{N}(-\sqrt{\beta_{t}}\sigma_{t-1}(x),σt12(x)\sigma_{t-1}^{2}(x)). Hence

𝔼β\displaystyle\mathbb{E}_{\beta} [𝔼f[𝟏{f(x)αt1(x)0}[f(x)αt1(x)|βt]]]\displaystyle\left[\mathbb{E}_{f}\left[\mathbf{1}\left\{f(x)-\alpha_{t-1}(x)\geq 0\right\}[f(x)-\alpha_{t-1}(x)|\beta_{t}]\right]\right] (12)
\displaystyle\leq 𝔼β[σt1(x)2πexp{βt2}]\displaystyle\mathbb{E}_{\beta}\left[\frac{\sigma_{t-1}(x)}{\sqrt{2\pi}}\exp\left\{-\frac{\sqrt{\beta_{t}}}{2}\right\}\right]
\displaystyle\leq σt1(x)2π𝔼β[exp{βt2}]\displaystyle\frac{\sigma_{t-1}(x)}{\sqrt{2\pi}}\mathbb{E}_{\beta}\left[\exp\left\{-\frac{\sqrt{\beta_{t}}}{2}\right\}\right]

We note that the exponential term is simply the moment generating function of βt\sqrt{\beta_{t}}, which has a closed form for a gamma distribution. This lets us express our inequality as

𝔼β\displaystyle\mathbb{E}_{\beta} [𝔼f[𝟏{f(x)αt1(x)0}[f(x)αt1(x)|βt]]]\displaystyle\left[\mathbb{E}_{f}\left[\mathbf{1}\left\{f(x)-\alpha_{t-1}(x)\geq 0\right\}[f(x)-\alpha_{t-1}(x)|\beta_{t}]\right]\right] (13)
\displaystyle\leq σt1(x)2π1(1+θt1/2)κt1\displaystyle\frac{\sigma_{t-1}(x)}{\sqrt{2\pi}}\frac{1}{(1+\theta_{t-1}/2)^{\kappa_{t-1}}}

As we want this to decay at a sub-linear rate, we need

σt1(x)2π1(1+θt1/2)κt11(t2+1)|𝒳𝖽𝗂𝗌|\displaystyle\frac{\sigma_{t-1}(x)}{\sqrt{2\pi}}\frac{1}{(1+\theta_{t-1}/2)^{\kappa_{t-1}}}\leq\frac{1}{(t^{2}+1)|\mathcal{X}_{\mathsf{dis}}|}
κt1log(12πσt1(x)(t2+1)|𝒳𝖽𝗂𝗌|)log(1+θt1/2)\displaystyle\kappa_{t-1}\geq\frac{\log\left(\frac{1}{\sqrt{2\pi}}{\sigma_{t-1}(x)(t^{2}+1)|\mathcal{X}_{\mathsf{dis}}|}\right)}{\log\left(1+\theta_{t-1}/2\right)} (14)

Setting κt1=log(12π(t2+1))log(1+θt1/2)\kappa_{t-1}=\frac{\log\left(\frac{1}{\sqrt{2\pi}}{(t^{2}+1)}\right)}{\log\left(1+\theta_{t-1}/2\right)} to satisfy this, equation R2R_{2} is bounded by the following:

R2\displaystyle R_{2} =\displaystyle= t=1T𝔼[f([x]τ)αt1([x]τ)]\displaystyle\sum_{t=1}^{T}\mathbb{E}\left[f([x^{*}]_{\tau})-\alpha_{t-1}([x^{*}]_{\tau})\right] (15)
\displaystyle\leq t=1Tx𝒳𝔼βt[𝔼f[𝟏{f(x)\displaystyle\sum_{t=1}^{T}\sum_{x\in\mathcal{X}}\mathbb{E}_{\beta_{t}}[\mathbb{E}_{f}[\mathbf{1}\{f(x)
\displaystyle- αt1(x)0}[f(x)αt1(x)|βt]]]\displaystyle\alpha_{t-1}(x)\geq 0\}[f(x)-\alpha_{t-1}(x)|\beta_{t}]]]
\displaystyle\leq t=1T1t2+1\displaystyle\sum_{t=1}^{T}\frac{1}{t^{2}+1}

Next, we attempt to bound the first component.

Theorem 2.

Assuming that βt\beta_{t} is drawn from a Γ(k,θ)\Gamma(k,\theta) distribution, the following bound holds

R1\displaystyle R_{1} =\displaystyle= t=1T𝔼[αt1([xt]τ)f([xt]τ)]\displaystyle\sum_{t=1}^{T}\mathbb{E}\left[\alpha_{t-1}([x_{t}]_{\tau})-f([x_{t}]_{\tau})\right] (16)
\displaystyle\leq [1+k1F1(11T)]γ+F1(11T)\displaystyle\sqrt{\left[1+\frac{k-1}{F^{-1}\left(1-\frac{1}{T}\right)}\right]\gamma+F^{-1}\left(1-\frac{1}{T}\right)}
×Tt=1Tσt1(xt)\displaystyle\times\sqrt{T\sum_{t=1}^{T}\sigma_{t-1}(x_{t})}
Proof.

Using Jensen’s Inequality we have

R1\displaystyle R_{1} =\displaystyle= t=1T𝔼[αt1([xt]τ)f([xt]τ)]\displaystyle\sum_{t=1}^{T}\mathbb{E}\left[\alpha_{t-1}([x_{t}]_{\tau})-f([x_{t}]_{\tau})\right] (17)
=\displaystyle= t=1T𝔼βt[𝔼f[αt1([xt]τ)f([xt]τ)|βt]]\displaystyle\sum_{t=1}^{T}\mathbb{E}_{\beta_{t}}\left[\mathbb{E}_{f}\left[\alpha_{t-1}([x_{t}]_{\tau})-f([x_{t}]_{\tau})|\beta_{t}\right]\right]
=\displaystyle= t=1T𝔼βt[βtσt1([xt]τ)]\displaystyle\sum_{t=1}^{T}\mathbb{E}_{\beta_{t}}\left[\sqrt{\beta_{t}}\sigma_{t-1}([x_{t}]_{\tau})\right]
\displaystyle\leq t=1Tσt1([xt]τ)𝔼βt[βt]\displaystyle\sum_{t=1}^{T}\sigma_{t-1}([x_{t}]_{\tau})\sqrt{\mathbb{E}_{\beta_{t}}\left[\beta_{t}\right]}

We can then use the Cauchy-Schwartz inequality to get

R1\displaystyle R_{1} \displaystyle\leq t=1T𝔼βt[βt]t=1Tσt12([xt]τ)\displaystyle\sqrt{\sum_{t=1}^{T}\mathbb{E}_{\beta_{t}}\left[\beta_{t}\right]}\sqrt{\sum_{t=1}^{T}\sigma^{2}_{t-1}([x_{t}]_{\tau})} (18)
\displaystyle\leq T𝔼βt[maxtTβt]t=1Tσt12([xt]τ)\displaystyle\sqrt{T\mathbb{E}_{\beta_{t}}\left[\underset{t\leq T}{\max}\beta_{t}\right]}\sqrt{\sum_{t=1}^{T}\sigma^{2}_{t-1}([x_{t}]_{\tau})}

As βt\beta_{t} is a gamma distribution with shape parameter κt\kappa_{t} and scale parameter θ\theta, its maximum is given by

𝔼[maxtTβt][1+κt1F1(11T)]γ+F1(11T)\mathbb{E}\left[\underset{t\leq T}{\max}\beta_{t}\right]\approx\left[1+\frac{\kappa_{t}-1}{F^{-1}\left(1-\frac{1}{T}\right)}\right]\gamma+F^{-1}\left(1-\frac{1}{T}\right) (19)

where γ\gamma is the Euler-Mascheroni constant and F1(x)F^{-1}(x) is the inverse CDF of β\beta. This finally gives us the following bound:

Ri\displaystyle R_{i} \displaystyle\leq [1+k1F1(11T)]γ+F1(11T)\displaystyle\sqrt{\left[1+\frac{k-1}{F^{-1}\left(1-\frac{1}{T}\right)}\right]\gamma+F^{-1}\left(1-\frac{1}{T}\right)} (20)
×Tt=1Tσt12(xt)\displaystyle\times\sqrt{T\sum_{t=1}^{T}\sigma^{2}_{t-1}(x_{t})}

Finally, we need to bound components R3R_{3} and R4R_{4}. For these, we can use lemma 10 from Kandasamy et al. (2017).

Lemma 2.

At step tt, for all x𝒳x\in\mathcal{X}, 𝔼[|f(x)f([x]τ)|]12t2\mathbb{E}[|f(x)-f([x]_{\tau})|]\leq\frac{1}{2t^{2}}.

This means that we have that

R3+R4\displaystyle R_{3}+R_{4} \displaystyle\leq 2t=1T12t2\displaystyle 2\sum^{T}_{t=1}\frac{1}{2t^{2}} (21)
\displaystyle\leq π26\displaystyle\frac{\pi^{2}}{6}

With this we can finally find our Bayesian regret bound.

Theorem 3.

If βt\beta_{t} is sampled from a Γ(κt,θ)\Gamma(\kappa_{t},\theta) distribution with

κt=log(12π(t2+1))log(1+θ/2)\kappa_{t}=\frac{\log\left(\frac{1}{\sqrt{2\pi}}{(t^{2}+1)}\right)}{\log\left(1+\theta/2\right)} (22)

then the RGP-UCB acquisition function has its Bayesian regret bounded by

BRT\displaystyle BR_{T} \displaystyle\leq [1+κt1F1(11T)]γ+F1(11T)\displaystyle\sqrt{\left[1+\frac{\kappa_{t}-1}{F^{-1}\left(1-\frac{1}{T}\right)}\right]\gamma+F^{-1}\left(1-\frac{1}{T}\right)} (23)
×Tt=1Tσt12(xt)+t=1T1t2+1+π26\displaystyle\times\sqrt{T\sum_{t=1}^{T}\sigma^{2}_{t-1}(x_{t})}+\sum_{t=1}^{T}\frac{1}{t^{2}+1}+\frac{\pi^{2}}{6}

where γ\gamma is the Euler-Mascheroni constant and F1(x)F^{-1}(x) is the inverse CDF of βt\beta_{t}

Proof.

The result follows simply by combining the bounds for the various components. ∎

4 Results

In this section we present results that demonstrate the performance of RGP-UCB in comparison to other common acquisition functions. We also demonstrate the impact of varying the θ\theta parameter of the gamma distribution used to sample βt\beta_{t}. The Python code used for this paper can be found at https://github.com/jmaberk/RGPUCB.

4.1 Experimental Setup

We test our method against a selection of common acquisition functions on a range of Bayesian optimisations problems. These include a range of synthetic benchmark functions and real-world optimisation problems. In each case, the experiment was run for 40d40d iterations and repeated 10 times with 3d+13d+1 different initial points. The initial points are chosen randomly with a Latin hypercube sample scheme Jones (2001). The methods being tested are:

  • Our randomised Gaussian process upper confidence bound with θ=8\theta=8 (RGP-UCB θ=8\theta=8).

  • Our randomised Gaussian process upper confidence bound with θ=1\theta=1 (RGP-UCB θ=1\theta=1).

  • Our randomised Gaussian process upper confidence bound with θ=0.5\theta=0.5 (RGP-UCB θ=0.5\theta=0.5).

  • Standard Gaussian process upper confidence bound (GP-UCB) Srinivas et al. (2010).

  • Expected improvement (EI) Jones et al. (1998).

  • Thompson sampling (Thompson) Russo and Van Roy (2014).

Note that we turn all functions that are traditionally minimised into maximisation problems by taking their negative for consistency. As such, higher results are always better.

4.2 Selection of the Trade-off Parameter

An advantage of our method is that it can change its exploration-exploitation balance without compromising its convergence guarantee. This is done by changing the θ\theta parameter in the βtΓ(κt,θ)\beta_{t}\sim\Gamma(\kappa_{t},\theta) distribution. Increasing θ\theta will increase the expected βt\beta_{t}, increasing exploration.

As different problems favour different exploration-exploration balances, we tested a range of θ\theta values on a range of different problems. In Figure 2, we show the performance of a range of θ\theta values on an exploitation favouring problem, the Alpine 2 (5D) function, and an exploration favouring problem, the Dropwave (2D) function.

Dropwave (2D) Alpine 2 (5D)
θ=0.1\theta=0.1 0.738±\pm 6.7e-2 78.9±\pm12.4
θ=0.5\theta=0.5 0.755±\pm 5.0e-2 92.1±\pm12.2
θ=1\theta=1 0.754±\pm 6.9e-2 77.8±\pm12.7
θ=2\theta=2 0.727±\pm 8.6e-2 77.5±\pm12.5
θ=4\theta=4 0.847±\pm 5.7e-2 71.5±\pm13.6
θ=8\theta=8 0.848±\pm 3.3e-2 43.4±\pm9.84
θ=16\theta=16 0.814±\pm 6.2e-2 45.4±\pm10.0
Figure 2: Best found values using different θ\theta parameters on an exploitation favouring function (Alpine 2) and an exploration favouring function (Dropwave).

It was found that θ=8\theta=8 gives good performance on both the above exploration-favouring problem and other similar problems tested. Likewise, θ=0.5\theta=0.5 is a good choice for exploitation favouring problems. We also note that θ=1\theta=1 has decent performance on both problems, making it a good choice for problems where the required exploration-exploitation balance is completely unknown.

4.3 Synthetic Benchmark Functions

The first demonstration of our methods performance is the optimisation of several common synthetic benchmark functions. These are the Dropwave (2D), Sphere (4D), Alpine 2 (5D), and Ackley (5D) functions222All benchmark functions use the recommended parameters from https://www.sfu.ca/~ssurjano/optimization.html. Results for these are shown in Figure 3.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Performance of the RGP-UCB acquisition function compared to other common methods on a range of optimisation problems, including synthetic (left and middle) and real-world (right) examples.

Here we can see that RGP-UCB has competitive performance in all of the above cases. In general, it does significantly better than the standard GP-UCB and the Thompson sampling acquisition functions. EI has better early performance in many cases, as it starts exploiting earlier. However, RGP-UCB tends to have better exploration and therefore often able to beat it in the long-term.

The leftmost functions were chosen to be pathological cases which disproportionately favours exploration (Dropwave 2D) and exploitation (Sphere 4D). These can be seen as best-case examples for GP-UCB and EI respectively. However, RGP-UCB is able to out-perform them even on these if its θ\theta parameter is chosen properly, and does so while maintaining its convergence guarantee. This formulation of EI does not have a known regret bound as it follows the standard implementation and hence doesn’t satisfy the assumptions required by the current bounds Bull (2011); Ryzhov (2016); Wang and de Freitas (2014); Nguyen et al. (2017).

In the middle two plots, RGP-UCB is superior even with the conservative parametrisation of θ=1\theta=1.

4.4 Machine Learning Hyperparameter Tuning

Our first demonstration of the real-world performance of RGP-UCB is the hyperparameter tuning of a support vector regression (SVR) Drucker et al. (1997) algorithm. This is the support vector machine classification algorithm extended to work on regression problems, with performance measured in root mean squared error (RMSE). It has three hyperparameters, the threshold, ϵ\epsilon, the kernel parameter, γ\gamma, and a soft margin parameter, CC. All experiments are done with the public Space GA scale dataset 333 Dataset can be found at https://www.csie.ntu.edu.tw~cjlin/libsvmtools/datasets/regression.html. The results are shown in Figure 3.

We can see that the final performance of all three variants of our method exceeds that of standard GP-UCB. The high exploitation and balanced variants are competitive with EI, with the former achieving higher final performance. As with many real-world problems, SVR is known to favour higher exploitation, and is therefore an example of when the user would know to try a smaller θ\theta.

4.5 Materials Science Application: Alloy Heat Treatment

The second demonstration of RGP-UCB’s performance is the optimisation of a Aluminium-Scandium alloy heat treatment simulation Robson et al. (2003). The goal of the simulation is to optimise the resulting alloys hardness, measured in MPa. The hardening process is controlled through multiple cooking stages, each with two hyperparametrs, the duration and a temperature. As we use a two-stage cooking simulation, there is a total of four hyperparameters to optimise through Bayesian optimisation. The results are shown in Figure 3.

The results are very similar to the previous SVR example, with the high-exploitation method having the best performance and the balance method being competitive with EI.

4.6 Conclusion

We have developed a modified UCB based acquisition function that has substantially improved performance while maintaining a sub-linear regret bound. We have proved that this bound holds in terms of Bayesian regret while allowing for some flexibility in the selection of its parameters. We have also demonstrated the impact of said parameters on the performance. Moreover, we have shown that its performance is competitive or greater than existing methods in a range of synthetic and real-world applications.

Acknowledgements

This research was supported by an Australian Government Research Training Program (RTP) Scholarship awarded to JMA Berk, and was partially funded by the Australian Government through the Australian Research Council (ARC). Prof Venkatesh is the recipient of an ARC Australian Laureate Fellowship (FL170100006).

References

  • Brochu et al. [2010] Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010.
  • Bull [2011] Adam D Bull. Convergence rates of efficient global optimization algorithms. Journal of Machine Learning Research, 12(Oct):2879–2904, 2011.
  • Drucker et al. [1997] Harris Drucker, Christopher JC Burges, Linda Kaufman, Alex J Smola, and Vladimir Vapnik. Support vector regression machines. In Advances in neural information processing systems, pages 155–161, 1997.
  • Gonzalez et al. [2015] Javier Gonzalez, Joseph Longworth, David C James, and Neil D Lawrence. Bayesian optimization for synthetic gene design. arXiv preprint arXiv:1505.01627, 2015.
  • Hennig and Schuler [2012] Philipp Hennig and Christian J Schuler. Entropy search for information-efficient global optimization. Journal of Machine Learning Research, 13(Jun):1809–1837, 2012.
  • Hernández-Lobato et al. [2014] José Miguel Hernández-Lobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In Advances in neural information processing systems, pages 918–926, 2014.
  • Jones et al. [1998] Donald R Jones, Matthias Schonlau, and William J Welch. Efficient global optimization of expensive black-box functions. Journal of Global optimization, 13(4):455–492, 1998.
  • Jones [2001] D. R Jones. A taxonomy of global optimization methods based on response surfaces. Journal of global optimization, 21(4):345–383, 2001.
  • Ju et al. [2017] Shenghong Ju, Takuma Shiga, Lei Feng, Zhufeng Hou, Koji Tsuda, and Junichiro Shiomi. Designing nanostructures for phonon transport via bayesian optimization. Physical Review X, 7(2):021024, 2017.
  • Kandasamy et al. [2017] Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and Barnabás Póczos. Asynchronous parallel bayesian optimisation via thompson sampling. arXiv preprint arXiv:1705.09236, 2017.
  • Klein et al. [2017] Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast bayesian optimization of machine learning hyperparameters on large datasets. In International Conference on Artificial Intelligence and Statistics (AISTATS 2017), pages 528–536. PMLR, 2017.
  • Li et al. [2017] Cheng Li, David Rubín de Celis Leal, Santu Rana, Sunil Gupta, Alessandra Sutti, Stewart Greenhill, Teo Slezak, Murray Height, and Svetha Venkatesh. Rapid bayesian optimisation for synthesis of short polymer fiber materials. Scientific reports, 7(1):1–10, 2017.
  • Nguyen et al. [2017] Vu Nguyen, Sunil Gupta, Santu Rana, Cheng Li, and Svetha Venkatesh. Regret for expected improvement over the best-observed value and stopping condition. In Asian Conference on Machine Learning, pages 279–294, 2017.
  • Rasmussen and Williams [2006] Carl Edward Rasmussen and Christopher KI Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
  • Robson et al. [2003] JD Robson, MJ Jones, and PB Prangnell. Extension of the n-model to predict competing homogeneous and heterogeneous precipitation in al-sc alloys. Acta Materialia, 51(5):1453–1468, 2003.
  • Russo and Van Roy [2014] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
  • Ryzhov [2016] Ilya O Ryzhov. On the convergence rates of expected improvement methods. Operations Research, 64(6):1515–1528, 2016.
  • Scott et al. [2011] Warren Scott, Peter Frazier, and Warren Powell. The correlated knowledge gradient for simulation optimization of continuous parameters using gaussian process regression. SIAM Journal on Optimization, 21(3):996–1026, 2011.
  • Snoek et al. [2012] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pages 2951–2959, 2012.
  • Srinivas et al. [2010] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on Machine Learning, pages 1015–1022, 2010.
  • Turgeon et al. [2016] Martine Turgeon, Cindy Lustig, and Warren H Meck. Cognitive aging and time perception: roles of bayesian optimization and degeneracy. Frontiers in aging neuroscience, 8:102, 2016.
  • Wang and de Freitas [2014] Ziyu Wang and Nando de Freitas. Theoretical analysis of bayesian optimisation with unknown gaussian process hyper-parameters. NIPS Workshop on Bayesian Optimization, 2014.
  • Xia et al. [2017] Yufei Xia, Chuanzhe Liu, YuYing Li, and Nana Liu. A boosted decision tree approach using bayesian hyper-parameter optimization for credit scoring. Expert Systems with Applications, 78:225–241, 2017.