This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Power of human-in-the-loop Bayesian optimization

Elsevier111Since 1880. Radarweg 29, Amsterdam Elsevier Inc [ Global Customer Service [email protected] 1600 John F Kennedy Boulevard, Philadelphia 360 Park Avenue South, New York
Abstract

In this paper, we show that global optimization technique Bayesian optimization can be used to approximate class of finite domain problems arbitrary well and efficiently in expectation. We refer to a recent article by [] who showed that there exists a Bayesian optimization algorithm that can retrieve values arbitrary close to optimum of a problem instance in exponential time to size of the problem. We show that this result can be taken further if one is to assume perfect knowledge in the Gaussian process prior and restrict the class of objective functions. In that case, the running time can be reduced to polynomial to the size of the problem. We however note that while the results are promising, the technique is theoretical and has little to no interest in practical situations where the Gaussian prior is usually not known in advance or it has to be estimated on fly.

keywords:
elsarticle.cls, , Elsevier , template
MSC:
[2010] 00-01, 99-00
journal: Journal of  Templatesmytitlenotemytitlenotefootnotetext: Fully documented templates are available in the elsarticle package on CTAN.

url]www.elsevier.com

0.1 Introduction

Bayesian optimization (BO) is a technique used in global optimization (GO) to search optimal values in continuous, combinatorial, and discrete domains. BO has wide range of applications and during the recent years, its usage in optimizing hard black box functions has increased [ref]. BO is often used in low data regimes where where evaluation of the objective function is costly or not efficient. These problems include, robot navigation and planning [ref], tuning the hyperparameters of deep learning models [ref], predicting earthquake intensities [ref], finding optimal stock values in stock markets [ref] and much much more [see refs].

The advantage of the BO is that BO can optimize any set of black box functions. That is, if one can only access function values and not, say, gradient information, then BO can be very efficient. Also, BO does not usually make any assumptions on the function it optimizes unlike multiple state-of-the-art optimization algorithms [ref]. These algorithms expect the objective function to be Lipschitz continuous [ref], unimodal [ref] or having a certain shape near to global optimizer [refs]. BO, however, assumes that the user has some knowledge or expectation on the shape of the objective function, which realizes as (Gaussian) prior. The main difference between BO and traditional GO is that in the BO, one is not assumed to provide, say, differentiable function – one just has to know to some extent what kind of the function is.

In BO, one places a (Gaussian) prior on the unknown and possible non-convex function which reflects one’s understanding of the function – whether the function is continuous, differentiable, smooth, unimodal and so forth. After the prior is set over the unknown function, an algorithm suggests points where the optimal values would lie. Based on these suggestions and function evaluations, the algorithm updates the prior as posterior and uses the posterior the suggest even better points where the possible optimum would be located. In many settings, the BO finds the global optimum much more faster than, say, brute force search [ref]. The suggestions and how they are derived from the prior and posterior are defined by an acquisition function. Multiple different possibilities for acquisition function exists. These include upper confidence bound [ref], expected improvement [ref], and thompson sampling [ref] to name few.

Multiple studies have shown that Bayesian optimization converges to global optimum in reasonable time [refs]. [refs] showed exponential convergence rate of Bayesian optimization to global optimum (that is, the highest or lowest value of the function depending on the context) if the function near to global optimum has a certain shape. [ref] showed similar results to more restricted version of Bayesian optimization where the function is a realization of a Gaussian process called Brownian motion. Brownian motion is usually used to model stock market prices and their optimums [refs].

While it is tempting to praise Bayesian optimization for its ability to find optimal values of the structures efficiently, one of the downsides of the approach is that the values found and supremums (globally optimal values) expected are based on how well the prior of the Gaussian process is defined. If the black box function is a realization of the prior, then BO works surprisingly well. On the other hand, a poor choice of a prior can result the algorithm to not converge at all [refs].

Furthermore, the regret [ref], which is used to give an idea of the convergence rate of a BO algorithm, experienced by the BO algorithm is based on the expectations. That is, ratio between expected supremum and expected best value found the function. These expectations are close to actual best values found by the function and actual supremums of the black box function only if the prior is set over the function accurately.

In the next sections we complement the previous results of [refs] to obtain exponential convergence rate for BO. We restrict ourselves to the functions that can be changed while preserving the global optimum. We show that any combinatorial optimization problem with combinatorial complexity of O(2n)O(2^{n}) can be reduced to such function. Later, we show how to derive the exponential convergence rate with a human-in-the-loop algorithm.

1 Combinatorial problems as univariate finite domain functions

In this section, we show that any problem with a combinatorial complexity of O(2n)O(2^{n}) can be reduced to a univariate function with a domain of size Ω(2n)\Omega(2^{n}). The reduction is quite general and we assume that with modifications it could be used to problems with combinatorial complexity, say, O(n!)O(n!).

The reduction algorithm (seen in Algorithm 1) assumes that the problem instance can be viewed as a decision tree where a left child arc of a node implies that the decision variable at the node is assigned with 0 and right child arc of a node implies 1 assignment (or vice versa). The function that the reduction produces takes values of a domain D=[1,2n]d,d=1D=[1,2^{n}]^{d},d=1 and uses the value passed to the function to find a correct assignment for the combinatorial problem instance. Finally the function evaluates the combinatorial problem with the assignment to produce a scalar value, which is then scaled by the function parameter.

The algorithm starts by ordering the decision variables (or vertices depending on the problem instance) randomly. Second the algorithm creates an objective function that accepts a scalar value from a domain [1,2n]d,d=1[1,2^{n}]^{d},d=1. Based on the value passed to the function, the function always divides the remaining domain in the two equally sized halves and selects the half that contains the value. At each selection of the half, the assignment is updated with the decision variable related to the depth of the decision tree. When the domain has been completely split so that no more intervals can be selected, the original combinatorial problem instance is evaluated with the final assignment and the scalar value of the evaluation (number of vertices in a clique, number of clauses satisfied, or number of vertices in a vertex-cover and so forth) is scaled with the scaling parameter and eventually returned. The algorithm can be seen in Algorithm 1.

Algorithm 1 Combinatorial problem to finite domain univariate function
1:  input: X (problem instance), V (variables), scale (scale of function values)
2:  randomly sort V
3:  create black-box function with parameters x (point to evaluate), and D (domain):
4:  a := first half of D
5:  b := second half of D
6:  current := a if x \in a else b
7:  V1V_{1} = {1, 0} based on current (i.e. V1:=0V_{1}:=0 if a else 1)
8:  ii := 2
9:  while current \neq single value do
10:     a := first half of current
11:     b := second half of current
12:     current := a if x \in a else b
13:     ViV_{i} := {1, 0} based on current
14:     ii := ii + 1
15:  end while
16:  assignment := evaluate X with V
17:  y := value of assignment (value of maximization / minimization problem)
18:  scale y based on scale
19:  return black box function (lines 3 - 18)

1.1 Human-in-the-loop Bayesian optimization

In this section we refer to a recent paper of [ref] where the authors showed that any ratio between expected global optimum and expected optimum found by their Bayesian optimization (BO) algorithm can be done in exponential time. We complement their results to show that for a specific type of problems, their results can be extended to a polynomial time algorithm. We stress that the algorithm is merely theoretical and is required to query Gaussian priors from an external expert multiple times. Hence, we do not claim to solve any notoriously hard open problems related to computational complexity theory. Instead, we show that with this type of human-in-the-loop approach, even the most hardest problems can be solved theoretically.

The key ingredient of their work is that instead of standard simple regret (that is, the difference between expected global optimum and expected best value found by the function), they provided proof for expected approximation ratio [ref] – the ratio between the expected best value found by the function and the expected global optimum. The other useful feature of their work is that their bound of the regret is strict. That is, their algorithm is optimal to the worst case bound. This means that the upper bound equals to lower bound in their algorithm. As such, their work cannot be extended to polynomial time algorithm to achieve a constant factor approximation due to their bounds are only expectations (as in standard BO). This means that to obtain the strict ratio provided by their algorithm, one would have to evaluate arbitrary number of different functions. We overcome this restriction to narrow down the scope of the functions to a special type of functions where the function can be changed arbitrary many times but still use the same problem instance domain whole time and retain single globally optimal value.

In Algorithm 1 we showed that any combinatorial problem with combinatorial complexity of O(2n)O(2^{n}) can be reduced to a black-box univariate function. By running the algorithm multiple times with different scaling factor, the algorithm produces different functions from the same problem instance. This is because the algorithm always randomly sorts the variables (line 2 in Algorithm 1) We use that and Hoeffding’s inequality to obtain with high probability a converge to expected supremum and expected best value found by [ref]. Hoeffding’s inequality

P(1nΣi=1n(Xi)E[X]ϵ)e2n2ϵ2Σi=1n(ba)2P(\frac{1}{n}\Sigma^{n}_{i=1}(X_{i})-E[X]\geq\epsilon)\leq e^{-\frac{2n^{2}\epsilon^{2}}{\Sigma^{n}_{i=1}(b-a)^{2}}} (1)

states that the mean values of a random variable converges to its expected value exponentially fast. In our case, we run our reduction algorithm mm times and use algorithm from [ref] to obtain sharp bounds on expected global optimum and expected local optimum find by the algorithm of [ref]. We then branch to the optimal reagions of the domain where the global optimum is likely to reside as in [refs]. Hence, out algorithm follows quite popular branch and bound framework used in multiple global optimization algorithms [refs]. In these, the search space is shrunk after function evaluations and the intervals where the global optimizer is not likely to reside are discarded.

Because the prior given to the Gaussian process upper bounds the expected optimum and the expected local optimum find by [ref] algorithm, at every expansion, we query a new prior from an external expert. We assume the prior given by the expert is optimal – that is, it gives tight bounds to global optimum and the function with the reduced domain is indeed realization of the expert’s prior. While this is not practical, perfect knowledge of prior is in general assumed in BO literature [refs]. Our extension to [ref] algorithm can be seen in Algorithm 2.

Algorithm 2 Bayesian optimization
1:  inputs: D (domain)
2:  a := first half of D
3:  b := second half of D
4:  while re-sampling and re-scaling is not done do
5:     query new covariance matrix and mean from expert for a and b
6:     solve [ref] for a
7:     solve [ref] for b
8:     re-sample and re-scale a and b
9:  end while
10:  current := a if mean upper bound of a \geq mean upper bound of b else b
11:  while not stopping condition do
12:     a := first half of current
13:     b := second half of current
14:     remove current cell
15:     while re-sampling and re-scaling is not done do
16:        query new covariance matrix and mean from expert for a and b
17:        solve [ref] for a
18:        solve [ref] for b
19:        re-sample and re-scale a and b
20:     end while
21:     current := argmax of cells (the cell with highest mean upper bound)
22:  end while
23:  return max value of any cell found by [ref]

If the prior given by the expert is correct, then Algorithm 2 converges to global optimum in log2(D)\log_{2}(D) time. This is because the Algorithm 2 always divides the cell where the optimum lies in two equal sized halves – the rate of convergence is exponential. Of course, the algorithm could use processing time to divisions of sub-optimal cells. This is however not the case with high probability.

{lemma}

In expectation, the cell with the highest upper bound contains the optimizer

Proof.

In [ref], the authors proved that with either UCB2 or EI2 algorithm, with sufficiently large T, the upper bound equals to lower bound. This directly implies that if the function is drawn from a Gaussian process with a convariance matrix Σ\Sigma and mean 0, then in expectation UCB2 or EI2 will find exactly a preferred fraction of the global optimum. The higher the value is, the higher the value of the optimizer is. ∎

Assumption 1: the expert always provides a correct prior for a black box function.

{lemma}

When running Algorithm 1 mm times, with high probability the optimal cell is expanded

Proof.

By Hoeffding’s inequality (1), the value of a random variable converges exponentially fast to its expected value. That is, by running Algorithm 1 mm times and by producing different functions with the same values, the mean of the obtained value of UCB2 or EI2 converges exponentially fast to its expected value of the cell. By Assumption 1, the prior bounds strictly the expected values of actual global optimum and local optimum found by EI2 or UCB2. This implies the expected optimums (both global and local) bound the values of expected global optimum and expected local optimum. Together these imply that the cell with highest expected global optimum contains the global optimizer with high probability. ∎

By proof of Lemma 3.2, Algorithm 2 divides always the optimal cell because if the cell’s function is indeed realization of the expert’s prior – the expected optimum is bounded tightly and the expectation is with high probability fulfilled by re-samplings (Hoeffding’s inequality). Then the algorithm always expands the correct cell with high probability. The optimality of the expanded cell is proved by the fact that [ref] provided tight regret bounds. That is, in expectation the upper bound and the lower bound of [ref] are differentiated by a constant ϵ\epsilon. This means that the cells with lower expected supremum can be selected only if the supremum is ϵ\epsilon close to a global supremum.

Of course, the approach provided here is merely theoretical. First assumption is that all functions are indeed realization of the Gaussian processes (which might not be the case) and secondly, we assume that the expert who the prior is queried from, always has domain knowledge to provide correct prior (note that we expect the expert to know the prior but not the actual values of supremums). What is more, as the domain size increases, the more priors (even in the best possible case) the expert has to provide, which is not feasible even for even moderate sized problems in real life. The other disadvantage of the algorithm is that some covariance functions may produce better results and have less strict bounds as in [ref] because the bounds are general and not prior dependent. This of course, leads to a situation where for some priors, the Algorithm 1 spends more time in searching for sub-optimal regions of the search space.

2 Conclusions