This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Computationally Efficient High-Dimensional Bayesian Optimization via Variable Selection

Yihang Shen
Department of Computational Biology
Carnegie Mellon University
Pittsburgh, PA 15213
[email protected]
&Carl Kingsford
Department of Computational Biology
Carnegie Mellon University
Pittsburgh, PA 15213
[email protected]
Corresponding author
Abstract

Bayesian Optimization (BO) is a method for globally optimizing black-box functions. While BO has been successfully applied to many scenarios, developing effective BO algorithms that scale to functions with high-dimensional domains is still a challenge. Optimizing such functions by vanilla BO is extremely time-consuming. Alternative strategies for high-dimensional BO that are based on the idea of embedding the high-dimensional space to the one with low dimension are sensitive to the choice of the embedding dimension, which needs to be pre-specified. We develop a new computationally efficient high-dimensional BO method that exploits variable selection. Our method is able to automatically learn axis-aligned sub-spaces, i.e. spaces containing selected variables, without the demand of any pre-specified hyperparameters. We theoretically analyze the computational complexity of our algorithm. We empirically show the efficacy of our method on several synthetic and real problems.

1 Introduction

We study the problem of globally maximizing a black-box function f(𝐱)f(\mathbf{x}) with an input domain 𝒳=[0,1]D\mathcal{X}=[0,1]^{D}, where the function has some special properties: (1) It is hard to calculate its first and second-order derivatives, therefore gradient-based optimization algorithms are not useful; (2) It is too strong to make additional assumptions on the function such as convexity; (3) It is expensive to evaluate the function, hence some classical global optimization algorithms such as evolutionary algorithms (EA) are not applicable.

Bayesian optimization (BO) is a popular global optimization method to solve the problem described above. It aims to obtain the input 𝐱\mathbf{x}^{*} that maximizes the function ff by sequentially acquiring queries that are likely to achieve the maximum and evaluating the function on these queries. BO has been successfully applied in many scenarios such as hyper-parameter tuning  (Snoek et al., 2012; Klein et al., 2017), automated machine learning (Nickson et al., 2014; Yao et al., 2018), reinforcement learning (Brochu et al., 2010; Marco et al., 2017; Wilson et al., 2014), robotics (Calandra et al., 2016; Berkenkamp et al., 2016), and chemical design (Griffiths and Hernández-Lobato, 2017; Negoescu et al., 2011). However, most problems described above that have been solved by BO successfully have black-box functions with low-dimensional domains, typically with D20D\leq 20 (Frazier, 2018). Scaling BO to high-dimensional black-box functions is challenging because of the following two reasons: (1) Due to the curse of dimensionality, the global optima is harder to find as DD increases; and (2) computationally, vanilla BO is extremely time consuming on functions with large DD. As global optimization for high-dimensional black-box function has become a necessity in several scientific fields such as algorithm configuration (Hutter et al., 2010), computer vision (Bergstra et al., 2013) and biology (Gonzalez et al., 2015), developing new BO algorithms that can effectively optimize black-box functions with high dimensions is very important for practical applications.

A large class of algorithms for high-dimensional BO is based on the assumption that the black-box function has an effective subspace with dimension deDd_{e}\ll D (Djolonga et al., 2013; Wang et al., 2016; Moriconi et al., 2019; Nayebi et al., 2019; Letham et al., 2020). Therefore, these algorithms first embed the high-dimensional domain 𝒳\mathcal{X} to a space with the embedding dimension dd pre-specified by users, do vanilla BO in the embedding space to obtain the new query, and then project it back and evaluate the function ff. These algorithms are time efficient since BO is done in a low-dimensional space. Wang et al. (2016) proves that if dded\geq d_{e}, then theoretically with probability 11 the embedding space contains the global optimum. However, since ded_{e} is usually not known, it is difficult for users to set a suitable dd. Previous work such as Eriksson and Jankowiak (2021) shows that different settings of dd will impact the performance of embedding-based algorithms, and there has been little work on how to choose dd heuristically. Letham et al. (2020) also points out that when projecting the optimal point in the embedding space back to the original space, it is not guaranteed that this projection point is inside 𝒳\mathcal{X}, hence algorithms may fail to find an optimum within the input domain.

We develop a new algorithm, called VS-BO (Variable Selection Bayesian Optimization), to solve issues mentioned above. Our method is based on the assumption that all the DD variables (elements) of the input 𝐱\mathbf{x} can be divided into two disjoint sets 𝐱={𝐱ipt,𝐱nipt}\mathbf{x}=\{\mathbf{x}_{ipt},\mathbf{x}_{nipt}\}: (1) 𝐱ipt\mathbf{x}_{ipt}, called important variables, are variables that have significant effects on the output value of ff; (2) 𝐱nipt\mathbf{x}_{nipt}, called unimportant variables, are variables that have no or little effect on the output. Previous work such as Hutter et al. (2014) shows that the performance of many machine learning methods is strongly affected by only a small subset of hyperparameters, indicating the rationality of this assumption. We propose a robust strategy to identify 𝐱ipt\mathbf{x}_{ipt}, and do BO on the space of 𝐱ipt\mathbf{x}_{ipt} to reduce time consumption. In particular, our method is able to learn the dimension of 𝐱ipt\mathbf{x}_{ipt} automatically, hence there is no need to pre-specify the hyperparameter dd as embedding-based algorithms. Since the space of 𝐱ipt\mathbf{x}_{ipt} is axis-aligned, issues caused by the space projection no longer exist in our method. We theoretically analyze the computational complexity of VS-BO, showing that our method can decrease the computational complexity of both steps of fitting the Gaussian Process (GP) and optimizing the acquisition function. We formalize the assumption that some variables of the input are important while others are unimportant and derive the regret bound of our method. Finally, we empirically show the good performance of VS-BO on several synthetic and real problems.

The source code to reproduce the results from this study can be found at https://github.com/Kingsford-Group/vsbo. This work has been accepted in AutoML 2023, with camera-ready version https://openreview.net/forum?id=QXKWSM0rFCK1.

2 Related work

The basic framework of BO has two steps for each iteration: First, GP is used as the surrogate to model ff based on all the previous query-output pairs (𝐱1:n,y1:n)\left(\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n},y^{1\mathrel{\mathop{\mathchar 58\relax}}n}\right):

y1:n𝒩(𝟎,K(𝐱1:n,Θ)+σ02𝐈),\displaystyle y^{1\mathrel{\mathop{\mathchar 58\relax}}n}\sim\mathcal{N}\left(\mathbf{0},K(\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n},\Theta)+\sigma^{2}_{0}\mathbf{I}\right),

Here y1:n=[y1,,yn]y^{1\mathrel{\mathop{\mathchar 58\relax}}n}=[y^{1},\dots,y^{n}] is a nn-dimensional vector, yi=f(𝐱i)+ϵiy^{i}=f(\mathbf{x}^{i})+\epsilon^{i} is the output of ff with random noise ϵi𝒩(0,σ02)\epsilon^{i}\sim\mathcal{N}(0,\sigma_{0}^{2}), and K(𝐱1:n,Θ)K(\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n},\Theta) is a n×nn\times n covariance matrix where its entry Ki,j=k(𝐱i,𝐱j,Θ)K_{i,j}=k(\mathbf{x}^{i},\mathbf{x}^{j},\Theta) is the value of a kernel function kk in which 𝐱i\mathbf{x}^{i} and 𝐱j\mathbf{x}^{j} are the i-th and j-th queries respectively. Θ\Theta and σ0\sigma_{0} are parameters of GP that will be optimized each iteration, and 𝐈\mathbf{I} is the n×nn\times n identity matrix. A detailed description of GP and its applications can be found in Williams and Rasmussen (2006).

Given a new input 𝐱\mathbf{x}^{\prime}, we can compute the posterior distribution of f(𝐱)f(\mathbf{x}^{\prime}) from GP, which is again a Gaussian distribution with mean μ(𝐱𝐱1:n,y1:n)\mu(\mathbf{x}^{\prime}\mid\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n},y^{1\mathrel{\mathop{\mathchar 58\relax}}n}) and variance σ2(𝐱𝐱1:n)\sigma^{2}(\mathbf{x}^{\prime}\mid\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n}) that have the following forms:

μ(𝐱𝐱1:n,y1:n)=𝐤(𝐱,𝐱1:n)[K(𝐱1:n,Θ)+σ02𝐈]1(y1:n)\displaystyle\mu(\mathbf{x}^{\prime}\mid\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n},y^{1\mathrel{\mathop{\mathchar 58\relax}}n})=\mathbf{k}(\mathbf{x}^{\prime},\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n})[K(\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n},\Theta)+\sigma^{2}_{0}\mathbf{I}]^{-1}(y^{1\mathrel{\mathop{\mathchar 58\relax}}n})^{\top}
σ2(𝐱𝐱1:n)=k(𝐱,𝐱,Θ)𝐤(𝐱,𝐱1:n)[K(𝐱1:n,Θ)+σ02𝐈]1𝐤(𝐱,𝐱1:n)\displaystyle\sigma^{2}(\mathbf{x}^{\prime}\mid\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n})=k(\mathbf{x}^{\prime},\mathbf{x}^{\prime},\Theta)-\mathbf{k}(\mathbf{x}^{\prime},\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n})[K(\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n},\Theta)+\sigma^{2}_{0}\mathbf{I}]^{-1}\mathbf{k}(\mathbf{x}^{\prime},\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n})^{\top}

Here, 𝐤(𝐱,𝐱1:n)=[k(𝐱,𝐱1,Θ),,k(𝐱,𝐱n,Θ)]\mathbf{k}(\mathbf{x}^{\prime},\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n})=[k(\mathbf{x}^{\prime},\mathbf{x}^{1},\Theta),\dots,k(\mathbf{x}^{\prime},\mathbf{x}^{n},\Theta)] is a nn-dimensional vector.

The second step of BO is to use μ\mu and σ\sigma to construct an acquisition function acqacq and maximize it to get the new query 𝐱new\mathbf{x}^{new}, on which the function ff is evaluated to obtain the new pair (𝐱new,ynew)(\mathbf{x}^{new},y^{new}):

𝐱new=argmax𝐱𝒳acq(μ(𝐱𝐱1:n,y1:n),σ(𝐱𝐱1:n)).\displaystyle\mathbf{x}^{new}=\text{argmax}_{\mathbf{x}^{\prime}\in\mathcal{X}}\;acq(\mu(\mathbf{x}^{\prime}\mid\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n},y^{1\mathrel{\mathop{\mathchar 58\relax}}n}),\sigma(\mathbf{x}^{\prime}\mid\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n})).

A wide variety of methods have been proposed that are related to high-dimensional BO, and most of them are based on some extra assumptions on intrinsic structures of the domain 𝒳\mathcal{X} or the function ff. As mentioned in the previous section, a considerable body of algorithms is based on the assumption that the black-box function has an effective subspace with a significantly smaller dimension than 𝒳\mathcal{X}. Among them, REMBO (Wang et al., 2016) uses a randomly generated matrix as the projection operator to embed 𝒳\mathcal{X} to a low-dimensional subspace. SI-BO (Djolonga et al., 2013), DSA (Ulmasov et al., 2016) and MGPC-BO (Moriconi et al., 2019) propose different ways to learn the projection operator from data, of which the major shortcoming is that a large number of data points are required to make the learning process accurate. HeSBO (Nayebi et al., 2019) uses a hashing-based method to do subspace embedding. Finally, ALEBO (Letham et al., 2020) aims to improve the performance of REMBO with several novel refinements.

Another assumption is that the black-box function has an additive structure. Kandasamy et al. (2015) first develops a high-dimensional BO algorithm called Add-GP by adopting this assumption. They derive a simplified acquisition function and prove that the regret bound is linearly dependent on the dimension. Their framework is subsequently generalized by Li et al. (2016); Wang et al. (2017) and Rolland et al. (2018).

As described in the previous section, our method is based on the assumption that some variables are more “important" than others, which is similar to the axis-aligned subspace embedding. Several previous works propose different methods to choose axis-aligned subspaces in high-dimensional BO. Li et al. (2016) uses the idea of dropout, i.e, for each iteration of BO, a subset of variables are randomly chosen and optimized, while our work chooses variables that are important in place of the randomness. Eriksson and Jankowiak (2021) develops a method called SAASBO, which uses the idea of Bayesian inference. SAASBO defines a prior distribution for each parameter in the kernel function kk, and for each iteration the parameters are sampled from posterior distributions and used in the step of optimizing the acquisition function. Since those priors restrict parameters to concentrate near zero, the method is able to learn a sparse axis-aligned subspaces (SAAS) during BO process. Similar to vanilla BO, the main drawback of SAASBO is that it is very time consuming. While traditionally it is assumed that the function ff is very expensive to evaluate so that the runtime of BO itself does not need to be considered, previous work such as Ulmasov et al. (2016) points out that in some application scenarios the runtime of BO cannot be neglected. Spagnol et al. (2019) proposes a similar framework of high-dimensional BO as us; they use Hilbert Schmidt Independence criterion (HSIC) to select variables, and use the chosen variables to do BO. However, they do not provide a comprehensive comparison with other high-dimensional BO methods: their method is only compared with the method in Li et al. (2016) on several synthetic functions. In addition, they do not provide any theoretical analysis.

Algorithm 1 VS-BO
1:Input: f(𝐱)f(\mathbf{x}), 𝒳=[0,1]D\mathcal{X}=[0,1]^{D}, NinitN_{init}, NN, NvsN_{vs}
2:Output: Approximate maximizer 𝐱max\mathbf{x}^{max}
3:Initialize the set of 𝐱ipt\mathbf{x}_{ipt} to be all variables in 𝐱\mathbf{x}, 𝐱ipt=𝐱\mathbf{x}_{ipt}=\mathbf{x}, and 𝐱nipt=\mathbf{x}_{nipt}=\emptyset
4:Uniformly sample NinitN_{init} points 𝐱i\mathbf{x}^{i} and evaluate yi=f(𝐱i)y^{i}=f(\mathbf{x}^{i}), let 𝒟={(𝐱i,yi)}i=1Ninit\mathcal{D}=\{(\mathbf{x}^{i},y^{i})\}_{i=1}^{N_{init}}
5:Initialize the distribution p(𝐱𝒟)p(\mathbf{x}\mid\mathcal{D})
6:for t=Ninit+1,Ninit+2,Ninit+Nt=N_{init}+1,N_{init}+2,\ldots N_{init}+N do
7:     if mod(tNinitt-N_{init}, NvsN_{vs}) = 0 then
8:         Variable selection to update 𝐱ipt\mathbf{x}_{ipt} and let 𝐱nipt=𝐱𝐱ipt\mathbf{x}_{nipt}=\mathbf{x}\setminus\mathbf{x}_{ipt} (Algorithm 2)
9:         Update p(𝐱𝒟)p(\mathbf{x}\mid\mathcal{D}), then derive the conditional distribution p(𝐱nipt𝐱ipt,𝒟)p(\mathbf{x}_{nipt}\mid\mathbf{x}_{ipt},\mathcal{D})
10:     end if
11:     Fit a GP to 𝒟ipt:={(𝐱ipti,yi)}i=1t1\mathcal{D}_{ipt}\mathrel{\mathop{\mathchar 58\relax}}=\{(\mathbf{x}_{ipt}^{i},y^{i})\}_{i=1}^{t-1}
12:     Maximize the acquisition function to obtain 𝐱iptt\mathbf{x}_{ipt}^{t}.
13:     Sample 𝐱niptt\mathbf{x}_{nipt}^{t} from p(𝐱nipt𝐱iptt,𝒟)p(\mathbf{x}_{nipt}\mid\mathbf{x}_{ipt}^{t},\mathcal{D})
14:     Evaluate yt=f(𝐱t)+ϵt=f({𝐱iptt,𝐱niptt})+ϵty^{t}=f(\mathbf{x}^{t})+\epsilon^{t}=f(\{\mathbf{x}_{ipt}^{t},\mathbf{x}_{nipt}^{t}\})+\epsilon^{t} and update 𝒟=𝒟{(𝐱t,yt)}\mathcal{D}=\mathcal{D}\cup\{(\mathbf{x}^{t},y^{t})\}
15:end for
16:return 𝐱max\mathbf{x}^{max} which is equal to 𝐱i\mathbf{x}^{i} with maximal yiy^{i}

3 Framework of VS-BO

Given the black-box function f(𝐱):𝒳f(\mathbf{x})\mathrel{\mathop{\mathchar 58\relax}}\mathcal{X}\to\mathbb{R} in the domain 𝒳=[0,1]D\mathcal{X}=[0,1]^{D} with a large DD, the goal of high-dimensional BO is to find the maximizer 𝐱=argmax𝐱𝒳f(𝐱)\mathbf{x}^{*}=\text{argmax}_{\mathbf{x}\in\mathcal{X}}f(\mathbf{x}) efficiently. As mentioned in the introduction, VS-BO is based on the assumption that all variables in 𝐱\mathbf{x} can be divided into important variables 𝐱ipt\mathbf{x}_{ipt} and unimportant variables 𝐱nipt\mathbf{x}_{nipt}, and the algorithm uses different strategies to decide values of the variables from two different sets.

The high-level framework of VS-BO (Algorithm 1) is similar to Spagnol et al. (2019). For every NvsN_{vs} iterations VS-BO will update 𝐱ipt\mathbf{x}_{ipt} and 𝐱nipt\mathbf{x}_{nipt} (line 8 in Algorithm 1), and for every BO iteration tt only variables in 𝐱ipt\mathbf{x}_{ipt} are used to fit GP (line 11 in Algorithm 1 ), and the new query of important variables 𝐱iptt\mathbf{x}_{ipt}^{t} is obtained by maximising the acquisition function (line 12 in Algorithm 1). Unlike Spagnol et al. (2019), VS-BO learns a conditional distribution p(𝐱nipt𝐱ipt,𝒟)p(\mathbf{x}_{nipt}\mid\mathbf{x}_{ipt},\mathcal{D}) from the existing query-output pairs 𝒟\mathcal{D} (line 5, 9 in Algorithm 1). This distribution is used for choosing the value of 𝐱nipt\mathbf{x}_{nipt} to make f(𝐱)f(\mathbf{x}) large when 𝐱ipt\mathbf{x}_{ipt} is fixed. Hence, once 𝐱iptt\mathbf{x}_{ipt}^{t} is obtained, the algorithm samples 𝐱niptt\mathbf{x}_{nipt}^{t} from p(𝐱nipt𝐱iptt,𝒟)p(\mathbf{x}_{nipt}\mid\mathbf{x}_{ipt}^{t},\mathcal{D}) (line 13 in Algorithm 1), concatenates it with 𝐱iptt\mathbf{x}_{ipt}^{t} and evaluates f({𝐱iptt,𝐱niptt})f(\{\mathbf{x}_{ipt}^{t},\mathbf{x}_{nipt}^{t}\}).

Compared to Spagnol et al. (2019), our method is new on the following three aspects: First, we propose a new variable selection method that takes full advantage of the information in the fitted GP model, and there is no hyperparameter that needs to be pre-specified in this method; Second, we develop a new mechanism, called VS-momentum, to improve the robustness of variable selection; Finally, we integrate an evolutionary algorithm into the framework of BO to make the sampling of unimportant variables more precise. The following subsections introduce these three points in detail.

Algorithm 2 Variable Selection (line 8 in Algorithm 1)
1:Input: 𝒟={(𝐱i,yi)}i=1t\mathcal{D}=\{(\mathbf{x}^{i},y^{i})\}_{i=1}^{t}
2:Output: Set of important variables 𝐱ipt\mathbf{x}_{ipt}
3:Fit a GP to 𝒟\mathcal{D} and calculate important scores of variables ISIS where IS[i]IS[i] is the important score of the i-th variable
4:Sort variables according to their important scores, [𝐱s(1),,𝐱s(D)][\mathbf{x}_{s(1)},\dots,\mathbf{x}_{s(D)}], from the most important to the least
5:for m=1,2,Dm=1,2,\ldots D do \triangleright Stepwise forward selection
6:     Fit a GP to 𝒟m:={(𝐱s(1):s(m)i,yi)}i=1t1\mathcal{D}_{m}\mathrel{\mathop{\mathchar 58\relax}}=\{(\mathbf{x}_{s(1)\mathrel{\mathop{\mathchar 58\relax}}s(m)}^{i},y^{i})\}_{i=1}^{t-1} where 𝐱s(1):s(m)i\mathbf{x}_{s(1)\mathrel{\mathop{\mathchar 58\relax}}s(m)}^{i} is the ii-th input with only the first mm important variables, let LmL_{m} to be the value of final negative marginal log likelihood
7:     if m<3m<3 then
8:         continue
9:     else if Lm1Lm0L_{m-1}-L_{m}\leq 0 or Lm1Lm<Lm2Lm110L_{m-1}-L_{m}<\frac{L_{m-2}-L_{m-1}}{10} then
10:         break
11:     end if
12:end for
13:return 𝐱ipt={𝐱s(1),,𝐱s(m1)}\mathbf{x}_{ipt}=\{\mathbf{x}_{s(1)},\dots,\mathbf{x}_{s(m-1)}\}

3.1 Variable selection

The variable selection step in VS-BO (Algorithm 2) can be further separated into two substeps: (1) calculate the importance score (ISIS) of each variable (line 3 in Algorithm 2), and (2) do the stepwise-forward variable selection (Derksen and Keselman, 1992) according to the importance scores.

For step one, we develop a gradient-based ISIS calculation method, called Grad-IS, inspired by Paananen et al. (2019). Intuitively, if the partial derivative of the function ff with respect to one variable is large on average, then the variable ought to be important. Since the derivative of ff is unknown, VS-BO instead estimates the expectation of the gradient of posterior mean from a fitted GP model, normalized by the posterior standard deviation:

IS\displaystyle IS =𝔼𝐱Unif(𝒳)[𝐱𝔼p(f(𝐱)𝐱,𝒟)[f(𝐱)]Varp(f(𝐱)𝐱,𝒟)[f(𝐱)]]=𝔼𝐱Unif(𝒳)[𝐱μ(𝐱𝒟)σ(𝐱𝒟)]\displaystyle=\mathbb{E}_{\mathbf{x}\sim Unif(\mathcal{X})}\left[\frac{\nabla_{\mathbf{x}}\mathbb{E}_{p(f(\mathbf{x})\mid\mathbf{x},\mathcal{D})}\left[f(\mathbf{x})\right]}{\sqrt{Var_{p(f(\mathbf{x})\mid\mathbf{x},\mathcal{D})}\left[f(\mathbf{x})\right]}}\right]=\mathbb{E}_{\mathbf{x}\sim Unif(\mathcal{X})}\left[\frac{\nabla_{\mathbf{x}}\mu(\mathbf{x}\mid\mathcal{D})}{\sigma(\mathbf{x}\mid\mathcal{D})}\right]
1Nisk=1Nis𝐱μ(𝐱k𝒟)σ(𝐱k𝒟)𝐱ki.i.dUnif(𝒳).\displaystyle\approx\frac{1}{N_{is}}\sum_{k=1}^{N_{is}}\frac{\nabla_{\mathbf{x}}\mu(\mathbf{x}^{k}\mid\mathcal{D})}{\sigma(\mathbf{x}^{k}\mid\mathcal{D})}\quad\mathbf{x}^{k}\stackrel{{\scriptstyle i.i.d}}{{\sim}}Unif(\mathcal{X}).

Here, both 𝐱μ(𝒟)\nabla_{\mathbf{x}}\mu(\cdot\mid\mathcal{D}) and σ(𝒟)\sigma(\cdot\mid\mathcal{D}) have explicit forms. Both the Grad-IS and Kullback-Leibler Divergence (KLD)-based methods in Paananen et al. (2019) are estimations of 𝔼𝐱Unif(𝒳)[𝐱𝔼p(f(𝐱)𝐱,𝒟)[f(𝐱)]Varp(f(𝐱)𝐱,𝒟)[f(𝐱)]]\mathbb{E}_{\mathbf{x}\sim Unif(\mathcal{X})}\left[\frac{\nabla_{\mathbf{x}}\mathbb{E}_{p(f(\mathbf{x})\mid\mathbf{x},\mathcal{D})}\left[f(\mathbf{x})\right]}{\sqrt{Var_{p(f(\mathbf{x})\mid\mathbf{x},\mathcal{D})}\left[f(\mathbf{x})\right]}}\right]. Since the KLD method only calculates approximate derivatives around the chosen points in 𝒟\mathcal{D} that are always unevenly distributed, it is a biased estimator, while our importance score estimation is unbiased.

Each time the algorithm fits GP to the existing query-output pairs, the marginal log likelihood (MLL) of GP is maximized by updating parameters Θ\Theta and σ0\sigma_{0}. VS-BO takes negative MLL as the loss and uses its value as the stopping criteria of the stepwise-forward selection. More specifically, VS-BO sequentially selects variables according to the important score, and when a new variable is added, the algorithm will fit GP again by only using those chosen variables and records a new final loss (line 6 in Algorithm 2). If the new loss is nearly identical to the previous loss, the loss of fitted GP when the new variable is not included, then the selection step stops (line 9 in Algorithm 2) and all those already chosen variables are important variables.

Consider the squared exponential kernel, a common kernel choice for GP, which is given by

k(𝐱,𝐱,Θ={ρ1:D2,α02})=α02exp(12i=1Dρi2(𝐱i𝐱i)2),\displaystyle k(\mathbf{x},\mathbf{x}^{\prime},\Theta=\{\rho^{2}_{1\mathrel{\mathop{\mathchar 58\relax}}D},\alpha_{0}^{2}\})=\alpha_{0}^{2}\exp\left(-\frac{1}{2}\sum_{i=1}^{D}\rho_{i}^{2}(\mathbf{x}_{i}-\mathbf{x}^{\prime}_{i})^{2}\right),

where ρi2\rho_{i}^{2} is the inverse squared length scale of the ii-th variable. On the one hand, when only a small subset of variables in 𝐱\mathbf{x} are important, the variable selection is similar to adding a L0L_{0} regularization for GP fitting step. Let ρ=[ρ12,,ρD2]\mathbf{\rho}=[\rho^{2}_{1},\dots,\rho^{2}_{D}], the variable selection step chooses a subset of variables and specifies ρi2=0\rho_{i}^{2}=0 when ii-th variable is in 𝐱nipt\mathbf{x}_{nipt}, leading ρ0\mathinner{\!\left\lVert\mathbf{\rho}\right\rVert}_{0} to be small. Therefore, fitting GP by only using variables in 𝐱ipt\mathbf{x}_{ipt} is similar to learning the kernel function with sparse parameters. On the other hand, when in the worst case every variable is equally important, 𝐱ipt\mathbf{x}_{ipt} is likely to contain nearly all the variables in 𝐱\mathbf{x}, and in that case VS-BO degenerates to vanilla BO.

3.2 Momentum mechanism in variable selection

The idea of VS-momentum is to some extent similar to momentum in the stochastic gradient descent (Loizou and Richtárik, 2017). Intuitively, queries obtained after one variable selection step can give extra information on the accuracy of this variable selection. If empirically a new maximizer is found, then this variable selection step is likely to have found real important variables, hence most of these variables should be kept at the next variable selection step. Otherwise, most should be removed and new variables need to be added.

More specifically, we say that the variable selection at iteration t+Nvst+N_{vs} is in an accurate case when maxk{t+1,,t+Nvs}yk>maxk{1,,t}yk\max_{k\in\{t+1,\dots,t+N_{vs}\}}y^{k}>\max_{k\in\{1,\dots,t\}}y^{k}, otherwise it is in an inaccurate case. In the accurate case, VS-BO first uses recursive feature elimination (RFE) based algorithm to remove redundant variables in 𝐱ipt\mathbf{x}_{ipt} that is selected at tt, then it adds new variables into the remaining only if the loss decreases evidently (Figure 1a). In the inaccurate case, variables selected at tt will not be considered at t+Nvst+N_{vs} unless they still obtain very high important scores at t+Nvst+N_{vs} (marked by the blue box in Figure 1b). New variables are added via stepwise-forward algorithm. The details of variable selection with momentum mechanism are described in section A of the appendix.

Refer to caption
Figure 1: Momentum mechanism in VS-BO. (a) Accurate case, RFE is first used to remove redundant variables, and then new variables are added. (b) Inaccurate case, most variables are removed except those that are considered very important in both variable selection steps (blue box). New variables are then added.

3.3 Sampling for unimportant variables

We propose a method based on Covariance Matrix Adaptation Evolution Strategy (CMA-ES) to obtain the new value of unimportant variables for each iteration. CMA-ES is an evolutionary algorithm for numerically optimizing a function. For each generation kk, the algorithm samples new offsprings from a multivariate Gaussian distribution 𝒩(m(k1),(σ(k1))2)\mathcal{N}\left(m^{(k-1)},(\sigma^{(k-1)})^{2}\right) and updates m(k1)m^{(k-1)} and (σ(k1))2(\sigma^{(k-1)})^{2} based on these new samples and their corresponding function values. Details of this algorithm can be seen in Hansen (2016).

Using the same approach as CMA-ES, VS-BO uses the initialized data {(𝐱i,yi)}i=1Ninit\{(\mathbf{x}^{i},y^{i})\}_{i=1}^{N_{init}} to initialize the multivariate Gaussian distribution p(𝐱𝒟)p(\mathbf{x}\mid\mathcal{D}) (line 5 in Algorithm 1), and for every NvsN_{vs} iterations, it updates the distribution based on new query-output pairs (line 9 in Algorithm 1). Because of the property of Gaussian distribution, the conditional distribution p(𝐱nipt𝐱ipt,𝒟)p(\mathbf{x}_{nipt}\mid\mathbf{x}_{ipt},\mathcal{D}) is easily derived which is also a multivariate Gaussian distribution. Therefore, 𝐱niptt\mathbf{x}_{nipt}^{t} can be sampled from the Gaussian distribution p(𝐱nipt𝐱iptt,𝒟)p(\mathbf{x}_{nipt}\mid\mathbf{x}_{ipt}^{t},\mathcal{D}) (line 13 in Algorithm 1) when 𝐱iptt\mathbf{x}_{ipt}^{t} is obtained.

Compared to BO, it is much faster to update the evolutionary algorithm and obtain new queries, although these queries are less precise than those from BO. VS-BO takes advantage of the strength of these two methods by using them on different variables. Important variables are crucial to the function value, therefore VS-BO uses the framework of BO on them to obtain precise queries. Unimportant variables do not effect the function value too much so there is no need to spend large time budget to search for extremely precise queries. Hence, they are determined by CMA-ES to reduce runtime. In addition, when the variable selection step is inaccurate, VS-BO degenerates to an algorithm that is similar to CMA-ES rather than random sampling, therefore this sampling strategy may help improve the robustness of the performance of the whole algorithm.

4 Computational complexity analysis

From the theoretical perspective, we prove that running BO by only using those important variables is able to decrease the runtime of both the step of fitting the GP and maximizing the acquisition function. Specifically, we have the following proposition:

Proposition 4.1.

Suppose the cardinality of 𝐱ipt\mathbf{x}_{ipt} is pp and the Quasi-Newton method (QN) is used for both fitting the GP and maximizing the acquisition function. Under the choice of commonly used kernel functions and acquisition functions, if only variables in 𝐱ipt\mathbf{x}_{ipt} is used, then the complexity of each step of QN is 𝒪(p2+pn2+n3)\mathcal{O}(p^{2}+pn^{2}+n^{3}) for fitting the GP and 𝒪(p2+pn+n2)\mathcal{O}(p^{2}+pn+n^{2}) for maximizing the acquisition function, where nn is the number of queries that are already obtained.

The proof is in section B of the appendix. Note that the method for fitting the GP and maximizing the acquisition function under the framework of BoTorch is limited-memory BFGS, which is indeed a QN method. Since the complexity is related to the quadratic of pp, selecting a small subset of variables (so that pp is small) can decrease the runtime of BO. Figure 6 empirically shows that compared to vanilla BO, VS-BO can both reduce the runtime of fitting a GP and optimizing the acquisition function, especially when nn is not small.

5 Regret bound analysis

Let 𝐱\mathbf{x}^{*} be one of the maximal points of f(𝐱)f(\mathbf{x}). To quantify the efficacy of the optimization algorithm, we are interested in the cumulative regret RNR_{N}, defined as: RN=t=1N[f(𝐱)f(𝐱t)]R_{N}=\sum_{t=1}^{N}\left[f(\mathbf{x}^{*})-f(\mathbf{x}^{t})\right] where 𝐱t\mathbf{x}^{t} is the query at iteration tt. Intuitively, the algorithm is better when RNR_{N} is small, and a desirable property is to have no regret: limNRN/N=0\lim_{N\to\infty}R_{N}/N=0. Here, we provide an upper bound of the cumulative regret for a simplified VS-BO algorithm, called VS-GP-UCB (Algorithm 6 in the appendix). Similar to Srinivas et al. (2009), for proving the regret bound we need the smoothness assumption of the kernel function. In addition, we have the extra assumption that the DdD-d variables in 𝐱\mathbf{x} are unimportant (for convenience we index unimportant variables from d+1d+1 to DD without loss of generality), meaning the absolute values of partial derivatives of ff on those DdD-d variables are in general smaller than those on important variables. Formally, we have the following assumption:

Assumption 5.1.

Let 𝒳[0,1]D\mathcal{X}\subset[0,1]^{D} be compact and convex, DD\in\mathbb{N}, and ff be a sample path of a GP with mean zero and the kernel function kk, which satisfies the following high probability bound on the derivatives of ff for some constants a,b>0a,b>0, 1>α01>\alpha\geq 0:

P(sup𝐱𝒳|f𝐱j|>L)aexp((Lb)2),j=1,,d\displaystyle P(\sup_{\mathbf{x}\in\mathcal{X}}\mathinner{\!\left\lvert\frac{\partial f}{\partial\mathbf{x}_{j}}\right\rvert}>L)\leq a\exp\left(-\left(\frac{L}{b}\right)^{2}\right),j=1,\dots,d

And:

P(sup𝐱𝒳|f𝐱j|>L)aexp((Lαb)2),j=d+1,,D,\displaystyle P(\sup_{\mathbf{x}\in\mathcal{X}}\mathinner{\!\left\lvert\frac{\partial f}{\partial\mathbf{x}_{j}}\right\rvert}>L)\leq a\exp\left(-\left(\frac{L}{\alpha b}\right)^{2}\right),j=d+1,\dots,D,

In VS-GP-UCB, the values of unimportant variables are fixed in advance (line 4 in Algorithm 6), denoted as 𝐱[d+1:D]0\mathbf{x}^{0}_{[d+1\mathrel{\mathop{\mathchar 58\relax}}D]}, and the important variables are queried at each iteration by maximizing the acquisition function upper confidence bound (UCB) (Auer, 2002) with those fixed unimportant variables (line 6 in Algorithm 6). We have the following regret bound theorem of VS-GP-UCB:

Theorem 5.1.

Let 𝒳[0,1]D\mathcal{X}\subset[0,1]^{D} be compact and convex, suppose Assumption 5.1 is satisfied, pick δ(0,1)\delta\in(0,1), and define

βt=2log8π2t23δ+2(Dd)log(αDt2blog(8Daδ)+1)+2dlog(Dt2blog(8Daδ)).\displaystyle\beta_{t}=2\log\frac{8\pi^{2}t^{2}}{3\delta}+2(D-d)\log\left(\alpha Dt^{2}b\sqrt{\log\left(\frac{8Da}{\delta}\right)}+1\right)+2d\log\left(Dt^{2}b\sqrt{\log\left(\frac{8Da}{\delta}\right)}\right).

Running the VS-GP-UCB, with probability 1δ\geq 1-\delta, we have:

RNN=t=1NrtN2C1βNγNN+π23N+αblog(8Daδ)(Dd),\displaystyle\frac{R_{N}}{N}=\frac{\sum_{t=1}^{N}r_{t}}{N}\leq 2\sqrt{C_{1}\frac{\beta_{N}\gamma_{N}}{N}}+\frac{\pi^{2}}{3N}+\alpha b\sqrt{\log\left(\frac{8Da}{\delta}\right)}(D-d),

Here, γN:=maxA𝒳:|A|=N𝐈(𝐲A;𝐟A)\gamma_{N}\mathrel{\mathop{\mathchar 58\relax}}=\max_{A\subset\mathcal{X}\mathrel{\mathop{\mathchar 58\relax}}|A|=N}\mathbf{I}(\mathbf{y}_{A};\mathbf{f}_{A}) is the maximum information gain with a finite set of sampling points AA, 𝐟A=[f(𝐱)]𝐱A\mathbf{f}_{A}=[f(\mathbf{x})]_{\mathbf{x}\in A}, 𝐲A=𝐟A+ϵA\mathbf{y}_{A}=\mathbf{f}_{A}+\epsilon_{A}, and C1=8log(1+σ02)C_{1}=\frac{8}{\log\left(1+\sigma_{0}^{-2}\right)}.

The proof of Theorem 5.1 is in section C of the appendix. Srinivas et al. (2009) upper bounded the maximum information gain for some commonly used kernel functions, for example they prove that by using SE kernel with the same length scales (ρi=ρ0\rho_{i}=\rho_{0} for all ii), γN=𝒪((logN)D+1)\gamma_{N}=\mathcal{O}\left((\log N)^{D+1}\right) so that limN(βNγN)/N=0\lim_{N\to\infty}(\beta_{N}\gamma_{N})/N=0. However, every variable is equally important in the SE kernel with the same length scales, which does not obey Assumption 5.1. We hypothesize that by using SE kernel of which the length scales are different such that Assumption 5.1 is satisfied, the statement that limN(βNγN)/N=0\lim_{N\to\infty}(\beta_{N}\gamma_{N})/N=0 is also correct, although we do not have proof here.

Li et al. (2016) also derives a regret bound for its dropout algorithm (Lemma 5 in Li et al. (2016)). Compared to the regret bound in Srinivas et al. (2009) (Theorem 2 in Srinivas et al. (2009)), both Li et al. (2016) and our work have an additional residual in the bound, while ours contains a small coefficient α\alpha. In the case when α0\alpha\to 0, the bound in Theorem 5.1 is the same as that in theorem 2 of Srinivas et al. (2009) and there is no regret. These results show the necessity of the variable selection since it can help decrease the value of α\alpha. In addition, compared to fixing unimportant variables in VS-GP-UCB, sampling from the CMA-ES posterior may further decrease the residual value.

6 Experiments

We compare VS-BO to a broad selection of existing methods: vanilla BO, REMBO and its variant REMBO Interleave, Dragonfly, HeSBO and ALEBO. The details of implementations of these methods as well as hyperparameter settings are described in section D of the appendix.

Refer to caption
Figure 2: Performance of BO methods on Branin, Hartmann6 and Styblinski-Tang4 test functions. For each test function, we do 20 independent runs for each method. We plot the mean and 1/8 standard deviation of the best maximum value found by iterations.

6.1 Synthetic problems

We use the Branin (de=2d_{e}=2), Hartmann6 (de=6d_{e}=6) and Styblinski-Tang4 (de=4d_{e}=4) functions as test functions. Previous high-dimensional BO work extends these functions to high dimension by adding unrelated variables, while in our work we present a harder test setting that has not been tried before by adding both unrelated and unimportant (but not totally unrelated) variables. For example, in the Hartmann6 case with the standard Hartmann6 function fHartmann6(𝐱[1:6])f_{Hartmann6}(\mathbf{x}_{[1\mathrel{\mathop{\mathchar 58\relax}}6]}) we first construct a new function Fhm6(𝐱)F_{hm6}(\mathbf{x}) by adding variables with different importance, Fhm6(𝐱)=fHartmann6(𝐱[1:6])+0.1fHartmann6(𝐱[7:12])+0.01fHartmann6(𝐱[13:18])F_{hm6}(\mathbf{x})=f_{Hartmann6}(\mathbf{x}_{[1\mathrel{\mathop{\mathchar 58\relax}}6]})+0.1f_{Hartmann6}(\mathbf{x}_{[7\mathrel{\mathop{\mathchar 58\relax}}12]})+0.01f_{Hartmann6}(\mathbf{x}_{[13\mathrel{\mathop{\mathchar 58\relax}}18]}), and we further extend it to D=50D=50 by adding unrelated variables; see section D for full details. The dimension of effective subspace of Fhm6F_{hm6} is 1818, while the dimension of important variables is only 66. We hope that VS-BO can find those important variables successfully. For each embedding-based methods we evalualte both d=4d=4 and d=6d=6.

Figures 2 and  4 show performance of VS-BO as well as other BO methods on these three synthetic functions. When the iteration budget is fixed (Figure 2), the best value in average found by VS-BO after 200200 iterations is the largest or slightly smaller than the largest in all three cases. When the wall clock time or CPU time budget for BO is fixed (Figure 4), results show that VS-BO can find a large function value with high computational efficiency. Figure 5 shows that VS-BO can accurately find all the real important variables and meanwhile control false positives. We also test VS-BO on the function that has a non-axis-aligned subspace, and results in Figure 7 show that VS-BO also performs well. Vanilla BO under the framework of BoTorch can also achieve good performance for the fixed iteration budget, however, it is very computationally inefficient. For embedding-based methods, the results reflect some of their shortcomings. First, the performance of these methods are more variable than VS-BO; for example, HeSBO with d=6d=6 performs very well in the Styblinski-Tang4 case but not in the others; Second, embedding-based methods are sensitive to the choice of the embedding dimension dd, they perform especially bad when dd is smaller than the dimension of important variables (see results of the Hartmann6 case) and may still perform not well even when dd is larger (such as ALEBO with d=6d=6 in the Styblinski-Tang4 case), while VS-BO can automatically learn the dimension. One advantage of embedding-based methods is that they may have a better performance than VS-BO within a very limited iteration budget (for example 50 iterations), which is expected since a number of data points are needed for VS-BO to make the variable selection accurate.

6.2 Real-world problems

We compare VS-BO with other methods on two real-world problems. First, VS-BO is tested on the rover trajectory optimization problem presented in Wang et al. (2017), a problem with a 6060-dimensional input domain. Second, it is tested on the vehicle design problem MOPTA08 (Jones, 2008), a problem with 124124 dimensions. On these two problems, we evaluate both d=6d=6 and d=10d=10 for each embedding-based method, except we omit ALEBO with d=10d=10 since it is very time consuming. The detailed settings of these two problems are described in section D of the appendix.

Refer to caption
Figure 3: Performance of BO methods on the rover trajectory and MOPTA08 problems. We do 20 independent runs on the rover trajectory problem and 15 on the MOPTA08 problem. We plot the mean and 1/4 standard deviation of the best maximum value found by iterations. Curves of vanilla BO and ALEBO with d=6d=6 do not reach the maximum iteration since they are time consuming and cannot run the maximum within the wall clock time budget (3600 seconds for the rover trajectory problem for each run and 4800 seconds for the MOPTA08 problem).

Figures 3 and  8 show the performance of VS-BO and other BO methods on these two problems. When the iteration budget is fixed (Figure 3), VS-BO and vanilla BO have a better performance than other methods on both problems. When the wall clock or CPU time is fixed (Figure 8), Dragonfly and VS-BO reach the best performance on the rover trajectory problem, and Dragonfly performs the best on MOPTA08 problem while VS-BO has the second best performance. Vanilla BO is computationally inefficient so it does not have good performance with the fixed runtime. The left column of Figure 9 shows the frequency of being chosen as important for each variable when VS-BO is used. Since there is no ground truth of important variables in real-world cases, we use a sampling experiment to test whether those more frequently-chosen variables are more important. Specifically, we sample the first 55 variables that have been chosen most frequently by a Sobol sequence and fix the values of other variables with the values in the best query we have found (the query having the highest function value). We then calculate the function values of this set of samples. Likewise, we also sample the first 55 variables that have been chosen least frequently and evaluate the functions. The right column of Figure 9 shows that the variance of function values from the first set of samples is significantly higher than that from the second, especially on the MOPTA08 problem, indicating that those frequently selected variables indeed have more significant effect on the function value.

7 Conclusion

We propose a new method, VS-BO, for high-dimensional BO that is based on the assumption that variables of the input can be divided to two categories: important and unimportant. Our method can assign variables into these two categories with no need for pre-specifying any crucial hyper-parameter and use different strategies to decide values of the variables in different categories to reduce runtime. The good performance of our method on synthetic and real-world problems further verify the rationality of the assumption. We show the computational efficiency of our method both theoretically and empirically. In addition, information from the variable selection improves the interpretability of BO model: VS-BO can find important variables so that it can help increase our understanding of the black-box function. We also notice that in practice vanilla BO under the framework of BoTorch usually has a good performance if the runtime of BO does not need to be considered, especially when the dimension is not too large (D<100D<100). However, this method is usually not considered as a baseline to compare with in previous high-dimensional BO work.

We also find some limitations of our method when running experiments. First, when the dimension of the input increases, it becomes harder to do variable selection accurately. Therefore, embedding-based methods are still the first choice when the input of a function has thousands of dimensions. It might be interesting to develop new algorithms that can do variable selection robustly even when the dimension is extremely large. Further, Grad-IS might be invalid when variables are discrete or categorical, therefore new methods for calculating the importance score of these kinds of variables are needed. These are several directions for future improvements of VS-BO. It is also interesting to do further theoretical study such as investigating the bounds on the maximum information gain of kernels that satisfy Assumption 5.1.

References

  • Auer [2002] Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422, 2002.
  • Bergstra et al. [2013] James Bergstra, Daniel Yamins, and David Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. In International conference on machine learning, pages 115–123, 2013.
  • Berkenkamp et al. [2016] Felix Berkenkamp, Andreas Krause, and Angela P Schoellig. Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics. arXiv preprint arXiv:1602.04450, 2016.
  • Brochu et al. [2010] Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010.
  • Calandra et al. [2016] Roberto Calandra, André Seyfarth, Jan Peters, and Marc Peter Deisenroth. Bayesian optimization for learning gaits under uncertainty. Annals of Mathematics and Artificial Intelligence, 76(1):5–23, 2016.
  • Derksen and Keselman [1992] Shelley Derksen and Harvey J Keselman. Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. British Journal of Mathematical and Statistical Psychology, 45(2):265–282, 1992.
  • Djolonga et al. [2013] Josip Djolonga, Andreas Krause, and Volkan Cevher. High-dimensional Gaussian process bandits. In Advances in Neural Information Processing Systems, pages 1025–1033, 2013.
  • Eriksson and Jankowiak [2021] David Eriksson and Martin Jankowiak. High-Dimensional Bayesian Optimization with Sparse Axis-Aligned Subspaces. arXiv preprint arXiv:2103.00349, 2021.
  • Frazier [2018] Peter I Frazier. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
  • Gonzalez et al. [2015] Javier Gonzalez, Joseph Longworth, David C James, and Neil D Lawrence. Bayesian optimization for synthetic gene design. arXiv preprint arXiv:1505.01627, 2015.
  • Griffiths and Hernández-Lobato [2017] Ryan-Rhys Griffiths and José Miguel Hernández-Lobato. Constrained Bayesian optimization for automatic chemical design. arXiv preprint arXiv:1709.05501, 2017.
  • Hansen [2016] Nikolaus Hansen. The CMA evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772, 2016.
  • Hutter et al. [2010] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Automated configuration of mixed integer programming solvers. In International Conference on Integration of Artificial Intelligence (AI) and Operations Research (OR) Techniques in Constraint Programming, pages 186–202. Springer, 2010.
  • Hutter et al. [2014] Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. An efficient approach for assessing hyperparameter importance. In International Conference on Machine Learning, pages 754–762. PMLR, 2014.
  • Jones [2008] Donald R Jones. Large-scale multi-disciplinary mass optimization in the auto industry. In MOPTA 2008 Conference (20 August 2008), 2008.
  • Kandasamy et al. [2015] Kirthevasan Kandasamy, Jeff Schneider, and Barnabás Póczos. High dimensional Bayesian optimisation and bandits via additive models. In International Conference on Machine Learning, pages 295–304, 2015.
  • Kandasamy et al. [2020] Kirthevasan Kandasamy, Karun Raju Vysyaraju, Willie Neiswanger, Biswajit Paria, Christopher R Collins, Jeff Schneider, Barnabas Poczos, and Eric P Xing. Tuning hyperparameters without grad students: Scalable and robust bayesian optimisation with dragonfly. Journal of Machine Learning Research, 21(81):1–27, 2020.
  • Klein et al. [2017] Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast Bayesian optimization of machine learning hyperparameters on large datasets. In Artificial Intelligence and Statistics, pages 528–536. PMLR, 2017.
  • Letham et al. [2020] Ben Letham, Roberto Calandra, Akshara Rai, and Eytan Bakshy. Re-examining linear embeddings for high-dimensional Bayesian optimization. Advances in Neural Information Processing Systems, 33, 2020.
  • Li et al. [2016] Chun-Liang Li, Kirthevasan Kandasamy, Barnabás Póczos, and Jeff Schneider. High dimensional Bayesian optimization via restricted projection pursuit models. In Artificial Intelligence and Statistics, pages 884–892, 2016.
  • Loizou and Richtárik [2017] Nicolas Loizou and Peter Richtárik. Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677, 2017.
  • Marco et al. [2017] Alonso Marco, Felix Berkenkamp, Philipp Hennig, Angela P Schoellig, Andreas Krause, Stefan Schaal, and Sebastian Trimpe. Virtual vs. real: Trading off simulations and physical experiments in reinforcement learning with Bayesian optimization. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 1557–1563. IEEE, 2017.
  • Metzen [2016] Jan Hendrik Metzen. Minimum regret search for single- and multi-task optimization. arXiv preprint arXiv:1602.01064, 2016.
  • Močkus [1975] Jonas Močkus. On Bayesian methods for seeking the extremum. In Optimization Techniques IFIP Technical Conference, pages 400–404. Springer, 1975.
  • Moriconi et al. [2019] Riccardo Moriconi, Marc P Deisenroth, and KS Kumar. High-dimensional Bayesian optimization using low-dimensional feature spaces. arXiv preprint arXiv:1902.10675, 2019.
  • Nayebi et al. [2019] Amin Nayebi, Alexander Munteanu, and Matthias Poloczek. A framework for Bayesian optimization in embedded subspaces. In International Conference on Machine Learning, pages 4752–4761. PMLR, 2019.
  • Negoescu et al. [2011] Diana M Negoescu, Peter I Frazier, and Warren B Powell. The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS Journal on Computing, 23(3):346–363, 2011.
  • Nickson et al. [2014] Thomas Nickson, Michael A Osborne, Steven Reece, and Stephen J Roberts. Automated machine learning on big data using stochastic algorithm tuning. arXiv preprint arXiv:1407.7969, 2014.
  • Paananen et al. [2019] Topi Paananen, Juho Piironen, Michael Riis Andersen, and Aki Vehtari. Variable selection for gaussian processes via sensitivity analysis of the posterior predictive distribution. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1743–1752, 2019.
  • Rolland et al. [2018] Paul Rolland, Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. High-dimensional Bayesian optimization via additive models with overlapping groups. arXiv preprint arXiv:1802.07028, 2018.
  • Snoek et al. [2012] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pages 2951–2959, 2012.
  • Spagnol et al. [2019] Adrien Spagnol, Rodolphe Le Riche, and Sébastien Da Veiga. Bayesian optimization in effective dimensions via kernel-based sensitivity indices. In International Conference on Applications of Statistics and Probability in Civil Engineering, 2019.
  • Srinivas et al. [2009] Niranjan Srinivas, Andreas Krause, Sham M Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995, 2009.
  • Stewart [1980] Gilbert W Stewart. The efficient generation of random orthogonal matrices with an application to condition estimators. SIAM Journal on Numerical Analysis, 17(3):403–409, 1980.
  • Ulmasov et al. [2016] Doniyor Ulmasov, Caroline Baroukh, Benoit Chachuat, Marc Peter Deisenroth, and Ruth Misener. Bayesian optimization with dimension scheduling: Application to biological systems. In Computer Aided Chemical Engineering, volume 38, pages 1051–1056. Elsevier, 2016.
  • Wang et al. [2017] Zi Wang, Chengtao Li, Stefanie Jegelka, and Pushmeet Kohli. Batched high-dimensional bayesian optimization via structural kernel learning. In International Conference on Machine Learning, pages 3656–3664. PMLR, 2017.
  • Wang et al. [2016] Ziyu Wang, Frank Hutter, Masrour Zoghi, David Matheson, and Nando de Feitas. Bayesian optimization in a billion dimensions via random embeddings. Journal of Artificial Intelligence Research, 55:361–387, 2016.
  • Williams and Rasmussen [2006] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA, 2006.
  • Wilson et al. [2014] Aaron Wilson, Alan Fern, and Prasad Tadepalli. Using trajectory data to improve Bayesian optimization for reinforcement learning. The Journal of Machine Learning Research, 15(1):253–282, 2014.
  • Yao et al. [2018] Quanming Yao, Mengshuo Wang, Yuqiang Chen, Wenyuan Dai, Hu Yi-Qi, Li Yu-Feng, Tu Wei-Wei, Yang Qiang, and Yu Yang. Taking human out of learning applications: A survey on automated machine learning. arXiv preprint arXiv:1810.13306, 2018.

Checklist

  1. 1.

    For all authors…

    1. (a)

      Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes] The last paragraph of the introduction section describes the paper’s contributions.

    2. (b)

      Did you describe the limitations of your work? [Yes] The last paragraph of the conclusion section describes some limitations of our work.

    3. (c)

      Did you discuss any potential negative societal impacts of your work? [N/A] There will be no potential negative societal impacts of our work.

    4. (d)

      Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes] I read it carefully.

  2. 2.

    If you are including theoretical results…

    1. (a)

      Did you state the full set of assumptions of all theoretical results? [Yes] It is in section 5.

    2. (b)

      Did you include complete proofs of all theoretical results? [Yes] It is in section B and C.

  3. 3.

    If you ran experiments…

    1. (a)

      Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] Codes will be attached in the supplementary material.

    2. (b)

      Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] They are described in section D.

    3. (c)

      Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] For each case we repeat 2020 independent runs, and we plot mean and standard deviation in our figures.

    4. (d)

      Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] They are described in section D.

  4. 4.

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. (a)

      If your work uses existing assets, did you cite the creators? [Yes]

    2. (b)

      Did you mention the license of the assets? [N/A]

    3. (c)

      Did you include any new assets either in the supplemental material or as a URL? [Yes] They are described in section D.

    4. (d)

      Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A]

    5. (e)

      Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A]

  5. 5.

    If you used crowdsourcing or conducted research with human subjects…

    1. (a)

      Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]

    2. (b)

      Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

    3. (c)

      Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

Appendix A Variable selection with momentum mechanism

In this section, we provide pseudo-code for the algorithm of variable selection with momentum mechanism. Note that Algorithm 4 (momentum in the inaccurate case) is similar to Algorithm 2, except the lines that are marked with red color.

Algorithm 3 Variable Selection (VS) with Momentum
1:Input: iteration index tt, 𝒟={(𝐱i,yi)}i=1t\mathcal{D}=\{(\mathbf{x}^{i},y^{i})\}_{i=1}^{t}, NinitN_{init}, NvsN_{vs}, set of important variables chosen at iteration tNvst-N_{vs}, denote as 𝐱^ipt\hat{\mathbf{x}}_{ipt}
2:Output: Set of important variables chosen at iteration tt, denote as 𝐱ipt\mathbf{x}_{ipt}
3:if t=Ninit+Nvst=N_{init}+N_{vs} or 𝐱^ipt=𝐱\hat{\mathbf{x}}_{ipt}=\mathbf{x} then \triangleright First time to do variable selection or 𝐱^ipt\hat{\mathbf{x}}_{ipt} contains all variables
4:     return Algorithm 2
5:else if maxk{tNvs+1,tNvs+2,,t}ykmaxk{1,,tNvs}yk\max_{k\in\{t-N_{vs}+1,t-N_{vs}+2,\dots,t\}}y^{k}\leq\max_{k\in\{1,\dots,t-N_{vs}\}}y^{k} then \triangleright Inaccurate case
6:     return Algorithm 4
7:else\triangleright Accurate case
8:     return Algorithm 5
9:end if
Algorithm 4 Momentum in the inaccurate case
1:Input: 𝒟={(𝐱i,yi)}i=1t\mathcal{D}=\{(\mathbf{x}^{i},y^{i})\}_{i=1}^{t}, NvsN_{vs}, set of important variables chosen at iteration tNvst-N_{vs}, denote as 𝐱^ipt\hat{\mathbf{x}}_{ipt}
2:Output: Set of important variables chosen at iteration tt, denote as 𝐱ipt\mathbf{x}_{ipt}
3:Fit a GP to 𝒟\mathcal{D} and calculate important scores of variables ISIS where IS[i]IS[i] is the important score of the i-th variable
4:Sort variables according to their important scores, [𝐱s(1),,𝐱s(D)][\mathbf{x}_{s(1)},\dots,\mathbf{x}_{s(D)}], from the most important to the least
5:for n=1,Dn=1,\ldots D do
6:     if 𝐱s(n)𝐱^ipt\mathbf{x}_{s(n)}\notin\hat{\mathbf{x}}_{ipt} then
7:         break
8:     end if
9:end for
10:for m=n,n+1,Dm=n,n+1,\ldots D do \triangleright Stepwise forward selection
11:     Fit a GP to 𝒟m:={(𝐱s(1):s(m)i,yi)}i=1t1\mathcal{D}_{m}\mathrel{\mathop{\mathchar 58\relax}}=\{(\mathbf{x}_{s(1)\mathrel{\mathop{\mathchar 58\relax}}s(m)}^{i},y^{i})\}_{i=1}^{t-1} where 𝐱s(1):s(m)i\mathbf{x}_{s(1)\mathrel{\mathop{\mathchar 58\relax}}s(m)}^{i} is the ii-th input with only the first mm important variables, let LmL_{m} to be the value of final negative marginal log likelihood
12:     if mn<2m-n<2 then
13:         continue
14:     else if Lm1Lm0L_{m-1}-L_{m}\leq 0 or Lm1Lm<Lm2L(m1)10L_{m-1}-L_{m}<\frac{L_{m-2}-L(m-1)}{10} then
15:         break
16:     end if
17:end for
18:return 𝐱ipt={𝐱s(1),,𝐱s(m1)}\mathbf{x}_{ipt}=\{\mathbf{x}_{s(1)},\dots,\mathbf{x}_{s(m-1)}\}
Algorithm 5 Momentum in accurate case
1:Input: 𝒟={(𝐱i,yi)}i=1t\mathcal{D}=\{(\mathbf{x}^{i},y^{i})\}_{i=1}^{t}, NvsN_{vs}, set of important variables chosen at iteration tNvst-N_{vs}, denoted as 𝐱^ipt\hat{\mathbf{x}}_{ipt}, with cardinality w=|𝐱^ipt|w=\mathinner{\!\left\lvert\hat{\mathbf{x}}_{ipt}\right\rvert}
2:Output: Set of important variables chosen at iteration tt, denote as 𝐱ipt\mathbf{x}_{ipt}
3:Fit a GP to 𝒟\mathcal{D} and calculate important scores of variables ISIS where IS[i]IS[i] is the important score of the i-th variable
4:Sort variables according to ISIS, [𝐱s(1),,𝐱s(D)][\mathbf{x}_{s(1)},\dots,\mathbf{x}_{s(D)}], from the most important to the least
5:Fit a GP by using variables in 𝐱^ipt\hat{\mathbf{x}}_{ipt}, i.e. fit a GP to {(𝐱^ipti,yi)}i=1t\{(\hat{\mathbf{x}}_{ipt}^{i},y^{i})\}_{i=1}^{t}, and calculate important scores of these variables IS^\widehat{IS}. Let L^w\hat{L}_{w} be the value of final negative marginal log likelihood
6:Sort variables in 𝐱^ipt\hat{\mathbf{x}}_{ipt} according to IS^\widehat{IS}, [𝐱s(1),,𝐱s(w)][\mathbf{x}_{s^{\prime}(1)},\dots,\mathbf{x}_{s^{\prime}(w)}], from the most important to the least.
7:for m=w1,w2,0m=w-1,w-2,\ldots 0 do \triangleright Recursive feature elimination
8:     if m=0m=0 then
9:         Set 𝐱ipt={𝐱s(1)}\mathbf{x}_{ipt}=\{\mathbf{x}_{s^{\prime}(1)}\}
10:         break
11:     end if
12:     Fit a GP by only using the first mm important variables according to IS^\widehat{IS}. Let L^m\hat{L}_{m} to be the value of final negative marginal log likelihood
13:     if L^m>L^m+1\hat{L}_{m}>\hat{L}_{m+1} then
14:         Set 𝐱ipt={𝐱s(1),,𝐱s(m+1)}\mathbf{x}_{ipt}=\{\mathbf{x}_{s^{\prime}(1)},\dots,\mathbf{x}_{s^{\prime}(m+1)}\}
15:         Set L0=L^m+1L_{0}=\hat{L}_{m+1}
16:         break
17:     end if
18:end for
19:for m=1,2,Dm=1,2,\ldots D do \triangleright Stepwise forward selection
20:     if 𝐱s(m)𝐱ipt\mathbf{x}_{s(m)}\in\mathbf{x}_{ipt} then
21:         Set Lm=Lm1L_{m}=L_{m-1}, Lm1=Lm2L_{m-1}=L_{m-2}
22:         continue
23:     end if
24:     Fit a GP by using variables in 𝐱ipt{𝐱s(m)}\mathbf{x}_{ipt}\cup\{\mathbf{x}_{s(m)}\}. Let LmL_{m} to be the value of final negative marginal log likelihood
25:     if m<2m<2 then
26:         𝐱ipt=𝐱ipt{𝐱s(m)}\mathbf{x}_{ipt}=\mathbf{x}_{ipt}\cup\{\mathbf{x}_{s(m)}\}
27:         continue
28:     else if Lm1Lm0L_{m-1}-L_{m}\leq 0 or Lm1Lm<Lm2L(m1)10L_{m-1}-L_{m}<\frac{L_{m-2}-L(m-1)}{10} then
29:         break
30:     end if
31:     𝐱ipt=𝐱ipt{𝐱s(m)}\mathbf{x}_{ipt}=\mathbf{x}_{ipt}\cup\{\mathbf{x}_{s(m)}\}
32:end for
33:return 𝐱ipt\mathbf{x}_{ipt}

Appendix B Proof of Proposition 4.1

Proof of Proposition 4.1.

Given query-output pairs 𝒟={(𝐱i,yi)}i=1n\mathcal{D}=\{(\mathbf{x}^{i},y^{i})\}_{i=1}^{n}, the marginal log likelihood (MLL) that need to be maximized at the step of fitting a GP has the following explicit form:

logp(Θ={ρ1:D2,α02},σ0𝒟)=12𝐲M1𝐲12log|M|nlog2π2\displaystyle\log p(\Theta=\{\rho^{2}_{1\mathrel{\mathop{\mathchar 58\relax}}D},\alpha_{0}^{2}\},\sigma_{0}\mid\mathcal{D})=-\frac{1}{2}\mathbf{y}M^{-1}\mathbf{y}^{\top}-\frac{1}{2}\log\mathinner{\!\left\lvert M\right\rvert}-\frac{n\log 2\pi}{2}

where 𝐲=[y1,yn]\mathbf{y}=[y^{1},\ldots y^{n}] is an nn-dimensional vector and M=(K(𝐱1:n,Θ)+σ02𝐈)M=\left(K(\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n},\Theta)+\sigma^{2}_{0}\mathbf{I}\right). When the quasi-Newton method is used for maximizing MLL, the gradient should be calculated for each iteration:

Θ,σ0logp(Θ,σ0𝒟)=12𝐲M1(Θ,σ0M)M1𝐲12tr(M1(Θ,σ0M))\displaystyle\bigtriangledown_{\Theta,\sigma_{0}}\log p(\Theta,\sigma_{0}\mid\mathcal{D})=-\frac{1}{2}\mathbf{y}M^{-1}\left(\bigtriangledown_{\Theta,\sigma_{0}}M\right)M^{-1}\mathbf{y}^{\top}-\frac{1}{2}\textbf{tr}\left(M^{-1}\left(\bigtriangledown_{\Theta,\sigma_{0}}M\right)\right)

When only variables in 𝐱ipt\mathbf{x}_{ipt} are used, we define the distance between two queries 𝐱i\mathbf{x}^{i} and 𝐱j\mathbf{x}^{j} as:

d(𝐱i,𝐱j)=m:m𝐱iptρm2(𝐱mi𝐱mj)2\displaystyle d(\mathbf{x}^{i},\mathbf{x}^{j})=\sqrt{\sum_{m\mathrel{\mathop{\mathchar 58\relax}}m\in\mathbf{x}_{ipt}}\rho_{m}^{2}(\mathbf{x}_{m}^{i}-\mathbf{x}_{m}^{j})^{2}}

and all the other inverse squared length scales corresponding to unimportant variables are fixed to 0. Commonly chosen kernel functions are actually functions of the distance defined above, for example the squared exponential (SE) kernel is as the following:

kSE(𝐱i,𝐱j,Θ)=α02exp(12d2(𝐱i,𝐱j))\displaystyle k_{SE}(\mathbf{x}^{i},\mathbf{x}^{j},\Theta)=\alpha_{0}^{2}\exp\left(-\frac{1}{2}d^{2}(\mathbf{x}^{i},\mathbf{x}^{j})\right)

and the Matern-5/2 kernel is as the following:

kMt(𝐱i,𝐱j,Θ)=α02(1+5d(𝐱i,𝐱j)+53d2(𝐱i,𝐱j))exp(5d(𝐱i,𝐱j))\displaystyle k_{Mt}(\mathbf{x}^{i},\mathbf{x}^{j},\Theta)=\alpha_{0}^{2}\left(1+\sqrt{5}d(\mathbf{x}^{i},\mathbf{x}^{j})+\frac{5}{3}d^{2}(\mathbf{x}^{i},\mathbf{x}^{j})\right)\exp\left(-\sqrt{5}d(\mathbf{x}^{i},\mathbf{x}^{j})\right)

Since the cardinality of 𝐱ipt\mathbf{x}_{ipt} is pp, the cardinality of parameters in the kernel function that are not fixed to 0 is p+1p+1, hence the complexity of calculating the gradient of the distance is 𝒪(p)\mathcal{O}(p), therefore whatever using SE kernel or Matern-5/2 kernel, the complexity of calculating Θk(𝐱i,𝐱j,Θ)\bigtriangledown_{\Theta}k(\mathbf{x}^{i},\mathbf{x}^{j},\Theta) is 𝒪(p)\mathcal{O}(p).

Since MM is a n×nn\times n matrix and each entry MijM_{ij} equals to k(𝐱i,𝐱j,Θ)+σ02𝟙(i=j)k(\mathbf{x}^{i},\mathbf{x}^{j},\Theta)+\sigma_{0}^{2}\mathbbm{1}(i=j), the complexity of calculating Θ,σ0M\bigtriangledown_{\Theta,\sigma_{0}}M is 𝒪(pn2)\mathcal{O}(pn^{2}). the complexity of calculating the inverse matrix M1M^{-1} is 𝒪(n3)\mathcal{O}(n^{3}) in general, and the following matrix multiplication and trace calculation need 𝒪(pn2)\mathcal{O}(pn^{2}), therefore the complexity of calculating the gradient of MLL is 𝒪(pn2+n3)\mathcal{O}(pn^{2}+n^{3}). Once the gradient is obtained, each quasi-Newton step needs additional 𝒪(p2)\mathcal{O}(p^{2}), therefore the complexity of one step of quasi-Newton method when fitting a GP is 𝒪(p2+pn2+n3)\mathcal{O}(p^{2}+pn^{2}+n^{3}).

As described in section 2, the acquisition function is a function that depends on the posterior mean μ\mu and the posterior standard deviation σ\sigma, hence the gradients of μ\mu and σ\sigma should be calculated when the gradient of the acquisition function is needed.

When only variables in 𝐱ipt\mathbf{x}_{ipt} are used, the gradient of μ\mu with respect to 𝐱ipt\mathbf{x}_{ipt} has the following form:

𝐱iptμ(𝐱ipt𝒟)=(𝐱iptK(𝐱ipt,𝐱ipt1:n))(K(𝐱ipt1:n,Θ)+σ02𝐈)1𝐲\displaystyle\bigtriangledown_{\mathbf{x}_{ipt}}\mu(\mathbf{x}_{ipt}\mid\mathcal{D})=\left(\bigtriangledown_{\mathbf{x}_{ipt}}K(\mathbf{x}_{ipt},\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n}_{ipt})\right)\left(K(\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n}_{ipt},\Theta)+\sigma^{2}_{0}\mathbf{I}\right)^{-1}\mathbf{y}^{\top}

Here (K(𝐱ipt1:n,Θ)+σ02𝐈)1𝐲\left(K(\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n}_{ipt},\Theta)+\sigma^{2}_{0}\mathbf{I}\right)^{-1}\mathbf{y}^{\top} is fixed so that its value can be calculated in advance and stored as a nn-dimensional vector. K(𝐱ipt,𝐱ipt1:n)K(\mathbf{x}_{ipt},\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n}_{ipt}) is a nn-dimensional vector of which each element is a kernel value between 𝐱ipt\mathbf{x}_{ipt} and 𝐱ipti\mathbf{x}_{ipt}^{i}, hence the complexity of calculating the gradient of each element in K(𝐱ipt,𝐱ipt1:n)K(\mathbf{x}_{ipt},\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n}_{ipt}) is 𝒪(p)\mathcal{O}(p). Therefore, the complexity is 𝒪(pn)\mathcal{O}(pn) to calculate 𝐱iptK(𝐱ipt,𝐱ipt1:n)\bigtriangledown_{\mathbf{x}_{ipt}}K(\mathbf{x}_{ipt},\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n}_{ipt}) and 𝒪(pn)\mathcal{O}(pn) for additional matrix manipulation, hence the total complexity for calculating 𝐱iptμ(𝐱ipt𝒟)\bigtriangledown_{\mathbf{x}_{ipt}}\mu(\mathbf{x}_{ipt}\mid\mathcal{D}) is 𝒪(pn)\mathcal{O}(pn).

The gradient of σ\sigma has the following form:

𝐱iptσ(𝐱ipt𝒟)\displaystyle\bigtriangledown_{\mathbf{x}_{ipt}}\sigma(\mathbf{x}_{ipt}\mid\mathcal{D}) =𝐱iptk(𝐱ipt,𝐱ipt,Θ)K(𝐱ipt,𝐱ipt1:n)[K(𝐱ipt1:n,Θ)+σ02𝐈]1K(𝐱ipt,𝐱ipt1:n)\displaystyle=\bigtriangledown_{\mathbf{x}_{ipt}}\sqrt{k(\mathbf{x}_{ipt},\mathbf{x}_{ipt},\Theta)-K(\mathbf{x}_{ipt},\mathbf{x}_{ipt}^{1\mathrel{\mathop{\mathchar 58\relax}}n})[K(\mathbf{x}_{ipt}^{1\mathrel{\mathop{\mathchar 58\relax}}n},\Theta)+\sigma^{2}_{0}\mathbf{I}]^{-1}K(\mathbf{x}_{ipt},\mathbf{x}_{ipt}^{1\mathrel{\mathop{\mathchar 58\relax}}n})^{\top}}
=(𝐱iptK(𝐱ipt,𝐱ipt1:n))(K(𝐱ipt1:n,Θ)+σ02𝐈)1k(𝐱ipt,𝐱ipt,Θ)K(𝐱ipt,𝐱ipt1:n)[K(𝐱ipt1:n,Θ)+σ02𝐈]1K(𝐱ipt,𝐱ipt1:n)\displaystyle=-\frac{\left(\bigtriangledown_{\mathbf{x}_{ipt}}K(\mathbf{x}_{ipt},\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n}_{ipt})\right)\left(K(\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n}_{ipt},\Theta)+\sigma^{2}_{0}\mathbf{I}\right)^{-1}}{\sqrt{k(\mathbf{x}_{ipt},\mathbf{x}_{ipt},\Theta)-K(\mathbf{x}_{ipt},\mathbf{x}_{ipt}^{1\mathrel{\mathop{\mathchar 58\relax}}n})[K(\mathbf{x}_{ipt}^{1\mathrel{\mathop{\mathchar 58\relax}}n},\Theta)+\sigma^{2}_{0}\mathbf{I}]^{-1}K(\mathbf{x}_{ipt},\mathbf{x}_{ipt}^{1\mathrel{\mathop{\mathchar 58\relax}}n})^{\top}}}

Once 𝐱iptK(𝐱ipt,𝐱ipt1:n)\bigtriangledown_{\mathbf{x}_{ipt}}K(\mathbf{x}_{ipt},\mathbf{x}^{1\mathrel{\mathop{\mathchar 58\relax}}n}_{ipt}) is calculated, 𝒪(pn+n2)\mathcal{O}(pn+n^{2}) is needed for additional matrix manipulation, hence the total complexity for calculating 𝐱iptσ(𝐱ipt𝒟)\bigtriangledown_{\mathbf{x}_{ipt}}\sigma(\mathbf{x}_{ipt}\mid\mathcal{D}) is 𝒪(pn+n2)\mathcal{O}(pn+n^{2}).

For commonly used acquisition functions such as upper confidence bound (UCB) [Auer, 2002]:

UCB(𝐱ipt𝒟)=μ(𝐱ipt𝒟)+βnσ(𝐱ipt𝒟)\displaystyle UCB(\mathbf{x}_{ipt}\mid\mathcal{D})=\mu(\mathbf{x}_{ipt}\mid\mathcal{D})+\sqrt{\beta_{n}}\sigma(\mathbf{x}_{ipt}\mid\mathcal{D})

and expected improvement (EI) [Močkus, 1975]:

EI(𝐱ipt𝒟)=(μ(𝐱ipt𝒟)yn)Φ(μ(𝐱ipt𝒟)ynσ(𝐱ipt𝒟))+σ(𝐱ipt𝒟)φ(μ(𝐱ipt𝒟)ynσ(𝐱ipt𝒟))\displaystyle EI(\mathbf{x}_{ipt}\mid\mathcal{D})=\left(\mu(\mathbf{x}_{ipt}\mid\mathcal{D})-y^{*}_{n}\right)\Phi\left(\frac{\mu(\mathbf{x}_{ipt}\mid\mathcal{D})-y^{*}_{n}}{\sigma(\mathbf{x}_{ipt}\mid\mathcal{D})}\right)+\sigma(\mathbf{x}_{ipt}\mid\mathcal{D})\varphi\left(\frac{\mu(\mathbf{x}_{ipt}\mid\mathcal{D})-y^{*}_{n}}{\sigma(\mathbf{x}_{ipt}\mid\mathcal{D})}\right)

where yn=maxinyiy_{n}^{*}=\max_{i\leq n}y^{i}, Φ()\Phi(\cdot) is the cumulative distribution function of the standard normal distribution, and φ()\varphi(\cdot) is the probability density function, once the gradients of μ\mu and σ\sigma are derived, only additional 𝒪(p)\mathcal{O}(p) is needed for vector calculation, hence the total complexity of calculating the gradient of the acquisition function is 𝒪(pn+n2)\mathcal{O}(pn+n^{2}). Again, once the gradient is obtained, each quasi-Newton step needs additional 𝒪(p2)\mathcal{O}(p^{2}), therefore the complexity of one step of quasi-Newton method for maximising the acquisition function is 𝒪(p2+pn+n2)\mathcal{O}(p^{2}+pn+n^{2}).

Algorithm 6 VS-GP-UCB
1:Input: domain 𝒳\mathcal{X}, GP prior with mean μ0=0\mu_{0}=0 and the kernel function kk, maximal iteration NN
2:Output: Approximate maximizer 𝐱max\mathbf{x}^{max}
3:Let 𝒟=\mathcal{D}=\emptyset
4:Fix the last DdD-d variables, 𝐱[d+1:D]0\mathbf{x}_{[d+1\mathrel{\mathop{\mathchar 58\relax}}D]}^{0}
5:for t=1,2,Nt=1,2,\ldots N do
6:     Choose 𝐱[1:d]t=argmax𝐱[1:d]μ({𝐱[1:d],𝐱[d+1:D]0}𝒟)+βtσ({𝐱[1:d],𝐱[d+1:D]0}𝒟)\mathbf{x}^{t}_{[1\mathrel{\mathop{\mathchar 58\relax}}d]}=\operatorname*{arg\,max}_{\mathbf{x}_{[1\mathrel{\mathop{\mathchar 58\relax}}d]}}\mu(\{\mathbf{x}_{[1\mathrel{\mathop{\mathchar 58\relax}}d]},\mathbf{x}^{0}_{[d+1\mathrel{\mathop{\mathchar 58\relax}}D]}\}\mid\mathcal{D})+\sqrt{\beta_{t}}\sigma(\{\mathbf{x}_{[1\mathrel{\mathop{\mathchar 58\relax}}d]},\mathbf{x}^{0}_{[d+1\mathrel{\mathop{\mathchar 58\relax}}D]}\}\mid\mathcal{D})
7:     Sample yt=f(𝐱t={𝐱[1:d]t,𝐱[d+1:D]0})+ϵty^{t}=f\left(\mathbf{x}^{t}=\{\mathbf{x}_{[1\mathrel{\mathop{\mathchar 58\relax}}d]}^{t},\mathbf{x}_{[d+1\mathrel{\mathop{\mathchar 58\relax}}D]}^{0}\}\right)+\epsilon_{t}, where ϵt𝒩(0,σ02)\epsilon_{t}\sim\mathcal{N}(0,\sigma_{0}^{2}) is a noise
8:     Update 𝒟=𝒟{(𝐱t,yt)}\mathcal{D}=\mathcal{D}\cup\{(\mathbf{x}^{t},y^{t})\}
9:end for
10:return 𝐱max\mathbf{x}^{max} that is equal to 𝐱i\mathbf{x}^{i} with maximal yiy^{i}

Appendix C Proof of Theorem 5.1

We show the details of VS-GP-UCB and provide the proof of Theorem 5.1 below.

Proof of Theorem 5.1.

The second inequality of Assumption 5.1 is equivalent to the following inequality:

P(sup𝐱𝒳|f𝐱j|>αL)aexp((Lb)2),j=d+1,,D\displaystyle P(\sup_{\mathbf{x}\in\mathcal{X}}\mathinner{\!\left\lvert\frac{\partial f}{\partial\mathbf{x}_{j}}\right\rvert}>\alpha L)\leq a\exp\left(-\left(\frac{L}{b}\right)^{2}\right),j=d+1,\dots,D

Therefore, according to Assumption 5.1 and the union bound, we have that w.p.1Daexp((Lb)2)w.p.\geq 1-Da\exp\left(-\left(\frac{L}{b}\right)^{2}\right):

𝐱,𝐱𝒳,|f(𝐱)f(𝐱)|L𝐱[1:d]𝐱[1:d]1+αL𝐱[d+1:D]𝐱[d+1:D]1\displaystyle\forall\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{X},\mathinner{\!\left\lvert f(\mathbf{x})-f(\mathbf{x}^{\prime})\right\rvert}\leq L\mathinner{\!\left\lVert\mathbf{x}_{[1\mathrel{\mathop{\mathchar 58\relax}}d]}-\mathbf{x}^{\prime}_{[1\mathrel{\mathop{\mathchar 58\relax}}d]}\right\rVert}_{1}+\alpha L\mathinner{\!\left\lVert\mathbf{x}_{[d+1\mathrel{\mathop{\mathchar 58\relax}}D]}-\mathbf{x}^{\prime}_{[d+1\mathrel{\mathop{\mathchar 58\relax}}D]}\right\rVert}_{1}

Let δ2=Daexp((Lb)2)\frac{\delta}{2}=Da\exp\left(-\left(\frac{L}{b}\right)^{2}\right), meaning L=blog(2Daδ)L=b\sqrt{\log\left(\frac{2Da}{\delta}\right)}, we have that w.p.1δ2w.p.\geq 1-\frac{\delta}{2},

𝐱,𝐱𝒳,|f(𝐱)f(𝐱)|blog(2Daδ)(𝐱[1:d]𝐱[1:d]1+α𝐱[d+1:D]𝐱[d+1:D]1)\displaystyle\forall\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{X},\mathinner{\!\left\lvert f(\mathbf{x})-f(\mathbf{x}^{\prime})\right\rvert}\leq b\sqrt{\log\left(\frac{2Da}{\delta}\right)}\left(\mathinner{\!\left\lVert\mathbf{x}_{[1\mathrel{\mathop{\mathchar 58\relax}}d]}-\mathbf{x}^{\prime}_{[1\mathrel{\mathop{\mathchar 58\relax}}d]}\right\rVert}_{1}+\alpha\mathinner{\!\left\lVert\mathbf{x}_{[d+1\mathrel{\mathop{\mathchar 58\relax}}D]}-\mathbf{x}^{\prime}_{[d+1\mathrel{\mathop{\mathchar 58\relax}}D]}\right\rVert}_{1}\right)
Algorithm 7 GP-UCB (Algorithm 1 in Srinivas et al. [2009])
1:Input: domain 𝒳\mathcal{X}, GP prior with mean μ0=0\mu_{0}=0 and the kernel function kk, maximal iteration NN
2:Output: Approximate maximizer 𝐱max\mathbf{x}^{max}
3:Let 𝒟^=\hat{\mathcal{D}}=\emptyset
4:for t=1,2,Nt=1,2,\ldots N do
5:     Choose 𝐱^t=argmax𝐱𝒳μ^(𝐱𝒟^={(𝐱^i,y^i)}i=1t1)+βtσ^(𝐱𝒟^={(𝐱^i,y^i)}i=1t1)\hat{\mathbf{x}}^{t}=\operatorname*{arg\,max}_{\mathbf{x}\in\mathcal{X}}\hat{\mu}(\mathbf{x}\mid\hat{\mathcal{D}}=\{(\hat{\mathbf{x}}^{i},\hat{y}^{i})\}_{i=1}^{t-1})+\sqrt{\beta_{t}}\hat{\sigma}(\mathbf{x}\mid\hat{\mathcal{D}}=\{(\hat{\mathbf{x}}^{i},\hat{y}^{i})\}_{i=1}^{t-1})
6:     Sample y^t=f(𝐱^t)+ϵ^t\hat{y}^{t}=f\left(\hat{\mathbf{x}}^{t}\right)+\hat{\epsilon}_{t}, where ϵ^t𝒩(0,σ02)\hat{\epsilon}_{t}\sim\mathcal{N}(0,\sigma_{0}^{2}) is a noise
7:     Update 𝒟^=𝒟^{(𝐱^t,y^t)}\hat{\mathcal{D}}=\hat{\mathcal{D}}\cup\{(\hat{\mathbf{x}}^{t},\hat{y}^{t})\}
8:end for
9:return 𝐱^max\hat{\mathbf{x}}^{max} that is equal to 𝐱^i\hat{\mathbf{x}}^{i} with maximal y^i\hat{y}^{i}

Now we consider the standard GP-UCB algorithm (Algorithm 7). Given the same GP prior and the same kernel function, since VS-GP-UCB and GP-UCB are two different algorithms, for each iteration they obtain different queries and have different posterior mean and standard deviation. We use the hat symbol to represent elements from GP-UCB, and in the following proof we use μt()\mu_{t}(\cdot) (μ^t()\hat{\mu}_{t}(\cdot)) to represent μ(𝒟={(𝐱i,yi)}i=1t)\mu(\cdot\mid\mathcal{D}=\{(\mathbf{x}^{i},y^{i})\}_{i=1}^{t}) (μ^(𝒟^={(𝐱^i,y^i)}i=1t)\hat{\mu}(\cdot\mid\hat{\mathcal{D}}=\{(\hat{\mathbf{x}}^{i},\hat{y}^{i})\}_{i=1}^{t})) and σt()\sigma_{t}(\cdot) (σ^t()\hat{\sigma}_{t}(\cdot)) to represent σ(𝒟={(𝐱i,yi)}i=1t)\sigma(\cdot\mid\mathcal{D}=\{(\mathbf{x}^{i},y^{i})\}_{i=1}^{t}) (σ^(𝒟^={(𝐱^i,y^i)}i=1t)\hat{\sigma}(\cdot\mid\hat{\mathcal{D}}=\{(\hat{\mathbf{x}}^{i},\hat{y}^{i})\}_{i=1}^{t})) for convenience.

Srinivas et al. [2009] has the following lemma:

Lemma C.1 (Lemma 5.6 in Srinivas et al. [2009]).

Consider subsets 𝒳t𝒳\mathcal{X}_{t}\subset\mathcal{X} with finite size, |𝒳t|<|\mathcal{X}_{t}|<\infty, pick δ(0,1)\delta\in(0,1) and set βt=2log(|𝒳t|πtδ)\beta_{t}=2\log\left(\frac{\mathinner{\!\left\lvert\mathcal{X}_{t}\right\rvert}\pi_{t}}{\delta}\right), where t1πt1=1,πt>0\sum_{t\geq 1}\pi_{t}^{-1}=1,\pi_{t}>0. Running GP-UCB, we have that

|f(𝐱)μ^t1(𝐱)|βtσ^t1(𝐱)𝐱𝒳t,t1\displaystyle\mathinner{\!\left\lvert f(\mathbf{x})-\hat{\mu}_{t-1}(\mathbf{x})\right\rvert}\leq\sqrt{\beta_{t}}\hat{\sigma}_{t-1}(\mathbf{x})\quad\forall\mathbf{x}\in\mathcal{X}_{t},\forall t\geq 1

holds with probability 1δ\geq 1-\delta.

By using the same proof procedure, we can easily obtain the following lemma:

Lemma C.2.

Consider subsets 𝒳t𝒳\mathcal{X}_{t}\subset\mathcal{X} with finite size, |𝒳t|<|\mathcal{X}_{t}|<\infty, pick δ(0,1)\delta\in(0,1) and set βt=2log(|𝒳t|πtδ)\beta_{t}=2\log\left(\frac{\mathinner{\!\left\lvert\mathcal{X}_{t}\right\rvert}\pi_{t}}{\delta}\right), where t1πt1=1,πt>0\sum_{t\geq 1}\pi_{t}^{-1}=1,\pi_{t}>0. Running VS-GP-UCB, we have

|f(𝐱)μt1(𝐱)|βtσt1(𝐱)𝐱𝒳t,t1\displaystyle\mathinner{\!\left\lvert f(\mathbf{x})-\mu_{t-1}(\mathbf{x})\right\rvert}\leq\sqrt{\beta_{t}}\sigma_{t-1}(\mathbf{x})\quad\forall\mathbf{x}\in\mathcal{X}_{t},\forall t\geq 1

holds with probability 1δ\geq 1-\delta.

Now consider 𝒳t\mathcal{X}_{t} to be the following discretization: for each dimension j=1,,dj=1,\dots,d, discretize it with τt\tau_{t} uniformly spaced points, and for each dimension j=d+1,,Dj=d+1,\dots,D, discretize it with ατt\alpha\tau_{t} uniformly spaced points plus the j-th element in 𝐱t\mathbf{x}^{t}, 𝐱jt=𝐱j0\mathbf{x}^{t}_{j}=\mathbf{x}^{0}_{j} (note that 𝐱jt\mathbf{x}^{t}_{j} is fixed when jd+1j\geq d+1). Denote (𝐱)~t\widetilde{(\mathbf{x})}_{t} be the closest point in 𝒳t\mathcal{X}_{t} to 𝐱\mathbf{x} under L1L_{1} norm (note that 𝐱t\mathbf{x}^{t} is the tt-th query of the VS-GP-UCB algorithm, while (𝐱t)~t\widetilde{(\mathbf{x}^{t})}_{t^{\prime}} is its closet point in 𝒳t\mathcal{X}_{t^{\prime}}), then because of the smoothness assumption, we have w.p.1δ2w.p.\geq 1-\frac{\delta}{2}:

|f(𝐱)f((𝐱)~t)|blog(2Daδ)dτt+αblog(2Daδ)Ddατt\displaystyle\mathinner{\!\left\lvert f(\mathbf{x})-f\left(\widetilde{(\mathbf{x})}_{t}\right)\right\rvert}\leq b\sqrt{\log\left(\frac{2Da}{\delta}\right)}\frac{d}{\tau_{t}}+\alpha b\sqrt{\log\left(\frac{2Da}{\delta}\right)}\frac{D-d}{\alpha\tau_{t}} =blog(2Daδ)Dτt,\displaystyle=b\sqrt{\log\left(\frac{2Da}{\delta}\right)}\frac{D}{\tau_{t}},
𝐱𝒳,t1\displaystyle\forall\mathbf{x}\in\mathcal{X},\forall t\geq 1

By choosing τt=Dt2blog(2Daδ)\tau_{t}=Dt^{2}b\sqrt{\log\left(\frac{2Da}{\delta}\right)}, we have w.p.1δ2w.p.\geq 1-\frac{\delta}{2}:

|f(𝐱)f((𝐱)~t)|1t2𝐱𝒳,t1\displaystyle\mathinner{\!\left\lvert f(\mathbf{x})-f(\widetilde{(\mathbf{x})}_{t})\right\rvert}\leq\frac{1}{t^{2}}\quad\forall\mathbf{x}\in\mathcal{X},\forall t\geq 1

Under this situation, we have:

|𝒳t|=(ατt+1)Ddτtd=(αDt2blog(2Daδ)+1)Dd(Dt2blog(2Daδ))d\displaystyle\mathinner{\!\left\lvert\mathcal{X}_{t}\right\rvert}=(\alpha\tau_{t}+1)^{D-d}\tau_{t}^{d}=\left(\alpha Dt^{2}b\sqrt{\log\left(\frac{2Da}{\delta}\right)}+1\right)^{D-d}\left(Dt^{2}b\sqrt{\log\left(\frac{2Da}{\delta}\right)}\right)^{d}

Combine with Lemma C.2, by setting:

βt=2log(4|𝒳t|πtδ)=2log4πtδ\displaystyle\beta_{t}=2\log\left(\frac{4\mathinner{\!\left\lvert\mathcal{X}_{t}\right\rvert}\pi_{t}}{\delta}\right)=2\log\frac{4\pi_{t}}{\delta} +2(Dd)log(αDt2blog(2Daδ)+1)\displaystyle+2(D-d)\log\left(\alpha Dt^{2}b\sqrt{\log\left(\frac{2Da}{\delta}\right)}+1\right)
+2dlog(Dt2blog(2Daδ))\displaystyle+2d\log\left(Dt^{2}b\sqrt{\log\left(\frac{2Da}{\delta}\right)}\right)

then w.p. 1δ\geq 1-\delta:

|f(𝐱)μt1((𝐱)~t)|\displaystyle\mathinner{\!\left\lvert f(\mathbf{x})-\mu_{t-1}\left(\widetilde{(\mathbf{x})}_{t}\right)\right\rvert} |f(𝐱)f((𝐱)~t)+f((𝐱)~t)μt1((𝐱)~t)|\displaystyle\leq\mathinner{\!\left\lvert f(\mathbf{x})-f\left(\widetilde{(\mathbf{x})}_{t}\right)+f\left(\widetilde{(\mathbf{x})}_{t}\right)-\mu_{t-1}\left(\widetilde{(\mathbf{x})}_{t}\right)\right\rvert}
|f(𝐱)f((𝐱)~t)|+|f((𝐱)~t)μt1((𝐱)~t)|\displaystyle\leq\mathinner{\!\left\lvert f(\mathbf{x})-f\left(\widetilde{(\mathbf{x})}_{t}\right)\right\rvert}+\mathinner{\!\left\lvert f\left(\widetilde{(\mathbf{x})}_{t}\right)-\mu_{t-1}\left(\widetilde{(\mathbf{x})}_{t}\right)\right\rvert}
1t2+βtσt1((𝐱)~t)𝐱𝒳,t1.\displaystyle\leq\frac{1}{t^{2}}+\sqrt{\beta_{t}}\sigma_{t-1}\left(\widetilde{(\mathbf{x})}_{t}\right)\quad\forall\mathbf{x}\in\mathcal{X},\forall t\geq 1.

Likewise, by setting the same βt\beta_{t}, we have w.p. 1δ\geq 1-\delta:

|f(𝐱)μ^t1((𝐱)~t)|1t2+βtσ^t1((𝐱)~t)𝐱𝒳,t1.\displaystyle\mathinner{\!\left\lvert f(\mathbf{x})-\hat{\mu}_{t-1}\left(\widetilde{(\mathbf{x})}_{t}\right)\right\rvert}\leq\frac{1}{t^{2}}+\sqrt{\beta_{t}}\hat{\sigma}_{t-1}\left(\widetilde{(\mathbf{x})}_{t}\right)\quad\forall\mathbf{x}\in\mathcal{X},\forall t\geq 1.

Let 𝐱mixt={𝐱^[1:d]t,𝐱[d+1:D]0}\mathbf{x}^{t}_{mix}=\{\hat{\mathbf{x}}^{t}_{[1\mathrel{\mathop{\mathchar 58\relax}}d]},\mathbf{x}_{[d+1\mathrel{\mathop{\mathchar 58\relax}}D]}^{0}\} (first dd variables are equal to those in 𝐱^t\hat{\mathbf{x}}^{t}, and the others are equal to 𝐱[d+1:D]0\mathbf{x}_{[d+1\mathrel{\mathop{\mathchar 58\relax}}D]}^{0}), again because of the smoothness assumption, we have w.p.1δ2w.p.\geq 1-\frac{\delta}{2}:

|f(𝐱mixt)f(𝐱^t)|αblog(2Daδ)𝐱[d+1:D]0𝐱^[d+1:D]t1αblog(2Daδ)(Dd)\displaystyle\mathinner{\!\left\lvert f(\mathbf{x}^{t}_{mix})-f(\hat{\mathbf{x}}^{t})\right\rvert}\leq\alpha b\sqrt{\log\left(\frac{2Da}{\delta}\right)}\mathinner{\!\left\lVert\mathbf{x}^{0}_{[d+1\mathrel{\mathop{\mathchar 58\relax}}D]}-\hat{\mathbf{x}}^{t}_{[d+1\mathrel{\mathop{\mathchar 58\relax}}D]}\right\rVert}_{1}\leq\alpha b\sqrt{\log\left(\frac{2Da}{\delta}\right)}(D-d)

Note that the point in 𝒳t\mathcal{X}_{t} that is closest to 𝐱mixt\mathbf{x}^{t}_{mix} is (𝐱mixt)~t={(𝐱^[1:d]t)~t,𝐱[d+1:D]0}\widetilde{(\mathbf{x}^{t}_{mix})}_{t}=\left\{\widetilde{(\hat{\mathbf{x}}^{t}_{[1\mathrel{\mathop{\mathchar 58\relax}}d]})}_{t},\mathbf{x}_{[d+1\mathrel{\mathop{\mathchar 58\relax}}D]}^{0}\right\} because all elements in 𝐱[d+1:D]0\mathbf{x}_{[d+1\mathrel{\mathop{\mathchar 58\relax}}D]}^{0} are contained in the discretization. Since 𝐱[1:d]t\mathbf{x}_{[1\mathrel{\mathop{\mathchar 58\relax}}d]}^{t} is the maximizer of UCB with fixed DdD-d variables, we have:

μt1((𝐱mixt)~t)+βtσt1((𝐱mixt)~t)μt1(𝐱t)+βtσt1(𝐱t)\displaystyle\mu_{t-1}\left(\widetilde{(\mathbf{x}^{t}_{mix})}_{t}\right)+\sqrt{\beta_{t}}\sigma_{t-1}\left(\widetilde{(\mathbf{x}^{t}_{mix})}_{t}\right)\leq\mu_{t-1}(\mathbf{x}^{t})+\sqrt{\beta_{t}}\sigma_{t-1}(\mathbf{x}^{t})

Similarly, considering GP-UCB algorithm, we have:

μ^t1((𝐱)~t)+βtσ^t1((𝐱)~t)μ^t1(𝐱^t)+βtσ^t1(𝐱^t)\displaystyle\hat{\mu}_{t-1}(\widetilde{(\mathbf{x}^{*})}_{t})+\sqrt{\beta_{t}}\hat{\sigma}_{t-1}(\widetilde{(\mathbf{x}^{*})}_{t})\leq\hat{\mu}_{t-1}(\hat{\mathbf{x}}^{t})+\sqrt{\beta_{t}}\hat{\sigma}_{t-1}(\hat{\mathbf{x}}^{t})

To finish the proof, we need to use another lemma in Srinivas et al. [2009]:

Lemma C.3 (Lemma 5.5 in Srinivas et al. [2009]).

Let {𝐱^t}\{\hat{\mathbf{x}}^{t}\} be a sequence of points chosen by GP-UCB, pick δ(0,1)\delta\in(0,1) and set βt=2log(πtδ)\beta_{t}=2\log\left(\frac{\pi_{t}}{\delta}\right), where t1πt1=1,πt>0\sum_{t\geq 1}\pi_{t}^{-1}=1,\pi_{t}>0. Then,

|f(𝐱^t)μ^t1(𝐱^t)|βtσ^t1(𝐱^t),t1\displaystyle\mathinner{\!\left\lvert f(\hat{\mathbf{x}}^{t})-\hat{\mu}_{t-1}(\hat{\mathbf{x}}^{t})\right\rvert}\leq\sqrt{\beta_{t}}\hat{\sigma}_{t-1}(\hat{\mathbf{x}}^{t}),\forall t\geq 1

holds with probability 1δ\geq 1-\delta.

By using the same proof procedure, we can easily obtain the following lemma:

Lemma C.4.

Let {𝐱t}\{\mathbf{x}^{t}\} be a sequence of points chosen by VS-GP-UCB, pick δ(0,1)\delta\in(0,1) and set βt=2log(πtδ)\beta_{t}=2\log\left(\frac{\pi_{t}}{\delta}\right), where t1πt1=1,πt>0\sum_{t\geq 1}\pi_{t}^{-1}=1,\pi_{t}>0. Then,

|f(𝐱t)μt1(𝐱t)|βtσt1(𝐱t),t1\displaystyle\mathinner{\!\left\lvert f(\mathbf{x}^{t})-\mu_{t-1}(\mathbf{x}^{t})\right\rvert}\leq\sqrt{\beta_{t}}\sigma_{t-1}(\mathbf{x}^{t}),\forall t\geq 1

holds with probability 1δ\geq 1-\delta.

Finally, by using δ4\frac{\delta}{4} and setting (often set πt=π2t26\pi_{t}=\frac{\pi^{2}t^{2}}{6}):

βt=2log16πtδ+2(Dd)log(αDt2blog(8Daδ)+1)+2dlog(Dt2blog(8Daδ))\displaystyle\beta_{t}=2\log\frac{16\pi_{t}}{\delta}+2(D-d)\log\left(\alpha Dt^{2}b\sqrt{\log\left(\frac{8Da}{\delta}\right)}+1\right)+2d\log\left(Dt^{2}b\sqrt{\log\left(\frac{8Da}{\delta}\right)}\right)

w.p.1δw.p.\geq 1-\delta the simple regret rtr_{t} of VS-GP-UCB, rt=f(𝐱)f(𝐱t={𝐱[1:d]t,𝐱[d+1:D]0})r_{t}=f(\mathbf{x}^{*})-f(\mathbf{x}^{t}=\{\mathbf{x}_{[1\mathrel{\mathop{\mathchar 58\relax}}d]}^{t},\mathbf{x}_{[d+1\mathrel{\mathop{\mathchar 58\relax}}D]}^{0}\}), is upper bounded by:

rt\displaystyle r_{t} =f(𝐱)f(𝐱t)\displaystyle=f(\mathbf{x}^{*})-f(\mathbf{x}^{t})
μ^t1((𝐱)~t)+βtσ^t1((𝐱)~t)+1t2f(𝐱t)\displaystyle\leq\hat{\mu}_{t-1}(\widetilde{(\mathbf{x}^{*})}_{t})+\sqrt{\beta_{t}}\hat{\sigma}_{t-1}(\widetilde{(\mathbf{x}^{*})}_{t})+\frac{1}{t^{2}}-f(\mathbf{x}^{t})
μ^t1(𝐱^t)+βtσ^t1(𝐱^t)+1t2f(𝐱t)\displaystyle\leq\hat{\mu}_{t-1}(\hat{\mathbf{x}}^{t})+\sqrt{\beta_{t}}\hat{\sigma}_{t-1}(\hat{\mathbf{x}}^{t})+\frac{1}{t^{2}}-f(\mathbf{x}^{t})
=μ^t1(𝐱^t)+βtσ^t1(𝐱^t)f(𝐱^t)+f(𝐱^t)f(𝐱mixt)+f(𝐱mixt)f(𝐱t)+1t2\displaystyle=\hat{\mu}_{t-1}(\hat{\mathbf{x}}^{t})+\sqrt{\beta_{t}}\hat{\sigma}_{t-1}(\hat{\mathbf{x}}^{t})-f(\hat{\mathbf{x}}^{t})+f(\hat{\mathbf{x}}^{t})-f(\mathbf{x}^{t}_{mix})+f(\mathbf{x}^{t}_{mix})-f(\mathbf{x}^{t})+\frac{1}{t^{2}}
2βtσ^t1(𝐱^t)+αblog(8Daδ)(Dd)+μt1((𝐱mixt)~t)\displaystyle\leq 2\sqrt{\beta_{t}}\hat{\sigma}_{t-1}(\hat{\mathbf{x}}^{t})+\alpha b\sqrt{\log\left(\frac{8Da}{\delta}\right)}(D-d)+\mu_{t-1}\left(\widetilde{(\mathbf{x}^{t}_{mix})}_{t}\right)
+βtσt1((𝐱mixt)~t)f(𝐱t)+2t2\displaystyle+\sqrt{\beta_{t}}\sigma_{t-1}\left(\widetilde{(\mathbf{x}^{t}_{mix})}_{t}\right)-f(\mathbf{x}^{t})+\frac{2}{t^{2}}
2βtσ^t1(𝐱^t)+αblog(8Daδ)(Dd)+μt1(𝐱t)+βtσt1(𝐱t)f(𝐱t)+2t2\displaystyle\leq 2\sqrt{\beta_{t}}\hat{\sigma}_{t-1}(\hat{\mathbf{x}}^{t})+\alpha b\sqrt{\log\left(\frac{8Da}{\delta}\right)}(D-d)+\mu_{t-1}(\mathbf{x}^{t})+\sqrt{\beta_{t}}\sigma_{t-1}(\mathbf{x}^{t})-f(\mathbf{x}^{t})+\frac{2}{t^{2}}
2βtσ^t1(𝐱^t)+2βtσt1(𝐱t)+2t2+αblog(8Daδ)(Dd)\displaystyle\leq 2\sqrt{\beta_{t}}\hat{\sigma}_{t-1}(\hat{\mathbf{x}}^{t})+2\sqrt{\beta_{t}}\sigma_{t-1}(\mathbf{x}^{t})+\frac{2}{t^{2}}+\alpha b\sqrt{\log\left(\frac{8Da}{\delta}\right)}(D-d)

By using Lemma 5.4 in Srinivas et al. [2009] and the Cauchy-Schwarz inequality, both t=1N2βtσ^t1(𝐱^t)\sum_{t=1}^{N}2\sqrt{\beta_{t}}\hat{\sigma}_{t-1}(\hat{\mathbf{x}}^{t}) and t=1N2βtσt1(𝐱t)\sum_{t=1}^{N}2\sqrt{\beta_{t}}\sigma_{t-1}(\mathbf{x}^{t}) are upper bounded by C1NβNγN\sqrt{C_{1}N\beta_{N}\gamma_{N}}, where C1=8log(1+σ02)C_{1}=\frac{8}{\log\left(1+\sigma_{0}^{-2}\right)}, and γN:=maxA𝒳:|A|=N𝐈(𝐲A;𝐟A)\gamma_{N}\mathrel{\mathop{\mathchar 58\relax}}=\max_{A\subset\mathcal{X}\mathrel{\mathop{\mathchar 58\relax}}|A|=N}\mathbf{I}(\mathbf{y}_{A};\mathbf{f}_{A}) is the maximum information gain with a finite set of sampling points AA, 𝐟A=[f(𝐱)]𝐱A\mathbf{f}_{A}=[f(\mathbf{x})]_{\mathbf{x}\in A}, 𝐲A=𝐟A+ϵA\mathbf{y}_{A}=\mathbf{f}_{A}+\epsilon_{A}. Therefore, we have:

RN=t=1Nrt2C1NβNγN+π23+αblog(8Daδ)N(Dd)\displaystyle R_{N}=\sum_{t=1}^{N}r_{t}\leq 2\sqrt{C_{1}N\beta_{N}\gamma_{N}}+\frac{\pi^{2}}{3}+\alpha b\sqrt{\log\left(\frac{8Da}{\delta}\right)}N(D-d)

Hence:

RNN=t=1NrtN2C1βNγNN+π23N+αblog(8Daδ)(Dd)\displaystyle\frac{R_{N}}{N}=\frac{\sum_{t=1}^{N}r_{t}}{N}\leq 2\sqrt{C_{1}\frac{\beta_{N}\gamma_{N}}{N}}+\frac{\pi^{2}}{3N}+\alpha b\sqrt{\log\left(\frac{8Da}{\delta}\right)}(D-d)

Appendix D Detailed experimental settings and extended discussion of experimental results

We use the framework of BoTorch to implement VS-BO. We compare VS-BO to the following existing BO methods: vanilla BO, which is implemented by the standard BoTorch framework111https://botorch.org; REMBO and its variant REMBO Interleave [Wang et al., 2016], of which the implementations are based on Metzen [2016]222https://github.com/jmetzen/bayesian_optimization; Dragonfly333https://github.com/dragonfly/dragonfly/ [Kandasamy et al., 2020], which contains the Add-GP method; HeSBO [Nayebi et al., 2019] which has already been implemented in Adaptive Experimentation Platform (Ax)444https://github.com/facebook/Ax/tree/master/ax/modelbridge/strategies; And ALEBO555https://github.com/facebookresearch/alebo [Letham et al., 2020]. By the time when we write this manuscript, source codes of the work Spagnol et al. [2019] and Eriksson and Jankowiak [2021] have not been released, so we cannot compare VS-BO with these two methods. Both VS-BO and vanilla BO use Matern 5/2 as the kernel function and expected improvement as the acquisition function, and use limited-memory BFGS (L-BFGS) to fit GP and optimize the acquisition function. The number of initialized samples NinitN_{init} is set to 55 for all methods, and NvsN_{vs} in VS-BO is set to 2020, NisN_{is} is set to 1000010000 for all experiments. The number of the interleaved cycle for REMBO Interleave is set to 44. Since our algorithm aims to maximize the black-box function, all the test functions that have minimum points will be converted to the corresponding negative forms.

In synthetic experiments, as described in section 6.1, for each test function we add some unimportant variables as well as unrelated variables to make it high-dimensional. The standard Branin function fBraninf_{Branin} has two dimensions with the input domain 𝒳Branin=[5,10]×[0,10]\mathcal{X}_{Branin}=[-5,10]\times[0,10], and we construct a new Branin function FbraninF_{branin} as the following:

Fbranin(𝐱)=fBranin(𝐱[1:2])+0.1fBranin(𝐱[3:4])\displaystyle F_{branin}(\mathbf{x})=f_{Branin}(\mathbf{x}_{[1\mathrel{\mathop{\mathchar 58\relax}}2]})+0.1f_{Branin}(\mathbf{x}_{[3\mathrel{\mathop{\mathchar 58\relax}}4]}) +0.01fBranin(𝐱[5:6]),\displaystyle+0.01f_{Branin}(\mathbf{x}_{[5\mathrel{\mathop{\mathchar 58\relax}}6]}),
𝐱(i=13𝒳Branin)i=144[0,1]\displaystyle\mathbf{x}\in\left(\bigotimes_{i=1}^{3}\mathcal{X}_{Branin}\right)\bigotimes_{i=1}^{44}[0,1]

where \bigotimes represents the direct product. We use de=[2,2,2]d_{e}=[2,2,2] to represent the dimension of the effective subspace of FbraninF_{branin}, the total effective dimension is 66, however, the number of important variables is only 22.

Likewise, for the standard Hartmann6 function fHartmann6f_{Hartmann6} that has six dimensions with the input domain [0,1]6[0,1]^{6}, we construct Fhm6F_{hm6} as:

Fhm6(𝐱)=fHartmann6(𝐱[1:6])+0.1fHartmann6(𝐱[7:12])+0.01fHartmann6(𝐱[13:18])𝐱[0,1]50\displaystyle F_{hm6}(\mathbf{x})=f_{Hartmann6}(\mathbf{x}_{[1\mathrel{\mathop{\mathchar 58\relax}}6]})+0.1f_{Hartmann6}(\mathbf{x}_{[7\mathrel{\mathop{\mathchar 58\relax}}12]})+0.01f_{Hartmann6}(\mathbf{x}_{[13\mathrel{\mathop{\mathchar 58\relax}}18]})\quad\mathbf{x}\in[0,1]^{50}

and use de=[6,6,6]d_{e}=[6,6,6] to represent the dimension of the effective subspace. For the Styblinski-Tang4 function fST4f_{ST4} that has four dimensions with the input domain [5,5]4[-5,5]^{4}, we construct FST4F_{ST4} as:

FST4(𝐱)=fST4(𝐱[1:4])+0.1fST4(𝐱[5:8])+0.01fST4(𝐱[9:12])𝐱[5,5]50\displaystyle F_{ST4}(\mathbf{x})=f_{ST4}(\mathbf{x}_{[1\mathrel{\mathop{\mathchar 58\relax}}4]})+0.1f_{ST4}(\mathbf{x}_{[5\mathrel{\mathop{\mathchar 58\relax}}8]})+0.01f_{ST4}(\mathbf{x}_{[9\mathrel{\mathop{\mathchar 58\relax}}12]})\quad\mathbf{x}\in[-5,5]^{50}

and use de=[4,4,4]d_{e}=[4,4,4] to represent the dimension of the effective subspace. All synthetic experiments are run on the same Linux cluster that has 40 3.0 GHz 10-Core Intel Xeon E5-2690 v2 CPUs.

Refer to caption
Figure 4: Performance of BO methods on Branin, Hartmann6 and Styblinski-Tang4 test functions. For each test function, we do 20 independent runs for each method. We plot the mean and 1/8 standard deviation of the best maximum value found by wall clock time used for BO (first row) and CPU time (second row).

Figure 2 and  4 show performances of different BO methods. They are compared under the fixed iteration budget (Figure 2), the fixed wall clock time budget used for BO itself (time for evaluating the black box function is excluded) or the fixed CPU time budget (Figure 4). Figure 5 shows the frequency of being chosen as important for each variable in steps of variable selection of VS-BO. Since we do 20 runs of VS-BO on each test function, each run has 210210 iterations and important variables are re-selected every 2020 iterations, the maximum frequency each variable can be chosen as important is 200200. In the Branin case with de=[2,2,2]d_{e}=[2,2,2], the first two variables are important, and the left panel of Figure 5 shows that the frequency of choosing the first two variables as important is significantly higher than that of choosing the other variables; In the Hartmann6 case with de=[6,6,6]d_{e}=[6,6,6], the first six variables are important, and the middle panel of Figure 5 shows that the frequency of choosing the first six variables as important is the highest; In the Styblinski-Tang4 case with de=[4,4,4]d_{e}=[4,4,4], the first four variables are important, and again the right panel of Figure 5 shows that the frequency of choosing the first four variables as important is the highest. These results indicate that VS-BO is able to find real important variables and control false positives.

Refer to caption
Figure 5: The total frequency of being chosen as important for each variable on Branin case (left), Hartmann6 case (middle) and Styblinski-Tang4 case (right). For the Branin function, the first two variables are important; for the Hartmann6 function, the first six variables are important; and for the Styblinski-Tang4 function, the first four variables are important.

Figure 6 shows the wall clock time or CPU time comparison between VS-BO and vanilla BO for each iteration under the step of fitting a GP or optimizing the acquisition function. As the number of iterations increases, the runtime of vanilla BO increases significantly, while for VS-BO the runtime only has a slight increase. These results empirically show that VS-BO is able to reduce the runtime of BO process.

Refer to caption
Figure 6: The wall clock time or CPU time comparison between VS-BO and vanilla BO for each iteration. The Branin test function with de=[2,2,2]d_{e}=[2,2,2] and D=50D=50 is run here. (a) Wall clock time comparison at the GP fitting step (b) CPU time comparison at the GP fitting step (c) Wall clock time comparison at the acquisition function optimization step (d) CPU time comparison at the acquisition function optimization step

To test the performance of VS-BO on the function that has a non-axis-aligned subspace, we construct a rotational Hartmann6 function Frot_H(𝐱)F_{rot\_H}(\mathbf{x}) by using the same way as Letham et al. [2020]. We sample a rotation matrix AA from the Haar distribution on the orthogonal group SO(100)SO(100) [Stewart, 1980], and take the form of Frot_H(𝐱)F_{rot\_H}(\mathbf{x}) as the following:

Frot_H(𝐱)=fHartmann6(A[:6]𝐱)\displaystyle F_{rot\_H}(\mathbf{x})=f_{Hartmann6}(A[\mathrel{\mathop{\mathchar 58\relax}}6]\mathbf{x})

where A[:6]6×100A[\mathrel{\mathop{\mathchar 58\relax}}6]\in\mathbb{R}^{6\times 100} represents the first 66 rows of AA and 𝐱[1,1]100\mathbf{x}\in[-1,1]^{100} is the input. We then run VS-BO as well as other methods on this function and compare their performance. Figure 7 shows that in this case REMBO with d=6d=6 has the best performance. VS-BO also has a good performance, and surprisingly it outperforms several embedding-based methods such as ALEBO (d=6d=6) and REMBO Interleave (d=6d=6), although it tries to learn an axis-aligned subspace.

Refer to caption
Figure 7: Performance of BO methods on the rotational Hartmann6 function. We do 20 independent runs for each method. We plot the mean and 1/8 standard deviation of the best maximum value found by iterations.

Spagnol et al. [2019] introduces several methods to sample unimportant variables and shows that the mix strategy is the best, that is for each iteration, the algorithm samples values of unimportant variables from the uniform distribution with probability 0.50.5 or use the values from the previous query that has the maximal function value with probability 0.50.5. To compare the mix strategy with our sampling strategy, i.e. sampling from CMA-ES related posterior, we replace our strategy with the mix strategy in VS-BO, creating a variant method called VSBO mix. All the other parts in VSBO mix are the same as those in VS-BO. We compare VS-BO with VSBO mix on these three synthetic functions, and Figure 10 show that in general our sampling strategy is clearly better than the mix strategy.

For real-world problems, the rover trajectory problem is a high-dimensional optimization problem with input domain [0,1]60[0,1]^{60}. The problem setting in our experiment is the same as that in Wang et al. [2017], and the source code of this problem can be found in https://github.com/zi-w/Ensemble-Bayesian-Optimization. MOPTA08 is another high-dimensional optimization problem with input domain [0,1]124[0,1]^{124}. It has one objective function fmopta(𝐱)f_{mopta}(\mathbf{x}) that needs to be minimized and 6868 constraints ci(𝐱),i{1,2,68}c_{i}(\mathbf{x}),i\in\{1,2,\ldots 68\}. Similar to Eriksson and Jankowiak [2021], we convert these constraints to soft penalties and convert the minimization problem to the maximization problem by adding a minus at the front of the objective function, i.e., we construct the following new function FmoptaF_{mopta}:

Fmopta(𝐱)=(fmopta(𝐱)+10i=168max(0,ci(𝐱)))\displaystyle F_{mopta}(\mathbf{x})=-\left(f_{mopta}(\mathbf{x})+10\sum_{i=1}^{68}\max(0,c_{i}(\mathbf{x}))\right)

The Fortran codes of MOPTA can be found in https://www.miguelanjos.com/jones-benchmark and we further use codes in https://gist.github.com/denis-bz/c951e3e59fb4d70fd1a52c41c3675187 to wrap up it in python. All experiments for these two real-world problems are run on the same Linux cluster that has 80 2.40 GHz 20-Core Intel Xeon 6148 CPUs.

Figure 3 and  8 show performances of different BO methods on these two real-world problems. Note that Dragonfly performs quite well if the wall clock or CPU time used for BO is fixed, but not as good as VS-BO under the fixed iteration budget. Similar to Figure 5, the left column of Figure 9 shows the frequency of being chosen as important for each variable in steps of the variable selection of VS-BO. As described in section 6.2, we design a sampling experiment to test the accuracy of the variable selection. The indices of the first 55 variables that have been chosen most frequently are {1,2,3,59,60}\{1,2,3,59,60\} on the rover trajectory problem and {30,37,42,79,112}\{30,37,42,79,112\} on MOPTA08, and the indices of the first 55 variables that have been chosen least frequently are {15,18,29,38,51}\{15,18,29,38,51\} and {59,77,91,105,114}\{59,77,91,105,114\} respectively. The total number of samples in each set is 800000800000. The right column of Figure 9 shows the empirical distributions of function values from two sets of samples. The significant difference between two distributions in each panel tells us that changing the values of variables that have been chosen more frequently can alter the function value more significantly, indicating that these variables are more important.

Refer to caption
Figure 8: Performance of BO methods on the rover trajectory and MOPTA08 problems. We do 20 independent runs on the rover trajectory problem and 15 on the MOPTA08 problem. We plot the mean and 1/4 standard deviation of the best maximum value found by wall clock time (first row) and CPU time (second row).

Eriksson and Jankowiak [2021] shows an impressive performance of their method, SAASBO, on MOPTA08 problem. According to their results, SAASBO outperforms VS-BO in this case. We don’t compare VS-BO with SAASBO in our work since the source code of SAASBO has not been released. One potential drawback of SAASBO is that it is very time consuming. For each iteration, SAASBO needs significantly more runtime than ALEBO (section C of Eriksson and Jankowiak [2021]), while our experiments show that ALEBO is a method that is significantly more time-consuming than VS-BO, sometimes even more time-consuming than vanilla BO, especially when dd is large (for example when d10d\geq 10). Therefore, SAASBO might not be a good choice for the case when the runtime of BO needs to be considered.

Refer to caption
Figure 9: (left column) The total frequency of being chosen as important for each variable on the rover trajectory and MOPTA08 problems. (right column)The distribution of function values when sampling the first 55 variables that have been chosen most frequently (important) or the first 55 variables that have been chosen least frequently (unimportant) with all the other variables fixed.
Refer to caption
Figure 10: Performance of VS-BO and VSBO mix on Branin, Hartmann6 and Styblinski-Tang4 test functions. For each test function, we do 20 independent runs for each method. We plot the mean and 1/8 standard deviation of the best maximum value found by iterations.