[1]\fnmYuki \surMinamoto

[1]\orgnameFixstars Amplify Corporation, \orgaddress\cityMinato-ku, \postcode108-0023, \stateTokyo, \countryJapan

2]\orgnameFixstars Corporation, \orgaddress\cityMinato-ku, \postcode108-0023, \stateTokyo, \countryJapan

A black-box optimization method with polynomial-based kernels and quadratic-optimization annealing

[email protected] \fnmYuya \surSakamoto [email protected] * [

Abstract

We introduce kernel-QA, a black-box optimization (BBO) method that constructs surrogate models analytically using low-order polynomial kernels within a quadratic unconstrained binary optimization (QUBO) framework, enabling efficient utilization of Ising machines. The method has been evaluated on artificial landscapes, ranging from uni-modal to multi-modal, with input dimensions extending to 80 for real variables and 640 for binary variables. The results demonstrate that kernel-QA is particularly effective for optimizing black-box functions characterized by local minima and high-dimensional inputs, showcasing its potential as a robust and scalable BBO approach.

keywords:

black-box optimization, Annealing, Ising, Quantum, Bayesian

1 Introduction

Black-box optimization (BBO) is a promising tool in various sectors where the objectives are expensive to evaluate black-box functions such as complex numerical simulations or experimental measurements. For such optimization problems, a serial optimization method is typically chosen. Here, an evaluation point in the search space is determined at each “cycle” (i.e., an iteration of the optimization loop) using information gathered from previous evaluations as shown in Fig. 1(a). Typically, a model function is constructed in each cycle based on a relatively small dataset consisting of input-output pairs to/from the black-box function (plus some uncertainty information for some cases). Then, an input set that may minimize the model function is obtained as a potentially optimal input during the optimization step (2) in Fig. 1(a). Subsequently, the black-box function is evaluated with the newly obtained input set, and finally, the new input-output pair is added to the dataset before repeating the next cycle.

Among existing serial optimization methods, Bayesian optimization stands out for its efficiency in dealing with expensive-to-evaluate black-box functions by building a probabilistic model [1, 2, 3]. In Bayesian optimization, the Gaussian process (GP) regression typically plays a central role in searching the next evaluation point, balancing the need to explore regions of high uncertainty (to gather more information, typically called “exploration”) with the need to exploit regions with low expected values (in case of minimization problem as in this study, called “exploitation”). This exploration makes Bayesian optimization particularly effective for complex black-box functions.

However, the performance of Bayesian optimization degrades as the dimensionality of the search space increases, a challenge known as the curse of dimensionality. This issue arises because the volume of the search space grows exponentially with the number of dimensions, rendering the Gaussian process model less informative. Consequences of high dimensionality include (i) optimization of the model function (step (2) in Fig. 1(a)) becomes increasingly difficult, (ii) high cost in the construction of acquisition function [4], and (iii) insensitivity of the constructed function to the inputs (due to the complexity of the used kernel function) [5]. As a result, the number of evaluations needed to find an optimal solution becomes prohibitively large, and the method’s ability to exploit the model’s predictions diminishes.

Factorization machine with quadratic-optimization annealing (FMQA) ¹¹1FMQA originally stands for factorization machine with quantum annealing. However, this method is also straightforwardly incorporated with simulated annealers and gate-based quantum computers with a quantum approximate optimization algorithm (QAOA). Thus, a factorization machine with quadratic-optimization annealing may be more appropriate and general, and this point is acknowledged by the authors of the original FMQA study [6]. is another serial optimization method for BBO, which has been proposed and applied initially in material informatics [6, 7, 8, 9]. The key in FMQA is to use factorization machine (FM) [10] as a surrogate model for the black-box objective function. The use of an FM model allows the use of Ising machines ²²2An Ising machine is designed to solve combinatorial optimization problems quickly by mimicking the spins’ behavior in an Ising model. In an Ising machine, a QUBO problem to be solved is encoded as a network of spins, where each spin represents a binary variable, and the interactions between spins represent the constraints or relationships between these variables. The machine then searches for the spin configuration that minimizes the system’s energy, which corresponds to the optimal solution to the problem. Such QUBO solvers include quantum annealers [11] and simulated annealers implemented on GPU or FPGA [12]. (simulated or quantum annealing) to solve an optimization problem represented by the model function in a Quadratic Unconstrained Binary Optimization (QUBO) manner at each cycle since a quadratic formulation defines the FM model. FMQA has also been extended and applied to problems with integer/real variables [13] and network [14]. FMQA can handle a relatively large number of decision variables since the model function is optimized using Ising machines. However, the cost of model construction at each cycle is not negligible for large-dimension problems: Rendle [10] has reported that the cost of training an FM is $\mathcal{O}(n\cdot kd)$ , where $n$ , $k$ and $d$ are the number of samples in the (current training) dataset, the size of the latent embedding vector (the model hyperparameter), and the number of decision variables (problem dimensions), respectively. Also, stochastic gradient descent typically used to construct an FM may yield locally optimal model coefficients, which is not ideal for a BBO context.

Bayesian optimization of combinatorial structures (BOCS) is a type of Bayesian optimization method where optimization of acquisition function may be performed with simulated or quantum annealing, thereby circumventing the consequence (i) [4, 15]. However, the Gibbs sampling used in the model (acquisition function) construction costs $\mathcal{O}(n^{2}d^{2}T)$ , where $T$ is the number of iterations [4].

This paper focuses on high-order BBO problems where the number of initial samples is small relative to the input dimensions. For such problems, we propose an approach that mitigates the curse of dimensionality, particularly within serial optimization. The method we will propose here is named polynomial-based kernels and quadratic-optimization annealing (kernel-QA). Using low-order polynomial-based kernels for the model function resolves the abovementioned issues (i)–(iii), especially for high-dimensional BBO problems. The proposed method has been evaluated on different benchmark landscapes, highlighting the method’s robustness and efficiency in handling problems, including high-dimensional ones, where traditional methods typically falter in terms of cost or performance.

The rest of the paper is structured as follows. We describe the formulation of the proposed BBO method in Sec. 2. The technique is then assessed using the test functions and conditions described in Secs. 3.1 and 3.2. The results of the assessments are presented and discussed in Sec. 4. and important findings in the present study are summarized in the conclusions.

2 Method

The keys in kernel-QA are utilizing a QUBO solver (or similar fast optimization solver with/without native constraint support) and the fast construction of the (surrogate) model function for the black-box function.

2.1 Expected value of the black-box function

Here, we describe the details of the surrogate model construction for the black-box function (exploitation in a Bayesian optimization context). As noted above, the surrogate model constructed here must be a second-order polynomial to utilize QUBO solvers. Using a relatively low-order polynomial also contributes to circumventing the over-fitting of the constructed model parameters, as the training data is insufficient, which is typical for most BBO problems.

Suppose $f(\mathbf{x})$ is a black-box function with the input variables $\mathbf{x}$ . The black-box function $f(\mathbf{x})$ is a complex function (such as numerical simulations or experimental measurements). In BBO, the purpose is to find an input value vector $\mathbf{x}$ that minimizes $f(\mathbf{x})$ with as few cycles as possible (Fig. 1(a)). Let $(\mathbf{x}_{i}^{*},y_{i}^{*})$ is the $i$ -th input-output pair to/from $f(\mathbf{x})$ , and we have $n$ number of such pairs in the training dataset. Here, $\mathbf{x}_{i}$ can be a vector of binary, integer, real, and their mix with a dimension $d$ . By using appropriate coefficients $\mathbf{Q}\in\mathbb{R}^{d\times d}$ (symmetric matrix), $\mathbf{q}\in\mathbb{R}^{d}$ and a scalar $r$ , the output $y_{i}$ may be written as:

y_{i}=\mathbf{x}_{i}^{T}\mathbf{Q}\mathbf{x}_{i}+\mathbf{q}^{T}\mathbf{x}_{i}+r.

(1)

The coefficients, $\mathbf{Q},\mathbf{q}$ , and $r$ can be estimated by solving the following least-squares problem.

\displaystyle\hat{\mathbf{Q}},\hat{\mathbf{q}},\hat{r}

\displaystyle=\min_{\mathbf{Q},\mathbf{q},c}\sum_{i=1}^{N}\left\{y_{i}^{*}-(\mathbf{x}_{i}^{T}\mathbf{Q}\mathbf{x}_{i}+\mathbf{q}^{T}\mathbf{x}_{i}+r)\right\}^{2}+\lambda\left(\parallel\mathbf{Q}\parallel_{F}^{2}+\parallel\mathbf{q}\parallel_{2}^{2}+r^{2}\right).

(2)

Here, $\lambda$ is a regularization parameter. Once Eq. (2) is solved, we can obtain the surrogate model for the black-box function from Eq. (1). However, the number of parameters to optimize for Eq. (2) increases with the square of the problem size $d$ , which is not ideal for high-dimensional BBO problems.

Now, let our first kernel function be:

k_{\alpha}^{\prime}(\mathbf{x}_{1},\mathbf{x}_{2})=(\mathbf{x}_{1}^{T}\mathbf{x}_{2}+\gamma)^{2},

(3)

where $\gamma$ is a constant, and Eq.(3) is essentially a second-order polynomial. Also, let the black-box function $f$ be estimated using the kernel function in Eq. (3) as:

$\displaystyle\hat{f}_{\mu}(\mathbf{x})$	$\displaystyle=\sum_{t}c_{t}k_{\mu}^{\prime}(\mathbf{z}_{t},\mathbf{x})$
	$\displaystyle=\sum_{t}c_{t}(\mathbf{z}_{t}^{T}\mathbf{x}+\gamma)^{2}$
	$\displaystyle=\mathbf{x}^{T}(\sum_{t}c_{t}\mathbf{z}_{t}\mathbf{z}_{t}^{T})\mathbf{x}+2\gamma(\sum_{t}c_{t}\mathbf{z})^{T}\mathbf{x}+c_{\mu}$
	$\displaystyle=\mathbf{x}^{T}\hat{\mathbf{Q}}_{\mu}\mathbf{x}+2\gamma\hat{\mathbf{q}}_{\mu}^{T}\mathbf{x}+c_{\mu}.$	(4)

Here, all resulting constant terms are put together in $c_{\mu}$ .

We now introduce an optimization problem alternative to Eq. (2) to obtain the surrogate model function as:

	$\displaystyle\hat{f}_{\mu}$	$\displaystyle=\min_{f}\sum_{i=1}^{N}(f(\mathbf{x}_{i}^{})-y_{i}^{})^{2}+\lambda\parallel f\parallel_{\mathcal{V}}^{2}$
		$\displaystyle=\min_{\mathbf{c},\mathbf{Z}}\sum_{i=1}^{N}\sum_{t}(c_{t}k(\mathbf{z}_{t},\mathbf{x}_{i}^{})-y_{i}^{})^{2}+\lambda\sum_{t,t^{\prime}}c_{t}c_{t^{\prime}}k(\mathbf{z}_{t},\mathbf{z}_{t^{\prime}}).$		(5)

Here, the subscript $\mathcal{V}$ denotes an operator for the function $f$ , $\langle f,f^{\prime}\rangle_{\mathcal{V}}=\sum_{t}\sum_{t}c_{t}c^{\prime}_{t^{\prime}}k_{\mu}(\mathbf{z}_{t},\mathbf{z}^{\prime}_{t^{\prime}})$ . The optimization problem in Eq. (5) is identical to Eq. (2) as $\mathbf{Q}=\sum_{t}c_{t}\mathbf{z}_{t}\mathbf{z}_{t}^{T}$ , $\mathbf{q}=2\gamma\sum_{t}c_{t}\mathbf{z}_{t}$ , $r=c_{\mu}$ . The optimal solution for Eq. (5) is $\hat{f}_{\mu}(\mathbf{x})=\sum_{j=1}^{n}c_{j}k_{\mu}(\mathbf{x}_{j}^{*},\mathbf{x})$ by the representer theorem without the need for estimating $\mathbf{z}_{t}$ [16]. Thus, the optimization problem Eq. (5) reduces to the following problem to estimate the optimal $\mathbf{c}\in\mathbb{R}^{n}$ .

\hat{\mathbf{c}}=\min_{\mathbf{c}}\parallel\mathbf{K}^{*}_{\mu}\mathbf{c}-\mathbf{y}^{*}\parallel+\lambda\mathbf{c}^{T}\mathbf{K}^{*}_{\mu}\mathbf{c},

(6)

where $\mathbf{K}^{*}_{\mu}$ is a Gram matrix whose $i,j$ -element is $k_{\mu}(\mathbf{x}_{i}^{*},\mathbf{x}_{j}^{*})$ . Assuming $\mathbf{K}^{*}_{\mu}$ is invertible, Eq. (6) can be analytically solved as:

\hat{\mathbf{c}}=(\mathbf{K}^{*}_{\mu}+\lambda\mathbf{I})^{-1}\mathbf{y}^{*}.

(7)

As you can see, the coefficients of the surrogate model are not estimated directly but by kernel regression using a second-order polynomial kernel. When the number of observations is small relative to the dimension of the variables square, it is much cheaper computationally than estimating coefficients of the square order of the dimension. Also, the representer theorem can yield model coefficients that are globally optimal to the samples in the dataset, which is beneficial for black-box optimization.

Note that the inverse operation involved in Eq. (7) seems expensive when the number of samples $n$ becomes larger. In the present study, $(\mathbf{K}^{*}_{\mu}+\lambda\mathbf{I})^{-1}$ is sequentially updated by using the Woodbury formula [17] each time a new input-output pair is added at the end of optimization cycle (4) in Fig. 1(a). This suppresses the computational cost by a factor of $\mathcal{O}(n^{2})$ , and the overall cost of computing Eq. (7) is $\mathcal{O}(n)$ .

Finally, the coefficient matrix can be obtained as $\hat{\mathbf{Q}}_{\mu}=\sum_{i=1}^{n}\hat{c}_{i}\mathbf{x}_{i}^{*}\mathbf{x}_{i}^{*T}$ , $\hat{\mathbf{q}}_{\mu}=\sum_{i=1}^{n}\hat{c}_{i}\mathbf{x}_{i}^{*}$ , and the final surrogate model $\hat{f}_{\mu}$ for the black-box function $f$ based on the given dataset is:

\hat{f}_{\mu}(\mathbf{x})=\mathbf{x}^{T}\hat{\mathbf{Q}}_{\mu}\mathbf{x}+2\gamma\hat{\mathbf{q}}_{\mu}^{T}\mathbf{x}+c_{\mu}.

(8)

We can remove the constant $c_{\mu}$ as this does not generally affect optimization.

2.2 Standard deviation

In some cases, it may be helpful to consider the uncertainty $\hat{f}_{\sigma}(\mathbf{x})$ of the constructed surrogate model $\hat{f}_{\mu}(\mathbf{x})$ . The general standard deviation of the Gaussian process regression may capture such uncertainty. However, in the context of the kernel-QA method, the formulation of $\hat{f}_{\sigma}(\mathbf{x})$ must be QUBO-compatible, whereas a well-known acquisition function is not QUBO. Therefore, we construct the acquisition function as QUBO based on the lower confidence bound (LCB):

\hat{f}(\mathbf{x})=\hat{f}_{\mu}(\mathbf{x})-\beta\hat{f}_{\sigma}(\mathbf{x}),

(9)

where Eq. (8) is used for $\hat{f}_{\mu}(\mathbf{x})$ . Using the Gaussian process regression, the standard deviation $\sigma(\mathbf{x})$ can be written as:

\sigma(\mathbf{x})=\sqrt{k_{\sigma}(\mathbf{x},\mathbf{x})-\mathbf{k}_{\sigma}^{T}(\mathbf{K}^{*}_{\sigma}+\lambda\mathbf{I})^{-1}\mathbf{k}_{\sigma}},

(10)

where $\mathbf{K}^{*}_{\sigma}$ is another Gram matrix whose $i,j$ -element is $k_{\sigma}(\mathbf{x}^{*}_{i},\mathbf{x}^{*}_{j})$ , and $\mathbf{k}_{\sigma}=[k_{\sigma}(\mathbf{x},\mathbf{x}_{1}^{*}),k_{\sigma}(\mathbf{x},\mathbf{x}_{2}^{*}),\cdots,k_{\sigma}(\mathbf{x},\mathbf{x}_{n}^{*})]$ . Also, for the standard deviation, the following polynomial kernel is used instead of Eq. (3).

k_{\sigma}(\mathbf{x}_{a},\mathbf{x}_{b})=\mathbf{x}_{a}^{T}\mathbf{x}_{b}+\gamma.

(11)

The formulation of Eq. (10) is not a second-order polynomial and therefore requires some modification. Eq. (10) is simplified to be the QUBO-compatible standard deviation $\hat{f}_{\sigma}(\mathbf{x})$ as follows.

$\displaystyle\hat{f}_{\sigma}(\mathbf{x})$	$\displaystyle=k_{\sigma}(\mathbf{x},\mathbf{x})-\mathbf{k}_{\sigma}^{T}(\mathbf{K}^{*}_{\sigma}+\lambda\mathbf{I})^{-1}\mathbf{k}_{\sigma}$
	$\displaystyle=\mathbf{x}^{T}\mathbf{x}+\gamma-\sum_{i=1}^{n}\sum_{j=1}^{n}L_{ij}(\mathbf{x}_{i}^{T}\mathbf{x}+\gamma)(\mathbf{x}_{j}^{T}\mathbf{x}+\gamma)$
	$\displaystyle=\mathbf{x}^{T}\mathbf{x}-\sum_{i=1}^{n}\sum_{j=1}^{n}L_{ij}(\mathbf{x}_{i}^{T}\mathbf{x})(\mathbf{x}_{j}^{T}\mathbf{x})-2\gamma\sum_{i=1}^{n}\sum_{j=1}^{n}L_{ij}(\mathbf{x}_{i}^{*T}\mathbf{x})+C_{v}$
	$\displaystyle=\mathbf{x}^{T}\left\{\mathbf{I}-\left(\sum_{i=1}^{n}\sum_{j=1}^{n}L_{ij}\mathbf{x}_{i}^{}\mathbf{x}_{j}^{T}\right)\right\}\mathbf{x}-2\gamma\sum_{i=1}^{n}\sum_{j=1}^{n}L_{ij}(\mathbf{x}_{i}^{*T}\mathbf{x})+c_{\sigma}$
	$\displaystyle=\mathbf{x}^{T}\hat{\mathbf{Q}}_{\sigma}\mathbf{x}-2\gamma\hat{\mathbf{q}}_{\sigma}^{T}\mathbf{x}+c_{\sigma},$	(12)

where $L_{i,j}$ is the $i,j$ -element of $(\mathbf{K}^{*}_{\sigma}+\lambda\mathbf{I})^{-1}$ . In addition, all resulting constant terms are put together in $c_{\sigma}$ . On par with $\hat{f}_{\mu}$ described in Sec. 2.1, $(\mathbf{K}^{*}_{\sigma}+\lambda\mathbf{I})^{-1}$ is sequentially computed by using the Woodbury formula to suppress computational cost [17]. Also, as in Eq. (8), the constant $c_{\sigma}$ can be removed as this does not affect optimization.

Both $\hat{f}_{\mu}(\mathbf{x})$ in Eq. (8) and $\hat{f}_{\sigma}(\mathbf{x})$ in Eq. (12) are second-order polynomials, thereby the acquisition function in Eq. (9) with non-zero $\beta$ being QUBO-compatible. Also, all the coefficients in the polynomial can be analytically obtained rather than iteratively fitted, and this point is advantageous when constructing a surrogate model with a relatively large input dimension.

2.3 Variable conversion

Kernel-QA inherently requires the variables to be binary to utilise a fast QUBO solver for the model function optimization. However, with appropriate variable encoding, real and integer variables as well as their mix can be considered. Conversion from non-binary to binary variables can be performed either before or after the construction of the model function $\hat{f}(\mathbf{x})$ , which are named respectively a priori and a posteriori conversions here.

In the present study, a priori conversion is chosen due to its straightforwardness, with a domain-wall encoding method. A real variable $x$ with lower and upper bounds $(x_{lower},x_{upper})$ is discretized onto a vector $\mathbf{x}_{disc}$ of a size $n_{bins}$ . The spacing of bins $\Delta$ can be either uniform or non-uniform. For example, a variable $x$ with $(x_{lower},x_{upper})=(0.0,1.0)$ is discretized into $n_{bins}=5$ bins, and $\mathbf{x}_{disc}=[x_{disc,1},\cdots,x_{disc,n_{bins}}]=[0.0,0.25,0.50,0.75,1.0]$ .

Using a binary vector $\mathbf{x}_{b}$ of a size $n_{bins}-1$ , the conversion from $x$ to $\mathbf{x}_{b}$ is:

x_{b,i}=\begin{cases}1,&\text{if }i\leq i_{disc}\\ 0,&\text{otherwise}\end{cases}\>\>(\forall i\in\mathbb{Z}\mid 1\leq i\leq n_{bins}-1),

(13)

where the discritization index $i_{disc}$ is:

i_{disc}=\begin{cases}\lfloor(x-x_{lower})/\Delta+0.5\rfloor,&\text{if $x$ is uniformly discretized}\\ \min\{i\mid x_{disc,i}=x\},&\text{otherwise}\end{cases}.

(14)

The conversion from $\mathbf{x}_{b}$ to $x$ is:

x=\mathbf{x}_{disc}[\sum^{n_{bins}-1}_{i=1}x_{b,i}+1].

(15)

In the a priori variable conversion method, variable conversion (encoding) from real/integer values to binary vectors is performed before the construction of the surrogate model at (1) in Fig. 1(a). Thus, any inputs to the constructed surrogate model in Eq. (9) are binary values (each non-binary value is converted to a binary vector with the size $n_{bins}-1$ ). Also, variable conversion (decoding) from binary vectors to real/integer values is required before evaluating the black-box function at (3) in Fig. 1(a). Evaluation of other encoding methods, such as one-hot encoding and binary encoding with/without a posteriori conversion, is an interesting topic, which is out of scope in this study.

2.4 Exponential transformation of output values

In kernel-QA with $\beta=0$ , the construction of an appropriate surrogate model (e.g., Eqs. (8)) is the key. Here, the model does not need to be highly accurate: With the complex black-box function and relatively small dataset, such a highly accurate model cannot be obtained, especially for large-dimension problems. In the BBO context, an appropriate model yields a positive correlation with the black-box function. Such correlation characteristics of the model may be assessed by the cross-correlation between the true output values of the black-box function in the training dataset and the values predicted by the trained surrogate model. Typically, such a correlation coefficient should be positive, preferably greater than 0.5 throughout the cycles.

Sometimes, the output values of the black-box function $f(\mathbf{x})$ may have an extensive dynamic range, and the output is sensitive to the input values. Such situations can be identified by looking at the initial data set $(\mathbf{x}^{*},\mathbf{y}^{*})$ . Constructing an “appropriate” model for such $f(\mathbf{x})$ can be difficult and often requires some feature scaling on a par with the usual machine learning, for example.

Additional care must be taken for BBO purposes. Minimization problems focus on the model’s behavior at relatively small output values. Thus, the model must positively correlate with the black-box function among samples with relatively small output values even though other samples hold large output values. To construct an effective surrogate model for the above-mentioned situation, we consider a transformation that diminishes returns for large values of $f(\mathbf{x})$ and encourage the optimizer to explore regions of the input space where $f(\mathbf{x})$ is smaller. Such transformation is beneficial if there are local minima in the original $f(\mathbf{x})$ that you would like to avoid or de-emphasize in favor of a deeper global search.

One possible transformation would be as follows:

\bar{y}=-\exp{\left(-y/c_{m}\right)},

(16)

where $y=f(\mathbf{x})$ and $\bar{y}$ is the transformed output values. Here, $c_{m}$ is the model parameter, which should be determined based on the initial training dataset. The optimization outcome resulting from this transformation is relatively insensitive to the choice of $c_{m}$ (see Sec. 4.3). In the present study, we used the ensemble average $\langle\cdot\rangle$ of the output values in the initial training dataset $\mathbf{y}_{init}$ :

c_{m}=\alpha_{exp}\langle\mathbf{y}_{init}\rangle,

(17)

where $\alpha_{exp}=1$ by default. We also considered cases with $\alpha_{exp}\neq 1$ to assess the sensitivity of the choice of $c_{m}$ on the overall optimization performance in Sec. 4.3. $c_{m}$ needs to be a positive number, which may require some linear transformation so that the values $y$ in Eq. (16) are mostly positive before performing Eq. (16). Perhaps, a straightforward way is to subtract $\min(\mathbf{y}_{init})$ from the original output values $\mathbf{y}_{init}$ if $\min(\mathbf{y}_{init})<0$ .

The exponential transformation in Eq. (16) magnifies small output values of $f(\mathbf{x})$ while compressing large ones. Such transformation would be helpful in problems where the objective function has a long tail, or the optimizer needs to focus on further reducing already small values of $f(\mathbf{x})$ .

3 Assessments

Here, we perform various assessments of the proposed kernel-QA by taking the test functions described in Sec. 3.1 as black-box with binary and real decision variables.

3.1 Artificial landscapes

In the present study, we consider the following artificial landscapes to assess the proposed method, kernel-QA. These functions are for the input of arbitrary dimensions $d$ and are formulated as follows.

•

The Rosenbrock function:

$f(\mathbf{x})=\sum_{i=1}^{n-1}\left[(1-x_{i})^{2}+100(x_{i+1}-x_{i}^{2})^{2}\right],$ (18)

Global minimum: $f(1,1,\cdots,1)=0$ .
•

The Rastrigin functions:

$f(\mathbf{x})=10n+\sum_{i=1}^{n}\left[x_{i}^{2}-10\cos(2\pi x_{i})\right],$ (19)

Global minimum: $f(0,0,\cdots,0)=0$ .

Typical two-dimensional ( $d=2)$ logarithmic surfaces of these landscapes are shown in Fig. 2. These artificial landscapes have different characteristics. Both of the functions are non-convex. The Rosenbrock function is uni-modal, but the global minimum lies in a narrow, parabolic valley [18]. As clearly shown in Fig. 2(b), the Rastrigin function is an example of a non-linear multi-modal function with various local minima.

Depending on the purpose of the assessment, the input dimensions of $d=5$ – $80$ are considered for real-variable cases, and $d=40$ – $640$ are considered for binary-variable cases. Note that for the kernel-QA, due to the conversion from $d$ real variables (see Sec. 2.3), the variables $\mathbf{x}$ considered in the surrogate model (Eq. (9)) are binary. The number of such binary variables that the surrogate model and its optimization need to deal with is $d_{B}=d(n_{bins}-1)$ in the present study, where all the real variables are encoded using the $n_{bins}$ uniform bins. For the binary-variable cases, such variable conversion is unnecessary.

3.2 Assessment conditions

The kernel-QA is parametrically assessed in the present study for the artificial landscapes described in Sec. 3.1. For reference, results from typical Bayesian optimization are also compared. Here, optimization conditions considered in the assessments are described.

Table 1: Summary of assessment conditions for real- and binary-variable problems.

n_{init}=10

\alpha_{exp}=1

, and

\beta=0

unless otherwise noted.

name	$d$ (real or binary)	$(x_{low},x_{up})$	$n_{bins}$	$d_{B}$
r5n	5 (real)	$(-3,3)$	301	1,500
r5	5 (real)	$(-3,3)$	61	300
r10	10 (real)	$(-3,3)$	61	600
r20	20 (real)	$(-3,3)$	61	1,200
r40	40 (real)	$(-3,3)$	61	2,400
r80	80 (real)	$(-3,3)$	61	4,800
b40	40 (binary)	-	-	40
b80	80 (binary)	-	-	80
b160	160 (binary)	-	-	160
b320	320 (binary)	-	-	320
b640	640 (binary)	-	-	640

A summary of optimization conditions considered in the present assessments is shown in Table 1. For both kernel-QA and Bayesian optimization, each artificial landscape in Fig. 2 (with the shown input dimensions, i.e., the number of real/binary variables, $d$ ) is regarded as black-box and optimized by considering each condition shown in Table 1. Also, identical initial exploration data are used for kernel-QA and Bayesian optimization cases, whose length (the number of input-output pairs to/from the black-box function) is denoted by $n_{init}$ . The initial exploration data are constructed based on randomly chosen input vectors. For each BBO method, optimization runs are repeated ten times independently with a different initial seed for the initial exploration data to collect some statistics.

For kernel-QA, the same annealing timeout of 5 seconds is considered for all conditions, and Fixstars Amplify Annealing Engine (Amplify AE) [12] is used for the optimization of the acquisition function. Here, Amplify AE is a GPU-based Ising machine that is available to the public. It can accept QUBO problems with up to 256,000 bits. Thus, many typical BBO problems would be solvable regarding decision variable dimensions. Also, for the variable conversion required for the real decision variable cases, the a priori conversion (Sec. 2.3) is applied. These conditions are also applied to the present FMQA runs considered for some assessments. As for the model function for kernel-QA, Eq. (9) is considered with $\beta=0$ unless otherwise noted, without including the linear terms ( $\gamma=0$ in Eqs. (3) and (11)) for simplicity. For all the cases, the regularization parameter $\lambda=1$ .

As for Bayesian optimization, the expected improvement (EI) is used for the acquisition function whose coefficients are determined analytically [19]. Also, the minimization of the acquisition function is performed by using the limited-memory Broyden-Fletcher-Goldfarb-Shanno (L-BFGS) algorithm [20], which would cost $\mathcal{O}(dn^{2}T)$ , where $T$ is the number of iterations required. These settings are typical in Bayesian optimization and implemented using the GPyOpt library [21] in the present study. Note that of all the parameters shown in Table 1, only $d$ , $(x_{low},x_{up})$ and $n_{init}$ are directly relevant to Bayesian optimization.

Ten independent runs (different initial data samples) are performed for each optimization case, and their average and standard deviation are discussed in the present assessment. Note that identical initial training data samples are used between different optimization methods.

4 Results

4.1 General features

Kernel-QA and Bayesian optimization are performed to find an optimal input for “black-box” functions, the Rosenbrock and Rastrigin functions, with the real-number input dimension $d=5$ under the condition labeled as r5h in Table 1. The evolution of the objective function values with the found best solution at $i$ -th cycle, $f(\mathbf{x}_{best,i})$ , averaged over ten independent optimization runs, is shown in Fig. 3. Generally, all optimization methods show reasonable trends. That said, the results clarify the different characteristics of different optimization methods.

For the Rosenbrock function (no local minima), after the 1000-th cycle, the values of $f(\mathbf{x}_{best,1000})$ averaged over the ten runs are 0.7 for kernel-QA, 4.9 for Bayesian optimization, and 13.4 for FMQA. Clearly, FMQA shows a digit worse result, and this is potentially because the stochastic gradient descent used in FM with the relatively small dataset might have resulted in local optimal model coefficients as discussed in Eq. (7), and such behavior does not seem beneficial for the Rosenbrock’s landscape, where the global minimum lies in a very narrow valley.

As for the Rastrigin function in Fig. 3(b), Bayesian optimization yields a rapid decrease of $f(\mathbf{x}_{best,i})$ at the early phase of optimization cycles (say first 100 cycles), but the slope of evolution becomes less than the other methods. Kernel-QA and FMQA catch up with Bayesian optimization after approximately 100 cycles and continue to find better solutions that yield smaller $f(\mathbf{x})$ . The values of $f(\mathbf{x}_{best,1000})$ averaged over the ten runs are 1.4 for kernel-QA, 3.1 for Bayesian optimization, and 1.3 for FMQA.

An interesting observation is shown in Fig. 4 which plots the distance between the found best solution at $i$ -th cycle $\mathbf{x}_{i,best}$ and the true solution $\mathbf{x}_{true}$ , computed as $\|\mathbf{x}_{i,best}-\mathbf{x}_{true}\|$ for the Rastrigin case. The evolution shows that the solutions found by kernel-QA are actually closer to the true solution than the ones found by Bayesian optimization as soon as the optimization cycles are started. On the other hand, in terms of the $f(\mathbf{x}_{best,i})$ shown in Fig. 3(b), Bayesian optimization shows a rapid decrease and seems outperforms kernel-QA for the first 100 cycles. This conflicting observation reveals kernel-QA and FMQA successfully avoid being trapped in local minima for the Rastrigin function. Such a tendency to avoid local minima in kernel-QA is considered due to the low-order polynomial in its surrogate model, which would circumvent the overfitting of the model parameters with a relatively small dataset typical in BBO problems. A similar trend is shown in Figs. 3(b) and 4 for FMQA, which also utilizes a second-order polynomial in the surrogate model, ensures this insight.

4.2 Mitigation of curse of dimensionality

4.2.1 Real-variable problems

Figure 5 shows the evolution of $f(\mathbf{x}_{best,i})$ with different real-variable dimensions $d=5$ (r5), 10 (r10), 20 (r20), 40 (r40) and 80 (r80) to see the effect of “curse of dimensionality” for kernel-QA and Bayesian optimization.

For the “black-box” function with the uni-modal feature (Rosenbrock), kernel-QA and Bayesian optimization perform reasonably at a similar level in an average sense. However, the standard deviation observed for Bayesian optimization is substantially more significant than for kernel-QA throughout the cycles (see Fig. 5(a). The large standard deviations for Bayesian optimization become obvious when $d\geq 20$ in the present study. For instance, these deviations are 3.2 ( $d=20$ ), 4.4 ( $d=40$ ), and 7.3 ( $d=80$ ) times greater than kernel-QA at the $50$ -th optimization cycle. This observation suggests that the quality of the found best solution, especially for cycles $10<i<300$ , varies significantly from run to run for the present Bayesian optimization.

For multi-modal function (Rastrigin), the difference between the two optimization methods is also apparent: optimization progresses at a greater rate and a more consistent manner for kernel-QA as shown by much lower $f(\mathbf{x}_{best,i})$ values throughout the cycles in Fig. 5(b). On the other hand, the results show that the adverse effect of increased dimensionality on typical Bayesian optimization is emphasized when the objective function yields multi-modal.

For $d=[5,10,20,40,80]$ , the minimum objective function values found at the end of shown cycles, averaged over the ten optimization runs, are $[1.1,4.8,89.4,893.1,3950.0]$ (Rosenbrock) and $[1.6,4.5,13.0,60.5,243.3]$ (Rastrigin) for kernel-QA. As for Bayesian optimization, they are $[5.3,53.7,278.7,793.9,5700.5]$ (Rosenbrock) and $[2.7,19.2,127.9,335.2,759.9]$ (Rastrigin).

When considering real-variable problems, Bayesian optimization seems reasonable for uni-modal landscapes like the Rosenbrock function with input dimensions $d\lessapprox 20$ . At the same time, kernel-QA performs well for optimization problems with the input dimensions $d\gg 20$ for uni- and multi-modal landscapes.

4.2.2 Binary-variable problems

Given that kernel-QA utilizes the Ising machine, the advantage of kernel-QA for large-dimension problems is even more evident when the optimization cases with binary decision variables (combinatorial BBO) are considered. The test functions are based on Eqs. (18) and (19). However, half of the input elements $x_{i}$ are randomly flipped to be $\hat{x}_{i}$ , and $\hat{\mathbf{x}}$ is used as the input to the function.

\hat{x}_{i}=\begin{cases}1-x_{i},&\text{if }i\in\{j_{1},j_{2},\dots,j_{d/2}\},\\ x_{i},&\text{otherwise},\end{cases}

(20)

where $j_{1},j_{2},\dots,j_{d/2}$ are randomly chosen unique indices between 1 and $d$ . This additional flipping treatment is essential for a fair and meaningful comparison: The used Ising machine, Amplify AE, searches the solution around $\mathbf{x}=\mathbf{0}$ initially, which results in the superior performance of kernel-QA for the problems where $\mathbf{x}_{true}=\mathbf{0}$ , but such behavior does not reflect general performance features of kernel-QA.

Figure 6 shows assessment results with the binary input dimensions $d=40$ (b40), 80 (b80), 160 (b160), 320 (b320), and 640 (b640). For these input dimensions, the minimum objective function values found at the end of shown cycles (averaged over the ten independent optimization runs) are $[186.1,734.5,2284.0,5136.7,13039.2]$ (Rosenbrock), and $[0.0,0.0,1.2,15.2,67.7]$ (Rastrigin) for kernel-QA. As for Bayesian optimization, they are $[270.3,1538.6,4769.8,12407.1,27876.9]$ (Rosenbrock), and $[1.9,12.6,33.9,86.1,196.9]$ (Rastrigin). While Bayesian optimization is not particularly known for being advantageous for discrete or binary variables, kernel-QA consistently demonstrates robust and reliable performance even for huge variable dimensions. Also, on par with the real-variable problems discussed in Sec. 4.2.1, relatively large standard deviations of $f(\mathbf{x}_{best,i})$ are also observed for Bayesian optimization at larger $d$ conditions as shown in Fig. 6(b).

4.3 Effect of $\alpha_{exp}$

Sec. 2.4 described the exponential transformation of the training data ( $f(\mathbf{x})$ output values) to help the surrogate model be constructed appropriately for optimization even when the black-box function yields an extensive dynamic range. The exponential transformation requires a transformation parameter $c_{m}$ , which can be determined based on the initial training data, typically, the averaged output value $\alpha_{exp}\langle\mathbf{y}_{init}\rangle$ of the samples in the initial training dataset with $\alpha_{exp}$ being unity by default. This section discusses the effect of this transformation and the sensitivity of $c_{m}$ (or $\alpha_{exp}$ ) on the overall optimization performance.

Figure 7 shows the optimization history with kernel-QA under the conditions r80 and b640 for the Rosenbrock and Rastrigin functions. Four $\alpha_{exp}=0.1,0.5,1.0,2.0$ and one condition without the exponential transformation are considered for each case. The results with $\alpha_{exp}=1.0$ are identical to the ones for r80 and b640 in Figs. 5 and 6. There are substantial improvements for all the cases using this transformation method with basically any $\alpha_{exp}$ . The unity $\alpha_{exp}$ (default in the present study) seems quite a decent choice. However, the results show that twice or half of the value also yields reasonable (or better) optimization performance.

4.4 Effect of initial data size

Figure 1(a) shows that serial optimization methods require initial training data to construct the first model function. A relatively sizeable initial data size may result in a “better” model function for the first few optimization cycles. In contrast, a smaller initial data size reduces the cost of evaluating black-box functions during the initial data construction. Also, with a smaller initial data size, the ratio of newly added data samples during the optimization cycles to the total data becomes more prominent. This may result in faster convergence as, at the same cycles, the surrogate model fits better to the samples obtained during the optimization cycles rather than the initial data, for example, obtained randomly. The number of samples in the initial training data in the present study is set to be $b_{init}=10$ as summarized in Table 1. Here, the effect of initial data size on the optimization progress is discussed.

Figure 8 shows optimization history with different initial data sizes $n_{init}=2$ , $10$ and $100$ . While Fig. 8 shows the optimization results for the Rastrigin function, the trend for the corresponding Rosenbrock function is very similar. As clearly shown, the choice of $n_{init}$ does not substantially affect the overall optimization performance. That said, the result with $n_{init}=100$ shows lagged evolution of $f(\mathbf{x}_{best,i})$ reduction for the binary-variable problem in Fig. 8(b). Also, such $n_{init}$ requires more black-box evaluations, which is impractical. While $n_{init}=10$ is considered the default in the present experiments, the optimization cycles in kernel-QA can start with as small as $n_{init=2}$ samples without affecting optimization performance.

4.5 Effect of $\beta$

The assessments for kernel-QA hereinbefore utilized a surrogate model (Eq. (8)), rather than an acquisition function (Eq. (9)). However, for some BBO problems, it may be beneficial to consider both Eqs. (8) and (12) in the context of the acquisition function. The balance of the expected value (exploitation) and uncertainty (exploration) is controlled by $\beta$ in Eq. (9). when $\beta=0$ , Eq. (9) is the surrogate model.

Figure 9 compares evolution of $f(\mathbf{x}_{best,i})$ for the Rosenbrock and Rastrigin functions with zero and non-zero $\beta$ for the conditions r10, r80, b80 and b640 summarized in Table 1. The considered non-zero $\beta$ values are 0.0001, 0.001, and 0.01, where $\beta=0.01$ means more emphasis on exploration than $\beta=0$ (default value). The effect of $\beta$ seems more prominent, in both positive and negative manner, for the larger dimension, r80 than r10 for the real-variable case, and b640 than b80 for the binary-variable case.

For the more significant dimension cases (r80 and b640), the optimization performance with $\beta=0.01$ (and larger $\beta$ , expectedly) seems poorer regardless of functions or variable types. This result suggests the adverse effect of (excess) exploration for larger dimension problems, where the number of samples needed to sufficiently “cover” the search space grows rapidly with dimensionality. As for smaller non-zero $\beta$ , its influence on overall optimization performance differs depending on the functions and variable types, and the trend does not seem consistent entirely. Such inconsistency is due to approximating the standard deviation in Eq. (12). However, this inconsistency reduces towards the end of optimization cycles, and at the 1000-th cycle, the values of $f(\mathbf{x})$ converge to similar values between $\beta=0,0.0001$ , and $0.001$ for each case.

Although the present experiment does not demonstrate the consistent effectiveness of $\beta$ , the result shows that the optimization performance could be further improved at optimization cycles $i\ll 1000$ for relatively large dimension problems if an appropriate $\beta$ value can be used. Alternatively, the QUBO-compatible formulation of $\sigma$ (simplification from Eq. (12) to Eq. (10)) could be explored to achieve consistent performance in future studies.

5 Conclusions

In the present study, a BBO method, kernel-QA, has been proposed, which is based on relatively low-order polynomial kernels and quadratic-optimization annealing, similar to FMQA. Instead of using a well-known acquisition function, kernel-QA considers a surrogate model function constructed analytically in a QUBO-compatible form to leverage the optimization using the Ising machine. Therefore, kernel-QA considers exploitation alone, whereas typical Bayesian optimization does both exploitation and exploration. Using a relatively low-order model function avoids model overfitting given relatively small training samples typical in BBO, and this feature helps circumvent local optimization.

Kernel-QA, Bayesian optimization, and partly FMQA have been assessed using artificial landscapes, the Rosenbrock and Rastrigin functions at various input dimensions (up to 80 for real and 640 for binary variables), variable types, and optimization conditions. For all test functions and variable types considered, kernel-QA performs well in terms of the final solution, the evolution of best objective function values, and its run-to-run standard deviation. The performance difference between the two methods becomes more transparent for larger-dimension problems or functions with local minima. The second point is especially explicitly clarified by comparing the history of the best objective value and distance between the best solution and true solution, and kernel-QA (as well as FMQA) showed a tendency to avoid lingering in local minima.

Several optimization parameters, exponential transformation coefficient ( $\alpha_{exp}$ ) and size of initial data size ( $n_{init}$ ), are also explored around their default values of $\alpha_{exp}=1$ and $n_{init}=10$ . While $\alpha_{exp}=1$ seems reasonable, changing this parameter twice or half does not unduly influence the optimization performance. As for the initial dataset size, $n_{init}=2$ , also showed almost identical optimization history. In contrast, we do not recommend using $n_{init}=100$ due to its negative influence on the performance and cost of constructing such a dataset.

Finally, a slight extension of kernel-QA to involve exploration is also proposed and assessed. An LCB parameter $\beta$ controls the balance of exploitation and exploration, and different $\beta$ values are considered here. The assessment shows the negative effect of exploration for a relatively large $\beta$ for the larger-dimension problems. Such an adverse effect is because the difficulty of sufficiently covering the search space grows rapidly, and the probability of sampling ”promising” regions diminishes exponentially. Although the assessment showed some positive influence of exploration with appropriate $\beta$ values, the effect seems inconsistent, and this extension requires further improvement.

These results suggest that the proposed kernel-QA is a suitable BBO method for black-box functions with local minima and relatively large input dimensions.

Data availability statements

The optimization program code used in the present study is available from https://pypi.org/project/amplify-bbopt/0.1.0/#files.

Appendix A Computational cost

The effect of problem dimensions and number of cycles on the computational time is discussed here. Figure 10 shows the variation of per-cycle computational time for kernel-QA under the conditions b40, b640, r5, r20 and r80 for the Rastrigin function. The per-cycle computational time is almost constant, except for r80, but the cycle dependency on computational time for r80 does not seem strong (less than linear dependency).

As mentioned in Sec. 3.2, the annealing timeout for the optimization step ((2) in Fig. 1(a)) is five seconds for all the cases. Thus, for the cases b40, b640, r5, annealing cost is predominant in overall computational cost. The computational time increases with the increase of the problem dimension (to be specific, the number of binary variables $d_{B}$ after variable conversion). Compared to r5 with $d_{B}=300$ converted binary variables, r20 and r80 shows an average of 2.5 and 7.8 times more computational costs, whereas their $d_{B}$ are 4 and 16 times. Most of the cost increase is due to the QUBO formulation and preprocessing of request data before annealing on the Ising machine.

Appendix B Exponential transformation for Bayesian optimization

The exponential transformation described in Sec. 2.4 and assessed in Sec. 4.3 is a robust method to facilitate the surrogate model construction in kernel-QA. This transformation is applied to Bayesian optimization, and the results are compared in Fig. 11.

The effect of the exponential transformation on the Bayesian optimization results is not as substantial as kernel-QA shown in Sec. 4.3. Also, the method performs best without the transformation. The result implies the predominance of exploration in this optimization method in large-dimension problems.

References

\bibcommenthead
Frazier [2018] Frazier, P.I.: A Tutorial on Bayesian Optimization (2018). https://arxiv.org/abs/1807.02811
Snoek et al. [2012] Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Pereira, F., Burges, C.J., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25. Curran Associates, Inc., NY (2012)
Shahriari et al. [2016] Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., Freitas, N.: Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE 104(1), 148–175 (2016) https://doi.org/10.1109/JPROC.2015.2494218
Baptista and Poloczek [2018] Baptista, R., Poloczek, M.: Bayesian optimization of combinatorial structures. In: Dy, J., Krause, A. (eds.) Proceedings of the 35th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 80, pp. 462–471 (2018). https://proceedings.mlr.press/v80/baptista18a.html
Hvarfner et al. [2024] Hvarfner, C., Hellsten, E.O., Nardi, L.: Vanilla Bayesian Optimization Performs Great in High Dimensions. Preprint at https://arxiv.org/abs/2402.02229 (2024)
Kitai et al. [2020] Kitai, K., Guo, J., Ju, S., Tanaka, S., Tsuda, K., Shiomi, J., , Tamura, R.: Designing metamaterials with quantum annealing and factorization machines. Phys. Rev. Res. 2, 013319 (2020)
Inoue et al. [2022] Inoue, T., Seki, Y., Tanaka, S., Togawa, N., Ishizaki, K., Noda, S.: Towards optimization of photonic-crystal surface-emitting lasers via quantum annealing. Opt. Express 30, 43503–43512 (2022)
Kim et al. [2022] Kim, S., Shang, W., Moon, S., Pastega, T., Lee, E., Luo, T.: High-performance transparent radiative cooler designed by quantum computing. ACS Energy Letters 7(12), 4134–4141 (2022) https://doi.org/10.1021/acsenergylett.2c01969
Nawa et al. [2023] Nawa, K., Suzuki, T., Masuda, K., Tanaka, S., Miura, Y.: Quantum annealing optimization method for the design of barrier materials in magnetic tunnel junctions. Phys. Rev. Appl. 20, 024044 (2023) https://doi.org/10.1103/PhysRevApplied.20.024044
Rendle [2010] Rendle, S.: Factorization machines. In: 2010 IEEE International Conference on Data Mining, pp. 995–1000 (2010). https://doi.org/%****␣sn-article.bbl␣Line␣200␣****10.1109/ICDM.2010.127
[11] D-Wave Systems. Accessed on September 2nd 2024. https://www.dwavesys.com/
[12] Fixstars Amplify. Accessed on September 2nd 2024. https://amplify.fixstars.com/en/
Izawa et al. [2022] Izawa, S., Kitai, K., Tanaka, S., Tamura, R., , Tsuda, K.: Continuous black-box optimization with an Ising machine and random subspace coding. Phys. Rev. Res. 4, 023062 (2022)
Mao et al. [2023] Mao, Z., Matsuda, Y., Tamura, R., Tsuda, K.: Chemical design with GPU-based Ising machines. Digital Discovery 2(4), 1098–1103 (2023)
Kadowaki and Ambai [2022] Kadowaki, T., Ambai, M.: Lossy compression of matrices by black box optimisation of mixed integer nonlinear programming. Sci. Rep. 12, 15482 (2022)
Schölkopf et al. [2001] Schölkopf, B., Herbrich, R., Smola, A.J.: A generalized representer theorem. In: International Conference on Computational Learning Theory, pp. 416–426 (2001)
Golub and Van Loan [1996] Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn., p. 51. Johns Hopkins University Press, Baltimore, MD (1996)
Picheny et al. [2013] Picheny, V., Wagner, T., Ginsbourger, D.: A benchmark of kriging-based infill criteria for noisy optimization. Struct. Multidisc. Optim. 48, 607–626 (2013)
Jones et al. [1998] Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-box functions. Journal of Global Optimization 13(4), 455–492 (1998)
Liu and Nocedal [1989] Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Mathematical Programming 45(1), 503–528 (1989)
[21] GPyOpt: A Bayesian Optimization framework in Python (2016). http://github.com/SheffieldML/GPyOpt