This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient Robust Bayesian Optimization
for Arbitrary Uncertain Inputs

Lin Yang
Huawei Noah’s Ark Lab
China
yanglin33@huawei.com
&Junlong Lyu
Huawei Noah’s Ark Lab
Hong Kong SAR, China
lyujunlong@huawei.com
Wenlong Lyu
Huawei Noah’s Ark Lab
China
lvwenlong2@huawei.com
&Zhitang Chen
Huawei Noah’s Ark Lab
Hong Kong SAR, China
chenzhitang2@huawei.com
Abstract

Bayesian Optimization (BO) is a sample-efficient optimization algorithm widely employed across various applications. In some challenging BO tasks, input uncertainty arises due to the inevitable randomness in the optimization process, such as machining errors, execution noise, or contextual variability. This uncertainty deviates the input from the intended value before evaluation, resulting in significant performance fluctuations in final result. In this paper, we introduce a novel robust Bayesian Optimization algorithm, AIRBO, which can effectively identify a robust optimum that performs consistently well under arbitrary input uncertainty. Our method directly models the uncertain inputs of arbitrary distributions by empowering the Gaussian Process with the Maximum Mean Discrepancy (MMD) and further accelerates the posterior inference via Nyström approximation. Rigorous theoretical regret bound is established under MMD estimation error and extensive experiments on synthetic functions and real problems demonstrate that our approach can handle various input uncertainties and achieve a state-of-the-art performance.

1 Introduction

Bayesian Optimization (BO) is a powerful sequential decision-making algorithm for high-cost black-box optimization. Owing to its remarkable sample efficiency and capacity to balance exploration and exploitation, BO has been successfully applied in diverse domains, including neural architecture search [32], hyper-parameter tuning [4, 29, 12], and robotic control [18, 5], among others. Nevertheless, in some real-world problems, the stochastic nature of the optimization process, such as machining error during manufacturing, execution noise of control, or variability in contextual factor, inevitably introduces input randomness, rendering the design parameter xx to deviate to xx^{\prime} before evaluation. This deviation produces a fluctuation of function value yy and eventually leads to a performance instability of the outcome. In general, the input randomness is determined by the application scenario and can be of arbitrary distribution, even quite complex ones. Moreover, in some cases, we cannot observe the exact deviated input xx^{\prime} but a rough estimation for the input uncertainty. This is quite common for robotics and process controls. For example, consider a robot control task shown in Figure 1(a), a drone is sent to a target location xx to perform a measurement task. However, due to the execution noise caused by the fuzzy control or a sudden wind, the drone ends up at location xP(x)x^{\prime}\sim P(x) and gets a noisy measurement y=f(x)+ζy=f(x^{\prime})+\zeta. Instead of observing the exact value of xx^{\prime}, we only have a coarse estimation of the input uncertainty P(x)P(x). The goal is to identify a robust location that gives the maximal expected measurement under the process randomness.

Refer to caption
(a) An example case: drone measurement with execution noise.
Refer to caption
(b) Problem formulation.
Figure 1: Robust Bayesian optimization problem.

To find a robust optimum, it is crucial to account for input uncertainty during the optimization process. Existing works [24, 3, 7, 10] along this direction assume that the exact input value, i.e., xx^{\prime} in Figure 1(b), is observable and construct a surrogate model using these exact inputs. Different techniques are then employed to identify the robust optimum: [24] utilize the unscented transform to propagate input uncertainty to the acquisition function [24], while [3] integrate over the exact GP model to obtain the posterior with input uncertainty [3]. Meanwhile, [7] designs a robust confidence-bounded acquisition and applies min-max optimization to identify the robust optimum. Similarly, [10] constructs an adversarial surrogate with samples from the exact surrogate. These methods work quite well but are constrained by their dependence on observable input values, which may not always be practical.

An alternative approach involves directly modeling the uncertain inputs. A pioneering work by Moreno et al. [20] assumes Gaussian input distribution and employs a symmetric Kullback-Leibler divergence (SKL) to measure the distance of input variables. Dallaire et al. [13] implement a Gaussian process model with an expected kernel and derive a closed-form solution by restricting the kernel to linear, quadratic, or squared exponential kernels and assuming Gaussian inputs. Nonetheless, the applicability of these methods is limited due to their restrictive Gaussian input distribution assumption and kernel choice. To surmount these limitations, Oliveira et al. propose a robust Gaussian process model that incorporates input distribution by computing an integral kernel. Although this kernel can be applied to various distributions and offers a rigorous regret bound, its posterior inference requires a large sampling and can be time-consuming.

In this work, we propose an Arbitrary Input uncertainty Robust Bayesian Optimization algorithm (AIRBO). This algorithm can directly model the uncertain input of arbitrary distribution and propagate the input uncertainty into the surrogate posterior, which can then be used to guide the search for robust optimum. To achieve this, we employ Gaussian Process (GP) as the surrogate and empower its kernel design with the Maximum Mean Discrepancy (MMD), which allows us to comprehensively compare the uncertain inputs in Reproducing Kernel Hilbert Space (RKHS) and accurately quantify the target function under various input uncertainties(Sec. 3.1). Moreover, to stabilize the MMD estimation and accelerate the posterior inference, we utilize Nyström approximation to reduce the space complexity of MMD estimation from O(m2)O(m^{2}) to O(mh)O(mh), where hmh\ll m (Sec. 3.2). This can substantially improve the parallelization of posterior inference and a rigorous theoretical regret bound is also established under the approximation error (Sec. 4). Comprehensive evaluations on synthetic functions and real problems in Sec.5 demonstrate that our algorithm can efficiently identify robust optimum under complex input uncertainty and achieve state-of-the-art performance.

2 Problem Formulation

In this section, we first formulize the robust optimization problem under input uncertainty then briefly review the intuition behind Bayesian Optimization and Gaussian Processes.

2.1 Optimization with Input Uncertainty

As illustrated in Figure 1(b), we consider an optimization of expensive black-box function: f(x)f(x), where xx is the design parameter to be tuned. At each iteration nn, we select a new query point xnx_{n} according to the optimization heuristics. However, due to the stochastic nature of the process, such as machining error or execution noise, the query point is perturbed to xnx_{n}^{\prime} before the function evaluation. Moreover, we cannot observe the exact value of xnx_{n}^{\prime} and only have a vague probability estimation of its value: PxnP_{x_{n}}. After the function evaluation, we get a noisy measurement y=f(xn)+ζny=f(x_{n}^{\prime})+\zeta_{n}, where ζn\zeta_{n} is homogeneous measurement noise sampled from 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}). The goal is to find an optimal design parameter xx^{*} that maximizes the expected function value under input uncertainty:

x=argmaxxxPxf(x)𝑑x=argmaxx𝔼Px[f]x^{*}=\operatorname*{arg\,max}_{x}{\int_{x^{\prime}\sim P_{x}}f(x^{\prime})dx^{\prime}}=\operatorname*{arg\,max}_{x}\mathbb{E}_{P_{x}}[f] (1)

Depending on the specific problem and randomness source, the input distribution PxP_{x} can be arbitrary in general and even become quite complex sometimes. Here we do not place any additional assumption on them, except assuming we can sample from these input distributions, which can be easily done by approximating it with Bayesian methods and learning a parametric probabilistic model [16]. Additionally, we assume the exact values of xx^{\prime} are inaccessible, which is quite common in some real-world applications, particularly in robotics and process control [25].

2.2 Bayesian Optimization

In this paper, we focus on finding the robust optimum with BO. Each iteration of BO involves two key steps: I) fitting a surrogate model and II) maximizing an acquisition function.

Gaussian Process Surrogate: To build a sample-efficient surrogate, we choose Gaussian Process (GP) as the surrogate model in this paper. Following [34], GP can be interpreted from a weight-space view: given a set of nn observations, 𝒟n={(xi,yi)|i=1,,n}\mathcal{D}_{n}=\{(x_{i},y_{i})|i=1,...,n\}. Denote all the inputs as XD×nX\in\mathbb{R}^{D\times n} and all the output vector as yn×1y\in\mathbb{R}^{n\times 1}. We first consider a linear surrogate:

f(x)=xTw,y=f(x)+ζ,ζ𝒩(0,σ2),f(x)=x^{T}w,\;y=f(x)+\zeta,\,\zeta\sim\mathcal{N}(0,\sigma^{2}), (2)

where ww is the model parameters and ζ\zeta is the observation noise. This model’s capacity is limited due to its linear form. To obtain a more powerful surrogate, we can extend it by projecting the input xx into a feature space ϕ(x)\phi(x). By taking a Bayesian treatment and placing a zero mean Gaussian prior on the weight vector: w𝒩(0,Σp)w\sim\mathcal{N}(0,\Sigma_{p}), its predictive distribution can be derived as follows (see Section2.1 of [34] for detailed derivation):

f|x,X,y𝒩(ϕT(x)Σpϕ(X)(A+σn2I)1y,ϕT(x)Σpϕ(x)ϕT(x)Σpϕ(X)(A+σn2I)1ϕT(X)Σpϕ(x)),\begin{split}f_{*}|x_{*},X,y\sim\mathcal{N}\big{(}&\phi^{T}(x_{*})\Sigma_{p}\phi(X)(A+\sigma_{n}^{2}I)^{-1}y,\\ &\phi^{T}(x_{*})\Sigma_{p}\phi(x_{*})-\phi^{T}(x_{*})\Sigma_{p}\phi(X)(A+\sigma_{n}^{2}I)^{-1}\phi^{T}(X)\Sigma_{p}\phi(x_{*})\big{)},\end{split} (3)

where A=ϕT(X)Σpϕ(X)A=\phi^{T}(X)\Sigma_{p}\phi(X) and II is a identity matrix. Note the predictive distribution is also a Gaussian and the feature mappings are always in the form of inner product with respect to Σp\Sigma_{p}. This implies we are comparing inputs in a feature space and enables us to apply kernel trick. Therefore, instead of exactly defining a feature mapping ϕ()\phi(\cdot), we can define a kernel: k(x,x)=ϕ(x)TΣpϕ(x)=ψ(x)ψ(x)k(x,x^{\prime})=\phi(x)^{T}\Sigma_{p}\phi(x^{\prime})=\psi(x)\cdot\psi(x^{\prime}). Substituting it into Eq. 3 gives the vanilla GP posterior:

f|X,y,X𝒩(μ,Σ),where μ=K(X,X)[K(X,X)+σn2I]1y,Σ=K(X,X)K(X,X)[K(X,X)+σn2I]1K(X,X).\begin{split}&f_{*}|X,y,X_{*}\sim\mathcal{N}(\mu_{*},\Sigma_{*}),\text{where }\mu_{*}=K(X_{*},X)[K(X,X)+\sigma_{n}^{2}I]^{-1}y,\\ &\Sigma_{*}=K(X_{*},X_{*})-K(X_{*},X)[K(X,X)+\sigma_{n}^{2}I]^{-1}K(X,X_{*}).\end{split} (4)

From this interpretation of GP, we note that its core idea is to project the input xx to a (possibly infinite) feature space ψ(x)\psi(x) and compare them in the Reproducing Kernel Hilbert Space (RKHS) defined by kernel.

Acquisition Function Optimization: Given the posterior of GP surrogate model, the next step is to decide a query point xnx_{n}. The exploitation and exploration balance is achieved by designing an acquisition function α(x|𝒟n)\alpha(x|\mathcal{D}_{n}). Through numerous acquisition functions exist [28], we follow [25, 7] and adopt the Upper Confidence Bound (UCB) acquisition:

α(x|𝒟n)=μ(x)+βσ(x),\alpha(x|\mathcal{D}_{n})=\mu_{*}(x)+\beta\sigma_{*}(x), (5)

where β\beta is a hyper-parameter to control the level of exploration.

3 Proposed Method

To cope with randomness during the optimization process, we aim to build a robust surrogate that can directly accept the uncertain inputs of arbitrary distributions and propagate the input uncertainty into the posterior. Inspired by the weight-space interpretation of GP, we empower GP kernel with MMD to compare the uncertain inputs in RKHS. In this way, the input randomness is considered during the covariance computation and naturally reflected in the resulting posterior , which then can be used to guide the search for a robust optimum(Sec. 3.1). To further accelerate the posterior inference, we employ Nyström approximation to stabilize the MMD estimation and reduce its space complexity (Sec. 3.2).

3.1 Modeling the Uncertain Inputs

Assume Px𝒫𝒳𝒫P_{x}\in\mathcal{P}_{\mathcal{X}}\subset\mathcal{P} are a set of distribution densities over d\mathbb{R}^{d}, representing the distributions of the uncertain inputs. We are interested in building a GP surrogate over the probability space 𝒫\mathcal{P}, which requires to measure the difference between the uncertain inputs.

To do so, we turn to the Integral Probabilistic Metric (IPM) [23]. The basic idea behind IPM is to define a distance measure between two distributions PP and QQ as the supremum over a class of functions 𝒢\mathcal{G} of the absolute expectation difference:

d(P,Q)=supg𝒢|𝔼uPg(u)𝔼vQg(v)|,d(P,Q)=\sup_{g\in\mathcal{G}}|\mathbb{E}_{u\sim P}g(u)-\mathbb{E}_{v\sim Q}g(v)|, (6)

where 𝒢\mathcal{G} is a class of functions that satisfies certain conditions. Different choices of 𝒢\mathcal{G} lead to various IPMs. For example, if we restrict the function class to be uniformly bounded in RKHS we can get the MMD [15], while a Lipschitz-continuous 𝒢\mathcal{G} realizes the Wasserstein distance [14].

In this work, we choose MMD as the distance measurement for the uncertain inputs because of its intrinsic connection with distance measurement in RKHS. Given a characteristic kernel k:d×dk:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R} and associate RKHS k\mathcal{H}_{k}, define the mean map ψ:𝒫k\psi:\mathcal{P}\to\mathcal{H}_{k} such that ψ(P),g=𝔼P[g],gk\langle\psi(P),g\rangle=\mathbb{E}_{P}[g],\forall g\in\mathcal{H}_{k}. The MMD between P,Q𝒫P,Q\in\mathcal{P} is defined as:

MMD(P,Q)=supgk1[𝔼uPg(u)𝔼vQg(v)]=ψPψQ,\text{MMD}(P,Q)=\sup_{||g||_{k}\leq 1}[\mathbb{E}_{u\sim P}g(u)-\mathbb{E}_{v\sim Q}g(v)]=||\psi_{P}-\psi_{Q}||, (7)

Without any additional assumption on the input distributions, except we can get mm samples {ui}i=1m,{vi}i=1m\{u_{i}\}_{i=1}^{m},\{v_{i}\}_{i=1}^{m} from P,QP,Q respectively, MMD can be empirically estimated as follows [21]:

MMD2(P,Q)1m(m1)1i,jm,ij(k(ui,uj)+k(vi,vj))2m21i,jmk(ui,vj),\begin{split}\text{MMD}^{2}(P,Q)\approx\frac{1}{m(m-1)}\sum_{1\leq i,j\leq m,i\neq j}\left(k(u_{i},u_{j})+k(v_{i},v_{j})\right)-\frac{2}{m^{2}}\sum_{1\leq i,j\leq m}k(u_{i},v_{j}),\end{split} (8)

To integrate MMD into the GP surrogate, we design an MMD-based kernel over 𝒫\mathcal{P} as follows:

k^(P,Q)=exp(αMMD2(P,Q)),\hat{k}(P,Q)=\exp(-\alpha\text{MMD}^{2}(P,Q)), (9)

with a learnable scaling parameter α\alpha. This is a valid kernel, and universal w.r.t. C(𝒫)C(\mathcal{P}) under mild conditions (see Theorem 2.2, [11]). Also, it is worth to mention that, to compute the GP posterior, we only need to sample mm points from the input distributions, but do not require their corresponding function values.

With the MMD kernel, our surrogate model places a prior 𝒢𝒫(0,k^(Px,Px))\mathcal{GP}(0,\hat{k}(P_{x},P_{x^{\prime}})) and obtain a dataset 𝒟n={(x^i,yi)|x^iPxi,i=1,2,,n)}\mathcal{D}_{n}=\{(\hat{x}_{i},y_{i})|\hat{x}_{i}\sim P_{x_{i}},i=1,2,...,n)\}. The posterior is Gaussian with mean and variance:

μ^n(P)\displaystyle\hat{\mu}_{n}(P_{\ast}) =𝐤^n(P)T(𝐊^n+σ2𝐈)1𝐲n\displaystyle=\hat{\bf k}_{n}(P_{\ast})^{T}(\hat{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}{\bf y}_{n} (10)
σ^n2(P)\displaystyle\hat{\sigma}_{n}^{2}(P_{\ast}) =k^(P,P)𝐤^n(P)T(𝐊^n+σ2𝐈)1𝐤^n(P),\displaystyle=\hat{k}(P_{\ast},P_{\ast})-\hat{\bf k}_{n}(P_{\ast})^{T}(\hat{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}\hat{\bf k}_{n}(P_{\ast}), (11)

where 𝐲n:=[y1,,yn]T{\bf y}_{n}:=[y_{1},\cdots,y_{n}]^{T}, 𝐤^n(P):=[k^(P,Px1),,k^(P,Pxn)]T\hat{\bf k}_{n}(P_{\ast}):=[\hat{k}(P_{\ast},P_{x_{1}}),\cdots,\hat{k}(P_{\ast},P_{x_{n}})]^{T} and [𝐊^n]ij=k^(Pxi,Pxj)[\hat{\bf K}_{n}]_{ij}=\hat{k}(P_{x_{i}},P_{x_{j}}).

3.2 Boosting posterior inference with Nyström Approximation

To derive the posterior distribution of our robust GP surrogate, it requires estimating the MMD between each pair of inputs. Gretton et al. prove the empirical estimator in Eq. 8 approximates MMD in a bounded and asymptotic way [15]. However, the sampling size mm used for estimation greatly affects the approximation error and insufficient sampling leads to a high estimation variance(ref. Figure 3(a)).

Such an MMD estimation variance causes numerical instability of the covariance matrix and propagates into the posterior distribution and acquisition function, rendering the search for optimal query point a challenging task. Figure 3(b) gives an example of MMD-GP posterior with insufficient samples, which produces a noisy acquisition function and impedes the search of optima. Increasing the sampling size can help alleviate this issue. However, the computation and space complexities of the empirical MMD estimator scale quadratically with the sampling size mm. This leaves us with a dilemma that insufficient sampling results in a highly-varied posterior while a larger sample size can occupy significant GPU memory and reduce the ability for parallel computation.

To reduce the space and computation complexity while retaining a stable MMD estimation, we resort to the Nyström approximation [33]. This method alleviates the computational cost of kernel matrix by randomly selecting hh subsamples from the mm samples(hmh\ll m) and computes an approximated matrix via K~=KmhKh+KmhT\tilde{K}=K_{mh}K_{h}^{+}K_{mh}^{T}. Combining this with the MMD definition gives its Nyström estimator:

MMD2(P,Q)=𝔼u,uPP[k(u,u)]+𝔼v,vQQ[k(v,v)]2𝔼u,vPQ[k(u,v)]1m2𝟏mTU𝟏m+1m2𝟏mTV𝟏m2m2𝟏mTW𝟏m1m2𝟏mTUmhUh+UmhT𝟏n+1m2𝟏mTVmhVh+VmhT𝟏m2m2𝟏mTWmhWh+WmhT𝟏m\begin{split}{\text{MMD}}^{2}(P,Q)&=\mathbb{E}_{u,u^{\prime}\sim P\bigotimes P}[k(u,u^{\prime})]+\mathbb{E}_{v,v^{\prime}\sim Q\bigotimes Q}[k(v,v^{\prime})]-2\mathbb{E}_{u,v\sim P\bigotimes Q}[k(u,v)]\\ &\approx\frac{1}{m^{2}}\mathbf{1}_{m}^{T}U\mathbf{1}_{m}+\frac{1}{m^{2}}\mathbf{1}_{m}^{T}V\mathbf{1}_{m}-\frac{2}{m^{2}}\mathbf{1}_{m}^{T}W\mathbf{1}_{m}\\ &\approx\frac{1}{m^{2}}\mathbf{1}_{m}^{T}U_{mh}U_{h}^{+}U_{mh}^{T}\mathbf{1}_{n}+\frac{1}{m^{2}}\mathbf{1}_{m}^{T}V_{mh}V_{h}^{+}V_{mh}^{T}\mathbf{1}_{m}-\frac{2}{m^{2}}\mathbf{1}_{m}^{T}W_{mh}W_{h}^{+}W_{mh}^{T}\mathbf{1}_{m}\end{split} (12)

where U=K(𝐮,𝐮)U=K({\bf u},{\bf u}^{\prime}), V=K(𝐯,𝐯)V=K({\bf v},{\bf v}^{\prime}), W=K(𝐮,𝐯)W=K({\bf u},{\bf v}) are the kernel matrices, 𝟏m\mathbf{1}_{m} represents a m-by-1 vector of ones, mm defines the sampling size and hh controls the sub-sampling size. Note that this Nyström estimator reduces the space complexity of posterior inference from O(MNm2)O(MNm^{2}) to O(MNmh)O(MNmh), where MM and NN are the numbers of training and testing samples, mm is the sampling size for MMD estimation while hmh\ll m is the sub-sampling size. This can significantly boost the posterior inference of robust GP by allowing more inference to run in parallel on GPU.

4 Theoretical Analysis

Assume x𝒳dx\in\mathcal{X}\subset\mathbb{R}^{d}, and Px𝒫𝒳𝒫P_{x}\in\mathcal{P}_{\mathcal{X}}\subset\mathcal{P} are a set of distribution densities over d\mathbb{R}^{d}, representing the distribution of the noisy input. Given a characteristic kernel k:d×dk:\mathbb{R}^{d}\times\mathbb{R}^{d}\to\mathbb{R} and associate RKHS k\mathcal{H}_{k}, we define the mean map ψ:𝒫k\psi:\mathcal{P}\to\mathcal{H}_{k} such that ψ(P),g=𝔼P[g],gk\langle\psi(P),g\rangle=\mathbb{E}_{P}[g],\forall g\in\mathcal{H}_{k}.

We consider a more general case. Choosing any suitable functional LL such that k^(P,P):=L(ψP,ψP)\hat{k}(P,P^{\prime}):=L(\psi_{P},\psi_{P^{\prime}}) is a positive-definite kernel over 𝒫\mathcal{P}, for example the linear kernel ψP,ψPk\langle\psi_{P},\psi_{P^{\prime}}\rangle_{k} and radial kernels exp(αψPψPk2)\exp(-\alpha\|\psi_{P}-\psi_{P^{\prime}}\|^{2}_{k}) using the MMD distance as a metric. Such a kernel k^\hat{k} is associated with a RKHS k^\mathcal{H}_{\hat{k}} containing functions over the space of probability measures 𝒫\mathcal{P}.

One important theoretical guarantee to conduct 𝒢𝒫\mathcal{GP} model is that our object function can be approximated by functions in k^\mathcal{H}_{\hat{k}}, which relies on the universality of k^\hat{k}. Let C(𝒫)C(\mathcal{P}) be the class of continuous functions over 𝒫\mathcal{P} endowed with the topology of weak convergence and the associated Borel σ\sigma-algebra, and we define f^C(𝒫)\hat{f}\in C(\mathcal{P}) such that

f^(P):=𝔼P[f],P𝒫,\hat{f}(P):=\mathbb{E}_{P}[f],\forall P\in\mathcal{P},

which is just our object function, For k^\hat{k} be radial kernels, it has been shown that k^\hat{k} is universal w.r.t C(𝒫)C(\mathcal{P}) given that 𝒳\mathcal{X} is compact and the mean map ψ\psi is injective [11, 22]. For k^\hat{k} be linear kernel which is not universal, it has been shown in Lemma 1, [26] that f^k^\hat{f}\in\mathcal{H}_{\hat{k}} if and only if ff\in\mathcal{H} and further f^k^=fk\|\hat{f}\|_{\hat{k}}=\|f\|_{k}. Thus, in the remain of this chapter, we may simply assume f^k^\hat{f}\in\mathcal{H}_{\hat{k}}.

Suppose we have an approximation kernel function k~(P,Q)\tilde{k}(P,Q) near to the exact kernel function k^(P,Q)\hat{k}(P,Q). The mean μ^n(p)\hat{\mu}_{n}(p_{\ast}) and variance σ^n2(p)\hat{\sigma}_{n}^{2}(p_{\ast}) are approximated by

μ~n(P)\displaystyle\tilde{\mu}_{n}(P_{\ast}) =𝐤~n(P)T(𝐊~n+σ2𝐈)1𝐲n\displaystyle=\tilde{\bf k}_{n}(P_{\ast})^{T}(\tilde{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}{\bf y}_{n} (13)
σ~n2(P)\displaystyle\tilde{\sigma}_{n}^{2}(P_{\ast}) =k~(P,P)𝐤~n(P)T(𝐊~n+σ2𝐈)1𝐤~n(P),\displaystyle=\tilde{k}(P_{\ast},P_{\ast})-\tilde{\bf k}_{n}(P_{\ast})^{T}(\tilde{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}\tilde{\bf k}_{n}(P_{\ast}), (14)

where 𝐲n:=[y1,,yn]T{\bf y}_{n}:=[y_{1},\cdots,y_{n}]^{T}, 𝐤~n(P):=[k~(P,P1),,k~(P,Pn)]T\tilde{\bf k}_{n}(P_{\ast}):=[\tilde{k}(P_{\ast},P_{1}),\cdots,\tilde{k}(P_{\ast},P_{n})]^{T} and [𝐊~n]ij=k~(Pi,Pj)[\tilde{\bf K}_{n}]_{ij}=\tilde{k}(P_{i},P_{j}).

The maximum information gain corresponding to the kernel k^\hat{k} is denoted as

γ^n:=sup𝒫𝒳;||=nI^(𝐲n;𝐟^n|)=12lndet(𝐈+σ2𝐊^n),\hat{\gamma}_{n}:=\sup_{\mathcal{R}\in\mathcal{P}_{\mathcal{X}};|\mathcal{R}|=n}\hat{I}({\bf y}_{n};\hat{\bf f}_{n}|\mathcal{R})=\frac{1}{2}\ln\det({\bf I}+\sigma^{-2}\hat{\bf K}_{n}),

Denote e(P,Q)=k^(P,Q)k~(P,Q)e(P,Q)=\hat{k}(P,Q)-\tilde{k}(P,Q) as the error function when estimating the kernel k^\hat{k}. We suppose e(P,Q)e(P,Q) has an upper bound with high probability:

Assumption 1.

For any ε>0\varepsilon>0, P,Q𝒫𝒳P,Q\in\mathcal{P_{\mathcal{X}}}, we may choose an estimated k~(P,Q)\tilde{k}(P,Q) such that the error function e(P,Q)e(P,Q) can be upper-bounded by eεe_{\varepsilon} with probability at least 1ε1-\varepsilon, that is, (|e(P,Q)|eε)>1ε.\mathbb{P}\left(|e(P,Q)|\leq e_{\varepsilon}\right)>1-\varepsilon.

Remark. Note that this assumption is standard in our case: we may assume maxx𝒳ϕkΦ\max_{x\in\mathcal{X}}\|\phi\|_{k}\leq\Phi, where ϕ\phi is the feature map corresponding to the kk. Then when using empirical estimator, the error between MMDempirical\text{MMD}_{\text{empirical}} and MMD is controlled by 4Φ2log(6/ε)m14\Phi\sqrt{2\log(6/\varepsilon)m^{-1}} with probability at least 1ε1-\varepsilon according to Lemma E.1, [8]. When using the Nyström estimator, the error has a similar form as the empirical one, and under mild conditions, when h=O(mlog(m))h=O(\sqrt{m}\log(m)), we get the error of the order O(m1/2log(1/ε))O(m^{-{1/2}}\log(1/\varepsilon)) with probability at least 1ε1-\varepsilon. One can check more details in Lemma 1.

Now we restrict our Gaussian process in the subspace 𝒫𝒳={Px,x𝒳}𝒫\mathcal{P}_{\mathcal{X}}=\{P_{x},x\in\mathcal{X}\}\subset\mathcal{P}. We assume the observation yi=f(xi)+ζiy_{i}=f(x_{i})+\zeta_{i} with the noise ζi\zeta_{i}. The input-induced noise is defined as Δfpxi:=f(xi)𝔼Pxi[f]=f(xi)f^(Pxi)\Delta f_{p_{x_{i}}}:=f(x_{i})-\mathbb{E}_{P_{x_{i}}}[f]=f(x_{i})-\hat{f}(P_{x_{i}}). Then the total noise is yi𝔼Pxi[f]=ζi+Δfpxiy_{i}-\mathbb{E}_{P_{x_{i}}}[f]=\zeta_{i}+\Delta f_{p_{x_{i}}}. We can state our main result, which gives a cumulative regret bound under inexact kernel calculations,

Theorem 1.

Let δ>0,fk,\delta>0,f\in\mathcal{H}_{k}, and the corresponding f^k^b,maxx𝒳|f(x)|=M\|\hat{f}\|_{\hat{k}}\leq b,\max_{x\in\mathcal{X}}|f(x)|=M. Suppose the observation noise ζi=yif(xi)\zeta_{i}=y_{i}-f(x_{i}) is σζ\sigma_{\zeta}-sub-Gaussian, and thus with high probability |ζi|<A|\zeta_{i}|<A for some A>0A>0. Assume that both kk and PxP_{x} satisfy the conditions for ΔfPx\Delta f_{P_{x}} to be σE\sigma_{E}-sub-Gaussian, for a given σE>0\sigma_{E}>0. Then, under Assumption 1 with ε>0\varepsilon>0 and corresponding eεe_{\varepsilon}, setting σ2=1+2n\sigma^{2}=1+\frac{2}{n}, running Gaussian Process with acquisition function

α~(x|𝒟n)\displaystyle\tilde{\alpha}(x|\mathcal{D}_{n}) =μ~n(Px)+βnσ~n(Px)\displaystyle=\tilde{\mu}_{n}(P_{x})+\beta_{n}\tilde{\sigma}_{n}(P_{x}) (15)
where βn=\displaystyle\text{where }\beta_{n}= (b+σE2+σζ22(γ^n+1lnδ)),\displaystyle\left(b+\sqrt{\sigma_{E}^{2}+\sigma_{\zeta}^{2}}\sqrt{2\left(\hat{\gamma}_{n}+1-\ln\delta\right)}\right),

we have that the uncertain-inputs cumulative regret satisfies:

R~nO(nγ^n(γ^nlnδ)+n2(γ^nlnδ)eε+n3eε)\tilde{R}_{n}\in O\left(\sqrt{n\hat{\gamma}_{n}(\hat{\gamma}_{n}-\ln\delta)}+n^{2}\sqrt{(\hat{\gamma}_{n}-\ln\delta)e_{\varepsilon}}+n^{3}e_{\varepsilon}\right) (16)

with probability at least 1δnε1-\delta-n\varepsilon. Here R~n=t=1nr~t\tilde{R}_{n}=\sum_{t=1}^{n}\tilde{r}_{t}, and r~t=maxx𝒳𝔼Px[f]𝔼Pxt[f]\tilde{r}_{t}=\max_{x\in\mathcal{X}}\mathbb{E}_{P_{x}}[f]-\mathbb{E}_{P_{x_{t}}}[f]

The proof of our main theorem 1 can be found in appendix B.3.

The assumption that ζi\zeta_{i} is σζ\sigma_{\zeta}-sub-Gaussian is standard in 𝒢𝒫\mathcal{GP} fields. The assumption that ΔfPx\Delta f_{P_{x}} is σE\sigma_{E}-sub-Gaussian can be met when PxP_{x} is uniformly bounded or Gaussian, as stated in Proposition 3, [26]. Readers may check the definition of sub-Gaussian in appendix, Definition 1.

To achieve an regret of order R~nO(nγ^n)\tilde{R}_{n}\in O(\sqrt{n}\hat{\gamma}_{n}) , the same order as the exact Improved 𝒢𝒫\mathcal{GP} regret (23), and ensure this with high probability, we need to take ε=O(δ/n)\varepsilon=O(\delta/n), eε=O(n52γ^n(γ^n2n12))e_{\varepsilon}=O(n^{-\frac{5}{2}}\hat{\gamma}_{n}(\hat{\gamma}_{n}^{-2}\wedge n^{-\frac{1}{2}})), and this requires a sample size mm of order O(n5γ^n2(γ^n4n)log(n))O(n^{5}\hat{\gamma}_{n}^{-2}(\hat{\gamma}_{n}^{4}\vee n)\log(n)) for MCMC approximation, or with a same sample size mm and a subsample size hh of order O(n52+νγ^n1ν(γ^n2n12))O(n^{\frac{5}{2}+\nu}\hat{\gamma}_{n}^{-1-\nu}(\hat{\gamma}_{n}^{2}\vee n^{\frac{1}{2}})) for Nyström approximation with some ν>0\nu>0. Note that (16) only offers an upper bound for cumulative regret, in real applications the calculated regret may be much smaller than this bound, as the approximation error eϵe_{\epsilon} can be fairly small even with a few samples when the input noise is relatively weak.

To analysis the exact order of γ^n\hat{\gamma}_{n} could be difficult, as it is influenced by the specific choice of embedding kernel kk and input uncertainty distributions Pxi,xi𝒳P_{x_{i}},x_{i}\in\mathcal{X}. Nevertheless, we can deduce the following result for a wide range of cases, showing that cumulative regret is sub-linear under mild conditions. One can check the proof in appendix B.4.

Theorem 2 (Bounding the Maximum information gain).

Suppose kk is rr-th differentiable with bounded derivatives and translation invariant, i.e., k(x,y)=k(xy,0)k(x,y)=k(x-y,0). Suppose the input uncertainty is i.i.d., that is, the noised input density satisfies Pxi(x)=P0(xxi),xi𝒳P_{x_{i}}(x)=P_{0}(x-x_{i}),\forall x_{i}\in\mathcal{X}. Then if the space 𝒳\mathcal{X} is compact in d\mathbb{R}^{d}, the maximum information gain γ^n\hat{\gamma}_{n} satisfies

γ^n=O(nd(d+1)r+d(d+1)log(n)).\hat{\gamma}_{n}=O(n^{\frac{d(d+1)}{r+d(d+1)}}\log(n)).

Thus, when r>d(d+1)r>d(d+1), the accumulate regret is sub-linear respect to nn, with sufficiently small eεe_{\varepsilon}.

5 Evaluation

In this section, we first experimentally demonstrate AIRBO’s ability to model uncertain inputs of arbitrary distributions, then validate the Nyström-based inference acceleration for GP posterior, followed by experiments on robust optimization of synthetic functions and real-world benchmark.

5.1 Robust Surrogate

Modeling arbitrary uncertain inputs: We demonstrate MMDGP’s capabilities by employing an RKHS function as the black-box function and randomly selecting 10 samples from its input domain. Various types of input randomness are introduced into the observation and produce training datasets of 𝒟={(xi,f(xi+δi))|δiPxi}i=110\mathcal{D}=\{(x_{i},f(x_{i}+\delta_{i}))|\delta_{i}\sim P_{x_{i}}\}_{i=1}^{10} with different PxP_{x} configurations. Figure 2(a) and 2(b) compare the modeling results of a conventional GP and MMDGP under a Gaussian input uncertainty Px=𝒩(0,0.012)P_{x}=\mathcal{N}(0,0.01^{2}). We observe that the GP model appears to overfit the observed samples without recognizing the input uncertainty, whereas MMDGP properly incorporates the input randomness into its posterior.

Refer to caption
(a) GP posterior with a Gaussian input uncertainty P=𝒩(0,0.01)P=\mathcal{N}(0,0.01).
Refer to caption
(b) MMDGP posterior with an input uncertainty P=𝒩(0,0.01)P=\mathcal{N}(0,0.01).
Refer to caption
(c) MMDGP posterior with a variance-changing beta distribution.
Refer to caption
(d) MMDGP posterior with a Chi-squared distribution of changing DoF.
Figure 2: Modeling results under different types of input uncertainties.

To further examine our model’s ability under complex input uncertainty, we design the input distribution to follow a beta distribution with input-dependent variance: Px=beta(α=0.5,β=0.5,σ=0.9(sin4πx+1))P_{x}=beta(\alpha=0.5,\beta=0.5,\sigma=0.9(\sin{4\pi x}+1)). The MMDGP posterior is shown in Figure 2(c). As the input variance σ\sigma changes along xx, inputs from the left and right around a given location xix_{i} yield different MMD distances, resulting in an asymmetric posterior (e.g., around x=0.05x=0.05 and x=0.42x=0.42). This suggests that MMDGP can precisely model the multimodality and asymmetry of the input uncertainty.

Moreover, we evaluated MMDGP using a step-changing Chi-squared distribution Px=χ2(g(x),σ=0.01)P_{x}=\chi^{2}(g(x),\sigma=0.01), where g(x)=0.5g(x)=0.5 if x[0,0.6]x\in[0,0.6], and g(x)=7.0g(x)=7.0 otherwise. This abrupt change in g(x)g(x) significantly alters the input distribution from a sharply peaked distribution to a flat one with a long tail. Figure 2(d) illustrates that our model can accurately capture this distribution shape variation, as evidenced by the sudden posterior change around x=0.6x=0.6. This demonstrates our model can thoroughly quantify the characteristics of complex input uncertainties.

Comparing with the other surrogate models: We also compare our model with the other surrogate models under the step-changing Chi-squared input distribution. The results are reported in Figure 7 and they demonstrate our model outperforms obviously under such a complex input uncertainty (see Appendix D.1 for more details)

Refer to caption
(a) The empirical MMD and covariance values between two beta distributions PP and QQ as QQ moves away from PP.
Refer to caption
(b) The noisy posterior derived from a sampling size of 20 (upper) traps the acq. optimizer at x=0x=0 (lower).
Refer to caption
(c) The posterior becomes smoother with a sampling size of 100 and acq. optimizer can easily identify the optima.
Refer to caption
(d) The Nyström estimator with less memory consumption also produces a smooth posterior that is easy to optimize.
Figure 3: The posterior derived from the empirical and Nyström MMD approximators with varying sampling sizes.

5.2 Accelerating the Posterior Inference

Estimation variance of MMD: We first examine the variance of MMD estimation by employing two beta distributions P=beta(α=0.4,β=0.2,σ=0.1)P=beta(\alpha=0.4,\beta=0.2,\sigma=0.1) and Q=beta(α=0.4,β=0.2,σ=0.1)+cQ=beta(\alpha=0.4,\beta=0.2,\sigma=0.1)+c, where cc is an offset value. Figure 3(a) shows the empirical MMDs computed via Eq. 8 with varying sampling sizes as QQ moves away from PP. We find that a sampling size of 20 is inadequate, leading to high estimation variance, and increasing the sampling size to 100 stabilizes the estimation.

We further utilize this beta distribution PP as the input distribution and derive the MMDGP posterior via empirical estimator in Figure 3(b). Note that the MMD instability caused by inadequate sampling subsequently engenders a fluctuating posterior and culminates in a noisy acquisition function, which prevents the acquisition optimizer (e.g., L-BFGS-B in this experiment) from identifying the optima. Although Figure 3(c) shows that this issue can be mitigated by using more samples during empirical MMD estimation, it is crucial to note that a larger sampling size significantly increases GPU memory usage because of its quadratic space complexity of O(MNm2)O(MNm^{2}) (MM and NN are the sample number of training and testing, mm is the sampling size for MMD estimation). This limitation severely hinders parallel inference for multiple samples and slows the overall speed of posterior computation.

Table 1 summarizes the inference time of MMDGP posteriors at 512 samples with different sampling sizes. We find that, for beta distribution defined in this experiment, the Nyström MMD estimator with a sampling size of 100 and sub-sampling size of 10 already delivers a comparable result to the empirical estimator with 100 samples (as seen in the acquisition plot of Figure 3(d)). Also, the inference time is reduced from 8.117 to 0.78 seconds by enabling parallel computation for more samples. For the cases that require much more samples for MMD estimation (e.g., the input distribution is quite complex or high-dimensional), this Nyström-based acceleration can have a more pronounced impact.

Table 1: Performance of Posterior inference for 512 samples.
Method Sampling Size Sub-sampling Size Inference Time (seconds) Batch Size (samples)
Empirical 20 - 1.143 ±\pm 0.083 512
Empirical 100 - 8.117 ±\pm 0.040 128
Empirical 1000 - 840.715 ±\pm 2.182 1
Nystrom 100 10 0.780 ±\pm 0.001 512
Nystrom 1000 100 21.473 ±\pm 0.984 128

Effect of Nyström estimator on optimization: To investigate the effect of Nyström estimator on optimization, we also perform an ablation study in Appendix D.2, the results in Figure 8 suggest that Nyström estimator slightly degrades the optimization performance but greatly improves the inference efficiency.

5.3 Robust Optimization

Experiment setup: To experimentally validate AIRBO’s performance, we implement our algorithm 111The code will be available on https://github.com/huawei-noah/HEBO, and more implementation details can be found in Appendix C.1. based on BoTorch [2] and employ a linear combination of multiple rational quadratic kernels [6] to compute the MMD as Eq. 9. We compare our algorithm with several baselines: 1) uGP-UCB [26] is a closely related work that employs an integral kernel to model the various input distributions. It has a quadratic inference complexity of O(MNm2)O(MNm^{2}), where MM and NN are the sample numbers of the training and testing set, and mm indicates the sampling size of the integral kernel. 3)GP-UCB is the standard GP with UCB acquisition, which represents a broad range of existing methods that focus on non-robust optimization. 3) SKL-UCB employs symmetric Kullback-Leibler divergence to measure the distance between the uncertain inputs [20]. Its closed-form solution only exists if the input distributions are the Gaussians. 4) ERBF-UCB is the robust GP with the expected Radial Basis Function kernel proposed in [13]. It computes the expected kernel under input distribution using the Gaussian integrals. Assuming the input distributions are sub-Gaussians, this method can efficiently find the robust optimum. Since all the methods use UCB acquisition, we simply distinguish them by their surrogate names in the following tests.

Refer to caption
(a) The RKHS function
Refer to caption
(b) Robust regrets of RKHS function.
Refer to caption
(c) The double-peak function
Refer to caption
(d) Robust regrets of double-peak function
Figure 4: Robust optimization results on synthetic functions.

At the end of the optimization, each algorithm needs to decide a final outcome xnrx_{n}^{r}, perceived to be the robust optimum under input uncertainty at step nn. For a fair comparison, we employ the same outcome policy across all the algorithms: xnr=argmaxx𝒟nμ^(x)x_{n}^{r}=\operatorname*{arg\,max}_{x\in\mathcal{D}_{n}}\hat{\mu}_{*}(x), where μ^(x)\hat{\mu}_{*}(x) is the posterior mean of robust surrogate at xx and 𝒟n={(xi,f(xi+δi))|δiPxi}\mathcal{D}_{n}=\{(x_{i},f(x_{i}+\delta_{i}))|\delta_{i}\sim P_{x_{i}}\} are the observations so far. The optimization performance is measured in terms of robust regret as follows:

r(xnr)=𝔼δPx[f(x+δ)]𝔼δPxnr[f(xnr+δ)],r(x_{n}^{r})=\mathbb{E}_{\delta\sim P_{x^{*}}}[f(x^{*}+\delta)]-\mathbb{E}_{\delta\sim P_{x_{n}^{r}}}[f(x_{n}^{r}+\delta)], (17)

where xx^{*} is the global robust optimum and xnrx_{n}^{r} represents the outcome point at step nn. For each algorithm, we repeat the optimization process 12 times and compare the average robust regret.

1D RKHS function: We begin the optimization evaluation with an RKHS function that is widely used in previous BO works [1, 24, 10]. Figure 4(a) shows its exact global optimum resides at x=0.892x=0.892 while the robust optimum is around x=0.08x=0.08 when the inputs follow a Gaussian distribution 𝒩(0,0.012)\mathcal{N}(0,0.01^{2}). According to Figure 4(b), all the robust BO methods work well with Gaussian uncertain inputs and efficiently identify the robust optimum, but the GP-UCB stagnates at a local optimum due to its neglect of input uncertainty. Also, we notice the regret of our method decrease slightly slower than uGP works in this low-dimensional and Gaussian-input case, but later cases with higher dimension and more complex distribution show our method is more stable and efficient.

1D double-peak function: To test with more complex input uncertainty, we design a blackbox function with double peaks and set the input distribution to be a multi-modal distribution Px=beta(α=0.4,β=0.2,σ=0.1)P_{x}=beta(\alpha=0.4,\beta=0.2,\sigma=0.1). Figure 4(c) shows the blackbox function (black solid line) and the corresponding function expectations estimated numerically via sampling from the input distribution (i.e., the colored lines). Note the true robust optimum is around x=0.251x=0.251 under the beta distribution, but an erroneous location at x=0.352x=0.352 may be determined if the input uncertainty is incorrectly presumed to be Gaussian. This explains the results in Figure 4(d): the performance of SKL-UCB and ERBF-UCB are sub-optimal due to their misidentification of inputs as Gaussian variables, while our method accurately quantifies the input uncertainty and outperforms the others.

10D bumped-bowl function: we also extend our evaluation to a 10D bumped-bowl function [27] under a concatenated circular distribution. Figure 9 demonstrates AIRBO scales efficiently to high dimension and outperforms the others under complex input uncertainty(see Appendix D.3).

Refer to caption
(a) Contour of the robot push world
Refer to caption
(b) Robust regrets of different algorithms
Figure 5: Robust optimization of the robot push problem.
Refer to caption
Figure 6: The robot’s initial locations and push times found by different algorithms

Robust robot pushing: To evaluate AIRBO in a real-world problem, we employ a robust robot pushing benchmark from [31], in which a ball is placed at the origin point of a 2D space and a robot learns to push it to a predefined target location (gx,gy)(g_{x},g_{y}). This benchmark takes a 3-dimensional input (rx,ry,rt)(r_{x},r_{y},r_{t}), where rx,ry[5,+5]r_{x},r_{y}\in[-5,+5] are the 2D coordinates of the initial robot location and rt[0,30]r_{t}\in[0,30] controls the push duration. We set four targets in separate quadrants, i.e., g1=(3,3),g2=(3,3),g3=(4.3,4.3)g1=(-3,-3),g_{2}=(-3,3),g_{3}=(4.3,4.3), and a “twin” target at g3=(5.1,3.0)g_{3}^{\prime}=(5.1,3.0), and describe the input uncertainty via a two-component Gaussian Mixture Model (defined in Appendix D.4). Following [7, 10], this blackbox benchmark outputs the minimum distance to these 4 targets under squared and linear distances: loss=min(d2(g1,l),d(g2,l),d(g3,l),d(g3,l))loss=\min(d^{2}(g_{1},l),d(g_{2},l),d(g_{3},l),d(g_{3}^{\prime},l)), where d(gi,l)d(g_{i},l) is the Euclidean distance between the ball’s ending location ll and the ii-th target. This produces a loss landscape as shown in Figure 5(a). Note that g2g_{2} is a more robust target than g1g_{1} because of its linear-form distance while pushing the ball to quadrant I is the best choice as the targets, g3g_{3} and g3g_{3}^{\prime}, match the dual-mode pattern of the input uncertainty. According to Figure 5(b), our method obviously outperforms the others because it efficiently quantifies the multimodal input uncertainty. This can be further evidenced by the push configurations found by different algorithms in Figure 6, in which each dot represents the robot’s initial location and its color represents the push duration. We find that AIRBO successfully recognizes the targets in quadrant I as an optimal choice and frequently pushes from quadrant III to quadrant I. Moreover, the pushes started close to the origin can easily go far away under input variation, so our method learns to push the ball from a corner with a long push duration, which is more robust in this case.

6 Discussion and Conclusion

In this work, we generalize robust Bayesian Optimization to an uncertain input setting. The weight-space interpretation of GP inspires us to empower the GP kernel with MMD and build a robust surrogate for uncertain inputs of arbitrary distributions. We also employ the Nyström approximation to boost the posterior inference and provide theoretical regret bound under approximation error. The experiments on synthetic blackbox function and benchmarks demonstrate our method can handle various input uncertainty and achieve state-of-the-art optimization performance.

There are several interesting directions that worth to explore: though we come to current MMD-based kernel from the weight-space interpretation of GP and the RKHS realization of MMD, our kernel design exhibits a deep connection with existing works on kernel over probability measures [22, 11]. Along this direction, as our theoretic regret analysis in Section 4 does not assume any particular form of kernel and the Nyström acceleration can also be extended to the other kernel computation, it is possible that AIRBO can be further generalized to a more rich family of kernels. Moreover, the MMD used in our kernel is by no means limited to its RKHS realization. In fact, any function class \mathcal{F} that comes with uniform convergence guarantees and is sufficiently rich can be used, which renders different realizations of MMD. With proper choice of function class \mathcal{F}, MMD can be expressed as the Kolmogorov metric or other Earth-mover distances [15]. It is also interesting to extend AIRBO with the other IPMs.

7 Acknowledgements

We sincerely thank Yanbin Zhu and Ke Ma for their help on formulating the problem. Also, a heartfelt appreciation goes to Lu Kang for her constant encouragement and support throughout this work.

References

  • [1] John-Alexander M Assael, Ziyu Wang, Bobak Shahriari and Nando Freitas “Heteroscedastic treed bayesian optimisation” In arXiv preprint arXiv:1410.7172, 2014
  • [2] Maximilian Balandat et al. “BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization” In Advances in Neural Information Processing Systems 33, 2020 URL: http://arxiv.org/abs/1910.06403
  • [3] Justin J Beland and Prasanth B Nair “Bayesian Optimization Under Uncertainty” In Advances in neural information processing systems, 2017, pp. 5
  • [4] James Bergstra, Rémi Bardenet, Yoshua Bengio and Balázs Kégl “Algorithms for hyper-parameter optimization” In Advances in neural information processing systems 24, 2011
  • [5] Felix Berkenkamp, Andreas Krause and Angela P Schoellig “Bayesian optimization with safety constraints: safe and automatic parameter tuning in robotics” In Machine Learning Springer, 2021, pp. 1–35
  • [6] Mikołaj Bińkowski, Danica J. Sutherland, Michael Arbel and Arthur Gretton “Demystifying MMD GANs” arXiv, 2021 DOI: 10.48550/arXiv.1801.01401
  • [7] Ilija Bogunovic, Jonathan Scarlett, Stefanie Jegelka and Volkan Cevher “Adversarially Robust Optimization with Gaussian Processes” In Advances in Neural Information Processing Systems 31 Curran Associates, Inc., 2018
  • [8] Antoine Chatalic, Nicolas Schreuder, Alessandro Rudi and Lorenzo Rosasco “Nyström Kernel Mean Embeddings” In International Conference on Machine Learning, 2022
  • [9] Sayak Ray Chowdhury and Aditya Gopalan “On Kernelized Multi-armed Bandits” In International Conference on Machine Learning, 2017
  • [10] Ryan B. Christianson and Robert B. Gramacy “Robust Expected Improvement for Bayesian Optimization” arXiv, 2023 arXiv:2302.08612
  • [11] Andreas Christmann and Ingo Steinwart “Universal Kernels on Non-Standard Input Spaces” In NIPS, 2010 URL: https://api.semanticscholar.org/CorpusID:16968759
  • [12] Alexander I Cowen-Rivers et al. “HEBO: pushing the limits of sample-efficient hyper-parameter optimisation” In Journal of Artificial Intelligence Research 74, 2022, pp. 1269–1349
  • [13] Patrick Dallaire, Camille Besse and Brahim Chaib-draa “An Approximate Inference with Gaussian Process to Latent Functions from Uncertain Data” In Neurocomputing 74.11, Adaptive Incremental Learning in Neural Networks, 2011, pp. 1945–1955 DOI: 10.1016/j.neucom.2010.09.024
  • [14] Arthur Gretton, Dougal Sutherland and Wittawat Jitkrittum “Interpretable comparison of distributions and models” In Advances in Neural Information Processing Systems [Tutorial], 2019
  • [15] Arthur Gretton et al. “A kernel two-sample test” In The Journal of Machine Learning Research 13.1 JMLR. org, 2012, pp. 723–773
  • [16] Trevor Hastie, Robert Tibshirani, Jerome H Friedman and Jerome H Friedman “The elements of statistical learning: data mining, inference, and prediction” Springer, 2009
  • [17] Thomas Kühn “Eigenvalues of integral operators with smooth positive definite kernels” In Archiv der Mathematik 49, 1987, pp. 525–534 URL: https://api.semanticscholar.org/CorpusID:121372638
  • [18] Ruben Martinez-Cantin, Nando Freitas, Arnaud Doucet and José A Castellanos “Active policy learning for robot planning and exploration under uncertainty.” In Robotics: Science and systems 3, 2007, pp. 321–328
  • [19] Seyedali Mirjalili and Andrew Lewis “Obstacles and difficulties for robust benchmark problems: A novel penalty-based robust optimisation method” In Information Sciences 328 Elsevier, 2016, pp. 485–509
  • [20] Pedro Moreno, Purdy Ho and Nuno Vasconcelos “A Kullback-Leibler Divergence Based Kernel for SVM Classification in Multimedia Applications” In Advances in Neural Information Processing Systems 16 MIT Press, 2003
  • [21] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur and Bernhard Schölkopf “Kernel Mean Embedding of Distributions: A Review and Beyond” In Foundations and Trends® in Machine Learning 10.1-2, 2017, pp. 1–141 DOI: 10.1561/2200000060
  • [22] Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo and Bernhard Schölkopf “Learning from Distributions via Support Measure Machines” In NIPS, 2012
  • [23] Alfred Müller “Integral probability metrics and their generating classes of functions” In Advances in applied probability 29.2 Cambridge University Press, 1997, pp. 429–443
  • [24] Jose Nogueira, Ruben Martinez-Cantin, Alexandre Bernardino and Lorenzo Jamone “Unscented Bayesian Optimization for Safe Robot Grasping” In 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) Daejeon, South Korea: IEEE, 2016, pp. 1967–1972 DOI: 10.1109/IROS.2016.7759310
  • [25] Rafael Oliveira, Lionel Ott and Fabio Ramos “Bayesian Optimisation under Uncertain Inputs” In Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics PMLR, 2019, pp. 1177–1184
  • [26] Rafael Oliveira, Lionel Ott and Fabio Ramos “Bayesian optimisation under uncertain inputs”, 2019 arXiv:1902.07908
  • [27] Nicholas D. Sanders, Richard M. Everson, Jonathan E. Fieldsend and Alma A.. Rahat “Bayesian Search for Robust Optima” arXiv, 2021 arXiv:1904.11416
  • [28] Bobak Shahriari et al. “Taking the Human Out of the Loop: A Review of Bayesian Optimization” In Proceedings of the IEEE 104.1, 2016, pp. 148–175 DOI: 10.1109/JPROC.2015.2494218
  • [29] Jasper Snoek, Hugo Larochelle and Ryan P Adams “Practical bayesian optimization of machine learning algorithms” In Advances in neural information processing systems 25, 2012
  • [30] Niranjan Srinivas, Andreas Krause, Sham M. Kakade and Matthias W. Seeger “Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design” In International Conference on Machine Learning, 2009 URL: https://api.semanticscholar.org/CorpusID:59031327
  • [31] Zi Wang and Stefanie Jegelka “Max-value entropy search for efficient Bayesian optimization” In International Conference on Machine Learning, 2017, pp. 3627–3635 PMLR
  • [32] Colin White, Willie Neiswanger and Yash Savani “Bananas: Bayesian optimization with neural architectures for neural architecture search” In Proceedings of the AAAI Conference on Artificial Intelligence 35.12, 2021, pp. 10293–10301
  • [33] Christopher Williams and Matthias Seeger “Using the Nyström Method to Speed Up Kernel Machines” In Advances in Neural Information Processing Systems 13 MIT Press, 2000
  • [34] Christopher KI Williams and Carl Edward Rasmussen “Gaussian processes for machine learning” MIT press Cambridge, MA, 2006

Appendix A Nyström Estimator Error Bound

Nyström estimator can easily approximate the kernel mean embedding ψp1,ψp2\psi_{p_{1}},\psi_{p_{2}} as well as the MMD distance between two distribution density p1p_{1} and p2p_{2}. We need first assume the boundedness of the feature map to the kernel kk:

Assumption 2.

There exists a positive constant KK\leq\infty such that supx𝒳ϕ(x)K\sup_{x\in\mathcal{X}}\|\phi(x)\|\leq K

The true MMD distance between p1p_{1} and p2p_{2} is denoted as MMD(p1,p2)\text{MMD}(p_{1},p_{2}). The estimated MMD distance when using a Nyström sample size mim_{i}, sub-sample size hih_{i} for pip_{i} respectively, is denoted as MMD(pi,mi,hi)\text{MMD}_{(p_{i},m_{i},h_{i})}. Then the error

Err(pi,mi,hi):=|MMD(p1,p2)MMD(pi,mi,hi)|\text{Err}_{(p_{i},m_{i},h_{i})}:=|\text{MMD}(p_{1},p_{2})-\text{MMD}_{(p_{i},m_{i},h_{i})}|

and now we have the lemma from Theorem 5.1 in [8]

Lemma 1.

Let Assumption 2 hold. Furthermore, assume that for i1,2i\in{1,2}, the data points X1i,,XmiiX_{1}^{i},\cdots,X_{m_{i}}^{i} are drawn i.i.d. from the distribution ρi\rho_{i} and that himih_{i}\leq m_{i} sub-samples X~1i,,X~hii\tilde{X}_{1}^{i},\cdots,\tilde{X}_{h_{i}}^{i} are drawn uniformly with replacement from the dataset {X1i,,Xmii}.\{X_{1}^{i},\cdots,X_{m_{i}}^{i}\}. Then, for any δ(0,1)\delta\in(0,1), it holds with probability at least 12δ1-2\delta

Err(pi,mi,hi)i=1,2(c1mi+c2hi+log(hi/δ)hi𝒩pi(12K2log(hi/δ)hi)),\text{Err}_{(p_{i},m_{i},h_{i})}\leq\sum_{i=1,2}\left(\frac{c_{1}}{\sqrt{m_{i}}}+\frac{c_{2}}{h_{i}}+\frac{\sqrt{\log(h_{i}/\delta)}}{h_{i}}\sqrt{\mathcal{N}^{p_{i}}(\frac{12K^{2}\log(h_{i}/\delta)}{h_{i}})}\right),

provided that, for i{1,2}i\in\{1,2\},

himax(67,12K2Ci()1)log(hi/δ)h_{i}\geq\max(67,12K^{2}\|C_{i}\|^{-1}_{\mathcal{L}(\mathcal{H})})\log(h_{i}/\delta)

where c1=2K2log(6/δ),c2=43Klog(12/δ)c_{1}=2K\sqrt{2\log(6/\delta)},c_{2}=4\sqrt{3}K\log(12/\delta) and c4=6Klog(12/δ)c_{4}=6K\sqrt{\log(12/\delta)}. The notation 𝒩pi\mathcal{N}^{p_{i}} denotes the effective dimension associated to the distribution pkp_{k}.

Specifically, when the effective dimension 𝒩\mathcal{N} satisfies, for some c0c\geq 0,

  • either 𝒩ρi(σ2)cσ2γ\mathcal{N}^{\rho_{i}}(\sigma^{2})\leq c\sigma^{2^{-\gamma}} for some γ(0,1)\gamma\in(0,1),

  • or 𝒩ρi(σ2)log(1+c/σ2)/β,\mathcal{N}^{\rho_{i}}(\sigma^{2})\leq\log(1+c/\sigma^{2})/\beta, for some β>0\beta>0.

Then, choosing the subsample size mm to be

  • hi=mi1/(2γ)log(mi/δ)h_{i}=m_{i}^{1/(2-\gamma)}\log(m_{i}/\delta) in the first case

  • or hi=milog(mimax(1/δ,c/(6K2))h_{i}=\sqrt{m_{i}}\log(\sqrt{m_{i}}\max(1/\delta,c/(6K^{2})) in the second case,

we get Err(ρi,mi,hi)=O(1/mi)\text{Err}_{(\rho_{i},m_{i},h_{i})}=O(1/\sqrt{m_{i}})

Appendix B Proofs of Section 4

B.1 Exact Kernel Uncertainty GP Formulating

Following the same notation in Section 4, now we can construct a Gaussian process 𝒢𝒫(0,k^)\mathcal{GP}(0,\hat{k}) modeling over 𝒫\mathcal{P}. This 𝒢𝒫\mathcal{GP} model can then be applied to learn f^\hat{f} from a given set of observations 𝒟n={(Pi,yi)}i=1n.\mathcal{D}_{n}=\{(P_{i},y_{i})\}_{i=1}^{n}. Under zero mean condition, the value of f^(P)\hat{f}(P_{\ast}) for a given P𝒫P_{\ast}\in\mathcal{P} follows a Gaussian posterior distribution with

μ^n(P)\displaystyle\hat{\mu}_{n}(P_{\ast}) =𝐤^n(P)T(𝐊^n+σ2𝐈)1𝐲n\displaystyle=\hat{\bf k}_{n}(P_{\ast})^{T}(\hat{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}{\bf y}_{n} (18)
σ^n2(P)\displaystyle\hat{\sigma}_{n}^{2}(P_{\ast}) =k^(P,P)𝐤^n(P)T(𝐊^n+σ2𝐈)1𝐤^n(P),\displaystyle=\hat{k}(P_{\ast},P_{\ast})-\hat{\bf k}_{n}(P_{\ast})^{T}(\hat{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}\hat{\bf k}_{n}(P_{\ast}), (19)

where 𝐲n:=[y1,,yn]T{\bf y}_{n}:=[y_{1},\cdots,y_{n}]^{T}, 𝐤^n(P):=[k^(P,P1),,k^(P,Pn)]T\hat{\bf k}_{n}(P_{\ast}):=[\hat{k}(P_{\ast},P_{1}),\cdots,\hat{k}(P_{\ast},P_{n})]^{T} and [𝐊^n]ij=k^(Pi,Pj)[\hat{\bf K}_{n}]_{ij}=\hat{k}(P_{i},P_{j}).

Now we restrict our Gaussian process in the subspace 𝒫𝒳={Px,x𝒳}𝒫\mathcal{P}_{\mathcal{X}}=\{P_{x},x\in\mathcal{X}\}\subset\mathcal{P}. We assume the observation yi=f(xi)+ζiy_{i}=f(x_{i})+\zeta_{i} with the noise ζi\zeta_{i}. The input-induced noise is defined as Δfpxi:=f(xi)𝔼Pxi[f]=f(xi)f^(Pxi)\Delta f_{p_{x_{i}}}:=f(x_{i})-\mathbb{E}_{P_{x_{i}}}[f]=f(x_{i})-\hat{f}(P_{x_{i}}). Then the total noise is yi𝔼Pxi[f]=ζi+Δfpxiy_{i}-\mathbb{E}_{P_{x_{i}}}[f]=\zeta_{i}+\Delta f_{p_{x_{i}}}. To formulate the regret bounds, we introduce the information gain and estimated information gain given any {Pt}t=1n𝒫\{P_{t}\}_{t=1}^{n}\subset\mathcal{P}:

I^(𝐲n;𝐟^n|{Pt}t=1n):=12lndet(𝐈+σ2𝐊^n),\displaystyle\hat{I}({\bf y}_{n};\hat{\bf f}_{n}|\{P_{t}\}_{t=1}^{n}):=\frac{1}{2}\ln\det({\bf I}+\sigma^{-2}\hat{\bf K}_{n}), (20)
I~(𝐲n;𝐟^n|{Pt}t=1n):=12lndet(𝐈+σ2𝐊~n),\displaystyle\tilde{I}({\bf y}_{n};\hat{\bf f}_{n}|\{P_{t}\}_{t=1}^{n}):=\frac{1}{2}\ln\det({\bf I}+\sigma^{-2}\tilde{\bf K}_{n}), (21)

and the maximum information gain is defined as γ^n:=sup𝒫𝒳;||=nI^(𝐲n;𝐟^n|).\hat{\gamma}_{n}:=\sup_{\mathcal{R}\in\mathcal{P}_{\mathcal{X}};|\mathcal{R}|=n}\hat{I}({\bf y}_{n};\hat{\bf f}_{n}|\mathcal{R}). Here 𝐟^n:=[f^(p1),,f^(pn)]T\hat{\bf f}_{n}:=[\hat{f}(p_{1}),\cdots,\hat{f}(p_{n})]^{T}.

We define the sub-Gaussian condition as follows:

Definition 1.

For a given σξ>0\sigma_{\xi}>0, a real-valued random variable ξ\xi is said to be σξ\sigma_{\xi}-sub-Gaussian if:

λ,𝔼[eλξ]eλ2σξ2/2\forall\lambda\in\mathbb{R},\mathbb{E}[e^{\lambda\xi}]\leq e^{\lambda^{2}\sigma_{\xi}^{2}/2}

Now we can state the lemma for bounding the uncertain-inputs regret of exact kernel evaluations, which is originally stated in Theorem 5 in [26].

Lemma 2.

Let δ(0,1),fk,\delta\in(0,1),f\in\mathcal{H}_{k}, and the corresponding f^k^b\|\hat{f}\|_{\hat{k}}\leq b. Suppose the observation noise ζi=yif(xi)\zeta_{i}=y_{i}-f(x_{i}) is conditionally σζ\sigma_{\zeta}-sub-Gaussian. Assume that both kk and PxP_{x} satisfy the conditions for ΔfPx\Delta f_{P_{x}} to be σE\sigma_{E}-sub-Gaussian, for a given σE>0\sigma_{E}>0. Then, we have the following results:

  • The following holds for all x𝒳x\in\mathcal{X} and t1t\geq 1:

    |μ^n(Px)f^(Px)|(b+σE2+σζ22(I^(𝐲n;𝐟^n|{Pt}t=1n)+1+ln(1/δ)))σ^n(Px)|\hat{\mu}_{n}(P_{x})-\hat{f}(P_{x})|\leq\left(b+\sqrt{\sigma_{E}^{2}+\sigma_{\zeta}^{2}}\sqrt{2\left(\hat{I}({\bf y}_{n};\hat{\bf f}_{n}|\{P_{t}\}_{t=1}^{n})+1+\ln(1/\delta)\right)}\right)\hat{\sigma}_{n}(P_{x}) (22)
  • Running with upper confidence bound (UCB) acquisition function α(x|𝒟n)=μ^n(Px)+β^nσ^n(Px)\alpha(x|\mathcal{D}_{n})=\hat{\mu}_{n}(P_{x})+\hat{\beta}_{n}\hat{\sigma}_{n}(P_{x}) where

    β^n=b+σE2+σζ22(I^(𝐲n;𝐟^n|{Pt}t=1n)+1+ln(1/δ)),\hat{\beta}_{n}=b+\sqrt{\sigma_{E}^{2}+\sigma_{\zeta}^{2}}\sqrt{2\left(\hat{I}({\bf y}_{n};\hat{\bf f}_{n}|\{P_{t}\}_{t=1}^{n})+1+\ln(1/\delta)\right)},

    and set σ2=1+2/n\sigma^{2}=1+2/n, the uncertain-inputs cumulative regret satisfies:

    R^nO(nγ^n(b+γ^n+ln(1/δ)))\hat{R}_{n}\in O(\sqrt{n\hat{\gamma}_{n}}(b+\sqrt{\hat{\gamma}_{n}+\ln(1/\delta)})) (23)

    with probability at least 1δ1-\delta.

Note that although the original theorem restricted to the case when k^(p,q)=ψP,ψQk\hat{k}(p,q)=\langle\psi_{P},\psi_{Q}\rangle_{k}, the results can be easily generated to other kernels over 𝒫\mathcal{P}, as long as its universal w.r.t C(𝒫)C(\mathcal{P}) given that 𝒳\mathcal{X} is compact and the mean map ψ\psi is injective [11, 22].

B.2 Error Estimates for Inexact Kernel Approximation

Now let us derivative the inference under the inexact kernel estimations.

Theorem 3.

Under the Assumption 1 for ε>0\varepsilon>0, let μ~n,σ~n,I~(𝐲n;𝐟^n|{Pt}t=1n)\tilde{\mu}_{n},\tilde{\sigma}_{n},\tilde{I}({\bf y}_{n};\hat{\bf f}_{n}|\{P_{t}\}_{t=1}^{n}) as defined in (13),(14),(21) respectively, and μ^n,σ^n,I^(𝐲n;𝐟^n|{Pt}t=1n)\hat{\mu}_{n},\hat{\sigma}_{n},\hat{I}({\bf y}_{n};\hat{\bf f}_{n}|\{P_{t}\}_{t=1}^{n}) as defined in (18),(19),(20). Assume maxx𝒳f(x)=M\max_{x\in\mathcal{X}}f(x)=M, and assume the observation error ζi=yif(xi)\zeta_{i}=y_{i}-f(x_{i}) satisfies |ζi|<A|\zeta_{i}|<A for all ii. Then we have the following error bound holds with probability at least 1nε1-n\varepsilon:

|μ^n(P)μ~n(P)|\displaystyle|\hat{\mu}_{n}(P_{\ast})-\tilde{\mu}_{n}(P_{\ast})| <(nσ2+n2σ4)(M+A)eε+O(eε2)\displaystyle<(\frac{n}{\sigma^{2}}+\frac{n^{2}}{\sigma^{4}})(M+A)e_{\varepsilon}+O(e_{\varepsilon}^{2}) (24)
|σ^n2(P)σ~n2(P)|\displaystyle|\hat{\sigma}_{n}^{2}(P_{\ast})-\tilde{\sigma}_{n}^{2}(P_{\ast})| <(1+nσ2)2eε+O(eε2)\displaystyle<(1+\frac{n}{\sigma^{2}})^{2}e_{\varepsilon}+O(e_{\varepsilon}^{2}) (25)
|I~(𝐲n;𝐟^n|{Pt}t=1n)I^(𝐲n;𝐟^n|{Pt}t=1n)|\displaystyle\left|\tilde{I}({\bf y}_{n};\hat{\bf f}_{n}|\{P_{t}\}_{t=1}^{n})-\hat{I}({\bf y}_{n};\hat{\bf f}_{n}|\{P_{t}\}_{t=1}^{n})\right| <n3/22σ2eε+O(eε2)\displaystyle<\frac{n^{3/2}}{2\sigma^{2}}e_{\varepsilon}+O(e_{\varepsilon}^{2}) (26)
Proof.

Denote e(P,Q)=k~(P,Q)k^(P,Q)e(P_{\ast},Q)=\tilde{k}(P_{\ast},Q)-\hat{k}(P_{\ast},Q), 𝐞n(P)=[e(P,P1),,e(P.Pn)]T{\bf e}_{n}(P_{\ast})=[e(P_{\ast},P_{1}),\cdots,e(P_{\ast}.P_{n})]^{T}, and [𝐄n]i,j=e(Pi,Pj)[{\bf E}_{n}]_{i,j}=e(P_{i},P_{j}). Now according to the matrix inverse perturbation expansion,

(X+δX)1=X1X1δXX1+O(δX2),(X+\delta X)^{-1}=X^{-1}-X^{-1}\delta XX^{-1}+O(\|\delta X\|^{2}),

we have

(𝐊^n+σ2𝐈+𝐄n)1=(𝐊^n+σ2𝐈)1(𝐊^n+σ2𝐈)1𝐄n(𝐊^n+σ2𝐈)1+O(𝐄n2),(\hat{\bf K}_{n}+\sigma^{2}{\bf I}+{\bf E}_{n})^{-1}=(\hat{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}-(\hat{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}{\bf E}_{n}(\hat{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}+O(\|{\bf E}_{n}\|^{2}),

thus

μ~n(P)=\displaystyle\tilde{\mu}_{n}(P_{\ast})= (𝐤^n(P)+𝐞n(P))T(𝐊^n+σ2𝐈+𝐄n)1𝐲n\displaystyle(\hat{\bf k}_{n}(P_{\ast})+{\bf e}_{n}(P_{\ast}))^{T}(\hat{\bf K}_{n}+\sigma^{2}{\bf I}+{\bf E}_{n})^{-1}{\bf y}_{n}
=\displaystyle= μ^n(P)+𝐞n(P)T(𝐊^n+σ2𝐈)1𝐲n𝐤^n(P)T(𝐊^n+σ2𝐈)1𝐄n(𝐊^n+σ2𝐈)1𝐲n\displaystyle\hat{\mu}_{n}(P_{\ast})+{\bf e}_{n}(P_{\ast})^{T}(\hat{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}{\bf y}_{n}-\hat{\bf k}_{n}(P_{\ast})^{T}(\hat{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}{\bf E}_{n}(\hat{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}{\bf y}_{n}
+O(𝐄n2)+O(𝐞n(P))𝐄n)\displaystyle+O(\|{\bf E}_{n}\|^{2})+O(\|{\bf e}_{n}(P_{\ast}))\|\cdot\|{\bf E}_{n}\|)
σ~n2(P)=\displaystyle\tilde{\sigma}_{n}^{2}(P_{\ast})= σ^n2(P)+e(P,P)(𝐤^n(P)+𝐞n(P))T(𝐊^n+σ2𝐈+𝐄n)1(𝐤^n(P)+𝐞n(P))\displaystyle\hat{\sigma}_{n}^{2}(P_{\ast})+e(P_{\ast},P_{\ast})-(\hat{\bf k}_{n}(P_{\ast})+{\bf e}_{n}(P_{\ast}))^{T}(\hat{\bf K}_{n}+\sigma^{2}{\bf I}+{\bf E}_{n})^{-1}(\hat{\bf k}_{n}(P_{\ast})+{\bf e}_{n}(P_{\ast}))
=\displaystyle= σ^n2(P)+e(P,P)2𝐞n(P)T(𝐊^n+σ2𝐈)1𝐤^n(P)\displaystyle\hat{\sigma}_{n}^{2}(P_{\ast})+e(P_{\ast},P_{\ast})-2{\bf e}_{n}(P)^{T}(\hat{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}\hat{\bf k}_{n}(P_{\ast})
+𝐤^n(P)T(𝐊^n+σ2𝐈)1𝐄n(𝐊^n+σ2𝐈)1𝐤^n(P)\displaystyle+\hat{\bf k}_{n}(P)^{T}(\hat{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}{\bf E}_{n}(\hat{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}\hat{\bf k}_{n}(P_{\ast})
+O(𝐄n2)+O(𝐞n𝐄n)+O(𝐞n2𝐄n)\displaystyle+O(\|{\bf E}_{n}\|^{2})+O(\|{\bf e}_{n}\|\cdot\|{\bf E}_{n}\|)+O(\|{\bf e}_{n}\|^{2}\cdot\|{\bf E}_{n}\|)

Notic that the following holds with a probability at least 1nε1-n\varepsilon, according to the Assumption 1,

|𝐞n(P)T(K^n+σ2𝐈)1𝐲n|𝐞n(P)2(K^n+σ2𝐈)12𝐲n2nσ2(M+A)eε,|{\bf e}_{n}(P_{\ast})^{T}(\hat{K}_{n}+\sigma^{2}{\bf I})^{-1}{\bf y}_{n}|\leq\|{\bf e}_{n}(P_{\ast})\|_{2}\|(\hat{K}_{n}+\sigma^{2}{\bf I})^{-1}\|_{2}\|{\bf y}_{n}\|_{2}\leq\frac{n}{\sigma^{2}}(M+A)e_{\varepsilon},
|𝐤^n(P)T(𝐊^n+σ2𝐈)1𝐄n(𝐊^n+σ2𝐈)1𝐲n|\displaystyle|\hat{\bf k}_{n}(P_{\ast})^{T}(\hat{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}{\bf E}_{n}(\hat{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}{\bf y}_{n}| 𝐤^n(P)2(K^n+σ2𝐈)122𝐄n2𝐲n2\displaystyle\leq\|\hat{\bf k}_{n}(P_{\ast})\|_{2}\|(\hat{K}_{n}+\sigma^{2}{\bf I})^{-1}\|_{2}^{2}\|{\bf E}_{n}\|_{2}\|{\bf y}_{n}\|_{2}
nσ4neεn(M+A)=n2σ4(M+A),\displaystyle\leq\sqrt{n}\sigma^{-4}ne_{\varepsilon}\sqrt{n}(M+A)=\frac{n^{2}}{\sigma^{4}}(M+A),

here we use the fact that K^n\hat{K}_{n} semi-definite (which means (K^n+σ2I)12σ2\|(\hat{K}_{n}+\sigma^{2}I)^{-1}\|_{2}\leq\sigma^{-2}), k^(P,P)1\hat{k}(P_{\ast},P_{\ast})\leq 1, |yi|M+A|y_{i}|\leq M+A. Combining these results, we have that

|μ~n(P)μ^n(P)|<(nσ2+n2σ4)(M+A)eε+O(eε2),|\tilde{\mu}_{n}(P_{\ast})-\hat{\mu}_{n}(P_{\ast})|<(\frac{n}{\sigma^{2}}+\frac{n^{2}}{\sigma^{4}})(M+A)e_{\varepsilon}+O(e_{\varepsilon}^{2}),

holds with a probability at least 1nε1-n\varepsilon.

Similarly, we can conduct the same estimation to 𝐞n(P)T(𝐊^n+σ2𝐈)1𝐤^n(P){\bf e}_{n}(P)^{T}(\hat{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}\hat{\bf k}_{n}(P_{\ast}) and 𝐤^n(P)T(𝐊^n+σ2𝐈)1𝐄n(𝐊^n+σ2𝐈)1𝐤^n(P)\hat{\bf k}_{n}(P)^{T}(\hat{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}{\bf E}_{n}(\hat{\bf K}_{n}+\sigma^{2}{\bf I})^{-1}\hat{\bf k}_{n}(P_{\ast}), and get |σ~n2(P)σ^n2(P)|<(1+nσ2)2eε+O(eε2)|\tilde{\sigma}_{n}^{2}(P_{\ast})-\hat{\sigma}_{n}^{2}(P_{\ast})|<(1+\frac{n}{\sigma^{2}})^{2}e_{\varepsilon}+O(e_{\varepsilon}^{2}) holds with a probability at least 1nε1-n\varepsilon.

It remains to estimate the error for estimating the information gain. Notice that, with a probability at least 1nε1-n\varepsilon,

|I~(𝐲n;𝐟^n|{pt}t=1n)I^(𝐲n;𝐟^n|{pt}t=1n)|\displaystyle\left|\tilde{I}({\bf y}_{n};\hat{\bf f}_{n}|\{p_{t}\}_{t=1}^{n})-\hat{I}({\bf y}_{n};\hat{\bf f}_{n}|\{p_{t}\}_{t=1}^{n})\right| =|12logdet(𝐈+σ2𝐊~n)det(𝐈+σ2𝐊^n)|\displaystyle=\left|\frac{1}{2}\log\frac{\det({\bf I}+\sigma^{-2}\tilde{\bf K}_{n})}{\det({\bf I}+\sigma^{-2}\hat{\bf K}_{n})}\right|
=|12logdet(𝐈(σ2𝐈+𝐊^n)1𝐄n)|\displaystyle=\left|\frac{1}{2}\log\det({\bf I}-(\sigma^{2}{\bf I}+\hat{\bf K}_{n})^{-1}{\bf E}_{n})\right|
=|12Tr(log(𝐈(σ2𝐈+𝐊^n)1𝐄n))|\displaystyle=\left|\frac{1}{2}\text{Tr}(\log({\bf I}-(\sigma^{2}{\bf I}+\hat{\bf K}_{n})^{-1}{\bf E}_{n}))\right|
=|12Tr((σ2𝐈+𝐊^n)1𝐄n)+O(𝐄n2)|\displaystyle=\left|\frac{1}{2}\text{Tr}(-(\sigma^{2}{\bf I}+\hat{\bf K}_{n})^{-1}{\bf E}_{n})+O(\|{\bf E}_{n}\|^{2})\right|
n3/22σ2eε+O(𝐄n2),\displaystyle\leq\frac{n^{3/2}}{2\sigma^{2}}e_{\varepsilon}+O(\|{\bf E}_{n}\|^{2}),

here the second equation uses the fact that det(AB1)=det(A)det(B)1\det(AB^{-1})=\det(A)\det(B)^{-1}, and the third and fourth equations use logdet(I+A)=Trlog(I+A)=Tr(AA22+)\log\det(I+A)=\text{Tr}\log(I+A)=\text{Tr}(A-\frac{A^{2}}{2}+\cdots). The last inequality follows from the fact

Tr(σ2𝐈+𝐊^n)1𝐄n)(σ2𝐈+𝐊^n)1F𝐄nFn3/2σ2eε\text{Tr}(\sigma^{2}{\bf I}+\hat{\bf K}_{n})^{-1}{\bf E}_{n})\leq\|(\sigma^{2}{\bf I}+\hat{\bf K}_{n})^{-1}\|_{F}\|{\bf E}_{n}\|_{F}\leq n^{3/2}\sigma^{-2}e_{\varepsilon}

and 𝐊^n\hat{\bf K}_{n} is semi-definite. ∎

With the uncertainty bound given by Lemma 3, let us prove that under inexact kernel estimations, the posterior mean is concentrated around the unknown reward function f^\hat{f}

Theorem 4.

Under the former setting as in Theorem 3, with probability at least 1δnε1-\delta-n\varepsilon, let σν=σζ2+σE2\sigma_{\nu}=\sqrt{\sigma_{\zeta}^{2}+\sigma_{E}^{2}}, taking σ=1+2n\sigma=1+\frac{2}{n}, the following holds for all x𝒳x\in\mathcal{X}:

|μ~n(Px)f^(Px)|\displaystyle|\tilde{\mu}_{n}(P_{x})-\hat{f}(P_{x})|\leq βnσ~n(Px)+βn(1+n)eε1/2+(n+n2)(M+A)eε,\displaystyle\beta_{n}\tilde{\sigma}_{n}(P_{x})+\beta_{n}(1+n)e_{\varepsilon}^{1/2}+\left(n+n^{2}\right)(M+A)e_{\varepsilon}, (27)
where βn=\displaystyle\text{where }\beta_{n}= (b+σν2(γ^nln(δ)+1))\displaystyle\left(b+\sigma_{\nu}\sqrt{2(\hat{\gamma}_{n}-\ln(\delta)+1)}\right)
Proof.

According to Lemma 2, equation (22), we have

|μ^n(Px)f^(Px)|β^nσ^n(Px)|\hat{\mu}_{n}(P_{x})-\hat{f}(P_{x})|\leq\hat{\beta}_{n}\hat{\sigma}_{n}(P_{x})

with

β^n=b+σν2(I^(𝐲n;𝐟^n|{Pt}t=1n)+1+ln(1/δ))βn.\hat{\beta}_{n}=b+\sigma_{\nu}\sqrt{2\left(\hat{I}({\bf y}_{n};\hat{\bf f}_{n}|\{P_{t}\}_{t=1}^{n})+1+\ln(1/\delta)\right)}\leq\beta_{n}.

Notice that

|μ~n(Px)f^(Px)||μ~n(Px)μ^n(Px)|+|μ^n(Px)f^(Px)|,|\tilde{\mu}_{n}(P_{x})-\hat{f}(P_{x})|\leq|\tilde{\mu}_{n}(P_{x})-\hat{\mu}_{n}(P_{x})|+|\hat{\mu}_{n}(P_{x})-\hat{f}(P_{x})|, (28)

We also have (25), which means

σ^n(Px)=σ^n(Px)2σ~n(Px)2+(1+n)2eεσ~n(Px)+(1+n)eε1/2,\hat{\sigma}_{n}(P_{x})=\sqrt{\hat{\sigma}_{n}(P_{x})^{2}}\leq\sqrt{\tilde{\sigma}_{n}(P_{x})^{2}+(1+n)^{2}e_{\varepsilon}}\leq\tilde{\sigma}_{n}(P_{x})+(1+n)e_{\varepsilon}^{1/2}, (29)

combining (24), (28) and (29), we finally get the result in (27).

B.3 Proofs for Theorem 1

Now we can prove our main theorem 1.

Proof of Theorem 1..

Let xx^{\ast} maximize f^(Px)\hat{f}(P_{x}) over 𝒳\mathcal{X}. Observing that at each round n1n\geq 1, by the choice of xnx_{n} to maximize the acquisition function α~(x|𝒟n1)=μ~n1(Px)+βn1σ~n1(Px)\tilde{\alpha}(x|\mathcal{D}_{n-1})=\tilde{\mu}_{n-1}(P_{x})+\beta_{n-1}\tilde{\sigma}_{n-1}(P_{x}), we have

r~n\displaystyle\tilde{r}_{n} =f^(Px)f^(Pxn)\displaystyle=\hat{f}(P_{x^{\ast}})-\hat{f}(P_{x_{n}})
μ~n1(Px)+βn1σ~n1(Px)μ~n1(Pxn)+βn1σ~n1(Pxn)+2Err(n1,eε)\displaystyle\leq\tilde{\mu}_{n-1}(P_{x^{\ast}})+\beta_{n-1}\tilde{\sigma}_{n-1}(P_{x^{\ast}})-\tilde{\mu}_{n-1}(P_{x_{n}})+\beta_{n-1}\tilde{\sigma}_{n-1}(P_{x_{n}})+2Err(n-1,e_{\varepsilon})
2βn1σ~n1(Pxn)+2Err(n1,eε).\displaystyle\leq 2\beta_{n-1}\tilde{\sigma}_{n-1}(P_{x_{n}})+2Err(n-1,e_{\varepsilon}).

Here we denote Err(n,eε):=(βn(1+n)+σ~n(Px)σνn3/4)eε1/2+(n+n2)(M+A)eε.Err(n,e_{\varepsilon}):=\left(\beta_{n}(1+n)+\tilde{\sigma}_{n}(P_{x})\sigma_{\nu}n^{3/4}\right)e_{\varepsilon}^{1/2}+\left(n+n^{2}\right)(M+A)e_{\varepsilon}. The second inequality follows from (27),

f^(Px)μ~n1(Px)βn1σ~n1(Px)+Err(n1,eε)\displaystyle\hat{f}(P_{x^{\ast}})-\tilde{\mu}_{n-1}(P_{x^{\ast}})\leq\beta_{n-1}\tilde{\sigma}_{n-1}(P_{x^{\ast}})+Err(n-1,e_{\varepsilon})
μ~n1(Pxn)f^(Pxn)βn1σ~n1(Pxn)+Err(n1,eε),\displaystyle\tilde{\mu}_{n-1}(P_{x_{n}})-\hat{f}(P_{x_{n}})\leq\beta_{n-1}\tilde{\sigma}_{n-1}(P_{x_{n}})+Err(n-1,e_{\varepsilon}),

and the third inequality follows from the choice of xnx_{n}:

μ~n1(Px)+βn1σ~n1(Px)μ~n1(Pxn)+βn1σ~n1(Pxn).\tilde{\mu}_{n-1}(P_{x^{\ast}})+\beta_{n-1}\tilde{\sigma}_{n-1}(P_{x^{\ast}})\leq\tilde{\mu}_{n-1}(P_{x_{n}})+\beta_{n-1}\tilde{\sigma}_{n-1}(P_{x_{n}}).

Thus we have

R~n=t=1nr~t2βnt=1nσ~t1(Pxt)+t=1TErr(t1,eε).\tilde{R}_{n}=\sum_{t=1}^{n}\tilde{r}_{t}\leq 2\beta_{n}\sum_{t=1}^{n}\tilde{\sigma}_{t-1}(P_{x_{t}})+\sum_{t=1}^{T}Err(t-1,e_{\varepsilon}).

From Lemma 4 in [9], we have that

t=1nσ~t1(Pxt)4(n+2)lndet(I+σ2K~n)4(n+2)(γ^n+n322eε),\sum_{t=1}^{n}\tilde{\sigma}_{t-1}(P_{x_{t}})\leq\sqrt{4(n+2)\ln\det(I+\sigma^{-2}\tilde{K}_{n})}\leq\sqrt{4(n+2)(\hat{\gamma}_{n}+\frac{n^{\frac{3}{2}}}{2}e_{\varepsilon})},

and thus

2βnt=1nσ~t1(Pxt)=O(nγ^n+nγ^n(γ^nlnδ)+n52eε+n52(γ^nlnδ)eε).2\beta_{n}\sum_{t=1}^{n}\tilde{\sigma}_{t-1}(P_{x_{t}})=O\left(\sqrt{n\hat{\gamma}_{n}}+\sqrt{n\hat{\gamma}_{n}(\hat{\gamma}_{n}-\ln\delta)}+\sqrt{n^{\frac{5}{2}}e_{\varepsilon}}+\sqrt{n^{\frac{5}{2}}(\hat{\gamma}_{n}-\ln\delta)e_{\varepsilon}}\right). (30)

On the other hand, notice that

t=1nErr(t1,eε)=O(n2(γ^nlnδ)eε+(n2+n3)eϵ),\sum_{t=1}^{n}Err(t-1,e_{\varepsilon})=O\left(n^{2}\sqrt{(\hat{\gamma}_{n}-\ln\delta)e_{\varepsilon}}+(n^{2}+n^{3})e_{\epsilon}\right), (31)

we find that the eεe_{\varepsilon} term in (30) can be controlled by in (31), thus we immediately get the result. ∎

B.4 Proofs for Theorem 2

Proof.

Define the square of the MMD distance between Px1,Px2P_{x_{1}},P_{x_{2}} as dM(x1,x2),d_{M}(x_{1},x_{2}), we have

dM(x1,x2)\displaystyle d_{M}(x_{1},x_{2})
=dk(x,x)Px1(x)Px1(x)dxdx+dk(x,x)Px2(x)Px2(x)dxdx\displaystyle=\int_{\mathbb{R}^{d}}k(x,x^{\prime})P_{x_{1}}(x)P_{x_{1}}(x^{\prime})\mathrm{d}x\mathrm{d}x^{\prime}+\int_{\mathbb{R}^{d}}k(x,x^{\prime})P_{x_{2}}(x)P_{x_{2}}(x^{\prime})\mathrm{d}x\mathrm{d}x^{\prime}
2dk(x,x)Px1(x)Px2(x)dxdx\displaystyle-2\int_{\mathbb{R}^{d}}k(x,x^{\prime})P_{x_{1}}(x)P_{x_{2}}(x^{\prime})\mathrm{d}x\mathrm{d}x^{\prime}
=d(k(xx1,xx1)+k(xx2,xx2)2k(xx1,xx2))P0(x)P0(x)dxdx.\displaystyle=\int_{\mathbb{R}^{d}}(k(x-x_{1},x^{\prime}-x_{1})+k(x-x_{2},x^{\prime}-x_{2})-2k(x-x_{1},x^{\prime}-x_{2}))P_{0}(x)P_{0}(x^{\prime})\mathrm{d}x\mathrm{d}x^{\prime}.

It is not hard to verify that dMd_{M} is shift invariant: dM(x1,x2)=dM(x1x2,0)d_{M}(x_{1},x_{2})=d_{M}(x_{1}-x_{2},0), and dMd_{M} has rr-th bounded derivatives, thus k^(x1,x2):=k^(Px1,Px2)=exp(αdM(x1,x2))\hat{k}^{\ast}(x_{1},x_{2}):=\hat{k}(P_{x_{1}},P_{x_{2}})=\exp(-\alpha d_{M}(x_{1},x_{2})) is shift invariant with rr-th bounded derivatives. Then take μ(x)\mu(x) as the Lebesgue measure over 𝒳\mathcal{X}, according to Theorem 4, [17], the integral operator Tk,μ:Tk,μf(x)=𝒳K(x,y)f(y)𝑑μ(y)T_{k,\mu}:T_{k,\mu}f(x)=\int_{\mathcal{X}}K(x,y)f(y)d\mu(y) is a symmetric compact operator in L2(𝒳,μ)L_{2}(\mathcal{X},\mu), and the spectrum of Tk,μT_{k,\mu} satisfies

λn(Tk,μ)=O(n1r/d).\lambda_{n}(T_{k,\mu})=O(n^{-1-r/d}).

Then according to Theorem 5 in [30], we have γ^n=O(nd(d+1)r+d(d+1)log(n))\hat{\gamma}_{n}=O(n^{\frac{d(d+1)}{r+d(d+1)}}\log(n)), which finish the proof. ∎

Appendix C Evaluation Details

C.1 Implementation

In our implementation of AIRBO, we design the kernel kk used for MMD estimation to be a linear combination of multiple Rational Quadratic kernels as its long tail behavior circumvents the fast decay issue of kernel [6]:

k(x,x)=ai{0.2,0.5,1,2,5}(1+(xx)22aili2)ai,k(x,x^{\prime})=\sum_{a_{i}\in\{0.2,0.5,1,2,5\}}\big{(}1+\frac{(x-x^{\prime})^{2}}{2a_{i}l_{i}^{2}}\big{)}^{-a_{i}}, (32)

where lil_{i} is a learnable lengthscale and aia_{i} determines the relative weighting of large-scale and small-scale variations.

Depending on the form of input distributions, the sampling and sub-sampling sizes for Nyström MMD estimator are empirically selected via experiments. Moreover, as the input uncertainty is already modeled in the surrogate, we employ a classic UCB-based acquisition as Eq. 5 with β=2.0\beta=2.0 and maximize it via an L-BFGS-B optimizer.

Refer to caption
(a) MMD-GP with a Nystrom estimator, in which the sampling size m=160m=160 and sub-sampling size h=10h=10.
Refer to caption
(b) uGP model that uses an integral kernel [26] and a sampling size of m=40m=40.
Refer to caption
(c) uGP with an integral kernel [26] and uses a much larger sampling size (m=160m=160).
Refer to caption
(d) Conventional noisy GP model.
Refer to caption
(e) GP model with a symmetric KL-divergence kernel [20].
Refer to caption
(f) Robust GP model with an expected RBF kernel [13]
Figure 7: Modeling performance with a step-changing Chi-squared distribution.

Appendix D More Experiments

D.1 Comparing with the Other Models

To compare the modeling performances with the other models, we design the input uncertainty to follow a step-changing Chi-squared distribution: Px=χ2(g(x),σ=0.01)P_{x}=\chi^{2}(g(x),\sigma=0.01), where g(x)=0.5g(x)=0.5 if x[0.0,0.6)x\in[0.0,0.6) and g(x)=7.0g(x)=7.0 when x[0.6,1.0]x\in[0.6,1.0]. Due to this sudden parameter change, the uncertainty at point x=0.6x=0.6 is expected to be asymmetric: 1) on its left-hand side, as the Chi-squared distribution becomes quite lean and sharp with a small value of g(x)=0.5g(x)=0.5, the distance from x=0.6x=0.6 to its LHS points, xlhs[0.0,0.6)x_{lhs}\in[0.0,0.6), are relatively large, thus their covariances are small, resulting a fast-growing uncertainty. 2)Meanwhile, when x[0.6,1.0]x\in[0.6,1.0], the g(x)g(x) suddenly increases to 7.07.0, rendering the input distribution a quite flat one with a long tail. Therefore, the distances between x=0.6x=0.6 and its RHS points become relatively small, which leads to large covariances and small uncertainties for points in [0.6,1.0][0.6,1.0]. As a result, we expect to observe an asymmetric posterior uncertainty at x=0.6x=0.6.

Several surrogate models are employed in this comparison, including:

  • MMDGP-nystrom(160/10) is our method with a sampling size m=160m=160 and sub-sampling size h=10h=10. Its complexity is O(MNmh)O(MNmh), where MM and NN are the sizes of training and test samples (Note: all the models in this experiment use the same training and testing samples for a fair comparison).

  • uGP(40) is the surrogate from [25], which employs an integral kernel with sampling size m=40m=40. Due to its O(MNm2)O(MNm^{2}) complexity, we set the sampling size m=40m=40 to ensure a similar complexity as ours.

  • uGP(160) is also the surrogate from [25] but uses a much larger sampling size (m=160m=160). Given the same training and testing samples, its complexity is 16 times higher than MMDGP-nystrom(160/10).

  • skl is a robust GP surrogate equipped with a symmetric KL-based kernel, which is described in [20].

  • ERBF [13] assumes the input uncertainty to be Gaussians and employs a close-form expected RBF kernel.

  • GP utilizes a noisy Gaussian Process model with a learnable output noise level.

According to Figure 7(a), our method, MMDGP-nystrom(160/10), can comprehensively quantify the sudden change of the input uncertainty, evidenced by its abrupt posterior change at x=0.6x=0.6. However, Figure 7(b) shows that uGP(40) with the same complexity fails to model the uncertainty correctly. We suspect this is because uGP requires much larger samples to stabilize its estimation of the integral kernel and thus can perform poorly with insufficient sample size, so we further evaluate the uGP(160) with a much larger sampling size m=160m=160 in Figure 7(c). It does successfully alleviate the issue but also results in a 16 times higher complexity. Apart from this, Figure 7(d) suggests the noisy GP model with a learnable output noise level is not aware of this uncertainty change at all as it treats the inputs as the exact values instead of random variables. Moreover, Figure 7(e) and 7(f) show that both the skl and ERBF fail in this case, this may be due to their misassumption of Gaussian input uncertainty.

D.2 Ablation Test for Nyström Approximation

In this experiment, we aim to examine the effect of Nyström approximation for optimization. To this end, we choose to optimize an RKHS function (Figure 4(a)) under a beta input distribution: Px=beta(α=0.4,β=0.2,σ=0.1)P_{x}=beta(\alpha=0.4,\beta=0.2,\sigma=0.1). Several amortized candidates include:

  • MMDGP-nystrom is our method with Nystrom approximation, in which the sampling size m=16m=16 and sub-sampling size h=9h=9. Its complexity is O(MNmh)O(MNmh), where MM and NN are the sizes of training and test samples respectively, mm is the sampling size for MMD estimation, and hh indicates the sub-sampling size during the Nystrom approximation.

  • MMDGP-raw-S does not use the Nystrom approximation but employs an empirical MMD estimator. Due to its O(MNm2)O(MNm^{2}) complexity, we set the sampling size m=12m=12 to ensure a similar complexity as the MMDGP-nystrom.

  • MMDGP-raw-L also uses an empirical MMD estimator, but with a larger sampling size (m=16m=16).

  • GP utilizes a vanilla GP with a learnable output noise level and optimizes with the upper-confidence-bound acquisition222For a fair comparison, all the methods in this test use a UCB acquisition..

Refer to caption
Figure 8: Ablation test for the Nyström approximation.

According to Figure 8, we find that 1) with sufficient computation power, the MMDGP-raw-L can obtain the best performance by using a large sample size. 2)However, with limited complexity, the performance MMDGP-raw-S degrades obviously while the MMDGP-nystrom performs much better. This suggests that the Nyström approximation can significantly improve the efficiency with a mild cost of performance degradation. 3) All the MMDGP-based methods are better than the vanilla GP-UCB.

D.3 Optimization on 10D Bumped-Bowl Problem

To further evaluate AIRBO’s optimization performance on the high-dimensional problem, we employ a 10-dimensional bumped bowl function from [27, 19]:

f(𝒙)=g(𝒙1:2)h(𝒙3:),where {g(𝒙)=2log(0.8𝒙2+e10𝒙2)+2.54h(𝒙)=id5xi2+1f(\boldsymbol{x})=g(\boldsymbol{x}_{1:2})h(\boldsymbol{x}_{3:}),\text{where }\begin{cases}&g(\boldsymbol{x})=2\log{(0.8\|\boldsymbol{x}\|^{2}+e^{-10\|\boldsymbol{x}\|^{2}})}+2.54\\ &h(\boldsymbol{x})=\sum_{i}^{d}5x_{i}^{2}+1\end{cases} (33)

Here, xix_{i} is the i-th dimension of 𝒙\boldsymbol{x}, 𝒙1:2\boldsymbol{x}_{1:2} represents the first 2 dimensions for the variable, and 𝒙3:\boldsymbol{x}_{3:} indicates the rest dimensions. The input uncertainty is designed to follow a concatenated distribution of a 2D circular distribution(r=0.5r=0.5) and a multivariate normal distribution with a zero mean and diagonal covariance of 0.01.

Figure 9 shows the mean and std values of the optimization regrets. We note that 1)when it comes to a high-dimensional problem and complex input distribution, the misassumption of Gaussian input uncertainty renders the skl and ERBF fail to locate the robust optimum and get stuck at local optimums. 2)Our method outperforms the others and can find the robust optimum efficiently and stably, while the uGP with a similar inference cost suffers the instability caused by insufficient sampling and stumbles over iterations, which can be evidenced by its high std values of optimization regret.

Refer to caption
Figure 9: Optimization regret on 10D bumped-bowl problem.
Refer to caption
Figure 10: The input GMM distribution.
Refer to caption
Figure 11: Simulation results of the push configurations found by different algorithms.

D.4 Robust Robot Pushing

This benchmark is based on a Box2D simulator from [31], where our objective is to identify a robust push configuration, enabling a robot to push a ball to predetermined targets under input randomness. In our experiment, we simplify the task by setting the push angle to ra=arctanryrxr_{a}=\arctan{\frac{r_{y}}{r_{x}}}, ensuring the robot is always facing the ball. Also, we intentionally define the input distribution as a two-component Gaussian Mixture Model as follows:

(rx,ry,rt)GMM(μ=[000110],𝚺=[0.120.321e60.320.121e61e61e61.02],w=[0.50.5]),(r_{x},r_{y},r_{t})\sim GMM\big{(}\mathbf{\mu}=\begin{bmatrix}0&0&0\\ -1&1&0\end{bmatrix},\mathbf{\Sigma}=\begin{bmatrix}0.1^{2}&-0.3^{2}&1e-6\\ -0.3^{2}&0.1^{2}&1e-6\\ 1e-6&1e-6&1.0^{2}\end{bmatrix},w=\begin{bmatrix}0.5\\ 0.5\end{bmatrix}\big{)}, (34)

where the covariance matrix Σ\Sigma is shared among components and ww is the weights of mixture components. Meanwhile, as the SKL-UCB and ERBF-UCB surrogates can only accept Gaussian input distributions, we choose to approximate the true input distribution with a Gaussian. As shown in Figure 10, the approximation error is obvious, which explains the performance gap among these algorithms in Figure 5(b).

Apart from the statistics of the found pre-images in Figure 6, we also simulate the robot pushes according to the found configurations and visualize the results in Figure 11. In this figure, each black hollow square represents an instance of the robot’s initial location, the grey arrow indicates the push direction and duration, and the blue circle marks the ball’s ending position after the push. We can find that, as the GP-UCB ignores the input uncertainty, it randomly pushes to these targets and the ball ending positions fluctuate. Also, due to the incorrect assumption of the input distribution, the SKL-UCB and ERBF-UCB fail to control the ball’s ending position under input randomness. On the contrary, AIRBO successfully recognizes the twin targets in quadrant I as an optimal choice and frequently pushes to this area. Moreover, all the ball’s ending positions are well controlled and centralized around the targets under input randomness.