This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bayesian Optimization under Stochastic Delayed Feedback

Arun Verma    Zhongxiang Dai    Bryan Kian Hsiang Low
Abstract

Bayesian optimization (BO) is a widely-used sequential method for zeroth-order optimization of complex and expensive-to-compute black-box functions. The existing BO methods assume that the function evaluation (feedback) is available to the learner immediately or after a fixed delay. Such assumptions may not be practical in many real-life problems like online recommendations, clinical trials, and hyperparameter tuning where feedback is available after a random delay. To benefit from the experimental parallelization in these problems, the learner needs to start new function evaluations without waiting for delayed feedback. In this paper, we consider the BO under stochastic delayed feedback problem. We propose algorithms with sub-linear regret guarantees that efficiently address the dilemma of selecting new function queries while waiting for randomly delayed feedback. Building on our results, we also make novel contributions to batch BO and contextual Gaussian process bandits. Experiments on synthetic and real-life datasets verify the performance of our algorithms.

Bayesian Optimization, Stochastic Delayed Feedback, Batch Bayesian Optimization, Contextual Gaussian Process Bandits

1 Introduction

Bayesian optimization (BO) (Shahriari et al., 2015; Frazier, 2018; Garnett, 2022) is a popular and widely-used sequential method for zeroth-order optimization of unknown black-box functions that are complex and expensive to compute. The existing BO methods assume that the function evaluation (feedback) is available to the learner immediately or after a fixed delay. However, these assumptions are impractical in many real-life problems where feedback is available after a random delay. To take advantage of the experimental parallelization in these problems, the learner needs to start new function evaluations without waiting for the randomly delayed feedback. We refer to this new BO problem as ‘Bayesian Optimization under Stochastic Delayed Feedback’ (BO-SDF). In this paper, we propose algorithms that efficiently address the problem of selecting new function queries while waiting for randomly delayed feedback. Specifically, we answer the following question:
How to start a new function query when the observations of the past function queries are randomly delayed?

Many real-life applications can be cast as BO-SDF problems. For example, when performing clinical trials to discover new medicines, we need to optimize the amount of different raw materials in order to find their most effective composition using a small number of clinical trials (Chow & Chang, 2006), which is a natural application for BO due to its sample efficiency. However, the evaluation of the effectiveness of a medicine needs to take into account the side effects which are usually not observed immediately but are instead revealed after a random period due to the varying physiology of different patients. Therefore, a natural challenge here is how to use BO to choose the next composition for testing, while accounting for the stochastically delayed observations from some previous clinical trials. A similar challenge also arises when we aim to find the most effective dose of a new medicine through clinical trials (Takahashi & Suzuki, 2021).

Another motivating example arises from online product recommendation (Chapelle, 2014; Diemert et al., 2017). Most online platforms make recommendations in milliseconds, but the user’s response (i.e., watching a movie or buying a product) generally happens after a random time period which may range from hours to days or even weeks. Furthermore, the issue of randomly delayed feedback also plagues other common applications of BO such as hyperparameter tuning of machine learning (ML) models (Snoek et al., 2012) (e.g., the training time of a neural network depends on the number of layers, network width, among others) and material discovery (Ueno et al., 2016).

The closest BO setting to our BO-SDF is batch BO (Desautels et al., 2014; Daxberger & Low, 2017; Chowdhury & Gopalan, 2019) which also allows multiple queries to be performed on the black-box function in parallel. Both problems require choosing an input query to evaluate the black-box function while the observations of some previously selected inputs are not available (i.e., delayed). However, there are important differences because the delays are fixed in a certain way in batch BO but can be random in BO-SDF. Interestingly, batch BO can be considered as a special case of our BO-SDF problem and hence, our algorithms proposed for BO-SDF can be applied to batch BO while achieving important improvements (Section 4).

Following the practice of BO, in order to choose the input queries to quickly approach the global optima, we employ a Gaussian process (GP) (Rasmussen & Williams, 2006) to model the black-box function, which builds a posterior belief of the function using the history of function queries and their feedback. However, when constructing this posterior belief in BO-SDF problems, we are faced with the important question of how to handle the randomly delayed feedback whose observations are not available. To this end, we replace the unavailable delayed feedback by the minimum function value,111In many problems, the minimum value of a function is known, e.g., a user’s minimum response in online recommendation systems is ‘no click’ (i.e., 0) and the minimum accuracy for hyperparameter tuning of machine learning models is ‘0’. which we refer to as censored feedback. The use of censored feedback, interestingly, improves the exploration in BO-SDF problems (Section 3.1) and hence leads to better theoretical and empirical performances. With the posterior belief built using censored feedback, we propose algorithms using upper confidence bound (UCB) and Thompson sampling (TS), both of which enjoy sub-linear regret guarantees. Specifically, our contributions are as follows:

  • We introduce and formalize the notion of censored feedback in the BO-SDF problem, and propose UCB- and TS-based algorithms with sub-linear regret guarantees (Section 3).

  • Since batch BO is a special case of BO-SDF, we apply our proposed algorithms (Section 3) to batch BO and show that our algorithms enjoy tighter regret upper bounds than classic batch BO algorithms (Section 4). This gain is mainly attributed to the censored feedback, which leads to a better exploration policy.

  • We extend our algorithms for the BO-SDF problem to contextual Gaussian process bandits (Krause & Ong, 2011) with stochastic delayed feedback in Section 5. This new contribution is itself of independent interest.

  • Our experimental results in Section 6 validate the different performance aspects of our proposed algorithms on synthetic and real-life datasets.

2 Problem Setting

This paper considers the Bayesian optimization (BO) problem where the function evaluations (feedback) are available to the learner after a random delay. Let 𝒬n\mathcal{Q}\subset\mathbb{R}^{n} be a finite domain of all function queries where n1n\geq 1.222We assume 𝒬\mathcal{Q} to be finite, but it is straightforward to extend our theoretical results to a compact domain using a suitable discretization scheme (Li & Scarlett, 2022, Lemma 2). The learner selects a new function query as soon as enough resources are available (or when experiment parallelization is possible). We denote the tt-th query by xt𝒬x_{t}\in\mathcal{Q}. After the learner selects the query xtx_{t} to evaluate the unknown black-box function ff, the environment generates a noisy function evaluation, denoted by feedback yt=f(xt)+εty_{t}=f(x_{t})+\varepsilon_{t}. We assume that εt\varepsilon_{t} is an RR-sub-Gaussian noise. The learner observes yty_{t} only after a stochastic delay dtd_{t}, which is generated from an unknown distribution 𝒟\mathcal{D}.

We refer to this new BO problem as ‘Bayesian Optimization under Stochastic Delayed Feedback’ (BO-SDF). The unknown function ff, query space 𝒬\mathcal{Q}, and unknown delay distribution 𝒟\mathcal{D} identify an instance of the BO-SDF problem. The optimal query (xx^{\star}) has the maximum function value (global maximizer), i.e., xargmaxx𝒬f(x)x^{\star}\in\arg\!\max_{x\in\mathcal{Q}}f(x). After selecting a query xtx_{t}, the learner incurs a penalty (or instantaneous regret) pt=f(x)f(xt)p_{t}=f(x^{\star})-f(x_{t}).

Since the optimal function query is unknown, we sequentially estimate this using the available information of the selected queries and observed feedback. Our goal is to learn a sequential policy for selecting queries that finds the optimal query (or global maximizer) as quickly as possible. There are two common performance measures for evaluating a sequential policy. The first performance measure is simple regret. Let xtx_{t} be the tt-th function query selected by the policy. Then, after observing TT function evaluations, the simple regret is rT=f(x)maxt{1,,T}f(xt).r_{T}=f(x^{\star})-\max_{t\in\{1,\ldots,T\}}f(x_{t}). Any good policy must have no regret, i.e., limTrT=0\lim_{T\to\infty}{r_{T}}=0. The second performance measure of the policy is cumulative regret which is the sum of total penalties incurred by the learner. After observing TT function evaluations, the cumulative regret of a policy is given by T=t=1T(f(x)f(xt)).\mathfrak{R}_{T}=\sum_{t=1}^{T}\left(f(x^{\star})-f(x_{t})\right). Any good policy should have sub-linear regret, i.e., limTT/T=0.\lim_{T\to\infty}{\mathfrak{R}_{T}}/T=0. Even though simple regret and cumulative regret are different performance measures, having a policy with no regret or sub-linear regret implies that the policy will eventually converge to the optimal query (or global optimizer).

3 BO under Stochastic Delayed Feedback

A function estimator is the main component of any BO problem for achieving good performance. We use a Gaussian process (GP) as a surrogate for the posterior belief of the unknown function (Rasmussen & Williams, 2006; Srinivas et al., 2010). To deal with the delayed feedback, we will introduce the notion of censored feedback in GPs, in which the delayed feedback is replaced by the minimum function value. Finally, we will propose UCB- and TS-based algorithms with sub-linear regret guarantees.

3.1 Estimating Function using Gaussian Processes

We assume that the unknown function ff belongs to the reproducing kernel Hilbert space (RKHS) associated with a kernel kk, i.e., fkf\left\|f\right\|_{k}\leq\mathcal{B}_{f} where f>0\mathcal{B}_{f}>0. By following standard practice in the literature, we assume without loss of generality that k(x,x)1k(x,x^{\prime})\leq 1 for all x,x𝒬x,x^{\prime}\in\mathcal{Q}. We also assume that feedback is bounded, i.e., |yt|y|y_{t}|\leq\mathcal{B}_{y} for all t1t\geq 1. This assumption is not restrictive since any bounded function can be re-scaled to satisfy this boundedness requirement. For example, the validation accuracy of machine learning models in our hyperparameter tuning experiment in Section 6 is bounded in [0,1][0,1].

Censored Feedback.

The learner has to efficiently exploit available information about queries and feedback to achieve a better performance. The estimated GP posterior belief has two parts, i.e., mean function and covariance function. The posterior covariance function only needs the function queries for building tighter confidence intervals, whereas the posterior mean function needs both queries and their feedback. Incorporating the queries with delayed feedback in updating the posterior mean function leads to a better estimate, but the question is how to do it. One possible solution is to replace the delayed feedback with some value, but then the question is what should that value be.

To pick the suitable value for replacing delayed feedback, we motivate ourselves from real-life applications like the online movies recommendations problem. When an online platform recommends a movie, it expects the following responses from the users – not watching the movie, watching some part of the movie, and watching the entire movie. The platform can map these responses to [0,1][0,1] where 0 is assigned for ‘not watching the movie’, 11 for ‘watching the entire movie’, and an appropriate value in (0,1)(0,1) for ‘watching some part of the movie.’ There are two reasons for the delay in the user’s response: The user does not like the recommended movie or does not have time to watch the movie. Before the user starts watching the movie, the platform can replace the user’s delayed response with ‘not watching the movie’ (i.e., ‘0’) and update it later when more information is available. Therefore, we can replace the delayed feedback with the minimum function value.333 It is possible to replace the delayed feedback with other values. When it is replaced by the current GP posterior mean, it has the same effect as the hallucination in batch BO (more details in Section 4). Another possible value for replacing the delayed feedback is the kk-nearest posterior means where k>0k>0. We refer to this replaced feedback as censored feedback. The idea of censored feedback is also used in other problems such as censoring delayed binary feedback (Vernade et al., 2020) and censoring losses (Verma et al., 2019, 2021) or rewards (Verma & Hanawal, 2020) in online resource allocation problems.

When the minimum function value replaces the delayed feedback, the posterior mean around that query becomes small, consequently ensuring that the learner will not select the queries around the queries with delayed feedback before exploring the other parts of the query’s domain. Therefore, the censored feedback leads to better exploration than replacing delayed feedback with a larger value (e.g., current posterior mean). However, the queries with delayed feedback need to be stored so that the posterior mean can be updated when feedback is revealed. Due to several reasons (e.g., limited storage, making resources available to new queries), the learner cannot keep all queries with delayed feedback and replace the old queries with the latest ones.444Our motivation for removing older queries comes from online recommendation problems where the chance of observing feedback diminishes with time. The learner can also use a more complicated scheme for query removal. It can degrade the algorithm’s performance, as discussed in Section 3.3 and demonstrated by experiments in Section 6.

Let an incomplete query represent the function query with unobserved feedback and mm be the number of incomplete queries the learner can store. We assume that the random delay for a query is the number of new queries that need to be started (i.e., number of iterations elapsed) before observing its feedback. Therefore, the delay dtd_{t} only takes values of non-negative integers.555We have introduced the delay based on iterations here to bring out the main ideas of censored feedback. Any suitable discrete distribution (e.g., Poisson) can model such delays. We have also introduced the censored feedback with time-based delay, which is more practical and can be modeled using a suitable continuous distribution (see Section A.4 for more details). The censored feedback of the ss-th query before selecting the tt-th query is denoted by y~s,t=ys𝟙{dsmin(m,ts)}\tilde{y}_{s,t}=y_{s}\mathbbm{1}\{d_{s}\leq\min(m,t-s)\}. That is, the learner censors the incomplete queries by assigning 0 to them.666We can make the minimum value of any bounded function 0 by adding to it the negative of the minimum function value. It does not change the optimizer since the optimizer of a function is invariant to the constant function shift. By doing this, we can represent censored feedback by an indicator function, making censored feedback easier to handle in theoretical analysis. Now, the GP posterior mean and covariance functions can be expressed using available function queries and their censored feedback, as follows:

μt1(x)=𝐤t1(x)𝐊t,λ1𝐲~t1,σt12(x,x)=k(x,x)𝐤t1(x)𝐊t,λ1𝐤t1(x)\displaystyle\begin{split}\mu_{t-1}(x)&=\mathbf{k}_{t-1}(x)^{\top}\mathbf{K}_{t,\lambda}^{-1}\ \tilde{\mathbf{y}}_{t-1}\ ,\\ \sigma_{t-1}^{2}(x,x^{\prime})&=k(x,x^{\prime})-\mathbf{k}_{t-1}(x)^{\top}\mathbf{K}_{t,\lambda}^{-1}\ \mathbf{k}_{t-1}(x^{\prime})\end{split} (1)

where 𝐊t,λ=𝐊t1+λ𝐈\mathbf{K}_{t,\lambda}=\mathbf{K}_{t-1}+\lambda\mathbf{I}, 𝐤t1(x)=(k(x,xt))t[t1]\mathbf{k}_{t-1}(x)=(k(x,x_{t^{\prime}}))^{\top}_{t^{\prime}\in[t-1]}, 𝐲~t1=(y~s,t)s[t1]\tilde{\mathbf{y}}_{t-1}=(\tilde{y}_{s,t})^{\top}_{s\in[t-1]}, 𝐊t1=(k(xt,xt′′))t,t′′[t1]\mathbf{K}_{t-1}=(k(x_{t^{\prime}},x_{t^{\prime\prime}}))_{t^{\prime},t^{\prime\prime}\in[t-1]}, and λ\lambda is a regularization parameter that ensures 𝐊t,λ\mathbf{K}_{t,\lambda} is a positive definite matrix. We denote σt12(x)=σt12(x,x)\sigma_{t-1}^{2}(x)=\sigma_{t-1}^{2}(x,x).

3.2 Algorithms for the BO-SDF problem

After having a posterior belief of the unknown function, the learner can decide which query needs to be selected for the subsequent function evaluation. Since the feedback is only observed for the selected query, the learner needs to deal with the exploration-exploitation trade-off. We use UCB- and TS-based algorithms for BO-SDF problems that handle the exploration-exploitation trade-off efficiently.

UCB-based Algorithm for BO-SDF problems. UCB is an effective and widely-used technique for dealing with the exploration-exploitation trade-off in various sequential decision-making problems (Auer et al., 2002; Garivier & Cappé, 2011). We propose a UCB-based algorithm named GP-UCB-SDF for the BO-SDF problems. It works as follows: When the learner is ready to start the tt-th function query, he firstly updates the GP posterior mean and standard deviation defined in Eq. 1 using available censored noisy feedback. Then, he selects the input xtx_{t} for the next function evaluation by maximizing the following UCB value:

xt=argmaxx𝒬(μt1+νtσt1(x))\textstyle x_{t}=\arg\!\max_{x\in\mathcal{Q}}\left(\mu_{t-1}+\nu_{t}\sigma_{t-1}(x)\right) (2)

where νt=ys=tmt1σt1(xs)+βt\nu_{t}=\mathcal{B}_{y}\sum^{t-1}_{s=t-m}\sigma_{t-1}(x_{s})+\beta_{t}, βt=f+(R+y)2(γt1+1+log(2/δ))\beta_{t}=\mathcal{B}_{f}+(R+\mathcal{B}_{y})\sqrt{2(\gamma_{t-1}+1+\log(2/\delta))}, δ(0,1)\delta\in(0,1), and γt\gamma_{t} is the maximum information gain about the function ff from any set of tt function queries (Srinivas et al., 2010).

GP-UCB-SDF UCB-based Algorithm for BO-SDF
1:  Input: λ>0\lambda>0, mm\in\mathbb{N}
2:  for t=1,2,3,t=1,2,3,... do
3:     Update GP posterior defined in Eq. 1 using available censored noisy feedback
4:     Select input xtx_{t} using Eq. 2 and compute ff at xtx_{t}
5:     Keep observing the noisy feedback until the starting of next function evaluation
6:  end for

After selecting the input xtx_{t}, the learner queries the unknown function f(xt)f(x_{t}) and the environment generates a noisy feedback yty_{t} with an associated delay dt𝒟d_{t}\sim\mathcal{D}. The learner observes yty_{t} only after a delay of dtd_{t} iff dtmd_{t}\leq m. Before starting a new function evaluation, the learner keeps observing the noisy feedback of past queries. The same process is repeated for the subsequent function evaluations.

TS-based Algorithm for BO-SDF problems. In contrast to the deterministic UCB, TS (Thompson, 1933; Agrawal & Goyal, 2012, 2013a; Kaufmann et al., 2012) is based on Bayesian updates that select inputs according to their probability of being the best input. Many works have shown that TS is empirically superior to UCB-based algorithms (Chapelle & Li, 2011; Chowdhury & Gopalan, 2019). We name our TS-based algorithm GP-TS-SDF which works similarly to GP-UCB-SDF, except that the tt-th function query is selected as follows:

xt=argmaxx𝒬ft(x)\textstyle x_{t}=\arg\!\max_{x\in\mathcal{Q}}f_{t}(x) (3)

where ft𝒢𝒫(μt1(),νt2σt12())f_{t}\sim\mathcal{GP}(\mu_{t-1}(\cdot),\nu_{t}^{2}\sigma_{t-1}^{2}(\cdot)).

GP-TS-SDF TS-based Algorithm for BO-SDF
1:  Input: λ>0\lambda>0, mm\in\mathbb{N}
2:  for t=1,2,3,t=1,2,3,... do
3:     Update GP posterior defined in Eq. 1 using available censored noisy feedback
4:     Select input xtx_{t} using Eq. 3 and compute ff at xtx_{t}
5:     Keep observing the noisy feedback until the starting of next function evaluation
6:  end for

3.3 Regret Analysis

Let ρm=(dsm)\rho_{m}=\mathbb{P}(d_{s}\leq m) for s1s\geq 1 be the probability of observing delayed feedback within the next mm iterations. The following result gives the confidence bounds of GP posterior mean function and is important to our subsequent regret analysis:

Theorem 1 (Confidence Ellipsoid).

Let x𝒬x\in\mathcal{Q}, λ>1\lambda>1, and t1t\geq 1. Then, with probability at least 1δ1-\delta,

|μt1(x)ρmf(x)|νtσt1(x).\left|\mu_{t-1}(x)-\rho_{m}f(x)\right|\leq\nu_{t}\sigma_{t-1}(x).

The detailed proofs of 1 and other results below are given in Appendix A. Now, we state the regret upper bounds of our proposed GP-UCB-SDF and GP-TS-SDF:

Theorem 2 (GP-UCB-SDF).

Let C1=2/log(1+λ1)C_{1}=\sqrt{{2}/{\log(1+\lambda^{-1})}}. Then, with probability at least 1δ1-\delta,

T2ρm(C1βTTγT+myC12γT).\mathfrak{R}_{T}\leq\frac{2}{\rho_{m}}\left(C_{1}\beta_{T}\sqrt{T\gamma_{T}}+m\mathcal{B}_{y}C_{1}^{2}\gamma_{T}\right).

As expected, the regret inversely depends on ρm\rho_{m}, i.e., the probability of observing delayed feedback within the next mm iterations. If we ignore the logarithmic factors and constants in 2, then the regret of GP-UCB-SDF is upper bounded by O~(ρm1(γTT+mγT))\tilde{O}\left({\rho^{-1}_{m}}\left(\gamma_{T}\sqrt{T}+m\gamma_{T}\right)\right).

Theorem 3 (GP-TS-SDF).

With probability at least 1δ1-\delta,

T=O~(1ρm(TγT(γT+1)+m(γT+T))).\mathfrak{R}_{T}=\tilde{O}\left(\frac{1}{\rho_{m}}\left(\sqrt{T\gamma_{T}}(\sqrt{\gamma_{T}}+1)+m(\gamma_{T}+\sqrt{T})\right)\right).

The regret bounds in both Theorems 2 and 3 are sub-linear for the commonly used squared exponential (SE) kernel, for which γT=O(logn+1(T))\gamma_{T}=O(\log^{n+1}(T)) (Srinivas et al., 2010).

By following the common practice in the BO literature (Contal et al., 2013; Kandasamy et al., 2018), we can upper-bound the simple regret by the average cumulative regret, i.e., rTT/Tr_{T}\leq\mathfrak{R}_{T}/T.

Discussion on the trade-off regarding mm. A large value of mm ensures that fewer queries are discarded and favors the regret bound (i.e., via the term 1/ρm1/\rho_{m}) by making ρm\rho_{m} large. However, a large mm also means that our algorithm allows larger delays, which consequently forces us to enlarge the width of the confidence interval (1) to account for them. It makes the regret bound worse, which is also reflected by the linear dependence on mm in the second term of the regret upper bounds (Theorems 2 and 3). As mm is directly related to the storage requirement of input queries whose feedback are yet to be observed, a large value of mm is constrained by the resources available to the learner.

Input-dependent random delays. In some applications, the random delay can depend on the function query: for example, a larger number of hidden layers increases the training time of a neural network. Our regret bounds still hold with input-dependent random delays by redefining ρm\rho_{m} for TT queries as ρm=mint[T]{dtm}\rho_{m}=\min_{t\in[T]}\mathbb{P}\left\{d_{t}\leq m\right\}.

Limitations of censored feedback. Censoring, i.e., replacing values of incomplete queries with the minimum function values, leads to an aggressive exploration strategy. However, censoring is required to ensure that our BO method does not unnecessarily explore the incomplete queries or queries near them. Moreover, we can reduce the effect of this aggressive exploration by increasing the value of mm, since a larger mm leads to less aggressive censoring. Another limitation of the censoring method is that it needs to know the minimum function value. However, when the minimum value is unknown, we can use a suitable lower bound as a proxy for the minimum function value. However, regret analysis of such a method can be challenging and left for future research.

4 Improved Algorithms for Batch BO

Notably, our BO-SDF problem subsumes batch BO as a special case. Specifically, the function queries in batch BO can be viewed as being sequentially selected, in which some queries need to be selected while some other queries of the same batch are still incomplete (Desautels et al., 2014). Therefore, in batch BO with a batch size of x\mathcal{B}_{x}, the incomplete queries can be viewed as delayed feedback with fixed delays s.t. dsx1d_{s}\leq\mathcal{B}_{x}-1. In this case, by choosing m=x1m=\mathcal{B}_{x}-1, we can ensure that ρm=(dsx1)=1\rho_{m}=\mathbb{P}(d_{s}\leq\mathcal{B}_{x}-1)=1. As a result, our method of censoring (Section 3.1) gives a better treatment to the incomplete queries than the classic technique from batch BO, which we will demonstrate next via both intuitive justification and regret comparison.

The classic technique to handle the incomplete queries in batch BO is hallucination which was proposed by the GP-BUCB algorithm from (Desautels et al., 2014). It has also been adopted by a number of its extensions (Daxberger & Low, 2017; Chowdhury & Gopalan, 2019). In particular, the feedback of incomplete queries is hallucinated to be (i.e., censored with) the GP posterior mean computed using only the completed queries (excluding the incomplete queries). It is equivalent to keeping the GP posterior mean unchanged (i.e., unaffected by the incomplete queries) while updating the GP posterior variance using all selected queries (including the incomplete queries). Note that in contrast to the hallucination, our censoring technique sets the incomplete observations to the minimum function value (i.e., 0) (Section 3.1). As a result, compared with the hallucination where the GP posterior mean is unaffected by the incomplete queries, our censoring can reduce the value of the GP posterior mean at the incomplete queries (because we have treated their observations as if they are 0). As a result, our censoring can discourage these incomplete queries from being unnecessarily queried again. It encourages the algorithm to explore other unexplored regions, hence leading to better exploration.

Interestingly, the better exploration resulting from our censoring technique also translates to a tighter regret upper bound. Specifically, in batch BO with a batch size of x\mathcal{B}_{x}, after plugging in m=x1m=\mathcal{B}_{x}-1 and ρm=1\rho_{m}=1, our next result gives the regret upper bound of GP-UCB-SDF.

Proposition 1.

With probability at least 1δ1-\delta,

T=O~(γTT+xγT).\mathfrak{R}_{T}=\tilde{O}\left(\gamma_{T}\sqrt{T}+\mathcal{B}_{x}\gamma_{T}\right).

As long as γT=Ω(logT)\gamma_{T}=\Omega(\log T) which holds for most commonly used kernels such as the squared exponential (SE) and Matérn kernels, both terms in 1 grow slower than T=O~(γTTexp(γx1))\mathfrak{R}_{T}=\tilde{O}(\gamma_{T}\sqrt{T}\exp(\gamma_{\mathcal{B}_{x}-1})) which is the regret upper bound of the GP-BUCB algorithm based on hallucination (Desautels et al., 2014). Importantly, to achieve a sub-linear regret upper bound (i.e., to ensure that exp(γx1)\exp(\gamma_{\mathcal{B}_{x}-1}) can be upper-bounded by a constant), GP-BUCB (Desautels et al., 2014) and its extensions (Daxberger & Low, 2017; Kandasamy et al., 2018; Chowdhury & Gopalan, 2019) all require a special initialization scheme (i.e., uncertainty sampling). It is often found unnecessary in practice (Kandasamy et al., 2018) and hence represents a gap between theory and practice. Interestingly, our tighter regret upper bound from 1 has removed the dependence on exp(γx1)\exp(\gamma_{\mathcal{B}_{x}-1}) and hence eliminated the need for uncertainty sampling, thereby closing this gap between theory and practice.

Note that our discussions above regarding the advantage of censoring over hallucination also apply to TS-based algorithms. In particular, in batch BO with a batch size of x\mathcal{B}_{x}, the regret upper bound of GP-TS-SDF is given by the following result.

Proposition 2.

With probability at least 1δ1-\delta,

T=O~(TγT(γT+1)+xγT+xT).\mathfrak{R}_{T}=\tilde{O}\left(\sqrt{T\gamma_{T}}(\sqrt{\gamma_{T}}+1)+\mathcal{B}_{x}\gamma_{T}+\mathcal{B}_{x}\sqrt{T}\right).

All three terms in 2 grow slower than T=O~(exp(γx1)TγT(γT+1))\mathfrak{R}_{T}=\tilde{O}(\exp(\gamma_{\mathcal{B}_{x}-1})\sqrt{T\gamma_{T}}(\sqrt{\gamma_{T}}+1)) which is the regret upper bound of the GP-BTS algorithm based on hallucination (Chowdhury & Gopalan, 2019) as long as γT=Ω(logT)\gamma_{T}=\Omega(\log T).

To summarize, though batch BO is only a special case of our BO-SDF problem, we have made non-trivial contributions to the algorithm and theory of batch BO. Our GP-UCB-SDF and GP-TS-SDF algorithms and our theoretical analyses may serve as inspiration for future works on batch BO.

5 Contextual Gaussian Process Bandits under Stochastic Delayed Feedback

Before making a decision, sometimes additional information is available to the learner in many real-life scenarios (e.g., users’ profile information for an online platform and patients’ medical history before clinical trials). This additional information is referred to as context in the literature, and the value of a function depends on the context (Agrawal & Goyal, 2013b; Chu et al., 2011; Krause & Ong, 2011; Li et al., 2010). The learner can design the algorithms to exploit the contextual information to make a better decision. We can extend our results for the BO-SDF problem to the contextual Gaussian process bandits (Krause & Ong, 2011) with stochastic delayed feedback where a non-linear function maps a context to the feedback.

Let 𝒬n\mathcal{Q}\subset\mathbb{R}^{n} be a finite set of available actions (or queries) and 𝒵n\mathcal{Z}\subset\mathbb{R}^{n^{\prime}} be a set of all contexts where n,n1n,n^{\prime}\geq 1. In round tt, the environment generates a context zt𝒵z_{t}\in\mathcal{Z}. Then, the learner selects an action xt𝒬x_{t}\in\mathcal{Q} and observes noisy feedback denoted by yt=g(zt,xt)+εty_{t}=g(z_{t},x_{t})+\varepsilon_{t}. We assume that g:𝒵×𝒬g:\mathcal{Z}\times\mathcal{Q}\rightarrow\mathbb{R} and εt\varepsilon_{t} is an RR-sub-Gaussian noise. The yty_{t} is only observed after a random delay dtd_{t}, which is generated from an unknown distribution 𝒟\mathcal{D}.

We refer to this new problem as ‘Contextual Gaussian process Bandits under Stochastic Delayed Feedback’ (CGB-SDF). The unknown function gg, context space 𝒵\mathcal{Z}, action space 𝒬\mathcal{Q}, and unknown delay distribution 𝒟\mathcal{D} identify an instance of the CGB-SDF problem. The optimal action xtx_{t}^{\star} for a given context ztz_{t} is the action where the function gg has its maximum value, i.e., xt=argmaxx𝒬g(zt,x)x_{t}^{\star}=\arg\!\max_{x\in\mathcal{Q}}g(z_{t},x). The learner incurs a penalty of (g(zt,xt)g(zt,xt))\left(g(z_{t},x_{t}^{\star})-g(z_{t},x_{t})\right) for a context ztz_{t} and action xtx_{t}. We aim to learn a policy that achieves the minimum cumulative penalties or regret which is given by T=t=1T(g(zt,xt)g(zt,xt))\mathfrak{R}_{T}=\sum_{t=1}^{T}\left(g(z_{t},x_{t}^{\star})-g(z_{t},x_{t})\right).

Our goal is to learn a policy that has a small sub-linear regret, i.e., limTT/T0\lim_{T\rightarrow\infty}{\mathfrak{R}_{T}}/T\rightarrow 0. The sub-linear regret here implies that the policy will eventually select the optimal action for a given context. We have adapted our algorithms for the BO-SDF problem to the CGB-SDF problem and shown that they also enjoy similar sub-linear regret guarantees; see Appendix B for more details.

6 Experiments

We compare with previous baseline methods that can be applied to BO-SDF problems after modifications. Firstly, we compare with standard GP-UCB (Srinivas et al., 2010) which ignores the delayed observations when selecting the next query. Note that GP-UCB is likely to repeat previously selected queries in some iterations when the GP posterior remains unchanged (i.e., if no observation is newly collected between two iterations). Next, we also compare with the batch BO method of asy-TS (Kandasamy et al., 2018) which, similarly to GP-UCB, ignores the delayed observations when using TS to choose the next query. Lastly, we compare with the batch BO methods of GP-BUCB (Desautels et al., 2014) and GP-BTS (Chowdhury & Gopalan, 2019) which handle the delayed observations via hallucination (Section 4). Following the suggestion from (Vernade et al., 2020), if not specified otherwise, we set mm to be 2μ2\mu where μ\mu is the mean of the delay distribution. However, this is for convenience only since we will demonstrate (Section 6.1) that our algorithms consistently achieve competitive performances as long as mm is large enough. Therefore, μ\mu does not need to be known in practice. We defer some experimental details to Appendix C due to space constraint. Our code is released at https://github.com/daizhongxiang/BO-SDF.

Refer to caption Refer to caption Refer to caption Refer to caption
(a) (b) (c) (d)
Figure 1: Performances of our GP-UCB-SDF for (a) different delay distributions and (b) different values of mm (μ\mu is fixed at μ=10\mu=10). Performances of GP-UCB-SDF and baseline methods for (c) stochastic and (d) deterministic delay distributions.
Refer to caption Refer to caption Refer to caption Refer to caption
(a) (b) (c) (d)
Refer to caption Refer to caption Refer to caption Refer to caption
(e) (f) (g) (h)
Figure 2: Performances for hyperparameter tuning of SVM: UBC-based methods with (a) stochastic and (b) deterministic delay distributions, and TS-based methods with (c) stochastic and (d) deterministic delay distributions. Performances for hyperparameter tuning of CNN: UBC-based methods with (e) stochastic and (f) deterministic delay distributions, and TS-based methods with (g) stochastic and (h) deterministic delay distributions.

6.1 Synthetic Experiments

For this experiment, we use a Poisson distribution (with a mean parameter μ\mu) as the delay distribution and sample the objective function ff from a GP with an SE kernel defined on a discrete 11-dimensional domain. Fig. 1a plots the performances of our GP-UCB-SDF for different delay distributions and shows that larger delays (i.e., larger μ\mu’s) result in worse performances. Fig. 1b shows that for a fixed delay distribution (i.e., μ=10\mu=10), an overly small value of mm leads to large regrets since it causes our algorithm to ignore a large number of delayed observations (i.e., overly aggressive censoring). On the other hand, a large mm is desirable since m=20=2μm=20=2\mu and m=40m=40 produce comparable performances.

However, an overly large value of mm may exert an excessive resource requirement since we may need to cater to many pending function evaluations. Therefore, when the delay distribution (hence μ\mu) may be unknown, we recommend using a large value of mm that is allowed by the resource constraints. We now compare the performances of our GP-UCB-SDF with the baselines of GP-UCB and GP-BUCB for both stochastic (μ=10\mu=10, Fig. 1c) and deterministic (delays fixed at 1010, Fig. 1d) delay distributions. The figures show that our GP-UCB-SDF achieves smaller simple regret than the baselines in both settings.

6.2 Real-world Experiments

In this section, we perform real-world experiments on hyperparameter tuning for ML models under two realistic settings. In the first setting, we consider stochastic delay distributions where the delays are sampled from a Poisson distribution with a mean parameter μ=10\mu=10. This setting is used to simulate real-world scenarios where the evaluations of different hyperparameter configurations may be stochastically delayed because they may require a different amount of computations (e.g., training a neural network with more layers is more time-consuming), or the machines deployed in the experiment may have different computational capabilities. The second setting considers fixed delay distributions where every delay is fixed at x1=10\mathcal{B}_{x}-1=10, which is used to emulate real-world asynchronous batch BO problems with a batch size of x=11\mathcal{B}_{x}=11.

Refer to caption Refer to caption Refer to caption Refer to caption
(a) (b) (c) (d)
Figure 3: Performances for hyperparameter tuning of LR, UBC-based methods for (a) stochastic and (b) deterministic delay distributions, and TS-based methods for (a) stochastic and (b) deterministic delay distributions.
Refer to caption Refer to caption Refer to caption Refer to caption
(a) (b) (c) (d)
Figure 4: Cumulative regret of contextual GP bandit for (a) UCB-based and (b) TS-based methods for multi-task BO, and (c) UCB-based and (d) TS-based methods for non-stationary BO.

We now tune the hyperparameters of three different ML models to demonstrate the consistency of our algorithms. Specifically, we tune two hyperparameters of SVM (i.e., the penalty parameter and RBF kernel parameter) used for diabetes diagnosis, three hyperparameters of convolution neural networks (CNNs) (i.e., the batch size, learning rate, and learning rate decay) using the MNIST dataset, and the same three hyperparameters of logistic regression (LR) used for breast cancer detection. The experimental results on hyperparameter tuning for SVM and CNN are plotted in Fig. 2, including the results for UCB- and TS-based algorithms for both settings. The results for hyperparameter tuning of LR are shown in Fig. 3. The figures show that our GP-UCB-SDF and GP-TS-SDF consistently outperform the corresponding baselines in both settings with stochastic and deterministic delay distributions. The results here verify the practical effectiveness of our proposed algorithms in real-world problems.

6.3 Real-world Experiments on Contextual Gaussian Process Bandits

In this section, we apply our algorithms to two real-world contextual GP bandit problems (Section 5). Our first experiment performs multi-task BO in the contextual GP bandit setting following the work of Krause & Ong (2011). We use the tabular benchmark dataset for SVM hyperparameter tuning from the work of Wistuba et al. (2015), which consists of the validation accuracy evaluated on a discrete domain of six SVM hyperparameters for 5050 different classification tasks. Each of the 5050 tasks is associated with a set of meta-features which are used as the contexts ztz_{t}. We sequentially tackle the 5050 tasks and perform hyperparameter tuning for each task for 3030 iterations, i.e., the context ztz_{t} for every task is repeated consecutively for 3030 times and the contexts for all tasks arrive sequentially.

Our second experiment performs non-stationary BO in the contextual GP bandit setting. In some real-world BO problems, the objective function is non-stationary (i.e., changing over time). For example, when we tune the hyperparameters of an ML model for diabetes diagnosis (Section 6.2), we may need to perform the hyperparameter tuning task periodically because the size of the dataset is progressively growing due to sequentially arriving patient data. We divide the entire dataset into 2020 progressively growing datasets and assign an index zt{0,,19}z_{t}\in\{0,\ldots,19\} to each dataset such that a smaller dataset has a smaller index. Then, we use the indices as the contexts because two datasets whose indices are close share many common data points and are hence expected to lead to similar performances. We let the contexts arrive in the order of their indices to simulate sequentially arriving patients and again repeat each context for 3030 iterations.

For both experiments, we use a stochastic delay distribution (i.e., Poisson distribution with μ=3\mu=3) and set m=6m=6. The cumulative regrets of different algorithms are shown in Fig. 4, in which both GP-UCB-SDF and GP-TS-SDF incur smaller regrets than the other baselines.

7 Related Work

We have developed algorithms for the BO problems with stochastic delayed feedback. In the following, we briefly review the papers relevant to our problem.

Batch BO. Batch BO is the closest BO setting to our BO-SDF problem. Many algorithms have been proposed for different variants of batch BO problems in the literature (Desautels et al., 2014; González et al., 2016; Daxberger & Low, 2017; Kandasamy et al., 2018; Gong et al., 2019; Chowdhury & Gopalan, 2019; Balandat et al., 2020). However, the delay in observing feedback observations in all these batch BO problems is fixed. Some batch BO algorithms are based on the UCB index (Desautels et al., 2014; Daxberger & Low, 2017), while some others are based on TS (Kandasamy et al., 2018; Chowdhury & Gopalan, 2019; Gong et al., 2019). The batch BO methods like ‘local penalization’ (González et al., 2016) and ‘fantasize’ (Balandat et al., 2020) can not work with stochastically delayed feedback and have no theoretical guarantees. All variants of batch BO problems can be treated as special cases of the BO-SDF problem by fixing the delays appropriately. Therefore, our UCB- and TS-based algorithms can be used for any batch BO problem.

Contextual Gaussian Process (GP) Bandits. Several prior works consider the availability of additional information to the learner before making the decision. This line of work is popularly known as contextual bandits (Li et al., 2010). It has many applications in advertising, web search, and e-commerce. In contextual bandits, the mean reward is a function of context and often parameterized (e.g., linear (Li et al., 2010; Chu et al., 2011; Agrawal & Goyal, 2013b), GLM (Li et al., 2017), or non-linear (Valko et al., 2013)). Contextual GP bandits model the posterior belief of a non-linear function using GPs (Krause & Ong, 2011). To the best of our knowledge, our work is the first to introduce stochastic delayed feedback into contextual GP bandits.

Stochastic Delayed Feedback. Due to several practical applications, some works have explored fixed or stochastic delayed feedback in the stochastic multi-armed bandits (MAB) (Slivkins, 2019; Lattimore & Szepesvári, 2020) which is a simpler sequential decision-making framework than BO. In the MAB literature, the problems ranging from known fixed delay (Joulani et al., 2013), unknown delay (Zhong et al., 2017), unknown stochastic delay (Vernade et al., 2017; Grover et al., 2018), and unknown delays in the adversarial setting (Li et al., 2019) are well-studied. The setting of the unknown stochastic delay is further extended to the linear bandits (Zhou et al., 2019; Vernade et al., 2020) which is closest to our BO-SDF problem. However, these works on linear bandits assume that the black-box function is linear and has binary feedback, which may not hold in real-life applications.

8 Conclusion

This paper has studied the Bayesian optimization (BO) problem with stochastic delayed feedback (BO-SDF). We have used Gaussian processes as a surrogate model for the posterior belief of the unknown black-box function. To exploit the information of queries with delayed feedback (i.e., incomplete queries) for updating the posterior mean function, we have used censored feedback that assigns the minimum function value (or ‘0’) to the delayed feedback. It discourages the learner from selecting the incomplete queries (or queries near them) when choosing the next query and hence leads to more efficient exploration.

We have proposed UCB- and TS-based algorithms for the BO-SDF problem and given their sub-linear regret guarantees. Interestingly, when delays are fixed in a certain way, batch BO becomes a special case of the BO-SDF problem. Hence, we can adapt our algorithms for the batch BO problem. We have shown that our algorithms are theoretically and empirically better and do not need the special initialization scheme required for the existing batch BO algorithms. We also extended our algorithms to contextual Gaussian process bandits with stochastic delayed feedback, which is in itself a non-trivial contribution.

As the learning environment keeps changing in real life, an interesting future direction will be considering the BO-SDF problem with a non-stationary function. We also plan to generalize our algorithms to nonmyopic BO (Kharkovskii et al., 2020b; Ling et al., 2016), high-dimensional BO (Hoang et al., 2018), private outsourced BO (Kharkovskii et al., 2020a), preferential BO (Nguyen et al., 2021d), federated/collaborative BO (Dai et al., 2020b, 2021; Sim et al., 2021), meta-BO (Dai et al., 2022), and multi-fidelity BO (Zhang et al., 2017, 2019) settings, handle risks (Nguyen et al., 2021b, a; Tay et al., 2022) and information-theoretic acquisition functions (Nguyen et al., 2021c, e), incorporate early stopping (Dai et al., 2019), and/or recursive reasoning (Dai et al., 2020a), and consider its application to neural architecture search (Shu et al., 2022a, b) and inverse reinforcement learning (Balakrishnan et al., 2020). For applications with a huge budget of function evaluations, we like to couple our algorithms with the use of distributed/decentralized (Chen et al., 2012, 2013a, 2013b, 2015; Hoang et al., 2016, 2019; Low et al., 2015; Ouyang & Low, 2018), online/stochastic (Hoang et al., 2015, 2017; Low et al., 2014; Xu et al., 2014; Yu et al., 2019b), or deep sparse GP models (Yu et al., 2019a, 2021) to represent the belief of the unknown objective function efficiently.

Acknowledgements

This research/project is supported by A*STAR under its RIE20202020 Advanced Manufacturing and Engineering (AME) Industry Alignment Fund – Pre Positioning (IAF-PP) (Award A1919E44a01010101) and by the Singapore Ministry of Education Academic Research Fund Tier 11.

References

  • Agrawal & Goyal (2012) Agrawal, S. and Goyal, N. Analysis of Thompson sampling for the multi-armed bandit problem. In Proc. COLT, pp.  39.1–39.26, 2012.
  • Agrawal & Goyal (2013a) Agrawal, S. and Goyal, N. Further optimal regret bounds for Thompson sampling. In Proc. AISTATS, pp.  99–107, 2013a.
  • Agrawal & Goyal (2013b) Agrawal, S. and Goyal, N. Thompson sampling for contextual bandits with linear payoffs. In Proc. ICML, pp.  127–135, 2013b.
  • Auer et al. (2002) Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-time analysis of the multiarmed bandit problem. Machine Learning, pp.  235–256, 2002.
  • Balakrishnan et al. (2020) Balakrishnan, S., Nguyen, Q. P., Low, B. K. H., and Soh, H. Efficient exploration of reward functions in inverse reinforcement learning via Bayesian optimization. In Proc. NeurIPS, pp.  4187–4198, 2020.
  • Balandat et al. (2020) Balandat, M., Karrer, B., Jiang, D., Daulton, S., Letham, B., Wilson, A. G., and Bakshy, E. BoTorch: a framework for efficient Monte-Carlo Bayesian optimization. In Proc. NeurIPS, pp.  21524–21538, 2020.
  • Besson & Kaufmann (2018) Besson, L. and Kaufmann, E. What doubling tricks can and can’t do for multi-armed bandits. arXiv preprint arXiv:1803.06971, 2018.
  • Chapelle (2014) Chapelle, O. Modeling delayed feedback in display advertising. In Proc. KDD, pp.  1097–1105, 2014.
  • Chapelle & Li (2011) Chapelle, O. and Li, L. An empirical evaluation of Thompson sampling. In Proc. NeurIPS, pp.  2249–2257, 2011.
  • Chen et al. (2012) Chen, J., Low, K. H., Tan, C. K.-Y., Oran, A., Jaillet, P., Dolan, J. M., and Sukhatme, G. S. Decentralized data fusion and active sensing with mobile sensors for modeling and predicting spatiotemporal traffic phenomena. In Proc. UAI, pp.  163–173, 2012.
  • Chen et al. (2013a) Chen, J., Cao, N., Low, K. H., Ouyang, R., Tan, C. K.-Y., and Jaillet, P. Parallel Gaussian process regression with low-rank covariance matrix approximations. In Proc. UAI, pp.  152–161, 2013a.
  • Chen et al. (2013b) Chen, J., Low, K. H., and Tan, C. K.-Y. Gaussian process-based decentralized data fusion and active sensing for mobility-on-demand system. In Proc. RSS, 2013b.
  • Chen et al. (2015) Chen, J., Low, K. H., Jaillet, P., and Yao, Y. Gaussian process decentralized data fusion and active sensing for spatiotemporal traffic modeling and prediction in mobility-on-demand systems. IEEE Trans. Autom. Sci. Eng., 12:901–921, 2015.
  • Chow & Chang (2006) Chow, S.-C. and Chang, M. Adaptive Design Methods in Clinical Trials. CRC Press, 2006.
  • Chowdhury & Gopalan (2017) Chowdhury, S. R. and Gopalan, A. On kernelized multi-armed bandits. In Proc. ICML, pp.  844–853, 2017.
  • Chowdhury & Gopalan (2019) Chowdhury, S. R. and Gopalan, A. On batch Bayesian optimization. arXiv preprint arXiv:1911.01032, 2019.
  • Chu et al. (2011) Chu, W., Li, L., Reyzin, L., and Schapire, R. E. Contextual bandits with linear payoff functions. In Proc. AISTATS, pp.  208–214, 2011.
  • Contal et al. (2013) Contal, E., Buffoni, D., Robicquet, A., and Vayatis, N. Parallel Gaussian process optimization with upper confidence bound and pure exploration. In Proc. ECML/PKDD, pp.  225–240, 2013.
  • Dai et al. (2019) Dai, Z., Yu, H., Low, B. K. H., and Jaillet, P. Bayesian optimization meets Bayesian optimal stopping. In Proc. ICML, pp.  1496–1506, 2019.
  • Dai et al. (2020a) Dai, Z., Chen, Y., Low, B. K. H., Jaillet, P., and Ho, T.-H. R2-B2: Recursive reasoning-based Bayesian optimization for no-regret learning in games. In Proc. ICML, pp.  2291–2301, 2020a.
  • Dai et al. (2020b) Dai, Z., Low, B. K. H., and Jaillet, P. Federated Bayesian optimization via Thompson sampling. In Proc. NeurIPS, pp.  9687–9699, 2020b.
  • Dai et al. (2021) Dai, Z., Low, B. K. H., and Jaillet, P. Differentially private federated Bayesian optimization with distributed exploration. In Proc. NeurIPS, pp.  9125–9139, 2021.
  • Dai et al. (2022) Dai, Z., Chen, Y., Yu, H., Low, B. K. H., and Jaillet, P. On provably robust meta-Bayesian optimization. In Proc. UAI, 2022.
  • Daxberger & Low (2017) Daxberger, E. A. and Low, B. K. H. Distributed batch Gaussian process optimization. In Proc. ICML, pp.  951–960, 2017.
  • Desautels et al. (2014) Desautels, T., Krause, A., and Burdick, J. W. Parallelizing exploration-exploitation tradeoffs in Gaussian process bandit optimization. JMLR, 15:3873–3923, 2014.
  • Diemert et al. (2017) Diemert, E., Meynet, J., Galland, P., and Lefortier, D. Attribution modeling increases efficiency of bidding in display advertising. In Proc. KDD Workshop on AdKDD and TargetAd, 2017.
  • Frazier (2018) Frazier, P. I. A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
  • Garivier & Cappé (2011) Garivier, A. and Cappé, O. The KL-UCB algorithm for bounded stochastic bandits and beyond. In Proc. COLT, pp.  359–376, 2011.
  • Garnett (2022) Garnett, R. Bayesian Optimization. Cambridge Univ. Press, 2022.
  • Gong et al. (2019) Gong, C., Peng, J., and Liu, Q. Quantile Stein variational gradient descent for batch Bayesian optimization. In Proc. ICML, pp.  2347–2356, 2019.
  • González et al. (2016) González, J., Dai, Z., Hennig, P., and Lawrence, N. Batch Bayesian optimization via local penalization. In Proc. AISTATS, pp.  648–657, 2016.
  • Grover et al. (2018) Grover, A., Markov, T., Attia, P., Jin, N., Perkins, N., Cheong, B., Chen, M., Yang, Z., Harris, S., Chueh, W., and Ermon, S. Best arm identification in multi-armed bandits with delayed feedback. In Proc. AISTATS, pp.  833–842, 2018.
  • Hoang et al. (2017) Hoang, Q. M., Hoang, T. N., and Low, K. H. A generalized stochastic variational Bayesian hyperparameter learning framework for sparse spectrum Gaussian process regression. In Proc. AAAI, pp.  2007–2014, 2017.
  • Hoang et al. (2015) Hoang, T. N., Hoang, Q. M., and Low, K. H. A unifying framework of anytime sparse Gaussian process regression models with stochastic variational inference for big data. In Proc. ICML, pp.  569–578, 2015.
  • Hoang et al. (2016) Hoang, T. N., Hoang, Q. M., and Low, K. H. A distributed variational inference framework for unifying parallel sparse Gaussian process regression models. In Proc. ICML, pp.  382–391, 2016.
  • Hoang et al. (2018) Hoang, T. N., Hoang, Q. M., and Low, B. K. H. Decentralized high-dimensional Bayesian optimization with factor graphs. In Proc. AAAI, pp.  3231–3238, 2018.
  • Hoang et al. (2019) Hoang, T. N., Hoang, Q. M., Low, K. H., and How, J. P. Collective online learning of Gaussian processes in massive multi-agent systems. In Proc. AAAI, 2019.
  • Joulani et al. (2013) Joulani, P., Gyorgy, A., and Szepesvári, C. Online learning under delayed feedback. In Proc. ICML, pp.  1453–1461, 2013.
  • Kandasamy et al. (2018) Kandasamy, K., Krishnamurthy, A., Schneider, J., and Póczos, B. Parallelised Bayesian optimisation via Thompson sampling. In Proc. AISTATS, pp.  133–142, 2018.
  • Kaufmann et al. (2012) Kaufmann, E., Korda, N., and Munos, R. Thompson sampling: An asymptotically optimal finite-time analysis. In Proc. ALT, pp.  199–213, 2012.
  • Kharkovskii et al. (2020a) Kharkovskii, D., Dai, Z., and Low, B. K. H. Private outsourced Bayesian optimization. In Proc. ICML, pp.  5231–5242, 2020a.
  • Kharkovskii et al. (2020b) Kharkovskii, D., Ling, C. K., and Low, B. K. H. Nonmyopic Gaussian process optimization with macro-actions. In Proc. AISTATS, pp.  4593–4604, 2020b.
  • Krause & Ong (2011) Krause, A. and Ong, C. S. Contextual Gaussian process bandit optimization. In Proc. NeurIPS, pp.  2447–2455, 2011.
  • Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit Algorithms. Cambridge Univ. Press, 2020.
  • Li et al. (2019) Li, B., Chen, T., and Giannakis, G. B. Bandit online learning with unknown delays. In Proc. AISTATS, pp.  993–1002, 2019.
  • Li et al. (2010) Li, L., Chu, W., Langford, J., and Schapire, R. E. A contextual-bandit approach to personalized news article recommendation. In Proc. WWW, pp.  661–670, 2010.
  • Li et al. (2017) Li, L., Lu, Y., and Zhou, D. Provably optimal algorithms for generalized linear contextual bandits. In Proc. ICML, pp.  2071–2080, 2017.
  • Li & Scarlett (2022) Li, Z. and Scarlett, J. Gaussian process bandit optimization with few batches. In Proc. AISTATS, pp.  92–107, 2022.
  • Ling et al. (2016) Ling, C. K., Low, B. K. H., and Jaillet, P. Gaussian process planning with Lipschitz continuous reward functions: Towards unifying Bayesian optimization, active learning, and beyond. In Proc. AAAI, pp.  1860–1866, 2016.
  • Low et al. (2014) Low, K. H., Xu, N., Chen, J., Lim, K. K., and Özgül, E. B. Generalized online sparse Gaussian processes with application to persistent mobile robot localization. In Proc. ECML/PKDD Nectar Track, pp.  499–503, 2014.
  • Low et al. (2015) Low, K. H., Yu, J., Chen, J., and Jaillet, P. Parallel Gaussian process regression for big data: Low-rank representation meets Markov approximation. In Proc. AAAI, pp.  2821–2827, 2015.
  • Nguyen et al. (2021a) Nguyen, Q. P., Dai, Z., Low, B. K. H., and Jaillet, P. Optimizing conditional value-at-risk of black-box functions. In Proc. NeurIPS, pp.  4170–4180, 2021a.
  • Nguyen et al. (2021b) Nguyen, Q. P., Dai, Z., Low, B. K. H., and Jaillet, P. Value-at-risk optimization with Gaussian processes. In Proc. ICML, pp.  8063–8072, 2021b.
  • Nguyen et al. (2021c) Nguyen, Q. P., Low, B. K. H., and Jaillet, P. An information-theoretic framework for unifying active learning problems. In Proc. AAAI, pp.  9126–9134, 2021c.
  • Nguyen et al. (2021d) Nguyen, Q. P., Tay, S., Low, B. K. H., and Jaillet, P. Top-kk ranking Bayesian optimization. In Proc. AAAI, pp.  9135–9143, 2021d.
  • Nguyen et al. (2021e) Nguyen, Q. P., Wu, Z., Low, B. K. H., and Jaillet, P. Trusted-maximizers entropy search for efficient Bayesian optimization. In Proc. UAI, pp.  1486–1495, 2021e.
  • Ouyang & Low (2018) Ouyang, R. and Low, K. H. Gaussian process decentralized data fusion meets transfer learning in large-scale distributed cooperative perception. In Proc. AAAI, pp.  3876–3883, 2018.
  • Rasmussen & Williams (2006) Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. MIT Press, 2006.
  • Shahriari et al. (2015) Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and De Freitas, N. Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2015.
  • Shu et al. (2022a) Shu, Y., Cai, S., Dai, Z., Ooi, B. C., and Low, B. K. H. NASI: Label- and data-agnostic neural architecture search at initialization. In Proc. ICLR, 2022a.
  • Shu et al. (2022b) Shu, Y., Chen, Y., Dai, Z., and Low, B. K. H. Neural ensemble search via Bayesian sampling. In Proc. UAI, 2022b.
  • Sim et al. (2021) Sim, R. H. L., Zhang, Y., Low, B. K. H., and Jaillet, P. Collaborative Bayesian optimization with fair regret. In Proc. ICML, pp.  9691–9701, 2021.
  • Slivkins (2019) Slivkins, A. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1–2):1–286, 2019.
  • Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian optimization of machine learning algorithms. In Proc. NeurIPS, 2012.
  • Srinivas et al. (2010) Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: No regret and experimental design. In Proc. ICML, pp.  1015–1022, 2010.
  • Takahashi & Suzuki (2021) Takahashi, A. and Suzuki, T. Bayesian optimization design for dose-finding based on toxicity and efficacy outcomes in phase I/II clinical trials. Pharmaceutical Statistics, 20(3):422–439, 2021.
  • Tay et al. (2022) Tay, S. S., Foo, C. S., Urano, D., Leong, R. C. X., and Low, B. K. H. Efficient distributionally robust Bayesian optimization with worst-case sensitivity. In Proc. ICML, 2022.
  • Thompson (1933) Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3–4):285–294, 1933.
  • Ueno et al. (2016) Ueno, T., Rhone, T. D., Hou, Z., Mizoguchi, T., and Tsuda, K. Combo: An efficient Bayesian optimization library for materials science. Materials Discovery, 4:18–21, 2016.
  • Valko et al. (2013) Valko, M., Korda, N., Munos, R., Flaounas, I., and Cristianini, N. Finite-time analysis of kernelised contextual bandits. In Proc. UAI, pp.  654–663, 2013.
  • Verma & Hanawal (2020) Verma, A. and Hanawal, M. K. Stochastic network utility maximization with unknown utilities: Multi-armed bandits approach. In Proc. IEEE INFOCOM, pp.  189–198, 2020.
  • Verma et al. (2019) Verma, A., Hanawal, M., Rajkumar, A., and Sankaran, R. Censored semi-bandits: A framework for resource allocation with censored feedback. In Proc. NeurIPS, pp.  14499–14509, 2019.
  • Verma et al. (2021) Verma, A., Hanawal, M. K., Rajkumar, A., and Sankaran, R. Censored semi-bandits for resource allocation. arXiv preprint arXiv:2104.05781, 2021.
  • Vernade et al. (2017) Vernade, C., Cappé, O., and Perchet, V. Stochastic bandit models for delayed conversions. In Proc. UAI, 2017.
  • Vernade et al. (2020) Vernade, C., Carpentier, A., Lattimore, T., Zappella, G., Ermis, B., and Brueckner, M. Linear bandits with stochastic delayed feedback. In Proc. ICML, pp.  9712–9721, 2020.
  • Wistuba et al. (2015) Wistuba, M., Schilling, N., and Schmidt-Thieme, L. Learning hyperparameter optimization initializations. In Proc. IEEE International Conference on Data Science and Advanced Analytics, 2015.
  • Xu et al. (2014) Xu, N., Low, K. H., Chen, J., Lim, K. K., and Özgül, E. B. GP-Localize: Persistent mobile robot localization using online sparse Gaussian process observation model. In Proc. AAAI, pp.  2585–2592, 2014.
  • Yu et al. (2019a) Yu, H., Chen, Y., Dai, Z., Low, B. K. H., and Jaillet, P. Implicit posterior variational inference for deep Gaussian processes. In Proc. NeurIPS, pp.  14475–14486, 2019a.
  • Yu et al. (2019b) Yu, H., Hoang, T. N., Low, K. H., and Jaillet, P. Stochastic variational inference for Bayesian sparse Gaussian process regression. In Proc. IJCNN, 2019b.
  • Yu et al. (2021) Yu, H., Liu, D., Low, K. H., and Jaillet, P. Convolutional normalizing flows for deep Gaussian processes. In Proc. IJCNN, 2021.
  • Zhang et al. (2017) Zhang, Y., Hoang, T. N., Low, B. K. H., and Kankanhalli, M. Information-based multi-fidelity Bayesian optimization. In Proc. NIPS Workshop on Bayesian Optimization, 2017.
  • Zhang et al. (2019) Zhang, Y., Dai, Z., and Low, B. K. H. Bayesian optimization with binary auxiliary information. In Proc. UAI, pp.  1222–1232, 2019.
  • Zhong et al. (2017) Zhong, J., Huang, Y., and Liu, J. Asynchronous parallel empirical variance guided algorithms for the thresholding bandit problem. arXiv preprint arXiv:1704.04567, 2017.
  • Zhou et al. (2019) Zhou, Z., Xu, R., and Blanchet, J. Learning in generalized linear contextual bandits with stochastic delays. In Proc. NeurIPS, pp.  5197–5208, 2019.

Appendix

 

Appendix A Proofs of Theoretical Results

A.1 Proof of 1

We define φ(x)=k(x,)\varphi(x)=k(x,\cdot) as the feature map associated with the kernel kk, i.e., φ\varphi maps any input x𝒬x\in\mathcal{Q} into a feature in the (potentially infinite-dimensional) feature space of the RKHS associated with kk. The reproducing property tells us that f(x)=φ(x),fk=φ(x)ff(x)=\langle\varphi(x),f\rangle_{k}=\varphi(x)^{\top}f. Next, we define Vt(λ)s=1t1φ(xs)φ(xs)+λIV_{t}(\lambda)\triangleq\sum^{t-1}_{s=1}\varphi(x_{s})\varphi(x_{s})^{\top}+\lambda I, Bts=1t1φ(xs)y~s,tB_{t}\triangleq\sum^{t-1}_{s=1}\varphi(x_{s})\tilde{y}_{s,t}, and θ^tWVt(λ)1Bt\hat{\theta}^{W}_{t}\triangleq V_{t}(\lambda)^{-1}B_{t}. Of note, the GP posterior mean (1) can be equivalently expressed as μt1(x)=φ(x)Vt(λ)1Bt\mu_{t-1}(x)=\varphi(x)^{\top}V_{t}(\lambda)^{-1}B_{t}, and the GP posterior variance (1) can be equivalently written as σt12(x,x)=σt12(x)=λφ(x)Vt(λ)1φ(x)=λφ(x)Vt(λ)12\sigma^{2}_{t-1}(x,x)=\sigma^{2}_{t-1}(x)=\lambda\varphi(x)V_{t}(\lambda)^{-1}\varphi(x)=\lambda\left\|\varphi(x)\right\|_{V_{t}(\lambda)^{-1}}^{2}. In this section, we prove 1, which is restated below.

See 1

Proof.

Firstly, we can decompose BtB_{t} as follows:

Bt=s=1t1φ(xs)y~s,t=s=1t1φ(xs)ys𝟙{dsm}+s=tmt1φ(xs)ys(𝟙{dsts}𝟙{dsm}).B_{t}=\sum^{t-1}_{s=1}\varphi(x_{s})\tilde{y}_{s,t}=\sum^{t-1}_{s=1}\varphi(x_{s})y_{s}\mathbbm{1}\{d_{s}\leq m\}+\sum^{t-1}_{s=t-m}\varphi(x_{s})y_{s}\left(\mathbbm{1}\{d_{s}\leq t-s\}-\mathbbm{1}\{d_{s}\leq m\}\right).

Next, we can plug in this decomposition of BtB_{t} to derive:

|μt1(x)ρmf(x)|=|φ(x)Vt(λ)1Btρmφ(x)f||φ(x)Vt(λ)1(s=1t1φ(xs)ys𝟙{dsm})ρmφ(x)f|+|φ(x)Vt(λ)1(s=tmt1φ(xs)ys(𝟙{dsts}𝟙{dsm}))|.\begin{split}|\mu_{t-1}(x)&-\rho_{m}f(x)|=|\varphi(x)^{\top}V_{t}(\lambda)^{-1}B_{t}-\rho_{m}\varphi(x)^{\top}f|\\ &\leq\left|\varphi(x)^{\top}V_{t}(\lambda)^{-1}\left(\sum^{t-1}_{s=1}\varphi(x_{s})y_{s}\mathbbm{1}\{d_{s}\leq m\}\right)-\rho_{m}\varphi(x)^{\top}f\right|+\\ &\left|\varphi(x)^{\top}V_{t}(\lambda)^{-1}\left(\sum^{t-1}_{s=t-m}\varphi(x_{s})y_{s}\left(\mathbbm{1}\{d_{s}\leq t-s\}-\mathbbm{1}\{d_{s}\leq m\}\right)\right)\right|.\end{split} (4)

in which the equality makes use of the expression for the GP posterior mean: μt1(x)=φ(x)Vt(λ)1Bt\mu_{t-1}(x)=\varphi(x)^{\top}V_{t}(\lambda)^{-1}B_{t}, and the reproducing property: f(x)=φ(x)ff(x)=\varphi(x)^{\top}f. Next, we will separately provide upper bounds on the two terms in Eq. 4.

Firstly, We upper-bound the second term in Eq. 4 as follows:

|φ(x)Vt(λ)1(s=tmt1φ(xs)ys(𝟙{dsts}𝟙{dsm}))|φ(x)Vt(λ)1Vt(λ)1(s=tmt1φ(xs)ys(𝟙{dsts}𝟙{dsm}))Vt(λ)σt1(x)1λs=tmt1φ(xs)ys(𝟙{dsts}𝟙{dsm})Vt(λ)1σt1(x)s=tmt1φ(xs)ys(𝟙{dsts}𝟙{dsm})Vt(λ)1σt1(x)ys=tmt1φ(xs)Vt(λ)1=σt1(x)yλs=tmt1σt1(xs)yσt1(x)s=tmt1σt1(xs).\begin{split}\Big{|}\varphi(x)^{\top}V_{t}(\lambda)^{-1}&\left(\sum^{t-1}_{s=t-m}\varphi(x_{s})y_{s}\left(\mathbbm{1}\{d_{s}\leq t-s\}-\mathbbm{1}\{d_{s}\leq m\}\right)\right)\Big{|}\\ &\leq\left\|\varphi(x)\right\|_{V_{t}(\lambda)^{-1}}\left\|V_{t}(\lambda)^{-1}\left(\sum^{t-1}_{s=t-m}\varphi(x_{s})y_{s}\left(\mathbbm{1}\{d_{s}\leq t-s\}-\mathbbm{1}\{d_{s}\leq m\}\right)\right)\right\|_{{V_{t}(\lambda)}}\\ &\leq\sigma_{t-1}(x)\frac{1}{\sqrt{\lambda}}\left\|\sum^{t-1}_{s=t-m}\varphi(x_{s})y_{s}\left(\mathbbm{1}\{d_{s}\leq t-s\}-\mathbbm{1}\{d_{s}\leq m\}\right)\right\|_{V_{t}(\lambda)^{-1}}\\ &\leq\sigma_{t-1}(x)\sum^{t-1}_{s=t-m}\left\|\varphi(x_{s})y_{s}\left(\mathbbm{1}\{d_{s}\leq t-s\}-\mathbbm{1}\{d_{s}\leq m\}\right)\right\|_{V_{t}(\lambda)^{-1}}\\ &\leq\sigma_{t-1}(x)\mathcal{B}_{y}\sum^{t-1}_{s=t-m}\left\|\varphi(x_{s})\right\|_{V_{t}(\lambda)^{-1}}\\ &=\sigma_{t-1}(x)\frac{\mathcal{B}_{y}}{\sqrt{\lambda}}\sum^{t-1}_{s=t-m}\sigma_{t-1}(x_{s})\\ &\leq\mathcal{B}_{y}\sigma_{t-1}(x)\sum^{t-1}_{s=t-m}\sigma_{t-1}(x_{s}).\end{split} (5)

The second inequality and the equality have made use of the expression for the GP posterior variance: σt12(x)=λφ(x)Vt(λ)12\sigma^{2}_{t-1}(x)=\lambda\left\|\varphi(x)\right\|_{V_{t}(\lambda)^{-1}}^{2}, the third and last inequalities has made use of the fact that 1λ<1\frac{1}{\sqrt{\lambda}}<1 because λ>1\lambda>1 (note that if λ>0\lambda>0, then there will be an additional multiplier factor of 1/λ1/\lambda), the fourth inequality follows since we have assumed that all observations are bounded: |ys|y,s1|y_{s}|\leq\mathcal{B}_{y},\forall s\geq 1.

Next, define μt1(x)\mu^{\prime}_{t-1}(x) as the standard GP posterior mean without censoring the observations (i.e., replace every y~s,t\tilde{y}_{s,t} by ysy_{s} in Eq. 1). Now the first term in Eq. 4 can be upper-bounded as:

|φ(x)Vt(λ)1(s=1t1φ(xs)ys𝟙{dsm})ρmφ(x)f|=|φ(x)Vt(λ)1(s=1t1φ(xs)ys(ρm+ηs))ρmφ(x)f||φ(x)Vt(λ)1s=1t1φ(xs)ysηs|+ρm|φ(x)Vt(λ)1s=1t1φ(xs)ysφ(x)f|=|φ(x)Vt(λ)1s=1t1φ(xs)ysηs|+ρm|μt1(x)f(x)||φ(x)Vt(λ)1s=1t1φ(xs)ysηs|+(f+R2(γt1+1+log(2/δ)))σt1(x).\begin{split}\Big{|}\varphi(x)^{\top}V_{t}(\lambda)^{-1}&\left(\sum^{t-1}_{s=1}\varphi(x_{s})y_{s}\mathbbm{1}\{d_{s}\leq m\}\right)-\rho_{m}\varphi(x)^{\top}f\Big{|}\\ &=\Big{|}\varphi(x)^{\top}V_{t}(\lambda)^{-1}\left(\sum^{t-1}_{s=1}\varphi(x_{s})y_{s}(\rho_{m}+\eta_{s})\right)-\rho_{m}\varphi(x)^{\top}f\Big{|}\\ &\leq\Big{|}\varphi(x)^{\top}V_{t}(\lambda)^{-1}\sum^{t-1}_{s=1}\varphi(x_{s})y_{s}\eta_{s}\Big{|}+\rho_{m}\Big{|}\varphi(x)^{\top}V_{t}(\lambda)^{-1}\sum^{t-1}_{s=1}\varphi(x_{s})y_{s}-\varphi(x)^{\top}f\Big{|}\\ &=\Big{|}\varphi(x)^{\top}V_{t}(\lambda)^{-1}\sum^{t-1}_{s=1}\varphi(x_{s})y_{s}\eta_{s}\Big{|}+\rho_{m}\Big{|}\mu^{\prime}_{t-1}(x)-f(x)\Big{|}\\ &\leq\Big{|}\varphi(x)^{\top}V_{t}(\lambda)^{-1}\sum^{t-1}_{s=1}\varphi(x_{s})y_{s}\eta_{s}\Big{|}+\left(\mathcal{B}_{f}+R\sqrt{2(\gamma_{t-1}+1+\log(2/\delta))}\right)\sigma_{t-1}(x).\end{split} (6)

In the first equality, we have defined ηs𝟙{dsm}ρm\eta_{s}\triangleq\mathbbm{1}\{d_{s}\leq m\}-\rho_{m}. In the second equality, we have again used the expression for the GP posterior mean and the reproducing property for ff. The last inequality follows from Theorem 2 of Chowdhury & Gopalan (2017), which holds with the probability of 1δ/2\geq 1-\delta/2. We have also used the fact that ρm1\rho_{m}\leq 1 in the last inequality.

To bound the first term in Eq. 6, we firstly define ηsysηs\eta^{\prime}_{s}\triangleq y_{s}\eta_{s}, which is by definition a y\mathcal{B}_{y}-sub-Gaussian noise. Also define 𝜼1:t1=[ηs]s=1,,t1\boldsymbol{\eta}^{\prime}_{1:t-1}=[\eta^{\prime}_{s}]_{s=1,\ldots,t-1} which is a (t1)×1(t-1)\times 1-dimensional vector and define Φt1[φ(x1),,φ(xt1)]\Phi_{t-1}\triangleq[\varphi(x_{1}),\ldots,\varphi(x_{t-1})]^{\top} which is a (t1)×(t-1)\times\infty-dimensional matrix. These definitions imply that 𝐊t1=Φt1Φt1\mathbf{K}_{t-1}=\Phi_{t-1}\Phi_{t-1}^{\top} and Vt(λ)=Φt1Φt1+λIV_{t}(\lambda)=\Phi_{t-1}^{\top}\Phi_{t-1}+\lambda I. Also define λ=1+η\lambda=1+\eta777The value of λ\lambda is set as (1+2/T)(1+2/T) (Chowdhury & Gopalan, 2017). However, one can use a Doubling Trick (Besson & Kaufmann, 2018) for unknown TT. with η>0\eta>0. Based on these definitions, we can upper-bound s=1t1φ(xs)ysηsVt(λ)1\left\|\sum^{t-1}_{s=1}\varphi(x_{s})y_{s}\eta_{s}\right\|_{V_{t}(\lambda)^{-1}} by

s=1t1φ(xs)ysηsVt(λ)1=s=1t1φ(xs)ηsVt(λ)1=(s=1t1ηsφ(xs))Vt(λ)1(s=1t1ηsφ(xs))=𝜼1:t1Φt1(Φt1Φt1+λI)1Φt1𝜼1:t1=𝜼1:t1Φt1Φt1(Φt1Φt1+λI)1𝜼1:t1=𝜼1:t1𝐊t1(𝐊t1+(1+η)I)1𝜼1:t1𝜼1:t1(𝐊t1+ηI)(𝐊t1+(1+η)I)1𝜼1:t1=𝜼1:t1((𝐊t1+ηI)1+I)1𝜼1:t1=𝜼1:t1((𝐊t1+ηI)1+I)1.\begin{split}\left\|\sum^{t-1}_{s=1}\varphi(x_{s})y_{s}\eta_{s}\right\|_{V_{t}(\lambda)^{-1}}&=\left\|\sum^{t-1}_{s=1}\varphi(x_{s})\eta^{\prime}_{s}\right\|_{V_{t}(\lambda)^{-1}}=\sqrt{\left(\sum^{t-1}_{s=1}\eta^{\prime}_{s}\varphi(x_{s})^{\top}\right)V_{t}(\lambda)^{-1}\left(\sum^{t-1}_{s=1}\eta^{\prime}_{s}\varphi(x_{s})\right)}\\ &=\sqrt{\boldsymbol{\eta}^{\prime\top}_{1:t-1}\Phi_{t-1}(\Phi_{t-1}^{\top}\Phi_{t-1}+\lambda I)^{-1}\Phi_{t-1}^{\top}\boldsymbol{\eta}^{\prime}_{1:t-1}}\\ &=\sqrt{\boldsymbol{\eta}^{\prime\top}_{1:t-1}\Phi_{t-1}\Phi_{t-1}^{\top}(\Phi_{t-1}\Phi_{t-1}^{\top}+\lambda I)^{-1}\boldsymbol{\eta}^{\prime}_{1:t-1}}\\ &=\sqrt{\boldsymbol{\eta}^{\prime\top}_{1:t-1}\mathbf{K}_{t-1}(\mathbf{K}_{t-1}+(1+\eta)I)^{-1}\boldsymbol{\eta}^{\prime}_{1:t-1}}\\ &\leq\sqrt{\boldsymbol{\eta}^{\prime\top}_{1:t-1}(\mathbf{K}_{t-1}+\eta I)(\mathbf{K}_{t-1}+(1+\eta)I)^{-1}\boldsymbol{\eta}^{\prime}_{1:t-1}}\\ &=\sqrt{\boldsymbol{\eta}^{\prime\top}_{1:t-1}((\mathbf{K}_{t-1}+\eta I)^{-1}+I)^{-1}\boldsymbol{\eta}^{\prime}_{1:t-1}}\\ &=\left\|\boldsymbol{\eta}^{\prime}_{1:t-1}\right\|_{((\mathbf{K}_{t-1}+\eta I)^{-1}+I)^{-1}}.\end{split} (7)

In the equality in the third line, we have made use of the following matrix equality: (Φt1Φt1+λI)1Φt1=Φt1(Φt1Φt1+λI)1(\Phi_{t-1}^{\top}\Phi_{t-1}+\lambda I)^{-1}\Phi_{t-1}^{\top}=\Phi_{t-1}^{\top}(\Phi_{t-1}\Phi_{t-1}^{\top}+\lambda I)^{-1}, and in the second last equality, we have used the matrix equality of (𝐊t1+ηI)(𝐊t1+ηI+I)1=((𝐊t1+ηI)1+I)1(\mathbf{K}_{t-1}+\eta I)(\mathbf{K}_{t-1}+\eta I+I)^{-1}=((\mathbf{K}_{t-1}+\eta I)^{-1}+I)^{-1}. Next, making use of Theorem 1 of Chowdhury & Gopalan (2017) and following similar steps as those in the proof there, we have that with probability of 1δ/2\geq 1-\delta/2,

𝜼1:t1((𝐊t1+ηI)1+I)1y2logdet((1+η)I+𝐊t1)δ/2y2(γt1+1+log(2/δ)).\begin{split}\left\|\boldsymbol{\eta}^{\prime}_{1:t-1}\right\|_{((\mathbf{K}_{t-1}+\eta I)^{-1}+I)^{-1}}&\leq\mathcal{B}_{y}\sqrt{2\log\frac{\sqrt{\text{det}((1+\eta)I+\mathbf{K}_{t-1})}}{\delta/2}}\\ &\leq\mathcal{B}_{y}\sqrt{2(\gamma_{t-1}+1+\log(2/\delta))}.\end{split} (8)

Next, combining Eq. 7 and Eq. 8 above allows us to upper-bound the first term in Eq. 6:

|φ(x)Vt(λ)1s=1t1φ(xs)ysηs|φ(x)Vt(λ)1Vt(λ)1s=1t1φ(xs)ysηsVt(λ)σt1(x)1λs=1t1φ(xs)ysηsVt(λ)1σt1(x)y2(γt1+1+log(2/δ)),\begin{split}\Big{|}\varphi(x)^{\top}V_{t}(\lambda)^{-1}\sum^{t-1}_{s=1}\varphi(x_{s})y_{s}\eta_{s}\Big{|}&\leq\left\|\varphi(x)\right\|_{V_{t}(\lambda)^{-1}}\left\|V_{t}(\lambda)^{-1}\sum^{t-1}_{s=1}\varphi(x_{s})y_{s}\eta_{s}\right\|_{V_{t}(\lambda)}\\ &\leq\sigma_{t-1}(x)\frac{1}{\sqrt{\lambda}}\left\|\sum^{t-1}_{s=1}\varphi(x_{s})y_{s}\eta_{s}\right\|_{V_{t}(\lambda)^{-1}}\\ &\leq\sigma_{t-1}(x)\mathcal{B}_{y}\sqrt{2(\gamma_{t-1}+1+\log(2/\delta))},\end{split} (9)

in which the second inequality has made use of the expression for the GP posterior variance: σt12(x)=λφ(x)Vt(λ)12\sigma^{2}_{t-1}(x)=\lambda\left\|\varphi(x)\right\|_{V_{t}(\lambda)^{-1}}^{2}, and the last inequality follows from Eq. 7 and Eq. 8 (and hence holds with probability of 1δ/2\geq 1-\delta/2), as well as the fact that λ>1\lambda>1. Now, we can plug in Eq. 9 as an upper bound on the first term in Eq. 6, and then use the resulting Eq. 6 as an upper bound on the first term in Eq. 4. Combining this with the upper bound on the second term of Eq. 4 given by Eq. 5, we have that

|μt1(x)ρmf(x)|\displaystyle|\mu_{t-1}(x)-\rho_{m}f(x)| yσt1(x)s=tmt1σt1(xs)+(f+R2(γt1+1+log(2/δ)))σt1(x)+\displaystyle\leq\mathcal{B}_{y}\sigma_{t-1}(x)\sum^{t-1}_{s=t-m}\sigma_{t-1}(x_{s})+\left(\mathcal{B}_{f}+R\sqrt{2(\gamma_{t-1}+1+\log(2/\delta))}\right)\sigma_{t-1}(x)+
σt1(x)y2(γt1+1+log(2/δ))\displaystyle\quad\sigma_{t-1}(x)\mathcal{B}_{y}\sqrt{2(\gamma_{t-1}+1+\log(2/\delta))}
=σt1(x)(ys=tmt1σt1(xs)+f+(R+y)2(γt1+1+log(2/δ)))\displaystyle=\sigma_{t-1}(x)\left(\mathcal{B}_{y}\sum^{t-1}_{s=t-m}\sigma_{t-1}(x_{s})+\mathcal{B}_{f}+(R+\mathcal{B}_{y})\sqrt{2(\gamma_{t-1}+1+\log(2/\delta))}\right)
=νtσt1(x),\displaystyle=\nu_{t}\sigma_{t-1}(x),

in which we have plugged in the definition of νt\nu_{t} in the last equality. This completes the proof. ∎

A.2 Proof of 2

See 2

Proof.

To begin with, we can upper-bound the instantaneous regret rt=f(x)f(xt)r_{t}=f(x^{*})-f(x_{t}) as

rt=f(x)f(xt)=1ρm[ρmf(x)ρmf(xt)]1ρm[μt1(x)+νtσt1(x)ρmf(xt)]1ρm[μt1(xt)+νtσt1(xt)ρmf(xt)]1ρm2νtσt1(xt),\begin{split}r_{t}&=f(x^{*})-f(x_{t})=\frac{1}{\rho_{m}}\left[\rho_{m}f(x^{*})-\rho_{m}f(x_{t})\right]\\ &\leq\frac{1}{\rho_{m}}\left[\mu_{t-1}(x^{*})+\nu_{t}\sigma_{t-1}(x^{*})-\rho_{m}f(x_{t})\right]\\ &\leq\frac{1}{\rho_{m}}\left[\mu_{t-1}(x_{t})+\nu_{t}\sigma_{t-1}(x_{t})-\rho_{m}f(x_{t})\right]\\ &\leq\frac{1}{\rho_{m}}2\nu_{t}\sigma_{t-1}(x_{t}),\end{split} (10)

in which the first and last inequalities have made use of 1, and the second inequality follows because xtx_{t} is selected using the UCB policy: xt=argmaxx𝒬(μt1+νtσt1(x))x_{t}=\arg\!\max_{x\in\mathcal{Q}}\left(\mu_{t-1}+\nu_{t}\sigma_{t-1}(x)\right) (2). Now the cumulative regret can be upper-bounded as

T=t=1Trt2ρmt=1Tνtσt1(xt)=2ρmt=1Tσt1(xt)βt+2ρmt=1Tσt1(xt)(ys=tmt1σt1(xs)),\mathfrak{R}_{T}=\sum^{T}_{t=1}r_{t}\leq\frac{2}{\rho_{m}}\sum^{T}_{t=1}\nu_{t}\sigma_{t-1}(x_{t})=\frac{2}{\rho_{m}}\sum^{T}_{t=1}\sigma_{t-1}(x_{t})\beta_{t}+\frac{2}{\rho_{m}}\sum^{T}_{t=1}\sigma_{t-1}(x_{t})\left(\mathcal{B}_{y}\sum^{t-1}_{s=t-m}\sigma_{t-1}(x_{s})\right), (11)

where we have plugged in the expression of νt\nu_{t} in the equality.

Next, we use I(𝐲1:T;𝐟1:T)I(\mathbf{y}_{1:T};\mathbf{f}_{1:T}) to denote the information gain from the noisy observations in the first TT iterations about the objective function ff. The first term in Eq. 11 can be upper-bounded as:

2ρmt=1Tσt1(xt)βt2ρmβTt=1Tσt1(xt)2ρmβTTt=1Tσt12(xt)2ρmβTTt=1T1log(1+λ1)log(1+λ1σt12(xt))2ρm1log(1+λ1)βTT212t=1Tlog(1+λ1σt12(xt))2ρm1log(1+λ1)βTT2I(𝐲1:T;𝐟)2ρm2log(1+λ1)βTTγT2ρmC1βTTγT.\begin{split}\frac{2}{\rho_{m}}\sum^{T}_{t=1}\sigma_{t-1}(x_{t})\beta_{t}&\leq\frac{2}{\rho_{m}}\beta_{T}\sum^{T}_{t=1}\sigma_{t-1}(x_{t})\\ &\leq\frac{2}{\rho_{m}}\beta_{T}\sqrt{T}\sqrt{\sum^{T}_{t=1}\sigma^{2}_{t-1}(x_{t})}\\ &\leq\frac{2}{\rho_{m}}\beta_{T}\sqrt{T}\sqrt{\sum^{T}_{t=1}\frac{1}{\log(1+\lambda^{-1})}\log(1+\lambda^{-1}\sigma^{2}_{t-1}(x_{t}))}\\ &\leq\frac{2}{\rho_{m}}\sqrt{\frac{1}{\log(1+\lambda^{-1})}}\beta_{T}\sqrt{T}\sqrt{2\frac{1}{2}\sum^{T}_{t=1}\log(1+\lambda^{-1}\sigma^{2}_{t-1}(x_{t}))}\\ &\leq\frac{2}{\rho_{m}}\sqrt{\frac{1}{\log(1+\lambda^{-1})}}\beta_{T}\sqrt{T}\sqrt{2I(\mathbf{y}_{1:T};\mathbf{f})}\\ &\leq\frac{2}{\rho_{m}}\sqrt{\frac{2}{\log(1+\lambda^{-1})}}\beta_{T}\sqrt{T}\sqrt{\gamma_{T}}\\ &\leq\frac{2}{\rho_{m}}C_{1}\beta_{T}\sqrt{T\gamma_{T}}.\end{split} (12)

The first inequality follows since βt\beta_{t} is monotonically increasing in tt, the inequality in the second line follows because λ1aλ1log(1+λ1)log(1+λ1a)\lambda^{-1}a\leq\frac{\lambda^{-1}}{\log(1+\lambda^{-1})}\log(1+\lambda^{-1}a) for a(0,1)a\in(0,1) (substitute a=σt12(xt)a=\sigma^{2}_{t-1}(x_{t})), and the inequality in the fourth line follows from plugging in the expression for the information gain: I(𝐲1:T;𝐟)=12t=1Tlog(1+λ1σt12(xt))I(\mathbf{y}_{1:T};\mathbf{f})=\frac{1}{2}\sum^{T}_{t=1}\log(1+\lambda^{-1}\sigma^{2}_{t-1}(x_{t})) (Lemma 5.3 of Srinivas et al. (2010)).

Next, we can derive an upper bound on the second term in Eq. 11:

2ρmt=1Tσt1(xt)(ys=tmt1σt1(xs))=2ρmyt=1Ts=tmt1σt1(xt)σt1(xs)1ρmyt=1Ts=tmt1(σt12(xt)+σt12(xs))1ρmyt=1Ts=tmt1(σt12(xt)+σs12(xs))2mρmyt=1Tσt12(xt)2mρmyC12γT.\begin{split}\frac{2}{\rho_{m}}\sum^{T}_{t=1}\sigma_{t-1}(x_{t})&\left(\mathcal{B}_{y}\sum^{t-1}_{s=t-m}\sigma_{t-1}(x_{s})\right)=\frac{2}{\rho_{m}}\mathcal{B}_{y}\sum^{T}_{t=1}\sum^{t-1}_{s=t-m}\sigma_{t-1}(x_{t})\sigma_{t-1}(x_{s})\\ &\leq\frac{1}{\rho_{m}}\mathcal{B}_{y}\sum^{T}_{t=1}\sum^{t-1}_{s=t-m}\left(\sigma^{2}_{t-1}(x_{t})+\sigma^{2}_{t-1}(x_{s})\right)\\ &\leq\frac{1}{\rho_{m}}\mathcal{B}_{y}\sum^{T}_{t=1}\sum^{t-1}_{s=t-m}\left(\sigma^{2}_{t-1}(x_{t})+\sigma^{2}_{s-1}(x_{s})\right)\\ &\leq\frac{2m}{\rho_{m}}\mathcal{B}_{y}\sum^{T}_{t=1}\sigma^{2}_{t-1}(x_{t})\\ &\leq\frac{2m}{\rho_{m}}\mathcal{B}_{y}C_{1}^{2}\gamma_{T}.\end{split} (13)

The first inequality has made use of the inequality of ab(a2+b2)/2ab\leq(a^{2}+b^{2})/2, the second inequality follows since σt12(xs)σs12(xs)\sigma^{2}_{t-1}(x_{s})\leq\sigma^{2}_{s-1}(x_{s}) which is because σt12(xs)\sigma^{2}_{t-1}(x_{s}) is calculated by conditioning on more input locations than σs12(xs)\sigma^{2}_{s-1}(x_{s}), and the last inequality follows easily from some of the intermediate steps in the derivations of Eq. 12.

Lastly, we can plug in Eq. 12 and Eq. 13 as upper bounds on the two terms in Eq. 11, to obtain:

T\displaystyle\mathfrak{R}_{T} 2ρmC1βTTγT+2mρmyC12γT=2ρm(C1βTTγT+myC12γT).\displaystyle\leq\frac{2}{\rho_{m}}C_{1}\beta_{T}\sqrt{T\gamma_{T}}+\frac{2m}{\rho_{m}}\mathcal{B}_{y}C_{1}^{2}\gamma_{T}=\frac{2}{\rho_{m}}\left(C_{1}\beta_{T}\sqrt{T\gamma_{T}}+m\mathcal{B}_{y}C_{1}^{2}\gamma_{T}\right).

It completes the proof. ∎

A.3 Proof of 3

First we define βtf+(R+y)2(γt1+1+log(4/δ))\beta_{t}\triangleq\mathcal{B}_{f}+(R+\mathcal{B}_{y})\sqrt{2(\gamma_{t-1}+1+\log(4/\delta))}, νtys=tmt1σt1(xs)+βt\nu_{t}\triangleq\mathcal{B}_{y}\sum^{t-1}_{s=t-m}\sigma_{t-1}(x_{s})+\beta_{t} and ctνt(1+2log(|𝒬|t2))c_{t}\triangleq\nu_{t}(1+\sqrt{2\log(|\mathcal{Q}|t^{2})}). Note that we have replaced the δ\delta in the definition of βt\beta_{t} (1) by δ/2\delta/2. We use t1\mathcal{F}_{t-1} to denote the filtration containing the history of selected inputs and observed outputs up to iteration t1t-1. To begin with, we define two events Ef(t)E^{f}(t) and Eft(t)E^{f_{t}}(t) through the following two lemmas.

Lemma 1.

Let δ(0,1)\delta\in(0,1). Define Ef(t)E^{f}(t) as the event that |μt1(x)ρmf(x)|νtσt1(x)|\mu_{t-1}(x)-\rho_{m}f(x)|\leq\nu_{t}\sigma_{t-1}(x) for all x𝒬x\in\mathcal{Q}. We have that [Ef(t)]1δ/2\mathbb{P}\left[E^{f}(t)\right]\geq 1-\delta/2 for all t1t\geq 1.

Note that 1 is the same as 1 except that we have replaced the error probability of δ\delta by δ/2\delta/2.

Lemma 2.

Define Eft(t)E^{f_{t}}(t) as the event that |ft(x)μt1(x)|νt2log(|𝒬|t2)σt1(x)|f_{t}(x)-\mu_{t-1}(x)|\leq\nu_{t}\sqrt{2\log(|\mathcal{Q}|t^{2})}\sigma_{t-1}(x). We have that [Eft(t)|t1]11/t2\mathbb{P}\left[E^{f_{t}}(t)|\mathcal{F}_{t-1}\right]\geq 1-1/t^{2} for any possible filtration t1\mathcal{F}_{t-1}.

2 makes use of the concentration of a random variable sampled from a Gaussian distribution, and its proof follows from Lemma 5 of Chowdhury & Gopalan (2017). Next, we define a set of saturated points in every iteration, which intuitively represents those inputs that lead to large regrets in an iteration.

Definition 1.

In iteration tt, define the set of saturated points as

St={x𝒬:ρmΔ(x)>ctσt1(x)},S_{t}=\{x\in\mathcal{Q}:\rho_{m}\Delta(x)>c_{t}\sigma_{t-1}(x)\},

where Δ(x)=f(x)f(x)\Delta(x)=f(x^{*})-f(x) and xargmaxx𝒬f(x)x^{*}\in\arg\max_{x\in\mathcal{Q}}f(x).

Based on this definition, we will later prove that we can get a lower bound on the probability that the selected input in iteration tt is unsaturated (4). To do that, we first need the following auxiliary lemma.

Lemma 3.

For any filtration t1\mathcal{F}_{t-1}, conditioned on the event Ef(t)E^{f}(t), we have that x𝒬\forall x\in\mathcal{Q},

(ft(x)>ρmf(x)|t1)p,\mathbb{P}\left(f_{t}(x)>\rho_{m}f(x)|\mathcal{F}_{t-1}\right)\geq p, (14)

where p=14eπp=\frac{1}{4e\sqrt{\pi}}.

Proof.

Adding and subtracting μt1(x)νtσt1(x)\frac{\mu_{t-1}(x)}{\nu_{t}\sigma_{t-1}(x)} both sides of (ft(x)>ρmf(x)|t1)\mathbb{P}\left(f_{t}(x)>\rho_{m}f(x)|\mathcal{F}_{t-1}\right), we get

(ft(x)>ρmf(x)|t1)=(ft(x)μt1(x)νtσt1(x)>ρmf(x)μt1(x)νtσt1(x)|t1)(ft(x)μt1(x)νtσt1(x)>|ρmf(x)μt1(x)|νtσt1(x)|t1)(ft(x)μt1(x)νtσt1(x)>1|t1)14eπ,\begin{split}\mathbb{P}\left(f_{t}(x)>\rho_{m}f(x)|\mathcal{F}_{t-1}\right)&=\mathbb{P}\left(\frac{f_{t}(x)-\mu_{t-1}(x)}{\nu_{t}\sigma_{t-1}(x)}>\frac{\rho_{m}f(x)-\mu_{t-1}(x)}{\nu_{t}\sigma_{t-1}(x)}|\mathcal{F}_{t-1}\right)\\ &\geq\mathbb{P}\left(\frac{f_{t}(x)-\mu_{t-1}(x)}{\nu_{t}\sigma_{t-1}(x)}>\frac{|\rho_{m}f(x)-\mu_{t-1}(x)|}{\nu_{t}\sigma_{t-1}(x)}|\mathcal{F}_{t-1}\right)\\ &\geq\mathbb{P}\left(\frac{f_{t}(x)-\mu_{t-1}(x)}{\nu_{t}\sigma_{t-1}(x)}>1|\mathcal{F}_{t-1}\right)\\ &\geq\frac{1}{4e\sqrt{\pi}},\end{split} (15)

in which the second inequality makes use of 1 (note that we have conditioned on the event Ef(t)E^{f}(t) here), and the last inequality follows from the Gaussian anti-concentration inequality: (z>a)exp(a2)/(4πa)\mathbb{P}(z>a)\geq\exp(-a^{2})/(4\sqrt{\pi}a) where z𝒩(0,1)z\sim\mathcal{N}(0,1). ∎

The next lemma proves a lower bound on the probability that the selected input xtx_{t} is unsaturated.

Lemma 4.

For any filtration t1\mathcal{F}_{t-1}, conditioned on the event Ef(t)E^{f}(t), we have that,

(xt𝒬St|t1)p1/t2.\mathbb{P}\left(x_{t}\in\mathcal{Q}\setminus S_{t}|\mathcal{F}_{t-1}\right)\geq p-1/t^{2}.
Proof.

To begin with, we have that

(xt𝒬St|t1)(ft(x)>ft(x),xSt|t1).\mathbb{P}\left(x_{t}\in\mathcal{Q}\setminus S_{t}|\mathcal{F}_{t-1}\right)\geq\mathbb{P}\left(f_{t}(x^{*})>f_{t}(x),\forall x\in S_{t}|\mathcal{F}_{t-1}\right). (16)

This inequality can be justified because the event on the right hand side implies the event on the left hand side. Specifically, according to 1, xx^{*} is always unsaturated because Δ(x)=0<ctσt1(x)\Delta(x^{*})=0<c_{t}\sigma_{t-1}(x^{*}). Therefore, because xtx_{t} is selected by xt=argmaxx𝒬ft(x)x_{t}={\arg\max}_{x\in\mathcal{Q}}f_{t}(x), we have that if ft(x)>ft(x),xStf_{t}(x^{*})>f_{t}(x),\forall x\in S_{t}, then the selected xtx_{t} is guaranteed to be unsaturated. Now conditioning on both events Ef(t)E^{f}(t) and Eft(t)E^{f_{t}}(t), for all xStx\in S_{t}, we have that

ft(x)ρmf(x)+ctσt1(x)ρmf(x)+ρmΔ(x)=ρmf(x)+ρmf(x)ρmf(x)=ρmf(x),\begin{split}f_{t}(x)\leq\rho_{m}f(x)+c_{t}\sigma_{t-1}(x)\leq\rho_{m}f(x)+\rho_{m}\Delta(x)=\rho_{m}f(x)+\rho_{m}f(x^{*})-\rho_{m}f(x)=\rho_{m}f(x^{*}),\end{split} (17)

in which the first inequality follows from 1 and 2 and the second inequality makes use of 1. Next, separately considering the cases where the event Eft(t)E^{f_{t}}(t) holds or not and making use of Eq. 16 and Eq. 17, we have that

(xt𝒬St|t1)(ft(x)>ft(x),xSt|t1)(ft(x)>f(x)|t1)(Eft(t)¯|t1)p1/t2,\begin{split}\mathbb{P}\left(x_{t}\in\mathcal{Q}\setminus S_{t}|\mathcal{F}_{t-1}\right)&\geq\mathbb{P}\left(f_{t}(x^{*})>f_{t}(x),\forall x\in S_{t}|\mathcal{F}_{t-1}\right)\\ &\geq\mathbb{P}\left(f_{t}(x^{*})>f(x^{*})|\mathcal{F}_{t-1}\right)-\mathbb{P}\left(\overline{E^{f_{t}}(t)}|\mathcal{F}_{t-1}\right)\\ &\geq p-1/t^{2},\end{split} (18)

in which the last inequality has made use of 2 and 3. ∎

Next, we use the following lemma to derive an upper bound on the expected instantaneous regret.

Lemma 5.

For any filtration t1\mathcal{F}_{t-1}, conditioned on the event Ef(t)E^{f}(t), we have that,

𝔼[rt|t1]11ctρmp𝔼[σt1(xt)|t1]+2ft2.\mathbb{E}[r_{t}|\mathcal{F}_{t-1}]\leq\frac{11c_{t}}{\rho_{m}p}\mathbb{E}\left[\sigma_{t-1}(x_{t})|\mathcal{F}_{t-1}\right]+\frac{2\mathcal{B}_{f}}{t^{2}}.
Proof.

To begin with, define x¯t\overline{x}_{t} as the unsaturated input with the smallest GP posterior standard deviation:

x¯t=argminx𝒬Stσt1(x).\overline{x}_{t}={\arg\min}_{x\in\mathcal{Q}\setminus S_{t}}\sigma_{t-1}(x). (19)

This definition gives us:

𝔼[σt1(xt)|t1]𝔼[σt1(xt)|t1,xt𝒬St](xt𝒬St|t1)σt1(x¯t)(p1/t2),\mathbb{E}\left[\sigma_{t-1}(x_{t})|\mathcal{F}_{t-1}\right]\geq\mathbb{E}\left[\sigma_{t-1}(x_{t})|\mathcal{F}_{t-1},x_{t}\in\mathcal{Q}\setminus S_{t}\right]\mathbb{P}\left(x_{t}\in\mathcal{Q}\setminus S_{t}|\mathcal{F}_{t-1}\right)\geq\sigma_{t-1}(\overline{x}_{t})(p-1/t^{2}), (20)

in which the second inequality makes use of 4, as well as the definition of x¯t\overline{x}_{t}.

Next, conditioned on both events Ef(t)E^{f}(t) and Eft(t)E^{f_{t}}(t), we can upper-bound the instantaneous regret as

rt\displaystyle r_{t} =f(x)f(xt)=1ρm[ρmf(x)ρmf(x¯t)+ρmf(x¯t)ρmf(xt)]\displaystyle=f(x^{*})-f(x_{t})=\frac{1}{\rho_{m}}\left[\rho_{m}f(x^{*})-\rho_{m}f(\overline{x}_{t})+\rho_{m}f(\overline{x}_{t})-\rho_{m}f(x_{t})\right]
1ρm[ρmΔ(x¯t)+ft(x¯t)+ctσt1(x¯t)ft(xt)+ctσt1(xt)]\displaystyle\leq\frac{1}{\rho_{m}}\left[\rho_{m}\Delta(\overline{x}_{t})+f_{t}(\overline{x}_{t})+c_{t}\sigma_{t-1}(\overline{x}_{t})-f_{t}(x_{t})+c_{t}\sigma_{t-1}(x_{t})\right]
1ρm[ctσt1(x¯t)+ft(x¯t)+ctσt1(x¯t)ft(xt)+ctσt1(xt)]\displaystyle\leq\frac{1}{\rho_{m}}\left[c_{t}\sigma_{t-1}(\overline{x}_{t})+f_{t}(\overline{x}_{t})+c_{t}\sigma_{t-1}(\overline{x}_{t})-f_{t}(x_{t})+c_{t}\sigma_{t-1}(x_{t})\right]
1ρm[2ctσt1(x¯t)+ctσt1(xt)],\displaystyle\leq\frac{1}{\rho_{m}}\left[2c_{t}\sigma_{t-1}(\overline{x}_{t})+c_{t}\sigma_{t-1}(x_{t})\right],

in which the first inequality follows from 1 and 2 as well as the definition of Δ()\Delta(\cdot) (1), the second inequality follows because x¯t\overline{x}_{t} is unsaturated, and the last inequality follows because ft(x¯t)ft(xt)0f_{t}(\overline{x}_{t})-f_{t}(x_{t})\leq 0 since xt=argmaxx𝒬ft(x)x_{t}={\arg\max}_{x\in\mathcal{Q}}f_{t}(x). Next, by separately considering the cases where the event Eft(t)E^{f_{t}}(t) holds and otherwise, we are ready to upper-bound the expected instantaneous regret:

𝔼[rt|t1]1ρm𝔼[2ctσt1(x¯t)+ctσt1(xt)|t1]+2ft21ρm𝔼[2ctσt1(xt)p1/t2+ctσt1(xt)|t1]+2ft2ctρm(2p1/t2+1)𝔼[σt1(xt)|t1]+2ft211ctρmp𝔼[σt1(xt)|t1]+2ft2,\begin{split}\mathbb{E}[r_{t}|\mathcal{F}_{t-1}]&\leq\frac{1}{\rho_{m}}\mathbb{E}[2c_{t}\sigma_{t-1}(\overline{x}_{t})+c_{t}\sigma_{t-1}(x_{t})|\mathcal{F}_{t-1}]+\frac{2\mathcal{B}_{f}}{t^{2}}\\ &\leq\frac{1}{\rho_{m}}\mathbb{E}\left[2c_{t}\frac{\sigma_{t-1}(x_{t})}{p-1/t^{2}}+c_{t}\sigma_{t-1}(x_{t})|\mathcal{F}_{t-1}\right]+\frac{2\mathcal{B}_{f}}{t^{2}}\\ &\leq\frac{c_{t}}{\rho_{m}}\left(\frac{2}{p-1/t^{2}}+1\right)\mathbb{E}\left[\sigma_{t-1}(x_{t})|\mathcal{F}_{t-1}\right]+\frac{2\mathcal{B}_{f}}{t^{2}}\\ &\leq\frac{11c_{t}}{\rho_{m}p}\mathbb{E}\left[\sigma_{t-1}(x_{t})|\mathcal{F}_{t-1}\right]+\frac{2\mathcal{B}_{f}}{t^{2}},\end{split} (21)

in which the second inequality results from Eq. 20, and the last inequality follows because 2p1/t210/p\frac{2}{p-1/t^{2}}\leq 10/p and 11/p1\leq 1/p. ∎

Next, we define the following stochastic process (Yt:t=0,,T)(Y_{t}:t=0,\ldots,T), which we prove is a super-martingale in the subsequent lemma by making use of 5.

Definition 2.

Define Y0=0Y_{0}=0, and for all t=1,,Tt=1,\ldots,T,

r¯t=rt𝕀{Ef(t)},Xt=r¯t11ctρmpσt1(xt)2ft2, and Yt=s=1tXs.\overline{r}_{t}=r_{t}\mathbb{I}\{E^{f}(t)\},\quad X_{t}=\overline{r}_{t}-\frac{11c_{t}}{\rho_{m}p}\sigma_{t-1}(x_{t})-\frac{2\mathcal{B}_{f}}{t^{2}},\text{ and }\quad Y_{t}=\sum^{t}_{s=1}X_{s}.
Lemma 6.

(Yt:t=0,,T)(Y_{t}:t=0,\ldots,T) is a super-martingale with respect to the filtration t\mathcal{F}_{t}.

Proof.

As Xt=YtYt1X_{t}=Y_{t}-Y_{t-1}, we have

𝔼[YtYt1|t1]\displaystyle\mathbb{E}\left[Y_{t}-Y_{t-1}|\mathcal{F}_{t-1}\right] =𝔼[Xt|t1]\displaystyle=\mathbb{E}\left[X_{t}|\mathcal{F}_{t-1}\right]
=𝔼[r¯t11ctρmpσt1(xt)2ft2|t1]\displaystyle=\mathbb{E}\left[\overline{r}_{t}-\frac{11c_{t}}{\rho_{m}p}\sigma_{t-1}(x_{t})-\frac{2\mathcal{B}_{f}}{t^{2}}|\mathcal{F}_{t-1}\right]
=𝔼[r¯t|t1][11ctρmp𝔼[σt1(xt)|t1]+2ft2]\displaystyle=\mathbb{E}\left[\overline{r}_{t}|\mathcal{F}_{t-1}\right]-\left[\frac{11c_{t}}{\rho_{m}p}\mathbb{E}\left[\sigma_{t-1}(x_{t})|\mathcal{F}_{t-1}\right]+\frac{2\mathcal{B}_{f}}{t^{2}}\right]
0.\displaystyle\leq 0.

When the event Ef(t)E^{f}(t) holds, the last inequality follows from 5; when Ef(t)E^{f}(t) is false, r¯t=0\overline{r}_{t}=0 and hence the inequality trivially holds. ∎

Lastly, we are ready to prove the upper bound of GP-TS-SDF on the cumulative regret T\mathfrak{R}_{T} by applying the Azuma-Hoeffding Inequality to the stochastic process defined above.

See 3

Proof.

To begin with, we derive an upper bound on |YtYt1||Y_{t}-Y_{t-1}|:

|YtYt1|=|Xt||r¯t|+11ctρmpσt1(xt)+2ft22f+11ctρmp+2f1ρmp(4f+11ct),\begin{split}|Y_{t}-Y_{t-1}|&=|X_{t}|\leq|\overline{r}_{t}|+\frac{11c_{t}}{\rho_{m}p}\sigma_{t-1}(x_{t})+\frac{2\mathcal{B}_{f}}{t^{2}}\\ &\leq 2\mathcal{B}_{f}+\frac{11c_{t}}{\rho_{m}p}+2\mathcal{B}_{f}\\ &\leq\frac{1}{\rho_{m}p}\left(4\mathcal{B}_{f}+11c_{t}\right),\end{split} (22)

where the second inequality follows because σt1(xt)1\sigma_{t-1}(x_{t})\leq 1, and the last inequality follows since 1ρmp1\frac{1}{\rho_{m}p}\geq 1. Now we are ready to apply the Azuma-Hoeffding Inequality to (Yt:t=0,,T)(Y_{t}:t=0,\ldots,T) with an error probability of δ/2\delta/2:

t=1Tr¯tt=1T11ctρmpσt1(xt)+t=1T2ft2+2log(2/δ)t=1T(1ρmp(4f+11ct))211ρmp(1+2log(|𝒬|T2))t=1Tνtσt1(xt)+Bπ23+4f+11cTρmp2Tlog(2/δ)11ρmp(1+2log(|𝒬|T2))(C1βTTγT+myC12γT)+4f+11cTρmp2Tlog(2/δ)+Bπ23,\begin{split}\sum^{T}_{t=1}\overline{r}_{t}&\leq\sum^{T}_{t=1}\frac{11c_{t}}{\rho_{m}p}\sigma_{t-1}(x_{t})+\sum^{T}_{t=1}\frac{2\mathcal{B}_{f}}{t^{2}}+\sqrt{2\log(2/\delta)\sum^{T}_{t=1}\left(\frac{1}{\rho_{m}p}\left(4\mathcal{B}_{f}+11c_{t}\right)\right)^{2}}\\ &\leq\frac{11}{\rho_{m}p}(1+\sqrt{2\log(|\mathcal{Q}|T^{2})})\sum^{T}_{t=1}\nu_{t}\sigma_{t-1}(x_{t})+\frac{B\pi^{2}}{3}+\frac{4\mathcal{B}_{f}+11c_{T}}{\rho_{m}p}\sqrt{2T\log(2/\delta)}\\ &\leq\frac{11}{\rho_{m}p}(1+\sqrt{2\log(|\mathcal{Q}|T^{2})})\left(C_{1}\beta_{T}\sqrt{T\gamma_{T}}+m\mathcal{B}_{y}C_{1}^{2}\gamma_{T}\right)+\frac{4\mathcal{B}_{f}+11c_{T}}{\rho_{m}p}\sqrt{2T\log(2/\delta)}+\frac{B\pi^{2}}{3},\end{split} (23)

in which the second inequality makes use of the fact that ctc_{t} is monotonically increasing and that t=1T1/t2π2/6\sum^{T}_{t=1}1/t^{2}\leq\pi^{2}/6, and the last inequality follows from the proof of 2 (Section A.2) which gives an upper bound on t=1Tνtσt1(xt)\sum^{T}_{t=1}\nu_{t}\sigma_{t-1}(x_{t}). Note that Eq. 23 holds with probability 1δ/2\geq 1-\delta/2. Also note that r¯t=rt\overline{r}_{t}=r_{t} with probability of 1δ/2\geq 1-\delta/2 because the event Ef(t)E^{f}(t) holds with probability of 1δ/2\geq 1-\delta/2 (1), therefore, the upper bound from Eq. 23 is an upper bound on T=t=1Trt\mathfrak{R}_{T}=\sum^{T}_{t=1}r_{t} with probability of 1δ1-\delta.

Lastly, we can simplify the regret upper bound from Eq. 23 into asymptotic notation. Note that ct=νt(1+2log(|𝒬|t2))c_{t}=\nu_{t}(1+\sqrt{2\log(|\mathcal{Q}|t^{2})}), νt=ys=tmt1σt1(xs)+βt\nu_{t}=\mathcal{B}_{y}\sum^{t-1}_{s=t-m}\sigma_{t-1}(x_{s})+\beta_{t} and βt=f+(R+y)2(γt1+1+log(4/δ))\beta_{t}=\mathcal{B}_{f}+(R+\mathcal{B}_{y})\sqrt{2(\gamma_{t-1}+1+\log(4/\delta))}. To simplify notations, we analyze the asymptotic dependence while ignoring all log factors, and ignore the dependence on constants including BB, y\mathcal{B}_{y} and RR. This gives us βT=O~(γT)\beta_{T}=\tilde{O}(\sqrt{\gamma_{T}}) and cT=O~(m+γT)c_{T}=\tilde{O}(m+\sqrt{\gamma_{T}}). As a result, the regret upper bound can be simplified as

T\displaystyle\mathfrak{R}_{T} =O~(1ρm(γTT+mγT)+1ρm(m+γT)T)=O~(1ρm(TγT(γT+1)+m(γT+T))).\displaystyle=\tilde{O}\left(\frac{1}{\rho_{m}}\left(\gamma_{T}\sqrt{T}+m\gamma_{T}\right)+\frac{1}{\rho_{m}}(m+\sqrt{\gamma_{T}})\sqrt{T}\right)=\tilde{O}\left(\frac{1}{\rho_{m}}\left(\sqrt{T\gamma_{T}}(\sqrt{\gamma_{T}}+1)+m(\gamma_{T}+\sqrt{T})\right)\right).

It completes the proof. ∎

A.4 Time based Censored Feedback

Let mm be the amount of time the learner waits for the delayed feedback, τ¯t\bar{\tau}_{t} be the time of initiating the selection process for the tt-th query, and τt\tau_{t} be the time of starting the tt-th query. The censored feedback of ss-th query before selecting tt-th query is denoted by y~s,t\tilde{y}_{s,t}, where y~s,t=ys𝟙{dsmin(m,τ¯tτs)}\tilde{y}_{s,t}=y_{s}\mathbbm{1}\{d_{s}\leq\min(m,\bar{\tau}_{t}-\tau_{s})\}. Now, the GP posterior mean and covariance functions can be expressed using available information of function queries and their censored feedback as follows:

μt1(x)=𝐤t1(x)𝐊t,λ1𝐲~t1,σt12(x,x)=k(x,x)𝐤t1(x)𝐊t,λ1𝐤t1(x),\begin{split}\mu_{t-1}(x)&=\mathbf{k}_{t-1}(x)^{\top}\mathbf{K}_{t,\lambda}^{-1}\tilde{\mathbf{y}}_{t-1},\\ \sigma_{t-1}^{2}(x,x^{\prime})&=k(x,x^{\prime})-\mathbf{k}_{t-1}(x)^{\top}\mathbf{K}_{t,\lambda}^{-1}\mathbf{k}_{t-1}(x^{\prime}),\end{split} (24)

where 𝐊t,λ=𝐊t1+λ𝐈\mathbf{K}_{t,\lambda}=\mathbf{K}_{t-1}+\lambda\mathbf{I}, 𝐤t1(x)=(k(x,xt))τt[0,τ¯t]\mathbf{k}_{t-1}(x)=(k(x,x_{t^{\prime}}))^{\top}_{\tau_{t^{\prime}}\in[0,\bar{\tau}_{t}]}, 𝐲~t1=(y~s,t)τs[0,τ¯t]\tilde{\mathbf{y}}_{t-1}=(\tilde{y}_{s,t})^{\top}_{\tau_{s}\in[0,\bar{\tau}_{t}]}, 𝐊t1=(k(xt,xt′′))τt,τt′′[0,τ¯t]\mathbf{K}_{t-1}=(k(x_{t^{\prime}},x_{t^{\prime\prime}}))_{\tau_{t^{\prime}},\tau_{t^{\prime\prime}}\in[0,\bar{\tau}_{t}]} and λ\lambda is a regularization parameter to ensures 𝐊t,λ\mathbf{K}_{t,\lambda} is an invertible matrix. Our current regret analysis will work for this setting as well after appropriately changing the used variables.

Appendix B Theoretical Analysis of Contextual Gaussian Process Bandit

Our regret upper bounds in 2 and 3 are also applicable in the contextual GP bandit setting after some slight modifications to their proofs. Note that in the contextual GP bandit setting (Section 5), the only major difference with the BO setting is that now the objective function gg depends on both a context vector zt𝒵z_{t}\in\mathcal{Z} and an input vector xt𝒬x_{t}\in\mathcal{Q}, in which every ztz_{t} is generated by the environment and xtx_{t} is selected by our GP-UCB-SDF or GP-TS-SDF algorithms. Therefore, the most important modifications to the proofs of 2 and 3 involve replacing the original input space of 𝒬\mathcal{Q} by the new input space of 𝒵×𝒬\mathcal{Z}\times\mathcal{Q} and accounting for the modified definition of regret in Section 5.

To begin with, with the same definition of νt\nu_{t}, it is easy to verify that 1 can be extended to the contextual GP bandit setting:

Theorem 4 (Confidence Ellipsoid for Contextual GP Bandit).

With probability at least 1δ1-\delta,

|μt1(z,x)ρmg(z,x)|νtσt1(z,x),z𝒵,x𝒬.|\mu_{t-1}(z,x)-\rho_{m}g(z,x)|\leq\nu_{t}\sigma_{t-1}(z,x),\forall z\in\mathcal{Z},x\in\mathcal{Q}.

The proof of 4 is the same as that of 1 except that the input xx in the proof of 1 is replaced by (z,x)(z,x).

B.1 GP-UCB-SDF for Contextual Gaussian Process Bandit

For the GP-UCB-SDF algorithm in the contextual GP bandit setting, the instantaneous regret rtr_{t} can be analyzed in a similar way to Eq. 10 in the proof of 2 (Section A.2):

rt\displaystyle r_{t} =g(zt,xt)g(zt,xt)=1ρm[ρmg(zt,xt)ρmg(zt,xt)]\displaystyle=g(z_{t},x_{t}^{\star})-g(z_{t},x_{t})=\frac{1}{\rho_{m}}\left[\rho_{m}g(z_{t},x_{t}^{\star})-\rho_{m}g(z_{t},x_{t})\right]
1ρm[μt1(zt,xt)+νtσt1(zt,xt)ρmg(zt,xt)]\displaystyle\leq\frac{1}{\rho_{m}}\left[\mu_{t-1}(z_{t},x_{t}^{\star})+\nu_{t}\sigma_{t-1}(z_{t},x_{t}^{\star})-\rho_{m}g(z_{t},x_{t})\right]
1ρm[μt1(zt,xt)+νtσt1(zt,xt)ρmg(zt,xt)]\displaystyle\leq\frac{1}{\rho_{m}}\left[\mu_{t-1}(z_{t},x_{t})+\nu_{t}\sigma_{t-1}(z_{t},x_{t})-\rho_{m}g(z_{t},x_{t})\right]
1ρm2νtσt1(zt,xt),\displaystyle\leq\frac{1}{\rho_{m}}2\nu_{t}\sigma_{t-1}(z_{t},x_{t}),

in which the first and last inequalities both make use of 4, and the second inequality follows from the way in which xtx_{t} is selected: xt=argmaxx𝒬μt1(zt,x)+νtσt1(zt,x)x_{t}={\arg\max}_{x\in\mathcal{Q}}\mu_{t-1}(z_{t},x)+\nu_{t}\sigma_{t-1}(z_{t},x). Following this, the subsequent steps in the proof in Section A.2 immediately follow, hence producing the same regret upper bound as 2.

B.2 GP-TS-SDF for Contextual Gaussian Process Bandit

To begin with, the events Ef(t)E^{f}(t) (1) and Eft(t)E^{f_{t}}(t) (2) can be similarly defined:

Lemma 7.

Let δ(0,1)\delta\in(0,1). Define Ef(t)E^{f}(t) as the event that |μt1(z,x)ρmg(z,x)|νtσt1(z,x)|\mu_{t-1}(z,x)-\rho_{m}g(z,x)|\leq\nu_{t}\sigma_{t-1}(z,x) for all z𝒵,x𝒬z\in\mathcal{Z},x\in\mathcal{Q}. We have that [Ef(t)]1δ/2\mathbb{P}\left[E^{f}(t)\right]\geq 1-\delta/2 for all t1t\geq 1.

Lemma 8.

Define Eft(t)E^{f_{t}}(t) as the event that |gt(z,x)μt1(z,x)|νt2log(|𝒵||𝒬|t2)σt1(z,x)|g_{t}(z,x)-\mu_{t-1}(z,x)|\leq\nu_{t}\sqrt{2\log(|\mathcal{Z}||\mathcal{Q}|t^{2})}\sigma_{t-1}(z,x). We have that [Eft(t)|t1]11/t2\mathbb{P}\left[E^{f_{t}}(t)|\mathcal{F}_{t-1}\right]\geq 1-1/t^{2} for any possible filtration t1\mathcal{F}_{t-1}.

The validity of 7 follows immediately from 4 (after replacing the error probability of δ\delta by δ/2\delta/2 in the definition of νt\nu_{t}, which we have omitted here for simplicity), and the proof of 8 is the same as that of 2. Next, we modify the definition of saturated points from 1:

Definition 3.

In iteration tt, given a context vector ztz_{t}, define the set of saturated points as

St={x𝒬:ρmΔ(zt,x)>ctσt1(zt,x)},S_{t}=\{x\in\mathcal{Q}:\rho_{m}\Delta(z_{t},x)>c_{t}\sigma_{t-1}(z_{t},x)\},

where Δ(zt,x)=g(zt,xt)g(zt,xt)\Delta(z_{t},x)=g(z_{t},x_{t}^{\star})-g(z_{t},x_{t}) and xt=argmaxx𝒬g(zt,x)x_{t}^{\star}=\arg\!\max\limits_{x\in\mathcal{Q}}g(z_{t},x).

Note that according to this definition, xtx_{t}^{\star} is always unsaturated. The next lemma is a counterpart of 3, and the proofs are the same.

Lemma 9.

For any filtration t1\mathcal{F}_{t-1}, conditioned on the event Ef(t)E^{f}(t), we have that z𝒵,x𝒬\forall z\in\mathcal{Z},x\in\mathcal{Q},

(gt(z,x)>ρmg(z,x)|t1)p,\mathbb{P}\left(g_{t}(z,x)>\rho_{m}g(z,x)|\mathcal{F}_{t-1}\right)\geq p,

where p=14eπp=\frac{1}{4e\sqrt{\pi}}.

The next lemma, similar to 4, shows that in contextual GP bandit problems, the probability that the selected xtx_{t} is unsaturated can also be lower-bounded.

Lemma 10.

For any filtration t1\mathcal{F}_{t-1}, given a context vector ztz_{t}, conditioned on the event Ef(t)E^{f}(t), we have that,

(xt𝒬St|t1)p1/t2.\mathbb{P}\left(x_{t}\in\mathcal{Q}\setminus S_{t}|\mathcal{F}_{t-1}\right)\geq p-1/t^{2}.
Proof.

To begin with, we have that

(xt𝒬St|t1)(gt(zt,xt)>gt(zt,x),xSt|t1).\mathbb{P}\left(x_{t}\in\mathcal{Q}\setminus S_{t}|\mathcal{F}_{t-1}\right)\geq\mathbb{P}\left(g_{t}(z_{t},x_{t}^{*})>g_{t}(z_{t},x),\forall x\in S_{t}|\mathcal{F}_{t-1}\right). (25)

The validity of this equation can be seen following the same analysis of Eq. 16, i.e., the event on the right hand side implies the event on the left hand side. Now conditioning on both events Ef(t)E^{f}(t) and Eft(t)E^{f_{t}}(t), for all xStx\in S_{t}, we have that

gt(zt,x)ρmg(zt,x)+ctσt1(zt,x)ρmg(zt,x)+ρmΔ(zt,x)=ρmg(zt,x)+ρmg(zt,xt)ρmg(zt,x)=ρmg(zt,xt),\begin{split}g_{t}(z_{t},x)&\leq\rho_{m}g(z_{t},x)+c_{t}\sigma_{t-1}(z_{t},x)\leq\rho_{m}g(z_{t},x)+\rho_{m}\Delta(z_{t},x)\\ &=\rho_{m}g(z_{t},x)+\rho_{m}g(z_{t},x_{t}^{*})-\rho_{m}g(z_{t},x)=\rho_{m}g(z_{t},x_{t}^{*}),\end{split} (26)

in which the first inequality follows from 7 and 8 and the second inequality makes use of 3.

Next, separately considering the cases where the event Eft(t)E^{f_{t}}(t) holds or not and making use of Eq. 25 and Eq. 26, we have that

(xt𝒬St|t1)\displaystyle\mathbb{P}\left(x_{t}\in\mathcal{Q}\setminus S_{t}|\mathcal{F}_{t-1}\right) (gt(zt,xt)>gt(zt,x),xSt|t1)\displaystyle\geq\mathbb{P}\left(g_{t}(z_{t},x_{t}^{*})>g_{t}(z_{t},x),\forall x\in S_{t}|\mathcal{F}_{t-1}\right)
(gt(zt,xt)>ρmg(zt,xt)|t1)(Eft(t)¯|t1)\displaystyle\geq\mathbb{P}\left(g_{t}(z_{t},x_{t}^{*})>\rho_{m}g(z_{t},x_{t}^{\star})|\mathcal{F}_{t-1}\right)-\mathbb{P}\left(\overline{E^{f_{t}}(t)}|\mathcal{F}_{t-1}\right)
p1/t2,\displaystyle\geq p-1/t^{2},

where the last inequality has made use of 8 and 9. ∎

Lemma 11.

For any filtration t1\mathcal{F}_{t-1}, given a context vector ztz_{t}, conditioned on the event Ef(t)E^{f}(t), we have that,

𝔼[rt|t1]11ctρmp𝔼[σt1(zt,xt)|t1]+2ft2.\mathbb{E}[r_{t}|\mathcal{F}_{t-1}]\leq\frac{11c_{t}}{\rho_{m}p}\mathbb{E}\left[\sigma_{t-1}(z_{t},x_{t})|\mathcal{F}_{t-1}\right]+\frac{2\mathcal{B}_{f}}{t^{2}}.
Proof.

To begin with, given a context vector ztz_{t}, define x¯t\overline{x}_{t} as the unsaturated input with the smallest GP posterior standard deviation:

x¯t=argminx𝒬Stσt1(zt,x).\overline{x}_{t}={\arg\min}_{x\in\mathcal{Q}\setminus S_{t}}\sigma_{t-1}(z_{t},x). (27)

This definition gives us:

𝔼[σt1(zt,xt)|t1]𝔼[σt1(zt,xt)|t1,xt𝒬St](xt𝒬St|t1)σt1(zt,x¯t)(p1/t2),\begin{split}\mathbb{E}\left[\sigma_{t-1}(z_{t},x_{t})|\mathcal{F}_{t-1}\right]&\geq\mathbb{E}\left[\sigma_{t-1}(z_{t},x_{t})|\mathcal{F}_{t-1},x_{t}\in\mathcal{Q}\setminus S_{t}\right]\mathbb{P}\left(x_{t}\in\mathcal{Q}\setminus S_{t}|\mathcal{F}_{t-1}\right)\\ &\geq\sigma_{t-1}(z_{t},\overline{x}_{t})(p-1/t^{2}),\end{split} (28)

in which the second inequality makes use of 4, as well as the definition of x¯t\overline{x}_{t}. Next, conditioned on both events Ef(t)E^{f}(t) and Eft(t)E^{f_{t}}(t), we can upper-bound the instantaneous regret as

rt\displaystyle r_{t} =g(zt,xt)g(zt,xt)\displaystyle=g(z_{t},x_{t}^{\star})-g(z_{t},x_{t})
=1ρm[ρmg(zt,xt)ρmg(zt,x¯t)+ρmg(zt,x¯t)ρmg(zt,xt)]\displaystyle=\frac{1}{\rho_{m}}\left[\rho_{m}g(z_{t},x_{t}^{\star})-\rho_{m}g(z_{t},\overline{x}_{t})+\rho_{m}g(z_{t},\overline{x}_{t})-\rho_{m}g(z_{t},x_{t})\right]
1ρm[ρmΔ(zt,x¯t)+gt(zt,x¯t)+ctσt1(zt,x¯t)gt(zt,xt)+ctσt1(zt,xt)]\displaystyle\leq\frac{1}{\rho_{m}}\left[\rho_{m}\Delta(z_{t},\overline{x}_{t})+g_{t}(z_{t},\overline{x}_{t})+c_{t}\sigma_{t-1}(z_{t},\overline{x}_{t})-g_{t}(z_{t},x_{t})+c_{t}\sigma_{t-1}(z_{t},x_{t})\right]
1ρm[ctσt1(zt,x¯t)+gt(zt,x¯t)+ctσt1(zt,x¯t)gt(zt,xt)+ctσt1(zt,xt)]\displaystyle\leq\frac{1}{\rho_{m}}\left[c_{t}\sigma_{t-1}(z_{t},\overline{x}_{t})+g_{t}(z_{t},\overline{x}_{t})+c_{t}\sigma_{t-1}(z_{t},\overline{x}_{t})-g_{t}(z_{t},x_{t})+c_{t}\sigma_{t-1}(z_{t},x_{t})\right]
1ρm[2ctσt1(zt,x¯t)+ctσt1(zt,xt)],\displaystyle\leq\frac{1}{\rho_{m}}\left[2c_{t}\sigma_{t-1}(z_{t},\overline{x}_{t})+c_{t}\sigma_{t-1}(z_{t},x_{t})\right],

where the first inequality follows from 7 and 8 as well as the definition of Δ()\Delta(\cdot) (3), the second inequality follows because x¯t\overline{x}_{t} is unsaturated, and the last inequality follows because gt(zt,x¯t)gt(zt,xt)0g_{t}(z_{t},\overline{x}_{t})-g_{t}(z_{t},x_{t})\leq 0 since xt=argmaxx𝒬gt(zt,x)x_{t}={\arg\max}_{x\in\mathcal{Q}}g_{t}(z_{t},x). Next, the proof in Eq. 21 can be immediately applied, hence producing the upper bound on the instantaneous regret in this lemma. ∎

Now the remaining steps in the proof of 3 in Section A.3 immediately follow. Specifically, we can first define a stochastic process (Yt:t=0,,T)(Y_{t}:t=0,\ldots,T) in the same way as 2, and then use 6 to show that (Yt:t=0,,T)(Y_{t}:t=0,\ldots,T) is super-martingale. Lastly, we can apply the Azuma Hoeffding Inequality to (Yt:t=0,,T)(Y_{t}:t=0,\ldots,T) in the same way as 3, which completes the proof.

Appendix C More Experimental Details

In our experiments, since it has been repeatedly observed that the theoretical value of βt\beta_{t} is overly conservative (Srinivas et al., 2010), we set βt\beta_{t} to be a constant (βt=1\beta_{t}=1) for all methods. In all experiments and for all methods, we use the SE kernel for the GP and optimize the GP hyperparameters by maximizing the marginal likelihood after every 10 iterations. In all experiments where we report the simple regret (e.g., Fig. 1, Fig. 2, etc.), we calculate the simple regret in an iteration using only those function evaluations which have converted (i.e., we ignore all pending observations).

C.1 Synthetic Experiment

In the synthetic experiment, the objective function ff is sampled from a GP using the SE kernel with a lengthscale of 0.020.02 defined on a 1-dimensional domain. The domain is an equally spaced grid within the range [0,1][0,1] with a size 10001000. The sampled function is normalized into the range of [0,1][0,1].

Refer to caption Refer to caption
(a) (b)
Figure 5: Performances of GP-TS-SDF and other TS-based baseline methods for (a) stochastic and (b) deterministic delay distributions in the synthetic experiment (Section 6.1).

C.2 Real-world Experiments

In the SVM hyperparameter tuning experiment, the diabetes diagnosis dataset can be found at https://www.kaggle.com/uciml/pima-indians-diabetes-database. We use 70%70\% of the dataset as the training set and the remaining 30%30\% as the validation set. For every evaluated hyperparameter configuration, we train the SVM using the training set with the particular hyperparameter configuration and then evaluate the learned SVM on the validation set, whose validation accuracy is reported as the observations. We tune the penalty parameter within the range of [104,100][10^{-4},100] and the RBF kernel parameter within [104,10][10^{-4},10].

In the experiment on hyperparameter tuning for CNN, we use the MNIST dataset and follow the default training-testing sets partitions given by the PyTorch package. The CNN consists of one convolutional layer (with 8 channels and convolutional kernels of size 3), followed by a max-pooling layer (with a pooling size of 3), and then followed by a fully connected layer (with 8 nodes). We use the ReLU activation function and the Adam optimizer. We tune the batch size (within [128,512][128,512]), the learning rate (within [106,1][10^{-6},1]) and the learning rate decay (within [106,1][10^{-6},1]).

In the hyperparameter tuning experiment for LR, the breast cancer dataset we have adopted can be found at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic). Similar to the SVM experiment, we use 70%70\% of the dataset as the training set and the remaining 30%30\% as the testing set. Here we tune three hyperparameters of the LR model: the batch size (within [32,128][32,128]), the learning rate (within [106,1][10^{-6},1]) and learning rate decay (within [106,1][10^{-6},1]).

C.3 Real-world Experiments on Contextual Gaussian Process Bandits

For the multi-task BO experiment, the tabular benchmark on hyperparameter tuning of SVM is introduced by the work of (Wistuba et al., 2015) and can be found at https://github.com/wistuba/TST. The dataset consists of 50 tasks, and each task corresponds to a different classification problem with a different dataset. The domain of hyperparameter configurations in this benchmark (which is shared by all 50 tasks) consists of discrete values for six hyperparameters: 3 binary parameters indicating (via one-hot encoding) whether the linear, polynomial, or RBF kernel is used, and the penalty parameter, the degree for the polynomial kernel, and the RBF kernel parameter. The size of the domain is 288. As a result, for each task, the validation accuracy for each one of the 288 hyperparameter configurations is recorded as the observation. As we have mentioned in the main text (Section 6.2), each of the 50 tasks is associated with some meta-features, and here we use the first six meta-features as the contexts. Each of the 50 tasks is associated with a 6-dimensional context vector ztz_{t}, which is used to characterize this particular task. For the non-stationary BO task, the dataset and the other experimental settings (e.g., the tuned hyperparameters, the range of the hyperparameters, etc.) are the same as the SVM hyperparameter tuning experiment in Section 6.2. Refer to Section C.2 for more details.