\jmlrvolume\jmlryear

2024 \jmlrproceedings \coltauthor\NameNobuaki Kikkawa \Email[email protected]
\NameHiroshi Ohno \Email[email protected]
\addrToyota Central R&D Labs., Inc., 41-1, Yokomichi, Nagakute, Aichi 480-1192, Japan \DeclareMathOperator*\argmaxargmax \DeclareMathOperator*\KLKL \DeclareMathOperator*\erfcerfc \DeclareMathOperator*\ierfcierfc

Unified theory of upper confidence bound policies for bandit problems targeting total reward, maximal reward, and more

Abstract

The upper confidence bound (UCB) policy is recognized as an order-optimal solution for the classical total-reward bandit problem. While similar UCB-based approaches have been applied to the max bandit problem, which aims to maximize the cumulative maximal reward, their order optimality remains unclear. In this study, we clarify the unified conditions under which the UCB policy achieves the order optimality in both total-reward and max bandit problems. A key concept of our theory is the oracle quantity, which identifies the best arm by its highest value. This allows a unified definition of the UCB policy as pulling the arm with the highest UCB of the oracle quantity. Additionally, under this setting, optimality analysis can be conducted by replacing traditional regret with the number of failures as a core measure. One consequence of our analysis is that the confidence interval of the oracle quantity must narrow appropriately as trials increase to ensure the order optimality of UCB policies. From this consequence, we prove that the previously proposed MaxSearch algorithm satisfies this condition and is an order-optimal policy for the max bandit problem. We also demonstrate that new bandit problems and their order-optimal UCB algorithms can be systematically derived by providing the appropriate oracle quantity and its confidence interval. Building on this, we propose PIUCB algorithms, which aim to pull the arm with the highest probability of improvement (PI). These algorithms can be applied to the max bandit problem in practice and perform comparably or better than the MaxSearch algorithm in toy examples. This suggests that our theory has the potential to generate new policies tailored to specific oracle quantities.

keywords:

bandit problem, UCB policy, max bandit problem, probability of improvement, order optimality, oracle quantity

1 Introduction

The bandit problem (robbins1952some), which aims to pull the best arm based on past observations, is a fundamental issue in decision making and machine learning. Despite its simplicity, it serves as a theoretical foundation for real-world applications, such as Bayesian optimization (srinivas2009gaussian) and Monte Carlo Tree Search (MCTS) (kocsis2006bandit; browne2012survey). The scope of the bandit problem extends beyond classical total reward maximization, encompassing a variety of problem settings. These include settings that aim to maximize discounted total rewards (garivier2011upper), maximize a single best reward (cicirello2005max), or identify the optimal arm (audibert2010best). The optimal policy also varies depending on differences in the reward distributions for each arm (auer2002finite; bubeck2013bandits), as well as whether the rewards are provided stochastically or adversarially (auer2002finite; auer2002nonstochastic; bubeck2012regret). Many other policies have been proposed to address these varying conditions (agrawal1995continuum; li2010contextual; yue2012k; besbes2014stochastic). Among these policies, the upper confidence bound (UCB) policy (auer2002finite; audibert2010best; garivier2011upper; kikkawa2024materials), which utilizes confidence bounds of rewards, has proven its effectiveness under various conditions in stochastic settings.

Although UCB policies have proven generally effective in stochastic bandit problems, their validity is often evaluated case by case, depending on specific contexts and conditions. As a result, when considering variant settings, it becomes unclear what is permissible and what is not, based on existing knowledge derived from standard settings. This lack of clarity poses a significant obstacle to applying UCB policies in the variant settings. By establishing unified conditions for the effectiveness of UCB policies, we aim to provide a theoretical guideline for the application of UCB policies across diverse problem settings. As a step toward a unified theory, this study provides a comprehensive proof of the effectiveness of the UCB policy by focusing on two stochastic bandit problems: the classic total-reward bandit problem, which aims to maximize cumulative total rewards (robbins1952some), and the max bandit problem (also known as the max $K$ -armed bandit problem or extreme bandit problem), which seeks to maximize the single best reward (cicirello2005max; carpentier2014extreme; david2016pac). The key in this proof is the introduction of the oracle quantity $z_{k,t}$ , which addresses the differences in the objectives of the bandit problems. This approach proves the effectiveness of the UCB policy not only for the total-reward and max bandit problems but also for bandit problems maximizing other targets. This flexibility of the target definition can improve practical exploration efficiency, tailoring the policy to each target. As an example, using the probability of improvement (PI) as the oracle quantity, we constructed a PIUCB policy, which aims to maximize PI from the cumulative best reward. This policy can effectively be used as one of the max bandit algorithms, which performance is at least comparable to the MaxSearch algorithm, a UCB algorithm proposed by kikkawa2024materials for the max bandit problem.

This paper is organized as follows: First, the Related Work section reviews previous research on the fundamentals of bandit problems and UCB policies. Next, the Contributions section outlines the key achievements of this study. The Preliminaries section introduces the formal definition of the bandit problems, where we employ the oracle quantity $z_{k,t}$ to provide a unified representation of the different targets. This approach offers a consistent framework for understanding various bandit problem settings. In the Key Claims section and the Proofs section, we present and prove a theorem and related propositions that demonstrate the requirements for establishing order optimality of the UCB policy. This theorem serves as a guide for developing UCB policies tailored to each target setting. Following this, the Discussions section confirms that some total-reward UCB and MaxSearch algorithms are order-optimal using our theorem. In addition, we construct an order-optimal PIUCB policy from our theorem using PI as the oracle quantity. We also highlight the relation between our results and the previous research on regret lower bounds (lai1985asymptotically; burnetas1996optimal; garivier2019explore). Finally, in the Summary section, we conclude this paper and discuss future prospects for applying our theorem across a wide range of scenarios.

2 Related Work

The classical total-reward bandit problem was first formulated by robbins1952some. About thirty years later, lai1985asymptotically proved that any consistent policy must pull suboptimal arms at the order of $\ln(t)$ in a stochastic setting. This results established a lower bound on regret for the classical total-reward bandit problem. This lower bound was later extended for general reward distributions by burnetas1996optimal. Furthermore, garivier2019explore proposed a fundamental inequality to derive regret lower bounds in various settings.

UCB-based policies are well-known for achieving the regret that matches the order of the lower bound described above in the classical bandit problems (lai1985asymptotically; agrawal1995sample; auer2002finite). Notably, auer2002finite proposed a simple policy, and since then, the effectiveness of UCB policies has been proven individually across various contexts and conditions (audibert2010best; garivier2011upper; bubeck2013bandits), For example, the UCB policies have been shown to be order-optimal when the reward distributions have heavy tails (bubeck2013bandits).

While many bandit problems focus on the sum or expectation of rewards, a distinct class of problems, known as the max bandit problems, aims to maximize the single best reward (cicirello2005max). In these problems, regret defined in a straightforward manner approaches zero for most policies involving random selection (nishihara2016no; kikkawa2024materials). This poses significant obstacles in proving order optimality of the algorithms. However, kikkawa2024materials recently demonstrated that UCB policies can be constructed by focusing on the number of mistaken choices and employing a carefully designed oracle. Their algorithm, called MaxSearch, practically outperforms other max bandit algorithms (streeter2006simple; achab2017max), as well as the classical UCB (auer2002finite) and the best arm identification algorithms (audibert2010best). Nevertheless, the proofs for the regret lower bound and order optimality of the max bandit problem remain open problems.

In the context of Bayesian optimization, PI appears as an acquisition function alongside expected improvement (EI) and UCB (kushner1964new; jones1998efficient; srinivas2009gaussian). Although it is rare for PI to be used in the context of bandit problem, there are instances of its use in early research on the max bandit problem (streeter2006simple). One of the challenges in using PI in the bandit problem is ensuring its theoretical authenticity, and our study justifies it by using PI as the oracle quantity.

3 Contributions

Our contributions are as follows:

•

We present a unified representation of the bandit problem using the oracle quantity $z_{k,t}$ and the number of mistaken choices (hereafter referred to as failures) instead of regret.
•

We generalize the proof of the lower and upper bounds on failures for any consistent policy and the UCB policy, respectively, within the present framework.
•

We clarify the conditions under which the UCB policy becomes order-optimal in our framework, and providing the first proof of the order optimality of the MaxSearch algorithm for the max bandit problem.
•

We propose the PIUCB algorithm, which uses UCB of PI as the selection index, from our framework. This algorithm is at least comparable to the existing state-of-the-art max bandit algorithms, and in some cases, outperforms them.

4 Preliminaries

In our framework, bandit problems are generalized by the oracle quantity $z_{k,t}$ as the following definitions:

Definition 4.1 (bandit problem).

The $K$ -armed bandit game is a game where a player repeatedly pulls one of the $K$ arms over $T$ times. When a player pulls the $k$ -th arm at the $t$ -th round, they receive a reward $r_{k}(t)$ from an unknown reward distribution $f_{k}(r):=f(r;\theta_{k})$ , where $\theta_{k}$ is a parameter set for the $k$ -th arm’s distribution. The sequence of a tuple of pulling arm indices and rewards over $T$ rounds is denoted $\mathcal{R}_{T}:=\left\{\left(k(t),r_{k(t)}(t)\right)\right\}_{t\in[T]}$ , where $[T]:=\left\{1,2,\dots,T\right\}$ . The player’s goal is to pull the optimal arm $k^{*}(t):=\argmax_{k\in[K]}z_{k,t}$ at round $t$ , where $z_{k,t}:=z_{k}\left(\left\{\theta_{k}\right\}_{k\in[K]},\mathcal{R}_{T}\right)$ is an oracle quantity. The $K$ -armed bandit problem, often referred to as the bandit problem, is to find a policy $\pi:\left\{\mathcal{R}_{t-1},X_{t}\right\}\mapsto k(t)$ that pulls the optimal arm most frequently, where $X_{t}$ is a sequence of random variables sampled from the uniform distribution $U(x)$ , used by the policy at each round $t$ .

Remark 4.2.

If $z_{k,t}=\mathbb{E}\left[\sum_{t\in[T]}r_{k}(t)\right]=T\mu_{k}$ or simply $z_{k,t}=\mu_{k}$ , where $\mu_{k}=\mathbb{E}\left[r_{k}(t)\right]$ , then Definition 4.1 corresponds to the classical total-reward bandit problem.

Remark 4.3.

If $z_{k,t}=\mathbb{E}\left[\max_{t\in[T]}r_{k}(t)\right]$ , then Definition 4.1 is corresponding to the max bandit problem in a sense of carpentier2014extreme. If $z_{k,t}$ is the EI of the $k$ -th arm at time $t$ in terms of the single best reward from $r^{\text}{max}_{t-1}:=\max_{\tau\in[t-1]}r_{k(\tau)}(\tau)$ , then Definition 4.1 corresponds to the max bandit problem in a sense of kikkawa2024materials.

Definition 4.4 (failures).

A failure occurs when the player pulls the non-optimal arm $k(t)\neq k^{*}(t)$ . The number of failures, or simply called failures, until round $t$ is defined as

N(t):=\sum_{\tau\in[t]}\mathbb{I}\left[k(\tau)\neq k^{*}(\tau)\right],

(1)

where $\mathbb{I}[A]$ is the indicating function of the event $A$ .

Remark 4.5.

The expected number of failures, $\mathbb{E}[N(T)]$ , after $T$ rounds is analogous to the expected regret, $R(T)$ , in our framework. This approach avoids difficulties in defining regret in the max bandit problem, where $R(T)\to 0$ at $T\to\infty$ for most policies, which causes fatal problems in theoretical analysis (nishihara2016no; kikkawa2024materials). In the classical bandit settings, $R(T)$ and $\mathbb{E}[N(T)]$ are equivalent up to a constant factor because $\varDelta_{w}\mathbb{E}[N(T)]\leq R(T)\leq\varDelta_{s}\mathbb{E}[N(T)]$ , where $\varDelta_{k}:=\mu_{k^{*}}-\mu_{k}$ , and the arm indices $w$ and $s$ are the worst and sub-optimal arms, respectively.

Definition 4.6 (consistency).

A policy is said to be “consistent” if the failures satisfies $\mathbb{E}[N(t)]=o(t^{\delta})$ , where $0<\delta<1$ .

Remark 4.7.

Under a “consistent” policy, the number of successful pulls satisfies $\mathbb{E}[N^{*}(t)]=\varTheta(t)$ , where $N^{*}(t):=\sum_{\tau\in[t]}\mathbb{I}[k(\tau)=k^{*}(\tau)]$ . In other words, with infinite pulls, a consistent policy always pulls the best arm mostly. Conversely, an inconsistent policy may fail indefinitely.

Definition 4.8 (UCB policy).

Assume that the oracle quantity $z_{k,t}$ satisfies

z_{k}^{\text}{LCB}\left(\mathcal{R}_{t-1}\right)\leq z_{k,t}\leq z_{k}^{\text}{UCB}\left(\mathcal{R}_{t-1}\right)

(2)

with a confidence level $1-\alpha(t)$ , where LCB is an abbreviation for lower confidence bound. A policy that pulls $k(t)=\argmax_{k\in[K]}z_{k}^{\text}{UCB}\left(\mathcal{R}_{t-1}\right)$ at each round $t$ is called a UCB policy in general.

Remark 4.9.

In the classical bandit problem, the UCB of $z_{k,t}=\mu_{k}$ under the sub-gaussian assumption for $f(r;\theta_{k})$ yields the well-known selection index (auer2002finite),

z_{k}^{\text}{UCB}(\mathcal{R}_{t-1})=\bar{\mu}_{k}(\mathcal{R}_{t-1})+c\sqrt{\frac{-\ln\alpha(t)}{N_{k}(t)}},

(3)

where $\bar{\mu}_{k}(\mathcal{R}_{t}):=[N_{k}(t)]^{-1}\sum_{\tau\in[t]}r_{k(\tau)}(\tau)\mathbb{I}[k=k(\tau)]$ , $N_{k}(t):=\sum_{\tau\in[t]}\mathbb{I}[k=k(\tau)]$ , and $c$ are the sample mean, number of the $k$ -th arm pulling, and a positive constant, respectively.

Our definitions are based solely on $z_{k,t}$ . Consequently, the claims and algorithms in the following sections are concretely defined by specifying the form of $z_{k,t}$ .

5 Key Claims

Under the above definitions, we claim the following.

Theorem 5.1 (order optimality of UCB policy).

The UCB policy gives $N(t)=\varTheta(\ln t)$ almost surely for any oracle quantity $z_{k,t}$ that satisfies the following conditions.

a)

$z_{k,t}$ and $z_{k}^{\text}{UCB}(\mathcal{R}_{t-1})$ exist at each round $t$ , and the arms with their maxima are uniquely determined almost surely.

There exists $d>0$ such that the confidence interval

z_{k}^{\text}{UCB}(\mathcal{R}_{t})-z_{k}^{\text}{LCB}(\mathcal{R}_{t})=O\left(\left[z_{k^{*}(t),t}-z_{k,t}\right]\beta_{t}^{d}\right)

(4)

with $\alpha(t)=O(t^{-1})$ , where $\beta_{t}:=(\ln t)/N_{k}(t)$ .

There exists a modified parameter set $\left\{\theta_{k,k^{*}}^{\prime}\right\}_{k,k^{*}\in[K]}$ such that

z_{k(\tau),\tau}^{\prime}:=z_{k(\tau)}\left(\left\{\theta_{k,k^{*}(\tau)}^{\prime}\right\}_{k\in[K]},\mathcal{R}_{\tau}\right)>z_{k^{*}(\tau),\tau}

(5)

and

f_{k(\tau)}(r)\ll f_{k(\tau)}^{\prime}(r):=f\left(r;\theta_{k(\tau),k^{*}(\tau)}^{\prime}\right)

(6)

when $k(\tau)\neq k^{*}(\tau)$ for all $\tau\in[t]$ , where $f\ll g$ indicates that the distribution $f$ is absolutely continuous with respect to the distribution $g$ .

Remark 5.2.

Theorem 5.1 directly implies a generalized UCB algorithm as shown in Algorithm 1. If multiple arms accidentally have the same maximum $z_{k}^{\text}{UCB}(\mathcal{R})$ in any round, one of these arms will be pulled randomly.

Algorithm 1 Generalized UCB

\mathcal{R}\leftarrow\varnothing

2: while

|\mathcal{R}|<T

k\leftarrow\argmax_{k\in[K]}z_{k}^{\text}{UCB}(\mathcal{R})

4: Pull the

k

-th arm and obtain reward

r

\mathcal{R}\leftarrow\mathcal{R}\cup\left\{(k,r)\right\}

6: end while

The proof of Theorem 5.1 is almost the same as the corresponding proofs for the classical bandit problem (lai1985asymptotically; auer2002finite; bartlett2014learning; kikkawa2024materials). This proof consists of the proofs of the following two propositions.

Proposition 5.3 (upper bound of failures).

The UCB policy guarantees $N(t)=O(\ln t)$ for any oracle quantity that satisfies conditions (a) and (b) in Theorem 5.1.

Proposition 5.4 (lower bound of failures).

For any consistent policy and oracle quantity, the failures $N(t)=\varOmega(\ln t)$ almost surely if conditions (a) and (c) in Theorem 5.1 are satisfied.

6 Proofs

6.1 Proof for Proposition 5.3

Assume the confidence bounds are given by Equation (2) in condition (b) with a confidence level of $1-\alpha(t)$ . Then, the number of times any of the $z_{k,t}$ violates these confidence bounds under the UCB policy until round $t$ is given by

\sum_{\tau\in[t]}\mathbb{I}\left[\bigcup_{k\in[K]}z_{k,\tau}<z_{k}^{\text}{LCB}\left(\mathcal{R}_{\tau-1}\right)\cup z_{k}^{\text}{UCB}\left(\mathcal{R}_{\tau-1}\right)<z_{k,\tau}\right]\leq K\sum_{\tau\in[t]}\alpha(\tau)=O(\ln t)

(7)

under the assumption of $\alpha(t)=O(t^{-1})$ in condition (b). In all other cases, where the confidence bounds of Equation (2) hold for all $k$ , the UCB policy pulls a non-optimal arm $k\neq k^{*}(t)$ when $z_{k}^{\text}{UCB}(\mathcal{R}_{t-1})\geq z_{k^{*}(t)}^{\text}{UCB}(\mathcal{R}_{t-1})$ at time $t$ . Then,

z_{k^{*}(t),t}\leq z_{k^{*}(t)}^{\text}{UCB}(\mathcal{R}_{t-1})\leq z_{k}^{\text}{UCB}(\mathcal{R}_{t-1})\leq z_{k}^{\text}{UCB}(\mathcal{R}_{t-1})+z_{k,t}-z_{k}^{\text}{LCB}(\mathcal{R}_{t-1})

(8)

must hold in this case. Thus, if Equation (4) holds, inequality (8) implies $N_{k}(t)\leq O(\ln t)$ . Then, $\sum_{k\in[K]}N_{k}(t)=O(\ln t)$ . Therefore, combining it with Equation (7), Proposition 5.3 is established since all possible cases are covered.

Remark 6.1.

The improvement of this proof over the previous studies (auer2002finite; kikkawa2024materials) lies in focusing on the width of the confidence interval and identifying the conditions under which $N(t)=O(\ln t)$ holds, without providing an explicit expression for the confidence bounds.

6.2 Proof for Proposition 5.4

From Definition 4.1, the probability of an event $N(t)\leq a_{t}$ at time $t$ under the original and modified distributions in condition (c) can be written as:

\mathbb{P}[N(t)\leq a_{t}]=\int_{N(t)\leq a_{t}}\prod_{\tau\in[t]}f_{k(\tau)}(r_{\tau})dr_{\tau}\prod_{x\in X_{\tau}}U(x)dx,

(9)

\mathbb{P}^{\prime}[N(t)\leq a_{t}]=\int_{N(t)\leq a_{t}}\prod_{\tau\in[t]}f_{k(\tau)}^{\prime}(r_{\tau})dr_{\tau}\prod_{x\in X_{\tau}}U(x)dx.

(10)

for any $z_{k,\tau}$ which can uniquely determine $k^{*}(\tau)$ . These probabilities are related to each other, as shown in the following lemma.

Lemma 6.2.

Assume $f_{k(\tau)}(r)\ll f_{k(\tau)}^{\prime}(r)$ for all $\tau\in[t]$ . Then,

\mathbb{P}[N(t)\leq a_{t}]\leq e^{N(t)M}\mathbb{P}^{\prime}[N(t)\leq a_{t}],

(11)

holds almost surely as $t\to\infty$ when setting $\theta_{k^{*},k^{*}}^{\prime}=\theta_{k^{*}}$ . The factor $M$ is given as

M:=\lim_{t\to\infty}\left[\frac{1}{N(t)}\sum_{k,k^{*}\in[K]}N_{k,k^{*}}(t)\KL(f_{k},f_{k}^{\prime})\right]=O(1)

(12)

for $N(t)$ , where $N_{k,k^{*}}(t):=\sum_{\tau\in[t]}\mathbb{I}[k=k(t)\cap k^{*}=k^{*}(t)]$ and $\KL(f,g)$ is the Kullback-Leibler (KL) divergence of $f$ from $g$ .

Proof 6.3.

From the definitions

$\displaystyle\mathbb{P}[N(t)\leq a_{t}]$	$\displaystyle=\int_{N(t)\leq a_{t}}e^{L_{t}}\prod_{\tau\in[t]}f_{k(\tau)}^{\prime}(r_{\tau})dr_{\tau}\prod_{x\in X_{\tau}}U(x)dx$	(13)
	$\displaystyle=\left[\int_{N(t)\leq a_{t}\cap L_{t}\leq b_{t}}+\int_{N(t)\leq a_{t}\cap L_{t}>b_{t}}\right]e^{L_{t}}\prod_{\tau\in[t]}f_{k(\tau)}^{\prime}(r_{\tau})dr_{\tau}\prod_{x\in X_{\tau}}U(x)dx$	(14)
	$\displaystyle\leq e^{b_{t}}\mathbb{P}^{\prime}[N(t)\leq a_{t}]+\mathbb{P}[N(t)\leq a_{t}\cap L_{t}>b_{t}]$	(15)

for any $b_{t}$ , where $L_{t}:=\sum_{k,k^{*}\in[K]}\sum_{n\in[N_{k,k^{*}}(t)]}\ln[f_{k}(r_{n})/f_{k}^{\prime}(r_{n})]$ . Here,

\frac{1}{N_{k,k^{*}}(t)}\sum_{n\in[N_{k,k^{*}}(t)]}\ln\frac{f_{k}(r_{n})}{f_{k}^{\prime}(r_{n})}\to\KL(f_{k},f_{k}^{\prime})<\infty

(16)

as $t\to\infty$ , as a result of the law of large numbers under the assumption of the absolute continuity. Thus, $L_{t}\leq N(t)M$ holds almost surely as $t\to\infty$ , since $M=M\left(\left\{\theta_{k,k^{*}}^{\prime}\right\}_{k,k^{*}\in[K]}\right)$ satisfies Equation (12). Because $N(t)=\sum_{k,k^{*}\in[K]\cap k\neq k^{*}}N_{k,k^{*}}(t)$ and $\KL(f_{k^{*}},f_{k^{*}}^{\prime})=0$ when $\theta_{k,k^{*}}^{\prime}=\theta_{k^{*}}$ , $M=O(1)$ for $N(t)$ .

When $L_{t}\leq N(t)M$ holds, the second term of inequality (15) becomes,

\mathbb{P}[N(t)\leq a_{t}\cap L_{t}>b_{t}]=\mathbb{P}[N(t)\leq a_{t}\cap L_{t}>b_{t}\cap L_{t}\leq N(t)M].

(17)

Considering the event of probability (17), $b_{t}<N(t)M$ should be satisfied. Then, this probability vanishes to $0$ if $b_{t}\geq N(t)M$ . Applying this result to inequality (15), Lemma 6.2 is obtained.

Using Lemma 6.2 and the Markov inequality,