This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\jmlrvolume\jmlryear

2024 \jmlrproceedings \coltauthor\NameNobuaki Kikkawa \Email[email protected]
\NameHiroshi Ohno \Email[email protected]
\addrToyota Central R&D Labs., Inc., 41-1, Yokomichi, Nagakute, Aichi 480-1192, Japan \DeclareMathOperator*\argmaxargmax \DeclareMathOperator*\KLKL \DeclareMathOperator*\erfcerfc \DeclareMathOperator*\ierfcierfc

Unified theory of upper confidence bound policies for bandit problems targeting total reward, maximal reward, and more

Abstract

The upper confidence bound (UCB) policy is recognized as an order-optimal solution for the classical total-reward bandit problem. While similar UCB-based approaches have been applied to the max bandit problem, which aims to maximize the cumulative maximal reward, their order optimality remains unclear. In this study, we clarify the unified conditions under which the UCB policy achieves the order optimality in both total-reward and max bandit problems. A key concept of our theory is the oracle quantity, which identifies the best arm by its highest value. This allows a unified definition of the UCB policy as pulling the arm with the highest UCB of the oracle quantity. Additionally, under this setting, optimality analysis can be conducted by replacing traditional regret with the number of failures as a core measure. One consequence of our analysis is that the confidence interval of the oracle quantity must narrow appropriately as trials increase to ensure the order optimality of UCB policies. From this consequence, we prove that the previously proposed MaxSearch algorithm satisfies this condition and is an order-optimal policy for the max bandit problem. We also demonstrate that new bandit problems and their order-optimal UCB algorithms can be systematically derived by providing the appropriate oracle quantity and its confidence interval. Building on this, we propose PIUCB algorithms, which aim to pull the arm with the highest probability of improvement (PI). These algorithms can be applied to the max bandit problem in practice and perform comparably or better than the MaxSearch algorithm in toy examples. This suggests that our theory has the potential to generate new policies tailored to specific oracle quantities.

keywords:
bandit problem, UCB policy, max bandit problem, probability of improvement, order optimality, oracle quantity

1 Introduction

The bandit problem (robbins1952some), which aims to pull the best arm based on past observations, is a fundamental issue in decision making and machine learning. Despite its simplicity, it serves as a theoretical foundation for real-world applications, such as Bayesian optimization (srinivas2009gaussian) and Monte Carlo Tree Search (MCTS) (kocsis2006bandit; browne2012survey). The scope of the bandit problem extends beyond classical total reward maximization, encompassing a variety of problem settings. These include settings that aim to maximize discounted total rewards (garivier2011upper), maximize a single best reward (cicirello2005max), or identify the optimal arm (audibert2010best). The optimal policy also varies depending on differences in the reward distributions for each arm (auer2002finite; bubeck2013bandits), as well as whether the rewards are provided stochastically or adversarially (auer2002finite; auer2002nonstochastic; bubeck2012regret). Many other policies have been proposed to address these varying conditions (agrawal1995continuum; li2010contextual; yue2012k; besbes2014stochastic). Among these policies, the upper confidence bound (UCB) policy (auer2002finite; audibert2010best; garivier2011upper; kikkawa2024materials), which utilizes confidence bounds of rewards, has proven its effectiveness under various conditions in stochastic settings.

Although UCB policies have proven generally effective in stochastic bandit problems, their validity is often evaluated case by case, depending on specific contexts and conditions. As a result, when considering variant settings, it becomes unclear what is permissible and what is not, based on existing knowledge derived from standard settings. This lack of clarity poses a significant obstacle to applying UCB policies in the variant settings. By establishing unified conditions for the effectiveness of UCB policies, we aim to provide a theoretical guideline for the application of UCB policies across diverse problem settings. As a step toward a unified theory, this study provides a comprehensive proof of the effectiveness of the UCB policy by focusing on two stochastic bandit problems: the classic total-reward bandit problem, which aims to maximize cumulative total rewards (robbins1952some), and the max bandit problem (also known as the max KK-armed bandit problem or extreme bandit problem), which seeks to maximize the single best reward (cicirello2005max; carpentier2014extreme; david2016pac). The key in this proof is the introduction of the oracle quantity zk,tz_{k,t}, which addresses the differences in the objectives of the bandit problems. This approach proves the effectiveness of the UCB policy not only for the total-reward and max bandit problems but also for bandit problems maximizing other targets. This flexibility of the target definition can improve practical exploration efficiency, tailoring the policy to each target. As an example, using the probability of improvement (PI) as the oracle quantity, we constructed a PIUCB policy, which aims to maximize PI from the cumulative best reward. This policy can effectively be used as one of the max bandit algorithms, which performance is at least comparable to the MaxSearch algorithm, a UCB algorithm proposed by kikkawa2024materials for the max bandit problem.

This paper is organized as follows: First, the Related Work section reviews previous research on the fundamentals of bandit problems and UCB policies. Next, the Contributions section outlines the key achievements of this study. The Preliminaries section introduces the formal definition of the bandit problems, where we employ the oracle quantity zk,tz_{k,t} to provide a unified representation of the different targets. This approach offers a consistent framework for understanding various bandit problem settings. In the Key Claims section and the Proofs section, we present and prove a theorem and related propositions that demonstrate the requirements for establishing order optimality of the UCB policy. This theorem serves as a guide for developing UCB policies tailored to each target setting. Following this, the Discussions section confirms that some total-reward UCB and MaxSearch algorithms are order-optimal using our theorem. In addition, we construct an order-optimal PIUCB policy from our theorem using PI as the oracle quantity. We also highlight the relation between our results and the previous research on regret lower bounds (lai1985asymptotically; burnetas1996optimal; garivier2019explore). Finally, in the Summary section, we conclude this paper and discuss future prospects for applying our theorem across a wide range of scenarios.

2 Related Work

The classical total-reward bandit problem was first formulated by robbins1952some. About thirty years later, lai1985asymptotically proved that any consistent policy must pull suboptimal arms at the order of ln(t)\ln(t) in a stochastic setting. This results established a lower bound on regret for the classical total-reward bandit problem. This lower bound was later extended for general reward distributions by burnetas1996optimal. Furthermore, garivier2019explore proposed a fundamental inequality to derive regret lower bounds in various settings.

UCB-based policies are well-known for achieving the regret that matches the order of the lower bound described above in the classical bandit problems (lai1985asymptotically; agrawal1995sample; auer2002finite). Notably, auer2002finite proposed a simple policy, and since then, the effectiveness of UCB policies has been proven individually across various contexts and conditions (audibert2010best; garivier2011upper; bubeck2013bandits), For example, the UCB policies have been shown to be order-optimal when the reward distributions have heavy tails (bubeck2013bandits).

While many bandit problems focus on the sum or expectation of rewards, a distinct class of problems, known as the max bandit problems, aims to maximize the single best reward (cicirello2005max). In these problems, regret defined in a straightforward manner approaches zero for most policies involving random selection (nishihara2016no; kikkawa2024materials). This poses significant obstacles in proving order optimality of the algorithms. However, kikkawa2024materials recently demonstrated that UCB policies can be constructed by focusing on the number of mistaken choices and employing a carefully designed oracle. Their algorithm, called MaxSearch, practically outperforms other max bandit algorithms (streeter2006simple; achab2017max), as well as the classical UCB (auer2002finite) and the best arm identification algorithms (audibert2010best). Nevertheless, the proofs for the regret lower bound and order optimality of the max bandit problem remain open problems.

In the context of Bayesian optimization, PI appears as an acquisition function alongside expected improvement (EI) and UCB (kushner1964new; jones1998efficient; srinivas2009gaussian). Although it is rare for PI to be used in the context of bandit problem, there are instances of its use in early research on the max bandit problem (streeter2006simple). One of the challenges in using PI in the bandit problem is ensuring its theoretical authenticity, and our study justifies it by using PI as the oracle quantity.

3 Contributions

Our contributions are as follows:

  • We present a unified representation of the bandit problem using the oracle quantity zk,tz_{k,t} and the number of mistaken choices (hereafter referred to as failures) instead of regret.

  • We generalize the proof of the lower and upper bounds on failures for any consistent policy and the UCB policy, respectively, within the present framework.

  • We clarify the conditions under which the UCB policy becomes order-optimal in our framework, and providing the first proof of the order optimality of the MaxSearch algorithm for the max bandit problem.

  • We propose the PIUCB algorithm, which uses UCB of PI as the selection index, from our framework. This algorithm is at least comparable to the existing state-of-the-art max bandit algorithms, and in some cases, outperforms them.

4 Preliminaries

In our framework, bandit problems are generalized by the oracle quantity zk,tz_{k,t} as the following definitions:

Definition 4.1 (bandit problem).

The KK-armed bandit game is a game where a player repeatedly pulls one of the KK arms over TT times. When a player pulls the kk-th arm at the tt-th round, they receive a reward rk(t)r_{k}(t) from an unknown reward distribution fk(r):=f(r;θk)f_{k}(r):=f(r;\theta_{k}), where θk\theta_{k} is a parameter set for the kk-th arm’s distribution. The sequence of a tuple of pulling arm indices and rewards over TT rounds is denoted T:={(k(t),rk(t)(t))}t[T]\mathcal{R}_{T}:=\left\{\left(k(t),r_{k(t)}(t)\right)\right\}_{t\in[T]}, where [T]:={1,2,,T}[T]:=\left\{1,2,\dots,T\right\}. The player’s goal is to pull the optimal arm k(t):=\argmaxk[K]zk,tk^{*}(t):=\argmax_{k\in[K]}z_{k,t} at round tt, where zk,t:=zk({θk}k[K],T)z_{k,t}:=z_{k}\left(\left\{\theta_{k}\right\}_{k\in[K]},\mathcal{R}_{T}\right) is an oracle quantity. The KK-armed bandit problem, often referred to as the bandit problem, is to find a policy π:{t1,Xt}k(t)\pi:\left\{\mathcal{R}_{t-1},X_{t}\right\}\mapsto k(t) that pulls the optimal arm most frequently, where XtX_{t} is a sequence of random variables sampled from the uniform distribution U(x)U(x), used by the policy at each round tt.

Remark 4.2.

If zk,t=𝔼[t[T]rk(t)]=Tμkz_{k,t}=\mathbb{E}\left[\sum_{t\in[T]}r_{k}(t)\right]=T\mu_{k} or simply zk,t=μkz_{k,t}=\mu_{k}, where μk=𝔼[rk(t)]\mu_{k}=\mathbb{E}\left[r_{k}(t)\right], then Definition 4.1 corresponds to the classical total-reward bandit problem.

Remark 4.3.

If zk,t=𝔼[maxt[T]rk(t)]z_{k,t}=\mathbb{E}\left[\max_{t\in[T]}r_{k}(t)\right], then Definition 4.1 is corresponding to the max bandit problem in a sense of carpentier2014extreme. If zk,tz_{k,t} is the EI of the kk-th arm at time tt in terms of the single best reward from r\textmaxt1:=maxτ[t1]rk(τ)(τ)r^{\text}{max}_{t-1}:=\max_{\tau\in[t-1]}r_{k(\tau)}(\tau), then Definition 4.1 corresponds to the max bandit problem in a sense of kikkawa2024materials.

Definition 4.4 (failures).

A failure occurs when the player pulls the non-optimal arm k(t)k(t)k(t)\neq k^{*}(t). The number of failures, or simply called failures, until round tt is defined as

N(t):=τ[t]𝕀[k(τ)k(τ)],N(t):=\sum_{\tau\in[t]}\mathbb{I}\left[k(\tau)\neq k^{*}(\tau)\right], (1)

where 𝕀[A]\mathbb{I}[A] is the indicating function of the event AA.

Remark 4.5.

The expected number of failures, 𝔼[N(T)]\mathbb{E}[N(T)], after TT rounds is analogous to the expected regret, R(T)R(T), in our framework. This approach avoids difficulties in defining regret in the max bandit problem, where R(T)0R(T)\to 0 at TT\to\infty for most policies, which causes fatal problems in theoretical analysis (nishihara2016no; kikkawa2024materials). In the classical bandit settings, R(T)R(T) and 𝔼[N(T)]\mathbb{E}[N(T)] are equivalent up to a constant factor because \varDeltaw𝔼[N(T)]R(T)\varDeltas𝔼[N(T)]\varDelta_{w}\mathbb{E}[N(T)]\leq R(T)\leq\varDelta_{s}\mathbb{E}[N(T)], where \varDeltak:=μkμk\varDelta_{k}:=\mu_{k^{*}}-\mu_{k}, and the arm indices ww and ss are the worst and sub-optimal arms, respectively.

Definition 4.6 (consistency).

A policy is said to be “consistent” if the failures satisfies 𝔼[N(t)]=o(tδ)\mathbb{E}[N(t)]=o(t^{\delta}), where 0<δ<10<\delta<1.

Remark 4.7.

Under a “consistent” policy, the number of successful pulls satisfies 𝔼[N(t)]=\varTheta(t)\mathbb{E}[N^{*}(t)]=\varTheta(t), where N(t):=τ[t]𝕀[k(τ)=k(τ)]N^{*}(t):=\sum_{\tau\in[t]}\mathbb{I}[k(\tau)=k^{*}(\tau)]. In other words, with infinite pulls, a consistent policy always pulls the best arm mostly. Conversely, an inconsistent policy may fail indefinitely.

Definition 4.8 (UCB policy).

Assume that the oracle quantity zk,tz_{k,t} satisfies

zk\textLCB(t1)zk,tzk\textUCB(t1)z_{k}^{\text}{LCB}\left(\mathcal{R}_{t-1}\right)\leq z_{k,t}\leq z_{k}^{\text}{UCB}\left(\mathcal{R}_{t-1}\right) (2)

with a confidence level 1α(t)1-\alpha(t), where LCB is an abbreviation for lower confidence bound. A policy that pulls k(t)=\argmaxk[K]zk\textUCB(t1)k(t)=\argmax_{k\in[K]}z_{k}^{\text}{UCB}\left(\mathcal{R}_{t-1}\right) at each round tt is called a UCB policy in general.

Remark 4.9.

In the classical bandit problem, the UCB of zk,t=μkz_{k,t}=\mu_{k} under the sub-gaussian assumption for f(r;θk)f(r;\theta_{k}) yields the well-known selection index (auer2002finite),

zk\textUCB(t1)=μ¯k(t1)+clnα(t)Nk(t),z_{k}^{\text}{UCB}(\mathcal{R}_{t-1})=\bar{\mu}_{k}(\mathcal{R}_{t-1})+c\sqrt{\frac{-\ln\alpha(t)}{N_{k}(t)}}, (3)

where μ¯k(t):=[Nk(t)]1τ[t]rk(τ)(τ)𝕀[k=k(τ)]\bar{\mu}_{k}(\mathcal{R}_{t}):=[N_{k}(t)]^{-1}\sum_{\tau\in[t]}r_{k(\tau)}(\tau)\mathbb{I}[k=k(\tau)], Nk(t):=τ[t]𝕀[k=k(τ)]N_{k}(t):=\sum_{\tau\in[t]}\mathbb{I}[k=k(\tau)], and cc are the sample mean, number of the kk-th arm pulling, and a positive constant, respectively.

Our definitions are based solely on zk,tz_{k,t}. Consequently, the claims and algorithms in the following sections are concretely defined by specifying the form of zk,tz_{k,t}.

5 Key Claims

Under the above definitions, we claim the following.

Theorem 5.1 (order optimality of UCB policy).

The UCB policy gives N(t)=\varTheta(lnt)N(t)=\varTheta(\ln t) almost surely for any oracle quantity zk,tz_{k,t} that satisfies the following conditions.

  1. a)

    zk,tz_{k,t} and zk\textUCB(t1)z_{k}^{\text}{UCB}(\mathcal{R}_{t-1}) exist at each round tt, and the arms with their maxima are uniquely determined almost surely.

  2. b)

    There exists d>0d>0 such that the confidence interval

    zk\textUCB(t)zk\textLCB(t)=O([zk(t),tzk,t]βtd)z_{k}^{\text}{UCB}(\mathcal{R}_{t})-z_{k}^{\text}{LCB}(\mathcal{R}_{t})=O\left(\left[z_{k^{*}(t),t}-z_{k,t}\right]\beta_{t}^{d}\right) (4)

    with α(t)=O(t1)\alpha(t)=O(t^{-1}), where βt:=(lnt)/Nk(t)\beta_{t}:=(\ln t)/N_{k}(t).

  3. c)

    There exists a modified parameter set {θk,k}k,k[K]\left\{\theta_{k,k^{*}}^{\prime}\right\}_{k,k^{*}\in[K]} such that

    zk(τ),τ:=zk(τ)({θk,k(τ)}k[K],τ)>zk(τ),τz_{k(\tau),\tau}^{\prime}:=z_{k(\tau)}\left(\left\{\theta_{k,k^{*}(\tau)}^{\prime}\right\}_{k\in[K]},\mathcal{R}_{\tau}\right)>z_{k^{*}(\tau),\tau} (5)

    and

    fk(τ)(r)fk(τ)(r):=f(r;θk(τ),k(τ))f_{k(\tau)}(r)\ll f_{k(\tau)}^{\prime}(r):=f\left(r;\theta_{k(\tau),k^{*}(\tau)}^{\prime}\right) (6)

    when k(τ)k(τ)k(\tau)\neq k^{*}(\tau) for all τ[t]\tau\in[t], where fgf\ll g indicates that the distribution ff is absolutely continuous with respect to the distribution gg.

Remark 5.2.

Theorem 5.1 directly implies a generalized UCB algorithm as shown in Algorithm 1. If multiple arms accidentally have the same maximum zk\textUCB()z_{k}^{\text}{UCB}(\mathcal{R}) in any round, one of these arms will be pulled randomly.

Algorithm 1 Generalized UCB
1:  \varnothing\mathcal{R}\leftarrow\varnothing
2:  while ||<T|\mathcal{R}|<T do
3:     k\argmaxk[K]zk\textUCB()k\leftarrow\argmax_{k\in[K]}z_{k}^{\text}{UCB}(\mathcal{R})
4:     Pull the kk-th arm and obtain reward rr.
5:     {(k,r)}\mathcal{R}\leftarrow\mathcal{R}\cup\left\{(k,r)\right\}
6:  end while

The proof of Theorem 5.1 is almost the same as the corresponding proofs for the classical bandit problem (lai1985asymptotically; auer2002finite; bartlett2014learning; kikkawa2024materials). This proof consists of the proofs of the following two propositions.

Proposition 5.3 (upper bound of failures).

The UCB policy guarantees N(t)=O(lnt)N(t)=O(\ln t) for any oracle quantity that satisfies conditions (a) and (b) in Theorem 5.1.

Proposition 5.4 (lower bound of failures).

For any consistent policy and oracle quantity, the failures N(t)=\varOmega(lnt)N(t)=\varOmega(\ln t) almost surely if conditions (a) and (c) in Theorem 5.1 are satisfied.

6 Proofs

6.1 Proof for Proposition 5.3

Assume the confidence bounds are given by Equation (2) in condition (b) with a confidence level of 1α(t)1-\alpha(t). Then, the number of times any of the zk,tz_{k,t} violates these confidence bounds under the UCB policy until round tt is given by

τ[t]𝕀[k[K]zk,τ<zk\textLCB(τ1)zk\textUCB(τ1)<zk,τ]Kτ[t]α(τ)=O(lnt)\sum_{\tau\in[t]}\mathbb{I}\left[\bigcup_{k\in[K]}z_{k,\tau}<z_{k}^{\text}{LCB}\left(\mathcal{R}_{\tau-1}\right)\cup z_{k}^{\text}{UCB}\left(\mathcal{R}_{\tau-1}\right)<z_{k,\tau}\right]\leq K\sum_{\tau\in[t]}\alpha(\tau)=O(\ln t) (7)

under the assumption of α(t)=O(t1)\alpha(t)=O(t^{-1}) in condition (b). In all other cases, where the confidence bounds of Equation (2) hold for all kk, the UCB policy pulls a non-optimal arm kk(t)k\neq k^{*}(t) when zk\textUCB(t1)zk(t)\textUCB(t1)z_{k}^{\text}{UCB}(\mathcal{R}_{t-1})\geq z_{k^{*}(t)}^{\text}{UCB}(\mathcal{R}_{t-1}) at time tt. Then,

zk(t),tzk(t)\textUCB(t1)zk\textUCB(t1)zk\textUCB(t1)+zk,tzk\textLCB(t1)z_{k^{*}(t),t}\leq z_{k^{*}(t)}^{\text}{UCB}(\mathcal{R}_{t-1})\leq z_{k}^{\text}{UCB}(\mathcal{R}_{t-1})\leq z_{k}^{\text}{UCB}(\mathcal{R}_{t-1})+z_{k,t}-z_{k}^{\text}{LCB}(\mathcal{R}_{t-1}) (8)

must hold in this case. Thus, if Equation (4) holds, inequality (8) implies Nk(t)O(lnt)N_{k}(t)\leq O(\ln t). Then, k[K]Nk(t)=O(lnt)\sum_{k\in[K]}N_{k}(t)=O(\ln t). Therefore, combining it with Equation (7), Proposition 5.3 is established since all possible cases are covered.

Remark 6.1.

The improvement of this proof over the previous studies (auer2002finite; kikkawa2024materials) lies in focusing on the width of the confidence interval and identifying the conditions under which N(t)=O(lnt)N(t)=O(\ln t) holds, without providing an explicit expression for the confidence bounds.

6.2 Proof for Proposition 5.4

From Definition 4.1, the probability of an event N(t)atN(t)\leq a_{t} at time tt under the original and modified distributions in condition (c) can be written as:

[N(t)at]=N(t)atτ[t]fk(τ)(rτ)drτxXτU(x)dx,\mathbb{P}[N(t)\leq a_{t}]=\int_{N(t)\leq a_{t}}\prod_{\tau\in[t]}f_{k(\tau)}(r_{\tau})dr_{\tau}\prod_{x\in X_{\tau}}U(x)dx, (9)
[N(t)at]=N(t)atτ[t]fk(τ)(rτ)drτxXτU(x)dx.\mathbb{P}^{\prime}[N(t)\leq a_{t}]=\int_{N(t)\leq a_{t}}\prod_{\tau\in[t]}f_{k(\tau)}^{\prime}(r_{\tau})dr_{\tau}\prod_{x\in X_{\tau}}U(x)dx. (10)

for any zk,τz_{k,\tau} which can uniquely determine k(τ)k^{*}(\tau). These probabilities are related to each other, as shown in the following lemma.

Lemma 6.2.

Assume fk(τ)(r)fk(τ)(r)f_{k(\tau)}(r)\ll f_{k(\tau)}^{\prime}(r) for all τ[t]\tau\in[t]. Then,

[N(t)at]eN(t)M[N(t)at],\mathbb{P}[N(t)\leq a_{t}]\leq e^{N(t)M}\mathbb{P}^{\prime}[N(t)\leq a_{t}], (11)

holds almost surely as tt\to\infty when setting θk,k=θk\theta_{k^{*},k^{*}}^{\prime}=\theta_{k^{*}}. The factor MM is given as

M:=limt[1N(t)k,k[K]Nk,k(t)\KL(fk,fk)]=O(1)M:=\lim_{t\to\infty}\left[\frac{1}{N(t)}\sum_{k,k^{*}\in[K]}N_{k,k^{*}}(t)\KL(f_{k},f_{k}^{\prime})\right]=O(1) (12)

for N(t)N(t), where Nk,k(t):=τ[t]𝕀[k=k(t)k=k(t)]N_{k,k^{*}}(t):=\sum_{\tau\in[t]}\mathbb{I}[k=k(t)\cap k^{*}=k^{*}(t)] and \KL(f,g)\KL(f,g) is the Kullback-Leibler (KL) divergence of ff from gg.

Proof 6.3.

From the definitions

[N(t)at]\displaystyle\mathbb{P}[N(t)\leq a_{t}] =N(t)ateLtτ[t]fk(τ)(rτ)drτxXτU(x)dx\displaystyle=\int_{N(t)\leq a_{t}}e^{L_{t}}\prod_{\tau\in[t]}f_{k(\tau)}^{\prime}(r_{\tau})dr_{\tau}\prod_{x\in X_{\tau}}U(x)dx (13)
=[N(t)atLtbt+N(t)atLt>bt]eLtτ[t]fk(τ)(rτ)drτxXτU(x)dx\displaystyle=\left[\int_{N(t)\leq a_{t}\cap L_{t}\leq b_{t}}+\int_{N(t)\leq a_{t}\cap L_{t}>b_{t}}\right]e^{L_{t}}\prod_{\tau\in[t]}f_{k(\tau)}^{\prime}(r_{\tau})dr_{\tau}\prod_{x\in X_{\tau}}U(x)dx (14)
ebt[N(t)at]+[N(t)atLt>bt]\displaystyle\leq e^{b_{t}}\mathbb{P}^{\prime}[N(t)\leq a_{t}]+\mathbb{P}[N(t)\leq a_{t}\cap L_{t}>b_{t}] (15)

for any btb_{t}, where Lt:=k,k[K]n[Nk,k(t)]ln[fk(rn)/fk(rn)]L_{t}:=\sum_{k,k^{*}\in[K]}\sum_{n\in[N_{k,k^{*}}(t)]}\ln[f_{k}(r_{n})/f_{k}^{\prime}(r_{n})]. Here,

1Nk,k(t)n[Nk,k(t)]lnfk(rn)fk(rn)\KL(fk,fk)<\frac{1}{N_{k,k^{*}}(t)}\sum_{n\in[N_{k,k^{*}}(t)]}\ln\frac{f_{k}(r_{n})}{f_{k}^{\prime}(r_{n})}\to\KL(f_{k},f_{k}^{\prime})<\infty (16)

as tt\to\infty, as a result of the law of large numbers under the assumption of the absolute continuity. Thus, LtN(t)ML_{t}\leq N(t)M holds almost surely as tt\to\infty, since M=M({θk,k}k,k[K])M=M\left(\left\{\theta_{k,k^{*}}^{\prime}\right\}_{k,k^{*}\in[K]}\right) satisfies Equation (12). Because N(t)=k,k[K]kkNk,k(t)N(t)=\sum_{k,k^{*}\in[K]\cap k\neq k^{*}}N_{k,k^{*}}(t) and \KL(fk,fk)=0\KL(f_{k^{*}},f_{k^{*}}^{\prime})=0 when θk,k=θk\theta_{k,k^{*}}^{\prime}=\theta_{k^{*}}, M=O(1)M=O(1) for N(t)N(t).

When LtN(t)ML_{t}\leq N(t)M holds, the second term of inequality (15) becomes,

[N(t)atLt>bt]=[N(t)atLt>btLtN(t)M].\mathbb{P}[N(t)\leq a_{t}\cap L_{t}>b_{t}]=\mathbb{P}[N(t)\leq a_{t}\cap L_{t}>b_{t}\cap L_{t}\leq N(t)M]. (17)

Considering the event of probability (17), bt<N(t)Mb_{t}<N(t)M should be satisfied. Then, this probability vanishes to 0 if btN(t)Mb_{t}\geq N(t)M. Applying this result to inequality (15), Lemma 6.2 is obtained.

Using Lemma 6.2 and the Markov inequality,