Improving the Knowledge Gradient Algorithm

Le Yang
Department of Systems Engineering
City University of Hong Kong
[email protected]
&Siyang Gao
Department of Systems Engineering
City University of Hong Kong
[email protected]
&Chin Pang Ho
School of Data Science
City University of Hong Kong
[email protected]

Abstract

The knowledge gradient (KG) algorithm is a popular policy for the best arm identification (BAI) problem. It is built on the simple idea of always choosing the measurement that yields the greatest expected one-step improvement in the estimate of the best mean of the arms. In this research, we show that this policy has limitations, causing the algorithm not asymptotically optimal. We next provide a remedy for it, by following the manner of one-step look ahead of KG, but instead choosing the measurement that yields the greatest one-step improvement in the probability of selecting the best arm. The new policy is called improved knowledge gradient (iKG). iKG can be shown to be asymptotically optimal. In addition, we show that compared to KG, it is easier to extend iKG to variant problems of BAI, with the $\epsilon$ -good arm identification and feasible arm identification as two examples. The superior performances of iKG on these problems are further demonstrated using numerical examples.

1 Introduction

The best arm identification (BAI) is a sequential decision problem where in each stage, the agent pulls one out of $k$ given arms and observes a noisy sample of the chosen arm. At the end of the sampling stage, the agent needs to select the arm that is believed to be the best according to the samples. In this research, we let the best arm be the one with the largest mean. BAI is a useful abstraction of issues faced in many practical settings berry1978modified ; gilotte2018offline and has been widely studied in the machine learning community even2006action ; audibert2010best . Since in practical problems, the target arm(s) (to be identified) is not necessarily the best arm, some variant models of BAI have also been proposed in the literature, e.g., top- $m$ arm identification bubeck2013multiple ; xiao2018simulation , Pareto front identification auer2016pareto , $\epsilon$ -good arm identification mason2020finding , feasible arm identification gao2016efficient ; katz2018feasible , etc.

In this research, we focus on the fixed-budget BAI, in which the total number of samples (budget) is fixed and known by the agent. The goal is to correctly identify the best arm when the budget is used up. To solve this problem, many methods have been proposed, e.g., successive rejects (SR) audibert2010best , expected improvements (EI) chick2010sequential , top-two sampling qin2017improving ; russo2020simple , knowledge gradient (KG) ryzhov2010robustness ; li2022finite , optimal computing budget allocation (OCBA)chen2000simulation ; gao2017new ; li2023convergence , etc. Among these methods, KG has been prevailing. It was first proposed in gupta1994bayesian and further analyzed in frazier2008knowledge ; frazier2009knowledge . It is built on the simple idea of always pulling the arm that yields the greatest expected one-step improvement in the estimate of the best mean of the arms. This improvement measure is analytical, making the algorithm easily implementable. KG often offers reasonable empirical performances and has been successfully applied in a number of real applications schoppe2010wind ; negoescu2011knowledge .

However, we observe that this definition of KG has limitations, causing the algorithm not asymptotically optimal. Here by not being asymptotically optimal, we mean that the KG algorithm is not rate optimal, in the sense that the probability of the best arm being falsely selected based on the posterior means of the $k$ arms does not converge to zero at the fastest possible rate. This is resulted from KG allocating too few samples to the best arm and excessive samples to the remaining arms. Note that Frazier et al. frazier2008knowledge claimed that KG is “asymptotically optimal”, but in their context, “asymptotically optimal” is consistent, i.e., all the arms will be infinitely sampled as the round $n\rightarrow\infty$ , so that the best arm will be correctly selected eventually. This is a relatively weak result for BAI algorithms (the simple equal allocation is also consistent). In this paper, asymptotically optimal refers to rate optimal.

Contributions. We propose a new policy that can overcome this limitation of KG. The new policy follows the manner of one-step look ahead of KG, but pulls the arm that yields the greatest one-step improvement in the probability of selecting the best arm. We call it improved knowledge gradient (iKG) and show that it is asymptotically optimal. This policy is originated from the thought of looking at whether the best arm has been selected at the end of sampling, instead of looking at the extent that the mean of the selected arm has been maximized. Although both ways can identify the best arm, it turns out that the algorithms developed from them are significantly different in the rates of posterior convergence. Another advantage of iKG over KG is that iKG is more general and can be more easily extended to variant problems of BAI. We use $\epsilon$ -good arm identification and feasible arm identification as examples, develop algorithms for them using the idea of iKG and establish asymptotic optimality for the algorithms.

This paper is conceptually similar to qin2017improving which improves the EI algorithm for BAI. However, for EI, sampling ratios of any two arms in the non-best set are already asymptotically optimal. One only needs to introduce a parameter $\beta$ to balance the probabilities of sampling the best arm and the non-best set without changing the sampling policy within the non-best set to further improve EI. For KG, sampling ratios are not asymptotically optimal for any two out of the $k$ arms. It requires a fundamental change on the sampling policy that influences the sampling rates of all the arms to improve KG. Moreover, the improved rate of posterior convergence of EI in qin2017improving still depends on $\beta$ which is not necessarily optimal, while we can show that this rate of iKG is optimal.

2 Knowledge Gradient and its Limitations

In this section, we review KG and discuss its limitations. Suppose there are $k$ arms in BAI. In each round $t$ , the agent chooses any arm $i$ to pull and obtains a noisy sample $X_{t+1,i}$ . After $n$ rounds, the agent needs to select an arm that he/she believes to be the best. Under the framework of the KG algorithm, $X_{t+1,i}$ ’s are assumed to be independent across different rounds $t$ and arms $i$ and following the normal distribution $\mathcal{N}(\mu_{i},\sigma_{i}^{2})$ with unknown means $\mu_{i}$ and known variances $\sigma_{i}^{2}$ . The best arm is assumed to be unique. Without loss of generality, let $\mu_{\langle 1\rangle}>\mu_{\langle 2\rangle}\geq\ldots\geq\mu_{\langle k\rangle}$ , where $\langle i\rangle$ indicates the arm with $i$ -th largest mean.

The KG algorithm can be derived from a dynamic programming (DP) formulation of BAI. The state space $\mathbb{S}$ consists of all the possible posterior means and variances of the arms, denoted as $\mathbb{S}\triangleq\mathbb{R}^{k}\times(0,\infty)^{k}$ . State $S_{t}$ in round $t$ can be written as $S_{t}=(\mu_{t,1},\mu_{t,2},\ldots,\mu_{t,k},\sigma_{t,1}^{2},\sigma_{t,2}^{2},\ldots,\sigma_{t,k}^{2})^{\top}$ . In the Bayesian model, the unknown mean $\mu_{i}$ is treated as random and let $\theta_{i}$ be the random variable following its posterior distribution. We adopt normal distribution priors $\mathcal{N}(\mu_{0,i},\sigma_{0,i}^{2})$ . With samples of the arms, we can compute their posterior distributions, which are still normal $\mathcal{N}(\mu_{t,i},\sigma_{t,i}^{2})$ in round $t$ by conjugacy. The posterior mean and variance of arm $i$ are

\mu_{t+1,i}=\left\{\begin{aligned} &\frac{\sigma_{t,i}^{-2}\mu_{t,i}+\sigma_{i}^{-2}X_{t+1,i}}{\sigma_{t,i}^{-2}+\sigma_{i}^{-2}}&{\mbox{if~{}}I_{t}=i},\\ &\mu_{t,i}&{\mbox{if~{}}I_{t}\neq i},\end{aligned}\right.\quad\mbox{and}\quad\sigma_{t+1,i}^{2}=\left\{\begin{aligned} &\frac{1}{\sigma_{t,i}^{-2}+\sigma_{i}^{-2}}&{\mbox{if~{}}I_{t}=i},\\ &\sigma_{t,i}^{2}&{\mbox{if~{}}I_{t}\neq i}.\end{aligned}\right.

(1)

In this paper, we adopt a non-informative prior for each arm $i\in\mathbb{A}$ , i.e., $\mu_{0,i}=0$ and $\sigma_{0,i}=\infty$ . Denote the action space as $\mathbb{A}\triangleq\{1,2,\ldots,k\}$ and transition function as $\mathcal{T}\triangleq\mathbb{S}\times\mathbb{A}\times\mathbb{S}\rightarrow\mathbb{S}$ . Suppose $\theta_{t,i}$ is a random variable following the posterior distribution $\mathcal{N}(\mu_{t,i},\sigma_{i}^{2})$ of arm $i$ . Then, the state transition can be written as $S_{t+1}=\mathcal{T}(S_{t},i,\theta_{t,i})$ . Let $\pi$ be the sampling policy that guides the agent to pull arm $I_{t}$ in round $t$ and $\Pi$ be the set of sampling policies $\pi=(I_{0},I_{1},\ldots,I_{n-1})$ adapted to the filtration $I_{0},X_{1,I_{0}},\ldots,I_{t-1},X_{t,I_{t-1}}$ . After $n$ rounds, the estimated best arm $I_{n}^{*}$ is selected and a terminal reward $v_{n}(S_{n})$ is received. We can write our objective as

\sup_{\pi\in\Pi}\mathbb{E}_{\pi}v_{n}(S_{n}).

(2)

The DP principle implies that the value function in round $0\leq t<n$ can be computed recursively by

v_{t}(S)\triangleq\max_{i\in\mathbb{A}}\mathbb{E}[v_{t+1}(\mathcal{T}(S,i,\theta_{t,i}))],\quad S\in\mathbb{S}.

We define the Q-factors as

Q_{t}(S,i)\triangleq\mathbb{E}[v_{t+1}(\mathcal{T}(S,i,\theta_{t,i}))],\quad S\in\mathbb{S},

and the DP principle tells us that any policy satisfying

I_{t}(S)\in\operatorname*{argmax}_{i\in\mathbb{A}}Q_{t}(S,i),\quad S\in\mathbb{S}

is optimal. However, the optimal policy is basically intractable unless for problems with very small scales, known as the “curse of dimensionality”.

On the other hand, note that except the terminal reward $v_{n}(S_{n})$ , this problem has no rewards in the other rounds, so we can restructure $v_{n}(S_{n})$ as a telescoping sequence

v_{n}(S_{n})=[v_{n}(S_{n})-v_{n}(S_{n-1})]+\ldots+[v_{n}(S_{t+1})-v_{n}(S_{t})]+v_{n}(S_{t}).

Thus, $v_{n}(S_{n})$ can be treated as the cumulation of multiple one-step improvements $v_{n}(S_{l})-v_{n}(S_{l-1})$ , $l=t+1,\ldots,n$ . A class of one-step look ahead algorithms iteratively pull the arm that maximizes the expectation of the one-step improvement on the value function

\mathbb{E}[v_{n}(\mathcal{T}(S_{t},i,\theta_{t,i}))-v_{n}(S_{t})].

(3)

These algorithms are not optimal in general unless there is only one round left, i.e., $n=t+1$ .

The KG algorithm falls in this class. It sets the terminal reward as $v_{n}(S_{n})=\mu_{I_{n}^{*}}$ . With this reward, the one-step improvement in (3) becomes

\text{KG}_{t,i}=\mathbb{E}[\max\{\mathcal{T}(\mu_{t,i},i,\theta_{t,i}),\max_{i^{\prime}\neq i}\mu_{t,i^{\prime}}\}-\max_{i\in\mathbb{A}}\mu_{t,i}],

and in each round, the KG algorithm pulls the arm $I_{t}(S_{t})\in\operatorname*{argmax}_{i\in\mathbb{A}}\text{KG}_{t,i}$ .

Input:

k\geq 2

n

1 Collect

n_{0}

samples for each arm

i

;

2 while $t<n$ do

3 Compute KG_t,i and set

I_{t}=\operatorname*{argmax}_{i\in\mathbb{A}}

KG_t,i;

4 Play

I_{t}

;

5 Update

\mu_{t+1,i}

and

\sigma_{t+1,i}

;

Output:

I_{n}^{*}

Algorithm 1 KG Algorithm

We next characterize for the KG algorithm the rate of posterior convergence of $1-\mathbb{P}\{I_{n}^{*}=I^{*}\}$ , the probability that the best arm is falsely selected.

Proposition 1.

Let $c_{\langle i\rangle}=\frac{(\mu_{\langle 1\rangle}-\mu_{\langle i\rangle})/\sigma_{\langle i\rangle}}{(\mu_{\langle 1\rangle}-\mu_{\langle 2\rangle})/\sigma_{\langle 2\rangle}}$ , $i=2,...,k$ . For the KG algorithm,

\lim_{n\rightarrow\infty}-\frac{1}{n}\log(1-\mathbb{P}\{I_{n}^{*}=I^{*}\})=\Gamma^{\text{KG}},

where

\Gamma^{\text{KG}}=\min_{i\neq 1}\bigg{(}\frac{(\mu_{\langle i\rangle}-\mu_{\langle 1\rangle})^{2}}{2((\sum_{i\neq 1}\sigma_{\langle 2\rangle}/c_{\langle i\rangle}+\sigma_{\langle 1\rangle})\sigma_{\langle 1\rangle}+c_{\langle i\rangle}\sigma_{\langle i\rangle}^{2}(\sum_{i\neq 1}1/c_{\langle i\rangle}+\sigma_{\langle 1\rangle}/\sigma_{\langle 2\rangle}))}\bigg{)}.

We observe that $\Gamma^{\text{KG}}$ is not optimal. To make this point, Proposition 2 gives an example that $\Gamma^{\text{KG}}$ is no better than this rate of the TTEI algorithm qin2017improving when the parameter $\beta$ (probability of sampling the best arm) of TTEI is set to some suboptimal value.

Proposition 2.

For the TTEI algorithm qin2017improving , the rate of posterior convergence of $1-\mathbb{P}\{I_{n}^{*}=I^{*}\}$ exists and is denoted as $\Gamma^{\text{TTEI}}$ . Let its probability of sampling the best arm $\beta=(\sigma_{\langle 2\rangle}/\sigma_{\langle 1\rangle}\sum_{i\neq 1}1/c_{\langle i\rangle}+1)^{-1}$ . We have $\Gamma^{\text{KG}}\leq\Gamma^{\text{TTEI}}$ .

According to the proof of Proposition 2, there are configurations of the BAI problem leading to $\Gamma^{\text{KG}}<\Gamma^{\text{TTEI}}$ , i.e., $\Gamma^{\text{KG}}$ is not optimal. In fact, with $\beta=(\sigma_{\langle 2\rangle}/\sigma_{\langle 1\rangle}\sum_{i\neq 1}1/c_{\langle i\rangle}+1)^{-1}$ , $\Gamma^{\text{KG}}=\Gamma^{\text{TTEI}}$ is achieved only in some special cases, e.g., when $k=2$ .

3 Improved Knowledge Gradient

In this section, we propose an improved knowledge gradient (iKG) algorithm. We still follow the manner of one-step look ahead of KG, but set the terminal reward of problem (2) as $v_{n}(S_{n})=\mathbf{1}\{I_{n}^{*}=I^{*}\}$ . That is, for the goal of identifying the best arm, we reward the selected arm by a 0-1 quantity showing whether this arm is the best arm, instead of the mean of this arm (as in KG).

In this case, $\mathbb{E}[v_{n}(S_{n})]=\mathbb{P}\{I_{n}^{*}=I^{*}\}$ , where

\begin{split}\mathbb{P}\{I_{n}^{*}=I^{*}\}&=\mathbb{P}\bigg{\{}\bigcap\limits_{i\neq I_{n}^{*}}(\theta_{I_{n}^{*}}>\theta_{i})\bigg{\}}=1-\mathbb{P}\bigg{\{}\bigcup\limits_{i\neq I_{n}^{*}}(\theta_{i}>\theta_{I_{n}^{*}})\bigg{\}}.\end{split}

(4)

However, the probability $\mathbb{P}\bigg{\{}\bigcup\limits_{i\neq I_{n}^{*}}(\theta_{i}>\theta_{I_{n}^{*}})\bigg{\}}$ in (4) does not have an analytical expression. To facilitate the algorithm implementation and analysis, we adopt an approximation to it using the Bonferroni inequality galambos1977bonferroni :

\mathbb{P}\bigg{\{}\bigcup\limits_{i\neq I_{n}^{*}}(\theta_{i}>\theta_{I_{n}^{*}})\bigg{\}}\leq\sum_{i\neq I_{n}^{*}}\mathbb{P}(\theta_{i}>\theta_{I_{n}^{*}}),

and $\mathbb{E}[v_{n}(S_{n})]$ can be approximately computed as

\mathbb{E}[v_{n}(S_{n})]\approx 1-\sum_{i\neq I_{n}^{*}}\mathbb{P}(\theta_{i}>\theta_{I_{n}^{*}})=1-\sum_{i\neq I_{n}^{*}}\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,I_{n}^{*}})^{2}}{2(\sigma_{n,i}^{2}+\sigma_{n,I_{n}^{*}}^{2})}}\Bigg{)}.

(5)

Note that the Bonferroni inequality has been adopted as an approximation of the probability of correct selection in the literature for development of BAI algorithms chen2000simulation . For our purpose, we can show that the use of this approximation still makes the resulting algorithm asymptotically optimal and empirically superior.

Input:

k\geq 2

n

1 Collect

n_{0}

samples for each arm

i

;

2 while $t<n$ do

3 Compute iKG_t,i and set

I_{t}=\operatorname*{argmax}_{i\in\mathbb{A}}

iKG_t,i;

4 Play

I_{t}

;

5 Update

\mu_{t+1,i}

\sigma_{t+1,i}

and

I_{t+1}^{*}

;

Output:

I_{n}^{*}

Algorithm 2 iKG Algorithm

Let iKG_t,i be the one-step improvement in (3) with $I_{t}^{*}$ treated as unchanged after one more sample and $\mathbb{E}[v_{n}(S_{n})]$ approximated by (5). We have the following proposition to compute iKG_t,i. The iKG algorithm pulls the arm with the largest iKG_t,i in each round.

Proposition 3.

With the definition of iKG_t,i above, we have

\text{iKG}_{t,i}=\left\{\begin{aligned} &\exp\Bigg{(}{-\frac{(\mu_{t,i}-\mu_{t,I_{t}^{*}})^{2}}{2(\sigma_{t,i}^{2}+\sigma_{t,I_{t}^{*}}^{2})}}\Bigg{)}-\exp\Bigg{(}{-\frac{(\mu_{t,i}-\mu_{t,I_{t}^{*}})^{2}}{2(\sigma_{t+1,i}^{2}+\sigma_{t,I_{t}^{*}}^{2}+\sigma_{i}^{2}(\sigma_{t+1,i}^{2}/\sigma_{i}^{2})^{2})}}\Bigg{)},&{\mbox{if~{}}i\neq I_{t}^{*}},\\ &\sum\limits_{i^{\prime}\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}})^{2}}{2(\sigma_{t,i^{\prime}}^{2}+\sigma_{t,I_{t}^{*}}^{2})}}\Bigg{)}-\sum\limits_{i^{\prime}\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}})^{2}}{2(\sigma_{t,i^{\prime}}^{2}+\sigma_{t+1,I_{t}^{*}}^{2}+\sigma_{I_{t}^{*}}^{2}(\sigma_{t+1,I_{t}^{*}}^{2}/\sigma_{I_{t}^{*}}^{2})^{2})}}\Bigg{)},&{\mbox{if~{}}i=I_{t}^{*}}.\end{aligned}\right.

(6)

Both KG and iKG are greedy algorithms that look at the improvement only one-step ahead. The essential difference between them is on the reward they use for the event of best arm identification. For KG, it is the mean of the arm selected, while for iKG, it is a 0-1 quantity showing whether the best arm is selected. It is interesting to note that the choice between these two rewards has been discussed in the control community for optimization of complex systems, known as cardinal optimization (similar to KG) vs. ordinal optimization (similar to iKG) ho1992ordinal , with the discussion result in line with this research, indicating that ordinal optimization has advantages over cardinal optimization in the convergence rates of the optimization algorithms ho1999explanation .

Theorem 1.

For the iKG algorithm, $\lim_{n\rightarrow\infty}-\frac{1}{n}\log(1-\mathbb{P}\{I_{n}^{*}=I^{*}\})=\Gamma^{\text{iKG}}$ , where

\Gamma^{\text{iKG}}=\frac{(\mu_{\langle i\rangle}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{\langle i\rangle}^{2}/w_{\langle i\rangle}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})},

(7)

and $w_{i}$ is the sampling rate of arm $i$ satisfying

\begin{split}\sum\limits_{i=1}^{k}w_{i}=1,\quad\frac{w_{\langle 1\rangle}^{2}}{\sigma_{\langle 1\rangle}^{2}}=\sum_{i=2}^{k}\frac{w_{\langle i\rangle}^{2}}{\sigma_{\langle i\rangle}^{2}}\quad\mbox{ and }\quad\frac{(\mu_{\langle i\rangle}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{\langle i\rangle}^{2}/w_{\langle i\rangle}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}=\frac{(\mu_{\langle i^{\prime}\rangle}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{\langle i^{\prime}\rangle}^{2}/w_{\langle i^{\prime}\rangle}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})},\ \ i\neq i^{\prime}\neq 1.\end{split}

(8)

In addition, for any BAI algorithms,

\limsup_{n\rightarrow\infty}-\frac{1}{n}\log(1-\mathbb{P}\{I_{n}^{*}=I^{*}\})\leq\Gamma^{\text{iKG}}.

Theorem 1 shows that the rate of posterior convergence $\Gamma^{\text{iKG}}$ of the iKG algorithm is the fastest possible. We still use TTEI as an example. This theorem indicates that $\Gamma^{\text{TTEI}}\leq\Gamma^{\text{iKG}}$ for any $\beta\in(0,1)$ and the equality holds only when $\beta$ is set to $\beta^{*}$ , where $\beta^{*}$ is the optimal value of $\beta$ and is typically unknown.

4 Variant Problems of BAI

Another advantage of iKG over KG is that iKG is more general, in the sense that it can be easily extended to solve variant problems of BAI. In the variants, the target arms to be identified are not the single best arm, but no matter how the target arms are defined, one can always look at the event that whether these arms are correctly identified at the end of sampling and investigate the probability of this event to develop iKG and the algorithm. In contrast, it is difficult to extend KG to identify arms that cannot be found through optimizing means of these (and/or other) arms. In this section, we extend iKG to two BAI variants: $\epsilon$ -good arm identification mason2020finding and feasible arm identification katz2018feasible . We develop algorithms for them and establish their asymptotic optimality. Note that in these two variant problems, the target arms need to be found by comparing their means with some fixed values. In such cases, the idea of KG is not straightforward.

4.1 $\epsilon$ -Good Arm Identification

We follow the notation in Sections 2 and 3. For the $k$ arms, suppose $\mu_{\langle 1\rangle}\geq\mu_{\langle 2\rangle}\geq\ldots\geq\mu_{\langle k\rangle}$ . Given $\epsilon>0$ , the $\epsilon$ -good arm identification problem aims to find all the arms $i$ with $\mu_{\langle i\rangle}>\mu_{\langle 1\rangle}-\epsilon$ , i.e., all the arms whose means are close enough to the best ( $\epsilon$ -good). Assume that no arms have means lying on $\mu_{\langle 1\rangle}-\epsilon$ . Denote the set of $\epsilon$ -good arms as $G^{\epsilon}$ and the estimated set of $\epsilon$ -good arms after $n$ rounds as $G_{n}^{\epsilon}$ . We set the terminal reward $v_{n}(S_{n})=\mathbf{1}\{G_{n}^{\epsilon}=G^{\epsilon}\}$ , i.e., whether the set $G^{\epsilon}$ is correctly selected. Then, $\mathbb{E}[v_{n}(S_{n})]=\mathbb{P}\{G_{n}^{\epsilon}=G^{\epsilon}\}$ , where

\begin{split}\mathbb{P}\{G_{n}^{\epsilon}=G^{\epsilon}\}&=\mathbb{P}\bigg{\{}\bigcap\limits_{i\in G_{n}^{\epsilon}}(\theta_{i}>\max_{i^{{}^{\prime}}\in\mathbb{A}}\theta_{i^{{}^{\prime}}}-\epsilon)\cap\bigcap\limits_{i\in\mathbb{A}\setminus G_{n}^{\epsilon}}(\theta_{i}<\max_{i^{{}^{\prime}}\in\mathbb{A}}\theta_{i^{{}^{\prime}}}-\epsilon)\bigg{\}}\\ &=1-\mathbb{P}\bigg{\{}\bigcup\limits_{i\in G_{n}^{\epsilon}}(\theta_{i}<\max_{i^{{}^{\prime}}\in\mathbb{A}}\theta_{i^{{}^{\prime}}}-\epsilon)\cup\bigcup\limits_{i\in\mathbb{A}\setminus G_{n}^{\epsilon}}(\theta_{i}>\max_{i^{{}^{\prime}}\in\mathbb{A}}\theta_{i^{{}^{\prime}}}-\epsilon)\bigg{\}}.\end{split}

Again, applying the Bonferroni inequality,

\mathbb{P}\{G_{n}^{\epsilon}=G^{\epsilon}\}\geq 1-\sum\limits_{i\in G_{n}^{\epsilon}}\mathbb{P}(\theta_{i}<\max_{i^{{}^{\prime}}\in\mathbb{A}}\theta_{i^{{}^{\prime}}}-\epsilon)-\sum\limits_{i\in\mathbb{A}\setminus G_{n}^{\epsilon}}\mathbb{P}(\theta_{i}>\max_{i^{{}^{\prime}}\in\mathbb{A}}\theta_{i^{{}^{\prime}}}-\epsilon).

(9)

Let iKG ${}_{t,i}^{\epsilon}$ be the one-step improvement in (3) with $I_{t}^{*}$ treated as unchanged after one more sample and $\mathbb{E}[v_{n}(S_{n})]$ approximated by the right-hand side of (9). We have the following proposition to compute iKG ${}_{t,i}^{\epsilon}$ .

Proposition 4.

With the definition of iKG ${}_{t,i}^{\epsilon}$ above, we have

\text{iKG}_{t,i}^{\epsilon}=\left\{\begin{aligned} &\exp\Bigg{(}{-\frac{(\mu_{t,i}-\mu_{t,I_{t}^{*}}+\epsilon)^{2}}{2(\sigma_{t,i}^{2}+\sigma_{t,I_{t}^{*}}^{2})}}\Bigg{)}-\exp\Bigg{(}{-\frac{(\mu_{t,i}-\mu_{t,I_{t}^{*}}+\epsilon)^{2}}{2(\sigma_{t+1,i}^{2}+\sigma_{t,I_{t}^{*}}^{2}+\sigma_{i}^{2}(\sigma_{t+1,i}^{2}/\sigma_{i}^{2})^{2})}}\Bigg{)},&{\mbox{if~{}}i\neq I_{t}^{*}},\\ &\sum\limits_{i^{\prime}\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}}+\epsilon)^{2}}{2(\sigma_{t,i^{\prime}}^{2}+\sigma_{t,I_{t}^{*}}^{2})}}\Bigg{)}-\sum\limits_{i^{\prime}\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}}+\epsilon)^{2}}{2(\sigma_{t,i^{\prime}}^{2}+\sigma_{t+1,I_{t}^{*}}^{2}+\sigma_{I_{t}^{*}}^{2}(\sigma_{t+1,I_{t}^{*}}^{2}/\sigma_{I_{t}^{*}}^{2})^{2})}}\Bigg{)},&{\mbox{if~{}}i=I_{t}^{*}}.\end{aligned}\right.

(10)

Input:

k\geq 2

n

1 Collect

n_{0}

samples for each arm

i

;

2 while $t<n$ do

3 Compute iKG

{}_{t,i}^{\epsilon}

and set

I_{t}=\operatorname*{argmax}_{i\in\mathbb{A}}

iKG

{}_{t,i}^{\epsilon}

;

4 Play

I_{t}

;

5 Update

\mu_{t+1,i}

\sigma_{t+1,i}

and

I_{t+1}^{*}

;

Output:

G_{n}^{\epsilon}

Algorithm 3 iKG-

\epsilon

Algorithm (

\epsilon

-good Arm Identification)

To identify the $\epsilon$ -good arms, the iKG- $\epsilon$ algorithm pulls the arm with the largest iKG ${}_{t,i}^{\epsilon}$ in each round. For this algorithm, we can show that the rate of posterior convergence of $1-\mathbb{P}\{G_{n}^{\epsilon}=G^{\epsilon}\}$ is the fastest possible.

Theorem 2.

For the iKG- $\epsilon$ algorithm, $\lim_{n\rightarrow\infty}-\frac{1}{n}\log(1-\mathbb{P}\{G_{n}^{\epsilon}=G^{\epsilon}\})=\Gamma^{\epsilon}$ , where

\Gamma^{\epsilon}=\frac{(\mu_{\langle i\rangle}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{\langle i\rangle}^{2}/w_{\langle i\rangle}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})},

(11)

and $w_{i}$ is the sampling rate of arm $i$ satisfying

\sum\limits_{i=1}^{k}w_{i}=1,\quad\frac{w_{\langle 1\rangle}^{2}}{\sigma_{\langle 1\rangle}^{2}}=\sum_{i=2}^{k}\frac{w_{\langle i\rangle}^{2}}{\sigma_{\langle i\rangle}^{2}}\quad\mbox{ and }\quad\frac{(\mu_{\langle i\rangle}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{\langle i\rangle}^{2}/w_{\langle i\rangle}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}=\frac{(\mu_{\langle i^{\prime}\rangle}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{\langle i^{\prime}\rangle}^{2}/w_{\langle i^{\prime}\rangle}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})},\ \ i\neq i^{\prime}\neq 1.

(12)

In addition, for any $\epsilon$ -good arm identification algorithms,

\limsup_{n\rightarrow\infty}-\frac{1}{n}\log(1-\mathbb{P}\{G_{n}^{\epsilon}=G^{\epsilon}\})\leq\Gamma^{\epsilon}.

4.2 Feasible Arm Identification

In the feasible arm identification, samples from pulling arms $i$ are $m$ -dimensional vectors $\bm{X}_{t+1,i}=[X_{t+1,i1},\ldots,X_{t+1,im}]$ instead of scalars, where each dimension of the vector corresponds to some measure of the system performance and $X_{t+1,ij}$ is the observation associated with arm $i$ and measure $j$ . Suppose $X_{t+1,ij}$ ’s follow the normal distribution with unknown means $\mu_{ij}$ and known variances $\sigma_{ij}^{2}$ . We impose constraints $\mu_{ij}\leq\gamma_{j}$ on arms $i=1,2,\ldots,k$ and measures $j=1,2,\ldots,m$ . The goal of this problem is to find the set of feasible arms $\mathcal{S}^{1}$ . Let the estimated set of feasible arms after $n$ rounds be $\mathcal{S}_{n}^{1}$ and $\mathcal{S}^{2}=\mathbb{A}\setminus\mathcal{S}^{1}$ . We assume that $X_{t+1,ij}$ ’s are independent across different rounds $t$ and measures $j$ , and $\mu_{ij}$ ’s do not lie on the constraint limits $\gamma_{j}$ . To facilitate the analysis, we also define for round $t$ the set of measures $\mathcal{E}_{t,i}^{1}\triangleq\{j:\mu_{t,ij}\leq\gamma_{j}\}$ satisfied by arm $i$ and the set of measures $\mathcal{E}_{t,i}^{2}\triangleq\{j:\mu_{t,ij}>\gamma_{j}\}$ violated by arm $i$ .

Set the terminal reward $v_{n}(S_{n})=\mathbf{1}\{\mathcal{S}_{n}^{1}=\mathcal{S}^{1}\}$ , i.e., whether the set $\mathcal{S}^{1}$ is correctly selected. Then, $\mathbb{E}[v_{n}(S_{n})]=\mathbb{P}\{\mathcal{S}_{n}^{1}=\mathcal{S}^{1}\}$ , where

\begin{split}\mathbb{P}\{\mathcal{S}_{n}^{1}=\mathcal{S}^{1}\}&=\mathbb{P}\biggl{\{}\bigcap\limits_{i\in\mathcal{S}_{n}^{1}}\bigg{(}\bigcap\limits_{j=1}^{m}(\theta_{ij}\leq\gamma_{j})\bigg{)}\cap\bigcap\limits_{i\in\mathcal{S}_{n}^{2}}\bigg{(}\bigcup\limits_{j=1}^{m}(\theta_{ij}>\gamma_{j})\bigg{)}\biggr{\}}\\ &=1-\mathbb{P}\biggl{\{}\bigcup\limits_{i\in\mathcal{S}_{n}^{1}}\bigg{(}\bigcup\limits_{j=1}^{m}(\theta_{ij}>\gamma_{j})\bigg{)}\cup\bigcup\limits_{i\in\mathcal{S}_{n}^{2}}\bigg{(}\bigcap\limits_{j=1}^{m}(\theta_{ij}\leq\gamma_{j})\bigg{)}\biggr{\}}.\end{split}

Applying the Bonferroni inequality,

\mathbb{P}\{\mathcal{S}_{n}^{1}=\mathcal{S}^{1}\}\geq 1-\sum_{i\in\mathcal{S}_{n}^{1}}\sum_{j=1}^{m}\mathbb{P}(\theta_{ij}>\gamma_{j})-\sum_{i\in\mathcal{S}_{n}^{2}}\prod_{j\in\mathcal{E}_{t,i}^{2}}\mathbb{P}(\theta_{ij}\leq\gamma_{j}).

(13)

The inequality holds because $0<\prod_{j\in\mathcal{E}_{n,i}^{1}}\mathbb{P}(\theta_{ij}\leq\gamma_{j})\leq 1$ .

Let iKG ${}_{t,i}^{\text{F}}$ be the one-step improvement in (3) with $\mathcal{S}_{t}^{1}$ , $\mathcal{S}_{t}^{2}$ and $\mathcal{E}_{t,i}^{2}$ treated as unchanged after one more sample and $\mathbb{E}[v_{n}(S_{n})]$ approximated by the right-hand side of (13). We have the following proposition to compute iKG ${}_{t,i}^{\text{F}}$ .

Proposition 5.

With the definition of iKG ${}_{t,i}^{\text{F}}$ above, we have

\begin{split}\text{iKG}_{t,i}^{\text{F}}=&\sum_{j=1}^{m}\Bigg{(}\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{t,ij})^{2}}{2\sigma_{t,ij}^{2}}}\mathbf{1}\{i\in\mathcal{S}_{t}^{1}\}\Bigg{)}-\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{t,ij})^{2}}{2(\sigma_{t+1,ij}^{2}+\sigma_{ij}^{2}(\sigma_{t+1,ij}^{2}/\sigma_{ij}^{2})^{2})}}\mathbf{1}\{i\in\mathcal{S}_{t}^{1}\}\Bigg{)}\Bigg{)}\\ &+\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{t,i}^{2}}\frac{(\gamma_{j}-\mu_{t,ij})^{2}}{2\sigma_{t,ij}^{2}}}\mathbf{1}\{i\in\mathcal{S}_{t}^{2}\}\Bigg{)}-\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{t,i}^{2}}\frac{(\gamma_{j}-\mu_{t,ij})^{2}}{2(\sigma_{t+1,ij}^{2}+\sigma_{ij}^{2}(\sigma_{t+1,ij}^{2}/\sigma_{ij}^{2})^{2})}}\mathbf{1}\{i\in\mathcal{S}_{t}^{2}\}\Bigg{)}.\end{split}

(14)

Input:

k\geq 2

n

1 Collect

n_{0}

samples for each arm

i

;

2 while $t<n$ do

3 Compute iKG

{}_{t,i}^{\text{F}}

and set

I_{t}=\operatorname*{argmax}_{i\in\mathbb{A}}

iKG

{}_{t,i}^{\text{F}}

;

4 Play

I_{t}

;

5 Update

\mu_{t+1,i}

\sigma_{t+1,i}

\mathcal{S}_{t+1}^{1}

\mathcal{S}_{t+1}^{2}

\mathcal{E}_{t+1,i}^{1}

and

\mathcal{E}_{t+1,i}^{2}

;

Output:

\mathcal{S}_{n}^{1}

Algorithm 4 iKG-F Algorithm (Feasible Arm Identification)

To identify the feasible arms, the iKG-F algorithm pulls the arm with the largest iKG ${}_{t,i}^{\text{F}}$ in each round. For this algorithm, we can show that the rate of posterior convergence of $1-\mathbb{P}\{\mathcal{S}_{n}^{1}=\mathcal{S}^{1}\}$ is also the fastest possible.

Theorem 3.

For the iKG-F algorithm, $\lim_{n\rightarrow\infty}-\frac{1}{n}\log(1-\mathbb{P}\{\mathcal{S}_{n}^{1}=\mathcal{S}^{1}\})=\Gamma^{\text{F}}$ , where

\Gamma^{\text{F}}=w_{i}\min\limits_{j\in\mathcal{E}_{i}^{1}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{1}\}+w_{i}\sum_{j\in\mathcal{E}_{i}^{2}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{2}\},

(15)

and $w_{i}$ is the sampling rate of arm $i$ satisfying

\begin{split}&\sum\limits_{i=1}^{k}w_{i}=1,\quad\\ &w_{i}\min\limits_{j\in\mathcal{E}_{i}^{1}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{1}\}+w_{i}\sum_{j\in\mathcal{E}_{i}^{2}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{2}\}\\ &=w_{i^{\prime}}\min\limits_{j\in\mathcal{E}_{i^{\prime}}^{1}}\frac{(\gamma_{j}-\mu_{i^{\prime}j})^{2}}{2\sigma_{i^{\prime}j}^{2}}\mathbf{1}\{i^{\prime}\in\mathcal{S}^{1}\}+w_{i^{\prime}}\sum_{j\in\mathcal{E}_{i^{\prime}}^{2}}\frac{(\gamma_{j}-\mu_{i^{\prime}j})^{2}}{2\sigma_{i^{\prime}j}^{2}}\mathbf{1}\{i^{\prime}\in\mathcal{S}^{2}\},\ \ i\neq i^{\prime}.\end{split}

(16)

In addition, for any feasible arm identification algorithms

\limsup_{n\rightarrow\infty}-\frac{1}{n}\log(1-\mathbb{P}\{\mathcal{S}_{n}^{1}=\mathcal{S}^{1}\})\leq\Gamma^{\text{F}}.

5 Numerical Experiments

In this section, we show empirical performances of the iKG, iKG- $\epsilon$ and iKG-F algorithms on synthetic and real-world examples. For the best arm identification problem, we compare iKG with the following algorithms.

•

Expected Improvement (EI) chick2010sequential . This is another common strategy for BAI. In each round, it pulls the arm offering the maximal expected improvement over the current estimate of the best mean of the arms.
•

Top-Two Expected Improvement (TTEI) qin2017improving . This is a modification of the EI algorithm by introducing a parameter $\beta$ to control the probabilities of sampling the best arm and the non-best set. We set the parameter $\beta$ in TTEI as its default value $1/2$ .
•

Knowledge Gradient. This is the algorithm under study in this research.

For the $\epsilon$ -good arm identification problem, we compare iKG- $\epsilon$ with the following algorithms.

•

APT Algorithm locatelli2016optimal . It is a fixed-budget algorithm for identifying the arms whose means are above a given threshold. We set the input tolerance parameter as $0.0001$ and the threshold as the posterior mean of the estimated best arm minus $\epsilon$ .
•

$(\mbox{ST})^{2}$ Algorithm mason2020finding . It is a fixed-confidence algorithm for $\epsilon$ -good arm identification. It pulls three arms in each round, the estimated best arm, one arm above the threshold and one arm below the threshold. We set the input tolerance parameter as $0.0001$ and $\gamma=0$ .

For the feasible arm identification problem, we compare iKG-F with the following algorithms.

•

MD-UCBE Algorithm katz2018feasible . This is a fixed-budget algorithm for feasible arm identification based on the upper confidence bound. We set the input tolerance parameter as $0.0001$ and hyperparameter $a=\frac{25}{36}\frac{n-k}{H}$ , where $H$ is a constant that can be computed. Katz-Samuels and Scott katz2018feasible showed that with $a=\frac{25}{36}\frac{n-k}{H}$ , the performance of MD-UCBE is nearly optimal.
•

MD-SAR Algorithm katz2018feasible . This is a fixed-budget algorithm for feasible arm identification based on successive accepts and rejects. We set the input tolerance parameter as $0.0001$ .

In addition, iKG, iKG- $\epsilon$ and iKG-F will be compared with the equal allocation, where each arm is simply played with the same number of rounds. It is a naive method and is often used as a benchmark against which improvements might be measured.

The examples for testing include three synthetic examples, called Examples 1-3, and three real examples, namely the Dose-Finding Problem, Drug Selection Problem, and Caption Selection Problem. For Example 1-3 and the Dose-Finding problem, samples of the arms are two-dimensional. We call the measures of them measures 1 and 2. When the examples are tested for the best arm identification and $\epsilon$ -good identification, only measure 1 will be used for identifying good/best arms. When the examples are tested for the feasible arm identification, both measures will be used for feasibility detection. For the Drug Selection and Caption Selection problems, samples of the arms are one-dimensional. They are tested for the best arm identification, $\epsilon$ -good identification and feasible arm identification.

Synthetic Datasets. We consider three examples, all containing ten arms.

Example 1. The means in measure 1 of the ten arms are $0.1927$ , $0.6438$ , $3.0594$ , $3.0220$ , $1.3753$ , $1.4215$ , $0.9108$ , $1.0126$ , $0.1119$ and $1.8808$ , and the means in measure 2 of the ten arms are $0.4350$ , $0.7240$ , $1.1566$ , $0.8560$ , $3.4712$ , $0.8248$ , $3.8797$ , $1.9819$ , $3.2431$ and $1.4315$ , all of which are uniformly generated in $(0,4)$ . Samples of the arms are corrupted by normal noises $\mathcal{N}(0,1)$ . The best arm is arm $3$ and $0.1$ -good arms are arms 3 and 4. For the feasible arm identification, we choose arms with means in both measures less than $2$ . Then the feasible arms are arms $1$ , $2$ , $6$ , $8$ and $10$ .

Example 2. We keep the setting of Example 1. Distributions of the noises for arms 1-5 are changed to $\mathcal{N}(0,4)$ .

Example 3. Consider functions $y_{1}(x)=-0.05x^{2}$ , $y_{2}(x)=-0.06(7-x)$ and $y_{3}(x)=0.06(x-6)$ . The means in measure 1 of the ten arms are $y_{1}(x)$ with $x=1,2,\ldots,10$ . The means in measure 2 of the ten arms are $y_{2}(x)$ with $x=1,\ldots,6$ and $y_{3}(x)$ with $x=7,\ldots,10$ . Noises follow the normal distribution $\mathcal{N}(0,1)$ . The best arm is arm $1$ and $0.5$ -good arms are arms 1-3. For the feasible arm identification, we choose arms with means in measure $1$ greater than $-0.5$ and means in measure $2$ less than $0$ . The feasible arms are arms 1-3.

Dose-Finding Problem. We use the data in genovese2013efficacy (see ACR $50$ in week $16$ ) for treating rheumatoid arthritis by the drug secukinumab. There are four dosage levels, $25$ mg, $75$ mg, $150$ mg, and $300$ mg, and a placebo, which are treated as five arms. We develop a simulation model based on the dataset. Each arm is associated with two performance measures: the probability of the drug being effective and the probability of the drug causing infections. The means of the five arms are $\bm{\mu_{1}}=(0.151,0.259)$ , $\bm{\mu_{2}}=(0.184,0.184)$ , $\bm{\mu_{3}}=(0.209,0.209)$ , $\bm{\mu_{4}}=(0.171,0.293)$ and $\bm{\mu_{5}}=(0.06,0.16)$ . Samples of each arm are corrupted by normal noises $\mathcal{N}(0,0.25)$ . The best arm is arm 3 and the 0.03-good arms are arms 2 and 3. For the feasible arm identification, we find the arms whose probability of being effective is larger than $0.18$ and the probability of causing infections is less than $0.25$ . The feasible arms are arms 2 and 3.

Drug Selection Problem. We consider five contraceptive alternatives based on the Drug Review Dataset (https://doi.org/10.24432/C5SK5S): Ethinyl estradiol / levonorgest, Ethinyl estradiol / norethindro, Ethinyl estradiol / norgestimat, Etonogestrel and Nexplanon, which can be treated as five arms. The dataset provides user reviews on the five drugs along with related conditions and ratings reflecting overall user satisfaction. We set the means of the five arms as $\mu_{1}=5.8676$ , $\mu_{2}=5.6469$ , $\mu_{3}=5.8765$ , $\mu_{4}=5.8298$ and $\mu_{5}=5.6332$ , and the variances of the five arms as $\sigma_{1}^{2}=3.2756$ , $\sigma_{2}^{2}=3.4171$ , $\sigma_{3}^{2}=3.2727$ , $\sigma_{4}^{2}=3.3198$ and $\sigma_{5}^{2}=3.3251$ , all calculated by the data. When this example is used for the best arm identification and $\epsilon$ -good arm identification, the best arm (with the highest user satisfaction) and 0.003-good arm are both arm 3 (Ethinyl estradiol / norgestimat). When this example is used for feasible arm identification, we will select the drugs whose ratings are over $5.6$ , and the feasible arms are arm $1$ (Ethinyl estradiol / levonorgest), arm $2$ (Ethinyl estradiol / norethindro), arm $3$ (Ethinyl estradiol / norgestimat), arm $4$ (Etonogestrel) and arm $5$ (Nexplanon).

Caption Selection Problem. We aim to select good captions based on the New Yorker Cartoon Caption Contest Dataset (https://nextml.github.io/caption-contest-data/). In the contests, each caption can be treated as an arm. The dataset provides the mean and variance of each arm, which can be used to set up our experiments. We will test contests 853 (Caption 853) and 854 (Caption 854).

In Caption 853, we randomly select ten captions as arms. We set the means of the ten arms as $\mu_{1}=1.1400$ , $\mu_{2}=1.0779$ , $\mu_{3}=1.4160$ , $\mu_{4}=1.0779$ , $\mu_{5}=1.1081$ , $\mu_{6}=1.1467$ , $\mu_{7}=1.1333$ , $\mu_{8}=1.1075$ , $\mu_{9}=1.1026$ and $\mu_{10}=1.4900$ , and the variances of the arms as $\sigma_{1}^{2}=0.1418$ , $\sigma_{2}^{2}=0.0991$ , $\sigma_{3}^{2}=0.4871$ , $\sigma_{4}^{2}=0.0728$ , $\sigma_{5}^{2}=0.0977$ , $\sigma_{6}^{2}=0.1809$ , $\sigma_{7}^{2}=0.1843$ , $\sigma_{8}^{2}=0.0970$ , $\sigma_{9}^{2}=0.0932$ and $\sigma_{10}^{2}=0.4843$ , which are all calculated by the data. When this example is used for the best arm identification, the best arm (with the highest funniness score) is arm 10. When this example is used for $\epsilon$ -good arm identification, the 0.1-good arms are arms 3 and 10. When this example is used for feasible arm identification, we will select the captions whose funniness scores are over 1.4, and the feasible arms are arms 3 and 10.

Table 1: Probabilities of false selection for the tested algorithms in best arm identification problem.

Example		Example 1		Example 2		Example 3		Dose-finding		Drug Selection		Caption 853		Caption 854
		1000	5000	4400	18000	400	1000	1200	13000	2400	98000	1600	3000	12000	18000
BAI	Equal Allocation	0.38	0.22	0.44	0.31	0.25	0.13	0.35	0.05	0.43	0.27	0.17	0.11	0.26	0.18
	EI	0.36	0.21	0.40	0.28	0.28	0.22	0.46	0.21	0.46	0.37	0.14	0.12	0.26	0.23
	TTEI	0.25	0.07	0.32	0.09	0.13	0.02	0.31	0.03	0.55	0.28	0.04	0.01	0.10	0.06
	KG	0.29	0.14	0.32	0.13	0.14	0.03	0.40	0.03	0.44	0.28	0.04	0.01	0.11	0.05
	iKG	0.21	0.03	0.23	0.03	0.09	0.01	0.29	0.01	0.38	0.23	0.02	0.00	0.07	0.04

In Caption 854, we also randomly select ten captions as arms. We set the means of the ten arms as $\mu_{1}=1.1986$ , $\mu_{2}=1.1890$ , $\mu_{3}=1.1400$ , $\mu_{4}=1.2621$ , $\mu_{5}=1.1544$ , $\mu_{6}=1.0339$ , $\mu_{7}=1.1349$ , $\mu_{8}=1.2786$ , $\mu_{9}=1.1765$ and $\mu_{10}=1.1367$ , and the variances of the arms as $\sigma_{1}^{2}=0.1879$ , $\sigma_{2}^{2}=0.2279$ , $\sigma_{3}^{2}=0.1346$ , $\sigma_{4}^{2}=0.3186$ , $\sigma_{5}^{2}=0.1314$ , $\sigma_{6}^{2}=0.0330$ , $\sigma_{7}^{2}=0.1337$ , $\sigma_{8}^{2}=0.3167$ , $\sigma_{9}^{2}=0.1858$ and $\sigma_{10}^{2}=0.1478$ , all calculated by the data. When this example is used for the best arm identification, the best arm is arm 8. When this example is used for $\epsilon$ -good arm identification, the 0.05-good arms are arms 4 and 8. When this example is used for feasible arm identification, we will select the captions whose funniness scores are over 1.25, and the feasible arms are arms 4 and 8.

Table 2: Probabilities of false selection for the tested algorithms in

\epsilon

-good arm identification problem.

Example		Example 1		Example 2		Example 3		Dose-finding		Drug Selection		Caption 853		Caption 854
		1000	4000	2400	12000	400	4000	1600	6000	2600	90000	4000	10000	9400	15000
$\epsilon$ -good	Equal Allocation	0.54	0.20	0.65	0.28	0.61	0.26	0.46	0.18	0.62	0.37	0.28	0.19	0.14	0.05
	APT	0.28	0.17	0.52	0.25	0.72	0.49	0.56	0.53	0.74	0.70	0.41	0.35	0.48	0.49
	$(\mbox{ST})^{2}$	0.29	0.07	0.35	0.11	0.51	0.06	0.38	0.17	0.64	0.34	0.21	0.10	0.12	0.04
	iKG- ${\epsilon}$	0.17	0.03	0.29	0.00	0.48	0.03	0.34	0.06	0.60	0.27	0.10	0.02	0.11	0.03

Table 3: Probabilities of false selection for the tested algorithms in feasible arm identification problem.

Example		Example 1		Example 2		Example 3		Dose-finding		Drug Selection		Caption 853		Caption 854
		3400	11000	4800	14000	2200	4800	2000	4000	100000	140000	4000	10000	30600	44000
feasible arm	Equal Allocation	0.34	0.26	0.33	0.23	0.22	0.14	0.22	0.18	0.03	0.03	0.36	0.29	0.18	0.07
	MD-UCBE	0.27	0.16	0.33	0.26	0.05	0.01	0.20	0.17	0.06	0.06	0.32	0.15	0.06	0.04
	MD-SAR	0.74	0.33	0.68	0.22	0.30	0.03	0.79	0.55	0.06	0.02	0.58	0.19	0.08	0.05
	iKG-F	0.23	0.02	0.24	0.01	0.04	0.00	0.14	0.01	0.01	0.01	0.20	0.07	0.05	0.00

For the tested algorithms, probabilities of false selection (PFS) are obtained based on the average of 100 macro-replications. Tables 1-3 show the PFS of the algorithms under some fixed sample sizes (additional numerical results about the PFS and sampling rates of the tested algorithms are provided in the Supplement). The proposed iKG, iKG- $\epsilon$ and iKG-F perform the best. For the best arm identification, EI tends to allocate too many samples to the estimated best arm, leading to insufficient exploration in the remaining arms, while KG tends to allocate too few samples to the estimated best arm, leading to excessive exploration in the remaining arms. TTEI always allocates approximately one-half budget to the estimated best arm when $\beta=1/2$ , leading to the budget not being the best utilized. For the $\epsilon$ -good identification, APT and $(\mbox{ST})^{2}$ are inferior because the former insufficiently pulls the estimate best arm, leading to inaccurate estimates of the threshold, while the latter falls in the fixed-confidence regime that focuses on making guarantees on the probability of false selection instead of minimizing it. For the feasible arm identification, both MD-UCBE and MD-SAR allocate too many samples to the arms near the constraint limits. For the three problems, equal allocation performs the worst in general, because it does not have any efficient sampling mechanisms for identifying the target arms in these problems.

6 Conclusion

This paper studies the knowledge gradient (KG), a popular policy for the best arm identification (BAI). We observe that the KG algorithm is not asymptotically optimal, and then propose a remedy for it. The new policy follows KG’s manner of one-step look ahead, but utilizes different evidence to identify the best arm. We call it improved knowledge gradient (iKG) and show that it is asymptotically optimal. Another advantage of iKG is that it can be easily extended to variant problems of BAI. We use $\epsilon$ -good arm identification and feasible arm identification as two examples for algorithm development and analysis. The superior performances of iKG on BAI and the two variants are further demonstrated using numerical examples.

References

[1] D. A. Berry. Modified two-armed bandit strategies for certain clinical trials. Journal of the American Statistical Association, 73(362):339–345, 1978.
[2] A. Gilotte, C. Calauzènes, T. Nedelec, A. Abraham, and S. Dollé. Offline a/b testing for recommender systems. In Proceedings of the 11th ACM International Conference on Web Search and Data Mining, pages 198–206, 2018.
[3] E. Even-Dar, S. Mannor, Y. Mansour, and S. Mahadevan. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7(6), 2006.
[4] J.-Y. Audibert, S. Bubeck, and R. Munos. Best arm identification in multi-armed bandits. In 23rd Conference on Learning Theory, pages 41–53, 2010.
[5] S. Bubeck, T. Wang, and N. Viswanathan. Multiple identifications in multi-armed bandits. In International Conference on Machine Learning, pages 258–265. PMLR, 2013.
[6] H. Xiao and S. Gao. Simulation budget allocation for selecting the top-m designs with input uncertainty. IEEE Transactions on Automatic Control, 63(9):3127–3134, 2018.
[7] P. Auer, C. K. Chiang, R. Ortner, and M. Drugan. Pareto front identification from stochastic bandit feedback. In Artificial Intelligence and Statistics, pages 939–947. PMLR, 2016.
[8] B. Mason, L. Jain, A. Tripathy, and R. Nowak. Finding all $\epsilon$ -good arms in stochastic bandits. Advances in Neural Information Processing Systems, 33:20707–20718, 2020.
[9] S. Gao and W. Chen. Efficient feasibility determination with multiple performance measure constraints. IEEE Transactions on Automatic Control, 62(1):113–122, 2016.
[10] J. Katz-Samuels and C. Scott. Feasible arm identification. In International Conference on Machine Learning, pages 2535–2543. PMLR, 2018.
[11] S. E. Chick, J. Branke, and C. Schmidt. Sequential sampling to myopically maximize the expected value of information. INFORMS Journal on Computing, 22(1):71–80, 2010.
[12] C. Qin, D. Klabjan, and D. Russo. Improving the expected improvement algorithm. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 5387–5397, 2017.
[13] D. Russo. Simple Bayesian algorithms for best arm identification. Operations Research, 68(6):1625–1647, 2020.
[14] I. O. Ryzhov, P. I. Frazier, and W. B. Powell. On the robustness of a one-period look-ahead policy in multi-armed bandit problems. Procedia Computer Science, 1(1):1635–1644, 2010.
[15] Y. Li and S. Gao. On the finite-time performance of the knowledge gradient algorithm. In International Conference on Machine Learning, pages 12741–12764. PMLR, 2022.
[16] C. H. Chen, J. Lin, E. Yücesan, and S. E. Chick. Simulation budget allocation for further enhancing the efficiency of ordinal optimization. Discrete Event Dynamic Systems, 10:251–270, 2000.
[17] S. Gao, W. Chen, and L. Shi. A new budget allocation framework for the expected opportunity cost. Operations Research, 65(3):787–803, 2017.
[18] Y. Li and S. Gao. Convergence rate analysis for optimal computing budget allocation algorithms. Automatica, 153:111042, 2023.
[19] S. S. Gupta and K. J. Miescke. Bayesian look ahead one stage sampling allocations for selecting the largest normal mean. Statistical Papers, 35(1):169–177, 1994.
[20] P. I. Frazier, W. B. Powell, and S. Dayanik. A knowledge-gradient policy for sequential information collection. SIAM Journal on Control and Optimization, 47(5):2410–2439, 2008.
[21] P. Frazier, W. Powell, and S. Dayanik. The knowledge-gradient policy for correlated normal beliefs. INFORMS Journal on Computing, 21(4):599–613, 2009.
[22] C. Schoppe. Wind and pumped-hydro power storage: Determining optimal commitment policies with knowledge gradient non-parametric estimation. B.Sc. Thesis at Princeton University, 2010.
[23] D. M. Negoescu, P. I. Frazier, and W. B. Powell. The knowledge-gradient algorithm for sequencing experiments in drug discovery. INFORMS Journal on Computing, 23(3):346–363, 2011.
[24] J. Galambos. Bonferroni inequalities. The Annals of Probability, pages 577–581, 1977.
[25] Y.-C. Ho, S. Sreenivas, and P. Vakili. Ordinal optimization of deds. Discrete Event Dynamic Systems, 2(1):61–88, 1992.
[26] Y.-C. Ho. An explanation of ordinal optimization: Soft computing for hard problems. Information Sciences, 113(3-4):169–192, 1999.
[27] A. Locatelli, M. Gutzeit, and A. Carpentier. An optimal algorithm for the thresholding bandit problem. In International Conference on Machine Learning, pages 1690–1698. PMLR, 2016.
[28] M. C. Genovese, P. Durez, H. B. Richards, J. Supronik, E. Dokoupilova, V. Mazurov, J. A. Aelion, S. H. Lee, C. E. Codding, H. Kellner, et al. Efficacy and safety of secukinumab in patients with rheumatoid arthritis: a phase ii, dose-finding, double-blind, randomised, placebo controlled study. Annals of the Rheumatic Diseases, 72(6):863–869, 2013.
[29] I. O. Ryzhov. On the convergence rates of expected improvement methods. Operations Research, 64(6):1515–1528, 2016.
[30] P. Glynn and S. Juneja. A large deviations perspective on ordinal optimization. In Proceedings of the 2004 Winter Simulation Conference, 2004., volume 1. IEEE, 2004.

Appendix A Proof of Proposition 1

To facilitate the analysis, we make the following definition. For two real-valued sequences $\{a_{n}\}$ and $\{b_{n}\}$ , if $\lim_{n\rightarrow\infty}1/n\log(a_{n}/b_{n})=0$ , we call them logarithmically equivalent, denoted by $a_{n}\dot{=}b_{n}$ . We first analyze $1-\mathbb{P}\{I_{n}^{*}=I^{*}\}$ . Note that $1-\mathbb{P}\{I_{n}^{*}=I^{*}\}=\mathbb{P}\bigg{\{}\bigcup\limits_{i\neq I_{n}^{*}}(\theta_{i}>\theta_{I_{n}^{*}})\bigg{\}}$ and we have

\max_{i\neq I_{n}^{*}}\mathbb{P}(\theta_{i}>\theta_{I_{n}^{*}})\leq\mathbb{P}\bigg{\{}\bigcup\limits_{i\neq I_{n}^{*}}(\theta_{i}>\theta_{I_{n}^{*}})\bigg{\}}\leq(k-1)\max_{i\neq I_{n}^{*}}\mathbb{P}(\theta_{i}>\theta_{I_{n}^{*}}).

Then $1-\mathbb{P}\{I_{n}^{*}=I^{*}\}\dot{=}\max_{i\neq I_{n}^{*}}\mathbb{P}(\theta_{i}>\theta_{I_{n}^{*}})$ . In round $n$ , $\theta_{i}-\theta_{I_{n}^{*}}$ follows $\mathcal{N}(\mu_{n,i}-\mu_{n,I_{n}^{*}},\sigma_{n,i}^{2}+\sigma_{n,I_{n}^{*}}^{2})$ . Let $\Phi(\cdot)$ and $\phi(\cdot)$ be the cumulative density function and probability density function of the standard normal distribution, respectively. We have

\mathbb{P}(\theta_{i}>\theta_{I_{n}^{*}})=1-\Phi\Bigg{(}\frac{\mu_{n,I_{n}^{*}}-\mu_{n,i}}{\sqrt{\sigma_{n,i}^{2}+\sigma_{n,I_{n}^{*}}^{2}}}\Bigg{)}.

Let $z=\frac{\mu_{n,I_{n}^{*}}-\mu_{n,i}}{\sqrt{\sigma_{n,i}^{2}+\sigma_{n,I_{n}^{*}}^{2}}}$ and $z>0$ . By the following property of the cumulative probability function of the standard normal distribution

\frac{z}{(z^{2}+1)}<1-\Phi(z)<\frac{1}{z}\phi(z)

and $\mu_{n,I_{n}^{*}}-\mu_{n,i}>0$ , we have

\mathbb{P}(\theta_{i}>\theta_{I_{n}^{*}})\dot{=}\phi\Bigg{(}\frac{\mu_{n,i}-\mu_{n,I_{n}^{*}}}{\sqrt{\sigma_{n,i}^{2}+\sigma_{n,I_{n}^{*}}^{2}}}\Bigg{)}\dot{=}\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,I_{n}^{*}})^{2}}{2(\sigma_{n,i}^{2}+\sigma_{n,I_{n}^{*}}^{2})}}\Bigg{)}.

(17)

Denote $T_{n,i}$ as the number of samples for arm $i$ before round $t$ , i.e., $T_{t,i}\triangleq\sum_{l=0}^{t-1}\mathbf{1}\{I_{l}=i\}$ . By (1) in the main text, we have

\sigma_{n,i}^{2}=\frac{1}{(\sigma_{i}^{2}/T_{n,i})^{-1}+\sigma_{0,i}^{-2}}.

Then

\begin{split}1-\mathbb{P}\{I_{n}^{*}=I^{*}\}&\dot{=}\max_{i\neq I_{n}^{*}}\Bigg{(}\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,I_{n}^{*}})^{2}}{2(\sigma_{n,i}^{2}+\sigma_{t,I_{n}^{*}}^{2})}}\Bigg{)}\Bigg{)}\\ &\dot{=}\exp\Bigg{(}{-n\min_{i\neq I_{n}^{*}}\frac{(\mu_{n,i}-\mu_{n,I_{n}^{*}})^{2}}{2(\sigma_{n,i}^{2}+\sigma_{t,I_{n}^{*}}^{2})}}\Bigg{)}.\end{split}

Hence

\Gamma^{\text{KG}}=\lim_{n\rightarrow\infty}-\frac{1}{n}\log(1-\mathbb{P}\{I_{n}^{*}=I^{*}\})=\min_{i\neq I^{*}}\frac{(\mu_{i}-\mu_{I^{*}})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{I^{*}}^{2}/w_{I^{*}})}.

(18)

Notice that the sampling rate $w_{i}$ of each arm $i$ of the KG algorithm has been characterized in [29], with

\quad\frac{w_{\langle 1\rangle}}{w_{\langle 2\rangle}}=\frac{\sigma_{\langle 1\rangle}}{\sigma_{\langle 2\rangle}}\quad\mbox{ and }\quad\frac{w_{\langle i\rangle}}{w_{\langle i^{\prime}\rangle}}=\frac{(\mu_{\langle 1\rangle}-\mu_{\langle i^{\prime}\rangle})/\sigma_{\langle i^{\prime}\rangle}}{(\mu_{\langle 1\rangle}-\mu_{\langle i\rangle})/\sigma_{\langle i\rangle}},\quad i,i^{\prime}=2,3,\ldots,k\mbox{ and }i\neq i^{\prime}.

Together with $\sum_{i=1}^{k}w_{i}=1$ , we have

w_{\langle 1\rangle}=\bigg{(}\frac{\sigma_{\langle 2\rangle}}{\sigma_{\langle 1\rangle}}\sum_{i\neq 1}\frac{1}{c_{\langle i\rangle}}+1\bigg{)}^{-1}

(19)

and

w_{\langle i\rangle}=\bigg{(}c_{\langle i\rangle}\bigg{(}\sum_{i\neq 1}\frac{1}{c_{\langle i\rangle}}+\frac{\sigma_{\langle 1\rangle}}{\sigma_{\langle 2\rangle}}\bigg{)}\bigg{)}^{-1},\quad i=2,3,\ldots,k.

(20)

Plugging into (18),

\Gamma^{\text{KG}}=\min_{i\neq 1}\bigg{(}\frac{(\mu_{\langle i\rangle}-\mu_{\langle 1\rangle})^{2}}{2((\sum_{i\neq 1}\sigma_{\langle 2\rangle}/c_{\langle i\rangle}+\sigma_{\langle 1\rangle})\sigma_{\langle 1\rangle}+c_{\langle i\rangle}\sigma_{\langle i\rangle}^{2}(\sum_{i\neq 1}1/c_{\langle i\rangle}+\sigma_{\langle 1\rangle}/\sigma_{\langle 2\rangle}))}\bigg{)}.

Appendix B Proof of Proposition 2

Similar to the proof of Proposition 1, we have

\Gamma^{\text{TTEI}}=\lim_{n\rightarrow\infty}-\frac{1}{n}\log(1-\mathbb{P}\{I_{n}^{*}=I^{*}\})=\min_{i\neq I^{*}}\frac{(\mu_{i}-\mu_{I^{*}})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{I^{*}}^{2}/w_{I^{*}})}.

Since for the TTEI algorithm,

\frac{(\mu_{i}-\mu_{I^{*}})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{I^{*}}^{2}/w_{I^{*}})}=\frac{(\mu_{i^{\prime}}-\mu_{I^{*}})^{2}}{2(\sigma_{i^{\prime}}^{2}/w_{i^{\prime}}+\sigma_{I^{*}}^{2}/w_{I^{*}})},\quad\forall i\neq i^{\prime}\neq I^{*},

we have

\Gamma^{\text{TTEI}}=\frac{(\mu_{i}-\mu_{I^{*}})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{I^{*}}^{2}/w_{I^{*}})}\quad\forall i\neq I^{*}.

According to (19), for the KG algorithm, $w_{\langle 1\rangle}=(\sigma_{\langle 2\rangle}/\sigma_{\langle 1\rangle}\sum_{i\neq I^{*}}1/c_{\langle i\rangle}+1)^{-1}$ . Now by setting $\beta$ of the TTEI algorithm to the same value, the sampling rates of the best arm from these two algorithms will be the same. According to Theorem 2 of [12], among algorithms allocating the same proportion of the samples to the best arm, $\Gamma^{\text{TTEI}}$ of the TTEI algorithm is optimal, i.e., $\Gamma^{\text{KG}}\leq\Gamma^{\text{TTEI}}$ .

Appendix C Proof of Propositions 3, 4 and 5

Propositions 3, 4 and 5 give the expressions of iKG_t,i, iKG ${}_{t,i}^{\epsilon}$ and iKG ${}_{t,i}^{\text{F}}$ . Below we introduce a lemma first, which will be used in the proofs of the three propositions.

Lemma 1.

If arm $i$ is sampled from $\mathcal{N}(\mu_{t,i},\sigma_{i}^{2})$ in round $t$ , $\theta_{i}$ and $\theta_{i^{\prime}}$ follow $\mathcal{N}(\mu_{t,i},\sigma_{t,i}^{2})$ and $\mathcal{N}(\mu_{t,i^{\prime}},\sigma_{t,i^{\prime}}^{2})$ respectively. Then,

\mathbb{E}[\mathbb{P}(\theta_{i}>\theta_{i^{\prime}})]=\exp\Bigg{(}{-\frac{(\mu_{t,i}-\mu_{t,i^{\prime}})^{2}}{2(\sigma_{t+1,i}^{2}+\sigma_{t,i^{\prime}}^{2}+\sigma_{i}^{2}(\sigma_{t+1,i}^{2}/\sigma_{i}^{2})^{2})}}\Bigg{)}.

Proof of Lemma 1:
We know that $\theta_{t,i}$ follows $\mathcal{N}(\mu_{t,i},\sigma_{i}^{2})$ . Then by (1) of the main text, we have

\mu_{t+1,i}=\left\{\begin{aligned} &\frac{\sigma_{t,i}^{-2}\mu_{t,i}+\sigma_{i}^{-2}\theta_{t,i}}{\sigma_{t,i}^{-2}+\sigma_{i}^{-2}}&{\mbox{if~{}}I_{t}=i},\\ &\mu_{t,i}&{\mbox{if~{}}I_{t}\neq i},\end{aligned}\right.\quad\mbox{and}\quad\sigma_{t+1,i}^{2}=\left\{\begin{aligned} &\frac{1}{\sigma_{t,i}^{-2}+\sigma_{i}^{-2}}&{\mbox{if~{}}I_{t}=i},\\ &\sigma_{t,i}^{2}&{\mbox{if~{}}I_{t}\neq i}.\end{aligned}\right.

Recall that

\mathbb{P}(\theta_{i}>\theta_{I_{t}^{*}})\dot{=}\exp\Bigg{(}{-\frac{(\mu_{t,i}-\mu_{t,I_{t}^{*}})^{2}}{2(\sigma_{t,i}^{2}+\sigma_{t,I_{t}^{*}}^{2})}}\Bigg{)}.

Then

\begin{split}\mathbb{E}[\mathbb{P}(\theta_{i}>\theta_{i^{\prime}})]&=\mathbb{E}\Bigg{[}\exp\Bigg{(}{-\frac{(\mu_{t+1,i}-\mu_{t,i^{\prime}})^{2}}{2(\sigma_{t+1,i}^{2}+\sigma_{t,i^{\prime}}^{2})}}\Bigg{)}\Bigg{]}\\ &=\frac{1}{\sqrt{2\pi\sigma_{i}}}\int_{-\infty}^{\infty}\exp\Bigg{(}{-\frac{(\frac{\sigma_{t,i}^{-2}\mu_{t,i}+\sigma_{i}^{-2}\theta_{t,i}}{\sigma_{t+1,i}^{-2}}-\mu_{t,i^{\prime}})^{2}}{2(\sigma_{t+1,i}^{2}+\sigma_{t,i^{\prime}}^{2})}}\Bigg{)}\exp\Bigg{(}-\frac{(\theta_{t,i}-\mu_{t,i^{\prime}})^{2}}{2\sigma_{i}^{2}}\Bigg{)}d\theta_{t,i}\\ &\dot{=}\exp\Bigg{(}{-\frac{(\mu_{t,i}-\mu_{t,i^{\prime}})^{2}}{2(\sigma_{t+1,i}^{2}+\sigma_{t,i^{\prime}}^{2}+\sigma_{i}^{2}(\sigma_{t+1,i}^{2}/\sigma_{i}^{2})^{2})}}\Bigg{)}.\end{split}

Proof of Proposition 3:
For the best arm identification problem, if $i\neq I_{t}^{*}$ ,

\begin{split}&\text{iKG}_{t,i}=\mathbb{E}[v_{n}(\mathcal{T}(S_{t},i,\theta_{t,i}))-v_{n}(S_{t})]\\ =&1-\sum_{i^{\prime}\neq i\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}})^{2}}{2(\sigma_{t,i^{\prime}}^{2}+\sigma_{t,I_{t}^{*}}^{2})}}\Bigg{)}-\exp\Bigg{(}{-\frac{(\mu_{t,i}-\mu_{t,I_{t}^{*}})^{2}}{2(\sigma_{t+1,i}^{2}+\sigma_{t,I_{t}^{*}}^{2}+\sigma_{i}^{2}(\sigma_{t+1,i}^{2}/\sigma_{i}^{2})^{2})}}\Bigg{)}\\ &-\Bigg{(}1-\sum_{i^{\prime}\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}})^{2}}{2(\sigma_{t,i^{\prime}}^{2}+\sigma_{t,I_{t}^{*}}^{2})}}\Bigg{)}\Bigg{)}\\ =&\exp\Bigg{(}{-\frac{(\mu_{t,i}-\mu_{t,I_{t}^{*}})^{2}}{2(\sigma_{t,i}^{2}+\sigma_{t,I_{t}^{*}}^{2})}}\Bigg{)}-\exp\Bigg{(}{-\frac{(\mu_{t,i}-\mu_{t,I_{t}^{*}})^{2}}{2(\sigma_{t+1,i}^{2}+\sigma_{t,I_{t}^{*}}^{2}+\sigma_{i}^{2}(\sigma_{t+1,i}^{2}/\sigma_{i}^{2})^{2})}}\Bigg{)}.\end{split}

If $i=I_{t}^{*}$ ,

\begin{split}&\text{iKG}_{t,i}=\mathbb{E}[v_{n}(\mathcal{T}(S_{t},i,\theta_{t,i}))-v_{n}(S_{t})]\\ =&1-\sum\limits_{i^{\prime}\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}})^{2}}{2(\sigma_{t,i^{\prime}}^{2}+\sigma_{t+1,I_{t}^{*}}^{2}+\sigma_{I_{t}^{*}}^{2}(\sigma_{t+1,I_{t}^{*}}^{2}/\sigma_{I_{t}^{*}}^{2})^{2})}}\Bigg{)}-\Bigg{(}1-\sum_{i^{\prime}\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}})^{2}}{2(\sigma_{t,i^{\prime}}^{2}+\sigma_{t,I_{t}^{*}}^{2})}}\Bigg{)}\Bigg{)}\\ =&\sum\limits_{i^{\prime}\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}})^{2}}{2(\sigma_{t,i^{\prime}}^{2}+\sigma_{t,I_{t}^{*}}^{2})}}\Bigg{)}-\sum\limits_{i^{\prime}\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}})^{2}}{2(\sigma_{t,i^{\prime}}^{2}+\sigma_{t+1,I_{t}^{*}}^{2}+\sigma_{I_{t}^{*}}^{2}(\sigma_{t+1,I_{t}^{*}}^{2}/\sigma_{I_{t}^{*}}^{2})^{2})}}\Bigg{)}.\end{split}

Proof of Proposition 4:
We explore the expression of $\mathbb{E}[v_{n}(S_{n})]$ in the $\epsilon$ -good arm identification problem first. We know that

\mathbb{E}[v_{n}(S_{n})]=1-\sum\limits_{i\in G_{n}^{\epsilon}}\mathbb{P}(\theta_{i}<\max_{i^{{}^{\prime}}\in\mathbb{A}}\theta_{i^{{}^{\prime}}}-\epsilon)-\sum\limits_{i\in\mathbb{A}\setminus G_{n}^{\epsilon}}\mathbb{P}(\theta_{i}>\max_{i^{{}^{\prime}}\in\mathbb{A}}\theta_{i^{{}^{\prime}}}-\epsilon).

Note that in round $n$ , $\theta_{i}-\theta_{I_{n}^{*}}+\epsilon$ follows $\mathcal{N}(\mu_{n,i}-\mu_{n,I_{n}^{*}}+\epsilon,\sigma_{n,i}^{2}+\sigma_{n,I_{n}^{*}}^{2})$ . Similarly as in the proof of Proposition 1, we can know that if $i\in G_{n}^{\epsilon}$

\mathbb{P}(\theta_{i}<\max_{i^{{}^{\prime}}\in\mathbb{A}}\theta_{i^{{}^{\prime}}}-\epsilon)\dot{=}\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,I_{n}^{*}}+\epsilon)^{2}}{2(\sigma_{n,i}^{2}+\sigma_{n,I_{n}^{*}}^{2})}}\Bigg{)},

and if $i\in\mathbb{A}\setminus G_{n}^{\epsilon}$

\mathbb{P}(\theta_{i}>\max_{i^{{}^{\prime}}\in\mathbb{A}}\theta_{i^{{}^{\prime}}}-\epsilon)\dot{=}\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,I_{n}^{*}}+\epsilon)^{2}}{2(\sigma_{n,i}^{2}+\sigma_{n,I_{n}^{*}}^{2})}}\Bigg{)}.

Then

\mathbb{E}[v_{n}(S_{n})]=1-\sum_{i\neq I_{n}^{*}}\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,I_{n}^{*}}+\epsilon)^{2}}{2(\sigma_{n,i}^{2}+\sigma_{n,I_{n}^{*}}^{2})}}\Bigg{)}.

For the $\epsilon$ -good arm identification problem, if $i\neq I_{t}^{*}$ ,

\begin{split}&\text{iKG}_{t,i}^{\epsilon}=\mathbb{E}[v_{n}(\mathcal{T}(S_{t},i,\theta_{t,i}))-v_{n}(S_{t})]\\ =&1-\sum_{i^{\prime}\neq i\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}}+\epsilon)^{2}}{2(\sigma_{t,i^{\prime}}^{2}+\sigma_{t,I_{t}^{*}}^{2})}}\Bigg{)}-\exp\Bigg{(}{-\frac{(\mu_{t,i}-\mu_{t,I_{t}^{*}}+\epsilon)^{2}}{2(\sigma_{t+1,i}^{2}+\sigma_{t,I_{t}^{*}}^{2}+\sigma_{i}^{2}(\sigma_{t+1,i}^{2}/\sigma_{i}^{2})^{2})}}\Bigg{)}\\ &-\Bigg{(}1-\sum_{i^{\prime}\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}}+\epsilon)^{2}}{2(\sigma_{t,i^{\prime}}^{2}+\sigma_{t,I_{t}^{*}}^{2})}}\Bigg{)}\Bigg{)}\\ =&\exp\Bigg{(}{-\frac{(\mu_{t,i}-\mu_{t,I_{t}^{*}}+\epsilon)^{2}}{2(\sigma_{t,i}^{2}+\sigma_{t,I_{t}^{*}}^{2})}}\Bigg{)}-\exp\Bigg{(}{-\frac{(\mu_{t,i}-\mu_{t,I_{t}^{*}}+\epsilon)^{2}}{2(\sigma_{t+1,i}^{2}+\sigma_{t,I_{t}^{*}}^{2}+\sigma_{i}^{2}(\sigma_{t+1,i}^{2}/\sigma_{i}^{2})^{2})}}\Bigg{)}.\end{split}

If $i=I_{t}^{*}$ ,

\begin{split}&\text{iKG}_{t,i}^{\epsilon}=\mathbb{E}[v_{n}(\mathcal{T}(S_{t},i,\theta_{t,i}))-v_{n}(S_{t})]\\ =&1-\sum\limits_{i^{\prime}\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}}+\epsilon)^{2}}{2(\sigma_{t,i^{\prime}}^{2}+\sigma_{t+1,I_{t}^{*}}^{2}+\sigma_{I_{t}^{*}}^{2}(\sigma_{t+1,I_{t}^{*}}^{2}/\sigma_{I_{t}^{*}}^{2})^{2})}}\Bigg{)}-\Bigg{(}1-\sum_{i^{\prime}\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}}+\epsilon)^{2}}{2(\sigma_{t,i^{\prime}}^{2}+\sigma_{t,I_{t}^{*}}^{2})}}\Bigg{)}\Bigg{)}\\ =&\sum\limits_{i^{\prime}\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}}+\epsilon)^{2}}{2(\sigma_{t,i^{\prime}}^{2}+\sigma_{t,I_{t}^{*}}^{2})}}\Bigg{)}-\sum\limits_{i^{\prime}\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}}+\epsilon)^{2}}{2(\sigma_{t,i^{\prime}}^{2}+\sigma_{t+1,I_{t}^{*}}^{2}+\sigma_{I_{t}^{*}}^{2}(\sigma_{t+1,I_{t}^{*}}^{2}/\sigma_{I_{t}^{*}}^{2})^{2})}}\Bigg{)}.\end{split}

Proof of Proposition 5:
We explore the expression of $\mathbb{E}[v_{n}(S_{n})]$ in the feasible arm identification problem first. We know that

\mathbb{E}[v_{n}(S_{n})]=1-\sum_{i\in\mathcal{S}_{n}^{1}}\sum_{j=1}^{m}\mathbb{P}(\theta_{ij}>\gamma_{j})-\sum_{i\in\mathcal{S}_{n}^{2}}\prod_{j\in\mathcal{E}_{t,i}^{2}}\mathbb{P}(\theta_{ij}\leq\gamma_{j}).

Note that in round $n$ , $\theta_{ij}-\gamma_{j}$ follows $\mathcal{N}(\mu_{n,i}-\gamma_{j},\sigma_{n,ij}^{2})$ . Similarly as in the proof of Proposition 1, we can know that if $i\in\mathcal{S}_{n}^{1}$ and measure $j\in\{1,2,\ldots,m\}$ ,

\mathbb{P}(\theta_{ij}>\gamma_{j})\dot{=}\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{n,ij})^{2}}{2\sigma_{n,ij}^{2}}}\Bigg{)},

and if $i\in\mathcal{S}_{n}^{2}$ and measure $j\in\mathcal{E}_{n,i}^{2}$ ,

\mathbb{P}(\theta_{ij}\leq\gamma_{j})\dot{=}\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{n,ij})^{2}}{2\sigma_{n,ij}^{2}}}\Bigg{)}.

Then

\mathbb{E}[v_{n}(S_{n})]=1-\sum_{i\in\mathcal{S}_{n}^{1}}\sum_{j=1}^{m}\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{n,ij})^{2}}{2\sigma_{n,ij}^{2}}}\Bigg{)}-\sum_{i\in\mathcal{S}_{n}^{2}}\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{n,i}^{2}}\frac{(\gamma_{j}-\mu_{n,ij})^{2}}{2\sigma_{n,ij}^{2}}}\Bigg{)}.

For the feasible arm identification problem,

\begin{split}&\text{iKG}_{t,i}^{\text{F}}=\mathbb{E}[v_{n}(\mathcal{T}(S_{t},i,\theta_{t,i}))-v_{n}(S_{t})]\\ =&1-\sum_{i^{\prime}\neq i\in\mathcal{S}_{t}^{1}}\sum_{j=1}^{m}\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{t,i^{\prime}j})^{2}}{2\sigma_{t,i^{\prime}j}^{2}}}\Bigg{)}-\sum_{i^{\prime}\neq i\in\mathcal{S}_{t}^{2}}\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{t,i^{\prime}}^{2}}\frac{(\gamma_{j}-\mu_{t,i^{\prime}j})^{2}}{2\sigma_{t,i^{\prime}j}^{2}}}\Bigg{)}\\ &-\sum_{j=1}^{m}\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{t,ij})^{2}}{2(\sigma_{t+1,ij}^{2}+\sigma_{ij}^{2}(\sigma_{t+1,ij}^{2}/\sigma_{ij}^{2})^{2})}}\mathbf{1}\{i\in\mathcal{S}_{t}^{1}\}\Bigg{)}-\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{t,i}^{2}}\frac{(\gamma_{j}-\mu_{t,ij})^{2}}{2(\sigma_{t+1,ij}^{2}+\sigma_{ij}^{2}(\sigma_{t+1,ij}^{2}/\sigma_{ij}^{2})^{2})}}\mathbf{1}\{i\in\mathcal{S}_{t}^{2}\}\Bigg{)}\\ &-\Bigg{(}1-\sum_{i\in\mathcal{S}_{t}^{1}}\sum_{j=1}^{m}\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{t,ij})^{2}}{2\sigma_{t,ij}^{2}}}\Bigg{)}-\sum_{i\in\mathcal{S}_{t}^{2}}\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{t,i}^{2}}\frac{(\gamma_{j}-\mu_{t,ij})^{2}}{2\sigma_{t,ij}^{2}}}\Bigg{)}\Bigg{)}\\ \end{split}

\begin{split}=&\sum_{j=1}^{m}\Bigg{(}\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{t,ij})^{2}}{2\sigma_{t,ij}^{2}}}\mathbf{1}\{i\in\mathcal{S}_{t}^{1}\}\Bigg{)}-\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{t,ij})^{2}}{2(\sigma_{t+1,ij}^{2}+\sigma_{ij}^{2}(\sigma_{t+1,ij}^{2}/\sigma_{ij}^{2})^{2})}}\mathbf{1}\{i\in\mathcal{S}_{t}^{1}\}\Bigg{)}\Bigg{)}\\ &+\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{t,i}^{2}}\frac{(\gamma_{j}-\mu_{t,ij})^{2}}{2\sigma_{t,ij}^{2}}}\mathbf{1}\{i\in\mathcal{S}_{t}^{2}\}\Bigg{)}-\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{t,i}^{2}}\frac{(\gamma_{j}-\mu_{t,ij})^{2}}{2(\sigma_{t+1,ij}^{2}+\sigma_{ij}^{2}(\sigma_{t+1,ij}^{2}/\sigma_{ij}^{2})^{2})}}\mathbf{1}\{i\in\mathcal{S}_{t}^{2}\}\Bigg{)}.\end{split}

Appendix D Proof of Theorem 1

Our proof of Theorem 1 will be divided into the analysis of the consistency, sampling rates and asymptotic optimality of the iKG algorithm.

We first show the consistency, i.e., each arm will be pulled infinitely by the algorithm as the round $n$ goes to infinity. Since

\text{iKG}_{t,i}=\left\{\begin{aligned} &\exp\Bigg{(}{-\frac{(\mu_{t,i}-\mu_{t,I_{t}^{*}})^{2}}{2(\sigma_{i}^{2}/T_{t,i}+\sigma_{I_{t}^{*}}^{2}/T_{t,I_{t}^{*}})}}\Bigg{)}-\exp\Bigg{(}{-\frac{(\mu_{t,i}-\mu_{t,I_{t}^{*}})^{2}}{2((T_{t,i}+2)\sigma_{i}^{2}/(T_{t,i}+1)^{2}+\sigma_{I_{t}^{*}}^{2}/T_{t,I_{t}^{*}})}}\Bigg{)},&{\mbox{if~{}}i\neq I_{t}^{*}},\\ &\sum\limits_{i^{\prime}\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}})^{2}}{2(\sigma_{i^{\prime}}^{2}/T_{t,i^{\prime}}+\sigma_{I_{t}^{*}}^{2}/T_{t,I_{t}^{*}})}}\Bigg{)}-\sum\limits_{i^{\prime}\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}})^{2}}{2(\sigma_{i^{\prime}}^{2}/T_{t,i^{\prime}}+(T_{t,I_{t}^{*}}+2)\sigma_{I_{t}^{*}}^{2}/(T_{t,I_{t}^{*}}+1)^{2})}}\Bigg{)},&{\mbox{if~{}}i=I_{t}^{*}},\\ \end{aligned}\right.

(21)

it is obvious that $\text{iKG}_{t,i}>0$ for $t>0$ . To prove the consistency, we define a set $V\triangleq\{i\in\mathbb{A}:\sum_{l\geq 0}\mathbf{1}\{I_{l}=i\}<\infty\}$ . It suffices to prove that $V=\emptyset$ , and then the claim is straightforward based on the Strong Law of Large Numbers. For any $\delta_{1}>0$ and arm $i\notin V$ , there exists $N_{1}$ such that when $n>N_{1}$ , $|\mu_{n,i}-\mu_{i}|<\delta_{1}$ , because arms not in $V$ will be infinitely pulled. Since the $\exp(\cdot)$ is a continuous function and $\sigma_{i}^{2}/T_{t,i}-\sigma_{i}^{2}(T_{t,i}+2)/(T_{t,i}+1)^{2}=\sigma_{i}^{2}/((T_{t,i}+1)^{2}T_{t,i})\rightarrow 0$ holds for arm $i\notin V$ , then for any $\delta_{2}>0$ , there exists $N_{2}$ such that when $n>N_{2}$ , iKG ${}_{t,i}<\delta_{2}$ .

Arms $i^{\prime}\in V$ are pulled for only a finite number of rounds. Then $\max_{i^{\prime}\in V}T_{t,i^{\prime}}$ exists and we have $\sigma_{i^{\prime}}^{2}/((T_{t,i^{\prime}}+1)^{2}T_{t,i^{\prime}})>\min_{i^{\prime}\neq I_{t}^{*}}\sigma_{i^{\prime}}^{2}/\max_{i^{\prime}\in V}(T_{t,i^{\prime}}+2)/(T_{t,i^{\prime}}+1)^{2}$ . According to the continuity of the function $\exp(\cdot)$ , there exists $\delta_{3}>0$ such that iKG ${}_{t,i^{\prime}}>\delta_{3}$ . Since $\delta_{2}$ is arbitrary, let $\delta_{2}<\delta_{3}$ , and then iKG ${}_{t,i^{\prime}}>$ iKG_t,i holds, which implies $I_{t}\in V$ . As the total number of rounds tend to infinity, $V$ will become an empty set eventually. In other words, all the arms will be pulled infinitely and $I_{n}^{*}=I^{*}=\langle 1\rangle$ holds with probability $1$ .

We next analyze the sampling rate of each arm by the iKG algorithm. Let $\delta_{4}=2\delta_{2}>0$ , we know that when $n$ is large, iKG ${}_{n,i}<\delta_{2}=\delta_{4}/2$ for all $i\in\mathbb{A}$ . Then $|$ iKG ${}_{n,i}-$ iKG ${}_{n,i^{\prime}}|<$ iKG ${}_{n,i}+$ iKG ${}_{n,i^{\prime}}<\delta_{4}/2+\delta_{4}/2=\delta_{4}$ , where $i\neq i^{\prime}$ . For any $i,i^{\prime}\in\mathbb{A}$ and $i\neq i^{\prime}\neq\langle 1\rangle$ ,

\begin{split}&|\text{iKG}_{n,i}-\text{iKG}_{n,i^{\prime}}|\\ =&\Bigg{|}\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/T_{n,i}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}-\exp\Bigg{(}{-\frac{(\mu_{n,i^{\prime}}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i^{\prime}}^{2}/T_{n,i^{\prime}}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}\\ &+\exp\Bigg{(}{-\frac{(\mu_{n,i^{\prime}}-\mu_{n,\langle 1\rangle})^{2}}{2((T_{n,i^{\prime}}+2)\sigma_{i^{\prime}}^{2}/(T_{n,i^{\prime}}+1)^{2}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}-\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle})^{2}}{2((T_{n,i}+2)\sigma_{i}^{2}/(T_{n,i}+1)^{2}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}\Bigg{|}\\ \leq&2\Bigg{|}\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/T_{n,i}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}-\exp\Bigg{(}{-\frac{(\mu_{n,i^{\prime}}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i^{\prime}}^{2}/T_{n,i^{\prime}}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}\Bigg{|}\\ =&2\Bigg{|}\exp\Bigg{(}{-n\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}}\Bigg{)}-\exp\Bigg{(}{-n\frac{(\mu_{n,i^{\prime}}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i^{\prime}}^{2}/w_{i^{\prime}}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}}\Bigg{)}\Bigg{|},\end{split}

where $w_{i}=T_{n,i}/n$ is the sampling rate of arm $i$ . For any $\delta_{5}=\delta_{1}^{2}>0$ , we have $|\text{iKG}_{n,i}-\text{iKG}_{n,i^{\prime}}|<\delta_{4}$ if and only if

\Bigg{|}\frac{(\mu_{i}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}-\frac{(\mu_{i^{\prime}}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i^{\prime}}^{2}/w_{i^{\prime}}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}\Bigg{|}<\delta_{5}

(22)

by the continuity of the function $\exp(\cdot)$ and $|\mu_{n,i}-\mu_{i}|<\delta_{1}$ . For arms $i\neq\langle 1\rangle$ ,

\begin{split}&|\text{iKG}_{n,i}-\text{iKG}_{n,\langle 1\rangle}|\\ =&\Bigg{|}\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/T_{n,i}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}-\sum_{i^{\prime}\neq 1}\exp\Bigg{(}{-\frac{(\mu_{n,i^{\prime}}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i^{\prime}}^{2}/T_{n,i^{\prime}}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}\\ &+\sum_{i^{\prime}\neq\langle 1\rangle}\exp\Bigg{(}{-\frac{(\mu_{n,i^{\prime}}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i^{\prime}}^{2}/T_{n,i^{\prime}}+(T_{n,\langle 1\rangle}+2)\sigma_{\langle 1\rangle}^{2}/(T_{n,\langle 1\rangle}+1)^{2})}}\Bigg{)}-\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle})^{2}}{2((T_{n,i}+2)\sigma_{i}^{2}/(T_{n,i}+1)^{2}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}\Bigg{|}.\end{split}

Notice that $(T_{n,i}+2)\sigma_{i}^{2}/(T_{n,i}+1)^{2}=\sigma_{i}^{2}/(T_{n,i}+1/(T_{n,i}+2))$ . When $n$ is large enough, $1/(T_{n,i}+2)$ is sufficiently small according to the consistency of the algorithm. Then

\begin{split}&\lim_{n\rightarrow\infty}\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/T_{n,i}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}-\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle})^{2}}{2((T_{n,i}+2)\sigma_{i}^{2}/(T_{n,i}+1)^{2}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}\\ =&\frac{\partial\Bigg{(}\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/T_{n,i}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}\Bigg{)}}{\partial T_{n,i}}=\frac{\partial\Bigg{(}\exp\Bigg{(}{-n\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}}\Bigg{)}\Bigg{)}}{\partial w_{i}}.\end{split}

Since $|\text{iKG}_{n,i}-\text{iKG}_{n,\langle 1\rangle}|<\delta_{4}$ for $i\neq\langle 1\rangle$ , given $\delta_{6}>0$ , we have

1-\delta_{6}<\Bigg{|}\sum_{i\neq\langle 1\rangle}\frac{\partial\Bigg{(}\exp\Bigg{(}{-n\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}}\Bigg{)}\Bigg{)}}{\partial w_{{\langle 1\rangle}}}\Bigg{/}\frac{\partial\Bigg{(}\exp\Bigg{(}{-n\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}}\Bigg{)}\Bigg{)}}{\partial w_{i}}\Bigg{|}<1+\delta_{6}.

By (22), we have

\frac{\partial\Bigg{(}\exp\Bigg{(}{-n\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}}\Bigg{)}\Bigg{)}}{\partial w_{i}}=\frac{\partial\Bigg{(}\exp\Bigg{(}{-n\frac{(\mu_{n,i^{\prime}}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i^{\prime}}^{2}/w_{i^{\prime}}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}}\Bigg{)}\Bigg{)}}{\partial w_{i^{\prime}}}.

Then

1-\delta_{6}<\sum_{i\neq\langle 1\rangle}\Bigg{|}\frac{\partial\Bigg{(}\exp\Bigg{(}{-n\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}}\Bigg{)}\Bigg{)}}{\partial w_{{\langle 1\rangle}}}\Bigg{/}\frac{\partial\Bigg{(}\exp\Bigg{(}{-n\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}}\Bigg{)}\Bigg{)}}{\partial w_{i}}\Bigg{|}<1+\delta_{6}.

Hence

\Bigg{|}\frac{w_{\langle 1\rangle}^{2}}{\sigma_{\langle 1\rangle}^{2}}-\sum_{i\neq\langle 1\rangle}\frac{w_{i}^{2}}{\sigma_{i}^{2}}\Bigg{|}<\delta_{6}.

Since $\delta_{6}$ can be arbitarily small, $\frac{w_{\langle 1\rangle}^{2}}{\sigma_{\langle 1\rangle}^{2}}\rightarrow\sum_{i\neq\langle 1\rangle}\frac{w_{i}^{2}}{\sigma_{i}^{2}}$ .

We have shown that

\begin{split}1-\mathbb{P}\{I_{n}^{*}=\langle 1\rangle\}&\dot{=}\exp\Bigg{(}{-n\min_{i\neq\langle 1\rangle}\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}n/T_{n,i}+\sigma_{\langle 1\rangle}^{2}n/T_{n,\langle 1\rangle})}}\Bigg{)}.\end{split}

Then

\Gamma^{\text{iKG}}=\lim_{n\rightarrow\infty}-\frac{1}{n}\log(1-\mathbb{P}\{I_{n}^{*}=\langle 1\rangle\})=\min_{i\neq\langle 1\rangle}\frac{(\mu_{i}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}.

(23)

By (22),

\Gamma^{\text{iKG}}=\frac{(\mu_{i}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})},\quad\forall i\neq\langle 1\rangle,

(24)

where $w_{i}$ in (23) and (24) is the solution of (8) in the main text.

Next, we will show that for any BAI algorithms, $\lim_{n\rightarrow\infty}-\frac{1}{n}\log(1-\mathbb{P}\{I_{n}^{*}=\langle 1\rangle\})\leq\Gamma^{\text{iKG}}$ . Let $W\triangleq\{\bm{w}=(w_{1},\ldots,w_{k}):\sum_{i=1}^{k}w_{i}=1\mbox{ and }w_{i}\geq 0,\forall i\in\mathbb{A}\}$ be set of the feasible sampling rates of the $k$ arms. The proof of this claim is divided into two stages. First, suppose that $w_{\langle 1\rangle}=\alpha$ is fixed for some $0<\alpha<1$ . We will show that $\max_{\bm{w}\in W,w_{\langle 1\rangle}=\alpha}\min_{i\neq\langle 1\rangle}\frac{(\mu_{i}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}$ is achieved when

\begin{split}\sum\limits_{i\neq\langle 1\rangle}w_{i}=1-\alpha,\quad\mbox{ and }\quad\frac{(\mu_{i}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}=\frac{(\mu_{i^{\prime}}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i^{\prime}}^{2}/w_{i^{\prime}}+\sigma_{\langle 1\rangle}^{2}/\alpha)},\ \ i\neq i^{\prime}\neq\langle 1\rangle.\end{split}

(25)

In other words, in this stage, we will prove the first and third equations in (8) of the main text. We prove it by contradiction. Suppose there exists a policy with sampling rates $\bm{w}^{\prime}=(w^{\prime}_{1},w^{\prime}_{2},\ldots,w^{\prime}_{k})$ of the $k$ arms such that $\min_{i\neq\langle 1\rangle}\frac{(\mu_{i}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w^{\prime}_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}=\max_{\bm{w}\in W,w_{\langle 1\rangle}=\alpha}\min_{i\neq\langle 1\rangle}\frac{(\mu_{i}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}$ . Since the solution of (25) is unique, there exists an arm $i^{\prime}$ satisfying $\frac{(\mu_{i^{\prime}}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i^{\prime}}^{2}/w^{\prime}_{i^{\prime}}+\sigma_{\langle 1\rangle}^{2}/\alpha)}>\min_{i\neq\langle 1\rangle}\frac{(\mu_{i}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w^{\prime}_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}$ . We consider a new policy. There exists $\delta_{7}>0$ such that $\tilde{w}_{i^{\prime}}=w^{\prime}_{i^{\prime}}-\delta_{7}\in(0,1)$ and $\tilde{w}_{i}=w^{\prime}_{i}+\delta_{7}/(k-2)\in(0,1)$ for $i\neq i^{\prime}\neq\langle 1\rangle$ . Then

\min_{i\neq\langle 1\rangle}\frac{(\mu_{i}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/\tilde{w}_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}>\min_{i\neq\langle 1\rangle}\frac{(\mu_{i}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w^{\prime}_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}=\max_{\bm{w}\in W,w_{\langle 1\rangle}=\alpha}\min_{i\neq\langle 1\rangle}\frac{(\mu_{i}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)},

which yields a contradiction. Therefore, the first and third equations in (8) of the main text hold.

In the second stage, we will prove the second equation in (8) of the main text. Consider the following optimization problem

\begin{split}\mbox{$\max\limits_{\alpha\in(0,1)}$}\quad&z\\ \mbox{s.t.}\quad&\frac{(\mu_{i}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}=\frac{(\mu_{i^{\prime}}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i^{\prime}}^{2}/w_{i^{\prime}}+\sigma_{\langle 1\rangle}^{2}/\alpha)}\quad i,i^{\prime}\neq\langle 1\rangle\mbox{ and }i\neq i^{\prime},\\ &\frac{(\mu_{i}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}\geq z,\quad i\neq\langle 1\rangle,\\ &\sum_{i\neq\langle 1\rangle}w_{i}=1-\alpha.\end{split}

(26)

The Lagrangian function of (26) is

L(\alpha,\lambda_{i})=z+\sum_{i\neq\langle 1\rangle}\lambda_{i}(\frac{(\mu_{i}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}-z)+\lambda_{1}(\sum_{i\neq\langle 1\rangle}w_{i}-1+\alpha),

where $\lambda_{i}$ ’s are the Lagrange multipliers. By the KKT conditions, we have $\lambda_{i}\partial(\frac{(\mu_{i}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)})/\partial w_{i}+\lambda_{1}=0$ for all $i\neq\langle 1\rangle$ and $\sum_{i\neq\langle 1\rangle}\lambda_{i}\partial(\frac{(\mu_{i}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)})/\partial w_{\langle 1\rangle}+\lambda_{1}=0$ . Then

\sum_{i\neq\langle 1\rangle}\frac{\partial(\frac{(\mu_{i}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)})/\partial w_{\langle 1\rangle}}{\partial(\frac{(\mu_{i}-\mu_{\langle 1\rangle})^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)})/\partial w_{i}}=1,

i.e., $\frac{w_{\langle 1\rangle}^{2}}{\sigma_{\langle 1\rangle}^{2}}=\sum_{i\neq\langle 1\rangle}\frac{w_{i}^{2}}{\sigma_{i}^{2}}$ .

Remark: The conditions in (8) of the main text coincide with the optimality conditions developed in [30] using the OCBA method under normal sampling distributions.

Appendix E Proof of Theorem 2

Our proof of Theorem 2 will be divided into the analysis of the consistency, sampling rates and asymptotic optimality of the iKG- $\epsilon$ algorithm.

We first show consistency, i.e., each arm will be pulled infinitely by the algorithm as the round $n$ goes to infinity. Since

\text{iKG}_{t,i}^{\epsilon}=\left\{\begin{aligned} &\exp\Bigg{(}{-\frac{(\mu_{t,i}-\mu_{t,I_{t}^{*}}+\epsilon)^{2}}{2(\sigma_{i}^{2}/T_{t,i}+\sigma_{I_{t}^{*}}^{2}/T_{t,I_{t}^{*}})}}\Bigg{)}-\exp\Bigg{(}{-\frac{(\mu_{t,i}-\mu_{t,I_{t}^{*}}+\epsilon)^{2}}{2((T_{t,i}+2)\sigma_{i}^{2}/(T_{t,i}+1)^{2}+\sigma_{I_{t}^{*}}^{2}/T_{t,I_{t}^{*}}}}\Bigg{)},&{\mbox{if~{}}i\neq I_{t}^{*}},\\ &\sum\limits_{i^{\prime}\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}}+\epsilon)^{2}}{2(\sigma_{i^{\prime}}^{2}/T_{t,i^{\prime}}+\sigma_{I_{t}^{*}}^{2}/T_{t,I_{t}^{*}})}}\Bigg{)}-\sum\limits_{i^{\prime}\neq I_{t}^{*}}\exp\Bigg{(}{-\frac{(\mu_{t,i^{\prime}}-\mu_{t,I_{t}^{*}}+\epsilon)^{2}}{2(\sigma_{i^{\prime}}^{2}/T_{t,i^{\prime}}+(T_{t,I_{t}^{*}}+2)\sigma_{I_{t}^{*}}^{2}/(T_{t,I_{t}^{*}}+1)^{2}}}\Bigg{)},&{\mbox{if~{}}i=I_{t}^{*}},\end{aligned}\right.

(27)

it is obvious that $\text{iKG}_{t,i}^{\epsilon}>0$ for $t>0$ . To prove the consistency, it suffices to prove that $V=\emptyset$ , and then the claim is straightforward based on the Strong Law of Large Numbers. For any $\delta_{8}>0$ and arm $i\notin V$ , there exists $N_{3}$ such that when $n>N_{3}$ , $|\mu_{n,i}-\mu_{i}|<\delta_{8}$ , because arms not in $V$ will be infinitely pulled. Since the $\exp(\cdot)$ is a continuous function and $\sigma_{i}^{2}/T_{t,i}-\sigma_{i}^{2}(T_{t,i}+2)/(T_{t,i}+1)^{2}=\sigma_{i}^{2}/((T_{t,i}+1)^{2}T_{t,i})\rightarrow 0$ holds for arm $i\notin V$ , then for any $\delta_{9}>0$ , there exists $N_{4}$ such that when $n>N_{4}$ , iKG ${}_{t,i}^{\epsilon}<\delta_{9}$ .

Arms $i^{\prime}\in V$ are pulled for only a finite number of rounds. Then $\max_{i^{\prime}\in V}T_{t,i^{\prime}}$ exists and we have $\sigma_{i^{\prime}}^{2}/((T_{t,i^{\prime}}+1)^{2}T_{t,i^{\prime}})>\min_{i^{\prime}\neq I_{t}^{*}}\sigma_{i^{\prime}}^{2}/\max_{i^{\prime}\in V}(T_{t,i^{\prime}}+2)/(T_{t,i^{\prime}}+1)^{2}$ . According to the continuity of the function $\exp(\cdot)$ , there exists $\delta_{10}>0$ such that iKG ${}_{t,i^{\prime}}^{\epsilon}>\delta_{10}$ . Since $\delta_{9}$ is arbitrary, let $\delta_{9}<\delta_{10}$ , and then iKG ${}_{t,i^{\prime}}^{\epsilon}>$ iKG ${}_{t,i}^{\epsilon}$ holds, which implies $I_{t}\in V$ . As the total number of rounds tend to infinity, $V$ will become an empty set eventually. In other words, all the arms will be pulled infinitely and $I_{n}^{*}=I^{*}=\langle 1\rangle$ holds with probability $1$ .

We next analyze the sampling rate each arm by the iKG- ${\epsilon}$ algorithm. Let $\delta_{11}=2\delta_{9}>0$ , we know that when $n$ is large, iKG ${}_{n,i}^{\epsilon}<\delta_{9}=\delta_{11}/2$ holds for $i\in\mathbb{A}$ . Then $|$ iKG ${}_{n,i}^{\epsilon}-$ iKG ${}_{n,i^{\prime}}^{\epsilon}|<$ iKG ${}_{n,i}^{\epsilon}+$ iKG ${}_{n,i^{\prime}}^{\epsilon}<\delta_{11}/2+\delta_{11}/2=\delta_{11}$ , where $i\neq i^{\prime}$ . For any $i,i^{\prime}\in\mathbb{A}$ and $i\neq i^{\prime}\neq\langle 1\rangle$ ,

\begin{split}&|\text{iKG}_{n,i}^{\epsilon}-\text{iKG}_{n,i^{\prime}}^{\epsilon}|\\ =&\Bigg{|}\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/T_{n,i}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}-\exp\Bigg{(}{-\frac{(\mu_{n,i^{\prime}}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i^{\prime}}^{2}/T_{n,i^{\prime}}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}\\ &+\exp\Bigg{(}{-\frac{(\mu_{n,i^{\prime}}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2((T_{n,i^{\prime}}+2)\sigma_{i^{\prime}}^{2}/(T_{n,i^{\prime}}+1)^{2}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}-\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2((T_{n,i}+2)\sigma_{i}^{2}/(T_{n,i}+1)^{2}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}\Bigg{|}\\ \leq&2\Bigg{|}\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/T_{n,i}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}-\exp\Bigg{(}{-\frac{(\mu_{n,i^{\prime}}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i^{\prime}}^{2}/T_{n,i^{\prime}}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}\Bigg{|}\\ =&2\Bigg{|}\exp\Bigg{(}{-n\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}}\Bigg{)}-\exp\Bigg{(}{-n\frac{(\mu_{n,i^{\prime}}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i^{\prime}}^{2}/w_{i^{\prime}}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}}\Bigg{)}\Bigg{|},\end{split}

where $w_{i}=T_{n,i}/n$ is the sampling rate of arm $i$ . For any $\delta_{12}=\delta_{8}^{2}>0$ , we have $|\text{iKG}_{n,i}^{\epsilon}-\text{iKG}_{n,i^{\prime}}^{\epsilon}|<\delta_{11}$ if and only if

\Bigg{|}\frac{(\mu_{i}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}-\frac{(\mu_{i^{\prime}}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i^{\prime}}^{2}/w_{i^{\prime}}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}\Bigg{|}<\delta_{12}

(28)

by the continuity of the function $\exp(\cdot)$ and $|\mu_{n,i}-\mu_{i}|<\delta_{8}$ . For arms $i\neq\langle 1\rangle$ ,

\begin{split}&|\text{iKG}_{n,i}^{\epsilon}-\text{iKG}_{n,\langle 1\rangle}^{\epsilon}|\\ =&\Bigg{|}\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/T_{n,i}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}-\sum_{i^{\prime}\neq\langle 1\rangle}\exp\Bigg{(}{-\frac{(\mu_{n,i^{\prime}}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i^{\prime}}^{2}/T_{n,i^{\prime}}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}\\ &+\sum_{i^{\prime}\neq\langle 1\rangle}\exp\Bigg{(}{-\frac{(\mu_{n,i^{\prime}}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i^{\prime}}^{2}/T_{n,i^{\prime}}+(T_{n,\langle 1\rangle}+2)\sigma_{\langle 1\rangle}^{2}/(T_{n,\langle 1\rangle}+1)^{2})}}\Bigg{)}-\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2((T_{n,i}+2)\sigma_{i}^{2}/(T_{n,i}+1)^{2}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}\Bigg{|}.\end{split}

\begin{split}&\lim_{n\rightarrow\infty}\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/T_{n,i}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}-\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2((T_{n,i}+2)\sigma_{i}^{2}/(T_{n,i}+1)^{2}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}\\ =&\frac{\partial\Bigg{(}\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/T_{n,i}+\sigma_{\langle 1\rangle}^{2}/T_{n,\langle 1\rangle})}}\Bigg{)}\Bigg{)}}{\partial T_{n,i}}=\frac{\partial\Bigg{(}\exp\Bigg{(}{-n\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}}\Bigg{)}\Bigg{)}}{\partial w_{i}}.\end{split}

Since $|\text{iKG}_{n,i}^{\epsilon}-\text{iKG}_{n,\langle 1\rangle}^{\epsilon}|<\delta_{11}$ for $i\neq\langle 1\rangle$ , given $\delta_{13}>0$ , we have

1-\delta_{13}<\Bigg{|}\sum_{i\neq\langle 1\rangle}\frac{\partial\Bigg{(}\exp\Bigg{(}{-n\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}}\Bigg{)}\Bigg{)}}{\partial w_{{\langle 1\rangle}}}\Bigg{/}\frac{\partial\Bigg{(}\exp\Bigg{(}{-n\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}}\Bigg{)}\Bigg{)}}{\partial w_{i}}\Bigg{|}<1+\delta_{13}.

By (28), we have

\frac{\partial\Bigg{(}\exp\Bigg{(}{-n\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}}\Bigg{)}\Bigg{)}}{\partial w_{i}}=\frac{\partial\Bigg{(}\exp\Bigg{(}{-n\frac{(\mu_{n,i^{\prime}}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i^{\prime}}^{2}/w_{i^{\prime}}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}}\Bigg{)}\Bigg{)}}{\partial w_{i^{\prime}}}.

Then

1-\delta_{13}<\sum_{i\neq\langle 1\rangle}\Bigg{|}\frac{\partial\Bigg{(}\exp\Bigg{(}{-n\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}}\Bigg{)}\Bigg{)}}{\partial w_{{\langle 1\rangle}}}\Bigg{/}\frac{\partial\Bigg{(}\exp\Bigg{(}{-n\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}}\Bigg{)}\Bigg{)}}{\partial w_{i}}\Bigg{|}<1+\delta_{13}.

Hence

\Bigg{|}\frac{w_{\langle 1\rangle}^{2}}{\sigma_{\langle 1\rangle}^{2}}-\sum_{i\neq\langle 1\rangle}\frac{w_{i}^{2}}{\sigma_{i}^{2}}\Bigg{|}<\delta_{13}.

Since $\delta_{13}$ can be arbitarily small, $\frac{w_{\langle 1\rangle}^{2}}{\sigma_{\langle 1\rangle}^{2}}\rightarrow\sum_{i\neq\langle 1\rangle}\frac{w_{i}^{2}}{\sigma_{i}^{2}}$ .

We know that

\begin{split}1-\mathbb{P}\{G_{n}^{\epsilon}=G^{\epsilon}\}=&\mathbb{P}\bigg{\{}\bigcup\limits_{i\in G_{n}^{\epsilon}}(\theta_{i}<\theta_{\langle 1\rangle}-\epsilon)\cup\bigcup\limits_{i\in\mathbb{A}\setminus G_{n}^{\epsilon}}(\theta_{i}>\theta_{\langle 1\rangle}-\epsilon)\bigg{\}},\end{split}

and

\begin{split}&\max(\max_{i\in G_{n}^{\epsilon}}\mathbb{P}(\theta_{i}<\theta_{\langle 1\rangle}-\epsilon),\max_{i\in\mathbb{A}\setminus G_{n}^{\epsilon}}\mathbb{P}(\theta_{i}>\theta_{\langle 1\rangle}-\epsilon))\\ \leq&\mathbb{P}\bigg{\{}\bigcup\limits_{i\in G_{n}^{\epsilon}}(\theta_{i}<\theta_{\langle 1\rangle}-\epsilon)\cup\bigcup\limits_{i\in\mathbb{A}\setminus G_{n}^{\epsilon}}(\theta_{i}>\theta_{\langle 1\rangle}-\epsilon)\bigg{\}}\\ \leq&k\max(\max_{i\in G_{n}^{\epsilon}}\mathbb{P}(\theta_{i}<\theta_{\langle 1\rangle}-\epsilon),\max_{i\in\mathbb{A}\setminus G_{n}^{\epsilon}}\mathbb{P}(\theta_{i}>\theta_{\langle 1\rangle}-\epsilon)).\end{split}

Then

1-\mathbb{P}\{G_{n}^{\epsilon}=G^{\epsilon}\}\dot{=}\exp\Bigg{(}{-\frac{(\mu_{n,i}-\mu_{n,\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{n,i}^{2}+\sigma_{n,\langle 1\rangle}^{2})}}\Bigg{)}.

We have

\Gamma^{\epsilon}=\lim_{n\rightarrow\infty}-\frac{1}{n}\log(1-\mathbb{P}\{G_{n}^{\epsilon}=G^{\epsilon}\})=\min_{i\neq\langle 1\rangle}\frac{(\mu_{i}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})}.

(29)

By (28),

\Gamma^{\epsilon}=\frac{(\mu_{i}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/w_{\langle 1\rangle})},\quad\forall i\neq\langle 1\rangle,

(30)

where $w_{i}$ in (29) and (30) is the solution of (12) in the main text.

Next, we will show that for any $\epsilon$ -good arm identification algorithms, $\lim_{n\rightarrow\infty}-\frac{1}{n}\log(1-\mathbb{P}\{G_{n}^{\epsilon}=G^{\epsilon}\})\leq\Gamma^{\epsilon}$ . Let $W\triangleq\{\bm{w}=(w_{1},\ldots,w_{k}):\sum_{i=1}^{k}w_{i}=1\mbox{ and }w_{i}\geq 0,\forall i\in\mathbb{A}\}$ be set of the feasible sampling rates of the $k$ arms. The proof of this claim is divided into two stages. First, suppose that $w_{\langle 1\rangle}=\alpha$ is fixed for some $0<\alpha<1$ . We will show that $\max_{\bm{w}\in W,w_{\langle 1\rangle}=\alpha}\min_{i\neq\langle 1\rangle}\frac{(\mu_{i}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}$ is achieved when

\begin{split}\sum\limits_{i\neq\langle 1\rangle}w_{i}=1-\alpha,\quad\mbox{ and }\quad\frac{(\mu_{i}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}=\frac{(\mu_{i^{\prime}}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i^{\prime}}^{2}/w_{i^{\prime}}+\sigma_{\langle 1\rangle}^{2}/\alpha)},\ \ i\neq i^{\prime}\neq\langle 1\rangle.\end{split}

(31)

In other words, in this stage, we will prove the first and third equations in (12) of the main text. We prove it by contradiction. Suppose there exists a policy with sampling rates $\bm{w}^{\prime}=(w^{\prime}_{1},w^{\prime}_{2},\ldots,w^{\prime}_{k})$ of the $k$ arms such that $\min_{i\neq\langle 1\rangle}\frac{(\mu_{i}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w^{\prime}_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}=\max_{\bm{w}\in W,w_{\langle 1\rangle}=\alpha}\min_{i\neq\langle 1\rangle}\frac{(\mu_{i}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}$ . Since the solution of (31) is unique, there exists an arm $i^{\prime}$ satisfying $\frac{(\mu_{i^{\prime}}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i^{\prime}}^{2}/w^{\prime}_{i^{\prime}}+\sigma_{\langle 1\rangle}^{2}/\alpha)}>\min_{i\neq\langle 1\rangle}\frac{(\mu_{i}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w^{\prime}_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}$ . We consider a new policy. There exists $\delta_{14}>0$ such that $\tilde{w}_{i^{\prime}}=w^{\prime}_{i^{\prime}}-\delta_{14}\in(0,1)$ and $\tilde{w}_{i}=w^{\prime}_{i}+\delta_{14}/(k-2)\in(0,1)$ for $i\neq i^{\prime}\neq\langle 1\rangle$ . Then

\min_{i\neq\langle 1\rangle}\frac{(\mu_{i}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/\tilde{w}_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}>\min_{i\neq\langle 1\rangle}\frac{(\mu_{i}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w^{\prime}_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}=\max_{\bm{w}\in W,w_{\langle 1\rangle}=\alpha}\min_{i\neq\langle 1\rangle}\frac{(\mu_{i}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)},

which yields a contradiction. Therefore, the first and third equations in (12) of the main text hold.

In the second stage, we will prove the second equation in (12) of the main text. Consider the following optimization problem

\begin{split}\mbox{$\max\limits_{\alpha\in(0,1)}$}\quad&z\\ \mbox{s.t.}\quad&\frac{(\mu_{i}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}=\frac{(\mu_{i^{\prime}}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i^{\prime}}^{2}/w_{i^{\prime}}+\sigma_{\langle 1\rangle}^{2}/\alpha)},\quad i,i^{\prime}\neq\langle 1\rangle\mbox{ and }i\neq i^{\prime},\\ &\frac{(\mu_{i}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}\geq z,\quad i\neq\langle 1\rangle,\\ &\sum_{i\neq\langle 1\rangle}w_{i}=1-\alpha.\end{split}

(32)

The Lagrangian function of (32) is

L(\alpha,\lambda_{i})=z+\sum_{i\neq\langle 1\rangle}\lambda_{i}(\frac{(\mu_{i}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)}-z)+\lambda_{1}(\sum_{i\neq\langle 1\rangle}w_{i}-1+\alpha),

where $\lambda_{i}$ ’s are the Lagrange multipliers. By the KKT conditions, we have $\lambda_{i}\partial(\frac{(\mu_{i}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)})/\partial w_{i}+\lambda_{1}=0$ for all $i\neq\langle 1\rangle$ and $\sum_{i\neq\langle 1\rangle}\lambda_{i}\partial(\frac{(\mu_{i}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)})/\partial w_{\langle 1\rangle}+\lambda_{1}=0$ . Then

\sum_{i\neq\langle 1\rangle}\frac{\partial(\frac{(\mu_{i}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)})/\partial w_{\langle 1\rangle}}{\partial(\frac{(\mu_{i}-\mu_{\langle 1\rangle}+\epsilon)^{2}}{2(\sigma_{i}^{2}/w_{i}+\sigma_{\langle 1\rangle}^{2}/\alpha)})/\partial w_{i}}=1,

i.e., $\frac{w_{\langle 1\rangle}^{2}}{\sigma_{\langle 1\rangle}^{2}}=\sum_{i\neq\langle 1\rangle}\frac{w_{i}^{2}}{\sigma_{i}^{2}}$ .

Appendix F Proof of Theorem 3

Our proof of Theorem 3 will be divided into the analysis of the consistency, sampling rates and asymptotic optimality of the iKG-F algorithm.

We first show consistency, i.e., each arm will be pulled infinitely by the algorithm as the round $n$ goes to infinity. Since

\begin{split}\text{iKG}_{t,i}^{\text{F}}=&\sum_{j=1}^{m}\Bigg{(}\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{t,ij})^{2}}{2\sigma_{ij}^{2}/T_{t,i}}}\mathbf{1}\{i\in\mathcal{S}_{t}^{1}\}\Bigg{)}-\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{t,ij})^{2}}{2(T_{t,i}+2)\sigma_{ij}^{2}/(T_{t,i}+1)^{2}}}\mathbf{1}\{i\in\mathcal{S}_{t}^{1}\}\Bigg{)}\Bigg{)}\\ &+\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{t,i}^{2}}\frac{(\gamma_{j}-\mu_{t,ij})^{2}}{2\sigma_{ij}^{2}/T_{t,i}}}\mathbf{1}\{i\in\mathcal{S}_{t}^{2}\}\Bigg{)}-\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{t,i}^{2}}\frac{(\gamma_{j}-\mu_{t,ij})^{2}}{2(T_{t,i}+2)\sigma_{ij}^{2}/(T_{t,i}+1)^{2}}}\mathbf{1}\{i\in\mathcal{S}_{t}^{2}\}\Bigg{)}.\end{split}

(33)

It is obvious that $\text{iKG}_{t,i}^{\text{F}}>0$ for $t>0$ . To prove the consistency, it suffices to prove that $V=\emptyset$ and then the claim is straightforward based on the Strong Law of Large Numbers. For any $\delta_{15}>0$ and $i\notin V$ , there exists $N_{5}$ such that when $n>N_{5}$ , $|\mu_{n,i}-\mu_{i}|<\delta_{15}$ , because arms not in $V$ will be infinitely pulled. Since the $\exp(\cdot)$ is a continuous function and $\sigma_{i}^{2}/T_{t,i}-\sigma_{i}^{2}(T_{t,i}+2)/(T_{t,i}+1)^{2}=\sigma_{i}^{2}/((T_{t,i}+1)^{2}T_{t,i})\rightarrow 0$ holds for arm $i\notin V$ , then for any $\delta_{16}>0$ , there exists $N_{6}$ such that when $n>N_{6}$ , iKG ${}_{t,i}^{\text{F}}<\delta_{16}$ .

Arms $i^{\prime}\in V$ are pulled for only a finite number of rounds. Then $\max_{i^{\prime}\in V}T_{t,i^{\prime}}$ exists and we have $\sigma_{i^{\prime}}^{2}/((T_{t,i^{\prime}}+1)^{2}T_{t,i^{\prime}})>\min_{i^{\prime}\in\mathbb{A}}\sigma_{i^{\prime}}^{2}/\max_{i^{\prime}\in V}(T_{t,i^{\prime}}+2)/(T_{t,i^{\prime}}+1)^{2}$ . According to the continuity of the function $\exp(\cdot)$ , there exists $\delta_{17}>0$ such that iKG ${}_{t,i^{\prime}}^{\text{F}}>\delta_{17}$ . Since $\delta_{16}$ is arbitrary, let $\delta_{16}<\delta_{17}$ and then iKG ${}_{t,i^{\prime}}^{\text{F}}>$ iKG ${}_{t,i}^{\text{F}}$ holds, which implies $I_{t}\in V$ . As the total number of rounds tend to infinity, $V$ will become an empty set eventually. In other words, all the arms will be pulled infinitely.

We next analyze the sampling rate each arm by the iKG-F algorithm. Let $\delta_{18}=2\delta_{16}>0$ , we know that when $n$ is large, iKG ${}_{n,i}^{\text{F}}<\delta_{16}=\delta_{18}/2$ holds for $i\in\mathbb{A}$ . Then $|$ iKG ${}_{n,i}^{\text{F}}-$ iKG ${}_{n,i^{\prime}}^{\text{F}}|<$ iKG ${}_{n,i}^{\text{F}}+$ iKG ${}_{n,i^{\prime}}^{\text{F}}<\delta_{18}/2+\delta_{18}/2=\delta_{18}$ , where $i\neq i^{\prime}$ . For any $i,i^{\prime}\in\mathbb{A}$ ,

\begin{split}&|\text{iKG}_{n,i}^{\text{F}}-\text{iKG}_{n,i^{\prime}}^{\text{F}}|\\ =&\Bigg{|}\sum_{j=1}^{m}\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{n,ij})^{2}}{2\sigma_{ij}^{2}/T_{n,i}}}\mathbf{1}\{i\in\mathcal{S}_{n}^{1}\}\Bigg{)}-\sum_{j=1}^{m}\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{n,i^{\prime}j})^{2}}{2\sigma_{i^{\prime}j}^{2}/T_{n,i^{\prime}}}}\mathbf{1}\{i^{\prime}\in\mathcal{S}_{n}^{1}\}\Bigg{)}\\ &+\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{n,i}^{2}}\frac{(\gamma_{j}-\mu_{n,ij})^{2}}{2\sigma_{ij}^{2}/T_{n,i}}}\mathbf{1}\{i\in\mathcal{S}_{n}^{2}\}\Bigg{)}-\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{n,i^{\prime}}^{2}}\frac{(\gamma_{j}-\mu_{n,i^{\prime}j})^{2}}{2\sigma_{i^{\prime}j}^{2}/T_{n,i^{\prime}}}}\mathbf{1}\{i^{\prime}\in\mathcal{S}_{n}^{2}\}\Bigg{)}\\ &+\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{n,i^{\prime}j})^{2}}{2(T_{n,i^{\prime}}+2)\sigma_{i^{\prime}j}^{2}/(T_{n,i^{\prime}}+1)^{2}}}\mathbf{1}\{i^{\prime}\in\mathcal{S}_{n}^{1}\}\Bigg{)}-\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{n,ij})^{2}}{2(T_{n,i}+2)\sigma_{ij}^{2}/(T_{n,i}+1)^{2}}}\mathbf{1}\{i\in\mathcal{S}_{n}^{1}\}\Bigg{)}\\ &+\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{n,i^{\prime}}^{2}}\frac{(\gamma_{j}-\mu_{n,i^{\prime}j})^{2}}{2(T_{n,i^{\prime}}+2)\sigma_{i^{\prime}j}^{2}/(T_{n,i^{\prime}}+1)^{2}}}\mathbf{1}\{i^{\prime}\in\mathcal{S}_{n}^{2}\}\Bigg{)}-\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{n,i}^{2}}\frac{(\gamma_{j}-\mu_{n,ij})^{2}}{2(T_{n,i}+2)\sigma_{ij}^{2}/(T_{n,i}+1)^{2}}}\mathbf{1}\{i\in\mathcal{S}_{n}^{2}\}\Bigg{)}\Bigg{|}\\ \leq&2\Bigg{|}\sum_{j=1}^{m}\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{n,ij})^{2}}{2\sigma_{ij}^{2}/T_{n,i}}}\mathbf{1}\{i\in\mathcal{S}_{n}^{1}\}\Bigg{)}-\sum_{j=1}^{m}\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{n,i^{\prime}j})^{2}}{2\sigma_{i^{\prime}j}^{2}/T_{n,i^{\prime}}}}\mathbf{1}\{i^{\prime}\in\mathcal{S}_{n}^{1}\}\Bigg{)}\\ &+\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{n,i}^{2}}\frac{(\gamma_{j}-\mu_{n,ij})^{2}}{2\sigma_{ij}^{2}/T_{n,i}}}\mathbf{1}\{i\in\mathcal{S}_{n}^{2}\}\Bigg{)}-\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{n,i^{\prime}}^{2}}\frac{(\gamma_{j}-\mu_{n,i^{\prime}j})^{2}}{2\sigma_{i^{\prime}j}^{2}/T_{n,i^{\prime}}}}\mathbf{1}\{i^{\prime}\in\mathcal{S}_{n}^{2}\}\Bigg{)}\Bigg{|}\\ \leq&2\Bigg{|}m\max_{j\in\mathcal{E}_{n,i}^{1}}\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{n,ij})^{2}}{2\sigma_{ij}^{2}/T_{n,i}}}\mathbf{1}\{i\in\mathcal{S}_{n}^{1}\}\Bigg{)}-m\max_{j\in\mathcal{E}_{n,i}^{1}}\exp\Bigg{(}{-\frac{(\gamma_{j}-\mu_{n,i^{\prime}j})^{2}}{2\sigma_{i^{\prime}j}^{2}/T_{n,i^{\prime}}}}\mathbf{1}\{i^{\prime}\in\mathcal{S}_{n}^{1}\}\Bigg{)}\\ &+\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{n,i}^{2}}\frac{(\gamma_{j}-\mu_{n,ij})^{2}}{2\sigma_{ij}^{2}/T_{n,i}}}\mathbf{1}\{i\in\mathcal{S}_{n}^{2}\}\Bigg{)}-\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{n,i^{\prime}}^{2}}\frac{(\gamma_{j}-\mu_{n,i^{\prime}j})^{2}}{2\sigma_{i^{\prime}j}^{2}/T_{n,i^{\prime}}}}\mathbf{1}\{i^{\prime}\in\mathcal{S}_{n}^{2}\}\Bigg{)}\Bigg{|}\\ \end{split}

\begin{split}=&2\Bigg{|}m\exp\Bigg{(}{-w_{i}\min_{j\in\mathcal{E}_{n,i}^{1}}\frac{(\gamma_{j}-\mu_{n,ij})^{2}}{2\sigma_{ij}^{2}}}\mathbf{1}\{i\in\mathcal{S}_{n}^{1}\}\Bigg{)}-m\exp\Bigg{(}{-w_{i^{\prime}}\min_{j\in\mathcal{E}_{n,i}^{1}}\frac{(\gamma_{j}-\mu_{n,i^{\prime}j})^{2}}{2\sigma_{i^{\prime}j}^{2}}}\mathbf{1}\{i^{\prime}\in\mathcal{S}_{n}^{1}\}\Bigg{)}\\ &+\exp\Bigg{(}{-w_{i}\sum_{j\in\mathcal{E}_{n,i}^{2}}\frac{(\gamma_{j}-\mu_{n,ij})^{2}}{2\sigma_{ij}^{2}}}\mathbf{1}\{i\in\mathcal{S}_{n}^{2}\}\Bigg{)}-\exp\Bigg{(}{-w_{i^{\prime}}\sum_{j\in\mathcal{E}_{n,i^{\prime}}^{2}}\frac{(\gamma_{j}-\mu_{n,i^{\prime}j})^{2}}{2\sigma_{i^{\prime}j}^{2}}}\mathbf{1}\{i^{\prime}\in\mathcal{S}_{n}^{2}\}\Bigg{)}\Bigg{|},\end{split}

where $w_{i}=T_{n,i}/n$ is the sampling rate of arm $i$ . We have shown that $|\mu_{n,i}-\mu_{i}|<\delta_{15}$ for any $\delta_{15}>0$ and $i=1,2,\ldots,k$ . We can find a sufficiently large positive integer $n^{\prime}$ such that when $n>n^{\prime}$ , ${S}_{n}^{1}={S}^{1}$ , ${S}_{n}^{2}={S}^{2}$ , $\mathcal{E}_{n,i}^{1}=\mathcal{E}_{i}^{1}$ and $\mathcal{E}_{n,i}^{2}=\mathcal{E}_{i}^{2}$ , where $i=1,2,\ldots,k$ and $j=1,2,\ldots,m$ . Note that ${S}_{n}^{1}\cap{S}_{2}^{2}=\emptyset$ and ${S}^{1}\cap{S}^{2}=\emptyset$ . If $i\in{S}_{n}^{1}={S}^{1}$ , $|\text{iKG}_{n,i}^{\text{F}}-\text{iKG}_{n,i^{\prime}}^{\text{F}}|\leq 2m\Big{|}\exp\Bigg{(}{-w_{i}\min_{j\in\mathcal{E}_{n,i}^{1}}\frac{(\gamma_{j}-\mu_{n,ij})^{2}}{2\sigma_{ij}^{2}}}\}\Bigg{)}-\exp\Bigg{(}{-w_{i^{\prime}}\min_{j\in\mathcal{E}_{n,i^{\prime}}^{1}}\frac{(\gamma_{j}-\mu_{n,i^{\prime}j})^{2}}{2\sigma_{i^{\prime}j}^{2}}}\Bigg{)}\Big{|}\leq 2m\Big{|}\exp\Bigg{(}{-w_{i}\min_{j\in\mathcal{E}_{i}^{1}}\frac{(\gamma_{j}-\mu_{ij}-\delta_{15})^{2}}{2\sigma_{ij}^{2}}}\}\Bigg{)}-\exp\Bigg{(}{-w_{i^{\prime}}\min_{j\in\mathcal{E}_{i^{\prime}}^{1}}\frac{(\gamma_{j}-\mu_{i^{\prime}j}+\delta_{15})^{2}}{2\sigma_{i^{\prime}j}^{2}}}\Bigg{)}\Big{|}$ . We have shown that $|\text{iKG}_{n,i}^{\text{F}}-\text{iKG}_{n,i^{\prime}}^{\text{F}}|<\delta_{18}$ . Hence $\Big{|}w_{i}\min\limits_{j\in\mathcal{E}_{i}^{1}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}-w_{i^{\prime}}\min\limits_{j\in\mathcal{E}_{i^{\prime}}^{1}}\frac{(\gamma_{j}-\mu_{i^{\prime}j})^{2}}{2\sigma_{i^{\prime}j}^{2}}\Big{|}<\delta_{19}$ for any $\delta_{19}=\delta_{15}^{2}>0$ by the continuity of the function $\exp(\cdot)$ . We can get similar result when $i\in{S}_{n}^{2}={S}^{2}$ . Hence, $|\text{iKG}_{n,i}^{\text{F}}-\text{iKG}_{n,i^{\prime}}^{\text{F}}|<\delta_{18}$ if and only if

\begin{split}&\Bigg{|}w_{i}\min\limits_{j\in\mathcal{E}_{i}^{1}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{1}\}+w_{i}\sum_{j\in\mathcal{E}_{i}^{2}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{2}\}\\ &-w_{i^{\prime}}\min\limits_{j\in\mathcal{E}_{i^{\prime}}^{1}}\frac{(\gamma_{j}-\mu_{i^{\prime}j})^{2}}{2\sigma_{i^{\prime}j}^{2}}\mathbf{1}\{i^{\prime}\in\mathcal{S}^{1}\}+w_{i^{\prime}}\sum_{j\in\mathcal{E}_{i^{\prime}}^{2}}\frac{(\gamma_{j}-\mu_{i^{\prime}j})^{2}}{2\sigma_{i^{\prime}j}^{2}}\mathbf{1}\{i^{\prime}\in\mathcal{S}^{2}\}\Bigg{|}<\delta_{19}\end{split}

(34)

by the continuity of the function $\exp(\cdot)$ and $|\mu_{n,i}-\mu_{i}|<\delta_{15}$ .

We have known that

\begin{split}1-\mathbb{P}\{\mathcal{S}_{n}^{1}=\mathcal{S}^{1}\}=&\mathbb{P}\biggl{\{}\bigcup\limits_{i\in\mathcal{S}_{n}^{1}}\bigg{(}\bigcup\limits_{j=1}^{m}(\theta_{ij}>\gamma_{j})\bigg{)}\cup\bigcup\limits_{i\in\mathcal{S}_{n}^{2}}\bigg{(}\bigcap\limits_{j=1}^{m}(\theta_{ij}\leq\gamma_{j})\bigg{)}\biggr{\}}.\end{split}

We have

\begin{split}&\max\Bigg{(}\max_{i\in\mathcal{S}_{n}^{1}}\mathbb{P}\bigg{(}\bigcup\limits_{j=1}^{m}(\theta_{ij}>\gamma_{j})\bigg{)},\max_{i\in\mathcal{S}_{n}^{2}}\mathbb{P}\bigg{(}\bigcap\limits_{j=1}^{m}(\theta_{ij}\leq\gamma_{j})\bigg{)}\Bigg{)}\\ \leq&\mathbb{P}\biggl{\{}\bigcup\limits_{i\in\mathcal{S}_{n}^{1}}\bigg{(}\bigcup\limits_{j=1}^{m}(\theta_{ij}>\gamma_{j})\bigg{)}\cup\bigcup\limits_{i\in\mathcal{S}_{n}^{2}}\bigg{(}\bigcap\limits_{j=1}^{m}(\theta_{ij}\leq\gamma_{j})\bigg{)}\biggr{\}}\\ \leq&k\max\Bigg{(}\max_{i\in\mathcal{S}_{n}^{1}}\mathbb{P}\bigg{(}\bigcup\limits_{j=1}^{m}(\theta_{ij}>\gamma_{j})\bigg{)},\max_{i\in\mathcal{S}_{n}^{2}}\mathbb{P}\bigg{(}\bigcap\limits_{j=1}^{m}(\theta_{ij}\leq\gamma_{j})\bigg{)}\Bigg{)}.\end{split}

Then

1-\mathbb{P}\{\mathcal{S}_{n}^{1}=\mathcal{S}^{1}\}\dot{=}\max\Bigg{(}\max_{i\in\mathcal{S}_{n}^{1}}\mathbb{P}\bigg{(}\bigcup\limits_{j=1}^{m}(\theta_{ij}>\gamma_{j})\bigg{)},\max_{i\in\mathcal{S}_{n}^{2}}\mathbb{P}\bigg{(}\bigcap\limits_{j=1}^{m}(\theta_{ij}\leq\gamma_{j})\bigg{)}\Bigg{)}.

For arm $i\in\mathcal{S}_{n}^{1}$ ,

\max_{j\in\mathcal{E}_{n,i}^{1}}\mathbb{P}(\theta_{ij}>\gamma_{j})\leq\mathbb{P}\bigg{(}\bigcup\limits_{j=1}^{m}(\theta_{ij}>\gamma_{j})\bigg{)}\leq m\max_{j\in\mathcal{E}_{n,i}^{1}}\mathbb{P}(\theta_{ij}>\gamma_{j}).

For arm $i\in\mathcal{S}_{n}^{2}$ ,

\mathbb{P}\bigg{(}\bigcap\limits_{j=1}^{m}(\theta_{ij}\leq\gamma_{j})\bigg{)}\rightarrow\mathbb{P}\bigg{(}\bigcap\limits_{j\in\mathcal{E}_{n,i}^{2}}(\theta_{ij}\leq\gamma_{j})\bigg{)},

because $\lim_{n\rightarrow\infty}\mathbb{P}\bigg{(}\bigcap\limits_{j\in\mathcal{E}_{n,i}^{1}}(\theta_{ij}\leq\gamma_{j})\bigg{)}\rightarrow 1$ . Hence

1-\mathbb{P}\{\mathcal{S}_{n}^{1}=\mathcal{S}^{1}\}\dot{=}\exp\Bigg{(}{-\min_{i\in\mathcal{S}_{n}^{1}}\frac{(\gamma_{j}-\mu_{n,ij})^{2}}{2\sigma_{ij}^{2}/T_{n,i}}}\mathbf{1}\{i\in\mathcal{S}_{n}^{1}\}+\exp\Bigg{(}{-\sum_{j\in\mathcal{E}_{n,i}^{2}}\frac{(\gamma_{j}-\mu_{n,ij})^{2}}{2\sigma_{ij}^{2}/T_{n,i}}}\mathbf{1}\{i\in\mathcal{S}_{n}^{2}\}\Bigg{)}.

We have

\Gamma^{\text{F}}=\lim_{n\rightarrow\infty}-\frac{1}{n}\log(1-\mathbb{P}\{\mathcal{S}_{n}^{1}=\mathcal{S}^{1}\})=\min_{i\in\mathbb{A}}w_{i}\min\limits_{j\in\mathcal{E}_{i}^{1}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{1}\}+w_{i}\sum_{j\in\mathcal{E}_{i}^{2}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{2}\}.

(35)

By (34),

\Gamma^{\text{F}}=w_{i}\min\limits_{j\in\mathcal{E}_{i}^{1}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{1}\}+w_{i}\sum_{j\in\mathcal{E}_{i}^{2}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{2}\},\quad\forall i\in\mathbb{A},

(36)

where $w_{i}$ in (35) and (36) is the solution of (16) in the main text.

Next, we will show that for any feasible arm identification algorithms, $\lim_{n\rightarrow\infty}-\frac{1}{n}\log(1-\mathbb{P}\{\mathcal{S}_{n}^{1}=\mathcal{S}^{1}\})\leq\Gamma^{\text{F}}$ . Let $W\triangleq\{\bm{w}=(w_{1},\ldots,w_{k}):\sum_{i=1}^{k}w_{i}=1\mbox{ and }w_{i}\geq 0,\forall i\in\mathbb{A}\}$ be set of the feasible sampling rates of the $k$ arms. We prove it by contradiction. Suppose there exists a policy with sampling rates $\bm{w}^{\prime}=(w^{\prime}_{1},w^{\prime}_{2},\ldots,w^{\prime}_{k})$ of the $k$ arms such that

\begin{split}&w^{\prime}_{i}\min\limits_{j\in\mathcal{E}_{i}^{1}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{1}\}+w^{\prime}_{i}\sum_{j\in\mathcal{E}_{i}^{2}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{2}\}\\ =&\max_{\bm{w}\in W}\min_{i\in\mathbb{A}}w_{i}\min\limits_{j\in\mathcal{E}_{i}^{1}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{1}\}+w_{i}\sum_{j\in\mathcal{E}_{i}^{2}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{2}\}.\end{split}

We will show that $\max_{\bm{w}\in W}\min_{i\in\mathbb{A}}w_{i}\min\limits_{j\in\mathcal{E}_{i}^{1}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{1}\}+w_{i}\sum_{j\in\mathcal{E}_{i}^{2}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{2}\}$ is achieved when

\begin{split}&w_{i}\min\limits_{j\in\mathcal{E}_{i}^{1}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{1}\}+w_{i}\sum_{j\in\mathcal{E}_{i}^{2}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{2}\}\\ &=w_{i^{\prime}}\min\limits_{j\in\mathcal{E}_{i^{\prime}}^{1}}\frac{(\gamma_{j}-\mu_{i^{\prime}j})^{2}}{2\sigma_{i^{\prime}j}^{2}}\mathbf{1}\{i^{\prime}\in\mathcal{S}^{1}\}+w_{i^{\prime}}\sum_{j\in\mathcal{E}_{i^{\prime}}^{2}}\frac{(\gamma_{j}-\mu_{i^{\prime}j})^{2}}{2\sigma_{i^{\prime}j}^{2}}\mathbf{1}\{i^{\prime}\in\mathcal{S}^{2}\},\ \ i\neq i^{\prime}.\end{split}

(37)

Since the solution of (37) is unique, there exists an arm $i^{\prime}$ satisfying

\begin{split}&w^{\prime}_{i^{\prime}}\min\limits_{j\in\mathcal{E}_{i^{\prime}}^{1}}\frac{(\gamma_{j}-\mu_{i^{\prime}j})^{2}}{2\sigma_{i^{\prime}j}^{2}}\mathbf{1}\{i^{\prime}\in\mathcal{S}^{1}\}+w^{\prime}_{i^{\prime}}\sum_{j\in\mathcal{E}_{i^{\prime}}^{2}}\frac{(\gamma_{j}-\mu_{i^{\prime}j})^{2}}{2\sigma_{i^{\prime}j}^{2}}\mathbf{1}\{i^{\prime}\in\mathcal{S}^{2}\}\\ \geq&\min_{i\in\mathbb{A}}w^{\prime}_{i}\min\limits_{j\in\mathcal{E}_{i}^{1}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{1}\}+w^{\prime}_{i}\sum_{j\in\mathcal{E}_{i}^{2}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{2}\}.\end{split}

We consider a new policy. There exists $\delta_{20}>0$ such that $\tilde{w}_{i^{\prime}}=w^{\prime}_{i^{\prime}}-\delta_{20}\in(0,1)$ and $\tilde{w}_{i}=w^{\prime}_{i}+\delta_{20}/(k-2)\in(0,1)$ for $i\neq i^{\prime}$ . Then

\begin{split}&\min_{i\in\mathbb{A}}\tilde{w}_{i}\min\limits_{j\in\mathcal{E}_{i}^{1}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{1}\}+\tilde{w}_{i}\sum_{j\in\mathcal{E}_{i}^{2}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{2}\}\\ >&\min_{i\in\mathbb{A}}w^{\prime}_{i}\min\limits_{j\in\mathcal{E}_{i}^{1}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{1}\}+w^{\prime}_{i}\sum_{j\in\mathcal{E}_{i}^{2}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{2}\}\\ =&\max_{\bm{w}\in W}\min_{i\in\mathbb{A}}w_{i}\min\limits_{j\in\mathcal{E}_{i}^{1}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{1}\}+w_{i}\sum_{j\in\mathcal{E}_{i}^{2}}\frac{(\gamma_{j}-\mu_{ij})^{2}}{2\sigma_{ij}^{2}}\mathbf{1}\{i\in\mathcal{S}^{2}\},\end{split}

which yields a contradiction. Therefore, the equations in (16) of the main text hold.

Appendix G Additional Numerical Results

In this section, we provide additional numerical results for the experiments conducted in Section 5 of the main text. Figures 1(a)-7(a) show how the probabilities of false selection of the compared algorithms change with the sample sizes for the best arm identification problem, and Figures 1(b)-7(b) show the sampling rates of the algorithms on some selected arms. It can be observed in Figures 1(a)-7(a) that the proposed iKG algorithm performs the best, followed by TTEI, KG, EI and the equal allocation. On the log scale, the probability of false selection (PFS) values of the iKG algorithm demonstrate linear patterns, indicating the potentially exponential convergence rates. For both EI and KG, the rates of the posterior convergence are not optimal, which might influence their empirical performances. The equal allocation performs the worst in general. In Figures 1(b)-7(b), we can see that TTEI always allocates half samples to the best arm when $\beta=0.5$ , EI allocates too many samples to the best arm while KG allocates too few samples to the best arm.

Figures 8(a)-14(a) show how the probabilities of false selection of the compared algorithms change with the sample sizes for the $\epsilon$ -good arm identification, and Figures 8(b)-14(b) show the sampling rates of the algorithms on some selected arms. It can be observed in Figures 8(a)-14(a) the proposed iKG- ${\epsilon}$ algorithm performs the best and demonstrates a linear pattern on the log scale. (ST)² and APT are inferior, and the equal allocation performs the worst. In Figures 8(b)-14(b), we can see that APT allocates too few samples to the best arm and too many samples to the arms near the threshold. ST² allocates too few samples to the best arm, which influences the accuracy of the threshold.

Figures 15(a)-21(a) show how the probabilities of false selection of the compared algorithms change with the sample sizes for the feasible arm identification, and Figures 15(b)-21(b) show the sampling rates of the algorithms on some selected arms. The results in Figures 15(a)-21(a) are similar to those in Figures 1(a)-14(a). The proposed iKG-F algorithm has the best performance, followed by the compared MD-UCBE, equal allocation and MD-SAR. In Figures 15(b)-21(b), we can see that MD-UCBE and MD-SAR allocate too many samples to the arms near the constraint limits.