This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\altauthor\Name

Anand Kalvit \Email[email protected] and \NameAssaf Zeevi \Email[email protected]
\addrColumbia University

Complexity Analysis of a Countable-armed Bandit Problem

Abstract

We consider a stochastic multi-armed bandit (MAB) problem motivated by “large” action spaces, and endowed with a population of arms containing exactly KK arm-types, each characterized by a distinct mean reward. The decision maker is oblivious to the statistical properties of reward distributions as well as the population-level distribution of different arm-types, and is precluded also from observing the type of an arm after play. We study the classical problem of minimizing the expected cumulative regret over a horizon of play nn, and propose algorithms that achieve a rate-optimal finite-time instance-dependent regret of 𝒪(logn)\mathcal{O}\left(\log n\right). We also show that the instance-independent (minimax) regret is 𝒪~(n)\tilde{\mathcal{O}}\left(\sqrt{n}\right) when K=2K=2. While the order of regret and complexity of the problem suggests a great degree of similarity to the classical MAB problem, properties of the performance bounds and salient aspects of algorithm design are quite distinct from the latter, as are the key primitives that determine complexity along with the analysis tools needed to study them.

keywords:
Infinite-armed bandits, Explore-then-Commit, UCB, Adaptivity to reservoir-distribution.

1 Introduction

Background and motivation. The MAB model is a classical machine learning paradigm capturing the essence of the exploration-exploitation trade-offs characteristic of sequential decision making problems under uncertainty. In its simplest formulation, the decision maker (DM) must play at each time instant t{1,,n}t\in\{1,...,n\} one out of a set of KnK\ll n possible alternatives (aka arms), each characterized by a distribution of rewards. Oblivious to their statistical properties, the DM must play a sequence of nn arms so as to maximize her cumulative expected payoffs, an objective often converted to minimizing regret relative to an oracle with perfect ex ante knowledge of the best arm.

Since the seminal paper of [Lai and Robbins (1985)] that laid the main theoretical foundations, there has been a plethora of work developing more advanced MAB models to encapsulate more realistic data-driven decision processes. These include formulations with covariate or contextual information, choice-models, budget constraints, non-stationary rewards, and metric space embeddings, among many others that utilize some structure in the arms, reward distributions, or physics of the problem (see [Slivkins et al. (2019), Lattimore and Szepesvári (2020)] for a comprehensive survey). In this paper, we are motivated by the choice overload phenomenon pervading modern MAB applications with a prohibitively large action space such as those encountered in online marketplaces, matching platforms and the likes.

Modeling choice overload. In several applications of MAB, it is quite common for the number of arms to be “large” to the extent that it may potentially exceed even the horizon of play, i.e. KnK\geqslant n. For example, the problem faced by recommendation systems in large retail platforms, such as Amazon, is characterized by a prohibitively large number of arms (products of certain type) and limited “display space,” creating a very challenging combinatorial problem (see, e.g., [Agrawal et al. (2019)]). Naturally, the canonical MAB model is ill-suited for the study of such settings. Among problems of this nature, a simple yet illuminating abstraction is one where an infinite population of arms is partitioned into KK different arm-types, each characterized uniquely by some reward statistic (e.g., the mean), and the fraction of each arm-type in the population of arms (or the arm-reservoir) remains fixed over the horizon of play. The motivation to study such settings stems from several contemporary applications, e.g., in a prototypical task-matching problem arising in the online gig economy: the platform must choose upon each task arrival, one agent from a large pool of available agents characterized by unknown (or only partially known) skill proficiencies. Such settings arise naturally in populations endowed with latent low-dimensional representations, i.e., an agent can only belong to one of finitely many possible types, each characterized distinctly by some attribute. Market segmentation based on types is central also to the operations research literature, see, e.g., [Bui et al. (2012), Banerjee et al. (2017), Johari et al. (2021)], etc., for examples involving analyses of online service and recommendation systems, among several other areas.

The countable-armed bandit problem. We provide an abstraction of the aforementioned decision problem as a bandit with countably many arms, each queried from an infinite population of arms (aka the arm-reservoir). There are KK possible arm-types given by 𝒦:={1,,K}\mathcal{K}:=\{1,...,K\} with KK known a priori111It is routine for platforms to run pilot experiments during initial rounds to gather information on key primitives such as the size and stability of clusters, if any exist in the population. One can therefore safely assume in settings where such clusters strongly exist that the number KK of possible types is accurately estimated., and the probability vector 𝜶={αi:i𝒦}{\boldsymbol{\alpha}}=\left\{\alpha_{i}:i\in\mathcal{K}\right\} denotes their corresponding fraction, i.e., relative prevalence in the reservoir, which is unknown.

Intuitively, the statistical complexity of regret minimization in the simplest formulation of the countable-armed bandit (CAB) with K=2K=2 arm-types is governed by three principal primitives: (i) the sub-optimality gap Δ¯:=μ1μ2\underline{\Delta}:=\mu_{1}-\mu_{2} between the mean rewards μ1>μ2\mu_{1}>\mu_{2} of the optimal and inferior arm-types; (ii) the probability α1\alpha_{1} of sampling from the population an arm of the optimal type; and (iii) the horizon of play nn. Absent knowledge of (Δ¯,α1)\left(\underline{\Delta},\alpha_{1}\right), exploration is challenging owing to the “large” number of arms. In particular, in contrast with the classical two-armed MAB, in the CAB problem, any finite selection of arms may only contain the mean μ2\mu_{2}. Consequently, any algorithm limited to such a consideration set will suffer a linear regret. Absence of information on α1\alpha_{1} further exacerbates the challenges in the study of the CAB problem.

Contributions. There has been limited technical development in this area and the literature remains sparse. In this work, we resolve several foundational questions pertaining to the complexity of this problem setting and provide a comprehensive understanding of various other aspects thereof. Our theoretical contributions can be projected along the following axes:

(i) Complexity of regret. We establish information-theoretic performance lower bounds that are order-wise tight (in the horizon nn) in the instance-dependent setting (Theorem 3.2). In the instance-independent (minimax) setting, we answer affirmatively an open question on achievability of 𝒪~(n)\tilde{\mathcal{O}}\left(\sqrt{n}\right) regret when K=2K=2 and show that this order is best achievable up to poly-logarithmic factors in nn (Theorem 4.1). In addition, we provide a uniform lower bound on achievable performance that is tight in nn and explicitly captures the scaling behavior w.r.t. the fraction α1\alpha_{1} of optimal arms, and furthermore, has a novel non-information-theoretic proof based entirely on convex analysis (Theorem 3.3).

(ii) Algorithm design. We design algorithms that achieve aforementioned regret guarantees relying only on knowledge of KK and are agnostic to information pertaining to the reward distributions as well as the frequency of occurrence of different types. Our design principles (Algorithms 1 and 2) are functionally distinct from extant work on finite-armed bandits which reflects in a fundamentally different scaling of regret (see Theorems 4.1 and 4.3). We also provide resolution to an outstanding design issue in extant literature for K=2K=2 (see Algorithm 3 and Theorem 4.4).

(iii) Regret behavior and arm-type distribution. In contrast to some observations on related models in the literature that show a higher order than logn\log n (instance-dependent) regret-behavior w.r.t. 𝜶{\boldsymbol{\alpha}}, we establish that when the learner has knowledge of KK but not of 𝜶{\boldsymbol{\alpha}}, one can still achieve 𝒪(logn)\mathcal{O}\left(\log n\right) regret where the dependence on 𝜶{\boldsymbol{\alpha}} only manifests as an additive loss (Theorems 4.3 and 4.4).

Before proceeding with a formal description of our model, we provide a brief overview of related works below.

Extant literature on bandits with infinitely many arms. These problems involve an infinite population of arms and a fixed reservoir distribution over a (typically uncountable) set of arm-types; a common reward statistic (usually the mean) uniquely characterizes each arm-type. The infinite-armed bandit problem traces its roots to [Berry et al. (1997)] where the problem was studied under Bernoulli rewards and a reservoir distribution of Bernoulli parameters that is Uniform on [0,1][0,1]. Subsequent works have considered more general reward and reservoir distributions on [0,1][0,1], see, e.g., [Wang et al. (2009), Bonald and Proutiere (2013), Carpentier and Valko (2015), Chan and Hu (2018)]. In terms of the statistical complexity of regret minimization, an uncountably rich set of arm-types is tantamount to the minimal achievable regret being polynomial in the horizon of play (see aforementioned references). In contrast, the recently studied models in [Kalvit and Zeevi (2020), de Heide et al. (2021)] that our work is most closely related to, are fundamentally simpler owing to a finite set of arm-types; this is central to the achievability of logarithmic (instance-dependent) regret in this class of problems. These two works are briefly discussed below.

The CAB problem first appeared in [Kalvit and Zeevi (2020)] together with an online adaptive policy achieving 𝒪(logn)\mathcal{O}\left(\log n\right) regret when K=2K=2. This policy is derived from UCB1 (Auer et al., 2002) and relies on certain newly discovered concentration and convergence properties thereof (see Kalvit and Zeevi (2021) for a detailed discussion of said properties). However, the analysis of this policy cannot be adapted to K>2K>2 types; we will later provide an example with K>2K>2 where the policy will likely run into issues that can be effectively mitigated by the algorithms proposed in this paper. There is also recent literature (de Heide et al., 2021) on a related setting where the set of inferior arm-types may be arbitrary as long as it is Δ¯\underline{\Delta}-separated from the optimal mean. However, ex ante knowledge of the proportion α1\alpha_{1} of optimal arms is necessary to achieve logarithmic regret in this setting. This aspect distinguishes their setting from CAB and will be discussed at length later.

Lastly, a formulation of the countable-armed problem based on pure exploration, referred to as the “heaviest coin identification problem,” was studied in Chandrasekaran and Karp (2014) for K=2K=2 (see Jamieson et al. (2016) for subsequent developments). In contrast, our problem is based on optimization of cumulative payoff (or regret); as a result, it shares little similarity with cited works.

Outline of the paper. A formal description of the model is provided in §2; §3 discusses lower bounds on achievable performance. We propose our algorithms in §4 and state supporting theoretical guarantees. Numerical experiments, proofs, and auxiliary results are relegated to the appendix.

2 Problem formulation

The set of arm-types is given by 𝒦={1,,K}\mathcal{K}=\left\{1,...,K\right\}, and the decision maker (DM) only knows the cardinality KK of 𝒦\mathcal{K}. Each type i𝒦i\in\mathcal{K} is characterized by a unique mean reward μi\mu_{i}; the reservoir is thus characterized by the collection 𝝁:={μi:i𝒦}{\boldsymbol{\mu}}:=\left\{\mu_{i}:i\in\mathcal{K}\right\} of possible mean rewards. Without loss of generality, we assume μ1>>μK\mu_{1}>...>\mu_{K} and refer to type 11 as the optimal type (we may refer to the others as inferior types). Define Δ¯:=μ1μK\bar{\Delta}:=\mu_{1}-\mu_{K} and Δ¯:=μ1μ2\underline{\Delta}:=\mu_{1}-\mu_{2} as the maximal and minimal sub-optimality gaps respectively, and δ:=min1i<jK(μiμj)\delta:=\min_{1\leqslant i<j\leqslant K}\left(\mu_{i}-\mu_{j}\right) as the minimal reward gap. Finally, 𝜶:=(αi:i𝒦){\boldsymbol{\alpha}}:=\left(\alpha_{i}:i\in\mathcal{K}\right) denotes the vector of reservoir probabilities for each type (aka the reservoir distribution), coordinate-wise bounded away from 0. These primitives will be important drivers of the statistical complexity of the regret minimization problem, as we shall later see. The horizon of play is nn, and the DM must play one arm at each time t{1,,n}t\in\{1,...,n\}.

The set of arms that have been played up to and including time t{1,2,}t\in\{1,2,...\} is denoted by t\mathcal{I}_{t} (and 0:=ϕ\mathcal{I}_{0}:=\phi). The set of actions available to the DM at time tt is given by 𝒜t=t1{newt}\mathcal{A}_{t}=\mathcal{I}_{t-1}\cup\{\text{new}_{t}\}; the DM must either play an arm from t1\mathcal{I}_{t-1} at time tt or select the action “newt\text{new}_{t}” which corresponds to querying (and playing) a new arm from the reservoir. This new arm is optimal-typed with probability α1\alpha_{1} and sub-optimal otherwise. The DM is oblivious to 𝜶{\boldsymbol{\alpha}}, and furthermore precluded from observing the type of an arm upon query or play. A policy π:=(π1,π2,)\pi:=\left(\pi_{1},\pi_{2},...\right) is an adaptive allocation rule that prescribes at time tt an action πt\pi_{t} from 𝒜t\mathcal{A}_{t} (possibly randomized). Each pull (or play) of an arm results in a stochastic reward. The sequence of rewards realized from the first kk pulls of an arm labeled ii (henceforth called arm ii) is denoted by (Xi,j)1jk\left(X_{i,j}\right)_{1\leqslant j\leqslant k}; these are mean-preserving in time keeping the arm fixed, independent across arms and time, and take values in [0,1][0,1]. The natural filtration at time tt, denoted by t\mathcal{F}_{t} and defined w.r.t. the sequence of rewards realized up to and including time tt, is given by t:=σ{(πs)1st,{(Xi,j)1jNi(t):it}}\mathcal{F}_{t}:=\sigma\left\{\left(\pi_{s}\right)_{1\leqslant s\leqslant t},\left\{\left(X_{i,j}\right)_{1\leqslant j\leqslant N_{i}(t)}:i\in\mathcal{I}_{t}\right\}\right\} (with 0:=ϕ\mathcal{F}_{0}:=\phi), where Ni(t)N_{i}(t) denotes the number of pulls of arm ii up to and including time tt. The cumulative pseudo-regret of policy π\pi after nn plays is given by Rnπ:=t=1n(μ1μ𝒯(πt))R_{n}^{\pi}:=\sum_{t=1}^{n}\left(\mu_{1}-\mu_{\mathcal{T}\left(\pi_{t}\right)}\right), where 𝒯(πt)𝒦\mathcal{T}\left(\pi_{t}\right)\in\mathcal{K} denotes the type of the arm played by π\pi at time tt; note that RnπR_{n}^{\pi} is a sample path-dependent quantity. The DM is interested in the classical problem of minimizing the expectation of the cumulative regret R^nπ:=t=1n(μ1Xπt,Nπt(t))\hat{R}_{n}^{\pi}:=\sum_{t=1}^{n}\left(\mu_{1}-X_{\pi_{t},N_{\pi_{t}}(t)}\right), given by

infπΠ𝔼R^nπ=infπΠ𝔼[t=1n(μ1Xπt,Nπt(t))]=()infπΠ𝔼[t=1n(μ1μ𝒯(πt))]=infπΠ𝔼Rnπ,\displaystyle\inf_{\pi\in\Pi}\mathbb{E}\hat{R}_{n}^{\pi}=\inf_{\pi\in\Pi}\mathbb{E}\left[\sum_{t=1}^{n}\left(\mu_{1}-X_{\pi_{t},N_{\pi_{t}}(t)}\right)\right]\underset{\mathrm{({\dagger})}}{=}\inf_{\pi\in\Pi}\mathbb{E}\left[\sum_{t=1}^{n}\left(\mu_{1}-\mu_{\mathcal{T}\left(\pi_{t}\right)}\right)\right]=\inf_{\pi\in\Pi}\mathbb{E}R_{n}^{\pi}, (1)

where ()({\dagger}) follows from the Tower property of expectation, the infimum is over policies satisfying the non-anticipation property πt:t1𝒫(𝒜t)\pi_{t}:\mathcal{F}_{t-1}\to\mathcal{P}\left(\mathcal{A}_{t}\right) for t{1,2,}t\in\{1,2,...\}; 𝒫(𝒜t)\mathcal{P}\left(\mathcal{A}_{t}\right) denotes the probability simplex on 𝒜t\mathcal{A}_{t}. Accordingly, the expectation in (1) is w.r.t. all the possible sources of randomness in the problem (rewards, policy, and the arm-reservoir).

3 Lower bounds for natural policy classes

There are three fundamental primitives governing the complexity of achievable regret in this setting, viz., (i) the minimal sub-optimality gap Δ¯\underline{\Delta}; (ii) the proportion α1\alpha_{1} of the optimal arm-type in the reservoir; and (iii) the number of arm-types KK. We next characterize lower bounds on achievable performance w.r.t. each of these primitives.

3.1 Achievable performance w.r.t. 𝚫¯\boldsymbol{\underline{\Delta}}: Information-theoretic lower bounds

The statistical complexity of this problem setting is best illustrated via the paradigmatic case of K=2K=2 and α11/2\alpha_{1}\leqslant 1/2. In this case, one anticipates the problem to be at least as hard as the classical two-armed bandit with a mean reward gap of Δ¯\underline{\Delta}. Indeed, we establish this in Theorem 3.2 via information-theoretic reductions adapted to handle a countable number of arms (proof is provided in Appendix C). In what follows, an instance ν\nu of the problem refers to a collection of reward distributions with gap Δ¯\underline{\Delta}; note that we are excluding the reservoir probabilities 𝜶=(α1,α2)\boldsymbol{\alpha}=\left(\alpha_{1},\alpha_{2}\right) from the definition of an instance. Recall that α1\alpha_{1} denotes the reservoir probability associated with the optimal mean reward, and α2=1α1\alpha_{2}=1-\alpha_{1} is the probability of the inferior. We will overload the notation for expected cumulative regret slightly to emphasize its dependence on ν\nu as well as 𝜶\boldsymbol{\alpha}.

Definition 3.1 (Admissible policies when 𝑲=𝟐\boldsymbol{K=2}).

A policy π\pi is deemed admissible if for any instance ν\nu, reservoir distributions 𝛂=(α1,1α1),𝛂=(α1,1α1)\boldsymbol{\alpha}=\left(\alpha_{1},1-\alpha_{1}\right),\boldsymbol{\alpha}^{\prime}=\left(\alpha_{1}^{\prime},1-\alpha_{1}^{\prime}\right), and horizon nn, one has that 𝔼Rnπ(ν,𝛂)𝔼Rnπ(ν,𝛂)\mathbb{E}R_{n}^{\pi}\left(\nu,\boldsymbol{\alpha}^{\prime}\right)\geqslant\mathbb{E}R_{n}^{\pi}\left(\nu,\boldsymbol{\alpha}\right) whenever α1α1\alpha_{1}^{\prime}\leqslant\alpha_{1}. The set of such policies is denoted by Πadm\Pi_{\texttt{adm}}.

We remark that the aforementioned definition is not restrictive in our problem setting since it is only natural that any reasonable policy should incur a larger cumulative regret (in expectation) in problems where the reservoir holds fewer optimal arms (in proportion).

Theorem 3.2 (Information-theoretic lower bounds when 𝑲=𝟐\boldsymbol{K=2}).

There exists an absolute constant C>0C>0 such that the following holds under any πΠadm\pi\in\Pi_{\texttt{adm}} and any 𝛂\boldsymbol{\alpha} with α11/2\alpha_{1}\leqslant 1/2:

  1. 1.

    For any Δ¯>0\underline{\Delta}>0, there exists a problem instance ν\nu such that 𝔼Rnπ(ν,𝜶)Clogn/Δ¯\mathbb{E}R_{n}^{\pi}\left(\nu,\boldsymbol{\alpha}\right)\geqslant C\log n/\underline{\Delta} for large enough nn, where the “large enough nn” depends exclusively on Δ¯\underline{\Delta}.

  2. 2.

    For any nn\in\mathbb{N}, there exists a problem instance ν\nu such that 𝔼Rnπ(ν,𝜶)Cn\mathbb{E}R_{n}^{\pi}\left(\nu,\boldsymbol{\alpha}\right)\geqslant C\sqrt{n}.

Distinction from classical MAB. Although the above result bears resemblance to classical information-theoretic lower bounds for finite-armed bandits, it is imperative to note that the setting has a fundamentally greater complexity that requires a more nuanced analysis vis-à-vis the finite-armed problem. Traditional proofs, as a result, cannot be tailored to our setting in a translational manner. To see this, note that when α1\alpha_{1} is high, a query of the reservoir is very likely to return an arm of the optimal type; in the limit as α11\alpha_{1}\to 1, the problem becomes degenerate as all policies incur zero expected regret. Clearly, the problem cannot be harder than a two-armed bandit with gap Δ¯\underline{\Delta} uniformly over all values of α1\alpha_{1}. While we conjecture α1<1\alpha_{1}<1 to be a sufficient condition for the existence of Ω(logn/Δ¯)\Omega\left(\log n/\underline{\Delta}\right) instance-dependent and Ω(n)\Omega\left(\sqrt{n}\right) instance-independent (minimax) lower bounds, there are technical challenges due to probabilistic type associations over countably many arms. The restriction to α11/2\alpha_{1}\leqslant 1/2 and admissible policies (Definition 3.1) is then necessary for tractability of the proof and it remains unclear if this can be generalized further. Furthermore, when K>2K>2, even defining an appropriate notion of admissibility à la Definition 3.1 is non-trivial and will likely involve dependencies on 𝝁\boldsymbol{\mu} in addition to 𝜶\boldsymbol{\alpha}; pursuits in this direction are currently left to future work.

3.2 Achievable performance w.r.t. 𝜶𝟏\boldsymbol{\alpha_{1}}: A uniform lower bound for front-loaded policies

Though the bounds in Theorem 3.2 are tight in nn as we shall later see, they fail to provide any actionable insights w.r.t. 𝜶{\boldsymbol{\alpha}}. A natural question in the CAB setting is whether and in what manner does the presence of countably many arms affect achievable regret. In particular, how does the difficulty associated with finding optimal arms from the reservoir (and the dependence on the distribution 𝜶{\boldsymbol{\alpha}}) come into play. Below, we propose a lower bound that explicitly captures this dependence, albeit with respect to a somewhat restricted policy class.

Theorem 3.3 (𝜶\boldsymbol{\alpha}-dependent lower bound for any 𝑲𝟐\boldsymbol{K\geqslant 2}).

Denote by Π~\tilde{\Pi} the class of policies under which the decision to query the arm-reservoir at any time s{1,2,}s\in\{1,2,...\} is independent of s1\mathcal{F}_{s-1}. Then, for all problem instances ν\nu with a minimal sub-optimality gap of at least Δ¯>0\underline{\Delta}>0, one has

lim infninfπΠ~𝔼Rnπ(ν,𝜶)logn(1α1)2Δ¯α1.\displaystyle\liminf_{n\to\infty}\inf_{\pi\in\tilde{\Pi}}\frac{\mathbb{E}R_{n}^{\pi}\left(\nu,{\boldsymbol{\alpha}}\right)}{\log n}\geqslant\frac{\left(1-\alpha_{1}\right)^{2}\underline{\Delta}}{\alpha_{1}}.

Discussion. The proof is located in Appendix D. Several comments are in order. (i) The class Π~\tilde{\Pi}, in particular, includes policies that front-load queries, i.e., query the reservoir upfront for a pre-specified number of arms and then deploy a regret minimizing routine on the queried set until the end of the playing horizon, see, e.g., the Sampling-UCB algorithm due to de Heide et al. (2021). (ii) The cited paper also derives an information-theoretic Ω(logn/α1Δ¯)\Omega\left(\log n/\alpha_{1}\underline{\Delta}\right) lower bound based on a standard reduction to a hypothesis testing problem, although notably their setting is non-trivially distinct from ours (this reflects starkly different sensitivities of achievable regret to 1/α11/\alpha_{1}-scale, as we shall later see). Importantly though, akin to Theorem 3.2, their bound too, establishes the existence of an instance with logarithmic regret. On the other hand, the foremost noticeable aspect of Theorem 3.3 that differs from aforementioned results is that it provides a uniform lower bound over all instances that are at least Δ¯\underline{\Delta}-separated in the mean reward, as opposed to merely establishing their existence. (iii) The presence of Δ¯\underline{\Delta} in the numerator (unlike traditional bounds where Δ¯\underline{\Delta} resides in the denominator) suggests that while this bound may be vacuous if Δ¯\underline{\Delta} is “small,” it certainly becomes most relevant when Δ¯\underline{\Delta} is “well-separated.” At the same time, it should be noted that Theorem 3.3 does not contradict the 𝒪(logn/α1Δ¯)\mathcal{O}\left(\log n/\alpha_{1}\underline{\Delta}\right) upper bound (up to logarithmic factors in Δ¯\underline{\Delta}) of Sampling-UCB; it merely provides a tool to separate regimes of Δ¯\underline{\Delta} where one bound captures the dominant effects vis-à-vis the other. (iv) A novelty of Theorem 3.3 lies in its proof, which differs from classical lower bound proofs in that it is based entirely on ideas from convex analysis as opposed to the information-theoretic and change-of-measure techniques hitherto used in the literature.

Remarks. (i) It is not impossible to avoid the 1/α11/\alpha_{1}-scaling of the instance-dependent logarithmic regret. We will later show via an upper bound for one of our algorithms (ALG2(n)(n)) that the α1\alpha_{1}-dependence can, in fact, be relegated to constant order terms (ALG2(n)(n) queries arms adaptively based on sample-history and therefore does not belong to Π~\tilde{\Pi}). Importantly, this will somewhat surprisingly establish that the instance-dependent logarithmic bound in Theorem 3.2 is optimal w.r.t. to its dependence on α1\alpha_{1} (the scaling w.r.t. Δ¯\underline{\Delta}, however, may not be best possible as forthcoming upper bounds suggest). (ii) Theorem 3.3 holds also for any arm-reservoir where the optimal type is at least Δ¯\underline{\Delta}-separated from the rest, the nature of types (countable or uncountable) notwithstanding.

3.3 Achievable performance w.r.t. 𝑲\boldsymbol{K}: The Bandit and the Coupon-collector

In the classical KK-armed bandit problem, the (instance-dependent) regret scales linearly with the number of arms. We will next show that the KK-typed countable-armed setting studied in this paper differs from its KK-armed counterpart on account of a fundamentally distinct scaling of regret w.r.t. KK. We will illustrate this by pivoting to a full information setting with one-sample learning, i.e., a setting where the decision maker observes the mean reward of an arm immediately upon pulling it, but does not learn whether it is optimal. Under such a setting, the optimal policy π\pi^{\ast} for the KK-armed problem will pull each of the KK arms once and subsequently commit the residual budget of play to the optimal arm, thus incurring a lifetime regret of 𝔼Rπ=i=2K(μ1μi)=Θ(K)\mathbb{E}R_{\infty}^{\pi^{\ast}}=\sum_{i=2}^{K}\left(\mu_{1}-\mu_{i}\right)=\Theta(K). The optimal policy for the KK-typed countable-armed setting will, analogously, keep querying new arms from the reservoir until it has collected one of each of the KK types, and will subsequently commit to the arm within said collection that has the best mean reward. In this case, regret will only accrue until the decision maker has pulled one arm of each type.

Theorem 3.4 (Regret scaling w.r.t. 𝑲\boldsymbol{K}).

In the full information setting, the lifetime regret of any policy under reservoir distribution 𝛂\boldsymbol{\alpha} and mean reward vector 𝛍\boldsymbol{\mu} is at least i=2Kαi(μ1μi)KlogK\sum_{i=2}^{K}\alpha_{i}\left(\mu_{1}-\mu_{i}\right)K\log K.

If the reservoir distribution remains non-degenerate w.r.t. the optimal type, i.e., the fraction α1\alpha_{1} of optimal arms in the reservoir remains bounded away from 11 as KK increases, it is ensured that i=2Kαi(μ1μi)\sum_{i=2}^{K}\alpha_{i}\left(\mu_{1}-\mu_{i}\right) remains non-vanishing in KK. Consequently, the lower bound in Theorem 3.4 grows as Ω(KΔ¯logK)\Omega\left(K\underline{\Delta}\log K\right).

This result establishes a fundamentally distinct scaling of regret w.r.t. KK in the countable-armed setting vis-à-vis the KK-armed one (in the full information setting). When the true type of an arm is not immediately observable, one only expects the Ω(KlogK)\Omega\left(K\log K\right) scaling to exacerbate. In fact, when the learning horizon is nn, we conjecture that regret grows at least as Ω(KlogKlogn)\Omega\left(K\log K\log n\right), where the Ω()\Omega(\cdot) is modulo gap-dependent constants. Characterizing the information-theoretic optimal rate, however, remains a challenging open problem. The proof of Theorem 3.4 is provided in Appendix E.

4 Gap and reservoir adaptive policies

As discussed, our goal here is to investigate regret achievable under 𝝁{\boldsymbol{\mu}}-adaptive algorithms that are agnostic also to ex ante information on the distribution 𝜶{\boldsymbol{\alpha}} of possible arm types. We propose two algorithms; ALG1(n)(n) and ALG2(n)(n), that are both predicated on ex ante knowledge of the horizon of play nn. §4.1 discusses the first of these, ALG1(n)(n), which uses knowledge of nn to calibrate the duration of its exploration phases. ALG1(n)(n) serves as an insightful basal motif for algorithm design in that it satisfies the desiderata of an 𝒪(logn)\mathcal{O}\left(\log n\right) instance-dependent regret for general K2K\geqslant 2 as well as an 𝒪~(n)\tilde{\mathcal{O}}\left(\sqrt{n}\right) instance-independent (minimax) regret when K=2K=2; the latter property settling an open problem in the literature. However, its regret has a sub-optimal dependence on 𝜶{\boldsymbol{\alpha}}. We leverage structural insights from the analysis of ALG1(n)(n) to explore another design in ALG2(n)(n) in §4.2, which determines its exploration phase lengths adaptively, as opposed to pre-specifying them upfront. This new design guarantees the best possible dependence of regret on 𝜶{\boldsymbol{\alpha}}. Finally in §4.3, we discuss a fully sequential adaptive algorithm from extant literature for K=2K=2, and propose a simple modification to rid it of a certain fragile assumption pertaining to ex ante knowledge of the support of reward distributions. We also provide new sharper bounds for the modified algorithm and discuss potential issues with its generalization to K2K\geqslant 2 vis-à-vis ALG1(n)(n) and ALG2(n)(n).

4.1 Explore-then-commit with a pre-specified exploration schedule

In what follows (and all subsequent algorithms), a new arm is one that is freshly queried from the reservoir (an arm without a history of previous pulls). This arm belongs to type ii with probability αi\alpha_{i} independent of the problem history thus far (collection of arms and types queried and the corresponding reward realizations until the current time).

Algorithm 1 ALG1(n)\left(n\right) (Fixed design ETC)
1:Input: Horizon of play nn.
2:Set budget T=nT=n; set epoch counter k=1k=1.
3:Initialize new epoch: Query KK new arms; call it consideration set 𝒜={1,,K}\mathcal{A}=\{1,...,K\}.
4:Set exploration duration L=e2klognL=\left\lceil e^{2\sqrt{k}}\log n\right\rceil.
5:mmin(L,T/K)m\leftarrow\min\left(L,\left\lfloor T/K\right\rfloor\right).
6:Play each arm in 𝒜\mathcal{A} mm times; observe rewards {(X1,j,,XK,j):j=1,,m}\left\{\left(X_{1,j},...,X_{K,j}\right):j=1,...,m\right\}.
7:Update budget: TTKmT\leftarrow T-Km.
8:if \exists a,b𝒜a,b\in\mathcal{A}, a<ba<b s.t. |j=1m(Xa,jXb,j)|<2mek\left\lvert{\sum_{j=1}^{m}(X_{a,j}-X_{b,j})}\right\rvert<2me^{-\sqrt{k}} then
9:     Permanently discard 𝒜\mathcal{A}.
10:     kk+1k\leftarrow k+1.
11:     Repeat from step (3).
12:else
13:     Permanently commit to arm aargmaxa𝒜{j=1mXa,j}a^{\ast}\in\arg\max_{a\in\mathcal{A}}\left\{\sum_{j=1}^{m}X_{a,j}\right\}.

Discussion of Algorithm 1. The foremost noticeable feature of this algorithm is the (nearly) exponentially increasing exploration schedule. Specifically, in the kkth epoch, each of the KK arms in the consideration set is played e2klogn\left\lceil e^{2\sqrt{k}}\log n\right\rceil times. It suffices to cap the size of the consideration set at KK since the decision maker is a priori aware of the existence of exactly KK arm-types in the reservoir. Upon completion of the kkth epoch, the cumulative-difference-of-reward statistic for each of the (K2){K\choose 2} arm-pairs is compared against a threshold of 2ekm2e^{-\sqrt{k}}m, where 2ek2e^{-\sqrt{k}} should be imagined as a proxy for a lower bound on the minimal reward gap δ\delta. If said statistic is small relative to the threshold for some pair, the pair is likely to contain arms of the same type (equal means), in which case, the algorithm discards the entire consideration set and ushers in a new epoch with a larger exploration phase. This is done to avoid the possibility of incurring linear regret should an optimal arm be missing from the consideration set (e.g., when all KK arms belong to type 22). On the other hand, if all arm-pairs are sufficiently separated, the algorithm simply commits permanently to the empirically best arm. The intuition behind the (nearly) exponential schedule is that as kk grows, 2ek2e^{-\sqrt{k}} will eventually provide a lower bound on δ\delta, and one may hope to achieve appropriate levels of error control using window sizes in the kkth epoch. Full proof is provided in Appendix F.

Theorem 4.1 (Upper bound on the regret of ALG1(n)\boldsymbol{(n)}).

For a horizon of play nKn\geqslant K, the expected cumulative regret of the policy π\pi given by ALG1(n)(n) is bounded as

𝔼RnπC~𝜶Δ¯lognδ2log2(4δ)+2KΔ¯,\displaystyle\mathbb{E}R_{n}^{\pi}\leqslant\frac{\tilde{C}_{{\boldsymbol{\alpha}}}\bar{\Delta}\log n}{\delta^{2}}\log^{2}\left(\frac{4}{\delta}\right)+2K\bar{\Delta},

where C~𝛂\tilde{C}_{{\boldsymbol{\alpha}}} is some constant that depends only on 𝛂{\boldsymbol{\alpha}}; an exact expression is provided in (13). In particular, C~𝛂\tilde{C}_{{\boldsymbol{\alpha}}}\to\infty as i=1Kαi\prod_{i=1}^{K}\alpha_{i} approaches 0.

Discussion. The dependence on the minimal reward gap δ\delta in Theorem 4.1 is not an artifact of our analysis but, in fact, reflective of the operating principle of the algorithm. ALG1(n)(n) keeps querying new consideration sets of size KK until it determines with high enough confidence that the queried arms are all distinct-typed; this is the genesis of δ\delta in the upper bound. Importantly, equipped just with knowledge of KK, it remains unclear if there exists an alternative sampling strategy that does not rely on assessing pairwise similarities between the queried arms, without necessitating any additional information on 𝜶{\boldsymbol{\alpha}}. Furthermore, while ALG1(n)(n) is evidently rate-optimal in nn (in the instance-dependent sense), the scaling of its upper bound w.r.t. 𝜶{\boldsymbol{\alpha}} is far from optimal. In particular, the 𝜶{\boldsymbol{\alpha}}-dependent factor in the leading term is attributable to a naive pre-determined exploration schedule. This dependence can, in fact, be relegated to 𝒪(1)\mathcal{O}(1) terms using a more sophisticated policy that operates based on an adaptive determination of stopping and re-initialization times.

Remarks. (i) When K=2K=2, the upper bound in Theorem 4.1 reduces to (C~𝜶/Δ¯)log2(4/Δ¯)logn+4Δ¯\left(\tilde{C}_{{\boldsymbol{\alpha}}}/\underline{\Delta}\right)\log^{2}\left({4}/{\underline{\Delta}}\right)\log n+4\underline{\Delta}, leading to a worst-case regret (w.r.t. Δ¯\underline{\Delta}) of 𝒪~(C~𝜶n)\tilde{\mathcal{O}}\left(\tilde{C}_{{\boldsymbol{\alpha}}}\sqrt{n}\right), where the big-Oh only hides poly-logarithmic factors in nn. This settles an open question concerning the best achievable minimax regret in the countable-armed problem with two types.222Minimax guarantees in previous work were polynomially bounded away from n\sqrt{n}; see Kalvit and Zeevi (2020). (ii) While specifying the exploration schedule, the choice of the exponent in kk can be fairly general as long as it is coercive and grows sufficiently fast but sub-linearly; the square-root function is chosen for technical convenience. Instead, if one were to use a linear function of kk in the exponent, the algorithm’s performance would become fragile w.r.t. ex ante knowledge of 𝜶{\boldsymbol{\alpha}}; an ill-calibrated ALG1(n)(n) can potentially incur linear regret.

4.2 Explore-then-commit with an adaptive stopping time

Algorithm 2 ALG2(n)\left(n\right) (ETC with adaptive stopping times)
1:Input: Horizon of play nn.
2:Set budget T=nT=n; Burn-in samples (per arm) sns_{n}.
3:Initialize new epoch: Query KK new arms; call it consideration set 𝒜={1,,K}\mathcal{A}=\{1,...,K\}.
4:Play each arm in 𝒜\mathcal{A} for sns_{n} periods; observe rewards {Xa,j:a𝒜,j=1,,sn}\left\{X_{a,j}:a\in\mathcal{A},j=1,...,s_{n}\right\}.
5:Set per-arm sample count m=snm=s_{n}.
6:Update budget: TTsnKT\leftarrow T-s_{n}K.
7:Generate (K2){K\choose 2} independent standard Gaussian random variables {𝒵a,b:a,b𝒜,a<b}\left\{\mathcal{Z}_{a,b}:a,b\in\mathcal{A},a<b\right\}.
8:while TKT\geqslant K do
9:     if \exists a,b𝒜a,b\in\mathcal{A}, a<ba<b s.t. |𝒵a,b+j=1m(Xa,jXb,j)|<4mlogm\left\lvert\mathcal{Z}_{a,b}+\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}\right)\right\rvert<4\sqrt{m\log m} then
10:         Permanently discard 𝒜\mathcal{A} and repeat from step (3).
11:     else
12:         if |j=1m(Xa,jXb,j)|4mlogna,b𝒜,a<b\left\lvert\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}\right)\right\rvert\geqslant 4\sqrt{m\log n}\ \forall\ a,b\in\mathcal{A},\ a<b then
13:              Permanently commit to arm aargmaxa𝒜{j=1mXa,j}a^{\ast}\in\arg\max_{a\in\mathcal{A}}\left\{\sum_{j=1}^{m}X_{a,j}\right\}.
14:         else
15:              Play each arm in 𝒜\mathcal{A} once; observe rewards {Xa,m+1:a𝒜}\left\{X_{a,m+1}:a\in\mathcal{A}\right\}.
16:              mm+1m\leftarrow m+1.
17:              TTKT\leftarrow T-K.               

Discussion of Algorithm 2. At any time, the algorithm computes two thresholds; 𝒪(mlogm)\mathcal{O}\left(\sqrt{m\log m}\right) and 𝒪(mlogn)\mathcal{O}\left(\sqrt{m\log n}\right) for the (K2){K\choose 2} pairwise difference-of-reward statistics, mm being the per-arm sample count. If said statistic is dominated by the former threshold for some arm-pair, it is likely to contain arms of the same type (equal means). The explanation stems from the Law of the Iterated Logarithm (see Durrett (2019), Theorem 8.5.2): a zero-drift length-mm random walk has its envelope bounded by 𝒪(mloglogm)\mathcal{O}\left(\sqrt{m\log\log m}\right). In the aforementioned scenario, the algorithm discards the entire consideration set and ushers in a new one. This is done to avoid the possibility of incurring linear regret should an optimal arm be missing from the consideration set. In the other scenario that the difference-of-reward statistic dominates the larger threshold for all arm-pairs, the consideration set is likely to contain arms of distinct types (no two have equal means) and the algorithm simply commits to the empirically best arm. Lastly, if difference-of-reward lies between the two thresholds (signifying insufficient learning), the sample count for each arm is advanced by one, and the entire process repeats.

Reason for introducing zero-mean corruptions supported on \boldsymbol{\mathbb{R}}. Centered Gaussian noise is added to the difference-of-reward statistic in step (9) of ALG2(n)(n) to avoid the possibility of incurring linear regret should the support of the reward distributions be a “very small” subset of [0,1][0,1]. To illustrate this point, suppose that K=2K=2, sn=1s_{n}=1, and the rewards associated with the two types are deterministic with Δ¯<22log2\underline{\Delta}<2\sqrt{2\log 2}. Then, as soon as the algorithm queries a heterogeneous consideration set (one arm optimal and the other inferior) and the per-arm sample count reaches 22, the difference-of-reward statistic will satisfy |j=12(X1,jX2,j)|=2Δ¯<42log2\left\lvert\sum_{j=1}^{2}\left(X_{1,j}-X_{2,j}\right)\right\rvert=2\underline{\Delta}<4\sqrt{2\log 2}, resulting in the consideration set getting discarded. On the other hand, if the consideration set is homogeneous (both arms simultaneously optimal or inferior), the algorithm will still re-initialize as soon as the per-arm count reaches 22.333|j=1m(X1,jX2,j)|=0\left\lvert\sum_{j=1}^{m}\left(X_{1,j}-X_{2,j}\right)\right\rvert=0 identically in this case for any mm\in\mathbb{N} while 4mlogm>04\sqrt{m\log m}>0 only for m2m\geqslant 2. This will force the algorithm to keep querying new arms from the reservoir at rate that is linear in time, which is tantamount to incurring linear regret in the horizon. The addition of centered Gaussian noise hedges against this risk by guaranteeing that the difference-of-reward process essentially has an infinite support at all times even when the reward distributions might be degenerate. This rids the regret performance of its fragility w.r.t. the support of reward distributions. The next proposition crystallizes this discussion; proof is provided in Appendix G.

Proposition 4.2 (Persistence of heterogeneous consideration sets).

Suppose {Xa,j:j=1,2,}\left\{X_{a,j}:j=1,2,...\right\} is a collection of independent samples from an arm of type a{1,,K}=:𝒜a\in\{1,...,K\}=:\mathcal{A}. Let {𝒵a,b:a,b𝒜,a<b}\left\{\mathcal{Z}_{a,b}:a,b\in\mathcal{A},a<b\right\} be a collection of (K2){K\choose 2} independently generated standard Normal random variables. Then,

(m1a,b𝒜,a<b{|𝒵a,b+j=1m(Xa,jXb,j)|4mlogm})>Φ¯(f(T0))2=:βδ,K>0,\displaystyle\mathbb{P}\left(\bigcap_{m\geqslant 1}\bigcap_{a,b\in\mathcal{A},a<b}\left\{\left\lvert\mathcal{Z}_{a,b}+\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}\right)\right\rvert\geqslant 4\sqrt{m\log m}\right\}\right)>\frac{\bar{\Phi}\left(f\left(T_{0}\right)\right)}{2}=:\beta_{\delta,K}>0, (2)

where Φ¯()\bar{\Phi}(\cdot) is the right tail of the standard Normal CDF, and T0:=max((64/δ2)log2(64δ2),ΛK)T_{0}:=\max\left(\left\lceil\left(64/\delta^{2}\right)\log^{2}\left(\frac{64}{\delta^{2}}\right)\right\rceil,\Lambda_{K}\right) with ΛK:=inf{p:m=p1m812K2}\Lambda_{K}:=\inf\left\{p\in\mathbb{N}:\sum_{m=p}^{\infty}\frac{1}{m^{8}}\leqslant\frac{1}{2K^{2}}\right\}. Lastly, f(x):=x+4xlogxf(x):=x+4\sqrt{x\log x} for all x1x\geqslant 1.

Interpretation of βδ,K\boldsymbol{\beta_{\delta,K}}. First of all, note that βδ,K\beta_{\delta,K} admits a closed-form characterization in terms of standard functions and satisfies βδ,K>0\beta_{\delta,K}>0 for δ>0\delta>0 with limδ0βδ,K=0\lim_{\delta\to 0}\beta_{\delta,K}=0. Secondly, βδ,K\beta_{\delta,K} depends exclusively on δ\delta and KK, and represents a lower bound on the probability that ALG2(n)(n) will never discard a consideration set containing arms of distinct types. This meta-result will be key to the upper bound on the regret of ALG2(n)(n) stated next in Theorem 4.3.

Theorem 4.3 (Upper bound on the regret of ALG2(n)\boldsymbol{(n)}).

For a horizon of play nKn\geqslant K and per-arm burn-in phase of 1snn/K1\leqslant s_{n}\leqslant n/K samples, the expected cumulative regret of the policy π\pi given by ALG2(n)(n) is bounded as

𝔼RnπCK3Δ¯γ(sn)(lognδ2+snK!i=1Kαi),\displaystyle\mathbb{E}R_{n}^{\pi}\leqslant\frac{CK^{3}\bar{\Delta}}{\gamma\left(s_{n}\right)}\left(\frac{\log n}{{\delta}^{2}}+\frac{s_{n}}{K!\prod_{i=1}^{K}\alpha_{i}}\right),

where CC is some absolute constant, and γ(sn)\gamma\left(s_{n}\right) is as defined in (14). In particular, γ(t)\gamma(t) is monotone increasing in tt with γ(t)1\gamma(t)\to 1 as tt\to\infty, and γ(1)=βδ,K\gamma\left(1\right)=\beta_{\delta,K}, where the latter is as defined in (2).

Remarks. The dependence on δ\delta in Theorem 4.3 is not incidental and has the same genesis as discussed in the context of Theorem 4.1. However, there is a prominent distinction from Theorem 4.1 in that the dependence on 𝜶{\boldsymbol{\alpha}} is captured exclusively through the constant term (as opposed to the logarithmic term). This should be viewed in light of the lower bound in Theorem 3.3; by allowing for policies that query the arm-reservoir adaptively, one can potentially make the regret performance robust w.r.t. 𝜶{\boldsymbol{\alpha}}. Absence of 𝜶{\boldsymbol{\alpha}} from the leading term also leads to the somewhat remarkable conclusion that the lower bound in Theorem 3.2 is optimal w.r.t. dependence on 𝜶{\boldsymbol{\alpha}}. The proof is provided in Appendix H.

More on the inverse scaling w.r.t. γ(sn)\boldsymbol{\gamma\left(s_{n}\right)}. This multiplicative factor is likely a consequence of the countable nature of arms (as opposed to finite). When K=2K=2, α11/2\alpha_{1}\leqslant 1/2, and the burn-in phase sns_{n} has a fixed duration independent of nn, the upper bound in Theorem 4.3 reduces to 𝒪(βΔ¯,21(logn/Δ¯+Δ¯/α1))\mathcal{O}\left(\beta_{\underline{\Delta},2}^{-1}\left(\log n/\underline{\Delta}+\underline{\Delta}/\alpha_{1}\right)\right), where the big-Oh only hides absolute constants. Evidently, there is an inflation by βΔ¯,21\beta_{\underline{\Delta},2}^{-1} relative to the optimal 𝒪(logn/Δ¯)\mathcal{O}\left(\log n/\underline{\Delta}\right) rate achievable in the paradigmatic two-armed bandit with gap Δ¯\underline{\Delta}. By setting sns_{n} as a coercive sub-logarithmic function of the horizon nn (e.g., sn=logns_{n}=\sqrt{\log n}), one can shave off the βΔ¯,21\beta_{\underline{\Delta},2}^{-1} factor to achieve 𝒪(logn/Δ¯)\mathcal{O}\left(\log n/\underline{\Delta}\right) regret. This establishes tightness of the instance-dependent lower bound in Theorem 3.2 when K=2K=2. On the other hand, owing to the dependence of γ(sn)\gamma\left(s_{n}\right) (and βΔ¯,2\beta_{\underline{\Delta},2}) on Δ¯\underline{\Delta}, the worst-case (instance-independent) upper bound of ALG2(n)(n) can be observed from Theorem 4.3 to be bounded away from Ω(n)\Omega\left(\sqrt{n}\right). However, recall that Theorem 4.1 already settles the issue of characterizing the optimal minimax rate when K=2K=2 (up to logarithmic factors in nn). Thus, we provide a complete characterization of the complexity of this problem when K=2K=2, thereby answering all the open problems in Kalvit and Zeevi (2020). For K>2K>2, Theorem 4.3 guarantees an upper bound of 𝒪(K3Δ¯/δ2logn)\mathcal{O}\left(K^{3}\bar{\Delta}/\delta^{2}\log n\right) under a coercive sub-logarithmic burn-in phase. In this case, characterizing the optimal dependence on Δ¯\bar{\Delta} and δ\delta remains an open problem. The scaling w.r.t. KK, however, cannot be improved to 𝒪(K)\mathcal{O}(K) as suggested by the Ω(KlogK)\Omega\left(K\log K\right) lower bound in Theorem 3.4. A full characterization of the complexity of the general setting with K>2K>2 arm-types remains challenging and is left to future work.

4.3 Towards fully sequential adaptive strategies: Optimism in exploration

In this section, we revisit the UCB-based adaptive policy proposed in Kalvit and Zeevi (2020) for K=2K=2. The policy is restated as ALG3 below after suitable modifications for reasons discussed next. The original policy (Algorithm 2 in cited paper) achieves an instance-dependent regret of 𝒪(logn)\mathcal{O}\left(\log n\right). Additionally, this algorithm enjoys the benefit of being anytime in nn owing to the use of UCB1 (Auer et al., 2002) as a subroutine. However, it suffers a major limitation through its dependence on ex ante knowledge of the support of reward distributions. In particular, the algorithm requires the reward distributions associated with the arm-types to have “full support” on [0,1][0,1], e.g., only distributions such as Bernoulli()(\cdot), Uniform on [0,1][0,1], Beta(,)(\cdot,\cdot), etc., are amenable to its performance guarantees; Uniform on [0,0.5][0,0.5], on the other hand, is not. We identify a simple fix to this issue: Drawing inspiration from the design of ALG2(n)(n), we propose adding a centered Gaussian noise term to the difference-of-reward statistic (see step (6) of Algorithm 3) to essentially create an unbounded support. This rids the algorithm of fragility w.r.t. the reward support while also preserving 𝒪(logn)\mathcal{O}\left(\log n\right) regret guarantees (see Theorem 4.4).

Remark. The original Algorithm 2 in cited paper uses a threshold that is distinct from 4mlogm4\sqrt{m\log m} (see step (6) of Algorithm 3); the choice of 4mlogm4\sqrt{m\log m} here aims to unify the technical presentation with ALG2(n)(n), and facilitate a more transparent comparison between the corresponding upper bounds.

Algorithm 3 ALG3 (Nested UCB1 for 𝙺=𝟸\mathtt{K=2} types)
1:Initialize new epoch (resets clock t0t\leftarrow 0): Query two new arms; call it set 𝒜={1,2}\mathcal{A}=\{1,2\}.
2:Play each arm in 𝒜\mathcal{A} once; observe rewards {Xa,1:a𝒜}\left\{X_{a,1}:a\in\mathcal{A}\right\}.
3:Minimum per-arm sample count m1m\leftarrow 1.
4:Generate a standard Gaussian random variable 𝒵\mathcal{Z}.
5:for t{3,4,}t\in\{3,4,...\} do
6:     if |𝒵+j=1m(X1,jX2,j)|<4mlogm\left\lvert\mathcal{Z}+\sum_{j=1}^{m}\left(X_{1,j}-X_{2,j}\right)\right\rvert<4\sqrt{m\log m} then
7:         Permanently discard 𝒜\mathcal{A} and repeat from step (1).
8:     else
9:         Play arm atargmaxa𝒜(j=1Na(t1)Xa,jNa(t1)+2log(t1)Na(t1))a_{t}\in\arg\max_{a\in\mathcal{A}}\left(\frac{\sum_{j=1}^{N_{a}(t-1)}X_{a,j}}{N_{a}(t-1)}+\sqrt{\frac{2\log(t-1)}{N_{a}(t-1)}}\right).
10:         Observe reward Xat,Nat(t)X_{a_{t},N_{a_{t}}(t)}.
11:         if m<mina𝒜Na(t)m<\min_{a\in\mathcal{A}}N_{a}(t) then
12:              mm+1m\leftarrow m+1.               

Discussion of Algorithm 3. Similar to ALG2(n)(n), ALG3 also has an episodic dynamic with exactly one pair of arms played per episode. The distinction, however, resides in the fact that ALG3 plays arms according to UCB1 in every episode as opposed to playing them equally often until committing to the empirically superior one. Secondly, unlike ALG2(n)(n), ALG3 never “commits” to an arm (or a consideration set). The implication is that the algorithm will keep querying new consideration sets throughout the playing horizon; this property is at the core of its anytime nature. Despite these differences, the performance guarantees of the two algorithms are essentially identical when K=2K=2, as the next result illustrates. The proof is provided in Appendix J.

Theorem 4.4 (Upper bound on the regret of ALG3 when 𝑲=𝟐\boldsymbol{K=2}).

The expected cumulative regret of the policy π\pi given by ALG3 after any number of pulls n2n\geqslant 2 is bounded as

𝔼RnπCβΔ¯,2(lognΔ¯+Δ¯α1),\displaystyle\mathbb{E}R_{n}^{\pi}\leqslant\frac{C}{\beta_{\underline{\Delta},2}}\left(\frac{\log n}{\underline{\Delta}}+\frac{\underline{\Delta}}{\alpha_{1}}\right),

where βΔ¯,2\beta_{\underline{\Delta},2} is as defined in (2) with δΔ¯\delta\leftarrow\underline{\Delta} and K2K\leftarrow 2, and CC is some absolute constant.

Remark. It is possible to shave off the βΔ¯,2\beta_{\underline{\Delta},2} factor by introducing in ALG3 a horizon-dependent burn-in phase à la ALG2(n)(n). This may be achieved at the expense of ALG3’s anytime property.

Limitation of ALG3. The performance stated in Theorem 4.4 together with its anytime property might appear to give an edge to ALG3 over ALG2(n)(n). However, the former is theoretically disadvantaged in that its logarithmic upper bound is not currently amenable to extensions to the general KK-typed setting. The issue traces its roots to the use of UCB1 as a subroutine. The concentration behavior of UCB1 leveraged towards the analysis of ALG3 when K=2K=2 fails to hold when K>2K>2, rendering proofs intractable. This is illustrated via a simple example with K=3K=3 types discussed below.

Technical issues with generalizing ALG3 to K\boldsymbol{K} types. When K=2K=2, there are only two possibilities for what a consideration set could be; arms can have means that are either (i) distinct, or (ii) equal. In the former case, an optimal arm is guaranteed to exist in the consideration set and UCB1 will spend the bulk of its sampling effort on it, which is good for regret performance. In the latter scenario, since arms have equal means, UCB1 will split samples approximately equally between the two with high probability (see Theorem 4(i) in Kalvit and Zeevi (2020)); subsequently the consideration set will be discarded within a finite number of samples in expectation (see steps (6) and (7) of ALG3). Contrast this with an alternative setting with K=3K=3 and mean rewards μ1>μ2>μ3\mu_{1}>\mu_{2}>\mu_{3}. A natural generalization of ALG3 (see ALG4 in Appendix B) will query consideration sets of size 33. Thus, a query can potentially return one arm with mean μ2\mu_{2} and two with mean μ3\mu_{3}. Since an optimal arm (mean μ1\mu_{1}) is missing, the algorithm will incur linear regret on this set; it is therefore imperative to discard it at the earliest. Unfortunately though, UCB1 will invest an overwhelming majority of its sampling effort in the “locally optimal” arm (mean μ2\mu_{2}) and allocate logarithmically fewer samples among the other two. This logarithmic rate of sampling arms with mean μ3\mu_{3} is proof-inhibiting (vis-à-vis the K=2K=2 case where the rate is linear as previously discussed), making it difficult to theoretically answer if ALG4 might still be able to discard the arms within, say, logarithmically many pulls of the horizon. This is an open research question and at the moment, an 𝒪(logn)\mathcal{O}\left(\log n\right) bound exists only for K=2K=2; we could only establish asymptotic-optimality (o(n)o(n) regret) when K>2K>2 (see Theorem B.1). Among other things, identifying the optimal (instance-dependent) scaling factors w.r.t. (𝝁,𝜶)\left({\boldsymbol{\mu}},{\boldsymbol{\alpha}}\right) and the optimal order of minimax regret when K>2K>2 remain open problems.

5 Concluding remarks and open problems

This paper provides a first-order characterization of the complexity of the KK-typed countable-armed bandit problem with matching lower and upper bounds for K=2K=2. For K>2K>2, we establish an instance-dependent upper bound of 𝒪(K3Δ¯/δ2logn)\mathcal{O}\left(K^{3}\bar{\Delta}/\delta^{2}\log n\right) and show that the scaling w.r.t. KK cannot beat Ω(KlogK)\Omega\left(K\log K\right); the latter property differentiates this setting fundamentally from the classical KK-armed problem. Another key takeaway from our work is that achievable regret in this setting only has a second-order dependence on the reservoir distribution, i.e., dependence on 𝜶\boldsymbol{\alpha} only manifests through sub-logarithmic terms (see Theorem 4.3 and 4.4). Although this work is predicated on countably many arms, our algorithms can easily be adapted to settings with a large but finite number of arms. For example, the result on second-order dependence w.r.t. 𝜶\boldsymbol{\alpha} has profound implications for the NN-armed bandit problem with KK arm-types, where each type is characterized by a unique mean reward. A naive implementation of standard MAB algorithms in this setting will result in a regret that scales linearly with NN. Instead, one can simulate a KK-typed reservoir over the NN arms and deploy ALG2(n)(n) to achieve an 𝒪(K3)\mathcal{O}\left(K^{3}\right) scaling of the leading term; if KNK\ll N, performance improvement can be substantial vis-à-vis naive MAB algorithms. Another important direction concerns adaptivity to KK: This paper provides algorithms that adapt to 𝜶\boldsymbol{\alpha} assuming perfect knowledge of KK; performance characterization given only an approximation thereof remains an open problem.

\acks

We thank the reviewers for their feedback and suggestions.

References

  • Agrawal et al. (2019) Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research, 67(5):1453–1485, 2019.
  • Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2-3):235–256, 2002.
  • Banerjee et al. (2017) Siddhartha Banerjee, Sreenivas Gollapudi, Kostas Kollias, and Kamesh Munagala. Segmenting two-sided markets. In Proceedings of the 26th International Conference on World Wide Web, pages 63–72, 2017.
  • Berry et al. (1997) Donald A Berry, Robert W Chen, Alan Zame, David C Heath, Larry A Shepp, et al. Bandit problems with infinitely many arms. The Annals of Statistics, 25(5):2103–2116, 1997.
  • Bonald and Proutiere (2013) Thomas Bonald and Alexandre Proutiere. Two-target algorithms for infinite-armed bandits with bernoulli rewards. In Advances in Neural Information Processing Systems, pages 2184–2192, 2013.
  • Bui et al. (2012) Loc Bui, Ramesh Johari, and Shie Mannor. Clustered bandits. arXiv preprint arXiv:1206.4169, 2012.
  • Carpentier and Valko (2015) Alexandra Carpentier and Michal Valko. Simple regret for infinitely many armed bandits. In International Conference on Machine Learning, pages 1133–1141, 2015.
  • Chan and Hu (2018) Hock Peng Chan and Shouri Hu. Infinite arms bandit: Optimality via confidence bounds. arXiv preprint arXiv:1805.11793, 2018.
  • Chandrasekaran and Karp (2014) Karthekeyan Chandrasekaran and Richard Karp. Finding a most biased coin with fewest flips. In Conference on Learning Theory, pages 394–407. PMLR, 2014.
  • Chaudhuri and Kalyanakrishnan (2018) Arghya Roy Chaudhuri and Shivaram Kalyanakrishnan. Quantile-regret minimisation in infinitely many-armed bandits. In UAI, pages 425–434, 2018.
  • de Heide et al. (2021) Rianne de Heide, James Cheshire, Pierre Ménard, and Alexandra Carpentier. Bandits with many optimal arms. In Advances in Neural Information Processing Systems, volume 34, pages 22457–22469, 2021.
  • Durrett (2019) Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019.
  • Flajolet et al. (1992) Philippe Flajolet, Daniele Gardy, and Loÿs Thimonier. Birthday paradox, coupon collectors, caching algorithms and self-organizing search. Discrete Applied Mathematics, 39(3):207–229, 1992.
  • Hoeffding (1963) W Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963.
  • Jamieson et al. (2016) Kevin Jamieson, Daniel Haas, and Ben Recht. On the detection of mixture distributions with applications to the most biased coin problem. arXiv preprint arXiv:1603.08037, 2016.
  • Johari et al. (2021) Ramesh Johari, Vijay Kamble, and Yash Kanoria. Matching while learning. Operations Research, 69(2):655–681, 2021.
  • Kalvit and Zeevi (2020) Anand Kalvit and Assaf Zeevi. From finite to countable-armed bandits. In Advances in Neural Information Processing Systems, volume 33, pages 8259–8269, 2020.
  • Kalvit and Zeevi (2021) Anand Kalvit and Assaf Zeevi. A closer look at the worst-case behavior of multi-armed bandit algorithms. In Advances in Neural Information Processing Systems, volume 34, pages 8807–8819, 2021.
  • Lai and Robbins (1985) Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
  • Lattimore and Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. Bandit algorithms. Cambridge University Press, 2020.
  • Slivkins et al. (2019) Aleksandrs Slivkins et al. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning, 12(1-2):1–286, 2019.
  • Wang et al. (2009) Yizao Wang, Jean-Yves Audibert, and Rémi Munos. Algorithms for infinitely many-armed bandits. In Advances in Neural Information Processing Systems, pages 1729–1736, 2009.
  • Zhu and Nowak (2020) Yinglun Zhu and Robert Nowak. On regret with multiple best arms. Advances in Neural Information Processing Systems, 33:9050–9060, 2020.

Supplementary material: General organization

  1. 1.

    Appendix A provides numerical experiments.

  2. 2.

    Appendix B generalizes ALG3 to K>2K>2 and states performance guarantees thereof.

  3. 3.

    Appendix C provides the proof of Theorem 3.2.

  4. 4.

    Appendix D provides the proof of Theorem 3.3.

  5. 5.

    Appendix E provides the proof of Theorem 3.4.

  6. 6.

    Appendix F provides the proof of Theorem 4.1.

  7. 7.

    Appendix G provides the proof of Proposition 4.2.

  8. 8.

    Appendix H provides the proof of Theorem 4.3.

  9. 9.

    Appendix I provides auxiliary results used in the analysis of ALG3.

  10. 10.

    Appendix J provides the proof of Theorem 4.4.

  11. 11.

    Appendix K provides auxiliary results used in the analysis of ALG4.

  12. 12.

    Appendix L provides the proof of Theorem B.1.

Appendix A Numerical experiments

We evaluate the empirical performance of our algorithms for K=2K=2 and K=3K=3 on synthetic data.

Experiments. In what follows, the graphs show the performance of different algorithms simulated on synthetic data. The horizon is capped at n=105n=10^{5} for K=2K=2 and at n=104n=10^{4} for K=3K=3. Each regret trajectory is averaged over at least 100100 independent experiments (sample-paths). The shaded regions indicate standard 95%95\% confidence intervals. For horizon-dependent algorithms, regret is plotted for discrete values of the horizon nn indicated by “\ast” and interpolated; for anytime algorithms, regret accrued until each t{1,,n}t\in\{1,...,n\} is plotted.

Baseline policies. We will benchmark the performance of our algorithms against two policies: (i) Sampling-UCB (de Heide et al., 2021), and (ii) ETC-(2)\infty(2) (Kalvit and Zeevi, 2020). The former is a UCB-styled policy based on front-loading exploration of new arms (Theorem 3.3 thus applies to this policy). It is, however, noteworthy that Sampling-UCB is predicated on ex ante knowledge of (a lower bound on) the probability α1\alpha_{1} of sampling an optimal arm from the reservoir; we reemphasize that this is not the setting of interest in our paper. Furthermore, its regret scales as 𝒪~(logn/(α1Δ¯))\tilde{\mathcal{O}}\left(\log n/\left(\alpha_{1}\underline{\Delta}\right)\right) (up to poly-logarithmic factors in 1/Δ¯1/\underline{\Delta}), which is inferior in terms of its dependence on α1\alpha_{1} relative to ALG2(n)(n) and ALG3 (see Theorem 4.3 and 4.4 respectively). There exist other algorithms as well (see, e.g., Chaudhuri and Kalyanakrishnan (2018); Zhu and Nowak (2020)) developed for formulations with prohibitively large number of arms. However, these are either sensitive to certain parametric assumptions on the probability of sampling an optimal arm, or focus on a different notion of regret altogether; both directions remain outside the ambit of our setting.

The second policy ETC-(2)\infty(2) is a non-adaptive explore-then-commit-styled algorithm for reservoirs with K=2K=2 types; this policy requires ex ante knowledge of a lower bound on the difference between the two mean rewards. Although ETC-(2)\infty(2) was originally proposed only for K=2K=2, it is easily generalizable and we present in Algorithm 5 a version (ETC-(K)\infty(K)) that is adapted to KK types.

Setup 1 [Figures 2, 2 and 4]. In this setting, we consider K=2K=2 with α1=0.5\alpha_{1}=0.5, i.e., two equiprobable arm-types, characterized by Bernoulli(0.6)(0.6) and Bernoulli(0.4)(0.4) rewards. Via this setup, we intend to illustrate the difference between the empirical performance achievable in the countable-armed setting vis-à-vis its traditional two-armed counterpart. Refer to Figure 2. The red curve indicates the empirical performance of ALG3 in this setting. For reference, the blue one shows the empirical performance of UCB1 (Auer et al., 2002) in a two-armed bandit with Bernoulli(0.6) and Bernoulli(0.4) rewards; the green curve indicates the best achievable instance-dependent regret (Lai and Robbins, 1985) in said two-armed configuration. As expected, the regret of ALG3 is inflated relative to UCB1. This is owing to the βδ,21\beta_{\delta,2}\leqslant 1 factor present in the denominator of ALG3’s upper bound; characterization of the sharpest lower bound on the probability in (2) (see Proposition 4.2) is challenging owing to the limited theoretical tools available to this end and we leave it as an open problem at the moment. Figure 2 shows the empirical performance of the algorithms proposed in this paper as well as Sampling-UCB initialized with α1=1/2\alpha_{1}=1/2 and ETC-(2)\infty(2) initialized with δ¯=δ/2=0.1\underline{\delta}=\delta/2=0.1. Evidently, the (adaptive) explore-then-commit approach in ALG2(n)(n) outperforms the pre-specified exploration schedule-based approach of ALG1(n)(n), and performs almost as good as the gap-aware approach in ETC-(2)\infty(2). While Sampling-UCB outmatches all explore-then-commit styled approaches, the best performing algorithm is ALG3. Surprisingly, this is despite the fact that the theoretical performance bounds for ALG2(n)(n) and ALG3 are identical (modulo numerical multiplicative constants) when K=2K=2 and α10.5\alpha_{1}\leqslant 0.5 (see Theorem 4.3 and 4.4). A similar hierarchy in performances is also observable in Figure 4, which corresponds to a slightly “easier” instance with δ=0.4\delta=0.4 (as opposed to 0.20.2) and equiprobable Bernoulli(0.9)(0.9) and Bernoulli(0.5)(0.5) rewards.

Refer to caption
Figure 1: 𝑲=𝟐\boldsymbol{K=2} and 𝜶=(𝟏/𝟐,𝟏/𝟐)\boldsymbol{{\boldsymbol{\alpha}}=\left(1/2,1/2\right)}: Achievable regret in 2-CAB vis-à-vis 2-MAB.
Refer to caption
Figure 2: 𝑲=𝟐\boldsymbol{K=2} and 𝜶=(𝟏/𝟐,𝟏/𝟐)\boldsymbol{{\boldsymbol{\alpha}}=\left(1/2,1/2\right)}: An instance with Bernoulli 0.6,0.40.6,0.4 rewards.

Setup 2 [Figure 4]. Here, we consider a setting with K=3K=3 arm-types characterized by Bernoulli rewards with means 0.9,0.5,0.10.9,0.5,0.1, each occurring with probability 1/31/3. We compare the performance of ALG1(n)(n), ALG2(n)(n) and ALG4 (generalization of ALG3 to K2K\geqslant 2; see Appendix B) with ETC-(3)\infty(3) initialized with δ¯=δ/2=0.2\underline{\delta}=\delta/2=0.2, and Sampling-UCB initialized with α1=1/3\alpha_{1}=1/3. It is noteworthy that despite ALG4’s significantly superior empirical performance relative to aforementioned algorithms, only a weak o(n)o(n) theoretical guarantee on its regret is currently available (see Theorem B.1) due to reasons discussed earlier in the paper. Investigating best achievable rates under ALG4 is an area of active research at the moment.

Refer to caption
Figure 3: 𝑲=𝟐\boldsymbol{K=2} and 𝜶=(𝟏/𝟐,𝟏/𝟐)\boldsymbol{{\boldsymbol{\alpha}}=\left(1/2,1/2\right)}: An instance with Bernoulli 0.9,0.50.9,0.5 rewards.
Refer to caption
Figure 4: 𝑲=𝟑\boldsymbol{K=3} and 𝜶=(𝟏/𝟑,𝟏/𝟑,𝟏/𝟑)\boldsymbol{{\boldsymbol{\alpha}}=\left(1/3,1/3,1/3\right)}: Instance with Bernoulli 0.9,0.5,0.10.9,0.5,0.1 rewards.

Appendix B More algorithms for reservoirs with 𝑲𝟐\boldsymbol{K\geqslant 2} arm-types

Algorithm 4 ALG4 (Nested UCB1 for 𝙺\mathtt{K} types)
1:Initialize new epoch (resets clock t0t\leftarrow 0): Query KK new arms; call it set 𝒜={1,,K}\mathcal{A}=\{1,...,K\}.
2:Play each arm in 𝒜\mathcal{A} once; observe rewards {Xa,1:a𝒜}\left\{X_{a,1}:a\in\mathcal{A}\right\}.
3:Minimum per-arm sample count m1m\leftarrow 1.
4:Generate (K2){K\choose 2} independent standard Gaussian random variables {𝒵a,b:a,b𝒜,a<b}\left\{\mathcal{Z}_{a,b}:a,b\in\mathcal{A},a<b\right\}.
5:for t{K+1,K+2,}t\in\{K+1,K+2,...\} do
6:     if \exists a,b𝒜a,b\in\mathcal{A}, a<ba<b s.t. |𝒵a,b+j=1m(Xa,jXb,j)|<4mlogm\left\lvert\mathcal{Z}_{a,b}+\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}\right)\right\rvert<4\sqrt{m\log m} then
7:         Permanently discard 𝒜\mathcal{A} and repeat from step (1).
8:     else
9:         Play arm atargmaxa𝒜(j=1Na(t1)Xa,jNa(t1)+2log(t1)Na(t1))a_{t}\in\arg\max_{a\in\mathcal{A}}\left(\frac{\sum_{j=1}^{N_{a}(t-1)}X_{a,j}}{N_{a}(t-1)}+\sqrt{\frac{2\log(t-1)}{N_{a}(t-1)}}\right).
10:         Observe reward Xat,Nat(t)X_{a_{t},N_{a_{t}}(t)}.
11:         if m<mina𝒜Na(t)m<\min_{a\in\mathcal{A}}N_{a}(t) then
12:              mm+1m\leftarrow m+1.               
Theorem B.1 (Upper bound on the regret of ALG4).

The expected cumulative regret of the policy π\pi given by ALG4 after any number n1n\geqslant 1 of plays is bounded as

𝔼RnπCKβδ,K(lognΔ¯+Δ¯)+o(Δ¯nβδ,Ki=1Kαi),\displaystyle\mathbb{E}R_{n}^{\pi}\leqslant\frac{CK}{\beta_{\delta,K}}\left(\frac{\log n}{\underline{\Delta}}+\bar{\Delta}\right)+o\left(\frac{\bar{\Delta}n}{\beta_{\delta,K}\prod_{i=1}^{K}\alpha_{i}}\right),

where CC is some absolute constant, βδ,K\beta_{\delta,K} is as defined in (2), and the little-Oh is asymptotic in nn and only hides multiplicative factors in KK.

Proof is provided in Appendix L.

Algorithm 5 ETC-\infty(K)
1:Input: (i) Horizon of play nn, (ii) A lower bound δ¯(0,δ]\underline{\delta}\in\left(0,\delta\right] on the minimal reward gap δ\delta.
2:Set budget T=nT=n.
3:Initialize new epoch: Query KK new arms; call it consideration set 𝒜={1,,K}\mathcal{A}=\{1,...,K\}.
4:Set exploration duration L=2δ¯2lognL=\left\lceil 2\underline{\delta}^{-2}\log n\right\rceil.
5:mmin(L,T/K)m\leftarrow\min\left(L,\left\lfloor T/K\right\rfloor\right).
6:Play each arm in 𝒜\mathcal{A} mm times; observe rewards {(X1,j,,XK,j):j=1,,m}\left\{\left(X_{1,j},...,X_{K,j}\right):j=1,...,m\right\}.
7:Update budget: TTKmT\leftarrow T-Km.
8:if \exists a,b𝒜a,b\in\mathcal{A}, a<ba<b s.t. |j=1m(Xa,jXb,j)|<δ¯m\left\lvert{\sum_{j=1}^{m}(X_{a,j}-X_{b,j})}\right\rvert<\underline{\delta}m then
9:     Permanently discard 𝒜\mathcal{A}, and repeat from step (3).
10:else
11:     Permanently commit to arm aargmaxa𝒜{j=1mXa,j}a^{\ast}\in\arg\max_{a\in\mathcal{A}}\left\{\sum_{j=1}^{m}X_{a,j}\right\}.

Appendix C Proof of Theorem 3.2

Notation. For each i{1,2}i\in\{1,2\}, let 𝒢i(x)\mathcal{G}_{i}(x) be an arbitrary collection of distributions with mean xx\in\mathbb{R}. The tuple (𝒢1(x),𝒢2(y))\left(\mathcal{G}_{1}(x),\mathcal{G}_{2}(y)\right) will be referred to as an instance.

Since the horizon of play is fixed at nn, the decision maker may play at most nn distinct arms. Therefore, it suffices to focus only on the sequence of the first nn arms that may be played. A realization of an instance ν=(𝒢1(μ1),𝒢2(μ2))\nu=\left(\mathcal{G}_{1}(\mu_{1}),\mathcal{G}_{2}(\mu_{2})\right) is defined as the nn-tuple r(ri)1inr\equiv\left(r_{i}\right)_{1\leqslant i\leqslant n}, where ri𝒢1(μ1)𝒢2(μ2)r_{i}\in\mathcal{G}_{1}(\mu_{1})\cup\mathcal{G}_{2}(\mu_{2}) denotes the reward distribution of arm i{1,,n}i\in\{1,...,n\}. It must be noted that the decision maker need not play every arm in rr. Let i:=argmaxi{1,2}μii^{\ast}:=\arg\max_{i\in\{1,2\}}\mu_{i}. Suppose that the distribution over possible realizations of ν=(𝒢1(μ1),𝒢2(μ2))\nu=\left(\mathcal{G}_{1}(\mu_{1}),\mathcal{G}_{2}(\mu_{2})\right) in {r:ri𝒢1(μ1)𝒢2(μ2), 1in}\left\{r:r_{i}\in\mathcal{G}_{1}(\mu_{1})\cup\mathcal{G}_{2}(\mu_{2}),\ 1\leqslant i\leqslant n\right\} satisfies (ri𝒢i(μi))=α\mathbb{P}\left(r_{i}\in\mathcal{G}_{i^{\ast}}\left(\mu_{i^{\ast}}\right)\right)=\alpha^{\ast} (where α(0,1)\alpha^{\ast}\in(0,1) is arbitrary) for all i{1,,n}i\in\{1,...,n\}, i.e., optimal arms occur in the reservoir with probability α\alpha^{\ast}.

Recall that the cumulative pseudo-regret after nn plays of a policy π\pi on ν=(𝒢1(μ1),𝒢2(μ2))\nu=\left(\mathcal{G}_{1}(\mu_{1}),\mathcal{G}_{2}(\mu_{2})\right) is given by Rnπ(ν)=t=1n(μiμ𝒯(πt))R_{n}^{\pi}(\nu)=\sum_{t=1}^{n}\left(\mu_{i^{\ast}}-\mu_{\mathcal{T}\left(\pi_{t}\right)}\right), where 𝒯(πt){1,2}\mathcal{T}(\pi_{t})\in\{1,2\} indicates the type of the arm played by π\pi at time tt. Our goal is to lower bound 𝔼Rnπ(ν)\mathbb{E}R_{n}^{\pi}(\nu), where the expectation is w.r.t. the randomness in π\pi as well as the distribution over possible realizations of ν\nu. To this end, we define the notion of expected cumulative regret of π\pi on a realization rr of ν=(𝒢1(μ1),𝒢2(μ2))\nu=\left(\mathcal{G}_{1}(\mu_{1}),\mathcal{G}_{2}(\mu_{2})\right) by

Snπ(ν,r):=𝔼π[t=1n(μiμ𝒯(πt))],\displaystyle S_{n}^{\pi}(\nu,r):=\mathbb{E}^{\pi}\left[\sum_{t=1}^{n}\left(\mu_{i^{\ast}}-\mu_{\mathcal{T}\left(\pi_{t}\right)}\right)\right],

where the expectation 𝔼π\mathbb{E}^{\pi} is w.r.t. the randomness in π\pi. Note that 𝔼Rnπ(ν)=𝔼νSnπ(ν,r)\mathbb{E}R_{n}^{\pi}(\nu)=\mathbb{E}^{\nu}S_{n}^{\pi}(\nu,r), where the expectation 𝔼ν\mathbb{E}^{\nu} is w.r.t. the distribution over possible realizations of ν\nu. We define our problem class 𝒩Δ¯\mathcal{N}_{\underline{\Delta}} as the collection of Δ¯\underline{\Delta}-separated instances given by

𝒩Δ¯:={(𝒢1(μ1),𝒢2(μ2)):μ1μ2=Δ¯,(μ1,μ2)2}.\displaystyle\mathcal{N}_{\underline{\Delta}}:=\left\{\left(\mathcal{G}_{1}(\mu_{1}),\mathcal{G}_{2}(\mu_{2})\right):\mu_{1}-\mu_{2}=\underline{\Delta},\ (\mu_{1},\mu_{2})\in\mathbb{R}^{2}\right\}.

Fix an arbitrary Δ¯>0\underline{\Delta}>0 and consider an instance ν=({Q1},{Q2})𝒩Δ¯\nu=\left(\{Q_{1}\},\{Q_{2}\}\right)\in\mathcal{N}_{\underline{\Delta}}, where (Q1,Q2)(Q_{1},Q_{2}) are unit-variance Gaussian distributions with means (μ1,μ2)(\mu_{1},\mu_{2}) respectively. Consider an arbitrary realization r{Q1,Q2}nr\in\{Q_{1},Q_{2}\}^{n} of ν\nu and let {1,,n}\mathcal{I}\subseteq\{1,...,n\} denote the set of inferior arms in rr (arms with reward distribution Q2Q_{2}). Consider another instance ν𝒩Δ¯\nu^{\prime}\in\mathcal{N}_{\underline{\Delta}} given by ν=({Q~1},{Q1})\nu^{\prime}=\left(\{\widetilde{Q}_{1}\},\{Q_{1}\}\right), where Q~1\widetilde{Q}_{1} is another unit variance Gaussian with mean μ1+Δ¯\mu_{1}+\underline{\Delta}. Now consider a realization r{Q~1,Q1}nr^{\prime}\in\{\widetilde{Q}_{1},Q_{1}\}^{n} of ν\nu^{\prime} that is such that the arms at positions in \mathcal{I} have distribution Q~1\widetilde{Q}_{1} while those at positions in {1,,n}\\{1,...,n\}\backslash\mathcal{I} have distribution Q1Q_{1}. Notice that \mathcal{I} is the set of optimal arms in rr^{\prime} (arms with reward distribution Q~1\widetilde{Q}_{1}). Then, the following always holds:

Snπ(ν,r)+Snπ(ν,r)(Δ¯n2)(ν,rπ(iNi(n)>n2)+ν,rπ(iNi(n)n2)),\displaystyle S_{n}^{\pi}(\nu,r)+S_{n}^{\pi}(\nu^{\prime},r^{\prime})\geqslant\left(\frac{\underline{\Delta}n}{2}\right)\left(\mathbb{P}^{\pi}_{\nu,r}\left(\sum_{i\in\mathcal{I}}N_{i}(n)>\frac{n}{2}\right)+\mathbb{P}^{\pi}_{\nu^{\prime},r^{\prime}}\left(\sum_{i\in\mathcal{I}}N_{i}(n)\leqslant\frac{n}{2}\right)\right),

where ν,rπ()\mathbb{P}^{\pi}_{\nu,r}(\cdot) and ν,rπ()\mathbb{P}^{\pi}_{\nu^{\prime},r^{\prime}}(\cdot) denote the probability measures w.r.t. the instance-realization pairs (ν,r)(\nu,r) and (ν,r)(\nu^{\prime},r^{\prime}) respectively, and Ni(n)N_{i}(n) denotes the number of plays up to and including time nn of arm i{1,,n}i\in\{1,...,n\}. Using the Bretagnolle-Huber inequality (Theorem 14.2 of Lattimore and Szepesvári (2020)), we obtain

Snπ(ν,r)+Snπ(ν,r)(Δ¯n4)exp(D(ν,rπ,ν,rπ)),\displaystyle S_{n}^{\pi}(\nu,r)+S_{n}^{\pi}(\nu^{\prime},r^{\prime})\geqslant\left(\frac{\underline{\Delta}n}{4}\right)\exp\left(-\text{D}\left(\mathbb{P}^{\pi}_{\nu,r},\mathbb{P}^{\pi}_{\nu^{\prime},r^{\prime}}\right)\right),

where D(ν,rπ,ν,rπ)\text{D}\left(\mathbb{P}^{\pi}_{\nu,r},\mathbb{P}^{\pi}_{\nu^{\prime},r^{\prime}}\right) denotes the KL-Divergence between ν,rπ\mathbb{P}^{\pi}_{\nu,r} and ν,rπ\mathbb{P}^{\pi}_{\nu^{\prime},r^{\prime}}. Using Divergence decomposition (Lemma 15.1 of Lattimore and Szepesvári (2020)), we further obtain

Snπ(ν,r)+Snπ(ν,r)\displaystyle S_{n}^{\pi}(\nu,r)+S_{n}^{\pi}(\nu^{\prime},r^{\prime}) (Δ¯n4)exp((D(Q2,Q~1)Δ¯)Snπ(ν,r))=(Δ¯n4)exp(2Δ¯Snπ(ν,r)),\displaystyle\geqslant\left(\frac{\underline{\Delta}n}{4}\right)\exp\left(-\left(\frac{\text{D}\left(Q_{2},\widetilde{Q}_{1}\right)}{\underline{\Delta}}\right)S_{n}^{\pi}(\nu,r)\right)=\left(\frac{\underline{\Delta}n}{4}\right)\exp\left(-2\underline{\Delta}S_{n}^{\pi}(\nu,r)\right),

where the equality follows since Q~1\widetilde{Q}_{1} and Q2Q_{2} are unit variance Gaussian distributions with means separated by 2Δ¯2\underline{\Delta}. Next, taking the expectation 𝔼ν\mathbb{E}^{\nu} on both sides followed by a direct application of Jensen’s inequality yields

𝔼Rnπ(ν)+𝔼νSnπ(ν,r)(Δ¯n4)exp(2Δ¯𝔼Rnπ(ν)).\displaystyle\mathbb{E}R_{n}^{\pi}(\nu)+\mathbb{E}^{\nu}S_{n}^{\pi}(\nu^{\prime},r^{\prime})\geqslant\left(\frac{\underline{\Delta}n}{4}\right)\exp\left(-2\underline{\Delta}\mathbb{E}R_{n}^{\pi}(\nu)\right). (3)

Consider the 𝔼νSnπ(ν,r)\mathbb{E}^{\nu}S_{n}^{\pi}(\nu^{\prime},r^{\prime}) term in (3). Using a simple change-of-measure argument, we obtain

𝔼νSnπ(ν,r)=𝔼ν[Snπ(ν,r)(1αα)2(Λ(r)n/2)],\displaystyle\mathbb{E}^{\nu}S_{n}^{\pi}(\nu^{\prime},r^{\prime})=\mathbb{E}^{{\nu}^{\prime}}\left[S_{n}^{\pi}(\nu^{\prime},r^{\prime})\left(\frac{1-\alpha^{\ast}}{\alpha^{\ast}}\right)^{2\left(\Lambda(r^{\prime})-n/2\right)}\right],

where Λ(r)\Lambda(r^{\prime}) is the number of optimal arms in realization rr^{\prime}.

Since α\alpha^{\ast} is arbitrary, we fix α=1/2\alpha^{\ast}=1/2 to obtain

𝔼νSnπ(ν,r)=𝔼νSnπ(ν,r)=𝔼Rnπ(ν),\displaystyle\mathbb{E}^{\nu}S_{n}^{\pi}(\nu^{\prime},r^{\prime})=\mathbb{E}^{{\nu}^{\prime}}S_{n}^{\pi}(\nu^{\prime},r^{\prime})=\mathbb{E}R_{n}^{\pi}\left(\nu^{\prime}\right), (4)

Now, from (3) and (4), we have that for α=1/2\alpha^{\ast}=1/2,

𝔼Rnπ(ν)+𝔼Rnπ(ν)Δ¯n4exp(2Δ¯𝔼Rnπ(ν))\displaystyle\mathbb{E}R_{n}^{\pi}\left(\nu\right)+\mathbb{E}R_{n}^{\pi}\left(\nu^{\prime}\right)\geqslant\frac{\underline{\Delta}n}{4}\exp\left(-2\underline{\Delta}\mathbb{E}R_{n}^{\pi}\left(\nu\right)\right)
\displaystyle\implies\ R~nΔ¯n8exp(2Δ¯R~n),\displaystyle\tilde{R}_{n}\geqslant\frac{\underline{\Delta}n}{8}\exp\left(-2\underline{\Delta}\tilde{R}_{n}\right), (5)

where R~n:=max(𝔼Rnπ(ν),𝔼Rnπ(ν))\tilde{R}_{n}:=\max\left(\mathbb{E}R_{n}^{\pi}\left(\nu\right),\mathbb{E}R_{n}^{\pi}\left(\nu^{\prime}\right)\right).

Instance-dependent lower bound

The assertion of the theorem follows from the fact that the inequality (5) is fulfilled only if for any ε(0,1)\varepsilon\in(0,1), R~n\tilde{R}_{n} satisfies for all nn large enough R~n(1ε)logn/(2Δ¯)\tilde{R}_{n}\geqslant\left(1-\varepsilon\right)\log n/\left(2\underline{\Delta}\right). Therefore, there exists an instance ν\nu with gap Δ¯\underline{\Delta} such that 𝔼Rnπ(ν)Clogn/Δ¯\mathbb{E}R_{n}^{\pi}\left(\nu\right)\geqslant C\log n/\underline{\Delta} for some absolute constant CC and nn large enough, whenever α=1/2\alpha^{\ast}=1/2. In fact, said statement holds for all α1/2\alpha^{\ast}\leqslant 1/2 since the policy π\pi satisfies Definition 3.1.

Instance-independent (minimax) lower bound

Since R~nΔ¯n\tilde{R}_{n}\leqslant\underline{\Delta}n, it follows from (5) that

R~nΔ¯n8exp(2Δ¯2n).\displaystyle\tilde{R}_{n}\geqslant\frac{\underline{\Delta}n}{8}\exp\left(-2\underline{\Delta}^{2}n\right).

Setting Δ¯=1/n\underline{\Delta}=1/\sqrt{n}, and noting that the inequality, in fact, holds for all α1/2\alpha^{\ast}\leqslant 1/2 (owing to the admissibility of π\pi; see Definition 3.1), proves the stated assertion. \square

Appendix D Proof of Theorem 3.3

Note that this result is stated for general K2K\geqslant 2 and is not specific to K=2K=2. In fact, the nature of the set of possible sub-optimal types is inconsequential to the proof that follows as long as said set is at least Δ¯\underline{\Delta}-separated from the optimal mean reward. Consider an arbitrary policy πΠ~\pi\in\tilde{\Pi}. Denote by AnπA_{n}^{\pi} the number of distinct arms played by π\pi until time nn. Consider an arbitrary k{1,,n}k\in\{1,...,n\}. Then, conditioned on Anπ=kA_{n}^{\pi}=k, the expected cumulative regret incurred by π\pi is at least

𝔼[Rnπ|Anπ=k]\displaystyle\mathbb{E}\left[R_{n}^{\pi}|A_{n}^{\pi}=k\right] (1α1)Δ¯k+(1α1)kΔ¯(nk)=:f(k).\displaystyle\geqslant(1-\alpha_{1})\underline{\Delta}k+(1-\alpha_{1})^{k}\underline{\Delta}(n-k)=:f(k). (6)

Intuition behind (6). Each of the kk arms played during the horizon has at least one pull associated with it. Consider a clairvoyant policy coupled to π\pi that learns the best among the AnπA_{n}^{\pi} arms played by π\pi as soon as each has been pulled exactly once, i.e., after a total of AnπA_{n}^{\pi} pulls. Clearly, the regret incurred by said clairvoyant policy lower bounds 𝔼Rnπ\mathbb{E}R_{n}^{\pi}. Further, since AnπA_{n}^{\pi} is independent of the sample-history of arms, it follows that the AnπA_{n}^{\pi} arms are statistically identical. Thus, conditioned on Anπ=kA_{n}^{\pi}=k, the expected regret from the first kk pulls of the clairvoyant policy is at least (1α1)Δ¯k(1-\alpha_{1})\underline{\Delta}k. Also, the probability that each of the kk arms is inferior-typed is (1α1)k(1-\alpha_{1})^{k}; the clairvoyant policy thus incurs a regret of at least (1α1)kΔ¯(nk)(1-\alpha_{1})^{k}\underline{\Delta}(n-k) going forward. This explains the lower bound in (6). Therefore, for any k{1,2,,n}k\in\{1,2,...,n\}, we have

𝔼[Rnπ|Anπ=k]\displaystyle\mathbb{E}\left[R_{n}^{\pi}|A_{n}^{\pi}=k\right] mink{1,2,,n}f(k)minx[0,n]f(x).\displaystyle\geqslant\min_{k\in\{1,2,...,n\}}f(k)\geqslant\min_{x\in[0,n]}f(x).

We will show that f(x)f(x) is strictly convex over [0,n][0,n] with f(0)<0f^{\prime}(0)<0 and f(n)>0f^{\prime}(n)>0. Then, it would follow that f()f(\cdot) admits a unique minimizer xn(0,n)x_{n}^{\ast}\in(0,n) given by the solution to f(x)=0f^{\prime}(x)=0. The minimum f(xn)f\left(x_{n}^{\ast}\right) will turn out to be logarithmic in nn. Observe that

f(x)\displaystyle f^{\prime}(x) =(1α1)Δ¯+(1α1)xΔ¯[(nx)log(1α1)1],\displaystyle=(1-\alpha_{1})\underline{\Delta}+(1-\alpha_{1})^{x}\underline{\Delta}\left[(n-x)\log(1-\alpha_{1})-1\right],
f′′(x)\displaystyle f^{\prime\prime}(x) =(1α1)xΔ¯[2(nx)log(1α1)]log(1α1).\displaystyle=-(1-\alpha_{1})^{x}\underline{\Delta}\left[2-(n-x)\log(1-\alpha_{1})\right]\log(1-\alpha_{1}).

Since Δ¯>0\underline{\Delta}>0, it follows that f′′(x)>0f^{\prime\prime}(x)>0 over [0,n][0,n]. Further, note that

f(0)\displaystyle f^{\prime}(0) =α1Δ¯+Δ¯nlog(1α1)<0,\displaystyle=-\alpha_{1}\underline{\Delta}+\underline{\Delta}n\log(1-\alpha_{1})<0,
f(n)\displaystyle f^{\prime}(n) =(1α1)Δ¯(1α1)nΔ¯>0.\displaystyle=(1-\alpha_{1})\underline{\Delta}-(1-\alpha_{1})^{n}\underline{\Delta}>0.

Solving f(xn)=0f^{\prime}\left(x_{n}^{\ast}\right)=0 for the unique minimizer xnx_{n}^{\ast}, we obtain

(11α1)xn11=(nxn)log(11α1)\displaystyle\left(\frac{1}{1-\alpha_{1}}\right)^{x_{n}^{\ast}-1}-1=\left(n-x_{n}^{\ast}\right)\log\left(\frac{1}{1-\alpha_{1}}\right)
\displaystyle\implies (11α1)xn+xnlog(11α1)>nlog(11α1)\displaystyle\left(\frac{1}{1-\alpha_{1}}\right)^{x_{n}^{\ast}}+x_{n}^{\ast}\log\left(\frac{1}{1-\alpha_{1}}\right)>n\log\left(\frac{1}{1-\alpha_{1}}\right)
\displaystyle\implies 2(11α1)xn>nlog(11α1),\displaystyle 2\left(\frac{1}{1-\alpha_{1}}\right)^{x_{n}^{\ast}}>n\log\left(\frac{1}{1-\alpha_{1}}\right),

where the last inequality follows using y>logyy>\log y. Therefore, we have

(11α1)xn>n2log(11α1)\displaystyle\left(\frac{1}{1-\alpha_{1}}\right)^{x_{n}^{\ast}}>\frac{n}{2}\log\left(\frac{1}{1-\alpha_{1}}\right)
\displaystyle\implies xn>logn+loglog(11α1)log2log(11α1).\displaystyle x_{n}^{\ast}>\frac{\log n+\log\log\left(\frac{1}{1-\alpha_{1}}\right)-\log 2}{\log\left(\frac{1}{1-\alpha_{1}}\right)}.

Thus, for any k{1,,n}k\in\{1,...,n\},

𝔼[Rnπ|Anπ=k]f(xn)>(1α1)Δ¯xn>(1α1)(logn+loglog(11α1)log2log(11α1))Δ¯\displaystyle\mathbb{E}\left[R_{n}^{\pi}|A_{n}^{\pi}=k\right]\geqslant f\left(x_{n}^{\ast}\right)>(1-\alpha_{1})\underline{\Delta}x_{n}^{\ast}>(1-\alpha_{1})\left(\frac{\log n+\log\log\left(\frac{1}{1-\alpha_{1}}\right)-\log 2}{\log\left(\frac{1}{1-\alpha_{1}}\right)}\right)\underline{\Delta}
\displaystyle\implies 𝔼Rnπ(1α1)(logn+loglog(11α1)log2log(11α1))Δ¯\displaystyle\mathbb{E}R_{n}^{\pi}\geqslant(1-\alpha_{1})\left(\frac{\log n+\log\log\left(\frac{1}{1-\alpha_{1}}\right)-\log 2}{\log\left(\frac{1}{1-\alpha_{1}}\right)}\right)\underline{\Delta}
\displaystyle\implies infπΠ~𝔼Rnπlogn(1α1)(1log(11α1)+loglog(11α1)log2(logn)log(11α1))Δ¯\displaystyle\inf_{\pi\in\tilde{\Pi}}\frac{\mathbb{E}R_{n}^{\pi}}{\log n}\geqslant(1-\alpha_{1})\left(\frac{1}{\log\left(\frac{1}{1-\alpha_{1}}\right)}+\frac{\log\log\left(\frac{1}{1-\alpha_{1}}\right)-\log 2}{\left(\log n\right)\log\left(\frac{1}{1-\alpha_{1}}\right)}\right)\underline{\Delta}
\displaystyle\implies infπΠ~𝔼Rnπlogn()(1α1)(1α1α1+loglog(11α1)log2(logn)log(11α1))Δ¯\displaystyle\inf_{\pi\in\tilde{\Pi}}\frac{\mathbb{E}R_{n}^{\pi}}{\log n}\underset{\mathrm{({\dagger})}}{\geqslant}(1-\alpha_{1})\left(\frac{1-\alpha_{1}}{\alpha_{1}}+\frac{\log\log\left(\frac{1}{1-\alpha_{1}}\right)-\log 2}{\left(\log n\right)\log\left(\frac{1}{1-\alpha_{1}}\right)}\right)\underline{\Delta}
\displaystyle\implies infπΠ~𝔼Rnπlogn(1α1)2Δ¯α1+(1α1)(loglog(11α1)log2(logn)log(11α1))Δ¯,\displaystyle\inf_{\pi\in\tilde{\Pi}}\frac{\mathbb{E}R_{n}^{\pi}}{\log n}\geqslant\frac{(1-\alpha_{1})^{2}\underline{\Delta}}{\alpha_{1}}+(1-\alpha_{1})\left(\frac{\log\log\left(\frac{1}{1-\alpha_{1}}\right)-\log 2}{\left(\log n\right)\log\left(\frac{1}{1-\alpha_{1}}\right)}\right)\underline{\Delta},

where ()({\dagger}) follows using logyy1\log y\leqslant y-1. Taking the appropriate limit now proves the assertion. \square

Appendix E Proof of Theorem 3.4

The reservoir distribution is given by 𝜶=(α1,,αK)\boldsymbol{\alpha}=\left(\alpha_{1},...,\alpha_{K}\right). In the full information setting, the decision maker observes the true mean reward of an arm immediately upon pulling it. Let π=(πt:t=1,2,)\pi=\left(\pi_{t}:t=1,2,...\right) be the policy that pulls a new arm from the reservoir in each period. Let NN denote the first time at which one arm of each of the KK types is collected under π\pi. Then, it follows from classical results (see Theorem 4.1 in Flajolet et al. (1992)) for the Coupon-collector problem that

𝔼N=0(1j=1K(1exp(αjy)))𝑑y\displaystyle\mathbb{E}N=\int_{0}^{\infty}\left(1-\prod_{j=1}^{K}\left(1-\exp\left(-\alpha_{j}y\right)\right)\right)dy ()0(1j=1K(1exp(yK)))𝑑y\displaystyle\underset{\mathrm{({\dagger})}}{\geqslant}\int_{0}^{\infty}\left(1-\prod_{j=1}^{K}\left(1-\exp\left(-\frac{y}{K}\right)\right)\right)dy
=()j=1KKj\displaystyle\underset{\mathrm{({\ddagger})}}{=}\sum_{j=1}^{K}\frac{K}{j}
KlogK,\displaystyle\geqslant K\log K, (7)

where ()({\dagger}) follows as j=1K(1exp(αjy))\prod_{j=1}^{K}\left(1-\exp\left(-\alpha_{j}y\right)\right) is maximized when 𝜶\boldsymbol{\alpha} is the Uniform distribution; ()({\ddagger}) is a classical result (see previous reference). The optimal policy π\pi^{\ast} follows π\pi until time NN, and subsequently commits to the arm with the highest mean among the first NN arms. The lifetime regret of π\pi^{\ast} is then given by

𝔼Rπ\displaystyle\mathbb{E}R_{\infty}^{\pi^{\ast}} =𝔼[t=1Ni=2K(μ1μi)𝟙{𝒯(πt)=i}]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{N}\sum_{i=2}^{K}\left(\mu_{1}-\mu_{i}\right)\mathbbm{1}\left\{\mathcal{T}\left(\pi_{t}\right)=i\right\}\right]
=𝔼[t=1i=2K(μ1μi)𝟙{𝒯(πt)=i,tN}]\displaystyle=\mathbb{E}\left[\sum_{t=1}^{\infty}\sum_{i=2}^{K}\left(\mu_{1}-\mu_{i}\right)\mathbbm{1}\left\{\mathcal{T}\left(\pi_{t}\right)=i,\ t\leqslant N\right\}\right]
=t=1i=2K(μ1μi)(𝒯(πt)=i,tN),\displaystyle=\sum_{t=1}^{\infty}\sum_{i=2}^{K}\left(\mu_{1}-\mu_{i}\right)\mathbb{P}\left(\mathcal{T}\left(\pi_{t}\right)=i,\ t\leqslant N\right), (8)

where the last equality follows from Tonelli’s Theorem. Note that

(𝒯(πt)=i,tN)\displaystyle\mathbb{P}\left(\mathcal{T}\left(\pi_{t}\right)=i,\ t\leqslant N\right) =(𝒯(πt)=i)(𝒯(πt)=i,t>N)\displaystyle=\mathbb{P}\left(\mathcal{T}\left(\pi_{t}\right)=i\right)-\mathbb{P}\left(\mathcal{T}\left(\pi_{t}\right)=i,\ t>N\right)
=αi(𝒯(πt)=i|t>N)(t>N)\displaystyle=\alpha_{i}-\mathbb{P}\left(\left.\mathcal{T}\left(\pi_{t}\right)=i\ \right\rvert\ t>N\right)\mathbb{P}\left(t>N\right)
=()αi(𝒯(πt)=i)(t>N)\displaystyle\underset{\mathrm{(\star)}}{=}\alpha_{i}-\mathbb{P}\left(\mathcal{T}\left(\pi_{t}\right)=i\right)\mathbb{P}\left(t>N\right)
=αiαi(t>N)\displaystyle=\alpha_{i}-\alpha_{i}\mathbb{P}\left(t>N\right)
=αi(Nt),\displaystyle=\alpha_{i}\mathbb{P}\left(N\geqslant t\right), (9)

where ()(\star) follows since 𝒯(πt)\mathcal{T}\left(\pi_{t}\right) is i.i.d. in time tt. Therefore, from (8) and (9), we have

𝔼Rπ\displaystyle\mathbb{E}R_{\infty}^{\pi^{\ast}} =t=1i=2Kαi(μ1μi)(Nt)\displaystyle=\sum_{t=1}^{\infty}\sum_{i=2}^{K}\alpha_{i}\left(\mu_{1}-\mu_{i}\right)\mathbb{P}\left(N\geqslant t\right)
=i=2Kαi(μ1μi)t=1(Nt)\displaystyle=\sum_{i=2}^{K}\alpha_{i}\left(\mu_{1}-\mu_{i}\right)\sum_{t=1}^{\infty}\mathbb{P}\left(N\geqslant t\right)
=i=2Kαi(μ1μi)𝔼N.\displaystyle=\sum_{i=2}^{K}\alpha_{i}\left(\mu_{1}-\mu_{i}\right)\mathbb{E}N. (10)

Finally, from (10) and (7), one obtains

𝔼Rπ\displaystyle\mathbb{E}R_{\infty}^{\pi^{\ast}} i=2Kαi(μ1μi)KlogK.\displaystyle\geqslant\sum_{i=2}^{K}\alpha_{i}\left(\mu_{1}-\mu_{i}\right)K\log K.

\square

Appendix F Proof of Theorem 4.1

Let {(X1,jl,,XK,jl):j=1,,ml}\left\{\left(X_{1,j}^{l},...,X_{K,j}^{l}\right):j=1,...,m_{l}\right\} be the reward sequence associated with the KK arms played in the llth epoch, where ml=e2llognm_{l}=\left\lceil e^{2\sqrt{l}}\log n\right\rceil. Let 𝒜:={1,,K}\mathcal{A}:=\{1,...,K\} and define

I:=inf{l:|j=1ml(Xa,jlXb,jl)|2mlela,b𝒜,a<b}.\displaystyle I:=\inf\left\{l\in\mathbb{N}:\left\lvert\sum_{j=1}^{m_{l}}\left(X_{a,j}^{l}-X_{b,j}^{l}\right)\right\rvert\geqslant 2m_{l}e^{-\sqrt{l}}\ \forall\ a,b\in\mathcal{A},a<b\right\}.

Then,

(Ik)\displaystyle\mathbb{P}\left(I\geqslant k\right) =(l=1k1a,b𝒜,a<b{|j=1ml(Xa,jlXb,jl)|<2mlel})\displaystyle=\mathbb{P}\left(\bigcap_{l=1}^{k-1}\bigcup_{a,b\in\mathcal{A},a<b}\left\{\left\lvert\sum_{j=1}^{m_{l}}\left(X_{a,j}^{l}-X_{b,j}^{l}\right)\right\rvert<2m_{l}e^{-\sqrt{l}}\right\}\right)
=l=1k1(a,b𝒜,a<b{|j=1ml(Xa,jlXb,jl)|<2mlel})\displaystyle=\prod_{l=1}^{k-1}\mathbb{P}\left(\bigcup_{a,b\in\mathcal{A},a<b}\left\{\left\lvert\sum_{j=1}^{m_{l}}\left(X_{a,j}^{l}-X_{b,j}^{l}\right)\right\rvert<2m_{l}e^{-\sqrt{l}}\right\}\right)
l=1k1[1(D)+(D)D(a,b𝒜,a<b{|j=1ml(Xa,jlXb,jl)|<2mlel})],\displaystyle\leqslant\prod_{l=1}^{k-1}\left[1-\mathbb{P}\left(\texttt{D}\right)+\mathbb{P}\left(\texttt{D}\right)\mathbb{P}_{\texttt{D}}\left(\bigcup_{a,b\in\mathcal{A},a<b}\left\{\left\lvert\sum_{j=1}^{m_{l}}\left(X_{a,j}^{l}-X_{b,j}^{l}\right)\right\rvert<2m_{l}e^{-\sqrt{l}}\right\}\right)\right],

where D denotes the event that the KK arms played in epoch ll are “all-distinct,” i.e., no two arms belong to the same type, 𝙳():=(|𝙳)\mathbb{P}_{\mathtt{D}}(\cdot):=\mathbb{P}\left(\cdot|\mathtt{D}\right) denotes the corresponding conditional measure, and (D)=K!i=1Kαi\mathbb{P}\left(\texttt{D}\right)=K!\prod_{i=1}^{K}\alpha_{i}. Using the Union bound, we obtain

(Ik)\displaystyle\mathbb{P}\left(I\geqslant k\right) l=1k1[1(D)+(D)a,b𝒜,a<bD(|j=1ml(Xa,jlXb,jl)|<2mlel)].\displaystyle\leqslant\prod_{l=1}^{k-1}\left[1-\mathbb{P}\left(\texttt{D}\right)+\mathbb{P}\left(\texttt{D}\right)\sum_{a,b\in\mathcal{A},a<b}\mathbb{P}_{\texttt{D}}\left(\left\lvert\sum_{j=1}^{m_{l}}\left(X_{a,j}^{l}-X_{b,j}^{l}\right)\right\rvert<2m_{l}e^{-\sqrt{l}}\right)\right]. (11)

Define τ:=Kl=1Iml\tau:=K\sum_{l=1}^{I}m_{l}. Consider the following events:

𝙴1\displaystyle\mathtt{E}_{1} :={None of the arms played in epoch I belongs to the optimal type},\displaystyle:=\left\{\text{None of the arms played in epoch $I$ belongs to the optimal type}\right\},
𝙴2\displaystyle\mathtt{E}_{2} :={At least one optimal-typed arm is played in epoch I, and the empirically best arm is not optimal-typed}.\displaystyle:=\left\{\text{At least one optimal-typed arm is played in epoch $I$, and the empirically best arm is \text@underline{not} optimal-typed}\right\}.

Recall that Δ¯=μ1μK\bar{\Delta}=\mu_{1}-\mu_{K} denotes the maximal sub-optimality gap. Then, the cumulative pseudo-regret RnR_{n} (superscript π\pi suppressed for notational convenience) of ALG1(n)(n) is bounded as

Rn\displaystyle R_{n} 𝟙{τn}[Δ¯τ+𝟙{𝙴1𝙴2}Δ¯n]+𝟙{τ>n}Δ¯n\displaystyle\leqslant\mathbbm{1}\left\{\tau\leqslant n\right\}\left[\bar{\Delta}\tau+\mathbbm{1}\left\{\mathtt{E}_{1}\cup\mathtt{E}_{2}\right\}\bar{\Delta}n\right]+\mathbbm{1}\left\{\tau>n\right\}\bar{\Delta}n
Δ¯τ+(𝟙{τn,𝙴1}+𝟙{τn,𝙴2})Δ¯n+𝟙{τ>n}Δ¯n.\displaystyle\leqslant\bar{\Delta}\tau+\left(\mathbbm{1}\left\{\tau\leqslant n,\mathtt{E}_{1}\right\}+\mathbbm{1}\left\{\tau\leqslant n,\mathtt{E}_{2}\right\}\right)\bar{\Delta}n+\mathbbm{1}\left\{\tau>n\right\}\bar{\Delta}n.

Taking expectations,

𝔼Rn\displaystyle\mathbb{E}R_{n} Δ¯𝔼τ+[(τn,𝙴1)+(τn,𝙴2)]Δ¯n+(τ>n)Δ¯n\displaystyle\leqslant\bar{\Delta}\mathbb{E}\tau+\left[\mathbb{P}\left(\tau\leqslant n,\mathtt{E}_{1}\right)+\mathbb{P}\left(\tau\leqslant n,\mathtt{E}_{2}\right)\right]\bar{\Delta}n+\mathbb{P}\left(\tau>n\right)\bar{\Delta}n
2Δ¯𝔼τ+[(τn,𝙴1)+(τn,𝙴2)]Δ¯n,\displaystyle\leqslant 2\bar{\Delta}\mathbb{E}\tau+\left[\mathbb{P}\left(\tau\leqslant n,\mathtt{E}_{1}\right)+\mathbb{P}\left(\tau\leqslant n,\mathtt{E}_{2}\right)\right]\bar{\Delta}n,

where the last step uses Markov’s inequality. Therefore,

𝔼Rn\displaystyle\mathbb{E}R_{n} 2KΔ¯𝔼[ImI]+[(τn,𝙴1)+(τn,𝙴2)]Δ¯n\displaystyle\leqslant 2K\bar{\Delta}\mathbb{E}\left[Im_{I}\right]+\left[\mathbb{P}\left(\tau\leqslant n,\mathtt{E}_{1}\right)+\mathbb{P}\left(\tau\leqslant n,\mathtt{E}_{2}\right)\right]\bar{\Delta}n
4KΔ¯𝔼[Ie2I]logn+[(τn,𝙴1)+(τn,𝙴2)]Δ¯n.\displaystyle\leqslant 4K\bar{\Delta}\mathbb{E}\left[Ie^{2\sqrt{I}}\right]\log n+\left[\mathbb{P}\left(\tau\leqslant n,\mathtt{E}_{1}\right)+\mathbb{P}\left(\tau\leqslant n,\mathtt{E}_{2}\right)\right]\bar{\Delta}n.

F.1 Upper bounding 𝔼[Ie2I]\mathbb{E}\left[Ie^{2\sqrt{I}}\right]

Recall that δ=min1i<jK(μiμj)\delta=\min_{1\leqslant i<j\leqslant K}\left(\mu_{i}-\mu_{j}\right) denotes the smallest gap between any two distinct mean rewards. Then, on the event D, for any a,b𝒜,a<ba,b\in\mathcal{A},a<b, we either have 𝔼[Xa,jlXb,jl]δ\mathbb{E}\left[X_{a,j}^{l}-X_{b,j}^{l}\right]\geqslant\delta or 𝔼[Xa,jlXb,jl]δ\mathbb{E}\left[X_{a,j}^{l}-X_{b,j}^{l}\right]\leqslant-\delta. Without loss of generality, suppose that 𝔼[Xa,jlXb,jl]δ\mathbb{E}\left[X_{a,j}^{l}-X_{b,j}^{l}\right]\geqslant\delta. Then,

𝙳(|j=1ml(Xa,jlXb,jl)|<2mlel)\displaystyle\mathbb{P}_{\mathtt{D}}\left(\left\lvert\sum_{j=1}^{m_{l}}\left(X_{a,j}^{l}-X_{b,j}^{l}\right)\right\rvert<2m_{l}e^{-\sqrt{l}}\right) 𝙳(j=1ml(Xa,jlXb,jl)<2mlel)\displaystyle\leqslant\mathbb{P}_{\mathtt{D}}\left(\sum_{j=1}^{m_{l}}\left(X_{a,j}^{l}-X_{b,j}^{l}\right)<2m_{l}e^{-\sqrt{l}}\right)
=𝙳(j=1ml(Xa,jlXb,jlδ)<ml(δ2el)).\displaystyle=\mathbb{P}_{\mathtt{D}}\left(\sum_{j=1}^{m_{l}}\left(X_{a,j}^{l}-X_{b,j}^{l}-\delta\right)<-m_{l}\left(\delta-2e^{-\sqrt{l}}\right)\right).

Then, for l>log2(4/δ)=:kl>\left\lceil\log^{2}\left(4/\delta\right)\right\rceil=:k^{\ast}, one has that

𝙳(|j=1ml(Xa,jlXb,jl)|<2mlel)𝙳(j=1ml(Xa,jlXb,jlδ)<2mlel)n2,\displaystyle\mathbb{P}_{\mathtt{D}}\left(\left\lvert\sum_{j=1}^{m_{l}}\left(X_{a,j}^{l}-X_{b,j}^{l}\right)\right\rvert<2m_{l}e^{-\sqrt{l}}\right)\leqslant\mathbb{P}_{\mathtt{D}}\left(\sum_{j=1}^{m_{l}}\left(X_{a,j}^{l}-X_{b,j}^{l}-\delta\right)<-2m_{l}e^{-\sqrt{l}}\right)\leqslant n^{-2}, (12)

where the final inequality follows using the Chernoff-Hoeffding bound (Hoeffding, 1963), together with the fact that 1Xa,jlXb,jl1-1\leqslant X_{a,j}^{l}-X_{b,j}^{l}\leqslant 1 and ml=e2llognm_{l}=\left\lceil e^{2\sqrt{l}}\log n\right\rceil. Using (11) and (12), we obtain for k>k+1k>k^{\ast}+1 that

(Ik)\displaystyle\mathbb{P}\left(I\geqslant k\right) l=1k1[1(D)+(D)a,b𝒜,a<bD(|j=1ml(Xa,jlXb,jl)|<2mlel)]\displaystyle\leqslant\prod_{l=1}^{k-1}\left[1-\mathbb{P}\left(\texttt{D}\right)+\mathbb{P}\left(\texttt{D}\right)\sum_{a,b\in\mathcal{A},a<b}\mathbb{P}_{\texttt{D}}\left(\left\lvert\sum_{j=1}^{m_{l}}\left(X_{a,j}^{l}-X_{b,j}^{l}\right)\right\rvert<2m_{l}e^{-\sqrt{l}}\right)\right]
k<lk1[1(D)+K2(D)n2]\displaystyle\leqslant\prod_{k^{\ast}<l\leqslant k-1}\left[1-\mathbb{P}\left(\texttt{D}\right)+\frac{K^{2}\mathbb{P}\left(\texttt{D}\right)}{n^{2}}\right]
[13(D)4]kk1,\displaystyle\leqslant\left[1-\frac{3\mathbb{P}\left(\texttt{D}\right)}{4}\right]^{k-k^{\ast}-1},

where the last inequality holds for n2Kn\geqslant 2K.444We will ensure that all guarantees hold for nKn\geqslant K by offsetting regret by 2KΔ¯2K\bar{\Delta} in the end. Thus, for any k1k\geqslant 1 and n2Kn\geqslant 2K, we have

(Ik+k)[13(D)4]k1.\displaystyle\mathbb{P}\left(I\geqslant k^{\ast}+k\right)\leqslant\left[1-\frac{3\mathbb{P}\left(\texttt{D}\right)}{4}\right]^{k-1}.

Now,

𝔼[Ie2I]\displaystyle\mathbb{E}\left[Ie^{2\sqrt{I}}\right] ke2k+k=1(k+k)e2k+k(I=k+k)\displaystyle\leqslant k^{\ast}e^{2\sqrt{k^{\ast}}}+\sum_{k=1}^{\infty}\left(k^{\ast}+k\right)e^{2\sqrt{k^{\ast}+k}}\mathbb{P}\left(I=k^{\ast}+k\right)
()ke2k+k=1(k+k)e2ke2k(I=k+k)\displaystyle\underset{\mathrm{({\dagger})}}{\leqslant}k^{\ast}e^{2\sqrt{k^{\ast}}}+\sum_{k=1}^{\infty}\left(k^{\ast}+k\right)e^{2\sqrt{k^{\ast}}}e^{2\sqrt{k}}\mathbb{P}\left(I=k^{\ast}+k\right)
()ke2k+2k=1kke2ke2k(I=k+k)\displaystyle\underset{\mathrm{({\ddagger})}}{\leqslant}k^{\ast}e^{2\sqrt{k^{\ast}}}+2\sum_{k=1}^{\infty}k^{\ast}ke^{2\sqrt{k^{\ast}}}e^{2\sqrt{k}}\mathbb{P}\left(I=k^{\ast}+k\right)
ke2k+2k=1kke2ke2k(Ik+k)\displaystyle\leqslant k^{\ast}e^{2\sqrt{k^{\ast}}}+2\sum_{k=1}^{\infty}k^{\ast}ke^{2\sqrt{k^{\ast}}}e^{2\sqrt{k}}\mathbb{P}\left(I\geqslant k^{\ast}+k\right)
ke2k+2k=1kke2ke2k[13(D)4]k1\displaystyle\leqslant k^{\ast}e^{2\sqrt{k^{\ast}}}+2\sum_{k=1}^{\infty}k^{\ast}ke^{2\sqrt{k^{\ast}}}e^{2\sqrt{k}}\left[1-\frac{3\mathbb{P}\left(\texttt{D}\right)}{4}\right]^{k-1}
=ke2k[1+2k=1ke2k[13(D)4]k1]\displaystyle=k^{\ast}e^{2\sqrt{k^{\ast}}}\left[1+2\sum_{k=1}^{\infty}ke^{2\sqrt{k}}\left[1-\frac{3\mathbb{P}\left(\texttt{D}\right)}{4}\right]^{k-1}\right]
4ke2kk=1ke2k[13(D)4]k1,\displaystyle\leqslant 4k^{\ast}e^{2\sqrt{k^{\ast}}}\sum_{k=1}^{\infty}ke^{2\sqrt{k}}\left[1-\frac{3\mathbb{P}\left(\texttt{D}\right)}{4}\right]^{k-1},

where ()({\dagger}) follows using k+kk+k\sqrt{k^{\ast}+k}\leqslant\sqrt{k^{\ast}}+\sqrt{k}, and ()({\ddagger}) using k+k2kkk^{\ast}+k\leqslant 2k^{\ast}k (since k,kk^{\ast},k\in\mathbb{N}). Define

C𝜶:=k=1ke2k[13(D)4]k1.\displaystyle C_{{\boldsymbol{\alpha}}}:=\sum_{k=1}^{\infty}ke^{2\sqrt{k}}\left[1-\frac{3\mathbb{P}\left(\texttt{D}\right)}{4}\right]^{k-1}.

Since (D)=K!i=1Kαi\mathbb{P}\left(\texttt{D}\right)=K!\prod_{i=1}^{K}\alpha_{i}, note that the infinite summation is finite since 𝜶=(αi:i=1,,K){\boldsymbol{\alpha}}=\left(\alpha_{i}:i=1,...,K\right) is coordinate-wise bounded away from 0. Therefore,

𝔼[Ie2I]4C𝜶ke2k.\displaystyle\mathbb{E}\left[Ie^{2\sqrt{I}}\right]\leqslant 4C_{{\boldsymbol{\alpha}}}k^{\ast}e^{2\sqrt{k^{\ast}}}.

Note that

e2k=e2log2(4/δ)e2log2(4/δ)+1e2(log2(4/δ)+1)16e2δ2.\displaystyle e^{2\sqrt{k^{\ast}}}=e^{2\sqrt{\left\lceil\log^{2}\left(4/\delta\right)\right\rceil}}\leqslant e^{2\sqrt{\log^{2}\left(4/\delta\right)+1}}\leqslant e^{2\left(\sqrt{\log^{2}\left(4/\delta\right)}+1\right)}\leqslant\frac{16e^{2}}{\delta^{2}}.

Therefore, in conclusion,

𝔼[Ie2I]128e2C𝜶δ2log2(4δ).\displaystyle\mathbb{E}\left[Ie^{2\sqrt{I}}\right]\leqslant\frac{128e^{2}C_{{\boldsymbol{\alpha}}}}{\delta^{2}}{\log^{2}\left(\frac{4}{\delta}\right)}.

F.2 Upper bounding (τn,𝙴1)\mathbb{P}\left(\tau\leqslant n,\mathtt{E}_{1}\right)

Note that on the event {τn}\left\{\tau\leqslant n\right\}, the duration of epoch II is KmIKm_{I}. On the event 𝙴1\mathtt{E}_{1}, the consideration set contains at least two arms that belong to the same type. Without loss of generality, suppose that these arms are indexed 11 and 22. Then,

(τn,𝙴1)\displaystyle\mathbb{P}\left(\tau\leqslant n,\mathtt{E}_{1}\right) (|j=1mI(X1,jIX2,jI)|2mIeI,τn)\displaystyle\leqslant\mathbb{P}\left(\left\lvert\sum_{j=1}^{m_{I}}\left(X_{1,j}^{I}-X_{2,j}^{I}\right)\right\rvert\geqslant 2m_{I}e^{-\sqrt{I}},\tau\leqslant n\right)
(|j=1mI(X1,jIX2,jI)|2mIeI,In)\displaystyle\leqslant\mathbb{P}\left(\left\lvert\sum_{j=1}^{m_{I}}\left(X_{1,j}^{I}-X_{2,j}^{I}\right)\right\rvert\geqslant 2m_{I}e^{-\sqrt{I}},I\leqslant n\right)
l=1n(|j=1ml(X1,jlX2,jl)|2mlel)\displaystyle\leqslant\sum_{l=1}^{n}\mathbb{P}\left(\left\lvert\sum_{j=1}^{m_{l}}\left(X_{1,j}^{l}-X_{2,j}^{l}\right)\right\rvert\geqslant 2m_{l}e^{-\sqrt{l}}\right)
l=1n2n2\displaystyle\leqslant\sum_{l=1}^{n}\frac{2}{n^{2}}
=2n,\displaystyle=\frac{2}{n},

where the last inequality follows using the Chernoff-Hoeffding bound (Hoeffding, 1963).

F.3 Upper bounding (τn,𝙴2)\mathbb{P}\left(\tau\leqslant n,\mathtt{E}_{2}\right)

Note that on the event {τn}\left\{\tau\leqslant n\right\}, the duration of epoch II is KmIKm_{I}. On the event 𝙴2\mathtt{E}_{2}, the consideration set 𝒜\mathcal{A} contains at least one arm of the optimal type, and the empirically best arm belongs to an inferior type. Without loss of generality, suppose that arm 11 belongs to the optimal type and 𝒜\mathcal{I}\subset\mathcal{A} denotes the set of inferior-typed arms. We then have

(τn,𝙴2)\displaystyle\mathbb{P}\left(\tau\leqslant n,\mathtt{E}_{2}\right) (a{j=1mI(Xa,jIX1,jI)2mIeI,τn})\displaystyle\leqslant\mathbb{P}\left(\bigcup_{a\in\mathcal{I}}\left\{\sum_{j=1}^{m_{I}}\left(X_{a,j}^{I}-X_{1,j}^{I}\right)\geqslant 2m_{I}e^{-\sqrt{I}},\tau\leqslant n\right\}\right)
a(j=1mI(Xa,jIX1,jI)2mIeI,In)\displaystyle\leqslant\sum_{a\in\mathcal{I}}\mathbb{P}\left(\sum_{j=1}^{m_{I}}\left(X_{a,j}^{I}-X_{1,j}^{I}\right)\geqslant 2m_{I}e^{-\sqrt{I}},I\leqslant n\right)
al=1n(j=1ml(Xa,jlX1,jl)2mlel)\displaystyle\leqslant\sum_{a\in\mathcal{I}}\sum_{l=1}^{n}\mathbb{P}\left(\sum_{j=1}^{m_{l}}\left(X_{a,j}^{l}-X_{1,j}^{l}\right)\geqslant 2m_{l}e^{-\sqrt{l}}\right)
al=1n1n2\displaystyle\leqslant\sum_{a\in\mathcal{I}}\sum_{l=1}^{n}\frac{1}{n^{2}}
Kn,\displaystyle\leqslant\frac{K}{n},

where the second-to-last inequality follows using the Chernoff-Hoeffding bound (Hoeffding, 1963) since 𝔼[Xa,jlX1,jl]Δ¯<0a\mathbb{E}\left[X_{a,j}^{l}-X_{1,j}^{l}\right]\leqslant-\underline{\Delta}<0\ \forall\ a\in\mathcal{I}.

F.4 Putting everything together

In conclusion, the expected cumulative regret of the policy π\pi given by ALG1(n)(n) is bounded for any nKn\geqslant K as

𝔼RnπC~𝜶KΔ¯lognδ2log2(4δ)+4KΔ¯,\displaystyle\mathbb{E}R_{n}^{\pi}\leqslant\frac{\tilde{C}_{{\boldsymbol{\alpha}}}K\bar{\Delta}\log n}{\delta^{2}}\log^{2}\left(\frac{4}{\delta}\right)+4K\bar{\Delta},

where C~𝜶\tilde{C}_{{\boldsymbol{\alpha}}} is a finite constant that depends only on 𝜶=(αi:i=1,,K){\boldsymbol{\alpha}}=\left(\alpha_{i}:i=1,...,K\right). In particular, C~𝜶\tilde{C}_{{\boldsymbol{\alpha}}} is given by the following infinite summation:

C~𝜶:=512e2C𝜶=512e2k=1ke2k[13K!i=1Kαi4]k1.\displaystyle\tilde{C}_{{\boldsymbol{\alpha}}}:=512e^{2}C_{{\boldsymbol{\alpha}}}=512e^{2}\sum_{k=1}^{\infty}ke^{2\sqrt{k}}\left[1-\frac{3K!\prod_{i=1}^{K}\alpha_{i}}{4}\right]^{k-1}. (13)

\square

Appendix G Proof of Proposition 4.2

Consider the following stopping time:

τ:=inf{m:a,b𝒜,a<bs.t.𝒵a,b+j=1m(Xa,jXb,j)<4mlogm}.\displaystyle\tau:=\inf\left\{m\in\mathbb{N}:\exists a,b\in\mathcal{A},a<b\ \text{s.t.}\ \mathcal{Z}_{a,b}+\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}\right)<4\sqrt{m\log m}\right\}.

Since (m1a,b𝒜,a<b{|𝒵a,b+j=1m(Xa,jXb,j)|4mlogm})(τ=)\mathbb{P}\left(\bigcap_{m\geqslant 1}\bigcap_{a,b\in\mathcal{A},a<b}\left\{\left\lvert\mathcal{Z}_{a,b}+\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}\right)\right\rvert\geqslant 4\sqrt{m\log m}\right\}\right)\geqslant\mathbb{P}(\tau=\infty), it suffices to show that (τ=)\mathbb{P}(\tau=\infty) is bounded away from 0. To this end, define the following entities:

ΛK:=inf{p:m=p1m812K2},\displaystyle\Lambda_{K}:=\inf\left\{p\in\mathbb{N}:\sum_{m=p}^{\infty}\frac{1}{m^{8}}\leqslant\frac{1}{2K^{2}}\right\},
T0:=max((64δ2)log2(64δ2),ΛK),\displaystyle T_{0}:=\max\left(\left\lceil\left(\frac{64}{\delta^{2}}\right)\log^{2}\left(\frac{64}{\delta^{2}}\right)\right\rceil,\Lambda_{K}\right),
f(x):=x+4xlogx\displaystyle f(x):=x+4\sqrt{x\log x} for x1.\displaystyle\mbox{for $x\geqslant 1$}.
Lemma G.1.

For any a,b𝒜a,b\in\mathcal{A} s.t. a<ba<b, it is the case that

{𝒵a,b>f(T0)}m=1T0{𝒵+j=1m(Xa,jXb,j)4mlogm}.\displaystyle\left\{\mathcal{Z}_{a,b}>f\left(T_{0}\right)\right\}\subseteq\bigcap_{m=1}^{T_{0}}\left\{\mathcal{Z}+\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}\right)\geqslant 4\sqrt{m\log m}\right\}.

Proof of Lemma G.1. Note that

𝒵a,b\displaystyle\mathcal{Z}_{a,b} >f(T0)\displaystyle>f\left(T_{0}\right)
=T0+4T0logT0\displaystyle=T_{0}+4\sqrt{T_{0}\log T_{0}}
m+4mlogm 1mT0\displaystyle\geqslant m+4\sqrt{m\log m}\ \forall\ 1\leqslant m\leqslant T_{0}
(𝔞)j=1m(Xb,jXa,j)+4mlogm 1mT0\displaystyle\underset{\mathrm{(\mathfrak{a})}}{\geqslant}\sum_{j=1}^{m}\left(X_{b,j}-X_{a,j}\right)+4\sqrt{m\log m}\ \forall\ 1\leqslant m\leqslant T_{0}
𝒵a,b+j=1m(Xa,jXb,j)\displaystyle\implies\mathcal{Z}_{a,b}+\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}\right) 4mlogm 1mT0,\displaystyle\geqslant 4\sqrt{m\log m}\ \forall\ 1\leqslant m\leqslant T_{0},

where (𝔞)(\mathfrak{a}) follows since the rewards are bounded in [0,1][0,1], i.e., |Xa,jXb,j|1\left\lvert X_{a,j}-X_{b,j}\right\rvert\leqslant 1. \square

Lemma G.2.

For mT0m\geqslant T_{0}, it is the case that

δ8logmm.\displaystyle\delta\geqslant 8\sqrt{\frac{\log m}{m}}.

Proof of Lemma G.2. First of all, note that T0(64/δ2)log2(64/δ2)64T_{0}\geqslant\left({64}/{\delta^{2}}\right)\log^{2}\left({64}/{\delta^{2}}\right)\geqslant 64 (since δ1\delta\leqslant 1). For s=(64/δ2)log2(64/δ2)s=\left({64}/{\delta^{2}}\right)\log^{2}\left({64}/{\delta^{2}}\right), one has

δ2=64log2(64δ2)s(𝔟)64[log(64δ2)+2loglog(64δ2)]s=64logss,\displaystyle\delta^{2}=\frac{64\log^{2}\left(\frac{64}{\delta^{2}}\right)}{s}\underset{\mathrm{(\mathfrak{b})}}{\geqslant}\frac{64\left[\log\left(\frac{64}{\delta^{2}}\right)+2\log\log\left(\frac{64}{\delta^{2}}\right)\right]}{s}=\frac{64\log s}{s},

where (𝔟)(\mathfrak{b}) follows since the function g(x):=x2x2logxg(x):=x^{2}-x-2\log x is monotone increasing for xlog64x\geqslant\log 64 (think of log(64/δ2)\log\left({64}/{\delta^{2}}\right) as xx), and therefore attains its minimum at x=log64x=\log 64; one can verify that this minimum is strictly positive. Furthermore, since logs/s\log s/s is monotone decreasing for s64s\geqslant 64, it follows that for any mT0m\geqslant T_{0},

δ264logmm.\displaystyle\delta^{2}\geqslant\frac{64\log m}{m}.

\square

Now coming back to the proof of Proposition 4.2, consider an arbitrary ll\in\mathbb{N} such that l>T0l>T_{0}. Then,

(τl)\displaystyle\mathbb{P}\left(\tau\leqslant l\right) =(τl,𝒵>f(T0))+(τl,𝒵f(T0))\displaystyle=\mathbb{P}\left(\tau\leqslant l,\mathcal{Z}>f\left(T_{0}\right)\right)+\mathbb{P}\left(\tau\leqslant l,\mathcal{Z}\leqslant f\left(T_{0}\right)\right)
(τl,𝒵>f(T0))+Φ(f(T0)).\displaystyle\leqslant\mathbb{P}\left(\tau\leqslant l,\mathcal{Z}>f\left(T_{0}\right)\right)+\Phi\left(f\left(T_{0}\right)\right).

Now,

(τl,𝒵>f(T0))\displaystyle\mathbb{P}\left(\tau\leqslant l,\mathcal{Z}>f\left(T_{0}\right)\right)
=\displaystyle=\ (m=1la,b𝒜,a<b{𝒵a,b+j=1m(Xa,jXb,j)<4mlogm,𝒵a,b>f(T0)})\displaystyle\mathbb{P}\left(\bigcup_{m=1}^{l}\bigcup_{a,b\in\mathcal{A},a<b}\left\{\mathcal{Z}_{a,b}+\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}\right)<4\sqrt{m\log m},\mathcal{Z}_{a,b}>f\left(T_{0}\right)\right\}\right)
=()\displaystyle\underset{\mathrm{({\dagger})}}{=}\ (m=T0la,b𝒜,a<b{𝒵a,b+j=1m(Xa,jXb,j)<4mlogm,𝒵a,b>f(T0)})\displaystyle\mathbb{P}\left(\bigcup_{m=T_{0}}^{l}\bigcup_{a,b\in\mathcal{A},a<b}\left\{\mathcal{Z}_{a,b}+\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}\right)<4\sqrt{m\log m},\mathcal{Z}_{a,b}>f\left(T_{0}\right)\right\}\right)
\displaystyle\leqslant\ m=T0la,b𝒜,a<b(𝒵a,b+j=1m(Xa,jXb,j)<4mlogm,𝒵a,b>f(T0))\displaystyle\sum_{m=T_{0}}^{l}\sum_{a,b\in\mathcal{A},a<b}\mathbb{P}\left(\mathcal{Z}_{a,b}+\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}\right)<4\sqrt{m\log m},\mathcal{Z}_{a,b}>f\left(T_{0}\right)\right)
=\displaystyle=\ m=T0la,b𝒜,a<b(𝒵a,b+j=1m(Xa,jXb,jδ)<m(δ4logmm),𝒵a,b>f(T0))\displaystyle\sum_{m=T_{0}}^{l}\sum_{a,b\in\mathcal{A},a<b}\mathbb{P}\left(\mathcal{Z}_{a,b}+\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}-\delta\right)<-m\left(\delta-4\sqrt{\frac{\log m}{m}}\right),\mathcal{Z}_{a,b}>f\left(T_{0}\right)\right)
\displaystyle\leqslant\ m=T0la,b𝒜,a<b(j=1m(Xa,jXb,jδ)<m(δ4logmm),𝒵a,b>f(T0))\displaystyle\sum_{m=T_{0}}^{l}\sum_{a,b\in\mathcal{A},a<b}\mathbb{P}\left(\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}-\delta\right)<-m\left(\delta-4\sqrt{\frac{\log m}{m}}\right),\mathcal{Z}_{a,b}>f\left(T_{0}\right)\right)
()\displaystyle\underset{\mathrm{({\ddagger})}}{\leqslant}\ m=T0la,b𝒜,a<b(j=1m(Xa,jXb,jδ)<4mlogm,𝒵a,b>f(T0))\displaystyle\sum_{m=T_{0}}^{l}\sum_{a,b\in\mathcal{A},a<b}\mathbb{P}\left(\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}-\delta\right)<-4\sqrt{m\log m},\mathcal{Z}_{a,b}>f\left(T_{0}\right)\right)
=\displaystyle=\ Φ¯(f(T0))m=T0la,b𝒜,a<b(j=1m(Xa,jXb,jδ)<4mlogm)\displaystyle\bar{\Phi}\left(f\left(T_{0}\right)\right)\sum_{m=T_{0}}^{l}\sum_{a,b\in\mathcal{A},a<b}\mathbb{P}\left(\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}-\delta\right)<-4\sqrt{m\log m}\right)
()\displaystyle\underset{\mathrm{(\star)}}{\leqslant}\ Φ¯(f(T0))m=T0la,b𝒜,a<b1m8\displaystyle\bar{\Phi}\left(f\left(T_{0}\right)\right)\sum_{m=T_{0}}^{l}\sum_{a,b\in\mathcal{A},a<b}\frac{1}{m^{8}}
\displaystyle\leqslant\ Φ¯(f(T0))K2m=T01m8\displaystyle\bar{\Phi}\left(f\left(T_{0}\right)\right)K^{2}\sum_{m=T_{0}}^{\infty}\frac{1}{m^{8}}
()\displaystyle\underset{\mathrm{(\ast)}}{\leqslant}\ Φ¯(f(T0))2,\displaystyle\frac{\bar{\Phi}\left(f\left(T_{0}\right)\right)}{2},

where ()({\dagger}) follows from Lemma G.1, ()({\ddagger}) from Lemma G.2, ()(\star) follows using the Chernoff-Hoeffding bound
citephoeffding and finally, ()(\ast) follows from the definition of T0T_{0}. Therefore, we have

(τl)Φ¯(f(T0))2+Φ(f(T0))=1Φ¯(f(T0))2\displaystyle\mathbb{P}\left(\tau\leqslant l\right)\leqslant\frac{\bar{\Phi}\left(f\left(T_{0}\right)\right)}{2}+\Phi\left(f\left(T_{0}\right)\right)=1-\frac{\bar{\Phi}\left(f\left(T_{0}\right)\right)}{2}
\displaystyle\implies\ (τ>l)Φ¯(f(T0))2.\displaystyle\mathbb{P}\left(\tau>l\right)\geqslant\frac{\bar{\Phi}\left(f\left(T_{0}\right)\right)}{2}.

Taking the limit ll\to\infty and appealing to the continuity of probability, we obtain

(τ=)Φ¯(f(T0))2\displaystyle\mathbb{P}\left(\tau=\infty\right)\geqslant\frac{\bar{\Phi}\left(f\left(T_{0}\right)\right)}{2}
\displaystyle\implies\ (m1a,b𝒜,a<b{|𝒵a,b+j=1m(Xa,jXb,j)|4mlogm})Φ¯(f(T0))2.\displaystyle\mathbb{P}\left(\bigcap_{m\geqslant 1}\bigcap_{a,b\in\mathcal{A},a<b}\left\{\left\lvert\mathcal{Z}_{a,b}+\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}\right)\right\rvert\geqslant 4\sqrt{m\log m}\right\}\right)\geqslant\frac{\bar{\Phi}\left(f\left(T_{0}\right)\right)}{2}.

\square

Appendix H Proof of Theorem 4.3

We will initially assume δ>8logn/n\delta>8\sqrt{\log n/n} for technical convenience. In the final step leading up to the asserted bound, we will relax this assumption by offsetting regret appropriately.

Let 𝒜:={1,,K}\mathcal{A}:=\{1,...,K\}. Define the following stopping times:

τ1(sn)\displaystyle\tau_{1}\left(s_{n}\right) :=inf{msn:a,b𝒜,a<bs.t.|𝒵a,b+j=1m(Xa,jXb,j)|<4mlogm},\displaystyle:=\inf\left\{m\geqslant s_{n}:\exists a,b\in\mathcal{A},a<b\ \text{s.t.}\ \left\lvert\mathcal{Z}_{a,b}+\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}\right)\right\rvert<4\sqrt{m\log m}\right\},
τ2(sn)\displaystyle\tau_{2}\left(s_{n}\right) :=inf{msn:|j=1m(Xa,jXb,j)|4mlogna,b𝒜,a<b}.\displaystyle:=\inf\left\{m\geqslant s_{n}:\left\lvert\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}\right)\right\rvert\geqslant 4\sqrt{m\log n}\ \forall\ a,b\in\mathcal{A},\ a<b\right\}.

To keep notations simple, we will suppress the argument and denote τ1(sn)\tau_{1}\left(s_{n}\right) and τ2(sn)\tau_{2}\left(s_{n}\right) by τ1\tau_{1} and τ2\tau_{2} respectively (the dependence on sns_{n} will be implicit going forward). Let RtR_{t} denote the cumulative pseudo-regret of ALG2(n)(n) after tnt\leqslant n pulls. Let D denote the event that the first batch of KK arms queried from the reservoir is “all-distinct,” i.e., no two arms in this batch belong to the same type; let Dc be the complement of this event. Let CI denote the event that the algorithm commits to an inferior-typed arm. Let R~,R¯\tilde{R}_{\cdot},\bar{R}_{\cdot} be independently drawn from the same distribution as RR_{\cdot}. Let x+:=max(x,0)x^{+}:=\max(x,0) for xx\in\mathbb{R}. Then, RnR_{n} evolves according to the following stochastic recursion:

Rn\displaystyle R_{n}
\displaystyle\leqslant\ 𝟙{D}[𝟙{τ1<τ2}(Δ¯min(Kτ1,n)+R~(nKτ1)+)+𝟙{τ1τ2}(Δ¯min(Kτ2,n)+𝟙{CI}Δ¯(nKτ2)+)]\displaystyle\mathbbm{1}\{\texttt{D}\}\left[\mathbbm{1}\left\{\tau_{1}<\tau_{2}\right\}\left(\bar{\Delta}\min\left(K\tau_{1},n\right)+\tilde{R}_{\left(n-K\tau_{1}\right)^{+}}\right)+\mathbbm{1}\left\{\tau_{1}\geqslant\tau_{2}\right\}\left(\bar{\Delta}\min\left(K\tau_{2},n\right)+\mathbbm{1}\{\texttt{CI}\}\bar{\Delta}\left(n-K\tau_{2}\right)^{+}\right)\right]
+𝟙{Dc}[𝟙{τ1<τ2}(Δ¯min(Kτ1,n)+R¯(nKτ1)+)+𝟙{τ1τ2}Δ¯n]\displaystyle+\mathbbm{1}\{\texttt{D\textsuperscript{c}}\}\left[\mathbbm{1}\left\{\tau_{1}<\tau_{2}\right\}\left(\bar{\Delta}\min\left(K\tau_{1},n\right)+\bar{R}_{\left(n-K\tau_{1}\right)^{+}}\right)+\mathbbm{1}\left\{\tau_{1}\geqslant\tau_{2}\right\}\bar{\Delta}n\right]
\displaystyle\leqslant\ 𝟙{D}[𝟙{τ1<τ2}(Δ¯min(Kτ2,n)+R~n)+𝟙{τ1τ2}(Δ¯min(Kτ2,n)+𝟙{CI}Δ¯n)]\displaystyle\mathbbm{1}\{\texttt{D}\}\left[\mathbbm{1}\left\{\tau_{1}<\tau_{2}\right\}\left(\bar{\Delta}\min\left(K\tau_{2},n\right)+\tilde{R}_{n}\right)+\mathbbm{1}\left\{\tau_{1}\geqslant\tau_{2}\right\}\left(\bar{\Delta}\min\left(K\tau_{2},n\right)+\mathbbm{1}\{\texttt{CI}\}\bar{\Delta}n\right)\right]
+𝟙{Dc}[𝟙{τ1<τ2}(Δ¯min(Kτ1,n)+R¯n)+𝟙{τ1τ2}Δ¯n]\displaystyle+\mathbbm{1}\{\texttt{D\textsuperscript{c}}\}\left[\mathbbm{1}\left\{\tau_{1}<\tau_{2}\right\}\left(\bar{\Delta}\min\left(K\tau_{1},n\right)+\bar{R}_{n}\right)+\mathbbm{1}\left\{\tau_{1}\geqslant\tau_{2}\right\}\bar{\Delta}n\right]
\displaystyle\leqslant\ 𝟙{D}[2Δ¯min(Kτ2,n)+𝟙{τ1<τ2}R~n+𝟙{CI}Δ¯n]+𝟙{Dc}[Δ¯Kτ1+R¯n+𝟙{τ1τ2}Δ¯n].\displaystyle\mathbbm{1}\{\texttt{D}\}\left[2\bar{\Delta}\min\left(K\tau_{2},n\right)+\mathbbm{1}\left\{\tau_{1}<\tau_{2}\right\}\tilde{R}_{n}+\mathbbm{1}\{\texttt{CI}\}\bar{\Delta}n\right]+\mathbbm{1}\{\texttt{D\textsuperscript{c}}\}\left[\bar{\Delta}K\tau_{1}+\bar{R}_{n}+\mathbbm{1}\left\{\tau_{1}\geqslant\tau_{2}\right\}\bar{\Delta}n\right].

Taking expectations on both sides, one recovers using the independence of R~n,R¯n\tilde{R}_{n},\bar{R}_{n} that

𝔼Rn\displaystyle\mathbb{E}R_{n}
\displaystyle\leqslant\ Δ¯(τ1τ2|D)[2𝔼[min(Kτ2,n)|D]+((Dc)(D))𝔼[Kτ1|Dc]+((CI|D)+((Dc)(D))(τ1τ2|Dc))n],\displaystyle\frac{\bar{\Delta}}{\mathbb{P}\left(\left.\tau_{1}\geqslant\tau_{2}\right\rvert\texttt{D}\right)}\left[2\mathbb{E}\left[\left.\min\left(K\tau_{2},n\right)\right\rvert\texttt{D}\right]+\left(\frac{\mathbb{P}\left(\texttt{D\textsuperscript{c}}\right)}{\mathbb{P}\left(\texttt{D}\right)}\right)\mathbb{E}\left[\left.K\tau_{1}\right\rvert\texttt{D\textsuperscript{c}}\right]+\left(\mathbb{P}\left(\left.\texttt{CI}\right\rvert\texttt{D}\right)+\left(\frac{\mathbb{P}\left(\texttt{D\textsuperscript{c}}\right)}{\mathbb{P}\left(\texttt{D}\right)}\right)\mathbb{P}\left(\left.\tau_{1}\geqslant\tau_{2}\right\rvert\texttt{D\textsuperscript{c}}\right)\right)n\right],

where (D)=K!i=1Kαi\mathbb{P}\left(\texttt{D}\right)=K!\prod_{i=1}^{K}\alpha_{i}.

H.1 Lower bounding (τ1τ2|𝙳)\mathbb{P}\left(\left.\tau_{1}\geqslant\tau_{2}\right\rvert\mathtt{D}\right)

Define the following:

γ(sn):=(τ1(sn)=|𝙳),\displaystyle\gamma\left(s_{n}\right):=\mathbb{P}\left(\left.\tau_{1}\left(s_{n}\right)=\infty\right\rvert\mathtt{D}\right), (14)

where τ1(sn)\tau_{1}\left(s_{n}\right) and 𝙳\mathtt{D} are as defined before. We will suppress the dependence on sns_{n} to keep notations minimal. Note that

(τ1<τ2|𝙳)\displaystyle\mathbb{P}\left(\left.\tau_{1}<\tau_{2}\right\rvert\mathtt{D}\right) =(τ1<τ2,τ2=|𝙳)+(τ1<τ2,τ2<|𝙳)\displaystyle=\mathbb{P}\left(\left.\tau_{1}<\tau_{2},\tau_{2}=\infty\right\rvert\mathtt{D}\right)+\mathbb{P}\left(\left.\tau_{1}<\tau_{2},\tau_{2}<\infty\right\rvert\mathtt{D}\right)
(τ2=|𝙳)+(τ1<|𝙳)\displaystyle\leqslant\mathbb{P}\left(\left.\tau_{2}=\infty\right\rvert\mathtt{D}\right)+\mathbb{P}\left(\left.\tau_{1}<\infty\right\rvert\mathtt{D}\right)
=(τ1<|𝙳)\displaystyle=\mathbb{P}\left(\left.\tau_{1}<\infty\right\rvert\mathtt{D}\right)
=1γ(sn),\displaystyle=1-\gamma\left(s_{n}\right),

where the equality in the third step follows since τ2\tau_{2} is almost surely finite on the event D (proved in §H.1.1 below), and the final equality is due to (14). Thus, (τ1τ2|𝙳)γ(sn)\mathbb{P}\left(\left.\tau_{1}\geqslant\tau_{2}\right\rvert\mathtt{D}\right)\geqslant\gamma\left(s_{n}\right).

H.1.1 Proof that τ2\tau_{2} is almost surely finite on D

Let D():=(|D)\mathbb{P}_{\texttt{D}}(\cdot):=\mathbb{P}\left(\cdot|\texttt{D}\right) be the conditional measure w.r.t. the event D. Let 𝒜:={1,,K}\mathcal{A}:=\{1,...,K\}. Then, by continuity of probability, we have

D(τ2=)\displaystyle\mathbb{P}_{\texttt{D}}\left(\tau_{2}=\infty\right) =limlD(τ2>l)\displaystyle=\lim_{l\to\infty}\mathbb{P}_{\texttt{D}}\left(\tau_{2}>l\right)
=limlD(m=snla,b𝒜,a<b{|j=1m(Xa,jXb,j)|<4mlogn})\displaystyle=\lim_{l\to\infty}\mathbb{P}_{\texttt{D}}\left(\bigcap_{m=s_{n}}^{l}\bigcup_{a,b\in\mathcal{A},a<b}\left\{\left\lvert\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}\right)\right\rvert\ <4\sqrt{m\log n}\right\}\right)
limla,b𝒜,a<bD(|j=1l(Xa,jXb,j)|<4llogn).\displaystyle\leqslant\lim_{l\to\infty}\sum_{a,b\in\mathcal{A},a<b}\mathbb{P}_{\texttt{D}}\left(\left\lvert\sum_{j=1}^{l}\left(X_{a,j}-X_{b,j}\right)\right\rvert\ <4\sqrt{l\log n}\right).

On D, it must be that |𝔼[Xa,jXb,j]|δ\left\lvert\mathbb{E}\left[X_{a,j}-X_{b,j}\right]\right\rvert\geqslant\delta. Without loss of generality, assume that 𝔼[Xa,jXb,j]δ\mathbb{E}\left[X_{a,j}-X_{b,j}\right]\geqslant\delta. Then,

D(τ2=)\displaystyle\mathbb{P}_{\texttt{D}}\left(\tau_{2}=\infty\right) limla,b𝒜,a<bD(j=1l(Xa,jXb,j)<4llogn)\displaystyle\leqslant\lim_{l\to\infty}\sum_{a,b\in\mathcal{A},a<b}\mathbb{P}_{\texttt{D}}\left(\sum_{j=1}^{l}\left(X_{a,j}-X_{b,j}\right)<4\sqrt{l\log n}\right)
=limla,b𝒜,a<bD(j=1l(Xa,jXb,jδ)<l(δ4lognl))\displaystyle=\lim_{l\to\infty}\sum_{a,b\in\mathcal{A},a<b}\mathbb{P}_{\texttt{D}}\left(\sum_{j=1}^{l}\left(X_{a,j}-X_{b,j}-\delta\right)<-l\left(\delta-4\sqrt{\frac{\log n}{l}}\right)\right)
limla,b𝒜,a<bD(j=1l(Xa,jXb,jδ)<4llogn(2n1l)),\displaystyle\leqslant\lim_{l\to\infty}\sum_{a,b\in\mathcal{A},a<b}\mathbb{P}_{\texttt{D}}\left(\sum_{j=1}^{l}\left(X_{a,j}-X_{b,j}-\delta\right)<-4l\sqrt{\log n}\left(\frac{2}{\sqrt{n}}-\frac{1}{\sqrt{l}}\right)\right),

where the last inequality follows since δ>8logn/n\delta>8\sqrt{\log n/n} (by assumption). Now, using the Chernoff-Hoeffding bound (Hoeffding, 1963) together with the fact that 1Xa,jXb,j1-1\leqslant X_{a,j}-X_{b,j}\leqslant 1, we obtain for l>nl>n and any a,b𝒜,a<ba,b\in\mathcal{A},a<b that

D(j=1l(Xa,jXb,jδ)<4llogn(2n1l))\displaystyle\mathbb{P}_{\texttt{D}}\left(\sum_{j=1}^{l}\left(X_{a,j}-X_{b,j}-\delta\right)<-4l\sqrt{\log n}\left(\frac{2}{\sqrt{n}}-\frac{1}{\sqrt{l}}\right)\right) exp[8l(2n1l)2logn]\displaystyle\leqslant\exp\left[-8l\left(\frac{2}{\sqrt{n}}-\frac{1}{\sqrt{l}}\right)^{2}\log n\right]
=exp[8(4ln4ln+1)2logn].\displaystyle=\exp\left[-8\left(\frac{4l}{n}-4\sqrt{\frac{l}{n}}+1\right)^{2}\log n\right].

Summing over a,b𝒜,a<ba,b\in\mathcal{A},a<b and taking the limit ll\to\infty proves the stated assertion. \square

H.2 Upper bounding 𝔼[min(Kτ2,n)|𝙳]\mathbb{E}\left[\left.\min\left(K\tau_{2},n\right)\right\rvert\mathtt{D}\right]

Let D():=(|D)\mathbb{P}_{\texttt{D}}(\cdot):=\mathbb{P}\left(\cdot|\texttt{D}\right) be the conditional measure w.r.t. the event D. Let 𝒜:={1,,K}\mathcal{A}:=\{1,...,K\}. Then,

𝔼[min(Kτ2,n)|D]\displaystyle\mathbb{E}\left[\left.\min\left(K\tau_{2},n\right)\right\rvert\texttt{D}\right] =K𝔼[min(τ2,nK)|D]\displaystyle=K\mathbb{E}\left[\left.\min\left(\tau_{2},\frac{n}{K}\right)\right\rvert\texttt{D}\right]
K𝔼[min(τ2,n)|D]\displaystyle\leqslant K\mathbb{E}\left[\left.\min\left(\tau_{2},n\right)\right\rvert\texttt{D}\right]
Ksn+Kk=sn+1nD(τ2k)\displaystyle\leqslant Ks_{n}+K\sum_{k=s_{n}+1}^{n}\mathbb{P}_{\texttt{D}}\left(\tau_{2}\geqslant k\right)
Ksn+Kk=snnD(τ2k+1)\displaystyle\leqslant Ks_{n}+K\sum_{k=s_{n}}^{n}\mathbb{P}_{\texttt{D}}\left(\tau_{2}\geqslant k+1\right)
Ksn+Kk=1na,b𝒜,a<bD(|j=1k(Xa,jXb,j)|<4klogn).\displaystyle\leqslant Ks_{n}+K\sum_{k=1}^{n}\sum_{a,b\in\mathcal{A},a<b}\mathbb{P}_{\texttt{D}}\left(\left\lvert\sum_{j=1}^{k}\left(X_{a,j}-X_{b,j}\right)\right\rvert<4\sqrt{k\log n}\right).

On D, it must be that |𝔼[Xa,jXb,j]|δ\left\lvert\mathbb{E}\left[X_{a,j}-X_{b,j}\right]\right\rvert\geqslant\delta. Without loss of generality, assume that 𝔼[Xa,jXb,j]δ\mathbb{E}\left[X_{a,j}-X_{b,j}\right]\geqslant\delta. Then,

𝔼[min(Kτ2,n)|D]\displaystyle\mathbb{E}\left[\left.\min\left(K\tau_{2},n\right)\right\rvert\texttt{D}\right] Ksn+Kk=1na,b𝒜,a<bD(j=1k(Xa,jXb,j)<4klogn)\displaystyle\leqslant Ks_{n}+K\sum_{k=1}^{n}\sum_{a,b\in\mathcal{A},a<b}\mathbb{P}_{\texttt{D}}\left(\sum_{j=1}^{k}\left(X_{a,j}-X_{b,j}\right)<4\sqrt{k\log n}\right)
=Ksn+Kk=1na,b𝒜,a<bD(j=1k(Xa,jXb,jδ)<k(δ4lognk))\displaystyle=Ks_{n}+K\sum_{k=1}^{n}\sum_{a,b\in\mathcal{A},a<b}\mathbb{P}_{\texttt{D}}\left(\sum_{j=1}^{k}\left(X_{a,j}-X_{b,j}-\delta\right)<-k\left(\delta-4\sqrt{\frac{\log n}{k}}\right)\right)
Ksn+32K3lognδ2+Kk=64lognδ2na,b𝒜,a<bD(j=1k(Xa,jXb,jδ)<kδ2),\displaystyle\leqslant Ks_{n}+\frac{32K^{3}\log n}{{\delta}^{2}}+K\sum_{k=\left\lceil\frac{64\log n}{{\delta}^{2}}\right\rceil}^{n}\sum_{a,b\in\mathcal{A},a<b}\mathbb{P}_{\texttt{D}}\left(\sum_{j=1}^{k}\left(X_{a,j}-X_{b,j}-\delta\right)<\frac{-k\delta}{2}\right),

where the last step follows since δ>8logn/n\delta>8\sqrt{\log n/n} (by assumption) implies n>64logn/δ2n>64\log n/{\delta}^{2}, and k64logn/δ2k\geqslant 64\log n/{\delta}^{2} implies δ4logn/kδ/2\delta-4\sqrt{{\log n}/{k}}\geqslant\delta/2. Finally, using the Chernoff-Hoeffding inequality (Hoeffding, 1963) together with the fact that |Xa,jXb,j|1\left\lvert X_{a,j}-X_{b,j}\right\rvert\leqslant 1, one obtains

𝔼[min(Kτ2,n)|D]Ksn+32K3lognδ2+K32k=64lognδ2nexp(δ2k8)Ksn+64K3lognδ2.\displaystyle\mathbb{E}\left[\left.\min\left(K\tau_{2},n\right)\right\rvert\texttt{D}\right]\leqslant Ks_{n}+\frac{32K^{3}\log n}{{\delta}^{2}}+\frac{K^{3}}{2}\sum_{k=\left\lceil\frac{64\log n}{{\delta}^{2}}\right\rceil}^{n}\exp\left(\frac{-{{\delta}}^{2}k}{8}\right)\leqslant Ks_{n}+\frac{64K^{3}\log n}{{\delta}^{2}}.

H.3 Upper bounding 𝔼[Kτ1|𝙳c]\mathbb{E}\left[\left.K\tau_{1}\right\rvert\mathtt{D\textsuperscript{c}}\right]

The event Dc will be implicitly assumed and we will drop the conditional argument for notational simplicity. Let 𝒜:={1,,K}\mathcal{A}:=\{1,...,K\}. Without loss of generality, suppose that arm 11 and 22 belong to the same type. Then,

𝔼[Kτ1|𝙳c]\displaystyle\mathbb{E}\left[\left.K\tau_{1}\right\rvert\mathtt{D\textsuperscript{c}}\right] =Ksn+Kksn+1(τ1k)\displaystyle=Ks_{n}+K\sum_{k\geqslant s_{n}+1}\mathbb{P}\left(\tau_{1}\geqslant k\right)
=Ksn+Kksn(τ1k+1)\displaystyle=Ks_{n}+K\sum_{k\geqslant s_{n}}\mathbb{P}\left(\tau_{1}\geqslant k+1\right)
=Ksn+Kksn(m=snka,b𝒜,a<b{|𝒵a,b+j=1m(X1,jX2,j)|4mlogm})\displaystyle=Ks_{n}+K\sum_{k\geqslant s_{n}}\mathbb{P}\left(\bigcap_{m=s_{n}}^{k}\bigcap_{a,b\in\mathcal{A},a<b}\left\{\left\lvert\mathcal{Z}_{a,b}+\sum_{j=1}^{m}\left(X_{1,j}-X_{2,j}\right)\right\rvert\geqslant 4\sqrt{m\log m}\right\}\right)
Ksn+Kk1(|𝒵1,2+j=1k(X1,jX2,j)|4klogk).\displaystyle\leqslant Ks_{n}+K\sum_{k\geqslant 1}\mathbb{P}\left(\left\lvert\mathcal{Z}_{1,2}+\sum_{j=1}^{k}\left(X_{1,j}-X_{2,j}\right)\right\rvert\geqslant 4\sqrt{k\log k}\right).

Since 𝒵a,b\mathcal{Z}_{a,b} is a standard Gaussian, and the increments X1,jX2,jX_{1,j}-X_{2,j} are zero-mean sub-Gaussian with variance proxy 11, it follows from the Chernoff-Hoeffding concentration bound (Hoeffding, 1963) that

𝔼[Kτ1|𝙳c]\displaystyle\mathbb{E}\left[\left.K\tau_{1}\right\rvert\mathtt{D\textsuperscript{c}}\right] Ksn+2Kk11k4=(sn+π445)K<(sn+3)K.\displaystyle\leqslant Ks_{n}+2K\sum_{k\geqslant 1}\frac{1}{k^{4}}=\left(s_{n}+\frac{\pi^{4}}{45}\right)K<\left(s_{n}+3\right)K.

H.4 Upper bounding (𝙲𝙸|𝙳)\mathbb{P}\left(\left.\mathtt{CI}\right\rvert\mathtt{D}\right)

Let D():=(|D)\mathbb{P}_{\texttt{D}}(\cdot):=\mathbb{P}\left(\cdot|\texttt{D}\right) be the conditional measure w.r.t. the event D. Let 𝒜:={1,,K}\mathcal{A}:=\{1,...,K\} and without loss of generality, suppose that arm 11 is optimal (mean μ1\mu_{1}). Then,

(𝙲𝙸|𝙳)\displaystyle\mathbb{P}\left(\left.\mathtt{CI}\right\rvert\mathtt{D}\right) D(b=2K{j=1τ2(X1,jXb,j)4τ2logn})\displaystyle\leqslant\mathbb{P}_{\texttt{D}}\left(\bigcup_{b=2}^{K}\left\{\sum_{j=1}^{\tau_{2}}\left(X_{1,j}-X_{b,j}\right)\leqslant-4\sqrt{\tau_{2}\log n}\right\}\right)
b=2Kk=snnD(j=1k(X1,jXb,j)4klogn)+b=2KD(τ2>n)\displaystyle\leqslant\sum_{b=2}^{K}\sum_{k=s_{n}}^{n}\mathbb{P}_{\texttt{D}}\left(\sum_{j=1}^{k}\left(X_{1,j}-X_{b,j}\right)\leqslant-4\sqrt{k\log n}\right)+\sum_{b=2}^{K}\mathbb{P}_{\texttt{D}}\left(\tau_{2}>n\right)
b=2Kk=1n1n8+K3n8\displaystyle\leqslant\sum_{b=2}^{K}\sum_{k=1}^{n}\frac{1}{n^{8}}+\frac{K^{3}}{n^{8}}
Kn7+K3n8,\displaystyle\leqslant\frac{K}{n^{7}}+\frac{K^{3}}{n^{8}},

where the second-to-last step follows using the Chernoff-Hoeffding inequality (Hoeffding, 1963).

H.5 Upper bounding (τ1τ2|𝙳c)\mathbb{P}\left(\left.\tau_{1}\geqslant\tau_{2}\right\rvert\mathtt{D\textsuperscript{c}}\right)

Let Dc():=(|Dc)\mathbb{P}_{\texttt{D\textsuperscript{c}}}(\cdot):=\mathbb{P}\left(\cdot|\texttt{D\textsuperscript{c}}\right) be the conditional measure w.r.t. the event Dc. Let 𝒜:={1,,K}\mathcal{A}:=\{1,...,K\}. On Dc, there exist 22 arms in 𝒜\mathcal{A} that belong to the same type; without loss of generality suppose that these arms are indexed by 1,21,2. Then,

(τ1τ2|𝙳c)\displaystyle\mathbb{P}\left(\left.\tau_{1}\geqslant\tau_{2}\right\rvert\mathtt{D\textsuperscript{c}}\right) (τ1>n|𝙳c)+(τ2n|𝙳c)\displaystyle\leqslant\mathbb{P}\left(\left.\tau_{1}>n\right\rvert\mathtt{D\textsuperscript{c}}\right)+\mathbb{P}\left(\left.\tau_{2}\leqslant n\right\rvert\mathtt{D\textsuperscript{c}}\right)
2n4+Dc(τ2n)\displaystyle\leqslant\frac{2}{n^{4}}+\mathbb{P}_{\texttt{D\textsuperscript{c}}}\left(\tau_{2}\leqslant n\right)
=2n4+Dc(m=snna,b𝒜,a<b{|j=1m(Xa,jXb,j)|4mlogn})\displaystyle=\frac{2}{n^{4}}+\mathbb{P}_{\texttt{D\textsuperscript{c}}}\left(\bigcup_{m=s_{n}}^{n}\bigcap_{a,b\in\mathcal{A},a<b}\left\{\left\lvert\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}\right)\right\rvert\geqslant 4\sqrt{m\log n}\right\}\right)
2n4+m=1nDc(|j=1m(X1,jX2,j)|4mlogn)\displaystyle\leqslant\frac{2}{n^{4}}+\sum_{m=1}^{n}\mathbb{P}_{\texttt{D\textsuperscript{c}}}\left(\left\lvert\sum_{j=1}^{m}\left(X_{1,j}-X_{2,j}\right)\right\rvert\geqslant 4\sqrt{m\log n}\right)
2n4+2n7,\displaystyle\leqslant\frac{2}{n^{4}}+\frac{2}{n^{7}}, (15)

where the last step follows using the Chernoff-Hoeffding bound (Hoeffding, 1963).

H.6 Putting everything together

Combining everything, one finally obtains that when δ>8logn/n\delta>8\sqrt{\log n/n},

𝔼RnCK3Δ¯γ(sn)(lognδ2+sn(D)),\displaystyle\mathbb{E}R_{n}\leqslant\frac{CK^{3}\bar{\Delta}}{\gamma\left(s_{n}\right)}\left(\frac{\log n}{{\delta}^{2}}+\frac{s_{n}}{\mathbb{P}\left(\texttt{D}\right)}\right),

where γ(sn)\gamma\left(s_{n}\right) is as defined in (14), (D)=K!i=1Kαi\mathbb{P}\left(\texttt{D}\right)=K!\prod_{i=1}^{K}\alpha_{i}, and CC is some absolute constant. When δ8logn/n\delta\leqslant 8\sqrt{\log n/n}, regret is at most Δ¯n64Δ¯/δ2logn\bar{\Delta}n\leqslant 64\bar{\Delta}/\delta^{2}\log n. Thus, the aforementioned bound, in fact, holds generally for some large enough absolute constant CC. \square

Appendix I Auxiliary results used in the analysis of ALG3

Lemma I.1 (Persistence of heterogeneous consideration sets).

Consider a two-armed bandit with rewards bounded in [0,1][0,1], means μ1>μ2\mu_{1}>\mu_{2}, and gap Δ¯=μ1μ2\underline{\Delta}=\mu_{1}-\mu_{2}. Let {Xi,j:j=1,2,}\left\{X_{i,j}:j=1,2,...\right\} denote the sequence of rewards collected from arm i{1,2}i\in\{1,2\} by UCB1 (Auer et al., 2002). Let 𝒵\mathcal{Z} be an independently generated standard Gaussian random variable. Let (N1(n),N2(n))\left(N_{1}(n),N_{2}(n)\right) be the per-arm sample counts under UCB1 up to and including time nn. Define

Mn:=min(N1(n),N2(n)),\displaystyle M_{n}:=\min\left(N_{1}(n),N_{2}(n)\right),
τ:=inf{n:|𝒵+j=1Mn(X1,jX2,j)|<4MnlogMn}.\displaystyle\tau:=\inf\left\{n\in\mathbb{N}:\left\lvert\mathcal{Z}+\sum_{j=1}^{M_{n}}\left(X_{1,j}-X_{2,j}\right)\right\rvert<4\sqrt{M_{n}\log M_{n}}\right\}.

Then, (τ=)>βΔ¯,2\mathbb{P}\left(\tau=\infty\right)>\beta_{\underline{\Delta},2}, where βΔ¯,2\beta_{\underline{\Delta},2} is as defined in (2) with δΔ¯\delta\leftarrow\underline{\Delta} and K2K\leftarrow 2.

Lemma I.2 (Fast rejection of homogeneous consideration sets).

Consider a two-armed bandit where both arms have equal means. Let {Xi,j:j=1,2,}\left\{X_{i,j}:j=1,2,...\right\} denote the sequence of rewards collected from arm i{1,2}i\in\{1,2\} by UCB1 (Auer et al., 2002). Let 𝒵\mathcal{Z} be an independently generated standard Gaussian random variable. Let (N1(n),N2(n))\left(N_{1}(n),N_{2}(n)\right) be the per-arm sample counts under UCB1 up to and including time nn. Define

Mn:=min(N1(n),N2(n)),\displaystyle M_{n}:=\min\left(N_{1}(n),N_{2}(n)\right),
τ:=inf{n:|𝒵+j=1Mn(X1,jX2,j)|<4MnlogMn}.\displaystyle\tau:=\inf\left\{n\in\mathbb{N}:\left\lvert\mathcal{Z}+\sum_{j=1}^{M_{n}}\left(X_{1,j}-X_{2,j}\right)\right\rvert<4\sqrt{M_{n}\log M_{n}}\right\}.

Then, there exists an absolute constant CC such that 𝔼τC\mathbb{E}\tau\leqslant C.

I.1 Proof of Lemma I.1

Since the rewards are uniformly bounded in [0,1][0,1], it follows that Ni(n)N_{i}(n)\to\infty for each arm i{1,2}i\in\{1,2\} as nn\to\infty on every sample-path. This is due to the structure of the upper confidence bounds used by UCB1. Consequently, Mn=min(N1(n),N2(n))M_{n}=\min\left(N_{1}(n),N_{2}(n)\right)\to\infty as nn\to\infty on every sample-path. Also note that MnM_{n} is a weakly increasing integer-valued process (starting from 11) with unit increments, wherever they exist. Thus, it follows on every sample-path that τ\tau, in fact, weakly dominates the stopping time τ\tau^{\prime} defined below

τ:=inf{m:|𝒵+j=1m(X1,jX2,j)|<4mlogm}.\displaystyle\tau^{\prime}:=\inf\left\{m\in\mathbb{N}:\left\lvert\mathcal{Z}+\sum_{j=1}^{m}\left(X_{1,j}-X_{2,j}\right)\right\rvert<4\sqrt{m\log m}\right\}. (16)

Therefore, (τ=)(τ=)>βΔ¯,2\mathbb{P}\left(\tau=\infty\right)\geqslant\mathbb{P}\left(\tau^{\prime}=\infty\right)>\beta_{\underline{\Delta},2}, where the last inequality follows from Proposition 4.2 with δΔ¯\delta\leftarrow\underline{\Delta} and K2K\leftarrow 2. \square

I.2 Proof of Lemma I.2

Note that

𝔼τ\displaystyle\mathbb{E}\tau =1+k2(τk)\displaystyle=1+\sum_{k\geqslant 2}\mathbb{P}\left(\tau\geqslant k\right)
=1+k2(n=1k1{|𝒵+j=1Mn(X1,jX2,j)|4MnlogMn})\displaystyle=1+\sum_{k\geqslant 2}\mathbb{P}\left(\bigcap_{n=1}^{k-1}\left\{\left\lvert\mathcal{Z}+\sum_{j=1}^{M_{n}}\left(X_{1,j}-X_{2,j}\right)\right\rvert\geqslant 4\sqrt{M_{n}\log M_{n}}\right\}\right)
1+k1(|𝒵+j=1Mk(X1,jX2,j)|4MklogMk)\displaystyle\leqslant 1+\sum_{k\geqslant 1}\mathbb{P}\left(\left\lvert\mathcal{Z}+\sum_{j=1}^{M_{k}}\left(X_{1,j}-X_{2,j}\right)\right\rvert\geqslant 4\sqrt{M_{k}\log M_{k}}\right)
=1+k1m=1k(|𝒵+j=1Mk(X1,jX2,j)|4MklogMk,N1(k)=m)\displaystyle=1+\sum_{k\geqslant 1}\sum_{m=1}^{k}\mathbb{P}\left(\left\lvert\mathcal{Z}+\sum_{j=1}^{M_{k}}\left(X_{1,j}-X_{2,j}\right)\right\rvert\geqslant 4\sqrt{M_{k}\log M_{k}},N_{1}(k)=m\right)
=1+k11mk/2(|𝒵+j=1m(X1,jX2,j)|4mlogm,N1(k)=m)\displaystyle=1+\sum_{k\geqslant 1}\sum_{1\leqslant m\leqslant k/2}\mathbb{P}\left(\left\lvert\mathcal{Z}+\sum_{j=1}^{m}\left(X_{1,j}-X_{2,j}\right)\right\rvert\geqslant 4\sqrt{m\log m},N_{1}(k)=m\right)
=+k1k/2<mk(|𝒵+j=1(km)(X1,jX2,j)|4(km)log(km),N1(k)=m)\displaystyle{\color[rgb]{1,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,1,1}\pgfsys@color@gray@stroke{1}\pgfsys@color@gray@fill{1}=}+\sum_{k\geqslant 1}\sum_{k/2<m\leqslant k}\mathbb{P}\left(\left\lvert\mathcal{Z}+\sum_{j=1}^{(k-m)}\left(X_{1,j}-X_{2,j}\right)\right\rvert\geqslant 4\sqrt{(k-m)\log(k-m)},N_{1}(k)=m\right)
=1+k11mk/2(|𝒵+j=1m(X1,jX2,j)|4mlogm,N1(k)=m)\displaystyle=1+\sum_{k\geqslant 1}\sum_{1\leqslant m\leqslant k/2}\mathbb{P}\left(\left\lvert\mathcal{Z}+\sum_{j=1}^{m}\left(X_{1,j}-X_{2,j}\right)\right\rvert\geqslant 4\sqrt{m\log m},N_{1}(k)=m\right)
=+k11m<k/2(|𝒵+j=1m(X1,jX2,j)|4mlogm,N2(k)=m)\displaystyle{\color[rgb]{1,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,1,1}\pgfsys@color@gray@stroke{1}\pgfsys@color@gray@fill{1}=}+\sum_{k\geqslant 1}\sum_{1\leqslant m<k/2}\mathbb{P}\left(\left\lvert\mathcal{Z}+\sum_{j=1}^{m}\left(X_{1,j}-X_{2,j}\right)\right\rvert\geqslant 4\sqrt{m\log m},N_{2}(k)=m\right)
1+2k1θkmk/2(|𝒵+j=1m(X1,jX2,j)|4mlogm)\displaystyle\leqslant 1+2\sum_{k\geqslant 1}\sum_{\theta k\leqslant m\leqslant k/2}\mathbb{P}\left(\left\lvert\mathcal{Z}+\sum_{j=1}^{m}\left(X_{1,j}-X_{2,j}\right)\right\rvert\geqslant 4\sqrt{m\log m}\right)
=+k1[(N1(k)θk)+(N2(k)θk)],\displaystyle{\color[rgb]{1,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{1,1,1}\pgfsys@color@gray@stroke{1}\pgfsys@color@gray@fill{1}=}+\sum_{k\geqslant 1}\left[\mathbb{P}\left(N_{1}(k)\leqslant\theta k\right)+\mathbb{P}\left(N_{2}(k)\leqslant\theta k\right)\right],

where θ=1/215/8\theta=1/2-\sqrt{15}/8. Using Theorem 4(i) of Kalvit and Zeevi (2020) with ϵ=15/8\epsilon=\sqrt{15}/8, one obtains

𝔼τ\displaystyle\mathbb{E}\tau 1+2k1θkmk/2(|𝒵+j=1m(X1,jX2,j)|4mlogm)+16k11k2\displaystyle\leqslant 1+2\sum_{k\geqslant 1}\sum_{\theta k\leqslant m\leqslant k/2}\mathbb{P}\left(\left\lvert\mathcal{Z}+\sum_{j=1}^{m}\left(X_{1,j}-X_{2,j}\right)\right\rvert\geqslant 4\sqrt{m\log m}\right)+16\sum_{k\geqslant 1}\frac{1}{k^{2}}
1+4k1θkmk/21m4+16k11k2\displaystyle\leqslant 1+4\sum_{k\geqslant 1}\sum_{\theta k\leqslant m\leqslant k/2}\frac{1}{m^{4}}+16\sum_{k\geqslant 1}\frac{1}{k^{2}}
1+4θ4k11k3+16k11k2.\displaystyle\leqslant 1+\frac{4}{\theta^{4}}\sum_{k\geqslant 1}\frac{1}{k^{3}}+16\sum_{k\geqslant 1}\frac{1}{k^{2}}.

\square

Appendix J Proof of Theorem 4.4

Consider the first epoch and define the following:

Mn:=min(N1(n),N2(n)),\displaystyle M_{n}:=\min\left(N_{1}(n),N_{2}(n)\right),
τ:=inf{n:|𝒵+j=1Mn(X1,jX2,j)|<4MnlogMn},\displaystyle\tau:=\inf\left\{n\in\mathbb{N}:\left\lvert\mathcal{Z}+\sum_{j=1}^{M_{n}}\left(X_{1,j}-X_{2,j}\right)\right\rvert<4\sqrt{M_{n}\log M_{n}}\right\},

where (N1(n),N2(n))\left(N_{1}(n),N_{2}(n)\right) are the per-arm sample counts under UCB1 up to and including time nn. Note that τ\tau marks the termination of epoch 11.

Let RnR_{n} denote the cumulative pseudo-regret of ALG3 after nn pulls (superscript π\pi suppressed for notational convenience). Let SnS_{n} denote the cumulative pseudo-regret of UCB1 after nn pulls in a two-armed bandit with gap Δ¯\underline{\Delta}. Let D and I respectively denote the events that the two arms queried in epoch 11 have distinct and identical types. Similarly, let OPT and INF respectively denote the events that the two arms have “optimal” and “inferior” types (Note that I=OPTINF\texttt{I}=\texttt{OPT}\cup\texttt{INF}). Let R~n,R¯n,R^n\tilde{R}_{n},\bar{R}_{n},\hat{R}_{n} be independently drawn from the same distribution as RnR_{n}. Then, note that RnR_{n} admits the following stochastic evolution:

Rn\displaystyle R_{n} =𝟙{D}[Smin(τ,n)+R~(nτ)+]+𝟙{INF}[Δ¯min(τ,n)+R¯(nτ)+]+𝟙{OPT}R^(nτ)+\displaystyle=\mathbbm{1}\{\texttt{D}\}\left[S_{\min\left(\tau,n\right)}+\tilde{R}_{\left(n-\tau\right)^{+}}\right]+\mathbbm{1}\{\texttt{INF}\}\left[\underline{\Delta}\min\left(\tau,n\right)+\bar{R}_{\left(n-\tau\right)^{+}}\right]+\mathbbm{1}\{\texttt{OPT}\}\hat{R}_{\left(n-\tau\right)^{+}}
𝟙{D}[Sn+𝟙{τ<n}R~n]+𝟙{INF}[Δ¯τ+R¯n]+𝟙{OPT}R^n\displaystyle\leqslant\mathbbm{1}\{\texttt{D}\}\left[S_{n}+\mathbbm{1}\left\{\tau<n\right\}\tilde{R}_{n}\right]+\mathbbm{1}\{\texttt{INF}\}\left[\underline{\Delta}\tau+\bar{R}_{n}\right]+\mathbbm{1}\{\texttt{OPT}\}\hat{R}_{n}
𝟙{D}[Sn+𝟙{τ<}R~n]+𝟙{INF}[Δ¯τ+R¯n]+𝟙{OPT}R^n,\displaystyle\leqslant\mathbbm{1}\{\texttt{D}\}\left[S_{n}+\mathbbm{1}\left\{\tau<\infty\right\}\tilde{R}_{n}\right]+\mathbbm{1}\{\texttt{INF}\}\left[\underline{\Delta}\tau+\bar{R}_{n}\right]+\mathbbm{1}\{\texttt{OPT}\}\hat{R}_{n},

where the first inequality follows since ALG3 is agnostic to nn, and hence the pseudo-regret RnR_{n} is weakly increasing in nn. Taking expectations on both sides, one recovers using the independence of R~n,R¯n,R^n\tilde{R}_{n},\bar{R}_{n},\hat{R}_{n} that

𝔼Rn\displaystyle\mathbb{E}R_{n} 1(τ=|D)[𝔼Sn+(1α12α1)Δ¯𝔼[τ|INF]]\displaystyle\leqslant\frac{1}{\mathbb{P}\left(\left.\tau=\infty\right\rvert\texttt{D}\right)}\left[\mathbb{E}S_{n}+\left(\frac{1-\alpha_{1}}{2\alpha_{1}}\right)\underline{\Delta}\mathbb{E}\left[\left.\tau\right\rvert\texttt{INF}\right]\right]
1βΔ¯,2[𝔼Sn+C(1α12α1)Δ¯],\displaystyle\leqslant\frac{1}{\beta_{\underline{\Delta},2}}\left[\mathbb{E}S_{n}+C\left(\frac{1-\alpha_{1}}{2\alpha_{1}}\right)\underline{\Delta}\right],

where βΔ¯,2\beta_{\underline{\Delta},2} is as defined in (2) with δΔ¯\delta\leftarrow\underline{\Delta} and K2K\leftarrow 2, and CC is some absolute constant; the last inequality follows using Lemma I.1 and I.2. The stated assertion now follows since 𝔼SnC(logn/Δ¯+Δ¯)\mathbb{E}S_{n}\leqslant C^{\prime}\left(\log n/\underline{\Delta}+\underline{\Delta}\right) for some absolute constant CC^{\prime} (Auer et al., 2002). \square

Appendix K Auxiliary results used in the analysis of ALG4

Lemma K.1 (Persistence of heterogeneous consideration sets).

Consider a KK-armed bandit with rewards bounded in [0,1][0,1] and means μ1>>μK\mu_{1}>...>\mu_{K}. Let {Xa,j:j=1,2,}\left\{X_{a,j}:j=1,2,...\right\} denote the rewards collected from arm a{1,,K}=:𝒜a\in\{1,...,K\}=:\mathcal{A} by UCB1 (Auer et al., 2002). Let {𝒵a,b:a,b𝒜,a<b}\left\{\mathcal{Z}_{a,b}:a,b\in\mathcal{A},a<b\right\} be a collection of (K2){K\choose 2} independent standard Gaussian random variables. Let Na(n)N_{a}(n) be the sample count of arm aa under UCB1 until time nn. Define

Ml:=mina𝒜Na(l),\displaystyle M_{l}:=\min_{a\in\mathcal{A}}N_{a}(l),
τ:=inf{lK:a,b𝒜,a<bs.t.|𝒵a,b+j=1Ml(Xa,jXb,j)|<4MllogMl}.\displaystyle\tau:=\inf\left\{l\geqslant K:\exists\ a,b\in\mathcal{A},a<b\ \text{s.t.}\ \left\lvert\mathcal{Z}_{a,b}+\sum_{j=1}^{M_{l}}\left(X_{a,j}-X_{b,j}\right)\right\rvert<4\sqrt{M_{l}\log M_{l}}\right\}.

Then, (τ=)>βδ,K\mathbb{P}\left(\tau=\infty\right)>\beta_{\delta,K}, where βδ,K\beta_{\delta,K} is as defined in (2).

Lemma K.2 (Path-wise lower bound on the arm-sampling rate of UCB1).

Consider a KK-armed bandit with rewards bounded in [0,1][0,1]. Let Na(n)N_{a}(n) be the sample count of arm a{1,,K}=:𝒜a\in\{1,...,K\}=:\mathcal{A} under UCB1 (Auer et al., 2002) until time nn. Then, for all nKn\geqslant K,

Mn:=mina𝒜Na(n)f(n),\displaystyle M_{n}:=\min_{a\in\mathcal{A}}N_{a}(n)\geqslant f(n),

where (f(n):n=K,K+1,)\left(f(n):n=K,K+1,...\right) is some deterministic monotone non-decreasing integer-valued sequence satisfying f(K)=1f(K)=1 and f(n)f(n)\to\infty as nn\to\infty.

K.1 Proof of Lemma K.1

Suppose that there exists a sample-path on which some non-empty subset of arms 𝔄𝒜\mathfrak{A}\subset\mathcal{A} receives a bounded number of pulls asymptotically in the horizon of play. Also suppose that 𝔄\mathfrak{A} is the maximal such subset, i.e., each arm in 𝒜\𝔄\mathcal{A}\backslash\mathfrak{A} is played infinitely often asymptotically on said sample-path. This implies that the UCB score of any arm in 𝒜\𝔄\mathcal{A}\backslash\mathfrak{A} is at most 1+o(logt)1+o\left(\sqrt{\log t}\right) at time tt (since the empirical mean term remains bounded in [0,1][0,1]). At the same time, the boundedness hypothesis implies that the UCB score of any arm in 𝔄\mathfrak{A} is at least Ω(logt)\Omega\left(\sqrt{\log t}\right). Thus, for tt large enough, UCB scores of arms in 𝔄\mathfrak{A} will start to dominate those in 𝒜\𝔄\mathcal{A}\backslash\mathfrak{A} and the algorithm will end up playing an arm from 𝔄\mathfrak{A} at some point, thus increasing the cumulative sample-count of arms in 𝔄\mathfrak{A} by 11. As tt grows further, one can replicate the preceding argument an arbitrary number of times to conclude that 𝔄\mathfrak{A} receives an unbounded number of pulls on the sample-path under consideration, thereby contradicting the boundedness hypothesis. Therefore, it must be the case that each arm in 𝒜\mathcal{A} is played infinitely often on every sample-path. Consequently, Mn=mina𝒜Na(n)M_{n}=\min_{a\in\mathcal{A}}N_{a}(n)\to\infty as nn\to\infty on every sample-path.

Now since (Mn:n=K,K+1,)\left(M_{n}:n=K,K+1,...\right) is an integer-valued process (starting from MK=1M_{K}=1) with unit increments (wherever they exist), it follows that on every sample-path, τ\tau, in fact, weakly dominates the stopping time τ\tau^{\prime} given by

τ:=inf{m:a,b𝒜,a<bs.t.|𝒵a,b+j=1m(Xa,jXb,j)|<4mlogm}.\displaystyle\tau^{\prime}:=\inf\left\{m\in\mathbb{N}:\exists\ a,b\in\mathcal{A},a<b\ \text{s.t.}\ \left\lvert\mathcal{Z}_{a,b}+\sum_{j=1}^{m}\left(X_{a,j}-X_{b,j}\right)\right\rvert<4\sqrt{m\log m}\right\}. (17)

Therefore, (τ=)(τ=)>βδ,K\mathbb{P}\left(\tau=\infty\right)\geqslant\mathbb{P}\left(\tau^{\prime}=\infty\right)>\beta_{\delta,K}; the last inequality follows from Proposition 4.2. \square

K.2 Proof of Lemma K.2

Suppose that 𝒮n={(Na(n):a𝒜)}\mathcal{S}_{n}=\left\{\left(N_{a}(n):a\in\mathcal{A}\right)\right\} denotes the set of possible sample-count realizations under UCB1 when the horizon of play is nn. Define f(n):=min(Na(n):a𝒜)𝒮nmina𝒜Na(n)f(n):=\min_{\left(N_{a}(n):a\in\mathcal{A}\right)\in\mathcal{S}_{n}}\min_{a\in\mathcal{A}}N_{a}(n). Since 𝒮n\mathcal{S}_{n} is finite, aforementioned minimum is attained at some (Na(n):a𝒜)𝒮n\left(N_{a}^{\ast}(n):a\in\mathcal{A}\right)\in\mathcal{S}_{n}. Note that (Na(n):a𝒜)\left(N_{a}^{\ast}(n):a\in\mathcal{A}\right) is not a random vector as it corresponds to a specific set of sample-paths (possibly non-unique) on which mina𝒜Na(n)\min_{a\in\mathcal{A}}N_{a}(n) is minimized. Therefore, f(n)=mina𝒜Na(n)f(n)=\min_{a\in\mathcal{A}}N^{\ast}_{a}(n) is deterministic. We have already established in the proof of Lemma K.1 that for each a𝒜a\in\mathcal{A}, Na(n)N_{a}(n)\to\infty as nn\to\infty on every sample-path. In particular, this also implies Na(n)N_{a}^{\ast}(n)\to\infty as nn\to\infty. Thus, we have established the existence of a sequence f(n)f(n) satisfying the assertions of the lemma. \square

Appendix L Proof of Theorem B.1

Let 𝒜:={1,,K}\mathcal{A}:=\{1,...,K\} be the collection of KK arms queried during the first epoch. Consider an arbitrary ll\in\mathbb{N} s.t. lKl\geqslant K and define the following:

Ml:=mina𝒜Na(l),\displaystyle M_{l}:=\min_{a\in\mathcal{A}}N_{a}(l),
τ:=inf{lK:a,b𝒜,a<bs.t.|𝒵a,b+j=1Ml(Xa,jXb,j)|<4MllogMl},\displaystyle\tau:=\inf\left\{l\geqslant K:\exists\ a,b\in\mathcal{A},a<b\ \text{s.t.}\ \left\lvert\mathcal{Z}_{a,b}+\sum_{j=1}^{M_{l}}\left(X_{a,j}-X_{b,j}\right)\right\rvert<4\sqrt{M_{l}\log M_{l}}\right\},

where Na(l)N_{a}(l) denotes the sample count from arm aa under UCB1 until time ll. Note that τ\tau marks the termination of epoch 11.

Let RnR_{n} denote the cumulative pseudo-regret of ALG4 after nn pulls (superscript π\pi suppressed for notational convenience). Let SnS_{n} denote the cumulative pseudo-regret of UCB1 after nn pulls in a KK-armed bandit with means μ1>μ2>>μK\mu_{1}>\mu_{2}>...>\mu_{K}. Let D denote the event that the KK arms queried in epoch 11 have distinct types (no two belong to the same type). Let R~n,R¯n\tilde{R}_{n},\bar{R}_{n} be independently drawn from the same distribution as RnR_{n}. Then, the evolution of RnR_{n} satisfies

Rn\displaystyle R_{n} 𝟙{D}[Smin(τ,n)+R~(nτ)+]+𝟙{Dc}[Δ¯min(τ,n)+R¯(nτ)+]\displaystyle\leqslant\mathbbm{1}\{\texttt{D}\}\left[S_{\min\left(\tau,n\right)}+\tilde{R}_{\left(n-\tau\right)^{+}}\right]+\mathbbm{1}\{\texttt{D\textsuperscript{c}}\}\left[\bar{\Delta}\min\left(\tau,n\right)+\bar{R}_{\left(n-\tau\right)^{+}}\right]
()𝟙{D}[Sn+𝟙{τ<n}R~n]+𝟙{Dc}[Δ¯min(τ,n)+R¯n]\displaystyle\underset{\mathrm{({\dagger})}}{\leqslant}\mathbbm{1}\{\texttt{D}\}\left[S_{n}+\mathbbm{1}\left\{\tau<n\right\}\tilde{R}_{n}\right]+\mathbbm{1}\{\texttt{D\textsuperscript{c}}\}\left[\bar{\Delta}\min\left(\tau,n\right)+\bar{R}_{n}\right]
𝟙{D}[Sn+𝟙{τ<}R~n]+𝟙{Dc}[Δ¯min(τ,n)+R¯n],\displaystyle\leqslant\mathbbm{1}\{\texttt{D}\}\left[S_{n}+\mathbbm{1}\left\{\tau<\infty\right\}\tilde{R}_{n}\right]+\mathbbm{1}\{\texttt{D\textsuperscript{c}}\}\left[\bar{\Delta}\min\left(\tau,n\right)+\bar{R}_{n}\right],

where ()({\dagger}) follows since ALG4 is agnostic to nn, and hence the pseudo-regret RnR_{n} is weakly increasing in nn. Taking expectations on both sides, one recovers using the independence of R~n,R¯n\tilde{R}_{n},\bar{R}_{n} that

𝔼Rn\displaystyle\mathbb{E}R_{n} 1(τ=|D)[𝔼Sn+(Δ¯𝔼[min(τ,n)|Dc](D))]1βδ,K[𝔼Sn+(Δ¯𝔼[min(τ,n)|Dc](D))],\displaystyle\leqslant\frac{1}{\mathbb{P}\left(\left.\tau=\infty\right\rvert\texttt{D}\right)}\left[\mathbb{E}S_{n}+\left(\frac{\bar{\Delta}\mathbb{E}\left[\left.\min\left(\tau,n\right)\right\rvert\texttt{D\textsuperscript{c}}\right]}{\mathbb{P}(\texttt{D})}\right)\right]\leqslant\frac{1}{\beta_{\delta,K}}\left[\mathbb{E}S_{n}+\left(\frac{\bar{\Delta}\mathbb{E}\left[\left.\min\left(\tau,n\right)\right\rvert\texttt{D\textsuperscript{c}}\right]}{\mathbb{P}(\texttt{D})}\right)\right],

where (D)=K!i=1Kαi\mathbb{P}(\texttt{D})=K!\prod_{i=1}^{K}\alpha_{i}, and the last inequality follows using Lemma K.1 with βδ,K\beta_{\delta,K} as defined in (2). We know that 𝔼SnCK(logn/Δ¯+Δ¯)\mathbb{E}S_{n}\leqslant CK\left(\log n/\underline{\Delta}+\bar{\Delta}\right) for some absolute constant CC (Auer et al., 2002). The rest of the proof is geared towards showing that 𝔼[min(τ,n)|Dc]=o(n)\mathbb{E}\left[\left.\min\left(\tau,n\right)\right\rvert\texttt{D\textsuperscript{c}}\right]=o(n).

L.1 Proof of 𝔼[𝐦𝐢𝐧(𝝉,𝒏)|𝙳c]=𝒐(𝒏)\boldsymbol{\mathbb{E}\left[\left.\min\left(\tau,n\right)\right\rvert\mathtt{D\textsuperscript{c}}\right]=o(n)}

Let 𝙳c():=(|𝙳c)\mathbb{P}_{\mathtt{D\textsuperscript{c}}}(\cdot):=\mathbb{P}(\cdot|\mathtt{D\textsuperscript{c}}) be the conditional measure w.r.t. the event 𝙳c\mathtt{D\textsuperscript{c}}. On 𝙳c\mathtt{D\textsuperscript{c}}, there exist two arms in the consideration set 𝒜\mathcal{A} that belong to the same type. Without loss of generality, suppose that these are indexed by 1,21,2. Then,

𝔼[min(τ,n)|𝙳c]\displaystyle\mathbb{E}\left[\left.\min\left(\tau,n\right)\right\rvert\mathtt{D\textsuperscript{c}}\right] K+k=K+1n𝙳c(τk)\displaystyle\leqslant K+\sum_{k=K+1}^{n}\mathbb{P}_{\mathtt{D\textsuperscript{c}}}\left(\tau\geqslant k\right)
K+k=Kn𝙳c(τk+1)\displaystyle\leqslant K+\sum_{k=K}^{n}\mathbb{P}_{\mathtt{D\textsuperscript{c}}}\left(\tau\geqslant k+1\right)
=K+k=Kn𝙳c(l=1ka,b𝒜,a<b{|𝒵a,b+j=1Ml(Xa,jXb,j)|4MllogMl})\displaystyle=K+\sum_{k=K}^{n}\mathbb{P}_{\mathtt{D\textsuperscript{c}}}\left(\bigcap_{l=1}^{k}\bigcap_{a,b\in\mathcal{A},a<b}\left\{\left\lvert\mathcal{Z}_{a,b}+\sum_{j=1}^{M_{l}}\left(X_{a,j}-X_{b,j}\right)\right\rvert\geqslant 4\sqrt{M_{l}\log M_{l}}\right\}\right)
K+k=Kn𝙳c(|𝒵1,2+j=1Mk(X1,jX2,j)|4MklogMk)\displaystyle\leqslant K+\sum_{k=K}^{n}\mathbb{P}_{\mathtt{D\textsuperscript{c}}}\left(\left\lvert\mathcal{Z}_{1,2}+\sum_{j=1}^{M_{k}}\left(X_{1,j}-X_{2,j}\right)\right\rvert\geqslant 4\sqrt{M_{k}\log M_{k}}\right)
=K+k=Knm=1k𝙳c(|𝒵1,2+j=1Mk(X1,jX2,j)|4MklogMk,Mk=m)\displaystyle=K+\sum_{k=K}^{n}\sum_{m=1}^{k}\mathbb{P}_{\mathtt{D\textsuperscript{c}}}\left(\left\lvert\mathcal{Z}_{1,2}+\sum_{j=1}^{M_{k}}\left(X_{1,j}-X_{2,j}\right)\right\rvert\geqslant 4\sqrt{M_{k}\log M_{k}},M_{k}=m\right)
=()K+k=Knm=f(k)k𝙳c(|𝒵1,2+j=1Mk(X1,jX2,j)|4MklogMk,Mk=m)\displaystyle\underset{\mathrm{({\dagger})}}{=}K+\sum_{k=K}^{n}\sum_{m=f(k)}^{k}\mathbb{P}_{\mathtt{D\textsuperscript{c}}}\left(\left\lvert\mathcal{Z}_{1,2}+\sum_{j=1}^{M_{k}}\left(X_{1,j}-X_{2,j}\right)\right\rvert\geqslant 4\sqrt{M_{k}\log M_{k}},M_{k}=m\right)
K+k=Knm=f(k)k𝙳c(|𝒵1,2+j=1m(X1,jX2,j)|4mlogm)\displaystyle\leqslant K+\sum_{k=K}^{n}\sum_{m=f(k)}^{k}\mathbb{P}_{\mathtt{D\textsuperscript{c}}}\left(\left\lvert\mathcal{Z}_{1,2}+\sum_{j=1}^{m}\left(X_{1,j}-X_{2,j}\right)\right\rvert\geqslant 4\sqrt{m\log m}\right)
()K+2k=Knm=f(k)k1m4\displaystyle\underset{\mathrm{({\ddagger})}}{\leqslant}K+2\sum_{k=K}^{n}\sum_{m=f(k)}^{k}\frac{1}{m^{4}}
=K+2k=Kn(1(f(k))4+m=f(k)+1k1m4)\displaystyle=K+2\sum_{k=K}^{n}\left(\frac{1}{\left(f(k)\right)^{4}}+\sum_{m=f(k)+1}^{k}\frac{1}{m^{4}}\right)
K+2k=Kn(1(f(k))4+13(f(k))3),\displaystyle\leqslant K+2\sum_{k=K}^{n}\left(\frac{1}{\left(f(k)\right)^{4}}+\frac{1}{3\left(f(k)\right)^{3}}\right),

where ()({\dagger}) follows from Lemma K.2, and ()({\ddagger}) using the Chernoff-Hoeffding bound (Hoeffding, 1963). Since f(k)f(k) is monotone non-decreasing and coercive in kk, it follows that 𝔼[min(τ,n)|𝙳c]=o(n)\mathbb{E}\left[\left.\min\left(\tau,n\right)\right\rvert\mathtt{D\textsuperscript{c}}\right]=o(n), where the little-Oh only hides dependence on KK. \square