This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: A. Kagrecha 22institutetext: Stanford University
22email: [email protected]
33institutetext: J. Nair 44institutetext: IIT Bombay
44email: [email protected]
55institutetext: K. Jagannathan 66institutetext: IIT Madras
66email: [email protected]

Constrained regret minimization for multi-criterion multi-armed bandits

Anmol Kagrecha    Jayakrishnan Nair    Krishna Jagannathan
Abstract

We consider a stochastic multi-armed bandit setting and study the problem of constrained regret minimization over a given time horizon. Each arm is associated with an unknown, possibly multi-dimensional distribution, and the merit of an arm is determined by several, possibly conflicting attributes. The aim is to optimize a ‘primary’ attribute subject to user-provided constraints on other ‘secondary’ attributes. We assume that the attributes can be estimated using samples from the arms’ distributions, and that the estimators enjoy suitable concentration properties. We propose an algorithm called Con-LCB that guarantees a logarithmic regret, i.e., the average number of plays of all non-optimal arms is at most logarithmic in the horizon. The algorithm also outputs a boolean flag that correctly identifies, with high probability, whether the given instance is feasible/infeasible with respect to the constraints. We also show that Con-LCB is optimal within a universal constant, i.e., that more sophisticated algorithms cannot do much better universally. Finally, we establish a fundamental trade-off between regret minimization and feasibility identification. Our framework finds natural applications, for instance, in financial portfolio optimization, where risk constrained maximization of expected return is meaningful.

Keywords:
Multi-criterion multi-armed bandits constrained bandits regret minimization

1 Introduction

The multi-armed bandit (MAB) problem is a fundamental construct in online learning, where a learner has to quickly identify the best option (a.k.a., arm) among a given set of options. In the stochastic MAB problem, each arm is associated with an (a priori unknown) reward distribution, and a sample from this distribution is revealed each time an arm is chosen (a.k.a., pulled). The classical goal is to use these samples to quickly identify the arm with the highest mean reward. The most popular metric to evaluate the performance of a learning algorithm is regret, which captures how often a suboptimal arm was pulled by the algorithm.

While the classical stochastic MAB formulation has been applied in various application scenarios, including clinical trials, portfolio optimization, anomaly detection, and telecommunication Bouneffouf and Rish [2019], it ignores a key aspect of most real-world decision-making problems—namely, that they have multiple criteria of interest. For example, when comparing testing kits in a clinical trial, one would want to keep track of the false-positive rate as well as the false-negative rate of each kit. Similarly, choosing the best financial portfolio involves balancing risk and reward. A wireless node deciding which channel to transmit on, or what transmission rate to use has to balance several criteria, including throughput, delay, and energy consumption. This multi-criterion nature of decision making is not always adequately captured by the classical MAB approach of optimizing a one-dimensional reward signal.

The most common approach for incorporating multiple arm attributes into MAB formulations is to define the reward as a suitable function (say a linear combination) of the attributes of interest. For example, risk-aware portfolio optimization can be cast as an MAB problem where the best arm is one that optimizes a certain linear combination of mean value and a suitable risk measure (such as standard deviation or Conditional Value at Risk) (see, for example, Sani et al. [2012], Vakili and Zhao [2016], Kagrecha et al. [2019]. However, this approach assumes that the different attributes of interest can be expressed and compared on a common scale, which is not always reasonable. For example, how does one ‘equate’ the impact a certain increment in risk to that of an another increment in the mean return of a portfolio?

A more natural approach for multi-criterion decision making is to instead pose the optimal choice as the solution of a constrained optimization. In other words, optimize one attribute subject to constraints on the others. However, despite the modest literature on multi-criterion MABs (surveyed below), little attention has been paid to formulating, and designing algorithms for constrained multi-criterion MABs. This paper seeks to fill this gap. Formally, we assume that each arm is associated with a DD-dimensional probability distribution, and the best arm is the one that minimizes a certain arm attribute subject to constraints on mm other attributes. For this setting, we pursue regret minimization, i.e., we seek to minimize the average number of pulls of non-optimal arms.

To simplify the presentation, and to highlight the key aspects of this problem, we first consider a single constraint (i.e., m=1m=1). Each arm is associated with a probability distribution, and we consider the problem of optimizing an objective attribute, subject to a single constraint attribute satisfying a user-specified constraint. The algorithm design and performance evaluation for this special case (with m=1m=1) generalize readily to the multi-criterion problem; see Section 5. An example that fits this template that is of significant interest in the finance community: the optimization of the expected return, subject to a risk constraint. Here, the objective attribute is the mean of an arm distribution, while the constraint attribute measuring risk could be Conditional Value at Risk (CVaR).

For this problem, we propose an algorithm, called Constrained Lower Confidence Bound (Con-LCB), that guarantees logarithmic regret, i.e., the average number of plays of all non-optimal arms (including those that violate the constraint) is at most logarithmic in the horizon. If Con-LCB is presented with an infeasbile instance, i.e., an instance where all arms violate the specified risk constraint, the algorithm in effect relaxes this constraint just enough to make at least one arm compliant. Another feature of Con-LCB is that at the end of the horizon, it outputs a boolean flag that correctly identifies with high probability, whether or not the given instance was feasible.

Finally, we establish fundamental lower bounds on the performance of any algorithm on this constrained regret minimization problem. Our results demonstrate a fundamental tradeoff between regret minimization and feasibility identification, similar to the well-known tradeoff between regret minimization and best arm identification in the classical (unconstrained) MAB problem Bubeck et al. [2009].

The remainder of this paper is organized as follows. A brief survey of related literature is provided below. In Section 2, we introduce some preliminaries and formulate the constrained mean minimization problem. We present our algorithm (Con-LCB) and its performance guarantees in Section 3. Information-theoretic lower bounds on performance are discussed in Section 4. Finally, the general formulation for multi-criterion MABs is introduced in Section 5. All proofs are deferred to an appendix, which is part of the ‘supplementary materials’ document uploaded separately.

Related literature: The literature related to multi-armed bandit problems is quite large. We refer the reader to Bubeck and Cesa-Bianchi [2012], Lattimore and Szepesvári [2018] for a comprehensive review. Here, we restrict ourselves to papers that consider (i) multi-objective MAB problems with vector rewards, and (ii) risk-aware arm selection.

For multi-objective MAB problems with vector rewards, different notions of optimality are considered. For example, Drugan and Nowe [2013], Yahyaa and Manderick [2015] consider the notion of Pareto optimality. In these papers, all dimensions are considered equally important and the aim is to play all Pareto-optimal arms an equal number of times. Another important notion of optimality is lexicographic optimality (see Ehrgott [2005]). Here there is an order of importance among different dimensions. Tekin and Turğay [2018], Tekin [2019] consider the notion of lexicographic optimality for contextual bandit problems. In this line of work, the goal is to obtain higher reward in an important dimension and for tie-breaking, use rewards obtained in dimensions of lower importance.

Turning now to the literature on risk-aware arm selection, Sani et al. [2012], Galichet et al. [2013], Vakili and Zhao [2016], David and Shimkin [2016], Bhat and Prashanth [2019], Prashanth et al. [2020], Kagrecha et al. [2019] consider the problem of optimizing a risk metric alone or consider a linear combination of mean and a risk metric. Zimin et al. [2014] looks at the learnability of general functions of mean and variance, and Maillard [2013] proposes an optimization of the logarithm of moment generating function as a risk measure in a regret minimization framework. Cassel et al. [2018] look at path dependent regret and provides a general approach to study many risk metrics.

None of the above papers considers the constrained MAB problem, which frames the optimal arm as the solution of a constrained optimization problem. There has been some recent work for the constrained MAB problem. A constrained linear bandit setting is considered in Pacchiano et al. [2021] under the assumption that there is at least one arm which satisfies the constraints. The papers Amani et al. [2019], Moradipari et al. [2019] consider the problem of maximizing the reward subject to satisfying a linear constraint with high probability. Constrained setting was also considered in David et al. [2018] and Chang [2020]. Both these papers consider a single constraint on arm selection; David et al. [2018] considers a constraint on VaR, and Chang [2020] considers an average cost constraint (each arm has a cost distribution that is independent of its reward distribution). All the papers above implicitly assume that the instance presented is feasible, whereas we address the issue of encountering an infeasible instance.

2 Problem formulation

In this section, we describe the formulation of our constrained stochastic MAB problem. To keep the exposition simple, we consider a single constraint (i.e., m=1m=1) for now; the general formulation with multiple constraints (i.e., m1m\geq 1) is discussed in Section 5. Informally, under the regret minimization objective considered here, the goal is to play, as often as possible, the arm that optimizes the objective, subject to a constraint on another attribute.

Formally, consider a multi-armed bandit problem with KK arms, labeled 1,2,,K.1,2,\cdots,K. Each arm is associated with a (possibly multi-dimensional) probability distribution, with ν(k)\nu(k) denoting the (joint) distribution corresponding to arm k[K]k\in[K]111For a positive integer n,n, we denote [n]:={1,2,,n}.[n]:=\{1,2,\cdots,n\}.. Suppose that ν(k)𝒞,\nu(k)\in\mathcal{C}, the space of possible arm distributions. The objective and the constraint attributes are defined via functions g0g_{0} and g1g_{1} respectively, mapping 𝒞\mathcal{C} to .\mathbb{R}. Additionally, the user provides a threshold τ\tau\in\mathbb{R} which specifies an upper bound on the attribute g1.g_{1}. An instance of the constrainted MAB problem is defined by (ν,τ),(\nu,\tau), where ν=(ν(k), 1kK).\nu=(\nu(k),\ 1\leq k\leq K). The arms which satisfy the constraint g1(ν())τg_{1}(\nu(\cdot))\leq\tau are called feasible arms; the set of feasible arms is denoted by 𝒦(ν).\mathcal{K}(\nu). The instance (ν,τ)(\nu,\tau) is said to be feasible if 𝒦(ν),\mathcal{K}(\nu)\neq\emptyset, and is said to be infeasible if 𝒦(ν)=.\mathcal{K}(\nu)=\emptyset.

Consider first a feasible instance. An optimal arm in this case is defined as an arm that minimizes g0(ν()),g_{0}(\nu(\cdot)), subject to the constraint g1(ν())τg_{1}(\nu(\cdot))\leq\tau. Denote the optimal value as g0=mink𝒦(ν)g0(ν(k)).g_{0}^{*}=\min_{k\in\mathcal{K}(\nu)}g_{0}(\nu(k)). Arms which have g0(ν())g_{0}(\nu(\cdot)) larger than g0g_{0}^{*} (whether or not feasible) are referred to as suboptimal arms. Note that there can also exist infeasible arms with a smaller objective g0g_{0} than g0.g_{0}^{*}. We refer to such arms as deceiver arms; the set of deceiver arms is denoted by 𝒦d(ν),\mathcal{K}^{d}(\nu), where 𝒦d(ν)={k[K]:g1(ν(k))>τ,g0(ν(k))g0}.\mathcal{K}^{d}(\nu)=\{k\in[K]:g_{1}(\nu(k))>\tau,~{}g_{0}(\nu(k))\leq g_{0}^{*}\}. For a suboptimal arm k,k, the suboptimality gap is defined by Δ(k):=g0(k)g0>0.\Delta(k):=g_{0}(k)-g_{0}^{*}>0. (Note that the suboptimality gap is also defined for infeasible, non-deceiver arms). Finally, for an infeasible arm k,k, the infeasibility gap is defined as Δτ(k)=g1(k)τ>0.\Delta_{\tau}(k)=g_{1}(k)-\tau>0. Figure 1(a) provides a visual representation of a feasbile instance.

Next, consider an infeasible instance. The optimal arm in this case is defined in as the one with the smallest value of g1(ν()).g_{1}(\nu(\cdot)). Let the optimal value be denoted by g1=mink[K]g1(ν(k))>τ.g_{1}^{*}=\min_{k\in[K]}g_{1}(\nu(k))>\tau. This is equivalent to requiring that if the algorithm is faced with an infeasible instance, it must ’relax’ the constraint just enough, until at least one arm satisfies the constraint. The constraint gap for an arm kk that is not optimal is defined as Δcon(k)=g1(ν(k))g1>0.\Delta_{\text{con}}(k)=g_{1}(\nu(k))-g_{1}^{*}>0. Figure 1(b) provides a visual representation of an infeasbile instance.

For any (feasible or infeasible) instance (ν,τ),(\nu,\tau), let the set of optimal arms be denoted by 𝒦(ν).\mathcal{K}^{*}(\nu). The total number of pulls (or horizon) is denoted by T.T. For an algorithm (a.k.a., policy) π\pi, the number of pulls of an arm kk over the first tt pulls, for t[T],t\in[T], is denoted by Nkπ(t),N^{\pi}_{k}(t), though we often suppress the dependence on π\pi for simplicity.

Refer to caption
(a) Feasible instance
Refer to caption
(b) Infeasible instance

Consistency: We now define the notion of consistency of an algorithm in this setting. A policy π\pi is said to be consistent over a class of distributions 𝒞,\mathcal{C}, given a pre-specified constraint threshold τ,\tau, if for all instances (ν,τ)(\nu,\tau) such that ν𝒞K,\nu\in\mathcal{C}^{K}, it holds that 𝔼[Nk(T)]=o(Ta)\mathbb{E}\left[N_{k}(T)\right]=o(T^{a}) for all a>0a>0 and for all (non-optimal) arms in [K]𝒦(ν).[K]\setminus\mathcal{K}^{*}(\nu). This definition is in line with the definition of consistency in the classical unconstrained regret minimization setting (see Lai and Robbins [1985]).

Regret: The formal definition of regret in the present setting is as follows. For a feasible instance, there are two types of regret: suboptimality regret

RT𝗌𝗎𝖻:=k𝒦(ν)𝒦(ν)Δ(k)𝔼[Nk(T)],R^{\mathsf{sub}}_{T}:=\sum_{k\in\mathcal{K}(\nu)\setminus\mathcal{K}^{*}(\nu)}\Delta(k)\mathbb{E}\left[N_{k}(T)\right],

which is the regret caused due to the sampling of feasible, suboptimal arms, and infeasibility regret

RT𝗂𝗇𝖿:=k𝒦(ν)cΔτ(i)𝔼[Nk(T)],R^{\mathsf{inf}}_{T}:=\sum_{k\in\mathcal{K}(\nu)^{c}}\Delta_{\tau}(i)\mathbb{E}\left[N_{k}(T)\right],

which is the regret caused due to the sampling of infeasible arms. For an infeasible instance, regret is caused by playing arms that are farther from the constraint boundary than the optimal arm. In this case, we define the constraint regret as

RT𝖼𝗈𝗇:=k[K]𝒦(ν)Δcon(k)𝔼[Nk(T)].R^{\mathsf{con}}_{T}:=\sum_{k\in[K]\setminus\mathcal{K}^{*}(\nu)}\Delta_{\text{con}}(k)\mathbb{E}\left[N_{k}(T)\right].

(Note that our analysis essentially provides upper and lower bounds on 𝔼[Nk(T)]\mathbb{E}\left[N_{k}(T)\right] for all arms in [K]𝒦(ν).[K]\setminus\mathcal{K}^{*}(\nu). So alternative definitions of regret, involving linear combinations of the expected number of pulls of non-optimal arms, are also supported by our analysis.)

Infeasibility identification: Next, we introduce an optional boolean flag called feasibility_flag, that the policy may set at the end of TT plays, to indicate post facto whether it considered the instance as feasible (by setting the flag as true) or infeasible (by setting the flag as false). For the algorithms proposed in this paper, we provide bounds on the probability that this flag erroneously flags a feasible instance as infeasible and vice-versa. We also provide fundamental lower bounds on the probability of error under any consistent policy.

Concentration inequalities on attribute estimators: Natually, MAB algorithms must estimate the attributes g0g_{0} and g1g_{1} corresponding to each arm using the data samples obtained by pulling that arm. We make the following assumption on concentration properties of these estimators. Suppose that for i{0,1},i\in\{0,1\}, and distribution F𝒞,F\in\mathcal{C}, there exists an estimator g^i,n(F)\hat{g}_{i,n}(F) of gi(F)g_{i}(F) using nn i.i.d. samples from F,F, satisfying the following concentration inequality: There exists ai>0a_{i}>0 such that for all Δ>0,\Delta>0,

𝖯𝗋(|g^i,n(F)gi(F)|Δ)2exp(ainΔ2).{\mathsf{Pr}}\left(\left|\hat{g}_{i,n}(F)-g_{i}(F)\right|\geq\Delta\right)\leq 2\exp(-a_{i}n\Delta^{2}). (1)

Concentration inequalities of this form are available in a broad variety of settings. For example, if gi(F)=𝔼[hi(X)],g_{i}(F)=\mathbb{E}\left[h_{i}(X)\right], where XX is a random vector distributed as F,F, then a concentration inequality of the form (1) is readily obtained if hi(X)h_{i}(X) is bounded (using the Hoeffding inequality), or σ\sigma-subGaussian. Similarly, if hih_{i} is Lipschitz and XX is a subGaussian random vector, concentration bounds of the form (1) can be obtained by invoking the results in Kontorovich [2014]. Several examples where risk measures can be concentrated in this manner are provided in Cassel et al. [2018]. Also, note that the specific form (1) of the concentration inequality is only assumed to simplify the description of our algorithms. Alternative forms of concentration guarantees (such as those known for the means of subexponential or heavy-tailed distributions) can also be supported by our algorithmic framework via trivial modifications to the confidence bounds.

3 Constrained regret minimization

In this section, we present an algorithm for constrained regret minimization, and provide performance guarantees for the same, assuming that we have estimators for each attribute that satisfy the concentration inequality (1).

The algorithm, which we refer to as constrained lower confidence bound (Con-LCB) algorithm, is based on the well-known principle of optimism under uncertainty. Con-LCB uses lower confidence bounds (LCBs) on attribute g1g_{1} of each arm to maintain a set of plausibly feasible arms, and uses LCBs for attribute g0g_{0} of the arms in this set to select the arm to be played. Note that LCBs are used for attribute g0g_{0} because we are dealing with minimization of g0.g_{0}. If, at some instant, the set of plausibly feasible arms maintained by Con-LCB becomes empty, the algorithm turns conservative and plays the arm which violates the constraint least, i.e., the one with the smallest LCB on g1g_{1}. Finally, at the end of TT rounds, Con-LCB sets the feasiblity flag as true if the set of plausibly feasible arms is found to be non-empty, and false otherwise.

Algorithm 1 Con-LCB
procedure Con-LCB(T,K,τT,K,\tau)
     Play each arm once
     for t=K+1,,Tt=K+1,\cdots,T do
         Set 𝒦^t={k:g^1,Nk(t1)(k)log(2T2)a1Nk(t1)τ}\hat{\mathcal{K}}_{t}=\left\{k:\hat{g}_{1,N_{k}(t-1)}(k)-\sqrt{\frac{\log(2T^{2})}{a_{1}N_{k}(t-1)}}\leq\tau\right\}
         if 𝒦^t\hat{\mathcal{K}}_{t}\neq\varnothing then
              Set L0(k)=g^0,Nk(t1)(k)log(2T2)a0Nk(t1)\small\mathrm{L}_{0}(k)=\hat{g}_{0,N_{k}(t-1)}(k)-\sqrt{\frac{\log(2T^{2})}{a_{0}N_{k}(t-1)}}
              Play arm ktargmink𝒦^tL0(k)\text{Play arm }k_{t}^{\dagger}\in\operatorname*{arg\,min}_{k\in\hat{\mathcal{K}}_{t}}\mathrm{L}_{0}(k)
         else
              Set L1(k)=g^1,Nk(t1)(k)log(2T2)a1Nk(t1)\small\mathrm{L}_{1}(k)=\hat{g}_{1,N_{k}(t-1)}(k)-\sqrt{\frac{\log(2T^{2})}{a_{1}N_{k}(t-1)}}
              Play arm ktargmink[K]L1(k)\text{Play arm }k_{t}^{\dagger}\in\operatorname*{arg\,min}_{k\in[K]}\mathrm{L}_{1}(k)
         end if
     end for
     if 𝒦^T\hat{\mathcal{K}}_{T}\neq\varnothing then
         Set feasibility_flag=true\text{Set }\texttt{feasibility\_flag}=\texttt{true}
     else
         Set feasibility_flag=false\text{Set }\texttt{feasibility\_flag}=\texttt{false}
     end if
end procedure

The remainder of this section is devoted to performance guatantees for Con-LCB. We consider feasible and infeasible instances separately.

Theorem 3.1

Consider a feasible instance. Under Con-LCB, the expected number of pulls of a feasible but suboptimal arm kk (i.e., satisfying g0(ν(k))>g0g_{0}(\nu(k))>g_{0}^{*} and g1(ν(k))τg_{1}(\nu(k))\leq\tau), is bounded by

𝔼[Nk(T)]4log(2T2)a0Δ2(k)+5.\displaystyle\mathbb{E}\left[N_{k}(T)\right]\leq\frac{4\log(2T^{2})}{a_{0}\Delta^{2}(k)}+5.

The expected number of pulls of a deceiver arm kk (i.e., satisfying g0(ν(k))g0g_{0}(\nu(k))\leq g_{0}^{*} and g1(ν(k))>τg_{1}(\nu(k))>\tau is bounded by

𝔼[Nk(T)](4log(2T2)a1[Δτ(k)]2)+2.\displaystyle\mathbb{E}\left[N_{k}(T)\right]\leq\left(\frac{4\log(2T^{2})}{a_{1}[\Delta_{\tau}(k)]^{2}}\right)+2.

The expected number of pulls of an arm kk which is an infeasible non-deceiver (i.e., satistying g0(ν(k))>g0g_{0}(\nu(k))>g_{0}^{*} and g1(ν(k))>τg_{1}(\nu(k))>\tau) is bounded by

𝔼[Nk(T)]min(4log(2T2)a0Δ2(k),4log(2T2)a1[Δτ(k)]2)+5.\displaystyle\mathbb{E}\left[N_{k}(T)\right]\leq\min\left(\frac{4\log(2T^{2})}{a_{0}\Delta^{2}(k)},\frac{4\log(2T^{2})}{a_{1}[\Delta_{\tau}(k)]^{2}}\right)+5.

The probability of incorrectly setting the feasibility_flag is upper bounded by

𝖯𝗋(feasibility_flag = false)1T.\displaystyle{\mathsf{Pr}}\left(\emph{{feasibility\_flag = false}}\right)\leq\frac{1}{T}.

The main takeaways from Theorem 3.1 are as follows.

\bullet The upper bound on the expected number of pulls for feasible, suboptimal arms is logarithmic in the horizon T,T, and inversely proportional to the square of the suboptimality gap. This is similar to bounds known for classical (unconstrained) MABs.

\bullet For deceiver arms, the upper bound on the expected number of pulls, also logarithmic in T,T, is inversely proportional to the square of the feasiblity gap. This is similar to the bound one would obtain in a pure g1g_{1} minimization problem for a hypothetical instance consisting of all the originally infeasible arms, and a single hypothetical (optimal) arm having g1g_{1} equal to τ.\tau.

\bullet The bound on the expected number of pulls of non-deceiver, infeasible arms involves a minimum of the dominant terms in the above two cases. Intuitively, this is because these arms can disambiguated in two ways: via the suboptimality gap, and the feasibility gap.

\bullet The probability that the feasiblity flag incorrectly identifies the instance as infeasible is upper bounded by a power law in the horizon T.T. Note that the specific form of the probability of mis-identification bound is not fundamental; a small modification in the algorithm would make this bound a faster decaying power law at the expense of a multiplicative bump in regret. However, that this probability is not much smaller (for example, exponentially decaying in the horizon TT) is a consequence of an inherent tension between regret minimization and feasiblity identification. A similar tension is known to exist between regret minimization and best arm identification in the unconstrained setting; see Bubeck et al. [2009]). We provide lower bounds on the probability that any consistent algorithm makes a mistake in feasibility identification in Section 4.

Finally, we note that the suboptimality regret, as well as the infeasibility regret are logarithmic in the horizon T.T. Indeed, the suboptimality regret for a feasible instance is bounded as

RT𝗌𝗎𝖻k𝒦(ν)𝒦(ν)(4log(2T2)a0Δ(k)+5Δ(k)),R^{\mathsf{sub}}_{T}\leq\sum_{k\in\mathcal{K}(\nu)\setminus\mathcal{K}^{*}(\nu)}\left(\frac{4\log(2T^{2})}{a_{0}\Delta(k)}+5\Delta(k)\right),

which is similar to regret bounds in the classical unconstrained setting. To express the infeasibility regret compactly, let us interpret Δ(k)\Delta(k) as 0 (and 1/Δ(k)1/\Delta(k) as \infty) for deceiver arms. With this notation, the infeasibility regret of a feasible instance is bounded as

RT𝗂𝗇𝖿k𝒦(ν)c5Δτ(k)+min(4Δτ(k)log(2T2)a0Δ2(k),4log(2T2)a1Δτ(k)).R^{\mathsf{inf}}_{T}\leq\sum_{k\in\mathcal{K}(\nu)^{c}}5\Delta_{\tau}(k)+\min\left(\frac{4\Delta_{\tau}(k)\log(2T^{2})}{a_{0}\Delta^{2}(k)},\frac{4\log(2T^{2})}{a_{1}\Delta_{\tau}(k)}\right).

Next, we move on to characterizing the performance of Con-LCB over infeasible instances.

Theorem 3.2

Consider an infeasible instance. Under Con-LCB, the expected number of pulls of a non-optimal arm kk is bounded by

𝔼[Nk(T)](4log(2T2)a1[Δcon(k)]2)+K+2.\mathbb{E}\left[N_{k}(T)\right]\leq\left(\frac{4\log(2T^{2})}{a_{1}[\Delta_{\rm{con}}(k)]^{2}}\right)+K+2.

Moreover, the probability that the algorithm incorrectly flags the instance as feasible is bounded as 𝖯𝗋(feasibility_flag=true)KT for T>T(ν),{\mathsf{Pr}}\left(\emph{{feasibility\_flag}}=\emph{{true}}\right)\leq\frac{K}{T}\text{ for }T>T^{*}(\nu), where T(ν)T^{*}(\nu) is an instance-dependent constant.

For an infeasible instance, the upper bound on the expected number of pulls of a non-optimal arm, logarithmic in the horizon, and inversely proportional to the the square of the constraint gaps Δcon(k),\Delta_{\rm{con}}(k), is structurally similar to the bound one would obtain in pure g1g_{1} minimization problem on the same instance. However, note that when faced with an infeasible instance, Con-LCB would only start playing the optimal arm regularly after all KK arms appear to be infeasible; this explains the appearance of KK in our bounds.

Here, the constraint regret is bounded as

RT𝖼𝗈𝗇:=k[K]𝒦(ν)(K+2)Δcon(k)+4log(2T2)a1Δcon(k)R^{\mathsf{con}}_{T}:=\sum_{k\in[K]\setminus\mathcal{K}^{*}(\nu)}(K+2)\Delta_{\rm{con}}(k)+\frac{4\log(2T^{2})}{a_{1}\Delta_{\rm{con}}(k)}

Finally, as before, the probability that the feasibility flag wrongly identifies the infeasible instance as feasible decays as a power law in the horizon for T>T(ν);T>T^{*}(\nu); the threshold TT^{*} accounts for the time it takes for the algorithm to ‘detect’ that the instance is infeasibile with high probability.

4 Information theoretic lower bounds

In this section, we establish fundamental limits on the performance of algorithms for constrained regret minimization. First, we show that the regret bounds obtained for Con-LCB are asymptotically tight upto universal multiplicative constants (on a class of Gaussian bandit instances). We then prove a lower bound on the probability that any consistent algorithm misidentifies a feasible instance as infeasible or vice-versa. This result illustrates an inherent tension between regret minimization and feasibility identification—consistent algorithms (recall that consistency means regret is o(Ta)o(T^{a}) for all a>0a>0) cannot have a misidentification probability that decays exponentially in the horizon, and algorithms that enjoy a mid-identification probability that decays exponentially in the horizon cannot be consistent.

To state our information theoretic lower bound on regret, suppose that the class of arm distributions 𝒢\mathcal{G} is the class of 2-dimensional Gaussian distributions with covariance matrix Σ=diag(1/2a0,1/2a1).\Sigma=\mathrm{diag}(\nicefrac{{1}}{{2a_{0}}},\nicefrac{{1}}{{2a_{1}}}). Let attribute g0g_{0} be the mean of the first dimension, and g1g_{1} be the mean of the second dimension. For the assumed covariance matrix structure, the standard empirical mean estimators for g0g_{0} and g1g_{1} satisfy the concentration properties stated in (1).

Theorem 4.1

Let π\pi be any consistent policy over the class of distributions 𝒢\mathcal{G} given the threshold τ.\tau\in\mathbb{R}. Consider a feasible instance (νf,τ),(\nu^{f},\tau), where νf𝒢K.\nu^{f}\in\mathcal{G}^{K}. For any feasible but suboptimal arm k,k,

lim infT𝔼[Nkπ(T)]log(T)1a0Δ2(k).\liminf_{T\to\infty}\frac{\mathbb{E}\left[N^{\pi}_{k}(T)\right]}{\log(T)}\geq\frac{1}{a_{0}\Delta^{2}(k)}.

For any deceiver arm k,k,

lim infT𝔼[Nkπ(T)]log(T)1a1[Δτ(k)]2.\displaystyle\liminf_{T\to\infty}\frac{\mathbb{E}\left[N^{\pi}_{k}(T)\right]}{\log(T)}\geq\frac{1}{a_{1}[\Delta_{\tau}(k)]^{2}}.

Finally, for any infeasible non-deceiver arm k,k,

lim infT𝔼[Nkπ(T)]log(T)12min(1a0Δ2(k),1a1[Δτ(k)]2).\displaystyle\liminf_{T\to\infty}\frac{\mathbb{E}\left[N^{\pi}_{k}(T)\right]}{\log(T)}\geq\frac{1}{2}\min\left(\frac{1}{a_{0}\Delta^{2}(k)},\frac{1}{a_{1}[\Delta_{\tau}(k)]^{2}}\right).

Similarly, for an infeasible instance (νi,τ),(\nu^{i},\tau), such that νi𝒢K,\nu^{i}\in\mathcal{G}^{K}, for any non-optimal arm k,k,

lim infT𝔼[Nkπ(T)]log(T)1a1[Δcon(k)]2.\displaystyle\liminf_{T\to\infty}\frac{\mathbb{E}\left[N^{\pi}_{k}(T)\right]}{\log(T)}\geq\frac{1}{a_{1}[\Delta_{\text{con}}(k)]^{2}}.

The proof of Theorem 4.1 can be found in Appendix B. Comparing the lower bounds in Theorem 4.1 with the upper bounds for Con-LCB in Theorems 3.1 and 3.2, we conclude that Con-LCB is asymptotically optimal on regret, up to universal multiplicative constants.222Information theoretic lower bounds on the expected number of pulls of non-optimal arms can be derived (in terms of a certain optimization over KL divergences) for any general class of arm distributions; see Appendix B.1. We have specialized the statement of Theorem 4.1 to a class of Gaussian bandit instances to enable an easy comparison with our upper bounds for Con-LCB.

Next, we address the fundamental tradeoff between regret minimization and feasibility identification.

Theorem 4.2

Consider a space of arm distributions 𝒞,\mathcal{C}, and a threshold τ\tau such that 𝒞\mathcal{C} contains both feasible as well as infeasible arm distributions. There exists a feasible instance (ν,τ)(\nu,\tau) and an infeasible instance (ν,τ)(\nu^{\prime},\tau) such that for any policy π\pi that is consistent over 𝒞,\mathcal{C},

lim supT1T\displaystyle\limsup_{T\to\infty}-\frac{1}{T} log(ν(feasibility_flag=false)\displaystyle\log\Big{(}\mathbb{P}_{\nu}\left(\emph{{feasibility\_flag}}=\emph{{false}}\right)
+ν(feasibility_flag=true))0.\displaystyle\quad+\mathbb{P}_{\nu^{\prime}}\left(\emph{{feasibility\_flag}}=\emph{{true}}\right)\Big{)}\leq 0.

Theorem 4.2 states that for any consistent algorithm, the probability that (ν,τ)(\nu,\tau) get misidentified as infeasible, and the probability that (ν,τ)(\nu^{\prime},\tau) get misidentified as feasible, cannot both decay exponentially in the horizon. This is of course consistent with the power-law probability of misidentification under Con-LCB. In other words, slower-than-exponential decay of the probability of feasibility misidentification with respect to the horizon is an unavoidable consequence of the exploration-exploitation interplay in regret minimization. A similar tension between regret minimization and best arm identification was noted for the unconstrained MABs by Bubeck et al. [2009].

5 General framework for constrained MABs

In this section, we provide a general formulation for constrained stochastic MABs with multiple criteria. We allow each arm to be associated with a DD-dimensional probability distibution, the goal being to optimize one (dominant) attribute associated with this distribution, subject to constraints on mm others. The algorithm design and performance evaluation performed in Sections 24 for the special case of m=1m=1 extend naturally to this general formulation, which can in turn be applied in a variety of application scenarios. We illustrate a few here.

\bullet For clinical trials, with the arms corresponding to various treatment protocols, the dominant attribute might, for example, correspond to the success/recovery probability, whereas the constraints might capture recovery time, severity of side effects, etc.

\bullet For product/service rating, where the arms correspond to various service providers, the dominant attribute might correspond to product quality, with constraints capturing reliability, pricing, customer service, etc.

\bullet In wireless networks, the arms might correspond to various access networks or channels, with, for example, the dominant attribute corresponding to throughput, and constraints capturing delay, energy efficiency, etc.

Formulation: Consider a set of KK arms, each associated with a DD-dimensional probability distribution, with ν(k)\nu(k) denoting the joint distribution corresponding to arm k[K].k\in[K]. Suppose that ν(k)𝒞,\nu(k)\in\mathcal{C}, the space of possible arm distributions. The objective and the constraints are defined by functions g0,g1,,gm.g_{0},g_{1},\cdots,g_{m}. Specifically, the optimal arm is defined as that arm kk that minimizes g0(ν(k)),g_{0}(\nu(k)), subject to the constraints {gi(ν(k))τi}i=1m\{g_{i}(\nu(k))\leq\tau_{i}\}_{i=1}^{m} when the instance is feasible (i.e., at least one arm exists that satisfies the above constraints).

If the instance is infeasible, the optimal arm is defined via a ranked list of the constraints, that orders them by ‘importance’. Without loss of generality assume that the order of importance increases from (g1,τ1)(g_{1},\tau_{1}) to (gm,τm).(g_{m},\tau_{m}). The idea is to relax the constraints one-by-one, starting with the least important, until a compliant arm is found. Formally, for a given infeasible instance ν\nu, for 2im,2\leq i\leq m, let 𝒦i(ν)\mathcal{K}_{i}(\nu) denote the set of arms that satisfy the constraints {gj(ν(k))τj}j=im.\{g_{j}(\nu(k))\leq\tau_{j}\}_{j=i}^{m}. Let us also define 𝒦m+1(ν):=[K].\mathcal{K}_{m+1}(\nu):=[K]. Now, let i(ν)=min{k[m]:𝒦k+1(ν)ϕ}.i^{*}(\nu)=\min\{k\in[m]:\ \mathcal{K}_{k+1}(\nu)\neq\phi\}. Here, i(ν)i^{*}(\nu) is the fewest number of constraints one must relax, in order of increasing importance, in order to have at least one compliant arm. An optimal arm is then defined to be argmin{gi(ν)(ν(k)):k𝒦i(ν)+1(ν)}.\operatorname*{arg\,min}\{g_{i^{*}(\nu)}(\nu(k)):\ k\in\mathcal{K}_{i^{*}(\nu)+1}(\nu)\}.

Similar to the case where m=1,m=1, we make the following assumption to simplify algorithm design. Suppose that for 0im,0\leq i\leq m, and ν¯𝒞,\overline{\nu}\in\mathcal{C}, there exist an estimator g^i,n(ν¯)\hat{g}_{i,n}(\overline{\nu}) for gi(ν¯)g_{i}(\overline{\nu}) using nn i.i.d. samples from ν¯,\overline{\nu}, satisfying the following concentration inequality: There exists ai>0a_{i}>0 such that for all Δ>0,\Delta>0,

𝖯𝗋(|g^i,n(ν¯)gi(ν¯)|Δ)2exp(ainΔ2).{\mathsf{Pr}}\left(\left|\hat{g}_{i,n}(\overline{\nu})-g_{i}(\overline{\nu})\right|\geq\Delta\right)\leq 2\exp(-a_{i}n\Delta^{2}). (2)

Note that this is simply an extension of our assumption (1).

Algorithm and performance guarantees: For the above problem formulation, we propose Multi-Con-LCB, a simple extension of the Con-LCB algorithm that we presented before for the case of m=1.m=1. Multi-Con-LCB uses upper confidence bounds on constrained attributes {gi(ν(k))}i=1m\{g_{i}(\nu(k))\}_{i=1}^{m} for each arm kk to maintain a set of plausibly feasible arms. The set of plausibly feasible arms for the constraint on function gig_{i} at time t[T]t\in[T] will be denoted by 𝒦^i,t\hat{\mathcal{K}}_{i,t}^{\dagger} for all values of i[m].i\in[m]. Using the sets of plausibly feasible arms for each constraint, we construct the set of arms that plausibly lie in the set 𝒦(ν)\mathcal{K}(\nu) and this set is denoted by 𝒦^t\hat{\mathcal{K}}_{t} for time instant t[T].t\in[T]. Formally, 𝒦^t=i=1m𝒦^i,t.\hat{\mathcal{K}}_{t}=\cap_{i=1}^{m}\hat{\mathcal{K}}_{i,t}^{\dagger}. As this set might be empty, for i{2,,m},i\in\{2,\cdots,m\}, let the estimate for 𝒦i(ν)\mathcal{K}_{i}(\nu) be 𝒦^i,t=j=im𝒦^j,t.\hat{\mathcal{K}}_{i,t}=\cap_{j=i}^{m}\hat{\mathcal{K}}_{j,t}^{\dagger}. It is also possible that the most important constraint is not satisfied, therefore let 𝒦^m+1,t=[K].\hat{\mathcal{K}}_{m+1,t}=[K]. If the set 𝒦^t\hat{\mathcal{K}}_{t} is not empty, then the algorithm uses lower confidence bounds (LCBs) on g0g_{0} to select the arm to be played. If the set 𝒦^t\hat{\mathcal{K}}_{t} turns out to be empty, then the algorithm finds the smallest index i^\hat{i}^{*} such that the set 𝒦^i^+1,t\hat{\mathcal{K}}_{\hat{i}^{*}+1,t} is not empty. The algorithm then plays the arm with lowest LCB on gi^.g_{\hat{i}^{*}}. Finally, at the end of TT rounds, Multi-Con-LCB sets the feasibility flag as true if the set 𝒦^T\hat{\mathcal{K}}_{T} is not empty and false otherwise. The details are presented as Algorithm 2.

Algorithm 2 Multiple Constrained LCB
procedure Multi-Con-LCB(T,K,{τi}i=1mT,K,\{\tau_{i}\}_{i=1}^{m})
     Play each arm once
     for t=K+1,,Tt=K+1,\cdots,T do
         for i=1,,mi=1,\cdots,m do
              Set 𝒦^i,t={k:g^i,Nk(t1)(k)τi+log(2T2)aiNk(t1)}\text{Set }\hat{\mathcal{K}}_{i,t}^{\dagger}=\left\{k:\hat{g}_{i,N_{k}(t-1)}(k)\leq\tau_{i}+\sqrt{\frac{\log(2T^{2})}{a_{i}N_{k}(t-1)}}\right\}
         end for
         Set 𝒦^t=i=1m𝒦^i,t\text{Set }\hat{\mathcal{K}}_{t}=\cap_{i=1}^{m}\hat{\mathcal{K}}_{i,t}^{\dagger}
         if 𝒦^t\hat{\mathcal{K}}_{t}\neq\varnothing then
              kt+1argmink𝒦^tg^0,Nk(t1)(k)log(2T2)a0Nk(t1)k_{t+1}^{\dagger}\in\operatorname*{arg\,min}_{k\in\hat{\mathcal{K}}_{t}}\hat{g}_{0,N_{k}(t-1)}(k)-\sqrt{\frac{\log(2T^{2})}{a_{0}N_{k}(t-1)}}
              Play arm kt+1\text{Play arm }k_{t+1}^{\dagger}
         else
              i^=argmini{1,,m}𝒦^i+1,t\hat{i}^{*}=\operatorname*{arg\,min}_{i\in\{1,\cdots,m\}}\hat{\mathcal{K}}_{i+1,t}\neq\varnothing
              kt+1argmink𝒦^i^+1,tg^i^,Nk(t1)(k)log(2T2)ai^Nk(t1)k_{t+1}^{\dagger}\in\operatorname*{arg\,min}_{k\in\hat{\mathcal{K}}_{\hat{i}^{*}+1,t}}\hat{g}_{\hat{i}^{*},N_{k}(t-1)}(k)-\sqrt{\frac{\log(2T^{2})}{a_{\hat{i}^{*}}N_{k}(t-1)}}
              Play arm kt+1\text{Play arm }k_{t+1}^{\dagger}
         end if
     end for
     if 𝒦^T+1\hat{\mathcal{K}}_{T+1}\neq\varnothing then
         Set feasibility_flag = true
     else
         Set feasibility_flag = false
     end if
end procedure

The remainder of this section is devoted to performance guarantees for Multi-Con-LCB. The suboptimality gap for an arm kk is given by Δ(k)=max(g0(νk)g0,0).\Delta(k)=\max(g_{0}(\nu_{k})-g_{0}^{*},0). The infeasibility gap of an arm kk for constraint ii is given by Δi,τi(k)=max(gi(ν(k))τi,0).\Delta_{i,\tau_{i}}(k)=\max(g_{i}(\nu(k))-\tau_{i},0). We restrict our attention to feasible instances here; infeasible instances can be handled on similar lines as Theorem 3.2 for Con-LCB.

Theorem 5.1

Consider a feasible instance. Under Multi-Con-LCB, the expected number of pulls of a feasible but suboptimal arm kk (i.e., satisfying g0(ν(k))>g0g_{0}(\nu(k))>g_{0}^{*} and gi(ν(k))τig_{i}(\nu(k))\leq\tau_{i} for all i[m]i\in[m]), is bounded by

𝔼[Nk(T)]4log(2T2)a0Δ2(k)+2m+3.\displaystyle\mathbb{E}\left[N_{k}(T)\right]\leq\frac{4\log(2T^{2})}{a_{0}\Delta^{2}(k)}+2m+3.

The expected number of pulls of a deceiver arm kk (i.e., satisfying g0(ν(k))g0(ν(1))g_{0}(\nu(k))\leq g_{0}(\nu(1)) and there exists a constraint indexed by j[m]j\in[m] such that gj(ν(k))>τjg_{j}(\nu(k))>\tau_{j}) is bounded by

𝔼[Nk(T)]mini[m](4log(2T2)ai[Δi,τi(k)]2)+m.\displaystyle\mathbb{E}\left[N_{k}(T)\right]\leq\min_{i\in[m]}\left(\frac{4\log(2T^{2})}{a_{i}[\Delta_{i,\tau_{i}}(k)]^{2}}\right)+m.

The expected number of pulls of a an arm kk which is infeasible, but not a deceiver (i.e., satistying g0(ν(k))>g0(ν(1))g_{0}(\nu(k))>g_{0}(\nu(1)) and there exists a constraint indexed by j[m]j\in[m] such that gj(ν(k))>τjg_{j}(\nu(k))>\tau_{j}) is bounded by

𝔼[Nk(T)]min(4log(2T2)a0Δ2(k),mini[m](4log(2T2)ai[Δi,τi(k)]2))+2m+3.\displaystyle\mathbb{E}\left[N_{k}(T)\right]\leq\min\left(\frac{4\log(2T^{2})}{a_{0}\Delta^{2}(k)},\min_{i\in[m]}\left(\frac{4\log(2T^{2})}{a_{i}[\Delta_{i,\tau_{i}}(k)]^{2}}\right)\right)+2m+3.

The probability of incorrectly setting the feasibility_flag is upper bounded by

𝖯𝗋(feasibility_flag = false)mT.\displaystyle{\mathsf{Pr}}\left(\emph{{feasibility\_flag = false}}\right)\leq\frac{m}{T}.

We omit the proof of Theorem 5.1 in the appendix because it is very similar to the proof of Theorem 3.1. However, we state the key takeaways from Theorem 5.1 below.

\bullet Similar to Theorem 3.1, the upper bound on the expected number of pulls of suboptimal arms is logarithmic in TT and is inversely proportional to the suboptimality gap squared. Moreover, the upper bound for a deceiver arm is also logarithmic in TT but there is a minimum over mm terms because the deceiver arm can be identified by any of the constraints that the arm does not satisfy. Similarly, the upper bound for a non-deceiver, suboptimal arm is also logarithmic in TT and there is a minimum over m+1m+1 terms because the arm can be identified as suboptimal, or infeasible for not satisfying one of the mm constraints.

\bullet With an increase in the number of constraints to mm, the upper bounds on the expected number of pulls of non-optimal arms increase linearly in mm and the probability of mis-identification also gets scaled by a factor of m.m. The slight degradation in the guarantees is expected because with more constraints, the optimal arm has to satisfy more conditions, and it becomes harder to compare different arms.

6 Numerical experiments

In this section, we present a simple experiment to show the numerical performance of our algorithm Con-LCB.

We consider an instance with four arms and two criteria. Each arm is associated with a (two-dimensional) multivariate Gaussian distribution, different arms having the same covariance matrix Σ,\Sigma, but different mean vectors. The means corresponding to the first dimension are 0.3, 0.4, 0.2, and 0.5 and the means corresponding to the second dimension are 0.4, 0.4, 1.0, and 1.0. The covariance matrix is given by Σ=(10.50.51).\Sigma=\begin{pmatrix}1&0.5\\ 0.5&1\end{pmatrix}.

The goal is to maximize the pulls of the arm with the minimum mean of the first dimension, subject to a constraint on the mean of the second. Specifically, we require that the mean of the second dimension should be less than or equal to τ:=0.5.\tau:=0.5. The instance is summarized in Table 1. We use empirical averages as the estimators for the means and standard confidence bounds based on sub-Gaussianity assumption. We run the algorithm 1000 times for values of TT in [10000,100000].[10000,100000]. The suboptimality regret RTsubR_{T}^{\text{sub}} and the infeasibility regret RTinfR_{T}^{\text{inf}} are plotted in Figure 2. Clearly, the regrets grow sub-linearly with respect to the horizon. Also, in our experiments, the algorithm correctly detected the feasibility of the instance in all of the 1000 runs.

Table 1: Gaussian instance
Arm Mean 1 Mean 2 Remarks
1 0.3 0.4 Optimal arm
2 0.4 0.4 Feasible, suboptimal
3 0.2 1.0 Deceiver
4 0.5 1.0 Infeasible, suboptimal
Refer to caption
Figure 2: Regret of Gaussian instance

We consider another instance where arms are Beta distributed. The parameters are chosen in such a way that the mean of the arms are 0.3, 0.4, 0.2, and 0.5 and the variance of the arms are 0.08, 0.06, 0.15, 0.15. The goal here is to maximize the pulls of the arm with the minimum mean, subject to a constraint on the variance. Specifically, we require that the variance of the arms should be less than or equal to τ:=0.1.\tau:=0.1. The instance is summarized in Table 2. We use the empirical average to estimate the mean of the arms and the standard unbiased estimator of variance to estimate the variance of the arms. The concentration bounds we use are based on fact that underlying arms are bounded between 0 and 1. We run the algorithm 100 times for values of TT in [10000,100000].[10000,100000]. The suboptimality regret RTsubR_{T}^{\text{sub}} and the infeasibility regret RTinfR_{T}^{\text{inf}} are plotted in Figure 3. Similar to the previous case, the algorithm correctly detected feasibility in all of the 100 runs.

We also compare Con-LCB with an algorithm designed to optimize a single ’Lagrange relaxed’ objective of the form g0+βg1.g_{0}+\beta g_{1}. Since there is no systematic way of setting β\beta such that the original (constrained) optimal arm also minimizes this metric, this approach is very fragile, as we demonstrate in Figure 4. The instance we use is the Beta distributed instance in Table 2, and we are plotting the sum of the infeasible regret and suboptimal regret. When β\beta is small (β<10/7\beta<10/7 here), a deceiver arm is optimal for the relaxed objective, and when β\beta is large (β>5\beta>5 here), a suboptimal arm is optimal for the relaxed objective; the ‘correct’ range of β\beta being instance dependent (and a priori unknown).

Table 2: Beta instance
Arm Mean Variance Remarks
1 0.3 0.08 Optimal arm
2 0.4 0.06 Feasible, suboptimal
3 0.2 0.15 Deceiver
4 0.5 0.15 Infeasible, suboptimal
Refer to caption
Figure 3: Regret of Beta distributed instance
Refer to caption
Figure 4: Comparison with the Lagrangian relaxation

7 Concluding remarks

In this paper, we have introduced a general formulation of multi-criterion constrained MABs, which is applicable when there are multiple attributes of interest associated with each arm. We propose algorithms that incur logarithmic regret, and also provide information theoretic lower bounds on the performance of any algorithm. An interesting departure from the classical MAB formulation is the aspect of instance feasibility; our algorithms predict, post facto, with a probability of error that decays as a power law in the horizon, whether the instance presented was feasible. Interestingly, this illustrates a fundamental tradeoff between regret minimization and feasibility detection. Finally, our algorithms ‘auto-tune’ to an infeasible instance, relaxing the constraints on arm attributes (in order of increasing importance), until a compliant arm is found.

The proposed framework and our algorithms can be applied in a wide range of application scenarios (see Bouneffouf and Rish [2019] for a survey). Further, this work motivates the study of multi-criterion variants of other bandit formulations, including contextual bandits, combinatorial bandits, and correlated bandits.

8 Declarations

  • Funding: the authors did not receive support from any organization for the submitted work.

  • Conflicts of interest: the authors frequently collaborate with researchers from IIT Bombay (@iitb.ac.in), researchers from IIT Madras (@iitm.ac.in), and researchers from Stanford University (@stanford.edu). Jayakrishnan Nair received his PhD from Caltech (@caltech.edu) in 2012 and Krishna Jagannathan received his PhD from MIT (@mit.edu) in 2010.

  • Ethics approval: not applicable

  • Consent to participate: not applicable

  • Consent for publication: not applicable

  • Availability of data and material: not applicable

  • Code availability: the Python code for the numerical experiments can be found in this Github repository.

  • Author’s contributions: all authors contributed equally to the paper

References

  • Bouneffouf and Rish [2019] Djallel Bouneffouf and Irina Rish. A survey on practical applications of multi-armed and contextual bandits. CoRR, abs/1904.10040, 2019. URL http://arxiv.org/abs/1904.10040.
  • Sani et al. [2012] Amir Sani, Alessandro Lazaric, and Rémi Munos. Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 3275–3283, 2012.
  • Vakili and Zhao [2016] Sattar Vakili and Qing Zhao. Risk-averse multi-armed bandit problems under mean-variance measure. IEEE Journal of Selected Topics in Signal Processing, 10(6):1093–1111, 2016.
  • Kagrecha et al. [2019] Anmol Kagrecha, Jayakrishnan Nair, and Krishna Jagannathan. Distribution oblivious, risk-aware algorithms for multi-armed bandits with unbounded rewards. In Advances in Neural Information Processing Systems, pages 11272–11281, 2019.
  • Bubeck et al. [2009] Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23–37. Springer, 2009.
  • Bubeck and Cesa-Bianchi [2012] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
  • Lattimore and Szepesvári [2018] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, page 28, 2018.
  • Drugan and Nowe [2013] Madalina M Drugan and Ann Nowe. Designing multi-objective multi-armed bandits algorithms: A study. In The 2013 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2013.
  • Yahyaa and Manderick [2015] Saba Yahyaa and Bernard Manderick. Thompson sampling for multi-objective multi-armed bandits problem. In Proceedings, page 47. Presses universitaires de Louvain, 2015.
  • Ehrgott [2005] Matthias Ehrgott. Multicriteria optimization, volume 491. Springer Science & Business Media, 2005.
  • Tekin and Turğay [2018] Cem Tekin and Eralp Turğay. Multi-objective contextual multi-armed bandit with a dominant objective. IEEE Transactions on Signal Processing, 66(14):3799–3813, 2018.
  • Tekin [2019] Cem Tekin. The biobjective multiarmed bandit: learning approximate lexicographic optimal allocations. Turkish Journal of Electrical Engineering & Computer Sciences, 27(2):1065–1080, 2019.
  • Galichet et al. [2013] Nicolas Galichet, Michele Sebag, and Olivier Teytaud. Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. In Asian Conference on Machine Learning, pages 245–260, 2013.
  • David and Shimkin [2016] Yahel David and Nahum Shimkin. Pure exploration for max-quantile bandits. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 556–571. Springer, 2016.
  • Bhat and Prashanth [2019] Sanjay P Bhat and LA Prashanth. Concentration of risk measures: A wasserstein distance approach. In Advances in Neural Information Processing Systems, pages 11739–11748, 2019.
  • Prashanth et al. [2020] L. A. Prashanth, Krishna Jagannathan, and Ravi Kumar Kolla. Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed distributions. In International Conference on Machine Learning (To appear), 2020.
  • Zimin et al. [2014] Alexander Zimin, Rasmus Ibsen-Jensen, and Krishnendu Chatterjee. Generalized risk-aversion in stochastic multi-armed bandits. arXiv preprint arXiv:1405.0833, 2014.
  • Maillard [2013] Odalric-Ambrym Maillard. Robust risk-averse stochastic multi-armed bandits. In International Conference on Algorithmic Learning Theory, pages 218–233. Springer, 2013.
  • Cassel et al. [2018] Asaf Cassel, Shie Mannor, and Assaf Zeevi. A general approach to multi-armed bandits under risk criteria. arXiv preprint arXiv:1806.01380, 2018.
  • Pacchiano et al. [2021] Aldo Pacchiano, Mohammad Ghavamzadeh, Peter Bartlett, and Heinrich Jiang. Stochastic bandits with linear constraints. In International Conference on Artificial Intelligence and Statistics, pages 2827–2835. PMLR, 2021.
  • Amani et al. [2019] Sanae Amani, Mahnoosh Alizadeh, and Christos Thrampoulidis. Linear stochastic bandits under safety constraints. Advances in Neural Information Processing Systems, 32, 2019.
  • Moradipari et al. [2019] Ahmadreza Moradipari, Sanae Amani, Mahnoosh Alizadeh, and Christos Thrampoulidis. Safe linear thompson sampling. arXiv preprint arXiv:1911.02156, 2019.
  • David et al. [2018] Yahel David, Balázs Szörényi, Mohammad Ghavamzadeh, Shie Mannor, and Nahum Shimkin. PAC bandits with risk constraints. In ISAIM, 2018.
  • Chang [2020] Hyeong Soo Chang. An asymptotically optimal strategy for constrained multi-armed bandit problems. Mathematical Methods of Operations Research, pages 1–13, 2020.
  • Lai and Robbins [1985] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
  • Kontorovich [2014] Aryeh Kontorovich. Concentration in unbounded metric spaces and algorithmic stability. In International Conference on Machine Learning, pages 28–36, 2014.

Appendix A Upper bounds

In this section, we prove Theorems 3.1 and 3.2. The bounds in Subsections A.1A.4 imply the statement of Theorem 3.1, and the bounds in Subsections A.5A.6 imply the statement of Theorem 3.2.

A.1 Feasible instance: Upper bounding the expected pulls of deceiver arms

For a deceiver arm kk, we will define a good event G1,kG_{1,k} where the estimator for g1g_{1} is concentrated enough and derive an upper bound on the number of pulls of the deceiver arm.

G1,k={n[T]|g^1,n(k)g1(k)|<log(2T2)a1n}G_{1,k}=\left\{~{}\forall n\in[T]~{}|\hat{g}_{1,n}(k)-g_{1}(k)|<\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}n}}\right\} (3)

On G1,k,G_{1,k}, we can lower bound the estimator for g1g_{1} for the arm kk as follows

g^1,n(k)>g1(k)log(2T2)a1n\hat{g}_{1,n}(k)>g_{1}(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}n}}

If the lower bound is greater than τ+log(2T2)a1n,\tau+\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}n}}, then arm kk can’t be in 𝒦^t.\hat{\mathcal{K}}_{t}. Hence, we can upper bound the number of pulls of an infeasible arm as follows

g1(k)log(2T2)a1nτ+log(2T2)a1n\displaystyle g_{1}(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}n}}\leq\tau+\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}n}}
\displaystyle\Rightarrow nvk:=4log(2T2)a1Δτ2(k)\displaystyle n\leq v_{k}:=\left\lceil\frac{4\log(2T^{2})}{a_{1}\Delta_{\tau}^{2}(k)}\right\rceil (4)

Event G1,kcG_{1,k}^{c} is given by

G1,kc={n[T]|g^1,n(k)g1(k)|log(2T2)a1n}\displaystyle G_{1,k}^{c}=\left\{~{}\exists n\in[T]~{}|\hat{g}_{1,n}(k)-g_{1}(k)|\geq\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}n}}\right\}

Using our assumption in Equation 1 and a union bound, we can show

𝖯𝗋(G1,kc)T×1T2=1T.\displaystyle{\mathsf{Pr}}\left(G_{1,k}^{c}\right)\leq T\times\frac{1}{T^{2}}=\frac{1}{T}.

Now, let us upper bound the expected number of pulls of a deceiver arm k.k.

𝔼[Nk(T)]\displaystyle\mathbb{E}\left[N_{k}(T)\right] =𝔼[𝔼[Nk(T)|G1,k]]+𝔼[𝔼[Nk(T)|G1,kc]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[N_{k}(T)|G_{1,k}\right]\right]+\mathbb{E}\left[\mathbb{E}\left[N_{k}(T)|G_{1,k}^{c}\right]\right]
4log(2T2)a1Δτ2(k)(11T)+T×1T\displaystyle\leq\left\lceil\frac{4\log(2T^{2})}{a_{1}\Delta_{\tau}^{2}(k)}\right\rceil\left(1-\frac{1}{T}\right)+T\times\frac{1}{T}
4log(2T2)a1Δτ2(k)+2\displaystyle\leq\frac{4\log(2T^{2})}{a_{1}\Delta_{\tau}^{2}(k)}+2

A.2 Feasible instance: Upper bounding the expected pulls of feasible suboptimal arms

We will begin by showing that a feasible arm kk remains in the set 𝒦^t\hat{\mathcal{K}}_{t} for t[T]t\in[T] when estimator of g1g_{1} is concentrated enough. We define an event G1,kG_{1,k} for a feasible arm as done in Equation (3). When G1,kG_{1,k} holds, the estimator for g1g_{1} is upper bounded by

g^1,n(k)\displaystyle\hat{g}_{1,n}(k)\leq g1(k)+log(2T2)a1n\displaystyle g_{1}(k)+\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}n}}
\displaystyle\leq τ+log(2T2)a1n(g1(k)τ)\displaystyle\tau+\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}n}}\quad(\because~{}g_{1}(k)\leq\tau)

Hence, arm kk is in 𝒦^t\hat{\mathcal{K}}_{t} for t[T]t\in[T] when G1,kG_{1,k} holds.

We are considering the case where an arm kk is feasible but suboptimal. We will define a good event for arm kk and bound the number of pulls on this good event. Without loss of generality, assume arm 1 is optimal.

G0,k={g0(1)>maxn[T]g^0,n(1)log(2T2)a0n}{g^0,uk(k)log(2T2)a0uk>g0(1)}G1,1G1,kwhere uk=4log(2T2)a0Δ2(k)G_{0,k}=\left\{g_{0}(1)>\max_{n\in[T]}\hat{g}_{0,n}(1)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}n}}\right\}\\ \cap\left\{\hat{g}_{0,u_{k}}(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}u_{k}}}>g_{0}(1)\right\}\cap G_{1,1}\cap G_{1,k}\\ \text{where }u_{k}=\left\lceil\frac{4\log(2T^{2})}{a_{0}\Delta^{2}(k)}\right\rceil (5)

We will show that if G0,kG_{0,k} holds, then Nk(T)ukN_{k}(T)\leq u_{k}. We will also show that G0,kcG_{0,k}^{c} holds with a small probability.

The proof is by contradiction. Suppose G0,kG_{0,k} holds and Nk(T)>ukN_{k}(T)>u_{k}, then there exists a t[T]t\in[T] such that Nk(t1)=ukN_{k}(t-1)=u_{k} and At=k.A_{t}=k. Using the definition of G0,k,G_{0,k},

LCBk(t1)\displaystyle\text{LCB}_{k}(t-1) =g^0,uk(k)log(2T2)a0uk\displaystyle=\hat{g}_{0,u_{k}}(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}u_{k}}}
>g0(1)(Definition of G0,k)\displaystyle>g_{0}(1)\qquad(\text{Definition of }G_{0,k})
>LCB1(t1)(Definition of G0,k)\displaystyle>\text{LCB}_{1}(t-1)\qquad(\text{Definition of }G_{0,k})

Hence, At=argminjLCBj(t)k,A_{t}=\operatorname*{arg\,min}_{j}\text{LCB}_{j}(t)\neq k, which is a contradiction. Therefore, if G0,kG_{0,k} occurs, Nk(T)uk.N_{k}(T)\leq u_{k}.

Now, consider the event G0,kc.G_{0,k}^{c}.

G0,kc={g0(1)maxn[T]g^0,n(1)log(2T2)a0n}{g^0,uk(k)log(2T2)a0ukg0(1)}G1,1cG1,kc.G_{0,k}^{c}=\left\{g_{0}(1)\leq\max_{n\in[T]}\hat{g}_{0,n}(1)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}n}}\right\}\cup\\ \left\{\hat{g}_{0,u_{k}}(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}u_{k}}}\leq g_{0}(1)\right\}\cup G_{1,1}^{c}\cup G_{1,k}^{c}.

Let us bound the probability of the first term above.

𝖯𝗋(g0(1)maxn[T]g^0,n(1)log(2T2)a0n)\displaystyle{\mathsf{Pr}}\left(g_{0}(1)\leq\max_{n\in[T]}\hat{g}_{0,n}(1)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}n}}\right)
\displaystyle\leq 𝖯𝗋(n=1T{g0(1)g^0,n(1)log(2T2)a0n})\displaystyle{\mathsf{Pr}}\left(\bigcup_{n=1}^{T}\left\{g_{0}(1)\leq\hat{g}_{0,n}(1)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}n}}\right\}\right)
\displaystyle\leq T×1T2=1T(Using Equation 1 and union bound)\displaystyle T\times\frac{1}{T^{2}}=\frac{1}{T}\quad(\text{Using Equation~{}\ref{eq:gen_conc_ineq_m1} and union bound})

Now, let us bound the second event above. By the choice of uku_{k} we have the following

Δ(k)log(2T2)a0uklog(2T2)a0uk\displaystyle\Delta(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}u_{k}}}\geq\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}u_{k}}}

Now,

𝖯𝗋(g^0,uk(k)log(2T2)a0ukg0(1))\displaystyle{\mathsf{Pr}}\left(\hat{g}_{0,u_{k}}(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}u_{k}}}\leq g_{0}(1)\right)
=\displaystyle= 𝖯𝗋(g0(k)g^0,uk(k)Δ(k)log(2T2)a0uk)(Use Δ(k)=g0(k)g0(1))\displaystyle{\mathsf{Pr}}\left(g_{0}(k)-\hat{g}_{0,u_{k}}(k)\geq\Delta(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}u_{k}}}\right)\quad(\text{Use }\Delta(k)=g_{0}(k)-g_{0}(1))
\displaystyle\leq 𝖯𝗋(g0(k)g^0,uk(k)log(2T2)a0uk)(By the choice of uk)\displaystyle{\mathsf{Pr}}\left(g_{0}(k)-\hat{g}_{0,u_{k}}(k)\geq\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}u_{k}}}\right)\quad(\text{By the choice of }u_{k})
\displaystyle\leq 1T2(Using Equation 1)\displaystyle\frac{1}{T^{2}}\quad(\text{Using Equation~{}}\ref{eq:gen_conc_ineq_m1})

We can show that 𝖯𝗋(G1,1c)1/T{\mathsf{Pr}}\left(G_{1,1}^{c}\right)\leq\nicefrac{{1}}{{T}} and 𝖯𝗋(G1,kc)1/T{\mathsf{Pr}}\left(G_{1,k}^{c}\right)\leq\nicefrac{{1}}{{T}} like in the previous subsection. Hence,

𝖯𝗋(G0,kc)3T+1T2.\displaystyle{\mathsf{Pr}}\left(G_{0,k}^{c}\right)\leq\frac{3}{T}+\frac{1}{T^{2}}.

Hence, we can upper bound the number of pulls of feasible but suboptimal arms as follows

𝔼[Nk(T)]=\displaystyle\mathbb{E}\left[N_{k}(T)\right]= 𝔼[𝔼[Nk(T)|G0,k]]+𝔼[𝔼[Nk(T)|G0,kc]]\displaystyle\mathbb{E}\left[\mathbb{E}\left[N_{k}(T)|G_{0,k}\right]\right]+\mathbb{E}\left[\mathbb{E}\left[N_{k}(T)|G_{0,k}^{c}\right]\right]
\displaystyle\leq 4log(2T2)a0Δ2(k)(13T1T2)+T×(3T+1T2)\displaystyle\left\lceil\frac{4\log(2T^{2})}{a_{0}\Delta^{2}(k)}\right\rceil\left(1-\frac{3}{T}-\frac{1}{T^{2}}\right)+T\times\left(\frac{3}{T}+\frac{1}{T^{2}}\right)
\displaystyle\leq 4log(2T2)a0Δ2(k)+5.\displaystyle\frac{4\log(2T^{2})}{a_{0}\Delta^{2}(k)}+5.

A.3 Feasible instance: Upper bounding the expected pulls of infeasible suboptimal arms

Consider arm kk which is both suboptimal and infeasible. Define an event G0,kG_{0,k} as done in Equation (5). Recall that the upper bound on the pulls of the deceiver arms on event G1,kG_{1,k} is denoted by vkv_{k} and the upper bound on the pulls of the feasible but suboptimal arms on event G0,kG_{0,k} is denoted by uk.u_{k}.

On the event G0,k,G_{0,k}, if ukvk,u_{k}\geq v_{k}, then due to concentration of estimator of g1g_{1}, this arm can’t be played more than vkv_{k} times. If uk<vk,u_{k}<v_{k}, then due to suboptimality, this arm can’t be played more than uku_{k} times. We can show that the probability of G0,kcG_{0,k}^{c} is less than or equal to 3/T+1/T2\nicefrac{{3}}{{T}}+\nicefrac{{1}}{{T^{2}}} as we did before. Hence, we can upper bound the pulls of infeasible and suboptimal arms as follows

𝔼[Nk(T)]=\displaystyle\mathbb{E}\left[N_{k}(T)\right]= 𝔼[𝔼[Nk(T)|G0,k]]+𝔼[𝔼[Nk(T)|G0,kc]]\displaystyle\mathbb{E}\left[\mathbb{E}\left[N_{k}(T)|G_{0,k}\right]\right]+\mathbb{E}\left[\mathbb{E}\left[N_{k}(T)|G_{0,k}^{c}\right]\right]
min(uk,vk)(13T1T2)+T×(3T+1T2)\displaystyle\leq\min(u_{k},v_{k})\left(1-\frac{3}{T}-\frac{1}{T^{2}}\right)+T\times\left(\frac{3}{T}+\frac{1}{T^{2}}\right)
min(uk,vk)+4\displaystyle\leq\min(u_{k},v_{k})+4

A.4 Feasible instance: Upper bounding the probability of misidentification

Feasibility is correctly detected if for at least one of the feasible arms, event G1,kG_{1,k} as defined in Equation (3) holds. We had seen in Subsection A.2 that if estimator of g1g_{1} is concentrated enough, a feasible arm always remains in the set 𝒦^t\hat{\mathcal{K}}_{t} of plausibly feasible arms.

Without loss of generality, assume that arm 1 is optimal. Then, we can lower bound the probability of correctly setting the flag as follows

𝖯𝗋(feasibility_flag = true)\displaystyle{\mathsf{Pr}}\left(\texttt{feasibility\_flag = true}\right) 𝖯𝗋(G1,1)\displaystyle\geq{\mathsf{Pr}}\left(G_{1,1}\right)
11T.\displaystyle\geq 1-\frac{1}{T}.

This upper bounds the probability of incorrectly setting the flag

𝖯𝗋(feasibility_flag = false)1T.{\mathsf{Pr}}\left(\texttt{feasibility\_flag = false}\right)\leq\frac{1}{T}.

A.5 Infeasible instance: Upper bounding the expected pulls of non-optimal arms

In this section, we discuss the case when the given instance is infeasible. As defined before, the optimal choice for the algorithm is to play the arm with minimum g1g_{1}. We will upper bound the number of pulls of arms that have g1g_{1} greater than the minimum.

For all the arms, we define good events G1,kG_{1,k} as in Equation (3) where the estimator for g1g_{1} is concentrated enough. Let r=k=1KG1,k.\mathcal{E}_{r}=\cap_{k=1}^{K}G_{1,k}. When event r\mathcal{E}_{r} occurs, the set 𝒦^t\hat{\mathcal{K}}_{t} becomes empty after at most k=1Kvk\sum_{k=1}^{K}v_{k} pulls where vkv_{k} is defined in Equation (4). The analysis is similar to given in Subsection A.1.

Once the set 𝒦^t\hat{\mathcal{K}}_{t} becomes empty, the algorithm starts pulling arms with minimum lower confidence bound on estimator of g1g_{1}. We will upper bound the number of pulls for the non-optimal arms. Without loss of generality, assume that arm 1 has the lowest g1g_{1}. As we are dealing with an infeasible instance, g1(1)>τ.g_{1}(1)>\tau. For a non-optimal arm k,k, we define the following good event

Gr,k={g^1,wklog(2T2)a1wk>g1(1)}r\displaystyle G_{r,k}=\left\{\hat{g}_{1,w_{k}}-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}w_{k}}}>g_{1}(1)\right\}\cap\mathcal{E}_{r}
where wk=4log(2T2)a1Δcon2(k).\displaystyle\text{where }w_{k}=\left\lceil\frac{4\log(2T^{2})}{a_{1}\Delta_{\text{con}}^{2}(k)}\right\rceil.

One can check that wkw_{k} is greater than vkv_{k} because the gap Δcon(k)=g1(k)g1(1)\Delta_{\text{con}}(k)=g_{1}(k)-g_{1}(1) is smaller than Δτ(k)=g1(k)τ.\Delta_{\tau}(k)=g_{1}(k)-\tau. It is easy to argue using LCB based arguments that if Gr,kG_{r,k} occurs, then arm kk can’t be pulled more than wkw_{k} times. This is similar to the proof given in Subsection A.2.

Let us upper bound the probability of Gr,kc.G_{r,k}^{c}. Event Gr,kcG_{r,k}^{c} is given by

Gr,kc={g^1,wklog(2T2)a1wkg1(1)}rc.\displaystyle G_{r,k}^{c}=\left\{\hat{g}_{1,w_{k}}-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}w_{k}}}\leq g_{1}(1)\right\}\cup\mathcal{E}_{r}^{c}.

Using analysis in Subsection A.1, we can show 𝖯𝗋(rc)K/T.{\mathsf{Pr}}\left(\mathcal{E}_{r}^{c}\right)\leq\nicefrac{{K}}{{T}}. Let us bound the probability of the first term above. By the choice of wk,w_{k}, we have

Δcon(k)log(2T2)a1wklog(2T2)a1wk\displaystyle\Delta_{\text{con}}(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}w_{k}}}\geq\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}w_{k}}}

Now,

𝖯𝗋(g^1,wklog(2T2)a1wkg1(1))\displaystyle{\mathsf{Pr}}\left(\hat{g}_{1,w_{k}}-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}w_{k}}}\leq g_{1}(1)\right)
=\displaystyle= 𝖯𝗋(g1(k)g^1,wkΔcon(k)log(2T2)a1wk)(Use Δcon(k)=g1(k)g1(1))\displaystyle{\mathsf{Pr}}\left(g_{1}(k)-\hat{g}_{1,w_{k}}\geq\Delta_{\text{con}}(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}w_{k}}}\right)\quad(\text{Use }\Delta_{\text{con}}(k)=g_{1}(k)-g_{1}(1))
\displaystyle\leq 𝖯𝗋(g1(k)g^1,wklog(2T2)a1wk)(By the choice of wk)\displaystyle{\mathsf{Pr}}\left(g_{1}(k)-\hat{g}_{1,w_{k}}\geq\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}w_{k}}}\right)\quad(\text{By the choice of }w_{k})
\displaystyle\leq 1T2(Using Equation 1)\displaystyle\frac{1}{T^{2}}\quad(\text{Using Equation~{}}\ref{eq:gen_conc_ineq_m1})

Hence, we can upper bound 𝖯𝗋(Gr,kc){\mathsf{Pr}}\left(G_{r,k}^{c}\right) using a union bound

𝖯𝗋(Gr,kc)KT+1T2K+1T.\displaystyle{\mathsf{Pr}}\left(G_{r,k}^{c}\right)\leq\frac{K}{T}+\frac{1}{T^{2}}\leq\frac{K+1}{T}.

When the instance is infeasible, the expected number of pulls of non-optimal arms are upper bounded by

𝔼[Nk(T)]\displaystyle\mathbb{E}\left[N_{k}(T)\right] =𝔼[𝔼[Nk(T)|Gr,k]]+𝔼[𝔼[Nk(T)|Gr,kc]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[N_{k}(T)|G_{r,k}\right]\right]+\mathbb{E}\left[\mathbb{E}\left[N_{k}(T)|G_{r,k}^{c}\right]\right]
4log(2T2)a1Δcon2(k)(1K+1T)+T×K+1T\displaystyle\leq\left\lceil\frac{4\log(2T^{2})}{a_{1}\Delta_{\text{con}}^{2}(k)}\right\rceil\left(1-\frac{K+1}{T}\right)+T\times\frac{K+1}{T}
4log(2T2)a1Δcon2(k)+K+2\displaystyle\leq\frac{4\log(2T^{2})}{a_{1}\Delta_{\text{con}}^{2}(k)}+K+2

A.6 Infeasible instance: Upper bounding the probability of misidentification

In this subsection, we will upper bound the probability of incorrectly setting the feasibility_flag.

Firstly, we define TT^{*} to be the minimum value of TT for which the following holds:

T>k=1Kvkk=1K4log(2T2)a1Δτ2(k).\displaystyle T>\sum_{k=1}^{K}v_{k}\geq\sum_{k=1}^{K}\left\lceil\frac{4\log(2T^{2})}{a_{1}\Delta_{\tau}^{2}(k)}\right\rceil.

TT^{*} exists because Clog(T)=o(T)C\log(T)=o(T) for a fixed C(0,).C\in(0,\infty).

For all the arms, we define good events G1,kG_{1,k} as in Equation (3) where the estimator for g1g_{1} is concentrated enough. Let r=k=1KG1,k.\mathcal{E}_{r}=\cap_{k=1}^{K}G_{1,k}. On event r,\mathcal{E}_{r}, for t>T,t>T^{*}, set of plausible feasible arms 𝒦^t\hat{\mathcal{K}}_{t} will remain empty.

We showed in the previous subsection that 𝖯𝗋(r)1KT.{\mathsf{Pr}}\left(\mathcal{E}_{r}\right)\geq 1-\frac{K}{T}. Hence, we can bound the probability of incorrectly setting the feasibility_flag as

𝖯𝗋(feasibility_flag = true)KT for T>T.\displaystyle{\mathsf{Pr}}\left(\texttt{feasibility\_flag = true}\right)\leq\frac{K}{T}\text{ for }T>T^{*}.

Appendix B Lower bounds

In this section, we will prove Theorems 4.1 and 4.2. To prove Theorem 4.1, we will first prove a result which lower bounds the pulls of non-optimal arms when the arm distributions belong to a general distribution class and the attributes are not necessarily means. Theorem 4.1 is proved in Subsection B.1, Theorem 4.2 is proved in Subsection B.2, and subsequent subsections provide the proof for the intermediate results.

The proof technique to derive lower bounds for the constrained bandit setting is very similar to the technique used for standard stochastic bandit setting. We begin by stating some important results that will be used later in the proofs.

We first state the divergence decomposition lemma for the constrained bandit setting. The proof of the lemma is similar to the proof of divergence decomposition for the standard setting and we leave it to the reader to verify the result (see Lemma 15.1, Lattimore and Szepesvári [2018]).

Lemma 1

Consider two instances (ν,τ)(\nu,\tau) and (ν,τ),(\nu^{\prime},\tau), where ν\nu and ν\nu^{\prime} belong to a space of distributions 𝒞K.\mathcal{C}^{K}. Fix some policy π\pi and let ν=(ν,τ),π\mathbb{P}_{\nu}=\mathbb{P}_{(\nu,\tau),\pi} and ν=(ν,τ)π\mathbb{P}_{\nu^{\prime}}=\mathbb{P}_{(\nu^{\prime},\tau)\pi} be the probability measures on the constrained bandit model induced by the TT-round interconnection between π\pi and (ν,τ)(\nu,\tau) (respectively, π\pi and (ν,τ)(\nu^{\prime},\tau)). Then

KL(ν,ν)=k=1K𝔼ν[Nk(T)]KL(Pi,Pi)\displaystyle\text{KL}(\mathbb{P}_{\nu},\mathbb{P}_{\nu^{\prime}})=\sum_{k=1}^{K}\mathbb{E}_{\nu}\left[N_{k}(T)\right]\text{KL}(P_{i},P_{i}^{\prime}) (6)

We also state the high probability Pinsker inequality (see Theorem 14.2, Lattimore and Szepesvári [2018]).

Lemma 2

Let PP and QQ be probability measures on the same measurable space (Ω,)(\Omega,\mathcal{F}) and let AA\in\mathcal{F} be an arbitrary event. Then,

P(A)+Q(Ac)12exp(KL(P,Q)),\displaystyle P(A)+Q(A^{c})\geq\frac{1}{2}\exp(-\text{KL}(P,Q)),

where Ac=Ω/AA^{c}=\Omega/A is the complement of A.

B.1 Lower bounding the pulls of non-optimal arms for the Gaussian case

We first state an instance-dependent lower bound for the expected pulls of non-optimal arms under consistent algorithms for any class of distributions 𝒞\mathcal{C} and a threshold τ.\tau\in\mathbb{R}. For a feasible instance (νf,τ),(\nu^{f},\tau), where νf𝒞K,\nu^{f}\in\mathcal{C}^{K}, define, for each non-optimal arm k,k,

ηf(νf(k),g0,τ,𝒞)=infν(k)𝒞{KL(νf(k),ν(k)):g0(ν(k))<g0,g1(ν(k))τ}.\eta^{f}(\nu^{f}(k),g_{0}^{*},\tau,\mathcal{C})=\inf_{\nu^{\prime}(k)\in\mathcal{C}}\{\text{KL}(\nu^{f}(k),\nu^{\prime}(k)):g_{0}(\nu^{\prime}(k))<g_{0}^{*},~{}g_{1}(\nu^{\prime}(k))\leq\tau\}.

Similarly, for an infeasible instance (νi,τ),(\nu^{i},\tau), where ν𝒞K,\nu^{\prime}\in\mathcal{C}^{K}, define, for each non-optimal arm k,k,

ηi(νi(k),g1,𝒞)=infν(k)𝒞{KL(νi(k),ν(k)):g1(ν(k))<g1}.\eta^{i}(\nu^{i}(k),g_{1}^{*},\mathcal{C})=\inf_{\nu^{\prime}(k)\in\mathcal{C}}\{\text{KL}(\nu^{i}(k),\nu^{\prime}(k)):g_{1}(\nu^{\prime}(k))<g_{1}^{*}\}.
Theorem B.1

Let π\pi be a consistent policy over the class of distributions 𝒞\mathcal{C} given the threshold τ.\tau\in\mathbb{R}. Then for a feasible instance (νf,τ),(\nu^{f},\tau), where νf𝒞K,\nu^{f}\in\mathcal{C}^{K}, for any non-optimal arm k,k,

lim infT𝔼ν[Nk(T)]log(T)1ηf(νf(k),g0,τ,𝒞),\displaystyle\liminf_{T\to\infty}\frac{\mathbb{E}_{\nu}\left[N_{k}(T)\right]}{\log(T)}\geq\frac{1}{\eta^{f}(\nu^{f}(k),g_{0}^{*},\tau,\mathcal{C})},

For an infeasible instance (νi,τ),(\nu^{i},\tau), where ν𝒞K,\nu^{\prime}\in\mathcal{C}^{K}, for any non-optimal arm k,k,

lim infT𝔼ν[Nk(T)]log(T)1ηi(νi(k),g1,𝒞).\displaystyle\liminf_{T\to\infty}\frac{\mathbb{E}_{\nu}\left[N_{k}(T)\right]}{\log(T)}\geq\frac{1}{\eta^{i}(\nu^{i}(k),g_{1}^{*},\mathcal{C})}.

The bounds in Subsections B.3 and B.4 imply Theorem B.1. The main message here is that the ’hardness’ of a non-optimal arm kk, which dictates the minimum number of pulls required on average to distinguish it from the optimal arm, is characterized by the reciprocal of the smallest perturbation in terms of KL divergence KL(ν(k),)\text{KL}(\nu(k),\cdot) needed to ‘make’ the arm optimal.

Now, we will use Theorem B.1 to prove Theorem 4.1.

Consider two d-dimensional Gaussian distribution ν1\nu_{1} and ν2\nu_{2} with means μ1\mu_{1} and μ2,\mu_{2}, and covariances Σ1\Sigma_{1} and Σ2.\Sigma_{2}. The KL divergence between these distributions is given by

KL(ν1,ν2)=log(|Σ2|0.5|Σ1|0.5)+tr(Σ21Σ1)tr(Σ11Σ1)+(μ1μ2)TΣ21(μ1μ2)2.\text{KL}(\nu_{1},\nu_{2})=\log\left(\frac{|\Sigma_{2}|^{0.5}}{|\Sigma_{1}|^{0.5}}\right)+\frac{\text{tr}(\Sigma_{2}^{-1}\Sigma_{1})-\text{tr}(\Sigma_{1}^{-1}\Sigma_{1})+(\mu_{1}-\mu_{2})^{T}\Sigma_{2}^{-1}(\mu_{1}-\mu_{2})}{2}. (7)

This is easy to derive but one can check Section 9 of these notes.

Let 𝒢\mathcal{G} be the class of 2-dimensional gaussian distributions with the covariance matrix Σ=diagonal(1/2a0,1/2a1).\Sigma=\text{diagonal}(\nicefrac{{1}}{{2a_{0}}},\nicefrac{{1}}{{2a_{1}}}). Let attribute g0g_{0} be the mean of the first dimension, and g1g_{1} be the mean of the second dimension. Let ν1,ν2𝒢,\nu_{1},\nu_{2}\in\mathcal{G}, then using Equation 7, the KL divergence between ν1\nu_{1} and ν2\nu_{2} is given by

KL(ν1,ν2)=a0(g0(ν1)g0(ν2))2+a1(g1(ν1)g1(ν2))2.\text{KL}(\nu_{1},\nu_{2})=a_{0}(g_{0}(\nu_{1})-g_{0}(\nu_{2}))^{2}+a_{1}(g_{1}(\nu_{1})-g_{1}(\nu_{2}))^{2}. (8)

With 𝒞=𝒢\mathcal{C}=\mathcal{G} and attributes as defined in Theorem 4.1, we can use Equation 8 to calculate ηf(νf(k),g0,τ,𝒞)\eta^{f}(\nu^{f}(k),g_{0}^{*},\tau,\mathcal{C}) and ηi(νi(k),g1,𝒞).\eta^{i}(\nu^{i}(k),g_{1}^{*},\mathcal{C}). In particular, for a feasible but suboptimal arm k,k, we have

ηf(νf(k),g0,τ,𝒞)=a0Δ2(k),\displaystyle\eta^{f}(\nu^{f}(k),g_{0}^{*},\tau,\mathcal{C})=a_{0}\Delta^{2}(k),

for a deceiver arm k,k, we have

ηf(νf(k),g0,τ,𝒞)=a1[Δτ(k)]2,\displaystyle\eta^{f}(\nu^{f}(k),g_{0}^{*},\tau,\mathcal{C})=a_{1}[\Delta_{\tau}(k)]^{2},

and for an infeasible arm kk which is not a deceiver, we have

ηf(νf(k),g0,τ,𝒞)=a0Δ2(k)+a1[Δτ(k)]22max(a0Δ2(k),a1[Δτ(k)]2).\displaystyle\eta^{f}(\nu^{f}(k),g_{0}^{*},\tau,\mathcal{C})=a_{0}\Delta^{2}(k)+a_{1}[\Delta_{\tau}(k)]^{2}\leq 2\max(a_{0}\Delta^{2}(k),a_{1}[\Delta_{\tau}(k)]^{2}).

Finally, for a non-optimal arm kk for an infeasible instance, we have

ηi(νi(k),g1,𝒞)=a1[Δcon(k)]2.\displaystyle\eta^{i}(\nu^{i}(k),g_{1}^{*},\mathcal{C})=a_{1}[\Delta_{\text{con}}(k)]^{2}.

Using the results above and combining them with Theorem B.1, we get Theorem 4.1.

B.2 Disambiguating between feasible and infeasible instances

Consider a feasible instance (ν,τ)(\nu,\tau) where Arm 1 is the only feasible arm and therefore, the only optimal arm. Let g1=mink{2,,K}g1(k)g_{1}^{\dagger}=\min_{k\in\{2,\cdots,K\}}g_{1}(k) be the minimum value of g1g_{1} for the set of infeasible arms. For Arm 1 we define

η(ν(1),g1,𝒞)=infν(1)𝒞{KL(ν(1),ν(1)):g1(ν(1))>g1}.\displaystyle\eta(\nu(1),g_{1}^{\dagger},\mathcal{C})=\inf_{\nu^{\prime}(1)\in\mathcal{C}}\{\text{KL}(\nu^{\prime}(1),\nu(1)):g_{1}(\nu^{\prime}(1))>g_{1}^{\dagger}\}.

Consider another instance (ν,τ)(\nu^{\prime},\tau) where ν(j)=ν(j)\nu^{\prime}(j)=\nu(j) for j1j\neq 1 and ν(1)𝒞\nu^{\prime}(1)\in\mathcal{C} such that KL(ν(1),ν(1))d1+ε\text{KL}(\nu(1),\nu^{\prime}(1))\leq d_{1}+\varepsilon and g1(ν(1))>g1,g_{1}(\nu^{\prime}(1))>g_{1}^{\dagger}, where d1=η(ν(1),g1,𝒞).d_{1}=\eta(\nu(1),g_{1}^{\dagger},\mathcal{C}). Using divergence decomposition lemma, we have KL(ν,ν)𝔼ν[N1(T)](d1+ε)\text{KL}(\mathbb{P}_{\nu^{\prime}},\mathbb{P}_{\nu})\leq\mathbb{E}_{\nu^{\prime}}\left[N_{1}(T)\right](d_{1}+\varepsilon) and by using Lemma 2 we have

ν(A)+ν(Ac)12exp(KL(ν,ν))12exp(𝔼ν[N1(T)](d1+ε)).\displaystyle\mathbb{P}_{\nu}\left(A\right)+\mathbb{P}_{\nu^{\prime}}\left(A^{c}\right)\geq\frac{1}{2}\exp(-\text{KL}(\mathbb{P}_{\nu^{\prime}},\mathbb{P}_{\nu}))\geq\frac{1}{2}\exp(-\mathbb{E}_{\nu^{\prime}}\left[N_{1}(T)\right](d_{1}+\varepsilon)).

Let event A=A= {feasibility_flag = false}. Taking logarithm on both sides and rearranging gives

log(ν(A)+ν(Ac))Tlog(2)+(d1+ε)𝔼ν[N1(T)]T\displaystyle-\frac{\log(\mathbb{P}_{\nu}\left(A\right)+\mathbb{P}_{\nu^{\prime}}\left(A^{c}\right))}{T}\leq\frac{\log(2)+(d_{1}+\varepsilon)\mathbb{E}_{\nu^{\prime}}\left[N_{1}(T)\right]}{T}

RHS goes to zero as TT goes to infinity. This follows from the definition of consistency and the fact that for instance (ν,τ),(\nu^{\prime},\tau), arm 1 is suboptimal. Hence, we have

lim supTlog(ν(A)+ν(Ac))T0\displaystyle\limsup_{T\to\infty}-\frac{\log(\mathbb{P}_{\nu}\left(A\right)+\mathbb{P}_{\nu^{\prime}}\left(A^{c}\right))}{T}\leq 0

This shows that at least for one of the instances, the probability of incorrect detection decays slower than exponential in T.T.

B.3 Feasible instances

Consider a class of distributions 𝒞\mathcal{C} and a threshold τ.\tau\in\mathbb{R}. For a feasible instance (νf,τ),(\nu^{f},\tau), where νf𝒞K,\nu^{f}\in\mathcal{C}^{K}, define, for each non-optimal arm k,k,

ηf(νf(k),g0,τ,𝒞)=infν(k)𝒞{KL(νf(k),ν(k)):g0(ν(k))<g0,g1(ν(k))τ}.\eta^{f}(\nu^{f}(k),g_{0}^{*},\tau,\mathcal{C})=\inf_{\nu^{\prime}(k)\in\mathcal{C}}\{\text{KL}(\nu^{f}(k),\nu^{\prime}(k)):g_{0}(\nu^{\prime}(k))<g_{0}^{*},~{}g_{1}(\nu^{\prime}(k))\leq\tau\}.

We will show that

lim infT𝔼ν[Nk(T)]log(T)1ηf(νf(k),g0,τ,𝒞).\liminf_{T\to\infty}\frac{\mathbb{E}_{\nu}\left[N_{k}(T)\right]}{\log(T)}\geq\frac{1}{\eta^{f}(\nu^{f}(k),g_{0}^{*},\tau,\mathcal{C})}.
Proof

Let dk=ηf(νf(k),g0,τ,𝒞)d_{k}=\eta^{f}(\nu^{f}(k),g_{0}^{*},\tau,\mathcal{C}) and fix any ε>0.\varepsilon>0. Let (ν,τ)(\nu^{\prime},\tau) be a bandit instance with ν𝒞K,\nu^{\prime}\in\mathcal{C}^{K}, and ν(j)=νf(j)\nu^{\prime}(j)=\nu^{f}(j) for jkj\neq k be such that KL(νf(k),ν(k))dk+ε,\text{KL}(\nu^{f}(k),\nu^{\prime}(k))\leq d_{k}+\varepsilon, g0(ν(k))<g0,g_{0}(\nu^{\prime}(k))<g_{0}^{*}, and g1(ν(k))τ.g_{1}(\nu^{\prime}(k))\leq\tau. A distribution like ν(k)\nu^{\prime}(k) exists because of the definition of dk.d_{k}. Note that arm kk is the unique optimal arm for bandit instance ν.\nu^{\prime}. Using divergence decomposition lemma, we have KL(νf,ν)𝔼ν[Nk(T)](dk+ε)\text{KL}(\mathbb{P}_{\nu^{f}},\mathbb{P}_{\nu^{\prime}})\leq\mathbb{E}_{\nu}\left[N_{k}(T)\right](d_{k}+\varepsilon) and by using Lemma 2 we have

ν(A)+ν(Ac)12exp(KL(νf,ν))12exp(𝔼νf[Nk(T)](dk+ε)).\displaystyle\mathbb{P}_{\nu}\left(A\right)+\mathbb{P}_{\nu^{\prime}}\left(A^{c}\right)\geq\frac{1}{2}\exp(-\text{KL}(\mathbb{P}_{\nu^{f}},\mathbb{P}_{\nu^{\prime}}))\geq\frac{1}{2}\exp(-\mathbb{E}_{\nu^{f}}\left[N_{k}(T)\right](d_{k}+\varepsilon)).

Let event A={Nk(T)>T/2}.A=\{N_{k}(T)>\nicefrac{{T}}{{2}}\}.

𝔼νf[Nk(T)]+jk𝔼ν[Nj(T)]\displaystyle\mathbb{E}_{\nu^{f}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right] T2(νf(A)+ν(Ac)),\displaystyle\geq\frac{T}{2}(\mathbb{P}_{\nu^{f}}\left(A\right)+\mathbb{P}_{\nu^{\prime}}\left(A^{c}\right)),
T4exp(𝔼νf[Nk(T)](dk+ε)).\displaystyle\geq\frac{T}{4}\exp(-\mathbb{E}_{\nu^{f}}\left[N_{k}(T)\right](d_{k}+\varepsilon)).

Rearranging and taking the limit inferior we get

lim infT𝔼νf[Nk(T)]logT\displaystyle\liminf_{T\to\infty}\frac{\mathbb{E}_{\nu^{f}}\left[N_{k}(T)\right]}{\log T} 1dk+εlim infTlog(T4(𝔼νf[Nk(T)]+jk𝔼ν[Nj(T)]))logT\displaystyle\geq\frac{1}{d_{k}+\varepsilon}\liminf_{T\to\infty}\frac{\log\left(\frac{T}{4(\mathbb{E}_{\nu^{f}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right])}\right)}{\log T}
=1dk+ε(1lim supTlog(𝔼νf[Nk(T)]+jk𝔼ν[Nj(T)])logT)\displaystyle=\frac{1}{d_{k}+\varepsilon}\left(1-\limsup_{T\to\infty}\frac{\log(\mathbb{E}_{\nu^{f}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right])}{\log T}\right)
=1dk+ε.\displaystyle=\frac{1}{d_{k}+\varepsilon}.

The last equality follows from the definition of consistency. As an arm which is suboptimal or infeasible or both is played only o(Ta)o(T^{a}) times in expectation for all a>0a>0, there exists a constant CaC_{a} for large enough TT such that 𝔼νf[Nk(T)]+jk𝔼ν[Nj(T)])CaTa.\mathbb{E}_{\nu^{f}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right])\leq C_{a}T^{a}. This gives

lim supTlog(𝔼νf[Nk(T)]+jk𝔼ν[Nj(T)])logTlim supTCa+alog(T)log(T)=a.\displaystyle\limsup_{T\to\infty}\frac{\log(\mathbb{E}_{\nu^{f}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right])}{\log T}\leq\limsup_{T\to\infty}\frac{C_{a}+a\log(T)}{\log(T)}=a.

As this holds for all a>0a>0 and ε\varepsilon was an arbitrary constant greater than zero, we have the result.

B.4 Infeasible instance

For an infeasible instance (νi,τ),(\nu^{i},\tau), where ν𝒞K,\nu^{\prime}\in\mathcal{C}^{K}, define, for each non-optimal arm k,k,

ηi(νki,g1,𝒞)=infν(k)𝒞{KL(νki,ν(k)):g1(ν(k))<g1}.\eta^{i}(\nu^{i}_{k},g_{1}^{*},\mathcal{C})=\inf_{\nu^{\prime}(k)\in\mathcal{C}}\{\text{KL}(\nu^{i}_{k},\nu^{\prime}(k)):g_{1}(\nu^{\prime}(k))<g_{1}^{*}\}.

We will show that for arm kk

lim infT𝔼ν[Nk(T)]log(T)1ηi(νki,g1,𝒞)\displaystyle\liminf_{T\to\infty}\frac{\mathbb{E}_{\nu}\left[N_{k}(T)\right]}{\log(T)}\geq\frac{1}{\eta^{i}(\nu^{i}_{k},g_{1}^{*},\mathcal{C})}
Proof

Let dk=ηi(νki,g1,𝒞)d_{k}=\eta^{i}(\nu^{i}_{k},g_{1}^{*},\mathcal{C}) and fix any ε>0\varepsilon>0. Let (ν,τ)(\nu^{\prime},\tau) be a bandit instance with ν(j)=νi(j)\nu^{\prime}(j)=\nu^{i}(j) for jkj\neq k and ν(k)𝒞\nu^{\prime}(k)\in\mathcal{C} such that KL(νi(k),ν(k))dk+ε\text{KL}(\nu^{i}(k),\nu^{\prime}(k))\leq d_{k}+\varepsilon and g1(ν(k))g1.g_{1}(\nu^{\prime}(k))\leq g_{1}^{*}. A distribution like ν(k)\nu^{\prime}(k) exists because of the definition of dk.d_{k}. Observe that the instance (ν,τ)(\nu^{\prime},\tau) could be a feasible instance. Nonetheless, arm kk is the unique optimal arm irrespective of the feasibility of instance (ν,τ).(\nu^{\prime},\tau). Using divergence decomposition lemma, we have KL(νi,ν)𝔼νi[Nk(T)](dk+ε)\text{KL}(\mathbb{P}_{\nu^{i}},\mathbb{P}_{\nu^{\prime}})\leq\mathbb{E}_{\nu^{i}}\left[N_{k}(T)\right](d_{k}+\varepsilon) and by using Lemma 2 we have

νi(A)+ν(Ac)12exp(KL(νi,ν))12exp(𝔼νi[Nk(T)](dk+ε)).\displaystyle\mathbb{P}_{\nu^{i}}\left(A\right)+\mathbb{P}_{\nu^{\prime}}\left(A^{c}\right)\geq\frac{1}{2}\exp(-\text{KL}(\mathbb{P}_{\nu^{i}},\mathbb{P}_{\nu^{\prime}}))\geq\frac{1}{2}\exp(-\mathbb{E}_{\nu^{i}}\left[N_{k}(T)\right](d_{k}+\varepsilon)).

Let event A={Nk(T)>T/2}.A=\{N_{k}(T)>\nicefrac{{T}}{{2}}\}.

𝔼νi[Nk(T)]+jk𝔼ν[Nj(T)]\displaystyle\mathbb{E}_{\nu^{i}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right] T2(νi(A)+ν(Ac)),\displaystyle\geq\frac{T}{2}(\mathbb{P}_{\nu^{i}}\left(A\right)+\mathbb{P}_{\nu^{\prime}}\left(A^{c}\right)),
T4exp(𝔼νi[Nk(T)](dk+ε)).\displaystyle\geq\frac{T}{4}\exp(-\mathbb{E}_{\nu^{i}}\left[N_{k}(T)\right](d_{k}+\varepsilon)).

Rearranging and taking the limit inferior we get

lim infT𝔼νi[Nk(T)]logT\displaystyle\liminf_{T\to\infty}\frac{\mathbb{E}_{\nu^{i}}\left[N_{k}(T)\right]}{\log T} 1dk+εlim infTlog(T4(𝔼νi[Nk(T)]+jk𝔼ν[Nj(T)]))logT\displaystyle\geq\frac{1}{d_{k}+\varepsilon}\liminf_{T\to\infty}\frac{\log\left(\frac{T}{4(\mathbb{E}_{\nu^{i}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right])}\right)}{\log T}
=1dk+ε(1lim supTlog(𝔼νi[Nk(T)]+jk𝔼ν[Nj(T)])logT)\displaystyle=\frac{1}{d_{k}+\varepsilon}\left(1-\limsup_{T\to\infty}\frac{\log(\mathbb{E}_{\nu^{i}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right])}{\log T}\right)
=1dk+ε.\displaystyle=\frac{1}{d_{k}+\varepsilon}.

The last equality follows from the definition of consistency. As νi\nu^{i} is an infeasible instance, arm kk is suboptimal and is played only o(Ta)o(T^{a}) times in expectation for all a>0.a>0. For instance ν,\nu^{\prime}, arm kk is the unique optimal arm. Therefore, all the other arms are played only o(Ta)o(T^{a}) times in expectation for all a>0.a>0. Hence, there exists a constant CaC_{a} for large enough TT such that 𝔼νi[Nk(T)]+jk𝔼ν[Nj(T)])CaTa.\mathbb{E}_{\nu^{i}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right])\leq C_{a}T^{a}. This gives

lim supTlog(𝔼νi[Nk(T)]+jk𝔼ν[Nj(T)])logTlim supTCa+alog(T)log(T)=a.\displaystyle\limsup_{T\to\infty}\frac{\log(\mathbb{E}_{\nu^{i}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right])}{\log T}\leq\limsup_{T\to\infty}\frac{C_{a}+a\log(T)}{\log(T)}=a.

As this holds for all a>0a>0 and ε\varepsilon was an arbitrary constant greater than zero, we have the result.