∎

¹¹institutetext: A. Kagrecha ²²institutetext: Stanford University
²²email: [email protected] ³³institutetext: J. Nair ⁴⁴institutetext: IIT Bombay
⁴⁴email: [email protected] ⁵⁵institutetext: K. Jagannathan ⁶⁶institutetext: IIT Madras
⁶⁶email: [email protected]

Constrained regret minimization for multi-criterion multi-armed bandits

Anmol Kagrecha Jayakrishnan Nair Krishna Jagannathan

Abstract

We consider a stochastic multi-armed bandit setting and study the problem of constrained regret minimization over a given time horizon. Each arm is associated with an unknown, possibly multi-dimensional distribution, and the merit of an arm is determined by several, possibly conflicting attributes. The aim is to optimize a ‘primary’ attribute subject to user-provided constraints on other ‘secondary’ attributes. We assume that the attributes can be estimated using samples from the arms’ distributions, and that the estimators enjoy suitable concentration properties. We propose an algorithm called Con-LCB that guarantees a logarithmic regret, i.e., the average number of plays of all non-optimal arms is at most logarithmic in the horizon. The algorithm also outputs a boolean flag that correctly identifies, with high probability, whether the given instance is feasible/infeasible with respect to the constraints. We also show that Con-LCB is optimal within a universal constant, i.e., that more sophisticated algorithms cannot do much better universally. Finally, we establish a fundamental trade-off between regret minimization and feasibility identification. Our framework finds natural applications, for instance, in financial portfolio optimization, where risk constrained maximization of expected return is meaningful.

Keywords:

Multi-criterion multi-armed bandits constrained bandits regret minimization

1 Introduction

The multi-armed bandit (MAB) problem is a fundamental construct in online learning, where a learner has to quickly identify the best option (a.k.a., arm) among a given set of options. In the stochastic MAB problem, each arm is associated with an (a priori unknown) reward distribution, and a sample from this distribution is revealed each time an arm is chosen (a.k.a., pulled). The classical goal is to use these samples to quickly identify the arm with the highest mean reward. The most popular metric to evaluate the performance of a learning algorithm is regret, which captures how often a suboptimal arm was pulled by the algorithm.

While the classical stochastic MAB formulation has been applied in various application scenarios, including clinical trials, portfolio optimization, anomaly detection, and telecommunication Bouneffouf and Rish [2019], it ignores a key aspect of most real-world decision-making problems—namely, that they have multiple criteria of interest. For example, when comparing testing kits in a clinical trial, one would want to keep track of the false-positive rate as well as the false-negative rate of each kit. Similarly, choosing the best financial portfolio involves balancing risk and reward. A wireless node deciding which channel to transmit on, or what transmission rate to use has to balance several criteria, including throughput, delay, and energy consumption. This multi-criterion nature of decision making is not always adequately captured by the classical MAB approach of optimizing a one-dimensional reward signal.

The most common approach for incorporating multiple arm attributes into MAB formulations is to define the reward as a suitable function (say a linear combination) of the attributes of interest. For example, risk-aware portfolio optimization can be cast as an MAB problem where the best arm is one that optimizes a certain linear combination of mean value and a suitable risk measure (such as standard deviation or Conditional Value at Risk) (see, for example, Sani et al. [2012], Vakili and Zhao [2016], Kagrecha et al. [2019]. However, this approach assumes that the different attributes of interest can be expressed and compared on a common scale, which is not always reasonable. For example, how does one ‘equate’ the impact a certain increment in risk to that of an another increment in the mean return of a portfolio?

A more natural approach for multi-criterion decision making is to instead pose the optimal choice as the solution of a constrained optimization. In other words, optimize one attribute subject to constraints on the others. However, despite the modest literature on multi-criterion MABs (surveyed below), little attention has been paid to formulating, and designing algorithms for constrained multi-criterion MABs. This paper seeks to fill this gap. Formally, we assume that each arm is associated with a $D$ -dimensional probability distribution, and the best arm is the one that minimizes a certain arm attribute subject to constraints on $m$ other attributes. For this setting, we pursue regret minimization, i.e., we seek to minimize the average number of pulls of non-optimal arms.

To simplify the presentation, and to highlight the key aspects of this problem, we first consider a single constraint (i.e., $m=1$ ). Each arm is associated with a probability distribution, and we consider the problem of optimizing an objective attribute, subject to a single constraint attribute satisfying a user-specified constraint. The algorithm design and performance evaluation for this special case (with $m=1$ ) generalize readily to the multi-criterion problem; see Section 5. An example that fits this template that is of significant interest in the finance community: the optimization of the expected return, subject to a risk constraint. Here, the objective attribute is the mean of an arm distribution, while the constraint attribute measuring risk could be Conditional Value at Risk (CVaR).

For this problem, we propose an algorithm, called Constrained Lower Confidence Bound (Con-LCB), that guarantees logarithmic regret, i.e., the average number of plays of all non-optimal arms (including those that violate the constraint) is at most logarithmic in the horizon. If Con-LCB is presented with an infeasbile instance, i.e., an instance where all arms violate the specified risk constraint, the algorithm in effect relaxes this constraint just enough to make at least one arm compliant. Another feature of Con-LCB is that at the end of the horizon, it outputs a boolean flag that correctly identifies with high probability, whether or not the given instance was feasible.

Finally, we establish fundamental lower bounds on the performance of any algorithm on this constrained regret minimization problem. Our results demonstrate a fundamental tradeoff between regret minimization and feasibility identification, similar to the well-known tradeoff between regret minimization and best arm identification in the classical (unconstrained) MAB problem Bubeck et al. [2009].

The remainder of this paper is organized as follows. A brief survey of related literature is provided below. In Section 2, we introduce some preliminaries and formulate the constrained mean minimization problem. We present our algorithm (Con-LCB) and its performance guarantees in Section 3. Information-theoretic lower bounds on performance are discussed in Section 4. Finally, the general formulation for multi-criterion MABs is introduced in Section 5. All proofs are deferred to an appendix, which is part of the ‘supplementary materials’ document uploaded separately.

Related literature: The literature related to multi-armed bandit problems is quite large. We refer the reader to Bubeck and Cesa-Bianchi [2012], Lattimore and Szepesvári [2018] for a comprehensive review. Here, we restrict ourselves to papers that consider (i) multi-objective MAB problems with vector rewards, and (ii) risk-aware arm selection.

For multi-objective MAB problems with vector rewards, different notions of optimality are considered. For example, Drugan and Nowe [2013], Yahyaa and Manderick [2015] consider the notion of Pareto optimality. In these papers, all dimensions are considered equally important and the aim is to play all Pareto-optimal arms an equal number of times. Another important notion of optimality is lexicographic optimality (see Ehrgott [2005]). Here there is an order of importance among different dimensions. Tekin and Turğay [2018], Tekin [2019] consider the notion of lexicographic optimality for contextual bandit problems. In this line of work, the goal is to obtain higher reward in an important dimension and for tie-breaking, use rewards obtained in dimensions of lower importance.

Turning now to the literature on risk-aware arm selection, Sani et al. [2012], Galichet et al. [2013], Vakili and Zhao [2016], David and Shimkin [2016], Bhat and Prashanth [2019], Prashanth et al. [2020], Kagrecha et al. [2019] consider the problem of optimizing a risk metric alone or consider a linear combination of mean and a risk metric. Zimin et al. [2014] looks at the learnability of general functions of mean and variance, and Maillard [2013] proposes an optimization of the logarithm of moment generating function as a risk measure in a regret minimization framework. Cassel et al. [2018] look at path dependent regret and provides a general approach to study many risk metrics.

None of the above papers considers the constrained MAB problem, which frames the optimal arm as the solution of a constrained optimization problem. There has been some recent work for the constrained MAB problem. A constrained linear bandit setting is considered in Pacchiano et al. [2021] under the assumption that there is at least one arm which satisfies the constraints. The papers Amani et al. [2019], Moradipari et al. [2019] consider the problem of maximizing the reward subject to satisfying a linear constraint with high probability. Constrained setting was also considered in David et al. [2018] and Chang [2020]. Both these papers consider a single constraint on arm selection; David et al. [2018] considers a constraint on VaR, and Chang [2020] considers an average cost constraint (each arm has a cost distribution that is independent of its reward distribution). All the papers above implicitly assume that the instance presented is feasible, whereas we address the issue of encountering an infeasible instance.

2 Problem formulation

In this section, we describe the formulation of our constrained stochastic MAB problem. To keep the exposition simple, we consider a single constraint (i.e., $m=1$ ) for now; the general formulation with multiple constraints (i.e., $m\geq 1$ ) is discussed in Section 5. Informally, under the regret minimization objective considered here, the goal is to play, as often as possible, the arm that optimizes the objective, subject to a constraint on another attribute.

Formally, consider a multi-armed bandit problem with $K$ arms, labeled $1,2,\cdots,K.$ Each arm is associated with a (possibly multi-dimensional) probability distribution, with $\nu(k)$ denoting the (joint) distribution corresponding to arm $k\in[K]$ ¹¹1For a positive integer $n,$ we denote $[n]:=\{1,2,\cdots,n\}.$ . Suppose that $\nu(k)\in\mathcal{C},$ the space of possible arm distributions. The objective and the constraint attributes are defined via functions $g_{0}$ and $g_{1}$ respectively, mapping $\mathcal{C}$ to $\mathbb{R}.$ Additionally, the user provides a threshold $\tau\in\mathbb{R}$ which specifies an upper bound on the attribute $g_{1}.$ An instance of the constrainted MAB problem is defined by $(\nu,\tau),$ where $\nu=(\nu(k),\ 1\leq k\leq K).$ The arms which satisfy the constraint $g_{1}(\nu(\cdot))\leq\tau$ are called feasible arms; the set of feasible arms is denoted by $\mathcal{K}(\nu).$ The instance $(\nu,\tau)$ is said to be feasible if $\mathcal{K}(\nu)\neq\emptyset,$ and is said to be infeasible if $\mathcal{K}(\nu)=\emptyset.$

Consider first a feasible instance. An optimal arm in this case is defined as an arm that minimizes $g_{0}(\nu(\cdot)),$ subject to the constraint $g_{1}(\nu(\cdot))\leq\tau$ . Denote the optimal value as $g_{0}^{*}=\min_{k\in\mathcal{K}(\nu)}g_{0}(\nu(k)).$ Arms which have $g_{0}(\nu(\cdot))$ larger than $g_{0}^{*}$ (whether or not feasible) are referred to as suboptimal arms. Note that there can also exist infeasible arms with a smaller objective $g_{0}$ than $g_{0}^{*}.$ We refer to such arms as deceiver arms; the set of deceiver arms is denoted by $\mathcal{K}^{d}(\nu),$ where $\mathcal{K}^{d}(\nu)=\{k\in[K]:g_{1}(\nu(k))>\tau,~{}g_{0}(\nu(k))\leq g_{0}^{*}\}.$ For a suboptimal arm $k,$ the suboptimality gap is defined by $\Delta(k):=g_{0}(k)-g_{0}^{*}>0.$ (Note that the suboptimality gap is also defined for infeasible, non-deceiver arms). Finally, for an infeasible arm $k,$ the infeasibility gap is defined as $\Delta_{\tau}(k)=g_{1}(k)-\tau>0.$ Figure 1(a) provides a visual representation of a feasbile instance.

Next, consider an infeasible instance. The optimal arm in this case is defined in as the one with the smallest value of $g_{1}(\nu(\cdot)).$ Let the optimal value be denoted by $g_{1}^{*}=\min_{k\in[K]}g_{1}(\nu(k))>\tau.$ This is equivalent to requiring that if the algorithm is faced with an infeasible instance, it must ’relax’ the constraint just enough, until at least one arm satisfies the constraint. The constraint gap for an arm $k$ that is not optimal is defined as $\Delta_{\text{con}}(k)=g_{1}(\nu(k))-g_{1}^{*}>0.$ Figure 1(b) provides a visual representation of an infeasbile instance.

For any (feasible or infeasible) instance $(\nu,\tau),$ let the set of optimal arms be denoted by $\mathcal{K}^{*}(\nu).$ The total number of pulls (or horizon) is denoted by $T.$ For an algorithm (a.k.a., policy) $\pi$ , the number of pulls of an arm $k$ over the first $t$ pulls, for $t\in[T],$ is denoted by $N^{\pi}_{k}(t),$ though we often suppress the dependence on $\pi$ for simplicity.

Refer to caption — (a) Feasible instance

Consistency: We now define the notion of consistency of an algorithm in this setting. A policy $\pi$ is said to be consistent over a class of distributions $\mathcal{C},$ given a pre-specified constraint threshold $\tau,$ if for all instances $(\nu,\tau)$ such that $\nu\in\mathcal{C}^{K},$ it holds that $\mathbb{E}\left[N_{k}(T)\right]=o(T^{a})$ for all $a>0$ and for all (non-optimal) arms in $[K]\setminus\mathcal{K}^{*}(\nu).$ This definition is in line with the definition of consistency in the classical unconstrained regret minimization setting (see Lai and Robbins [1985]).

Regret: The formal definition of regret in the present setting is as follows. For a feasible instance, there are two types of regret: suboptimality regret

R^{\mathsf{sub}}_{T}:=\sum_{k\in\mathcal{K}(\nu)\setminus\mathcal{K}^{*}(\nu)}\Delta(k)\mathbb{E}\left[N_{k}(T)\right],

which is the regret caused due to the sampling of feasible, suboptimal arms, and infeasibility regret

R^{\mathsf{inf}}_{T}:=\sum_{k\in\mathcal{K}(\nu)^{c}}\Delta_{\tau}(i)\mathbb{E}\left[N_{k}(T)\right],

which is the regret caused due to the sampling of infeasible arms. For an infeasible instance, regret is caused by playing arms that are farther from the constraint boundary than the optimal arm. In this case, we define the constraint regret as

R^{\mathsf{con}}_{T}:=\sum_{k\in[K]\setminus\mathcal{K}^{*}(\nu)}\Delta_{\text{con}}(k)\mathbb{E}\left[N_{k}(T)\right].

(Note that our analysis essentially provides upper and lower bounds on $\mathbb{E}\left[N_{k}(T)\right]$ for all arms in $[K]\setminus\mathcal{K}^{*}(\nu).$ So alternative definitions of regret, involving linear combinations of the expected number of pulls of non-optimal arms, are also supported by our analysis.)

Infeasibility identification: Next, we introduce an optional boolean flag called feasibility_flag, that the policy may set at the end of $T$ plays, to indicate post facto whether it considered the instance as feasible (by setting the flag as true) or infeasible (by setting the flag as false). For the algorithms proposed in this paper, we provide bounds on the probability that this flag erroneously flags a feasible instance as infeasible and vice-versa. We also provide fundamental lower bounds on the probability of error under any consistent policy.

Concentration inequalities on attribute estimators: Natually, MAB algorithms must estimate the attributes $g_{0}$ and $g_{1}$ corresponding to each arm using the data samples obtained by pulling that arm. We make the following assumption on concentration properties of these estimators. Suppose that for $i\in\{0,1\},$ and distribution $F\in\mathcal{C},$ there exists an estimator $\hat{g}_{i,n}(F)$ of $g_{i}(F)$ using $n$ i.i.d. samples from $F,$ satisfying the following concentration inequality: There exists $a_{i}>0$ such that for all $\Delta>0,$

{\mathsf{Pr}}\left(\left|\hat{g}_{i,n}(F)-g_{i}(F)\right|\geq\Delta\right)\leq 2\exp(-a_{i}n\Delta^{2}).

(1)

Concentration inequalities of this form are available in a broad variety of settings. For example, if $g_{i}(F)=\mathbb{E}\left[h_{i}(X)\right],$ where $X$ is a random vector distributed as $F,$ then a concentration inequality of the form (1) is readily obtained if $h_{i}(X)$ is bounded (using the Hoeffding inequality), or $\sigma$ -subGaussian. Similarly, if $h_{i}$ is Lipschitz and $X$ is a subGaussian random vector, concentration bounds of the form (1) can be obtained by invoking the results in Kontorovich [2014]. Several examples where risk measures can be concentrated in this manner are provided in Cassel et al. [2018]. Also, note that the specific form (1) of the concentration inequality is only assumed to simplify the description of our algorithms. Alternative forms of concentration guarantees (such as those known for the means of subexponential or heavy-tailed distributions) can also be supported by our algorithmic framework via trivial modifications to the confidence bounds.

3 Constrained regret minimization

In this section, we present an algorithm for constrained regret minimization, and provide performance guarantees for the same, assuming that we have estimators for each attribute that satisfy the concentration inequality (1).

The algorithm, which we refer to as constrained lower confidence bound (Con-LCB) algorithm, is based on the well-known principle of optimism under uncertainty. Con-LCB uses lower confidence bounds (LCBs) on attribute $g_{1}$ of each arm to maintain a set of plausibly feasible arms, and uses LCBs for attribute $g_{0}$ of the arms in this set to select the arm to be played. Note that LCBs are used for attribute $g_{0}$ because we are dealing with minimization of $g_{0}.$ If, at some instant, the set of plausibly feasible arms maintained by Con-LCB becomes empty, the algorithm turns conservative and plays the arm which violates the constraint least, i.e., the one with the smallest LCB on $g_{1}$ . Finally, at the end of $T$ rounds, Con-LCB sets the feasiblity flag as true if the set of plausibly feasible arms is found to be non-empty, and false otherwise.

Algorithm 1 Con-LCB

procedure Con-LCB(

T,K,\tau

)

Play each arm once

for

t=K+1,\cdots,T

Set

\hat{\mathcal{K}}_{t}=\left\{k:\hat{g}_{1,N_{k}(t-1)}(k)-\sqrt{\frac{\log(2T^{2})}{a_{1}N_{k}(t-1)}}\leq\tau\right\}

\hat{\mathcal{K}}_{t}\neq\varnothing

then

Set

\small\mathrm{L}_{0}(k)=\hat{g}_{0,N_{k}(t-1)}(k)-\sqrt{\frac{\log(2T^{2})}{a_{0}N_{k}(t-1)}}

\text{Play arm }k_{t}^{\dagger}\in\operatorname*{arg\,min}_{k\in\hat{\mathcal{K}}_{t}}\mathrm{L}_{0}(k)

else

Set

\small\mathrm{L}_{1}(k)=\hat{g}_{1,N_{k}(t-1)}(k)-\sqrt{\frac{\log(2T^{2})}{a_{1}N_{k}(t-1)}}

\text{Play arm }k_{t}^{\dagger}\in\operatorname*{arg\,min}_{k\in[K]}\mathrm{L}_{1}(k)

end if

end for

\hat{\mathcal{K}}_{T}\neq\varnothing

then

\text{Set }\texttt{feasibility\_flag}=\texttt{true}

else

\text{Set }\texttt{feasibility\_flag}=\texttt{false}

end if

end procedure

The remainder of this section is devoted to performance guatantees for Con-LCB. We consider feasible and infeasible instances separately.

Theorem 3.1

Consider a feasible instance. Under Con-LCB, the expected number of pulls of a feasible but suboptimal arm $k$ (i.e., satisfying $g_{0}(\nu(k))>g_{0}^{*}$ and $g_{1}(\nu(k))\leq\tau$ ), is bounded by

\displaystyle\mathbb{E}\left[N_{k}(T)\right]\leq\frac{4\log(2T^{2})}{a_{0}\Delta^{2}(k)}+5.

The expected number of pulls of a deceiver arm $k$ (i.e., satisfying $g_{0}(\nu(k))\leq g_{0}^{*}$ and $g_{1}(\nu(k))>\tau$ is bounded by

\displaystyle\mathbb{E}\left[N_{k}(T)\right]\leq\left(\frac{4\log(2T^{2})}{a_{1}[\Delta_{\tau}(k)]^{2}}\right)+2.

The expected number of pulls of an arm $k$ which is an infeasible non-deceiver (i.e., satistying $g_{0}(\nu(k))>g_{0}^{*}$ and $g_{1}(\nu(k))>\tau$ ) is bounded by

\displaystyle\mathbb{E}\left[N_{k}(T)\right]\leq\min\left(\frac{4\log(2T^{2})}{a_{0}\Delta^{2}(k)},\frac{4\log(2T^{2})}{a_{1}[\Delta_{\tau}(k)]^{2}}\right)+5.

The probability of incorrectly setting the feasibility_flag is upper bounded by

\displaystyle{\mathsf{Pr}}\left(\emph{{feasibility\_flag = false}}\right)\leq\frac{1}{T}.

The main takeaways from Theorem 3.1 are as follows.

$\bullet$ The upper bound on the expected number of pulls for feasible, suboptimal arms is logarithmic in the horizon $T,$ and inversely proportional to the square of the suboptimality gap. This is similar to bounds known for classical (unconstrained) MABs.

$\bullet$ For deceiver arms, the upper bound on the expected number of pulls, also logarithmic in $T,$ is inversely proportional to the square of the feasiblity gap. This is similar to the bound one would obtain in a pure $g_{1}$ minimization problem for a hypothetical instance consisting of all the originally infeasible arms, and a single hypothetical (optimal) arm having $g_{1}$ equal to $\tau.$

$\bullet$ The bound on the expected number of pulls of non-deceiver, infeasible arms involves a minimum of the dominant terms in the above two cases. Intuitively, this is because these arms can disambiguated in two ways: via the suboptimality gap, and the feasibility gap.

$\bullet$ The probability that the feasiblity flag incorrectly identifies the instance as infeasible is upper bounded by a power law in the horizon $T.$ Note that the specific form of the probability of mis-identification bound is not fundamental; a small modification in the algorithm would make this bound a faster decaying power law at the expense of a multiplicative bump in regret. However, that this probability is not much smaller (for example, exponentially decaying in the horizon $T$ ) is a consequence of an inherent tension between regret minimization and feasiblity identification. A similar tension is known to exist between regret minimization and best arm identification in the unconstrained setting; see Bubeck et al. [2009]). We provide lower bounds on the probability that any consistent algorithm makes a mistake in feasibility identification in Section 4.

Finally, we note that the suboptimality regret, as well as the infeasibility regret are logarithmic in the horizon $T.$ Indeed, the suboptimality regret for a feasible instance is bounded as

R^{\mathsf{sub}}_{T}\leq\sum_{k\in\mathcal{K}(\nu)\setminus\mathcal{K}^{*}(\nu)}\left(\frac{4\log(2T^{2})}{a_{0}\Delta(k)}+5\Delta(k)\right),

which is similar to regret bounds in the classical unconstrained setting. To express the infeasibility regret compactly, let us interpret $\Delta(k)$ as 0 (and $1/\Delta(k)$ as $\infty$ ) for deceiver arms. With this notation, the infeasibility regret of a feasible instance is bounded as

R^{\mathsf{inf}}_{T}\leq\sum_{k\in\mathcal{K}(\nu)^{c}}5\Delta_{\tau}(k)+\min\left(\frac{4\Delta_{\tau}(k)\log(2T^{2})}{a_{0}\Delta^{2}(k)},\frac{4\log(2T^{2})}{a_{1}\Delta_{\tau}(k)}\right).

Next, we move on to characterizing the performance of Con-LCB over infeasible instances.

Theorem 3.2

Consider an infeasible instance. Under Con-LCB, the expected number of pulls of a non-optimal arm $k$ is bounded by

\mathbb{E}\left[N_{k}(T)\right]\leq\left(\frac{4\log(2T^{2})}{a_{1}[\Delta_{\rm{con}}(k)]^{2}}\right)+K+2.

Moreover, the probability that the algorithm incorrectly flags the instance as feasible is bounded as ${\mathsf{Pr}}\left(\emph{{feasibility\_flag}}=\emph{{true}}\right)\leq\frac{K}{T}\text{ for }T>T^{*}(\nu),$ where $T^{*}(\nu)$ is an instance-dependent constant.

For an infeasible instance, the upper bound on the expected number of pulls of a non-optimal arm, logarithmic in the horizon, and inversely proportional to the the square of the constraint gaps $\Delta_{\rm{con}}(k),$ is structurally similar to the bound one would obtain in pure $g_{1}$ minimization problem on the same instance. However, note that when faced with an infeasible instance, Con-LCB would only start playing the optimal arm regularly after all $K$ arms appear to be infeasible; this explains the appearance of $K$ in our bounds.

Here, the constraint regret is bounded as

R^{\mathsf{con}}_{T}:=\sum_{k\in[K]\setminus\mathcal{K}^{*}(\nu)}(K+2)\Delta_{\rm{con}}(k)+\frac{4\log(2T^{2})}{a_{1}\Delta_{\rm{con}}(k)}

Finally, as before, the probability that the feasibility flag wrongly identifies the infeasible instance as feasible decays as a power law in the horizon for $T>T^{*}(\nu);$ the threshold $T^{*}$ accounts for the time it takes for the algorithm to ‘detect’ that the instance is infeasibile with high probability.

4 Information theoretic lower bounds

In this section, we establish fundamental limits on the performance of algorithms for constrained regret minimization. First, we show that the regret bounds obtained for Con-LCB are asymptotically tight upto universal multiplicative constants (on a class of Gaussian bandit instances). We then prove a lower bound on the probability that any consistent algorithm misidentifies a feasible instance as infeasible or vice-versa. This result illustrates an inherent tension between regret minimization and feasibility identification—consistent algorithms (recall that consistency means regret is $o(T^{a})$ for all $a>0$ ) cannot have a misidentification probability that decays exponentially in the horizon, and algorithms that enjoy a mid-identification probability that decays exponentially in the horizon cannot be consistent.

To state our information theoretic lower bound on regret, suppose that the class of arm distributions $\mathcal{G}$ is the class of 2-dimensional Gaussian distributions with covariance matrix $\Sigma=\mathrm{diag}(\nicefrac{{1}}{{2a_{0}}},\nicefrac{{1}}{{2a_{1}}}).$ Let attribute $g_{0}$ be the mean of the first dimension, and $g_{1}$ be the mean of the second dimension. For the assumed covariance matrix structure, the standard empirical mean estimators for $g_{0}$ and $g_{1}$ satisfy the concentration properties stated in (1).

Theorem 4.1

Let $\pi$ be any consistent policy over the class of distributions $\mathcal{G}$ given the threshold $\tau\in\mathbb{R}.$ Consider a feasible instance $(\nu^{f},\tau),$ where $\nu^{f}\in\mathcal{G}^{K}.$ For any feasible but suboptimal arm $k,$

\liminf_{T\to\infty}\frac{\mathbb{E}\left[N^{\pi}_{k}(T)\right]}{\log(T)}\geq\frac{1}{a_{0}\Delta^{2}(k)}.

For any deceiver arm $k,$

\displaystyle\liminf_{T\to\infty}\frac{\mathbb{E}\left[N^{\pi}_{k}(T)\right]}{\log(T)}\geq\frac{1}{a_{1}[\Delta_{\tau}(k)]^{2}}.

Finally, for any infeasible non-deceiver arm $k,$

\displaystyle\liminf_{T\to\infty}\frac{\mathbb{E}\left[N^{\pi}_{k}(T)\right]}{\log(T)}\geq\frac{1}{2}\min\left(\frac{1}{a_{0}\Delta^{2}(k)},\frac{1}{a_{1}[\Delta_{\tau}(k)]^{2}}\right).

Similarly, for an infeasible instance $(\nu^{i},\tau),$ such that $\nu^{i}\in\mathcal{G}^{K},$ for any non-optimal arm $k,$

\displaystyle\liminf_{T\to\infty}\frac{\mathbb{E}\left[N^{\pi}_{k}(T)\right]}{\log(T)}\geq\frac{1}{a_{1}[\Delta_{\text{con}}(k)]^{2}}.

The proof of Theorem 4.1 can be found in Appendix B. Comparing the lower bounds in Theorem 4.1 with the upper bounds for Con-LCB in Theorems 3.1 and 3.2, we conclude that Con-LCB is asymptotically optimal on regret, up to universal multiplicative constants.²²2Information theoretic lower bounds on the expected number of pulls of non-optimal arms can be derived (in terms of a certain optimization over KL divergences) for any general class of arm distributions; see Appendix B.1. We have specialized the statement of Theorem 4.1 to a class of Gaussian bandit instances to enable an easy comparison with our upper bounds for Con-LCB.

Next, we address the fundamental tradeoff between regret minimization and feasibility identification.

Theorem 4.2

Consider a space of arm distributions $\mathcal{C},$ and a threshold $\tau$ such that $\mathcal{C}$ contains both feasible as well as infeasible arm distributions. There exists a feasible instance $(\nu,\tau)$ and an infeasible instance $(\nu^{\prime},\tau)$ such that for any policy $\pi$ that is consistent over $\mathcal{C},$

	$\displaystyle\limsup_{T\to\infty}-\frac{1}{T}$	$\displaystyle\log\Big{(}\mathbb{P}_{\nu}\left(\emph{{feasibility\_flag}}=\emph{{false}}\right)$
		$\displaystyle\quad+\mathbb{P}_{\nu^{\prime}}\left(\emph{{feasibility\_flag}}=\emph{{true}}\right)\Big{)}\leq 0.$

Theorem 4.2 states that for any consistent algorithm, the probability that $(\nu,\tau)$ get misidentified as infeasible, and the probability that $(\nu^{\prime},\tau)$ get misidentified as feasible, cannot both decay exponentially in the horizon. This is of course consistent with the power-law probability of misidentification under Con-LCB. In other words, slower-than-exponential decay of the probability of feasibility misidentification with respect to the horizon is an unavoidable consequence of the exploration-exploitation interplay in regret minimization. A similar tension between regret minimization and best arm identification was noted for the unconstrained MABs by Bubeck et al. [2009].

5 General framework for constrained MABs

In this section, we provide a general formulation for constrained stochastic MABs with multiple criteria. We allow each arm to be associated with a $D$ -dimensional probability distibution, the goal being to optimize one (dominant) attribute associated with this distribution, subject to constraints on $m$ others. The algorithm design and performance evaluation performed in Sections 2–4 for the special case of $m=1$ extend naturally to this general formulation, which can in turn be applied in a variety of application scenarios. We illustrate a few here.

$\bullet$ For clinical trials, with the arms corresponding to various treatment protocols, the dominant attribute might, for example, correspond to the success/recovery probability, whereas the constraints might capture recovery time, severity of side effects, etc.

$\bullet$ For product/service rating, where the arms correspond to various service providers, the dominant attribute might correspond to product quality, with constraints capturing reliability, pricing, customer service, etc.

$\bullet$ In wireless networks, the arms might correspond to various access networks or channels, with, for example, the dominant attribute corresponding to throughput, and constraints capturing delay, energy efficiency, etc.

Formulation: Consider a set of $K$ arms, each associated with a $D$ -dimensional probability distribution, with $\nu(k)$ denoting the joint distribution corresponding to arm $k\in[K].$ Suppose that $\nu(k)\in\mathcal{C},$ the space of possible arm distributions. The objective and the constraints are defined by functions $g_{0},g_{1},\cdots,g_{m}.$ Specifically, the optimal arm is defined as that arm $k$ that minimizes $g_{0}(\nu(k)),$ subject to the constraints $\{g_{i}(\nu(k))\leq\tau_{i}\}_{i=1}^{m}$ when the instance is feasible (i.e., at least one arm exists that satisfies the above constraints).

If the instance is infeasible, the optimal arm is defined via a ranked list of the constraints, that orders them by ‘importance’. Without loss of generality assume that the order of importance increases from $(g_{1},\tau_{1})$ to $(g_{m},\tau_{m}).$ The idea is to relax the constraints one-by-one, starting with the least important, until a compliant arm is found. Formally, for a given infeasible instance $\nu$ , for $2\leq i\leq m,$ let $\mathcal{K}_{i}(\nu)$ denote the set of arms that satisfy the constraints $\{g_{j}(\nu(k))\leq\tau_{j}\}_{j=i}^{m}.$ Let us also define $\mathcal{K}_{m+1}(\nu):=[K].$ Now, let $i^{*}(\nu)=\min\{k\in[m]:\ \mathcal{K}_{k+1}(\nu)\neq\phi\}.$ Here, $i^{*}(\nu)$ is the fewest number of constraints one must relax, in order of increasing importance, in order to have at least one compliant arm. An optimal arm is then defined to be $\operatorname*{arg\,min}\{g_{i^{*}(\nu)}(\nu(k)):\ k\in\mathcal{K}_{i^{*}(\nu)+1}(\nu)\}.$

Similar to the case where $m=1,$ we make the following assumption to simplify algorithm design. Suppose that for $0\leq i\leq m,$ and $\overline{\nu}\in\mathcal{C},$ there exist an estimator $\hat{g}_{i,n}(\overline{\nu})$ for $g_{i}(\overline{\nu})$ using $n$ i.i.d. samples from $\overline{\nu},$ satisfying the following concentration inequality: There exists $a_{i}>0$ such that for all $\Delta>0,$

{\mathsf{Pr}}\left(\left|\hat{g}_{i,n}(\overline{\nu})-g_{i}(\overline{\nu})\right|\geq\Delta\right)\leq 2\exp(-a_{i}n\Delta^{2}).

(2)

Note that this is simply an extension of our assumption (1).

Algorithm and performance guarantees: For the above problem formulation, we propose Multi-Con-LCB, a simple extension of the Con-LCB algorithm that we presented before for the case of $m=1.$ Multi-Con-LCB uses upper confidence bounds on constrained attributes $\{g_{i}(\nu(k))\}_{i=1}^{m}$ for each arm $k$ to maintain a set of plausibly feasible arms. The set of plausibly feasible arms for the constraint on function $g_{i}$ at time $t\in[T]$ will be denoted by $\hat{\mathcal{K}}_{i,t}^{\dagger}$ for all values of $i\in[m].$ Using the sets of plausibly feasible arms for each constraint, we construct the set of arms that plausibly lie in the set $\mathcal{K}(\nu)$ and this set is denoted by $\hat{\mathcal{K}}_{t}$ for time instant $t\in[T].$ Formally, $\hat{\mathcal{K}}_{t}=\cap_{i=1}^{m}\hat{\mathcal{K}}_{i,t}^{\dagger}.$ As this set might be empty, for $i\in\{2,\cdots,m\},$ let the estimate for $\mathcal{K}_{i}(\nu)$ be $\hat{\mathcal{K}}_{i,t}=\cap_{j=i}^{m}\hat{\mathcal{K}}_{j,t}^{\dagger}.$ It is also possible that the most important constraint is not satisfied, therefore let $\hat{\mathcal{K}}_{m+1,t}=[K].$ If the set $\hat{\mathcal{K}}_{t}$ is not empty, then the algorithm uses lower confidence bounds (LCBs) on $g_{0}$ to select the arm to be played. If the set $\hat{\mathcal{K}}_{t}$ turns out to be empty, then the algorithm finds the smallest index $\hat{i}^{*}$ such that the set $\hat{\mathcal{K}}_{\hat{i}^{*}+1,t}$ is not empty. The algorithm then plays the arm with lowest LCB on $g_{\hat{i}^{*}}.$ Finally, at the end of $T$ rounds, Multi-Con-LCB sets the feasibility flag as true if the set $\hat{\mathcal{K}}_{T}$ is not empty and false otherwise. The details are presented as Algorithm 2.

Algorithm 2 Multiple Constrained LCB

procedure Multi-Con-LCB(

T,K,\{\tau_{i}\}_{i=1}^{m}

)

Play each arm once

for

t=K+1,\cdots,T

for

i=1,\cdots,m

\text{Set }\hat{\mathcal{K}}_{i,t}^{\dagger}=\left\{k:\hat{g}_{i,N_{k}(t-1)}(k)\leq\tau_{i}+\sqrt{\frac{\log(2T^{2})}{a_{i}N_{k}(t-1)}}\right\}

end for

\text{Set }\hat{\mathcal{K}}_{t}=\cap_{i=1}^{m}\hat{\mathcal{K}}_{i,t}^{\dagger}

\hat{\mathcal{K}}_{t}\neq\varnothing

then

k_{t+1}^{\dagger}\in\operatorname*{arg\,min}_{k\in\hat{\mathcal{K}}_{t}}\hat{g}_{0,N_{k}(t-1)}(k)-\sqrt{\frac{\log(2T^{2})}{a_{0}N_{k}(t-1)}}

\text{Play arm }k_{t+1}^{\dagger}

else

\hat{i}^{*}=\operatorname*{arg\,min}_{i\in\{1,\cdots,m\}}\hat{\mathcal{K}}_{i+1,t}\neq\varnothing

k_{t+1}^{\dagger}\in\operatorname*{arg\,min}_{k\in\hat{\mathcal{K}}_{\hat{i}^{*}+1,t}}\hat{g}_{\hat{i}^{*},N_{k}(t-1)}(k)-\sqrt{\frac{\log(2T^{2})}{a_{\hat{i}^{*}}N_{k}(t-1)}}

\text{Play arm }k_{t+1}^{\dagger}

end if

end for

\hat{\mathcal{K}}_{T+1}\neq\varnothing

then

Set feasibility_flag = true

else

Set feasibility_flag = false

end if

end procedure

The remainder of this section is devoted to performance guarantees for Multi-Con-LCB. The suboptimality gap for an arm $k$ is given by $\Delta(k)=\max(g_{0}(\nu_{k})-g_{0}^{*},0).$ The infeasibility gap of an arm $k$ for constraint $i$ is given by $\Delta_{i,\tau_{i}}(k)=\max(g_{i}(\nu(k))-\tau_{i},0).$ We restrict our attention to feasible instances here; infeasible instances can be handled on similar lines as Theorem 3.2 for Con-LCB.

Theorem 5.1

Consider a feasible instance. Under Multi-Con-LCB, the expected number of pulls of a feasible but suboptimal arm $k$ (i.e., satisfying $g_{0}(\nu(k))>g_{0}^{*}$ and $g_{i}(\nu(k))\leq\tau_{i}$ for all $i\in[m]$ ), is bounded by

\displaystyle\mathbb{E}\left[N_{k}(T)\right]\leq\frac{4\log(2T^{2})}{a_{0}\Delta^{2}(k)}+2m+3.

The expected number of pulls of a deceiver arm $k$ (i.e., satisfying $g_{0}(\nu(k))\leq g_{0}(\nu(1))$ and there exists a constraint indexed by $j\in[m]$ such that $g_{j}(\nu(k))>\tau_{j}$ ) is bounded by

\displaystyle\mathbb{E}\left[N_{k}(T)\right]\leq\min_{i\in[m]}\left(\frac{4\log(2T^{2})}{a_{i}[\Delta_{i,\tau_{i}}(k)]^{2}}\right)+m.

The expected number of pulls of a an arm $k$ which is infeasible, but not a deceiver (i.e., satistying $g_{0}(\nu(k))>g_{0}(\nu(1))$ and there exists a constraint indexed by $j\in[m]$ such that $g_{j}(\nu(k))>\tau_{j}$ ) is bounded by

\displaystyle\mathbb{E}\left[N_{k}(T)\right]\leq\min\left(\frac{4\log(2T^{2})}{a_{0}\Delta^{2}(k)},\min_{i\in[m]}\left(\frac{4\log(2T^{2})}{a_{i}[\Delta_{i,\tau_{i}}(k)]^{2}}\right)\right)+2m+3.

The probability of incorrectly setting the feasibility_flag is upper bounded by

\displaystyle{\mathsf{Pr}}\left(\emph{{feasibility\_flag = false}}\right)\leq\frac{m}{T}.

We omit the proof of Theorem 5.1 in the appendix because it is very similar to the proof of Theorem 3.1. However, we state the key takeaways from Theorem 5.1 below.

$\bullet$ Similar to Theorem 3.1, the upper bound on the expected number of pulls of suboptimal arms is logarithmic in $T$ and is inversely proportional to the suboptimality gap squared. Moreover, the upper bound for a deceiver arm is also logarithmic in $T$ but there is a minimum over $m$ terms because the deceiver arm can be identified by any of the constraints that the arm does not satisfy. Similarly, the upper bound for a non-deceiver, suboptimal arm is also logarithmic in $T$ and there is a minimum over $m+1$ terms because the arm can be identified as suboptimal, or infeasible for not satisfying one of the $m$ constraints.

$\bullet$ With an increase in the number of constraints to $m$ , the upper bounds on the expected number of pulls of non-optimal arms increase linearly in $m$ and the probability of mis-identification also gets scaled by a factor of $m.$ The slight degradation in the guarantees is expected because with more constraints, the optimal arm has to satisfy more conditions, and it becomes harder to compare different arms.

6 Numerical experiments

In this section, we present a simple experiment to show the numerical performance of our algorithm Con-LCB.

We consider an instance with four arms and two criteria. Each arm is associated with a (two-dimensional) multivariate Gaussian distribution, different arms having the same covariance matrix $\Sigma,$ but different mean vectors. The means corresponding to the first dimension are 0.3, 0.4, 0.2, and 0.5 and the means corresponding to the second dimension are 0.4, 0.4, 1.0, and 1.0. The covariance matrix is given by $\Sigma=\begin{pmatrix}1&0.5\\ 0.5&1\end{pmatrix}.$

The goal is to maximize the pulls of the arm with the minimum mean of the first dimension, subject to a constraint on the mean of the second. Specifically, we require that the mean of the second dimension should be less than or equal to $\tau:=0.5.$ The instance is summarized in Table 1. We use empirical averages as the estimators for the means and standard confidence bounds based on sub-Gaussianity assumption. We run the algorithm 1000 times for values of $T$ in $[10000,100000].$ The suboptimality regret $R_{T}^{\text{sub}}$ and the infeasibility regret $R_{T}^{\text{inf}}$ are plotted in Figure 2. Clearly, the regrets grow sub-linearly with respect to the horizon. Also, in our experiments, the algorithm correctly detected the feasibility of the instance in all of the 1000 runs.

Table 1: Gaussian instance

Arm	Mean 1	Mean 2	Remarks
1	0.3	0.4	Optimal arm
2	0.4	0.4	Feasible, suboptimal
3	0.2	1.0	Deceiver
4	0.5	1.0	Infeasible, suboptimal

We consider another instance where arms are Beta distributed. The parameters are chosen in such a way that the mean of the arms are 0.3, 0.4, 0.2, and 0.5 and the variance of the arms are 0.08, 0.06, 0.15, 0.15. The goal here is to maximize the pulls of the arm with the minimum mean, subject to a constraint on the variance. Specifically, we require that the variance of the arms should be less than or equal to $\tau:=0.1.$ The instance is summarized in Table 2. We use the empirical average to estimate the mean of the arms and the standard unbiased estimator of variance to estimate the variance of the arms. The concentration bounds we use are based on fact that underlying arms are bounded between 0 and 1. We run the algorithm 100 times for values of $T$ in $[10000,100000].$ The suboptimality regret $R_{T}^{\text{sub}}$ and the infeasibility regret $R_{T}^{\text{inf}}$ are plotted in Figure 3. Similar to the previous case, the algorithm correctly detected feasibility in all of the 100 runs.

We also compare Con-LCB with an algorithm designed to optimize a single ’Lagrange relaxed’ objective of the form $g_{0}+\beta g_{1}.$ Since there is no systematic way of setting $\beta$ such that the original (constrained) optimal arm also minimizes this metric, this approach is very fragile, as we demonstrate in Figure 4. The instance we use is the Beta distributed instance in Table 2, and we are plotting the sum of the infeasible regret and suboptimal regret. When $\beta$ is small ( $\beta<10/7$ here), a deceiver arm is optimal for the relaxed objective, and when $\beta$ is large ( $\beta>5$ here), a suboptimal arm is optimal for the relaxed objective; the ‘correct’ range of $\beta$ being instance dependent (and a priori unknown).

Table 2: Beta instance

Arm	Mean	Variance	Remarks
1	0.3	0.08	Optimal arm
2	0.4	0.06	Feasible, suboptimal
3	0.2	0.15	Deceiver
4	0.5	0.15	Infeasible, suboptimal

7 Concluding remarks

In this paper, we have introduced a general formulation of multi-criterion constrained MABs, which is applicable when there are multiple attributes of interest associated with each arm. We propose algorithms that incur logarithmic regret, and also provide information theoretic lower bounds on the performance of any algorithm. An interesting departure from the classical MAB formulation is the aspect of instance feasibility; our algorithms predict, post facto, with a probability of error that decays as a power law in the horizon, whether the instance presented was feasible. Interestingly, this illustrates a fundamental tradeoff between regret minimization and feasibility detection. Finally, our algorithms ‘auto-tune’ to an infeasible instance, relaxing the constraints on arm attributes (in order of increasing importance), until a compliant arm is found.

The proposed framework and our algorithms can be applied in a wide range of application scenarios (see Bouneffouf and Rish [2019] for a survey). Further, this work motivates the study of multi-criterion variants of other bandit formulations, including contextual bandits, combinatorial bandits, and correlated bandits.

8 Declarations

•

Funding: the authors did not receive support from any organization for the submitted work.
•

Conflicts of interest: the authors frequently collaborate with researchers from IIT Bombay (@iitb.ac.in), researchers from IIT Madras (@iitm.ac.in), and researchers from Stanford University (@stanford.edu). Jayakrishnan Nair received his PhD from Caltech (@caltech.edu) in 2012 and Krishna Jagannathan received his PhD from MIT (@mit.edu) in 2010.
•

Ethics approval: not applicable
•

Consent to participate: not applicable
•

Consent for publication: not applicable
•

Availability of data and material: not applicable
•

Code availability: the Python code for the numerical experiments can be found in this Github repository.
•

Author’s contributions: all authors contributed equally to the paper

References

Bouneffouf and Rish [2019] Djallel Bouneffouf and Irina Rish. A survey on practical applications of multi-armed and contextual bandits. CoRR, abs/1904.10040, 2019. URL http://arxiv.org/abs/1904.10040.
Sani et al. [2012] Amir Sani, Alessandro Lazaric, and Rémi Munos. Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 3275–3283, 2012.
Vakili and Zhao [2016] Sattar Vakili and Qing Zhao. Risk-averse multi-armed bandit problems under mean-variance measure. IEEE Journal of Selected Topics in Signal Processing, 10(6):1093–1111, 2016.
Kagrecha et al. [2019] Anmol Kagrecha, Jayakrishnan Nair, and Krishna Jagannathan. Distribution oblivious, risk-aware algorithms for multi-armed bandits with unbounded rewards. In Advances in Neural Information Processing Systems, pages 11272–11281, 2019.
Bubeck et al. [2009] Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23–37. Springer, 2009.
Bubeck and Cesa-Bianchi [2012] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
Lattimore and Szepesvári [2018] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, page 28, 2018.
Drugan and Nowe [2013] Madalina M Drugan and Ann Nowe. Designing multi-objective multi-armed bandits algorithms: A study. In The 2013 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2013.
Yahyaa and Manderick [2015] Saba Yahyaa and Bernard Manderick. Thompson sampling for multi-objective multi-armed bandits problem. In Proceedings, page 47. Presses universitaires de Louvain, 2015.
Ehrgott [2005] Matthias Ehrgott. Multicriteria optimization, volume 491. Springer Science & Business Media, 2005.
Tekin and Turğay [2018] Cem Tekin and Eralp Turğay. Multi-objective contextual multi-armed bandit with a dominant objective. IEEE Transactions on Signal Processing, 66(14):3799–3813, 2018.
Tekin [2019] Cem Tekin. The biobjective multiarmed bandit: learning approximate lexicographic optimal allocations. Turkish Journal of Electrical Engineering & Computer Sciences, 27(2):1065–1080, 2019.
Galichet et al. [2013] Nicolas Galichet, Michele Sebag, and Olivier Teytaud. Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. In Asian Conference on Machine Learning, pages 245–260, 2013.
David and Shimkin [2016] Yahel David and Nahum Shimkin. Pure exploration for max-quantile bandits. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 556–571. Springer, 2016.
Bhat and Prashanth [2019] Sanjay P Bhat and LA Prashanth. Concentration of risk measures: A wasserstein distance approach. In Advances in Neural Information Processing Systems, pages 11739–11748, 2019.
Prashanth et al. [2020] L. A. Prashanth, Krishna Jagannathan, and Ravi Kumar Kolla. Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed distributions. In International Conference on Machine Learning (To appear), 2020.
Zimin et al. [2014] Alexander Zimin, Rasmus Ibsen-Jensen, and Krishnendu Chatterjee. Generalized risk-aversion in stochastic multi-armed bandits. arXiv preprint arXiv:1405.0833, 2014.
Maillard [2013] Odalric-Ambrym Maillard. Robust risk-averse stochastic multi-armed bandits. In International Conference on Algorithmic Learning Theory, pages 218–233. Springer, 2013.
Cassel et al. [2018] Asaf Cassel, Shie Mannor, and Assaf Zeevi. A general approach to multi-armed bandits under risk criteria. arXiv preprint arXiv:1806.01380, 2018.
Pacchiano et al. [2021] Aldo Pacchiano, Mohammad Ghavamzadeh, Peter Bartlett, and Heinrich Jiang. Stochastic bandits with linear constraints. In International Conference on Artificial Intelligence and Statistics, pages 2827–2835. PMLR, 2021.
Amani et al. [2019] Sanae Amani, Mahnoosh Alizadeh, and Christos Thrampoulidis. Linear stochastic bandits under safety constraints. Advances in Neural Information Processing Systems, 32, 2019.
Moradipari et al. [2019] Ahmadreza Moradipari, Sanae Amani, Mahnoosh Alizadeh, and Christos Thrampoulidis. Safe linear thompson sampling. arXiv preprint arXiv:1911.02156, 2019.
David et al. [2018] Yahel David, Balázs Szörényi, Mohammad Ghavamzadeh, Shie Mannor, and Nahum Shimkin. PAC bandits with risk constraints. In ISAIM, 2018.
Chang [2020] Hyeong Soo Chang. An asymptotically optimal strategy for constrained multi-armed bandit problems. Mathematical Methods of Operations Research, pages 1–13, 2020.
Lai and Robbins [1985] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
Kontorovich [2014] Aryeh Kontorovich. Concentration in unbounded metric spaces and algorithmic stability. In International Conference on Machine Learning, pages 28–36, 2014.

Appendix A Upper bounds

In this section, we prove Theorems 3.1 and 3.2. The bounds in Subsections A.1–A.4 imply the statement of Theorem 3.1, and the bounds in Subsections A.5–A.6 imply the statement of Theorem 3.2.

A.1 Feasible instance: Upper bounding the expected pulls of deceiver arms

For a deceiver arm $k$ , we will define a good event $G_{1,k}$ where the estimator for $g_{1}$ is concentrated enough and derive an upper bound on the number of pulls of the deceiver arm.

G_{1,k}=\left\{~{}\forall n\in[T]~{}|\hat{g}_{1,n}(k)-g_{1}(k)|<\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}n}}\right\}

(3)

On $G_{1,k},$ we can lower bound the estimator for $g_{1}$ for the arm $k$ as follows

\hat{g}_{1,n}(k)>g_{1}(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}n}}

If the lower bound is greater than $\tau+\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}n}},$ then arm $k$ can’t be in $\hat{\mathcal{K}}_{t}.$ Hence, we can upper bound the number of pulls of an infeasible arm as follows

		$\displaystyle g_{1}(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}n}}\leq\tau+\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}n}}$
	$\displaystyle\Rightarrow$	$\displaystyle n\leq v_{k}:=\left\lceil\frac{4\log(2T^{2})}{a_{1}\Delta_{\tau}^{2}(k)}\right\rceil$		(4)

Event $G_{1,k}^{c}$ is given by

\displaystyle G_{1,k}^{c}=\left\{~{}\exists n\in[T]~{}|\hat{g}_{1,n}(k)-g_{1}(k)|\geq\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}n}}\right\}

Using our assumption in Equation 1 and a union bound, we can show

\displaystyle{\mathsf{Pr}}\left(G_{1,k}^{c}\right)\leq T\times\frac{1}{T^{2}}=\frac{1}{T}.

Now, let us upper bound the expected number of pulls of a deceiver arm $k.$

	$\displaystyle\mathbb{E}\left[N_{k}(T)\right]$	$\displaystyle=\mathbb{E}\left[\mathbb{E}\left[N_{k}(T)\|G_{1,k}\right]\right]+\mathbb{E}\left[\mathbb{E}\left[N_{k}(T)\|G_{1,k}^{c}\right]\right]$
		$\displaystyle\leq\left\lceil\frac{4\log(2T^{2})}{a_{1}\Delta_{\tau}^{2}(k)}\right\rceil\left(1-\frac{1}{T}\right)+T\times\frac{1}{T}$
		$\displaystyle\leq\frac{4\log(2T^{2})}{a_{1}\Delta_{\tau}^{2}(k)}+2$

A.2 Feasible instance: Upper bounding the expected pulls of feasible suboptimal arms

We will begin by showing that a feasible arm $k$ remains in the set $\hat{\mathcal{K}}_{t}$ for $t\in[T]$ when estimator of $g_{1}$ is concentrated enough. We define an event $G_{1,k}$ for a feasible arm as done in Equation (3). When $G_{1,k}$ holds, the estimator for $g_{1}$ is upper bounded by

	$\displaystyle\hat{g}_{1,n}(k)\leq$	$\displaystyle g_{1}(k)+\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}n}}$
	$\displaystyle\leq$	$\displaystyle\tau+\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}n}}\quad(\because~{}g_{1}(k)\leq\tau)$

Hence, arm $k$ is in $\hat{\mathcal{K}}_{t}$ for $t\in[T]$ when $G_{1,k}$ holds.

We are considering the case where an arm $k$ is feasible but suboptimal. We will define a good event for arm $k$ and bound the number of pulls on this good event. Without loss of generality, assume arm 1 is optimal.

G_{0,k}=\left\{g_{0}(1)>\max_{n\in[T]}\hat{g}_{0,n}(1)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}n}}\right\}\\ \cap\left\{\hat{g}_{0,u_{k}}(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}u_{k}}}>g_{0}(1)\right\}\cap G_{1,1}\cap G_{1,k}\\ \text{where }u_{k}=\left\lceil\frac{4\log(2T^{2})}{a_{0}\Delta^{2}(k)}\right\rceil

(5)

We will show that if $G_{0,k}$ holds, then $N_{k}(T)\leq u_{k}$ . We will also show that $G_{0,k}^{c}$ holds with a small probability.

The proof is by contradiction. Suppose $G_{0,k}$ holds and $N_{k}(T)>u_{k}$ , then there exists a $t\in[T]$ such that $N_{k}(t-1)=u_{k}$ and $A_{t}=k.$ Using the definition of $G_{0,k},$

	$\displaystyle\text{LCB}_{k}(t-1)$	$\displaystyle=\hat{g}_{0,u_{k}}(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}u_{k}}}$
		$\displaystyle>g_{0}(1)\qquad(\text{Definition of }G_{0,k})$
		$\displaystyle>\text{LCB}_{1}(t-1)\qquad(\text{Definition of }G_{0,k})$

Hence, $A_{t}=\operatorname*{arg\,min}_{j}\text{LCB}_{j}(t)\neq k,$ which is a contradiction. Therefore, if $G_{0,k}$ occurs, $N_{k}(T)\leq u_{k}.$

Now, consider the event $G_{0,k}^{c}.$

G_{0,k}^{c}=\left\{g_{0}(1)\leq\max_{n\in[T]}\hat{g}_{0,n}(1)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}n}}\right\}\cup\\ \left\{\hat{g}_{0,u_{k}}(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}u_{k}}}\leq g_{0}(1)\right\}\cup G_{1,1}^{c}\cup G_{1,k}^{c}.

Let us bound the probability of the first term above.

		$\displaystyle{\mathsf{Pr}}\left(g_{0}(1)\leq\max_{n\in[T]}\hat{g}_{0,n}(1)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}n}}\right)$
	$\displaystyle\leq$	$\displaystyle{\mathsf{Pr}}\left(\bigcup_{n=1}^{T}\left\{g_{0}(1)\leq\hat{g}_{0,n}(1)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}n}}\right\}\right)$
	$\displaystyle\leq$	$\displaystyle T\times\frac{1}{T^{2}}=\frac{1}{T}\quad(\text{Using Equation~{}\ref{eq:gen_conc_ineq_m1} and union bound})$

Now, let us bound the second event above. By the choice of $u_{k}$ we have the following

\displaystyle\Delta(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}u_{k}}}\geq\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}u_{k}}}

Now,

		$\displaystyle{\mathsf{Pr}}\left(\hat{g}_{0,u_{k}}(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}u_{k}}}\leq g_{0}(1)\right)$
	$\displaystyle=$	$\displaystyle{\mathsf{Pr}}\left(g_{0}(k)-\hat{g}_{0,u_{k}}(k)\geq\Delta(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}u_{k}}}\right)\quad(\text{Use }\Delta(k)=g_{0}(k)-g_{0}(1))$
	$\displaystyle\leq$	$\displaystyle{\mathsf{Pr}}\left(g_{0}(k)-\hat{g}_{0,u_{k}}(k)\geq\sqrt{\frac{\log\left(2T^{2}\right)}{a_{0}u_{k}}}\right)\quad(\text{By the choice of }u_{k})$
	$\displaystyle\leq$	$\displaystyle\frac{1}{T^{2}}\quad(\text{Using Equation~{}}\ref{eq:gen_conc_ineq_m1})$

We can show that ${\mathsf{Pr}}\left(G_{1,1}^{c}\right)\leq\nicefrac{{1}}{{T}}$ and ${\mathsf{Pr}}\left(G_{1,k}^{c}\right)\leq\nicefrac{{1}}{{T}}$ like in the previous subsection. Hence,

\displaystyle{\mathsf{Pr}}\left(G_{0,k}^{c}\right)\leq\frac{3}{T}+\frac{1}{T^{2}}.

Hence, we can upper bound the number of pulls of feasible but suboptimal arms as follows

	$\displaystyle\mathbb{E}\left[N_{k}(T)\right]=$	$\displaystyle\mathbb{E}\left[\mathbb{E}\left[N_{k}(T)\|G_{0,k}\right]\right]+\mathbb{E}\left[\mathbb{E}\left[N_{k}(T)\|G_{0,k}^{c}\right]\right]$
	$\displaystyle\leq$	$\displaystyle\left\lceil\frac{4\log(2T^{2})}{a_{0}\Delta^{2}(k)}\right\rceil\left(1-\frac{3}{T}-\frac{1}{T^{2}}\right)+T\times\left(\frac{3}{T}+\frac{1}{T^{2}}\right)$
	$\displaystyle\leq$	$\displaystyle\frac{4\log(2T^{2})}{a_{0}\Delta^{2}(k)}+5.$

A.3 Feasible instance: Upper bounding the expected pulls of infeasible suboptimal arms

Consider arm $k$ which is both suboptimal and infeasible. Define an event $G_{0,k}$ as done in Equation (5). Recall that the upper bound on the pulls of the deceiver arms on event $G_{1,k}$ is denoted by $v_{k}$ and the upper bound on the pulls of the feasible but suboptimal arms on event $G_{0,k}$ is denoted by $u_{k}.$

On the event $G_{0,k},$ if $u_{k}\geq v_{k},$ then due to concentration of estimator of $g_{1}$ , this arm can’t be played more than $v_{k}$ times. If $u_{k}<v_{k},$ then due to suboptimality, this arm can’t be played more than $u_{k}$ times. We can show that the probability of $G_{0,k}^{c}$ is less than or equal to $\nicefrac{{3}}{{T}}+\nicefrac{{1}}{{T^{2}}}$ as we did before. Hence, we can upper bound the pulls of infeasible and suboptimal arms as follows

	$\displaystyle\mathbb{E}\left[N_{k}(T)\right]=$	$\displaystyle\mathbb{E}\left[\mathbb{E}\left[N_{k}(T)\|G_{0,k}\right]\right]+\mathbb{E}\left[\mathbb{E}\left[N_{k}(T)\|G_{0,k}^{c}\right]\right]$
		$\displaystyle\leq\min(u_{k},v_{k})\left(1-\frac{3}{T}-\frac{1}{T^{2}}\right)+T\times\left(\frac{3}{T}+\frac{1}{T^{2}}\right)$
		$\displaystyle\leq\min(u_{k},v_{k})+4$

A.4 Feasible instance: Upper bounding the probability of misidentification

Feasibility is correctly detected if for at least one of the feasible arms, event $G_{1,k}$ as defined in Equation (3) holds. We had seen in Subsection A.2 that if estimator of $g_{1}$ is concentrated enough, a feasible arm always remains in the set $\hat{\mathcal{K}}_{t}$ of plausibly feasible arms.

Without loss of generality, assume that arm 1 is optimal. Then, we can lower bound the probability of correctly setting the flag as follows

	$\displaystyle{\mathsf{Pr}}\left(\texttt{feasibility\_flag = true}\right)$	$\displaystyle\geq{\mathsf{Pr}}\left(G_{1,1}\right)$
		$\displaystyle\geq 1-\frac{1}{T}.$

This upper bounds the probability of incorrectly setting the flag

{\mathsf{Pr}}\left(\texttt{feasibility\_flag = false}\right)\leq\frac{1}{T}.

A.5 Infeasible instance: Upper bounding the expected pulls of non-optimal arms

In this section, we discuss the case when the given instance is infeasible. As defined before, the optimal choice for the algorithm is to play the arm with minimum $g_{1}$ . We will upper bound the number of pulls of arms that have $g_{1}$ greater than the minimum.

For all the arms, we define good events $G_{1,k}$ as in Equation (3) where the estimator for $g_{1}$ is concentrated enough. Let $\mathcal{E}_{r}=\cap_{k=1}^{K}G_{1,k}.$ When event $\mathcal{E}_{r}$ occurs, the set $\hat{\mathcal{K}}_{t}$ becomes empty after at most $\sum_{k=1}^{K}v_{k}$ pulls where $v_{k}$ is defined in Equation (4). The analysis is similar to given in Subsection A.1.

Once the set $\hat{\mathcal{K}}_{t}$ becomes empty, the algorithm starts pulling arms with minimum lower confidence bound on estimator of $g_{1}$ . We will upper bound the number of pulls for the non-optimal arms. Without loss of generality, assume that arm 1 has the lowest $g_{1}$ . As we are dealing with an infeasible instance, $g_{1}(1)>\tau.$ For a non-optimal arm $k,$ we define the following good event

	$\displaystyle G_{r,k}=\left\{\hat{g}_{1,w_{k}}-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}w_{k}}}>g_{1}(1)\right\}\cap\mathcal{E}_{r}$
	$\displaystyle\text{where }w_{k}=\left\lceil\frac{4\log(2T^{2})}{a_{1}\Delta_{\text{con}}^{2}(k)}\right\rceil.$

One can check that $w_{k}$ is greater than $v_{k}$ because the gap $\Delta_{\text{con}}(k)=g_{1}(k)-g_{1}(1)$ is smaller than $\Delta_{\tau}(k)=g_{1}(k)-\tau.$ It is easy to argue using LCB based arguments that if $G_{r,k}$ occurs, then arm $k$ can’t be pulled more than $w_{k}$ times. This is similar to the proof given in Subsection A.2.

Let us upper bound the probability of $G_{r,k}^{c}.$ Event $G_{r,k}^{c}$ is given by

\displaystyle G_{r,k}^{c}=\left\{\hat{g}_{1,w_{k}}-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}w_{k}}}\leq g_{1}(1)\right\}\cup\mathcal{E}_{r}^{c}.

Using analysis in Subsection A.1, we can show ${\mathsf{Pr}}\left(\mathcal{E}_{r}^{c}\right)\leq\nicefrac{{K}}{{T}}.$ Let us bound the probability of the first term above. By the choice of $w_{k},$ we have

\displaystyle\Delta_{\text{con}}(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}w_{k}}}\geq\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}w_{k}}}

Now,

		$\displaystyle{\mathsf{Pr}}\left(\hat{g}_{1,w_{k}}-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}w_{k}}}\leq g_{1}(1)\right)$
	$\displaystyle=$	$\displaystyle{\mathsf{Pr}}\left(g_{1}(k)-\hat{g}_{1,w_{k}}\geq\Delta_{\text{con}}(k)-\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}w_{k}}}\right)\quad(\text{Use }\Delta_{\text{con}}(k)=g_{1}(k)-g_{1}(1))$
	$\displaystyle\leq$	$\displaystyle{\mathsf{Pr}}\left(g_{1}(k)-\hat{g}_{1,w_{k}}\geq\sqrt{\frac{\log\left(2T^{2}\right)}{a_{1}w_{k}}}\right)\quad(\text{By the choice of }w_{k})$
	$\displaystyle\leq$	$\displaystyle\frac{1}{T^{2}}\quad(\text{Using Equation~{}}\ref{eq:gen_conc_ineq_m1})$

Hence, we can upper bound ${\mathsf{Pr}}\left(G_{r,k}^{c}\right)$ using a union bound

\displaystyle{\mathsf{Pr}}\left(G_{r,k}^{c}\right)\leq\frac{K}{T}+\frac{1}{T^{2}}\leq\frac{K+1}{T}.

When the instance is infeasible, the expected number of pulls of non-optimal arms are upper bounded by

	$\displaystyle\mathbb{E}\left[N_{k}(T)\right]$	$\displaystyle=\mathbb{E}\left[\mathbb{E}\left[N_{k}(T)\|G_{r,k}\right]\right]+\mathbb{E}\left[\mathbb{E}\left[N_{k}(T)\|G_{r,k}^{c}\right]\right]$
		$\displaystyle\leq\left\lceil\frac{4\log(2T^{2})}{a_{1}\Delta_{\text{con}}^{2}(k)}\right\rceil\left(1-\frac{K+1}{T}\right)+T\times\frac{K+1}{T}$
		$\displaystyle\leq\frac{4\log(2T^{2})}{a_{1}\Delta_{\text{con}}^{2}(k)}+K+2$

A.6 Infeasible instance: Upper bounding the probability of misidentification

In this subsection, we will upper bound the probability of incorrectly setting the feasibility_flag.

Firstly, we define $T^{*}$ to be the minimum value of $T$ for which the following holds:

\displaystyle T>\sum_{k=1}^{K}v_{k}\geq\sum_{k=1}^{K}\left\lceil\frac{4\log(2T^{2})}{a_{1}\Delta_{\tau}^{2}(k)}\right\rceil.

$T^{*}$ exists because $C\log(T)=o(T)$ for a fixed $C\in(0,\infty).$

For all the arms, we define good events $G_{1,k}$ as in Equation (3) where the estimator for $g_{1}$ is concentrated enough. Let $\mathcal{E}_{r}=\cap_{k=1}^{K}G_{1,k}.$ On event $\mathcal{E}_{r},$ for $t>T^{*},$ set of plausible feasible arms $\hat{\mathcal{K}}_{t}$ will remain empty.

We showed in the previous subsection that ${\mathsf{Pr}}\left(\mathcal{E}_{r}\right)\geq 1-\frac{K}{T}.$ Hence, we can bound the probability of incorrectly setting the feasibility_flag as

\displaystyle{\mathsf{Pr}}\left(\texttt{feasibility\_flag = true}\right)\leq\frac{K}{T}\text{ for }T>T^{*}.

Appendix B Lower bounds

In this section, we will prove Theorems 4.1 and 4.2. To prove Theorem 4.1, we will first prove a result which lower bounds the pulls of non-optimal arms when the arm distributions belong to a general distribution class and the attributes are not necessarily means. Theorem 4.1 is proved in Subsection B.1, Theorem 4.2 is proved in Subsection B.2, and subsequent subsections provide the proof for the intermediate results.

The proof technique to derive lower bounds for the constrained bandit setting is very similar to the technique used for standard stochastic bandit setting. We begin by stating some important results that will be used later in the proofs.

We first state the divergence decomposition lemma for the constrained bandit setting. The proof of the lemma is similar to the proof of divergence decomposition for the standard setting and we leave it to the reader to verify the result (see Lemma 15.1, Lattimore and Szepesvári [2018]).

Lemma 1

Consider two instances $(\nu,\tau)$ and $(\nu^{\prime},\tau),$ where $\nu$ and $\nu^{\prime}$ belong to a space of distributions $\mathcal{C}^{K}.$ Fix some policy $\pi$ and let $\mathbb{P}_{\nu}=\mathbb{P}_{(\nu,\tau),\pi}$ and $\mathbb{P}_{\nu^{\prime}}=\mathbb{P}_{(\nu^{\prime},\tau)\pi}$ be the probability measures on the constrained bandit model induced by the $T$ -round interconnection between $\pi$ and $(\nu,\tau)$ (respectively, $\pi$ and $(\nu^{\prime},\tau)$ ). Then

\displaystyle\text{KL}(\mathbb{P}_{\nu},\mathbb{P}_{\nu^{\prime}})=\sum_{k=1}^{K}\mathbb{E}_{\nu}\left[N_{k}(T)\right]\text{KL}(P_{i},P_{i}^{\prime})

(6)

We also state the high probability Pinsker inequality (see Theorem 14.2, Lattimore and Szepesvári [2018]).

Lemma 2

Let $P$ and $Q$ be probability measures on the same measurable space $(\Omega,\mathcal{F})$ and let $A\in\mathcal{F}$ be an arbitrary event. Then,

\displaystyle P(A)+Q(A^{c})\geq\frac{1}{2}\exp(-\text{KL}(P,Q)),

where $A^{c}=\Omega/A$ is the complement of A.

B.1 Lower bounding the pulls of non-optimal arms for the Gaussian case

We first state an instance-dependent lower bound for the expected pulls of non-optimal arms under consistent algorithms for any class of distributions $\mathcal{C}$ and a threshold $\tau\in\mathbb{R}.$ For a feasible instance $(\nu^{f},\tau),$ where $\nu^{f}\in\mathcal{C}^{K},$ define, for each non-optimal arm $k,$

\eta^{f}(\nu^{f}(k),g_{0}^{*},\tau,\mathcal{C})=\inf_{\nu^{\prime}(k)\in\mathcal{C}}\{\text{KL}(\nu^{f}(k),\nu^{\prime}(k)):g_{0}(\nu^{\prime}(k))<g_{0}^{*},~{}g_{1}(\nu^{\prime}(k))\leq\tau\}.

Similarly, for an infeasible instance $(\nu^{i},\tau),$ where $\nu^{\prime}\in\mathcal{C}^{K},$ define, for each non-optimal arm $k,$

\eta^{i}(\nu^{i}(k),g_{1}^{*},\mathcal{C})=\inf_{\nu^{\prime}(k)\in\mathcal{C}}\{\text{KL}(\nu^{i}(k),\nu^{\prime}(k)):g_{1}(\nu^{\prime}(k))<g_{1}^{*}\}.

Theorem B.1

Let $\pi$ be a consistent policy over the class of distributions $\mathcal{C}$ given the threshold $\tau\in\mathbb{R}.$ Then for a feasible instance $(\nu^{f},\tau),$ where $\nu^{f}\in\mathcal{C}^{K},$ for any non-optimal arm $k,$

\displaystyle\liminf_{T\to\infty}\frac{\mathbb{E}_{\nu}\left[N_{k}(T)\right]}{\log(T)}\geq\frac{1}{\eta^{f}(\nu^{f}(k),g_{0}^{*},\tau,\mathcal{C})},

For an infeasible instance $(\nu^{i},\tau),$ where $\nu^{\prime}\in\mathcal{C}^{K},$ for any non-optimal arm $k,$

\displaystyle\liminf_{T\to\infty}\frac{\mathbb{E}_{\nu}\left[N_{k}(T)\right]}{\log(T)}\geq\frac{1}{\eta^{i}(\nu^{i}(k),g_{1}^{*},\mathcal{C})}.

The bounds in Subsections B.3 and B.4 imply Theorem B.1. The main message here is that the ’hardness’ of a non-optimal arm $k$ , which dictates the minimum number of pulls required on average to distinguish it from the optimal arm, is characterized by the reciprocal of the smallest perturbation in terms of KL divergence $\text{KL}(\nu(k),\cdot)$ needed to ‘make’ the arm optimal.

Now, we will use Theorem B.1 to prove Theorem 4.1.

Consider two d-dimensional Gaussian distribution $\nu_{1}$ and $\nu_{2}$ with means $\mu_{1}$ and $\mu_{2},$ and covariances $\Sigma_{1}$ and $\Sigma_{2}.$ The KL divergence between these distributions is given by

\text{KL}(\nu_{1},\nu_{2})=\log\left(\frac{|\Sigma_{2}|^{0.5}}{|\Sigma_{1}|^{0.5}}\right)+\frac{\text{tr}(\Sigma_{2}^{-1}\Sigma_{1})-\text{tr}(\Sigma_{1}^{-1}\Sigma_{1})+(\mu_{1}-\mu_{2})^{T}\Sigma_{2}^{-1}(\mu_{1}-\mu_{2})}{2}.

(7)

This is easy to derive but one can check Section 9 of these notes.

Let $\mathcal{G}$ be the class of 2-dimensional gaussian distributions with the covariance matrix $\Sigma=\text{diagonal}(\nicefrac{{1}}{{2a_{0}}},\nicefrac{{1}}{{2a_{1}}}).$ Let attribute $g_{0}$ be the mean of the first dimension, and $g_{1}$ be the mean of the second dimension. Let $\nu_{1},\nu_{2}\in\mathcal{G},$ then using Equation 7, the KL divergence between $\nu_{1}$ and $\nu_{2}$ is given by

\text{KL}(\nu_{1},\nu_{2})=a_{0}(g_{0}(\nu_{1})-g_{0}(\nu_{2}))^{2}+a_{1}(g_{1}(\nu_{1})-g_{1}(\nu_{2}))^{2}.

(8)

With $\mathcal{C}=\mathcal{G}$ and attributes as defined in Theorem 4.1, we can use Equation 8 to calculate $\eta^{f}(\nu^{f}(k),g_{0}^{*},\tau,\mathcal{C})$ and $\eta^{i}(\nu^{i}(k),g_{1}^{*},\mathcal{C}).$ In particular, for a feasible but suboptimal arm $k,$ we have

\displaystyle\eta^{f}(\nu^{f}(k),g_{0}^{*},\tau,\mathcal{C})=a_{0}\Delta^{2}(k),

for a deceiver arm $k,$ we have

\displaystyle\eta^{f}(\nu^{f}(k),g_{0}^{*},\tau,\mathcal{C})=a_{1}[\Delta_{\tau}(k)]^{2},

and for an infeasible arm $k$ which is not a deceiver, we have

\displaystyle\eta^{f}(\nu^{f}(k),g_{0}^{*},\tau,\mathcal{C})=a_{0}\Delta^{2}(k)+a_{1}[\Delta_{\tau}(k)]^{2}\leq 2\max(a_{0}\Delta^{2}(k),a_{1}[\Delta_{\tau}(k)]^{2}).

Finally, for a non-optimal arm $k$ for an infeasible instance, we have

\displaystyle\eta^{i}(\nu^{i}(k),g_{1}^{*},\mathcal{C})=a_{1}[\Delta_{\text{con}}(k)]^{2}.

Using the results above and combining them with Theorem B.1, we get Theorem 4.1.

B.2 Disambiguating between feasible and infeasible instances

Consider a feasible instance $(\nu,\tau)$ where Arm 1 is the only feasible arm and therefore, the only optimal arm. Let $g_{1}^{\dagger}=\min_{k\in\{2,\cdots,K\}}g_{1}(k)$ be the minimum value of $g_{1}$ for the set of infeasible arms. For Arm 1 we define

\displaystyle\eta(\nu(1),g_{1}^{\dagger},\mathcal{C})=\inf_{\nu^{\prime}(1)\in\mathcal{C}}\{\text{KL}(\nu^{\prime}(1),\nu(1)):g_{1}(\nu^{\prime}(1))>g_{1}^{\dagger}\}.

Consider another instance $(\nu^{\prime},\tau)$ where $\nu^{\prime}(j)=\nu(j)$ for $j\neq 1$ and $\nu^{\prime}(1)\in\mathcal{C}$ such that $\text{KL}(\nu(1),\nu^{\prime}(1))\leq d_{1}+\varepsilon$ and $g_{1}(\nu^{\prime}(1))>g_{1}^{\dagger},$ where $d_{1}=\eta(\nu(1),g_{1}^{\dagger},\mathcal{C}).$ Using divergence decomposition lemma, we have $\text{KL}(\mathbb{P}_{\nu^{\prime}},\mathbb{P}_{\nu})\leq\mathbb{E}_{\nu^{\prime}}\left[N_{1}(T)\right](d_{1}+\varepsilon)$ and by using Lemma 2 we have

\displaystyle\mathbb{P}_{\nu}\left(A\right)+\mathbb{P}_{\nu^{\prime}}\left(A^{c}\right)\geq\frac{1}{2}\exp(-\text{KL}(\mathbb{P}_{\nu^{\prime}},\mathbb{P}_{\nu}))\geq\frac{1}{2}\exp(-\mathbb{E}_{\nu^{\prime}}\left[N_{1}(T)\right](d_{1}+\varepsilon)).

Let event $A=$ {feasibility_flag = false}. Taking logarithm on both sides and rearranging gives

\displaystyle-\frac{\log(\mathbb{P}_{\nu}\left(A\right)+\mathbb{P}_{\nu^{\prime}}\left(A^{c}\right))}{T}\leq\frac{\log(2)+(d_{1}+\varepsilon)\mathbb{E}_{\nu^{\prime}}\left[N_{1}(T)\right]}{T}

RHS goes to zero as $T$ goes to infinity. This follows from the definition of consistency and the fact that for instance $(\nu^{\prime},\tau),$ arm 1 is suboptimal. Hence, we have

\displaystyle\limsup_{T\to\infty}-\frac{\log(\mathbb{P}_{\nu}\left(A\right)+\mathbb{P}_{\nu^{\prime}}\left(A^{c}\right))}{T}\leq 0

This shows that at least for one of the instances, the probability of incorrect detection decays slower than exponential in $T.$

B.3 Feasible instances

Consider a class of distributions $\mathcal{C}$ and a threshold $\tau\in\mathbb{R}.$ For a feasible instance $(\nu^{f},\tau),$ where $\nu^{f}\in\mathcal{C}^{K},$ define, for each non-optimal arm $k,$

\eta^{f}(\nu^{f}(k),g_{0}^{*},\tau,\mathcal{C})=\inf_{\nu^{\prime}(k)\in\mathcal{C}}\{\text{KL}(\nu^{f}(k),\nu^{\prime}(k)):g_{0}(\nu^{\prime}(k))<g_{0}^{*},~{}g_{1}(\nu^{\prime}(k))\leq\tau\}.

We will show that

\liminf_{T\to\infty}\frac{\mathbb{E}_{\nu}\left[N_{k}(T)\right]}{\log(T)}\geq\frac{1}{\eta^{f}(\nu^{f}(k),g_{0}^{*},\tau,\mathcal{C})}.

Proof

Let $d_{k}=\eta^{f}(\nu^{f}(k),g_{0}^{*},\tau,\mathcal{C})$ and fix any $\varepsilon>0.$ Let $(\nu^{\prime},\tau)$ be a bandit instance with $\nu^{\prime}\in\mathcal{C}^{K},$ and $\nu^{\prime}(j)=\nu^{f}(j)$ for $j\neq k$ be such that $\text{KL}(\nu^{f}(k),\nu^{\prime}(k))\leq d_{k}+\varepsilon,$ $g_{0}(\nu^{\prime}(k))<g_{0}^{*},$ and $g_{1}(\nu^{\prime}(k))\leq\tau.$ A distribution like $\nu^{\prime}(k)$ exists because of the definition of $d_{k}.$ Note that arm $k$ is the unique optimal arm for bandit instance $\nu^{\prime}.$ Using divergence decomposition lemma, we have $\text{KL}(\mathbb{P}_{\nu^{f}},\mathbb{P}_{\nu^{\prime}})\leq\mathbb{E}_{\nu}\left[N_{k}(T)\right](d_{k}+\varepsilon)$ and by using Lemma 2 we have

\displaystyle\mathbb{P}_{\nu}\left(A\right)+\mathbb{P}_{\nu^{\prime}}\left(A^{c}\right)\geq\frac{1}{2}\exp(-\text{KL}(\mathbb{P}_{\nu^{f}},\mathbb{P}_{\nu^{\prime}}))\geq\frac{1}{2}\exp(-\mathbb{E}_{\nu^{f}}\left[N_{k}(T)\right](d_{k}+\varepsilon)).

Let event $A=\{N_{k}(T)>\nicefrac{{T}}{{2}}\}.$

	$\displaystyle\mathbb{E}_{\nu^{f}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right]$	$\displaystyle\geq\frac{T}{2}(\mathbb{P}_{\nu^{f}}\left(A\right)+\mathbb{P}_{\nu^{\prime}}\left(A^{c}\right)),$
		$\displaystyle\geq\frac{T}{4}\exp(-\mathbb{E}_{\nu^{f}}\left[N_{k}(T)\right](d_{k}+\varepsilon)).$

Rearranging and taking the limit inferior we get

	$\displaystyle\liminf_{T\to\infty}\frac{\mathbb{E}_{\nu^{f}}\left[N_{k}(T)\right]}{\log T}$	$\displaystyle\geq\frac{1}{d_{k}+\varepsilon}\liminf_{T\to\infty}\frac{\log\left(\frac{T}{4(\mathbb{E}_{\nu^{f}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right])}\right)}{\log T}$
		$\displaystyle=\frac{1}{d_{k}+\varepsilon}\left(1-\limsup_{T\to\infty}\frac{\log(\mathbb{E}_{\nu^{f}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right])}{\log T}\right)$
		$\displaystyle=\frac{1}{d_{k}+\varepsilon}.$

The last equality follows from the definition of consistency. As an arm which is suboptimal or infeasible or both is played only $o(T^{a})$ times in expectation for all $a>0$ , there exists a constant $C_{a}$ for large enough $T$ such that $\mathbb{E}_{\nu^{f}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right])\leq C_{a}T^{a}.$ This gives

\displaystyle\limsup_{T\to\infty}\frac{\log(\mathbb{E}_{\nu^{f}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right])}{\log T}\leq\limsup_{T\to\infty}\frac{C_{a}+a\log(T)}{\log(T)}=a.

As this holds for all $a>0$ and $\varepsilon$ was an arbitrary constant greater than zero, we have the result.

B.4 Infeasible instance

For an infeasible instance $(\nu^{i},\tau),$ where $\nu^{\prime}\in\mathcal{C}^{K},$ define, for each non-optimal arm $k,$

\eta^{i}(\nu^{i}_{k},g_{1}^{*},\mathcal{C})=\inf_{\nu^{\prime}(k)\in\mathcal{C}}\{\text{KL}(\nu^{i}_{k},\nu^{\prime}(k)):g_{1}(\nu^{\prime}(k))<g_{1}^{*}\}.

We will show that for arm $k$

\displaystyle\liminf_{T\to\infty}\frac{\mathbb{E}_{\nu}\left[N_{k}(T)\right]}{\log(T)}\geq\frac{1}{\eta^{i}(\nu^{i}_{k},g_{1}^{*},\mathcal{C})}

Proof

Let $d_{k}=\eta^{i}(\nu^{i}_{k},g_{1}^{*},\mathcal{C})$ and fix any $\varepsilon>0$ . Let $(\nu^{\prime},\tau)$ be a bandit instance with $\nu^{\prime}(j)=\nu^{i}(j)$ for $j\neq k$ and $\nu^{\prime}(k)\in\mathcal{C}$ such that $\text{KL}(\nu^{i}(k),\nu^{\prime}(k))\leq d_{k}+\varepsilon$ and $g_{1}(\nu^{\prime}(k))\leq g_{1}^{*}.$ A distribution like $\nu^{\prime}(k)$ exists because of the definition of $d_{k}.$ Observe that the instance $(\nu^{\prime},\tau)$ could be a feasible instance. Nonetheless, arm $k$ is the unique optimal arm irrespective of the feasibility of instance $(\nu^{\prime},\tau).$ Using divergence decomposition lemma, we have $\text{KL}(\mathbb{P}_{\nu^{i}},\mathbb{P}_{\nu^{\prime}})\leq\mathbb{E}_{\nu^{i}}\left[N_{k}(T)\right](d_{k}+\varepsilon)$ and by using Lemma 2 we have

\displaystyle\mathbb{P}_{\nu^{i}}\left(A\right)+\mathbb{P}_{\nu^{\prime}}\left(A^{c}\right)\geq\frac{1}{2}\exp(-\text{KL}(\mathbb{P}_{\nu^{i}},\mathbb{P}_{\nu^{\prime}}))\geq\frac{1}{2}\exp(-\mathbb{E}_{\nu^{i}}\left[N_{k}(T)\right](d_{k}+\varepsilon)).

Let event $A=\{N_{k}(T)>\nicefrac{{T}}{{2}}\}.$

	$\displaystyle\mathbb{E}_{\nu^{i}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right]$	$\displaystyle\geq\frac{T}{2}(\mathbb{P}_{\nu^{i}}\left(A\right)+\mathbb{P}_{\nu^{\prime}}\left(A^{c}\right)),$
		$\displaystyle\geq\frac{T}{4}\exp(-\mathbb{E}_{\nu^{i}}\left[N_{k}(T)\right](d_{k}+\varepsilon)).$

Rearranging and taking the limit inferior we get

	$\displaystyle\liminf_{T\to\infty}\frac{\mathbb{E}_{\nu^{i}}\left[N_{k}(T)\right]}{\log T}$	$\displaystyle\geq\frac{1}{d_{k}+\varepsilon}\liminf_{T\to\infty}\frac{\log\left(\frac{T}{4(\mathbb{E}_{\nu^{i}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right])}\right)}{\log T}$
		$\displaystyle=\frac{1}{d_{k}+\varepsilon}\left(1-\limsup_{T\to\infty}\frac{\log(\mathbb{E}_{\nu^{i}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right])}{\log T}\right)$
		$\displaystyle=\frac{1}{d_{k}+\varepsilon}.$

The last equality follows from the definition of consistency. As $\nu^{i}$ is an infeasible instance, arm $k$ is suboptimal and is played only $o(T^{a})$ times in expectation for all $a>0.$ For instance $\nu^{\prime},$ arm $k$ is the unique optimal arm. Therefore, all the other arms are played only $o(T^{a})$ times in expectation for all $a>0.$ Hence, there exists a constant $C_{a}$ for large enough $T$ such that $\mathbb{E}_{\nu^{i}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right])\leq C_{a}T^{a}.$ This gives

\displaystyle\limsup_{T\to\infty}\frac{\log(\mathbb{E}_{\nu^{i}}\left[N_{k}(T)\right]+\sum_{j\neq k}\mathbb{E}_{\nu^{\prime}}\left[N_{j}(T)\right])}{\log T}\leq\limsup_{T\to\infty}\frac{C_{a}+a\log(T)}{\log(T)}=a.

As this holds for all $a>0$ and $\varepsilon$ was an arbitrary constant greater than zero, we have the result.