∎
22email: [email protected] 33institutetext: J. Nair 44institutetext: IIT Bombay
44email: [email protected] 55institutetext: K. Jagannathan 66institutetext: IIT Madras
66email: [email protected]
Constrained regret minimization for multi-criterion multi-armed bandits
Abstract
We consider a stochastic multi-armed bandit setting and study the problem of constrained regret minimization over a given time horizon. Each arm is associated with an unknown, possibly multi-dimensional distribution, and the merit of an arm is determined by several, possibly conflicting attributes. The aim is to optimize a ‘primary’ attribute subject to user-provided constraints on other ‘secondary’ attributes. We assume that the attributes can be estimated using samples from the arms’ distributions, and that the estimators enjoy suitable concentration properties. We propose an algorithm called Con-LCB that guarantees a logarithmic regret, i.e., the average number of plays of all non-optimal arms is at most logarithmic in the horizon. The algorithm also outputs a boolean flag that correctly identifies, with high probability, whether the given instance is feasible/infeasible with respect to the constraints. We also show that Con-LCB is optimal within a universal constant, i.e., that more sophisticated algorithms cannot do much better universally. Finally, we establish a fundamental trade-off between regret minimization and feasibility identification. Our framework finds natural applications, for instance, in financial portfolio optimization, where risk constrained maximization of expected return is meaningful.
Keywords:
Multi-criterion multi-armed bandits constrained bandits regret minimization1 Introduction
The multi-armed bandit (MAB) problem is a fundamental construct in online learning, where a learner has to quickly identify the best option (a.k.a., arm) among a given set of options. In the stochastic MAB problem, each arm is associated with an (a priori unknown) reward distribution, and a sample from this distribution is revealed each time an arm is chosen (a.k.a., pulled). The classical goal is to use these samples to quickly identify the arm with the highest mean reward. The most popular metric to evaluate the performance of a learning algorithm is regret, which captures how often a suboptimal arm was pulled by the algorithm.
While the classical stochastic MAB formulation has been applied in various application scenarios, including clinical trials, portfolio optimization, anomaly detection, and telecommunication Bouneffouf and Rish [2019], it ignores a key aspect of most real-world decision-making problems—namely, that they have multiple criteria of interest. For example, when comparing testing kits in a clinical trial, one would want to keep track of the false-positive rate as well as the false-negative rate of each kit. Similarly, choosing the best financial portfolio involves balancing risk and reward. A wireless node deciding which channel to transmit on, or what transmission rate to use has to balance several criteria, including throughput, delay, and energy consumption. This multi-criterion nature of decision making is not always adequately captured by the classical MAB approach of optimizing a one-dimensional reward signal.
The most common approach for incorporating multiple arm attributes into MAB formulations is to define the reward as a suitable function (say a linear combination) of the attributes of interest. For example, risk-aware portfolio optimization can be cast as an MAB problem where the best arm is one that optimizes a certain linear combination of mean value and a suitable risk measure (such as standard deviation or Conditional Value at Risk) (see, for example, Sani et al. [2012], Vakili and Zhao [2016], Kagrecha et al. [2019]. However, this approach assumes that the different attributes of interest can be expressed and compared on a common scale, which is not always reasonable. For example, how does one ‘equate’ the impact a certain increment in risk to that of an another increment in the mean return of a portfolio?
A more natural approach for multi-criterion decision making is to instead pose the optimal choice as the solution of a constrained optimization. In other words, optimize one attribute subject to constraints on the others. However, despite the modest literature on multi-criterion MABs (surveyed below), little attention has been paid to formulating, and designing algorithms for constrained multi-criterion MABs. This paper seeks to fill this gap. Formally, we assume that each arm is associated with a -dimensional probability distribution, and the best arm is the one that minimizes a certain arm attribute subject to constraints on other attributes. For this setting, we pursue regret minimization, i.e., we seek to minimize the average number of pulls of non-optimal arms.
To simplify the presentation, and to highlight the key aspects of this problem, we first consider a single constraint (i.e., ). Each arm is associated with a probability distribution, and we consider the problem of optimizing an objective attribute, subject to a single constraint attribute satisfying a user-specified constraint. The algorithm design and performance evaluation for this special case (with ) generalize readily to the multi-criterion problem; see Section 5. An example that fits this template that is of significant interest in the finance community: the optimization of the expected return, subject to a risk constraint. Here, the objective attribute is the mean of an arm distribution, while the constraint attribute measuring risk could be Conditional Value at Risk (CVaR).
For this problem, we propose an algorithm, called Constrained Lower Confidence Bound (Con-LCB), that guarantees logarithmic regret, i.e., the average number of plays of all non-optimal arms (including those that violate the constraint) is at most logarithmic in the horizon. If Con-LCB is presented with an infeasbile instance, i.e., an instance where all arms violate the specified risk constraint, the algorithm in effect relaxes this constraint just enough to make at least one arm compliant. Another feature of Con-LCB is that at the end of the horizon, it outputs a boolean flag that correctly identifies with high probability, whether or not the given instance was feasible.
Finally, we establish fundamental lower bounds on the performance of any algorithm on this constrained regret minimization problem. Our results demonstrate a fundamental tradeoff between regret minimization and feasibility identification, similar to the well-known tradeoff between regret minimization and best arm identification in the classical (unconstrained) MAB problem Bubeck et al. [2009].
The remainder of this paper is organized as follows. A brief survey of related literature is provided below. In Section 2, we introduce some preliminaries and formulate the constrained mean minimization problem. We present our algorithm (Con-LCB) and its performance guarantees in Section 3. Information-theoretic lower bounds on performance are discussed in Section 4. Finally, the general formulation for multi-criterion MABs is introduced in Section 5. All proofs are deferred to an appendix, which is part of the ‘supplementary materials’ document uploaded separately.
Related literature: The literature related to multi-armed bandit problems is quite large. We refer the reader to Bubeck and Cesa-Bianchi [2012], Lattimore and Szepesvári [2018] for a comprehensive review. Here, we restrict ourselves to papers that consider (i) multi-objective MAB problems with vector rewards, and (ii) risk-aware arm selection.
For multi-objective MAB problems with vector rewards, different notions of optimality are considered. For example, Drugan and Nowe [2013], Yahyaa and Manderick [2015] consider the notion of Pareto optimality. In these papers, all dimensions are considered equally important and the aim is to play all Pareto-optimal arms an equal number of times. Another important notion of optimality is lexicographic optimality (see Ehrgott [2005]). Here there is an order of importance among different dimensions. Tekin and Turğay [2018], Tekin [2019] consider the notion of lexicographic optimality for contextual bandit problems. In this line of work, the goal is to obtain higher reward in an important dimension and for tie-breaking, use rewards obtained in dimensions of lower importance.
Turning now to the literature on risk-aware arm selection, Sani et al. [2012], Galichet et al. [2013], Vakili and Zhao [2016], David and Shimkin [2016], Bhat and Prashanth [2019], Prashanth et al. [2020], Kagrecha et al. [2019] consider the problem of optimizing a risk metric alone or consider a linear combination of mean and a risk metric. Zimin et al. [2014] looks at the learnability of general functions of mean and variance, and Maillard [2013] proposes an optimization of the logarithm of moment generating function as a risk measure in a regret minimization framework. Cassel et al. [2018] look at path dependent regret and provides a general approach to study many risk metrics.
None of the above papers considers the constrained MAB problem, which frames the optimal arm as the solution of a constrained optimization problem. There has been some recent work for the constrained MAB problem. A constrained linear bandit setting is considered in Pacchiano et al. [2021] under the assumption that there is at least one arm which satisfies the constraints. The papers Amani et al. [2019], Moradipari et al. [2019] consider the problem of maximizing the reward subject to satisfying a linear constraint with high probability. Constrained setting was also considered in David et al. [2018] and Chang [2020]. Both these papers consider a single constraint on arm selection; David et al. [2018] considers a constraint on VaR, and Chang [2020] considers an average cost constraint (each arm has a cost distribution that is independent of its reward distribution). All the papers above implicitly assume that the instance presented is feasible, whereas we address the issue of encountering an infeasible instance.
2 Problem formulation
In this section, we describe the formulation of our constrained stochastic MAB problem. To keep the exposition simple, we consider a single constraint (i.e., ) for now; the general formulation with multiple constraints (i.e., ) is discussed in Section 5. Informally, under the regret minimization objective considered here, the goal is to play, as often as possible, the arm that optimizes the objective, subject to a constraint on another attribute.
Formally, consider a multi-armed bandit problem with arms, labeled Each arm is associated with a (possibly multi-dimensional) probability distribution, with denoting the (joint) distribution corresponding to arm 111For a positive integer we denote . Suppose that the space of possible arm distributions. The objective and the constraint attributes are defined via functions and respectively, mapping to Additionally, the user provides a threshold which specifies an upper bound on the attribute An instance of the constrainted MAB problem is defined by where The arms which satisfy the constraint are called feasible arms; the set of feasible arms is denoted by The instance is said to be feasible if and is said to be infeasible if
Consider first a feasible instance. An optimal arm in this case is defined as an arm that minimizes subject to the constraint . Denote the optimal value as Arms which have larger than (whether or not feasible) are referred to as suboptimal arms. Note that there can also exist infeasible arms with a smaller objective than We refer to such arms as deceiver arms; the set of deceiver arms is denoted by where For a suboptimal arm the suboptimality gap is defined by (Note that the suboptimality gap is also defined for infeasible, non-deceiver arms). Finally, for an infeasible arm the infeasibility gap is defined as Figure 1(a) provides a visual representation of a feasbile instance.
Next, consider an infeasible instance. The optimal arm in this case is defined in as the one with the smallest value of Let the optimal value be denoted by This is equivalent to requiring that if the algorithm is faced with an infeasible instance, it must ’relax’ the constraint just enough, until at least one arm satisfies the constraint. The constraint gap for an arm that is not optimal is defined as Figure 1(b) provides a visual representation of an infeasbile instance.
For any (feasible or infeasible) instance let the set of optimal arms be denoted by The total number of pulls (or horizon) is denoted by For an algorithm (a.k.a., policy) , the number of pulls of an arm over the first pulls, for is denoted by though we often suppress the dependence on for simplicity.


Consistency: We now define the notion of consistency of an algorithm in this setting. A policy is said to be consistent over a class of distributions given a pre-specified constraint threshold if for all instances such that it holds that for all and for all (non-optimal) arms in This definition is in line with the definition of consistency in the classical unconstrained regret minimization setting (see Lai and Robbins [1985]).
Regret: The formal definition of regret in the present setting is as follows. For a feasible instance, there are two types of regret: suboptimality regret
which is the regret caused due to the sampling of feasible, suboptimal arms, and infeasibility regret
which is the regret caused due to the sampling of infeasible arms. For an infeasible instance, regret is caused by playing arms that are farther from the constraint boundary than the optimal arm. In this case, we define the constraint regret as
(Note that our analysis essentially provides upper and lower bounds on for all arms in So alternative definitions of regret, involving linear combinations of the expected number of pulls of non-optimal arms, are also supported by our analysis.)
Infeasibility identification: Next, we introduce an optional boolean flag called feasibility_flag, that the policy may set at the end of plays, to indicate post facto whether it considered the instance as feasible (by setting the flag as true) or infeasible (by setting the flag as false). For the algorithms proposed in this paper, we provide bounds on the probability that this flag erroneously flags a feasible instance as infeasible and vice-versa. We also provide fundamental lower bounds on the probability of error under any consistent policy.
Concentration inequalities on attribute estimators: Natually, MAB algorithms must estimate the attributes and corresponding to each arm using the data samples obtained by pulling that arm. We make the following assumption on concentration properties of these estimators. Suppose that for and distribution there exists an estimator of using i.i.d. samples from satisfying the following concentration inequality: There exists such that for all
(1) |
Concentration inequalities of this form are available in a broad variety of settings. For example, if where is a random vector distributed as then a concentration inequality of the form (1) is readily obtained if is bounded (using the Hoeffding inequality), or -subGaussian. Similarly, if is Lipschitz and is a subGaussian random vector, concentration bounds of the form (1) can be obtained by invoking the results in Kontorovich [2014]. Several examples where risk measures can be concentrated in this manner are provided in Cassel et al. [2018]. Also, note that the specific form (1) of the concentration inequality is only assumed to simplify the description of our algorithms. Alternative forms of concentration guarantees (such as those known for the means of subexponential or heavy-tailed distributions) can also be supported by our algorithmic framework via trivial modifications to the confidence bounds.
3 Constrained regret minimization
In this section, we present an algorithm for constrained regret minimization, and provide performance guarantees for the same, assuming that we have estimators for each attribute that satisfy the concentration inequality (1).
The algorithm, which we refer to as constrained lower confidence bound (Con-LCB) algorithm, is based on the well-known principle of optimism under uncertainty. Con-LCB uses lower confidence bounds (LCBs) on attribute of each arm to maintain a set of plausibly feasible arms, and uses LCBs for attribute of the arms in this set to select the arm to be played. Note that LCBs are used for attribute because we are dealing with minimization of If, at some instant, the set of plausibly feasible arms maintained by Con-LCB becomes empty, the algorithm turns conservative and plays the arm which violates the constraint least, i.e., the one with the smallest LCB on . Finally, at the end of rounds, Con-LCB sets the feasiblity flag as true if the set of plausibly feasible arms is found to be non-empty, and false otherwise.
The remainder of this section is devoted to performance guatantees for Con-LCB. We consider feasible and infeasible instances separately.
Theorem 3.1
Consider a feasible instance. Under Con-LCB, the expected number of pulls of a feasible but suboptimal arm (i.e., satisfying and ), is bounded by
The expected number of pulls of a deceiver arm (i.e., satisfying and is bounded by
The expected number of pulls of an arm which is an infeasible non-deceiver (i.e., satistying and ) is bounded by
The probability of incorrectly setting the feasibility_flag is upper bounded by
The main takeaways from Theorem 3.1 are as follows.
The upper bound on the expected number of pulls for feasible, suboptimal arms is logarithmic in the horizon and inversely proportional to the square of the suboptimality gap. This is similar to bounds known for classical (unconstrained) MABs.
For deceiver arms, the upper bound on the expected number of pulls, also logarithmic in is inversely proportional to the square of the feasiblity gap. This is similar to the bound one would obtain in a pure minimization problem for a hypothetical instance consisting of all the originally infeasible arms, and a single hypothetical (optimal) arm having equal to
The bound on the expected number of pulls of non-deceiver, infeasible arms involves a minimum of the dominant terms in the above two cases. Intuitively, this is because these arms can disambiguated in two ways: via the suboptimality gap, and the feasibility gap.
The probability that the feasiblity flag incorrectly identifies the instance as infeasible is upper bounded by a power law in the horizon Note that the specific form of the probability of mis-identification bound is not fundamental; a small modification in the algorithm would make this bound a faster decaying power law at the expense of a multiplicative bump in regret. However, that this probability is not much smaller (for example, exponentially decaying in the horizon ) is a consequence of an inherent tension between regret minimization and feasiblity identification. A similar tension is known to exist between regret minimization and best arm identification in the unconstrained setting; see Bubeck et al. [2009]). We provide lower bounds on the probability that any consistent algorithm makes a mistake in feasibility identification in Section 4.
Finally, we note that the suboptimality regret, as well as the infeasibility regret are logarithmic in the horizon Indeed, the suboptimality regret for a feasible instance is bounded as
which is similar to regret bounds in the classical unconstrained setting. To express the infeasibility regret compactly, let us interpret as 0 (and as ) for deceiver arms. With this notation, the infeasibility regret of a feasible instance is bounded as
Next, we move on to characterizing the performance of Con-LCB over infeasible instances.
Theorem 3.2
Consider an infeasible instance. Under Con-LCB, the expected number of pulls of a non-optimal arm is bounded by
Moreover, the probability that the algorithm incorrectly flags the instance as feasible is bounded as where is an instance-dependent constant.
For an infeasible instance, the upper bound on the expected number of pulls of a non-optimal arm, logarithmic in the horizon, and inversely proportional to the the square of the constraint gaps is structurally similar to the bound one would obtain in pure minimization problem on the same instance. However, note that when faced with an infeasible instance, Con-LCB would only start playing the optimal arm regularly after all arms appear to be infeasible; this explains the appearance of in our bounds.
Here, the constraint regret is bounded as
Finally, as before, the probability that the feasibility flag wrongly identifies the infeasible instance as feasible decays as a power law in the horizon for the threshold accounts for the time it takes for the algorithm to ‘detect’ that the instance is infeasibile with high probability.
4 Information theoretic lower bounds
In this section, we establish fundamental limits on the performance of algorithms for constrained regret minimization. First, we show that the regret bounds obtained for Con-LCB are asymptotically tight upto universal multiplicative constants (on a class of Gaussian bandit instances). We then prove a lower bound on the probability that any consistent algorithm misidentifies a feasible instance as infeasible or vice-versa. This result illustrates an inherent tension between regret minimization and feasibility identification—consistent algorithms (recall that consistency means regret is for all ) cannot have a misidentification probability that decays exponentially in the horizon, and algorithms that enjoy a mid-identification probability that decays exponentially in the horizon cannot be consistent.
To state our information theoretic lower bound on regret, suppose that the class of arm distributions is the class of 2-dimensional Gaussian distributions with covariance matrix Let attribute be the mean of the first dimension, and be the mean of the second dimension. For the assumed covariance matrix structure, the standard empirical mean estimators for and satisfy the concentration properties stated in (1).
Theorem 4.1
Let be any consistent policy over the class of distributions given the threshold Consider a feasible instance where For any feasible but suboptimal arm
For any deceiver arm
Finally, for any infeasible non-deceiver arm
Similarly, for an infeasible instance such that for any non-optimal arm
The proof of Theorem 4.1 can be found in Appendix B. Comparing the lower bounds in Theorem 4.1 with the upper bounds for Con-LCB in Theorems 3.1 and 3.2, we conclude that Con-LCB is asymptotically optimal on regret, up to universal multiplicative constants.222Information theoretic lower bounds on the expected number of pulls of non-optimal arms can be derived (in terms of a certain optimization over KL divergences) for any general class of arm distributions; see Appendix B.1. We have specialized the statement of Theorem 4.1 to a class of Gaussian bandit instances to enable an easy comparison with our upper bounds for Con-LCB.
Next, we address the fundamental tradeoff between regret minimization and feasibility identification.
Theorem 4.2
Consider a space of arm distributions and a threshold such that contains both feasible as well as infeasible arm distributions. There exists a feasible instance and an infeasible instance such that for any policy that is consistent over
Theorem 4.2 states that for any consistent algorithm, the probability that get misidentified as infeasible, and the probability that get misidentified as feasible, cannot both decay exponentially in the horizon. This is of course consistent with the power-law probability of misidentification under Con-LCB. In other words, slower-than-exponential decay of the probability of feasibility misidentification with respect to the horizon is an unavoidable consequence of the exploration-exploitation interplay in regret minimization. A similar tension between regret minimization and best arm identification was noted for the unconstrained MABs by Bubeck et al. [2009].
5 General framework for constrained MABs
In this section, we provide a general formulation for constrained stochastic MABs with multiple criteria. We allow each arm to be associated with a -dimensional probability distibution, the goal being to optimize one (dominant) attribute associated with this distribution, subject to constraints on others. The algorithm design and performance evaluation performed in Sections 2–4 for the special case of extend naturally to this general formulation, which can in turn be applied in a variety of application scenarios. We illustrate a few here.
For clinical trials, with the arms corresponding to various treatment protocols, the dominant attribute might, for example, correspond to the success/recovery probability, whereas the constraints might capture recovery time, severity of side effects, etc.
For product/service rating, where the arms correspond to various service providers, the dominant attribute might correspond to product quality, with constraints capturing reliability, pricing, customer service, etc.
In wireless networks, the arms might correspond to various access networks or channels, with, for example, the dominant attribute corresponding to throughput, and constraints capturing delay, energy efficiency, etc.
Formulation: Consider a set of arms, each associated with a -dimensional probability distribution, with denoting the joint distribution corresponding to arm Suppose that the space of possible arm distributions. The objective and the constraints are defined by functions Specifically, the optimal arm is defined as that arm that minimizes subject to the constraints when the instance is feasible (i.e., at least one arm exists that satisfies the above constraints).
If the instance is infeasible, the optimal arm is defined via a ranked list of the constraints, that orders them by ‘importance’. Without loss of generality assume that the order of importance increases from to The idea is to relax the constraints one-by-one, starting with the least important, until a compliant arm is found. Formally, for a given infeasible instance , for let denote the set of arms that satisfy the constraints Let us also define Now, let Here, is the fewest number of constraints one must relax, in order of increasing importance, in order to have at least one compliant arm. An optimal arm is then defined to be
Similar to the case where we make the following assumption to simplify algorithm design. Suppose that for and there exist an estimator for using i.i.d. samples from satisfying the following concentration inequality: There exists such that for all
(2) |
Note that this is simply an extension of our assumption (1).
Algorithm and performance guarantees: For the above problem formulation, we propose Multi-Con-LCB, a simple extension of the Con-LCB algorithm that we presented before for the case of Multi-Con-LCB uses upper confidence bounds on constrained attributes for each arm to maintain a set of plausibly feasible arms. The set of plausibly feasible arms for the constraint on function at time will be denoted by for all values of Using the sets of plausibly feasible arms for each constraint, we construct the set of arms that plausibly lie in the set and this set is denoted by for time instant Formally, As this set might be empty, for let the estimate for be It is also possible that the most important constraint is not satisfied, therefore let If the set is not empty, then the algorithm uses lower confidence bounds (LCBs) on to select the arm to be played. If the set turns out to be empty, then the algorithm finds the smallest index such that the set is not empty. The algorithm then plays the arm with lowest LCB on Finally, at the end of rounds, Multi-Con-LCB sets the feasibility flag as true if the set is not empty and false otherwise. The details are presented as Algorithm 2.
The remainder of this section is devoted to performance guarantees for Multi-Con-LCB. The suboptimality gap for an arm is given by The infeasibility gap of an arm for constraint is given by We restrict our attention to feasible instances here; infeasible instances can be handled on similar lines as Theorem 3.2 for Con-LCB.
Theorem 5.1
Consider a feasible instance. Under Multi-Con-LCB, the expected number of pulls of a feasible but suboptimal arm (i.e., satisfying and for all ), is bounded by
The expected number of pulls of a deceiver arm (i.e., satisfying and there exists a constraint indexed by such that ) is bounded by
The expected number of pulls of a an arm which is infeasible, but not a deceiver (i.e., satistying and there exists a constraint indexed by such that ) is bounded by
The probability of incorrectly setting the feasibility_flag is upper bounded by
We omit the proof of Theorem 5.1 in the appendix because it is very similar to the proof of Theorem 3.1. However, we state the key takeaways from Theorem 5.1 below.
Similar to Theorem 3.1, the upper bound on the expected number of pulls of suboptimal arms is logarithmic in and is inversely proportional to the suboptimality gap squared. Moreover, the upper bound for a deceiver arm is also logarithmic in but there is a minimum over terms because the deceiver arm can be identified by any of the constraints that the arm does not satisfy. Similarly, the upper bound for a non-deceiver, suboptimal arm is also logarithmic in and there is a minimum over terms because the arm can be identified as suboptimal, or infeasible for not satisfying one of the constraints.
With an increase in the number of constraints to , the upper bounds on the expected number of pulls of non-optimal arms increase linearly in and the probability of mis-identification also gets scaled by a factor of The slight degradation in the guarantees is expected because with more constraints, the optimal arm has to satisfy more conditions, and it becomes harder to compare different arms.
6 Numerical experiments
In this section, we present a simple experiment to show the numerical performance of our algorithm Con-LCB.
We consider an instance with four arms and two criteria. Each arm is associated with a (two-dimensional) multivariate Gaussian distribution, different arms having the same covariance matrix but different mean vectors. The means corresponding to the first dimension are 0.3, 0.4, 0.2, and 0.5 and the means corresponding to the second dimension are 0.4, 0.4, 1.0, and 1.0. The covariance matrix is given by
The goal is to maximize the pulls of the arm with the minimum mean of the first dimension, subject to a constraint on the mean of the second. Specifically, we require that the mean of the second dimension should be less than or equal to The instance is summarized in Table 1. We use empirical averages as the estimators for the means and standard confidence bounds based on sub-Gaussianity assumption. We run the algorithm 1000 times for values of in The suboptimality regret and the infeasibility regret are plotted in Figure 2. Clearly, the regrets grow sub-linearly with respect to the horizon. Also, in our experiments, the algorithm correctly detected the feasibility of the instance in all of the 1000 runs.
Arm | Mean 1 | Mean 2 | Remarks |
---|---|---|---|
1 | 0.3 | 0.4 | Optimal arm |
2 | 0.4 | 0.4 | Feasible, suboptimal |
3 | 0.2 | 1.0 | Deceiver |
4 | 0.5 | 1.0 | Infeasible, suboptimal |

We consider another instance where arms are Beta distributed. The parameters are chosen in such a way that the mean of the arms are 0.3, 0.4, 0.2, and 0.5 and the variance of the arms are 0.08, 0.06, 0.15, 0.15. The goal here is to maximize the pulls of the arm with the minimum mean, subject to a constraint on the variance. Specifically, we require that the variance of the arms should be less than or equal to The instance is summarized in Table 2. We use the empirical average to estimate the mean of the arms and the standard unbiased estimator of variance to estimate the variance of the arms. The concentration bounds we use are based on fact that underlying arms are bounded between 0 and 1. We run the algorithm 100 times for values of in The suboptimality regret and the infeasibility regret are plotted in Figure 3. Similar to the previous case, the algorithm correctly detected feasibility in all of the 100 runs.
We also compare Con-LCB with an algorithm designed to optimize a single ’Lagrange relaxed’ objective of the form Since there is no systematic way of setting such that the original (constrained) optimal arm also minimizes this metric, this approach is very fragile, as we demonstrate in Figure 4. The instance we use is the Beta distributed instance in Table 2, and we are plotting the sum of the infeasible regret and suboptimal regret. When is small ( here), a deceiver arm is optimal for the relaxed objective, and when is large ( here), a suboptimal arm is optimal for the relaxed objective; the ‘correct’ range of being instance dependent (and a priori unknown).
Arm | Mean | Variance | Remarks |
---|---|---|---|
1 | 0.3 | 0.08 | Optimal arm |
2 | 0.4 | 0.06 | Feasible, suboptimal |
3 | 0.2 | 0.15 | Deceiver |
4 | 0.5 | 0.15 | Infeasible, suboptimal |


7 Concluding remarks
In this paper, we have introduced a general formulation of multi-criterion constrained MABs, which is applicable when there are multiple attributes of interest associated with each arm. We propose algorithms that incur logarithmic regret, and also provide information theoretic lower bounds on the performance of any algorithm. An interesting departure from the classical MAB formulation is the aspect of instance feasibility; our algorithms predict, post facto, with a probability of error that decays as a power law in the horizon, whether the instance presented was feasible. Interestingly, this illustrates a fundamental tradeoff between regret minimization and feasibility detection. Finally, our algorithms ‘auto-tune’ to an infeasible instance, relaxing the constraints on arm attributes (in order of increasing importance), until a compliant arm is found.
The proposed framework and our algorithms can be applied in a wide range of application scenarios (see Bouneffouf and Rish [2019] for a survey). Further, this work motivates the study of multi-criterion variants of other bandit formulations, including contextual bandits, combinatorial bandits, and correlated bandits.
8 Declarations
-
•
Funding: the authors did not receive support from any organization for the submitted work.
-
•
Conflicts of interest: the authors frequently collaborate with researchers from IIT Bombay (@iitb.ac.in), researchers from IIT Madras (@iitm.ac.in), and researchers from Stanford University (@stanford.edu). Jayakrishnan Nair received his PhD from Caltech (@caltech.edu) in 2012 and Krishna Jagannathan received his PhD from MIT (@mit.edu) in 2010.
-
•
Ethics approval: not applicable
-
•
Consent to participate: not applicable
-
•
Consent for publication: not applicable
-
•
Availability of data and material: not applicable
-
•
Code availability: the Python code for the numerical experiments can be found in this Github repository.
-
•
Author’s contributions: all authors contributed equally to the paper
References
- Bouneffouf and Rish [2019] Djallel Bouneffouf and Irina Rish. A survey on practical applications of multi-armed and contextual bandits. CoRR, abs/1904.10040, 2019. URL http://arxiv.org/abs/1904.10040.
- Sani et al. [2012] Amir Sani, Alessandro Lazaric, and Rémi Munos. Risk-aversion in multi-armed bandits. In Advances in Neural Information Processing Systems, pages 3275–3283, 2012.
- Vakili and Zhao [2016] Sattar Vakili and Qing Zhao. Risk-averse multi-armed bandit problems under mean-variance measure. IEEE Journal of Selected Topics in Signal Processing, 10(6):1093–1111, 2016.
- Kagrecha et al. [2019] Anmol Kagrecha, Jayakrishnan Nair, and Krishna Jagannathan. Distribution oblivious, risk-aware algorithms for multi-armed bandits with unbounded rewards. In Advances in Neural Information Processing Systems, pages 11272–11281, 2019.
- Bubeck et al. [2009] Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in multi-armed bandits problems. In International conference on Algorithmic learning theory, pages 23–37. Springer, 2009.
- Bubeck and Cesa-Bianchi [2012] Sébastien Bubeck and Nicolo Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit problems. Foundations and Trends® in Machine Learning, 5(1):1–122, 2012.
- Lattimore and Szepesvári [2018] Tor Lattimore and Csaba Szepesvári. Bandit algorithms. preprint, page 28, 2018.
- Drugan and Nowe [2013] Madalina M Drugan and Ann Nowe. Designing multi-objective multi-armed bandits algorithms: A study. In The 2013 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2013.
- Yahyaa and Manderick [2015] Saba Yahyaa and Bernard Manderick. Thompson sampling for multi-objective multi-armed bandits problem. In Proceedings, page 47. Presses universitaires de Louvain, 2015.
- Ehrgott [2005] Matthias Ehrgott. Multicriteria optimization, volume 491. Springer Science & Business Media, 2005.
- Tekin and Turğay [2018] Cem Tekin and Eralp Turğay. Multi-objective contextual multi-armed bandit with a dominant objective. IEEE Transactions on Signal Processing, 66(14):3799–3813, 2018.
- Tekin [2019] Cem Tekin. The biobjective multiarmed bandit: learning approximate lexicographic optimal allocations. Turkish Journal of Electrical Engineering & Computer Sciences, 27(2):1065–1080, 2019.
- Galichet et al. [2013] Nicolas Galichet, Michele Sebag, and Olivier Teytaud. Exploration vs exploitation vs safety: Risk-aware multi-armed bandits. In Asian Conference on Machine Learning, pages 245–260, 2013.
- David and Shimkin [2016] Yahel David and Nahum Shimkin. Pure exploration for max-quantile bandits. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 556–571. Springer, 2016.
- Bhat and Prashanth [2019] Sanjay P Bhat and LA Prashanth. Concentration of risk measures: A wasserstein distance approach. In Advances in Neural Information Processing Systems, pages 11739–11748, 2019.
- Prashanth et al. [2020] L. A. Prashanth, Krishna Jagannathan, and Ravi Kumar Kolla. Concentration bounds for CVaR estimation: The cases of light-tailed and heavy-tailed distributions. In International Conference on Machine Learning (To appear), 2020.
- Zimin et al. [2014] Alexander Zimin, Rasmus Ibsen-Jensen, and Krishnendu Chatterjee. Generalized risk-aversion in stochastic multi-armed bandits. arXiv preprint arXiv:1405.0833, 2014.
- Maillard [2013] Odalric-Ambrym Maillard. Robust risk-averse stochastic multi-armed bandits. In International Conference on Algorithmic Learning Theory, pages 218–233. Springer, 2013.
- Cassel et al. [2018] Asaf Cassel, Shie Mannor, and Assaf Zeevi. A general approach to multi-armed bandits under risk criteria. arXiv preprint arXiv:1806.01380, 2018.
- Pacchiano et al. [2021] Aldo Pacchiano, Mohammad Ghavamzadeh, Peter Bartlett, and Heinrich Jiang. Stochastic bandits with linear constraints. In International Conference on Artificial Intelligence and Statistics, pages 2827–2835. PMLR, 2021.
- Amani et al. [2019] Sanae Amani, Mahnoosh Alizadeh, and Christos Thrampoulidis. Linear stochastic bandits under safety constraints. Advances in Neural Information Processing Systems, 32, 2019.
- Moradipari et al. [2019] Ahmadreza Moradipari, Sanae Amani, Mahnoosh Alizadeh, and Christos Thrampoulidis. Safe linear thompson sampling. arXiv preprint arXiv:1911.02156, 2019.
- David et al. [2018] Yahel David, Balázs Szörényi, Mohammad Ghavamzadeh, Shie Mannor, and Nahum Shimkin. PAC bandits with risk constraints. In ISAIM, 2018.
- Chang [2020] Hyeong Soo Chang. An asymptotically optimal strategy for constrained multi-armed bandit problems. Mathematical Methods of Operations Research, pages 1–13, 2020.
- Lai and Robbins [1985] Tze Leung Lai and Herbert Robbins. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
- Kontorovich [2014] Aryeh Kontorovich. Concentration in unbounded metric spaces and algorithmic stability. In International Conference on Machine Learning, pages 28–36, 2014.
Appendix A Upper bounds
In this section, we prove Theorems 3.1 and 3.2. The bounds in Subsections A.1–A.4 imply the statement of Theorem 3.1, and the bounds in Subsections A.5–A.6 imply the statement of Theorem 3.2.
A.1 Feasible instance: Upper bounding the expected pulls of deceiver arms
For a deceiver arm , we will define a good event where the estimator for is concentrated enough and derive an upper bound on the number of pulls of the deceiver arm.
(3) |
On we can lower bound the estimator for for the arm as follows
If the lower bound is greater than then arm can’t be in Hence, we can upper bound the number of pulls of an infeasible arm as follows
(4) |
Now, let us upper bound the expected number of pulls of a deceiver arm
A.2 Feasible instance: Upper bounding the expected pulls of feasible suboptimal arms
We will begin by showing that a feasible arm remains in the set for when estimator of is concentrated enough. We define an event for a feasible arm as done in Equation (3). When holds, the estimator for is upper bounded by
Hence, arm is in for when holds.
We are considering the case where an arm is feasible but suboptimal. We will define a good event for arm and bound the number of pulls on this good event. Without loss of generality, assume arm 1 is optimal.
(5) |
We will show that if holds, then . We will also show that holds with a small probability.
The proof is by contradiction. Suppose holds and , then there exists a such that and Using the definition of
Hence, which is a contradiction. Therefore, if occurs,
Now, consider the event
Let us bound the probability of the first term above.
Now, let us bound the second event above. By the choice of we have the following
Now,
We can show that and like in the previous subsection. Hence,
Hence, we can upper bound the number of pulls of feasible but suboptimal arms as follows
A.3 Feasible instance: Upper bounding the expected pulls of infeasible suboptimal arms
Consider arm which is both suboptimal and infeasible. Define an event as done in Equation (5). Recall that the upper bound on the pulls of the deceiver arms on event is denoted by and the upper bound on the pulls of the feasible but suboptimal arms on event is denoted by
On the event if then due to concentration of estimator of , this arm can’t be played more than times. If then due to suboptimality, this arm can’t be played more than times. We can show that the probability of is less than or equal to as we did before. Hence, we can upper bound the pulls of infeasible and suboptimal arms as follows
A.4 Feasible instance: Upper bounding the probability of misidentification
Feasibility is correctly detected if for at least one of the feasible arms, event as defined in Equation (3) holds. We had seen in Subsection A.2 that if estimator of is concentrated enough, a feasible arm always remains in the set of plausibly feasible arms.
Without loss of generality, assume that arm 1 is optimal. Then, we can lower bound the probability of correctly setting the flag as follows
This upper bounds the probability of incorrectly setting the flag
A.5 Infeasible instance: Upper bounding the expected pulls of non-optimal arms
In this section, we discuss the case when the given instance is infeasible. As defined before, the optimal choice for the algorithm is to play the arm with minimum . We will upper bound the number of pulls of arms that have greater than the minimum.
For all the arms, we define good events as in Equation (3) where the estimator for is concentrated enough. Let When event occurs, the set becomes empty after at most pulls where is defined in Equation (4). The analysis is similar to given in Subsection A.1.
Once the set becomes empty, the algorithm starts pulling arms with minimum lower confidence bound on estimator of . We will upper bound the number of pulls for the non-optimal arms. Without loss of generality, assume that arm 1 has the lowest . As we are dealing with an infeasible instance, For a non-optimal arm we define the following good event
One can check that is greater than because the gap is smaller than It is easy to argue using LCB based arguments that if occurs, then arm can’t be pulled more than times. This is similar to the proof given in Subsection A.2.
Let us upper bound the probability of Event is given by
Using analysis in Subsection A.1, we can show Let us bound the probability of the first term above. By the choice of we have
Now,
Hence, we can upper bound using a union bound
When the instance is infeasible, the expected number of pulls of non-optimal arms are upper bounded by
A.6 Infeasible instance: Upper bounding the probability of misidentification
In this subsection, we will upper bound the probability of incorrectly setting the feasibility_flag.
Firstly, we define to be the minimum value of for which the following holds:
exists because for a fixed
For all the arms, we define good events as in Equation (3) where the estimator for is concentrated enough. Let On event for set of plausible feasible arms will remain empty.
We showed in the previous subsection that Hence, we can bound the probability of incorrectly setting the feasibility_flag as
Appendix B Lower bounds
In this section, we will prove Theorems 4.1 and 4.2. To prove Theorem 4.1, we will first prove a result which lower bounds the pulls of non-optimal arms when the arm distributions belong to a general distribution class and the attributes are not necessarily means. Theorem 4.1 is proved in Subsection B.1, Theorem 4.2 is proved in Subsection B.2, and subsequent subsections provide the proof for the intermediate results.
The proof technique to derive lower bounds for the constrained bandit setting is very similar to the technique used for standard stochastic bandit setting. We begin by stating some important results that will be used later in the proofs.
We first state the divergence decomposition lemma for the constrained bandit setting. The proof of the lemma is similar to the proof of divergence decomposition for the standard setting and we leave it to the reader to verify the result (see Lemma 15.1, Lattimore and Szepesvári [2018]).
Lemma 1
Consider two instances and where and belong to a space of distributions Fix some policy and let and be the probability measures on the constrained bandit model induced by the -round interconnection between and (respectively, and ). Then
(6) |
We also state the high probability Pinsker inequality (see Theorem 14.2, Lattimore and Szepesvári [2018]).
Lemma 2
Let and be probability measures on the same measurable space and let be an arbitrary event. Then,
where is the complement of A.
B.1 Lower bounding the pulls of non-optimal arms for the Gaussian case
We first state an instance-dependent lower bound for the expected pulls of non-optimal arms under consistent algorithms for any class of distributions and a threshold For a feasible instance where define, for each non-optimal arm
Similarly, for an infeasible instance where define, for each non-optimal arm
Theorem B.1
Let be a consistent policy over the class of distributions given the threshold Then for a feasible instance where for any non-optimal arm
For an infeasible instance where for any non-optimal arm
The bounds in Subsections B.3 and B.4 imply Theorem B.1. The main message here is that the ’hardness’ of a non-optimal arm , which dictates the minimum number of pulls required on average to distinguish it from the optimal arm, is characterized by the reciprocal of the smallest perturbation in terms of KL divergence needed to ‘make’ the arm optimal.
Consider two d-dimensional Gaussian distribution and with means and and covariances and The KL divergence between these distributions is given by
(7) |
This is easy to derive but one can check Section 9 of these notes.
Let be the class of 2-dimensional gaussian distributions with the covariance matrix Let attribute be the mean of the first dimension, and be the mean of the second dimension. Let then using Equation 7, the KL divergence between and is given by
(8) |
With and attributes as defined in Theorem 4.1, we can use Equation 8 to calculate and In particular, for a feasible but suboptimal arm we have
for a deceiver arm we have
and for an infeasible arm which is not a deceiver, we have
Finally, for a non-optimal arm for an infeasible instance, we have
Using the results above and combining them with Theorem B.1, we get Theorem 4.1.
B.2 Disambiguating between feasible and infeasible instances
Consider a feasible instance where Arm 1 is the only feasible arm and therefore, the only optimal arm. Let be the minimum value of for the set of infeasible arms. For Arm 1 we define
Consider another instance where for and such that and where Using divergence decomposition lemma, we have and by using Lemma 2 we have
Let event {feasibility_flag = false}. Taking logarithm on both sides and rearranging gives
RHS goes to zero as goes to infinity. This follows from the definition of consistency and the fact that for instance arm 1 is suboptimal. Hence, we have
This shows that at least for one of the instances, the probability of incorrect detection decays slower than exponential in
B.3 Feasible instances
Consider a class of distributions and a threshold For a feasible instance where define, for each non-optimal arm
We will show that
Proof
Let and fix any Let be a bandit instance with and for be such that and A distribution like exists because of the definition of Note that arm is the unique optimal arm for bandit instance Using divergence decomposition lemma, we have and by using Lemma 2 we have
Let event
Rearranging and taking the limit inferior we get
The last equality follows from the definition of consistency. As an arm which is suboptimal or infeasible or both is played only times in expectation for all , there exists a constant for large enough such that This gives
As this holds for all and was an arbitrary constant greater than zero, we have the result.
B.4 Infeasible instance
For an infeasible instance where define, for each non-optimal arm
We will show that for arm
Proof
Let and fix any . Let be a bandit instance with for and such that and A distribution like exists because of the definition of Observe that the instance could be a feasible instance. Nonetheless, arm is the unique optimal arm irrespective of the feasibility of instance Using divergence decomposition lemma, we have and by using Lemma 2 we have
Let event
Rearranging and taking the limit inferior we get
The last equality follows from the definition of consistency. As is an infeasible instance, arm is suboptimal and is played only times in expectation for all For instance arm is the unique optimal arm. Therefore, all the other arms are played only times in expectation for all Hence, there exists a constant for large enough such that This gives
As this holds for all and was an arbitrary constant greater than zero, we have the result.