The Symmetry between Arms and Knapsacks: A Primal-Dual Approach for Bandits with Knapsacks
‡ Institute for Computational and Mathematical Engineering, Stanford University
◇ Department of Management Science and Engineering, Stanford University )
Abstract
In this paper, we study the bandits with knapsacks (BwK) problem and develop a primal-dual based algorithm that achieves a problem-dependent logarithmic regret bound. The BwK problem extends the multi-arm bandit (MAB) problem to model the resource consumption associated with playing each arm, and the existing BwK literature has been mainly focused on deriving asymptotically optimal distribution-free regret bounds. We first study the primal and dual linear programs underlying the BwK problem. From this primal-dual perspective, we discover symmetry between arms and knapsacks, and then propose a new notion of sub-optimality measure for the BwK problem. The sub-optimality measure highlights the important role of knapsacks in determining algorithm regret and inspires the design of our two-phase algorithm. In the first phase, the algorithm identifies the optimal arms and the binding knapsacks, and in the second phase, it exhausts the binding knapsacks via playing the optimal arms through an adaptive procedure. Our regret upper bound involves the proposed sub-optimality measure and it has a logarithmic dependence on length of horizon and a polynomial dependence on (the numbers of arms) and (the number of knapsacks). To the best of our knowledge, this is the first problem-dependent logarithmic regret bound for solving the general BwK problem.
1 Introduction
The Multi-Armed Bandit (MAB) problem is a problem in which a limited amount of resource must be allocated between competing (alternative) choices in a way that maximizes the expected gain (Gittins et al., 2011). It is a benchmark problem for decision making under uncertainty that has been studied for nearly a century. As a prototypical reinforcement learning problem, MAB problem exemplifies the exploration–exploitation tradeoff dilemma (Weber et al., 1992). The original problem first formulated in its predominant version in (Robbins, 1952), has inspired a recent line of research that considers additional constraints that reflect more accurately the reality of the online decision making process. Bandits with Knapsacks (BwK) was introduced by Badanidiyuru et al. (2013) to allow more general constraints on the decisions across time, in addition to the customary limitation on the time horizon. The BwK problem, as a general framework, encompasses a wide range of applications, including dynamic pricing and revenue management (Besbes and Zeevi, 2012), online advertisement (Mehta et al., 2005), network and routing (Agrawal et al., 2014), etc.
While the existing BwK literature (Badanidiyuru et al., 2013; Agrawal and Devanur, 2014) has derived algorithms that achieve optimal problem-independent regret bounds, a problem-dependent bound that captures the optimal performance of an algorithm on a specific BwK problem instance remains an open question. For the setting of standard MAB problem, the problem-dependent bound has been well understood, and an upper bound with logarithmic dependence on can be achieved by both UCB-based algorithm (Auer et al., 2002) and Thompson sampling-based algorithm (Agrawal and Goyal, 2012). In this paper, we focus on developing a problem-dependent bound for the BwK problem and identify parameters that characterize the hardness of a BwK problem instance.
Two existing works along this line are (Flajolet and Jaillet, 2015) and (Sankararaman and Slivkins, 2020). The paper (Flajolet and Jaillet, 2015) considers several specific settings for the problem, and for the general BwK problem, it achieves an regret bound (with other problem-dependent parameters omitted) where is the number of arms and is the number of the knapsacks/resource constraints. In addition, the results in (Flajolet and Jaillet, 2015) require the knowledge of some parameters of the problem instance a priori. The recent work (Sankararaman and Slivkins, 2020) considers the BwK problem under the assumptions that there is only one single knapsack/resource constraint and that there is only one single optimal arm. In contrast to these two pieces of work, we consider the problem in its full generality and do not assume any prior knowledge of the problem instance. We will further compare with their results after we present our regret bound.
Specifically, we adopt a primal-dual perspective to study the BwK problem. Our treatment is new in that we highlight the effects of resources/knapsacks on regret from the dual perspective and define the sub-optimality measure based on the primal and dual problems jointly. Specifically, we first derive a generic upper bound that works for all BwK algorithms. The upper bound consists of two elements: (i) the number of times for which a sub-optimal arm is played; (ii) the remaining knapsack resource at the end of the horizon. It emphasizes that the arms and knapsacks are of equal importance in determining the regret of a BwK algorithm. By further exploiting the structure of the primal and dual LPs, we develop a new sub-optimality measure for the BwK problem which can be viewed as a generalization of the sub-optimality measure for the MAB problem first derived in (Lai and Robbins, 1985). The sub-optimality measure accounts for both arms and knapsacks, and it aims to distinguish optimal arms from non-optimal arms, and binding knapsacks from non-binding knapsacks. We use this measure as a key characterization of the hardness of a BwK problem instance.
Inspired by these findings, we propose a two-phase algorithm for the problem. The first phase of our algorithm is elimination-based and its objective is to identify the optimal arms and binding knapsacks of the BwK problem. The second phase of our algorithm utilizes the output of the first phase and it uses the optimal arms to exhaust the remaining resources through an adaptive procedure. Our algorithm and analysis feature for its full generality and we only make a mild assumption on the non-degeneracy of the underlying LP. In addition, the algorithm requires no prior knowledge of the underlying problem instance.
Other related literature: (Agrawal and Devanur, 2015; Agrawal et al., 2016) study the contextual BwK problem where the reward and resource consumption are both linear in a context vector. (Immorlica et al., 2019; Kesselheim and Singla, 2020) study the BwK problem under an adversarial setting. (Ferreira et al., 2018) analyzes Thompson sampling-based algorithms for the BwK problem.
2 Model and Setup
The problem of bandits with knapsacks (BwK) was first defined in Badanidiyuru et al. (2013), and the notations presented here are largely consistent with Badanidiyuru et al. (2013); Agrawal and Devanur (2014). Consider a fixed and known finite set of arms (possible actions) available to the decision maker, henceforth called the algorithm. There are type of resources and a finite time horizon , where is known to the algorithm. In each time step , the algorithm plays an arm of the m arms, receives reward , and consumes amount of each resource . The reward and consumption are revealed to the algorithm after choosing arm . The rewards and costs in every round are generated i.i.d. from some unknown fixed underlying distribution. More precisely, there is some fixed but unknown and such that
where and are the expected reward and the expected resource consumption of arm In the beginning of every time step t, the algorithm needs to pick an arm , using only the history of plays and outcomes until time step There is a hard constraint capacity for the -th type of resource. The algorithm stops at the earliest time when one or more of the constraints is violated, i.e. for some or if the time horizon ends, if i.e. Its total reward is given by the sum of rewards in all rounds preceding , i.e The goal of the algorithm is to maximize the expected total reward. The values of are known to the algorithm, and without loss of generality, we make the following assumption.
Assumption 1.
We assume for all (by scaling the consumption matrix ). Let . Moreover, we assume the resource capacity scales linearly with , i.e., for some
The assumption is mainly for notation simplicity and it will not change the nature of the analysis in this paper. Given that our focus is to derive asymptotic problem-dependent bound, it is natural to have the resource capacity scales linearly with the length of horizon. Throughout this paper, we use bold symbols to denote vectors/matrices and normal symbols to denote scalars.
Furthermore, without loss of generality, a “null” arm is introduced to represent the time constraint (time horizon ). Specifically, let , and for all In this way, the first constraint captures the constraint of finite time horizon and the “null” arm can be played with no reward achieved and with no cost induced to the other factual constraints except for the time constraint.
Regret is defined as the difference in the total reward obtained by the algorithm and OPT, where OPT denotes the total expected reward for the optimal dynamic policy. In this paper, we are interested in the (expected) problem-dependent regret,
where denotes the algorithm, and encapsulates all the parameters related to the distributions of reward and resource consumption, including and The expectation is taken with respect to the randomness of the reward and resource consumption.
3 Primal-dual Perspective for BwK
In this section, we present a generic regret upper bound and explore the properties of the underlying primal and dual LPs.
Let and denote the set of optimal basic variables and the set of optimal non-basic variables of (1), respectively. Let and denote the set of binding and non-binding constraints of (1), respectively. That is,
So, we know and Accordingly, we call an arm as an optimal arm and as a sub-optimal arm. Here and hereafter, we will refer to knapsack as constraint so that the terminology is more aligned with the LPs.
We make the following assumption on the LP’s optimal solution.
Assumption 2.
The LP (1) has an unique optimal solution. Moreover, the optimal solution is non-degenerate, i.e.,
The assumption is a standard one in LP’s literature, and any LP can satisfy the assumption with an arbitrarily small perturbation (Megiddo and Chandrasekaran, 1989). To interpret the non-degeneracy, consider if , then the optimal solution to LP (1) is to play the only arms in . When there is no linear dependency between the columns of , that will result in a depletion of resource constraints.
The dual problem of (1) is
(2) | ||||
s.t. | ||||
Denote its optimal solution as From LP’s complementarity condition, we know the following relation holds under Assumption 2,
The following lemma summarizes the LP’s properties.
Lemma 1.
3.1 A Generic Regret Upper Bound
We begin our discussion with deriving a new upper bound for a generic BwK algorithm. First, we define the knapsack process as the remaining resource capacity at each time . Specifically, we define and
for Recall that is the (random) resource consumption at time . The process is pertaining to the BwK algorithm. In addition, we follow the convention of the bandits literature and define the count process as the number of times the -th arm is played up to the end of time period .
Proposition 1.
The following inequality holds for any BwK algorithm,
(3) |
where for
Here is known as reduced cost/profit in LP literature and it quantifies the cost-efficiency of each arm (each basic variable in LP). The upper bound in Proposition 1 is new to the existing literature and it can be generally applicable to all BwK algorithms. It consists of two parts: (i) the number of times for which a sub-optimal arm is played multiplied by the corresponding reduced cost; (ii) the remaining resource at time , either when any of the resource is depleted or at the end of time horizon . The first part is consistent with the classic bandits literature in that we always want to upper bound the number of sub-optimal arms being played throughout the horizon. At each time a sub-optimal arm is played, a cost of will be induced. Meanwhile, the second part is particular to the bandits with knapsacks setting and can easily be overlooked. Recall that the definition of refers to the first time that any resource is exhausted (or the end of the horizon). It tells that the left-over of resources when the process terminates at time may also induce regret. For example, for two binding resources , it would be less desirable for one of them to have a lot of remaining while the other one is exhausted. Since the binding resources are critical in determining optimal objective value for LP (1), intuitively, it is not profitable to waste any of them at the end of the procedure.
3.2 Symmetry between Arms and Bandits
From Proposition 1, we see the importance of dual problem (2) in bounding an algorithm’s regret. Now, we pursue further along the path and propose a new notion of sub-optimality for the BwK problem. Our sub-optimality measure is built upon both the primal and dual LPs, and it reveals the combinatorial structure of the problem. In the following, we define two classes of LPs, one for the arm and the other for the constraints
First, for each arm define
(4) | ||||
s.t. | ||||
By definition, denotes the optimal objective value of an LP that takes the same form as the primal LP (1) except with an extra constraint . It represents the optimal objective value if the -th arm is not allowed to use. For a sub-optimal arm , while for an optimal arm , . In this way, characterizes the importance of arm .
Next, for each constraint , define
(5) | ||||
s.t. | ||||
where denotes the -th row of the constraint matrix Though it may not be as obvious as the previous case of OPTi, the definition of aims to characterize the bindingness/non-bindingness of a constraint . The point can be illustrated by looking at the primal problem for (5). From LP’s strong duality, we know
(6) | ||||
s.t. | ||||
Compared to the original primal LP (1), there is an extra term in the objective function in LP (6). The extra term is a penalization for the left-over of the -th constraint, and thus it encourages the usage of the -th constraint. For a binding constraint it will be exhausted under the optimal solution to LP (1) so the penalization term does not have any effect, i.e., OPTOPT. In contrast, for a non-binding constraint the extra term will result in a reduction in the objective value, i.e., . We note that one can introduce any positive weight to the penalization term so as to trade off between the reward and the left-over of the -th constraint in (6), but its current version suffices our discussion.
The following proposition summarizes the properties of OPTi and OPT
Proposition 2.
Under Assumption 2, we have
In this way, the definition of distinguishes optimal arms from sub-optimal arms , while the definition of distinguishes the binding constraints from non-binding constraints . The importance of such a distinguishment arises from the upper bound in Proposition 1: on one hand, we should avoid playing sub-optimal arms, and on the other hand, we should exhaust the binding resources. A second motivation for defining both OPTi and OPTj can be seen after we present our algorithm. Furthermore, we remark that a measurement of the sub-optimality of the arms has to be defined through the lens of LP due to the combinatorial nature of the problem. The effect of the -th arm’s removal on the objective value can only be gauged by solving an alternative LP of OPT Similarly, a measurement of the bindingness of the constraints should also take into account the combinatorial relation between constraints.
Next, define
where the factor is to normalize the optimality gap by the number of time periods. Under Assumption 1, all the objective values in above should scale linearly with
To summarize, characterizes the hardness of distinguishing optimal arms from non-optimal arms (and binding constraints from non-binding constraints). It can be viewed as a generalization of the sub-optimality measure for the MAB problem (Lai and Robbins, 1985). characterizes the hardness of an MAB problem instance, i.e., the hardness of distinguishing the optimal arm from sub-optimal arms. In the context of BwK, is a more comprehensive characterization in that it takes into account both the arms and the constraints. Imaginably, it will be critical in both algorithm design and analysis for the BwK problem.
3.3 Key Parameters of the LPs
Now, we define two LP-related quantities that will appear in our regret bound:
-
•
Linear Dependency between Arms: Define be the minimum singular value of the matrix Specifically,
In this light, represents the linear dependency between optimal arms across the binding constraints. For a smaller value of , the optimal arms are more linearly dependent, and then it will be harder to identify the optimal numbers of plays. Under Assumption 2, the uniqueness of optimal solution implies
-
•
Threshold on the optimal solution:
If the total resource both the optimal solution should scale linearly with . The factor normalizes the optimal solution into a probability vector. denotes the smallest non-zero entry for the optimal solution. Intuitively, a small value of implies that the optimal proportion of playing an arm is small and thus it is more prone to “overplay” the arm.
Remarks. By the definition, it seems that the above parameters and both involve a factor of . But if we replace (the right-hand-side of LP) with from Assumption 1, then the factor disappears, and the parameters and are essentially dependent on , , and which are inherent to the problem instance but bear no dependency on the horizon . In other words, Assumption 1 frees the dependency on by introducing the quantity . Practically, the assumption states the resource capacity should be sufficiently large and it is natural in many application contexts (for example, the small bids assumption in AdWords problem Mehta et al. (2005)). Theoretically, in two previous works Flajolet and Jaillet (2015); Sankararaman and Slivkins (2020), either a factor of appears in the parameter definition Sankararaman and Slivkins (2020) or the assumption is explicitly imposed Flajolet and Jaillet (2015). Such an assumption might be inevitable for a logarithmic regret to be derived.
3.4 LCB and UCB
Throughout this paper, we denote the reward and resource consumption of the -th play of the -th arm as and respectively, for and Let be the number of times the -th arm is played in the first time periods. Accordingly, we denote the estimators at time for the -th arm as
for and In a similar manner, we define the estimator for the -th arm’s resource consumption vector as , and the resource consumption matrix as . Specifically, without changing the nature of the analysis, we ignore the case that when . Then, we define the lower confidence bound (LCB) and upper confidence bound (UCB) for the parameters as
where projects a real number to interval The following lemma is standard in bandits literature and it characterizes the relation between the true values and the LCB/UCB estimators. It states that all the true values will fall into the intervals defined by the corresponding estimators with high probability.
Lemma 2 (Concentration).
The following event holds with probability no less than ,
for all .
With the UCB/LCB estimators for the parameters, we can construct UCB/LCB estimators for the objective of the primal LP. Specifically,
(7) | ||||
s.t. | ||||
(8) | ||||
s.t. | ||||
The following lemma states the relation between , , and Intuitively, if we substitute the original constraint matrix with its LCB (or UCB) and the objective coefficient with its UCB (or LCB), the resultant optimal objective value will be a UCB (or LCB) for .
Lemma 3.
The following inequality holds with probability no less
A similar approach is used in (Agrawal and Devanur, 2014) to develop an UCB-based algorithm for the BwK problem. For our algorithm presented in the following section, we will construct estimates not only for the primal LP (1), but also for the LPs (4) and (5). By comparing the estimates of OPT, OPTi, and OPTj, we will be able to identify the optimal arms and the non-binding constraints
4 Two-Phase Algorithm
In this section, we describe our two-phase algorithm for the BwK problem. The main theme is to use the underlying LP’s solution to guide the plays of the arms. The two phases in the algorithm correspond to the two parts of the regret upper bound in Proposition 1. In the following, we describe the two phases of the algorithm and their intuitions respectively.
Phase I of Algorithm 1 is an elimination algorithm and it aims to identify the optimal arms and the non-binding constraints In each round of the while loop in Phase I, all the arms are played once to improve the estimators for and After each round of plays, an identification procedure is conducted by comparing the LCB of the original optimal value () against the UCBs ( and ). Recall that Proposition 2, there will be a non-zero sub-optimality gap if or So, if the algorithm observes a gap between the corresponding LCBs/UCBs, it will assert or , and the assertion will be true with high probability.
The stopping rule in Phase I originates from the complementary condition in Lemma 1, i.e., . This is a key in the primal-dual design of the algorithm and it further justifies the consideration of the dual problem. Without maintaining the set we cannot decide whether we have obtain the true primal set, i.e., . Specifically, since there is no precise knowledge of the number of optimal arms while we keep adding arms into , we do not know when to stop. The complementarity in Lemma 1 provides a condition on the number of arms in and the number of constraints in . Accordingly, Phase I is terminated when this condition is met. Moreover, we emphasize that a by-product of the Phase I is a best arm identification procedure. To the best of our knowledge, this is the first result on identifying optimal arms for the BwK problem.
s.t. | |||
s.t. | |||
(9) | ||||
s.t. | ||||
Phase II of Algorithm 1 is built upon the output of Phase I. At each time , the algorithm solves an adaptive version LP (9) and normalizes its optimal solution into a sampling scheme on arms. Also, in Phase II, the algorithm will only play arms ; this is achieved by enforcing for in (9). The adaptive design is exemplified on the right-hand-side of the LP (9), where instead of the static resource capacity , it uses the remaining resource at the end of last time period . To see its intuition, consider if a binding resource is over-used in the first time periods, then it will result in a smaller value of , and then the adaptive mechanism will tend to be more reluctant to consume the -th resource in the future, and vice versa for the case of under-use.
We emphasize that the adaptive design is not only intuitive but also necessary to achieve a regret that is logarithmic in . If we adopt a static right-hand-side in (9) (as in (Badanidiyuru et al., 2013; Agrawal and Devanur, 2014)), then the fluctuation of the process will be on the order of . Consequently, when it approaches to the end of the horizon, certain type of binding resource may be exhausted while other binding resources still have left-over, and this may result in an upper bound in Proposition 1. The intuition is made rigorous by (Arlotto and Gurvich, 2019); the paper establishes that without an adaptive design, the regret is at least for the multi-secretary problem (can be viewed as a one-constraint BwK problem) even if the underlying distribution is known.
5 Regret Analysis
In this section, we derive an regret upper bound for Algorithm 1 by analyzing the two phases separately.
5.1 Analysis of Phase I
Proposition 3 provides an upper bound on the number of time periods within which Phase I will terminate. It also states that the identification of and will be precise conditional on the high probability event in Lemma 2.
Proposition 3.
The surprising point of Proposition 3 lies in that there are possible configurations of and, without any prior knowledge, the true configuration can be identified within number of plays for each arm. In contrast, (Flajolet and Jaillet, 2015) does not utilize the primal-dual structure of the problem and conducts a brute-force search in all possible configurations which results in an regret. In addition, the search therein requires the knowledge of a non-degeneracy parameter a priori. The result also explains why (Sankararaman and Slivkins, 2020) imposes a single-best-arm condition for the BwK problem, which assumes . This additional greatly simplifies the combinatorial structure and reduces the BwK problem more closely to the standard MAB problem. In this light, Proposition 3 can be viewed as a best-arm-identification result (Audibert and Bubeck, 2010) for the BwK problem in full generality and without any prior knowledge.
5.2 Analysis of Phase II
Proposition 4 provides an upper bound on the remaining resource for binding constraints when the procedure terminate, which corresponds to the second part of Proposition 1. Notably, the upper bound has no dependency on . In a comparison with the fluctuation under the static design, it demonstrates the effectiveness of our adaptive design.
Proposition 4.
The idea of proof is to introduce an auxiliary process for and . Recall that is the -th component of the knapsack process , we know its initial value . Then define
for a fixed With the definition, can be interpreted as average remaining resource (per time period) and can be interpreted as the first time that deviates from its initial value by a small amount. It is easy to see that with a proper choice of Next, we aim to upper bound by analyzing the process From the dynamic of the knapsack process we know that
(10) |
where as defined earlier is the resource consumption of -th constraint at time . The above formula (10) provides a technical explanation for the motivation of the adaptive design in (9). Intuitively, when the right-hand-side of (9) is it will lead to a solution that (approximately and on expectation) consumes of the -th resource for each of the following time periods. Ideally, this will make the second term in (10) have a zero expectation. However, due to estimation error for the LP’s parameters, this may not be the case. The idea is to first provide an upper bound for the “bias” term
where encapsulates all the information up to time Unsurprisingly, the upper bound is on the order of . Next, with this bias upper bound, we can construct a super-martingale (sub-martingale) based on the dynamic (10) and employ Azuma–Hoeffding inequality to provide a concentration result for the value of the martingale. Through the analysis of the process , we can derive an upper bound for , and consequently, it leads to an upper bound on .
The importance of the adaptive design has been widely recognized in other constrained online learning problems, such as online matching problem Manshadi et al. (2012), online assortment problem Golrezaei et al. (2014), online linear programming problem Li and Ye (2019), network revenue management problem Jasin and Kumar (2012), etc. The common pattern of these problem is to allocate limited resources in a sequential manner, and the idea of adaptive design is to adjust the allocation rule dynamically according to the remaining resource/inventory. This is in parallel with LP (9) where the solution at each time is contingent on the remaining resources The significance of our algorithm design and analysis lies in that (i) to the literature of BwK, our paper is the first application of the idea of adaptive design; (ii) to the existing literature of adaptive design in constrained online learning problems, our work provides its first application and analysis in a partial-information environment. For the second aspect, all the existing analysis on the adaptive design fall in the paradigm of “first-observe-then-decide” while the BwK problem is “first-decide-then-observe”. Specifically, in matching/resource allocation/revenue management problems, at each time period, a new agent arrives, and upon the observation, we decide the matching for the agent; or a new customer arrives, and upon the observation of her preference, we decide the assortment decision for the customer. So, the existing analyses are analogous to a BwK “setting” where the reward and resource consumption of playing an arm are first observed (magically), and then we decide whether we want to play the arm or not.
5.3 Regret Upper Bound for Algorithm 1
Combining the two parts, we have the following result on the regret of Algorithm 1.
Proposition 5.
The result reduces the exponential dependence on and in (Flajolet and Jaillet, 2015) to polynomial, and also it does not rely on any prior knowledge. Specifically, the authors consider several settings for BwK problem, many of which assume special structures such as one or two resource constraints. The most general settings therein, which is comparable to ours, allows arbitrary number of constraints and number of optimal arms. In terms of the key parameters, the way we define is the same as their definition of . However, the regret bound (Theorem 8 therein) involves a summation of exponentially many ’s (the same as the total number of the bases of the LP). Our parameter is related to their (Assumption 8 therein) while the latter is more restrictive. Because in our paper represents the minimal singular value of the matrix corresponding to only the optimal basis of the primal LP, whereas the parameter therein represents a lower bound of the determinant of the matrices corresponding to all the possible (exponentially many) bases of the primal LP. In this light, if they adopt our parameter , their bound would be improved by a factor of ( therein). Moreover, is a lower bound for our parameter and (Flajolet and Jaillet, 2015) explicitly requires the knowledge of a priori.
The proposition also generalizes the one-constraint bound in (Sankararaman and Slivkins, 2020) and relaxes the deterministic resource consumption assumption therein. Specifically, the authors assume there is one single optimal arm and one single resource (other than time), i.e., the optimal solution to the primal LP has only one non-zero entry ( and ). They also assume the underlying LP is non-degenerate. Our results generalize their work in allowing arbitrary , and In terms of the key parameters, under their assumption, our parameter because it is defined by a 1-by-1 matrix. Our parameter is deemed as a constant in their paper. For , under their assumptions, our definition of OPTi can be adjusted accordingly. Specifically, we can replace the constraint in defining OPTi with . Then our definition of would reduce to their definition of , both of which characterize the sub-optimality gap of an arm.
The above comparison against the existing literature highlights that the parameters , and or other LP-based parameters might be inevitable in characterizing logarithmic regret bound. Our paper makes some preliminary efforts along this line, but we do not believe our bound is tight: parameters such as and may be replaced with some tighter parameter through a better algorithm and sharper analysis. As remarked in Section 3.3, the parameters are not dependent on under Assumption 1. But to characterize the dependency of these parameters (such as and ) on the problem size and remains an open question. Moreover, our algorithm and analysis highly rely on the structural properties of the underlying LP, which might not be the unique method to handle the BwK problem.
6 Conclusion
In this paper, we introduce a new BwK algorithm and derive problem-dependent bound for the algorithm. In the Phase I of the algorithm, it involves a round-robin design and may result in playing sub-optimal arms for inefficiently many times. Our regret bound can be large when the parameters , and are small and the inefficient plays of the sub-optimal arms may prevent the algorithm from achieving an worst-case regret. In the extreme case, these parameters may scale with , though Assumption 1 prevents such a possibility. So the question is whether Assumption 1 is necessary in admitting a logarithmic problem-dependent bound.
We conclude our discussion with a new one-phase algorithm – Algorithm 2. The algorithm is also LP-based, and at each time , it solves a UCB version of the primal LP to sample the arm. The algorithm has an adaptive design to exhaust the resources. On one hand, the algorithm naturally incorporates the Phase I of Algorithm 1 as a part of its Phase II. It directly enters the Phase II of Algorithm 1 and lets the adaptive LP to fully determine the arm(s) to play (without the extra constraint in (9)). On the other hand, the algorithm can be viewed as an adaptive version of the algorithm in (Agrawal and Devanur, 2014). Our conjecture is that Algorithm 2 is the optimal algorithm for BwK: it is optimal in the sense that it achieves optimal problem-dependent bound, but also admits problem-independent bound. Unfortunately, its analysis is more challenging than Algorithm 1, which we leave as an open question.
(11) | ||||
s.t. | ||||
Acknowledgements
We would like to thank anonymous reviewers for their helpful suggestions and insightful comments.
References
- Agrawal and Devanur (2014) Agrawal, Shipra, Nikhil R Devanur. 2014. Bandits with concave rewards and convex knapsacks. Proceedings of the fifteenth ACM conference on Economics and computation. 989–1006.
- Agrawal and Devanur (2015) Agrawal, Shipra, Nikhil R Devanur. 2015. Linear contextual bandits with knapsacks. arXiv preprint arXiv:1507.06738 .
- Agrawal et al. (2016) Agrawal, Shipra, Nikhil R Devanur, Lihong Li. 2016. An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. Conference on Learning Theory. PMLR, 4–18.
- Agrawal and Goyal (2012) Agrawal, Shipra, Navin Goyal. 2012. Analysis of thompson sampling for the multi-armed bandit problem. Conference on learning theory. JMLR Workshop and Conference Proceedings, 39–1.
- Agrawal et al. (2014) Agrawal, Shipra, Zizhuo Wang, Yinyu Ye. 2014. A dynamic near-optimal algorithm for online linear programming. Operations Research 62(4) 876–890.
- Arlotto and Gurvich (2019) Arlotto, Alessandro, Itai Gurvich. 2019. Uniformly bounded regret in the multisecretary problem. Stochastic Systems .
- Audibert and Bubeck (2010) Audibert, Jean-Yves, Sébastien Bubeck. 2010. Best arm identification in multi-armed bandits. COLT-23th Conference on Learning Theory-2010. 13–p.
- Auer et al. (2002) Auer, Peter, Nicolo Cesa-Bianchi, Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47(2) 235–256.
- Badanidiyuru et al. (2013) Badanidiyuru, Ashwinkumar, Robert Kleinberg, Aleksandrs Slivkins. 2013. Bandits with knapsacks. 2013 IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE, 207–216.
- Besbes and Zeevi (2012) Besbes, Omar, Assaf Zeevi. 2012. Blind network revenue management. Operations research 60(6) 1537–1550.
- Ferreira et al. (2018) Ferreira, Kris Johnson, David Simchi-Levi, He Wang. 2018. Online network revenue management using thompson sampling. Operations research 66(6) 1586–1602.
- Flajolet and Jaillet (2015) Flajolet, Arthur, Patrick Jaillet. 2015. Logarithmic regret bounds for bandits with knapsacks. arXiv preprint arXiv:1510.01800 .
- Gittins et al. (2011) Gittins, John, Kevin Glazebrook, Richard Weber. 2011. Multi-armed bandit allocation indices. John Wiley & Sons.
- Golrezaei et al. (2014) Golrezaei, Negin, Hamid Nazerzadeh, Paat Rusmevichientong. 2014. Real-time optimization of personalized assortments. Management Science 60(6) 1532–1551.
- Immorlica et al. (2019) Immorlica, Nicole, Karthik Abinav Sankararaman, Robert Schapire, Aleksandrs Slivkins. 2019. Adversarial bandits with knapsacks. 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 202–219.
- Jasin and Kumar (2012) Jasin, Stefanus, Sunil Kumar. 2012. A re-solving heuristic with bounded revenue loss for network revenue management with customer choice. Mathematics of Operations Research 37(2) 313–345.
- Kesselheim and Singla (2020) Kesselheim, Thomas, Sahil Singla. 2020. Online learning with vector costs and bandits with knapsacks. Conference on Learning Theory. PMLR, 2286–2305.
- Lai and Robbins (1985) Lai, Tze Leung, Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1) 4–22.
- Li and Ye (2019) Li, Xiaocheng, Yinyu Ye. 2019. Online linear programming: Dual convergence, new algorithms, and regret bounds. arXiv preprint arXiv:1909.05499 .
- Manshadi et al. (2012) Manshadi, Vahideh H, Shayan Oveis Gharan, Amin Saberi. 2012. Online stochastic matching: Online actions based on offline statistics. Mathematics of Operations Research 37(4) 559–573.
- Megiddo and Chandrasekaran (1989) Megiddo, Nimrod, Ramaswamy Chandrasekaran. 1989. On the -perturbation method for avoiding degeneracy. Operations Research Letters 8(6) 305–308.
- Mehta et al. (2005) Mehta, Aranyak, Amin Saberi, Umesh Vazirani, Vijay Vazirani. 2005. Adwords and generalized on-line matching. 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05). IEEE, 264–273.
- Robbins (1952) Robbins, Herbert. 1952. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society 58(5) 527–535.
- Sankararaman and Slivkins (2020) Sankararaman, Karthik Abinav, Aleksandrs Slivkins. 2020. Advances in bandits with knapsacks. arXiv preprint arXiv:2002.00253 .
- Weber et al. (1992) Weber, Richard, et al. 1992. On the gittins index for multiarmed bandits. The Annals of Applied Probability 2(4) 1024–1033.
Appendix A Proof of Section 2 and 3
A1 Proof of Lemma 1
Proof.
Notice is a feasible solution to the primal LP (1) and is a feasible solution to the dual LP (2). We have both of the primal and dual LPs are feasible. So, by the strong duality of linear programming, both of the primal and dual LPs have optimal solutions and share the same optimal object objective value.
A2 Proof of Proposition 1
Proof.
From , we know
(12) | ||||
The idea is to represent with reduced cost and number of plays for the arms. We have
where denotes the arm chosen to play at time . In above, the second line comes from the independence between stopping time and the reward generated from a fixed arm. The third line comes from the definition of the reduced cost for all that , and the last line can be obtained by the same proof of the first three lines. Thus, plugging the equations above in the inequality (12) and rearranging terms, we have
Thus we complete the proof. ∎
A3 Proof of Proposition 2
Proof.
It is sufficient to show that
(13) | |||
(14) |
Here we prove (13) and the proof of (14) follows the same analysis. By definition, an feasible solution to LP (4) is also feasible to LP (1). So,
If for , the optimal solution of LP (4) is also a optimal solution to (1). Denote that solution as . We have both of and are optimal solutions of LP (1). From the uniqueness in Assumption 2, we know and hence and for . However, this contradicts with constraint of (4) where . ∎
A4 Proof of Lemma 2
Proof.
The proof follows a standard analysis of the tail bounds for the estimators. Specifically, we first consider the complement of these events and then apply a union bound to complete the proof. The analysis first concerns the case when there is no projection and then extends the results to the case with projection.
For each we have
(15) |
where denotes the reward of the -th play for the -th arm. The first inequality is obtained by the union probability bound, the second one follows from Bayes rule, the third one is obtained by , and the last one is obtained by an application of the Hoeffding’s inequality with and for and .
Then, recall that
and , following the definitions of the sets, we can have
Then we can apply (15) for the last line in above.
Note the proof is only relied on the property that the reward is bounded in .
Next, following the exact same analysis, we have
also holds for each . Then, we complete the proof by taking an union bound with respect to the arm index and the constraint index ∎
A5 Proof of Lemma 3
Proof.
It is sufficient to show that
holds under the event in Lemma 2. Then, applying Lemma 2, we can have that the inequality above holds with probability no less than
Recall that corresponds to the optimal objective value of the following LP (7) and denote as its optimal solution. Since holds element-wise under the event in Lemma 2, we have
where those inequalities hold element-wise. Thus, we have is a feasible solution to the LP (1), and
(16) | ||||
where the first line comes the definition of , the second line comes from (under the event in Lemma 2), and the third line comes from that is feasible to LP (1) and the definition of . Thus, is an underestimate of for all .
With a similar analysis, we can show the part for . ∎
Appendix B Analysis of Phase I of Algorithm 1
In this section, we provide the analysis of Phase I of Algorithm 1 and prove Proposition 3. In the proof, we will use some concentration results for the LCB/UCB estimators for the LP, and these concentration results will be summarized and proved in Lemma 4 following the proof of Proposition 3.
B1 Proof of Proposition 3
Proof.
In this proof, we show the result under the high-probability event in Lemma 2.
Recall that an arm will be added to or resource will be added to when
So, our proof will be focusing on discussing the relation between and , and the relation between and . Our proof can be divided into two parts:
-
•
First, we prove that and will not include a non-optimal arm and binding resource . Equivalently, we only need to show
for and and all .
-
•
Second, Phase I of Algorithm 1 will terminate within steps. Equivalently, we only need to show
for and when the time
For the first part, recall that is the optimal objective value of the LP (4). For , with a similar proof as Lemma 3, we can show that
and from Lemma 3,
for all . For , we have from Proposition 2. So,
With a similar argument,
Therefore, we complete the first part. The implication is that and .
Recall that the algorithm terminates when and that from Lemma 1, . So, we have the algorithm terminates if and only if and . That is, the algorithm stops only when it finds all optimal arms and all non-binding resources.
Now, we move to the second part and prove that and within
steps.
For time that is a multiple of and that
we have
i.e., the condition (19) in the statement of Lemma 4 holds, where in this case. Then, by Lemma 4, we have
Thus, within at most steps, we have
for all and since and . Thus, all optimal arms and non-binding resources will be detected with at most steps, equivalently, with each arm played at most times.
∎
B2 Concentration of LP UCB/LCB Estimators
Lemma 4.
Proof.
We prove inequalities in the statement one by one. We note that the condition is to provide a guarantee that the algorithm does not exhaust the resource before the termination of Phase I of the algorithm.
Proof of (17):
For (17), we only need to show that
since for the other direction, we can use the inequality with the same proof of Lemma 3.
We introduce the following LP (B2) to relate with .
s.t. | (21) | |||
Compared to (4) which defines , the LP (B2) scales the right-hand-side resource capacity with a factor and also appends an additional term to the objective function. It is easy to see that the optimal objective value of the LP (B2) is
(22) |
Denote the optimal solution of LP (4) as . First, we show that is a feasible solution to (B2), and then, with the feasibility, we show that (22) provides an upper bound for OPT and thus establish the relation between OPT and OPT
To see the feasibility, note that the -th entry of the resource consumption
where the last line comes from the concentration event in Lemma 2 and the fact that
With this feasibility, we have
(23) |
Here the second from last line comes from a similar argument as the analysis of feasibility. It is based on the concentration event in Lemma 2 and the fact The last line plugs in the upper bound (22) to LP (B2) and it comes from the fact that is a feasible solution of LP (B2). By rearranging terms in (23), we complete the proof of (17).
Proof of (18):
Just like the proof of (17), we only need to show one side of the inequality, that is
Denote as the optimal solution of the LP (5), which is the LP corresponding to . Define
We first show that is feasible the LP corresponding to in Algorithm 1. To see the feasibility, we have
Here the first two lines come from rearranging terms. The third line comes from the feasibility of . The fourth line comes from the concentration event in Lemma 2 and the fact that The last line comes from the definition of UCB estimators.
With the feasibility of we have
where the first comes from the feasibility of , the second and fourth line uses the fact that and the third line comes from the definition of OPT By rearranging terms, we complete the proof of (18).
Proof of inequality (20):
The idea of the proof is to first show that
is a feasible solution to the LP corresponding to OPT under condition (19) and then analyze its corresponding objective value.
First, to see its feasibility, the condition (19) ensures that
On the other hand, for the constraint
Here the first and second equality comes from rearranging terms and the last two line follow a similar analysis as before.
By rearranging terms in the last line of above,
Here the first from last line comes from rearranging terms and the last line comes from the feasibility of In this way, we obtain the feasibility part.
Appendix C Analysis of Phase II of Algorithm 1
In this section, we analyze Phase II of Algorithm 1 and prove Proposition 4. To simplify the presentation, we first introduce a notion of “warm start” and prove Proposition 4 under this warm start condition and also, the assumption that all the constraints are binding. Indeed, recall that the upper bound in Proposition 1 only involves the remaining resource for binding constraints. The assumption is only aimed to simplify the presentation but will not change the natural of our analysis. Later in Section C3 and Section C4, we discuss how the warm start condition is achieved and what if there are non-binding constraints, respectively.
C1 Proof of Proposition 4 under Warm Start Assumption
We first define the following quantity
In the following, we will use to define a threshold for parameter estimation. As we will see later, the threshold is mainly to ensure that the estimated constraint matrix is well-positioned (with a minimum singular value of ).
Assumption 3 (Warm Start).
We make the following assumption on the parameter estimates
-
(a)
holds element-wise for all .
-
(b)
The following condition holds for all and
(24)
Here Part (a) of the Assumption restricts to the concentration event in Lemma 2. Part (b) is to ensure that the estimated LP with constraint matrix is well-positioned. In the following Section C3, we will show that the assumption can be met automatically in Algorithm 1 and we discuss the number of time periods it takes to satisfy this assumption.
Assumption 4 (All optimal arms and binding constraints).
We assume that all the arm are optimal and all the constraints are binding, i.e.,
In fact, Assumption 4 is not as strong as it seems to be. First, in Algorithm 1, we have shown that Phase I identifies optimal arms and binding constraints with high probability in Proposition 3. Then, for the LP (9) solved in Phase II of the algorithm, it restricts the arm play to the optimal arm as if all the arms are all optimal ones. Second, from Proposition 1, we know the generic upper bound only involves the remaining resources for the binding constraints, so this assumption allows us to focus on the binding constraints. To make the analysis go through, we only need to show that the non-binding resources are not exhausted before any binding resource is exhausted. Intuitively, this will always happen because the non-binding resources have a slackness. Later in Section C4, we will return to analyze this case of non-binding resources. Now, we prove Proposition 4 under this two additional assumptions.
Proof of Proposition 4 under Assumption 3 and 4
We show that under Assumption 3 and 4, there exists a constant depending on , , , and such that, for any , the left-over of each binding resource (at time ) satisfies
Proof.
Our proof basically formalize the idea described in Section 5.2. Recall that the average remaining resource
where for Define
where will be specified later. Ideally, the algorithm consumes approximately resource per time period so that the average remaining resource will always stay in . This motivates the definition of a stopping time that represents the first time that escapes from .
Recall that the stopping time is the termination of the procedure. Thus, as long as , we have since no resource is used up at time . Furthermore, we define the following event, for each ,
where is the filtration up to time , is the arm played by Algorithm 1 in time period during Phase II, and is the expected resource consumption of the -th arm. The threshold parameters will be specified later. We remark that in Algorithm 1, the choice of (and its distribution) is dependent on the remaining resource . Here we represent with , and thus the second term captures the maximal “bias” of resource consumption when the average remaining resource , where the maximum is taken with respect to all
At time , if for some there are in total remaining resources and remaining time periods. Ideally, we hope that the resource is consumed for exactly in each of the remaining time period. In this sense, the event describes that the average resource level stays in for the first time periods and the consumption bias at time is greater than .
The following lemma states that with a careful choice of and , the event will happen with a small probability.
Lemma 5.
Let . If we choose and
then there exists a constant such that,
for all .
The proof of Lemma 5 will be postponed to the end of this proof. The following lemma considers the complements of the event ’s and analyze the stopping time on their complements. Here we use to denote the complement of an event
Lemma 6.
If , , and satisfy that for all and
(25) |
the following inequality holds
for and , and therefore,
The proof of Lemma 6 will also be postponed to the end of this proof.
C2 Proofs of Lemma 5 and Lemma 6
In this subsection, we prove the two key lemmas in the previous section. The proof of Lemma 5 will utilize the following lemma. The statement of Lemma 7 is quite straightforward: under the assumption that the parameters are well estimated, that all the arms are optimal, and that all the constraints are binding, the optimal solution to LP (9) can be simply obtained by . Its proof is solely based on linear algebra knowledge, but it is a bit tedious, for which we will postpone to after the proof of Lemma 5 and Lemma 6.
Lemma 7.
Proof of Lemma 5
Proof.
First, we show that each arm will be played, on expectation, no less than times in the first time periods. Denote the normalized probability used to play arm at time as then by the definition of , we know
Here, from Assumption 4, we know that in this case, so the constraint consumption matrix is a squared matrix. By the Lemma 7 and the definition of if
(26) |
where the first line comes from the triangle inequality, the second line comes from the definition of , and the third line is obtained by the relation between the singular value and matrix infinity norm. Now we can transfer the bound to From warm start condition (24) that
with the similar proof in the first part of Lemma 7, we can show
(27) |
Specifically, the warm start condition (24) provides an upper bound for deviation of and Putting together (26) and (27), we have
where the inequality holds element-wise. This inequality provides an lower bound on the probability distribution with which the arms are played. The implication is that if each of the arms will be played with probability no less than In this way, as long as always holds, all the arms will be played with sufficiently amount of time to refine the corresponding estimation.
Now, we apply a concentration argument to show that each arm will be played for sufficiently many times, not only on expectation, but also with high probability. Concretely, we show that if holds for all and then the number of times that the -th arm is played up to time ,
with high probability for each arm . To see this, we first notice that if for all , then
for each Also, note that is a martingale difference with respect to the filtration , we have the following holds for all
by an application of Azuma–Hoeffding inequality. By rearranging terms, it becomes
and consequently,
(28) |
On the event
we know that the inequality holds if for all and
In other words, when for all , the -th arm will be played for with probability , which implies the estimation error of the -th arm is at most at the beginning of time with high probability. Then the bias in resource consumption has the following upper bound,
where the last line is defined to be when Here, the second line comes from the definition of the algorithm, the third and fourth line come from the property of matrix multiplication, and the fifth line comes from plugging in the estimation error of . The choice of ensures that . Also, we choose as the smallest integer such that
In this way, when we have the following two cases:
-
•
We have
-
•
: We have and . Consequently, the following holds
for all Thus it implies that
Combining the two cases, when we have
and therefore, we know from (28),
∎
Proof of Lemma 6
Proof.
Note that
for each constraint Thus, we only need to analyze the event
Note that from telescoping sum,
We define , and then have
(29) |
We remark that we do not index the process in that the analyses for all the constraints are the same, so we only focus on the analysis of a specific constraint . Moreover, from the definition of ,
Next, we first develop a concentration argument for the first summation in (29). Specifically, the summand can be viewed as a martingale difference sequence. Note that
for By applying Hoeffding’s Inequality, we have
holds for all and . By taking , we have
(30) |
Proof of Lemma 7
Proof.
Consider the following LPs for all , and denote their optimal objective value as and , respectively.
s.t. | |||
(32) | ||||
s.t. | ||||
Then, similar as the proof in Lemma 2, to prove that is a optimal solution to LP (9), it is sufficient to show that is feasible to LP (9) and for all .
First, we show the feasibility. To see its feasibility, it is sufficient to check the following two results:
-
(a)
The matrix is non-singular and thus is well-defined.
-
(b)
and thus is a feasible solution to the LP (9).
To show part (a), we prove that the smallest singular value of the matrix is positive. Recall we use and to denote the smallest and the largest singular value of a matrix , respectively.
where the first line comes from Weyl’s inequality on matrix eigenvalues/singular values, the second line comes from the definition of , the third line is obtained from the relation between the spectral norm and the infinity norm of a matrix, and the last line comes from the relation between the matrix infinity norm and the vector infinity norm of each column in the matrix. From the warm start condition (24), we have
and consequently,
(33) |
For part (b), we show the non-negativeness of the solution The starting point is the condition that the optimal solution to the Primal LP (1) has strictly positive elements, i.e., from Section 3.3, we know
where the inequality holds element-wise. Now, we show that,
which implies that
holds element-wise by triangle inequality. To see this, from condition (24) that , then
The first line comes from the definition of matrix L∞ norm. The second line comes from the assumption that . The third and sixth line come from the sub-multiplicativity of matrix L∞ norm. The fourth and seventh line come from the definition of following Section 3.3 and the relation between the spectral norm and L∞ norm, i.e., The last line reduces the matrix infinity norm to vector infinity norm and applies condition (24). Thus we finish the part on the feasibility of
Next, we show for all by utilizing and .
Recall that is the optimal objective value of the following LP.
s.t. | |||
With the similar proof in Lemma 4 and Assumption 3, we can show that
(34) |
Then, from LP’s duality, we know that and are also the optimal values of the dual LP (C2) and (C2), respectively.
s.t. | (35) | |||
s.t. | (36) | |||
where is a matrix (vector) that delete the -th column (entry) of the matrix (vector) . Denote the optimal solution of LP (C2) as . By duality, we have
for all . Moreover, since is also feasible to LP (C2), by duality, we have
where the first line comes from weak duality, the second line comes from re-arranging terms, the third line comes from and Holder’s Inequality that , and the last line is obtained by the inequality (34). Then, we can have a similar statement that
Thus, since , we have
for all . Thus we complete the proof by combining the feasibility and the optimality of ∎
C3 Relaxation of Warm Start Assumption
In this subsection, we relax Assumption 3 for the analysis in Section C1. Indeed, The following proposition shows that the warm start condition in Assumption 3 will be automatically satisfied after time periods after Phase I, where being the threshold parameter in Assumption 3. The idea of the proof is straightforward: at the beginning of Phase II, although the parameter estimation may not satisfy the warm start condition, it will ensure that each arm will be played with at least certain probability. Then, by continually playing the arms, the corresponding parameters for the optimal arms will be gradually refined until the warm start condition is met. The refinement takes at most time periods with high probability.
Proposition 6.
Proof.
First, we show that each arm will be played linear times even if the Phase II does not initialize with a warm start, or, sufficiently, there exists an arm so that for some in Phase II. W.L.O.G., we assume that all UCB estimates is non-increasing and all LCB estimates are non-decreasing since we can use the minimal UCB estimates and maximal LCB estimates for all quantities from time 1 to at each time . Thus, and are non-increasing and is non-decreasing.
Now, consider the following LPs for all (the same as the LPs used in Lemma 7):
s.t. | |||
s.t. | |||
Note if . With the similar proof in Lemma 7, we can show that
for all . Then, by triangle inequality, we have, for all in Phase II,
(37) | ||||
where the second line is obtained by
after Phase I. This motivates us to focus on the difference between and . With the proof in Lemma 2, we can show that with probability
hold for all , and . Notice that is also a feasible solution to LP (7). Thus, we have
hold with probability no less than and all in Phase II. Here, the first line comes from the definition of , the second line comes from re-arranging terms, the third line comes from and for all , and the last line comes from calculation. Take this result into the inequality (37), we have
for all . Denotes the optimal solution corresponding to as and notice that
for all . we have
Thus, in expectation, the algorithm plays each arm at least times for each step if during Phase II. Then, with similar proof in Lemma 5, we can show that there exist a depending polynomially on , , , , and such that for all , the algorithm can play each arm for at least times within steps with probability no less than . Then, the algorithm achieve the warm start. Moreover, since the algorithm only plays arms in and is large enough during this process, it will not cause additional regret. ∎
C4 Analysis of the Non-Binding Constraints
The way how we analyze the non-binding constraints is the same as how we analyze the binding constraints. Previously, we derive an upper bound for in Section C1 under the assumption that all the constraints are binding. Indeed, with the presence of non-binding constraints, we only need to show that the non-binding constraints will not be exhausted before exhausting any binding constraints. To adapt the analysis in Section C1 to the case when there are non-binding constraints, we only need to change the event to the joint of following event and . All the remaining parts in Section C1 can then be extended to the case when there are non-binding constraints,
Define
In this way, the event describes the “bad” event that the binding dimensions of the process stays within the region for and the non-binding dimensions of the demand process is not exhausted before time , but certain non-binding constraint is exhausted at time . The following lemma establishes that such event will happen with a small probability. It thus implies that as long as , the non-binding constraints will not be exhausted. Its proof shares the same structure as Lemma 5.
Lemma 8.
There exists constants depending on , such that for all , the following inequality holds
Proof of Lemma 8
Proof.
Based on the definition of , we have
holds for all non-binding constraint , where denotes the optimal solution to (1). Recall that for non-binding constraints, we only need to ensure that they are not depleted before . The implication of this inequality is that the non-binding constraints have amount of resource surplus to prevent the depletion.
Intuitively, if the average consumption of all binding resources are close to , i.e. for each binding resource , the number of average played times for each optimal arm will be close to . Then, the average consumption of each non-binding resources will close to . Follow the intuition, we will bound the probability of the statement in this lemma by concentration inequality.
First, we estimate played times for each arm. To bound the average played times, with the similar proof for lemma 5, we can show that, with probability no less than , the following inequality holds for all and :
Based on the algorithm, if for all binding resources, we know that,
for all . Thus, if , for all and binding resources, we have
where . Moreover, for all , on the event , we know that no resource is used up.
Second, we estimate the resource consumption for non-binding resources. Let , apply concentration inequality and we have
Thus, with probability no more than , we have for all and . Moreover, notice that for all on the event , we have
holds on the event
where the first in the inequalities line comes from , the second line comes from the analysis of , the third line comes from the decomposition of , the fifth line comes from , and, the last line comes form . Henceforth, on the event
and we have
if . Thus, as , non-binding resources are no less than at the end of time for all non-binding resources, implying . Thus, we have
and consequently,
∎
Appendix D Proof of Proposition 5
Proof.
The regret analysis can be accomplished by a combination of Proposition 1, Proposition 3 and Proposition 4. Proposition 1 shows that
(38) |
First, we bound the first term in the right hand side of inequality (38). From the duality, we know that . Thus, we have
Moreover, Proposition 3 shows that with probability no less than , the algorithm only plays arms in during the Phase I and that Phase I will terminate within
time periods. Thus,
(39) |