This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Symmetry between Arms and Knapsacks: A Primal-Dual Approach for Bandits with Knapsacks

Xiaocheng Li    Chunlin Sun    Yinyu Ye
( Imperial College Business School
Institute for Computational and Mathematical Engineering, Stanford University
Department of Management Science and Engineering, Stanford University
)
Abstract

In this paper, we study the bandits with knapsacks (BwK) problem and develop a primal-dual based algorithm that achieves a problem-dependent logarithmic regret bound. The BwK problem extends the multi-arm bandit (MAB) problem to model the resource consumption associated with playing each arm, and the existing BwK literature has been mainly focused on deriving asymptotically optimal distribution-free regret bounds. We first study the primal and dual linear programs underlying the BwK problem. From this primal-dual perspective, we discover symmetry between arms and knapsacks, and then propose a new notion of sub-optimality measure for the BwK problem. The sub-optimality measure highlights the important role of knapsacks in determining algorithm regret and inspires the design of our two-phase algorithm. In the first phase, the algorithm identifies the optimal arms and the binding knapsacks, and in the second phase, it exhausts the binding knapsacks via playing the optimal arms through an adaptive procedure. Our regret upper bound involves the proposed sub-optimality measure and it has a logarithmic dependence on length of horizon TT and a polynomial dependence on mm (the numbers of arms) and dd (the number of knapsacks). To the best of our knowledge, this is the first problem-dependent logarithmic regret bound for solving the general BwK problem.

1 Introduction

The Multi-Armed Bandit (MAB) problem is a problem in which a limited amount of resource must be allocated between competing (alternative) choices in a way that maximizes the expected gain (Gittins et al., 2011). It is a benchmark problem for decision making under uncertainty that has been studied for nearly a century. As a prototypical reinforcement learning problem, MAB problem exemplifies the exploration–exploitation tradeoff dilemma (Weber et al., 1992). The original problem first formulated in its predominant version in (Robbins, 1952), has inspired a recent line of research that considers additional constraints that reflect more accurately the reality of the online decision making process. Bandits with Knapsacks (BwK) was introduced by Badanidiyuru et al. (2013) to allow more general constraints on the decisions across time, in addition to the customary limitation on the time horizon. The BwK problem, as a general framework, encompasses a wide range of applications, including dynamic pricing and revenue management (Besbes and Zeevi, 2012), online advertisement (Mehta et al., 2005), network and routing (Agrawal et al., 2014), etc.

While the existing BwK literature (Badanidiyuru et al., 2013; Agrawal and Devanur, 2014) has derived algorithms that achieve optimal problem-independent regret bounds, a problem-dependent bound that captures the optimal performance of an algorithm on a specific BwK problem instance remains an open question. For the setting of standard MAB problem, the problem-dependent bound has been well understood, and an upper bound with logarithmic dependence on TT can be achieved by both UCB-based algorithm (Auer et al., 2002) and Thompson sampling-based algorithm (Agrawal and Goyal, 2012). In this paper, we focus on developing a problem-dependent bound for the BwK problem and identify parameters that characterize the hardness of a BwK problem instance.

Two existing works along this line are (Flajolet and Jaillet, 2015) and (Sankararaman and Slivkins, 2020). The paper (Flajolet and Jaillet, 2015) considers several specific settings for the problem, and for the general BwK problem, it achieves an O(2m+dlogT)O(2^{m+d}\log T) regret bound (with other problem-dependent parameters omitted) where mm is the number of arms and dd is the number of the knapsacks/resource constraints. In addition, the results in (Flajolet and Jaillet, 2015) require the knowledge of some parameters of the problem instance a priori. The recent work (Sankararaman and Slivkins, 2020) considers the BwK problem under the assumptions that there is only one single knapsack/resource constraint and that there is only one single optimal arm. In contrast to these two pieces of work, we consider the problem in its full generality and do not assume any prior knowledge of the problem instance. We will further compare with their results after we present our regret bound.

Specifically, we adopt a primal-dual perspective to study the BwK problem. Our treatment is new in that we highlight the effects of resources/knapsacks on regret from the dual perspective and define the sub-optimality measure based on the primal and dual problems jointly. Specifically, we first derive a generic upper bound that works for all BwK algorithms. The upper bound consists of two elements: (i) the number of times for which a sub-optimal arm is played; (ii) the remaining knapsack resource at the end of the horizon. It emphasizes that the arms and knapsacks are of equal importance in determining the regret of a BwK algorithm. By further exploiting the structure of the primal and dual LPs, we develop a new sub-optimality measure for the BwK problem which can be viewed as a generalization of the sub-optimality measure for the MAB problem first derived in (Lai and Robbins, 1985). The sub-optimality measure accounts for both arms and knapsacks, and it aims to distinguish optimal arms from non-optimal arms, and binding knapsacks from non-binding knapsacks. We use this measure as a key characterization of the hardness of a BwK problem instance.

Inspired by these findings, we propose a two-phase algorithm for the problem. The first phase of our algorithm is elimination-based and its objective is to identify the optimal arms and binding knapsacks of the BwK problem. The second phase of our algorithm utilizes the output of the first phase and it uses the optimal arms to exhaust the remaining resources through an adaptive procedure. Our algorithm and analysis feature for its full generality and we only make a mild assumption on the non-degeneracy of the underlying LP. In addition, the algorithm requires no prior knowledge of the underlying problem instance.

Other related literature: (Agrawal and Devanur, 2015; Agrawal et al., 2016) study the contextual BwK problem where the reward and resource consumption are both linear in a context vector. (Immorlica et al., 2019; Kesselheim and Singla, 2020) study the BwK problem under an adversarial setting. (Ferreira et al., 2018) analyzes Thompson sampling-based algorithms for the BwK problem.

2 Model and Setup

The problem of bandits with knapsacks (BwK) was first defined in Badanidiyuru et al. (2013), and the notations presented here are largely consistent with Badanidiyuru et al. (2013); Agrawal and Devanur (2014). Consider a fixed and known finite set of mm arms (possible actions) available to the decision maker, henceforth called the algorithm. There are dd type of resources and a finite time horizon TT, where TT is known to the algorithm. In each time step tt, the algorithm plays an arm of the m arms, receives reward rtr_{t}, and consumes amount Cj,t[0,1]C_{j,t}\in[0,1] of each resource j[d]j\in[d]. The reward rtr_{t} and consumption 𝑪t=(C1,t,.,Cd,t)d\bm{C}_{t}=(C_{1,t},....,C_{d,t})^{\top}\in\mathbb{R}^{d} are revealed to the algorithm after choosing arm it[m]i_{t}\in[m]. The rewards and costs in every round are generated i.i.d. from some unknown fixed underlying distribution. More precisely, there is some fixed but unknown 𝝁=(μ1,,μm)m\bm{\mu}=(\mu_{1},...,\mu_{m})^{\top}\in\mathbb{R}^{m} and 𝑪=(𝒄1,,𝒄m)d×m\bm{C}=(\bm{c}_{1},...,\bm{c}_{m})\in\mathbb{R}^{d\times m} such that

𝔼[rt|it]=μit,𝔼[𝑪t|it]=𝒄it\mathbb{E}[r_{t}|i_{t}]=\mu_{i_{t}},\ \mathbb{E}[\bm{C}_{t}|i_{t}]=\bm{c}_{i_{t}}

where μi\mu_{i}\in\mathbb{R} and 𝒄i=(c1i,,cdi)d\bm{c}_{i}=(c_{1i},...,c_{di})^{\top}\in\mathbb{R}^{d} are the expected reward and the expected resource consumption of arm i[m].i\in[m]. In the beginning of every time step t, the algorithm needs to pick an arm iti_{t}, using only the history of plays and outcomes until time step t1.t-1. There is a hard constraint capacity BjB_{j} for the jj-th type of resource. The algorithm stops at the earliest time τ\tau when one or more of the constraints is violated, i.e. t=1τ+1cj,t>Bj\sum_{t=1}^{\tau+1}c_{j,t}>B_{j} for some j[d]j\in[d] or if the time horizon ends, if i.e. τT.\tau\geq T. Its total reward is given by the sum of rewards in all rounds preceding τ\tau, i.e t=1τrt.\sum_{t=1}^{\tau}r_{t}. The goal of the algorithm is to maximize the expected total reward. The values of BjB_{j} are known to the algorithm, and without loss of generality, we make the following assumption.

Assumption 1.

We assume Bj=B=minjBjB_{j}=B=\min_{j}B_{j} for all j[d]j\in[d] (by scaling the consumption matrix 𝐂\bm{C}). Let 𝐁=(B,,B)d\bm{B}=(B,...,B)^{\top}\in\mathbb{R}^{d}. Moreover, we assume the resource capacity 𝐁\bm{B} scales linearly with TT, i.e., 𝐁=T𝐛=T(b,,b)d\bm{B}=T\cdot\bm{b}=T\cdot(b,...,b)^{\top}\in\mathbb{R}^{d} for some b>0.b>0.

The assumption is mainly for notation simplicity and it will not change the nature of the analysis in this paper. Given that our focus is to derive asymptotic problem-dependent bound, it is natural to have the resource capacity scales linearly with the length of horizon. Throughout this paper, we use bold symbols to denote vectors/matrices and normal symbols to denote scalars.

Furthermore, without loss of generality, a “null” arm is introduced to represent the time constraint (time horizon TT). Specifically, let μm=0,\mu_{m}=0, 𝒄m=(b,0,,0)\bm{c}_{m}=(b,0,...,0)^{\top}, and c1,i=bc_{1,i}=b for all i[m].i\in[m]. In this way, the first constraint captures the constraint of finite time horizon TT and the “null” arm can be played with no reward achieved and with no cost induced to the other factual constraints except for the time constraint.

Regret is defined as the difference in the total reward obtained by the algorithm and OPT, where OPT denotes the total expected reward for the optimal dynamic policy. In this paper, we are interested in the (expected) problem-dependent regret,

RegretTπ(𝒫,𝑩)OPT𝔼[t=1τrt]\text{Regret}_{T}^{\pi}(\mathcal{P},\bm{B})\coloneqq\text{OPT}-\mathbb{E}\left[\sum_{t=1}^{\tau}r_{t}\right]

where π\pi denotes the algorithm, and 𝒫\mathcal{P} encapsulates all the parameters related to the distributions of reward and resource consumption, including 𝝁\bm{\mu} and 𝑪.\bm{C}. The expectation is taken with respect to the randomness of the reward and resource consumption.

Consider the following linear program:

OPTLPmax𝒙\displaystyle\text{OPT}_{\text{LP}}\coloneqq\max_{\bm{x}}\ \ 𝝁𝒙\displaystyle\bm{\mu}^{\top}\bm{x} (1)
s.t. 𝑪𝒙𝑩\displaystyle\bm{C}\bm{x}\leq\bm{B}
𝒙𝟎\displaystyle\bm{x}\geq\bm{0}

where the decision variables are 𝒙=(x1,,xm)m.\bm{x}=(x_{1},...,x_{m})^{\top}\in\mathbb{R}^{m}. One can show that (see Badanidiyuru et al. (2013))

OPTLPOPT\text{OPT}_{\text{LP}}\geq\text{OPT}

so that OPTLP\text{OPT}_{\text{LP}} provides a deterministic upper bound for the expected reward under the optimal dynamic policy. Let 𝒙=(x1,,xm)\bm{x}^{*}=(x_{1}^{*},...,x_{m}^{*})^{\top} denote the optimal solution to (1).

3 Primal-dual Perspective for BwK

In this section, we present a generic regret upper bound and explore the properties of the underlying primal and dual LPs.

Let \mathcal{I}^{*} and \mathcal{I}^{\prime} denote the set of optimal basic variables and the set of optimal non-basic variables of (1), respectively. Let 𝒥\mathcal{J}^{*} and 𝒥\mathcal{J}^{\prime} denote the set of binding and non-binding constraints of (1), respectively. That is,

\displaystyle\mathcal{I}^{*} {xi>0,i[m]},\displaystyle\coloneqq\left\{x_{i}^{*}>0,i\in[m]\right\},
\displaystyle\mathcal{I}^{\prime} {xi=0,i[m]},\displaystyle\coloneqq\left\{x_{i}^{*}=0,i\in[m]\right\},
𝒥\displaystyle\mathcal{J}^{*} {Bi=1mcjixi=0,j[d]},\displaystyle\coloneqq\left\{B-\sum_{i=1}^{m}c_{ji}x_{i}^{*}=0,j\in[d]\right\},
𝒥\displaystyle\mathcal{J}^{\prime} {Bi=1mcjixi>0,j[d]}.\displaystyle\coloneqq\left\{B-\sum_{i=1}^{m}c_{ji}x_{i}^{*}>0,j\in[d]\right\}.

So, we know =[m]\mathcal{I}^{*}\cap\mathcal{I}^{\prime}=[m] and 𝒥𝒥=[d].\mathcal{J}^{*}\cap\mathcal{J}^{\prime}=[d]. Accordingly, we call an arm ii\in\mathcal{I}^{*} as an optimal arm and ii\in\mathcal{I}^{\prime} as a sub-optimal arm. Here and hereafter, we will refer to knapsack as constraint so that the terminology is more aligned with the LPs.

We make the following assumption on the LP’s optimal solution.

Assumption 2.

The LP (1) has an unique optimal solution. Moreover, the optimal solution is non-degenerate, i.e.,

||=|𝒥|.|\mathcal{I}^{*}|=|\mathcal{J}^{*}|.

The assumption is a standard one in LP’s literature, and any LP can satisfy the assumption with an arbitrarily small perturbation (Megiddo and Chandrasekaran, 1989). To interpret the non-degeneracy, consider if ||=|𝒥|=l|\mathcal{I}^{*}|=|\mathcal{J}^{*}|=l, then the optimal solution to LP (1) is to play the only ll arms in \mathcal{I}^{*}. When there is no linear dependency between the columns of 𝑪\bm{C}, that will result in a depletion of ll resource constraints.

The dual problem of (1) is

min𝒚\displaystyle\min_{\bm{y}}\ \ 𝑩𝒚\displaystyle\bm{B}^{\top}\bm{y} (2)
s.t. 𝑪𝒚𝝁\displaystyle\bm{C}^{\top}\bm{y}\geq\bm{\mu}
𝒚𝟎.\displaystyle\bm{y}\geq\bm{0}.

Denote its optimal solution as 𝒚=(y1,,yd).\bm{y}^{*}=(y_{1}^{*},...,y_{d}^{*}). From LP’s complementarity condition, we know the following relation holds under Assumption 2,

j𝒥yj>0,j𝒥yj=0.\displaystyle j\in\mathcal{J}^{*}\Leftrightarrow y_{j}^{*}>0,\ \ j\in\mathcal{J}^{\prime}\Leftrightarrow y_{j}^{*}=0.

The following lemma summarizes the LP’s properties.

Lemma 1.

Under Assumption 2, we have the primal LP (1) and the dual LP (2) share the same optimal objective value. Also,

||+|𝒥|=d,|\mathcal{I}^{*}|+|\mathcal{J}^{\prime}|=d,
||+|𝒥|=m.|\mathcal{I}^{\prime}|+|\mathcal{J}^{*}|=m.

3.1 A Generic Regret Upper Bound

We begin our discussion with deriving a new upper bound for a generic BwK algorithm. First, we define the knapsack process as the remaining resource capacity at each time tt. Specifically, we define 𝑩(0)𝑩\bm{B}^{(0)}\coloneqq\bm{B} and

𝑩(t+1)𝑩(t)𝑪t\bm{B}^{(t+1)}\coloneqq\bm{B}^{(t)}-\bm{C}_{t}

for t[T].t\in[T]. Recall that 𝑪t\bm{C}_{t} is the (random) resource consumption at time tt. The process 𝑩(t)\bm{B}^{(t)} is pertaining to the BwK algorithm. In addition, we follow the convention of the bandits literature and define the count process ni(t)n_{i}(t) as the number of times the ii-th arm is played up to the end of time period tt.

Proposition 1.

The following inequality holds for any BwK algorithm,

RegretTπ(𝒫,𝑩)ini(t)Δi+𝔼[𝑩(τ)]𝒚.\displaystyle\text{Regret}_{T}^{\pi}(\mathcal{P},\bm{B})\leq\sum_{i\in\mathcal{I}^{\prime}}n_{i}(t)\Delta_{i}+\mathbb{E}\left[\bm{B}^{(\tau)}\right]^{\top}\bm{y}^{*}. (3)

where Δi=𝐜i𝐲μi\Delta_{i}=\bm{c}_{i}^{\top}\bm{y}^{*}-\mu_{i} for i[m].i\in[m].

Here Δi\Delta_{i} is known as reduced cost/profit in LP literature and it quantifies the cost-efficiency of each arm (each basic variable in LP). The upper bound in Proposition 1 is new to the existing literature and it can be generally applicable to all BwK algorithms. It consists of two parts: (i) the number of times for which a sub-optimal arm is played multiplied by the corresponding reduced cost; (ii) the remaining resource at time τ\tau, either when any of the resource is depleted or at the end of time horizon TT. The first part is consistent with the classic bandits literature in that we always want to upper bound the number of sub-optimal arms being played throughout the horizon. At each time a sub-optimal arm ii\in\mathcal{I}^{\prime} is played, a cost of Δi\Delta_{i} will be induced. Meanwhile, the second part is particular to the bandits with knapsacks setting and can easily be overlooked. Recall that the definition of τ\tau refers to the first time that any resource j[d]j\in[d] is exhausted (or the end of the horizon). It tells that the left-over of resources when the process terminates at time τ\tau may also induce regret. For example, for two binding resources j,j𝒥j,j^{\prime}\in\mathcal{J}^{*}, it would be less desirable for one of them jj to have a lot of remaining while the other one jj^{\prime} is exhausted. Since the binding resources are critical in determining optimal objective value for LP (1), intuitively, it is not profitable to waste any of them at the end of the procedure.

3.2 Symmetry between Arms and Bandits

From Proposition 1, we see the importance of dual problem (2) in bounding an algorithm’s regret. Now, we pursue further along the path and propose a new notion of sub-optimality for the BwK problem. Our sub-optimality measure is built upon both the primal and dual LPs, and it reveals the combinatorial structure of the problem. In the following, we define two classes of LPs, one for the arm i[m]i\in[m] and the other for the constraints j[d].j\in[d].

First, for each arm i[m],i\in[m], define

OPTimax𝒙\displaystyle\text{OPT}_{i}\coloneqq\max_{\bm{x}}\ \ 𝝁𝒙,\displaystyle\bm{\mu}^{\top}\bm{x}, (4)
s.t. 𝑪𝒙𝑩,\displaystyle\bm{C}\bm{x}\leq\bm{B},
xi=0,𝒙𝟎.\displaystyle x_{i}=0,\bm{x}\geq\bm{0}.

By definition, OPTi\text{OPT}_{i} denotes the optimal objective value of an LP that takes the same form as the primal LP (1) except with an extra constraint xi=0x_{i}=0. It represents the optimal objective value if the ii-th arm is not allowed to use. For a sub-optimal arm i,i\in\mathcal{I}^{\prime}, OPTi=OPTLP\text{OPT}_{i}=\text{OPT}_{\text{LP}}, while for an optimal arm ii\in\mathcal{I}^{*}, OPTi<OPTLP\text{OPT}_{i}<\text{OPT}_{\text{LP}}. In this way, OPTi\text{OPT}_{i} characterizes the importance of arm ii.

Next, for each constraint j[d]j\in[d], define

OPTjmin𝒚\displaystyle\text{OPT}_{j}\coloneqq\min_{\bm{y}}\ \ 𝑩𝒚B,\displaystyle\bm{B}^{\top}\bm{y}-B, (5)
s.t. 𝑪𝒚𝝁+𝑪j,,\displaystyle{{\bm{C}}}^{\top}\bm{y}\geq\bm{\mu}+\bm{C}_{j,\cdot},
𝒚𝟎,\displaystyle\bm{y}\geq\bm{0},

where 𝑪j,\bm{C}_{j,\cdot} denotes the jj-th row of the constraint matrix 𝑪.\bm{C}. Though it may not be as obvious as the previous case of OPTi, the definition of OPTj\text{OPT}_{j} aims to characterize the bindingness/non-bindingness of a constraint jj. The point can be illustrated by looking at the primal problem for (5). From LP’s strong duality, we know

OPTj=max𝒙\displaystyle\text{OPT}_{j}=\max_{\bm{x}}\ \ 𝝁𝒙(B𝑪j,𝒙),\displaystyle\bm{\mu}^{\top}\bm{x}-(B-\bm{C}_{j,\cdot}^{\top}\bm{x}), (6)
s.t. i=1m𝑪𝒙𝑩,\displaystyle\sum_{i=1}^{m}\bm{C}\bm{x}\leq\bm{B},
𝒙0.\displaystyle\bm{x}\geq 0.

Compared to the original primal LP (1), there is an extra term in the objective function in LP (6). The extra term is a penalization for the left-over of the jj-th constraint, and thus it encourages the usage of the jj-th constraint. For a binding constraint j𝒥,j\in\mathcal{J}^{*}, it will be exhausted under the optimal solution 𝒙\bm{x}^{*} to LP (1) so the penalization term does not have any effect, i.e., OPT=j{}_{j}=OPTLP{}_{\text{LP}}. In contrast, for a non-binding constraint j𝒥,j\in\mathcal{J}^{\prime}, the extra term will result in a reduction in the objective value, i.e., OPTj<OPTLP\text{OPT}_{j}<\text{OPT}_{\text{LP}}. We note that one can introduce any positive weight to the penalization term so as to trade off between the reward and the left-over of the jj-th constraint in (6), but its current version suffices our discussion.

The following proposition summarizes the properties of OPTi and OPTj.{}_{j}.

Proposition 2.

Under Assumption 2, we have

OPTi<OPTLP\displaystyle\text{OPT}_{i}<\text{OPT}_{\text{LP}} i,\displaystyle\Leftrightarrow i\in\mathcal{I}^{*},
OPTi=OPTLP\displaystyle\text{OPT}_{i}=\text{OPT}_{\text{LP}} i,\displaystyle\Leftrightarrow i\in\mathcal{I}^{\prime},
OPTj=OPTLP\displaystyle\text{OPT}_{j}=\text{OPT}_{\text{LP}} j𝒥,\displaystyle\Leftrightarrow j\in\mathcal{J}^{*},
OPTj<OPTLP\displaystyle\text{OPT}_{j}<\text{OPT}_{\text{LP}} j𝒥.\displaystyle\Leftrightarrow j\in\mathcal{J}^{\prime}.

In this way, the definition of OPTi\text{OPT}_{i} distinguishes optimal arms \mathcal{I}^{*} from sub-optimal arms \mathcal{I}^{\prime}, while the definition of OPTj\text{OPT}_{j} distinguishes the binding constraints 𝒥\mathcal{J}^{*} from non-binding constraints 𝒥\mathcal{J}^{\prime}. The importance of such a distinguishment arises from the upper bound in Proposition 1: on one hand, we should avoid playing sub-optimal arms, and on the other hand, we should exhaust the binding resources. A second motivation for defining both OPTi and OPTj can be seen after we present our algorithm. Furthermore, we remark that a measurement of the sub-optimality of the arms has to be defined through the lens of LP due to the combinatorial nature of the problem. The effect of the ii-th arm’s removal on the objective value can only be gauged by solving an alternative LP of OPTi.{}_{i}. Similarly, a measurement of the bindingness of the constraints should also take into account the combinatorial relation between constraints.

Next, define

δ1T(OPTLPmax{maxiOPTi,maxj𝒥OPTj}).\delta\coloneqq\frac{1}{T}\left(\text{OPT}_{\text{LP}}-\max\left\{\max_{i\in\mathcal{I}^{*}}\text{OPT}_{i},\max_{j\in\mathcal{J}^{\prime}}\text{OPT}_{j}\right\}\right).

where the factor 1T\frac{1}{T} is to normalize the optimality gap by the number of time periods. Under Assumption 1, all the objective values in above should scale linearly with T.T.

To summarize, δ\delta characterizes the hardness of distinguishing optimal arms from non-optimal arms (and binding constraints from non-binding constraints). It can be viewed as a generalization of the sub-optimality measure δMAB=miniiμiμi\delta_{\text{MAB}}=\min_{i\neq i^{*}}\mu_{i^{*}}-\mu_{i} for the MAB problem (Lai and Robbins, 1985). δMAB\delta_{\text{MAB}} characterizes the hardness of an MAB problem instance, i.e., the hardness of distinguishing the optimal arm from sub-optimal arms. In the context of BwK, δ\delta is a more comprehensive characterization in that it takes into account both the arms and the constraints. Imaginably, it will be critical in both algorithm design and analysis for the BwK problem.

3.3 Key Parameters of the LPs

Now, we define two LP-related quantities that will appear in our regret bound:

  • Linear Dependency between Arms: Define σ\sigma be the minimum singular value of the matrix 𝑪,𝒥.\bm{C}_{\mathcal{I}^{*},\mathcal{J}^{*}}. Specifically,

    σσmin(𝑪𝒥,)=σmin((cji)j𝒥,i).\sigma\coloneqq\sigma_{\min}\left(\bm{C}_{\mathcal{J}^{*},\mathcal{I}^{*}}\right)=\sigma_{\min}\left((c_{ji})_{j\in\mathcal{J}^{*},i\in\mathcal{I}^{*}}\right).

    In this light, σ\sigma represents the linear dependency between optimal arms across the binding constraints. For a smaller value of σ\sigma, the optimal arms are more linearly dependent, and then it will be harder to identify the optimal numbers of plays. Under Assumption 2, the uniqueness of optimal solution implies σ>0.\sigma>0.

  • Threshold on the optimal solution:

    χ\displaystyle\chi 1Tmin{xi0,i[m]}\displaystyle\coloneqq\frac{1}{T}\cdot\min\{x_{i}^{*}\neq 0,i\in[m]\}

    If the total resource 𝑩=T𝒃,\bm{B}=T\cdot\bm{b}, both the optimal solution (x1,,xm)(x_{1}^{*},...,x_{m}^{*}) should scale linearly with TT. The factor 1T\frac{1}{T} normalizes the optimal solution into a probability vector. χ\chi denotes the smallest non-zero entry for the optimal solution. Intuitively, a small value of χ\chi implies that the optimal proportion of playing an arm ii\in\mathcal{I}^{*} is small and thus it is more prone to “overplay” the arm.

Remarks. By the definition, it seems that the above parameters δ\delta and χ\chi both involve a factor of TT. But if we replace 𝑩\bm{B} (the right-hand-side of LP) with 𝒃\bm{b} from Assumption 1, then the factor TT disappears, and the parameters χ\chi and δ\delta are essentially dependent on 𝝁\bm{\mu}, 𝑪\bm{C}, and 𝒃\bm{b} which are inherent to the problem instance but bear no dependency on the horizon TT. In other words, Assumption 1 frees the dependency on TT by introducing the quantity bb. Practically, the assumption states the resource capacity should be sufficiently large and it is natural in many application contexts (for example, the small bids assumption in AdWords problem Mehta et al. (2005)). Theoretically, in two previous works Flajolet and Jaillet (2015); Sankararaman and Slivkins (2020), either a factor of 1/T1/T appears in the parameter definition Sankararaman and Slivkins (2020) or the assumption is explicitly imposed Flajolet and Jaillet (2015). Such an assumption might be inevitable for a logarithmic regret to be derived.

3.4 LCB and UCB

Throughout this paper, we denote the reward and resource consumption of the ss-th play of the ii-th arm as ri,sr_{i,s} and 𝑪i,s=(C1i,s,,Cdi,s)\bm{C}_{i,s}=(C_{1i,s},...,C_{di,s})^{\top} respectively, for i[m]i\in[m] and s[T].s\in[T]. Let ni(t)n_{i}(t) be the number of times the ii-th arm is played in the first tt time periods. Accordingly, we denote the estimators at time tt for the ii-th arm as

μ^i(t)1ni(t)s=1ni(t)ri,s,\hat{\mu}_{i}(t)\coloneqq\frac{1}{n_{i}(t)}\sum_{s=1}^{n_{i}(t)}r_{i,s},
C^ji(t)1ni(t)s=1ni(t)Cji,s\hat{C}_{ji}(t)\coloneqq\frac{1}{n_{i}(t)}\sum_{s=1}^{n_{i}(t)}C_{ji,s}

for i[m]i\in[m] and j[d].j\in[d]. In a similar manner, we define the estimator for the ii-th arm’s resource consumption vector as 𝑪^i(t)(C^1i(t),,C^di(t))\hat{\bm{C}}_{i}(t)\coloneqq(\hat{C}_{1i}(t),...,\hat{C}_{di}(t))^{\top}, and the resource consumption matrix as 𝑪^(t)(𝑪^1(t),,𝑪^m(t))\hat{\bm{C}}(t)\coloneqq(\hat{\bm{C}}_{1}(t),...,\hat{\bm{C}}_{m}(t)). Specifically, without changing the nature of the analysis, we ignore the case that when ni(t)=0n_{i}(t)=0. Then, we define the lower confidence bound (LCB) and upper confidence bound (UCB) for the parameters as

μiL(t)\displaystyle\mu_{i}^{L}(t) proj[0,1](μ^i(t)2logTni(t))\displaystyle\coloneqq proj_{[0,1]}\left(\hat{\mu}_{i}(t)-\sqrt{\frac{2\log T}{n_{i}(t)}}\right)
μiU(t)\displaystyle\mu_{i}^{U}(t) proj[0,1](μ^i(t)+2logTni(t))\displaystyle\coloneqq proj_{[0,1]}\left(\hat{\mu}_{i}(t)+\sqrt{\frac{2\log T}{n_{i}(t)}}\right)
CjiL(t)\displaystyle C_{ji}^{L}(t) proj[0,1](C^ji(t)2logTni(t))\displaystyle\coloneqq proj_{[0,1]}\left(\hat{C}_{ji}(t)-\sqrt{\frac{2\log T}{n_{i}(t)}}\right)
CjiU(t)\displaystyle C_{ji}^{U}(t) proj[0,1](C^ji(t)+2logTni(t))\displaystyle\coloneqq proj_{[0,1]}\left(\hat{C}_{ji}(t)+\sqrt{\frac{2\log T}{n_{i}(t)}}\right)

where proj[0,1]()proj_{[0,1]}(\cdot) projects a real number to interval [0,1].[0,1]. The following lemma is standard in bandits literature and it characterizes the relation between the true values and the LCB/UCB estimators. It states that all the true values will fall into the intervals defined by the corresponding estimators with high probability.

Lemma 2 (Concentration).

The following event holds with probability no less than 14mdT21-\frac{4md}{T^{2}},

μi\displaystyle\mu_{i} (μiL(t),μiU(t)),\displaystyle\in\left(\mu_{i}^{L}(t),\mu_{i}^{U}(t)\right),
cji\displaystyle c_{ji} (CjiL(t),CjiU(t)),\displaystyle\in\left(C_{ji}^{L}(t),C_{ji}^{U}(t)\right),

for all i[m],j[d],tTi\in[m],j\in[d],t\in T.

With the UCB/LCB estimators for the parameters, we can construct UCB/LCB estimators for the objective of the primal LP. Specifically,

OPTLPUmax𝒙\displaystyle\text{OPT}_{\text{LP}}^{U}\coloneqq\max_{\bm{x}}\ \ (𝝁U)𝒙,\displaystyle\left(\bm{\mu}^{U}\right)^{\top}\bm{x}, (7)
s.t. 𝑪L𝒙𝑩,\displaystyle\bm{C}^{L}\bm{x}\leq\bm{B},
𝒙𝟎.\displaystyle\bm{x}\geq\bm{0}.
OPTLPLmax𝒙\displaystyle\text{OPT}_{\text{LP}}^{L}\coloneqq\max_{\bm{x}}\ \ (𝝁L)𝒙,\displaystyle\left(\bm{\mu}^{L}\right)^{\top}\bm{x}, (8)
s.t. 𝑪U𝒙𝑩,\displaystyle\bm{C}^{U}\bm{x}\leq\bm{B},
𝒙𝟎.\displaystyle\bm{x}\geq\bm{0}.

The following lemma states the relation between OPTLPU\text{OPT}_{\text{LP}}^{U}, OPTLPL\text{OPT}_{\text{LP}}^{L}, and OPTLP.\text{OPT}_{\text{LP}}. Intuitively, if we substitute the original constraint matrix 𝑪\bm{C} with its LCB (or UCB) and the objective coefficient 𝝁\bm{\mu} with its UCB (or LCB), the resultant optimal objective value will be a UCB (or LCB) for OPTLP\text{OPT}_{\text{LP}}.

Lemma 3.

The following inequality holds with probability no less 14mdT2,1-\frac{4md}{T^{2}},

OPTLPLOPTLPOPTLPU.\text{OPT}_{\text{LP}}^{L}\leq\text{OPT}_{\text{LP}}\leq\text{OPT}_{\text{LP}}^{U}.

A similar approach is used in (Agrawal and Devanur, 2014) to develop an UCB-based algorithm for the BwK problem. For our algorithm presented in the following section, we will construct estimates not only for the primal LP (1), but also for the LPs (4) and (5). By comparing the estimates of OPTLP{}_{\text{LP}}, OPTi, and OPTj, we will be able to identify the optimal arms \mathcal{I}^{*} and the non-binding constraints 𝒥.\mathcal{J}^{\prime}.

4 Two-Phase Algorithm

In this section, we describe our two-phase algorithm for the BwK problem. The main theme is to use the underlying LP’s solution to guide the plays of the arms. The two phases in the algorithm correspond to the two parts of the regret upper bound in Proposition 1. In the following, we describe the two phases of the algorithm and their intuitions respectively.

Phase I of Algorithm 1 is an elimination algorithm and it aims to identify the optimal arms \mathcal{I}^{*} and the non-binding constraints 𝒥.\mathcal{J}^{\prime}. In each round of the while loop in Phase I, all the arms are played once to improve the estimators for μ\mu and 𝑪.\bm{C}. After each round of plays, an identification procedure is conducted by comparing the LCB of the original optimal value (OPTLPL\text{OPT}^{L}_{\text{LP}}) against the UCBs (OPTiU\text{OPT}^{U}_{i} and OPTjU\text{OPT}^{U}_{j}). Recall that Proposition 2, there will be a non-zero sub-optimality gap if ii\in\mathcal{I}^{*} or j𝒥.j\in\mathcal{J}^{\prime}. So, if the algorithm observes a gap between the corresponding LCBs/UCBs, it will assert ii\in\mathcal{I}^{*} or j𝒥j\in\mathcal{J}^{\prime}, and the assertion will be true with high probability.

The stopping rule in Phase I originates from the complementary condition in Lemma 1, i.e., |^|+|𝒥^|<d|\hat{\mathcal{I}}^{*}|+|\hat{\mathcal{J}}^{\prime}|<d. This is a key in the primal-dual design of the algorithm and it further justifies the consideration of the dual problem. Without maintaining the set 𝒥^,\hat{\mathcal{J}}^{\prime}, we cannot decide whether we have obtain the true primal set, i.e., ^=\hat{\mathcal{I}}^{*}=\mathcal{I}^{*}. Specifically, since there is no precise knowledge of the number of optimal arms ||,|\mathcal{I}^{*}|, while we keep adding arms into ^\hat{\mathcal{I}}^{*}, we do not know when to stop. The complementarity in Lemma 1 provides a condition on the number of arms in \mathcal{I}^{*} and the number of constraints in 𝒥\mathcal{J}^{\prime}. Accordingly, Phase I is terminated when this condition is met. Moreover, we emphasize that a by-product of the Phase I is a best arm identification procedure. To the best of our knowledge, this is the first result on identifying optimal arms for the BwK problem.

Algorithm 1 Primal-dual Adaptive Algorithm for BwK
1:Input: Resource capacity 𝑩\bm{B}, TT
2:%% Phase I: Identification of \mathcal{I}^{*} and 𝒥\mathcal{J}^{\prime}
3:Initialize ^=𝒥^=\hat{\mathcal{I}}^{*}=\hat{\mathcal{J}}^{\prime}=\emptyset, t=0t=0
4:Initialize the knapsack process 𝑩(0)=𝑩\bm{B}^{(0)}=\bm{B}
5:while |^|+|𝒥^|<d|\hat{\mathcal{I}}^{*}|+|\hat{\mathcal{J}}^{\prime}|<d do
6:     Play each arm i[m]i\in[m] once
7:     Update t=t+mt=t+m and the knapsack process 𝑩(t)\bm{B}^{(t)}
8:     Update the estimates 𝝁^(t)\hat{\bm{\mu}}(t) and 𝑪^(t)\hat{\bm{C}}(t)
9:     Solve the LCB problem (8) and obtain OPT(t)LPL{}_{\text{LP}}^{L}(t)
10:     for i^i\notin\hat{\mathcal{I}}^{*} do
11:         Solve the following UCB problem for OPTi
OPTiU(t)max𝒙\displaystyle\text{OPT}_{i}^{U}(t)\coloneqq\max_{\bm{x}}\ \ (𝝁U(t))𝒙,\displaystyle\left(\bm{\mu}^{U}(t)\right)^{\top}\bm{x},
s.t. 𝑪L(t)𝒙𝑩,\displaystyle\bm{C}^{L}(t)\bm{x}\leq\bm{B},
xi=0,𝒙𝟎.\displaystyle x_{i}=0,\bm{x}\geq\bm{0}.
12:         if  OPTLPL(t)>OPTiU(t)\text{OPT}_{\text{LP}}^{L}(t)>\text{OPT}_{i}^{\text{U}}(t)  then
13:              Update ^=^{i}\hat{\mathcal{I}}^{*}=\hat{\mathcal{I}}^{*}\cup\{i\}
14:         end if
15:     end for
16:     for j𝒥^j\notin\hat{\mathcal{J}}^{\prime} do
17:         Solve the following UCB problem for OPTj
OPTjU(t):=min𝒚\displaystyle\text{OPT}^{U}_{j}(t):=\min_{\bm{y}}\ \ 𝑩𝒚B,\displaystyle\bm{B}^{\top}\bm{y}-B,
s.t. (𝑪L(t))𝒚𝝁U(t)+𝑪j,U(t),\displaystyle({{\bm{C}}}^{L}(t))^{\top}\bm{y}\geq{\bm{\mu}}^{U}(t)+\bm{C}_{j,\cdot}^{U}(t),
𝒚0.\displaystyle\bm{y}\geq 0.
18:         if  OPTLPL(t)>OPTjU(t)\text{OPT}_{\text{LP}}^{L}(t)>\text{OPT}_{j}^{\text{U}}(t)  then
19:              Update 𝒥^=𝒥^{j}\hat{\mathcal{J}}^{\prime}=\hat{\mathcal{J}}^{\prime}\cup\{j\}
20:         end if
21:     end for
22:end while
23:Update t=t+1t=t+1
24:%% Phase II: Exhausting the binding resources
25:while tτt\leq\tau do
26:     Solve the following LP
max𝒙\displaystyle\max_{\bm{x}}\ \ (𝝁U(t1))𝒙,\displaystyle\left(\bm{\mu}^{U}(t-1)\right)^{\top}\bm{x}, (9)
s.t. 𝑪L(t1)𝒙𝑩(t1),\displaystyle\bm{C}^{L}(t-1)\bm{x}\leq\bm{B}^{(t-1)},
xi=0,i,\displaystyle x_{i}=0,\ i\notin\mathcal{I}^{*},
𝒙𝟎.\displaystyle\bm{x}\geq\bm{0}.
27:     Denote its optimal solution as 𝒙~\tilde{\bm{x}}
28:     Normalize 𝒙~\tilde{\bm{x}} into a probability and randomly play an arm according to the probability
29:     Update estimates 𝝁^(t)\hat{\bm{\mu}}(t), 𝑪^(t)\hat{\bm{C}}(t), and 𝑩(t)\bm{B}^{(t)}
30:     Update t=t+1t=t+1
31:end while

Phase II of Algorithm 1 is built upon the output of Phase I. At each time tt, the algorithm solves an adaptive version LP (9) and normalizes its optimal solution into a sampling scheme on arms. Also, in Phase II, the algorithm will only play arms i^i\in\hat{\mathcal{I}}^{*}; this is achieved by enforcing xi=0x_{i}=0 for i^i\in\hat{\mathcal{I}}^{*} in (9). The adaptive design is exemplified on the right-hand-side of the LP (9), where instead of the static resource capacity 𝑩\bm{B}, it uses the remaining resource at the end of last time period 𝑩(t1)\bm{B}^{(t-1)}. To see its intuition, consider if a binding resource j𝒥j\in\mathcal{J}^{*} is over-used in the first tt time periods, then it will result in a smaller value of Bj(t1)B^{(t-1)}_{j}, and then the adaptive mechanism will tend to be more reluctant to consume the jj-th resource in the future, and vice versa for the case of under-use.

We emphasize that the adaptive design is not only intuitive but also necessary to achieve a regret that is logarithmic in TT. If we adopt a static right-hand-side 𝑩\bm{B} in (9) (as in (Badanidiyuru et al., 2013; Agrawal and Devanur, 2014)), then the fluctuation of the process 𝑩t\bm{B}_{t} will be on the order of Ω(t)\Omega(\sqrt{t}). Consequently, when it approaches to the end of the horizon, certain type of binding resource may be exhausted while other binding resources still have Ω(T)\Omega(\sqrt{T}) left-over, and this may result in an Ω(T)\Omega(\sqrt{T}) upper bound in Proposition 1. The intuition is made rigorous by (Arlotto and Gurvich, 2019); the paper establishes that without an adaptive design, the regret is at least Ω(T)\Omega(\sqrt{T}) for the multi-secretary problem (can be viewed as a one-constraint BwK problem) even if the underlying distribution is known.

5 Regret Analysis

In this section, we derive an regret upper bound for Algorithm 1 by analyzing the two phases separately.

5.1 Analysis of Phase I

Proposition 3 provides an upper bound on the number of time periods within which Phase I will terminate. It also states that the identification of \mathcal{I}^{*} and 𝒥\mathcal{J}^{\prime} will be precise conditional on the high probability event in Lemma 2.

Proposition 3.

In Phase I of Algorithm 1, each arm i[m]i\in[m] will be played for no more than (2+1b)272logTδ2\left(2+\frac{1}{b}\right)^{2}\cdot\frac{72\log T}{\delta^{2}} times where bb is defined in Assumption 1. If the resources are not exhausted in Phase I, then its output satisfies

(^=,𝒥^=𝒥)14mdT2.\mathbb{P}\left(\hat{\mathcal{I}}^{*}=\mathcal{I}^{*},\hat{\mathcal{J}}^{\prime}=\mathcal{J}^{\prime}\right)\geq 1-\frac{4md}{T^{2}}.

The surprising point of Proposition 3 lies in that there are O(2m+d)O(2^{m+d}) possible configurations of (,𝒥)(\mathcal{I}^{*},\mathcal{J}^{\prime}) and, without any prior knowledge, the true configuration can be identified within O(logT)O(\log T) number of plays for each arm. In contrast, (Flajolet and Jaillet, 2015) does not utilize the primal-dual structure of the problem and conducts a brute-force search in all possible configurations which results in an O(2m+dlogT)O(2^{m+d}\log T) regret. In addition, the search therein requires the knowledge of a non-degeneracy parameter a priori. The result also explains why (Sankararaman and Slivkins, 2020) imposes a single-best-arm condition for the BwK problem, which assumes =1\mathcal{I}^{*}=1. This additional greatly simplifies the combinatorial structure and reduces the BwK problem more closely to the standard MAB problem. In this light, Proposition 3 can be viewed as a best-arm-identification result (Audibert and Bubeck, 2010) for the BwK problem in full generality and without any prior knowledge.

5.2 Analysis of Phase II

Proposition 4 provides an upper bound on the remaining resource for binding constraints when the procedure terminate, which corresponds to the second part of Proposition 1. Notably, the upper bound has no dependency on TT. In a comparison with the Ω(T)\Omega(\sqrt{T}) fluctuation under the static design, it demonstrates the effectiveness of our adaptive design.

Proposition 4.

For each binding constraint j𝒥,j\in\mathcal{J}^{*}, we have

𝔼[Bj(τ)]=O(d3bmin{χ2,δ2}min{1,σ2})\mathbb{E}\left[B_{j}^{(\tau)}\right]=O\left(\frac{d^{3}}{b\min\{\chi^{2},\delta^{2}\}\min\{1,\sigma^{2}\}}\right)

where χ\chi and σ\sigma are defined in Section 3.3, and bb is defined in Assumption 1.

The idea of proof is to introduce an auxiliary process bj(t)=Bj(t)Ttb_{j}^{(t)}=\frac{B_{j}^{(t)}}{T-t} for t[T1]t\in[T-1] and j𝒥j\in\mathcal{J}^{*}. Recall that Bj(t)B_{j}^{(t)} is the jj-th component of the knapsack process 𝑩(t)\bm{B}^{(t)}, we know its initial value bj(0)=bb_{j}^{(0)}=b. Then define

τj=min{t:bj(t)[bϵ,b+ϵ]}{T}\tau_{j}=\min\{t:b_{j}^{(t)}\notin[b-\epsilon,b+\epsilon]\}\cup\{T\}

for a fixed ϵ>0.\epsilon>0. With the definition, bj(t)b_{j}^{(t)} can be interpreted as average remaining resource (per time period) and τj\tau_{j} can be interpreted as the first time that bj(t)b_{j}^{(t)} deviates from its initial value bb by a small amount. It is easy to see that τjτ\tau_{j}\leq\tau with a proper choice of ϵ.\epsilon. Next, we aim to upper bound 𝔼[Tτj]\mathbb{E}[T-\tau_{j}] by analyzing the process {bj(t)}t=0T.\{b_{j}^{(t)}\}_{t=0}^{T}. From the dynamic of the knapsack process 𝑩t,\bm{B}_{t}, we know that

bj(t)\displaystyle b_{j}^{(t)} =Bj(t)Tt=Bj(t1)Cj,tTt\displaystyle=\frac{B_{j}^{(t)}}{T-t}=\frac{B_{j}^{(t-1)}-C_{j,t}}{T-t}
=bj(t1)1Tt(Cj,tbj(t1))\displaystyle=b_{j}^{(t-1)}-\frac{1}{T-t}\left(C_{j,t}-b_{j}^{(t-1)}\right) (10)

where Cj,tC_{j,t} as defined earlier is the resource consumption of jj-th constraint at time tt. The above formula (10) provides a technical explanation for the motivation of the adaptive design in (9). Intuitively, when the right-hand-side of (9) is 𝑩(t1),\bm{B}^{(t-1)}, it will lead to a solution that (approximately and on expectation) consumes bj(t1)b_{j}^{(t-1)} of the jj-th resource for each of the following time periods. Ideally, this will make the second term in (10) have a zero expectation. However, due to estimation error for the LP’s parameters, this may not be the case. The idea is to first provide an upper bound for the “bias” term

|𝔼[Cj,tbj(t1)|t1]|\left|\mathbb{E}\left[C_{j,t}-b_{j}^{(t-1)}|\mathcal{H}_{t-1}\right]\right|

where t1={(rs,𝑪s,is)}s=1t1\mathcal{H}_{t-1}=\{(r_{s},\bm{C}_{s},i_{s})\}_{s=1}^{t-1} encapsulates all the information up to time t1.t-1. Unsurprisingly, the upper bound is on the order of O(1t)O\left(\frac{1}{\sqrt{t}}\right). Next, with this bias upper bound, we can construct a super-martingale (sub-martingale) based on the dynamic (10) and employ Azuma–Hoeffding inequality to provide a concentration result for the value of the martingale. Through the analysis of the process bj(t)b_{j}^{(t)}, we can derive an upper bound for 𝔼[Tτj]\mathbb{E}[T-\tau_{j}], and consequently, it leads to an upper bound on 𝔼[Bj(τ)]\mathbb{E}[B^{(\tau)}_{j}].

The importance of the adaptive design has been widely recognized in other constrained online learning problems, such as online matching problem Manshadi et al. (2012), online assortment problem Golrezaei et al. (2014), online linear programming problem Li and Ye (2019), network revenue management problem Jasin and Kumar (2012), etc. The common pattern of these problem is to allocate limited resources in a sequential manner, and the idea of adaptive design is to adjust the allocation rule dynamically according to the remaining resource/inventory. This is in parallel with LP (9) where the solution at each time tt is contingent on the remaining resources 𝑩(t).\bm{B}^{(t)}. The significance of our algorithm design and analysis lies in that (i) to the literature of BwK, our paper is the first application of the idea of adaptive design; (ii) to the existing literature of adaptive design in constrained online learning problems, our work provides its first application and analysis in a partial-information environment. For the second aspect, all the existing analysis on the adaptive design fall in the paradigm of “first-observe-then-decide” while the BwK problem is “first-decide-then-observe”. Specifically, in matching/resource allocation/revenue management problems, at each time period, a new agent arrives, and upon the observation, we decide the matching for the agent; or a new customer arrives, and upon the observation of her preference, we decide the assortment decision for the customer. So, the existing analyses are analogous to a BwK “setting” where the reward and resource consumption of playing an arm are first observed (magically), and then we decide whether we want to play the arm or not.

5.3 Regret Upper Bound for Algorithm 1

Combining the two parts, we have the following result on the regret of Algorithm 1.

Proposition 5.

The regret of Algorithm 1 has the following upper bound,

O((2+1b)2mdlogTbδ2+d4b2min{χ2,δ2}min{1,σ2})O\left(\left(2+\frac{1}{b}\right)^{2}\frac{md\log T}{b\delta^{2}}+\frac{d^{4}}{b^{2}\min\{\chi^{2},\delta^{2}\}\min\{1,\sigma^{2}\}}\right)

where bb is defined in Assumption 1, δ\delta is defined in Section 3.2, and σ\sigma and χ\chi are defined in Section 3.3.

The result reduces the exponential dependence on mm and dd in (Flajolet and Jaillet, 2015) to polynomial, and also it does not rely on any prior knowledge. Specifically, the authors consider several settings for BwK problem, many of which assume special structures such as one or two resource constraints. The most general settings therein, which is comparable to ours, allows arbitrary number of constraints and number of optimal arms. In terms of the key parameters, the way we define δ\delta is the same as their definition of Δx\Delta_{x}. However, the regret bound (Theorem 8 therein) involves a summation of exponentially many 1Δx\frac{1}{\Delta_{x}}’s (the same as the total number of the bases of the LP). Our parameter σ\sigma is related to their ϵ\epsilon (Assumption 8 therein) while the latter is more restrictive. Because σ\sigma in our paper represents the minimal singular value of the matrix corresponding to only the optimal basis of the primal LP, whereas the parameter ϵ\epsilon therein represents a lower bound of the determinant of the matrices corresponding to all the possible (exponentially many) bases of the primal LP. In this light, if they adopt our parameter σ\sigma, their bound would be improved by a factor of d!d! (C!C! therein). Moreover, ϵ\epsilon is a lower bound for our parameter χ\chi and (Flajolet and Jaillet, 2015) explicitly requires the knowledge of ϵ\epsilon a priori.

The proposition also generalizes the one-constraint bound in (Sankararaman and Slivkins, 2020) and relaxes the deterministic resource consumption assumption therein. Specifically, the authors assume there is one single optimal arm and one single resource (other than time), i.e., the optimal solution to the primal LP has only one non-zero entry (||=|𝒥|=1|\mathcal{I}^{*}|=|\mathcal{J}^{*}|=1 and d=2d=2). They also assume the underlying LP is non-degenerate. Our results generalize their work in allowing arbitrary dd, |||\mathcal{I}^{*}| and |𝒥|.|\mathcal{J}^{*}|. In terms of the key parameters, under their assumption, our parameter σ=1\sigma=1 because it is defined by a 1-by-1 matrix. Our parameter χ\chi is deemed as a constant in their paper. For δ\delta, under their assumptions, our definition of OPTi can be adjusted accordingly. Specifically, we can replace the constraint xi=0x_{i}=0 in defining OPTi with xi=0,iix_{i^{\prime}}=0,i^{\prime}\neq i. Then our definition of δ\delta would reduce to their definition of GLAG(a)G_{LAG}(a), both of which characterize the sub-optimality gap of an arm.

The above comparison against the existing literature highlights that the parameters σ,\sigma, δ\delta, and χ\chi or other LP-based parameters might be inevitable in characterizing logarithmic regret bound. Our paper makes some preliminary efforts along this line, but we do not believe our bound is tight: parameters such as χ\chi and σ\sigma may be replaced with some tighter parameter through a better algorithm and sharper analysis. As remarked in Section 3.3, the parameters are not dependent on TT under Assumption 1. But to characterize the dependency of these parameters (such as δ\delta and χ\chi) on the problem size mm and dd remains an open question. Moreover, our algorithm and analysis highly rely on the structural properties of the underlying LP, which might not be the unique method to handle the BwK problem.

6 Conclusion

In this paper, we introduce a new BwK algorithm and derive problem-dependent bound for the algorithm. In the Phase I of the algorithm, it involves a round-robin design and may result in playing sub-optimal arms for inefficiently many times. Our regret bound can be large when the parameters σ,\sigma, δ\delta, and χ\chi are small and the inefficient plays of the sub-optimal arms may prevent the algorithm from achieving an O(T)O(\sqrt{T}) worst-case regret. In the extreme case, these parameters may scale with O(1T)O(\frac{1}{T}), though Assumption 1 prevents such a possibility. So the question is whether Assumption 1 is necessary in admitting a logarithmic problem-dependent bound.

We conclude our discussion with a new one-phase algorithm – Algorithm 2. The algorithm is also LP-based, and at each time tt, it solves a UCB version of the primal LP to sample the arm. The algorithm has an adaptive design to exhaust the resources. On one hand, the algorithm naturally incorporates the Phase I of Algorithm 1 as a part of its Phase II. It directly enters the Phase II of Algorithm 1 and lets the adaptive LP to fully determine the arm(s) to play (without the extra constraint in (9)). On the other hand, the algorithm can be viewed as an adaptive version of the algorithm in (Agrawal and Devanur, 2014). Our conjecture is that Algorithm 2 is the optimal algorithm for BwK: it is optimal in the sense that it achieves optimal problem-dependent bound, but also admits O(T)O(\sqrt{T}) problem-independent bound. Unfortunately, its analysis is more challenging than Algorithm 1, which we leave as an open question.

Algorithm 2 One-Phase Adaptive Algorithm for BwK
1:Input: Resource capacity 𝑩\bm{B}, TT
2:Initialize the knapsack process 𝑩(0)=𝑩\bm{B}^{(0)}=\bm{B}
3:Initialize the estimates 𝝁^(0)\hat{\bm{\mu}}(0) and 𝑪^(0)\hat{\bm{C}}(0)
4:Set t=1t=1
5:while tτt\leq\tau do
6:     Solve the following LP
max𝒙\displaystyle\max_{\bm{x}}\ \ (𝝁U(t1))𝒙,\displaystyle\left(\bm{\mu}^{U}(t-1)\right)^{\top}\bm{x}, (11)
s.t. 𝑪L(t1)𝒙𝑩(t1),\displaystyle\bm{C}^{L}(t-1)\bm{x}\leq\bm{B}^{(t-1)},
𝒙𝟎.\displaystyle\bm{x}\geq\bm{0}.
7:     Denote its optimal solution as 𝒙~\tilde{\bm{x}}
8:     Normalize 𝒙~\tilde{\bm{x}} into a probability and randomly play an arm according to the probability
9:     Update estimates 𝝁^(t)\hat{\bm{\mu}}(t), 𝑪^(t)\hat{\bm{C}}(t), and 𝑩(t)\bm{B}^{(t)}
10:     Update t=t+1t=t+1
11:end while

Acknowledgements

We would like to thank anonymous reviewers for their helpful suggestions and insightful comments.

References

  • Agrawal and Devanur (2014) Agrawal, Shipra, Nikhil R Devanur. 2014. Bandits with concave rewards and convex knapsacks. Proceedings of the fifteenth ACM conference on Economics and computation. 989–1006.
  • Agrawal and Devanur (2015) Agrawal, Shipra, Nikhil R Devanur. 2015. Linear contextual bandits with knapsacks. arXiv preprint arXiv:1507.06738 .
  • Agrawal et al. (2016) Agrawal, Shipra, Nikhil R Devanur, Lihong Li. 2016. An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. Conference on Learning Theory. PMLR, 4–18.
  • Agrawal and Goyal (2012) Agrawal, Shipra, Navin Goyal. 2012. Analysis of thompson sampling for the multi-armed bandit problem. Conference on learning theory. JMLR Workshop and Conference Proceedings, 39–1.
  • Agrawal et al. (2014) Agrawal, Shipra, Zizhuo Wang, Yinyu Ye. 2014. A dynamic near-optimal algorithm for online linear programming. Operations Research 62(4) 876–890.
  • Arlotto and Gurvich (2019) Arlotto, Alessandro, Itai Gurvich. 2019. Uniformly bounded regret in the multisecretary problem. Stochastic Systems .
  • Audibert and Bubeck (2010) Audibert, Jean-Yves, Sébastien Bubeck. 2010. Best arm identification in multi-armed bandits. COLT-23th Conference on Learning Theory-2010. 13–p.
  • Auer et al. (2002) Auer, Peter, Nicolo Cesa-Bianchi, Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47(2) 235–256.
  • Badanidiyuru et al. (2013) Badanidiyuru, Ashwinkumar, Robert Kleinberg, Aleksandrs Slivkins. 2013. Bandits with knapsacks. 2013 IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE, 207–216.
  • Besbes and Zeevi (2012) Besbes, Omar, Assaf Zeevi. 2012. Blind network revenue management. Operations research 60(6) 1537–1550.
  • Ferreira et al. (2018) Ferreira, Kris Johnson, David Simchi-Levi, He Wang. 2018. Online network revenue management using thompson sampling. Operations research 66(6) 1586–1602.
  • Flajolet and Jaillet (2015) Flajolet, Arthur, Patrick Jaillet. 2015. Logarithmic regret bounds for bandits with knapsacks. arXiv preprint arXiv:1510.01800 .
  • Gittins et al. (2011) Gittins, John, Kevin Glazebrook, Richard Weber. 2011. Multi-armed bandit allocation indices. John Wiley & Sons.
  • Golrezaei et al. (2014) Golrezaei, Negin, Hamid Nazerzadeh, Paat Rusmevichientong. 2014. Real-time optimization of personalized assortments. Management Science 60(6) 1532–1551.
  • Immorlica et al. (2019) Immorlica, Nicole, Karthik Abinav Sankararaman, Robert Schapire, Aleksandrs Slivkins. 2019. Adversarial bandits with knapsacks. 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 202–219.
  • Jasin and Kumar (2012) Jasin, Stefanus, Sunil Kumar. 2012. A re-solving heuristic with bounded revenue loss for network revenue management with customer choice. Mathematics of Operations Research 37(2) 313–345.
  • Kesselheim and Singla (2020) Kesselheim, Thomas, Sahil Singla. 2020. Online learning with vector costs and bandits with knapsacks. Conference on Learning Theory. PMLR, 2286–2305.
  • Lai and Robbins (1985) Lai, Tze Leung, Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1) 4–22.
  • Li and Ye (2019) Li, Xiaocheng, Yinyu Ye. 2019. Online linear programming: Dual convergence, new algorithms, and regret bounds. arXiv preprint arXiv:1909.05499 .
  • Manshadi et al. (2012) Manshadi, Vahideh H, Shayan Oveis Gharan, Amin Saberi. 2012. Online stochastic matching: Online actions based on offline statistics. Mathematics of Operations Research 37(4) 559–573.
  • Megiddo and Chandrasekaran (1989) Megiddo, Nimrod, Ramaswamy Chandrasekaran. 1989. On the ε\varepsilon-perturbation method for avoiding degeneracy. Operations Research Letters 8(6) 305–308.
  • Mehta et al. (2005) Mehta, Aranyak, Amin Saberi, Umesh Vazirani, Vijay Vazirani. 2005. Adwords and generalized on-line matching. 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05). IEEE, 264–273.
  • Robbins (1952) Robbins, Herbert. 1952. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society 58(5) 527–535.
  • Sankararaman and Slivkins (2020) Sankararaman, Karthik Abinav, Aleksandrs Slivkins. 2020. Advances in bandits with knapsacks. arXiv preprint arXiv:2002.00253 .
  • Weber et al. (1992) Weber, Richard, et al. 1992. On the gittins index for multiarmed bandits. The Annals of Applied Probability 2(4) 1024–1033.

Appendix A Proof of Section 2 and 3

A1 Proof of Lemma 1

Proof.

Notice 𝒙=𝟎\bm{x}=\bm{0} is a feasible solution to the primal LP (1) and 𝒚=(b,,b)\bm{y}=(b,...,b)^{\top} is a feasible solution to the dual LP (2). We have both of the primal and dual LPs are feasible. So, by the strong duality of linear programming, both of the primal and dual LPs have optimal solutions and share the same optimal object objective value.

Then, based on assumption (2) that ||=|𝒥||\mathcal{I}^{*}|=|\mathcal{J}^{*}|, we have

||+|𝒥|\displaystyle|\mathcal{I}^{*}|+|\mathcal{J}^{\prime}| =|𝒥|+|𝒥|=d,\displaystyle=|\mathcal{J}^{*}|+|\mathcal{J}^{\prime}|=d,
||+||\displaystyle|\mathcal{I}^{*}|+|\mathcal{I}^{\prime}| =|𝒥|+||=m.\displaystyle=|\mathcal{J}^{*}|+|\mathcal{I}^{\prime}|=m.

A2 Proof of Proposition 1

Proof.

From OPTLPOPT\text{OPT}_{\text{LP}}\geq\text{OPT}, we know

RegretTπ(𝒫,𝑩)\displaystyle\text{Regret}_{T}^{\pi}(\mathcal{P},\bm{B}) OPT𝔼[t=1τrt]\displaystyle\coloneqq\text{OPT}-\mathbb{E}\left[\sum_{t=1}^{\tau}r_{t}\right] (12)
OPTLP𝔼[t=1τrt]\displaystyle\leq\text{OPT}_{\text{LP}}-\mathbb{E}\left[\sum_{t=1}^{\tau}r_{t}\right]

The idea is to represent 𝔼[t=1τrt]\mathbb{E}\left[\sum_{t=1}^{\tau}r_{t}\right] with reduced cost and number of plays for the arms. We have

𝔼[t=1τrt]\displaystyle\mathbb{E}\left[\sum\limits_{t=1}^{\tau}r_{t}\right] =𝔼[t=1τrit]\displaystyle=\mathbb{E}\left[\sum\limits_{t=1}^{\tau}r_{i_{t}}\right]
=i[m]μi𝔼[ni(τ)]\displaystyle=\sum\limits_{i\in[m]}\mu_{i}\mathbb{E}\left[n_{i}(\tau)\right]
=i[m](𝒄i𝒚Δi)𝔼[ni(τ)]\displaystyle=\sum\limits_{i\in[m]}(\bm{c}_{i}^{\top}\bm{y}^{*}-\Delta_{i})\mathbb{E}\left[n_{i}(\tau)\right]
=𝔼[t=1τ𝒄t]𝒚iΔi𝔼[ni(τ)],\displaystyle=\mathbb{E}\left[\sum\limits_{t=1}^{\tau}\bm{c}_{t}\right]^{\top}\bm{y}^{*}-\sum\limits_{i\in\mathcal{I}^{\prime}}\Delta_{i}\mathbb{E}\left[n_{i}(\tau)\right],

where iti_{t} denotes the arm chosen to play at time tt. In above, the second line comes from the independence between stopping time and the reward generated from a fixed arm. The third line comes from the definition of the reduced cost Δi\Delta_{i} for all i[d]i\in[d] that Δi=𝒄i𝒚μi\Delta_{i}=\bm{c}_{i}^{\top}\bm{y}^{*}-\mu_{i}, and the last line can be obtained by the same proof of the first three lines. Thus, plugging the equations above in the inequality (12) and rearranging terms, we have

RegretTπ(𝒫,𝑩)\displaystyle\text{Regret}_{T}^{\pi}(\mathcal{P},\bm{B}) OPTLP𝔼[t=1τrt]\displaystyle\leq\text{OPT}_{\text{LP}}-\mathbb{E}\left[\sum_{t=1}^{\tau}r_{t}\right]
=\displaystyle= 𝔼[𝑩t=1τ𝒄t]𝒚+iΔi𝔼[ni(τ)]\displaystyle\mathbb{E}\left[\bm{B}-\sum\limits_{t=1}^{\tau}\bm{c}_{t}\right]^{\top}\bm{y}^{*}+\sum\limits_{i\in\mathcal{I}^{\prime}}\Delta_{i}\mathbb{E}\left[n_{i}(\tau)\right]
=\displaystyle= 𝔼[𝑩(τ)]𝒚+iΔi𝔼[ni(τ)].\displaystyle\mathbb{E}\left[\bm{B}^{(\tau)}\right]^{\top}\bm{y}^{*}+\sum\limits_{i\in\mathcal{I}^{\prime}}\Delta_{i}\mathbb{E}\left[n_{i}(\tau)\right].

Thus we complete the proof. ∎

A3 Proof of Proposition 2

Proof.

It is sufficient to show that

OPTi<OPTLPi,\displaystyle\text{OPT}_{i}<\text{OPT}_{\text{LP}}\Leftrightarrow i\in\mathcal{I}^{*}, (13)
OPTj<OPTLPi𝒥.\displaystyle\text{OPT}_{j}<\text{OPT}_{\text{LP}}\Leftrightarrow i\in\mathcal{J}^{*}. (14)

Here we prove (13) and the proof of (14) follows the same analysis. By definition, an feasible solution to LP (4) is also feasible to LP (1). So,

OPTiOPTLP.\text{OPT}_{i}\leq\text{OPT}_{\text{LP}}.

If OPTLP=OPTi\text{OPT}_{\text{LP}}=\text{OPT}_{i} for ii\in\mathcal{I}^{*}, the optimal solution of LP (4) is also a optimal solution to (1). Denote that solution as 𝒙~\tilde{\bm{x}}^{*}. We have both of 𝒙~\tilde{\bm{x}}^{*} and 𝒙\bm{x}^{*} are optimal solutions of LP (1). From the uniqueness in Assumption 2, we know 𝒙~=𝒙\tilde{\bm{x}}^{*}=\bm{x}^{*} and hence xi0x^{*}_{i}\not=0 and x~i0\tilde{x}^{*}_{i}\not=0 for ii\in\mathcal{I}^{*}. However, this contradicts with constraint of (4) where xi=0x_{i}=0. ∎

A4 Proof of Lemma 2

Proof.

The proof follows a standard analysis of the tail bounds for the estimators. Specifically, we first consider the complement of these events and then apply a union bound to complete the proof. The analysis first concerns the case when there is no projection and then extends the results to the case with projection.

For each i[m],i\in[m], we have

(t[T]{1ni(t)|s=1ni(t)ri,sni(t)μi|2logTni(t)})\displaystyle\mathbb{P}\left(\bigcup_{t\in[T]}\left\{\frac{1}{n_{i}(t)}\left|\sum\limits_{s=1}^{n_{i}(t)}r_{i,s}-n_{i}(t)\mu_{i}\right|\geq\sqrt{\frac{2\log T}{n_{i}(t)}}\right\}\right)
\displaystyle\leq t=1T(1ni(t)|s=1ni(t)ri,sni(t)μi|2logTni(t))\displaystyle\sum\limits_{t=1}^{T}\mathbb{P}\left(\frac{1}{n_{i}(t)}\left|\sum\limits_{s=1}^{n_{i}(t)}r_{i,s}-n_{i}(t)\mu_{i}\right|\geq\sqrt{\frac{2\log T}{n_{i}(t)}}\right)
\displaystyle\leq t=1Tk=1t(1k|s=1kri,skμi|2logTk)(ni(t)=k)\displaystyle\sum\limits_{t=1}^{T}\sum\limits_{k=1}^{t}\mathbb{P}\left(\frac{1}{k}\left|\sum\limits_{s=1}^{k}r_{i,s}-k\mu_{i}\right|\geq\sqrt{\frac{2\log T}{k}}\right)\mathbb{P}\left(n_{i}(t)=k\right)
\displaystyle\leq t=1Tk=1t(1k|s=1kri,skμi|2logTk)\displaystyle\sum\limits_{t=1}^{T}\sum\limits_{k=1}^{t}\mathbb{P}\left(\frac{1}{k}\left|\sum\limits_{s=1}^{k}r_{i,s}-k\mu_{i}\right|\geq\sqrt{\frac{2\log T}{k}}\right)
\displaystyle\leq t=1Tk=1t2T42T2.\displaystyle\sum\limits_{t=1}^{T}\sum\limits_{k=1}^{t}\frac{2}{T^{4}}\leq\frac{2}{T^{2}}. (15)

where ri,sr_{i,s} denotes the reward of the ss-th play for the ii-th arm. The first inequality is obtained by the union probability bound, the second one follows from Bayes rule, the third one is obtained by [ni(t)=k]1\mathbb{P}\left[n_{i}(t)=k\right]\leq 1, and the last one is obtained by an application of the Hoeffding’s inequality with ri,s[0,1]r_{i,s}\in[0,1] and 𝔼[ri,s]=μi\mathbb{E}[r_{i,s}]=\mu_{i} for i[m]i\in[m] and s[T]s\in[T].

Then, recall that

μiL(t)\displaystyle\mu_{i}^{L}(t) proj[0,1](μ^i(t)2logTni(t)),\displaystyle\coloneqq proj_{[0,1]}\left(\hat{\mu}_{i}(t)-\sqrt{\frac{2\log T}{n_{i}(t)}}\right),
μiU(t)\displaystyle\mu_{i}^{U}(t) proj[0,1](μ^i(t)+2logTni(t)),\displaystyle\coloneqq proj_{[0,1]}\left(\hat{\mu}_{i}(t)+\sqrt{\frac{2\log T}{n_{i}(t)}}\right),

and μi[0,1]\mu_{i}\in[0,1], following the definitions of the sets, we can have

(t[T]{μi(μiL(t),μiU(t))})\displaystyle\mathbb{P}\left(\bigcup_{t\in[T]}\left\{\mu_{i}\notin\left(\mu_{i}^{L}(t),\mu_{i}^{U}(t)\right)\right\}\right) =(t[T]{1ni(t)|s=1ni(t)ri,sni(t)μi|2logTni(t)}).\displaystyle=\mathbb{P}\left(\bigcup_{t\in[T]}\left\{\frac{1}{n_{i}(t)}\left|\sum\limits_{s=1}^{n_{i}(t)}r_{i,s}-n_{i}(t)\mu_{i}\right|\geq\sqrt{\frac{2\log T}{n_{i}(t)}}\right\}\right).

Then we can apply (15) for the last line in above.

(t[T]{μi(μiL(t),μiU(t))})2T2.\displaystyle\ \ \ \ \mathbb{P}\left(\bigcup_{t\in[T]}\left\{\mu_{i}\notin\left(\mu_{i}^{L}(t),\mu_{i}^{U}(t)\right)\right\}\right)\leq\frac{2}{T^{2}}.

Note the proof is only relied on the property that the reward is bounded in [0,1][0,1].

Next, following the exact same analysis, we have

(t[T]{cji(CjiL(t),CjiU(t))})2T2\displaystyle\ \ \ \ \mathbb{P}\left(\bigcup_{t\in[T]}\left\{c_{ji}\notin\left(C_{ji}^{L}(t),C_{ji}^{U}(t)\right)\right\}\right)\leq\frac{2}{T^{2}}

also holds for each j[d],i[m]j\in[d],i\in[m]. Then, we complete the proof by taking an union bound with respect to the arm index i[m]i\in[m] and the constraint index j[d].j\in[d].

A5 Proof of Lemma 3

Proof.

It is sufficient to show that

OPTLPLOPTLPOPTLPU\text{OPT}_{\text{LP}}^{L}\leq\text{OPT}_{\text{LP}}\leq\text{OPT}_{\text{LP}}^{U}

holds under the event in Lemma 2. Then, applying Lemma 2, we can have that the inequality above holds with probability no less than 14mdT2.1-\frac{4md}{T^{2}}.

Recall that OPTLPL(t)\text{OPT}^{L}_{\text{LP}}(t) corresponds to the optimal objective value of the following LP (7) and denote 𝒙L(t){\bm{x}}^{L}(t) as its optimal solution. Since 𝑪^U(t)𝑪\hat{{\bm{C}}}^{U}(t)\geq\bm{C} holds element-wise under the event in Lemma 2, we have

𝑪𝒙L(t)𝑪U(t)𝒙L(t)𝑩,\bm{C}{\bm{x}}^{L}(t)\leq{{\bm{C}}}^{U}(t){\bm{x}}^{L}(t)\leq\bm{B},

where those inequalities hold element-wise. Thus, we have 𝒙L(t){\bm{x}}^{L}(t) is a feasible solution to the LP (1), and

OPTLPL(t)\displaystyle\text{OPT}^{L}_{\text{LP}}(t) =i=1mμiL(t)xiL(t)\displaystyle=\sum\limits_{i=1}^{m}{{\mu}}^{L}_{i}(t){x}^{L}_{i}(t)
i=1mμi(t)xiL(t)\displaystyle\leq\sum\limits_{i=1}^{m}{{\mu}}_{i}(t){x}^{L}_{i}(t) (16)
OPTLP,\displaystyle\leq\text{OPT}_{\text{LP}},

where the first line comes the definition of 𝒙L(t){\bm{x}}^{L}(t), the second line comes from 𝝁^L(t)𝝁(t){\hat{\bm{\mu}}}^{L}(t)\leq\bm{\mu}(t) (under the event in Lemma 2), and the third line comes from that 𝒙L(t){\bm{x}}^{L}(t) is feasible to LP (1) and the definition of OPTLP\text{OPT}_{\text{LP}}. Thus, OPTLPL(t)\text{OPT}^{L}_{\text{LP}}(t) is an underestimate of OPTLP\text{OPT}_{\text{LP}} for all t[T]t\in[T].

With a similar analysis, we can show the part for OPTLPU(t)\text{OPT}^{U}_{\text{LP}}(t). ∎

Appendix B Analysis of Phase I of Algorithm 1

In this section, we provide the analysis of Phase I of Algorithm 1 and prove Proposition 3. In the proof, we will use some concentration results for the LCB/UCB estimators for the LP, and these concentration results will be summarized and proved in Lemma 4 following the proof of Proposition 3.

B1 Proof of Proposition 3

Proof.

In this proof, we show the result under the high-probability event in Lemma 2.

Recall that an arm ii will be added to ^\hat{\mathcal{I}}^{*} or resource jj will be added to 𝒥^\hat{\mathcal{J}}^{\prime} when

OPTiU(t)<OPTLPL(t) or OPTjL(t)>OPTLPU(t).\text{OPT}^{U}_{i}(t)<\text{OPT}^{L}_{\text{LP}}(t)\text{ or }\text{OPT}^{L}_{j}(t)>\text{OPT}_{\text{LP}}^{U}(t).

So, our proof will be focusing on discussing the relation between OPTiU(t)\text{OPT}^{U}_{i}(t) and OPTLPL(t)\text{OPT}^{L}_{\text{LP}}(t), and the relation between OPTjU(t)\text{OPT}^{U}_{j}(t) and OPTLPL(t)\text{OPT}^{L}_{\text{LP}}(t). Our proof can be divided into two parts:

  • First, we prove that ^\hat{\mathcal{I}}^{*} and 𝒥^\hat{\mathcal{J}}^{\prime} will not include a non-optimal arm ii\in\mathcal{I}^{\prime} and binding resource j𝒥j\in\mathcal{J}^{*}. Equivalently, we only need to show

    OPTiU(t)OPTLPL(t),OPTjU(t)OPTLPL(t)\text{OPT}^{U}_{i}(t)\geq\text{OPT}^{L}_{\text{LP}}(t),\text{OPT}^{U}_{j}(t)\geq\text{OPT}_{\text{LP}}^{L}(t)

    for ii\in\mathcal{I}^{\prime} and j𝒥j\in\mathcal{J}^{*} and all t[T]t\in[T].

  • Second, Phase I of Algorithm 1 will terminate within (2+1b)272mlogTδ2\left(2+\frac{1}{b}\right)^{2}\frac{72m\log T}{\delta^{2}} steps. Equivalently, we only need to show

    OPTiU(t)<OPTLPL(t),OPTjU(t)<OPTLPL(t)\text{OPT}^{U}_{i}(t)<\text{OPT}^{L}_{\text{LP}}(t),\text{OPT}^{U}_{j}(t)<\text{OPT}^{L}_{\text{LP}}(t)

    for ii\in\mathcal{I}^{*} and j𝒥j\in\mathcal{J}^{\prime} when the time

    t(2+1b)272mlogTδ2.t\geq\left(2+\frac{1}{b}\right)^{2}\frac{72m\log T}{\delta^{2}}.

For the first part, recall that OPTiU\text{OPT}^{U}_{i} is the optimal objective value of the LP (4). For i[m]i\in[m], with a similar proof as Lemma 3, we can show that

OPTiU(t)OPTi\text{OPT}^{U}_{i}(t)\geq\text{OPT}_{i}

and from Lemma 3,

OPTLPL(t)OPTLP\text{OPT}^{L}_{\text{LP}}(t)\leq\text{OPT}_{\text{LP}}

for all t[T]t\in[T]. For ii\in\mathcal{I}^{\prime}, we have OPTi=OPTLP\text{OPT}_{i}=\text{OPT}_{\text{LP}} from Proposition 2. So,

OPTiU(t)OPTLPL(t).\text{OPT}^{U}_{i}(t)\geq\text{OPT}^{L}_{\text{LP}}(t).

With a similar argument,

OPTjU(t)OPTLP, for all j𝒥.\text{OPT}^{U}_{j}(t)\geq\text{OPT}_{\text{LP}},\text{ for all $j\in\mathcal{J}^{*}$}.

Therefore, we complete the first part. The implication is that ^\hat{\mathcal{I}}^{*}\subset{\mathcal{I}}^{*} and 𝒥^𝒥\hat{\mathcal{J}}^{\prime}\subset{\mathcal{J}}^{\prime}.

Recall that the algorithm terminates when |^|+|𝒥^|d|\hat{\mathcal{I}}^{*}|+|\hat{\mathcal{J}}^{\prime}|\geq d and that from Lemma 1, ||+|𝒥|=d|{\mathcal{I}}^{*}|+|{\mathcal{J}}^{\prime}|=d. So, we have the algorithm terminates if and only if ^=\hat{\mathcal{I}}^{*}={\mathcal{I}}^{*} and 𝒥^=𝒥{\hat{\mathcal{J}}^{\prime}}={\mathcal{J}}^{\prime}. That is, the algorithm stops only when it finds all optimal arms and all non-binding resources.

Now, we move to the second part and prove that ^=\hat{\mathcal{I}}^{*}={\mathcal{I}}^{*} and 𝒥^=𝒥{\hat{\mathcal{J}}^{\prime}}={\mathcal{J}}^{\prime} within

(2+1b)272mlogTδ2\left(2+\frac{1}{b}\right)^{2}\frac{72m\log T}{\delta^{2}}

steps.

For time tt that is a multiple of mm and that

t(2+1b)272mlogTδ2,t\geq\left(2+\frac{1}{b}\right)^{2}\frac{72m\log T}{\delta^{2}},

we have

32logTni(t)b23\sqrt{\frac{2\log T}{n_{i}(t)}}\leq\frac{b}{2}

i.e., the condition (19) in the statement of Lemma 4 holds, where ni(t)=tmn_{i}(t)=\frac{t}{m} in this case. Then, by Lemma 4, we have

1T|OPTL(t)OPTLP|\displaystyle\frac{1}{T}|\text{OPT}^{L}(t)-\text{OPT}_{\text{LP}}| <δ2,\displaystyle<\frac{\delta}{2},
1T|OPTiU(t)OPTi|\displaystyle\frac{1}{T}|\text{OPT}^{U}_{i}(t)-\text{OPT}_{i}| <δ2,\displaystyle<\frac{\delta}{2},
1T|OPTjU(t)OPTj|\displaystyle\frac{1}{T}|\text{OPT}^{U}_{j}(t)-\text{OPT}_{j}| <δ2.\displaystyle<\frac{\delta}{2}.

Thus, within at most (2+1b)272mlogTδ2\left(2+\frac{1}{b}\right)^{2}\frac{72m\log T}{\delta^{2}} steps, we have

OPTL(t)OPTiU>OPTLPδ2OPTi+δ20,\displaystyle\text{OPT}^{L}(t)-\text{OPT}^{U}_{i}>\text{OPT}_{\text{LP}}-\frac{\delta}{2}-\text{OPT}_{i}+\frac{\delta}{2}\geq 0,
OPTL(t)OPTjU>OPTLPδ2OPTj+δ20,\displaystyle\text{OPT}^{L}(t)-\text{OPT}^{U}_{j}>\text{OPT}_{\text{LP}}-\frac{\delta}{2}-\text{OPT}_{j}+\frac{\delta}{2}\geq 0,

for all ii\in\mathcal{I}^{*} and j𝒥j\in\mathcal{J}^{\prime} since OPTOPTiδ\text{OPT}-\text{OPT}_{i}\geq\delta and OPTjOPTLPδ\text{OPT}_{j}-\text{OPT}_{\text{LP}}\geq\delta. Thus, all optimal arms and non-binding resources will be detected with at most (2+1b)272mlogTδ2\left(2+\frac{1}{b}\right)^{2}\frac{72m\log T}{\delta^{2}} steps, equivalently, with each arm played at most (2+1b)272logTδ2\left(2+\frac{1}{b}\right)^{2}\frac{72\log T}{\delta^{2}} times.

B2 Concentration of LP UCB/LCB Estimators

Lemma 4.

In Phase I of Algorithm 1, for t[T]t\in[T] that is the multiple of mm and tBt\leq B, the following inequalities hold for all i[m]i\in[m] and j[d]j\in[d],

1T|OPTiU(t)OPTi|\displaystyle\frac{1}{T}|\text{OPT}^{U}_{i}(t)-\text{OPT}_{i}| <3(1+1b)2logTni(t),\displaystyle<3\left(1+\frac{1}{b}\right)\sqrt{\frac{2\log T}{n_{i}(t)}}, (17)
1T|OPTjU(t)OPTj|\displaystyle\frac{1}{T}|\text{OPT}^{U}_{j}(t)-\text{OPT}_{j}| <3(2+1b)2logTni(t).\displaystyle<3\left(2+\frac{1}{b}\right)\sqrt{\frac{2\log T}{n_{i}(t)}}. (18)

Moreover, if time tt also satisfies

32logTni(t)b2,\displaystyle 3\ \sqrt{\frac{2\log T}{n_{i}(t)}}\leq\frac{b}{2}, (19)

the following inequality also holds

1T|OPTL(t)OPTLP|\displaystyle\frac{1}{T}|\text{OPT}^{L}(t)-\text{OPT}_{\text{LP}}| <3(1+1b)2logTni(t).\displaystyle<3\left(1+\frac{1}{b}\right)\sqrt{\frac{2\log T}{n_{i}(t)}}. (20)

By the design of Phase I of Algorithm 1, all the ni(t)n_{i}(t) in this lemma (the statement and the proof) equals to t/m.t/m.

Proof.

We prove inequalities in the statement one by one. We note that the condition tBt\leq B is to provide a guarantee that the algorithm does not exhaust the resource before the termination of Phase I of the algorithm.

Proof of (17):

For (17), we only need to show that

1T(OPTiUOPTi)<3(1+1b)2logTni(t)\frac{1}{T}(\text{OPT}_{i}^{U}-\text{OPT}_{i})<3\left(1+\frac{1}{b}\right)\sqrt{\frac{2\log T}{n_{i}(t)}}

since for the other direction, we can use the inequality OPTiU(t)OPTi\text{OPT}^{U}_{i}(t)\geq\text{OPT}_{i} with the same proof of Lemma 3.

We introduce the following LP (B2) to relate OPTiU(t)\text{OPT}_{i}^{U}(t) with OPTi\text{OPT}_{i}.

max𝒙\displaystyle\max_{\bm{x}}\ \ 𝝁𝒙+32logTni(t)T\displaystyle\bm{\mu}^{\top}\bm{x}+3\sqrt{\frac{2\log T}{n_{i}(t)}}T
s.t. 𝑪𝒙(1+32logTni(t)b2)𝑩\displaystyle{\bm{C}}\bm{x}\leq\left(1+3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\right)\bm{B} (21)
xi=0,𝒙0.\displaystyle\ \ \ \ x_{i}=0,\ \bm{x}\geq 0.

Compared to (4) which defines OPTi\text{OPT}_{i}, the LP (B2) scales the right-hand-side resource capacity with a factor and also appends an additional term to the objective function. It is easy to see that the optimal objective value of the LP (B2) is

(1+32logTni(t)b2)OPTi+32logTni(t)T.\displaystyle\left(1+3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\right)\text{OPT}_{i}+3\sqrt{\frac{2\log T}{n_{i}(t)}}T. (22)

Denote the optimal solution of LP (4) as 𝒙U(t)\bm{x}^{U}(t). First, we show that 𝒙U(t)\bm{x}^{U}(t) is a feasible solution to (B2), and then, with the feasibility, we show that (22) provides an upper bound for OPTUi{}_{i}^{U} and thus establish the relation between OPTUi{}_{i}^{U} and OPTi.{}_{i}.

To see the feasibility, note that the jj-th entry of the resource consumption

(𝑪𝒙U(t))j\displaystyle(\bm{C}\bm{x}^{U}(t))_{j} =(𝑪L(t)𝒙U(t))j+((𝑪𝑪L(t))𝒙U(t))j\displaystyle=({\bm{C}}^{L}(t)\bm{x}^{U}(t))_{j}+\left(\left(\bm{C}-{\bm{C}}^{L}(t)\right)\bm{x}^{U}(t)\right)_{j}
B+(𝑪𝑪L(t))j𝒙U(t)1\displaystyle\leq B+\left\|\left(\bm{C}-{\bm{C}}^{L}(t)\right)_{j}\right\|_{\infty}\|\bm{x}^{U}(t)\|_{1}
B+32logTni(t)T,\displaystyle\leq B+3\sqrt{\frac{2\log T}{n_{i}(t)}}T,

where the last line comes from the concentration event in Lemma 2 and the fact that 𝒙U(t)1T.\|\bm{x}^{U}(t)\|_{1}\leq T.

With this feasibility, we have

OPTiU(t)=\displaystyle\text{OPT}^{U}_{i}(t)= (𝝁U(t))𝒙U(t)\displaystyle({\bm{\mu}}^{U}(t))^{\top}\bm{x}^{U}(t)
=\displaystyle= 𝝁𝒙U(t)+(𝝁U(t)𝝁)𝒙U(t)\displaystyle\bm{\mu}^{\top}\bm{x}^{U}(t)+\left({\bm{\mu}}^{U}(t)-\bm{\mu}\right)^{\top}\bm{x}^{U}(t)
\displaystyle\leq 𝝁𝒙U(t)+𝝁U(t)𝝁𝒙U(t)1\displaystyle\bm{\mu}^{\top}\bm{x}^{U}(t)+\left\|{\bm{\mu}}^{U}(t)-\bm{\mu}\right\|_{\infty}\left\|\bm{x}^{U}(t)\right\|_{1}
<\displaystyle< 𝝁𝒙U(t)+32logTni(t)T\displaystyle\bm{\mu}^{\top}\bm{x}^{U}(t)+3\sqrt{\frac{2\log T}{n_{i}(t)}}T
\displaystyle\leq (1+32logTni(t)b2)OPTi+2logTni(t)T.\displaystyle\left(1+3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\right)\text{OPT}_{i}+\sqrt{\frac{2\log T}{n_{i}(t)}}T. (23)

Here the second from last line comes from a similar argument as the analysis of feasibility. It is based on the concentration event in Lemma 2 and the fact 𝒙U(t)1T.\|\bm{x}^{U}(t)\|_{1}\leq T. The last line plugs in the upper bound (22) to LP (B2) and it comes from the fact that 𝒙U(t)\bm{x}^{U}(t) is a feasible solution of LP (B2). By rearranging terms in (23), we complete the proof of (17).

Proof of (18):

Just like the proof of (17), we only need to show one side of the inequality, that is

1T(OPTjUOPTj)3(2+1b)2logTni(t).\frac{1}{T}(\text{OPT}_{j}^{U}-\text{OPT}_{j})\leq 3\left(2+\frac{1}{b}\right)\sqrt{\frac{2\log T}{n_{i}(t)}}.

Denote 𝒚~\tilde{\bm{y}} as the optimal solution of the LP (5), which is the LP corresponding to OPTj\text{OPT}_{j}. Define

𝒚~=𝒚~+3(2b+1)2logTni(t)(1,0,,0).\tilde{\bm{y}}^{\prime}=\tilde{\bm{y}}+3\left(2b+1\right)\sqrt{\frac{2\log T}{n_{i}(t)}}(1,0,...,0)^{\top}.

We first show that 𝒚~\tilde{\bm{y}}^{\prime} is feasible the LP corresponding to OPTjU(t)\text{OPT}_{j}^{U}(t) in Algorithm 1. To see the feasibility, we have

(𝑪L(t))𝒚~=\displaystyle\left(\bm{C}^{L}(t)\right)^{\top}\tilde{\bm{y}}^{\prime}= (𝑪L(t))𝒚~+3(2b+1)2logTni(t)(𝑪L(t))(1,0,,0)\displaystyle\left(\bm{C}^{L}(t)\right)^{\top}\tilde{\bm{y}}+3\left(2b+1\right)\sqrt{\frac{2\log T}{n_{i}(t)}}\left(\bm{C}^{L}(t)\right)^{\top}(1,0,...,0)^{\top}
=\displaystyle= 𝑪𝒚~+(𝑪L(t)𝑪)𝒚~+3(2b+1)2logTni(t)b2𝟏\displaystyle\bm{C}^{\top}\tilde{\bm{y}}+(\bm{C}^{L}(t)-\bm{C})^{\top}\tilde{\bm{y}}+3\left(2b+1\right)\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\bm{1}
\displaystyle\geq 𝝁+(𝑪j,)maxj[d]𝒄jL(t)𝒄j(𝟏𝒚~)𝟏+3(2+1b)2logTni(t)𝟏\displaystyle\bm{\mu}+(\bm{C}_{j,\cdot})^{\top}-\max\limits_{j\in[d]}\left\|\bm{c}^{L}_{j}(t)-\bm{c}_{j}\right\|_{\infty}(\bm{1}^{\top}\tilde{\bm{y}})\bm{1}+3\left(2+\frac{1}{b}\right)\sqrt{\frac{2\log T}{n_{i}(t)}}\bm{1}
\displaystyle\geq 𝝁+(𝑪j,)+62logTni(t)𝟏\displaystyle\bm{\mu}+(\bm{C}_{j,\cdot})^{\top}+6\sqrt{\frac{2\log T}{n_{i}(t)}}\bm{1}
=\displaystyle= 𝝁U(t)+(𝑪j,U(t)).\displaystyle\bm{\mu}^{U}(t)+\left(\bm{C}_{j,\cdot}^{U}(t)\right)^{\top}.

Here the first two lines come from rearranging terms. The third line comes from the feasibility of 𝒚~\tilde{\bm{y}}. The fourth line comes from the concentration event in Lemma 2 and the fact that 𝟏𝒚~TB.\bm{1}^{\top}\tilde{\bm{y}}\leq\frac{T}{B}. The last line comes from the definition of UCB estimators.

With the feasibility of 𝒚~,\tilde{\bm{y}}^{\prime}, we have

OPTjU\displaystyle\text{OPT}^{U}_{j} 𝑩𝒚~\displaystyle\leq\bm{B}^{\top}\tilde{\bm{y}}^{\prime}
𝑩𝒚~+92logTni(t)T\displaystyle\leq\bm{B}^{\top}\tilde{\bm{y}}+9\sqrt{\frac{2\log T}{n_{i}(t)}}T
=OPTj+92logTni(t)\displaystyle=\text{OPT}_{j}+9\sqrt{\frac{2\log T}{n_{i}(t)}}
<OPTj+3(2+1b)2logTni(t),\displaystyle<\text{OPT}_{j}+3\left(2+\frac{1}{b}\right)\sqrt{\frac{2\log T}{n_{i}(t)}},

where the first comes from the feasibility of 𝒚~\tilde{\bm{y}}^{\prime}, the second and fourth line uses the fact that b<1,b<1, and the third line comes from the definition of OPTj.{}_{j}. By rearranging terms, we complete the proof of (18).

Proof of inequality (20):

The idea of the proof is to first show that

(132logTni(t)b2)𝒙\left(1-3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\right)\bm{x}^{*}

is a feasible solution to the LP corresponding to OPT(t)LPL{}_{\text{LP}}^{L}(t) under condition (19) and then analyze its corresponding objective value.

First, to see its feasibility, the condition (19) ensures that

(132logTni(t)b2)𝒙𝟎.\left(1-3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\right)\bm{x}^{*}\geq\bm{0}.

On the other hand, for the constraint j[d],j\in[d],

B(132logTni(t)b2)(𝑪U(t)𝒙)j=\displaystyle B-\left(1-3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\right)({\bm{C}}^{U}(t)\bm{x}^{*})_{j}= B(𝑪U(t)𝒙)j+32logTni(t)b2(𝑪U(t)𝒙)j\displaystyle B-({\bm{C}}^{U}(t)\bm{x}^{*})_{j}+3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}({\bm{C}}^{U}(t)\bm{x}^{*})_{j}
=\displaystyle= B(𝑪𝒙)j+((𝑪U(t)𝑪)𝒙)j+32logTni(t)b2(𝑪𝒙)j\displaystyle B-(\bm{C}\bm{x}^{*})_{j}+(({\bm{C}}^{U}(t)-\bm{C})\bm{x}^{*})_{j}+3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}(\bm{C}\bm{x}^{*})_{j}
\displaystyle\geq B(𝑪𝒙)j(𝑪U(t)𝑪)j,𝒙1+32logTni(t)b2(𝑪𝒙)j\displaystyle B-(\bm{C}\bm{x}^{*})_{j}-\|({\bm{C}}^{U}(t)-\bm{C})_{j,\cdot}\|_{\infty}\|\bm{x}^{*}\|_{1}+3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}(\bm{C}\bm{x}^{*})_{j}
\displaystyle\geq B(𝑪𝒙)j+32logTni(t)(1b(𝑪𝒙)jT).\displaystyle B-(\bm{C}\bm{x}^{*})_{j}+3\sqrt{\frac{2\log T}{n_{i}(t)}}\left(\frac{1}{b}(\bm{C}\bm{x}^{*})_{j}-T\right).

Here the first and second equality comes from rearranging terms and the last two line follow a similar analysis as before.

By rearranging terms in the last line of above,

B(132logTni(t)b2)(𝑪U(t)𝒙)j\displaystyle B-\left(1-3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\right)({\bm{C}}^{U}(t)\bm{x}^{*})_{j}\geq B(𝑪𝒙)j+32logTni(t)(1b(𝑪𝒙)jT)\displaystyle B-(\bm{C}\bm{x}^{*})_{j}+3\sqrt{\frac{2\log T}{n_{i}(t)}}\left(\frac{1}{b}(\bm{C}\bm{x}^{*})_{j}-T\right)
=\displaystyle= (132logTni(t)b2)(𝑪𝒙)j+B32logTni(t)T\displaystyle-\left(1-3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\right)(\bm{C}\bm{x}^{*})_{j}+B-3\sqrt{\frac{2\log T}{n_{i}(t)}}T
\displaystyle\geq 0.\displaystyle 0.

Here the first from last line comes from rearranging terms and the last line comes from the feasibility of 𝒙.\bm{x}^{*}. In this way, we obtain the feasibility part.

Given the feasibility, we can prove the inequality (20) with a similar roadmap as before

OPTLPL(t)\displaystyle\text{OPT}^{L}_{\text{LP}}(t) (132logTni(t)b2)(𝝁L(t))𝒙\displaystyle\geq\left(1-3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\right)({{\bm{\mu}}}^{L}(t))^{\top}\bm{x}^{*}
=𝝁𝒙+(𝝁L(t)𝝁)𝒙32logTni(t)b2(𝝁L(t))𝒙\displaystyle=\bm{\mu}^{\top}\bm{x}^{*}+\left({{\bm{\mu}}}^{L}(t)-\bm{\mu}\right)^{\top}\bm{x}^{*}-3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}({{\bm{\mu}}}^{L}(t))^{\top}\bm{x}^{*}
OPTLP𝝁L(t)𝝁𝒙132logTni(t)b2(𝝁L(t))𝒙\displaystyle\geq\text{OPT}_{\text{LP}}-\left\|{{\bm{\mu}}}^{L}(t)-\bm{\mu}\right\|_{\infty}\|\bm{x}^{*}\|_{1}-3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}({{\bm{\mu}}}^{L}(t))^{\top}\bm{x}^{*}
OPTLP3(1+1b)2logTni(t)T.\displaystyle\geq\text{OPT}_{\text{LP}}-3\left(1+\frac{1}{b}\right)\sqrt{\frac{2\log T}{n_{i}(t)}}T.

Thus we complete the proof of inequality (20). ∎

Appendix C Analysis of Phase II of Algorithm 1

In this section, we analyze Phase II of Algorithm 1 and prove Proposition 4. To simplify the presentation, we first introduce a notion of “warm start” and prove Proposition 4 under this warm start condition and also, the assumption that all the constraints are binding. Indeed, recall that the upper bound in Proposition 1 only involves the remaining resource for binding constraints. The assumption is only aimed to simplify the presentation but will not change the natural of our analysis. Later in Section C3 and Section C4, we discuss how the warm start condition is achieved and what if there are non-binding constraints, respectively.

C1 Proof of Proposition 4 under Warm Start Assumption

We first define the following quantity

θ=min{min{1,σ2}min{χ,δ}12min{m2,d2},(2+1b)2δ5}.\theta=\min\left\{\frac{\min\{1,\sigma^{2}\}\min\{\chi,\delta\}}{12\min\{m^{2},d^{2}\}},\left(2+\frac{1}{b}\right)^{-2}\cdot\frac{\delta}{5}\right\}.

In the following, we will use θ\theta to define a threshold for parameter estimation. As we will see later, the threshold is mainly to ensure that the estimated constraint matrix 𝑪L(t)\bm{C}^{L}(t) is well-positioned (with a minimum singular value of σ/2\sigma/2).

Assumption 3 (Warm Start).

We make the following assumption on the parameter estimates

  • (a)

    𝝁U(t)𝝁,𝑪L(t)𝑪\bm{\mu}^{U}(t)\geq\bm{\mu},\bm{C}^{L}(t)\leq\bm{C} holds element-wise for all t[T]t\in[T].

  • (b)

    The following condition holds for all ii\in\mathcal{I}^{*} and t[T]t\in[T]

    𝒄iL(t)𝒄iθ.\displaystyle\|{\bm{c}}^{L}_{i}(t)-\bm{c}_{i}\|_{\infty}\leq\theta. (24)

Here Part (a) of the Assumption restricts to the concentration event in Lemma 2. Part (b) is to ensure that the estimated LP with constraint matrix 𝑪L(t)\bm{C}^{L}(t) is well-positioned. In the following Section C3, we will show that the assumption can be met automatically in Algorithm 1 and we discuss the number of time periods it takes to satisfy this assumption.

Assumption 4 (All optimal arms and binding constraints).

We assume that all the arm are optimal and all the constraints are binding, i.e.,

=[m],𝒥=d.\mathcal{I}^{*}=[m],\mathcal{J}^{*}=d.

In fact, Assumption 4 is not as strong as it seems to be. First, in Algorithm 1, we have shown that Phase I identifies optimal arms and binding constraints with high probability in Proposition 3. Then, for the LP (9) solved in Phase II of the algorithm, it restricts the arm play to the optimal arm as if all the arms are all optimal ones. Second, from Proposition 1, we know the generic upper bound only involves the remaining resources for the binding constraints, so this assumption allows us to focus on the binding constraints. To make the analysis go through, we only need to show that the non-binding resources are not exhausted before any binding resource is exhausted. Intuitively, this will always happen because the non-binding resources have a slackness. Later in Section C4, we will return to analyze this case of non-binding resources. Now, we prove Proposition 4 under this two additional assumptions.

Proof of Proposition 4 under Assumption 3 and 4

We show that under Assumption 3 and 4, there exists a constant T¯\underline{T} depending on mm, dd, σ\sigma, χ\chi and δ\delta such that, for any TT¯T\geq\underline{T}, the left-over of each binding resource (at time τ\tau) j𝒥j\in\mathcal{J}^{*} satisfies

𝔼[Bj(τ)]=O(d3bmin{χ2,δ2}min{1,σ2}).\mathbb{E}\left[B_{j}^{(\tau)}\right]=O\left(\frac{d^{3}}{b\min\{\chi^{2},\delta^{2}\}\min\{1,\sigma^{2}\}}\right).
Proof.

Our proof basically formalize the idea described in Section 5.2. Recall that the average remaining resource

𝒃(t)𝑩(t)Tt\bm{b}^{(t)}\coloneqq\frac{\bm{B}^{(t)}}{T-t}

where 𝒃(t)=(b1(t),,bd(t))\bm{b}^{(t)}=\left(b_{1}^{(t)},...,b_{d}^{(t)}\right)^{\top} for t=0,,T1.t=0,...,T-1. Define

𝒵[bϵ,b+ϵ]dd\mathcal{Z}\coloneqq[b-\epsilon,b+\epsilon]^{d}\subset\mathbb{R}^{d}

where ϵ>0\epsilon>0 will be specified later. Ideally, the algorithm consumes approximately 𝒃\bm{b} resource per time period so that the average remaining resource 𝒃(t)\bm{b}^{(t)} will always stay in 𝒵\mathcal{Z}. This motivates the definition of a stopping time that represents the first time that 𝒃(t)\bm{b}^{(t)} escapes from 𝒵\mathcal{Z}.

τmin{t:𝒃(t)𝒵}.\tau^{\prime}\coloneqq\min\{t:\bm{b}^{(t)}\notin\mathcal{Z}\}.

Recall that the stopping time τ\tau is the termination of the procedure. Thus, as long as ϵ<b\epsilon<b, we have ττ\tau^{\prime}\leq\tau since no resource is used up at time τ1\tau^{\prime}-1. Furthermore, we define the following event, for each t[T]t\in[T],

t\displaystyle\mathcal{E}_{t}\coloneqq {𝒃(s)𝒵 for each s[t1] and sup𝒃𝒵𝔼[𝒄it(𝒃)|t1]𝒃>ϵt}\displaystyle\left\{\bm{b}^{(s)}\in\mathcal{Z}\text{ for each $s\in[t-1]$}\ \text{ and }\ \sup_{\bm{b}^{\prime}\in\mathcal{Z}}\left\|\mathbb{E}[\bm{c}_{i_{t}}(\bm{b}^{\prime})|\mathcal{H}_{t-1}]-\bm{b}^{\prime}\right\|_{\infty}>\epsilon_{t}\right\}

where t={(𝒓s,𝑪s)}s=1t\mathcal{H}_{t}=\{(\bm{r}_{s},\bm{C}_{s})\}_{s=1}^{t} is the filtration up to time tt, iti_{t} is the arm played by Algorithm 1 in time period tt during Phase II, and 𝒄it(𝒃)\bm{c}_{i_{t}}(\bm{b}^{\prime}) is the expected resource consumption of the iti_{t}-th arm. The threshold parameters ϵt\epsilon_{t} will be specified later. We remark that in Algorithm 1, the choice of iti_{t} (and its distribution) is dependent on the remaining resource 𝒃(t1)\bm{b}^{(t-1)}. Here we represent 𝒃(t1)\bm{b}^{(t-1)} with 𝒃\bm{b}^{\prime}, and thus the second term captures the maximal “bias” of resource consumption when the average remaining resource 𝒃(t1)=𝒃\bm{b}^{(t-1)}=\bm{b}^{\prime}, where the maximum is taken with respect to all 𝒃𝒵.\bm{b}^{\prime}\in\mathcal{Z}.

At time tt, if 𝒃(t1)=𝒃\bm{b}^{(t-1)}=\bm{b}^{\prime} for some 𝒃𝒵,\bm{b}^{\prime}\in\mathcal{Z}, there are in total (nt+1)𝒃(n-t+1)\bm{b}^{\prime} remaining resources and nt+1n-t+1 remaining time periods. Ideally, we hope that the resource is consumed for exactly 𝒃\bm{b}^{\prime} in each of the remaining time period. In this sense, the event t\mathcal{E}_{t} describes that the average resource level 𝒃s\bm{b}_{s} stays in 𝒵\mathcal{Z} for the first t1t-1 time periods and the consumption bias at time tt is greater than ϵt\epsilon_{t}.

The following lemma states that with a careful choice of ϵ\epsilon and ϵt\epsilon_{t}, the event t\mathcal{E}_{t} will happen with a small probability.

Lemma 5.

Let α=min{1,σ}min{χ,δ}bd23\alpha=\frac{\min\{1,\sigma\}\min\{\chi,\delta\}b}{d^{\frac{2}{3}}}. If we choose ϵα5\epsilon\coloneqq\frac{\alpha}{5} and

ϵt{65,tαT19,11bmin{m3,d3}logTmin{χσ2,δσ2}t,t>αT19,\epsilon_{t}\coloneqq\begin{cases}\frac{6}{5},&t\leq\frac{\alpha T}{19},\\ \frac{11b\sqrt{\min\{m^{3},d^{3}\}\log T}}{\sqrt{\min\{\chi\sigma^{2},\delta\sigma^{2}\}t}},&t>\frac{\alpha T}{19},\end{cases}

then there exists a constant T¯\underline{T} such that,

𝒫(t=1Tt)min{m,d}T3.\mathcal{P}\left(\bigcup\limits_{t=1}^{T}\mathcal{E}_{t}\right)\leq\frac{\min\{m,d\}}{T^{3}}.

for all TT¯T\geq\underline{T}.

The proof of Lemma 5 will be postponed to the end of this proof. The following lemma considers the complements of the event t\mathcal{E}_{t}’s and analyze the stopping time on their complements. Here we use \mathcal{E}^{\complement} to denote the complement of an event .\mathcal{E}.

Lemma 6.

If ϵ\epsilon, ϵ¯\bar{\epsilon}, α\alpha and ϵt\epsilon_{t} satisfy that ϵt=ϵ¯\epsilon_{t}=\bar{\epsilon} for all tαTt\leq\alpha T and

αϵ¯1α+t=αT+1T1ϵtTt2ϵ3,\displaystyle\frac{\alpha\bar{\epsilon}}{1-\alpha}+\sum\limits^{T-1}_{t=\alpha T+1}\frac{\epsilon_{t}}{T-t}\leq\frac{2{\epsilon}}{3}, (25)

the following inequality holds

(bj(s)[bϵ,b+ϵ] for some sts=1ts)2exp{(Tt+1)ϵ218(1+ϵ)2},\displaystyle\mathbb{P}\left(b^{(s)}_{j}\not\in[b-\epsilon,b+\epsilon]\text{ for some $s\leq t$, }\bigcap\limits_{s=1}^{t}\mathcal{E}_{s}^{\complement}\right)\leq 2\exp\left\{-\frac{(T-t+1)\epsilon^{2}}{18(1+\epsilon)^{2}}\right\},

for tT2t\leq T-2 and j[d]j\in[d], and therefore,

(τt,s=1ts)2dexp{(Tt+1)ϵ218(1+ϵ)2}.\mathbb{P}\left(\tau^{\prime}\leq t,\bigcap\limits_{s=1}^{t}\mathcal{E}_{s}^{\complement}\right)\leq 2d\exp\left\{-\frac{(T-t+1)\epsilon^{2}}{18(1+\epsilon)^{2}}\right\}.

The proof of Lemma 6 will also be postponed to the end of this proof.

With Lemma 5 and 6, we can derive an upper bound on 𝔼[Tτ].\mathbb{E}[T-\tau^{\prime}]. Specifically,

(τt)=\displaystyle\mathbb{P}\left(\tau^{\prime}\leq t\right)= (τt,s=1tt)+(τt,s=1tt)\displaystyle\mathbb{P}\left(\tau^{\prime}\leq t,\bigcap\limits_{s=1}^{t}\mathcal{E}_{t}^{\complement}\right)+\mathbb{P}\left(\tau^{\prime}\leq t,\bigcup\limits_{s=1}^{t}\mathcal{E}_{t}\right)
\displaystyle\leq (τt,s=1tt)+(s=1tt)\displaystyle\mathbb{P}\left(\tau^{\prime}\leq t,\bigcap\limits_{s=1}^{t}\mathcal{E}_{t}^{\complement}\right)+\mathbb{P}\left(\bigcup\limits_{s=1}^{t}\mathcal{E}_{t}\right)
\displaystyle\leq 2dexp{min{1,σ2}min{χ2,δ2}b2(Tt1)650d3}+min{m,d}T3,\displaystyle 2d\exp\left\{-\frac{\min\{1,\sigma^{2}\}\min\{\chi^{2},\delta^{2}\}b^{2}\cdot(T-t-1)}{650d^{3}}\right\}+\frac{\min\{m,d\}}{T^{3}},

where the two parts in the third line come from the Lemma 5 and Lemma 6, respectively. Then, if we denote uv=min{u,v}u\wedge v=\min\{u,v\}, we have

𝔼[Tτ]\displaystyle\mathbb{E}[T-\tau^{\prime}]\leq t=1T(τt)\displaystyle\sum_{t=1}^{T}\mathbb{P}\left(\tau^{\prime}\leq t\right)
\displaystyle\leq t=1T1(min{m,d}T3+2dexp{min{1,σ2}min{χ2,δ2}b2(Tt1)650d3})\displaystyle\sum_{t=1}^{T}1\wedge\left(\frac{\min\{m,d\}}{T^{3}}+2d\exp\left\{-\frac{\min\{1,\sigma^{2}\}\min\{\chi^{2},\delta^{2}\}b^{2}\cdot(T-t-1)}{650d^{3}}\right\}\right)
\displaystyle\leq 650d3b2min{χ2,δ2}min{1,σ2}+min{m,d}T2.\displaystyle\frac{650d^{3}}{b^{2}\min\{\chi^{2},\delta^{2}\}\min\{1,\sigma^{2}\}}+\frac{\min\{m,d\}}{T^{2}}.

Thus, since ττ\tau^{\prime}\leq\tau and 𝑩(τ)(Tτ)(𝒃+ϵ𝟏)(Tτ)65𝒃\bm{B}^{(\tau^{\prime})}\leq(T-\tau^{\prime})\cdot(\bm{b}+\epsilon\bm{1})\leq(T-\tau^{\prime})\cdot\frac{6}{5}\bm{b}, we have

𝔼[𝑩(τ)]\displaystyle\mathbb{E}[\bm{B}^{(\tau)}]\leq 𝔼[𝑩(τ)]\displaystyle\mathbb{E}[\bm{B}^{(\tau^{\prime})}]
=\displaystyle= O(d3bmin{χ2,δ2}min{1,σ2}).\displaystyle O\left(\frac{d^{3}}{b\min\{\chi^{2},\delta^{2}\}\min\{1,\sigma^{2}\}}\right).

Thus we complete the proof. ∎

C2 Proofs of Lemma 5 and Lemma 6

In this subsection, we prove the two key lemmas in the previous section. The proof of Lemma 5 will utilize the following lemma. The statement of Lemma 7 is quite straightforward: under the assumption that the parameters are well estimated, that all the arms are optimal, and that all the constraints are binding, the optimal solution to LP (9) can be simply obtained by (𝑪L(t))1𝒃(t)\left(\bm{C}^{L}(t)\right)^{-1}\bm{b}^{(t)}. Its proof is solely based on linear algebra knowledge, but it is a bit tedious, for which we will postpone to after the proof of Lemma 5 and Lemma 6.

Lemma 7.

Under Assumption 3 and Assumption 4, if 𝐛(t)𝒵\bm{b}^{(t)}\in\mathcal{Z}, we have 𝐱^(t)=(𝐂L(t))1𝐁(t)\hat{\bm{x}}(t)=\left(\bm{C}^{L}(t)\right)^{-1}\bm{B}^{(t)} is the optimal solution of LP (9) at time t+1t+1.

Proof of Lemma 5

Proof.

First, we show that each arm will be played, on expectation, no less than χt5\frac{\chi t}{5} times in the first tt time periods. Denote the normalized probability used to play arm at time t+1t+1 as 𝒙~(t),\tilde{\bm{x}}(t), then by the definition of 𝒃(t)\bm{b}^{(t)}, we know

𝒙~(t)=(𝑪L(t))1𝒃(t).\tilde{\bm{x}}(t)=\left({\bm{C}}^{L}(t)\right)^{-1}\bm{b}^{(t)}.

Here, from Assumption 4, we know that m=dm=d in this case, so the constraint consumption matrix 𝑪L(t){\bm{C}}^{L}(t) is a squared matrix. By the Lemma 7 and the definition of 𝒵,\mathcal{Z}, if 𝒃(t)𝒵,\bm{b}^{(t)}\in\mathcal{Z},

𝑪1𝒃(t)\displaystyle\bm{C}^{-1}\bm{b}^{(t)} 𝑪1𝒃𝑪1(𝒃𝒃(t))\displaystyle\geq\bm{C}^{-1}{\bm{b}}-\left\|\bm{C}^{-1}\left(\bm{b}-\bm{b}^{(t)}\right)\right\|_{\infty}
χ𝑪1(𝒃𝒃(t))\displaystyle\geq\chi-\|\bm{C}^{-1}\|_{\infty}\left\|\left(\bm{b}-\bm{b}^{(t)}\right)\right\|_{\infty}
χdσχ5d3/24χ5,\displaystyle\geq\chi-\frac{\sqrt{d}}{\sigma}\frac{\chi}{5d^{3/2}}\geq\frac{4\chi}{5}, (26)

where the first line comes from the triangle inequality, the second line comes from the definition of χ\chi, and the third line is obtained by the relation between the singular value and matrix infinity norm. Now we can transfer the bound to 𝒙~(t).\tilde{\bm{x}}(t). From warm start condition (24) that

𝒄iL(t)𝒄imin{1,σ2}min{χ,δ}12min{md,d2},i,\|{\bm{c}}^{L}_{i}(t)-\bm{c}_{i}\|_{\infty}\leq\frac{\min\{1,\sigma^{2}\}\min\{\chi,\delta\}}{12\min\{md,d^{2}\}},\ i\in\mathcal{I}^{*},

with the similar proof in the first part of Lemma 7, we can show

(𝑪L(t))1𝒃(t)𝑪1𝒃(t)3χ5.\displaystyle\left\|\left({\bm{C}}^{L}(t)\right)^{-1}\bm{b}^{(t)}-\bm{C}^{-1}\bm{b}^{(t)}\right\|_{\infty}\leq\frac{3\chi}{5}. (27)

Specifically, the warm start condition (24) provides an upper bound for deviation of 𝑪L(t){\bm{C}}^{L}(t) and 𝑪.\bm{C}. Putting together (26) and (27), we have

𝒙~(t)=\displaystyle\tilde{\bm{x}}(t)= (𝑪L(t))1𝒃(t)\displaystyle\left({\bm{C}}^{L}(t)\right)^{-1}\bm{b}^{(t)}
\displaystyle\geq 𝑪1𝒃(t)(𝑪L(t))1𝒃(t)𝑪1𝒃(t)\displaystyle\bm{C}^{-1}\bm{b}^{(t)}-\left\|\left({\bm{C}}^{L}(t)\right)^{-1}\bm{b}^{(t)}-\bm{C}^{-1}\bm{b}^{(t)}\right\|_{\infty}
\displaystyle\geq χ5\displaystyle\frac{\chi}{5}

where the inequality holds element-wise. This inequality provides an lower bound on the probability distribution with which the arms are played. The implication is that if 𝒃(t)𝒵,\bm{b}^{(t)}\in\mathcal{Z}, each of the arms will be played with probability no less than χ5.\frac{\chi}{5}. In this way, as long as 𝒃(t)𝒵\bm{b}^{(t)}\in\mathcal{Z} always holds, all the arms will be played with sufficiently amount of time to refine the corresponding estimation.

Now, we apply a concentration argument to show that each arm will be played for sufficiently many times, not only on expectation, but also with high probability. Concretely, we show that if 𝒃(s)𝒵\bm{b}^{(s)}\in\mathcal{Z} holds for all s[t]s\in[t] and t200logTχ2,t\geq\frac{200\log T}{\chi^{2}}, then the number of times that the ii-th arm is played up to time tt,

ni(t)χt10n_{i}(t)\geq\frac{\chi t}{10}

with high probability for each arm i[m]i\in[m]. To see this, we first notice that if 𝒃(s)𝒵\bm{b}^{(s)}\in\mathcal{Z} for all s[t]s\in[t], then

s=1tx~i(s)χt5\sum_{s=1}^{t}\tilde{x}_{i}(s)\geq\frac{\chi t}{5}

for each i[m].i\in[m]. Also, note that ni(t)ni(t1)x~i(t)n_{i}(t)-n_{i}(t-1)-\tilde{x}_{i}(t) is a martingale difference with respect to the filtration t\mathcal{H}_{t}, we have the following holds for all t[T],t\in[T],

(s=1tni(s)ni(s1)x~i(s)2tlogT)1T4,\displaystyle\mathbb{P}\left(\sum\limits_{s=1}^{t}n_{i}(s)-n_{i}(s-1)-\tilde{x}_{i}(s)\leq-\sqrt{2t\log T}\right)\leq\frac{1}{T^{4}},

by an application of Azuma–Hoeffding inequality. By rearranging terms, it becomes

(ni(t)s=1tx~i(s)2tlogT)1T4,\displaystyle\mathbb{P}\left(n_{i}(t)\geq\sum\limits_{s=1}^{t}\tilde{x}_{i}(s)-\sqrt{2t\log T}\right)\leq\frac{1}{T^{4}},

and consequently,

(ni(t)s=1tx~i(s)2tlogT, for some t[T])1T3.\displaystyle\mathbb{P}\left(n_{i}(t)\geq\sum\limits_{s=1}^{t}\tilde{x}_{i}(s)-\sqrt{2t\log T},\text{ for some }t\in[T]\right)\leq\frac{1}{T^{3}}. (28)

On the event

{ni(t)s=1tx~i(s)2tlogT, for some t[T]},\left\{n_{i}(t)\geq\sum\limits_{s=1}^{t}\tilde{x}_{i}(s)-\sqrt{2t\log T},\text{ for some $t\in[T]$}\right\}^{\complement},

we know that the inequality ni(t)χt10n_{i}(t)\geq\frac{\chi t}{10} holds if 𝒃(s)𝒵\bm{b}^{(s)}\in\mathcal{Z} for all s[t]s\in[t] and t200logTχ2.t\geq\frac{200\log T}{\chi^{2}}.

In other words, when 𝒃(s)𝒵\bm{b}^{(s)}\in\mathcal{Z} for all s[t]s\in[t], the ii-th arm will be played for χt10\frac{\chi t}{10} with probability 11T31-\frac{1}{T^{3}}, which implies the estimation error of the ii-th arm is at most 20logTχt\sqrt{\frac{20\log T}{\chi t}} at the beginning of time t+1t+1 with high probability. Then the bias in resource consumption has the following upper bound,

𝔼[𝑪it+1(𝒃)|t]𝒃=\displaystyle\left\|\mathbb{E}\left[\bm{C}_{i_{t+1}}(\bm{b}^{\prime})|\mathcal{H}_{t}\right]-\bm{b}^{\prime}\right\|_{\infty}= 𝑪(𝑪L(t))1𝒃𝒃\displaystyle\left\|\bm{C}\left(\bm{C}^{L}(t)\right)^{-1}\bm{b}^{\prime}-\bm{b}^{\prime}\right\|_{\infty}
\displaystyle\leq (𝑪𝑪L(t))(𝑪L(t))1𝒃\displaystyle\left\|\left(\bm{C}-\bm{C}^{L}(t)\right)\left(\bm{C}^{L}(t)\right)^{-1}\bm{b}^{\prime}\right\|_{\infty}
\displaystyle\leq 𝑪𝑪L(t)(𝑪L(t))1𝒃\displaystyle\left\|\bm{C}-\bm{C}^{L}(t)\right\|_{\infty}\left\|\left(\bm{C}^{L}(t)\right)^{-1}\right\|_{\infty}\|\bm{b}^{\prime}\|_{\infty}
\displaystyle\leq 20min{md,d2}logTχt2dσ6b5\displaystyle\sqrt{\frac{{20\min\{md,d^{2}\}\log T}}{\chi t}}\cdot\frac{2\sqrt{d}}{\sigma}\cdot\frac{6b}{5}
\displaystyle\leq 11bmin{md2,d3}logTχσ2(t+1)\displaystyle\frac{11b\sqrt{\min\{md^{2},d^{3}\}\log T}}{\sqrt{\chi\sigma^{2}(t+1)}}

where the last line is defined to be ϵt\epsilon_{t} when tαT.t\geq\alpha T. Here, the second line comes from the definition of the algorithm, the third and fourth line come from the property of matrix multiplication, and the fifth line comes from plugging in the estimation error of 𝑪L(t)\bm{C}^{L}(t). The choice of α\alpha ensures that tαT200logTχ2t\geq\alpha T\geq\frac{200\log T}{\chi^{2}}. Also, we choose T¯\underline{T} as the smallest integer such that

αT200logTχ2.\alpha T\geq\frac{200\log T}{\chi^{2}}.

In this way, when TT¯,T\geq\underline{T}, we have the following two cases:

  • t>αT:t>\alpha T: We have

    t{ni(t)s=1tx~i(s)2tlogT, for all t},\mathcal{E}_{t}\subset\left\{n_{i}(t)\geq\sum\limits_{s=1}^{t}\tilde{x}_{i}(s)-\sqrt{2t\log T},\text{ for all $t$}\right\}^{\complement},
  • tαTt\leq\alpha T: We have 𝟎𝔼[𝒄it(𝒃)|t]𝟏\bm{0}\leq\mathbb{E}\left[\bm{c}_{i_{t}}(\bm{b}^{\prime})|\mathcal{H}_{t}\right]\leq\bm{1} and 𝟎𝒃65𝟏\bm{0}\leq\bm{b}^{\prime}\leq\frac{6}{5}\bm{1}. Consequently, the following holds

    𝔼[𝒄it(𝒃)|t1]𝒃65\left\|\mathbb{E}\left[\bm{c}_{i_{t}}(\bm{b}^{\prime})|\mathcal{H}_{t-1}\right]-\bm{b}^{\prime}\right\|_{\infty}\leq\frac{6}{5}

    for all 𝒃𝒵.\bm{b}^{\prime}\in\mathcal{Z}. Thus it implies that t=.\mathcal{E}_{t}=\emptyset.

Combining the two cases, when TT¯,T\geq\underline{T}, we have

t=1Tt{ni(t)s=1tx~i(s)2tlogT, for all t[T]}\bigcup\limits_{t=1}^{T}\mathcal{E}_{t}\subset\left\{n_{i}(t)\geq\sum\limits_{s=1}^{t}\tilde{x}_{i}(s)-\sqrt{2t\log T},\text{ for all $t\in[T]$}\right\}^{\complement}

and therefore, we know from (28),

𝒫(t=1Tt)min{m,d}T3.\mathcal{P}\left(\bigcup\limits_{t=1}^{T}\mathcal{E}_{t}\right)\leq\frac{\min\{m,d\}}{T^{3}}.

Proof of Lemma 6

Proof.

Note that

{bj(s)[bϵ,b+ϵ]for some sts=1tt}\displaystyle\left\{{b}^{(s)}_{j}\not\in[b-\epsilon,b+\epsilon]\text{for some $s\leq t$, }\bigcap\limits_{s=1}^{t}\mathcal{E}_{t}^{\complement}\right\}\subset {b~j(s)[bϵ,b+ϵ] for some st}\displaystyle\left\{\tilde{{b}}^{(s)}_{j}\not\in[b-\epsilon,b+\epsilon]\text{ for some }s\leq t\right\}

for each constraint j[d].j\in[d]. Thus, we only need to analyze the event

{b~j(s)[bϵ,b+ϵ] for some st}.\left\{\tilde{{b}}^{(s)}_{j}\not\in[b-\epsilon,b+\epsilon]\text{ for some }s\leq t\right\}.

Note that from telescoping sum,

b~j(t)b\displaystyle\tilde{{b}}^{(t)}_{j}-b =b~j(t)b~j(0)\displaystyle=\tilde{{b}}^{(t)}_{j}-\tilde{{b}}^{(0)}_{j}
=s=1t(b~j(s)b~j(s1)).\displaystyle=\sum_{s=1}^{t}\left(\tilde{{b}}^{(s)}_{j}-\tilde{{b}}^{(s-1)}_{j}\right).

We define βtb~j(t)b~j(t1)\beta_{t}\coloneqq\tilde{{b}}_{j}^{(t)}-\tilde{{{b}}}_{j}^{(t-1)}, and then have

b~j(t)b\displaystyle\tilde{{b}}^{(t)}_{j}-b =s=1tβt\displaystyle=\sum_{s=1}^{t}\beta_{t}
=s=1t(βs𝔼[βs|s1])+s=1t𝔼[βs|s1].\displaystyle=\sum_{s=1}^{t}\left(\beta_{s}-\mathbb{E}[\beta_{s}|\mathcal{H}_{s-1}]\right)+\sum_{s=1}^{t}\mathbb{E}[\beta_{s}|\mathcal{H}_{s-1}]. (29)

We remark that we do not index the process βt\beta_{t} in that the analyses for all the constraints are the same, so we only focus on the analysis of a specific constraint jj. Moreover, from the definition of τ~\tilde{\tau},

βt\displaystyle\beta_{t} =b~j(t)b~j(t1)={0,t>τ~,1Tt(Cj,tb~j(t1)),tτ~.\displaystyle=\tilde{{b}}_{j}^{(t)}-\tilde{{b}}_{j}^{(t-1)}=\left\{\begin{matrix}0,&t>\tilde{\tau},\\ -\frac{1}{T-t}(C_{j,t}-\tilde{{b}}_{j}^{(t-1)}),&t\leq\tilde{\tau}.\end{matrix}\right.

Next, we first develop a concentration argument for the first summation in (29). Specifically, the summand can be viewed as a martingale difference sequence. Note that

|βt𝔼[βt|t1]|1+ϵTt|\beta_{t}-\mathbb{E}[\beta_{t}|\mathcal{H}_{t-1}]|\leq\frac{1+\epsilon}{T-t}

for t[T1].t\in[T-1]. By applying Hoeffding’s Inequality, we have

(|k=1s(βk𝔼[βk|k1])|l for some st)\displaystyle\mathbb{P}\left(\left|\sum_{k=1}^{s}(\beta_{k}-\mathbb{E}\left[\beta_{k}|\mathcal{H}_{k-1}\right])\right|\geq l\text{ for some }s\leq t\right)\leq 2exp{2l24s=1t(1+ϵ)2(Tt)2}\displaystyle 2\exp\left\{-\frac{2l^{2}}{4\sum\limits_{s=1}^{t}\frac{(1+\epsilon)^{2}}{(T-t)^{2}}}\right\}
=\displaystyle= 2exp{(Tt1)l22(1+ϵ)2}\displaystyle 2\exp\left\{-\frac{(T-t-1)l^{2}}{2(1+\epsilon)^{2}}\right\}

holds for all tT2t\leq T-2 and l0l\geq 0. By taking l=ϵ3l=\frac{\epsilon}{3}, we have

(|k=1s(βk𝔼[βk|k1])|ϵ3 for some st)2exp{(Tt+1)ϵ218(1+ϵ)2}.\displaystyle\mathbb{P}\left(\left|\sum_{k=1}^{s}\left(\beta_{k}-\mathbb{E}[\beta_{k}|\mathcal{H}_{k-1}]\right)\right|\geq\frac{\epsilon}{3}\text{ for some }s\leq t\right)\leq 2\exp\left\{\frac{-(T-t+1)\epsilon^{2}}{18(1+\epsilon)^{2}}\right\}. (30)

Then we analyze the second summation in (29). For s[t]s\in[t],

|k=1s𝔼[βk|k1]|\displaystyle\left|\sum\limits_{k=1}^{s}\mathbb{E}[\beta_{k}|\mathcal{H}_{k-1}]\right| =|k=1min{s,τ~}𝔼[βk|k1]|\displaystyle=\left|\sum\limits_{k=1}^{\min\{s,\tilde{\tau}\}}\mathbb{E}[\beta_{k}|\mathcal{H}_{k-1}]\right|
k=1τ~|𝔼[βk|k1]|\displaystyle\leq\sum\limits_{k=1}^{\tilde{\tau}}\left|\mathbb{E}[\beta_{k}|\mathcal{H}_{k-1}]\right|
k=1αTϵ¯Tk+k=αT+1T1ϵtTk\displaystyle\leq\sum\limits_{k=1}^{\alpha T}\frac{\bar{\epsilon}}{T-k}+\sum\limits_{k=\alpha T+1}^{T-1}\frac{\epsilon_{t}}{T-k}
αϵ¯1α+k=αT+1T1ϵtTk,\displaystyle\leq\frac{\alpha\bar{\epsilon}}{1-\alpha}+\sum\limits_{k=\alpha T+1}^{T-1}\frac{\epsilon_{t}}{T-k},

where the last line correspond to the condition of (25). In above, the first line comes from the definition of τ~\tilde{\tau} and the third line comes from the definition of t\mathcal{E}_{t}.

So, if (25) holds, i.e.,

αϵ¯1α+l=αT+1T1ϵtTl23ϵ,\frac{\alpha\bar{\epsilon}}{1-\alpha}+\sum\limits_{l=\alpha T+1}^{T-1}\frac{\epsilon_{t}}{T-l}\leq\frac{2}{3}\epsilon,

we have

|k=1s𝔼[βk|k1]|2ϵ3,\left|\sum\limits_{k=1}^{s}\mathbb{E}[\beta_{k}|\mathcal{H}_{k-1}]\right|\leq\frac{2\epsilon}{3}, (31)

for all s[t]s\in[t].

Finally, by combining (30) and (31) together into (29), we have

{b~j(s)[bϵ,b+ϵ] for some st}\displaystyle\left\{\tilde{{b}}_{j}^{(s)}\not\in\left[b-\epsilon,b+\epsilon\right]\text{ for some }s\leq t\right\}
\displaystyle\subset {|k=1sβk|ϵ for some st}\displaystyle\left\{\left|\sum\limits_{k=1}^{s}\beta_{k}\right|\geq\epsilon\text{ for some }s\leq t\right\}
\displaystyle\subset {|k=1s(βk𝔼[βk|k1])|ϵ3 for some st}{|k=1s𝔼[βk|k1]|>2ϵ3 for some st}.\displaystyle\left\{\left|\sum_{k=1}^{s}\left(\beta_{k}-\mathbb{E}[\beta_{k}|\mathcal{H}_{k-1}]\right)\right|\geq\frac{\epsilon}{3}\text{ for some }s\leq t\right\}\bigcup\left\{\left|\sum\limits_{k=1}^{s}\mathbb{E}[\beta_{k}|\mathcal{H}_{k-1}]\right|>\frac{2\epsilon}{3}\text{ for some }s\leq t\right\}.

Therefore,

(b~j(s)[bϵ,b+ϵ] for some st)\displaystyle\mathbb{P}\left(\tilde{{b}}_{j}^{(s)}\not\in\left[b-\epsilon,b+\epsilon\right]\text{ for some }s\leq t\right)\leq (|k=1s(βk𝔼[βk|k1])|ϵ3 for some st)\displaystyle\mathbb{P}\left(\left|\sum_{k=1}^{s}\left(\beta_{k}-\mathbb{E}[\beta_{k}|\mathcal{H}_{k-1}]\right)\right|\geq\frac{\epsilon}{3}\text{ for some }s\leq t\right)
\displaystyle\leq 2exp{(Tt+1)ϵ218(1+ϵ)2}.\displaystyle 2\exp\left\{\frac{-(T-t+1)\epsilon^{2}}{18(1+\epsilon)^{2}}\right\}.

As noted earlier, the proof is applicable to all the constraint jj. And by taking an union bound, we complete the proof. ∎

Proof of Lemma 7

Proof.

Consider the following LPs for all ii\in\mathcal{I}^{*}, and denote their optimal objective value as OPT~Ui(t)\tilde{\text{OPT}}^{U}_{i}(t) and OPT~ULP(t)\tilde{\text{OPT}}^{U}_{\text{LP}}(t), respectively.

OPT~iU(t)=max𝒙\displaystyle\tilde{\text{OPT}}_{i}^{U}(t)=\max_{\bm{x}}\ \ (𝝁U(t))𝒙,\displaystyle\left(\bm{\mu}^{U}(t)\right)^{\top}\bm{x},
s.t. 𝑪L(t)𝒙𝑩(t),\displaystyle\bm{C}^{L}(t)\bm{x}\leq\bm{B}^{(t)},
xi=0,𝒙𝟎.\displaystyle x_{i}=0,\ \bm{x}\geq\bm{0}.
OPT~ULP(t)=max𝒙\displaystyle\tilde{\text{OPT}}^{U}_{\text{LP}}(t)=\max_{\bm{x}}\ \ (𝝁U(t))𝒙,\displaystyle\left(\bm{\mu}^{U}(t)\right)^{\top}\bm{x}, (32)
s.t. 𝑪L(t)𝒙𝑩(t),\displaystyle\bm{C}^{L}(t)\bm{x}\leq\bm{B}^{(t)},
𝒙𝟎.\displaystyle\bm{x}\geq\bm{0}.

Then, similar as the proof in Lemma 2, to prove that 𝒙^(t)\hat{\bm{x}}(t) is a optimal solution to LP (9), it is sufficient to show that 𝒙^(t)\hat{\bm{x}}(t) is feasible to LP (9) and OPT~iU(t)<OPT~ULP(t)\tilde{\text{OPT}}_{i}^{U}(t)<\tilde{\text{OPT}}^{U}_{\text{LP}}(t) for all ii\in\mathcal{I}^{*}.

First, we show the feasibility. To see its feasibility, it is sufficient to check the following two results:

  • (a)

    The matrix 𝑪L(t){\bm{C}}^{L}(t) is non-singular and thus 𝒙^(t)\hat{\bm{x}}(t) is well-defined.

  • (b)

    𝒙^(t)𝟎\hat{\bm{x}}(t)\geq\bm{0} and thus 𝒙^(t)\hat{\bm{x}}(t) is a feasible solution to the LP (9).

To show part (a), we prove that the smallest singular value of the matrix is positive. Recall we use σmin(𝑴)\sigma_{\min}(\bm{M}) and σmax(𝑴)\sigma_{\max}(\bm{M}) to denote the smallest and the largest singular value of a matrix 𝑴\bm{M}, respectively.

σmin(𝑪L(t))\displaystyle\sigma_{\min}\left({\bm{C}}^{L}(t)\right) σmin(𝑪)σmax(𝑪L(t)𝑪)\displaystyle\geq\sigma_{\min}\left({\bm{C}}\right)-\sigma_{\max}\left({\bm{C}}^{L}(t)-\bm{C}\right)
=σσmax(𝑪L(t)𝑪)\displaystyle=\sigma-\sigma_{\max}\left({\bm{C}}^{L}(t)-\bm{C}\right)
σd𝑪L(t)𝑪\displaystyle\geq\sigma-\sqrt{d}\|{\bm{C}}^{L}(t)-\bm{C}\|_{\infty}
σdmin{m,d}maxi𝒄Li𝒄i,\displaystyle\geq\sigma-\sqrt{d\cdot\min\{m,d\}}\max\limits_{i\in\mathcal{I}^{*}}\|{\bm{c}}^{L}_{i}-\bm{c}_{i}\|_{\infty},

where the first line comes from Weyl’s inequality on matrix eigenvalues/singular values, the second line comes from the definition of σ\sigma, the third line is obtained from the relation between the spectral norm and the infinity norm of a matrix, and the last line comes from the relation between the matrix infinity norm and the vector infinity norm of each column in the matrix. From the warm start condition (24), we have

σmin(𝑪L(t))σ2,\sigma_{\min}\left({\bm{C}}^{L}(t)\right)\geq\frac{\sigma}{2},

and consequently,

σmax((𝑪L(t))1)2σ.\displaystyle\sigma_{\max}\left(\left({\bm{C}}^{L}(t)\right)^{-1}\right)\leq\frac{2}{\sigma}. (33)

For part (b), we show the non-negativeness of the solution 𝒙^(t).\hat{\bm{x}}(t). The starting point is the condition that the optimal solution to the Primal LP (1) has strictly positive elements, i.e., from Section 3.3, we know

𝒙χT>0,{\bm{x}}^{*}\geq\chi T>0,

where the inequality holds element-wise. Now, we show that,

𝒙^(t)𝒙=(𝑪L(t))1𝑩𝑪1𝑩χT2,\left\|\hat{\bm{x}}(t)-{\bm{x}}^{*}\right\|_{\infty}=\left\|\left({\bm{C}}^{L}(t)\right)^{-1}\bm{B}-\bm{C}^{-1}\bm{B}\right\|_{\infty}\leq\frac{\chi T}{2},

which implies that

𝒙^(t)χT2\hat{\bm{x}}(t)\geq\frac{\chi T}{2}

holds element-wise by triangle inequality. To see this, from condition (24) that maxi𝒄Li𝒄iχσ212dmin{m,d}\max\limits_{i\in\mathcal{I}^{*}}\|{\bm{c}}^{L}_{i}-\bm{c}_{i}\|_{\infty}\leq\frac{\chi\sigma^{2}}{12{d\min\{m,d\}}}, then

(𝑪L(t))1𝑩𝑪1𝑩\displaystyle\left\|\left({\bm{C}}^{L}(t)\right)^{-1}\bm{B}-\bm{C}^{-1}\bm{B}\right\|_{\infty} (𝑪L(t))1𝑪1𝑩\displaystyle\leq\left\|\left({\bm{C}}^{L}(t)\right)^{-1}-\bm{C}^{-1}\right\|_{\infty}\|\bm{B}\|_{\infty}
T(𝑪L(t))1𝑪1\displaystyle\leq T\left\|\left({\bm{C}}^{L}(t)\right)^{-1}-\bm{C}^{-1}\right\|_{\infty}
T𝑪1𝑪(𝑪L(t))1𝑰\displaystyle\leq T\|\bm{C}^{-1}\|_{\infty}\left\|{\bm{C}}\left({\bm{C}}^{L}(t)\right)^{-1}-\bm{I}\right\|_{\infty}
dTσ(𝑪𝑪L(t)+𝑪L(t))(𝑪L(t))1𝑰\displaystyle\leq\frac{\sqrt{d}T}{\sigma}\left\|\left({\bm{C}}-{\bm{C}}^{L}(t)+{\bm{C}}^{L}(t)\right)\left({\bm{C}}^{L}(t)\right)^{-1}-\bm{I}\right\|_{\infty}
=dTσ(𝑪𝑪L(t))(𝑪L(t))1\displaystyle=\frac{\sqrt{d}T}{\sigma}\left\|\left({\bm{C}}-{\bm{C}}^{L}(t)\right)\left({\bm{C}}^{L}(t)\right)^{-1}\right\|_{\infty}
dTσ𝑪𝑪L(t)(𝑪L(t))1\displaystyle\leq\frac{\sqrt{d}T}{\sigma}\left\|{\bm{C}}-{\bm{C}}^{L}(t)\right\|_{\infty}\left\|\left({\bm{C}}^{L}(t)\right)^{-1}\right\|_{\infty}
2dTσ2𝑪𝑪L(t)\displaystyle\leq\frac{2{d}T}{\sigma^{2}}\left\|{\bm{C}}-{\bm{C}}^{L}(t)\right\|_{\infty}
2min{md,d2}Tσ2maxi𝒄Li𝒄iχT2.\displaystyle\leq\frac{2\min\{md,d^{2}\}T}{\sigma^{2}}\max_{i\in\mathcal{I}^{*}}\|{\bm{c}}^{L}_{i}-\bm{c}_{i}\|_{\infty}\leq\frac{\chi T}{2}.

The first line comes from the definition of matrix L norm. The second line comes from the assumption that BTB\leq T. The third and sixth line come from the sub-multiplicativity of matrix L norm. The fourth and seventh line come from the definition of σ\sigma following Section 3.3 and the relation between the spectral norm σmax\sigma_{max} and L norm, i.e., (𝑪L(t))1dσmax((𝑪L(t))1).\left\|\left({\bm{C}}^{L}(t)\right)^{-1}\right\|_{\infty}\leq\sqrt{d}\ \sigma_{max}\left(\left({\bm{C}}^{L}(t)\right)^{-1}\right). The last line reduces the matrix infinity norm to vector infinity norm and applies condition (24). Thus we finish the part on the feasibility of 𝒙^(t).\hat{\bm{x}}(t).

Next, we show OPT~iU(t)<OPT~ULP(t)\tilde{\text{OPT}}_{i}^{U}(t)<\tilde{\text{OPT}}^{U}_{\text{LP}}(t) for all ii\in\mathcal{I}^{*} by utilizing OPTiU(t)\text{OPT}_{i}^{U}(t) and OPTU(t)\text{OPT}^{U}(t).

Recall that OPTiU(t)\text{OPT}_{i}^{U}(t) is the optimal objective value of the following LP.

OPTiU(t)max𝒙\displaystyle\text{OPT}_{i}^{U}(t)\coloneqq\max_{\bm{x}}\ \ (𝝁U(t1))𝒙,\displaystyle\left(\bm{\mu}^{U}(t-1)\right)^{\top}\bm{x},
s.t. 𝑪L(t1)𝒙𝑩,\displaystyle\bm{C}^{L}(t-1)\bm{x}\leq\bm{B},
xi=0,𝒙𝟎.\displaystyle x_{i}=0,\bm{x}\geq\bm{0}.

With the similar proof in Lemma 4 and Assumption 3, we can show that

OPTUi(t)\displaystyle\text{OPT}^{U}_{i}(t) OPTi+δ5T.\displaystyle\leq\text{OPT}_{i}+\frac{\delta}{5}T. (34)

Then, from LP’s duality, we know that OPTUi(t)\text{OPT}^{U}_{i}(t) and OPT~Ui(t)\tilde{\text{OPT}}^{U}_{i}(t) are also the optimal values of the dual LP (C2) and (C2), respectively.

min𝒚\displaystyle\min_{\bm{y}}\ \ 𝑩𝒚,\displaystyle\bm{B}^{\top}\bm{y},
s.t. (𝑪iL(t1))𝒚𝝁iU(t1),\displaystyle({{\bm{C}}}_{-i}^{L}(t-1))^{\top}\bm{y}\geq{\bm{\mu}}_{-i}^{U}(t-1), (35)
𝒚0.\displaystyle\bm{y}\geq 0.
min𝒚\displaystyle\min_{\bm{y}}\ \ (𝑩(t1))𝒚,\displaystyle\left(\bm{B}^{(t-1)}\right)^{\top}\bm{y},
s.t. (𝑪iL(t1))𝒚𝝁iU(t1),\displaystyle({{\bm{C}}}_{-i}^{L}(t-1))^{\top}\bm{y}\geq{\bm{\mu}}_{-i}^{U}(t-1), (36)
𝒚0,\displaystyle\bm{y}\geq 0,

where 𝑴i\bm{M}_{-i} is a matrix (vector) that delete the ii-th column (entry) of the matrix (vector) 𝑴\bm{M}. Denote the optimal solution of LP (C2) as 𝒚~\tilde{\bm{y}}. By duality, we have

y~j=1BBy~j1B𝑩𝒚~1b.\displaystyle\tilde{y}_{j}=\frac{1}{B}B\tilde{y}_{j}\leq\frac{1}{B}\bm{B}^{\top}\tilde{\bm{y}}\leq\frac{1}{b}.

for all j[d]j\in[d]. Moreover, since 𝒚~\tilde{\bm{y}} is also feasible to LP (C2), by duality, we have

OPT~iU(t)\displaystyle\tilde{\text{OPT}}_{i}^{U}(t)\leq (𝑩(t1))𝒚~\displaystyle(\bm{B}^{(t-1)})^{\top}\tilde{\bm{y}}
=\displaystyle= TtT(𝑩)𝒚~+(𝑩(t)(Tt)𝒃)𝒚~\displaystyle\frac{T-t}{T}(\bm{B})^{\top}\tilde{\bm{y}}+(\bm{B}^{(t)}-(T-t)\bm{b})^{\top}\tilde{\bm{y}}
\displaystyle\leq TtTOPTiU(t)+dbδ5d3/2(Tt)\displaystyle\frac{T-t}{T}\text{OPT}_{i}^{U}(t)+\frac{d}{b}\cdot\frac{\delta}{5d^{3/2}}(T-t)
\displaystyle\leq TtTOPTiU(t)+δ5(Tt)\displaystyle\frac{T-t}{T}\text{OPT}_{i}^{U}(t)+\frac{\delta}{5}(T-t)
\displaystyle\leq TtTOPTi+2δ5(Tt).\displaystyle\frac{T-t}{T}\text{OPT}_{i}+\frac{2\delta}{5}(T-t).

where the first line comes from weak duality, the second line comes from re-arranging terms, the third line comes from 𝒃(t)𝒃δ5d3/2\|\bm{b}^{(t)}-\bm{b}\|_{\infty}\leq\frac{\delta}{5d^{3/2}} and Holder’s Inequality that |(𝑩(t)(Tt)𝒃)𝒚~|𝑩(t)(Tt)𝒃𝒚~1|(\bm{B}^{(t)}-(T-t)\bm{b})^{\top}\tilde{\bm{y}}|\leq\|\bm{B}^{(t)}-(T-t)\bm{b}\|_{\infty}\|\tilde{\bm{y}}\|_{1}, and the last line is obtained by the inequality (34). Then, we can have a similar statement that

OPT~ULP(t)TtTOPTLP2δ5(Tt).\tilde{\text{OPT}}^{U}_{\text{LP}}(t)\geq\frac{T-t}{T}{\text{OPT}}_{\text{LP}}-\frac{2\delta}{5}(T-t).

Thus, since OPTiOPTLPδ\text{OPT}_{i}\leq\text{OPT}_{\text{LP}}-\delta, we have

OPT~ULP(t)\displaystyle\tilde{\text{OPT}}^{U}_{\text{LP}}(t)\geq TtTOPTLP2δ5(Tt)\displaystyle\frac{T-t}{T}{\text{OPT}}_{\text{LP}}-\frac{2\delta}{5}(T-t)
\displaystyle\geq TtTOPT~Ui(t)+δ5(Tt)\displaystyle\frac{T-t}{T}\tilde{\text{OPT}}^{U}_{i}(t)+\frac{\delta}{5}(T-t)
>\displaystyle> OPT~Ui(t),\displaystyle\tilde{\text{OPT}}^{U}_{i}(t),

for all ii\in\mathcal{I}^{*}. Thus we complete the proof by combining the feasibility and the optimality of 𝒙^(t).\hat{\bm{x}}(t).

C3 Relaxation of Warm Start Assumption

In this subsection, we relax Assumption 3 for the analysis in Section C1. Indeed, The following proposition shows that the warm start condition in Assumption 3 will be automatically satisfied after 400logTχθ3\frac{400\log T}{\chi\theta^{3}} time periods after Phase I, where θ\theta being the threshold parameter in Assumption 3. The idea of the proof is straightforward: at the beginning of Phase II, although the parameter estimation may not satisfy the warm start condition, it will ensure that each arm will be played with at least certain probability. Then, by continually playing the arms, the corresponding parameters for the optimal arms will be gradually refined until the warm start condition is met. The refinement takes at most 400logTχθ3\frac{400\log T}{\chi\theta^{3}} time periods with high probability.

Proposition 6.

In Phase II of Algorithm 1, it will automatically achieve the warm start condition in Assumption 3 after 400logTχθ3\frac{400\log T}{\chi\theta^{3}} time periods (excluding the periods in Phase I) with probability 120mdT1-\frac{20md}{T}.

Proof.

First, we show that each arm will be played linear times even if the Phase II does not initialize with a warm start, or, sufficiently, there exists an arm ii\in\mathcal{I}^{*} so that 22logTni(t)>θ2\sqrt{\frac{2\log T}{n_{i}(t)}}>\theta for some tt in Phase II. W.L.O.G., we assume that all UCB estimates is non-increasing and all LCB estimates are non-decreasing since we can use the minimal UCB estimates and maximal LCB estimates for all quantities from time 1 to t1t-1 at each time tt. Thus, OPTULP(t)\text{OPT}^{U}_{\text{LP}}(t) and OPTUi\text{OPT}^{U}_{i} are non-increasing and OPTLLP\text{OPT}^{L}_{\text{LP}} is non-decreasing.

Now, consider the following LPs for all ii\in\mathcal{I}^{*} (the same as the LPs used in Lemma 7):

OPT~iU(t)=max𝒙\displaystyle\tilde{\text{OPT}}_{i}^{U}(t)=\max_{\bm{x}}\ \ (𝝁U(t1))𝒙,\displaystyle\left(\bm{\mu}^{U}(t-1)\right)^{\top}\bm{x},
s.t. 𝑪L(t1)𝒙𝑩(t1),\displaystyle\bm{C}^{L}(t-1)\bm{x}\leq\bm{B}^{(t-1)},
xi=0,𝒙𝟎,\displaystyle x_{i}=0,\ \bm{x}\geq\bm{0},
OPT~ULP(t)=max𝒙\displaystyle\tilde{\text{OPT}}^{U}_{\text{LP}}(t)=\max_{\bm{x}}\ \ (𝝁U(t1))𝒙,\displaystyle\left(\bm{\mu}^{U}(t-1)\right)^{\top}\bm{x},
s.t. 𝑪L(t1)𝒙𝑩(t1),\displaystyle\bm{C}^{L}(t-1)\bm{x}\leq\bm{B}^{(t-1)},
𝒙𝟎.\displaystyle\bm{x}\geq\bm{0}.

Note 𝑩(t)Tt𝒃2tT\|\frac{\bm{B}^{(t)}}{T-t}-\bm{b}\|_{\infty}\leq\frac{2t}{T} if tT2t\leq\frac{T}{2}. With the similar proof in Lemma 7, we can show that

1Tt|OPT~iU(t)TtTOPTiU(t)|min{χθ,δθ}45,\displaystyle\frac{1}{T-t}|\tilde{\text{OPT}}_{i}^{U}(t)-\frac{T-t}{T}\text{OPT}_{i}^{U}(t)|\leq\frac{\min\{\chi\theta,\delta\theta\}}{45},
1Tt|OPT~ULP(t)TtTOPTLPU(t)|min{χθ,δθ}45\displaystyle\frac{1}{T-t}|\tilde{\text{OPT}}^{U}_{\text{LP}}(t)-\frac{T-t}{T}\text{OPT}_{\text{LP}}^{U}(t)|\leq\frac{\min\{\chi\theta,\delta\theta\}}{45}

for all tbmin{χθ,δθ}90dTt\leq\frac{b\min\{\chi\theta,\delta\theta\}}{90d}T. Then, by triangle inequality, we have, for all tt in Phase II,

OPT~ULP(t)OPT~iU(t)\displaystyle\tilde{\text{OPT}}^{U}_{\text{LP}}(t)-\tilde{\text{OPT}}_{i}^{U}(t)\geq TtTOPTLPU(t)TtTOPTiU(t)2min{χθ,δθ}45(Tt)\displaystyle\frac{T-t}{T}\text{OPT}_{\text{LP}}^{U}(t)-\frac{T-t}{T}\text{OPT}_{i}^{U}(t)-\frac{2\min\{\chi\theta,\delta\theta\}}{45}(T-t) (37)
\displaystyle\geq TtTOPTLPU(t)TtTOPTLPL(t)2min{χθ,δθ}45(Tt),\displaystyle\frac{T-t}{T}\text{OPT}_{\text{LP}}^{U}(t)-\frac{T-t}{T}\text{OPT}_{\text{LP}}^{L}(t)-\frac{2\min\{\chi\theta,\delta\theta\}}{45}(T-t),

where the second line is obtained by

OPTiU(t)<OPTLLP(t)\text{OPT}_{i}^{U}(t)<\text{OPT}^{L}_{\text{LP}}(t)

after Phase I. This motivates us to focus on the difference between OPTLPU(t)\text{OPT}_{\text{LP}}^{U}(t) and OPTLPL(t)\text{OPT}_{\text{LP}}^{L}(t). With the proof in Lemma 2, we can show that with probability 14mT1-\frac{4m}{T}

μiμ^i(t)+3logT2ni(t),\displaystyle\mu_{i}\leq\hat{\mu}_{i}(t)+\sqrt{\frac{3\log T}{2n_{i}(t)}},

hold for all i[m]i\in[m], j[d]j\in[d] and t[T]t\in[T]. Notice that 𝒙\bm{x}^{*} is also a feasible solution to LP (7). Thus, we have

OPTLPU(t)\displaystyle\text{OPT}_{\text{LP}}^{U}(t)\geq i=1mμiU(t)xi\displaystyle\sum\limits_{i=1}^{m}\mu_{i}^{U}(t)x_{i}^{*}
=\displaystyle= i=1m(μ^iU(t)+3logT2ni(t))xi+i=1m(2logTni(t)3logT2ni(t))xi\displaystyle\sum\limits_{i=1}^{m}\left(\hat{\mu}_{i}^{U}(t)+\sqrt{\frac{3\log T}{2n_{i}(t)}}\right)x_{i}^{*}+\sum\limits_{i=1}^{m}\left(\sqrt{\frac{2\log T}{n_{i}(t)}}-\sqrt{\frac{3\log T}{2n_{i}(t)}}\right)x_{i}^{*}
\displaystyle\geq OPTLP+i=1m(2logTni(t)3logT2ni(t))χT\displaystyle\text{OPT}_{\text{LP}}+\sum\limits_{i=1}^{m}\left(\sqrt{\frac{2\log T}{n_{i}(t)}}-\sqrt{\frac{3\log T}{2n_{i}(t)}}\right)\chi T
\displaystyle\geq OPTLP+115θχT\displaystyle\text{OPT}_{\text{LP}}+\frac{1}{15}\theta\chi T

hold with probability no less than 18mdT1-\frac{8md}{T} and all tt in Phase II. Here, the first line comes from the definition of OPTLPU(t)\text{OPT}_{\text{LP}}^{U}(t), the second line comes from re-arranging terms, the third line comes from xiχx_{i}^{*}\geq\chi and μiμ^i(t)+3logT2ni(t)\mu_{i}\leq\hat{\mu}_{i}(t)+\sqrt{\frac{3\log T}{2n_{i}(t)}} for all ii, and the last line comes from calculation. Take this result into the inequality (37), we have

OPT~ULP(t)OPT~iU(t)\displaystyle\tilde{\text{OPT}}^{U}_{\text{LP}}(t)-\tilde{\text{OPT}}_{i}^{U}(t)\geq min{χθ,δθ}45(Tt),\displaystyle\frac{\min\{\chi\theta,\delta\theta\}}{45}(T-t),

for all ii\in\mathcal{I}^{*}. Denotes the optimal solution corresponding to OPT~ULP(t)\tilde{\text{OPT}}^{U}_{\text{LP}}(t) as 𝒙~(t)\tilde{\bm{x}}(t) and notice that

OPT~ULP(t)OPT~iU(t)+x~i(t)\tilde{\text{OPT}}^{U}_{\text{LP}}(t)\leq\tilde{\text{OPT}}_{i}^{U}(t)+\tilde{x}_{i}(t)

for all ii\in\mathcal{I}^{*}. we have

𝒙~(t)χθ45(Tt).\tilde{\bm{x}}(t)\geq\frac{\chi\theta}{45}(T-t).

Thus, in expectation, the algorithm plays each arm at least χθ45\frac{\chi\theta}{45} times for each step if tbmin{χθ,δθ}T90dt\leq\frac{b\min\{\chi\theta,\delta\theta\}T}{90d} during Phase II. Then, with similar proof in Lemma 5, we can show that there exist a T0T_{0} depending polynomially on mm, dd, 1/b1/b, 1/σ1/\sigma, 1/χ1/\chi and 1/δ1/\delta such that for all T>T0T>T_{0}, the algorithm can play each arm for at least 8logTθ2\frac{8\log T}{\theta^{2}} times within 400logTχθ3\frac{400\log T}{\chi\theta^{3}} steps with probability no less than 120mdT1-\frac{20md}{T}. Then, the algorithm achieve the warm start. Moreover, since the algorithm only plays arms in \mathcal{I}^{*} and TT0T\geq T_{0} is large enough during this process, it will not cause additional regret. ∎

C4 Analysis of the Non-Binding Constraints

The way how we analyze the non-binding constraints is the same as how we analyze the binding constraints. Previously, we derive an upper bound for 𝔼[Tτ]\mathbb{E}[T-\tau] in Section C1 under the assumption that all the constraints are binding. Indeed, with the presence of non-binding constraints, we only need to show that the non-binding constraints will not be exhausted before exhausting any binding constraints. To adapt the analysis in Section C1 to the case when there are non-binding constraints, we only need to change the event t\mathcal{E}_{t} to the joint of following event ~t\tilde{\mathcal{E}}_{t} and t\mathcal{E}_{t}. All the remaining parts in Section C1 can then be extended to the case when there are non-binding constraints,

Define

~t{b(s)j[bϵ,b+ϵ] for all s[t1] and for all binding constraints j𝒥B(s)j𝟎 for all s[t1] and for all non-binding constraints j𝒥,B(t)j𝟎 does not hold for some non-binding constraint j𝒥}.\tilde{\mathcal{E}}_{t}\coloneqq\left\{\begin{matrix}{b}^{(s)}_{j}\in[b-\epsilon,b+\epsilon]\text{ for all $s\in[t-1]$}\text{ and for all binding constraints $j\in\mathcal{J}^{*}$, }\\ B^{(s)}_{j}\geq\bm{0}\text{ for all $s\in[t-1]$}\text{ and for all non-binding constraints $j\in\mathcal{J}^{\prime}$,}\\ B^{(t)}_{j}\geq\bm{0}\text{ does not hold}\text{ for some non-binding constraint $j\in\mathcal{J}^{\prime}$}\end{matrix}\right\}.

In this way, the event ~t\tilde{\mathcal{E}}_{t} describes the “bad” event that the binding dimensions of the process 𝒃(s)\bm{b}^{(s)} stays within the region 𝒵\mathcal{Z} for s=1,,t1s=1,...,t-1 and the non-binding dimensions of the demand process 𝑩(s)\bm{B}^{(s)} is not exhausted before time t1t-1, but certain non-binding constraint is exhausted at time tt. The following lemma establishes that such event will happen with a small probability. It thus implies that as long as 𝒃(t)𝒵\bm{b}^{(t)}\in\mathcal{Z}, the non-binding constraints will not be exhausted. Its proof shares the same structure as Lemma 5.

Lemma 8.

There exists constants T¯\underline{T} depending on dd, χ\chi such that for all TT¯T\geq\underline{T}, the following inequality holds

(t=1T~t)4dT3.\mathbb{P}\left(\bigcup\limits_{t=1}^{T}\tilde{\mathcal{E}}_{t}\right)\leq\frac{4d}{T^{3}}.

Proof of Lemma 8

Proof.

Based on the definition of δ\delta, we have

i=1mcijxiTbδ\sum\limits_{i=1}^{m}\frac{c_{ij}x_{i}^{*}}{T}\leq b-\delta

holds for all non-binding constraint jj, where 𝒙=(x1,,xm)\bm{x}=(x_{1}^{*},...,x_{m}^{*})^{\top} denotes the optimal solution to (1). Recall that for non-binding constraints, we only need to ensure that they are not depleted before τ\tau^{\prime}. The implication of this inequality is that the non-binding constraints have δT\delta T amount of resource surplus to prevent the depletion.

Intuitively, if the average consumption of all binding resources are close to bb, i.e. bj(s)[bϵ,b+ϵ]b_{j}^{(s)}\in[b-\epsilon,b+\epsilon] for each binding resource jj, the number of average played times for each optimal arm ii will be close to 𝒙iT\frac{\bm{x}_{i}^{*}}{T}. Then, the average consumption of each non-binding resources will close to . Follow the intuition, we will bound the probability of the statement in this lemma by concentration inequality.

First, we estimate played times for each arm. To bound the average played times, with the similar proof for lemma 5, we can show that, with probability no less than 2dT3\frac{2d}{T^{3}}, the following inequality holds for all i[m]i\in[m] and s[T]s\in[T]:

ni(s)l=1sx~i(l)+2slogT.n_{i}(s)\leq\sum\limits_{l=1}^{s}\tilde{x}_{i}(l)+\sqrt{2s\log T}.

Based on the algorithm, if b(s)j[bϵ,b+ϵ]b^{(s)}_{j}\in[b-\epsilon,b+\epsilon] for all binding resources, we know that,

𝒙~(s)=(𝑪L(s))𝒥,1(𝒃(s))𝒥,\tilde{\bm{x}}(s)=({\bm{C}}^{L}(s))_{\mathcal{J}^{*},\mathcal{I}^{*}}^{-1}({\bm{b}}^{(s)})_{\mathcal{J}^{*}},

for all sτn<τbs\leq\tau_{n}<\tau_{b}. Thus, if b(s)j[bϵ,b+ϵ]b^{(s)}_{j}\in[b-\epsilon,b+\epsilon], for all sts\leq t and binding resources, we have

𝒏(s)l=1s(𝑪L(l))𝒥,1(𝒃(l))𝒥+2slogT𝟏,\bm{n}(s)\leq\sum\limits_{l=1}^{s}({\bm{C}}^{L}(l))_{\mathcal{J}^{*},\mathcal{I}^{*}}^{-1}({\bm{b}}^{(l)})_{\mathcal{J}^{*}}+\sqrt{2s\log T}\bm{1},

where 𝒏(s)={ni(s)}i=1m\bm{n}(s)=\{n_{i}(s)\}_{i=1}^{m}. Moreover, for all tTt\leq T, on the event t\mathcal{E}_{t}, we know that no resource is used up.

Second, we estimate the resource consumption for non-binding resources. Let ~s=(𝒓l,𝑪l,il+1)l=1s\tilde{\mathcal{H}}_{s}=(\bm{r}_{l},\bm{C}_{l},i_{l+1})_{l=1}^{s}, apply concentration inequality and we have

(l=1s𝑪l𝔼[𝑪l|~l1]>2slogT)\displaystyle\mathbb{P}\left(\sum\limits_{l=1}^{s}\bm{C}_{l}-\mathbb{E}[\bm{C}_{l}|\tilde{\mathcal{H}}_{l-1}]>\sqrt{2s\log T}\right)\leq 2exp{4slogTs}\displaystyle 2\exp\left\{-\frac{4s\log T}{s}\right\}
=\displaystyle= 2exp{4logT}\displaystyle 2\exp\{-4\log T\}
=\displaystyle= 1T4.\displaystyle\frac{1}{T^{4}}.

Thus, with probability no more than 2dT3\frac{2d}{T^{3}}, we have l=1s𝑪l𝔼[𝑪l|~l]2slogT\sum\limits_{l=1}^{s}\bm{C}_{l}-\mathbb{E}[\bm{C}_{l}|\tilde{\mathcal{H}}_{l}]\leq\sqrt{2s\log T} for all s[T]s\in[T] and j[d]j\in[d]. Moreover, notice that 𝔼[𝑪s|~s1]=𝒄is\mathbb{E}[\bm{C}_{s}|\tilde{\mathcal{H}}_{s-1}]=\bm{c}_{i_{s}} for all stT2s\leq t\leq T-2 on the event t(n)\mathcal{E}_{t}^{(n)}, we have

s=1t𝔼[𝑪jis,s|~l]=\displaystyle\sum\limits_{s=1}^{t}\mathbb{E}[\bm{C}_{ji_{s},s}|\tilde{\mathcal{H}}_{l}]= 𝑪𝒏(t)\displaystyle\bm{C}\bm{n}(t)
\displaystyle\leq 𝑪(s=1t(𝑪L(s))𝒥,1(𝒃(s))𝒥+2tlogT𝟏)\displaystyle\bm{C}\left(\sum\limits_{s=1}^{t}({\bm{C}}^{L}(s))_{\mathcal{J}^{*},\mathcal{I}^{*}}^{-1}({\bm{b}}^{(s)})_{\mathcal{J}^{*}}+\sqrt{2t\log T}\bm{1}\right)
=\displaystyle= s=1t𝑪(𝑪𝒥,𝑪1𝒥,(𝑪L(s)𝑪)𝒥,(𝑪L(s))1𝒥,)(𝒃(s))𝒥\displaystyle\sum\limits_{s=1}^{t}\bm{C}\left(\bm{C}_{\mathcal{J}^{*},\mathcal{I}^{*}}\bm{C}^{-1}_{\mathcal{J}^{*},\mathcal{I}^{*}}({\bm{C}}^{L}(s)-\bm{C})_{\mathcal{J}^{*},\mathcal{I}^{*}}\left({\bm{C}}^{L}(s)\right)^{-1}_{\mathcal{J}^{*},\mathcal{I}^{*}}\right)({\bm{b}}^{(s)})_{\mathcal{J}^{*}}
+2tlogT𝑪𝟏\displaystyle+\sqrt{2t\log T}\bm{C}\bm{1}
\displaystyle\leq tT𝑪𝒙+χ𝑪𝟏5dt+2tlogT𝑪𝟏+χt5𝑪𝟏\displaystyle\frac{t}{T}\bm{C}\bm{x}^{*}+\frac{\chi\bm{C}\bm{1}}{5d}t+\sqrt{2t\log T}\bm{C}\bm{1}+\frac{\chi t}{5}\bm{C}\bm{1}
\displaystyle\leq tT𝑪𝒙+2χt5𝟏+2d2tlogT𝟏\displaystyle\frac{t}{T}\bm{C}\bm{x}^{*}+\frac{2\chi t}{5}\bm{1}+\sqrt{2d^{2}t\log T}\bm{1}

holds on the event

~t\\displaystyle\tilde{\mathcal{E}}_{t}\backslash {ni(s)>l=1sx~i(l)+2slogT for some s[T] and i[m]},\displaystyle\left\{n_{i}(s)>\sum\limits_{l=1}^{s}\tilde{x}_{i}(l)+\sqrt{2s\log T}\text{ for some $s\in[T]$ and $i\in[m]$}\right\},

where the first in the inequalities line comes from 𝔼[𝑪is,s|~s1]=𝒄is\mathbb{E}[\bm{C}_{i_{s},s}|\tilde{\mathcal{H}}_{s-1}]=\bm{c}_{i_{s}}, the second line comes from the analysis of 𝒏(s)\bm{n}(s), the third line comes from the decomposition of (𝑪^(s1))𝒥,1(\hat{\bm{C}}_{(}s-1))_{\mathcal{J}^{*},\mathcal{I}^{*}}^{-1}, the fifth line comes from 𝒃(s)𝒵\bm{b}^{(s)}\in\mathcal{Z}, and, the last line comes form 𝑪𝟏d\bm{C}\bm{1}\leq d. Henceforth, on the event

~t\\displaystyle\tilde{\mathcal{E}}_{t}\backslash {ni(s)>l=1sx~i(l)+2slogT or \displaystyle\left\{n_{i}(s)>\sum\limits_{l=1}^{s}\tilde{x}_{i}(l)+\sqrt{2s\log T}\text{ or }\right.
l=1sCjil,l𝔼[Cjil,l|~l]>2slogT for some s[T] i[m] and j[d]},\displaystyle\left.\sum\limits_{l=1}^{s}{C}_{ji_{l},l}-\mathbb{E}[{C}_{ji_{l},l}|\tilde{\mathcal{H}}_{l}]>\sqrt{2s\log T}\text{ for some $s\in[T]$ $i\in[m]$ and $j\in[d]$}\right\},

and tT2t\leq T-2 we have

s=1tCj(s)\displaystyle\sum\limits_{s=1}^{t}C_{j}(s)\leq tT(𝑪𝒙)j+2χt5+22d2tlogT\displaystyle\frac{t}{T}\left(\bm{C}\bm{x}^{*}\right)_{j}+\frac{2\chi t}{5}+2\sqrt{2d^{2}t\log T}
\displaystyle\leq (𝑪𝒙)j+2χt5+22d2tlogT\displaystyle\left(\bm{C}\bm{x}^{*}\right)_{j}+\frac{2\chi t}{5}+2\sqrt{2d^{2}t\log T}
\displaystyle\leq BχTχ+2χT5+22d2TlogT\displaystyle B-\chi T-\chi+\frac{2\chi T}{5}+2\sqrt{2d^{2}T\log T}
\displaystyle\leq BχT5,\displaystyle B-\frac{\chi T}{5},

if T50d2logTχ2T\geq\frac{50d^{2}\log T}{\chi^{2}}. Thus, as T10χT\geq\frac{10}{\chi}, non-binding resources are no less than 22 at the end of time tt for all non-binding resources, implying 𝑩(t+1)𝟏\bm{B}^{(t+1)}\geq\bm{1}. Thus, we have

~t\displaystyle\tilde{\mathcal{E}}_{t}\subset {ni(s)>l=1sx~i(l)+2slogT or \displaystyle\left\{n_{i}(s)>\sum\limits_{l=1}^{s}\tilde{x}_{i}(l)+\sqrt{2s\log T}\text{ or }\right.
l=1sCjil,l𝔼[Cjil,l|~l]>2slogT for some s[T] i[m] and j[d]},\displaystyle\left.\sum\limits_{l=1}^{s}{C}_{ji_{l},l}-\mathbb{E}[{C}_{ji_{l},l}|\tilde{\mathcal{H}}_{l}]>\sqrt{2s\log T}\text{ for some $s\in[T]$ $i\in[m]$ and $j\in[d]$}\right\},

and consequently,

t=0T2~t\displaystyle\bigcup\limits_{t=0}^{T-2}\tilde{\mathcal{E}}_{t}\subset {ni(s)>l=1sx~i(l)+2slogT or \displaystyle\left\{n_{i}(s)>\sum\limits_{l=1}^{s}\tilde{x}_{i}(l)+\sqrt{2s\log T}\text{ or }\right.
l=1sCjil,l𝔼[Cjil,l|~l]>2slogT for some s[T] i[m] and j[d]},\displaystyle\left.\sum\limits_{l=1}^{s}{C}_{ji_{l},l}-\mathbb{E}[{C}_{ji_{l},l}|\tilde{\mathcal{H}}_{l}]>\sqrt{2s\log T}\text{ for some $s\in[T]$ $i\in[m]$ and $j\in[d]$}\right\},
(t=0T2~t)\displaystyle\mathbb{P}\left(\bigcup\limits_{t=0}^{T-2}\tilde{\mathcal{E}}_{t}\right)\leq (ni(s)>l=1sx~i(l)+2slogT for some s[T] and i[m])\displaystyle\mathbb{P}\left(n_{i}(s)>\sum\limits_{l=1}^{s}\tilde{x}_{i}(l)+\sqrt{2s\log T}\text{ for some $s\in[T]$ and $i\in[m]$}\right)
+(l=1sCjil,l𝔼[Cjil,l|~l]>2slogT for some s[T] i[m] and j[d])\displaystyle+\mathbb{P}\left(\sum\limits_{l=1}^{s}{C}_{ji_{l},l}-\mathbb{E}[{C}_{ji_{l},l}|\tilde{\mathcal{H}}_{l}]>\sqrt{2s\log T}\text{ for some $s\in[T]$ $i\in[m]$ and $j\in[d]$}\right)
\displaystyle\leq 4dT3.\displaystyle\frac{4d}{T^{3}}.

Appendix D Proof of Proposition 5

Proof.

The regret analysis can be accomplished by a combination of Proposition 1, Proposition 3 and Proposition 4. Proposition 1 shows that

RegretTπ(𝒫,𝑩)ini(t)Δi+𝔼[𝑩(τ)]𝒚.\displaystyle\text{Regret}_{T}^{\pi}(\mathcal{P},\bm{B})\leq\sum_{i\in\mathcal{I}^{\prime}}n_{i}(t)\Delta_{i}+\mathbb{E}\left[\bm{B}^{(\tau)}\right]^{\top}\bm{y}^{*}. (38)

First, we bound the first term in the right hand side of inequality (38). From the duality, we know that 𝑩𝒚T\bm{B}^{\top}\bm{y}^{*}\leq T. Thus, we have

𝒚\displaystyle\bm{y}^{*} 1b,Δi𝟏𝒚db.\displaystyle\leq\frac{1}{b},\ \ \ \Delta_{i}\leq\bm{1}^{\top}\bm{y}^{*}\leq\frac{d}{b}.

Moreover, Proposition 3 shows that with probability no less than 14mdT21-\frac{4md}{T^{2}}, the algorithm only plays arms in \mathcal{I}^{\prime} during the Phase I and that Phase I will terminate within

O((2+1b)2logTδ2)O\left(\left(2+\frac{1}{b}\right)^{2}\frac{\log T}{\delta^{2}}\right)

time periods. Thus,

ini(t)Δi=O((2+1b)2mdlogTbδ2).\displaystyle\sum_{i\in\mathcal{I}^{\prime}}n_{i}(t)\Delta_{i}=O\left(\left(2+\frac{1}{b}\right)^{2}\frac{md\log T}{b\delta^{2}}\right). (39)

Next, for the second term in the inequality (38). Applying Proposition 4, we have

𝔼[𝑩(τ)]𝒚\displaystyle\mathbb{E}\left[\bm{B}^{(\tau)}\right]^{\top}\bm{y}^{*}\leq j𝒥𝔼[Bj(τ)]yj\displaystyle\sum\limits_{j\in\mathcal{J}^{*}}\mathbb{E}\left[B_{j}^{(\tau)}\right]{y}^{*}_{j} (40)
\displaystyle\leq O(d4b2min{χ2,δ2}min{1,σ2}),\displaystyle O\left(\frac{d^{4}}{b^{2}\min\{\chi^{2},\delta^{2}\}\min\{1,\sigma^{2}\}}\right), (41)

Finally, plugging (39) and (40) into (38), we complete the proof. ∎