The Symmetry between Arms and Knapsacks: A Primal-Dual Approach for Bandits with Knapsacks

Xiaocheng Li^† Chunlin Sun^‡ Yinyu Ye^◇

(^† Imperial College Business School
^‡ Institute for Computational and Mathematical Engineering, Stanford University
^◇ Department of Management Science and Engineering, Stanford University )

Abstract

In this paper, we study the bandits with knapsacks (BwK) problem and develop a primal-dual based algorithm that achieves a problem-dependent logarithmic regret bound. The BwK problem extends the multi-arm bandit (MAB) problem to model the resource consumption associated with playing each arm, and the existing BwK literature has been mainly focused on deriving asymptotically optimal distribution-free regret bounds. We first study the primal and dual linear programs underlying the BwK problem. From this primal-dual perspective, we discover symmetry between arms and knapsacks, and then propose a new notion of sub-optimality measure for the BwK problem. The sub-optimality measure highlights the important role of knapsacks in determining algorithm regret and inspires the design of our two-phase algorithm. In the first phase, the algorithm identifies the optimal arms and the binding knapsacks, and in the second phase, it exhausts the binding knapsacks via playing the optimal arms through an adaptive procedure. Our regret upper bound involves the proposed sub-optimality measure and it has a logarithmic dependence on length of horizon $T$ and a polynomial dependence on $m$ (the numbers of arms) and $d$ (the number of knapsacks). To the best of our knowledge, this is the first problem-dependent logarithmic regret bound for solving the general BwK problem.

1 Introduction

The Multi-Armed Bandit (MAB) problem is a problem in which a limited amount of resource must be allocated between competing (alternative) choices in a way that maximizes the expected gain (Gittins et al., 2011). It is a benchmark problem for decision making under uncertainty that has been studied for nearly a century. As a prototypical reinforcement learning problem, MAB problem exemplifies the exploration–exploitation tradeoff dilemma (Weber et al., 1992). The original problem first formulated in its predominant version in (Robbins, 1952), has inspired a recent line of research that considers additional constraints that reflect more accurately the reality of the online decision making process. Bandits with Knapsacks (BwK) was introduced by Badanidiyuru et al. (2013) to allow more general constraints on the decisions across time, in addition to the customary limitation on the time horizon. The BwK problem, as a general framework, encompasses a wide range of applications, including dynamic pricing and revenue management (Besbes and Zeevi, 2012), online advertisement (Mehta et al., 2005), network and routing (Agrawal et al., 2014), etc.

While the existing BwK literature (Badanidiyuru et al., 2013; Agrawal and Devanur, 2014) has derived algorithms that achieve optimal problem-independent regret bounds, a problem-dependent bound that captures the optimal performance of an algorithm on a specific BwK problem instance remains an open question. For the setting of standard MAB problem, the problem-dependent bound has been well understood, and an upper bound with logarithmic dependence on $T$ can be achieved by both UCB-based algorithm (Auer et al., 2002) and Thompson sampling-based algorithm (Agrawal and Goyal, 2012). In this paper, we focus on developing a problem-dependent bound for the BwK problem and identify parameters that characterize the hardness of a BwK problem instance.

Two existing works along this line are (Flajolet and Jaillet, 2015) and (Sankararaman and Slivkins, 2020). The paper (Flajolet and Jaillet, 2015) considers several specific settings for the problem, and for the general BwK problem, it achieves an $O(2^{m+d}\log T)$ regret bound (with other problem-dependent parameters omitted) where $m$ is the number of arms and $d$ is the number of the knapsacks/resource constraints. In addition, the results in (Flajolet and Jaillet, 2015) require the knowledge of some parameters of the problem instance a priori. The recent work (Sankararaman and Slivkins, 2020) considers the BwK problem under the assumptions that there is only one single knapsack/resource constraint and that there is only one single optimal arm. In contrast to these two pieces of work, we consider the problem in its full generality and do not assume any prior knowledge of the problem instance. We will further compare with their results after we present our regret bound.

Specifically, we adopt a primal-dual perspective to study the BwK problem. Our treatment is new in that we highlight the effects of resources/knapsacks on regret from the dual perspective and define the sub-optimality measure based on the primal and dual problems jointly. Specifically, we first derive a generic upper bound that works for all BwK algorithms. The upper bound consists of two elements: (i) the number of times for which a sub-optimal arm is played; (ii) the remaining knapsack resource at the end of the horizon. It emphasizes that the arms and knapsacks are of equal importance in determining the regret of a BwK algorithm. By further exploiting the structure of the primal and dual LPs, we develop a new sub-optimality measure for the BwK problem which can be viewed as a generalization of the sub-optimality measure for the MAB problem first derived in (Lai and Robbins, 1985). The sub-optimality measure accounts for both arms and knapsacks, and it aims to distinguish optimal arms from non-optimal arms, and binding knapsacks from non-binding knapsacks. We use this measure as a key characterization of the hardness of a BwK problem instance.

Inspired by these findings, we propose a two-phase algorithm for the problem. The first phase of our algorithm is elimination-based and its objective is to identify the optimal arms and binding knapsacks of the BwK problem. The second phase of our algorithm utilizes the output of the first phase and it uses the optimal arms to exhaust the remaining resources through an adaptive procedure. Our algorithm and analysis feature for its full generality and we only make a mild assumption on the non-degeneracy of the underlying LP. In addition, the algorithm requires no prior knowledge of the underlying problem instance.

Other related literature: (Agrawal and Devanur, 2015; Agrawal et al., 2016) study the contextual BwK problem where the reward and resource consumption are both linear in a context vector. (Immorlica et al., 2019; Kesselheim and Singla, 2020) study the BwK problem under an adversarial setting. (Ferreira et al., 2018) analyzes Thompson sampling-based algorithms for the BwK problem.

2 Model and Setup

The problem of bandits with knapsacks (BwK) was first defined in Badanidiyuru et al. (2013), and the notations presented here are largely consistent with Badanidiyuru et al. (2013); Agrawal and Devanur (2014). Consider a fixed and known finite set of $m$ arms (possible actions) available to the decision maker, henceforth called the algorithm. There are $d$ type of resources and a finite time horizon $T$ , where $T$ is known to the algorithm. In each time step $t$ , the algorithm plays an arm of the m arms, receives reward $r_{t}$ , and consumes amount $C_{j,t}\in[0,1]$ of each resource $j\in[d]$ . The reward $r_{t}$ and consumption $\bm{C}_{t}=(C_{1,t},....,C_{d,t})^{\top}\in\mathbb{R}^{d}$ are revealed to the algorithm after choosing arm $i_{t}\in[m]$ . The rewards and costs in every round are generated i.i.d. from some unknown fixed underlying distribution. More precisely, there is some fixed but unknown $\bm{\mu}=(\mu_{1},...,\mu_{m})^{\top}\in\mathbb{R}^{m}$ and $\bm{C}=(\bm{c}_{1},...,\bm{c}_{m})\in\mathbb{R}^{d\times m}$ such that

\mathbb{E}[r_{t}|i_{t}]=\mu_{i_{t}},\ \mathbb{E}[\bm{C}_{t}|i_{t}]=\bm{c}_{i_{t}}

where $\mu_{i}\in\mathbb{R}$ and $\bm{c}_{i}=(c_{1i},...,c_{di})^{\top}\in\mathbb{R}^{d}$ are the expected reward and the expected resource consumption of arm $i\in[m].$ In the beginning of every time step t, the algorithm needs to pick an arm $i_{t}$ , using only the history of plays and outcomes until time step $t-1.$ There is a hard constraint capacity $B_{j}$ for the $j$ -th type of resource. The algorithm stops at the earliest time $\tau$ when one or more of the constraints is violated, i.e. $\sum_{t=1}^{\tau+1}c_{j,t}>B_{j}$ for some $j\in[d]$ or if the time horizon ends, if i.e. $\tau\geq T.$ Its total reward is given by the sum of rewards in all rounds preceding $\tau$ , i.e $\sum_{t=1}^{\tau}r_{t}.$ The goal of the algorithm is to maximize the expected total reward. The values of $B_{j}$ are known to the algorithm, and without loss of generality, we make the following assumption.

Assumption 1.

We assume $B_{j}=B=\min_{j}B_{j}$ for all $j\in[d]$ (by scaling the consumption matrix $\bm{C}$ ). Let $\bm{B}=(B,...,B)^{\top}\in\mathbb{R}^{d}$ . Moreover, we assume the resource capacity $\bm{B}$ scales linearly with $T$ , i.e., $\bm{B}=T\cdot\bm{b}=T\cdot(b,...,b)^{\top}\in\mathbb{R}^{d}$ for some $b>0.$

The assumption is mainly for notation simplicity and it will not change the nature of the analysis in this paper. Given that our focus is to derive asymptotic problem-dependent bound, it is natural to have the resource capacity scales linearly with the length of horizon. Throughout this paper, we use bold symbols to denote vectors/matrices and normal symbols to denote scalars.

Furthermore, without loss of generality, a “null” arm is introduced to represent the time constraint (time horizon $T$ ). Specifically, let $\mu_{m}=0,$ $\bm{c}_{m}=(b,0,...,0)^{\top}$ , and $c_{1,i}=b$ for all $i\in[m].$ In this way, the first constraint captures the constraint of finite time horizon $T$ and the “null” arm can be played with no reward achieved and with no cost induced to the other factual constraints except for the time constraint.

Regret is defined as the difference in the total reward obtained by the algorithm and OPT, where OPT denotes the total expected reward for the optimal dynamic policy. In this paper, we are interested in the (expected) problem-dependent regret,

\text{Regret}_{T}^{\pi}(\mathcal{P},\bm{B})\coloneqq\text{OPT}-\mathbb{E}\left[\sum_{t=1}^{\tau}r_{t}\right]

where $\pi$ denotes the algorithm, and $\mathcal{P}$ encapsulates all the parameters related to the distributions of reward and resource consumption, including $\bm{\mu}$ and $\bm{C}.$ The expectation is taken with respect to the randomness of the reward and resource consumption.

Consider the following linear program:

$\displaystyle\text{OPT}_{\text{LP}}\coloneqq\max_{\bm{x}}\ \$	$\displaystyle\bm{\mu}^{\top}\bm{x}$	(1)
s.t.	$\displaystyle\bm{C}\bm{x}\leq\bm{B}$
	$\displaystyle\bm{x}\geq\bm{0}$

where the decision variables are $\bm{x}=(x_{1},...,x_{m})^{\top}\in\mathbb{R}^{m}.$ One can show that (see Badanidiyuru et al. (2013))

\text{OPT}_{\text{LP}}\geq\text{OPT}

so that $\text{OPT}_{\text{LP}}$ provides a deterministic upper bound for the expected reward under the optimal dynamic policy. Let $\bm{x}^{*}=(x_{1}^{*},...,x_{m}^{*})^{\top}$ denote the optimal solution to (1).

3 Primal-dual Perspective for BwK

In this section, we present a generic regret upper bound and explore the properties of the underlying primal and dual LPs.

Let $\mathcal{I}^{*}$ and $\mathcal{I}^{\prime}$ denote the set of optimal basic variables and the set of optimal non-basic variables of (1), respectively. Let $\mathcal{J}^{*}$ and $\mathcal{J}^{\prime}$ denote the set of binding and non-binding constraints of (1), respectively. That is,

	$\displaystyle\mathcal{I}^{*}$	$\displaystyle\coloneqq\left\{x_{i}^{*}>0,i\in[m]\right\},$
	$\displaystyle\mathcal{I}^{\prime}$	$\displaystyle\coloneqq\left\{x_{i}^{*}=0,i\in[m]\right\},$
	$\displaystyle\mathcal{J}^{*}$	$\displaystyle\coloneqq\left\{B-\sum_{i=1}^{m}c_{ji}x_{i}^{*}=0,j\in[d]\right\},$
	$\displaystyle\mathcal{J}^{\prime}$	$\displaystyle\coloneqq\left\{B-\sum_{i=1}^{m}c_{ji}x_{i}^{*}>0,j\in[d]\right\}.$

So, we know $\mathcal{I}^{*}\cap\mathcal{I}^{\prime}=[m]$ and $\mathcal{J}^{*}\cap\mathcal{J}^{\prime}=[d].$ Accordingly, we call an arm $i\in\mathcal{I}^{*}$ as an optimal arm and $i\in\mathcal{I}^{\prime}$ as a sub-optimal arm. Here and hereafter, we will refer to knapsack as constraint so that the terminology is more aligned with the LPs.

We make the following assumption on the LP’s optimal solution.

Assumption 2.

The LP (1) has an unique optimal solution. Moreover, the optimal solution is non-degenerate, i.e.,

|\mathcal{I}^{*}|=|\mathcal{J}^{*}|.

The assumption is a standard one in LP’s literature, and any LP can satisfy the assumption with an arbitrarily small perturbation (Megiddo and Chandrasekaran, 1989). To interpret the non-degeneracy, consider if $|\mathcal{I}^{*}|=|\mathcal{J}^{*}|=l$ , then the optimal solution to LP (1) is to play the only $l$ arms in $\mathcal{I}^{*}$ . When there is no linear dependency between the columns of $\bm{C}$ , that will result in a depletion of $l$ resource constraints.

The dual problem of (1) is

$\displaystyle\min_{\bm{y}}\ \$	$\displaystyle\bm{B}^{\top}\bm{y}$	(2)
s.t.	$\displaystyle\bm{C}^{\top}\bm{y}\geq\bm{\mu}$
	$\displaystyle\bm{y}\geq\bm{0}.$

Denote its optimal solution as $\bm{y}^{*}=(y_{1}^{*},...,y_{d}^{*}).$ From LP’s complementarity condition, we know the following relation holds under Assumption 2,

\displaystyle j\in\mathcal{J}^{*}\Leftrightarrow y_{j}^{*}>0,\ \ j\in\mathcal{J}^{\prime}\Leftrightarrow y_{j}^{*}=0.

The following lemma summarizes the LP’s properties.

Lemma 1.

Under Assumption 2, we have the primal LP (1) and the dual LP (2) share the same optimal objective value. Also,

|\mathcal{I}^{*}|+|\mathcal{J}^{\prime}|=d,

|\mathcal{I}^{\prime}|+|\mathcal{J}^{*}|=m.

3.1 A Generic Regret Upper Bound

We begin our discussion with deriving a new upper bound for a generic BwK algorithm. First, we define the knapsack process as the remaining resource capacity at each time $t$ . Specifically, we define $\bm{B}^{(0)}\coloneqq\bm{B}$ and

\bm{B}^{(t+1)}\coloneqq\bm{B}^{(t)}-\bm{C}_{t}

for $t\in[T].$ Recall that $\bm{C}_{t}$ is the (random) resource consumption at time $t$ . The process $\bm{B}^{(t)}$ is pertaining to the BwK algorithm. In addition, we follow the convention of the bandits literature and define the count process $n_{i}(t)$ as the number of times the $i$ -th arm is played up to the end of time period $t$ .

Proposition 1.

The following inequality holds for any BwK algorithm,

\displaystyle\text{Regret}_{T}^{\pi}(\mathcal{P},\bm{B})\leq\sum_{i\in\mathcal{I}^{\prime}}n_{i}(t)\Delta_{i}+\mathbb{E}\left[\bm{B}^{(\tau)}\right]^{\top}\bm{y}^{*}.

(3)

where $\Delta_{i}=\bm{c}_{i}^{\top}\bm{y}^{*}-\mu_{i}$ for $i\in[m].$

Here $\Delta_{i}$ is known as reduced cost/profit in LP literature and it quantifies the cost-efficiency of each arm (each basic variable in LP). The upper bound in Proposition 1 is new to the existing literature and it can be generally applicable to all BwK algorithms. It consists of two parts: (i) the number of times for which a sub-optimal arm is played multiplied by the corresponding reduced cost; (ii) the remaining resource at time $\tau$ , either when any of the resource is depleted or at the end of time horizon $T$ . The first part is consistent with the classic bandits literature in that we always want to upper bound the number of sub-optimal arms being played throughout the horizon. At each time a sub-optimal arm $i\in\mathcal{I}^{\prime}$ is played, a cost of $\Delta_{i}$ will be induced. Meanwhile, the second part is particular to the bandits with knapsacks setting and can easily be overlooked. Recall that the definition of $\tau$ refers to the first time that any resource $j\in[d]$ is exhausted (or the end of the horizon). It tells that the left-over of resources when the process terminates at time $\tau$ may also induce regret. For example, for two binding resources $j,j^{\prime}\in\mathcal{J}^{*}$ , it would be less desirable for one of them $j$ to have a lot of remaining while the other one $j^{\prime}$ is exhausted. Since the binding resources are critical in determining optimal objective value for LP (1), intuitively, it is not profitable to waste any of them at the end of the procedure.

3.2 Symmetry between Arms and Bandits

From Proposition 1, we see the importance of dual problem (2) in bounding an algorithm’s regret. Now, we pursue further along the path and propose a new notion of sub-optimality for the BwK problem. Our sub-optimality measure is built upon both the primal and dual LPs, and it reveals the combinatorial structure of the problem. In the following, we define two classes of LPs, one for the arm $i\in[m]$ and the other for the constraints $j\in[d].$

First, for each arm $i\in[m],$ define

$\displaystyle\text{OPT}_{i}\coloneqq\max_{\bm{x}}\ \$	$\displaystyle\bm{\mu}^{\top}\bm{x},$	(4)
s.t.	$\displaystyle\bm{C}\bm{x}\leq\bm{B},$
	$\displaystyle x_{i}=0,\bm{x}\geq\bm{0}.$

By definition, $\text{OPT}_{i}$ denotes the optimal objective value of an LP that takes the same form as the primal LP (1) except with an extra constraint $x_{i}=0$ . It represents the optimal objective value if the $i$ -th arm is not allowed to use. For a sub-optimal arm $i\in\mathcal{I}^{\prime},$ $\text{OPT}_{i}=\text{OPT}_{\text{LP}}$ , while for an optimal arm $i\in\mathcal{I}^{*}$ , $\text{OPT}_{i}<\text{OPT}_{\text{LP}}$ . In this way, $\text{OPT}_{i}$ characterizes the importance of arm $i$ .

Next, for each constraint $j\in[d]$ , define

$\displaystyle\text{OPT}_{j}\coloneqq\min_{\bm{y}}\ \$	$\displaystyle\bm{B}^{\top}\bm{y}-B,$	(5)
s.t.	$\displaystyle{{\bm{C}}}^{\top}\bm{y}\geq\bm{\mu}+\bm{C}_{j,\cdot},$
	$\displaystyle\bm{y}\geq\bm{0},$

where $\bm{C}_{j,\cdot}$ denotes the $j$ -th row of the constraint matrix $\bm{C}.$ Though it may not be as obvious as the previous case of OPT_i, the definition of $\text{OPT}_{j}$ aims to characterize the bindingness/non-bindingness of a constraint $j$ . The point can be illustrated by looking at the primal problem for (5). From LP’s strong duality, we know

$\displaystyle\text{OPT}_{j}=\max_{\bm{x}}\ \$	$\displaystyle\bm{\mu}^{\top}\bm{x}-(B-\bm{C}_{j,\cdot}^{\top}\bm{x}),$	(6)
s.t.	$\displaystyle\sum_{i=1}^{m}\bm{C}\bm{x}\leq\bm{B},$
	$\displaystyle\bm{x}\geq 0.$

Compared to the original primal LP (1), there is an extra term in the objective function in LP (6). The extra term is a penalization for the left-over of the $j$ -th constraint, and thus it encourages the usage of the $j$ -th constraint. For a binding constraint $j\in\mathcal{J}^{*},$ it will be exhausted under the optimal solution $\bm{x}^{*}$ to LP (1) so the penalization term does not have any effect, i.e., OPT ${}_{j}=$ OPT ${}_{\text{LP}}$ . In contrast, for a non-binding constraint $j\in\mathcal{J}^{\prime},$ the extra term will result in a reduction in the objective value, i.e., $\text{OPT}_{j}<\text{OPT}_{\text{LP}}$ . We note that one can introduce any positive weight to the penalization term so as to trade off between the reward and the left-over of the $j$ -th constraint in (6), but its current version suffices our discussion.

The following proposition summarizes the properties of OPT_i and OPT ${}_{j}.$

Proposition 2.

Under Assumption 2, we have

	$\displaystyle\text{OPT}_{i}<\text{OPT}_{\text{LP}}$	$\displaystyle\Leftrightarrow i\in\mathcal{I}^{*},$
	$\displaystyle\text{OPT}_{i}=\text{OPT}_{\text{LP}}$	$\displaystyle\Leftrightarrow i\in\mathcal{I}^{\prime},$
	$\displaystyle\text{OPT}_{j}=\text{OPT}_{\text{LP}}$	$\displaystyle\Leftrightarrow j\in\mathcal{J}^{*},$
	$\displaystyle\text{OPT}_{j}<\text{OPT}_{\text{LP}}$	$\displaystyle\Leftrightarrow j\in\mathcal{J}^{\prime}.$

In this way, the definition of $\text{OPT}_{i}$ distinguishes optimal arms $\mathcal{I}^{*}$ from sub-optimal arms $\mathcal{I}^{\prime}$ , while the definition of $\text{OPT}_{j}$ distinguishes the binding constraints $\mathcal{J}^{*}$ from non-binding constraints $\mathcal{J}^{\prime}$ . The importance of such a distinguishment arises from the upper bound in Proposition 1: on one hand, we should avoid playing sub-optimal arms, and on the other hand, we should exhaust the binding resources. A second motivation for defining both OPT_i and OPT_j can be seen after we present our algorithm. Furthermore, we remark that a measurement of the sub-optimality of the arms has to be defined through the lens of LP due to the combinatorial nature of the problem. The effect of the $i$ -th arm’s removal on the objective value can only be gauged by solving an alternative LP of OPT ${}_{i}.$ Similarly, a measurement of the bindingness of the constraints should also take into account the combinatorial relation between constraints.

Next, define

\delta\coloneqq\frac{1}{T}\left(\text{OPT}_{\text{LP}}-\max\left\{\max_{i\in\mathcal{I}^{*}}\text{OPT}_{i},\max_{j\in\mathcal{J}^{\prime}}\text{OPT}_{j}\right\}\right).

where the factor $\frac{1}{T}$ is to normalize the optimality gap by the number of time periods. Under Assumption 1, all the objective values in above should scale linearly with $T.$

To summarize, $\delta$ characterizes the hardness of distinguishing optimal arms from non-optimal arms (and binding constraints from non-binding constraints). It can be viewed as a generalization of the sub-optimality measure $\delta_{\text{MAB}}=\min_{i\neq i^{*}}\mu_{i^{*}}-\mu_{i}$ for the MAB problem (Lai and Robbins, 1985). $\delta_{\text{MAB}}$ characterizes the hardness of an MAB problem instance, i.e., the hardness of distinguishing the optimal arm from sub-optimal arms. In the context of BwK, $\delta$ is a more comprehensive characterization in that it takes into account both the arms and the constraints. Imaginably, it will be critical in both algorithm design and analysis for the BwK problem.

3.3 Key Parameters of the LPs

Now, we define two LP-related quantities that will appear in our regret bound:

•

Linear Dependency between Arms: Define $\sigma$ be the minimum singular value of the matrix $\bm{C}_{\mathcal{I}^{*},\mathcal{J}^{*}}.$ Specifically,

\sigma\coloneqq\sigma_{\min}\left(\bm{C}_{\mathcal{J}^{*},\mathcal{I}^{*}}\right)=\sigma_{\min}\left((c_{ji})_{j\in\mathcal{J}^{*},i\in\mathcal{I}^{*}}\right).

In this light, $\sigma$ represents the linear dependency between optimal arms across the binding constraints. For a smaller value of $\sigma$ , the optimal arms are more linearly dependent, and then it will be harder to identify the optimal numbers of plays. Under Assumption 2, the uniqueness of optimal solution implies $\sigma>0.$

•

Threshold on the optimal solution:

$\displaystyle\chi$ $\displaystyle\coloneqq\frac{1}{T}\cdot\min\{x_{i}^{*}\neq 0,i\in[m]\}$

If the total resource $\bm{B}=T\cdot\bm{b},$ both the optimal solution $(x_{1}^{*},...,x_{m}^{*})$ should scale linearly with $T$ . The factor $\frac{1}{T}$ normalizes the optimal solution into a probability vector. $\chi$ denotes the smallest non-zero entry for the optimal solution. Intuitively, a small value of $\chi$ implies that the optimal proportion of playing an arm $i\in\mathcal{I}^{*}$ is small and thus it is more prone to “overplay” the arm.

Remarks. By the definition, it seems that the above parameters $\delta$ and $\chi$ both involve a factor of $T$ . But if we replace $\bm{B}$ (the right-hand-side of LP) with $\bm{b}$ from Assumption 1, then the factor $T$ disappears, and the parameters $\chi$ and $\delta$ are essentially dependent on $\bm{\mu}$ , $\bm{C}$ , and $\bm{b}$ which are inherent to the problem instance but bear no dependency on the horizon $T$ . In other words, Assumption 1 frees the dependency on $T$ by introducing the quantity $b$ . Practically, the assumption states the resource capacity should be sufficiently large and it is natural in many application contexts (for example, the small bids assumption in AdWords problem Mehta et al. (2005)). Theoretically, in two previous works Flajolet and Jaillet (2015); Sankararaman and Slivkins (2020), either a factor of $1/T$ appears in the parameter definition Sankararaman and Slivkins (2020) or the assumption is explicitly imposed Flajolet and Jaillet (2015). Such an assumption might be inevitable for a logarithmic regret to be derived.

3.4 LCB and UCB

Throughout this paper, we denote the reward and resource consumption of the $s$ -th play of the $i$ -th arm as $r_{i,s}$ and $\bm{C}_{i,s}=(C_{1i,s},...,C_{di,s})^{\top}$ respectively, for $i\in[m]$ and $s\in[T].$ Let $n_{i}(t)$ be the number of times the $i$ -th arm is played in the first $t$ time periods. Accordingly, we denote the estimators at time $t$ for the $i$ -th arm as

\hat{\mu}_{i}(t)\coloneqq\frac{1}{n_{i}(t)}\sum_{s=1}^{n_{i}(t)}r_{i,s},

\hat{C}_{ji}(t)\coloneqq\frac{1}{n_{i}(t)}\sum_{s=1}^{n_{i}(t)}C_{ji,s}

for $i\in[m]$ and $j\in[d].$ In a similar manner, we define the estimator for the $i$ -th arm’s resource consumption vector as $\hat{\bm{C}}_{i}(t)\coloneqq(\hat{C}_{1i}(t),...,\hat{C}_{di}(t))^{\top}$ , and the resource consumption matrix as $\hat{\bm{C}}(t)\coloneqq(\hat{\bm{C}}_{1}(t),...,\hat{\bm{C}}_{m}(t))$ . Specifically, without changing the nature of the analysis, we ignore the case that when $n_{i}(t)=0$ . Then, we define the lower confidence bound (LCB) and upper confidence bound (UCB) for the parameters as

	$\displaystyle\mu_{i}^{L}(t)$	$\displaystyle\coloneqq proj_{[0,1]}\left(\hat{\mu}_{i}(t)-\sqrt{\frac{2\log T}{n_{i}(t)}}\right)$
	$\displaystyle\mu_{i}^{U}(t)$	$\displaystyle\coloneqq proj_{[0,1]}\left(\hat{\mu}_{i}(t)+\sqrt{\frac{2\log T}{n_{i}(t)}}\right)$
	$\displaystyle C_{ji}^{L}(t)$	$\displaystyle\coloneqq proj_{[0,1]}\left(\hat{C}_{ji}(t)-\sqrt{\frac{2\log T}{n_{i}(t)}}\right)$
	$\displaystyle C_{ji}^{U}(t)$	$\displaystyle\coloneqq proj_{[0,1]}\left(\hat{C}_{ji}(t)+\sqrt{\frac{2\log T}{n_{i}(t)}}\right)$

where $proj_{[0,1]}(\cdot)$ projects a real number to interval $[0,1].$ The following lemma is standard in bandits literature and it characterizes the relation between the true values and the LCB/UCB estimators. It states that all the true values will fall into the intervals defined by the corresponding estimators with high probability.

Lemma 2 (Concentration).

The following event holds with probability no less than $1-\frac{4md}{T^{2}}$ ,

	$\displaystyle\mu_{i}$	$\displaystyle\in\left(\mu_{i}^{L}(t),\mu_{i}^{U}(t)\right),$
	$\displaystyle c_{ji}$	$\displaystyle\in\left(C_{ji}^{L}(t),C_{ji}^{U}(t)\right),$

for all $i\in[m],j\in[d],t\in T$ .

With the UCB/LCB estimators for the parameters, we can construct UCB/LCB estimators for the objective of the primal LP. Specifically,

$\displaystyle\text{OPT}_{\text{LP}}^{U}\coloneqq\max_{\bm{x}}\ \$	$\displaystyle\left(\bm{\mu}^{U}\right)^{\top}\bm{x},$	(7)
s.t.	$\displaystyle\bm{C}^{L}\bm{x}\leq\bm{B},$
	$\displaystyle\bm{x}\geq\bm{0}.$
$\displaystyle\text{OPT}_{\text{LP}}^{L}\coloneqq\max_{\bm{x}}\ \$	$\displaystyle\left(\bm{\mu}^{L}\right)^{\top}\bm{x},$	(8)
s.t.	$\displaystyle\bm{C}^{U}\bm{x}\leq\bm{B},$
	$\displaystyle\bm{x}\geq\bm{0}.$

The following lemma states the relation between $\text{OPT}_{\text{LP}}^{U}$ , $\text{OPT}_{\text{LP}}^{L}$ , and $\text{OPT}_{\text{LP}}.$ Intuitively, if we substitute the original constraint matrix $\bm{C}$ with its LCB (or UCB) and the objective coefficient $\bm{\mu}$ with its UCB (or LCB), the resultant optimal objective value will be a UCB (or LCB) for $\text{OPT}_{\text{LP}}$ .

Lemma 3.

The following inequality holds with probability no less $1-\frac{4md}{T^{2}},$

\text{OPT}_{\text{LP}}^{L}\leq\text{OPT}_{\text{LP}}\leq\text{OPT}_{\text{LP}}^{U}.

A similar approach is used in (Agrawal and Devanur, 2014) to develop an UCB-based algorithm for the BwK problem. For our algorithm presented in the following section, we will construct estimates not only for the primal LP (1), but also for the LPs (4) and (5). By comparing the estimates of OPT ${}_{\text{LP}}$ , OPT_i, and OPT_j, we will be able to identify the optimal arms $\mathcal{I}^{*}$ and the non-binding constraints $\mathcal{J}^{\prime}.$

4 Two-Phase Algorithm

In this section, we describe our two-phase algorithm for the BwK problem. The main theme is to use the underlying LP’s solution to guide the plays of the arms. The two phases in the algorithm correspond to the two parts of the regret upper bound in Proposition 1. In the following, we describe the two phases of the algorithm and their intuitions respectively.

Phase I of Algorithm 1 is an elimination algorithm and it aims to identify the optimal arms $\mathcal{I}^{*}$ and the non-binding constraints $\mathcal{J}^{\prime}.$ In each round of the while loop in Phase I, all the arms are played once to improve the estimators for $\mu$ and $\bm{C}.$ After each round of plays, an identification procedure is conducted by comparing the LCB of the original optimal value ( $\text{OPT}^{L}_{\text{LP}}$ ) against the UCBs ( $\text{OPT}^{U}_{i}$ and $\text{OPT}^{U}_{j}$ ). Recall that Proposition 2, there will be a non-zero sub-optimality gap if $i\in\mathcal{I}^{*}$ or $j\in\mathcal{J}^{\prime}.$ So, if the algorithm observes a gap between the corresponding LCBs/UCBs, it will assert $i\in\mathcal{I}^{*}$ or $j\in\mathcal{J}^{\prime}$ , and the assertion will be true with high probability.

The stopping rule in Phase I originates from the complementary condition in Lemma 1, i.e., $|\hat{\mathcal{I}}^{*}|+|\hat{\mathcal{J}}^{\prime}|<d$ . This is a key in the primal-dual design of the algorithm and it further justifies the consideration of the dual problem. Without maintaining the set $\hat{\mathcal{J}}^{\prime},$ we cannot decide whether we have obtain the true primal set, i.e., $\hat{\mathcal{I}}^{*}=\mathcal{I}^{*}$ . Specifically, since there is no precise knowledge of the number of optimal arms $|\mathcal{I}^{*}|,$ while we keep adding arms into $\hat{\mathcal{I}}^{*}$ , we do not know when to stop. The complementarity in Lemma 1 provides a condition on the number of arms in $\mathcal{I}^{*}$ and the number of constraints in $\mathcal{J}^{\prime}$ . Accordingly, Phase I is terminated when this condition is met. Moreover, we emphasize that a by-product of the Phase I is a best arm identification procedure. To the best of our knowledge, this is the first result on identifying optimal arms for the BwK problem.

Algorithm 1 Primal-dual Adaptive Algorithm for BwK

1:Input: Resource capacity

\bm{B}

T

2:%% Phase I: Identification of

\mathcal{I}^{*}

and

\mathcal{J}^{\prime}

3:Initialize

\hat{\mathcal{I}}^{*}=\hat{\mathcal{J}}^{\prime}=\emptyset

t=0

4:Initialize the knapsack process

\bm{B}^{(0)}=\bm{B}

5:while

|\hat{\mathcal{I}}^{*}|+|\hat{\mathcal{J}}^{\prime}|<d

6: Play each arm

i\in[m]

once

7: Update

t=t+m

and the knapsack process

\bm{B}^{(t)}

8: Update the estimates

\hat{\bm{\mu}}(t)

and

\hat{\bm{C}}(t)

9: Solve the LCB problem (8) and obtain OPT

{}_{\text{LP}}^{L}(t)

10: for

i\notin\hat{\mathcal{I}}^{*}

11: Solve the following UCB problem for OPT_i

	$\displaystyle\text{OPT}_{i}^{U}(t)\coloneqq\max_{\bm{x}}\ \$	$\displaystyle\left(\bm{\mu}^{U}(t)\right)^{\top}\bm{x},$
	s.t.	$\displaystyle\bm{C}^{L}(t)\bm{x}\leq\bm{B},$
		$\displaystyle x_{i}=0,\bm{x}\geq\bm{0}.$

12: if

\text{OPT}_{\text{LP}}^{L}(t)>\text{OPT}_{i}^{\text{U}}(t)

then

13: Update

\hat{\mathcal{I}}^{*}=\hat{\mathcal{I}}^{*}\cup\{i\}

14: end if

15: end for

16: for

j\notin\hat{\mathcal{J}}^{\prime}

17: Solve the following UCB problem for OPT_j

	$\displaystyle\text{OPT}^{U}_{j}(t):=\min_{\bm{y}}\ \$	$\displaystyle\bm{B}^{\top}\bm{y}-B,$
	s.t.	$\displaystyle({{\bm{C}}}^{L}(t))^{\top}\bm{y}\geq{\bm{\mu}}^{U}(t)+\bm{C}_{j,\cdot}^{U}(t),$
		$\displaystyle\bm{y}\geq 0.$

18: if

\text{OPT}_{\text{LP}}^{L}(t)>\text{OPT}_{j}^{\text{U}}(t)

then

19: Update

\hat{\mathcal{J}}^{\prime}=\hat{\mathcal{J}}^{\prime}\cup\{j\}

20: end if

21: end for

22:end while

23:Update

t=t+1

24:%% Phase II: Exhausting the binding resources

25:while

t\leq\tau

26: Solve the following LP

$\displaystyle\max_{\bm{x}}\ \$	$\displaystyle\left(\bm{\mu}^{U}(t-1)\right)^{\top}\bm{x},$	(9)
s.t.	$\displaystyle\bm{C}^{L}(t-1)\bm{x}\leq\bm{B}^{(t-1)},$
	$\displaystyle x_{i}=0,\ i\notin\mathcal{I}^{*},$
	$\displaystyle\bm{x}\geq\bm{0}.$

27: Denote its optimal solution as

\tilde{\bm{x}}

28: Normalize

\tilde{\bm{x}}

into a probability and randomly play an arm according to the probability

29: Update estimates

\hat{\bm{\mu}}(t)

\hat{\bm{C}}(t)

, and

\bm{B}^{(t)}

30: Update

t=t+1

31:end while

Phase II of Algorithm 1 is built upon the output of Phase I. At each time $t$ , the algorithm solves an adaptive version LP (9) and normalizes its optimal solution into a sampling scheme on arms. Also, in Phase II, the algorithm will only play arms $i\in\hat{\mathcal{I}}^{*}$ ; this is achieved by enforcing $x_{i}=0$ for $i\in\hat{\mathcal{I}}^{*}$ in (9). The adaptive design is exemplified on the right-hand-side of the LP (9), where instead of the static resource capacity $\bm{B}$ , it uses the remaining resource at the end of last time period $\bm{B}^{(t-1)}$ . To see its intuition, consider if a binding resource $j\in\mathcal{J}^{*}$ is over-used in the first $t$ time periods, then it will result in a smaller value of $B^{(t-1)}_{j}$ , and then the adaptive mechanism will tend to be more reluctant to consume the $j$ -th resource in the future, and vice versa for the case of under-use.

We emphasize that the adaptive design is not only intuitive but also necessary to achieve a regret that is logarithmic in $T$ . If we adopt a static right-hand-side $\bm{B}$ in (9) (as in (Badanidiyuru et al., 2013; Agrawal and Devanur, 2014)), then the fluctuation of the process $\bm{B}_{t}$ will be on the order of $\Omega(\sqrt{t})$ . Consequently, when it approaches to the end of the horizon, certain type of binding resource may be exhausted while other binding resources still have $\Omega(\sqrt{T})$ left-over, and this may result in an $\Omega(\sqrt{T})$ upper bound in Proposition 1. The intuition is made rigorous by (Arlotto and Gurvich, 2019); the paper establishes that without an adaptive design, the regret is at least $\Omega(\sqrt{T})$ for the multi-secretary problem (can be viewed as a one-constraint BwK problem) even if the underlying distribution is known.

5 Regret Analysis

In this section, we derive an regret upper bound for Algorithm 1 by analyzing the two phases separately.

5.1 Analysis of Phase I

Proposition 3 provides an upper bound on the number of time periods within which Phase I will terminate. It also states that the identification of $\mathcal{I}^{*}$ and $\mathcal{J}^{\prime}$ will be precise conditional on the high probability event in Lemma 2.

Proposition 3.

In Phase I of Algorithm 1, each arm $i\in[m]$ will be played for no more than $\left(2+\frac{1}{b}\right)^{2}\cdot\frac{72\log T}{\delta^{2}}$ times where $b$ is defined in Assumption 1. If the resources are not exhausted in Phase I, then its output satisfies

\mathbb{P}\left(\hat{\mathcal{I}}^{*}=\mathcal{I}^{*},\hat{\mathcal{J}}^{\prime}=\mathcal{J}^{\prime}\right)\geq 1-\frac{4md}{T^{2}}.

The surprising point of Proposition 3 lies in that there are $O(2^{m+d})$ possible configurations of $(\mathcal{I}^{*},\mathcal{J}^{\prime})$ and, without any prior knowledge, the true configuration can be identified within $O(\log T)$ number of plays for each arm. In contrast, (Flajolet and Jaillet, 2015) does not utilize the primal-dual structure of the problem and conducts a brute-force search in all possible configurations which results in an $O(2^{m+d}\log T)$ regret. In addition, the search therein requires the knowledge of a non-degeneracy parameter a priori. The result also explains why (Sankararaman and Slivkins, 2020) imposes a single-best-arm condition for the BwK problem, which assumes $\mathcal{I}^{*}=1$ . This additional greatly simplifies the combinatorial structure and reduces the BwK problem more closely to the standard MAB problem. In this light, Proposition 3 can be viewed as a best-arm-identification result (Audibert and Bubeck, 2010) for the BwK problem in full generality and without any prior knowledge.

5.2 Analysis of Phase II

Proposition 4 provides an upper bound on the remaining resource for binding constraints when the procedure terminate, which corresponds to the second part of Proposition 1. Notably, the upper bound has no dependency on $T$ . In a comparison with the $\Omega(\sqrt{T})$ fluctuation under the static design, it demonstrates the effectiveness of our adaptive design.

Proposition 4.

For each binding constraint $j\in\mathcal{J}^{*},$ we have

\mathbb{E}\left[B_{j}^{(\tau)}\right]=O\left(\frac{d^{3}}{b\min\{\chi^{2},\delta^{2}\}\min\{1,\sigma^{2}\}}\right)

where $\chi$ and $\sigma$ are defined in Section 3.3, and $b$ is defined in Assumption 1.

The idea of proof is to introduce an auxiliary process $b_{j}^{(t)}=\frac{B_{j}^{(t)}}{T-t}$ for $t\in[T-1]$ and $j\in\mathcal{J}^{*}$ . Recall that $B_{j}^{(t)}$ is the $j$ -th component of the knapsack process $\bm{B}^{(t)}$ , we know its initial value $b_{j}^{(0)}=b$ . Then define

\tau_{j}=\min\{t:b_{j}^{(t)}\notin[b-\epsilon,b+\epsilon]\}\cup\{T\}

for a fixed $\epsilon>0.$ With the definition, $b_{j}^{(t)}$ can be interpreted as average remaining resource (per time period) and $\tau_{j}$ can be interpreted as the first time that $b_{j}^{(t)}$ deviates from its initial value $b$ by a small amount. It is easy to see that $\tau_{j}\leq\tau$ with a proper choice of $\epsilon.$ Next, we aim to upper bound $\mathbb{E}[T-\tau_{j}]$ by analyzing the process $\{b_{j}^{(t)}\}_{t=0}^{T}.$ From the dynamic of the knapsack process $\bm{B}_{t},$ we know that

	$\displaystyle b_{j}^{(t)}$	$\displaystyle=\frac{B_{j}^{(t)}}{T-t}=\frac{B_{j}^{(t-1)}-C_{j,t}}{T-t}$
		$\displaystyle=b_{j}^{(t-1)}-\frac{1}{T-t}\left(C_{j,t}-b_{j}^{(t-1)}\right)$		(10)

where $C_{j,t}$ as defined earlier is the resource consumption of $j$ -th constraint at time $t$ . The above formula (10) provides a technical explanation for the motivation of the adaptive design in (9). Intuitively, when the right-hand-side of (9) is $\bm{B}^{(t-1)},$ it will lead to a solution that (approximately and on expectation) consumes $b_{j}^{(t-1)}$ of the $j$ -th resource for each of the following time periods. Ideally, this will make the second term in (10) have a zero expectation. However, due to estimation error for the LP’s parameters, this may not be the case. The idea is to first provide an upper bound for the “bias” term

\left|\mathbb{E}\left[C_{j,t}-b_{j}^{(t-1)}|\mathcal{H}_{t-1}\right]\right|

where $\mathcal{H}_{t-1}=\{(r_{s},\bm{C}_{s},i_{s})\}_{s=1}^{t-1}$ encapsulates all the information up to time $t-1.$ Unsurprisingly, the upper bound is on the order of $O\left(\frac{1}{\sqrt{t}}\right)$ . Next, with this bias upper bound, we can construct a super-martingale (sub-martingale) based on the dynamic (10) and employ Azuma–Hoeffding inequality to provide a concentration result for the value of the martingale. Through the analysis of the process $b_{j}^{(t)}$ , we can derive an upper bound for $\mathbb{E}[T-\tau_{j}]$ , and consequently, it leads to an upper bound on $\mathbb{E}[B^{(\tau)}_{j}]$ .

The importance of the adaptive design has been widely recognized in other constrained online learning problems, such as online matching problem Manshadi et al. (2012), online assortment problem Golrezaei et al. (2014), online linear programming problem Li and Ye (2019), network revenue management problem Jasin and Kumar (2012), etc. The common pattern of these problem is to allocate limited resources in a sequential manner, and the idea of adaptive design is to adjust the allocation rule dynamically according to the remaining resource/inventory. This is in parallel with LP (9) where the solution at each time $t$ is contingent on the remaining resources $\bm{B}^{(t)}.$ The significance of our algorithm design and analysis lies in that (i) to the literature of BwK, our paper is the first application of the idea of adaptive design; (ii) to the existing literature of adaptive design in constrained online learning problems, our work provides its first application and analysis in a partial-information environment. For the second aspect, all the existing analysis on the adaptive design fall in the paradigm of “first-observe-then-decide” while the BwK problem is “first-decide-then-observe”. Specifically, in matching/resource allocation/revenue management problems, at each time period, a new agent arrives, and upon the observation, we decide the matching for the agent; or a new customer arrives, and upon the observation of her preference, we decide the assortment decision for the customer. So, the existing analyses are analogous to a BwK “setting” where the reward and resource consumption of playing an arm are first observed (magically), and then we decide whether we want to play the arm or not.

5.3 Regret Upper Bound for Algorithm 1

Combining the two parts, we have the following result on the regret of Algorithm 1.

Proposition 5.

The regret of Algorithm 1 has the following upper bound,

O\left(\left(2+\frac{1}{b}\right)^{2}\frac{md\log T}{b\delta^{2}}+\frac{d^{4}}{b^{2}\min\{\chi^{2},\delta^{2}\}\min\{1,\sigma^{2}\}}\right)

where $b$ is defined in Assumption 1, $\delta$ is defined in Section 3.2, and $\sigma$ and $\chi$ are defined in Section 3.3.

The result reduces the exponential dependence on $m$ and $d$ in (Flajolet and Jaillet, 2015) to polynomial, and also it does not rely on any prior knowledge. Specifically, the authors consider several settings for BwK problem, many of which assume special structures such as one or two resource constraints. The most general settings therein, which is comparable to ours, allows arbitrary number of constraints and number of optimal arms. In terms of the key parameters, the way we define $\delta$ is the same as their definition of $\Delta_{x}$ . However, the regret bound (Theorem 8 therein) involves a summation of exponentially many $\frac{1}{\Delta_{x}}$ ’s (the same as the total number of the bases of the LP). Our parameter $\sigma$ is related to their $\epsilon$ (Assumption 8 therein) while the latter is more restrictive. Because $\sigma$ in our paper represents the minimal singular value of the matrix corresponding to only the optimal basis of the primal LP, whereas the parameter $\epsilon$ therein represents a lower bound of the determinant of the matrices corresponding to all the possible (exponentially many) bases of the primal LP. In this light, if they adopt our parameter $\sigma$ , their bound would be improved by a factor of $d!$ ( $C!$ therein). Moreover, $\epsilon$ is a lower bound for our parameter $\chi$ and (Flajolet and Jaillet, 2015) explicitly requires the knowledge of $\epsilon$ a priori.

The proposition also generalizes the one-constraint bound in (Sankararaman and Slivkins, 2020) and relaxes the deterministic resource consumption assumption therein. Specifically, the authors assume there is one single optimal arm and one single resource (other than time), i.e., the optimal solution to the primal LP has only one non-zero entry ( $|\mathcal{I}^{*}|=|\mathcal{J}^{*}|=1$ and $d=2$ ). They also assume the underlying LP is non-degenerate. Our results generalize their work in allowing arbitrary $d$ , $|\mathcal{I}^{*}|$ and $|\mathcal{J}^{*}|.$ In terms of the key parameters, under their assumption, our parameter $\sigma=1$ because it is defined by a 1-by-1 matrix. Our parameter $\chi$ is deemed as a constant in their paper. For $\delta$ , under their assumptions, our definition of OPT_i can be adjusted accordingly. Specifically, we can replace the constraint $x_{i}=0$ in defining OPT_i with $x_{i^{\prime}}=0,i^{\prime}\neq i$ . Then our definition of $\delta$ would reduce to their definition of $G_{LAG}(a)$ , both of which characterize the sub-optimality gap of an arm.

The above comparison against the existing literature highlights that the parameters $\sigma,$ $\delta$ , and $\chi$ or other LP-based parameters might be inevitable in characterizing logarithmic regret bound. Our paper makes some preliminary efforts along this line, but we do not believe our bound is tight: parameters such as $\chi$ and $\sigma$ may be replaced with some tighter parameter through a better algorithm and sharper analysis. As remarked in Section 3.3, the parameters are not dependent on $T$ under Assumption 1. But to characterize the dependency of these parameters (such as $\delta$ and $\chi$ ) on the problem size $m$ and $d$ remains an open question. Moreover, our algorithm and analysis highly rely on the structural properties of the underlying LP, which might not be the unique method to handle the BwK problem.

6 Conclusion

In this paper, we introduce a new BwK algorithm and derive problem-dependent bound for the algorithm. In the Phase I of the algorithm, it involves a round-robin design and may result in playing sub-optimal arms for inefficiently many times. Our regret bound can be large when the parameters $\sigma,$ $\delta$ , and $\chi$ are small and the inefficient plays of the sub-optimal arms may prevent the algorithm from achieving an $O(\sqrt{T})$ worst-case regret. In the extreme case, these parameters may scale with $O(\frac{1}{T})$ , though Assumption 1 prevents such a possibility. So the question is whether Assumption 1 is necessary in admitting a logarithmic problem-dependent bound.

We conclude our discussion with a new one-phase algorithm – Algorithm 2. The algorithm is also LP-based, and at each time $t$ , it solves a UCB version of the primal LP to sample the arm. The algorithm has an adaptive design to exhaust the resources. On one hand, the algorithm naturally incorporates the Phase I of Algorithm 1 as a part of its Phase II. It directly enters the Phase II of Algorithm 1 and lets the adaptive LP to fully determine the arm(s) to play (without the extra constraint in (9)). On the other hand, the algorithm can be viewed as an adaptive version of the algorithm in (Agrawal and Devanur, 2014). Our conjecture is that Algorithm 2 is the optimal algorithm for BwK: it is optimal in the sense that it achieves optimal problem-dependent bound, but also admits $O(\sqrt{T})$ problem-independent bound. Unfortunately, its analysis is more challenging than Algorithm 1, which we leave as an open question.

Algorithm 2 One-Phase Adaptive Algorithm for BwK

1:Input: Resource capacity

\bm{B}

T

2:Initialize the knapsack process

\bm{B}^{(0)}=\bm{B}

3:Initialize the estimates

\hat{\bm{\mu}}(0)

and

\hat{\bm{C}}(0)

4:Set

t=1

5:while

t\leq\tau

6: Solve the following LP

$\displaystyle\max_{\bm{x}}\ \$	$\displaystyle\left(\bm{\mu}^{U}(t-1)\right)^{\top}\bm{x},$	(11)
s.t.	$\displaystyle\bm{C}^{L}(t-1)\bm{x}\leq\bm{B}^{(t-1)},$
	$\displaystyle\bm{x}\geq\bm{0}.$

7: Denote its optimal solution as

\tilde{\bm{x}}

8: Normalize

\tilde{\bm{x}}

into a probability and randomly play an arm according to the probability

9: Update estimates

\hat{\bm{\mu}}(t)

\hat{\bm{C}}(t)

, and

\bm{B}^{(t)}

10: Update

t=t+1

11:end while

Acknowledgements

We would like to thank anonymous reviewers for their helpful suggestions and insightful comments.

References

Agrawal and Devanur (2014) Agrawal, Shipra, Nikhil R Devanur. 2014. Bandits with concave rewards and convex knapsacks. Proceedings of the fifteenth ACM conference on Economics and computation. 989–1006.
Agrawal and Devanur (2015) Agrawal, Shipra, Nikhil R Devanur. 2015. Linear contextual bandits with knapsacks. arXiv preprint arXiv:1507.06738 .
Agrawal et al. (2016) Agrawal, Shipra, Nikhil R Devanur, Lihong Li. 2016. An efficient algorithm for contextual bandits with knapsacks, and an extension to concave objectives. Conference on Learning Theory. PMLR, 4–18.
Agrawal and Goyal (2012) Agrawal, Shipra, Navin Goyal. 2012. Analysis of thompson sampling for the multi-armed bandit problem. Conference on learning theory. JMLR Workshop and Conference Proceedings, 39–1.
Agrawal et al. (2014) Agrawal, Shipra, Zizhuo Wang, Yinyu Ye. 2014. A dynamic near-optimal algorithm for online linear programming. Operations Research 62(4) 876–890.
Arlotto and Gurvich (2019) Arlotto, Alessandro, Itai Gurvich. 2019. Uniformly bounded regret in the multisecretary problem. Stochastic Systems .
Audibert and Bubeck (2010) Audibert, Jean-Yves, Sébastien Bubeck. 2010. Best arm identification in multi-armed bandits. COLT-23th Conference on Learning Theory-2010. 13–p.
Auer et al. (2002) Auer, Peter, Nicolo Cesa-Bianchi, Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47(2) 235–256.
Badanidiyuru et al. (2013) Badanidiyuru, Ashwinkumar, Robert Kleinberg, Aleksandrs Slivkins. 2013. Bandits with knapsacks. 2013 IEEE 54th Annual Symposium on Foundations of Computer Science. IEEE, 207–216.
Besbes and Zeevi (2012) Besbes, Omar, Assaf Zeevi. 2012. Blind network revenue management. Operations research 60(6) 1537–1550.
Ferreira et al. (2018) Ferreira, Kris Johnson, David Simchi-Levi, He Wang. 2018. Online network revenue management using thompson sampling. Operations research 66(6) 1586–1602.
Flajolet and Jaillet (2015) Flajolet, Arthur, Patrick Jaillet. 2015. Logarithmic regret bounds for bandits with knapsacks. arXiv preprint arXiv:1510.01800 .
Gittins et al. (2011) Gittins, John, Kevin Glazebrook, Richard Weber. 2011. Multi-armed bandit allocation indices. John Wiley & Sons.
Golrezaei et al. (2014) Golrezaei, Negin, Hamid Nazerzadeh, Paat Rusmevichientong. 2014. Real-time optimization of personalized assortments. Management Science 60(6) 1532–1551.
Immorlica et al. (2019) Immorlica, Nicole, Karthik Abinav Sankararaman, Robert Schapire, Aleksandrs Slivkins. 2019. Adversarial bandits with knapsacks. 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 202–219.
Jasin and Kumar (2012) Jasin, Stefanus, Sunil Kumar. 2012. A re-solving heuristic with bounded revenue loss for network revenue management with customer choice. Mathematics of Operations Research 37(2) 313–345.
Kesselheim and Singla (2020) Kesselheim, Thomas, Sahil Singla. 2020. Online learning with vector costs and bandits with knapsacks. Conference on Learning Theory. PMLR, 2286–2305.
Lai and Robbins (1985) Lai, Tze Leung, Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics 6(1) 4–22.
Li and Ye (2019) Li, Xiaocheng, Yinyu Ye. 2019. Online linear programming: Dual convergence, new algorithms, and regret bounds. arXiv preprint arXiv:1909.05499 .
Manshadi et al. (2012) Manshadi, Vahideh H, Shayan Oveis Gharan, Amin Saberi. 2012. Online stochastic matching: Online actions based on offline statistics. Mathematics of Operations Research 37(4) 559–573.
Megiddo and Chandrasekaran (1989) Megiddo, Nimrod, Ramaswamy Chandrasekaran. 1989. On the $\varepsilon$ -perturbation method for avoiding degeneracy. Operations Research Letters 8(6) 305–308.
Mehta et al. (2005) Mehta, Aranyak, Amin Saberi, Umesh Vazirani, Vijay Vazirani. 2005. Adwords and generalized on-line matching. 46th Annual IEEE Symposium on Foundations of Computer Science (FOCS’05). IEEE, 264–273.
Robbins (1952) Robbins, Herbert. 1952. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical Society 58(5) 527–535.
Sankararaman and Slivkins (2020) Sankararaman, Karthik Abinav, Aleksandrs Slivkins. 2020. Advances in bandits with knapsacks. arXiv preprint arXiv:2002.00253 .
Weber et al. (1992) Weber, Richard, et al. 1992. On the gittins index for multiarmed bandits. The Annals of Applied Probability 2(4) 1024–1033.

Appendix A Proof of Section 2 and 3

A1 Proof of Lemma 1

Proof.

Notice $\bm{x}=\bm{0}$ is a feasible solution to the primal LP (1) and $\bm{y}=(b,...,b)^{\top}$ is a feasible solution to the dual LP (2). We have both of the primal and dual LPs are feasible. So, by the strong duality of linear programming, both of the primal and dual LPs have optimal solutions and share the same optimal object objective value.

Then, based on assumption (2) that $|\mathcal{I}^{*}|=|\mathcal{J}^{*}|$ , we have

	$\displaystyle\|\mathcal{I}^{*}\|+\|\mathcal{J}^{\prime}\|$	$\displaystyle=\|\mathcal{J}^{*}\|+\|\mathcal{J}^{\prime}\|=d,$
	$\displaystyle\|\mathcal{I}^{*}\|+\|\mathcal{I}^{\prime}\|$	$\displaystyle=\|\mathcal{J}^{*}\|+\|\mathcal{I}^{\prime}\|=m.$

∎

A2 Proof of Proposition 1

Proof.

From $\text{OPT}_{\text{LP}}\geq\text{OPT}$ , we know

	$\displaystyle\text{Regret}_{T}^{\pi}(\mathcal{P},\bm{B})$	$\displaystyle\coloneqq\text{OPT}-\mathbb{E}\left[\sum_{t=1}^{\tau}r_{t}\right]$		(12)
		$\displaystyle\leq\text{OPT}_{\text{LP}}-\mathbb{E}\left[\sum_{t=1}^{\tau}r_{t}\right]$

The idea is to represent $\mathbb{E}\left[\sum_{t=1}^{\tau}r_{t}\right]$ with reduced cost and number of plays for the arms. We have

	$\displaystyle\mathbb{E}\left[\sum\limits_{t=1}^{\tau}r_{t}\right]$	$\displaystyle=\mathbb{E}\left[\sum\limits_{t=1}^{\tau}r_{i_{t}}\right]$
		$\displaystyle=\sum\limits_{i\in[m]}\mu_{i}\mathbb{E}\left[n_{i}(\tau)\right]$
		$\displaystyle=\sum\limits_{i\in[m]}(\bm{c}_{i}^{\top}\bm{y}^{*}-\Delta_{i})\mathbb{E}\left[n_{i}(\tau)\right]$
		$\displaystyle=\mathbb{E}\left[\sum\limits_{t=1}^{\tau}\bm{c}_{t}\right]^{\top}\bm{y}^{*}-\sum\limits_{i\in\mathcal{I}^{\prime}}\Delta_{i}\mathbb{E}\left[n_{i}(\tau)\right],$

where $i_{t}$ denotes the arm chosen to play at time $t$ . In above, the second line comes from the independence between stopping time and the reward generated from a fixed arm. The third line comes from the definition of the reduced cost $\Delta_{i}$ for all $i\in[d]$ that $\Delta_{i}=\bm{c}_{i}^{\top}\bm{y}^{*}-\mu_{i}$ , and the last line can be obtained by the same proof of the first three lines. Thus, plugging the equations above in the inequality (12) and rearranging terms, we have

	$\displaystyle\text{Regret}_{T}^{\pi}(\mathcal{P},\bm{B})$	$\displaystyle\leq\text{OPT}_{\text{LP}}-\mathbb{E}\left[\sum_{t=1}^{\tau}r_{t}\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\bm{B}-\sum\limits_{t=1}^{\tau}\bm{c}_{t}\right]^{\top}\bm{y}^{*}+\sum\limits_{i\in\mathcal{I}^{\prime}}\Delta_{i}\mathbb{E}\left[n_{i}(\tau)\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\bm{B}^{(\tau)}\right]^{\top}\bm{y}^{*}+\sum\limits_{i\in\mathcal{I}^{\prime}}\Delta_{i}\mathbb{E}\left[n_{i}(\tau)\right].$

Thus we complete the proof. ∎

A3 Proof of Proposition 2

Proof.

It is sufficient to show that

	$\displaystyle\text{OPT}_{i}<\text{OPT}_{\text{LP}}\Leftrightarrow i\in\mathcal{I}^{*},$		(13)
	$\displaystyle\text{OPT}_{j}<\text{OPT}_{\text{LP}}\Leftrightarrow i\in\mathcal{J}^{*}.$		(14)

Here we prove (13) and the proof of (14) follows the same analysis. By definition, an feasible solution to LP (4) is also feasible to LP (1). So,

\text{OPT}_{i}\leq\text{OPT}_{\text{LP}}.

If $\text{OPT}_{\text{LP}}=\text{OPT}_{i}$ for $i\in\mathcal{I}^{*}$ , the optimal solution of LP (4) is also a optimal solution to (1). Denote that solution as $\tilde{\bm{x}}^{*}$ . We have both of $\tilde{\bm{x}}^{*}$ and $\bm{x}^{*}$ are optimal solutions of LP (1). From the uniqueness in Assumption 2, we know $\tilde{\bm{x}}^{*}=\bm{x}^{*}$ and hence $x^{*}_{i}\not=0$ and $\tilde{x}^{*}_{i}\not=0$ for $i\in\mathcal{I}^{*}$ . However, this contradicts with constraint of (4) where $x_{i}=0$ . ∎

A4 Proof of Lemma 2

Proof.

The proof follows a standard analysis of the tail bounds for the estimators. Specifically, we first consider the complement of these events and then apply a union bound to complete the proof. The analysis first concerns the case when there is no projection and then extends the results to the case with projection.

For each $i\in[m],$ we have

	$\displaystyle\mathbb{P}\left(\bigcup_{t\in[T]}\left\{\frac{1}{n_{i}(t)}\left\|\sum\limits_{s=1}^{n_{i}(t)}r_{i,s}-n_{i}(t)\mu_{i}\right\|\geq\sqrt{\frac{2\log T}{n_{i}(t)}}\right\}\right)$
$\displaystyle\leq$	$\displaystyle\sum\limits_{t=1}^{T}\mathbb{P}\left(\frac{1}{n_{i}(t)}\left\|\sum\limits_{s=1}^{n_{i}(t)}r_{i,s}-n_{i}(t)\mu_{i}\right\|\geq\sqrt{\frac{2\log T}{n_{i}(t)}}\right)$
$\displaystyle\leq$	$\displaystyle\sum\limits_{t=1}^{T}\sum\limits_{k=1}^{t}\mathbb{P}\left(\frac{1}{k}\left\|\sum\limits_{s=1}^{k}r_{i,s}-k\mu_{i}\right\|\geq\sqrt{\frac{2\log T}{k}}\right)\mathbb{P}\left(n_{i}(t)=k\right)$
$\displaystyle\leq$	$\displaystyle\sum\limits_{t=1}^{T}\sum\limits_{k=1}^{t}\mathbb{P}\left(\frac{1}{k}\left\|\sum\limits_{s=1}^{k}r_{i,s}-k\mu_{i}\right\|\geq\sqrt{\frac{2\log T}{k}}\right)$
$\displaystyle\leq$	$\displaystyle\sum\limits_{t=1}^{T}\sum\limits_{k=1}^{t}\frac{2}{T^{4}}\leq\frac{2}{T^{2}}.$	(15)

where $r_{i,s}$ denotes the reward of the $s$ -th play for the $i$ -th arm. The first inequality is obtained by the union probability bound, the second one follows from Bayes rule, the third one is obtained by $\mathbb{P}\left[n_{i}(t)=k\right]\leq 1$ , and the last one is obtained by an application of the Hoeffding’s inequality with $r_{i,s}\in[0,1]$ and $\mathbb{E}[r_{i,s}]=\mu_{i}$ for $i\in[m]$ and $s\in[T]$ .

Then, recall that

	$\displaystyle\mu_{i}^{L}(t)$	$\displaystyle\coloneqq proj_{[0,1]}\left(\hat{\mu}_{i}(t)-\sqrt{\frac{2\log T}{n_{i}(t)}}\right),$
	$\displaystyle\mu_{i}^{U}(t)$	$\displaystyle\coloneqq proj_{[0,1]}\left(\hat{\mu}_{i}(t)+\sqrt{\frac{2\log T}{n_{i}(t)}}\right),$

and $\mu_{i}\in[0,1]$ , following the definitions of the sets, we can have

\displaystyle\mathbb{P}\left(\bigcup_{t\in[T]}\left\{\mu_{i}\notin\left(\mu_{i}^{L}(t),\mu_{i}^{U}(t)\right)\right\}\right)

\displaystyle=\mathbb{P}\left(\bigcup_{t\in[T]}\left\{\frac{1}{n_{i}(t)}\left|\sum\limits_{s=1}^{n_{i}(t)}r_{i,s}-n_{i}(t)\mu_{i}\right|\geq\sqrt{\frac{2\log T}{n_{i}(t)}}\right\}\right).

Then we can apply (15) for the last line in above.

\displaystyle\ \ \ \ \mathbb{P}\left(\bigcup_{t\in[T]}\left\{\mu_{i}\notin\left(\mu_{i}^{L}(t),\mu_{i}^{U}(t)\right)\right\}\right)\leq\frac{2}{T^{2}}.

Note the proof is only relied on the property that the reward is bounded in $[0,1]$ .

Next, following the exact same analysis, we have

\displaystyle\ \ \ \ \mathbb{P}\left(\bigcup_{t\in[T]}\left\{c_{ji}\notin\left(C_{ji}^{L}(t),C_{ji}^{U}(t)\right)\right\}\right)\leq\frac{2}{T^{2}}

also holds for each $j\in[d],i\in[m]$ . Then, we complete the proof by taking an union bound with respect to the arm index $i\in[m]$ and the constraint index $j\in[d].$ ∎

A5 Proof of Lemma 3

Proof.

It is sufficient to show that

\text{OPT}_{\text{LP}}^{L}\leq\text{OPT}_{\text{LP}}\leq\text{OPT}_{\text{LP}}^{U}

holds under the event in Lemma 2. Then, applying Lemma 2, we can have that the inequality above holds with probability no less than $1-\frac{4md}{T^{2}}.$

Recall that $\text{OPT}^{L}_{\text{LP}}(t)$ corresponds to the optimal objective value of the following LP (7) and denote ${\bm{x}}^{L}(t)$ as its optimal solution. Since $\hat{{\bm{C}}}^{U}(t)\geq\bm{C}$ holds element-wise under the event in Lemma 2, we have

\bm{C}{\bm{x}}^{L}(t)\leq{{\bm{C}}}^{U}(t){\bm{x}}^{L}(t)\leq\bm{B},

where those inequalities hold element-wise. Thus, we have ${\bm{x}}^{L}(t)$ is a feasible solution to the LP (1), and

$\displaystyle\text{OPT}^{L}_{\text{LP}}(t)$	$\displaystyle=\sum\limits_{i=1}^{m}{{\mu}}^{L}_{i}(t){x}^{L}_{i}(t)$
	$\displaystyle\leq\sum\limits_{i=1}^{m}{{\mu}}_{i}(t){x}^{L}_{i}(t)$	(16)
	$\displaystyle\leq\text{OPT}_{\text{LP}},$

where the first line comes the definition of ${\bm{x}}^{L}(t)$ , the second line comes from ${\hat{\bm{\mu}}}^{L}(t)\leq\bm{\mu}(t)$ (under the event in Lemma 2), and the third line comes from that ${\bm{x}}^{L}(t)$ is feasible to LP (1) and the definition of $\text{OPT}_{\text{LP}}$ . Thus, $\text{OPT}^{L}_{\text{LP}}(t)$ is an underestimate of $\text{OPT}_{\text{LP}}$ for all $t\in[T]$ .

With a similar analysis, we can show the part for $\text{OPT}^{U}_{\text{LP}}(t)$ . ∎

Appendix B Analysis of Phase I of Algorithm 1

In this section, we provide the analysis of Phase I of Algorithm 1 and prove Proposition 3. In the proof, we will use some concentration results for the LCB/UCB estimators for the LP, and these concentration results will be summarized and proved in Lemma 4 following the proof of Proposition 3.

B1 Proof of Proposition 3

Proof.

In this proof, we show the result under the high-probability event in Lemma 2.

Recall that an arm $i$ will be added to $\hat{\mathcal{I}}^{*}$ or resource $j$ will be added to $\hat{\mathcal{J}}^{\prime}$ when

\text{OPT}^{U}_{i}(t)<\text{OPT}^{L}_{\text{LP}}(t)\text{ or }\text{OPT}^{L}_{j}(t)>\text{OPT}_{\text{LP}}^{U}(t).

So, our proof will be focusing on discussing the relation between $\text{OPT}^{U}_{i}(t)$ and $\text{OPT}^{L}_{\text{LP}}(t)$ , and the relation between $\text{OPT}^{U}_{j}(t)$ and $\text{OPT}^{L}_{\text{LP}}(t)$ . Our proof can be divided into two parts:

•

First, we prove that $\hat{\mathcal{I}}^{*}$ and $\hat{\mathcal{J}}^{\prime}$ will not include a non-optimal arm $i\in\mathcal{I}^{\prime}$ and binding resource $j\in\mathcal{J}^{*}$ . Equivalently, we only need to show

$\text{OPT}^{U}_{i}(t)\geq\text{OPT}^{L}_{\text{LP}}(t),\text{OPT}^{U}_{j}(t)\geq\text{OPT}_{\text{LP}}^{L}(t)$

for $i\in\mathcal{I}^{\prime}$ and $j\in\mathcal{J}^{*}$ and all $t\in[T]$ .
•

Second, Phase I of Algorithm 1 will terminate within $\left(2+\frac{1}{b}\right)^{2}\frac{72m\log T}{\delta^{2}}$ steps. Equivalently, we only need to show

$\text{OPT}^{U}_{i}(t)<\text{OPT}^{L}_{\text{LP}}(t),\text{OPT}^{U}_{j}(t)<\text{OPT}^{L}_{\text{LP}}(t)$

for $i\in\mathcal{I}^{*}$ and $j\in\mathcal{J}^{\prime}$ when the time

$t\geq\left(2+\frac{1}{b}\right)^{2}\frac{72m\log T}{\delta^{2}}.$

For the first part, recall that $\text{OPT}^{U}_{i}$ is the optimal objective value of the LP (4). For $i\in[m]$ , with a similar proof as Lemma 3, we can show that

\text{OPT}^{U}_{i}(t)\geq\text{OPT}_{i}

and from Lemma 3,

\text{OPT}^{L}_{\text{LP}}(t)\leq\text{OPT}_{\text{LP}}

for all $t\in[T]$ . For $i\in\mathcal{I}^{\prime}$ , we have $\text{OPT}_{i}=\text{OPT}_{\text{LP}}$ from Proposition 2. So,

\text{OPT}^{U}_{i}(t)\geq\text{OPT}^{L}_{\text{LP}}(t).

With a similar argument,

\text{OPT}^{U}_{j}(t)\geq\text{OPT}_{\text{LP}},\text{ for all $j\in\mathcal{J}^{*}$}.

Therefore, we complete the first part. The implication is that $\hat{\mathcal{I}}^{*}\subset{\mathcal{I}}^{*}$ and $\hat{\mathcal{J}}^{\prime}\subset{\mathcal{J}}^{\prime}$ .

Recall that the algorithm terminates when $|\hat{\mathcal{I}}^{*}|+|\hat{\mathcal{J}}^{\prime}|\geq d$ and that from Lemma 1, $|{\mathcal{I}}^{*}|+|{\mathcal{J}}^{\prime}|=d$ . So, we have the algorithm terminates if and only if $\hat{\mathcal{I}}^{*}={\mathcal{I}}^{*}$ and ${\hat{\mathcal{J}}^{\prime}}={\mathcal{J}}^{\prime}$ . That is, the algorithm stops only when it finds all optimal arms and all non-binding resources.

Now, we move to the second part and prove that $\hat{\mathcal{I}}^{*}={\mathcal{I}}^{*}$ and ${\hat{\mathcal{J}}^{\prime}}={\mathcal{J}}^{\prime}$ within

\left(2+\frac{1}{b}\right)^{2}\frac{72m\log T}{\delta^{2}}

steps.

For time $t$ that is a multiple of $m$ and that

t\geq\left(2+\frac{1}{b}\right)^{2}\frac{72m\log T}{\delta^{2}},

we have

3\sqrt{\frac{2\log T}{n_{i}(t)}}\leq\frac{b}{2}

i.e., the condition (19) in the statement of Lemma 4 holds, where $n_{i}(t)=\frac{t}{m}$ in this case. Then, by Lemma 4, we have

	$\displaystyle\frac{1}{T}\|\text{OPT}^{L}(t)-\text{OPT}_{\text{LP}}\|$	$\displaystyle<\frac{\delta}{2},$
	$\displaystyle\frac{1}{T}\|\text{OPT}^{U}_{i}(t)-\text{OPT}_{i}\|$	$\displaystyle<\frac{\delta}{2},$
	$\displaystyle\frac{1}{T}\|\text{OPT}^{U}_{j}(t)-\text{OPT}_{j}\|$	$\displaystyle<\frac{\delta}{2}.$

Thus, within at most $\left(2+\frac{1}{b}\right)^{2}\frac{72m\log T}{\delta^{2}}$ steps, we have

	$\displaystyle\text{OPT}^{L}(t)-\text{OPT}^{U}_{i}>\text{OPT}_{\text{LP}}-\frac{\delta}{2}-\text{OPT}_{i}+\frac{\delta}{2}\geq 0,$
	$\displaystyle\text{OPT}^{L}(t)-\text{OPT}^{U}_{j}>\text{OPT}_{\text{LP}}-\frac{\delta}{2}-\text{OPT}_{j}+\frac{\delta}{2}\geq 0,$

for all $i\in\mathcal{I}^{*}$ and $j\in\mathcal{J}^{\prime}$ since $\text{OPT}-\text{OPT}_{i}\geq\delta$ and $\text{OPT}_{j}-\text{OPT}_{\text{LP}}\geq\delta$ . Thus, all optimal arms and non-binding resources will be detected with at most $\left(2+\frac{1}{b}\right)^{2}\frac{72m\log T}{\delta^{2}}$ steps, equivalently, with each arm played at most $\left(2+\frac{1}{b}\right)^{2}\frac{72\log T}{\delta^{2}}$ times.

∎

B2 Concentration of LP UCB/LCB Estimators

Lemma 4.

In Phase I of Algorithm 1, for $t\in[T]$ that is the multiple of $m$ and $t\leq B$ , the following inequalities hold for all $i\in[m]$ and $j\in[d]$ ,

	$\displaystyle\frac{1}{T}\|\text{OPT}^{U}_{i}(t)-\text{OPT}_{i}\|$	$\displaystyle<3\left(1+\frac{1}{b}\right)\sqrt{\frac{2\log T}{n_{i}(t)}},$		(17)
	$\displaystyle\frac{1}{T}\|\text{OPT}^{U}_{j}(t)-\text{OPT}_{j}\|$	$\displaystyle<3\left(2+\frac{1}{b}\right)\sqrt{\frac{2\log T}{n_{i}(t)}}.$		(18)

Moreover, if time $t$ also satisfies

\displaystyle 3\ \sqrt{\frac{2\log T}{n_{i}(t)}}\leq\frac{b}{2},

(19)

the following inequality also holds

\displaystyle\frac{1}{T}|\text{OPT}^{L}(t)-\text{OPT}_{\text{LP}}|

\displaystyle<3\left(1+\frac{1}{b}\right)\sqrt{\frac{2\log T}{n_{i}(t)}}.

(20)

By the design of Phase I of Algorithm 1, all the $n_{i}(t)$ in this lemma (the statement and the proof) equals to $t/m.$

Proof.

We prove inequalities in the statement one by one. We note that the condition $t\leq B$ is to provide a guarantee that the algorithm does not exhaust the resource before the termination of Phase I of the algorithm.

Proof of (17):

For (17), we only need to show that

\frac{1}{T}(\text{OPT}_{i}^{U}-\text{OPT}_{i})<3\left(1+\frac{1}{b}\right)\sqrt{\frac{2\log T}{n_{i}(t)}}

since for the other direction, we can use the inequality $\text{OPT}^{U}_{i}(t)\geq\text{OPT}_{i}$ with the same proof of Lemma 3.

We introduce the following LP (B2) to relate $\text{OPT}_{i}^{U}(t)$ with $\text{OPT}_{i}$ .

$\displaystyle\max_{\bm{x}}\ \$	$\displaystyle\bm{\mu}^{\top}\bm{x}+3\sqrt{\frac{2\log T}{n_{i}(t)}}T$
s.t.	$\displaystyle{\bm{C}}\bm{x}\leq\left(1+3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\right)\bm{B}$	(21)
	$\displaystyle\ \ \ \ x_{i}=0,\ \bm{x}\geq 0.$

Compared to (4) which defines $\text{OPT}_{i}$ , the LP (B2) scales the right-hand-side resource capacity with a factor and also appends an additional term to the objective function. It is easy to see that the optimal objective value of the LP (B2) is

\displaystyle\left(1+3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\right)\text{OPT}_{i}+3\sqrt{\frac{2\log T}{n_{i}(t)}}T.

(22)

Denote the optimal solution of LP (4) as $\bm{x}^{U}(t)$ . First, we show that $\bm{x}^{U}(t)$ is a feasible solution to (B2), and then, with the feasibility, we show that (22) provides an upper bound for OPT ${}_{i}^{U}$ and thus establish the relation between OPT ${}_{i}^{U}$ and OPT ${}_{i}.$

To see the feasibility, note that the $j$ -th entry of the resource consumption

	$\displaystyle(\bm{C}\bm{x}^{U}(t))_{j}$	$\displaystyle=({\bm{C}}^{L}(t)\bm{x}^{U}(t))_{j}+\left(\left(\bm{C}-{\bm{C}}^{L}(t)\right)\bm{x}^{U}(t)\right)_{j}$
		$\displaystyle\leq B+\left\\|\left(\bm{C}-{\bm{C}}^{L}(t)\right)_{j}\right\\|_{\infty}\\|\bm{x}^{U}(t)\\|_{1}$
		$\displaystyle\leq B+3\sqrt{\frac{2\log T}{n_{i}(t)}}T,$

where the last line comes from the concentration event in Lemma 2 and the fact that $\|\bm{x}^{U}(t)\|_{1}\leq T.$

With this feasibility, we have

$\displaystyle\text{OPT}^{U}_{i}(t)=$	$\displaystyle({\bm{\mu}}^{U}(t))^{\top}\bm{x}^{U}(t)$
$\displaystyle=$	$\displaystyle\bm{\mu}^{\top}\bm{x}^{U}(t)+\left({\bm{\mu}}^{U}(t)-\bm{\mu}\right)^{\top}\bm{x}^{U}(t)$
$\displaystyle\leq$	$\displaystyle\bm{\mu}^{\top}\bm{x}^{U}(t)+\left\\|{\bm{\mu}}^{U}(t)-\bm{\mu}\right\\|_{\infty}\left\\|\bm{x}^{U}(t)\right\\|_{1}$
$\displaystyle<$	$\displaystyle\bm{\mu}^{\top}\bm{x}^{U}(t)+3\sqrt{\frac{2\log T}{n_{i}(t)}}T$
$\displaystyle\leq$	$\displaystyle\left(1+3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\right)\text{OPT}_{i}+\sqrt{\frac{2\log T}{n_{i}(t)}}T.$	(23)

Here the second from last line comes from a similar argument as the analysis of feasibility. It is based on the concentration event in Lemma 2 and the fact $\|\bm{x}^{U}(t)\|_{1}\leq T.$ The last line plugs in the upper bound (22) to LP (B2) and it comes from the fact that $\bm{x}^{U}(t)$ is a feasible solution of LP (B2). By rearranging terms in (23), we complete the proof of (17).

Proof of (18):

Just like the proof of (17), we only need to show one side of the inequality, that is

\frac{1}{T}(\text{OPT}_{j}^{U}-\text{OPT}_{j})\leq 3\left(2+\frac{1}{b}\right)\sqrt{\frac{2\log T}{n_{i}(t)}}.

Denote $\tilde{\bm{y}}$ as the optimal solution of the LP (5), which is the LP corresponding to $\text{OPT}_{j}$ . Define

\tilde{\bm{y}}^{\prime}=\tilde{\bm{y}}+3\left(2b+1\right)\sqrt{\frac{2\log T}{n_{i}(t)}}(1,0,...,0)^{\top}.

We first show that $\tilde{\bm{y}}^{\prime}$ is feasible the LP corresponding to $\text{OPT}_{j}^{U}(t)$ in Algorithm 1. To see the feasibility, we have

	$\displaystyle\left(\bm{C}^{L}(t)\right)^{\top}\tilde{\bm{y}}^{\prime}=$	$\displaystyle\left(\bm{C}^{L}(t)\right)^{\top}\tilde{\bm{y}}+3\left(2b+1\right)\sqrt{\frac{2\log T}{n_{i}(t)}}\left(\bm{C}^{L}(t)\right)^{\top}(1,0,...,0)^{\top}$
	$\displaystyle=$	$\displaystyle\bm{C}^{\top}\tilde{\bm{y}}+(\bm{C}^{L}(t)-\bm{C})^{\top}\tilde{\bm{y}}+3\left(2b+1\right)\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\bm{1}$
	$\displaystyle\geq$	$\displaystyle\bm{\mu}+(\bm{C}_{j,\cdot})^{\top}-\max\limits_{j\in[d]}\left\\|\bm{c}^{L}_{j}(t)-\bm{c}_{j}\right\\|_{\infty}(\bm{1}^{\top}\tilde{\bm{y}})\bm{1}+3\left(2+\frac{1}{b}\right)\sqrt{\frac{2\log T}{n_{i}(t)}}\bm{1}$
	$\displaystyle\geq$	$\displaystyle\bm{\mu}+(\bm{C}_{j,\cdot})^{\top}+6\sqrt{\frac{2\log T}{n_{i}(t)}}\bm{1}$
	$\displaystyle=$	$\displaystyle\bm{\mu}^{U}(t)+\left(\bm{C}_{j,\cdot}^{U}(t)\right)^{\top}.$

Here the first two lines come from rearranging terms. The third line comes from the feasibility of $\tilde{\bm{y}}$ . The fourth line comes from the concentration event in Lemma 2 and the fact that $\bm{1}^{\top}\tilde{\bm{y}}\leq\frac{T}{B}.$ The last line comes from the definition of UCB estimators.

With the feasibility of $\tilde{\bm{y}}^{\prime},$ we have

	$\displaystyle\text{OPT}^{U}_{j}$	$\displaystyle\leq\bm{B}^{\top}\tilde{\bm{y}}^{\prime}$
		$\displaystyle\leq\bm{B}^{\top}\tilde{\bm{y}}+9\sqrt{\frac{2\log T}{n_{i}(t)}}T$
		$\displaystyle=\text{OPT}_{j}+9\sqrt{\frac{2\log T}{n_{i}(t)}}$
		$\displaystyle<\text{OPT}_{j}+3\left(2+\frac{1}{b}\right)\sqrt{\frac{2\log T}{n_{i}(t)}},$

where the first comes from the feasibility of $\tilde{\bm{y}}^{\prime}$ , the second and fourth line uses the fact that $b<1,$ and the third line comes from the definition of OPT ${}_{j}.$ By rearranging terms, we complete the proof of (18).

Proof of inequality (20):

The idea of the proof is to first show that

\left(1-3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\right)\bm{x}^{*}

is a feasible solution to the LP corresponding to OPT ${}_{\text{LP}}^{L}(t)$ under condition (19) and then analyze its corresponding objective value.

First, to see its feasibility, the condition (19) ensures that

\left(1-3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\right)\bm{x}^{*}\geq\bm{0}.

On the other hand, for the constraint $j\in[d],$

	$\displaystyle B-\left(1-3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\right)({\bm{C}}^{U}(t)\bm{x}^{*})_{j}=$	$\displaystyle B-({\bm{C}}^{U}(t)\bm{x}^{})_{j}+3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}({\bm{C}}^{U}(t)\bm{x}^{})_{j}$
	$\displaystyle=$	$\displaystyle B-(\bm{C}\bm{x}^{})_{j}+(({\bm{C}}^{U}(t)-\bm{C})\bm{x}^{})_{j}+3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}(\bm{C}\bm{x}^{*})_{j}$
	$\displaystyle\geq$	$\displaystyle B-(\bm{C}\bm{x}^{})_{j}-\\|({\bm{C}}^{U}(t)-\bm{C})_{j,\cdot}\\|_{\infty}\\|\bm{x}^{}\\|_{1}+3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}(\bm{C}\bm{x}^{*})_{j}$
	$\displaystyle\geq$	$\displaystyle B-(\bm{C}\bm{x}^{})_{j}+3\sqrt{\frac{2\log T}{n_{i}(t)}}\left(\frac{1}{b}(\bm{C}\bm{x}^{})_{j}-T\right).$

Here the first and second equality comes from rearranging terms and the last two line follow a similar analysis as before.

By rearranging terms in the last line of above,

	$\displaystyle B-\left(1-3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\right)({\bm{C}}^{U}(t)\bm{x}^{*})_{j}\geq$	$\displaystyle B-(\bm{C}\bm{x}^{})_{j}+3\sqrt{\frac{2\log T}{n_{i}(t)}}\left(\frac{1}{b}(\bm{C}\bm{x}^{})_{j}-T\right)$
	$\displaystyle=$	$\displaystyle-\left(1-3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\right)(\bm{C}\bm{x}^{*})_{j}+B-3\sqrt{\frac{2\log T}{n_{i}(t)}}T$
	$\displaystyle\geq$	$\displaystyle 0.$

Here the first from last line comes from rearranging terms and the last line comes from the feasibility of $\bm{x}^{*}.$ In this way, we obtain the feasibility part.

Given the feasibility, we can prove the inequality (20) with a similar roadmap as before

	$\displaystyle\text{OPT}^{L}_{\text{LP}}(t)$	$\displaystyle\geq\left(1-3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}\right)({{\bm{\mu}}}^{L}(t))^{\top}\bm{x}^{*}$
		$\displaystyle=\bm{\mu}^{\top}\bm{x}^{}+\left({{\bm{\mu}}}^{L}(t)-\bm{\mu}\right)^{\top}\bm{x}^{}-3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}({{\bm{\mu}}}^{L}(t))^{\top}\bm{x}^{*}$
		$\displaystyle\geq\text{OPT}_{\text{LP}}-\left\\|{{\bm{\mu}}}^{L}(t)-\bm{\mu}\right\\|_{\infty}\\|\bm{x}^{}\\|_{1}-3\sqrt{\frac{2\log T}{n_{i}(t)b^{2}}}({{\bm{\mu}}}^{L}(t))^{\top}\bm{x}^{}$
		$\displaystyle\geq\text{OPT}_{\text{LP}}-3\left(1+\frac{1}{b}\right)\sqrt{\frac{2\log T}{n_{i}(t)}}T.$

Thus we complete the proof of inequality (20). ∎

Appendix C Analysis of Phase II of Algorithm 1

In this section, we analyze Phase II of Algorithm 1 and prove Proposition 4. To simplify the presentation, we first introduce a notion of “warm start” and prove Proposition 4 under this warm start condition and also, the assumption that all the constraints are binding. Indeed, recall that the upper bound in Proposition 1 only involves the remaining resource for binding constraints. The assumption is only aimed to simplify the presentation but will not change the natural of our analysis. Later in Section C3 and Section C4, we discuss how the warm start condition is achieved and what if there are non-binding constraints, respectively.

C1 Proof of Proposition 4 under Warm Start Assumption

We first define the following quantity

\theta=\min\left\{\frac{\min\{1,\sigma^{2}\}\min\{\chi,\delta\}}{12\min\{m^{2},d^{2}\}},\left(2+\frac{1}{b}\right)^{-2}\cdot\frac{\delta}{5}\right\}.

In the following, we will use $\theta$ to define a threshold for parameter estimation. As we will see later, the threshold is mainly to ensure that the estimated constraint matrix $\bm{C}^{L}(t)$ is well-positioned (with a minimum singular value of $\sigma/2$ ).

Assumption 3 (Warm Start).

We make the following assumption on the parameter estimates

(a)

$\bm{\mu}^{U}(t)\geq\bm{\mu},\bm{C}^{L}(t)\leq\bm{C}$ holds element-wise for all $t\in[T]$ .
(b)

The following condition holds for all $i\in\mathcal{I}^{*}$ and $t\in[T]$

$\displaystyle\|{\bm{c}}^{L}_{i}(t)-\bm{c}_{i}\|_{\infty}\leq\theta.$ (24)

Here Part (a) of the Assumption restricts to the concentration event in Lemma 2. Part (b) is to ensure that the estimated LP with constraint matrix $\bm{C}^{L}(t)$ is well-positioned. In the following Section C3, we will show that the assumption can be met automatically in Algorithm 1 and we discuss the number of time periods it takes to satisfy this assumption.

Assumption 4 (All optimal arms and binding constraints).

We assume that all the arm are optimal and all the constraints are binding, i.e.,

\mathcal{I}^{*}=[m],\mathcal{J}^{*}=d.

In fact, Assumption 4 is not as strong as it seems to be. First, in Algorithm 1, we have shown that Phase I identifies optimal arms and binding constraints with high probability in Proposition 3. Then, for the LP (9) solved in Phase II of the algorithm, it restricts the arm play to the optimal arm as if all the arms are all optimal ones. Second, from Proposition 1, we know the generic upper bound only involves the remaining resources for the binding constraints, so this assumption allows us to focus on the binding constraints. To make the analysis go through, we only need to show that the non-binding resources are not exhausted before any binding resource is exhausted. Intuitively, this will always happen because the non-binding resources have a slackness. Later in Section C4, we will return to analyze this case of non-binding resources. Now, we prove Proposition 4 under this two additional assumptions.

Proof of Proposition 4 under Assumption 3 and 4

We show that under Assumption 3 and 4, there exists a constant $\underline{T}$ depending on $m$ , $d$ , $\sigma$ , $\chi$ and $\delta$ such that, for any $T\geq\underline{T}$ , the left-over of each binding resource (at time $\tau$ ) $j\in\mathcal{J}^{*}$ satisfies

\mathbb{E}\left[B_{j}^{(\tau)}\right]=O\left(\frac{d^{3}}{b\min\{\chi^{2},\delta^{2}\}\min\{1,\sigma^{2}\}}\right).

Proof.

Our proof basically formalize the idea described in Section 5.2. Recall that the average remaining resource

\bm{b}^{(t)}\coloneqq\frac{\bm{B}^{(t)}}{T-t}

where $\bm{b}^{(t)}=\left(b_{1}^{(t)},...,b_{d}^{(t)}\right)^{\top}$ for $t=0,...,T-1.$ Define

\mathcal{Z}\coloneqq[b-\epsilon,b+\epsilon]^{d}\subset\mathbb{R}^{d}

where $\epsilon>0$ will be specified later. Ideally, the algorithm consumes approximately $\bm{b}$ resource per time period so that the average remaining resource $\bm{b}^{(t)}$ will always stay in $\mathcal{Z}$ . This motivates the definition of a stopping time that represents the first time that $\bm{b}^{(t)}$ escapes from $\mathcal{Z}$ .

\tau^{\prime}\coloneqq\min\{t:\bm{b}^{(t)}\notin\mathcal{Z}\}.

Recall that the stopping time $\tau$ is the termination of the procedure. Thus, as long as $\epsilon<b$ , we have $\tau^{\prime}\leq\tau$ since no resource is used up at time $\tau^{\prime}-1$ . Furthermore, we define the following event, for each $t\in[T]$ ,

\displaystyle\mathcal{E}_{t}\coloneqq

\displaystyle\left\{\bm{b}^{(s)}\in\mathcal{Z}\text{ for each $s\in[t-1]$}\ \text{ and }\ \sup_{\bm{b}^{\prime}\in\mathcal{Z}}\left\|\mathbb{E}[\bm{c}_{i_{t}}(\bm{b}^{\prime})|\mathcal{H}_{t-1}]-\bm{b}^{\prime}\right\|_{\infty}>\epsilon_{t}\right\}

where $\mathcal{H}_{t}=\{(\bm{r}_{s},\bm{C}_{s})\}_{s=1}^{t}$ is the filtration up to time $t$ , $i_{t}$ is the arm played by Algorithm 1 in time period $t$ during Phase II, and $\bm{c}_{i_{t}}(\bm{b}^{\prime})$ is the expected resource consumption of the $i_{t}$ -th arm. The threshold parameters $\epsilon_{t}$ will be specified later. We remark that in Algorithm 1, the choice of $i_{t}$ (and its distribution) is dependent on the remaining resource $\bm{b}^{(t-1)}$ . Here we represent $\bm{b}^{(t-1)}$ with $\bm{b}^{\prime}$ , and thus the second term captures the maximal “bias” of resource consumption when the average remaining resource $\bm{b}^{(t-1)}=\bm{b}^{\prime}$ , where the maximum is taken with respect to all $\bm{b}^{\prime}\in\mathcal{Z}.$

At time $t$ , if $\bm{b}^{(t-1)}=\bm{b}^{\prime}$ for some $\bm{b}^{\prime}\in\mathcal{Z},$ there are in total $(n-t+1)\bm{b}^{\prime}$ remaining resources and $n-t+1$ remaining time periods. Ideally, we hope that the resource is consumed for exactly $\bm{b}^{\prime}$ in each of the remaining time period. In this sense, the event $\mathcal{E}_{t}$ describes that the average resource level $\bm{b}_{s}$ stays in $\mathcal{Z}$ for the first $t-1$ time periods and the consumption bias at time $t$ is greater than $\epsilon_{t}$ .

The following lemma states that with a careful choice of $\epsilon$ and $\epsilon_{t}$ , the event $\mathcal{E}_{t}$ will happen with a small probability.

Lemma 5.

Let $\alpha=\frac{\min\{1,\sigma\}\min\{\chi,\delta\}b}{d^{\frac{2}{3}}}$ . If we choose $\epsilon\coloneqq\frac{\alpha}{5}$ and

\epsilon_{t}\coloneqq\begin{cases}\frac{6}{5},&t\leq\frac{\alpha T}{19},\\ \frac{11b\sqrt{\min\{m^{3},d^{3}\}\log T}}{\sqrt{\min\{\chi\sigma^{2},\delta\sigma^{2}\}t}},&t>\frac{\alpha T}{19},\end{cases}

then there exists a constant $\underline{T}$ such that,

\mathcal{P}\left(\bigcup\limits_{t=1}^{T}\mathcal{E}_{t}\right)\leq\frac{\min\{m,d\}}{T^{3}}.

for all $T\geq\underline{T}$ .

The proof of Lemma 5 will be postponed to the end of this proof. The following lemma considers the complements of the event $\mathcal{E}_{t}$ ’s and analyze the stopping time on their complements. Here we use $\mathcal{E}^{\complement}$ to denote the complement of an event $\mathcal{E}.$

Lemma 6.

If $\epsilon$ , $\bar{\epsilon}$ , $\alpha$ and $\epsilon_{t}$ satisfy that $\epsilon_{t}=\bar{\epsilon}$ for all $t\leq\alpha T$ and

\displaystyle\frac{\alpha\bar{\epsilon}}{1-\alpha}+\sum\limits^{T-1}_{t=\alpha T+1}\frac{\epsilon_{t}}{T-t}\leq\frac{2{\epsilon}}{3},

(25)

the following inequality holds

\displaystyle\mathbb{P}\left(b^{(s)}_{j}\not\in[b-\epsilon,b+\epsilon]\text{ for some $s\leq t$, }\bigcap\limits_{s=1}^{t}\mathcal{E}_{s}^{\complement}\right)\leq 2\exp\left\{-\frac{(T-t+1)\epsilon^{2}}{18(1+\epsilon)^{2}}\right\},

for $t\leq T-2$ and $j\in[d]$ , and therefore,

\mathbb{P}\left(\tau^{\prime}\leq t,\bigcap\limits_{s=1}^{t}\mathcal{E}_{s}^{\complement}\right)\leq 2d\exp\left\{-\frac{(T-t+1)\epsilon^{2}}{18(1+\epsilon)^{2}}\right\}.

The proof of Lemma 6 will also be postponed to the end of this proof.

With Lemma 5 and 6, we can derive an upper bound on $\mathbb{E}[T-\tau^{\prime}].$ Specifically,

	$\displaystyle\mathbb{P}\left(\tau^{\prime}\leq t\right)=$	$\displaystyle\mathbb{P}\left(\tau^{\prime}\leq t,\bigcap\limits_{s=1}^{t}\mathcal{E}_{t}^{\complement}\right)+\mathbb{P}\left(\tau^{\prime}\leq t,\bigcup\limits_{s=1}^{t}\mathcal{E}_{t}\right)$
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\tau^{\prime}\leq t,\bigcap\limits_{s=1}^{t}\mathcal{E}_{t}^{\complement}\right)+\mathbb{P}\left(\bigcup\limits_{s=1}^{t}\mathcal{E}_{t}\right)$
	$\displaystyle\leq$	$\displaystyle 2d\exp\left\{-\frac{\min\{1,\sigma^{2}\}\min\{\chi^{2},\delta^{2}\}b^{2}\cdot(T-t-1)}{650d^{3}}\right\}+\frac{\min\{m,d\}}{T^{3}},$

where the two parts in the third line come from the Lemma 5 and Lemma 6, respectively. Then, if we denote $u\wedge v=\min\{u,v\}$ , we have

	$\displaystyle\mathbb{E}[T-\tau^{\prime}]\leq$	$\displaystyle\sum_{t=1}^{T}\mathbb{P}\left(\tau^{\prime}\leq t\right)$
	$\displaystyle\leq$	$\displaystyle\sum_{t=1}^{T}1\wedge\left(\frac{\min\{m,d\}}{T^{3}}+2d\exp\left\{-\frac{\min\{1,\sigma^{2}\}\min\{\chi^{2},\delta^{2}\}b^{2}\cdot(T-t-1)}{650d^{3}}\right\}\right)$
	$\displaystyle\leq$	$\displaystyle\frac{650d^{3}}{b^{2}\min\{\chi^{2},\delta^{2}\}\min\{1,\sigma^{2}\}}+\frac{\min\{m,d\}}{T^{2}}.$

Thus, since $\tau^{\prime}\leq\tau$ and $\bm{B}^{(\tau^{\prime})}\leq(T-\tau^{\prime})\cdot(\bm{b}+\epsilon\bm{1})\leq(T-\tau^{\prime})\cdot\frac{6}{5}\bm{b}$ , we have

	$\displaystyle\mathbb{E}[\bm{B}^{(\tau)}]\leq$	$\displaystyle\mathbb{E}[\bm{B}^{(\tau^{\prime})}]$
	$\displaystyle=$	$\displaystyle O\left(\frac{d^{3}}{b\min\{\chi^{2},\delta^{2}\}\min\{1,\sigma^{2}\}}\right).$

Thus we complete the proof. ∎

C2 Proofs of Lemma 5 and Lemma 6

In this subsection, we prove the two key lemmas in the previous section. The proof of Lemma 5 will utilize the following lemma. The statement of Lemma 7 is quite straightforward: under the assumption that the parameters are well estimated, that all the arms are optimal, and that all the constraints are binding, the optimal solution to LP (9) can be simply obtained by $\left(\bm{C}^{L}(t)\right)^{-1}\bm{b}^{(t)}$ . Its proof is solely based on linear algebra knowledge, but it is a bit tedious, for which we will postpone to after the proof of Lemma 5 and Lemma 6.

Lemma 7.

Under Assumption 3 and Assumption 4, if $\bm{b}^{(t)}\in\mathcal{Z}$ , we have $\hat{\bm{x}}(t)=\left(\bm{C}^{L}(t)\right)^{-1}\bm{B}^{(t)}$ is the optimal solution of LP (9) at time $t+1$ .

Proof of Lemma 5

Proof.

First, we show that each arm will be played, on expectation, no less than $\frac{\chi t}{5}$ times in the first $t$ time periods. Denote the normalized probability used to play arm at time $t+1$ as $\tilde{\bm{x}}(t),$ then by the definition of $\bm{b}^{(t)}$ , we know

\tilde{\bm{x}}(t)=\left({\bm{C}}^{L}(t)\right)^{-1}\bm{b}^{(t)}.

Here, from Assumption 4, we know that $m=d$ in this case, so the constraint consumption matrix ${\bm{C}}^{L}(t)$ is a squared matrix. By the Lemma 7 and the definition of $\mathcal{Z},$ if $\bm{b}^{(t)}\in\mathcal{Z},$

$\displaystyle\bm{C}^{-1}\bm{b}^{(t)}$	$\displaystyle\geq\bm{C}^{-1}{\bm{b}}-\left\\|\bm{C}^{-1}\left(\bm{b}-\bm{b}^{(t)}\right)\right\\|_{\infty}$
	$\displaystyle\geq\chi-\\|\bm{C}^{-1}\\|_{\infty}\left\\|\left(\bm{b}-\bm{b}^{(t)}\right)\right\\|_{\infty}$
	$\displaystyle\geq\chi-\frac{\sqrt{d}}{\sigma}\frac{\chi}{5d^{3/2}}\geq\frac{4\chi}{5},$	(26)

where the first line comes from the triangle inequality, the second line comes from the definition of $\chi$ , and the third line is obtained by the relation between the singular value and matrix infinity norm. Now we can transfer the bound to $\tilde{\bm{x}}(t).$ From warm start condition (24) that

\|{\bm{c}}^{L}_{i}(t)-\bm{c}_{i}\|_{\infty}\leq\frac{\min\{1,\sigma^{2}\}\min\{\chi,\delta\}}{12\min\{md,d^{2}\}},\ i\in\mathcal{I}^{*},

with the similar proof in the first part of Lemma 7, we can show

\displaystyle\left\|\left({\bm{C}}^{L}(t)\right)^{-1}\bm{b}^{(t)}-\bm{C}^{-1}\bm{b}^{(t)}\right\|_{\infty}\leq\frac{3\chi}{5}.

(27)

Specifically, the warm start condition (24) provides an upper bound for deviation of ${\bm{C}}^{L}(t)$ and $\bm{C}.$ Putting together (26) and (27), we have

	$\displaystyle\tilde{\bm{x}}(t)=$	$\displaystyle\left({\bm{C}}^{L}(t)\right)^{-1}\bm{b}^{(t)}$
	$\displaystyle\geq$	$\displaystyle\bm{C}^{-1}\bm{b}^{(t)}-\left\\|\left({\bm{C}}^{L}(t)\right)^{-1}\bm{b}^{(t)}-\bm{C}^{-1}\bm{b}^{(t)}\right\\|_{\infty}$
	$\displaystyle\geq$	$\displaystyle\frac{\chi}{5}$

where the inequality holds element-wise. This inequality provides an lower bound on the probability distribution with which the arms are played. The implication is that if $\bm{b}^{(t)}\in\mathcal{Z},$ each of the arms will be played with probability no less than $\frac{\chi}{5}.$ In this way, as long as $\bm{b}^{(t)}\in\mathcal{Z}$ always holds, all the arms will be played with sufficiently amount of time to refine the corresponding estimation.

Now, we apply a concentration argument to show that each arm will be played for sufficiently many times, not only on expectation, but also with high probability. Concretely, we show that if $\bm{b}^{(s)}\in\mathcal{Z}$ holds for all $s\in[t]$ and $t\geq\frac{200\log T}{\chi^{2}},$ then the number of times that the $i$ -th arm is played up to time $t$ ,

n_{i}(t)\geq\frac{\chi t}{10}

with high probability for each arm $i\in[m]$ . To see this, we first notice that if $\bm{b}^{(s)}\in\mathcal{Z}$ for all $s\in[t]$ , then

\sum_{s=1}^{t}\tilde{x}_{i}(s)\geq\frac{\chi t}{5}

for each $i\in[m].$ Also, note that $n_{i}(t)-n_{i}(t-1)-\tilde{x}_{i}(t)$ is a martingale difference with respect to the filtration $\mathcal{H}_{t}$ , we have the following holds for all $t\in[T],$

\displaystyle\mathbb{P}\left(\sum\limits_{s=1}^{t}n_{i}(s)-n_{i}(s-1)-\tilde{x}_{i}(s)\leq-\sqrt{2t\log T}\right)\leq\frac{1}{T^{4}},

by an application of Azuma–Hoeffding inequality. By rearranging terms, it becomes

\displaystyle\mathbb{P}\left(n_{i}(t)\geq\sum\limits_{s=1}^{t}\tilde{x}_{i}(s)-\sqrt{2t\log T}\right)\leq\frac{1}{T^{4}},

and consequently,

\displaystyle\mathbb{P}\left(n_{i}(t)\geq\sum\limits_{s=1}^{t}\tilde{x}_{i}(s)-\sqrt{2t\log T},\text{ for some }t\in[T]\right)\leq\frac{1}{T^{3}}.

(28)

On the event

\left\{n_{i}(t)\geq\sum\limits_{s=1}^{t}\tilde{x}_{i}(s)-\sqrt{2t\log T},\text{ for some $t\in[T]$}\right\}^{\complement},

we know that the inequality $n_{i}(t)\geq\frac{\chi t}{10}$ holds if $\bm{b}^{(s)}\in\mathcal{Z}$ for all $s\in[t]$ and $t\geq\frac{200\log T}{\chi^{2}}.$

In other words, when $\bm{b}^{(s)}\in\mathcal{Z}$ for all $s\in[t]$ , the $i$ -th arm will be played for $\frac{\chi t}{10}$ with probability $1-\frac{1}{T^{3}}$ , which implies the estimation error of the $i$ -th arm is at most $\sqrt{\frac{20\log T}{\chi t}}$ at the beginning of time $t+1$ with high probability. Then the bias in resource consumption has the following upper bound,

	$\displaystyle\left\\|\mathbb{E}\left[\bm{C}_{i_{t+1}}(\bm{b}^{\prime})\|\mathcal{H}_{t}\right]-\bm{b}^{\prime}\right\\|_{\infty}=$	$\displaystyle\left\\|\bm{C}\left(\bm{C}^{L}(t)\right)^{-1}\bm{b}^{\prime}-\bm{b}^{\prime}\right\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle\left\\|\left(\bm{C}-\bm{C}^{L}(t)\right)\left(\bm{C}^{L}(t)\right)^{-1}\bm{b}^{\prime}\right\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle\left\\|\bm{C}-\bm{C}^{L}(t)\right\\|_{\infty}\left\\|\left(\bm{C}^{L}(t)\right)^{-1}\right\\|_{\infty}\\|\bm{b}^{\prime}\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle\sqrt{\frac{{20\min\{md,d^{2}\}\log T}}{\chi t}}\cdot\frac{2\sqrt{d}}{\sigma}\cdot\frac{6b}{5}$
	$\displaystyle\leq$	$\displaystyle\frac{11b\sqrt{\min\{md^{2},d^{3}\}\log T}}{\sqrt{\chi\sigma^{2}(t+1)}}$

where the last line is defined to be $\epsilon_{t}$ when $t\geq\alpha T.$ Here, the second line comes from the definition of the algorithm, the third and fourth line come from the property of matrix multiplication, and the fifth line comes from plugging in the estimation error of $\bm{C}^{L}(t)$ . The choice of $\alpha$ ensures that $t\geq\alpha T\geq\frac{200\log T}{\chi^{2}}$ . Also, we choose $\underline{T}$ as the smallest integer such that

\alpha T\geq\frac{200\log T}{\chi^{2}}.

In this way, when $T\geq\underline{T},$ we have the following two cases:

•

$t>\alpha T:$ We have

\mathcal{E}_{t}\subset\left\{n_{i}(t)\geq\sum\limits_{s=1}^{t}\tilde{x}_{i}(s)-\sqrt{2t\log T},\text{ for all $t$}\right\}^{\complement},

•

$t\leq\alpha T$ : We have $\bm{0}\leq\mathbb{E}\left[\bm{c}_{i_{t}}(\bm{b}^{\prime})|\mathcal{H}_{t}\right]\leq\bm{1}$ and $\bm{0}\leq\bm{b}^{\prime}\leq\frac{6}{5}\bm{1}$ . Consequently, the following holds

$\left\|\mathbb{E}\left[\bm{c}_{i_{t}}(\bm{b}^{\prime})|\mathcal{H}_{t-1}\right]-\bm{b}^{\prime}\right\|_{\infty}\leq\frac{6}{5}$

for all $\bm{b}^{\prime}\in\mathcal{Z}.$ Thus it implies that $\mathcal{E}_{t}=\emptyset.$

Combining the two cases, when $T\geq\underline{T},$ we have

\bigcup\limits_{t=1}^{T}\mathcal{E}_{t}\subset\left\{n_{i}(t)\geq\sum\limits_{s=1}^{t}\tilde{x}_{i}(s)-\sqrt{2t\log T},\text{ for all $t\in[T]$}\right\}^{\complement}

and therefore, we know from (28),

\mathcal{P}\left(\bigcup\limits_{t=1}^{T}\mathcal{E}_{t}\right)\leq\frac{\min\{m,d\}}{T^{3}}.

∎

Proof of Lemma 6

Proof.

Note that

\displaystyle\left\{{b}^{(s)}_{j}\not\in[b-\epsilon,b+\epsilon]\text{for some $s\leq t$, }\bigcap\limits_{s=1}^{t}\mathcal{E}_{t}^{\complement}\right\}\subset

\displaystyle\left\{\tilde{{b}}^{(s)}_{j}\not\in[b-\epsilon,b+\epsilon]\text{ for some }s\leq t\right\}

for each constraint $j\in[d].$ Thus, we only need to analyze the event

\left\{\tilde{{b}}^{(s)}_{j}\not\in[b-\epsilon,b+\epsilon]\text{ for some }s\leq t\right\}.

Note that from telescoping sum,

	$\displaystyle\tilde{{b}}^{(t)}_{j}-b$	$\displaystyle=\tilde{{b}}^{(t)}_{j}-\tilde{{b}}^{(0)}_{j}$
		$\displaystyle=\sum_{s=1}^{t}\left(\tilde{{b}}^{(s)}_{j}-\tilde{{b}}^{(s-1)}_{j}\right).$

We define $\beta_{t}\coloneqq\tilde{{b}}_{j}^{(t)}-\tilde{{{b}}}_{j}^{(t-1)}$ , and then have

	$\displaystyle\tilde{{b}}^{(t)}_{j}-b$	$\displaystyle=\sum_{s=1}^{t}\beta_{t}$
		$\displaystyle=\sum_{s=1}^{t}\left(\beta_{s}-\mathbb{E}[\beta_{s}\|\mathcal{H}_{s-1}]\right)+\sum_{s=1}^{t}\mathbb{E}[\beta_{s}\|\mathcal{H}_{s-1}].$		(29)

We remark that we do not index the process $\beta_{t}$ in that the analyses for all the constraints are the same, so we only focus on the analysis of a specific constraint $j$ . Moreover, from the definition of $\tilde{\tau}$ ,

\displaystyle\beta_{t}

\displaystyle=\tilde{{b}}_{j}^{(t)}-\tilde{{b}}_{j}^{(t-1)}=\left\{\begin{matrix}0,&t>\tilde{\tau},\\ -\frac{1}{T-t}(C_{j,t}-\tilde{{b}}_{j}^{(t-1)}),&t\leq\tilde{\tau}.\end{matrix}\right.

Next, we first develop a concentration argument for the first summation in (29). Specifically, the summand can be viewed as a martingale difference sequence. Note that

|\beta_{t}-\mathbb{E}[\beta_{t}|\mathcal{H}_{t-1}]|\leq\frac{1+\epsilon}{T-t}

for $t\in[T-1].$ By applying Hoeffding’s Inequality, we have

	$\displaystyle\mathbb{P}\left(\left\|\sum_{k=1}^{s}(\beta_{k}-\mathbb{E}\left[\beta_{k}\|\mathcal{H}_{k-1}\right])\right\|\geq l\text{ for some }s\leq t\right)\leq$	$\displaystyle 2\exp\left\{-\frac{2l^{2}}{4\sum\limits_{s=1}^{t}\frac{(1+\epsilon)^{2}}{(T-t)^{2}}}\right\}$
	$\displaystyle=$	$\displaystyle 2\exp\left\{-\frac{(T-t-1)l^{2}}{2(1+\epsilon)^{2}}\right\}$

holds for all $t\leq T-2$ and $l\geq 0$ . By taking $l=\frac{\epsilon}{3}$ , we have

\displaystyle\mathbb{P}\left(\left|\sum_{k=1}^{s}\left(\beta_{k}-\mathbb{E}[\beta_{k}|\mathcal{H}_{k-1}]\right)\right|\geq\frac{\epsilon}{3}\text{ for some }s\leq t\right)\leq 2\exp\left\{\frac{-(T-t+1)\epsilon^{2}}{18(1+\epsilon)^{2}}\right\}.

(30)

Then we analyze the second summation in (29). For $s\in[t]$ ,

	$\displaystyle\left\|\sum\limits_{k=1}^{s}\mathbb{E}[\beta_{k}\|\mathcal{H}_{k-1}]\right\|$	$\displaystyle=\left\|\sum\limits_{k=1}^{\min\{s,\tilde{\tau}\}}\mathbb{E}[\beta_{k}\|\mathcal{H}_{k-1}]\right\|$
		$\displaystyle\leq\sum\limits_{k=1}^{\tilde{\tau}}\left\|\mathbb{E}[\beta_{k}\|\mathcal{H}_{k-1}]\right\|$
		$\displaystyle\leq\sum\limits_{k=1}^{\alpha T}\frac{\bar{\epsilon}}{T-k}+\sum\limits_{k=\alpha T+1}^{T-1}\frac{\epsilon_{t}}{T-k}$
		$\displaystyle\leq\frac{\alpha\bar{\epsilon}}{1-\alpha}+\sum\limits_{k=\alpha T+1}^{T-1}\frac{\epsilon_{t}}{T-k},$

where the last line correspond to the condition of (25). In above, the first line comes from the definition of $\tilde{\tau}$ and the third line comes from the definition of $\mathcal{E}_{t}$ .

So, if (25) holds, i.e.,

\frac{\alpha\bar{\epsilon}}{1-\alpha}+\sum\limits_{l=\alpha T+1}^{T-1}\frac{\epsilon_{t}}{T-l}\leq\frac{2}{3}\epsilon,

we have

\left|\sum\limits_{k=1}^{s}\mathbb{E}[\beta_{k}|\mathcal{H}_{k-1}]\right|\leq\frac{2\epsilon}{3},

(31)

for all $s\in[t]$ .

Finally, by combining (30) and (31) together into (29), we have

		$\displaystyle\left\{\tilde{{b}}_{j}^{(s)}\not\in\left[b-\epsilon,b+\epsilon\right]\text{ for some }s\leq t\right\}$
	$\displaystyle\subset$	$\displaystyle\left\{\left\|\sum\limits_{k=1}^{s}\beta_{k}\right\|\geq\epsilon\text{ for some }s\leq t\right\}$
	$\displaystyle\subset$	$\displaystyle\left\{\left\|\sum_{k=1}^{s}\left(\beta_{k}-\mathbb{E}[\beta_{k}\|\mathcal{H}_{k-1}]\right)\right\|\geq\frac{\epsilon}{3}\text{ for some }s\leq t\right\}\bigcup\left\{\left\|\sum\limits_{k=1}^{s}\mathbb{E}[\beta_{k}\|\mathcal{H}_{k-1}]\right\|>\frac{2\epsilon}{3}\text{ for some }s\leq t\right\}.$

Therefore,

	$\displaystyle\mathbb{P}\left(\tilde{{b}}_{j}^{(s)}\not\in\left[b-\epsilon,b+\epsilon\right]\text{ for some }s\leq t\right)\leq$	$\displaystyle\mathbb{P}\left(\left\|\sum_{k=1}^{s}\left(\beta_{k}-\mathbb{E}[\beta_{k}\|\mathcal{H}_{k-1}]\right)\right\|\geq\frac{\epsilon}{3}\text{ for some }s\leq t\right)$
	$\displaystyle\leq$	$\displaystyle 2\exp\left\{\frac{-(T-t+1)\epsilon^{2}}{18(1+\epsilon)^{2}}\right\}.$

As noted earlier, the proof is applicable to all the constraint $j$ . And by taking an union bound, we complete the proof. ∎

Proof of Lemma 7

Proof.

Consider the following LPs for all $i\in\mathcal{I}^{*}$ , and denote their optimal objective value as $\tilde{\text{OPT}}^{U}_{i}(t)$ and $\tilde{\text{OPT}}^{U}_{\text{LP}}(t)$ , respectively.

	$\displaystyle\tilde{\text{OPT}}_{i}^{U}(t)=\max_{\bm{x}}\ \$	$\displaystyle\left(\bm{\mu}^{U}(t)\right)^{\top}\bm{x},$
	s.t.	$\displaystyle\bm{C}^{L}(t)\bm{x}\leq\bm{B}^{(t)},$
		$\displaystyle x_{i}=0,\ \bm{x}\geq\bm{0}.$

$\displaystyle\tilde{\text{OPT}}^{U}_{\text{LP}}(t)=\max_{\bm{x}}\ \$	$\displaystyle\left(\bm{\mu}^{U}(t)\right)^{\top}\bm{x},$	(32)
s.t.	$\displaystyle\bm{C}^{L}(t)\bm{x}\leq\bm{B}^{(t)},$
	$\displaystyle\bm{x}\geq\bm{0}.$

Then, similar as the proof in Lemma 2, to prove that $\hat{\bm{x}}(t)$ is a optimal solution to LP (9), it is sufficient to show that $\hat{\bm{x}}(t)$ is feasible to LP (9) and $\tilde{\text{OPT}}_{i}^{U}(t)<\tilde{\text{OPT}}^{U}_{\text{LP}}(t)$ for all $i\in\mathcal{I}^{*}$ .

First, we show the feasibility. To see its feasibility, it is sufficient to check the following two results:

(a)

The matrix ${\bm{C}}^{L}(t)$ is non-singular and thus $\hat{\bm{x}}(t)$ is well-defined.
(b)

$\hat{\bm{x}}(t)\geq\bm{0}$ and thus $\hat{\bm{x}}(t)$ is a feasible solution to the LP (9).

To show part (a), we prove that the smallest singular value of the matrix is positive. Recall we use $\sigma_{\min}(\bm{M})$ and $\sigma_{\max}(\bm{M})$ to denote the smallest and the largest singular value of a matrix $\bm{M}$ , respectively.

	$\displaystyle\sigma_{\min}\left({\bm{C}}^{L}(t)\right)$	$\displaystyle\geq\sigma_{\min}\left({\bm{C}}\right)-\sigma_{\max}\left({\bm{C}}^{L}(t)-\bm{C}\right)$
		$\displaystyle=\sigma-\sigma_{\max}\left({\bm{C}}^{L}(t)-\bm{C}\right)$
		$\displaystyle\geq\sigma-\sqrt{d}\\|{\bm{C}}^{L}(t)-\bm{C}\\|_{\infty}$
		$\displaystyle\geq\sigma-\sqrt{d\cdot\min\{m,d\}}\max\limits_{i\in\mathcal{I}^{*}}\\|{\bm{c}}^{L}_{i}-\bm{c}_{i}\\|_{\infty},$

where the first line comes from Weyl’s inequality on matrix eigenvalues/singular values, the second line comes from the definition of $\sigma$ , the third line is obtained from the relation between the spectral norm and the infinity norm of a matrix, and the last line comes from the relation between the matrix infinity norm and the vector infinity norm of each column in the matrix. From the warm start condition (24), we have

\sigma_{\min}\left({\bm{C}}^{L}(t)\right)\geq\frac{\sigma}{2},

and consequently,

\displaystyle\sigma_{\max}\left(\left({\bm{C}}^{L}(t)\right)^{-1}\right)\leq\frac{2}{\sigma}.

(33)

For part (b), we show the non-negativeness of the solution $\hat{\bm{x}}(t).$ The starting point is the condition that the optimal solution to the Primal LP (1) has strictly positive elements, i.e., from Section 3.3, we know

{\bm{x}}^{*}\geq\chi T>0,

where the inequality holds element-wise. Now, we show that,

\left\|\hat{\bm{x}}(t)-{\bm{x}}^{*}\right\|_{\infty}=\left\|\left({\bm{C}}^{L}(t)\right)^{-1}\bm{B}-\bm{C}^{-1}\bm{B}\right\|_{\infty}\leq\frac{\chi T}{2},

which implies that

\hat{\bm{x}}(t)\geq\frac{\chi T}{2}

holds element-wise by triangle inequality. To see this, from condition (24) that $\max\limits_{i\in\mathcal{I}^{*}}\|{\bm{c}}^{L}_{i}-\bm{c}_{i}\|_{\infty}\leq\frac{\chi\sigma^{2}}{12{d\min\{m,d\}}}$ , then

	$\displaystyle\left\\|\left({\bm{C}}^{L}(t)\right)^{-1}\bm{B}-\bm{C}^{-1}\bm{B}\right\\|_{\infty}$	$\displaystyle\leq\left\\|\left({\bm{C}}^{L}(t)\right)^{-1}-\bm{C}^{-1}\right\\|_{\infty}\\|\bm{B}\\|_{\infty}$
		$\displaystyle\leq T\left\\|\left({\bm{C}}^{L}(t)\right)^{-1}-\bm{C}^{-1}\right\\|_{\infty}$
		$\displaystyle\leq T\\|\bm{C}^{-1}\\|_{\infty}\left\\|{\bm{C}}\left({\bm{C}}^{L}(t)\right)^{-1}-\bm{I}\right\\|_{\infty}$
		$\displaystyle\leq\frac{\sqrt{d}T}{\sigma}\left\\|\left({\bm{C}}-{\bm{C}}^{L}(t)+{\bm{C}}^{L}(t)\right)\left({\bm{C}}^{L}(t)\right)^{-1}-\bm{I}\right\\|_{\infty}$
		$\displaystyle=\frac{\sqrt{d}T}{\sigma}\left\\|\left({\bm{C}}-{\bm{C}}^{L}(t)\right)\left({\bm{C}}^{L}(t)\right)^{-1}\right\\|_{\infty}$
		$\displaystyle\leq\frac{\sqrt{d}T}{\sigma}\left\\|{\bm{C}}-{\bm{C}}^{L}(t)\right\\|_{\infty}\left\\|\left({\bm{C}}^{L}(t)\right)^{-1}\right\\|_{\infty}$
		$\displaystyle\leq\frac{2{d}T}{\sigma^{2}}\left\\|{\bm{C}}-{\bm{C}}^{L}(t)\right\\|_{\infty}$
		$\displaystyle\leq\frac{2\min\{md,d^{2}\}T}{\sigma^{2}}\max_{i\in\mathcal{I}^{*}}\\|{\bm{c}}^{L}_{i}-\bm{c}_{i}\\|_{\infty}\leq\frac{\chi T}{2}.$

The first line comes from the definition of matrix L_∞ norm. The second line comes from the assumption that $B\leq T$ . The third and sixth line come from the sub-multiplicativity of matrix L_∞ norm. The fourth and seventh line come from the definition of $\sigma$ following Section 3.3 and the relation between the spectral norm $\sigma_{max}$ and L_∞ norm, i.e., $\left\|\left({\bm{C}}^{L}(t)\right)^{-1}\right\|_{\infty}\leq\sqrt{d}\ \sigma_{max}\left(\left({\bm{C}}^{L}(t)\right)^{-1}\right).$ The last line reduces the matrix infinity norm to vector infinity norm and applies condition (24). Thus we finish the part on the feasibility of $\hat{\bm{x}}(t).$

Next, we show $\tilde{\text{OPT}}_{i}^{U}(t)<\tilde{\text{OPT}}^{U}_{\text{LP}}(t)$ for all $i\in\mathcal{I}^{*}$ by utilizing $\text{OPT}_{i}^{U}(t)$ and $\text{OPT}^{U}(t)$ .

Recall that $\text{OPT}_{i}^{U}(t)$ is the optimal objective value of the following LP.

	$\displaystyle\text{OPT}_{i}^{U}(t)\coloneqq\max_{\bm{x}}\ \$	$\displaystyle\left(\bm{\mu}^{U}(t-1)\right)^{\top}\bm{x},$
	s.t.	$\displaystyle\bm{C}^{L}(t-1)\bm{x}\leq\bm{B},$
		$\displaystyle x_{i}=0,\bm{x}\geq\bm{0}.$

With the similar proof in Lemma 4 and Assumption 3, we can show that

\displaystyle\text{OPT}^{U}_{i}(t)

\displaystyle\leq\text{OPT}_{i}+\frac{\delta}{5}T.

(34)

Then, from LP’s duality, we know that $\text{OPT}^{U}_{i}(t)$ and $\tilde{\text{OPT}}^{U}_{i}(t)$ are also the optimal values of the dual LP (C2) and (C2), respectively.

$\displaystyle\min_{\bm{y}}\ \$	$\displaystyle\bm{B}^{\top}\bm{y},$
s.t.	$\displaystyle({{\bm{C}}}_{-i}^{L}(t-1))^{\top}\bm{y}\geq{\bm{\mu}}_{-i}^{U}(t-1),$	(35)
	$\displaystyle\bm{y}\geq 0.$

$\displaystyle\min_{\bm{y}}\ \$	$\displaystyle\left(\bm{B}^{(t-1)}\right)^{\top}\bm{y},$
s.t.	$\displaystyle({{\bm{C}}}_{-i}^{L}(t-1))^{\top}\bm{y}\geq{\bm{\mu}}_{-i}^{U}(t-1),$	(36)
	$\displaystyle\bm{y}\geq 0,$

where $\bm{M}_{-i}$ is a matrix (vector) that delete the $i$ -th column (entry) of the matrix (vector) $\bm{M}$ . Denote the optimal solution of LP (C2) as $\tilde{\bm{y}}$ . By duality, we have

\displaystyle\tilde{y}_{j}=\frac{1}{B}B\tilde{y}_{j}\leq\frac{1}{B}\bm{B}^{\top}\tilde{\bm{y}}\leq\frac{1}{b}.

for all $j\in[d]$ . Moreover, since $\tilde{\bm{y}}$ is also feasible to LP (C2), by duality, we have

	$\displaystyle\tilde{\text{OPT}}_{i}^{U}(t)\leq$	$\displaystyle(\bm{B}^{(t-1)})^{\top}\tilde{\bm{y}}$
	$\displaystyle=$	$\displaystyle\frac{T-t}{T}(\bm{B})^{\top}\tilde{\bm{y}}+(\bm{B}^{(t)}-(T-t)\bm{b})^{\top}\tilde{\bm{y}}$
	$\displaystyle\leq$	$\displaystyle\frac{T-t}{T}\text{OPT}_{i}^{U}(t)+\frac{d}{b}\cdot\frac{\delta}{5d^{3/2}}(T-t)$
	$\displaystyle\leq$	$\displaystyle\frac{T-t}{T}\text{OPT}_{i}^{U}(t)+\frac{\delta}{5}(T-t)$
	$\displaystyle\leq$	$\displaystyle\frac{T-t}{T}\text{OPT}_{i}+\frac{2\delta}{5}(T-t).$

where the first line comes from weak duality, the second line comes from re-arranging terms, the third line comes from $\|\bm{b}^{(t)}-\bm{b}\|_{\infty}\leq\frac{\delta}{5d^{3/2}}$ and Holder’s Inequality that $|(\bm{B}^{(t)}-(T-t)\bm{b})^{\top}\tilde{\bm{y}}|\leq\|\bm{B}^{(t)}-(T-t)\bm{b}\|_{\infty}\|\tilde{\bm{y}}\|_{1}$ , and the last line is obtained by the inequality (34). Then, we can have a similar statement that

\tilde{\text{OPT}}^{U}_{\text{LP}}(t)\geq\frac{T-t}{T}{\text{OPT}}_{\text{LP}}-\frac{2\delta}{5}(T-t).

Thus, since $\text{OPT}_{i}\leq\text{OPT}_{\text{LP}}-\delta$ , we have

	$\displaystyle\tilde{\text{OPT}}^{U}_{\text{LP}}(t)\geq$	$\displaystyle\frac{T-t}{T}{\text{OPT}}_{\text{LP}}-\frac{2\delta}{5}(T-t)$
	$\displaystyle\geq$	$\displaystyle\frac{T-t}{T}\tilde{\text{OPT}}^{U}_{i}(t)+\frac{\delta}{5}(T-t)$
	$\displaystyle>$	$\displaystyle\tilde{\text{OPT}}^{U}_{i}(t),$

for all $i\in\mathcal{I}^{*}$ . Thus we complete the proof by combining the feasibility and the optimality of $\hat{\bm{x}}(t).$ ∎

C3 Relaxation of Warm Start Assumption

In this subsection, we relax Assumption 3 for the analysis in Section C1. Indeed, The following proposition shows that the warm start condition in Assumption 3 will be automatically satisfied after $\frac{400\log T}{\chi\theta^{3}}$ time periods after Phase I, where $\theta$ being the threshold parameter in Assumption 3. The idea of the proof is straightforward: at the beginning of Phase II, although the parameter estimation may not satisfy the warm start condition, it will ensure that each arm will be played with at least certain probability. Then, by continually playing the arms, the corresponding parameters for the optimal arms will be gradually refined until the warm start condition is met. The refinement takes at most $\frac{400\log T}{\chi\theta^{3}}$ time periods with high probability.

Proposition 6.

In Phase II of Algorithm 1, it will automatically achieve the warm start condition in Assumption 3 after $\frac{400\log T}{\chi\theta^{3}}$ time periods (excluding the periods in Phase I) with probability $1-\frac{20md}{T}$ .

Proof.

First, we show that each arm will be played linear times even if the Phase II does not initialize with a warm start, or, sufficiently, there exists an arm $i\in\mathcal{I}^{*}$ so that $2\sqrt{\frac{2\log T}{n_{i}(t)}}>\theta$ for some $t$ in Phase II. W.L.O.G., we assume that all UCB estimates is non-increasing and all LCB estimates are non-decreasing since we can use the minimal UCB estimates and maximal LCB estimates for all quantities from time 1 to $t-1$ at each time $t$ . Thus, $\text{OPT}^{U}_{\text{LP}}(t)$ and $\text{OPT}^{U}_{i}$ are non-increasing and $\text{OPT}^{L}_{\text{LP}}$ is non-decreasing.

Now, consider the following LPs for all $i\in\mathcal{I}^{*}$ (the same as the LPs used in Lemma 7):

	$\displaystyle\tilde{\text{OPT}}_{i}^{U}(t)=\max_{\bm{x}}\ \$	$\displaystyle\left(\bm{\mu}^{U}(t-1)\right)^{\top}\bm{x},$
	s.t.	$\displaystyle\bm{C}^{L}(t-1)\bm{x}\leq\bm{B}^{(t-1)},$
		$\displaystyle x_{i}=0,\ \bm{x}\geq\bm{0},$

	$\displaystyle\tilde{\text{OPT}}^{U}_{\text{LP}}(t)=\max_{\bm{x}}\ \$	$\displaystyle\left(\bm{\mu}^{U}(t-1)\right)^{\top}\bm{x},$
	s.t.	$\displaystyle\bm{C}^{L}(t-1)\bm{x}\leq\bm{B}^{(t-1)},$
		$\displaystyle\bm{x}\geq\bm{0}.$

Note $\|\frac{\bm{B}^{(t)}}{T-t}-\bm{b}\|_{\infty}\leq\frac{2t}{T}$ if $t\leq\frac{T}{2}$ . With the similar proof in Lemma 7, we can show that

	$\displaystyle\frac{1}{T-t}\|\tilde{\text{OPT}}_{i}^{U}(t)-\frac{T-t}{T}\text{OPT}_{i}^{U}(t)\|\leq\frac{\min\{\chi\theta,\delta\theta\}}{45},$
	$\displaystyle\frac{1}{T-t}\|\tilde{\text{OPT}}^{U}_{\text{LP}}(t)-\frac{T-t}{T}\text{OPT}_{\text{LP}}^{U}(t)\|\leq\frac{\min\{\chi\theta,\delta\theta\}}{45}$

for all $t\leq\frac{b\min\{\chi\theta,\delta\theta\}}{90d}T$ . Then, by triangle inequality, we have, for all $t$ in Phase II,

	$\displaystyle\tilde{\text{OPT}}^{U}_{\text{LP}}(t)-\tilde{\text{OPT}}_{i}^{U}(t)\geq$	$\displaystyle\frac{T-t}{T}\text{OPT}_{\text{LP}}^{U}(t)-\frac{T-t}{T}\text{OPT}_{i}^{U}(t)-\frac{2\min\{\chi\theta,\delta\theta\}}{45}(T-t)$		(37)
	$\displaystyle\geq$	$\displaystyle\frac{T-t}{T}\text{OPT}_{\text{LP}}^{U}(t)-\frac{T-t}{T}\text{OPT}_{\text{LP}}^{L}(t)-\frac{2\min\{\chi\theta,\delta\theta\}}{45}(T-t),$

where the second line is obtained by

\text{OPT}_{i}^{U}(t)<\text{OPT}^{L}_{\text{LP}}(t)

after Phase I. This motivates us to focus on the difference between $\text{OPT}_{\text{LP}}^{U}(t)$ and $\text{OPT}_{\text{LP}}^{L}(t)$ . With the proof in Lemma 2, we can show that with probability $1-\frac{4m}{T}$

\displaystyle\mu_{i}\leq\hat{\mu}_{i}(t)+\sqrt{\frac{3\log T}{2n_{i}(t)}},

hold for all $i\in[m]$ , $j\in[d]$ and $t\in[T]$ . Notice that $\bm{x}^{*}$ is also a feasible solution to LP (7). Thus, we have

	$\displaystyle\text{OPT}_{\text{LP}}^{U}(t)\geq$	$\displaystyle\sum\limits_{i=1}^{m}\mu_{i}^{U}(t)x_{i}^{*}$
	$\displaystyle=$	$\displaystyle\sum\limits_{i=1}^{m}\left(\hat{\mu}_{i}^{U}(t)+\sqrt{\frac{3\log T}{2n_{i}(t)}}\right)x_{i}^{}+\sum\limits_{i=1}^{m}\left(\sqrt{\frac{2\log T}{n_{i}(t)}}-\sqrt{\frac{3\log T}{2n_{i}(t)}}\right)x_{i}^{}$
	$\displaystyle\geq$	$\displaystyle\text{OPT}_{\text{LP}}+\sum\limits_{i=1}^{m}\left(\sqrt{\frac{2\log T}{n_{i}(t)}}-\sqrt{\frac{3\log T}{2n_{i}(t)}}\right)\chi T$
	$\displaystyle\geq$	$\displaystyle\text{OPT}_{\text{LP}}+\frac{1}{15}\theta\chi T$

hold with probability no less than $1-\frac{8md}{T}$ and all $t$ in Phase II. Here, the first line comes from the definition of $\text{OPT}_{\text{LP}}^{U}(t)$ , the second line comes from re-arranging terms, the third line comes from $x_{i}^{*}\geq\chi$ and $\mu_{i}\leq\hat{\mu}_{i}(t)+\sqrt{\frac{3\log T}{2n_{i}(t)}}$ for all $i$ , and the last line comes from calculation. Take this result into the inequality (37), we have

\displaystyle\tilde{\text{OPT}}^{U}_{\text{LP}}(t)-\tilde{\text{OPT}}_{i}^{U}(t)\geq

\displaystyle\frac{\min\{\chi\theta,\delta\theta\}}{45}(T-t),

for all $i\in\mathcal{I}^{*}$ . Denotes the optimal solution corresponding to $\tilde{\text{OPT}}^{U}_{\text{LP}}(t)$ as $\tilde{\bm{x}}(t)$ and notice that

\tilde{\text{OPT}}^{U}_{\text{LP}}(t)\leq\tilde{\text{OPT}}_{i}^{U}(t)+\tilde{x}_{i}(t)

for all $i\in\mathcal{I}^{*}$ . we have

\tilde{\bm{x}}(t)\geq\frac{\chi\theta}{45}(T-t).

Thus, in expectation, the algorithm plays each arm at least $\frac{\chi\theta}{45}$ times for each step if $t\leq\frac{b\min\{\chi\theta,\delta\theta\}T}{90d}$ during Phase II. Then, with similar proof in Lemma 5, we can show that there exist a $T_{0}$ depending polynomially on $m$ , $d$ , $1/b$ , $1/\sigma$ , $1/\chi$ and $1/\delta$ such that for all $T>T_{0}$ , the algorithm can play each arm for at least $\frac{8\log T}{\theta^{2}}$ times within $\frac{400\log T}{\chi\theta^{3}}$ steps with probability no less than $1-\frac{20md}{T}$ . Then, the algorithm achieve the warm start. Moreover, since the algorithm only plays arms in $\mathcal{I}^{*}$ and $T\geq T_{0}$ is large enough during this process, it will not cause additional regret. ∎

C4 Analysis of the Non-Binding Constraints

The way how we analyze the non-binding constraints is the same as how we analyze the binding constraints. Previously, we derive an upper bound for $\mathbb{E}[T-\tau]$ in Section C1 under the assumption that all the constraints are binding. Indeed, with the presence of non-binding constraints, we only need to show that the non-binding constraints will not be exhausted before exhausting any binding constraints. To adapt the analysis in Section C1 to the case when there are non-binding constraints, we only need to change the event $\mathcal{E}_{t}$ to the joint of following event $\tilde{\mathcal{E}}_{t}$ and $\mathcal{E}_{t}$ . All the remaining parts in Section C1 can then be extended to the case when there are non-binding constraints,

Define

\tilde{\mathcal{E}}_{t}\coloneqq\left\{\begin{matrix}{b}^{(s)}_{j}\in[b-\epsilon,b+\epsilon]\text{ for all $s\in[t-1]$}\text{ and for all binding constraints $j\in\mathcal{J}^{*}$, }\\ B^{(s)}_{j}\geq\bm{0}\text{ for all $s\in[t-1]$}\text{ and for all non-binding constraints $j\in\mathcal{J}^{\prime}$,}\\ B^{(t)}_{j}\geq\bm{0}\text{ does not hold}\text{ for some non-binding constraint $j\in\mathcal{J}^{\prime}$}\end{matrix}\right\}.

In this way, the event $\tilde{\mathcal{E}}_{t}$ describes the “bad” event that the binding dimensions of the process $\bm{b}^{(s)}$ stays within the region $\mathcal{Z}$ for $s=1,...,t-1$ and the non-binding dimensions of the demand process $\bm{B}^{(s)}$ is not exhausted before time $t-1$ , but certain non-binding constraint is exhausted at time $t$ . The following lemma establishes that such event will happen with a small probability. It thus implies that as long as $\bm{b}^{(t)}\in\mathcal{Z}$ , the non-binding constraints will not be exhausted. Its proof shares the same structure as Lemma 5.

Lemma 8.

There exists constants $\underline{T}$ depending on $d$ , $\chi$ such that for all $T\geq\underline{T}$ , the following inequality holds

\mathbb{P}\left(\bigcup\limits_{t=1}^{T}\tilde{\mathcal{E}}_{t}\right)\leq\frac{4d}{T^{3}}.

Proof of Lemma 8

Proof.

Based on the definition of $\delta$ , we have

\sum\limits_{i=1}^{m}\frac{c_{ij}x_{i}^{*}}{T}\leq b-\delta

holds for all non-binding constraint $j$ , where $\bm{x}=(x_{1}^{*},...,x_{m}^{*})^{\top}$ denotes the optimal solution to (1). Recall that for non-binding constraints, we only need to ensure that they are not depleted before $\tau^{\prime}$ . The implication of this inequality is that the non-binding constraints have $\delta T$ amount of resource surplus to prevent the depletion.

Intuitively, if the average consumption of all binding resources are close to $b$ , i.e. $b_{j}^{(s)}\in[b-\epsilon,b+\epsilon]$ for each binding resource $j$ , the number of average played times for each optimal arm $i$ will be close to $\frac{\bm{x}_{i}^{*}}{T}$ . Then, the average consumption of each non-binding resources will close to . Follow the intuition, we will bound the probability of the statement in this lemma by concentration inequality.

First, we estimate played times for each arm. To bound the average played times, with the similar proof for lemma 5, we can show that, with probability no less than $\frac{2d}{T^{3}}$ , the following inequality holds for all $i\in[m]$ and $s\in[T]$ :

n_{i}(s)\leq\sum\limits_{l=1}^{s}\tilde{x}_{i}(l)+\sqrt{2s\log T}.

Based on the algorithm, if $b^{(s)}_{j}\in[b-\epsilon,b+\epsilon]$ for all binding resources, we know that,

\tilde{\bm{x}}(s)=({\bm{C}}^{L}(s))_{\mathcal{J}^{*},\mathcal{I}^{*}}^{-1}({\bm{b}}^{(s)})_{\mathcal{J}^{*}},

for all $s\leq\tau_{n}<\tau_{b}$ . Thus, if $b^{(s)}_{j}\in[b-\epsilon,b+\epsilon]$ , for all $s\leq t$ and binding resources, we have

\bm{n}(s)\leq\sum\limits_{l=1}^{s}({\bm{C}}^{L}(l))_{\mathcal{J}^{*},\mathcal{I}^{*}}^{-1}({\bm{b}}^{(l)})_{\mathcal{J}^{*}}+\sqrt{2s\log T}\bm{1},

where $\bm{n}(s)=\{n_{i}(s)\}_{i=1}^{m}$ . Moreover, for all $t\leq T$ , on the event $\mathcal{E}_{t}$ , we know that no resource is used up.

Second, we estimate the resource consumption for non-binding resources. Let $\tilde{\mathcal{H}}_{s}=(\bm{r}_{l},\bm{C}_{l},i_{l+1})_{l=1}^{s}$ , apply concentration inequality and we have

	$\displaystyle\mathbb{P}\left(\sum\limits_{l=1}^{s}\bm{C}_{l}-\mathbb{E}[\bm{C}_{l}\|\tilde{\mathcal{H}}_{l-1}]>\sqrt{2s\log T}\right)\leq$	$\displaystyle 2\exp\left\{-\frac{4s\log T}{s}\right\}$
	$\displaystyle=$	$\displaystyle 2\exp\{-4\log T\}$
	$\displaystyle=$	$\displaystyle\frac{1}{T^{4}}.$

Thus, with probability no more than $\frac{2d}{T^{3}}$ , we have $\sum\limits_{l=1}^{s}\bm{C}_{l}-\mathbb{E}[\bm{C}_{l}|\tilde{\mathcal{H}}_{l}]\leq\sqrt{2s\log T}$ for all $s\in[T]$ and $j\in[d]$ . Moreover, notice that $\mathbb{E}[\bm{C}_{s}|\tilde{\mathcal{H}}_{s-1}]=\bm{c}_{i_{s}}$ for all $s\leq t\leq T-2$ on the event $\mathcal{E}_{t}^{(n)}$ , we have

	$\displaystyle\sum\limits_{s=1}^{t}\mathbb{E}[\bm{C}_{ji_{s},s}\|\tilde{\mathcal{H}}_{l}]=$	$\displaystyle\bm{C}\bm{n}(t)$
	$\displaystyle\leq$	$\displaystyle\bm{C}\left(\sum\limits_{s=1}^{t}({\bm{C}}^{L}(s))_{\mathcal{J}^{},\mathcal{I}^{}}^{-1}({\bm{b}}^{(s)})_{\mathcal{J}^{*}}+\sqrt{2t\log T}\bm{1}\right)$
	$\displaystyle=$	$\displaystyle\sum\limits_{s=1}^{t}\bm{C}\left(\bm{C}_{\mathcal{J}^{},\mathcal{I}^{}}\bm{C}^{-1}_{\mathcal{J}^{},\mathcal{I}^{}}({\bm{C}}^{L}(s)-\bm{C})_{\mathcal{J}^{},\mathcal{I}^{}}\left({\bm{C}}^{L}(s)\right)^{-1}_{\mathcal{J}^{},\mathcal{I}^{}}\right)({\bm{b}}^{(s)})_{\mathcal{J}^{*}}$
		$\displaystyle+\sqrt{2t\log T}\bm{C}\bm{1}$
	$\displaystyle\leq$	$\displaystyle\frac{t}{T}\bm{C}\bm{x}^{*}+\frac{\chi\bm{C}\bm{1}}{5d}t+\sqrt{2t\log T}\bm{C}\bm{1}+\frac{\chi t}{5}\bm{C}\bm{1}$
	$\displaystyle\leq$	$\displaystyle\frac{t}{T}\bm{C}\bm{x}^{*}+\frac{2\chi t}{5}\bm{1}+\sqrt{2d^{2}t\log T}\bm{1}$

holds on the event

\displaystyle\tilde{\mathcal{E}}_{t}\backslash

\displaystyle\left\{n_{i}(s)>\sum\limits_{l=1}^{s}\tilde{x}_{i}(l)+\sqrt{2s\log T}\text{ for some $s\in[T]$ and $i\in[m]$}\right\},

where the first in the inequalities line comes from $\mathbb{E}[\bm{C}_{i_{s},s}|\tilde{\mathcal{H}}_{s-1}]=\bm{c}_{i_{s}}$ , the second line comes from the analysis of $\bm{n}(s)$ , the third line comes from the decomposition of $(\hat{\bm{C}}_{(}s-1))_{\mathcal{J}^{*},\mathcal{I}^{*}}^{-1}$ , the fifth line comes from $\bm{b}^{(s)}\in\mathcal{Z}$ , and, the last line comes form $\bm{C}\bm{1}\leq d$ . Henceforth, on the event

	$\displaystyle\tilde{\mathcal{E}}_{t}\backslash$	$\displaystyle\left\{n_{i}(s)>\sum\limits_{l=1}^{s}\tilde{x}_{i}(l)+\sqrt{2s\log T}\text{ or }\right.$
		$\displaystyle\left.\sum\limits_{l=1}^{s}{C}_{ji_{l},l}-\mathbb{E}[{C}_{ji_{l},l}\|\tilde{\mathcal{H}}_{l}]>\sqrt{2s\log T}\text{ for some $s\in[T]$ $i\in[m]$ and $j\in[d]$}\right\},$

and $t\leq T-2$ we have

	$\displaystyle\sum\limits_{s=1}^{t}C_{j}(s)\leq$	$\displaystyle\frac{t}{T}\left(\bm{C}\bm{x}^{*}\right)_{j}+\frac{2\chi t}{5}+2\sqrt{2d^{2}t\log T}$
	$\displaystyle\leq$	$\displaystyle\left(\bm{C}\bm{x}^{*}\right)_{j}+\frac{2\chi t}{5}+2\sqrt{2d^{2}t\log T}$
	$\displaystyle\leq$	$\displaystyle B-\chi T-\chi+\frac{2\chi T}{5}+2\sqrt{2d^{2}T\log T}$
	$\displaystyle\leq$	$\displaystyle B-\frac{\chi T}{5},$

if $T\geq\frac{50d^{2}\log T}{\chi^{2}}$ . Thus, as $T\geq\frac{10}{\chi}$ , non-binding resources are no less than $2$ at the end of time $t$ for all non-binding resources, implying $\bm{B}^{(t+1)}\geq\bm{1}$ . Thus, we have

	$\displaystyle\tilde{\mathcal{E}}_{t}\subset$	$\displaystyle\left\{n_{i}(s)>\sum\limits_{l=1}^{s}\tilde{x}_{i}(l)+\sqrt{2s\log T}\text{ or }\right.$
		$\displaystyle\left.\sum\limits_{l=1}^{s}{C}_{ji_{l},l}-\mathbb{E}[{C}_{ji_{l},l}\|\tilde{\mathcal{H}}_{l}]>\sqrt{2s\log T}\text{ for some $s\in[T]$ $i\in[m]$ and $j\in[d]$}\right\},$

and consequently,

	$\displaystyle\bigcup\limits_{t=0}^{T-2}\tilde{\mathcal{E}}_{t}\subset$	$\displaystyle\left\{n_{i}(s)>\sum\limits_{l=1}^{s}\tilde{x}_{i}(l)+\sqrt{2s\log T}\text{ or }\right.$
		$\displaystyle\left.\sum\limits_{l=1}^{s}{C}_{ji_{l},l}-\mathbb{E}[{C}_{ji_{l},l}\|\tilde{\mathcal{H}}_{l}]>\sqrt{2s\log T}\text{ for some $s\in[T]$ $i\in[m]$ and $j\in[d]$}\right\},$

	$\displaystyle\mathbb{P}\left(\bigcup\limits_{t=0}^{T-2}\tilde{\mathcal{E}}_{t}\right)\leq$	$\displaystyle\mathbb{P}\left(n_{i}(s)>\sum\limits_{l=1}^{s}\tilde{x}_{i}(l)+\sqrt{2s\log T}\text{ for some $s\in[T]$ and $i\in[m]$}\right)$
		$\displaystyle+\mathbb{P}\left(\sum\limits_{l=1}^{s}{C}_{ji_{l},l}-\mathbb{E}[{C}_{ji_{l},l}\|\tilde{\mathcal{H}}_{l}]>\sqrt{2s\log T}\text{ for some $s\in[T]$ $i\in[m]$ and $j\in[d]$}\right)$
	$\displaystyle\leq$	$\displaystyle\frac{4d}{T^{3}}.$

∎

Appendix D Proof of Proposition 5

Proof.

The regret analysis can be accomplished by a combination of Proposition 1, Proposition 3 and Proposition 4. Proposition 1 shows that

\displaystyle\text{Regret}_{T}^{\pi}(\mathcal{P},\bm{B})\leq\sum_{i\in\mathcal{I}^{\prime}}n_{i}(t)\Delta_{i}+\mathbb{E}\left[\bm{B}^{(\tau)}\right]^{\top}\bm{y}^{*}.

(38)

First, we bound the first term in the right hand side of inequality (38). From the duality, we know that $\bm{B}^{\top}\bm{y}^{*}\leq T$ . Thus, we have

\displaystyle\bm{y}^{*}

\displaystyle\leq\frac{1}{b},\ \ \ \Delta_{i}\leq\bm{1}^{\top}\bm{y}^{*}\leq\frac{d}{b}.

Moreover, Proposition 3 shows that with probability no less than $1-\frac{4md}{T^{2}}$ , the algorithm only plays arms in $\mathcal{I}^{\prime}$ during the Phase I and that Phase I will terminate within

O\left(\left(2+\frac{1}{b}\right)^{2}\frac{\log T}{\delta^{2}}\right)

time periods. Thus,

\displaystyle\sum_{i\in\mathcal{I}^{\prime}}n_{i}(t)\Delta_{i}=O\left(\left(2+\frac{1}{b}\right)^{2}\frac{md\log T}{b\delta^{2}}\right).

(39)

Next, for the second term in the inequality (38). Applying Proposition 4, we have

	$\displaystyle\mathbb{E}\left[\bm{B}^{(\tau)}\right]^{\top}\bm{y}^{*}\leq$	$\displaystyle\sum\limits_{j\in\mathcal{J}^{}}\mathbb{E}\left[B_{j}^{(\tau)}\right]{y}^{}_{j}$		(40)
	$\displaystyle\leq$	$\displaystyle O\left(\frac{d^{4}}{b^{2}\min\{\chi^{2},\delta^{2}\}\min\{1,\sigma^{2}\}}\right),$		(41)

Finally, plugging (39) and (40) into (38), we complete the proof. ∎

	$\displaystyle\|\mathcal{I}^{*}\|+\|\mathcal{J}^{\prime}\|$	$\displaystyle=\|\mathcal{J}^{*}\|+\|\mathcal{J}^{\prime}\|=d,$
	$\displaystyle\|\mathcal{I}^{*}\|+\|\mathcal{I}^{\prime}\|$	$\displaystyle=\|\mathcal{J}^{*}\|+\|\mathcal{I}^{\prime}\|=m.$

	$\displaystyle\left\|\sum\limits_{k=1}^{s}\mathbb{E}[\beta_{k}\|\mathcal{H}_{k-1}]\right\|$	$\displaystyle=\left\|\sum\limits_{k=1}^{\min\{s,\tilde{\tau}\}}\mathbb{E}[\beta_{k}\|\mathcal{H}_{k-1}]\right\|$
		$\displaystyle\leq\sum\limits_{k=1}^{\tilde{\tau}}\left\|\mathbb{E}[\beta_{k}\|\mathcal{H}_{k-1}]\right\|$
		$\displaystyle\leq\sum\limits_{k=1}^{\alpha T}\frac{\bar{\epsilon}}{T-k}+\sum\limits_{k=\alpha T+1}^{T-1}\frac{\epsilon_{t}}{T-k}$
		$\displaystyle\leq\frac{\alpha\bar{\epsilon}}{1-\alpha}+\sum\limits_{k=\alpha T+1}^{T-1}\frac{\epsilon_{t}}{T-k},$

	$\displaystyle\left\\|\left({\bm{C}}^{L}(t)\right)^{-1}\bm{B}-\bm{C}^{-1}\bm{B}\right\\|_{\infty}$	$\displaystyle\leq\left\\|\left({\bm{C}}^{L}(t)\right)^{-1}-\bm{C}^{-1}\right\\|_{\infty}\\|\bm{B}\\|_{\infty}$
		$\displaystyle\leq T\left\\|\left({\bm{C}}^{L}(t)\right)^{-1}-\bm{C}^{-1}\right\\|_{\infty}$
		$\displaystyle\leq T\\|\bm{C}^{-1}\\|_{\infty}\left\\|{\bm{C}}\left({\bm{C}}^{L}(t)\right)^{-1}-\bm{I}\right\\|_{\infty}$
		$\displaystyle\leq\frac{\sqrt{d}T}{\sigma}\left\\|\left({\bm{C}}-{\bm{C}}^{L}(t)+{\bm{C}}^{L}(t)\right)\left({\bm{C}}^{L}(t)\right)^{-1}-\bm{I}\right\\|_{\infty}$
		$\displaystyle=\frac{\sqrt{d}T}{\sigma}\left\\|\left({\bm{C}}-{\bm{C}}^{L}(t)\right)\left({\bm{C}}^{L}(t)\right)^{-1}\right\\|_{\infty}$
		$\displaystyle\leq\frac{\sqrt{d}T}{\sigma}\left\\|{\bm{C}}-{\bm{C}}^{L}(t)\right\\|_{\infty}\left\\|\left({\bm{C}}^{L}(t)\right)^{-1}\right\\|_{\infty}$
		$\displaystyle\leq\frac{2{d}T}{\sigma^{2}}\left\\|{\bm{C}}-{\bm{C}}^{L}(t)\right\\|_{\infty}$
		$\displaystyle\leq\frac{2\min\{md,d^{2}\}T}{\sigma^{2}}\max_{i\in\mathcal{I}^{*}}\\|{\bm{c}}^{L}_{i}-\bm{c}_{i}\\|_{\infty}\leq\frac{\chi T}{2}.$