This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning for Bandits under Action Erasures

Osama A. Hanna, Merve Karakas, Lin F. Yang and Christina Fragouli
University of California, Los Angeles
Email:{ohanna, mervekarakas, linyang, christina.fragouli}@ucla.edu
Abstract

We consider a novel multi-arm bandit (MAB) setup, where a learner needs to communicate the actions to distributed agents over erasure channels, while the rewards for the actions are directly available to the learner through external sensors. In our model, while the distributed agents know if an action is erased, the central learner does not (there is no feedback), and thus does not know whether the observed reward resulted from the desired action or not. We propose a scheme that can work on top of any (existing or future) MAB algorithm and make it robust to action erasures. Our scheme results in a worst-case regret over action-erasure channels that is at most a factor of O(1/1ϵ)O(1/\sqrt{1-\epsilon}) away from the no-erasure worst-case regret of the underlying MAB algorithm, where ϵ\epsilon is the erasure probability. We also propose a modification of the successive arm elimination algorithm and prove that its worst-case regret is O~(KT+K/(1ϵ))\tilde{O}(\sqrt{KT}+K/(1-\epsilon)), which we prove is optimal by providing a matching lower bound.

I Introduction

Multi-armed bandit (MAB) problems have gained popularity in a wide range of applications that include recommendation systems, clinical trials, advertising, and distributed robotics [1, 2]. MABs are sequential decision problems where a learner, at each round, selects an action from a set of KK available actions (arms) and receives an associated reward; the aim is to maximize the total reward over TT rounds. In this paper, we develop a theoretical framework that explores a new setup, MAB systems with action erasures.

In particular, we consider a MAB system where the learner needs to communicate the actions to be taken to distributed agents over wireless channels subject to erasures with some probability ϵ\epsilon, while the rewards for the actions taken are directly available to the learner through external sensors. For example, the learner may be regulating drone traffic in a slowly varying environment (weather conditions, obstacles) where passing by drones receive which path to take or which maneuvers to perform, and the learner monitors the outcome through video cameras, magnetic and other sensors. Similarly, since online learning has been used in medical micro-robots [3, 4], our setup can help with problems such as directing inside veins micro-robots on how to move or how much of a medical substance to release, and observing the results through patient medical imaging and vital sign measurements. In our model, while the agent knows if an action is erased, the learner does not (there is no feedback); therefore, the learner at each round receives a reward that may be a result of playing the requested action or another action (if an erasure occurs).

Note that at any given round, although the requested action may be erased and not received by the agent, the agent still needs to play an action - the micro-robot needs to release some amount (perhaps zero) of substance and the drone needs to continue moving in some way. We make the assumption that if at a round the agent does not receive an action, the agent continues to play the most recently received action, until a new action is successfully received to replace it. This is we believe a reasonable assumption. Indeed, during exploitation, we would like the agents to persistently play the identified optimal action, which this strategy achieves. In contrast, as we discuss in Section III playing a random or a fixed action results in suboptimal performance, while using more sophisticated strategies (e.g., keeping track of all previous actions played and accordingly optimizing) may not be possible in distributed systems, where multiple agents may be playing different actions and where agents may not even be able to observe the rewards.

To the best of our knowledge, this setup has not been systematically studied, although there exists extensive MAB literature.

Indeed, existing work has examined MAB systems under a multitude of setups and constraints such as stochastic bandits [5, 6, 7], adversarial bandits [8, 9, 10], or contextual bandits [10, 11]; and over a number of distributed setups as well [12, 13, 14, 15]. In particular, a number of works examine the case where rewards are corrupted adversarially or stochastically [16, 17, 18, 19]. However, all the works we know of assume that the requested actions can be sent through perfect communication channels. This assumption makes sense for delay-tolerant applications, where we can leverage feedback and error-correcting codes to ensure perfect communication of the desired action; yet as the applicability of the MAB framework expands to agents, such as micro-robots, that are communication limited, we believe that our proposed model can be of interest, both in terms of theory and practice.

The main questions we ask (and answer) in this paper are: for our action erasure model (i) what is an optimal strategy? and (ii) if a MAB algorithm is already deployed, can we couple it, at low regret cost, with a generic layer, akin to erasure coding, that makes the MAB algorithm operation robust to erasures?

Our main contributions include:

  • We propose a scheme that can work on top of any MAB algorithm and make it robust to action erasures. Our scheme results in a worst-case regret over the erasure channel that is at most a factor of O(1/1ϵ)O(1/\sqrt{1-\epsilon}) away from the no-erasure worst-case regret of the underlying MAB algorithm, where ϵ[0,1)\epsilon\in[0,1) is the erasure probability.

  • We propose a modification of the successive arm elimination algorithm and prove that its worst-case regret is O~(KT+K/(1ϵ))\tilde{O}(\sqrt{KT}+K/(1-\epsilon)).

  • We prove a matching Ω(K1ϵ)\Omega(\frac{K}{1-\epsilon}) regret lower bound.

I-A Related Work

MAB Algorithms. Over the years, several stochastic MAB algorithms that achieve optimal or near-optimal regret bounds have been proposed under different assumptions for the environment or model parameters.

For instance, over a horizon of length TT, Thompson sampling [6] and UCB [5, 7], achieve a gap dependent regret bound of O~(i:Δi>01Δi)\tilde{O}(\sum_{i:\Delta_{i}>0}\frac{1}{\Delta_{i}}) and a worst-case regret bound of O(KTlogT)O(\sqrt{KT\log{T}}), where O~\tilde{O} hides log\log factors. These algorithms achieve nearly optimal regret; a lower bound of Ω(i:Δi>01Δi)\Omega(\sum_{i:\Delta_{i}>0}\frac{1}{\Delta_{i}}) on the gap-dependent regret was proved in [20] while a lower bound on the worst-case regret of Ω(KT)\Omega(\sqrt{KT}) was proved in [21]. However, such algorithms are not robust to action erasures.
MAB with Adversarial Corruption. Recently, [16] introduced stochastic multi-armed bandits with adversarial corruption, where the rewards are corrupted by an adversary with a corruption budget CC, an upper bound on the accumulated corruption magnitudes over TT rounds. [16] proposes a simple elimination algorithm that achieves O~(CKT)\tilde{O}(C\sqrt{KT}) regret bound in the worst-case. Later, [17] improved the dependency on CC to be additive by achieving a worst-case regret bound of O~(KT+KC)\tilde{O}(\sqrt{KT}+KC) while also providing a regret lower bound of Ω(C)\Omega(C). The work in [19] further improves the worst-case regret bound to O~(KT+C)\tilde{O}(\sqrt{KT}+C) when the optimal arm is unique. The work in [18] studies a similar problem where rewards are corrupted in each round independently with a fixed probability. [18] achieves a regret of O~(KT+CT)\tilde{O}(\sqrt{KT}+CT), where CC is an upper bound on the expected corruption, and proves a Ω(CT)\Omega(CT) lower bound for their model.

While our model can be reduced to MABs with reward corruption, the amount of corruption CC amounts to Ω(ϵT)\Omega(\epsilon T). This causes the best-known algorithms for MABs with adversarial corruption to suffer a linear regret, given the increased exploration required for the large amount of corruption. In contrast, by exploiting the fact that corruptions are in actions, we achieve a gap-dependent regret bound of O~(i:Δi>01Δi)\tilde{O}(\sum_{i:\Delta_{i}>0}\frac{1}{\Delta_{i}}) and a worst-case regret bound of O~(KT+K1ϵ)\tilde{O}(\sqrt{KT}+\frac{K}{1-\epsilon}).

I-B Paper Organization

Section II introduces our system model and notation; Section III discusses some straightforward approaches and why they would fail; Section IV, and Section V describe our proposed algorithms and upper bounds; and Section VI provides our lower bound.

II Problem Formulation

Standard MAB Setup. We consider a MAB problem over a horizon of length TT, where a learner interacts with an environment by pulling arms from a set of KK arms (or actions). That is, we assume operation in TT discrete times, where at each time t[T]t\in[T], the learner sends the index of an arm ata_{t} to an agent; the agent pulls the arm ata_{t} resulting in a reward rtr_{t} that can be observed by the learner. The learner selects the arm ata_{t} based on the history of pulled arms and previously seen rewards a1,r1,,at1,rt1a_{1},r_{1},...,a_{t-1},r_{t-1}. The reward rtr_{t} is sampled from an unknown distribution with an unknown mean μat\mu_{a_{t}}. The set of KK distributions for all arm rewards is referred to as a bandit instance. We use Δi\Delta_{i} to denote the gap between the mean value of arm ii and the best (highest mean) arm. For simplicity, we follow the standard assumption that rewards are supported on [0,1][0,1]. The analysis directly extends to subGaussian reward distributions.

MAB with Action Erasures. We assume that the learner is connected to the agent over an erasure channel, where the action index sent by the learner to the agent at each time tt can be erased with probability ϵ[0,1)\epsilon\in[0,1). We follow the standard erasure channel model where erasures at different times are independent. We assume no feedback in the channel; in particular, while the agent knows when the action is erased (does not receive anything at time tt), the learner does not. Moreover, we assume that the learner can directly observe the reward of the action played rtr_{t}.

Agent Operation. Recall that at each time tt, the learner transmits an action index ata_{t}. The agent plays the most recent action she received: in particular, at time tt, the agent plays the action a~t=at\tilde{a}_{t}=a_{t} if no erasure occurs, while if an erasure occurs, a~t=a~t1\tilde{a}_{t}=\tilde{a}_{t-1}. We assume that a~0\tilde{a}_{0} is initialized uniformly at random, i.e., if the first action is erased, the agent chooses a~1\tilde{a}_{1} uniformly at random. In the example described in Table I, a~4=a~3=a~2=a2\tilde{a}_{4}=\tilde{a}_{3}=\tilde{a}_{2}=a_{2}. Section III motivates why we select this particular agent operation, by arguing that alternative simple strategies can significantly deteriorate the performance.

TABLE I: Example of actions played under action erasures.
Round (tt) 1 2 3 4 5
Learner (ata_{t}) 1 3 2 4 2
Agent (a^t\hat{a}_{t}) 1 3 3 3 2
Erasure x x

Performance Metric. The objective is to minimize the regret

RT=t=1Tmaxi[K]μit=1Tμa~t.R_{T}=\sum_{t=1}^{T}\max_{i\in[K]}\mu_{i}-\sum_{t=1}^{T}\mu_{\tilde{a}_{t}}. (1)

III Motivation and Challenges

In this paper we consider a specific agent operation, namely, the agent simply plays the most recently received action. In this section, we argue that this is a reasonable choice, by considering alternate simple strategies and explaining why they do not work well. We also discuss the technical challenges of our approach.

Random Action. A very simple strategy could be, when an erasure occurs, for the agent to uniformly at random select one of the KK actions to play. The rewards seen by the learner will follow a new distribution with mean that is shifted from the pulled arm mean by a constant; in particular,

𝔼[rt|at]=(1ϵ)μat+ϵ𝔼[rt|εt]=(1ϵ)μat+ϵKk=1Kμk,\vspace{-5pt}\mathbb{E}\left[r_{t}|a_{t}\right]=\left(1-\epsilon\right)\mu_{a_{t}}+\epsilon\mathbb{E}\left[r_{t}|\varepsilon_{t}\right]=\left(1-\epsilon\right)\mu_{a_{t}}+\frac{\epsilon}{K}\sum_{k=1}^{K}\mu_{k},

where εt\varepsilon_{t} is the event that erasure occurs at time tt. Hence, the best arm does not change, the gaps between arm means do not change, and the learner can identify the best arm and provide a good policy. However, as actions are erased in Ω(ϵT)\Omega(\epsilon T) iterations, the regret becomes Ω(ϵT)\Omega(\epsilon T).

Fixed Action. Very similar arguments hold if, when erasures occur, the agent always plays a fixed (predetermined) action ii; unless ii happens to be the optimal action, the experienced regret will be linear.

Last Received Action. To see why the previous strategies fail, observe that although the learner may have identified the best policy, this will not be consistently played. In this paper, we make the assumption that if an action is erased, the agent plays the last successfully received action. Thus if the optimal action is identified, it will be consistently played.

We note that our selected agent strategy introduces memory and a challenge in the analysis since it creates a more complicated dependency between the action and erasures. For example, any of the previously sent actions by the learner has a non-zero probability of being played at the current iteration, where the probability changes with time and the sequence of previous actions. In the previous two strategies, the learner still observes a reward with fixed mean and can be treated as a valid MAB instance (although the optimal action might be different from the ground truth). The last received action strategy unfortunately changes the reward distribution and standard MAB analysis no longer holds.

IV Repeat-the-Instruction Module

In this section we propose a scheme that works on top of any MAB algorithm to make it robust to erasures. Our scheme adds a form of repetition coding layer on top of the MAB algorithm operation. In particular, if the underlying MAB algorithm selects an action to play, our scheme sends the action chosen by the algorithm α\alpha times to the agent, and only associates the last received reward with the chosen action (hence the name Repeat-the-Instruction). Since the agent plays the last received action, one successful reception within the α\alpha slots is sufficient to make the last reward sampled from the distribution of the chosen arm. Thus our algorithm, summarized in Agorithm 1, effectively partitions the TT rounds into T/αT/\alpha groups of rounds, where each group of α\alpha rounds is treated by the underlying bandit algorithm as one time slot. The next theorem describes what is the resulting performance.

Algorithm 1 Repeat-the-Instruction Module
Input: α,ALG\alpha,\text{ALG}
for t1,,Tt\leftarrow 1,...,T do
     Learner:
       1) at=ALG(t1α+1)a_{t}=\text{ALG}(\frac{t-1}{\alpha}+1) if (t1modα)(t\equiv 1\mod\alpha)
    else at=at1a_{t}=a_{t-1}
       3) observe rtr_{t} (corresponding to a~t\tilde{a}_{t}) if (t0modα)(t\equiv 0\mod\alpha)
     Agent:
       2) play a~t1\tilde{a}_{t-1} if erasure else play ata_{t}.
    a~0\tilde{a}_{0} is initialized uniformly at random.
end for
Theorem 1

Let ALG be a MAB algorithm with expected regret upper bounded by RTALG({Δi}i=1K)R^{\text{ALG}}_{T}(\{\Delta_{i}\}_{i=1}^{K}) for any instance with gaps {Δi}i=1K\{\Delta_{i}\}_{i=1}^{K}. For α=2logT/log(1ϵ)\alpha=\lceil 2\log T/\log(\frac{1}{\epsilon})\rceil, using Repeat-the-Instruction on top of ALG achieves an expected regret 𝔼[RT]\mathbb{E}[R_{T}]

𝔼[RT]2αRTαALG({Δi}i=1K)+α+1,\mathbb{E}[R_{T}]\leq 2\alpha R^{\text{ALG}}_{\lceil\frac{T}{\alpha}\rceil}(\{\Delta_{i}\}_{i=1}^{K})+\alpha+1, (2)

where the expectation is over the randomness in the MAB instance, erasures, and algorithm.

Proof. We use “run" to refer to the α\alpha rounds that repeat each action dictated by ALG. Let GG be the event that all runs contain at least one round with no erasure. Conditioned on GG, the maximum number of consecutive plays for an action due to erasures is 2α12\alpha-1 (across two runs, when the action transmitted at the first round of the first run is not erased, while the first α1\alpha-1 rounds of the next run are erased). Hence, we have that

𝔼[RT|G]\displaystyle\mathbb{E}\left[R_{T}|G\right] 2α𝔼[i1 mod α,αtTμaμat|G]+α,\displaystyle\leq 2\alpha\mathbb{E}[\sum_{i\equiv 1\text{ mod }\alpha,\ \alpha\leq t\leq T}\mu_{a^{\star}}-\mu_{a_{t}}|G]+\alpha, (3)

where aa^{\star} is the optimal arm, the second term bounds the regret for the first α\alpha iterations. Note that, conditioned on GG, if t0 mod αt\equiv 0\text{ mod }\alpha with tαt\geq\alpha, we have that a~t=atα+1\tilde{a}_{t}=a_{t-\alpha+1}, that is, the reward rtr_{t} is sampled from arm atα+1a_{t-\alpha+1}. Indeed in Algorithm 1 rtr_{t} is the reward associated with arm atα+1a_{t-\alpha+1} for t0 mod α,tαt\equiv 0\text{ mod }\alpha,t\geq\alpha. We observe that as erasures occur independently of the bandit environment and learners actions, we have that for t0 mod α,tαt\equiv 0\text{ mod }\alpha,t\geq\alpha, any set AA\subseteq\mathbb{R} and action atα+1a_{t-\alpha+1}, we have that [rtA|G,atα+1]=[rtA|a~t=atα+1]\mathbb{P}[r_{t}\in A|G,a_{t-\alpha+1}]=\mathbb{P}[r_{t}\in A|\tilde{a}_{t}=a_{t-\alpha+1}] (note that GG only depends on erasures). In particular, conditioned on GG, the algorithm ALG receives rewards generated from a bandit instance with the same reward distributions as the original instance, hence, the same gaps, and time horizon Tα\lceil\frac{T}{\alpha}\rceil. Hence, its regret is upper bounded by RTαALG({Δi}i=1K)R^{\text{ALG}}_{\lceil\frac{T}{\alpha}\rceil}(\{\Delta_{i}\}_{i=1}^{K}). Substituting in (3) we get that

𝔼[RT|G]2αRTαALG({Δi}i=1K)+α.\mathbb{E}\left[R_{T}|G\right]\leq 2\alpha R^{\text{ALG}}_{\lceil\frac{T}{\alpha}\rceil}(\{\Delta_{i}\}_{i=1}^{K})+\alpha. (4)

We finally upper bound the probability of the event GG. Let GiG_{i} denote the event that run ii contains at least one slot with no erasure. The probability of GCG^{C} can be bounded as

[GC](1)i=1T/α[Gic]=i=1T/αϵαi=1T/α1T21/T,\mathbb{P}[G^{C}]\stackrel{{\scriptstyle(1)}}{{\leq}}\sum_{i=1}^{\lceil T/\alpha\rceil}\mathbb{P}[G_{i}^{c}]=\sum_{i=1}^{\lceil T/\alpha\rceil}\epsilon^{\alpha}\leq\sum_{i=1}^{\lceil T/\alpha\rceil}\frac{1}{T^{2}}\leq 1/T, (5)

where (1)(1) follows by the union bound. From (4), we get that

𝔼[RT]\displaystyle\mathbb{E}\left[R_{T}\right] 2αRTαALG({Δi}i=1K)+α+T[GC]\displaystyle\leq 2\alpha R^{\text{ALG}}_{\lceil\frac{T}{\alpha}\rceil}(\{\Delta_{i}\}_{i=1}^{K})+\alpha+T\mathbb{P}[G^{C}]
2αRTαALG({Δi}i=1K)+α+1.\displaystyle\leq 2\alpha R^{\text{ALG}}_{\lceil\frac{T}{\alpha}\rceil}(\{\Delta_{i}\}_{i=1}^{K})+\alpha+1.\qquad\blacksquare
Corollary 1

For α=2logT/log(1ϵ)\alpha=\lceil 2\log T/\log(\frac{1}{\epsilon})\rceil, if RTALGR_{T}^{\text{ALG}} is the worst-case expected regret of ALG, then

𝔼[RT]RTALG/1ϵ.\mathbb{E}[R_{T}]\leq R_{T}^{\text{ALG}}/\sqrt{1-\epsilon}. (6)

This follows from Theorem 1 by observing that log(1/ϵ)=Ω(1ϵ)\log(1/\epsilon)=\Omega(1-\epsilon), and that the worst-case regret for any algorithm is Ω(KT)\Omega(\sqrt{KT}). The next corollary follows by substituting the regret bound of the UCB algorithm [5, 7] in Theorem 1.

Corollary 2

For α=2logT/log(1ϵ)\alpha=\lceil 2\log T/\log(\frac{1}{\epsilon})\rceil, the proposed algorithm with the UCB algorithm [5, 7] achieves a gap-dependent regret bounded by 𝔼[RT]cαi:Δi>0logTΔi,\mathbb{E}[R_{T}]\leq c\alpha\sum_{i:\Delta_{i}>0}\frac{\log T}{\Delta_{i}}, and a worst-case regret bounded by 𝔼[RT]cTKlogT/(1ϵ)\mathbb{E}[R_{T}]\leq c\sqrt{TK\log{T}/(1-\epsilon)}, where cc is a universal constant.

Observation. Note that our proposed module achieves the optimal dependency of the regret on TT and KK. However, we show next that the multiplicative dependency on ϵ\epsilon is suboptimal by providing an algorithm with a regret bound that has additive dependency on ϵ\epsilon. This effect is small for small values of ϵ\epsilon but becomes significant for ϵ\epsilon approaching 11.

V The Lingering SAE (L-SAE) Algorithm

The intuition behind the repeat-the-instruction algorithm is that we do not frequently switch between arms - and thus playing the last successfully received action often coincides with playing the desired action. In this section, we propose “Lingering Successive Arm Elimination" (L-SAE), an algorithm that by design does not frequently switch arms, and evaluate its performance; in the next section, we prove a lower bound establishing L-SAE is order optimal.

L-SAE builds on the Successive Arm Elimination (SAE) algorithm [22]. SAE works in batches of exponentially growing length, where at the end of each batch the arms that appear to be suboptimal are eliminated. In batch ii, each of the surviving arms is pulled 4i4^{i} times. We add two modifications to SAE to make it robust to erasures. First, we do not use the frequent arm switches of the first batches, when 4i4^{i} is small. Instead, in the first batch, the algorithm pulls each arm 4α4\alpha times. Then, the number of pulls for surviving arms in a batch is 44 times that of the previous batch. Second, we ignore the first half of the samples for each arm. This ensures that all the chosen rewards are picked from the desired arm with high probability, as the probability of half the samples being erased is small. Given these modifications, we also update the arm elimination criterion to account for the higher variance in the mean estimates. In particular, the algorithm starts with A=[K]A=[K] as the good arms set and at the end of batch ii, the algorithm eliminates arms with empirical mean that is away by more than log(KT)/(α4i2)\sqrt{\log(KT)/(\alpha 4^{i-2})} from the empirical mean of the arm that appears to be best. The pseudo-code of the algorithm is provided in Algorithm 2.

Algorithm 2 Lingering SAE Algorithm
  • Initialize: set of good arms A=[K]A=[K], batch index i=1i=1.

  • For batch ii:

    • Pull each arm in AA, Mi=α4iM_{i}=\alpha 4^{i} times to receive rewards r1a,,rMiar^{a}_{1},...,r^{a}_{M_{i}} for arm aAa\in A.

    • Update means: μa(i)=j=Mi/2+1Mirja/(Mi/2)aA\mu_{a}^{(i)}={\sum_{j=M_{i}/2+1}^{M_{i}}r^{a}_{j}}/({M_{i}/2})\ \forall a\in A.

    • A{aA|maxa~Aμa~(i)μa(i)4log(KT)/Mi}A\leftarrow\{a\in A|\max_{\tilde{a}\in A}\mu_{\tilde{a}}^{(i)}-\mu_{a}^{(i)}\leq 4\sqrt{{\log(KT)}/{M_{i}}}\}.

    • ii+1i\leftarrow i+1.

Theorem 2

For α=2logT/log(1ϵ)\alpha=\lceil 2\log T/\log(\frac{1}{\epsilon})\rceil, Algorithm 2 achieves a regret that is bounded by

RTc(KlogT1ϵ+i:Δi>0logTΔi)R_{T}\leq c\left(\frac{K\log T}{1-\epsilon}+\sum_{i:\Delta_{i}>0}\frac{\log T}{\Delta_{i}}\right)

with probability at least 11/T1-1/T, and for a constant cc.

Proof. Let GG be the good event that the second half of all arm pulls in all batches are from the correct (desired) arm. Note that GG does not occur when there is a batch ii and an arm jj such that all half of arm jj pulls in batch ii coincide with an erasure. As since in any batch each arm is pulled at least 2α2\alpha times we have that

[G]=1[GC]1Klog(T)ϵα10.25/T.\displaystyle\mathbb{P}[G]=1-\mathbb{P}[G^{C}]\geq 1-K\log(T)\epsilon^{\alpha}\geq 1-0.25/T. (7)

We notice that erasures are independent of the bandit environment, and learner actions; and GG only depends on erasures. Hence, conditioned on GG and the picked arm aa, the second half of the rewards {rja}j=Mi/2+1Mi\{r_{j}^{a}\}_{j=M_{i}/2+1}^{M_{i}} are picked from arm aa and they follow the original reward distribution of arm aa. Hence, as the rewards are only supported on [0,1][0,1], we have that conditioned on G,aG,a, the reward rja,j>Mi/2r_{j}^{a},j>M_{i}/2 is 1/41/4-subGaussian with mean μa\mu_{a}. By concentration of sub-Gaussian random variables, conditioned on GG, we also have that the following event

G={|μa(i)μa|2log(KT)/MiaAii[logT]},G^{\prime}=\{|\mu_{a}^{(i)}-\mu_{a}|\leq 2\sqrt{{\log(KT)}/{M_{i}}}\forall a\in A_{i}\forall i\in[\log T]\},

where MiM_{i} is the number of pulls for surviving arms in batch ii, occurs with probability at least 10.25/T1-0.25/T. Hence, we have that

[GG](10.25/T)211/T.\mathbb{P}[G\cap G^{\prime}]\geq(1-0.25/T)^{2}\geq 1-1/T.

In the remaining part of the proof we condition on the event GGG\cap G^{\prime}. By the elimination criterion in Algorithm 2, and assuming GGG\cap G^{\prime}, the best arm will not be eliminated. This is because the elimination criterion will not hold for the best arm as

μa(i)μa(i)\displaystyle\mu_{a}^{(i)}-\mu_{a^{\star}}^{(i)} μaμa+4log(KT)/Mi\displaystyle\leq\mu_{a}-\mu_{a^{\star}}+4\sqrt{\log(KT)/M_{i}}
4log(KT)/Miai.\displaystyle\leq 4\sqrt{\log(KT)/M_{i}}\forall a\forall i. (8)

Now consider an arm with gap Δa>0\Delta_{a}>0 and let ii be the smallest integer for which 4log(KT)/Mi<Δa24\sqrt{\log(KT)/M_{i}}<\frac{\Delta_{a}}{2}. Then, we have that

μa(i)μa(i)\displaystyle\mu_{a^{\star}}^{(i)}-\mu_{a}^{(i)} μaμa4log(KT)/Mi>ΔaΔa2\displaystyle\geq\mu_{a^{\star}}-\mu_{a}-4\sqrt{\log(KT)/M_{i}}>\Delta_{a}-\frac{\Delta_{a}}{2}
>4log(KT)/Mi.\displaystyle>4\sqrt{\log(KT)/M_{i}}. (9)

Hence, arm aa will be eliminated before the start of batch i+1i+1. We also notice that from 4log(KT)/Mi<Δa24\sqrt{\log(KT)/M_{i}}<\frac{\Delta_{a}}{2}, the value of ii can be bounded as

imax{1,log4(65log(KT)αΔi2)},i\leq\max\{1,\log_{4}(\frac{65\log(KT)}{\alpha\Delta_{i}^{2}})\}, (10)

By the exponential increase in the number of pulls of each arm, we get that until eliminated, arm aa will be pulled by the learner at most

Ta(T)j=1i4jα4i+1c(α+log(KT)Δa2),T_{a}(T)\leq\sum_{j=1}^{i}4^{j}\alpha\leq 4^{i+1}\leq c(\alpha+\frac{\log(KT)}{\Delta_{a}^{2}}),

for some absolute constant c>0c>0. We also notice that on event GG, the agent will pull arm ii at most 4Ti(T)4T_{i}(T) times. This results in a regret that is at most c(α+log(KT)/Δa)c(\alpha+\log(KT)/\Delta_{a}). Summing the regret over all arms, we get that conditioned on GGG\cap G^{\prime} we have that

RTc(KlogTlog(1/ϵ)+a:Δa>0log(KT)Δa).R_{T}\leq c\left(\frac{K\log T}{\log(1/\epsilon)}+\sum_{a:\Delta_{a}>0}\frac{\log(KT)}{\Delta_{a}}\right).

The proof is concluded by noticing that log(1/ϵ)=O(1ϵ)\log(1/\epsilon)=O(1-\epsilon). \blacksquare
The previous theorem directly implies the following worst-case regret bound RTc(KlogT1ϵ+KT).R_{T}\leq c\left(\frac{K\log T}{1-\epsilon}+\sqrt{KT}\right).

VI Lower Bound

We here prove a lower bound that matches the upper bound in Theorem 2 up to log\log factors. A lower bound of Ω(i:Δi>01Δi)\Omega(\sum_{i:\Delta_{i}>0}\frac{1}{\Delta_{i}}) on the gap-dependent regret is already provided in [20] and a lower bound on the worst-case regret of Ω(KT)\Omega(\sqrt{KT}) is provided in [21]. Thus it suffices to prove a lower bound of Ω(K/(1ϵ))\Omega(K/(1-\epsilon)) which we provide next in Theorem 3.

Theorem 3

Let TK4log(1/ϵ)T\geq\frac{K}{4\log(1/\epsilon)}, and ϵ1/2\epsilon\geq 1/2. Assuming the agent operation in Section II, for any policy π\pi, there exists a KK-armed bandit instance ν\nu such that

𝔼[RT(π,ν)]cK1ϵ,\mathbb{E}[R_{T}(\pi,\nu)]\geq c\frac{K}{1-\epsilon}, (11)

where 𝔼[RT(π,ν)]\mathbb{E}[R_{T}(\pi,\nu)] is the expected regret of the policy π\pi over the instance ν\nu and cc is a universal constant.

Proof. We consider KK bandit instances, each with KK arms, where in instance νi\nu_{i} the means of the reward distributions are

μj(i)=𝟏{j=i},j=1K\mu_{j}^{(i)}=\mathbf{1}\{j=i\},j=1\ldots K (12)

with 𝟏\mathbf{1} the indicator function. We assume a noiseless setting where in instance νi\nu_{i} picking arm jj results in reward μj(i)\mu^{(i)}_{j} almost surely. We consider two events:
EiE_{i} indicates that the first 1/log(1/ϵ)1/\log(1/\epsilon) pulls of arm ii are erased;
Ei={a~0i}E^{\prime}_{i}=\{\tilde{a}_{0}\neq i\} indicates that the agent in the first round, if the first transmitted arm is erased, does not select to pull arm ii (recall that if an erasure occurs at the first round the agent randomly selects an action according to some distribution). We note that Ei,EiE_{i},E^{\prime}_{i} are independent events since erasures are independent of the agent operation.

We will show that there is no policy π\pi that can make 𝔼[RT(π,νi)|EiEi]\mathbb{E}[R_{T}(\pi,\nu_{i})|E_{i}\cap E^{\prime}_{i}] to be small for all ii. We first note that there are at least111We assume for simplicity that KK is divisible by 44. The proof easily generalizes by taking the floor in divisions, which only affects the constants. K/2K/2 arms with [Ei]1/2\mathbb{P}[E^{\prime}_{i}]\geq 1/2. Pick a set I[K]I\subseteq[K] with |I|=K/2|I|=K/2, and [Ei]1/2\mathbb{P}[E^{\prime}_{i}]\geq 1/2 iI\forall i\in I. Now define the event 𝒫i\mathcal{P}_{i}:
𝒫i\mathcal{P}_{i}: arm ii is picked no more than 1log(1/ϵ)\frac{1}{\log(1/\epsilon)} times by the learner in the first K4log(1/ϵ)\frac{K}{4\log(1/\epsilon)} rounds.
We next consider the minimum worst case probability of 𝒫i\mathcal{P}_{i} conditioned on EiEiE_{i}\cap E^{\prime}_{i} under the distribution induced by instance νi\nu_{i}; namely,

minπmaxiIνi[𝒫i|EiEi].\min_{\pi}\max_{i\in I}\mathbb{P}_{\nu_{i}}[\mathcal{P}_{i}|E_{i}\cap E^{\prime}_{i}]. (13)

Let the set AA denote the indices of arms for which 𝒫i\mathcal{P}_{i} holds, i.e., A={iI|𝒫i}A=\{i\in I|\mathcal{P}_{i}\}, and let B=I/AB=I/A. We have that A,BA,B are disjoint and their union is the set II. Moreover, the quantity in (13) can be rewritten as

minπmaxiIνi[iA|EiEi].\min_{\pi}\max_{i\in I}\mathbb{P}_{\nu_{i}}[i\in A|E_{i}\cap E^{\prime}_{i}]. (14)

We make two observations: (i) after K4log(1/ϵ)\frac{K}{4\log(1/\epsilon)} rounds, there exist at least |I|2=K4\frac{|I|}{2}=\frac{K}{4} arms in II for which 𝒫i\mathcal{P}_{i} holds (hence, |A||I|/2|A|\geq|I|/2 and |B||I|/2|B|\leq|I|/2); and (ii) if for instance νi\nu_{i}, conditioned on EiEiE_{i}\cap E^{\prime}_{i}, arm ii is picked less than 1/log(1/ϵ)1/\log(1/\epsilon) times by the learner, then all the rewards received by the learner will be zeros, thus no information will be provided to the policy during the first K4log(1/ϵ)\frac{K}{4\log(1/\epsilon)} rounds.
From observation (ii) and the fact that the result of whether iAi\in A or not, does not change after the first reward feedback that is 11, it follows that the optimal value for (13) does not change if the learner does not observe rewards. Hence, the learner can proceed assuming that all previous rewards are zeros. As a result, the learner can decide on the arms to pull in the first K/(4log(1/ϵ))K/(4\log(1/\epsilon)) slots ahead of the time (i.e., at t=1t=1; since no feedback is required); equivalently, the learner can divide the set II into two sets AA and BB (possibly in a random way) ahead of the time with |B||I|/2|B|\leq|I|/2 (from observation (i)(i)). As the learner decides on AA ahead of the time, the probability of iAi\in A does not depend on the instance, reward outcomes and a~0\tilde{a}_{0}, hence, we can simply denote νi[iA|EiEi]\mathbb{P}_{\nu_{i}}[i\in A|E_{i}\cap E^{\prime}_{i}] as [iA]\mathbb{P}[i\in A]. This shows that, the problem of minimizing maxiIνi[𝒫i|EiEi]\max_{i\in I}\mathbb{P}_{\nu_{i}}[\mathcal{P}_{i}|E_{i}\cap E^{\prime}_{i}] is equivalent to partitioning the set II into two sets AA and BB (with AA containing the indices where 𝒫i\mathcal{P}_{i} holds), |A||I|/2|A|\leq|I|/2 and with the goal of minimizing maxiI[iA]\max_{i\in I}\mathbb{P}[i\in A]. It is easy to see that the minimum value for maxiI[iA]\max_{i\in I}\mathbb{P}[i\in A], and similarly, maxiIνi[𝒫i|EiEi]\max_{i\in I}\mathbb{P}_{\nu_{i}}[\mathcal{P}_{i}|E_{i}\cap E^{\prime}_{i}], is at least 1/21/2 as i=1|I|[iA]|I|/2\sum_{i=1}^{|I|}\mathbb{P}[i\notin A]\leq|I|/2.

Hence, for any policy π\pi, there is an instance νi,iI\nu_{i},i\in I such that conditioned on EiEiE_{i}\cap E^{\prime}_{i}, arm ii is picked no more than 1/log(1/ϵ)1/\log(1/\epsilon) times by the learner in the first K/2log(1/ϵ)K/2\log(1/\epsilon) rounds with probability at least 1/21/2. But for instance νi\nu_{i}, whenever arm ii is not pulled we incur a regret of value 11, we get that there is iIi\in I such that [RT(π,νi)K4log(1/ϵ)|EiEi]1/2\mathbb{P}\left[R_{T}(\pi,\nu_{i})\geq\frac{K}{4\log(1/\epsilon)}|E_{i}\cap E^{\prime}_{i}\right]\geq 1/2. Then, we have that 𝔼[RT(π,νi)|EiEi]cKlog(1/ϵ)\mathbb{E}[R_{T}(\pi,\nu_{i})|E_{i}\cap E^{\prime}_{i}]\geq c\frac{K}{\log(1/\epsilon)}. By non-negativity of regret, we have that

𝔼[RT(π,νi)]\displaystyle\mathbb{E}[R_{T}(\pi,\nu_{i})] cKlog(1/ϵ)[EiEi]\displaystyle\geq c\frac{K}{\log(1/\epsilon)}\mathbb{P}[E_{i}\cap E^{\prime}_{i}]
=(i)cKlog(1/ϵ)[Ei][Ei]\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}c\frac{K}{\log(1/\epsilon)}\mathbb{P}[E_{i}]\mathbb{P}[E^{\prime}_{i}]
(ii)c/2Klog(1/ϵ)[Ei],\displaystyle\stackrel{{\scriptstyle(ii)}}{{\geq}}c/2\frac{K}{\log(1/\epsilon)}\mathbb{P}[E_{i}], (15)

where (i)(i) follows from the fact that Ei,EiE_{i},E^{\prime}_{i} are independent, and (ii)(ii) follows since iIi\in I and thus P(Ei)12P(E^{\prime}_{i})\geq\frac{1}{2}. Moreover, [Ei]=ϵ1/log(1/ϵ)=e1\mathbb{P}[E_{i}]=\epsilon^{1/\log(1/\epsilon)}=e^{-1} and thus 𝔼[RT(π,νi)]cKlog(1/ϵ)\mathbb{E}[R_{T}(\pi,\nu_{i})]\geq c\frac{K}{\log(1/\epsilon)}. The proof is concluded by observing that log(1/ϵ)=O(1ϵ)\log(1/\epsilon)=O(1-\epsilon) for ϵ1/2\epsilon\geq 1/2. \hfill{\blacksquare}

References

  • [1] P. Matikainen, P. M. Furlong, R. Sukthankar, and M. Hebert, “Multi-armed recommendation bandits for selecting state machine policies for robotic systems,” in 2013 IEEE International Conference on Robotics and Automation, 2013, pp. 4545–4551.
  • [2] D. Bouneffouf, I. Rish, and C. Aggarwal, “Survey on applications of multi-armed and contextual bandits,” in 2020 IEEE Congress on Evolutionary Computation (CEC), 2020, pp. 1–8.
  • [3] Y. Yang, M. A. Bevan, and B. Li, “Hierarchical planning with deep reinforcement learning for 3d navigation of microrobots in blood vessels,” Advanced Intelligent Systems, vol. 4, no. 11, p. 2200168, 2022. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1002/aisy.202200168
  • [4] Z. Zou, Y. Liu, Y. Young, O. S. Pak, and A. C. H. Tsang, “Gait switching and targeted navigation of microswimmers via deep reinforcement learning,” Commun Phys, vol. 5, no. 158, 2022.
  • [5] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis of the multiarmed bandit problem,” Machine Learning, vol. 47, pp. 235–256, 05 2002.
  • [6] W. R. Thompson, “On the likelihood that one unknown probability exceeds another in view of the evidence of two samples,” Biometrika, vol. 25, pp. 285–294, 1933.
  • [7] T. L. Lai, “Adaptive treatment allocation and the multi-armed bandit problem,” Annals of Statistics, vol. 15, pp. 1091–1114, 1987.
  • [8] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. Schapire, “Gambling in a rigged casino: The adversarial multi-armed bandit problem,” Foundations of Computer Science, 1975., 16th Annual Symposium on, 07 1998.
  • [9] S. Bubeck and A. Slivkins, “The best of both worlds: Stochastic and adversarial bandits,” in Proceedings of the 25th Annual Conference on Learning Theory, ser. Proceedings of Machine Learning Research, S. Mannor, N. Srebro, and R. C. Williamson, Eds., vol. 23.   Edinburgh, Scotland: PMLR, 2012, pp. 42.1–42.23. [Online]. Available: https://proceedings.mlr.press/v23/bubeck12b.html
  • [10] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “The nonstochastic multiarmed bandit problem,” SIAM Journal on Computing, vol. 32, no. 1, pp. 48–77, 2002. [Online]. Available: https://doi.org/10.1137/S0097539701398375
  • [11] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit approach to personalized news article recommendation,” in Proceedings of the 19th International Conference on World Wide Web, ser. WWW ’10.   New York, NY, USA: Association for Computing Machinery, 2010, p. 661–670. [Online]. Available: https://doi.org/10.1145/1772690.1772758
  • [12] S. Shahrampour, A. Rakhlin, and A. Jadbabaie, “Multi-armed bandits in multi-agent networks,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 2786–2790.
  • [13] D. Kalathil, N. Nayyar, and R. Jain, “Decentralized learning for multiplayer multiarmed bandits,” IEEE Transactions on Information Theory, vol. 60, no. 4, pp. 2331–2345, 2014.
  • [14] P. Landgren, V. Srivastava, and N. E. Leonard, “Distributed cooperative decision making in multi-agent multi-armed bandits,” Automatica, vol. 125, p. 109445, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0005109820306439
  • [15] P. Landgren, “Distributed multi-agent multi-armed bandits,” 2019.
  • [16] T. Lykouris, V. Mirrokni, and R. Leme, “Stochastic bandits robust to adversarial corruptions,” 06 2018, pp. 114–122.
  • [17] A. Gupta, T. Koren, and K. Talwar, “Better algorithms for stochastic bandits with adversarial corruptions,” in Proceedings of the Thirty-Second Conference on Learning Theory, ser. Proceedings of Machine Learning Research, A. Beygelzimer and D. Hsu, Eds., vol. 99.   PMLR, 2019, pp. 1562–1578. [Online]. Available: https://proceedings.mlr.press/v99/gupta19a.html
  • [18] S. Kapoor, K. K. Patel, and P. Kar, “Corruption-tolerant bandit learning,” Mach. Learn., vol. 108, no. 4, p. 687–715, 2019. [Online]. Available: https://doi.org/10.1007/s10994-018-5758-5
  • [19] I. Amir, I. Attias, T. Koren, Y. Mansour, and R. Livni, “Prediction with corrupted expert advice,” Advances in Neural Information Processing Systems, vol. 33, pp. 14 315–14 325, 2020.
  • [20] T. L. Lai, H. Robbins et al., “Asymptotically efficient adaptive allocation rules,” Advances in applied mathematics, vol. 6, no. 1, pp. 4–22, 1985.
  • [21] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, “Gambling in a rigged casino: The adversarial multi-armed bandit problem,” in Proceedings of IEEE 36th annual foundations of computer science.   IEEE, 1995, pp. 322–331.
  • [22] P. Auer and R. Ortner, “Ucb revisited: Improved regret bounds for the stochastic multi-armed bandit problem,” Periodica Mathematica Hungarica, vol. 61, no. 1-2, pp. 55–65, 2010.