Robust Multi-Agent Bandits Over Undirected Graphs
Abstract.
We consider a multi-agent multi-armed bandit setting in which honest agents collaborate over a network to minimize regret but malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur regret in this setting, where is the number of arms and is the arm gap. For , this improves over the single-agent baseline regret of .
In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we prove that if the state-of-the-art algorithm is used on the undirected line graph, honest agents can suffer (nearly) linear regret until time is doubly exponential in and . In light of this negative result, we propose a new algorithm for which the -th agent has regret on any connected and undirected graph, where is the number of ’s neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph (where ), and show the effect of malicious agents is entirely local (in the sense that only the malicious agents directly connected to affect its long-term regret).
1. Introduction
Motivated by applications including distributed computing, social recommendation systems, and federated learning, a number of recent papers have studied multi-agent variants of the classical multi-armed bandit problem. Typically, these variants involve a large number of agents playing a bandit while communicating over a network. The goal is to devise communication protocols that allow the agents to efficiently amalgamate information, thereby learning the bandit’s parameters more quickly than they could by running single-agent algorithms in isolation.
Among the many multi-agent variants considered in the literature (see Section 1.5), we focus on a particular setting studied in the recent line of work (Sankararaman et al., 2019; Chawla et al., 2020; Newton et al., 2021; Vial et al., 2021). In these papers, agents play separate instances of the same -armed bandit and are restricted to pairwise and bit-limited communications per arm pulls. We recount two motivating applications from this prior work.
Example 1.
For an e-commerce site (e.g., Amazon), the agents model servers choosing one of products to show visitors to the site. The product selection problem can be viewed as a bandit – products are arms, while purchases yield rewards – and communication among the agents/servers is restricted by bandwidth.
Example 2.
For a social recommendation site (e.g., Yelp), the agents represent users choosing among items, such as restaurants. This is analogously modeled as a bandit, and communication is limited because agents/users are exposed to a small portion of all reviews.
To contextualize our contributions, we next discuss this line of work in more detail.
1.1. Fully cooperative multi-agent bandits
The goal of (Sankararaman et al., 2019; Chawla et al., 2020; Newton et al., 2021) is to devise fully cooperative algorithms for which the cumulative regret of each agent is small (see (5) for the formal definition of regret). All of these papers follow a similar approach, which roughly proceeds as follows (see Section 3 for details):
-
•
The arms are partitioned into subsets of size , and each agent is assigned a distinct subset called a sticky set, which they are responsible for exploring.
-
•
Occasionally ( times per arm pulls), each agent asks a random neighbor for an arm recommendation; responds with the arm they believe is best, which begins playing.
For these algorithms, the regret analysis essentially contains two steps:
-
•
First, the authors show that the agent (say, ) with the true best arm in its sticky set eventually identifies it as such. Thereafter, a gossip process unfolds. Namely, recommends the best arm to its neighbors, who recommend it to their neighbors, etc., spreading the best arm to all agents. The spreading time (and thus the regret before this time) is shown to be polynomial in , , and , where is the gap in mean reward between the two best arms.
-
•
Once the best arm spreads, agents play only it and their sticky sets, so long-term, they effectively face -armed bandits instead of the full -armed bandit. By classical bandit results (see, e.g., (Auer et al., 2002)), this implies regret over horizon .
Hence, summing up the two terms, (Sankararaman et al., 2019; Chawla et al., 2020; Newton et al., 2021) provide regret bounds of the form111More precisely, (Chawla et al., 2020; Newton et al., 2021) prove (1), while the term balloons to in (Sankararaman et al., 2019).
(1) |
as compared to regret for running a single-agent algorithm in isolation.
1.2. Robust multi-agent bandits on the complete graph
Despite these improved bounds, (Sankararaman et al., 2019; Chawla et al., 2020; Newton et al., 2021) require all agents to execute the prescribed algorithm, and in particular, to recommend best arm estimates to their neighbors. As pointed out in (Vial et al., 2021), this may be unrealistic: in Example 2, review spam can be modeled as bad arm recommendations, while in Example 1, servers may fail entirely. Hence, (Vial et al., 2021) considers a more realistic setting where honest agents recommend best arm estimates but malicious agents recommend arbitrarily. For this setting, the authors propose a robust version of the algorithm from (Chawla et al., 2020) where honest agents block suspected malicious agents. More specifically, (Vial et al., 2021) considers the following blocking rule:
-
•
If agent recommends arm to honest agent , but arm subsequently performs poorly for – in the sense that the upper confidence bound (UCB) algorithm does not select it sufficiently often – then temporarily suspends communication with .
As shown in (Vial et al., 2021), this blocking scheme prevents each malicious agent from recommending more than bad arms long-term, which (effectively) results in an -armed bandit ( malicious recommendations, plus the -sized sticky set). Under the assumption that honest and malicious agents are connected by the complete graph, this allows (Vial et al., 2021) to prove
(2) |
In (Vial et al., 2021), it is also shown that blocking is necessary: for any , if even malicious agent is present, the algorithm from (Chawla et al., 2020) (which does not use blocking) incurs regret. Thus, one malicious agent negates the improvement over the single-agent baseline.
1.3. Objective and challenges
Our main goal is to generalize the results of (Vial et al., 2021) from the complete graph to the case where the honest agent subgraph is only connected and undirected. This is nontrivial because (Vial et al., 2021) relies heavily on the complete graph assumption. In particular, the analysis in (Vial et al., 2021) requires that (the agent with the best arm in its sticky set) itself recommends the best arm to each of the other honest agents. In other words, each honest agent relies on to inform them of the best arm, which means must be a neighbor of . Thus, to extend (2) beyond complete graphs, we need to show a gossip process unfolds (like in the fully cooperative case): recommends the best arm to its neighbors, who recommend it to their neighbors, etc., spreading it through the network.
The challenge is that, while blocking is necessary to prevent regret, it also causes honest agents to accidentally block each other. Indeed, due to the aforementioned blocking rule and the noisy rewards, they will block each other until they collect enough samples to reliably identify good arms. From a network perspective, accidental blocking means that edges in the subgraph of honest agents temporarily fail. Consequently, it is not clear if the best arm spreads to all honest agents, or if (for example) this subgraph eventually becomes disconnected, preventing the spread and causing the agents who do not receive the best arm to suffer regret.
Analytically, accidental blocking means we must deal with a gossip process over a dynamic graph. This process is extremely complicated, because the graph dynamics are driven by the bandit algorithms, which in turn affect the future evolution of the graph. Put differently, blocking causes the randomness of the communication protocol and that of the bandit algorithms to become interdependent. We note this does not occur for the original non-blocking algorithm, where the two sources of randomness can be cleverly decoupled and separately analyzed – see (Chawla et al., 2020, Proposition 4). Thus, in contrast to existing work, we need to analyze the interdependent processes directly.
1.4. Our contributions
Failure of the existing blocking rule: In Section 4, we show that the algorithm from (Vial et al., 2021) fails to achieve a regret bound of the form (2) for connected and undirected graphs in general. Toward this end, we define a natural “bad instance” in which , the honest agent subgraph is an undirected line (thus connected), and all honest agents share a malicious neighbor. For this instance, we propose a malicious strategy that causes honest agents to repeatedly block one another, which results in the best arm spreading extremely slowly. More specifically, we show that if honest agents run the algorithm from (Vial et al., 2021), then the best arm does not reach honest agent (the one at the end of the line) until time is doubly exponential in . Note (Vial et al., 2021) shows the best arm spreads polynomially fast for the complete graph, so we demonstrate a doubly exponential slowdown for complete versus line graphs. This is rather surprising, because for classical rumor processes that do not involve bandits or blocking (see, e.g., (Pittel, 1987)), the slowdown is only exponential (i.e., rumor spreading time on the complete graph versus on the line graph). As a consequence of the slow spread, we show the algorithm from (Vial et al., 2021) suffers regret
(3) |
i.e., it incurs (nearly) linear regret until time and thereafter incurs logarithmic regret but with a huge additive term (see Theorem 1).
Refined blocking rule: In light of this negative result, we propose a refined blocking rule in Section 5. Roughly, our rule is as follows: agent blocks for recommending arm if
-
•
arm performs poorly, i.e., it is not chosen sufficiently often by UCB,
-
•
and agent has not changed its own best arm estimate recently.
The second criterion is the key distinction from (Vial et al., 2021). Intuitively, it says that agents should not block for seemingly-poor recommendations until they become confident that their own best arm estimates have settled on truly good arms. This idea is the main new algorithmic insight of the paper. It is directly motivated by the negative result of Section 4; see Remark 5.
Gossip despite blocking: Analytically, our main contribution is to show that, with our refined blocking rule, the best arm quickly spreads to all honest agents. The proof is quite involved; we provide an outline in Section 7. At a very high level, the idea is to show that honest agents using our blocking rule eventually stop blocking each other. Thereafter, we can couple the arm spreading process with a much more tractable noisy rumor process that involves neither bandits nor blocking (see Definition 1), and that is guaranteed to spread the best arm in polynomial time.
Regret upper bound: Combining our novel gossip analysis with some existing regret minimization techniques, we show in Section 5 that our refined algorithm enjoys the regret bound
(4) |
where is the number of malicious neighbors of (see Theorem 2). Thus, our result generalizes (2) from the complete graph (where ) to connected and undirected graphs. Moreover, note the leading term in (4) is entirely local – only the malicious agents directly connected to affect its long-term regret. For example, in the sparse regime , our term matches the one in (1) up to constants, which (we recall) (Chawla et al., 2020; Newton et al., 2021) proved in the case where there are no malicious agents anywhere in the network. In fact, for honest agents with , we can prove that the term in our regret bound matches the corresponding term from (Chawla et al., 2020), including constants (see Corollary 2). In other words, we show that for large horizons , the effects of malicious agents do not propagate beyond one-step neighbors. Furthermore, we note that the additive term in (4) is polynomial in all parameters, whereas for the existing algorithm it can be doubly exponential in and , as shown in (3) and discussed above.
Numerical results: In Section 6, we replicate the experiments from (Vial et al., 2021) and extend them from the complete graph to random graphs. Among other findings, we show that for and , respectively, the algorithm from (Vial et al., 2021) can perform worse than the non-blocking algorithm from (Chawla et al., 2020) and the single-agent baseline, respectively. In other words, the existing blocking rule becomes a liability as decreases from the extreme case considered in (Vial et al., 2021). In contrast, we show that our refined rule has lower regret than (Chawla et al., 2020) across the range of tested. Additionally, it outperforms (Vial et al., 2021) on average for all but the largest and has much lower variance for smaller .
Summary: Ultimately, the high-level messages of this paper are twofold:
-
•
In multi-agent bandits with malicious agents, we can devise algorithms that simultaneously (1) learn useful information and spread it through the network via gossip, and (2) learn who is malicious and block them to mitigate the harm they cause. Moreover, this harm is local in the sense that it only affects one-hop neighbors.
-
•
However, blocking must be done carefully; algorithms designed for the complete graph may spread information extremely slowly on general graphs. In particular, the slowdown can be doubly exponential, much worse than the exponential slowdown of simple rumor processes.
1.5. Other related work
In addition to the paper (Vial et al., 2021) discussed above, several others have considered multi-agent bandits where some of the agents are uncooperative. In (Awerbuch and Kleinberg, 2008), the honest agents face a non-stochastic (i.e., adversarial) bandit (Auer et al., 1995) and communicate at every time step, in contrast to the stochastic bandit and limited communication of our work. The authors of (Mitra et al., 2021) consider the objective of best arm identification (Audibert and Bubeck, 2010) instead of cumulative regret. Most of their paper involves a different communication model where the agents/clients collaborate via a central server; Section 6 studies a “peer-to-peer” model which is closer to ours but requires additional assumptions on the number of malicious neighbors. A different line of work considers the case where an adversary can corrupt the observed rewards (see, e.g., (Bogunovic et al., 2020, 2021; Garcelon et al., 2020; Gupta et al., 2019; Jun et al., 2018; Kapoor et al., 2019; Liu and Shroff, 2019; Liu et al., 2021; Lykouris et al., 2018), and the references therein), which is distinct from the role that malicious agents play in our setting.
For the fully cooperative case, there are several papers with communication models that differ from the aforementioned (Chawla et al., 2020; Newton et al., 2021; Sankararaman et al., 2019). For example, agents in (Buccapatnam et al., 2015; Chakraborty et al., 2017) broadcast information instead of exchanging pairwise arm recommendations, communication in (Kolla et al., 2018; Lalitha and Goldsmith, 2021; Martínez-Rubio et al., 2019) is more frequent, the number of transmissions in (Madhushani and Leonard, 2021) depends on so could be large, and agents in (Landgren et al., 2016) exchange arm mean estimates instead of (bit-limited) arm indices.
More broadly, other papers have studied fully cooperative variants of different bandit problems. These include minimizing simple instead of cumulative regret (e.g., (Hillel et al., 2013; Szörényi et al., 2013)), minimizing the total regret across agents rather than ensuring all have low regret (e.g., (Dubey et al., 2020a; Wang et al., 2020)), contextual instead of multi-armed bandits (e.g., (Chawla et al., 2022; Dubey et al., 2020b; Dubey and Pentland, 2020; Korda et al., 2016; Tekin and Van Der Schaar, 2015)), adversarial rather than stochastic bandits (e.g., (Bar-On and Mansour, 2019; Cesa-Bianchi et al., 2016; Kanade et al., 2012)), and bandits that vary across agents (e.g., (Bistritz and Leshem, 2018; Shahrampour et al., 2017; Zhu et al., 2021)). Another long line of work features collision models where rewards are lower if multiple agents simultaneously pull the same arm (e.g., (Anandkumar et al., 2011; Avner and Mannor, 2014; Boursier and Perchet, 2019; Dakdouk et al., 2021; Kalathil et al., 2014; Liu and Zhao, 2010; Liu et al., 2020; Mansour et al., 2018; Rosenski et al., 2016)), unlike our model. Along these lines, other reward structures have been studied, such as reward being a function of the agents’ joint action (e.g., (Bargiacchi et al., 2018; Bistritz and Bambos, 2020; Kao et al., 2022)).
1.6. Organization
The rest of the paper is structured as follows. We begin in Section 2 with definitions. In Section 3, we introduce the algorithm from (Vial et al., 2021). Sections 4 and 5 discuss the existing and proposed blocking rules. Section 6 contains experiments. We discuss our analysis in Section 7 and close in Section 8.
2. Preliminaries
Communication network: Let be an undirected graph with vertices . We call the honest agents and assume they execute the forthcoming algorithm. The remaining agents are termed malicious; their behavior will be specified shortly. For instance, honest and malicious agents represent functioning and failed servers in Example 1. The edge set encodes which agents are allowed to communicate, e.g., if , the -th and -th servers can communicate in the forthcoming algorithm.
Denote by the edges between honest agents and the honest agent subgraph. For each , we let denote its neighbors, its honest neighbors, and its malicious neighbors. We write , , and for the associated degrees, and , , and for the maximal degrees. We make the following assumption, which generalizes the complete graph case of (Vial et al., 2021).
Assumption 1.
The honest agent subgraph is connected, i.e., for any , there exists and such that , , and .
Multi-armed bandit: We consider the standard stochastic multi-armed bandit (Lattimore and Szepesvári, 2020, Chapter 4). Denote by the number of arms and the set of arms. For each , we let be a probability distribution over and an i.i.d. sequence of rewards sampled from . The interpretation is that, if the -th honest agent chooses the -th arm at time , it earns reward . The objective (to be formalized shortly) is reward maximization. In Example 2, for instance, represents the set of restaurants in a city, and the reward quantifies how much person enjoys restaurant if they dine there on day .
For each arm , we let denote the corresponding expected reward. Without loss of generality, we assume the arms are labeled such that . We additionally assume the following, which generalizes the and setting of (Vial et al., 2021). Notice that under this assumption, the arm gap is strictly positive.
Assumption 2.
Rewards are -valued, i.e., for each , is a distribution over . Furthermore, the best arm is unique, i.e., .
Objective: For each and , let denote the arm chosen by honest agent at time . Our goal is to minimize the regret , which is the expected additive loss in cumulative reward for agent ’s sequence of arm pulls compared to the optimal policy that always chooses the best arm . More precisely, we define regret as follows:
(5) |
3. Algorithm
We next discuss the algorithm from (Vial et al., 2021) (Algorithm 1 below), which modifies the one from (Chawla et al., 2020) to include blocking. For ease of exposition, we begin by discussing the key algorithmic design principles from (Chawla et al., 2020) in Section 3.1. We then define Algorithm 1 formally in Section 3.2. Finally, we introduce and discuss one additional assumption in Section 3.3.
3.1. Key ideas of the non-blocking algorithm







We assume this subsection and describe the non-blocking algorithm from (Chawla et al., 2020).
-
•
Phases: In (Chawla et al., 2020), the time steps are grouped into phases, whose role is twofold. First, within the -th phase, the -th honest agent only pulls arms belonging to a particular subset . We call these active sets and detail their construction next. Second, at the end of the -th phase, the agents construct new active sets by exchanging arm recommendations with neighbors, in a manner to be described shortly. See Figure 1 for a pictorial description. Notice that the phase durations are increasing, which will be discussed in Section 3.2.
-
•
Active sets: The active set will always contain a subset of arms that does not vary with the phase . Following (Chawla et al., 2020; Vial et al., 2021), we call the sticky set and its elements sticky arms. The sticky sets ensure that each arm is explored by some agent, as will be seen in the forthcoming example. In addition, will contain two non-sticky arms that are dynamically updated across phases based on arm recommendations from neighbors.
-
•
Arm recommendations: After the -th phase, each agent contacts a random neighbor, who responds with whichever of their active arms performed “best” in the current phase. Upon receiving this recommendation, adds it to its active set and discards whichever currently-active non-sticky arm (i.e., whichever element of ) performed “worse”. (We quantify “best” and “worse” in the formal discussion of Section 3.2.)
Example 3.
Each subfigure of Figure 2 depicts honest agents as circles and their active sets as rectangles. The blue rectangles are sticky sets, the orange rectangles are non-sticky arms, and the arms are sorted by performance. For example, the left agent in Figure 2(a) has sticky set and active set and believes arm to be the best of these. Note the blue sticky sets partition , so at each phase, each arm is active for some agent. This ensures the best arm is never permanently discarded during the arm recommendations discussed above. Figure 2(b) shows agents recommending the active arms they believe are best, and Figure 2(c) depicts the updated active sets. For instance, the left agent replaces its worse non-sticky arm with the recommendation . Figure 2(d) shows a later phase where the best arm has spread to all agents, who have all identified it as such. Thereafter, all agents recommend , so the active set remains fixed (Figures 2(e) and 2(f)). Hence, all agents eventually exploit the best arm while only exploring a subset of the suboptimal arms (three instead of five here).
3.2. Formal definition of the blocking algorithm
The algorithm in (Vial et al., 2021) supplements the one from Section 3.1 with a blocking procedure. Specifically, honest agents run the algorithm from (Chawla et al., 2020) while maintaining blocklists of neighbors they are unwilling to communicate with. This approach is defined in Algorithm 1 and detailed next.
Inputs (Line 1): The first input is a standard UCB exploration parameter , which will be discussed shortly. The input controls the lengths of the phases; namely, the -th phase encompasses times , where . Note the phase duration grows with , as shown in Figure 1. The final input is an -sized sticky set ( in Example 3), which, as in (Vial et al., 2021), we assume are provided to the agents (see Section 3.3 for more details).
Initialization (Lines 1-1): To start, initializes the times at which the -th phase ends, along with the blocklist . Additionally, chooses two distinct (but otherwise arbitrary) non-sticky arms and and constructs the active set . Notice that the active set contains the sticky set and two arms that depend on the phase, as described in Section 3.1.
UCB over the active set (Line 1): As was also mentioned in Section 3.1, only pulls arms from its current active set . More specifically, at each time during phase , chooses the active arm that maximizes the UCB in Line 1 (see (Lattimore and Szepesvári, 2020, Chapters 7-10) for background). Here and are the number of pulls of and the empirical mean of those pulls, i.e.,
where as in Section 2 and is the indicator function.
Best arm estimate (Line 1): At the end of phase (i.e., when ), defines its best arm estimate as the active arm it played the most in phase . The intuition is that, for large horizons, the arm chosen most by UCB is a good estimate of the true best arm (Bubeck et al., 2011). Thus, because phase lengths are increasing (see Figure 1), will be a good estimate of the best active arm for large .
Blocklist update (Line 1): Next, calls the Update-Blocklist subroutine to update its blocklist . The implementation of this subroutine is the key distinction between (Vial et al., 2021) and our work. We discuss the respective implementations in Sections 4 and 5, respectively.
Arm recommendations (Line 1): Having updated , requests an arm recommendation via Algorithm 2. Algorithm 2 is a black box (i.e., provides the input and observes the output), which samples a random non-blocked neighbor . If is honest, it recommends its best arm estimate, while if malicious, it recommends an arbitrary arm.222Technically, malicious recommendations need to be measurable; see (Vial et al., 2021, Section 3) for details.
Updating the active set (Lines 1-1): Finally, updates its active set as in Section 3.1. In particular, if the recommendation is not currently active333If the recommendation is currently active, the active set remains unchanged (see Line 1 of Algorithm 1)., defines to be the non-sticky arm that performed better in phase , in the sense that UCB chose it more often (following the above intuition from (Bubeck et al., 2011)). The other non-sticky arm becomes the recommendation , and the new active set becomes (the sticky set and two other arms, as above).
3.3. Additional assumption
Observe that Algorithm 1 does not preclude the case where the best arm is not in any honest agent’s sticky set, i.e., . In this case, the best arm may be permanently discarded, which causes linear regret even in the absence of malicious agents. For example, this would occur if was not a sticky arm for the left agent in Figure 2 (since the right agent discards in Figure 2(c)). To prevent this situation, we will follow (Sankararaman et al., 2019; Chawla et al., 2020; Newton et al., 2021; Vial et al., 2021) in assuming the following.
Assumption 3.
There exists with the best arm in its sticky set, i.e., .
Remark 1.
Remark 2.
The choice from Remark 1 requires the honest agents to know an order-accurate estimate of , i.e., they need to know some in order to set and ensure that . As discussed in (Vial et al., 2021, Remark 7), this amounts to knowing order-accurate estimates of and . The former quantity is the total number of agents, knowledge of which is rather benign and is also assumed in the fully-cooperative setting (Sankararaman et al., 2019; Chawla et al., 2020; Newton et al., 2021). The latter requires the agents to know that, e.g., half of the others are honest, which is similar in spirit to the assumptions in related problems regarding social learning in the presence of adversarial agents (e.g., (LeBlanc et al., 2013)).
Remark 3.
Alternatively, we can avoid Assumption 3 entirely by defining the set of the arms to be those initially known by the honest agents (i.e., their sticky sets), rather than sampling the sticky sets from a larger “base set” as in Remark 1. In this alternative model, the honest agents aim to identify and spread through the network whichever of the initially-known arms is best, similar to what happens on platforms like Yelp (see Example 2). In contrast, the Section 2 model allows for the pathological case where the base set contains a better arm than any initially known to honest agents (e.g., where no honest Yelp user has ever dined at the best restaurant). Coping with these pathological cases either requires Assumption 3, or another mode of exploration (i.e., exploration of base arms) that obfuscates the key point of our work (collaborative bandit exploration amidst adversaries). For these reasons, we prefer the alternative model, but to enable a cleaner comparison with prior work (Vial et al., 2021), we restrict attention to the Section 2 model (which generalizes that of (Vial et al., 2021)).
4. Existing blocking rule
We can now define the blocking approach from (Vial et al., 2021), which is provided in Algorithm 3. In words, the rule is as follows: if the recommendation from phase is not ’s most played arm in the subsequent phase , then the agent who recommended it is added to the blocklists , where is a tuning parameter. By Algorithm 2, this means blocks (i.e., does not communicate with) until phase (at the earliest). Thus, agents block others whose recommendations perform poorly – in the sense that UCB does not play them often – and the blocking becomes more severe as the phase counter grows. See (Vial et al., 2021, Remark 4) for further intuition.
In the remainder of this section, we define a bad instance (Section 4.1) on which this blocking rule provably fails (Section 4.2). Our goal here is to demonstrate a single such instance in order to show this blocking rule must be refined. Therefore, we have opted for a concrete example, which includes some numerical constants (e.g., in (6), the in the term in Theorem 1, etc.) that have no particular meaning. Nevertheless, the instance can be generalized; see Remark 4.
4.1. Bad instance
The network and bandit for the bad instance are as follows:
-
•
There are an even number of honest agents (at least four) arranged in a line, increasing in index from left to right, and there is a malicious agent connected to each of the honest ones. Mathematically, we have , , and .
-
•
There are arms that generate deterministic rewards (i.e., ) with
(6) Intuitively, there are three sets of arms: the best arm, mediocre arms, and bad arms. We provide further intuition in the forthcoming proof sketch. For now, we highlight three key properties. First, the gap from mediocre to bad arms is constant, i.e., when . Second, the gaps between mediocre arms are doubly exponentially small, i.e., for . Third, the gap from the best to the mediocre arms is at least , as shown in Appendix C.
Observation 1.
Since rewards are deterministic, the most played arm in phase is a deterministic function of the number of plays of the active arms at the beginning of the phase, i.e., of the set . Hence, when the -th recommendation is already active (i.e., when , which implies in Algorithm 1), is a function of , which is information available to the malicious agent at the -th communication time . Consequently, the malicious agent can always recommend some such that to avoid being blocked by .
We make the following assumptions on Algorithms 1 and 2:
- •
-
•
Sticky sets have size and for any , ’s initial active set satisfies . Thus, active sets contain three arms, and the right half of the honest agents are initially only aware of the bad arms, i.e., of those that provide no reward.
Remark 4.
Note that Assumptions 1-3 all hold for this instance, and the choices and are used for the complete graph experiments in (Vial et al., 2021). Additionally, the instance can be significantly generalized – the key properties are that and have the same scaling, the gaps from mediocre arms to others are constant, the gaps among mediocre arms are doubly exponentially small, and a constant fraction of agents on the right initially only have bad arms active.
Finally, we define a particular malicious agent strategy. Let and inductively define for each . Then the malicious recommendations are as follows:
-
•
If and for some , set .
-
•
Otherwise, let be such that (see Observation 1).
Similar to the arm means, we will wait for the proof sketch to explain this strategy in more detail. For now, we only mention that the phases grow doubly exponentially, i.e.,
(7) |
4.2. Negative result
We can now state the main result of this section. It shows that if the existing blocking rule from (Vial et al., 2021) is used on the above instance, then the honest agent at the end of the line suffers nearly linear regret until time exceeds a doubly exponential function of .
Theorem 1.
Proof sketch.
We provide a complete proof in Appendix C but discuss the intuition here.
-
•
First, suppose honest agent contacts the malicious agent at all phases (this occurs with constant probability since is constant). Then the right half of honest agents (i.e., agents ) only have bad arms (i.e., arms ) in their active sets at phase . This is because their initial active sets only contain such arms (by assumption), only recommends currently-active arms before , and no arm recommendations flow from the left half of the graph to the right half (they need to first be sent from to , but we are assuming the latter only contacts before ).
-
•
Now consider phase . With constant probability, and both contact , who recommends a currently active (thus bad) arm and the mediocre arm , respectively. Then, again with constant probability, contacts at the next phase ; only has bad arms active and thus recommends a bad arm. Therefore, during phase , agent has the mediocre arm and some bad recommendation from in its active set. The inverse gap squared between these arms is constant, thus less than the length of phase (for appropriate ), so by standard bandit arguments (basically, noiseless versions of best arm identification results from (Bubeck et al., 2011)), will be most played. Consequently, by the blocking rule in Algorithm 3, blocks until phase .
-
•
We then use induction. For each ( in the previous bullet), suppose blocks between phases and . Then during these phases, no arm recommendations flow past , so agents only play arms . At phase , the malicious agent recommends and to agents and , respectively, and at the subsequent phase , recommends to . Similar to the previous bullet, we then show plays arm more than during phase and thus blocks until , completing the inductive step. The proof that is played more than during phase again follows from noiseless best arm identification, although unlike the previous bullet, the relevant arm gap is no longer constant (both could be mediocre arms). However, we chose the mediocre arm means such that their inverse gap squared is at most doubly exponential in , so by (7), the length of phase dominates it.
In summary, we show that due to blocking amongst honest agents, does not receive arm until phase , given that some constant probability events occur at each of the times . This allows us to show that, with probability at least , agent does not receive the best arm until phase , and thus does not play the best arm until time in expectation. Since is constant, we can lower bound regret similarly. ∎
5. Proposed blocking rule
To summarize the previous section, we showed that the existing blocking rule (Algorithm 3) may result in honest agents blocking too aggressively, which causes the best arm to spread very slowly. In light of this, we propose a relaxed blocking criteria (see Algorithm 4): at phase , agent will block the agent who recommended arm at the previous phase if
(8) |
where and are tuning parameters. Thus, blocks if both of the following occur:
-
•
The recommended arm performs poorly, in the sense that UCB has not chosen it sufficiently often (i.e., at least times) by the end of phase .
-
•
Agent has not changed its own best arm estimate since phase . Intuitively, this can be viewed as a confidence criterion: if instead has recently changed its estimate, then is currently unsure which arm is best, so should not block for recommendations that appear suboptimal at first glance (i.e., those for which the first criterion in (8) may hold).
Remark 5.
The first criterion in (8) is a natural relaxation of demanding the recommended arm be most played. The second is directly motivated by the negative result from Section 4. In particular, recall from the Theorem 1 proof sketch that blocked shortly after receiving a new mediocre arm from the malicious agent. Thus, blocking amongst honest agents was always precipitated by the blocking agent changing its best arm estimate. The second criterion in (8) aims to avoid this.
Remark 6.
Our proposed rule has two additional parameters compared to the existing one: and . For our theoretical results, these will be specified in Theorem 2; for experiments, they are discussed in Section 6. For now, we only mention that they should satisfy two properties. First, should be , so that the first criterion in (8) dictates a sublinear number of plays. Second, should grow with , since (as discussed above) the second criterion represents the confidence in the best arm estimate, which grows as the number of reward observations increases.
In the remainder of this section, we introduce a further definition (Section 5.1), provide a general regret bound under our blocking rule (Section 5.2), and discuss some special cases (Section 5.3).
5.1. Noisy rumor process
As discussed in Section 1.4, we will show that under our proposed rule (1) honest agents eventually stop blocking each other, and (2) honest agents with the best arm active will eventually recommend it to others. Thereafter, we essentially reduce the arm spreading process to a much simpler rumor process in which each honest agent contacts a uniformly random neighbor and, if is an honest agent who knows the rumor (i.e., if the best arm is active for ), then informs of the rumor (i.e., recommends the best arm to ). The only caveat is that we make no assumption on the malicious agent arm recommendations, so we have no control over whether or not they are blocked. In other words, the rumor process unfolds over a dynamic graph, where edges between honest and malicious agents may or may not be present, and we have no control over these dynamics.
In light of this, we take a worst-case view and lower bound the arm spreading process with a noisy rumor process that unfolds on the (static) honest agent subgraph. More specifically, we consider the process that tracks the honest agents informed of the rumor. Initially, only (the agent from Assumption 3) is informed (i.e., ). Then at each phase , each honest agent contacts a random honest neighbor . If is informed (i.e., if ), then becomes informed as well (i.e., ), subject to some noise, where . Hence, becomes informed with probability . Note the right side of this inequality is in turn upper bounded by the probability with which they receive the best arm in the process of the previous paragraph.
More formally, we define the noisy rumor process as follows. The key quantity in Definition 1 is , the first phase all are informed. Analogous to (Chawla et al., 2020), our most general result will be in terms of the expected time that this phase occurs, i.e., . Under Assumption 1, the latter quantity is , which cannot be improved in general (see Appendix D.4). However, Section 5.3 provides sharper bounds for in some special cases.
Definition 1.
Let . For each honest agent , let be i.i.d. random variables and i.i.d. random variables chosen uniformly at random from . Inductively define as follows: (the agent from Assumption 3) and
(9) |
Finally, let .
5.2. Positive result
We can now present the main result of this section: a regret upper bound for the proposed blocking rule. We state it first and then unpack the statement in some ensuing remarks. The proof of this result (and all others in this section) is deferred to Appendix D.
Theorem 2.
Let Assumptions 1-3 hold. Suppose we run Algorithm 1 and use Algorithm 4 as the Update-Blocklist subroutine with and in (8). Also assume
(10) |
Then for any honest agent and horizon , we have
(11) |
where by convention if . Here is a term independent of satisfying
(12) |
where hides dependencies on , , , , and and log dependencies on , , , and .
Remark 7.
The theorem shows that our algorithm’s regret scales as , plus an additive term that is independent of and polynomial in all other parameters. When (see Remark 1), the first term is , as stated in Section 1.4. Also, when is large, we recover the single-agent bound (including the constant ), i.e., if there are many malicious agents, honest ones fare no worse than the single-agent case.
Remark 8.
Remark 9.
The bound in Theorem 2 can be simplified under additional assumptions. For instance, in Example 2, it is reasonable to assume (i.e., the number of restaurants is proportional to the population) and (i.e., the degrees are constant, as in sparse social networks). Under these assumptions, the choice from Remark 1, and the parameters from Remark 6, the theorem’s regret bound can be further upper bounded by
Remark 10.
Proof sketch.
Let denote the first phase where the best arm is most played for all honest agents at all phases thereafter. Before this phase (i.e., before time ) we simply upper bound regret by . The main novelty of our analysis is bounding in terms of and . We devote Section 7 to discussing this proof.
After phase , the best arm is active by definition, so incurs logarithmic in regret. We let
(13) |
be the earliest phase such that blocks for all suboptimal recommendations thereafter. We then split the phases after into two groups: those before and those after.
For the phases after but before , we consider three cases:
-
•
: In this case, there are no such phases, so there is nothing to prove.
- •
-
•
: Here we are considering phases where the best arm is most played by (since ) but does not block suboptimal recommendations (since ). Note that no such phases arise for the existing blocking rule, so here the proof diverges from (Vial et al., 2021), and most of Appendix D.1 is dedicated to this case. Roughly speaking, the analogous argument of the previous case yields the regret bound , and we prove this term is also by deriving a tail bound for . The tail amounts to showing that, once the best arm is active, can identify suboptimal arms as such, within the phase. This in turn follows from best arm identification results and the growing phase lengths.
After phase , the best arm is most played for all honest agents (since ), so they only recommend this arm. Thus, only plays the best arm, its sticky arms, and any malicious recommendations. Consequently, to bound regret by as in (11), we need to show each malicious neighbor only recommends suboptimal arms. It is easy to see that can only recommend such arms: if recommends a bad arm at phase , they will be blocked until phase (since ), then until phase , etc. Thus, the -th bad recommendation occurs at phase , which is time by definition . Finally, an argument from (Vial et al., 2021) sharpens this term to . ∎
5.3. Special cases
We next discuss some special cases of our regret bound. First, as in (Chawla et al., 2020), we can prove an explicit bound assuming the honest agent subgraph is -regular, i.e., .
Corollary 1.
Let the assumptions of Theorem 2 hold and further assume is -regular with . Let denote the conductance of . Then for any honest agent and horizon ,
(14) | ||||
(15) |
Remark 11.
This corollary includes the complete graph case studied in (Vial et al., 2021), where , , and . In this case, the term (14) matches the corresponding term from (Vial et al., 2021) exactly, i.e., for large , Corollary 1 is a strict generalization. Our additive term scales as
whereas the additive term from (Vial et al., 2021) scales as . Notice our dependence on the arm gap is , which matches the fully cooperative case (Chawla et al., 2020), whereas the dependence is in (Vial et al., 2021), which is potentially much larger.
Remark 12.
Proof sketch.
In light of Theorem 2, we only need to show . To do so, we let denote the noiseless version of (defined in the same way but with ) and . We then construct a coupling between and , which ensures that with high probability, whenever . Finally, using this coupling and a tail bound for from (Chawla et al., 2020) (which draws upon the analysis of (Chierichetti et al., 2010)), we derive a tail bound for . This allows us to show , as desired.444When , (Chawla et al., 2020) shows , so our bound generalizes theirs up to terms. ∎
Finally, we can sharpen the above results for honest agents without malicious neighbors.
Corollary 2.
Remark 13.
Proof sketch.
Recall from the Theorem 2 proof sketch that the term arises from regret after phase . At any such phase, the best arm is most played for all honest agents (by definition), so when , ’s neighbors only recommend this arm. Therefore, ’s active sets after are fixed; they contain the best arm and suboptimal ones. Thus, only plays suboptimal arms long-term, so in the worst case incurs the standard UCB regret . ∎
6. Numerical results
Thus far, we have shown the proposed blocking rule adapts to general graphs more gracefully than the existing one, at least in theory. We now illustrate this finding empirically.
Experimental setup: We follow (Vial et al., 2021, Section 6) except we extend those experiments to graphs, i.e., each edge is present with probability . For each and each of two malicious strategies (to be defined shortly), we conduct trials of the following:
-
•
Set and and generate as a random graph, resampling if necessary until the honest agent subgraph is connected (see Assumption 1).
-
•
Set , , and , then sample the remaining arm means uniformly from (so ). For each , set .
-
•
Set and sample the sticky sets uniformly from the -sized subsets of , resampling if necessary until (see Assumption 3).
- •
Algorithmic parameters: We set and as in Remarks 4 and 8. For the parameters in the proposed blocking rule, we choose , and . While these are different from the parameters specified in our theoretical results (which we found are too conservative in practice), they do satisfy the key properties discussed in Remark 6.
Malicious strategies: Like (Vial et al., 2021), we use strategies we call the naive and smart strategies (they are called uniform and omniscient in (Vial et al., 2021)). The naive strategy simply recommends a uniformly random suboptimal arm. The smart strategy recommends , i.e., the least played, inactive, suboptimal arm. Intuitively, this is a more devious strategy which forces to play often in the next phase (to drive down its upper confidence bound). Consequently, may play it most and discard a better arm in favor of it (see Lines 1-1 of Algorithm 1).
Results: In Figure 3, we plot the average and standard deviation (across trials) of the per-agent regret . For the naive strategy, the existing blocking rule eventually becomes worse than the no blocking baseline as decreases. More strikingly, it even becomes worse than the no communication baseline for the smart strategy. In other words, honest agents would have been better off ignoring the network and simply running UCB on their own. As in Section 4, this is because accidental blocking causes the best arm to spread very slowly. Additionally, the standard deviation becomes much higher than all other algorithms, suggesting that regret is significantly worse in some trials. In contrast, the proposed blocking rule improves as decreases, because it is mild enough to spread the best arm at all values of , and for smaller , honest agents have fewer malicious neighbors (on average). We also observe that the proposed rule outperforms both baselines uniformly across . Additionally, it improves over the existing rule more dramatically for the smart strategy, i.e., when the honest agents face a more sophisticated adversary. Finally, it is worth acknowledging the existing rule is better when – although not in a statistically significant sense for the smart strategy – because it does spread the best arm quickly on the complete graph (as shown in (Vial et al., 2021)), and thereafter more aggressively blocks malicious agents.

Empirical results for synthetic data. Rows of subfigures correspond to the malicious strategy, while columns correspond to the edge probability for the random graph.
Other results: As in (Vial et al., 2021), we reran the simulations using arm means derived from the MovieLens dataset (Harper and Konstan, 2015). We also experimented with new variants of the smart and naive strategies, where the malicious agents follow these strategies if the best arm is active (in hopes of forcing honest agents to discard it) and recommend the second best arm otherwise. Intuitively, these variants differ in that malicious agents recommend good arms (i.e., the second best) more frequently, while still never revealing the best arm (the only one that leads to logarithmic regret). For all experiments, the key message – that the proposed blocking rule adapts to varying graph structures more gracefully than the existing one – is consistent. See Appendix B for details.
7. Gossip despite blocking
As discussed above, the main analytical contribution of this work is proving that the best arm spreads in a gossip fashion, despite accidental blocking. In this (technical) section, we provide a detailed sketch of this proof. We begin with a high-level outline. The key is to show that honest agents eventually stop blocking each other. This argument (roughly) proceeds as follows:
-
•
Step 1: First, we show that honest agents learn the arm statistics in a certain sense. More specifically, we provide a tail bound for a random phase such that for all phases (1) each honest agent’s most played arm in phase is close to its true best active arm and (2) any active arm close to the true best one is played at least times by the end of phase .
-
•
Step 2: Next, we show that honest agents communicate with their neighbors frequently. In particular, we establish a tail bound for another random phase such that for any , each honest agent contacts all of its honest neighbors at least once between and .
-
•
Step 3: Finally, we use the above to show that eventually, no blocking occurs amongst honest agents. The basic idea is as follows. Consider a phase , an honest agent , and a neighbor of . Then if has had the same best arm estimate since phase – i.e., if the second blocking criterion in (8) holds – would have contacted at some phase (by step 2) and received arm . Between phases and , the most played arm for cannot get significantly worse (by step 1). Thus, if asks for a recommendation at , will respond with an arm whose mean is close to , which will play at least times (by step 1). Hence, the first criterion in (8) fails, i.e., the two cannot simultaneously hold.
In the next three sub-sections, we discuss these three steps. Then in Section 7.4, we describe how, once accidental blocking stops, the arm spreading process can be coupled to the noisy rumor process from Definition 1. Finally, in Section 7.5, we discuss how to combine all of these steps to bound the term from the Theorem 2 proof sketch.
7.1. Learning the arm statistics
Recall we assume , so for any , is the best arm in , i.e., . Therefore, for any , is the subset of arms at least -close to the best one. For each honest agent and phase , define
(16) |
where will be chosen shortly. Finally, define the random phase
(17) |
In words, is the earliest phase such that, at all phases thereafter, (1) the most played arms are -close to best active arms and (2) all arms -close to the best are played at least times.
As discussed above, Step 1 involves a tail bound for . The analysis is based on (Bubeck et al., 2011, Theorem 2), which includes a tail bound showing that the most played arm is -close to the best, provided that samples have been collected from each of the -far arms. In our case, phase lasts time steps, so each of active arms is played times on average. Hence, we can show the most played arm within the phase is -close to the best if we choose , which allows us to bound . Analogously, we choose and show that -close arms must be played times before they are distinguished as such, which allows us to bound . Taken together, we can prove a tail bound for (Lemma 9 in Appendix E.1) with these choices and .
7.2. Communicating frequently
Next, for any such that , let denote the event that did not send a recommendation to between phases and . Also define
(18) |
Here we abuse notation slightly; the union is over all (undirected) edges in but viewed as pairs of directed edges. Hence, at all phases , each honest agent receives a recommendation from each of its honest neighbors at some phase between and .
Step 2 involves the tail bound for that was mentioned above (see Lemma 10 in Appendix E.2). The proof amounts to bounding the probability of . Recall this event says did not contact for a recommendation at any phase . Clearly, this means did not block at any such phase. Hence, in the worst case, blocked just before , in which case was un-blocked at , where the inequality holds by assumption in Theorem 2. Hence, implies was not blocking between phases and , so each of the neighbors that contacted in these phases was sampled uniformly from a set containing , yet was never sampled. The probability of this decays exponentially in , which yields an exponential tail for .
7.3. Avoiding accidental blocking
Next, we show honest agents eventually stop blocking each other. Toward this end, we first note
(19) |
where the first inequality uses the definition of , and the second holds because and in Algorithm 1 (see Claim 13 in Appendix E.3 for details). In words, (19) says the best active arm can decay by at most at phase . Applying iteratively and since there are arms total, we then show (see Claim 14 in Appendix E.3)
(20) |
Combining the previous two inequalities, we conclude (see Corollary 3 in Appendix E.3)
(21) |
Now the key part of Step 3 is to use (21) to show (see Claim 15 in Appendix E.3)
(22) |
In words, this result says that if is sufficiently large, and if has sent a recommendation to since phase , then will not block at phase . The proof is by contraction: if instead blocks at , then by Algorithm 4, has not changed its best arm estimate since phase , so it would have recommended to at some phase . Therefore, . Additionally, since , we know that for any , the choice of in Section 7.1 guarantees that
Combining these observations and using (21) (with and replaced by and ), we then show
(23) |
On the other hand, blocking at phase means plays the recommended arm fewer than times by the end of phase . Since , this implies (by definition of ) that , where as in Section 7.1. Combined with (23) and the choice from Theorem 2, we conclude . This contradicts the assumption in Theorem 2, which completes the proof of (22).
Finally, we use (22) to show honest agents eventually stop blocking each other entirely, i.e.,
(24) |
(see Lemma 11 in Appendix E.3). Intuitively, (24) holds because after new blocking stops (22), old blocking will eventually “wear off”. The proof is again by contradiction: if is blocking some honest at phase , the blocking must have started at some (else, it ends by ). Thus, by assumption , blocked at phase . But by definition of , would have contacted at some phase . Applying (22) (at phase ; note that by the above inequalities, , as required by (22)), we obtain a contradiction.
7.4. Coupling with noisy rumor process
To begin, we define an equivalent way to sample in Algorithm 2.555Claim 16 in Appendix E.4 verifies this equivalency (the proof is a straightforward application of the law of total probability). This equivalent method will allow us to couple the arm spreading and noisy rumor processes through a set of primitive random variables. In particular, for each honest agent , let and be i.i.d. sequences drawn uniformly from and . Then choose according to two cases:
-
•
If , let and consider two sub-cases. First, if , set . Second, if , sample from uniformly.
-
•
If , sample from uniformly.
Next, we observe that since as by the choice of in Section 7.1 and by Assumption 2, we have for large enough . Paired with the definition of , this allows us to show that for all large and with (i.e., with the best arm active), (i.e., the best arm is played most). See Claim 17 in Appendix E.4 for the formal statement.
Finally, we observe that by (24), only the first case of the above sampling strategy occurs for large . Moreover, in this case, is Bernoulli with parameter
where the second inequality holds by Definition 1. Hence, the probability that , and thus the probability that contacts the random honest neighbor in the above sampling strategy, dominates the probability that contacts in the noisy rumor process of Definition 1. Additionally, by the previous paragraph, agents with the best arm active will recommend it (for large enough ). Taken together, we can show that the probability of receiving the best arm in the arm spreading process dominates the probability of being informed of the rumor in the noisy rumor process. This allows us to prove a tail bound for in terms of a tail bound for the random phase from Definition 1, on the event that the tails of and are sufficiently small (in the sense of (24); see Lemma 12 in Appendix E.4 for details).
7.5. Spreading the best arm
In summary, we prove tail bounds for and (Sections 7.1 and 7.2) and show the tails of are controlled by those of , provided the tails of and are not too heavy (Sections 7.3 and 7.4). Combining and summing tails allows us to bound in terms of (which accounts for the tails of and ) and (which accounts for the tail of ), as mentioned in the Theorem 2 proof sketch. See Theorem 3 and Corollary 4 in Appendix E.5 for details.
8. Conclusion
In this work, we showed that existing algorithms for multi-agent bandits with malicious agents fail to generalize beyond the complete graph. In light of this, we proposed a new blocking algorithm and showed it has low regret on any connected and undirected graph. This regret bound relied on the analysis of a novel process involving gossip and blocking. Our work leaves open several questions, such as whether our insights can be applied to multi-agent reinforcement learning.
Acknowledgements.
This work was supported by NSF Grants CCF 22-07547, CCF 19-34986, CNS 21-06801, 2019844, 2112471, and 2107037; ONR Grant N00014-19-1-2566; the Machine Learning Lab (MLL) at UT Austin; and the Wireless Networking and Communications Group (WNCG) Industrial Affiliates Program.References
- (1)
- Anandkumar et al. (2011) Animashree Anandkumar, Nithin Michael, Ao Kevin Tang, and Ananthram Swami. 2011. Distributed algorithms for learning and cognitive medium access with logarithmic regret. IEEE Journal on Selected Areas in Communications 29, 4 (2011), 731–745.
- Audibert and Bubeck (2010) Jean-Yves Audibert and Sébastien Bubeck. 2010. Best Arm Identification in Multi-Armed Bandits. In COLT-23th Conference on Learning Theory-2010. 13–p.
- Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47, 2-3 (2002), 235–256.
- Auer et al. (1995) Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. 1995. Gambling in a rigged casino: The adversarial multi-armed bandit problem. In Proceedings of IEEE 36th Annual Foundations of Computer Science. IEEE, 322–331.
- Avner and Mannor (2014) Orly Avner and Shie Mannor. 2014. Concurrent bandits and cognitive radio networks. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 66–81.
- Awerbuch and Kleinberg (2008) Baruch Awerbuch and Robert Kleinberg. 2008. Competitive collaborative learning. J. Comput. System Sci. 74, 8 (2008), 1271–1288.
- Bar-On and Mansour (2019) Yogev Bar-On and Yishay Mansour. 2019. Individual regret in cooperative nonstochastic multi-armed bandits. Advances in Neural Information Processing Systems 32 (2019), 3116–3126.
- Bargiacchi et al. (2018) Eugenio Bargiacchi, Timothy Verstraeten, Diederik Roijers, Ann Nowé, and Hado Hasselt. 2018. Learning to coordinate with coordination graphs in repeated single-stage multi-agent decision problems. In International conference on machine learning. PMLR, 482–490.
- Bistritz and Bambos (2020) Ilai Bistritz and Nicholas Bambos. 2020. Cooperative multi-player bandit optimization. Advances in Neural Information Processing Systems 33 (2020).
- Bistritz and Leshem (2018) Ilai Bistritz and Amir Leshem. 2018. Distributed multi-player bandits-a game of thrones approach. In Advances in Neural Information Processing Systems. 7222–7232.
- Bogunovic et al. (2020) Ilija Bogunovic, Andreas Krause, and Jonathan Scarlett. 2020. Corruption-tolerant Gaussian process bandit optimization. In International Conference on Artificial Intelligence and Statistics. PMLR, 1071–1081.
- Bogunovic et al. (2021) Ilija Bogunovic, Arpan Losalka, Andreas Krause, and Jonathan Scarlett. 2021. Stochastic linear bandits robust to adversarial attacks. In International Conference on Artificial Intelligence and Statistics. PMLR, 991–999.
- Boursier and Perchet (2019) Etienne Boursier and Vianney Perchet. 2019. SIC-MMAB: Synchronisation Involves Communication in Multiplayer Multi-Armed Bandits. Advances in Neural Information Processing Systems 32 (2019), 12071–12080.
- Bubeck et al. (2011) Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. 2011. Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science 412, 19 (2011), 1832–1852.
- Buccapatnam et al. (2015) Swapna Buccapatnam, Jian Tan, and Li Zhang. 2015. Information sharing in distributed stochastic bandits. In 2015 IEEE Conference on Computer Communications (INFOCOM). IEEE, 2605–2613.
- Cesa-Bianchi et al. (2016) Nicolo Cesa-Bianchi, Claudio Gentile, Yishay Mansour, and Alberto Minora. 2016. Delay and cooperation in nonstochastic bandits. In Conference on Learning Theory, Vol. 49. 605–622.
- Chakraborty et al. (2017) Mithun Chakraborty, Kai Yee Phoebe Chua, Sanmay Das, and Brendan Juba. 2017. Coordinated Versus Decentralized Exploration In Multi-Agent Multi-Armed Bandits.. In IJCAI. 164–170.
- Chawla et al. (2020) Ronshee Chawla, Abishek Sankararaman, Ayalvadi Ganesh, and Sanjay Shakkottai. 2020. The Gossiping Insert-Eliminate Algorithm for Multi-Agent Bandits. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics. 3471–3481.
- Chawla et al. (2022) Ronshee Chawla, Abishek Sankararaman, and Sanjay Shakkottai. 2022. Multi-agent low-dimensional linear bandits. IEEE Trans. Automat. Control (2022).
- Chierichetti et al. (2010) Flavio Chierichetti, Silvio Lattanzi, and Alessandro Panconesi. 2010. Almost tight bounds for rumour spreading with conductance. In Proceedings of the forty-second ACM symposium on Theory of computing. 399–408.
- Dakdouk et al. (2021) Hiba Dakdouk, Raphaël Féraud, Romain Laroche, Nadège Varsier, and Patrick Maillé. 2021. Collaborative Exploration and Exploitation in massively Multi-Player Bandits. (2021).
- Dubey et al. (2020a) Abhimanyu Dubey et al. 2020a. Cooperative multi-agent bandits with heavy tails. In International Conference on Machine Learning. PMLR, 2730–2739.
- Dubey et al. (2020b) Abhimanyu Dubey et al. 2020b. Kernel methods for cooperative multi-agent contextual bandits. In International Conference on Machine Learning. PMLR, 2740–2750.
- Dubey and Pentland (2020) Abhimanyu Dubey and AlexSandy’ Pentland. 2020. Differentially-Private Federated Linear Bandits. Advances in Neural Information Processing Systems 33 (2020).
- Garcelon et al. (2020) Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Jean Tarbouriech, Olivier Teytaud, Alessandro Lazaric, and Matteo Pirotta. 2020. Adversarial Attacks on Linear Contextual Bandits. Advances in Neural Information Processing Systems 33 (2020).
- Gupta et al. (2019) Anupam Gupta, Tomer Koren, and Kunal Talwar. 2019. Better Algorithms for Stochastic Bandits with Adversarial Corruptions. In Conference on Learning Theory. 1562–1578.
- Harper and Konstan (2015) F Maxwell Harper and Joseph A Konstan. 2015. The movielens datasets: History and context. Acm transactions on interactive intelligent systems (tiis) 5, 4 (2015), 1–19.
- Hillel et al. (2013) Eshcar Hillel, Zohar S Karnin, Tomer Koren, Ronny Lempel, and Oren Somekh. 2013. Distributed exploration in multi-armed bandits. In Advances in Neural Information Processing Systems. 854–862.
- Jun et al. (2018) Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Jerry Zhu. 2018. Adversarial attacks on stochastic bandits. Advances in Neural Information Processing Systems 31 (2018), 3640–3649.
- Kalathil et al. (2014) Dileep Kalathil, Naumaan Nayyar, and Rahul Jain. 2014. Decentralized learning for multiplayer multiarmed bandits. IEEE Transactions on Information Theory 60, 4 (2014), 2331–2345.
- Kanade et al. (2012) Varun Kanade, Zhenming Liu, and Bozidar Radunovic. 2012. Distributed non-stochastic experts. In Advances in Neural Information Processing Systems. 260–268.
- Kao et al. (2022) Hsu Kao, Chen-Yu Wei, and Vijay Subramanian. 2022. Decentralized cooperative reinforcement learning with hierarchical information structure. In International Conference on Algorithmic Learning Theory. PMLR, 573–605.
- Kapoor et al. (2019) Sayash Kapoor, Kumar Kshitij Patel, and Purushottam Kar. 2019. Corruption-tolerant bandit learning. Machine Learning 108, 4 (2019), 687–715.
- Kolla et al. (2018) Ravi Kumar Kolla, Krishna Jagannathan, and Aditya Gopalan. 2018. Collaborative learning of stochastic bandits over a social network. IEEE/ACM Transactions on Networking 26, 4 (2018), 1782–1795.
- Korda et al. (2016) Nathan Korda, Balázs Szörényi, and Li Shuai. 2016. Distributed clustering of linear bandits in peer to peer networks. In Journal of machine learning research workshop and conference proceedings, Vol. 48. International Machine Learning Societ, 1301–1309.
- Lalitha and Goldsmith (2021) Anusha Lalitha and Andrea Goldsmith. 2021. Bayesian Algorithms for Decentralized Stochastic Bandits. IEEE Journal on Selected Areas in Information Theory 2, 2 (2021), 564–583.
- Landgren et al. (2016) Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. 2016. Distributed cooperative decision-making in multiarmed bandits: Frequentist and Bayesian algorithms. In 2016 IEEE 55th Conference on Decision and Control (CDC). IEEE, 167–172.
- Lattimore and Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. 2020. Bandit algorithms. Cambridge University Press.
- LeBlanc et al. (2013) Heath J LeBlanc, Haotian Zhang, Xenofon Koutsoukos, and Shreyas Sundaram. 2013. Resilient asymptotic consensus in robust networks. IEEE Journal on Selected Areas in Communications 31, 4 (2013), 766–781.
- Liu and Shroff (2019) Fang Liu and Ness Shroff. 2019. Data Poisoning Attacks on Stochastic Bandits. In International Conference on Machine Learning. 4042–4050.
- Liu et al. (2021) Junyan Liu, Shuai Li, and Dapeng Li. 2021. Cooperative Stochastic Multi-agent Multi-armed Bandits Robust to Adversarial Corruptions. arXiv preprint arXiv:2106.04207 (2021).
- Liu and Zhao (2010) Keqin Liu and Qing Zhao. 2010. Distributed learning in multi-armed bandit with multiple players. IEEE Transactions on Signal Processing 58, 11 (2010), 5667–5681.
- Liu et al. (2020) Lydia T Liu, Horia Mania, and Michael Jordan. 2020. Competing bandits in matching markets. In International Conference on Artificial Intelligence and Statistics. PMLR, 1618–1628.
- Lykouris et al. (2018) Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. 2018. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing. 114–122.
- Madhushani and Leonard (2021) Udari Madhushani and Naomi Leonard. 2021. When to call your neighbor? strategic communication in cooperative stochastic bandits. arXiv preprint arXiv:2110.04396 (2021).
- Mansour et al. (2018) Yishay Mansour, Aleksandrs Slivkins, and Steven Wu. 2018. Competing bandits: Learning under competition. In 9th Innovations in Theoretical Computer Science, ITCS 2018. Schloss Dagstuhl-Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing, 48.
- Martínez-Rubio et al. (2019) David Martínez-Rubio, Varun Kanade, and Patrick Rebeschini. 2019. Decentralized cooperative stochastic multi-armed bandits. Advances in Neural Information Processing Systems (2019).
- Mitra et al. (2021) Aritra Mitra, Hamed Hassani, and George Pappas. 2021. Exploiting Heterogeneity in Robust Federated Best-Arm Identification. arXiv preprint arXiv:2109.05700 (2021).
- Newton et al. (2021) Conor Newton, AJ Ganesh, and Henry Reeve. 2021. Asymptotic Optimality for Decentralised Bandits. In Reinforcement Learning in Networks and Queues, Sigmetrics 2021.
- Pittel (1987) Boris Pittel. 1987. On spreading a rumor. SIAM J. Appl. Math. 47, 1 (1987), 213–223.
- Rosenski et al. (2016) Jonathan Rosenski, Ohad Shamir, and Liran Szlak. 2016. Multi-player bandits–a musical chairs approach. In International Conference on Machine Learning. 155–163.
- Sankararaman et al. (2019) Abishek Sankararaman, Ayalvadi Ganesh, and Sanjay Shakkottai. 2019. Social learning in multi agent multi armed bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems 3, 3 (2019), 1–35.
- Shahrampour et al. (2017) Shahin Shahrampour, Alexander Rakhlin, and Ali Jadbabaie. 2017. Multi-armed bandits in multi-agent networks. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2786–2790.
- Szörényi et al. (2013) Balázs Szörényi, Róbert Busa-Fekete, István Hegedűs, Róbert Ormándi, Márk Jelasity, and Balázs Kégl. 2013. Gossip-based distributed stochastic bandit algorithms. In Journal of Machine Learning Research Workshop and Conference Proceedings, Vol. 2. International Machine Learning Societ, 1056–1064.
- Tekin and Van Der Schaar (2015) Cem Tekin and Mihaela Van Der Schaar. 2015. Distributed online learning via cooperative contextual bandits. IEEE transactions on signal processing 63, 14 (2015), 3700–3714.
- Vial et al. (2021) Daniel Vial, Sanjay Shakkottai, and R Srikant. 2021. Robust multi-agent multi-armed bandits. In Proceedings of the Twenty-second International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing. 161–170.
- Wang et al. (2020) Po-An Wang, Alexandre Proutiere, Kaito Ariu, Yassir Jedra, and Alessio Russo. 2020. Optimal algorithms for multiplayer multi-armed bandits. In International Conference on Artificial Intelligence and Statistics. PMLR, 4120–4129.
- Zhu et al. (2021) Zhaowei Zhu, Jingxuan Zhu, Ji Liu, and Yang Liu. 2021. Federated bandit: A gossiping approach. In Abstract Proceedings of the 2021 ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems. 3–4.
Appendix A Notes on appendices
The appendices are organized as follows. First, Appendix B contains the additional numerical results that were mentioned in Section 6. Next, we prove Theorem 1 in Appendix C and all results from Section 5 in Appendix D. We then provide a rigorous version of the proof sketch from Section 7 in Appendix E. Finally, Appendix F contains some auxiliary results – namely, Appendix F.1 records some simple inequalities, Appendix F.2 provides some bandit results that are essentially known but stated in forms convenient to us, and Appendices F.3-F.4 contain some tedious calculations.
For the analysis, we use , , etc. to denote positive constants depending only on the algorithmic parameters , , , , and . Each is associated with a corresponding claim, e.g., with Claim 1. Within the proofs, we use , , etc. to denote constants whose values may change across proofs. Finally, denotes the indicator function, and are expectation and probability conditioned on all randomness before the -th communication period, and denotes the current phase at time .
Appendix B Additional experiments

Empirical results for real data. Rows of subfigures correspond to the malicious strategy, while columns correspond to the edge probability for the random graph.

Empirical results for synthetic data with and mixed malicious strategies.
As mentioned in Section 6, we also considered arm means derived from real data. The setup was the same as for the synthetic means, except for two changes (as in (Vial et al., 2021)): we choose instead of , and we sample uniformly from a set of arm means derived from MovieLens (Harper and Konstan, 2015) user film ratings via matrix completion; see (Vial et al., 2021, Section 6.2) for details. The results (Figure 4) are qualitatively similar to the synthetic case.
Finally, we repeated the synthetic data experiments from Section 6 with the intermediate graph parameter and two new malicious strategies called mixed naive and mixed smart. As discussed in Section 6, these approaches use a “mixed report” where the malicious agents more frequently recommend good arms – namely, the second best when the best is inactive and the naive or smart recommendation otherwise. Results are shown in Figure 5. They again reinforce the key message that the proposed rule adapts more gracefully to networks beyond the complete graph – in this case, our blocking rule has less than half of the regret of the existing one at the horizon . Additionally, we observe that the no blocking algorithm from (Chawla et al., 2020) has much lower regret in Figure 5 than Figure 3, though still higher than our proposed blocking algorithm. This suggests that our algorithm remains superior even for “nicer” malicious strategies under which blocking is less necessary (in the sense that (Chawla et al., 2020) has lower regret in Figure 5 than Figure 3).
Appendix C Proof of Theorem 1
We first observe that since , we can lower bound the arm gap as follows:
Next, we show that for phases , agents aware of arms at least as good as will (1) play such arms most often in phase and (2) have such arms active thereafter. The proof is basically a noiseless version of a known bandit argument, specialized to the setting of Section 4.
Lemma 1.
Under the assumptions of Theorem 1, for any , , and such that , we have and .
Proof.
First, we prove by contradiction that : suppose instead that . Let and . Then ; otherwise, since by assumption and is the most played arm in phase , we obtain
which is a contradiction. Furthermore, there clearly exists such that
Combining these observations, and since , we obtain that
By the UCB policy and since by assumption, the previous expression implies
(25) |
Since by the assumption and , we also know
(26) |
where we define . Note this function decreases on , since
where the second inequality is . Thus, since , we know . Combined with (25) and (26), we obtain . Finally, recall and , so . Combined with , we conclude
(27) |
We now show that in each of three cases, (27) yields a contradiction.
-
•
: By definition, the left side of (27) is and the right side is , and one can verify that .
-
•
: Here and , i.e., both arms are mediocre. By definition, we thus have , so to obtain a contradiction to (27), it suffices to show . We show by induction that this holds for all .
For , note (so ) and (so ). Thus, .
-
•
: Recall that in the previous case, we showed for any . Therefore, by assumption . Since in this case, we obtain a contradiction to (27).
Thus, we have established the first part of the lemma (). To show , we suppose instead that for some . Then is well-defined. If , then , which violates the assumption of the lemma, so we assume . In this case, we know (since is minimal) and (else, because , we would have ). But since (by assumption and ), this contradicts the first part of the lemma. ∎
Next, for each , we define the event
where by convention (so is well-defined). Thus, in words says (1) is not blocking at phase , (2) no honest agent has ever been aware of arms as good as up to phase , and (3) no such has ever blocked the malicious agent . Point (2) will allow us to show that agent does not become aware of the best arm until phase (when holds). The other events in the definition of will allow us to inductively lower bound their probabilities. The next lemma establishes the base of this inductive argument.
Lemma 2.
Under the assumptions of Theorem 1, .
Proof.
We first observe that at all phases , only the second case of the malicious strategy – where the malicious agent recommends to avoid blocking – arises, which implies . Therefore, it suffices to show , where we define
To do so, we will show and .
To show , first note that by the law of total expectation, we have
Now when occurs, the malicious strategy implies , so is sampled from a set of three agents which includes , for each . Since this sampling is independent, we conclude that when occurs,
Thus, combining the previous two expressions with the definition of and iterating, we obtain
To show , first observe that when occurs, does not contact at any phase , so . Thus, it only remains to show that implies for all and . Suppose instead that holds and for some such and . Let be the earliest it occurs; note by assumption that for . Let be some agent it occurs for, i.e., is such that for some . Since is the earliest such phase and , we know was not active for at the previous phase , so it was recommended to at this phase. By the malicious strategy and , this implies that the agent who recommended to is honest, so was active for at the previous phase, which implies (else, we contradict the minimality of ). From the assumed line graph structure, we must have and , i.e., contacted at phase . But this contradicts the definition of , which stipulates that only contacts before . ∎
To continue the inductive argument, we lower bound in terms of .
Lemma 3.
Under the assumptions of Theorem 1, for any , we have .
Proof.
The proof is somewhat lengthy and proceeds in four steps.
Step 1: Probabilistic arguments. First, we define the events
Then by the law of total expectation, we know that
(29) |
Now if occurs, then (by ) and (by ); the latter implies , so combined with the former, . Thus, implies that is sampled from a set of most three agents containing , so . Substituting into (29), and again using total expectation, we thus obtain
Analogously, when holds, , which by similar logic gives . Therefore, combining the previous two inequalities, we have shown . Consequently, it suffices to show that .
Step 2: Event decomposition. For , we decompose , where
(30) | |||
(31) |
As a simple consequence of these definitions, we note that
(32) | ||||
(33) |
Hence, to prove , it suffices to show . For the remainder of the proof, we thus assume holds and argue holds.
Step 3: Some consequences. We begin by deriving several consequences of . First, note each contacts at phase (by ), who recommends (by the malicious strategy). Since (by ), this implies , so is most played in phase (by Lemma 1). In summary, we have shown
(34) |
Second, as a consequence of the above and Lemma 1, we can also write
(35) |
Third, we know contacts at phase (by ), who responds with a currently active arm (by the malicious strategy), so since (by ), we have
(36) |
As a consequence of (36), we see that when contacts at phase (which occurs by ), recommends some arm strictly worse than . On the other hand, by (35) and Lemma 1, we know the most played arm for in phase has index at most . Taken together, the recommendation is not most played, so
(37) |
Step 4: Completing the proof. Using the above, we prove in turn that , , and hold. For , we use proof by contradiction: if fails, we can find and such that . Let be the minimal such (for this ). Since (by ) and is minimal, we must have , i.e., was blocked for the recommendation it provided at . If , this contradicts the malicious strategy, since and the strategy avoids blocking for such and . A similar contradiction arises if and (since in this case), so we must have and . But in this case, contradicts (34).
Next, we show holds. The logic is similar to the end of the Lemma 2 proof. If instead fails, we can find and such that . Let be the minimal such and an agent with for some . Since (by ), , and is minimal, we know that was recommended to at phase . By the malicious strategy, this implies that the recommending agent (say, ) was honest. Therefore, was active for at phase , so since is minimal, . Hence, by the assumed graph structure, contacted at phase , who recommended . If , this contact cannot occur, since instead contacts at (by ) and does not contact at (by (37)). Hence, we must have , so , contradicting (36).
Finally, we prove . Suppose instead that , i.e., is blocked at . Then since , we must have for some . Let be the maximal such . Then ; otherwise, if , would have been un-blocked by phase . Therefore, the blocking rule implies
(38) |
By , , and (35), we also know
so . Combined with (38), we must have for some , which contradicts Lemma 1 (since ). ∎
Finally, we can prove the theorem. Define . Then by definition, for any . Hence, because in the problem instance of the theorem, we obtain
Thus, by Claim 23 from Appendix F.1 and since by the choice , we can write
(39) |
Let be chosen later. Then implies (since ). Thus, we can write
By definition of and , along with Lemmas 2 and 3, we also know
By (7) from Section 4.1, we know . Combined with the previous three bounds, and letting denote the constant , we thus obtain
(40) |
We now consider three different cases, each with a different choice of .
- •
-
•
If , let . Then , so . Furthermore, we know , which implies
Next, observe that for this case of , so . Thus, we can choose this in (40) and combine with the above bounds to lower bound regret as .
-
•
If , choose . Then , so (40) implies .
Hence, in all three cases, we have shown for some absolute constant . This establishes the theorem.
We return to state and prove the aforementioned Claim 1. We note the analysis is rather coarse; our only goal here is to establish a scaling (not optimize constants).
Claim 1.
Under the assumptions of Theorem 1, we have , where .
Proof.
If , the bound holds by nonnegativity. If , then since and by assumption in Theorem 1, we know , which implies . Thus, only the case remains. By Claim 23 from Appendix F.1 and ,
Thus, it suffices to show that . Suppose instead that this inequality fails. Then since the left side is an integer, we have by assumption. Therefore, we can find such that (else, we violate the assumed inequality). By this choice of and the assumed inequality, we can then write
We can lower bound the right side by (else, applying Claim 20 from Appendix F.1 with , , and yields , a contradiction), which is further bounded by . Combined with the fact that rewards are deterministic, , and in Theorem 1, we obtain
(41) |
Next, let be any other arm which is active for at time . Then clearly
(42) | |||
(43) |
By (41), (43), the fact that , and the UCB policy, we conclude . Since , this further implies . But we also know that . Combining these inequalities gives , a contradiction. ∎
Appendix D Proofs from Section 5
D.1. Proof of Theorem 2
Fix an honest agent . Let , where we recall from the proof sketch that
(44) | |||
(45) |
Let be chosen later. Denote by and the suboptimal sticky and non-sticky arms, respectively, for agent . Then by Claim 23 from Appendix F.1, we can decompose regret as , where
(46) | |||
(47) |
and where whenever by convention. Thus, we have rewritten regret as the sum of four terms: , which accounts for regret before the best arm spreads; , the regret due to sticky arms after the best arm spreads; , regret from non-sticky arms after the best arm spreads but before phase ; and , regret from non-sticky arms after this phase. The subsequent lemmas bound these terms in turn.
Lemma 4.
Under the assumptions of Theorem 2, for any and , we have
Proof.
Lemma 5.
Under the assumptions of Theorem 2, for any and , we have
(48) |
Proof.
Lemma 6.
Under the assumptions of Theorem 2, for any , , and , we have
Proof.
The proof is nontrivial; we defer it to the end of this sub-appendix. ∎
Lemma 7.
Under the assumptions of Theorem 2, for any , , and , we have
Proof sketch.
The proof follows the same logic as that of (Vial et al., 2021, Lemma 4), so we omit it. The only differences are (1) we replace (the number of malicious agents connected to for the complete graph) with , and (2) we use Claim 19 from Appendix F.1 to bound the summation in (Vial et al., 2021, Lemma 4). We refer the reader to the Theorem 2 proof sketch for intuition. ∎
Additionally, we note the sum can be naively bounded as follows.
Lemma 8.
Under the assumptions of Theorem 2, for any and , we have
Proof.
The proof follows the exact same logic as that of Lemma 5 so is omitted. ∎
We can now prove the theorem. First, we use the regret decomposition , Lemmas 4-7, and the fact that to write
(52) | ||||
(53) |
Now choose . Then
(54) | ||||
(55) |
where the second inequality is due to Lemma 4. Furthermore, by Claim 18 from Appendix F.1, we know that . Combined with , , and the choice of , we can thus write
Therefore, we can bound (52) as follows:
(56) | ||||
(57) |
where the second inequality holds by and . Combining the above yields
(58) | ||||
(59) |
Alternatively, we can simply use Lemmas 4, 5, and 8 to write
(60) |
Therefore, combining the previous two expressions and again invoking Lemma 4 to bound the additive terms in (60) by those in (58), we obtain the desired bound.
Thus, it only remains to prove Lemma 6. We begin by using some standard bandit arguments recounted in Appendix F.2 to bound in terms of a particular tail of .
Claim 2.
Under the assumptions of Theorem 2, for any , , and , we have
(61) | ||||
(62) |
Proof.
If , we can naively bound , which completes the proof. Thus, we assume (which will allow us to divide by later). For any , we first write
(63) | |||
(64) | |||
(65) |
By Corollary 6 from Appendix F.2, (63) is bounded by . By Claim 22 from Appendix F.2 and , (64) is bounded by . For (65), Claim 22 similarly gives
(66) |
which clearly implies the following bound for (65):
(67) |
For the remaining probability term, by Markov’s inequality, we have
(68) |
Hence, combining the previous two inequalities, and since , we can bound (65) by
(69) |
Finally, combining these bounds, then multiplying by , summing over , and using and , we obtain the desired result. ∎
To bound (62), we consider two cases defined in terms of the following inequalities:
(70) | |||
(71) |
Roughly speaking, when all of these inequalities hold, then is large enough to ensure that the event in (62) is unlikely. The next claim makes this precise.
Claim 3.
Proof.
If , the left side is zero and the bound is immediate, so we assume . First note that if , then since by definition and by (70), . By definition with , this further implies . Thus, when and , we have , which by definition of implies . Therefore, we can write
(73) |
Now by definition, implies that . Also by definition, implies that for some and , but . Thus,
(74) | |||
(75) |
Now fix and as in the double summation. Again using , the blocking rules implies that if , , and , then . Since , this means . Hence, there must exist some such that and . Thus, taking another union bound,
(76) | |||
(77) |
Next, note implies that for any and , we have , so by definition of . Therefore, for any ,
(78) |
Now let , , and . Then by definition and (71), respectively. Therefore, we can use Corollary 5 from Appendix F.2 to obtain
(79) |
Combining the above five inequalities, then using Claim 19 from Appendix F.1 and (71), we obtain
so multiplying both sides by completes the proof. ∎
On the other hand, when (70)-(71) fails, we can show that is bounded, and thus we bound the term in (62) by a constant and the probability term by , as shown in the following claim.
Claim 4.
Proof.
Finally, Lemma 6 follows by combining the previous three claims.
D.2. Proof of Corollary 1
As discussed in the proof sketch, we will couple with the noiseless process. We define this process as follows: let be i.i.d. random variables for each , and
(81) |
For the coupling, we first define
(82) |
Next, for each and , let . Note this set is nonempty, and since is a deterministic function of , which is independent of , is for each . Hence, we can set
without changing the distribution of . This results in a coupling where the noiseless process dominates the noisy one, in the following sense.
Claim 5.
For the coupling described above, for any .
Proof.
We use induction on . For , we simply have . Now assume ; we aim to show that if , then . We consider two cases, the first of which is straightforward: if , then by the inductive hypothesis, so since by definition and increases monotonically, we obtain , as desired.
For the second case, we assume and . Set and recall by definition. From the coupling above, we know and . Since in the present case, we have as well. Hence, because by the inductive hypothesis, by definition, and is increasing, we obtain . We have thus shown and , so by Definition 1 Finally, using and again appealing to monotonocity, we conclude that . ∎
We can now relate the rumor spreading times of the two processes. In particular, let (as in Definition 1) and . We then have the following.
Claim 6.
For any and , we have .
Proof.
Let and for each . Then clearly
(83) |
For the first term in (83), by definition of and Claim 5, we have
(84) |
For the second term in (83), we first observe that for any ,
where the last inequality holds by assumption on and . Thus, by the union bound, we can write
(85) | ||||
(86) |
Hence, because , we can iterate this argument to obtain that for any ,
(87) | ||||
(88) |
In particular, . Combining with (83) and (84) completes the proof. ∎
To bound the tail of , we will use the following result.
Claim 7 (Lemma 19 from (Chawla et al., 2020)).
Under the assumptions of Corollary 1, there exists an absolute constant such that, for any , we have .
Using the previous two claims, we can prove a tail bound for .
Claim 8.
Under the assumptions of Corollary 1, there exists an absolute constant such that, for any , we have , where we define
Proof.
Since is -regular with by assumption, we have as well. Therefore, setting , we know , which implies
Consequently, if we define and , we can write
(89) | ||||
(90) |
Because , , and , we also know
(91) |
Hence, ; combined with , we can apply Claim 6 to obtain
(92) |
On the other hand, (91) implies , so by definition of ,
Therefore, by Claim 7, we know that
Finally, notice that , so , thus by (91),
Hence, substituting the previous two inequalities into (92) and using completes the proof. ∎
We now bound . First, we define
Notice that for any , we have (else, we can invoke Claim 20 from Appendix F.1 with , , and to obtain , a contradiction). We write
(93) |
Now for any and , we use , the definition of , and Claim 8 to write
Therefore, for any such , we obtain
Furthermore, by Claim 18 in Appendix F.1 and , we know that
Therefore, by the previous two inequalities and (93), and since , we have shown
where the second inequality follows from Claim 19 from Appendix F.1, , and . Because is a constant, the right side is . Therefore, we have shown
Hence, by definition of , we obtain . Combining this bound with Theorem 2 completes the proof of Corollary 1.
D.3. Proof of Corollary 2
Similar to the analysis in Appendix D.1, we can use the decomposition , along with Lemmas 4 and 5, to bound regret as follows:
(94) |
Next, for each , let be the indicator that was active after . Then as in the proof of Lemma 5, we can use Claim 22 and Corollary 6 from Appendix F.2 to write
(95) |
(The only difference from the proof of Lemma 5 is that, when applying Claim 22, we write
where we can multiply by because the left side is also zero when .) Combining (95) with the definitions of and and using , we thus obtain
We claim, and will return to prove, that when ,
(96) |
Assuming (96) holds, we can combine the previous two inequalities and substitute into (94) to obtain . Bounding as in Lemma 4 yields the sharper version of Theorem 2, and further bounding as in Appendix D.2 sharpens Corollary 2.
To prove (96), we first show . Suppose instead that . Then we can find distinct arms , and corresponding phases , such that was active at phase for each . Without loss of generality, we can assume each is minimal, i.e., . We consider two cases (which are exhaustive since ) and show that both yield contradictions.
-
•
: We consider two further sub-cases.
-
–
, i.e., the best arm is sticky. Then , so are all active at phase . But all of these arms are non-sticky and only two such arms are active per phase.
-
–
. Here are both active at phase , as is (by definition of ). But since and are suboptimal, we again have three non-sticky active arms.
-
–
-
•
: We can assume (after possibly relabeling) that . Thus, by minimality of , was not active at phase but became active at , so it was recommended by some neighbor at . But since , is honest, and since , the best arm was most played for in phase , so would not have recommended .
Thus, holds. Combined with the fact that by definition, at most terms are nonzero in the summations on the left side of (96). Since by the assumed ordering of the arm means, this completes the proof.
D.4. Coarse analysis of the noisy rumor process
For completeness, we provide a coarser though more general bound for than the one derived in Appendix D.2. To begin, let and denote probability and expectation conditioned on . For each , define the random phase . Note that and . We then have the following tail bound.
Claim 9.
For any , we have .
Proof.
We use induction on . For , ensures for any , so the bound is immediate. Next, assume the bound holds for . We first write
Thus, by the inductive hypothesis, it suffices to bound the first term by . We first write
Now suppose . By Assumption 1, we can find and such that . Then for to occur, it must be the case that, for each , the event did not occur. Therefore, we have
(97) |
By the law of total expectation, we can write
(98) | |||
(99) |
Since is and is , we have
Therefore, combining the previous two expressions and iterating, we obtain
Next, we have a simple technical claim.
Claim 10.
Let . Then for any , .
Proof.
We can now bound . The analysis is similar to Appendix D.2. We first write
By the previous two claims, we know
By Claim 18 from Appendix F.1, we also have
Therefore, combining the previous three expressions, we obtain
where the second inequality uses , , and Claim 19 from Appendix F.1. Hence, we have shown . Note this bound cannot be improved in general – for example, if is a line graph, it becomes , so since , we have , which is the correct scaling (up to log terms) in Definition 1.
Appendix E Details from Section 7
In this appendix, we formalize the analysis that was discussed in Section 7. In particular, the subsequent five sub-appendices provide details on the respective five subsections of Section 7.
E.1. Details from Section 7.1
Let . Note that since and by assumption,
so is well-defined and . Next, for any , define
(100) |
Since , , and , we are guaranteed that and for large , so the following is well-defined:
(101) |
Also note (since ). Next, recall from Section 7.1 that
(102) |
Hence, if we let denote the possible active sets for agent (i.e., for any phase ), we can write
(103) |
Consequently, by the union bound, we obtain
(104) |
The next two claims bound the two summands on the right side.
Claim 11.
Under the assumptions of Theorem 2, for any , , and , we have
(105) |
Proof.
If , the claim is immediate. Otherwise, implies for some . Thus, by the union bound,
(106) |
Fix . Then implies (else, by definition of , ). Since (by ), we conclude , so there exists such that
(107) |
Combining and using the union bound and with by definition, we obtain
(108) |
Now fix as in the summation. Observe that since and , we have
Therefore, for any such , we can apply a basic bandit tail (namely, Corollary 5 from Appendix F.2 with the parameters , , and ) to obtain
Substituting into (108) and using Claim 19 from Appendix F.1 (which applies since ), we obtain
Substituting into (106) and using completes the proof. ∎
Claim 12.
Under the assumption of Theorem 2, for any , , and ,
(109) |
Proof.
By definition, we have
(110) |
As in the proof of Claim 11, we know that
(111) |
where the final inequality holds since by assumption , which implies
(112) |
Thus, implies , so by the union bound,
(113) | ||||
(114) |
Now fix as in the double summation. Then similar to the proof of Claim 11,
(115) | ||||
(116) | ||||
(117) |
where the second bound follows from applying Claim 21 from Appendix F.2 with , , , , and ; note this claim applies since by assumption ,
Next, observe that for any , by definition of , , and ,
Similar to the proof of Claim 11, we can then use Claim 19 to obtain
Combining with (113) and (115) completes the proof, since
Finally, we provide the tail for .
Lemma 9.
Under the assumptions of Theorem 2, for any ,
(118) |
E.2. Details from Section 7.2
Recall , where and . Hence, for all large , we have
(121) |
Thus, the following is well-defined:
(122) |
Now recall from Section 7.2 that , and
The next lemma provides a tail bound for this random phase.
Lemma 10.
Under the assumptions of Theorem 2, for any ,
(123) |
Proof.
We first use the union bound to write
(124) |
Fix and . Suppose holds. Then ; else, we can find such that (i.e., for ), contradicting . Hence, we have two cases: , or for some and . In the former case, ; in the latter, . Thus,
(125) | ||||
(126) |
Now given that , is sampled uniformly from a set of at most elements which includes , so . Substituting above and iterating yields
(127) |
where the final inequality uses . Combining (124) and (127) and computing a geometric series, we obtain
(128) |
Finally, using , , for any and , and , we obtain the desired bound. ∎
E.3. Details from Section 7.3
We begin with some intermediate claims.
Claim 13.
If the assumptions of Theorem 2 hold, then for any and , we have
Proof.
The first inequality holds by definition of and assumption . The second holds since is the best arm in and in the algorithm. ∎
Claim 14.
If the assumptions of Theorem 2 hold, then for any and ,
(129) |
Proof.
If or , the bound is immediate, so we assume and for the remainder of the proof. Under this assumption, there must exist a phase at which the mean of the best active arm reaches a new strict minimum since , i.e., . Let denote the number of phases this occurs and these (ordered) phases; formally,
(130) |
The remainder of the proof relies on the following three inequalities:
(131) |
The first inequality holds since by definition, so are distinct arms; since there are of these arms and in total, . For the second, we have when and when (if the latter fails, we contradict the definition of ). For the third, note by construction, so if , the bound holds with equality; else, , so if the bound fails,
(132) |
which is a contradiction, since is the minimal element of the set at right. Hence, (131) holds. Combined with Claim 13 (note , as required), we obtain
(133) | ||||
(134) |
where the last inequality uses . ∎
As a simple corollary of the previous two claims, we have the following.
Corollary 3.
If the assumptions of Theorem 2 hold, then for any and ,
(135) |
Next, inspecting the analysis in Section 7.3, we see that for large . Thus, the following is well-defined:
(137) |
As discussed in Section 7.3, we can now show that no new accidental blocking occurs at late phases, at least among pairs of honest agents that have recently communicated.
Claim 15.
Under the assumptions of Theorem 2, if , , and for some and , then .
Proof.
Suppose instead that . Then by the algorithm,
(138) |
Since , this implies . We then observe the following:
-
•
Since , the definition of implies .
-
•
Again using , Claim 13 implies .
-
•
Since , (138) implies , so .
-
•
Since , the algorithm implies , so .
-
•
Since , Corollary 3 implies .
Stringing together these inequalities, we obtain
(139) |
i.e., , which contradicts . ∎
Finally, we prove that honest agents eventually stop blocking each other.
Lemma 11.
Under the assumptions of Theorem 2, if , , and , then for any and , .
Proof.
Fix and ; we aim to show . This clearly holds if . Otherwise, (the latest phase up to and including that blocked ) is well-defined. We consider two cases of .
The first case is . Let denote the first phase after that blocked . Combined with the definition of , the algorithm implies
(140) |
Since , the definition of implies
(141) |
so as well. Combined with by definition, holds by (140).
The second case is . By assumption, . Hence, by definition of , for some . Note that and by assumption (and by monotonicity of in the latter case). Hence, we can apply Claim 15 (with and in the claim) to obtain . This is a contradiction. ∎
E.4. Details from Section 7.4
Claim 16.
Proof.
Since conditions on , we can prove the identity separately in the cases and . The identity is immediate in the former case. For the latter, we have
(143) | ||||
(144) | ||||
(145) | ||||
(146) |
Next, note that since as , the following is well-defined:
(147) |
As in Section 7.4, we let be the agents with the best arm active at phase .
Claim 17.
Under the assumptions of Theorem 2, if and , then .
Proof.
Suppose instead that for some and . Since , we know . Hence, because by definition of , we have . Combined with , we get , which contradicts . ∎
Finally, recall and , where is the noisy rumor process from Definition 1.
Lemma 12.
Under the assumptions of Theorem 2, if , , and , then
(148) |
Proof.
Let and
(149) |
Then it suffices to prove the following:
(150) |
For the first inequality in (150), we begin by proving
(151) |
To do so, we show . Assume and hold; by definition of , we aim to show . We use induction. The base of induction () holds by , , and Claim 17. Given the inductive hypothesis , we have by the algorithm and by definition, so we again use the assumption that holds and Claim 17 to obtain . Hence, (151) holds, so
(152) |
For the second inequality in (150), we claim, and will return to prove, the following:
(153) |
Assuming (153) holds, we obtain
(154) |
Hence, it only remains to prove (153). We show by induction on when holds, for each . For , recall by Assumption 3, so . Now assume for some . Let ; we aim to show that , i.e., that . By (149), we have two cases to consider:
-
•
: By the inductive hypothesis, as well, so by definition. Combined with and Claim 17 (recall and when holds, so the claim applies), this implies , so by the algorithm.
-
•
: First observe that since and , we can apply Lemma 11 on the event (with replaced by in that lemma) to obtain . On the other hand, recall that is in Definition 1 and is for the Section 7.4 sampling, so we can realize the former as . Hence, by assumption and definition , we obtain
(155) In summary, we have shown that and . Hence, by the Section 7.4 sampling, we conclude . Let denote this honest agent. Then by the inductive hypothesis, we know that , i.e., that . By Claim 17, this implies , so by the algorithm, .
Finally, for the equality in (150), note that is independent of the randomness before . By , the fact that and are both random variables and and are both sampled uniformly from , has the same distribution as . Also, note that is independent of the randomness before . Finally, by definition of , implies that . These observations successively imply
(156) |
E.5. Details from Section 7.5
Combining the lemmas of the above sub-appendices, we can bound the tail of the spreading time.
Theorem 3.
Proof.
For any , we can use Claim 32 to obtain
(158) | |||
(159) |
In particular, implies that we can use Lemma 9 with replaced by . Combined with (158) (namely, ); this yields
(160) |
Since by (158), we can use Lemma 10 (with replaced by ) and (159) to obtain
(161) |
Furthermore, using the bounds , , and from (158) (the last of which implies , since by ), we can use Lemma 12 to get
(162) |
Finally, by the union bound, we have
(163) | ||||
(164) |
so combining the previous four inequalities yields the desired result. ∎
Finally, as a corollary, we can bound .
Corollary 4.
Proof.
We first observe
(167) |
For the first term in (167), using Claim 18 from Appendix F.1, we compute
(168) |
For the second term in (167), define the constant
(169) |
Then by Theorem 3, we have
(170) | ||||
(171) |
For (170), we simply use nonnegativity to write
(172) |
For the second term in (171), we use Claim 18 and and by the assumptions of Theorem 2 to write
(173) | |||
(174) |
Finally, combining the above bounds completes the proof. ∎
Appendix F Other proofs
F.1. Basic inequalities
In this sub-appendix, we prove some simple inequalities used frequently in the analysis.
Claim 18.
For any , we have
and for any and , we have .
Proof.
For the first pair of inequalities, observe when , and for ,
(175) |
For the second pair of inequalities, we first observe
(176) |
where the equality holds for some by the mean value theorem and the second inequality is . By analogous reasoning, one can also show , so the second pair of inequalities holds. The third inequality holds with equality when , and for , the lower bound in (176) and imply , so since and are integers, . Finally, using , , and , we can write
Claim 19.
For any and , .
Proof.
Since and , we can write
Claim 20.
For any such that , .
Proof.
Multiplying and dividing the right side of the assumed inequality by , we obtain . We can then loosen this bound to get , or . Plugging into the term of the assumed inequality yields . Raising both sides to the power establishes the first bound. The second bound follows by using . ∎
Remark 14.
We typically apply Claim 20 with constant but not. It allows us to invert inequalities of the form to obtain .
F.2. Bandit inequalities
Next, we state and prove some basic bandit inequalities. The proof techniques are mostly modified from existing work (e.g., (Auer et al., 2002)), but we provide the bounds in forms useful for our setting.
Claim 21.
Suppose that , , , and satisfy
Let be such that , i.e., . Then for any , we have
Proof.
For , let be an i.i.d. sequence distributed as , and for , let
(177) |
denote the empirical mean and UCB index of if has pulled the -th arm times before . Then Algorithm 1 implies that if and , we must have
(178) |
Next, note that if , then by monotonicity, as well. Combined with the fact that by definition, we conclude that implies . Similarly, implies (since ). Combined with (178), we obtain that if the event in the statement of the claim occurs, it must be the case that
Therefore, by the union bound, we obtain
(179) |
Now fix and as in the double summation. We claim implies
Indeed, if instead both inequalities fail, then by choice of and the assumption of the claim,
(180) | ||||
(181) |
which is a contradiction. Thus, by the union bound, Hoeffding’s inequality, and , we obtain
so plugging into (179) completes the proof. ∎
Corollary 5.
Suppose that , ,and satisfy . Let be such that , i.e., . Then for any , we have
Proof.
Using by definition and applying Claim 21 with and ,
Corollary 6.
For any , , and , we have
Proof.
Claim 22.
For any , such that , and ,
Proof.
Set . Then clearly
Now suppose the right strictly exceeds . Then since the right side is an integer, we can find times such that and . Let denote the largest such . Because occurred at least times before , we know . But since , this implies , which is a contradiction. ∎
Finally, we recall a well-known regret decomposition.
Claim 23.
The regret defined in (5) satisfies .
Proof.
See, e.g., the proof of (Lattimore and Szepesvári, 2020, Lemma 4.5). ∎
F.3. Calculations for the early regret
In this sub-appendix, we assume , , , , , , and are chosen as in Theorem 2. Recall , , etc. denote constants associated with Claim that only depend on , , , , and .
Claim 24.
There exists such that and .
Proof.
This follows from the choice in Theorem 2. ∎
Claim 25.
There exists such that .
Claim 26.
There exists such that for any ,
(184) |
Proof.
By Claim 18, we can find depending only on such that whenever . Hence, for any , we know , so
(185) |
where we also used . The claim follows if we set and . ∎
Claim 27.
There exists such that for any ,
(186) |
Proof.
By Claim 26, we can find constants such that for any ,
where we also used . Furthermore, since by assumption, we can find depending only on , , , and , such that for any , we have . Combined with the previous inequality and the choice in Theorem 2, we obtain
Hence, if we set and , the second inequality in (186) holds for . Finally, define . Then and , so . Thus, for any , we know , so by Claim 18, , i.e., the first inequality in (186) holds. ∎
Claim 28.
There exists such that for any , .
Proof.
Claim 29.
There exists such that for any , .
Proof.
We first upper bound . By Claim 24, we can find such that when . Let , where is from Claim 26, and . Then , so we can find such that
where the first inequality uses Claim 26; the second inequality uses , , and Claim 18; and the equality defines . On the other hand, if , where is from Claim 27, we can find such that
where the inequality uses Claim 27 and the equality defines . Finally, by assumption , we can find such that, for any , we have . Combined with the previous two inequalities, we obtain that for and any , . Therefore, by definition of (137), we conclude that , where . Therefore, if we set , we obtain that for any , (since ) and (since ), so stringing together the inequalities, we conclude . ∎
Claim 30.
There exists such that .
Proof.
Let and , where and are the constants from Claim 26. Also define . Then by definition of (147), it suffices to show . Fix such a and suppose instead that . Since , we know , so
(187) |
Hence, because , we have , so by Claims 26 and 18, respectively,
(188) |
Rearranging and using the assumption , this implies . Hence, applying Claim 20 with , , and , we obtain
But since , we have shown , which contradicts (187). ∎
Claim 31.
There exists such that, for any ,
(189) |
Proof.
Let and , and suppose instead that (189) fails for some . Then we can write
(190) |
Since and , we know that , so by Claim 24. Since and , we also have , or . Combining these two bounds with (190), we conclude
(191) |
Applying Claim 20 with , , and , we obtain . But since , this contradicts the assumed lower bound on . ∎
Claim 32.
Define and
Then for any such that , we have
(192) | |||
(193) |
F.4. Calculations for the later regret
In this sub-appendix, we assume , , , , , , and are chosen as in Theorem 2. Recall , , etc. denote constants associated with Claim that only depend on , , , , and .
Claim 33.
There exists such that, for any ,
(194) |
Proof.
Similar to Claim 24 from Appendix F.3, we can find constants such that for any , . If , then , so , and the right side of (194) will hold for . If instead , then , so if the left side of (194) holds, we can write
Hence, applying Claim 20 with , , and , we obtain
where the last inequality holds for any . ∎
Claim 34.
There exists such that, for any ,
Proof.
Claim 35.
There exists such that, for any ,
Proof.
Fix such that . Note that if , then , so since , we must have . This implies , so the claimed bound is immediate. Hence, we assume moving forward. We first observe
where the inequalities use , the assumed upper bound on , Claim 18, and (since , , and ), respectively. Applying Claim 20 with , , and , and noting that (since , , and ), we obtain for any . Therefore, by assumption , we obtain that for any ,
Claim 36.
There exists such that, for any ,
(195) |
Proof.
We first eliminate the corner case where . In this case, one of and must hold. If the former holds, then , and if the latter holds, then , so . In both cases, we can clearly find satisfying the right side of (195).
Next, we assume and . By monotonicity, the former implies for any . For any such , by definition and , we can then write
Therefore, since by assumption in Theorem 2, we obtain
For the summation at right, we use (which implies ) and by assumption in Theorem 2, along with Claim 19, to write
Using (by assumption), we also know
Combining the previous three inequalities, we then obtain
Therefore, if the left side of (195) holds, we are guaranteed that
or, after rearranging, then using and ,
where the second inequality holds for large and the third uses and . Taking logarithms and choosing appropriately in terms of , , and yields the right side of (195). ∎