This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Exploiting Heterogeneity in Robust Federated Best-Arm Identification

Aritra Mitra, George J. Pappas, and Hamed Hassani The authors are with the Department of Electrical and Systems Engineering, University of Pennsylvania. Email: {amitra20, pappasg, hassani}@seas.upenn.edu. This work was supported by NSF Award 1837253, NSF CAREER award CIF 1943064, and the Air Force Office of Scientific Research Young Investigator Program (AFOSR-YIP) under award FA9550-20-1-0111.
Abstract

We study a federated variant of the best-arm identification problem in stochastic multi-armed bandits: a set of clients, each of whom can sample only a subset of the arms, collaborate via a server to identify the best arm (i.e., the arm with the highest mean reward) with prescribed confidence. For this problem, we propose Fed-SEL, a simple communication-efficient algorithm that builds on successive elimination techniques and involves local sampling steps at the clients. To study the performance of Fed-SEL, we introduce a notion of arm-heterogeneity that captures the level of dissimilarity between distributions of arms corresponding to different clients. Interestingly, our analysis reveals the benefits of arm-heterogeneity in reducing both the sample- and communication-complexity of Fed-SEL. As a special case of our analysis, we show that for certain heterogeneous problem instances, Fed-SEL outputs the best-arm after just one round of communication. Our findings have the following key implication: unlike federated supervised learning where recent work has shown that statistical heterogeneity can lead to poor performance, one can provably reap the benefits of both local computation and heterogeneity for federated best-arm identification. As our final contribution, we develop variants of Fed-SEL, both for federated and peer-to-peer settings, that are robust to the presence of Byzantine clients, and hence suitable for deployment in harsh, adversarial environments.

1 Introduction

Federated Learning (FL) has recently emerged as a popular framework for training statistical models using diverse data available at multiple clients (mobile phones, smart devices, health organizations, etc.) [1, 2, 3, 4, 5]. Two key challenges intrinsic to FL include statistical heterogeneity that arises due to differences in the data sets of the clients, and stringent communication constraints imposed by bandwidth considerations. Several recent works [6, 7, 8, 9, 10, 11] have reported that under statistical heterogeneity, FL algorithms that perform local computations to save communication can exhibit poor performance. Notably, these observations concern the task of supervised learning, which has almost exclusively been the focus of FL thus far.

In a departure from the standard supervised learning setting, the authors in [12] recently showed that for the problem of federated clustering, one can in fact benefit from a notion of heterogeneity defined suitably for their setting. This is a rare finding that naturally begs the following question.

Are there other classes of cooperative learning problems where, unlike federated supervised learning, local computations and specific notions of heterogeneity can provably help?

In this paper, we answer the above question in the affirmative by introducing and studying a federated variant of the best-arm identification problem [13, 14, 15, 16, 17, 18, 19] in stochastic multi-armed bandits. In our model, a set of arms is partitioned into disjoint arm-sets and each arm-set is associated with a unique client. Specifically, each client can only sample arms from its respective arm-set. This modeling assumption is consistent with the standard FL setting where each client typically has access to only a portion of a large data set; we will provide further motivation for our model shortly. The task for the clients is to coordinate via a central server and identify the best arm with high probability, i.e., with a given fixed confidence.111We will provide a precise problem formulation in Sec. 2 where we explain what is meant by a “best arm”. To achieve this task, communication is clearly necessary since each client can only acquire information about its own arm-set in isolation. Our formulation naturally captures statistical heterogeneity: other than the requirement of sharing the same support, we allow distributions of arms across arm-sets to be arbitrarily different. In what follows, we briefly describe a couple of motivating applications.

Applications in Robotics and Sensor Networks: Our work is practically motivated by the diverse engineering applications (e.g., distributed sensing, estimation, localization, target-tracking, etc.) of cooperative learning [20, 21, 22, 23, 24]. For instance, consider a team of robots deployed over a large environment and tasked with detecting a chemical leakage based on readings from sensors spread out over the environment. Since it may be infeasible for a single robot to explore the entire environment, each robot is instead assigned the task of exploring a sub-region. The robots explore their respective sub-regions and coordinate via a central controller. We can abstract this application into our model by mapping robots to clients, controller to server, sensors to arms, and sub-regions to arm-sets. Based on this abstraction, detecting the sensor with the highest readings directly translates into the problem of identifying the best arm.

Applications in Recommendation Systems: As another concrete application, consider a web-recommendation system for restaurants in a city. People in the city upload ratings of restaurants to this system based on their personal experiences. Based on these ratings, the system constantly updates its scores, and provides recommendations regarding the best restaurant(s) in the city. To capture this scenario via our model, one can interpret people as clients, the web-recommendation system as the server, and restaurants as arms. Since it is somewhat unreasonable to expect that a person visits every restaurant in the city, the rationale behind a client sampling only a subset of the arms makes sense in this context as well.

We have thus justified the reason for considering heterogeneous action spaces (arm-sets) for the clients. In contrast, as we discuss in Section 1.2, prior related works make the arguably unrealistic assumption of a common action space for all clients, i.e., it is assumed that each client can sample every arm. As far as we are aware, this is the first work to study a federated/distributed variant of the heterogeneous multi-armed bandit (MAB) best-arm identification problem. The core question posed by this problem is the following: Although a client can sample only a subset of the arms itself, how can it still learn the best arm via collaboration?

In this context, we will focus on two important themes relevant to each of the applications we described above: communication-efficiency and robustness. To be more precise, we are interested in designing best-arm identification algorithms that (i) minimize communication between the clients and the server, and (ii) are robust to the presence of clients that act adversarially. Minimizing communication is motivated by low-bandwidth considerations in the context of sensor networks, and privacy concerns in the context of recommendation systems. Moreover, for the latter application, we would ideally like to avoid scenarios where a few deliberately biased reviews culminate in a bad recommendation. The need for robustness has a similar rationale in distributed robotics and sensor networks. Having motivated the setup, we now describe our main contributions.

1.1 Contributions

The specific contributions of this paper are as follows.

\bullet Problem Formulation: Our first contribution is to introduce the federated best-arm identification problem in Section 2, and its robust variants in Sections 5 and 6. As we explained earlier in the introduction, the problems we study have diverse practical applications that span collaborative learning in sensor networks, distributed robotics, and recommendation systems.

In [12], the authors had conjectured that specific notions of heterogeneity - together with careful analyses - may provide benefits for a plethora of problems in federated learning. By identifying federated best-arm identification as one such problem, our work is one of the earliest efforts to formally support this conjecture.

\bullet Algorithm: In Section 3, we propose Fed-SEL - a simple, communication-efficient algorithm that builds on successive elimination techniques [13, 14], and comprises of two distinct phases. During Phase I, each client aims to identify the locally best arm in its own arm-set; this requires no communication. Phase II involves periodic, encoded communication between the clients and the server (akin to standard FL algorithms) to determine the best arm among the locally best arms. As we discuss later in the paper, having two separate phases contributes towards (i) significantly reducing communication; and (ii) tolerating adversaries.

\bullet Benefits of Heterogeneity: To analyze Fed-SEL, we introduce a notion of arm-heterogeneity in Section 4. Roughly speaking, our definition of heterogeneity captures the “closeness” between distributions of arms within an arm-set relative to that across different arm-sets. In Theorem 1, we rigorously characterize the performance of Fed-SEL. Our analysis reveals that more the level of heterogeneity in a given instance of the problem, fewer the number of communication rounds it takes for Fed-SEL to output the best arm at the prescribed confidence level, i.e., Fed-SEL adapts to the level of heterogeneity. We also identify problem instances where one round of communication suffices to learn the best arm. Since fewer rounds leads to fewer arm-pulls, we conclude that heterogeneity helps reduce both the sample- and communication-complexity of Fed-SEL.

As far as we are aware, Theorem 1 is the only result to demonstrate provable benefits of heterogeneity in federated/distributed MAB problems. In the process of arriving at this result, we derive new bounds on the Lambert WW function [25] that may be of independent interest; see Lemma 5 in Appendix A.1.

\bullet Robustness to Byzantine adversaries: By now, there are several works that explore the challenging problem of tackling adversaries in supervised learning [26, 27, 28, 29, 30, 31, 32, 33, 34, 35]. However, there is no analogous result for the distributed MAB problem we consider here. We fill this gap by developing robust variants of Fed-SEL that output the best arm despite the presence of Byzantine adversarial clients, both in federated and peer-to-peer settings. Our robustness results, namely Theorems 2 and 3, highlight the role of information- and network structure-redundancy in combating adversaries.

Overall, for the problem of cooperative best-arm identification, our work develops simple, intuitive algorithms that are communication-efficient, benefit from heterogeneity, and practically appealing since they can be deployed in harsh, adversarial environments.

1.2 Related Work

In what follows, we position our work in the context of relevant literature.

\bullet Multi-Armed Bandits in Centralized and Distributed Settings: Best-arm identification is a classical problem in the MAB literature and is typically studied with two distinct objectives: (i) maximizing the probability of identifying the best arm given a fixed sample budget [16, 17, 18]; and (ii) minimizing the number of arm-pulls (samples) required to identify the best arm at a given fixed confidence level [13, 14, 15]. In this paper, we focus on the fixed confidence setting; for an excellent survey on this topic, see [19].

The best-arm identification problem has also been explored in distributed homogeneous environments [36, 37] where every agent (client) can sample all the arms, i.e., each agent can identify the best-arm on its own, and communication serves to reduce the sample-complexity (number of arm-pulls) at each agent. A fundamental difference of our work with [36, 37] is that communication is necessary for the heterogeneous setting we consider. Moreover, unlike the vast majority of distributed MAB formulations that focus on minimizing the cumulative regret metric [38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57], the goal of our study is best-arm identification within a PAC framework.

A couple of recent works demonstrate that in the context of multi-agent stochastic linear bandits, if the parameters of the agents share a common low-dimensional representation [53], or exhibit some other form of similarity [54], then such structure can be exploited to improve performance (as measured by regret). Thus, these works essentially show that similarity helps. Our findings complement these results by showing that depending upon the problem formulation and the performance measure, dissimilarity can also help.

\bullet Robust Multi-Armed Bandits: A recent stream of works has focused on understanding the impacts of adversarial corruptions in MAB problems, both in the case of independent arms [58, 59, 60, 61], and also for the linear bandit setting [62, 63, 64]. In these works, it is assumed that an attacker with a bounded attack budget can perturb the stochastic rewards associated with the arms, i.e., the attack is structured to a certain extent. The Byzantine attack model we consider in this work is different from the reward-corruption models in [58, 59, 60, 61, 62, 63, 64] in that it allows a malicious client to act arbitrarily. Thus, the techniques used in these prior works do not apply to our setting.

Like our work, worst-case adversarial behavior is also considered in [65], where a blocking approach is used to provide regret guarantees that preserve the benefits of collaboration when the number of adversaries is small. While the authors in [65] assume a complete graph and focus on regret minimization, our analysis in Section 6 accounts for general networks, and our goal is best-arm identification, i.e., we focus on the pure exploration setting.

\bullet Federated Optimization: The standard FL setup involves periodic coordination between the server and clients to solve a distributed optimization problem [1, 2, 3, 4]. The most popular algorithm for this problem, FedAvg [2], has been extensively analyzed both in the homogeneous setting [66, 67, 68, 69, 70, 71, 72], and also under statistical heterogeneity [73, 74, 75, 6, 76]. Several more recent works have focused on improving the convergence guarantees of FedAvg by drawing on a variety of ideas: proximal methods in FedProx [77]; normalized aggregation in FedNova [78]; operator-splitting in FedSplit [10]; variance-reduction in Scaffold [11] and S-Local-SVRG [79]; and gradient-tracking in FedLin [80]. Despite these promising efforts, there is little to no theoretical evidence that justifies performing local computations - a key feature of FL algorithms - under statistical heterogeneity.

Heterogeneity can hurt: Indeed, even in simple, deterministic settings with full client participation, it is now known that FedAvg and FedProx fail to match the basic centralized guarantee of linear convergence to the global minimum for strongly-convex, smooth loss functions [6, 7, 8, 9, 10]. Moreover, for algorithms such as Scaffold [11], S-Local-SVRG [79], and FedLin [80] that do guarantee linear convergence to the true minimizer, their linear convergence rates demonstrate no apparent benefit of performing multiple local gradient steps under heterogeneity. To further exemplify this point, a lower bound is presented in [80] to complement the upper bounds in [11], [79], and [80]; see also Remark 2.2 in [79]. Thus, under heterogeneity, the overall picture is quite grim in FL.

In Section 2, we draw various interesting parallels between our formulation and the standard federated optimization setup. These parallels are particularly insightful since they provide intuition regarding why both local computations and heterogeneity can be helpful (if exploited appropriately) for the problem we study, thereby further motivating our work.

2 Problem Formulation

Our model is comprised of a set of arms 𝒜\mathcal{A}, a set of clients 𝒞\mathcal{C}, and a central server. The set 𝒜\mathcal{A} is partitioned into NN disjoint arm-sets, one corresponding to each client. In particular, client i𝒞i\in\mathcal{C} can only sample arms from the ii-th arm set, denoted by 𝒜i={a1(i),a2(i),,a|𝒜i|(i)}\mathcal{A}_{i}=\{{a^{(i)}_{1}},{a^{(i)}_{2}},\ldots,{a^{(i)}_{|\mathcal{A}_{i}|}}\}. Throughout, we will make the mild assumption that each arm-set contains at least two arms, i.e., |𝒜i|2,i[N]|\mathcal{A}_{i}|\geq 2,\forall i\in[N]. We associate a notion of time with our model: in each unit of time, a client can only sample one arm. Each time client ii samples an arm 𝒜i\ell\in\mathcal{A}_{i}, it observes a random reward XX_{\ell} drawn independently of the past (actions and observations of all clients) from a distribution DD_{\ell} (unknown to client ii) associated with arm \ell. We will use r^i,(t)\hat{r}_{i,\ell}(t) to denote the average of the first tt independent samples Xi,(s),s[t]X_{i,\ell}(s),s\in[t] of the reward of an arm 𝒜i\ell\in\mathcal{A}_{i} as observed by client ii, i.e., r^i,(t)=(s=1tXi,(s))/t\hat{r}_{i,\ell}(t)=(\sum_{s=1}^{t}X_{i,\ell}(s))/t.

As is quite standard [13, 14, 16, 17, 18, 15, 19], we assume that all rewards are in [0,1][0,1]. For each arm 𝒜\ell\in\mathcal{A}, let rr_{\ell} denote the expected reward of arm \ell, i.e., r=𝔼[X]r_{\ell}=\mathbb{E}[X_{\ell}]. We will use Δjrrj\Delta_{\ell j}\triangleq r_{\ell}-r_{j} to represent the difference between the mean rewards of any two arms ,j𝒜\ell,j\in\mathcal{A}. An arm with the highest expected reward in 𝒜i\mathcal{A}_{i} (resp., in 𝒜\mathcal{A}) will be called a locally best arm in arm-set ii (resp., a globally best arm in 𝒜\mathcal{A}), and be denoted by aia^{*}_{i} (resp., by aa^{*}). For simplicity of exposition, let the arms in each arm-set 𝒜i\mathcal{A}_{i} be ordered such that ra1(i)>ra2(i)>>ra|𝒜i|(i)r_{a^{(i)}_{1}}>r_{a^{(i)}_{2}}>\cdots>r_{{a^{(i)}_{|\mathcal{A}_{i}|}}}; thus, ai=a1(i)a^{*}_{i}=a^{(i)}_{1}. Without loss of generality, we assume that arm 11 is the unique globally best arm, and that it belongs to arm-set 𝒜1\mathcal{A}_{1}, i.e., a=a1(1)=1a^{*}=a^{(1)}_{1}=1. We can now concretely state the problem of interest.

Problem 1.

(Federated Best-Arm Identification) Given any δ(0,1)\delta\in(0,1), design an algorithm that with probability at least 1δ1-\delta terminates in finite-time, and outputs the globally best-arm aa^{*}, where

a=argmaxi{1,,N},{1,,|𝒜i|}(ra(i)=𝔼[Xa(i)]).a^{*}=\operatorname*{\arg\!\max}_{i\in\{1,\ldots,N\},\ell\in\{1,\ldots,|\mathcal{A}_{i}|\}}\left(r_{a^{(i)}_{\ell}}=\mathbb{E}[X_{a^{(i)}_{\ell}}]\right). (1)

In words, given a fixed confidence level δ\delta as input, our goal is to design an algorithm that enables the clients and the server to identify the globally best arm with probability at least 1δ1-\delta. Following [14], we will call such an algorithm (0,δ)(0,\delta)-PAC. Since each client i𝒞i\in\mathcal{C} is only partially informative (i.e., can only sample arms from its local arm-set 𝒜i\mathcal{A}_{i}), it will not be able to identify aa^{*} on its own, in general. Thus, any algorithm that solves Problem 1 will necessarily require communication between the clients and the server. Given that the communication-bottleneck is a major concern in FL, our focus will be to design communication-efficient best-arm identification algorithms. Later, in Section 5, we will consider a robust version of Problem 1 where the goal will be to identify the globally best arm despite the presence of adversarial clients that can act arbitrarily. We now draw some interesting parallels between the above formulation and the familiar federated optimization setup that is used to model and study supervised learning problems.

Comparison with the standard FL setting: The canonical FL setting involves solving the following optimization problem via periodic communication between the clients and the server [1, 2, 3, 4]:

minxdf(x),wheref(x)=1Ni=1Nfi(x),andfi(x)=𝔼ziPi[Li(x;zi)].\min_{x\in\mathbb{R}^{d}}f(x),\hskip 5.69054pt\textrm{where}\hskip 2.84526ptf(x)=\frac{1}{N}\sum_{i=1}^{N}f_{i}(x),\hskip 2.84526pt\textrm{and}\hskip 2.84526ptf_{i}(x)=\mathbb{E}_{z_{i}\sim{P}_{i}}[L_{i}(x;z_{i})]. (2)

Here, fi:df_{i}:\mathbb{R}^{d}\rightarrow\mathbb{R} is the local loss function of client ii, and f(x)f(x) is the global objective function. Let xi=argminxdfi(x)x^{*}_{i}=\operatorname*{\arg\!\min}_{x\in\mathbb{R}^{d}}f_{i}(x), and x=argminxdf(x)x^{*}=\operatorname*{\arg\!\min}_{x\in\mathbb{R}^{d}}f(x). Under statistical heterogeneity, PiPjP_{i}\neq P_{j} in general for distinct clients ii and jj, and hence, fi(x)f_{i}(x) and fj(x)f_{j}(x) may have different minima. A typical FL algorithm that solves for xx^{*} operates in rounds. Between two successive communication rounds, each client performs local training on its own data using stochastic gradient descent (SGD) or a variant thereof. As a consequence of such local training, popular FL algorithms such as FedAvg [2] and FedProx [77] exhibit a “client-drift phenomenon”: the iterates of each client ii drift-off towards the minimizer xix^{*}_{i} of its own local loss function fi(x)f_{i}(x) [6, 7, 8, 9, 10, 11]. Under statistical heterogeneity, there is no apparent relationship between xix^{*}_{i} and xx^{*}, and these points may potentially be far-off from each other. Thus, client-drift can cause FedAvg and FedProx to exhibit poor convergence.222By poor convergence, we mean that these algorithms either converge slowly to the true global minimum, or converge to an incorrect point (potentially fast). See [8], [9], and [80] for a detailed discussion on this topic. More recent algorithms such as SCAFFOLD [11] and FedLin [80] try to mitigate the client-drift phenomenon by employing techniques such as variance-reduction and gradient-tracking. However, even these more refined algorithms are unable to demonstrate any quantifiable benefits of performing multiple local steps under arbitrary statistical heterogeneity.333Indeed, in [80], the authors show that there exist simple instances involving two clients with quadratic loss functions, for which, performing multiple local steps yields no improvement in the convergence rate even when gradient-tracking is employed.

Algorithm 1 Local Successive Elimination at Client ii
1:Parameters: {αi(t),βi(t)}\{\alpha_{i}(t),\beta_{i}(t)\}, where
αi(t)=log(4|𝒞||𝒜i|t2/δ)/t;βi(t)=cαi(t)\alpha_{i}(t)=\sqrt{{\log\left({4|\mathcal{C}||\mathcal{A}_{i}|t^{2}}/{\delta}\right)}/{t}};\beta_{i}(t)=c\alpha_{i}(t)
2:Set: t=1t=1, 𝒮i(t)=𝒜i\mathcal{S}_{i}(t)=\mathcal{A}_{i}
3:Sample each arm 𝒮i(t)\ell\in\mathcal{S}_{i}(t) once and update empirical mean r^i,(t)\hat{r}_{i,\ell}(t) of arm \ell
4:Let r^i,max(t)=max𝒮i(t)r^i,(t)\hat{r}_{i,\max}(t)=\underset{\ell\in\mathcal{S}_{i}(t)}{\max}\ \hat{r}_{i,\ell}(t)
5:Eliminate sub-optimal arms: 𝒮i(t+1)={𝒮i(t):r^i,max(t)r^i,(t)<βi(t)}\mathcal{S}_{i}(t+1)=\{\ell\in\mathcal{S}_{i}(t)\ :\ \hat{r}_{i,\max}(t)-\hat{r}_{i,\ell}(t)<\beta_{i}(t)\}
6:If |𝒮i(t+1)|>1|\mathcal{S}_{i}(t+1)|>1, then set t=t+1t=t+1 and Go To Line 3. Else, Output ai=Si(t+1)a_{i}=S_{i}(t+1)

Coming back to the MAB problem, aia^{*}_{i} can be thought of as the direct analogue of xix^{*}_{i} in the standard FL setting. We make two key observations.

  • Observation 1: Unlike the standard FL setting, there is an immediate relationship between the global optimum, namely the globally best arm aa^{*}, and the locally best arms {ai}i𝒞\{a^{*}_{i}\}_{i\in\mathcal{C}}: the globally best arm is the arm with the highest expected reward among the locally best arms.

  • Observation 2: Suppose distributions D1D_{\ell_{1}} and D2D_{\ell_{2}} corresponding to arms 1𝒜i\ell_{1}\in\mathcal{A}_{i} and 2𝒜j\ell_{2}\in\mathcal{A}_{j} are “very different” (in an appropriate sense). Intuition dictates that this should make the task of identifying the arm with the higher mean easier, i.e., statistical heterogeneity among distributions of arms corresponding to different clients should facilitate best-arm identification.

The first observation, although simple and rather obvious, has the following important implication. Suppose each client ii does pure exploration locally to identify the best arm aia^{*}_{i} in its arm-set 𝒜i\mathcal{A}_{i}. Clearly, such local computations contribute towards the overall task of identifying the globally best arm (as they help eliminate sub-optimal arms), regardless of the level of heterogeneity in the distributions of arms across arm-sets. Thus, we immediately note a major distinction between the standard federated supervised learning setting and the federated best-arm identification problem: for the former, local steps can potentially hurt under statistical heterogeneity; for the latter, we can exploit problem structure to design local computations that always contribute towards the global task. In fact, aligning with the second observation, we will show later in Section 4 that statistical heterogeneity of arm distributions across clients can assist in minimizing communication. Building on the insights from this section, we now proceed to develop a federated best-arm identification algorithm that solves Problem 1.

3 Federated Successive Elimination

In this section, we develop an algorithm for solving Problem 1. Since minimizing communication is one of our primary concerns, let us start by asking what a client can achieve on its own without interacting with the server. To this end, consider the following example to help build intuition.

Example: Suppose there are just 2 clients with associated arm-sets 𝒜1={1,2}\mathcal{A}_{1}=\{1,2\} and 𝒜2={3,4}\mathcal{A}_{2}=\{3,4\}. The mean rewards are r1=0.9,r2=0.3,r3=0.8r_{1}=0.9,r_{2}=0.3,r_{3}=0.8, and r4=0.7r_{4}=0.7. From classical MAB theory, we know that to identify the best arm among two arms with gap in mean rewards Δ\Delta, each of the arms needs to be sampled O(1/Δ2)O(1/\Delta^{2}) times. Thus, after sampling each of the arms in 𝒜1\mathcal{A}_{1} for O(1/Δ122)O(1/\Delta^{2}_{12}) number of times (resp., arms in 𝒜2\mathcal{A}_{2} for O(1/Δ342)O(1/\Delta^{2}_{34}) times), client 1 (resp., client 2) will be able to identify arm 1 (resp., arm 3) as the locally best-arm in 𝒜1\mathcal{A}_{1} (resp., 𝒜2\mathcal{A}_{2}) with high probability. However, at this stage, arm 1 will likely have been sampled much fewer times than arm 3 (since 1/Δ12231/\Delta^{2}_{12}\approx 3, and 1/Δ342=1001/\Delta^{2}_{34}=100), and hence, client 1’s estimate of arm 1’s true mean will be much coarser than client 2’s estimate of arm 3’s true mean. The main message here is that simply identifying locally best arms from each arm-set and then naively comparing between them may result in rejection of the globally best-arm. Thus, we need a more informed strategy that accounts for imbalances in sampling across arm-sets. With this in mind, we develop Fed-SEL.

Algorithm 2 Federated Successive Elimination (Fed-SEL)
1: Phase I: Identifying locally best arms
2:Each client i𝒞i\in\mathcal{C} runs Algo. 1 on 𝒜i\mathcal{A}_{i}. If Algo. 1 terminates at client ii, then client ii reports {ai,r^i,ai(t¯i),t¯i}\{a_{i},\hat{r}_{i,a_{i}}(\bar{t}_{i}),\bar{t}_{i}\} to the server
3:
4: Phase II: Identifying the globally best arm
5:for k{0,1,}k\in\{0,1,\ldots\} do
6: \bullet Server initializes Ψ¯(0)=𝒞\bar{\Psi}(0)=\mathcal{C} and does:
7:     For each active client iΨ¯(k)i\in\bar{\Psi}(k), set r~i(k)=r^i,ai(t¯i)\tilde{r}_{i}(k)=\hat{r}_{i,a_{i}}(\bar{t}_{i}) if k=0k=0; else if k1k\geq 1, decode r~i(k)=Deci,k(σi(k))\tilde{r}_{i}(k)=\texttt{Dec}_{i,k}(\sigma_{i}(k)), and compute
r~i(U)(k)=r~i(k)+2αi(t¯i+kH);r~i(L)(k)=r~i(k)2αi(t¯i+kH)\tilde{r}^{(U)}_{i}(k)=\tilde{r}_{i}(k)+2\alpha_{i}(\bar{t}_{i}+kH);\hskip 5.69054pt\tilde{r}^{(L)}_{i}(k)=\tilde{r}_{i}(k)-2\alpha_{i}(\bar{t}_{i}+kH)
8:      Compute r~max(L)(k)=maxiΨ¯(k)r~i(L)(k)\tilde{r}^{(L)}_{\max}(k)=\max_{i\in\bar{\Psi}(k)}\tilde{r}^{(L)}_{i}(k)
9:     Update active client set: Ψ¯(k+1)={iΨ¯(k):r~max(L)(k)<r~i(U)(k)}\bar{\Psi}(k+1)=\{i\in\bar{\Psi}(k)\ :\ \tilde{r}^{(L)}_{\max}(k)<\tilde{r}^{(U)}_{i}(k)\}
10:     If |Ψ¯(k+1)|>1|\bar{\Psi}(k+1)|>1, broadcast the threshold r~max(L)(k)\tilde{r}^{(L)}_{\max}(k). Else, output a¯=aΨ¯(k+1)\bar{a}=a_{\bar{\Psi}(k+1)} and terminate
11: \bullet Each active client iΨ¯(k)i\in\bar{\Psi}(k) does:
12:     If r~max(L)(k)\tilde{r}^{(L)}_{\max}(k) is received from server and r~max(L)(k)<r~i(U)(k)\tilde{r}^{(L)}_{\max}(k)<\tilde{r}^{(U)}_{i}(k), then sample arm aia_{i} HH times and transmit σi(k+1)=Enci,k+1(r^i,ai(t¯i+(k+1)H))\sigma_{i}(k+1)=\texttt{Enc}_{i,k+1}\left(\hat{r}_{i,a_{i}}(\bar{t}_{i}+(k+1)H)\right) to server. Else, deactivate
13:end for

Description of Fed-SEL: Our proposed algorithm Fed-SEL, outlined in Algo. 2, and depicted in Fig. 1, involves two distinct phases. In Phase I, each client i𝒞i\in\mathcal{C} runs a local successive elimination sub-routine (Algo. 1) in parallel on its arm-set 𝒜i\mathcal{A}_{i} to identify the locally best arm aia^{*}_{i} in 𝒜i\mathcal{A}_{i}. Note that this requires no communication. Phase II then involves periodic encoded communication between the clients and the server to identify the best arm among the locally best arms. We assume that each client ii knows the parameters |𝒞|,|𝒜i||\mathcal{C}|,|\mathcal{A}_{i}|, and δ\delta, and that the server knows |𝒞|,{|𝒜i|}i𝒞|\mathcal{C}|,\{|\mathcal{A}_{i}|\}_{i\in\mathcal{C}}, and δ\delta. Additionally, we assume that the communication period HH is known to both the clients and the server. We now describe each of the main components of Fed-SEL in detail.

Phase I: Algo. 1 is based on the successive elimination technique [14], and proceeds in epochs. In each epoch tt, client ii maintains a set of active arms 𝒮i(t)𝒜i\mathcal{S}_{i}(t)\subseteq\mathcal{A}_{i}. Each active arm 𝒮i(t)\ell\in\mathcal{S}_{i}(t) is sampled once, and its empirical mean r^i,(t)\hat{r}_{i,\ell}(t) is then updated. For the next epoch, an arm is retained in the active set if and only if its empirical mean is not much lower than the empirical mean of the arm with the highest empirical mean in epoch tt; see line 5 of Algo. 1. This process continues till only one arm is left, at which point Algo. 1 terminates at client ii and outputs the last arm aia_{i}. Let t¯i\bar{t}_{i} represent the final epoch count when Algo. 1 terminates at client ii. It is easy to then see that arm aia_{i}, the output of Algo. 1 (if it terminates) at client ii, is sampled t¯i\bar{t}_{i} times at the end of Phase I. The quantities {αi(t),βi(t)}\{\alpha_{i}(t),\beta_{i}(t)\} associated with arm-set 𝒜i\mathcal{A}_{i} are given by line 1 of Algo. 1. Here, cc is a flexible parameter we introduce to control the accuracy of client ii’s estimate r^i,ai(t¯i)\hat{r}_{i,a_{i}}(\bar{t}_{i}).

Server 11iiNN\cdots\cdotsLocally identify aia^{*}_{i}i\mathcal{I}_{i} Server 11iiNN\cdots\cdotsIf active, sample arm aia_{i} HH timesPhase IPhase IIσi(k+1)\sigma_{i}(k+1)r~max(L)(k)\tilde{r}^{(L)}_{\max}(k)
Figure 1: Illustration of Fed-SEL. (Left) In Phase I, each client ii performs local computations (represented by self-loops) to identify the best arm aia^{*}_{i} in 𝒜i\mathcal{A}_{i}. It then transmits i={ai,r^i,ai(t¯i),t¯i}\mathcal{I}_{i}=\{a_{i},\hat{r}_{i,a_{i}}(\bar{t}_{i}),\bar{t}_{i}\} to the server. (Right) In round kk of Phase II, the server broadcasts a threshold r~max(L)(k)\tilde{r}^{(L)}_{\max}(k) that is used by each client ii to determine if it is active. If active, it performs HH local sampling steps, and transmits an encoded version σi(k+1)\sigma_{i}(k+1) of the estimate r^i,ai(t¯i+(k+1)H)\hat{r}_{i,a_{i}}(\bar{t}_{i}+(k+1)H) of arm aia_{i}’s mean.

Phase II: The goal of Phase II is to identify the best arm from the set {ai}i𝒞\{a_{i}\}_{i\in\mathcal{C}}. Comparing arms across clients entails communication via the server. Since such communication is costly, we consider periodic exchanges between the clients and the server as in standard FL algorithms. To further reduce communication, such exchanges are encoded using a novel adaptive encoding strategy that we will explain shortly. Specifically, Phase II operates in rounds, where in each round k{0,1,}k\in\{0,1,\ldots\}, the server maintains a set of active clients Ψ¯(k)\bar{\Psi}(k): a client ii is deemed active if and only if arm aia_{i} is still under contention for being the globally best arm. To decide upon the latter, the server aggregates the encoded information received from clients to compute the threshold r~max(L)(k)\tilde{r}^{(L)}_{\max}(k) (lines 3 and 4 of Algo. 2). Computing this threshold requires some care to account for (i) imbalances in sampling across arm-sets, and (ii) errors introduced due to encoding (quantization). Having computed r~max(L)(k)\tilde{r}^{(L)}_{\max}(k), the server executes line 5 of Algo. 2 to eliminate sub-optimal arms from the set {ai}iΨ¯(k)\{a_{i}\}_{i\in\bar{\Psi}(k)}. It then updates the active client set and broadcasts r~max(L)(k)\tilde{r}^{(L)}_{\max}(k) if |Ψ¯(k+1)|>1|\bar{\Psi}(k+1)|>1.

Only the arms that remain in {ai}iΨ¯(k+1)\{a_{i}\}_{i\in\bar{\Psi}(k+1)} require further sampling by the corresponding clients. Each client iΨ¯(k)i\in\bar{\Psi}(k) uses the threshold r~max(L)(k)\tilde{r}^{(L)}_{\max}(k) to determine whether it should sample aia_{i} any further (i.e., it checks whether to remain active or not). If the conditions in line 7 of Algo. 2 pass, then client ii samples arm aia_{i} HH times, and transmits an encoded version σi(k+1)\sigma_{i}(k+1) of r^i,ai(t¯i+(k+1)H)\hat{r}_{i,a_{i}}(\bar{t}_{i}+(k+1)H) to the server. Here, HH is the communication period; if H=1H=1, then the active clients interact with the server at each time-step in Phase II. Since we aim to reduce the number of communication rounds, we instead let each client perform multiple local sampling steps before communicating with the server, i.e., H>1H>1 for our setting.444The local sampling steps in line 7 of Fed-SEL can be viewed as the analog of performing multiple local gradient-descent steps in standard federated optimization algorithms. Phase II lasts till the active client set reduces to a singleton, at which point the locally best arm corresponding to the last active client is deemed to be the globally best arm by the server. We now describe our encoding-decoding strategy.

Adaptive Encoding Strategy: At the end of round k1k-1, each active client iΨ¯(k)i\in\bar{\Psi}(k) encodes r^i,ai(t¯i+kH)\hat{r}_{i,a_{i}}(\bar{t}_{i}+kH) using a quantizer with range [0,1][0,1] and bit-precision level Bi(k)B_{i}(k), where

Bi(k)=log2(1αi(t¯i+kH)).B_{i}(k)=\left\lceil\log_{2}\left(\frac{1}{\alpha_{i}(\bar{t}_{i}+kH)}\right)\right\rceil. (3)

Specifically, the interval [0,1][0,1] is quantized into 2Bi(k)2^{B_{i}(k)} bins, and each bin is assigned a unique binary symbol. Since all rewards belong to [0,1][0,1], r^i,ai(t¯i+kH)\hat{r}_{i,a_{i}}(\bar{t}_{i}+kH) falls in one of the bins in [0,1][0,1]. Let σi(k)\sigma_{i}(k) be the binary representation of the index of this bin. Compactly, we use Enci,k(r^i,ai(t¯i+kH))=σi(k)\texttt{Enc}_{i,k}\left(\hat{r}_{i,a_{i}}(\bar{t}_{i}+kH)\right)=\sigma_{i}(k) to represent the above encoding operation at client ii. The key idea we employ in our encoding technique described above is that of adaptive quantization: the bit precision level Bi(k)B_{i}(k) is successively refined in each round based on the confidence interval αi(t¯i+kH)\alpha_{i}(\bar{t}_{i}+kH).

For correct decoding, we assume that the server is aware of the encoding strategies at the clients, and the parameters |𝒞|,{|𝒜i|}i𝒞|\mathcal{C}|,\{|\mathcal{A}_{i}|\}_{i\in\mathcal{C}}, HH, and δ\delta. Based on this information, and the fact that at the end of Phase I, each client ii transmits t¯i\bar{t}_{i} to the server, note that the server can exactly compute αi(t¯i+kH)\alpha_{i}(\bar{t}_{i}+kH), and hence Bi(k)B_{i}(k) for each i𝒞i\in\mathcal{C}. Let r~i(k)\tilde{r}_{i}(k) be the center of the bin associated with σi(k)\sigma_{i}(k). Then, we compactly represent the decoding operation at the server (corresponding to client ii) as Deci,k(σi(k))=r~i(k)\texttt{Dec}_{i,k}(\sigma_{i}(k))=\tilde{r}_{i}(k). Using r~i(k)\tilde{r}_{i}(k) as a proxy for r^i,ai(t¯i+kH)\hat{r}_{i,a_{i}}(\bar{t}_{i}+kH), the server computes upper and lower estimates, namely r~i(U)(k)\tilde{r}^{(U)}_{i}(k) and r~i(L)(k)\tilde{r}^{(L)}_{i}(k), of the mean of arm aia_{i} in line 3 of Algo. 2.

This completes the description of Fed-SEL. A few remarks are now in order.

Remark 1.

Fed-SEL is communication-efficient by design: Phase I requires incurs only one round of communication (at the end of the phase) while Phase II involves periodic communication. Moreover, in each round k{1,2,}k\in\{1,2,\ldots\} of Phase II, every client transmits encoded information about the empirical mean of only one arm from its arm-set. We note here that the encoding strategy needs to be designed carefully so as to prevent the globally best-arm from being eliminated due to quantization errors; our proposed approach takes this into consideration.

Remark 2.

In our current approach, the clients transmit information without encoding only once, at the end of Phase I (i.e., when k=0k=0). This is done to inform the server of the sample counts {t¯i}i𝒞\{\bar{t}_{i}\}_{i\in\mathcal{C}} needed for correct decoding during Phase II.555Recall that in the encoding operation at the end of round k1k-1, the bit-precision Bi(k)B_{i}(k) is a function of αi(t¯i+kH)\alpha_{i}(\bar{t}_{i}+kH). Other than t¯i\bar{t}_{i}, all other terms featuring in αi(t¯i+kH)\alpha_{i}(\bar{t}_{i}+kH) are known a priori to the server. Below, we outline a simple way to bypass the need for transmitting the parameters {t¯i}i𝒞\{\bar{t}_{i}\}_{i\in\mathcal{C}}.

At the end of Phase I, suppose the first time the server receives information from client ii is at time-step tit^{\prime}_{i}.666Here, we implicitly assume that there are no transmission delays, and that the server is equipped with the means to keep track of time. Since client ii can sample only one arm at each time-step, the maximum number of epochs at client ii during Phase I is ti/|𝒜i|t~it^{\prime}_{i}/|\mathcal{A}_{i}|\triangleq\tilde{t}_{i}. Thus, t~i\tilde{t}_{i} provides a lower bound on t¯i\bar{t}_{i} - the number of times arm aia_{i} is sampled during Phase I. Using t~i\tilde{t}_{i} now in place of t¯i\bar{t}_{i}, one can employ more conservative confidence intervals αi(t~i+kH)\alpha_{i}(\tilde{t}_{i}+kH) both in line 3 of Fed-SEL, and also for encoding. While this strategy will also lead to identification of the best arm, it will in general require more rounds to terminate relative to Fed-SEL. As a final comment, using O(log2|𝒜i|)O\left(\log_{2}{|\mathcal{A}_{i}|}\right) bits, one can encode the arm label aia_{i} as well. Based on the above discussion, it is easy to see that with minor modifications to our current algorithm, one can effectively encode all transmissions from the clients to the server.

4 Performance Guarantees for Fed-SEL

4.1 A Notion of Heterogeneity

In Section 2, we provided an intuitive argument that statistical heterogeneity of arm distributions across different arm-sets should facilitate the task of identifying the globally best arm. To make this argument precise, however, we need to fill in two important missing pieces: (i) defining an appropriate notion of heterogeneity for our problem, and (ii) formally establishing that our proposed algorithm Fed-SEL benefits from the notion of heterogeneity so defined. In this section, we address both these objectives. We start by introducing the concept of arm-heterogeneity indices that will play a key role in our subsequent analysis of Fed-SEL.

Refer to caption
Figure 2: Illustration of Definition 1 via a target detection example. An environment is partitioned into 9 sub-regions (arm-sets) with 3 sensors (arms) in each sub-region. The goal is to detect a target based on the sensor readings. For this purpose, an agent (client) is assigned to each sub-region. The target is depicted by the red triangle and the sensors are depicted by the blue circles. Arms aa and bb (resp., arms cc and dd) are the two top arms (i.e., closest to the target) in sub-region 1 (resp., in sub-region 9). The arm-heterogeneity indices {σ19,σ91}\{\sigma_{19},\sigma_{91}\} for 𝒜1\mathcal{A}_{1} and 𝒜9\mathcal{A}_{9} are as shown in the figure.
Definition 1.

(Arm-Heterogeneity Index) For any two arm-sets 𝒜i\mathcal{A}_{i} and 𝒜j\mathcal{A}_{j} such that ra1(i)ra1(j)r_{a^{(i)}_{1}}\neq r_{a^{(j)}_{1}}, the arm-heterogeneity index σij\sigma_{ij} is defined as follows:

σij=ra1(i)ra2(i)|ra1(i)ra1(j)|.\sigma_{ij}=\frac{r_{a^{(i)}_{1}}-r_{a^{(i)}_{2}}}{|r_{a^{(i)}_{1}}-r_{a^{(j)}_{1}}|}. (4)

With Definition 1, we aim to capture the intuitive idea that distributions of arms within an arm-set are more similar than those across arm-sets. The arm-heterogeneity indices we have introduced quantify the extent to which the above statement is true. In words, σij\sigma_{ij} measures the closeness between the two best arms within 𝒜i\mathcal{A}_{i} relative to the closeness between the best arms in each of the sets 𝒜i\mathcal{A}_{i} and 𝒜j\mathcal{A}_{j}; here, “closeness” between two arms is quantified by the difference in their means.777Recall that we use a1(i)a^{(i)}_{1} to denote the top arm in 𝒜i\mathcal{A}_{i}, i.e., the arm with the highest mean reward in 𝒜i\mathcal{A}_{i}. Together, the index pair {σij,σji}\{\sigma_{ij},\sigma_{ji}\} captures the level of dissimilarity or heterogeneity between arm-sets 𝒜i\mathcal{A}_{i} and 𝒜j\mathcal{A}_{j}; smaller values of these indices reflect more heterogeneity. To see why the latter is true, notice that both σij\sigma_{ij} and σji\sigma_{ji} scale inversely with the gap in mean rewards of the two best arms in 𝒜i\mathcal{A}_{i} and 𝒜j\mathcal{A}_{j}. Thus, if these arms have “well-separated” mean rewards - an intuitive measure of heterogeneity - then this would lead to small values of σij\sigma_{ij} and σji\sigma_{ji}.

How does heterogeneity arise in practice? Our definition of heterogeneity is motivated by scenarios where the partitioning of arms into disjoint arm-sets is not arbitrary, but rather adheres to some structure. Consider for instance the target detection application we discussed in the Introduction; see Fig. 2 for an illustration. For this setting, sensors closer to the target have higher readings than those farther away. As a result, one should expect the readings of sensor aa to be more similar to those of sensor bb (in the same sub-region) as compared to those of sensor cc in sub-region 9. This spatial heterogeneity effect is precisely captured by low values of the arm-heterogeneity index σ19.\sigma_{19}. One can similarly argue that spatial heterogeneity leads to low values of σ91\sigma_{91}.

4.2 Main Result and Discussion

Our first main result, Theorem 1, shows that Fed-SEL is (0,δ)(0,\delta)-PAC, and provides a detailed characterization of its sample- and communication-complexity. To state the result cleanly, we will use i1i_{1} as a shorthand for a1(i)a^{(i)}_{1}, and employ the following notations:

c¯i=4|𝒞||𝒜i|δ,Δi=ra1(i)ra2(i),andΔ=r1maxi𝒞{1}ri1.\bar{c}_{i}=\sqrt{\frac{4|\mathcal{C}||\mathcal{A}_{i}|}{\delta}},\hskip 5.69054pt\Delta^{*}_{i}=r_{a^{(i)}_{1}}-r_{a^{(i)}_{2}},\hskip 5.69054pt\textrm{and}\hskip 5.69054pt\Delta^{*}=r_{1}-\max_{i\in\mathcal{C}\setminus\{1\}}r_{i_{1}}.

Thus, Δ\Delta^{*} is the difference in the mean rewards of the two best arms in 𝒜\mathcal{A}. We remind the reader here that the best arm a=1a^{*}=1 belongs to the arm-set 𝒜1\mathcal{A}_{1} of client 11.

Theorem 1.

Suppose Fed-SEL is run with βi(t)=8αi(t),i𝒞\beta_{i}(t)=8\alpha_{i}(t),\forall i\in\mathcal{C}. Then with probability at least 1δ1-\delta, the following statements hold.

  • (Consistency) Fed-SEL terminates in finite-time and outputs the globally best arm, i.e., we have a¯=a\bar{a}=a^{*}.

  • (Communication-Complexity) The number of rounds RiR_{i} a client i𝒞{1}i\in\mathcal{C}\setminus\{1\} remains active during Phase II is bounded above as follows: Rimax{Ri1,R1i}R_{i}\leq\max\{R_{i1},R_{1i}\}, where

    Ri11H(128(Δi)2(σi121)log128c¯iΔ1i12+256(Δi)2logσi1+O(1Δ1i12loglogc¯iΔ1i12))+1,ifσi1>1,R_{i1}\leq\frac{1}{H}\left(\frac{128}{{(\Delta^{*}_{i})}^{2}}(\sigma^{2}_{i1}-1)\log{\frac{128\bar{c}_{i}}{\Delta^{2}_{1i_{1}}}}+\frac{256}{{(\Delta^{*}_{i})}^{2}}\log{\sigma_{i1}}+O\left(\frac{1}{\Delta^{2}_{1i_{1}}}\log\log{\frac{\bar{c}_{i}}{\Delta^{2}_{1i_{1}}}}\right)\right)+1,\hskip 2.84526pt\textrm{if}\hskip 2.84526pt\sigma_{i1}>1, (5)

    and Ri1=1R_{i1}=1 if σi11\sigma_{i1}\leq 1. Similarly,

    R1i1H(128(Δ1)2(σ1i21)log128c¯1Δ1i12+256(Δ1)2logσ1i+O(1Δ1i12loglogc¯1Δ1i12))+1,ifσ1i>1,R_{1i}\leq\frac{1}{H}\left(\frac{128}{{(\Delta^{*}_{1})}^{2}}(\sigma^{2}_{1i}-1)\log{\frac{128\bar{c}_{1}}{\Delta^{2}_{1i_{1}}}}+\frac{256}{{(\Delta^{*}_{1})}^{2}}\log{\sigma_{1i}}+O\left(\frac{1}{\Delta^{2}_{1i_{1}}}\log\log{\frac{\bar{c}_{1}}{\Delta^{2}_{1i_{1}}}}\right)\right)+1,\hskip 2.84526pt\textrm{if}\hskip 2.84526pt\sigma_{1i}>1, (6)

    and R1i=1R_{1i}=1 if σ1i1\sigma_{1i}\leq 1. The number of communication rounds RR during Phase II is then given by R=maxi𝒞{1}RiR=\max_{i\in\mathcal{C}\setminus\{1\}}{R_{i}}; client 11 always remains active till Fed-SEL terminates, i.e., R1=RR_{1}=R. Moreover, each client i𝒞i\in\mathcal{C} transmits at most

    O(log2(1Δ))O\left(\log_{2}{\left(\frac{1}{\Delta^{*}}\right)}\right)

    bits of information in each round k{1,,Ri11}k\in\{1,\ldots,R_{i1}-1\}.

  • (Sample-Complexity) The total number of arm-pulls made by client ii is bounded above by

    O(𝒜i{i1}log(|𝒞||𝒜i|δΔi1)Δi12)Phase-I arm-pulls+HRiPhase-II arm-pulls.\underbrace{O\left(\sum_{\ell\in\mathcal{A}_{i}\setminus\{i_{1}\}}\frac{\log\left(\frac{|\mathcal{C}||\mathcal{A}_{i}|}{\delta\Delta_{i_{1}\ell}}\right)}{\Delta^{2}_{i_{1}\ell}}\right)}_{\textrm{Phase-I arm-pulls}}+\underbrace{HR_{i}}_{\textrm{Phase-II arm-pulls}}. (7)

Since each client can only sample arms from its own arm-set, it is apparent that at least one round of communication is necessary for solving Problem 1, in general. The next result - an immediate corollary of Theorem 1 - identifies heterogeneous regimes where one round of communication is also sufficient.888When σi1=1\sigma_{i1}=1, we would ideally like the upper-bound on Ri1{R}_{i1} to evaluate to 11. Setting σi1=1\sigma_{i1}=1 in (5) causes the first two dominant terms in the upper-bound for Ri1{R}_{i1} to vanish, as desired. The lower order term that remains is an artifact of our analysis.

Corollary 1.

Suppose Fed-SEL is run with βi(t)=8αi(t),i𝒞\beta_{i}(t)=8\alpha_{i}(t),\forall i\in\mathcal{C}. Additionally, suppose

maxi𝒞{1}{σ1i,σi1}1.\max_{i\in\mathcal{C}\setminus\{1\}}\{\sigma_{1i},\sigma_{i1}\}\leq 1.

Then, with probability at least 1δ1-\delta, Fed-SEL terminates after just one round of communication between the clients and the server, and outputs the globally best arm, i.e., a¯=a\bar{a}=a^{*}.

The proof of Theorem 1 is deferred to Appendix A. In what follows, we discuss several important implications of Theorem 1.

\bullet (Benefit of Heterogeneity) From the first two dominant terms in the expressions for Ri1R_{i1} and R1iR_{1i} in equations (5) and (6), respectively, observe that lower values of the arm-heterogeneity indices translate to fewer communication rounds. Moreover, fewer rounds imply fewer arm-pulls (and hence, lower sample-complexity) during Phase II, as is evident from Eq. (7). Since low values of arm-heterogeneity indices correspond to high heterogeneity, we conclude that heterogeneity helps reduce both the sample- and communication-complexity of Fed-SEL.

\bullet (Benefit of Local Steps) Even if the level of heterogeneity is low in a given instance, one can reduce the number of communication rounds of Fed-SEL by increasing the communication period HH; see equations (5) and (6). Based on these equations and (7), we note that HH shows up as an additive term in the upper-bound on the number of arm-pulls made during Phase II. The key implication of this fact is that one can ramp up HH to significantly reduce communication and get away with a modest price in terms of additional sample-complexity. The main underlying reason for this stems from the observation that with Fed-SEL, more local steps always help by providing better arm-mean estimates.

\bullet (Benefit of Collaboration) We have already discussed how heterogeneity assists in reducing the sample-complexity of Phase II. To focus on the number of arm-pulls made by a client ii during Phase I, let us now turn our attention to the first term in Eq. (7). The summation in this term is over the local arm-set 𝒜i\mathcal{A}_{i} of client ii, as opposed to the global arm-set 𝒜\mathcal{A}. Since |𝒜i||\mathcal{A}_{i}| can be much smaller than |𝒜||\mathcal{A}| in practice, the benefit of collaboration is immediate: each client needs to explore only a small subset of the global arm-set, i.e., the task of exploration is distributed among clients.

\bullet (Number of Bits Exchanged) Since Δ\Delta^{*} is the gap in mean rewards of the two best arms (i.e., the two arms that are hardest to distinguish) in the global arm set 𝒜\mathcal{A}, the quantity log2(1/Δ)\log_{2}{\left(1/\Delta^{*}\right)} essentially captures the complexity of a given instance of the MAB problem. Intuitively, one should expect that the amount of information that needs to be exchanged to solve a given instance of Problem 1 should scale with the complexity of that instance, i.e., with log2(1/Δ)\log_{2}{\left(1/\Delta^{*}\right)}. From Theorem 1, we note that our adaptive encoding strategy formalizes this intuition to a certain extent.

Remark 3.

From the above discussion, note that although our algorithm itself is oblivious to both the level of heterogeneity and the complexity of a given instance, the sample- and communication-complexity bounds we obtain naturally adapt to these defining features of the instance, i.e., our guarantees are instance-dependent.

We now briefly comment on the main steps in the proof of Theorem 1.

Proof Sketch for Theorem 1: We first define a “good event” \mathcal{E} with measure at least 1δ1-\delta, on which, the empirical means of all arms are within their respective confidence intervals at all times. On this event \mathcal{E}, we show using standard arguments that at the end of Phase I, each client ii successfully identifies the locally best arm aia^{*}_{i} in 𝒜i\mathcal{A}_{i}, i.e., ai=ai=a1(i)a_{i}=a^{*}_{i}=a^{(i)}_{1}. At this stage, to exploit heterogeneity, we require the locally best arm from each arm-set to be “well-explored”. Accordingly, we establish that |r^i,ai(t¯i)ra1(i)|c/(c2)2Δi,i𝒞|\hat{r}_{i,a_{i}}(\bar{t}_{i})-r_{a^{(i)}_{1}}|\leq c/{(c-2)}^{2}\Delta^{*}_{i},\forall i\in\mathcal{C}, where cc is the flexible design parameter in line 1 of Algo. 1. The next key piece of our analysis pertains to relating the amount of exploration that has already been done during Phase I, to that which remains to be done during Phase II for eliminating all arms in the set {a1(i)}i𝒞{1}\{a^{(i)}_{1}\}_{i\in\mathcal{C}\setminus\{1\}}; this is precisely where the arm-heterogeneity indices show up in the analysis. To complete the above step, we derive new upper and lower bounds on the Lambert WW function that may be of independent interest; see Lemma 5 in Appendix A.1.

Before moving on to the next section, we discuss how certain aspects of our algorithm and analysis can be potentially improved.

Potential Refinements: As is most often the case, there is considerable room for improvement. First, the dependence on the number of arms within the logarithmic terms of our sample-complexity bounds can be improved using a variant of the Median Elimination Algorithm [13]. Second, note from (7) that for each client ii, its sample-complexity during Phase I scales inversely with the local arm-gaps Δi1\Delta_{i_{1}\ell}, where 𝒜i{i1}\ell\in\mathcal{A}_{i}\setminus\{i_{1}\}, and i1=a1(i)i_{1}=a^{(i)}_{1} is the locally best-arm in 𝒜i\mathcal{A}_{i}. Ideally, however, we would like to achieve sample-complexity bounds that scale inversely with Δ1\Delta_{1\ell}, where 1𝒜11\in\mathcal{A}_{1} is the globally best arm. For all clients other than client 1 (i.e., clients who cannot sample arm 11 directly), this would necessarily require communication via the server. Thus, while we save on communication during Phase I, the price paid for doing so is a larger sample-complexity.

One may thus consider the alternative of merging Phases I and II. Our reasons, however, for persisting with separate phases are as follows. First, in our current approach, a client starts communicating when the best arm in its arm-set has already been well-explored. This, in turn, considerably reduces the workload for Phase II, and helps reduce communication significantly (see Corollary 1 for instance). The second important reason is somewhat more subtle: as we explain in the next section, having two separate phases aids the task of tolerating adversarial clients.

5 Robust Federated Best-Arm Identification

In this section, we will focus on the following question: Can we still hope to identify the best-arm when certain clients act adversarially? To formally answer this question, we will consider a Byzantine adversary model [81, 82] where adversarial clients have complete knowledge of the model, can collude among themselves, and act arbitrarily. Let us start by building some intuition. Suppose the client associated with the arm-set containing the globally best arm aa^{*} is adversarial. Since this is the only client that can sample aa^{*}, there is not much hope of identifying the best arm in this case. This simple observation reveals the need for information-redundancy: we need multiple clients to sample each arm-set to account for adversarial behaviour. Accordingly, with each arm-set 𝒜j\mathcal{A}_{j}, we now associate a group of clients 𝒞j\mathcal{C}_{j}; thus, 𝒞=j[N]𝒞j\mathcal{C}=\cup_{j\in[N]}\mathcal{C}_{j}.999From now on, we will use the index jj to refer to arm-sets and client-groups, and the index ii to refer to clients. We will denote the set of honest clients by \mathcal{H}, the set of adversarial clients by 𝒥\mathcal{J}, and allow at most ff clients to be adversarial in each client-group, i.e., |𝒞j𝒥|f,j[N]|\mathcal{C}_{j}\cap\mathcal{J}|\leq f,\forall j\in\mathcal{[}N]. We will also assume that the server knows ff.

Simply restricting the number of adversaries in each client-group is not enough to identify the best arm; there are additional challenges that need to be dealt with. To see this, consider a scenario where an arm 𝒜j\ell\in\mathcal{A}_{j} has not been adequately explored by an honest client i𝒞ji\in\mathcal{C}_{j}. Due to the stochastic nature of our model, client ii, although honest, may have a poor estimate of arm s\ell^{\prime}s true mean reward at this stage. How can we distinguish this scenario from one where an adversarial client in 𝒞j\mathcal{C}_{j} deliberately transmits a completely incorrect estimate of arm \ell? In this context, one natural idea is to enforce each honest client to start transmitting information only when it has sufficiently explored every arm in its arm-set. This further justifies the rationale behind having two separate phases in Fed-SEL. In what follows, we describe a robust variant of Fed-SEL.

Algorithm 3 Robust Federated Successive Elimination (Robust Fed-SEL)
1: Phase I: Identifying locally best arms
2:For each j[N]j\in[N], every client i𝒞ji\in\mathcal{C}_{j}\cap\mathcal{H} runs Algo. 1 on 𝒜j\mathcal{A}_{j} with βj(t)=8αj(t)\beta_{j}(t)=8\alpha_{j}(t). If Algo. 1 terminates at client ii, then client ii reports {ai,r^i,ai(t¯i),t¯i}\{a_{i},\hat{r}_{i,a_{i}}(\bar{t}_{i}),\bar{t}_{i}\} to the server
3:
4: Phase II: Identifying the globally best arm
5:For each j[N]j\in[N], server identifies a representative arm for 𝒜j\mathcal{A}_{j} as a¯j=Maj_Vot({ai}i𝒞j)\bar{a}_{j}=\texttt{Maj\textunderscore Vot}\left(\{a_{i}\}_{i\in\mathcal{C}^{\prime}_{j}}\right), where 𝒞j𝒞j\mathcal{C}^{\prime}_{j}\subseteq\mathcal{C}_{j} are the first (2f+1)(2f+1) clients from 𝒞j\mathcal{C}_{j} to report to the server at the end of Phase I
6:Let 𝒞¯j={i𝒞j:ai=a¯j}\bar{\mathcal{C}}_{j}=\{i\in\mathcal{C}^{\prime}_{j}:a_{i}=\bar{a}_{j}\} be those clients in 𝒞j\mathcal{C}^{\prime}_{j} that report a¯j\bar{a}_{j}, and define fj=f|𝒞j|+|𝒞¯j|f_{j}=f-|\mathcal{C}^{\prime}_{j}|+|\bar{\mathcal{C}}_{j}|
7:for k{0,1,}k\in\{0,1,\ldots\} do
8: \bullet Server initializes Ψ¯(0)=[N]\bar{\Psi}(0)=[N] and does:
9:     For each active client-group jΨ¯(k)j\in\bar{\Psi}(k), decode r~i(k)\tilde{r}_{i}(k) and compute the pair {r~i(U)(k),r~i(L)(k)}\{\tilde{r}^{(U)}_{i}(k),\tilde{r}^{(L)}_{i}(k)\} exactly as in Fed-SEL for every client i𝒞¯ji\in\bar{\mathcal{C}}_{j}
10:     For each jΨ¯(k)j\in\bar{\Psi}(k), compute robust upper and lower estimates of the mean reward of a¯j\bar{a}_{j}:
r¯j(U)(k)=Trim({r~i(U)(k)}i𝒞¯j,fj)andr¯j(L)(k)=Trim({r~i(L)(k)}i𝒞¯j,fj)\bar{r}^{(U)}_{j}(k)=\texttt{Trim}\left(\{\tilde{r}^{(U)}_{i}(k)\}_{i\in\bar{\mathcal{C}}_{j}},f_{j}\right)\hskip 2.84526pt\textrm{and}\hskip 2.84526pt\bar{r}^{(L)}_{j}(k)=\texttt{Trim}\left(\{\tilde{r}^{(L)}_{i}(k)\}_{i\in\bar{\mathcal{C}}_{j}},f_{j}\right)
11:     Compute r¯max(L)(k)=maxjΨ¯(k)r¯j(L)(k)\bar{r}^{(L)}_{\max}(k)=\max_{j\in\bar{\Psi}(k)}\bar{r}^{(L)}_{j}(k)
12:     Update active client groups: Ψ¯(k+1)={jΨ¯(k):r¯max(L)(k)<r¯j(U)(k)}\bar{\Psi}(k+1)=\{j\in\bar{\Psi}(k)\ :\ \bar{r}^{(L)}_{\max}(k)<\bar{r}^{(U)}_{j}(k)\}
13:     If |Ψ¯(k+1)|>1|\bar{\Psi}(k+1)|>1, broadcast a binary N×1N\times 1 vector 𝐝k\mathbf{d}_{k} that encodes active client-groups. Else, output a¯=a¯Ψ¯(k+1)\bar{a}=\bar{a}_{\bar{\Psi}(k+1)} and terminate
14: \bullet For each j[N]j\in[N], each client i𝒞¯ji\in\bar{\mathcal{C}}_{j}\cap\mathcal{H} does:
15:     If 𝐝k\mathbf{d}_{k} is received from server and 𝐝k(j)=1\mathbf{d}_{k}(j)=1, then sample arm aia_{i} HH times, encode r^i,ai(t¯i+(k+1)H)\hat{r}_{i,a_{i}}(\bar{t}_{i}+(k+1)H) as in Fed-SEL, and transmit the encoding σi(k+1)\sigma_{i}(k+1) to the server. Else, deactivate
16:end for

Description of Robust Fed-SEL: To identify aa^{*} despite adversaries, we propose Robust Fed-SEL, and outline its main steps in Algorithm 3. Like the Fed-SEL algorithm, Robust Fed-SEL also involves two separate phases. During Phase I, in each client group 𝒞j,j[N]\mathcal{C}_{j},j\in[N], every honest client i𝒞ji\in\mathcal{C}_{j}\cap\mathcal{H} independently runs Algo. 1 on 𝒜j\mathcal{A}_{j}, and then transmits {ai,r^i,ai(t¯i),t¯i}\{a_{i},\hat{r}_{i,a_{i}}(\bar{t}_{i}),\bar{t}_{i}\} to the server exactly as in Fed-SEL. The key operations performed by the server as follows.

1) Majority Voting to Identify Representative Arms: At the onset of Phase II, the server waits till it has heard from at least (2f+1)(2f+1) clients of each client group before it starts acting. Next, for each arm-set 𝒜j\mathcal{A}_{j}, the server aims to compute a representative locally best arm a¯j\bar{a}_{j}. Note that adversarial clients in 𝒞j\mathcal{C}_{j} may collude and decide to report an arm to the server that has true mean much lower than that of the locally best arm a1(j)a^{(j)}_{1} in 𝒜j\mathcal{A}_{j}. Nonetheless, we should expect every honest client i𝒞ji\in\mathcal{C}_{j} to output ai=a1(j)a_{i}=a^{(j)}_{1} with high probability. Thus, if honest clients are in majority in 𝒞j\mathcal{C}_{j}, a simple majority voting operation (denoted by Maj_Vot) at the server should suffice to reveal the best arm in 𝒜j\mathcal{A}_{j} with high probability; this is precisely the rationale behind line 2. During the remainder of Phase II, for each client group j[N]j\in[N], the server only uses the information of the subset 𝒞¯j𝒞j\bar{\mathcal{C}}_{j}\subset\mathcal{C}_{j} of clients who reported the representative arm a¯j\bar{a}_{j}.

2) Trimming to Compute Robust Arm-Statistics: Phase II involves periodic communication and evolves in rounds. In each round kk, the server maintains a set Ψ¯(k)\bar{\Psi}(k) of active client-groups: group jj is deemed active if a¯j\bar{a}_{j} is still under contention for being the globally best arm. To decide whether group jj should remain active, the server performs certain trimming operations that we next describe. Given a non-negative integer ee, and a set {mi}i[K]\{m_{i}\}_{i\in[K]} of KK scalars where K2e+1K\geq 2e+1, let Trim({mi}i[K],e)\texttt{Trim}\left(\{m_{i}\}_{i\in[K]},e\right) denote the operation of removing the highest ee and lowest ee numbers from {mi}i[K]\{m_{i}\}_{i\in[K]}, and then returning any one of the remaining numbers. Based on this Trim operator, the server computes robust upper and lower estimates, namely r¯j(U)(k)\bar{r}^{(U)}_{j}(k) and r¯j(L)(k)\bar{r}^{(L)}_{j}(k), of the mean of the representative arm a¯j,jΨ¯(k)\bar{a}_{j},\forall j\in\bar{\Psi}(k); see line 6.101010The trimming parameter fjf_{j} is set based on the observation that clients in 𝒞j\mathcal{C}^{\prime}_{j} who report an arm other than a¯j\bar{a}_{j} are adversarial with high probability. Using these robust estimates, the server subsequently updates the set of active client-groups by executing lines 7-8.

If group jj remains active, then the server needs more refined arm estimates of arm a¯j\bar{a}_{j} from 𝒞¯j\bar{\mathcal{C}}_{j}. Accordingly, the server sends out a binary N×1N\times 1 vector 𝐝k\mathbf{d}_{k} that encodes the active client-groups: 𝐝k(j)=1\mathbf{d}_{k}(j)=1 if and only if jΨ¯(k+1)j\in\bar{\Psi}(k+1). Upon receiving 𝐝k\mathbf{d}_{k}, each honest client ii in 𝒞¯j\bar{\mathcal{C}}_{j} checks if 𝐝k(j)=1\mathbf{d}_{k}(j)=1. If so, it samples arm ai=a¯ja_{i}=\bar{a}_{j} HH times, encodes r^i,ai(t¯i+(k+1)H)\hat{r}_{i,a_{i}}(\bar{t}_{i}+(k+1)H) as in Fed-SEL, and transmits the encoding σi(k+1)\sigma_{i}(k+1) to the server. If 𝐝k\mathbf{d}_{k} is not received from the server, or if 𝐝k(j)=0\mathbf{d}_{k}(j)=0, client ii does nothing further. This completes the description of Robust Fed-SEL.

Comments on Algorithm 3: Before providing formal guarantees for Robust Fed-SEL, a few comments are in order. First, note that unlike Fed-SEL where the server transmitted a threshold to enable the clients to figure out whether they are active or not, the server instead transmits a N×1N\times 1 vector 𝐝k\mathbf{d}_{k} in Robust Fed-SEL. The reason for this is as follows. Even if r¯max(L)(k)r¯j(U)(k)\bar{r}^{(L)}_{\max}(k)\geq\bar{r}^{(U)}_{j}(k) for some client-group jj, it may very well be that r¯max(L)(k)<r~i(U)(k)\bar{r}^{(L)}_{\max}(k)<\tilde{r}^{(U)}_{i}(k) for some i𝒞¯j.i\in\bar{\mathcal{C}}_{j}\cap\mathcal{H}. Thus, even if the server is able to eliminate client group jj on its end, simply broadcasting the threshold r¯max(L)(k)\bar{r}^{(L)}_{\max}(k) may not be enough for every client in 𝒞¯j\bar{\mathcal{C}}_{j}\cap\mathcal{H} to realize that it no longer needs to sample arm a¯j\bar{a}_{j}. Note that for line 10 of Robust Fed-SEL to make sense, we assume that every client is aware of the client-group to which it belongs.

During Phase II, for each group jj, since the server only uses the information of clients in 𝒞¯j\bar{\mathcal{C}}_{j} , we have not explicitly specified actions for honest clients in 𝒞j𝒞¯j\mathcal{C}_{j}\setminus\bar{\mathcal{C}}_{j} to avoid cluttering the exposition. A potential specification is as follows. Suppose the first time an honest client i𝒞ji\in\mathcal{C}_{j} hears from the server, it is still in the process of completing Phase I. Since in this case client ii cannot belong to 𝒞¯j\bar{\mathcal{C}}_{j}, it deactivates and does nothing further. In all other cases, client i𝒞ji\in\mathcal{C}_{j} acts exactly as specified in line 10 of Robust Fed-SEL.

The main result of this section is as follows.

Theorem 2.

Suppose |𝒞j𝒥|f|\mathcal{C}_{j}\cap\mathcal{J}|\leq f and |𝒞j|3f+1,j[N]|\mathcal{C}_{j}|\geq 3f+1,\forall j\in\mathcal{[}N]. Then, Robust Fed-SEL is (0,δ)(0,\delta)-PAC.

Discussion: Under information-redundancy, i.e., under the condition that “enough” clients sample every arm in each arm-set (|𝒞j|3f+1,j[N]|\mathcal{C}_{j}|\geq 3f+1,\forall j\in\mathcal{[}N]), Theorem 2 shows that Robust Fed-SEL is an adversarially-robust (0,δ)(0,\delta)-PAC algorithm. Despite the flurry of research on tolerating adversaries in distributed optimization/supervised learning [26, 27, 28, 29, 30, 31, 32, 33, 34, 35], there is no analogous result (that we know of) in the context of MAB problems. Theorem 2 fills this gap. The sample- and communication-complexity bounds for Robust Fed-SEL are of the same order as that of Fed-SEL; hence, we do not explicitly state them here. For a discussion on this topic, see Appendix B where we provide the proof of Theorem 2.

Remark 4.

The type of redundancy assumption we make in Theorem 2 is extremely standard in the literature on tolerating worst-case adversarial models. Similar assumptions are made in [26, 27, 28, 29, 30, 31, 32, 33, 34, 35] where typically less than half of the agents (clients) are allowed to be adversarial. The reason why we allow at most 1/31/3 of the clients in each client-group to be adversarial (as opposed to at most 1/21/2) is because some adversaries may choose not to transmit anything at all. In this case, the server may never acquire enough information from 𝒞j\mathcal{C}_{j} to compute a¯j\bar{a}_{j}. Now consider a slightly weaker adversary model where every client transmits when it is supposed to, but the information being sent can be arbitrarily corrupted for clients under attack. For this setting, our results go through identically under the weaker assumption that |𝒞j𝒥|f|\mathcal{C}_{j}\cap\mathcal{J}|\leq f and |𝒞j|2f+1,j[N]|\mathcal{C}_{j}|\geq 2f+1,\forall j\in\mathcal{[}N].

We now investigate how the ideas developed herein can be extended to the peer-to-peer setting.

6 Robust Peer-to-Peer Best-Arm Identification

Algorithm 4 Robust Networked Successive Elimination protocol at each Client i𝒞ji\in\mathcal{C}_{j}\cap\mathcal{H}
1: Step 1: Updating 𝐈i(j)\mathbf{I}_{i}(j) and 𝐑i(j)\mathbf{R}_{i}(j)
2:Run Algo. 1 on 𝒜j\mathcal{A}_{j} with βj(t)=6αj(t)\beta_{j}(t)=6\alpha_{j}(t). If Algo. 1 terminates, set 𝐈i(j)=ai\mathbf{I}_{i}(j)=a_{i}, 𝐑i(j)=r^i,ai(t¯i)\mathbf{R}_{i}(j)=\hat{r}_{i,a_{i}}(\bar{t}_{i}), and transmit 𝐈i(j),𝐑i(j)\mathbf{I}_{i}(j),\mathbf{R}_{i}(j) to each k𝒩i+k\in\mathcal{N}^{+}_{i}
3: Step 2: Updating 𝐈i()\mathbf{I}_{i}(\ell) and 𝐑i(),[N]{j}\mathbf{R}_{i}(\ell),\forall\ell\in[N]\setminus\{j\}
4:Let 𝒩i,𝒩i\mathcal{N}^{\prime}_{i,\ell}\subseteq\mathcal{N}_{i} be the first (2f+1)(2f+1) in-neighbors of ii that report about arm-set \ell
5:Set 𝐈i()=Maj_Vot({𝐈k()}k𝒩i,)\mathbf{I}_{i}(\ell)=\texttt{Maj\textunderscore Vot}\left(\{\mathbf{I}_{k}(\ell)\}_{k\in\mathcal{N}^{\prime}_{i,\ell}}\right). Let 𝒩¯i,={k𝒩i,:𝐈k()=𝐈i()}\bar{\mathcal{N}}_{i,\ell}=\{k\in\mathcal{N}^{\prime}_{i,\ell}:\mathbf{I}_{k}(\ell)=\mathbf{I}_{i}(\ell)\}
6:Set 𝐑i()=Trim({𝐑k()}k𝒩¯i,,fi,)\mathbf{R}_{i}(\ell)=\texttt{Trim}\left(\{\mathbf{R}_{k}(\ell)\}_{k\in\bar{\mathcal{N}}_{i,\ell}},f_{i,\ell}\right), where fi,=f|𝒩i,|+|𝒩¯i,|f_{i,\ell}=f-|\mathcal{N}^{\prime}_{i,\ell}|+|\bar{\mathcal{N}}_{i,\ell}|. Transmit 𝐈i(),𝐑i()\mathbf{I}_{i}(\ell),\mathbf{R}_{i}(\ell) to each k𝒩i+k\in\mathcal{N}^{+}_{i}
7: Step 3: Identifying globally best arm
8:Let pi=argmax[N]𝐑i[]p^{*}_{i}=\operatorname*{\arg\!\max}_{\ell\in[N]}\mathbf{R}_{i}[\ell]. Output 𝐈i(pi)\mathbf{I}_{i}(p^{*}_{i}).

In this section, we consider the challenging setting of best-arm identification over a fully-distributed network in the presence of adversarial clients. Formally, the network is described by a static directed graph 𝒢=(𝒱,𝒟)\mathcal{G}=(\mathcal{V},\mathcal{D}), where the vertex set 𝒱\mathcal{V} is composed of the set of clients, i.e., 𝒱=𝒞=j[N]𝒞j\mathcal{V}=\mathcal{C}=\cup_{j\in[N]}\mathcal{C}_{j}, and 𝒟\mathcal{D} represents the set of edges between them. An edge (u,v)𝒟(u,v)\in\mathcal{D} indicates that client uu can directly transmit information to client vv. The set of all in-neighbors (resp., out-neighbors) of client uu is defined as 𝒩u={v𝒱:(v,u)𝒟}\mathcal{N}_{u}=\{v\in\mathcal{V}:(v,u)\in\mathcal{D}\} (resp., 𝒩u+={v𝒱:(u,v)𝒟}\mathcal{N}^{+}_{u}=\{v\in\mathcal{V}:(u,v)\in\mathcal{D}\}).

Our goal is to design a distributed algorithm that ensures that each honest client ii\in\mathcal{H} can eventually identify the globally best arm aa^{*}, despite the actions of the adversarial clients 𝒥\mathcal{J}. Clearly, we need some upper-bound on the number of adversaries for the problem to make sense. To this end, we will consider an ff-local Byzantine adversary model [83, 84, 85, 86] where there are at most ff adversarial clients in the in-neighborhood of each honest client, i.e., |𝒩i𝒥|f,i|\mathcal{N}_{i}\cap\mathcal{J}|\leq f,\forall i\in\mathcal{H}. Recall that clients in 𝒞j\mathcal{C}_{j} can only sample arms from 𝒜j\mathcal{A}_{j}. Thus, to acquire information about some other arm-set 𝒜\mathcal{A}_{\ell} (say), we need information to diffuse from the clients in 𝒞\mathcal{C}_{\ell} to clients in 𝒞j\mathcal{C}_{j}. If there is a small cut (bottleneck) in the graph separating 𝒞\mathcal{C}_{\ell} from 𝒞j\mathcal{C}_{j}, then such a cut can be infested by adversaries, thereby preventing the flow of information necessary for identifying aa^{*}. From this intuitive reasoning, note that we need sufficient disjoint paths linking different client groups to have any hope of tackling adversaries. The key graph-theoretic property that we will exploit in this context is a notion of strong-robustness [86] adapted to our setting.

551122334466778899𝒞1\mathcal{C}_{1}𝒞3\mathcal{C}_{3}𝒞2\mathcal{C}_{2}
Figure 3: The figure shows a network composed of three client groups. Clients in group jj can sample arms in arm-set 𝒜j\mathcal{A}_{j}, where j{1,2,3}j\in\{1,2,3\}. Given that client 55 is the only source of information for arms in 𝒜2\mathcal{A}_{2}, if it acts adversarially, then all other clients have no way of learning about the arms in 𝒜2\mathcal{A}_{2}. This highlights the need for information-redundancy. Moreover, since clients in 𝒞1\mathcal{C}_{1} and 𝒞3\mathcal{C}_{3} can interact only via client 55, if client 55 is adversarial, then it can create a bottleneck between 𝒞1\mathcal{C}_{1} and 𝒞3\mathcal{C}_{3}, making it impossible for clients in 𝒞1\mathcal{C}_{1} (resp., 𝒞3\mathcal{C}_{3}) to learn about arms in arm-set 𝒜3\mathcal{A}_{3} (resp., 𝒜1\mathcal{A}_{1}). This highlights the need for redundant paths in the network. The strong-robustness property in Definition 2 captures both the requirements described above.
Definition 2.

(Strong-robustness w.r.t. 𝒞j\mathcal{C}_{j}) Given any positive integer rr, the graph 𝒢\mathcal{G} is said to be strongly rr-robust w.r.t. the client group 𝒞j\mathcal{C}_{j} if every non-empty subset 𝒱{𝒞j}\mathcal{B}\subseteq\mathcal{V}\setminus\{\mathcal{C}_{j}\} contains at least one client ii with rr in-neighbors outside \mathcal{B}, i.e., for every non-empty set 𝒱{𝒞j}\mathcal{B}\subseteq\mathcal{V}\setminus\{\mathcal{C}_{j}\}, i\exists i\in\mathcal{B} such that |𝒩i|r|\mathcal{N}_{i}\setminus\mathcal{B}|\geq r.

We note that if 𝒢\mathcal{G} is strongly (3f+1)(3f+1)-robust w.r.t. 𝒞j\mathcal{C}_{j}, then |𝒞j|3f+1|\mathcal{C}_{j}|\geq 3f+1 (follows by considering =𝒱{𝒞j}\mathcal{B}=\mathcal{V}\setminus\{\mathcal{C}_{j}\}). Thus, the above definition implicitly captures information-redundancy. At the same time, it ensures that there is adequate redundancy in the network-structure to reliably disseminate information about arm-set 𝒜j\mathcal{A}_{j} from 𝒞j\mathcal{C}_{j} to the rest of the network. We now develop an algorithm, namely Algorithm 4, that leverages the strong-robustness property in Definition 2.

Description of Algorithm 4: Each client ii maintains two NN-dimensional vectors 𝐈i\mathbf{I}_{i} and 𝐑i\mathbf{R}_{i}. The mm-th entry of 𝐈i\mathbf{I}_{i}, namely 𝐈i(m)\mathbf{I}_{i}(m), stores client ii’s estimate of the locally best arm in arm-set 𝒜m\mathcal{A}_{m}, and 𝐑i(m)\mathbf{R}_{i}(m) stores an estimate of the corresponding arm’s mean. Consider a client i𝒞ji\in\mathcal{C}_{j}\cap{\mathcal{H}}. Since client ii can sample all arms in 𝒜j\mathcal{A}_{j}, it updates 𝐈i(j)\mathbf{I}_{i}(j) and 𝐑i(j)\mathbf{R}_{i}(j) on its own by running Algo. 1 on 𝒜j\mathcal{A}_{j}. To learn about the best arm in all arm-sets other than 𝒜j\mathcal{A}_{j}, client ii needs to interact with its neighbors, some of whom might be adversarial. For each [N]{j}\ell\in[N]\setminus\{j\}, the strategy employed by client ii to update 𝐈i()\mathbf{I}_{i}(\ell) and 𝐑i()\mathbf{R}_{i}(\ell) is very similar to that of the server in Robust Fed-SEL. In short, ii collects information about 𝒜\mathcal{A}_{\ell} from its neighbors, performs majority voting to decide 𝐈i()\mathbf{I}_{i}(\ell) (line 3), and does a trimming operation to compute 𝐑i()\mathbf{R}_{i}(\ell) (line 4). For this operation, client ii only uses the estimates of those neighbors that report the majority-voted arm 𝐈i()\mathbf{I}_{i}(\ell). Once all components of 𝐈i\mathbf{I}_{i} and 𝐑i\mathbf{R}_{i} have been populated, client ii executes line 5 to identify the globally best arm.

We now present the main result of this section.

Theorem 3.

Suppose the following arm-heterogeneity condition holds:

maxj[N]{1}{σ1j,σj1}1.\max_{j\in[N]\setminus\{1\}}\{\sigma_{1j},\sigma_{j1}\}\leq 1. (8)

Moreover, suppose the adversary model is ff-local, and 𝒢\mathcal{G} is strongly (3f+1)(3f+1)-robust w.r.t. 𝒞j,j[N]\mathcal{C}_{j},\forall j\in[N]. Then, with probability at least 1δ1-\delta, Algorithm 4 terminates in finite-time, and outputs 𝐈i(pi)=a,i\mathbf{I}_{i}(p^{*}_{i})=a^{*},\forall i\in\mathcal{H}.

Discussion: The above result is the only one we know of that provides any guarantees against adversarial attacks for MAB problems over general networks. While the strong-robustness property in Definition 2 has been exploited earlier in the context of resilient distributed state estimation [86] and hypothesis-testing/inference [87], Theorem 3 reveals that the same graph-theoretic property ends up playing a crucial role for the best-arm identification problem we consider here.111111By making a connection to bootstrap percolation theory, the authors in [86] show that the strong-robustness property in Definition 2 can be checked in polynomial time.

To isolate the subtleties and challenges associated with tolerating adversaries over a general network, we worked under an arm-heterogeneity assumption. This assumption effectively bypasses the need for Phase II, i.e., the sample-complexity of a client is given by the number of arm-pulls it makes during Phase I. We believe that the ideas presented in this paper can be extended in an appropriate way to study the networked setting without making the arm-heterogeneity assumption.

7 Simulations

In this section, we verify some of our theoretical findings based on a simple simulation example. Motivated by distributed target tracking applications in cooperative learning [20, 21, 22, 23], we consider a target detection problem involving 9 static sensors distributed over an environment. The environment is divided into 3 sub-regions with 3 sensors and a mobile robot in each sub-region. Each mobile robot can access the sensor readings of the sensors in its sub-region, and transmit information to a central controller. The goal is to detect a static target located in one of the sub-regions. We model this setting by mapping sensors to arms, sub-regions to arm-sets, and robots to clients. Each sensor displays binary readings: sensors that are located closer to the target have a higher probability of displaying 11. Thus, finding the “best arm” in this setting corresponds to localizing the target, and a robot visiting a sensor corresponds to a client sampling an arm. Ideally, to minimize battery life of the robots, we would like to locate the target with a small number of visits (sample-complexity), and only a few information exchanges with the controller (communication-complexity).

We assume the target is located in the first sub-region that corresponds to arm-set 𝒜1\mathcal{A}_{1}. Arm rewards (sensor readings) are drawn from Bernoulli distributions with means {0.9,0.90.05σ,0.1}\{0.9,0.9-0.05\sigma,0.1\}, {0.85,0.8,0.3}\{0.85,0.8,0.3\}, and {0.7,0.6,0.5}\{0.7,0.6,0.5\} associated with the arm-sets (sub-regions) 𝒜1,𝒜2\mathcal{A}_{1},\mathcal{A}_{2}, and 𝒜3\mathcal{A}_{3}, respectively. Based on the definition of arm-heterogeneity indices in Definition 1, the parameter σ[1,15]\sigma\in[1,15] captures the level of heterogeneity in the example we have described above. In particular, it is easy to verify that for each i[3]i\in[3],

ra1(i)ra2(i)σ|ra1(i)ra1(j)|,j[3]{i}.r_{a^{(i)}_{1}}-r_{a^{(i)}_{2}}\leq\sigma|r_{a^{(i)}_{1}}-r_{a^{(j)}_{1}}|,\forall j\in[3]\setminus\{i\}. (9)
Refer to caption Refer to caption
(a) (b)
Figure 4: Illustration of the performance of Fed-SEL. (a) Plot of the number of communication rounds versus the level of heterogeneity σ\sigma. (b) Plot of the number of Phase II arm-pulls versus the communication period HH.

For the example we have considered, arm-heterogeneity has a very natural spatial interpretation: sensors located closer to the target output higher readings than those farther away. Note that arm 1 in 𝒜1\mathcal{A}_{1} is the best arm, and hence closest to the target. With confidence parameter δ=0.1\delta=0.1, we implement a simplified version of Fed-SEL (Algo. 2) without any adaptive quantization. We vary the level of heterogeneity by tuning σ\sigma, and for each value of σ\sigma, perform 5050 trials.

Main Observations: Below, we discuss our experimental findings from Figure 4.

  • From Fig. 4(a), we note a distinct trend: the number of communication rounds decrease with an increase in the level of arm-heterogeneity σ\sigma, exactly as suggested by Theorem 1. Moreover, when σ=1\sigma=1, Fed-SEL terminates in 1 round, complying with Corollary 1. We conclude that Fed-SEL does indeed benefit from the notion of heterogeneity that we have introduced.

  • From Fig. 4(a), we note that doubling the communication period HH roughly halves the number of communication rounds, as expected. Importantly, increasing the communication period HH does not cause the number of arm-pulls in Phase II to increase much; see Fig. 4(b). These observations once again align with our theoretical findings in Theorem 1: since local computations always help for Fed-SEL, one can significantly reduce communication by increasing HH, and at the same time, not pay much of an additional price in terms of sample-complexity.

8 Conclusion and Future Work

We introduced and studied the federated best-arm identification problem where clients with disjoint arm-sets collaborate via a server to identify the globally best arm with the highest mean pay-off. We defined a novel notion of arm-heterogeneity for this problem, and proved that when such heterogeneity is exploited in a principled way, one can significantly reduce the number of communication rounds required to identify the best arm (with high probability). Our finding is unique in that it reveals that unlike federated supervised learning, heterogeneity can provably help for best-arm identification problems in stochastic multi-armed bandits. Finally, a key advantage of our algorithmic approach is that it is robust to worst-case adversarial actions.

There are several interesting avenues of future research, as we discuss below.

  • First, we aim to derive lower-bounds on the sample-complexity for the problem we have introduced. This will shed light on the tightness of our results in Theorem 1, and inform the need for potentially better algorithms. It would also be interesting to explore alternate formulations where the communication budget is fixed a priori, and the goal is to maximize the probability of identifying the best arm subject to such a budget.

  • Perhaps the most interesting direction is to see if the ideas developed in this paper have immediate extensions to general Markov decision processes (MDP’s) in the context of reinforcement learning. Specifically, we wish to investigate the trade-offs between communication-complexity and performance measures such as PAC bounds or regret for distributed sequential decision-making problems.

  • Another important direction is to explore related settings where statistical heterogeneity can provably help. Our conjecture is that for problems where the globally optimal parameter belongs to a finite set, one can establish results similar to those in this paper by using elimination-based techniques. In this context, one ripe candidate problem is that of hypothesis-testing/statistical inference where the true hypothesis belongs to a finite set of possible hypotheses. Recent results have shown that for peer-to-peer versions of this problem [88, 89], one can design principled approaches to significantly reduce communication.

  • Some recent works have shown that by either considering alternate problem formulations (such as personalized variants of FL [90, 91, 92, 93]), or by drawing on ideas from representation learning [94], one can train meaningful statistical models in FL, despite statistical heterogeneity. It may be worthwhile to investigate whether such ideas apply to the setting we consider. Finally, while we studied robustness to adversarial actions, it would also be interesting to explore robustness to other challenges, such as distribution shifts [95], or stragglers [96, 97, 80].

References

  • [1] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
  • [2] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pages 1273–1282. PMLR, 2017.
  • [3] Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konečnỳ, Stefano Mazzocchi, H Brendan McMahan, et al. Towards federated learning at scale: System design. arXiv preprint arXiv:1902.01046, 2019.
  • [4] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977, 2019.
  • [5] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: Challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020.
  • [6] Xiang Li, Kaixuan Huang, Wenhao Yang, Shusen Wang, and Zhihua Zhang. On the convergence of fedavg on non-iid data. arXiv preprint arXiv:1907.02189, 2019.
  • [7] Grigory Malinovskiy, Dmitry Kovalev, Elnur Gasanov, Laurent Condat, and Peter Richtarik. From local sgd to local fixed-point methods for federated learning. In International Conference on Machine Learning, pages 6692–6701. PMLR, 2020.
  • [8] Zachary Charles and Jakub Konečnỳ. On the outsized importance of learning rates in local update methods. arXiv preprint arXiv:2007.00878, 2020.
  • [9] Zachary Charles and Jakub Konečnỳ. Convergence and accuracy trade-offs in federated learning and meta-learning. In International Conference on Artificial Intelligence and Statistics, pages 2575–2583. PMLR, 2021.
  • [10] Reese Pathak and Martin J Wainwright. Fedsplit: An algorithmic framework for fast federated optimization. arXiv preprint arXiv:2005.05238, 2020.
  • [11] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International Conference on Machine Learning, pages 5132–5143. PMLR, 2020.
  • [12] Don Kurian Dennis, Tian Li, and Virginia Smith. Heterogeneity for the win: One-shot federated clustering. arXiv preprint arXiv:2103.00697, 2021.
  • [13] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Pac bounds for multi-armed bandit and markov decision processes. In International Conference on Computational Learning Theory, pages 255–270. Springer, 2002.
  • [14] Eyal Even-Dar, Shie Mannor, Yishay Mansour, and Sridhar Mahadevan. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of machine learning research, 7(6), 2006.
  • [15] Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. lil’ucb: An optimal exploration algorithm for multi-armed bandits. In Conference on Learning Theory, pages 423–439. PMLR, 2014.
  • [16] Jean-Yves Audibert, Sébastien Bubeck, and Rémi Munos. Best arm identification in multi-armed bandits. In COLT, pages 41–53, 2010.
  • [17] Sébastien Bubeck, Rémi Munos, and Gilles Stoltz. Pure exploration in finitely-armed and continuous-armed bandits. Theoretical Computer Science, 412(19):1832–1852, 2011.
  • [18] Victor Gabillon, Mohammad Ghavamzadeh, and Alessandro Lazaric. Best arm identification: A unified approach to fixed budget and fixed confidence. In NIPS-Twenty-Sixth Annual Conference on Neural Information Processing Systems, 2012.
  • [19] Kevin Jamieson and Robert Nowak. Best-arm identification algorithms for multi-armed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE, 2014.
  • [20] Volkan Cevher, Marco F Duarte, and Richard G Baraniuk. Distributed target localization via spatial sparsity. In 2008 16th European Signal Processing Conference, pages 1–5. IEEE, 2008.
  • [21] Antonio Franchi, Paolo Stegagno, Maurizio Di Rocco, and Giuseppe Oriolo. Distributed target localization and encirclement with a multi-robot system. IFAC Proceedings Volumes, 43(16):151–156, 2010.
  • [22] Vladimir Savic, Henk Wymeersch, and Santiago Zazo. Belief consensus algorithms for fast distributed target tracking in wireless sensor networks. Signal Processing, 95:149–160, 2014.
  • [23] Philip Dames, Pratap Tokekar, and Vijay Kumar. Detecting, localizing, and tracking an unknown number of moving targets using a team of mobile robots. The International Journal of Robotics Research, 36(13-14):1540–1553, 2017.
  • [24] Shahin Shahrampour, Alexander Rakhlin, and Ali Jadbabaie. Distributed detection: Finite-time analysis and impact of network topology. IEEE Transactions on Automatic Control, 61(11):3256–3268, 2015.
  • [25] Robert M Corless, Gaston H Gonnet, David EG Hare, David J Jeffrey, and Donald E Knuth. On the lambertw function. Advances in Computational mathematics, 5(1):329–359, 1996.
  • [26] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):1–25, 2017.
  • [27] Peva Blanchard, El Mahdi El Mhamdi, Rachid Guerraoui, and Julien Stainer. Machine learning with adversaries: Byzantine tolerant gradient descent. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 118–128, 2017.
  • [28] Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In International Conference on Machine Learning, pages 5650–5659. PMLR, 2018.
  • [29] Lingjiao Chen, Zachary Charles, Dimitris Papailiopoulos, et al. Draco: Robust distributed training via redundant gradients. arXiv preprint arXiv:1803.09877, 2018.
  • [30] Dan Alistarh, Zeyuan Allen-Zhu, and Jerry Li. Byzantine stochastic gradient descent. arXiv preprint arXiv:1803.08917, 2018.
  • [31] Cong Xie, Oluwasanmi Koyejo, and Indranil Gupta. Generalized byzantine-tolerant sgd. arXiv preprint arXiv:1802.10116, 2018.
  • [32] Liping Li, Wei Xu, Tianyi Chen, Georgios B Giannakis, and Qing Ling. Rsa: Byzantine-robust stochastic aggregation methods for distributed learning from heterogeneous datasets. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1544–1551, 2019.
  • [33] Avishek Ghosh, Justin Hong, Dong Yin, and Kannan Ramchandran. Robust federated learning in a heterogeneous environment. arXiv preprint arXiv:1906.06629, 2019.
  • [34] Avishek Ghosh, Raj Kumar Maity, Swanand Kadhe, Arya Mazumdar, and Kannan Ramachandran. Communication efficient and byzantine tolerant distributed learning. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2545–2550. IEEE, 2020.
  • [35] Avishek Ghosh, Raj Kumar Maity, and Arya Mazumdar. Distributed newton can communicate less and resist byzantine workers. arXiv preprint arXiv:2006.08737, 2020.
  • [36] Eshcar Hillel, Zohar Karnin, Tomer Koren, Ronny Lempel, and Oren Somekh. Distributed exploration in multi-armed bandits. arXiv preprint arXiv:1311.0800, 2013.
  • [37] Chao Tao, Qin Zhang, and Yuan Zhou. Collaborative learning with limited interaction: Tight bounds for distributed exploration in multi-armed bandits. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), pages 126–146. IEEE, 2019.
  • [38] Keqin Liu and Qing Zhao. Distributed learning in multi-armed bandit with multiple players. IEEE Transactions on Signal Processing, 58(11):5667–5681, 2010.
  • [39] Dileep Kalathil, Naumaan Nayyar, and Rahul Jain. Decentralized learning for multiplayer multiarmed bandits. IEEE Transactions on Information Theory, 60(4):2331–2345, 2014.
  • [40] Soummya Kar, H Vincent Poor, and Shuguang Cui. Bandit problems in networks: Asymptotically efficient distributed allocation rules. In 2011 50th IEEE Conference on Decision and Control and European Control Conference, pages 1771–1778. IEEE, 2011.
  • [41] Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. Distributed cooperative decision-making in multiarmed bandits: Frequentist and bayesian algorithms. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 167–172. IEEE, 2016.
  • [42] Peter Landgren, Vaibhav Srivastava, and Naomi Ehrich Leonard. Distributed cooperative decision making in multi-agent multi-armed bandits. Automatica, 125:109445, 2021.
  • [43] Shahin Shahrampour, Alexander Rakhlin, and Ali Jadbabaie. Multi-armed bandits in multi-agent networks. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2786–2790. IEEE, 2017.
  • [44] Swapna Buccapatnam, Jian Tan, and Li Zhang. Information sharing in distributed stochastic bandits. In 2015 IEEE Conference on Computer Communications (INFOCOM), pages 2605–2613. IEEE, 2015.
  • [45] Ravi Kumar Kolla, Krishna Jagannathan, and Aditya Gopalan. Collaborative learning of stochastic bandits over a social network. IEEE/ACM Transactions on Networking, 26(4):1782–1795, 2018.
  • [46] Yuanhao Wang, Jiachen Hu, Xiaoyu Chen, and Liwei Wang. Distributed bandit learning: Near-optimal regret with efficient communication. arXiv preprint arXiv:1904.06309, 2019.
  • [47] Abishek Sankararaman, Ayalvadi Ganesh, and Sanjay Shakkottai. Social learning in multi agent multi armed bandits. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 3(3):1–35, 2019.
  • [48] David Martínez-Rubio, Varun Kanade, and Patrick Rebeschini. Decentralized cooperative stochastic bandits. arXiv preprint arXiv:1810.04468, 2018.
  • [49] Abhimanyu Dubey et al. Kernel methods for cooperative multi-agent contextual bandits. In International Conference on Machine Learning, pages 2740–2750. PMLR, 2020.
  • [50] Abhimanyu Dubey and Alex Pentland. Differentially-private federated linear bandits. arXiv preprint arXiv:2010.11425, 2020.
  • [51] Anusha Lalitha and Andrea Goldsmith. Bayesian algorithms for decentralized stochastic bandits. arXiv preprint arXiv:2010.10569, 2020.
  • [52] Ronshee Chawla, Abishek Sankararaman, Ayalvadi Ganesh, and Sanjay Shakkottai. The gossiping insert-eliminate algorithm for multi-agent bandits. In International Conference on Artificial Intelligence and Statistics, pages 3471–3481. PMLR, 2020.
  • [53] Ronshee Chawla, Abishek Sankararaman, and Sanjay Shakkottai. Multi-agent low-dimensional linear bandits. arXiv preprint arXiv:2007.01442, 2020.
  • [54] Avishek Ghosh, Abishek Sankararaman, and Kannan Ramchandran. Collaborative learning and personalization in multi-agent stochastic linear bandits. arXiv preprint arXiv:2106.08902, 2021.
  • [55] Mridul Agarwal, Vaneet Aggarwal, and Kamyar Azizzadenesheli. Multi-agent multi-armed bandits with limited communication. arXiv preprint arXiv:2102.08462, 2021.
  • [56] Zhaowei Zhu, Jingxuan Zhu, Ji Liu, and Yang Liu. Federated bandit: A gossiping approach. In Abstract Proceedings of the 2021 ACM SIGMETRICS/International Conference on Measurement and Modeling of Computer Systems, pages 3–4, 2021.
  • [57] Chengshuai Shi, Cong Shen, and Jing Yang. Federated multi-armed bandits with personalization. In International Conference on Artificial Intelligence and Statistics, pages 2917–2925. PMLR, 2021.
  • [58] Kwang-Sung Jun, Lihong Li, Yuzhe Ma, and Xiaojin Zhu. Adversarial attacks on stochastic bandits. arXiv preprint arXiv:1810.12188, 2018.
  • [59] Fang Liu and Ness Shroff. Data poisoning attacks on stochastic bandits. In International Conference on Machine Learning, pages 4042–4050. PMLR, 2019.
  • [60] Thodoris Lykouris, Vahab Mirrokni, and Renato Paes Leme. Stochastic bandits robust to adversarial corruptions. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, pages 114–122, 2018.
  • [61] Anupam Gupta, Tomer Koren, and Kunal Talwar. Better algorithms for stochastic bandits with adversarial corruptions. In Conference on Learning Theory, pages 1562–1578. PMLR, 2019.
  • [62] Ilija Bogunovic, Andreas Krause, and Jonathan Scarlett. Corruption-tolerant gaussian process bandit optimization. In International Conference on Artificial Intelligence and Statistics, pages 1071–1081. PMLR, 2020.
  • [63] Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Jean Tarbouriech, Olivier Teytaud, Alessandro Lazaric, and Matteo Pirotta. Adversarial attacks on linear contextual bandits. arXiv preprint arXiv:2002.03839, 2020.
  • [64] Ilija Bogunovic, Arpan Losalka, Andreas Krause, and Jonathan Scarlett. Stochastic linear bandits robust to adversarial attacks. In International Conference on Artificial Intelligence and Statistics, pages 991–999. PMLR, 2021.
  • [65] Daniel Vial, Sanjay Shakkottai, and R Srikant. Robust multi-agent multi-armed bandits. In Proceedings of the Twenty-second International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, pages 161–170, 2021.
  • [66] Sebastian U Stich. Local sgd converges fast and communicates little. arXiv preprint arXiv:1805.09767, 2018.
  • [67] Jianyu Wang and Gauri Joshi. Cooperative sgd: A unified framework for the design and analysis of communication-efficient sgd algorithms. arXiv preprint arXiv:1808.07576, 2018.
  • [68] Artin Spiridonoff, Alex Olshevsky, and Ioannis Ch Paschalidis. Local sgd with a communication overhead depending only on the number of workers. arXiv preprint arXiv:2006.02582, 2020.
  • [69] Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani, Ali Jadbabaie, and Ramtin Pedarsani. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. In International Conference on Artificial Intelligence and Statistics, pages 2021–2031. PMLR, 2020.
  • [70] Farzin Haddadpour, Mohammad Mahdi Kamani, Mehrdad Mahdavi, and Viveck Cadambe. Local sgd with periodic averaging: Tighter analysis and adaptive synchronization. In Advances in Neural Information Processing Systems, pages 11082–11094, 2019.
  • [71] Farzin Haddadpour, Mohammad Mahdi Kamani, Aryan Mokhtari, and Mehrdad Mahdavi. Federated learning with compression: Unified analysis and sharp guarantees. In International Conference on Artificial Intelligence and Statistics, pages 2350–2358. PMLR, 2021.
  • [72] Blake Woodworth, Kumar Kshitij Patel, Sebastian U Stich, Zhen Dai, Brian Bullins, H Brendan McMahan, Ohad Shamir, and Nathan Srebro. Is local sgd better than minibatch sgd? arXiv preprint arXiv:2002.07839, 2020.
  • [73] Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. First analysis of local gd on heterogeneous data. arXiv preprint arXiv:1909.04715, 2019.
  • [74] Ahmed Khaled, Konstantin Mishchenko, and Peter Richtárik. Tighter theory for local sgd on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics, pages 4519–4529. PMLR, 2020.
  • [75] Farzin Haddadpour and Mehrdad Mahdavi. On the convergence of local descent methods in federated learning. arXiv preprint arXiv:1910.14425, 2019.
  • [76] Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian U Stich. A unified theory of decentralized sgd with changing topology and local updates. arXiv preprint arXiv:2003.10422, 2020.
  • [77] Anit Kumar Sahu, Tian Li, Maziar Sanjabi, Manzil Zaheer, Ameet Talwalkar, and Virginia Smith. On the convergence of federated optimization in heterogeneous networks. arXiv preprint arXiv:1812.06127, 3, 2018.
  • [78] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. arXiv preprint arXiv:2007.07481, 2020.
  • [79] Eduard Gorbunov, Filip Hanzely, and Peter Richtárik. Local sgd: Unified theory and new efficient methods. In International Conference on Artificial Intelligence and Statistics, pages 3556–3564. PMLR, 2021.
  • [80] Aritra Mitra, Rayana Jaafar, George J Pappas, and Hamed Hassani. Achieving linear convergence in federated learning under objective and systems heterogeneity. arXiv preprint arXiv:2102.07053, 2021.
  • [81] Marshall Pease, Robert Shostak, and Leslie Lamport. Reaching agreement in the presence of faults. Journal of the ACM (JACM), 27(2):228–234, 1980.
  • [82] Danny Dolev, Nancy A Lynch, Shlomit S Pinter, Eugene W Stark, and William E Weihl. Reaching approximate agreement in the presence of faults. Journal of the ACM (JACM), 33(3):499–516, 1986.
  • [83] C.Y. Koo. Broadcast in radio networks tolerating Byzantine adversarial behavior. In Proc. of the ACM Symp. on Principles of Distributed Comp., pages 275–282. ACM, 2004.
  • [84] A. Pelc and D. Peleg. Broadcasting with locally bounded Byzantine faults. Info. Proc. Letters, 93(3):109–115, 2005.
  • [85] H. J. LeBlanc, H. Zhang, X. Koutsoukos, and S. Sundaram. Resilient asymptotic consensus in robust networks. IEEE Journal on Selected Areas in Comm., 31(4):766–781, 2013.
  • [86] Aritra Mitra and Shreyas Sundaram. Byzantine-resilient distributed observers for LTI systems. Automatica, 108:108487, 2019.
  • [87] Aritra Mitra, John A Richards, and Shreyas Sundaram. A new approach to distributed hypothesis testing and non-Bayesian learning: Improved learning rate and Byzantine-resilience. IEEE Transactions on Automatic Control, 2020.
  • [88] Aritra Mitra, John A Richards, and Shreyas Sundaram. A communication-efficient algorithm for exponentially fast non-Bayesian learning in networks. In 2019 IEEE 58th Conference on Decision and Control (CDC), pages 8347–8352. IEEE, 2019.
  • [89] Aritra Mitra, John A Richards, Saurabh Bagchi, and Shreyas Sundaram. Distributed inference with sparse and quantized communication. IEEE Transactions on Signal Processing, 2021.
  • [90] Yihan Jiang, Jakub Konečnỳ, Keith Rush, and Sreeram Kannan. Improving federated learning personalization via model agnostic meta learning. arXiv preprint arXiv:1909.12488, 2019.
  • [91] Sen Lin, Guang Yang, and Junshan Zhang. A collaborative learning framework via federated meta-learning. In 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS), pages 289–299. IEEE, 2020.
  • [92] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personalized federated learning: A meta-learning approach. arXiv preprint arXiv:2002.07948, 2020.
  • [93] Filip Hanzely, Slavomír Hanzely, Samuel Horváth, and Peter Richtárik. Lower bounds and optimal algorithms for personalized federated learning. arXiv preprint arXiv:2010.02372, 2020.
  • [94] Liam Collins, Hamed Hassani, Aryan Mokhtari, and Sanjay Shakkottai. Exploiting shared representations for personalized federated learning. arXiv preprint arXiv:2102.07078, 2021.
  • [95] Amirhossein Reisizadeh, Farzan Farnia, Ramtin Pedarsani, and Ali Jadbabaie. Robust federated learning: The case of affine distribution shifts. arXiv preprint arXiv:2006.08907, 2020.
  • [96] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the objective inconsistency problem in heterogeneous federated optimization. Advances in Neural Information Processing Systems, 33, 2020.
  • [97] Amirhossein Reisizadeh, Isidoros Tziotis, Hamed Hassani, Aryan Mokhtari, and Ramtin Pedarsani. Straggler-resilient federated learning: Leveraging the interplay between statistical accuracy and system heterogeneity. arXiv preprint arXiv:2012.14453, 2020.
  • [98] Wassily Hoeffding. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding, pages 409–426. Springer, 1994.
  • [99] Ioannis Chatzigeorgiou. Bounds on the lambert function and their application to the outage analysis of user cooperation. IEEE Communications Letters, 17(8):1505–1508, 2013.

Appendix A Proof of Theorem 1

The goal of this section is to prove Theorem 1. Throughout the proof, to simplify notation, we will use i1i_{1} as a shorthand to represent the best arm a1(i)a^{(i)}_{1} in arm-set 𝒜i\mathcal{A}_{i}. Also, for any arm 𝒜i\ell\in\mathcal{A}_{i}, we will use TT_{\ell} to denote the number of pulls made to arm \ell by client ii during the course of operation of Fed-SEL. Before diving into the formal argument, we first outline the main steps of the proof.

  • Step 1: The first step of our analysis is to define a “good event” \mathcal{E}, and show that it occurs with probability at least 1δ1-\delta, where δ(0,1)\delta\in(0,1) is the given confidence parameter. On this good event, the empirical estimates of all arms are within appropriate confidence intervals (as given by the αs\alpha^{\prime}s) of the true means. See Lemma 1.

  • Step 2: On \mathcal{E}, we show that the local subroutine (Algo. 1) at each client i𝒞i\in\mathcal{C} terminates in finite-time, and outputs the locally best-arm a1(i)a^{(i)}_{1} in 𝒜i\mathcal{A}_{i}, i.e., ai=a1(i),i𝒞a_{i}=a^{(i)}_{1},\forall i\in\mathcal{C}. This step follows from standard arguments [13, 14]; we provide a proof only for completeness. See Lemma 2.

  • Step 3: The next key step of our analysis is to bound the error of each client ii’s estimate r^i,ai(t¯i)\hat{r}_{i,a_{i}}(\bar{t}_{i}) of the true mean ri1{r}_{i_{1}}, at the end of Phase I, as a function of the local arm-gap Δi=ri1ri2\Delta^{*}_{i}=r_{i_{1}}-r_{i_{2}}. This analysis, carried out in Lemma 3, reveals that the locally best arm from each arm-set is “well-explored” at the end of Phase I.

  • Step 4: The final piece of the analysis is to relate the amount of exploration that has been done during Phase I, to that which remains to be done during Phase II, for eliminating each client in the set 𝒞{1}.\mathcal{C}\setminus\{1\}. Specifically, for each client i𝒞{1}i\in\mathcal{C}\setminus\{1\}, we seek a lower bound on the number of times arm i1i_{1} has been sampled during Phase I, and an upper bound on the number of times it still needs to be sampled during Phase II, before being deemed sub-optimal. See Lemma 4. To complete this step, we derive new bounds on the Lambert WW function in Lemma 5 of Appendix A.1. These bounds may be of independent interest.

We now proceed to establish each of the above steps, starting with Step 1.

Lemma 1.

Consider the event \mathcal{E} defined as follows:

i𝒞𝒜it[T]{|r^i,(t)r|αi(t)}.\mathcal{E}\triangleq\bigcap_{i\in\mathcal{C}}\bigcap_{\ell\in\mathcal{A}_{i}}\bigcap_{t\in[T_{\ell}]}\{|\hat{r}_{i,\ell}(t)-r_{\ell}|\leq\alpha_{i}(t)\}. (10)

Then, it holds that ()1δ\mathbb{P}(\mathcal{E})\geq 1-\delta.

Proof.

We can assume that the sequence of rewards for each arm is drawn before the algorithm is initiated. That way, for any t{1,2,,}t\in\{1,2,\ldots,\infty\}, the empirical estimate of the mean of an arm \ell after tt pulls is well defined, even if it has not actually been pulled tt times by a client. Applying the union bound gives us

(c)\displaystyle\mathbb{P}(\mathcal{E}^{c}) i𝒞𝒜it=1({|r^i,(t)r|>αi(t)})\displaystyle\leq\sum_{i\in\mathcal{C}}\sum_{\ell\in\mathcal{A}_{i}}\sum\limits_{t=1}^{\infty}\mathbb{P}\left(\{|\hat{r}_{i,\ell}(t)-r_{\ell}|>\alpha_{i}(t)\}\right) (11)
i𝒞𝒜it=12exp(2tαi2(t))\displaystyle\leq\sum_{i\in\mathcal{C}}\sum_{\ell\in\mathcal{A}_{i}}\sum\limits_{t=1}^{\infty}2\exp(-2t\alpha^{2}_{i}(t))
i𝒞𝒜it=1δ2|𝒞||𝒜i|t2\displaystyle\leq\sum_{i\in\mathcal{C}}\sum_{\ell\in\mathcal{A}_{i}}\sum\limits_{t=1}^{\infty}\frac{\delta}{2|\mathcal{C}||\mathcal{A}_{i}|t^{2}}
i𝒞𝒜iδ2|𝒞||𝒜i|(1+11z2𝑑z)\displaystyle\leq\sum_{i\in\mathcal{C}}\sum_{\ell\in\mathcal{A}_{i}}\frac{\delta}{2|\mathcal{C}||\mathcal{A}_{i}|}\left(1+\int_{1}^{\infty}\frac{1}{z^{2}}dz\right)
=i𝒞𝒜iδ|𝒞||𝒜i|=δ.\displaystyle=\sum_{i\in\mathcal{C}}\sum_{\ell\in\mathcal{A}_{i}}\frac{\delta}{|\mathcal{C}||\mathcal{A}_{i}|}=\delta.

The second step above follows from an application of Hoeffding’s inequality [98]. ∎

For Step 2, we have the following lemma.

Lemma 2.

Suppose Fed-SEL is run with βi(t)=cαi(t)\beta_{i}(t)=c\alpha_{i}(t), where c2c\geq 2. Then, on the event \mathcal{E}, for each client i𝒞i\in\mathcal{C}, Algo. 1 terminates in finite-time and outputs ai=a1(i)a_{i}=a^{(i)}_{1}.

Proof.

We prove the result for the case when c=2c=2; the same reasoning applies when c>2c>2. Fix a client i𝒞i\in\mathcal{C}. We first argue that on \mathcal{E}, the locally best arm a1(i)a^{(i)}_{1} in 𝒜i\mathcal{A}_{i} never gets eliminated when Algo. 1 is run. To see this, observe that on \mathcal{E}, the following holds in each epoch tt:

r^i,i1(t)ri1αi(t).\hat{r}_{i,i_{1}}(t)\geq r_{i_{1}}-\alpha_{i}(t).

At the same time, for any other arm 𝒮i(t){i1}\ell\in\mathcal{S}_{i}(t)\setminus\{i_{1}\}, the following also holds on \mathcal{E}:

r^i,(t)r+αi(t)<ri1+αi(t).\hat{r}_{i,\ell}(t)\leq r_{\ell}+\alpha_{i}(t)<r_{i_{1}}+\alpha_{i}(t).

Thus, on \mathcal{E}, we have that

r^i,(t)r^i,i1(t)<2αi(t)=βi(t).\hat{r}_{i,\ell}(t)-\hat{r}_{i,i_{1}}(t)<2\alpha_{i}(t)=\beta_{i}(t).

From line 5 of Algo. 1, it is then clear that a1(i)a^{(i)}_{1} does not get eliminated on \mathcal{E}. We now argue that every arm 𝒜i{i1}\ell\in\mathcal{A}_{i}\setminus\{i_{1}\} eventually gets eliminated. In particular, we claim that

αi(t)Δi14\alpha_{i}(t)\leq\frac{\Delta_{i_{1}\ell}}{4}

is a sufficient condition to eliminate arm \ell. To see this, suppose the above condition holds. We then have

ri1αi(t)r+αi(t)+2αi(t)\displaystyle r_{i_{1}}-\alpha_{i}(t)\geq r_{\ell}+\alpha_{i}(t)+2\alpha_{i}(t) (12)
\displaystyle\overset{\mathcal{E}}{\implies} r^i,i1(t)r^i,(t)+2αi(t)=r^i,(t)+βi(t)\displaystyle\hat{r}_{i,i_{1}}(t)\geq\hat{r}_{i,\ell}(t)+2\alpha_{i}(t)=\hat{r}_{i,\ell}(t)+\beta_{i}(t)
\displaystyle\implies r^i,max(t)r^i,(t)+βi(t).\displaystyle\hat{r}_{i,\max}(t)\geq\hat{r}_{i,\ell}(t)+\beta_{i}(t).

The claim then follows from line 5 of Algo. 1. This completes the proof. ∎

The next lemma provides a bound on the confidence intervals at the end of Phase I.

Lemma 3.

Suppose Fed-SEL is run with βi(t)=cαi(t)\beta_{i}(t)=c\alpha_{i}(t), where c>2c>2. Then, on the event \mathcal{E}, the following holds for every client i𝒞i\in\mathcal{C}:

αi(t¯i)<c(c2)2Δi,\alpha_{i}(\bar{t}_{i})<\frac{c}{(c-2)^{2}}\Delta^{*}_{i}, (13)

where Δi=ri1ri2\Delta^{*}_{i}=r_{i_{1}}-r_{i_{2}}.

Proof.

Fix a client i𝒞i\in\mathcal{C}. From Lemma 2, we know that every arm 𝒜i{i1}\ell\in\mathcal{A}_{i}\setminus\{i_{1}\} eventually gets eliminated when client ii runs Algo. 1 on 𝒜i\mathcal{A}_{i}. We split our subsequent analysis into two cases depending on when the second best arm i2i_{2} in 𝒜i\mathcal{A}_{i} gets eliminated during Phase I.

Case 1: Suppose arm i2i_{2} is eliminated in the last epoch t¯i\bar{t}_{i}. On event \mathcal{E}, the arm with the largest empirical mean in 𝒮i(t¯i)\mathcal{S}_{i}(\bar{t}_{i}) must be i1i_{1} (from Lemma 2). Thus, we must have

r^i,i1(t¯i)r^i,i2(t¯i)+cαi(t¯i)\displaystyle\hat{r}_{i,i_{1}}(\bar{t}_{i})\geq\hat{r}_{i,i_{2}}(\bar{t}_{i})+c\alpha_{i}(\bar{t}_{i}) (14)
\displaystyle\overset{\mathcal{E}}{\implies} ri1+αi(t¯i)r^i,i2(t¯i)+cαi(t¯i)\displaystyle r_{i_{1}}+\alpha_{i}(\bar{t}_{i})\geq\hat{r}_{i,i_{2}}(\bar{t}_{i})+c\alpha_{i}(\bar{t}_{i})
\displaystyle\overset{\mathcal{E}}{\implies} ri1+αi(t¯i)ri2αi(t¯i)+cαi(t¯i)\displaystyle r_{i_{1}}+\alpha_{i}(\bar{t}_{i})\geq r_{i_{2}}-\alpha_{i}(\bar{t}_{i})+c\alpha_{i}(\bar{t}_{i})
\displaystyle\implies αi(t¯i)Δi(c2).\displaystyle\alpha_{i}(\bar{t}_{i})\leq\frac{\Delta^{*}_{i}}{(c-2)}.

Case 2: Suppose arm i2i_{2} is eliminated in some epoch t<t¯it<\bar{t}_{i}. For this to happen on \mathcal{E}, it must be that i1=argmax𝒮i(t)r^i,(t)i_{1}=\operatorname*{\arg\!\max}_{\ell\in\mathcal{S}_{i}(t)}\hat{r}_{i,\ell}(t). This is easy to verify. Moreover, note that there must exist some arm 𝒜i{i1i2}\ell\in\mathcal{A}_{i}\setminus\{i_{1}\cup i_{2}\} that does not get eliminated in epoch tt, but gets eliminated in epoch t¯i\bar{t}_{i}. Thus, the following inequalities must hold simultaneously:

r^i,i1(t)r^i,i2(t)cαi(t);r^i,i1(t)r^i,(t)<cαi(t),\hat{r}_{i,i_{1}}(t)-\hat{r}_{i,i_{2}}(t)\geq c\alpha_{i}(t);\hskip 2.84526pt\hat{r}_{i,i_{1}}(t)-\hat{r}_{i,\ell}(t)<c\alpha_{i}(t), (15)

implying r^i,(t)>r^i,i2(t)\hat{r}_{i,\ell}(t)>\hat{r}_{i,i_{2}}(t). Since we are on \mathcal{E}, this further implies that

r+αi(t)>ri2αi(t)2αi(t)>Δi2.r_{\ell}+\alpha_{i}(t)>r_{i_{2}}-\alpha_{i}(t)\implies 2\alpha_{i}(t)>\Delta_{i_{2}\ell}. (16)

At the same time, since arm i2i_{2} gets eliminated in epoch tt, we must also have

αi(t)Δi(c2),\alpha_{i}(t)\leq\frac{\Delta^{*}_{i}}{(c-2)}, (17)

following the analysis of Case 1. Combining Eq. (16) with the above bound yields

Δi2<2Δi(c2)\displaystyle\Delta_{i_{2}\ell}<2\frac{\Delta^{*}_{i}}{(c-2)} (18)
\displaystyle\implies ri2ri1+ri1r<2Δi(c2)\displaystyle r_{i_{2}}-r_{i_{1}}+r_{i_{1}}-r_{\ell}<2\frac{\Delta^{*}_{i}}{(c-2)}
\displaystyle\implies Δi1<c(c2)Δi.\displaystyle\Delta_{i_{1}\ell}<\frac{c}{(c-2)}\Delta^{*}_{i}.

Finally, since arm \ell gets eliminated in the last epoch t¯i\bar{t}_{i}, an argument identical to the Case 1 analysis reveals that

αi(t¯i)Δi1(c2).\alpha_{i}(\bar{t}_{i})\leq\frac{\Delta_{i_{1}\ell}}{(c-2)}. (19)

Combining the above inequality with the one in Eq. (18), we obtain

αi(t¯i)<c(c2)2Δi.\alpha_{i}(\bar{t}_{i})<\frac{c}{(c-2)^{2}}\Delta^{*}_{i}. (20)

This completes the analysis for Case 2. The claim of the lemma then follows immediately from equations (14) and (20). ∎

Before proceeding to our next result, we need to first take a brief detour and familiarize ourselves with the Lambert WW function that will show up in our subsequent analysis. Accordingly, we start by noting that the Lambert WW function is defined to be the multi-valued inverse of the function f(w)=wewf(w)=we^{w}, where ww\in\mathbb{C} is any complex number [25].121212Here, and in what follows, we use ee to denote the exponential function. Specifically, the Lambert WW function, denoted by W(x)W(x), has the following defining property:

W(x)eW(x)=x,wherex.W(x)e^{W(x)}=x,\hskip 2.84526pt\textrm{where}\hskip 2.84526ptx\in\mathbb{C}. (21)

When xx is real and belongs to [1/e,0)[-1/e,0), W(x)W(x) can take on two possible values. The branch satisfying W(x)1W(x)\geq-1 is called the principal branch and is denoted by W0(x)W_{0}(x). The other branch with values satisfying W(x)1W(x)\leq-1 is denoted by W1(x)W_{-1}(x). The two branches meet at the point x=1/ex=-1/e, where they both take on the value of 1-1. For a depiction of these branches, see Fig. 5.

We now state and prove the final ingredient required for establishing Theorem 1.

Lemma 4.

For t[1,)t\in[1,\infty), define α(t)2log(c¯t)/t\alpha(t)\triangleq\sqrt{2\log(\bar{c}t)/t}, where c¯>e\bar{c}>e. Given Δ(0,1)\Delta\in(0,1) and σ>1\sigma>1 such that σΔ1\sigma\Delta\leq 1, suppose t~\tilde{t} and tt^{\prime} satisfy:

α(t~)=σΔ8,andα(t)=Δ8.\alpha(\tilde{t})=\frac{\sigma\Delta}{8},\hskip 5.69054pt\textrm{and}\hskip 5.69054pt\alpha(t^{\prime})=\frac{\Delta}{8}.

Then, we have

tt~(128Δ2log128c¯Δ2128σ2Δ2log128c¯σ2Δ2)+(512Δ2loglog128c¯Δ232σ2Δ2loglog128c¯σ2Δ2).t^{\prime}-\tilde{t}\leq\left(\frac{128}{\Delta^{2}}\log\frac{128\bar{c}}{\Delta^{2}}-\frac{128}{\sigma^{2}\Delta^{2}}\log\frac{128\bar{c}}{\sigma^{2}\Delta^{2}}\right)+\left(\frac{512}{\Delta^{2}}\log\log\frac{128\bar{c}}{\Delta^{2}}-\frac{32}{\sigma^{2}\Delta^{2}}\log\log\frac{128\bar{c}}{\sigma^{2}\Delta^{2}}\right). (22)
Proof.

Our goal is to leverage the bounds on the W1(x)W_{-1}(x) branch that we derived in Lemma 5. To do so, we need to first transform the equations concerning α(t~)\alpha(\tilde{t}) and α(t)\alpha(t^{\prime}) into an appropriate form. Accordingly, simple manipulations yield:

(σ2Δ2t~128)exp(σ2Δ2t~128)=σ2Δ2128c¯;(Δ2t128)exp(Δ2t128)=Δ2128c¯.\left(-\frac{\sigma^{2}\Delta^{2}\tilde{t}}{128}\right)\exp{\left(-\frac{\sigma^{2}\Delta^{2}\tilde{t}}{128}\right)}=-\frac{\sigma^{2}\Delta^{2}}{128\bar{c}};\hskip 7.11317pt\left(-\frac{\Delta^{2}{t}^{\prime}}{128}\right)\exp{\left(-\frac{\Delta^{2}{t}^{\prime}}{128}\right)}=-\frac{\Delta^{2}}{128\bar{c}}. (23)

Based on the conditions of the lemma, it is easy to see that both σ2Δ2128c¯-\frac{\sigma^{2}\Delta^{2}}{128\bar{c}} and Δ2128c¯-\frac{\Delta^{2}}{128\bar{c}} belong to the interval [1/e2,0).[-1/e^{2},0). Using the defining property of the Lambert WW function in Eq. (21), we claim:

t~=128σ2Δ2W1(σ2Δ2128c¯);t=128Δ2W1(Δ2128c¯).\tilde{t}=-\frac{128}{\sigma^{2}\Delta^{2}}W_{-1}\left(-\frac{\sigma^{2}\Delta^{2}}{128\bar{c}}\right);\hskip 5.69054pt{t}^{\prime}=-\frac{128}{\Delta^{2}}W_{-1}\left(-\frac{\Delta^{2}}{128\bar{c}}\right). (24)

We need to now argue why the above solutions for t~\tilde{t} and tt^{\prime} involve the W1(x)W_{-1}(x) branch, and not the principal branch W0(x)W_{0}(x). To see why, note that W0(x)ex,x[1/e2,0)W_{0}(x)\geq ex,\forall x\in[-1/e^{2},0). This in turn follows by observing that:

W0(x)1eeW0(x)exxeW0(x)=W0(x),W_{0}(x)\geq-1\implies e\geq e^{-W_{0}(x)}\implies ex\leq xe^{-W_{0}(x)}=W_{0}(x),

where we used x<0x<0 and Eq. (21). Now suppose the solution for t~\tilde{t} in Eq. (24) involves W0(x)W_{0}(x) instead of W1(x)W_{-1}(x). Using W0(x)exW_{0}(x)\geq ex then leads to the following:

t~=128σ2Δ2W0(σ2Δ2128c¯)ec¯<1,\tilde{t}=-\frac{128}{\sigma^{2}\Delta^{2}}W_{0}\left(-\frac{\sigma^{2}\Delta^{2}}{128\bar{c}}\right)\leq\frac{e}{\bar{c}}<1,

since c¯>e\bar{c}>e. This leads to a contradiction as α(t)\alpha(t) is only defined for t[1,)t\in[1,\infty). A similar argument can be made for the solution involving tt^{\prime}. We have thus justified (24).

We can now directly appeal to the bounds in Lemma 5 to obtain

W1(σ2Δ2128c¯)log(σ2Δ2128c¯)14loglog128c¯σ2Δ2;W1(Δ2128c¯)log(Δ2128c¯)4loglog128c¯Δ2.W_{-1}\left(-\frac{\sigma^{2}\Delta^{2}}{128\bar{c}}\right)\leq\log\left(\frac{\sigma^{2}\Delta^{2}}{128\bar{c}}\right)-\frac{1}{4}\log\log\frac{128\bar{c}}{\sigma^{2}\Delta^{2}};\hskip 5.69054ptW_{-1}\left(-\frac{\Delta^{2}}{128\bar{c}}\right)\geq\log\left(\frac{\Delta^{2}}{128\bar{c}}\right)-4\log\log\frac{128\bar{c}}{\Delta^{2}}. (25)

Using the above bounds in conjunction with the expressions for t~\tilde{t} and tt^{\prime} in Eq. (24) immediately leads to the claim of the lemma. ∎

We are now in position to complete the proof of Theorem 1.

Proof.

(Theorem 1) We first argue that Fed-SEL is consistent, i.e., it outputs the globally best arm. To this end, we claim that on the event \mathcal{E} in Eq. (10), client 11 always remains active during Phase II (recall that a=1𝒜1a^{*}=1\in\mathcal{A}_{1}). To see why this is true, fix any client i𝒞i\in\mathcal{C}, and note that based on our encoding-decoding strategy,

|r~i(k)r^i,ai(t¯i+kH)|\displaystyle|\tilde{r}_{i}(k)-\hat{r}_{i,a_{i}}(\bar{t}_{i}+kH)| 1212Bi(k)\displaystyle\leq\frac{1}{2}\frac{1}{2^{B_{i}(k)}} (26)
αi(t¯i+kH)2,k1,\displaystyle\leq\frac{\alpha_{i}(\bar{t}_{i}+kH)}{2},\forall k\geq 1,

where we used the expression for Bi(k)B_{i}(k) in Eq. (3). The above bound holds trivially when k=0k=0, since r~i(0)=r^i,ai(t¯i)\tilde{r}_{i}(0)=\hat{r}_{i,a_{i}}(\bar{t}_{i}). Next, observe that at any round kk,

r~i(L)(k)\displaystyle\tilde{r}^{(L)}_{i}(k) =r~i(k)2αi(t¯i+kH)\displaystyle=\tilde{r}_{i}(k)-2\alpha_{i}(\bar{t}_{i}+kH) (27)
(26)r^i,ai(t¯i+kH)+αi(t¯i+kH)22αi(t¯i+kH)\displaystyle\overset{\eqref{eqn:quant_err}}{\leq}\hat{r}_{i,a_{i}}(\bar{t}_{i}+kH)+\frac{\alpha_{i}(\bar{t}_{i}+kH)}{2}-2\alpha_{i}(\bar{t}_{i}+kH)
=(a)r^i,i1(t¯i+kH)32αi(t¯i+kH)\displaystyle\overset{(a)}{=}\hat{r}_{i,i_{1}}(\bar{t}_{i}+kH)-\frac{3}{2}\alpha_{i}(\bar{t}_{i}+kH)
ri1+αi(t¯i+kH)32αi(t¯i+kH)\displaystyle\overset{\mathcal{E}}{\leq}{r_{i_{1}}}+\alpha_{i}(\bar{t}_{i}+kH)-\frac{3}{2}\alpha_{i}(\bar{t}_{i}+kH)
<ri1.\displaystyle<r_{i_{1}}.

Here, (a) follows from Lemma 2 where we showed that on \mathcal{E}, ai=a1(i)=i1,i𝒞{a}_{i}=a^{(i)}_{1}=i_{1},\forall i\in\mathcal{C}. Using the same reasoning as above, we can show that

r~i(U)(k)>ri1,i𝒞.\tilde{r}^{(U)}_{i}(k)>r_{i_{1}},\forall i\in\mathcal{C}. (28)

From (27) and (28), we conclude that on \mathcal{E},

r~1(U)(k)>r1andr~i(L)(k)<ri1,i𝒞{1}.\tilde{r}^{(U)}_{1}(k)>r_{1}\hskip 2.84526pt\textrm{and}\hskip 2.84526pt\tilde{r}^{(L)}_{i}(k)<r_{i_{1}},\forall i\in\mathcal{C}\setminus\{1\}.

Based on lines 4 and 5 of Fed-SEL, it is easy to then see that client 11 always remains in the active-client set Ψ¯(k)\bar{\Psi}(k) on the event \mathcal{E}.

We now show that every client i𝒞{1}i\in\mathcal{C}\setminus\{1\} eventually gets removed from the active-client set. In particular, for a client i𝒞{1}i\in\mathcal{C}\setminus\{1\} to get removed, we claim that the following is a set of sufficient conditions:

αi(t¯i+kH)\displaystyle\alpha_{i}(\bar{t}_{i}+kH) Δ1i18,and\displaystyle\leq\frac{\Delta_{1i_{1}}}{8},\textrm{and} (29)
α1(t¯1+kH)\displaystyle\alpha_{1}(\bar{t}_{1}+kH) Δ1i18.\displaystyle\leq\frac{\Delta_{1i_{1}}}{8}.

Suppose both the above conditions hold. We then have

r~i(U)(k)\displaystyle\tilde{r}^{(U)}_{i}(k) =r~i(k)+2αi(t¯i+kH)\displaystyle=\tilde{r}_{i}(k)+2\alpha_{i}(\bar{t}_{i}+kH) (30)
(26)r^i,ai(t¯i+kH)+αi(t¯i+kH)2+2αi(t¯i+kH)\displaystyle\overset{\eqref{eqn:quant_err}}{\leq}\hat{r}_{i,a_{i}}(\bar{t}_{i}+kH)+\frac{\alpha_{i}(\bar{t}_{i}+kH)}{2}+2\alpha_{i}(\bar{t}_{i}+kH)
=r^i,i1(t¯i+kH)+52αi(t¯i+kH)\displaystyle\overset{\mathcal{E}}{=}\hat{r}_{i,i_{1}}(\bar{t}_{i}+kH)+\frac{5}{2}\alpha_{i}(\bar{t}_{i}+kH)
ri1+αi(t¯i+kH)+52αi(t¯i+kH)\displaystyle\overset{\mathcal{E}}{\leq}{r_{i_{1}}}+\alpha_{i}(\bar{t}_{i}+kH)+\frac{5}{2}\alpha_{i}(\bar{t}_{i}+kH)
(29)ri1+716Δ1i1.\displaystyle\overset{\eqref{eqn:suff_cond}}{\leq}r_{i_{1}}+\frac{7}{16}\Delta_{1i_{1}}.

Similarly, we have

r~1(L)(k)r1716Δ1i1.\tilde{r}^{(L)}_{1}(k)\geq r_{1}-\frac{7}{16}{\Delta_{1i_{1}}}. (31)

From (30) and (31), we note that r~1(L)(k)>r~i(U)(k)\tilde{r}^{(L)}_{1}(k)>\tilde{r}^{(U)}_{i}(k), which is sufficient for client ii to get removed from the active-client set (based on line 5 of Fed-SEL). Thus, we have argued that on \mathcal{E}, Fed-SEL will eventually terminate with Ψ¯(k+1)={1}\bar{\Psi}(k+1)=\{1\}, for some round kk. The claim that Fed-SEL is (0,δ)(0,\delta)-PAC then follows immediately by noting that on \mathcal{E}, a1=a=1a_{1}=a^{*}=1.

We now derive bounds on the communication-complexity of Fed-SEL. For any i𝒞{1}i\in\mathcal{C}\setminus\{1\}, let Ri1R_{i1} (resp., R1iR_{1i}) be the first round such that αi(t¯i+kH)Δ1i1/8\alpha_{i}(\bar{t}_{i}+kH)\leq\Delta_{1i_{1}}/{8} (resp., α1(t¯1+kH)Δ1i1/8\alpha_{1}(\bar{t}_{1}+kH)\leq\Delta_{1i_{1}}/{8}). Based on the fact that αi(t)\alpha_{i}(t) is monotonically decreasing for all i𝒞i\in\mathcal{C} (this is easy to verify), and the above analysis, client ii remains active for at most Ri=max{Ri1,R1i}R_{i}=\max\{R_{i1},R_{1i}\} rounds. To bound Ri1R_{i1}, we start by noting that with c=8c=8 in Lemma 3,

αi(t¯i)<29Δi<Δi8=σi1Δ1i18,\alpha_{i}(\bar{t}_{i})<\frac{2}{9}\Delta^{*}_{i}<\frac{\Delta^{*}_{i}}{8}=\frac{\sigma_{i1}\Delta_{1i_{1}}}{8},

where for the last step we used the definition of the arm-heterogeneity index σi1\sigma_{i1} in Eq. (4). Clearly, if σi11\sigma_{i1}\leq 1, then Ri1=1R_{i1}=1. Now suppose σi1>1\sigma_{i1}>1. In this case, we have

Ri11H(αi1(Δ1i18)αi1(σi1Δ1i18)).R_{i1}\leq\left\lceil\frac{1}{H}\left(\left\lceil\alpha^{-1}_{i}\left(\frac{\Delta_{1i_{1}}}{8}\right)\right\rceil-\left\lfloor\alpha^{-1}_{i}\left(\frac{\sigma_{i1}\Delta_{1i_{1}}}{8}\right)\right\rfloor\right)\right\rceil. (32)

We now apply Lemma 4 with

α(t)=αi(t),Δ=Δ1i1,σ=σi1,andc¯=c¯i=4|𝒞||𝒜i|δ.\alpha(t)=\alpha_{i}(t),\Delta=\Delta_{1i_{1}},\sigma=\sigma_{i1},\hskip 2.84526pt\textrm{and}\hskip 2.84526pt\bar{c}=\bar{c}_{i}=\sqrt{\frac{4|\mathcal{C}||\mathcal{A}_{i}|}{\delta}}.

Since |𝒜i|2|\mathcal{A}_{i}|\geq 2, we have c¯i>e\bar{c}_{i}>e. Moreover, σi1Δ1ii=Δi1\sigma_{i1}\Delta_{1i_{i}}=\Delta^{*}_{i}\leq 1. Thus, Lemma 4 is indeed applicable. From (22) and (32), we then have

Ri11H(128Δ1i12log128c¯iΔ1i12128σi12Δ1i12log128c¯iσi12Δ1i12T1+T2)+1,R_{i1}\leq\frac{1}{H}\left(\underbrace{\frac{128}{\Delta^{2}_{1i_{1}}}\log\frac{128\bar{c}_{i}}{\Delta^{2}_{1i_{1}}}-\frac{128}{\sigma^{2}_{i1}\Delta^{2}_{1i_{1}}}\log\frac{128\bar{c}_{i}}{\sigma^{2}_{i1}\Delta^{2}_{1i_{1}}}}_{T_{1}}+T_{2}\right)+1,

where

T2=512Δ1i12loglog128c¯iΔ1i1232σi12Δ1i12loglog128c¯iσi12Δ1i12+2=O(1Δ1i12loglogc¯iΔ1i12).T_{2}=\frac{512}{\Delta^{2}_{1i_{1}}}\log\log\frac{128\bar{c}_{i}}{\Delta^{2}_{1i_{1}}}-\frac{32}{\sigma^{2}_{i1}\Delta^{2}_{1i_{1}}}\log\log\frac{128\bar{c}_{i}}{\sigma^{2}_{i1}\Delta^{2}_{1i_{1}}}+2=O\left(\frac{1}{\Delta^{2}_{1i_{1}}}\log\log\frac{\bar{c}_{i}}{\Delta^{2}_{1i_{1}}}\right).

Using σi1Δ1i1=Δi\sigma_{i1}\Delta_{1i_{1}}=\Delta^{*}_{i} to simplify the expression for T1T_{1}, we obtain

T1=128(Δi)2(σi121)log128c¯iΔ1i12+256(Δi)2logσi1.T_{1}=\frac{128}{{(\Delta^{*}_{i})}^{2}}\left(\sigma^{2}_{i1}-1\right)\log\frac{128\bar{c}_{i}}{\Delta^{2}_{1i_{1}}}+\frac{256}{{(\Delta^{*}_{i})}^{2}}\log{\sigma_{i1}}.

Combining all the above pieces together yields the upper bound on Ri1R_{i1} in Eq. (5). Following exactly the same reasoning as above, one can derive the upper bound on R1iR_{1i} in Eq. (6).

The total number of communication rounds RR of Fed-SEL equals the number of rounds it takes to eliminate every client in the set 𝒞{1}\mathcal{C}\setminus\{1\}. Thus, R=maxi𝒞{1}Ri.R=\max_{i\in\mathcal{C}\setminus\{1\}}R_{i}. Regarding the claim about the number of bits exchanged, note from the definition of Ri1R_{i1} that for every round kRi1k\leq R_{i1},

αi(t¯i+kH)Δ1i18Δ8,\alpha_{i}(\bar{t}_{i}+kH)\geq\frac{\Delta_{1i_{1}}}{8}\geq\frac{\Delta^{*}}{8},

where we used the definition of Δ.\Delta^{*}. From the expression for Bi(k)B_{i}(k) in Eq. (3), we then have

Bi(k)log2(1αi(t¯i+kH))+1log2(8Δ)+1=O(log2(1Δ)).B_{i}(k)\leq\log_{2}\left({\frac{1}{\alpha_{i}(\bar{t}_{i}+kH)}}\right)+1\leq\log_{2}\left(\frac{8}{\Delta^{*}}\right)+1=O\left(\log_{2}\left(\frac{1}{\Delta^{*}}\right)\right).

This establishes the communication-complexity of Fed-SEL.

Finally, we derive the sample-complexity bounds for Fed-SEL in Eq. (7). For any client ii, the total number of arm-pulls it makes is the sum of the number of pulls during Phases I and II. During Phase II, client ii only samples arm ai=i1a_{i}=i_{1} from its arm-set on the event \mathcal{E}. Moreover, it keeps sampling i1i_{1} for the entire duration it remains in the active-client set. Since i1i_{1} is sampled HH times in each round, and client ii remains active for at most RiR_{i} rounds, the number of arm-pulls made by client ii during Phase II is HRiHR_{i}. Since client 1 remains active till the termination of Fed-SEL, we have R1=RR_{1}=R.

For Phase I, recall from Lemma 2 that an arm 𝒜i{i1}\ell\in\mathcal{A}_{i}\setminus\{i_{1}\} gets removed from the set 𝒮i(t)\mathcal{S}_{i}(t) when

αi(t)Δi14,\alpha_{i}(t)\leq\frac{\Delta_{i_{1}\ell}}{4},

which holds for

t=O(log(|𝒞||𝒜i|δΔi1)Δi12).t=O\left(\frac{\log\left(\frac{|\mathcal{C}||\mathcal{A}_{i}|}{\delta\Delta_{i_{1}\ell}}\right)}{\Delta^{2}_{i_{1}\ell}}\right).

The above discussion immediately leads to the sample-complexity bound in Eq. (7). This completes the proof of Theorem 1. ∎

A.1 Bounds on the Lambert W function

Refer to caption Refer to caption
Figure 5: (Left) Illustration of the two branches of the Lambert WW function. (Right) Illustration of the bounds in Lemma 5. In this figure, U(x)U(x) and L(x)L(x) are the upper and lower bounds, respectively, on W1(x)W_{-1}(x) that we derive in Lemma 5; see Eq. (33).

In this section, we will derive new bounds on the Lambert WW function. To derive our bounds, we will rely on the following two facts: (i) W1(x)<1,x[1/e2,0)W_{-1}(x)<-1,\forall x\in[-1/e^{2},0), and (ii) 4<W1(1/e2)<3-4<W_{-1}(-1/e^{2})<-3. The first fact follows readily by noting that W1(1/e)=1W_{-1}(-1/e)=-1, and that W1(x)W_{-1}(x) is a strictly decreasing function on (1/e,0)(-1/e,0); see [25]. For the second fact, one can readily use any standard mathematical programming software (such as MATLAB) to verify that W1(1/e2)3.1462.W_{-1}(-1/e^{2})\approx-3.1462. Alternatively, one can use the bounds derived in [99] to arrive at the second fact. We have the following result. For a pictorial depiction of our bounds, see Fig. 5.

Lemma 5.

For x[1/e2,0)x\in[-1/e^{2},0), we have

log(x)4log(log(x))W1(x)log(x)14log(log(x)).\log(-x)-4\log(-\log(-x))\leq W_{-1}(x)\leq\log(-x)-\frac{1}{4}\log(-\log(-x)). (33)
Proof.

We start by defining a function fγ(x)f_{\gamma}(x) parameterized by γ\gamma:

fγ(x)=log(x)γlog(log(x))W1(x).f_{\gamma}(x)=\log(-x)-\gamma\log\left(-\log(-x)\right)-W_{-1}(x). (34)

Using fγ(x)f^{\prime}_{\gamma}(x) to denote the derivative of fγ(x)f_{\gamma}(x) w.r.t. xx, we obtain

fγ(x)\displaystyle f^{\prime}_{\gamma}(x) =1x[1γlog(x)W1(x)1+W1(x)]\displaystyle=\frac{1}{x}\left[1-\frac{\gamma}{\log(-x)}-\frac{W_{-1}(x)}{1+W_{-1}(x)}\right] (35)
=1x[log(x)γ(1+W1(x))log(x)(1+W1(x))].\displaystyle=\frac{1}{x}\left[\frac{\log(-x)-\gamma\left(1+W_{-1}(x)\right)}{\log(-x)\left(1+W_{-1}(x)\right)}\right].

Here, we used the fact that for x[1/e2,0)x\in[-1/e^{2},0), the derivative of W1(x)W_{-1}(x) is given by [25]

W1(x)x(1+W1(x)).\frac{W_{-1}(x)}{x(1+W_{-1}(x))}.

Since W1(x)<1W_{-1}(x)<-1 for x[1/e2,0)x\in[-1/e^{2},0), the denominator of the expression for fγ(x)f^{\prime}_{\gamma}(x) in Eq. (35) is negative. To study the behavior of the numerator, we consider another γ\gamma-parameterized function, namely gγ(x)g_{\gamma}(x), as follows:

gγ(x)=log(x)γ(1+W1(x)).g_{\gamma}(x)=\log(-x)-\gamma\left(1+W_{-1}(x)\right). (36)

Using gγ(x)g^{\prime}_{\gamma}(x) to denote the derivative of gγ(x)g_{\gamma}(x) w.r.t. xx, we obtain

gγ(x)=1+(1γ)W1(x)x(1+W1(x)).g^{\prime}_{\gamma}(x)=\frac{1+(1-\gamma)W_{-1}(x)}{x(1+W_{-1}(x))}. (37)

Establishing the lower bound: For proving the lower bound on W1(x)W_{-1}(x) in (33), we set γ=4\gamma=4 in Eq. (37) and note that both the numerator and the denominator of the resulting expression are positive; this once again follows from the fact that W1(x)<1W_{-1}(x)<-1 for x[1/e2,0)x\in[-1/e^{2},0). We conclude: g4(x)>0,x[1/e2,0)g^{\prime}_{4}(x)>0,\forall x\in[-1/e^{2},0). Moreover, setting γ=4\gamma=4 in (36), we obtain

g4(1/e2)\displaystyle g_{4}(-1/e^{2}) =log(1/e2)4(1+W1(1/e2))\displaystyle=\log(1/e^{2})-4\left(1+W_{-1}(-1/e^{2})\right) (38)
=64W1(1/e2)\displaystyle=-6-4W_{-1}(-1/e^{2})
>6+12>0,\displaystyle>-6+12>0,

where we used W1(1/e2)<3W_{-1}(-1/e^{2})<-3. We have thus shown that g4(x)>0,x[1/e2,0)g_{4}(x)>0,\forall x\in[-1/e^{2},0). Since the numerator of f4(x)f^{\prime}_{4}(x) in Eq. (35) is precisely g4(x)g_{4}(x), we conclude that f4(x)<0,x[1/e2,0).f^{\prime}_{4}(x)<0,\forall x\in[-1/e^{2},0). At the same time, plugging in γ=4\gamma=4 in (34) yields:

f4(1/e2)\displaystyle f_{4}(-1/e^{2}) =log(1/e2)4log(log(1/e2))W1(1/e2)\displaystyle=\log(1/e^{2})-4\log(-\log(1/e^{2}))-W_{-1}(-1/e^{2}) (39)
=24log(2)W1(1/e2)\displaystyle=-2-4\log(2)-W_{-1}(-1/e^{2})
<24log(2)<0,\displaystyle<2-4\log(2)<0,

where we used W1(1/e2)>4W_{-1}(-1/e^{2})>-4. Combining all the above pieces together, we conclude that f4(x)<0,x[1/e2,0)f_{4}(x)<0,\forall x\in[-1/e^{2},0). This immediately leads to the desired lower bound.

Establishing the upper bound: The proof of the upper bound proceeds in a very similar way. We start by noting that

g1/4(x)=1+3/4W1(x)x(1+W1(x)).g^{\prime}_{1/4}(x)=\frac{1+3/4W_{-1}(x)}{x(1+W_{-1}(x))}. (40)

For x[1/e2,0)x\in[-1/e^{2},0), while the denominator of the fraction on the R.H.S. of the above expression is positive, the numerator is negative. The latter claim follows from the following facts: (i) W1(1/e2)<3W_{-1}(-1/e^{2})<-3, and (ii) W1(x)W_{-1}(x) is monotonically decreasing. Thus, g1/4(x)<0,x[1/e2,0).g^{\prime}_{1/4}(x)<0,\forall x\in[-1/e^{2},0). Moreover,

g1/4(1/e2)\displaystyle g_{1/4}(-1/e^{2}) =log(1/e2)14(1+W1(1/e2))\displaystyle=\log(1/e^{2})-\frac{1}{4}\left(1+W_{-1}(-1/e^{2})\right) (41)
=9414W1(1/e2)\displaystyle=-\frac{9}{4}-\frac{1}{4}W_{-1}(-1/e^{2})
<0,\displaystyle<0,

where we used W1(1/e2)>4W_{-1}(-1/e^{2})>-4. We have thus shown that g1/4(x)<0,x[1/e2,0)g_{1/4}(x)<0,\forall x\in[-1/e^{2},0). Since the numerator of f1/4(x)f^{\prime}_{1/4}(x) is g1/4(x)g_{1/4}(x), it follows that f1/4(x)>0,x[1/e2,0)f^{\prime}_{1/4}(x)>0,\forall x\in[-1/e^{2},0). Finally, observe

f1/4(1/e2)\displaystyle f_{1/4}(-1/e^{2}) =log(1/e2)14log(log(1/e2))W1(1/e2)\displaystyle=\log(1/e^{2})-\frac{1}{4}\log(-\log(1/e^{2}))-W_{-1}(-1/e^{2}) (42)
=214log(2)W1(1/e2)\displaystyle=-2-\frac{1}{4}\log(2)-W_{-1}(-1/e^{2})
>114log(2)>0,\displaystyle>1-\frac{1}{4}\log(2)>0,

where we used W1(1/e2)<3W_{-1}(-1/e^{2})<-3. We thus conclude that f1/4(x)>0,x[1/e2,0)f_{1/4}(x)>0,\forall x\in[-1/e^{2},0). The upper bound on W1(x)W_{-1}(x) in Eq. (33) follows immediately. This concludes the proof. ∎

Appendix B Proof of Theorem 2

In this section, we analyze Robust Fed-SEL, namely Algorithm 2. Recall that \mathcal{H} and 𝒥\mathcal{J} represent the sets of honest and adversarial clients, respectively. To proceed, let Ti,T_{i,\ell} denote the total number of times an arm 𝒜j\ell\in\mathcal{A}_{j} is pulled by an honest client i𝒞ji\in\mathcal{C}_{j}\cap\mathcal{H} during the execution of Robust Fed-SEL. We then have the following analog of Lemma 1.

Lemma 6.

Consider the event \mathcal{E_{H}} defined as follows:

j[N]i𝒞j𝒜jt[Ti,]{|r^i,(t)r|αj(t)}.\mathcal{E_{H}}\triangleq\bigcap_{j\in[N]}\bigcap_{i\in\mathcal{C}_{j}\cap\mathcal{H}}\bigcap_{\ell\in\mathcal{A}_{j}}\bigcap_{t\in[T_{i,\ell}]}\{|\hat{r}_{i,\ell}(t)-r_{\ell}|\leq\alpha_{j}(t)\}. (43)

Then, it holds that ()1δ\mathbb{P}(\mathcal{E_{H}})\geq 1-\delta.

Proof.

Following the same reasoning as in Lemma 1,

(c)\displaystyle\mathbb{P}(\mathcal{E}^{c}_{\mathcal{H}}) j[N]i𝒞j𝒜jt=1({|r^i,(t)r|>αj(t)})\displaystyle\leq\sum_{j\in[N]}\sum_{i\in\mathcal{C}_{j}\cap\mathcal{H}}\sum_{\ell\in\mathcal{A}_{j}}\sum\limits_{t=1}^{\infty}\mathbb{P}\left(\{|\hat{r}_{i,\ell}(t)-r_{\ell}|>\alpha_{j}(t)\}\right) (44)
j[N]i𝒞j𝒜jt=12exp(2tαj2(t))\displaystyle\leq\sum_{j\in[N]}\sum_{i\in\mathcal{C}_{j}\cap\mathcal{H}}\sum_{\ell\in\mathcal{A}_{j}}\sum\limits_{t=1}^{\infty}2\exp(-2t\alpha^{2}_{j}(t))
j[N]i𝒞j𝒜jt=1δ2|𝒞||𝒜j|t2\displaystyle\leq\sum_{j\in[N]}\sum_{i\in\mathcal{C}_{j}\cap\mathcal{H}}\sum_{\ell\in\mathcal{A}_{j}}\sum\limits_{t=1}^{\infty}\frac{\delta}{2|\mathcal{C}||\mathcal{A}_{j}|t^{2}}
j[N]i𝒞j𝒜jδ2|𝒞||𝒜j|(1+11z2𝑑z)\displaystyle\leq\sum_{j\in[N]}\sum_{i\in\mathcal{C}_{j}\cap\mathcal{H}}\sum_{\ell\in\mathcal{A}_{j}}\frac{\delta}{2|\mathcal{C}||\mathcal{A}_{j}|}\left(1+\int_{1}^{\infty}\frac{1}{z^{2}}dz\right)
=j[N]i𝒞j𝒜jδ|𝒞||𝒜j|\displaystyle=\sum_{j\in[N]}\sum_{i\in\mathcal{C}_{j}\cap\mathcal{H}}\sum_{\ell\in\mathcal{A}_{j}}\frac{\delta}{|\mathcal{C}||\mathcal{A}_{j}|}
=|𝒞||𝒞|δδ,\displaystyle=\frac{|\mathcal{C}\cap\mathcal{H}|}{|\mathcal{C}|}\delta\leq\delta,

which leads to the desired result. ∎

The next lemma provides the main intermediate result we need to prove Theorem 2.

Lemma 7.

Suppose Robust Fed-SEL is run by every honest client. Moreover, suppose |𝒞j𝒥|f|\mathcal{C}_{j}\cap\mathcal{J}|\leq f and |𝒞j|3f+1,j[N]|\mathcal{C}_{j}|\geq 3f+1,\forall j\in\mathcal{[}N]. On event \mathcal{E_{H}}, the following then hold:

  • (i)

    a¯j=j1,j[N]\bar{a}_{j}=j_{1},\forall j\in[N].

  • (ii)

    In very round kk,

    r¯j(U)(k)Conv({r~i(U)(k)}i𝒞¯j);r¯j(L)(k)Conv({r~i(L)(k)}i𝒞¯j),j[N],\bar{r}^{(U)}_{j}(k)\in\texttt{Conv}\left(\{\tilde{r}^{(U)}_{i}(k)\}_{i\in\bar{\mathcal{C}}_{j}\cap\mathcal{H}}\right);\hskip 5.69054pt\bar{r}^{(L)}_{j}(k)\in\texttt{Conv}\left(\{\tilde{r}^{(L)}_{i}(k)\}_{i\in\bar{\mathcal{C}}_{j}\cap\mathcal{H}}\right),\forall j\in[N], (45)

where we use Conv(Ω)\texttt{Conv}(\Omega) to represent the convex hull of a set of points Ω\Omega.

Proof.

Fix a client group j[N].j\in[N]. Based on Lemma 2, note that when each client i𝒞ji\in\mathcal{C}_{j}\cap\mathcal{H} runs Algo. 1 on 𝒜j\mathcal{A}_{j}, the local subroutine will eventually terminate on event \mathcal{E_{H}}, and client ii will output ai=j1a_{i}=j_{1}, i.e., each honest client ii in 𝒞j\mathcal{C}_{j} will correctly identify the locally best arm j1j_{1} in 𝒜j\mathcal{A}_{j} at the end of Phase I. Since |𝒞j𝒥|f|\mathcal{C}_{j}\cap\mathcal{J}|\leq f and |𝒞j|3f+1|\mathcal{C}_{j}|\geq 3f+1, at least 2f+12f+1 clients from 𝒞j\mathcal{C}_{j} are guaranteed to report to the server at the end of Phase I. Among them, since at most ff can be adversarial, observe that the majority voting operation in line 2 of Robust Fed-SEL yields a¯j=j1\bar{a}_{j}=j_{1}. This establishes the first claim of the lemma.

For the second claim, let us start by noting that on \mathcal{E_{H}}, the number of adversaries in the set 𝒞¯j\bar{\mathcal{C}}_{j} is at most fj=f|𝒞j|+|𝒞¯j|f_{j}=f-|\mathcal{C}^{\prime}_{j}|+|\bar{\mathcal{C}}_{j}|. This follows from the fact that on \mathcal{E_{H}}, any client i𝒞ji\in\mathcal{C}^{\prime}_{j} who reports an arm other than j1j_{1} must be adversarial, i.e., on \mathcal{E_{H}}, 𝒞j𝒞¯j𝒥\mathcal{C}^{\prime}_{j}\setminus\bar{\mathcal{C}}_{j}\subseteq\mathcal{J}. Moreover, observe that

2fj+1\displaystyle 2f_{j}+1 =2(f|𝒞j|+|𝒞¯j|)+1\displaystyle=2\left(f-|\mathcal{C}^{\prime}_{j}|+|\bar{\mathcal{C}}_{j}|\right)+1 (46)
=2(f(2f+1)+|𝒞¯j|)+1\displaystyle=2\left(f-(2f+1)+|\bar{\mathcal{C}}_{j}|\right)+1
=(|𝒞¯j|(2f+1))+|𝒞¯j|\displaystyle=\left(|\bar{\mathcal{C}}_{j}|-(2f+1)\right)+|\bar{\mathcal{C}}_{j}|
(|𝒞j|(2f+1))+|𝒞¯j|\displaystyle\leq\left(|\mathcal{C}^{\prime}_{j}|-(2f+1)\right)+|\bar{\mathcal{C}}_{j}|
=|𝒞¯j|.\displaystyle=|\bar{\mathcal{C}}_{j}|.

With these basic observations in place, we are now ready to analyze the Trim operation outlined in line 6 of Robust Fed-SEL. To establish the claim that

r¯j(U)(k)Conv({r~i(U)(k)}i𝒞¯j),\bar{r}^{(U)}_{j}(k)\in\texttt{Conv}\left(\{\tilde{r}^{(U)}_{i}(k)\}_{i\in\bar{\mathcal{C}}_{j}\cap\mathcal{H}}\right),

suppose Trim({r~i(U)(k)}i𝒞¯j,fj)=r~w(U)(k)\texttt{Trim}\left(\{\tilde{r}^{(U)}_{i}(k)\}_{i\in\bar{\mathcal{C}}_{j}},f_{j}\right)=\tilde{r}^{(U)}_{w}(k), where w𝒞¯jw\in\bar{\mathcal{C}}_{j}. If ww\in\mathcal{H}, then the claim holds trivially. Now suppose ww is adversarial, i.e., w𝒥w\in\mathcal{J}. Since |𝒞¯j𝒥|fj|\bar{\mathcal{C}}_{j}\cap\mathcal{J}|\leq f_{j}, and |𝒞¯j|2fj+1|\bar{\mathcal{C}}_{j}|\geq 2f_{j}+1, there must exist honest clients u,v𝒞¯ju,v\in\bar{\mathcal{C}}_{j}\cap\mathcal{H}, such that r~v(U)(k)r~w(U)(k)r~u(U)(k)\tilde{r}^{(U)}_{v}(k)\leq\tilde{r}^{(U)}_{w}(k)\leq\tilde{r}^{(U)}_{u}(k). This establishes the claim about r¯j(U)(k)\bar{r}^{(U)}_{j}(k) in Eq. (45); the claim regarding r¯j(L)(k)\bar{r}^{(L)}_{j}(k) follows the exact same argument. ∎

The above lemma tells us that with high probability, (i) the server will correctly identify the locally best arm from each arm-set, and (ii) the thresholds used for eliminating client-groups are “not too corrupted” by the adversaries. We now show how these two properties feature in the proof of Theorem 2.

Proof.

(Theorem 2) Throughout the proof, we will condition on the event \mathcal{E_{H}} defined in Eq. (43). Based on Lemma 6, recall that \mathcal{E_{H}} occurs with probability at least 1δ1-\delta. As in the proof of Theorem 1, we start by arguing that client group 11 always remains active (recall that a=1𝒜1a^{*}=1\in\mathcal{A}_{1}). To see why this is true, fix any j[N]j\in[N], and observe that for each i𝒞¯ji\in\bar{\mathcal{C}}_{j}\cap\mathcal{H} and every round kk,

r~i(L)(k)\displaystyle\tilde{r}^{(L)}_{i}(k) =r~i(k)2αj(t¯i+kH)\displaystyle=\tilde{r}_{i}(k)-2\alpha_{j}(\bar{t}_{i}+kH) (47)
(a)r^i,a¯j(t¯i+kH)+αj(t¯i+kH)22αj(t¯i+kH)\displaystyle\overset{(a)}{\leq}\hat{r}_{i,\bar{a}_{j}}(\bar{t}_{i}+kH)+\frac{\alpha_{j}(\bar{t}_{i}+kH)}{2}-2\alpha_{j}(\bar{t}_{i}+kH)
=(b)r^i,j1(t¯i+kH)32αj(t¯i+kH)\displaystyle\overset{(b)}{=}\hat{r}_{i,j_{1}}(\bar{t}_{i}+kH)-\frac{3}{2}\alpha_{j}(\bar{t}_{i}+kH)
rj1+αj(t¯i+kH)32αj(t¯i+kH)\displaystyle\overset{\mathcal{E}_{\mathcal{H}}}{\leq}{r_{j_{1}}}+\alpha_{j}(\bar{t}_{i}+kH)-\frac{3}{2}\alpha_{j}(\bar{t}_{i}+kH)
<rj1.\displaystyle<r_{j_{1}}.

In the above steps, (a) follows from (i) the fact that by definition of the set 𝒞¯j\bar{\mathcal{C}}_{j}, every client ii in 𝒞¯j\bar{\mathcal{C}}_{j} reports ai=a¯ja_{i}=\bar{a}_{j}, and (ii) by using the bound on the quantization error in Eq. (26). Based on Lemma 7, the majority voted arm a¯j\bar{a}_{j} is j1j_{1} on event \mathcal{E_{H}}; this leads to (b)(b). Using the same reasoning, we can show that for each i𝒞¯ji\in\bar{\mathcal{C}}_{j}\cap\mathcal{H} and every round kk,

r~i(U)(k)>rj1\tilde{r}^{(U)}_{i}(k)>r_{j_{1}} (48)

on event \mathcal{E_{H}}. Appealing to (45) from Lemma 7, and using (47) and (48), we conclude that on \mathcal{E_{H}},

r¯1(U)(k)>r1;r¯j(L)(k)<rj1,j[N]{1}.\bar{r}^{(U)}_{1}(k)>r_{1};\hskip 11.38109pt\bar{r}^{(L)}_{j}(k)<r_{j_{1}},\forall j\in[N]\setminus\{1\}.

From lines 7 and 8 of Robust Fed-SEL, we immediately note that client-group 11 never gets eliminated. To show that Robust Fed-SEL is consistent, it remains to argue that on \mathcal{E_{H}}, every client-group j[N]{1}j\in[N]\setminus\{1\} eventually gets eliminated. We claim that for group j[N]{1}j\in[N]\setminus\{1\} to get eliminated, the following is a set of sufficient conditions:

αj(t¯i+kH)\displaystyle\alpha_{j}(\bar{t}_{i}+kH) Δ1j18,i𝒞¯j,and\displaystyle\leq\frac{\Delta_{1j_{1}}}{8},\forall i\in\bar{\mathcal{C}}_{j}\cap\mathcal{H},\textrm{and} (49)
α1(t¯i+kH)\displaystyle\alpha_{1}(\bar{t}_{i}+kH) Δ1j18,i𝒞¯1.\displaystyle\leq\frac{\Delta_{1j_{1}}}{8},\forall i\in\bar{\mathcal{C}}_{1}\cap\mathcal{H}.

Suppose the above conditions hold. Then, based on the same reasoning we used to arrive at Eq. (30), we can show that for each i𝒞¯ji\in\bar{\mathcal{C}}_{j}\cap\mathcal{H},

r~i(U)(k)rj1+716Δ1j1,\displaystyle\tilde{r}^{(U)}_{i}(k)\leq r_{j_{1}}+\frac{7}{16}\Delta_{1j_{1}}, (50)

and for each i𝒞¯1i\in\bar{\mathcal{C}}_{1}\cap\mathcal{H},

r~i(L)(k)r1716Δ1j1.\tilde{r}^{(L)}_{i}(k)\geq r_{1}-\frac{7}{16}\Delta_{1j_{1}}. (51)

Appealing once again to the second claim of Lemma 7, and using (50) and (51), we conclude that if the conditions in (49) hold, then

r¯1(L)(k)r1716Δ1j1andr¯j(U)(k)rj1+716Δ1j1,\bar{r}^{(L)}_{1}(k)\geq r_{1}-\frac{7}{16}{\Delta_{1j_{1}}}\hskip 2.84526pt\textrm{and}\hskip 2.84526pt\bar{r}^{(U)}_{j}(k)\leq r_{j_{1}}+\frac{7}{16}{\Delta_{1j_{1}}}, (52)

implying r¯1(L)(k)>r¯j(U)(k)\bar{r}^{(L)}_{1}(k)>\bar{r}^{(U)}_{j}(k). From line 8 of Robust Fed-SEL, this is sufficient for client-group jj to get eliminated. Thus, Robust Fed-SEL will eventually terminate on \mathcal{E_{H}} with Ψ¯(k+1)={1}\bar{\Psi}(k+1)=\{1\}, for some round kk. Consistency then follows from Lemma 7 by noting that on \mathcal{E_{H}}, a¯1=a=1\bar{a}_{1}=a^{*}=1.

The sample- and communication-complexity bounds of Robust Fed-SEL are of the same order as that of Fed-SEL, and can be derived in exactly the same way as in Theorem 1. We now briefly explain how this can be done. Suppose we want to compute an upper bound on the number of rounds client group j[N]{1}j\in[N]\setminus\{1\} remains active. Start by noting that at the end of Phase I, the following bound applies uniformly to every i𝒞¯ji\in\bar{\mathcal{C}}_{j}\cap\mathcal{H}:

αj(t¯i)<29Δj<σj1Δ1j18.\alpha_{j}(\bar{t}_{i})<\frac{2}{9}\Delta^{*}_{j}<\frac{\sigma_{j1}\Delta_{1j_{1}}}{8}.

This follows from Lemma 3. For each i𝒞¯ji\in\bar{\mathcal{C}}_{j}\cap\mathcal{H}, let Ri1(j)R^{(j)}_{i1} be the first round such that

αj(t¯i+kH)Δ1j18.\alpha_{j}(\bar{t}_{i}+kH)\leq\frac{\Delta_{1j_{1}}}{8}.

We can obtain an upper bound on Ri1(j)R^{(j)}_{i1} for each i𝒞¯ji\in\bar{\mathcal{C}}_{j}\cap\mathcal{H} exactly as in Theorem 1, and show that

maxi𝒞¯jRi1(j)1H(128(Δj)2(σj121)log128c¯jΔ1j12+256(Δj)2logσj1+O(1Δ1j12loglogc¯jΔ1j12))+1,ifσj1>1,\max_{i\in\bar{\mathcal{C}}_{j}\cap\mathcal{H}}R^{(j)}_{i1}\leq\frac{1}{H}\left(\frac{128}{{(\Delta^{*}_{j})}^{2}}(\sigma^{2}_{j1}-1)\log{\frac{128\bar{c}_{j}}{\Delta^{2}_{1j_{1}}}}+\frac{256}{{(\Delta^{*}_{j})}^{2}}\log{\sigma_{j1}}+O\left(\frac{1}{\Delta^{2}_{1j_{1}}}\log\log{\frac{\bar{c}_{j}}{\Delta^{2}_{1j_{1}}}}\right)\right)+1,\hskip 2.84526pt\textrm{if}\hskip 2.84526pt\sigma_{j1}>1,

and maxi𝒞¯jRi1(j)=1\max_{i\in\bar{\mathcal{C}}_{j}\cap\mathcal{H}}R^{(j)}_{i1}=1, if σj11.\sigma_{j1}\leq 1. With Rj1maxi𝒞¯jRi1(j)R_{j1}\triangleq\max_{i\in\bar{\mathcal{C}}_{j}\cap\mathcal{H}}R^{(j)}_{i1}, observe that the expression for Rj1R_{j1} exactly resembles that for Ri1R_{i1} in Eq. (5) of Theorem 1.131313Recall that in the context of Theorem 1, each client group contained only one agent, and hence the index ii was used for both client groups and agents. Proceeding as we did in Theorem 1, we can now derive an upper bound on the number of rounds group jj remains active. The rest of the proof mirrors that of Theorem 1, and hence we omit details. ∎

Appendix C Proof of Theorem 3

In this section, we will analyze the peer-to-peer setting described in Section 6. As before, we need a “good event” on which we will base our reasoning. For the setting we consider here, the event \mathcal{E_{H}} described in Eq. (43) will serve such a purpose. Recall that ()1δ.\mathbb{P}(\mathcal{E_{H}})\geq 1-\delta.

Consider any honest client i𝒞j.i\in\mathcal{C}_{j}. The argument we present next applies identically to honest clients in other client groups as well. Since i𝒞ji\in\mathcal{C}_{j}, running Algo. 1 on 𝒜j\mathcal{A}_{j} with βj(t)=6αj(t)\beta_{j}(t)=6\alpha_{j}(t) yields

ai=j1;|r^i,ai(t¯i)rj1|=|r^i,j1(t¯i)rj1|αj(t¯i)<c(c2)2Δj=38Δja_{i}=j_{1};\hskip 2.84526pt|\hat{r}_{i,a_{i}}(\bar{t}_{i})-r_{j_{1}}|=|\hat{r}_{i,j_{1}}(\bar{t}_{i})-r_{j_{1}}|\leq\alpha_{j}(\bar{t}_{i})<\frac{c}{(c-2)^{2}}\Delta^{*}_{j}=\frac{3}{8}\Delta^{*}_{j} (53)

on event \mathcal{E_{H}}. The above claims follow directly from Lemma’s 2 and 3. Thus, based on Line 1 of Algo. 4, we have 𝐈i(j)=j1\mathbf{I}_{i}(j)=j_{1}, and 𝐑i(j)=r^i,j1(t¯i)\mathbf{R}_{i}(j)=\hat{r}_{i,j_{1}}(\bar{t}_{i}), where r^i,j1(t¯i)\hat{r}_{i,j_{1}}(\bar{t}_{i}) satisfies the bound in Eq. (53).

Now for client ii to identify the globally best arm, it must eventually receive accurate information about every other arm-set [N]{j}\ell\in[N]\setminus\{j\} from its neighbors, despite the presence of adversaries in the network. The next lemma shows how the strong-robustness property in Definition 2 contributes to this cause.

Lemma 8.

Suppose Algo. 4 is run by every ii\in\mathcal{H}. Moreover, suppose the adversary model is ff-local, and 𝒢\mathcal{G} is strongly-robust w.r.t. 𝒞j,j[N]\mathcal{C}_{j},\forall j\in[N]. Consider any j[N]j\in[N]. Then, on the event \mathcal{E_{H}}, every client i𝒱𝒞ji\in\mathcal{V}\setminus\mathcal{C}_{j} receives information about arm-set jj from at least 2f+12f+1 neighbors in 𝒩i\mathcal{N}_{i}.

Proof.

We prove this lemma by contradiction. Fix any j[N]j\in[N]. Suppose that on the event \mathcal{E_{H}}, there exists a non-empty set 𝒱𝒞j\mathcal{B}\subseteq\mathcal{V}\setminus\mathcal{C}_{j} of clients who never receive information about arm-set jj from 2f+12f+1 neighbors. Based on the strong-robustness property in Def. 2, there exists a client ii\in\mathcal{B} with at least 3f+13f+1 neighbors in 𝒱\mathcal{V}\setminus\mathcal{B}. Consider the set 𝒬i,j=𝒩i{𝒱}\mathcal{Q}_{i,j}=\mathcal{N}_{i}\cap\{\mathcal{V}\setminus\mathcal{B}\}. Based on the strong-robustness property and the ff-local assumption, |𝒬i,j|2f+1|\mathcal{Q}_{i,j}\cap\mathcal{H}|\geq 2f+1. Now a client kk in 𝒬i,j\mathcal{Q}_{i,j}\cap\mathcal{H} can either belong to 𝒞j\mathcal{C}_{j} or not. If k𝒬i,jk\in\mathcal{Q}_{i,j}\cap\mathcal{H} belongs to 𝒞j\mathcal{C}_{j}, then on the event \mathcal{E_{H}}, Algo. 1 will terminate at client kk. Thus, client kk will eventually transmit {𝐈k(j),𝐑k(j)}\{\mathbf{I}_{k}(j),\mathbf{R}_{k}(j)\} to every out-neighbor in 𝒩k+\mathcal{N}^{+}_{k}, including client ii. Even if k𝒬i,jk\in\mathcal{Q}_{i,j}\cap\mathcal{H} does not belong to 𝒞j\mathcal{C}_{j}, since kk\notin\mathcal{B}, client kk must have received information about arm-set jj from 2f+12f+1 in-neighbors. Thus, based on lines 2-4 of Algo. 4, it must transmit {𝐈k(j),𝐑k(j)}\{\mathbf{I}_{k}(j),\mathbf{R}_{k}(j)\} to client ii at some point of time. We conclude that on the event \mathcal{E_{H}}, client ii will eventually receive information about arm-set jj from every neighbor in 𝒬i,j\mathcal{Q}_{i,j}\cap\mathcal{H}. This leads to the desired contradiction and concludes the proof. ∎

Our next goal is to prove that on \mathcal{E_{H}}, for each j[N]j\in[N], every honest client i𝒱𝒞ji\in\mathcal{V}\setminus\mathcal{C}_{j} is able to correctly identify the best arm j1j_{1} in 𝒜j\mathcal{A}_{j}, and obtain a reasonable estimate of its mean rj1r_{j_{1}}. We establish these properties in the next lemma.

Lemma 9.

Suppose Algo. 4 is run by every ii\in\mathcal{H}. Moreover, suppose the adversary model is ff-local, and 𝒢\mathcal{G} is strongly-robust w.r.t. 𝒞j,j[N]\mathcal{C}_{j},\forall j\in[N]. On event \mathcal{E_{H}}, the following then hold j[N]\forall j\in[N]:

  • (i)

    𝐈i(j)=j1,i\mathbf{I}_{i}(j)=j_{1},\forall i\in\mathcal{H}.

  • (ii)

    |𝐑i(j)rj1|(3/8)Δj,i.|\mathbf{R}_{i}(j)-r_{j_{1}}|\leq(3/8)\Delta^{*}_{j},\forall i\in\mathcal{H}.

Proof.

Fix any j[N]j\in[N]. If i𝒞ji\in\mathcal{C}_{j}\cap\mathcal{H}, then each of the above claims hold based on the arguments we presented at the beginning of this section; see Eq. (53). Thus, we only need to prove these claims for clients in {𝒱𝒞j}\{\mathcal{V}\setminus\mathcal{C}_{j}\}\cap\mathcal{H}.

To do so, let us first associate a notion of “activation” with Algo. 4. For each client i𝒞ji\in\mathcal{C}_{j}\cap\mathcal{H}, we say that it gets activated when the local subroutine (Algo. 1) terminates at client ii. For each i{𝒱𝒞j}i\in\{\mathcal{V}\setminus\mathcal{C}_{j}\}\cap\mathcal{H}, we say it gets activated upon receiving information about arm-set jj from 2f+12f+1 neighbors in 𝒩i\mathcal{N}_{i}. From Lemma’s 2 and 8, observe that all honest clients eventually get activated on the event \mathcal{E_{H}}. Accordingly, on the event \mathcal{E_{H}}, let τ1<τ2<<τK\tau_{1}<\tau_{2}<\cdots<\tau_{K} denote the activation times for clients in {𝒱𝒞j}\{\mathcal{V}\setminus\mathcal{C}_{j}\}\cap\mathcal{H}.141414Note that multiple clients may get activated at the same time-step. Moreover, let τs(j)\mathcal{H}^{(j)}_{\tau_{s}} represent those clients in {𝒱𝒞j}\{\mathcal{V}\setminus\mathcal{C}_{j}\}\cap\mathcal{H} that get activated at time-step τs,s[K]\tau_{s},s\in[K].

We prove the result by inducting on the activation times. For the base case, consider any client ii in the set τ1(j)\mathcal{H}^{(j)}_{\tau_{1}}. Recall that 𝒩i,j𝒩i\mathcal{N}^{\prime}_{i,j}\subseteq\mathcal{N}_{i} are the first 2f+12f+1 neighbors who report about arm-set jj to client ii. From the rules of Algo. 4, notice that an honest client kk transmits {𝐈k(j),𝐑k(j)}\{\mathbf{I}_{k}(j),\mathbf{R}_{k}(j)\} only when it gets activated. Thus, it must be that 𝒩i,j𝒞j\mathcal{N}^{\prime}_{i,j}\cap\mathcal{H}\subseteq\mathcal{C}_{j}\cap\mathcal{H}. Moreover, since the adversary model is ff-local, |𝒩i,j|f+1|\mathcal{N}^{\prime}_{i,j}\cap\mathcal{H}|\geq f+1. We conclude that at least f+1f+1 honest clients in 𝒞j\mathcal{C}_{j} report arm j1j_{1} to client ii. Based on the majority voting operation in line 3 of Algo. 4, it follows that 𝐈i(j)=j1\mathbf{I}_{i}(j)=j_{1}.

Observe also that on the event \mathcal{E_{H}}, any client k𝒩i,jk\in\mathcal{N}^{\prime}_{i,j} that reports an arm other than arm j1j_{1} to client ii must be adversarial. Following the same reasoning as in Lemma 7, we can then establish that (i) |𝒩¯i,j|2fi,j+1|\bar{\mathcal{N}}_{i,j}|\geq 2f_{i,j}+1, where fi,j=f|𝒩i,j|+|𝒩¯i,j|f_{i,j}=f-|\mathcal{N}^{\prime}_{i,j}|+|\bar{\mathcal{N}}_{i,j}|, and 𝒩¯i,j\bar{\mathcal{N}}_{i,j} represents those clients in 𝒩i,j\mathcal{N}^{\prime}_{i,j} that report arm j1j_{1}; and (ii) |𝒩¯i,j𝒥|fi,j|\bar{\mathcal{N}}_{i,j}\cap\mathcal{J}|\leq f_{i,j}. Using the same arguments used to arrive at Eq. (45), we can then show that

𝐑i(j)Conv({𝐑k(j)}k𝒩¯i,j)𝐑i(j)Conv({𝐑k(j)}k𝒞j),\mathbf{R}_{i}(j)\in\texttt{Conv}\left(\{\mathbf{R}_{k}(j)\}_{k\in\bar{\mathcal{N}}_{i,j}\cap\mathcal{H}}\right)\implies\mathbf{R}_{i}(j)\in\texttt{Conv}\left(\{\mathbf{R}_{k}(j)\}_{k\in{\mathcal{C}}_{j}\cap\mathcal{H}}\right), (54)

since 𝒩¯i,j𝒞j\bar{\mathcal{N}}_{i,j}\cap\mathcal{H}\subseteq{\mathcal{C}}_{j}\cap\mathcal{H}. For client ii, claim (ii) of the lemma then follows immediately from the fact that on event \mathcal{E_{H}}, |𝐑k(j)rj1|(3/8)Δj,k𝒞j|\mathbf{R}_{k}(j)-r_{j_{1}}|\leq(3/8)\Delta^{*}_{j},\forall k\in{\mathcal{C}}_{j}\cap\mathcal{H}. This completes the base case.

For the induction hypothesis, suppose for all is[m]τs(j)i\in\bigcup_{s\in[m]}\mathcal{H}^{(j)}_{\tau_{s}}, m[K1]m\in[K-1], it holds that 𝐈i(j)=j1\mathbf{I}_{i}(j)=j_{1}, and

𝐑i(j)Conv({𝐑k(j)}k𝒞j).\mathbf{R}_{i}(j)\in\texttt{Conv}\left(\{\mathbf{R}_{k}(j)\}_{k\in{\mathcal{C}}_{j}\cap\mathcal{H}}\right). (55)

Now consider a client iτm+1(j)i\in\mathcal{H}^{(j)}_{\tau_{m+1}}. Based on the rules of Algo. 4, and the definition of activation times, it must be that

𝒩i,j{𝒞j}{s[m]τs(j)}.\mathcal{N}^{\prime}_{i,j}\cap\mathcal{H}\subseteq\{\mathcal{C}_{j}\cap\mathcal{H}\}\cup\{\bigcup_{s\in[m]}\mathcal{H}^{(j)}_{\tau_{s}}\}.

Using the above fact, |𝒩i,j|f+1|\mathcal{N}^{\prime}_{i,j}\cap\mathcal{H}|\geq f+1, the induction hypothesis, and the same reasoning as the base case, it is easy to see that 𝐈i(j)=j1\mathbf{I}_{i}(j)=j_{1}, and that 𝐑i(j)\mathbf{R}_{i}(j) satisfies the inclusion in Eq. (55). This completes the induction hypothesis and the proof. ∎

We are now ready to complete the proof of Theorem 3.

Proof.

(Theorem 3) To complete the proof of Theorem 3, it remains to argue that for each honest client ii, 𝐑i(1)>𝐑i(j),j[N]{1}\mathbf{R}_{i}(1)>\mathbf{R}_{i}(j),\forall j\in[N]\setminus\{1\}. By way of contradiction, suppose there exists some client ii\in\mathcal{H} and some j[N]{1}j\in[N]\setminus\{1\}, such that 𝐑i(j)𝐑i(1)\mathbf{R}_{i}(j)\geq\mathbf{R}_{i}(1). We then have

𝐑i(j)𝐑i(1)\displaystyle\mathbf{R}_{i}(j)\geq\mathbf{R}_{i}(1) (56)
(a)\displaystyle\overset{(a)}{\implies} rj1+38Δjr138Δ1\displaystyle r_{j_{1}}+\frac{3}{8}\Delta^{*}_{j}\geq r_{1}-\frac{3}{8}\Delta^{*}_{1}
\displaystyle\implies Δ1j138(Δ1+Δj)\displaystyle\Delta_{1j_{1}}\leq\frac{3}{8}(\Delta^{*}_{1}+\Delta^{*}_{j})
(b)\displaystyle\overset{(b)}{\implies} Δ1j138(σ1j+σj1)Δ1j1\displaystyle\Delta_{1j_{1}}\leq\frac{3}{8}\left(\sigma_{1j}+\sigma_{j1}\right)\Delta_{1j_{1}}
(c)\displaystyle\overset{(c)}{\implies} Δ1j134Δ1j1,\displaystyle\Delta_{1j_{1}}\leq\frac{3}{4}\Delta_{1j_{1}},

which leads to the desired contradiction. Here, (a) follows from the second claim of Lemma 9, (b) follows from the definition of arm-heterogeneity indices in Eq. (4), and (c) follows from the assumption on arm-heterogeneity in Eq. (8). This completes the proof. ∎