This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Federated Linear Bandits
with Finite Adversarial Actions

Li Fan
University of Virginia
[email protected]
&Ruida Zhou
Texas A&M University
[email protected]
Chao Tian
Texas A&M University
[email protected]
&Cong Shen
University of Virginia
[email protected]
Abstract

We study a federated linear bandits model, where MM clients communicate with a central server to solve a linear contextual bandits problem with finite adversarial action sets that may be different across clients. To address the unique challenges of adversarial finite action sets, we propose the FedSupLinUCB algorithm, which extends the principles of SupLinUCB and OFUL algorithms in linear contextual bandits. We prove that FedSupLinUCB achieves a total regret of O~(dT)\tilde{O}(\sqrt{dT}), where TT is the total number of arm pulls from all clients, and dd is the ambient dimension of the linear model. This matches the minimax lower bound and thus is order-optimal (up to polylog terms). We study both asynchronous and synchronous cases and show that the communication cost can be controlled as O(dM2log(d)log(T))O(dM^{2}\log(d)\log(T)) and O(d3M3log(d))O(\sqrt{d^{3}M^{3}}\log(d)), respectively. The FedSupLinUCB design is further extended to two scenarios: (1) variance-adaptive, where a total regret of O~(dt=1Tσt2)\tilde{O}(\sqrt{d\sum\nolimits_{t=1}^{T}\sigma_{t}^{2}}) can be achieved with σt2\sigma_{t}^{2} being the noise variance of round tt; and (2) adversarial corruption, where a total regret of O~(dT+dCp)\tilde{O}(\sqrt{dT}+dC_{p}) can be achieved with CpC_{p} being the total corruption budget. Experiment results corroborate the theoretical analysis and demonstrate the effectiveness of FedSupLinUCB on both synthetic and real-world datasets.

1 Introduction

In the canonical formulation of contextual bandits, a single player would repeatedly make arm-pulling decisions based on contextual information with the goal of maximizing the long-term reward. With the emerging federated learning paradigm (McMahan et al.,, 2017) where multiple clients and a server jointly learn a global model with each client locally updating the model with its own data and server only aggregating the local models periodically, researchers have started exploring contextual bandits algorithms in such federated learning setting (Dubey and Pentland,, 2020; Huang et al.,, 2021; Li and Wang, 2022a, ; Li and Wang, 2022b, ). This federated contextual bandits framework broadens the applicability of contextual bandits to practical scenarios such as recommender systems, clinical trials, and cognitive radio. In these applications, although the goal is still to maximize the cumulative reward for the overall system, decision-making and observations are naturally distributed at the participating clients.

Several intrinsic challenges arise with the federated contextual bandit formulation. One important issue is that besides regret, we should also take into account the communication cost, which is usually the system bottleneck. To reduce the communication cost while maintaining the same regret guarantee, the clients should transmit the necessary information to the server only when the local information has accumulated to the extent that it would affect the decision-making. Compared with the centralized contextual bandits, which have a linearly growing communication cost, algorithms for federated contextual bandits attempt to achieve a comparable regret with sub-linear communication cost.

Second, most existing studies on federated contextual bandits focus on the synchronous communication scenario (Huang et al.,, 2021; Li and Wang, 2022b, ), in which all participating clients first upload local information and then download updated global information from the server in each communication round. This stringent communication requirement is often not met in practice. A recent work of Li and Wang, 2022a studies the asynchronous federated linear bandit problem. However, communications for different clients are not independent in their approach because the upload from one client may trigger the server to perform downloads for all clients. To address this issue, He et al., 2022a proposes FedLinUCB, which enables independent synchronizations between clients and the server.

Third, the majority of prior studies on federated linear bandits focused on the infinite-arm setting (Li and Wang, 2022b, ; Li and Wang, 2022a, ; He et al., 2022a, ) (see Section 2 for a detailed literature review). From a methodology point of view, these papers largely build on the OFUL principle (Abbasi-Yadkori et al.,, 2011). One notable exception is Huang et al., (2021), which studies synchronous communication with fixed contexts and proposes the Fed-PE algorithm based on the phased elimination G-optimal design (Lattimore and Szepesvári,, 2020). To the best of our knowledge, no prior result exists for federated linear bandits with finite arms and time-evolving adversarial contexts, which is the focus of our work.

Table 1: Comparison of this paper with related works

System Action Algorithm Regret Communication Single-player infinite arm OFUL (Abbasi-Yadkori et al.,, 2011) dTlogTd\sqrt{T\log T} N/A Single-player finite fixed arm PE + G-optimal (Lattimore and Szepesvári,, 2020) O(dTlogT)O(\sqrt{dT\log T}) N/A Single-player finite adversarial arm SupLinUCB (Chu et al.,, 2011) O(dTlog3T)O(\sqrt{dT\log^{3}T}) N/A Federated (Async) infinite arm FedLinUCB (He et al., 2022a, ) O(dTlogT)O(d\sqrt{T}\log T) O(dM2logT)O(dM^{2}\log T) Federated (Async) infinite arm Async-LinUCB (Li and Wang, 2022a, ) O(dTlogT)O(d\sqrt{T}\log T) O(dM2logT)O(dM^{2}\log T) Federated (Sync) infinite arm DisLinUCB (Wang et al.,, 2019) O(dTlog2T)O(d\sqrt{T}\log^{2}T) O(dM3/2)O(dM^{3/2}) Federated (Sync) finite fixed arm Fed-PE (Huang et al.,, 2021) O(dTlogT)O(\sqrt{dT\log T}) O(d2MKlogT)O(d^{2}MK\log T) Federated (Async) finite adversarial arm FedSupLinUCB (This work) O(dTlog3T)O(\sqrt{dT\log^{3}T}) O(dM2logdlogT)O(dM^{2}\log d\log T) Federated (Sync) finite adversarial arm FedSupLinUCB (This work) O(dTlog3T)O(\sqrt{dT\log^{3}T}) O(d3/2M3/2log(d))O(d^{3/2}M^{3/2}\log(d))

dd: the dimension of the unknown parameter, MM: the number of clients, KK: the number of finite actions, TT: the total arm pulls from all clients.

Main contributions.

Our main contributions are summarized as follows.

  • We develop a general federated bandits framework, termed FedSupLinUCB, for solving the problem of federated linear contextual bandits with finite adversarial actions. FedSupLinUCB extends SupLinUCB (Chu et al.,, 2011; Ruan et al.,, 2021) and OFUL (Abbasi-Yadkori et al.,, 2011), two important principles in (single-player, centralized) linear bandits, to the federated bandits setting with a carefully designed layered successive screening.

  • We instantiate FedSupLinUCB with both asynchronous and synchronous client activities. For the former setting, we propose Async-FedSupLinUCB where communication is triggered only when the cumulative local information impacts the exploration uncertainty to a certain extent. We prove that Async-FedSupLinUCB achieves O~(dT)\tilde{O}(\sqrt{dT}) regret with O(dM2logdlogT)O(dM^{2}\log d\log T) communication cost, which not only reduces the regret by d\sqrt{d} compared with previous results on asynchronous federated linear bandits with infinite arms, but also matches the minimax lower bound up to polylog terms, indicating that Async-FedSupLinUCB achieves order-optimal regret.

  • For synchronous communications, we propose Sync-FedSupLinUCB, which has a refined communication design where only certain layers are communicated, as opposed to the complete information. Sync-FedSupLinUCB achieves order-optimal regret of O~(dT)\tilde{O}(\sqrt{dT}) with horizon-independent communication cost O(d3M3logd)O(\sqrt{d^{3}M^{3}}\log d). Compared with the best previous result (Huang et al.,, 2021) which achieves the same order-optimal regret but only for fixed actions, we show that it is the finite actions that fundamentally determines the regret behavior in the federated linear bandits setting.

  • We further develop two extensions of FedSupLinUCB: (1) Variance-adaptive FedSupLinUCB, for which a total regret of O~(dt=1Tσt2)\tilde{O}(\sqrt{d\sum\nolimits_{t=1}^{T}\sigma_{t}^{2}}) is achieved, where σt2\sigma_{t}^{2} is the noise variance at round tt. (2) Adversarial corruption FedSupLinUCB, for which a total regret of O~(dT+dCp)\tilde{O}(\sqrt{dT}+dC_{p}) is achieved, where CpC_{p} is the total corruption budget.

2 Related Works

The linear bandit model, as a generalization of finite armed bandits with linear contextual information, has been extensively studied. The setting of infinite arm sets solved by LinUCB was analyzed in (Dani et al.,, 2008; Abbasi-Yadkori et al.,, 2011), which achieves regret O~(dT)\tilde{O}(d\sqrt{T}) with appropriate confidence width (Abbasi-Yadkori et al.,, 2011) and matches the lower bound (Dani et al.,, 2008) up to logarithmic factors. In contrast, algorithms like SupLinRel (Auer,, 2002) and SupLinUCB (Chu et al.,, 2011) achieve O~(dT)\tilde{O}(\sqrt{dT}) in the setting of finite time-varying adversarial arm sets under K2dK\ll 2^{d}, with a lower bound Ω(dT)\Omega(\sqrt{dT}) (Chu et al.,, 2011). The SupLinUCB algorithm was later optimized and matches the lower bound up to iterated logarithmic factors in Li et al., (2019). As a special case of the finite arm setting, if the arm set is time-invariant, an elimination-based algorithm (Lattimore and Szepesvári,, 2020) via G-optimal design can be applied to achieve similar optimal performance.

The federated linear bandits problems were studied under the settings of infinite arm set (Dubey and Pentland,, 2020; Li et al.,, 2020; Li and Wang, 2022a, ) and time-invariant finite arm set (Huang et al.,, 2021), while the time-varying finite arm set setting has not been well explored. A finite time-varying arm set has many meaningful practical applications such as recommendation system (Li et al.,, 2010; Chu et al.,, 2011), and the distributed (federated) nature of the applications naturally falls in the federated linear bandits problem with finite time-varying arms. The paper fills this gap by generalizing the SupLinUCB algorithm to the federated setting.

We study both the asynchronous setting (Li and Wang, 2022a, ) (He et al., 2022a, ), where clients are active on their own and full participation is not required, and the synchronous setting (Shi et al.,, 2021; Dubey and Pentland,, 2020), where all the clients make decisions at each round and the communication round requires all the clients to upload new information to the server and download the updated information. We design algorithms so as to reduce the communication cost while maintaining optimal regret. Technically, the communication cost is associated with the algorithmic adaptivity, since less adaptivity requires fewer updates and thus fewer communication rounds. The algorithmic adaptivity of linear bandits algorithms was studied in the single-player setting (Han et al.,, 2020) (Ruan et al.,, 2021). It was also considered in the federated setting (Wang et al.,, 2019; Huang et al.,, 2021; Salgia and Zhao,, 2023).

3 System Model and Preliminaries

3.1 Problem Formulation

We consider a federated linear contextual bandits model with KK finite but possibly time-varying arms. The model consists of MM clients and one server in a star-shaped communication network. Clients jointly solve a linear bandit problem by collecting local information and communicating with the central server through the star-shaped network in a federated manner, with no direct communications among clients. The only function of the server is to aggregate received client information and to send back updated information to clients. It cannot directly play the bandits game.

Specifically, some clients It[M]I_{t}\subseteq[M] are active at round tt. Client iIti\in I_{t} receives KK arms (actions to take) associated with contexts {xt,ai}a[K]d\{x_{t,a}^{i}\}_{a\in[K]}\subset\mathbb{R}^{d} with xt,ai21\|x_{t,a}^{i}\|_{2}\leq 1. Here we adopt the oblivious adversarial setting, where all contexts are chosen beforehand, and not dependent on previous game observation. Client ii then pulls an arm ati[K]a_{t}^{i}\in[K] based on the information collected locally as well as previously communicated from the server. A reward rt,atii=θxt,atii+ϵtr_{t,a_{t}^{i}}^{i}=\theta^{\top}x_{t,a_{t}^{i}}^{i}+\epsilon_{t} is revealed privately to client ii, where θd\theta\in\mathbb{R}^{d} is an unknown weight vector with θ21\|\theta\|_{2}\leq 1 and ϵt\epsilon_{t} is an independent 11-sub-Gaussian noise. At the end of round tt, depending on the communication protocol, client ii may exchange the collected local information with the server so that it can update the global information.

We aim to design algorithms to guide the clients’ decision-making and overall communication behaviors. We analyze two patterns of client activity. 1) Synchronous: all MM clients are active at each round. 2) Asynchronous: one client is active at each round. For the latter case, we further assume that client activity is independent of data and history. Denote by TiT_{i} the number of times client ii is active. In the former case, Ti=Tj,i,j[M]T_{i}=T_{j},\forall i,j\in[M], while in the latter case, TiT_{i} may be different among clients. We define T=i=1MTiT=\sum_{i=1}^{M}T_{i} as the total number of arm pulls from all clients.

The performance is measured under two metrics – total regret and communication cost, which concern the decision-making effectiveness and the communication efficiency respectively. Denote by PTi={t[T]|iIt}P_{T}^{i}=\{t\in[T]\ |\ i\in I_{t}\} the set of time indices at which client ii is active, with |PTi|=Ti|P_{T}^{i}|=T_{i}. The total regret is defined as

RT=i=1MRTi=i=1M𝔼[tPTirt,ati,irt,atii],\displaystyle R_{T}=\sum_{i=1}^{M}R_{T}^{i}=\sum_{i=1}^{M}\mathbb{E}\left[\sum\nolimits_{t\in P_{T}^{i}}r^{i}_{t,a_{t}^{i,*}}-r^{i}_{t,a_{t}^{i}}\right], (1)

where ati,=argmaxa[K]θxt,aia_{t}^{i,*}=\arg\max_{a\in[K]}\theta^{\top}x_{t,a}^{i}. We define the communication cost as the total number of communication rounds between clients and the server.

3.2 Preliminaries

Information encoding.

In the linear bandits setting (federated or not), the information a client acquires is usually encoded by the gram matrix and the action-reward vector. Specifically, when the client has observed nn action-reward pairs {(xt,rt)}t=1n\{(x_{t},r_{t})\}_{t=1}^{n}, the information is encoded by matrix An=t=1nxtxtA_{n}=\sum_{t=1}^{n}x_{t}x_{t}^{\top} and vector bn=t=1nrtxtb_{n}=\sum_{t=1}^{n}r_{t}x_{t}. Denote by Encoder()\operatorname{Encoder}(\cdot) this encoding function, i.e., An,bnEncoder({xt,rt}t=1n)A_{n},b_{n}\leftarrow\operatorname{Encoder}(\{x_{t},r_{t}\}_{t=1}^{n}).

Communication criterion.

Communication in our proposed framework is data-dependent, in the same spirit as the “doubling trick” introduced in Abbasi-Yadkori et al., (2011) to reduce the computation complexity in single-player linear bandits. The key idea is that communication is triggered only when the cumulative local information, represented by the determinant of the gram matrix AnA_{n}, affects the exploration uncertainty to a great extent and hence the client needs to communicate with the server. Detailed communication protocols will be presented in each algorithm design.

Synchronization procedure.

Denote by Sync()\operatorname{Sync}() a routine that nn clients (client 1, \ldots, client nn) first communicate their local gram matrices and action-reward vectors to the server, and the server then aggregates the matrices (and vectors) into one gram matrix (and action-reward vector) and transmits them back to the nn clients. Specifically, each client ii holds newly observed local information (ΔAi,Δbi)(\Delta A^{i},\Delta b^{i}), which is the difference between the client’s current information (Ai,bi)(A^{i},b^{i}) and the information after the last synchronization. In other words, (ΔAi,Δbi)(\Delta A^{i},\Delta b^{i}) is the information that has not been communicated to the server. The server, after receiving the local information {(ΔAi,Δbi)}i=1n\{(\Delta A^{i},\Delta b^{i})\}_{i=1}^{n}, updates the server-side information (Aser,bser)(A^{ser},b^{ser}) by AserAser+i=1nΔAi,bserbser+i=1nΔbiA^{ser}\leftarrow A^{ser}+\sum\nolimits_{i=1}^{n}\Delta A^{i},b^{ser}\leftarrow b^{ser}+\sum\nolimits_{i=1}^{n}\Delta b^{i} and sends them back to each of the nn clients. Each client ii will then update the local information by AiAser,bibserA^{i}\leftarrow A^{ser},b^{i}\leftarrow b^{ser}. The procedure is formally presented in Algorithm 1.

Algorithm 1 Sync\operatorname{Sync}(ss, server, client 11, \ldots client nn)
1:for i=1,2,,ni=1,2,\ldots,n do \triangleright Client-side local information upload
2:     Client ii sends the local new layer ss information (ΔAsi,Δbsi)(\Delta A^{i}_{s},\Delta b^{i}_{s}) to the server
3:end for
4:Update server’s layer ss information: \triangleright Server-side information aggregation and distribution
AsserAsser+i=1nΔAsi,bsserbsser+i=1nΔbsiA_{s}^{ser}\leftarrow A_{s}^{ser}+\sum\nolimits_{i=1}^{n}\Delta A_{s}^{i},\quad b_{s}^{ser}\leftarrow b_{s}^{ser}+\sum\nolimits_{i=1}^{n}\Delta b_{s}^{i}
5:Send server information Asser,bsserA_{s}^{ser},b_{s}^{ser} back to all clients
6:for i=1,2,,ni=1,2,\ldots,n do
7:     AsiAsser,bsibsser,ΔAsi0,Δbsi0A_{s}^{i}\leftarrow A_{s}^{ser},b_{s}^{i}\leftarrow b_{s}^{ser},\Delta A_{s}^{i}\leftarrow 0,\Delta b_{s}^{i}\leftarrow 0 \triangleright Client ii updates the local information
8:end for

4 The FedSupLinUCB Framework

In this section, we present a general framework of federated bandits for linear bandits with finite oblivious adversarial actions. Two instances (asynchronous and synchronous) of this general framework will be discussed in subsequent sections.

Building block: SupLinUCB.

As the name suggests, the proposed FedSupLinUCB framework is built upon the principle of SupLinUCB (Chu et al.,, 2011; Ruan et al.,, 2021). The information (A,b)(A,b) is useful in the sense that the reward corresponding to an action xx can be estimated within confidence interval xθ^±αxA1x^{\top}\hat{\theta}\pm\alpha\|x\|_{A^{-1}}, where θ^=A1b\hat{\theta}=A^{-1}b. It is shown in Abbasi-Yadkori et al., (2011) that in linear bandits (even with an infinite number of actions) with α=O~(d)\alpha=\tilde{O}(\sqrt{d}), the true reward is within the confidence interval with high probability. Moreover, if the rewards in the action-reward vector bb are mutually independent, α\alpha can be reduced to O(1)O(1). The former choice of α\alpha naturally guarantees O~(dT)\tilde{O}(d\sqrt{T}) regret. However, to achieve regret O~(dT)\tilde{O}(\sqrt{dT}), it is critical to keep α=O(1)\alpha=O(1). This is fulfilled by the SupLinUCB algorithm (Chu et al.,, 2011) and then recently improved by Ruan et al., (2021). The key intuition is to successively refine an action set that contains the optimal action, where the estimation precision of sets is geometrically strengthened. Specifically, the algorithm maintains (S+1)(S+1) layers of information pairs {(As,bs)}s=0S\{(A_{s},b_{s})\}_{s=0}^{S}, and the rewards in the action-reward vectors are mutually independent, except for layer 0. The confidence radius for each layer ss is ws=2sd1.5/Tw_{s}=2^{-s}d^{1.5}/\sqrt{T}.

Algorithm 2 S-LUCB
1:Initialization: S=logdS=\lceil\log d\rceil, w¯0=d1.5/T\overline{w}_{0}=d^{1.5}/\sqrt{T}, w¯s2sw¯0,s[1:S]\overline{w}_{s}\leftarrow 2^{-s}\overline{w}_{0},\forall s\in[1:S].
2:α0=1+dln(2M2T/δ),αs1+2ln(2KMTlnd/δ),s[1:S]\alpha_{0}=1+\sqrt{d\ln(2M^{2}T/\delta)},\alpha_{s}\leftarrow 1+\sqrt{2\ln(2KMT\ln d/\delta)},\forall s\in[1:S]
3:Input: Client ii (with local information Ai,biA^{i},b^{i}, ΔAi,Δbi\Delta A^{i},\Delta b^{i}), contexts set {xt,1i,,xt,Ki}\{x_{t,1}^{i},\ldots,x_{t,K}^{i}\}
4:At,siAsi+ΔAsi,bt,sibsi+ΔbsiA_{t,s}^{i}\leftarrow A_{s}^{i}+\Delta A_{s}^{i},b_{t,s}^{i}\leftarrow b_{s}^{i}+\Delta b_{s}^{i}\;\; or At,siAsi,bt,sibsi\;\;A_{t,s}^{i}\leftarrow A_{s}^{i},b_{t,s}^{i}\leftarrow b_{s}^{i} for lazy update
5:θ^s(At,si)1bt,si\hat{\theta}_{s}\leftarrow(A^{i}_{t,s})^{-1}b^{i}_{t,s}, r^t,s,ai=θ^sxt,ai\hat{r}_{t,s,a}^{i}=\hat{\theta}_{s}^{\top}x_{t,a}^{i}, wt,s,aiαsxt,ai(At,si)1w_{t,s,a}^{i}\leftarrow\alpha_{s}\|x^{i}_{t,a}\|_{(A^{i}_{t,s})^{-1}}, s[0:S],a[K]\forall s\in[0:S],\forall a\in[K]
6:s0s\leftarrow 0; 𝒜0{a[K]r^t,0,ai+wt,0,aimaxa[K](r^t,0,aiwt,0,ai)}\mathcal{A}_{0}\leftarrow\{a\in[K]\mid\hat{r}_{t,0,a}^{i}+w_{t,0,a}^{i}\geq\max_{a\in[K]}(\hat{r}_{t,0,a}^{i}-w_{t,0,a}^{i})\} \triangleright Initial screening
7:repeat\triangleright Layered successive screening
8:     if s=Ss=S then
9:         Choose action atia_{t}^{i} arbitrarily from 𝒜S\mathcal{A}_{S}
10:     else if wt,s,aiw¯sw_{t,s,a}^{i}\leq\overline{w}_{s} for all a𝒜sa\in\mathcal{A}_{s} then
11:         𝒜s+1{a𝒜sr^t,s,aimaxa𝒜s(r^t,s,ai)2w¯s}\mathcal{A}_{s+1}\leftarrow\{a\in\mathcal{A}_{s}\mid\hat{r}_{t,s,a}^{i}\geq\max_{a^{\prime}\in\mathcal{A}_{s}}(\hat{r}_{t,s,a^{\prime}}^{i})-2\overline{w}_{s}\}; ss+1s\leftarrow s+1
12:     else
13:         atiargmax{a𝒜s,wt,s,ai>w¯s}wt,s,aia_{t}^{i}\leftarrow\arg\max_{\{a\in\mathcal{A}_{s},w_{t,s,a}^{i}>\overline{w}_{s}\}}w_{t,s,a}^{i}
14:     end if
15:until action atia_{t}^{i} is found
16:Take action atia_{t}^{i} and and receive reward rt,atiir_{t,a_{t}^{i}}^{i}
17:ΔAsiΔAsi+xt,atiixt,atii\Delta A_{s}^{i}\leftarrow\Delta A_{s}^{i}+x_{t,a_{t}^{i}}^{i}x_{t,a_{t}^{i}}^{i\top}, ΔbsiΔbsi+rt,atiixt,atii\Delta b_{s}^{i}\leftarrow\Delta b_{s}^{i}+r_{t,a_{t}^{i}}^{i}x_{t,a_{t}^{i}}^{i} \triangleright Update local information
18:Return layer index ss

FedSupLinUCB.

S-LUCB, presented in Algorithm 2, combines the principles of SupLinUCB and OFUL (Abbasi-Yadkori et al.,, 2011) and is the core subroutine for FedSupLinUCB. We maintain S=logdS=\lceil\log d\rceil information layers, and the estimation accuracy starts from d1.5/Td^{1.5}/\sqrt{T} of layer 0 and halves as the layer index increases. Finally, it takes Θ(logd)\Theta(\log d) layers to reach the sufficient accuracy of d/T\sqrt{d/T} and achieves the minimax-optimal regret.

When client ii is active, the input parameters (Ai,bi)(A^{i},b^{i}) contain information received from the server at the last communication round, and (ΔAi,Δbi)(\Delta A^{i},\Delta b^{i}) is the new local information collected between two consecutive communication rounds. {xt,1i,,xt,Ki}\{x_{t,1}^{i},\ldots,x_{t,K}^{i}\} is the set of contexts observed in this round. Client ii can estimate the unknown parameter θ\theta either with all available information or just making a lazy update. This choice depends on the communication protocol and will be elaborated later. During the decision-making process, client ii first makes arm elimination at layer 0 to help bootstrap the accuracy parameters. Then, it goes into the layered successive screening in the same manner as the SupLinUCB algorithm, where we sequentially eliminate suboptimal arms depending on their empirical means and confidence widths. After taking action atia_{t}^{i} and receiving the corresponding reward rt,atiir_{t,a_{t}^{i}}^{i}, client ii updates its local information set (ΔAsi,Δbsi)(\Delta A_{s}^{i},\Delta b_{s}^{i}) by aggregating the context into layer ss in which we take the action, before returning layer ss.

5 Asynchronous FedSupLinUCB

In the asynchronous setting, only one client is active in each round. Note that global synchronization and coordination are not required, and all inactive clients are idle.

5.1 Algorithm

We first initialize the information for all clients and the server (gram matrix and action-reward vector) in each layer s[0:S]s\in[0:S]. We assume only one client iti_{t} is active at round tt. It is without loss of generality since if multiple clients are active, we can queue them up and activate them in turn. More discussion of this equivalence can be found in He et al., 2022a ; Li and Wang, 2022a . The active client chooses the action, receives a reward, updates local information matrices of layer ss with a lazy update according to S-LUCB, and decides whether communication with the server is needed by the criterion in Line 7 of Algorithm 3. If communication is triggered, we synchronize client iti_{t} with the server by Algorithm 1.

Algorithm 3 Async-FedSupLinUCB
1:Initialization: TT, CC, S=logdS=\lceil\log d\rceil
2:{AsserId,bsser0s[0:S]}\{A_{s}^{ser}\leftarrow I_{d},b_{s}^{ser}\leftarrow 0\mid s\in[0:S]\} \triangleright Server initialization
3:{AsiId,ΔAsi,bsi,Δbsi0s[0:S],i[M]}\{A_{s}^{i}\leftarrow I_{d},\Delta A_{s}^{i},b_{s}^{i},\Delta b_{s}^{i}\leftarrow 0\mid s\in[0:S],i\in[M]\} \triangleright Clients initialization
4:for t=1,2,,Tt=1,2,\cdots,T do
5:     Client it=ii_{t}=i is active, and observes KK contexts {xt,1i,xt,2i,,xt,Ki}\{x_{t,1}^{i},x_{t,2}^{i},\cdots,x_{t,K}^{i}\}
6:     ss\leftarrow S-LUCB (client i,{xt,1i,xt,2i,,xt,Ki})\left(\text{client }i,\{x_{t,1}^{i},x_{t,2}^{i},\cdots,x_{t,K}^{i}\}\right) with lazy update
7:     if det(Asi+ΔAsi)det(Asi)>(1+C)\frac{\det(A_{s}^{i}+\Delta A_{s}^{i})}{\det(A_{s}^{i})}>(1+C) then
8:         Sync\operatorname{Sync}(ss, server, clients ii) for each s[0:S]s\in[0:S]
9:     end if
10:end for

5.2 Performance Analysis

Theorem 5.1.

For any 0<δ<10<\delta<1, if we run Algorithm 3 with C=1/M2C=1/M^{2}, then with probability at least 1δ1-\delta, the regret of the algorithm is bounded as RTO~(di=1MTi)=O~(dT)R_{T}\leq\tilde{O}\left(\sqrt{d\sum_{i=1}^{M}T_{i}}\right)=\tilde{O}\left(\sqrt{dT}\right). Moreover, the corresponding communication cost is bounded by O(dM2logdlogT)O(dM^{2}\log d\log T).

Remark 1. The minimax lower bound of the expected regret for linear contextual bandits with KK adversarial actions is Ω(dT)\Omega(\sqrt{dT}), given in Chu et al., (2011). Theorem 5.1 indicates that Async-FedSupLinUCB achieves order-optimal regret (up to polylog term) with O(dM2logdlogT)O(dM^{2}\log d\log T) communication cost. To the best of our knowledge, this is the first algorithm that achieves the (near) optimal regret in federated linear bandits with finite adversarial actions.

Remark 2. Without any communication, each client would execute SupLinUCB (Chu et al.,, 2011) for TiT_{i} rounds locally, and each client can achieve regret of order O~(dTi)\tilde{O}(\sqrt{dT_{i}}). Therefore, the total regret of MM clients is upper bound by RTi=1MdTi polylog(T)dMi=1MTi polylog(T),R_{T}\leq\sum_{i=1}^{M}\sqrt{dT_{i}}\text{ polylog}(T)\leq\sqrt{dM\sum_{i=1}^{M}T_{i}}\text{ polylog}(T), where the last inequality becomes equality when Ti=Tj,i,j[M]T_{i}=T_{j},\forall i,j\in[M]. Compared with conducting MM independent SupLinUCB algorithms locally, Async-FedSupLinUCB yields an average per-client gain of 1/M1/\sqrt{M}, demonstrating that communications in the federated system can speed up local linear bandits decision-making at clients.

Remark 3. Most previous federated linear bandits consider the infinite action setting, based on the LinUCB principle (Abbasi-Yadkori et al.,, 2011). Async-FedSupLinUCB considers a finite adversarial action setting and has a d\sqrt{d} reduction on the regret bound. Fed-PE proposed in Huang et al., (2021) also considers the finite action setting. However, their action sets are fixed. We generalize their formulation and take into account a more challenging scenario, where the finite action set can be chosen adversarially. The regret order is the same as Fed-PE (ignoring the ploylog term), indicating that it is the finite actions as opposed to fixed actions that fundamentally leads to the d\sqrt{d} regret improvement in the federated linear bandits setting.

Communication cost analysis of FedSupLinUCB.

We sketch the proof for the communication cost bound in Theorem 5.1 in the following, while deferring the detailed proofs for the regret and the communication cost to Appendix C.

We first study the communication cost triggered by some layer ss. Denote by At,sserA_{t,s}^{ser} the gram matrix in the server aggregated by the gram matrices uploaded by all clients up to round tt. Define Tn,s=min{t[T]|det(At,sser)2n}T_{n,s}=\min\{t\in[T]|\det(A_{t,s}^{ser})\geq 2^{n}\}, for each n0n\geq 0. We then divide rounds into epochs {Tn,s,Tn,s+1,,min(Tn+1,s1,T)}\{T_{n,s},T_{n,s}+1,\cdots,\min(T_{n+1,s}-1,T)\} for each n0n\geq 0. The number of communications triggered by layer ss within any epoch can be upper bounded by 2(M+1/C)2(M+1/C) (see Lemma C.1), and the number of non-empty epochs is at most dlog(1+T/d)d\log(1+T/d) by Lemma A.1. Since there are S=logdS=\lceil\log d\rceil layers and synchronization among all layers is performed once communication is triggered by any layer (Line 8 in Algorithm 3), the total communication cost is thus upper-bounded by O(d(M+1/C)logdlogT)O(d(M+1/C)\log d\log T). Plugging C=1/M2C=1/M^{2} proves the result.

We note that although choosing a larger CC would trigger fewer communications, the final choice of C=1/M2C=1/M^{2} takes into consideration both the regret and the communication cost, i.e., to achieve a small communication cost while maintaining an order-optimal regret.

6 Synchronous FedSupLinUCB

In the synchronous setting, all clients are active and make decisions at each round. Though it can be viewed as a special case of the asynchronous scenario (clients are active and pulling arms in a round-robin manner), the information update is broadcast to all clients. In other words, the key difference from the asynchronous scenario besides that all clients are active at each round is that when a client meets the communication criterion, all clients will upload local information to the server and download the updated matrices. This leads to a higher communication cost per communication round, but in this synchronous scenario, knowing all clients are participating allows the communicated information to be well utilized by other clients. This is in sharp contrast to the asynchronous setting, where if many other clients are active in the current round, uploading local information to the clients seems unworthy. To mitigate the total communication cost, we use a more refined communication criterion to enable time-independent communication cost.

6.1 The Algorithm

The Sync-FedSupLinUCB algorithm allows each client to make decisions by the S-LUCB subroutine. Note that the decision-making is based on all available local information instead of the lazy update in the Async-FedSupLinUCB algorithm. The communication criterion involves the count of rounds since the last communication, which forces the communication to prevent the local data from being obsolete. Some layers may trigger the communication criterion either because the local client has gathered enough new data or due to having no communication with the server for too long. We categorize these layers in the CommLayers and synchronize all the clients with the server.

Algorithm 4 Sync-FedSupLinUCB
1:Initialization: TcT_{c}, DD, S=logdS=\lceil\log d\rceil, tlasts0,s[0:S]t_{last}^{s}\leftarrow 0,\forall s\in[0:S], CommLayers \leftarrow\emptyset.
2:{AsserId,bsser0s[0:S]}\{A_{s}^{ser}\leftarrow I_{d},b_{s}^{ser}\leftarrow 0\mid s\in[0:S]\} \triangleright Server initialization
3:{AsiId,ΔAsi,bsi,Δbsi0s[0:S],i[M]}\{A_{s}^{i}\leftarrow I_{d},\Delta A_{s}^{i},b_{s}^{i},\Delta b_{s}^{i}\leftarrow 0\mid s\in[0:S],i\in[M]\}\triangleright Clients initialization
4:for t=1,2,,Tct=1,2,\cdots,T_{c} do
5:     for i=1,2,,Mi=1,2,\cdots,M do
6:         Client it=ii_{t}=i is active, and observes KK contexts {xt,1i,xt,2i,,xt,Ki}\{x_{t,1}^{i},x_{t,2}^{i},\cdots,x_{t,K}^{i}\}
7:         ss\leftarrowS-LUCB (client i,{xt,1i,xt,2i,,xt,Ki})\left(\text{client }i,\{x_{t,1}^{i},x_{t,2}^{i},\cdots,x_{t,K}^{i}\}\right)
8:         if (ttlasts)logdet(Asi+ΔAsi)det(Asi)>D(t-t_{last}^{s})\log\frac{\det(A_{s}^{i}+\Delta A_{s}^{i})}{\det(A_{s}^{i})}>D then
9:              Add ss to CommLayers
10:         end if
11:     end for
12:end for
13:for ss\in CommLayers do
14:     Sync\operatorname{Sync}(ss, server, clients [M][M]); tlaststt_{last}^{s}\leftarrow t, CommLayers \leftarrow\emptyset
15:end for

6.2 Performance Analysis

Theorem 6.1.

For any 0<δ<10<\delta<1, if we run Algorithm 4 with D=TclogTcd2MD=\frac{T_{c}\log T_{c}}{d^{2}M}, with probability at least 1δ1-\delta, the regret of the algorithm is bounded as RTO~(dMTc)R_{T}\leq\tilde{O}(\sqrt{dMT_{c}}) where TcT_{c} is the total per-client arm pulls. Moreover, the corresponding communication cost is bounded by O(d3M3logd)O(\sqrt{d^{3}M^{3}}\log d).

Remark 4. Theorem 6.1 demonstrates Sync-FedSupLinUCB also achieves the minimax regret lower bound while the communication cost is independent of TcT_{c}. It is particularly beneficial for large TcT_{c}. Especially, the number of total rounds in the synchronous scenario is T=MTcT=MT_{c}, while in the asynchronous setting, we have T=i=1MTiT=\sum_{i=1}^{M}T_{i} rounds.

Communication cost analysis of Sync-FedSupLinUCB.

We sketch the proof for the communication cost bound in Theorem 6.1 below, while deferring the detailed proofs for the regret and the communication cost to Appendix D.

We call the chunk of consecutive rounds without communicating information in layer ss (except the last round) an epoch. Information in layer ss is collected locally by each client and synchronized at the end of the epoch, following which the next epoch starts. Denoted by Ap,sallA_{p,s}^{all} the synchronized gram matrix at the end of the pp-th epoch. For any value β>0\beta>0, there are at most Tcβ\lceil\frac{T_{c}}{\beta}\rceil epochs that contain more than β\beta rounds by pigeonhole principle. If the pp-th epoch contains less than β\beta rounds, then log(det(Ap,sall)det(Ap1,sall))>Dβ\log(\frac{\det(A_{p,s}^{all})}{\det(A_{p-1,s}^{all})})>\frac{D}{\beta} based on the communication criterion and the fact that p=1Plogdet(Ap,sall)det(Ap1,sall)Rs=O(dlog(Tc))\sum_{p=1}^{P}\log\frac{\det(A_{p,s}^{all})}{\det(A_{p-1,s}^{all})}\leq R_{s}=O(d\log(T_{c})) (see Equation 6). The number of epochs containing rounds fewer than β\beta is at most O(RsD/β)O(\lceil\frac{R_{s}}{D/\beta}\rceil). Noting that D=Tclog(Tc)d2MD=\frac{T_{c}\log(T_{c})}{d^{2}M}, the total number of epochs for layer ss is at most Tcβ+RsβD=O(TcRsD)=O(d3M)\lceil\frac{T_{c}}{\beta}\rceil+\lceil\frac{R_{s}\beta}{D}\rceil=O(\sqrt{\frac{T_{c}R_{s}}{D}})=O(\sqrt{d^{3}M}) by taking β=DTcRs\beta=\sqrt{\frac{DT_{c}}{R_{s}}}. The total communication cost is thus upper bounded by O(SMd3M)=O(log(d)d3M3)O(SM\sqrt{d^{3}M})=O(\log(d)\sqrt{d^{3}M^{3}}).

7 Extensions of FedSupLinUCB

In this section, we extend the FedSupLinUCB algorithm to address two distinct settings in federated systems: scenarios characterized by heterogeneous variances, and those affected by adversarial corruptions.

7.1 Federated Heteroscedastic Linear Bandits

We have so far focused on the federated linear bandits with 1-sub-Gaussian reward noises. In this section, we adapt Async-FedSupLinUCB to the case where the reward noises have heterogeneous variances, which extends the heteroscedastic linear bandits as studied in Zhou et al., (2021); Zhou and Gu, (2022) to the asynchronous federated setting, where one client is active at a time. Specifically, the reward noises {ϵt}t[T]\{\epsilon_{t}\}_{t\in[T]} are independent with |ϵt|R,𝔼[ϵt]=0|\epsilon_{t}|\leq R,\mathbb{E}[\epsilon_{t}]=0 and 𝔼[ϵt2]σt2\mathbb{E}[\epsilon_{t}^{2}]\leq\sigma_{t}^{2}, where σt\sigma_{t} is known to the active client.

We propose a variance-adaptive Asyc-FedSupLinUCB and analyze its regret and the communication cost in the theorem below, with the algorithm and the proof details in Appendix E due to space constraint. The regret is significantly less than that of the Async-FedSupLinUCB when the variances {σt2}\{\sigma_{t}^{2}\} are small.

Theorem 7.1.

For any 0<δ<10<\delta<1, if we run the variance-adaptive Async-FedSupLinUCB algorithm in Appendix E with C=1/M2C=1/M^{2}, with probability at least 1δ1-\delta, the regret is bounded as RTO~(dt=1Tσt2)R_{T}\leq\tilde{O}(\sqrt{d\sum\nolimits_{t=1}^{T}\sigma_{t}^{2}}), and the communication cost is bounded by O(dM2log2T)O(dM^{2}\log^{2}T).

7.2 Federated Linear Bandits with Corruption

We further explore asynchronous federated linear bandits with adversarial corruptions, where an adversary inserts a corruption ctc_{t} to the reward rtr_{t} of the active client at round tt. The total corruption is bounded by t=1T|ct|Cp\sum_{t=1}^{T}|c_{t}|\leq C_{p}. We incorporate the idea of linear bandits with adversarial corruption studied in He et al., 2022b to the proposed FedSupLinUCB framework and propose the Robust Async-FedSupLinUCB algorithm, with details in Appendix F. Robust Async-FedSupLinUCB can achieve the optimal minimax regret (matching the lower bound in He et al., 2022b ) while incurring a low communication cost.

Theorem 7.2.

For any 0<δ<10<\delta<1, if we run the Robust Async-FedSupLinUCB algorithm in Appendix F with C=1/M2C=1/M^{2}, with probability at least 1δ1-\delta, the regret is bounded as RTO~(dT+dCp)R_{T}\leq\tilde{O}(\sqrt{dT}+dC_{p}), and the communication cost is bounded by O(dM2logdlogT)O(dM^{2}\log d\log T).

8 Experiments

We have experimentally evaluated FedSupLinUCB in the asynchronous and synchronous settings on both synthetic and real-world datasets. We report the results in this section.

8.1 Experiment Results Using Synthetic Dataset

We simulate the federated linear bandits environment specified in Section 3. With T=40000T=40000, M=20M=20, d=25d=25, 𝒜=20\mathcal{A}=20, contexts are uniformly randomly sampled from an l2l_{2} unit sphere, and reward rt,a=θxt,a+ϵtr_{t,a}=\theta^{\top}x_{t,a}+\epsilon_{t}, where ϵt\epsilon_{t} is Gaussian distributed with zero mean and variance σ=0.01\sigma=0.01. It should be noted that while MM clients participate in each round in the synchronous scenario, only one client is active in the asynchronous case. In the plots, the xx-axis coordinate denotes the number of arm pulls, which flattens the actions in the synchronous setting.

Refer to caption
(a) Regret: arrival patterns.
Refer to caption
(b) Communication: arrival patterns.
Refer to caption
(c) Regret: client numbers.
Refer to caption
(d) Regret vs communications.
Figure 1: Experimental results with the synthetic dataset.

Arrival pattern.   We first investigate the impact of different arrival patterns (the sequence of activating clients): (1) Random, which randomly allocates T/MT/M arm pulls in [T][T] for each client. (2) Round-robin, i.e. [1,2,3,,M,1,2,3,M,][1,2,3,\cdots,M,1,2,3,\cdots M,\cdots]. (3) Click-leave, i.e. [1,1,,2,2,,,M,M,][1,1,\cdots,2,2,\cdots,\cdots,M,M,\cdots]. The regret and the communication cost of these three arrival patterns in the synthetic experiment are reported in Figure 1(a) and Figure 1(b), respectively. We note that although the upper bound analysis in our proof is for the worst-case instance, the numerical results suggest that different arrival patterns result in diverse regret performances. Round-robin and random patterns are more challenging since both local bandit learning and each client’s policy updates happen relatively slowly. The click-leave pattern, which is the closest to the centralized setting, achieves the best regret. In addition, compared with Async-FedSupLinUCB , Sync-FedSupLinUCB achieves better cumulative regrets with a higher communication cost.

Amount of clients.   The per-client cumulative regret as a function of Tc=T/MT_{c}=T/M with different amounts of clients is plotted in Figure 1(c). In comparison to the baseline SupLinUCB, FedSupLinUCB algorithms achieve better regret via communication between clients and the server. We can see from the experiment that FedSupLinUCB significantly reduces the per-client regret compared with SupLinUCB, and achieves a better regret as MM increases in both asynchronous and synchronous settings.

Trade-off between regrets and communications.   We evaluate the tradeoff between communication and regret by running FedSupLinUCB with different communication threshold values CC and DD in asynchronous and synchronous settings respectively. The results are reported in Figure 1(d), where each scattered dot represents the communication cost and the cumulative regret that FedSupLinUCB has achieved with a given threshold value at round T=40000T=40000. We see a clear tradeoff between the regret and the communication. More importantly, Sync-FedSupLinUCB achieves a better tradeoff than Async-FedSupLinUCB.

8.2 Experiment Results Using Real-world Dataset

We further investigate how efficiently the federated linear bandits algorithm performs in a more realistic and difficult environment. We have carried out experiments utilizing the real-world recommendation dataset MovieLens 20M (Harper and Konstan,, 2015). Following the steps in Li and Wang, 2022b , we first filter the data by maintaining users with above 25002500 movie ratings and treating rating points greater than 33 as positive, ending up with N=37N=37 users and 121934 total movie rating interactions. Then, we follow the process described in Cesa-Bianchi et al., (2013) to generate the contexts set, using the TF-IDF feature d=25d=25 and the arm set K=20K=20. We plot the per-client normalized rewards of the FedSupLinUCB algorithm with different client numbers MM in synchronous and asynchronous cases respectively. Note that the per-client cumulative rewards here are normalized by a random strategy. From Figure 2(a) and Figure 2(b), we can see that in both synchronous and asynchronous experiments, FedSupLinUCB has better rewards than SupLinUCB, and the advantage becomes more significant as the number of users increases.

Refer to caption
(a) Async-FedSupLinUCB.
Refer to caption
(b) Sync-FedSupLinUCB.
Figure 2: Experimental results with the real-world MovieLens-20M dataset.

9 Conclusion

We studied federated linear bandits with finite adversarial actions, a model that has not been investigated before. We proposed FedSupLinUCB that extends the SupLinUCB and OFUL principles to the federated setting in both asynchronous and synchronous scenarios, and analyzed their regret and communication cost, respectively. The theoretical results proved that FedSupLinUCB is capable of approaching the minimal regret lower bound (up to polylog terms) while only incurring sublinear communication costs, suggesting that it is the finite actions that fundamentally determines the regret behavior in the federated linear bandits setting. Furthermore, we examined the extensions of the algorithm design to the variance-adaptive and adversarial corruption scenarios.

Acknowledgments and Disclosure of Funding

The work of LF and CS was supported in part by the U.S. National Science Foundation (NSF) under grants 2143559, 2029978, and 2132700.

References

  • Abbasi-Yadkori et al., (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24.
  • Auer, (2002) Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422.
  • Cesa-Bianchi et al., (2013) Cesa-Bianchi, N., Gentile, C., and Zappella, G. (2013). A gang of bandits. Advances in neural information processing systems, 26.
  • Chu et al., (2011) Chu, W., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings.
  • Dani et al., (2008) Dani, V., Hayes, T. P., and Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback. 21st Annual Conference on Learning Theory, pages 355–366.
  • Dubey and Pentland, (2020) Dubey, A. and Pentland, A. (2020). Differentially-private federated linear bandits. Advances in Neural Information Processing Systems, 33:6003–6014.
  • Han et al., (2020) Han, Y., Zhou, Z., Zhou, Z., Blanchet, J., Glynn, P. W., and Ye, Y. (2020). Sequential batch learning in finite-action linear contextual bandits. arXiv preprint arXiv:2004.06321.
  • Harper and Konstan, (2015) Harper, F. M. and Konstan, J. A. (2015). The MovieLens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5(4):1–19.
  • (9) He, J., Wang, T., Min, Y., and Gu, Q. (2022a). A simple and provably efficient algorithm for asynchronous federated contextual linear bandits. arXiv preprint arXiv:2207.03106.
  • (10) He, J., Zhou, D., Zhang, T., and Gu, Q. (2022b). Nearly optimal algorithms for linear contextual bandits with adversarial corruptions. Advances in neural information processing systems.
  • Huang et al., (2021) Huang, R., Wu, W., Yang, J., and Shen, C. (2021). Federated linear contextual bandits. Advances in Neural Information Processing Systems, 34:27057–27068.
  • Lattimore and Szepesvári, (2020) Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.
  • (13) Li, C. and Wang, H. (2022a). Asynchronous upper confidence bound algorithms for federated linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 6529–6553. PMLR.
  • (14) Li, C. and Wang, H. (2022b). Communication efficient federated learning for generalized linear bandits. arXiv preprint arXiv:2202.01087.
  • Li et al., (2010) Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670.
  • Li et al., (2020) Li, T., Song, L., and Fragouli, C. (2020). Federated recommendation system via differential privacy. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2592–2597. IEEE.
  • Li et al., (2019) Li, Y., Wang, Y., and Zhou, Y. (2019). Nearly minimax-optimal regret for linearly parameterized bandits. In Conference on Learning Theory, pages 2173–2174. PMLR.
  • McMahan et al., (2017) McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. (2017). Communication-efficient learning of deep networks from decentralized data. In Proc. AISTATS, pages 1273–1282, Fort Lauderdale, FL, USA.
  • Ruan et al., (2021) Ruan, Y., Yang, J., and Zhou, Y. (2021). Linear bandits with limited adaptivity and learning distributional optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 74–87.
  • Salgia and Zhao, (2023) Salgia, S. and Zhao, Q. (2023). Distributed linear bandits under communication constraints. In International Conference on Machine Learning, pages 29845–29875. PMLR.
  • Shi et al., (2021) Shi, C., Shen, C., and Yang, J. (2021). Federated multi-armed bandits with personalization. In International Conference on Artificial Intelligence and Statistics, pages 2917–2925. PMLR.
  • Wang et al., (2019) Wang, Y., Hu, J., Chen, X., and Wang, L. (2019). Distributed bandit learning: Near-optimal regret with efficient communication. In International Conference on Learning Representations.
  • Zhou and Gu, (2022) Zhou, D. and Gu, Q. (2022). Computationally efficient horizon-free reinforcement learning for linear mixture mdps. arXiv preprint arXiv:2205.11507.
  • Zhou et al., (2021) Zhou, D., Gu, Q., and Szepesvari, C. (2021). Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532–4576. PMLR.

Appendix A Supporting Lemmas

Lemma A.1.

(Lemma 11 in Abbasi-Yadkori et al., (2011)) Let {Xt}t=1\left\{X_{t}\right\}_{t=1}^{\infty} be a sequence in d,V\mathbb{R}^{d},V be a d×dd\times d positive definite matrix, and define V¯t=V+s=1tXsXs\bar{V}_{t}=V+\sum_{s=1}^{t}X_{s}X_{s}^{\top}. Then, we have

log(det(V¯n)det(V))t=1nXtV¯t112.\log\left(\frac{\operatorname{det}\left(\bar{V}_{n}\right)}{\operatorname{det}(V)}\right)\leq\sum_{t=1}^{n}\left\|X_{t}\right\|_{\bar{V}_{t-1}^{-1}}^{2}.

Further, if Xt2L\left\|X_{t}\right\|_{2}\leq L for all tt, then

t=1nmin{1,XtV¯t112}2(logdet(V¯n)logdetV)2(dlog((trace(V)+nL2)/d)logdetV).\sum_{t=1}^{n}\min\left\{1,\left\|X_{t}\right\|_{\bar{V}_{t-1}^{-1}}^{2}\right\}\leq 2\left(\log\operatorname{det}\left(\bar{V}_{n}\right)-\log\operatorname{det}V\right)\leq 2\left(d\log\left(\left(\operatorname{trace}(V)+nL^{2}\right)/d\right)-\log\operatorname{det}V\right).

Finally, if λmin(V)max(1,L2)\lambda_{min}(V)\geq\max(1,L^{2}), then

t=1nXtV¯t1122logdet(V¯n)det(V).\displaystyle\sum_{t=1}^{n}\|X_{t}\|^{2}_{\bar{V}_{t-1}^{-1}}\leq 2\log\frac{\det(\bar{V}_{n})}{\det(V)}.
Lemma A.2.

(Lemma 12 in Abbasi-Yadkori et al., (2011)). Let A,BA,B, and CC be positive semi-definite matrices such that A=B+CA=B+C. Then, we have

supx𝟎xAxxBxdet(A)det(B).\sup_{x\neq\mathbf{0}}\frac{x^{\top}Ax}{x^{\top}Bx}\leq\frac{\det(A)}{\det(B)}.
Theorem A.1.

(Theorem 2 in Abbasi-Yadkori et al., (2011)). Let {i}i=0\left\{\mathcal{F}_{i}\right\}_{i=0}^{\infty} be a filtration. Let {xi}i=1\left\{x_{i}\right\}_{i=1}^{\infty} be an d\mathbb{R}^{d}-valued stochastic process such that xix_{i} is i1\mathcal{F}_{i-1}-measurable and xi1\|x_{i}\|\leq 1 almost surely. Let {ϵi}i=1\left\{\epsilon_{i}\right\}_{i=1}^{\infty} be a real-valued stochastic process such that εi\varepsilon_{i} is i\mathcal{F}_{i}-measurable and is sub-Gaussian with variance proxy 11 when conditioned on i1\mathcal{F}_{i-1}. Fix θd\theta\in\mathbb{R}^{d} such that θ1\|\theta\|\leq 1. Let An=I+i=1nxixi,ri=xiθ+εiA_{n}=I+\sum_{i=1}^{n}x_{i}x_{i}^{\top},r_{i}=x_{i}^{\top}\theta+\varepsilon_{i}, and θ^n=An1i=1nrixi\hat{\theta}_{n}=A_{n}^{-1}\sum_{i=1}^{n}r_{i}x_{i}. For every δ>0\delta>0, we have

[n0:θ^nθAn1+dln(1+nδ)]1δ,\mathbb{P}\left[\forall n\geq 0:\left\|\hat{\theta}_{n}-\theta\right\|_{A_{n}}\leq 1+\sqrt{d\ln\left(\frac{1+n}{\delta}\right)}\right]\geq 1-\delta,

where we define xA=xAx\|x\|_{A}=\sqrt{x^{\top}Ax}. Furthermore, when the above event holds, we have for every n0n\geq 0 and any vector xdx\in\mathbb{R}^{d} that

|x(θ^nθ)|(1+dln(1+nδ))xAn1x.\left|x^{\top}(\hat{\theta}_{n}-\theta)\right|\leq\left(1+\sqrt{d\ln\left(\frac{1+n}{\delta}\right)}\right)\sqrt{x^{\top}A_{n}^{-1}x}.
Lemma A.3.

(Adapted from Lemma B.1 in He et al., 2022a ) Under the setting of Theorem 5.1, establish C=1/M2,α0=1+dln(2M2T/δ)C=1/M^{2},\alpha_{0}=1+\sqrt{d\ln(2M^{2}T/\delta)}. In layer 0, with probability at least 1δ1-\delta, the good event 0\mathcal{E}_{0} happens:

0{|xt,aiθ^t,sixt,aiθ|wt,s,ai,i[M],a[K],t[T],s=0}.\mathcal{E}_{0}\triangleq\left\{\left|x_{t,a}^{i\top}\hat{\theta}_{t,s}^{i}-x_{t,a}^{i\top}\theta\right|\leq w_{t,s,a}^{i},\forall i\in[M],a\in[K],t\in[T],s=0\right\}.
Lemma A.4.

(Lemma 31 in Ruan et al., (2021)). Given θ,x1,x2,,xnd\theta,x_{1},x_{2},\ldots,x_{n}\in\mathbb{R}^{d} such that θ1\|\theta\|\leq 1, for all i[n]i\in[n], let ri=xiθ+ϵir_{i}=x_{i}^{\top}\theta+\epsilon_{i} where ϵi\epsilon_{i} is an independent sub-Gaussian random variable with variance proxy 11. Let A=I+i=1nxixiA=I+\sum_{i=1}^{n}x_{i}x_{i}^{\top}, and θ^=A1i=1nrixi\hat{\theta}=A^{-1}\sum_{i=1}^{n}r_{i}x_{i}. For any xdx\in\mathbb{R}^{d} and any α>0\alpha>0, we have

[|x(θθ^)|>(α+1)xA1]2exp(α2/2).\mathbb{P}\left[|x^{\top}(\theta-\hat{\theta})|>(\alpha+1)\|x\|_{A^{-1}}\right]\leq 2\exp(-\alpha^{2}/2).

Appendix B Lemmas for the SupLinUCB Subroutine

We present several useful lemmas that are based on Algorithm 2. Recall that Ψt,s\Psi_{t,s} represents the index set of rounds up to and including round tt during which an action is taken in layer ss. That is,

Ψt,s={t[t]:i[M],ati is chosen in layer s},s[0:S].\displaystyle\Psi_{t,s}=\{t^{\prime}\in[t]:\exists i\in[M],\ a_{t^{\prime}}^{i}\text{ is chosen in layer }s\},\forall s\in[0:S].

Similar to Lemma 4 in Chu et al., (2011), we claim that the rewards associated with rounds within each Ψt,s,s[S]\Psi_{t,s},s\in[S] (excluding layer 0) are mutually independent.

Lemma B.1.

For each t[T]t\in[T] each s[S]s\in[S], given any fixed sequence of contexts {xt,ai,tΨt,s}\{x_{t,a}^{i},t\in\Psi_{t,s}\}, the rewards {rt,s,ai,tΨt,s}\{r_{t,s,a}^{i},t\in\Psi_{t,s}\} are independent random variables with means 𝔼[rt,s,ai]=θxt,s,ai\mathbb{E}[r_{t,s,a}^{i}]=\theta^{\top}x_{t,s,a}^{i}.

Proof of Lemma B.1.

For each s[S]s\in[S] and each time tt, the procedure of generating Ψt,s\Psi_{t,s} only depends on the information in previous layers σ<sΨt,σ\cup_{\sigma<s}\Psi_{t,\sigma} and confidence width {wt,s,ai,a[K]}\{w_{t,s,a}^{i},a\in[K]\}. From its definition, wt,s,aiw_{t,s,a}^{i} only depends on {xτ,aτ,τΨt1,s}\{x_{\tau,a_{\tau}},\tau\in\Psi_{t-1,s}\} and on the current context xt,aix_{t,a}^{i}. Thus the procedure of generating Ψt,s\Psi_{t,s} does not depend on rewards {rτ,aτ,τΨt1,s}\{r_{\tau,a_{\tau}},\tau\in\Psi_{t-1,s}\}, and therefore the rewards are independent random variables when conditioned on Ψt,s\Psi_{t,s}. ∎

Given the above-mentioned statistical independence property, and by referring to Lemma A.4, we can establish the following lemma for each layer s[S]s\in[S].

Lemma B.2.

Suppose the time index set Ψt,s\Psi_{t,s} is constructed so that for fixed xτ,aτx_{\tau,a_{\tau}} with τΨt,s\tau\in\Psi_{t,s}, the rewards {rτ,aτ}\{r_{\tau,a_{\tau}}\} are independent random variables with mean 𝔼[rτ,aτ]=θxτ,aτ\mathbb{E}[r_{\tau,a_{\tau}}]=\theta^{\top}x_{\tau,a_{\tau}}. For any round t[T]t\in[T], if client it=ii_{t}=i is active and chooses arm ata_{t} in layer s[S]s\in[S], then with probability at least 1δMTlnd1-\frac{\delta}{MT\ln d}, we have for any at[K]a_{t}\in[K]:

|r^t,s,atθxt,ati|wt,s,ati=αsxt,ati(At,si)1.\left|\hat{r}_{t,s,a_{t}}-\theta^{\top}x_{t,a_{t}}^{i}\right|\leq w_{t,s,a_{t}}^{i}=\alpha_{s}\|x_{t,a_{t}}^{i}\|_{(A_{t,s}^{i})^{-1}}.

For layer 0, we employ the self-normalized martingale concentration inequality as outlined in He et al., 2022a . By resorting to Lemma A.3, we obtain the following:

Lemma B.3.

For any round t[T]t\in[T], given that client it=ii_{t}=i is active in round tt and arm ata_{t} is chosen in layer 0, with probability at least 1δ1-\delta, we have for any at[K]a_{t}\in[K]:

|r^t,0,atθxt,ati|wt,0,ati=α0xt,ati(At,0i)1.\left|\hat{r}_{t,0,a_{t}}-\theta^{\top}x_{t,a_{t}}^{i}\right|\leq w_{t,0,a_{t}}^{i}=\alpha_{0}\|x_{t,a_{t}}^{i}\|_{(A_{t,0}^{i})^{-1}}.

Summarizing the discussions presented in Lemma B.2 and Lemma B.3, we now proceed to define the following good event:

Lemma B.4.

Define the good event \mathcal{E} as:

{|r^t,s,axt,aiθ|wt,s,ai,i[M],a[K],t[T],s[0:S]}.\displaystyle\mathcal{E}\triangleq\left\{\left|\hat{r}_{t,s,a}-x_{t,a}^{i\top}\theta\right|\leq w_{t,s,a}^{i},\forall i\in[M],a\in[K],t\in[T],s\in[0:S]\right\}. (2)

We have []1δ\mathbb{P}[\mathcal{E}]\geq 1-\delta.

Conditioned on the good event \mathcal{E}, the ensuing lemma illustrates that the optimal arm persists in the candidate set, and that the regret experienced in each layer aligns with the order of the confidence width.

Lemma B.5.

Conditioned on the good event \mathcal{E}, for t[T]t\in[T], assume that client ii is active and chooses an action at𝒜sa_{t}\in\mathcal{A}_{s}, and recall (ati)(a_{t}^{i})^{*} represents the optimal arm in the current round. For any sss^{\prime}\leq s, we have:

(ati)=argmaxa[K]θxt,ai=argmaxa𝒜sθxt,ai.(a_{t}^{i})^{*}=\arg\max_{a\in[K]}\theta^{\top}x_{t,a}^{i}=\arg\max_{a\in\mathcal{A}_{s^{\prime}}}\theta^{\top}x_{t,a}^{i}.
Proof of Lemma B.5.

For any time step t[T]t\in[T], when the good event \mathcal{E} holds, by the arm elimination rule in layer 0, we have

r^t,0,a+wt,0,amaxa[K]θxt,amaxaaθxt,amaxaa(r^t,0,awt,0,a).\hat{r}_{t,0,a^{*}}+w_{t,0,a^{*}}\geq\max_{a\in[K]}\theta^{\top}x_{t,a}\geq\max_{a\neq a^{*}}\theta^{\top}x_{t,a}\geq\max_{a\neq a^{*}}(\hat{r}_{t,0,a}-w_{t,0,a}).

Thus, a𝒜0a^{*}\in\mathcal{A}_{0}. For each layer s<ss^{\prime}<s, we have:

r^t,s,a+wt,s,amaxa𝒜sθxt,amaxaa,a𝒜sθxt,amaxaa,a𝒜s(r^t,s,awt,s,a).\hat{r}_{t,s^{\prime},a^{*}}+w_{t,s^{\prime},a^{*}}\geq\max_{a\in\mathcal{A}_{s^{\prime}}}\theta^{\top}x_{t,a}\geq\max_{a\neq a^{*},a\in\mathcal{A}_{s^{\prime}}}\theta^{\top}x_{t,a}\geq\max_{a\neq a^{*},a\in\mathcal{A}_{s^{\prime}}}(\hat{r}_{t,s^{\prime},a}-w_{t,s^{\prime},a}).

Thus, we derive r^t,s,amaxa𝒜s(r^t,s,a)2w¯s\hat{r}_{t,s^{\prime},a^{*}}\geq\max_{a\in\mathcal{A}_{s^{\prime}}}(\hat{r}_{t,s^{\prime},a})-2\overline{w}_{s^{\prime}}, which follows from wt,s,aw¯sw_{t,s^{\prime},a}\leq\overline{w}_{s^{\prime}} for all a𝒜sa\in\mathcal{A}_{s^{\prime}} by the arm elimination rule in Line 10 Algorithm 2. Therefore, arm eliminations will preserve the best arm. ∎

The forthcoming lemma demonstrates that, under the good event, the regret experienced in each layer aligns with the order of the corresponding confidence width.

Lemma B.6.

Conditioned on the good event \mathcal{E}, for t[T]t\in[T] client i[M]i\in[M] and s[S]s\in[S], it holds that:

𝕀[at is chosen in layer 0](maxa𝒜0θxt,aθxt,at)4wt,0,at,\displaystyle\mathbb{I}[a_{t}\text{ is chosen in layer }0](\max_{a\in\mathcal{A}_{0}}\theta^{\top}x_{t,a}-\theta^{\top}x_{t,a_{t}})\leq 4w_{t,0,a_{t}}, (3)
𝕀[at is chosen in layer s](maxa𝒜sθxt,aθxt,at)8w¯s.\displaystyle\mathbb{I}[a_{t}\text{ is chosen in layer }s](\max_{a\in\mathcal{A}_{s}}\theta^{\top}x_{t,a}-\theta^{\top}x_{t,a_{t}})\leq 8\overline{w}_{s}. (4)
Proof of Lemma B.6.

If an action is taken in layer 0, we have that

at=argmaxa𝒜0,wt,0,a>w0¯wt,0,a,a_{t}=\arg\max_{a\in\mathcal{A}_{0},w_{t,0,a}>\overline{w_{0}}}w_{t,0,a},

and

maxa𝒜0θ(xt,aθxt,at)\displaystyle\max_{a\in\mathcal{A}_{0}}\theta^{\top}(x_{t,a}-\theta^{\top}x_{t,a_{t}}) maxa𝒜0θxt,amina𝒜0θxt,a\displaystyle\leq\max_{a\in\mathcal{A}_{0}}\theta^{\top}x_{t,a}-\min_{a\in\mathcal{A}_{0}}\theta^{\top}x_{t,a}
maxa𝒜0(θ^0xt,a+wt,0,a)mina𝒜0(θ^0xt,awt,0,a)\displaystyle\leq\max_{a\in\mathcal{A}_{0}}(\hat{\theta}_{0}^{\top}x_{t,a}+w_{t,0,a})-\min_{a\in\mathcal{A}_{0}}(\hat{\theta}_{0}^{\top}x_{t,a}-w_{t,0,a})
4maxa𝒜0wt,0,a\displaystyle\leq 4\max_{a\in\mathcal{A}_{0}}w_{t,0,a}
=4wt,0,at.\displaystyle=4w_{t,0,a_{t}}.

The second inequality is conditioned on the good event \mathcal{E}, and the third inequality arises from the arm elimination rule. If an action is taken in layer ss, we establish the following:

at=argmaxa𝒜s,wt,s,a>w¯swt,s,a,a_{t}=\arg\max_{a\in\mathcal{A}_{s},w_{t,s,a}>\overline{w}_{s}}w_{t,s,a},

and

maxa𝒜s(θxt,aθxt,at)\displaystyle\max_{a\in\mathcal{A}_{s}}(\theta^{\top}x_{t,a}-\theta^{\top}x_{t,a_{t}}) maxa𝒜s1(θ^s1xt,a+wt,s1,a)mina𝒜s1(θ^s1xt,awt,s1,a)\displaystyle\leq\max_{a\in\mathcal{A}_{s-1}}(\hat{\theta}_{s-1}^{\top}x_{t,a}+w_{t,s-1,a})-\min_{a\in\mathcal{A}_{s-1}}(\hat{\theta}_{s-1}^{\top}x_{t,a}-w_{t,s-1,a})
2maxa𝒜s1wt,s1,a+maxa𝒜s1θ^s1mina𝒜s1θ^s1xt,a\displaystyle\leq 2\max_{a\in\mathcal{A}_{s-1}}w_{t,s-1,a}+\max_{a\in\mathcal{A}_{s-1}}\hat{\theta}_{s-1}^{\top}-\min_{a\in\mathcal{A}_{s-1}}\hat{\theta}_{s-1}^{\top}x_{t,a}
2maxa𝒜s1wt,s1,a+2w¯s1\displaystyle\leq 2\max_{a\in\mathcal{A}_{s-1}}w_{t,s-1,a}+2\overline{w}_{s-1}
4w¯s18w¯s.\displaystyle\leq 4\overline{w}_{s-1}\leq 8\overline{w}_{s}.

The first inequality is based on the good event \mathcal{E}, the third inequality follows the arm elimination rule, and the fourth inequality is due to wt,s1,aw¯s1w_{t,s-1,a}\leq\overline{w}_{s-1} for all a𝒜s1a\in\mathcal{A}_{s-1}. ∎

Appendix C Supporting Lemmas and Proofs for Async-FedSupLinUCB

Lemma C.1.

(Lemma 6.2 in He et al., 2022a ) In any epoch from round Tn,sT_{n,s} to round Tn+1,s1T_{n+1,s}-1, the number of communications is at most 2(M+1/C)2(M+1/C).

Proof outline of Async-FedSupLinUCB.

First, we reorganize the arrival pattern, demonstrating that the rearranged system parallels the original system, and present the requisite definitions for our analysis. Second, we deploy a virtual global model encapsulating information about all clients up to round tt, subsequently interconnecting the local models with this global model. Lastly, we derive upper bounds on the regret and communication cost in each layer s[0:S]s\in[0:S] prior to aggregating them to yield the total regret and communication costs, respectively.

Suppose that client ii communicates with the server at rounds t1,t2t_{1},t_{2} with t1<t2t_{1}<t_{2} and does not communicate during the rounds in between. The actions and information gained by client ii at the rounds t1<t<t2t_{1}<t<t_{2} do not impact other clients’ decision-making, since the information is kept local without communication. Therefore, we can reorder the arrival of clients appropriately while keeping the reordered system equivalent to the original system.

More specifically, suppose client ii communicates with the server at two rounds tmt_{m} and tnt_{n} and does not communicate in the rounds in between (even if she is active). We reorder all the active rounds of client ii in tm<t<tnt_{m}<t<t_{n} and place them sequentially after the round tmt_{m}. Hence, the arrival of clients can be reordered such that each client communicates with the server and keeps active until the next client’s communication begins. We assume that the sequence of communication rounds in the reordered arrival pattern is 0=t0<t1<t2<<tN=T0=t_{0}<t_{1}<t_{2}<\cdots<t_{N}=T, where in rounds tit<ti+1t_{i}\leq t<t_{i+1}, the active client is the same. Details of the reordering process are given in Definition C.2. Due to the equivalence between the original system and the reordered system, we carry out the proofs in the reordered system. Note that only one client iti_{t} is active at round tt, we will write at=atita_{t}=a_{t}^{i_{t}}, xt=xt,atitx_{t}=x^{i_{t}}_{t,a_{t}} and rt=rt,atir_{t}=r_{t,a_{t}}^{i} for simplicity.

Definition C.1.

Client information. Recall for each client i[M]i\in[M], we denote by Li(t)L_{i}(t) the last round when client ii communicated with the server before and including round tt. E.g., Li(t)=tL_{i}(t)=t if client ii communicates at round tt. For each round tt each client ii and each layer ss, the information that has been uploaded by client ii to the server is: At,si,up=t=1Li(t)xtxt𝕀{it=i,at in layer s},bt,si,up=t=1Li(t)rtxt𝕀{it=i,at in layer s}A_{t,s}^{i,up}=\sum\nolimits_{t^{\prime}=1}^{L_{i}(t)}x_{t^{\prime}}x_{t^{\prime}}^{\top}\mathbb{I}\{i_{t^{\prime}}=i,a_{t}\text{ in layer }s\},b_{t,s}^{i,up}=\sum\nolimits_{t^{\prime}=1}^{L_{i}(t)}r_{t^{\prime}}x_{t^{\prime}}\mathbb{I}\{i_{t^{\prime}}=i,a_{t}\text{ in layer }s\}, and the local information in the buffer that has not been uploaded to the server is: ΔAt,si=t=Li(t)+1txtxt𝕀{it=i,at in layer s},Δbt,si=t=Li(t)+1trtxj𝕀{it=i,at in layer s}\Delta A_{t,s}^{\mathrm{i}}=\sum\nolimits_{t^{\prime}=L_{i}(t)+1}^{t}x_{t^{\prime}}x_{t^{\prime}}^{\top}\mathbb{I}\{i_{t^{\prime}}=i,a_{t}\text{ in layer }s\},\Delta b_{t,s}^{i}=\sum\nolimits_{t^{\prime}=L_{i}(t)+1}^{t}r_{t^{\prime}}x_{j}\mathbb{I}\{i_{t^{\prime}}=i,a_{t}\text{ in layer }s\}.

Server information. The information in the server is the data uploaded by all clients up to round tt: At,sser=I+i=1MAt,si,up,bt,sser=i=1Mbt,si,upA_{t,s}^{ser}=I+\sum_{i=1}^{M}A_{t,s}^{i,up},b_{t,s}^{ser}=\sum_{i=1}^{M}b_{t,s}^{i,up}.

Time index set. Denote by Ψt,s\Psi_{t,s} the time index set when the action atia_{t}^{i} is chosen in layer ss. It can be expressed as Ψt,s={t[t],ati in layer s,i[M]},s[0:S]}\Psi_{t,s}=\{t^{\prime}\in[t],a_{t^{\prime}}^{i}\text{ in layer }s,\ i\in[M]\},s\in[0:S]\}.

Virtual global information. We define a virtual global model that contains all the information up to round tt as: At,sall=I+tΨt,sxtxt,bt,sall=tΨt,srtxtA_{t,s}^{all}=I+\sum_{t^{\prime}\in\Psi_{t,s}}x_{t^{\prime}}x_{t^{\prime}}^{\top},b_{t,s}^{all}=\sum_{t^{\prime}\in\Psi_{t,s}}r_{t^{\prime}}x_{t^{\prime}}.

The information that is stored on the server and all the information that has not yet been uploaded by clients are combined to generate the global information: At,sall=At,sser+i=1MΔAt,si,bt,sall=bt,sser+i=1MΔbt,siA_{t,s}^{all}=A_{t,s}^{ser}+\sum_{i=1}^{M}\Delta A_{t,s}^{i},b_{t,s}^{all}=b_{t,s}^{ser}+\sum_{i=1}^{M}\Delta b_{t,s}^{i}.

Before presenting the proof, we define good event \mathcal{E} as

{|xt,aiθ^t,sixt,aiθ|wt,s,ai,i[M],a[K],t[T],s[0:S]}.\mathcal{E}\triangleq\left\{\left|x_{t,a}^{i\top}\hat{\theta}_{t,s}^{i}-x_{t,a}^{i\top}\theta\right|\leq w_{t,s,a}^{i},\forall i\in[M],a\in[K],t\in[T],s\in[0:S]\right\}.

Recall θ^t,si\hat{\theta}_{t,s}^{i} is the estimate of θ\theta by client ii, and xt,aix_{t,a}^{i} and wt,s,aiw_{t,s,a}^{i} is the corresponding context and confidence width of the action taken at round tt. The following lemma shows the good event happens with high probability, similar to the result in Lemma B.4.

Lemma C.2.

It holds that []1δ\mathbb{P}[\mathcal{E}]\geq 1-\delta.

Conditioned on the good event, to upper bound the regret, we bound the confidence width in each layer via the size of each time index set in the lemma below.

Lemma C.3.

Conditioned on the good event \mathcal{E}, for each s[0:S1]s\in[0:S-1] we have:

tΨT,swt,s,aiαs2(1+MC)2d|ΨT,s|log|ΨT,s|+αsdMlog(1+T/d).\displaystyle\sum_{t\in\Psi_{T,s}}w_{t,s,a}^{i}\leq\alpha_{s}\sqrt{2(1+MC)}\sqrt{2d|\Psi_{T,s}|\log|\Psi_{T,s}|}+\alpha_{s}dM\log(1+T/d).

Noting that |ΨT,s|T|\Psi_{T,s}|\leq T naturally holds, we give a tighter (dimension-dependent) bound on the size of ΨT,0\Psi_{T,0} so as to mitigate the larger coefficient α0\alpha_{0} as follows.

Lemma C.4.

The size of ΨT,0\Psi_{T,0} can be bounded by |ΨT,0|TlogTlog(2MT/δ)/d|\Psi_{T,0}|\leq T\log T\log(2MT/\delta)/d.

We postpone the proofs of Lemma C.3 and Lemma C.4 until the end of this section, and instead focus on presenting the regret analysis next. Equipped with the previous lemmas, we are ready to analyze the total regret.

Proof of Theorem 5.1.

(Regret analysis) The total regret can be decomposed w.r.t. layers as follows:

RT=𝔼tΨT,0(rt,ati,irt,ati)+s=1S𝔼tΨT,s(rt,ati,irt,ati).R_{T}=\mathbb{E}\sum_{t\in\Psi_{T,0}}(r^{i}_{t,a_{t}^{i,*}}-r^{i}_{t,a_{t}})+\sum_{s=1}^{S}\mathbb{E}\sum_{t\in\Psi_{T,s}}(r^{i}_{t,a_{t}^{i,*}}-r^{i}_{t,a_{t}}).

Conditioned on the good event \mathcal{E}, we first bound the regret in layer 0 by

𝔼tΨT,0(rt,ati,irt,ati)tΨT,04wt,0,at\displaystyle\mathbb{E}\sum_{t\in\Psi_{T,0}}(r^{i}_{t,a_{t}^{i,*}}-r^{i}_{t,a_{t}})\leq\sum_{t\in\Psi_{T,0}}4w_{t,0,a_{t}}
4α02(1+MC)2d|ΨT,0|log|ΨT,0|+4α0dMlog(1+T/d)sO~((1+MC)dT).\displaystyle\leq 4\alpha_{0}\sqrt{2(1+MC)}\sqrt{2d|\Psi_{T,0}|\log|\Psi_{T,0}|}+4\alpha_{0}dM\log(1+T/d)s\leq\tilde{O}(\sqrt{(1+MC)dT}).

The first inequality follows Lemma B.6, the second inequality is from Lemma C.3, and the last inequality is due to Lemma C.4. We next bound the regret in each layer s[1:S1]s\in[1:S-1] similarly by

tΨT,s𝔼[rt,ati,irt,ati]tΨT,s8w¯stΨT,s8wt,s,at\displaystyle\sum_{t\in\Psi_{T,s}}\mathbb{E}\left[r^{i}_{t,a_{t}^{i,*}}-r^{i}_{t,a_{t}}\right]\leq\sum_{t\in\Psi_{T,s}}8\overline{w}_{s}\leq\sum_{t\in\Psi_{T,s}}8w_{t,s,a_{t}}
8αs2(1+MC)2d|ΨT,s|log|ΨT,s|+8αsdMlog(1+T/d)O~((1+MC)dT)\displaystyle\leq 8\alpha_{s}\sqrt{2(1+MC)}\sqrt{2d|\Psi_{T,s}|\log|\Psi_{T,s}|}+8\alpha_{s}dM\log(1+T/d)\leq\tilde{O}(\sqrt{(1+MC)dT})

where the first inequality follows Lemma B.6, the second inequality is from the arm selection rule in line 13 Algorithm 2, and the third inequality is from Lemma C.3. For the last layer SS, we have:

tΨT,S𝔼[rt,ati,irt,ati]tΨT,S8w¯S8w¯S|ΨT,S|8w¯ST8dT.\displaystyle\sum_{t\in\Psi_{T,S}}\mathbb{E}\left[r^{i}_{t,a_{t}^{i,*}}-r^{i}_{t,a_{t}}\right]\leq\sum_{t\in\Psi_{T,S}}8\overline{w}_{S}\leq 8\overline{w}_{S}|\Psi_{T,S}|\leq 8\overline{w}_{S}T\leq 8\sqrt{dT}.

Finally, with Lemma C.2, we have RTO~((1+MC)dT)R_{T}\leq\tilde{O}(\sqrt{(1+MC)dT}).

(Communication cost analysis) Next, we study the communication cost in an asynchronous setting. For each layer ss, i0i\geq 0, we define Tn,s=min{t[T]|det(At,sser)2i}.T_{n,s}=\min\{t\in[T]|\det(A_{t,s}^{ser})\geq 2^{i}\}. We divide rounds in each layer into epoch {Tn,s,Tn,s+1,..,min(T,Tn+1,s1)}\{T_{n,s},T_{n,s}+1,..,\min(T,T_{n+1,s}-1)\}, and the communication rounds in the epoch Tn,stTn+1,s1T_{n,s}\leq t\leq T_{n+1,s}-1 can be bound by Lemma C.1. Let NN^{\prime} be the largest integer such that TN,sT_{N^{\prime},s} is not empty. According to Lemma A.1 that log(det(At,sall))dlog(1+|ΨT,s|/d)\log(\det(A_{t,s}^{all}))\leq d\log(1+|\Psi_{T,s}|/d), Ndlog(1+T/d)N^{\prime}\leq d\log(1+T/d). The total number of epochs of layer ss is bounded by dlog(1+T/d)d\log(1+T/d). By lemma C.1 the communication rounds in layer ss is bounded by O((M+1/C))dlogTO((M+1/C))d\log T. There are S=logdS=\lceil\log d\rceil in the FedSupLinUCB algorithm, the total communication cost is thus upper bound by O(d(M+1/C)logdlogT)O(d(M+1/C)\log d\log T). Plugging in C=1/M2C=1/M^{2} proves the result.

Definition C.2.

(Reorder function) Without loss of generality, we assume all clients communicate with the server at round t0=0t_{0}=0, and the sequence of rounds that clients communicate with the server in the original system is 0t0<t1<t2<<tNT0\leq t_{0}<t_{1}<t_{2}<...<t_{N}\leq T. Define It,i=𝕀(client i communicates with the server at round t)I_{t,i}=\mathbb{I}(\text{client $i$ communicates with the server at round $t$}). Denote by Li(t)L_{i}(t) the last communication round of client ii before and including round tt:

Li(t):=inf{u:t=0uIt,i=t=0tIt,i}.\displaystyle L_{i}(t):=\inf\{u:\sum_{t^{\prime}=0}^{u}I_{t^{\prime},i}=\sum_{t^{\prime}=0}^{t}I_{t^{\prime},i}\}.

Denote by Ni(t)N_{i}(t) the next communication round of client ii including and after round tt:

Ni(t):=inf{u:t=tuIt,i=1}.\displaystyle N_{i}(t):=\inf\{u:\sum_{t^{\prime}=t}^{u}I_{t^{\prime},i}=1\}.

The round t[T]t\in[T] in the original system is placed in round ϕ(t)\phi(t) by the reordering function ϕ:[T][T]\phi:[T]\rightarrow[T]. We first reorder the communication round, suppose two consecutive communication rounds tnt_{n} and tn+1t_{n+1} with tn<tn+1t_{n}<t_{n+1}, and client ii is active at round tnt_{n} and client jj is active at round tn+1t_{n+1}.

ϕ(tn+1)=ϕ(tn)+t=tnNi(tn)I(it=i)1.\phi(t_{n+1})=\phi(t_{n})+\sum\nolimits_{t^{\prime}=t_{n}}^{N_{i}(t_{n})}I(i_{t^{\prime}}=i)-1.

Then we reorder the no-communication rounds, assuming client ii is active at round tt and does not communicate at this round. We first find the last communication round of client ii as Li(t)L_{i}(t), and place round tt by ϕ(t)\phi(t):

ϕ(t)=ϕ(Li(t))+t=Li(t)tI(it=i)1.\phi(t)=\phi(L_{i}(t))+\sum\nolimits_{t^{\prime}=L_{i}(t)}^{t}I(i_{t^{\prime}}=i)-1.
Lemma C.5.

(Adapted from Lemma 6.5 in He et al., 2022a ) For each round t[T]t\in[T] each layer s[0:S]s\in[0:S] and each client i[M]i\in[M] , we have:

At,sser=I+i=1MAt,si,up1CΔAt,si.\displaystyle A_{t,s}^{ser}=I+\sum_{i=1}^{M}A_{t,s}^{i,up}\succeq\frac{1}{C}\Delta A_{t,s}^{i}.

Further averaging the inequality above over MM clients, we have:

At,sser=I+i=1MAt,si,up1MCi=1MΔAt,si.\displaystyle A_{t,s}^{ser}=I+\sum_{i=1}^{M}A_{t,s}^{i,up}\succeq\frac{1}{MC}\sum_{i=1}^{M}\Delta A_{t,s}^{i}.
Proof of Lemma C.5.

Without loss of generality, we consider client ii and fix any round t[T]t\in[T]. Let t1tt_{1}\leq t be the last round such that client ii was active at round t1t_{1}. If client ii communicated with the server at round t1t_{1}, and chose action at1a_{t_{1}} at layer ss, then we have

At,sser=I+i=1MAt,si,up1CΔAt1,si=0A_{t,s}^{ser}=I+\sum_{i=1}^{M}A_{t,s}^{i,up}\succeq\frac{1}{C}\Delta A_{t_{1},s}^{i}=0

for other layers sss^{\prime}\neq s, according to the determinant-based communication criterion, we have:

det(At1,si+ΔAt1,si)<(1+C)det(At1,si).\det(A_{t_{1},s^{\prime}}^{i}+\Delta A_{t_{1},s^{\prime}}^{i})<(1+C)\det(A_{t_{1},s^{\prime}}^{i}).

By Lemma A.2 we have

At,si=At1,si1CΔAt1,si.A_{t,s^{\prime}}^{i}=A_{t_{1},s^{\prime}}^{i}\succeq\frac{1}{C}\Delta A_{t_{1},s^{\prime}}^{i}.

Otherwise, if no communication happened at round t1t_{1}, by the communication criterion, at the end of round t1t_{1}, for each layer s[0:S]s\in[0:S], we have At1,si1CΔAt1,siA_{t_{1},s}^{i}\succeq\frac{1}{C}\Delta A_{t_{1},s}^{i}. Note that {At1,si,s[0:S]}\{A_{t_{1},s}^{i},\ s\in[0:S]\} are the downloaded gram matrices from last communication before round t1t_{1}, so it must satisfy At1,siAt1,sserA_{t_{1},s}^{i}\preceq A_{t_{1},s}^{ser} for all s[0:S]s\in[0:S]. For round tt, since client ii is inactive from round t1t_{1} to tt, we have for all s[0:S]s\in[0:S]:

At,sserAt1,sserAt1,si1CΔAt1,si=1CΔAt,siA_{t,s}^{ser}\succeq A_{t_{1},s}^{ser}\succeq A_{t_{1},s}^{i}\succeq\frac{1}{C}\Delta A_{t_{1},s}^{i}=\frac{1}{C}\Delta A_{t,s}^{i}

where the last equality holds for inactivation, which completes the proof of the first claim. Further average the above inequality over all clients i[M]i\in[M], and we get:

At,sser=I+i=1MAt,si,up1MCi=1MΔAt,si.A_{t,s}^{ser}=I+\sum_{i=1}^{M}A_{t,s}^{i,up}\succeq\frac{1}{MC}\sum_{i=1}^{M}\Delta A_{t,s}^{i}.

Recall that client ii utilizes At,siA_{t,s}^{i} and bt,sib_{t,s}^{i} to make the decision at round tt, which were received from the server during the last communication. The following lemma establishes a connection between the gram matrix of the virtual global model and the gram matrix in the active client at round tt.

Lemma C.6.

In the reordered arrival pattern, for any 1t1<t2T1\leq t_{1}<t_{2}\leq T, suppose client ii communicates with the server at round t1t_{1}, and keep active during rounds t1tt21t_{1}\leq t\leq t_{2}-1. Then for rounds t1+1tt21t_{1}+1\leq t\leq t_{2}-1, it holds that for each s[0:S]s\in[0:S]:

At,si11+MCAt,sall.A_{t,s}^{i}\succeq\frac{1}{1+MC}A_{t,s}^{all}.
Proof of Lemma C.6.

Client ii is the only active client from round t1t_{1} to t21t_{2}-1 and only communicated with the server at round t1t_{1}, which implies that for t1+1tt21s[0:S]t_{1}+1\leq t\leq t_{2}-1\forall s\in[0:S], we have

At,si=I+i=1MAt1,si,up=I+i=1MAt,si,up11+MC(I+i=1MAt,si,up+i=1MΔAt,si)11+MCAt,sallA_{t,s}^{i}=I+\sum_{i=1}^{M}A_{t_{1},s}^{i,up}=I+\sum_{i=1}^{M}A_{t,s}^{i,up}\succeq\frac{1}{1+MC}(I+\sum_{i=1}^{M}A_{t,s}^{i,up}+\sum_{i=1}^{M}\Delta A_{t,s}^{i})\succeq\frac{1}{1+MC}A_{t,s}^{all}

where the second equality holds due to the fact that no clients communicate with the server from round t1+1t_{1}+1 to t21t_{2}-1, and the first inequality follows Lemma C.5. ∎

Proof of Lemma C.3.

For tΨT,st\in\Psi_{T,s}, if no communication happened at round tt, under Lemma C.6 and Lemma A.2, we can connect confidence width at the local client with the global gram matrix as:

xt,ait(At,sit)11+MCxt,ai(At,sall)1.\|x_{t,a}^{i_{t}}\|_{(A_{t,s}^{i_{t}})^{-1}}\leq\sqrt{1+MC}\|x_{t,a}^{i}\|_{(A_{t,s}^{all})^{-1}}.

It remains to control the communication rounds in ΨT,s\Psi_{T,s}. We define

Tn=min{tΨT,sdet(At,sall)2n},T_{n}=\min\left\{t\in\Psi_{T,s}\mid\operatorname{\det}\left({A}_{t,s}^{{all}}\right)\geq 2^{n}\right\},

and let NN^{\prime} be the largest integer such that TNT_{N^{\prime}} is not empty. According to Lemma A.1, we have:

log(det(At,sall))dlog(1+|ΨT,s|/d).\log(\det(A_{t,s}^{all}))\leq d\log(1+|\Psi_{T,s}|/d).

Thus, Ndlog(1+T/d)N^{\prime}\leq d\log(1+T/d). For each time interval from TnT_{n} to Tn+1T_{n+1} and each client i[M]i\in[M], suppose client ii communicates with the server more than once, and communication rounds sequentially are Tn,1,Tn,2,,Tn,k[Tn,Tn+1)T_{n,1},T_{n,2},\ldots,T_{n,k}\in\left[T_{n},T_{n+1}\right). Then for each j=2,,kj=2,\ldots,k, since client ii is active at rounds Tn,j1T_{n,j-1} and Tn,jT_{n,j}, we have

xTn,j(ATn,j,si)1xTn,j(ATn,j1+1,si)11+MCxTn,j((ATn,j1+1,sall)1.\|x_{T_{n,j}}\|_{(A_{T_{n,j},s}^{i})^{-1}}\leq\|x_{T_{n,j}}\|_{(A_{T_{n,j-1}+1,s}^{i})^{-1}}\leq\sqrt{1+MC}\|x_{T_{n,j}}\|_{((A_{T_{n,j-1}+1,s}^{all})^{-1}}.

Since det(ATn+11,sall)/det(ATn,j1+1,sall)2n+1/2n=2\det({A}_{T_{n+1}-1,s}^{all})/\det({A}_{T_{n,j-1}+1,s}^{all})\leq 2^{n+1}/2^{n}=2, by the definition of TnT_{n}, we have:

xTn,j(ATn,j,si)12(1+MC)xTn,j(ATn+11,sall)12(1+MC)xTn,j(ATn,j,sall)1,\|x_{T_{n,j}}\|_{(A_{T_{n,j},s}^{i})^{-1}}\leq\sqrt{2(1+MC)}\|x_{T_{n,j}}\|_{(A_{T_{n+1}-1,s}^{all})^{-1}}\leq\sqrt{2(1+MC)}\|x_{T_{n,j}}\|_{(A_{T_{n,j},s}^{all})^{-1}},

where the second inequality comes from ATn+11,sallATn,j,sallA_{T_{n+1}-1,s}^{all}\succeq A_{T_{n,j},s}^{all}. Specifically, for round Ti,1T_{i,1} the first communication round, we can bound the confidence width by 11. Thus, for the communication rounds in ΨT,s\Psi_{T,s}, we have:

tΨT,s,roundtcommxt,ait(At,sit)1MN+tΨT,s,roundtcomm2(1+MC)xt,ait(At,sall)1.\sum_{t\in\Psi_{T,s},round\ t\ comm}\|x_{t,a}^{i_{t}}\|_{(A_{t,s}^{i_{t}})^{-1}}\leq MN^{\prime}+\sum_{t\in\Psi_{T,s},round\ t\ comm}\sqrt{2(1+MC)}\|x_{t,a}^{i_{t}}\|_{(A_{t,s}^{all})^{-1}}.

Finally, we put all rounds in ΨT,s\Psi_{T,s} together:

tΨT,swt,s,a\displaystyle\sum_{t\in\Psi_{T,s}}w_{t,s,a} =αstΨT,sxt,ait(At,sit)1\displaystyle=\alpha_{s}\sum_{t\in\Psi_{T,s}}\|x_{t,a}^{i_{t}}\|_{(A_{t,s}^{i_{t}})^{-1}}
αstΨT,s2(1+MC)xt,ait(At,sall)1+αsMN\displaystyle\leq\alpha_{s}\sum_{t\in\Psi_{T,s}}\sqrt{2(1+MC)}\|x_{t,a}^{i_{t}}\|_{(A_{t,s}^{all})^{-1}}+\alpha_{s}MN^{\prime}
αs2(1+MC)2d|ΨT,s|log|ΨT,s|+αsdMlog(1+T/d)\displaystyle\leq\alpha_{s}\sqrt{2(1+MC)}\sqrt{2d|\Psi_{T,s}|\log|\Psi_{T,s}|}+\alpha_{s}dM\log(1+T/d)

where the second inequality follows Lemma A.1. ∎

Proof of Lemma C.4.

Based on the algorithm, if we choose an action in layer 0, the selected arm is

at=argmaxa𝒜0,wt,0,a>w¯0wt,0,a,a_{t}=\arg\max_{a\in\mathcal{A}_{0},w_{t,0,a}>\overline{w}_{0}}w_{t,0,a},

and the corresponding confidence width satisfies wt,0,at>w¯0w_{t,0,a_{t}}>\overline{w}_{0}. Furthermore,

w¯0|ΨT,0|\displaystyle\overline{w}_{0}|\Psi_{T,0}| tΨT,0wt,0,at=α0tΨT,0xt,at(At,0it)1\displaystyle\leq\sum_{t\in\Psi_{T,0}}w_{t,0,a_{t}}=\alpha_{0}\sum_{t\in\Psi_{T,0}}\|x_{t,a_{t}}\|_{(A_{t,0}^{i_{t}})^{-1}}
α02(1+MC)2d|ΨT,s|log|ΨT,s|+α0dMlog(1+T/d),\displaystyle\leq\alpha_{0}\sqrt{2(1+MC)}\sqrt{2d|\Psi_{T,s}|\log|\Psi_{T,s}|}+\alpha_{0}dM\log(1+T/d),

where the last inequality is by Lemma C.3. We can thus conclude that ΨT,0TlogTlog(2MT/δ)/d\Psi_{T,0}\leq T\log T\log(2MT/\delta)/d.

Appendix D Supporting Lemmas and Proofs for Sync-FedSupLinUCB

Proof outline of Sync-FedSupLinUCB.

To prove a high-probability regret bound, we first define the good event \mathcal{E} in the following lemma, under which the regret bound is derived.

Lemma D.1.

Define {|xt,aiθ^t,sixt,aiθ|wt,s,ai,i[M],a[K],t[Tc],0sS}\mathcal{E}\triangleq\{\left|x_{t,a}^{i\top}\hat{\theta}_{t,s}^{i}-x_{t,a}^{i\top}\theta\right|\leq w_{t,s,a}^{i},\forall i\in[M],a\in[K],t\in[T_{c}],0\leq s\leq S\}. Then, []1δ\mathbb{P}[\mathcal{E}]\geq 1-\delta.

Define client ii’s one-step regret at round tt as 𝗋𝖾𝗀=tiθ(xt,atiixt,ati)\textrm{$\sf{reg}$}{}_{t}^{i}=\theta^{\top}(x^{i}_{t,a^{i*}_{t}}-x^{i}_{t,a_{t}}). Let 𝗋𝖾𝗀=t,si𝗋𝖾𝗀ti\textrm{$\sf{reg}$}{}_{t,s}^{i}=\textrm{$\sf{reg}$}{}_{t}^{i} if action ata_{t} is chosen in layer ss; otherwise 𝗋𝖾𝗀=t,si0\textrm{$\sf{reg}$}{}_{t,s}^{i}=0. The total regret can be written as

RT=i=1Mt=1Tc𝗋𝖾𝗀=tis=0Si=1Mt=1Tc𝗋𝖾𝗀.t,siR_{T}=\sum_{i=1}^{M}\sum_{t=1}^{T_{c}}\textrm{$\sf{reg}$}{}_{t}^{i}=\sum_{s=0}^{S}\sum_{i=1}^{M}\sum_{t=1}^{T_{c}}\textrm{$\sf{reg}$}{}_{t,s}^{i}.

Fix an arbitrary s{0,1,,S}s\in\{0,1,\ldots,S\}, we analyze the total regret induced by the actions taken in layer ss, i.e., Rs,Tc=i=1Mt=1Tc𝗋𝖾𝗀t,siR_{s,T_{c}}=\sum_{i=1}^{M}\sum_{t=1}^{T_{c}}\textrm{$\sf{reg}$}{}_{t,s}^{i}. The analysis can be carried over to different ss in the same manner.

We call the chunk of consecutive rounds without communicating information in layer ss (except the last round) an epoch. In other words, information in layer ss is collected locally by each client and synchronized at the end of the epoch, following which the next epoch starts. The set of rounds that at least one client is pulling an arm in layer ss can then be divided into multiple consecutive epochs, and we further dichotomize these epochs into good and bad epochs in the following definition.

Definition D.1.

(Good epoch) Suppose the set of rounds that at least one client is pulling an arm in layer ss are divided into PP epochs and denoted by Ap,sall,bp,sallA_{p,s}^{all},b_{p,s}^{all} the synchronized gram matrix and reward-action vector at the end of the pp-th epoch. PP epochs can then be dichotomized into 𝒫sgood{p[P]:det(Ap,sall)det(Ap1,sall)2},𝒫sbad[P]𝒫sgood\mathcal{P}^{good}_{s}\triangleq\left\{p\in[P]:\frac{\det(A_{p,s}^{all})}{\det(A_{p-1,s}^{all})}\leq 2\right\},\mathcal{P}^{bad}_{s}\triangleq[P]\setminus\mathcal{P}^{good}_{s}, where A0,sallIA_{0,s}^{all}\triangleq I. We say round tt is good if the epoch containing round tt belongs to 𝒫sgood\mathcal{P}_{s}^{good}; otherwise tt is bad.

We bound regrets in layer ss induced by the good and bad epochs separately in the following lemmas. Recall Ψt,s\Psi_{t,s} is the time index set when the action atia_{t}^{i} is chosen in the ss layer.

Lemma D.2.

Conditioned on the good event \mathcal{E}, for each layer s[0:S]s\in[0:S], the regret induced by good epochs of layer ss is bounded as tΨTc,s,t is good𝗋𝖾𝗀t,siO~(αsd|ΨTc,s|log(MTc))\sum_{t\in\Psi_{T_{c},s},t\text{ is good}}\textrm{$\sf{reg}$}{}_{t,s}^{i}\leq\tilde{O}\left(\alpha_{s}\sqrt{d|\Psi_{T_{c},s}|\log(MT_{c})}\right).

Lemma D.3.

Define D=TclogTcd2MD=\frac{T_{c}\log T_{c}}{d^{2}M} and Rs=dlog(1+|ΨTc,s|d)R_{s}=d\log(1+\frac{|\Psi_{T_{c},s}|}{d}). Conditioned on the good event \mathcal{E}, for each layer s[0:S]s\in[0:S], the regret induced by bad epochs of layer ss is bounded as tΨTc,s,t is bad𝗋𝖾𝗀t,siO(αsMDRs)\sum_{t\in\Psi_{T_{c},s},t\text{ is bad}}\textrm{$\sf{reg}$}{}_{t,s}^{i}\leq O\left(\alpha_{s}M\sqrt{D}R_{s}\right).

Lemma D.4.

We have |ΨTc,s|O~(MTcd)|\Psi_{T_{c},s}|\leq\tilde{O}(\frac{MT_{c}}{d}).

Proof of Theorem 6.1.

(Regret analysis) For each s[0:S]s\in[0:S], the regret induced in layer ss is bounded by:

Rs,Tc\displaystyle R_{s,T_{c}} tΨTc,s,t is good𝗋𝖾𝗀+t,sitΨTc,s,t is bad𝗋𝖾𝗀t,si\displaystyle\leq\sum_{t\in\Psi_{T_{c},s},t\text{ is good}}\textrm{$\sf{reg}$}{}_{t,s}^{i}+\sum_{t\in\Psi_{T_{c},s},t\text{ is bad}}\textrm{$\sf{reg}$}{}_{t,s}^{i}
O(αsd|ΨTc,s|log(MT)+αsMDRs)O~(dMTc)\displaystyle\leq O(\alpha_{s}\sqrt{d|\Psi_{T_{c},s}|\log(MT)}+\alpha_{s}M\sqrt{D}R_{s})\leq\tilde{O}(\sqrt{dMT_{c}})

where the second inequality is from Lemmas D.2 and D.3, and the last inequality is due to Lemma D.4. The total regret can thus be bounded as RT=s=0SRs,Tc=O~(dMTc)R_{T}=\sum_{s=0}^{S}R_{s,T_{c}}=\tilde{O}(\sqrt{dMT_{c}}). ∎

Proof of Lemma D.2.

If tt is good and belongs to the pp-th epoch, we have by Lemma A.2 that

wt,s,ai=αsxt,ai(At,si)12αsxt,ai(Ap,sall)12αsxt,ai(Ap1,sall)1.\displaystyle w_{t,s,a}^{i}=\alpha_{s}\|x_{t,a}^{i}\|_{(A_{t,s}^{i})^{-1}}\leq\sqrt{2}\alpha_{s}\|x_{t,a}^{i}\|_{(A_{p,s}^{all})^{-1}}\leq 2\alpha_{s}\|x_{t,a}^{i}\|_{(A_{p-1,s}^{all})^{-1}}. (5)

Within pp-th good epoch, we have

Ap1,sall+i=1Mtp-th good epochxt,atii(xt,atii)=Ap,sall,\displaystyle A^{all}_{p-1,s}+\sum_{i=1}^{M}\sum_{t\in\text{$p$-th good epoch}}x^{i}_{t,a^{i}_{t}}(x^{i}_{t,a^{i}_{t}})^{\top}=A^{all}_{p,s},

which together with inequality (5) and the last inequality in the elliptical potential lemma (Lemma A.1) imply that

i=1Mtp-th good epochxt,atii(At,si)124logdet(Ap,sall)det(Ap1,sall).\displaystyle\sum_{i=1}^{M}\sum_{t\in\text{$p$-th good epoch}}\|x_{t,a_{t}^{i}}^{i}\|^{2}_{(A_{t,s}^{i})^{-1}}\leq 4\log\frac{\det(A^{all}_{p,s})}{\det(A^{all}_{p-1,s})}.

Thus under event \mathcal{E}, the regret induced by good epochs of layer ss is

(i,t)ΨT,s,t is goodregt,si\displaystyle\sum_{(i,t)\in\Psi_{T,s},t\text{ is good}}reg_{t,s}^{i} (i,t)ΨT,s,t is good8wt,s,atii\displaystyle\leq\sum_{(i,t)\in\Psi_{T,s},t\text{ is good}}8w_{t,s,a_{t}^{i}}^{i}
8|ΨT,s|(i,t)ΨT,s,t is good(wt,s,atii)2\displaystyle\leq 8\sqrt{|\Psi_{T,s}|\sum_{(i,t)\in\Psi_{T,s},t\text{ is good}}(w_{t,s,a_{t}^{i}}^{i})^{2}}
=O~(αsd|ΨT,s|log(MT)),\displaystyle=\tilde{O}\left(\alpha_{s}\sqrt{d|\Psi_{T,s}|\log(MT)}\right),

where the first inequality is from Lemma B.6, the second inequality is by Cauchy-Schwartz inequality, and the last relation is from

p=1Plogdet(Ap,sall)det(Ap1,sall)=logdet(AP,sall)dlog(1+|ΨT,s|d)=Rs.\displaystyle\sum_{p=1}^{P}\log\frac{\det(A_{p,s}^{all})}{\det(A_{p-1,s}^{all})}=\log\det(A_{P,s}^{all})\leq d\log(1+\frac{|\Psi_{T,s}|}{d})=R_{s}. (6)

Proof of Lemma D.3.

Denote by Rs=dlog(1+|ΨT,s|d)R_{s}=d\log(1+\frac{|\Psi_{T,s}|}{d}). It follows that the number of bad epochs is at most O(Rs)O(R_{s}). Moreover, the regret within a bad epoch of length nn can be upper bounded as O(M+αsMD)O(M+\alpha_{s}M\sqrt{D}) by applying the elliptical potential lemma for each client ii and the communication condition, where the extra 11 in the upper bound is due to that at most MM clients trigger the communication condition at the end of the pp-th epoch. We thus have

t is badregt,sitΨT,s is bad8wt,si=O(MRs+αsMDRs)=O(αsMDRs).\displaystyle\sum_{t\text{ is bad}}reg_{t,s}^{i}\leq\sum_{t\in\Psi_{T,s}\text{ is bad}}8w_{t,s}^{i}=O\left(MR_{s}+\alpha_{s}M\sqrt{D}R_{s}\right)=O\left(\alpha_{s}M\sqrt{D}R_{s}\right).

Proof of Lemma D.4.

Recall D=Tlog(T)d2MD=\frac{T\log(T)}{d^{2}M}. Note that if αsd|ΨT,s|log(T)=O(αsMDRs)\alpha_{s}\sqrt{d|\Psi_{T,s}|\log(T)}=O(\alpha_{s}M\sqrt{D}R_{s}), we have |ΨT,s|=O~(M2Dd)=O~(MTd)|\Psi_{T,s}|=\tilde{O}(M^{2}Dd)=\tilde{O}(\frac{MT}{d}). Otherwise |ΨT,s|w¯ts=O(αsd|ΨT,s|log(T))|\Psi_{T,s}|\overline{w}_{t}^{s}=O(\alpha_{s}\sqrt{d|\Psi_{T,s}|\log(T)}), which implies |ΨT,s|=O~(αs2d(w¯ts)2)=O~(MT4sd)|\Psi_{T,s}|=\tilde{O}(\frac{\alpha_{s}^{2}d}{(\overline{w}_{t}^{s})^{2}})=\tilde{O}(\frac{MT4^{s}}{d}). ∎

Appendix E Variance-adaptive Async-FedSupLinUCB

The variance-adaptive SupLinUCB subroutine is presented in Alg. 5, while the complete variance-adaptive Async-FedSupLinUCB is given in Alg. 6.

E.1 Algorithm

Algorithm 5 Variance-adaptive SupLinUCB subroutine: VS-LUCB
1:Initialization: SlogR+logTS\leftarrow\lceil\log R+\log T\rceil, w¯0=dR2\overline{w}_{0}=dR^{2}, w¯s2sw¯0,s[1:S]\overline{w}_{s}\leftarrow 2^{-s}\overline{w}_{0},\forall s\in[1:S],
2:α0=O~(d),αs=1+2ln(2KMTlnd/δ),ρ=1/T,γ=R1/2/d1/4\alpha_{0}=\tilde{O}(\sqrt{d}),\alpha_{s}=1+\sqrt{2\ln(2KMT\ln d/\delta)},\rho=1/\sqrt{T},\gamma=R^{1/2}/d^{1/4}.
3:Input: Client ii (with local information Ai,biA^{i},b^{i}, ΔAi,Δbi\Delta A^{i},\Delta b^{i}), contexts set {xt,1i,,xt,Ki}\{x_{t,1}^{i},\ldots,x_{t,K}^{i}\}
4:At,siAsi,bt,sibsiA_{t,s}^{i}\leftarrow A_{s}^{i},b_{t,s}^{i}\leftarrow b_{s}^{i} for lazy update
5:θ^s(At,si)1bt,si\hat{\theta}_{s}\leftarrow(A^{i}_{t,s})^{-1}b^{i}_{t,s}, r^t,s,ai=θ^sxt,ai\hat{r}_{t,s,a}^{i}=\hat{\theta}_{s}^{\top}x_{t,a}^{i}, wt,s,aiαsxt,ai(At,si)1w_{t,s,a}^{i}\leftarrow\alpha_{s}\|x^{i}_{t,a}\|_{(A^{i}_{t,s})^{-1}}, s[0:S],a[K]\forall s\in[0:S],\forall a\in[K].
6:s0s\leftarrow 0; 𝒜0{a[K]r^t,0,ai+wt,0,aimaxa[K](r^t,0,aiwt,0,ai)}\mathcal{A}_{0}\leftarrow\{a\in[K]\mid\hat{r}_{t,0,a}^{i}+w_{t,0,a}^{i}\geq\max_{a\in[K]}(\hat{r}_{t,0,a}^{i}-w_{t,0,a}^{i})\} \triangleright Initial screening
7:repeat\triangleright Layered successive screening
8:     if s=Ss=S then
9:         Choose action atia_{t}^{i} arbitrarily from 𝒜S\mathcal{A}_{S}
10:     else if wt,s,aiw¯sw_{t,s,a}^{i}\leq\overline{w}_{s} for all a𝒜sa\in\mathcal{A}_{s} then
11:         𝒜s+1{a𝒜sr^t,s,aimaxa𝒜s(r^t,s,ai)2w¯s}\mathcal{A}_{s+1}\leftarrow\{a\in\mathcal{A}_{s}\mid\hat{r}_{t,s,a}^{i}\geq\max_{a^{\prime}\in\mathcal{A}_{s}}(\hat{r}_{t,s,a^{\prime}}^{i})-2\overline{w}_{s}\}; ss+1s\leftarrow s+1
12:     else
13:         Choose at=argmax{a𝒜s,wt,s,ai>w¯s}wt,s,aia_{t}=\arg\max_{\{a\in\mathcal{A}_{s},w_{t,s,a}^{i}>\overline{w}_{s}\}}w_{t,s,a}^{i}
14:     end if
15:until action ata_{t} is found
16:Take action ata_{t} and and receive reward rt,atir_{t,a_{t}}^{i} and variance σt\sigma_{t}
17:σ¯t=max{σt,ρ,γxt,ati(At,si)11/2}\overline{\sigma}_{t}=\max\{\sigma_{t},\rho,\gamma\|x_{t,a_{t}}^{i}\|_{(A_{t,s}^{i})^{-1}}^{1/2}\}
18:ΔAsiΔAsi+xt,atixt,ati/σ¯t2\Delta A_{s}^{i}\leftarrow\Delta A_{s}^{i}+x_{t,a_{t}}^{i}x_{t,a_{t}}^{i\top}/\overline{\sigma}_{t}^{2}, ΔbsiΔbsi+rt,atixt,ati/σ¯t2\Delta b_{s}^{i}\leftarrow\Delta b_{s}^{i}+r_{t,a_{t}}^{i}x_{t,a_{t}}^{i}/\overline{\sigma}_{t}^{2} \triangleright Update local information
19:Return layer index ss
Algorithm 6 Variance-adaptive Async-FedSupLinUCB
1:Initialization: TT, CC, S=logR+logTS=\lceil\log R+\log T\rceil
2:{AsserId,bsser0s[0:S]}\{A_{s}^{ser}\leftarrow I_{d},b_{s}^{ser}\leftarrow 0\mid s\in[0:S]\} \triangleright Server initialization
3:{AsiId,ΔAsi,bsi,Δbsi0s[0:S],i[M]}\{A_{s}^{i}\leftarrow I_{d},\Delta A_{s}^{i},b_{s}^{i},\Delta b_{s}^{i}\leftarrow 0\mid s\in[0:S],i\in[M]\} \triangleright Clients initialization
4:for t=1,2,,Tt=1,2,...,T do
5:     Client it=ii_{t}=i is active, and observes KK contexts {xt,1i,xt,2i,,xt,Ki}\{x_{t,1}^{i},x_{t,2}^{i},\cdots,x_{t,K}^{i}\}
6:     s=s= VS-LUCB (clienti,{xt,1i,xt,2i,,xt,Ki})\left(clienti,\{x_{t,1}^{i},x_{t,2}^{i},\cdots,x_{t,K}^{i}\}\right) with the lazy update
7:     if det(Asi+ΔAsi)det(Asi)>(1+C)\frac{\det(A_{s}^{i}+\Delta A_{s}^{i})}{\det(A_{s}^{i})}>(1+C) then
8:         Sync\operatorname{Sync}(ss, server, clients ii) for each s[0:S]s\in[0:S]
9:     end if
10:end for

E.2 Supporting Lemmas and Proofs

Theorem E.1.

(Theorem 4.3 in Zhou and Gu, (2022)) Let {t}t=1\left\{\mathcal{F}_{t}\right\}_{t=1}^{\infty} be a filtration, and {xt,ηt}t1\left\{x_{t},\eta_{t}\right\}_{t\geq 1} be a stochastic process such that xtdx_{t}\in\mathbb{R}^{d} is t\mathcal{F}_{t}-measurable and ηt\eta_{t}\in\mathbb{R} is t+1\mathcal{F}_{t+1}-measurable. Let σ,ϵ>0,θd\sigma,\epsilon>0,\theta^{*}\in\mathbb{R}^{d}. For t1t\geq 1, let yt=θ,xt+ηty_{t}=\left\langle\theta^{*},x_{t}\right\rangle+\eta_{t} and suppose that ηt,xt\eta_{t},x_{t} also satisfy

𝔼[ηtt]=0,𝔼[ηt2t]σ2,|ηt|R,xt21.\mathbb{E}[\eta_{t}\mid\mathcal{F}_{t}]=0,\mathbb{E}[\eta_{t}^{2}\mid\mathcal{F}_{t}]\leq\sigma^{2},\left|\eta_{t}\right|\leq R,\left\|x_{t}\right\|_{2}\leq 1.

For t1t\geq 1, let Zt=I+i=1txixi,bt=i=1tyixi,θt=Zt1btZ_{t}=I+\sum_{i=1}^{t}x_{i}x_{i}^{\top},b_{t}=\sum_{i=1}^{t}y_{i}x_{i},\theta_{t}=Z_{t}^{-1}b_{t}, and

βt=\displaystyle\beta_{t}= 12σ2dlog(1+tL2/(d))log(32(log(R/ϵ)+1)t2/δ)\displaystyle 12\sqrt{\sigma^{2}d\log\left(1+tL^{2}/(d)\right)\log\left(32(\log(R/\epsilon)+1)t^{2}/\delta\right)}
+24log(32(log(R/ϵ)+1)t2/δ)max1it{|ηi|min{1,𝐱i𝐙i111}}+6log(32(log(R/ϵ)+1)t2/δ)ϵ.\displaystyle+24\log\left(32(\log(R/\epsilon)+1)t^{2}/\delta\right)\max_{1\leq i\leq t}\left\{\left|\eta_{i}\right|\min\left\{1,\left\|\mathbf{x}_{i}\right\|_{\mathbf{Z}_{i-1}^{-1}}^{-1}\right\}\right\}+6\log\left(32(\log(R/\epsilon)+1)t^{2}/\delta\right)\epsilon.

Then, for any 0<δ<10<\delta<1, we have with probability at least 1δ1-\delta that,

t1,i=1txiηiZt1βt,θtθZtβt+θ2.\forall t\geq 1,\left\|\sum_{i=1}^{t}x_{i}\eta_{i}\right\|_{Z_{t}^{-1}}\leq\beta_{t},\quad\|\theta_{t}-\theta^{*}\|_{Z_{t}}\leq\beta_{t}+\|\theta^{*}\|_{2}.
Lemma E.1.

(Adapted from Lemma B.1 in Zhou and Gu, (2022)). Let {σt,βt}t1\left\{\sigma_{t},\beta_{t}\right\}_{t\geq 1} be a sequence of non-negative numbers, ρ,γ>0,{xt}t1d\rho,\gamma>0,\{x_{t}\}_{t\geq 1}\subset\mathbb{R}^{d} and xt21\|x_{t}\|_{2}\leq 1. Let {Zt}t1\{Z_{t}\}_{t\geq 1} and {σ¯t}t1\left\{\bar{\sigma}_{t}\right\}_{t\geq 1} be recursively defined as follows:

Z1=I;Zt+1=Zt+xtxt/σ¯t2,t1,σ¯t=max{σt,ρ,γxtZt11/2}.\displaystyle Z_{1}=I;\quad Z_{t+1}=Z_{t}+x_{t}x_{t}^{\top}/\bar{\sigma}_{t}^{2},\quad\forall t\geq 1,\bar{\sigma}_{t}=\max\{\sigma_{t},\rho,\gamma\|x_{t}\|_{Z_{t}^{-1}}^{1/2}\}.

Let ι=log(1+T/(dρ2))\iota=\log(1+T/(d\rho^{2})). Then we have

t=1Tmin{1,βtxtZt1}2dι+2βTγ2dι+2dιt=1Tβt2(σt2+ρ2).\sum_{t=1}^{T}\min\{1,\beta_{t}\|x_{t}\|_{Z_{t}^{-1}}\}\leq 2d\iota+2\beta_{T}\gamma^{2}d\iota+2\sqrt{d\iota}\sqrt{\sum_{t=1}^{T}\beta_{t}^{2}(\sigma_{t}^{2}+\rho^{2})}.

Following a similar proof structure to Async-FedSupLinUCB, we employ a novel Bernstein-type self-normalized martingale inequality, proposed by Zhou and Gu, (2022), for layer 0 to manage the variance information. We define α0=βT\alpha_{0}=\beta_{T} as specified in Theorem E.1, and establish the following lemma, analogous to Lemma B.3.

Lemma E.2.

For any round t[T]t\in[T], if client it=ii_{t}=i is active in round tt and arm ata_{t} is chosen in layer 0, with probability at least 1δ1-\delta, with α0=O~(d)\alpha_{0}=\tilde{O}(\sqrt{d}) we have for any at[K]a_{t}\in[K]:

|r^t,0,atθxt,ati|wt,0,ati=α0xt,ati(At,0i)1.\left|\hat{r}_{t,0,a_{t}}-\theta^{\top}x_{t,a_{t}}^{i}\right|\leq w_{t,0,a_{t}}^{i}=\alpha_{0}\|x_{t,a_{t}}^{i}\|_{(A_{t,0}^{i})^{-1}}.

We define good event \mathcal{E} as {|xt,aiθ^t,sixt,aiθ|wt,s,ai,i[M],a[K],t[T],s[0:S]}.\mathcal{E}\triangleq\left\{\left|x_{t,a}^{i\top}\hat{\theta}_{t,s}^{i}-x_{t,a}^{i\top}\theta\right|\leq w_{t,s,a}^{i},\forall i\in[M],a\in[K],t\in[T],s\in[0:S]\right\}. In a manner similar to the proof of Lemma B.4, we have that []1δ\mathbb{P}[\mathcal{E}]\geq 1-\delta.

Lemma E.3.

Conditioned on the event \mathcal{E}, the regret in layer 0 can be bounded by 𝗋𝖾𝗀layer 0O~(d).\textrm{$\sf{reg}$}{}_{\text{layer }0}\leq\tilde{O}(d).

Proof of Lemma E.3.

We set w¯0=dR2\overline{w}_{0}=dR^{2} to provide a tighter bound for the size of ΨT,0\Psi_{T,0}. Mirroring the proof methodology in Lemma C.4, we establish the following:

w¯0|ΨT,0|\displaystyle\overline{w}_{0}|\Psi_{T,0}| α0tΨT,0xt,i(At,sit)12dι+2α0γ2dι+2α0dιtΨT,0(σt2+ρ2)\displaystyle\leq\alpha_{0}\sum_{t\in\Psi_{T,0}}\|x_{t,i}\|_{(A_{t,s}^{i_{t}})^{-1}}\leq 2d\iota+2\alpha_{0}\gamma^{2}d\iota+2\alpha_{0}\sqrt{d\iota}\sqrt{\sum_{t\in\Psi_{T,0}}(\sigma_{t}^{2}+\rho^{2})}
2dι+2α0γ2dι+2α0dι|ΨT,0|(R2+ρ2).\displaystyle\leq 2d\iota+2\alpha_{0}\gamma^{2}d\iota+2\alpha_{0}\sqrt{d\iota}\sqrt{|\Psi_{T,0}|(R^{2}+\rho^{2})}.

The first inequality results from the arm selection rule of layer 0, the second is derived from Lemma E.1, and the third arises due to the constraint σt2R2\sigma_{t}^{2}\leq R^{2}. Consequently, we infer that ΨT,0O(d2R2/w¯02)\Psi_{T,0}\leq O(d^{2}R^{2}/\overline{w}_{0}^{2}). We can then bound the regret in layer 0 as follows:

𝗋𝖾𝗀layer 0\displaystyle\textrm{$\sf{reg}$}{}_{\text{layer }0} 4α0tΨT,0xt,i(At,sit)18dι+8α0γ2dι+8α0dι|ΨT,0|(R2+ρ2)O~(d).\displaystyle\leq 4\alpha_{0}\sum_{t\in\Psi_{T,0}}\|x_{t,i}\|_{(A_{t,s}^{i_{t}})^{-1}}\leq 8d\iota+8\alpha_{0}\gamma^{2}d\iota+8\alpha_{0}\sqrt{d\iota}\sqrt{|\Psi_{T,0}|(R^{2}+\rho^{2})}\leq\tilde{O}(d).

Lemma E.4.

Conditioned on the event \mathcal{E}, the regret of each layer s[1:S1]s\in[1:S-1] can be bounded by 𝗋𝖾𝗀layer sO~dtσt2.\textrm{$\sf{reg}$}{}_{\text{layer }s}\leq\tilde{O}\sqrt{d\sum_{t}\sigma_{t}^{2}}.

Proof of Lemma E.4.

For s{1,2,,S1}s\in\{1,2,...,S-1\}, the rewards in each layer ss are mutually independent, as proven in Lemma B.1. We deduce:

𝗋𝖾𝗀layer s\displaystyle\textrm{$\sf{reg}$}{}_{\text{layer }s} 8w¯s|ΨT,s|8αstΨT,sxt,i(At,sit)1\displaystyle\leq 8\overline{w}_{s}|\Psi_{T,s}|\leq 8\alpha_{s}\sum_{t\in\Psi_{T,s}}\|x_{t,i}\|_{(A_{t,s}^{i_{t}})^{-1}}
αstΨT,sxt,ait(At,sall)1+αsdMlog(1+T/d)\displaystyle\leq\alpha_{s}\sum_{t\in\Psi_{T,s}}\|x_{t,a}^{i_{t}}\|_{(A_{t,s}^{all})^{-1}}+\alpha_{s}dM\log(1+T/d)
O~(dtΨT,sσt2).\displaystyle\leq\tilde{O}(\sqrt{d\sum_{t\in\Psi_{T,s}}\sigma_{t}^{2}}).

The first inequality arises from LABEL:{lem:regret_lay}, the second is a result of the arm selection rule in Line 13, the third derives from Lemma C.3, and the final inequality is attributable to Lemma E.1. ∎

For the final layer SS, applying Lemma B.6 and setting w¯S=d/T\overline{w}_{S}=d/T, we have 𝗋𝖾𝗀layer S8w¯s|ΨS|O~(d).\textrm{$\sf{reg}$}{}_{\text{layer }S}\leq 8\overline{w}_{s}|\Psi_{S}|\leq\tilde{O}(d).

Proof of the communication bound in Theorem 7.1.

Having established the bound for regret in each layer, we have demonstrated that RTO~(dt=1Tσt2)R_{T}\leq\tilde{O}\left(\sqrt{d\sum_{t=1}^{T}\sigma_{t}^{2}}\right). Given that we set w¯0=dR2\overline{w}_{0}=dR^{2} and w¯S=d/T\overline{w}_{S}=d/T, it requires S=log(w¯0/w¯S)=Θ(logR+logT)S=\log(\overline{w}_{0}/\overline{w}_{S})=\Theta(\log R+\log T) layers to achieve the desired accuracy. The number of communications triggered by layer ss can be upper bounded by O(dM2log(T)O(dM^{2}\log(T) (Lemma C.1). Consequently, we are able to constrain the overall communication cost to O~(dM2log2T)\tilde{O}(dM^{2}\log^{2}T).

Appendix F Corruption Robust Async-FedSupLinUCB

The corruption robust SupLinUCB subroutine is presented in Alg. 7, while the complete corruption robust Async-FedSupLinUCB is given in Alg. 8.

F.1 Algorithm

Algorithm 7 Corruption Robust SupLinUCB subroutine: CS-LUCB
1:Initialization: S=logdS=\lceil\log d\rceil, w¯0=d1.5/T\overline{w}_{0}=d^{1.5}/\sqrt{T}, w¯s2sw¯0\overline{w}_{s}\leftarrow 2^{-s}\overline{w}_{0}, γ=d/Cp\gamma=\sqrt{d}/C_{p}.
2:α0=1+dln(2M2T/δ)+γCp,αs1+2ln(2KMTlnd/δ)+γCp,s[1:S]\alpha_{0}=1+\sqrt{d\ln(2M^{2}T/\delta)}+\gamma C_{p},\alpha_{s}\leftarrow 1+\sqrt{2\ln(2KMT\ln d/\delta)}+\gamma C_{p},\forall s\in[1:S]
3:Input: Client ii (with local information Ai,biA^{i},b^{i}, ΔAi,Δbi\Delta A^{i},\Delta b^{i}), contexts set {xt,1i,,xt,Ki}\{x_{t,1}^{i},\ldots,x_{t,K}^{i}\}
4:At,siAsi,bt,sibsiA_{t,s}^{i}\leftarrow A_{s}^{i},b_{t,s}^{i}\leftarrow b_{s}^{i} for lazy update
5:θ^s(At,si)1bt,si\hat{\theta}_{s}\leftarrow(A^{i}_{t,s})^{-1}b^{i}_{t,s}, r^t,s,ai=θ^sxt,ai\hat{r}_{t,s,a}^{i}=\hat{\theta}_{s}^{\top}x_{t,a}^{i}, wt,s,aiαsxt,ai(At,si)1w_{t,s,a}^{i}\leftarrow\alpha_{s}\|x^{i}_{t,a}\|_{(A^{i}_{t,s})^{-1}}, s[0:S],a[K]\forall s\in[0:S],\forall a\in[K].
6:s0s\leftarrow 0; 𝒜0{a[K]r^t,0,ai+wt,0,aimaxa[K](r^t,0,aiwt,0,ai)}\mathcal{A}_{0}\leftarrow\{a\in[K]\mid\hat{r}_{t,0,a}^{i}+w_{t,0,a}^{i}\geq\max_{a\in[K]}(\hat{r}_{t,0,a}^{i}-w_{t,0,a}^{i})\}. \triangleright Initial screening
7:repeat\triangleright Layered successive screening
8:     if s=Ss=S then
9:         Choose action atia_{t}^{i} arbitrarily from 𝒜S\mathcal{A}_{S}
10:     else if wt,s,aiw¯sw_{t,s,a}^{i}\leq\overline{w}_{s} for all a𝒜sa\in\mathcal{A}_{s} then
11:         𝒜s+1{a𝒜sr^t,s,aimaxa𝒜s(r^t,s,ai)2w¯s}\mathcal{A}_{s+1}\leftarrow\{a\in\mathcal{A}_{s}\mid\hat{r}_{t,s,a}^{i}\geq\max_{a^{\prime}\in\mathcal{A}_{s}}(\hat{r}_{t,s,a^{\prime}}^{i})-2\overline{w}_{s}\}; ss+1s\leftarrow s+1
12:     else
13:         atiargmax{a𝒜s,wt,s,ai>w¯s}wt,s,aia_{t}^{i}\leftarrow\arg\max_{\{a\in\mathcal{A}_{s},w_{t,s,a}^{i}>\overline{w}_{s}\}}w_{t,s,a}^{i}
14:     end if
15:until action atia_{t}^{i} is found
16:Take action atia_{t}^{i} and and receive reward rt,atiir_{t,a_{t}^{i}}^{i}
17:ηt=min{1,γ/xt,ati(At,si)1}\eta_{t}=\min\{1,\gamma/\|x^{i}_{t,a_{t}}\|_{(A^{i}_{t,s})^{-1}}\}
18:ΔAsiΔAsi+ηtxt,atiixt,atii\Delta A_{s}^{i}\leftarrow\Delta A_{s}^{i}+\eta_{t}x_{t,a_{t}^{i}}^{i}x_{t,a_{t}^{i}}^{i\top}, ΔbsiΔbsi+ηtrt,atiixt,atii\Delta b_{s}^{i}\leftarrow\Delta b_{s}^{i}+\eta_{t}r_{t,a_{t}^{i}}^{i}x_{t,a_{t}^{i}}^{i} \triangleright Update local information
19:Return layer index ss
Algorithm 8 Corruption Robust Async-FedSupLinUCB
1:Initialization: TT, CC, S=logdS=\lceil\log d\rceil
2:{AsserId,bsser0s[0:S]}\{A_{s}^{ser}\leftarrow I_{d},b_{s}^{ser}\leftarrow 0\mid s\in[0:S]\} \triangleright Server initialization
3:{AsiId,ΔAsi,bsi,Δbsi0s[0:S],i[M]}\{A_{s}^{i}\leftarrow I_{d},\Delta A_{s}^{i},b_{s}^{i},\Delta b_{s}^{i}\leftarrow 0\mid s\in[0:S],i\in[M]\} \triangleright Clients initialization
4:for t=1,2,,Tt=1,2,\cdots,T do
5:     Client it=ii_{t}=i is active, and observes KK contexts {xt,1i,xt,2i,,xt,Ki}\{x_{t,1}^{i},x_{t,2}^{i},\cdots,x_{t,K}^{i}\}
6:     ss\leftarrow CS-LUCB (client i,{xt,1i,xt,2i,,xt,Ki})\left(\text{client }i,\{x_{t,1}^{i},x_{t,2}^{i},\cdots,x_{t,K}^{i}\}\right) with lazy update
7:     if det(Asi+ΔAsi)det(Asi)>(1+C)\frac{\det(A_{s}^{i}+\Delta A_{s}^{i})}{\det(A_{s}^{i})}>(1+C) then
8:         Sync\operatorname{Sync}(ss, server, clients ii) for each s[0:S]s\in[0:S]
9:     end if
10:end for

F.2 Supporting Lemmas and Proof

When confronted with adversarial corruption, we utilize a weighted ridge regression in which the weight assigned to each selected action depends on its confidence. Further, we expand the confidence width to accommodate this corruption, with α0=1+dln(2M2T/δ)+γCp\alpha_{0}=1+\sqrt{d\ln(2M^{2}T/\delta)}+\gamma C_{p} and αs=1+2ln(2KMTlnd/δ)+γCp\alpha_{s}=1+\sqrt{2\ln(2KMT\ln d/\delta)}+\gamma C_{p} as proposed in He et al., 2022b . In our analysis of layer 0, we adapt Lemma B.1 from He et al., 2022b to fit a federated scenario, yielding the following lemma:

Lemma F.1.

(Adapted from Lemma B.1 in He et al., 2022b ) Under the setting of Theorem 5.1, in the layer 0, with probability at least 1δ1-\delta, the following event 0\mathcal{E}_{0} happens:

0{|xt,aiθ^t,sixt,aiθ|wt,s,ai,i[M],a[K],t[T],s=0}.\mathcal{E}_{0}\triangleq\{\left|x_{t,a}^{i\top}\hat{\theta}_{t,s}^{i}-x_{t,a}^{i\top}\theta\right|\leq w_{t,s,a}^{i},\forall i\in[M],a\in[K],t\in[T],s=0\}.

For each layer s[S]s\in[S], the rewards are mutually independent, analogous to the proof of Lemma B.1. We can restate the lemma as follows:

Lemma F.2.

Suppose the time index set Ψt,s\Psi_{t,s} is constructed so that for fixed xτ,aτx_{\tau,a_{\tau}} with τΨt,s\tau\in\Psi_{t,s}, the rewards {rτ,aτ}\{r_{\tau,a_{\tau}}\} are independent random variables with means 𝔼[rτ,aτ]=θxτ,aτ+cτ\mathbb{E}[r_{\tau,a_{\tau}}]=\theta^{\top}x_{\tau,a_{\tau}}+c_{\tau}. For any round t[T]t\in[T], if client it=ii_{t}=i is active and chooses arm ata_{t} in layer s[S]s\in[S], with probability at least 1δMTlnd1-\frac{\delta}{MT\ln d}, we have for any at[K]a_{t}\in[K]:

|r^t,s,atθxt,ati|wt,s,ati=αsxt,ati(At,si)1.\left|\hat{r}_{t,s,a_{t}}-\theta^{\top}x_{t,a_{t}}^{i}\right|\leq w_{t,s,a_{t}}^{i}=\alpha_{s}\|x_{t,a_{t}}^{i}\|_{(A_{t,s}^{i})^{-1}}.

After combining the aforementioned events, we redefine the good event in the presence of corruption as follows:

{|xt,aiθ^t,sixt,aiθ|wt,s,ai,i[M],a[K],t[T],s[0:S]}.\mathcal{E}\triangleq\left\{\left|x_{t,a}^{i\top}\hat{\theta}_{t,s}^{i}-x_{t,a}^{i\top}\theta\right|\leq w_{t,s,a}^{i},\forall i\in[M],a\in[K],t\in[T],s\in[0:S]\right\}.

Similar to proof of Lemma B.4, we have that []1δ\mathbb{P}[\mathcal{E}]\geq 1-\delta.

Lemma F.3.

Conditioned on the good event \mathcal{E}, the regret of layer s[0:S1]s\in[0:S-1] can be bounded as follows: 𝗋𝖾𝗀layersO~(dT+dCp)\textrm{$\sf{reg}$}{}{\text{layer}s}\leq\tilde{O}(\sqrt{dT}+dC_{p}).

Proof of Lemma F.3.

Under the condition of the good event \mathcal{E}, we adopt a similar approach to the regret decomposition analysis presented in He et al., 2022b to bound the regret in each layer s[0:S1]s\in[0:S-1].

𝔼tΨT,s(rt,ati,irt,ati)tΨT,s8wt,s,at=tΨT,s8αsxt,ai(At,si)1\displaystyle\mathbb{E}\sum_{t\in\Psi_{T,s}}(r^{i}_{t,a_{t}^{i,*}}-r^{i}_{t,a_{t}})\leq\sum_{t\in\Psi_{T,s}}8w_{t,s,a_{t}}=\sum_{t\in\Psi_{T,s}}8\alpha_{s}\|x^{i}_{t,a}\|_{(A^{i}_{t,s})^{-1}} (7)
=8αstΨT,s,ηt=1xt,ai(At,si)1I1+8αstΨT,s,ηt<1xt,ai(At,si)1I2.\displaystyle=\underbrace{8\alpha_{s}\sum_{t\in\Psi_{T,s},\eta_{t}=1}\|x^{i}_{t,a}\|_{(A^{i}_{t,s})^{-1}}}_{I_{1}}+\underbrace{8\alpha_{s}\sum_{t\in\Psi_{T,s},\eta_{t}<1}\|x^{i}_{t,a}\|_{(A^{i}_{t,s})^{-1}}}_{I_{2}}. (8)

The first inequality is derived from Lemma B.6, while Equation 8 follows from the definition of ηt\eta_{t}. For the term I1I_{1}, we consider the rounds with ηt=1\eta_{t}=1, assuming these rounds can be listed as {k1,k2,,kn}\{k_{1},k_{2},...,k_{n}\}. To analyze this, we construct the auxiliary matrix Bt,s=I+j=1nxkjxkjI{kjt}B_{t,s}=I+\sum_{j=1}^{n}x_{k_{j}}x_{k_{j}}^{\top}I\{k_{j}\leq t\}. Using the definition of At,siA_{t,s}^{i}, we can establish the inequality At,si11+MCAt,sall11+MCBt,sA_{t,s}^{i}\succeq\frac{1}{1+MC}A_{t,s}^{all}\succeq\frac{1}{1+MC}B_{t,s}.

Then we have

I1=tΨT,s,ηt=18αsxt,ai(At,si)1\displaystyle I_{1}=\sum_{t\in\Psi_{T,s},\eta_{t}=1}8\alpha_{s}\|x^{i}_{t,a}\|_{(A^{i}_{t,s})^{-1}}
8αs2(1+MC)2d|ΨT,s|log|ΨT,s|+8αsdMlog(1+T/d)O~(dT),\displaystyle\leq 8\alpha_{s}\sqrt{2(1+MC)}\sqrt{2d|\Psi_{T,s}|\log|\Psi_{T,s}|}+8\alpha_{s}dM\log(1+T/d)\leq\tilde{O}(\sqrt{dT}),

where the first inequality follows from Lemma C.3, and the second inequality is obtained by noting that the size of ΨT,0\Psi_{T,0} is bounded by O~(T/d)\tilde{O}(T/d), as stated in Lemma C.4 particularly for layer 0.

For the term I2I_{2}, using the property ηt<1\eta_{t}<1, we can express ηt\eta_{t} as ηt=γ/xt,ai(At,si)1\eta_{t}=\gamma/\|x^{i}_{t,a}\|_{(A^{i}_{t,s})^{-1}}, which implies:

I2=tΨT,s,ηt<18αsxt,ai(At,si)1\displaystyle I_{2}=\sum_{t\in\Psi_{T,s},\eta_{t}<1}8\alpha_{s}\|x^{i}_{t,a}\|_{(A^{i}_{t,s})^{-1}}
tΨT,s,ηt<18αsγηtxt,ai(At,si)1xt,aiαsγdlog(T)O~(dCp),\displaystyle\leq\sum_{t\in\Psi_{T,s},\eta_{t}<1}8\frac{\alpha_{s}}{\gamma}\eta_{t}x^{i\top}_{t,a}(A^{i}_{t,s})^{-1}x^{i}_{t,a}\leq\frac{\alpha_{s}}{\gamma}d\log(T)\leq\tilde{O}(dC_{p}),

where the first inequality is derived from the definition of ηt\eta_{t}, the second inequality is obtained from the elliptical potential lemma, as referenced in Lemma A.1, and the third inequality stems from the definition of αs\alpha_{s}.

By combining I1I_{1} and I2I_{2}, we can ultimately bound the regret in each layer s[0:S1]s\in[0:S-1] as 𝗋𝖾𝗀layersO~(dT+dCp)\textrm{$\sf{reg}$}{}_{\text{layer}s}\leq\tilde{O}(\sqrt{dT}+dC_{p}). ∎

For the regret that occurs in the last layer SS, we can derive the following bound:

tΨT,S𝔼[rt,ati,irt,ati]tΨT,S8w¯S8w¯S|ΨT,S|8w¯ST8dT.\displaystyle\sum_{t\in\Psi_{T,S}}\mathbb{E}\left[r^{i}_{t,a_{t}^{i,*}}-r^{i}_{t,a_{t}}\right]\leq\sum_{t\in\Psi_{T,S}}8\overline{w}_{S}\leq 8\overline{w}_{S}|\Psi_{T,S}|\leq 8\overline{w}_{S}T\leq 8\sqrt{dT}.

The first inequality is from Lemma B.6, and the last inequality follows from w¯S=d/T\overline{w}_{S}=\sqrt{d/T}.

Proof of the communication bound in Theorem 7.2.

By combining the regret in each layer, we can conclude that RTO~(dT+dCp)R_{T}\leq\tilde{O}(\sqrt{dT}+dC_{p}). Note that, based on the definition of ηt1\eta_{t}\leq 1 and Lemma A.1, it follows that log(det(At,sall))dlog(1+|ΨT,s|/d)\log(\det(A_{t,s}^{all}))\leq d\log(1+|\Psi_{T,s}|/d). Additionally, by following a similar proof as in Lemma C.1, we can bound the number of communication rounds in layer ss by O(dM2logT)O(dM^{2}\log T). Considering that the FedSupLinUCB algorithm has S=logdS=\lceil\log d\rceil layers, the total communication cost is therefore upper bounded by O(dM2logdlogT)O(dM^{2}\log d\log T).