Federated Linear Bandits
with Finite Adversarial Actions
Abstract
We study a federated linear bandits model, where clients communicate with a central server to solve a linear contextual bandits problem with finite adversarial action sets that may be different across clients. To address the unique challenges of adversarial finite action sets, we propose the FedSupLinUCB algorithm, which extends the principles of SupLinUCB and OFUL algorithms in linear contextual bandits. We prove that FedSupLinUCB achieves a total regret of , where is the total number of arm pulls from all clients, and is the ambient dimension of the linear model. This matches the minimax lower bound and thus is order-optimal (up to polylog terms). We study both asynchronous and synchronous cases and show that the communication cost can be controlled as and , respectively. The FedSupLinUCB design is further extended to two scenarios: (1) variance-adaptive, where a total regret of can be achieved with being the noise variance of round ; and (2) adversarial corruption, where a total regret of can be achieved with being the total corruption budget. Experiment results corroborate the theoretical analysis and demonstrate the effectiveness of FedSupLinUCB on both synthetic and real-world datasets.
1 Introduction
In the canonical formulation of contextual bandits, a single player would repeatedly make arm-pulling decisions based on contextual information with the goal of maximizing the long-term reward. With the emerging federated learning paradigm (McMahan et al.,, 2017) where multiple clients and a server jointly learn a global model with each client locally updating the model with its own data and server only aggregating the local models periodically, researchers have started exploring contextual bandits algorithms in such federated learning setting (Dubey and Pentland,, 2020; Huang et al.,, 2021; Li and Wang, 2022a, ; Li and Wang, 2022b, ). This federated contextual bandits framework broadens the applicability of contextual bandits to practical scenarios such as recommender systems, clinical trials, and cognitive radio. In these applications, although the goal is still to maximize the cumulative reward for the overall system, decision-making and observations are naturally distributed at the participating clients.
Several intrinsic challenges arise with the federated contextual bandit formulation. One important issue is that besides regret, we should also take into account the communication cost, which is usually the system bottleneck. To reduce the communication cost while maintaining the same regret guarantee, the clients should transmit the necessary information to the server only when the local information has accumulated to the extent that it would affect the decision-making. Compared with the centralized contextual bandits, which have a linearly growing communication cost, algorithms for federated contextual bandits attempt to achieve a comparable regret with sub-linear communication cost.
Second, most existing studies on federated contextual bandits focus on the synchronous communication scenario (Huang et al.,, 2021; Li and Wang, 2022b, ), in which all participating clients first upload local information and then download updated global information from the server in each communication round. This stringent communication requirement is often not met in practice. A recent work of Li and Wang, 2022a studies the asynchronous federated linear bandit problem. However, communications for different clients are not independent in their approach because the upload from one client may trigger the server to perform downloads for all clients. To address this issue, He et al., 2022a proposes FedLinUCB, which enables independent synchronizations between clients and the server.
Third, the majority of prior studies on federated linear bandits focused on the infinite-arm setting (Li and Wang, 2022b, ; Li and Wang, 2022a, ; He et al., 2022a, ) (see Section 2 for a detailed literature review). From a methodology point of view, these papers largely build on the OFUL principle (Abbasi-Yadkori et al.,, 2011). One notable exception is Huang et al., (2021), which studies synchronous communication with fixed contexts and proposes the Fed-PE algorithm based on the phased elimination G-optimal design (Lattimore and Szepesvári,, 2020). To the best of our knowledge, no prior result exists for federated linear bandits with finite arms and time-evolving adversarial contexts, which is the focus of our work.
System Action Algorithm Regret Communication Single-player infinite arm OFUL (Abbasi-Yadkori et al.,, 2011) N/A Single-player finite fixed arm PE + G-optimal (Lattimore and Szepesvári,, 2020) N/A Single-player finite adversarial arm SupLinUCB (Chu et al.,, 2011) N/A Federated (Async) infinite arm FedLinUCB (He et al., 2022a, ) Federated (Async) infinite arm Async-LinUCB (Li and Wang, 2022a, ) Federated (Sync) infinite arm DisLinUCB (Wang et al.,, 2019) Federated (Sync) finite fixed arm Fed-PE (Huang et al.,, 2021) Federated (Async) finite adversarial arm FedSupLinUCB (This work) Federated (Sync) finite adversarial arm FedSupLinUCB (This work)
: the dimension of the unknown parameter, : the number of clients, : the number of finite actions, : the total arm pulls from all clients.
Main contributions.
Our main contributions are summarized as follows.
-
•
We develop a general federated bandits framework, termed FedSupLinUCB, for solving the problem of federated linear contextual bandits with finite adversarial actions. FedSupLinUCB extends SupLinUCB (Chu et al.,, 2011; Ruan et al.,, 2021) and OFUL (Abbasi-Yadkori et al.,, 2011), two important principles in (single-player, centralized) linear bandits, to the federated bandits setting with a carefully designed layered successive screening.
-
•
We instantiate FedSupLinUCB with both asynchronous and synchronous client activities. For the former setting, we propose Async-FedSupLinUCB where communication is triggered only when the cumulative local information impacts the exploration uncertainty to a certain extent. We prove that Async-FedSupLinUCB achieves regret with communication cost, which not only reduces the regret by compared with previous results on asynchronous federated linear bandits with infinite arms, but also matches the minimax lower bound up to polylog terms, indicating that Async-FedSupLinUCB achieves order-optimal regret.
-
•
For synchronous communications, we propose Sync-FedSupLinUCB, which has a refined communication design where only certain layers are communicated, as opposed to the complete information. Sync-FedSupLinUCB achieves order-optimal regret of with horizon-independent communication cost . Compared with the best previous result (Huang et al.,, 2021) which achieves the same order-optimal regret but only for fixed actions, we show that it is the finite actions that fundamentally determines the regret behavior in the federated linear bandits setting.
-
•
We further develop two extensions of FedSupLinUCB: (1) Variance-adaptive FedSupLinUCB, for which a total regret of is achieved, where is the noise variance at round . (2) Adversarial corruption FedSupLinUCB, for which a total regret of is achieved, where is the total corruption budget.
2 Related Works
The linear bandit model, as a generalization of finite armed bandits with linear contextual information, has been extensively studied. The setting of infinite arm sets solved by LinUCB was analyzed in (Dani et al.,, 2008; Abbasi-Yadkori et al.,, 2011), which achieves regret with appropriate confidence width (Abbasi-Yadkori et al.,, 2011) and matches the lower bound (Dani et al.,, 2008) up to logarithmic factors. In contrast, algorithms like SupLinRel (Auer,, 2002) and SupLinUCB (Chu et al.,, 2011) achieve in the setting of finite time-varying adversarial arm sets under , with a lower bound (Chu et al.,, 2011). The SupLinUCB algorithm was later optimized and matches the lower bound up to iterated logarithmic factors in Li et al., (2019). As a special case of the finite arm setting, if the arm set is time-invariant, an elimination-based algorithm (Lattimore and Szepesvári,, 2020) via G-optimal design can be applied to achieve similar optimal performance.
The federated linear bandits problems were studied under the settings of infinite arm set (Dubey and Pentland,, 2020; Li et al.,, 2020; Li and Wang, 2022a, ) and time-invariant finite arm set (Huang et al.,, 2021), while the time-varying finite arm set setting has not been well explored. A finite time-varying arm set has many meaningful practical applications such as recommendation system (Li et al.,, 2010; Chu et al.,, 2011), and the distributed (federated) nature of the applications naturally falls in the federated linear bandits problem with finite time-varying arms. The paper fills this gap by generalizing the SupLinUCB algorithm to the federated setting.
We study both the asynchronous setting (Li and Wang, 2022a, ) (He et al., 2022a, ), where clients are active on their own and full participation is not required, and the synchronous setting (Shi et al.,, 2021; Dubey and Pentland,, 2020), where all the clients make decisions at each round and the communication round requires all the clients to upload new information to the server and download the updated information. We design algorithms so as to reduce the communication cost while maintaining optimal regret. Technically, the communication cost is associated with the algorithmic adaptivity, since less adaptivity requires fewer updates and thus fewer communication rounds. The algorithmic adaptivity of linear bandits algorithms was studied in the single-player setting (Han et al.,, 2020) (Ruan et al.,, 2021). It was also considered in the federated setting (Wang et al.,, 2019; Huang et al.,, 2021; Salgia and Zhao,, 2023).
3 System Model and Preliminaries
3.1 Problem Formulation
We consider a federated linear contextual bandits model with finite but possibly time-varying arms. The model consists of clients and one server in a star-shaped communication network. Clients jointly solve a linear bandit problem by collecting local information and communicating with the central server through the star-shaped network in a federated manner, with no direct communications among clients. The only function of the server is to aggregate received client information and to send back updated information to clients. It cannot directly play the bandits game.
Specifically, some clients are active at round . Client receives arms (actions to take) associated with contexts with . Here we adopt the oblivious adversarial setting, where all contexts are chosen beforehand, and not dependent on previous game observation. Client then pulls an arm based on the information collected locally as well as previously communicated from the server. A reward is revealed privately to client , where is an unknown weight vector with and is an independent -sub-Gaussian noise. At the end of round , depending on the communication protocol, client may exchange the collected local information with the server so that it can update the global information.
We aim to design algorithms to guide the clients’ decision-making and overall communication behaviors. We analyze two patterns of client activity. 1) Synchronous: all clients are active at each round. 2) Asynchronous: one client is active at each round. For the latter case, we further assume that client activity is independent of data and history. Denote by the number of times client is active. In the former case, , while in the latter case, may be different among clients. We define as the total number of arm pulls from all clients.
The performance is measured under two metrics – total regret and communication cost, which concern the decision-making effectiveness and the communication efficiency respectively. Denote by the set of time indices at which client is active, with . The total regret is defined as
(1) |
where . We define the communication cost as the total number of communication rounds between clients and the server.
3.2 Preliminaries
Information encoding.
In the linear bandits setting (federated or not), the information a client acquires is usually encoded by the gram matrix and the action-reward vector. Specifically, when the client has observed action-reward pairs , the information is encoded by matrix and vector . Denote by this encoding function, i.e., .
Communication criterion.
Communication in our proposed framework is data-dependent, in the same spirit as the “doubling trick” introduced in Abbasi-Yadkori et al., (2011) to reduce the computation complexity in single-player linear bandits. The key idea is that communication is triggered only when the cumulative local information, represented by the determinant of the gram matrix , affects the exploration uncertainty to a great extent and hence the client needs to communicate with the server. Detailed communication protocols will be presented in each algorithm design.
Synchronization procedure.
Denote by a routine that clients (client 1, , client ) first communicate their local gram matrices and action-reward vectors to the server, and the server then aggregates the matrices (and vectors) into one gram matrix (and action-reward vector) and transmits them back to the clients. Specifically, each client holds newly observed local information , which is the difference between the client’s current information and the information after the last synchronization. In other words, is the information that has not been communicated to the server. The server, after receiving the local information , updates the server-side information by and sends them back to each of the clients. Each client will then update the local information by . The procedure is formally presented in Algorithm 1.
4 The FedSupLinUCB Framework
In this section, we present a general framework of federated bandits for linear bandits with finite oblivious adversarial actions. Two instances (asynchronous and synchronous) of this general framework will be discussed in subsequent sections.
Building block: SupLinUCB.
As the name suggests, the proposed FedSupLinUCB framework is built upon the principle of SupLinUCB (Chu et al.,, 2011; Ruan et al.,, 2021). The information is useful in the sense that the reward corresponding to an action can be estimated within confidence interval , where . It is shown in Abbasi-Yadkori et al., (2011) that in linear bandits (even with an infinite number of actions) with , the true reward is within the confidence interval with high probability. Moreover, if the rewards in the action-reward vector are mutually independent, can be reduced to . The former choice of naturally guarantees regret. However, to achieve regret , it is critical to keep . This is fulfilled by the SupLinUCB algorithm (Chu et al.,, 2011) and then recently improved by Ruan et al., (2021). The key intuition is to successively refine an action set that contains the optimal action, where the estimation precision of sets is geometrically strengthened. Specifically, the algorithm maintains layers of information pairs , and the rewards in the action-reward vectors are mutually independent, except for layer . The confidence radius for each layer is .
FedSupLinUCB.
S-LUCB, presented in Algorithm 2, combines the principles of SupLinUCB and OFUL (Abbasi-Yadkori et al.,, 2011) and is the core subroutine for FedSupLinUCB. We maintain information layers, and the estimation accuracy starts from of layer and halves as the layer index increases. Finally, it takes layers to reach the sufficient accuracy of and achieves the minimax-optimal regret.
When client is active, the input parameters contain information received from the server at the last communication round, and is the new local information collected between two consecutive communication rounds. is the set of contexts observed in this round. Client can estimate the unknown parameter either with all available information or just making a lazy update. This choice depends on the communication protocol and will be elaborated later. During the decision-making process, client first makes arm elimination at layer to help bootstrap the accuracy parameters. Then, it goes into the layered successive screening in the same manner as the SupLinUCB algorithm, where we sequentially eliminate suboptimal arms depending on their empirical means and confidence widths. After taking action and receiving the corresponding reward , client updates its local information set by aggregating the context into layer in which we take the action, before returning layer .
5 Asynchronous FedSupLinUCB
In the asynchronous setting, only one client is active in each round. Note that global synchronization and coordination are not required, and all inactive clients are idle.
5.1 Algorithm
We first initialize the information for all clients and the server (gram matrix and action-reward vector) in each layer . We assume only one client is active at round . It is without loss of generality since if multiple clients are active, we can queue them up and activate them in turn. More discussion of this equivalence can be found in He et al., 2022a ; Li and Wang, 2022a . The active client chooses the action, receives a reward, updates local information matrices of layer with a lazy update according to S-LUCB, and decides whether communication with the server is needed by the criterion in Line 7 of Algorithm 3. If communication is triggered, we synchronize client with the server by Algorithm 1.
5.2 Performance Analysis
Theorem 5.1.
For any , if we run Algorithm 3 with , then with probability at least , the regret of the algorithm is bounded as . Moreover, the corresponding communication cost is bounded by .
Remark 1. The minimax lower bound of the expected regret for linear contextual bandits with adversarial actions is , given in Chu et al., (2011). Theorem 5.1 indicates that Async-FedSupLinUCB achieves order-optimal regret (up to polylog term) with communication cost. To the best of our knowledge, this is the first algorithm that achieves the (near) optimal regret in federated linear bandits with finite adversarial actions.
Remark 2. Without any communication, each client would execute SupLinUCB (Chu et al.,, 2011) for rounds locally, and each client can achieve regret of order . Therefore, the total regret of clients is upper bound by where the last inequality becomes equality when . Compared with conducting independent SupLinUCB algorithms locally, Async-FedSupLinUCB yields an average per-client gain of , demonstrating that communications in the federated system can speed up local linear bandits decision-making at clients.
Remark 3. Most previous federated linear bandits consider the infinite action setting, based on the LinUCB principle (Abbasi-Yadkori et al.,, 2011). Async-FedSupLinUCB considers a finite adversarial action setting and has a reduction on the regret bound. Fed-PE proposed in Huang et al., (2021) also considers the finite action setting. However, their action sets are fixed. We generalize their formulation and take into account a more challenging scenario, where the finite action set can be chosen adversarially. The regret order is the same as Fed-PE (ignoring the ploylog term), indicating that it is the finite actions as opposed to fixed actions that fundamentally leads to the regret improvement in the federated linear bandits setting.
Communication cost analysis of FedSupLinUCB.
We sketch the proof for the communication cost bound in Theorem 5.1 in the following, while deferring the detailed proofs for the regret and the communication cost to Appendix C.
We first study the communication cost triggered by some layer . Denote by the gram matrix in the server aggregated by the gram matrices uploaded by all clients up to round . Define , for each . We then divide rounds into epochs for each . The number of communications triggered by layer within any epoch can be upper bounded by (see Lemma C.1), and the number of non-empty epochs is at most by Lemma A.1. Since there are layers and synchronization among all layers is performed once communication is triggered by any layer (Line 8 in Algorithm 3), the total communication cost is thus upper-bounded by . Plugging proves the result.
We note that although choosing a larger would trigger fewer communications, the final choice of takes into consideration both the regret and the communication cost, i.e., to achieve a small communication cost while maintaining an order-optimal regret.
6 Synchronous FedSupLinUCB
In the synchronous setting, all clients are active and make decisions at each round. Though it can be viewed as a special case of the asynchronous scenario (clients are active and pulling arms in a round-robin manner), the information update is broadcast to all clients. In other words, the key difference from the asynchronous scenario besides that all clients are active at each round is that when a client meets the communication criterion, all clients will upload local information to the server and download the updated matrices. This leads to a higher communication cost per communication round, but in this synchronous scenario, knowing all clients are participating allows the communicated information to be well utilized by other clients. This is in sharp contrast to the asynchronous setting, where if many other clients are active in the current round, uploading local information to the clients seems unworthy. To mitigate the total communication cost, we use a more refined communication criterion to enable time-independent communication cost.
6.1 The Algorithm
The Sync-FedSupLinUCB algorithm allows each client to make decisions by the S-LUCB subroutine. Note that the decision-making is based on all available local information instead of the lazy update in the Async-FedSupLinUCB algorithm. The communication criterion involves the count of rounds since the last communication, which forces the communication to prevent the local data from being obsolete. Some layers may trigger the communication criterion either because the local client has gathered enough new data or due to having no communication with the server for too long. We categorize these layers in the CommLayers and synchronize all the clients with the server.
6.2 Performance Analysis
Theorem 6.1.
For any , if we run Algorithm 4 with , with probability at least , the regret of the algorithm is bounded as where is the total per-client arm pulls. Moreover, the corresponding communication cost is bounded by .
Remark 4. Theorem 6.1 demonstrates Sync-FedSupLinUCB also achieves the minimax regret lower bound while the communication cost is independent of . It is particularly beneficial for large . Especially, the number of total rounds in the synchronous scenario is , while in the asynchronous setting, we have rounds.
Communication cost analysis of Sync-FedSupLinUCB.
We sketch the proof for the communication cost bound in Theorem 6.1 below, while deferring the detailed proofs for the regret and the communication cost to Appendix D.
We call the chunk of consecutive rounds without communicating information in layer (except the last round) an epoch. Information in layer is collected locally by each client and synchronized at the end of the epoch, following which the next epoch starts. Denoted by the synchronized gram matrix at the end of the -th epoch. For any value , there are at most epochs that contain more than rounds by pigeonhole principle. If the -th epoch contains less than rounds, then based on the communication criterion and the fact that (see Equation 6). The number of epochs containing rounds fewer than is at most . Noting that , the total number of epochs for layer is at most by taking . The total communication cost is thus upper bounded by .
7 Extensions of FedSupLinUCB
In this section, we extend the FedSupLinUCB algorithm to address two distinct settings in federated systems: scenarios characterized by heterogeneous variances, and those affected by adversarial corruptions.
7.1 Federated Heteroscedastic Linear Bandits
We have so far focused on the federated linear bandits with 1-sub-Gaussian reward noises. In this section, we adapt Async-FedSupLinUCB to the case where the reward noises have heterogeneous variances, which extends the heteroscedastic linear bandits as studied in Zhou et al., (2021); Zhou and Gu, (2022) to the asynchronous federated setting, where one client is active at a time. Specifically, the reward noises are independent with and , where is known to the active client.
We propose a variance-adaptive Asyc-FedSupLinUCB and analyze its regret and the communication cost in the theorem below, with the algorithm and the proof details in Appendix E due to space constraint. The regret is significantly less than that of the Async-FedSupLinUCB when the variances are small.
Theorem 7.1.
For any , if we run the variance-adaptive Async-FedSupLinUCB algorithm in Appendix E with , with probability at least , the regret is bounded as , and the communication cost is bounded by .
7.2 Federated Linear Bandits with Corruption
We further explore asynchronous federated linear bandits with adversarial corruptions, where an adversary inserts a corruption to the reward of the active client at round . The total corruption is bounded by . We incorporate the idea of linear bandits with adversarial corruption studied in He et al., 2022b to the proposed FedSupLinUCB framework and propose the Robust Async-FedSupLinUCB algorithm, with details in Appendix F. Robust Async-FedSupLinUCB can achieve the optimal minimax regret (matching the lower bound in He et al., 2022b ) while incurring a low communication cost.
Theorem 7.2.
For any , if we run the Robust Async-FedSupLinUCB algorithm in Appendix F with , with probability at least , the regret is bounded as , and the communication cost is bounded by .
8 Experiments
We have experimentally evaluated FedSupLinUCB in the asynchronous and synchronous settings on both synthetic and real-world datasets. We report the results in this section.
8.1 Experiment Results Using Synthetic Dataset
We simulate the federated linear bandits environment specified in Section 3. With , , , , contexts are uniformly randomly sampled from an unit sphere, and reward , where is Gaussian distributed with zero mean and variance . It should be noted that while clients participate in each round in the synchronous scenario, only one client is active in the asynchronous case. In the plots, the -axis coordinate denotes the number of arm pulls, which flattens the actions in the synchronous setting.




Arrival pattern. We first investigate the impact of different arrival patterns (the sequence of activating clients): (1) Random, which randomly allocates arm pulls in for each client. (2) Round-robin, i.e. . (3) Click-leave, i.e. . The regret and the communication cost of these three arrival patterns in the synthetic experiment are reported in Figure 1(a) and Figure 1(b), respectively. We note that although the upper bound analysis in our proof is for the worst-case instance, the numerical results suggest that different arrival patterns result in diverse regret performances. Round-robin and random patterns are more challenging since both local bandit learning and each client’s policy updates happen relatively slowly. The click-leave pattern, which is the closest to the centralized setting, achieves the best regret. In addition, compared with Async-FedSupLinUCB , Sync-FedSupLinUCB achieves better cumulative regrets with a higher communication cost.
Amount of clients. The per-client cumulative regret as a function of with different amounts of clients is plotted in Figure 1(c). In comparison to the baseline SupLinUCB, FedSupLinUCB algorithms achieve better regret via communication between clients and the server. We can see from the experiment that FedSupLinUCB significantly reduces the per-client regret compared with SupLinUCB, and achieves a better regret as increases in both asynchronous and synchronous settings.
Trade-off between regrets and communications. We evaluate the tradeoff between communication and regret by running FedSupLinUCB with different communication threshold values and in asynchronous and synchronous settings respectively. The results are reported in Figure 1(d), where each scattered dot represents the communication cost and the cumulative regret that FedSupLinUCB has achieved with a given threshold value at round . We see a clear tradeoff between the regret and the communication. More importantly, Sync-FedSupLinUCB achieves a better tradeoff than Async-FedSupLinUCB.
8.2 Experiment Results Using Real-world Dataset
We further investigate how efficiently the federated linear bandits algorithm performs in a more realistic and difficult environment. We have carried out experiments utilizing the real-world recommendation dataset MovieLens 20M (Harper and Konstan,, 2015). Following the steps in Li and Wang, 2022b , we first filter the data by maintaining users with above movie ratings and treating rating points greater than as positive, ending up with users and 121934 total movie rating interactions. Then, we follow the process described in Cesa-Bianchi et al., (2013) to generate the contexts set, using the TF-IDF feature and the arm set . We plot the per-client normalized rewards of the FedSupLinUCB algorithm with different client numbers in synchronous and asynchronous cases respectively. Note that the per-client cumulative rewards here are normalized by a random strategy. From Figure 2(a) and Figure 2(b), we can see that in both synchronous and asynchronous experiments, FedSupLinUCB has better rewards than SupLinUCB, and the advantage becomes more significant as the number of users increases.


9 Conclusion
We studied federated linear bandits with finite adversarial actions, a model that has not been investigated before. We proposed FedSupLinUCB that extends the SupLinUCB and OFUL principles to the federated setting in both asynchronous and synchronous scenarios, and analyzed their regret and communication cost, respectively. The theoretical results proved that FedSupLinUCB is capable of approaching the minimal regret lower bound (up to polylog terms) while only incurring sublinear communication costs, suggesting that it is the finite actions that fundamentally determines the regret behavior in the federated linear bandits setting. Furthermore, we examined the extensions of the algorithm design to the variance-adaptive and adversarial corruption scenarios.
Acknowledgments and Disclosure of Funding
The work of LF and CS was supported in part by the U.S. National Science Foundation (NSF) under grants 2143559, 2029978, and 2132700.
References
- Abbasi-Yadkori et al., (2011) Abbasi-Yadkori, Y., Pál, D., and Szepesvári, C. (2011). Improved algorithms for linear stochastic bandits. Advances in neural information processing systems, 24.
- Auer, (2002) Auer, P. (2002). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3(Nov):397–422.
- Cesa-Bianchi et al., (2013) Cesa-Bianchi, N., Gentile, C., and Zappella, G. (2013). A gang of bandits. Advances in neural information processing systems, 26.
- Chu et al., (2011) Chu, W., Li, L., Reyzin, L., and Schapire, R. (2011). Contextual bandits with linear payoff functions. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pages 208–214. JMLR Workshop and Conference Proceedings.
- Dani et al., (2008) Dani, V., Hayes, T. P., and Kakade, S. M. (2008). Stochastic linear optimization under bandit feedback. 21st Annual Conference on Learning Theory, pages 355–366.
- Dubey and Pentland, (2020) Dubey, A. and Pentland, A. (2020). Differentially-private federated linear bandits. Advances in Neural Information Processing Systems, 33:6003–6014.
- Han et al., (2020) Han, Y., Zhou, Z., Zhou, Z., Blanchet, J., Glynn, P. W., and Ye, Y. (2020). Sequential batch learning in finite-action linear contextual bandits. arXiv preprint arXiv:2004.06321.
- Harper and Konstan, (2015) Harper, F. M. and Konstan, J. A. (2015). The MovieLens datasets: History and context. ACM Trans. Interact. Intell. Syst., 5(4):1–19.
- (9) He, J., Wang, T., Min, Y., and Gu, Q. (2022a). A simple and provably efficient algorithm for asynchronous federated contextual linear bandits. arXiv preprint arXiv:2207.03106.
- (10) He, J., Zhou, D., Zhang, T., and Gu, Q. (2022b). Nearly optimal algorithms for linear contextual bandits with adversarial corruptions. Advances in neural information processing systems.
- Huang et al., (2021) Huang, R., Wu, W., Yang, J., and Shen, C. (2021). Federated linear contextual bandits. Advances in Neural Information Processing Systems, 34:27057–27068.
- Lattimore and Szepesvári, (2020) Lattimore, T. and Szepesvári, C. (2020). Bandit algorithms. Cambridge University Press.
- (13) Li, C. and Wang, H. (2022a). Asynchronous upper confidence bound algorithms for federated linear bandits. In International Conference on Artificial Intelligence and Statistics, pages 6529–6553. PMLR.
- (14) Li, C. and Wang, H. (2022b). Communication efficient federated learning for generalized linear bandits. arXiv preprint arXiv:2202.01087.
- Li et al., (2010) Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670.
- Li et al., (2020) Li, T., Song, L., and Fragouli, C. (2020). Federated recommendation system via differential privacy. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2592–2597. IEEE.
- Li et al., (2019) Li, Y., Wang, Y., and Zhou, Y. (2019). Nearly minimax-optimal regret for linearly parameterized bandits. In Conference on Learning Theory, pages 2173–2174. PMLR.
- McMahan et al., (2017) McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. (2017). Communication-efficient learning of deep networks from decentralized data. In Proc. AISTATS, pages 1273–1282, Fort Lauderdale, FL, USA.
- Ruan et al., (2021) Ruan, Y., Yang, J., and Zhou, Y. (2021). Linear bandits with limited adaptivity and learning distributional optimal design. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 74–87.
- Salgia and Zhao, (2023) Salgia, S. and Zhao, Q. (2023). Distributed linear bandits under communication constraints. In International Conference on Machine Learning, pages 29845–29875. PMLR.
- Shi et al., (2021) Shi, C., Shen, C., and Yang, J. (2021). Federated multi-armed bandits with personalization. In International Conference on Artificial Intelligence and Statistics, pages 2917–2925. PMLR.
- Wang et al., (2019) Wang, Y., Hu, J., Chen, X., and Wang, L. (2019). Distributed bandit learning: Near-optimal regret with efficient communication. In International Conference on Learning Representations.
- Zhou and Gu, (2022) Zhou, D. and Gu, Q. (2022). Computationally efficient horizon-free reinforcement learning for linear mixture mdps. arXiv preprint arXiv:2205.11507.
- Zhou et al., (2021) Zhou, D., Gu, Q., and Szepesvari, C. (2021). Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory, pages 4532–4576. PMLR.
Appendix A Supporting Lemmas
Lemma A.1.
(Lemma 11 in Abbasi-Yadkori et al., (2011)) Let be a sequence in be a positive definite matrix, and define . Then, we have
Further, if for all , then
Finally, if , then
Lemma A.2.
(Lemma 12 in Abbasi-Yadkori et al., (2011)). Let , and be positive semi-definite matrices such that . Then, we have
Theorem A.1.
(Theorem 2 in Abbasi-Yadkori et al., (2011)). Let be a filtration. Let be an -valued stochastic process such that is -measurable and almost surely. Let be a real-valued stochastic process such that is -measurable and is sub-Gaussian with variance proxy when conditioned on . Fix such that . Let , and . For every , we have
where we define . Furthermore, when the above event holds, we have for every and any vector that
Lemma A.3.
(Adapted from Lemma B.1 in He et al., 2022a ) Under the setting of Theorem 5.1, establish . In layer , with probability at least , the good event happens:
Lemma A.4.
(Lemma 31 in Ruan et al., (2021)). Given such that , for all , let where is an independent sub-Gaussian random variable with variance proxy . Let , and . For any and any , we have
Appendix B Lemmas for the SupLinUCB Subroutine
We present several useful lemmas that are based on Algorithm 2. Recall that represents the index set of rounds up to and including round during which an action is taken in layer . That is,
Similar to Lemma 4 in Chu et al., (2011), we claim that the rewards associated with rounds within each (excluding layer ) are mutually independent.
Lemma B.1.
For each each , given any fixed sequence of contexts , the rewards are independent random variables with means .
Proof of Lemma B.1.
For each and each time , the procedure of generating only depends on the information in previous layers and confidence width . From its definition, only depends on and on the current context . Thus the procedure of generating does not depend on rewards , and therefore the rewards are independent random variables when conditioned on . ∎
Given the above-mentioned statistical independence property, and by referring to Lemma A.4, we can establish the following lemma for each layer .
Lemma B.2.
Suppose the time index set is constructed so that for fixed with , the rewards are independent random variables with mean . For any round , if client is active and chooses arm in layer , then with probability at least , we have for any :
For layer , we employ the self-normalized martingale concentration inequality as outlined in He et al., 2022a . By resorting to Lemma A.3, we obtain the following:
Lemma B.3.
For any round , given that client is active in round and arm is chosen in layer , with probability at least , we have for any :
Summarizing the discussions presented in Lemma B.2 and Lemma B.3, we now proceed to define the following good event:
Lemma B.4.
Define the good event as:
(2) |
We have .
Conditioned on the good event , the ensuing lemma illustrates that the optimal arm persists in the candidate set, and that the regret experienced in each layer aligns with the order of the confidence width.
Lemma B.5.
Conditioned on the good event , for , assume that client is active and chooses an action , and recall represents the optimal arm in the current round. For any , we have:
Proof of Lemma B.5.
For any time step , when the good event holds, by the arm elimination rule in layer , we have
Thus, . For each layer , we have:
Thus, we derive , which follows from for all by the arm elimination rule in Line 10 Algorithm 2. Therefore, arm eliminations will preserve the best arm. ∎
The forthcoming lemma demonstrates that, under the good event, the regret experienced in each layer aligns with the order of the corresponding confidence width.
Lemma B.6.
Conditioned on the good event , for client and , it holds that:
(3) | |||
(4) |
Proof of Lemma B.6.
If an action is taken in layer , we have that
and
The second inequality is conditioned on the good event , and the third inequality arises from the arm elimination rule. If an action is taken in layer , we establish the following:
and
The first inequality is based on the good event , the third inequality follows the arm elimination rule, and the fourth inequality is due to for all . ∎
Appendix C Supporting Lemmas and Proofs for Async-FedSupLinUCB
Lemma C.1.
(Lemma 6.2 in He et al., 2022a ) In any epoch from round to round , the number of communications is at most .
Proof outline of Async-FedSupLinUCB.
First, we reorganize the arrival pattern, demonstrating that the rearranged system parallels the original system, and present the requisite definitions for our analysis. Second, we deploy a virtual global model encapsulating information about all clients up to round , subsequently interconnecting the local models with this global model. Lastly, we derive upper bounds on the regret and communication cost in each layer prior to aggregating them to yield the total regret and communication costs, respectively.
Suppose that client communicates with the server at rounds with and does not communicate during the rounds in between. The actions and information gained by client at the rounds do not impact other clients’ decision-making, since the information is kept local without communication. Therefore, we can reorder the arrival of clients appropriately while keeping the reordered system equivalent to the original system.
More specifically, suppose client communicates with the server at two rounds and and does not communicate in the rounds in between (even if she is active). We reorder all the active rounds of client in and place them sequentially after the round . Hence, the arrival of clients can be reordered such that each client communicates with the server and keeps active until the next client’s communication begins. We assume that the sequence of communication rounds in the reordered arrival pattern is , where in rounds , the active client is the same. Details of the reordering process are given in Definition C.2. Due to the equivalence between the original system and the reordered system, we carry out the proofs in the reordered system. Note that only one client is active at round , we will write , and for simplicity.
Definition C.1.
Client information. Recall for each client , we denote by the last round when client communicated with the server before and including round . E.g., if client communicates at round . For each round each client and each layer , the information that has been uploaded by client to the server is: , and the local information in the buffer that has not been uploaded to the server is: .
Server information. The information in the server is the data uploaded by all clients up to round : .
Time index set. Denote by the time index set when the action is chosen in layer . It can be expressed as .
Virtual global information. We define a virtual global model that contains all the information up to round as: .
The information that is stored on the server and all the information that has not yet been uploaded by clients are combined to generate the global information: .
Before presenting the proof, we define good event as
Recall is the estimate of by client , and and is the corresponding context and confidence width of the action taken at round . The following lemma shows the good event happens with high probability, similar to the result in Lemma B.4.
Lemma C.2.
It holds that .
Conditioned on the good event, to upper bound the regret, we bound the confidence width in each layer via the size of each time index set in the lemma below.
Lemma C.3.
Conditioned on the good event , for each we have:
Noting that naturally holds, we give a tighter (dimension-dependent) bound on the size of so as to mitigate the larger coefficient as follows.
Lemma C.4.
The size of can be bounded by .
We postpone the proofs of Lemma C.3 and Lemma C.4 until the end of this section, and instead focus on presenting the regret analysis next. Equipped with the previous lemmas, we are ready to analyze the total regret.
Proof of Theorem 5.1.
(Regret analysis) The total regret can be decomposed w.r.t. layers as follows:
Conditioned on the good event , we first bound the regret in layer by
The first inequality follows Lemma B.6, the second inequality is from Lemma C.3, and the last inequality is due to Lemma C.4. We next bound the regret in each layer similarly by
where the first inequality follows Lemma B.6, the second inequality is from the arm selection rule in line 13 Algorithm 2, and the third inequality is from Lemma C.3. For the last layer , we have:
Finally, with Lemma C.2, we have .
(Communication cost analysis) Next, we study the communication cost in an asynchronous setting. For each layer , , we define We divide rounds in each layer into epoch , and the communication rounds in the epoch can be bound by Lemma C.1. Let be the largest integer such that is not empty. According to Lemma A.1 that , . The total number of epochs of layer is bounded by . By lemma C.1 the communication rounds in layer is bounded by . There are in the FedSupLinUCB algorithm, the total communication cost is thus upper bound by . Plugging in proves the result.
∎
Definition C.2.
(Reorder function) Without loss of generality, we assume all clients communicate with the server at round , and the sequence of rounds that clients communicate with the server in the original system is . Define . Denote by the last communication round of client before and including round :
Denote by the next communication round of client including and after round :
The round in the original system is placed in round by the reordering function . We first reorder the communication round, suppose two consecutive communication rounds and with , and client is active at round and client is active at round .
Then we reorder the no-communication rounds, assuming client is active at round and does not communicate at this round. We first find the last communication round of client as , and place round by :
Lemma C.5.
(Adapted from Lemma 6.5 in He et al., 2022a ) For each round each layer and each client , we have:
Further averaging the inequality above over clients, we have:
Proof of Lemma C.5.
Without loss of generality, we consider client and fix any round . Let be the last round such that client was active at round . If client communicated with the server at round , and chose action at layer , then we have
for other layers , according to the determinant-based communication criterion, we have:
By Lemma A.2 we have
Otherwise, if no communication happened at round , by the communication criterion, at the end of round , for each layer , we have . Note that are the downloaded gram matrices from last communication before round , so it must satisfy for all . For round , since client is inactive from round to , we have for all :
where the last equality holds for inactivation, which completes the proof of the first claim. Further average the above inequality over all clients , and we get:
∎
Recall that client utilizes and to make the decision at round , which were received from the server during the last communication. The following lemma establishes a connection between the gram matrix of the virtual global model and the gram matrix in the active client at round .
Lemma C.6.
In the reordered arrival pattern, for any , suppose client communicates with the server at round , and keep active during rounds . Then for rounds , it holds that for each :
Proof of Lemma C.6.
Client is the only active client from round to and only communicated with the server at round , which implies that for , we have
where the second equality holds due to the fact that no clients communicate with the server from round to , and the first inequality follows Lemma C.5. ∎
Proof of Lemma C.3.
For , if no communication happened at round , under Lemma C.6 and Lemma A.2, we can connect confidence width at the local client with the global gram matrix as:
It remains to control the communication rounds in . We define
and let be the largest integer such that is not empty. According to Lemma A.1, we have:
Thus, . For each time interval from to and each client , suppose client communicates with the server more than once, and communication rounds sequentially are . Then for each , since client is active at rounds and , we have
Since , by the definition of , we have:
where the second inequality comes from . Specifically, for round the first communication round, we can bound the confidence width by . Thus, for the communication rounds in , we have:
Finally, we put all rounds in together:
where the second inequality follows Lemma A.1. ∎
Appendix D Supporting Lemmas and Proofs for Sync-FedSupLinUCB
Proof outline of Sync-FedSupLinUCB.
To prove a high-probability regret bound, we first define the good event in the following lemma, under which the regret bound is derived.
Lemma D.1.
Define . Then, .
Define client ’s one-step regret at round as . Let if action is chosen in layer ; otherwise . The total regret can be written as
Fix an arbitrary , we analyze the total regret induced by the actions taken in layer , i.e., . The analysis can be carried over to different in the same manner.
We call the chunk of consecutive rounds without communicating information in layer (except the last round) an epoch. In other words, information in layer is collected locally by each client and synchronized at the end of the epoch, following which the next epoch starts. The set of rounds that at least one client is pulling an arm in layer can then be divided into multiple consecutive epochs, and we further dichotomize these epochs into good and bad epochs in the following definition.
Definition D.1.
(Good epoch) Suppose the set of rounds that at least one client is pulling an arm in layer are divided into epochs and denoted by the synchronized gram matrix and reward-action vector at the end of the -th epoch. epochs can then be dichotomized into , where . We say round is good if the epoch containing round belongs to ; otherwise is bad.
We bound regrets in layer induced by the good and bad epochs separately in the following lemmas. Recall is the time index set when the action is chosen in the layer.
Lemma D.2.
Conditioned on the good event , for each layer , the regret induced by good epochs of layer is bounded as .
Lemma D.3.
Define and . Conditioned on the good event , for each layer , the regret induced by bad epochs of layer is bounded as .
Lemma D.4.
We have .
Proof of Theorem 6.1.
Proof of Lemma D.2.
If is good and belongs to the -th epoch, we have by Lemma A.2 that
(5) |
Within -th good epoch, we have
which together with inequality (5) and the last inequality in the elliptical potential lemma (Lemma A.1) imply that
Thus under event , the regret induced by good epochs of layer is
where the first inequality is from Lemma B.6, the second inequality is by Cauchy-Schwartz inequality, and the last relation is from
(6) |
∎
Proof of Lemma D.3.
Denote by . It follows that the number of bad epochs is at most . Moreover, the regret within a bad epoch of length can be upper bounded as by applying the elliptical potential lemma for each client and the communication condition, where the extra in the upper bound is due to that at most clients trigger the communication condition at the end of the -th epoch. We thus have
∎
Proof of Lemma D.4.
Recall . Note that if , we have . Otherwise , which implies . ∎
Appendix E Variance-adaptive Async-FedSupLinUCB
The variance-adaptive SupLinUCB subroutine is presented in Alg. 5, while the complete variance-adaptive Async-FedSupLinUCB is given in Alg. 6.
E.1 Algorithm
E.2 Supporting Lemmas and Proofs
Theorem E.1.
(Theorem 4.3 in Zhou and Gu, (2022)) Let be a filtration, and be a stochastic process such that is -measurable and is -measurable. Let . For , let and suppose that also satisfy
For , let , and
Then, for any , we have with probability at least that,
Lemma E.1.
(Adapted from Lemma B.1 in Zhou and Gu, (2022)). Let be a sequence of non-negative numbers, and . Let and be recursively defined as follows:
Let . Then we have
Following a similar proof structure to Async-FedSupLinUCB, we employ a novel Bernstein-type self-normalized martingale inequality, proposed by Zhou and Gu, (2022), for layer to manage the variance information. We define as specified in Theorem E.1, and establish the following lemma, analogous to Lemma B.3.
Lemma E.2.
For any round , if client is active in round and arm is chosen in layer , with probability at least , with we have for any :
We define good event as In a manner similar to the proof of Lemma B.4, we have that .
Lemma E.3.
Conditioned on the event , the regret in layer can be bounded by
Proof of Lemma E.3.
We set to provide a tighter bound for the size of . Mirroring the proof methodology in Lemma C.4, we establish the following:
The first inequality results from the arm selection rule of layer 0, the second is derived from Lemma E.1, and the third arises due to the constraint . Consequently, we infer that . We can then bound the regret in layer as follows:
∎
Lemma E.4.
Conditioned on the event , the regret of each layer can be bounded by
Proof of Lemma E.4.
For , the rewards in each layer are mutually independent, as proven in Lemma B.1. We deduce:
The first inequality arises from LABEL:{lem:regret_lay}, the second is a result of the arm selection rule in Line 13, the third derives from Lemma C.3, and the final inequality is attributable to Lemma E.1. ∎
For the final layer , applying Lemma B.6 and setting , we have
Proof of the communication bound in Theorem 7.1.
Having established the bound for regret in each layer, we have demonstrated that . Given that we set and , it requires layers to achieve the desired accuracy. The number of communications triggered by layer can be upper bounded by (Lemma C.1). Consequently, we are able to constrain the overall communication cost to .
Appendix F Corruption Robust Async-FedSupLinUCB
The corruption robust SupLinUCB subroutine is presented in Alg. 7, while the complete corruption robust Async-FedSupLinUCB is given in Alg. 8.
F.1 Algorithm
F.2 Supporting Lemmas and Proof
When confronted with adversarial corruption, we utilize a weighted ridge regression in which the weight assigned to each selected action depends on its confidence. Further, we expand the confidence width to accommodate this corruption, with and as proposed in He et al., 2022b . In our analysis of layer , we adapt Lemma B.1 from He et al., 2022b to fit a federated scenario, yielding the following lemma:
Lemma F.1.
(Adapted from Lemma B.1 in He et al., 2022b ) Under the setting of Theorem 5.1, in the layer , with probability at least , the following event happens:
For each layer , the rewards are mutually independent, analogous to the proof of Lemma B.1. We can restate the lemma as follows:
Lemma F.2.
Suppose the time index set is constructed so that for fixed with , the rewards are independent random variables with means . For any round , if client is active and chooses arm in layer , with probability at least , we have for any :
After combining the aforementioned events, we redefine the good event in the presence of corruption as follows:
Similar to proof of Lemma B.4, we have that .
Lemma F.3.
Conditioned on the good event , the regret of layer can be bounded as follows: .
Proof of Lemma F.3.
Under the condition of the good event , we adopt a similar approach to the regret decomposition analysis presented in He et al., 2022b to bound the regret in each layer .
(7) | |||
(8) |
The first inequality is derived from Lemma B.6, while Equation 8 follows from the definition of . For the term , we consider the rounds with , assuming these rounds can be listed as . To analyze this, we construct the auxiliary matrix . Using the definition of , we can establish the inequality .
Then we have
where the first inequality follows from Lemma C.3, and the second inequality is obtained by noting that the size of is bounded by , as stated in Lemma C.4 particularly for layer .
For the term , using the property , we can express as , which implies:
where the first inequality is derived from the definition of , the second inequality is obtained from the elliptical potential lemma, as referenced in Lemma A.1, and the third inequality stems from the definition of .
By combining and , we can ultimately bound the regret in each layer as . ∎
For the regret that occurs in the last layer , we can derive the following bound:
The first inequality is from Lemma B.6, and the last inequality follows from .
Proof of the communication bound in Theorem 7.2.
By combining the regret in each layer, we can conclude that . Note that, based on the definition of and Lemma A.1, it follows that . Additionally, by following a similar proof as in Lemma C.1, we can bound the number of communication rounds in layer by . Considering that the FedSupLinUCB algorithm has layers, the total communication cost is therefore upper bounded by .