Byzantine-Robust Online and Offline Distributed Reinforcement Learning
Abstract
We consider a distributed reinforcement learning setting where multiple agents separately explore the environment and communicate their experiences through a central server. However, -fraction of agents are adversarial and can report arbitrary fake information. Critically, these adversarial agents can collude and their fake data can be of any sizes. We desire to robustly identify a near-optimal policy for the underlying Markov decision process in the presence of these adversarial agents. Our main technical contribution is Weighted-Clique, a novel algorithm for the robust mean estimation from batches problem, that can handle arbitrary batch sizes. Building upon this new estimator, in the offline setting, we design a Byzantine-robust distributed pessimistic value iteration algorithm; in the online setting, we design a Byzantine-robust distributed optimistic value iteration algorithm. Both algorithms obtain near-optimal sample complexities and achieve superior robustness guarantee than prior works.
1 Introduction
Distributed learning systems have been one of the main driving force to recent successes of deep learning [Verbraeken et al., 2020, Goyal et al., 2017, Abadi et al., 2016]. Advances in designing efficient distributed optimization algorithms[Horgan et al., 2018] and deep learning infrastructures [Espeholt et al., 2018] have enables training powerful models with hundreds of billions of parameters [Brown et al., 2020]. However, with the outsourcing of computation and data collection, new challenges emerges. In particular, distributed system has been found vulnerable to Byzantine failure [LAMPORT et al., 1982], meaning there could be agents with failure that may send arbitrary information to the central server. Even a small number of Byzantine machines who send out moderate value can lead to a significant loss in performance [Yin et al., 2018, Ma et al., 2019, Zhang et al., 2020a], which raise security concern in real world applications such as chatbot [Neff and Nagy, 2016] and autonomous vehicles [Eykholt et al., 2018, Ma et al., 2021]. In addition, other desired properties are chased after, such as protecting data privacy of individual data contributors [Sakuma et al., 2008, Liu et al., 2019] and and reducing communication cost[Dubey and Pentland, 2021]. These challenges requires new algorithmic design on the server side, which is the main focus of this paper.
When it comes to reinforcement learning (RL), distributed learning has been prevalent to many large scale decision making problems even before the deep learning era, such as cooperative learning in robotics systems [Ding et al., 2020], power grids optimization [Yu et al., 2014], automatic traffic control [Bazzan, 2009]. Different from supervised learning where the data distribution of interest is often fixed a prior, reinforcement learning requires active exploration on the agent’s side to discover the optimal policy for the current task, thus creating new challenges in achieving the above desiderata while exploring in an unknown environment.
This paper studies this precise problem:
Can we design a distributed RL algorithm that is sample efficient and robust to Byzantine agents, while having small communication costs and promote data privacy?
We study Byzantine-robust RL in both online and offline reinforcement learning settings: in the online setting, a central server is designed to outsource the exploration task to agents, and the agents collect experiences and send back to the server, and the server use the data to update its policy; in the offline setting, a central server collects logged data from agents and use the data to identify a good policy, without the additional interaction to the environment. Importantly, among the agents, fraction are Byzantine, meaning they are allowed to send out arbitrary data in both the online and offline setting. We summarize our contributions as following:
-
1.
We design Weighted-Clique, a robust mean estimation algorithm for learning from batches. By utilizing the batch structure, the estimation error of our algorithm vanishes with more data. Compared to prior works [Qiao and Valiant, 2017, Chen et al., 2020, Jain and Orlitsky, 2021, Yin et al., 2018], our algorithm adapts to arbitrary batch sizes, which is desired in many applications of interests.
-
2.
We design Byzan-UCBVI, a Byzantine-Robust variant of optimistic value iteration for online RL by calling Weighted-Clique as a subroutine. We show that Byzan-UCBVI achieves near-optimal regret with -fraction Byzantine agents. Meanwhile, Byzan-UCBVI also enjoys a logarithmic communication cost and switching cost [Bai et al., 2019, Zhang et al., 2020b, Gao et al., 2021], and preserves data privacy of individual agents.
-
3.
We design Byzan-PEVI, a Byzantine-Robust variant of pessimistic value iteration for offline RL again utilizing Weighted-Clique as a subroutine. Despite of the presence of Byzantine agents, we show that Byzan-PEVI can learn a near-optimal policy with polynomial name of samples, when certain good coverage properties are satisfied [Zhang et al., 2021a].
2 Related Work
Reinforcement learning:
Reinforcement learning studies the optimal strategy in a Markov Decision Process (MDP) [Sutton and Barto, 2018]. [Azar et al., 2017, Dann et al., 2017] show that UCB style algorithm achieves minimax regret bound in tabular MDPs. Recent work extend the theoretical understanding to RL with function approximation [Jin et al., 2020, Yang and Wang, 2020]. [Jin et al., 2021, Rashidinejad et al., 2021] use pessimistic strategy to efficiently learn a nearly optimal policy in offline setting. Recently, [Bai et al., 2019, Zhang et al., 2020b, Gao et al., 2021] study low switching cost RL algorithm, meaning the learning agent has small budget for policy changes.
Distributed reinforcement learning:
Parallel RL deploys large-scale models in distributed system [Kretchmar, 2002]. [Horgan et al., 2018, Espeholt et al., 2018] provide distributed architecture for deep reinforcement learning by parallelizing the data generating process. [Dubey and Pentland, 2021, Agarwal et al., 2021, Chen et al., 2021] provide the first sets of theoretical guarantee for performance and communication cost in parallel RL.
Robust statistics:
Robust statistics has a long history [Huber, 1992, Tukey, 1960], which studies learning with corrupted dataset. In modern machine learning, models are high dimensional. Recent work provide sample and computationally efficient algorithms for robust mean and covariance estimation in high dimension [Diakonikolas and Kane, 2019, Lai et al., 2016]. Shortly after, those robust mean estimators are applied to robust supervised learning [Diakonikolas et al., 2019, Prasad et al., 2018] and RL [Zhang et al., 2021a, b].
Robust learning from batches:
Another line of work studies robust learning from batches [Qiao and Valiant, 2017, Chen et al., 2020, Jain and Orlitsky, 2021, Yin et al., 2018]. A collection of data is generated from data sources while a fraction of the data sources are corrupted. By exploiting the batch structure of the data, these algorithms achieve significantly high accuracy than non-batch setting [Diakonikolas and Kane, 2019]. To our best knowledge, all of these work study batches with equal size. Our paper is the first that generalizes to the setting where the batch size varies. Similarly, these works all assume the same batch sizes from each agent, which may not be true in many crowd-sourcing applications.
Byzantine-robust distributed learning:
Byzantine-Robust learning algorithm studies learning under Byzantine failure [LAMPORT et al., 1982]. [Chen et al., 2017] provides a Byzantine gradient descent via geometric median of mean estimation for the gradients. [Yin et al., 2018] provides robust distributed gradient descent algorithms with optimal statistics rates.
Corruption robust RL and Byzantine-robust RL:
There is a line of work studying adversarial attack against reinforcement learning [Ma et al., 2019, Zhang et al., 2020a, Huang et al., 2017], and corruption robust reinforcement RL for online [Zhang et al., 2021b, Lykouris et al., 2021] and offline [Zhang et al., 2021a] settings. [Jadbabaie et al., 2022] studies Byzantine-Robust linear bandits in federated setting. Unlike our setting, they allow different agents to be subject to Byzantine attack in different episodes. Our algorithm enjoys a better regret bound and communication cost. [Fan et al., 2021] provides a Byzantine-robust policy gradient algorithm that is guaranteed to converge to an approximate stationary point while we focus the regret of the algorithm. [Dubey and Pentland, 2020] studies Byzantine-Robust multi-armed bandit, where the corruption can only come from a fixed distribution. We study a more difficult MDP setting and allow the corruption to be arbitrary.
3 Robust Mean Estimation from Untruthful Batches
We first present our novel algorithm, Weighted-Clique, for the robust mean estimation from batches problem, which we define below. Weighted-Clique will be the main workhorse later in our algorithms for both offline and online Byzantine-robust RL problems.
Definition 3.1 (Robust mean estimation from batches).
There are data providers indexed by: . Among these providers, we denote the indexes of uncorrupted providers by and the indexes of corrupted providers by , where , , . Any uncorrupted providers has access to a sub-Gaussian distribution with mean and variance proxy (i.e. and , ). For each , a data batch is sent to the learner, where can be arbitrary. For , are i.i.d. samples drawn from ; for , can be arbitrary.
Definition 3.1 considers a robust learning problem from batches where we allow arbitarily different batch sizes. In contrast, prior works [Qiao and Valiant, 2017, Chen et al., 2020, Jain and Orlitsky, 2021] have only studied the setting with (roughly) equal batch sizes, which is much more restricted. For this problem, we propose the Weighted-Clique algorithm (Algorithm 1). Given the batch datasets , parameter of the sub-Gaussian distribution, corruption level , and confidence level , Weighted-Clique first performs a clipping step (Line 4) to clip the sizes of the largest batches to the size of the -th largest batch. This is to reduce the impact of corrupted batch on the weighted average in Line 7. Next, a set confidence intervals for the true mean is constructed in Line 5 based on the data of each batch, where if . In order to remove the outliers, the algorithm find the largest set of batches whose confidence intervals all intersect. This can be formulated as a maximum-clique problem, and thus the name Weighted-Clique. The largest clique can be found efficiently by sorting and scanning endpoints of all ’s. This algorithm returns the weighted average of empirical means of the maximum clique, where the weights are given by clipped sample size, .
Intuitively, by choosing the maximum clique in Line 6, Algorithm 1 finds a cluster of good data batches and drops extreme batches. We show that Algorithm 1 achieves the following guarantee.
Theorem 3.2.
Under Definition 3.1, if , , with probability at least , returned by Algorithm 1 satisfies:
(1) |
where and ’s are defined in Line 2 and Line 4 in Algorithm 1.
A number of immediate remarks are in order.
Remark 3.3.
Note that comparing to prior works [Qiao and Valiant, 2017, Chen et al., 2020, Jain and Orlitsky, 2021], we allow arbitrary batch sizes. Even if some agents report , as long as , i.e. there are at least agents reporting non-zero ’s, our estimator will have a well-behaved error bound. This means that the breakdown point (in the sense of fraction of bad agent) of our algorithm is , which is optimal.
Remark 3.4 (Equal batch size case).
Remark 3.5 (Robust mean estimation v.s. robust mean estimation from batches).
In classical robust mean estimation setting [Huber, 1992, Diakonikolas et al., 2017], the optimal error rate is given total samples and faction corrupted samples. In contrast, due to having data source ID, i.e. the batch indices, the adversary are much restricted. To see this, notice that the equal batch setting can be viewed as robust mean estimation from data points ’s. When the batch size becomes larger, has a smaller variance and thus the error of robust mean estimation becomes , which matches the above rate (up to logarithmic factors).
Remark 3.6 (Impossibility result).
Our bound in (1) does not depend on the largest ’s. This means even if some of the clean agents have infinite samples, the algorithm cannot get a very low error. This might look not ideal at first glance, but we show that this is inevitable information-theoretically. Interested readers are referred to Theorem A.1.
Remark 3.7 (Perturbation stability of the estimator and adaption to distributed setting).
When the good data batch is subject to point-wise perturbation of magnitude at most , a variant of Algorithm 1 (Algorithm 4 Pert-Weighted-Clique, see Section A.2) suffers at most a term in the error upper bound in addition to (1). Algorithm 1 does not need the exact dataset as input, but only the empirical mean and batch sizes of each data batch. As we see later, this property is essential to achieve low communication cost and preserve data privacy.
4 Byzantine-Robust Learning in Parallel MDP
We study the problem of Byzantine-robust reinforcement learning in the parallel Markov Decision Processes (MDPs) setting with one central server and agents, fraction of which may suffer Byzantine failure. We postpone the precise interaction protocols between the server and agents to Section 5 and Section 6.
In both online and offline settings, we consider a finite horizon episodic tabular Markov Decision Process (MDP) . Where is the finite state space with ; is the finite action space with ; is the sequence of transition probability matrix, meaning , and specifies the state distribution in step if action is taken from state at step ; is the sequence of bounded stochastic reward function, meaning , is the stochastic reward bounded in associated with taking action in state at step ; is the time horizon; is the initial state distribution. For simplicity, we assume is deterministic, and has probability mass on state .
Within each episode, the MDP starts at state . At each step , the agent observes current state and take an action and receives a stochastic reward . After that, the MDP transits to a next state , which is drawn from . The episode terminates after the agent takes action in state and receives reward at step .
A policy is a sequence of functions , each maps from state space to action space . The value function , is the expected sum of future rewards by taking action according policy , i.e. where the expectation is w.r.t. to the stochasticity of state transition and reward in the MDP. Similarly, we define the state-action value function : Let be an optimal policy and let , , .
For any , We define the Bellman operator by: Then the Bellman equation is given by:
(2) |
The Bellman optimality equation is given by:
(3) |
We define the state distribution at step by following policy as , and the state trajectory distribution of as: . The goal is to find a policy that maximizes the reward, i.e. find a , s.t. . To measure the performance of our RL algorithms, we use suboptimality as our performance metric for offline setting and use regret as our performance metric for online setting. We formalize these two measures in their corresponding sections below.
5 Byzantine-Robust Online RL
In the online setting, we assume that a central server and agents aim to collaboratively minimizing their total regrets. The agents and server collaborate by following a communication protocol to decide when to synchronize and what information to communicate. Unlike standard distributed RL setting, we assume fraction of the agents are Byzantine:
Definition 5.1 (Distributed online RL with Byzantine corruption).
There are agents consists of two types:
-
•
good agents, denoted by : Each of the good agents interacts with a copy of and communicates its observations to the server following the interaction protocol;
-
•
bad agents, denoted by : The bad agents is allowed to send arbitrary observations to the server at the end of each episode.
Because the server has no control over the bad agents, we only seek to minimize the error incurred by the good agents. Formally, we use regret as our performance measure for the online RL algorithm:
(4) |
where is the policy used by agent in episode . At the same time , because of the distributed nature of our problem, we want to synchronize between the servers and agents only if it is necessary to reduce the communication cost.
Based on these considerations, we propose the Byzan-UCBVI algorithm (Algorithm 2). We highlight the following key features of Byzan-UCBVI:
-
1.
Low-switching-cost style algorithm design: the server will check the synchronization criteria in Line 6 when receiving requests from agents. Each good agent will request synchronization if and only if any of their own counts doubles (Line 21). Importantly, our agents do not need to know other agents’ counts to decide if synchronization is necessary. This design choice reduces the number of policy switches, synchronization rounds and communication cost all from to . Compared to the communication steps in [Jadbabaie et al., 2022], ours is much lower. Unlike [Dubey and Pentland, 2021], our agents do not need to know other agents’ transition counts in order to decide whether to synchronize or not.
-
2.
Homogeneous policy execution: In any episode , our algorithm ensures that all good agents are running the same policy . This ensures that the robust mean estimation achieves the smallest estimation error. Recall that the samples in the large batches are wasted if the batch sizes are severely imbalanced (cf. Section 3).
-
3.
Robust UCBVI updates: During synchronization, the central server performs policy update using a variant of the UCBVI algorithm [Azar et al., 2017]: for , compute:
(5) (6) We replace the empirical mean estimation with our Pert-Weighted-Clique (PWC) algorithm (Algorithm 4) and design a new confidence bonus accordingly. Instead of estimating the transition matrix and reward function, we directly estimate the Bellman operator given an estimated value function . The server gathers the sufficient statistics from agents in Line 13.
We are now ready to present the following regret bound for Byzan-UCBVI.
Theorem 5.2 (Regret bound).
Under Definition 5.1, if , for all , with probability at least , the total regret of Algorithm 2 is at most
(7) |
Remark 5.3 (Understanding the regret bound).
In Algorithm 2, the good agents are using the same policy and thus for all , , where is the policy calculated by the server in -th episode. By utilizing the batch structure, Algorithm 2 achieves a regret sublinear in , even under Byzantine attacks. Our regret is only compared to the regret in [Jadbabaie et al., 2022]. When , the dominating term is optimal even in the clean setting [Azar et al., 2017].
Remark 5.4 (Communication cost).
Because each agent runs episodes in total, count of each of the tuples doubles at most times during training. Thus each good agent will send at most sync requests. The bad agents can only send logarithmic number of effective request because of the checking step in Line 6. As a result, there will be at most synchronization episodes in total. The communication inside one synchronization episode includes the following: at least one agent sends a sync request; inside the value iteration, the server will send estimated value functions at steps to each agents; each of the good agents will send the estimated Bellman operator for each pairs at steps and the counts to server. Importantly, the agents only need to send summary statistics, instead of the raw dataset to server, this preserves the data privacy of individual agents [Sakuma et al., 2008, Liu et al., 2019].
Remark 5.5 (Switching cost).
Switching cost measures the number of policy changes. Algorithms with low switching cost is favorable in real world applications [Bai et al., 2019, Zhang et al., 2020b, Gao et al., 2021]. Algorithm 2 only performs policy updates in the synchronization episodes, its switching cost is thus at most .
6 Byzantine-Robust Offline RL
In the offline setting, we assume the server has access to a set of data batches while some data batches are corrupted. The goal of the server is to find a nearly optimal policy without further interaction with the environment. Specifically:
Definition 6.1 (Distributed offline RL with Byzantine corruption).
The server has access to an offline data set with data batches , including good batches and bad batches , where . We make an assumption on the data generating process similar to [Wang et al., 2020]. Precisely, for all , is drawn from an unknown distribution , where for each , . For all , , and is an instantiation of . For any (i.e. bad batches), can be arbitrary.
The performance is measured by the suboptimality w.r.t. a deterministic comparator policy (not necessarily an optimal policy):
(8) |
In the offline setting, the server cannot interact with the MDP. So our result relies heavily on the quality of the dataset. As we will see in the analysis, the suboptimality gap (8) can be upper bounded by the estimation error of Bellman operator along the trajectory of . As a result, we do not need full coverage over the whole state-action space. Instead, we only need the offline dataset to have proper coverage over , the state distribution of policy at each step . To characterize the data coverage, for any , we define the counts on tuples by:
(9) |
When calling Algorithm 1, the large data batches might be clipped in Line 4. By definition, the clipping threshold is bounded between: , the -th largest of and , the -th largest of . We define three quantities to characterize the quality of the offline dataset. The first quantity describes the density of trajectory that are not properly covered by the offline dataset:
Definition 6.2 (Measure of insufficient coverage).
We define as the probability of visiting an tuple that is insufficiently covered by the logged data, namely
(10) |
Recall that Algorithm 1 requires there are at least non-empty data batches to make an informed decision. measures an upper bound on the total probability under to encounter an on which Weighted-Clique cannot return a good mean estimator.
We now introduce , the density ratio between the and the empirical distribution of the uncorrupted offline dataset. quantifies the portion of useful data in the whole dataset and is commonly used in the offline RL literature [Rashidinejad et al., 2021, Zhang et al., 2021a]. We only focus on the tuples excluded by in Definition 6.2:
Definition 6.3 (density ratio).
We use to denote the state space (in the support of ) that have proper clean agents coverage:
(11) |
We use to denote the density ratio between the state distribution of policy and the empirical distribution over the uncorrupted offline dataset:
(12) |
As we can see in Theorem 3.2, the accuracy of Algorithm 1 heavily depend on the evenness of the batches. Even if there are some good batches with a large amount of data, those extra data are not useful (cf. Remark 3.6). We define the following quantity to measure the information loss in the clipping step (Line 4 in Algorithm 1):
Definition 6.4 (Unevenness of good agents coverage).
(13) |
where
Intuitively, describes the evenness of good agent coverage. It measures both how many data in large batches are cut off by the clip step and the unevenness of the batches after clipping. We includes and , instead of the true clipping threshold, meaning serves as an upper bound of the actual unevenness resulting from running the algorithm. For example, suppose : if for any , , then ; if for any , there is one good data batch with size for some while the others have size , then and , meaning increases as the batches become less even.
Remarkably, all three quantities defined above only depend on the counts of the good data batches.
Given the above setup, we now present our second algorithm, Byzan-PEVI, a Byzantine-Robust variant of pessimistic value iteration [Jin et al., 2021]. Similar to the online setting, we use our Weighted-Clique (without perturbation) algorithm to approximate the Bellman operator and use the estimation error to design PESSIMISTIC bonus for the value iteration. Byzan-PEVI (Algorithm 3) runs pessimistic value iteration ((14)-(15)) and calls Weighted-Clique as a subroutine to robustly estimate the Bellman operator using offline dataset :
(14) | |||
(15) |
Theorem 6.5.
Given any deterministic comparator policy , under Definition 6.1, Definition 6.2, Definition 6.3 and Definition 6.4: for any , , with probability at least , Algorithm 3 outputs a policy with:
(16) |
Remark 6.6.
Compared to [Zhang et al., 2021a], there is no non-diminishing term in the bound. Meaning the suboptimality gap vanishes as the good agents collect more data. To the best of our knowledge, this is the first result for Byzantine-robust offline RL.
Remark 6.7 (Offline v.s. online RL).
Our offline RL results are more involved and notation heavy due to the nature of the problem. In the offline RL setting, learner has no control over the data generating process, and each data source can be arbitrarily different. The agent can only passively rely on the robust mean estimator we designed and the pessimism principle to learn as well as the data permits. In contrast, in the online setting, the learner has complete control over the clean agents’ data collection process. Our algorithm Byzan-UCBVI enables the server to realize its full potential and obtain a tighter and cleaner sample complexity guarantee.
7 Conclusion
To summarize, in this work, we first present Weighted-Clique, a robust mean estimation algorithm for learning from uneven batches that can be of independent interest. Building upon Weighted-Clique, we propose byzantine-robust online (Byzan-UCBVI) and the first byzantine-robust offline (Byzan-PEVI) reinforcement learning algorithms in distributed setting. Several questions remain open: (1) Can we provide a complete characterization of the information-theoretical lower bound for robust mean estimation from uneven batches? (2) Can we extend our RL algorithms to the function approximation setting?
References
- Abadi et al. [2016] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
- Agarwal et al. [2021] Mridul Agarwal, Bhargav Ganguly, and Vaneet Aggarwal. Communication efficient parallel reinforcement learning. In Uncertainty in Artificial Intelligence, pages 247–256. PMLR, 2021.
- Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017.
- Bai et al. [2019] Yu Bai, Tengyang Xie, Nan Jiang, and Yu-Xiang Wang. Provably efficient q-learning with low switching cost. Advances in Neural Information Processing Systems, 32, 2019.
- Bazzan [2009] Ana LC Bazzan. Opportunities for multiagent systems and multiagent reinforcement learning in traffic control. Autonomous Agents and Multi-Agent Systems, 18(3):342–375, 2009.
- Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Chan et al. [2014] Siu-On Chan, Ilias Diakonikolas, Paul Valiant, and Gregory Valiant. Optimal algorithms for testing closeness of discrete distributions. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1193–1203. SIAM, 2014.
- Chen et al. [2020] Sitan Chen, Jerry Li, and Ankur Moitra. Efficiently learning structured distributions from untrusted batches. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 960–973, 2020.
- Chen et al. [2021] Tianyi Chen, Kaiqing Zhang, Georgios B Giannakis, and Tamer Basar. Communication-efficient policy gradient methods for distributed reinforcement learning. IEEE Transactions on Control of Network Systems, 2021.
- Chen et al. [2017] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):1–25, 2017.
- Dann et al. [2017] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.
- Diakonikolas and Kane [2019] Ilias Diakonikolas and Daniel M Kane. Recent advances in algorithmic high-dimensional robust statistics. arXiv preprint arXiv:1911.05911, 2019.
- Diakonikolas et al. [2017] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In International Conference on Machine Learning, pages 999–1008. PMLR, 2017.
- Diakonikolas et al. [2019] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. In International Conference on Machine Learning, pages 1596–1606. PMLR, 2019.
- Ding et al. [2020] Guohui Ding, Joewie J Koh, Kelly Merckaert, Bram Vanderborght, Marco M Nicotra, Christoffer Heckman, Alessandro Roncone, and Lijun Chen. Distributed reinforcement learning for cooperative multi-robot object manipulation. arXiv preprint arXiv:2003.09540, 2020.
- Dubey and Pentland [2020] Abhimanyu Dubey and Alex Pentland. Private and byzantine-proof cooperative decision-making. In AAMAS, pages 357–365, 2020.
- Dubey and Pentland [2021] Abhimanyu Dubey and Alex Pentland. Provably efficient cooperative multi-agent reinforcement learning with function approximation. arXiv preprint arXiv:2103.04972, 2021.
- Espeholt et al. [2018] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, pages 1407–1416. PMLR, 2018.
- Eykholt et al. [2018] Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1625–1634, 2018.
- Fan et al. [2021] Xiaofeng Fan, Yining Ma, Zhongxiang Dai, Wei Jing, Cheston Tan, and Bryan Kian Hsiang Low. Fault-tolerant federated reinforcement learning with theoretical guarantee. Advances in Neural Information Processing Systems, 34, 2021.
- Freedman [1975] David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100–118, 1975.
- Gao et al. [2021] Minbo Gao, Tianle Xie, Simon S Du, and Lin F Yang. A provably efficient algorithm for linear markov decision process with low switching cost. arXiv preprint arXiv:2101.00494, 2021.
- Goyal et al. [2017] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Horgan et al. [2018] Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado Van Hasselt, and David Silver. Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933, 2018.
- Huang et al. [2017] Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
- Huber [1992] Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics, pages 492–518. Springer, 1992.
- Jadbabaie et al. [2022] Ali Jadbabaie, Haochuan Li, Jian Qian, and Yi Tian. Byzantine-robust federated linear bandits. arXiv preprint arXiv:2204.01155, 2022.
- Jain and Orlitsky [2021] Ayush Jain and Alon Orlitsky. Robust density estimation from batches: The best things in life are (nearly) free. In International Conference on Machine Learning, pages 4698–4708. PMLR, 2021.
- Jin et al. [2018] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018.
- Jin et al. [2020] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
- Jin et al. [2021] Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR, 2021.
- Kretchmar [2002] R Matthew Kretchmar. Parallel reinforcement learning. In The 6th World Conference on Systemics, Cybernetics, and Informatics. Citeseer, 2002.
- Lai et al. [2016] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665–674. IEEE, 2016.
- LAMPORT et al. [1982] LESLIE LAMPORT, ROBERT SHOSTAK, and MARSHALL PEASE. The byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382–401, 1982.
- Liu et al. [2019] Ximeng Liu, Robert H Deng, Kim-Kwang Raymond Choo, and Yang Yang. Privacy-preserving reinforcement learning design for patient-centric dynamic treatment regimes. IEEE Transactions on Emerging Topics in Computing, 9(1):456–470, 2019.
- Lykouris et al. [2021] Thodoris Lykouris, Max Simchowitz, Alex Slivkins, and Wen Sun. Corruption-robust exploration in episodic reinforcement learning. In Conference on Learning Theory, pages 3242–3245. PMLR, 2021.
- Ma et al. [2019] Yuzhe Ma, Xuezhou Zhang, Wen Sun, and Jerry Zhu. Policy poisoning in batch reinforcement learning and control. Advances in Neural Information Processing Systems, 32, 2019.
- Ma et al. [2021] Yuzhe Ma, J Sharp, Ruizhe Wang, Earlence Fernandes, and Xiaojin Zhu. Adversarial attacks on kalman filter-based forward collision warning systems. In The Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021.
- Neff and Nagy [2016] Gina Neff and Peter Nagy. Automation, algorithms, and politics| talking to bots: Symbiotic agency and the case of tay. International Journal of Communication, 10:17, 2016.
- Paninski [2008] Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750–4755, 2008.
- Prasad et al. [2018] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485, 2018.
- Qiao and Valiant [2017] Mingda Qiao and Gregory Valiant. Learning discrete distributions from untrusted batches. arXiv preprint arXiv:1711.08113, 2017.
- Rashidinejad et al. [2021] Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34, 2021.
- Sakuma et al. [2008] Jun Sakuma, Shigenobu Kobayashi, and Rebecca N Wright. Privacy-preserving reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pages 864–871, 2008.
- Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
- Tukey [1960] John W Tukey. A survey of sampling from contaminated distributions. Contributions to probability and statistics, pages 448–485, 1960.
- Verbraeken et al. [2020] Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S Rellermeyer. A survey on distributed machine learning. ACM Computing Surveys (CSUR), 53(2):1–33, 2020.
- Wang et al. [2020] Ruosong Wang, Dean P Foster, and Sham M Kakade. What are the statistical limits of offline rl with linear function approximation? arXiv preprint arXiv:2010.11895, 2020.
- Yang and Wang [2020] Lin Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746–10756. PMLR, 2020.
- Yin et al. [2018] Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In International Conference on Machine Learning, pages 5650–5659. PMLR, 2018.
- Yu et al. [2014] Tao Yu, HZ Wang, Bin Zhou, Ka Wing Chan, and J Tang. Multi-agent correlated equilibrium q () learning for coordinated smart generation control of interconnected power grids. IEEE transactions on power systems, 30(4):1669–1679, 2014.
- Zhang et al. [2020a] Xuezhou Zhang, Yuzhe Ma, Adish Singla, and Xiaojin Zhu. Adaptive reward-poisoning attacks against reinforcement learning. In International Conference on Machine Learning, pages 11225–11234. PMLR, 2020a.
- Zhang et al. [2021a] Xuezhou Zhang, Yiding Chen, Jerry Zhu, and Wen Sun. Corruption-robust offline reinforcement learning. arXiv preprint arXiv:2106.06630, 2021a.
- Zhang et al. [2021b] Xuezhou Zhang, Yiding Chen, Xiaojin Zhu, and Wen Sun. Robust policy gradient against strong data corruption. In International Conference on Machine Learning, pages 12391–12401. PMLR, 2021b.
- Zhang et al. [2020b] Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost optimal model-free reinforcement learningvia reference-advantage decomposition. Advances in Neural Information Processing Systems, 33:15198–15207, 2020b.
Appendix A More discussion on Algorithm 1:Weighted-Clique
A.1 Impossible result
Theorem A.1 (impossibility result).
There exists a distribution , s.t. given data batches generated under Definition 3.1, every robust mean estimation algorithm suffers an error at least
(17) |
even knows some of the batches are clean, where is the sum of sizes of the smallest good batches.
Proof of Theorem A.1.
Let be Bernoulli distribution with parameter . W.l.o.g., assume , and . We assume algorithm knows is a subset of the good batches.
Let . Let the bad batches be i.i.d. samples from , a Bernoulli distribution with parameter . By Theorem 4 of [Paninski, 2008, Chan et al., 2014], no algorithm can distinguish if the batches are sampled from or . I.e. no algorithm can distinguish if are good batches or are good batches. This means, given data batches , every robust mean estimation algorithm suffers an error at least . ∎
A.2 Adaption to good batch perturbation and distributed learning
Compared to Algorithm 1, Algorithm 4 enlarges the confidence interval by on both endpoints due to the perturbation and only requires some sufficient statistics from the batches, instead of the whole dataset. When , meaning there are at least non-empty batches, Algorithm 4 runs a modified Weighted-Clique algorithm to calculate the mean estimation and the error upper bound (with an additional as an adjustment for -cover argument in the proof of Theorem 5.2). When , Algorithm 4 returns and a trivial error upper bound.
Appendix B Proof of Theorem 3.2
To prove Theorem 3.2, we show (1) holds under some concentration event while the event happens with high probability. We consider a slight more general setting where there could be perturbations to even good batches:
Definition B.1 (Robust mean estimation from batches).
There are data providers indexed by: . Among these providers, we denote the indexes of uncorrupted providers by and the indexes of corrupted providers by , where , , . Any uncorrupted providers has access to perturbed samples from a sub-Gaussian distribution with mean and variance proxy (i.e. and , ). For each , a data batch is drawn from , while a perturbed version is sent to the learner, where can be arbitrary and for some . For , can be arbitrary.
One can easily recover Definition 3.1 by letting . We prove the We first define the concentration event as following:
Definition B.2 (Concentration event).
For all , define the event that the empirical mean of clean batches is close to the population mean as:
(18) |
Define the event that the weighted average of empirical means of clean batches is close to the population mean as:
(19) |
Let be the event that events above happens together:
(20) |
We can show happens with high probability using Hoeffding’s inequality:
Lemma B.3.
.
Proof.
See proof in Section B.1. ∎
Under event , we can give an upper bound on the estimation error:
Lemma B.4.
Under event , if , Algorithm 4 outputs a with
(21) |
Proof.
See proof in Section B.2. ∎
Proof of Theorem 3.2.
B.1 Proof of Lemma B.3
To prove Lemma B.3,
-
1.
we first show that the perturbation changes the empirical mean of batches by at most ;
-
2.
we can show the concentration bound of empirical means and weighted means for the unperturbed samples;
-
3.
we can conclude by using the two results above and triangular inequality.
the probability of event :
For all , let be the empirical mean of unperturbed samples in batch :
(23) |
By triangular inequality:
(24) |
Since is sub-Gaussian distribution, we can show the concentration of unperturbed samples : for all good batch ,
(25) |
By union bound, with probability at least , ,
(26) |
By triangular inequality, with probability at least , ,
(27) |
I.e. .
the probability of event :
We first show the weighted average of empirical mean of the unperturbed sample i.e., is a sub-Gaussian random variable: firstly, note that the mean of the weighted average is , i.e. . By definition, we know for good batch , are i.i.d. sub-Gaussian random variable with mean and variance proxy , i.e.
(28) |
Since : for all ,
(29) | ||||
(30) | ||||
(31) |
This means is a sub-Gaussian random variable with variance proxy . Thus ,
(32) |
Thus with probability at least :
(33) |
This means:
(34) | ||||
(35) | ||||
(36) |
I.e. .
By union bound .
B.2 Proof of Lemma B.4
By Lemma B.3, we know the weighted average of empirical mean of good batches is a proper estimation for the population mean. Compared to , the returned in Line 6 in Algorithm 4 may remove some good batches and include some bad batches. Even though, as long as we can show:
-
1.
Line 6 will not remove too many good batches and will not include too many bad batches;
-
2.
the bad batches included in will not be significant
then we can show that the returned in Line 7 is a reasonable estimation for .
The structure of :
is the largest subset of batches with confidence interval intersection. The confidence intervals of all the good batches intersect under event , thus should at least as large as , thus it is not possible to remove too many good batches. Furthermore, we can also show that we will not lose too much information, meaning significantly reduce the total number of samples and thus later on, we can show that the statistical rate will not be affected by too much. We make these ideas precise below.
Because maximizes
(38) |
we know . Furthermore, can include at most batches, this means includes at least good batches. Formally:
(39) |
Now we show is not losing too much information, i.e. . By definition of , there are at least batches in such that . Because removes at more batches, there are at least batches in such that . I.e.
(40) | ||||
(41) | ||||
(42) |
This means the information loss can be bounded by , formally:
(43) | ||||
(44) |
Thus we have:
(45) |
Bad batches in :
In order for a bad batch to survive in , its confidence interval must intersect with each good batches’ confidence interval in . In particular, must intersect with the good batch in with largest . By definition, there are at least good batches with . Because excludes at most good batches, there are at least one good batch (denote by ), s.t. .
Thus , . Which means, there exists some point , s.t. , thus
(46) | ||||
(47) | ||||
(48) |
Furthermore, under event ,
(49) |
By triangular inequality, we not bee too far away from :
(50) |
Error decomposition:
As mentioned earlier, we can decompose the estimation of returned by Algorithm 4 by: statistical error (with potential information loss), term in (55); error coming from including bad batches, term in (55); error coming from removing good batches, term in (55). Specifically:
(51) | ||||
(52) | ||||
(53) | ||||
(this is by triangular inequality) | (54) | |||
(55) |
We can bound the first term by (45) under event :
(56) | ||||
(57) | ||||
(58) | ||||
(59) | ||||
(60) | ||||
(61) | ||||
(62) |
By (50), we can bound the second term by:
(63) | ||||
(64) | ||||
(65) | ||||
(66) | ||||
(67) |
We can bound the third term by:
(68) | ||||
(69) | ||||
(70) | ||||
(71) | ||||
(72) |
Note that the above upper bounds for and are still valid even if some of the ’s are zero.
In conclusion, we can bound the estimation error by:
(73) | ||||
(74) | ||||
(75) | ||||
(76) | ||||
(77) | ||||
(78) | ||||
(79) | ||||
(80) | ||||
(81) | ||||
(82) | ||||
(83) |
Appendix C Proof of Theorem 5.2
By following standard regret decomposition for UCB type of algorithm (see [Jin et al., 2020]), under the event that the estimation error of Bellman operator is bounded by bonus terms, we can decompose the regret by:
-
1.
the cumulative bonus term occurred in the trajectories of each good agent
-
2.
a term that can be easier bounded by Azuma-Hoeffding’s inequalities.
By Lemma B.4 and replacing Lemma B.3 with a variant for martingale, we can show the event mentioned above happens with high probability. Unlike standard regret bound for tabular setting, we cannot directly use the telescope series to estimate the cumulative bonuses. Instead, we first need to show that because each good agent is using the same policy in every episode, their trajectories have a lot of overlaps, meaning the counts of all good agents do not differ by too much. Given that, we can simply the bound in Lemma B.4 and use the telescope series.
We start by restating Theorem 5.2:
Theorem C.1 (Regret bound, Theorem 5.2).
If , for all , with probability at least :
(84) |
We first give the high level idea of our proof:
-
1.
We give an analysis under the intersection of three “good events”:
-
•
event : the estimation error of Bellman operator is upper-bounded by bonus (See Section C.1, Lemma C.3);
-
•
event : if the total count on some is large, then the counts of each agent differ by at most times (See Section C.3, Lemma C.9);
-
•
event : an error term in the regret decomposition is bounded by Azmua-Hoeffding bound.
-
•
-
2.
Under event , we can decompose the regret into two terms (see Section C.2, Lemma C.8):
-
•
a martingale with bounded difference which is controlled by Hoeffding bound under event ;
-
•
the cumulative bonus term, which can be bounded by telescoping series under event .
-
•
We use , , , , , to denote the variables used in the -th episode. When a synchronization happens in episode , those variables are the updated ones after the synchronization; when there is no synchronization in episode , those variables remain the same as last episode. Let be the counts on tuples in episode after the counts update. Formally, We start by restating the data collection process and counts on tuples of each good agent : during the data collection process, we allow all of the agents collect data together. In the -th episode, agent collects a multi-set of transition tuples using policy : .
(85) |
is given by:
(86) |
We give the formal definition of good events below:
Definition C.2.
(87) | ||||
(88) |
(89) |
For any , we define the follow event:
(90) |
And define
(91) |
Proof of Theorem 5.2.
By Azuma-Hoeffding inequality:
(92) |
Then by union bound: Lemma C.3 and Lemma C.9 together implies for all :
(93) |
which means happens with probability at least .
We now upper bound the regret under event . By Lemma C.8 we can decompose the regret by:
(94) | ||||
(95) | ||||
(96) | ||||
(97) | ||||
(98) | ||||
(99) |
We only need to upper bound the cumulative bonus. Suppose the policy is updated at the beginning of -th episodes, with the data collected in the first -th episodes, with . To simplify the notation, we define , .
For convenience, in the following, we use to denote the total count on tuples up to episode over all good agents:
(100) |
where . We can rearrange the cumulative bonus by summing over pairs:
(101) |
When there are less than agents have coverage on some tuple, the bonus term is trivially set to be . In the following, we show that under event , in (101), for each tuple, there are at most bonus term such that , where
(102) |
For any , let be such that:
(103) |
This means when running the policy update at episode , the total counts for , i.e. , is larger than . For any , we have
(104) |
By definition of , for any
(105) |
this means for any , , meaning all of the good agents have coverage on , this means there are at least agents have coverage, and thus:
-
•
Trivial bonus can only happens at , i.e.
(106) Furthermore, because in the algorithm, the agents synchronize and update their policy when or before any counts for a good agent doubles. I.e.: for all :
(107) This means
(108) Thus for each tuple, there are at most bonus term such that .
-
•
for any
(109) (110) Where is the -largest among and
(111) For any , implies , and .
This means for any
(112) (113) (114) Thus
(115)
We are know ready to bound the cumulative bonus:
(116) | ||||
(117) | ||||
(118) | ||||
(119) | ||||
(120) | ||||
(121) | ||||
(122) | ||||
(123) |
(124) |
Because ,
(125) |
By (107),
(126) | ||||
(127) |
By Cauchy–Schwarz inequality,
(128) |
Thus
(129) | ||||
(130) |
Thus
(131) | ||||
(132) |
In conclusion:
(133) | ||||
(134) |
∎
C.1 the good event
We first show that our bonus is a valid upper confidence bound for the estimated Bellman operator. Recall that our bonus term used in -th episode is calculated based on the data collected in the first -episodes. The bonus is given by:
-
•
If
(135) -
•
If
(136) (137) Where is the -largest among and
(138)
To be precise:
Lemma C.3 (Valid bonus).
Let be the following event:
(139) |
Then, we have
(140) |
To show that is a high probability event, we seek to utilize the result of Theorem 3.2. Since there are two obstacles, we need to make some modifications:
- 1.
-
2.
Event shows the concentration property of holds uniformly for infinitely many ’s. Thus a direct union bound does not apply. Instead we need to use a cover number argument for all possible ’s, which is standard (see [Jin et al., 2020]).
Proof of Lemma C.3.
Let be the following event:
(141) |
In the following, we decompose by:
(142) |
and bound by law of total probability.
If , because and , with probability , ,
(143) |
This means
(144) |
If , we use a covering number argument and union bound to bound the probability of event .
Consider , an cover of , in the sense of -norm. We can bound the cover number by . This means , we can find an , s.t. . In another word,
(145) |
Importantly, unlike model based method without bad agents, our is not an linear operator, meaning we cannot trivially upper bound in the cover number argument. Instead, we need to use the continuity of error bound of our robust mean estimation Algorithm 4, meaning as long as each data point collected by each agent is not perturbed too much, then the estimation error bound does not increase too much.
Recall that in Algorithm 2, at episode , if the agents decide to synchronize, then at each step , given any function , the clean agents will calculate empirical mean for
(146) |
Let be an element in , s.t. , this means set (146) is a perturbed version (by at most ) of
(147) |
This means given an , for any , s.t. , Algorithm 4 can be used to robustly estimate , given set (146). Furthermore, choosing , by Lemma C.4, Lemma C.5 and Lemma B.4, given any , and any , s.t. , with probability at least ,
(148) |
We can bound the by:
(149) | ||||
(150) |
Then
(151) | ||||
(152) | ||||
(153) | ||||
(154) | ||||
(155) |
This means
(156) |
In conclusion,
(157) |
∎
C.1.1 Concentration of estimation from good agents
Lemma C.4.
Let:
(158) |
where we define . For any , and for any with probability at least , happens, where
(159) |
and
(160) |
Proof of Lemma C.4.
We use the martingale stopping time argument in Lemma 4.3 of [Jin et al., 2018].
For each fixed : for all , define
(161) |
Let
(162) |
Then is a martingale. One observation is if agent did not visit in -th episodes. Thus we can use the stopping time idea to shorten the martingale sequence.
Define the following sequence of ’s: ,
(163) |
Intuitively, is the episode when is visited by agent for the -th time. If agent visit for less than times, then . By definition, is a stopping time w.r.t. .
By optional sampling theorem, is a martingale.
By Azuma-Hoeffding’s inequality: for any
(164) |
Let , we get: for any , for any , with probability at least :
(165) |
By union bound, for any , with probability at least , for any :
(166) |
This means for any and any
(167) | ||||
(168) | ||||
(169) |
Thus
(170) | ||||
(171) |
By union bound
(172) |
∎
Lemma C.5.
Let:
(173) |
(174) |
where we define . For any , with probability at least , happens, where
(175) |
Proof Lemma C.5.
During the data collecting process, the agents are allowed to collect data simultaneously. For analysis purpose, we artificially order the data in the following sequence:
(176) |
where . Let
(177) |
Then forms a valid filtration. Let be a fixed set of scalar, s.t. , for all .
For each fixed : for all , Let
(178) |
Then is a martingale. As we can see, if good agent did not visit in episode , then a.s. Thus we can use the stopping time idea to shorten the martingale sequence.
Define the following functions to map from sequence index to agent index and episode index:
(179) |
For any , define the following sequence of ’s: ,
(180) | ||||
(181) |
Intuitively, is the episode when is visited in sequence (176) for the -th time. And for all , agent have not collected tuples. If is visited for less than times or there exists agent visiting more than times, then . By definition, is a stopping time w.r.t. .
In particular, let be the th-largest of all ’s and . We choose .
By optional sampling theorem, is a martingale.
By Azuma-Hoeffding’s inequality: for any
(182) |
Let , we get: for any , for any , with probability at least :
(183) |
By union bound, for any , with probability at least , for any :
(184) |
This means for any and any
(185) | ||||
(186) | ||||
(187) | ||||
(188) | ||||
(189) | ||||
(190) |
Thus
(191) | ||||
(192) | ||||
(193) |
∎
C.2 The regret decomposition for UCB style algorithm
We follow the regret decomposition strategy in [Jin et al., 2020] under event , i.e. the estimation error for the Bellman operator is bounded by the bonus term.
The estimated Bellman operator can be used to approximate the Q function:
Lemma C.6.
Under event , for any , and any policy
(194) |
Proof of Lemma C.6.
(195) | ||||
(196) | ||||
(197) | ||||
(198) | ||||
(199) | ||||
(200) | ||||
(201) |
∎
Under event we can upper bound the value function and Q function of the optimal policy by the estimated value function and Q function of policy :
Lemma C.7 (Optimism).
Under event , :
(202) |
Proof of Lemma C.7.
We prove this by induction on . Before that, note that, for any , if
(203) |
then because is chosen by maximizing , we know
(204) |
This means for any :
(205) |
We now begin our induction:
- •
-
•
Suppose for any , the statement holds for step , i.e.
(209) our goal is to show :
(210) (211) (212) (213) (214) (215) (216) (217) By definition of Q function . We concludes these two statements by:
(218) By (205), .
∎
We are know ready to prove the regret decomposition lemma:
Lemma C.8.
Under good event :
(219) | ||||
(220) | ||||
(221) |
Proof of Lemma C.8.
We start by showing the decomposition of regret after step in one episode of a single agent: by Lemma C.6 and Lemma C.7, under event , for any
(222) | ||||
(223) | ||||
(224) | ||||
(225) | ||||
(226) | ||||
(227) | ||||
(228) | ||||
(229) | ||||
(230) | ||||
(231) | ||||
(232) | ||||
(233) |
This indeed gives a recursive formula: for any trajectory
(234) | ||||
(235) | ||||
(236) | ||||
(237) |
Then, we can show the regret decomposition in one episode of a single agent by recursion:
for any trajectory collected by a clean agent under policy :
(238) | ||||
(239) | ||||
(240) | ||||
(241) | ||||
(242) | ||||
(243) | ||||
(244) | ||||
(245) |
Now we are ready to show the total regret decomposition. For each episode, we can make the regret decomposition w.r.t. any trajectory collected by a clean agent following policy . For convenience, we specialize the trajectories to be exactly the ones that are collected by the good agents and are used to calculate the bonus terms. The purpose is, in the future, when we bound the regret, we need to bound the cumulative bonus used in the trajectory. By decomposing the regret w.r.t. the trajectory collected in the algorithm, it is naturally guaranteed that the tuples that collected a lot by the good agents have lower bonus. This is because with more data collected, we can narrow down the confidence interval and design small but still valid bonus terms.
Because in our MDP definition, the MDP has a deterministic initial distribution, meaning the good agents always have the same starting state:
(246) | ||||
(247) | ||||
(248) |
∎
C.3 Evenness of clean agents
We need at least -agents to cover in order to learn the Bellman operator properly. In this section, we show that the agents have “even” coverage on the visited tuples in each (except a relatively small number) of the episodes. In the following we use to denote the number of good agents.
Formally, we have:
Lemma C.9 (Even coverage of good agent).
For any , we define the follow event:
(249) |
then, we have: for all
(250) |
Remark C.10 (Intuition of the good event).
The event characterizes that: if in any episode , a tuple gets enough coverage from the clean agents, then the coverage of each agent are very close.
See proof of Lemma C.9 in Section C.3.1.
C.3.1 Proof of Lemma C.9
Proof of Lemma C.9 depends on the concentration of :
Lemma C.11 (Concentration of counts around empirical mean).
For all
(251) | |||
(252) |
Proof of Lemma C.11.
See Section C.3.2. ∎
Proof of Lemma C.9.
Let
(253) |
For any , define events:
(254) | ||||
(255) |
Recall:
(256) |
Then we can rewrite even as:
(257) |
We first show that of there are two ’s, whose ratio exceed , then there must be some that deviates a lot from the empirical mean of ’s:
(258) | ||||
(259) | ||||
(260) | ||||
(261) | ||||
(262) |
To show that happens w.h.p.:
(263) | ||||
(264) | ||||
(265) | ||||
(266) | ||||
(By (262)) | (267) | |||
(268) | ||||
(269) | ||||
(270) | ||||
(271) | ||||
(272) |
∎
C.3.2 Proof of Lemma C.11
The high level ideas are:
-
1.
For each ,
-
•
for each , define centered as a martingale;
-
•
define centered as a martingale;
-
•
-
2.
apply a modified Bernstein type of martingale concentration bound for both centered ’s and centered (see Lemma C.12 and Lemma C.13);
-
3.
because and have the same mean, we can use triangular inequality to show these two terms are close, and the distance is bounded by the variance term in Bernstein inequality.
-
4.
Bernstein on also allow us to bound its variance in terms of itself.
- 5.
Lemma C.12 (Concentration of each ).
For all , with probability at least , for all :
(273) |
Proof of Lemma C.12.
See Section C.3.3 ∎
Lemma C.13 (Concentration of each ).
For all , with probability at least , for all :
(274) |
Proof of Lemma C.13.
See Section C.3.4 ∎
Proof of Lemma C.11.
Let the intersection of the events in Lemma C.12 and Lemma C.13. Then by Lemma C.12 and Lemma C.13, happens with probability at least . By (274),
(275) |
(276) | ||||
(277) | ||||
(278) | ||||
(279) | ||||
(280) |
∎
C.3.3 Proof of Lemma C.12
Proof of Lemma C.12.
For each fixed : for all , define
(281) |
Let
(282) | ||||
(283) |
Then is a martingale. Since depends on , which is calculated use data in the first episodes, then . By Corollary E.3,
(284) | ||||
(285) |
By union bound, with probability at least , for all :
(286) |
∎
C.3.4 Proof of Lemma C.13
Proof of Lemma C.13.
During the data collecting process, the agents are allowed to collect data simultaneously. For analysis purpose, we artificially order the data in the following sequence:
(287) |
where . Let
(288) |
Then forms a valid filtration. Define the following functions to map from sequence index to agent index and episode index:
(289) |
For each fixed , for all , we define as the (centered) total counts of collected by all good agents up to time . The -th term in (287) could be in the center of an episode, meaning some agents have not collected their trajectories yet. So we need to treat the agents differently: Let
(290) | ||||
(291) |
Then is a martingale. Similar to Lemma C.12, define
(292) | ||||
(293) |
Then by Corollary E.3,
(294) | ||||
(295) | ||||
(296) | ||||
(297) | ||||
(298) | ||||
(299) |
By union bound, with probability at least , for all :
(300) |
∎
Appendix D Proof of Theorem 6.5
By the following lemma, we can upper bound the suboptimality by the cumulative bonuses:
Lemma D.1.
Recall that for all ,
(302) |
and is the -largest among . is the -th largest of and is the -th largest of . The bonuses are given by:
-
•
If
(303) -
•
If
(304) (305) Where
(306)
Proof of Theorem 6.5.
We first show that with probability at least ,
(307) |
where is defined in (301).
-
•
if , by definition, . By definition of and , , thus (307) holds;
-
•
if , for any fixed , , . Because is bounded and thus sub-Gaussian, we can use Theorem 3.2 to upper bound :
(308) Thus
(309) (310) (311) By union bound, with probability at least ,
(312)
Then, by Lemma D.1, with probability at least ,
(313) | ||||
(314) | ||||
(315) | ||||
(316) |
By definition of in Definition 6.2,
(317) |
(318) | ||||
(319) | ||||
(320) |
By the definition of in Definition 6.4: for ,
(321) | ||||
(322) | ||||
(323) |
and
(324) | ||||
(325) | ||||
(326) |
Thus
(327) | ||||
(328) | ||||
(329) | ||||
(330) |
Recall that . By Cauchy–Schwarz inequality and the definition of in Definition 6.3,
(331) | ||||
(332) | ||||
(333) | ||||
(334) |
In conclusion,
(335) | ||||
(336) | ||||
(337) |
∎
Appendix E Useful inequalities
Theorem E.1 (Bernstein type of bound for martingale, Theorem 1.6 of [Freedman, 1975]).
Let be a probability triple. Let be an increasing sequence of sub--fields of . Let be random variables on , such that is measurable. Let . Assume and = 0. Let
(338) | |||
(339) |
where . Then, for any , ,
(340) |
By union bound and partition, we can get a more useful version of Theorem E.1.
We first present a result, which shows: given,
(341) |
We can bound up to some error.
Lemma E.2.
Let and be two sequences of random variables. We don’t make any assumption about the independence. Assume
-
•
, almost surely;
-
•
, , monotonic increasing,
If for all ,
(342) |
Then for any ,
(343) |
Proof.
See proof in Section E.1. ∎
Corollary E.3.
Under the assumption of Theorem E.1, suppose terminate at . Then, for all ,
(344) |
Proof of Corollary E.3.
Let then
(345) |
by Theorem E.1, For all ,
(346) |
In Lemma E.2, let:
-
•
, ,
-
•
-
•
Because , . then, we get:
(347) | ||||
(348) | ||||
(349) |
∎
E.1 Proof for Lemma E.2
Proof of Lemma E.2.
For discrete random variable, we can just conditioning on each possible value of and use a union bound. Here, because can be continuous random variable, we divide the range of into intervals. And upper bound the target by law of total probability.
For all , let:
(350) |
Be a partition of interval . Let be a set of intervals. Note that, .
Then
(351) | ||||
(352) | ||||
(353) | ||||
(354) | ||||
(355) |
Thus
(356) | ||||
(357) |
∎