Byzantine-Robust Online and Offline Distributed Reinforcement Learning

Yiding Chen
University of Wisconsin-Madison
[email protected]
&Xuezhou Zhang
Princeton University
[email protected]
&Kaiqing Zhang
Massachusetts Institute of Technology
[email protected] &Mengdi Wang
Princeton University
[email protected] &Xiaojin Zhu
University of Wisconsin-Madison
[email protected]

Abstract

We consider a distributed reinforcement learning setting where multiple agents separately explore the environment and communicate their experiences through a central server. However, $\alpha$ -fraction of agents are adversarial and can report arbitrary fake information. Critically, these adversarial agents can collude and their fake data can be of any sizes. We desire to robustly identify a near-optimal policy for the underlying Markov decision process in the presence of these adversarial agents. Our main technical contribution is Weighted-Clique, a novel algorithm for the robust mean estimation from batches problem, that can handle arbitrary batch sizes. Building upon this new estimator, in the offline setting, we design a Byzantine-robust distributed pessimistic value iteration algorithm; in the online setting, we design a Byzantine-robust distributed optimistic value iteration algorithm. Both algorithms obtain near-optimal sample complexities and achieve superior robustness guarantee than prior works.

1 Introduction

Distributed learning systems have been one of the main driving force to recent successes of deep learning [Verbraeken et al., 2020, Goyal et al., 2017, Abadi et al., 2016]. Advances in designing efficient distributed optimization algorithms[Horgan et al., 2018] and deep learning infrastructures [Espeholt et al., 2018] have enables training powerful models with hundreds of billions of parameters [Brown et al., 2020]. However, with the outsourcing of computation and data collection, new challenges emerges. In particular, distributed system has been found vulnerable to Byzantine failure [LAMPORT et al., 1982], meaning there could be agents with failure that may send arbitrary information to the central server. Even a small number of Byzantine machines who send out moderate value can lead to a significant loss in performance [Yin et al., 2018, Ma et al., 2019, Zhang et al., 2020a], which raise security concern in real world applications such as chatbot [Neff and Nagy, 2016] and autonomous vehicles [Eykholt et al., 2018, Ma et al., 2021]. In addition, other desired properties are chased after, such as protecting data privacy of individual data contributors [Sakuma et al., 2008, Liu et al., 2019] and and reducing communication cost[Dubey and Pentland, 2021]. These challenges requires new algorithmic design on the server side, which is the main focus of this paper.

When it comes to reinforcement learning (RL), distributed learning has been prevalent to many large scale decision making problems even before the deep learning era, such as cooperative learning in robotics systems [Ding et al., 2020], power grids optimization [Yu et al., 2014], automatic traffic control [Bazzan, 2009]. Different from supervised learning where the data distribution of interest is often fixed a prior, reinforcement learning requires active exploration on the agent’s side to discover the optimal policy for the current task, thus creating new challenges in achieving the above desiderata while exploring in an unknown environment.

This paper studies this precise problem:

Can we design a distributed RL algorithm that is sample efficient and robust to Byzantine agents, while having small communication costs and promote data privacy?

We study Byzantine-robust RL in both online and offline reinforcement learning settings: in the online setting, a central server is designed to outsource the exploration task to $m$ agents, and the agents collect experiences and send back to the server, and the server use the data to update its policy; in the offline setting, a central server collects logged data from $m$ agents and use the data to identify a good policy, without the additional interaction to the environment. Importantly, among the $m$ agents, $\alpha$ fraction are Byzantine, meaning they are allowed to send out arbitrary data in both the online and offline setting. We summarize our contributions as following:

1.

We design Weighted-Clique, a robust mean estimation algorithm for learning from batches. By utilizing the batch structure, the estimation error of our algorithm vanishes with more data. Compared to prior works [Qiao and Valiant, 2017, Chen et al., 2020, Jain and Orlitsky, 2021, Yin et al., 2018], our algorithm adapts to arbitrary batch sizes, which is desired in many applications of interests.
2.

We design Byzan-UCBVI, a Byzantine-Robust variant of optimistic value iteration for online RL by calling Weighted-Clique as a subroutine. We show that Byzan-UCBVI achieves near-optimal regret with $\alpha$ -fraction Byzantine agents. Meanwhile, Byzan-UCBVI also enjoys a logarithmic communication cost and switching cost [Bai et al., 2019, Zhang et al., 2020b, Gao et al., 2021], and preserves data privacy of individual agents.
3.

We design Byzan-PEVI, a Byzantine-Robust variant of pessimistic value iteration for offline RL again utilizing Weighted-Clique as a subroutine. Despite of the presence of Byzantine agents, we show that Byzan-PEVI can learn a near-optimal policy with polynomial name of samples, when certain good coverage properties are satisfied [Zhang et al., 2021a].

2 Related Work

Reinforcement learning:

Reinforcement learning studies the optimal strategy in a Markov Decision Process (MDP) [Sutton and Barto, 2018]. [Azar et al., 2017, Dann et al., 2017] show that UCB style algorithm achieves minimax regret bound in tabular MDPs. Recent work extend the theoretical understanding to RL with function approximation [Jin et al., 2020, Yang and Wang, 2020]. [Jin et al., 2021, Rashidinejad et al., 2021] use pessimistic strategy to efficiently learn a nearly optimal policy in offline setting. Recently, [Bai et al., 2019, Zhang et al., 2020b, Gao et al., 2021] study low switching cost RL algorithm, meaning the learning agent has small budget for policy changes.

Distributed reinforcement learning:

Parallel RL deploys large-scale models in distributed system [Kretchmar, 2002]. [Horgan et al., 2018, Espeholt et al., 2018] provide distributed architecture for deep reinforcement learning by parallelizing the data generating process. [Dubey and Pentland, 2021, Agarwal et al., 2021, Chen et al., 2021] provide the first sets of theoretical guarantee for performance and communication cost in parallel RL.

Robust statistics:

Robust statistics has a long history [Huber, 1992, Tukey, 1960], which studies learning with corrupted dataset. In modern machine learning, models are high dimensional. Recent work provide sample and computationally efficient algorithms for robust mean and covariance estimation in high dimension [Diakonikolas and Kane, 2019, Lai et al., 2016]. Shortly after, those robust mean estimators are applied to robust supervised learning [Diakonikolas et al., 2019, Prasad et al., 2018] and RL [Zhang et al., 2021a, b].

Robust learning from batches:

Another line of work studies robust learning from batches [Qiao and Valiant, 2017, Chen et al., 2020, Jain and Orlitsky, 2021, Yin et al., 2018]. A collection of data is generated from data sources while a fraction of the data sources are corrupted. By exploiting the batch structure of the data, these algorithms achieve significantly high accuracy than non-batch setting [Diakonikolas and Kane, 2019]. To our best knowledge, all of these work study batches with equal size. Our paper is the first that generalizes to the setting where the batch size varies. Similarly, these works all assume the same batch sizes from each agent, which may not be true in many crowd-sourcing applications.

Byzantine-robust distributed learning:

Byzantine-Robust learning algorithm studies learning under Byzantine failure [LAMPORT et al., 1982]. [Chen et al., 2017] provides a Byzantine gradient descent via geometric median of mean estimation for the gradients. [Yin et al., 2018] provides robust distributed gradient descent algorithms with optimal statistics rates.

Corruption robust RL and Byzantine-robust RL:

There is a line of work studying adversarial attack against reinforcement learning [Ma et al., 2019, Zhang et al., 2020a, Huang et al., 2017], and corruption robust reinforcement RL for online [Zhang et al., 2021b, Lykouris et al., 2021] and offline [Zhang et al., 2021a] settings. [Jadbabaie et al., 2022] studies Byzantine-Robust linear bandits in federated setting. Unlike our setting, they allow different agents to be subject to Byzantine attack in different episodes. Our algorithm enjoys a better regret bound and communication cost. [Fan et al., 2021] provides a Byzantine-robust policy gradient algorithm that is guaranteed to converge to an approximate stationary point while we focus the regret of the algorithm. [Dubey and Pentland, 2020] studies Byzantine-Robust multi-armed bandit, where the corruption can only come from a fixed distribution. We study a more difficult MDP setting and allow the corruption to be arbitrary.

3 Robust Mean Estimation from Untruthful Batches

We first present our novel algorithm, Weighted-Clique, for the robust mean estimation from batches problem, which we define below. Weighted-Clique will be the main workhorse later in our algorithms for both offline and online Byzantine-robust RL problems.

Definition 3.1 (Robust mean estimation from batches).

There are $m$ data providers indexed by: $\{1,2,\ldots,m\}=:[m]$ . Among these providers, we denote the indexes of uncorrupted providers by $\mathcal{G}$ and the indexes of corrupted providers by $\mathcal{B}$ , where $\mathcal{B}\cup\mathcal{G}=[m]$ , $\mathcal{B}\cap\mathcal{G}=\emptyset$ , $\left|\mathcal{B}\right|=\alpha m$ . Any uncorrupted providers has access to a sub-Gaussian distribution $\mathcal{D}$ with mean $\mu$ and variance proxy $\sigma^{2}$ (i.e. $\mathbb{E}_{X\sim\mathcal{D}}[X]=\mu$ and $\mathbb{E}_{X\sim\mathcal{D}}\left[\exp\left(s\left(X-\mu\right)\right)\right]\leq\exp\left({\sigma^{2}s^{2}}/{2}\right)$ , $\forall s\in\mathbb{R}.$ ). For each $j\in[m]$ , a data batch $\left\{x_{j}^{i}\right\}_{i=1}^{n_{j}}$ is sent to the learner, where $n_{j}$ can be arbitrary. For $j\in\mathcal{G}$ , $\left\{x_{j}^{i}\right\}_{i=1}^{n_{j}}$ are i.i.d. samples drawn from $\mathcal{D}$ ; for $j\in\mathcal{B}$ , $\left\{x_{j}^{i}\right\}_{i=1}^{n_{j}}$ can be arbitrary.

Definition 3.1 considers a robust learning problem from batches where we allow arbitarily different batch sizes. In contrast, prior works [Qiao and Valiant, 2017, Chen et al., 2020, Jain and Orlitsky, 2021] have only studied the setting with (roughly) equal batch sizes, which is much more restricted. For this problem, we propose the Weighted-Clique algorithm (Algorithm 1). Given the batch datasets $\left\{x_{1}^{i}\right\}_{i=1}^{n_{1}},\left\{x_{2}^{i}\right\}_{i=1}^{n_{2}},\ldots,\left\{x_{m}^{i}\right\}_{i=1}^{n_{m}}$ , parameter $\sigma$ of the sub-Gaussian distribution, corruption level $\alpha$ , and confidence level $\delta$ , Weighted-Clique first performs a clipping step (Line 4) to clip the sizes of the largest $2\alpha m$ batches to the size of the $(2\alpha+1)m$ -th largest batch. This is to reduce the impact of corrupted batch on the weighted average in Line 7. Next, a set confidence intervals for the true mean $\mu$ is constructed in Line 5 based on the data of each batch, where $I_{j}=\mathbb{R}$ if $n_{j}=0$ . In order to remove the outliers, the algorithm find the largest set of batches whose confidence intervals all intersect. This can be formulated as a maximum-clique problem, and thus the name Weighted-Clique. The largest clique can be found efficiently by sorting and scanning endpoints of all $I_{j}$ ’s. This algorithm returns the weighted average of empirical means of the maximum clique, where the weights are given by clipped sample size, $\tilde{n}_{j}$ .

Algorithm 1 Weighted-Clique

\left\{\left\{x_{j}^{i}\right\}_{i=1}^{n_{j}}\right\}_{j\in[m]}

\sigma

\alpha

\delta>0

\hat{x}_{j}\leftarrow\frac{1}{n_{j}}\sum_{i=1}^{n_{j}}x_{j}^{i}

, for all

j\in[m]

{n^{\operatorname{cut}}}\leftarrow\mbox{$(2\alpha m+1)$-th largest value in $\{n_{j}\}_{j\in[m]}$}

3: for

j=1,2,\ldots,m

\tilde{n}_{j}\leftarrow\min(n_{j},{n^{\operatorname{cut}}})

I_{j}\leftarrow\left[\hat{x}_{j}-\frac{\sigma}{\sqrt{\tilde{n}_{j}}}\sqrt{2\log\frac{2m}{\delta}},\hat{x}_{j}+\frac{\sigma}{\sqrt{\tilde{n}_{j}}}\sqrt{2\log\frac{2m}{\delta}}\right]

U^{*}\leftarrow\mathop{\mathrm{argmax}}_{U\mbox{ s.t. }\emptyset\neq\bigcap_{j\in U}I_{j}}|U|

7: return

\hat{x}\leftarrow\frac{1}{\sum_{j\in U^{*}}\tilde{n}_{j}}\sum_{j\in U^{*}}\tilde{n}_{j}\hat{x}_{j},\quad\operatorname{Error}\leftarrow\mbox{RHS of \hyperref@@ii[eq:err bound]{{(\ref*{eq:err bound})}}}

Intuitively, by choosing the maximum clique in Line 6, Algorithm 1 finds a cluster of good data batches and drops extreme batches. We show that Algorithm 1 achieves the following guarantee.

Theorem 3.2.

Under Definition 3.1, if ${n^{\operatorname{cut}}}>0$ , $\alpha<\frac{1}{2}$ , with probability at least $1-\delta$ , $\hat{x}$ returned by Algorithm 1 satisfies:

\displaystyle\left|\hat{x}-\mu\right|\leq

\displaystyle\frac{2}{\sqrt{\sum_{j\in[m]}\tilde{n}_{j}}}\sigma\sqrt{2\log\frac{2}{\delta}}+\frac{8\alpha m\sqrt{{n^{\operatorname{cut}}}}}{\sum_{j\in[m]}\tilde{n}_{j}}\sigma\sqrt{2\log\frac{2m}{\delta}}

(1)

where ${n^{\operatorname{cut}}}$ and $\tilde{n}_{j}$ ’s are defined in Line 2 and Line 4 in Algorithm 1.

A number of immediate remarks are in order.

Remark 3.3.

Note that comparing to prior works [Qiao and Valiant, 2017, Chen et al., 2020, Jain and Orlitsky, 2021], we allow arbitrary batch sizes. Even if some agents report $n_{j}=0$ , as long as ${n^{\operatorname{cut}}}>0$ , i.e. there are at least $2\alpha m+1$ agents reporting non-zero $n_{j}$ ’s, our estimator will have a well-behaved error bound. This means that the breakdown point (in the sense of fraction of bad agent) of our algorithm is $\frac{1}{2}$ , which is optimal.

Remark 3.4 (Equal batch size case).

When $n_{1}=\cdots=n_{m}=n$ , the right hand side of (1) becomes $O\left(\frac{\sigma}{\sqrt{n}}\left(\frac{1}{\sqrt{m}}+\alpha\sqrt{\log m}\right)\right).$ This recovers the rate in [Yin et al., 2018], which is optimal (up to logarithmic factors).

Remark 3.5 (Robust mean estimation v.s. robust mean estimation from batches).

In classical robust mean estimation setting [Huber, 1992, Diakonikolas et al., 2017], the optimal error rate is $O\left(\sigma\left(\alpha+\frac{1}{\sqrt{m}}\right)\right)$ given $m$ total samples and $\alpha$ faction corrupted samples. In contrast, due to having data source ID, i.e. the batch indices, the adversary are much restricted. To see this, notice that the equal batch setting can be viewed as robust mean estimation from $m$ data points $\hat{x}_{j}$ ’s. When the batch size $n$ becomes larger, $\hat{x}_{j}$ has a smaller variance $\frac{\sigma^{2}}{{n}}$ and thus the error of robust mean estimation becomes $O\left(\frac{\sigma}{\sqrt{n}}\left(\alpha+\frac{1}{\sqrt{m}}\right)\right)$ , which matches the above rate (up to logarithmic factors).

Remark 3.6 (Impossibility result).

Our bound in (1) does not depend on the largest $2\alpha m$ $n_{j}$ ’s. This means even if some of the clean agents have infinite samples, the algorithm cannot get a very low error. This might look not ideal at first glance, but we show that this is inevitable information-theoretically. Interested readers are referred to Theorem A.1.

Remark 3.7 (Perturbation stability of the estimator and adaption to distributed setting).

When the good data batch is subject to point-wise perturbation of magnitude at most $\epsilon$ , a variant of Algorithm 1 (Algorithm 4 Pert-Weighted-Clique, see Section A.2) suffers at most a $2\epsilon$ term in the error upper bound in addition to (1). Algorithm 1 does not need the exact dataset as input, but only the empirical mean and batch sizes of each data batch. As we see later, this property is essential to achieve low communication cost and preserve data privacy.

4 Byzantine-Robust Learning in Parallel MDP

We study the problem of Byzantine-robust reinforcement learning in the parallel Markov Decision Processes (MDPs) setting with one central server and $m$ agents, $\alpha$ fraction of which may suffer Byzantine failure. We postpone the precise interaction protocols between the server and agents to Section 5 and Section 6.

In both online and offline settings, we consider a finite horizon episodic tabular Markov Decision Process (MDP) $\mathcal{M}=\left({\mathcal{S}},\mathcal{A},\mathcal{P},\mathcal{R},H,\mu_{1}\right)$ . Where ${\mathcal{S}}$ is the finite state space with $|{\mathcal{S}}|=S$ ; $\mathcal{A}$ is the finite action space with $|\mathcal{A}|=A$ ; $\mathcal{P}=\left\{P_{h}\right\}_{h=1}^{H}$ is the sequence of transition probability matrix, meaning $\forall h\in[H]$ , $P_{h}:{\mathcal{S}}\times\mathcal{A}\mapsto\Delta({\mathcal{S}})$ and $P_{h}(\cdot|s,a)$ specifies the state distribution in step $h+1$ if action $a$ is taken from state $s$ at step $h$ ; $\mathcal{R}=\left\{R_{h}\right\}_{h=1}^{H}$ is the sequence of bounded stochastic reward function, meaning $\forall h\in[H]$ , $R_{h}(s,a)$ is the stochastic reward bounded in $[0,1]$ associated with taking action $a$ in state $s$ at step $h$ ; $H$ is the time horizon; $\mu_{1}$ is the initial state distribution. For simplicity, we assume $\mu_{1}$ is deterministic, and has probability mass $1$ on state $s_{1}$ .

Within each episode, the MDP starts at state $s_{1}$ . At each step $h$ , the agent observes current state $s_{h}$ and take an action $a_{h}$ and receives a stochastic reward $R_{h}(s_{h},a_{h})$ . After that, the MDP transits to a next state $s_{h+1}$ , which is drawn from $P_{h}(\cdot|s,a)$ . The episode terminates after the agent takes action $a_{H}$ in state $s_{H}$ and receives reward $R_{H}(s_{H},a_{H})$ at step $H$ .

A policy $\pi$ is a sequence of functions $\left\{\pi_{1},\ldots,\pi_{H}\right\}$ , each maps from state space ${\mathcal{S}}$ to action space $\mathcal{A}$ . The value function $V_{h}^{\pi}:{\mathcal{S}}\mapsto[0,H-h+1]$ , is the expected sum of future rewards by taking action according policy $\pi$ , i.e. $V_{h}^{\pi}(s):=\mathbb{E}\left[\left.\sum_{t=h}^{H}R_{t}(s_{t},\pi_{t}(s_{t}))\right|s_{h}=s\right],$ where the expectation is w.r.t. to the stochasticity of state transition and reward in the MDP. Similarly, we define the state-action value function $Q_{h}^{\pi}:{\mathcal{S}}\times\mathcal{A}\mapsto[0,H-1+1]$ : $Q_{h}^{\pi}(s,a):=\mathbb{E}\left[R_{h}(s,a)\right]+\mathbb{E}\left[\left.\sum_{t=h+1}^{H}R_{t}(s_{t},\pi_{t}(s_{t}))\right|s_{h}=s,a_{h}=a\right]$ Let $\pi^{*}=\left\{\pi^{h}\right\}$ be an optimal policy and let $V_{h}^{*}(s):=V_{h}^{\pi^{*}}(s,a)$ , $Q_{h}^{*}(s):=Q_{h}^{\pi^{*}}(s,a)$ , $\forall h,s,a$ .

For any $f:{\mathcal{S}}\mapsto[0,H]$ , We define the Bellman operator by: $\left(\mathbb{B}_{h}f\right)(s,a)=\mathbb{E}\left[R_{h}(s,a)\right]+\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot|s,a)}[f(s^{\prime})]$ Then the Bellman equation is given by:

V_{h}^{\pi}(s)=Q_{h}^{\pi}(s,\pi_{h}(s)),\quad Q_{h}^{\pi}(s,a)=\left(\mathbb{B}_{h}V_{h+1}^{\pi}\right)(s,a),\quad V_{H+1}^{\pi}(s)=0

(2)

The Bellman optimality equation is given by:

V_{h}^{*}(s)=\max_{a\in\mathcal{A}}Q_{h}^{*}(s,a),\quad Q_{h}^{*}(s,a)=\left(\mathbb{B}_{h}V_{h+1}^{*}\right)(s,a),\quad V_{H+1}^{*}(s)=0

(3)

We define the state distribution at step $h$ by following policy $\pi$ as $d_{h}^{\pi}(s):=P_{h}^{\pi}(s_{h}=s)$ , and the state trajectory distribution of $\pi$ as: $d^{\pi}:=\left\{d_{h}^{\pi}\right\}_{h=1}^{H}$ . The goal is to find a policy that maximizes the reward, i.e. find a $\hat{\pi}$ , s.t. $V_{1}^{\hat{\pi}}(s_{1})=V_{1}^{*}(s_{1})=\max_{\pi}V_{1}^{\pi}(s_{1})$ . To measure the performance of our RL algorithms, we use suboptimality as our performance metric for offline setting and use regret as our performance metric for online setting. We formalize these two measures in their corresponding sections below.

5 Byzantine-Robust Online RL

In the online setting, we assume that a central server and $m$ agents aim to collaboratively minimizing their total regrets. The agents and server collaborate by following a communication protocol to decide when to synchronize and what information to communicate. Unlike standard distributed RL setting, we assume $\alpha$ fraction of the agents are Byzantine:

Definition 5.1 (Distributed online RL with Byzantine corruption).

There are $m$ agents consists of two types:

•

$(1-\alpha)m$ good agents, denoted by $\mathcal{G}$ : Each of the good agents interacts with a copy of $\mathcal{M}$ and communicates its observations to the server following the interaction protocol;
•

$\alpha m$ bad agents, denoted by $\mathcal{B}$ : The bad agents is allowed to send arbitrary observations to the server at the end of each episode.

Because the server has no control over the bad agents, we only seek to minimize the error incurred by the good agents. Formally, we use regret as our performance measure for the online RL algorithm:

\operatorname{Regret}(K)=\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\left(V_{1}^{*}(s_{1})-V_{1}^{\pi_{k}^{j}}(s_{1})\right),

(4)

where $\pi_{k}^{j}$ is the policy used by agent $j$ in episode $k$ . At the same time , because of the distributed nature of our problem, we want to synchronize between the servers and agents only if it is necessary to reduce the communication cost.

Based on these considerations, we propose the Byzan-UCBVI algorithm (Algorithm 2). We highlight the following key features of Byzan-UCBVI:

1.

Low-switching-cost style algorithm design: the server will check the synchronization criteria in Line 6 when receiving requests from agents. Each good agent will request synchronization if and only if any of their own $(s,a,h)$ counts doubles (Line 21). Importantly, our agents do not need to know other agents’ $(s,a,h)$ counts to decide if synchronization is necessary. This design choice reduces the number of policy switches, synchronization rounds and communication cost all from $O(T)$ to $O(\log T)$ . Compared to the $O(\sqrt{T})$ communication steps in [Jadbabaie et al., 2022], ours is much lower. Unlike [Dubey and Pentland, 2021], our agents do not need to know other agents’ transition counts in order to decide whether to synchronize or not.
2.

Homogeneous policy execution: In any episode $k$ , our algorithm ensures that all good agents are running the same policy $\pi_{k}$ . This ensures that the robust mean estimation achieves the smallest estimation error. Recall that the samples in the large batches are wasted if the batch sizes are severely imbalanced (cf. Section 3).

Robust UCBVI updates: During synchronization, the central server performs policy update using a variant of the UCBVI algorithm [Azar et al., 2017]: for $h=H,H-1,\ldots,1$ , compute:

	$\displaystyle\bar{Q}_{h}(\cdot,\cdot)=\left(\hat{\mathbb{B}}_{h}\hat{V}_{h+1}\right)(\cdot,\cdot)+\Gamma_{h}(\cdot,\cdot),\quad\hat{Q}_{h}(\cdot,\cdot)=\min\left\{\bar{Q}_{h}(\cdot,\cdot),H-h+1\right\}^{+}$		(5)
	$\displaystyle\hat{\pi}_{h}(\cdot)=\mathop{\mathrm{argmax}}_{a}\hat{Q}_{h}(\cdot,a),\quad\hat{V}_{h}(\cdot)=\max_{a}\hat{Q}_{h}(\cdot,a)$		(6)

We replace the empirical mean estimation with our Pert-Weighted-Clique (PWC) algorithm (Algorithm 4) and design a new confidence bonus accordingly. Instead of estimating the transition matrix and reward function, we directly estimate the Bellman operator given an estimated value function $\hat{V}_{h+1}$ . The server gathers the sufficient statistics from agents in Line 13.

Algorithm 2

\textsc{Byzan-UCBVI}\left(K,\delta,\alpha\right)

1: [S]

\hat{V}_{H+1}(\cdot)\leftarrow 0

\hat{Q}_{H+1}(\cdot,\cdot)\leftarrow 0

\operatorname{SyncCount}_{j}\leftarrow-1,\forall j\in[m]

\operatorname{Sync}_{j}\leftarrow\operatorname{TRUE},\forall j\in[m]

\delta^{\prime}\leftarrow\frac{\delta}{(SAHKm)^{3S}}

\epsilon\leftarrow\frac{1}{SAHKm}

{We use [S] to denote the action of central server}

2: [A]

N_{h}^{j}(s,a)\leftarrow 0

D_{h}^{j}\leftarrow\emptyset

\forall(j,h,s,a)\in[m]\times[H]\times\mathcal{S}\times\mathcal{A}

{We use [A] to denote the action of agents}

3: for episode

k\in[K]

4: [S] Receive

\operatorname{Sync}_{1},\operatorname{Sync}_{2},\ldots,\operatorname{Sync}_{m}

5: for agent

j\in[m]

6: if

\operatorname{Sync}_{j}=\operatorname{TRUE}

and

\operatorname{SyncCount}_{j}\leq SAH\log_{2}K

then

7: [S]

\operatorname{SyncCount}_{j}\leftarrow\operatorname{SyncCount}_{j}+1

\operatorname{SYNCHRONIZE}\leftarrow\operatorname{TRUE}

8: if

\operatorname{SYNCHRONIZE}

then

9: [A]

N_{h,j}^{\operatorname{old}}(s,a)\leftarrow N_{h}^{j}(s,a),\forall s,a,h,j

10: for

h=H,H-1,\ldots,1

11: [S] Communicate

\hat{V}_{h+1}(\cdot)

to each agent

12: for

(s,a)\in\mathcal{S}\times\mathcal{A}

13: [A]

\forall j\in[m]

, Send

x_{j},n_{j}\leftarrow\frac{1}{N_{h}^{j}(s,a)}\sum_{(s,a,r,s^{\prime})\in D_{h}^{j}}r+\hat{V}_{h+1}(s^{\prime}),N_{h}^{j}(s,a)

to Server.

14: [S]

\left(\hat{\mathbb{B}}_{h}\hat{V}_{h+1}\right)(s,a),\Gamma_{h}(s,a)\leftarrow\textsc{PWC}\left(\left\{\left(x_{j},n_{j}\right)\right\}_{j\in[m]},H-h+1,\alpha,\epsilon,\delta^{\prime}\right)

15: [S] Compute

\bar{Q}_{h},\hat{Q}_{h},\hat{\pi}_{h},\hat{V}_{h}

as in (5)-(6).

16: [S]

\operatorname{SYNCHRONIZE}\leftarrow\operatorname{FALSE}

17: for

j\in\mathcal{G}

18: [A]

\operatorname{Sync}_{j}\leftarrow\operatorname{FALSE}

, Sample

\left\{(s_{h}^{j,k},a_{h}^{j,k},r_{h}^{j,k},s_{h+1}^{j,k})\right\}_{h\in[H]}

under

\left\{\hat{\pi}_{h}\right\}_{h=1}^{H}

19: [A]

\forall h

N_{h}^{j}(s_{h}^{j,k},a_{h}^{j,k})\leftarrow N_{h}^{j}(s_{h}^{j,k},a_{h}^{j,k})+1,\quad D_{h}^{j}\leftarrow D_{h}^{j}\cup\left\{(s_{h}^{j,k},a_{h}^{j,k},r_{h}^{j,k},s_{h+1}^{j,k})\right\}

20: [A] Send Sync request to Server, if

\operatorname{Sync}_{j}\leftarrow{\mathbf{1}}\left\{\max_{s,a,h}\frac{N_{h}^{j}(s,a)}{N_{h,j}^{\operatorname{old}}(s,a)}\geq 2\right\}

\operatorname{TRUE}

21: return

\left\{\hat{\pi}_{h}\right\}_{h=1}^{H}

We are now ready to present the following regret bound for Byzan-UCBVI.

Theorem 5.2 (Regret bound).

Under Definition 5.1, if $\alpha\leq\frac{1}{3}\left(1-\frac{1}{m}\right)$ , for all $\delta<\frac{1}{4}$ , with probability at least $1-3\delta$ , the total regret of Algorithm 2 is at most

\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\left(V_{1}^{*}(s_{1})-V_{1}^{\hat{\pi}_{k}^{j}}(s_{1})\right)=\tilde{O}\left((1+\alpha\sqrt{m})H^{2}S\sqrt{AmK\log(1/\delta)}\right)

(7)

Remark 5.3 (Understanding the regret bound).

In Algorithm 2, the good agents are using the same policy and thus for all $j\in\mathcal{G}$ , $\hat{\pi}_{k}^{j}=\hat{\pi}_{k}$ , where $\hat{\pi}_{k}$ is the policy calculated by the server in $k$ -th episode. By utilizing the batch structure, Algorithm 2 achieves a regret sublinear in $K$ , even under Byzantine attacks. Our regret is only $O(\sqrt{mK}+\alpha{m}\sqrt{K}))$ compared to the $O(m\sqrt{K}+m\alpha^{1/4}K^{3/4})$ regret in [Jadbabaie et al., 2022]. When $\alpha\leq 1/\sqrt{m}$ , the dominating term $\sqrt{mK}$ is optimal even in the clean setting [Azar et al., 2017].

Remark 5.4 (Communication cost).

Because each agent runs $K$ episodes in total, count of each of the $(s,a,h)$ tuples doubles at most $\lfloor\log_{2}K\rfloor$ times during training. Thus each good agent will send at most $SAH\lfloor\log_{2}K\rfloor$ sync requests. The bad agents can only send logarithmic number of effective request because of the checking step in Line 6. As a result, there will be at most $mSAH\lfloor\log_{2}K\rfloor$ synchronization episodes in total. The communication inside one synchronization episode includes the following: at least one agent sends a sync request; inside the value iteration, the server will send estimated value functions at $H$ steps to each agents; each of the good agents will send the estimated Bellman operator for each $(s,a)$ pairs at $H$ steps and the counts to server. Importantly, the agents only need to send summary statistics, instead of the raw dataset to server, this preserves the data privacy of individual agents [Sakuma et al., 2008, Liu et al., 2019].

Remark 5.5 (Switching cost).

Switching cost measures the number of policy changes. Algorithms with low switching cost is favorable in real world applications [Bai et al., 2019, Zhang et al., 2020b, Gao et al., 2021]. Algorithm 2 only performs policy updates in the synchronization episodes, its switching cost is thus at most $mSAH\lfloor\log_{2}K\rfloor$ .

6 Byzantine-Robust Offline RL

In the offline setting, we assume the server has access to a set of data batches while some data batches are corrupted. The goal of the server is to find a nearly optimal policy without further interaction with the environment. Specifically:

Definition 6.1 (Distributed offline RL with Byzantine corruption).

The server has access to an offline data set with $m$ data batches $\bigcup_{j\in[m]}D_{j}$ , including $(1-\alpha)m$ good batches $\mathcal{G}$ and $\alpha m$ bad batches $\mathcal{B}$ , where $D_{j}:=\bigcup_{h\in[H]}D_{j}^{h}:=\bigcup_{h\in[H]}\left\{\left(s_{h}^{j,k},a_{h}^{j,k},r_{h}^{j,k},{s_{h}^{\prime}}^{j,k}\right)\right\}_{k=1}^{K_{j}}$ . We make an assumption on the data generating process similar to [Wang et al., 2020]. Precisely, for all $j\in\mathcal{G}$ , $D_{j}$ is drawn from an unknown distribution $\left\{\nu^{j}_{h}\right\}_{h=1}^{H}$ , where for each $h\in[H]$ , $\nu_{h}^{j}\in\Delta\left({\mathcal{S}}\times\mathcal{A}\right)$ . For all $h,j,k$ , $\left(s_{h}^{j,k},a_{h}^{j,k}\right)\sim\nu_{h}^{j}$ , ${s_{h}^{\prime}}^{j,k}\sim P_{h}(\cdot|s_{h}^{j,k},a_{h}^{j,k})$ and $r_{h}^{j,k}$ is an instantiation of $R_{h}\left(s_{h}^{j,k},a_{h}^{j,k}\right)$ . For any $j\in\mathcal{B}$ (i.e. bad batches), $D_{j}$ can be arbitrary.

The performance is measured by the suboptimality w.r.t. a deterministic comparator policy $\tilde{\pi}$ (not necessarily an optimal policy):

\operatorname{SubOpt}\left(\pi,\tilde{\pi}\right):=V_{1}^{\tilde{\pi}}(s_{1})-V_{1}^{\pi}(s_{1}).

(8)

In the offline setting, the server cannot interact with the MDP. So our result relies heavily on the quality of the dataset. As we will see in the analysis, the suboptimality gap (8) can be upper bounded by the estimation error of Bellman operator along the trajectory of $\tilde{\pi}$ . As a result, we do not need full coverage over the whole state-action space. Instead, we only need the offline dataset to have proper coverage over $\{d_{h}^{\tilde{\pi}}\}_{h=1}^{H}$ , the state distribution of policy $\tilde{\pi}$ at each step $h$ . To characterize the data coverage, for any $s,a,h$ , we define the counts on $(s,a,h)$ tuples by:

N_{h}^{j}(s,a):=\sum_{k\in[K_{j}]}{\mathbf{1}}\left\{(s_{h}^{j,k},a_{h}^{j,k})=(s,a)\right\},\quad\forall j\in[m].

(9)

When calling Algorithm 1, the large data batches might be clipped in Line 4. By definition, the clipping threshold is bounded between: $N_{h}^{\mathcal{G},{\operatorname{cut}}_{1}}(s,a)$ , the $(\alpha m+1)$ -th largest of $\left\{N_{h}^{j}(s,a)\right\}_{j\in\mathcal{G}}$ and $N_{h}^{\mathcal{G},{\operatorname{cut}}_{2}}(s,a)$ , the $(2\alpha m+1)$ -th largest of $\left\{N_{h}^{j}(s,a)\right\}_{j\in\mathcal{G}}$ . We define three quantities $p^{\mathcal{G},0},\kappa,\kappa_{\text{even}}$ to characterize the quality of the offline dataset. The first quantity describes the density of $\tilde{\pi}$ trajectory that are not properly covered by the offline dataset:

Definition 6.2 (Measure of insufficient coverage).

We define $p^{\mathcal{G},0}$ as the probability of $\tilde{\pi}$ visiting an $(s,h,a)$ tuple that is insufficiently covered by the logged data, namely

p^{\mathcal{G},0}:=\sum_{h=1}^{H}\mathbb{E}_{d_{h}^{\tilde{\pi}}}\left[{\mathbf{1}}\left\{N_{h}^{\mathcal{G},{\operatorname{cut}}_{2}}(s,\tilde{\pi}(s))=0\right\}\right]

(10)

Recall that Algorithm 1 requires there are at least $(2\alpha m+1)$ non-empty data batches to make an informed decision. $p^{\mathcal{G},0}$ measures an upper bound on the total probability under $d^{\tilde{\pi}}$ to encounter an $(s,h,a)$ on which Weighted-Clique cannot return a good mean estimator.

We now introduce $\kappa$ , the density ratio between the $d^{\tilde{\pi}}$ and the empirical distribution of the uncorrupted offline dataset. $\kappa$ quantifies the portion of useful data in the whole dataset and is commonly used in the offline RL literature [Rashidinejad et al., 2021, Zhang et al., 2021a]. We only focus on the $(s,a,h)$ tuples excluded by $p^{\mathcal{G},0}$ in Definition 6.2:

Definition 6.3 (density ratio).

We use $\left\{\mathcal{C}_{h}\right\}_{h=1}^{H}$ to denote the state space (in the support of $\left\{d_{h}^{\tilde{\pi}}\right\}_{h=1}^{H}$ ) that have proper clean agents coverage:

\mathcal{C}_{h}=\left\{s|N_{h}^{\mathcal{G},{\operatorname{cut}}_{2}}(s,\tilde{\pi}(s))>0\right\}

(11)

We use $\kappa$ to denote the density ratio between the state distribution of policy $\tilde{\pi}$ and the empirical distribution over the uncorrupted offline dataset:

\kappa:=\max_{h\in[H]}\max_{s\in\mathcal{C}_{h}}\frac{d_{h}^{\tilde{\pi}}(s)}{\sum_{j\in\mathcal{G}}N_{h}^{j}(s,\tilde{\pi}_{h}(s))/\sum_{j\in\mathcal{G}}K_{j}}

(12)

As we can see in Theorem 3.2, the accuracy of Algorithm 1 heavily depend on the evenness of the batches. Even if there are some good batches with a large amount of data, those extra data are not useful (cf. Remark 3.6). We define the following quantity to measure the information loss in the clipping step (Line 4 in Algorithm 1):

Definition 6.4 (Unevenness of good agents coverage).

\kappa_{\text{even}}:=\max_{h\in[H]}\max_{s\in\mathcal{C}_{h}}\frac{{\sum_{j\in\mathcal{G}}N_{h}^{j}(s,\tilde{\pi}_{h}(s))}}{\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j,{\operatorname{cut}}_{2}}(s,\tilde{\pi}_{h}(s))}\frac{m(1-\alpha)N_{h}^{\mathcal{G},{\operatorname{cut}}_{1}}(s,\tilde{\pi}_{h}(s))}{\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j,{\operatorname{cut}}_{2}}(s,\tilde{\pi}_{h}(s))}

(13)

where $\tilde{N}_{h}^{j,{\operatorname{cut}}_{2}}(s,\tilde{\pi}_{h}(s))=\max\left(N_{h}^{\mathcal{G},{\operatorname{cut}}_{2}}(s,\tilde{\pi}_{h}(s)),N_{h}^{j}(s,\tilde{\pi}_{h}(s))\right)$

Intuitively, $\kappa_{\text{even}}$ describes the evenness of good agent coverage. It measures both how many data in large batches are cut off by the clip step and the unevenness of the batches after clipping. We includes $N_{h}^{\mathcal{G},{\operatorname{cut}}_{1}}(s,\tilde{\pi}_{h}(s))$ and $N_{h}^{\mathcal{G},{\operatorname{cut}}_{2}}(s,\tilde{\pi}_{h}(s))$ , instead of the true clipping threshold, meaning $\kappa_{\text{even}}$ serves as an upper bound of the actual unevenness resulting from running the algorithm. For example, suppose $\alpha m>1$ : if for any $s,a,h,j$ , $N_{h}^{j}(s,a)=n$ , then $\kappa_{\text{even}}=1$ ; if for any $s,a,h$ , there is one good data batch with size $Lm$ for some $L>1$ while the others have size $1$ , then $N_{h}^{\mathcal{G},{\operatorname{cut}}_{1}}(s,a)=N_{h}^{\mathcal{G},{\operatorname{cut}}_{2}}(s,a)=1$ and $\kappa_{\text{even}}=\frac{Lm+(1-\alpha)m-1}{(1-\alpha)m}\frac{(1-\alpha)m}{(1-\alpha)m}\approx L+1$ , meaning $\kappa_{\text{even}}$ increases as the batches become less even.

Remarkably, all three quantities defined above only depend on the $(s,a,h)$ counts of the good data batches.

Given the above setup, we now present our second algorithm, Byzan-PEVI, a Byzantine-Robust variant of pessimistic value iteration [Jin et al., 2021]. Similar to the online setting, we use our Weighted-Clique (without perturbation) algorithm to approximate the Bellman operator and use the estimation error to design PESSIMISTIC bonus for the value iteration. Byzan-PEVI (Algorithm 3) runs pessimistic value iteration ((14)-(15)) and calls Weighted-Clique as a subroutine to robustly estimate the Bellman operator using offline dataset $D$ :

	$\displaystyle\bar{Q}_{h}(\cdot,\cdot)=\left(\hat{\mathbb{B}}_{h}\hat{V}_{h+1}\right)(\cdot,\cdot)-\Gamma_{h}(\cdot,\cdot),\quad\hat{Q}_{h}(\cdot,\cdot)=\min\left\{\bar{Q}_{h}(\cdot,\cdot),H-h+1\right\}^{+}$		(14)
	$\displaystyle\hat{\pi}_{h}(\cdot)=\mathop{\mathrm{argmax}}_{a}\hat{Q}_{h}(\cdot,a),\quad\hat{V}_{h}(\cdot)=\max_{a}\hat{Q}_{h}(\cdot,a)$		(15)

Algorithm 3 Byzan-PEVI

D:=\bigcup_{j\in[m]}D_{j}:=\bigcup_{h\in[H]}D_{j}^{h}:=\bigcup_{h\in[H]}\left\{\left(s_{h}^{j,k},a_{h}^{j,k},r_{h}^{j,k},{s_{h}^{\prime}}^{j,k}\right)\right\}_{k=1}^{K_{j}}

\alpha

\delta

\delta^{\prime}\leftarrow\frac{\delta}{H|{\mathcal{S}}||\mathcal{A}|m}

\hat{V}_{H+1}(\cdot)\leftarrow 0

3: for

h=H,H-1,\ldots,1

\sigma\leftarrow H-h+1

5: for

(s,a)\in\mathcal{S}\times\mathcal{A}

6: for

j\in[m]

n_{j}\leftarrow\sum_{k\in[K_{j}]}{\mathbf{1}}\left\{(s_{h}^{j,k},a_{h}^{j,k})=(s,a)\right\}

x_{j}\leftarrow\frac{1}{N_{h}^{j}(s,a)}\sum_{(s,a,r,s^{\prime})\in D_{h}^{j}}\left(r+\hat{V}_{h+1}(s^{\prime})\right)

9: if

|j\in[m]:n_{j}>0|\geq 2\alpha m+1

then

10:

\left(\hat{\mathbb{B}}_{h}\hat{V}_{h+1}\right)(s,a),\Gamma_{h}(s,a)\leftarrow\textsc{Weighted-Clique}\left(\left\{\left(x_{j},n_{j}\right)\right\}_{j=1}^{m},\sigma,\alpha,\delta^{\prime}\right)

11: else

12:

\left(\hat{\mathbb{B}}_{h}\hat{V}_{h+1}\right)(s,a)\leftarrow 0

\Gamma_{h}(s,a)\leftarrow H-h+1

13: Compute

\bar{Q}_{h},\hat{Q}_{h},\hat{\pi}_{h},\hat{V}_{h}

as in (14)-(15).

14: return

\left\{\hat{\pi}_{h}\right\}_{h=1}^{H}

Theorem 6.5.

Given any deterministic comparator policy $\tilde{\pi}$ , under Definition 6.1, Definition 6.2, Definition 6.3 and Definition 6.4: for any $\delta$ , $\alpha<\frac{1}{3}$ , with probability at least $1-\delta$ , Algorithm 3 outputs a policy $\hat{\pi}$ with:

\displaystyle\operatorname{SubOpt}\left(\hat{\pi},\tilde{\pi}\right)\leq

\displaystyle 2Hp^{\mathcal{G},0}+O\left(\sqrt{\kappa\kappa_{\text{even}}}H^{2}\sqrt{|{\mathcal{S}}|}\frac{1+\sqrt{m}\alpha}{\sqrt{\sum_{j\in\mathcal{G}}K_{j}}}\sqrt{\log\frac{H|{\mathcal{S}}||\mathcal{A}|m}{\delta}}\right).

(16)

Remark 6.6.

Compared to [Zhang et al., 2021a], there is no non-diminishing term in the bound. Meaning the suboptimality gap vanishes as the good agents collect more data. To the best of our knowledge, this is the first result for Byzantine-robust offline RL.

Remark 6.7 (Offline v.s. online RL).

Our offline RL results are more involved and notation heavy due to the nature of the problem. In the offline RL setting, learner has no control over the data generating process, and each data source can be arbitrarily different. The agent can only passively rely on the robust mean estimator we designed and the pessimism principle to learn as well as the data permits. In contrast, in the online setting, the learner has complete control over the clean agents’ data collection process. Our algorithm Byzan-UCBVI enables the server to realize its full potential and obtain a tighter and cleaner sample complexity guarantee.

7 Conclusion

To summarize, in this work, we first present Weighted-Clique, a robust mean estimation algorithm for learning from uneven batches that can be of independent interest. Building upon Weighted-Clique, we propose byzantine-robust online (Byzan-UCBVI) and the first byzantine-robust offline (Byzan-PEVI) reinforcement learning algorithms in distributed setting. Several questions remain open: (1) Can we provide a complete characterization of the information-theoretical lower bound for robust mean estimation from uneven batches? (2) Can we extend our RL algorithms to the function approximation setting?

References

Abadi et al. [2016] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
Agarwal et al. [2021] Mridul Agarwal, Bhargav Ganguly, and Vaneet Aggarwal. Communication efficient parallel reinforcement learning. In Uncertainty in Artificial Intelligence, pages 247–256. PMLR, 2021.
Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017.
Bai et al. [2019] Yu Bai, Tengyang Xie, Nan Jiang, and Yu-Xiang Wang. Provably efficient q-learning with low switching cost. Advances in Neural Information Processing Systems, 32, 2019.
Bazzan [2009] Ana LC Bazzan. Opportunities for multiagent systems and multiagent reinforcement learning in traffic control. Autonomous Agents and Multi-Agent Systems, 18(3):342–375, 2009.
Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Chan et al. [2014] Siu-On Chan, Ilias Diakonikolas, Paul Valiant, and Gregory Valiant. Optimal algorithms for testing closeness of discrete distributions. In Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pages 1193–1203. SIAM, 2014.
Chen et al. [2020] Sitan Chen, Jerry Li, and Ankur Moitra. Efficiently learning structured distributions from untrusted batches. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 960–973, 2020.
Chen et al. [2021] Tianyi Chen, Kaiqing Zhang, Georgios B Giannakis, and Tamer Basar. Communication-efficient policy gradient methods for distributed reinforcement learning. IEEE Transactions on Control of Network Systems, 2021.
Chen et al. [2017] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 1(2):1–25, 2017.
Dann et al. [2017] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. Advances in Neural Information Processing Systems, 30, 2017.
Diakonikolas and Kane [2019] Ilias Diakonikolas and Daniel M Kane. Recent advances in algorithmic high-dimensional robust statistics. arXiv preprint arXiv:1911.05911, 2019.
Diakonikolas et al. [2017] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair Stewart. Being robust (in high dimensions) can be practical. In International Conference on Machine Learning, pages 999–1008. PMLR, 2017.
Diakonikolas et al. [2019] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Steinhardt, and Alistair Stewart. Sever: A robust meta-algorithm for stochastic optimization. In International Conference on Machine Learning, pages 1596–1606. PMLR, 2019.
Ding et al. [2020] Guohui Ding, Joewie J Koh, Kelly Merckaert, Bram Vanderborght, Marco M Nicotra, Christoffer Heckman, Alessandro Roncone, and Lijun Chen. Distributed reinforcement learning for cooperative multi-robot object manipulation. arXiv preprint arXiv:2003.09540, 2020.
Dubey and Pentland [2020] Abhimanyu Dubey and Alex Pentland. Private and byzantine-proof cooperative decision-making. In AAMAS, pages 357–365, 2020.
Dubey and Pentland [2021] Abhimanyu Dubey and Alex Pentland. Provably efficient cooperative multi-agent reinforcement learning with function approximation. arXiv preprint arXiv:2103.04972, 2021.
Espeholt et al. [2018] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, pages 1407–1416. PMLR, 2018.
Eykholt et al. [2018] Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1625–1634, 2018.
Fan et al. [2021] Xiaofeng Fan, Yining Ma, Zhongxiang Dai, Wei Jing, Cheston Tan, and Bryan Kian Hsiang Low. Fault-tolerant federated reinforcement learning with theoretical guarantee. Advances in Neural Information Processing Systems, 34, 2021.
Freedman [1975] David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100–118, 1975.
Gao et al. [2021] Minbo Gao, Tianle Xie, Simon S Du, and Lin F Yang. A provably efficient algorithm for linear markov decision process with low switching cost. arXiv preprint arXiv:2101.00494, 2021.
Goyal et al. [2017] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
Horgan et al. [2018] Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado Van Hasselt, and David Silver. Distributed prioritized experience replay. arXiv preprint arXiv:1803.00933, 2018.
Huang et al. [2017] Sandy Huang, Nicolas Papernot, Ian Goodfellow, Yan Duan, and Pieter Abbeel. Adversarial attacks on neural network policies. arXiv preprint arXiv:1702.02284, 2017.
Huber [1992] Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics, pages 492–518. Springer, 1992.
Jadbabaie et al. [2022] Ali Jadbabaie, Haochuan Li, Jian Qian, and Yi Tian. Byzantine-robust federated linear bandits. arXiv preprint arXiv:2204.01155, 2022.
Jain and Orlitsky [2021] Ayush Jain and Alon Orlitsky. Robust density estimation from batches: The best things in life are (nearly) free. In International Conference on Machine Learning, pages 4698–4708. PMLR, 2021.
Jin et al. [2018] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? Advances in neural information processing systems, 31, 2018.
Jin et al. [2020] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
Jin et al. [2021] Ying Jin, Zhuoran Yang, and Zhaoran Wang. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR, 2021.
Kretchmar [2002] R Matthew Kretchmar. Parallel reinforcement learning. In The 6th World Conference on Systemics, Cybernetics, and Informatics. Citeseer, 2002.
Lai et al. [2016] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance. In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665–674. IEEE, 2016.
LAMPORT et al. [1982] LESLIE LAMPORT, ROBERT SHOSTAK, and MARSHALL PEASE. The byzantine generals problem. ACM Transactions on Programming Languages and Systems, 4(3):382–401, 1982.
Liu et al. [2019] Ximeng Liu, Robert H Deng, Kim-Kwang Raymond Choo, and Yang Yang. Privacy-preserving reinforcement learning design for patient-centric dynamic treatment regimes. IEEE Transactions on Emerging Topics in Computing, 9(1):456–470, 2019.
Lykouris et al. [2021] Thodoris Lykouris, Max Simchowitz, Alex Slivkins, and Wen Sun. Corruption-robust exploration in episodic reinforcement learning. In Conference on Learning Theory, pages 3242–3245. PMLR, 2021.
Ma et al. [2019] Yuzhe Ma, Xuezhou Zhang, Wen Sun, and Jerry Zhu. Policy poisoning in batch reinforcement learning and control. Advances in Neural Information Processing Systems, 32, 2019.
Ma et al. [2021] Yuzhe Ma, J Sharp, Ruizhe Wang, Earlence Fernandes, and Xiaojin Zhu. Adversarial attacks on kalman filter-based forward collision warning systems. In The Thirty-Fifth AAAI Conference on Artificial Intelligence, 2021.
Neff and Nagy [2016] Gina Neff and Peter Nagy. Automation, algorithms, and politics| talking to bots: Symbiotic agency and the case of tay. International Journal of Communication, 10:17, 2016.
Paninski [2008] Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750–4755, 2008.
Prasad et al. [2018] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485, 2018.
Qiao and Valiant [2017] Mingda Qiao and Gregory Valiant. Learning discrete distributions from untrusted batches. arXiv preprint arXiv:1711.08113, 2017.
Rashidinejad et al. [2021] Paria Rashidinejad, Banghua Zhu, Cong Ma, Jiantao Jiao, and Stuart Russell. Bridging offline reinforcement learning and imitation learning: A tale of pessimism. Advances in Neural Information Processing Systems, 34, 2021.
Sakuma et al. [2008] Jun Sakuma, Shigenobu Kobayashi, and Rebecca N Wright. Privacy-preserving reinforcement learning. In Proceedings of the 25th international conference on Machine learning, pages 864–871, 2008.
Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
Tukey [1960] John W Tukey. A survey of sampling from contaminated distributions. Contributions to probability and statistics, pages 448–485, 1960.
Verbraeken et al. [2020] Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S Rellermeyer. A survey on distributed machine learning. ACM Computing Surveys (CSUR), 53(2):1–33, 2020.
Wang et al. [2020] Ruosong Wang, Dean P Foster, and Sham M Kakade. What are the statistical limits of offline rl with linear function approximation? arXiv preprint arXiv:2010.11895, 2020.
Yang and Wang [2020] Lin Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746–10756. PMLR, 2020.
Yin et al. [2018] Dong Yin, Yudong Chen, Ramchandran Kannan, and Peter Bartlett. Byzantine-robust distributed learning: Towards optimal statistical rates. In International Conference on Machine Learning, pages 5650–5659. PMLR, 2018.
Yu et al. [2014] Tao Yu, HZ Wang, Bin Zhou, Ka Wing Chan, and J Tang. Multi-agent correlated equilibrium q ( $\lambda$ ) learning for coordinated smart generation control of interconnected power grids. IEEE transactions on power systems, 30(4):1669–1679, 2014.
Zhang et al. [2020a] Xuezhou Zhang, Yuzhe Ma, Adish Singla, and Xiaojin Zhu. Adaptive reward-poisoning attacks against reinforcement learning. In International Conference on Machine Learning, pages 11225–11234. PMLR, 2020a.
Zhang et al. [2021a] Xuezhou Zhang, Yiding Chen, Jerry Zhu, and Wen Sun. Corruption-robust offline reinforcement learning. arXiv preprint arXiv:2106.06630, 2021a.
Zhang et al. [2021b] Xuezhou Zhang, Yiding Chen, Xiaojin Zhu, and Wen Sun. Robust policy gradient against strong data corruption. In International Conference on Machine Learning, pages 12391–12401. PMLR, 2021b.
Zhang et al. [2020b] Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost optimal model-free reinforcement learningvia reference-advantage decomposition. Advances in Neural Information Processing Systems, 33:15198–15207, 2020b.

Appendix A More discussion on Algorithm 1:Weighted-Clique

A.1 Impossible result

Theorem A.1 (impossibility result).

There exists a distribution $\mathcal{D}$ , s.t. given $m$ data batches $\left\{\left\{x_{j}^{i}\right\}_{i=1}^{n_{j}}\right\}_{j\in[m]}$ generated under Definition 3.1, every robust mean estimation algorithm $\mathscr{A}$ suffers an error at least

\Omega\left(\frac{1}{\sqrt{N}}\right)

(17)

even $\mathscr{A}$ knows some of the batches are clean, where $N$ is the sum of sizes of the smallest $(1-2\alpha)m$ good batches.

Proof of Theorem A.1.

Let $\mathcal{D}$ be Bernoulli distribution with parameter $\frac{1}{2}$ . W.l.o.g., assume $\mathcal{G}=[(1-\alpha)m]$ , $n_{1}\leq\cdots\leq n_{(1-\alpha)m}$ and $\mathcal{B}=\left\{(1-\alpha)m+1,\ldots,m\right\}$ . We assume algorithm $\mathscr{A}$ knows $[(1-2\alpha)m]$ is a subset of the good batches.

Let $\eta=\frac{1}{2\sqrt{N}}=\frac{1}{2\sqrt{\sum_{j=1}^{(1-2\alpha)m}n_{j}}}$ . Let the bad batches $\mathcal{B}$ be i.i.d. samples from $\mathcal{D}^{\prime}$ , a Bernoulli distribution with parameter $\frac{1}{2}+\eta$ . By Theorem 4 of [Paninski, 2008, Chan et al., 2014], no algorithm can distinguish if the batches $\left\{x_{1}^{i}\right\}_{i=1}^{n_{1}},\ldots,\left\{x_{(1-2\alpha)m}^{i}\right\}_{i=1}^{n_{(1-2\alpha)m}}$ are sampled from $\mathcal{D}$ or $\mathcal{D}^{\prime}$ . I.e. no algorithm can distinguish if $\left\{(1-2\alpha)m+1,\ldots,(1-\alpha)m\right\}$ are good batches or $\mathcal{B}$ are good batches. This means, given $m$ data batches $\left\{\left\{x_{j}^{i}\right\}_{i=1}^{n_{j}}\right\}_{j\in[m]}$ , every robust mean estimation algorithm suffers an error at least $\Omega\left(\frac{1}{\sqrt{N}}\right)$ . ∎

A.2 Adaption to good batch perturbation and distributed learning

Compared to Algorithm 1, Algorithm 4 enlarges the confidence interval by $\epsilon$ on both endpoints due to the perturbation and only requires some sufficient statistics from the batches, instead of the whole dataset. When ${n^{\operatorname{cut}}}>0$ , meaning there are at least $2\alpha m+1$ non-empty batches, Algorithm 4 runs a modified Weighted-Clique algorithm to calculate the mean estimation and the error upper bound (with an additional $\epsilon$ as an adjustment for $\epsilon$ -cover argument in the proof of Theorem 5.2). When ${n^{\operatorname{cut}}}=0$ , Algorithm 4 returns $0$ and a trivial error upper bound.

Algorithm 4 Pert-Weighted-Clique (PWC)

\left\{\left(\hat{x}_{j},n_{j}\right)\right\}_{j\in[m]}

\sigma

\alpha

\epsilon

\delta>0

{n^{\operatorname{cut}}}\leftarrow\mbox{$(2\alpha m+1)$-th largest value in $\{n_{j}\}_{j\in[m]}$}

2: if

{n^{\operatorname{cut}}}>0

then

3: for

j=1,2,\ldots,m

\tilde{n}_{j}\leftarrow\min(n_{j},{n^{\operatorname{cut}}})

I_{j}\leftarrow\left[\hat{x}_{j}-\frac{\sigma}{\sqrt{\tilde{n}_{j}}}\sqrt{2\log\frac{2m}{\delta}}-\epsilon,\hat{x}_{j}+\frac{\sigma}{\sqrt{\tilde{n}_{j}}}\sqrt{2\log\frac{2m}{\delta}}+\epsilon\right]

U^{*}\leftarrow\mathop{\mathrm{argmax}}_{U\mbox{ s.t. }\emptyset\neq\bigcap_{j\in U}I_{j}}|U|

7: return

\hat{x}\leftarrow\frac{1}{\sum_{j\in U^{*}}\tilde{n}_{j}}\sum_{j\in U^{*}}\tilde{n}_{j}\hat{x}_{j},\quad\operatorname{Error}\leftarrow\mbox{RHS of \hyperref@@ii[eq:pert bd 5eps]{{(\ref*{eq:pert bd 5eps})}}}+\epsilon

8: else

9: if

\mathcal{D}

is bounded between

a,b

then

10: return

\hat{x}\leftarrow 0,\quad\operatorname{Error}\leftarrow b-a

11: else

12: return

\hat{x}\leftarrow 0,\quad\operatorname{Error}\leftarrow\infty

Appendix B Proof of Theorem 3.2

To prove Theorem 3.2, we show (1) holds under some concentration event while the event happens with high probability. We consider a slight more general setting where there could be perturbations to even good batches:

Definition B.1 (Robust mean estimation from batches).

There are $m$ data providers indexed by: $\{1,2,\ldots,m\}=:[m]$ . Among these providers, we denote the indexes of uncorrupted providers by $\mathcal{G}$ and the indexes of corrupted providers by $\mathcal{B}$ , where $\mathcal{B}\cup\mathcal{G}=[m]$ , $\mathcal{B}\cap\mathcal{G}=\emptyset$ , $\left|\mathcal{B}\right|=\alpha m$ . Any uncorrupted providers has access to perturbed samples from a sub-Gaussian distribution $\mathcal{D}$ with mean $\mu$ and variance proxy $\sigma^{2}$ (i.e. $\mathbb{E}_{X\sim\mathcal{D}}[X]=\mu$ and $\mathbb{E}_{X\sim\mathcal{D}}\left[\exp\left(s\left(X-\mu\right)\right)\right]\leq\exp\left({\sigma^{2}s^{2}}/{2}\right)$ , $\forall s\in\mathbb{R}.$ ). For each $j\in\mathcal{G}$ , a data batch $\left\{\tilde{x}_{j}^{i}\right\}_{i=1}^{n_{j}}$ is drawn from $\mathcal{D}$ , while a perturbed version $\left\{x_{j}^{i}\right\}_{i=1}^{n_{j}}$ is sent to the learner, where $n_{j}$ can be arbitrary and $\left|\hat{x}_{j}^{i}-\tilde{x}_{j}^{i}\right|\leq\epsilon$ for some $\epsilon\geq 0$ . For $j\in\mathcal{B}$ , $\left\{x_{j}^{i}\right\}_{i=1}^{n_{j}}$ can be arbitrary.

One can easily recover Definition 3.1 by letting $\epsilon=0$ . We prove the We first define the concentration event as following:

Definition B.2 (Concentration event).

For all $j\in\mathcal{G}$ , define the event that the empirical mean of clean batches is close to the population mean as:

\mathcal{E}_{j}:=\left\{\left|\hat{x}_{j}-\mu\right|\leq\frac{\sigma}{\sqrt{\tilde{n}_{j}}}\sqrt{2\log\frac{2m}{\delta}}+\epsilon\right\}

(18)

Define the event that the weighted average of empirical means of clean batches is close to the population mean as:

\mathcal{E}_{{wa}}:=\left\{\left|\frac{1}{\sum_{j\in\mathcal{G}}\tilde{n}_{j}}\sum_{j\in\mathcal{G}}\tilde{n}_{j}\hat{x}_{j}-\mu\right|\leq\frac{\sigma}{\sqrt{\sum_{j\in\mathcal{G}}\tilde{n}_{j}}}\sqrt{2\log\frac{2}{\delta}}+\epsilon\right\}

(19)

Let $\mathcal{E}_{{conc}}$ be the event that events above happens together:

\mathcal{E}_{{conc}}:=\mathcal{E}_{wa}\cap\bigcap_{j\in\mathcal{G}}\mathcal{E}_{j}

(20)

We can show $\mathcal{E}_{{conc}}$ happens with high probability using Hoeffding’s inequality:

Lemma B.3.

$\mathbb{P}\left(\mathcal{E}_{conc}\right)\geq 1-2\delta$ .

Proof.

See proof in Section B.1. ∎

Under event $\mathcal{E}_{conc}$ , we can give an upper bound on the estimation error:

Lemma B.4.

Under event $\mathcal{E}_{conc}$ , if $n^{\operatorname{cut}}>0$ , Algorithm 4 outputs a $\hat{x}$ with

\left|\hat{x}-\mu\right|\leq\frac{2}{\sqrt{\sum_{j\in[m]}\tilde{n}_{j}}}\sigma\sqrt{2\log\frac{2}{\delta}}+\frac{8\alpha m\sqrt{{n^{\operatorname{cut}}}}}{\sum_{j\in[m]}\tilde{n}_{j}}\sigma\sqrt{2\log\frac{2m}{\delta}}+5\epsilon

(21)

Proof.

See proof in Section B.2. ∎

Proof of Theorem 3.2.

Consider $\epsilon=0$ , i.e. no perturbation involved. By Lemma B.3 and Lemma B.4, with probability at least $1-2\delta$ ,

\left|\hat{x}-\mu\right|\leq\frac{2}{\sqrt{\sum_{j\in[m]}\tilde{n}_{j}}}\sigma\sqrt{2\log\frac{2}{\delta}}+\frac{8\alpha m\sqrt{{n^{\operatorname{cut}}}}}{\sum_{j\in[m]}\tilde{n}_{j}}\sigma\sqrt{2\log\frac{2m}{\delta}}

(22)

∎

B.1 Proof of Lemma B.3

To prove Lemma B.3,

1.

we first show that the perturbation changes the empirical mean of batches by at most $\epsilon$ ;
2.

we can show the concentration bound of empirical means and weighted means for the unperturbed samples;
3.

we can conclude by using the two results above and triangular inequality.

the probability of event $\bigcap_{j\in{\mathcal{G}}}\mathcal{E}_{j}$ :

For all $j\in\mathcal{G}$ , let $\bar{x}_{j}$ be the empirical mean of unperturbed samples in batch $j$ :

\bar{x}_{j}:=\frac{1}{n_{j}}\sum_{i=1}^{n_{j}}\tilde{x}_{j}^{i}

(23)

By triangular inequality:

\displaystyle\left|\bar{x}_{j}-\hat{x}_{j}\right|=\left|\frac{1}{n_{j}}\sum_{i=1}^{n_{j}}(x_{j}^{i}-\tilde{x}_{j}^{i})\right|\leq\frac{1}{n_{j}}\sum_{i=1}^{n_{j}}\epsilon=\epsilon

(24)

Since $\mathcal{D}$ is sub-Gaussian distribution, we can show the concentration of unperturbed samples $\bar{x}_{j}$ : for all good batch $j\in\mathcal{G}$ ,

\mathbb{P}\left(\left|\bar{x}_{j}-\mu\right|>t\right)\leq 2\exp\left(-\frac{n_{j}t^{2}}{2\sigma^{2}}\right)\leq 2\exp\left(-\frac{\tilde{n}_{j}t^{2}}{2\sigma^{2}}\right)

(25)

By union bound, with probability at least $1-\delta$ , $\forall j\in\mathcal{G}$ ,

\left|\bar{x}_{j}-\mu\right|\leq\frac{\sigma}{\sqrt{\tilde{n}_{j}}}\sqrt{2\log\frac{2\left|\mathcal{G}\right|}{\delta}}\leq\frac{\sigma}{\sqrt{\tilde{n}_{j}}}\sqrt{2\log\frac{2m}{\delta}}

(26)

By triangular inequality, with probability at least $1-\delta$ , $\forall j\in\mathcal{G}$ ,

\left|\hat{x}_{j}-\mu\right|\leq\left|\hat{x}_{j}-\bar{x}_{j}\right|+\left|\bar{x}_{j}-\mu\right|\leq\frac{\sigma}{\sqrt{\tilde{n}_{j}}}\sqrt{2\log\frac{2m}{\delta}}+\epsilon

(27)

I.e. $\mathbb{P}\left(\bigcap_{j\in\mathcal{G}}\mathcal{E}_{j}\right)\geq 1-\delta$ .

the probability of event $\mathcal{E}_{wa}$ :

We first show the weighted average of empirical mean of the unperturbed sample i.e., $\frac{1}{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}}\sum_{j\in\mathcal{G}}\tilde{n}_{j}\bar{x}_{j}$ is a sub-Gaussian random variable: firstly, note that the mean of the weighted average is $\mu$ , i.e. $\mathbb{E}\left[\frac{1}{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}}\sum_{j\in\mathcal{G}}\tilde{n}_{j}\bar{x}_{j}\right]=\mu$ . By definition, we know for good batch $j\in\mathcal{G}$ , $\tilde{x}_{j}^{1},\ldots,\tilde{x}_{j}^{n_{j}}$ are i.i.d. sub-Gaussian random variable with mean $\mu$ and variance proxy $\sigma^{2}$ , i.e.

\mathbb{E}\left[\exp\left(s\left(\tilde{x}_{j}^{i}-\mu\right)\right)\right]\leq\exp\left(\frac{\sigma^{2}s^{2}}{2}\right)\quad\forall s\in\mathbb{R}.

(28)

Since $\bar{x}_{j}=\frac{1}{n_{j}}\sum_{i=1}^{n_{j}}\tilde{x}_{j}^{i}$ : for all $s\in\mathbb{R}$ ,

	$\displaystyle\mathbb{E}\left[\exp\left(s\left(\frac{1}{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}}\sum_{j\in\mathcal{G}}\tilde{n}_{j}\bar{x}_{j}-\mu\right)\right)\right]=\prod_{j\in\mathcal{G}}\mathbb{E}\left[\exp\left(s\left(\frac{1}{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}}\tilde{n}_{j}(\bar{x}_{j}-\mu)\right)\right)\right]$	(29)
$\displaystyle=$	$\displaystyle\prod_{j\in\mathcal{G}}\prod_{i\in[n_{j}]}\mathbb{E}\left[\exp\left(\frac{s}{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}}\frac{\tilde{n}_{j}}{n_{j}}(\tilde{x}_{j}^{i}-\mu)\right)\right]\leq\prod_{j\in\mathcal{G}}\prod_{i\in[n_{j}]}\exp\left(\frac{\sigma^{2}}{2}\left(\frac{s}{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}}\frac{\tilde{n}_{j}}{n_{j}}\right)^{2}\right)$	(30)
$\displaystyle\leq$	$\displaystyle\prod_{i\in\mathcal{G}}\prod_{i\in[n_{j}]}\exp\left(\frac{\sigma^{2}}{2}\left(\frac{s}{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}}\right)^{2}\right)=\exp\left(\frac{s^{2}}{2}\left(\frac{\sigma}{\sqrt{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}}}\right)^{2}\right)$	(31)

This means $\frac{1}{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}}\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}\bar{x}_{j}$ is a sub-Gaussian random variable with variance proxy $\frac{\sigma^{2}}{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}}$ . Thus $\forall t>0$ ,

\mathbb{P}\left(\left|\frac{1}{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}}\sum_{j\in\mathcal{G}}\tilde{n}_{j}\bar{x}_{j}-\mu\right|>t\right)\leq 2\exp\left(-\frac{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}t^{2}}{2\sigma^{2}}\right)

(32)

Thus with probability at least $1-\delta$ :

\left|\frac{1}{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}}\sum_{j\in\mathcal{G}}\tilde{n}_{j}\bar{x}_{j}-\mu\right|\leq\frac{\sigma}{\sqrt{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}}}\sqrt{2\log\frac{2}{\delta}}

(33)

This means:

	$\displaystyle\left\|\frac{1}{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}}\sum_{j\in\mathcal{G}}\tilde{n}_{j}\hat{x}_{j}-\mu\right\|$	(34)
$\displaystyle\leq$	$\displaystyle\left\|\frac{1}{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}}\sum_{j\in\mathcal{G}}\tilde{n}_{j}\bar{x}_{j}-\mu\right\|+\left\|\frac{1}{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}}\sum_{j\in\mathcal{G}}\tilde{n}_{j}\bar{x}_{j}-\frac{1}{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}}\sum_{j\in\mathcal{G}}\tilde{n}_{j}\hat{x}_{j}\right\|$	(35)
$\displaystyle\leq$	$\displaystyle\frac{\sigma}{\sqrt{\sum_{j^{\prime}\in\mathcal{G}}\tilde{n}_{j^{\prime}}}}\sqrt{2\log\frac{2}{\delta}}+\epsilon$	(36)

I.e. $\mathbb{P}\left(\mathcal{E}_{wa}\right)\geq 1-\delta$ .

By union bound $\mathbb{P}\left(\mathcal{E}_{conc}\right)=\mathbb{P}\left(\mathcal{E}_{wa}\cap\bigcap_{j\in\mathcal{G}}\mathcal{E}_{j}\right)\geq 1-2\delta$ .

B.2 Proof of Lemma B.4

By Lemma B.3, we know the weighted average of empirical mean of good batches is a proper estimation for the population mean. Compared to $\mathcal{G}$ , the $U^{*}$ returned in Line 6 in Algorithm 4 may remove some good batches and include some bad batches. Even though, as long as we can show:

1.

Line 6 will not remove too many good batches and will not include too many bad batches;
2.

the bad batches included in $U^{*}$ will not be significant

then we can show that the $\hat{x}$ returned in Line 7 is a reasonable estimation for $\mu$ .

The structure of $U^{*}$ :

$U^{*}$ is the largest subset of batches with confidence interval intersection. The confidence intervals of all the good batches intersect under event $\bigcap_{j\in\mathcal{G}}\mathcal{E}_{j}$ , thus $U^{*}$ should at least as large as $\mathcal{G}$ , thus it is not possible to remove too many good batches. Furthermore, we can also show that we will not lose too much information, meaning significantly reduce the total number of samples and thus later on, we can show that the statistical rate will not be affected by too much. We make these ideas precise below.

Under event $\bigcap_{j\in\mathcal{G}}\mathcal{E}_{j}$ ,

\mu\in\bigcap_{j\in\mathcal{G}}I_{j},

(37)

where $I_{j}$ is the confidence interval defined in Line 5. Thus $\bigcap_{j\in\mathcal{G}}I_{j}\neq\emptyset$ .

Because $U^{*}$ maximizes

\max_{U\mbox{ s.t. }\emptyset\neq\bigcap_{j\in U}I_{j}}|U|,

(38)

we know $|U^{*}|\geq|\mathcal{G}|=(1-\alpha)m$ . Furthermore, $U^{*}$ can include at most $\alpha m$ batches, this means $U^{*}$ includes at least $(1-2\alpha m)$ good batches. Formally:

|U^{*}\cap\mathcal{G}|=|U^{*}\setminus\mathcal{B}|\geq|U^{*}|-|\mathcal{B}|\geq(1-2\alpha)m.

(39)

Now we show $U^{*}$ is not losing too much information, i.e. $\sum_{j\in U^{*}}\tilde{n}_{j}\geq\frac{1}{2}\sum_{j\in[m]}\tilde{n}_{j}$ . By definition of ${n^{\operatorname{cut}}}$ , there are at least $2\alpha m+1$ batches in $[m]$ such that $\tilde{n}_{j}={n^{\operatorname{cut}}}$ . Because $U^{*}$ removes at more $\alpha m$ batches, there are at least $\alpha m+1$ batches in $U^{*}$ such that $\tilde{n}_{j}={n^{\operatorname{cut}}}$ . I.e.

$\displaystyle\left\|\left\{j\in U^{*}:\tilde{n}_{j}={n^{\operatorname{cut}}}\right\}\right\|=$	$\displaystyle\left\|\left\{j\in[m]:\tilde{n}_{j}={n^{\operatorname{cut}}}\right\}\right\|-\left\|\left\{j\in[m]\setminus U^{*}:\tilde{n}_{j}={n^{\operatorname{cut}}}\right\}\right\|$	(40)
$\displaystyle\geq$	$\displaystyle\left\|\left\{j\in[m]:\tilde{n}_{j}={n^{\operatorname{cut}}}\right\}\right\|-\left\|[m]\setminus U^{*}\right\|$	(41)
$\displaystyle\geq$	$\displaystyle 2\alpha m+1-\alpha m=\alpha m+1$	(42)

This means the information loss $\sum_{j\in[m]\setminus\mathcal{G}}\tilde{n}_{j}$ can be bounded by $\sum_{j\in U^{*}}\tilde{n}_{j}$ , formally:

	$\displaystyle 2\sum_{j\in U^{*}}\tilde{n}_{j}-\sum_{j\in[m]}\tilde{n}_{j}=$	$\displaystyle\sum_{j\in U^{}}\tilde{n}_{j}+\sum_{j\in U^{}}\tilde{n}_{j}-\sum_{j\in[m]\cap U^{}}\tilde{n}_{j}-\sum_{j\in[m]\setminus U^{}}\tilde{n}_{j}$		(43)
	$\displaystyle=$	$\displaystyle\sum_{j\in U^{}}\tilde{n}_{j}-\sum_{j\in[m]\setminus U^{}}\tilde{n}_{j}\geq(\alpha m+1){n^{\operatorname{cut}}}-\alpha m{n^{\operatorname{cut}}}\geq 0$		(44)

Thus we have:

\sum_{j\in U^{*}}\tilde{n}_{j}\geq\frac{1}{2}\sum_{j\in[m]}\tilde{n}_{j}.

(45)

Bad batches in $U^{*}$ :

In order for a bad batch $i$ to survive in $U^{*}$ , its confidence interval $I_{i}$ must intersect with each good batches’ confidence interval in $U^{*}$ . In particular, $I_{i}$ must intersect with the good batch in $U^{*}$ with largest $\tilde{n}_{j}$ . By definition, there are at least $\alpha m+1$ good batches with $\tilde{n}_{j}={n^{\operatorname{cut}}}$ . Because $U^{*}$ excludes at most $\alpha m$ good batches, there are at least one good batch (denote by $j^{*}$ ), s.t. $\tilde{n}_{j^{*}}={n^{\operatorname{cut}}}$ .

Thus $\forall j\in U^{*}\cap\mathcal{B}$ , $I_{i}\cap I_{j^{*}}\neq\emptyset$ . Which means, there exists some point $x$ , s.t. $x\in I_{i}\cap I_{j^{*}}$ , thus

	$\displaystyle\left\|\hat{x}_{i}-\hat{x}_{j^{}}\right\|\leq\left\|\hat{x}_{i}-x\right\|+\left\|x-\hat{x}_{j^{}}\right\|$	(46)
$\displaystyle\leq$	$\displaystyle\frac{\sigma}{\sqrt{\tilde{n}_{i}}}\sqrt{2\log\frac{2m}{\delta}}+\epsilon+\frac{\sigma}{\sqrt{\tilde{n}_{j^{*}}}}\sqrt{2\log\frac{2m}{\delta}}+\epsilon$	(47)
$\displaystyle=$	$\displaystyle\left(\frac{1}{\sqrt{\tilde{n}_{i}}}+\frac{1}{\sqrt{{n^{\operatorname{cut}}}}}\right)\sigma\sqrt{2\log\frac{2m}{\delta}}+2\epsilon.$	(48)

Furthermore, under event $\bigcap_{j\in\mathcal{G}}\mathcal{E}_{j}$ ,

\left|\hat{x}_{j^{*}}-\mu\right|\leq\frac{\sigma}{\sqrt{{n^{\operatorname{cut}}}}}\sqrt{2\log\frac{2m}{\delta}}+\epsilon

(49)

By triangular inequality, $\hat{x}_{i}$ we not bee too far away from $\mu$ :

\left|\hat{x}_{i}-\mu\right|\leq\left|\hat{x}_{i}-\hat{x}_{j^{*}}\right|+\left|\hat{x}_{j^{*}}-\mu\right|=\left(\frac{1}{\sqrt{\tilde{n}_{i}}}+\frac{2}{\sqrt{{n^{\operatorname{cut}}}}}\right)\sigma\sqrt{2\log\frac{2m}{\delta}}+3\epsilon

(50)

Error decomposition:

As mentioned earlier, we can decompose the estimation of $\hat{x}$ returned by Algorithm 4 by: statistical error (with potential information loss), term $\mathscr{A}_{1}$ in (55); error coming from including bad batches, term $\mathscr{A}_{2}$ in (55); error coming from removing good batches, term $\mathscr{A}_{3}$ in (55). Specifically:

	$\displaystyle\left\|\hat{x}-\mu\right\|=\frac{1}{\sum_{j\in U^{}}\tilde{n}_{j}}\left\|\sum_{j\in U^{}}\tilde{n}_{j}(\hat{x}_{j}-\mu)\right\|$	(51)
$\displaystyle=$	$\displaystyle\frac{1}{\sum_{j\in U^{}}\tilde{n}_{j}}\left\|\left(\sum_{j\in\mathcal{G}}+\sum_{j\in U^{}\cap\mathcal{B}}-\sum_{j\in\mathcal{G}\setminus U^{*}}\right)\tilde{n}_{j}(\hat{x}_{j}-\mu)\right\|$	(52)
$\displaystyle\leq$	$\displaystyle\frac{1}{\sum_{j\in U^{}}\tilde{n}_{j}}\left(\left\|\sum_{j\in\mathcal{G}}\tilde{n}_{j}(\hat{x}_{j}-\mu)\right\|+\left\|\sum_{j\in U^{}\cap\mathcal{B}}\tilde{n}_{j}(\hat{x}_{j}-\mu)\right\|+\left\|\sum_{j\in\mathcal{G}\setminus U^{*}}\tilde{n}_{j}(\hat{x}_{j}-\mu)\right\|\right)$	(53)
	(this is by triangular inequality)	(54)
$\displaystyle=:$	$\displaystyle\mathscr{A}_{1}+\mathscr{A}_{2}+\mathscr{A}_{3}$	(55)

We can bound the first term $\mathscr{A}_{1}$ by (45) under event $\mathcal{E}_{wa}$ :

$\displaystyle\mathscr{A}_{1}=$	$\displaystyle\frac{1}{\sum_{j\in U^{}}\tilde{n}_{j}}\left\|\sum_{j\in\mathcal{G}}\tilde{n}_{j}(\hat{x}_{j}-\mu)\right\|=\frac{\sum_{j\in\mathcal{G}}\tilde{n}_{j}}{\sum_{j\in U^{}}\tilde{n}_{j}}\left\|\frac{1}{\sum_{j\in\mathcal{G}}\tilde{n}_{j}}\sum_{j\in\mathcal{G}}\tilde{n}_{j}(\hat{x}_{j}-\mu)\right\|$	(56)
$\displaystyle=$	$\displaystyle\frac{\sum_{j\in\mathcal{G}}\tilde{n}_{j}}{\sum_{j\in U^{*}}\tilde{n}_{j}}\left\|\frac{1}{\sum_{j\in\mathcal{G}}\tilde{n}_{j}}\sum_{j\in\mathcal{G}}\tilde{n}_{j}\hat{x}_{j}-\mu\right\|$	(57)
$\displaystyle\leq$	$\displaystyle\frac{\sum_{j\in\mathcal{G}}\tilde{n}_{j}}{\sum_{j\in U^{*}}\tilde{n}_{j}}\left(\frac{\sigma}{\sqrt{\sum_{j\in\mathcal{G}}\tilde{n}_{j}}}\sqrt{2\log\frac{2}{\delta}}+\epsilon\right)\quad\left(\mbox{By event $\mathcal{E}_{wa}$}\right)$	(58)
$\displaystyle=$	$\displaystyle\frac{\sqrt{\sum_{j\in\mathcal{G}}\tilde{n}_{j}}}{\sum_{j\in U^{}}\tilde{n}_{j}}{\sigma}\sqrt{2\log\frac{2}{\delta}}+\frac{\sum_{j\in\mathcal{G}}\tilde{n}_{j}}{\sum_{j\in U^{}}\tilde{n}_{j}}\epsilon$	(59)
$\displaystyle\leq$	$\displaystyle 2\frac{\sqrt{\sum_{j\in\mathcal{G}}\tilde{n}_{j}}}{\sum_{j\in[m]}\tilde{n}_{j}}{\sigma}\sqrt{2\log\frac{2}{\delta}}+\frac{\sum_{j\in\mathcal{G}}\tilde{n}_{j}}{\sum_{j\in U^{}}\tilde{n}_{j}}\epsilon\quad\left(\mbox{By \hyperref@@ii[eq:sm info loss]{{(\ref{eq:sm info loss})}}}\right)$	(60)
$\displaystyle\leq$	$\displaystyle 2\frac{\sqrt{\sum_{j\in[m]}\tilde{n}_{j}}}{\sum_{j\in[m]}\tilde{n}_{j}}{\sigma}\sqrt{2\log\frac{2}{\delta}}+\frac{\sum_{j\in\mathcal{G}}\tilde{n}_{j}}{\sum_{j\in U^{*}}\tilde{n}_{j}}\epsilon\quad\left(\mbox{By $\mathcal{G}\subseteq[m]$}\right)$	(61)
$\displaystyle=$	$\displaystyle\frac{2}{\sqrt{\sum_{j\in[m]}\tilde{n}_{j}}}{\sigma}\sqrt{2\log\frac{2}{\delta}}+\frac{\sum_{j\in\mathcal{G}}\tilde{n}_{j}}{\sum_{j\in U^{*}}\tilde{n}_{j}}\epsilon$	(62)

By (50), we can bound the second term $\mathscr{A}_{2}$ by:

$\displaystyle\mathscr{A}_{2}$	$\displaystyle=\frac{1}{\sum_{j\in U^{}}\tilde{n}_{j}}\left\|\sum_{j\in U^{}\cap\mathcal{B}}\tilde{n}_{j}(\hat{x}_{j}-\mu)\right\|\leq\frac{1}{\sum_{j\in U^{}}\tilde{n}_{j}}\sum_{j\in U^{}\cap\mathcal{B}}\tilde{n}_{j}\left\|\hat{x}_{j}-\mu\right\|\quad\left(\mbox{By triangular ineq}\right)$	(63)
$\displaystyle\leq$	$\displaystyle\frac{1}{\sum_{j\in U^{}}\tilde{n}_{j}}\sum_{j\in U^{}\cap\mathcal{B}}\tilde{n}_{j}\left(\left(\frac{1}{\sqrt{\tilde{n}_{j}}}+\frac{2}{\sqrt{{n^{\operatorname{cut}}}}}\right)\sigma\sqrt{2\log\frac{2m}{\delta}}+3\epsilon\right)\quad\left(\mbox{By \hyperref@@ii[eq:bd bc bd]{{(\ref*{eq:bd bc bd})}}}\right)$	(64)
$\displaystyle\leq$	$\displaystyle\frac{1}{\sum_{j\in U^{}}\tilde{n}_{j}}\sum_{j\in U^{}\cap\mathcal{B}}\left({\sqrt{\tilde{n}_{j}}}+\frac{2\tilde{n}_{j}}{\sqrt{{n^{\operatorname{cut}}}}}\right)\sigma\sqrt{2\log\frac{2m}{\delta}}+\frac{\sum_{j\in U^{}\cap\mathcal{B}}\tilde{n}_{j}}{\sum_{j\in U^{}}\tilde{n}_{j}}3\epsilon$	(65)
$\displaystyle\leq$	$\displaystyle\frac{1}{\sum_{j\in U^{}}\tilde{n}_{j}}\sum_{j\in U^{}\cap\mathcal{B}}3{\sqrt{{n^{\operatorname{cut}}}}}\sigma\sqrt{2\log\frac{2m}{\delta}}+\frac{\sum_{j\in U^{}\cap\mathcal{B}}\tilde{n}_{j}}{\sum_{j\in U^{}}\tilde{n}_{j}}3\epsilon\quad\left(\mbox{By $\tilde{n}_{j}\leq{n^{\operatorname{cut}}}$}\right)$	(66)
$\displaystyle\leq$	$\displaystyle\frac{3\alpha m{\sqrt{{n^{\operatorname{cut}}}}}}{\sum_{j\in U^{}}\tilde{n}_{j}}\sigma\sqrt{2\log\frac{2m}{\delta}}+\frac{\sum_{j\in U^{}\cap\mathcal{B}}\tilde{n}_{j}}{\sum_{j\in U^{}}\tilde{n}_{j}}3\epsilon\quad\left(\mbox{$U^{}$ includes at most $\alpha m$ bad batches}\right)$	(67)

We can bound the third term $\mathscr{A}_{3}$ by:

$\displaystyle\mathscr{A}_{3}=$	$\displaystyle\frac{1}{\sum_{j\in U^{}}\tilde{n}_{j}}\left\|\sum_{j\in\mathcal{G}\setminus U^{}}\tilde{n}_{j}(\hat{x}_{j}-\mu)\right\|\leq\frac{1}{\sum_{j\in U^{}}\tilde{n}_{j}}\sum_{j\in\mathcal{G}\setminus U^{}}\tilde{n}_{j}\left\|\hat{x}_{j}-\mu\right\|\quad\left(\mbox{By triangular ineq}\right)$	(68)
$\displaystyle\leq$	$\displaystyle\frac{1}{\sum_{j\in U^{}}\tilde{n}_{j}}\sum_{j\in\mathcal{G}\setminus U^{}}\tilde{n}_{j}\left(\frac{\sigma}{\sqrt{\tilde{n}_{j}}}\sqrt{2\log\frac{2m}{\delta}}+\epsilon\right)\quad\left(\mbox{By event $\bigcap_{j\in\mathcal{G}}\mathcal{E}_{j}$}\right)$	(69)
$\displaystyle=$	$\displaystyle\frac{1}{\sum_{j\in U^{}}\tilde{n}_{j}}\sum_{j\in\mathcal{G}\setminus U^{}}{\sigma}{\sqrt{\tilde{n}_{j}}}\sqrt{2\log\frac{2m}{\delta}}+\frac{\sum_{j\in\mathcal{G}\setminus U^{}}\tilde{n}_{j}}{\sum_{j\in U^{}}\tilde{n}_{j}}\epsilon$	(70)
$\displaystyle\leq$	$\displaystyle\frac{\alpha m{\sqrt{{n^{\operatorname{cut}}}}}}{\sum_{j\in U^{}}\tilde{n}_{j}}{\sigma}\sqrt{2\log\frac{2m}{\delta}}+\frac{\sum_{j\in\mathcal{G}\setminus U^{}}\tilde{n}_{j}}{\sum_{j\in U^{*}}\tilde{n}_{j}}\epsilon$	(71)
	$\displaystyle\left(\mbox{Because $U^{*}$ excludes at most $\alpha m$ good batches and $\tilde{n}_{j}\leq{n^{\operatorname{cut}}}$}\right)$	(72)

Note that the above upper bounds for $\mathscr{A}_{2}$ and $\mathscr{A}_{3}$ are still valid even if some of the $\tilde{n}_{j}$ ’s are zero.

In conclusion, we can bound the estimation error by:

$\displaystyle\left\|\hat{x}-\mu\right\|\leq$	$\displaystyle\mathscr{A}_{1}+\mathscr{A}_{2}+\mathscr{A}_{3}$	(73)
$\displaystyle\leq$	$\displaystyle\left(\frac{2}{\sqrt{\sum_{j\in[m]}\tilde{n}_{j}}}{\sigma}\sqrt{2\log\frac{2}{\delta}}+\frac{\sum_{j\in\mathcal{G}}\tilde{n}_{j}}{\sum_{j\in U^{*}}\tilde{n}_{j}}\epsilon\right)$	(74)
	$\displaystyle+\left(\frac{3\alpha m{\sqrt{{n^{\operatorname{cut}}}}}}{\sum_{j\in U^{}}\tilde{n}_{j}}\sigma\sqrt{2\log\frac{2m}{\delta}}+\frac{\sum_{j\in U^{}\cap\mathcal{B}}\tilde{n}_{j}}{\sum_{j\in U^{*}}\tilde{n}_{j}}3\epsilon\right)$	(75)
	$\displaystyle+\left(\frac{\alpha m{\sqrt{{n^{\operatorname{cut}}}}}}{\sum_{j\in U^{}}\tilde{n}_{j}}{\sigma}\sqrt{2\log\frac{2m}{\delta}}+\frac{\sum_{j\in\mathcal{G}\setminus U^{}}\tilde{n}_{j}}{\sum_{j\in U^{*}}\tilde{n}_{j}}\epsilon\right)$	(76)
$\displaystyle=$	$\displaystyle\frac{2}{\sqrt{\sum_{j\in[m]}\tilde{n}_{j}}}{\sigma}\sqrt{2\log\frac{2}{\delta}}+\frac{4\alpha m{\sqrt{{n^{\operatorname{cut}}}}}}{\sum_{j\in U^{*}}\tilde{n}_{j}}\sigma\sqrt{2\log\frac{2m}{\delta}}$	(77)
	$\displaystyle+\frac{\left(\sum_{j\in\mathcal{G}}+\sum_{j\in U^{}\cap\mathcal{B}}\right)\tilde{n}_{j}}{\sum_{j\in U^{}}\tilde{n}_{j}}\epsilon+\frac{\sum_{j\in U^{}\cap\mathcal{B}}\tilde{n}_{j}}{\sum_{j\in U^{}}\tilde{n}_{j}}2\epsilon+\frac{\sum_{j\in\mathcal{G}\setminus U^{}}\tilde{n}_{j}}{\sum_{j\in U^{}}\tilde{n}_{j}}\epsilon$	(78)
$\displaystyle\leq$	$\displaystyle\frac{2}{\sqrt{\sum_{j\in[m]}\tilde{n}_{j}}}{\sigma}\sqrt{2\log\frac{2}{\delta}}+\frac{4\alpha m{\sqrt{{n^{\operatorname{cut}}}}}}{\sum_{j\in U^{*}}\tilde{n}_{j}}\sigma\sqrt{2\log\frac{2m}{\delta}}$	(79)
	$\displaystyle+\frac{\sum_{j\in[m]}\tilde{n}_{j}}{\sum_{j\in U^{}}\tilde{n}_{j}}\epsilon+\frac{\alpha m{n^{\operatorname{cut}}}}{\sum_{j\in U^{}}\tilde{n}_{j}}2\epsilon+\frac{\alpha m{n^{\operatorname{cut}}}}{\sum_{j\in U^{*}}\tilde{n}_{j}}\epsilon$	(80)
	$\displaystyle\left(\mbox{By $\mathcal{G}\cup\left(U^{}\cap\mathcal{B}\right)\subseteq[m]$, $\|U^{}\cap\mathcal{B}\|\leq\alpha m$, $\|\mathcal{G}\setminus U^{*}\|\leq\alpha m$}\right)$	(81)
$\displaystyle\leq$	$\displaystyle\frac{2}{\sqrt{\sum_{j\in[m]}\tilde{n}_{j}}}{\sigma}\sqrt{2\log\frac{2}{\delta}}+\frac{8\alpha m{\sqrt{{n^{\operatorname{cut}}}}}}{\sum_{j\in[m]}\tilde{n}_{j}}\sigma\sqrt{2\log\frac{2m}{\delta}}+5\epsilon$	(82)
	$\displaystyle\left(\mbox{By \hyperref@@ii[eq:ncut in U star]{{(\ref{eq:ncut in U star})}} and \hyperref@@ii[eq:sm info loss]{{(\ref{eq:sm info loss})}}}\right)$	(83)

Appendix C Proof of Theorem 5.2

By following standard regret decomposition for UCB type of algorithm (see [Jin et al., 2020]), under the event that the estimation error of Bellman operator is bounded by bonus terms, we can decompose the regret by:

1.

the cumulative bonus term occurred in the trajectories of each good agent
2.

a term that can be easier bounded by Azuma-Hoeffding’s inequalities.

By Lemma B.4 and replacing Lemma B.3 with a variant for martingale, we can show the event mentioned above happens with high probability. Unlike standard regret bound for tabular setting, we cannot directly use the telescope series to estimate the cumulative bonuses. Instead, we first need to show that because each good agent is using the same policy in every episode, their trajectories have a lot of overlaps, meaning the $(s,a,h)$ counts of all good agents do not differ by too much. Given that, we can simply the bound in Lemma B.4 and use the telescope series.

We start by restating Theorem 5.2:

Theorem C.1 (Regret bound, Theorem 5.2).

If $\alpha\leq\frac{1}{3}\left(1-\frac{1}{m}\right)$ , for all $\delta<\frac{1}{4}$ , with probability at least $1-3\delta$ :

\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\left(V_{1}^{*}(s_{1})-V_{1}^{\hat{\pi}^{k}}(s_{1})\right)=\tilde{O}\left((1+\alpha\sqrt{m})SH^{2}\sqrt{AKm\log\frac{1}{\delta}}\right)

(84)

We first give the high level idea of our proof:

1.
We give an analysis under the intersection of three “good events”:
- •
  
  event $\mathcal{E}$ : the estimation error of Bellman operator is upper-bounded by bonus (See Section C.1, Lemma C.3);
- •
  
  event $\mathcal{E}_{\mbox{even}}$ : if the total count $\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a,h)$ on some $(s,a,h)$ is large, then the counts of each agent differ by at most $2$ times (See Section C.3, Lemma C.9);
- •
  
  event $\mathcal{E}_{\mbox{Azmua}}$ : an error term in the regret decomposition is bounded by Azmua-Hoeffding bound.
2.
Under event $\mathcal{E}$ , we can decompose the regret into two terms (see Section C.2, Lemma C.8):
- •
  
  a martingale with bounded difference which is controlled by Hoeffding bound under event $\mathcal{E}_{\mbox{Azmua}}$ ;
- •
  
  the cumulative bonus term, which can be bounded by telescoping series under event $\mathcal{E}_{\mbox{even}}$ .

We use $\bar{Q}_{h}^{k}$ , $\hat{Q}_{h}^{k}$ , $\hat{\pi}_{h}^{k}$ , $\hat{V}_{h}^{k}$ , $\hat{\mathbb{B}}_{h}^{k}$ , $\Gamma_{h}^{k}$ to denote the variables used in the $k$ -th episode. When a synchronization happens in episode $k$ , those variables are the updated ones after the synchronization; when there is no synchronization in episode $k$ , those variables remain the same as last episode. Let $N_{h}^{j,k}(s,a)$ be the counts on $(s,a,h)$ tuples in episode $k$ after the counts update. Formally, We start by restating the data collection process and counts on $(s,a,h)$ tuples of each good agent $j\in\mathcal{G}$ : during the data collection process, we allow all of the agents collect data together. In the $k$ -th episode, agent $j$ collects a multi-set of transition tuples using policy $d^{\hat{\pi}^{k}}$ : $\left\{\left(s_{h}^{j,k},a_{h}^{j,k},r_{h}^{j,k},s_{h+1}^{j,k}\right)\right\}_{h\in[H]}$ .

D_{j,k}:=\bigcup_{h\in[H]}D_{j,k}^{h}:=\bigcup_{h\in[H]}\bigcup_{k^{\prime}\leq k}\left\{\left(s_{h}^{j,k^{\prime}},a_{h}^{j,k^{\prime}},r_{h}^{j,k^{\prime}},s_{h+1}^{j,k^{\prime}}\right)\right\}

(85)

$N_{h}^{j,k}(s,a)$ is given by:

N_{h}^{j,k}(s,a)=\sum_{h=1}^{H}\sum_{(\tilde{s},\tilde{a},\tilde{r},\tilde{s}^{\prime})\in D_{j,k}^{h}}\mathbf{1}\left\{(s,a)=(\tilde{s},\tilde{a})\right\}

(86)

We give the formal definition of good events below:

Definition C.2.

	$\displaystyle\mathcal{E}_{\mbox{Azmua}}:=$	$\displaystyle\left\{\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\sum_{h=1}^{H}\left(\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot\mid s_{h}^{j,k},a_{h}^{j,k})}\left[\hat{V}_{h+1}^{k}(s^{\prime})-V_{h+1}^{\hat{\pi}^{k}}(s^{\prime})\right]\right.\right.$		(87)
		$\displaystyle\left.\left.-\left(\hat{V}_{h+1}^{k}(s_{h+1}^{j,k})-V_{h+1}^{\hat{\pi}^{k}}(s_{h+1}^{j,k})\right)\right)\leq\sqrt{8mKH^{3}\log\frac{2}{\delta}}\right\}$		(88)

\mathcal{E}:=\left\{\bigcap_{(s,a,h,k,f)\in\mathcal{S}\times\mathcal{A}\times H\times K\times[0,1]^{\mathcal{S}}}\left\{\left|\left(\hat{\mathbb{B}}_{h}^{k}f\right)(s,a)-\left(\mathbb{B}_{h}f\right)(s,a)\right|\leq\Gamma_{h}^{k}(s,a)\right\}\right\}

(89)

For any $(s,a,h,k)\in\mathcal{S}\times\mathcal{A}\times[H]\times[K]$ , we define the follow event:

\mathcal{E}_{\mbox{even}}(s,a,h,k):=\left\{\mbox{if $\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)\geq 400m\log\frac{2mKSAH}{\delta}$, then $\max_{i,j\in\mathcal{G}}\frac{N_{h}^{j,k}(s,a)}{N_{h}^{i,k}(s,a)}\leq 2$}\right\}

(90)

And define

\mathcal{E}_{\mbox{even}}:=\bigcap_{s,a,h,K}\mathcal{E}_{\mbox{even}}(s,a,h,k).

(91)

Proof of Theorem 5.2.

By Azuma-Hoeffding inequality:

\displaystyle\mathbb{P}\left(\overline{\mathcal{E}_{\mbox{Azmua}}}\right)\leq\delta

(92)

Then by union bound: Lemma C.3 and Lemma C.9 together implies for all $0<\delta<\frac{1}{4}$ :

\displaystyle\mathbb{P}\left(\overline{\mathcal{E}}\cup\overline{\mathcal{E}_{\mbox{even}}}\cup\overline{\mathcal{E}_{\mbox{Azmua}}}\right)\leq\mathbb{P}\left(\overline{\mathcal{E}}\right)+\mathbb{P}\left(\overline{\mathcal{E}_{\mbox{even}}}\right)+\mathbb{P}\left(\overline{\mathcal{E}_{\mbox{Azmua}}}\right)\leq 3\delta

(93)

which means $\mathcal{E}\cap\mathcal{E}_{\mbox{even}}\cap\mathcal{E}_{\mbox{Azmua}}$ happens with probability at least $1-3\delta$ .

We now upper bound the regret under event $\mathcal{E}\cap\mathcal{E}_{\mbox{even}}\cap\mathcal{E}_{\mbox{Azmua}}$ . By Lemma C.8 we can decompose the regret by:

	$\displaystyle\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\left(V_{1}^{*}(s_{1})-V_{1}^{\hat{\pi}^{k}}(s_{1})\right)$	(94)
$\displaystyle\leq$	$\displaystyle 2\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\sum_{h=1}^{H}\Gamma_{h}^{k}(s_{h}^{j,k},a_{h}^{j,k})$	(95)
	$\displaystyle+\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\sum_{h=1}^{H}\left(\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot\mid s_{h}^{k},a_{h}^{k})}\left[\hat{V}_{h+1}^{k}(s^{\prime})-V_{h+1}^{\hat{\pi}^{k}}(s^{\prime})\right]-\left(\hat{V}_{h+1}^{k}(s_{h+1}^{j,k})-V_{h+1}^{\hat{\pi}^{k}}(s_{h+1}^{j,k})\right)\right)$	(96)
	$\displaystyle\left(\mbox{Under event $\mathcal{E}$}\right)$	(97)
$\displaystyle\leq$	$\displaystyle 2\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\sum_{h=1}^{H}\Gamma_{h}^{k}(s_{h}^{j,k},a_{h}^{j,k})+\sqrt{8mKH^{3}\log\frac{2}{\delta}}$	(98)
	$\displaystyle\left(\mbox{Under event $\mathcal{E}_{\mbox{Azmua}}$}\right)$	(99)

We only need to upper bound the cumulative bonus. Suppose the policy is updated at the beginning of $k_{0}+1,k_{1}+1,k_{2}+1,\ldots,k_{l}+1$ -th episodes, with the data collected in the first $k_{0},k_{1},k_{2},\ldots,k_{l}$ -th episodes, with $k_{1}=1$ . To simplify the notation, we define $k_{0}=0$ , $k_{l+1}=K$ .

For convenience, in the following, we use $N_{h}^{k}(s,a)$ to denote the total count on $(s,a,h)$ tuples up to episode $k$ over all good agents:

N_{h}^{k}(s,a):=\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a),

(100)

where $N_{h}^{0}(s,a)=0$ . We can rearrange the cumulative bonus by summing over $(s,a)$ pairs:

\displaystyle\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\sum_{h=1}^{H}\Gamma_{h}^{k}(s_{h}^{j,k},a_{h}^{j,k})=\sum_{h=1}^{H}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\sum_{t=1}^{l+1}\Gamma_{h}^{k_{t-1}+1}(s,a)\left(N_{h}^{k_{t}}(s,a)-N_{h}^{k_{t-1}}(s,a)\right)

(101)

When there are less than $(2\alpha m+1)$ agents have coverage on some $(s,a,h)$ tuple, the bonus term $\Gamma_{h}^{k}(s,a)$ is trivially set to be $H-h+1$ . In the following, we show that under event $\mathcal{E}_{\mbox{even}}$ , in (101), for each $(s,a,h)$ tuple, there are at most $2N_{0}$ bonus term such that $\Gamma_{h}(s,a)=H-h+1$ , where

N_{0}:=400m\log\frac{2mKSAH}{\delta}.

(102)

For any $(s,a,h)$ , let $l_{0}(s,a,h)$ be such that:

N_{h}^{k_{l_{0}(s,a,h)-1}}(s,a)<N_{0}\leq N_{h}^{k_{l_{0}(s,a,h)}}(s,a).

(103)

This means when running the policy update at episode $k_{l_{0}(s,a,h)}+1$ , the total counts for $(s,a,h)$ , i.e. $N_{h}^{k_{l_{0}(s,a,h)}}(s,a)$ , is larger than $N_{0}$ . For any $k\geq k_{l_{0}(s,a,h)}$ , we have

\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)=N_{h}^{k}(s,a)\geq N_{h}^{k_{l_{0}(s,a,h)}}(s,a)\geq N_{0}.

(104)

By definition of $\mathcal{E}_{\mbox{even}}$ , for any $k\geq k_{l_{0}(s,a,h)}$

\max_{i,j\in\mathcal{G}}\frac{N_{h}^{j,k}(s,a)}{N_{h}^{i,k}(s,a)}\leq 2

(105)

this means for any $k\geq k_{l_{0}(s,a,h)}$ , $N_{h}^{j,k}(s,a)>0,\forall j\in\mathcal{\mathcal{G}}$ , meaning all of the good agents have coverage on $(s,a,h)$ , this means there are at least $(1-\alpha)m\geq 2\alpha m+1$ agents have coverage, and thus:

•

Trivial bonus can only happens at $k\leq k_{l_{0}(s,a,h)}$ , i.e.

$\Gamma_{h}^{k}(s,a)=H-h+1\mbox{ only if }k\leq k_{l_{0}(s,a,h)}.$ (106)

Furthermore, because in the algorithm, the agents synchronize and update their policy when or before any $(s,a,h)$ counts for a good agent doubles. I.e.: for all $(s,a,h,j,i)\in\mathcal{S}\times\mathcal{A}\times[H]\times\mathcal{G}\times[l]$ :

$N_{h}^{k_{t}}(s,a)\leq 2N_{h}^{k_{t-1}}(s,a)$ (107)

This means

$N_{h}^{k_{l_{0}(s,a,h)}}(s,a)\leq 2N_{h}^{k_{l_{0}(s,a,h)-1}}(s,a)<2N_{0}.$ (108)

Thus for each $(s,a,h)$ tuple, there are at most $2N_{0}$ bonus term such that $\Gamma_{h}(s,a)=H-h+1$ .

•

for any $k\geq k_{l_{0}(s,a,h)}+1$

	$\displaystyle\Gamma_{h}^{k}(s,a)=$	$\displaystyle\frac{6}{SAHKm}+\frac{2(H-h+1)}{\sqrt{\sum_{j\in[m]}\tilde{N}_{h}^{j,k-1}(s,a)}}\sqrt{2\log\frac{2(SAHKm)^{3S}}{\delta}}$		(109)
		$\displaystyle+\frac{8\alpha m\sqrt{N_{h}^{\operatorname{cut},k-1}(s,a)}}{\sum_{j\in[m]}\tilde{N}_{h}^{j,k-1}(s,a)}(H-h+1)\sqrt{2\log\frac{2m(SAHKm)^{3S}}{\delta}}$		(110)

Where $N_{h}^{\operatorname{cut},k-1}(s,a)$ is the $(2\alpha m+1)$ -largest among $\left\{N_{h}^{j,k-1}(s,a)\right\}$ and

\tilde{N}_{h}^{j,k-1}(s,a)=\max\left(N_{h}^{\operatorname{cut},k-1}(s,a),N_{h}^{j,k-1}(s,a)\right);

(111)

For any $k-1\geq k_{l_{0}(s,a,h)}$ , $\max_{i,j\in\mathcal{G}}\frac{N_{h}^{j,k-1}(s,a)}{N_{h}^{i,k-1}(s,a)}\leq 2$ implies $\forall j$ , $\tilde{N}_{h}^{j,k-1}(s,a)\geq\frac{1}{2}N_{h}^{j,k-1}(s,a)$ and $\tilde{N}_{h}^{j,k-1}(s,a)\geq\frac{1}{2}N_{h}^{\operatorname{cut},k-1}(s,a)$ .

This means for any $k\geq k_{l_{0}(s,a,h)}+1$

\frac{1}{\sqrt{\sum_{j\in[m]}\tilde{N}_{h}^{j,k-1}(s,a)}}\leq\frac{\sqrt{2}}{\sqrt{\sum_{j\in[m]}N_{h}^{j,k-1}(s,a)}}=\frac{\sqrt{2}}{\sqrt{N_{h}^{k-1}(s,a)}}

(112)

	$\displaystyle\frac{m\sqrt{N_{h}^{\operatorname{cut},k-1}(s,a)}}{\sum_{j\in[m]}\tilde{N}_{h}^{j,k-1}(s,a)}=$	$\displaystyle\frac{\sqrt{m}\sqrt{\sum_{j\in[m]}N_{h}^{\operatorname{cut},k-1}(s,a)}}{\sum_{j\in[m]}\tilde{N}_{h}^{j,k-1}(s,a)}$		(113)
	$\displaystyle\leq$	$\displaystyle\frac{\sqrt{m}\sqrt{2\sum_{j\in[m]}\tilde{N}_{h}^{j,k-1}(s,a)}}{\sum_{j\in[m]}\tilde{N}_{h}^{j,k-1}(s,a)}\leq\frac{2\sqrt{m}}{\sqrt{N_{h}^{k-1}(s,a)}}$		(114)

Thus

\displaystyle\Gamma_{h}^{k}(s,a)\leq

\displaystyle\frac{4+16\sqrt{2}\alpha\sqrt{m}}{\sqrt{N_{h}^{k-1}(s,a)}}H\sqrt{\log\frac{2m(SAHKm)^{3S}}{\delta}}+\frac{6}{SAHKm}

(115)

We are know ready to bound the cumulative bonus:

	$\displaystyle\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\sum_{h=1}^{H}\Gamma_{h}^{k}(s_{h}^{j,k},a_{h}^{j,k})=\sum_{h=1}^{H}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\sum_{t=1}^{l+1}\Gamma_{h}^{k_{t-1}+1}(s,a)\left(N_{h}^{k_{t}}(s,a)-N_{h}^{k_{t-1}}(s,a)\right)$	(116)
$\displaystyle=$	$\displaystyle\sum_{h=1}^{H}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\left(\sum_{t=1}^{l_{0}(s,a,h)}\Gamma_{h}^{k_{t-1}+1}(s,a)\left(N_{h}^{k_{t}}(s,a)-N_{h}^{k_{t-1}}(s,a)\right)\right.$	(117)
	$\displaystyle\left.+\sum_{t=l_{0}(s,a,h)+1}^{l+1}\Gamma_{h}^{k_{t-1}+1}(s,a)\left(N_{h}^{k_{t}}(s,a)-N_{h}^{k_{t-1}}(s,a)\right)\right)$	(118)
$\displaystyle\leq$	$\displaystyle\sum_{h=1}^{H}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\left(\sum_{t=1}^{l_{0}(s,a,h)}\Gamma_{h}^{k_{t-1}+1}(s,a)\left(N_{h}^{k_{t}}(s,a)-N_{h}^{k_{t-1}}(s,a)\right)\right.$	(119)
	$\displaystyle+\sum_{t=l_{0}(s,a,h)+1}^{l+1}\frac{4+16\sqrt{2}\alpha\sqrt{m}}{\sqrt{N_{h}^{k-1}(s,a)}}H\sqrt{\log\frac{2m(SAHKm)^{3S}}{\delta}}\left(N_{h}^{k_{t}}(s,a)-N_{h}^{k_{t-1}}(s,a)\right)$	(120)
	$\displaystyle\left.+\sum_{t=l_{0}(s,a,h)+1}^{l+1}\frac{6}{SAHKm}\left(N_{h}^{k_{t}}(s,a)-N_{h}^{k_{t-1}}(s,a)\right)\right)$	(121)
	$\displaystyle\left(\mbox{By \hyperref@@ii[eq:bd bn]{{(\ref*{eq:bd bn})}}}\right)$	(122)
$\displaystyle=:$	$\displaystyle\mathscr{A}_{1}+\mathscr{A}_{2}+\mathscr{A}_{3}.$	(123)

By (106) and (108),

\mathscr{A}_{1}\leq SAH^{2}N_{h}^{k_{l_{0}(s,a,h)}}(s,a)\leq 2SAH^{2}N_{0}.

(124)

Because $k_{l+1}=K$ ,

\displaystyle\mathscr{A}_{3}\leq

\displaystyle\frac{6}{SAHKm}\sum_{h=1}^{H}\sum_{(s,a)\in{\mathcal{S}}{\times}\mathcal{A}}N_{h}^{K}(s,a)=\frac{6}{SA}

(125)

By (107),

		$\displaystyle\sum_{t=l_{0}(s,a,h)+1}^{l+1}\frac{N_{h}^{k_{t}}(s,a)-N_{h}^{k_{t-1}}(s,a)}{\sqrt{N_{h}^{k_{t-1}}(s,a)}}\leq(\sqrt{2}+1)\sum_{t=l_{0}(s,a,h)+1}^{l+1}\frac{N_{h}^{k_{t}}(s,a)-N_{h}^{k_{t-1}}(s,a)}{\sqrt{N_{h}^{k_{t}}(s,a)}+\sqrt{N_{h}^{k_{t-1}}(s,a)}}$		(126)
	$\displaystyle=$	$\displaystyle(\sqrt{2}+1)\sum_{t=l_{0}(s,a,h)+1}^{l+1}\left({\sqrt{N_{h}^{k_{t}}(s,a)}-\sqrt{N_{h}^{k_{t-1}}(s,a)}}\right)\leq(\sqrt{2}+1)\sqrt{N_{h}^{K}(s,a)}$		(127)

By Cauchy–Schwarz inequality,

\displaystyle\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\sqrt{N_{h}^{K}(s,a)}\leq\sqrt{\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}1\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}{N_{h}^{K}(s,a)}}=\sqrt{SAKm}

(128)

Thus

	$\displaystyle\mathscr{A}_{2}\leq$	$\displaystyle(\sqrt{2}+1)(4+16\sqrt{2}\alpha\sqrt{m})H^{2}\sqrt{SAKm}\sqrt{\log\frac{2m(SAHKm)^{3S}}{\delta}}$		(129)
	$\displaystyle=$	$\displaystyle O\left((1+\alpha\sqrt{m})H^{2}S\sqrt{AKm}\sqrt{\log\frac{SAHKm}{\delta}}\right)$		(130)

Thus

	$\displaystyle\mathscr{A}_{1}+\mathscr{A}_{2}+\mathscr{A}_{3}\leq$	$\displaystyle O\left((1+\alpha\sqrt{m})H^{2}S\sqrt{AKm}\sqrt{\log\frac{SAHKm}{\delta}}\right)$		(131)
		$\displaystyle+O\left(SAH^{2}m\log\frac{2mKSAH}{\delta}\right)$		(132)

In conclusion:

	$\displaystyle\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\left(V_{1}^{*}(s_{1})-V_{1}^{\hat{\pi}^{k}}(s_{1})\right)\leq$	$\displaystyle 2\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\sum_{h=1}^{H}\Gamma_{h}^{k}(s_{h}^{j,k},a_{h}^{j,k})+\sqrt{8mKH^{3}\log\frac{2}{\delta}}$		(133)
	$\displaystyle=$	$\displaystyle\tilde{O}\left((1+\alpha\sqrt{m})SH^{2}\sqrt{AKm\log\frac{1}{\delta}}\right)$		(134)

∎

C.1 the good event $\mathcal{E}$

We first show that our bonus is a valid upper confidence bound for the estimated Bellman operator. Recall that our bonus term used in $k$ -th episode is calculated based on the data collected in the first $k-1$ -episodes. The bonus is given by:

•

If $|j\in[m]:N_{h}^{j,k-1}(s,a)>0|<2\alpha m+1$

$\Gamma_{h}^{k}(s,a)=H-h+1;$ (135)

•

If $|j\in[m]:N_{h}^{j,k-1}(s,a)>0|\geq 2\alpha m+1$

	$\displaystyle\Gamma_{h}^{k}(s,a):=$	$\displaystyle\frac{6}{SAHKm}+\ \frac{2{(H-h+1)}}{\sqrt{\sum_{j\in[m]}\tilde{N}_{h}^{j,k-1}(s,a)}}\sqrt{2\log\frac{2(SAHKm)^{3S}}{\delta}}$		(136)
		$\displaystyle+\frac{8\alpha m\sqrt{N_{h}^{\operatorname{cut},k-1}(s,a)}}{\sum_{j\in[m]}\tilde{N}_{h}^{j,k-1}(s,a)}{(H-h+1)}\sqrt{2\log\frac{2m(SAHKm)^{3S}}{\delta}}$		(137)

Where $N_{h}^{\operatorname{cut},k-1}(s,a)$ is the $(2\alpha m+1)$ -largest among $\left\{N_{h}^{j,k-1}(s,a)\right\}$ and

\tilde{N}_{h}^{j,k-1}(s,a)=\max\left(N_{h}^{\operatorname{cut},k-1}(s,a),N_{h}^{j,k-1}(s,a)\right).

(138)

To be precise:

Lemma C.3 (Valid bonus).

Let $\mathcal{E}$ be the following event:

\mathcal{E}=\left\{\bigcap_{(s,a,h,k,f)\in\mathcal{S}\times\mathcal{A}\times H\times K\times[0,1]^{\mathcal{S}}}\left\{\left|\left(\hat{\mathbb{B}}_{h}^{k}f\right)(s,a)-\left(\mathbb{B}_{h}f\right)(s,a)\right|\leq\Gamma_{h}^{k}(s,a)\right\}\right\}

(139)

Then, we have

\mathbb{P}\left(\mathcal{E}\right)\geq 1-\delta

(140)

To show that $\mathcal{E}$ is a high probability event, we seek to utilize the result of Theorem 3.2. Since there are two obstacles, we need to make some modifications:

1.

Because the transition tuples are collected sequentially, they are no longer i.i.d., which means Lemma B.3 does not hold trivially. To resolve this, we use the concentration of martingale (see Lemma C.4);
2.

Event $\mathcal{E}$ shows the concentration property of $\hat{\mathbb{B}}$ holds uniformly for infinitely many $f$ ’s. Thus a direct union bound does not apply. Instead we need to use a cover number argument for all possible $f$ ’s, which is standard (see [Jin et al., 2020]).

Proof of Lemma C.3.

Let $\mathcal{E}^{\prime}$ be the following event:

\mathcal{E}^{\prime}=\left\{N_{h}^{\operatorname{cut},k-1}(s,a)>0\right\}.

(141)

In the following, we decompose $\mathcal{E}$ by:

\mathcal{E}=\left(\mathcal{E}\cap\overline{\mathcal{E}^{\prime}}\right)\cup\left(\mathcal{E}\cap\mathcal{E}^{\prime}\right)

(142)

and bound $\mathbb{P}\left(\mathcal{E}\right)$ by law of total probability.

If $N_{h}^{\operatorname{cut},k-1}(s,a)=0$ , because $\left(\hat{\mathbb{B}}_{h}^{k}f\right)(s,a)=0$ and $\left(\mathbb{B}_{h}f\right)(s,a)\leq H-h+1$ , with probability $1$ , $\forall(s,a,h,k,f)\in\mathcal{S}\times\mathcal{A}\times H\times K\times[0,1]^{\mathcal{S}}$ ,

\left|\left(\hat{\mathbb{B}}_{h}^{k}f\right)(s,a)-\left(\mathbb{B}_{h}f\right)(s,a)\right|\leq\Gamma_{h}^{k}(s,a)

(143)

This means

\mathbb{P}\left(\mathcal{E}\cap\overline{\mathcal{E}^{\prime}}\right)=\mathbb{P}\left(\mathcal{E}|\overline{\mathcal{E}^{\prime}}\right)\mathbb{P}\left(\overline{\mathcal{E}^{\prime}}\right)=\mathbb{P}\left(\overline{\mathcal{E}^{\prime}}\right)

(144)

If $N_{h}^{\operatorname{cut},k-1}(s,a)>0$ , we use a covering number argument and union bound to bound the probability of event $\mathcal{E}$ .

Consider $\mathcal{V}_{\epsilon}:=\left\{\frac{1}{\lceil 1/\epsilon\rceil},\frac{2}{\lceil 1/\epsilon\rceil},\ldots,\frac{H\lceil 1/\epsilon\rceil}{\lceil 1/\epsilon\rceil}\right\}^{\mathcal{S}}$ , an $\epsilon$ cover of $[0,H]^{\mathcal{S}}$ , in the sense of $\infty$ -norm. We can bound the cover number by $\left|\mathcal{V}_{\epsilon}\right|\leq\left(H\left(\frac{1}{\epsilon}+1\right)\right)^{S}$ . This means $\forall f\in[0,H]^{\mathcal{S}}$ , we can find an $V_{f}\in\mathcal{V}_{\epsilon}$ , s.t. $\|f-V_{f}\|_{\infty}:=\max_{x\in\mathcal{S}}|f(x)-V_{f}(x)|\leq\epsilon$ . In another word,

[0,H]^{\mathcal{S}}=\bigcup_{f_{\epsilon}\in\mathcal{V}_{\epsilon}}\left\{f:\|f-f_{\epsilon}\|_{\infty}\leq\epsilon\right\}.

(145)

Importantly, unlike model based method without bad agents, our $\hat{\mathbb{B}}$ is not an linear operator, meaning we cannot trivially upper bound $\left|\left(\hat{\mathbb{B}}_{h}^{k}f\right)(s,a)-\left(\hat{\mathbb{B}}_{h}^{k}V_{f}\right)(s,a)\right|$ in the cover number argument. Instead, we need to use the continuity of error bound of our robust mean estimation Algorithm 4, meaning as long as each data point collected by each agent is not perturbed too much, then the estimation error bound does not increase too much.

Recall that in Algorithm 2, at episode $k$ , if the agents decide to synchronize, then at each step $h$ , given any function $f$ , the clean agents will calculate empirical mean for

\left\{r+f(s^{\prime}):(s,a,r,s^{\prime})\in D_{h}^{j,k}\right\}.

(146)

Let $f_{\epsilon}$ be an element in $\mathcal{V}_{\epsilon}$ , s.t. $\|f_{\epsilon}-f\|_{\infty}\leq\epsilon$ , this means set (146) is a perturbed version (by at most $\epsilon$ ) of

\left\{r+f_{\epsilon}(s^{\prime}):(s,a,r,s^{\prime})\in D_{h}^{j,k}\right\}.

(147)

This means given an $f_{\epsilon}\in\mathcal{V}_{\epsilon}$ , for any $f$ , s.t. $\|f-f_{\epsilon}\|_{\infty}\leq\epsilon$ , Algorithm 4 can be used to robustly estimate $\left(\mathbb{B}_{h}f_{\epsilon}\right)(s,a)$ , given set (146). Furthermore, choosing $\epsilon=\frac{1}{SAHKm}$ , by Lemma C.4, Lemma C.5 and Lemma B.4, given any $s,a,h,k,f_{\epsilon}$ , and any $f$ , s.t. $\|f-f_{\epsilon}\|_{\infty}\leq\epsilon$ , with probability at least $1-\frac{\delta}{(SAHKm)^{3S}/(2mK)}$ ,

\left|\left(\hat{\mathbb{B}}_{h}^{k}f\right)(s,a)-\left(\mathbb{B}_{h}f_{\epsilon}\right)(s,a)\right|\leq\Gamma_{h}^{k}(s,a)-\frac{1}{SAHKm}.

(148)

We can bound the $\left|\left(\hat{\mathbb{B}}_{h}^{k}f\right)(s,a)-\left(\mathbb{B}_{h}f\right)(s,a)\right|$ by:

	$\displaystyle\left\|\left(\hat{\mathbb{B}}_{h}^{k}f\right)(s,a)-\left(\mathbb{B}_{h}f\right)(s,a)\right\|\leq$	$\displaystyle\left\|\left(\hat{\mathbb{B}}_{h}^{k}f\right)(s,a)-\left(\mathbb{B}_{h}f_{\epsilon}\right)(s,a)\right\|+\left\|\left(\mathbb{B}_{h}f_{\epsilon}\right)(s,a)-\left(\mathbb{B}_{h}f\right)(s,a)\right\|$		(149)
	$\displaystyle\leq$	$\displaystyle\left\|\left(\hat{\mathbb{B}}_{h}^{k}f\right)(s,a)-\left(\mathbb{B}_{h}f_{\epsilon}\right)(s,a)\right\|+\frac{1}{SAHKm}$		(150)

Then

	$\displaystyle\mathbb{P}\left(\bigcup_{s,a,h,k,f}\left\{\left\|\left(\hat{\mathbb{B}}_{h}^{k}f\right)(s,a)-\left(\mathbb{B}_{h}f\right)(s,a)\right\|>\Gamma_{h}^{k}(s,a)\right\}\right)$	(151)
$\displaystyle\leq$	$\displaystyle\sum_{s,a,h,k}\mathbb{P}\left(\bigcup_{f\in[0,H]^{\mathcal{S}}}\left\{\left\|\left(\hat{\mathbb{B}}_{h}^{k}f\right)(s,a)-\left(\mathbb{B}_{h}f\right)(s,a)\right\|>\Gamma_{h}^{k}(s,a)\right\}\right)$	(152)
$\displaystyle\leq$	$\displaystyle\sum_{s,a,h,k}\mathbb{P}\left(\bigcup_{f_{\epsilon}\in\mathcal{V}_{\epsilon}}\bigcup_{f:\\|f-f_{\epsilon}\\|_{\infty}\leq\epsilon}\left\{\left\|\left(\hat{\mathbb{B}}_{h}^{k}f\right)(s,a)-\left(\mathbb{B}_{h}f_{\epsilon}\right)(s,a)\right\|+\frac{1}{SAHKm}>\Gamma_{h}^{k}(s,a)\right\}\right)$	(153)
$\displaystyle\leq$	$\displaystyle\sum_{s,a,h,k}\sum_{f_{\epsilon}\in\mathcal{V}_{\epsilon}}\mathbb{P}\left(\bigcup_{f:\\|f-f_{\epsilon}\\|_{\infty}\leq\epsilon}\left\{\left\|\left(\hat{\mathbb{B}}_{h}^{k}f\right)(s,a)-\left(\mathbb{B}_{h}f_{\epsilon}\right)(s,a)\right\|+\frac{1}{SAHKm}>\Gamma_{h}^{k}(s,a)\right\}\right)$	(154)
$\displaystyle\leq$	$\displaystyle SAHK(H(1+HSAKm))^{S}\frac{\delta}{(SAHKm)^{3S}/(2mK)}\leq\delta$	(155)

This means

\mathbb{P}\left(\mathcal{E}\cap{\mathcal{E}^{\prime}}\right)=\mathbb{P}\left(\mathcal{E}|{\mathcal{E}^{\prime}}\right)\mathbb{P}\left({\mathcal{E}^{\prime}}\right)\geq\left(1-\delta\right)\mathbb{P}\left({\mathcal{E}^{\prime}}\right)\geq\mathbb{P}\left({\mathcal{E}^{\prime}}\right)-\delta

(156)

In conclusion,

\mathbb{P}\left(\mathcal{E}\right)=\mathbb{P}\left(\mathcal{E}\cap{\mathcal{E}^{\prime}}\right)+\mathbb{P}\left(\mathcal{E}\cap\overline{\mathcal{E}^{\prime}}\right)\geq\mathbb{P}\left({\mathcal{E}}\right)+\mathbb{P}\left({\mathcal{E}^{\prime}}\right)-\delta=1-\delta.

(157)

∎

C.1.1 Concentration of estimation from good agents

Lemma C.4.

Let:

\left(\hat{\mathbb{B}}_{h}^{j,k}f\right)(s,a):=\frac{1}{N_{h}^{j,k}(s,a)}\sum_{(s,a,r,s^{\prime})\in D_{h}^{j,k}}r+f(s^{\prime}),

(158)

where we define $\frac{0}{0}=0$ . For any $f:{\mathcal{S}}\mapsto[H]$ , and for any $(s,a,h,k)\in{\mathcal{S}}{\times}\mathcal{A}{\times}[H]{\times}[K]$ with probability at least $1-\delta/2$ , $\mathcal{E}_{conc-seq}(s,a,h,k)$ happens, where

\mathcal{E}_{conc-seq}(s,a,h,k)=\bigcap_{j\in\mathcal{G}}\mathcal{E}_{c-seq}(s,a,h,j,k),

(159)

and

\displaystyle\mathcal{E}_{c-seq}(s,a,h,j,k):=\left\{\left|\left(\hat{\mathbb{B}}_{h}^{j,k}f\right)(s,a)-\left(\mathbb{B}_{h}f\right)(s,a)\right|\leq\frac{H-h+1}{\sqrt{\tilde{N}_{h}^{j,k}(s,a)}}\sqrt{2\log\frac{4Km}{\delta}}\right\}

(160)

Proof of Lemma C.4.

We use the martingale stopping time argument in Lemma 4.3 of [Jin et al., 2018].

For each fixed $(s,a,h,j)\in{\mathcal{S}}{\times}\mathcal{A}{\times}[H]{\times}\mathcal{G}$ : for all $t\in[K]$ , define

\mathcal{F}_{t}:=\sigma\left(\bigcup_{t^{\prime}\leq t}\bigcup_{j\in[m]}\left\{\left(s_{h}^{j,t^{\prime}},a_{h}^{j,t^{\prime}},r_{h}^{j,t^{\prime}},s_{h+1}^{j,t^{\prime}}\right)\right\}_{h=1}^{H}\right).

(161)

Let

X_{t}=\sum_{(s,a,r,s^{\prime})\in D_{h}^{j,t}}\left(r+f(s^{\prime})-\left(\mathbb{B}_{h}f\right)(s,a)\right)

(162)

Then $\left\{\left(\mathcal{F}_{t},X_{t}\right)\right\}_{t=1}^{K}$ is a martingale. One observation is $X_{t_{1}}=X_{t_{2}}$ if agent $j$ did not visit $(s,a,h)$ in $t_{1}+1,t_{1}+2,\dots,t_{2}$ -th episodes. Thus we can use the stopping time idea to shorten the martingale sequence.

Define the following sequence of $t_{i}$ ’s: $t_{0}:=0$ ,

t_{i}:=\min\left(\left\{t^{\prime}\in[K]:t^{\prime}\geq t_{i-1}\mbox{ and }(s_{h}^{j,t^{\prime}},a_{h}^{j,t^{\prime}})=(s,a)\right\}\cup\left\{K+1\right\}\right).

(163)

Intuitively, $t_{i}$ is the episode when $(s,a,h)$ is visited by agent $j$ for the $i$ -th time. If agent $j$ visit $(s,a,h)$ for less than $i$ times, then $t_{i}=K+1$ . By definition, $t_{i}$ is a stopping time w.r.t. $\left\{\mathcal{F}_{t}\right\}_{t=1}^{K}$ .

By optional sampling theorem, $\left\{\left(\mathcal{F}_{t_{i}},X_{t_{i}}\right)\right\}_{i=1}^{K}$ is a martingale.

By Azuma-Hoeffding’s inequality: for any $\tau\leq K$

\mathbb{P}\left(\left|X_{t_{\tau}}\right|\geq\beta\right)\leq 2\exp\left(-\frac{2\beta^{2}}{4\tau(H-h+1)^{2}}\right)

(164)

Let $\frac{\delta}{2mK}=2\exp\left(-\frac{2\beta^{2}}{4\tau(H-h+1)^{2}}\right)$ , we get: for any $(s,a,h,j)$ , for any $\tau\leq K$ , with probability at least $1-\frac{\delta}{2mK}$ :

\left|\sum_{(s,a,r,s^{\prime})\in D_{h}^{j,t_{\tau}}}\left(r+f(s^{\prime})-\left(\mathbb{B}_{h}f\right)(s,a)\right)\right|<\sqrt{\tau}(H-h+1)\sqrt{2\log\frac{4mK}{\delta}}.

(165)

By union bound, for any $(s,a,h,j)$ , with probability at least $1-\frac{\delta}{mK}$ , for any $\tau\leq K$ :

\left|\sum_{(s,a,r,s^{\prime})\in D_{h}^{j,t_{\tau}}}\left(r+f(s^{\prime})-\left(\mathbb{B}_{h}f\right)(s,a)\right)\right|<\sqrt{\tau}(H-h+1)\sqrt{2\log\frac{4mK}{\delta}}.

(166)

This means for any $(s,a,h,j,k)$ and any $\tau\leq k$

	$\displaystyle\mathbb{P}\left(\left.\overline{\mathcal{E}_{c-seq}(s,a,h,j,k)}\right\|N_{h}^{j,k}(s,a)=\tau\right)$	(167)
$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\left.\left\|\left(\hat{\mathbb{B}}_{h}^{j,k}f\right)(s,a)-\left(\mathbb{B}_{h}f\right)(s,a)\right\|\geq\frac{H-h+1}{\sqrt{N_{h}^{j,k}(s,a)}}\sqrt{2\log\frac{4Km}{\delta}}\right\|N_{h}^{j,k}(s,a)=\tau\right)$	(168)
$\displaystyle\leq$	$\displaystyle\frac{\delta}{mK}$	(169)

Thus

	$\displaystyle\mathbb{P}\left(\overline{\mathcal{E}_{c-seq}(s,a,h,j,k)}\right)=$	$\displaystyle\sum_{\tau=0}^{k}\mathbb{P}\left(\left.\overline{\mathcal{E}_{c-seq}(s,a,h,j,k)}\right\|N_{h}^{j,k}(s,a)=\tau\right)\mathbb{P}\left(N_{h}^{j,k}(s,a)=\tau\right)$		(170)
	$\displaystyle\leq$	$\displaystyle\frac{\delta}{2mK}$		(171)

By union bound

\mathbb{P}\left({\mathcal{E}_{conc-seq}(s,a,h,k)}\right)\geq 1-\frac{\delta}{2}.

(172)

∎

Lemma C.5.

Let:

\left(\hat{\mathbb{B}}_{h}^{j,k}f\right)(s,a):=\frac{1}{N_{h}^{j,k}(s,a)}\sum_{(s,a,r,s^{\prime})\in D_{h}^{j,k}}r+f(s^{\prime}),

(173)

\left(\hat{\mathbb{B}}_{h}^{\mathcal{G},k}f\right)(s,a):=\frac{1}{\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j,k}(s,a)}\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j,k}(s,a)\left(\hat{\mathbb{B}}_{h}^{j,k}f\right)(s,a),

(174)

where we define $\frac{0}{0}=0$ . For any $f:{\mathcal{S}}\mapsto[H]$ , with probability at least $1-\delta/2$ , $\mathcal{E}_{ct}(s,a,h,k)$ happens, where

\displaystyle\mathcal{E}_{ct}(s,a,h,k):=\left\{\left|\left(\hat{\mathbb{B}}_{h}^{\mathcal{G},k}f\right)(s,a)-\left(\mathbb{B}_{h}f\right)(s,a)\right|\leq\frac{H-h+1}{\sqrt{\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j,k}(s,a)}}\sqrt{2\log\frac{4mK}{\delta}}\right\}

(175)

Proof Lemma C.5.

During the data collecting process, the agents are allowed to collect data simultaneously. For analysis purpose, we artificially order the data in the following sequence:

E^{1,1},E^{2,1},\ldots,E^{m,1},E^{1,2},\ldots,E^{m,2},\ldots,E^{1,K},\ldots,E^{m,K}

(176)

where $E^{j,k}:=\left\{\left(s_{h}^{j,k},a_{h}^{j,k},r_{h}^{j,k},s_{h+1}^{j,k}\right)\right\}_{h=1}^{H}$ . Let

\mathcal{F}_{t}=\sigma\left(\bigcup_{j,k\mbox{ s.t. }m(k-1)+j\leq t}E^{j,k}\right).

(177)

Then $\left\{\mathcal{F}_{t}\right\}_{t=0}^{mK}$ forms a valid filtration. Let $\left\{\left\{\gamma_{j,k}\right\}_{j\in[m]}\right\}_{k\in[K]}$ be a fixed set of scalar, s.t. $0\leq\gamma_{j,k}\leq 1$ , for all $j,k$ .

For each fixed $(s,a,h)\in{\mathcal{S}}{\times}\mathcal{A}{\times}[H]$ : for all $t\in[mK]$ , Let

X_{t}=\sum_{(s,a,r,s^{\prime})\in\bigcup_{(j,k)\in\mathcal{G}{\times}[K]\mbox{ s.t. }m(k-1)+j\leq t}E^{j,k}}\gamma_{j,k}\left(r+f(s^{\prime})-\left(\mathbb{B}_{h}f\right)(s,a)\right)

(178)

Then $\left\{\left(\mathcal{F}_{t},X_{t}\right)\right\}_{t=1}^{mK}$ is a martingale. As we can see, if good agent $j$ did not visit $(s,a,h)$ in episode $k$ , then $X_{m(k-1)+j}=X_{m(k-1)+j-1}$ a.s. Thus we can use the stopping time idea to shorten the martingale sequence.

Define the following functions to map from sequence index to agent index and episode index:

\displaystyle\mathcal{J}(t):=t-m\left(\lceil t/m\rceil-1\right),\quad\mathcal{K}(t):=\lceil t/m\rceil

(179)

For any $n_{1},\ldots,n_{m}$ , define the following sequence of $t_{i}$ ’s: $t_{0}:=0$ ,

	$\displaystyle t_{i}:=$	$\displaystyle\min\left(\left\{t^{\prime}\in[mK]:t^{\prime}\geq t_{i-1}\mbox{ and }(s_{h}^{\mathcal{J}(t^{\prime}),\mathcal{K}(t^{\prime})},a_{h}^{\mathcal{J}(t^{\prime}),\mathcal{K}(t^{\prime})})=(s,a)\right.\right.$		(180)
		$\displaystyle\left.\left.\mbox{ and for all }j\leq\mathcal{J}(t^{\prime}),N_{h}^{j,\mathcal{K}(t^{\prime})}\leq n_{j};j>\mathcal{J}(t^{\prime}),N_{h}^{j,\mathcal{K}(t^{\prime})-1}\leq n_{j}\right\}\cup\left\{K+1\right\}\right).$		(181)

Intuitively, $t_{i}$ is the episode when $(s,a,h)$ is visited in sequence (176) for the $i$ -th time. And for all $j$ , agent $j$ have not collected $n_{j}$ $(s,a,h)$ tuples. If $(s,a,h)$ is visited for less than $i$ times or there exists agent $j$ visiting $(s,a,h)$ more than $n_{j}$ times, then $t_{i}=K+1$ . By definition, $t_{i}$ is a stopping time w.r.t. $\left\{\mathcal{F}_{t}\right\}_{t=1}^{mK}$ .

In particular, let $n_{\operatorname{cut}}$ be the $(2\alpha m+1)$ th-largest of all $n_{j}$ ’s and $\tilde{n}_{j}=\min(n_{\operatorname{cut}},n_{j})$ . We choose $\gamma_{j,k}:=\frac{\tilde{n}_{j}}{n_{j}}\leq 1$ .

By optional sampling theorem, $\left\{\left(\mathcal{F}_{t_{i}},X_{t_{i}}\right)\right\}_{i=1}^{mK}$ is a martingale.

By Azuma-Hoeffding’s inequality: for any $\tau:=\sum_{j\in[m]}n_{j}\leq mK$

\mathbb{P}\left(\left|X_{t_{\tau}}\right|\geq\beta\right)\leq 2\exp\left(-\frac{2\beta^{2}}{4(H-h+1)^{2}\sum_{t=1}^{\tau}\gamma_{\mathcal{J}(t),\mathcal{K}(t)}^{2}}\right)

(182)

Let $\frac{\delta}{2mK}=2\exp\left(-\frac{2\beta^{2}}{4(H-h+1)^{2}\sum_{t=1}^{\tau}\gamma_{\mathcal{J}(t),\mathcal{K}(t)}^{2}}\right)$ , we get: for any $(s,a,h)$ , for any $\tau\leq mK$ , with probability at least $1-\frac{\delta}{2mK}$ :

\left|X_{t_{\tau}}\right|<\sqrt{\sum_{t=1}^{\tau}\gamma_{\mathcal{J}(t),\mathcal{K}(t)}^{2}}(H-h+1)\sqrt{2\log\frac{4mK}{\delta}}.

(183)

By union bound, for any $(s,a,h)$ , with probability at least $1-\frac{\delta}{2}$ , for any $\tau\leq mK$ :

\left|X_{t_{\tau}}\right|<\sqrt{\sum_{t=1}^{\tau}\gamma_{\mathcal{J}(t),\mathcal{K}(t)}^{2}}(H-h+1)\sqrt{2\log\frac{4mK}{\delta}}.

(184)

This means for any $(s,a,h,k)\in{\mathcal{S}}{\times}\mathcal{A}{\times}[H]{\times}[K]$ and any $\tau\leq mk$

	$\displaystyle\mathbb{P}\left(\left.\overline{\mathcal{E}_{ct}(s,a,h,k)}\right\|N_{h}^{j,k}(s,a)=n_{j},\forall j\right)$	(185)
$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\left\|\left(\hat{\mathbb{B}}_{h}^{\mathcal{G},k}f\right)(s,a)-\left(\mathbb{B}_{h}f\right)(s,a)\right\|\geq\frac{(H-h+1)\sqrt{\sum_{t=1}^{\tau}\frac{\tilde{N}_{h}^{\mathcal{J}(t),\mathcal{K}(t)}(s,a)}{N_{h}^{\mathcal{J}(t),\mathcal{K}(t)}(s,a)}}}{{\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j,k}(s,a)}}\right.$	(186)
	$\displaystyle\left.\left.\cdot\sqrt{2\log\frac{4SAHmK^{2}}{\delta}}\right\|N_{h}^{j,k}(s,a)=n_{j},\forall j\right)$	(187)
$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\left\|\left(\hat{\mathbb{B}}_{h}^{\mathcal{G},k}f\right)(s,a)-\left(\mathbb{B}_{h}f\right)(s,a)\right\|\geq\frac{(H-h+1)\sqrt{\sum_{t=1}^{\tau}\left(\frac{\tilde{N}_{h}^{\mathcal{J}(t),\mathcal{K}(t)}(s,a)}{N_{h}^{\mathcal{J}(t),\mathcal{K}(t)}(s,a)}\right)^{2}}}{{\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j,k}(s,a)}}\right.$	(188)
	$\displaystyle\left.\left.\cdot\sqrt{2\log\frac{4SAHmK^{2}}{\delta}}\right\|N_{h}^{j,k}(s,a)=n_{j},\forall j\right)$	(189)
$\displaystyle\leq$	$\displaystyle\frac{\delta}{2}\quad\left(\mbox{By $\gamma_{j,k}=\frac{\tilde{N}_{h}^{j,k}(s,a)}{N_{h}^{j,k}(s,a)}$}\right)$	(190)

Thus

$\displaystyle\mathbb{P}\left(\overline{\mathcal{E}_{ct}(s,a,h,k)}\right)=$	$\displaystyle\sum_{(n_{1},\ldots,n_{m})\in[K]^{m}}\mathbb{P}\left(\left.\overline{\mathcal{E}_{ct}(s,a,h,k)}\right\|N_{h}^{j,k}(s,a)=n_{j},\forall j\right)$	(191)
	$\displaystyle\mathbb{P}\left(N_{h}^{j,k}(s,a)=n_{j},\forall j\right)$	(192)
$\displaystyle\leq$	$\displaystyle\frac{\delta}{2}$	(193)

∎

C.2 The regret decomposition for UCB style algorithm

We follow the regret decomposition strategy in [Jin et al., 2020] under event $\mathcal{E}$ , i.e. the estimation error for the Bellman operator is bounded by the bonus term.

The estimated Bellman operator can be used to approximate the Q function:

Lemma C.6.

Under event $\mathcal{E}$ , for any $(s,a,h,k)\in\mathcal{S}\times\mathcal{A}\times H\times K$ , and any policy $\pi^{\prime}$

\displaystyle\left|\left(\hat{\mathbb{B}}_{h}^{k}\hat{V}_{h+1}^{k}\right)(s,a)-Q_{h}^{\pi^{\prime}}(s,a)-\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot\mid s,a)}\left[\hat{V}_{h+1}^{k}(s^{\prime})-V_{h+1}^{\pi^{\prime}}(s^{\prime})\right]\right|\leq\Gamma_{h}^{k}(s,a)

(194)

Proof of Lemma C.6.

	$\displaystyle\left\|\left(\hat{\mathbb{B}}_{h}^{k}\hat{V}_{h+1}^{k}\right)(s,a)-Q_{h}^{\pi^{\prime}}(s,a)-\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot\mid s,a)}\left[\hat{V}_{h+1}^{k}(s^{\prime})-V_{h+1}^{\pi^{\prime}}(s^{\prime})\right]\right\|$	(195)
$\displaystyle\leq$	$\displaystyle\left\|\left(\hat{\mathbb{B}}_{h}^{k}\hat{V}_{h+1}^{k}\right)(s,a)-\left(\mathbb{B}_{h}\hat{V}_{h+1}^{k}\right)(s,a)\right\|$	(196)
	$\displaystyle+\left\|\left(\mathbb{B}_{h}\hat{V}_{h+1}^{k}\right)(s,a)-\left(\mathbb{B}_{h}V_{h+1}^{\pi^{\prime}}\right)(s,a)-\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot\mid s,a)}\left[\hat{V}_{h+1}^{k}(s^{\prime})-V_{h+1}^{\pi^{\prime}}(s^{\prime})\right]\right\|$	(197)
	$\displaystyle\left(\mbox{By triangular inequality and the fact that $\left(\mathbb{B}_{h}V_{h+1}^{\pi^{\prime}}\right)(s,a)=Q_{h}^{\pi^{\prime}}(s,a)$.}\right)$	(198)
$\displaystyle\leq$	$\displaystyle\Gamma_{h}^{k}(s,a)$	(199)
	$\displaystyle\left(\mbox{We can bound the first term by the definition of event $\mathcal{E}$,}\right.$	(200)
	$\displaystyle\left.\mbox{and the second term is zero by the definition of Bellman operator.}\right)$	(201)

∎

Under event $\mathcal{E}$ we can upper bound the value function and Q function of the optimal policy by the estimated value function and Q function of policy $\hat{\pi}^{k}$ :

Lemma C.7 (Optimism).

Under event $\mathcal{E}$ , $\forall s,a,h,k$ :

\displaystyle\hat{Q}_{h}^{k}(s,a)\geq Q_{h}^{*}(s,a),\quad\hat{V}_{h}^{k}(s)\geq V_{h}^{*}(s)

(202)

Proof of Lemma C.7.

We prove this by induction on $h$ . Before that, note that, for any $h,k,s$ , if

\hat{Q}_{h}^{k}(s,a)\geq Q_{h}^{*}(s,a),\quad\forall a

(203)

then because $\hat{\pi}^{k}$ is chosen by maximizing $\hat{Q}_{h}^{k}(s,a)$ , we know

\hat{V}_{h}^{k}(s)=\max_{a}\hat{Q}_{h}^{k}(s,a)\geq\hat{Q}_{h}^{k}(s,\pi_{h}^{*}(a))\geq Q_{h}^{*}(s,\pi_{h}^{*}(a))=V_{h}^{*}(s)

(204)

This means for any $h,k,s$ :

\left\{\forall a,\hat{Q}_{h}^{k}(s,a)\geq Q_{h}^{*}(s,a)\right\}\implies\left\{\hat{V}_{h}^{k}(s)\geq V_{h}^{*}(s)\right\}

(205)

We now begin our induction:

•

For the base case, our goal is to show for any $s,a,k$ , in the last step $H$ ,

\displaystyle\hat{Q}_{H}^{k}(s,a)\geq Q_{H}^{*}(s,a),\quad\hat{V}_{H}^{k}(s)\geq V_{H}^{*}(s)

(206)

First note that $\hat{V}_{H+1}=V_{H+1}^{*}=0$ . By Lemma C.6 and choose $\pi^{\prime}=\pi^{*}$ ,

\displaystyle\left|\left(\hat{\mathbb{B}}_{H}^{k}\hat{V}_{H+1}^{k}\right)(s,a)-Q_{h}^{*}(s,a)\right|\leq\Gamma_{H}^{k}(s,a)

(207)

By definition of $\hat{Q}_{H}^{k}(s,a)$ , and the fact that $Q_{h}^{*}(s,a)$ only contains the reward at step $H$ , which is bounded by $1$ :

\displaystyle\hat{Q}_{H}^{k}(s,a)=\min\left(\left(\hat{\mathbb{B}}_{H}^{k}\hat{V}_{H+1}^{k}\right)(s,a)+\Gamma_{H}^{k}(s,a),1\right)\geq Q_{H}^{*}(s,a)

(208)

By (205), $\hat{V}_{H}^{k}(s)\geq V_{H}^{*}(s),\forall s$ .

•

Suppose for any $s,a,k$ , the statement holds for step $h+1$ , i.e.

\displaystyle\hat{Q}_{h+1}^{k}(s,a)\geq Q_{h+1}^{*}(s,a),\quad\hat{V}_{h+1}^{k}(s)\geq V_{h+1}^{*}(s)

(209)

our goal is to show $\forall s,a,k$ :

\displaystyle\hat{Q}_{h}^{k}(s,a)\geq Q_{h}^{*}(s,a),\quad\hat{V}_{h}^{k}(s)\geq V_{h}^{*}(s)

(210)

	$\displaystyle\left(\hat{\mathbb{B}}_{h}^{k}\hat{V}_{h+1}^{k}\right)(s,a)+\Gamma_{h}^{k}(s,a)$	(211)
$\displaystyle\geq$	$\displaystyle\left(\hat{\mathbb{B}}_{h}^{k}\hat{V}_{h+1}^{k}\right)(s,a)+\left\|\left(\hat{\mathbb{B}}_{h}^{k}\hat{V}_{h+1}^{k}\right)(s,a)-Q_{h}^{}(s,a)-\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot\mid s,a)}\left[\hat{V}_{h+1}^{k}(s^{\prime})-V_{h+1}^{}(s^{\prime})\right]\right\|$	(212)
	$\displaystyle\left(\mbox{By \hyperref@@ii[lem:q error]{Lemma~{}\ref{lem:q error}} and let $\pi^{\prime}=\pi^{}$}\right)$	(213)
$\displaystyle\geq$	$\displaystyle Q_{h}^{}(s,a)+\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot\mid s,a)}\left[\hat{V}_{h+1}^{k}(s^{\prime})-V_{h+1}^{}(s^{\prime})\right]$	(214)
	$\displaystyle\left(\mbox{By triangular inequality}\right)$	(215)
$\displaystyle\geq$	$\displaystyle Q_{h}^{*}(s,a)$	(216)
	$\displaystyle\left(\mbox{$\forall s,\hat{V}_{h+1}^{k}(s^{\prime})\geq V_{h+1}^{}(s^{\prime})$ by \hyperref@@ii[eq:hat Q ge Q h+1]{{(\ref{eq:hat Q ge Q h+1})}}}\right)$	(217)

By definition of Q function $Q_{h}^{*}(s,a)\leq H-h+1$ . We concludes these two statements by:

\displaystyle\hat{Q}_{h}^{k}(s,a)=\min\left(\left(\hat{\mathbb{B}}_{h}^{k}\hat{V}_{h+1}^{k}\right)(s,a)+\Gamma_{h}^{k}(s,a),H-h+1\right)\geq Q_{h}^{*}(s,a)

(218)

By (205), $\hat{V}_{h}^{k}(s)\geq V_{h}^{*}(s),\forall s$ .

∎

We are know ready to prove the regret decomposition lemma:

Lemma C.8.

Under good event $\mathcal{E}$ :

	$\displaystyle\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\left(V_{1}^{*}(s_{1})-V_{1}^{\hat{\pi}^{k}}(s_{1})\right)$	(219)
$\displaystyle\leq$	$\displaystyle 2\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\sum_{h=1}^{H}\Gamma_{h}^{k}(s_{h}^{j,k},a_{h}^{j,k})$	(220)
	$\displaystyle+\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\sum_{h=1}^{H}\left(\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot\mid s_{h}^{k},a_{h}^{k})}\left[\hat{V}_{h+1}^{k}(s^{\prime})-V_{h+1}^{\hat{\pi}^{k}}(s^{\prime})\right]-\left(\hat{V}_{h+1}^{k}(s_{h+1}^{j,k})-V_{h+1}^{\hat{\pi}^{k}}(s_{h+1}^{j,k})\right)\right)$	(221)

Proof of Lemma C.8.

We start by showing the decomposition of regret after step $h$ in one episode of a single agent: by Lemma C.6 and Lemma C.7, under event $\mathcal{E}$ , for any $s,k,h$

	$\displaystyle V_{h}^{}(s)-V_{h}^{\hat{\pi}^{k}}(s)\leq\hat{V}_{h}^{\hat{\pi}^{k}}(s)-V_{h}^{\hat{\pi}^{k}}(s)\quad\left(\mbox{By \hyperref@@ii[lem:opt]{Lemma~{}\ref{lem:opt}}}\right)$	(222)
$\displaystyle=$	$\displaystyle\hat{Q}_{h}^{k}(s,\hat{\pi}_{h}^{k}(s))-Q_{h}^{\hat{\pi}^{k}}(s,\hat{\pi}_{h}^{k}(s))$	(223)
$\displaystyle\leq$	$\displaystyle\left(\hat{\mathbb{B}}_{h}^{k}\hat{V}_{h+1}^{k}\right)(s,a)+\Gamma_{h}^{k}(s,a)-Q_{h}^{\hat{\pi}^{k}}(s,\hat{\pi}_{h}^{k}(s))\quad\left(\mbox{By definition of $\hat{Q}_{h}^{k}$}\right)$	(224)
$\displaystyle\leq$	$\displaystyle\left\|\left(\hat{\mathbb{B}}_{h}^{k}\hat{V}_{h+1}^{k}\right)(s,\hat{\pi}_{h}^{k}(s))-Q_{h}^{\hat{\pi}^{k}}(s,\hat{\pi}_{h}^{k}(s))-\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot\mid s,\hat{\pi}_{h}^{k}(s))}\left[\hat{V}_{h+1}^{k}(s^{\prime})-V_{h+1}^{\hat{\pi}^{k}}(s^{\prime})\right]\right\|$	(225)
	$\displaystyle+\left\|Q_{h}^{\hat{\pi}^{k}}(s,\hat{\pi}_{h}^{k}(s))+\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot\mid s,\hat{\pi}_{h}^{k}(s))}\left[\hat{V}_{h+1}^{k}(s^{\prime})-V_{h+1}^{\hat{\pi}^{k}}(s^{\prime})\right]\right\|$	(226)
	$\displaystyle+\Gamma_{h}^{k}(s,a)-Q_{h}^{\hat{\pi}^{k}}(s,\hat{\pi}_{h}^{k}(s))$	(227)
	$\displaystyle\left(\mbox{By using triangular inequality on the first term}\right)$	(228)
$\displaystyle\leq$	$\displaystyle\Gamma_{h}^{k}(s,\hat{\pi}_{h}^{k}(s))+Q_{h}^{\hat{\pi}^{k}}(s,\hat{\pi}_{h}^{k}(s))+\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot\mid s,\hat{\pi}_{h}^{k}(s))}\left[\hat{V}_{h+1}^{k}(s^{\prime})-V_{h+1}^{\hat{\pi}^{k}}(s^{\prime})\right]$	(229)
	$\displaystyle+\Gamma_{h}^{k}(s,a)-Q_{h}^{\hat{\pi}^{k}}(s,\hat{\pi}_{h}^{k}(s))$	(230)
	$\displaystyle\left(\mbox{The first term is by using \hyperref@@ii[lem:q error]{Lemma~{}\ref*{lem:q error}} with $\pi^{\prime}=\hat{\pi}^{k}$, }\right.$	(231)
	$\displaystyle\left.\mbox{the term inside the absolute in the second is non-negative by \hyperref@@ii[lem:opt]{Lemma~{}\ref*{lem:opt}}}\right)$	(232)
$\displaystyle=$	$\displaystyle 2\Gamma_{h}^{k}(s,\hat{\pi}_{h}^{k}(s))+\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot\mid s,\hat{\pi}_{h}^{k}(s))}\left[\hat{V}_{h+1}^{k}(s^{\prime})-V_{h+1}^{\hat{\pi}^{k}}(s^{\prime})\right]$	(233)

This indeed gives a recursive formula: for any trajectory $\left\{(s_{h}^{k},a_{h}^{k},r_{h}^{k},s_{h+1}^{k})\right\}_{h\in[H]}$

	$\displaystyle\hat{V}_{h}^{\hat{\pi}^{k}}(s_{h}^{k})-V_{h}^{\hat{\pi}^{k}}(s_{h}^{k})$	(234)
$\displaystyle\leq$	$\displaystyle 2\Gamma_{h}^{k}(s_{h}^{k},\hat{\pi}_{h}^{k}(s_{h}^{k}))+\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot\mid s_{h}^{k},\hat{\pi}_{h}^{k}(s_{h}^{k}))}\left[\hat{V}_{h+1}^{k}(s^{\prime})-V_{h+1}^{\hat{\pi}^{k}}(s^{\prime})\right]$	(235)
$\displaystyle=$	$\displaystyle\hat{V}_{h+1}^{k}(s_{h}^{k})-V_{h+1}^{\hat{\pi}^{k}}(s_{h}^{k})+2\Gamma_{h}^{k}(s_{h}^{k},\hat{\pi}_{h}^{k}(s_{h}^{k}))$	(236)
	$\displaystyle+\left(\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot\mid s_{h}^{k},\hat{\pi}_{h}^{k}(s_{h}^{k}))}\left[\hat{V}_{h+1}^{k}(s^{\prime})-V_{h+1}^{\hat{\pi}^{k}}(s^{\prime})\right]-\left(\hat{V}_{h+1}^{k}(s_{h}^{k})-V_{h+1}^{\hat{\pi}^{k}}(s_{h}^{k})\right)\right)$	(237)

Then, we can show the regret decomposition in one episode of a single agent by recursion:

for any trajectory $\left\{(s_{h}^{k},a_{h}^{k},r_{h}^{k},s_{h+1}^{k})\right\}_{h\in[H]}$ collected by a clean agent under policy $\hat{\pi}^{k}$ :

	$\displaystyle V_{1}^{*}(s_{1}^{k})-V_{1}^{\hat{\pi}^{k}}(s_{1}^{k})\leq\hat{V}_{1}^{k}(s_{1}^{k})-V_{1}^{\hat{\pi}^{k}}(s_{1}^{k})$	(238)
$\displaystyle\leq$	$\displaystyle\left(\hat{V}_{2}^{k}(s_{2}^{k})-V_{2}^{\hat{\pi}^{k}}(s_{2}^{k})\right)+2\Gamma_{1}^{k}(s_{1}^{k},a_{1}^{k})$	(239)
	$\displaystyle+\left(\mathbb{E}_{s^{\prime}\sim P_{1}(\cdot\mid s_{1}^{k},a_{1}^{k})}\left[\hat{V}_{2}^{k}(s^{\prime})-V_{2}^{\hat{\pi}^{k}}(s^{\prime})\right]-\left(\hat{V}_{2}^{k}(s_{2}^{k})-V_{2}^{\hat{\pi}^{k}}(s_{2}^{k})\right)\right)$	(240)
$\displaystyle\leq$	$\displaystyle\left(\hat{V}_{3}^{k}(s_{3}^{k})-V_{3}^{\hat{\pi}^{k}}(s_{3}^{k})\right)+\sum_{h=1}^{2}2\Gamma_{h}^{k}(s_{h}^{k},a_{h}^{k})$	(241)
	$\displaystyle+\sum_{h=1}^{2}\left(\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot\mid s_{h}^{k},a_{h}^{k})}\left[\hat{V}_{h+1}^{k}(s^{\prime})-V_{h+1}^{\hat{\pi}^{k}}(s^{\prime})\right]-\left(\hat{V}_{h+1}^{k}(s_{h+1}^{k})-V_{h+1}^{\hat{\pi}^{k}}(s_{h+1}^{k})\right)\right)$	(242)
$\displaystyle\leq$	$\displaystyle\cdots$	(243)
$\displaystyle\leq$	$\displaystyle\sum_{h=1}^{H}2\Gamma_{h}^{k}(s_{h}^{k},a_{h}^{k})$	(244)
	$\displaystyle+\sum_{h=1}^{H}\left(\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot\mid s_{h}^{k},a_{h}^{k})}\left[\hat{V}_{h+1}^{k}(s^{\prime})-V_{h+1}^{\hat{\pi}^{k}}(s^{\prime})\right]-\left(\hat{V}_{h+1}^{k}(s_{h+1}^{k})-V_{h+1}^{\hat{\pi}^{k}}(s_{h+1}^{k})\right)\right)$	(245)

Now we are ready to show the total regret decomposition. For each episode, we can make the regret decomposition w.r.t. any trajectory collected by a clean agent following policy $\hat{\pi}^{k}$ . For convenience, we specialize the trajectories to be exactly the ones that are collected by the good agents and are used to calculate the bonus terms. The purpose is, in the future, when we bound the regret, we need to bound the cumulative bonus used in the trajectory. By decomposing the regret w.r.t. the trajectory collected in the algorithm, it is naturally guaranteed that the $(s,a,h)$ tuples that collected a lot by the good agents have lower bonus. This is because with more data collected, we can narrow down the confidence interval and design small but still valid bonus terms.

Because in our MDP definition, the MDP has a deterministic initial distribution, meaning the good agents always have the same starting state:

	$\displaystyle\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\left(V_{1}^{}(s_{1})-V_{1}^{\hat{\pi}^{k}}(s_{1})\right)=\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\left(V_{1}^{}(s_{1}^{j,k})-V_{1}^{\hat{\pi}^{k}}(s_{1}^{j,k})\right)$	(246)
$\displaystyle\leq$	$\displaystyle 2\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\sum_{h=1}^{H}\Gamma_{h}^{k}(s_{h}^{j,k},a_{h}^{j,k})$	(247)
	$\displaystyle+\sum_{k=1}^{K}\sum_{j\in\mathcal{G}}\sum_{h=1}^{H}\left(\mathbb{E}_{s^{\prime}\sim P_{h}(\cdot\mid s_{h}^{j,k},a_{h}^{j,k})}\left[\hat{V}_{h+1}^{k}(s^{\prime})-V_{h+1}^{\hat{\pi}^{k}}(s^{\prime})\right]-\left(\hat{V}_{h+1}^{k}(s_{h+1}^{j,k})-V_{h+1}^{\hat{\pi}^{k}}(s_{h+1}^{j,k})\right)\right)$	(248)

∎

C.3 Evenness of clean agents

We need at least $(2\alpha m+1)$ -agents to cover $(s,a,h)$ in order to learn the Bellman operator properly. In this section, we show that the agents have “even” coverage on the visited $(s,a,h)$ tuples in each (except a relatively small number) of the episodes. In the following we use $\tilde{m}:=(1-\alpha)m=|\mathcal{G}|$ to denote the number of good agents.

Formally, we have:

Lemma C.9 (Even coverage of good agent).

For any $(s,a,h,k)\in\mathcal{S}\times\mathcal{A}\times[H]\times[K]$ , we define the follow event:

\mathcal{E}_{\mbox{even}}(s,a,h,k):=\left\{\mbox{if $\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)\geq 400m\log\frac{2mKSAH}{\delta}$, then $\max_{i,j\in\mathcal{G}}\frac{N_{h}^{j,k}(s,a)}{N_{h}^{i,k}(s,a)}\leq 2$}\right\}

(249)

then, we have: for all $0<\delta<\frac{1}{4}$

\mathbb{P}\left(\bigcap_{(s,a,h,k)\in\mathcal{S}\times\mathcal{A}\times[H]\times[K]}\mathcal{E}_{\mbox{even}}(s,a,h,k)\right)\geq 1-2\delta

(250)

Remark C.10 (Intuition of the good event).

The event $\mathcal{E}_{\mbox{even}}(s,a,h,k)$ characterizes that: if in any episode $k$ , a $(s,a,h)$ tuple gets enough coverage from the clean agents, then the coverage of each agent are very close.

See proof of Lemma C.9 in Section C.3.1.

C.3.1 Proof of Lemma C.9

Proof of Lemma C.9 depends on the concentration of $N_{h}^{j,k}(s,a)$ :

Lemma C.11 (Concentration of counts around empirical mean).

For all $0<\delta<\frac{1}{4}$

	$\displaystyle\mathbb{P}\left(\bigcup_{s,a,h,k,j}\left\{\left\|N_{h}^{j,k}(s,a)-\frac{1}{\|\mathcal{G}\|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)\right\|\right.\right.$		(251)
	$\displaystyle>\left.\left.18\log\frac{2SAHmK}{\delta}+4\sqrt{\log\frac{2SAHmK}{\delta}}\sqrt{\frac{1}{\|\mathcal{G}\|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)}\right\}\right)<2\delta$		(252)

Proof of Lemma C.11.

See Section C.3.2. ∎

Proof of Lemma C.9.

Let

N_{0}:=400m\log\frac{2mKSAH}{\delta}

(253)

For any $(s,a,h,k)\in\mathcal{S}\times\mathcal{A}\times[H]\times[K]$ , define events:

	$\displaystyle\mathcal{E}_{1}(s,a,h,k):=$	$\displaystyle\left\{\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)\geq N_{0}\right\}$		(254)
	$\displaystyle\mathcal{E}_{2}(s,a,h,k):=$	$\displaystyle\left\{\max_{i,j\in\mathcal{G}}\frac{N_{h}^{j,k}(s,a)}{N_{h}^{i,k}(s,a)}\leq 2\right\}$		(255)

Recall:

\mathcal{E}_{\mbox{even}}(s,a,h,k):=\left\{\mbox{if $\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)\geq 400m\log\frac{2mKSAH}{\delta}$, then $\max_{i,j\in\mathcal{G}}\frac{N_{h}^{j,k}(s,a)}{N_{h}^{i,k}(s,a)}\leq 2$}\right\}

(256)

Then we can rewrite even $\mathcal{E}_{\mbox{even}}(s,a,h,k)$ as:

\mathcal{E}_{\mbox{even}}(s,a,h,k)=\overline{\mathcal{E}_{1}(s,a,h,k)}\cup\mathcal{E}_{2}(s,a,h,k)

(257)

We first show that of there are two $N_{h}^{j,k}$ ’s, whose ratio exceed $2$ , then there must be some $N_{h}^{j,k}$ that deviates a lot from the empirical mean of $N_{h}^{j,k}$ ’s:

	$\displaystyle\overline{\mathcal{E}_{2}(s,a,h,k)}=\left\{\max_{i,j\in\mathcal{G}}\frac{N_{h}^{j,k}(s,a)}{N_{h}^{i,k}(s,a)}>2\right\}$	(258)
$\displaystyle\subseteq$	$\displaystyle\bigcup_{i\in\mathcal{G}}\left\{N_{h}^{i,k}(s,a)>\frac{498}{400}\frac{1}{\|\mathcal{G}\|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)\right\}\cup\bigcup_{i\in\mathcal{G}}\left\{N_{h}^{i,k}(s,a)<\frac{302}{400}\frac{1}{\|\mathcal{G}\|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)\right\}$	(259)
$\displaystyle=$	$\displaystyle\bigcup_{i\in\mathcal{G}}\left\{N_{h}^{i,k}(s,a)-\frac{1}{\|\mathcal{G}\|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)>\frac{98}{400}\frac{1}{\|\mathcal{G}\|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)\right\}$	(260)
	$\displaystyle\cup\bigcup_{i\in\mathcal{G}}\left\{N_{h}^{i,k}(s,a)-\frac{1}{\|\mathcal{G}\|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)<-\frac{98}{400}\frac{1}{\|\mathcal{G}\|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)\right\}$	(261)
$\displaystyle=$	$\displaystyle\bigcup_{i\in\mathcal{G}}\left\{\left\|N_{h}^{i,k}(s,a)-\frac{1}{\|\mathcal{G}\|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)\right\|>\frac{98}{400}\frac{1}{\|\mathcal{G}\|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)\right\}$	(262)

To show that $\mathcal{E}_{\mbox{even}}(s,a,h,k)$ happens w.h.p.:

	$\displaystyle\mathbb{P}\left(\bigcup_{s,a,h,k}\overline{\mathcal{E}_{\mbox{even}}(s,a,h,k)}\right)=\mathbb{P}\left(\bigcup_{s,a,h,k}\overline{\overline{\mathcal{E}_{1}(s,a,h,k)}\cup\mathcal{E}_{2}(s,a,h,k)}\right)$	(263)
$\displaystyle=$	$\displaystyle\mathbb{P}\left(\bigcup_{s,a,h,k}{\mathcal{E}_{1}(s,a,h,k)}\cap\overline{\mathcal{E}_{2}(s,a,h,k)}\right)$	(264)
$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\exists s,a,h,k,\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)\geq N_{0},\right.$	(265)
	$\displaystyle\left.\exists i\in\mathcal{G},\left\|N_{h}^{i,k}(s,a)-\frac{1}{\|\mathcal{G}\|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)\right\|>\frac{98}{400}\frac{1}{\|\mathcal{G}\|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)\right)$	(266)
	(By (262))	(267)
$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\exists s,a,h,k,i\left\|N_{h}^{i,k}(s,a)-\frac{1}{\|\mathcal{G}\|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)\right\|\right.$	(268)
	$\displaystyle\left.>\frac{18}{400}\frac{1}{\|\mathcal{G}\|}N_{0}+4\sqrt{\frac{1}{400}\frac{1}{\|\mathcal{G}\|}N_{0}}\sqrt{\frac{1}{\|\mathcal{G}\|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)}\right)$	(269)
$\displaystyle=$	$\displaystyle\mathbb{P}\left(\exists s,a,h,k,i\left\|N_{h}^{i,k}(s,a)-\frac{1}{\|\mathcal{G}\|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)\right\|\right.$	(270)
	$\displaystyle\left.>18\log\frac{2mKSAH}{\delta}+4\sqrt{\log\frac{2mKSAH}{\delta}}\sqrt{\frac{1}{\|\mathcal{G}\|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)}\right)$	(271)
$\displaystyle<$	$\displaystyle 2\delta\quad\mbox{(By \hyperref@@ii[lem:concen of cts]{Lemma~{}\ref*{lem:concen of cts}})}$	(272)

∎

C.3.2 Proof of Lemma C.11

The high level ideas are:

1.
For each $s,a,h$ ,
- •
  
  for each $j\in\mathcal{G}$ , define centered $N_{h}^{j,k}(s,a)$ as a martingale;
- •
  
  define centered $\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)$ as a martingale;
2.

apply a modified Bernstein type of martingale concentration bound for both centered $N_{h}^{j,k}(s,a)$ ’s and centered $\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)$ (see Lemma C.12 and Lemma C.13);
3.

because $N_{h}^{j,k}(s,a)$ and $\frac{1}{\tilde{m}}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)$ have the same mean, we can use triangular inequality to show these two terms are close, and the distance is bounded by the variance term in Bernstein inequality.
4.

Bernstein on $\frac{1}{\tilde{m}}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)$ also allow us to bound its variance in terms of itself.
5.

We can get our result by combining Step 3 and Step 4.

Lemma C.12 (Concentration of each $N_{h}^{j,k}(s,a)$ ).

For all $0<\delta\leq 1/4$ , with probability at least $1-\delta$ , for all $(s,a,h,j,k)\in{\mathcal{S}}{\times}\mathcal{A}{\times}[H]{\times}\mathcal{G}{\times}[K]$ :

\left|N_{h}^{j,k}(s,a)-\sum_{t=1}^{k}d_{h}^{\hat{\pi}^{t}}(s,a)\right|<3\log\frac{2SAHmK}{\delta}+\sqrt{2\sum_{t=1}^{k}d_{h}^{\hat{\pi}^{t}}(s,a)\log\frac{2SAHmK}{\delta}}

(273)

Proof of Lemma C.12.

See Section C.3.3 ∎

Lemma C.13 (Concentration of each $\frac{1}{\tilde{m}}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)$ ).

For all $0<\delta\leq 1/4$ , with probability at least $1-\delta$ , for all $(s,a,h,k)\in{\mathcal{S}}{\times}\mathcal{A}{\times}[H]{\times}[K]$ :

\left|\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)-|\mathcal{G}|\sum_{t=1}^{k}d_{h}^{\hat{\pi}^{t}}(s,a)\right|<3\log\frac{2SAHmK}{\delta}+\sqrt{2|\mathcal{G}|\sum_{t=1}^{k}d_{h}^{\hat{\pi}^{t}}(s,a)\log\frac{2SAHmK}{\delta}}

(274)

Proof of Lemma C.13.

See Section C.3.4 ∎

Proof of Lemma C.11.

Let $\mathcal{E}_{N}$ the intersection of the events in Lemma C.12 and Lemma C.13. Then by Lemma C.12 and Lemma C.13, $\mathcal{E}_{N}$ happens with probability at least $1-2\delta$ . By (274),

\sqrt{\sum_{t=1}^{k}d_{h}^{\hat{\pi}^{t}}(s,a)}\leq 4\sqrt{\log\frac{2SAHmK}{\delta}}+\sqrt{\frac{1}{|\mathcal{G}|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)}

(275)

By (273) and (274), for all $s,a,h,j,k$

	$\displaystyle\left\|\frac{1}{\|\mathcal{G}\|}\sum_{j^{\prime}\in\mathcal{G}}N_{h}^{j^{\prime},k}(s,a)-N_{h}^{j,k}(s,a)\right\|$	(276)
$\displaystyle\leq$	$\displaystyle\left\|N_{h}^{j,k}(s,a)-\sum_{t=1}^{k}d_{h}^{\hat{\pi}^{t}}(s,a)\right\|+\left\|\frac{1}{\|\mathcal{G}\|}\sum_{j^{\prime}\in\mathcal{G}}N_{h}^{j^{\prime},k}(s,a)-\sum_{t=1}^{k}d_{h}^{\hat{\pi}^{t}}(s,a)\right\|$	(277)
$\displaystyle\leq$	$\displaystyle 6\log\frac{2SAHmK}{\delta}+2\sqrt{2\sum_{t=1}^{k}d_{h}^{\hat{\pi}^{t}}(s,a)\log\frac{2SAHmK}{\delta}}$	(278)
$\displaystyle\leq$	$\displaystyle 6\log\frac{2SAHmK}{\delta}+2\sqrt{2\log\frac{2SAHmK}{\delta}}\left(4\sqrt{\log\frac{2SAHmK}{\delta}}+\sqrt{\frac{1}{\|\mathcal{G}\|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)}\right)$	(279)
$\displaystyle\leq$	$\displaystyle 18\log\frac{2SAHmK}{\delta}+4\sqrt{\log\frac{2SAHmK}{\delta}}\sqrt{\frac{1}{\|\mathcal{G}\|}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)}$	(280)

∎

C.3.3 Proof of Lemma C.12

Proof of Lemma C.12.

For each fixed $(s,a,h,j)\in{\mathcal{S}}{\times}\mathcal{A}{\times}[H]{\times}\mathcal{G}$ : for all $t\in[K]$ , define

\mathcal{F}_{k}:=\sigma\left(\bigcup_{t\leq k}\bigcup_{j\in[m]}\left\{\left(s_{h}^{j,t},a_{h}^{j,t},r_{h}^{j,t},s_{h+1}^{j,t}\right)\right\}_{h=1}^{H}\right).

(281)

Let

	$\displaystyle S_{h}^{j,k}(s,a)=$	$\displaystyle N_{h}^{j,k}(s,a)-\sum_{t=1}^{k}d_{h}^{\hat{\pi}^{t}}(s,a)$		(282)
	$\displaystyle T_{h}^{j,k}(s,a)=$	$\displaystyle\sum_{t=1}^{k}d_{h}^{\hat{\pi}^{t}}(s,a)\left(1-d_{h}^{\hat{\pi}^{t}}(s,a)\right)$		(283)

Then $\left\{\left(\mathcal{F}_{k},S_{h}^{j,k}(s,a)\right)\right\}_{t=k}^{K}$ is a martingale. Since $d_{h}^{\hat{\pi}^{k}}(s,a)$ depends on $\hat{\pi}^{k}$ , which is calculated use data in the first $k-1$ episodes, then $d_{h}^{\hat{\pi}^{k}}(s,a)\in\mathcal{F}_{k-1}$ . By Corollary E.3,

		$\displaystyle\mathbb{P}\left(\bigcup_{k=1}^{K}\left\{\|S_{h}^{j,k}(s,a)\|\geq 3\log\frac{2SAHmK}{\delta}+\sqrt{2\sum_{t=1}^{k}d_{h}^{\hat{\pi}^{t}}(s,a)\log\frac{2SAHmK}{\delta}}\right\}\right)$		(284)
	$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\bigcup_{k=1}^{K}\left\{\|S_{h}^{j,k}(s,a)\|\geq 3\log\frac{2SAHmK}{\delta}+\sqrt{2T_{h}^{j,k}(s,a)\log\frac{2SAHmK}{\delta}}\right\}\right)\leq\frac{\delta}{SAHm}$		(285)

By union bound, with probability at least $1-\delta$ , for all $(s,a,h,j,k)\in{\mathcal{S}}{\times}\mathcal{A}{\times}[H]{\times}\mathcal{G}{\times}[K]$ :

|S_{h}^{j,k}(s,a)|<3\log\frac{2SAHmK}{\delta}+\sqrt{2\sum_{t=1}^{k}d_{h}^{\hat{\pi}^{t}}(s,a)\log\frac{2SAHmK}{\delta}}

(286)

∎

C.3.4 Proof of Lemma C.13

Proof of Lemma C.13.

During the data collecting process, the agents are allowed to collect data simultaneously. For analysis purpose, we artificially order the data in the following sequence:

E^{1,1},E^{2,1},\ldots,E^{m,1},E^{1,2},\ldots,E^{m,2},\ldots,E^{1,K},\ldots,E^{m,K}

(287)

where $E^{j,k}:=\left\{\left(s_{h}^{j,k},a_{h}^{j,k},r_{h}^{j,k},s_{h+1}^{j,k}\right)\right\}_{h=1}^{H}$ . Let

\mathcal{F}_{t}=\sigma\left(\bigcup_{j,k\mbox{ s.t. }m(k-1)+j\leq t}E^{j,k}\right).

(288)

Then $\left\{\mathcal{F}_{t}\right\}_{t=0}^{mK}$ forms a valid filtration. Define the following functions to map from sequence index to agent index and episode index:

\displaystyle\mathcal{J}(t):=t-m\left(\lceil t/m\rceil-1\right),\quad\mathcal{K}(t):=\lceil t/m\rceil

(289)

For each fixed $(s,a,h)\in{\mathcal{S}}{\times}\mathcal{A}{\times}[H]$ , for all $t\in[mK]$ , we define $S_{h}^{\mathcal{G},t}(s,a)$ as the (centered) total counts of $(s,a,h)$ collected by all good agents up to time $t$ . The $t$ -th term in (287) could be in the center of an episode, meaning some agents have not collected their trajectories yet. So we need to treat the agents differently: Let

	$\displaystyle S_{h}^{\mathcal{G},t}(s,a)=$	$\displaystyle\sum_{j\in\mathcal{G},j\leq\mathcal{J}(t)}\left(N_{h}^{j,\mathcal{K}(t)}(s,a)-\sum_{t=1}^{\mathcal{K}(t)}d_{h}^{\hat{\pi}^{t}}(s,a)\right)$		(290)
		$\displaystyle+\sum_{j\in\mathcal{G},j>\mathcal{J}(t)}\left(N_{h}^{j,\mathcal{K}(t)-1}(s,a)-\sum_{t=1}^{\mathcal{K}(t)-1}d_{h}^{\hat{\pi}^{t}}(s,a)\right)$		(291)

Then $\left\{\left(\mathcal{F}_{t},S_{h}^{\mathcal{G},t}(s,a)\right)\right\}_{t=1}^{mK}$ is a martingale. Similar to Lemma C.12, define

	$\displaystyle T_{h}^{\mathcal{G},t}(s,a)=$	$\displaystyle\sum_{j\in\mathcal{G},j\leq\mathcal{J}(t)}\sum_{t=1}^{\mathcal{K}(t)}d_{h}^{\hat{\pi}^{t}}(s,a)\left(1-d_{h}^{\hat{\pi}^{t}}(s,a)\right)$		(292)
		$\displaystyle+\sum_{j\in\mathcal{G},j>\mathcal{J}(t)}\sum_{t=1}^{\mathcal{K}(t)-1}d_{h}^{\hat{\pi}^{t}}(s,a)\left(1-d_{h}^{\hat{\pi}^{t}}(s,a)\right)$		(293)

Then by Corollary E.3,

	$\displaystyle\mathbb{P}\left(\bigcup_{k\in[K]}\left\{\left\|\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)-\|\mathcal{G}\|\sum_{t=1}^{k}d_{h}^{\hat{\pi}^{t}}(s,a)\right\|\geq 3\log\frac{2SAHmK}{\delta}\right.\right.$	(294)
	$\displaystyle\left.\left.+\sqrt{2\|\mathcal{G}\|\sum_{t=1}^{k}d_{h}^{\hat{\pi}^{t}}(s,a)\log\frac{2SAHmK}{\delta}}\right\}\right)$	(295)
$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\bigcup_{k\in[K]}\left\{\left\|\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)-\|\mathcal{G}\|\sum_{t=1}^{k}d_{h}^{\hat{\pi}^{t}}(s,a)\right\|\geq 3\log\frac{2SAHmK}{\delta}\right.\right.$	(296)
	$\displaystyle\left.\left.+\sqrt{2\|\mathcal{G}\|\sum_{t=1}^{k}d_{h}^{\hat{\pi}^{t}}(s,a)\left(1-d_{h}^{\hat{\pi}^{t}}(s,a)\right)\log\frac{2SAHmK}{\delta}}\right\}\right)$	(297)
$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\bigcup_{k=1}^{mK}\left\{\|S_{h}^{\mathcal{G},k}(s,a)\|\geq 3\log\frac{2SAHmK}{\delta}+\sqrt{2T_{h}^{\mathcal{G},k}(s,a)\log\frac{2SAHmK}{\delta}}\right\}\right)$	(298)
$\displaystyle\leq$	$\displaystyle\frac{\delta}{SAH}$	(299)

By union bound, with probability at least $1-\delta$ , for all $(s,a,h,k)\in{\mathcal{S}}{\times}\mathcal{A}{\times}[H]{\times}[K]$ :

\left|\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)-|\mathcal{G}|\sum_{t=1}^{k}d_{h}^{\hat{\pi}^{t}}(s,a)\right|<3\log\frac{2SAHmK}{\delta}+\sqrt{2|\mathcal{G}|\sum_{t=1}^{k}d_{h}^{\hat{\pi}^{t}}(s,a)\log\frac{2SAHmK}{\delta}}

(300)

∎

Appendix D Proof of Theorem 6.5

By the following lemma, we can upper bound the suboptimality by the cumulative bonuses:

Lemma D.1.

[Suboptimality for Pessimistic Value Iteration, Lemma 3.2 in [Zhang et al., 2021a] and Theorem 4.2 in [Jin et al., 2021]] Under the event $\mathcal{E}$ that the $\Gamma_{h}(s,a)$ satisfies the required property of bounding the Bellman error, i.e. $|\hat{Q}_{h}(s,a)-(\mathbb{B}\hat{V}_{h+1})(s,a)|\leq\Gamma_{h}(s,a),\forall h\in[H],(s,a)\in\mathcal{S}\times\mathcal{A}$ then against any comparator policy $\tilde{\pi}$ , it achieves

\displaystyle\operatorname{SubOpt}(\hat{\pi},\tilde{\pi})\leq 2\sum_{h=1}^{H}\mathbb{E}_{d^{\tilde{\pi}}}[\Gamma_{h}(s_{h},a_{h})]

(301)

Recall that for all $(s,a,h)\in{\mathcal{S}}{\times}\mathcal{A}{\times}[H]$ ,

N_{h}^{j}(s,a):=\sum_{k\in[K_{j}]}{\mathbf{1}}\left\{(s_{h}^{j,k},a_{h}^{j,k})=(s,a)\right\},\quad\forall j\in[m].

(302)

and $N_{h}^{\operatorname{cut}}(s,a)$ is the $(2\alpha m+1)$ -largest among $\left\{N_{h}^{j}(s,a)\right\}_{j\in[m]}$ . $N_{h}^{\mathcal{G},{\operatorname{cut}}_{1}}(s,a)$ is the $(\alpha m+1)$ -th largest of $\left\{N_{h}^{j}(s,a)\right\}_{j\in\mathcal{G}}$ and $N_{h}^{\mathcal{G},{\operatorname{cut}}_{2}}(s,a)$ is the $(2\alpha m+1)$ -th largest of $\left\{N_{h}^{j}(s,a)\right\}_{j\in\mathcal{G}}$ . The bonuses are given by:

•

If $N_{h}^{\operatorname{cut}}(s,a)=0$

$\Gamma_{h}(s,a)=H-h+1;$ (303)

•

If $N_{h}^{\operatorname{cut}}(s,a)>0$

	$\displaystyle\Gamma_{h}(s,a):=$	$\displaystyle\frac{2{(H-h+1)}}{\sqrt{\sum_{j\in[m]}\tilde{N}_{h}^{j}(s,a)}}\sqrt{2\log\frac{2SAH}{\delta}}$		(304)
		$\displaystyle+\frac{8\alpha m\sqrt{N_{h}^{\operatorname{cut}}(s,a)}}{\sum_{j\in[m]}\tilde{N}_{h}^{j}(s,a)}{(H-h+1)}\sqrt{2\log\frac{2mSAH}{\delta}}$		(305)

Where

\tilde{N}_{h}^{j}(s,a)=\max\left(N_{h}^{{\operatorname{cut}}}(s,a),N_{h}^{j}(s,a)\right).

(306)

Proof of Theorem 6.5.

We first show that with probability at least $1-\delta$ ,

|(\hat{\mathbb{B}}_{h}\hat{V}_{h+1})(s,a)-(\mathbb{B}_{h}\hat{V}_{h+1})(s,a)|\leq\Gamma_{h}(s,a),\quad\forall(s,a)\in\mathcal{S}\times\mathcal{A},\forall h\in[H]

(307)

where $\Gamma_{h}(s,a)$ is defined in (301).

•

if $N_{h}^{{\operatorname{cut}}}(s,a)=0$ , by definition, $(\hat{\mathbb{B}}_{h}\hat{V}_{h+1})(s,a)=0$ . By definition of $\hat{V}_{h}$ and $\mathbb{B}_{h}$ , $(\mathbb{B}_{h}\hat{V}_{h+1})(s,a)\in[0,H-h+1]$ , thus (307) holds;

•

if $N_{h}^{{\operatorname{cut}}}(s,a)>0$ , for any fixed $h\in[H]$ , $(s,a)\in\mathcal{S}\times\mathcal{A}$ , $f:\mathcal{S}\to[0,H]$ . Because $(\hat{\mathbb{B}}_{h}f)(s,a)$ is bounded and thus sub-Gaussian, we can use Theorem 3.2 to upper bound $|(\hat{\mathbb{B}}_{h}f)(s,a)-(\mathbb{B}_{h}f)(s,a)|$ :

\displaystyle\mathbb{P}\left(\left|(\hat{\mathbb{B}}_{h}f)(s,a)-(\mathbb{B}_{h}f)(s,a)\right|\geq\Gamma_{h}(s,a)\right)\leq\frac{\delta}{HSA}

(308)

Thus

	$\displaystyle\mathbb{P}\left(\|(\hat{\mathbb{B}}_{h}\hat{V}_{h+1})(s,a)-(\mathbb{B}_{h}\hat{V}_{h+1})(s,a)\|\geq\Gamma_{h}(s,a)\right)$	(309)
$\displaystyle=$	$\displaystyle\int_{[0,H]^{\mathcal{S}}}\mathbb{P}\left(\left.\|(\hat{\mathbb{B}}_{h}\hat{V}_{h+1})(s,a)-(\mathbb{B}_{h}\hat{V}_{h+1})(s,a)\|\geq\Gamma_{h}(s,a)\right\|\hat{V}_{h+1}(\cdot)\right)d\mathbb{P}(\hat{V}_{h+1}(\cdot))$	(310)
$\displaystyle\leq$	$\displaystyle\frac{\delta}{HSA}$	(311)

By union bound, with probability at least $1-\delta$ ,

|(\hat{\mathbb{B}}_{h}\hat{V}_{h+1})(s,a)-(\mathbb{B}_{h}\hat{V}_{h+1})(s,a)|\leq\Gamma_{h}(s,a),\quad\forall(s,a)\in\mathcal{S}\times\mathcal{A},\forall h\in[H]

(312)

Then, by Lemma D.1, with probability at least $1-\delta$ ,

	$\displaystyle\operatorname{SubOpt}(\hat{\pi},\tilde{\pi})\leq 2\sum_{h=1}^{H}\mathbb{E}_{d^{\tilde{\pi}}}[\Gamma_{h}(s_{h},a_{h})]$	(313)
$\displaystyle=$	$\displaystyle 2\sum_{h=1}^{H}\mathbb{E}_{d^{\tilde{\pi}}}\left[\Gamma_{h}(s_{h},a_{h}){\mathbf{1}}{\left\{N_{h}^{\mathcal{G},{\operatorname{cut}}_{2}}(s_{h},a_{h})=0\right\}}\right]$	(314)
	$\displaystyle+2\sum_{h=1}^{H}\mathbb{E}_{d^{\tilde{\pi}}}\left[\Gamma_{h}(s_{h},a_{h}){\mathbf{1}}\left\{N_{h}^{\mathcal{G},{\operatorname{cut}}_{2}}(s_{h},a_{h})>0\right\}\right]$	(315)
$\displaystyle=:$	$\displaystyle\mathscr{A}_{1}+\mathscr{A}_{2}.$	(316)

By definition of $p^{\mathcal{G},0}$ in Definition 6.2,

\mathscr{A}_{1}\leq 2Hp^{\mathcal{G},0}

(317)

$\displaystyle\mathscr{A}_{2}=$	$\displaystyle 2\sum_{h=1}^{H}\mathbb{E}_{d^{\tilde{\pi}}}\left[\Gamma_{h}(s_{h},a_{h}){\mathbf{1}}\left\{N_{h}^{\mathcal{G},{\operatorname{cut}}_{2}}(s_{h},a_{h})>0\right\}\right]$	(318)
$\displaystyle\leq$	$\displaystyle 2\sum_{h=1}^{H}\mathbb{E}_{d^{\tilde{\pi}}}\left[\left(\frac{2{(H-h+1)}}{\sqrt{\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j}(s,a)}}\sqrt{2\log\frac{2SAH}{\delta}}\right.\right.$	(319)
	$\displaystyle\left.\left.+\frac{8\alpha m\sqrt{N_{h}^{\operatorname{cut}}(s,a)}}{\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j}(s,a)}{(H-h+1)}\sqrt{2\log\frac{2mSAH}{\delta}}\right){\mathbf{1}}\left\{N_{h}^{\mathcal{G},{\operatorname{cut}}_{2}}(s_{h},a_{h})>0\right\}\right].$	(320)

By the definition of $\kappa_{\text{even}}$ in Definition 6.4: for $a=\tilde{\pi}(s)$ ,

$\displaystyle\frac{1}{\sqrt{\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j}(s,a)}}=$	$\displaystyle\frac{\sqrt{\sum_{j\in\mathcal{G}}N_{h}^{j}(s,a)}}{\sqrt{\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j}(s,a)}}\frac{1}{\sqrt{\sum_{j\in\mathcal{G}}N_{h}^{j}(s,a)}}$	(321)
$\displaystyle\leq$	$\displaystyle\frac{\sqrt{\sum_{j\in\mathcal{G}}N_{h}^{j}(s,a)}}{\sqrt{\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j,{\operatorname{cut}}_{2}}(s,a)}}\frac{1}{\sqrt{\sum_{j\in\mathcal{G}}N_{h}^{j}(s,a)}}$	(322)
$\displaystyle\leq$	$\displaystyle\frac{\sqrt{\kappa_{\text{even}}}}{\sqrt{\sum_{j\in\mathcal{G}}N_{h}^{j}(s,a)}}$	(323)

and

$\displaystyle\frac{m\sqrt{N_{h}^{\operatorname{cut}}(s,a)}}{\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j}(s,a)}$	$\displaystyle\leq\frac{1}{\sqrt{1-\alpha}}\sqrt{\frac{\sum_{j\in\mathcal{G}}N_{h}^{j}(s,a)}{\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j}(s,a)}\frac{m(1-\alpha){N_{h}^{\operatorname{cut}}(s,a)}}{\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j}(s,a)}}\frac{\sqrt{m}}{\sqrt{\sum_{j\in\mathcal{G}}N_{h}^{j}(s,a)}}$	(324)
	$\displaystyle\leq\sqrt{\frac{\sum_{j\in\mathcal{G}}N_{h}^{j}(s,a)}{\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j,{\operatorname{cut}}_{2}}(s,a)}\frac{m(1-\alpha){N_{h}^{\mathcal{G},{\operatorname{cut}}_{1}}(s,a)}}{\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j,{\operatorname{cut}}_{2}}(s,a)}}\frac{\sqrt{2m}}{\sqrt{\sum_{j\in\mathcal{G}}N_{h}^{j}(s,a)}}$	(325)
$\displaystyle\leq$	$\displaystyle\frac{\sqrt{2\kappa_{\text{even}}m}}{\sqrt{\sum_{j\in\mathcal{G}}N_{h}^{j}(s,a)}}$	(326)

Thus

$\displaystyle\mathscr{A}_{2}\leq$	$\displaystyle 2\sum_{h=1}^{H}\mathbb{E}_{d^{\tilde{\pi}}}\Bigg{[}\left(\frac{2}{\sqrt{\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j}(s,a)}}+\frac{8\alpha m\sqrt{N_{h}^{\operatorname{cut}}(s,a)}}{\sum_{j\in\mathcal{G}}\tilde{N}_{h}^{j}(s,a)}\right)H\sqrt{2\log\frac{2mSAH}{\delta}}$	(327)
	$\displaystyle{\mathbf{1}}\left\{N_{h}^{\mathcal{G},{\operatorname{cut}}_{2}}(s_{h},a_{h})>0\right\}\Bigg{]}$	(328)
$\displaystyle\leq$	$\displaystyle 2\sum_{h=1}^{H}\mathbb{E}_{d^{\tilde{\pi}}}\left[\frac{\left(2+8\alpha\sqrt{2m}\right)\sqrt{\kappa_{\text{even}}}}{\sqrt{\sum_{j\in\mathcal{G}}N_{h}^{j}(s,a)}}H\sqrt{2\log\frac{2mSAH}{\delta}}{\mathbf{1}}\left\{N_{h}^{\mathcal{G},{\operatorname{cut}}_{2}}(s_{h},a_{h})>0\right\}\right]$	(329)
$\displaystyle\leq$	$\displaystyle 2\left(2+8\alpha\sqrt{2m}\right)\sqrt{\kappa_{\text{even}}}H\sqrt{2\log\frac{2mSAH}{\delta}}\sum_{h=1}^{H}\mathbb{E}_{d^{\tilde{\pi}}}\left[\frac{{\mathbf{1}}\left\{N_{h}^{\mathcal{G},{\operatorname{cut}}_{2}}(s_{h},a_{h})>0\right\}}{\sqrt{\sum_{j\in\mathcal{G}}N_{h}^{j}(s,a)}}\right]$	(330)

Recall that $\mathcal{C}_{h}=\left\{s|N_{h}^{\mathcal{G},{\operatorname{cut}}_{2}}(s,\tilde{\pi}(s))>0\right\}$ . By Cauchy–Schwarz inequality and the definition of $\kappa$ in Definition 6.3,

$\displaystyle\mathbb{E}_{d^{\tilde{\pi}}}\left[\frac{{\mathbf{1}}\left\{N_{h}^{\mathcal{G},{\operatorname{cut}}_{2}}(s_{h},a_{h})>0\right\}}{\sqrt{\sum_{j\in\mathcal{G}}N_{h}^{j}(s,a)}}\right]\leq$	$\displaystyle\sqrt{\mathbb{E}_{d^{\tilde{\pi}}}\left[\frac{{\mathbf{1}}\left\{N_{h}^{\mathcal{G},{\operatorname{cut}}_{2}}(s_{h},a_{h})>0\right\}}{{\sum_{j\in\mathcal{G}}N_{h}^{j}(s,a)}}\right]}$	(331)
$\displaystyle=$	$\displaystyle\sqrt{\sum_{s\in\mathcal{C}_{h}}\frac{d_{h}^{\tilde{\pi}}(s)}{{\sum_{j\in\mathcal{G}}N_{h}^{j}(s,a)}}}$	(332)
$\displaystyle=$	$\displaystyle\sqrt{\sum_{s\in\mathcal{C}_{h}}\frac{d_{h}^{\tilde{\pi}}(s)}{{\sum_{j\in\mathcal{G}}N_{h}^{j}(s,a)}/\sum_{j\in\mathcal{G}}K_{j}}\frac{1}{\sum_{j\in\mathcal{G}}K_{j}}}$	(333)
$\displaystyle\leq$	$\displaystyle\sqrt{\sum_{s\in\mathcal{C}_{h}}\frac{\kappa}{\sum_{j\in\mathcal{G}}K_{j}}}\leq\sqrt{\frac{\kappa S}{\sum_{j\in\mathcal{G}}K_{j}}}$	(334)

In conclusion,

$\displaystyle\operatorname{SubOpt}(\hat{\pi},\tilde{\pi})\leq$	$\displaystyle\mathscr{A}_{1}+\mathscr{A}_{2}$	(335)
$\displaystyle\leq$	$\displaystyle 2Hp^{\mathcal{G},0}+2\left(2+8\alpha\sqrt{2m}\right)\frac{\sqrt{\kappa\kappa_{\text{even}}S}}{\sqrt{\sum_{j\in\mathcal{G}}K_{j}}}H^{2}\sqrt{2\log\frac{2mSAH}{\delta}}$	(336)
$\displaystyle=$	$\displaystyle 2Hp^{\mathcal{G},0}+O\left(\sqrt{\kappa\kappa_{\text{even}}}H^{2}\sqrt{S}\frac{1+\alpha\sqrt{m}}{\sqrt{\sum_{j\in\mathcal{G}}K_{j}}}\sqrt{\log\frac{mSAH}{\delta}}\right)$	(337)

∎

Appendix E Useful inequalities

Theorem E.1 (Bernstein type of bound for martingale, Theorem 1.6 of [Freedman, 1975]).

Let $(\Omega,\mathcal{F},P)$ be a probability triple. Let $\mathcal{F}_{0}\subset\mathcal{F}_{1}\subset\cdots$ be an increasing sequence of sub- $\sigma$ -fields of $\mathcal{F}$ . Let $X_{1},X_{2},\ldots$ be random variables on $(\Omega,\mathcal{F},P)$ , such that $X_{n}$ is $\mathcal{F}_{n}$ measurable. Let $V_{n}=\mathbb{V}\left[X_{n}|\mathcal{F}_{n-1}\right]$ . Assume $|X_{n}|\leq 1$ and $\mathbb{E}[X_{n}|\mathcal{F}_{n-1}]$ = 0. Let

	$\displaystyle S_{n}=X_{1}+\cdots+X_{n}$		(338)
	$\displaystyle T_{n}=V_{1}+\cdots+V_{n},$		(339)

where $S_{0}=T_{0}=0$ . Then, for any $a>0$ , $b>0$ ,

\mathbb{P}\left(|S_{n}|\geq a\mbox{ and }T_{n}\leq b\mbox{ for some $n$}\right)\leq 2\exp\left(-\frac{a^{2}}{2(a+b)}\right).

(340)

By union bound and partition, we can get a more useful version of Theorem E.1.

We first present a result, which shows: given,

\mathbb{P}\left(X\geq t,Y\leq t\right)\leq\delta(t)

(341)

We can bound $\mathbb{P}\left(X\geq Y\right)$ up to some error.

Lemma E.2.

Let $\{A_{n}\}_{n=1}^{N}$ and $\{B_{n}\}_{n=1}^{N}$ be two sequences of random variables. We don’t make any assumption about the independence. Assume

•

$\forall n$ , $0\leq B_{n}\leq nM$ almost surely;
•

$\forall\delta>0$ , $f_{\delta}:\mathbb{R}_{+}\mapsto\mathbb{R}_{+}$ , $f_{\delta}(\cdot)$ monotonic increasing,

If for all $t>0$ ,

\operatorname{P}\left(\bigcup_{n=1}^{N}\left\{|A_{n}|\geq f_{\delta}(t),B_{n}\leq t\right\}\right)\leq\delta

(342)

Then for any $\epsilon>0$ ,

\operatorname{P}\left(\bigcup_{n=1}^{N}\left\{|A_{n}|\geq f_{\delta}(B_{n}+\epsilon)\right\}\right)\leq NM\lceil 1/\epsilon\rceil\delta

(343)

Proof.

See proof in Section E.1. ∎

Corollary E.3.

Under the assumption of Theorem E.1, suppose $X_{n}$ terminate at $n=N$ . Then, for all $0<\delta<{2}{\exp(-2)}$ ,

\displaystyle\mathbb{P}\left(\bigcup_{n=1}^{N}\left\{|S_{n}|\geq 3\log\frac{2N}{\delta}+\sqrt{2T_{n}\log\frac{2N}{\delta}}\right\}\right)\leq\delta

(344)

Proof of Corollary E.3.

Let $\frac{\delta}{N}=2\exp\left(-\frac{a^{2}}{2(a+b)}\right)$ then

a=\log\frac{2N}{\delta}+\sqrt{\log^{2}\frac{2N}{\delta}+2b\log\frac{2N}{\delta}}

(345)

by Theorem E.1, For all $b>0$ ,

\mathbb{P}\left(|S_{n}|\geq\log\frac{2N}{\delta}+\sqrt{\log^{2}\frac{2N}{\delta}+2b\log\frac{2N}{\delta}},\mbox{ and }T_{n}\leq b\mbox{ for some $n$}\right)\leq\delta/N

(346)

In Lemma E.2, let:

•

$A_{n}=S_{n}$ , $B_{n}=T_{n}$ , $M=1$
•

$\epsilon=\frac{1}{2}\log\frac{2N}{\delta}$
•

$f_{\delta}(x)=\log\frac{2N}{\delta}+\sqrt{\log^{2}\frac{2N}{\delta}+2x\log\frac{2N}{\delta}}$

Because $0<\delta<2\exp(-2)$ , $\epsilon\geq 1$ . then, we get:

	$\displaystyle\mathbb{P}\left(\bigcup_{n=1}^{N}\left\{\|S_{n}\|\geq 3\log\frac{2N}{\delta}+\sqrt{2T_{n}\log\frac{2N}{\delta}}\right\}\right)$	(347)
$\displaystyle\leq$	$\displaystyle\mathbb{P}\left(\bigcup_{n=1}^{N}\left\{\|S_{n}\|\geq\log\frac{2N}{\delta}+\sqrt{2\log^{2}\frac{2N}{\delta}+2T_{n}\log\frac{2N}{\delta}}\right\}\right)$	(348)
$\displaystyle\leq$	$\displaystyle N\lceil 1/\epsilon\rceil\frac{\delta}{N}\leq\delta$	(349)

∎

E.1 Proof for Lemma E.2

Proof of Lemma E.2.

For discrete random variable, we can just conditioning on each possible value of $B_{n}$ and use a union bound. Here, because $B_{n}$ can be continuous random variable, we divide the range of $B_{n}$ into intervals. And upper bound the target by law of total probability.

For all $n$ , let:

0<\frac{1}{\lceil 1/\epsilon\rceil}<\frac{2}{\lceil 1/\epsilon\rceil}<\cdots<\frac{nM\lceil 1/\epsilon\rceil}{\lceil 1/\epsilon\rceil}=nM

(350)

Be a partition of interval $[0,nM]$ . Let $I_{i}:=\left[\frac{i-1}{\lceil 1/\epsilon\rceil},\frac{i}{\lceil 1/\epsilon\rceil}\right],i=1,\ldots,nM\lceil 1/\epsilon\rceil$ be a set of intervals. Note that, $\bigcup_{i=1}^{nM\lceil 1/\epsilon\rceil}I_{i}=[0,nM]$ .

Then

	$\displaystyle\bigcup_{n=1}^{N}\left\{\|A_{n}\|\geq f_{\delta}(B_{n}+\epsilon)\right\}=\bigcup_{n=1}^{N}\bigcup_{i=1}^{nM\lceil 1/\epsilon\rceil}\left\{\|A_{n}\|\geq f_{\delta}(B_{n}+\epsilon),B_{n}\in I_{i}\right\}$	(351)
$\displaystyle=$	$\displaystyle\bigcup_{n=1}^{N}\bigcup_{i=1}^{nM\lceil 1/\epsilon\rceil}\left\{\|A_{n}\|\geq f_{\delta}(B_{n}+\epsilon),\frac{i-1}{\lceil 1/\epsilon\rceil}\leq B_{n}\leq\frac{i}{\lceil 1/\epsilon\rceil}\right\}$	(352)
$\displaystyle\subseteq$	$\displaystyle\bigcup_{n=1}^{N}\bigcup_{i=1}^{nM\lceil 1/\epsilon\rceil}\left\{\|A_{n}\|\geq f_{\delta}(\frac{i}{\lceil 1/\epsilon\rceil}),B_{n}\leq\frac{i}{\lceil 1/\epsilon\rceil}\right\}$	(353)
$\displaystyle\subseteq$	$\displaystyle\bigcup_{n=1}^{N}\bigcup_{i=1}^{NM\lceil 1/\epsilon\rceil}\left\{\|A_{n}\|\geq f_{\delta}(\frac{i}{\lceil 1/\epsilon\rceil}),B_{n}\leq\frac{i}{\lceil 1/\epsilon\rceil}\right\}$	(354)
$\displaystyle=$	$\displaystyle\bigcup_{i=1}^{NM\lceil 1/\epsilon\rceil}\bigcup_{n=1}^{N}\left\{\|A_{n}\|\geq f_{\delta}(\frac{i}{\lceil 1/\epsilon\rceil}),B_{n}\leq\frac{i}{\lceil 1/\epsilon\rceil}\right\}$	(355)

Thus

	$\displaystyle\operatorname{P}\left(\bigcup_{n=1}^{N}\left\{\|A_{n}\|\geq f_{\delta}(B_{n}+\epsilon)\right\}\right)\leq$	$\displaystyle\sum_{i=1}^{NM\lceil 1/\epsilon\rceil}\mathbb{P}\left(\bigcup_{n=1}^{N}\left\{\|A_{n}\|\geq f_{\delta}(\frac{i}{\lceil 1/\epsilon\rceil}),B_{n}\leq\frac{i}{\lceil 1/\epsilon\rceil}\right\}\right)$		(356)
	$\displaystyle\leq$	$\displaystyle NM\lceil 1/\epsilon\rceil\delta\quad\left(\mbox{By \hyperref@@ii[eq:prob under t]{{(\ref*{eq:prob under t})}}}\right)$		(357)

∎

Byzantine-Robust Online and Offline Distributed Reinforcement Learning

Abstract

1 Introduction

2 Related Work

Reinforcement learning:

Distributed reinforcement learning:

Robust statistics:

Robust learning from batches:

Byzantine-robust distributed learning:

Corruption robust RL and Byzantine-robust RL:

3 Robust Mean Estimation from Untruthful Batches

Definition 3.1 (Robust mean estimation from batches).

Theorem 3.2.

Remark 3.3.

Remark 3.4 (Equal batch size case).

Remark 3.5 (Robust mean estimation v.s. robust mean estimation from batches).

Remark 3.6 (Impossibility result).

Remark 3.7 (Perturbation stability of the estimator and adaption to distributed setting).

4 Byzantine-Robust Learning in Parallel MDP

5 Byzantine-Robust Online RL

Definition 5.1 (Distributed online RL with Byzantine corruption).

Theorem 5.2 (Regret bound).

Remark 5.3 (Understanding the regret bound).

Remark 5.4 (Communication cost).

Remark 5.5 (Switching cost).

6 Byzantine-Robust Offline RL

Definition 6.1 (Distributed offline RL with Byzantine corruption).

Definition 6.2 (Measure of insufficient coverage).

Definition 6.3 (density ratio).

Definition 6.4 (Unevenness of good agents coverage).

Theorem 6.5.

Remark 6.6.

Remark 6.7 (Offline v.s. online RL).

7 Conclusion

References

Appendix A More discussion on Algorithm 1:Weighted-Clique

A.1 Impossible result

Theorem A.1 (impossibility result).

Proof of Theorem A.1.

A.2 Adaption to good batch perturbation and distributed learning

Appendix B Proof of Theorem 3.2

Definition B.1 (Robust mean estimation from batches).

Definition B.2 (Concentration event).

Lemma B.3.

Proof.

Lemma B.4.

Proof.

Proof of Theorem 3.2.

B.1 Proof of Lemma B.3

the probability of event ⋂j∈𝒢ℰj\bigcap_{j\in{\mathcal{G}}}\mathcal{E}_{j}:

the probability of event ℰw​a\mathcal{E}_{wa}:

B.2 Proof of Lemma B.4

The structure of U∗U^{*}:

Bad batches in U∗U^{*}:

Error decomposition:

Appendix C Proof of Theorem 5.2

Theorem C.1 (Regret bound, Theorem 5.2).

Definition C.2.

Proof of Theorem 5.2.

C.1 the good event ℰ\mathcal{E}

Lemma C.3 (Valid bonus).

Proof of Lemma C.3.

C.1.1 Concentration of estimation from good agents

Lemma C.4.

Proof of Lemma C.4.

Lemma C.5.

Proof Lemma C.5.

C.2 The regret decomposition for UCB style algorithm

Lemma C.6.

Proof of Lemma C.6.

Lemma C.7 (Optimism).

Proof of Lemma C.7.

Lemma C.8.

Proof of Lemma C.8.

C.3 Evenness of clean agents

Lemma C.9 (Even coverage of good agent).

Remark C.10 (Intuition of the good event).

C.3.1 Proof of Lemma C.9

Lemma C.11 (Concentration of counts around empirical mean).

Proof of Lemma C.11.

the probability of event $\bigcap_{j\in{\mathcal{G}}}\mathcal{E}_{j}$ :

the probability of event $\mathcal{E}_{wa}$ :

The structure of $U^{*}$ :

Bad batches in $U^{*}$ :

C.1 the good event $\mathcal{E}$

Lemma C.12 (Concentration of each $N_{h}^{j,k}(s,a)$ ).

Lemma C.13 (Concentration of each $\frac{1}{\tilde{m}}\sum_{j\in\mathcal{G}}N_{h}^{j,k}(s,a)$ ).