Gap-Dependent Bounds for Federated $Q$ -Learning¹¹1Haochen Zhang and Zhong Zheng are co-first authors who contributed equally to this paper. Lingzhou Xue is the corresponding author (Email: [email protected]).

Haochen Zhang, Zhong Zheng and Lingzhou Xue
Department of Statistics, The Pennsylvania State University

Abstract

We present the first gap-dependent analysis of regret and communication cost for on-policy federated $Q$ -Learning in tabular episodic finite-horizon Markov decision processes (MDPs). Existing FRL methods focus on worst-case scenarios, leading to $\sqrt{T}$ -type regret bounds and communication cost bounds with a $\log T$ term scaling with the number of agents $M$ , states $S$ , and actions $A$ , where $T$ is the average total number of steps per agent. In contrast, our novel framework leverages the benign structures of MDPs, such as a strictly positive suboptimality gap, to achieve a $\log T$ -type regret bound and a refined communication cost bound that disentangles exploration and exploitation. Our gap-dependent regret bound reveals a distinct multi-agent speedup pattern, and our gap-dependent communication cost bound removes the dependence on $MSA$ from the $\log T$ term. Notably, our gap-dependent communication cost bound also yields a better global switching cost when $M=1$ , removing $SA$ from the $\log T$ term.

1 Introduction

Federated reinforcement learning (FRL) is a distributed learning framework that combines the principles of reinforcement learning (RL) (Sutton & Barto, 2018) and federated learning (FL) (McMahan et al., 2017). Focusing on sequential decision-making, FRL aims to learn an optimal policy through parallel explorations by multiple agents under the coordination of a central server. Often modeled as a Markov decision process (MDP), multiple agents independently interact with an initially unknown environment and collaboratively train their decision-making models with limited information exchange between the agents. This approach accelerates the learning process with low communication costs. In this paper, we focus on the on-policy FRL tailored for episodic tabular MDPs with inhomogeneous transition kernels. Specifically, we assume the presence of a central server and $M$ local agents in the system. Each agent interacts independently with an episodic MDP consisting of $S$ states, $A$ actions, and $H$ steps per episode.

Multiple recent works studied the on-policy FRL for tabular MDPs. Zheng et al. (2024) proposed model-free algorithms FedQ-Hoeffding and FedQ-Bernstein that show the regret bounds $\tilde{O}(\sqrt{MH^{4}SAT})$ and $\tilde{O}(\sqrt{MH^{3}SAT})$ respectively under $O(MH^{3}SA\log T)$ rounds of communications. Here, $T$ is the average total number of steps for each agent, and $\tilde{O}$ hides logarithmic factors. Zheng et al. (2025a) proposed FedQ-Advantage that improved the regret to $\tilde{O}(\sqrt{MH^{2}SAT})$ under a reduced communication rounds of $O(f_{M}H^{2}SA(\log H)\log T)$ where $f_{M}\in\{1,M\}$ reflects the optional forced synchronization scheme. Chen et al. (2022) and Labbi et al. (2024) proposed model-based algorithms that extend the single-agent algorithm UCBVI (Azar et al., 2017). Byzan-UCBVI (Chen et al., 2022) reaches regret $\tilde{O}(\sqrt{MH^{3}S^{2}AT})$ under $O(MHSA\log T)$ rounds of communications. Fed-UCBVI (Labbi et al., 2024) reaches the regret $\tilde{O}(\sqrt{MH^{2}SAT})$ under $O(HSA\log T+MHSA\log\log T)$ rounds of communications. Here, model-based methods require estimating the transition kernel so that their memory requirements scale quadratically with the number of states $S$ . Model-free methods, which are also called $Q$ -Learning methods (Watkins, 1989), directly learn the action-value function, and their memory requirements only scale linearly with $S$ . The regret $\tilde{O}(\sqrt{MH^{2}SAT})$ reached by both FedQ-Advantage and Fed-UCBVI is almost optimal compared to the regret lower bound $\tilde{O}(\sqrt{MH^{2}SAT})$ (Jin et al., 2018; Domingues et al., 2021). In summary, all the works above provided worst-case guarantees for all possible MDPs and proved $\sqrt{T}$ -type regret bounds and communication cost bounds that linearly depend on $MSA\log T$ or $SA\log T$ .

In practice, RL algorithms often perform better than their worst-case guarantees, as they can be significantly improved under MDPs with benign structures (Zanette & Brunskill, 2019). This motivates the problem-dependent analysis exploiting benign MDPs (see, e.g., Wagenmaker et al. (2022a); Zhou et al. (2023); Zhang et al. (2024b)). One of the benign structures is based on the dependency on the positive suboptimality gap: for every state, the best actions outperform others by a margin. It is important because nearly all non-degenerate environments with finite action sets satisfy some sub-optimality gap conditions (Yang et al., 2021). For single-agent algorithms, Simchowitz & Jamieson (2019); Dann et al. (2021) analyzed gap-dependent regret for model-based methods, and Yang et al. (2021); Xu et al. (2021); Zheng et al. (2025b) analyzed model-free methods. Here, Yang et al. (2021) focused on UCB-Hoeffding proposed by Jin et al. (2018), while Xu et al. (2021) proposed an algorithm that did not use upper confidence bounds (UCB). Zheng et al. (2025b) analyzed UCB-Advantage (Zhang et al., 2020) and Q-EarlySettled-Advantage (Li et al., 2021), which used variance reduction techniques. All of these works reached regrets that logarithmically depend on $T$ , which is much better than the worst-case $\sqrt{T}$ -type regrets. However, no literature works on the gap-dependent regret for on-policy FRL. This motivates the following open question:

Is it possible to establish gap-dependent regret bounds for on-policy FRL algorithms that are logarithmic in $T$ ?

Meanwhile, recent works have proposed FRL algorithms for tabular MDPs on the off-policy setting (Woo et al., 2023), the offline setting (Woo et al., 2024), and the situation with a simulator (Woo et al., 2023; Salgia & Chi, 2024). Different from the on-policy methods, state-of-the-art algorithms for these settings do not update the implemented policies (exploration) and reach $MSA$ -free logarithmic bounds on communication rounds, whereas the worst-case communication cost bounds for on-policy FRL methods require the dependence on $M$ , $S$ , and $A$ for the $\log T$ term (e.g., $O(MH^{3}SA\log T)$ in Zheng et al. (2024)). While increased communication for exploration is reasonable, existing on-policy FRL methods cannot quantify the communication cost paid for exploring non-optimal actions or exploiting optimal policies under the worst-case MDPs since the suboptimality gaps can be arbitrarily close to 0. This leads to the dependence on $M$ , $S$ , and $A$ for the $\log T$ term, which motivates the following open question:

Is it possible to establish gap-dependent communication cost bounds for on-policy FRL algorithms that disentangle exploration and exploitation and remove the dependence on $MSA$ from the $\log T$ term?

A closely related evaluation criterion for on-policy RL is the global switching cost, which is defined as the times for policy switching. It is important in applications with restrictions on policy switching, such as compiler optimization (Ashouri et al., 2018), hardware placements (Mirhoseini et al., 2017), database optimization (Krishnan et al., 2018), and material discovery (Nguyen et al., 2019). Next, we review related literature on single-agent model-free RL algorithms. Under the worst-case MDPs, Bai et al. (2019) modified the algorithms in Jin et al. (2018), achieving a cost of $O(H^{3}SA\log T)$ , and UCB-Advantage (Zhang et al., 2020) reached an improved cost of $O(H^{2}SA\log T)$ , with both algorithms depending on $SA\log T$ . In gap-dependent analysis, Zheng et al. (2025b) proved that UCB-Advantage enjoyed a cost that linearly depends on $S\log T$ . Whether single-agent model-free RL algorithms can avoid the dependence on $SA$ for the $\log T$ term remains an open question.

In addition, multiple technical challenges exist when trying to establish gap-dependent bounds and improve the existing worst-case ones. First, gap-dependent regret analysis often relies on controlling the error in the value function estimations. However, the techniques for model-free methods (Yang et al., 2021; Xu et al., 2021; Zheng et al., 2025b) can only adapt to instant policy updates in single-agent methods, while FRL often uses delayed policy updates for a low communication cost. Second, proving low communication costs for FRL algorithms often requires actively estimating the number of visits to each state-action-step triple (see, e.g., Woo et al. (2023)). However, this is challenging for on-policy algorithms because the implemented policy is actively updated, and a universal stationary visiting probability is unavailable. Existing on-policy FRL methods reached logarithmic communication costs by controlling the visit and synchronization with the event-triggered synchronization conditions. These conditions guaranteed a sufficient increase in the number of visits to one state-action-step triple between synchronizations. However, this analysis is insufficient for the estimation of visiting numbers and results in the dependence on $SA$ for the $\log T$ term.

Summary of Our Contributions. We give an affirmative answer to these important open questions by proving the first gap-dependent bounds on both regret and communication cost for on-policy FRL in the literature. We analyze those bounds for FedQ-Hoeffding (Zheng et al., 2024), an on-policy FRL algorithm for tabular episodic finite-horizon MDPs. Our contributions are summarized as follows.

•

Gap-Dependent Regret (Theorem 4.1). Denote $\Delta_{\textnormal{min}}$ as the minimum nonzero suboptimality gap for all the state-action-step triples. We prove that FedQ-Hoeffding guarantees a gap-dependent expected regret of

$O\bigg{(}\frac{H^{6}SA\log(MSAT)}{\Delta_{\textnormal{min}}}+C_{f}\bigg{)}$ (1)

where $C_{f}=M\sqrt{H^{7}}SA\sqrt{\log(MSAT)}+MH^{5}SA$ provides the gap-free part. It is logarithmic in $T$ and better than the worst-case $\sqrt{T}$ -type regret discussed above when $T$ is large enough. When $M=1$ , (1) reduces to the single-agent gap-dependent regret bound (Yang et al., 2021) for UCB-Hoeffding (Jin et al., 2018), which is the single-agent counterpart of FedQ-Hoeffding. Compared to UCB-Hoeffding, when $T$ is large enough and $\Delta_{\textnormal{min}}$ is small enough, (1) shows a better multi-agent speedup compared to the $\sqrt{T}$ -type worst-case regrets shown in Zheng et al. (2024). Our numerical experiments in Section 5.1 also demonstrate the $\log T$ -pattern of the regret for any given MDP.

•

Gap-Dependent Communication Cost (Theorem 4.2). We prove that under some general uniqueness of optimal policies, for any $p\in(0,1)$ , with probability at least $1-p$ , both the number of communication rounds and the number of different implemented policies required by FedQ-Hoeffding are upper bounded by

\displaystyle O\bigg{(}MH^{3}SA\log(MH^{2}\iota_{0})+H^{3}SA\log\left(\frac{H^{5}SA}{\Delta^{2}_{\min}}\right)+H^{3}S\log\left(\frac{MH^{9}SA\iota_{0}}{\Delta^{2}_{\min}C_{st}}\right)+H^{2}\log\left(\frac{T}{HSA}\right)\bigg{)}.

(2)

Here, $C_{st}\in(0,1]$ represents the minimum of the nonzero visiting probabilities to all state-step pairs under optimal policies, and $\iota_{0}=\log(SAT/p)$ . Since the communication cost of each round is $O(MHS)$ , the total communication cost is (2) multiplied by $MHS$ . Compared to the existing worst-case communication rounds that depend on $MSA\log T$ (Zheng et al., 2024, 2025a; Qiao et al., 2022) or $SA\log T$ (Zheng et al., 2025a; Labbi et al., 2024), the first three terms in (2) only logarithmically depend on $1/\Delta_{\textnormal{min}}$ and $\log T$ , and the last term removes the dependence on $MSA$ from the $\log T$ term. This improvement is significant since $M$ represents the number of collaborating agents, and $SA$ represents the complexity of the state-action space that is often the bottleneck of RL methods (Jin et al., 2018). Compared to the $SA$ -free communication rounds for FRL methods that do not update policies, (2) quantifies the cost of multiple components in on-policy FRL: the first two terms represent the cost for exploration, and the last two terms show the cost of implementing the optimal policy (exploitation). Our numerical experiments, presented in Section 5.2, demonstrate that the $\log T$ term in the communication cost is independent of $M$ , $S$ , and $A$ .

When $M=1$ , FedQ-Hoeffding becomes a single-agent algorithm with low global switching cost shown in (2) (Corollary 4.1). It removes the dependence on $SA$ from the $\log T$ term compared to existing model-free methods (Bai et al., 2019; Zhang et al., 2020; Zheng et al., 2025b).

•

Technical Novelty and Contributions. We develop a new theoretical framework for the gap-dependent analysis of on-policy FRL with delayed policy updates. It provides two features simultaneously: controlling the error in the estimated value functions and estimating the number of visits . The first feature helps prove the gap-dependent regret (1), and the second is key to proving the bound (2) for communication rounds. Here, to overcome the difficulty of estimating visiting numbers, we develop a new technical tool: concentrations on visiting numbers under varying policies. We establish concentration inequalities for visits to the stationary visiting probability of the optimal policies via error recursion on episode steps. This step relies on the number of visits with suboptimal actions instead of the algorithm settling on an optimal policy. It provides better estimations of visiting numbers. We also establish the following techniques with the tool and nonzero minimum suboptimality gap: (a) Exploring visiting discrepancies between optimal actions and suboptimal actions. This validates the concentration above. (b) Showing agent-wise simultaneous sufficient increase of visits. This helps remove the linear dependency on $M$ in the last three terms of (2). (c) Showing state-wise simultaneous sufficient increase of visits for states with unique optimal actions. This helps remove the linear dependence on $SA$ from the last term in (2).

To the best of our knowledge, these techniques are new to the literature for on-policy model-free FRL methods. They will be of independent interest in the gap-dependent analysis of other on-policy RL and FRL methods in controlling or estimating the number of visits.

2 Related Work

On-policy RL for single agent finite-horizon tabular MDPs with worst-case regret. There are mainly two types of algorithms for reinforcement learning: model-based and model-free learning. Model-based algorithms learn a model from past experience and make decisions based on this model, while model-free algorithms only maintain a group of value functions and take the induced optimal actions. Due to these differences, model-free algorithms are usually more space-efficient and time-efficient compared to model-based algorithms. However, model-based algorithms may achieve better learning performance by leveraging the learned model.

Next, we discuss the literature on model-based and model-free algorithms for finite-horizon tabular MDPs with worst-case regret. Auer et al. (2008), Agrawal & Jia (2017), Azar et al. (2017), Kakade et al. (2018), Agarwal et al. (2020), Dann et al. (2019), Zanette & Brunskill (2019), Zhang et al. (2021), Zhou et al. (2023) and Zhang et al. (2024b) worked on model-based algorithms. Notably, Zhang et al. (2024b) provided an algorithm that achieves a regret of $\tilde{O}(\min\{\sqrt{SAH^{2}T},T\})$ , which matches the information lower bound. Jin et al. (2018), Yang et al. (2021), Zhang et al. (2020), Li et al. (2021) and Ménard et al. (2021) work on model-free algorithms. The latter three have introduced algorithms that achieve minimax regret of $\tilde{O}(\sqrt{SAH^{2}T})$ .

Suboptimality Gap. When there is a strictly positive suboptimality gap, it is possible to achieve logarithmic regret bounds. In RL, earlier work obtained asymptotic logarithmic regret bounds (Auer & Ortner, 2007; Tewari & Bartlett, 2008). Recently, non-asymptotic logarithmic regret bounds were obtained (Jaksch et al., 2010; Ok et al., 2018; Simchowitz & Jamieson, 2019; He et al., 2021). Specifically, Jaksch et al. (2010) developed a model-based algorithm, and their bound depends on the policy gap instead of the action gap studied in this paper. Ok et al. (2018) derived problem-specific logarithmic type lower bounds for both structured and unstructured MDPs. Simchowitz & Jamieson (2019) extended the model-based algorithm by Zanette & Brunskill (2019) and obtained logarithmic regret bounds. Logarithmic regret bounds are obtained in linear function approximation settings He et al. (2021). Nguyen-Tang et al. (2023) also provides a gap-dependent regret bounds for offline RL with linear funciton approximation.

Specifically, for model free algorithm, Yang et al. (2021) showed that the optimistic $Q$ -learning algorithm by Jin et al. (2018) enjoyed a logarithmic regret $O(\frac{H^{6}SAT}{\Delta_{\textnormal{min}}})$ , which was subsequently refined by Xu et al. (2021). In their work, Xu et al. (2021) introduced the Adaptive Multi-step Bootstrap (AMB) algorithm. Zheng et al. (2025b) further improved the logarithmic regret bound by leveraging the analysis of the UCB-Advantage algorithm (Zhang et al., 2020) and Q-EarlySettled-Advantage algorithm (Li et al., 2021).

There are also some other works focusing on gap-dependent sample complexity bounds (Jonsson et al., 2020; Marjani & Proutiere, 2020; Al Marjani et al., 2021; Tirinzoni et al., 2022; Wagenmaker et al., 2022b; Wagenmaker & Jamieson, 2022; Wang et al., 2022; Tirinzoni et al., 2023).

Variance reduction in RL. The reference-advantage decomposition used in Zhang et al. (2020) and Li et al. (2021) is a technique of variance reduction that was originally proposed for finite-sum stochastic optimization (Gower et al., 2020; Johnson & Zhang, 2013; Nguyen et al., 2017). Later on, model-free RL algorithms also used variance reduction to improve the sample efficiency. For example, it was used in learning with generative models (Sidford et al., 2018, 2023; Wainwright, 2019), policy evaluation (Du et al., 2017; Khamaru et al., 2021; Wai et al., 2019; Xu et al., 2020), offline RL (Shi et al., 2022; Yin et al., 2021), and $Q$ -learning (Li et al., 2020; Zhang et al., 2020; Li et al., 2021; Yan et al., 2023).

RL with low switching cost and batched RL. Research in RL with low-switching cost aims to minimize the number of policy switches while maintaining comparable regret bounds to fully adaptive counterparts, and it can be applied to federated RL. In batched RL (Perchet et al., 2016; Gao et al., 2019), the agent sets the number of batches and the length of each batch upfront, implementing an unchanged policy in a batch and aiming for fewer batches and lower regret. Bai et al. (2019) first introduced the problem of RL with low-switching cost and proposed a $Q$ -learning algorithm with lazy updates, achieving $\tilde{O}(SAH^{3}\log T)$ switching cost. This work was advanced by Zhang et al. (2020), which improved the regret upper bound and the switching cost. Additionally, Wang et al. (2021) studied RL under the adaptivity constraint. Recently, Qiao et al. (2022) proposed a model-based algorithm with $\tilde{O}(\log\log T)$ switching cost. Zhang et al. (2022) proposed a batched RL algorithm that is well-suited for the federated setting.

Multi-agent RL (MARL) with event-triggered communications. We review a few recent works for on-policy MARL with linear function approximations. Dubey & Pentland (2021) introduced Coop-LSVI for cooperative MARL. Min et al. (2023) proposed an asynchronous version of LSVI-UCB that originates from Jin et al. (2020), matching the same regret bound with improved communication complexity compared to Dubey & Pentland (2021). Hsu et al. (2024) developed two algorithms that incorporate randomized exploration, achieving the same regret and communication complexity as Min et al. (2023). Dubey & Pentland (2021), Min et al. (2023) and Hsu et al. (2024) employed event-triggered communication conditions based on determinants of certain quantities. Different from our federated algorithm, during the synchronization in Dubey & Pentland (2021) and Min et al. (2023), local agents share original rewards or trajectories with the server. On the other hand, Hsu et al. (2024) reduces communication cost by sharing compressed statistics in the non-tabular setting with linear function approximation.

Federated and distributed RL. Existing literature on federated and distributed RL algorithms highlights various aspects. For value-based algorithms, Guo & Brunskill (2015), Zheng et al. (2024), and Woo et al. (2023) focused on linear speed up. (Agarwal et al., 2021) proposed a parallel RL algorithm with low communication cost. Woo et al. (2023) and Woo et al. (2024) discussed the improved covering power of heterogeneity. Wu et al. (2021) and Chen et al. (2023) worked on robustness. Particularly, Chen et al. (2023) proposed algorithms in both offline and online settings, obtaining near-optimal sample complexities and achieving superior robustness guarantees. In addition, several works have investigated value-based algorithms such as $Q$ -learning in different settings, including Beikmohammadi et al. (2024), Jin et al. (2022), Khodadadian et al. (2022), Fan et al. (2023), Woo et al. (2023), and Woo et al. (2024); Anwar & Raychowdhury (2021); Zhao et al. (2023); Yang et al. (2023); Zhang et al. (2024a). The convergence of decentralized temporal difference algorithms has been analyzed by Doan et al. (2019), Doan et al. (2021), Chen et al. (2021b), Sun et al. (2020), Wai (2020), Wang et al. (2020), Zeng et al. (2021), and Liu & Olshevsky (2023).

Some other works focus on policy gradient-based algorithms. Communication-efficient policy gradient algorithms have been studied by Fan et al. (2021) and Chen et al. (2021a). Lan et al. (2023) further shows the simplicity compared to the other RL methods, and a linear speedup has been demonstrated in the synchronous setting. Optimal sample complexity for global optimality in federated RL, even in the presence of adversaries, is studied in Ganesh et al. (2024). (Lan et al., 2024) propose an algorithm to address the challenge of lagged policies in asynchronous settings.

The convergence of distributed actor-critic algorithms has been analyzed by Shen et al. (2023) and Chen et al. (2022). Federated actor-learner architectures have been explored by Assran et al. (2019), Espeholt et al. (2018), and Mnih et al. (2016). Distributed inverse reinforcement learning has been examined by Banerjee et al. (2021), Gong et al. (2023) and Liu & Zhu (2022, 2023, 2024, 2025).

3 Background and Problem Formulation

3.1 Preliminaries

We begin by introducing the mathematical framework of Markov decision processes. In this paper, we assume that $0/0=0$ . For any $C\in\mathbb{N}$ , we use $[C]$ to denote the set $\{1,2,\ldots C\}$ . We use $\mathbb{I}[x]$ to denote the indicator function, which equals 1 when the event $x$ is true and 0 otherwise.

Tabular episodic Markov decision process (MDP). A tabular episodic MDP is denoted as $\mathcal{M}:=(\mathcal{S},\mathcal{A},H,\mathbb{P},r)$ , where $\mathcal{S}$ is the set of states with $|\mathcal{S}|=S,\mathcal{A}$ is the set of actions with $|\mathcal{A}|=A$ , $H$ is the number of steps in each episode, $\mathbb{P}:=\{\mathbb{P}_{h}\}_{h=1}^{H}$ is the transition kernel so that $\mathbb{P}_{h}(\cdot\mid s,a)$ characterizes the distribution over the next state given the state action pair $(s,a)$ at step $h$ , and $r:=\{r_{h}\}_{h=1}^{H}$ is the collection of reward functions. We assume that $r_{h}(s,a)\in[0,1]$ is a deterministic function of $(s,a)$ , while the results can be easily extended to the case when $r_{h}$ is random.

In each episode, an initial state $s_{1}$ is selected arbitrarily by an adversary. Then, at each step $h\in[H]$ , an agent observes a state $s_{h}\in\mathcal{S}$ , picks an action $a_{h}\in\mathcal{A}$ , receives the reward $r_{h}=r_{h}(s_{h},a_{h})$ and then transits to the next state $s_{h+1}$ . The episode ends when an absorbing state $s_{H+1}$ is reached.

Policies, state value functions, and action value functions. A policy $\pi$ is a collection of $H$ functions $\left\{\pi_{h}:\mathcal{S}\rightarrow\Delta^{\mathcal{A}}\right\}_{h\in[H]}$ , where $\Delta^{\mathcal{A}}$ is the set of probability distributions over $\mathcal{A}$ . A policy is deterministic if for any $s\in\mathcal{S}$ , $\pi_{h}(s)$ concentrates all the probability mass on an action $a\in\mathcal{A}$ . In this case, we denote $\pi_{h}(s)=a$ .

Let $V_{h}^{\pi}:\mathcal{S}\rightarrow\mathbb{R}$ and $Q_{h}^{\pi}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ denote the state value function and the state-action value function at step $h$ under policy $\pi$ . Mathematically, $V_{h}^{\pi}(s):=\sum_{t=h}^{H}\mathbb{E}_{(s_{t},a_{t})\sim(\mathbb{P},\pi)}\left[r_{t}(s_{t},a_{t})\left.\right|s_{h}=s\right]$ . $Q_{h}^{\pi}(s,a):=\sum_{t=h}^{H}\mathbb{E}_{(s_{{t}},a_{t})\sim(\mathbb{P},\pi)}\left[r_{t}(s_{t},a_{t})\left.\right|(s_{h},a_{h})=(s,a)\right]$ .

Since the state and action spaces and the horizon are all finite, there exists an optimal policy $\pi^{\star}$ that achieves the optimal value $V_{h}^{\star}(s)=\sup_{\pi}V_{h}^{\pi}(s)=V_{h}^{\pi^{*}}(s)$ for all $(s,h)\in\mathcal{S}\times[H]$ (Azar et al., 2017). The Bellman equation and the Bellman optimality equation can be expressed as

\displaystyle\left\{\begin{array}[]{l}V_{h}^{\pi}(s)=\mathbb{E}_{a^{\prime}\sim\pi_{h}(s)}[Q_{h}^{\pi}(s,a^{\prime})]\\ Q_{h}^{\pi}(s,a):=r_{h}(s,a)+\mathbb{E}_{s^{\prime}\sim\mathbb{P}_{h}(\cdot|s,a)}V_{h+1}^{\pi}(s^{\prime})\\ V_{H+1}^{\pi}(s)=0,\forall(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H],\end{array}\right.\quad

\displaystyle\left\{\begin{array}[]{l}V_{h}^{\star}(s)=\max_{a^{\prime}\in\mathcal{A}}Q_{h}^{\star}(s,a^{\prime})\\ Q_{h}^{\star}(s,a):=r_{h}(s,a)+\mathbb{E}_{s^{\prime}\sim\mathbb{P}_{h}(\cdot|s,a)}V_{h+1}^{*}(s^{\prime})\\ V_{H+1}^{\star}(s)=0,\forall(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H].\end{array}\right.

(3)

Suboptimality Gap. For any given MDP, we can provide the following formal definition.

Definition 3.1.

For any $(s,a,h)$ , the suboptimality gap is defined as $\Delta_{h}(s,a):=V_{h}^{\star}(s)-Q_{h}^{\star}(s,a)$ .

(3) implies that for any $(s,a,h)$ , $\Delta_{h}(s,a)\geq 0$ . Then, it is natural to define the minimum gap, which is the minimum non-zero suboptimality gap with regard to all $(s,a,h)$ .

Definition 3.2.

We define the minimum gap as $\Delta_{\textnormal{min}}:=\inf\{\Delta_{h}(s,a)\mid\Delta_{h}(s,a)>0,(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H]\}.$

We remark that if $\{\Delta_{h}(s,a)\mid\Delta_{h}(s,a)>0,(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H]\}=\emptyset$ , then all actions are optimal, leading to a degenerate MDP. Therefore, we assume that the set is nonempty and $\Delta_{\textnormal{min}}>0$ in the rest of this paper. Definitions 3.1 and 3.2 and the non-degeneration are standard in the literature on gap-dependent analysis (Simchowitz & Jamieson, 2019; Yang et al., 2021; Xu et al., 2020).

Global Switching Cost. We provide the following definition for any algorithm with $K>1$ episodes, which is also used in Bai et al. (2019) and Qiao et al. (2022).

Definition 3.3.

The global switching cost for $K$ episodes is defined as $N_{\textnormal{switch}}:=\sum_{k=1}^{K-1}\mathbb{I}[\pi^{k+1}\neq\pi^{k}].$

3.2 The Federated RL Framework

We consider an FRL setting with a central server and $M$ agents, each interacting with an independent copy of $\mathcal{M}$ . The agents communicate with the server periodically: after receiving local information, the central server aggregates it and broadcasts certain information to the agents to coordinate their exploration.

For agent $m$ , let $U_{m}$ be the number of generated episodes, $\pi^{m,u}$ be the policy in the $u$ -th episode of agent $m$ , and $x_{1}^{m,u}$ be the corresponding initial state. The regret of $M$ agents over $\hat{T}=H\sum_{m=1}^{M}U_{m}$ total steps is

\mbox{Regret}(T)=\sum_{m\in[M]}\sum_{u=1}^{U_{m}}\left(V_{1}^{\star}(s_{1}^{m,u})-V_{1}^{\pi^{m,u}}(s_{1}^{m,u})\right).

Here, $T:=\hat{T}/M$ is the average total steps for $M$ agents.

We also define the communication cost of an algorithm as the number of scalars (integers or real numbers) communicated between the server and agents.

4 Performance Guarantees

4.1 FedQ-Hoeffding Algorithm

In this subsection, we briefly review FedQ-Hoeffding. Details are provided in Algorithm 1 and Algorithm 2 in Section A.1. FedQ-Hoeffding proceeds in rounds, indexed by $k\in[K]$ . Round $k$ consists of $n^{m,k}$ episodes for agent $m$ , where the specific value of $n^{m,k}$ will be determined later.

Notations. For the $j$ -th ( $j\in[n^{m,k}]$ ) episode for agent $m$ in the $k$ -th round, we denote the corresponding trajectory as $\{(s_{h}^{k,j,m},a_{h}^{k,j,m},r_{h}^{k,j,m})\}_{h=1}^{H}$ . Denote $n_{h}^{m,k}(s,a)$ as the number of times that $(s,a,h)$ has been visited by agent $m$ in round $k$ , $n_{h}^{k}(s,a):=\sum_{m=1}^{M}n_{h}^{m,k}(s,a)$ as the total number of visits in round $k$ for all agents, and $N_{h}^{k}(s,a)$ as the total number of visits to $(s,a,h)$ among all agents before the start of round $k$ . We also use $\{V_{h}^{k}:\mathcal{S}\rightarrow\mathbb{R}\}_{h=1}^{H}$ and $\{Q_{h}^{k}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}\}_{h=1}^{H}$ to denote the global estimates of the state value function and state-action value function at the beginning of round $k$ . Before the first round, both estimates are initialized as $H$ .

Coordinated Exploration. At the beginning of round $k$ , the server decides a deterministic policy $\pi^{k}=\{\pi_{h}^{k}\}_{h=1}^{H}$ , and then broadcasts it along with $\{N_{h}^{k}(s,\pi_{h}^{k}(s))\}_{s,h}$ and $\{V_{h}^{k}(s)\}_{s,h}$ to agents. Here, $\pi^{1}$ can be chosen arbitrarily. Then, the agents execute $\pi^{k}$ and start collecting trajectories. During the exploration in round $k$ , every agent $m$ will monitor its number of visits to each $(s,a,h)$ . For any agent $m$ , at the end of each episode, if any $(s,a,h)$ has been visited by

c_{h}^{k}(s,a)=\max\left\{1,\left\lfloor\frac{N_{h}^{k}(s,a)}{MH(H+1)}\right\rfloor\right\}

(4)

times by agent $m$ , the agent will send a signal to the server, which will then abort all agents’ exploration. Here, we say that $(s,a,h)$ satisfies the trigger condition in round $k$ . During the exploration, for all $(s,a,h)$ , agent $m$ adaptively calculates $n_{h}^{m,k}(s,a)$ and the local estimate for the next-step return $v_{h+1}^{m,k}(s,a):=\sum_{j=1}^{n^{m,k}}V_{h+1}^{k}\left(s_{h+1}^{k,j,m}\right)\mathbb{I}\left[(s_{h}^{k,j,m},a_{h}^{k,j,m})=(s,a)\right]$ . At the end of round $k$ , each agent sends $\{r_{h}(s,\pi_{h}^{k}(s))\}_{s,h}$ , $\{n_{h}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h}$ and $\{v_{h+1}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h}$ to the central server for aggregation.

Updates of estimated value functions. The central server calculates $n_{h}^{k}(s,a),N_{h}^{k+1}(s,a)$ for all triples. While letting $Q_{h}^{k+1}(s,a)=Q_{h}^{k}(s,a)$ for triples such that $n_{h}^{k}(s,a)=0$ , it updates the estimated value functions for each triple with positive $n_{h}^{k}(s,a)$ as follows.

Case 1: $N_{h}^{k}(s,a)<2MH(H+1)=:i_{0}$ . This case implies that each client can visit each $(s,a)$ pair at step $h$ at most once. Let $Q=Q_{h}^{k}(s,a)$ . Then the server iteratively update $Q$ using the following assignment:

Q\overset{+}{\leftarrow}\eta_{t}(r_{h}+V_{h+1}^{k,t}+b_{t}-Q),t=N_{h}^{k}+1,\ldots,N_{h}^{k+1}

(5)

and then assign $Q_{h}^{k+1}(s,a)$ with $Q$ . Here, $r_{h},N_{h}^{k},N_{h}^{k+1}$ are short for their respective values at $(s,a)$ , $\eta_{t}\in(0,1]$ is the learning rate, $b_{t}>0$ is a bonus, and $V_{h+1}^{k,t}$ represents the $(t-N_{h}^{k})$ -th nonzero value in $\{v_{h+1}^{m,k}(s,a)\}_{m=1}^{M}$ .

Case 2: $N_{h}^{k}(s,a)\geq i_{0}$ . In this case, the central server calculates the global estimate of the expected return $v_{h+1}^{k}(s,a)=\sum_{m=1}^{M}v_{h+1}^{m,k}(s,a)/n_{h}^{k}(s,a)$ and updates the $Q$ -estimate as

Q_{h}^{k+1}=(1-\eta_{s,a}^{h,k})Q_{h}^{k}+\eta_{s,a}^{h,k}(r_{h}+v_{h+1}^{k})+\beta^{k}_{s,a,h}.

(6)

Here, $r_{h},Q_{h}^{k},Q_{h}^{k+1},v_{h+1}^{k}$ are short for their respective values at $(s,a)$ , $\eta_{s,a}^{h,k}\in(0,1]$ is the learning rate and $\beta^{k}_{s,a,h}>0$ represents the bonus.

After updating the estimated $Q$ -function, the central server updates the estimated value function and the policy as $V_{h}^{k+1}(s)=\min\{H,\max_{a^{\prime}\in\mathcal{A}}Q_{h}^{k+1}(s,a^{\prime})\}$ and $\pi_{h}^{k+1}(s)=\arg\max_{a^{\prime}\in\mathcal{A}}Q_{h}^{k+1}(s,a^{\prime})$ . Such update implies that FedQ-Hoeffding is an optimism-based method. It then proceeds to round $k+1$ .

In FedQ-Hoeffding, agents only send local estimates instead of original trajectories to the central server. This guarantees a low communication cost for each round, which is $O(MHS)$ . In addition, the event-triggered termination condition with the threshold (4) limits the number of new visits in each round, with which Zheng et al. (2024) proved the linear regret speedup under worst-case MDPs. Moreover, it guarantees that the number of visits to the triple that satisfies the trigger condition sufficiently increases after this round. This is the key to proving the worst-case logarithmic communication cost in Zheng et al. (2024).

4.2 Gap-Dependent Regret

Next, we provide a new gap-dependent regret upper bound for FedQ-Hoeffding as follows.

Theorem 4.1.

For FedQ-Hoeffding (Algorithm 1 and Algorithm 2), $\mathbb{E}\left(\textnormal{Regret}(T)\right)$ can be bounded by (1).

Theorem 4.1 shows that the regret is logarithmic in $T$ under MDPs with a nonzero $\Delta_{\textnormal{min}}$ . When $T$ is sufficiently large, it is better than the $\sqrt{T}$ -type worst-case regrets in the literature for on-policy FRL.

When $M=1$ , the bound reduces to $O(\frac{H^{6}SA\log(SAT)}{\Delta_{\textnormal{min}}})$ , which is the gap-dependent bound in Yang et al. (2021) for our single-agent counterpart, UCB-Hoeffding algorithm. Thus, when $T$ is sufficiently large, for the average regret of all the episodes defined as $\textnormal{Regret}(T)/(MT)$ , the ratio between gap-dependent bounds for FedQ-Hoeffding and UCB-Hoeffding is $\tilde{O}(1/M)$ , which serves as our error reduction rate. As a comparison, it is better than the rates under worst-case MDPs for on-policy FRL methods in the literature, which are $\tilde{O}(1/\sqrt{M})$ because of their linear dependency on $\sqrt{MT}$ . We will show this pattern in the numerical experiments in Section 5.1.

Key Ideas of the Proof. Define $\text{clip}[x\mid y]:=x\cdot\mathbb{I}[x\geq y]$ . Note that

	$\displaystyle\mathbb{E}\left(\textnormal{Regret}(T)\right)$	$\displaystyle=\mathbb{E}\left[\sum_{k,j,m}\sum_{h=1}^{H}\Delta_{h}(s_{h}^{k,j,m},a_{h}^{k,j,m})\right]$
		$\displaystyle\leq\mathbb{E}\left[\sum_{h=1}^{H}\sum_{k,j,m}\mathrm{clip}\left[\left(Q_{h}^{k}-Q_{h}^{*}\right)(s_{h}^{k,j,m},a_{h}^{k,j,m})\mid\Delta_{\textnormal{min}}\right]\right].$

Now we only need to bound the term

\sum_{h=1}^{H}\sum_{k,j,m}\mathrm{clip}\left[\left(Q_{h}^{k}-Q_{h}^{*}\right)(s_{h}^{k,j,m},a_{h}^{k,j,m})\mid\Delta_{\textnormal{min}}\right].

To continue, we need to bound the weighted sum of estimation errors

\sum_{k,j,m}\omega_{h,k}\left(Q_{h}^{k}-Q_{h}^{*}\right)(s_{h}^{k,j,m},a_{h}^{k,j,m})

for any non-negative weights $\{\omega_{h,k}\}_{h,k}$ . This can be done by recursion on $h$ with $Q_{H+1}^{k}=Q_{H+1}^{*}=0$ .

4.3 Gap-Dependent Communication Cost

We first introduce two additional assumptions:
(I) Full synchronization. Similar to Zheng et al. (2024), we assume that there is no latency during the communications, and the agents and server are fully synchronized (McMahan et al., 2017). This means $n^{m,k}=n^{k}$ for each agent $m$ .
(II) Random initializations. We assume that the initial states $\{s_{1}^{k,j,m}\}_{k,j,m}$ are randomly generated following some distribution on $\mathcal{S}$ , and the generation is not affected by any result in the learning process.

Next, we introduce a new concept: G-MDPs.

Definition 4.1.

A G-MDP satisfies two conditions:
(a) The stationary visiting probabilities under optimal policies are unique: if both $\pi^{*,1}$ and $\pi^{*,2}$ are optimal policies, then we have $\mathbb{P}\left(s_{h}=s|\pi^{*,1}\right)=\mathbb{P}\left(s_{h}=s|\pi^{*,2}\right)=:\mathbb{P}_{s,h}^{*}.$
(b) Let $\mathcal{A}_{h}^{*}(s)=\{a\mid a=\arg\max_{a^{\prime}}Q_{h}^{*}(s,a^{\prime})\}$ . For any $(s,h)\in\mathcal{S}\times[H]$ , if $\mathbb{P}_{s,h}^{*}>0$ , then $|\mathcal{A}_{h}^{*}(s)|=1$ , which means that the optimal action is unique.

G-MDPs represent MDPs with generally unique optimal policies. (a) and (b) above characterize the general uniqueness, and an MDP with a unique optimal policy is a G-MDP. Compared to requiring a unique optimal policy, G-MDPs allow the optimal actions to vary outside the support under optimal policies, i.e., the state-step pairs with $\mathbb{P}_{s,h}^{*}=0$ .

For a G-MDP, we define $C_{st}=\min\{\mathbb{P}_{s,h}^{*}\mid s\in\mathcal{S},h\in[H],\mathbb{P}_{s,h}^{*}>0\}$ . Thus, $0<C_{st}\leq 1$ reflects the minimum visiting probability on the support for optimal policies. Next, we provide gap-dependent communication cost bound.

Theorem 4.2.

Let $\iota_{0}=\log(SAT/p)$ . For FedQ-Hoeffding (Algorithm 1 and Algorithm 2) under full synchronization and random initializations assumptions, for any given G-MDP and any $p\in(0,1)$ , with probability at least $1-p$ , the number of rounds $K$ is upper bounded by (2).

We can get the upper bound of total communication cost by multiplying the upper bound in (2) and $O(MHS)$ , the communication cost of each round in FedQ-Hoeffding.

Compared to existing worst-case costs that depend on $SA$ (Zheng et al., 2025a; Labbi et al., 2024) or $MSA$ (Zheng et al., 2024, 2025a; Qiao et al., 2022) for $\log T$ , (2) is better when $T$ is sufficiently large since the first three terms only logarithmically depend on $1/\Delta_{\textnormal{min}}$ and $\log T$ , and the last term that is logarithmic in $T$ removes the dependency on $MSA$ . Moreover, (2) highlights the cost for different procedures in FedQ-Hoeffding: the first two terms represent the cost for exploration, and the last two terms show the cost when exploiting the optimal policies. Our numerical experiments in Section 5.2 also demonstrate that the $\log T$ term in the communication cost is independent of $M$ , $S$ , and $A$ .

Key Ideas of the Proof. For the triple $(s,a,h)$ that satisfies the trigger condition in round $k$ when $N_{h}^{k}(s,a)>i_{1}$ for some threshold $i_{1}$ , we can demonstrate an agent-wise simultaneous sufficient increase in visits, which allows us to eliminate the linear dependency of $M$ from the last three terms.

In the exploitation period, for the triple $(s,a,h)$ such that $\mathbb{P}_{s,h}^{*}>0$ , $a\in\mathcal{A}_{h}^{*}(s)$ , and that satisfies the trigger condition in round $k$ when $N_{h}^{k}(s,a)>i_{2}$ for some threshold $i_{2}>i_{1}$ , we can show a state-wise simultaneous sufficient increase in visits, which helps us remove the linear dependency of $S$ and $A$ from the last term.

Since FedQ-Hoeffding implements a fixed policy in each round, when $M=1$ , the algorithm reduces to a single-agent algorithm with a low global switching cost, which is shown in Corollary 4.1.

Corollary 4.1.

For the FedQ-Hoeffding algorithm (Algorithm 1 and Algorithm 2) under full synchronization, random initialization, and $M=1$ , for any given G-MDP and any $p\in(0,1)$ , with probability at least $1-p$ , the global switching cost can be bounded by (2).

Given that the costs of existing single-agent model-free algorithms depend on $SA$ (Bai et al., 2019; Zhang et al., 2020) or $S$ (Zheng et al., 2025b) for $\log T$ ²²2In the literature, these bounds are for local switching cost that counts the state-step pairs where the policy switches. The local switching cost is greater than or equal to our global switching cost, but these works didn’t find tighter bounds for the global switching cost. We refer readers to Bai et al. (2019) for more information., our $\log T$ -dependency is better by removing the factor $SA$ .

At the end of this section, we also remark on FedQ-Bernstein, another on-policy FRL algorithm in Zheng et al. (2024). It differs from FedQ-Hoeffding in setting the bonuses ( $b_{t}$ and $\beta_{s,a,h}^{k}$ ) and involving variance estimators. We can also show that FedQ-Bernstein can also reach the gap-dependent bounds in (1) and in (2). In Zheng et al. (2024), the worst-case regret bound of FedQ-Bernstein is better than that of FedQ-Hoeffding by a factor $\sqrt{H}$ , and the two methods share the same worst-case communication cost bound. Whether FedQ-Bernstein can reach better gap-dependent bounds remains an open question.

5 Numerical Experiments

In this section, we conduct experiments ³³3All the experiments are run on a server with Intel Xeon E5-2650v4 (2.2GHz) and 100 cores. Each replication is limited to a single core and 50GB RAM. . All the experiments are conducted in a synthetic environment to demonstrate the $\log T$ -type regret and reduced communication cost bound with the coefficient of the main term $O(\log T)$ being independent of $M,S,A$ in FedQ-Hoeffding algorithm (Zheng et al., 2024). We follow Zheng et al. (2024) and generate a synthetic environment to evaluate the proposed algorithms on a tabular episodic MDP. After setting $H,S,A$ , the reward $r_{h}(s,a)$ for each $(s,a,h)$ is generated independently and uniformly at random from $[0,1]$ . $\mathbb{P}_{h}(\cdot\mid s,a)$ is generated on the $S$ -dimensional simplex independently and uniformly at random for $(s,a,h)$ . We also set the constant $c$ in the bonus term $b_{t}$ to be 2 and $\iota=1$ . We will first demonstrate the $\log T$ -type regret of FedQ-Hoeffding algorithm.

5.1 $\log T$ -type regret and speedup

In this section, we show that the regret for any given MDP follows a $\log T$ pattern. We consider two different values for the triple $(H,S,A)$ : $(2,2,2)$ and $(5,3,2)$ . For FedQ-Hoeffding algorithm, we set the agent number $M=10$ and generate $T/H=10^{7}$ episodes for each agent, resulting in a total of $10^{8}$ episodes. Additionally, to show the linear speedup effect, we conduct experiments with its single-agent version, the UCB-Hoeffding algorithm (Jin et al., 2018), where all the conditions except $M=1$ remain the same. To show error bars, we collect 10 sample paths for each algorithm under the same MDP environment. The regret results are shown in Figure 1 and Figure 2. For both algorithms, the solid line represents the median of the 10 sample paths, while the shaded area shows the 10th and 90th percentiles.

Refer to caption — Figure 1: Regret results for $H=2$ , $S=2$ , and $A=2$ . The left panel directly shows the plot of $\mbox{Regret}(T)$ versus $T/H$ , while the right panel illustrates the relationship between $\mbox{Regret}(T)/\log(T/H+1)$ and $T/H$ . In both plots, the yellow line represents the regret results of the FedQ-Hoeffding algorithm, while the red line represents the results of the UCB-Hoeffding algorithm. The blue line in each plot denotes the adjusted regret of the FedQ-Hoeffding algorithm, which is obtained by dividing the regret results of the yellow line by $\sqrt{M}$ .

From the two groups of plots, we observe that the two yellow lines in the plots on the right side of Figure 1 and Figure 2 tend to approach horizontal lines as $T/H$ becomes sufficiently large. Since the y-axis represents $\mbox{Regret}(T)/\log(T/H+1)$ in these two plots, we can conclude that the regret of the FedQ-Hoeffding algorithm follows a $\log T$ -type pattern for any given MDP, rather than the $\sqrt{MT}$ pattern shown in the Theorem 4.1 of Zheng et al. (2024). This is consistent with the logarithmic regret result presented in Theorem 4.1. Furthermore, as $T/H$ becomes sufficiently large, we observe that the adjusted regret of FedQ-Hoeffding (represented by the blue lines) for both groups of $(H,S,A)$ is significantly lower than the corresponding regret of the single-agent version, UCB-Hoeffding (represented by the red lines). This further supports the conclusion that the regret of FedQ-Hoeffding does not follow a $\sqrt{MT}$ pattern, or else the blue lines and the red lines would be close to each other. Finally, as $T/H$ grows larger, we notice that the yellow lines and the red lines become close, confirming that the regret of FedQ-Hoeffding approaches that of UCB-Hoeffding as $T$ becomes sufficiently large. This also supports the error reduction rate $\tilde{O}(1/M)$ for the gap-dependent regret.

5.2 Dependency of Communication Cost on $M$ , $S$ , and $A$

In this section, we will demonstrate that the coefficient of the $\log T$ term in the communication cost is independent of $M$ , $S$ and $A$ . To eliminate the influence of terms with lower orders of $\log T$ , such as $\log(\log T)$ and $\sqrt{\log T}$ in Theorem 4.2, we will focus exclusively on the communication cost for sufficiently large values of $T$ .

5.2.1 Dependency on $M$

To explore the dependency of communication cost on $M$ , we set $(H,S,A)=(2,2,2)$ and let $M$ take values in $\{2,4,6,8\}$ . We generate $10^{7}$ episodes for each agent and only consider the communication cost after $5\times 10^{5}$ episodes. The Figure 3 shows the results for each $M$ after $5\times 10^{5}$ episodes.

In Figure 3, each solid line represents the communication cost for each value of $M\in\{2,4,6,8\}$ after $5\times 10^{5}$ episodes, while the dashed line represents the corresponding fitted line. Since the x-axis represents the log-number of episodes, $\log(T/H)$ , the slope of the fitted line is very close to the coefficient of the $\log T$ -term in the communication cost when $\log T$ is sufficently large. We observe that the slopes of these fitted lines are very similar, which indicates that for any given MDP, the coefficient of the $\log T$ -term in the communication cost is independent of $M$ . If the coefficient were linearly dependent on $M$ , as shown in Zheng et al. (2024), then for $M=8$ , the slope of the fitted line should be nearly four times the value of the slope of the fitted line for $M=2$ .

5.2.2 Dependency on $S$

To explore the dependency of communication cost on $S$ , we set $(H,A,M)=(2,2,2)$ and let $S$ take values in $\{2,4,6,8\}$ . We generate $10^{7}$ episodes for each agent and only consider the communication cost after $5\times 10^{5}$ episodes. The Figure 4 shows the communication cost results for each $S$ after $5\times 10^{5}$ episodes.

In Figure 4, each solid line represents the communication cost for each value of $S\in\{2,4,6,8\}$ after $5\times 10^{5}$ episodes, while the dashed line represents the corresponding fitted line. Since the x-axis represents the log-number of episodes, $\log(T/H)$ , the slope of the fitted line is very close to the coefficient of the $\log T$ -term in the communication cost when $\log T$ is sufficently large. We observe that the slopes of these fitted lines are very similar, which indicates that for any given MDP, the coefficient of the $\log T$ -term in the communication cost is independent of $S$ .

5.2.3 Dependency on $A$

To explore the dependency of communication cost on $A$ , we set $(H,S,M)=(2,2,2)$ and let $A$ take values in $\{2,4,6,8\}$ . We generate $10^{7}$ episodes for each agent and only consider the communication cost after $5\times 10^{5}$ episodes. The Figure 5 shows the communication cost results for each $A$ after $5\times 10^{5}$ episodes.

In Figure 5, each solid line represents the communication cost for each value of $A\in\{2,4,6,8\}$ after $5\times 10^{5}$ episodes, while the dashed line represents the corresponding fitted line. Since the x-axis represents the log-number of episodes, $\log(T/H)$ , the slope of the fitted line is very close to the coefficient of the $\log T$ -term in the communication cost when $\log T$ is sufficently large. We observe that the slopes of these fitted lines are very similar, which indicates that for any given MDP, the coefficient of the $\log T$ -term in the communication cost is independent of $A$ .

6 Conclusion

In this paper, we establish the first gap-dependent bounds on regret and communication cost for on-policy federated $Q$ -Learning in tabular episodic finite-horizon MDPs, addressing two important open questions in the literature. While existing FRL methods focus on worst-case MDPs, we show that when MDPs exhibit benign structures, such as a strictly positive suboptimality gap, these bounds can be significantly improved. Specifically, we prove that both FedQ-Hoeffding and FedQ-Bernstein can achieve $\log T$ -type regret. Additionally, we derive a gap-dependent communication cost bound that disentangles exploration and exploitation, with the $\log T$ term in the communication cost being independent of $M$ , $S$ , and $A$ . This makes our work the first result in the on-policy FRL literature to achieve such a low communication cost. When $M=1$ , our gap-dependent communication cost bound also provides a better global switching cost, removing the dependence on $SA$ from the $\log T$ term.

References

Agarwal et al. (2020) Agarwal, A., Kakade, S., and Yang, L. F. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pp. 67–83. PMLR, 2020.
Agarwal et al. (2021) Agarwal, M., Ganguly, B., and Aggarwal, V. Communication efficient parallel reinforcement learning. In Uncertainty in Artificial Intelligence, pp. 247–256. PMLR, 2021.
Agrawal & Jia (2017) Agrawal, S. and Jia, R. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. Advances in Neural Information Processing Systems, 30, 2017.
Al Marjani et al. (2021) Al Marjani, A., Garivier, A., and Proutiere, A. Navigating to the best policy in markov decision processes. In Advances in Neural Information Processing Systems, pp. 25852–25864, 2021.
Anwar & Raychowdhury (2021) Anwar, A. and Raychowdhury, A. Multi-task federated reinforcement learning with adversaries. arXiv preprint arXiv:2103.06473, 2021.
Ashouri et al. (2018) Ashouri, A. H., Killian, W., Cavazos, J., Palermo, G., and Silvano, C. A survey on compiler autotuning using machine learning. ACM Computing Surveys (CSUR), 51(5):1–42, 2018.
Assran et al. (2019) Assran, M., Romoff, J., Ballas, N., Pineau, J., and Rabbat, M. Gossip-based actor-learner architectures for deep reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.
Auer & Ortner (2007) Auer, P. and Ortner, R. Logarithmic online regret bounds for undiscounted reinforcement learning. In Advances in Neural Information Processing Systems, pp. 49–56. MIT Press, 2007.
Auer et al. (2008) Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret bounds for reinforcement learning. Advances in Neural Information Processing Systems, 21, 2008.
Azar et al. (2017) Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263–272. PMLR, 2017.
Bai et al. (2019) Bai, Y., Xie, T., Jiang, N., and Wang, Y.-X. Provably efficient q-learning with low switching cost. Advances in Neural Information Processing Systems, 32, 2019.
Banerjee et al. (2021) Banerjee, S., Bouzefrane, S., and Abane, A. Identity management with hybrid blockchain approach: A deliberate extension with federated-inverse-reinforcement learning. In 2021 IEEE 22nd International Conference on High Performance Switching and Routing (HPSR), pp. 1–6. IEEE, 2021.
Beikmohammadi et al. (2024) Beikmohammadi, A., Khirirat, S., and Magnússon, S. Compressed federated reinforcement learning with a generative model. arXiv preprint arXiv:2404.10635, 2024.
Chen et al. (2021a) Chen, T., Zhang, K., Giannakis, G. B., and Başar, T. Communication-efficient policy gradient methods for distributed reinforcement learning. IEEE Transactions on Control of Network Systems, 9(2):917–929, 2021a.
Chen et al. (2023) Chen, Y., Zhang, X., Zhang, K., Wang, M., and Zhu, X. Byzantine-robust online and offline distributed reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp. 3230–3269. PMLR, 2023.
Chen et al. (2021b) Chen, Z., Zhou, Y., and Chen, R. Multi-agent off-policy tdc with near-optimal sample and communication complexity. In 2021 55th Asilomar Conference on Signals, Systems, and Computers, pp. 504–508. IEEE, 2021b.
Chen et al. (2022) Chen, Z., Zhou, Y., Chen, R.-R., and Zou, S. Sample and communication-efficient decentralized actor-critic algorithms with finite-time analysis. In International Conference on Machine Learning, pp. 3794–3834. PMLR, 2022.
Dann et al. (2019) Dann, C., Li, L., Wei, W., and Brunskill, E. Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pp. 1507–1516. PMLR, 2019.
Dann et al. (2021) Dann, C., Marinov, T. V., Mohri, M., and Zimmert, J. Beyond value-function gaps: Improved instance-dependent regret bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1–12, 2021.
Doan et al. (2019) Doan, T., Maguluri, S., and Romberg, J. Finite-time analysis of distributed td (0) with linear function approximation on multi-agent reinforcement learning. In International Conference on Machine Learning, pp. 1626–1635. PMLR, 2019.
Doan et al. (2021) Doan, T. T., Maguluri, S. T., and Romberg, J. Finite-time performance of distributed temporal-difference learning with linear function approximation. SIAM Journal on Mathematics of Data Science, 3(1):298–320, 2021.
Domingues et al. (2021) Domingues, O. D., Ménard, P., Kaufmann, E., and Valko, M. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory, pp. 578–598. PMLR, 2021.
Du et al. (2017) Du, S. S., Chen, J., Li, L., Xiao, L., and Zhou, D. Stochastic variance reduction methods for policy evaluation. In International Conference on Machine Learning, pp. 1049–1058. PMLR, 2017.
Dubey & Pentland (2021) Dubey, A. and Pentland, A. Provably efficient cooperative multi-agent reinforcement learning with function approximation. arXiv preprint arXiv:2103.04972, 2021.
Espeholt et al. (2018) Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, pp. 1407–1416. PMLR, 2018.
Fan et al. (2021) Fan, F. X., Ma, Y., Dai, Z., Jing, W., Tan, C., and Low, B. K. H. Fault-tolerant federated reinforcement learning with theoretical guarantee. Advances in Neural Information Processing Systems, 34:1007–1021, 2021.
Fan et al. (2023) Fan, F. X., Ma, Y., Dai, Z., Tan, C., and Low, B. K. H. Fedhql: Federated heterogeneous q-learning. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, pp. 2810–2812, 2023.
Ganesh et al. (2024) Ganesh, S., Chen, J., Thoppe, G., and Aggarwal, V. Global convergence guarantees for federated policy gradient methods with adversaries. arXiv preprint arXiv:2403.09940, 2024.
Gao et al. (2019) Gao, Z., Han, Y., Ren, Z., and Zhou, Z. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32, 2019.
Gong et al. (2023) Gong, W., Cao, L., Zhu, Y., Zuo, F., He, X., and Zhou, H. Federated inverse reinforcement learning for smart icus with differential privacy. IEEE Internet of Things Journal, 10(21):19117–19124, 2023.
Gower et al. (2020) Gower, R. M., Schmidt, M., Bach, F., and Richtárik, P. Variance-reduced methods for machine learning. Proceedings of the IEEE, 108(11):1968–1983, 2020.
Guo & Brunskill (2015) Guo, Z. and Brunskill, E. Concurrent pac rl. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, pp. 2624–2630, 2015.
He et al. (2021) He, J., Zhou, D., and Gu, Q. Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning, pp. 4171–4180. PMLR, 2021.
Hsu et al. (2024) Hsu, H.-L., Wang, W., Pajic, M., and Xu, P. Randomized exploration in cooperative multi-agent reinforcement learning. arXiv preprint arXiv:2404.10728, 2024.
Jaksch et al. (2010) Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563–1600, 2010.
Jin et al. (2018) Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is q-learning provably efficient? Advances in Neural Information Processing Systems, 31, 2018.
Jin et al. (2020) Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In Conference on learning theory, pp. 2137–2143. PMLR, 2020.
Jin et al. (2022) Jin, H., Peng, Y., Yang, W., Wang, S., and Zhang, Z. Federated reinforcement learning with environment heterogeneity. In International Conference on Artificial Intelligence and Statistics, pp. 18–37. PMLR, 2022.
Johnson & Zhang (2013) Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems, 26, 2013.
Jonsson et al. (2020) Jonsson, A., Kaufmann, E., Ménard, P., Darwiche Domingues, O., Leurent, E., and Valko, M. Planning in markov decision processes with gap-dependent sample complexity. In Advances in Neural Information Processing Systems, pp. 1253–1263, 2020.
Kakade et al. (2018) Kakade, S., Wang, M., and Yang, L. F. Variance reduction methods for sublinear reinforcement learning. arXiv preprint arXiv:1802.09184, 2018.
Khamaru et al. (2021) Khamaru, K., Pananjady, A., Ruan, F., Wainwright, M. J., and Jordan, M. I. Is temporal difference learning optimal? an instance-dependent analysis. SIAM Journal on Mathematics of Data Science, 3(4):1013–1040, 2021.
Khodadadian et al. (2022) Khodadadian, S., Sharma, P., Joshi, G., and Maguluri, S. T. Federated reinforcement learning: Linear speedup under markovian sampling. In International Conference on Machine Learning, pp. 10997–11057. PMLR, 2022.
Krishnan et al. (2018) Krishnan, S., Yang, Z., Goldberg, K., Hellerstein, J., and Stoica, I. Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196, 2018.
Labbi et al. (2024) Labbi, S., Tiapkin, D., Mancini, L., Mangold, P., and Moulines, E. Federated ucbvi: Communication-efficient federated regret minimization with heterogeneous agents. arXiv preprint arXiv:2410.22908, 2024.
Lan et al. (2023) Lan, G., Wang, H., Anderson, J., Brinton, C., and Aggarwal, V. Improved communication efficiency in federated natural policy gradient via admm-based gradient updates. arXiv preprint arXiv:2310.19807, 2023.
Lan et al. (2024) Lan, G., Han, D.-J., Hashemi, A., Aggarwal, V., and Brinton, C. G. Asynchronous federated reinforcement learning with policy gradient updates: Algorithm design and convergence analysis. arXiv preprint arXiv:2404.08003, 2024.
Li et al. (2020) Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction. Advances in Neural Information Processing Systems, 33:7031–7043, 2020.
Li et al. (2021) Li, G., Shi, L., Chen, Y., Gu, Y., and Chi, Y. Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning. Advances in Neural Information Processing Systems, 34:17762–17776, 2021.
Liu & Olshevsky (2023) Liu, R. and Olshevsky, A. Distributed td (0) with almost no communication. IEEE Control Systems Letters, 2023.
Liu & Zhu (2022) Liu, S. and Zhu, M. Distributed inverse constrained reinforcement learning for multi-agent systems. Advances in Neural Information Processing Systems, 35:33444–33456, 2022.
Liu & Zhu (2023) Liu, S. and Zhu, M. Meta inverse constrained reinforcement learning: Convergence guarantee and generalization analysis. In The Twelfth International Conference on Learning Representations, 2023.
Liu & Zhu (2024) Liu, S. and Zhu, M. Learning multi-agent behaviors from distributed and streaming demonstrations. Advances in Neural Information Processing Systems, 36, 2024.
Liu & Zhu (2025) Liu, S. and Zhu, M. In-trajectory inverse reinforcement learning: Learn incrementally before an ongoing trajectory terminates. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2025.
Marjani & Proutiere (2020) Marjani, A. and Proutiere, A. Best policy identification in discounted mdps: Problem-specific sample complexity. arXiv preprint arXiv:2009.13405, 2020.
McMahan et al. (2017) McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54, pp. 1273–1282. PMLR, 2017.
Ménard et al. (2021) Ménard, P., Domingues, O. D., Shang, X., and Valko, M. Ucb momentum q-learning: Correcting the bias without forgetting. In International Conference on Machine Learning, pp. 7609–7618. PMLR, 2021.
Min et al. (2023) Min, Y., He, J., Wang, T., and Gu, Q. Cooperative multi-agent reinforcement learning: Asynchronous communication and linear function approximation. In International Conference on Machine Learning, pp. 24785–24811. PMLR, 2023.
Mirhoseini et al. (2017) Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., and Dean, J. Device placement optimization with reinforcement learning. In International Conference on Machine Learning, pp. 2430–2439. PMLR, 2017.
Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937. PMLR, 2016.
Nguyen et al. (2017) Nguyen, L. M., Liu, J., Scheinberg, K., and Takáč, M. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pp. 2613–2621. PMLR, 2017.
Nguyen et al. (2019) Nguyen, P., Tran, T., Gupta, S., Rana, S., Barnett, M., and Venkatesh, S. Incomplete conditional density estimation for fast materials discovery. In Proceedings of the 2019 SIAM International Conference on Data Mining, pp. 549–557. SIAM, 2019.
Nguyen-Tang et al. (2023) Nguyen-Tang, T., Yin, M., Gupta, S., Venkatesh, S., and Arora, R. On instance-dependent bounds for offline reinforcement learning with linear function approximation. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 9310–9318, 2023.
Ok et al. (2018) Ok, J., Proutiere, A., and Tranos, D. Exploration in structured reinforcement learning. Advances in Neural Information Processing Systems, 31, 2018.
Perchet et al. (2016) Perchet, V., Rigollet, P., Chassang, S., and Snowberg, E. Batched bandit problems. The Annals of Statistics, 44(2):660 – 681, 2016.
Qiao et al. (2022) Qiao, D., Yin, M., Min, M., and Wang, Y.-X. Sample-efficient reinforcement learning with loglog (t) switching cost. In International Conference on Machine Learning, pp. 18031–18061. PMLR, 2022.
Salgia & Chi (2024) Salgia, S. and Chi, Y. The sample-communication complexity trade-off in federated q-learning. In Advances in Neural Information Processing Systems, 2024.
Shen et al. (2023) Shen, H., Zhang, K., Hong, M., and Chen, T. Towards understanding asynchronous advantage actor-critic: Convergence and linear speedup. IEEE Transactions on Signal Processing, 2023.
Shi et al. (2022) Shi, L., Li, G., Wei, Y., Chen, Y., and Chi, Y. Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity. In International Conference on Machine Learning, pp. 19967–20025. PMLR, 2022.
Sidford et al. (2018) Sidford, A., Wang, M., Wu, X., Yang, L., and Ye, Y. Near-optimal time and sample complexities for solving markov decision processes with a generative model. Advances in Neural Information Processing Systems, 31, 2018.
Sidford et al. (2023) Sidford, A., Wang, M., Wu, X., and Ye, Y. Variance reduced value iteration and faster algorithms for solving markov decision processes. Naval Research Logistics (NRL), 70(5):423–442, 2023.
Simchowitz & Jamieson (2019) Simchowitz, M. and Jamieson, K. G. Non-asymptotic gap-dependent regret bounds for tabular mdps. In Advances in Neural Information Processing Systems, 2019.
Sun et al. (2020) Sun, J., Wang, G., Giannakis, G. B., Yang, Q., and Yang, Z. Finite-time analysis of decentralized temporal-difference learning with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pp. 4485–4495. PMLR, 2020.
Sutton & Barto (2018) Sutton, R. and Barto, A. Reinforcement Learning: An Introduction. MIT Press, 2018.
Tewari & Bartlett (2008) Tewari, A. and Bartlett, P. Optimistic linear programming gives logarithmic regret for irreducible mdps. In Advances in Neural Information Processing Systems, pp. 1505–1512, 2008.
Tirinzoni et al. (2022) Tirinzoni, A., Al Marjani, A., and Kaufmann, E. Near instance-optimal pac reinforcement learning for deterministic mdps. In Advances in Neural Information Processing Systems, pp. 8785–8798, 2022.
Tirinzoni et al. (2023) Tirinzoni, A., Al-Marjani, A., and Kaufmann, E. Optimistic pac reinforcement learning: the instance-dependent view. In International Conference on Algorithmic Learning Theory, pp. 1460–1480. PMLR, 2023.
Wagenmaker & Jamieson (2022) Wagenmaker, A. and Jamieson, K. G. Instance-dependent near-optimal policy identification in linear mdps via online experiment design. Advances in Neural Information Processing Systems, 35:5968–5981, 2022.
Wagenmaker et al. (2022a) Wagenmaker, A. J., Chen, Y., Simchowitz, M., Du, S., and Jamieson, K. First-order regret in reinforcement learning with linear function approximation: A robust estimation approach. In International Conference on Machine Learning, pp. 22384–22429. PMLR, 2022a.
Wagenmaker et al. (2022b) Wagenmaker, A. J., Simchowitz, M., and Jamieson, K. Beyond no regret: Instance-dependent pac reinforcement learning. In Conference on Learning Theory, pp. 358–418. PMLR, 2022b.
Wai (2020) Wai, H.-T. On the convergence of consensus algorithms with markovian noise and gradient bias. In 2020 59th IEEE Conference on Decision and Control (CDC), pp. 4897–4902. IEEE, 2020.
Wai et al. (2019) Wai, H.-T., Hong, M., Yang, Z., Wang, Z., and Tang, K. Variance reduced policy evaluation with smooth function approximation. Advances in Neural Information Processing Systems, 32, 2019.
Wainwright (2019) Wainwright, M. J. Variance-reduced $q$ -learning is minimax optimal. arXiv preprint arXiv:1906.04697, 2019.
Wang et al. (2020) Wang, G., Lu, S., Giannakis, G., Tesauro, G., and Sun, J. Decentralized td tracking with linear function approximation and its finite-time analysis. Advances in Neural Information Processing Systems, 33:13762–13772, 2020.
Wang et al. (2021) Wang, T., Zhou, D., and Gu, Q. Provably efficient reinforcement learning with linear function approximation under adaptivity constraints. Advances in Neural Information Processing Systems, 34:13524–13536, 2021.
Wang et al. (2022) Wang, X., Cui, Q., and Du, S. S. On gap-dependent bounds for offline reinforcement learning. Advances in Neural Information Processing Systems, 35:14865–14877, 2022.
Watkins (1989) Watkins, C. J. C. H. Learning from Delayed Rewards. PhD thesis, King’s College, Oxford, 1989.
Woo et al. (2023) Woo, J., Joshi, G., and Chi, Y. The blessing of heterogeneity in federated q-learning: Linear speedup and beyond. In International Conference on Machine Learning, pp. 37157–37216, 2023.
Woo et al. (2024) Woo, J., Shi, L., Joshi, G., and Chi, Y. Federated offline reinforcement learning: Collaborative single-policy coverage suffices. In International Conference on Machine Learning, pp. 53165–53201, 2024.
Wu et al. (2021) Wu, Z., Shen, H., Chen, T., and Ling, Q. Byzantine-resilient decentralized policy evaluation with linear function approximation. IEEE Transactions on Signal Processing, 69:3839–3853, 2021.
Xu et al. (2021) Xu, H., Ma, T., and Du, S. Fine-grained gap-dependent bounds for tabular mdps via adaptive multi-step bootstrap. In Conference on Learning Theory, pp. 4438–4472. PMLR, 2021.
Xu et al. (2020) Xu, T., Wang, Z., Zhou, Y., and Liang, Y. Reanalysis of variance reduced temporal difference learning. In International Conference on Learning Representations, 2020.
Yan et al. (2023) Yan, Y., Li, G., Chen, Y., and Fan, J. The efficacy of pessimism in asynchronous q-learning. IEEE Transactions on Information Theory, 2023.
Yang et al. (2021) Yang, K., Yang, L., and Du, S. Q-learning with logarithmic regret. In International Conference on Artificial Intelligence and Statistics, pp. 1576–1584. PMLR, 2021.
Yang et al. (2023) Yang, T., Cen, S., Wei, Y., Chen, Y., and Chi, Y. Federated natural policy gradient methods for multi-task reinforcement learning. arXiv preprint arXiv:2311.00201, 2023.
Yin et al. (2021) Yin, M., Bai, Y., and Wang, Y.-X. Near-optimal offline reinforcement learning via double variance reduction. In Advances in Neural Information Processing Systems, pp. 7677–7688, 2021.
Zanette & Brunskill (2019) Zanette, A. and Brunskill, E. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pp. 7304–7312. PMLR, 2019.
Zeng et al. (2021) Zeng, S., Doan, T. T., and Romberg, J. Finite-time analysis of decentralized stochastic approximation with applications in multi-agent and multi-task learning. In 2021 60th IEEE Conference on Decision and Control (CDC), pp. 2641–2646. IEEE, 2021.
Zhang et al. (2024a) Zhang, C., Wang, H., Mitra, A., and Anderson, J. Finite-time analysis of on-policy heterogeneous federated reinforcement learning. arXiv preprint arXiv:2401.15273, 2024a.
Zhang et al. (2020) Zhang, Z., Zhou, Y., and Ji, X. Almost optimal model-free reinforcement learning via reference-advantage decomposition. Advances in Neural Information Processing Systems, 33:15198–15207, 2020.
Zhang et al. (2021) Zhang, Z., Ji, X., and Du, S. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. In Conference on Learning Theory, pp. 4528–4531. PMLR, 2021.
Zhang et al. (2022) Zhang, Z., Jiang, Y., Zhou, Y., and Ji, X. Near-optimal regret bounds for multi-batch reinforcement learning. Advances in Neural Information Processing Systems, 35:24586–24596, 2022.
Zhang et al. (2024b) Zhang, Z., Chen, Y., Lee, J. D., and Du, S. S. Settling the sample complexity of online reinforcement learning. In Conference on Learning Theory, pp. 5213–5219. PMLR, 2024b.
Zhao et al. (2023) Zhao, F., Ren, X., Yang, S., Zhao, P., Zhang, R., and Xu, X. Federated multi-objective reinforcement learning. Information Sciences, 624:811–832, 2023.
Zheng et al. (2024) Zheng, Z., Gao, F., Xue, L., and Yang, J. Federated q-learning: Linear regret speedup with low communication cost. In The Twelfth International Conference on Learning Representations, 2024.
Zheng et al. (2025a) Zheng, Z., Zhang, H., and Xue, L. Federated q-learning with reference-advantage decomposition: Almost optimal regret and logarithmic communication cost. In The Thirteenth International Conference on Learning Representations, 2025a.
Zheng et al. (2025b) Zheng, Z., Zhang, H., and Xue, L. Gap-dependent bounds for q-learning using reference-advantage decomposition. In The Thirteenth International Conference on Learning Representations, 2025b.
Zhou et al. (2023) Zhou, R., Zihan, Z., and Du, S. S. Sharp variance-dependent bounds in reinforcement learning: Best of both worlds in stochastic and deterministic environments. In International Conference on Machine Learning, pp. 42878–42914. PMLR, 2023.

In the appendix, we provide details for both FedQ-Hoeffding and FedQ-Bernstein algorithms.

Appendix A Algorithm Review

A.1 FedQ-Hoeffding Algorithm

In this section, we present more details for Section 4.1. Denote $\eta_{t}=\frac{H+1}{H+t}$ , $\eta_{0}^{0}=1$ , $\eta^{t}_{0}=0$ for $t\geq 1,$ and $\eta^{t}_{i}=\eta_{i}\prod_{i^{\prime}=i+1}^{t}(1-\eta_{i^{\prime}}),\forall\ 1\leq i\leq t$ . We also denote $\eta^{c}(t_{1},t_{2})=\prod_{t=t_{1}}^{t_{2}}(1-\eta_{t})$ for any positive integers $t_{1}<t_{2}$ . After receiving the information from each agent $m$ , for each triple $(s,a,h)$ visited by the agents, the server sets $\eta^{h,k}_{s,a}=1-\eta^{c}\big{(}N_{h}^{k}(s,a)+1,N_{h}^{k+1}(s,a)\big{)}$ and $\beta^{k}_{s,a,h}=\sum_{t=t^{k-1}+1}^{t^{k}}\eta^{t^{k}}_{t}b_{t}$ , where the confidence bound is given by $b_{t}=c\sqrt{\frac{H^{3}\iota}{t}}$ for some sufficiently large constant $c>0$ . Then the server updates the $Q$ -estimate according to the following two cases.

Case 1: $N_{h}^{k}(s,a)<2MH(H+1)=:i_{0}$ . This case implies that each client can visit each $(s,a)$ pair at step $h$ at most once. Then, we denote $1\leq m_{1}<m_{2}\ldots<m_{{N_{h}^{k+1}-N_{h}^{k}}}\leq M$ as the agent indices with $n_{h}^{m,k}(s,a)>0$ . The server then updates the global estimate of action values sequentially as follows:

Q_{h}^{k+1}(s,a)=(1-\eta_{t})Q_{h}^{k}(s,a)+\eta_{t}\big{(}r_{h}(x,a)+v_{h+1}^{m_{t},k}(s,a)+b_{t}\big{)},t=N_{h}^{k}(s,a)+1,\ldots N_{h}^{k+1}(s,a).

(7)

Case 2: $N_{h}^{k}(s,a)\geq i_{0}$ . In this case, the central server calculates

v_{h+1}^{k}(s,a)=\sum_{m=1}^{M}v_{h+1}^{m,k}(s,a)/n_{h}^{k}(s,a)

and updates

Q_{h}^{k+1}(s,a)=(1-\eta^{h,k}_{s,a})Q_{h}^{k}(s,a)+\eta^{h,k}_{s,a}\left(r_{h}(s,a)+v_{h+1}^{k}(s,a)\right)+\beta^{k}_{s,a,h}.

(8)

After finishing updating the estimated $Q$ function, the server updates the estimated value function and the policy as follows:

\displaystyle V_{h}^{k+1}(s)=\min\Big{\{}H,\max_{a^{\prime}\in\mathcal{A}}Q_{h}^{k+1}(s,a^{\prime})\Big{\}},\ \pi_{h}^{k+1}(s)=\arg\max_{a^{\prime}\in\mathcal{A}}Q_{h}^{k+1}\left(s,a^{\prime}\right),\forall(s,h)\in\mathcal{S}\times[H].

(9)

The details of the FedQ-Hoeffding algorithm are presented below.

Algorithm 1 FedQ-Hoeffding (Central Server)

1:Input:

T_{0}\in\mathbb{N}_{+}

2:Initialization:

k=1

N_{h}^{1}(s,a)=0

Q_{h}^{1}(s,a)=V_{h}^{1}(s)=H

\forall(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H]

and

\pi^{1}=\left\{\pi_{h}^{1}:\mathcal{S}\rightarrow\mathcal{A}\right\}_{h\in[H]}

is an arbitrary deterministic policy.

3:while

H\sum_{k^{\prime}=1}^{k-1}Mn^{k^{\prime}}<T_{0}

4: Broadcast

\pi^{k},

\{N_{h}^{k}(s,\pi_{h}^{k}(s))\}_{s,h}

and

\{V_{h}^{k}(s)\}_{s,h}

to all clients.

5: Wait until receiving an abortion signal and send the signal to all agents.

6: Receive

\{r_{h}(s,\pi_{h}^{k}(s))\}_{s,h}

\{n_{h}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h,m}

and

\{v_{h+1}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h,m}

from clients.

7: Calculate

N_{h}^{k+1}(s,a),n_{h}^{k}(s,a),v_{h+1}^{k}(s,a),\forall(s,h)\in\mathcal{S}\times[H]

with

a=\pi_{h}^{k}(s)

8: for

(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H]

9: if

a\neq\pi_{h}^{k}(s)

n_{h}^{k}(s,a)=0

then

10:

Q_{h}^{k+1}(s,a)\leftarrow Q_{h}^{k}(s,a)

11: else if

N_{h}^{k}(s,a)<i_{0}

then

12: Update

Q_{h}^{k+1}(s,a)

according to Equation 7.

13: else

14: Update

Q_{h}^{k+1}(s,a)

according to Equation 8.

15: end if

16: end for

17: Update

V_{h}^{k+1}

and

\pi^{k+1}

by eq. 9.

18:

k\leftarrow k+1

19:end while

Algorithm 2 FedQ-Hoeffding (Agent

m

in round

k

)

n_{h}^{m}(s,a)=v_{h+1}^{m}(s,a)=r_{h}(s,a)=0,\forall(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H]

2:Receive

\pi^{k},

\{N_{h}^{k}(s,\pi_{h}^{k}(s))\}_{s,h}

and

\{V_{h}^{k}(s)\}_{s,h}

from the central server.

3:while no abortion signal from the central server do

4: while

n_{h}^{m}(s_{h},a_{h})<\max\left\{1,\lfloor\frac{1}{MH(H+1)}N_{h}^{k}(s_{h},a_{h})\rfloor\right\},\forall(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H]

5: Collect a new trajectory

\{(s_{h},a_{h},r_{h})\}_{h=1}^{H}

with

a_{h}=\pi_{h}^{k}(s_{h})

n_{h}^{m}(s_{h},a_{h})\leftarrow n_{h}^{m}(s_{h},a_{h})+1,

v_{h+1}^{m}(s_{h},a_{h})\leftarrow v_{h+1}^{m}(s_{h},a_{h})+V_{h+1}^{k}(s_{h+1})

, and

{r}_{h}(s_{h},a_{h})\leftarrow r_{h},\forall h\in[H].

7: end while

8: Send an abortion signal to the central server.

9:end while

10:

n_{h}^{m,k}(s,a)\leftarrow n_{h}^{m}(s,a),v_{h+1}^{m,k}(s,a)\leftarrow v_{h+1}^{m}(s,a),\forall(s,h)\in\mathcal{S}\times[H]

with

a=\pi_{h}^{k}(s)

11:Send

\{r_{h}(s,\pi_{h}^{k}(s))\}_{s,h}

\{n_{h}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h}

and

\{v_{h+1}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h}

to the central server.

A.2 FedQ-Bernstein Algorithm

The Bernstein-type algorithm differs from the Hoeffding-type algorithm Algorithms 1 and 2, in that it selects the upper confidence bound based on a variance estimator of $X_{i}$ , akin to the approach used in the Bernstein-type algorithm in Jin et al. (2018). In this subsection, we first review the algorithm design.

To facilitate understanding, we introduce additional notations exclusive to Bernstein-type algorithms, supplementing the already provided notations for the Hoeffding-type algorithm.

\mu_{h}^{m,k}(s,a)=\frac{1}{n_{h}^{m,k}(s,a)}\sum_{j=1}^{n^{m,k}}\left[V_{h+1}^{k}\left(s_{h+1}^{k,j,m}\right)\right]^{2}\mathbb{I}[(s_{h}^{k,j,m},a_{h}^{k,j,m})=(s,a)].

\mu_{h}^{k}(s,a)=\frac{1}{N_{h}^{k+1}(s,a)-N_{h}^{k}(s,a)}\sum_{m=1}^{M}\mu_{h}^{m,k}(s,a)n_{h}^{m,k}(s,a).

Here, $\mu_{h}^{m,k}(s,a)$ is the sample mean of $[V_{h+1}^{k}(s_{h+1}^{k,j,m})]^{2}$ for all the visits of $(s,a,h)$ for the $m$ -th agent during the $k$ -th round and $\mu_{h}^{k}(s,a)$ corresponds to the mean for all the visits during the $k$ -th round. We define $W_{k}(s,a,h)$ to denote the sample variance of all the visits before the $k$ -th round, i.e.

W_{k}(s,a,h)=\frac{1}{N_{h}^{k}(s,a)}\sum_{i=1}^{N_{h}^{k}(s,a)}\left(V_{h+1}^{k^{i}}(s_{h+1}^{k^{i},j^{i},m^{i}})-\frac{1}{N_{h}^{k}(s,a)}\sum_{i^{\prime}=1}^{N_{h}^{k}(s,a)}V_{h+1}^{k^{i}}(s_{h+1}^{k^{i},j^{i},m^{i}})\right)^{2}.

We can find that

W_{k}(s,a,h)=\frac{1}{N_{h}^{k}(s,a)}\sum_{k^{\prime}=1}^{k-1}\mu_{h}^{k^{\prime}}(s,a)n_{h}^{k^{\prime}}(s,a)-\left[\frac{1}{N_{h}^{k}(s,a)}\sum_{k^{\prime}=1}^{k-1}v_{h+1}^{k^{\prime}}(s,a)n_{h}^{k^{\prime}}(s,a)\right]^{2},

which means that this quantity can be calculated efficiently in practice in the following way. Define

W_{1,k}(s,a,h)=\sum_{k^{\prime}=1}^{k-1}\mu_{h}^{k^{\prime}}(s,a)n_{h}^{k^{\prime}}(s,a),\ W_{2,k}(s,a,h)=\sum_{k^{\prime}=1}^{k-1}v_{h+1}^{k^{\prime}}(s,a)n_{h}^{k^{\prime}}(s,a),

(10)

then we have

W_{1,k+1}(s,a,h)=W_{1,k}(s,a,h)+\mu_{h}^{k}(s,a)n_{h}^{k}(s,a),\ W_{2,k+1}(s,a,h)=W_{2,k}(s,a,h)+v_{h+1}^{k}(s,a)n_{h}^{k}(s,a)

(11)

and

W_{k+1}(s,a,h)=\frac{W_{1,k+1}(s,a,h)}{N_{h}^{k+1}(s,a)}-\left[\frac{W_{2,k+1}(s,a,h)}{N_{h}^{k+1}(s,a)}\right]^{2}.

(12)

This indicates that the central server, by actively maintaining and updating the quantities $W_{1,k}$ and $W_{2,k}$ and systematically collecting $n_{h}^{m,k}$ s, $\mu_{h}^{m,k}$ s and $v_{h+1}^{m,k}$ s, is able to compute $W_{k+1}$ .

Next, we define

\beta_{t}^{\textnormal{B}}(s,a,h)=c^{\prime}\left(\min\left\{\sqrt{\frac{H\iota}{t}(W_{k^{t}+1}(s,a,h)+H)}+\iota\frac{\sqrt{H^{7}SA}+\sqrt{MSAH^{6}}}{t},\sqrt{\frac{H^{3}\iota}{t}}\right\}\right),

(13)

in which $c^{\prime}>0$ is a positive constant. With this, the upper confidence bound $b_{t}(s,a,h)$ for a single visit is determined by $\beta_{t}^{\textnormal{B}}(s,a,h)=2\sum_{i=1}^{t}\eta^{t}_{i}b_{t}(s,a,h),$ which can be calculated as follows:

b_{1}(s,a,h):=\frac{\beta_{1}^{\textnormal{B}}(s,a,h)}{2},\ b_{t}(s,a,h):=\frac{\beta_{t}^{\textnormal{B}}(s,a,h)-\left(1-\eta_{t}\right)\beta_{t-1}^{\textnormal{B}}(s,a,h)}{2\eta_{t}}.

When there is no ambiguity, we use the simplified notation $\tilde{b}_{t}=b_{t}(s,a,h)$ and $\tilde{\beta}_{t}=\beta_{t}(s,a,h)$ . In the FedQ-Bernstein algorithm, let $\tilde{\beta}=\beta_{t^{k}}^{\textnormal{B}}(s,a,h)-\eta^{c}(t^{k-1}+1,t^{k})\beta_{t^{k-1}}^{\textnormal{B}}(s,a,h)$ . Then similar to the FedQ-Hoeffding, we can updates the global estimate of the value functions according to the following two cases.

•

Case 1: $N_{h}^{k}(s,a)<i_{0}$ . This case implies that each client can visit each $(s,a)$ pair at step $h$ at most once. Then, we denote $1\leq m_{1}<m_{2}\ldots<m_{t^{k}-t^{k-1}}\leq M$ as the agent indices with $n_{h}^{m,k}(s,a)>0$ . The server then updates the global estimate of action values as follows:

Q_{h}^{k+1}(s,a)=(1-\eta_{t})Q_{h}^{k}(s,a)+\eta_{t}\left(r_{h}(x,a)+v_{h+1}^{m_{t},k}(s,a)+\tilde{b}_{t}\right),t=t^{k-1}+1,\ldots t^{k}.

(14)

•

Case 2: $N_{h}^{k}(s,a)\geq i_{0}$ . In this case, the central server calculates

v_{h+1}^{k}(s,a)=\sum_{m=1}^{M}v_{h+1}^{m,k}(s,a)/n_{h}^{k}(s,a)

and updates the $Q$ -estimate as

Q_{h}^{k+1}(s,a)=(1-\eta^{h,k}_{s,a})Q_{h}^{k}(s,a)+\eta^{h,k}_{s,a}\left(r_{h}(s,a)+v_{h+1}^{k}(s,a)\right)+\tilde{\beta}/2.

(15)

Then we can present the FedQ-Bernstein Algorithm in Zheng et al. (2024).

Algorithm 3 FedQ-Bernstein (Central Server)

1:Input:

T_{0}\in\mathbb{N}_{+}

2:Initialization:

k=1

N_{h}^{1}(s,a)=W_{1,k}(s,a,h)=W_{2,k}(s,a,h)=0,Q_{h}^{1}(s,a)=V_{h}^{1}(s)=H,\forall(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H]

and

\pi^{1}=\left\{\pi_{h}^{1}:\mathcal{S}\rightarrow\mathcal{A}\right\}_{h\in[H]}

is an arbitrary deterministic policy.

3:while

H\sum_{k^{\prime}=1}^{k-1}Mn^{k^{\prime}}<T_{0}

4: Broadcast

\pi^{k},

\{N_{h}^{k}(s,\pi_{h}^{k}(s))\}_{s,h}

and

\{V_{h}^{k}(s)\}_{s,h}

to all clients.

5: Wait until receiving an abortion signal and send the signal to all agents.

6: Receive

\{r_{h}(s,\pi_{h}^{k}(s))\}_{s,h}

\{n_{h}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h,m}

\{v_{h+1}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h,m}

\{\mu_{h}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h,m}

7: Calculate

N_{h}^{k+1}(s,a),n_{h}^{k}(s,a),v_{h+1}^{k}(s,a),\forall(s,h)\in\mathcal{S}\times[H]

with

a=\pi_{h}^{k}(s)

8: Calculate

W_{k}(s,a,h),W_{k+1}(s,a,h),W_{1,k+1}(s,a,h),W_{2,k+1}(s,a,h),

\forall(s,h)\in\mathcal{S}\times[H]

with

a=\pi_{h}^{k}(s)

based on Equation 10, Equation 11 and Equation 12.

9: for

(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H]

10: if

a\neq\pi_{h}^{k}(s)

n_{h}^{k}(s,a)=0

then

11:

Q_{h}^{k+1}(s,a)\leftarrow Q_{h}^{k}(s,a)

12: else

13: Update

Q_{h}^{k+1}(s,a)

according to Equation 14 or Equation 15.

14: end if

15: end for

16: Update

V_{h}^{k+1}

and

\pi^{k+1}

by Equation 9.

17:

k\leftarrow k+1

18:end while

Algorithm 4 FedQ-Bernstein (Agent

m

in round

k

)

n_{h}^{m}(s,a)=v_{h+1}^{m}(s,a)=r_{h}(s,a)=\mu_{h}^{m}(s,a)=0,\forall(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H]

2:Receive

\pi^{k}

\{N_{h}^{k}(s,\pi_{h}^{k}(s))\}_{s,h}

and

\{V_{h}^{k}(s)\}_{s,h}

from the central server.

3:while no abortion signal from the central server do

4: while

n_{h}^{m}(x_{h},a_{h})<\max\left\{1,\lfloor\frac{1}{MH(H+1)}N_{h}^{k}(x_{h},a_{h})\rfloor\right\},\forall(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H]

5: Collect a new trajectory

\{(x_{h},a_{h},r_{h})\}_{h=1}^{H}

with

a_{h}=\pi_{h}^{k}(x_{h})

n_{h}^{m}(x_{h},a_{h})\leftarrow n_{h}^{m}(x_{h},a_{h})+1,

v_{h+1}^{m}(x_{h},a_{h})\leftarrow v_{h+1}^{m}(x_{h},a_{h})+V_{h+1}^{k}(s_{h+1})

\mu_{h}^{m}(x_{h},a_{h})\leftarrow\mu_{h}^{m}(x_{h},a_{h})+\left[V_{h+1}^{k}(s_{h+1})\right]^{2}

, and

{r}_{h}(x_{h},a_{h})\leftarrow r_{h},\forall h\in[H].

7: end while

8: Send an abortion signal to the central server.

9:end while

10:

n_{h}^{m,k}(s,a)\leftarrow n_{h}^{m}(s,a),v_{h+1}^{m,k}(s,a)\leftarrow v_{h+1}^{m}(s,a)

and

\mu_{h}^{m,k}(s,a)\leftarrow\mu_{h}^{m}(s,a)/n_{h}^{m}(s,a),\forall(s,h)\in\mathcal{S}\times[H]

with

a=\pi_{h}^{k}(s)

11:Send

\{r_{h}(s,\pi_{h}^{k}(s))\}_{s,h}

\{n_{h}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h}

\{\mu_{h}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h}

and

\{v_{h+1}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h}

to the central server.

Gap-Dependent Bounds for Federated QQ-Learning111Haochen Zhang and Zhong Zheng are co-first authors who contributed equally to this paper. Lingzhou Xue is the corresponding author (Email: [email protected]).

Abstract

1 Introduction

2 Related Work

3 Background and Problem Formulation

3.1 Preliminaries

Definition 3.1.

Definition 3.2.

Definition 3.3.

3.2 The Federated RL Framework

4 Performance Guarantees

4.1 FedQ-Hoeffding Algorithm

4.2 Gap-Dependent Regret

Theorem 4.1.

4.3 Gap-Dependent Communication Cost

Definition 4.1.

Theorem 4.2.

Corollary 4.1.

5 Numerical Experiments

5.1 log⁡T\log T-type regret and speedup

5.2 Dependency of Communication Cost on MM, SS, and AA

5.2.1 Dependency on MM

5.2.2 Dependency on SS

5.2.3 Dependency on AA

6 Conclusion

References

Appendix A Algorithm Review

A.1 FedQ-Hoeffding Algorithm

A.2 FedQ-Bernstein Algorithm

Gap-Dependent Bounds for Federated $Q$ -Learning¹¹1Haochen Zhang and Zhong Zheng are co-first authors who contributed equally to this paper. Lingzhou Xue is the corresponding author (Email: [email protected]).

5.1 $\log T$ -type regret and speedup

5.2 Dependency of Communication Cost on $M$ , $S$ , and $A$

5.2.1 Dependency on $M$

5.2.2 Dependency on $S$

5.2.3 Dependency on $A$