This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Gap-Dependent Bounds for Federated QQ-Learning111Haochen Zhang and Zhong Zheng are co-first authors who contributed equally to this paper. Lingzhou Xue is the corresponding author (Email: [email protected]).

Haochen Zhang, Zhong Zheng and Lingzhou Xue
Department of Statistics, The Pennsylvania State University
Abstract

We present the first gap-dependent analysis of regret and communication cost for on-policy federated QQ-Learning in tabular episodic finite-horizon Markov decision processes (MDPs). Existing FRL methods focus on worst-case scenarios, leading to T\sqrt{T}-type regret bounds and communication cost bounds with a logT\log T term scaling with the number of agents MM, states SS, and actions AA, where TT is the average total number of steps per agent. In contrast, our novel framework leverages the benign structures of MDPs, such as a strictly positive suboptimality gap, to achieve a logT\log T-type regret bound and a refined communication cost bound that disentangles exploration and exploitation. Our gap-dependent regret bound reveals a distinct multi-agent speedup pattern, and our gap-dependent communication cost bound removes the dependence on MSAMSA from the logT\log T term. Notably, our gap-dependent communication cost bound also yields a better global switching cost when M=1M=1, removing SASA from the logT\log T term.

1 Introduction

Federated reinforcement learning (FRL) is a distributed learning framework that combines the principles of reinforcement learning (RL) (Sutton & Barto, 2018) and federated learning (FL) (McMahan et al., 2017). Focusing on sequential decision-making, FRL aims to learn an optimal policy through parallel explorations by multiple agents under the coordination of a central server. Often modeled as a Markov decision process (MDP), multiple agents independently interact with an initially unknown environment and collaboratively train their decision-making models with limited information exchange between the agents. This approach accelerates the learning process with low communication costs. In this paper, we focus on the on-policy FRL tailored for episodic tabular MDPs with inhomogeneous transition kernels. Specifically, we assume the presence of a central server and MM local agents in the system. Each agent interacts independently with an episodic MDP consisting of SS states, AA actions, and HH steps per episode.

Multiple recent works studied the on-policy FRL for tabular MDPs. Zheng et al. (2024) proposed model-free algorithms FedQ-Hoeffding and FedQ-Bernstein that show the regret bounds O~(MH4SAT)\tilde{O}(\sqrt{MH^{4}SAT}) and O~(MH3SAT)\tilde{O}(\sqrt{MH^{3}SAT}) respectively under O(MH3SAlogT)O(MH^{3}SA\log T) rounds of communications. Here, TT is the average total number of steps for each agent, and O~\tilde{O} hides logarithmic factors. Zheng et al. (2025a) proposed FedQ-Advantage that improved the regret to O~(MH2SAT)\tilde{O}(\sqrt{MH^{2}SAT}) under a reduced communication rounds of O(fMH2SA(logH)logT)O(f_{M}H^{2}SA(\log H)\log T) where fM{1,M}f_{M}\in\{1,M\} reflects the optional forced synchronization scheme. Chen et al. (2022) and Labbi et al. (2024) proposed model-based algorithms that extend the single-agent algorithm UCBVI (Azar et al., 2017). Byzan-UCBVI (Chen et al., 2022) reaches regret O~(MH3S2AT)\tilde{O}(\sqrt{MH^{3}S^{2}AT}) under O(MHSAlogT)O(MHSA\log T) rounds of communications. Fed-UCBVI (Labbi et al., 2024) reaches the regret O~(MH2SAT)\tilde{O}(\sqrt{MH^{2}SAT}) under O(HSAlogT+MHSAloglogT)O(HSA\log T+MHSA\log\log T) rounds of communications. Here, model-based methods require estimating the transition kernel so that their memory requirements scale quadratically with the number of states SS. Model-free methods, which are also called QQ-Learning methods (Watkins, 1989), directly learn the action-value function, and their memory requirements only scale linearly with SS. The regret O~(MH2SAT)\tilde{O}(\sqrt{MH^{2}SAT}) reached by both FedQ-Advantage and Fed-UCBVI is almost optimal compared to the regret lower bound O~(MH2SAT)\tilde{O}(\sqrt{MH^{2}SAT}) (Jin et al., 2018; Domingues et al., 2021). In summary, all the works above provided worst-case guarantees for all possible MDPs and proved T\sqrt{T}-type regret bounds and communication cost bounds that linearly depend on MSAlogTMSA\log T or SAlogTSA\log T.

In practice, RL algorithms often perform better than their worst-case guarantees, as they can be significantly improved under MDPs with benign structures (Zanette & Brunskill, 2019). This motivates the problem-dependent analysis exploiting benign MDPs (see, e.g., Wagenmaker et al. (2022a); Zhou et al. (2023); Zhang et al. (2024b)). One of the benign structures is based on the dependency on the positive suboptimality gap: for every state, the best actions outperform others by a margin. It is important because nearly all non-degenerate environments with finite action sets satisfy some sub-optimality gap conditions (Yang et al., 2021). For single-agent algorithms, Simchowitz & Jamieson (2019); Dann et al. (2021) analyzed gap-dependent regret for model-based methods, and Yang et al. (2021); Xu et al. (2021); Zheng et al. (2025b) analyzed model-free methods. Here, Yang et al. (2021) focused on UCB-Hoeffding proposed by Jin et al. (2018), while Xu et al. (2021) proposed an algorithm that did not use upper confidence bounds (UCB). Zheng et al. (2025b) analyzed UCB-Advantage (Zhang et al., 2020) and Q-EarlySettled-Advantage (Li et al., 2021), which used variance reduction techniques. All of these works reached regrets that logarithmically depend on TT, which is much better than the worst-case T\sqrt{T}-type regrets. However, no literature works on the gap-dependent regret for on-policy FRL. This motivates the following open question:

Is it possible to establish gap-dependent regret bounds for on-policy FRL algorithms that are logarithmic in TT?

Meanwhile, recent works have proposed FRL algorithms for tabular MDPs on the off-policy setting (Woo et al., 2023), the offline setting (Woo et al., 2024), and the situation with a simulator (Woo et al., 2023; Salgia & Chi, 2024). Different from the on-policy methods, state-of-the-art algorithms for these settings do not update the implemented policies (exploration) and reach MSAMSA-free logarithmic bounds on communication rounds, whereas the worst-case communication cost bounds for on-policy FRL methods require the dependence on MM, SS, and AA for the logT\log T term (e.g., O(MH3SAlogT)O(MH^{3}SA\log T) in Zheng et al. (2024)). While increased communication for exploration is reasonable, existing on-policy FRL methods cannot quantify the communication cost paid for exploring non-optimal actions or exploiting optimal policies under the worst-case MDPs since the suboptimality gaps can be arbitrarily close to 0. This leads to the dependence on MM, SS, and AA for the logT\log T term, which motivates the following open question:

Is it possible to establish gap-dependent communication cost bounds for on-policy FRL algorithms that disentangle exploration and exploitation and remove the dependence on MSAMSA from the logT\log T term?

A closely related evaluation criterion for on-policy RL is the global switching cost, which is defined as the times for policy switching. It is important in applications with restrictions on policy switching, such as compiler optimization (Ashouri et al., 2018), hardware placements (Mirhoseini et al., 2017), database optimization (Krishnan et al., 2018), and material discovery (Nguyen et al., 2019). Next, we review related literature on single-agent model-free RL algorithms. Under the worst-case MDPs, Bai et al. (2019) modified the algorithms in Jin et al. (2018), achieving a cost of O(H3SAlogT)O(H^{3}SA\log T), and UCB-Advantage (Zhang et al., 2020) reached an improved cost of O(H2SAlogT)O(H^{2}SA\log T), with both algorithms depending on SAlogTSA\log T. In gap-dependent analysis, Zheng et al. (2025b) proved that UCB-Advantage enjoyed a cost that linearly depends on SlogTS\log T. Whether single-agent model-free RL algorithms can avoid the dependence on SASA for the logT\log T term remains an open question.

In addition, multiple technical challenges exist when trying to establish gap-dependent bounds and improve the existing worst-case ones. First, gap-dependent regret analysis often relies on controlling the error in the value function estimations. However, the techniques for model-free methods (Yang et al., 2021; Xu et al., 2021; Zheng et al., 2025b) can only adapt to instant policy updates in single-agent methods, while FRL often uses delayed policy updates for a low communication cost. Second, proving low communication costs for FRL algorithms often requires actively estimating the number of visits to each state-action-step triple (see, e.g., Woo et al. (2023)). However, this is challenging for on-policy algorithms because the implemented policy is actively updated, and a universal stationary visiting probability is unavailable. Existing on-policy FRL methods reached logarithmic communication costs by controlling the visit and synchronization with the event-triggered synchronization conditions. These conditions guaranteed a sufficient increase in the number of visits to one state-action-step triple between synchronizations. However, this analysis is insufficient for the estimation of visiting numbers and results in the dependence on SASA for the logT\log T term.

Summary of Our Contributions. We give an affirmative answer to these important open questions by proving the first gap-dependent bounds on both regret and communication cost for on-policy FRL in the literature. We analyze those bounds for FedQ-Hoeffding (Zheng et al., 2024), an on-policy FRL algorithm for tabular episodic finite-horizon MDPs. Our contributions are summarized as follows.

  • Gap-Dependent Regret (Theorem 4.1). Denote Δmin\Delta_{\textnormal{min}} as the minimum nonzero suboptimality gap for all the state-action-step triples. We prove that FedQ-Hoeffding guarantees a gap-dependent expected regret of

    O(H6SAlog(MSAT)Δmin+Cf)O\bigg{(}\frac{H^{6}SA\log(MSAT)}{\Delta_{\textnormal{min}}}+C_{f}\bigg{)} (1)

    where Cf=MH7SAlog(MSAT)+MH5SAC_{f}=M\sqrt{H^{7}}SA\sqrt{\log(MSAT)}+MH^{5}SA provides the gap-free part. It is logarithmic in TT and better than the worst-case T\sqrt{T}-type regret discussed above when TT is large enough. When M=1M=1, (1) reduces to the single-agent gap-dependent regret bound (Yang et al., 2021) for UCB-Hoeffding (Jin et al., 2018), which is the single-agent counterpart of FedQ-Hoeffding. Compared to UCB-Hoeffding, when TT is large enough and Δmin\Delta_{\textnormal{min}} is small enough, (1) shows a better multi-agent speedup compared to the T\sqrt{T}-type worst-case regrets shown in Zheng et al. (2024). Our numerical experiments in Section 5.1 also demonstrate the logT\log T-pattern of the regret for any given MDP.

  • Gap-Dependent Communication Cost (Theorem 4.2). We prove that under some general uniqueness of optimal policies, for any p(0,1)p\in(0,1), with probability at least 1p1-p, both the number of communication rounds and the number of different implemented policies required by FedQ-Hoeffding are upper bounded by

    O(MH3SAlog(MH2ι0)+H3SAlog(H5SAΔmin2)+H3Slog(MH9SAι0Δmin2Cst)+H2log(THSA)).\displaystyle O\bigg{(}MH^{3}SA\log(MH^{2}\iota_{0})+H^{3}SA\log\left(\frac{H^{5}SA}{\Delta^{2}_{\min}}\right)+H^{3}S\log\left(\frac{MH^{9}SA\iota_{0}}{\Delta^{2}_{\min}C_{st}}\right)+H^{2}\log\left(\frac{T}{HSA}\right)\bigg{)}. (2)

    Here, Cst(0,1]C_{st}\in(0,1] represents the minimum of the nonzero visiting probabilities to all state-step pairs under optimal policies, and ι0=log(SAT/p)\iota_{0}=\log(SAT/p). Since the communication cost of each round is O(MHS)O(MHS), the total communication cost is (2) multiplied by MHSMHS. Compared to the existing worst-case communication rounds that depend on MSAlogTMSA\log T (Zheng et al., 2024, 2025a; Qiao et al., 2022) or SAlogTSA\log T (Zheng et al., 2025a; Labbi et al., 2024), the first three terms in (2) only logarithmically depend on 1/Δmin1/\Delta_{\textnormal{min}} and logT\log T, and the last term removes the dependence on MSAMSA from the logT\log T term. This improvement is significant since MM represents the number of collaborating agents, and SASA represents the complexity of the state-action space that is often the bottleneck of RL methods (Jin et al., 2018). Compared to the SASA-free communication rounds for FRL methods that do not update policies, (2) quantifies the cost of multiple components in on-policy FRL: the first two terms represent the cost for exploration, and the last two terms show the cost of implementing the optimal policy (exploitation). Our numerical experiments, presented in Section 5.2, demonstrate that the logT\log T term in the communication cost is independent of MM, SS, and AA.

    When M=1M=1, FedQ-Hoeffding becomes a single-agent algorithm with low global switching cost shown in (2) (Corollary 4.1). It removes the dependence on SASA from the logT\log T term compared to existing model-free methods (Bai et al., 2019; Zhang et al., 2020; Zheng et al., 2025b).

  • Technical Novelty and Contributions. We develop a new theoretical framework for the gap-dependent analysis of on-policy FRL with delayed policy updates. It provides two features simultaneously: controlling the error in the estimated value functions and estimating the number of visits . The first feature helps prove the gap-dependent regret (1), and the second is key to proving the bound (2) for communication rounds. Here, to overcome the difficulty of estimating visiting numbers, we develop a new technical tool: concentrations on visiting numbers under varying policies. We establish concentration inequalities for visits to the stationary visiting probability of the optimal policies via error recursion on episode steps. This step relies on the number of visits with suboptimal actions instead of the algorithm settling on an optimal policy. It provides better estimations of visiting numbers. We also establish the following techniques with the tool and nonzero minimum suboptimality gap: (a) Exploring visiting discrepancies between optimal actions and suboptimal actions. This validates the concentration above. (b) Showing agent-wise simultaneous sufficient increase of visits. This helps remove the linear dependency on MM in the last three terms of (2). (c) Showing state-wise simultaneous sufficient increase of visits for states with unique optimal actions. This helps remove the linear dependence on SASA from the last term in (2).

    To the best of our knowledge, these techniques are new to the literature for on-policy model-free FRL methods. They will be of independent interest in the gap-dependent analysis of other on-policy RL and FRL methods in controlling or estimating the number of visits.

2 Related Work

On-policy RL for single agent finite-horizon tabular MDPs with worst-case regret. There are mainly two types of algorithms for reinforcement learning: model-based and model-free learning. Model-based algorithms learn a model from past experience and make decisions based on this model, while model-free algorithms only maintain a group of value functions and take the induced optimal actions. Due to these differences, model-free algorithms are usually more space-efficient and time-efficient compared to model-based algorithms. However, model-based algorithms may achieve better learning performance by leveraging the learned model.

Next, we discuss the literature on model-based and model-free algorithms for finite-horizon tabular MDPs with worst-case regret. Auer et al. (2008), Agrawal & Jia (2017), Azar et al. (2017), Kakade et al. (2018), Agarwal et al. (2020), Dann et al. (2019), Zanette & Brunskill (2019), Zhang et al. (2021), Zhou et al. (2023) and Zhang et al. (2024b) worked on model-based algorithms. Notably, Zhang et al. (2024b) provided an algorithm that achieves a regret of O~(min{SAH2T,T})\tilde{O}(\min\{\sqrt{SAH^{2}T},T\}), which matches the information lower bound. Jin et al. (2018), Yang et al. (2021), Zhang et al. (2020), Li et al. (2021) and Ménard et al. (2021) work on model-free algorithms. The latter three have introduced algorithms that achieve minimax regret of O~(SAH2T)\tilde{O}(\sqrt{SAH^{2}T}).

Suboptimality Gap. When there is a strictly positive suboptimality gap, it is possible to achieve logarithmic regret bounds. In RL, earlier work obtained asymptotic logarithmic regret bounds (Auer & Ortner, 2007; Tewari & Bartlett, 2008). Recently, non-asymptotic logarithmic regret bounds were obtained (Jaksch et al., 2010; Ok et al., 2018; Simchowitz & Jamieson, 2019; He et al., 2021). Specifically, Jaksch et al. (2010) developed a model-based algorithm, and their bound depends on the policy gap instead of the action gap studied in this paper. Ok et al. (2018) derived problem-specific logarithmic type lower bounds for both structured and unstructured MDPs. Simchowitz & Jamieson (2019) extended the model-based algorithm by Zanette & Brunskill (2019) and obtained logarithmic regret bounds. Logarithmic regret bounds are obtained in linear function approximation settings He et al. (2021). Nguyen-Tang et al. (2023) also provides a gap-dependent regret bounds for offline RL with linear funciton approximation.

Specifically, for model free algorithm, Yang et al. (2021) showed that the optimistic QQ-learning algorithm by Jin et al. (2018) enjoyed a logarithmic regret O(H6SATΔmin)O(\frac{H^{6}SAT}{\Delta_{\textnormal{min}}}), which was subsequently refined by Xu et al. (2021). In their work, Xu et al. (2021) introduced the Adaptive Multi-step Bootstrap (AMB) algorithm. Zheng et al. (2025b) further improved the logarithmic regret bound by leveraging the analysis of the UCB-Advantage algorithm (Zhang et al., 2020) and Q-EarlySettled-Advantage algorithm (Li et al., 2021).

There are also some other works focusing on gap-dependent sample complexity bounds (Jonsson et al., 2020; Marjani & Proutiere, 2020; Al Marjani et al., 2021; Tirinzoni et al., 2022; Wagenmaker et al., 2022b; Wagenmaker & Jamieson, 2022; Wang et al., 2022; Tirinzoni et al., 2023).

Variance reduction in RL. The reference-advantage decomposition used in Zhang et al. (2020) and Li et al. (2021) is a technique of variance reduction that was originally proposed for finite-sum stochastic optimization (Gower et al., 2020; Johnson & Zhang, 2013; Nguyen et al., 2017). Later on, model-free RL algorithms also used variance reduction to improve the sample efficiency. For example, it was used in learning with generative models (Sidford et al., 2018, 2023; Wainwright, 2019), policy evaluation (Du et al., 2017; Khamaru et al., 2021; Wai et al., 2019; Xu et al., 2020), offline RL (Shi et al., 2022; Yin et al., 2021), and QQ-learning (Li et al., 2020; Zhang et al., 2020; Li et al., 2021; Yan et al., 2023).

RL with low switching cost and batched RL. Research in RL with low-switching cost aims to minimize the number of policy switches while maintaining comparable regret bounds to fully adaptive counterparts, and it can be applied to federated RL. In batched RL (Perchet et al., 2016; Gao et al., 2019), the agent sets the number of batches and the length of each batch upfront, implementing an unchanged policy in a batch and aiming for fewer batches and lower regret. Bai et al. (2019) first introduced the problem of RL with low-switching cost and proposed a QQ-learning algorithm with lazy updates, achieving O~(SAH3logT)\tilde{O}(SAH^{3}\log T) switching cost. This work was advanced by Zhang et al. (2020), which improved the regret upper bound and the switching cost. Additionally, Wang et al. (2021) studied RL under the adaptivity constraint. Recently, Qiao et al. (2022) proposed a model-based algorithm with O~(loglogT)\tilde{O}(\log\log T) switching cost. Zhang et al. (2022) proposed a batched RL algorithm that is well-suited for the federated setting.

Multi-agent RL (MARL) with event-triggered communications. We review a few recent works for on-policy MARL with linear function approximations. Dubey & Pentland (2021) introduced Coop-LSVI for cooperative MARL. Min et al. (2023) proposed an asynchronous version of LSVI-UCB that originates from Jin et al. (2020), matching the same regret bound with improved communication complexity compared to Dubey & Pentland (2021). Hsu et al. (2024) developed two algorithms that incorporate randomized exploration, achieving the same regret and communication complexity as Min et al. (2023). Dubey & Pentland (2021), Min et al. (2023) and Hsu et al. (2024) employed event-triggered communication conditions based on determinants of certain quantities. Different from our federated algorithm, during the synchronization in Dubey & Pentland (2021) and Min et al. (2023), local agents share original rewards or trajectories with the server. On the other hand, Hsu et al. (2024) reduces communication cost by sharing compressed statistics in the non-tabular setting with linear function approximation.

Federated and distributed RL. Existing literature on federated and distributed RL algorithms highlights various aspects. For value-based algorithms, Guo & Brunskill (2015), Zheng et al. (2024), and Woo et al. (2023) focused on linear speed up. (Agarwal et al., 2021) proposed a parallel RL algorithm with low communication cost. Woo et al. (2023) and Woo et al. (2024) discussed the improved covering power of heterogeneity. Wu et al. (2021) and Chen et al. (2023) worked on robustness. Particularly, Chen et al. (2023) proposed algorithms in both offline and online settings, obtaining near-optimal sample complexities and achieving superior robustness guarantees. In addition, several works have investigated value-based algorithms such as QQ-learning in different settings, including Beikmohammadi et al. (2024), Jin et al. (2022), Khodadadian et al. (2022), Fan et al. (2023), Woo et al. (2023), and Woo et al. (2024); Anwar & Raychowdhury (2021); Zhao et al. (2023); Yang et al. (2023); Zhang et al. (2024a). The convergence of decentralized temporal difference algorithms has been analyzed by Doan et al. (2019), Doan et al. (2021), Chen et al. (2021b), Sun et al. (2020), Wai (2020), Wang et al. (2020), Zeng et al. (2021), and Liu & Olshevsky (2023).

Some other works focus on policy gradient-based algorithms. Communication-efficient policy gradient algorithms have been studied by Fan et al. (2021) and Chen et al. (2021a). Lan et al. (2023) further shows the simplicity compared to the other RL methods, and a linear speedup has been demonstrated in the synchronous setting. Optimal sample complexity for global optimality in federated RL, even in the presence of adversaries, is studied in Ganesh et al. (2024). (Lan et al., 2024) propose an algorithm to address the challenge of lagged policies in asynchronous settings.

The convergence of distributed actor-critic algorithms has been analyzed by Shen et al. (2023) and Chen et al. (2022). Federated actor-learner architectures have been explored by Assran et al. (2019), Espeholt et al. (2018), and Mnih et al. (2016). Distributed inverse reinforcement learning has been examined by Banerjee et al. (2021), Gong et al. (2023) and Liu & Zhu (2022, 2023, 2024, 2025).

3 Background and Problem Formulation

3.1 Preliminaries

We begin by introducing the mathematical framework of Markov decision processes. In this paper, we assume that 0/0=00/0=0. For any CC\in\mathbb{N}, we use [C][C] to denote the set {1,2,C}\{1,2,\ldots C\}. We use 𝕀[x]\mathbb{I}[x] to denote the indicator function, which equals 1 when the event xx is true and 0 otherwise.

Tabular episodic Markov decision process (MDP). A tabular episodic MDP is denoted as :=(𝒮,𝒜,H,,r)\mathcal{M}:=(\mathcal{S},\mathcal{A},H,\mathbb{P},r), where 𝒮\mathcal{S} is the set of states with |𝒮|=S,𝒜|\mathcal{S}|=S,\mathcal{A} is the set of actions with |𝒜|=A|\mathcal{A}|=A, HH is the number of steps in each episode, :={h}h=1H\mathbb{P}:=\{\mathbb{P}_{h}\}_{h=1}^{H} is the transition kernel so that h(s,a)\mathbb{P}_{h}(\cdot\mid s,a) characterizes the distribution over the next state given the state action pair (s,a)(s,a) at step hh, and r:={rh}h=1Hr:=\{r_{h}\}_{h=1}^{H} is the collection of reward functions. We assume that rh(s,a)[0,1]r_{h}(s,a)\in[0,1] is a deterministic function of (s,a)(s,a), while the results can be easily extended to the case when rhr_{h} is random.

In each episode, an initial state s1s_{1} is selected arbitrarily by an adversary. Then, at each step h[H]h\in[H], an agent observes a state sh𝒮s_{h}\in\mathcal{S}, picks an action ah𝒜a_{h}\in\mathcal{A}, receives the reward rh=rh(sh,ah)r_{h}=r_{h}(s_{h},a_{h}) and then transits to the next state sh+1s_{h+1}. The episode ends when an absorbing state sH+1s_{H+1} is reached.

Policies, state value functions, and action value functions. A policy π\pi is a collection of HH functions {πh:𝒮Δ𝒜}h[H]\left\{\pi_{h}:\mathcal{S}\rightarrow\Delta^{\mathcal{A}}\right\}_{h\in[H]}, where Δ𝒜\Delta^{\mathcal{A}} is the set of probability distributions over 𝒜\mathcal{A}. A policy is deterministic if for any s𝒮s\in\mathcal{S}, πh(s)\pi_{h}(s) concentrates all the probability mass on an action a𝒜a\in\mathcal{A}. In this case, we denote πh(s)=a\pi_{h}(s)=a.

Let Vhπ:𝒮V_{h}^{\pi}:\mathcal{S}\rightarrow\mathbb{R} and Qhπ:𝒮×𝒜Q_{h}^{\pi}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R} denote the state value function and the state-action value function at step hh under policy π\pi. Mathematically, Vhπ(s):=t=hH𝔼(st,at)(,π)[rt(st,at)|sh=s]V_{h}^{\pi}(s):=\sum_{t=h}^{H}\mathbb{E}_{(s_{t},a_{t})\sim(\mathbb{P},\pi)}\left[r_{t}(s_{t},a_{t})\left.\right|s_{h}=s\right]. Qhπ(s,a):=t=hH𝔼(st,at)(,π)[rt(st,at)|(sh,ah)=(s,a)]Q_{h}^{\pi}(s,a):=\sum_{t=h}^{H}\mathbb{E}_{(s_{{t}},a_{t})\sim(\mathbb{P},\pi)}\left[r_{t}(s_{t},a_{t})\left.\right|(s_{h},a_{h})=(s,a)\right].

Since the state and action spaces and the horizon are all finite, there exists an optimal policy π\pi^{\star} that achieves the optimal value Vh(s)=supπVhπ(s)=Vhπ(s)V_{h}^{\star}(s)=\sup_{\pi}V_{h}^{\pi}(s)=V_{h}^{\pi^{*}}(s) for all (s,h)𝒮×[H](s,h)\in\mathcal{S}\times[H] (Azar et al., 2017). The Bellman equation and the Bellman optimality equation can be expressed as

{Vhπ(s)=𝔼aπh(s)[Qhπ(s,a)]Qhπ(s,a):=rh(s,a)+𝔼sh(|s,a)Vh+1π(s)VH+1π(s)=0,(s,a,h)𝒮×𝒜×[H],\displaystyle\left\{\begin{array}[]{l}V_{h}^{\pi}(s)=\mathbb{E}_{a^{\prime}\sim\pi_{h}(s)}[Q_{h}^{\pi}(s,a^{\prime})]\\ Q_{h}^{\pi}(s,a):=r_{h}(s,a)+\mathbb{E}_{s^{\prime}\sim\mathbb{P}_{h}(\cdot|s,a)}V_{h+1}^{\pi}(s^{\prime})\\ V_{H+1}^{\pi}(s)=0,\forall(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H],\end{array}\right.\quad {Vh(s)=maxa𝒜Qh(s,a)Qh(s,a):=rh(s,a)+𝔼sh(|s,a)Vh+1(s)VH+1(s)=0,(s,a,h)𝒮×𝒜×[H].\displaystyle\left\{\begin{array}[]{l}V_{h}^{\star}(s)=\max_{a^{\prime}\in\mathcal{A}}Q_{h}^{\star}(s,a^{\prime})\\ Q_{h}^{\star}(s,a):=r_{h}(s,a)+\mathbb{E}_{s^{\prime}\sim\mathbb{P}_{h}(\cdot|s,a)}V_{h+1}^{*}(s^{\prime})\\ V_{H+1}^{\star}(s)=0,\forall(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H].\end{array}\right. (3)

Suboptimality Gap. For any given MDP, we can provide the following formal definition.

Definition 3.1.

For any (s,a,h)(s,a,h), the suboptimality gap is defined as Δh(s,a):=Vh(s)Qh(s,a)\Delta_{h}(s,a):=V_{h}^{\star}(s)-Q_{h}^{\star}(s,a).

(3) implies that for any (s,a,h)(s,a,h), Δh(s,a)0\Delta_{h}(s,a)\geq 0. Then, it is natural to define the minimum gap, which is the minimum non-zero suboptimality gap with regard to all (s,a,h)(s,a,h).

Definition 3.2.

We define the minimum gap as Δmin:=inf{Δh(s,a)Δh(s,a)>0,(s,a,h)𝒮×𝒜×[H]}.\Delta_{\textnormal{min}}:=\inf\{\Delta_{h}(s,a)\mid\Delta_{h}(s,a)>0,(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H]\}.

We remark that if {Δh(s,a)Δh(s,a)>0,(s,a,h)𝒮×𝒜×[H]}=\{\Delta_{h}(s,a)\mid\Delta_{h}(s,a)>0,(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H]\}=\emptyset, then all actions are optimal, leading to a degenerate MDP. Therefore, we assume that the set is nonempty and Δmin>0\Delta_{\textnormal{min}}>0 in the rest of this paper. Definitions 3.1 and 3.2 and the non-degeneration are standard in the literature on gap-dependent analysis (Simchowitz & Jamieson, 2019; Yang et al., 2021; Xu et al., 2020).

Global Switching Cost. We provide the following definition for any algorithm with K>1K>1 episodes, which is also used in Bai et al. (2019) and Qiao et al. (2022).

Definition 3.3.

The global switching cost for KK episodes is defined as Nswitch:=k=1K1𝕀[πk+1πk].N_{\textnormal{switch}}:=\sum_{k=1}^{K-1}\mathbb{I}[\pi^{k+1}\neq\pi^{k}].

3.2 The Federated RL Framework

We consider an FRL setting with a central server and MM agents, each interacting with an independent copy of \mathcal{M}. The agents communicate with the server periodically: after receiving local information, the central server aggregates it and broadcasts certain information to the agents to coordinate their exploration.

For agent mm, let UmU_{m} be the number of generated episodes, πm,u\pi^{m,u} be the policy in the uu-th episode of agent mm, and x1m,ux_{1}^{m,u} be the corresponding initial state. The regret of MM agents over T^=Hm=1MUm\hat{T}=H\sum_{m=1}^{M}U_{m} total steps is

Regret(T)=m[M]u=1Um(V1(s1m,u)V1πm,u(s1m,u)).\mbox{Regret}(T)=\sum_{m\in[M]}\sum_{u=1}^{U_{m}}\left(V_{1}^{\star}(s_{1}^{m,u})-V_{1}^{\pi^{m,u}}(s_{1}^{m,u})\right).

Here, T:=T^/MT:=\hat{T}/M is the average total steps for MM agents.

We also define the communication cost of an algorithm as the number of scalars (integers or real numbers) communicated between the server and agents.

4 Performance Guarantees

4.1 FedQ-Hoeffding Algorithm

In this subsection, we briefly review FedQ-Hoeffding. Details are provided in Algorithm 1 and Algorithm 2 in Section A.1. FedQ-Hoeffding proceeds in rounds, indexed by k[K]k\in[K]. Round kk consists of nm,kn^{m,k} episodes for agent mm, where the specific value of nm,kn^{m,k} will be determined later.

Notations. For the jj-th (j[nm,k]j\in[n^{m,k}]) episode for agent mm in the kk-th round, we denote the corresponding trajectory as {(shk,j,m,ahk,j,m,rhk,j,m)}h=1H\{(s_{h}^{k,j,m},a_{h}^{k,j,m},r_{h}^{k,j,m})\}_{h=1}^{H} . Denote nhm,k(s,a)n_{h}^{m,k}(s,a) as the number of times that (s,a,h)(s,a,h) has been visited by agent mm in round kk, nhk(s,a):=m=1Mnhm,k(s,a)n_{h}^{k}(s,a):=\sum_{m=1}^{M}n_{h}^{m,k}(s,a) as the total number of visits in round kk for all agents, and Nhk(s,a)N_{h}^{k}(s,a) as the total number of visits to (s,a,h)(s,a,h) among all agents before the start of round kk. We also use {Vhk:𝒮}h=1H\{V_{h}^{k}:\mathcal{S}\rightarrow\mathbb{R}\}_{h=1}^{H} and {Qhk:𝒮×𝒜}h=1H\{Q_{h}^{k}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}\}_{h=1}^{H} to denote the global estimates of the state value function and state-action value function at the beginning of round kk. Before the first round, both estimates are initialized as HH.

Coordinated Exploration. At the beginning of round kk, the server decides a deterministic policy πk={πhk}h=1H\pi^{k}=\{\pi_{h}^{k}\}_{h=1}^{H}, and then broadcasts it along with {Nhk(s,πhk(s))}s,h\{N_{h}^{k}(s,\pi_{h}^{k}(s))\}_{s,h} and {Vhk(s)}s,h\{V_{h}^{k}(s)\}_{s,h} to agents. Here, π1\pi^{1} can be chosen arbitrarily. Then, the agents execute πk\pi^{k} and start collecting trajectories. During the exploration in round kk, every agent mm will monitor its number of visits to each (s,a,h)(s,a,h). For any agent mm, at the end of each episode, if any (s,a,h)(s,a,h) has been visited by

chk(s,a)=max{1,Nhk(s,a)MH(H+1)}c_{h}^{k}(s,a)=\max\left\{1,\left\lfloor\frac{N_{h}^{k}(s,a)}{MH(H+1)}\right\rfloor\right\} (4)

times by agent mm, the agent will send a signal to the server, which will then abort all agents’ exploration. Here, we say that (s,a,h)(s,a,h) satisfies the trigger condition in round kk. During the exploration, for all (s,a,h)(s,a,h), agent mm adaptively calculates nhm,k(s,a)n_{h}^{m,k}(s,a) and the local estimate for the next-step return vh+1m,k(s,a):=j=1nm,kVh+1k(sh+1k,j,m)𝕀[(shk,j,m,ahk,j,m)=(s,a)]v_{h+1}^{m,k}(s,a):=\sum_{j=1}^{n^{m,k}}V_{h+1}^{k}\left(s_{h+1}^{k,j,m}\right)\mathbb{I}\left[(s_{h}^{k,j,m},a_{h}^{k,j,m})=(s,a)\right]. At the end of round kk, each agent sends {rh(s,πhk(s))}s,h\{r_{h}(s,\pi_{h}^{k}(s))\}_{s,h}, {nhm,k(s,πhk(s))}s,h\{n_{h}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h} and {vh+1m,k(s,πhk(s))}s,h\{v_{h+1}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h} to the central server for aggregation.

Updates of estimated value functions. The central server calculates nhk(s,a),Nhk+1(s,a)n_{h}^{k}(s,a),N_{h}^{k+1}(s,a) for all triples. While letting Qhk+1(s,a)=Qhk(s,a)Q_{h}^{k+1}(s,a)=Q_{h}^{k}(s,a) for triples such that nhk(s,a)=0n_{h}^{k}(s,a)=0, it updates the estimated value functions for each triple with positive nhk(s,a)n_{h}^{k}(s,a) as follows.

Case 1: Nhk(s,a)<2MH(H+1)=:i0N_{h}^{k}(s,a)<2MH(H+1)=:i_{0}. This case implies that each client can visit each (s,a)(s,a) pair at step hh at most once. Let Q=Qhk(s,a)Q=Q_{h}^{k}(s,a). Then the server iteratively update QQ using the following assignment:

Q+ηt(rh+Vh+1k,t+btQ),t=Nhk+1,,Nhk+1Q\overset{+}{\leftarrow}\eta_{t}(r_{h}+V_{h+1}^{k,t}+b_{t}-Q),t=N_{h}^{k}+1,\ldots,N_{h}^{k+1} (5)

and then assign Qhk+1(s,a)Q_{h}^{k+1}(s,a) with QQ. Here, rh,Nhk,Nhk+1r_{h},N_{h}^{k},N_{h}^{k+1} are short for their respective values at (s,a)(s,a), ηt(0,1]\eta_{t}\in(0,1] is the learning rate, bt>0b_{t}>0 is a bonus, and Vh+1k,tV_{h+1}^{k,t} represents the (tNhk)(t-N_{h}^{k})-th nonzero value in {vh+1m,k(s,a)}m=1M\{v_{h+1}^{m,k}(s,a)\}_{m=1}^{M}.

Case 2: Nhk(s,a)i0N_{h}^{k}(s,a)\geq i_{0}. In this case, the central server calculates the global estimate of the expected return vh+1k(s,a)=m=1Mvh+1m,k(s,a)/nhk(s,a)v_{h+1}^{k}(s,a)=\sum_{m=1}^{M}v_{h+1}^{m,k}(s,a)/n_{h}^{k}(s,a) and updates the QQ-estimate as

Qhk+1=(1ηs,ah,k)Qhk+ηs,ah,k(rh+vh+1k)+βs,a,hk.Q_{h}^{k+1}=(1-\eta_{s,a}^{h,k})Q_{h}^{k}+\eta_{s,a}^{h,k}(r_{h}+v_{h+1}^{k})+\beta^{k}_{s,a,h}. (6)

Here, rh,Qhk,Qhk+1,vh+1kr_{h},Q_{h}^{k},Q_{h}^{k+1},v_{h+1}^{k} are short for their respective values at (s,a)(s,a), ηs,ah,k(0,1]\eta_{s,a}^{h,k}\in(0,1] is the learning rate and βs,a,hk>0\beta^{k}_{s,a,h}>0 represents the bonus.

After updating the estimated QQ-function, the central server updates the estimated value function and the policy as Vhk+1(s)=min{H,maxa𝒜Qhk+1(s,a)}V_{h}^{k+1}(s)=\min\{H,\max_{a^{\prime}\in\mathcal{A}}Q_{h}^{k+1}(s,a^{\prime})\} and πhk+1(s)=argmaxa𝒜Qhk+1(s,a)\pi_{h}^{k+1}(s)=\arg\max_{a^{\prime}\in\mathcal{A}}Q_{h}^{k+1}(s,a^{\prime}). Such update implies that FedQ-Hoeffding is an optimism-based method. It then proceeds to round k+1k+1.

In FedQ-Hoeffding, agents only send local estimates instead of original trajectories to the central server. This guarantees a low communication cost for each round, which is O(MHS)O(MHS). In addition, the event-triggered termination condition with the threshold (4) limits the number of new visits in each round, with which Zheng et al. (2024) proved the linear regret speedup under worst-case MDPs. Moreover, it guarantees that the number of visits to the triple that satisfies the trigger condition sufficiently increases after this round. This is the key to proving the worst-case logarithmic communication cost in Zheng et al. (2024).

4.2 Gap-Dependent Regret

Next, we provide a new gap-dependent regret upper bound for FedQ-Hoeffding as follows.

Theorem 4.1.

For FedQ-Hoeffding (Algorithm 1 and Algorithm 2), 𝔼(Regret(T))\mathbb{E}\left(\textnormal{Regret}(T)\right) can be bounded by (1).

Theorem 4.1 shows that the regret is logarithmic in TT under MDPs with a nonzero Δmin\Delta_{\textnormal{min}}. When TT is sufficiently large, it is better than the T\sqrt{T}-type worst-case regrets in the literature for on-policy FRL.

When M=1M=1, the bound reduces to O(H6SAlog(SAT)Δmin)O(\frac{H^{6}SA\log(SAT)}{\Delta_{\textnormal{min}}}), which is the gap-dependent bound in Yang et al. (2021) for our single-agent counterpart, UCB-Hoeffding algorithm. Thus, when TT is sufficiently large, for the average regret of all the episodes defined as Regret(T)/(MT)\textnormal{Regret}(T)/(MT), the ratio between gap-dependent bounds for FedQ-Hoeffding and UCB-Hoeffding is O~(1/M)\tilde{O}(1/M), which serves as our error reduction rate. As a comparison, it is better than the rates under worst-case MDPs for on-policy FRL methods in the literature, which are O~(1/M)\tilde{O}(1/\sqrt{M}) because of their linear dependency on MT\sqrt{MT}. We will show this pattern in the numerical experiments in Section 5.1.

Key Ideas of the Proof. Define clip[xy]:=x𝕀[xy]\text{clip}[x\mid y]:=x\cdot\mathbb{I}[x\geq y]. Note that

𝔼(Regret(T))\displaystyle\mathbb{E}\left(\textnormal{Regret}(T)\right) =𝔼[k,j,mh=1HΔh(shk,j,m,ahk,j,m)]\displaystyle=\mathbb{E}\left[\sum_{k,j,m}\sum_{h=1}^{H}\Delta_{h}(s_{h}^{k,j,m},a_{h}^{k,j,m})\right]
𝔼[h=1Hk,j,mclip[(QhkQh)(shk,j,m,ahk,j,m)Δmin]].\displaystyle\leq\mathbb{E}\left[\sum_{h=1}^{H}\sum_{k,j,m}\mathrm{clip}\left[\left(Q_{h}^{k}-Q_{h}^{*}\right)(s_{h}^{k,j,m},a_{h}^{k,j,m})\mid\Delta_{\textnormal{min}}\right]\right].

Now we only need to bound the term

h=1Hk,j,mclip[(QhkQh)(shk,j,m,ahk,j,m)Δmin].\sum_{h=1}^{H}\sum_{k,j,m}\mathrm{clip}\left[\left(Q_{h}^{k}-Q_{h}^{*}\right)(s_{h}^{k,j,m},a_{h}^{k,j,m})\mid\Delta_{\textnormal{min}}\right].

To continue, we need to bound the weighted sum of estimation errors

k,j,mωh,k(QhkQh)(shk,j,m,ahk,j,m)\sum_{k,j,m}\omega_{h,k}\left(Q_{h}^{k}-Q_{h}^{*}\right)(s_{h}^{k,j,m},a_{h}^{k,j,m})

for any non-negative weights {ωh,k}h,k\{\omega_{h,k}\}_{h,k}. This can be done by recursion on hh with QH+1k=QH+1=0Q_{H+1}^{k}=Q_{H+1}^{*}=0.

4.3 Gap-Dependent Communication Cost

We first introduce two additional assumptions:
(I) Full synchronization. Similar to Zheng et al. (2024), we assume that there is no latency during the communications, and the agents and server are fully synchronized (McMahan et al., 2017). This means nm,k=nkn^{m,k}=n^{k} for each agent mm.
(II) Random initializations. We assume that the initial states {s1k,j,m}k,j,m\{s_{1}^{k,j,m}\}_{k,j,m} are randomly generated following some distribution on 𝒮\mathcal{S}, and the generation is not affected by any result in the learning process.

Next, we introduce a new concept: G-MDPs.

Definition 4.1.

A G-MDP satisfies two conditions:
(a) The stationary visiting probabilities under optimal policies are unique: if both π,1\pi^{*,1} and π,2\pi^{*,2} are optimal policies, then we have (sh=s|π,1)=(sh=s|π,2)=:s,h.\mathbb{P}\left(s_{h}=s|\pi^{*,1}\right)=\mathbb{P}\left(s_{h}=s|\pi^{*,2}\right)=:\mathbb{P}_{s,h}^{*}.
(b) Let 𝒜h(s)={aa=argmaxaQh(s,a)}\mathcal{A}_{h}^{*}(s)=\{a\mid a=\arg\max_{a^{\prime}}Q_{h}^{*}(s,a^{\prime})\}. For any (s,h)𝒮×[H](s,h)\in\mathcal{S}\times[H], if s,h>0\mathbb{P}_{s,h}^{*}>0, then |𝒜h(s)|=1|\mathcal{A}_{h}^{*}(s)|=1, which means that the optimal action is unique.

G-MDPs represent MDPs with generally unique optimal policies. (a) and (b) above characterize the general uniqueness, and an MDP with a unique optimal policy is a G-MDP. Compared to requiring a unique optimal policy, G-MDPs allow the optimal actions to vary outside the support under optimal policies, i.e., the state-step pairs with s,h=0\mathbb{P}_{s,h}^{*}=0.

For a G-MDP, we define Cst=min{s,hs𝒮,h[H],s,h>0}C_{st}=\min\{\mathbb{P}_{s,h}^{*}\mid s\in\mathcal{S},h\in[H],\mathbb{P}_{s,h}^{*}>0\}. Thus, 0<Cst10<C_{st}\leq 1 reflects the minimum visiting probability on the support for optimal policies. Next, we provide gap-dependent communication cost bound.

Theorem 4.2.

Let ι0=log(SAT/p)\iota_{0}=\log(SAT/p). For FedQ-Hoeffding (Algorithm 1 and Algorithm 2) under full synchronization and random initializations assumptions, for any given G-MDP and any p(0,1)p\in(0,1), with probability at least 1p1-p, the number of rounds KK is upper bounded by (2).

We can get the upper bound of total communication cost by multiplying the upper bound in (2) and O(MHS)O(MHS), the communication cost of each round in FedQ-Hoeffding.

Compared to existing worst-case costs that depend on SASA (Zheng et al., 2025a; Labbi et al., 2024) or MSAMSA (Zheng et al., 2024, 2025a; Qiao et al., 2022) for logT\log T, (2) is better when TT is sufficiently large since the first three terms only logarithmically depend on 1/Δmin1/\Delta_{\textnormal{min}} and logT\log T, and the last term that is logarithmic in TT removes the dependency on MSAMSA. Moreover, (2) highlights the cost for different procedures in FedQ-Hoeffding: the first two terms represent the cost for exploration, and the last two terms show the cost when exploiting the optimal policies. Our numerical experiments in Section 5.2 also demonstrate that the logT\log T term in the communication cost is independent of MM, SS, and AA.

Key Ideas of the Proof. For the triple (s,a,h)(s,a,h) that satisfies the trigger condition in round kk when Nhk(s,a)>i1N_{h}^{k}(s,a)>i_{1} for some threshold i1i_{1}, we can demonstrate an agent-wise simultaneous sufficient increase in visits, which allows us to eliminate the linear dependency of MM from the last three terms.

In the exploitation period, for the triple (s,a,h)(s,a,h) such that s,h>0\mathbb{P}_{s,h}^{*}>0, a𝒜h(s)a\in\mathcal{A}_{h}^{*}(s), and that satisfies the trigger condition in round kk when Nhk(s,a)>i2N_{h}^{k}(s,a)>i_{2} for some threshold i2>i1i_{2}>i_{1}, we can show a state-wise simultaneous sufficient increase in visits, which helps us remove the linear dependency of SS and AA from the last term.

Since FedQ-Hoeffding implements a fixed policy in each round, when M=1M=1, the algorithm reduces to a single-agent algorithm with a low global switching cost, which is shown in Corollary 4.1.

Corollary 4.1.

For the FedQ-Hoeffding algorithm (Algorithm 1 and Algorithm 2) under full synchronization, random initialization, and M=1M=1, for any given G-MDP and any p(0,1)p\in(0,1), with probability at least 1p1-p, the global switching cost can be bounded by (2).

Given that the costs of existing single-agent model-free algorithms depend on SASA (Bai et al., 2019; Zhang et al., 2020) or SS (Zheng et al., 2025b) for logT\log T222In the literature, these bounds are for local switching cost that counts the state-step pairs where the policy switches. The local switching cost is greater than or equal to our global switching cost, but these works didn’t find tighter bounds for the global switching cost. We refer readers to Bai et al. (2019) for more information., our logT\log T-dependency is better by removing the factor SASA.

At the end of this section, we also remark on FedQ-Bernstein, another on-policy FRL algorithm in Zheng et al. (2024). It differs from FedQ-Hoeffding in setting the bonuses (btb_{t} and βs,a,hk\beta_{s,a,h}^{k}) and involving variance estimators. We can also show that FedQ-Bernstein can also reach the gap-dependent bounds in (1) and in (2). In Zheng et al. (2024), the worst-case regret bound of FedQ-Bernstein is better than that of FedQ-Hoeffding by a factor H\sqrt{H}, and the two methods share the same worst-case communication cost bound. Whether FedQ-Bernstein can reach better gap-dependent bounds remains an open question.

5 Numerical Experiments

In this section, we conduct experiments 333All the experiments are run on a server with Intel Xeon E5-2650v4 (2.2GHz) and 100 cores. Each replication is limited to a single core and 50GB RAM. . All the experiments are conducted in a synthetic environment to demonstrate the logT\log T-type regret and reduced communication cost bound with the coefficient of the main term O(logT)O(\log T) being independent of M,S,AM,S,A in FedQ-Hoeffding algorithm (Zheng et al., 2024). We follow Zheng et al. (2024) and generate a synthetic environment to evaluate the proposed algorithms on a tabular episodic MDP. After setting H,S,AH,S,A, the reward rh(s,a)r_{h}(s,a) for each (s,a,h)(s,a,h) is generated independently and uniformly at random from [0,1][0,1]. h(s,a)\mathbb{P}_{h}(\cdot\mid s,a) is generated on the SS-dimensional simplex independently and uniformly at random for (s,a,h)(s,a,h). We also set the constant cc in the bonus term btb_{t} to be 2 and ι=1\iota=1. We will first demonstrate the logT\log T-type regret of FedQ-Hoeffding algorithm.

5.1 logT\log T-type regret and speedup

In this section, we show that the regret for any given MDP follows a logT\log T pattern. We consider two different values for the triple (H,S,A)(H,S,A): (2,2,2)(2,2,2) and (5,3,2)(5,3,2). For FedQ-Hoeffding algorithm, we set the agent number M=10M=10 and generate T/H=107T/H=10^{7} episodes for each agent, resulting in a total of 10810^{8} episodes. Additionally, to show the linear speedup effect, we conduct experiments with its single-agent version, the UCB-Hoeffding algorithm (Jin et al., 2018), where all the conditions except M=1M=1 remain the same. To show error bars, we collect 10 sample paths for each algorithm under the same MDP environment. The regret results are shown in Figure 1 and Figure 2. For both algorithms, the solid line represents the median of the 10 sample paths, while the shaded area shows the 10th and 90th percentiles.

Refer to caption
Refer to caption
Figure 1: Regret results for H=2H=2, S=2S=2, and A=2A=2. The left panel directly shows the plot of Regret(T)\mbox{Regret}(T) versus T/HT/H, while the right panel illustrates the relationship between Regret(T)/log(T/H+1)\mbox{Regret}(T)/\log(T/H+1) and T/HT/H. In both plots, the yellow line represents the regret results of the FedQ-Hoeffding algorithm, while the red line represents the results of the UCB-Hoeffding algorithm. The blue line in each plot denotes the adjusted regret of the FedQ-Hoeffding algorithm, which is obtained by dividing the regret results of the yellow line by M\sqrt{M}.
Refer to caption
Refer to caption
Figure 2: Regret results for H=5H=5, S=3S=3, and A=2A=2. The left panel directly shows the plot of Regret(T)\mbox{Regret}(T) versus T/HT/H, while the right panel illustrates the relationship between Regret(T)/log(T/H+1)\mbox{Regret}(T)/\log(T/H+1) and T/HT/H. In both plots, the yellow line represents the regret results of the FedQ-Hoeffding algorithm, while the red line represents the results of the UCB-Hoeffding algorithm. The blue line in each plot denotes the adjusted regret of the FedQ-Hoeffding algorithm, which is obtained by dividing the regret results of the yellow line by M\sqrt{M}.

From the two groups of plots, we observe that the two yellow lines in the plots on the right side of Figure 1 and Figure 2 tend to approach horizontal lines as T/HT/H becomes sufficiently large. Since the y-axis represents Regret(T)/log(T/H+1)\mbox{Regret}(T)/\log(T/H+1) in these two plots, we can conclude that the regret of the FedQ-Hoeffding algorithm follows a logT\log T-type pattern for any given MDP, rather than the MT\sqrt{MT} pattern shown in the Theorem 4.1 of Zheng et al. (2024). This is consistent with the logarithmic regret result presented in Theorem 4.1. Furthermore, as T/HT/H becomes sufficiently large, we observe that the adjusted regret of FedQ-Hoeffding (represented by the blue lines) for both groups of (H,S,A)(H,S,A) is significantly lower than the corresponding regret of the single-agent version, UCB-Hoeffding (represented by the red lines). This further supports the conclusion that the regret of FedQ-Hoeffding does not follow a MT\sqrt{MT} pattern, or else the blue lines and the red lines would be close to each other. Finally, as T/HT/H grows larger, we notice that the yellow lines and the red lines become close, confirming that the regret of FedQ-Hoeffding approaches that of UCB-Hoeffding as TT becomes sufficiently large. This also supports the error reduction rate O~(1/M)\tilde{O}(1/M) for the gap-dependent regret.

5.2 Dependency of Communication Cost on MM, SS, and AA

In this section, we will demonstrate that the coefficient of the logT\log T term in the communication cost is independent of MM, SS and AA. To eliminate the influence of terms with lower orders of logT\log T, such as log(logT)\log(\log T) and logT\sqrt{\log T} in Theorem 4.2, we will focus exclusively on the communication cost for sufficiently large values of TT.

5.2.1 Dependency on MM

To explore the dependency of communication cost on MM, we set (H,S,A)=(2,2,2)(H,S,A)=(2,2,2) and let MM take values in {2,4,6,8}\{2,4,6,8\}. We generate 10710^{7} episodes for each agent and only consider the communication cost after 5×1055\times 10^{5} episodes. The Figure 3 shows the results for each MM after 5×1055\times 10^{5} episodes.

Refer to caption

Figure 3: Number of communication rounds vs Log-number of Episodes for different MM Values with H=2H=2, S=2S=2 and A=2A=2. Each solid line represents the number of communication rounds for each value of M{2,4,6,8}M\in\{2,4,6,8\} after 5×1055\times 10^{5} episodes, while the dashed line represents the corresponding fitted line for each MM.

In Figure 3, each solid line represents the communication cost for each value of M{2,4,6,8}M\in\{2,4,6,8\} after 5×1055\times 10^{5} episodes, while the dashed line represents the corresponding fitted line. Since the x-axis represents the log-number of episodes, log(T/H)\log(T/H), the slope of the fitted line is very close to the coefficient of the logT\log T-term in the communication cost when logT\log T is sufficently large. We observe that the slopes of these fitted lines are very similar, which indicates that for any given MDP, the coefficient of the logT\log T-term in the communication cost is independent of MM. If the coefficient were linearly dependent on MM, as shown in Zheng et al. (2024), then for M=8M=8, the slope of the fitted line should be nearly four times the value of the slope of the fitted line for M=2M=2.

5.2.2 Dependency on SS

To explore the dependency of communication cost on SS, we set (H,A,M)=(2,2,2)(H,A,M)=(2,2,2) and let SS take values in {2,4,6,8}\{2,4,6,8\}. We generate 10710^{7} episodes for each agent and only consider the communication cost after 5×1055\times 10^{5} episodes. The Figure 4 shows the communication cost results for each SS after 5×1055\times 10^{5} episodes.

Refer to caption

Figure 4: Number of communication rounds vs Log-number of Episodes for different SS Values with H=2H=2, A=2A=2 and M=2M=2. Each solid line represents the number of communication rounds for each value of S{2,4,6,8}S\in\{2,4,6,8\} after 5×1055\times 10^{5} episodes. The dashed line represents the fitted line for each SS.

In Figure 4, each solid line represents the communication cost for each value of S{2,4,6,8}S\in\{2,4,6,8\} after 5×1055\times 10^{5} episodes, while the dashed line represents the corresponding fitted line. Since the x-axis represents the log-number of episodes, log(T/H)\log(T/H), the slope of the fitted line is very close to the coefficient of the logT\log T-term in the communication cost when logT\log T is sufficently large. We observe that the slopes of these fitted lines are very similar, which indicates that for any given MDP, the coefficient of the logT\log T-term in the communication cost is independent of SS.

5.2.3 Dependency on AA

To explore the dependency of communication cost on AA, we set (H,S,M)=(2,2,2)(H,S,M)=(2,2,2) and let AA take values in {2,4,6,8}\{2,4,6,8\}. We generate 10710^{7} episodes for each agent and only consider the communication cost after 5×1055\times 10^{5} episodes. The Figure 5 shows the communication cost results for each AA after 5×1055\times 10^{5} episodes.

In Figure 5, each solid line represents the communication cost for each value of A{2,4,6,8}A\in\{2,4,6,8\} after 5×1055\times 10^{5} episodes, while the dashed line represents the corresponding fitted line. Since the x-axis represents the log-number of episodes, log(T/H)\log(T/H), the slope of the fitted line is very close to the coefficient of the logT\log T-term in the communication cost when logT\log T is sufficently large. We observe that the slopes of these fitted lines are very similar, which indicates that for any given MDP, the coefficient of the logT\log T-term in the communication cost is independent of AA.

Refer to caption

Figure 5: Number of communication rounds vs Log-number of Episodes for different AA Values with H=2H=2, S=2S=2 and M=2M=2. Each solid line represents the number of communication rounds for each value of A{2,4,6,8}A\in\{2,4,6,8\} after 5×1055\times 10^{5} episodes. The dashed line represents the fitted line for each AA.

6 Conclusion

In this paper, we establish the first gap-dependent bounds on regret and communication cost for on-policy federated QQ-Learning in tabular episodic finite-horizon MDPs, addressing two important open questions in the literature. While existing FRL methods focus on worst-case MDPs, we show that when MDPs exhibit benign structures, such as a strictly positive suboptimality gap, these bounds can be significantly improved. Specifically, we prove that both FedQ-Hoeffding and FedQ-Bernstein can achieve logT\log T-type regret. Additionally, we derive a gap-dependent communication cost bound that disentangles exploration and exploitation, with the logT\log T term in the communication cost being independent of MM, SS, and AA. This makes our work the first result in the on-policy FRL literature to achieve such a low communication cost. When M=1M=1, our gap-dependent communication cost bound also provides a better global switching cost, removing the dependence on SASA from the logT\log T term.

References

  • Agarwal et al. (2020) Agarwal, A., Kakade, S., and Yang, L. F. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, pp.  67–83. PMLR, 2020.
  • Agarwal et al. (2021) Agarwal, M., Ganguly, B., and Aggarwal, V. Communication efficient parallel reinforcement learning. In Uncertainty in Artificial Intelligence, pp.  247–256. PMLR, 2021.
  • Agrawal & Jia (2017) Agrawal, S. and Jia, R. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. Advances in Neural Information Processing Systems, 30, 2017.
  • Al Marjani et al. (2021) Al Marjani, A., Garivier, A., and Proutiere, A. Navigating to the best policy in markov decision processes. In Advances in Neural Information Processing Systems, pp.  25852–25864, 2021.
  • Anwar & Raychowdhury (2021) Anwar, A. and Raychowdhury, A. Multi-task federated reinforcement learning with adversaries. arXiv preprint arXiv:2103.06473, 2021.
  • Ashouri et al. (2018) Ashouri, A. H., Killian, W., Cavazos, J., Palermo, G., and Silvano, C. A survey on compiler autotuning using machine learning. ACM Computing Surveys (CSUR), 51(5):1–42, 2018.
  • Assran et al. (2019) Assran, M., Romoff, J., Ballas, N., Pineau, J., and Rabbat, M. Gossip-based actor-learner architectures for deep reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.
  • Auer & Ortner (2007) Auer, P. and Ortner, R. Logarithmic online regret bounds for undiscounted reinforcement learning. In Advances in Neural Information Processing Systems, pp.  49–56. MIT Press, 2007.
  • Auer et al. (2008) Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret bounds for reinforcement learning. Advances in Neural Information Processing Systems, 21, 2008.
  • Azar et al. (2017) Azar, M. G., Osband, I., and Munos, R. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp.  263–272. PMLR, 2017.
  • Bai et al. (2019) Bai, Y., Xie, T., Jiang, N., and Wang, Y.-X. Provably efficient q-learning with low switching cost. Advances in Neural Information Processing Systems, 32, 2019.
  • Banerjee et al. (2021) Banerjee, S., Bouzefrane, S., and Abane, A. Identity management with hybrid blockchain approach: A deliberate extension with federated-inverse-reinforcement learning. In 2021 IEEE 22nd International Conference on High Performance Switching and Routing (HPSR), pp.  1–6. IEEE, 2021.
  • Beikmohammadi et al. (2024) Beikmohammadi, A., Khirirat, S., and Magnússon, S. Compressed federated reinforcement learning with a generative model. arXiv preprint arXiv:2404.10635, 2024.
  • Chen et al. (2021a) Chen, T., Zhang, K., Giannakis, G. B., and Başar, T. Communication-efficient policy gradient methods for distributed reinforcement learning. IEEE Transactions on Control of Network Systems, 9(2):917–929, 2021a.
  • Chen et al. (2023) Chen, Y., Zhang, X., Zhang, K., Wang, M., and Zhu, X. Byzantine-robust online and offline distributed reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pp.  3230–3269. PMLR, 2023.
  • Chen et al. (2021b) Chen, Z., Zhou, Y., and Chen, R. Multi-agent off-policy tdc with near-optimal sample and communication complexity. In 2021 55th Asilomar Conference on Signals, Systems, and Computers, pp.  504–508. IEEE, 2021b.
  • Chen et al. (2022) Chen, Z., Zhou, Y., Chen, R.-R., and Zou, S. Sample and communication-efficient decentralized actor-critic algorithms with finite-time analysis. In International Conference on Machine Learning, pp.  3794–3834. PMLR, 2022.
  • Dann et al. (2019) Dann, C., Li, L., Wei, W., and Brunskill, E. Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pp.  1507–1516. PMLR, 2019.
  • Dann et al. (2021) Dann, C., Marinov, T. V., Mohri, M., and Zimmert, J. Beyond value-function gaps: Improved instance-dependent regret bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pp.  1–12, 2021.
  • Doan et al. (2019) Doan, T., Maguluri, S., and Romberg, J. Finite-time analysis of distributed td (0) with linear function approximation on multi-agent reinforcement learning. In International Conference on Machine Learning, pp.  1626–1635. PMLR, 2019.
  • Doan et al. (2021) Doan, T. T., Maguluri, S. T., and Romberg, J. Finite-time performance of distributed temporal-difference learning with linear function approximation. SIAM Journal on Mathematics of Data Science, 3(1):298–320, 2021.
  • Domingues et al. (2021) Domingues, O. D., Ménard, P., Kaufmann, E., and Valko, M. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory, pp.  578–598. PMLR, 2021.
  • Du et al. (2017) Du, S. S., Chen, J., Li, L., Xiao, L., and Zhou, D. Stochastic variance reduction methods for policy evaluation. In International Conference on Machine Learning, pp.  1049–1058. PMLR, 2017.
  • Dubey & Pentland (2021) Dubey, A. and Pentland, A. Provably efficient cooperative multi-agent reinforcement learning with function approximation. arXiv preprint arXiv:2103.04972, 2021.
  • Espeholt et al. (2018) Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In International Conference on Machine Learning, pp.  1407–1416. PMLR, 2018.
  • Fan et al. (2021) Fan, F. X., Ma, Y., Dai, Z., Jing, W., Tan, C., and Low, B. K. H. Fault-tolerant federated reinforcement learning with theoretical guarantee. Advances in Neural Information Processing Systems, 34:1007–1021, 2021.
  • Fan et al. (2023) Fan, F. X., Ma, Y., Dai, Z., Tan, C., and Low, B. K. H. Fedhql: Federated heterogeneous q-learning. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, pp.  2810–2812, 2023.
  • Ganesh et al. (2024) Ganesh, S., Chen, J., Thoppe, G., and Aggarwal, V. Global convergence guarantees for federated policy gradient methods with adversaries. arXiv preprint arXiv:2403.09940, 2024.
  • Gao et al. (2019) Gao, Z., Han, Y., Ren, Z., and Zhou, Z. Batched multi-armed bandits problem. Advances in Neural Information Processing Systems, 32, 2019.
  • Gong et al. (2023) Gong, W., Cao, L., Zhu, Y., Zuo, F., He, X., and Zhou, H. Federated inverse reinforcement learning for smart icus with differential privacy. IEEE Internet of Things Journal, 10(21):19117–19124, 2023.
  • Gower et al. (2020) Gower, R. M., Schmidt, M., Bach, F., and Richtárik, P. Variance-reduced methods for machine learning. Proceedings of the IEEE, 108(11):1968–1983, 2020.
  • Guo & Brunskill (2015) Guo, Z. and Brunskill, E. Concurrent pac rl. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, pp.  2624–2630, 2015.
  • He et al. (2021) He, J., Zhou, D., and Gu, Q. Logarithmic regret for reinforcement learning with linear function approximation. In International Conference on Machine Learning, pp.  4171–4180. PMLR, 2021.
  • Hsu et al. (2024) Hsu, H.-L., Wang, W., Pajic, M., and Xu, P. Randomized exploration in cooperative multi-agent reinforcement learning. arXiv preprint arXiv:2404.10728, 2024.
  • Jaksch et al. (2010) Jaksch, T., Ortner, R., and Auer, P. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11:1563–1600, 2010.
  • Jin et al. (2018) Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is q-learning provably efficient? Advances in Neural Information Processing Systems, 31, 2018.
  • Jin et al. (2020) Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. Provably efficient reinforcement learning with linear function approximation. In Conference on learning theory, pp.  2137–2143. PMLR, 2020.
  • Jin et al. (2022) Jin, H., Peng, Y., Yang, W., Wang, S., and Zhang, Z. Federated reinforcement learning with environment heterogeneity. In International Conference on Artificial Intelligence and Statistics, pp.  18–37. PMLR, 2022.
  • Johnson & Zhang (2013) Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems, 26, 2013.
  • Jonsson et al. (2020) Jonsson, A., Kaufmann, E., Ménard, P., Darwiche Domingues, O., Leurent, E., and Valko, M. Planning in markov decision processes with gap-dependent sample complexity. In Advances in Neural Information Processing Systems, pp.  1253–1263, 2020.
  • Kakade et al. (2018) Kakade, S., Wang, M., and Yang, L. F. Variance reduction methods for sublinear reinforcement learning. arXiv preprint arXiv:1802.09184, 2018.
  • Khamaru et al. (2021) Khamaru, K., Pananjady, A., Ruan, F., Wainwright, M. J., and Jordan, M. I. Is temporal difference learning optimal? an instance-dependent analysis. SIAM Journal on Mathematics of Data Science, 3(4):1013–1040, 2021.
  • Khodadadian et al. (2022) Khodadadian, S., Sharma, P., Joshi, G., and Maguluri, S. T. Federated reinforcement learning: Linear speedup under markovian sampling. In International Conference on Machine Learning, pp.  10997–11057. PMLR, 2022.
  • Krishnan et al. (2018) Krishnan, S., Yang, Z., Goldberg, K., Hellerstein, J., and Stoica, I. Learning to optimize join queries with deep reinforcement learning. arXiv preprint arXiv:1808.03196, 2018.
  • Labbi et al. (2024) Labbi, S., Tiapkin, D., Mancini, L., Mangold, P., and Moulines, E. Federated ucbvi: Communication-efficient federated regret minimization with heterogeneous agents. arXiv preprint arXiv:2410.22908, 2024.
  • Lan et al. (2023) Lan, G., Wang, H., Anderson, J., Brinton, C., and Aggarwal, V. Improved communication efficiency in federated natural policy gradient via admm-based gradient updates. arXiv preprint arXiv:2310.19807, 2023.
  • Lan et al. (2024) Lan, G., Han, D.-J., Hashemi, A., Aggarwal, V., and Brinton, C. G. Asynchronous federated reinforcement learning with policy gradient updates: Algorithm design and convergence analysis. arXiv preprint arXiv:2404.08003, 2024.
  • Li et al. (2020) Li, G., Wei, Y., Chi, Y., Gu, Y., and Chen, Y. Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction. Advances in Neural Information Processing Systems, 33:7031–7043, 2020.
  • Li et al. (2021) Li, G., Shi, L., Chen, Y., Gu, Y., and Chi, Y. Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning. Advances in Neural Information Processing Systems, 34:17762–17776, 2021.
  • Liu & Olshevsky (2023) Liu, R. and Olshevsky, A. Distributed td (0) with almost no communication. IEEE Control Systems Letters, 2023.
  • Liu & Zhu (2022) Liu, S. and Zhu, M. Distributed inverse constrained reinforcement learning for multi-agent systems. Advances in Neural Information Processing Systems, 35:33444–33456, 2022.
  • Liu & Zhu (2023) Liu, S. and Zhu, M. Meta inverse constrained reinforcement learning: Convergence guarantee and generalization analysis. In The Twelfth International Conference on Learning Representations, 2023.
  • Liu & Zhu (2024) Liu, S. and Zhu, M. Learning multi-agent behaviors from distributed and streaming demonstrations. Advances in Neural Information Processing Systems, 36, 2024.
  • Liu & Zhu (2025) Liu, S. and Zhu, M. In-trajectory inverse reinforcement learning: Learn incrementally before an ongoing trajectory terminates. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2025.
  • Marjani & Proutiere (2020) Marjani, A. and Proutiere, A. Best policy identification in discounted mdps: Problem-specific sample complexity. arXiv preprint arXiv:2009.13405, 2020.
  • McMahan et al. (2017) McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54, pp.  1273–1282. PMLR, 2017.
  • Ménard et al. (2021) Ménard, P., Domingues, O. D., Shang, X., and Valko, M. Ucb momentum q-learning: Correcting the bias without forgetting. In International Conference on Machine Learning, pp.  7609–7618. PMLR, 2021.
  • Min et al. (2023) Min, Y., He, J., Wang, T., and Gu, Q. Cooperative multi-agent reinforcement learning: Asynchronous communication and linear function approximation. In International Conference on Machine Learning, pp.  24785–24811. PMLR, 2023.
  • Mirhoseini et al. (2017) Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R., Zhou, Y., Kumar, N., Norouzi, M., Bengio, S., and Dean, J. Device placement optimization with reinforcement learning. In International Conference on Machine Learning, pp.  2430–2439. PMLR, 2017.
  • Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp.  1928–1937. PMLR, 2016.
  • Nguyen et al. (2017) Nguyen, L. M., Liu, J., Scheinberg, K., and Takáč, M. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International Conference on Machine Learning, pp.  2613–2621. PMLR, 2017.
  • Nguyen et al. (2019) Nguyen, P., Tran, T., Gupta, S., Rana, S., Barnett, M., and Venkatesh, S. Incomplete conditional density estimation for fast materials discovery. In Proceedings of the 2019 SIAM International Conference on Data Mining, pp.  549–557. SIAM, 2019.
  • Nguyen-Tang et al. (2023) Nguyen-Tang, T., Yin, M., Gupta, S., Venkatesh, S., and Arora, R. On instance-dependent bounds for offline reinforcement learning with linear function approximation. In Proceedings of the AAAI Conference on Artificial Intelligence, pp.  9310–9318, 2023.
  • Ok et al. (2018) Ok, J., Proutiere, A., and Tranos, D. Exploration in structured reinforcement learning. Advances in Neural Information Processing Systems, 31, 2018.
  • Perchet et al. (2016) Perchet, V., Rigollet, P., Chassang, S., and Snowberg, E. Batched bandit problems. The Annals of Statistics, 44(2):660 – 681, 2016.
  • Qiao et al. (2022) Qiao, D., Yin, M., Min, M., and Wang, Y.-X. Sample-efficient reinforcement learning with loglog (t) switching cost. In International Conference on Machine Learning, pp.  18031–18061. PMLR, 2022.
  • Salgia & Chi (2024) Salgia, S. and Chi, Y. The sample-communication complexity trade-off in federated q-learning. In Advances in Neural Information Processing Systems, 2024.
  • Shen et al. (2023) Shen, H., Zhang, K., Hong, M., and Chen, T. Towards understanding asynchronous advantage actor-critic: Convergence and linear speedup. IEEE Transactions on Signal Processing, 2023.
  • Shi et al. (2022) Shi, L., Li, G., Wei, Y., Chen, Y., and Chi, Y. Pessimistic q-learning for offline reinforcement learning: Towards optimal sample complexity. In International Conference on Machine Learning, pp.  19967–20025. PMLR, 2022.
  • Sidford et al. (2018) Sidford, A., Wang, M., Wu, X., Yang, L., and Ye, Y. Near-optimal time and sample complexities for solving markov decision processes with a generative model. Advances in Neural Information Processing Systems, 31, 2018.
  • Sidford et al. (2023) Sidford, A., Wang, M., Wu, X., and Ye, Y. Variance reduced value iteration and faster algorithms for solving markov decision processes. Naval Research Logistics (NRL), 70(5):423–442, 2023.
  • Simchowitz & Jamieson (2019) Simchowitz, M. and Jamieson, K. G. Non-asymptotic gap-dependent regret bounds for tabular mdps. In Advances in Neural Information Processing Systems, 2019.
  • Sun et al. (2020) Sun, J., Wang, G., Giannakis, G. B., Yang, Q., and Yang, Z. Finite-time analysis of decentralized temporal-difference learning with linear function approximation. In International Conference on Artificial Intelligence and Statistics, pp.  4485–4495. PMLR, 2020.
  • Sutton & Barto (2018) Sutton, R. and Barto, A. Reinforcement Learning: An Introduction. MIT Press, 2018.
  • Tewari & Bartlett (2008) Tewari, A. and Bartlett, P. Optimistic linear programming gives logarithmic regret for irreducible mdps. In Advances in Neural Information Processing Systems, pp.  1505–1512, 2008.
  • Tirinzoni et al. (2022) Tirinzoni, A., Al Marjani, A., and Kaufmann, E. Near instance-optimal pac reinforcement learning for deterministic mdps. In Advances in Neural Information Processing Systems, pp.  8785–8798, 2022.
  • Tirinzoni et al. (2023) Tirinzoni, A., Al-Marjani, A., and Kaufmann, E. Optimistic pac reinforcement learning: the instance-dependent view. In International Conference on Algorithmic Learning Theory, pp.  1460–1480. PMLR, 2023.
  • Wagenmaker & Jamieson (2022) Wagenmaker, A. and Jamieson, K. G. Instance-dependent near-optimal policy identification in linear mdps via online experiment design. Advances in Neural Information Processing Systems, 35:5968–5981, 2022.
  • Wagenmaker et al. (2022a) Wagenmaker, A. J., Chen, Y., Simchowitz, M., Du, S., and Jamieson, K. First-order regret in reinforcement learning with linear function approximation: A robust estimation approach. In International Conference on Machine Learning, pp.  22384–22429. PMLR, 2022a.
  • Wagenmaker et al. (2022b) Wagenmaker, A. J., Simchowitz, M., and Jamieson, K. Beyond no regret: Instance-dependent pac reinforcement learning. In Conference on Learning Theory, pp.  358–418. PMLR, 2022b.
  • Wai (2020) Wai, H.-T. On the convergence of consensus algorithms with markovian noise and gradient bias. In 2020 59th IEEE Conference on Decision and Control (CDC), pp.  4897–4902. IEEE, 2020.
  • Wai et al. (2019) Wai, H.-T., Hong, M., Yang, Z., Wang, Z., and Tang, K. Variance reduced policy evaluation with smooth function approximation. Advances in Neural Information Processing Systems, 32, 2019.
  • Wainwright (2019) Wainwright, M. J. Variance-reduced qq-learning is minimax optimal. arXiv preprint arXiv:1906.04697, 2019.
  • Wang et al. (2020) Wang, G., Lu, S., Giannakis, G., Tesauro, G., and Sun, J. Decentralized td tracking with linear function approximation and its finite-time analysis. Advances in Neural Information Processing Systems, 33:13762–13772, 2020.
  • Wang et al. (2021) Wang, T., Zhou, D., and Gu, Q. Provably efficient reinforcement learning with linear function approximation under adaptivity constraints. Advances in Neural Information Processing Systems, 34:13524–13536, 2021.
  • Wang et al. (2022) Wang, X., Cui, Q., and Du, S. S. On gap-dependent bounds for offline reinforcement learning. Advances in Neural Information Processing Systems, 35:14865–14877, 2022.
  • Watkins (1989) Watkins, C. J. C. H. Learning from Delayed Rewards. PhD thesis, King’s College, Oxford, 1989.
  • Woo et al. (2023) Woo, J., Joshi, G., and Chi, Y. The blessing of heterogeneity in federated q-learning: Linear speedup and beyond. In International Conference on Machine Learning, pp.  37157–37216, 2023.
  • Woo et al. (2024) Woo, J., Shi, L., Joshi, G., and Chi, Y. Federated offline reinforcement learning: Collaborative single-policy coverage suffices. In International Conference on Machine Learning, pp.  53165–53201, 2024.
  • Wu et al. (2021) Wu, Z., Shen, H., Chen, T., and Ling, Q. Byzantine-resilient decentralized policy evaluation with linear function approximation. IEEE Transactions on Signal Processing, 69:3839–3853, 2021.
  • Xu et al. (2021) Xu, H., Ma, T., and Du, S. Fine-grained gap-dependent bounds for tabular mdps via adaptive multi-step bootstrap. In Conference on Learning Theory, pp.  4438–4472. PMLR, 2021.
  • Xu et al. (2020) Xu, T., Wang, Z., Zhou, Y., and Liang, Y. Reanalysis of variance reduced temporal difference learning. In International Conference on Learning Representations, 2020.
  • Yan et al. (2023) Yan, Y., Li, G., Chen, Y., and Fan, J. The efficacy of pessimism in asynchronous q-learning. IEEE Transactions on Information Theory, 2023.
  • Yang et al. (2021) Yang, K., Yang, L., and Du, S. Q-learning with logarithmic regret. In International Conference on Artificial Intelligence and Statistics, pp.  1576–1584. PMLR, 2021.
  • Yang et al. (2023) Yang, T., Cen, S., Wei, Y., Chen, Y., and Chi, Y. Federated natural policy gradient methods for multi-task reinforcement learning. arXiv preprint arXiv:2311.00201, 2023.
  • Yin et al. (2021) Yin, M., Bai, Y., and Wang, Y.-X. Near-optimal offline reinforcement learning via double variance reduction. In Advances in Neural Information Processing Systems, pp.  7677–7688, 2021.
  • Zanette & Brunskill (2019) Zanette, A. and Brunskill, E. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pp.  7304–7312. PMLR, 2019.
  • Zeng et al. (2021) Zeng, S., Doan, T. T., and Romberg, J. Finite-time analysis of decentralized stochastic approximation with applications in multi-agent and multi-task learning. In 2021 60th IEEE Conference on Decision and Control (CDC), pp.  2641–2646. IEEE, 2021.
  • Zhang et al. (2024a) Zhang, C., Wang, H., Mitra, A., and Anderson, J. Finite-time analysis of on-policy heterogeneous federated reinforcement learning. arXiv preprint arXiv:2401.15273, 2024a.
  • Zhang et al. (2020) Zhang, Z., Zhou, Y., and Ji, X. Almost optimal model-free reinforcement learning via reference-advantage decomposition. Advances in Neural Information Processing Systems, 33:15198–15207, 2020.
  • Zhang et al. (2021) Zhang, Z., Ji, X., and Du, S. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. In Conference on Learning Theory, pp.  4528–4531. PMLR, 2021.
  • Zhang et al. (2022) Zhang, Z., Jiang, Y., Zhou, Y., and Ji, X. Near-optimal regret bounds for multi-batch reinforcement learning. Advances in Neural Information Processing Systems, 35:24586–24596, 2022.
  • Zhang et al. (2024b) Zhang, Z., Chen, Y., Lee, J. D., and Du, S. S. Settling the sample complexity of online reinforcement learning. In Conference on Learning Theory, pp.  5213–5219. PMLR, 2024b.
  • Zhao et al. (2023) Zhao, F., Ren, X., Yang, S., Zhao, P., Zhang, R., and Xu, X. Federated multi-objective reinforcement learning. Information Sciences, 624:811–832, 2023.
  • Zheng et al. (2024) Zheng, Z., Gao, F., Xue, L., and Yang, J. Federated q-learning: Linear regret speedup with low communication cost. In The Twelfth International Conference on Learning Representations, 2024.
  • Zheng et al. (2025a) Zheng, Z., Zhang, H., and Xue, L. Federated q-learning with reference-advantage decomposition: Almost optimal regret and logarithmic communication cost. In The Thirteenth International Conference on Learning Representations, 2025a.
  • Zheng et al. (2025b) Zheng, Z., Zhang, H., and Xue, L. Gap-dependent bounds for q-learning using reference-advantage decomposition. In The Thirteenth International Conference on Learning Representations, 2025b.
  • Zhou et al. (2023) Zhou, R., Zihan, Z., and Du, S. S. Sharp variance-dependent bounds in reinforcement learning: Best of both worlds in stochastic and deterministic environments. In International Conference on Machine Learning, pp.  42878–42914. PMLR, 2023.

In the appendix, we provide details for both FedQ-Hoeffding and FedQ-Bernstein algorithms.

Appendix A Algorithm Review

A.1 FedQ-Hoeffding Algorithm

In this section, we present more details for Section 4.1. Denote ηt=H+1H+t\eta_{t}=\frac{H+1}{H+t}, η00=1\eta_{0}^{0}=1, η0t=0\eta^{t}_{0}=0 for t1,t\geq 1, and ηit=ηii=i+1t(1ηi), 1it\eta^{t}_{i}=\eta_{i}\prod_{i^{\prime}=i+1}^{t}(1-\eta_{i^{\prime}}),\forall\ 1\leq i\leq t. We also denote ηc(t1,t2)=t=t1t2(1ηt)\eta^{c}(t_{1},t_{2})=\prod_{t=t_{1}}^{t_{2}}(1-\eta_{t}) for any positive integers t1<t2t_{1}<t_{2}. After receiving the information from each agent mm, for each triple (s,a,h)(s,a,h) visited by the agents, the server sets ηs,ah,k=1ηc(Nhk(s,a)+1,Nhk+1(s,a))\eta^{h,k}_{s,a}=1-\eta^{c}\big{(}N_{h}^{k}(s,a)+1,N_{h}^{k+1}(s,a)\big{)} and βs,a,hk=t=tk1+1tkηttkbt\beta^{k}_{s,a,h}=\sum_{t=t^{k-1}+1}^{t^{k}}\eta^{t^{k}}_{t}b_{t}, where the confidence bound is given by bt=cH3ιtb_{t}=c\sqrt{\frac{H^{3}\iota}{t}} for some sufficiently large constant c>0c>0. Then the server updates the QQ-estimate according to the following two cases.

Case 1: Nhk(s,a)<2MH(H+1)=:i0N_{h}^{k}(s,a)<2MH(H+1)=:i_{0}. This case implies that each client can visit each (s,a)(s,a) pair at step hh at most once. Then, we denote 1m1<m2<mNhk+1NhkM1\leq m_{1}<m_{2}\ldots<m_{{N_{h}^{k+1}-N_{h}^{k}}}\leq M as the agent indices with nhm,k(s,a)>0n_{h}^{m,k}(s,a)>0. The server then updates the global estimate of action values sequentially as follows:

Qhk+1(s,a)=(1ηt)Qhk(s,a)+ηt(rh(x,a)+vh+1mt,k(s,a)+bt),t=Nhk(s,a)+1,Nhk+1(s,a).Q_{h}^{k+1}(s,a)=(1-\eta_{t})Q_{h}^{k}(s,a)+\eta_{t}\big{(}r_{h}(x,a)+v_{h+1}^{m_{t},k}(s,a)+b_{t}\big{)},t=N_{h}^{k}(s,a)+1,\ldots N_{h}^{k+1}(s,a). (7)

Case 2: Nhk(s,a)i0N_{h}^{k}(s,a)\geq i_{0}. In this case, the central server calculates

vh+1k(s,a)=m=1Mvh+1m,k(s,a)/nhk(s,a)v_{h+1}^{k}(s,a)=\sum_{m=1}^{M}v_{h+1}^{m,k}(s,a)/n_{h}^{k}(s,a)

and updates

Qhk+1(s,a)=(1ηs,ah,k)Qhk(s,a)+ηs,ah,k(rh(s,a)+vh+1k(s,a))+βs,a,hk.Q_{h}^{k+1}(s,a)=(1-\eta^{h,k}_{s,a})Q_{h}^{k}(s,a)+\eta^{h,k}_{s,a}\left(r_{h}(s,a)+v_{h+1}^{k}(s,a)\right)+\beta^{k}_{s,a,h}. (8)

After finishing updating the estimated QQ function, the server updates the estimated value function and the policy as follows:

Vhk+1(s)=min{H,maxa𝒜Qhk+1(s,a)},πhk+1(s)=argmaxa𝒜Qhk+1(s,a),(s,h)𝒮×[H].\displaystyle V_{h}^{k+1}(s)=\min\Big{\{}H,\max_{a^{\prime}\in\mathcal{A}}Q_{h}^{k+1}(s,a^{\prime})\Big{\}},\ \pi_{h}^{k+1}(s)=\arg\max_{a^{\prime}\in\mathcal{A}}Q_{h}^{k+1}\left(s,a^{\prime}\right),\forall(s,h)\in\mathcal{S}\times[H]. (9)

The details of the FedQ-Hoeffding algorithm are presented below.

Algorithm 1 FedQ-Hoeffding (Central Server)
1:Input: T0+T_{0}\in\mathbb{N}_{+}.
2:Initialization: k=1k=1, Nh1(s,a)=0N_{h}^{1}(s,a)=0, Qh1(s,a)=Vh1(s)=HQ_{h}^{1}(s,a)=V_{h}^{1}(s)=H, (s,a,h)𝒮×𝒜×[H]\forall(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H] and π1={πh1:𝒮𝒜}h[H]\pi^{1}=\left\{\pi_{h}^{1}:\mathcal{S}\rightarrow\mathcal{A}\right\}_{h\in[H]} is an arbitrary deterministic policy.
3:while Hk=1k1Mnk<T0H\sum_{k^{\prime}=1}^{k-1}Mn^{k^{\prime}}<T_{0} do
4:     Broadcast πk,\pi^{k}, {Nhk(s,πhk(s))}s,h\{N_{h}^{k}(s,\pi_{h}^{k}(s))\}_{s,h} and {Vhk(s)}s,h\{V_{h}^{k}(s)\}_{s,h} to all clients.
5:     Wait until receiving an abortion signal and send the signal to all agents.
6:     Receive {rh(s,πhk(s))}s,h\{r_{h}(s,\pi_{h}^{k}(s))\}_{s,h},{nhm,k(s,πhk(s))}s,h,m\{n_{h}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h,m} and {vh+1m,k(s,πhk(s))}s,h,m\{v_{h+1}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h,m} from clients.
7:     Calculate Nhk+1(s,a),nhk(s,a),vh+1k(s,a),(s,h)𝒮×[H]N_{h}^{k+1}(s,a),n_{h}^{k}(s,a),v_{h+1}^{k}(s,a),\forall(s,h)\in\mathcal{S}\times[H] with a=πhk(s)a=\pi_{h}^{k}(s).
8:     for (s,a,h)𝒮×𝒜×[H](s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H] do
9:         if aπhk(s)a\neq\pi_{h}^{k}(s) or nhk(s,a)=0n_{h}^{k}(s,a)=0 then
10:              Qhk+1(s,a)Qhk(s,a)Q_{h}^{k+1}(s,a)\leftarrow Q_{h}^{k}(s,a).
11:         else if Nhk(s,a)<i0N_{h}^{k}(s,a)<i_{0} then
12:              Update Qhk+1(s,a)Q_{h}^{k+1}(s,a) according to Equation 7.
13:         else
14:              Update Qhk+1(s,a)Q_{h}^{k+1}(s,a) according to Equation 8.
15:         end if
16:     end for
17:     Update Vhk+1V_{h}^{k+1} and πk+1\pi^{k+1} by eq. 9.
18:     kk+1k\leftarrow k+1.
19:end while
Algorithm 2 FedQ-Hoeffding (Agent mm in round kk)
1:nhm(s,a)=vh+1m(s,a)=rh(s,a)=0,(s,a,h)𝒮×𝒜×[H]n_{h}^{m}(s,a)=v_{h+1}^{m}(s,a)=r_{h}(s,a)=0,\forall(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H].
2:Receive πk,\pi^{k}, {Nhk(s,πhk(s))}s,h\{N_{h}^{k}(s,\pi_{h}^{k}(s))\}_{s,h} and {Vhk(s)}s,h\{V_{h}^{k}(s)\}_{s,h} from the central server.
3:while no abortion signal from the central server do
4:     while nhm(sh,ah)<max{1,1MH(H+1)Nhk(sh,ah)},(s,a,h)𝒮×𝒜×[H]n_{h}^{m}(s_{h},a_{h})<\max\left\{1,\lfloor\frac{1}{MH(H+1)}N_{h}^{k}(s_{h},a_{h})\rfloor\right\},\forall(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H] do
5:         Collect a new trajectory {(sh,ah,rh)}h=1H\{(s_{h},a_{h},r_{h})\}_{h=1}^{H} with ah=πhk(sh)a_{h}=\pi_{h}^{k}(s_{h}).
6:         nhm(sh,ah)nhm(sh,ah)+1,n_{h}^{m}(s_{h},a_{h})\leftarrow n_{h}^{m}(s_{h},a_{h})+1, vh+1m(sh,ah)vh+1m(sh,ah)+Vh+1k(sh+1)v_{h+1}^{m}(s_{h},a_{h})\leftarrow v_{h+1}^{m}(s_{h},a_{h})+V_{h+1}^{k}(s_{h+1}), and rh(sh,ah)rh,h[H].{r}_{h}(s_{h},a_{h})\leftarrow r_{h},\forall h\in[H].
7:     end while
8:     Send an abortion signal to the central server.
9:end while
10:nhm,k(s,a)nhm(s,a),vh+1m,k(s,a)vh+1m(s,a),(s,h)𝒮×[H]n_{h}^{m,k}(s,a)\leftarrow n_{h}^{m}(s,a),v_{h+1}^{m,k}(s,a)\leftarrow v_{h+1}^{m}(s,a),\forall(s,h)\in\mathcal{S}\times[H] with a=πhk(s)a=\pi_{h}^{k}(s).
11:Send {rh(s,πhk(s))}s,h\{r_{h}(s,\pi_{h}^{k}(s))\}_{s,h},{nhm,k(s,πhk(s))}s,h\{n_{h}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h} and {vh+1m,k(s,πhk(s))}s,h\{v_{h+1}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h} to the central server.

A.2 FedQ-Bernstein Algorithm

The Bernstein-type algorithm differs from the Hoeffding-type algorithm Algorithms 1 and 2, in that it selects the upper confidence bound based on a variance estimator of XiX_{i}, akin to the approach used in the Bernstein-type algorithm in Jin et al. (2018). In this subsection, we first review the algorithm design.

To facilitate understanding, we introduce additional notations exclusive to Bernstein-type algorithms, supplementing the already provided notations for the Hoeffding-type algorithm.

μhm,k(s,a)=1nhm,k(s,a)j=1nm,k[Vh+1k(sh+1k,j,m)]2𝕀[(shk,j,m,ahk,j,m)=(s,a)].\mu_{h}^{m,k}(s,a)=\frac{1}{n_{h}^{m,k}(s,a)}\sum_{j=1}^{n^{m,k}}\left[V_{h+1}^{k}\left(s_{h+1}^{k,j,m}\right)\right]^{2}\mathbb{I}[(s_{h}^{k,j,m},a_{h}^{k,j,m})=(s,a)].
μhk(s,a)=1Nhk+1(s,a)Nhk(s,a)m=1Mμhm,k(s,a)nhm,k(s,a).\mu_{h}^{k}(s,a)=\frac{1}{N_{h}^{k+1}(s,a)-N_{h}^{k}(s,a)}\sum_{m=1}^{M}\mu_{h}^{m,k}(s,a)n_{h}^{m,k}(s,a).

Here, μhm,k(s,a)\mu_{h}^{m,k}(s,a) is the sample mean of [Vh+1k(sh+1k,j,m)]2[V_{h+1}^{k}(s_{h+1}^{k,j,m})]^{2} for all the visits of (s,a,h)(s,a,h) for the mm-th agent during the kk-th round and μhk(s,a)\mu_{h}^{k}(s,a) corresponds to the mean for all the visits during the kk-th round. We define Wk(s,a,h)W_{k}(s,a,h) to denote the sample variance of all the visits before the kk-th round, i.e.

Wk(s,a,h)=1Nhk(s,a)i=1Nhk(s,a)(Vh+1ki(sh+1ki,ji,mi)1Nhk(s,a)i=1Nhk(s,a)Vh+1ki(sh+1ki,ji,mi))2.W_{k}(s,a,h)=\frac{1}{N_{h}^{k}(s,a)}\sum_{i=1}^{N_{h}^{k}(s,a)}\left(V_{h+1}^{k^{i}}(s_{h+1}^{k^{i},j^{i},m^{i}})-\frac{1}{N_{h}^{k}(s,a)}\sum_{i^{\prime}=1}^{N_{h}^{k}(s,a)}V_{h+1}^{k^{i}}(s_{h+1}^{k^{i},j^{i},m^{i}})\right)^{2}.

We can find that

Wk(s,a,h)=1Nhk(s,a)k=1k1μhk(s,a)nhk(s,a)[1Nhk(s,a)k=1k1vh+1k(s,a)nhk(s,a)]2,W_{k}(s,a,h)=\frac{1}{N_{h}^{k}(s,a)}\sum_{k^{\prime}=1}^{k-1}\mu_{h}^{k^{\prime}}(s,a)n_{h}^{k^{\prime}}(s,a)-\left[\frac{1}{N_{h}^{k}(s,a)}\sum_{k^{\prime}=1}^{k-1}v_{h+1}^{k^{\prime}}(s,a)n_{h}^{k^{\prime}}(s,a)\right]^{2},

which means that this quantity can be calculated efficiently in practice in the following way. Define

W1,k(s,a,h)=k=1k1μhk(s,a)nhk(s,a),W2,k(s,a,h)=k=1k1vh+1k(s,a)nhk(s,a),W_{1,k}(s,a,h)=\sum_{k^{\prime}=1}^{k-1}\mu_{h}^{k^{\prime}}(s,a)n_{h}^{k^{\prime}}(s,a),\ W_{2,k}(s,a,h)=\sum_{k^{\prime}=1}^{k-1}v_{h+1}^{k^{\prime}}(s,a)n_{h}^{k^{\prime}}(s,a), (10)

then we have

W1,k+1(s,a,h)=W1,k(s,a,h)+μhk(s,a)nhk(s,a),W2,k+1(s,a,h)=W2,k(s,a,h)+vh+1k(s,a)nhk(s,a)W_{1,k+1}(s,a,h)=W_{1,k}(s,a,h)+\mu_{h}^{k}(s,a)n_{h}^{k}(s,a),\ W_{2,k+1}(s,a,h)=W_{2,k}(s,a,h)+v_{h+1}^{k}(s,a)n_{h}^{k}(s,a) (11)

and

Wk+1(s,a,h)=W1,k+1(s,a,h)Nhk+1(s,a)[W2,k+1(s,a,h)Nhk+1(s,a)]2.W_{k+1}(s,a,h)=\frac{W_{1,k+1}(s,a,h)}{N_{h}^{k+1}(s,a)}-\left[\frac{W_{2,k+1}(s,a,h)}{N_{h}^{k+1}(s,a)}\right]^{2}. (12)

This indicates that the central server, by actively maintaining and updating the quantities W1,kW_{1,k} and W2,kW_{2,k} and systematically collecting nhm,kn_{h}^{m,k}s, μhm,k\mu_{h}^{m,k}s and vh+1m,kv_{h+1}^{m,k}s, is able to compute Wk+1W_{k+1}.

Next, we define

βtB(s,a,h)=c(min{Hιt(Wkt+1(s,a,h)+H)+ιH7SA+MSAH6t,H3ιt}),\beta_{t}^{\textnormal{B}}(s,a,h)=c^{\prime}\left(\min\left\{\sqrt{\frac{H\iota}{t}(W_{k^{t}+1}(s,a,h)+H)}+\iota\frac{\sqrt{H^{7}SA}+\sqrt{MSAH^{6}}}{t},\sqrt{\frac{H^{3}\iota}{t}}\right\}\right), (13)

in which c>0c^{\prime}>0 is a positive constant. With this, the upper confidence bound bt(s,a,h)b_{t}(s,a,h) for a single visit is determined by βtB(s,a,h)=2i=1tηitbt(s,a,h),\beta_{t}^{\textnormal{B}}(s,a,h)=2\sum_{i=1}^{t}\eta^{t}_{i}b_{t}(s,a,h), which can be calculated as follows:

b1(s,a,h):=β1B(s,a,h)2,bt(s,a,h):=βtB(s,a,h)(1ηt)βt1B(s,a,h)2ηt.b_{1}(s,a,h):=\frac{\beta_{1}^{\textnormal{B}}(s,a,h)}{2},\ b_{t}(s,a,h):=\frac{\beta_{t}^{\textnormal{B}}(s,a,h)-\left(1-\eta_{t}\right)\beta_{t-1}^{\textnormal{B}}(s,a,h)}{2\eta_{t}}.

When there is no ambiguity, we use the simplified notation b~t=bt(s,a,h)\tilde{b}_{t}=b_{t}(s,a,h) and β~t=βt(s,a,h)\tilde{\beta}_{t}=\beta_{t}(s,a,h). In the FedQ-Bernstein algorithm, let β~=βtkB(s,a,h)ηc(tk1+1,tk)βtk1B(s,a,h)\tilde{\beta}=\beta_{t^{k}}^{\textnormal{B}}(s,a,h)-\eta^{c}(t^{k-1}+1,t^{k})\beta_{t^{k-1}}^{\textnormal{B}}(s,a,h). Then similar to the FedQ-Hoeffding, we can updates the global estimate of the value functions according to the following two cases.

  • Case 1: Nhk(s,a)<i0N_{h}^{k}(s,a)<i_{0}. This case implies that each client can visit each (s,a)(s,a) pair at step hh at most once. Then, we denote 1m1<m2<mtktk1M1\leq m_{1}<m_{2}\ldots<m_{t^{k}-t^{k-1}}\leq M as the agent indices with nhm,k(s,a)>0n_{h}^{m,k}(s,a)>0. The server then updates the global estimate of action values as follows:

    Qhk+1(s,a)=(1ηt)Qhk(s,a)+ηt(rh(x,a)+vh+1mt,k(s,a)+b~t),t=tk1+1,tk.Q_{h}^{k+1}(s,a)=(1-\eta_{t})Q_{h}^{k}(s,a)+\eta_{t}\left(r_{h}(x,a)+v_{h+1}^{m_{t},k}(s,a)+\tilde{b}_{t}\right),t=t^{k-1}+1,\ldots t^{k}. (14)
  • Case 2: Nhk(s,a)i0N_{h}^{k}(s,a)\geq i_{0}. In this case, the central server calculates

    vh+1k(s,a)=m=1Mvh+1m,k(s,a)/nhk(s,a)v_{h+1}^{k}(s,a)=\sum_{m=1}^{M}v_{h+1}^{m,k}(s,a)/n_{h}^{k}(s,a)

    and updates the QQ-estimate as

    Qhk+1(s,a)=(1ηs,ah,k)Qhk(s,a)+ηs,ah,k(rh(s,a)+vh+1k(s,a))+β~/2.Q_{h}^{k+1}(s,a)=(1-\eta^{h,k}_{s,a})Q_{h}^{k}(s,a)+\eta^{h,k}_{s,a}\left(r_{h}(s,a)+v_{h+1}^{k}(s,a)\right)+\tilde{\beta}/2. (15)

Then we can present the FedQ-Bernstein Algorithm in Zheng et al. (2024).

Algorithm 3 FedQ-Bernstein (Central Server)
1:Input: T0+T_{0}\in\mathbb{N}_{+}.
2:Initialization: k=1k=1, Nh1(s,a)=W1,k(s,a,h)=W2,k(s,a,h)=0,Qh1(s,a)=Vh1(s)=H,(s,a,h)𝒮×𝒜×[H]N_{h}^{1}(s,a)=W_{1,k}(s,a,h)=W_{2,k}(s,a,h)=0,Q_{h}^{1}(s,a)=V_{h}^{1}(s)=H,\forall(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H] and π1={πh1:𝒮𝒜}h[H]\pi^{1}=\left\{\pi_{h}^{1}:\mathcal{S}\rightarrow\mathcal{A}\right\}_{h\in[H]} is an arbitrary deterministic policy.
3:while Hk=1k1Mnk<T0H\sum_{k^{\prime}=1}^{k-1}Mn^{k^{\prime}}<T_{0} do
4:     Broadcast πk,\pi^{k}, {Nhk(s,πhk(s))}s,h\{N_{h}^{k}(s,\pi_{h}^{k}(s))\}_{s,h} and {Vhk(s)}s,h\{V_{h}^{k}(s)\}_{s,h} to all clients.
5:     Wait until receiving an abortion signal and send the signal to all agents.
6:     Receive {rh(s,πhk(s))}s,h\{r_{h}(s,\pi_{h}^{k}(s))\}_{s,h}, {nhm,k(s,πhk(s))}s,h,m\{n_{h}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h,m}, {vh+1m,k(s,πhk(s))}s,h,m\{v_{h+1}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h,m}, {μhm,k(s,πhk(s))}s,h,m\{\mu_{h}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h,m}.
7:     Calculate Nhk+1(s,a),nhk(s,a),vh+1k(s,a),(s,h)𝒮×[H]N_{h}^{k+1}(s,a),n_{h}^{k}(s,a),v_{h+1}^{k}(s,a),\forall(s,h)\in\mathcal{S}\times[H] with a=πhk(s)a=\pi_{h}^{k}(s).
8:     Calculate Wk(s,a,h),Wk+1(s,a,h),W1,k+1(s,a,h),W2,k+1(s,a,h),W_{k}(s,a,h),W_{k+1}(s,a,h),W_{1,k+1}(s,a,h),W_{2,k+1}(s,a,h), (s,h)𝒮×[H]\forall(s,h)\in\mathcal{S}\times[H] with a=πhk(s)a=\pi_{h}^{k}(s) based on Equation 10, Equation 11 and Equation 12.
9:     for (s,a,h)𝒮×𝒜×[H](s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H] do
10:         if aπhk(s)a\neq\pi_{h}^{k}(s) or nhk(s,a)=0n_{h}^{k}(s,a)=0 then
11:              Qhk+1(s,a)Qhk(s,a)Q_{h}^{k+1}(s,a)\leftarrow Q_{h}^{k}(s,a).
12:         else
13:              Update Qhk+1(s,a)Q_{h}^{k+1}(s,a) according to Equation 14 or Equation 15.
14:         end if
15:     end for
16:     Update Vhk+1V_{h}^{k+1} and πk+1\pi^{k+1} by Equation 9.
17:     kk+1k\leftarrow k+1.
18:end while
Algorithm 4 FedQ-Bernstein (Agent mm in round kk)
1:nhm(s,a)=vh+1m(s,a)=rh(s,a)=μhm(s,a)=0,(s,a,h)𝒮×𝒜×[H]n_{h}^{m}(s,a)=v_{h+1}^{m}(s,a)=r_{h}(s,a)=\mu_{h}^{m}(s,a)=0,\forall(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H].
2:Receive πk\pi^{k}, {Nhk(s,πhk(s))}s,h\{N_{h}^{k}(s,\pi_{h}^{k}(s))\}_{s,h} and {Vhk(s)}s,h\{V_{h}^{k}(s)\}_{s,h} from the central server.
3:while no abortion signal from the central server do
4:     while nhm(xh,ah)<max{1,1MH(H+1)Nhk(xh,ah)},(s,a,h)𝒮×𝒜×[H]n_{h}^{m}(x_{h},a_{h})<\max\left\{1,\lfloor\frac{1}{MH(H+1)}N_{h}^{k}(x_{h},a_{h})\rfloor\right\},\forall(s,a,h)\in\mathcal{S}\times\mathcal{A}\times[H] do
5:         Collect a new trajectory {(xh,ah,rh)}h=1H\{(x_{h},a_{h},r_{h})\}_{h=1}^{H} with ah=πhk(xh)a_{h}=\pi_{h}^{k}(x_{h}).
6:         nhm(xh,ah)nhm(xh,ah)+1,n_{h}^{m}(x_{h},a_{h})\leftarrow n_{h}^{m}(x_{h},a_{h})+1, vh+1m(xh,ah)vh+1m(xh,ah)+Vh+1k(sh+1)v_{h+1}^{m}(x_{h},a_{h})\leftarrow v_{h+1}^{m}(x_{h},a_{h})+V_{h+1}^{k}(s_{h+1}), μhm(xh,ah)μhm(xh,ah)+[Vh+1k(sh+1)]2\mu_{h}^{m}(x_{h},a_{h})\leftarrow\mu_{h}^{m}(x_{h},a_{h})+\left[V_{h+1}^{k}(s_{h+1})\right]^{2}, and rh(xh,ah)rh,h[H].{r}_{h}(x_{h},a_{h})\leftarrow r_{h},\forall h\in[H].
7:     end while
8:     Send an abortion signal to the central server.
9:end while
10:nhm,k(s,a)nhm(s,a),vh+1m,k(s,a)vh+1m(s,a)n_{h}^{m,k}(s,a)\leftarrow n_{h}^{m}(s,a),v_{h+1}^{m,k}(s,a)\leftarrow v_{h+1}^{m}(s,a) and μhm,k(s,a)μhm(s,a)/nhm(s,a),(s,h)𝒮×[H]\mu_{h}^{m,k}(s,a)\leftarrow\mu_{h}^{m}(s,a)/n_{h}^{m}(s,a),\forall(s,h)\in\mathcal{S}\times[H] with a=πhk(s)a=\pi_{h}^{k}(s).
11:Send {rh(s,πhk(s))}s,h\{r_{h}(s,\pi_{h}^{k}(s))\}_{s,h}, {nhm,k(s,πhk(s))}s,h\{n_{h}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h}, {μhm,k(s,πhk(s))}s,h\{\mu_{h}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h} and {vh+1m,k(s,πhk(s))}s,h\{v_{h+1}^{m,k}(s,\pi_{h}^{k}(s))\}_{s,h} to the central server.