This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Sample-Communication Complexity Trade-off in Federated Q-Learning

Sudeep Salgia
CMU
Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA 15213, USA.
   Yuejie Chi11footnotemark: 1
CMU
(April 2024)

We consider the problem of federated Q-learning, where MM agents aim to collaboratively learn the optimal Q-function of an unknown infinite-horizon Markov decision process with finite state and action spaces. We investigate the trade-off between sample and communication complexities for the widely used class of intermittent communication algorithms. We first establish the converse result, where it is shown that a federated Q-learning algorithm that offers any speedup with respect to the number of agents in the per-agent sample complexity needs to incur a communication cost of at least an order of 11γ\frac{1}{1-\gamma} up to logarithmic factors, where γ\gamma is the discount factor. We also propose a new algorithm, called Fed-DVR-Q , which is the first federated Q-learning algorithm to simultaneously achieve order-optimal sample and communication complexities. Thus, together these results provide a complete characterization of the sample-communication complexity trade-off in federated Q-learning.

1 Introduction

Reinforcement Learning (RL) (Sutton2018RLBook) refers to a paradigm of sequential decision making where an agent aims to learn an optimal policy, i.e., a policy that maximizes the long-term total reward, through repeated interactions with an unknown environment. RL finds applications across a diverse array of fields including, but not limited to, autonomous driving, games, recommendation systems, robotics and Internet of Things (IoT) (Kober2013RoboticsSurvey; Yurtsever2020SelfDriving; Huang2016Go; Lim2020IoT).

The primary hurdle in RL applications is often the high-dimensional nature of the decision space that necessitates the learning agent to have to access to an enormous amount of data in order to have any hope of learning the optimal policy. Moreover, the sequential collection of such an enormous amount of data through a single agent is extremely time-consuming and often infeasible in practice (mnih2016a3c). Consequently, practical implementations of RL involve deploying multiple agents to collect data in parallel. This decentralized approach to data collection has fueled the design and development of distributed or federated RL algorithms that can collaboratively learn the optimal policy without actually transferring the collected data to a centralized server, while achieving a linear speedup in terms of the number of agents. Such a federated approach to RL, which does not require the transfer of local data, is gaining interest due to lower bandwidth requirements and lower security and privacy risks. In this work, we focus on federated variants of the vastly popular Q-learning algorithm (Watkins1992QL), where the agents collaborate to directly learn the optimal Q-function without forming an estimate of the underlying unknown environment.

A particularly important aspect of designing federated RL algorithms, including federated Q-learning algorithms, is to address the natural tension between sample and communication complexities. At one end of the spectrum lies the naïve approach of running a centralized algorithm with an optimal sample complexity after transferring and combining all the collected data at a central server. Such an approach trivially achieves the optimal sample complexity while suffering from a very high and prohibitive communication complexity. On the other hand, several recently proposed algorithms (Khodadadian2022FederatedQL; Woo2023FedSynQ; Woo2024FedOffline; zheng2024federated) operate in more practical regimes, offering significantly lower communication complexities when compared to the naïve approach at the cost of sub-optimal sample complexities. These results suggest the existence of an underlying trade-off between sample and communication complexities of federated RL algorithms. The primary goal of this work is to better understand this trade-off in the context of federated Q-learning by investigating these following fundamental questions.

  • Fundamental limit of communication: What is the minimum amount of communication required by a federated Q-learning algorithm to achieve any statistical benefit of collaboration?

  • Optimal algorithm design: How does one design a federated Q-learning algorithm that simultaneously offers optimal sample and communication complexity guarantees, i.e., operates on the optimal frontier of the sample-communication complexity trade-off?

1.1 Main results

In this work, we consider a setup where MM distributed agents — each with access to a local generative model (Kearns1998FiniteSampleQL) — collaborate to learn the optimal Q-function of an infinite-horizon Markov decision process (MDP) (Puterman2014MDPBook), which is defined over a finite state space 𝒮\mathcal{S} and a finite action space 𝒜\mathcal{A}, and has a discount factor of γ(0,1)\gamma\in(0,1). To probe the communication complexity, we consider a common setup in federated learning called the intermittent communication setting (Woodworth2021IntermittentComm), where the agents intermittently share information among themselves with the help of a central server. We provide a complete characterization of the trade-off between sample and communication complexities under the aforementioned setting by providing answers to both the questions. Summarized below, the main result of this work is twofold.

  • Fundamental lower bounds on the communication complexity of federated Q-learning. We establish lower bounds on the communication complexity of federated Q-learning, both in terms of the number of communication rounds and the overall number of bits that need to be transmitted in order to achieve any speedup in the convergence rate with respect to the number of agents. Specifically, we show that in order for an intermittent communication algorithm to obtain any benefit of collaboration, i.e., any order of speedup with respect to the number of agents, the number of communication rounds must be least Ω(1(1γ)log2N)\Omega\left(\frac{1}{(1-\gamma)\log^{2}N}\right) and the number of bits sent by each agent to the server must be least Ω(|𝒮||𝒜|(1γ)log2N)\Omega\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)\log^{2}N}\right), where NN denotes the number of samples taken by the algorithm for each state-action pair.

  • Achieving the optimal sample-communication complexity trade-off of federated Q-learning. We propose a new federated Q-learning algorithm called Federated Doubly Variance Reduced Q-Learning (dubbed Fed-DVR-Q ), that simultaneously achieves order-optimal sample complexity and communication complexity as dictated by the lower bound. We show that Fed-DVR-Q  learns an ε\varepsilon-accurate optimal Q-function in the \ell_{\infty} sense with 𝒪~(|𝒮||𝒜|Mε2(1γ)3)\tilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{M\varepsilon^{2}(1-\gamma)^{3}}\right) i.i.d. samples from the generative model at each agent while incurring a total communication cost of 𝒪~(|𝒮||𝒜|1γ)\tilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}\right) bits per agent across 𝒪~(11γ)\tilde{\mathcal{O}}\left(\frac{1}{1-\gamma}\right) rounds of communication. Thus, Fed-DVR-Q  not only improves upon both the sample and communication complexities of existing algorithms, but also is the first algorithm to achieve order-optimal sample and communication complexities (See Table 1 for a comparison).

Algorithm/Reference Number of Agents Sample Complexity Communication Complexity
Q-learning (Li2023QLMinimax) 11 |𝒮||𝒜|(1γ)4ε2\dfrac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^{4}\varepsilon^{2}} N/A
Variance Reduced Q-learning (Wainwright2019VarianceReduced) 11 |𝒮||𝒜|(1γ)3ε2\dfrac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)^{3}\varepsilon^{2}} N/A
Fed-SynQ (Woo2023FedSynQ) MM |𝒮||𝒜|M(1γ)5ε2\dfrac{|\mathcal{S}||\mathcal{A}|}{M(1-\gamma)^{5}\varepsilon^{2}} M1γ\dfrac{M}{1-\gamma}
Fed-DVR-Q  (This work) MM |𝒮||𝒜|M(1γ)3ε2\dfrac{|\mathcal{S}||\mathcal{A}|}{M(1-\gamma)^{3}\varepsilon^{2}} 11γ\dfrac{1}{1-\gamma}
Lower bound ((Azar2013MinimaxPAC), This work) MM |𝒮||𝒜|M(1γ)3ε2\dfrac{|\mathcal{S}||\mathcal{A}|}{M(1-\gamma)^{3}\varepsilon^{2}} 11γ\dfrac{1}{1-\gamma}
Table 1: Comparison of sample and communication complexities (per-agent) of various single-agent and federated Q-learning algorithms for learning an ε\varepsilon-optimal Q-function under the synchronous setting. We hide logarithmic factors and burn-in costs for all results for simplicity of presentation. Here, 𝒮\mathcal{S} and 𝒜\mathcal{A} represent state and action spaces respectively and γ\gamma denotes the discount factor. We report the communication complexity only in terms of the number of rounds as existing algorithms assume transmission of real numbers and hence do not report bit-level costs. For the lower bound, Azar2013MinimaxPAC and this work establish the lower bounds for the sample and communication complexities, respectively.

1.2 Related work

Single-agent Q-learning.

Q-learning has been extensively studied in the single-agent setting in terms of its asymptotic convergence (Jaakkola1993StochasticIterativeDP; Tsitsiklis1994AsynchronousQL; Szepesvari1997AsymptoticQL; Borkar2000ODE-RL) and its finite-time sample complexity in synchronous (EvenDar2004LearningRates; Kearns1998FiniteSampleQL; Beck2012ConstantStepSize; sidford2018near; Wainwright2019ConeContractive; Wainwright2019VarianceReduced; Chen2020FiniteSampleAnalysis; Li2023QLMinimax) and asynchronous settings (Chen2021Lyapunov; Li2023QLMinimax; Li2021SampleComplexityAsynchronousQL; Qu2020FiniteAsychronousQL; xia2024instance) in terms of convergence in the \ell_{\infty} sense. On the other hand, regret analysis of Q-learning has been carried out in both online settings (jin2018q; bai2019provably; menard2021ucb; li2021breaking) and offline settings (shi2022pessimistic; yan2022efficacy), to name a few.

Federated and distributed RL.

There has also been a considerable effort towards developing distributed and federated RL algorithms. The distributed variants of the classical temporal difference (TD) learning algorithm have been investigated in a series of studies (Chen2021OffPolicyTDC; Doan2019TD0LinearFunctionMARL; Doan2021DistributedTD0; Sun2020FiniteDecentralizedTDLinear; Wai2020ConvergenceConsensus; liu2023distributed; Wang2020DecentralizedTDTracking; Zeng2021FiniteDecentralizedSA). The impact of environmental heterogeneity in federated RL was studied in Wang2023FederatedTDHeterogeneity for TD learning, and in Jin2022FederatedRLHeterogeneity when the local environments are known. A distributed version of the actor-critic algorithm was studied by Shen2023AsynchronousAdvantageActorCritic where the authors established convergence of their algorithm and demonstrated a linear speedup in the number of agents in their sample complexity bound. Chen2022DecentralizedActorCritic proposed a new distributed actor-critic algorithm which improved the dependence of sample complexity on the error ε\varepsilon with a communication cost of 𝒪~(ε1)\tilde{\mathcal{O}}(\varepsilon^{-1})Chen2021CommEfficientPolicyGrad proposed a communication-efficient distributed policy gradient algorithm with convergence analysis and established a communication complexity of 𝒪(1/(Mε))\mathcal{O}(1/(M\varepsilon)).  Xie2023FedKL adopts a distributed policy optimization perspective, which is different from the Q-learning paradigm considered in this work. Moreover, the algorithm in Xie2023FedKL obtains a linear communication cost, which is worse than that obtained in our work. Similarly, Zhang2024FiniteTimeOnPolicy focuses on on-policy learning and incurs a communication cost that depends polynomially on the required error ε\varepsilon. Several additional studies (Yang2023FederatedNPG; Zeng2021DecentralizedNPG; Lan2024AsynchronousPG) have also developed and analyzed other distributed/federated variants of the classical natural policy gradient method (Kakade2001PolicyGradient)Assran2019GossipActorCritic; Espeholt2018Impala; Mnih2016AsynchronousDeepRL developed distributed algorithms to train deep RL networks more efficiently.

Federated Q-learning.

Federated Q-learning has been explored relatively recently with theoretical sample and communication complexity guarantees. Khodadadian2022FederatedQL proposed and analyzed a federated Q-learning algorithm in the asynchronous setting, however, its sample complexity guarantee exhibits pessimistic dependencies with respect to salient problem-dependent parameters. Woo2023FedSynQ provided improved analyses for federated Q-learning under both synchronous and asynchronous settings, and introduced importance averaging to tame the heterogeneity of local behavior policies in the asynchronous setting to further improve the sample complexity, showing that a collaborative coverage of the entire state-action space suffices for federated Q-learning. Moving to the offline setting, Woo2024FedOffline proposed a federated Q-learning algorithm for offline RL in the finite-horizon setting and established sample and communication complexities that only require a collaborative coverage of the state-action pairs visited by the optimal policy. zheng2024federated, on the other hand, established a linear speedup for federated Q-learning in the online setting from the regret minimization perspective.

Accuracy-communication trade-off in federated learning.

The trade-off between communication complexity and accuracy (or equivalently, sample complexity) has been studied in various federated and distributed learning problems, including stochastic approximation algorithms for convex optimization. Duchi2014DistributedEstimation; Braverman2016CommunicationLowerBounds established the celebrated inverse linear relationship between the error and the communication cost for the problem of distributed mean estimation. Similar trade-offs for distributed stochastic optimization, multi-armed bandits and linear bandits have been studied and established across numerous studies, e.g., (Woodworth2018GraphOracle; Woodworth2021IntermittentComm; Tsitsiklis1987; Shi2021FMAB; Salgia2023LinearBandits).

2 Background and Problem Formulation

In this section, we provide a brief background of MDPs, outline the performance measures for federated Q-learning algorithms and describe the class of intermittent communication algorithms considered in this work.

2.1 Markov decision processes

We focus on an infinite-horizon MDP, denoted by \mathcal{M}, over a state space 𝒮\mathcal{S} and an action space 𝒜\mathcal{A} with a discount factor γ(0,1)\gamma\in(0,1). Both the state and action spaces are assumed to be finite sets. In an MDP, the state ss evolves dynamically under the influence of actions based on a probability transition kernel, P:(𝒮×𝒜)×𝒮[0,1]P:(\mathcal{S}\times\mathcal{A})\times\mathcal{S}\mapsto[0,1]. The entry P(s|s,a)P(s^{\prime}|s,a) denotes the probability of moving to state ss^{\prime} when an action aa is taken in state ss. An MDP is also associated with a deterministic reward function r:𝒮×𝒜[0,1]r:\mathcal{S}\times\mathcal{A}\mapsto[0,1], where r(s,a)r(s,a) denotes the immediate reward obtained for taking action aa in state ss. Thus, the transition kernel PP along with the reward function rr completely characterize an MDP.

A policy π:𝒮Δ(𝒜)\pi:\mathcal{S}\mapsto\Delta(\mathcal{A}) is a rule for selecting actions across different states, where Δ(𝒜)\Delta(\mathcal{A}) denotes the simplex over 𝒜\mathcal{A} and π(a|s)\pi(a|s) denotes the probability of choosing action aa in state ss. Each policy π\pi is associated with a state value function and a state-action value function, or the Q-function, denoted by VπV^{\pi} and QπQ^{\pi} respectively. Specifically, VπV^{\pi} and QπQ^{\pi} measure the expected discounted cumulative reward attained by policy π\pi starting from certain state ss and state-action pair (s,a)(s,a) respectively. Mathematically, VπV^{\pi} and QπQ^{\pi} are given as

Vπ(s):=𝔼[t=0γtr(st,at)|s0=s];Qπ(s,a):=𝔼[t=0γtr(st,at)|s0=s,a0=a],\displaystyle V^{\pi}(s):=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})\bigg{|}\ s_{0}=s\right];\quad Q^{\pi}(s,a):=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})\bigg{|}\ s_{0}=s,a_{0}=a\right], (1)

where atπ(|st)a_{t}\sim\pi(\cdot|s_{t}), st+1P(|st,at)s_{t+1}\sim P(\ \cdot\ |s_{t},a_{t}) for all t0t\geq 0, and the expectation is taken w.r.t. the randomness in the trajectory {st,at}t=1\{s_{t},a_{t}\}_{t=1}^{\infty}. Since the rewards lie in [0,1][0,1], it follows immediately that both the value function and the Q-function lie in the range [0,11γ][0,\frac{1}{1-\gamma}].

An optimal policy π\pi^{\star} is a policy that maximizes the value function uniformly over all the states and it has been shown that such an optimal policy π\pi^{\star} always exists (Puterman2014MDPBook). The optimal value and Q-functions are those corresponding to that of an optimal policy π\pi^{\star}, denoted as V:=VπV^{\star}:=V^{\pi^{\star}} and Q:=QπQ^{\star}:=Q^{\pi^{\star}} respectively. The optimal Q-function, QQ^{\star}, is also the unique fixed point of the Bellman optimality operator 𝒯:𝒮×𝒜𝒮×𝒜\mathcal{T}:\mathcal{S}\times\mathcal{A}\mapsto\mathcal{S}\times\mathcal{A}, given by

𝒯Q=Q,where(𝒯Q)(s,a)=r(s,a)+γ𝔼sP(|s,a)[maxa𝒜Q(s,a)].\displaystyle\mathcal{T}Q^{\star}=Q^{\star},\qquad\mbox{where}\quad(\mathcal{T}Q)(s,a)=r(s,a)+\gamma\cdot\mathbb{E}_{s^{\prime}\sim P(\cdot|s,a)}\left[\max_{a^{\prime}\in\mathcal{A}}Q(s^{\prime},a^{\prime})\right]. (2)

The popular Q-learning algorithm (Watkins1992QL) aims to learn the optimal policy by first learning QQ^{\star} as the solution to the above fixed-point equation — via stochastic approximation when 𝒯\mathcal{T} is only accessed through samples — and then obtaining a deterministic optimal policy via greedy action selection, i.e., π(s)=argmaxaQ(s,a)\pi^{\star}(s)=\operatorname*{arg\,max}_{a}Q^{\star}(s,a).

2.2 Performance measures in federated Q-learning

We consider a federated learning setup consisting of MM agents, where all the agents face a common, unknown MDP, i.e., the transition kernel and the reward function are the same across agents. In addition, we consider the synchronous setting (Wainwright2019ConeContractive), where each agent has access to an independent generative model or simulator from which they can draw independent samples from the unknown underlying distribution P(|s,a)P(\cdot|s,a) for all state-action pair (s,a)(s,a) (Kearns1998FiniteSampleQL) simultaneously. Let Z𝒮|𝒮||𝒜|Z\in\mathcal{S}^{|\mathcal{S}||\mathcal{A}|} be a corresponding random vector whose (s,a)(s,a)-th coordinate is drawn from the distribution P(|s,a)P(\cdot|s,a), independently of all other coordinates. We define the random operator 𝒯^Z:(𝒮×𝒜)(𝒮×𝒜)\hat{\mathcal{T}}_{Z}:(\mathcal{S}\times\mathcal{A})\mapsto(\mathcal{S}\times\mathcal{A}) as

(𝒯^ZQ)(s,a)=r(s,a)+γV(Z(s,a)),\displaystyle(\hat{\mathcal{T}}_{Z}Q)(s,a)=r(s,a)+\gamma V(Z(s,a)), (3)

where V(s)=maxa𝒜Q(s,a)V(s^{\prime})=\max_{a^{\prime}\in\mathcal{A}}Q(s^{\prime},a^{\prime}). The operator 𝒯^Z\hat{\mathcal{T}}_{Z} can be interpreted as the sample Bellman operator, where we have the relation 𝒯Q=𝔼Z[𝒯^ZQ]\mathcal{T}Q=\mathbb{E}_{Z}\big{[}\hat{\mathcal{T}}_{Z}Q\big{]} for all Q-functions. For a given value of ε(0,11γ)\varepsilon\in(0,\frac{1}{1-\gamma}), the objective of agents is to collaboratively learn an ε\varepsilon-optimal estimate (in the \ell_{\infty} sense) of the optimal Q-function of the unknown MDP.

We measure the performance of a federated Q-learning algorithm 𝒜\mathscr{A} using two metrics — sample complexity and communication complexity. For a given MDP \mathcal{M}, let Q^(𝒜,N,M)\widehat{Q}_{\mathcal{M}}(\mathscr{A},N,M) denote the estimate of QQ^{\star}_{\mathcal{M}}, the optimal Q-function of the MDP \mathcal{M}, returned by an algorithm 𝒜\mathscr{A}, when given access to NN i.i.d. samples from the generative model for each (s,a)(s,a) pair at all the MM agents. The error rate of the algorithm 𝒜\mathscr{A}, denoted by ER(𝒜;N,M)\textsf{ER}(\mathscr{A};N,M), is defined as

ER(𝒜;N,M):=sup𝔼[Q^(𝒜,N,M)Q],\displaystyle\textsf{ER}(\mathscr{A};N,M):=\sup_{\mathcal{M}}\mathbb{E}\left[\big{\|}\widehat{Q}_{\mathcal{M}}(\mathscr{A},N,M)-Q^{\star}_{\mathcal{M}}\big{\|}_{\infty}\right], (4)

where the expectation is taken over the samples and any randomness in the algorithm. Given a value of ε(0,11γ)\varepsilon\in(0,\frac{1}{1-\gamma}), the sample complexity of 𝒜\mathscr{A}, denoted by SC(𝒜;ε,M)\textsf{SC}(\mathscr{A};\varepsilon,M) is given by

SC(𝒜;ε,M):=|𝒮||𝒜|min{N:ER(𝒜;N,M)ε}.\displaystyle\textsf{SC}(\mathscr{A};\varepsilon,M):=|\mathcal{S}||\mathcal{A}|\cdot\min\left\{N\in\mathbb{N}:\textsf{ER}(\mathscr{A};N,M)\leq\varepsilon\right\}. (5)

Similarly, we can also define a high-probability version for any δ(0,1)\delta\in(0,1) as follows:

SC(𝒜;ε,M,δ):=|𝒮||𝒜|min{N:Pr(supQ^(𝒜,N,M)Qε)1δ}.\displaystyle\textsf{SC}(\mathscr{A};\varepsilon,M,\delta):=|\mathcal{S}||\mathcal{A}|\cdot\min\left\{N\in\mathbb{N}:\Pr\left(\sup_{\mathcal{M}}\|\widehat{Q}_{\mathcal{M}}(\mathscr{A},N,M)-Q^{\star}_{\mathcal{M}}\|_{\infty}\leq\varepsilon\right)\geq 1-\delta\right\}.

We measure the communication complexity of any federated learning algorithm both in terms of the frequency of information exchange and the total number of bits uploaded by the agents. For each agent mm, let Croundm(𝒜;N)C_{\textsf{round}}^{m}(\mathscr{A};N) and Cbitm(𝒜;N)C_{\textsf{bit}}^{m}(\mathscr{A};N) respectively denote the number of times agent mm sends a message, and, the total number of bits uploaded by agent mm to the server when an algorithm 𝒜\mathscr{A} is run with NN i.i.d. samples from the generative model for each (s,a)(s,a) pair at all the MM agent. The communication complexity of 𝒜\mathscr{A}, when measured in terms of the frequency of communication and the total number of bits exchanged, is given by

CCround(𝒜;N):=1Mm=1MCroundm(𝒜;N);CCbit(𝒜;N):=1Mm=1MCbitm(𝒜;N),\displaystyle\textsf{CC}_{\textsf{round}}(\mathscr{A};N):=\frac{1}{M}\sum_{m=1}^{M}C_{\textsf{round}}^{m}(\mathscr{A};N);\quad\textsf{CC}_{\textsf{bit}}(\mathscr{A};N):=\frac{1}{M}\sum_{m=1}^{M}C_{\textsf{bit}}^{m}(\mathscr{A};N), (6)

respectively. Similarly, for a given value of ε(0,11γ)\varepsilon\in(0,\frac{1}{1-\gamma}), we can also define CCround(𝒜;ε)\textsf{CC}_{\textsf{round}}(\mathscr{A};\varepsilon) and CCbit(𝒜;ε)\textsf{CC}_{\textsf{bit}}(\mathscr{A};\varepsilon) when 𝒜\mathscr{A} is run to guarantee an error of at most ε\varepsilon, as well as the high-probability version for any δ(0,1)\delta\in(0,1) as CCround(𝒜;ε,δ)\textsf{CC}_{\textsf{round}}(\mathscr{A};\varepsilon,\delta) and CCbit(𝒜;ε,δ)\textsf{CC}_{\textsf{bit}}(\mathscr{A};\varepsilon,\delta).

2.3 Intermittent-communication algorithm protocols

We consider a popular class of federated learning algorithms with intermittent communication. The intermittent communication setting provides a natural framework to extend single-agent Q-learning algorithms to the distributed setting. As the name suggests, under this setting, the agents intermittently communicate with each other or a central server, sharing their updated beliefs about QQ^{\star}. Between two communication rounds, each agent updates their belief about QQ^{\star} using stochastic approximation iterations based on the locally available data, similar to a single-agent setup. Such intermittent communication algorithms have been extensively studied and used to establish lower bounds on communication complexity of distributed stochastic convex optimization (Woodworth2018GraphOracle; Woodworth2021IntermittentComm).

1:  Input : T,R,{ηt}t=1T,𝒞={tr}r=1R,BT,R,\{\eta_{t}\}_{t=1}^{T},\mathcal{C}=\{t_{r}\}_{r=1}^{R},B
2:  Set Q0m=0Q_{0}^{m}=0 for all agents mm
3:  for t=1,2,,Tt=1,2,\dots,T do
4:     for m=1,2,,Mm=1,2,\dots,M do
5:        Compute Qt12mQ_{t-\frac{1}{2}}^{m} according to Eqn. (7)
6:        Compute QtmQ_{t}^{m} according to Eqn. (8)
7:     end for
8:  end for
9:  return QTQ_{T}
Algorithm 1 A generic federated Q-learning algorithm 𝒜\mathscr{A}

A generic federated Q-learning algorithm with intermittent communication is outlined in Algorithm 1. It is characterized by the following five parameters: (i) the total number of updates TT; (ii) the number of communication rounds RR; (iii) a step size schedule {ηt}t=1T\{\eta_{t}\}_{t=1}^{T}; (iv) a communication schedule 𝒞={tr}r=1R\mathcal{C}=\{t_{r}\}_{r=1}^{R}; (v) batch size BB. During the tt-th iteration, each agent mm computes {𝒯^Zb(Qt1m)}b=1B\{\widehat{\mathcal{T}}_{Z_{b}}(Q_{t-1}^{m})\}_{b=1}^{B}, a minibatch of sample Bellman operators at the current estimate Qt1mQ_{t-1}^{m}, using BB samples from the generative model for each (s,a)(s,a) pair, and obtains an intermediate local estimate using the Q-learning update rule as follows:

Qt12m=(1ηt)Qt1m+ηtBb=1B𝒯^Zb(Qt1m).\displaystyle Q_{t-\frac{1}{2}}^{m}=(1-\eta_{t})Q_{t-1}^{m}+\frac{\eta_{t}}{B}\sum_{b=1}^{B}\widehat{\mathcal{T}}_{Z_{b}}(Q_{t-1}^{m}). (7)

Here, ηt(0,1]\eta_{t}\in(0,1] is the step size chosen corresponding to the tt-th time step. The intermediate estimates are averaged based on a communication schedule 𝒞={tr}r=1R\mathcal{C}=\{t_{r}\}_{r=1}^{R} consisting of RR rounds, i.e.,

Qtm={1Mj=1MQt12j if t𝒞,Qt12motherwise.\displaystyle Q_{t}^{m}=\begin{cases}\frac{1}{M}\sum_{j=1}^{M}Q_{t-\frac{1}{2}}^{j}&\text{ if }t\in\mathcal{C},\\ Q_{t-\frac{1}{2}}^{m}&\text{otherwise.}\end{cases} (8)

In the above equation, the averaging step can also be replaced with any distributed mean estimation routine that includes compression to control the bit level costs. Without loss of generality, we assume that Q0m=0Q^{m}_{0}=0 for all agent and tR=Tt_{R}=T, i.e., the last iterates are always averaged. It is straightforward to note that the number of samples taken per agent by an intermittent communication algorithm is BTBT, i.e, N=BTN=BT and the communication complexity CCround(𝒜;N)=R\textsf{CC}_{\textsf{round}}(\mathscr{A};N)=R.

3 Communication Complexity Lower Bound

In this section, we investigate the first of the two questions regarding the lower bound on communication complexity. The following theorem establishes a lower bound on the communication complexity of a federated Q-learning algorithm with intermittent communication as described in Algorithm 1.

Theorem 1.

Assume that γ[5/6,1)\gamma\in[5/6,1) and the state and action spaces satisfy |𝒮|4|\mathcal{S}|\geq 4 and |𝒜|2|\mathcal{A}|\geq 2. Let 𝒜\mathscr{A} be a federated Q-learning algorithm with intermittent communication (as described in Algorithm 1) that is run for Tmax{16,11γ}T\geq\max\{16,\frac{1}{1-\gamma}\} steps with a step size schedule of either ηt:=11+cη(1γ)t\eta_{t}:=\frac{1}{1+c_{\eta}(1-\gamma)t} or ηt:=η\eta_{t}:=\eta for all 1tT1\leq t\leq T. If

R=CCround(𝒜;N)c0(1γ)log2N; or CCbit(𝒜;N)c1|𝒮||𝒜|(1γ)log2N\displaystyle R=\textsf{CC}_{\textsf{round}}(\mathscr{A};N)\leq\frac{c_{0}}{(1-\gamma)\log^{2}N};\text{ or }\textsf{CC}_{\textsf{bit}}(\mathscr{A};N)\leq\frac{c_{1}|\mathcal{S}||\mathcal{A}|}{(1-\gamma)\log^{2}N}

for some universal constants c0,c1>0c_{0},c_{1}>0, then for all choices of communication schedule, batch size BB, cη>0c_{\eta}>0 and η(0,1)\eta\in(0,1), the error of 𝒜\mathscr{A} satisfies

ER(𝒜;N,M)CγNlog3N\displaystyle\textsf{ER}(\mathscr{A};N,M)\geq\frac{C_{\gamma}}{\sqrt{N}\log^{3}N}

for all M2M\geq 2 and N=BTN=BT. Here Cγ>0C_{\gamma}>0 is a constant that depends only on γ\gamma.

The above theorem states that for any federated Q-learning algorithm with intermittent communication to obtain any benefit of collaboration, i.e., for the error rate ER(𝒜;N,M)\textsf{ER}(\mathscr{A};N,M) to decrease w.r.t. the number of agents, it must have at least Ω(1(1γ)log2N)\Omega\left(\frac{1}{(1-\gamma)\log^{2}N}\right) rounds of communication and transmit Ω(|𝒮||𝒜|(1γ)log2N)\Omega\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)\log^{2}N}\right) bits of information per agent, both of which scale inverse proportionally to the effective horizon 11γ\frac{1}{1-\gamma} of the MDP. The above lower bound on the communication complexity also immediately applies to federated Q-learning algorithms that offer order-optimal sample complexity, and thereby a linear speedup with respect to the number of agents. Therefore, this characterizes the converse relation for the sample-communication tradeoff in federated Q-learning.

The above lower bound on the communication complexity of federated Q-learning is a consequence of the bias-variance trade-off that governs the convergence of the Q-learning algorithm. While a careful choice of step sizes alone is sufficient to balance this trade-off in the centralized setting, the choice of communication schedule also plays an important role in balancing this trade-off in the federated setting. The local steps between two communication rounds induce a positive estimation bias that depends on the standard deviation of the iterates and is a well-documented issue of “over-estimation” in Q-learning (Hasselt2010DoubleQL). Since such a bias is driven by local updates, it does not reflect any benefit of collaboration. During a communication round, the averaging of iterates across agents allows the algorithm an opportunity to counter this bias by reducing the effective variance of the updates through averaging. In our analysis, we show that if the communication is infrequent, the local bias becomes the dominant term and averaging of iterates is insufficient to counter the impact of the positive bias induced by the local steps. As a result, we do not observe any statistical gains when the communication is infrequent. Our argument is inspired by the analysis of Q-learning in Li2023QLMinimax and is based on analyzing the convergence of an intermittent communication algorithm on a specifically chosen “hard” MDP instance. The detailed proof is deferred to Appendix A.

Remark 1 (Communication complexity of policy evaluation).

Several recent studies (liu2023distributed; tian2024one) established that a single round of communication is sufficient to achieve linear speedup of TD learning for policy evaluation, which do not contradict with our results focusing on Q-learning for policy learning. The latter is more involved due to the nonlinearity of the Bellman optimality operator. Specifically, if the operator whose fixed point is to be found is linear in the decision variable (e.g., the value function in TD learning) then the fixed point update only induces a variance term corresponding to the noise. However, if the operator is non-linear, then in addition to the variance term, we also obtain a bias term in the fixed point update. While the variance term can be controlled with one-shot averaging, more frequent communication is necessary to ensure that the bias term is small enough.

Remark 2 (Extension to asynchronous Q-learning).

We would like to point out that our lower bound extends to the asynchronous setting (Li2023QLMinimax) as the assumption of i.i.d. noise corresponding to a generative model is a special case of Markovian noise observed in the asynchronous setting.

4 The Fed-DVR-Q  Algorithm

Having characterized the lower bound on the communication complexity of federated Q-learning, we explore our second question of interest — designing a federated Q-learning algorithm that achieves this lower bound while simultaneously offering an optimal order of sample complexity (up to logarithmic factors).

We propose a new federated Q-learning algorithm, Fed-DVR-Q , that achieves not only a communication complexity of CCround=𝒪~(11γ)\textsf{CC}_{\textsf{round}}=\tilde{\mathcal{O}}\left(\frac{1}{1-\gamma}\right) and CCbit=𝒪~(|𝒮||𝒜|1γ)\textsf{CC}_{\textsf{bit}}=\tilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}\right) but also order-optimal sample complexity (up to logarithmic factors), thereby providing a tight characterization of the achievability frontier that matches with the converse result derived in the previous section.

4.1 Algorithm description

{wrapfigure}

R0.6 1:  Input : Error bound ε>0\varepsilon>0, failure probability δ>0\delta>0 2:  k1,Q(0)𝟎k\leftarrow 1,Q^{(0)}\leftarrow\mathbf{0} 3:  // Set parameters as described in Section 4.1.3 4:  for k=1,2,,Kk=1,2,\dots,K do 5:     Q(k)RefineEstimate(Q(k1),B,I,Lk,Dk,J)Q^{(k)}\leftarrow\textsc{RefineEstimate}(Q^{(k-1)},B,I,{L}_{k},D_{k},J) 6:     kk+1k\leftarrow k+1 7:  end for 8:  return Q(K)Q^{(K)} Algorithm 2 Fed-DVR-Q

Fed-DVR-Q  proceeds in epochs. During an epoch k1k\geq 1, the agents collaboratively update Q(k1)Q^{(k-1)}, the estimate of QQ^{\star} obtained at the end of the previous epoch, to a new estimate Q(k)Q^{(k)}, with the aid of the sub-routine called RefineEstimate. The sub-routine RefineEstimate is designed to ensure that the suboptimality gap, Q(k)Q\|Q^{(k)}-Q^{\star}\|_{\infty}, reduces by a factor of 22 at the end of every epoch. Thus, at the end of K=𝒪(log(1/ε))K=\mathcal{O}(\log(1/\varepsilon)) epochs, Fed-DVR-Q  obtains an ε\varepsilon-optimal estimate of QQ^{\star}, which is then set to be the output of the algorithm. Please refer to Algorithm 2 for a pseudocode.

4.1.1 The RefineEstimate sub-routine

RefineEstimate, starting from Q¯\overline{Q}, an initial estimate of QQ^{\star}, uses variance-reduced Q-learning updates (Wainwright2019VarianceReduced; sidford2018near) to obtain an improved estimate of QQ^{\star}. It is characterized by four parameters — the initial estimate Q¯\overline{Q}, the number of local iterations II, the re-centering sample size LL and the batch size BB, which can be appropriately tuned to control the quality of the returned estimate. Additionally, it also takes input two parameters DD and JJ required by the compressor 𝒞(;D,J)\mathscr{C}(\cdot;D,J), whose description is deferred to Section 4.1.2.

The first step in RefineEstimate is to collaboratively approximate 𝒯Q¯\mathcal{T}\overline{Q} for the variance-reduced updates. To this effect, each agent mm builds an approximation of 𝒯Q¯\mathcal{T}\overline{Q} as follows:

𝒯~L(m)(Q¯):=1L/Ml=1L/M𝒯Zl(m)(Q¯),\displaystyle\widetilde{\mathcal{T}}^{(m)}_{L}(\overline{Q}):=\frac{1}{\lceil{L}/M\rceil}\sum_{l=1}^{\lceil{L}/M\rceil}\mathcal{T}_{Z_{l}^{(m)}}(\overline{Q}), (9)

where {Z1(m),Z2(m),,ZL/M(m)}\{Z_{1}^{(m)},Z_{2}^{(m)},\ldots,Z_{\lceil{L}/M\rceil}^{(m)}\} are L/M\lceil{L}/M\rceil i.i.d. samples with Z1(m)ZZ_{1}^{(m)}\sim Z. Each agent then sends a compressed version, 𝒞(𝒯~L(m)(Q¯)Q¯;D,J)\mathscr{C}\left(\widetilde{\mathcal{T}}^{(m)}_{{L}}(\overline{Q})-\overline{Q};D,J\right), of the difference 𝒯~L(m)(Q¯)Q¯\widetilde{\mathcal{T}}^{(m)}_{{L}}(\overline{Q})-\overline{Q} to the server, which collects all the estimates from the agents and constructs the estimate

𝒯~L(Q¯)=Q¯+1Mm=1M𝒞(𝒯~L(m)(Q¯)Q¯;D,J)\displaystyle\widetilde{\mathcal{T}}_{L}(\overline{Q})=\overline{Q}+\frac{1}{M}\sum_{m=1}^{M}\mathscr{C}\left(\widetilde{\mathcal{T}}^{(m)}_{{L}}(\overline{Q})-\overline{Q};D,J\right) (10)

and sends it back to the agents for the variance-reduced updates. Equipped with the estimate 𝒯~L(Q¯)\widetilde{\mathcal{T}}_{L}(\overline{Q}), RefineEstimate constructs a sequence {Qi}i=1I\{Q_{i}\}_{i=1}^{I} using the following iterative update scheme initialized with Q0=Q¯Q_{0}=\overline{Q}. During the ii-th iteration, each agent mm carries out the following update:

Qi12m=(1η)Qi1+η[𝒯^i(m)Qi1𝒯^i(m)Q¯+𝒯~L(Q¯)].\displaystyle Q_{i-\frac{1}{2}}^{m}=(1-\eta)Q_{i-1}+\eta\left[\widehat{\mathcal{T}}_{i}^{(m)}Q_{i-1}-\widehat{\mathcal{T}}_{i}^{(m)}\overline{Q}+\widetilde{\mathcal{T}}_{L}(\overline{Q})\right]. (11)

In the above equation, η(0,1)\eta\in(0,1) is the step size and 𝒯^i(m)Q:=1Bz𝒵i(m)𝒯zQ\widehat{\mathcal{T}}_{i}^{(m)}Q:=\frac{1}{B}\sum_{z\in\mathcal{Z}_{i}^{(m)}}\mathcal{T}_{z}Q, where 𝒵i(m)\mathcal{Z}_{i}^{(m)} is the minibatch of BB i.i.d. random variables drawn according to ZZ, independently at each agent mm for all iterations ii. Each agent then sends a compressed version of the update, i.e., 𝒞(Qi12mQi1;D,J)\mathscr{C}\left(Q_{i-\frac{1}{2}}^{m}-Q_{i-1};D,J\right), to the server, which uses them to compute the next iterate

Qi=Qi1+1Mm=1M𝒞(Qi12mQi1;D,J),\displaystyle Q_{i}=Q_{i-1}+\frac{1}{M}\sum_{m=1}^{M}\mathscr{C}\left(Q_{i-\frac{1}{2}}^{m}-Q_{i-1};D,J\right), (12)

and broadcast it to the agents. After II such updates, the obtained iterate QIQ_{I} is returned by the sub-routine. A pseudocode of RefineEstimateis given in Algorithm 3.

1:  Input: Initial estimate Q¯\overline{Q}, batch size BB, number of iterations II, re-centering sample size LL, quantization bound DD, message size JJ
2:  // Build an approximation for 𝒯Q¯\mathcal{T}\overline{Q} which is to be used for variance reduced updates
3:  for m=1,2,,Mm=1,2,\dots,M do
4:     Draw L/M\lceil L/M\rceil i.i.d. samples from the generative model for each (s,a)(s,a) pair and evaluate 𝒯~L(m)(Q¯)\widetilde{\mathcal{T}}_{L}^{(m)}(\overline{Q}) according to Eqn. (9)
5:     Send 𝒞(𝒯~L(m)(Q¯)Q¯;D,J)\mathscr{C}(\widetilde{\mathcal{T}}_{L}^{(m)}(\overline{Q})-\overline{Q};D,J) to the server
6:     Receive 1Mm=1M𝒞(𝒯~L(m)(Q¯)Q¯;D,J)\frac{1}{M}\sum_{m=1}^{M}\mathscr{C}(\widetilde{\mathcal{T}}_{L}^{(m)}(\overline{Q})-\overline{Q};D,J) from the server and compute 𝒯~L(Q¯)\widetilde{\mathcal{T}}_{L}(\overline{Q}) according to Eqn. (10)
7:  end for
8:  Q0Q¯Q_{0}\leftarrow\overline{Q}
9:  // Variance reduced updates with minibatching
10:  for i=1,2,,Ii=1,2,\dots,I do
11:     for m=1,2,,Mm=1,2,\dots,M do
12:        Draw BB i.i.d. samples from the from the generative model for each (s,a)(s,a) pair
13:        Compute Qi12mQ_{i-\frac{1}{2}}^{m} according to Eqn. (11)
14:        Send 𝒞(Qi12mQi1;D,J)\mathscr{C}(Q_{i-\frac{1}{2}}^{m}-Q_{i-1};D,J) to the server
15:        Receive 1Mm=1M𝒞(QimQi1;D,J)\frac{1}{M}\sum_{m=1}^{M}\mathscr{C}(Q_{i}^{m}-Q_{i-1};D,J) from the server and compute QiQ_{i} according to Eqn. (12)
16:     end for
17:  end for
18:  return QIQ_{I}
Algorithm 3 RefineEstimate(Q¯,B,I,L,D,J)(\overline{Q},B,I,L,D,J)

4.1.2 The compression operator

The compressor, 𝒞(;D,J)\mathscr{C}(\cdot;D,J), used in the proposed algorithm Fed-DVR-Q  is based on the popular stochastic quantization scheme. In addition to the input vector QQ to be quantized, the compressor or quantizer 𝒞\mathscr{C} takes two input parameters DD and JJ: (i) DD corresponds to an upper bound on the \ell_{\infty} norm of QQ, i.e., QD\|Q\|_{\infty}\leq D; (ii) JJ corresponds to the resolution of the compressor, i.e., number of bits used by the compressor to represent each coordinate of the output vector.

The compressor first splits the interval [0,D][0,D] into 2J12^{J}-1 intervals of equal length where 0=d1<d2<d2J=D0=d_{1}<d_{2}\ldots<d_{2^{J}}=D correspond to end points of the intervals. Each coordinate of QQ is then separately quantized as follows. The value of the nn-th coordinate, 𝒞(Q)[n]\mathscr{C}(Q)[n], is set to be djn1d_{j_{n}-1} with probability djnQ[n]djndjn1\frac{d_{j_{n}}-Q[n]}{d_{j_{n}}-d_{j_{n}-1}} and to djnd_{j_{n}} with the remaining probability, where jn:=min{j:dj<Q[i]dj+1}j_{n}:=\min\{j:d_{j}<Q[i]\leq d_{j+1}\}. It is straightforward to note that each coordinate of 𝒞(Q)\mathscr{C}(Q) can be represented using JJ bits.

4.1.3 Setting the parameters

The desired convergence of the iterates {Q(k)}\{Q^{(k)}\} is obtained by carefully choosing the parameters of the sub-routine RefineEstimate and the compression operator 𝒞\mathscr{C}. Given a target accuracy ε(0,1]\varepsilon\in(0,1] and δ(0,1)\delta\in(0,1), the total number of epochs is set to

K=12log2(11γ)+12log2(1(1γ)ε2).\displaystyle K=\left\lceil\frac{1}{2}\log_{2}\left(\frac{1}{1-\gamma}\right)\right\rceil+\left\lceil\frac{1}{2}\log_{2}\left(\frac{1}{(1-\gamma)\varepsilon^{2}}\right)\right\rceil. (13)

For all epochs k1k\geq 1, we set the number of iterations II, the batch size BB, and the number of bits JJ of the compressor 𝒞\mathscr{C} to be

I\displaystyle I :=2η(1γ),\displaystyle:=\left\lceil\frac{2}{\eta(1-\gamma)}\right\rceil, (14a)
B\displaystyle B :=2M(12γ1γ)2log(8KI|𝒮||𝒜|δ),\displaystyle:=\left\lceil\frac{2}{M}\left(\frac{12\gamma}{1-\gamma}\right)^{2}\log\left(\frac{8KI|\mathcal{S}||\mathcal{A}|}{\delta}\right)\right\rceil, (14b)
J\displaystyle J :=log2(70η(1γ)4Mlog(8KI|𝒮||𝒜|δ))\displaystyle:=\left\lceil\log_{2}\left(\frac{70}{\eta(1-\gamma)}\sqrt{\frac{4}{M}\log\Big{(}\frac{8KI|\mathcal{S}||\mathcal{A}|}{\delta}\Big{)}}\right)\right\rceil (14c)

respectively. The re-centering sample sizes LkL_{k} and bounds DkD_{k} of the compressor 𝒞\mathscr{C} are set to be the following functions of epoch index kk respectively:

Lk:=39200(1γ)2log(8KI|𝒮||𝒜|δ){4k if kK04kK0 if k>K0;Dk:=162k1γ,\displaystyle L_{k}:=\frac{39200}{(1-\gamma)^{2}}\log\left(\frac{8KI|\mathcal{S}||\mathcal{A}|}{\delta}\right)\cdot\begin{cases}4^{k}&\text{ if }k\leq K_{0}\\ 4^{k-K_{0}}&\text{ if }k>K_{0}\end{cases};\quad D_{k}:=16\cdot\frac{2^{-k}}{1-\gamma}, (15)

where K0=12log2(11γ)K_{0}=\lceil\frac{1}{2}\log_{2}(\frac{1}{1-\gamma})\rceil. The piecewise definition of LkL_{k} is crucial to obtain the optimal dependence with respect to 11γ\frac{1}{1-\gamma}, similar to the two-step procedure outlined in Wainwright2019VarianceReduced.

4.2 Performance guarantees

The following theorem characterizes the sample and communication complexities of Fed-DVR-Q .

Theorem 2.

Consider any δ(0,1)\delta\in(0,1) and ε(0,1]\varepsilon\in(0,1]. The sample and communication complexities of the Fed-DVR-Q algorithm, when run with the choice of parameters described in Section 4.1.3 and a learning rate η(0,1)\eta\in(0,1), satisfy the following relations for some universal constant C1>0C_{1}>0:

SC(Fed-DVR-Q ;ε,M,δ)\displaystyle\textsf{SC}(\textsf{Fed-DVR-Q };\varepsilon,M,\delta) C1ηM(1γ)3ε2log2(1(1γ)ε)log(8KI|𝒮||𝒜|δ),\displaystyle\leq\frac{C_{1}}{\eta M(1-\gamma)^{3}\varepsilon^{2}}\log_{2}\left(\frac{1}{(1-\gamma)\varepsilon}\right)\log\left(\frac{8KI|\mathcal{S}||\mathcal{A}|}{\delta}\right),
CCround(Fed-DVR-Q ;ε,δ)\displaystyle\textsf{CC}_{\textsf{round}}(\textsf{Fed-DVR-Q };\varepsilon,\delta) 16η(1γ)log2(1(1γ)ε),\displaystyle\leq\frac{16}{\eta(1-\gamma)}\log_{2}\left(\frac{1}{(1-\gamma)\varepsilon}\right),
CCbit(Fed-DVR-Q ;ε,δ)\displaystyle\textsf{CC}_{\textsf{bit}}(\textsf{Fed-DVR-Q };\varepsilon,\delta) 32|𝒮|𝒜|η(1γ)log2(1(1γ)ε)log2(70η(1γ)4Mlog(8KI|𝒮||𝒜|δ)).\displaystyle\leq\frac{32|\mathcal{S}|\mathcal{A}|}{\eta(1-\gamma)}\log_{2}\left(\frac{1}{(1-\gamma)\varepsilon}\right)\log_{2}\left(\frac{70}{\eta(1-\gamma)}\sqrt{\frac{4}{M}\log\left(\frac{8KI|\mathcal{S}||\mathcal{A}|}{\delta}\right)}\right).

A proof of Theorem 2 can be found in Appendix LABEL:appendix:dvr_analysis. A few implications of the theorem are in order.

Optimal sample-communication complexity trade-off.

As shown by the above theorem, Fed-DVR-Q  offers a linear speedup in the sample complexity with respect to the number of agents while simultaneously achieving the same order of communication complexity as dictated by the lower bound derived in Theorem 1, both in terms of frequency and bit level complexity. Moreover, Fed-DVR-Q  is the first federated Q-learning algorithm that achieves a sample complexity with optimal dependence on all the salient parameters, i.e., |𝒮|,|𝒜||\mathcal{S}|,|\mathcal{A}| and 11γ\frac{1}{1-\gamma}, in addition to linear speedup w.r.t. to the number of agents, thereby bridging the existing gap between upper and lower bounds on the sample complexity for federated Q-learning. Thus, Theorem 1 and 2 together provide a characterization of optimal operating point of the sample-communication complexity trade-off in federated Q-learning.

Role of minibatching.

The commonly adopted approach in intermittent communication algorithm is to use a local update scheme that takes multiple small (i.e., B=𝒪(1)B=\mathcal{O}(1)), noisy updates between communication rounds, as evident from the algorithm design in Khodadadian2022FederatedQL; Woo2023FedSynQ and even numerous FL algorithms for stochastic optimization (Mcmahan2017FedAvg; Haddadpour2019LUPASGD; Khaled2020LocalSGDHeterogenous). In Fed-DVR-Q , we replace the local update scheme of taking multiple small, noisy updates by a single, large update with smaller variance, obtained by averaging the noisy updates over a minibatch of samples. The use of updates with smaller variance in variance-reduced Q-learning yields the algorithm its name. While both the approaches result in similar sample complexity guarantees, the local update scheme requires more frequent averaging across clients to ensure that the bias of the estimate, also commonly referred to as “client drift”, is not too large. On the other hand, the minibatching approach does not encounter the problem of bias accumulation from local updates and hence can afford more infrequent averaging, allowing Fed-DVR-Q  to achieve optimal order of communication complexity.

Compression.

Fed-DVR-Q  is the first federated Q-learning algorithm to analyze and establish communication complexity at the bit level. While all existing studies on federated Q-learning focus only on the frequency of communication and assume transmission of real numbers with infinite bit precision, our analysis provides a more holistic view point of communication complexity at the bit level, which is of great practical significance. While some recent other studies (Wang2023FederatedTDHeterogeneity) also considered quantization in federated RL, their objective is to understand the impact of message size on the convergence with no constraint on the frequency of communication, unlike the holistic viewpoint adopted in this work.

5 Conclusion

We presented a complete characterization of the sample-communication trade-off for federated Q-learning algorithms with intermittent communication. We showed that no federated Q-learning algorithm with intermittent communication can achieve any speedup with respect to the number of agents if its number of communication rounds are sublinear in 11γ\frac{1}{1-\gamma}. We also proposed a new federated Q-learning algorithm called Fed-DVR-Q  that uses variance reduction along with minibatching to achieve optimal-order sample and communication complexities. In particular, we showed that Fed-DVR-Q  has a per-agent sample complexity of 𝒪~(|𝒮||𝒜|M(1γ)3ε2)\tilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{M(1-\gamma)^{3}\varepsilon^{2}}\right), which is order-optimal in all salient problem parameters, and a communication complexity of 𝒪~(11γ)\tilde{\mathcal{O}}\left(\frac{1}{1-\gamma}\right) rounds and 𝒪~(|𝒮||𝒜|1γ)\tilde{\mathcal{O}}\left(\frac{|\mathcal{S}||\mathcal{A}|}{1-\gamma}\right) bits.

The results in this work raise several interesting questions that are worth exploring. While we focus on the tabular setting in this work, it is of great interest to investigate to the trade-off in other settings where we use function approximation to model the QQ^{\star} and VV^{\star} functions. Moreover, it is interesting to explore the trade-off in the finite-horizon setting, where there is no discount factor. Furthermore, it is also worthwhile to explore if the communication complexity can be further reduced by going beyond the class of intermittent communication algorithms.

Appendix A Proof of Theorem 1

In this section, we prove the main result of the paper, the lower bound on the communication complexity of federated Q-learning algorithms. At a high level, the proof consists of the following three steps.

Introducing the “hard” MDP instance.

The proof builds upon analyzing the behavior of a generic algorithm 𝒜\mathscr{A} outlined in Algorithm 1 over a particular instance of MDP. The particular choice of MDP is inspired by, and borrowed from, other lower bound proofs in the single-agent setting (Li2023QLMinimax) and helps highlight core issues that lie at the heart of the sample-communication complexity trade-off. Following Li2023QLMinimax, the construction is first over a small state-action space that allows us to focus on a simpler problem before generalizing it to larger state-action spaces.

Establishing the performance of intermittent communication algorithms.

In the second step, we analyze the error of the iterates generated by an intermittent communication algorithm 𝒜\mathscr{A}. The analysis is inspired by the single-agent analysis in Li2023QLMinimax, which highlights the underlying bias-variance trade-off. Through careful analysis of the algorithm dynamics in the federated setting, we uncover the impact of communication on the bias-variance trade-off and the resulting error of the iterates to obtain the lower bound on the communication complexity.

Generalization to larger MDPs.

As the last step, we generalize our construction of the “hard” instance to more general state-action space and extend our insights to obtain the statement of the theorem.

A.1 Introducing the “hard” instance

We first introduce an MDP instance h\mathcal{M}_{h} that we will use throughout the proof to establish the result. Note that this MDP is identical to the one considered in Li2023QLMinimax to establish the lower bounds on the performance of single-agent Q-learning algorithm. It consists of four states 𝒮={0,1,2,3}\mathcal{S}=\{0,1,2,3\}. Let 𝒜s\mathcal{A}_{s} denote the action set associated with the state ss. The probability transition kernel and the reward function of h\mathcal{M}_{h} is given as follows:

𝒜0={1}\displaystyle\mathcal{A}_{0}=\{1\} P(0|0,1)=1\displaystyle\quad\quad P(0|0,1)=1 r(0,1)=0,\displaystyle\quad r(0,1)=0, (16a)
𝒜1={1,2}\displaystyle\mathcal{A}_{1}=\{1,2\} P(1|1,1)=p\displaystyle\quad\quad P(1|1,1)=p P(0|1,1)=1p\displaystyle\quad P(0|1,1)=1-p r(1,1)=1,\displaystyle\quad r(1,1)=1, (16b)
P(1|1,2)=p\displaystyle\quad\quad P(1|1,2)=p P(0|1,2)=1p\displaystyle\quad P(0|1,2)=1-p r(1,2)=1,\displaystyle\quad r(1,2)=1, (16c)
𝒜2={1}\displaystyle\mathcal{A}_{2}=\{1\} P(2|2,1)=p\displaystyle\quad\quad P(2|2,1)=p P(0|2,1)=1p\displaystyle\quad P(0|2,1)=1-p r(2,1)=1,\displaystyle\quad r(2,1)=1, (16d)
𝒜3={1}\displaystyle\mathcal{A}_{3}=\{1\} P(3|3,1)=1\displaystyle\quad\quad P(3|3,1)=1 r(3,1)=1,\displaystyle\quad r(3,1)=1, (16e)

where the parameter p=4γ13γp=\dfrac{4\gamma-1}{3\gamma}. We have the following results about the optimal QQ and VV functions of this hard MDP instance.

Lemma 1 ((Li2023QLMinimax, Lemma 3)).

Consider the MDP h\mathcal{M}_{h} constructed in Eqn. (16). We have,

V(0)\displaystyle V^{\star}(0) =Q(0,1)=0\displaystyle=Q^{\star}(0,1)=0
V(1)\displaystyle V^{\star}(1) =Q(1,1)=Q(1,2)=V(2)=Q(2,1)=11γp=34(1γ)\displaystyle=Q^{\star}(1,1)=Q^{\star}(1,2)=V^{\star}(2)=Q^{\star}(2,1)=\frac{1}{1-\gamma p}=\frac{3}{4(1-\gamma)}
V(3)\displaystyle V^{\star}(3) =Q(3,1)=11γ.\displaystyle=Q^{\star}(3,1)=\frac{1}{1-\gamma}.

Throughout the next section of the proof, we focus on this MDP with four states and two actions. In Appendix A.4, we generalize the proof to larger state-action spaces.

A.2 Notation and preliminary results

For convenience, we first define some notation that will be used throughout the proof.

Useful relations of the learning rates.

We consider two kinds of step size sequences that are commonly used in Q-learning — the constant step size schedule, i.e., ηt=η\eta_{t}=\eta for all t{1,2,,T}t\in\{1,2,\dots,T\} and the rescaled linear step size schedule, i.e., ηt=11+cη(1γ)t\eta_{t}=\frac{1}{1+c_{\eta}(1-\gamma)t}, where cη>0c_{\eta}>0 is a universal constant that is independent of the problem parameters.

We define the following quantities:

ηk(t)\displaystyle\eta_{k}^{(t)} =ηki=k+1t(1ηi(1γp)) for all 0kt,\displaystyle=\eta_{k}\prod_{i=k+1}^{t}(1-\eta_{i}(1-\gamma p))~{}~{}~{}~{}~{}~{}~{}~{}~{}\text{ for all }0\leq k\leq t, (17)

where we take η0=1\eta_{0}=1 and use the convention throughout the proof that if a product operation does not have a valid index, we take the value of that product to be 11. For any integer 0τ<t0\leq\tau<t, we have the following relation, which will be proved at the end of this subsection for completeness:

k=τ+1t(1ηk(1γp))+(1γp)k=τ+1tηk(t)=1.\displaystyle\prod_{k=\tau+1}^{t}(1-\eta_{k}(1-\gamma p))+(1-\gamma p)\sum_{k=\tau+1}^{t}\eta_{k}^{(t)}=1. (18)

Similarly, we also define,

η~k(t)\displaystyle\widetilde{\eta}_{k}^{(t)} =ηki=k+1t(1ηi) for all 0kt,\displaystyle=\eta_{k}\prod_{i=k+1}^{t}(1-\eta_{i})~{}~{}~{}~{}~{}~{}~{}~{}~{}\text{ for all }0\leq k\leq t, (19)

which satisfies the relation

k=τ+1t(1ηk)+k=τ+1tη~k(t)=1.\displaystyle\prod_{k=\tau+1}^{t}(1-\eta_{k})+\sum_{k=\tau+1}^{t}\widetilde{\eta}_{k}^{(t)}=1. (20)

for any integer 0τ<t0\leq\tau<t. The claim follows immediately by plugging p=0p=0 in (18). Note that for constant step size, the sequence η~k(t)\widetilde{\eta}_{k}^{(t)} is clearly increasing. For the rescaled linear step size, we have,

η~k1(t)η~k(t)=ηkηk1(1ηk)=1(1cη(1γ))ηk1cη(1γ)ηk1\displaystyle\frac{\widetilde{\eta}_{k-1}^{(t)}}{\widetilde{\eta}_{k}^{(t)}}=\frac{\eta_{k}}{\eta_{k-1}(1-\eta_{k})}=1-\frac{(1-c_{\eta}(1-\gamma))\eta_{k}}{1-c_{\eta}(1-\gamma)\eta_{k}}\leq 1 (21)

whenever cη11γc_{\eta}\leq\frac{1}{1-\gamma}. Thus, η~k(t)\widetilde{\eta}_{k}^{(t)} is an increasing sequence as long as cη11γc_{\eta}\leq\frac{1}{1-\gamma}. Similarly, ηk(t)\eta_{k}^{(t)} is also clearly increasing for the constant step size schedule. For the rescaled linear step size schedule, we have,

ηk1(t)ηk(t)=ηkηk1(1ηk(1γp))ηkηk1(1ηk)1,\displaystyle\frac{\eta_{k-1}^{(t)}}{\eta_{k}^{(t)}}=\frac{\eta_{k}}{\eta_{k-1}(1-\eta_{k}(1-\gamma p))}\leq\frac{\eta_{k}}{\eta_{k-1}(1-\eta_{k})}\leq 1,

whenever cη11γc_{\eta}\leq\frac{1}{1-\gamma}. The last bound follows from Eqn. (21).

Proof of (18).

We can show the claim using backward induction. For the base case, note that,

(1γp)ηt(t)+(1γp)ηt1(t)\displaystyle(1-\gamma p)\eta_{t}^{(t)}+(1-\gamma p)\eta_{t-1}^{(t)} =(1γp)ηt+(1γp)ηt1(1(1γp)ηt)\displaystyle=(1-\gamma p)\eta_{t}+(1-\gamma p)\eta_{t-1}(1-(1-\gamma p)\eta_{t})
=1(1ηt(1γp))(1ηt1(1γp))=1k=t1t(1ηk(1γp)),\displaystyle=1-(1-\eta_{t}(1-\gamma p))(1-\eta_{t-1}(1-\gamma p))=1-\prod_{k=t-1}^{t}(1-\eta_{k}(1-\gamma p)),

as required. Assume (18) is true for some τ\tau. We have,

(1γp)k=τtηk(t)\displaystyle(1-\gamma p)\sum_{k=\tau}^{t}\eta_{k}^{(t)} =(1γp)ητt+(1γp)k=τ+1tηk(t)\displaystyle=(1-\gamma p)\eta_{\tau}^{t}+(1-\gamma p)\sum_{k=\tau+1}^{t}\eta_{k}^{(t)}
=(1γp)ητk=τ+1t(1ηk(1γp))+1k=τ+1t(1ηk(1γp))\displaystyle=(1-\gamma p)\eta_{\tau}\prod_{k=\tau+1}^{t}(1-\eta_{k}(1-\gamma p))+1-\prod_{k=\tau+1}^{t}(1-\eta_{k}(1-\gamma p))
=1k=τt(1ηk(1γp)),\displaystyle=1-\prod_{k=\tau}^{t}(1-\eta_{k}(1-\gamma p)),

thus completing the induction step.

Sample transition matrix.

Recall Z𝒮|𝒮||𝒜|Z\in\mathcal{S}^{|\mathcal{S}||\mathcal{A}|} is a random vector whose (s,a)(s,a)-th coordinate is drawn from the distribution P(|s,a)P(\cdot|s,a). We use P^tm\widehat{P}_{t}^{m} to denote the sample transition at time tt and agent mm obtained by averaging BB i.i.d. samples from the generative model. Specifically let {Zt,bm}b=1B\{Z_{t,b}^{m}\}_{b=1}^{B} denote a collection of BB i.i.d. copies of ZZ collected at time tt at agent mm. Then, for all s,a,ss,a,s^{\prime},

P^tm(s|s,a)=1Bb=1BPt,bm(s|s,a),\displaystyle\widehat{P}_{t}^{m}(s^{\prime}|s,a)=\frac{1}{B}\sum_{b=1}^{B}P_{t,b}^{m}(s^{\prime}|s,a), (22)

where Pt,bm(s|s,a)=𝟙{Zt,bm(s,a)=s}P_{t,b}^{m}(s^{\prime}|s,a)=\mathbbm{1}\{Z_{t,b}^{m}(s,a)=s^{\prime}\} for s𝒮s^{\prime}\in\mathcal{S}.

Preliminary relations of the iterates.

We state some preliminary relations regarding the evolution of the Q-function and the value function across different agents that will be helpful for the analysis later.

We begin with the state 0, where we have Qtm(0,1)=Vtm(0)=0Q_{t}^{m}(0,1)=V_{t}^{m}(0)=0 for all agents m[M]m\in[M] and t[T]t\in[T]. This follows almost immediately from the fact that state 0 is an absorbing state with zero reward. Note that Q0m(0,1)=V0m(0)=0Q_{0}^{m}(0,1)=V_{0}^{m}(0)=0 holds for all clients m[M]m\in[M]. Assuming that Qt1m(0,1)=Vt1m(0)=0Q_{t-1}^{m}(0,1)=V_{t-1}^{m}(0)=0 for all clients for some time instant t1t-1, by induction, we have,

Qt1/2m(0,1)=(1ηt)Qt1m(0,1)+ηt(γVt1m(0))=0.\displaystyle Q_{t-1/2}^{m}(0,1)=(1-\eta_{t})Q_{t-1}^{m}(0,1)+\eta_{t}(\gamma V_{t-1}^{m}(0))=0.

Consequently, Qtm(0,1)=0Q_{t}^{m}(0,1)=0 and Vtm(0)=0V_{t}^{m}(0)=0, for all agents mm, irrespective of whether there is averaging.

For state 33, the iterates satisfy the following relation:

Qt1/2m(3,1)\displaystyle Q_{t-1/2}^{m}(3,1) =(1ηt)Qt1m(3,1)+ηt(1+γVt1m(3))\displaystyle=(1-\eta_{t})Q_{t-1}^{m}(3,1)+\eta_{t}(1+\gamma V_{t-1}^{m}(3))
=(1ηt)Qt1m(3,1)+ηt(1+γQt1m(3,1))\displaystyle=(1-\eta_{t})Q_{t-1}^{m}(3,1)+\eta_{t}(1+\gamma Q_{t-1}^{m}(3,1))
=(1ηt(1γ))Qt1m(3,1)+ηt,\displaystyle=(1-\eta_{t}(1-\gamma))Q_{t-1}^{m}(3,1)+\eta_{t},

where the second step follows by noting Vtm(3)=Qtm(3,1)V_{t}^{m}(3)=Q_{t}^{m}(3,1). Once again, one can note that averaging step does not affect the update rule implying that the following holds for all m[M]m\in[M] and t[T]t\in[T]:

Vtm(3)=Qtm(3,1)=k=1tηk(i=k+1t(1ηi(1γ)))=11γ[1i=1t(1ηi(1γ))],\displaystyle V_{t}^{m}(3)=Q_{t}^{m}(3,1)=\sum_{k=1}^{t}\eta_{k}\left(\prod_{i=k+1}^{t}(1-\eta_{i}(1-\gamma))\right)=\frac{1}{1-\gamma}\left[1-\prod_{i=1}^{t}(1-\eta_{i}(1-\gamma))\right], (23)

where the last step follows from Eqn. (18) with p=1p=1.

Similarly, for state 11 and 22, we have,

Qt1/2m(1,1)\displaystyle Q_{t-1/2}^{m}(1,1) =(1ηt)Qt1m(1,1)+ηt(1+γP^tm(1|1,1)Vt1m(1)),\displaystyle=(1-\eta_{t})Q_{t-1}^{m}(1,1)+\eta_{t}(1+\gamma\widehat{P}_{t}^{m}(1|1,1)V_{t-1}^{m}(1)), (24)
Qt1/2m(1,2)\displaystyle Q_{t-1/2}^{m}(1,2) =(1ηt)Qt1m(1,2)+ηt(1+γP^tm(1|1,2)Vt1m(1)),\displaystyle=(1-\eta_{t})Q_{t-1}^{m}(1,2)+\eta_{t}(1+\gamma\widehat{P}_{t}^{m}(1|1,2)V_{t-1}^{m}(1)), (25)
Qt1/2m(2,1)\displaystyle Q_{t-1/2}^{m}(2,1) =(1ηt)Qt1m(2,1)+ηt(1+γP^tm(2|2,1)Vt1m(2)).\displaystyle=(1-\eta_{t})Q_{t-1}^{m}(2,1)+\eta_{t}(1+\gamma\widehat{P}_{t}^{m}(2|2,1)V_{t-1}^{m}(2)). (26)

Since the averaging makes a difference in the update rule, we further analyze the update as required in later proofs.

A.3 Main analysis

We first focus on establishing a bound on the number of communication rounds, i.e., CCround(𝒜)\textsf{CC}_{\textsf{round}}(\mathscr{A}) (where we drop the dependency with other parameters for notational simplicity), and then use this lower bound to establish the bound on the bit level communication complexity CCbit(𝒜)\textsf{CC}_{\textsf{bit}}(\mathscr{A}).

To establish the lower bound on CCround(𝒜)\textsf{CC}_{\textsf{round}}(\mathscr{A}) for any intermittent communication algorithm 𝒜\mathscr{A}, we analyze the convergence behavior of 𝒜\mathscr{A} on the MDP h\mathcal{M}_{h}. We assume that the averaging step in line 66 of Algorithm 1 is carried out exactly. Since the use of compression only makes the problem harder, it is sufficient for us to consider the case where there is no loss of information in the averaging step for establishing a lower bound. Lastly, throughout the proof, without loss of generality we assume that

logN11γ,\displaystyle\log N\leq\frac{1}{1-\gamma}, (27)

otherwise, the lower bound in Theorem 1 reduces to the trivial lower bound.

We divide the proof into following three parts based on the choice of learning rates and batch sizes:

  1. 1.

    Small learning rates: For constant learning rates, 0η<1(1γ)T0\leq\eta<\frac{1}{(1-\gamma)T} and for rescaled linear learning rates, the constant cηc_{\eta} satisfies cηlogTc_{\eta}\geq\log T.

  2. 2.

    Large learning rates with small ηT/(BM)\eta_{T}/(BM): For constant learning rates, η1(1γ)T\eta\geq\frac{1}{(1-\gamma)T} and for rescaled linear learning rates, the constant cηc_{\eta} satisfies 0cηlogT11γ0\leq c_{\eta}\leq\log T\leq\frac{1}{1-\gamma} (c.f. (27)). Additionally, the ratio ηTBM\frac{\eta_{T}}{BM} satisfies ηTBM1γ100\frac{\eta_{T}}{BM}\leq\frac{1-\gamma}{100}.

  3. 3.

    Large learning rates with large ηT/(BM)\eta_{T}/(BM): We have the same condition on the learning rates as above. However, in this case the ratio ηTBM\frac{\eta_{T}}{BM} satisfies ηTBM>1γ100\frac{\eta_{T}}{BM}>\frac{1-\gamma}{100}.

We consider each of the cases separately in the following three subsections.

A.3.1 Small learning rates

In this subsection, we prove the lower bound for small learning rates, which follow from similar arguments in Li2023QLMinimax.

For this case, we focus on the dynamics of state 22. We claim that the same relation established in Li2023QLMinimax continues to hold, which will be established momentarily:

𝔼[VTm(2)]=(1Mj=1M𝔼[VTj(2)])=k=1Tηk(t)=1η0(T)1γp.\displaystyle\mathbb{E}[V_{T}^{m}(2)]=\left(\frac{1}{M}\sum_{j=1}^{M}\mathbb{E}[V_{T}^{j}(2)]\right)=\sum_{k=1}^{T}\eta_{k}^{(t)}=\frac{1-\eta_{0}^{(T)}}{1-\gamma p}. (28)

Consequently, for all m[M]m\in[M], we have

V(2)𝔼[VTm(2)]=η0(T)1γp.\displaystyle V^{\star}(2)-\mathbb{E}[V_{T}^{m}(2)]=\frac{\eta_{0}^{(T)}}{1-\gamma p}. (29)

To obtain lower bound on V(2)𝔼[VTm(2)]V^{\star}(2)-\mathbb{E}[V_{T}^{m}(2)], we need to obtain a lower bound on η0(T)\eta_{0}^{(T)}, which from (Li2023QLMinimax, Eqn. (120)) we have

log(η0(T))1.5t=1Tη(1γp)2t=1T1tlogT2η0(T)e2\displaystyle\log(\eta_{0}^{(T)})\geq-1.5\sum_{t=1}^{T}\eta(1-\gamma p)\geq-2\sum_{t=1}^{T}\frac{1}{t\log T}\geq-2\qquad\Longrightarrow\qquad\eta_{0}^{(T)}\geq e^{-2}

when T16T\geq 16 for both choices of learning rates. On plugging this bound in (29), we obtain,

𝔼[QTmQ]𝔼[|Q(2)QTm(2)|]V(2)𝔼[VTm(2)]34e2(1γ)N\displaystyle\mathbb{E}[\|Q_{T}^{m}-Q^{\star}\|_{\infty}]\geq\mathbb{E}[|Q^{\star}(2)-Q_{T}^{m}(2)|]\geq V^{\star}(2)-\mathbb{E}[V_{T}^{m}(2)]\geq\frac{3}{4e^{2}(1-\gamma)\sqrt{N}} (30)

holds for all m[M]m\in[M], N1N\geq 1 and M2M\geq 2. Thus, it can be noted that the error rate ER(𝒜;N,M)\textsf{ER}(\mathscr{A};N,M) is bounded away from a constant value irrespective of the number of agents and the number of communication rounds. Thus, even with CCround=Ω(T)\textsf{CC}_{\textsf{round}}=\Omega(T), we will not observe any collaborative gain if the step size is too small.

Proof of (28).

Recall that from (26), we have,

Qt1/2m(2,1)\displaystyle Q_{t-1/2}^{m}(2,1) =(1ηt)Vt1m(2)+ηt(1+γP^tm(2|2,1)Vt1m(2)).\displaystyle=(1-\eta_{t})V_{t-1}^{m}(2)+\eta_{t}(1+\gamma\widehat{P}_{t}^{m}(2|2,1)V_{t-1}^{m}(2)).

Here, Qt1m(2,1)=Vt1m(2)Q_{t-1}^{m}(2,1)=V_{t-1}^{m}(2) as the second state has only a single action.

  • When tt is not an averaging instant, we have,

    Vtm(2)=Qtm(2,1)=(1ηt)Vt1m(2)+ηt(1+γP^tm(2|2,1)Vt1m(2)).\displaystyle V_{t}^{m}(2)=Q_{t}^{m}(2,1)=(1-\eta_{t})V_{t-1}^{m}(2)+\eta_{t}(1+\gamma\widehat{P}_{t}^{m}(2|2,1)V_{t-1}^{m}(2)). (31)

    On taking expectation on both sides of the equation, we obtain,

    𝔼[Vtm(2)]\displaystyle\mathbb{E}[V_{t}^{m}(2)] =(1ηt)𝔼[Vt1m(2)]+ηt(1+γ𝔼[P^tm(2|2,1)Vt1m(2)])\displaystyle=(1-\eta_{t})\mathbb{E}[V_{t-1}^{m}(2)]+\eta_{t}(1+\gamma\mathbb{E}[\widehat{P}_{t}^{m}(2|2,1)V_{t-1}^{m}(2)])
    =(1ηt)𝔼[Vt1m(2)]+ηt(1+γ𝔼[P^tm(2|2,1)]𝔼[Vt1m(2)])\displaystyle=(1-\eta_{t})\mathbb{E}[V_{t-1}^{m}(2)]+\eta_{t}\left(1+\gamma\mathbb{E}[\widehat{P}_{t}^{m}(2|2,1)]\mathbb{E}[V_{t-1}^{m}(2)]\right)
    =(1ηt)𝔼[Vt1m(2)]+ηt(1+γp𝔼[Vt1m(2)])\displaystyle=(1-\eta_{t})\mathbb{E}[V_{t-1}^{m}(2)]+\eta_{t}\left(1+\gamma p\mathbb{E}[V_{t-1}^{m}(2)]\right)
    =(1ηt(1γp))𝔼[Vt1m(2)]+ηt.\displaystyle=(1-\eta_{t}(1-\gamma p))\mathbb{E}[V_{t-1}^{m}(2)]+\eta_{t}. (32)

    In the second step, we used the fact that P^tm(2|2,1)\widehat{P}_{t}^{m}(2|2,1) is independent of Vt1m(2)V_{t-1}^{m}(2).

  • Similarly, if tt is an averaging instant, we have,

    Vtm(2)\displaystyle V_{t}^{m}(2) =Qtm(2,1)=1Mj=1MQt1/2j(2,1)\displaystyle=Q_{t}^{m}(2,1)=\frac{1}{M}\sum_{j=1}^{M}Q_{t-1/2}^{j}(2,1)
    =(1ηt)1Mj=1MVt1j(2)+1Mj=1Mηt(1+γP^tj(2|2,1)Vt1j(2)).\displaystyle=(1-\eta_{t})\frac{1}{M}\sum_{j=1}^{M}V_{t-1}^{j}(2)+\frac{1}{M}\sum_{j=1}^{M}\eta_{t}(1+\gamma\widehat{P}_{t}^{j}(2|2,1)V_{t-1}^{j}(2)). (33)

    Once again, upon taking expectation we obtain,

    𝔼[Vtm(2)]\displaystyle\mathbb{E}[V_{t}^{m}(2)] =(1ηt)1Mj=1M𝔼[Vt1j(2)]+1Mj=1Mηt(1+γ𝔼[P^tj(2|2,1)Vt1j(2)])\displaystyle=(1-\eta_{t})\frac{1}{M}\sum_{j=1}^{M}\mathbb{E}[V_{t-1}^{j}(2)]+\frac{1}{M}\sum_{j=1}^{M}\eta_{t}(1+\gamma\mathbb{E}[\widehat{P}_{t}^{j}(2|2,1)V_{t-1}^{j}(2)])
    =(1ηt)1Mj=1M𝔼[Vt1j(2)]+1Mj=1Mηt(1+γp𝔼[Vt1j(2)])\displaystyle=(1-\eta_{t})\frac{1}{M}\sum_{j=1}^{M}\mathbb{E}[V_{t-1}^{j}(2)]+\frac{1}{M}\sum_{j=1}^{M}\eta_{t}(1+\gamma p\mathbb{E}[V_{t-1}^{j}(2)])
    =(1ηt(1γp))(1Mj=1M𝔼[Vt1j(2)])+ηt.\displaystyle=(1-\eta_{t}(1-\gamma p))\left(\frac{1}{M}\sum_{j=1}^{M}\mathbb{E}[V_{t-1}^{j}(2)]\right)+\eta_{t}. (34)

Eqns. (32) and (34) together imply that for all t[T]t\in[T],

(1Mm=1M𝔼[Vtm(2)])=(1ηt(1γp))(1Mm=1M𝔼[Vt1m(2)])+ηt.\displaystyle\left(\frac{1}{M}\sum_{m=1}^{M}\mathbb{E}[V_{t}^{m}(2)]\right)=(1-\eta_{t}(1-\gamma p))\left(\frac{1}{M}\sum_{m=1}^{M}\mathbb{E}[V_{t-1}^{m}(2)]\right)+\eta_{t}. (35)

On unrolling the above recursion with V0m=0V_{0}^{m}=0 for all m[M]m\in[M], we obtain the desired claim (28).

A.3.2 Large learning rates with small ηTBM\frac{\eta_{T}}{BM}

In this subsection, we prove the lower bound for case of large learning rates when the ratio ηTBM\frac{\eta_{T}}{BM} is small. For the analysis in this part, we focus on the dynamics of state 11. Unless otherwise specified, throughout the section we implicitly assume that the state is 11.

We further define a key parameter that will play a key role in the analysis:

τ:=min{k:tk,ηtηk3ηt}.\displaystyle\tau:=\min\{k\in\mathbb{N}:\forall\ t\geq k,\eta_{t}\leq\eta_{k}\leq 3\eta_{t}\}. (36)

It can be noted that for constant step size sequence τ=1\tau=1 and for rescaled linear stepsize τ=T/3\tau=T/3.

Step 1: introducing an auxiliary sequence.

We define an auxiliary sequence Q^tm(a)\widehat{Q}^{m}_{t}(a) for a{1,2}a\in\{1,2\} and all t=1,2,,Tt=1,2,\dots,T to aid our analysis, where we drop the dependency with state s=1s=1 for simplicity. The evolution of the sequence Q^tm\widehat{Q}_{t}^{m} is defined in Algorithm 4, where V^tm=maxa{1,2}Q^tm(a)\widehat{V}_{t}^{m}=\max_{a\in\{1,2\}}\widehat{Q}_{t}^{m}(a). In other words, the iterates {Q^tm}\{\widehat{Q}_{t}^{m}\} evolve exactly as the iterates of Algorithm 1 except for the fact that sequence {Q^tm}\{\widehat{Q}_{t}^{m}\} is initialized at the optimal QQ-function of the MDP. We would like to point out that we assume that the underlying stochasticity is also identical in the sense that the evolution of both QtmQ_{t}^{m} and Q^tm\widehat{Q}_{t}^{m} is governed by the same P^tm\widehat{P}_{t}^{m} matrices. The following lemma controls the error between the iterates QtmQ_{t}^{m} and Q^tm\widehat{Q}_{t}^{m}, allowing us to focus only on Q^tm\widehat{Q}_{t}^{m}.

1:  Input : T,R,{ηt}t=1T,𝒞={tr}r=1R,BT,R,\{\eta_{t}\}_{t=1}^{T},\mathcal{C}=\{t_{r}\}_{r=1}^{R},B
2:  Set Q^0m(a)Q(1,a)\widehat{Q}_{0}^{m}(a)\leftarrow Q^{\star}(1,a) for a{1,2}a\in\{1,2\} and all agents mm     // Different initialization
3:  for t=1,2,,Tt=1,2,\dots,T do
4:     for m=1,2,,Mm=1,2,\dots,M do
5:        Compute Q^t12m\widehat{Q}_{t-\frac{1}{2}}^{m} according to Eqn. (7)
6:        Compute Q^tm\widehat{Q}_{t}^{m} according to Eqn. (8)
7:     end for
8:  end for
Algorithm 4 Evolution of Q^\widehat{Q}
Lemma 2.

The following relation holds for all agents m[M]m\in[M], all t[T]t\in[T] and a{1,2}a\in\{1,2\}:

Qtm(1,a)Q^tm(a)11γi=1t(1ηi(1γ)).\displaystyle Q_{t}^{m}(1,a)-\widehat{Q}_{t}^{m}(a)\geq-\frac{1}{1-\gamma}\prod_{i=1}^{t}(1-\eta_{i}(1-\gamma)).

By Lemma 2, bounding the error of the sequence Q^tm\widehat{Q}_{t}^{m} allows us to obtain a bound on the error of QtmQ_{t}^{m}. To that effect, we define the following terms for any tTt\leq T and all m[M]m\in[M]:

Δtm(a)\displaystyle\Delta_{t}^{m}(a) :=Q^tm(a)Q(1,a);a=1,2;\displaystyle:=\widehat{Q}_{t}^{m}(a)-Q^{\star}(1,a);\quad a=1,2;
Δt,maxm\displaystyle\Delta_{t,\max}^{m} =maxa{1,2}Δtm(a).\displaystyle=\max_{a\in\{1,2\}}\Delta_{t}^{m}(a).

In addition, we use Δ¯t=1Mm=1MΔtm\overline{\Delta}_{t}=\frac{1}{M}\sum_{m=1}^{M}\Delta_{t}^{m} to denote the error of the averaged iterate111We use this different notation in appendix as opposed to the half-time indices used in the main text to improve readability of the proof., and similarly,

Δ¯t,max:=maxa{1,2}Δ¯t(a).\displaystyle\overline{\Delta}_{t,\max}:=\max_{a\in\{1,2\}}\overline{\Delta}_{t}(a). (37)

We first derive a basic recursion regarding Δtm(a)\Delta_{t}^{m}(a). From the iterative update rule in Algorithm 4, we have,

Δtm(a)\displaystyle\Delta_{t}^{m}(a) =(1ηt)Δt1m(a)+ηt(1+γP^tm(1|1,a)V^t1mQ(1,a))\displaystyle=(1-\eta_{t})\Delta_{t-1}^{m}(a)+\eta_{t}(1+\gamma\widehat{P}_{t}^{m}(1|1,a)\widehat{V}_{t-1}^{m}-Q^{\star}(1,a))
=(1ηt)Δt1m(a)+ηtγ(P^tm(1|1,a)V^t1mpV(1))\displaystyle=(1-\eta_{t})\Delta_{t-1}^{m}(a)+\eta_{t}\gamma(\widehat{P}_{t}^{m}(1|1,a)\widehat{V}_{t-1}^{m}-pV^{\star}(1))
=(1ηt)Δt1m(a)+ηtγ(p(V^t1mV(1))+(P^tm(1|1,a)p)V^t1)\displaystyle=(1-\eta_{t})\Delta_{t-1}^{m}(a)+\eta_{t}\gamma(p(\widehat{V}_{t-1}^{m}-V^{\star}(1))+(\widehat{P}_{t}^{m}(1|1,a)-p)\widehat{V}_{t-1})
=(1ηt)Δt1m(a)+ηtγ(pΔt1,maxm+(P^tm(1|1,a)p)V^t1m).\displaystyle=(1-\eta_{t})\Delta_{t-1}^{m}(a)+\eta_{t}\gamma(p\Delta_{t-1,\max}^{m}+(\widehat{P}_{t}^{m}(1|1,a)-p)\widehat{V}_{t-1}^{m}).

Here in the last line, we used the following relation:

Δt,maxm\displaystyle\Delta_{t,\max}^{m} =maxa{1,2}(Q^tm(a)Q(1,a))=maxa{1,2}Q^tm(a)V(1)=V^t1mV(1),\displaystyle=\max_{a\in\{1,2\}}(\widehat{Q}_{t}^{m}(a)-Q^{\star}(1,a))=\max_{a\in\{1,2\}}\widehat{Q}_{t}^{m}(a)-V^{\star}(1)=\widehat{V}_{t-1}^{m}-V^{\star}(1),

as Q(1,1)=Q(1,2)=V(1)Q^{\star}(1,1)=Q^{\star}(1,2)=V^{\star}(1).

Recursively unrolling the above expression, and using the expression (19), we obtain the following relation: for any t<tt^{\prime}<t when there is no averaging during the interval (t,t)(t^{\prime},t)

Δtm(a)\displaystyle\Delta_{t}^{m}(a) =(k=t+1t(1ηk))Δtm(a)+k=t+1tη~k(t)γ(pΔk1,maxm+(P^km(1|1,a)p)V^k1m).\displaystyle=\left(\prod_{k=t^{\prime}+1}^{t}(1-\eta_{k})\right)\Delta_{t^{\prime}}^{m}(a)+\sum_{k=t^{\prime}+1}^{t}\widetilde{\eta}_{k}^{(t)}\gamma(p\Delta_{k-1,\max}^{m}+(\widehat{P}_{k}^{m}(1|1,a)-p)\widehat{V}_{k-1}^{m}). (38)

For any t,tt^{\prime},t with t<tt^{\prime}<t, we define the notation

φt,t\displaystyle\varphi_{t^{\prime},t} :=k=t+1t(1ηk),\displaystyle:=\prod_{k=t^{\prime}+1}^{t}(1-\eta_{k}), (39)
ξt,tm(a)\displaystyle\xi_{t^{\prime},t}^{m}(a) :=k=t+1tη~k(t)γ(P^km(1|1,a)p)V^k1m,a=1,2;\displaystyle:=\sum_{k=t^{\prime}+1}^{t}\widetilde{\eta}_{k}^{(t)}\gamma(\widehat{P}_{k}^{m}(1|1,a)-p)\widehat{V}_{k-1}^{m},\quad a=1,2; (40)
ξt,t,maxm\displaystyle\xi_{t^{\prime},t,\max}^{m} :=maxa{1,2}ξt,tm(a).\displaystyle:=\max_{a\in\{1,2\}}\xi_{t^{\prime},t}^{m}(a). (41)

Note that by definition, 𝔼[ξt,tm(a)]=0\mathbb{E}[\xi_{t^{\prime},t}^{m}(a)]=0 for a{1,2}a\in\{1,2\} and all mm, tt^{\prime} and tt. Plugging them into the previous expression leads to the simplified expression

Δtm(a)\displaystyle\Delta_{t}^{m}(a) =φt,tΔtm(a)+[k=t+1tη~k(t)γpΔk1,maxm]+ξt,tm(a).\displaystyle=\varphi_{t^{\prime},t}\Delta_{t^{\prime}}^{m}(a)+\left[\sum_{k=t^{\prime}+1}^{t}\widetilde{\eta}_{k}^{(t)}\gamma p\Delta_{k-1,\max}^{m}\right]+\xi_{t^{\prime},t}^{m}(a).

We specifically choose tt^{\prime} and tt to be the consecutive averaging instants to analyze the behaviour of Δtm\Delta_{t}^{m} across two averaging instants. Consequently, we can rewrite the above equation as

Δtm(a)\displaystyle\Delta_{t}^{m}(a) =φt,tΔ¯t(a)+[k=t+1tη~k(t)γpΔk1,maxm]+ξt,tm(a).\displaystyle=\varphi_{t^{\prime},t}\overline{\Delta}_{t^{\prime}}(a)+\left[\sum_{k=t^{\prime}+1}^{t}\widetilde{\eta}_{k}^{(t)}\gamma p\Delta_{k-1,\max}^{m}\right]+\xi_{t^{\prime},t}^{m}(a). (42)

Furthermore, after averaging, we obtain,

Δ¯t(a)\displaystyle\overline{\Delta}_{t}(a) =φt,tΔ¯t(a)+1Mm=1M[k=t+1tη~k(t)γpΔk1,maxm]+1Mm=1Mξt,tm(a).\displaystyle=\varphi_{t^{\prime},t}\overline{\Delta}_{t^{\prime}}(a)+\frac{1}{M}\sum_{m=1}^{M}\left[\sum_{k=t^{\prime}+1}^{t}\widetilde{\eta}_{k}^{(t)}\gamma p\Delta_{k-1,\max}^{m}\right]+\frac{1}{M}\sum_{m=1}^{M}\xi_{t^{\prime},t}^{m}(a). (43)
Step 2: deriving a recursive bound for 𝔼[Δ¯t,max]\mathbb{E}[\overline{\Delta}_{t,\max}].

Bounding (42), we obtain,

Δt,maxm\displaystyle\Delta_{t,\max}^{m} φt,tΔ¯t,max+[k=t+1tη~k(t)γpΔk1,maxm]+ξt,t,maxmφt,t|Δ¯t(1)Δ¯t(2)|,\displaystyle\geq\varphi_{t^{\prime},t}\overline{\Delta}_{t^{\prime},\max}+\left[\sum_{k=t^{\prime}+1}^{t}\widetilde{\eta}_{k}^{(t)}\gamma p\Delta_{k-1,\max}^{m}\right]+\xi_{t^{\prime},t,\max}^{m}-\varphi_{t^{\prime},t}|\overline{\Delta}_{t^{\prime}}(1)-\overline{\Delta}_{t^{\prime}}(2)|, (44a)
Δt,maxm\displaystyle\Delta_{t,\max}^{m} φt,tΔ¯t,max+[k=t+1tη~k(t)γpΔk1,maxm]+ξt,t,maxm,\displaystyle\leq\varphi_{t^{\prime},t}\overline{\Delta}_{t^{\prime},\max}+\left[\sum_{k=t^{\prime}+1}^{t}\widetilde{\eta}_{k}^{(t)}\gamma p\Delta_{k-1,\max}^{m}\right]+\xi_{t^{\prime},t,\max}^{m}, (44b)

where in the first step we used the fact that

max{a1+b1,a2+b2}min{a1,a2}+max{b1,b2}=max{a1,a2}+max{b1,b2}|a1a2|.\displaystyle\max\{a_{1}+b_{1},a_{2}+b_{2}\}\geq\min\{a_{1},a_{2}\}+\max\{b_{1},b_{2}\}=\max\{a_{1},a_{2}\}+\max\{b_{1},b_{2}\}-|a_{1}-a_{2}|. (45)

On taking expectation, we obtain,

𝔼[Δt,maxm]\displaystyle\mathbb{E}[\Delta_{t,\max}^{m}] φt,t𝔼[Δ¯t,max]+[k=t+1tη~k(t)γp𝔼[Δk1,maxm]]+𝔼[ξt,t,maxm]φt,t𝔼[|Δ¯t(1)Δ¯t(2)|],\displaystyle\geq\varphi_{t^{\prime},t}\mathbb{E}[\overline{\Delta}_{t^{\prime},\max}]+\left[\sum_{k=t^{\prime}+1}^{t}\widetilde{\eta}_{k}^{(t)}\gamma p\mathbb{E}[\Delta_{k-1,\max}^{m}]\right]+\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}]-\varphi_{t^{\prime},t}\mathbb{E}[|\overline{\Delta}_{t^{\prime}}(1)-\overline{\Delta}_{t^{\prime}}(2)|], (46a)
𝔼[Δt,maxm]\displaystyle\mathbb{E}[\Delta_{t,\max}^{m}] φt,t𝔼[Δ¯t,max]+[k=t+1tη~k(t)γp𝔼[Δk1,maxm]]+𝔼[ξt,t,maxm].\displaystyle\leq\varphi_{t^{\prime},t}\mathbb{E}[\overline{\Delta}_{t^{\prime},\max}]+\left[\sum_{k=t^{\prime}+1}^{t}\widetilde{\eta}_{k}^{(t)}\gamma p\mathbb{E}[\Delta_{k-1,\max}^{m}]\right]+\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}]. (46b)

Similarly, using (43) and (45) we can write,

Δ¯t,max\displaystyle\overline{\Delta}_{t,\max} φt,tΔ¯t,max+1Mm=1M[k=t+1tη~k(t)γpΔk1,maxm]φt,t|Δ¯t(1)Δ¯t(2)|\displaystyle\geq\varphi_{t^{\prime},t}\overline{\Delta}_{t^{\prime},\max}+\frac{1}{M}\sum_{m=1}^{M}\left[\sum_{k=t^{\prime}+1}^{t}\widetilde{\eta}_{k}^{(t)}\gamma p\Delta_{k-1,\max}^{m}\right]-\varphi_{t^{\prime},t}|\overline{\Delta}_{t^{\prime}}(1)-\overline{\Delta}_{t^{\prime}}(2)|
+max{1Mm=1Mξt,tm(1),1Mm=1Mξt,tm(2)}\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\max\left\{\frac{1}{M}\sum_{m=1}^{M}\xi_{t^{\prime},t}^{m}(1),\frac{1}{M}\sum_{m=1}^{M}\xi_{t^{\prime},t}^{m}(2)\right\} (47a)
𝔼[Δ¯t,max]\displaystyle\implies\mathbb{E}[\overline{\Delta}_{t,\max}] φt,t𝔼[Δ¯t,max]+1Mm=1M[k=t+1tη~k(t)γp𝔼[Δk1,maxm]]φt,t𝔼[|Δ¯t(1)Δ¯t(2)|]\displaystyle\geq\varphi_{t^{\prime},t}\mathbb{E}[\overline{\Delta}_{t^{\prime},\max}]+\frac{1}{M}\sum_{m=1}^{M}\left[\sum_{k=t^{\prime}+1}^{t}\widetilde{\eta}_{k}^{(t)}\gamma p\mathbb{E}[\Delta_{k-1,\max}^{m}]\right]-\varphi_{t^{\prime},t}\mathbb{E}[|\overline{\Delta}_{t^{\prime}}(1)-\overline{\Delta}_{t^{\prime}}(2)|]
+𝔼[max{1Mm=1Mξt,tm(1),1Mm=1Mξt,tm(2)}].\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\mathbb{E}\left[\max\left\{\frac{1}{M}\sum_{m=1}^{M}\xi_{t^{\prime},t}^{m}(1),\frac{1}{M}\sum_{m=1}^{M}\xi_{t^{\prime},t}^{m}(2)\right\}\right]. (47b)

On combining (46b) and (47b), we obtain,

𝔼[Δ¯t,max]\displaystyle\mathbb{E}[\overline{\Delta}_{t,\max}] 1Mm=1M[𝔼[Δt,maxm]𝔼[ξt,t,maxm]]φt,t𝔼[|Δ¯t(1)Δ¯t(2)|]\displaystyle\geq\frac{1}{M}\sum_{m=1}^{M}\left[\mathbb{E}[\Delta_{t,\max}^{m}]-\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}]\right]-\varphi_{t^{\prime},t}\mathbb{E}[|\overline{\Delta}_{t^{\prime}}(1)-\overline{\Delta}_{t^{\prime}}(2)|]
+𝔼[max{1Mm=1Mξt,tm(1),1Mm=1Mξt,tm(2)}].\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\mathbb{E}\left[\max\left\{\frac{1}{M}\sum_{m=1}^{M}\xi_{t^{\prime},t}^{m}(1),\frac{1}{M}\sum_{m=1}^{M}\xi_{t^{\prime},t}^{m}(2)\right\}\right]. (48)

In order to simplify (48), we make use of the following lemmas.

Lemma 3.

Let t<tt^{\prime}<t be two consecutive averaging instants. Then for all m[M]m\in[M],

𝔼[Δt,maxm]𝔼[ξt,t,maxm]\displaystyle\mathbb{E}[\Delta_{t,\max}^{m}]-\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}] (k=t+1t(1ηk(1γp)))𝔼[Δ¯t,max]+𝔼[ξt,t,maxm][k=t+1tηk(t)1]+\displaystyle\geq\left(\prod_{k=t^{\prime}+1}^{t}(1-\eta_{k}(1-\gamma p))\right)\mathbb{E}[\overline{\Delta}_{t^{\prime},\max}]+\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}]\left[\sum_{k=t^{\prime}+1}^{t}\eta_{k}^{(t)}-1\right]_{+}
φt,t𝔼[|Δ¯t(1)Δ¯t(2)|],\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\varphi_{t^{\prime},t}\mathbb{E}[|\overline{\Delta}_{t^{\prime}}(1)-\overline{\Delta}_{t^{\prime}}(2)|],

where [x]+=max{x,0}[x]_{+}=\max\{x,0\}.

Lemma 4.

For all consecutive averaging instants t,tt^{\prime},t satisfying tmax{t,τ}1/ητt-\max\{t^{\prime},\tau\}\geq 1/\eta_{\tau} and all m[M]m\in[M], we have,

𝔼[ξt,t,maxm]\displaystyle\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}] 1240log(180BηT(1γ))νν+1,\displaystyle\geq\frac{1}{240\log\left(\frac{180B}{\eta_{T}(1-\gamma)}\right)}\cdot\frac{\nu}{\nu+1},
𝔼[max{1Mm=1Mξt,tm(1),1Mm=1Mξt,tm(2)}]\displaystyle\mathbb{E}\left[\max\left\{\frac{1}{M}\sum_{m=1}^{M}\xi_{t^{\prime},t}^{m}(1),\frac{1}{M}\sum_{m=1}^{M}\xi_{t^{\prime},t}^{m}(2)\right\}\right] 1240log(180BMηT(1γ))νν+M,\displaystyle\geq\frac{1}{240\log\left(\frac{180BM}{\eta_{T}(1-\gamma)}\right)}\cdot\frac{\nu}{\nu+\sqrt{M}},

where ν:=20ηTB(1γ)\nu:=\sqrt{\dfrac{20\eta_{T}}{B(1-\gamma)}}.

Lemma 5.

For all t{tr}r=1Rt\in\{t_{r}\}_{r=1}^{R}, we have

𝔼[|Δ¯t(1)Δ¯t(2)|]8ηT3BM(1γ).\displaystyle\mathbb{E}[|\overline{\Delta}_{t}(1)-\overline{\Delta}_{t}(2)|]\leq\sqrt{\frac{8\eta_{T}}{3BM(1-\gamma)}}.

Thus, on combining the results from Lemmas 34, and 5 and plugging them into (48), we obtain the following relation for t,tτt,t^{\prime}\geq\tau:

𝔼[Δ¯t,max]\displaystyle\mathbb{E}[\overline{\Delta}_{t,\max}] (k=t+1t(1ηk(1γp)))𝔼[Δ¯t,max]+𝔼[ξt,t,maxm][k=t+1tηk(t)1]+2φt,t𝔼[|Δ¯t(1)Δ¯t(2)|]\displaystyle\geq\left(\prod_{k=t^{\prime}+1}^{t}(1-\eta_{k}(1-\gamma p))\right)\mathbb{E}[\overline{\Delta}_{t^{\prime},\max}]+\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}]\left[\sum_{k=t^{\prime}+1}^{t}\eta_{k}^{(t)}-1\right]_{+}-2\varphi_{t^{\prime},t}\mathbb{E}[|\overline{\Delta}_{t^{\prime}}(1)-\overline{\Delta}_{t^{\prime}}(2)|]
+𝔼[max{1Mm=1Mξt,tm(1),1Mm=1Mξt,tm(2)}]\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\mathbb{E}\left[\max\left\{\frac{1}{M}\sum_{m=1}^{M}\xi_{t^{\prime},t}^{m}(1),\frac{1}{M}\sum_{m=1}^{M}\xi_{t^{\prime},t}^{m}(2)\right\}\right]
(1ητ(1γp))tt𝔼[Δ¯t,max]+(1(1ητ(1γp))tt5760log(180BηT(1γ))(1γp))νν+1𝟙{tt8ητ}\displaystyle\geq(1-\eta_{\tau}(1-\gamma p))^{t-t^{\prime}}\mathbb{E}[\overline{\Delta}_{t^{\prime},\max}]+\left(\frac{1-(1-\eta_{\tau}(1-\gamma p))^{t-t^{\prime}}}{5760\log\left(\frac{180B}{\eta_{T}(1-\gamma)}\right)(1-\gamma p)}\right)\cdot\frac{\nu}{\nu+1}\cdot\mathbbm{1}\left\{t-t^{\prime}\geq\frac{8}{\eta_{\tau}}\right\}
2(1ηT)tt8ηT3BM(1γ)+1240log(180BMηT(1γ))νν+M𝟙{tt8ητ},\displaystyle~{}~{}~{}~{}~{}~{}~{}-2(1-\eta_{T})^{t-t^{\prime}}\sqrt{\frac{8\eta_{T}}{3BM(1-\gamma)}}+\frac{1}{240\log\left(\frac{180BM}{\eta_{T}(1-\gamma)}\right)}\cdot\frac{\nu}{\nu+\sqrt{M}}\cdot\mathbbm{1}\left\{t-t^{\prime}\geq\frac{8}{\eta_{\tau}}\right\}, (49)

where we used the relation φt,t(1ηT)tt\varphi_{t^{\prime},t}\leq(1-\eta_{T})^{t-t^{\prime}}, as well as the value of ν\nu as defined in Lemma 4 along with the fact

k=t+1tηk(t)11(1ητ(1γp))tt24(1γp)\displaystyle\sum_{k=t^{\prime}+1}^{t}\eta_{k}^{(t)}-1\geq\frac{1-(1-\eta_{\tau}(1-\gamma p))^{t-t^{\prime}}}{24(1-\gamma p)} (50)

for all t,tτt,t^{\prime}\geq\tau such that tt8/ητt-t^{\prime}\geq 8/\eta_{\tau}.

Proof of (50).

We have,

k=t+1tηk(t)1\displaystyle\sum_{k=t^{\prime}+1}^{t}\eta_{k}^{(t)}-1 =k=t+1t(ηki=k+1t(1ηi(1γp)))1\displaystyle=\sum_{k=t^{\prime}+1}^{t}\left(\eta_{k}\prod_{i=k+1}^{t}(1-\eta_{i}(1-\gamma p))\right)-1
k=t+1t(ηti=k+1t(1ητ(1γp)))1\displaystyle\geq\sum_{k=t^{\prime}+1}^{t}\left(\eta_{t}\prod_{i=k+1}^{t}(1-\eta_{\tau}(1-\gamma p))\right)-1
ηtk=t+1t(1ητ(1γp))tk1\displaystyle\geq\eta_{t}\sum_{k=t^{\prime}+1}^{t}(1-\eta_{\tau}(1-\gamma p))^{t-k}-1
ηt(1(1ητ(1γp))ttητ(1γp))1\displaystyle\geq\eta_{t}\cdot\left(\frac{1-(1-\eta_{\tau}(1-\gamma p))^{t-t^{\prime}}}{\eta_{\tau}(1-\gamma p)}\right)-1
1(1ητ(1γp))tt3(1γp)1.\displaystyle\geq\frac{1-(1-\eta_{\tau}(1-\gamma p))^{t-t^{\prime}}}{3(1-\gamma p)}-1. (51)

To show (50), it is sufficient to show that 1(1ητ(1γp))tt3(1γp)87\displaystyle\frac{1-(1-\eta_{\tau}(1-\gamma p))^{t-t^{\prime}}}{3(1-\gamma p)}\geq\frac{8}{7} for tt8/ητt-t^{\prime}\geq 8/\eta_{\tau}. Thus, for tt8/ητt-t^{\prime}\geq 8/\eta_{\tau} we have,

1(1ητ(1γp))tt3(1γp)\displaystyle\frac{1-(1-\eta_{\tau}(1-\gamma p))^{t-t^{\prime}}}{3(1-\gamma p)} 1exp(ητ(1γp)(tt))3(1γp)\displaystyle\geq\frac{1-\exp(-\eta_{\tau}(1-\gamma p)\cdot(t-t^{\prime}))}{3(1-\gamma p)}
1exp(8(1γp))3(1γp).\displaystyle\geq\frac{1-\exp(-8(1-\gamma p))}{3(1-\gamma p)}. (52)

Since γ5/6\gamma\geq 5/6, 1γp2/91-\gamma p\leq 2/9. For x2/9x\leq 2/9, the function f(x)=1e8x3x8/7f(x)=\frac{1-e^{-8x}}{3x}\geq 8/7, proving the claim.

Step 3: lower bounding 𝔼[Δ¯T,max]\mathbb{E}[\overline{\Delta}_{T,\max}].

We are now interested in evaluating 𝔼[Δ¯T,max]\mathbb{E}[\overline{\Delta}_{T,\max}] based on the recursion (49). To this effect, we introduce some notation to simplify the presentation. Let

Rτ\displaystyle R_{\tau} :=min{r:trτ}.\displaystyle:=\min\{r:t_{r}\geq\tau\}. (53)

For r=Rτ,,Rr=R_{\tau},\dots,R, we define the following terms:

xr\displaystyle x_{r} :=𝔼[Δ¯tr,max],\displaystyle:=\mathbb{E}[\overline{\Delta}_{t_{r},\max}],
αr\displaystyle\alpha_{r} :=(1ητ(1γp))trtr1,\displaystyle:=(1-\eta_{\tau}(1-\gamma p))^{t_{r}-t_{r-1}},
βr\displaystyle\beta_{r} :=(1ηT)trtr1,\displaystyle:=(1-\eta_{T})^{t_{r}-t_{r-1}},
r\displaystyle\mathcal{I}_{r} :={rr>Rτ:trtr18/ητ},\displaystyle:=\{r\geq r^{\prime}>R_{\tau}:t_{r^{\prime}}-t_{r^{\prime}-1}\geq 8/\eta_{\tau}\},
C1\displaystyle C_{1} :=15760log(180BηT(1γ))(1γp)νν+1,\displaystyle:=\frac{1}{5760\log\left(\frac{180B}{\eta_{T}(1-\gamma)}\right)(1-\gamma p)}\cdot\frac{\nu}{\nu+1},
C2\displaystyle C_{2} :=32ηT3BM(1γ),\displaystyle:=\sqrt{\frac{32\eta_{T}}{3BM(1-\gamma)}},
C3\displaystyle C_{3} :=1240log(180BMηT(1γ))νν+M.\displaystyle:=\frac{1}{240\log\left(\frac{180BM}{\eta_{T}(1-\gamma)}\right)}\cdot\frac{\nu}{\nu+\sqrt{M}}.

With these notations in place, the recursion in (49) can be rewritten as

xrαrxr1βrC2+C3𝟙{rr}+(1αr)C1𝟙{rr},\displaystyle x_{r}\geq\alpha_{r}x_{r-1}-\beta_{r}C_{2}+C_{3}\mathbbm{1}\{r\in\mathcal{I}_{r}\}+(1-\alpha_{r})C_{1}\mathbbm{1}\{r\in\mathcal{I}_{r}\}, (54)

for all rRτr\geq R_{\tau}. We claim that xrx_{r} satisfies the following relation for all rRτ+1r\geq R_{\tau}+1 (whose proof is deferred to the end of this step):

xr\displaystyle x_{r} (i=Rτ+1rαi)xRτk=Rτ+1rβk(i=k+1rαi)C2+k=Rτ+1r(i=k+1rαi)𝟙{kk}C3\displaystyle\geq\left(\prod_{i=R_{\tau}+1}^{r}\alpha_{i}\right)x_{R_{\tau}}-\sum_{k=R_{\tau}+1}^{r}\beta_{k}\left(\prod_{i=k+1}^{r}\alpha_{i}\right)C_{2}+\sum_{k=R_{\tau}+1}^{r}\left(\prod_{i=k+1}^{r}\alpha_{i}\right)\mathbbm{1}\{k\in\mathcal{I}_{k}\}C_{3}
+C1(irαi)(1irαi),\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+C_{1}\left(\prod_{i\notin\mathcal{I}_{r}}\alpha_{i}\right)\left(1-\prod_{i\in\mathcal{I}_{r}}\alpha_{i}\right), (55)

where we recall that if there is no valid index for a product, its value is taken to be 11.

Invoking (55) for r=Rr=R and using the relation xRτ10x_{R_{\tau}-1}\geq 0, we obtain,

xR\displaystyle x_{R} k=RτRβk(i=k+1Rαi)C2+k=RτR(i=k+1Rαi)C3𝟙{kk}+C1(iRαi)(1iRαi)\displaystyle\geq-\sum_{k=R_{\tau}}^{R}\beta_{k}\left(\prod_{i=k+1}^{R}\alpha_{i}\right)C_{2}+\sum_{k=R_{\tau}}^{R}\left(\prod_{i=k+1}^{R}\alpha_{i}\right)C_{3}\mathbbm{1}\{k\in\mathcal{I}_{k}\}+C_{1}\left(\prod_{i\notin\mathcal{I}_{R}}\alpha_{i}\right)\left(1-\prod_{i\in\mathcal{I}_{R}}\alpha_{i}\right)
RC2+C1(iRαi)(1iRαi)\displaystyle\geq-RC_{2}+C_{1}\left(\prod_{i\notin\mathcal{I}_{R}}\alpha_{i}\right)\left(1-\prod_{i\in\mathcal{I}_{R}}\alpha_{i}\right)
R32ηT3BM(1γ)+(iRαi)(1iRαi)15760log(180BηT(1γ))(1γp)νν+1,\displaystyle\geq-R\cdot\sqrt{\frac{32\eta_{T}}{3BM(1-\gamma)}}+\left(\prod_{i\notin\mathcal{I}_{R}}\alpha_{i}\right)\left(1-\prod_{i\in\mathcal{I}_{R}}\alpha_{i}\right)\cdot\frac{1}{5760\log\left(\frac{180B}{\eta_{T}(1-\gamma)}\right)(1-\gamma p)}\cdot\frac{\nu}{\nu+1}, (56)

where we used the fact βk(i=k+1Rαi)1\beta_{k}\left(\prod_{i=k+1}^{R}\alpha_{i}\right)\leq 1 and that C30C_{3}\geq 0. Consider the expression

iRαi\displaystyle\prod_{i\notin\mathcal{I}_{R}}\alpha_{i} =iR(1ητ(1γp))titi11ητ(1γp)iR(titi1)=:T1.\displaystyle=\prod_{i\notin\mathcal{I}_{R}}(1-\eta_{\tau}(1-\gamma p))^{t_{i}-t_{i-1}}\geq 1-\eta_{\tau}(1-\gamma p)\cdot\underbrace{\sum_{i\notin\mathcal{I}_{R}}(t_{i}-t_{i-1})}_{=:T_{1}}. (57)

Consequently,

(1iRαi)=1(1ητ(1γp))TτT11exp(ητ(1γp)(TτT1)).\displaystyle\left(1-\prod_{i\in\mathcal{I}_{R}}\alpha_{i}\right)=1-(1-\eta_{\tau}(1-\gamma p))^{T-\tau-T_{1}}\geq 1-\exp\left(-\eta_{\tau}(1-\gamma p)\left(T-\tau-T_{1}\right)\right). (58)

Note that T1T_{1} satisfies the following bound

T1:=iR(titi1)(R|R|)8ητ8Rητ.\displaystyle T_{1}:=\sum_{i\notin\mathcal{I}_{R}}(t_{i}-t_{i-1})\leq(R-|\mathcal{I}_{R}|)\cdot\frac{8}{\eta_{\tau}}\leq\frac{8R}{\eta_{\tau}}. (59)

We split the remainder of the analysis based on the step size schedule.

  • For the constant step size schedule, i.e., ηt=η1(1γ)T\eta_{t}=\eta\geq\frac{1}{(1-\gamma)T}, we have, Rτ=0R_{\tau}=0, with τ=0\tau=0 and t0=0t_{0}=0 (as all agents start at the same point). If R196000(1γ)log(180Bη(1γ))R\leq\frac{1}{96000(1-\gamma)\log\left(\frac{180B}{\eta(1-\gamma)}\right)}, then, (57), (58) and (59) yield the following relations:

    T1\displaystyle T_{1} 8RηT12000log(180N),\displaystyle\leq\frac{8R}{\eta}\leq\frac{T}{12000\log(180N)},
    iRαi\displaystyle\prod_{i\notin\mathcal{I}_{R}}\alpha_{i} 1η(1γp)T1132R(1γ)3119000log(180N),\displaystyle\geq 1-\eta(1-\gamma p)\cdot T_{1}\geq 1-\frac{32R(1-\gamma)}{3}\geq 1-\frac{1}{9000\log(180N)},
    (1iRαi)\displaystyle\left(1-\prod_{i\in\mathcal{I}_{R}}\alpha_{i}\right) 1exp(η(1γp)(TT1))1exp(43(119000log(180N))).\displaystyle\geq 1-\exp\left(-\eta(1-\gamma p)\left(T-T_{1}\right)\right)\geq 1-\exp\left(-\frac{4}{3}\left(1-\frac{1}{9000\log(180N)}\right)\right).

    On plugging the above relations into (56), we obtain

    xR\displaystyle x_{R} 4096000log(180Bη(1γ))(1γ)(νν+1ν5M)\displaystyle\geq\frac{\sqrt{40}}{96000\log\left(\frac{180B}{\eta(1-\gamma)}\right)(1-\gamma)}\cdot\left(\frac{\nu}{\nu+1}-\frac{\nu}{5\sqrt{M}}\right) (60)

    where recall that ν:=20η3B(1γ)\nu:=\sqrt{\dfrac{20\eta}{3B(1-\gamma)}}. Consider the function f(x)=xx+1x5Mf(x)=\frac{x}{x+1}-\frac{x}{5\sqrt{M}} . We claim that for x[0,M]x\in[0,\sqrt{M}] and all M2M\geq 2,

    f(x)720min{x,1}.\displaystyle f(x)\geq\frac{7}{20}\min\{x,1\}. (61)

    The proof of the above claim is deferred to the end of the section. In light of the above claim, we have,

    xR\displaystyle x_{R} 4096000log(180Bη(1γ))(1γ)720min{1,20η3B(1γ)}\displaystyle\geq\frac{\sqrt{40}}{96000\log\left(\frac{180B}{\eta(1-\gamma)}\right)(1-\gamma)}\cdot\frac{7}{20}\cdot\min\left\{1,\sqrt{\frac{20\eta}{3B(1-\gamma)}}\right\}
    4096000log(180N)720min{11γ,203(1γ)4N},\displaystyle\geq\frac{\sqrt{40}}{96000\log\left(180N\right)}\cdot\frac{7}{20}\cdot\min\left\{\frac{1}{1-\gamma},\sqrt{\frac{20}{3(1-\gamma)^{4}N}}\right\}, (62)

    where we used the fact that M2M\geq 2, xlog(1/x)\frac{\sqrt{x}}{\log(1/x)} is an increasing function and the relation νM=20η3BM(1γ)1151\dfrac{\nu}{M}=\dfrac{20\eta}{3BM(1-\gamma)}\leq\dfrac{1}{15}\leq 1.

  • Next, we consider the rescaled linear step size schedule, where τ=T/3\tau=T/3 (cf. (36)). To begin, we assume tRτmax{3T4,T16ητ(1γp)}t_{R_{\tau}}\leq\max\{\frac{3T}{4},T-\frac{1}{6\eta_{\tau}(1-\gamma p)}\}. It is straightforward to note that

    max{3T4,T16ητ(1γp)}={3T4 if cη3T16ητ(1γp) if cη<3.\displaystyle\max\left\{\frac{3T}{4},T-\frac{1}{6\eta_{\tau}(1-\gamma p)}\right\}=\begin{cases}\frac{3T}{4}&\text{ if }c_{\eta}\geq 3\\ T-\frac{1}{6\eta_{\tau}(1-\gamma p)}&\text{ if }c_{\eta}<3.\end{cases}

    If R1384000(1γ)log(180BηT(1γ))(5+cη)R\leq\frac{1}{384000(1-\gamma)\log\left(\frac{180B}{\eta_{T}(1-\gamma)}\right)\cdot(5+c_{\eta})} then, (57), (58) and (59) yield the following relations:

    T1\displaystyle T_{1} 8Rητ,iRαi1ητ(1γp)T1132R(1γ)31136000.\displaystyle\leq\frac{8R}{\eta_{\tau}},\qquad\prod_{i\notin\mathcal{I}_{R}}\alpha_{i}\geq 1-\eta_{\tau}(1-\gamma p)\cdot T_{1}\geq 1-\frac{32R(1-\gamma)}{3}\geq 1-\frac{1}{36000}.

    For cη3c_{\eta}\geq 3, we have,

    (1iRαi)\displaystyle\left(1-\prod_{i\in\mathcal{I}_{R}}\alpha_{i}\right) 1exp(ητ(1γp)(TtRτT1))\displaystyle\geq 1-\exp\left(-\eta_{\tau}(1-\gamma p)\left(T-t_{R_{\tau}}-T_{1}\right)\right)
    1exp((1γ)T(3+cη(1γ)T)+32R(1γ)3)\displaystyle\geq 1-\exp\left(-\frac{(1-\gamma)T}{(3+c_{\eta}(1-\gamma)T)}+\frac{32R(1-\gamma)}{3}\right)
    12(3+cη),\displaystyle\geq\frac{1}{2(3+c_{\eta})},

    where we used T11γT\geq\frac{1}{1-\gamma} in the second step. Similarly, for cη<3c_{\eta}<3, we have,

    (1iRαi)\displaystyle\left(1-\prod_{i\in\mathcal{I}_{R}}\alpha_{i}\right) 1exp(ητ(1γp)(TtRτT1))\displaystyle\geq 1-\exp\left(-\eta_{\tau}(1-\gamma p)\left(T-t_{R_{\tau}}-T_{1}\right)\right)
    1exp(16+32R(1γ)3)\displaystyle\geq 1-\exp\left(-\frac{1}{6}+\frac{32R(1-\gamma)}{3}\right)
    110.\displaystyle\geq\frac{1}{10}.

    On plugging the above relations into (56), we obtain

    xR\displaystyle x_{R} 181.6384000log(180BηT(1γ))(1γ)(5+cη)(νν+1ν18M)\displaystyle\geq\frac{18\sqrt{1.6}}{384000\log\left(\frac{180B}{\eta_{T}(1-\gamma)}\right)(1-\gamma)(5+c_{\eta})}\cdot\left(\frac{\nu}{\nu+1}-\frac{\nu}{18{\sqrt{M}}}\right)
    181.6384000log(180BηT(1γ))(1γ)(5+cη)720min{1,20ηT3B(1γ)}\displaystyle\geq\frac{18\sqrt{1.6}}{384000\log\left(\frac{180B}{\eta_{T}(1-\gamma)}\right)(1-\gamma)(5+c_{\eta})}\cdot\frac{7}{20}\cdot\min\left\{1,\sqrt{\frac{20\eta_{T}}{3B(1-\gamma)}}\right\}
    181.6384000log(180BηT(1γ))(5+cη)720min{11γ,20ηT3B(1γ)3}\displaystyle\geq\frac{18\sqrt{1.6}}{384000\log\left(\frac{180B}{\eta_{T}(1-\gamma)}\right)(5+c_{\eta})}\cdot\frac{7}{20}\cdot\min\left\{\frac{1}{1-\gamma},\sqrt{\frac{20\eta_{T}}{3B(1-\gamma)^{3}}}\right\}
    181.6384000log(180N(1+logN))(5+logN)720min{11γ,203B(1+logN)(1γ)4N},\displaystyle\geq\frac{18\sqrt{1.6}}{384000\log\left(180N(1+\log N)\right)(5+\log N)}\cdot\frac{7}{20}\cdot\min\left\{\frac{1}{1-\gamma},\sqrt{\frac{20}{3B(1+\log N)(1-\gamma)^{4}N}}\right\}, (63)

    where we again used the facts that M2M\geq 2, cηlogNc_{\eta}\leq\log N, xlog(1/x)\frac{\sqrt{x}}{\log(1/x)} is an increasing function and the relation νM=20ηT3BM(1γ)1\dfrac{\nu}{M}=\dfrac{20\eta_{T}}{3BM(1-\gamma)}\leq 1.

  • Last but not least, let us consider the rescaled linear step size schedule case when tRτ>max{3T4,T16ητ(1γp)}t_{R_{\tau}}>\max\{\frac{3T}{4},T-\frac{1}{6\eta_{\tau}(1-\gamma p)}\}. The condition implies that the time between the communication rounds Rτ1R_{\tau}-1 and RτR_{\tau} is at least T0:=max{5T12,2T316ητ(1γp)}T_{0}:=\max\{\frac{5T}{12},\frac{2T}{3}-\frac{1}{6\eta_{\tau}(1-\gamma p)}\}. Thus, (49) yields that

    𝔼[Δ¯tRτ]\displaystyle\mathbb{E}[\overline{\Delta}_{t_{R_{\tau}}}] (1(1ητ(1γp))T05760log(180BηT(1γ))(1γp))νν+12(1ηT)T08ηT3BM(1γ).\displaystyle\geq\left(\frac{1-(1-\eta_{\tau}(1-\gamma p))^{T_{0}}}{5760\log\left(\frac{180}{B\eta_{T}(1-\gamma)}\right)(1-\gamma p)}\right)\cdot\frac{\nu}{\nu+1}-2(1-\eta_{T})^{T_{0}}\sqrt{\frac{8\eta_{T}}{3BM(1-\gamma)}}. (64)

    Using the above relation along with (55), we can conclude that

    xR\displaystyle x_{R} (1ητ(1γp))TtRτ(1(1ητ(1γp))T05760log(180BηT(1γ))(1γp))νν+1\displaystyle\geq(1-\eta_{\tau}(1-\gamma p))^{T-t_{R_{\tau}}}\left(\frac{1-(1-\eta_{\tau}(1-\gamma p))^{T_{0}}}{5760\log\left(\frac{180}{B\eta_{T}(1-\gamma)}\right)(1-\gamma p)}\right)\cdot\frac{\nu}{\nu+1}
    2(1ηT)T0(1ητ(1γp))TtRτ8ηT3BM(1γ)RC2.\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-2(1-\eta_{T})^{T_{0}}\cdot(1-\eta_{\tau}(1-\gamma p))^{T-t_{R_{\tau}}}\sqrt{\frac{8\eta_{T}}{3BM(1-\gamma)}}-RC_{2}. (65)

    In the above relation, we used the trivial bounds C1,C30C_{1},C_{3}\geq 0 and a crude bound on the term corresponding to C2C_{2}, similar to (56). Let us first consider the case of cη3c_{\eta}\geq 3. We have,

    1(1ητ(1γp))T0\displaystyle 1-(1-\eta_{\tau}(1-\gamma p))^{T_{0}} 1exp(ητ(1γp)5T/12)1exp(5(1γ)T3(3+cη(1γ)T))13+cη,\displaystyle\geq 1-\exp\left(-\eta_{\tau}(1-\gamma p)5T/12\right)\geq 1-\exp\left(-\frac{5(1-\gamma)T}{3(3+c_{\eta}(1-\gamma)T)}\right)\geq\frac{1}{3+c_{\eta}},
    (1ητ(1γp))TtRτ\displaystyle(1-\eta_{\tau}(1-\gamma p))^{T-t_{R_{\tau}}} 1ητ(1γp)T41(1γ)T(3+cη(1γ)T)11cη23.\displaystyle\geq 1-\eta_{\tau}(1-\gamma p)\frac{T}{4}\geq 1-\frac{(1-\gamma)T}{(3+c_{\eta}(1-\gamma)T)}\geq 1-\frac{1}{c_{\eta}}\geq\frac{2}{3}.

    Similarly, for cη<3c_{\eta}<3, we have,

    1(1ητ(1γp))T0\displaystyle 1-(1-\eta_{\tau}(1-\gamma p))^{T_{0}} 1exp(ητ(1γp)2T3+16)\displaystyle\geq 1-\exp\left(-\eta_{\tau}(1-\gamma p)\frac{2T}{3}+\frac{1}{6}\right)
    1exp(8(1γ)T3(3+cη(1γ)T)+16)1e5/18,\displaystyle\geq 1-\exp\left(-\frac{8(1-\gamma)T}{3(3+c_{\eta}(1-\gamma)T)}+\frac{1}{6}\right)\geq 1-e^{-5/18},
    (1ητ(1γp))TtRτ\displaystyle(1-\eta_{\tau}(1-\gamma p))^{T-t_{R_{\tau}}} 1ητ(1γp)6ητ(1γp)56.\displaystyle\geq 1-\frac{\eta_{\tau}(1-\gamma p)}{6\eta_{\tau}(1-\gamma p)}\geq\frac{5}{6}.

    The above relations implies that (1ητ(1γp))TtRτ(1(1ητ(1γp))T0)c(1-\eta_{\tau}(1-\gamma p))^{T-t_{R_{\tau}}}(1-(1-\eta_{\tau}(1-\gamma p))^{T_{0}})\geq c for some constant cc, which only depends on cηc_{\eta}. On plugging this into (65), we obtain a relation that is identical to that in (56) up to leading constants. Thus, by using a similar sequence of argument as used to obtain (63), we arrive at the same conclusion as for the case of tRτmax{3T4,T16ητ(1γp)}t_{R_{\tau}}\leq\max\{\frac{3T}{4},T-\frac{1}{6\eta_{\tau}(1-\gamma p)}\}.

Step 4: finishing up the proof.

Thus, (62), (63) along with the above conclusion together imply that there exists a numerical constant c0>0c_{0}>0 such that

𝔼[|V^Tm(1)V(1)|]𝔼[Δ¯T,max]c0log3Nmin{11γ,1(1γ)4N}.\displaystyle\mathbb{E}[|\widehat{V}_{T}^{m}(1)-V^{\star}(1)|]\geq\mathbb{E}[\overline{\Delta}_{T,\max}]\geq\frac{c_{0}}{\log^{3}N}\cdot\min\left\{\frac{1}{1-\gamma},\sqrt{\frac{1}{(1-\gamma)^{4}N}}\right\}. (66)

The above equation along with Lemma 2 implies

𝔼[|VTmV(1)|]c0log3Nmin{11γ,1(1γ)4N}11γi=1T(1ηi(1γ)).\displaystyle\mathbb{E}[|V_{T}^{m}-V^{\star}(1)|]\geq\frac{c_{0}}{\log^{3}N}\cdot\min\left\{\frac{1}{1-\gamma},\sqrt{\frac{1}{(1-\gamma)^{4}N}}\right\}-\frac{1}{1-\gamma}\prod_{i=1}^{T}(1-\eta_{i}(1-\gamma)). (67)

On the other hand, from (23) we know that

𝔼[|VTm(3)V(3)|]11γi=1T(1ηi(1γ)).\displaystyle\mathbb{E}[|V_{T}^{m}(3)-V^{\star}(3)|]\geq\frac{1}{1-\gamma}\prod_{i=1}^{T}(1-\eta_{i}(1-\gamma)). (68)

Hence,

𝔼[QTmQ]\displaystyle\mathbb{E}[\|Q_{T}^{m}-Q^{\star}\|_{\infty}] 𝔼[max{|VTm(3)V(3)|,|VTm(1)V(1)|}]\displaystyle\geq\mathbb{E}\left[\max\left\{|V_{T}^{m}(3)-V^{\star}(3)|,|V_{T}^{m}(1)-V^{\star}(1)|\right\}\right]
max{𝔼[|VTm(3)V(3)|],𝔼[|VTm(1)V(1)|]}\displaystyle\geq\max\left\{\mathbb{E}\left[|V_{T}^{m}(3)-V^{\star}(3)|\right],\mathbb{E}\left[|V_{T}^{m}(1)-V^{\star}(1)|\right]\right\}
max{11γi=1T(1ηi(1γ)),min{11γ,1(1γ)4N}11γi=1T(1ηi(1γ))}\displaystyle\geq\max\left\{\frac{1}{1-\gamma}\prod_{i=1}^{T}(1-\eta_{i}(1-\gamma)),\min\left\{\frac{1}{1-\gamma},\sqrt{\frac{1}{(1-\gamma)^{4}N}}\right\}-\frac{1}{1-\gamma}\prod_{i=1}^{T}(1-\eta_{i}(1-\gamma))\right\}
12min{11γ,1(1γ)4N},\displaystyle\geq\frac{1}{2}\min\left\{\frac{1}{1-\gamma},\sqrt{\frac{1}{(1-\gamma)^{4}N}}\right\}, (69)

where the third step follows from (67) and (68) and the fourth step uses max{a,b}(a+b)/2\max\{a,b\}\geq(a+b)/2.

Thus, from (30) and (69) we can conclude that whenever CCround=𝒪(1(1γ)log2N)\textsf{CC}_{\textsf{round}}=\mathcal{O}\left(\frac{1}{(1-\gamma)\log^{2}N}\right), ER(𝒜;N,M)=Ω(1log3NN)\textsf{ER}(\mathscr{A};N,M)=\Omega\left(\frac{1}{\log^{3}N\sqrt{N}}\right) for all values of M2M\geq 2. In other words, for any algorithm to achieve any collaborative gain, its communication complexity should satisfy CCround=Ω(1(1γ)log2N)\textsf{CC}_{\textsf{round}}=\Omega\left(\frac{1}{(1-\gamma)\log^{2}N}\right), as required.

Proof of (55).

We now return to establish (55) using induction. For the base case, (54) yields

xRτ+1\displaystyle x_{R_{\tau}+1} αRτ+1xRτβRτ+1C2+C3𝟙{Rτ+1Rτ+1}+(1αRτ+1)C1𝟙{Rτ+1Rτ+1}.\displaystyle\geq\alpha_{R_{\tau}+1}x_{R_{\tau}}-\beta_{R_{\tau}+1}C_{2}+C_{3}\mathbbm{1}\{R_{\tau}+1\in\mathcal{I}_{R_{\tau}+1}\}+(1-\alpha_{R_{\tau}+1})C_{1}\mathbbm{1}\{{R_{\tau}+1}\in\mathcal{I}_{R_{\tau}+1}\}. (70)

Note that this is identical to the expression in (55) for r=Rτ+1r=R_{\tau}+1 as

(iRτ+1αi)(1iRτ+1αi)=(1αRτ+1)𝟙{Rτ+1Rτ+1}\displaystyle\left(\prod_{i\notin\mathcal{I}_{R_{\tau}+1}}\alpha_{i}\right)\left(1-\prod_{i\in\mathcal{I}_{R_{\tau}+1}}\alpha_{i}\right)=(1-\alpha_{R_{\tau}+1})\mathbbm{1}\{{R_{\tau}+1}\in\mathcal{I}_{R_{\tau}+1}\}

based on the adopted convention for products with no valid indices. For the induction step, assume (55) holds for some rRτ+1r\geq R_{\tau}+1. On combining (54) and (55), we obtain,

xr+1\displaystyle x_{r+1} αr+1xrβr+1C2+C3𝟙{(r+1)r+1}+(1αr+1)C1𝟙{r+1r+1}\displaystyle\geq\alpha_{r+1}x_{r}-\beta_{r+1}C_{2}+C_{3}\mathbbm{1}\{(r+1)\in\mathcal{I}_{r+1}\}+(1-\alpha_{r+1})C_{1}\mathbbm{1}\{r+1\in\mathcal{I}_{r+1}\}
αr+1(i=Rτ+1rαi)xRταr+1k=Rτ+1rβk(i=k+1rαi)C2+αr+1k=Rτ+1r(i=k+1rαi)C3𝟙{kk}\displaystyle\geq\alpha_{r+1}\left(\prod_{i=R_{\tau}+1}^{r}\alpha_{i}\right)x_{R_{\tau}}-\alpha_{r+1}\sum_{k=R_{\tau}+1}^{r}\beta_{k}\left(\prod_{i=k+1}^{r}\alpha_{i}\right)C_{2}+\alpha_{r+1}\sum_{k=R_{\tau}+1}^{r}\left(\prod_{i=k+1}^{r}\alpha_{i}\right)C_{3}\mathbbm{1}\{k\in\mathcal{I}_{k}\}
+αr+1C1(irαi)(1irαi)βr+1C2+C3𝟙{(r+1)r+1}+(1αr+1)C1𝟙{(r+1)r+1}\displaystyle+\alpha_{r+1}C_{1}\left(\prod_{i\notin\mathcal{I}_{r}}\alpha_{i}\right)\left(1-\prod_{i\in\mathcal{I}_{r}}\alpha_{i}\right)-\beta_{r+1}C_{2}+C_{3}\mathbbm{1}\{(r+1)\in\mathcal{I}_{r+1}\}+(1-\alpha_{r+1})C_{1}\mathbbm{1}\{(r+1)\in\mathcal{I}_{r+1}\}
(i=Rτ+1r+1αi)xRτk=Rτ+1r+1βk(i=k+1r+1αi)C2+k=Rτ+1r+1(i=k+1r+1αi)C3𝟙{kk}\displaystyle\geq\left(\prod_{i=R_{\tau}+1}^{r+1}\alpha_{i}\right)x_{R_{\tau}}-\sum_{k=R_{\tau}+1}^{r+1}\beta_{k}\left(\prod_{i=k+1}^{r+1}\alpha_{i}\right)C_{2}+\sum_{k=R_{\tau}+1}^{r+1}\left(\prod_{i=k+1}^{r+1}\alpha_{i}\right)C_{3}\mathbbm{1}\{k\in\mathcal{I}_{k}\}
+αr+1C1(irαi)(1irαi)+(1αr+1)C1𝟙{(r+1)r+1}.\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}+\alpha_{r+1}C_{1}\left(\prod_{i\notin\mathcal{I}_{r}}\alpha_{i}\right)\left(1-\prod_{i\in\mathcal{I}_{r}}\alpha_{i}\right)+(1-\alpha_{r+1})C_{1}\mathbbm{1}\{(r+1)\in\mathcal{I}_{r+1}\}. (71)

If (r+1)r+1(r+1)\notin\mathcal{I}_{r+1}, then (1irαi)=(1ir+1αi)\left(1-\prod_{i\in\mathcal{I}_{r}}\alpha_{i}\right)=\left(1-\prod_{i\in\mathcal{I}_{r+1}}\alpha_{i}\right) and αr+1(irαi)=(ir+1αi)\alpha_{r+1}\left(\prod_{i\notin\mathcal{I}_{r}}\alpha_{i}\right)=\left(\prod_{i\notin\mathcal{I}_{r+1}}\alpha_{i}\right). Consequently,

αr+1C1(irαi)(1irαi)+(1αr+1)C1𝟙{(r+1)r+1}=C1(ir+1αi)(1ir+1αi).\displaystyle\alpha_{r+1}C_{1}\left(\prod_{i\notin\mathcal{I}_{r}}\alpha_{i}\right)\left(1-\prod_{i\in\mathcal{I}_{r}}\alpha_{i}\right)+(1-\alpha_{r+1})C_{1}\mathbbm{1}\{(r+1)\in\mathcal{I}_{r+1}\}=C_{1}\left(\prod_{i\notin\mathcal{I}_{r+1}}\alpha_{i}\right)\left(1-\prod_{i\in\mathcal{I}_{r+1}}\alpha_{i}\right). (72)

On the other hand, if (r+1)r+1(r+1)\in\mathcal{I}_{r+1}, then (irαi)=(ir+1αi)\left(\prod_{i\notin\mathcal{I}_{r}}\alpha_{i}\right)=\left(\prod_{i\notin\mathcal{I}_{r+1}}\alpha_{i}\right). Consequently, we have,

αr+1C1(irαi)(1irαi)+(1αr+1)C1𝟙{(r+1)r+1}\displaystyle\alpha_{r+1}C_{1}\left(\prod_{i\notin\mathcal{I}_{r}}\alpha_{i}\right)\left(1-\prod_{i\in\mathcal{I}_{r}}\alpha_{i}\right)+(1-\alpha_{r+1})C_{1}\mathbbm{1}\{(r+1)\in\mathcal{I}_{r+1}\}
=αr+1C1(ir+1αi)(1irαi)+(1αr+1)C1\displaystyle=\alpha_{r+1}C_{1}\left(\prod_{i\notin\mathcal{I}_{r+1}}\alpha_{i}\right)\left(1-\prod_{i\in\mathcal{I}_{r}}\alpha_{i}\right)+(1-\alpha_{r+1})C_{1}
C1(ir+1αi)[αr+1(1irαi)+(1αr+1)]\displaystyle\geq C_{1}\left(\prod_{i\notin\mathcal{I}_{r+1}}\alpha_{i}\right)\left[\alpha_{r+1}\left(1-\prod_{i\in\mathcal{I}_{r}}\alpha_{i}\right)+(1-\alpha_{r+1})\right]
C1(ir+1αi)(1ir+1αi).\displaystyle\geq C_{1}\left(\prod_{i\notin\mathcal{I}_{r+1}}\alpha_{i}\right)\left(1-\prod_{i\in\mathcal{I}_{r+1}}\alpha_{i}\right). (73)

Combining (71), (72) and (73) proves the claim.

Proof of (61).

To establish this result, we separately consider the cases x1x\leq 1 and x1x\geq 1.

  • When x1x\leq 1, we have

    f(x)=xx+115Mx(12x5M)7x20,\displaystyle f(x)=\frac{x}{x+1}-\frac{1}{5\sqrt{M}}\geq x\cdot\left(\frac{1}{2}-\frac{x}{5\sqrt{M}}\right)\geq\frac{7x}{20}, (74)

    where in the last step, we used the relation M2M\geq 2.

  • Let us now consider the case x1x\geq 1. The second derivative of ff is given by f′′(x)=12(x+1)3f^{\prime\prime}(x)=-\frac{1}{2(x+1)^{3}}. Clearly, for all x1x\geq 1, f′′<0f^{\prime\prime}<0 implying that ff is a concave function. It is well-known that a continuous, bounded, concave function achieves its minimum values over a compact interval at the end points of the interval (Bauer’s minimum principle). For all M2M\geq 2, we have,

    f(1)=1215M720;f(M)=MM+115720.\displaystyle f(1)=\frac{1}{2}-\frac{1}{5\sqrt{M}}\geq\frac{7}{20};\quad f(\sqrt{M})=\frac{\sqrt{M}}{\sqrt{M}+1}-\frac{1}{5}\geq\frac{7}{20}.

    Consequently, we can conclude that for all x[1,M]x\in[1,\sqrt{M}],

    f(x)720.\displaystyle f(x)\geq\frac{7}{20}. (75)

    Combining (74) and (75) proves the claim.

A.3.3 Large learning rates with large ηTBM\frac{\eta_{T}}{BM}

In order to bound the error in this scenario, note that ηTBM\frac{\eta_{T}}{BM} controls the variance of the stochastic updates in the fixed point iteration. Thus, when ηTBM\frac{\eta_{T}}{BM} is large, the variance of the iterates is large, resulting in a large error. To demonstrate this effect, we focus on the dynamics of state 22. This part of the proof is similar to the large learning rate case of Li2023QLMinimax. For all t[T]t\in[T], define:

V¯t(2):=1Mm=1MVtm(2).\displaystyle\overline{V}_{t}(2):=\frac{1}{M}\sum_{m=1}^{M}V_{t}^{m}(2). (76)

Thus, from (35), we know that 𝔼[V¯t(2)]\mathbb{E}[\overline{V}_{t}(2)] obeys the following recursion:

𝔼[V¯t(2)]=(1ηt(1γp))𝔼[V¯t1(2)]+ηt.\displaystyle\mathbb{E}[\overline{V}_{t}(2)]=(1-\eta_{t}(1-\gamma p))\mathbb{E}[\overline{V}_{t-1}(2)]+\eta_{t}.

Upon unrolling the recursion, we obtain,

𝔼[V¯T(2)]=(k=t+1T(1ηk(1γp)))𝔼[V¯t(2)]+k=t+1Tηk(T).\displaystyle\mathbb{E}[\overline{V}_{T}(2)]=\left(\prod_{k=t+1}^{T}(1-\eta_{k}(1-\gamma p))\right)\mathbb{E}[\overline{V}_{t}(2)]+\sum_{k=t+1}^{T}\eta_{k}^{(T)}.

Thus, the above relation along with (18) and the value of V(2)V^{\star}(2) yields us,

V(2)𝔼[V¯T(2)]=k=t+1T(1ηk(1γp))(11γp𝔼[V¯t(2)]).\displaystyle V^{\star}(2)-\mathbb{E}[\overline{V}_{T}(2)]=\prod_{k=t+1}^{T}(1-\eta_{k}(1-\gamma p))\left(\frac{1}{1-\gamma p}-\mathbb{E}[\overline{V}_{t}(2)]\right). (77)

Similar to Li2023QLMinimax, we define

τ:=min{0tT2|𝔼[(V¯t)2]14(1γ)2 for all t+1tT}.\displaystyle\tau^{\prime}:=\min\left\{0\leq t^{\prime}\leq T-2\ \bigg{|}\ \mathbb{E}[(\overline{V}_{t})^{2}]\geq\frac{1}{4(1-\gamma)^{2}}\text{ for all }t^{\prime}+1\leq t\leq T\right\}.

If such a τ\tau^{\prime} does not exist, it implies that either 𝔼[(V¯T)2]<14(1γ)2\mathbb{E}[(\overline{V}_{T})^{2}]<\frac{1}{4(1-\gamma)^{2}} or 𝔼[(V¯T1)2]<14(1γ)2\mathbb{E}[(\overline{V}_{T-1})^{2}]<\frac{1}{4(1-\gamma)^{2}}. If the former is true, then,

V(2)𝔼[V¯T(2)]=34(1γ)𝔼[(V¯T)2]>14(1γ).\displaystyle V^{\star}(2)-\mathbb{E}[\overline{V}_{T}(2)]=\frac{3}{4(1-\gamma)}-\sqrt{\mathbb{E}[(\overline{V}_{T})^{2}]}>\frac{1}{4(1-\gamma)}. (78)

Similarly, if 𝔼[(V¯T1)2]<14(1γ)2\mathbb{E}[(\overline{V}_{T-1})^{2}]<\frac{1}{4(1-\gamma)^{2}}, it implies 𝔼[V¯T1]<12(1γ)\mathbb{E}[\overline{V}_{T-1}]<\frac{1}{2(1-\gamma)}. Using (35), we have,

𝔼[V¯T(2)]=(1ηT(1γp))𝔼[V¯T1(2)]+ηT𝔼[V¯T1(2)]+1<12(1γ)+16(1γ)=23(1γ).\displaystyle\mathbb{E}[\overline{V}_{T}(2)]=(1-\eta_{T}(1-\gamma p))\mathbb{E}[\overline{V}_{T-1}(2)]+\eta_{T}\leq\mathbb{E}[\overline{V}_{T-1}(2)]+1<\frac{1}{2(1-\gamma)}+\frac{1}{6(1-\gamma)}=\frac{2}{3(1-\gamma)}.

Consequently,

V(2)𝔼[V¯T(2)]>34(1γ)23(1γ)>112(1γ).\displaystyle V^{\star}(2)-\mathbb{E}[\overline{V}_{T}(2)]>\frac{3}{4(1-\gamma)}-\frac{2}{3(1-\gamma)}>\frac{1}{12(1-\gamma)}. (79)

For the case when τ\tau^{\prime} exists, we divide the proof into two cases.

  • We first consider the case when the learning rates satisfy:

    k=τ+1T(1ηk(1γp))12.\displaystyle\prod_{k=\tau^{\prime}+1}^{T}(1-\eta_{k}(1-\gamma p))\geq\frac{1}{2}. (80)

    The analysis for this case is identical to that considered in Li2023QLMinimax. We explicitly write the steps for completeness. Specifically,

    V(2)𝔼[V¯T(2)]\displaystyle V^{\star}(2)-\mathbb{E}[\overline{V}_{T}(2)] =(k=τ+1T(1ηk(1γp)))(11γp𝔼[V¯τ(2)])\displaystyle=\left(\prod_{k=\tau^{\prime}+1}^{T}(1-\eta_{k}(1-\gamma p))\right)\left(\frac{1}{1-\gamma p}-\mathbb{E}[\overline{V}_{\tau^{\prime}}(2)]\right)
    12(34(1γ)𝔼[(V¯τ(2))2])\displaystyle\geq\frac{1}{2}\cdot\left(\frac{3}{4(1-\gamma)}-\sqrt{\mathbb{E}[(\overline{V}_{\tau^{\prime}}(2))^{2}]}\right)
    12(34(1γ)12(1γ))18(1γ),\displaystyle\geq\frac{1}{2}\cdot\left(\frac{3}{4(1-\gamma)}-\frac{1}{2(1-\gamma)}\right)\geq\frac{1}{8(1-\gamma)}, (81)

    where the first line follows from (77), the second line from the condition on step sizes and the third line from the definition of τ\tau^{\prime}.

  • We now consider the other case where,

    0k=τ+1T(1ηk(1γp))<12.\displaystyle 0\leq\prod_{k=\tau^{\prime}+1}^{T}(1-\eta_{k}(1-\gamma p))<\frac{1}{2}. (82)

    Using (Li2023QLMinimax, Eqn.(134)), for any t<tt^{\prime}<t and all agents mm, we have the relation

    Vtm(2)=11γpk=t+1t(1ηk(1γp))(11γpVtm(2))+k=t+1ηk(t)γ(P^km(2|2)p)Vk1m(2).\displaystyle V_{t}^{m}(2)=\frac{1}{1-\gamma p}-\prod_{k=t^{\prime}+1}^{t}(1-\eta_{k}(1-\gamma p))\left(\frac{1}{1-\gamma p}-V_{t^{\prime}}^{m}(2)\right)+\sum_{k=t^{\prime}+1}\eta_{k}^{(t)}\gamma(\hat{P}_{k}^{m}(2|2)-p)V_{k-1}^{m}(2).

    The above equation is directly obtained by unrolling the recursion in (26) along with noting that Qt(2,1)=Vt(2)Q_{t}(2,1)=V_{t}(2) for all tt. Consequently, we have,

    V¯T(2)=11γpk=t+1T(1ηk(1γp))(11γpV¯t(2))+1Mm=1Mk=t+1Tηk(T)γ(P^km(2|2)p)Vk1m(2).\displaystyle\overline{V}_{T}(2)=\frac{1}{1-\gamma p}-\prod_{k=t^{\prime}+1}^{T}(1-\eta_{k}(1-\gamma p))\left(\frac{1}{1-\gamma p}-\overline{V}_{t^{\prime}}(2)\right)+\frac{1}{M}\sum_{m=1}^{M}\sum_{k=t^{\prime}+1}^{T}\eta_{k}^{(T)}\gamma(\hat{P}_{k}^{m}(2|2)-p)V_{k-1}^{m}(2). (83)

    Let {t}t=0T\{\mathscr{F}_{t}\}_{t=0}^{T} be a filtration such that t\mathscr{F}_{t} is the σ\sigma-algebra corresponding to {{P^sm(2|2)}m=1M}s=1t\{\{\hat{P}^{m}_{s}(2|2)\}_{m=1}^{M}\}_{s=1}^{t}. It is straightforward to note that {1Mm=1Mηk(T)γ(P^km(2|2)p)Vk1m(2)}k\left\{\frac{1}{M}\sum_{m=1}^{M}\eta_{k}^{(T)}\gamma(\hat{P}_{k}^{m}(2|2)-p)V_{k-1}^{m}(2)\right\}_{k} is a martingale sequence adapted to the filtration k\mathscr{F}_{k}. Thus, using the result from (Li2023QLMinimax, Eqn.(139)), we can conclude that

    Var(V¯T(2))𝔼[k=τ+2TVar(1Mm=1Mηk(T)γ(P^km(2|2)p)Vk1m(2)|k1)].\displaystyle\text{Var}(\overline{V}_{T}(2))\geq\mathbb{E}\left[\sum_{k=\tau^{\prime}+2}^{T}\text{Var}\left(\frac{1}{M}\sum_{m=1}^{M}\eta_{k}^{(T)}\gamma(\hat{P}_{k}^{m}(2|2)-p)V_{k-1}^{m}(2)\ \bigg{|}\mathscr{F}_{k-1}\right)\right]. (84)

    We have,

    Var(1Mm=1Mηk(T)γ(P^km(2|2)p)Vk1m(2)|k1)\displaystyle\text{Var}\left(\frac{1}{M}\sum_{m=1}^{M}\eta_{k}^{(T)}\gamma(\hat{P}_{k}^{m}(2|2)-p)V_{k-1}^{m}(2)\ \bigg{|}\mathscr{F}_{k-1}\right) =1M2m=1MVar(ηk(T)γ(P^km(2|2)p)Vk1m(2)|k1)\displaystyle=\frac{1}{M^{2}}\sum_{m=1}^{M}\text{Var}\left(\eta_{k}^{(T)}\gamma(\hat{P}_{k}^{m}(2|2)-p)V_{k-1}^{m}(2)\ \bigg{|}\mathscr{F}_{k-1}\right)
    =(ηk(T))2BMγ2p(1p)(1Mm=1M(Vk1m(2))2)\displaystyle=\frac{(\eta_{k}^{(T)})^{2}}{BM}\gamma^{2}p(1-p)\left(\frac{1}{M}\sum_{m=1}^{M}(V_{k-1}^{m}(2))^{2}\right)
    (1γ)(4γ1)9BM(ηk(T))2(V¯k1(2))2,\displaystyle\geq\frac{(1-\gamma)(4\gamma-1)}{9BM}\cdot(\eta_{k}^{(T)})^{2}\cdot(\overline{V}_{k-1}(2))^{2}, (85)

    where the first line follows from that fact that variance of sum of i.i.d. random variables is the sum of their variances, the second line from variance of Binomial random variable and the third line from Jensen’s inequality. Thus, (84) and (85) together yield,

    Var(V¯T(2))(1γ)(4γ1)9BMk=τ+2T(ηk(T))2𝔼[(V¯k1(2))2]\displaystyle\text{Var}(\overline{V}_{T}(2))\geq\frac{(1-\gamma)(4\gamma-1)}{9BM}\cdot\sum_{k=\tau^{\prime}+2}^{T}(\eta_{k}^{(T)})^{2}\cdot\mathbb{E}[(\overline{V}_{k-1}(2))^{2}]
    (1γ)(4γ1)9BM14(1γ)2k=max{τ,τ}+2T(ηk(T))2,\displaystyle\geq\frac{(1-\gamma)(4\gamma-1)}{9BM}\cdot\frac{1}{4(1-\gamma)^{2}}\cdot\sum_{k=\max\{\tau,\tau^{\prime}\}+2}^{T}(\eta_{k}^{(T)})^{2}, (86)

    where the second line follows from the definition of τ\tau^{\prime}. We focus on bounding the third term in the above relation. We have,

    k=max{τ,τ}+2T(ηk(T))2\displaystyle\sum_{k=\max\{\tau^{\prime},\tau\}+2}^{T}\left(\eta_{k}^{(T)}\right)^{2} k=max{τ,τ}+2T(ηki=k+1T(1ηi(1γp))2\displaystyle\geq\sum_{k=\max\{\tau^{\prime},\tau\}+2}^{T}\left(\eta_{k}\prod_{i=k+1}^{T}(1-\eta_{i}(1-\gamma p)\right)^{2}
    k=max{τ,τ}+2T(ηTi=k+1t(1ητ(1γp)))2\displaystyle\geq\sum_{k=\max\{\tau^{\prime},\tau\}+2}^{T}\left(\eta_{T}\prod_{i=k+1}^{t}(1-\eta_{\tau}(1-\gamma p))\right)^{2}
    =ηT2k=max{τ,τ}+2T(1ητ(1γp))2(tk)\displaystyle=\eta_{T}^{2}\sum_{k=\max\{\tau^{\prime},\tau\}+2}^{T}(1-\eta_{\tau}(1-\gamma p))^{2(t-k)}
    ηT21(1ητ(1γp))2(Tmax{τ,τ}1)ητ(1γp)(2ητ(1γp))\displaystyle\geq\eta_{T}^{2}\cdot\frac{1-(1-\eta_{\tau}(1-\gamma p))^{2(T-\max\{\tau^{\prime},\tau\}-1)}}{\eta_{\tau}(1-\gamma p)(2-\eta_{\tau}(1-\gamma p))}
    ηT14(1γ)c,\displaystyle\geq\eta_{T}\cdot\frac{1}{4(1-\gamma)}\cdot c^{\prime}, (87)

    where the second line follows from monotonicity of ηt\eta_{t} and the numerical constant cc^{\prime} in the fifth step is given by the following claim whose proof is deferred to the end of the section:

    1(1ητ(1γp))2(Tmax{τ,τ}1){1e8/9 for constant step sizes,1exp(83max{1,cη}) for linearly rescaled step sizes.\displaystyle 1-(1-\eta_{\tau}(1-\gamma p))^{2(T-\max\{\tau^{\prime},\tau\}-1)}\geq\begin{cases}1-e^{-8/9}&\text{ for constant step sizes},\\ 1-\exp\left(-\frac{8}{3\max\{1,c_{\eta}\}}\right)&\text{ for linearly rescaled step sizes}\end{cases}. (88)

    Thus, (86) and (87) together imply

    Var(V¯T(2))\displaystyle\text{Var}(\overline{V}_{T}(2)) (4γ1)36BM(1γ)k=τ+2T(ηk(T))2\displaystyle\geq\frac{(4\gamma-1)}{36BM(1-\gamma)}\cdot\sum_{k=\tau^{\prime}+2}^{T}(\eta_{k}^{(T)})^{2}
    c(4γ1)144(1γ)ηTBM(1γ)c(4γ1)144(1γ)1100,\displaystyle\geq\frac{c^{\prime}(4\gamma-1)}{144(1-\gamma)}\cdot\frac{\eta_{T}}{BM(1-\gamma)}\geq\frac{c^{\prime}(4\gamma-1)}{144(1-\gamma)}\cdot\frac{1}{100}, (89)

    where the last inequality follows from the bound on ηTBM\frac{\eta_{T}}{BM}.

Thus, for all N1N\geq 1, we have,

𝔼[(V(2)V¯T(2))2]=𝔼[(V(2)𝔼[V¯T(2)])2]+Var(V¯T(2))c′′(1γ)N,\displaystyle\mathbb{E}[(V^{\star}(2)-\overline{V}_{T}(2))^{2}]=\mathbb{E}[(V^{\star}(2)-\mathbb{E}[\overline{V}_{T}(2)])^{2}]+\text{Var}(\overline{V}_{T}(2))\geq\frac{c^{\prime\prime}}{(1-\gamma)N},

for some numerical constant c′′c^{\prime\prime}. Similar to the small learning rate case, the error rate is bounded away from a constant value irrespective of the number of agents and the number of communication rounds. Thus, even with CCround=Ω(T)\textsf{CC}_{\textsf{round}}=\Omega(T), we will not observe any collaborative gain in this scenario.

Proof of (88).

To establish the claim, we consider two cases:

  • ττ\tau^{\prime}\geq\tau: Under this case, we have,

    (1ητ(1γp))2(Tmax{τ,τ}1)\displaystyle(1-\eta_{\tau}(1-\gamma p))^{2(T-\max\{\tau^{\prime},\tau\}-1)} =(1ητ(1γp))2(Tτ1)\displaystyle=(1-\eta_{\tau}(1-\gamma p))^{2(T-\tau^{\prime}-1)}
    (1ητ(1γp))Tτk=τ+1T(1ηk(1γp))12,\displaystyle\leq(1-\eta_{\tau}(1-\gamma p))^{T-\tau^{\prime}}\leq\prod_{k=\tau^{\prime}+1}^{T}(1-\eta_{k}(1-\gamma p))\leq\frac{1}{2}, (90)

    where the last inequality follows from (82).

  • ττ\tau\geq\tau^{\prime}: For this case, we have

    (1ητ(1γp))2(Tmax{τ,τ}1)\displaystyle(1-\eta_{\tau}(1-\gamma p))^{2(T-\max\{\tau^{\prime},\tau\}-1)} =(1ητ(1γp))2(Tτ1)\displaystyle=(1-\eta_{\tau}(1-\gamma p))^{2(T-\tau-1)}
    (1ητ(1γp))Tτexp(2Tητ(1γp)3).\displaystyle\leq(1-\eta_{\tau}(1-\gamma p))^{T-\tau}\leq\exp\left(-\frac{2T\eta_{\tau}(1-\gamma p)}{3}\right). (91)

    For the constant stepsize schedule, we have,

    exp(2Tητ(1γp)3)exp(2T31(1γ)T4(1γ)3)=exp(89)\displaystyle\exp\left(-\frac{2T\eta_{\tau}(1-\gamma p)}{3}\right)\leq\exp\left(-\frac{2T}{3}\cdot\frac{1}{(1-\gamma)T}\cdot\frac{4(1-\gamma)}{3}\right)=\exp\left(-\frac{8}{9}\right) (92)

    For linearly rescaled stepsize schedule, we have,

    exp(2Tητ(1γp)3)exp(2T311+cη(1γ)T/34(1γ)3)=exp(83max{1,cη})\displaystyle\exp\left(-\frac{2T\eta_{\tau}(1-\gamma p)}{3}\right)\leq\exp\left(-\frac{2T}{3}\cdot\frac{1}{1+c_{\eta}(1-\gamma)T/3}\cdot\frac{4(1-\gamma)}{3}\right)=\exp\left(-\frac{8}{3\max\{1,c_{\eta}\}}\right) (93)

    On combining (90), (91), (92) and (93), we arrive at the claim.

A.4 Generalizing to larger state action spaces

We now elaborate on how we can extend the result to general state-action spaces along with the obtaining the lower bound on the bit level communication complexity. For the general case, we instead consider the following MDP. For the first four states {0,1,2,3}\{0,1,2,3\}, the probability transition kernel and reward function are given as follows.

𝒜0={1}\displaystyle\mathcal{A}_{0}=\{1\} P(0|0,1)=1\displaystyle\quad\quad P(0|0,1)=1 r(0,1)=0,\displaystyle\quad r(0,1)=0, (94a)
𝒜1={1,2,,|𝒜|}\displaystyle\mathcal{A}_{1}=\{1,2,\dots,|\mathcal{A}|\} P(1|1,a)=p\displaystyle\quad\quad P(1|1,a)=p P(0|1,a)=1p\displaystyle\quad P(0|1,a)=1-p r(1,a)=1,a𝒜\displaystyle\quad r(1,a)=1,\forall\ a\in\mathcal{A} (94b)
𝒜2={1}\displaystyle\mathcal{A}_{2}=\{1\} P(2|2,1)=p\displaystyle\quad\quad P(2|2,1)=p P(0|2,1)=1p\displaystyle\quad P(0|2,1)=1-p r(2,1)=1,\displaystyle\quad r(2,1)=1, (94c)
𝒜3={1}\displaystyle\mathcal{A}_{3}=\{1\} P(3|3,1)=1\displaystyle\quad\quad P(3|3,1)=1 r(3,1)=1,\displaystyle\quad r(3,1)=1, (94d)

where the parameter p=4γ13γp=\dfrac{4\gamma-1}{3\gamma}. The overall MDP is obtained by creating |𝒮|/4|\mathcal{S}|/4 copies of the above MDP for all sets of the form {4r,4r+1,4r+2,4r+3}\{4r,4r+1,4r+2,4r+3\} for r|𝒮|/41r\leq|\mathcal{S}|/4-1. It is straightforward to note that the lower bound on the number of communication rounds immediately transfers to the general case as well. Moreover, note that the bound on CCround\textsf{CC}_{\textsf{round}} implies the bound CCbit=Ω(1(1γ)log2N)\textsf{CC}_{\textsf{bit}}=\Omega\left(\frac{1}{(1-\gamma)\log^{2}N}\right) as every communication entails sending Ω(1)\Omega(1) bits.

To obtain the general lower bound on bit level communication complexity, note that we can carry out the analysis in the previous section for all |𝒜|/2|\mathcal{A}|/2 pairs of actions in state 11 corresponding to the set of states {0,1,2,3}\{0,1,2,3\}. Moreover, the algorithm 𝒜\mathscr{A}, needs to ensure that the error is low across all the |𝒜|/2|\mathcal{A}|/2 pairs. Since the errors are independent across all these pairs, each of them require Ω(1(1γ)log2N)\Omega\left(\frac{1}{(1-\gamma)\log^{2}N}\right) bits of information to be transmitted during the learning horizon leading to a lower bound of Ω(|𝒜|(1γ)log2N)\Omega\left(\frac{|\mathcal{A}|}{(1-\gamma)\log^{2}N}\right). Note that since we require a low \ell_{\infty} error, 𝒜\mathscr{A} needs to ensure that the error is low across all the pairs, resulting in a communication cost linearly growing with |𝒜||\mathcal{A}|. Upon repeating the argument across all |𝒮|/4|\mathcal{S}|/4 copies of the MDP, we arrive at the lower bound of CCbit=Ω(|𝒮||𝒜|(1γ)log2N)\textsf{CC}_{\textsf{bit}}=\Omega\left(\frac{|\mathcal{S}||\mathcal{A}|}{(1-\gamma)\log^{2}N}\right).

A.5 Proofs of auxiliary lemmas

A.5.1 Proof of Lemma 2

Note that a similar relationship is also derived in Li2023QLMinimax, but needing to take care of the averaging over multiple agents, we present the entire arguments for completeness. We prove the claim using an induction over tt. It is straightforward to note that the claim is true for t=0t=0 and all agents m{1,2,,M}m\in\{1,2,\dots,M\}. For the inductive step, we assume that the claim holds for t1t-1 for all clients. Using the induction hypothesis, we have the following relation between Vt1m(1)V_{t-1}^{m}(1) and V^t1m\widehat{V}_{t-1}^{m}:

Vt1m(1)=maxa{1,2}Qt1m(1,a)maxa{1,2}Q^t1m(a)11γi=1t1(1ηi(1γ))=V^t1m11γi=1t1(1ηi(1γ)).\displaystyle V_{t-1}^{m}(1)=\max_{a\in\{1,2\}}Q_{t-1}^{m}(1,a)\geq\max_{a\in\{1,2\}}\widehat{Q}_{t-1}^{m}(a)-\frac{1}{1-\gamma}\prod_{i=1}^{t-1}(1-\eta_{i}(1-\gamma))=\widehat{V}_{t-1}^{m}-\frac{1}{1-\gamma}\prod_{i=1}^{t-1}(1-\eta_{i}(1-\gamma)). (95)

For t{tr}r=1Rt\notin\{t_{r}\}_{r=1}^{R} and a{1,2}a\in\{1,2\}, we have,

Qtm(1,a)Q^tm(a)\displaystyle Q_{t}^{m}(1,a)-\widehat{Q}_{t}^{m}(a) =Qt1/2m(1,a)Q^t1/2m(a)\displaystyle=Q_{t-1/2}^{m}(1,a)-\widehat{Q}_{t-1/2}^{m}(a)
=(1ηt)Qt1m(1,a)+ηt(1+γP^tm(1|1,a)Vt1m(1))\displaystyle=(1-\eta_{t})Q_{t-1}^{m}(1,a)+\eta_{t}(1+\gamma\widehat{P}_{t}^{m}(1|1,a)V_{t-1}^{m}(1))
[(1ηt)Q^t1m(a)+ηt(1+γP^tm(1|1,a)V^t1m)]\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\left[(1-\eta_{t})\widehat{Q}_{t-1}^{m}(a)+\eta_{t}(1+\gamma\widehat{P}_{t}^{m}(1|1,a)\widehat{V}_{t-1}^{m})\right]
=(1ηt)(Qt1m(1|1,a)Q^t1m(a))+ηtγP^tm(1|1,a)(Vt1m(1)V^t1m)\displaystyle=(1-\eta_{t})(Q_{t-1}^{m}(1|1,a)-\widehat{Q}_{t-1}^{m}(a))+\eta_{t}\gamma\widehat{P}_{t}^{m}(1|1,a)(V_{t-1}^{m}(1)-\widehat{V}_{t-1}^{m})
(1ηt)1γi=1t1(1ηi(1γ))P^tm(1|1,a)ηtγ1γi=1t1(1ηi(1γ))\displaystyle\geq-\frac{(1-\eta_{t})}{1-\gamma}\prod_{i=1}^{t-1}(1-\eta_{i}(1-\gamma))-\widehat{P}_{t}^{m}(1|1,a)\cdot\frac{\eta_{t}\gamma}{1-\gamma}\prod_{i=1}^{t-1}(1-\eta_{i}(1-\gamma))
(1ηt)1γi=1t1(1ηi(1γ))ηtγ1γi=1t1(1ηi(1γ))\displaystyle\geq-\frac{(1-\eta_{t})}{1-\gamma}\prod_{i=1}^{t-1}(1-\eta_{i}(1-\gamma))-\frac{\eta_{t}\gamma}{1-\gamma}\prod_{i=1}^{t-1}(1-\eta_{i}(1-\gamma))
11γi=1t(1ηi(1γ)).\displaystyle\geq-\frac{1}{1-\gamma}\prod_{i=1}^{t}(1-\eta_{i}(1-\gamma)). (96)

For t{tr}r=1Rt\in\{t_{r}\}_{r=1}^{R} and a{1,2}a\in\{1,2\}, we have,

Qtm(1,a)Q^tm(a)\displaystyle Q_{t}^{m}(1,a)-\widehat{Q}_{t}^{m}(a) =1Mm=1MQt1/2m(1,a)1Mm=1MQ^t1/2m(a)\displaystyle=\frac{1}{M}\sum_{m=1}^{M}Q_{t-1/2}^{m}(1,a)-\frac{1}{M}\sum_{m=1}^{M}\widehat{Q}_{t-1/2}^{m}(a)
=1Mm=1M[(1ηt)Qt1m(1,a)+ηt(1+γP^tm(1|1,a)Vt1m(1))]\displaystyle=\frac{1}{M}\sum_{m=1}^{M}\left[(1-\eta_{t})Q_{t-1}^{m}(1,a)+\eta_{t}(1+\gamma\widehat{P}_{t}^{m}(1|1,a)V_{t-1}^{m}(1))\right]
1Mm=1M[(1ηt)Q^t1m(a)+ηt(1+γP^tm(1|1,a)V^t1m)]\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\frac{1}{M}\sum_{m=1}^{M}\left[(1-\eta_{t})\widehat{Q}_{t-1}^{m}(a)+\eta_{t}(1+\gamma\widehat{P}_{t}^{m}(1|1,a)\widehat{V}_{t-1}^{m})\right]
=1Mm=1M[(1ηt)(Qt1m(1,a)Q^t1m(a))+ηtγP^tm(1|1,a)(Vt1m(1)V^t1m)]\displaystyle=\frac{1}{M}\sum_{m=1}^{M}\left[(1-\eta_{t})(Q_{t-1}^{m}(1,a)-\widehat{Q}_{t-1}^{m}(a))+\eta_{t}\gamma\widehat{P}_{t}^{m}(1|1,a)(V_{t-1}^{m}(1)-\widehat{V}_{t-1}^{m})\right]
11γi=1t(1ηi(1γ)),\displaystyle\geq-\frac{1}{1-\gamma}\prod_{i=1}^{t}(1-\eta_{i}(1-\gamma)), (97)

where the last step follows using the same set of arguments as used in (96). The inductive step follows from (96) and (97).

A.5.2 Proof of Lemma 3

In order to bound the term 𝔼[Δt,maxm]𝔼[ξt,t,maxm]\mathbb{E}[\Delta_{t,\max}^{m}]-\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}], we make use of the relation in (46a), which we recall

𝔼[Δt,maxm]\displaystyle\mathbb{E}[\Delta_{t,\max}^{m}] φt,t𝔼[Δ¯t,max]+[k=t+1tη~k(t)γp𝔼[Δk1,maxm]]+𝔼[ξt,t,maxm]φt,t𝔼[|Δ¯t(1)Δ¯t(2)|].\displaystyle\geq\varphi_{t^{\prime},t}\mathbb{E}[\overline{\Delta}_{t^{\prime},\max}]+\left[\sum_{k=t^{\prime}+1}^{t}\widetilde{\eta}_{k}^{(t)}\gamma p\mathbb{E}[\Delta_{k-1,\max}^{m}]\right]+\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}]-\varphi_{t^{\prime},t}\mathbb{E}[|\overline{\Delta}_{t^{\prime}}(1)-\overline{\Delta}_{t^{\prime}}(2)|].
  • To aid the analysis, we consider the following recursive relation for any fixed agent mm:

    yt=(1ηt)yt1+ηt(γpyt1+𝔼[ξt,t,maxm]).\displaystyle y_{t}=(1-\eta_{t})y_{t-1}+\eta_{t}(\gamma py_{t-1}+\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}]). (98)

    Upon unrolling the recursion, we obtain,

    yt\displaystyle y_{t} =(k=t+1t(1ηk))yt+k=t+1t(ηki=k+1t(1ηi))γpyk1+k=t+1t(ηki=k+1t(1ηi))𝔼[ξt,t,maxm]\displaystyle=\left(\prod_{k=t^{\prime}+1}^{t}(1-\eta_{k})\right)y_{t^{\prime}}+\sum_{k=t^{\prime}+1}^{t}\left(\eta_{k}\prod_{i=k+1}^{t}(1-\eta_{i})\right)\gamma py_{k-1}+\sum_{k=t^{\prime}+1}^{t}\left(\eta_{k}\prod_{i=k+1}^{t}(1-\eta_{i})\right)\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}]
    =φt,tyt+k=t+1tη~k(t)γpyk1+k=t+1tη~k(t)𝔼[ξt,t,maxm].\displaystyle=\varphi_{t^{\prime},t}y_{t^{\prime}}+\sum_{k=t^{\prime}+1}^{t}\widetilde{\eta}_{k}^{(t)}\gamma py_{k-1}+\sum_{k=t^{\prime}+1}^{t}\widetilde{\eta}_{k}^{(t)}\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}]. (99)

    Initializing yt=𝔼[Δ¯t,max]y_{t^{\prime}}=\mathbb{E}[\overline{\Delta}_{t^{\prime},\max}] in (99) and plugging this into (46a), we have

    𝔼[Δt,maxm]ytφt,t𝔼[|Δ¯t(1)Δ¯t(2)|],\mathbb{E}[\Delta_{t,\max}^{m}]\geq y_{t}-\varphi_{t^{\prime},t}\mathbb{E}[|\overline{\Delta}_{t^{\prime}}(1)-\overline{\Delta}_{t^{\prime}}(2)|],

    where we used k=t+1tη~k(t)1\sum_{k=t^{\prime}+1}^{t}\widetilde{\eta}_{k}^{(t)}\leq 1 (cf. (20)). We now further simply the expression of yty_{t}. By rewriting (98) as

    yt=(1ηt(1γp))yt1+ηt𝔼[ξt,t,maxm],\displaystyle y_{t}=(1-\eta_{t}(1-\gamma p))y_{t-1}+\eta_{t}\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}],

    it is straight forward to note that yty_{t} is given as

    yt=(k=t+1t(1ηk(1γp)))yt+𝔼[ξt,t,maxm][k=t+1tηk(t)].\displaystyle y_{t}=\left(\prod_{k=t^{\prime}+1}^{t}(1-\eta_{k}(1-\gamma p))\right)y_{t^{\prime}}+\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}]\left[\sum_{k=t^{\prime}+1}^{t}\eta_{k}^{(t)}\right]. (100)

    Consequently, we have,

    𝔼[Δt,maxm]𝔼[ξt,t,maxm]\displaystyle\mathbb{E}[\Delta_{t,\max}^{m}]-\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}] (k=t+1t(1ηk(1γp)))𝔼[Δ¯t,max]+𝔼[ξt,t,maxm][k=t+1tηk(t)1]\displaystyle\geq\left(\prod_{k=t^{\prime}+1}^{t}(1-\eta_{k}(1-\gamma p))\right)\mathbb{E}[\overline{\Delta}_{t^{\prime},\max}]+\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}]\left[\sum_{k=t^{\prime}+1}^{t}\eta_{k}^{(t)}-1\right]
    φt,t𝔼[|Δ¯t(1)Δ¯t(2)|].\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-\varphi_{t^{\prime},t}\mathbb{E}[|\overline{\Delta}_{t^{\prime}}(1)-\overline{\Delta}_{t^{\prime}}(2)|]. (101)
  • We can consider a slightly different recursive sequence defined as

    wt=(1ηt)wt1+ηt(γpwt1).\displaystyle w_{t}=(1-\eta_{t})w_{t-1}+\eta_{t}(\gamma pw_{t-1}). (102)

    Using a similar sequence of arguments as outlined in (98)-(100), we can conclude that if wt=𝔼[Δ¯t,max]w_{t^{\prime}}=\mathbb{E}[\overline{\Delta}_{t^{\prime},\max}], then 𝔼[Δt,maxm]wt+𝔼[ξt,t,maxm]φt,t𝔼[|Δ¯t(1)Δ¯t(2)|]\mathbb{E}[\Delta_{t,\max}^{m}]\geq w_{t}+\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}]-\varphi_{t^{\prime},t}\mathbb{E}[|\overline{\Delta}_{t^{\prime}}(1)-\overline{\Delta}_{t^{\prime}}(2)|] and consequently,

    𝔼[Δt,maxm]\displaystyle\mathbb{E}[\Delta_{t,\max}^{m}] (k=t+1t(1ηk(1γp)))𝔼[Δ¯t,max]+𝔼[ξt,t,maxm]φt,t𝔼[|Δ¯t(1)Δ¯t(2)|].\displaystyle\geq\left(\prod_{k=t^{\prime}+1}^{t}(1-\eta_{k}(1-\gamma p))\right)\mathbb{E}[\overline{\Delta}_{t^{\prime},\max}]+\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}]-\varphi_{t^{\prime},t}\mathbb{E}[|\overline{\Delta}_{t^{\prime}}(1)-\overline{\Delta}_{t^{\prime}}(2)|]. (103)

On combining (101) and (103), we arrive at the claim.

A.5.3 Proof of Lemma 4

We begin with bounding the first term 𝔼[ξt,t,maxm]\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}]; the second bound follows in an almost identical derivation.

Step 1: applying Freedman’s inequality.

Using the relation max{a,b}=a+b+|ab|2\max\{a,b\}=\frac{a+b+|a-b|}{2}, we can rewrite 𝔼[ξt,t,maxm]\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}] as

𝔼[ξt,t,maxm]\displaystyle\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}] =𝔼[ξt,tm(1)+ξt,tm(2)2+|ξt,tm(1)ξt,tm(2)2|]\displaystyle=\mathbb{E}\left[\frac{\xi_{t^{\prime},t}^{m}(1)+\xi_{t^{\prime},t}^{m}(2)}{2}+\left|\frac{\xi_{t^{\prime},t}^{m}(1)-\xi_{t^{\prime},t}^{m}(2)}{2}\right|\right]
=12𝔼[|ξt,tm(1)ξt,tm(2)2|]\displaystyle=\frac{1}{2}\mathbb{E}\left[\left|\frac{\xi_{t^{\prime},t}^{m}(1)-\xi_{t^{\prime},t}^{m}(2)}{2}\right|\right]
=12𝔼[|k=t+1tη~k(t)γ(P^km(1|1,1)P^km(1|1,2))V^k1m=:ζt,tm|],\displaystyle=\frac{1}{2}\mathbb{E}\Bigg{[}\Bigg{|}\underbrace{\sum_{k=t^{\prime}+1}^{t}\widetilde{\eta}_{k}^{(t)}\gamma(\widehat{P}_{k}^{m}(1|1,1)-\widehat{P}_{k}^{m}(1|1,2))\widehat{V}_{k-1}^{m}}_{=:\zeta_{t^{\prime},t}^{m}}\Bigg{|}\Bigg{]}, (104)

where we used the definition in (40) and the fact that 𝔼[ξt,tm(1)]=𝔼[ξt,tm(2)]=0\mathbb{E}[\xi_{t^{\prime},t}^{m}(1)]=\mathbb{E}[\xi_{t^{\prime},t}^{m}(2)]=0. Decompose ζt,tm\zeta_{t^{\prime},t}^{m} as

ζt,tm\displaystyle\zeta_{t^{\prime},t}^{m} =k=t+1tb=1Bη~k(t)γB(Pk,bm(1|1,1)Pk,bm(1|1,2))V^k1m=:l=1Lzl,\displaystyle=\sum_{k=t^{\prime}+1}^{t}\sum_{b=1}^{B}\widetilde{\eta}_{k}^{(t)}\frac{\gamma}{B}(P_{k,b}^{m}(1|1,1)-P_{k,b}^{m}(1|1,2))\widehat{V}_{k-1}^{m}=:\sum_{l=1}^{L}z_{l}, (105)

where for all 1lL1\leq l\leq L

zl:=γB(Pk(l),b(l)m(1|1,1)Pk(l),b(l)m(1|1,2))V^k(l)1m\displaystyle z_{l}:=\frac{\gamma}{B}(P_{k(l),b(l)}^{m}(1|1,1)-P_{k(l),b(l)}^{m}(1|1,2))\widehat{V}_{k(l)-1}^{m}

with

k(l):=l/B+t+1;b(l)=((l1)modB)+1;L=(tt)B.\displaystyle k(l):=\lfloor l/B\rfloor+t^{\prime}+1;\quad b(l)=((l-1)\mod B)+1;\quad L=(t-t^{\prime})B.

Let {l}l=1L\{\mathscr{F}_{l}\}_{l=1}^{L} be a filtration such that l\mathscr{F}_{l} is the σ\sigma-algebra corresponding to {Pk(j),b(j)m(1|1,1),Pk(j),b(j)m(1|1,2)}j=1l\{P_{k(j),b(j)}^{m}(1|1,1),P_{k(j),b(j)}^{m}(1|1,2)\}_{j=1}^{l}. It is straightforward to note that {zl}l=1L\{z_{l}\}_{l=1}^{L} is a martingale sequence adapted to the filtration {}l=1L\{\mathscr{F}\}_{l=1}^{L}. We will use the Freedman’s inequality (Freedman1975Martingale; Li2023QLMinimax) to obtain a high probability bound on |ζt,tm||\zeta_{t^{\prime},t}^{m}|.

  • To that effect, note that

    supl|zl|\displaystyle\sup_{l}|z_{l}| supl|η~k(l)(t)γB(Pk(l),b(l)m(1|1,1)Pk(l),b(l)m(1|1,2))V^k(l)1m|\displaystyle\leq\sup_{l}\left|\widetilde{\eta}_{k(l)}^{(t)}\cdot\frac{\gamma}{B}\cdot(P_{k(l),b(l)}^{m}(1|1,1)-P_{k(l),b(l)}^{m}(1|1,2))\cdot\widehat{V}_{k(l)-1}^{m}\right|
    η~k(l)(t)γB(1γ)\displaystyle\leq\widetilde{\eta}_{k(l)}^{(t)}\cdot\frac{\gamma}{B(1-\gamma)}
    ηtB(1γ),\displaystyle\leq\frac{\eta_{t}}{B(1-\gamma)}, (106)

    where the second step follows from the bounds |(Pk(l),b(l)m(1|1,1)Pk(l),b(l)m(1|1,2))|1|(P_{k(l),b(l)}^{m}(1|1,1)-P_{k(l),b(l)}^{m}(1|1,2))|\leq 1 and V^k(l)1m11γ\widehat{V}_{k(l)-1}^{m}\leq\frac{1}{1-\gamma} and the third step uses cη11γc_{\eta}\leq\frac{1}{1-\gamma} and the fact that η~k(T)\widetilde{\eta}_{k}^{(T)} is increasing in kk in this regime. (cf. (21)).

  • Similarly,

    Var(zl|l1)\displaystyle\text{Var}(z_{l}|\mathscr{F}_{l-1}) (η~k(l)(t))2γ2B2(V^k(l)1m)2Var(Pk(l),b(l)m(1|1,1)Pk(l),b(l)m(1|1,2))\displaystyle\leq\left(\widetilde{\eta}_{k(l)}^{(t)}\right)^{2}\frac{\gamma^{2}}{B^{2}}\cdot\left(\widehat{V}_{k(l)-1}^{m}\right)^{2}\cdot\text{Var}(P_{k(l),b(l)}^{m}(1|1,1)-P_{k(l),b(l)}^{m}(1|1,2))
    (η~k(l)(t))2γ2B2(1γ)22p(1p)\displaystyle\leq\left(\widetilde{\eta}_{k(l)}^{(t)}\right)^{2}\frac{\gamma^{2}}{B^{2}(1-\gamma)^{2}}\cdot 2p(1-p)
    2(η~k(l)(t))23B2(1γ).\displaystyle\leq\frac{2\left(\widetilde{\eta}_{k(l)}^{(t)}\right)^{2}}{3B^{2}(1-\gamma)}. (107)

Using the above bounds (106) and (107) along with Freedman’s inequality yield that

Pr(|ζt,tm|8log(2/δ)3B2(1γ)l=1L(η~k(l)(t))2+4ηtlog(2/δ)3B(1γ))δ.\displaystyle\Pr\left(|\zeta_{t^{\prime},t}^{m}|\geq\sqrt{\frac{8\log(2/\delta)}{3B^{2}(1-\gamma)}\sum_{l=1}^{L}\left(\widetilde{\eta}_{k(l)}^{(t)}\right)^{2}}+\frac{4\eta_{t}\log(2/\delta)}{3B(1-\gamma)}\right)\leq\delta. (108)

Setting δ0=(1γ)22𝔼[|ζt,tm|2]\delta_{0}=\frac{(1-\gamma)^{2}}{2}\cdot\mathbb{E}[|\zeta_{t^{\prime},t}^{m}|^{2}], with probability at least 1δ01-\delta_{0}, it holds

|ζt,tm|8log(2/δ0)3B(1γ)k=t+1t(η~k(t))2+4ηtlog(2/δ0)3B(1γ)=:D.\displaystyle|\zeta_{t^{\prime},t}^{m}|\geq\sqrt{\frac{8\log(2/\delta_{0})}{3B(1-\gamma)}\sum_{k=t^{\prime}+1}^{t}\left(\widetilde{\eta}_{k}^{(t)}\right)^{2}}+\frac{4\eta_{t}\log(2/\delta_{0})}{3B(1-\gamma)}=:D. (109)

Consequently, plugging this back to (104), we obtain

𝔼[ξt,t,maxm]\displaystyle\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}] =12𝔼[|ζt,tm|]\displaystyle=\frac{1}{2}\mathbb{E}[|\zeta_{t^{\prime},t}^{m}|]
12𝔼[|ζt,tm|𝟙{|ζt,tm|D}]\displaystyle\geq\frac{1}{2}\mathbb{E}[|\zeta_{t^{\prime},t}^{m}|\mathbbm{1}\{|\zeta_{t^{\prime},t}^{m}|\leq D\}]
12D𝔼[|ζt,tm|2𝟙{|ζt,tm|D}]\displaystyle\geq\frac{1}{2D}\mathbb{E}[|\zeta_{t^{\prime},t}^{m}|^{2}\mathbbm{1}\{|\zeta_{t^{\prime},t}^{m}|\leq D\}]
12D(𝔼[|ζt,tm|2]𝔼[|ζt,tm|2𝟙{|ζt,tm|>D}])\displaystyle\geq\frac{1}{2D}\left(\mathbb{E}[|\zeta_{t^{\prime},t}^{m}|^{2}]-\mathbb{E}[|\zeta_{t^{\prime},t}^{m}|^{2}\mathbbm{1}\{|\zeta_{t^{\prime},t}^{m}|>D\}]\right)
12D(𝔼[|ζt,tm|2]Pr(|ζt,tm|>D)(1γ)2)14D𝔼[|ζt,tm|2].\displaystyle\geq\frac{1}{2D}\left(\mathbb{E}[|\zeta_{t^{\prime},t}^{m}|^{2}]-\frac{\Pr(|\zeta_{t^{\prime},t}^{m}|>D)}{(1-\gamma)^{2}}\right)\geq\frac{1}{4D}\cdot\mathbb{E}[|\zeta_{t^{\prime},t}^{m}|^{2}]. (110)

Here, the penultimate step used the fact that |ζt,tm|k=t+1tη~k(t)(1γ)1(1γ)\displaystyle|\zeta_{t^{\prime},t}^{m}|\leq\sum_{k=t^{\prime}+1}^{t}\frac{\widetilde{\eta}_{k}^{(t)}}{(1-\gamma)}\leq\frac{1}{(1-\gamma)}, and the last step used the definition of δ0\delta_{0}. Thus, it is sufficient to obtain a lower bound on 𝔼[|ζt,tm|2]\mathbb{E}[|\zeta_{t^{\prime},t}^{m}|^{2}] in order obtain a lower bound for 𝔼[ξt,t,maxm]\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}].

Step 2: lower bounding 𝔼[|ζt,tm|2]\mathbb{E}[|\zeta_{t^{\prime},t}^{m}|^{2}].

To proceed, we introduce the following lemma pertaining to lower bounding V^tm\widehat{V}_{t}^{m} that will be useful later.

Lemma 6.

For all time instants t[T]t\in[T] and all agent m[M]m\in[M]:

𝔼[(V^tm)2]12(1γ)2.\displaystyle\mathbb{E}\left[\left(\widehat{V}_{t}^{m}\right)^{2}\right]\geq\frac{1}{2(1-\gamma)^{2}}.

We have,

𝔼[|ζt,tm|2]\displaystyle\mathbb{E}[|\zeta_{t^{\prime},t}^{m}|^{2}] =𝔼[l=1LVar(zl|l1)]=𝔼[l=1L𝔼[zl2|l1]]\displaystyle=\mathbb{E}\left[\sum_{l=1}^{L}\text{Var}\left(z_{l}|\mathscr{F}_{l-1}\right)\right]=\mathbb{E}\left[\sum_{l=1}^{L}\mathbb{E}\left[z_{l}^{2}|\mathscr{F}_{l-1}\right]\right]
l=1L(η~k(l)(t))2γ2B22p(1p)𝔼[(V^k(l)1m)2]\displaystyle\geq\sum_{l=1}^{L}\left(\widetilde{\eta}_{k(l)}^{(t)}\right)^{2}\frac{\gamma^{2}}{B^{2}}\cdot 2p(1-p)\cdot\mathbb{E}\left[\left(\hat{V}_{k(l)-1}^{m}\right)^{2}\right]
l=1L(η~k(l)(t))2γ2B22p(1p)12(1γ)2\displaystyle\geq\sum_{l=1}^{L}\left(\widetilde{\eta}_{k(l)}^{(t)}\right)^{2}\frac{\gamma^{2}}{B^{2}}\cdot 2p(1-p)\cdot\frac{1}{2(1-\gamma)^{2}}
29B(1γ)k=max{t,τ}+1t(η~k(t))2,\displaystyle\geq\frac{2}{9B(1-\gamma)}\cdot\sum_{k=\max\{t^{\prime},\tau\}+1}^{t}\left(\widetilde{\eta}_{k}^{(t)}\right)^{2}, (111)

where the third line follows from Lemma 6 and the fourth line uses γ5/6\gamma\geq 5/6.

Step 3: finishing up.

We finish up the proof by bounding k=max{t,τ}+1t(η~k(t))2\sum_{k=\max\{t^{\prime},\tau\}+1}^{t}\left(\widetilde{\eta}_{k}^{(t)}\right)^{2} for tmax{t,τ}1/ητt-\max\{t^{\prime},\tau\}\geq 1/\eta_{\tau}. We have

k=max{t,τ}+1t(η~k(t))2\displaystyle\sum_{k=\max\{t^{\prime},\tau\}+1}^{t}\left(\widetilde{\eta}_{k}^{(t)}\right)^{2} k=max{t,τ}+1t(ηki=k+1t(1ηi))2\displaystyle\geq\sum_{k=\max\{t^{\prime},\tau\}+1}^{t}\left(\eta_{k}\prod_{i=k+1}^{t}(1-\eta_{i})\right)^{2}
(i)k=max{t,τ}+1t(ηti=k+1t(1ητ))2\displaystyle\overset{\mathrm{(i)}}{\geq}\sum_{k=\max\{t^{\prime},\tau\}+1}^{t}\left(\eta_{t}\prod_{i=k+1}^{t}(1-\eta_{\tau})\right)^{2}
=ηt2k=max{t,τ}+1t(1ητ)2(tk)\displaystyle=\eta_{t}^{2}\sum_{k=\max\{t^{\prime},\tau\}+1}^{t}(1-\eta_{\tau})^{2(t-k)}
ηt21(1ητ)2(tmax{t,τ})ητ(2ητ)\displaystyle\geq\eta_{t}^{2}\cdot\frac{1-(1-\eta_{\tau})^{2(t-\max\{t^{\prime},\tau\})}}{\eta_{\tau}(2-\eta_{\tau})}
ηt1exp(2)6ηt10ηT10,\displaystyle\geq\eta_{t}\cdot\frac{1-\exp(-2)}{6}\geq\frac{\eta_{t}}{10}\geq\frac{\eta_{T}}{10}, (112)

where (i) follows from the monotonicity of ηk\eta_{k}. Plugging (112) into the expressions of DD (cf. (109)) we have

D=\displaystyle D= 8log(2/δ0)3B(1γ)k=t+1t(η~k(t))2+4ηtlog(2/δ0)3B(1γ)\displaystyle\sqrt{\frac{8\log(2/\delta_{0})}{3B(1-\gamma)}\sum_{k=t^{\prime}+1}^{t}\left(\widetilde{\eta}_{k}^{(t)}\right)^{2}}+\frac{4\eta_{t}\log(2/\delta_{0})}{3B(1-\gamma)}
92𝔼[|ζt,tm|2]8log(2/δ0)3(1B(1γ)k=t+1t(η~k(t))2)1/2+60𝔼[|ζt,tm|2]log(2/δ0)\displaystyle\leq\frac{9}{2}\mathbb{E}[|\zeta_{t^{\prime},t}^{m}|^{2}]\cdot\sqrt{\frac{8\log(2/\delta_{0})}{3}}\left(\frac{1}{B(1-\gamma)}\sum_{k=t^{\prime}+1}^{t}\left(\widetilde{\eta}_{k}^{(t)}\right)^{2}\right)^{-1/2}+60\cdot\mathbb{E}[|\zeta_{t^{\prime},t}^{m}|^{2}]\cdot\log(2/\delta_{0})
3𝔼[|ζt,tm|2]log(2/δ0)[60B(1γ)ηt+20]\displaystyle\leq 3\mathbb{E}[|\zeta_{t^{\prime},t}^{m}|^{2}]\cdot\log(2/\delta_{0})\left[\sqrt{\frac{60B(1-\gamma)}{\eta_{t}}}+20\right]
60𝔼[|ζt,tm|2]log(2/δ0)[3B(1γ)20ηT+1],\displaystyle\leq 60\mathbb{E}[|\zeta_{t^{\prime},t}^{m}|^{2}]\cdot\log(2/\delta_{0})\left[\sqrt{\frac{3B(1-\gamma)}{20\eta_{T}}}+1\right],

where the second line follows from (111) and (112), and the third line follows from (112). On combining the above bound with (110), we obtain,

𝔼[ξt,t,maxm]1240log(2/δ0)νν+1,\displaystyle\mathbb{E}[\xi_{t^{\prime},t,\max}^{m}]\geq\frac{1}{240\log(2/\delta_{0})}\cdot\frac{\nu}{\nu+1}, (113)

where ν:=20ηT3B(1γ)\nu:=\sqrt{\dfrac{20\eta_{T}}{3B(1-\gamma)}}. Note that we have,

δ0\displaystyle\delta_{0} =(1γ)22𝔼[|ζt,tm|2](1γ)9Bk=t+1t(η~k(t))2ηT(1γ)90B.\displaystyle=\frac{(1-\gamma)^{2}}{2}\cdot\mathbb{E}[|\zeta_{t^{\prime},t}^{m}|^{2}]\geq\frac{(1-\gamma)}{9B}\cdot\sum_{k=t^{\prime}+1}^{t}\left(\widetilde{\eta}_{k}^{(t)}\right)^{2}\geq\frac{\eta_{T}(1-\gamma)}{90B}.

Combining the above bound with (113) yields us the required bound.

Step 4: repeating the argument for the second claim.

We note that second claim in the theorem, i.e., the lower bound on 𝔼[max{1Mm=1Mξt,tm(1),1Mm=1Mξt,tm(2)}]\mathbb{E}\left[\max\left\{\frac{1}{M}\sum_{m=1}^{M}\xi_{t^{\prime},t}^{m}(1),\frac{1}{M}\sum_{m=1}^{M}\xi_{t^{\prime},t}^{m}(2)\right\}\right] follows through an identical series of arguments where the bounds in Eqns. (106) and (107) contain an additional factor of MM in the denominator (effectively replacing BB with BMBM), which is carried through in all the following steps.

A.5.4 Proof of Lemma 5

Using Eqns. (43) and (40), we can write

Δ¯t(1)Δ¯t(2)\displaystyle\overline{\Delta}_{t}(1)-\overline{\Delta}_{t}(2) =(k=t+1t(1ηk))(Δ¯t(1)Δ¯t(2))\displaystyle=\left(\prod_{k=t^{\prime}+1}^{t}(1-\eta_{k})\right)(\overline{\Delta}_{t^{\prime}}(1)-\overline{\Delta}_{t^{\prime}}(2))
+1Mm=1Mk=t+1t(ηki=k+1t(1ηi))γ(P^km(1|1,1)P^km(1|1,2))V^k1m.\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}+\frac{1}{M}\sum_{m=1}^{M}\sum_{k=t^{\prime}+1}^{t}\left(\eta_{k}\prod_{i=k+1}^{t}(1-\eta_{i})\right)\gamma(\widehat{P}_{k}^{m}(1|1,1)-\widehat{P}_{k}^{m}(1|1,2))\widehat{V}_{k-1}^{m}.

Upon unrolling the recursion, we obtain,

Δ¯t(1)Δ¯t(2)=k=1tm=1M(ηki=k+1t(1ηi))γM(P^km(1|1,1)P^km(1|1,2))V^k1m.\displaystyle\overline{\Delta}_{t}(1)-\overline{\Delta}_{t}(2)=\sum_{k=1}^{t}\sum_{m=1}^{M}\left(\eta_{k}\prod_{i=k+1}^{t}(1-\eta_{i})\right)\frac{\gamma}{M}(\widehat{P}_{k}^{m}(1|1,1)-\widehat{P}_{k}^{m}(1|1,2))\widehat{V}_{k-1}^{m}.

If we define a filtration k\mathscr{F}_{k} as the σ\sigma-algebra corresponding to {P^l1(1|1,1),P^l1(1|1,2),,P^lM(1|1,1),P^lM(1|1,2)}l=1k\{\widehat{P}_{l}^{1}(1|1,1),\widehat{P}_{l}^{1}(1|1,2),\dots,\widehat{P}_{l}^{M}(1|1,1),\widehat{P}_{l}^{M}(1|1,2)\}_{l=1}^{k}, then it is straightforward to note that {Δ¯t(1)Δ¯t(2)}t\{\overline{\Delta}_{t}(1)-\overline{\Delta}_{t}(2)\}_{t} is a martingale sequence adapted to the filtration {t}t\{\mathscr{F}_{t}\}_{t}. Using Jensen’s inequality, we know that if {Zt}t\{Z_{t}\}_{t} is a martingale adapted to a filtration {𝒢t}t\{\mathscr{G}_{t}\}_{t}, then for a convex function ff such that f(Zt)f(Z_{t}) is integrable for all tt, {f(Zt)}t\{f(Z_{t})\}_{t} is a sub-martingale adapted to {𝒢t}t\{\mathscr{G}_{t}\}_{t}. Since f(x)=|x|f(x)=|x| is a convex function, {|Δ¯t(1)Δ¯t(2)|}t\{|\overline{\Delta}_{t}(1)-\overline{\Delta}_{t}(2)|\}_{t} is a submartingale adapted to the filtration {t}t\{\mathscr{F}_{t}\}_{t}. As a result,

sup1tT𝔼[|Δ¯t(1)Δ¯t(2)|]\displaystyle\sup_{1\leq t\leq T}\mathbb{E}[|\overline{\Delta}_{t}(1)-\overline{\Delta}_{t}(2)|] 𝔼[|Δ¯T(1)Δ¯T(2)|](𝔼[(Δ¯T(1)Δ¯T(2))2])1/2.\displaystyle\leq\mathbb{E}[|\overline{\Delta}_{T}(1)-\overline{\Delta}_{T}(2)|]\leq\left(\mathbb{E}[(\overline{\Delta}_{T}(1)-\overline{\Delta}_{T}(2))^{2}]\right)^{1/2}. (114)

We use the following observation about a martingale sequence {Xi}i=1t\{X_{i}\}_{i=1}^{t} adapted to a filtration {𝒢i}i=1t\{\mathscr{G}_{i}\}_{i=1}^{t} to evaluate the above expression. We have,

𝔼[(i=1tXi)2]\displaystyle\mathbb{E}\left[\left(\sum_{i=1}^{t}X_{i}\right)^{2}\right] =𝔼[𝔼[(i=1tXi)2|𝒢t1]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\left(\sum_{i=1}^{t}X_{i}\right)^{2}\bigg{|}\mathscr{G}_{t-1}\right]\right]
=𝔼[𝔼[Xt2+2Xt(i=1t1Xi)+(i=1t1Xi)2|𝒢t1]]\displaystyle=\mathbb{E}\left[\mathbb{E}\left[X_{t}^{2}+2X_{t}\left(\sum_{i=1}^{t-1}X_{i}\right)+\left(\sum_{i=1}^{t-1}X_{i}\right)^{2}\bigg{|}\mathscr{G}_{t-1}\right]\right]
=𝔼[Xt2]+𝔼[(i=1t1Xi)2]\displaystyle=\mathbb{E}\left[X_{t}^{2}\right]+\mathbb{E}\left[\left(\sum_{i=1}^{t-1}X_{i}\right)^{2}\right]
=i=1t𝔼[Xi2],\displaystyle=\sum_{i=1}^{t}\mathbb{E}\left[X_{i}^{2}\right], (115)

where the third step uses the facts that (i=1t1Xi)\left(\sum_{i=1}^{t-1}X_{i}\right) is 𝒢t1\mathscr{G}_{t-1} measure and 𝔼[Xt|𝒢t1]=0\mathbb{E}[X_{t}|\mathscr{G}_{t-1}]=0 and fourth step is obtained by recursively applying second and third steps. Using the relation in Eqn. (115) in Eqn. (114), we obtain,

sup1tT𝔼[|Δ¯t(1)Δ¯t(2)|]\displaystyle\sup_{1\leq t\leq T}\mathbb{E}[|\overline{\Delta}_{t}(1)-\overline{\Delta}_{t}(2)|] (𝔼[(Δ¯T(1)Δ¯T(2))2])1/2\displaystyle\leq\left(\mathbb{E}[(\overline{\Delta}_{T}(1)-\overline{\Delta}_{T}(2))^{2}]\right)^{1/2}
(k=1T𝔼[(m=1Mη~k(T)γM(P^km(1|1,1)P^km(1|1,2))V^k1m)2])1/2\displaystyle\leq\left(\sum_{k=1}^{T}\mathbb{E}\left[\left(\sum_{m=1}^{M}\widetilde{\eta}_{k}^{(T)}\cdot\frac{\gamma}{M}\cdot(\widehat{P}_{k}^{m}(1|1,1)-\widehat{P}_{k}^{m}(1|1,2))\hat{V}_{k-1}^{m}\right)^{2}\right]\right)^{1/2}
(k=1T(η~k(T))22γ2p(1p)BM2m=1M𝔼[(V^k1m)2])1/2\displaystyle\leq\left(\sum_{k=1}^{T}\left(\widetilde{\eta}_{k}^{(T)}\right)^{2}\cdot\frac{2\gamma^{2}p(1-p)}{BM^{2}}\cdot\sum_{m=1}^{M}\mathbb{E}\left[\left(\hat{V}_{k-1}^{m}\right)^{2}\right]\right)^{1/2}
(k=1T(η~k(T))22γ2p(1p)BM(1γ)2)1/2.\displaystyle\leq\left(\sum_{k=1}^{T}\left(\widetilde{\eta}_{k}^{(T)}\right)^{2}\cdot\frac{2\gamma^{2}p(1-p)}{BM(1-\gamma)^{2}}\right)^{1/2}. (116)

Let us focus on the term involving the step sizes. We separately consider the scenario for constant step sizes and linearly rescaled step sizes. For constant step sizes, we have,

k=1T(η~k(T))2\displaystyle\sum_{k=1}^{T}\left(\widetilde{\eta}_{k}^{(T)}\right)^{2} =k=1T(ηki=k+1T(1ηi))2=k=1Tη2(1η)2(Tk)η21(1η)2η.\displaystyle=\sum_{k=1}^{T}\left(\eta_{k}\prod_{i=k+1}^{T}(1-\eta_{i})\right)^{2}=\sum_{k=1}^{T}\eta^{2}(1-\eta)^{2(T-k)}\leq\frac{\eta^{2}}{1-(1-\eta)^{2}}\leq\eta. (117)

Similarly, for linearly rescaled step sizes, we have,

k=1T(η~k(T))2\displaystyle\sum_{k=1}^{T}\left(\widetilde{\eta}_{k}^{(T)}\right)^{2} =k=1τ(η~k(T))2+k=τ+1T(ηki=k+1T(1ηi))2\displaystyle=\sum_{k=1}^{\tau}\left(\widetilde{\eta}_{k}^{(T)}\right)^{2}+\sum_{k=\tau+1}^{T}\left(\eta_{k}\prod_{i=k+1}^{T}(1-\eta_{i})\right)^{2}
k=1τ(η~τ(T))2+k=τ+1Tηk2(1ηT)2(Tk)\displaystyle\leq\sum_{k=1}^{\tau}\left(\widetilde{\eta}_{\tau}^{(T)}\right)^{2}+\sum_{k=\tau+1}^{T}\eta_{k}^{2}(1-\eta_{T})^{2(T-k)}
ητ2(1ηT)2(Tτ)τ+ητ21ηT(2ηT)\displaystyle\leq\eta_{\tau}^{2}(1-\eta_{T})^{2(T-\tau)}\cdot\tau+\eta_{\tau}^{2}\cdot\frac{1}{\eta_{T}(2-\eta_{T})}
3ηTηTTexp(4TηT3)+3ηT\displaystyle\leq 3\eta_{T}\cdot\eta_{T}\cdot T\cdot\exp\left(-\frac{4T\eta_{T}}{3}\right)+3\eta_{T}
94eηT+3ηT\displaystyle\leq\frac{9}{4e}\eta_{T}+3\eta_{T}
4ηT,\displaystyle\leq 4\eta_{T}, (118)

where the second step uses cηlogN11γc_{\eta}\leq\log N\leq\frac{1}{1-\gamma} and the fact that η~k(T)\widetilde{\eta}_{k}^{(T)} is increasing in kk in this regime. (See Eqn. (21)) and fifth step uses xe4x/33/4exe^{-4x/3}\leq 3/4e. On plugging results from Eqns. (117) and (118) into Eqn. (116) along with the value of pp, we obtain,

sup1tT𝔼[|Δ¯t(1)Δ¯t(2)|]\displaystyle\sup_{1\leq t\leq T}\mathbb{E}[|\overline{\Delta}_{t}(1)-\overline{\Delta}_{t}(2)|] 8ηT3BM(1γ),\displaystyle\leq\sqrt{\frac{8\eta_{T}}{3BM(1-\gamma)}}, (119)

as required.

A.5.5 Proof of Lemma 6

For the proof, we fix an agent mm. In order to obtain the required lower bound on V^tm\widehat{V}_{t}^{m}, we define an auxiliary sequence Q¯tm\overline{Q}_{t}^{m} that evolves as described in Algorithm 5. Essentially, Q¯tm\overline{Q}_{t}^{m} evolves in a manner almost identical to Q^tm\widehat{Q}_{t}^{m} except for the fact that there is only one action and hence there is no maximization step in the update rule.

1:  r1r\leftarrow 1, Q¯0m=Q(1,1)\overline{Q}^{m}_{0}=Q^{\star}(1,1) for all m{1,2,,M}m\in\{1,2,\dots,M\}
2:  for t=1,2,,Tt=1,2,\dots,T do
3:     for m=1,2,,Mm=1,2,\dots,M do
4:        Q¯t1/2m(1ηt)Q¯t1m(a)+ηt(1+P^tm(1|1,1)Q¯t1m)\overline{Q}^{m}_{t-1/2}\leftarrow(1-\eta_{t})\overline{Q}^{m}_{t-1}(a)+\eta_{t}(1+\widehat{P}_{t}^{m}(1|1,1)\overline{Q}_{t-1}^{m})
5:        Compute Q¯tm\overline{Q}_{t}^{m} according to Eqn. (8)
6:     end for
7:  end for
Algorithm 5 Evolution of Q¯\overline{Q}

It is straightforward to note that Q^tm(1)Q¯tm\widehat{Q}_{t}^{m}(1)\geq\overline{Q}_{t}^{m}, which can be shown using induction. From the initialization, it follows that Q^0m(1)Q¯0m\widehat{Q}_{0}^{m}(1)\geq\overline{Q}_{0}^{m}. Assuming the relation holds for t1t-1, we have,

Q^t1/2m(1)\displaystyle\widehat{Q}_{t-1/2}^{m}(1) =(1ηt)Q^t1m(1)+ηt(1+γP^tm(1|1,1)V^t1m)\displaystyle=(1-\eta_{t})\widehat{Q}_{t-1}^{m}(1)+\eta_{t}(1+\gamma\widehat{P}_{t}^{m}(1|1,1)\widehat{V}_{t-1}^{m})
(1ηt)Q^t1m(1)+ηt(1+γP^tm(1|1,1)Q^t1m(1))\displaystyle\geq(1-\eta_{t})\widehat{Q}_{t-1}^{m}(1)+\eta_{t}(1+\gamma\widehat{P}_{t}^{m}(1|1,1)\widehat{Q}_{t-1}^{m}(1))
(1ηt)Q¯t1m+ηt(1+γP^tm(1|1,1)Q¯t1m)\displaystyle\geq(1-\eta_{t})\overline{Q}_{t-1}^{m}+\eta_{t}(1+\gamma\widehat{P}_{t}^{m}(1|1,1)\overline{Q}_{t-1}^{m})
=Q¯t1/2m.\displaystyle=\overline{Q}_{t-1/2}^{m}.

Since Q^tm\widehat{Q}_{t}^{m} and Q¯tm\overline{Q}_{t}^{m} follow the same averaging schedule, it immediately follows from the above relation that Q^tm(1)Q¯tm\widehat{Q}_{t}^{m}(1)\geq\overline{Q}_{t}^{m}. Since V^tmQ^tm(1)Q¯tm\widehat{V}_{t}^{m}\geq\widehat{Q}_{t}^{m}(1)\geq\overline{Q}_{t}^{m}, we will use the sequence Q¯tm\overline{Q}_{t}^{m} to establish the required lower bound on V^tm\widehat{V}_{t}^{m}.

We claim that for all time instants tt and all agents mm,

𝔼[Q¯tm]=11γp.\displaystyle\mathbb{E}[\overline{Q}_{t}^{m}]=\frac{1}{1-\gamma p}. (120)

Assuming (120) holds, we have

𝔼[(V^tm)2](𝔼[V^tm])2(𝔼[Q¯tm])2(11γp)212(1γ)2,\displaystyle\mathbb{E}[(\widehat{V}_{t}^{m})^{2}]\geq\left(\mathbb{E}[\widehat{V}_{t}^{m}]\right)^{2}\geq\left(\mathbb{E}[\overline{Q}_{t}^{m}]\right)^{2}\geq\left(\frac{1}{1-\gamma p}\right)^{2}\geq\frac{1}{2(1-\gamma)^{2}},

as required. In the above expression, the first inequality follows from Jensen’s inequality, the second from the relation V^tmQ¯tm0\widehat{V}_{t}^{m}\geq\overline{Q}_{t}^{m}\geq 0 and the third from (120).

We now move now to prove the claim (120) using induction. For the base case, 𝔼[Q¯0m]=11γp\mathbb{E}[\overline{Q}_{0}^{m}]=\frac{1}{1-\gamma p} holds by choice of initialization. Assume that 𝔼[Q¯t1m]=11γp\mathbb{E}[\overline{Q}_{t-1}^{m}]=\frac{1}{1-\gamma p} holds for some t1t-1 for all mm.

  • If tt is not an averaging instant, then for any client mm,

    Q¯tm\displaystyle\overline{Q}_{t}^{m} =(1ηt)Q¯t1m+ηt(1+γP^tm(1|1,1)Q¯t1m)\displaystyle=(1-\eta_{t})\overline{Q}_{t-1}^{m}+\eta_{t}(1+\gamma\widehat{P}_{t}^{m}(1|1,1)\overline{Q}_{t-1}^{m})
    𝔼[Q¯tm]\displaystyle\implies\mathbb{E}[\overline{Q}_{t}^{m}] =(1ηt)𝔼[Q¯t1m]+ηt(1+γ𝔼[P^tm(1|1,1)Q¯t1m])\displaystyle=(1-\eta_{t})\mathbb{E}[\overline{Q}_{t-1}^{m}]+\eta_{t}(1+\gamma\mathbb{E}[\widehat{P}_{t}^{m}(1|1,1)\overline{Q}_{t-1}^{m}])
    =(1ηt)𝔼[Q¯t1m]+ηt(1+γp𝔼[Q¯t1m])\displaystyle=(1-\eta_{t})\mathbb{E}[\overline{Q}_{t-1}^{m}]+\eta_{t}(1+\gamma p\mathbb{E}[\overline{Q}_{t-1}^{m}])
    =(1ηt)1γp+ηt(1+γp1γp)=11γp.\displaystyle=\frac{(1-\eta_{t})}{1-\gamma p}+\eta_{t}\left(1+\frac{\gamma p}{1-\gamma p}\right)=\frac{1}{1-\gamma p}. (121)

    The third line follows from the independence of P^tm(1|1,1)\widehat{P}_{t}^{m}(1|1,1) and Q¯t1m\overline{Q}_{t-1}^{m} and the fourth line uses the inductive hypothesis.

  • If tt is an averaging instant, then for all clients mm,

    Q¯tm\displaystyle\overline{Q}_{t}^{m} =(1ηt)Mj=1MQ¯t1j+ηt1Mj=1M(1+γP^tj(1|1,1)Q¯t1j)\displaystyle=\frac{(1-\eta_{t})}{M}\sum_{j=1}^{M}\overline{Q}_{t-1}^{j}+\eta_{t}\frac{1}{M}\sum_{j=1}^{M}(1+\gamma\widehat{P}_{t}^{j}(1|1,1)\overline{Q}_{t-1}^{j})
    𝔼[Q¯tm]\displaystyle\implies\mathbb{E}[\overline{Q}_{t}^{m}] =(1ηt)Mj=1M𝔼[Q¯t1j]+ηt1Mj=1M(1+γ𝔼[P^tj(1|1,1)Q¯t1j])\displaystyle=\frac{(1-\eta_{t})}{M}\sum_{j=1}^{M}\mathbb{E}[\overline{Q}_{t-1}^{j}]+\eta_{t}\frac{1}{M}\sum_{j=1}^{M}(1+\gamma\mathbb{E}[\widehat{P}_{t}^{j}(1|1,1)\overline{Q}_{t-1}^{j}])
    =(1ηt)Mj=1M11γp+ηt1Mj=1M(1+γp1γp)=11γp,\displaystyle=\frac{(1-\eta_{t})}{M}\sum_{j=1}^{M}\frac{1}{1-\gamma p}+\eta_{t}\frac{1}{M}\sum_{j=1}^{M}\left(1+\frac{\gamma p}{1-\gamma p}\right)=\frac{1}{1-\gamma p}, (122)

    where we again make use of independence and the inductive hypothesis.

Thus, (121) and (122) taken together complete the inductive step.