This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Refined Sample Complexity for Markov Games with
Independent Linear Function Approximation

Yan Dai IIIS, Tsinghua University. Email: [email protected].    Qiwen Cui University of Washington. Email: [email protected].    Simon S. Du University of Washington. Email: [email protected].
Abstract

Markov Games (MG) is an important model for Multi-Agent Reinforcement Learning (MARL). It was long believed that the “curse of multi-agents” (i.e., the algorithmic performance drops exponentially with the number of agents) is unavoidable until several recent works (Daskalakis et al., 2023, Cui et al., 2023, Wang et al., 2023). While these works resolved the curse of multi-agents, when the state spaces are prohibitively large and (linear) function approximations are deployed, they either had a slower convergence rate of 𝒪(T1/4)\operatorname{\mathcal{O}}(T^{-1/4}) or brought a polynomial dependency on the number of actions AmaxA_{\max} – which is avoidable in single-agent cases even when the loss functions can arbitrarily vary with time. This paper first refines the AVLPR framework by Wang et al. (2023), with an insight of designing data-dependent (i.e., stochastic) pessimistic estimation of the sub-optimality gap, allowing a broader choice of plug-in algorithms. When specialized to MGs with independent linear function approximations, we propose novel action-dependent bonuses to cover occasionally extreme estimation errors. With the help of state-of-the-art techniques from the single-agent RL literature, we give the first algorithm that tackles the curse of multi-agents, attains the optimal 𝒪(T1/2)\operatorname{\mathcal{O}}(T^{-1/2}) convergence rate, and avoids poly(Amax)\text{poly}(A_{\max}) dependency simultaneously.111Accepted for presentation at the Conference on Learning Theory (COLT) 2024.

1 Introduction

Multi-Agent Reinforcement Learning (MARL) studies decision-making under uncertainty in a multi-agent system. Many practical MARL systems demonstrate impressive performance in games like Go (Silver et al., 2017), Poker (Brown and Sandholm, 2019), Starcraft II (Vinyals et al., 2019), Hide-and-Seek (Baker et al., 2020), and Autonomous Driving (Shalev-Shwartz et al., 2016). While MARL exhibits huge success in practice, it is still far from being understood theoretically.

A core challenge in theoretically studying MARL is the “curse of multi-agents”: When many agents are involved in the game, the joint state and action space is prohibitively large. Thus, early algorithms for multi-agent games (Liu et al., 2021) usually have a sample complexity (the number of samples the algorithm requires to attain a given accuracy) exponentially depending on the number of agents mm, for example, scaling with i[m]|𝒜i|\prod_{i\in[m]}\lvert\mathcal{A}_{i}\rvert where 𝒜i\mathcal{A}_{i} is the action space of the ii-th agent.

Later, many efforts were made to resolve this issue. Jin et al. (2021) gave the first algorithm avoiding the curse of multi-agents, and Daskalakis et al. (2023) made the output policy Markovian (i.e., non-history-dependent, which we focus in our paper). Such results only depend on Amaxmaxi[m]|𝒜i|A_{\max}\triangleq\max_{i\in[m]}\lvert\mathcal{A}_{i}\rvert but not i[m]|𝒜i|\prod_{i\in[m]}\lvert\mathcal{A}_{i}\rvert. While these algorithms work well in the tabular Markov Games (Shapley, 1953),222Tabular Markov Games refer to the Markov Games where the number of states and actions are finite and small. they cannot handle the case where the state space is prohibitively large.

However, in many real-world applications, tabular models are insufficient. For example, Go has 33613^{361} possible states. In single-agent RL, people use function approximation to model the state space (Jiang et al., 2017, Wen and Van Roy, 2017). While this idea naturally generalizes to MARL (Xie et al., 2020, Chen et al., 2022b), unfortunately, it usually induces the curse of multi-agents.

Two recent works by Cui et al. (2023) and Wang et al. (2023) investigated this issue. They concurrently proposed that instead of the global function approximations previously used in the literature, independent function approximations should be developed. By assuming independent linear function approximations, they designed the first algorithms that can avoid the curse of multi-agents when the state spaces are prohibitively large and function approximations are deployed.

While their algorithms succeeded in yielding polynomial dependencies on mm, they were sub-optimal in other terms. The sample complexity of Cui et al. (2023) was 𝒪~(poly(m,H,d)ϵ4)\operatorname{\widetilde{\mathcal{O}}}(\mathrm{poly}(m,H,d)\epsilon^{-4}), while that of Wang et al. (2023) was 𝒪~(poly(m,H,d)Amax5ϵ2)\operatorname{\widetilde{\mathcal{O}}}(\mathrm{poly}(m,H,d)A_{\max}^{5}\epsilon^{-2}).333We use 𝒪~\operatorname{\widetilde{\mathcal{O}}} to hide any logarithmic factors. Here, mm is the number of agents, HH is the length of each episode, dd is the dimension of the feature space, Amax=maxi[m]|𝒜i|A_{\max}=\max_{i\in[m]}\lvert\mathcal{A}_{i}\rvert is the largest size of action sets, and ϵ\epsilon is the desired accuracy. The former has a sub-optimal convergence rate of ϵ4\epsilon^{-4}, whereas the latter has a polynomial dependency on the number of actions. However, no polynomial dependency on the number of actions is necessary for single-agent RL with linear function approximations, even when the losses can arbitrarily vary with time, i.e., in the so-called adversarial regime (Dai et al., 2023).

As Linear MGs are generalizations of Linear MDPs, we aim to generalize such a property to Linear MGs. In this paper, we propose an algorithm for multi-player general-sum Markov Games with independent linear function approximations that i) retains a polynomial dependency on mm, ii) ensures the optimal ϵ2\epsilon^{-2} convergence rate, and iii) only has logarithmic dependency on AmaxA_{\max}.

1.1 Key Insights and Technical Overview of This Paper

The key insight in our paper is developing data-dependent (i.e., stochastic) pessimistic sub-optimality gap estimators instead of deterministic ones. For more context, the AVLPR framework designed and used by Wang et al. (2023) required a deterministic gap estimation regarding the current policy π~\widetilde{\pi} during execution – so that the agents can collaborate to further improve π~\widetilde{\pi}. Unfortunately, yielding such a deterministic estimation corresponds to an open problem called “high-probability regret bounds for adversarial contextual linear bandits” in the literature (Olkhovskaya et al., 2023). Thus, Wang et al. (2023) used a uniform exploration strategy to avoid proving regret bounds; however, this approach unavoidably brings poly(A)\mathrm{poly}(A) factors. On the other hand, the framework by Cui et al. (2023) was intrinsically incapable of ϵ2\epsilon^{-2} convergence rate as it uses the epoching technique (i.e., fixing the policy for many episodes so the environment is almost stationary).

Hence, existing ideas in the literature cannot give favorable sample complexities, and we thus propose the usage of stochastic sub-optimality estimators. To fully deploy our insight, we make the following technical contributions:

  1. 1.

    Based on the AVLPR framework by Wang et al. (2023) which required deterministic sub-optimality gap estimations, we propose a refined framework capable of data-dependent (i.e., non-deterministic) estimators in Algorithm 1. As we show in Theorem 3.2, a stochastic gap estimation with bounded expectation already suffices for an ϵ\epsilon-CCE. This innovation, as we shall see shortly, gives more flexibility in algorithms and allowing techniques from expected-regret-minimization literature.

    Slightly more formally, suppose that we would like to evaluate a joint policy π~t\widetilde{\pi}_{t}. However, its actual sub-optimality gap, denoted by Gapπ~t{\textsc{Gap}}_{\widetilde{\pi}_{t}}, cannot be accurately calculated during runtime since the “optimal” policy is unknown. The original approach requires a deterministic constant GtG_{t} such that Gapπ~tGt{\textsc{Gap}}_{\widetilde{\pi}_{t}}\leq G_{t} w.h.p. However, the approach we use to generate π~t\widetilde{\pi}_{t} (which used the famous regret-to-sample-complexity reduction and translates the problem to regret-minimization in an adversarial contextual linear bandit; see Section 4 for more) does not allow such a deterministic GG. Instead of crafting π~\widetilde{\pi} in another way like Wang et al. (2023), we propose that calculating a random variable Gapt{\textsc{Gap}}_{t} such that i) Gapπ~tGapt{\textsc{Gap}}_{\widetilde{\pi}_{t}}\leq{\textsc{Gap}}_{t} w.h.p., and ii) 𝔼[Gapπ~t]Gt\operatornamewithlimits{\mathbb{E}}[{\textsc{Gap}}_{\widetilde{\pi}_{t}}]\leq G_{t} for some deterministic constant GtG_{t}. More details can be found in Theorem 3.2.

  2. 2.

    Existing expected-regret-minimization algorithms cannot be directly used, as they only guarantees the second condition of 𝔼[Gapπ~t]Gt\operatornamewithlimits{\mathbb{E}}[{\textsc{Gap}}_{\widetilde{\pi}_{t}}]\leq G_{t} – but in addition to this, we also want Gapt{\textsc{Gap}}_{t} to be a high-probability pessimistic estimation of Gapπ~t{\textsc{Gap}}_{\widetilde{\pi}_{t}}. Meanwhile, previous algorithms in high-probability RL also do not directly work due to the open problem. Technically, this is because existing bonus-design mechanism cannot cover estimation errors which can occasionally have extreme magnitudes albeit with well-bounded expectations; see Section 4.2 for a more formal description.

    To tackle this issue, we propose a novel technique called action-dependent bonuses which was partially inspired by the Adaptive Freedman Inequality proposed by Lee et al. (2020) and improved by Zimmert and Lattimore (2022). As we detail in Section 4.2, such a technique can be applied when a) we want to use the bonus technique to cancel some error in the form of tvt(a)\sum_{t}v_{t}(a^{\ast}), where aa^{\ast} is the optimal action in hindsight; but b) supt,a|vt(a)|\sup_{t,a}\lvert v_{t}(a)\rvert can be prohibitively large so Freedman fails, while c) 𝔼aπt[supt|vt(a)|]\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{t}}[\sup_{t}\lvert v_{t}(a)\rvert] is small where πt\pi_{t} is the player’s policy. Going beyond this paper, we expect this action-dependent bonuses technique to be also applicable elsewhere when high-probability bounds are desired, but the error to be covered can sometimes be prohibitively large – though its expectation w.r.t. the player’s policy can be well controlled.

  3. 3.

    Finally, because of unknown transitions and multiple agents, it is also non-trivial to attain an 𝒪~(K)\operatorname{\widetilde{\mathcal{O}}}(\sqrt{K})-style expected regret guarantee. Towards this, we incorporate several state-of-the-art techniques from the recent adversarial RL literature, e.g., the Magnitude-Reduced Estimators proposed by Dai et al. (2023), the Adaptive Freedman Inequality by Zimmert and Lattimore (2022), and a new covariance matrix estimation technique introduced by Liu et al. (2023a).

1.2 Related Work

Tabular Markov Games.

Markov Games date back to Shapley (1953). For tabular Markov Games where the number of states and actions is finite and small, Bai and Jin (2020) gave the first sample-efficient algorithm in two-player zero-sum games. For the harder multi-player general-sum case, the first provably sample-efficient algorithm was developed by Liu et al. (2021), albeit depending on i[m]|𝒜i|\prod_{i\in[m]}\lvert\mathcal{A}_{i}\rvert (i.e., the “curse of multiagents”). When non-Markovian policies were allowed, based on the V-learning algorithm (Bai et al., 2020), various algorithms were proposed (Jin et al., 2021, Song et al., 2022, Mao and Başar, 2023); otherwise, the first algorithm for multi-player general-sum games avoiding the curse of multiagents only dates back to (Daskalakis et al., 2023), although with a sub-optimal ϵ3\epsilon^{-3} convergence rate. Cui et al. (2023) and Wang et al. (2023) recently yielded the optimal ϵ2\epsilon^{-2} convergence rate, but their dependencies on SS remained improvable.

Markov Games with Function Approximation.

When the state space can be prohibitively large, as in the single-agent case, the state space is often modeled via function approximations. Early works in this line (Xie et al., 2020, Chen et al., 2022b) considered global function approximations where the function class captures the joint value functions of all the agents, making it hard to avoid the curse of multiagents. The idea of independent function approximation, i.e., the function class of each agent only encodes its own value function, was concurrently proposed by Cui et al. (2023) and Wang et al. (2023). However, as mentioned, their sample complexities were sub-optimal in either ϵ\epsilon or AA while ours is optimal in both ϵ\epsilon and AA. Notably, while this paper only focuses on linear function approximations, more general approximation schemes were already studied in the literature, both globally (see, e.g., (Huang et al., 2022, Jin et al., 2022, Xiong et al., 2022, Chen et al., 2022a, Ni et al., 2022, Zhan et al., 2023)) or independently (Wang et al., 2023).

Markov Decision Processes with Linear Function Approximation.

With only one agent, Markov Games became Markov Decision Processes (MDPs), whose linear function approximation schemes were extensively studied. When losses are fixed across episodes, the problem was solved by Jin et al. (2020) and Yang and Wang (2020). When losses are adversarial (i.e., can arbitrarily vary with time), some types of linear approximations were recently tackled, e.g., linear mixture MDPs (Zhao et al., 2023) or linear-Q MDPs equipped with simulators (Dai et al., 2023). In contrast, other approximation methods, e.g., linear MDPs, remain open. More detailed discussions can be found in recent papers like (Dai et al., 2023, Kong et al., 2023, Sherman et al., 2023b; a, Liu et al., 2023b).

Concurrent Work by Fan et al. (2024).

After the submission of this paper, we are aware of a concurrent and independent work by Fan et al. (2024), which also studies the sample complexity of finding a CCE in Markov Games with Independent Linear Function Approximations. Different from the online model studied in this paper and in previous works by Cui et al. (2023) and Wang et al. (2023), they assumed a local access model (i.e., there exists a simulator where the learner can query for samples s(s,𝒂)s^{\prime}\sim\operatorname{\mathbb{P}}(\cdot\mid s,\bm{a}) whenever ss is a previously visited state). Under this stronger assumption, Fan et al. (2024) achieved 𝒪~(m2d2H6ϵ2)\operatorname{\widetilde{\mathcal{O}}}(m^{2}d^{2}H^{6}\epsilon^{-2}) sample complexity, which resolving the curse of multi-agents, having optimal dependency on ϵ\epsilon, and avoiding polynomial dependency on AmaxA_{\max}. Technically, by maintaining core set of well-covered state-action pairs, each agent can independently perform policy learning (via a FTRL-based subroutine) and thus avoiding the curse of multi-agents. Moreover, as core sets have sizes independent to AA (Yin et al., 2022), their approach avoids poly(Amax)\text{poly}(A_{\max}) factors as well. Further making our results and that of Fan et al. (2024) completely independent of SS or enjoy better dependency on m,d,Hm,d,H remains a valuable direction for future research.444Throughout this paper, we omit the 𝒪(logS)\operatorname{\mathcal{O}}(\log S) term (due to Lemma A.2) in the sample complexity bound as it’s logarithmic. However, as discussed by Fan et al. (2024), it would be more favorable to have the logS\log S factor removed.

2 Preliminaries

Markov Games.

In (multi-agent general-sum) Markov Games, there are mm agents sharing a common state space 𝒮\mathcal{S}, but each agent i[m]i\in[m] has its own action space {𝒜i}i=1m\{\mathcal{A}^{i}\}_{i=1}^{m}. The game repeats for several episodes, each with length HH. Without loss of generality, assume that 𝒮\mathcal{S} is layered as 𝒮1,𝒮2,,𝒮H+1\mathcal{S}_{1},\mathcal{S}_{2},\ldots,\mathcal{S}_{H+1} such that transition only occurs from one layer to the next. At the beginning of each episode, the state resets to an initial state s1𝒮1s_{1}\in\mathcal{S}_{1}. For the hh-th step, each agent i[m]i\in[m] observes the current state shs_{h} and makes its action ahi𝒜ia_{h}^{i}\in\mathcal{A}^{i}.

Let the joint action be 𝒂h=(ah1,ah2,,ahm)𝒜1×𝒜2××𝒜m𝒜\bm{a}_{h}=(a_{h}^{1},a_{h}^{2},\ldots,a_{h}^{m})\in\mathcal{A}^{1}\times\mathcal{A}^{2}\times\cdots\times\mathcal{A}^{m}\triangleq\mathcal{A}. The new state sh+1𝒮h+1s_{h+1}\in\mathcal{S}_{h+1} is independently sampled from a distribution (sh,𝒂h)\operatorname{\mathbb{P}}(\cdot\mid s_{h},\bm{a}_{h}) (hidden from the agents). Meanwhile, each agent i[m]i\in[m] observes and suffers a loss i(sh,𝒂h)[0,1]\ell^{i}(s_{h},\bm{a}_{h})\in[0,1].555Adopting notations from the adversarial MDP literature, this paper focuses on losses instead of rewards (i.e., agents minimize total loss instead of maximizing total reward). Following the convention (Cui et al., 2023, Wang et al., 2023), we assume the loss functions i\ell^{i} are deterministic (though kept as secret from the agents). The objective of each agent is to minimize the expectation of its total loss, i.e., agent i[m]i\in[m] minimizes 𝔼[h=1Hi(sh,𝒂h)]\operatornamewithlimits{\mathbb{E}}\big{[}\sum_{h=1}^{H}\ell^{i}(s_{h},\bm{a}_{h})\big{]}.

Policies and Value Functions.

A (Markov joint) policy π\pi is a joint strategy of each agent, formally defined as π:𝒮(𝒜)\pi\colon\mathcal{S}\to\triangle(\mathcal{A}). Note that this allows the policies of different agents to be correlated. For a Markov joint policy π\pi, let πi\pi^{i} be the policy induced by agent ii, and let πi\pi^{-i} be the joint policy induced by all agents except ii. Define Π={π:𝒮(𝒜)}\Pi=\{\pi\colon\mathcal{S}\to\triangle(\mathcal{A})\} as the set of all Markov joint policies. Similarly, define Πi={πiπΠ}\Pi^{i}=\{\pi^{i}\mid\pi\in\Pi\} and Πi={πiπΠ}\Pi^{-i}=\{\pi^{-i}\mid\pi\in\Pi\}.

Given a joint policy πΠ\pi\in\Pi, one can define the following state-value function (V-function in short) induced by π\pi for each agent i[m]i\in[m] and state s𝒮s\in\mathcal{S} (where h[H]h\in[H] is the layer ss lies in):

Vπi(s)=𝔼(s1,𝒂1,s2,𝒂2,,sH,𝒂H)π[𝔥=hHi(s𝔥,𝒂𝔥)|sh=s],i[m],s𝒮h.V^{i}_{\pi}(s)=\operatornamewithlimits{\mathbb{E}}_{(s_{1},\bm{a}_{1},s_{2},\bm{a}_{2},\ldots,s_{H},\bm{a}_{H})\sim\pi}\left[\sum_{\mathfrak{h}=h}^{H}\ell^{i}(s_{\mathfrak{h}},\bm{a}_{\mathfrak{h}})\middle|s_{h}=s\right],\leavevmode\nobreak\ \forall i\in[m],s\in\mathcal{S}_{h}.

Here, (s1,𝒂1,s2,𝒂2,,sH,𝒂H)π(s_{1},\bm{a}_{1},s_{2},\bm{a}_{2},\ldots,s_{H},\bm{a}_{H})\sim\pi denotes a trajectory generated by following π\pi, i.e., s1=s1s_{1}=s_{1}, 𝒂hπh(sh)\bm{a}_{h}\sim\pi_{h}(\cdot\mid s_{h}), and sh+1(sh,𝒂h)s_{h+1}\sim\operatorname{\mathbb{P}}(\cdot\mid s_{h},\bm{a}_{h}). If we are only interested in the hh-th state in such a trajectory, we use shπs\sim_{h}\pi to denote an sh𝒮hs_{h}\in\mathcal{S}_{h} generated by following π\pi.

Similar to V-function, the state-action-value function (Q-function) can be defined as follows:

Qπi(s,ai)=𝔼(s1,𝒂1,s2,𝒂2,,sH,𝒂H)π[𝔥=hHi(s𝔥,𝒂𝔥)|(sh,𝒂hi)=(s,ai)],i[m],s𝒮h,ai𝒜i.Q^{i}_{\pi}(s,a^{i})=\operatornamewithlimits{\mathbb{E}}_{(s_{1},\bm{a}_{1},s_{2},\bm{a}_{2},\ldots,s_{H},\bm{a}_{H})\sim\pi}\left[\sum_{\mathfrak{h}=h}^{H}\ell^{i}(s_{\mathfrak{h}},\bm{a}_{\mathfrak{h}})\middle|(s_{h},\bm{a}_{h}^{i})=(s,a^{i})\right],\leavevmode\nobreak\ \forall i\in[m],s\in\mathcal{S}_{h},a^{i}\in\mathcal{A}^{i}.

Slightly abusing the notation \operatorname{\mathbb{P}}, we define an operator \operatorname{\mathbb{P}} as [V](s,𝒂)=𝔼s(s,𝒂)[V(s)][\operatorname{\mathbb{P}}V](s,\bm{a})=\operatornamewithlimits{\mathbb{E}}_{s^{\prime}\sim\operatorname{\mathbb{P}}(\cdot\mid s,\bm{a})}[V(s^{\prime})]. Then we can simply rewrite the Q-function as Qπi(s,ai)=𝔼𝒂iπi(s)[(i+Vπi)(s,𝒂)]Q_{\pi}^{i}(s,a^{i})=\operatornamewithlimits{\mathbb{E}}_{\bm{a}^{-i}\sim\pi^{-i}(s)}\left[(\ell^{i}+\operatorname{\mathbb{P}}V_{\pi}^{i})(s,\bm{a})\right].

Coarse Correlated Equilibrium.

As all agents have different objectives, we can only find an equilibrium policy where no agent can gain much more by voluntarily deviating. In general, calculating the well-known Nash equilibrium is intractable even in normal-form general-sum games, i.e., MGs with H=1H=1 and S=1S=1 (Daskalakis et al., 2009). Hence, people usually consider (Markov) Coarse Correlated Equilibrium (Daskalakis et al., 2023, Cui et al., 2023, Wang et al., 2023) instead.

Formally, for each agent i[m]i\in[m], we fix the strategy of all remaining agents as πiΠi\pi^{-i}\in\Pi^{-i} and consider its best-response. We define its best-response V-function against πi\pi^{-i} as

V,πii(s)=minπiΠiVπi,πii(s),i[m],s𝒮.V^{i}_{\dagger,\pi^{-i}}(s)=\min_{\pi^{i}\in\Pi^{i}}V^{i}_{\pi^{i},\pi^{-i}}(s),\quad\forall i\in[m],s\in\mathcal{S}.

A Markov joint policy π\pi is a (Markov) ϵ\epsilon-CCE if maxi[m]{Vπi(s1)V,πii(s1)}ϵ\max\limits_{i\in[m]}\left\{V^{i}_{\pi}(s_{1})-V^{i}_{\dagger,\pi^{-i}}(s_{1})\right\}\leq\epsilon. We measure an algorithm’s performance by the number of samples needed for learning an ϵ\epsilon-CCE, namely sample complexity. When the state space 𝒮\mathcal{S} is finite and small (which we call a tabular MG), the best-known sample complexity for finding an ϵ\epsilon-CCE is 𝒪~(H6S2Amaxϵ2)\operatorname{\widetilde{\mathcal{O}}}(H^{6}S^{2}A_{\max}\epsilon^{-2}) (Wang et al., 2023, Cui et al., 2023).

MGs with Independent Linear Function Approximation.

When 𝒮\mathcal{S} is infinite, Cui et al. (2023) and Wang et al. (2023) concurrently propose to use independent linear function approximation (though with different Bellman completeness assumptions; see Appendix D of Wang et al. (2023)). Their results are 𝒪~(d4H6m2Amax5ϵ2)\operatorname{\widetilde{\mathcal{O}}}(d^{4}H^{6}m^{2}A_{\max}^{5}\epsilon^{-2}) (Wang et al., 2023) and 𝒪~(d4H10m4ϵ4)\operatorname{\widetilde{\mathcal{O}}}(d^{4}H^{10}m^{4}\epsilon^{-4}) (Cui et al., 2023).

Our paper also considers Markov Games with independent linear function approximations. Inspired by linear MDPs in single-agent RL (Jin et al., 2020), we assume the transitions and losses to be linear. For a detailed discussion on the connection between linear MDP and Bellman completeness, we refer the readers to Section 5 of (Jin et al., 2020). Formally, we assume the following:

Assumption 2.1.

For any agent i[m]i\in[m], there exists a known dd-dimensional feature mapping ϕ:𝒮×𝒜id{\bm{\phi}}\colon\mathcal{S}\times\mathcal{A}^{i}\to\mathbb{R}^{d}, such that for any state s𝒮s\in\mathcal{S}, action ai𝒜ia^{i}\in\mathcal{A}^{i}, and any policy πΠest\pi\in\Pi^{\text{est}}, 666Πest\Pi^{\text{est}} refers to the set of all policies that the algorithm may give, similar to (Cui et al., 2023, Wang et al., 2023).

𝔼[(ss,𝒂)\displaystyle\operatornamewithlimits{\mathbb{E}}\big{[}\operatorname{\mathbb{P}}(s^{\prime}\mid s,\bm{a}) |𝒂iπi(s),𝒂i=ai]=ϕ(s,ai)𝖳𝝁πii(s),\displaystyle\big{|}\bm{a}^{-i}\sim\pi^{-i}(\cdot\mid s),\bm{a}^{i}=a^{i}\big{]}={\bm{\phi}}(s,a^{i})^{\mathsf{T}}{\bm{\mu}}_{\pi^{-i}}^{i}(s^{\prime}),
𝔼[i(s,𝒂)\displaystyle\operatornamewithlimits{\mathbb{E}}\big{[}\ell^{i}(s,\bm{a}) |𝒂iπi(s),𝒂i=ai]=ϕ(s,ai)𝖳𝝂πii(h),\displaystyle\big{|}\bm{a}^{-i}\sim\pi^{-i}(\cdot\mid s),\bm{a}^{i}=a^{i}\big{]}={\bm{\phi}}(s,a^{i})^{\mathsf{T}}{\bm{\nu}}_{\pi^{-i}}^{i}(h),

where 𝛍i:Πi×𝒮d{\bm{\mu}}^{i}\colon\Pi^{-i}\times\mathcal{S}\to\mathbb{R}^{d} and 𝛎i:Πi×[H]d{\bm{\nu}}^{i}\colon\Pi^{-i}\times[H]\to\mathbb{R}^{d} are both unknown to the agent. Following the convention (Jin et al., 2020, Luo et al., 2021), we assume ϕ(s,ai)21\lVert{\bm{\phi}}(s,a^{i})\rVert_{2}\leq 1 for all s𝒮s\in\mathcal{S}, ai𝒜ia^{i}\in\mathcal{A}^{i}, i[m]i\in[m] and that max{𝛍πii(𝒮h)2,𝛎πii(h)2}d\max\{\lVert{\bm{\mu}}_{\pi^{-i}}^{i}(\mathcal{S}_{h})\rVert_{2},\lVert{\bm{\nu}}_{\pi^{-i}}^{i}(h)\rVert_{2}\}\leq\sqrt{d} for all h[H]h\in[H], πiΠi\pi^{-i}\in\Pi^{-i}, and i[m]i\in[m].

This assumption also implies the Q-functions for all the agents are linear, i.e., there exists some unknown dd-dimensional feature mapping 𝛉i:Πest×[H]d{\bm{\theta}}^{i}\colon\Pi^{\text{est}}\times[H]\to\mathbb{R}^{d} with 𝛉πi(h)2dH\lVert\bm{\theta}_{\pi}^{i}(h)\rVert_{2}\leq\sqrt{d}H such that

Qπi(s,ai)=ϕ(s,ai)𝖳𝜽πi(h),s𝒮,ai𝒜i,i[m],πΠ.Q_{\pi}^{i}(s,a^{i})={\bm{\phi}}(s,a^{i})^{\mathsf{T}}{\bm{\theta}}_{\pi}^{i}(h),\quad\forall s\in\mathcal{S},a^{i}\in\mathcal{A}^{i},i\in[m],\pi\in\Pi.

3 Improved AVLPR Framework

Algorithm 1 Improved AVLPR Framework
0:  #epochs TT, potentials {Ψt,hi}t[T],h[H],i[m]\{\Psi_{t,h}^{i}\}_{t\in[T],h\in[H],i\in[m]}, subroutines CCE-Approx and V-Approx.
1:  Set policy π~0\widetilde{\pi}_{0} as the uniform policy for all i[m]i\in[m], i.e., π~0i(s)Unif(𝒜i)\widetilde{\pi}_{0}^{i}(s)\leftarrow\mathrm{Unif}(\mathcal{A}^{i}). Set t00t_{0}\leftarrow 0.
2:  for t=1,2,,Tt=1,2,\ldots,T do
3:     All agents play according to π~t1\widetilde{\pi}_{t-1}. Each agent i[m]i\in[m] records trajectory {(s~t,h,a~t,hi)}h=1H\{(\widetilde{s}_{t,h},\widetilde{a}_{t,h}^{i})\}_{h=1}^{H}.
4:     Each agent i[m]i\in[m] calculates its potential function Ψt,hi\Psi_{t,h}^{i} according to the definition.
5:     if t00t_{0}\neq 0 and Ψt,hiΨt0,hi+1\Psi_{t,h}^{i}\leq\Psi_{t_{0},h}^{i}+1 for all t[T]t\in[T], h[H]h\in[H], and i[m]i\in[m] then
6:        Do a “lazy” update by directly setting π~tπ~t1\widetilde{\pi}_{t}\leftarrow\widetilde{\pi}_{t-1} and GaptGapt1{\textsc{Gap}}_{t}\leftarrow{\textsc{Gap}}_{t-1}; continue.
7:     end if
8:     Define π¯t:𝒮(𝒜)\overline{\pi}_{t}\colon\mathcal{S}\to\triangle(\mathcal{A}) as the uniform mixture of all previous policies π~0:𝒮(𝒜),π~1:𝒮(𝒜),,π~t1:𝒮(𝒜)\widetilde{\pi}_{0}\colon\mathcal{S}\to\triangle(\mathcal{A}),\widetilde{\pi}_{1}\colon\mathcal{S}\to\triangle(\mathcal{A}),\ldots,\widetilde{\pi}_{t-1}\colon\mathcal{S}\to\triangle(\mathcal{A}), or denoted by π¯t1ts=0t1π~s\overline{\pi}_{t}\leftarrow\frac{1}{t}\sum_{s=0}^{t-1}\widetilde{\pi}_{s} for simplicity.
9:     We will then fill π~t:𝒮(𝒜)\widetilde{\pi}_{t}\colon\mathcal{S}\to\triangle(\mathcal{A}), Gap:𝒮m{\textsc{Gap}}\colon\mathcal{S}\to\mathbb{R}^{m}, and V¯t:𝒮m\overline{V}_{t}\colon\mathcal{S}\to\mathbb{R}^{m} layer-by-layer, in the order of 𝒮H,𝒮H1,,𝒮1\mathcal{S}_{H},\mathcal{S}_{H-1},\ldots,\mathcal{S}_{1}. Before that, we first initialize V¯ti(sH+1)=0\overline{V}_{t}^{i}(s_{H+1})=0 for all i[m]i\in[m].
10:     for h=H,H1,,1h=H,H-1,\ldots,1 do
11:        Execute subroutine CCE-Approxh(π¯t,V¯t,t){\textsc{CCE-Approx}}_{h}(\overline{\pi}_{t},\overline{V}_{t},t) independently for R𝒪(log1δ)R\triangleq\operatorname{\mathcal{O}}(\log\frac{1}{\delta}) times. For the rr-th execution of CCE-Approxh(π¯t,V¯t,t){\textsc{CCE-Approx}}_{h}(\overline{\pi}_{t},\overline{V}_{t},t), record the return value (π~r:𝒮h(𝒜),Gapr:𝒮hm)(\widetilde{\pi}_{r}\colon\mathcal{S}_{h}\to\triangle(\mathcal{A}),{\textsc{Gap}}_{r}\colon\mathcal{S}_{h}\to\mathbb{R}^{m}).
12:        For each current-layer state s𝒮hs\in\mathcal{S}_{h}, set (π~t(s),Gapt(s))(π~r(s)(s),Gapr(s)(s))(\widetilde{\pi}_{t}(s),{\textsc{Gap}}_{t}(s))\leftarrow(\widetilde{\pi}_{r^{\ast}(s)}(s),{\textsc{Gap}}_{r^{\ast}(s)}(s)) where
r(s)argminr[R]i=1mGapri(s).r^{\ast}(s)\triangleq\operatornamewithlimits{\mathrm{argmin}}_{r\in[R]}\sum_{i=1}^{m}{\textsc{Gap}}_{r}^{i}(s). (3.1)
13:        Update the current-layer V-function {V¯t(s)}s𝒮h\{\overline{V}_{t}(s)\}_{s\in\mathcal{S}_{h}} (abbreviated as V¯(𝒮h+1)\overline{V}(\mathcal{S}_{h+1})) from the next-layer {V¯t(s)}s𝒮h+1\{\overline{V}_{t}(s)\}_{s\in\mathcal{S}_{h+1}} (or simply V¯t(𝒮h+1)\overline{V}_{t}(\mathcal{S}_{h+1})) by calling V¯t(𝒮h)V-Approxh(π¯t,π~t,V¯t(𝒮h+1),Gapt,t)\overline{V}_{t}(\mathcal{S}_{h})\leftarrow{\textsc{V-Approx}}_{h}(\overline{\pi}_{t},\widetilde{\pi}_{t},\overline{V}_{t}(\mathcal{S}_{h+1}),{\textsc{Gap}}_{t},t).
14:     end for
15:     Update the “last update time” t0tt_{0}\leftarrow t.
16:  end for
16:  The uniform mixture of all policies π~0,π~1,π~2,,π~T\widetilde{\pi}_{0},\widetilde{\pi}_{1},\widetilde{\pi}_{2},\ldots,\widetilde{\pi}_{T}, i.e., πout1T+1t=0Tπ~t\pi_{\text{out}}\leftarrow\frac{1}{T+1}\sum_{t=0}^{T}\widetilde{\pi}_{t}.

Our framework, presented in Algorithm 1, is based on the AVLPR framework proposed by Wang et al. (2023). The main differences are marked in violet. Before introducing these differences, we first overview the original AVLPR framework, which is almost the same as Algorithm 1 except that R=1R=1 — one of the most crucial innovations of our framework which we will describe later.

Overview of the AVLPR Framework by Wang et al. (2023).

The original framework starts from an arbitrary policy π~0\widetilde{\pi}_{0} and then gradually improves it: In the tt-th epoch, all agents together make the next policy π~t\widetilde{\pi}_{t} an 𝒪~(1/t)\operatorname{\widetilde{\mathcal{O}}}(1/\sqrt{t})-CCE. Thus, the number of epochs for an ϵ\epsilon-CCE is T=𝒪~(ϵ2)T=\operatorname{\widetilde{\mathcal{O}}}(\epsilon^{-2}).

For each epoch t[T]t\in[T], the agents determine their new policies π~t\widetilde{\pi}_{t} layer-by-layer in the reversed order, i.e., h=H,H1,,1h=H,H-1,\ldots,1. Suppose that we are at layer h[H]h\in[H] and want to find {π~t(s)}s𝒮h\{\widetilde{\pi}_{t}(s)\}_{s\in\mathcal{S}_{h}} (abbreviated as π~t(𝒮h)\widetilde{\pi}_{t}(\mathcal{S}_{h}) for simplicity). As {π~t(𝒮𝔥)}𝔥=h+1H\{\widetilde{\pi}_{t}(\mathcal{S}_{\mathfrak{h}})\}_{\mathfrak{h}=h+1}^{H} are already calculated, the next-layer V-functions Vπ~ti(𝒮h+1)V_{\widetilde{\pi}_{t}}^{i}(\mathcal{S}_{h+1}) can be estimated (denoted by V¯ti\overline{V}_{t}^{i} in Algorithm 1). Thus, the problem of deciding π~ti(𝒮h)\widetilde{\pi}_{t}^{i}(\mathcal{S}_{h}) becomes a contextual bandit problem: The context ss is sampled from a fixed policy π¯t\overline{\pi}_{t} (which only depends on π~1,π~2,,π~t1\widetilde{\pi}_{1},\widetilde{\pi}_{2},\ldots,\widetilde{\pi}_{t-1}), and the loss of every (s,ai)𝒮h×𝒜i(s,a^{i})\in\mathcal{S}_{h}\times\mathcal{A}^{i} is the Q-function induced by the current policy, namely Qπ~ti(s,ai)Q_{\widetilde{\pi}_{t}}^{i}(s,a^{i}), which can be estimated via i(s,ai)\ell^{i}(s,a^{i}) and V¯i(𝒮h+1)\overline{V}^{i}(\mathcal{S}_{h+1}).

Hence, Wang et al. (2023) propose to deploy a contextual bandit algorithm on this layer hh. This is abstracted as a plug-in subroutine CCE-Approxh\textsc{CCE-Approx}_{h} in Algorithm 1.777In the CCE-Approx subroutine, an iterative approach is also used, which means the π~t(𝒮h)\widetilde{\pi}_{t}(\mathcal{S}_{h}) is the average of a few policies π1,π2,,πK\pi_{1},\pi_{2},\ldots,\pi_{K}. Thus, as πki\pi_{k}^{-i} is varying with kk, each Q-function Qπki(s,ai)Q_{\pi_{k}}^{i}(s,a^{i}) is also varying with time, which means that each agent actually faces an adversarial (i.e., non-stationary) contextual bandit problem. We shall see more details in Section 4. As we briefly mention in the Technical Overview, we should ensure that i) the joint policy π~:𝒮h(𝒜)\widetilde{\pi}\colon\mathcal{S}_{h}\to\triangle(\mathcal{A}) has a calculable sub-optimality gap on all s𝒮hs\in\mathcal{S}_{h} for all i[m]i\in[m] w.h.p. (see Equation 3.2), and ii) this sub-optimality gap estimation Gap:𝒮hm{\textsc{Gap}}\colon\mathcal{S}_{h}\to\mathbb{R}^{m} has a bounded expectation w.r.t. π~t\widetilde{\pi}_{t} (see Equation 3.4).

Afterward, Wang et al. (2023) estimates the V-function for the current layer (i.e., Vπ~i(𝒮h)V_{\widetilde{\pi}}^{i}(\mathcal{S}_{h})), which is useful when we move on to the previous layer and invoke CCE-Approxh1\textsc{CCE-Approx}_{h-1}. This is done by another subroutine called V-Approxh\textsc{V-Approx}_{h}, which must ensure an “optimistic” estimation (see Equation 3.3).

By repeating this process for all hh, the current epoch tt terminates. It can be inferred from Equations 3.2, 3.3 and 3.4 that π~t\widetilde{\pi}_{t} is an 𝒪~(1/t)\operatorname{\widetilde{\mathcal{O}}}(1/\sqrt{t})-CCE. To further reduce the sample complexity, Wang et al. (2023) propose another trick to “lazily” update the policies. Informally, if the Gap functions remain similar (measured by the increments in the potential function Ψt,hi\Psi_{t,h}^{i}, c.f. 5), then we directly adopt the previous π~t\widetilde{\pi}_{t}. By ensuring such updates only happen 𝒪(logT)\operatorname{\mathcal{O}}(\log T) times by properly choosing Ψi\Psi^{i}, 𝒪~(ϵ2)\operatorname{\widetilde{\mathcal{O}}}(\epsilon^{-2}) sample complexity is enjoyed. The main theorem of the original AVLPR framework can be summarized as follows.

Theorem 3.1 (Main Theorem of AVLPR (Wang et al., 2023, Theorem 18); Informal).

Suppose that

  1. 1.

    (Per-state no-regret) If we call the subroutine CCE-Approxh(π¯,V¯,K){\textsc{CCE-Approx}}_{h}(\overline{\pi},\overline{V},K), then the returned policy π~:𝒮h(𝒜)\widetilde{\pi}\colon\mathcal{S}_{h}\to\triangle(\mathcal{A}) and gap estimation Gap:𝒮hm{\textsc{Gap}}\colon\mathcal{S}_{h}\to\mathbb{R}^{m} shall ensure the following w.p. 1δ1-\delta:

    maxπiΠhi{(𝔼aπ~𝔼aπi×π~i)[(i+h+1V¯i)(s,a)]}Gapi(s),i[m],s𝒮h,\displaystyle\max_{\pi_{\ast}^{i}\in\Pi_{h}^{i}}\left\{\left(\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}}-\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{\ast}^{i}\times\widetilde{\pi}^{-i}}\right)\left[\big{(}\ell^{i}+\operatorname{\mathbb{P}}_{h+1}\overline{V}^{i}\big{)}(s,a)\right]\right\}\leq{\textsc{Gap}}^{i}(s),\quad\forall i\in[m],s\in\mathcal{S}_{h}, (3.2)

    where Gapi(s){\textsc{Gap}}^{i}(s) must be a deterministic (i.e., non-stochastic) function in the form Ghi(s,π¯,K,δ)G_{h}^{i}(s,\overline{\pi},K,\delta).

  2. 2.

    (Optimistic V-function) If we call the subroutine V-Approxh(π¯,π~,V¯,Gap,K){\textsc{V-Approx}}_{h}(\overline{\pi},\widetilde{\pi},\overline{V},{\textsc{Gap}},K), then the returned V-function estimation V^:𝒮hm\widehat{V}\colon\mathcal{S}_{h}\to\mathbb{R}^{m} should ensure the following w.p. 1δ1-\delta:

    V^i(s)[min{\displaystyle\widehat{V}^{i}(s)\in\bigg{[}\min\bigg{\{} 𝔼aπ~[(i+h+1V¯i)(s,a)]+Gapi(s),Hh+1},\displaystyle\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}}\left[\big{(}\ell^{i}+\operatorname{\mathbb{P}}_{h+1}\overline{V}^{i}\big{)}(s,a)\right]+\phantom{2}{\textsc{Gap}}^{i}(s),H-h+1\bigg{\}},
    𝔼aπ~[(i+h+1V¯i)(s,a)]+2Gapi(s)],i[m],s𝒮h.\displaystyle\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}}\left[\big{(}\ell^{i}+\operatorname{\mathbb{P}}_{h+1}\overline{V}^{i}\big{)}(s,a)\right]+2{\textsc{Gap}}^{i}(s)\bigg{]},\quad\forall i\in[m],s\in\mathcal{S}_{h}. (3.3)
  3. 3.

    (Pigeon-hole condition) There exists a deterministic complexity measure LL such that w.p. 1δ1-\delta

    t=1T𝔼shπ~t[Ghi(s,Unif({π~τ}τ[t]),t,δ)]LTlog2Tδ,i[m],h[H],\sum_{t=1}^{T}\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}_{t}}\left[G_{h}^{i}(s,\mathrm{Unif}(\{\widetilde{\pi}_{\tau}\}_{\tau\in[t]}),t,\delta)\right]\leq\sqrt{LT\log^{2}\frac{T}{\delta}},\quad\forall i\in[m],h\in[H], (3.4)

    where GhiG_{h}^{i} is the deterministic Gap function defined in Equation 3.2. Informally, if we execute the whole {π~t(𝒮h)}h[H]\{\widetilde{\pi}_{t}(\mathcal{S}_{h})\}_{h\in[H]}, it must have a expected sub-optimality gap of order 𝒪~(L/t)\operatorname{\widetilde{\mathcal{O}}}(\sqrt{L/t}) in each.

  4. 4.

    (Potential function) The potential functions {Ψt,hi}t[T],h[H],i[m]\{\Psi_{t,h}^{i}\}_{t\in[T],h\in[H],i\in[m]} is chosen s.t. there exists a constant dreplayd_{\text{replay}} ensuring that 5 is violated for at most dreplaylogTd_{\text{replay}}\log T times. Meanwhile,

    Ψt,hiΨt0,hi+1Ghi(s,π¯t0,t0,δ)8Ghi(s,π¯t,t,δ),i[m],h[H].\Psi_{t,h}^{i}\leq\Psi_{t_{0},h}^{i}+1\Longrightarrow G_{h}^{i}(s,\overline{\pi}_{t_{0}},t_{0},\delta)\leq 8G_{h}^{i}(s,\overline{\pi}_{t},t,\delta),\leavevmode\nobreak\ \leavevmode\nobreak\ \forall i\in[m],h\in[H]. (3.5)

Then, by setting T=𝒪~(H2Lϵ2)T=\operatorname{\widetilde{\mathcal{O}}}(H^{2}L\epsilon^{-2}), an ϵ\epsilon-CCE can be yielded within 𝒪~(H3Ldreplayϵ2)\operatorname{\widetilde{\mathcal{O}}}(H^{3}Ld_{\text{replay}}\epsilon^{-2}) samples.

One can see that Wang et al. (2023) require the Gap function to be deterministic (highlighted in Theorem 3.1). However, as we mentioned in Footnote 7, this corresponds to crafting high-probability regret bounds for adversarial linear contextual bandits, which is still open in the literature (Olkhovskaya et al., 2023). Hence, when facing MGs with (independent) linear function approximation, it is highly non-trivial to construct a non-stochastic high-probability upper bound like Equation 3.2 via deploying regret-minimization algorithm at each state s𝒮hs\in\mathcal{S}_{h}. Consequently, Wang et al. (2023) adopt a pure exploration algorithm (i.e., using uniform policies in subroutine CCE-Approx), which brings undesirable poly(Amax)\mathrm{poly}(A_{\max}) dependencies.

Loosened High-Probability Bound Requirement.

To bypass this situation, instead of forcing each agent to do pure exploration like (Wang et al., 2023), our Improved AVLPR framework allows the Gap(s){\textsc{Gap}}(s) to be a stochastic (i.e., data-dependent) upper bound of the actual sub-optimality gap, as shown in Equation 3.6. The differences between Theorems 3.1 and 3.2 are highlighted in violet.

Theorem 3.2 (Main Theorem of Improved AVLPR; Informal).

Suppose that

  1. 1.

    (Per-state no-regret) If we call the subroutine CCE-Approxh(π¯,V¯,K){\textsc{CCE-Approx}}_{h}(\overline{\pi},\overline{V},K), then the returned policy π~:𝒮h(𝒜)\widetilde{\pi}\colon\mathcal{S}_{h}\to\triangle(\mathcal{A}) and gap estimation Gap:𝒮hm{\textsc{Gap}}\colon\mathcal{S}_{h}\to\mathbb{R}^{m} shall ensure the following w.p. 1δ1-\delta:

    maxπiΠhi{(𝔼aπ~𝔼aπi×π~i)[(i+h+1V¯i)(s,a)]}Gapi(s),i[m],s𝒮h,\displaystyle\max_{\pi_{\ast}^{i}\in\Pi_{h}^{i}}\left\{\left(\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}}-\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{\ast}^{i}\times\widetilde{\pi}^{-i}}\right)\left[\big{(}\ell^{i}+\operatorname{\mathbb{P}}_{h+1}\overline{V}^{i}\big{)}(s,a)\right]\right\}\leq{\textsc{Gap}}^{i}(s),\quad\forall i\in[m],s\in\mathcal{S}_{h}, (3.6)

    where Gapi(s){\textsc{Gap}}^{i}(s) is a random variable whose randomness comes from the environment (when generating the trajectories), the agents (when playing the policies), and internal randomness.

  2. 2.

    (Optimistic V-function) If we call the subroutine V-Approxh(π¯,π~,V¯,Gap,K){\textsc{V-Approx}}_{h}(\overline{\pi},\widetilde{\pi},\overline{V},{\textsc{Gap}},K), then the returned V-function estimation V^:𝒮hm\widehat{V}\colon\mathcal{S}_{h}\to\mathbb{R}^{m} should ensure Equation 3.3 w.p. 1δ1-\delta.

  3. 3.

    (Pigeon-hole condition & Potential Function) The potential functions {Ψt,hi}t[T],h[H],i[m]\{\Psi_{t,h}^{i}\}_{t\in[T],h\in[H],i\in[m]} is chosen s.t. there exists a constant dreplayd_{\text{replay}} ensuring that 5 is violated for at most dreplaylogTd_{\text{replay}}\log T times. Meanwhile, there also exists a deterministic LL ensuring the following w.p. 1δ1-\delta:

    t=1Ti=1m𝔼shπ~t[𝔼Gap[Gapti(s)]]mLTlog2Tδ,h[H],\sum_{t=1}^{T}\sum_{i=1}^{m}\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}_{t}}\left[{\color[rgb]{.5,0,.5}\operatornamewithlimits{\mathbb{E}}_{{\textsc{Gap}}}[{\textsc{Gap}}_{t}^{i}(s)]}\right]\leq m\sqrt{LT\log^{2}\frac{T}{\delta}},\quad\forall h\in[H], (3.7)

    where the expectation is taken w.r.t. the randomness in calculating the random variable Gap.

Then we have a similar conclusion as Theorem 3.1 that 𝒪~(m2H3Ldreplayϵ2)\operatorname{\widetilde{\mathcal{O}}}(m^{2}H^{3}Ld_{\text{replay}}\epsilon^{-2}) samples give an ϵ\epsilon-CCE.

We defer the formal proof of this theorem to Appendix A and only sketch the idea here.

Proof Sketch of Theorem 3.2.

The idea is to show that our picked policy-gap pair nearly satisfies the conditions in Theorem 3.1. We can pick different rr’s for different states s𝒮hs\in\mathcal{S}_{h} (i.e., r(s)r^{\ast}(s)) because they are in the same layer. However, on a single state s𝒮hs\in\mathcal{S}_{h}, all agents i[m]i\in[m] must share the same r(s)r^{\ast}(s). This is because the expectation in Equation 3.6 is w.r.t. the opponents’ policies. So if any agent deviates from the current rr, all other agents will observe a different sequence of losses and thus break Equation 3.2.

Now we focus on a single state s𝒮hs\in\mathcal{S}_{h}. By Equation 3.6, any construction r[R]r\in[R] ensures w.h.p. that Vπ~ri(s)Vi(s)Gapri(s)V_{\widetilde{\pi}_{r}}^{i}(s)\geq V_{\ast}^{i}(s)-{\textsc{Gap}}_{r}^{i}(s). Moreover, Equation 3.3 ensures V^ri(s)[Vi(s),Vπ~ri(s)+2Gapri(s)]\widehat{V}_{r}^{i}(s)\in[V_{\ast}^{i}(s),V_{\widetilde{\pi}_{r}}^{i}(s)+2{\textsc{Gap}}_{r}^{i}(s)]. So we only need to find a rr whose Gapri(s){\textsc{Gap}}_{r}^{i}(s) is “small” for all i[m]i\in[m] – more preciously, we want a r(s)r^{\ast}(s) such that

i=1mGapr(s)i(s)2i=1m𝔼Gap[Gapi(s)],s𝒮h,i[m].\sum_{i=1}^{m}{\textsc{Gap}}_{r^{\ast}(s)}^{i}(s)\leq 2\sum_{i=1}^{m}\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}[{\textsc{Gap}}^{i}(s)],\quad\forall s\in\mathcal{S}_{h},i\in[m].

From Markov inequality, for each r[R]r\in[R], Pr{i=1mGapri(s)>2i=1m𝔼Gap[Gapi(s)]}12\Pr\left\{\sum_{i=1}^{m}{\textsc{Gap}}_{r}^{i}(s)>2\sum_{i=1}^{m}\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}[{\textsc{Gap}}^{i}(s)]\right\}\leq\frac{1}{2}. Thus, from the choice of R=𝒪(log1δ)R=\operatorname{\mathcal{O}}(\log\frac{1}{\delta}), the rr minimizing i=1mGapri(s)\sum_{i=1}^{m}{\textsc{Gap}}_{r}^{i}(s) (i.e., the r(s)r^{\ast}(s) defined in Equation 3.1) ensures that i=1mGapr(s)i(s)2i=1m𝔼Gap[Gapi(s)]\sum_{i=1}^{m}{\textsc{Gap}}_{r^{\ast}(s)}^{i}(s)\leq 2\sum_{i=1}^{m}\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}[{\textsc{Gap}}^{i}(s)] with probability 1δ1-\delta.

As states in the same layer are independent in the sense that the policy on ss^{\prime} doesn’t affect the Q-function of ss if s,s𝒮hs,s^{\prime}\in\mathcal{S}_{h}, we can combine all {(π~r(s)(s),Gapr(s)(s))}s𝒮h\{(\widetilde{\pi}_{r^{\ast}(s)}(s),{\textsc{Gap}}_{r^{\ast}(s)}(s))\}_{s\in\mathcal{S}_{h}}’s into (π~,Gap~)(\widetilde{\pi},\widetilde{\textsc{Gap}}). Equation 3.6 then ensures Condition 1 of Theorem 3.1, and Condition 3 is also closely related with our Equation 3.7. Besides the different Gap, the potential part is also slightly different from Equation 3.5. This is because the original version is mainly tailored for uniform exploration policies – as we will see in Section 5.2, we design a potential function similar to that of Cui et al. (2023) as we are both using π¯t\overline{\pi}_{t} as the exploration policy in CCE-Approx. ∎

4 Improved CCE-Approx Subroutine

Algorithm 2 Improved CCE-Approx Subroutine for Independent Linear Markov Games
0:  Policy mixture π¯\overline{\pi}, next-layer V-function V¯\overline{V}, epoch length KK, failure probability δ\delta.
0:  A mixed policy π~:𝒮h(𝒜)\widetilde{\pi}\colon\mathcal{S}_{h}\to\triangle(\mathcal{A}) and a data-dependent Gapestimation Gap:𝒮h×m{\textsc{Gap}}\colon\mathcal{S}_{h}\times\mathbb{R}^{m}.
1:  Set learning rate η\eta, bonus parameters β1,β2\beta_{1},\beta_{2}, and regularization parameter γ=5dKlog6dδ\gamma=\frac{5d}{K}\log\frac{6d}{\delta}.
2:  All agents play π¯\overline{\pi} for KK times. For the ii-th of them, it memorizes the state-action pair in the hh-th layer as {(skcov,ak,icov)}k=1K\{(s_{k}^{\text{cov}},a_{k,i}^{\text{cov}})\}_{k=1}^{K} and calculates the estimated (inverse) covariance matrix as
Σ^t,i=(1Kκ=1Kϕ(sκcov,aκ,icov)ϕ(sκcov,aκ,icov)𝖳+γI)1,\widehat{\Sigma}_{t,i}^{\dagger}=\left(\frac{1}{K}\sum_{\kappa=1}^{K}\bm{\phi}(s_{\kappa}^{\text{cov}},a_{\kappa,i}^{\text{cov}})\bm{\phi}(s_{\kappa}^{\text{cov}},a_{\kappa,i}^{\text{cov}})^{\mathsf{T}}+\gamma I\right)^{-1}, (4.1)
3:  All agents play π¯\overline{\pi} for KK times. For the ii-th of them, it memorizes the state-action pair in the hh-th layer as {(skmag,ak,imag)}k=1K\{(s_{k}^{\text{mag}},a_{k,i}^{\text{mag}})\}_{k=1}^{K} and calculates the magnitude-reduced estimator (Dai et al., 2023) as
m^ki(s,a)=HKκ=1K(ϕ(s,a)𝖳Σ^t,iϕ(sκmag,a~κ,imag)),s𝒮h,a𝒜i.\widehat{m}_{k}^{i}(s,a)=\frac{H}{K}\sum_{\kappa=1}^{K}\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s_{\kappa}^{\text{mag}},\widetilde{a}_{\kappa,i}^{\text{mag}})\right)_{-},\quad\forall s\in\mathcal{S}_{h},a\in\mathcal{A}^{i}. (4.2)
4:  for k=1,2,,Kk=1,2,\ldots,K do
5:     Each agent i[m]i\in[m] uses EXP3 to calculate its policy πki(as)\pi_{k}^{i}(a\mid s) for all s𝒮hs\in\mathcal{S}_{h} and a𝒜ia\in\mathcal{A}^{i}:
πki(as)exp(η(κ=1k1(Q^κi(s,a)Bki(s,a)))),\displaystyle{\pi_{k}^{i}(a\mid s)\propto\exp\left(-\eta\left(\sum_{\kappa=1}^{k-1}(\widehat{Q}_{\kappa}^{i}(s,a)-B_{k}^{i}(s,a))\right)\right),} (4.3)
where Bki(s,a)=β1ϕ(s,a)Σ^t,i2+β2j=1dϕ(s,a)[j]×sup(s,a)𝒮h×𝒜i(Σ^t,iϕ(s,a))[j]B_{k}^{i}(s,a)=\beta_{1}\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}+\beta_{2}\sum_{j=1}^{d}\bm{\phi}(s,a)[j]\times\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}_{h}\times\mathcal{A}^{i}}(\widehat{\Sigma}_{t,i}^{\dagger}\bm{\phi}(s^{\prime},a^{\prime}))[j], where [j]\cdot[j] means the jj-th coordinate. Roughly, η11/K\eta_{1}\approx 1/\sqrt{K}, β1dHK\beta_{1}\approx\frac{dH}{\sqrt{K}}, and β2HK\beta_{2}\approx\frac{H}{K}.
6:     for i=1,2,,mi=1,2,\ldots,m do
7:        The ii-th agent plays π¯\overline{\pi} and any other agent jij\neq i plays π¯\overline{\pi} for first h1h-1 steps and πkj\pi_{k}^{j} for the hh-th one. Agent ii records its observed states, actions, and losses as (sk,𝔥i,ak,𝔥i,k,𝔥i)𝔥=1h+1(s_{k,\mathfrak{h}}^{i},a_{k,\mathfrak{h}}^{i},\ell_{k,\mathfrak{h}}^{i})_{\mathfrak{h}=1}^{h+1}.
8:     end for
9:     Each agent i[m]i\in[m] estimates the kernel of the Q-function induced by V¯i\overline{V}^{i} and πk\pi_{k} as
𝜽^ki\displaystyle\widehat{\bm{\theta}}_{k}^{i} =Σ^t,iϕ(sk,hi,ak,hi)(k,hi+V¯i(sk,h+1i)).\displaystyle=\widehat{\Sigma}_{t,i}^{\dagger}\leavevmode\nobreak\ {\bm{\phi}}(s_{k,h}^{i},a_{k,h}^{i})\leavevmode\nobreak\ \left(\ell_{k,h}^{i}+\overline{V}^{i}(s_{k,h+1}^{i})\right). (4.4)
10:     Each agent i[m]i\in[m] adopts magnitude-reduced estimators (Dai et al., 2023) to calculate
Q^ki(s,a)=ϕ(s,a)𝖳𝜽^kiH(ϕ(s,a)𝖳Σ^t,iϕ(sk,hi,ak,hi))+m^ki(s,a).\displaystyle\widehat{Q}_{k}^{i}(s,a)={\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}-H\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s_{k,h}^{i},a_{k,h}^{i})\right)_{-}+\widehat{m}_{k}^{i}(s,a). (4.5)
11:  end for
12:  return  (π~,Gap)(\widetilde{\pi},{\textsc{Gap}}) where π~=1Kk=1Kπk\widetilde{\pi}=\frac{1}{K}\sum_{k=1}^{K}\pi_{k} and the data-dependent Gap is defined in Equation B.2.

To improve the sample complexity for MGs with independent linear function approximation, it is insufficient to change the framework alone. This section introduces our improved implementation of the CCE-Approx subroutine, presented in Algorithm 2. While our framework remains mostly the same as AVLPR, the subroutine CCE-Approx is very different from (Wang et al., 2023). Here, we briefly overview several main technical innovations that we make in our Algorithm 2.

4.1 Magnitude-Reduction Loss Estimators

In this regret-minimization task, we deploy EXP3 on each state, similar to Cui et al. (2023). Motivated by the recent progress in linear MDPs (Dai et al., 2023), we aggressively set the regularization parameter in covariance-estimation (the γ\gamma in Equation 4.1) as 𝒪~(K1)\operatorname{\widetilde{\mathcal{O}}}(K^{-1}), instead of the usual choice of γ=𝒪~(K1/2)\gamma=\operatorname{\widetilde{\mathcal{O}}}(K^{-1/2}) (Cui et al., 2023). This is because the regret analyses shall exhibit factors like 𝒪~(γβ1K+β1γ+β1K)\operatorname{\widetilde{\mathcal{O}}}(\frac{\gamma}{\beta_{1}}K+\frac{\beta_{1}}{\gamma}+\beta_{1}K) – to get 𝒪~(K)\operatorname{\widetilde{\mathcal{O}}}(\sqrt{K}) regret (which is necessary for ϵ2\epsilon^{-2} sample complexity), we must set β1=𝒪~(K1/2)\beta_{1}=\operatorname{\widetilde{\mathcal{O}}}(K^{-1/2}) and γ=𝒪~(K1)\gamma=\operatorname{\widetilde{\mathcal{O}}}(K^{-1}).

However, the downside of this aggressive tuning of γ\gamma is that the estimated Q-function, namely ϕ(s,a)𝖳𝜽^ki=ϕ(s,a)𝖳Σ^t,iϕ(sk,hi,ak,hi)(k,hi+V¯i(sk,h+1i))\bm{\phi}(s,a)^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}=\bm{\phi}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}\bm{\phi}(s_{k,h}^{i},a_{k,h}^{i})(\ell_{k,h}^{i}+\overline{V}^{i}(s_{k,h+1}^{i})), can lie anywhere in [γ1,γ1]=[𝒪(K),𝒪(K)][-\gamma^{-1},\gamma^{-1}]=[-\operatorname{\mathcal{O}}(K),\operatorname{\mathcal{O}}(K)]. To comply with the EXP3 requirement that all loss estimators are at least 𝒪(1/η)-\operatorname{\mathcal{O}}(1/\eta) (c.f. Lemma E.10) while still setting the learning rate η\eta as 𝒪(1/K)\operatorname{\mathcal{O}}(1/\sqrt{K}), we adopt the Magnitude-Reduced Estimator proposed by Dai et al. (2023) to “move” the Q-estimators into range [𝒪(K),𝒪(K)][-\operatorname{\mathcal{O}}(\sqrt{K}),\operatorname{\mathcal{O}}(K)] by setting Q^ki(s,a)=ϕ(s,a)𝖳𝜽^kiH(ϕ(s,a)𝖳Σ^t,iϕ(sk,hi,ak,hi))+m^ki(s,a)\widehat{Q}_{k}^{i}(s,a)={\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}-H({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s_{k,h}^{i},a_{k,h}^{i}))_{-}+\widehat{m}_{k}^{i}(s,a) (as we did in Equation 4.5). More details can be found in Lemma E.1.

4.2 Action-Dependent Bonuses

As we are using 𝜽^ki\widehat{\bm{\theta}}_{k}^{i} instead of the actual 𝜽ki\bm{\theta}_{k}^{i} when defining Q^ki\widehat{Q}_{k}^{i} in Equation 4.5, this incurs an estimation error. Focusing on a single state s𝒮hs\in\mathcal{S}_{h}, this estimation error occurs twice in the analysis as 𝔼aπki(s)[Est Err(s,a)]\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[\text{Est Err}(s,a)] and 𝔼aπi(s)[Est Err(s,a)]\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{\ast}^{i}(\cdot\mid s)}[\text{Est Err}(s,a)]. These two terms are called Bias-1 and Bias-2 in our analysis (see Equation 5.2).

The former term of Bias-1 is relatively easy to control: Because πki\pi_{k}^{i} is known, as we elaborate in Section 5.1, we can use some variants of the Freedman Inequality to have it bounded. For the latter term (Bias-2), due to the unknown πi\pi_{\ast}^{i}, the aforementioned approach no longer works. One common technique in the literature is to design bonuses: Suppose that we’d like to cover k=1Kvki(s,ai)\sum_{k=1}^{K}v_{k}^{i}(s,a_{\ast}^{i}) where aia_{\ast}^{i} is the optimal action on ss in hindsight and vkiv_{k}^{i} is an abstract quantity (e.g., the Est Err), we then design bonuses Bki(s,a)B_{k}^{i}(s,a) such that

k=1KBki(s,ai)k=1Kvki(s,ai)w.h.p.,s𝒮h,i[m].\sum_{k=1}^{K}B_{k}^{i}(s,a_{\ast}^{i})\gtrsim\sum_{k=1}^{K}v_{k}^{i}(s,a_{\ast}^{i})\leavevmode\nobreak\ \textit{w.h.p.},\quad\forall s\in\mathcal{S}_{h},i\in[m]. (4.6)

In this way, if we feed Q^kiBki\widehat{Q}_{k}^{i}-B_{k}^{i} instead of Q^ki\widehat{Q}_{k}^{i} into the EXP3 algorithm deployed at s𝒮hs\in\mathcal{S}_{h}, we can roughly get (see Lemma E.10 for the original EXP3 guarantee and Theorem B.2 for this form)

k=1Ka𝒜i(πki(as)πi(as))(Q^ki(s,a)Bki(s,a))log|𝒜i|η+k=1K𝔼aπki(s)[ηQ^ki(s,a)2+Bki(s,a)],\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}(\pi_{k}^{i}(a\mid s)-\pi_{\ast}^{i}(a\mid s))(\widehat{Q}_{k}^{i}(s,a)-B_{k}^{i}(s,a))\lesssim\frac{\log\lvert\mathcal{A}^{i}\rvert}{\eta}+\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[\eta\widehat{Q}_{k}^{i}(s,a)^{2}+B_{k}^{i}(s,a)\right],

and conclude that (the second term of LHS also appears in the regret decomposition in Equation 5.2)

k=1K𝔼aπi(s)[vki(s,a)]+k=1Ka𝒜i(πki(as)πi(as))Q^ki(s,a)\displaystyle\quad\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{\ast}^{i}(\cdot\mid s)}[v_{k}^{i}(s,a)]+\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}(\pi_{k}^{i}(a\mid s)-\pi_{\ast}^{i}(a\mid s))\widehat{Q}_{k}^{i}(s,a)
k=1K𝔼aπki(s)[Bki(s,a)]+log|𝒜i|η+ηk=1K𝔼aπki(s)[Q^ki(s,a)2],\displaystyle\lesssim\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[B_{k}^{i}(s,a)\right]+\frac{\log\lvert\mathcal{A}^{i}\rvert}{\eta}+\eta\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[\widehat{Q}_{k}^{i}(s,a)^{2}\right], (4.7)

which means that we replaces the unknown k=1Kvki(s,ai)\sum_{k=1}^{K}v_{k}^{i}(s,a_{\ast}^{i}) with the known k=1K𝔼aπki(s)[Bki(s,a)]\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[B_{k}^{i}(s,a)].

A tradition way of designing Bki(s,a)B_{k}^{i}(s,a) is using the classical Freedman Inequality (Freedman, 1975), which roughly claims that if |vki(s,a)|Vi\lvert v_{k}^{i}(s,a)\rvert\leq V^{i} a.s. for all s𝒮h,a𝒜i,k[K]s\in\mathcal{S}_{h},a\in\mathcal{A}^{i},k\in[K], we have888The expectation of (vki(s,ai))2(v_{k}^{i}(s,a_{\ast}^{i}))^{2} here should actually condition on the filtration k1\mathcal{F}_{k-1} as vkiv_{k}^{i} is a stochastic process.

k=1Kvki(s,ai)k=1K𝔼[(vki(s,ai))2]+Vilog1δw.h.p.,s𝒮h,i[m].\sum_{k=1}^{K}v_{k}^{i}(s,a_{\ast}^{i})\lesssim\sqrt{\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}[(v_{k}^{i}(s,a_{\ast}^{i}))^{2}]}+V^{i}\log\frac{1}{\delta}\leavevmode\nobreak\ \textit{w.h.p.},\quad\forall s\in\mathcal{S}_{h},i\in[m].

In problems like linear bandits, ViV^{i} is typically as small as 𝒪~(K)\operatorname{\widetilde{\mathcal{O}}}(\sqrt{K}) (see, e.g., (Zimmert and Lattimore, 2022)). Therefore, directly picking Bki(s,a)𝔼[(vki(s,a))2]B_{k}^{i}(s,a)\gtrsim\sqrt{\operatornamewithlimits{\mathbb{E}}[(v_{k}^{i}(s,a))^{2}]} can ensure Equation 4.6 as the Vilog1δV^{i}\log\frac{1}{\delta} part can directly go into the regret bound. However, in our case, due to the aggressive choice of γ=𝒪~(K1)\gamma=\operatorname{\widetilde{\mathcal{O}}}(K^{-1}), ViV^{i} can be as huge as 𝒪~(K)\operatorname{\widetilde{\mathcal{O}}}(K) and such an approach fails.

To tackle this, we observe that if we find a function Vi:𝒮h×𝒜idV^{i}\colon\mathcal{S}_{h}\times\mathcal{A}^{i}\to\mathbb{R}^{d} such that

|vki(s,a)|Vi(s,a)a.s.,s𝒮h,a𝒜i,k[K], and 1Kk=1K𝔼aπki(s)[Vi(s,a)] is well-bounded,\lvert v_{k}^{i}(s,a)\rvert\leq V^{i}(s,a)\leavevmode\nobreak\ \textit{a.s.},\leavevmode\nobreak\ \forall s\in\mathcal{S}_{h},a\in\mathcal{A}^{i},k\in[K],\text{ and }\quad\frac{1}{K}\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[V^{i}(s,a)]\text{ is well-bounded},

we can set a action-dependent bonus of

Bki(s,a)𝔼[(vki(s,a))2k1]+1KVi(s,a),s𝒮h,a𝒜i,k[K].B_{k}^{i}(s,a)\gtrsim\sqrt{\operatornamewithlimits{\mathbb{E}}[(v_{k}^{i}(s,a))^{2}\mid\mathcal{F}_{k-1}]}+\frac{1}{K}V^{i}(s,a),\quad\forall s\in\mathcal{S}_{h},a\in\mathcal{A}^{i},k\in[K]. (4.8)

Using the Adaptive Freedman Inequality (see Lemma E.7) given by Zimmert and Lattimore (2022),

k=1Kvki(s,ai)Vi(s,ai)+k=1K𝔼[(vki(s,ai))2]w.h.p.,\sum_{k=1}^{K}v_{k}^{i}(s,a_{\ast}^{i})\lesssim V^{i}(s,a_{\ast}^{i})+\sqrt{\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}[(v_{k}^{i}(s,a_{\ast}^{i}))^{2}]}\leavevmode\nobreak\ \textit{w.h.p.},

which infers the bonus condition of k=1KBki(s,ai)k=1Kvki(s,ai)\sum_{k=1}^{K}B_{k}^{i}(s,a_{\ast}^{i})\geq\sum_{k=1}^{K}v_{k}^{i}(s,a_{\ast}^{i}) in Equation 4.6. Meanwhile, the extra cost of k=1K𝔼aπki(s)[Bki(s,a)]\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[B_{k}^{i}(s,a)] (see the RHS of Equation 4.7) is still bounded because we assume 1Kk=1K𝔼aπki(s)[Vi(s,a)]\frac{1}{K}\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[V^{i}(s,a)] can be controlled. Finally, when Vi(s,a)=𝒪~(K)V^{i}(s,a)=\operatorname{\widetilde{\mathcal{O}}}(K), we can also ensure that the bonus function Bki(s,a)B_{k}^{i}(s,a) is small as 1KVi(s,a)=𝒪~(1)\frac{1}{K}V^{i}(s,a)=\operatorname{\widetilde{\mathcal{O}}}(1), which is necessary as the EXP3 regret guarantee requires Q^kiBki𝒪(1/η)\widehat{Q}_{k}^{i}-B_{k}^{i}\gtrsim-\operatorname{\mathcal{O}}(1/\eta).

In a nutshell, our action-dependent bonus technique allows the estimation errors vki(s,a)v_{k}^{i}(s,a) to have more extreme values on those rarely-visited state-action pairs (s,a)(s,a) – compared to the classical approach which requires |vki(s,a)|Vi=𝒪~(K)\lvert v_{k}^{i}(s,a)\rvert\leq V^{i}=\operatorname{\widetilde{\mathcal{O}}}(\sqrt{K}) for all s𝒮h,a𝒜i,k[K]s\in\mathcal{S}_{h},a\in\mathcal{A}^{i},k\in[K] uniformly, our Vi(s,a)V^{i}(s,a) can have values of order 𝒪~(K)\operatorname{\widetilde{\mathcal{O}}}(K) occasionally on some state-action pairs, as long as 1Kk=1K𝔼aπki(s)[Vi(s,a)]\frac{1}{K}\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[V^{i}(s,a)] remains small. We expect this technique to be useful in other problems where high-probability bounds are required.

4.3 Covariance Matrix Estimation

When analyzing linear regression, one critical step is to investigate the quality of “covariance matrix estimation”. Typically, we would like to estimate the covariance matrix Σ\Sigma of a dd-dimensional distribution 𝒟\mathcal{D}, namely Σ^\widehat{\Sigma}. Various approaches try to control either the additive error Σ^Σ2\lVert\widehat{\Sigma}-\Sigma\rVert_{2} (Neu and Olkhovskaya, 2020, Luo et al., 2021) or the multiplicative error Σ^(γI+Σ)12\lVert\widehat{\Sigma}(\gamma I+\Sigma)^{-1}\rVert_{2} (Dai et al., 2023, Sherman et al., 2023b), but the convergence rate is at most 𝒪~(n1/4)\operatorname{\widetilde{\mathcal{O}}}(n^{-1/4}) where nn is the number of samples from 𝒟\mathcal{D}. Recently, Liu et al. (2023a) bypassed this limitation by considering Tr(Σ^1/2(Σ^Σ))\text{Tr}\big{(}\widehat{\Sigma}^{-1/2}(\widehat{\Sigma}-\Sigma)\big{)} and gave an 𝒪~(n1/2)\operatorname{\widetilde{\mathcal{O}}}(n^{-1/2}) convergence rate; this technique is also adopted into our analysis for an 𝒪~(K)\operatorname{\widetilde{\mathcal{O}}}(\sqrt{K}) regret. See Section E.2 for more details regarding this.

4.4 Main Guarantee of Algorithm 2

Incorporating all these techniques, our Algorithm 2 ensures Equations 3.6 and 3.7. We make the follow claims, whose formal statements are presented in Theorems B.1 and C.1.

Theorem 4.1 (Gap is w.h.p. Pessimistic; Informal).

When Algorithm 2 is configured properly, for each execution of CCE-Approxh\textsc{CCE-Approx}_{h}, the condition in Equation 3.6 holds with probability 1𝒪~(δ)1-\operatorname{\widetilde{\mathcal{O}}}(\delta), i.e.,

maxπiΠhi{(𝔼aπ~𝔼aπi×π~i)[(i+h+1V¯i)(s,a)]}Gapi(s),i[m],s𝒮h with probability 1𝒪~(δ).\max_{\pi_{\ast}^{i}\in\Pi_{h}^{i}}\left\{\left(\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}}-\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{\ast}^{i}\times\widetilde{\pi}^{-i}}\right)\left[\big{(}\ell^{i}+\operatorname{\mathbb{P}}_{h+1}\overline{V}^{i}\big{)}(s,a)\right]\right\}\leq{\textsc{Gap}}^{i}(s),\quad\forall i\in[m],s\in\mathcal{S}_{h}\text{ with probability }1-\operatorname{\widetilde{\mathcal{O}}}(\delta).
Theorem 4.2 (Algorithm 2 Allows a Potential Function; Informal).

Consider the following potential:

Ψt,hi=(τ=1tϕ(s~τ,h,a~τ,hi)Σ^τ,i2)/(64log8mHTδ),t[T],h[H],i[m],\Psi_{t,h}^{i}=\left.\left(\sum_{\tau=1}^{t}\lVert\bm{\phi}(\widetilde{s}_{\tau,h},\widetilde{a}_{\tau,h}^{i})\rVert_{\widehat{\Sigma}_{\tau,i}^{\dagger}}^{2}\right)\middle/\left(64\log\frac{8mHT}{\delta}\right)\right.,\quad\forall t\in[T],h\in[H],i\in[m], (4.9)

where Σ^t,i\widehat{\Sigma}_{t,i}^{\dagger} is defined as Equation 4.1 if 5 is violated in epoch tt and as Σ^t,i=Σ^t1,i\widehat{\Sigma}_{t,i}^{\dagger}=\widehat{\Sigma}_{t-1,i}^{\dagger} otherwise (which is a recursive definition since 5 may not be violated in epoch (t1)(t-1) as well). Then, when Algorithm 2 is configured properly, the condition in Equation 3.7 holds with L=𝒪~(d4H2)L=\operatorname{\widetilde{\mathcal{O}}}(d^{4}H^{2}), i.e.,

t=1Ti=1m𝔼shπ~t[𝔼Gap[Gapti(s)]]=𝒪~(md2HT),with probability 1δ.\sum_{t=1}^{T}\sum_{i=1}^{m}\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}_{t}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}[{\textsc{Gap}}_{t}^{i}(s)]\right]=\operatorname{\widetilde{\mathcal{O}}}(md^{2}H\sqrt{T}),\quad\text{with probability }1-\delta.

Putting our improved CCE-Approx subroutine in Algorithm 2 and the original V-Approx subroutine (Wang et al., 2023, Algorithm 3) together, we give our final algorithm for multi-player general-sum Markov Games with independent linear function approximations. As claimed, this algorithm enjoys only polynomial dependency on mm, the optimal convergence rate ϵ2\epsilon^{-2}, and avoids poly(Amax)\text{poly}(A_{\max}) factors. Formally, we give the following theorem.

Theorem 4.3 (Main Theorem of the Overall Algorithm).

Under proper configuration of parameters, by picking CCE-Approxh\textsc{CCE-Approx}_{h} as Algorithm 2 and V-Approxh\textsc{V-Approx}_{h} as Algorithm 3 (Wang et al., 2023, Algorithm 3), under the Markov Games with Independent Linear Function Approximation assumption (Assumption 2.1), Algorithm 1 enjoys a sample complexity of 𝒪~(m4d5H6ϵ2)=𝒪~(poly(m,d,H)ϵ2)\operatorname{\widetilde{\mathcal{O}}}(m^{4}d^{5}H^{6}\epsilon^{-2})=\operatorname{\widetilde{\mathcal{O}}}(\text{poly}(m,d,H)\epsilon^{-2}).

The formal proofs of Theorems 4.1, 4.2 and 4.3 are in Appendices B, C and D, respectively. In the following section, we give an overview of the high-level idea of proving Theorems 4.1 and 4.2.

5 Analysis Outline of the Improved CCE-Approx Subroutine

Analyzing the improved CCE-Approx subroutine is highly technical, and the full proof of Theorems 4.1 and 4.2 are deferred into Appendices B and C. In this section, we highlight several key steps when verifying Equations 3.6 and 3.7 in Theorems 4.1 and 4.2.

5.1 Gap is w.h.p. Pessimistic (Proof Sketch of Theorem 4.1)

In this section, we fix an agent i[m]i\in[m] and a state s𝒮hs\in\mathcal{S}_{h}. We need to verify Equation 3.6 for this specific agent-state pair. By the construction of π~\widetilde{\pi} in 12, Equation 3.6 is equivalent to

k=1Kπki(s)πi(s),ϕ(s,a)𝖳𝜽kiKGapi(s),πiΠi,\sum_{k=1}^{K}\left\langle\pi_{k}^{i}(\cdot\mid s)-\pi_{\ast}^{i}(\cdot\mid s),{\bm{\phi}}(s,a)^{\mathsf{T}}{\bm{\theta}}_{k}^{i}\right\rangle\leq K\cdot{\textsc{Gap}}^{i}(s),\quad\forall\pi_{\ast}^{i}\in\Pi^{i}, (5.1)

where 𝜽ki{\bm{\theta}}_{k}^{i} is the “true” kernel induced by πk\pi_{k} and the next-layer V-function V¯i\overline{V}^{i}, i.e.,

ϕ(s,a)𝖳𝜽ki=𝔼𝒂iπki(s)[(i+V¯i)(s,𝒂)],s𝒮h,a𝒜i.{\bm{\phi}}(s,a)^{\mathsf{T}}{\bm{\theta}}_{k}^{i}=\operatornamewithlimits{\mathbb{E}}_{\bm{a}^{-i}\sim\pi_{k}^{-i}(\cdot\mid s)}\left[(\ell^{i}+\operatorname{\mathbb{P}}\overline{V}^{i})(s,\bm{a})\right],\quad\forall s\in\mathcal{S}_{h},a\in\mathcal{A}^{i}.

By (almost) standard regret decomposition, the LHS of Equation 5.1 can be rewritten as

k=1Kπki(s)πi(s),ϕ(s,a)𝖳𝜽ki=k=1Kπki(s)πi(s),Q^ki(s,)Bki(s,)Reg-Term+\displaystyle\quad\sum_{k=1}^{K}\left\langle\pi_{k}^{i}(\cdot\mid s)-\pi_{\ast}^{i}(\cdot\mid s),{\bm{\phi}}(s,a)^{\mathsf{T}}{\bm{\theta}}_{k}^{i}\right\rangle=\underbrace{\sum_{k=1}^{K}\left\langle\pi_{k}^{i}(\cdot\mid s)-\pi_{\ast}^{i}(\cdot\mid s),\widehat{Q}_{k}^{i}(s,\cdot)-B_{k}^{i}(s,\cdot)\right\rangle}_{\textsc{Reg-Term}}+
k=1Kπki(s)πi(s),ϕ(s,a)𝖳𝜽^kiQ^ki(s,a)Mag-Reduce+k=1Kπki(s),ϕ(s,)𝖳(𝜽ki𝜽^ki)Bias-1+\displaystyle\quad\underbrace{\sum_{k=1}^{K}\left\langle\pi_{k}^{i}(\cdot\mid s)-\pi_{\ast}^{i}(\cdot\mid s),{\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}-\widehat{Q}_{k}^{i}(s,a)\right\rangle}_{\textsc{Mag-Reduce}}+\underbrace{\sum_{k=1}^{K}\left\langle\pi_{k}^{i}(\cdot\mid s),{\bm{\phi}}(s,\cdot)^{\mathsf{T}}({\bm{\theta}}_{k}^{i}-\widehat{\bm{\theta}}_{k}^{i})\right\rangle}_{\textsc{Bias-1}}+
k=1Kπi(s),ϕ(s,)𝖳(𝜽^ki𝜽ki)Bias-2+k=1Kπki(s)πi(s),Bki(s,)Bonus-1(πki)Bonus-2(πi).\displaystyle\quad\underbrace{\sum_{k=1}^{K}\left\langle\pi_{\ast}^{i}(\cdot\mid s),{\bm{\phi}}(s,\cdot)^{\mathsf{T}}(\widehat{\bm{\theta}}_{k}^{i}-{\bm{\theta}}_{k}^{i})\right\rangle}_{\textsc{Bias-2}}+\underbrace{\sum_{k=1}^{K}\left\langle\pi_{k}^{i}(\cdot\mid s)-\pi_{\ast}^{i}(\cdot\mid s),B_{k}^{i}(s,\cdot)\right\rangle}_{\textsc{Bonus-1}\,(\pi_{k}^{i})\leavevmode\nobreak\ -\leavevmode\nobreak\ \textsc{Bonus-2}\,(\pi_{\ast}^{i})}. (5.2)

In the remaining part of this section, we overview the main steps in controlling these terms.

5.1.1 Controlling Reg-Term via Magnitude-Reduced Estimator

The Reg-Term is the regret w.r.t. estimated losses fed into the EXP3 instance deployed on each state ss, stated as follows:

Reg-Term=k=1Ka𝒜i(πki(as)πi(as))(Q^ki(s,a)Bki(s,a)).\textsc{Reg-Term}=\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}(\pi_{k}^{i}(a\mid s)-\pi_{\ast}^{i}(a\mid s))\big{(}\widehat{Q}_{k}^{i}(s,a)-B_{k}^{i}(s,a)\big{)}.

As we feed Q^ki(s,a)Bki(s,a)\widehat{Q}_{k}^{i}(s,a)-B_{k}^{i}(s,a) into the EXP3 procedure on each s𝒮hs\in\mathcal{S}_{h} (see Equation 4.3), this term usually can be directly controlled using the EXP3 regret guarantee stated in Lemma E.10, which roughly says

Reg-Termlog|𝒜i|η+ηk=1Ka𝒜iπki(as)(Q^ki(s,a)Bki(s,a))2,\displaystyle\textsc{Reg-Term}\leq\frac{\log\lvert\mathcal{A}^{i}\rvert}{\eta}+\eta\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{k}^{i}(a\mid s)\big{(}\widehat{Q}_{k}^{i}(s,a)-B_{k}^{i}(s,a)\big{)}^{2},
if Q^ki(s,a)Bki(s,a)1η,k[K],a𝒜i.\displaystyle\text{if }\widehat{Q}_{k}^{i}(s,a)-B_{k}^{i}(s,a)\geq-\frac{1}{\eta},\leavevmode\nobreak\ \forall k\in[K],a\in\mathcal{A}^{i}.

To verify Q^ki(s,a)Bki(s,a)1η\widehat{Q}_{k}^{i}(s,a)-B_{k}^{i}(s,a)\geq-\frac{1}{\eta}, people usually control the estimated Q^ki\widehat{Q}_{k}^{i} via (Luo et al., 2021)

|ϕ(s,a)𝖳𝜽^ki|=ϕ(s,a)𝖳Σ^t,iϕ(sk,hi,ak,hi)(k,hi+V¯i(sk,h+1i))HΣ^t,i2Hγ1.\lvert\bm{\phi}(s,a)^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}\rvert=\bm{\phi}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}\bm{\phi}(s_{k,h}^{i},a_{k,h}^{i})(\ell_{k,h}^{i}+\overline{V}^{i}(s_{k,h+1}^{i}))\lesssim H\lVert\widehat{\Sigma}_{t,i}^{\dagger}\rVert_{2}\leq H\gamma^{-1}. (5.3)

While this works when γ=𝒪~(K1/2)\gamma=\operatorname{\widetilde{\mathcal{O}}}(K^{-1/2}), it becomes prohibitively large when setting γ=𝒪~(K1)\gamma=\operatorname{\widetilde{\mathcal{O}}}(K^{-1}). Fortunately, thanks to the Magnitude-Reduced Estimator by Dai et al. (2023), we can ensure that Q^kiBkim^kiBki\widehat{Q}_{k}^{i}-B_{k}^{i}\geq\widehat{m}_{k}^{i}-B_{k}^{i} is bounded from below by m^ki𝒪~(K1/2)\widehat{m}_{k}^{i}\geq-\operatorname{\widetilde{\mathcal{O}}}(K^{-1/2}) (Dai et al., 2023, Theorem 4.1). As BkiB_{k}^{i} is also small, we can apply the standard EXP3 guarantee in Lemma E.10 with η=𝒪~(K1/2)\eta=\operatorname{\widetilde{\mathcal{O}}}(K^{-1/2}).

5.1.2 Controlling Bias-1 Term via Adaptive Freedman Inequality

The Bias-1 term kπki(s),ϕ(s,)𝖳(𝜽ki𝜽^ki)\sum_{k}\langle\pi_{k}^{i}(\cdot\mid s),{\bm{\phi}}(s,\cdot)^{\mathsf{T}}({\bm{\theta}}_{k}^{i}-\widehat{\bm{\theta}}_{k}^{i})\rangle corresponds to the cost of only knowing 𝜽^ki\widehat{\bm{\theta}}_{k}^{i} instead of 𝜽ki{\bm{\theta}}_{k}^{i} when estimating QkiQ_{k}^{i}. As Σ^t,i\widehat{\Sigma}_{t,i}^{\dagger} is biased because of the γI\gamma I regularizer in Equation 4.1, (𝜽ki𝜽^ki)({\bm{\theta}}_{k}^{i}-\widehat{\bm{\theta}}_{k}^{i}) can be further decomposed into (𝜽ki𝔼[𝜽^ki])({\bm{\theta}}_{k}^{i}-\operatornamewithlimits{\mathbb{E}}[\widehat{\bm{\theta}}_{k}^{i}]) and (𝔼[𝜽^ki]𝜽^ki)(\operatornamewithlimits{\mathbb{E}}[\widehat{\bm{\theta}}_{k}^{i}]-\widehat{\bm{\theta}}_{k}^{i}), namely the intrinsic bias and estimation error of 𝜽^ki\widehat{\bm{\theta}}_{k}^{i}:

Bias-1=k=1K𝔼aπki(s)[ϕ(s,a)𝖳](𝜽ki𝜽^ki)=k=1K(𝔼aπki(s)[ϕ(s,a)𝖳]𝜽kiμk)Intrinsic Bias+k=1K(μkXk)Estimation Error,\textsc{Bias-1}=\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]({\bm{\theta}}_{k}^{i}-\widehat{\bm{\theta}}_{k}^{i})=\underbrace{\sum_{k=1}^{K}\left(\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]{\bm{\theta}}_{k}^{i}-\mu_{k}\right)}_{\text{Intrinsic Bias}}+\underbrace{\sum_{k=1}^{K}(\mu_{k}-X_{k})}_{\text{Estimation Error}},

where Xk=𝔼aπki(s)[ϕ(s,a)𝖳]𝜽^kiX_{k}=\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]\widehat{\bm{\theta}}_{k}^{i} and μk=𝔼[Xkk1]\mu_{k}=\operatornamewithlimits{\mathbb{E}}[X_{k}\mid\mathcal{F}_{k-1}]; here, (k)k=0K(\mathcal{F}_{k})_{k=0}^{K} is the natural filtration.

The intrinsic bias term is standard in analyzing (expected-)regret-minimization algorithms for linear MDPs – see, e.g., Lemma D.2 of Luo et al. (2021) – and can thus be controlled analogously.

The estimation error term is a martingale. Consequently, regret-minimization papers in single-agent RL usually omit it by its zero-mean nature. However, it plays an important role here since Gap must be a high-probability upper bound. Existing high-probability results for adversarial linear bandits usually apply the famous Freedman inequality (Freedman, 1975): Suppose that |Xkμk|X\lvert X_{k}-\mu_{k}\rvert\leq X a.s., then

k=1K(Xkμk)k=1K𝔼[(Xkμk)2k1]+Xlog1δ,with probability 1δ.\sum_{k=1}^{K}(X_{k}-\mu_{k})\lesssim\sqrt{\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}[(X_{k}-\mu_{k})^{2}\mid\mathcal{F}_{k-1}]}+X\log\frac{1}{\delta},\quad\text{with probability }1-\delta.

However, as we calculated in Equation 5.3, due to our choice of γ=𝒪~(K1)\gamma=\operatorname{\widetilde{\mathcal{O}}}(K^{-1}), XX must be of order 𝒪~(HK)\operatorname{\widetilde{\mathcal{O}}}(HK). Hence, a direct application of the traditional Freedman inequality results in a factor of order 𝒪~(K)\operatorname{\widetilde{\mathcal{O}}}(K), which is unacceptable.

Fortunately, as our strengthened Theorem 3.2 only requires a stochastic Gap with bounded expectation, we are allowed to have a data-dependent variant of the Xlog1δX\log\frac{1}{\delta} term. Inspired by the Adaptive Freedman Inequality proposed by Lee et al. (2020) and improved by Zimmert and Lattimore (2022), we prove a variant of Freedman Inequality in Lemma E.8 which roughly reads

k=1K(Xkμk)k=1K(𝔼[Xk2k1]+Xk2)log1δ,with probability 1δ.\sum_{k=1}^{K}(X_{k}-\mu_{k})\lesssim\sqrt{\sum_{k=1}^{K}(\operatornamewithlimits{\mathbb{E}}[X_{k}^{2}\mid\mathcal{F}_{k-1}]+X_{k}^{2})}\log\frac{1}{\delta},\quad\text{with probability }1-\delta. (5.4)

As both 𝔼[Xk2k1]\operatornamewithlimits{\mathbb{E}}[X_{k}^{2}\mid\mathcal{F}_{k-1}] and Xk2X_{k}^{2} is known to the learner during execution, we can directly put them into Gapki(s)\textsc{Gap}_{k}^{i}(s) and ensure Equation 3.6. The more detailed calculation can be found in the proof of Theorem B.3.

5.1.3 Cancelling Bias-2 Using Bonus-2

While the Bias-2 term kπi(s),ϕ(s,)𝖳(𝜽^ki𝜽ki)\sum_{k}\langle\pi_{\ast}^{i}(\cdot\mid s),{\bm{\phi}}(s,\cdot)^{\mathsf{T}}(\widehat{\bm{\theta}}_{k}^{i}-{\bm{\theta}}_{k}^{i})\rangle looks almost the same as Bias-1, we cannot directly put the RHS of Equation 5.4 into Gapki(s){\textsc{Gap}}_{k}^{i}(s) as πi\pi_{\ast}^{i} is unknown. Following the classical idea of designing bonuses, we can make use of the Bonus-2 term, i.e., Bonus-2=k=1Kπi(s),Bki(s,)\textsc{Bonus-2}=-\sum_{k=1}^{K}\langle\pi_{\ast}^{i}(\cdot\mid s),B_{k}^{i}(s,\cdot)\rangle, to cancel Bias-2.

Although bonuses are standard in the literature, we shall remark again that our action-dependent bonus Bki(s,a)B_{k}^{i}(s,a) is novel. Following the notations in Section 4.2, we would like to cover

vki(s,a)=ϕ(s,a)𝖳(𝜽^ki𝔼[𝜽^ki])v_{k}^{i}(s,a)=\bm{\phi}(s,a)^{\mathsf{T}}(\widehat{\bm{\theta}}_{k}^{i}-\operatornamewithlimits{\mathbb{E}}[\widehat{\bm{\theta}}_{k}^{i}])

using Bki(s,a)B_{k}^{i}(s,a) such that Equation 4.7 happens. Because γK1\gamma\approx K^{-1}, each martingale difference term vki(s,a)v_{k}^{i}(s,a) can be of order 𝒪(K)\operatorname{\mathcal{O}}(K) – thus, directly picking Vi=maxk,s,a|vki(s,a)|V^{i}=\max_{k,s,a}\lvert v_{k}^{i}(s,a)\rvert would make Gap too large, which means the classical Freedman inequality approach described in Section 4.2 again fails.

To tackle this issue, as motivated in Section 4.2, we introduce an action-dependent Vi(s,a)=maxk|vki(s,a)|V^{i}(s,a)=\max_{k}\lvert v_{k}^{i}(s,a)\rvert and average it into the KK episodes, resulting in the bonus definition of (the β1\beta_{1}-related term corresponds to the 𝔼[(vki(s,a))2k1]\sqrt{\operatornamewithlimits{\mathbb{E}}[(v_{k}^{i}(s,a))^{2}\mid\mathcal{F}_{k-1}]} part, while the β2\beta_{2}-related term corresponds to the 1KVi(s,a)\frac{1}{K}V^{i}(s,a) part in Equation 4.8)

Bki(s,a)=β1ϕ(s,a)Σ^t,i2+β2j=1dϕ(s,a)[j]×sup(s,a)𝒮h×𝒜i(Σ^t,iϕ(s,a))[j].B_{k}^{i}(s,a)=\beta_{1}\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}+\beta_{2}\sum_{j=1}^{d}\bm{\phi}(s,a)[j]\times\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}_{h}\times\mathcal{A}^{i}}(\widehat{\Sigma}_{t,i}^{\dagger}\bm{\phi}(s^{\prime},a^{\prime}))[j].

Thus, the Bias-2 term is indeed cancelled by Bonus-2. Moreover, let us discuss a little bit why this approach is more favorable when proving Theorem 4.2: Regarding the k=1K𝔼aπki(s)[Bki(s,a)]\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[B_{k}^{i}(s,a)] term in Equation 4.7, we no longer need to suffer from Vi=𝒪~(K)V^{i}=\operatorname{\widetilde{\mathcal{O}}}(K), but only 1Kk𝔼πki[Vi(s,a)]\frac{1}{K}\sum_{k}\operatornamewithlimits{\mathbb{E}}_{\pi_{k}^{i}}[V^{i}(s,a)] instead. As (see Lemma C.6)

Vi(s,a)=j=1dϕ(s,a)[j]×sup(s,a)𝒮h×𝒜i(Σ^t,iϕ(s,a))[j]ϕ(s,a)Σ^t,i×Σ^t,i2,V^{i}(s,a)=\sum_{j=1}^{d}\bm{\phi}(s,a)[j]\times\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}_{h}\times\mathcal{A}^{i}}(\widehat{\Sigma}_{t,i}^{\dagger}\bm{\phi}(s^{\prime},a^{\prime}))[j]\leq\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}\times\sqrt{\lVert\widehat{\Sigma}_{t,i}^{\dagger}\rVert_{2}},

we have 1Kk𝔼πki[Vi(s,a)]𝔼1Kkπki[ϕϕ𝖳],Σ^t,i×Σ^t,i2\frac{1}{K}\sum_{k}\operatornamewithlimits{\mathbb{E}}_{\pi_{k}^{i}}[V^{i}(s,a)]\lesssim\sqrt{\langle\operatornamewithlimits{\mathbb{E}}_{\frac{1}{K}\sum_{k}\pi_{k}^{i}}[\bm{\phi}\bm{\phi}^{\mathsf{T}}],\widehat{\Sigma}_{t,i}^{\dagger}\rangle\times\lVert\widehat{\Sigma}_{t,i}\rVert_{2}}. Using the arguments presented shortly in Section 5.2, we can see this term is indeed nicely bounded.

To conclude, action-dependent bonuses allow us to cancel a random variable that can exhibit extreme values but has small expectations. We expect this technique to be of independent interest.

5.1.4 Directly Putting Bonus-1 into Gap

The Bonus-1 term is easy to handle in designing Gapki\textsc{Gap}_{k}^{i} as πki\pi_{k}^{i} and BkiB_{k}^{i} are both known. Still, to ensure Equation 3.7, we need to control 1Kk𝔼πki[Vi(s,a)]\frac{1}{K}\sum_{k}\operatornamewithlimits{\mathbb{E}}_{\pi_{k}^{i}}[V^{i}(s,a)] – the same arguments from the previous paragraph then work.

5.1.5 Controlling Mag-Reduce Like Bias-1 and Bias-2

This term is a unique challenge in our paper, which is due to the Magnitude-Reduced Estimator by Dai et al. (2023): Since the Q^ki\widehat{Q}_{k}^{i} in Equation 4.5 differs from ϕ𝖳𝜽^ki\bm{\phi}^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}, the following term shows up:

Mag-Reduce=k=1Ka𝒜i(πki(as)πi(as))(ϕ(s,a)𝖳𝜽^kiQ^ki(s,a)).\textsc{Mag-Reduce}=\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\left(\pi_{k}^{i}(a\mid s)-\pi_{\ast}^{i}(a\mid s)\right)\left(\bm{\phi}(s,a)^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}-\widehat{Q}_{k}^{i}(s,a)\right).

However, this term vanishes in the original paper because Q^ki\widehat{Q}_{k}^{i} is unbiased (see Lemma E.1) and Dai et al. (2023) studied expected-regret-minimization. Fortunately, because (sκmag,aκ,imag)(s_{\kappa}^{\text{mag}},a_{\kappa,i}^{\text{mag}}) in m^ki(s,a)\widehat{m}_{k}^{i}(s,a) (see Equation 4.2) and (sk,h,ak,hi)(s_{k,h},a_{k,h}^{i}) in Q^ki(s,a)\widehat{Q}_{k}^{i}(s,a) (see Equation 4.5) are i.i.d., we can decompose Mag-Reduce into the sum of a few martingales and apply the Adaptive Freedman Inequality to each of them. Informally, we observe that the resulting concentrations is smaller than those of Bias-1 and Bias-2, and thus Mag-Reduce is automatically controlled – more detailed arguments are presented in the proof of Theorem B.9.

5.2 Controlling the Sum of 𝔼[Gap]\operatornamewithlimits{\mathbb{E}}[{\textsc{Gap}}]’s via Potential Function

The potential function Ψt,hi\Psi_{t,h}^{i} is defined in Equation 4.9, presented below for the ease of readers:

Ψt,hi=(τ=1tϕ(s~τ,h,a~τ,hi)Σ^τ,i2)/(64log8mHTδ).\Psi_{t,h}^{i}=\left.\left(\sum_{\tau=1}^{t}\lVert\bm{\phi}(\widetilde{s}_{\tau,h},\widetilde{a}_{\tau,h}^{i})\rVert_{\widehat{\Sigma}_{\tau,i}^{\dagger}}^{2}\right)\middle/\left(64\log\frac{8mHT}{\delta}\right)\right..

Below we briefly explain why it ensures Theorem 4.2. After taking expectation to the randomness in Gapti(s){\textsc{Gap}}_{t}^{i}(s), we would get (omitting all dependencies on dd and HH; see Theorem C.2 for formal statements and calculations)

𝔼Gap[Gapti(s)]1t𝔼aπ~ti(s)[ϕ(s,a)Σ^t,i2],s𝒮h,i[m],t[T].\operatornamewithlimits{\mathbb{E}}_{{\textsc{Gap}}}[{\textsc{Gap}}_{t}^{i}(s)]\lesssim\frac{1}{\sqrt{t}}\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{t}^{i}(s)}\big{[}\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\big{]},\quad\forall s\in\mathcal{S}_{h},i\in[m],t\in[T].

Therefore, the definition of Ψt,hi\Psi_{t,h}^{i} ensures that if Ψt,hi\Psi_{t,h}^{i} is not so different from some Ψt0,hi\Psi_{t_{0},h}^{i}, we can pretend that 𝔼Gap[Gapt0i(s)]\operatornamewithlimits{\mathbb{E}}_{{\textsc{Gap}}}[{\textsc{Gap}}_{t_{0}}^{i}(s)] is also similar to 𝔼Gap[Gapti(s)]\operatornamewithlimits{\mathbb{E}}_{{\textsc{Gap}}}[{\textsc{Gap}}_{t}^{i}(s)], i.e., setting GaptGapt0{\textsc{Gap}}_{t}\leftarrow{\textsc{Gap}}_{t_{0}} will not violate the above inequality by a lot. Hence, suppose that the “lazy” update mechanism re-uses π~t\widetilde{\pi}_{t} for ntn_{t} times (i.e., 5 is not violated from epoch (t+1)(t+1) until (t+nt1)(t+n_{t}-1) and the π~t\widetilde{\pi}_{t}’s and Gapt{\textsc{Gap}}_{t}’s will be the same). We roughly have

=t=1Ti=1m𝔼shπ¯t[𝔼Gap[Gapti(s)]]Ti=1mt=1T𝔼shπ~t[𝔼aπ~ti(s)[ϕ(s,a)Σ^t,i2]]\displaystyle=\sum_{t=1}^{T}\sum_{i=1}^{m}\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\overline{\pi}_{t}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[{\textsc{Gap}}_{t}^{i}(s)\right]\right]\lesssim\sqrt{T}\sum_{i=1}^{m}\sum_{t=1}^{T}\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}_{t}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{t}^{i}(s)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\right]
=Ti=1mt=1TCovhi(π~t),Σ^t,i=Ti=1mt=1TntCovhi(π~t),Σ^t,i,\displaystyle=\sqrt{T}\sum_{i=1}^{m}\sum_{t=1}^{T}\langle\text{Cov}_{h}^{i}(\widetilde{\pi}_{t}),\widehat{\Sigma}_{t,i}^{\dagger}\rangle=\sqrt{T}\sum_{i=1}^{m}\sum_{t=1}^{T}n_{t}\langle\text{Cov}_{h}^{i}(\widetilde{\pi}_{t}),\widehat{\Sigma}_{t,i}^{\dagger}\rangle, (5.5)

where Covhi(π~t)\text{Cov}_{h}^{i}(\widetilde{\pi}_{t}) is the covariance matrix of agent ii in layer hh when following π~t\widetilde{\pi}_{t}, i.e.,

Covhi(π~t)=𝔼shπ~t[𝔼aπ~ti(s)[ϕ(s,a)ϕ(s,a)𝖳]].\text{Cov}_{h}^{i}(\widetilde{\pi}_{t})=\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}_{t}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{t}^{i}(s)}[\bm{\phi}(s,a)\bm{\phi}(s,a)^{\mathsf{T}}]\right].

By the definition of π¯t=1ts=0t1π~s\overline{\pi}_{t}=\frac{1}{t}\sum_{s=0}^{t-1}\widetilde{\pi}_{s}, we would roughly have Σ^t,i(s=0t1nsCovhi(π~s)+γI)1\widehat{\Sigma}_{t,i}^{\dagger}\approx(\sum_{s=0}^{t-1}n_{s}\text{Cov}_{h}^{i}(\widetilde{\pi}_{s})+\gamma I)^{-1}. Therefore,

t=1TntCovhi(π~t),Σ^t,it=1TntCovhi(π~t),(s<tnsCovhi(π~s))1.\sum_{t=1}^{T}n_{t}\langle\text{Cov}_{h}^{i}(\widetilde{\pi}_{t}),\widehat{\Sigma}_{t,i}^{\dagger}\rangle\approx\sum_{t=1}^{T}\left\langle n_{t}\text{Cov}_{h}^{i}(\widetilde{\pi}_{t}),\left(\sum_{s<t}n_{s}\text{Cov}_{h}^{i}(\widetilde{\pi}_{s})\right)^{-1}\right\rangle.

Recall that the scalar version of this sum will give txt/(s<txs)=𝒪~(txt)\sum_{t}x_{t}/(\sum_{s<t}x_{s})=\operatorname{\widetilde{\mathcal{O}}}(\sum_{t}x_{t}) if xtx_{t} is bounded, one can imagine that this sum should also be small. Indeed, from Lemma 11 of Zanette and Wainwright (2022), we can conclude that

t=1TntCovhi(π~t),Σ^t,i𝒪~(logdet(tntCovhi(π~t)))=𝒪~(d),\sum_{t=1}^{T}n_{t}\langle\text{Cov}_{h}^{i}(\widetilde{\pi}_{t}),\widehat{\Sigma}_{t,i}^{\dagger}\rangle\lesssim\operatorname{\widetilde{\mathcal{O}}}\left(\log\det\left(\sum_{t}n_{t}\text{Cov}_{h}^{i}(\widetilde{\pi}_{t})\right)\right)=\operatorname{\widetilde{\mathcal{O}}}(d),

and thus Equation 3.7=Equation 5.5𝒪~(mT)\text{\lx@cref{creftypecap~refnum}{eq:new expectation of Gap}}=\text{\lx@cref{creftypecap~refnum}{eq:matrix summation lemma main text}}\lesssim\operatorname{\widetilde{\mathcal{O}}}(m\sqrt{T}) (again, omitting all dependencies on dd and HH). Theorem 4.2 then follows.

6 Conclusion

In this paper, we consider multi-player general-sum Markov Games with independent linear function approximations. By enhancing the AVLPR framework recently proposed by Wang et al. (2023) with a data-dependent pessimistic gap estimation, proposing novel action-dependent bonuses, and incorporating several state-of-the-art techniques from the recent advances in the adversarial RL literature (Zimmert and Lattimore, 2022, Dai et al., 2023, Liu et al., 2023a), we give the first algorithm that i) bypasses the curse of multi-agents, ii) attains optimal convergence rate, and iii) avoids polynomial dependencies on the number of actions. We a) design data-dependent pessimistic sub-optimality gap estimations (Section 3), and b) propose an action-dependent bonus technique to cover extreme estimation errors (Section 4.2), which can be of independent interest.

Acknowledgement

We thank Haipeng Luo (University of Southern California), Chen-Yu Wei (University of Virginia), and Zihan Zhang (University of Washington) for their insightful discussions. We thank Fan et al. (2024) for pointing out a mistake in the original proof of Lemma A.2. We greatly acknowledge the anonymous reviewers for their comments. SSD acknowledges the support of NSF IIS 2110170, NSF DMS 2134106, NSF CCF 2212261, NSF IIS 2143493, NSF CCF 2019844, NSF IIS 2229881.

References

  • Bai and Jin (2020) Yu Bai and Chi Jin. Provable self-play algorithms for competitive reinforcement learning. In International Conference on Machine Learning, pages 551–560. PMLR, 2020.
  • Bai et al. (2020) Yu Bai, Chi Jin, and Tiancheng Yu. Near-optimal reinforcement learning with self-play. Advances in Neural Information Processing Systems, 33:2159–2170, 2020.
  • Baker et al. (2020) Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. Emergent tool use from multi-agent autocurricula. In International Conference on Learning Representations, 2020.
  • Brown and Sandholm (2019) Noam Brown and Tuomas Sandholm. Superhuman ai for multiplayer poker. Science, 365(6456):885–890, 2019.
  • Chen et al. (2022a) Fan Chen, Song Mei, and Yu Bai. Unified algorithms for rl with decision-estimation coefficients: No-regret, pac, and reward-free learning. arXiv preprint arXiv:2209.11745, 2022a.
  • Chen et al. (2022b) Zixiang Chen, Dongruo Zhou, and Quanquan Gu. Almost optimal algorithms for two-player zero-sum linear mixture markov games. In International Conference on Algorithmic Learning Theory, pages 227–261. PMLR, 2022b.
  • Cui et al. (2023) Qiwen Cui, Kaiqing Zhang, and Simon Du. Breaking the curse of multiagents in a large state space: Rl in markov games with independent linear function approximation. In Proceedings of Thirty Sixth Conference on Learning Theory, volume 195, pages 2651–2652. PMLR, 2023.
  • Dai et al. (2023) Yan Dai, Haipeng Luo, Chen-Yu Wei, and Julian Zimmert. Refined regret for adversarial mdps with linear function approximation. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 6726–6759. PMLR, 2023.
  • Daskalakis et al. (2009) Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a nash equilibrium. Communications of the ACM, 52(2):89–97, 2009.
  • Daskalakis et al. (2023) Constantinos Daskalakis, Noah Golowich, and Kaiqing Zhang. The complexity of markov equilibrium in stochastic games. In Proceedings of Thirty Sixth Conference on Learning Theory, volume 195, pages 4180–4234. PMLR, 2023.
  • Fan et al. (2024) Junyi Fan, Yuxuan Han, Jialin Zeng, Jian-Feng Cai, Yang Wang, Yang Xiang, and Jiheng Zhang. Rl in markov games with independent function approximation: Improved sample complexity bound under the local access model. In International Conference on Artificial Intelligence and Statistics, pages 2035–2043. PMLR, 2024.
  • Freedman (1975) David A Freedman. On tail probabilities for martingales. the Annals of Probability, pages 100–118, 1975.
  • Huang et al. (2022) Baihe Huang, Jason D. Lee, Zhaoran Wang, and Zhuoran Yang. Towards general function approximation in zero-sum markov games. In International Conference on Learning Representations, 2022.
  • Jiang et al. (2017) Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2017.
  • Jin et al. (2020) Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
  • Jin et al. (2021) Chi Jin, Qinghua Liu, Yuanhao Wang, and Tiancheng Yu. V-learning–a simple, efficient, decentralized algorithm for multiagent rl. arXiv preprint arXiv:2110.14555, 2021.
  • Jin et al. (2022) Chi Jin, Qinghua Liu, and Tiancheng Yu. The power of exploiter: Provable multi-agent rl in large state spaces. In International Conference on Machine Learning, pages 10251–10279. PMLR, 2022.
  • Kong et al. (2023) Fang Kong, Xiangcheng Zhang, Baoxiang Wang, and Shuai Li. Improved regret bounds for linear adversarial mdps via linear optimization. arXiv preprint arXiv:2302.06834, 2023.
  • Lee et al. (2020) Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei, and Mengxiao Zhang. Bias no more: high-probability data-dependent regret bounds for adversarial bandits and mdps. Advances in Neural Information Processing Systems, 33:15522–15533, 2020.
  • Liu et al. (2023a) Haolin Liu, Chen-Yu Wei, and Julian Zimmert. Bypassing the simulator: Near-optimal adversarial linear contextual bandits. arXiv preprint arXiv:2309.00814, 2023a.
  • Liu et al. (2023b) Haolin Liu, Chen-Yu Wei, and Julian Zimmert. Towards optimal regret in adversarial linear mdps with bandit feedback. arXiv preprint arXiv:2310.11550, 2023b.
  • Liu et al. (2021) Qinghua Liu, Tiancheng Yu, Yu Bai, and Chi Jin. A sharp analysis of model-based reinforcement learning with self-play. In International Conference on Machine Learning, pages 7001–7010. PMLR, 2021.
  • Luo et al. (2021) Haipeng Luo, Chen-Yu Wei, and Chung-Wei Lee. Policy optimization in adversarial mdps: Improved exploration via dilated bonuses. Advances in Neural Information Processing Systems, 34:22931–22942, 2021.
  • Mao and Başar (2023) Weichao Mao and Tamer Başar. Provably efficient reinforcement learning in decentralized general-sum markov games. Dynamic Games and Applications, 13(1):165–186, 2023.
  • Neu and Olkhovskaya (2020) Gergely Neu and Julia Olkhovskaya. Efficient and robust algorithms for adversarial linear contextual bandits. In Conference on Learning Theory, pages 3049–3068. PMLR, 2020.
  • Ni et al. (2022) Chengzhuo Ni, Yuda Song, Xuezhou Zhang, Chi Jin, and Mengdi Wang. Representation learning for general-sum low-rank markov games. arXiv preprint arXiv:2210.16976, 2022.
  • Olkhovskaya et al. (2023) Julia Olkhovskaya, Jack Mayo, Tim van Erven, Gergely Neu, and Chen-Yu Wei. First-and second-order bounds for adversarial linear contextual bandits. arXiv preprint arXiv:2305.00832, 2023.
  • Shalev-Shwartz et al. (2016) Shai Shalev-Shwartz, Shaked Shammah, and Amnon Shashua. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
  • Shapley (1953) Lloyd S Shapley. Stochastic games. Proceedings of the National Academy of Sciences, 39(10):1095–1100, 1953.
  • Sherman et al. (2023a) Uri Sherman, Alon Cohen, Tomer Koren, and Yishay Mansour. Rate-optimal policy optimization for linear markov decision processes. arXiv preprint arXiv:2308.14642, 2023a.
  • Sherman et al. (2023b) Uri Sherman, Tomer Koren, and Yishay Mansour. Improved regret for efficient online reinforcement learning with linear function approximation. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 31117–31150. PMLR, 2023b.
  • Silver et al. (2017) David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  • Song et al. (2022) Ziang Song, Song Mei, and Yu Bai. When can we learn general-sum markov games with a large number of players sample-efficiently? In International Conference on Learning Representations, 2022.
  • Vinyals et al. (2019) Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  • Wang et al. (2023) Yuanhao Wang, Qinghua Liu, Yu Bai, and Chi Jin. Breaking the curse of multiagency: Provably efficient decentralized multi-agent rl with function approximation. In Proceedings of Thirty Sixth Conference on Learning Theory, volume 195, pages 2793–2848. PMLR, 2023.
  • Wen and Van Roy (2017) Zheng Wen and Benjamin Van Roy. Efficient reinforcement learning in deterministic systems with value function generalization. Mathematics of Operations Research, 42(3):762–782, 2017.
  • Xie et al. (2020) Qiaomin Xie, Yudong Chen, Zhaoran Wang, and Zhuoran Yang. Learning zero-sum simultaneous-move markov games using function approximation and correlated equilibrium. In Conference on Learning Theory, pages 3674–3682. PMLR, 2020.
  • Xiong et al. (2022) Wei Xiong, Han Zhong, Chengshuai Shi, Cong Shen, and Tong Zhang. A self-play posterior sampling algorithm for zero-sum markov games. In International Conference on Machine Learning, pages 24496–24523. PMLR, 2022.
  • Yang and Wang (2020) Lin Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pages 10746–10756. PMLR, 2020.
  • Yin et al. (2022) Dong Yin, Botao Hao, Yasin Abbasi-Yadkori, Nevena Lazić, and Csaba Szepesvári. Efficient local planning with linear function approximation. In International Conference on Algorithmic Learning Theory, pages 1165–1192. PMLR, 2022.
  • Zanette and Wainwright (2022) Andrea Zanette and Martin Wainwright. Stabilizing q-learning with linear architectures for provable efficient learning. In International Conference on Machine Learning, pages 25920–25954. PMLR, 2022.
  • Zhan et al. (2023) Wenhao Zhan, Jason D. Lee, and Zhuoran Yang. Decentralized optimistic hyperpolicy mirror descent: Provably no-regret learning in markov games. In The Eleventh International Conference on Learning Representations, 2023.
  • Zhao et al. (2023) Canzhe Zhao, Ruofeng Yang, Baoxiang Wang, and Shuai Li. Learning adversarial linear mixture markov decision processes with bandit feedback and unknown transition. In The Eleventh International Conference on Learning Representations, 2023.
  • Zimmert and Lattimore (2022) Julian Zimmert and Tor Lattimore. Return of the bias: Almost minimax optimal high probability bounds for adversarial linear bandits. In Conference on Learning Theory, pages 3285–3312. PMLR, 2022.
\appendixpage
\startcontents

[section] \printcontents[section]l1

Appendix A Analysis of the Improved AVLPR Framework

This section proves the main theorem of our Improved AVLPR framework, restated as follows.

Theorem A.1 (Main Theorem of Improved AVLPR; Restatement of Theorem 3.2).

Suppose that

  1. 1.

    (Per-state no-regret) (π~,Gap)=CCE-Approxh(π¯,V¯,K)(\widetilde{\pi},{\textsc{Gap}})={\textsc{CCE-Approx}}_{h}(\overline{\pi},\overline{V},K) ensures the following w.p. 1δ1-\delta:

    maxπiΠhi{(𝔼aπ~𝔼aπi×π~i)[(i+h+1V¯i)(s,a)]}Gapi(s),i[m],s𝒮h,\displaystyle\max_{\pi_{\ast}^{i}\in\Pi_{h}^{i}}\left\{\left(\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}}-\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{\ast}^{i}\times\widetilde{\pi}^{-i}}\right)\left[\big{(}\ell^{i}+\operatorname{\mathbb{P}}_{h+1}\overline{V}^{i}\big{)}(s,a)\right]\right\}\leq{\textsc{Gap}}^{i}(s),\quad\forall i\in[m],s\in\mathcal{S}_{h}, (A.1)

    where Gapi(s){\textsc{Gap}}^{i}(s) is a random variable whose randomness comes from the environment (when generating the trajectories), the agents (when playing the policies), and internal randomness.

  2. 2.

    (Optimistic V-function) V^=V-Approxh(π¯,π~,V¯,Gap,K)\widehat{V}={\textsc{V-Approx}}_{h}(\overline{\pi},\widetilde{\pi},\overline{V},{\textsc{Gap}},K) ensures the following w.p. 1δ1-\delta:

    V^i(s)[min{\displaystyle\widehat{V}^{i}(s)\in\bigg{[}\min\bigg{\{} 𝔼aπ~[(i+V¯i)(s,a)]+Gapi(s),Hh+1},\displaystyle\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}}\left[\big{(}\ell^{i}+\operatorname{\mathbb{P}}\overline{V}^{i}\big{)}(s,a)\right]+\phantom{2}{\textsc{Gap}}^{i}(s),H-h+1\bigg{\}},
    𝔼aπ~[(i+V¯i)(s,a)]+2Gapi(s)],i[m],s𝒮h.\displaystyle\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}}\left[\big{(}\ell^{i}+\operatorname{\mathbb{P}}\overline{V}^{i}\big{)}(s,a)\right]+2{\textsc{Gap}}^{i}(s)\bigg{]},\quad\forall i\in[m],s\in\mathcal{S}_{h}. (A.2)
  3. 3.

    (Pigeon-hole condition & Potential Function) The potential function Ψt,hi\Psi_{t,h}^{i} ensures that 5 in Algorithm 1 is violated for at most dreplaylogTd_{\text{replay}}\log T times. Moreover, there exists a deterministic LL ensuring the following w.p. 1δ1-\delta:

    t=1Ti=1m𝔼shπ~t[𝔼Gap[Gapti(s)]]mLTlog2Tδ,h[H],\sum_{t=1}^{T}\sum_{i=1}^{m}\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}_{t}}\left[{\color[rgb]{.5,0,.5}\operatornamewithlimits{\mathbb{E}}_{{\textsc{Gap}}}[{\textsc{Gap}}_{t}^{i}(s)]}\right]\leq m\sqrt{LT\log^{2}\frac{T}{\delta}},\quad\forall h\in[H], (A.3)

    where the expectation is taken w.r.t. the randomness in Gap.

Then with probability 1𝒪~(δ)1-\operatorname{\widetilde{\mathcal{O}}}(\delta), the sum of CCE-Regret over all agents satisfies

i=1mt=1T(Vπ~ti(s1)V,π~tii(s1))=𝒪~(mHLT),\sum_{i=1}^{m}\sum_{t=1}^{T}\left(V_{\widetilde{\pi}_{t}}^{i}(s_{1})-V_{\dagger,\widetilde{\pi}_{t}^{-i}}^{i}(s_{1})\right)=\operatorname{\widetilde{\mathcal{O}}}(mH\sqrt{LT}),

where π~t\widetilde{\pi}_{t} is the policy for the tt-th epoch (which is the policy with smallest Gap; see Equation 3.1).

Suppose that one call to CCE-Approxh\textsc{CCE-Approx}_{h} will cost Γ1K\Gamma_{1}K samples (KK is the last parameter to the subroutine), and, similarly, one call to V-Approxh\textsc{V-Approx}_{h} will cost Γ2\Gamma_{2} samples. Then by picking T=𝒪~(m2H2L/ϵ2)T=\operatorname{\widetilde{\mathcal{O}}}(m^{2}H^{2}L/\epsilon^{2}), the roll-out policy is an ϵ\epsilon-CCE and the sample complexity is bounded by

𝒪~(max{Γ1,Γ2}×m2H3Lϵ2×dreplaylogT).\operatorname{\widetilde{\mathcal{O}}}\left(\max\{\Gamma_{1},\Gamma_{2}\}\times m^{2}H^{3}L\epsilon^{-2}\times d_{\text{replay}}\log T\right).
Proof.

The proof mostly inherits the proof of Theorem 18 from (Wang et al., 2023); all the main differences are sketched in the proof sketch of Theorem 3.2 (in the main text).

Let \mathcal{I} be the iterations where 5 is violated. For any tt\in\mathcal{I} and h=H,H1,,1h=H,H-1,\ldots,1, we first pass a “next-layer” V-function V¯i\overline{V}^{i} to CCE-Approxh\textsc{CCE-Approx}_{h} and then calculate the “current-layer” V-function via V-Approxh\textsc{V-Approx}_{h}. By Equation A.1 of Condition 1 and Equation A.2 of Condition 2, for any “next-layer” V¯i\overline{V}^{i},

i=1mminπi(𝒜i){𝔼𝒂πi×π~r(s)i(s)[(i+V¯i)(s,𝒂)]}\displaystyle\quad\sum_{i=1}^{m}\min_{\pi_{\ast}^{i}\in\triangle(\mathcal{A}^{i})}\left\{\operatornamewithlimits{\mathbb{E}}_{\bm{a}\sim\pi_{\ast}^{i}\times\widetilde{\pi}_{r^{\ast}(s)}^{-i}(\cdot\mid s)}\left[\big{(}\ell^{i}+\operatorname{\mathbb{P}}\overline{V}^{i}\big{)}(s,\bm{a})\right]\right\}
i=1m𝔼𝒂π~r(s)(s)[(i+V¯i)(s,𝒂)]i=1mGapr(s)i(s)\displaystyle\geq\sum_{i=1}^{m}\operatornamewithlimits{\mathbb{E}}_{\bm{a}\sim\widetilde{\pi}_{r^{\ast}(s)}(\cdot\mid s)}\left[\big{(}\ell^{i}+\operatorname{\mathbb{P}}\overline{V}^{i}\big{)}(s,\bm{a})\right]-\sum_{i=1}^{m}{\textsc{Gap}}_{r^{\ast}(s)}^{i}(s)
i=1mV^i(s),s𝒮h.\displaystyle\geq\sum_{i=1}^{m}\widehat{V}^{i}(s),\quad\forall s\in\mathcal{S}_{h}.

Thus, by induction over all h=H,H1,,1h=H,H-1,\ldots,1, we know that for all tt\in\mathcal{I} and h[H]h\in[H],

i=1mV^ti(s)i=1mV,πtii(s),s𝒮.\sum_{i=1}^{m}\widehat{V}_{t}^{i}(s)\leq\sum_{i=1}^{m}V_{\dagger,\pi_{t}^{-i}}^{i}(s),\quad\forall s\in\mathcal{S}.

In words, this indicates that our V^ti\widehat{V}_{t}^{i} is indeed an optimistic estimation of the best-response V-function. Again applying induction by invoking the other part of Equation A.2, we know for all tt\in\mathcal{I},

i=1mV^ti(s)i=1mVπ~ti(s)2i=1m𝔥=hH𝔼s𝔥π~tsh=s[Gapti(s)],s𝒮.\sum_{i=1}^{m}\widehat{V}_{t}^{i}(s)\geq\sum_{i=1}^{m}V_{\widetilde{\pi}_{t}}^{i}(s)-2\sum_{i=1}^{m}\sum_{\mathfrak{h}=h}^{H}\operatornamewithlimits{\mathbb{E}}_{s^{\prime}\sim_{\mathfrak{h}}\widetilde{\pi}_{t}\mid s_{h}=s}\left[{\textsc{Gap}}_{t}^{i}(s^{\prime})\right],\quad\forall s\in\mathcal{S}.

Putting these two inequalities together, the following holds for all tt\in\mathcal{I}:

i=1m(Vπ~ti(s)V,π~tii(s))2i=1m𝔥=hH𝔼s𝔥π~tsh=s[Gapti(s)],s𝒮.\sum_{i=1}^{m}\left(V_{\widetilde{\pi}_{t}}^{i}(s)-V_{\dagger,\widetilde{\pi}_{t}^{-i}}^{i}(s)\right)\leq 2\sum_{i=1}^{m}\sum_{\mathfrak{h}=h}^{H}\operatornamewithlimits{\mathbb{E}}_{s^{\prime}\sim_{\mathfrak{h}}\widetilde{\pi}_{t}\mid s_{h}=s}\left[{\textsc{Gap}}_{t}^{i}(s^{\prime})\right],\quad\forall s\in\mathcal{S}.

Hence, by denoting ItI_{t}\in\mathcal{I} as the policy where the tt-th epoch is using (i.e., the last time where 5 is violated). By above calculation, we can write the sum of CCE-Regret over all agents as

i=1mt=1T(Vπ~ti(s1)V,π~tii(s1))=i=1mt=1T(Vπ~Iti(s1)V,π~Itii(s1))2i=1mt=1T𝔥=1H𝔼s𝔥π~It[GapIti(s)].\displaystyle\sum_{i=1}^{m}\sum_{t=1}^{T}\left(V_{\widetilde{\pi}_{t}}^{i}(s_{1})-V_{\dagger,\widetilde{\pi}_{t}^{-i}}^{i}(s_{1})\right)=\sum_{i=1}^{m}\sum_{t=1}^{T}\left(V_{\widetilde{\pi}_{I_{t}}}^{i}(s_{1})-V_{\dagger,\widetilde{\pi}_{I_{t}}^{-i}}^{i}(s_{1})\right)\leq 2\sum_{i=1}^{m}\sum_{t=1}^{T}\sum_{\mathfrak{h}=1}^{H}\operatornamewithlimits{\mathbb{E}}_{s^{\prime}\sim_{\mathfrak{h}}\widetilde{\pi}_{I_{t}}}\left[{\textsc{Gap}}_{I_{t}}^{i}(s^{\prime})\right].

From Lemma A.2, we conclude that the following holds w.p. 1𝒪~(δTH)1-\operatorname{\widetilde{\mathcal{O}}}(\delta TH):

i=1mGapti(s)2𝔼Gap[i=1mGapti],t,h[H],s𝒮h.\sum_{i=1}^{m}{\textsc{Gap}}_{t}^{i}(s)\leq 2\operatornamewithlimits{\mathbb{E}}_{{\textsc{Gap}}}\left[\sum_{i=1}^{m}{\textsc{Gap}}_{t}^{i}\right],\quad\forall t\in\mathcal{I},h\in[H],s\in\mathcal{S}_{h}.

Moreover, recall from Equation A.3 that

t=1Ti=1m𝔼shπ~t[𝔼Gap[Gapti(s)]]mLTlog2Tδ,h[H].\sum_{t=1}^{T}\sum_{i=1}^{m}\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}_{t}}\left[{\operatornamewithlimits{\mathbb{E}}_{{\textsc{Gap}}}[{\textsc{Gap}}_{t}^{i}(s)]}\right]\leq m\sqrt{LT\log^{2}\frac{T}{\delta}},\quad\forall h\in[H].

We then have

i=1mt=1T(Vπ~ti(s1)V,π~tii(s1))4mHLTlog2Tδ,\sum_{i=1}^{m}\sum_{t=1}^{T}\left(V_{\widetilde{\pi}_{t}}^{i}(s_{1})-V_{\dagger,\widetilde{\pi}_{t}^{-i}}^{i}(s_{1})\right)\leq 4mH\sqrt{LT\log^{2}\frac{T}{\delta}},

which means that our choice of T=𝒪~(m2H2L/ϵ2)T=\operatorname{\widetilde{\mathcal{O}}}(m^{2}H^{2}L/\epsilon^{2}) indeed roll-outs an ϵ\epsilon-CCE.

It only remains to calculate the total sample complexity. From Condition 3, ||dreplaylogT\lvert\mathcal{I}\rvert\leq d_{\text{replay}}\log T. Thus, the total sample complexity is no more than

T+||×H𝒪~(Γ1T+Γ2T)=𝒪~(max{Γ1,Γ2}×m2H3Lϵ2×dreplaylogT),T+\lvert\mathcal{I}\rvert\times H\operatorname{\widetilde{\mathcal{O}}}(\Gamma_{1}T+\Gamma_{2}T)=\operatorname{\widetilde{\mathcal{O}}}(\max\{\Gamma_{1},\Gamma_{2}\}\times m^{2}H^{3}L\epsilon^{-2}\times d_{\text{replay}}\log T),

as claimed. ∎

Lemma A.2.

For any tt\in\mathcal{I}, w.p. 1δ1-\delta, we have i=1mGapti(s)2𝔼[i=1mGapti(s)]\sum_{i=1}^{m}{\textsc{Gap}}_{t}^{i}(s)\leq 2\operatornamewithlimits{\mathbb{E}}[\sum_{i=1}^{m}{\textsc{Gap}}_{t}^{i}(s)], s𝒮h\forall s\in\mathcal{S}_{h}.

Proof.

Denote the RR policy-Gap pairs yielded as (π~1,Gap1),(π~2,Gap2),,(π~R,GapR)(\widetilde{\pi}_{1},{\textsc{Gap}}_{1}),(\widetilde{\pi}_{2},{\textsc{Gap}}_{2}),\ldots,(\widetilde{\pi}_{R},{\textsc{Gap}}_{R}). By definition, (π~t(s),Gapt(s))=(π~r(s)(s),Gapr(s)(s))(\widetilde{\pi}_{t}(s),{\textsc{Gap}}_{t}(s))=(\widetilde{\pi}_{r^{\ast}(s)}(s),{\textsc{Gap}}_{r^{\ast}(s)}(s)) where r(s)r^{\ast}(s) is defined as (in Equation 3.1)

r(s)=argminr[R]i=1mGapri(s).r^{\ast}(s)=\operatornamewithlimits{\mathrm{argmin}}_{r\in[R]}\sum_{i=1}^{m}{\textsc{Gap}}_{r}^{i}(s).

By Markov inequality, for each r[R]r\in[R], Pr{i=1mGapri(s)>2i=1m𝔼[Gapi(s)]}12\Pr\left\{\sum_{i=1}^{m}{\textsc{Gap}}_{r}^{i}(s)>2\sum_{i=1}^{m}\operatornamewithlimits{\mathbb{E}}[{\textsc{Gap}}^{i}(s)]\right\}\leq\frac{1}{2}. Thus, as R2=logSδR_{2}=\log\frac{S}{\delta} and r(s)r^{\ast}(s) defined as the rr with smallest i=1mGapri(s)\sum_{i=1}^{m}{\textsc{Gap}}_{r}^{i}(s) in Equation 3.1, we have i=1mGapr(s)i(s)2i=1m𝔼[Gapi(s)]\sum_{i=1}^{m}{\textsc{Gap}}_{r^{\ast}(s)}^{i}(s)\leq 2\sum_{i=1}^{m}\operatornamewithlimits{\mathbb{E}}[{\textsc{Gap}}^{i}(s)] with probability 1δS1-\frac{\delta}{S}. Taking a union bound over all s𝒮hs\in\mathcal{S}_{h} gives our conclusion that i=1mGapri(s)2i=1m𝔼[Gapi(s)]\sum_{i=1}^{m}{\textsc{Gap}}_{r}^{i}(s)\leq 2\sum_{i=1}^{m}\operatornamewithlimits{\mathbb{E}}[{\textsc{Gap}}^{i}(s)] for all s𝒮hs\in\mathcal{S}_{h} w.p. 1δ1-\delta. ∎

Appendix B Analysis of the Improved CCE-Approx Subroutine

Theorem B.1 (Gap is w.h.p. Pessimistic; Formal Version of Theorem 4.1).

Suppose that η=Ω(max{γH,γβ1+β2})\eta=\Omega(\max\{\frac{\sqrt{\gamma}}{H},\frac{\gamma}{\beta_{1}+\beta_{2}}\}), β1=Ω~(dHK){\beta_{1}}=\widetilde{\Omega}(\frac{dH}{\sqrt{K}}), β2=Ω~(HK)\beta_{2}=\widetilde{\Omega}(\frac{H}{K}), and γ=𝒪~(dK)\gamma=\operatorname{\widetilde{\mathcal{O}}}(\frac{d}{K}). For each execution of (π~,Gap)=CCE-Approxh(π¯,V¯,K)(\widetilde{\pi},{\textsc{Gap}})=\textsc{CCE-Approx}_{h}(\overline{\pi},\overline{V},K),

maxπiΠhi{(𝔼aπ~𝔼aπi×π~i)[(i+h+1V¯i)(s,a)]}Gapi(s),i[m],s𝒮h,\max_{\pi_{\ast}^{i}\in\Pi_{h}^{i}}\left\{\left(\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}}-\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{\ast}^{i}\times\widetilde{\pi}^{-i}}\right)\left[\big{(}\ell^{i}+\operatorname{\mathbb{P}}_{h+1}\overline{V}^{i}\big{)}(s,a)\right]\right\}\leq{\textsc{Gap}}^{i}(s),\quad\forall i\in[m],s\in\mathcal{S}_{h}, (B.1)

w.p. 1𝒪(δ)1-\operatorname{\mathcal{O}}(\delta), where Gapi(s)\textsc{Gap}^{i}(s) is defined as (the names are in correspondence to those in Equation 5.2)

KGapi(s)log|𝒜i|η+2ηk=1Ka𝒜iπki(as)Q^ki(s,a)2Reg-Term+\displaystyle\quad K\cdot{\textsc{Gap}}^{i}(s)\triangleq\underbrace{\frac{\log\lvert\mathcal{A}^{i}\rvert}{\eta}+2\eta\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{k}^{i}(a\mid s)\widehat{Q}_{k}^{i}(s,a)^{2}}_{\textsc{Reg-Term}}+
822dH2k=1K𝔼aπki(s)[ϕ(s,a)]Σ^t,i2+k=1K(𝔼aπki(s)[ϕ(s,a)𝖳]𝜽^ki)2log4KHγδBias-1+\displaystyle\quad\underbrace{8\sqrt{2}\sqrt{2dH^{2}\sum_{k=1}^{K}\left\lVert\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)\right]\right\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}+\sum_{k=1}^{K}\left(\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[{\bm{\phi}}(s,a)^{\mathsf{T}}]\widehat{\bm{\theta}}_{k}^{i}\right)^{2}}\log\frac{4KH}{\gamma\delta}}_{\textsc{Bias-1}}+
𝒪(dβ1logdKδ)Bias-2+Bonus-2+k=1K𝔼aπki(s)[Bki(s,a)]Bonus-1,i[m],s𝒮h.\displaystyle\quad\underbrace{\operatorname{\mathcal{O}}\left(\frac{d}{\beta_{1}}\log\frac{dK}{\delta}\right)}_{\textsc{Bias-2}+\textsc{Bonus-2}}+\underbrace{\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[B_{k}^{i}(s,a)]}_{\textsc{Bonus-1}},\quad\forall i\in[m],s\in\mathcal{S}_{h}. (B.2)
Proof.

To make the proof easy to read, we first restate Equation B.1 using the notations of CCE-Approx. Let 𝜽ki{\bm{\theta}}_{k}^{i} be the Q-function kernel induced by the next-layer V-function V¯i\overline{V}^{i} and πki\pi_{k}^{-i}, i.e.,

ϕ(s,ai)𝖳𝜽ki=Qki(s,a)𝔼𝒂iπki(s)[(i+V¯i)(s,𝒂)],s𝒮h,ai𝒜i.{\bm{\phi}}(s,a^{i})^{\mathsf{T}}{\bm{\theta}}_{k}^{i}=Q_{k}^{i}(s,a)\triangleq\operatornamewithlimits{\mathbb{E}}_{\bm{a}^{-i}\sim\pi_{k}^{-i}(\cdot\mid s)}\left[(\ell^{i}+\operatorname{\mathbb{P}}\overline{V}^{i})(s,\bm{a})\right],\quad\forall s\in\mathcal{S}_{h},a^{i}\in\mathcal{A}^{i}. (B.3)

Then k,hi+V¯i(sk,h+1i)\ell_{k,h}^{i}+\overline{V}^{i}(s_{k,h+1}^{i}) is a sample with mean ϕ(s,ai)𝖳𝜽ki\bm{\phi}(s,a^{i})^{\mathsf{T}}\bm{\theta}_{k}^{i}.

Let πi\pi_{\ast}^{i} be the best-response policy of agent ii when facing π~i\widetilde{\pi}^{-i} (which is the average policy for KK episodes, i.e., π~i=1Kk=1Kπki\widetilde{\pi}^{-i}=\frac{1}{K}\sum_{k=1}^{K}\pi_{k}^{-i}). We need to ensure the following with probability 1δ1-\delta:

Gapi(s)a𝒜i(π~i(as)πi(as))(ϕ(s,a)𝖳𝜽ki),i[m],s𝒮h,{\textsc{Gap}}^{i}(s)\geq\sum_{a\in\mathcal{A}^{i}}(\widetilde{\pi}^{i}(a\mid s)-\pi_{\ast}^{i}(a\mid s))\left({\bm{\phi}}(s,a)^{\mathsf{T}}{\bm{\theta}}_{k}^{i}\right),\quad\forall i\in[m],s\in\mathcal{S}_{h},

while Gap is allowed to involve data-dependent quantities that are available during run-time. By plugging in the definition of π~i\widetilde{\pi}^{-i} and decomposing the right-handed-side, this inequality becomes

KGapi(s)\displaystyle K\cdot{\textsc{Gap}}^{i}(s) k=1Ka𝒜i(πki(as)πi(as))(ϕ(s,a)𝖳𝜽ki)\displaystyle\geq\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}(\pi_{k}^{i}(a\mid s)-\pi_{\ast}^{i}(a\mid s))\left({\bm{\phi}}(s,a)^{\mathsf{T}}{\bm{\theta}}_{k}^{i}\right)
=k=1Ka𝒜i(πki(as)πi(as))(Q^ki(s,a)Bki(s,a))Reg-Term+\displaystyle=\underbrace{\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}(\pi_{k}^{i}(a\mid s)-\pi_{\ast}^{i}(a\mid s))\left(\widehat{Q}_{k}^{i}(s,a)-B_{k}^{i}(s,a)\right)}_{{\textsc{Reg-Term}}}+
k=1Ka𝒜iπki(as)ϕ(s,a)𝖳(𝜽ki𝜽^ki)Bias-1+k=1Ka𝒜iπi(as)ϕ(s,a)𝖳(𝜽^ki𝜽ki)Bias-2+\displaystyle\quad\underbrace{\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{k}^{i}(a\mid s){\bm{\phi}}(s,a)^{\mathsf{T}}({\bm{\theta}}_{k}^{i}-\widehat{\bm{\theta}}_{k}^{i})}_{{\textsc{Bias-1}}}+\underbrace{\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{\ast}^{i}(a\mid s){\bm{\phi}}(s,a)^{\mathsf{T}}(\widehat{\bm{\theta}}_{k}^{i}-{\bm{\theta}}_{k}^{i})}_{{\textsc{Bias-2}}}+
k=1Ka𝒜iπki(as)Bki(s,a)Bonus-1k=1Ka𝒜iπi(as)Bki(s,a)Bonus-2+\displaystyle\quad\underbrace{\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{k}^{i}(a\mid s)B_{k}^{i}(s,a)}_{{\textsc{Bonus-1}}}-\underbrace{\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{\ast}^{i}(a\mid s)B_{k}^{i}(s,a)}_{{\textsc{Bonus-2}}}+
k=1Ka𝒜i(πki(as)πi(as))(ϕ(s,a)𝖳𝜽^kiQ^ki(s,a))Mag-Reduce,\displaystyle\quad\underbrace{\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}(\pi_{k}^{i}(a\mid s)-\pi_{\ast}^{i}(a\mid s))\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}-\widehat{Q}_{k}^{i}(s,a)\right)}_{{\textsc{Mag-Reduce}}}, (B.4)

for all player i[m]i\in[m] and state s𝒮hs\in\mathcal{S}_{h}, with probability 1δ1-\delta.

In Theorems B.2, B.3, B.6, B.8 and B.9, we control each term in the RHS of Equation B.4 and show that Equation B.4 indeed holds with probability 1𝒪(δ)1-\operatorname{\mathcal{O}}(\delta) when Gap is defined in Equation B.2. ∎

B.1 Bounding Reg-Term via EXP3 Regret Guarantee

In EXP3, one typical requirement is that the loss vector y^k\widehat{y}_{k} fed into EXP3 should satisfy y^k(a)1/η\widehat{y}_{k}(a)\geq-1/\eta (see Lemma E.10). To comply with the condition, people usually control the Q-estimate via |ϕ(s,a)𝖳𝜽^ki|Σ^t,i2γ1\lvert{\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}\rvert\lesssim\lVert\widehat{\Sigma}_{t,i}^{\dagger}\rVert_{2}\leq\gamma^{-1} (Luo et al., 2021) and set ηγ\eta\approx\gamma, suffering loss of order 𝒪~(γ1)\operatorname{\widetilde{\mathcal{O}}}(\gamma^{-1}).

However, γ1\gamma^{-1} can be prohibitively large when setting γ=𝒪~(K1)\gamma=\operatorname{\widetilde{\mathcal{O}}}(K^{-1}), which Theorem E.5 requires. Fortunately, thanks to the Magnitude-Reduced Estimator by Dai et al. (2023), Q^kim^ki\widehat{Q}_{k}^{i}\geq\widehat{m}_{k}^{i} (defined in Equation 4.5) can be bounded from below by 𝒪~(K1/2)-\operatorname{\widetilde{\mathcal{O}}}(K^{-1/2}) and thus we can pick the standard learning rate of η=𝒪~(K1/2)\eta=\operatorname{\widetilde{\mathcal{O}}}(K^{-1/2}). The other component in y^k\widehat{y}_{k} is the bonuses, which is Bki(s,a)β1ϕ(s,a)Σ^t,i2+β2j=1dϕ(s,a)[j]×(sup(s,a)𝒮h×𝒜iΣ^t,iϕ(s,a))[j]B_{k}^{i}(s,a)\triangleq\beta_{1}\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}+\beta_{2}\sum_{j=1}^{d}\bm{\phi}(s,a)[j]\times(\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}_{h}\times\mathcal{A}^{i}}\widehat{\Sigma}_{t,i}^{\dagger}\bm{\phi}(s^{\prime},a^{\prime}))[j] for all kk.

Theorem B.2.

When η=Ω(max{γH,γβ1+β2})\eta=\Omega(\max\{\frac{\sqrt{\gamma}}{H},\frac{\gamma}{\beta_{1}+\beta_{2}}\}), with probability 1δ1-\delta, for all i[m]i\in[m] and s𝒮hs\in\mathcal{S}_{h}:

Reg-Termlog|𝒜i|η+2ηk=1Ka𝒜iπki(as)Q^ki(s,a)2+2ηk=1Ka𝒜iπki(as)Bki(s,a)2.\textsc{Reg-Term}\leq\frac{\log\lvert\mathcal{A}^{i}\rvert}{\eta}+2\eta\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{k}^{i}(a\mid s)\widehat{Q}_{k}^{i}(s,a)^{2}+2\eta\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{k}^{i}(a\mid s)B_{k}^{i}(s,a)^{2}.
Proof.

The only thing we need to verify before invoking Lemma E.10 is that Q^ki(s,a)Bki(s,a)1/η\widehat{Q}_{k}^{i}(s,a)-B_{k}^{i}(s,a)\geq-1/\eta. We first show Q^ki(s,a)1/η\widehat{Q}_{k}^{i}(s,a)\geq-1/\eta. Recall the definition of Q^ki(s,a)\widehat{Q}_{k}^{i}(s,a) in Equation 4.5:

Q^ki(s,a)=ϕ(s,a)𝖳𝜽^kiH(ϕ(s,a)𝖳Σ^t,iϕ(sk,hi,ak,hi))+m^ki(s,a)\displaystyle\quad\widehat{Q}_{k}^{i}(s,a)={\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}-H\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s_{k,h}^{i},a_{k,h}^{i})\right)_{-}+\widehat{m}_{k}^{i}(s,a)
=ϕ(s,a)𝖳Σ^t,iϕ(sk,hi,ak,hi)(k,hi+V¯i(sk,h+1i))H(ϕ(s,a)𝖳Σ^t,iϕ(sk,hi,ak,hi))+m^ki(s,a).\displaystyle={\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}\leavevmode\nobreak\ {\bm{\phi}}(s_{k,h}^{i},a_{k,h}^{i})\leavevmode\nobreak\ \left(\ell_{k,h}^{i}+\overline{V}^{i}(s_{k,h+1}^{i})\right)-H\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s_{k,h}^{i},a_{k,h}^{i})\right)_{-}+\widehat{m}_{k}^{i}(s,a).

As |k,hi+V¯i(sk,h+1i)|H\lvert\ell_{k,h}^{i}+\overline{V}^{i}(s_{k,h+1}^{i})\rvert\leq H, we know Q^ki(s,a)m^ki(s,a)\widehat{Q}_{k}^{i}(s,a)\geq\widehat{m}_{k}^{i}(s,a) as x(x)0x-(x)_{-}\geq 0 holds for any xx\in\mathbb{R}. Thus, to lower bound Q^ki(s,a)\widehat{Q}_{k}^{i}(s,a), it suffices to lower bound m^ki(s,a)\widehat{m}_{k}^{i}(s,a). Notice that

(m^ki(s,a))2\displaystyle\left(\widehat{m}_{k}^{i}(s,a)\right)^{2} =(HKκ=1K(ϕ(s,a)𝖳Σ^t,iϕ(sκmag,a~κ,imag)))2\displaystyle=\left(\frac{H}{K}\sum_{\kappa=1}^{K}\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s_{\kappa}^{\text{mag}},\widetilde{a}_{\kappa,i}^{\text{mag}})\right)_{-}\right)^{2}
(a)H2Kκ=1K(ϕ(s,a)𝖳Σ^t,iϕ(sκmag,a~κ,imag))2\displaystyle\overset{(a)}{\leq}\frac{H^{2}}{K}\sum_{\kappa=1}^{K}\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s_{\kappa}^{\text{mag}},\widetilde{a}_{\kappa,i}^{\text{mag}})\right)_{-}^{2}
(b)H2Kκ=1K(ϕ(s,a)𝖳Σ^t,iϕ(sκmag,a~κ,imag))2\displaystyle\overset{(b)}{\leq}\frac{H^{2}}{K}\sum_{\kappa=1}^{K}\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s_{\kappa}^{\text{mag}},\widetilde{a}_{\kappa,i}^{\text{mag}})\right)^{2}
=H2ϕ(s,a)𝖳Σ^t,i(1Kκ=1Kϕ(sκmag,a~κ,imag)ϕ(sκmag,a~κ,imag)𝖳)Σ^t,iϕ(s,a).\displaystyle=H^{2}{\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}\left(\frac{1}{K}\sum_{\kappa=1}^{K}{\bm{\phi}}(s_{\kappa}^{\text{mag}},\widetilde{a}_{\kappa,i}^{\text{mag}}){\bm{\phi}}(s_{\kappa}^{\text{mag}},\widetilde{a}_{\kappa,i}^{\text{mag}})^{\mathsf{T}}\right)\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s,a).

where (a) used Cauchy-Schwartz inequality and (b) used Jensen inequality and the fact that (x)2x2(x)_{-}^{2}\leq x^{2} for all xx\in\mathbb{R}. Let the true covariance of π¯\overline{\pi} of layer h[H]h\in[H], and agent i[m]i\in[m] be

Σt,i=𝔼(s,a)hπ¯[ϕ(s,a)ϕ(s,a)𝖳].\Sigma_{t,i}=\operatornamewithlimits{\mathbb{E}}_{(s,a)\sim_{h}\overline{\pi}}\left[{\bm{\phi}}(s,a){\bm{\phi}}(s,a)^{\mathsf{T}}\right]. (B.5)

The average matrix in the middle, namely

Σ~t,imag1Kκ=1Kϕ(sκmag,a~κ,imag)ϕ(sκmag,a~κ,imag)𝖳\widetilde{\Sigma}_{t,i}^{\text{mag}}\triangleq\frac{1}{K}\sum_{\kappa=1}^{K}{\bm{\phi}}(s_{\kappa}^{\text{mag}},\widetilde{a}_{\kappa,i}^{\text{mag}}){\bm{\phi}}(s_{\kappa}^{\text{mag}},\widetilde{a}_{\kappa,i}^{\text{mag}})^{\mathsf{T}}

is an empirical estimation of the true covariance Σt,i\Sigma_{t,i}. Hence, by stochastic matrix concentration results stated as Corollary E.4 of Section E.2, we know

Σ~t,imag32Σt,i+32dKlog(dKδ)Iwith probability 1δK.\widetilde{\Sigma}_{t,i}^{\text{mag}}\preceq\frac{3}{2}\Sigma_{t,i}+\frac{3}{2}\frac{d}{K}\log\left(\frac{dK}{\delta}\right)I\quad\text{with probability }1-\frac{\delta}{K}.

Meanwhile, the following matrix from Equation 4.1 (where we defined Σ^t,i\widehat{\Sigma}_{t,i}^{\dagger}) is yet another empirical estimation of Σt,i\Sigma_{t,i} that is independent to Σ~t,imag\widetilde{\Sigma}_{t,i}^{\text{mag}}:

Σ~t,icov1Kκ=1Kϕ(sκcov,a~κ,icov)ϕ(sκcov,a~κ,icov)𝖳.\widetilde{\Sigma}_{t,i}^{\text{cov}}\triangleq\frac{1}{K}\sum_{\kappa=1}^{K}{\bm{\phi}}(s_{\kappa}^{\text{cov}},\widetilde{a}_{\kappa,i}^{\text{cov}}){\bm{\phi}}(s_{\kappa}^{\text{cov}},\widetilde{a}_{\kappa,i}^{\text{cov}})^{\mathsf{T}}. (B.6)

By similar arguments stated as Corollary E.3 of Section E.2, we have

Σt,i2Σ~t,icov+3dKlog(dKδ)Iwith probability 1δK.\Sigma_{t,i}\preceq 2\widetilde{\Sigma}_{t,i}^{\text{cov}}+3\frac{d}{K}\log\left(\frac{dK}{\delta}\right)I\quad\text{with probability }1-\frac{\delta}{K}.

Taking a union bound over all k[K]k\in[K], the following holds for all kk with probability 12δ1-2\delta:

(m^ki(s,a))2\displaystyle\left(\widehat{m}_{k}^{i}(s,a)\right)^{2} H2ϕ(s,a)𝖳Σ^t,iΣ~t,imagΣ^t,iϕ(s,a)\displaystyle\leq H^{2}{\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}\widetilde{\Sigma}_{t,i}^{\text{mag}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s,a)
3H2ϕ(s,a)𝖳Σ^t,i(Σ~t,icov+γI)Σ^t,iϕ(s,a)\displaystyle\leq 3H^{2}{\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}\left(\widetilde{\Sigma}_{t,i}^{\text{cov}}+\gamma I\right)\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s,a)
=3H2ϕ(s,a)𝖳Σ^t,iϕ(s,a)3H2/γ.\displaystyle=3H^{2}{\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s,a)\leq 3H^{2}/\gamma.

Therefore, we have Q^ki(s,a)m^ki(s,a)2H/γ\widehat{Q}_{k}^{i}(s,a)\geq-\widehat{m}_{k}^{i}(s,a)\geq-2H/\sqrt{\gamma} with probability 1δ1-\delta, which is at least 1/η-1/\eta with our choice of η\eta.

For the bonus term BkiB_{k}^{i}, we consider the two parts related to β1\beta_{1} and β2\beta_{2} separatedly. For the β1\beta_{1}-term, we have the following upper bound since Σ^t,i=(Σ~t,icov+γI)1γ1I\widehat{\Sigma}_{t,i}^{\dagger}=(\widetilde{\Sigma}_{t,i}^{\text{cov}}+\gamma I)^{-1}\preceq\gamma^{-1}I:

β1ϕ(s,a)Σ^t,i2β1Σ^t,i2β1γ,\displaystyle{\beta_{1}}\lVert{\bm{\phi}}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\leq{\beta_{1}}\left\lVert\widehat{\Sigma}_{t,i}^{\dagger}\right\rVert_{2}\leq\frac{{\beta_{1}}}{\gamma},

For the β2\beta_{2}-related term, notice that for any j[d]j\in[d], we have

(ϕ(s,a))[j]×sup(s,a)𝒮h×𝒜i(Σ^t,iϕ(s,a))[j]=sup(s,a)𝒮h×𝒜i((ϕ(s,a))𝖳𝒆j𝒆j𝖳Σ^t,iϕ(s,a)),(\bm{\phi}(s,a))[j]\times\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}_{h}\times\mathcal{A}^{i}}(\widehat{\Sigma}_{t,i}^{\dagger}\bm{\phi}(s^{\prime},a^{\prime}))[j]=\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}_{h}\times\mathcal{A}^{i}}\left((\bm{\phi}(s,a))^{\mathsf{T}}\bm{e}_{j}\bm{e}_{j}^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}\bm{\phi}(s^{\prime},a^{\prime})\right),

where 𝒆jd\bm{e}_{j}\in\mathbb{R}^{d} is the one-hot vector at the jj-th coordinate. By Cauchy-Schwartz, this is further bounded by (ϕ(s,a))𝖳𝒆j𝒆j𝖳Σ^t,i×sup(s,a)𝒮h×𝒜iϕ(s,a)Σ^t,i\lVert(\bm{\phi}(s,a))^{\mathsf{T}}\bm{e}_{j}\bm{e}_{j}^{\mathsf{T}}\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}\times\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}_{h}\times\mathcal{A}^{i}}\lVert\bm{\phi}(s^{\prime},a^{\prime})\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}. We also have ϕ(s,a)Σ^t,iϕ(s,a)2×Σ^t,i2γ1\lVert\bm{\phi}(s^{\prime},a^{\prime})\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}\leq\lVert\bm{\phi}(s^{\prime},a^{\prime})\rVert_{2}\times\sqrt{\lVert\widehat{\Sigma}_{t,i}^{\dagger}\rVert_{2}}\leq\sqrt{\gamma^{-1}} for any (s,a)𝒮h×𝒜i(s^{\prime},a^{\prime})\in\mathcal{S}_{h}\times\mathcal{A}^{i}. Thus

j=1d(ϕ(s,a))[j]×sup(s,a)𝒮h×𝒜i(Σ^t,iϕ(s,a))[j]γ1j=1d(ϕ(s,a))𝖳𝒆j𝒆j𝖳Σ^t,i\displaystyle\quad\sum_{j=1}^{d}(\bm{\phi}(s,a))[j]\times\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}_{h}\times\mathcal{A}^{i}}(\widehat{\Sigma}_{t,i}^{\dagger}\bm{\phi}(s^{\prime},a^{\prime}))[j]\leq\sqrt{\gamma^{-1}}\sum_{j=1}^{d}\lVert(\bm{\phi}(s,a))^{\mathsf{T}}\bm{e}_{j}\bm{e}_{j}^{\mathsf{T}}\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}
=γ1j=1d(ϕ(s,a))𝖳𝒆j𝒆j𝖳Σ^t,i=γ1ϕ(s,a)Σ^t,i.\displaystyle=\sqrt{\gamma^{-1}}\left\lVert\sum_{j=1}^{d}(\bm{\phi}(s,a))^{\mathsf{T}}\bm{e}_{j}\bm{e}_{j}^{\mathsf{T}}\right\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}=\sqrt{\gamma^{-1}}\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}. (B.7)

So the β2\beta_{2}-related term is controlled by β2γ1\beta_{2}\gamma^{-1}, which means Bki(s,a)(β1+β2)γ1B_{k}^{i}(s,a)\leq(\beta_{1}+\beta_{2})\gamma^{-1}.

Thus the condition that y^k(a)1/η\widehat{y}_{k}(a)\geq-1/\eta holds once η13H2γ+β1+β2γ\eta^{-1}\leq\sqrt{\frac{3H^{2}}{\gamma}}+\frac{\beta_{1}+\beta_{2}}{\gamma}, i.e., η=Ω(max{γH,γβ1+β2})\eta=\Omega(\max\{\frac{\sqrt{\gamma}}{H},\frac{\gamma}{\beta_{1}+\beta_{2}}\}). Applying the EXP3 regret bound (Lemma E.10) gives

Reg-Term log|𝒜i|η+ηk=1Ka𝒜iπki(as)(Q^ki(s,a)2Bki(s,a))2\displaystyle\leq\frac{\log\lvert\mathcal{A}^{i}\rvert}{\eta}+\eta\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{k}^{i}(a\mid s)\left(\widehat{Q}_{k}^{i}(s,a)^{2}-B_{k}^{i}(s,a)\right)^{2}
log|𝒜i|η+2ηk=1Ka𝒜iπki(as)Q^ki(s,a)2+2ηk=1Ka𝒜iπki(as)Bki(s,a)2\displaystyle\leq\frac{\log\lvert\mathcal{A}^{i}\rvert}{\eta}+2\eta\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{k}^{i}(a\mid s)\widehat{Q}_{k}^{i}(s,a)^{2}+2\eta\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{k}^{i}(a\mid s)B_{k}^{i}(s,a)^{2}
log|𝒜i|η+2ηk=1Ka𝒜iπki(as)Q^ki(s,a)2+2k=1Ka𝒜iπki(as)Bki(s,a),\displaystyle\leq\frac{\log\lvert\mathcal{A}^{i}\rvert}{\eta}+2\eta\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{k}^{i}(a\mid s)\widehat{Q}_{k}^{i}(s,a)^{2}+2\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{k}^{i}(a\mid s)B_{k}^{i}(s,a),

where the last step uses ηBki(s,a)1\eta B_{k}^{i}(s,a)\leq 1. All these terms are available during run-time, so the algorithm can include them into Gapti(s){\textsc{Gap}}_{t}^{i}(s). ∎

B.2 Bounding Bias-1 via Adaptive Freedman Inequality

Theorem B.3.

With probability 12δ1-2\delta, we have the following for all i[m]i\in[m] and s𝒮hs\in\mathcal{S}_{h}:

Bias-1 k=1Kβ14𝔼aπki(s)[ϕ(s,a)𝖳]Σ^t,i2+𝒪(dβ1logdKδ)+\displaystyle\leq\sum_{k=1}^{K}\frac{\beta_{1}}{4}\left\lVert\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]\right\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}+\operatorname{\mathcal{O}}\left(\frac{d}{\beta_{1}}\log\frac{dK}{\delta}\right)+
822dH2k=1K𝔼aπki(s)[ϕ(s,a)]Σ^t,i2+k=1K(𝔼aπki(s)[ϕ(s,a)𝖳]𝜽^ki)2log4KHγδ.\displaystyle\quad 8\sqrt{2}\sqrt{2dH^{2}\sum_{k=1}^{K}\left\lVert\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)\right]\right\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}+\sum_{k=1}^{K}\left(\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[{\bm{\phi}}(s,a)^{\mathsf{T}}]\widehat{\bm{\theta}}_{k}^{i}\right)^{2}}\log\frac{4KH}{\gamma\delta}.
Proof.

Let {Xk}k=1K\{X_{k}\}_{k=1}^{K} be a sequence of random variables adapted to filtration (k)k=0K(\mathcal{F}_{k})_{k=0}^{K} where

Xk=𝔼aπki(s)[ϕ(s,a)𝖳]𝜽^ki,1kK.X_{k}=\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]\widehat{\bm{\theta}}_{k}^{i},\quad\forall 1\leq k\leq K.

Let μk=𝔼[Xkk1]\mu_{k}=\operatornamewithlimits{\mathbb{E}}[X_{k}\mid\mathcal{F}_{k-1}] be the conditional expectations. Then {Xkμk}k=1K\{X_{k}-\mu_{k}\}_{k=1}^{K} forms a martingale difference sequence. We divide Bias-1 into two parts, one for the intrinsic bias of 𝜽^ki\widehat{\bm{\theta}}_{k}^{i} (how 𝔼[𝜽^ki]\operatornamewithlimits{\mathbb{E}}[\widehat{\bm{\theta}}_{k}^{i}] differs from 𝜽ki{\bm{\theta}}_{k}^{i}) and the other for the estimation error (how 𝜽^ki\widehat{\bm{\theta}}_{k}^{i} differs from 𝔼[𝜽^ki]\operatornamewithlimits{\mathbb{E}}[\widehat{\bm{\theta}}_{k}^{i}]). Namely,

Bias-1=k=1K𝔼aπki(s)[ϕ(s,a)𝖳](𝜽ki𝜽^ki)\displaystyle\quad{\textsc{Bias-1}}=\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]({\bm{\theta}}_{k}^{i}-\widehat{\bm{\theta}}_{k}^{i})
=k=1K(𝔼aπki(s)[ϕ(s,a)𝖳]𝜽kiμk)+k=1K(μkXk).\displaystyle=\sum_{k=1}^{K}\left(\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]{\bm{\theta}}_{k}^{i}-\mu_{k}\right)+\sum_{k=1}^{K}(\mu_{k}-X_{k}). (B.8)

The first term is a standard term appearing in regret-minimization analyses of single-agent RL. In Lemma B.4, we control it in analog to Lemma D.2 of Luo et al. (2021), but invoking the new covariance estimation analyses by Liu et al. (2023a) (which we restated in Section E.2).

The second term is the main obstacle stopping people from obtaining high-probability regret bounds for adversarial contextual linear bandits. While we are also unable to provide a deterministic high-probability upper bound, thanks to our Improved AVLPR framework (see the discussions after Theorem 3.2), data-dependent high-probability bounds are allowed. This is yielded in Lemma B.5 by developing a variant of the Adaptive Freedman Inequality proposed by Lee et al. (2020) and improved by Zimmert and Lattimore (2022) (the variant can be found in Section E.3). ∎

B.2.1 Controlling Intrinsic Bias

Lemma B.4.

With probability 1δ1-\delta, for any i[m]i\in[m] and s𝒮hs\in\mathcal{S}_{h}, we have

k=1K(𝔼aπki(s)[ϕ(s,a)𝖳]𝜽kiμk)=k=1Kβ14𝔼aπki(s)[ϕ(s,a)𝖳]Σ^t,i2+𝒪(dβ1logdKδ).\sum_{k=1}^{K}\left(\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]{\bm{\theta}}_{k}^{i}-\mu_{k}\right)=\sum_{k=1}^{K}\frac{\beta_{1}}{4}\left\lVert\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]\right\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}+\operatorname{\mathcal{O}}\left(\frac{d}{\beta_{1}}\log\frac{dK}{\delta}\right).
Proof.

The conditional expectation μk\mu_{k} can be directly calculated as

μk=𝔼Gap[Xkk1]\displaystyle\mu_{k}=\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[X_{k}\mid\mathcal{F}_{k-1}\right] =𝔼aπki(s)[ϕ(s,a)𝖳]𝔼Gap[Σ^t,iϕ(sk,hi,ak,hi)(k,hi+V¯i(sk,h+1i))]\displaystyle=\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\widehat{\Sigma}_{t,i}^{\dagger}\leavevmode\nobreak\ {\bm{\phi}}(s_{k,h}^{i},a_{k,h}^{i})\leavevmode\nobreak\ \left(\ell_{k,h}^{i}+\overline{V}^{i}(s_{k,h+1}^{i})\right)\right]
=(a)𝔼aπki(s)[ϕ(s,a)𝖳]Σ^t,i𝔼Gap[ϕ(sk,hi,ak,hi)ϕ(sk,hi,ak,hi)𝖳𝜽ki]\displaystyle\overset{(a)}{=}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]\widehat{\Sigma}_{t,i}^{\dagger}\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[{\bm{\phi}}(s_{k,h}^{i},a_{k,h}^{i}){\bm{\phi}}(s_{k,h}^{i},a_{k,h}^{i})^{\mathsf{T}}{\bm{\theta}}_{k}^{i}\right]
=(b)𝔼aπki(s)[ϕ(s,a)𝖳]Σ^t,iΣt,i𝜽ki,\displaystyle\overset{(b)}{=}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]\widehat{\Sigma}_{t,i}^{\dagger}\Sigma_{t,i}{\bm{\theta}}_{k}^{i},

where (a) uses the independence between Σ^t,i\widehat{\Sigma}_{t,i}^{\dagger} and the trajectory (sk,hi,ak,hi)(s_{k,h}^{i},a_{k,h}^{i}), and (b) uses the definition of Σt,i\Sigma_{t,i} in Equation B.5.

To handle the first term of Equation B.8, we use Cauchy-Schwartz inequality, triangle inequality, and AM-GM inequality (the calculation follows Lemma D.2 of Luo et al. (2021)).

𝔼aπki(s)[ϕ(s,a)𝖳]𝜽kiμk\displaystyle\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]{\bm{\theta}}_{k}^{i}-\mu_{k} =𝔼aπki(s)[ϕ(s,a)𝖳](IΣ^t,iΣt,i)𝜽ki\displaystyle=\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right](I-\widehat{\Sigma}_{t,i}^{\dagger}\Sigma_{t,i}){\bm{\theta}}_{k}^{i}
=𝔼aπki(s)[ϕ(s,a)𝖳]Σ^t,i(γI+Σ~t,icovΣt,i)𝜽ki\displaystyle=\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]\widehat{\Sigma}_{t,i}^{\dagger}(\gamma I+\widetilde{\Sigma}_{t,i}^{\text{cov}}-\Sigma_{t,i}){\bm{\theta}}_{k}^{i}
𝔼aπki(s)[ϕ(s,a)𝖳]Σ^t,i×(γI+Σ~t,icovΣt,i)𝜽kiΣ^t,i\displaystyle\leq\left\lVert\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]\right\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}\times\left\lVert(\gamma I+\widetilde{\Sigma}_{t,i}^{\text{cov}}-\Sigma_{t,i}){\bm{\theta}}_{k}^{i}\right\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}
β14𝔼aπki(s)[ϕ(s,a)𝖳]Σ^t,i2+2β1(γI+Σ~t,icovΣt,i)𝜽kiΣ^t,i2,\displaystyle\leq\frac{\beta_{1}}{4}\left\lVert\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]\right\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}+\frac{2}{\beta_{1}}\left\lVert(\gamma I+\widetilde{\Sigma}_{t,i}^{\text{cov}}-\Sigma_{t,i}){\bm{\theta}}_{k}^{i}\right\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2},

where Σ~t,icov\widetilde{\Sigma}_{t,i}^{\text{cov}} is defined in Equation B.6 such that Σ^t,i=(Σ~t,icov+γI)1\widehat{\Sigma}_{t,i}^{\dagger}=(\widetilde{\Sigma}_{t,i}^{\text{cov}}+\gamma I)^{-1}. The first term directly goes to Gap as it is available during run-time. The second term can be controlled by the following inequality (Liu et al., 2023a, Lemma 14) which we include as Theorem E.5:

2β1(γI+Σ~t,icovΣt,i)𝜽kiΣ^t,i2=𝒪(1β1dKlogdKδ),with probability 1δK,\frac{2}{\beta_{1}}\left\lVert(\gamma I+\widetilde{\Sigma}_{t,i}^{\text{cov}}-\Sigma_{t,i}){\bm{\theta}}_{k}^{i}\right\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}=\operatorname{\mathcal{O}}\left(\frac{1}{\beta_{1}}\frac{d}{K}\log\frac{dK}{\delta}\right),\quad\text{with probability }1-\frac{\delta}{K}, (B.9)

where we plugged in the definition of γ\gamma that γ=5dKlog6dδ\gamma=\frac{5d}{K}\log\frac{6d}{\delta}. Conditioning on the good events in Equation B.9 and taking a union bound over all k[K]k\in[K], our conclusion follows. ∎

B.2.2 Controlling Estimation Error

Lemma B.5.

With probability 12δ1-2\delta, for all i[m]i\in[m] and s𝒮hs\in\mathcal{S}_{h}, we have

|k=1K(Xkμk)|822dH2k=1K𝔼aπki(s)[ϕ(s,a)]Σ^t,i2+k=1K(𝔼aπki(s)[ϕ(s,a)𝖳]𝜽^ki)2log4KHγδ.\left\lvert\sum_{k=1}^{K}(X_{k}-\mu_{k})\right\rvert\leq 8\sqrt{2}\sqrt{2dH^{2}\sum_{k=1}^{K}\left\lVert\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)\right]\right\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}+\sum_{k=1}^{K}\left(\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[{\bm{\phi}}(s,a)^{\mathsf{T}}]\widehat{\bm{\theta}}_{k}^{i}\right)^{2}}\log\frac{4KH}{\gamma\delta}.
Proof.

From the Adaptive Freedman Inequality (Lemma E.8 in Section E.3), we have

|i=1n(Xiμi)|42i=1n𝔼[Xi2i1]+i=1nXi2logCδ,with probability 12δ,\left\lvert\sum_{i=1}^{n}(X_{i}-\mu_{i})\right\rvert\leq 4\sqrt{2}\sqrt{\sum_{i=1}^{n}\operatornamewithlimits{\mathbb{E}}[X_{i}^{2}\mid\mathcal{F}_{i-1}]+\sum_{i=1}^{n}X_{i}^{2}}\log\frac{C}{\delta},\quad\text{with probability }1-2\delta, (B.10)

where C=22i=1n𝔼[Xi2i1]+i=1nXi2C=2\sqrt{2}\sqrt{\sum_{i=1}^{n}\operatornamewithlimits{\mathbb{E}}[X_{i}^{2}\mid\mathcal{F}_{i-1}]+\sum_{i=1}^{n}X_{i}^{2}}. By definition of Xi=𝔼aπki(s)[ϕ(s,a)𝖳]𝜽^kiX_{i}=\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[\bm{\phi}(s,a)^{\mathsf{T}}]\widehat{\bm{\theta}}_{k}^{i},

Xi2ϕ(s,a)22×𝜽^ki22Σ^t,i22×ϕ(sk,hi,ak,hi)22×|k,hi+V¯i(sk,h+1i)|2γ2H2,X_{i}^{2}\leq\lVert\bm{\phi}(s,a)\rVert_{2}^{2}\times\lVert\widehat{\bm{\theta}}_{k}^{i}\rVert_{2}^{2}\leq\lVert\widehat{\Sigma}_{t,i}^{\dagger}\rVert_{2}^{2}\times\lVert\bm{\phi}(s_{k,h}^{i},a_{k,h}^{i})\rVert_{2}^{2}\times\left\lvert\ell_{k,h}^{i}+\overline{V}^{i}(s_{k,h+1}^{i})\right\rvert^{2}\leq\gamma^{-2}H^{2},

where we used Σ^t,iγ1I\widehat{\Sigma}_{t,i}^{\dagger}\preceq\gamma^{-1}I. Hence, C4KHγ1C\leq 4KH\gamma^{-1}.

As Xk2X_{k}^{2} is available during run-time, it only remains to control 𝔼[Xk2k1]\operatornamewithlimits{\mathbb{E}}[X_{k}^{2}\mid\mathcal{F}_{k-1}] to make Equation B.10 calculable. By definition of Xk=𝔼aπki(s)[ϕ(s,a)𝖳]𝜽^kiX_{k}=\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]\widehat{\bm{\theta}}_{k}^{i}, we have

𝔼Gap[Xk2k1]=𝔼Gap[(𝔼aπki(s)[ϕ(s,a)𝖳]𝜽^ki)2]\displaystyle\quad\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[X_{k}^{2}\mid\mathcal{F}_{k-1}\right]=\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\left(\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]\widehat{\bm{\theta}}_{k}^{i}\right)^{2}\right]
=𝔼aπki(s)[ϕ(s,a)𝖳]𝔼Gap[𝜽^ki(𝜽^ki)𝖳]𝔼aπki(s)[ϕ(s,a)].\displaystyle=\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\widehat{\bm{\theta}}_{k}^{i}(\widehat{\bm{\theta}}_{k}^{i})^{\mathsf{T}}\right]\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)\right].

We focus on the expectation in the middle, i.e., 𝔼Gap[𝜽^ki(𝜽^ki)𝖳]\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\widehat{\bm{\theta}}_{k}^{i}(\widehat{\bm{\theta}}_{k}^{i})^{\mathsf{T}}\right]. Plugging in the definition of 𝜽^ki\widehat{\bm{\theta}}_{k}^{i}:

𝔼Gap[𝜽^ki(𝜽^ki)𝖳]=Σ^t,i𝔼Gap[ϕ(sk,hi,ak,hi)𝜽ki(𝜽ki)𝖳ϕ(sk,hi,ak,hi)𝖳]Σ^t,idH2Σ^t,iΣt,iΣ^t,i.\displaystyle\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\widehat{\bm{\theta}}_{k}^{i}(\widehat{\bm{\theta}}_{k}^{i})^{\mathsf{T}}\right]=\widehat{\Sigma}_{t,i}^{\dagger}\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[{\bm{\phi}}(s_{k,h}^{i},a_{k,h}^{i}){\bm{\theta}}_{k}^{i}({\bm{\theta}}_{k}^{i})^{\mathsf{T}}{\bm{\phi}}(s_{k,h}^{i},a_{k,h}^{i})^{\mathsf{T}}\right]\widehat{\Sigma}_{t,i}^{\dagger}\preceq dH^{2}\widehat{\Sigma}_{t,i}^{\dagger}\Sigma_{t,i}\widehat{\Sigma}_{t,i}^{\dagger}.

From Corollary E.3, Σt,i2(Σ~t,icov+γI)\Sigma_{t,i}\preceq 2(\widetilde{\Sigma}_{t,i}^{\text{cov}}+\gamma I) w.p. 1δK1-\frac{\delta}{K} when γ3d2Klog(dKδ)I\gamma\geq\frac{3d}{2K}\log\left(\frac{dK}{\delta}\right)I. Hence,

𝔼Gap[𝜽^ki(𝜽^ki)𝖳]dH2Σ^t,iΣt,iΣ^t,i2dH2Σ^t,iwith probability 1δK.\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\widehat{\bm{\theta}}_{k}^{i}(\widehat{\bm{\theta}}_{k}^{i})^{\mathsf{T}}\right]\preceq dH^{2}\widehat{\Sigma}_{t,i}^{\dagger}\Sigma_{t,i}\widehat{\Sigma}_{t,i}^{\dagger}\preceq 2dH^{2}\widehat{\Sigma}_{t,i}^{\dagger}\quad\text{with probability }1-\frac{\delta}{K}.

Putting this into 𝔼Gap[Xk2k1]=𝔼aπki(s)[ϕ(s,a)𝖳]𝔼Gap[𝜽^ki(𝜽^ki)𝖳]𝔼aπki(s)[ϕ(s,a)]\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[X_{k}^{2}\mid\mathcal{F}_{k-1}\right]=\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\right]\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\widehat{\bm{\theta}}_{k}^{i}(\widehat{\bm{\theta}}_{k}^{i})^{\mathsf{T}}\right]\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)\right] gives

𝔼Gap[Xk2k1]2dH2𝔼aπki(s)[ϕ(s,a)]Σ^t,i2.\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[X_{k}^{2}\mid\mathcal{F}_{k-1}\right]\leq 2dH^{2}\left\lVert\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)\right]\right\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}. (B.11)

Our conclusion follows by combining Equations B.11 and B.10. ∎

B.3 Cancelling Bias-2 Using Bonus-2

Bias-2 looks pretty similar to Bias-1, except that we now have 𝔼aπi(s)[ϕ(s,a)𝖳𝜽^ki]\operatornamewithlimits{\mathbb{E}}_{a\sim\pi^{i}_{\color[rgb]{0,0,1}\ast}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}\right] instead of 𝔼aπki(s)[ϕ(s,a)𝖳𝜽^ki]\operatornamewithlimits{\mathbb{E}}_{a\sim\pi^{i}_{\color[rgb]{0,0,1}k}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}\right]. This subtle difference actually forbids us from handling Bias-2 analogue to Bias-1 as πi\pi_{\ast}^{i} is unknown to the agent. As we sketched in the main text, we also adopt the classical idea of using bonuses to cancel biases. However, as the maximum among 𝔼aπi(s)[ϕ(s,a)𝖳𝜽^ki]\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{\ast}^{i}(\cdot\mid s)}[\bm{\phi}(s,a)^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}] can be as large as Σ^k,i2γ1𝒪(K)\lVert\widehat{\Sigma}_{k,i}^{\dagger}\rVert_{2}\leq\gamma^{-1}\approx\operatorname{\mathcal{O}}(K), it can no longer be neglected like previous papers.

As mentioned in the main text, we use a state-action-wise bonus to cancel the maximum martingale difference term induced by Adaptive Freedman Inequality. As Equation B.4 is linear in πi\pi_{\ast}^{i}, we only need to consider the πi(s)\pi_{\ast}^{i}(\cdot\mid s)’s that are one-hot on some action ai𝒜ia_{\ast}^{i}\in\mathcal{A}^{i}. For notional simplicity, we abbreviate ϕi=ϕ(s,ai)\bm{\phi}_{\ast}^{i}=\bm{\phi}(s,a_{\ast}^{i}) when ss is clear from the context.

Theorem B.6.

When β1=Ω~(dHK){\beta_{1}}=\widetilde{\Omega}\left(\frac{dH}{\sqrt{K}}\right) and β2=Ω~(HK)\beta_{2}=\widetilde{\Omega}(\frac{H}{K}), w.p. 12δ1-2\delta, for all i[m]i\in[m] and s𝒮hs\in\mathcal{S}_{h},

Bias-2+Bonus-2\displaystyle\textsc{Bias-2}+\textsc{Bonus-2} =𝒪(dβ1logdKδ).\displaystyle=\operatorname{\mathcal{O}}\left(\frac{d}{\beta_{1}}\log\frac{dK}{\delta}\right).
Proof.

Imitating the analysis in Section B.2 but applying the original Adaptive Freedman Inequality (Lemma E.7) instead of our Lemma E.8 gives Lemma B.7, i.e.,

Bias-2β14ϕiΣ^t,i2+𝒪(dβ1logdKδ)+32dH2ϕiKΣ^t,i2log4KHγδ+2maxk[K](ϕi)𝖳𝜽^kilog4KHγδ.{\textsc{Bias-2}}\leq\frac{\beta_{1}}{4}\lVert{\bm{\phi}}_{\ast}^{i}\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}+\operatorname{\mathcal{O}}\left(\frac{d}{\beta_{1}}\log\frac{dK}{\delta}\right)+3\sqrt{2dH^{2}\lVert{\bm{\phi}}_{\ast}^{i}\rVert_{K\widehat{\Sigma}_{t,i}^{\dagger}}^{2}}\log\frac{4KH}{\gamma\delta}+2\max_{k\in[K]}(\bm{\phi}_{\ast}^{i})^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}\log\frac{4KH}{\gamma\delta}. (B.12)

By definition of Bonus-2, we have

Bonus-2 =a𝒜iπi(as)(k=1Kβ1ϕ(s,a)Σ^t,i2+β2sup(s,a)𝒮h×𝒜iϕ(s,a)𝖳Σ^t,iϕ(s,a))\displaystyle=\sum_{a\in\mathcal{A}^{i}}\pi_{\ast}^{i}(a\mid s)\left(\sum_{k=1}^{K}{\beta_{1}}\lVert{\bm{\phi}}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}+\beta_{2}\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}_{h}\times\mathcal{A}^{i}}\bm{\phi}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}\bm{\phi}(s^{\prime},a^{\prime})\right)
=β1ϕiKΣ^t,i2+Kβ2j=1dϕi[j]×sup(s,a)𝒮h×𝒜i(Σ^t,iϕ(s,a))[j].\displaystyle={\beta_{1}}\lVert\bm{\phi}_{\ast}^{i}\rVert_{K\widehat{\Sigma}_{t,i}^{\dagger}}^{2}+K\beta_{2}\sum_{j=1}^{d}\bm{\phi}_{\ast}^{i}[j]\times\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}_{h}\times\mathcal{A}^{i}}(\widehat{\Sigma}_{t,i}^{\dagger}\bm{\phi}(s^{\prime},a^{\prime}))[j]. (B.13)

So we only need to control Equation B.12 using (Equation B.13)-(\text{\lx@cref{creftypecap~refnum}{eq:negative bonus}}). The first term in Equation B.12 is already contained in Equation B.13, while the second term is a constant. For the third term, we would like to control it using the remaining 34β1ϕiKΣ^t,i2\frac{3}{4}{\beta_{1}}\lVert\bm{\phi}_{\ast}^{i}\rVert_{K\widehat{\Sigma}_{t,i}^{\dagger}}^{2}, i.e., we show

422dH2ϕiKΣ^t,i2log4KHγδ34β1ϕiKΣ^t,i2.4\sqrt{2}\sqrt{2dH^{2}\lVert{\bm{\phi}}_{\ast}^{i}\rVert_{K\widehat{\Sigma}_{t,i}^{\dagger}}^{2}}\log\frac{4KH}{\gamma\delta}\leq\frac{3}{4}{\beta_{1}}\lVert\bm{\phi}_{\ast}^{i}\rVert_{K\widehat{\Sigma}_{t,i}^{\dagger}}^{2}.

In other words, we would like to control 10249dH2log24HγδϕiKΣ^t,i2\frac{1024}{9}dH^{2}\log^{2}\frac{4H}{\gamma\delta}\lVert{\bm{\phi}}_{\ast}^{i}\rVert_{K\widehat{\Sigma}_{t,i}^{\dagger}}^{2} by β12ϕiKΣ^t,i4{\beta_{1}}^{2}\lVert{\bm{\phi}}_{\ast}^{i}\rVert_{K\widehat{\Sigma}_{t,i}^{\dagger}}^{4}. Equivalently,

10249dH2log24Hγδβ12ϕiΣ^t,i2=K(ϕi)𝖳Σ^t,iϕi.\frac{1024}{9}dH^{2}\log^{2}\frac{4H}{\gamma\delta}{\beta_{1}}^{-2}\leq\lVert{\bm{\phi}}_{\ast}^{i}\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}=K({\bm{\phi}}_{\ast}^{i})^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}_{\ast}^{i}.

As Σ^t,i=(Σ~t,icov+γI)1(1+γ)1I\widehat{\Sigma}_{t,i}^{\dagger}=(\widetilde{\Sigma}_{t,i}^{\text{cov}}+\gamma I)^{-1}\succeq(1+\gamma)^{-1}I, ϕiKΣ^t,i2K1+γϕi22K1+γ1d\lVert{\bm{\phi}}_{\ast}^{i}\rVert_{K\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\geq\frac{K}{1+\gamma}\lVert{\bm{\phi}}_{\ast}^{i}\rVert_{2}^{2}\geq\frac{K}{1+\gamma}\frac{1}{\sqrt{d}} (recall our assumption that ϕ21d\lVert\bm{\phi}\rVert_{2}\geq\frac{1}{\sqrt{d}}). As γ1\gamma\leq 1, this inequality is ensured so long as

10249dH2log24Hγδ×2d×β12Kβ164dHlog4KHγδ3K=Ω~(dHK).\frac{1024}{9}dH^{2}\log^{2}\frac{4H}{\gamma\delta}\times 2\sqrt{d}\times{\beta_{1}}^{-2}\leq K\Longleftarrow{\beta_{1}}\geq\frac{64dH\log\frac{4KH}{\gamma\delta}}{3\sqrt{K}}=\widetilde{\Omega}\left(\frac{dH}{\sqrt{K}}\right).

For the last term, by definition of 𝜽^ki\widehat{\bm{\theta}}_{k}^{i}, it’s covered by the second part in Equation B.13 once Kβ22Hlog4KHγδK\beta_{2}\geq 2H\log\frac{4KH}{\gamma\delta} as (ϕi)𝖳𝜽^kiH(ϕi)𝖳Σ^t,iϕ(sk,h,ak,hi)(\bm{\phi}_{\ast}^{i})^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}\leq H(\bm{\phi}_{\ast}^{i})^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}\bm{\phi}(s_{k,h},a_{k,h}^{i}) where (sk,h,ak,hi)𝒮h×𝒜i(s_{k,h},a_{k,h}^{i})\in\mathcal{S}_{h}\times\mathcal{A}^{i}, which means j=1dϕi[j]×sup(s,a)𝒮h×𝒜i(Σ^t,iϕ(s,a))[j]\sum_{j=1}^{d}\bm{\phi}_{\ast}^{i}[j]\times\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}_{h}\times\mathcal{A}^{i}}(\widehat{\Sigma}_{t,i}^{\dagger}\bm{\phi}(s^{\prime},a^{\prime}))[j] in Equation B.13 covers (ϕi)𝖳Σ^t,iϕ(sk,h,ak,hi)(\bm{\phi}_{\ast}^{i})^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}\bm{\phi}(s_{k,h},a_{k,h}^{i}). Hence our conclusion follows given that β1=Ω~(dHK)\beta_{1}=\widetilde{\Omega}\left(\frac{dH}{\sqrt{K}}\right) and β2=Ω~(HK)\beta_{2}=\widetilde{\Omega}(\frac{H}{K}). ∎

Lemma B.7.

With probability 12δ1-2\delta, for all i[m]i\in[m] and s𝒮hs\in\mathcal{S}_{h}, we have

Bias-2 β14k=1KϕiΣ^t,i2+𝒪(dβ1logdKδ)+\displaystyle\leq\frac{\beta_{1}}{4}\sum_{k=1}^{K}\lVert{\bm{\phi}}_{\ast}^{i}\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}+\operatorname{\mathcal{O}}\left(\frac{d}{\beta_{1}}\log\frac{dK}{\delta}\right)+
32dH2k=1KϕiΣ^t,i2log4KHγδ+2maxk[K](ϕi)𝖳𝜽^kilog4KHγδ.\displaystyle\quad 3\sqrt{2dH^{2}\sum_{k=1}^{K}\lVert{\bm{\phi}}_{\ast}^{i}\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}}\log\frac{4KH}{\gamma\delta}+2\max_{k\in[K]}(\bm{\phi}_{\ast}^{i})^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}\log\frac{4KH}{\gamma\delta}.
Proof.

Imitating Section B.2, we also decompose Bias-2 into intrinsic bias and estimation error. Let Xk=(ϕi)𝖳𝜽^kiX_{k}=(\bm{\phi}_{\ast}^{i})^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i} and μk=𝔼[Xkk1]\mu_{k}=\operatornamewithlimits{\mathbb{E}}[X_{k}\mid\mathcal{F}_{k-1}], we have

Bias-2=k=1K(ϕi)𝖳(𝜽^ki𝜽ki)=k=1K(μk(ϕi)𝖳𝜽ki)+k=1K(Xkμk).\displaystyle\textsc{Bias-2}=\sum_{k=1}^{K}(\bm{\phi}_{\ast}^{i})^{\mathsf{T}}(\widehat{\bm{\theta}}_{k}^{i}-{\bm{\theta}}_{k}^{i})=\sum_{k=1}^{K}\left(\mu_{k}-(\bm{\phi}_{\ast}^{i})^{\mathsf{T}}{\bm{\theta}}_{k}^{i}\right)+\sum_{k=1}^{K}(X_{k}-\mu_{k}).

The first term is the same as Lemma B.4: Concluding that μk=(ϕi)𝖳Σ^t,iΣt,i𝜽ki\mu_{k}=({\bm{\phi}}_{\ast}^{i})^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}\Sigma_{t,i}{\bm{\theta}_{k}^{i}} and then applying Cauchy-Schwartz inequality, triangle inequality, and AM-GM inequality gives

k=1K(μk(ϕi)𝖳𝜽ki)β14k=1KϕiΣ^t,i2+𝒪(dβ1logdKδ),with probability 1δ.\sum_{k=1}^{K}\left(\mu_{k}-(\bm{\phi}_{\ast}^{i})^{\mathsf{T}}{\bm{\theta}}_{k}^{i}\right)\leq\frac{\beta_{1}}{4}\sum_{k=1}^{K}\lVert{\bm{\phi}}_{\ast}^{i}\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}+\operatorname{\mathcal{O}}\left(\frac{d}{\beta_{1}}\log\frac{dK}{\delta}\right),\quad\text{with probability }1-\delta.

For the second term, the proof is also similar to Lemma B.5. The only difference is that instead of our Lemma E.8, we now apply the original Adaptive Freedman Inequality (in Lemma E.7). We get

k=1K(Xkμk)3k=1K𝔼[Xk2k1]logCδ+2maxk[K]XklogCδ,with probability 1δ,\sum_{k=1}^{K}(X_{k}-\mu_{k})\leq 3\sqrt{\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}[X_{k}^{2}\mid\mathcal{F}_{k-1}]}\log\frac{C}{\delta}+2\max_{k\in[K]}X_{k}\log\frac{C}{\delta},\quad\text{with probability }1-\delta,

where C=2max{1,k=1K𝔼[Xk2k1],maxk[K]Xk}C=2\max\{1,\sqrt{\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}[X_{k}^{2}\mid\mathcal{F}_{k-1}]},\max_{k\in[K]}X_{k}\}. Following the calculations in Lemma B.5, CC is bounded by 4Hγ14H\gamma^{-1} and 𝔼[Xk2k1]2dH2ϕiΣ^t,i2\operatornamewithlimits{\mathbb{E}}[X_{k}^{2}\mid\mathcal{F}_{k-1}]\leq 2dH^{2}\lVert{\bm{\phi}}_{\ast}^{i}\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}. The maximum part is directly contained in our conclusion by noticing that Xk=(ϕi)𝖳𝜽^kiX_{k}=(\bm{\phi}_{\ast}^{i})^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}. ∎

B.4 Putting Bonus-1 into Gap Directly

The two components in the Bonus-1 term, namely πki\pi_{k}^{i} and BkiB_{k}^{i}, are both known during run-time. So we trivially have the following theorem:

Theorem B.8.

For all i[m]i\in[m] and s𝒮hs\in\mathcal{S}_{h}, we have

Bonus-1k=1Ka𝒜iπki(as)Bki(s,a).\textsc{Bonus-1}\leq\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{k}^{i}(a\mid s)B_{k}^{i}(s,a).
Proof.

This is the definition of Bonus-1. ∎

B.5 Bounding Mag-Reduce via Martingale Properties

Theorem B.9.

Mag-Reduce is bounded by the sum of RHS of Lemmas B.5 and B.7 w.p. 1𝒪~(δ)1-\operatorname{\widetilde{\mathcal{O}}}(\delta).

Proof.

By definition of Q^ki(s,a)\widehat{Q}_{k}^{i}(s,a) in Equation 4.5, we have

Mag-Reduce=k=1Ka𝒜i(πki(as)πi(as))(ϕ(s,a)𝖳𝜽^kiQ^ki(s,a))\displaystyle\quad\textsc{Mag-Reduce}=\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}(\pi_{k}^{i}(a\mid s)-\pi_{\ast}^{i}(a\mid s))\bigg{(}{\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}-\widehat{Q}_{k}^{i}(s,a)\bigg{)}
=k=1Ka𝒜i(πki(as)πi(as))(H(ϕ(s,a)𝖳Σ^t,iϕ(sk,h,ak,hi))HKκ=1K(ϕ(s,a)𝖳Σ^t,iϕ(sκmag,aκ,imag)))\displaystyle=\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}(\pi_{k}^{i}(a\mid s)-\pi_{\ast}^{i}(a\mid s))\bigg{(}H\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s_{k,h},a_{k,h}^{i})\right)_{-}-\frac{H}{K}\sum_{\kappa=1}^{K}\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s_{\kappa}^{\text{mag}},a_{\kappa,i}^{\text{mag}})\right)_{-}\bigg{)}

As (sk,h,ak,hi)(s_{k,h},a_{k,h}^{i}) and (sκmag,aκ,imag)(s_{\kappa}^{\text{mag}},a_{\kappa,i}^{\text{mag}}) are both sampled from π¯\overline{\pi}, all these ()(\cdot)_{-}’s are common mean. Thus, by telescoping, we can decompose Mag-Reduce into the sum of K+1K+1 martingales.

It suffices to consider only one of them, for example,

k=1Ka𝒜i(πki(as)πi(as))H((ϕ(s,a)𝖳Σ^t,iϕ(sk,h,ak,hi))𝔼(s,a)hiπ¯[(ϕ(s,a)𝖳Σ^t,iϕ(s,a))]),\displaystyle\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}(\pi_{k}^{i}(a\mid s)-\pi_{\ast}^{i}(a\mid s))H\left(\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s_{k,h},a_{k,h}^{i})\right)_{-}-\operatornamewithlimits{\mathbb{E}}_{(s^{\prime},a^{\prime})\sim_{h}^{i}\overline{\pi}}\bigg{[}\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s^{\prime},a^{\prime})\right)_{-}\bigg{]}\right), (B.14)

where (s,a)hiπ(s,a)\sim_{h}^{i}\pi means the hh-th layer state and the hh-th layer ii-th agent action sampled from π\pi.

Again, there are two components in Equation B.14, one related to πki\pi_{k}^{i} and the other related to πi\pi_{\ast}^{i}. Fortunately, they can be handled pretty similarly to what we did in Theorem B.3 and Theorem B.6: For the πki\pi_{k}^{i} part, applying Lemma E.8 as in Section B.2.2, we have

k=1Ka𝒜iπki(as)H((ϕ(s,a)𝖳Σ^t,iϕ(sk,h,ak,hi))𝔼(s,a)hiπ¯[(ϕ(s,a)𝖳Σ^t,iϕ(s,a))])\displaystyle\quad\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{k}^{i}(a\mid s)H\left(\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s_{k,h},a_{k,h}^{i})\right)_{-}-\operatornamewithlimits{\mathbb{E}}_{(s^{\prime},a^{\prime})\sim_{h}^{i}\overline{\pi}}\bigg{[}\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s^{\prime},a^{\prime})\right)_{-}\bigg{]}\right)
𝒪~(Hk=1K𝔼Gap[(𝔼aπki(s)[(ϕ(s,a)𝖳Σ^t,iϕ(sk,hi,ak,hi))])2]+\displaystyle\leq\operatorname{\widetilde{\mathcal{O}}}\left(H\sqrt{\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\left(\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s_{k,h}^{i},a_{k,h}^{i})\right)_{-}\right]\right)^{2}\right]}+\right.
Hk=1K(𝔼aπki(s)[(ϕ(s,a)𝖳Σ^t,iϕ(sk,hi,ak,hi))])2).\displaystyle\quad\qquad\left.H\sqrt{\sum_{k=1}^{K}\left(\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s_{k,h}^{i},a_{k,h}^{i})\right)_{-}\right]\right)^{2}}\right).

Note that 𝔼[()]2𝔼[()2]𝔼[()2]\operatornamewithlimits{\mathbb{E}}[(\cdot)_{-}]^{2}\leq\operatornamewithlimits{\mathbb{E}}[(\cdot)_{-}^{2}]\leq\operatornamewithlimits{\mathbb{E}}[(\cdot)^{2}], it becomes identical to the conclusion of Lemma B.5. Thus, this component only causes a 𝒪~(1)\operatorname{\widetilde{\mathcal{O}}}(1) contribution to the final Gap and can be neglected.

Similarly, for πi\pi_{\ast}^{i}, we apply Lemma E.7 like we did in Lemma B.7. We have

k=1Ka𝒜iπi(as)H((ϕ(s,a)𝖳Σ^t,iϕ(sk,h,ak,hi))𝔼(s,a)hiπ¯[(ϕ(s,a)𝖳Σ^t,iϕ(s,a))])\displaystyle\quad-\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{\ast}^{i}(a\mid s)H\left(\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s_{k,h},a_{k,h}^{i})\right)_{-}-\operatornamewithlimits{\mathbb{E}}_{(s^{\prime},a^{\prime})\sim_{h}^{i}\overline{\pi}}\bigg{[}\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s^{\prime},a^{\prime})\right)_{-}\bigg{]}\right)
𝒪~(Hk=1K𝔼Gap[(𝔼aπi(s)[(ϕ(s,a)𝖳Σ^t,iϕ(sk,hi,ak,hi))])2]+\displaystyle\leq\operatorname{\widetilde{\mathcal{O}}}\left(H\sqrt{\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\left(\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{\ast}^{i}(\cdot\mid s)}\left[\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s_{k,h}^{i},a_{k,h}^{i})\right)_{-}\right]\right)^{2}\right]}+\right.
Hmaxk[K](𝔼aπi(s)[(ϕ(s,a)𝖳Σ^t,iϕ(sk,hi,ak,hi))])).\displaystyle\quad\qquad\left.H\max_{k\in[K]}\left(-\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{\ast}^{i}(\cdot\mid s)}\left[\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s_{k,h}^{i},a_{k,h}^{i})\right)_{-}\right]\right)\right).

Again, we have 𝔼[()]2𝔼[()2]𝔼[()2]\operatornamewithlimits{\mathbb{E}}[(\cdot)_{-}]^{2}\leq\operatornamewithlimits{\mathbb{E}}[(\cdot)_{-}^{2}]\leq\operatornamewithlimits{\mathbb{E}}[(\cdot)^{2}] and also |𝔼[()]|𝔼[()]\lvert-\operatornamewithlimits{\mathbb{E}}[(\cdot)_{-}]\rvert\leq\operatornamewithlimits{\mathbb{E}}[(\cdot)]. Thus, this part also produces the same result as Lemma B.7. ∎

Appendix C Controlling the Expectation of Gap Using Potentials

In this section, we verify Equation A.3, i.e., prove Theorem 4.2.

Theorem C.1 (Algorithm 2 Allows a Potential Function; Formal Version of Theorem 4.2).

With probability 1δ1-\delta, under the conditions of Theorem B.1, it is possible to give a tuning of Algorithm 2 such that

t=1Ti=1m𝔼shπ~t[𝔼Gap[Gapti(s)]]=𝒪~(md2HT),with probability 1𝒪~(δ).\sum_{t=1}^{T}\sum_{i=1}^{m}\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}_{t}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}[{\textsc{Gap}}_{t}^{i}(s)]\right]=\operatorname{\widetilde{\mathcal{O}}}(md^{2}H\sqrt{T}),\quad\text{with probability }1-\operatorname{\widetilde{\mathcal{O}}}(\delta).

In other words, Equation A.3 is ensured by picking L=d4H2L=d^{4}H^{2}.

Proof.

The proof is divided into two parts. In Theorem C.2, we calculate 𝔼shπ~[𝔼Gap[Gapti(s)]]\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}[\operatornamewithlimits{\mathbb{E}}_{{\textsc{Gap}}}[{\textsc{Gap}}_{t}^{i}(s)]] for any roll-in policy π~\widetilde{\pi} (although we only use π~t\widetilde{\pi}_{t} as the roll-in policy, it is unknown at the point when CCE-Approxh\textsc{CCE-Approx}_{h} is executed, because we iterated h=H,H1,,1h=H,H-1,\ldots,1). Then, in Theorem C.7, we control their summation using the definition of potential functions in Equation 4.9 and also borrowing techniques from (Zanette and Wainwright, 2022, Cui et al., 2023). ∎

C.1 Calculating the Expectation of Gap w.r.t. Any π~\widetilde{\pi}

Theorem C.2.

Consider a single agent i[m]i\in[m], epoch t[T]t\in[T], and layer h[H]h\in[H]. For any outcome of CCE-Approxh\textsc{CCE-Approx}_{h} with KK set as tt. Then for any “roll-in” policy π~\widetilde{\pi} (which is chosen as π~t\widetilde{\pi}_{t} in Equation A.3), under the conditions in Theorem B.1, i.e., setting η=Ω(max{γH,γβ1+β2})\eta=\Omega(\max\{\frac{\sqrt{\gamma}}{H},\frac{\gamma}{\beta_{1}+\beta_{2}}\}), β1=Ω~(dHK){\beta_{1}}=\widetilde{\Omega}(\frac{dH}{\sqrt{K}}), β2=Ω~(HK)\beta_{2}=\widetilde{\Omega}(\frac{H}{K}), and γ=𝒪~(dK)\gamma=\operatorname{\widetilde{\mathcal{O}}}(\frac{d}{K}) for the execution of CCE-Approx with KtK\leftarrow t, we have

𝔼shπ~[𝔼Gap[t×Gapti(s)]]\displaystyle\quad\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{{\textsc{Gap}}}[t\times{\textsc{Gap}}_{t}^{i}(s)]\right]
=𝒪~(η1+ηH2t𝔼shπ~[𝔼aπ~ti(s)[ϕ(s,a)Σ^t,i2]]+dHt𝔼shπ~[𝔼aπ~ti(s)[ϕ(s,a)Σ^t,i2]]\displaystyle=\operatorname{\widetilde{\mathcal{O}}}\Bigg{(}\eta^{-1}+\eta H^{2}t\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{t}^{i}(\cdot\mid s)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\right]+\sqrt{d}H\sqrt{t\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{t}^{i}(\cdot\mid s)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\right]}
dβ1+β1t𝔼shπ~[𝔼aπ~ti(s)[ϕ(s,a)Σ^t,i2]]+β2γ1t𝔼shπ~[𝔼aπ~ti(a)[ϕ(s,a)Σ^t,i2]]).\displaystyle\qquad\qquad\frac{d}{\beta_{1}}+\beta_{1}t\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{t}^{i}(\cdot\mid s)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\right]+\beta_{2}\sqrt{\gamma^{-1}t\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{t}^{i}(\cdot\mid a)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\right]}\Bigg{)}.

Note that, although in Algorithm 1 we mixed up Gap’s from different rr’s, all Gap’s are i.i.d. samples from the same distribution and thus 𝔼[Gapi(s)]\operatornamewithlimits{\mathbb{E}}[{\textsc{Gap}}^{i}(s)] does not depend on the choice of r(s)r^{\ast}(s) in Equation 3.1.

Proof.

Recall the definition of Gapti(s){\textsc{Gap}}_{t}^{i}(s) from Equation B.2, we have the following decomposition:

𝔼shπ~[𝔼Gap[K×Gapti(s)]]=log|𝒜i|η+2η𝔼shπ~[𝔼Gap[k=1Ka𝒜iπki(as)Q^ki(s,a)2]]Reg-Term+\displaystyle\quad\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[K\times{\textsc{Gap}}_{t}^{i}(s)\right]\right]=\underbrace{\frac{\log\lvert\mathcal{A}^{i}\rvert}{\eta}+2\eta\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{k}^{i}(a\mid s)\widehat{Q}_{k}^{i}(s,a)^{2}\right]\right]}_{\textsc{Reg-Term}}+
82𝔼shπ~[𝔼Gap[2dH2k=1K𝔼aπki(s)[ϕ(s,a)]Σ^t,i2+k=1K(𝔼aπki(s)[ϕ(s,a)𝖳]𝜽^ki)2log4KHγδ]]Bias-1+\displaystyle\quad\underbrace{8\sqrt{2}\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\sqrt{2dH^{2}\sum_{k=1}^{K}\left\lVert\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)\right]\right\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}+\sum_{k=1}^{K}\left(\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[{\bm{\phi}}(s,a)^{\mathsf{T}}]\widehat{\bm{\theta}}_{k}^{i}\right)^{2}}\log\frac{4KH}{\gamma\delta}\right]\right]}_{\textsc{Bias-1}}+
𝒪(dβ1logdKδ)Bias-2+Bonus-2+\displaystyle\quad\underbrace{\operatorname{\mathcal{O}}\left(\frac{d}{\beta_{1}}\log\frac{dK}{\delta}\right)}_{\textsc{Bias-2}+\textsc{Bonus-2}}+
𝔼shπ~[𝔼Gap[k=1K𝔼aπki(s)[β1ϕ(s,a)Σ^t,i2+β2j=1dϕ(s,a)[j]×sup(s,a)𝒮h×𝒜i(Σ^t,iϕ(s,a))[j]]]]Bonus-1.\displaystyle\quad\underbrace{\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\beta_{1}}\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}+\beta_{2}\sum_{j=1}^{d}\bm{\phi}(s,a)[j]\times\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}_{h}\times\mathcal{A}^{i}}(\widehat{\Sigma}_{t,i}^{\dagger}\bm{\phi}(s^{\prime},a^{\prime}))[j]\right]\right]\right]}_{\textsc{Bonus-1}}. (C.1)

We then go by these terms one by one in Lemmas C.3, C.4, C.5 and C.6 and give the conclusion. ∎

Lemma C.3.

Consider the Reg-Term part in Equation C.1, we have

log|𝒜i|η+2η𝔼shπ~[𝔼Gap[k=1Ka𝒜iπki(as)Q^ki(s,a)2]]𝒪~(η1+ηH2t𝔼(s,a)hiπ~[ϕ(s,a)Σ^t,i2]),\frac{\log\lvert\mathcal{A}^{i}\rvert}{\eta}+2\eta\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{k}^{i}(a\mid s)\widehat{Q}_{k}^{i}(s,a)^{2}\right]\right]\leq\operatorname{\widetilde{\mathcal{O}}}\left(\eta^{-1}+\eta H^{2}t\operatornamewithlimits{\mathbb{E}}_{(s,a)\sim_{h}^{i}\widetilde{\pi}}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\right),

where hi\sim_{h}^{i} stands for the state-action pair that agent ii observes in layer hh.

Proof.

The first component is already a constant and only contributes 𝒪~(η1)\operatorname{\widetilde{\mathcal{O}}}(\eta^{-1}) to the final bound. For the second component, we invoke the property of the Magnitude-Reduced Estimator Q^ki(s,a)\widehat{Q}_{k}^{i}(s,a) by Dai et al. (2023) (which we summarize as Lemma E.1) and conclude 𝔼[Q^ki(s,a)2]=𝒪(𝔼[(ϕ(s,a)𝜽^ki)2])\operatornamewithlimits{\mathbb{E}}[\widehat{Q}_{k}^{i}(s,a)^{2}]=\operatorname{\mathcal{O}}(\operatornamewithlimits{\mathbb{E}}[(\bm{\phi}(s,a)\widehat{\bm{\theta}}_{k}^{i})^{2}]), where the expectation is only taken w.r.t. the randomness in Equation 4.5. Thus

2η𝔼shπ~[𝔼Gap[k=1Ka𝒜iπki(as)Q^ki(s,a)2]]=2η𝔼shπ~[𝔼Gap[k=1Ka𝒜iπki(as)(ϕ(s,a)𝖳𝜽^ki)2]]\displaystyle\quad 2\eta\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{k}^{i}(a\mid s)\widehat{Q}_{k}^{i}(s,a)^{2}\right]\right]=2\eta\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\sum_{k=1}^{K}\sum_{a\in\mathcal{A}^{i}}\pi_{k}^{i}(a\mid s)\left({\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}\right)^{2}\right]\right]
2ηH2k=1K𝔼shπ~[𝔼aπki(s)[𝔼Gap[ϕ(sk,hi,ak,hi)𝖳Σ^t,iϕ(s,a)ϕ(s,a)𝖳Σ^t,iϕ(sk,hi,ak,hi)]]]\displaystyle\leq 2\eta H^{2}\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[{\bm{\phi}}(s_{k,h}^{i},a_{k,h}^{i})^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s,a){\bm{\phi}}(s,a)^{\mathsf{T}}\widehat{\Sigma}_{t,i}^{\dagger}{\bm{\phi}}(s_{k,h}^{i},a_{k,h}^{i})\right]\right]\right]
=(a)2ηH2k=1K𝔼shπ~[𝔼aπki(s)[Σ^t,iΣt,iΣ^t,i,ϕ(s,a)ϕ(s,a)𝖳]]\displaystyle\overset{(a)}{=}2\eta H^{2}\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[\langle\widehat{\Sigma}_{t,i}^{\dagger}\Sigma_{t,i}\widehat{\Sigma}_{t,i}^{\dagger},{\bm{\phi}}(s,a){\bm{\phi}}(s,a)^{\mathsf{T}}\rangle\right]\right]
(b)2ηH2K𝔼shπ~[𝔼aπ~ti(s)[ϕ(s,a)Σ^t,i2]],\displaystyle\overset{(b)}{\leq}2\eta H^{2}K\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{t}^{i}(\cdot\mid s)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\right],

where (a) uses the fact that (sk,hi,ak,hi)(s_{k,h}^{i},a_{k,h}^{i}) are sampled from π¯t\overline{\pi}_{t} (recall the definition of Σ^t,i\widehat{\Sigma}_{t,i}^{\dagger} in Equation 4.1), and (b) uses Corollary E.3 from Liu et al. (2023a) (which gives Σt,i(Σ^t,i)1\Sigma_{t,i}\preceq(\widehat{\Sigma}_{t,i}^{\dagger})^{-1}) together with the fact that π~t1Kk=1Kπk\widetilde{\pi}_{t}\triangleq\frac{1}{K}\sum_{k=1}^{K}\pi_{k} (for those states in 𝒮h\mathcal{S}_{h}). Recall the configuration that K=tK=t in CCE-Approx, the above quantity is 𝒪~(ηH2t𝔼shπ~[𝔼aπ~ti(s)[ϕ(s,a)Σ^t,i2]])\operatorname{\widetilde{\mathcal{O}}}\left(\eta H^{2}t\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{t}^{i}(\cdot\mid s)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\right]\right). ∎

Lemma C.4.

Consider the Bias-1 part in Equation C.1, we have

82𝔼shπ~[𝔼Gap[2dH2k=1K𝔼aπki(s)[ϕ(s,a)]Σ^t,i2+k=1K(𝔼aπki(s)[ϕ(s,a)𝖳]𝜽^ki)2log4KHγδ]]\displaystyle\quad 8\sqrt{2}\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\sqrt{2dH^{2}\sum_{k=1}^{K}\left\lVert\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\bm{\phi}}(s,a)\right]\right\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}+\sum_{k=1}^{K}\left(\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[{\bm{\phi}}(s,a)^{\mathsf{T}}]\widehat{\bm{\theta}}_{k}^{i}\right)^{2}}\log\frac{4KH}{\gamma\delta}\right]\right]
=𝒪~(dHt𝔼shπ~[𝔼aπ~ti(s)[ϕ(s,a)Σ^t,i2]]).\displaystyle=\operatorname{\widetilde{\mathcal{O}}}\left(\sqrt{d}H\sqrt{t\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{t}^{i}(\cdot\mid s)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\right]}\right).
Proof.

As X+YX+Y\sqrt{X+Y}\leq\sqrt{X}+\sqrt{Y}, we can write (ignoring constants and logarithmic factors)

Bias-1=𝒪~(\displaystyle\textsc{Bias-1}=\operatorname{\widetilde{\mathcal{O}}}\Bigg{(} dH𝔼shπ~[𝔼Gap[k=1K𝔼aπki(s)[ϕ(s,a)]Σ^t,i2]]+\displaystyle\sqrt{d}H\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\sqrt{\sum_{k=1}^{K}\left\lVert\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[\bm{\phi}(s,a)]\right\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}}\right]\right]+
𝔼shπ~[𝔼Gap[k=1K(𝔼aπki(s)[ϕ(s,a)]𝖳𝜽^ki)2]]).\displaystyle\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\sqrt{\sum_{k=1}^{K}\left(\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[\bm{\phi}(s,a)]^{\mathsf{T}}\widehat{\bm{\theta}}_{k}^{i}\right)^{2}}\right]\right]\Bigg{)}.

According to the calculations in Lemma B.5 on 𝔼[Xk2k1]\operatornamewithlimits{\mathbb{E}}[X_{k}^{2}\mid\mathcal{F}_{k-1}], the second term is exactly bounded by the first one. Utilizing the fact that 𝔼[X]𝔼[X]\operatornamewithlimits{\mathbb{E}}[\sqrt{X}]\leq\sqrt{\operatornamewithlimits{\mathbb{E}}[X]}, we have

Bias-1 𝒪~(dH𝔼shπ~[𝔼Gap[k=1K𝔼aπki(s)[ϕ(s,a)]Σ^t,i2]])\displaystyle\leq\operatorname{\widetilde{\mathcal{O}}}\left(\sqrt{d}H\sqrt{\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\sum_{k=1}^{K}\left\lVert\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[\bm{\phi}(s,a)]\right\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\right]}\right)
=𝒪~(dHk=1K𝔼shπ~[𝔼aπki(s)[ϕ(s,a)ϕ(s,a)𝖳]],Σ^t,i)\displaystyle=\operatorname{\widetilde{\mathcal{O}}}\left(\sqrt{d}H\sqrt{\sum_{k=1}^{K}\left\langle\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[\bm{\phi}(s,a)\bm{\phi}(s,a)^{\mathsf{T}}]\right],\widehat{\Sigma}_{t,i}^{\dagger}\right\rangle}\right)
=𝒪~(dHt𝔼shπ~[𝔼aπ~ti(s)[ϕ(s,a)Σ^t,i2]]),\displaystyle=\operatorname{\widetilde{\mathcal{O}}}\left(\sqrt{d}H\sqrt{t\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{t}^{i}(\cdot\mid s)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\right]}\right),

where the last line again uses the configuration K=tK=t and the definition of π~t\widetilde{\pi}_{t}. ∎

Lemma C.5.

The Bias-2 + Bonus-2 part in Equation C.1 is of order 𝒪~(dβ11)\operatorname{\widetilde{\mathcal{O}}}(d{\beta_{1}}^{-1}).

Proof.

This part is already a constant in Equation A.3

Lemma C.6.

The Bonus-1 term in Equation C.1 is bounded by

𝔼shπ~[𝔼Gap[k=1K𝔼aπki(s)[β1ϕ(s,a)Σ^t,i2+β2j=1dϕ(s,a)[j]×sup(s,a)𝒮h×𝒜i(Σ^t,iϕ(s,a))[j]]]]\displaystyle\quad\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\beta_{1}}\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}+\beta_{2}\sum_{j=1}^{d}\bm{\phi}(s,a)[j]\times\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}_{h}\times\mathcal{A}^{i}}(\widehat{\Sigma}_{t,i}^{\dagger}\bm{\phi}(s^{\prime},a^{\prime}))[j]\right]\right]\right]
β1t𝔼shπ~[𝔼aπ~ti(s)[ϕ(s,a)Σ^t,i2]]+β2γ1t𝔼shπ~[𝔼aπ~ti(a)[ϕ(s,a)Σ^t,i2]].\displaystyle\leq\beta_{1}t\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{t}^{i}(\cdot\mid s)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\right]+\beta_{2}\sqrt{\gamma^{-1}t\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{t}^{i}(\cdot\mid a)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\right]}.
Proof.

For the β1\beta_{1}-part, we have

𝔼shπ~[𝔼Gap[k=1𝔼aπki(s)[β1ϕ(s,a)Σ^t,i2]]]\displaystyle\quad\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\sum_{k=1}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[{\beta_{1}}\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\right]\right]
=β1𝔼Gap[k=1K𝔼shπ~[𝔼aπki(s)[ϕ(s,a)ϕ(s,a)𝖳],Σ^t,i]]\displaystyle={\beta_{1}}\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\sum_{k=1}^{K}\left\langle\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[\bm{\phi}(s,a)\bm{\phi}(s,a)^{\mathsf{T}}\right],\widehat{\Sigma}_{t,i}^{\dagger}\right]\right\rangle\right]
=β1t𝔼shπ~[𝔼aπ~ti(s)[ϕ(s,a)Σ^t,i2]],\displaystyle=\beta_{1}t\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{t}^{i}(\cdot\mid s)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\right],

which directly becomes the first part in the conclusion (where, again, we used the configuration that K=tK=t and also the definition of π~t\widetilde{\pi}_{t}).

For the β2\beta_{2}-part, from the calculations in Equation B.7, we know

j=1d(ϕ(s,a))[j]×sup(s,a)𝒮h×𝒜i(Σ^t,iϕ(s,a))[j]γ1ϕ(s,a)Σ^t,i.\sum_{j=1}^{d}(\bm{\phi}(s,a))[j]\times\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}_{h}\times\mathcal{A}^{i}}(\widehat{\Sigma}_{t,i}^{\dagger}\bm{\phi}(s^{\prime},a^{\prime}))[j]\leq\sqrt{\gamma^{-1}}\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}.

Utilizing Cauchy-Schwartz inequality and the fact that 𝔼[X]𝔼[X]\operatornamewithlimits{\mathbb{E}}[\sqrt{X}]\leq\sqrt{\operatornamewithlimits{\mathbb{E}}[X]}, we can get

𝔼shπ~[𝔼Gap[k=1K𝔼aπki(s)[β2j=1dϕ(s,a)[j]×sup(s,a)𝒮h×𝒜i(Σ^t,iϕ(s,a))[j]]]]\displaystyle\quad\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[\beta_{2}\sum_{j=1}^{d}\bm{\phi}(s,a)[j]\times\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}_{h}\times\mathcal{A}^{i}}(\widehat{\Sigma}_{t,i}^{\dagger}\bm{\phi}(s^{\prime},a^{\prime}))[j]\right]\right]\right]
β2γ1K𝔼shπ~[𝔼Gap[k=1K𝔼aπki(s)[ϕ(s,a)Σ^t,i2]]]\displaystyle\leq\beta_{2}\sqrt{\gamma^{-1}}\sqrt{K\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}\left[\sum_{k=1}^{K}\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\right]\right]}
=β2tγ1𝔼shπ~[𝔼aπ~ti(a)[ϕ(s,a)Σ^t,i2]].\displaystyle=\beta_{2}t\sqrt{\gamma^{-1}\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{t}^{i}(\cdot\mid a)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\right]}.

Putting two parts together gives our conclusion. ∎

C.2 Summing Up 𝔼[Gap]\operatornamewithlimits{\mathbb{E}}[{\textsc{Gap}}]’s Using Potentials

Theorem C.7.

Under the conditions of Theorem B.1, i.e., setting η=Ω(max{γH,γβ1+β2})\eta=\Omega(\max\{\frac{\sqrt{\gamma}}{H},\frac{\gamma}{\beta_{1}+\beta_{2}}\}), β1=Ω~(dHK){\beta_{1}}=\widetilde{\Omega}(\frac{dH}{\sqrt{K}}), β2=Ω~(HK)\beta_{2}=\widetilde{\Omega}(\frac{H}{K}), and γ=𝒪~(dK)\gamma=\operatorname{\widetilde{\mathcal{O}}}(\frac{d}{K}) for CCE-Approx executions with parameter KK (which is set to tt in each epoch tt where 5 is violated), we have

t=1Ti=1m𝔼shπ~t[𝔼Gap[Gapti(s)]]=𝒪~(md2HT),with probability 1𝒪~(δ).\sum_{t=1}^{T}\sum_{i=1}^{m}\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}_{t}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}[{\textsc{Gap}}_{t}^{i}(s)]\right]=\operatorname{\widetilde{\mathcal{O}}}(md^{2}H\sqrt{T}),\quad\text{with probability }1-\operatorname{\widetilde{\mathcal{O}}}(\delta).
Proof.

For any tt, let the last time 5 that was violated be ItI_{t}. Then π~t=π~It\widetilde{\pi}_{t}=\widetilde{\pi}_{I_{t}} and Σ^t,i=Σ^It,i\widehat{\Sigma}_{t,i}^{\dagger}=\widehat{\Sigma}_{I_{t},i}^{\dagger} by 5. For any τ\tau, we denote nτn_{\tau} as the number of indices such that It=τI_{t}=\tau. Throughout the proof, we use ηt,β1,t,β2,t,γt\eta_{t},\beta_{1,t},\beta_{2,t},\gamma_{t} to denote the η,β1,β2,γ\eta,\beta_{1},\beta_{2},\gamma used by CCE-Approx in the tt-th epoch, respectively. Recall the conclusion from Theorem C.2 (note that the LHS of Theorem C.2 is 𝔼[t×Gapti(s)]\operatornamewithlimits{\mathbb{E}}[t\times{\textsc{Gap}}_{t}^{i}(s)]),

t=1Ti=1m𝔼shπ~t[𝔼Gap[Gapti(s)]]=t=1Ti=1m𝔼shπ~It[𝔼Gap[GapIti(s)]]\displaystyle\quad\sum_{t=1}^{T}\sum_{i=1}^{m}\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}_{t}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}[{\textsc{Gap}}_{t}^{i}(s)]\right]=\sum_{t=1}^{T}\sum_{i=1}^{m}\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}_{I_{t}}}\left[\operatornamewithlimits{\mathbb{E}}_{\textsc{Gap}}[{\textsc{Gap}}_{I_{t}}^{i}(s)]\right]
=t=1Ti=1m𝒪~(ηIt1It+dβ1,It1It+(ηItH2+β1,It)𝔼shπ~It[𝔼aπ~Iti(s)[ϕ(s,a)Σ^It,i2]]+\displaystyle=\sum_{t=1}^{T}\sum_{i=1}^{m}\operatorname{\widetilde{\mathcal{O}}}\Bigg{(}\frac{\eta_{I_{t}}^{-1}}{I_{t}}+\frac{d\beta_{1,I_{t}}^{-1}}{I_{t}}+\left(\eta_{I_{t}}H^{2}+\beta_{1,I_{t}}\right)\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}_{I_{t}}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{I_{t}}^{i}(\cdot\mid s)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{I_{t},i}^{\dagger}}^{2}\right]\right]+
(HdIt+β2,ItγIt1)𝔼shπ~It[𝔼aπ~Iti(s)[ϕ(s,a)Σ^It,i2]]).\displaystyle\qquad\qquad\qquad\left(H\sqrt{\frac{d}{I_{t}}}+\beta_{2,I_{t}}\sqrt{\gamma_{I_{t}}^{-1}}\right)\sqrt{\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}_{I_{t}}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{I_{t}}^{i}(\cdot\mid s)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{I_{t},i}^{\dagger}}^{2}\right]\right]}\Bigg{)}.

We focus on a single agent i[m]i\in[m]. For notational simplicity, we denote (s,a)hiπ(s,a)\sim_{h}^{i}\pi as the state-action pair that agent i[m]i\in[m] observes in layer h[H]h\in[H]. The first two terms are bounded by 𝒪~(t=1T(ηt1/t+dβ1,t1/t))\operatorname{\widetilde{\mathcal{O}}}(\sum_{t=1}^{T}(\eta_{t}^{-1}/t+d\beta_{1,t}^{-1}/t)). For the third term, we replace the coefficients with a sup:

t=1T𝒪~(ηItH2+β1,It)𝔼shπ~It[𝔼aπ~Iti(s)[ϕ(s,a)Σ^It,i2]]\displaystyle\quad\sum_{t=1}^{T}\operatorname{\widetilde{\mathcal{O}}}\left(\eta_{I_{t}}H^{2}+\beta_{1,I_{t}}\right)\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}_{I_{t}}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{I_{t}}^{i}(\cdot\mid s)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{{I_{t}},i}^{\dagger}}^{2}\right]\right]
supt[T](tηtH2+tβ1,t)𝒪~(t=1Tnt𝔼(s,a)hiπ~t[1tϕ(s,a)Σ^t,i2]),\displaystyle\leq\sup_{t\in[T]}\left(t\eta_{t}H^{2}+t\beta_{1,t}\right)\operatorname{\widetilde{\mathcal{O}}}\left(\sum_{t=1}^{T}n_{t}\operatornamewithlimits{\mathbb{E}}_{(s,a)\sim_{h}^{i}\widetilde{\pi}_{t}}\left[\frac{1}{t}\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\right),

where, for simplicity, we define ntn_{t} as zero for those tt where 5 isn’t violated. Similarly, for the last term, we replace the coefficients with a sup and then apply Cauchy-Schwartz. We get

t=1T𝒪~(HdIt+β2,ItγIt1)𝔼shπ~It[𝔼aπ~Iti(s)[ϕ(s,a)Σ^It,i2]]\displaystyle\quad\sum_{t=1}^{T}\operatorname{\widetilde{\mathcal{O}}}\left(H\sqrt{\frac{d}{I_{t}}}+\beta_{2,I_{t}}\sqrt{\gamma_{I_{t}}^{-1}}\right)\sqrt{\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}_{I_{t}}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{I_{t}}^{i}(\cdot\mid s)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{I_{t},i}^{\dagger}}^{2}\right]\right]}
supt[T](dH+β2,tγt1t)t=1T𝔼shπ~It[𝔼aπ~Iti(s)[ϕ(s,a)Σ^It,i2]]\displaystyle\leq\sup_{t\in[T]}\left(\sqrt{d}H+\beta_{2,t}\sqrt{\gamma_{t}^{-1}t}\right)\sum_{t=1}^{T}\sqrt{\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}_{I_{t}}}\left[\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}_{I_{t}}^{i}(\cdot\mid s)}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{I_{t},i}^{\dagger}}^{2}\right]\right]}
supt[T](dH+β2,tγt1t)Tt=1Tnt𝔼(s,a)hiπ~t[1tϕ(s,a)Σ^t,i2].\displaystyle\leq\sup_{t\in[T]}\left(\sqrt{d}H+\beta_{2,t}\sqrt{\gamma_{t}^{-1}t}\right)\sqrt{T}\sqrt{\sum_{t=1}^{T}n_{t}\operatornamewithlimits{\mathbb{E}}_{(s,a)\sim_{h}^{i}\widetilde{\pi}_{t}}\left[\frac{1}{t}\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]}.

Thus the only thing we need to do before concluding the proof is to show that

t=1Tnt𝔼(s,a)hiπ~t[1tϕ(s,a)Σ^t,i2] is small.\sum_{t=1}^{T}n_{t}\operatornamewithlimits{\mathbb{E}}_{(s,a)\sim_{h}^{i}\widetilde{\pi}_{t}}\left[\frac{1}{t}\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]\text{ is small}. (C.2)

Indeed, this quantity is of order 𝒪~(d)\operatorname{\widetilde{\mathcal{O}}}(d) for all i[m]i\in[m] and h[H]h\in[H] w.p. 1𝒪~(δ)1-\operatorname{\widetilde{\mathcal{O}}}(\delta): In Lemma C.8, we imitate Lemma 10 of Cui et al. (2023) and conclude that Equation C.2 is of order 𝒪~(d)\operatorname{\widetilde{\mathcal{O}}}(d) for any fixed i[m]i\in[m] and h[H]h\in[H] w.p. 1δ/(2mH)1-\delta/(2mH); taking a union bound then gives the aforementioned fact.

To summarize, we have

t=1Ti=1m𝔼shπ~t[𝔼Gap[Gapti(s)]]\displaystyle\quad\sum_{t=1}^{T}\sum_{i=1}^{m}\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}_{t}}\left[\operatornamewithlimits{\mathbb{E}}_{{\textsc{Gap}}}[{\textsc{Gap}}_{t}^{i}(s)]\right]
=𝒪~(t=1T(ηt1t+dβ1,t1t)+supt[T](tηtH2+tβ1,t)d+supt[T](dH+β2,tγt1t)dT),\displaystyle=\operatorname{\widetilde{\mathcal{O}}}\left(\sum_{t=1}^{T}\left(\frac{\eta_{t}^{-1}}{t}+d\frac{\beta_{1,t}^{-1}}{t}\right)+\sup_{t\in[T]}\left(t\eta_{t}H^{2}+t\beta_{1,t}\right)d+\sup_{t\in[T]}\left(\sqrt{d}H+\beta_{2,t}\sqrt{\gamma_{t}^{-1}t}\right)\sqrt{dT}\right),

and we need to ensure that ηt=Ω(max{γtH,γtβ1,t+β2,t})\eta_{t}=\Omega(\max\{\frac{\sqrt{\gamma}_{t}}{H},\frac{\gamma_{t}}{\beta_{1,t}+\beta_{2,t}}\}), β1,t=Ω~(dHt){\beta_{1,t}}=\widetilde{\Omega}(\frac{dH}{\sqrt{t}}), β2,t=Ω~(Ht)\beta_{2,t}=\widetilde{\Omega}(\frac{H}{t}), and γt=𝒪~(dt)\gamma_{t}=\operatorname{\widetilde{\mathcal{O}}}(\frac{d}{t}). Setting ηt=Θ~(dtH)\eta_{t}=\widetilde{\Theta}(\frac{\sqrt{d}}{\sqrt{t}H}), β1,t=Ω~(dHt){\beta_{1,t}}=\widetilde{\Omega}(\frac{dH}{\sqrt{t}}), β2,t=Ω~(Ht)\beta_{2,t}=\widetilde{\Omega}(\frac{H}{t}), and γt=𝒪~(dt)\gamma_{t}=\operatorname{\widetilde{\mathcal{O}}}(\frac{d}{t}) gives

t=1Ti=1m𝔼shπ~t[𝔼Gap[Gapti(s)]]\displaystyle\quad\sum_{t=1}^{T}\sum_{i=1}^{m}\operatornamewithlimits{\mathbb{E}}_{s\sim_{h}\widetilde{\pi}_{t}}\left[\operatornamewithlimits{\mathbb{E}}_{{\textsc{Gap}}}[{\textsc{Gap}}_{t}^{i}(s)]\right]
m×𝒪~(t=1T(Hdt+1Ht)+(dHT+dHT)d+(dH+H/d)dT)\displaystyle\leq m\times\operatorname{\widetilde{\mathcal{O}}}\left(\sum_{t=1}^{T}\left(\frac{H}{\sqrt{dt}}+\frac{1}{H\sqrt{t}}\right)+(\sqrt{d}H\sqrt{T}+dH\sqrt{T})d+(\sqrt{d}H+H/\sqrt{d})\sqrt{dT}\right)
=𝒪~(md2HT),\displaystyle=\operatorname{\widetilde{\mathcal{O}}}(md^{2}H\sqrt{T}),

as claimed. ∎

Lemma C.8.

For any agent i[m]i\in[m] and layer h[H]h\in[H], w.p. 1δ/(2mH)1-\delta/(2mH), we have

t=1Tnt𝔼(s,a)hiπ~t[1tϕ(s,a)Σ^t,i2]=𝒪~(d).\sum_{t=1}^{T}n_{t}\operatornamewithlimits{\mathbb{E}}_{(s,a)\sim_{h}^{i}\widetilde{\pi}_{t}}\left[\frac{1}{t}\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]=\operatorname{\widetilde{\mathcal{O}}}(d).
Proof.

For a tt where 5 is violated, recall the definition of Σ^t,i\widehat{\Sigma}_{t,i}^{\dagger} in Equation 4.1 and the definition of Σt,icov\Sigma_{t,i}^{\text{cov}} in Equation B.6, we know Σ^t,i=(Σt,icov+γI)1\widehat{\Sigma}_{t,i}^{\dagger}=(\Sigma_{t,i}^{\text{cov}}+\gamma I)^{-1}. Also recall the choice of γ=Θ~(dt)\gamma=\widetilde{\Theta}(\frac{d}{t}) in Algorithm 2 and Corollary E.3 by Liu et al. (2023a) that says Σt,i2Σ~t,icov+Θ~(dt)I\Sigma_{t,i}\preceq 2\widetilde{\Sigma}_{t,i}^{\text{cov}}+\widetilde{\Theta}(\frac{d}{t})I, we have

1tΣ^t,i=(tΣt,icov+tγI)1(tΣt,i+Θ~(d)I)1=(s=0t1𝔼(s,a)hiπ~s[ϕ(s,a)ϕ(s,a)𝖳]+Θ~(d)I)1.\frac{1}{t}\widehat{\Sigma}_{t,i}^{\dagger}=\left(t\Sigma_{t,i}^{\text{cov}}+t\gamma I\right)^{-1}\preceq\left(t\Sigma_{t,i}+\widetilde{\Theta}(d)I\right)^{-1}=\left(\sum_{s=0}^{t-1}\operatornamewithlimits{\mathbb{E}}_{(s,a)\sim_{h}^{i}\widetilde{\pi}_{s}}[\bm{\phi}(s,a)\bm{\phi}(s,a)^{\mathsf{T}}]+\widetilde{\Theta}(d)I\right)^{-1}.

Let Covhi(π)\text{Cov}_{h}^{i}(\pi) be the true covariance of π\pi, i.e., 𝔼(s,a)hiπ[ϕ(s,a)ϕ(s,a)𝖳]\operatornamewithlimits{\mathbb{E}}_{(s,a)\sim_{h}^{i}\pi}[\bm{\phi}(s,a)\bm{\phi}(s,a)^{\mathsf{T}}]. Equation C.2 becomes

Equation C.2t=1TntCovhi(π~t),(s=0t1Covhi(π~s)+Θ~(d)I)1.\text{\lx@cref{creftypecap~refnum}{eq:matrix version summation lemma}}\leq\sum_{t=1}^{T}\left\langle n_{t}\text{Cov}_{h}^{i}(\widetilde{\pi}_{t}),\left(\sum_{s=0}^{t-1}\text{Cov}_{h}^{i}(\widetilde{\pi}_{s})+\widetilde{\Theta}(d)I\right)^{-1}\right\rangle.

Note that s=0t1Covhi(π~s)=s=0t1nsCovhi(π~s)\sum_{s=0}^{t-1}\text{Cov}_{h}^{i}(\widetilde{\pi}_{s})=\sum_{s=0}^{t-1}n_{s}\text{Cov}_{h}^{i}(\widetilde{\pi}_{s}) for any tt where 5 is violated, thus Equation C.2 can be viewed as a matrix version of t=It(Xt/(s=0t1Xs))\sum_{t=I_{t}}(X_{t}/(\sum_{s=0}^{t-1}X_{s})) where Xt=ntCovhi(π~t)X_{t}=n_{t}\text{Cov}_{h}^{i}(\widetilde{\pi}_{t}). We use the following lemma by Zanette and Wainwright (2022, Lemma 11):

Lemma C.9 (Lemma 11 by Zanette and Wainwright (2022)).

For a random vector ϕd\bm{\phi}\in\mathbb{R}^{d}, scalar α>0\alpha>0, and PSD matrix Σ\Sigma, suppose that α𝔼[ϕΣ12]L\alpha\operatornamewithlimits{\mathbb{E}}[\lVert\bm{\phi}\rVert_{\Sigma^{-1}}^{2}]\leq L where Le1L\geq e-1, then

α𝔼[ϕΣ12]Llogdet(Σ+α𝔼[ϕϕ𝖳])det(Σ)αL𝔼[ϕΣ12].\alpha\operatornamewithlimits{\mathbb{E}}[\lVert\bm{\phi}\rVert_{\Sigma^{-1}}^{2}]\leq L\log\frac{\det(\Sigma+\alpha\operatornamewithlimits{\mathbb{E}}[\bm{\phi}\bm{\phi}^{\mathsf{T}}])}{\det(\Sigma)}\leq\alpha L\operatornamewithlimits{\mathbb{E}}[\lVert\bm{\phi}\rVert_{\Sigma^{-1}}^{2}].

In our case, we apply Lemma C.9 to all t=Itt=I_{t} with α=nt\alpha=n_{t}, Σ=s=0t1Covhi(π~s)+Θ~(d)I\Sigma=\sum_{s=0}^{t-1}\text{Cov}_{h}^{i}(\widetilde{\pi}_{s})+\widetilde{\Theta}(d)I, and the distribution as ϕ(s,a)\bm{\phi}(s,a) with (s,a)hiπ~t(s,a)\sim_{h}^{i}\widetilde{\pi}_{t}. We first calculate α𝔼[ϕΣ12]\alpha\operatornamewithlimits{\mathbb{E}}[\lVert\bm{\phi}\rVert_{\Sigma^{-1}}^{2}], which is bounded by our potential construction and is in analog to Lemma 4 of Cui et al. (2023):

Lemma C.10.

For any t[T]t\in[T] where 5 is violated, we have the following w.p. 1δ/(2mHT)1-\delta/(2mHT):

ntCovhi(π~t),Σ^t,i=nt𝔼(s,a)hiπ~t[ϕ(s,a)Σ^t,i2]=𝒪~(1).n_{t}\langle\mathrm{Cov}_{h}^{i}(\widetilde{\pi}_{t}),\widehat{\Sigma}_{t,i}^{\dagger}\rangle=n_{t}\operatornamewithlimits{\mathbb{E}}_{(s,a)\sim_{h}^{i}\widetilde{\pi}_{t}}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]=\operatorname{\widetilde{\mathcal{O}}}(1).

The proof of Lemma C.10 is presented shortly after.

Therefore, when defining L=maxt[T]ntCovhi(π~t),Σ^t,iL=\max_{t\in[T]}n_{t}\langle\mathrm{Cov}_{h}^{i}(\widetilde{\pi}_{t}),\widehat{\Sigma}_{t,i}^{\dagger}\rangle, we can conclude from Lemma C.9:

ntCovhi(π~t),(s=0t1Covhi(π~s)+Θ~(d)I)1Llogdet(s=0t1Covhi(π~s)+Θ~(d)I+ntCovhi(π~t))det(s=0t1Covhi(π~s)+Θ~(d)I).\left\langle n_{t}\text{Cov}_{h}^{i}(\widetilde{\pi}_{t}),\left(\sum_{s=0}^{t-1}\text{Cov}_{h}^{i}(\widetilde{\pi}_{s})+\widetilde{\Theta}(d)I\right)^{-1}\right\rangle\leq L\log\frac{\det(\sum_{s=0}^{t-1}\text{Cov}_{h}^{i}(\widetilde{\pi}_{s})+\widetilde{\Theta}(d)I+n_{t}\text{Cov}_{h}^{i}(\widetilde{\pi}_{t}))}{\det(\sum_{s=0}^{t-1}\text{Cov}_{h}^{i}(\widetilde{\pi}_{s})+\widetilde{\Theta}(d)I)}.

By telescoping and the fact that s=0t1Covhi(π~s)=s<t,s=IsnsCovhi(π~s)\sum_{s=0}^{t-1}\text{Cov}_{h}^{i}(\widetilde{\pi}_{s})=\sum_{s<t,s=I_{s}}n_{s}\text{Cov}_{h}^{i}(\widetilde{\pi}_{s}), we have

t=1TntCovhi(π~t),(s=0t1Covhi(π~s)+Θ~(d)I)1Llogdet(t=0TCovhi(π~t)+Θ~(d)I)det(Θ~(d)I),\sum_{t=1}^{T}\left\langle n_{t}\text{Cov}_{h}^{i}(\widetilde{\pi}_{t}),\left(\sum_{s=0}^{t-1}\text{Cov}_{h}^{i}(\widetilde{\pi}_{s})+\widetilde{\Theta}(d)I\right)^{-1}\right\rangle\leq L\log\frac{\det(\sum_{t=0}^{T}\text{Cov}_{h}^{i}(\widetilde{\pi}_{t})+\widetilde{\Theta}(d)I)}{\det(\widetilde{\Theta}(d)I)},

which is bounded by 𝒪~(d)\operatorname{\widetilde{\mathcal{O}}}(d) as L=𝒪~(1)L=\operatorname{\widetilde{\mathcal{O}}}(1) from Lemma C.10, logdet(t=0TCovhi(π~t)+Θ~(d)I)dlogTr(t=0TCovhi(π~t)+Θ~(d)I)d=𝒪~(dlog(d+Td))=𝒪~(d)\log\det(\sum_{t=0}^{T}\text{Cov}_{h}^{i}(\widetilde{\pi}_{t})+\widetilde{\Theta}(d)I)\leq d\log\frac{\text{Tr}(\sum_{t=0}^{T}\text{Cov}_{h}^{i}(\widetilde{\pi}_{t})+\widetilde{\Theta}(d)I)}{d}=\operatorname{\widetilde{\mathcal{O}}}(d\log(d+\frac{T}{d}))=\operatorname{\widetilde{\mathcal{O}}}(d), and logdet(Θ~(d)I)0\log\det(\widetilde{\Theta}(d)I)\geq 0. ∎

Proof of Lemma C.10.

Recall the potential function definition in Equation 4.9, which says 64log8mHTδΨt,hi=τ=1tϕ(s~τ,h,a~τ,hi)Σ^τ,i264\log\frac{8mHT}{\delta}\cdot\Psi_{t,h}^{i}=\sum_{\tau=1}^{t}\lVert\bm{\phi}(\widetilde{s}_{\tau,h},\widetilde{a}_{\tau,h}^{i})\rVert_{\widehat{\Sigma}_{\tau,i}^{\dagger}}^{2}. Suppose that the first time after tt that 5 is violated again is in epoch t1t_{1}, then Ψt1,hi\Psi_{t_{1},h}^{i} is the first potential to reach Ψt,hi+1\Psi_{t,h}^{i}+1, which means

τ=tt11ϕ(s~τ,h,a~τ,hi)Σ^τ,i264log8mHT2δ,h[H].\sum_{\tau=t}^{t_{1}-1}\lVert\bm{\phi}(\widetilde{s}_{\tau,h},\widetilde{a}_{\tau,h}^{i})\rVert_{\widehat{\Sigma}_{\tau,i}^{\dagger}}^{2}\leq 64\log\frac{8mHT^{2}}{\delta},\quad\forall h\in[H].

Fix a single h[H]h\in[H]. Recall that for all epochs where 5 is not violated, we directly adopt the previous π~t\widetilde{\pi}_{t} and Σ^t,i\widehat{\Sigma}_{t,i}^{\dagger}. Hence, Σ^τ,iΣ^t0,i\widehat{\Sigma}_{\tau,i}^{\dagger}\equiv\widehat{\Sigma}_{t_{0},i}^{\dagger} for all τ[t,t1)\tau\in[t,t_{1}), and all such (s~τ,h,a~τ,hi)(\widetilde{s}_{\tau,h},\widetilde{a}_{\tau,h}^{i})’s are also drawn from the same π~t\widetilde{\pi}_{t}. From Lemma E.9 (which is Lemma 48 of Cui et al. (2023)), we know that

Covhi(π~t),Σ^t,i=𝔼(s,a)hiπ~t[ϕ(s,a)Σ^t,i2]\displaystyle\quad\langle\mathrm{Cov}_{h}^{i}(\widetilde{\pi}_{t}),\widehat{\Sigma}_{t,i}^{\dagger}\rangle=\operatornamewithlimits{\mathbb{E}}_{(s,a)\sim_{h}^{i}\widetilde{\pi}_{t}}\left[\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}\right]
21t1tτ=tt11ϕ(s~τ,h,a~τ,hi)Σ^τ,i2128tt0log8mHT2δ,\displaystyle\leq 2\frac{1}{t_{1}-t}\sum_{\tau=t}^{t_{1}-1}\lVert\bm{\phi}(\widetilde{s}_{\tau,h},\widetilde{a}_{\tau,h}^{i})\rVert_{\widehat{\Sigma}_{\tau,i}^{\dagger}}^{2}\leq\frac{128}{t-t_{0}}\log\frac{8mHT^{2}}{\delta},

w.p. 1δ/(2mHT)1-\delta/(2mHT). Hiding all logarithmic factors gives our conclusion. ∎

Appendix D Proof of the Resulting Sample Complexity (Theorem 4.3)

Proof of Theorem 4.3.

We verify the conditions in Theorem 3.2. The first condition, i.e., Equation A.1, is formalized as Theorem B.1, whose proof is in Appendix B. The second condition, i.e., Equation A.2, is postponed to Section D.1. The third condition, i.e., Equation A.3, is justified in Theorem C.1.

Then we can invoke the conclusion from Theorem 3.2 to conclude a sample complexity of

𝒪~(max{Γ1,Γ2}×m2H3Lϵ2×dreplaylogT)\displaystyle\quad\operatorname{\widetilde{\mathcal{O}}}\left(\max\{\Gamma_{1},\Gamma_{2}\}\times m^{2}H^{3}L\epsilon^{-2}\times d_{\text{replay}}\log T\right)
=𝒪~(m×m2H3×d4H2ϵ2×dmH)=𝒪~(m4d5H6ϵ2),\displaystyle=\operatorname{\widetilde{\mathcal{O}}}(m\times m^{2}H^{3}\times d^{4}H^{2}\epsilon^{-2}\times dmH)=\operatorname{\widetilde{\mathcal{O}}}(m^{4}d^{5}H^{6}\epsilon^{-2}),

where Γ1=Γ2=m\Gamma_{1}=\Gamma_{2}=m (see Algorithms 2 and 3), L=d4H2L=d^{4}H^{2} (from Theorem C.1), and dreplay=dmHd_{\text{replay}}=dmH (from Section E.5 of Wang et al. (2023)). ∎

D.1 V-Approx Procedure by Wang et al. (2023) and Its Guarantee

In this section, we introduce the V-Approx procedure by Wang et al. (2023). In Algorithm 3, we present the algorithm for linear Markov Games (i.e., the Optimistic-Regress procedure in their Algorithm 3 is replaced by that in Section E.1).

Algorithm 3 V-Approxh\textsc{V-Approx}_{h} Subroutine for Independent Linear Markov Games (Wang et al., 2023)
0:  Previous-layer policy π¯\overline{\pi}, current-layer policy π~\widetilde{\pi}, next-layer V-function V¯\overline{V}, sub-optimality gap upper-bound Gap, epoch length KK. Set regularization factor λ=𝒪(dKlogdKδ)\lambda=\operatorname{\mathcal{O}}(\frac{d}{K}\log\frac{dK}{\delta}).
1:  for i=1,2,,mi=1,2,\ldots,m do
2:     for k=1,2,,Kk=1,2,\ldots,K do
3:        For each agent jij\neq i, execute π¯π~\overline{\pi}\circ\widetilde{\pi}; for agent ii, execute π¯\overline{\pi} (i.e., the same as Algorithm 2). Record the trajectory as {(sk,𝔥,ak,𝔥i,^k,𝔥i)}𝔥=1H\{(s_{k,\mathfrak{h}},a_{k,\mathfrak{h}}^{i},\widehat{\ell}_{k,\mathfrak{h}}^{i})\}_{\mathfrak{h}=1}^{H} where ^k,hi^k,𝔥i+V¯i(sk,h+1i)\widehat{\ell}_{k,h}^{i}\triangleq\widehat{\ell}_{k,\mathfrak{h}}^{i}+\overline{V}^{i}(s_{k,h+1}^{i}).
4:        Calculate 𝜽^hi=argmin𝜽1Kk=1K(ϕ(sk,h,ak,hi),𝜽k,hi)+λ𝜽22\widehat{\bm{\theta}}_{h}^{i}=\operatornamewithlimits{\mathrm{argmin}}_{\bm{\theta}}\frac{1}{K}\sum_{k=1}^{K}(\langle\bm{\phi}(s_{k,h},a_{k,h}^{i}),\bm{\theta}\rangle-\ell_{k,h}^{i})+\lambda\lVert\bm{\theta}\rVert_{2}^{2}.
5:        For each s𝒮hs\in\mathcal{S}_{h} and a𝒜ia\in\mathcal{A}^{i}, set Q¯i(s,a)=min{ϕ(s,a),𝜽^hi+32Gapi(s),Hh+1}\overline{Q}^{i}(s,a)=\min\{\langle\bm{\phi}(s,a),\widehat{\bm{\theta}}_{h}^{i}\rangle+\frac{3}{2}\textsc{Gap}^{i}(s),H-h+1\}.
6:        For each s𝒮hs\in\mathcal{S}_{h}, set V¯i(s)=a𝒜iπ~i(as)Q¯i(s,a)\overline{V}^{i}(s)=\sum_{a\in\mathcal{A}^{i}}\widetilde{\pi}^{i}(a\mid s)\overline{Q}^{i}(s,a).
7:     end for
8:  end for
9:  return  {V¯i(𝒮h)}i[m]\{\overline{V}^{i}(\mathcal{S}_{h})\}_{i\in[m]}.

Similar to Section E.3 of Wang et al. (2023), we have the following lemma:

Lemma D.1.

Consider using Algorithm 3 with some roll-in policy π¯\overline{\pi} and episode length KK. Let (π~,Gap)=CCE-Approxh(π¯,V¯,K)(\widetilde{\pi},{\textsc{Gap}})=\textsc{CCE-Approx}_{h}(\overline{\pi},\overline{V},K). Then Algorithm 3 with parameters (π¯,π~,V¯,Gap,K)(\overline{\pi},\widetilde{\pi},\overline{V},{\textsc{Gap}},K) ensures that it’s output V¯(𝒮h)\overline{V}(\mathcal{S}_{h}) satisfies the following with probability 1δ1-\delta:

V¯i(s)[min{\displaystyle\overline{V}^{i}(s)\in\bigg{[}\min\bigg{\{} 𝔼aπ~[(i+h+1V¯i)(s,a)]+Gapi(s),Hh+1},\displaystyle\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}}\left[\big{(}\ell^{i}+\operatorname{\mathbb{P}}_{h+1}\overline{V}^{i}\big{)}(s,a)\right]+\phantom{2}{\textsc{Gap}}^{i}(s),H-h+1\bigg{\}},
𝔼aπ~[(i+h+1V¯i)(s,a)]+2Gapi(s)],i[m],s𝒮h.\displaystyle\operatornamewithlimits{\mathbb{E}}_{a\sim\widetilde{\pi}}\left[\big{(}\ell^{i}+\operatorname{\mathbb{P}}_{h+1}\overline{V}^{i}\big{)}(s,a)\right]+2{\textsc{Gap}}^{i}(s)\bigg{]},\quad\forall i\in[m],s\in\mathcal{S}_{h}.
Proof.

Consider a fixed i[m]i\in[m] and h[H]h\in[H]. By Assumption 2.1, there exists 𝜽hi\bm{\theta}_{h}^{i} such that

𝔼[^k,hisk,h,ak,hi]=ϕ(sk,h,ak,hi),𝜽hi,k[K].\operatornamewithlimits{\mathbb{E}}[\widehat{\ell}_{k,h}^{i}\mid s_{k,h},a_{k,h}^{i}]=\langle\bm{\phi}(s_{k,h},a_{k,h}^{i}),\bm{\theta}_{h}^{i}\rangle,\quad\forall k\in[K].

Let Σ~i,hreg=1Kk=1Kϕ(sk,h,ak,hi)ϕ(sk,h,ak,hi)𝖳+λI\widetilde{\Sigma}_{i,h}^{\text{reg}}=\frac{1}{K}\sum_{k=1}^{K}\bm{\phi}(s_{k,h},a_{k,h}^{i})\bm{\phi}(s_{k,h},a_{k,h}^{i})^{\mathsf{T}}+\lambda I and ξk,hi=^k,hi𝔼[^k,hisk,h,ak,hi]\xi_{k,h}^{i}=\widehat{\ell}_{k,h}^{i}-\operatornamewithlimits{\mathbb{E}}[\widehat{\ell}_{k,h}^{i}\mid s_{k,h},a_{k,h}^{i}]. From Lemma 21 of Wang et al. (2023), we know when λ=𝒪(dKlogdKδ)\lambda=\operatorname{\mathcal{O}}(\frac{d}{K}\log\frac{dK}{\delta}), with probability 1δ1-\delta,

k=1Kϕ(sk,h,ak,hi)ξk,hi(Σ¯i,h)1=𝒪~(dH2K),\left\lVert\sum_{k=1}^{K}\bm{\phi}(s_{k,h},a_{k,h}^{i})\xi_{k,h}^{i}\right\rVert_{(\overline{\Sigma}_{i,h})^{-1}}=\operatorname{\widetilde{\mathcal{O}}}(\sqrt{dH^{2}K}), (D.1)

where Σ¯i,h=𝔼(s,a)hiπ¯[ϕ(s,a)ϕ(s,a)𝖳]\overline{\Sigma}_{i,h}=\operatornamewithlimits{\mathbb{E}}_{(s,a)\sim_{h}^{i}\overline{\pi}}[\bm{\phi}(s,a)\bm{\phi}(s,a)^{\mathsf{T}}] (which is also the expectation of ϕ(sk,h,ak,hi)ϕ(sk,h,ak,hi)𝖳\bm{\phi}(s_{k,h},a_{k,h}^{i})\bm{\phi}(s_{k,h},a_{k,h}^{i})^{\mathsf{T}}). We also conclude from Lemma 22 of Wang et al. (2023) that Σ~i,hreg12(Σ¯i,h+λI)\widetilde{\Sigma}_{i,h}^{\text{reg}}\succeq\frac{1}{2}(\overline{\Sigma}_{i,h}+\lambda I). Therefore, for any s𝒮hs\in\mathcal{S}_{h} and a𝒜ia\in\mathcal{A}^{i}, we have

|ϕ(s,a)𝖳𝜽^i𝔼[^k,his,a]|\displaystyle\quad\left\lvert\bm{\phi}(s,a)^{\mathsf{T}}\widehat{\bm{\theta}}^{i}-\operatornamewithlimits{\mathbb{E}}[\widehat{\ell}_{k,h}^{i}\mid s,a]\right\rvert
=|ϕ(s,a)𝖳(Σ~i,hreg)1(1Kk=1Kϕ(sk,h,ak,hi)ϕ(sk,h,ak,hi)𝖳Σ~i,hreg)𝜽k,hi|+\displaystyle=\left\lvert\bm{\phi}(s,a)^{\mathsf{T}}(\widetilde{\Sigma}_{i,h}^{\text{reg}})^{-1}\left(\frac{1}{K}\sum_{k=1}^{K}\bm{\phi}(s_{k,h},a_{k,h}^{i})\bm{\phi}(s_{k,h},a_{k,h}^{i})^{\mathsf{T}}-\widetilde{\Sigma}_{i,h}^{\text{reg}}\right)\bm{\theta}_{k,h}^{i}\right\rvert+
1K|ϕ(s,a)𝖳(Σ~i,hreg)1k=1Kϕ(sk,h,ak,hi)ξk,hi|\displaystyle\quad\frac{1}{K}\left\lvert\bm{\phi}(s,a)^{\mathsf{T}}(\widetilde{\Sigma}_{i,h}^{\text{reg}})^{-1}\sum_{k=1}^{K}\bm{\phi}(s_{k,h},a_{k,h}^{i})\xi_{k,h}^{i}\right\rvert
ϕ(s,a)(Σ~i,hreg)1×(𝒪~(dK)+𝒪~(dH2K)),\displaystyle\leq\lVert\bm{\phi}(s,a)\rVert_{(\widetilde{\Sigma}_{i,h}^{\text{reg}})^{-1}}\times\left(\operatorname{\widetilde{\mathcal{O}}}\left(\sqrt{\frac{d}{K}}\right)+\operatorname{\widetilde{\mathcal{O}}}\left(\sqrt{\frac{dH^{2}}{K}}\right)\right),

where the last step uses both Theorem E.5 and Equation D.1. Noticing that Σ^t,i\widehat{\Sigma}_{t,i}^{\dagger} is also an empirical covariance of policy π¯\overline{\pi}, we can conclude that ϕ(s,a)(Σ~i,hreg)1=𝒪(ϕ(s,a)(Σ¯i,h+λI)1)=𝒪(ϕ(s,a)Σ^t,i)\lVert\bm{\phi}(s,a)\rVert_{(\widetilde{\Sigma}_{i,h}^{\text{reg}})^{-1}}=\operatorname{\mathcal{O}}(\lVert\bm{\phi}(s,a)\rVert_{(\overline{\Sigma}_{i,h}+\lambda I)^{-1}})=\operatorname{\mathcal{O}}(\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}) from Corollaries E.3 and E.4. Consequently, the error in V¯i(s)\overline{V}^{i}(s) is no more than

a𝒜iπ~i(as)|ϕ(s,a)𝖳𝜽^i𝔼[^k,his,a]|=𝒪~(dH2Ka𝒜iπ~i(as)ϕ(s,a)Σ^t,i)\displaystyle\quad\sum_{a\in\mathcal{A}^{i}}\widetilde{\pi}^{i}(a\mid s)\left\lvert\bm{\phi}(s,a)^{\mathsf{T}}\widehat{\bm{\theta}}^{i}-\operatornamewithlimits{\mathbb{E}}[\widehat{\ell}_{k,h}^{i}\mid s,a]\right\rvert=\operatorname{\widetilde{\mathcal{O}}}\left(\sqrt{\frac{dH^{2}}{K}}\sum_{a\in\mathcal{A}^{i}}\widetilde{\pi}^{i}(a\mid s)\lVert\bm{\phi}(s,a)\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}\right)
=𝒪~(dH2K1Kk=1K𝔼aπki(s)[ϕ(s,a)]Σ^t,i2)𝒪~(1KdH2k=1K𝔼aπki(s)[ϕ(s,a)]Σ^t,i2),\displaystyle=\operatorname{\widetilde{\mathcal{O}}}\left(\sqrt{\frac{dH^{2}}{K}}\frac{1}{K}\sum_{k=1}^{K}\sqrt{\left\lVert\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[\bm{\phi}(s,a)]\right\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}}\right)\leq\operatorname{\widetilde{\mathcal{O}}}\left(\frac{1}{K}\sqrt{dH^{2}\sum_{k=1}^{K}\left\lVert\operatornamewithlimits{\mathbb{E}}_{a\sim\pi_{k}^{i}(\cdot\mid s)}[\bm{\phi}(s,a)]\right\rVert_{\widehat{\Sigma}_{t,i}^{\dagger}}^{2}}\right),

where the last step used Cauchy-Schwartz. Recall the definition of Gap from Equation B.2, we can conclude that the total estimation error of V-Approx is no more than 12Gapti(s)\frac{1}{2}{\textsc{Gap}}_{t}^{i}(s). Our claim then follows from imitating the remaining arguments of Wang et al. (2023, Section E.3). ∎

Appendix E Auxiliary Lemmas

E.1 Magnitude-Reduced Estimators

The following lemma characterizes the Magnitude-Reduced Estimator (Dai et al., 2023).

Lemma E.1 (Magnitude-Reduced Estimators (Dai et al., 2023)).

For a random variable ZZ, its magnitude-reduced estimator Z^Z(Z)+𝔼[(Z)]\widehat{Z}\triangleq Z-(Z)_{-}+\operatornamewithlimits{\mathbb{E}}[(Z)_{-}] where (Z)min{Z,0}(Z)_{-}\triangleq\min\{Z,0\} satisfies

𝔼[Z^]=𝔼[Z],𝔼[(Z^)2]6𝔼[Z2],Z^𝔼[(Z)].\operatornamewithlimits{\mathbb{E}}[\widehat{Z}]=\operatornamewithlimits{\mathbb{E}}[Z],\quad\operatornamewithlimits{\mathbb{E}}[(\widehat{Z})^{2}]\leq 6\operatornamewithlimits{\mathbb{E}}[Z^{2}],\quad\widehat{Z}\geq\operatornamewithlimits{\mathbb{E}}[(Z)_{-}].
Proof.

The first conclusion follows from 𝔼[Z^]=𝔼[Z]𝔼[(Z)]+𝔼[𝔼[(Z)]]=𝔼[Z]\operatornamewithlimits{\mathbb{E}}[\widehat{Z}]=\operatornamewithlimits{\mathbb{E}}[Z]-\operatornamewithlimits{\mathbb{E}}[(Z)_{-}]+\operatornamewithlimits{\mathbb{E}}[\operatornamewithlimits{\mathbb{E}}[(Z)_{-}]]=\operatornamewithlimits{\mathbb{E}}[Z].

For the second conclusion, by definition of Z^\widehat{Z}, 𝔼[(Z^)2]=𝔼[(Z(Z)+𝔼[(Z)])2]2(𝔼[Z2]+𝔼[(Z)2]+𝔼[(Z)]2)\operatornamewithlimits{\mathbb{E}}[(\widehat{Z})^{2}]=\operatornamewithlimits{\mathbb{E}}[(Z-(Z)_{-}+\operatornamewithlimits{\mathbb{E}}[(Z)_{-}])^{2}]\leq 2(\operatornamewithlimits{\mathbb{E}}[Z^{2}]+\operatornamewithlimits{\mathbb{E}}[(Z)_{-}^{2}]+\operatornamewithlimits{\mathbb{E}}[(Z)_{-}]^{2}). As (Z)2=min{Z,0}2Z2(Z)_{-}^{2}=\min\{Z,0\}^{2}\leq Z^{2}, 𝔼[(Z)2]𝔼[Z2]\operatornamewithlimits{\mathbb{E}}[(Z)_{-}^{2}]\leq\operatornamewithlimits{\mathbb{E}}[Z^{2}]. By Jensen’s inequality, 𝔼[(Z)]2𝔼[(Z)2]𝔼[Z2]\operatornamewithlimits{\mathbb{E}}[(Z)_{-}]^{2}\leq\operatornamewithlimits{\mathbb{E}}[(Z)_{-}^{2}]\leq\operatornamewithlimits{\mathbb{E}}[Z^{2}]. Hence, we arrive at the conclusion that 𝔼[(Z^)2]6𝔼[Z2]\operatornamewithlimits{\mathbb{E}}[(\widehat{Z})^{2}]\leq 6\operatornamewithlimits{\mathbb{E}}[Z^{2}].

The last inequality follows from the fact that Z(Z)Z-(Z)_{-} is 0 if Z<0Z<0 and ZZ if Z0Z\geq 0. Therefore, Z^=Z(Z)+𝔼[(Z)]𝔼[(Z)]\widehat{Z}=Z-(Z)_{-}+\operatornamewithlimits{\mathbb{E}}[(Z)_{-}]\geq\operatornamewithlimits{\mathbb{E}}[(Z)_{-}], as desired. ∎

E.2 Stochastic Matrix Concentration

We then present some stochastic matrix concentration results.

Lemma E.2 (Lemma A.4 by Dai et al. (2023)).

If H1,H2,,HnH_{1},H_{2},\ldots,H_{n} are i.i.d. dd-dimensional PSD matrices such that for all ii: i) 𝔼[Hi]=H\operatornamewithlimits{\mathbb{E}}[H_{i}]=H, ii) HiIH_{i}\preceq I a.s., and iii) H1dn(logdδ)IH\succeq\frac{1}{dn}\left(\log\frac{d}{\delta}\right)I, then

dnlogdδH1/21ni=1nHiHdnlogdδH1/2with probability 12δ.-\sqrt{\frac{d}{n}\log\frac{d}{\delta}}H^{1/2}\preceq\frac{1}{n}\sum_{i=1}^{n}H_{i}-H\preceq\sqrt{\frac{d}{n}\log\frac{d}{\delta}}H^{1/2}\quad\text{with probability }1-2\delta.

The first corollary of Lemma E.2 is derived by Liu et al. (2023a, Corollary 10).

Corollary E.3.

If H1,H2,,HnH_{1},H_{2},\ldots,H_{n} are i.i.d. dd-dimensional PSD matrices such that for all ii: i) 𝔼[Hi]=H\operatornamewithlimits{\mathbb{E}}[H_{i}]=H, and ii) HicIH_{i}\preceq cI a.s. for some positive constant c>0c>0, then

H2ni=1nHi+3cdnlog(dδ)Iwith probability 1δ.H\preceq\frac{2}{n}\sum_{i=1}^{n}H_{i}+3c\cdot\frac{d}{n}\log\left(\frac{d}{\delta}\right)I\quad\text{with probability }1-\delta.

The second corollary is the opposite direction of Corollary E.3.

Corollary E.4.

For H1,H2,,HnH_{1},H_{2},\ldots,H_{n} i.i.d. PSD with expectation HH such that HicIH_{i}\preceq cI a.s. for all ii,

1ni=1nHi32H+3cd2nlog(dδ)Iwith probability 1δ.\frac{1}{n}\sum_{i=1}^{n}H_{i}\preceq\frac{3}{2}H+3c\cdot\frac{d}{2n}\log\left(\frac{d}{\delta}\right)I\quad\text{with probability }1-\delta.
Proof.

The proof mostly follows from Corollary 10 of Liu et al. (2023a). Using the fact that H1/2k2H+12kH^{1/2}\preceq\frac{k}{2}H+\frac{1}{2k} for any k>0k>0, we know from Lemma E.2 that under the same conditions of Lemma E.2,

1ni=1nHiHdnlogdδH1/212H+d2n(logdδ)Iwith probability 1δ.\frac{1}{n}\sum_{i=1}^{n}H_{i}-H\preceq\sqrt{\frac{d}{n}\log\frac{d}{\delta}}H^{1/2}\preceq\frac{1}{2}H+\frac{d}{2n}\left(\log\frac{d}{\delta}\right)I\quad\text{with probability }1-\delta.

In other words,

1ni=1nHi32H+d2n(logdδ)Iwith probability 1δ.\frac{1}{n}\sum_{i=1}^{n}H_{i}\preceq\frac{3}{2}H+\frac{d}{2n}\left(\log\frac{d}{\delta}\right)I\quad\text{with probability }1-\delta. (E.1)

Now we show Corollary E.4. For the case where dnlogdδ1\frac{d}{n}\log\frac{d}{\delta}\leq 1, define H~i=12cHi+d2n(logdδ)I\widetilde{H}_{i}=\frac{1}{2c}H_{i}+\frac{d}{2n}\left(\log\frac{d}{\delta}\right)I. Then H~i12ccI+d2n(logdδ)II\widetilde{H}_{i}\preceq\frac{1}{2c}cI+\frac{d}{2n}\left(\log\frac{d}{\delta}\right)I\preceq I. Moreover, H~=𝔼[H~i]=12cH+d2n(logdδ)I\widetilde{H}=\operatornamewithlimits{\mathbb{E}}[\widetilde{H}_{i}]=\frac{1}{2c}H+\frac{d}{2n}\left(\log\frac{d}{\delta}\right)I also ensures H~1dn(logdδ)I\widetilde{H}\succeq\frac{1}{dn}\left(\log\frac{d}{\delta}\right)I. Hence, applying Equation E.1 to H~1,H~2,,H~n\widetilde{H}_{1},\widetilde{H}_{2},\ldots,\widetilde{H}_{n} gives

1ni=1nH~i32H~+d2n(logdδ)Iwith probability 1δ.\frac{1}{n}\sum_{i=1}^{n}\widetilde{H}_{i}\preceq\frac{3}{2}\widetilde{H}+\frac{d}{2n}\left(\log\frac{d}{\delta}\right)I\quad\text{with probability }1-\delta.

By the definitions of H~i\widetilde{H}_{i} and H~\widetilde{H}, we further have the following, which shows our claim:

1ni=1nHi32H~+3cd2n(logdδ)Iwith probability 1δ.\frac{1}{n}\sum_{i=1}^{n}H_{i}\preceq\frac{3}{2}\widetilde{H}+3c\cdot\frac{d}{2n}\left(\log\frac{d}{\delta}\right)I\quad\text{with probability }1-\delta.

The case of dnlogdδ>1\frac{d}{n}\log\frac{d}{\delta}>1 is trivial because HicI32c(dnlogdδ)IH_{i}\preceq cI\preceq\frac{3}{2}c\left(\frac{d}{n}\log\frac{d}{\delta}\right)I. ∎

A key fact from the above corollaries is Lemma 14 of Liu et al. (2023a), which we include below.

Theorem E.5 (Lemma 14 of Liu et al. (2023a)).

For a dd-dimensional distribution 𝒟\mathcal{D}, let ϕ1,ϕ2,,ϕn\bm{\phi}_{1},\bm{\phi}_{2},\ldots,\bm{\phi}_{n} be i.i.d. samples from 𝒟\mathcal{D}. Define Σ=𝔼ϕ𝒟[ϕϕ𝖳]\Sigma=\operatornamewithlimits{\mathbb{E}}_{\bm{\phi}\sim\mathcal{D}}[\bm{\phi}\bm{\phi}^{\mathsf{T}}] and Σ~=1ni=1nϕiϕi𝖳\widetilde{\Sigma}=\frac{1}{n}\sum_{i=1}^{n}\bm{\phi}_{i}\bm{\phi}_{i}^{\mathsf{T}}. If Σ^=(Σ~+γI)1\widehat{\Sigma}^{\dagger}=(\widetilde{\Sigma}+\gamma I)^{-1} where γ=5dnlog6dδ\gamma=5\frac{d}{n}\log\frac{6d}{\delta}, then with probability 1δ1-\delta, we have

(γI+Σ~Σ)𝜽Σ^2=𝒪(dnlogdδ),𝜽21.\left\lVert(\gamma I+\widetilde{\Sigma}-\Sigma)\bm{\theta}\right\rVert_{\widehat{\Sigma}^{\dagger}}^{2}=\operatorname{\mathcal{O}}\left(\frac{d}{n}\log\frac{d}{\delta}\right),\quad\forall\lVert\bm{\theta}\rVert_{2}\leq 1.

E.3 Adaptive Freedman Inequality

In this paper, we will make use of the Adaptive Freedman Inequality proposed by Lee et al. (2020, Theorem 2.2) and improved by Zimmert and Lattimore (2022, Theorem 9), stated as follows.

Lemma E.6 (Theorem 9 of Zimmert and Lattimore (2022)).

For a sequence of martingale differences {Xi}i=1n\{X_{i}\}_{i=1}^{n} adapted to the filtration (i)i=0n(\mathcal{F}_{i})_{i=0}^{n}, suppose that 𝔼[|Xi|i1]<\operatornamewithlimits{\mathbb{E}}[\lvert X_{i}\rvert\mid\mathcal{F}_{i-1}]<\infty a.s. Then

i=1nXi3i=1n𝔼[Xi2i1]logCδ+2maxi=1,2,,nXilogCδ,with probability 1δ,\sum_{i=1}^{n}X_{i}\leq 3\sqrt{\sum_{i=1}^{n}\operatornamewithlimits{\mathbb{E}}[X_{i}^{2}\mid\mathcal{F}_{i-1}]\log\frac{C}{\delta}}+2\max_{i=1,2,\ldots,n}X_{i}\log\frac{C}{\delta},\quad\text{with probability }1-\delta,

where

C=2max{1,i=1n𝔼[Xi2i1],maxi=1,2,,nXi}.C=2\max\left\{1,\sqrt{\sum_{i=1}^{n}\operatornamewithlimits{\mathbb{E}}[X_{i}^{2}\mid\mathcal{F}_{i-1}]},\max_{i=1,2,\ldots,n}X_{i}\right\}.

A direct corollary of Lemma E.6 is the following lemma:

Lemma E.7.

For a sequence of random variables {Xi}i=1n\{X_{i}\}_{i=1}^{n} adapted to the filtration (i)i=0n(\mathcal{F}_{i})_{i=0}^{n}, let the conditional expectation of XiX_{i} be μi𝔼[Xii1]\mu_{i}\triangleq\operatornamewithlimits{\mathbb{E}}[X_{i}\mid\mathcal{F}_{i-1}]. Suppose that 𝔼[|Xi|i1]<\operatornamewithlimits{\mathbb{E}}[\lvert X_{i}\rvert\mid\mathcal{F}_{i-1}]<\infty a.s. Then

i=1n(Xiμi)3i=1n𝔼[Xi2i1]logCδ+2maxi=1,2,,nXilogCδ,with probability 1δ,\sum_{i=1}^{n}(X_{i}-\mu_{i})\leq 3\sqrt{\sum_{i=1}^{n}\operatornamewithlimits{\mathbb{E}}[X_{i}^{2}\mid\mathcal{F}_{i-1}]}\log\frac{C}{\delta}+2\max_{i=1,2,\ldots,n}X_{i}\log\frac{C}{\delta},\quad\text{with probability }1-\delta,

where

C=2max{1,i=1n𝔼[Xi2i1],maxi=1,2,,nXi}.C=2\max\left\{1,\sqrt{\sum_{i=1}^{n}\operatornamewithlimits{\mathbb{E}}[X_{i}^{2}\mid\mathcal{F}_{i-1}]},\max_{i=1,2,\ldots,n}X_{i}\right\}.
Proof.

We apply Lemma E.6 to the martingale difference sequence {Xiμi}i=1n\{X_{i}-\mu_{i}\}_{i=1}^{n}, giving

i=1n(Xiμi)3i=1n𝔼[(Xiμ)2i1]logCδ+2maxi=1,2,,n(Xiμi)logCδ,with probability 1δ.\sum_{i=1}^{n}(X_{i}-\mu_{i})\leq 3\sqrt{\sum_{i=1}^{n}\operatornamewithlimits{\mathbb{E}}[(X_{i}-\mu)^{2}\mid\mathcal{F}_{i-1}]\log\frac{C}{\delta}}+2\max_{i=1,2,\ldots,n}(X_{i}-\mu_{i})\log\frac{C}{\delta},\quad\text{with probability }1-\delta.

It is clear that max(Xiμi)maxXi+i=1nμi2\max(X_{i}-\mu_{i})\leq\max X_{i}+\sqrt{\sum_{i=1}^{n}\mu_{i}^{2}}. Furthermore, we know that 𝔼[(Xiμi)2i1]=𝔼[Xi2i1]μi2\operatornamewithlimits{\mathbb{E}}[(X_{i}-\mu_{i})^{2}\mid\mathcal{F}_{i-1}]=\operatornamewithlimits{\mathbb{E}}[X_{i}^{2}\mid\mathcal{F}_{i-1}]-\mu_{i}^{2}. Putting these two parts together gives our conclusion. ∎

We also give the following variant of Lemma E.7:

Lemma E.8.

For a sequence of random variables {Xi}i=1n\{X_{i}\}_{i=1}^{n} adapted to the filtration (i)i=0n(\mathcal{F}_{i})_{i=0}^{n}, let the conditional expectation of XiX_{i} be μi𝔼[Xii1]\mu_{i}\triangleq\operatornamewithlimits{\mathbb{E}}[X_{i}\mid\mathcal{F}_{i-1}]. Suppose that 𝔼[|Xi|i1]<\operatornamewithlimits{\mathbb{E}}[\lvert X_{i}\rvert\mid\mathcal{F}_{i-1}]<\infty a.s. Then

|i=1n(Xiμi)|82i=1n𝔼[Xi2i1]+i=1nXi2logCδ,with probability 1δ,\left\lvert\sum_{i=1}^{n}(X_{i}-\mu_{i})\right\rvert\leq 8\sqrt{2}\sqrt{\sum_{i=1}^{n}\operatornamewithlimits{\mathbb{E}}[X_{i}^{2}\mid\mathcal{F}_{i-1}]+\sum_{i=1}^{n}X_{i}^{2}}\log\frac{C}{\delta},\quad\text{with probability }1-\delta,

where

C=22i=1n𝔼[Xi2i1]+i=1nXi2.C=2\sqrt{2}\sqrt{\sum_{i=1}^{n}\operatornamewithlimits{\mathbb{E}}[X_{i}^{2}\mid\mathcal{F}_{i-1}]+\sum_{i=1}^{n}X_{i}^{2}}.
Proof.

Applying Lemma E.6 to the martingale difference sequence {Xiμi}i=1n\{X_{i}-\mu_{i}\}_{i=1}^{n}, the following inequality holds with probability 1δ1-\delta:

i=1n(Xiμi)3i=1n𝔼[(Xiμi)2i1]logCδ+2maxi=1,2,,n{Xiμi}logCδ,\sum_{i=1}^{n}(X_{i}-\mu_{i})\leq 3\sqrt{\sum_{i=1}^{n}\operatornamewithlimits{\mathbb{E}}[(X_{i}-\mu_{i})^{2}\mid\mathcal{F}_{i-1}]\log\frac{C^{\prime}}{\delta}}+2\max_{i=1,2,\ldots,n}\{X_{i}-\mu_{i}\}\log\frac{C^{\prime}}{\delta},

where

C=2max{1,i=1n𝔼[(Xiμi)2i1],maxi=1,2,,n{Xiμi}}.C^{\prime}=2\max\left\{1,\sqrt{\sum_{i=1}^{n}\operatornamewithlimits{\mathbb{E}}[(X_{i}-\mu_{i})^{2}\mid\mathcal{F}_{i-1}]},\max_{i=1,2,\ldots,n}\{X_{i}-\mu_{i}\}\right\}.

Utilizing the fact that (Xiμi)22(Xi2+μi2)(X_{i}-\mu_{i})^{2}\leq 2(X_{i}^{2}+\mu_{i}^{2}) for all ii, we have

maxi=1,2,,n{Xiμi}2i=1n(Xi2+μi2).\max_{i=1,2,\ldots,n}\{X_{i}-\mu_{i}\}\leq\sqrt{2\sum_{i=1}^{n}(X_{i}^{2}+\mu_{i}^{2})}.

Hence, we can write

i=1n(Xiμi)\displaystyle\sum_{i=1}^{n}(X_{i}-\mu_{i}) 42i=1n𝔼[(Xiμi)2i1]+i=1n(Xi2+μi2)logCδ\displaystyle\leq 4\sqrt{2}\sqrt{\sum_{i=1}^{n}\operatornamewithlimits{\mathbb{E}}[(X_{i}-\mu_{i})^{2}\mid\mathcal{F}_{i-1}]+\sum_{i=1}^{n}(X_{i}^{2}+\mu_{i}^{2})}\log\frac{C^{\prime}}{\delta}
=42i=1n(𝔼[Xi2i1]μi2)+i=1n(Xi2+μi2)logCδ\displaystyle=4\sqrt{2}\sqrt{\sum_{i=1}^{n}\left(\operatornamewithlimits{\mathbb{E}}[X_{i}^{2}\mid\mathcal{F}_{i-1}]-\mu_{i}^{2}\right)+\sum_{i=1}^{n}(X_{i}^{2}+\mu_{i}^{2})}\log\frac{C^{\prime}}{\delta}
=42i=1n𝔼[Xi2i1]+i=1nXi2logCδ.\displaystyle=4\sqrt{2}\sqrt{\sum_{i=1}^{n}\operatornamewithlimits{\mathbb{E}}[X_{i}^{2}\mid\mathcal{F}_{i-1}]+\sum_{i=1}^{n}X_{i}^{2}}\log\frac{C^{\prime}}{\delta}.

Similarly, C22i=1n𝔼[Xi2i1]+i=1nXi2=CC^{\prime}\leq 2\sqrt{2}\sqrt{\sum_{i=1}^{n}\operatornamewithlimits{\mathbb{E}}[X_{i}^{2}\mid\mathcal{F}_{i-1}]+\sum_{i=1}^{n}X_{i}^{2}}=C. By exactly the same arguments, the same inequality also holds for i=1n(μiXi)\sum_{i=1}^{n}(\mu_{i}-X_{i}). Our conclusion then follows. ∎

E.4 Relative Concentration Bounds

Lemma E.9 (Lemma 48 by Cui et al. (2023)).

Let X1,X2,X_{1},X_{2},\ldots be i.i.d. random variables supported in [0,1][0,1] and let S^n=1ni=1nXi\widehat{S}_{n}=\frac{1}{n}\sum_{i=1}^{n}X_{i}. Let nn be the stopping time that n=minn{ni=1nXi64log(4nmax)/δ}n=\min_{n}\{n\mid\sum_{i=1}^{n}X_{i}\geq 64\log(4n_{\max})/\delta\}. Suppose that nnmaxn\leq n_{\max}, then w.p. 1δ1-\delta, 12S^n𝔼[X]32S^n\frac{1}{2}\widehat{S}_{n}\leq\operatornamewithlimits{\mathbb{E}}[X]\leq\frac{3}{2}\widehat{S}_{n}.

E.5 EXP3 Regret Guarantee

The following lemma is a classical result for the EXP3 algorithm. For completeness, we also include the proof by Dai et al. (2023, Lemma C.1) here.

Lemma E.10.

Let x0,x1,x2,,xTAx_{0},x_{1},x_{2},\ldots,x_{T}\in\mathbb{R}^{A} be defined as

xt+1,i=(xt,iexp(ηct,i))/(i=1Axt,iexp(ηct,i)),0t<T,x_{t+1,i}=\left.\left(x_{t,i}\exp(-\eta c_{t,i})\right)\middle/\left(\sum_{i^{\prime}=1}^{A}x_{t,i^{\prime}}\exp(-\eta c_{t,i^{\prime}})\right)\right.,\quad\forall 0\leq t<T,

where ctAc_{t}\in\mathbb{R}^{A} is the loss corresponding to the tt-th iteration. Suppose that ηct,i1\eta c_{t,i}\geq-1 for all t[T]t\in[T] and i[A]i\in[A]. Then

t=1Txty,ctlogAη+ηt=1Ti=1Axt,ict,i2\sum_{t=1}^{T}\langle x_{t}-y,c_{t}\rangle\leq\frac{\log A}{\eta}+\eta\sum_{t=1}^{T}\sum_{i=1}^{A}x_{t,i}c_{t,i}^{2}

holds for any distribution y([A])y\in\triangle([A]) when x0=(1A,1A,,1A)x_{0}=(\frac{1}{A},\frac{1}{A},\ldots,\frac{1}{A}).

Proof.

By linearity, it suffices to prove the inequality for all one-hot yy’s. Without loss of generality, let y=𝟏iy=\bm{1}_{i^{\ast}} where i[A]i^{\ast}\in[A]. Define Ct,i=t=1tct,iC_{t,i}=\sum_{t^{\prime}=1}^{t}c_{t^{\prime},i} as the prefix sum of ct,ic_{t,i}. Let

Φt=1ηln(i=1Aexp(ηCt,i)),\Phi_{t}=\frac{1}{\eta}\ln\left(\sum_{i=1}^{A}\exp\left(-\eta C_{t,i}\right)\right),

then by definition of xtx_{t}, we have

ΦtΦt1\displaystyle\Phi_{t}-\Phi_{t-1} =1ηln(i=1Aexp(ηCt,i)i=1Aexp(ηCt1,i))=1ηln(i=1Axt,iexp(ηct,i))\displaystyle=\frac{1}{\eta}\ln\left(\frac{\sum_{i=1}^{A}\exp(-\eta C_{t,i})}{\sum_{i=1}^{A}\exp(-\eta C_{t-1,i})}\right)=\frac{1}{\eta}\ln\left(\sum_{i=1}^{A}x_{t,i}\exp(-\eta c_{t,i})\right)
(a)1ηln(i=1Axt,i(1ηct,i+η2ct,i2))=1ηln(1ηxt,ct+η2i=1Axt,ict,i2)\displaystyle\overset{(a)}{\leq}\frac{1}{\eta}\ln\left(\sum_{i=1}^{A}x_{t,i}(1-\eta c_{t,i}+\eta^{2}c_{t,i}^{2})\right)=\frac{1}{\eta}\ln\left(1-\eta\langle x_{t},c_{t}\rangle+\eta^{2}\sum_{i=1}^{A}x_{t,i}c_{t,i}^{2}\right)
(b)xt,ct+ηi=1Axt,ict,i2,\displaystyle\overset{(b)}{\leq}-\langle x_{t},c_{t}\rangle+\eta\sum_{i=1}^{A}x_{t,i}c_{t,i}^{2},

where (a) used exp(x)1x+x2\exp(-x)\leq 1-x+x^{2} for all x1x\geq-1 and (b) used ln(1+x)x\ln(1+x)\leq x (again for all x1x\geq-1). Therefore, summing over t=1,2,,Tt=1,2,\ldots,T gives

t=1Txt,ct\displaystyle\sum_{t=1}^{T}\langle x_{t},c_{t}\rangle Φ0ΦT+ηt=1Ti=1Axt,ict,i2\displaystyle\leq\Phi_{0}-\Phi_{T}+\eta\sum_{t=1}^{T}\sum_{i=1}^{A}x_{t,i}c_{t,i}^{2}
lnNη1ηln(exp(ηCT,i))+ηt=1Ti=1Npt(i)t2(i)\displaystyle\leq\frac{\ln N}{\eta}-\frac{1}{\eta}\ln\left(\exp(-\eta C_{T,i^{\ast}})\right)+\eta\sum_{t=1}^{T}\sum_{i=1}^{N}p_{t}(i)\ell_{t}^{2}(i)
lnAη+LT(i)+ηt=1Ti=1Axt,ict,i2.\displaystyle\leq\frac{\ln A}{\eta}+L_{T}(i^{\ast})+\eta\sum_{t=1}^{T}\sum_{i=1}^{A}x_{t,i}c_{t,i}^{2}.

Moving Ct,iC_{t,i^{\ast}} to the LHS then shows the inequality for y=𝟏iy=\bm{1}_{i^{\ast}}. The result then extends to all y([A])y\in\triangle([A]) by linearity. ∎