This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adaptively Perturbed Mirror Descent for Learning in Games

Kenshi Abe    Kaito Ariu    Mitsuki Sakamoto    Atsushi Iwasaki
Abstract

This paper proposes a payoff perturbation technique for the Mirror Descent (MD) algorithm in games where the gradient of the payoff functions is monotone in the strategy profile space, potentially containing additive noise. The optimistic family of learning algorithms, exemplified by optimistic MD, successfully achieves last-iterate convergence in scenarios devoid of noise, leading the dynamics to a Nash equilibrium. A recent re-emerging trend underscores the promise of the perturbation approach, where payoff functions are perturbed based on the distance from an anchoring, or slingshot, strategy. In response, we propose Adaptively Perturbed MD (APMD), which adjusts the magnitude of the perturbation by repeatedly updating the slingshot strategy at a predefined interval. This innovation empowers us to find a Nash equilibrium of the underlying game with guaranteed rates. Empirical demonstrations affirm that our algorithm exhibits significantly accelerated convergence.

Machine Learning, ICML

1 Introduction

This study delves into a variant of Mirror Descent (MD) (Nemirovskij & Yudin, 1983; Beck & Teboulle, 2003) in the context of monotone games whose gradient of the payoff functions exhibits monotonicity concerning the strategy profile space. This encompasses diverse games, including Cournot competition (Bravo et al., 2018), λ\lambda-cocoercive games (Lin et al., 2020), concave-convex games, and zero-sum polymatrix games (Cai & Daskalakis, 2011; Cai et al., 2016). Due to their extensive applicability, various learning algorithms have been developed and scrutinized to compute a Nash equilibrium efficiently.

Agents, which are prescribed to play according to MD or its variant, choose strategies with higher expected payoffs more likely but do not move far away from current strategies via regularization. The dynamics is known to converge to an equilibrium in an average sense, which is referred to as average-iterate convergence. In other words, the averaged strategy profile over iterations converges to an equilibrium. Nevertheless, research has shown that the actual trajectory of the updated strategy profiles fails to converge even in two-player zero-sum games and a specific class within monotone games (Mertikopoulos et al., 2018; Bailey & Piliouras, 2018). On the contrary, optimistic learning algorithms, incorporating recency bias, have shown success. The updated strategy profile itself converges to a Nash equilibrium (Daskalakis et al., 2018; Daskalakis & Panageas, 2019; Mertikopoulos et al., 2019; Wei et al., 2021), termed last-iterate convergence.

However, the optimistic approach faces challenges with feedback contaminated by some noise. Typically, each agent updates his or her strategy according to the perfect gradient feedback of the payoff function at each iteration, denoted as full feedback. In a more realistic scenario, noise might distort this feedback. With noisy feedback, optimistic learning algorithms perform suboptimally. For instance, Abe et al. (2023) empirically demonstrated that optimistic Multiplicative Weights Update (OMWU) fails to converge to an equilibrium, orbiting around it.

Alternatively, perturbation of payoffs has emerged again as a pivotal concept for achieving last-iterate convergence, even under noise (Abe et al., 2023). Payoff perturbation is a classical technique, as seen in Facchinei & Pang (2003), and introduces strongly convex penalties to the players’ payoff functions to stabilize learning. This leads to convergence to approximate equilibria, not only in the full feedback setting but also in the noisy feedback setting. However, to ensure convergence toward a Nash equilibrium of the underlying game, the magnitude of perturbation requires careful adjustment, which is calculated as the product of a strongly convex penalty function and a perturbation strength parameter. In fact, Liu et al. (2023) shrink the perturbation strength based on the current strategy profile’s proximity to an underlying equilibrium. Similarly, iterative Tikhonov regularization methods (Koshal et al., 2010; Tatarenko & Kamgarpour, 2019) adjust the magnitude of perturbation by using a sequence of perturbation strengths that satisfy certain conditions, such as diminishing as the iteration increases. Although these algorithms admit last-iterate convergence, it becomes challenging to choose an appropriate learning rate for the shrinking perturbation strength, which often leads to a failure in achieving rapid convergence for these algorithms and their variants.

In response to this, we adaptively determine the amount of the penalty from the distance G(,)G(\cdot,\cdot) between the current strategy π\pi and an anchoring, or slingshot strategy σ\sigma, while maintaining the perturbation strength parameter μ\mu constant. Instead of carefully decaying the perturbation strength, the slingshot strategies σ\sigma are re-initialized at a predefined interval TσT_{\sigma} by the current strategies, and thus the magnitude of the perturbation μG(,)\mu G(\cdot,\cdot) is adjusted. To the best of our knowledge, Perolat et al. (2021) were the first to employ this idea and enabled Abe et al. (2023) to modify MWU to achieve last-iterate convergence. However, they have established the convergence only in an asymptotic manner. The significance of our work, in part, lies in extending these two studies and establishing non-asymptotic convergence results.

Our contributions are manifold. First, we identify our algorithm as Adaptively Perturbed MD111An implementation of our method is available at https://github.com/CyberAgentAILab/adaptively-perturbed-md (APMD). Second, we analyze the case where both the perturbation function and the proximal regularizer are assumed to be the squared 2\ell^{2}-distance and provide the last-iterate convergence rates to a Nash equilibrium, 𝒪(lnT/T)\mathcal{O}(\ln T/\sqrt{T}) and 𝒪(lnT/T110)\mathcal{O}(\ln T/T^{\frac{1}{10}}) with full and noisy feedback, respectively. We also discuss the case where different distances from the squared 2\ell^{2}-distance are utilized. Finally, we empirically reveal that our proposed APMD significantly outperforms MWU and OMWU in two zero-sum polymatrix games, regardless of the feedback type.

2 Preliminaries

Monotone games.

This paper focuses on a continuous game, which is denoted by ([N],(𝒳i)i[N],(vi)i[N])([N],(\mathcal{X}_{i})_{i\in[N]},(v_{i})_{i\in[N]}). [N]={1,2,,N}[N]=\{1,2,\cdots,N\} represents the set of NN players, 𝒳idi\mathcal{X}_{i}\subseteq\mathbb{R}^{d_{i}} represents the did_{i}-dimensional compact convex strategy space for player i[N]i\in[N], and we write 𝒳=i[N]𝒳i\mathcal{X}=\prod_{i\in[N]}\mathcal{X}_{i}. Each player ii chooses a strategy πi\pi_{i} from 𝒳i\mathcal{X}_{i} and aims to maximize her differentiable payoff function vi:𝒳v_{i}:\mathcal{X}\to\mathbb{R}. We write πiji𝒳i\pi_{-i}\in\prod_{j\neq i}\mathcal{X}_{i} as the strategies of all players except player ii, and denote the strategy profile by π=(πi)i[N]𝒳\pi=(\pi_{i})_{i\in[N]}\in\mathcal{X}. This study particularly considers a smooth monotone game, where the gradient (πivi)i[N](\nabla_{\pi_{i}}v_{i})_{i\in[N]} of the payoff functions is monotone: π,π𝒳,\forall\pi,\pi^{\prime}\in\mathcal{X},

i=1Nπivi(πi,πi)πivi(πi,πi),πiπi0,\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i},\pi_{-i})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\prime},\pi_{-i}^{\prime}),\pi_{i}-\pi^{\prime}_{i}\rangle\leq 0, (1)

and LL-Lipschitz:

i=1Nπivi(πi,πi)πivi(πi,πi)2L2ππ2,\displaystyle\!\!\!\!\sum_{i=1}^{N}\!\|\nabla_{\pi_{i}}v_{i}(\pi_{i},\pi_{-i})\!-\!\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\prime},\pi_{-i}^{\prime})\|^{2}\!\leq\!L^{2}\|\pi\!-\!\pi^{\prime}\|^{2}\!\!, (2)

where \|\cdot\| is the 2\ell^{2}-norm.

Monotone games include many common and well-studied classes of games, such as concave-convex games, zero-sum polymatrix games, and Cournot competition.

Example 2.1 (Concave-Convex Games).

Let us consider a max-min game ({1,2},(𝒳1,𝒳2),(v,v))(\{1,2\},(\mathcal{X}_{1},\mathcal{X}_{2}),(v,-v)), where v:𝒳1×𝒳2v:\mathcal{X}_{1}\times\mathcal{X}_{2}\to\mathbb{R}. Player 11 aims to maximize vv, while player 22 aims to minimize vv. If vv is concave in x1𝒳1x_{1}\in\mathcal{X}_{1} and convex in x2𝒳2x_{2}\in\mathcal{X}_{2}, the game is called a concave-convex game or minimax optimization problem, and it is easy to confirm that the game is monotone.

Example 2.2 (Zero-Sum Polymatrix Games).

In a zero-sum polymatrix game, each player’s payoff function can be decomposed as vi(π)=jiui(πi,πj)v_{i}(\pi)=\sum_{j\neq i}u_{i}(\pi_{i},\pi_{j}), where ui:𝒳i×𝒳ju_{i}:\mathcal{X}_{i}\times\mathcal{X}_{j}\to\mathbb{R} is represented by ui(πi,πj)=πiM(i,j)πju_{i}(\pi_{i},\pi_{j})=\pi_{i}^{\top}\mathrm{M}^{(i,j)}\pi_{j} with some matrix M(i,j)di×dj\mathrm{M}^{(i,j)}\in\mathbb{R}^{d_{i}\times d_{j}}, and satisfies ui(πi,πj)=uj(πj,πi)u_{i}(\pi_{i},\pi_{j})=-u_{j}(\pi_{j},\pi_{i}). In this game, each player ii can be interpreted as playing a two-player zero-sum game with each other player jij\neq i. From the linearity and zero-sum property of uiu_{i}, we can easily show that i=1Nπivi(πi,πi)πivi(πi,πi),πiπi=0\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i},\pi_{-i})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\prime},\pi_{-i}^{\prime}),\pi_{i}-\pi^{\prime}_{i}\rangle=0. Thus, the zero-sum polymatrix game is a special case of monotone games.

Nash equilibrium and gap function.

A Nash equilibrium (Nash, 1951) is a common solution concept of a game, which is a strategy profile where no player can improve her payoff by deviating from her specified strategy. Formally, a Nash equilibrium π𝒳\pi^{\ast}\in\mathcal{X} satisfies the following condition:

i[N],πi𝒳i,vi(πi,πi)vi(πi,πi).\displaystyle\forall i\in[N],\forall\pi_{i}\in\mathcal{X}_{i},~{}v_{i}(\pi_{i}^{\ast},\pi_{-i}^{\ast})\geq v_{i}(\pi_{i},\pi_{-i}^{\ast}).

We denote the set of Nash equilibria by Π\Pi^{\ast}. Note that a Nash equilibrium always exists for any smooth monotone game (Debreu, 1952). Furthermore, we measure the proximity to Nash equilibrium for a given strategy profile π\pi by its gap function, which is defined as:

GAP(π):=maxπ~𝒳i=1Nπivi(πi,πi),π~iπi.\displaystyle\mathrm{GAP}(\pi):=\max_{\tilde{\pi}\in\mathcal{X}}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i},\pi_{-i}),\tilde{\pi}_{i}-\pi_{i}\rangle.

The gap function is a standard metric of proximity to Nash equilibrium for a given strategy profile π\pi (Cai & Zheng, 2023). From the definition, GAP(π)0\mathrm{GAP}(\pi)\geq 0 for any π𝒳\pi\in\mathcal{X}, and the equality holds if and only if π\pi is a Nash equilibrium.

Problem setting.

In this study, we consider the online learning setting where the following process is repeated for TT iterations: 1) At each iteration t0t\geq 0, each player i[N]i\in[N] chooses her strategy πit𝒳i\pi_{i}^{t}\in\mathcal{X}_{i} based on the previously observed feedback; 2) Each player ii receives the gradient feedback ^πivi(πit,πit)\widehat{\nabla}_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t}) as feedback. This study considers two feedback models: full feedback and noisy feedback. In the full feedback setting, each player receives the perfect gradient vector as feedback, i.e., ^πivi(πit,πit)=πivi(πit,πit)\widehat{\nabla}_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})=\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t}). In the noisy feedback setting, each player’s feedback is given by ^πivi(πit,πit)=πivi(πit,πit)+ξit\widehat{\nabla}_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})=\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})+\xi_{i}^{t}, where ξitdi\xi_{i}^{t}\in\mathbb{R}^{d_{i}} is a noise vector. Specifically, we focus on the zero-mean and bounded-variance noise vectors.

Mirror Descent.

Mirror Descent (MD) is a widely used algorithm for learning equilibria in games. Let us define ψ:di{}\psi:\mathbb{R}^{d_{i}}\to\mathbb{R}\cup\{\infty\} as the strictly convex regularization function and Dψ(πi,πi)=ψ(πi)ψ(πi)ψ(πi),ππiD_{\psi}(\pi_{i},\pi_{i}^{\prime})=\psi(\pi_{i})-\psi(\pi_{i}^{\prime})-\langle\nabla\psi(\pi_{i}^{\prime}),\pi-\pi_{i}^{\prime}\rangle as the associated Bregman divergence. Then, MD updates each player ii’s strategy πit\pi_{i}^{t} at iteration tt as follows:

πit+1\displaystyle\pi_{i}^{t+1} =argmaxx𝒳i{ηt^πivi(πt),xDψ(x,πit)},\displaystyle=\mathop{\rm arg~{}max}\limits_{x\in\mathcal{X}_{i}}\left\{\eta_{t}\left\langle\widehat{\nabla}_{\pi_{i}}v_{i}(\pi^{t}),x\right\rangle-D_{\psi}(x,\pi_{i}^{t})\right\},

where ηt(0,)\eta_{t}\in(0,\infty) is the learning rate at iteration tt.

Algorithm 1 APMD for player ii.
0:  Learning rate sequence {ηt}t0\{\eta_{t}\}_{t\geq 0}, divergence function for perturbation GG, perturbation strength μ\mu, update interval TσT_{\sigma}, initial strategy πi0\pi_{i}^{0}
1:  k0,τ0k\leftarrow 0,~{}\tau\leftarrow 0
2:  σi0πi0\sigma_{i}^{0}\leftarrow\pi_{i}^{0}
3:  for t=0,1,2,,Tt=0,1,2,\cdots,T do
4:     Receive the gradient feedback ^πivi(πt)\widehat{\nabla}_{\pi_{i}}v_{i}(\pi^{t})
5:     Update the strategy by
πit+1=argmaxx𝒳i{ηt^πivi(πt)μπiG(πit,σik),x\!\!\!\!\!\pi_{i}^{t+1}\!=\!\mathop{\rm arg~{}max}\limits_{x\in\mathcal{X}_{i}}\!\bigg{\{}\!\eta_{t}\left\langle\widehat{\nabla}_{\pi_{i}}v_{i}(\pi^{t})\!-\!\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),x\!\right\rangle
Dψ(x,πit)}\!\!\!\!\!\!\!\!\!\!\!\!-D_{\psi}(x,\pi_{i}^{t})\bigg{\}}
6:     ττ+1\tau\leftarrow\tau+1
7:     if τ=Tσ\tau=T_{\sigma} then
8:        kk+1,τ0k\leftarrow k+1,~{}\tau\leftarrow 0
9:        σikπit+1\sigma_{i}^{k}\leftarrow\pi_{i}^{t+1}
10:     end if
11:  end for

Other notations.

We denote a dd-dimensional probability simplex by Δd={p[0,1]d|j=1dpj=1}\Delta^{d}=\{p\in[0,1]^{d}~{}|~{}\sum_{j=1}^{d}p_{j}=1\}. We define diam(𝒳):=supπ,π𝒳ππ\mathrm{diam}(\mathcal{X}):=\sup_{\pi,\pi^{\prime}\in\mathcal{X}}\|\pi-\pi^{\prime}\| as the diameter of 𝒳\mathcal{X}. The Kullback-Leibler (KL) divergence is defined by KL(πi,πi)=j=1diπijlnπijπij\mathrm{KL}(\pi_{i},\pi_{i}^{\prime})=\sum_{j=1}^{d_{i}}\pi_{ij}\ln\frac{\pi_{ij}}{\pi_{ij}^{\prime}}. Besides, with a slight abuse of notation, we represent the sum of Bregman divergences and the sum of KL divergences by Dψ(π,π)=i=1NDψ(πi,πi)D_{\psi}(\pi,\pi^{\prime})=\sum_{i=1}^{N}D_{\psi}(\pi_{i},\pi_{i}^{\prime}), and KL(π,π)=i=1NKL(πi,πi)\mathrm{KL}(\pi,\pi^{\prime})=\sum_{i=1}^{N}\mathrm{KL}(\pi_{i},\pi_{i}^{\prime}), respectively. We finally denote the domain of ψ\psi by domψ:={x:ψ(x)<}\mathrm{dom~{}}\psi:=\{x:\psi(x)<\infty\}, and corresponding interior by int(domψ)\mathrm{int}(\mathrm{dom}~{}\psi).

3 Adaptively Perturbed Mirror Descent

In this section, we present Adaptively Perturbed Mirror Descent (APMD), which is an extension of the standard MD algorithms. Algorithm 1 describes the pseudo-code. APMD employs two pivotal techniques: slingshot payoff perturbation and slingshot strategy update. Each of them corresponds to line 5 and line 9 in Algorithm 1, respectively.

Refer to caption
Figure 1: Illustration of the impact of the slingshot strategy updates on the gap function for πt\pi^{t} updated by APMD.

3.1 Slingshot Payoff Perturbation

Letting us define the differentiable divergence function G(,):di×di{}G(\cdot,\cdot):\mathbb{R}^{d_{i}}\times\mathbb{R}^{d_{i}}\to\mathbb{R}\cup\{\infty\} and the slingshot strategy σi𝒳i\sigma_{i}\in\mathcal{X}_{i}, APMD perturbs each player’s payoff by the divergence between the current strategy πit\pi_{i}^{t} and the slingshot strategy σi\sigma_{i}, i.e., G(πit,σi)G(\pi_{i}^{t},\sigma_{i}). Specifically, APMD updates each player’s strategy according to

πit+1\displaystyle\pi_{i}^{t+1} =argmaxx𝒳i{ηt^πivi(πt)μπiG(πit,σi),x\displaystyle=\mathop{\rm arg~{}max}\limits_{x\in\mathcal{X}_{i}}\bigg{\{}\eta_{t}\left\langle\widehat{\nabla}_{\pi_{i}}v_{i}(\pi^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}),x\right\rangle
Dψ(x,πit)}.\displaystyle\phantom{\mathop{\rm arg~{}max}\limits_{x\in\mathcal{X}_{i}}==}-D_{\psi}(x,\pi_{i}^{t})\bigg{\}}. (3)

where μ(0,)\mu\in(0,\infty) is the perturbation strength and πiG\nabla_{\pi_{i}}G denotes differentiation with respect to first argument. We assume that G(,σi)G(\cdot,\sigma_{i}) is strictly convex for every σi𝒳i\sigma_{i}\in\mathcal{X}_{i}, and takes a minimum value of 0 at σi\sigma_{i}. Furthermore, we assume that ψ\psi is differentiable and ρ\rho-strongly convex on 𝒳i\mathcal{X}_{i} with ρ(0,)\rho\in(0,\infty).

The conventional MD updates its strategy based on the gradient feedback of the payoff function and the regularization term. The regularization term adjusts the next strategy so that it does not deviate significantly from the current strategy. APMD perturbs the gradient payoff vector by the divergence between the current strategy πit\pi_{i}^{t} and a predefined slingshot strategy σi\sigma_{i}. If the current strategy is far away from the slingshot strategy, the magnitude of the perturbation increases. Note that if both strategies are equivalent, no perturbation occurs, and APMD just seeks a strategy with a higher expected payoff. As Figure 1 illustrates, the divergence first fluctuates, and then the current strategy profile comes close to a stationary point where the gradient of the expected payoff vector is equal to the gradient of the magnitude of the perturbation so that the perturbation via slingshot stabilizes the learning dynamics. Indeed, Mutant Follow the Regularized Leader (Mutant FTRL) instantiated in Example 5.3 encompasses replicator-mutator dynamics, which is guaranteed to converge to an approximate equilibrium in two-player zero-sum games (Abe et al., 2022). We can argue that APMD inherits this nice feature.

3.2 Slingshot Strategy Update

The perturbation via slingshot enables πt\pi^{t} to converge quickly to a stationary point (Lemmas 5.6 and 5.7). Different slingshot strategy profiles induce different stationary points. Of course, when the slingshot strategy profile is set to be an equilibrium, the corresponding stationary point also becomes an equilibrium. However, it is virtually impossible to identify such an ideal slingshot strategy profile beforehand. To this end, APMD adjusts a slingshot strategy profile σ\sigma by replacing it with the (nearly) stationary point that is reached after predefined iterations TσT_{\sigma}.

Figure 1 illustrates how our slingshot strategy update brings the corresponding stationary points closer to an equilibrium. x- and y-axis indicate the number of iterations and the logarithm of the gap function of the last-iterate strategy profile. We here assume that the learning rate and the perturbation strength are η=0.1\eta=0.1 and μ=1\mu=1, respectively. The initial slingshot strategy σi\sigma_{i} is given as a uniform distribution on the action space. After the first interval of Tσ=200T_{\sigma}=200 iterations, APMD finds a stationary point. The slingshot strategy profile for the second interval is replaced with the stationary point. Figure 1 clearly shows the stationary point, i.e., the last-iterate strategy profile comes close to an equilibrium every time the slingshot strategy profile is updated. We also theoretically justify our slingshot strategy update in Theorem D.1, i.e., when the slingshot strategy profile is close to an equilibrium πΠ\pi^{\ast}\in\Pi^{\ast}, the stationary point is close to π\pi^{\ast}.

4 Last-Iterate Convergence Rates

In this section, we establish the last-iterate convergence rates of APMD. More specifically, we examine a setting where both DψD_{\psi} and GG is set to the squared 2\ell^{2}-distance, i.e., Dψ(πi,πi)=G(πi,πi)=12πiπi2D_{\psi}(\pi_{i},\pi_{i}^{\prime})=G(\pi_{i},\pi_{i}^{\prime})=\frac{1}{2}\|\pi_{i}-\pi_{i}^{\prime}\|^{2}. This instance can be considered as an extended version of Gradient Descent, which incorporates our techniques of payoff perturbation and slingshot strategy update. We also assume that the gradient vector of viv_{i} is bounded. We emphasize that we have obtained the overall last-iterative convergence rates of APMD for the entire TT iterations in both full and noisy feedback settings.

4.1 Full Feedback Setting

First, we demonstrate the last-iterate convergence rate of APMD with full feedback where each player receives the perfect gradient vector ^πivi(πit,πit)=πivi(πit,πit)\widehat{\nabla}_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})=\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t}), at each iteration tt. Theorem 4.1 provides the APMD’s convergence rate of 𝒪(lnT/T)\mathcal{O}(\ln T/\sqrt{T}) in terms of the gap function. Note that the learning rate is constant, and its upper bound is specified by perturbation strength μ\mu and smoothness parameter LL.

Theorem 4.1.

If we use the constant learning rate ηt=η(0,2μ3μ2+8L2)\eta_{t}=\eta\in(0,\frac{2\mu}{3\mu^{2}+8L^{2}}), and set DψD_{\psi} and GG as the squared 2\ell^{2}-distance Dψ(πi,πi)=G(πi,πi)=πiπi2/2D_{\psi}(\pi_{i},\pi_{i}^{\prime})=G(\pi_{i},\pi_{i}^{\prime})=\|\pi_{i}-\pi_{i}^{\prime}\|^{2}/2, and set Tσ=Θ(lnT)T_{\sigma}=\Theta(\ln T), then the strategy profile πT\pi^{T} updated by APMD satisfies:

GAP(πT)\displaystyle\mathrm{GAP}(\pi^{T}) =𝒪(lnTT).\displaystyle=\mathcal{O}\left(\frac{\ln T}{\sqrt{T}}\right).

The obtained rate herein is competitive with optimistic gradient and extra-gradient methods (Cai et al., 2022a) whose rates are 𝒪(1/T)\mathcal{O}(1/\sqrt{T}). Although it is open whether our convergence rate matches its lower bound, it closely aligns with the lower bound for the class of algorithms that includes 11-SCLI algorithms (Golowich et al., 2020b), which is different from our APMD. To the best of our knowledge, the fastest rate of 𝒪(1/T)\mathcal{O}(1/T) is achieved by Accelerated Optimistic Gradient (AOG) (Cai & Zheng, 2023), which is an optimistic variant of the Halpern iteration (Halpern, 1967). At first sight, the update rule of AOG looks as if the perturbation strength was linearly decayed. However, it does not perturb the payoff functions and instead adjusts the regularization term by using the convex combination of the current strategy and the anchoring strategy as the proximal point in MD. Unlike our APMD, the anchoring strategy is never updated through the iterations.222Regarding the rate of 𝒪(1/T)\mathcal{O}(1/T), a companion paper is in preparation.

4.2 Proof Sketch of Theorem 4.1

This section sketches the proof for Theorem 4.1. We present the complete proofs for the theorem and associated lemmas in Appendix E.

(1) Convergence rates to a stationary point with kk-th slingshot strategy profile.

We denote σk\sigma^{k} as the slingshot strategy profile after kk updates. Since the slingshot strategy profile is overwritten by the current strategy profile πt\pi^{t} every TσT_{\sigma} iterations, we can write σk=πkTσ\sigma^{k}=\pi^{kT_{\sigma}}. We first prove that, as TσT_{\sigma} increases, the next k+1k+1-th slingshot strategy profile σk+1\sigma^{k+1} approaches to the stationary point πμ,σk\pi^{\mu,\sigma^{k}} under the slingshot strategy profile σk\sigma^{k}, which satisfies the following condition: i[N]\forall i\in[N],

πiμ,σk\displaystyle\pi_{i}^{\mu,\sigma^{k}} =argmaxπi𝒳i{vi(πi,πiμ,σk)μG(πi,σik)}.\displaystyle=\mathop{\rm arg~{}max}\limits_{\pi_{i}\in\mathcal{X}_{i}}\left\{v_{i}(\pi_{i},\pi_{-i}^{\mu,\sigma^{k}})-\mu G(\pi_{i},\sigma_{i}^{k})\right\}. (4)

Note that πμ,σk\pi^{\mu,\sigma^{k}} always exists since the perturbed game is still monotone. Using the strong convexity of G(πi,σik)=12πiσik2G(\pi_{i},\sigma_{i}^{k})=\frac{1}{2}\|\pi_{i}-\sigma_{i}^{k}\|^{2}, we show that σk+1πμ,σk\sigma^{k+1}\to\pi^{\mu,\sigma^{k}} as TσT_{\sigma}\to\infty:

Lemma 4.2.

Assume that DψD_{\psi} and GG are set as the squared 2\ell^{2}-distance. If we use the constant learning rate ηt=η(0,2μ3μ2+8L2)\eta_{t}=\eta\in(0,\frac{2\mu}{3\mu^{2}+8L^{2}}), the k+1k+1-th slingshot strategy profile σk+1\sigma^{k+1} of APMD under the full feedback setting satisfies that:

πμ,σkσk+12=πμ,σkσk2exp(𝒪(Tσ)).\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}=\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\cdot\exp\left(-\mathcal{O}(T_{\sigma})\right).

(2) Upper bound on the gap function.

Next, we derive the upper bound on the gap function for σk+1\sigma^{k+1}. From the bounding technique of the gap function using the tangent residuals by Cai et al. (2022a) and the first-order optimality condition for πμ,σk\pi^{\mu,\sigma^{k}}, the gap function for σk+1\sigma^{k+1} can be upper bounded by the distance between the slingshot strategy profile and the stationary point πμ,σk\pi^{\mu,\sigma^{k}}:

Lemma 4.3.

If GG is set as the squared 2\ell^{2}-distance, the gap function for σk+1\sigma^{k+1} of APMD satisfies for k0k\geq 0:

GAP(σk+1)=𝒪(πμ,σkσk+1+πμ,σkσk).\displaystyle\mathrm{GAP}(\sigma^{k+1})=\mathcal{O}\left(\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|+\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|\right).

By Lemma 4.2, if we set DψD_{\psi} as the squared 2\ell^{2}-distance, the first term in this lemma can be bounded as:

πμ,σkσk+12=πμ,σkσk2exp(𝒪(Tσ)).\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}=\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\exp\left(-\mathcal{O}(T_{\sigma})\right). (5)

Therefore, it is enough to derive the convergence rate on the 2\ell^{2}-distance between πμ,σk\pi^{\mu,\sigma^{k}} and σk\sigma^{k}.

(3) Last-iterate convergence results for the slingshot strategy profile.

Let us denote K:=T/TσK:=\lfloor T/T_{\sigma}\rfloor as the total number of the slingshot strategy updates over the entire TT iterations. Then, by adjusting Tσ=Ω(lnT)T_{\sigma}=\Omega(\ln T), we show that the 2\ell^{2}-distance between the K1K-1-th slingshot strategy profile σK1\sigma^{K-1} and the corresponding stationary point πμ,σK1\pi^{\mu,\sigma^{K-1}} decreases as KK increases:

Lemma 4.4.

In the same setup of Theorem 4.1, the K1K-1-th slingshot strategy profile σK1\sigma^{K-1} of APMD satisfies:

πμ,σK1σK1=𝒪(1/K).\displaystyle\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|=\mathcal{O}(1/\sqrt{K}).

Lemma 4.4 implies that as KK increases, the variation of σK\sigma^{K} becomes negligible, signifying convergence in its behavior.

By combining (5) and Lemmas 4.3-4.4, we can derive the last-iterate convergence rate of the slingshot strategy profile σK\sigma^{K}:

GAP(σK)=𝒪(1/K).\displaystyle\mathrm{GAP}(\sigma^{K})=\mathcal{O}(1/\sqrt{K}).

Thus, since πT=σK\pi^{T}=\sigma^{K} and K=T/Tσ=Θ(T/lnT)K=\lfloor T/T_{\sigma}\rfloor=\Theta(T/\ln T), the statement of the theorem is concluded. ∎

4.3 Noisy Feedback Setting

Next, we consider the noisy feedback setting, where each player ii receives a gradient vector with additive noise: πivi(πit,πit)+ξit\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})+\xi^{t}_{i}. Define the sigma-algebra generated by the history of the observations: t=σ((^πivi(πi0,πi0))i[N],,(^πivi(πit1,πit1))i[N])\mathcal{F}_{t}=\sigma\left((\widehat{\nabla}_{\pi_{i}}v_{i}(\pi_{i}^{0},\pi_{-i}^{0}))_{i\in[N]},\ldots,(\widehat{\nabla}_{\pi_{i}}v_{i}(\pi_{i}^{t-1},\pi_{-i}^{t-1}))_{i\in[N]}\right), t1\forall t\geq 1. We assume that the noise vectors (ξit)t1(\xi_{i}^{t})_{t\geq 1} are with zero-mean and bounded variances. We also suppose that the noise vectors (ξit)t1(\xi_{i}^{t})_{t\geq 1} are independent over tt. In this setting, the last-iterate convergence rate is achieved by APMD using a decreasing learning rate sequence ηt\eta_{t}. The convergence rate obtained by APMD is 𝒪(lnT/T110)\mathcal{O}(\ln T/T^{\frac{1}{10}}):

Theorem 4.5.

Let θ=3μ2+8L22μ\theta=\frac{3\mu^{2}+8L^{2}}{2\mu} and κ=μ2\kappa=\frac{\mu}{2}. Assume that DψD_{\psi} and GG are set as the squared 2\ell^{2}-distance Dψ(πi,πi)=G(πi,πi)=12πiπi2D_{\psi}(\pi_{i},\pi_{i}^{\prime})=G(\pi_{i},\pi_{i}^{\prime})=\frac{1}{2}\|\pi_{i}-\pi_{i}^{\prime}\|^{2}, and Tσ=Θ(T4/5)T_{\sigma}=\Theta(T^{4/5}). If we choose the learning rate sequence of the form ηt=1/(κ(tTσt/Tσ)+2θ)\eta_{t}=1/(\kappa(t-T_{\sigma}\cdot\lfloor t/T_{\sigma}\rfloor)+2\theta), then the strategy profile πT\pi^{T} updated by APMD satisfies:

𝔼[GAP(πT)]\displaystyle\mathbb{E}\left[\mathrm{GAP}(\pi^{T})\right] =𝒪(lnTT110).\displaystyle=\mathcal{O}\left(\frac{\ln T}{T^{\frac{1}{10}}}\right).

It should be noted that Theorem 4.5 provides a non-asymptotic convergence guarantee with a rate. This is a significant departure from the existing convergence results (Koshal et al., 2010, 2013; Tatarenko & Kamgarpour, 2019), which focus on the asymptotic convergence of iterative Tikhonov regularization methods in the noisy or bandit feedback settings.

4.4 Proof Sketch of Theorem 4.5

As in Theorem 4.1, we first derive the convergence rate of σk+1\sigma^{k+1} for the noisy feedback setting:

Lemma 4.6.

Let θ=3μ2+8L22μ\theta=\frac{3\mu^{2}+8L^{2}}{2\mu} and κ=μ2\kappa=\frac{\mu}{2}. Suppose that both DψD_{\psi} and GG are defined as the squared 2\ell^{2}-distance. Under the noisy feedback setting, if we use the learning rate sequence of the form ηt=1/(κ(tTσt/Tσ)+2θ)\eta_{t}=1/(\kappa(t-T_{\sigma}\cdot\lfloor t/T_{\sigma}\rfloor)+2\theta), the k+1k+1-th slingshot strategy profile σk+1\sigma^{k+1} of APMD for each k0k\geq 0 satisfies that:

𝔼[πμ,σkσk+12]=𝒪(lnTσTσ).\displaystyle\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}\right]=\mathcal{O}\left(\frac{\ln T_{\sigma}}{T_{\sigma}}\right).

The proof is given in Appendix E.6 and is based on the standard argument of stochastic optimization, e.g., Nedić & Lee (2014). However, the proof is made possible by taking into account the monotonicity of the game and the relative (strong and smooth) convexity of the divergence function.

Next, in a similar manner for Lemma 4.4, we show the upper bound on the expected distance between πμ,σK1\pi^{\mu,\sigma^{K-1}} and σK1\sigma^{K-1} under the noisy feedback setting:

Lemma 4.7.

In the same setup of Theorem 4.5, the K1K-1-th slingshot strategy profile σK1\sigma^{K-1} satisfies:

𝔼[πμ,σK1σK1]=𝒪(lnKK).\displaystyle\mathbb{E}\left[\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|\right]=\mathcal{O}\left(\frac{\ln K}{\sqrt{K}}\right).

We note that Lemma 4.3 holds for any combination of σk\sigma^{k} and σk+1\sigma^{k+1}, regardless of the presence of noise. By combining Lemmas 4.3, 4.6, and 4.7, we can derive the following last-iterate convergence rate of σK\sigma^{K} in terms of the gap function:

𝔼[GAP(σK)]=𝒪(lnKK).\displaystyle\mathbb{E}\left[\mathrm{GAP}(\sigma^{K})\right]=\mathcal{O}\left(\frac{\ln K}{\sqrt{K}}\right).

By setting K=T/Tσ=Θ(T1/5)K=\lfloor T/T_{\sigma}\rfloor=\Theta(T^{1/5}), we conclude the statement of the theorem. ∎

5 Beyond Squared 2\ell^{2}-Payoff Perturbation

Section 4 assumes that both GG and DψD_{\psi} are the squared 2\ell^{2}-distance. This section considers more general choices of GG and DψD_{\psi}.

5.1 Instantiation of Payoff-Perturbed Algorithms

First, we would like to emphasize that choosing appropriate combinations of GG and DψD_{\psi} enables APMD to reproduce some existing learning algorithms that incorporate payoff perturbation. For example, the following learning algorithms can be viewed as instantiations of APMD.

Example 5.1 (Boltzmann Q-Learning (Tuyls et al., 2006)).

Assume that 𝒳i=Δdi\mathcal{X}_{i}=\Delta^{d_{i}}, the regularize is entropy: ψ(πi)=j=1diπijlnπij\psi(\pi_{i})=\sum_{j=1}^{d_{i}}\pi_{ij}\ln\pi_{ij}. Let us set GG as the KL divergence and the slingshot strategy σi\sigma_{i} as a uniform distribution, i.e., G(πi,σi)=KL(πi,σi)G(\pi_{i},\sigma_{i})=\mathrm{KL}(\pi_{i},\sigma_{i}) and σi=(1/di)j[di]\sigma_{i}=(1/d_{i})_{j\in[d_{i}]}. Then, the corresponding continuous-time APMD dynamics can be expressed as:

ddtπijt\displaystyle\frac{d}{dt}\pi_{ij}^{t} =πijt(qijπtk=1diπiktqikπt)\displaystyle=\pi_{ij}^{t}\left(q_{ij}^{\pi^{t}}-\sum_{k=1}^{d_{i}}\pi_{ik}^{t}q_{ik}^{\pi^{t}}\right)
μπijt(lnπijtk=1diπiktlnπikt),\displaystyle\phantom{=}-\mu\pi_{ij}^{t}\!\left(\ln\pi_{ij}^{t}-\sum_{k=1}^{d_{i}}\pi_{ik}^{t}\ln\pi_{ik}^{t}\right),

which is equivalent to Boltzman Q-learning (Tuyls et al., 2006; Bloembergen et al., 2015).

Example 5.2 (Reward transformed FTRL (Perolat et al., 2021)).

Consider the continuous-time APMD dynamics where N=2N=2, 𝒳i=Δdi\mathcal{X}_{i}=\Delta^{d_{i}}, ψ\psi is Legendre (Rockafellar, 1997; Lattimore & Szepesvári, 2020) with domψ𝒳i\mathrm{dom}~{}\psi\subseteq\mathcal{X}_{i}, and G(πi,σi)=KL(πi,σi)G(\pi_{i},\sigma_{i})=\mathrm{KL}(\pi_{i},\sigma_{i}). Then, APMD dynamics can be described as follows:

πijt\displaystyle\pi_{ij}^{t} =argmaxxΔdi{0t(k=1dixkqikπsμk=1dixklnπiksσik\displaystyle=\mathop{\rm arg~{}max}\limits_{x\in\Delta^{d_{i}}}\left\{\int_{0}^{t}\left(\sum_{k=1}^{d_{i}}x_{k}q_{ik}^{\pi^{s}}-\mu\sum_{k=1}^{d_{i}}x_{k}\ln\frac{\pi_{ik}^{s}}{\sigma_{ik}}\right.\right.
+μk=1diπikslnπiksσik)dsψ(x)}.\displaystyle\left.\left.\phantom{======}+\mu\sum_{k=1}^{d_{-i}}\pi_{-ik}^{s}\ln\frac{\pi_{-ik}^{s}}{\sigma_{-ik}}\right)ds-\psi(x)\right\}.

This algorithm is equivalent to FTRL with reward transformation (Perolat et al., 2021).

Example 5.3 (Mutant FTRL (Abe et al., 2022)).

Let us define 𝒳i=Δdi\mathcal{X}_{i}=\Delta^{d_{i}}, and assume that the regularizer ψ\psi is Legendre with domψ𝒳i\mathrm{dom}~{}\psi\subseteq\mathcal{X}_{i}. If we set GG as the reverse KL divergence, i.e., G(πi,σi)=KL(σi,πi)=j=1diσijlnσijπijG(\pi_{i},\sigma_{i})=\mathrm{KL}(\sigma_{i},\pi_{i})=\sum_{j=1}^{d_{i}}\sigma_{ij}\ln\frac{\sigma_{ij}}{\pi_{ij}}, we can rewrite (3.1) as:

πit+1=argmaxxΔdi{s=0tηj=1dixj(qijπs+μπijs(σijπijs))ψ(x)},\displaystyle\pi_{i}^{t+1}\!\!=\!\mathop{\rm arg~{}max}\limits_{x\in\Delta^{d_{i}}}\!\Bigg{\{}\!\!\sum_{s=0}^{t}\eta\!\sum_{j=1}^{d_{i}}\!x_{j}\!\!\left(\!q_{ij}^{\pi^{s}}\!\!+\!\frac{\mu}{\pi_{ij}^{s}}\!\left(\sigma_{ij}\!-\!\pi_{ij}^{s}\right)\!\!\right)\!\!-\!\psi(x)\!\Bigg{\}}\!,

where qijπs=(^πivi(πis,πis))jq_{ij}^{\pi^{s}}=(\widehat{\nabla}_{\pi_{i}}v_{i}(\pi_{i}^{s},\pi_{-i}^{s}))_{j}. This algorithm is equivalent to Mutant FTRL (Abe et al., 2022).

5.2 Convergence Results with General GG and DψD_{\psi}

Next, we establish the convergence results for APMD with general combinations of GG and DψD_{\psi}. For theoretical analysis, we assume a specific condition on GG:

Assumption 5.4.

G(,σi)G(\cdot,\sigma_{i}) is differentiable over int(domψ)\mathrm{int}(\mathrm{dom}~{}\psi). Moreover, G(,σi)G(\cdot,\sigma_{i}) is β\beta-smooth and γ\gamma-strongly convex relative to ψ\psi, i.e., for any πi,πiint(domψ)\pi_{i},\pi_{i}^{\prime}\in\mathrm{int}(\mathrm{dom}~{}\psi), γDψ(πi,πi)G(πi,σi)G(πi,σi)πiG(πi,σi),πiπiβDψ(πi,πi)\gamma D_{\psi}(\pi_{i}^{\prime},\pi_{i})\leq G(\pi_{i}^{\prime},\sigma_{i})-G(\pi_{i},\sigma_{i})-\langle\nabla_{\pi_{i}}G(\pi_{i},\sigma_{i}),\pi_{i}^{\prime}-\pi_{i}\rangle\leq\beta D_{\psi}(\pi_{i}^{\prime},\pi_{i}) holds.

Note that these assumptions are always satisfied with β=γ=1\beta=\gamma=1 whenever GG is identical to DψD_{\psi}; thus, these are not strong assumptions. We also assume that πt\pi^{t} is well-defined over iterations:

Assumption 5.5.

πt\pi^{t} updated by APMD satisfies πtint(domψ)\pi^{t}\in\mathrm{int}(\mathrm{dom}~{}\psi) for any t{0,1,,T}t\in\{0,1,\cdots,T\}.

Using Assumptions 5.4 and 5.5, we derive the convergence rate to πμ,σk\pi^{\mu,\sigma^{k}} in (4):

Lemma 5.6.

Suppose that Assumptions 5.4 with β,γ(0,)\beta,\gamma\in(0,\infty) and 5.5 hold. If we use the constant learning rate ηt=η(0,2μγρ2μ2γρ2(γ+2β)+8L2)\eta_{t}=\eta\in(0,\frac{2\mu\gamma\rho^{2}}{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}), the k+1k+1-th slingshot strategy profile σk+1\sigma^{k+1} of APMD under the full feedback setting satisfies that:

Dψ(πμ,σk,σk+1)=Dψ(πμ,σk,σk)exp(𝒪(Tσ)).\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k+1})=D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k})\exp\left(-\mathcal{O}(T_{\sigma})\right).
Lemma 5.7.

Let θ=μ2γρ2(γ+2β)+8L22μγρ2\theta=\frac{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}{2\mu\gamma\rho^{2}} and κ=μγ2\kappa=\frac{\mu\gamma}{2}. Suppose that Assumptions 5.4 with β,γ(0,)\beta,\gamma\in(0,\infty) and 5.5 hold, and the learning rate sequence of the form ηt=1/(κ(tTσt/Tσ)+2θ)\eta_{t}=1/(\kappa(t-T_{\sigma}\cdot\lfloor t/T_{\sigma}\rfloor)+2\theta) is used. Then, the k+1k+1-th slingshot strategy profile σk+1\sigma^{k+1} of APMD under the noisy feedback setting satisfies that:

𝔼[Dψ(πμ,σk,σk+1)]=𝒪(lnTσTσ).\displaystyle\mathbb{E}\left[D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k+1})\right]=\mathcal{O}\left(\frac{\ln T_{\sigma}}{T_{\sigma}}\right).
Refer to caption
Figure 2: The gap function for πt\pi^{t} for APMD, MWU, and OMWU with full feedback. The shaded area represents the standard errors. Note that the KL divergence, reverse KL divergence, and squared 2\ell^{2}-distance are abbreviated to KL, RKL, and L2, respectively.
Refer to caption
Figure 3: The gap function for πt\pi^{t} for APMD, MWU, and OMWU with noisy feedback.

These lemmas imply that σk+1πμ,σk\sigma^{k+1}\to\pi^{\mu,\sigma^{k}} as TσT_{\sigma}\to\infty. Therefore, when TσT_{\sigma} is sufficiently large, k+1k+1-th slingshot strategy profile becomes almost equivalent to the stationary point πμ,σk\pi^{\mu,\sigma^{k}}. From this, we anticipate that 𝔼[GAP(σk)]0\mathbb{E}[\mathrm{GAP}(\sigma^{k})]\to 0 as kk\to\infty even in the noisy feedback setting. The subsequent theorems provide asymptotic last-iterate convergence results for this ideal scenario. In particular, we show that the slingshot strategy profile σk+1\sigma^{k+1} converges to equilibrium when using the following divergence functions as GG: 1) Bregman divergence; 2) α\alpha-divergence; 3) Rényi-divergence; 4) reverse KL divergence.

Theorem 5.8.

Assume that GG is a Bregman divergence DψD_{\psi^{\prime}} for some strongly convex function ψ\psi^{\prime}, and σk+1=πμ,σk\sigma^{k+1}=\pi^{\mu,\sigma^{k}} for k0k\geq 0. Then, there exists πΠ\pi^{\ast}\in\Pi^{\ast} such that σkπ\sigma^{k}\to\pi^{\ast} as kk\to\infty.

Theorem 5.9.

Let us define 𝒳i=Δdi\mathcal{X}_{i}=\Delta^{d_{i}}. Assume that σk+1=πμ,σk\sigma^{k+1}=\pi^{\mu,\sigma^{k}} for k0k\geq 0, and GG is one of the following divergence: 1) α\alpha-divergence with α(0,1)\alpha\in(0,1); 2) Rényi-divergence with α(0,1)\alpha\in(0,1); 3) reverse KL divergence. If the initial slingshot strategy profile σ0\sigma^{0} is in the interior of 𝒳\mathcal{X}, the sequence {σk}k1\{\sigma^{k}\}_{k\geq 1} converges to the set of Nash equilibria Π\Pi^{\ast} of the underlying game.

We remark that these results cover the algorithms in Example 5.1, 5.2, and 5.3. Furthermore, we can incorporate our payoff perturbation techniques into FTRL, detailed in Appendix C.

6 Experiments

This section empirically compares the representative instance of MD, namely Multiplicative Weight Update (MWU) and its Optimistic version (OMWU), with our framework. Specifically, we consider the following three instances of APMD: (i) both are the squared 2\ell^{2}-distance; (ii) both GG and DψD_{\psi} are the KL divergence, which is also an instance of Reward transformed FTRL in Example 5.2. Note that if the slingshot strategy is fixed to a uniformly random strategy, this algorithm corresponds to Boltzmann Q-Learning in Example 5.1; (iii) the divergence function GG is the reverse KL divergence, and the Bregman divergence DψD_{\psi} is the KL divergence, which matches Mutant FTRL in Example 5.3.

We focus on two zero-sum polymatrix games: Three-Player Biased Rock-Paper-Scissors (3BRPS) and three-player random payoff games with 1010 and 5050 actions. For the 3BRPS game, each player participates in two instances of the game in Table 2 in Appendix I simultaneously with two other players. For the random payoff games, each player ii participates in two instances of the game with two other players jj simultaneously. The payoff matrix for each instance is denoted as M(i,j)M^{(i,j)}. Each entry of M(i,j)M^{(i,j)} is drawn independently from a uniform distribution on the interval [1,1][-1,1].

Figures 2 and 3 illustrate the logarithm of the gap function averaged over 100100 instances with different random seeds. We assume that the initial slingshot strategy profile π0\pi^{0} is chosen uniformly at random in the interior of the strategy space 𝒳=i=13Δdi\mathcal{X}=\prod_{i=1}^{3}\Delta^{d_{i}} in each instance for 3BRPS, while π0\pi^{0} is chosen as (1/di)j[di](1/d_{i})_{j\in[d_{i}]} for i[3]i\in[3] in every instances for the random payoff games.

First, Figure 2 depicts the case of full feedback. Unless otherwise specified, we use a constant learning rate η=0.1\eta=0.1 and a perturbation strength μ=0.1\mu=0.1 for APMD. Further details and additional experiments can be found in Appendix I. Figure 2 shows that APMD outperforms MWU and OMWU in all three games. Notably, APMD exhibits the fastest convergence in terms of the gap function when using the squared 2\ell^{2}-distance as both GG and DψD_{\psi}. Next, Figure 3 depicts the case of noisy feedback. We assume that the noise vector ξit\xi_{i}^{t} is generated from the multivariate Gaussian distribution 𝒩(0,0.12𝐈)\mathcal{N}(0,~{}0.1^{2}\mathbf{I}) in an i.i.d. manner. To account for the noise, we use a lower learning rate η=0.01\eta=0.01 than the full feedback case. In OMWU, we use the noisy gradient vector ^πivi(πit1,πit1)\widehat{\nabla}_{\pi_{i}}v_{i}(\pi_{i}^{t-1},\pi_{-i}^{t-1}) at the previous step t1t-1 as the prediction vector for the current iteration tt. We observe the same trends as with full feedback. While MWU and OMWU exhibit worse performance, APMD maintains its fast convergence, as predicted by the theoretical analysis.

7 Related Literature

Recent progress in achieving no-regret learning with full feedback has been driven by optimistic learning (Rakhlin & Sridharan, 2013a, b). Optimistic versions of well-known algorithms like Follow the Regularized Leader (Shalev-Shwartz & Singer, 2006) and Mirror Descent (Zhou et al., 2017; Hsieh et al., 2021) have been proposed to admit last-iterate convergence in a wide range of game settings. These optimistic algorithms have been successfully applied to various classes of games, including bilinear games (Daskalakis et al., 2018; Daskalakis & Panageas, 2019; Liang & Stokes, 2019; de Montbrun & Renault, 2022), cocoercive games (Lin et al., 2020), and saddle point problems (Daskalakis & Panageas, 2018; Mertikopoulos et al., 2019; Golowich et al., 2020b; Wei et al., 2021; Lei et al., 2021; Yoon & Ryu, 2021; Lee & Kim, 2021; Cevher et al., 2023). The advancements have provided solutions to monotone games and have established convergence rates (Golowich et al., 2020a; Cai et al., 2022a, b; Gorbunov et al., 2022; Cai & Zheng, 2023).

The exploration of literature with noisy feedback poses significant challenges, in contrast to full feedback. In situations where feedback is imprecise or limited, algorithms must estimate action values at each iteration. There have been two trends in achieving last-iterate convergence: restricting the class of games and perturbing the payoff functions. On one hand, particularly noticeable works lie in potential games (Cohen et al., 2017), normal-form games with strict Nash equilibria (Giannou et al., 2021b, a), and two-player zero-sum games (Abe et al., 2023). Also, noisy feedback is handled with games whose payoff functions are assumed to be strictly (or strongly) monotone  (Bravo et al., 2018; Kannan & Shanbhag, 2019; Hsieh et al., 2019; Anagnostides & Panageas, 2022), while to be strictly variational stable (Mertikopoulos & Zhou, 2019; Mertikopoulos et al., 2019, 2022; Azizian et al., 2021). Note that variationally stable games, often referred to in control theory, are a slightly broader class of monotone games. These studies require the payoff functions to be strictly or strongly convex. When these restrictions are not imposed, convergence is primarily guaranteed only in an asymptotic sense, and the rate is not quantified (Hsieh et al., 2020, 2022).

On the other hand, payoff-perturbed algorithms have recently regained attention for their ability to demonstrate convergence in unrestricted games when noise is present. As described in Section 1, payoff perturbation is a textbook technique (Facchinei & Pang, 2003) that has been extensively studied (Koshal et al., 2010, 2013; Yousefian et al., 2017; Tatarenko & Kamgarpour, 2019; Abe et al., 2023; Cen et al., 2021, 2023; Cai et al., 2023; Pattathil et al., 2023). It is known that carefully adjusting the magnitude of perturbation ensures convergence to a Nash equilibrium. This magnitude is computed as the product of a strongly convex penalty and a perturbation strength parameter. Liu et al. (2023) shrink the perturbation strength based on a predefined hyper-parameter and the gap function of the current strategy. Likewise, Koshal et al. (2010) and Tatarenko & Kamgarpour (2019) have identified somewhat complex conditions that the sequence of the perturbation strength parameters and learning rates should satisfy. Roughly speaking, as we have implied in Lemma 4.2, a smaller strength would require a lower learning rate. This potentially decelerates the convergence rate and complicates the task of finding an appropriate learning rate. For practicality, we have opted to keep the perturbation strength constant, independent of the iteration in APMD. Moreover, it must be emphasized that the existing literature has primarily provided asymptotic convergence results, while we have successfully provided the non-asymptotic convergence rate.

Finally, the idea of the slingshot strategy update was initiated by Perolat et al. (2021) and later extended by Abe et al. (2023). Our contribution partly owes to the significance of quantifying the convergence rate for the first time. We must also mention that Sokota et al. (2023) have proposed a very similar, but essentially different update rule to ours. It just adds an additional regularization term based on an anchoring strategy, which they call a magnetic one, and this means that it directly perturbs the (expected) payoff functions. In contrast, our APMD indirectly perturbs the payoff functions, i.e., perturbs the gradient vector. Furthermore, we have established non-asymptotic convergence results toward a Nash equilibrium, while Sokota et al. (2023) have only shown convergence toward a quantal response equilibrium (McKelvey & Palfrey, 1995, 1998), which is just equivalent to an approximate equilibrium. Similar results to them have been obtained with the Boltzmann Q-learning dynamics (Tuyls et al., 2006) and penalty-regularized dynamics (Coucheney et al., 2015) in continuous-time settings (Leslie & Collins, 2005; Hussain et al., 2023).

8 Conclusion

This paper proposes a novel variant of MD that achieves last-iterate convergence even when the noise is present, by adaptively adjusting the magnitude of the perturbation. This research could lead to several intriguing future studies, such as finding the best perturbation strength for the optimal convergence rate and achieving convergence with more limited feedback, for example, using bandit feedback (Bravo et al., 2018; Drusvyatskiy et al., 2022).

Acknowledgments

Kaito Ariu is supported by JSPS KAKENHI Grant Number 23K19986. Atsushi Iwasaki is supported by JSPS KAKENHI Grant Numbers 21H04890 and 23K17547.

Impact Statement

This paper focuses on discussing the problem of computing equilibria in games. There are potential social implications associated with our work, but none that we believe need to be particularly emphasized here.

References

  • Abe et al. (2022) Abe, K., Sakamoto, M., and Iwasaki, A. Mutation-driven follow the regularized leader for last-iterate convergence in zero-sum games. In UAI, pp.  1–10, 2022.
  • Abe et al. (2023) Abe, K., Ariu, K., Sakamoto, M., Toyoshima, K., and Iwasaki, A. Last-iterate convergence with full and noisy feedback in two-player zero-sum games. In AISTATS, pp.  7999–8028, 2023.
  • Anagnostides & Panageas (2022) Anagnostides, I. and Panageas, I. Frequency-domain representation of first-order methods: A simple and robust framework of analysis. In SOSA, pp.  131–160, 2022.
  • Azizian et al. (2021) Azizian, W., Iutzeler, F., Malick, J., and Mertikopoulos, P. The last-iterate convergence rate of optimistic mirror descent in stochastic variational inequalities. In COLT, pp.  326–358, 2021.
  • Bailey & Piliouras (2018) Bailey, J. P. and Piliouras, G. Multiplicative weights update in zero-sum games. In Economics and Computation, pp.  321–338, 2018.
  • Beck & Teboulle (2003) Beck, A. and Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
  • Bloembergen et al. (2015) Bloembergen, D., Tuyls, K., Hennes, D., and Kaisers, M. Evolutionary dynamics of multi-agent learning: A survey. Journal of Artificial Intelligence Research, 53:659–697, 2015.
  • Bravo et al. (2018) Bravo, M., Leslie, D., and Mertikopoulos, P. Bandit learning in concave N-person games. In NeurIPS, pp.  5666–5676, 2018.
  • Cai & Daskalakis (2011) Cai, Y. and Daskalakis, C. On minmax theorems for multiplayer games. In SODA, pp.  217–234, 2011.
  • Cai & Zheng (2023) Cai, Y. and Zheng, W. Doubly optimal no-regret learning in monotone games. arXiv preprint arXiv:2301.13120, 2023.
  • Cai et al. (2016) Cai, Y., Candogan, O., Daskalakis, C., and Papadimitriou, C. Zero-sum polymatrix games: A generalization of minmax. Mathematics of Operations Research, 41(2):648–655, 2016.
  • Cai et al. (2022a) Cai, Y., Oikonomou, A., and Zheng, W. Finite-time last-iterate convergence for learning in multi-player games. In NeurIPS, pp.  33904–33919, 2022a.
  • Cai et al. (2022b) Cai, Y., Oikonomou, A., and Zheng, W. Tight last-iterate convergence of the extragradient method for constrained monotone variational inequalities. arXiv preprint arXiv:2204.09228, 2022b.
  • Cai et al. (2023) Cai, Y., Luo, H., Wei, C.-Y., and Zheng, W. Uncoupled and convergent learning in two-player zero-sum Markov games. arXiv preprint arXiv:2303.02738, 2023.
  • Cen et al. (2021) Cen, S., Wei, Y., and Chi, Y. Fast policy extragradient methods for competitive games with entropy regularization. In NeurIPS, pp.  27952–27964, 2021.
  • Cen et al. (2023) Cen, S., Chi, Y., Du, S. S., and Xiao, L. Faster last-iterate convergence of policy optimization in zero-sum Markov games. In ICLR, 2023.
  • Cevher et al. (2023) Cevher, V., Piliouras, G., Sim, R., and Skoulakis, S. Min-max optimization made simple: Approximating the proximal point method via contraction maps. In Symposium on Simplicity in Algorithms (SOSA), pp. 192–206, 2023.
  • Cohen et al. (2017) Cohen, J., Héliou, A., and Mertikopoulos, P. Learning with bandit feedback in potential games. In NeurIPS, pp.  6372–6381, 2017.
  • Coucheney et al. (2015) Coucheney, P., Gaujal, B., and Mertikopoulos, P. Penalty-regulated dynamics and robust learning procedures in games. Mathematics of Operations Research, 40(3):611–633, 2015.
  • Daskalakis & Panageas (2018) Daskalakis, C. and Panageas, I. The limit points of (optimistic) gradient descent in min-max optimization. In NeurIPS, pp.  9256–9266, 2018.
  • Daskalakis & Panageas (2019) Daskalakis, C. and Panageas, I. Last-iterate convergence: Zero-sum games and constrained min-max optimization. In ITCS, pp.  27:1–27:18, 2019.
  • Daskalakis et al. (2018) Daskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, H. Training GANs with optimism. In ICLR, 2018.
  • de Montbrun & Renault (2022) de Montbrun, É. and Renault, J. Convergence of optimistic gradient descent ascent in bilinear games. arXiv preprint arXiv:2208.03085, 2022.
  • Debreu (1952) Debreu, G. A social equilibrium existence theorem. Proceedings of the National Academy of Sciences, 38(10):886–893, 1952.
  • Drusvyatskiy et al. (2022) Drusvyatskiy, D., Fazel, M., and Ratliff, L. J. Improved rates for derivative free gradient play in strongly monotone games. In CDC, pp.  3403–3408. IEEE, 2022.
  • Facchinei & Pang (2003) Facchinei, F. and Pang, J.-S. Finite-dimensional variational inequalities and complementarity problems. Springer, 2003.
  • Giannou et al. (2021a) Giannou, A., Vlatakis-Gkaragkounis, E. V., and Mertikopoulos, P. Survival of the strictest: Stable and unstable equilibria under regularized learning with partial information. In COLT, pp.  2147–2148, 2021a.
  • Giannou et al. (2021b) Giannou, A., Vlatakis-Gkaragkounis, E.-V., and Mertikopoulos, P. On the rate of convergence of regularized learning in games: From bandits and uncertainty to optimism and beyond. In NeurIPS, pp.  22655–22666, 2021b.
  • Golowich et al. (2020a) Golowich, N., Pattathil, S., and Daskalakis, C. Tight last-iterate convergence rates for no-regret learning in multi-player games. In NeurIPS, pp.  20766–20778, 2020a.
  • Golowich et al. (2020b) Golowich, N., Pattathil, S., Daskalakis, C., and Ozdaglar, A. Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems. In COLT, pp.  1758–1784, 2020b.
  • Gorbunov et al. (2022) Gorbunov, E., Taylor, A., and Gidel, G. Last-iterate convergence of optimistic gradient method for monotone variational inequalities. In NeurIPS, pp.  21858–21870, 2022.
  • Halpern (1967) Halpern, B. Fixed points of nonexpanding maps. Bulletin of the American Mathematical Society, 73(6):957 – 961, 1967.
  • Hart & Mas-Colell (2000) Hart, S. and Mas-Colell, A. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
  • Hsieh et al. (2019) Hsieh, Y.-G., Iutzeler, F., Malick, J., and Mertikopoulos, P. On the convergence of single-call stochastic extra-gradient methods. In NeurIPS, pp.  6938–6948, 2019.
  • Hsieh et al. (2020) Hsieh, Y.-G., Iutzeler, F., Malick, J., and Mertikopoulos, P. Explore aggressively, update conservatively: Stochastic extragradient methods with variable stepsize scaling. Advances in Neural Information Processing Systems, 33:16223–16234, 2020.
  • Hsieh et al. (2021) Hsieh, Y.-G., Antonakopoulos, K., and Mertikopoulos, P. Adaptive learning in continuous games: Optimal regret bounds and convergence to Nash equilibrium. In COLT, pp.  2388–2422, 2021.
  • Hsieh et al. (2022) Hsieh, Y.-G., Antonakopoulos, K., Cevher, V., and Mertikopoulos, P. No-regret learning in games with noisy feedback: Faster rates and adaptivity via learning rate separation. In NeurIPS, pp.  6544–6556, 2022.
  • Hussain et al. (2023) Hussain, A. A., Belardinelli, F., and Piliouras, G. Asymptotic convergence and performance of multi-agent Q-learning dynamics. arXiv preprint arXiv:2301.09619, 2023.
  • Kannan & Shanbhag (2019) Kannan, A. and Shanbhag, U. V. Optimal stochastic extragradient schemes for pseudomonotone stochastic variational inequality problems and their variants. Computational Optimization and Applications, 74(3):779–820, 2019.
  • Koshal et al. (2010) Koshal, J., Nedić, A., and Shanbhag, U. V. Single timescale regularized stochastic approximation schemes for monotone nash games under uncertainty. In CDC, pp.  231–236. IEEE, 2010.
  • Koshal et al. (2013) Koshal, J., Nedic, A., and Shanbhag, U. V. Regularized iterative stochastic approximation methods for stochastic variational inequality problems. IEEE Transactions on Automatic Control, 58(3):594–609, 2013.
  • Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.
  • Lee & Kim (2021) Lee, S. and Kim, D. Fast extra gradient methods for smooth structured nonconvex-nonconcave minimax problems. In NeurIPS, pp.  22588–22600, 2021.
  • Lei et al. (2021) Lei, Q., Nagarajan, S. G., Panageas, I., et al. Last iterate convergence in no-regret learning: constrained min-max optimization for convex-concave landscapes. In AISTATS, pp.  1441–1449, 2021.
  • Leslie & Collins (2005) Leslie, D. S. and Collins, E. J. Individual q-learning in normal form games. SIAM Journal on Control and Optimization, 44(2):495–514, 2005.
  • Liang & Stokes (2019) Liang, T. and Stokes, J. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. In AISTATS, pp.  907–915, 2019.
  • Lin et al. (2020) Lin, T., Zhou, Z., Mertikopoulos, P., and Jordan, M. Finite-time last-iterate convergence for multi-agent learning in games. In ICML, pp.  6161–6171, 2020.
  • Liu et al. (2023) Liu, M., Ozdaglar, A., Yu, T., and Zhang, K. The power of regularization in solving extensive-form games. In ICLR, 2023.
  • McKelvey & Palfrey (1995) McKelvey, R. D. and Palfrey, T. R. Quantal response equilibria for normal form games. Games and economic behavior, 10(1):6–38, 1995.
  • McKelvey & Palfrey (1998) McKelvey, R. D. and Palfrey, T. R. Quantal response equilibria for extensive form games. Experimental economics, 1:9–41, 1998.
  • Mertikopoulos & Zhou (2019) Mertikopoulos, P. and Zhou, Z. Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming, 173(1):465–507, 2019.
  • Mertikopoulos et al. (2018) Mertikopoulos, P., Papadimitriou, C., and Piliouras, G. Cycles in adversarial regularized learning. In SODA, pp.  2703–2717, 2018.
  • Mertikopoulos et al. (2019) Mertikopoulos, P., Lecouat, B., Zenati, H., Foo, C.-S., Chandrasekhar, V., and Piliouras, G. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. In ICLR, 2019.
  • Mertikopoulos et al. (2022) Mertikopoulos, P., Hsieh, Y.-P., and Cevher, V. Learning in games from a stochastic approximation viewpoint. arXiv preprint arXiv:2206.03922, 2022.
  • Nash (1951) Nash, J. Non-cooperative games. Annals of mathematics, pp.  286–295, 1951.
  • Nedić & Lee (2014) Nedić, A. and Lee, S. On stochastic subgradient mirror-descent algorithm with weighted averaging. SIAM Journal on Optimization, 24(1):84–107, 2014.
  • Nemirovski et al. (2009) Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
  • Nemirovskij & Yudin (1983) Nemirovskij, A. S. and Yudin, D. B. Problem complexity and method efficiency in optimization. Wiley, 1983.
  • Pattathil et al. (2023) Pattathil, S., Zhang, K., and Ozdaglar, A. Symmetric (optimistic) natural policy gradient for multi-agent learning with parameter convergence. In AISTATS, pp.  5641–5685, 2023.
  • Perolat et al. (2021) Perolat, J., Munos, R., Lespiau, J.-B., Omidshafiei, S., Rowland, M., Ortega, P., Burch, N., Anthony, T., Balduzzi, D., De Vylder, B., et al. From Poincaré recurrence to convergence in imperfect information games: Finding equilibrium via regularization. In ICML, pp.  8525–8535, 2021.
  • Rakhlin & Sridharan (2013a) Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. In COLT, pp.  993–1019, 2013a.
  • Rakhlin & Sridharan (2013b) Rakhlin, S. and Sridharan, K. Optimization, learning, and games with predictable sequences. In NeurIPS, pp.  3066–3074, 2013b.
  • Rockafellar (1997) Rockafellar, R. T. Convex analysis, volume 11. Princeton university press, 1997.
  • Shalev-Shwartz & Singer (2006) Shalev-Shwartz, S. and Singer, Y. Convex repeated games and fenchel duality. Advances in neural information processing systems, 19, 2006.
  • Sokota et al. (2023) Sokota, S., D’Orazio, R., Kolter, J. Z., Loizou, N., Lanctot, M., Mitliagkas, I., Brown, N., and Kroer, C. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. In ICLR, 2023.
  • Tammelin (2014) Tammelin, O. Solving large imperfect information games using CFR+. arXiv preprint arXiv:1407.5042, 2014.
  • Tatarenko & Kamgarpour (2019) Tatarenko, T. and Kamgarpour, M. Learning Nash equilibria in monotone games. In CDC, pp.  3104–3109. IEEE, 2019.
  • Tuyls et al. (2006) Tuyls, K., Hoen, P. J., and Vanschoenwinkel, B. An evolutionary dynamical analysis of multi-agent learning in iterated games. Autonomous Agents and Multi-Agent Systems, 12(1):115–153, 2006.
  • Wei et al. (2021) Wei, C.-Y., Lee, C.-W., Zhang, M., and Luo, H. Linear last-iterate convergence in constrained saddle-point optimization. In ICLR, 2021.
  • Yoon & Ryu (2021) Yoon, T. and Ryu, E. K. Accelerated algorithms for smooth convex-concave minimax problems with 𝒪(1/k2)\mathcal{O}(1/k^{2}) rate on squared gradient norm. In ICML, pp.  12098–12109, 2021.
  • Yousefian et al. (2017) Yousefian, F., Nedić, A., and Shanbhag, U. V. On smoothing, regularization, and averaging in stochastic approximation methods for stochastic variational inequality problems. Mathematical Programming, 165:391–431, 2017.
  • Zhou et al. (2017) Zhou, Z., Mertikopoulos, P., Moustakas, A. L., Bambos, N., and Glynn, P. Mirror descent learning in continuous games. In CDC, pp.  5776–5783. IEEE, 2017.

Appendix A Notations

In this section, we summarize the notations we use in Table 1.

Table 1: Notations
Symbol Description
NN Number of players
𝒳i\mathcal{X}_{i} Strategy space for player ii
𝒳\mathcal{X} Joint strategy space: 𝒳=i=1N𝒳i\mathcal{X}=\prod_{i=1}^{N}\mathcal{X}_{i}
viv_{i} Payoff function for player ii
πi\pi_{i} Strategy for player ii
π\pi Strategy profile: π=(πi)i[N]\pi=(\pi_{i})_{i\in[N]}
ξit\xi_{i}^{t} Noise vector for player ii at iteration tt
π\pi^{\ast} Nash equilibrium
Π\Pi^{\ast} Set of Nash equilibria
GAP(π)\mathrm{GAP}(\pi) Gap function of π\pi: GAP(π)=maxπ~𝒳i=1Nπivi(π),π~iπi\mathrm{GAP}(\pi)=\max_{\tilde{\pi}\in\mathcal{X}}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi),\tilde{\pi}_{i}-\pi_{i}\rangle
Δd\Delta^{d} dd-dimensional probability simplex: Δd={p[0,1]d|j=1dpj=1}\Delta^{d}=\{p\in[0,1]^{d}~{}|~{}\sum_{j=1}^{d}p_{j}=1\}
diam(𝒳)\mathrm{diam}(\mathcal{X}) Diameter of 𝒳\mathcal{X}: diam(𝒳)=supπ,π𝒳ππ\mathrm{diam}(\mathcal{X})=\sup_{\pi,\pi^{\prime}\in\mathcal{X}}\|\pi-\pi^{\prime}\|
KL(,)\mathrm{KL}(\cdot,\cdot) Kullback-Leibler divergence
Dψ(,)D_{\psi}(\cdot,\cdot) Bregman divergence associated with ψ\psi
πivi\nabla_{\pi_{i}}v_{i} Gradient of viv_{i} with respect to πi\pi_{i}
ηt\eta_{t} Learning rate at iteration tt
μ\mu Perturbation strength
σ\sigma Slingshot strategy profile
G(,)G(\cdot,\cdot) Divergence function for payoff perturbation
πiG\nabla_{\pi_{i}}G Gradient of GG with respect to first argument
TσT_{\sigma} Update interval for the slingshot strategy
KK Total number of the slingshot strategy updates
πμ,σ\pi^{\mu,\sigma} Stationary point satisfies (4) for given μ\mu and σ\sigma
πt\pi^{t} Strategy profile at iteration tt
σk\sigma^{k} Slingshot strategy profile after kk updates
LL Smoothness parameter of (vi)i[N](v_{i})_{i\in[N]}
ρ\rho Strongly convex parameter of ψ\psi
β\beta Smoothness parameter of G(,σi)G(\cdot,\sigma_{i}) relative to ψ\psi
γ\gamma Strongly convex parameter of G(,σi)G(\cdot,\sigma_{i}) relative to ψ\psi

Appendix B Formal Theorems and Lemmas

B.1 Full Feedback Setting

Theorem B.1 (Formal version of Theorem 4.1).

Assume that i=1Nπivi(π)2ζ\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi)\|^{2}}\leq\zeta for any π𝒳\pi\in\mathcal{X}. If we use the constant learning rate ηt=η(0,2μρ23μ2ρ2+8L2)\eta_{t}=\eta\in(0,\frac{2\mu\rho^{2}}{3\mu^{2}\rho^{2}+8L^{2}}), and set DψD_{\psi} and GG as squared 2\ell^{2}-distance Dψ(πi,πi)=G(πi,πi)=12πiπi2D_{\psi}(\pi_{i},\pi_{i}^{\prime})=G(\pi_{i},\pi_{i}^{\prime})=\frac{1}{2}\|\pi_{i}-\pi_{i}^{\prime}\|^{2}, and set Tσ=cmax(6ln2ln(2ημ)lnT+2ln64ln2ln(2ημ),1)T_{\sigma}=c\cdot\max(\frac{6}{\ln 2-\ln(2-\eta\mu)}\ln T+\frac{2\ln 64}{\ln 2-\ln(2-\eta\mu)},1) for some constant c1c\geq 1, then the strategy profile πT\pi^{T} updated by APMD satisfies:

GAP(πT)\displaystyle\mathrm{GAP}(\pi^{T})
22c((μ+L)diam(𝒳)+ζ)(6ln2ln(2ημ)lnT+2ln64ln2ln(2ημ)+1)Tdiam(𝒳)(8diam(𝒳)+ζμ).\displaystyle\leq\frac{2\sqrt{2}c\left((\mu+L)\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\left(\frac{6}{\ln 2-\ln(2-\eta\mu)}\ln T+\frac{2\ln 64}{\ln 2-\ln(2-\eta\mu)}+1\right)}{\sqrt{T}}\sqrt{\mathrm{diam}(\mathcal{X})\left(8\cdot\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)}.
Lemma B.2 (Formal version of Lemma 4.2).

Assume that DψD_{\psi} and GG are set as the squared 2\ell^{2}-distance. If we use the constant learning rate ηt=η(0,2μρ23μ2ρ2+8L2)\eta_{t}=\eta\in(0,\frac{2\mu\rho^{2}}{3\mu^{2}\rho^{2}+8L^{2}}), the k+1k+1-th slingshot strategy profile σk+1\sigma^{k+1} of APMD under the full feedback setting satisfies that:

πμ,σkσk+12πμ,σkσk2(1ημ2)Tσ.\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}\leq\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\left(1-\frac{\eta\mu}{2}\right)^{T_{\sigma}}.
Lemma B.3 (Formal version of Lemma 4.3).

Assume that i=1Nπivi(π)2ζ\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi)\|^{2}}\leq\zeta for any π𝒳\pi\in\mathcal{X}. If GG is set as the squared 2\ell^{2}-distance, the gap function for σk+1\sigma^{k+1} of APMD satisfies for k0k\geq 0:

GAP(σk+1)μdiam(𝒳)πμ,σkσk+(Ldiam(𝒳)+ζ)πμ,σkσk+1.\displaystyle\mathrm{GAP}(\sigma^{k+1})\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|.
Lemma B.4 (Formal version of Lemma 4.4).

Assume that Tσmax(6ln2ln(2ημ)lnT+2ln64ln2ln(2ημ),1)T_{\sigma}\geq\max(\frac{6}{\ln 2-\ln(2-\eta\mu)}\ln T+\frac{2\ln 64}{\ln 2-\ln(2-\eta\mu)},1). In the same setup of Theorem 4.1, the K1K-1-th slingshot strategy profile σK1\sigma^{K-1} of APMD satisfies:

πμ,σK1σK122Kdiam(𝒳)(8diam(𝒳)+ζμ).\displaystyle\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|\leq\frac{2\sqrt{2}}{\sqrt{K}}\sqrt{\mathrm{diam}(\mathcal{X})\left(8\cdot\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)}.

B.2 Noisy Feedback Setting

For the noisy feedback setting, we assume that ξitdi\xi^{t}_{i}\in\mathbb{R}^{d_{i}} is a zero-mean independent random vector with bounded variance.

Assumption B.5.

For all t1t\geq 1 and i[N]i\in[N], the noise vector ξit\xi^{t}_{i} satisfies the following properties: (a) Zero-mean: 𝔼[ξit|t]=(0,,0)\mathbb{E}[\xi^{t}_{i}|\mathcal{F}_{t}]=(0,\cdots,0)^{\top}; (b) Bounded variance: 𝔼[ξit2|t]C2\mathbb{E}[\|\xi^{t}_{i}\|^{2}|\mathcal{F}_{t}]\leq C^{2} with some constant C>0C>0.

This is a standard assumption in learning in games with noisy feedback (Mertikopoulos & Zhou, 2019; Hsieh et al., 2019) and stochastic optimization (Nemirovski et al., 2009; Nedić & Lee, 2014). Under Assumption B.5, we can obtain the following convergence results for ADMP under the noisy feedback setting.

Theorem B.6 (Formal version of Theorem 4.5).

Let θ=3μ2ρ2+8L22μρ2\theta=\frac{3\mu^{2}\rho^{2}+8L^{2}}{2\mu\rho^{2}} and κ=μ2\kappa=\frac{\mu}{2}. Suppose that Assumption B.5 holds and i=1Nπivi(π)2ζ\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi)\|^{2}}\leq\zeta for any π𝒳\pi\in\mathcal{X}. We also assume that DψD_{\psi} and GG are set as squared 2\ell^{2}-distance Dψ(πi,πi)=G(πi,πi)=12πiπi2D_{\psi}(\pi_{i},\pi_{i}^{\prime})=G(\pi_{i},\pi_{i}^{\prime})=\frac{1}{2}\|\pi_{i}-\pi_{i}^{\prime}\|^{2}, and Tσ=cmax(T4/5+2,3)T_{\sigma}=c\cdot\max(T^{4/5}+2,3) for some constant c1c\geq 1. If we choose the learning rate sequence of the form ηt=1/(κ(tTσt/Tσ)+2θ)\eta_{t}=1/(\kappa(t-T_{\sigma}\cdot\lfloor t/T_{\sigma}\rfloor)+2\theta), then the strategy profile πT\pi^{T} updated by APMD satisfies:

𝔼[GAP(πT)]6cμdiam(𝒳)2T1/10\displaystyle\mathbb{E}\left[\mathrm{GAP}(\pi^{T})\right]\leq\frac{\sqrt{6c}\mu\cdot\mathrm{diam}(\mathcal{X})^{2}}{T^{1/10}}
+Ldiam(𝒳)+ζ+μdiam(𝒳)18c(diam(𝒳)+ζμ+1)T1/10ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρκ.\displaystyle+\frac{L\cdot\mathrm{diam}(\mathcal{X})+\zeta+\mu\cdot\mathrm{diam}(\mathcal{X})\sqrt{18c\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)}}{T^{1/10}}\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa}}.
Lemma B.7 (Formal version of Lemma 4.6).

Let θ=3μ2ρ2+8L22μρ2\theta=\frac{3\mu^{2}\rho^{2}+8L^{2}}{2\mu\rho^{2}} and κ=μ2\kappa=\frac{\mu}{2}. Suppose that Assumption B.5 holds, and both DψD_{\psi} and GG are defined as the squared 2\ell^{2}-distance. Under the noisy feedback setting, if we use the learning rate sequence of the form ηt=1/(κ(tTσt/Tσ)+2θ)\eta_{t}=1/(\kappa(t-T_{\sigma}\cdot\lfloor t/T_{\sigma}\rfloor)+2\theta), the k+1k+1-th slingshot strategy profile σk+1\sigma^{k+1} of APMD for each k0k\geq 0 satisfies that:

𝔼[πμ,σkσk+12]2θκκ(Tσ1)+2θ𝔼[πμ,σkσk2]+NC2ρ(κ(Tσ1)+2θ)(1κln(κ2θ(Tσ1)+1)+12θ).\displaystyle\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}\right]\leq\frac{2\theta-\kappa}{\kappa(T_{\sigma}-1)+2\theta}\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\right]+\frac{NC^{2}}{\rho(\kappa(T_{\sigma}-1)+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-1)+1\right)+\frac{1}{2\theta}\right).
Lemma B.8 (Formal version of Lemma 4.7).

Assume that Tσmax(T4/5+2,3)T_{\sigma}\geq\max(T^{4/5}+2,3). In the same setup of Theorem 4.5, the K1K-1-th slingshot strategy profile σK1\sigma^{K-1} satisfies:

𝔼[πμ,σK1σK1]\displaystyle\mathbb{E}\left[\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|\right]
πσ02+3ρκ(diam(𝒳)+ζμ+1)(ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ))K.\displaystyle\leq\sqrt{\frac{\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{3}{\rho\kappa}\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)\left(\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)\right)}{K}}.

B.3 Convergence Results with General GG and DψD_{\psi}

Lemma B.9 (Formal version of Lemma 5.6).

Suppose that Assumptions 5.4 with β,γ(0,)\beta,\gamma\in(0,\infty) and 5.5 hold. If we use the constant learning rate ηt=η(0,2μγρ2μ2γρ2(γ+2β)+8L2)\eta_{t}=\eta\in(0,\frac{2\mu\gamma\rho^{2}}{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}), the k+1k+1-th slingshot strategy profile σk+1\sigma^{k+1} of APMD under the full feedback setting satisfies that:

Dψ(πμ,σk,σk+1)Dψ(πμ,σk,σk)(1ημγ2)Tσ.\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k+1})\leq D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k})\left(1-\frac{\eta\mu\gamma}{2}\right)^{T_{\sigma}}.
Lemma B.10 (Formal version of Lemma 5.7).

Let θ=μ2γρ2(γ+2β)+8L22μγρ2\theta=\frac{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}{2\mu\gamma\rho^{2}} and κ=μγ2\kappa=\frac{\mu\gamma}{2}. Suppose that Assumptions 5.4, 5.5, and B.5 hold, and the learning rate sequence of the form ηt=1/(κ(tTσt/Tσ)+2θ)\eta_{t}=1/(\kappa(t-T_{\sigma}\cdot\lfloor t/T_{\sigma}\rfloor)+2\theta) is used. Then, the k+1k+1-th slingshot strategy profile σk+1\sigma^{k+1} of APMD under the noisy feedback setting satisfies that:

𝔼[Dψ(πμ,σk,σk+1)]2θκκ(Tσ1)+2θ𝔼[Dψ(πμ,σk,σk)]+NC2ρ(κ(Tσ1)+2θ)(1κln(κ2θ(Tσ1)+1)+12θ).\displaystyle\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k+1})]\leq\frac{2\theta-\kappa}{\kappa(T_{\sigma}-1)+2\theta}\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k})]+\frac{NC^{2}}{\rho(\kappa(T_{\sigma}-1)+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-1)+1\right)+\frac{1}{2\theta}\right).

Appendix C Extension to Follow the Regularized Leader

In Sections 4 and 5, we introduced and analyzed APMD, which extends the standard MD approach. Similarly, it is possible to extend the FTRL approach as well. In this section, we present Adaptively Perturbed Follow the Regularized Leader (APFTRL), which incorporates the perturbation term μG(,σi)\mu G(\cdot,\sigma_{i}) into the conventional FTRL algorithm:

πit+1=argmaxx𝒳i{s=0tηs^πivi(πs)μπiG(πis,σi),xψ(x)}.\displaystyle\pi_{i}^{t+1}=\mathop{\rm arg~{}max}\limits_{x\in\mathcal{X}_{i}}\left\{\sum_{s=0}^{t}\eta_{s}\left\langle\widehat{\nabla}_{\pi_{i}}v_{i}(\pi^{s})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{s},\sigma_{i}),x\right\rangle-\psi(x)\right\}.

In this section, we make the assumption that 𝒳i\mathcal{X}_{i} is an affine subset, which means there exists a matrix Aki×diA\in\mathbb{R}^{k_{i}\times d_{i}} and a vector bkib\in\mathbb{R}^{k_{i}} such that Aπi=bA\pi_{i}=b for all πi𝒳i\pi_{i}\in\mathcal{X}_{i}. Additionally, we also assume that ψ\psi is a Legendre function, as described in (Rockafellar, 1997; Lattimore & Szepesvári, 2020). Another assumption we make is that πt\pi^{t} is consistently well-defined over iterations:

Assumption C.1.

πt\pi^{t}, updated by APFTRL, satisfies the condition πtint(domψ)\pi^{t}\in\mathrm{int}(\mathrm{dom}~{}\psi) for every t{0,1,,T}t\in\{0,1,\cdots,T\}.

Then, APFTRL enjoys the last-iterate convergence of πT\pi^{T} in full and noisy feedback settings by proving the following lemmas:

Lemma C.2.

Suppose that Assumptions 5.4 with β,γ(0,)\beta,\gamma\in(0,\infty) and C.1 hold. If we use the constant learning rate ηt=η(0,2μγρ2μ2γρ2(γ+2β)+8L2)\eta_{t}=\eta\in(0,\frac{2\mu\gamma\rho^{2}}{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}), the k+1k+1-th slingshot strategy profile σk+1\sigma^{k+1} of APFTRL under the full feedback setting satisfies that:

Dψ(πμ,σk,σk+1)Dψ(πμ,σk,σk)(1ημγ2)Tσ.\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k+1})\leq D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k})\left(1-\frac{\eta\mu\gamma}{2}\right)^{T_{\sigma}}.
Lemma C.3.

Let θ=μ2γρ2(γ+2β)+8L22μγρ2\theta=\frac{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}{2\mu\gamma\rho^{2}} and κ=μγ2\kappa=\frac{\mu\gamma}{2}. Suppose that Assumptions 5.4, B.5, and C.1 hold, and the learning rate sequence of the form ηt=1/(κ(tTσt/Tσ)+2θ)\eta_{t}=1/(\kappa(t-T_{\sigma}\cdot\lfloor t/T_{\sigma}\rfloor)+2\theta) is used. Then, the k+1k+1-th slingshot strategy profile σk+1\sigma^{k+1} of APFTRL under the noisy feedback setting satisfies that:

𝔼[Dψ(πμ,σk,σk+1)]2θκκ(Tσ1)+2θ𝔼[Dψ(πμ,σk,σk)]+NC2ρ(κ(Tσ1)+2θ)(1κln(κ2θ(Tσ1)+1)+12θ).\displaystyle\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k+1})]\leq\frac{2\theta-\kappa}{\kappa(T_{\sigma}-1)+2\theta}\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k})]+\frac{NC^{2}}{\rho(\kappa(T_{\sigma}-1)+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-1)+1\right)+\frac{1}{2\theta}\right).

The proofs of these theorems can be found in Appendix F.5, F.6.

Appendix D Additional Theorems

Based on this theorem, we can show that the gap function for πt\pi^{t} converges to the value of 𝒪(μ)\mathcal{O}(\mu).

Theorem D.1.

In the same setup of Lemma 5.6, the gap function for APMD is bounded as:

GAP(πt)\displaystyle\mathrm{GAP}(\pi^{t}) μdiam(𝒳)i=1NπiG(πiμ,σ,σi)2+𝒪((1ημγ2)t2).\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i})\|^{2}}+\mathcal{O}\left(\left(1-\frac{\eta\mu\gamma}{2}\right)^{\frac{t}{2}}\right).

We note that lower μ\mu reduces the gap function for πμ,σ\pi^{\mu,\sigma} (as in the first term of Theorem D.1), whereas higher μ\mu makes πt\pi^{t} converge faster (as in the second term of Theorem D.1). That is, μ\mu controls a trade-off between the speed of convergence and the gap function.

Appendix E Proofs for Section 4

E.1 Proof of Theorem B.1 (Formal Version of Theoremf 4.1)

Proof of Theorem B.1.

First, from Lemma B.3, we have for any k0k\geq 0:

GAP(σk+1)μdiam(𝒳)πμ,σkσk+(Ldiam(𝒳)+ζ)πμ,σkσk+1.\displaystyle\mathrm{GAP}(\sigma^{k+1})\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|.

Using Lemma B.2, we can upper bound the term of πμ,σkσk+1\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\| as follows:

πμ,σkσk+12πμ,σkσk2(1ημ2)Tσ.\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}\leq\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\left(1-\frac{\eta\mu}{2}\right)^{T_{\sigma}}.

Combining these inequalities, we have for any k0k\geq 0:

GAP(σk+1)\displaystyle\mathrm{GAP}(\sigma^{k+1}) μdiam(𝒳)πμ,σkσk+(1ημ2)Tσ2(Ldiam(𝒳)+ζ)πμ,σkσk\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|+\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\cdot\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|
((μ+L)diam(𝒳)+ζ)πμ,σkσk,\displaystyle\leq\left((\mu+L)\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|,

where the second inequality follows from Tσ1T_{\sigma}\geq 1. Let us denote K:=T/TσK:=\lfloor T/T_{\sigma}\rfloor as the total number of the slingshot strategy updates over the entire TT iterations. By letting k=K1k=K-1 in the above inequality, we get:

GAP(σK)((μ+L)diam(𝒳)+ζ)πμ,σK1σK1.\displaystyle\mathrm{GAP}(\sigma^{K})\leq\left((\mu+L)\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|. (6)

Next, we derive the following upper bound on πμ,σK1σK1\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\| from Lemma B.4:

πμ,σK1σK122Kdiam(𝒳)(8diam(𝒳)+ζμ).\displaystyle\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|\leq\frac{2\sqrt{2}}{\sqrt{K}}\sqrt{\mathrm{diam}(\mathcal{X})\left(8\cdot\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)}. (7)

By combining (6) and (7), we get:

GAP(σK)\displaystyle\mathrm{GAP}(\sigma^{K}) 22((μ+L)diam(𝒳)+ζ)Kdiam(𝒳)(8diam(𝒳)+ζμ).\displaystyle\leq\frac{2\sqrt{2}\left((\mu+L)\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)}{\sqrt{K}}\sqrt{\mathrm{diam}(\mathcal{X})\left(8\cdot\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)}.

Finally, since πT=σK\pi^{T}=\sigma^{K}, K=T/TσK=\lfloor T/T_{\sigma}\rfloor, and Tσ=cmax(6ln2ln(2ημ)lnT+2ln64ln2ln(2ημ),1)T_{\sigma}=c\cdot\max(\frac{6}{\ln 2-\ln(2-\eta\mu)}\ln T+\frac{2\ln 64}{\ln 2-\ln(2-\eta\mu)},1), we have:

GAP(πT)\displaystyle\mathrm{GAP}(\pi^{T})
22((μ+L)diam(𝒳)+ζ)T/Tσdiam(𝒳)(8diam(𝒳)+ζμ)\displaystyle\leq\frac{2\sqrt{2}\left((\mu+L)\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)}{\sqrt{T/T_{\sigma}}}\sqrt{\mathrm{diam}(\mathcal{X})\left(8\cdot\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)}
22c((μ+L)diam(𝒳)+ζ)(6ln2ln(2ημ)lnT+2ln64ln2ln(2ημ)+1)Tdiam(𝒳)(8diam(𝒳)+ζμ).\displaystyle\leq\frac{2\sqrt{2}c\left((\mu+L)\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\left(\frac{6}{\ln 2-\ln(2-\eta\mu)}\ln T+\frac{2\ln 64}{\ln 2-\ln(2-\eta\mu)}+1\right)}{\sqrt{T}}\sqrt{\mathrm{diam}(\mathcal{X})\left(8\cdot\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)}.

This concludes the statement of the theorem. ∎

E.2 Proof of Lemma B.2 (Formal Version of Lemma 4.2)

Proof of Lemma B.2.

From the definition of the Bregman divergence, we have for all πi,πi𝒳i\pi_{i},\pi_{i}^{\prime}\in\mathcal{X}_{i}:

Dψ(πi,σi)Dψ(πi,σi)πiDψ(πi,σi),πiπi\displaystyle D_{\psi}(\pi_{i}^{\prime},\sigma_{i})-D_{\psi}(\pi_{i},\sigma_{i})-\langle\nabla_{\pi_{i}}D_{\psi}(\pi_{i},\sigma_{i}),\pi_{i}^{\prime}-\pi_{i}\rangle
=ψ(πi)ψ(σi)ψ(σi),πiσiψ(πi)+ψ(σi)+ψ(σi),πiσiψ(πi)ψ(σi),πiπi\displaystyle=\psi(\pi_{i}^{\prime})-\psi(\sigma_{i})-\langle\nabla\psi(\sigma_{i}),\pi_{i}^{\prime}-\sigma_{i}\rangle-\psi(\pi_{i})+\psi(\sigma_{i})+\langle\nabla\psi(\sigma_{i}),\pi_{i}-\sigma_{i}\rangle-\langle\nabla\psi(\pi_{i})-\nabla\psi(\sigma_{i}),\pi_{i}^{\prime}-\pi_{i}\rangle
=ψ(πi)ψ(πi)ψ(πi),πiπi\displaystyle=\psi(\pi_{i}^{\prime})-\psi(\pi_{i})-\langle\nabla\psi(\pi_{i}),\pi_{i}^{\prime}-\pi_{i}\rangle
=Dψ(πi,πi).\displaystyle=D_{\psi}(\pi_{i}^{\prime},\pi_{i}).

Hence, assuming that GG is identical to DψD_{\psi}, Assumption 5.4 is satisfied with β=γ=1\beta=\gamma=1. Furthermore, since ψ(x)=12x2\psi(x)=\frac{1}{2}\left\|x\right\|^{2}, both ρ=1\rho=1 and int(domψ)=di\mathrm{int}(\mathrm{dom}~{}\psi)=\mathbb{R}^{d_{i}} hold. Therefore, Assumption 5.5 is also satisfied. Consequently, we can obtain the following convergence result from Lemma B.9 with β=γ=ρ=1\beta=\gamma=\rho=1:

πμ,σkσk+12πμ,σkσk2(1ημ2)Tσ.\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}\leq\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\left(1-\frac{\eta\mu}{2}\right)^{T_{\sigma}}.

E.3 Proof of Lemma B.3 (Formal Version of Lemma 4.3)

Proof of Lemma B.3.

First, we have for any π𝒳\pi\in\mathcal{X}:

GAP(π)\displaystyle\mathrm{GAP}(\pi) =maxπ~i𝒳ii=1Nπivi(π),π~iπi\displaystyle=\max_{\tilde{\pi}_{i}\in\mathcal{X}_{i}}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi),\tilde{\pi}_{i}-\pi_{i}\rangle
=maxπ~𝒳i=1N(πivi(π),π~iπiπivi(π),πiπi+πivi(π)πivi(π),π~iπi).\displaystyle=\max_{\tilde{\pi}\in\mathcal{X}}\sum_{i=1}^{N}\left(\langle\nabla_{\pi_{i}}v_{i}(\pi^{\prime}),\tilde{\pi}_{i}-\pi_{i}^{\prime}\rangle-\langle\nabla_{\pi_{i}}v_{i}(\pi^{\prime}),\pi_{i}-\pi_{i}^{\prime}\rangle+\langle\nabla_{\pi_{i}}v_{i}(\pi)-\nabla_{\pi_{i}}v_{i}(\pi^{\prime}),\tilde{\pi}_{i}-\pi_{i}\rangle\right). (8)

Here, we introduce the following lemma from Cai et al. (2022a):

Lemma E.1 (Lemma 2 of Cai et al. (2022a)).

For any π𝒳\pi\in\mathcal{X}, we have:

maxπ~𝒳i=1Nπivi(π),π~iπidiam(𝒳)min(ai)N𝒳(π)i=1Nπivi(π)+ai2,\displaystyle\max_{\tilde{\pi}\in\mathcal{X}}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi),\tilde{\pi}_{i}-\pi_{i}\rangle\leq\mathrm{diam}(\mathcal{X})\cdot\min_{(a_{i})\in N_{\mathcal{X}}(\pi)}\sqrt{\sum_{i=1}^{N}\|-\nabla_{\pi_{i}}v_{i}(\pi)+a_{i}\|^{2}},

where N𝒳(π)={(ai)i[N]i=1Ndi|i=1Nai,πiπi0,π𝒳}N_{\mathcal{X}}(\pi)=\{(a_{i})_{i\in[N]}\in\prod_{i=1}^{N}\mathbb{R}^{d_{i}}~{}|~{}\sum_{i=1}^{N}\langle a_{i},\pi_{i}^{\prime}-\pi_{i}\rangle\leq 0,~{}\forall\pi^{\prime}\in\mathcal{X}\}.

From Lemma E.1, the first term of (8) can be upper bounded as:

maxπ~𝒳i=1Nπivi(π),π~iπidiam(𝒳)min(ai)N𝒳(π)i=1Nπivi(π)+ai2\displaystyle\max_{\tilde{\pi}\in\mathcal{X}}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi^{\prime}),\tilde{\pi}_{i}-\pi_{i}^{\prime}\rangle\leq\mathrm{diam}(\mathcal{X})\cdot\min_{(a_{i})\in N_{\mathcal{X}}(\pi^{\prime})}\sqrt{\sum_{i=1}^{N}\|-\nabla_{\pi_{i}}v_{i}(\pi^{\prime})+a_{i}\|^{2}} (9)

Next, from Cauchy-Schwarz inequality, the second term of (8) can be upper bounded as:

i=1Nπivi(π),πiπi\displaystyle-\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi^{\prime}),\pi_{i}-\pi_{i}^{\prime}\rangle ππi=1Nπivi(π)2\displaystyle\leq\|\pi-\pi^{\prime}\|\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi^{\prime})\|^{2}}
ζππ.\displaystyle\leq\zeta\|\pi-\pi^{\prime}\|. (10)

Again from Cauchy-Schwarz inequality, the third term of (8) can be upper bounded as:

i=1Nπivi(π)πivi(π),π~iπi\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi)-\nabla_{\pi_{i}}v_{i}(\pi^{\prime}),\tilde{\pi}_{i}-\pi_{i}\rangle π~πi=1Nπivi(π)πivi(π)2\displaystyle\leq\|\tilde{\pi}-\pi\|\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi)-\nabla_{\pi_{i}}v_{i}(\pi^{\prime})\|^{2}}
diam(𝒳)i=1Nπivi(π)πivi(π)2\displaystyle\leq\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi)-\nabla_{\pi_{i}}v_{i}(\pi^{\prime})\|^{2}}
Ldiam(𝒳)ππ.\displaystyle\leq L\cdot\mathrm{diam}(\mathcal{X})\cdot\|\pi-\pi^{\prime}\|. (11)

By combining (8), (9), (10), and (11), we get for any π,π𝒳\pi,\pi^{\prime}\in\mathcal{X}:

GAP(π)\displaystyle\mathrm{GAP}(\pi) diam(𝒳)min(ai)N𝒳(π)i=1Nπivi(π)+ai2+(Ldiam(𝒳)+ζ)ππ.\displaystyle\leq\mathrm{diam}(\mathcal{X})\cdot\min_{(a_{i})\in N_{\mathcal{X}}(\pi^{\prime})}\sqrt{\sum_{i=1}^{N}\|-\nabla_{\pi_{i}}v_{i}(\pi^{\prime})+a_{i}\|^{2}}+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\|\pi-\pi^{\prime}\|.

Thus, letting π=σk+1\pi=\sigma^{k+1} and π=πμ,σk\pi^{\prime}=\pi^{\mu,\sigma^{k}}, we have:

GAP(σk+1)diam(𝒳)min(ai)N𝒳(πμ,σk)i=1Nπivi(πμ,σk)+ai2+(Ldiam(𝒳)+ζ)πμ,σkσk+1.\displaystyle\mathrm{GAP}(\sigma^{k+1})\leq\mathrm{diam}(\mathcal{X})\cdot\min_{(a_{i})\in N_{\mathcal{X}}(\pi^{\mu,\sigma^{k}})}\sqrt{\sum_{i=1}^{N}\|-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma^{k}})+a_{i}\|^{2}}+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|. (12)

On the other hand, from the first-order optimality condition for πμ,σk\pi^{\mu,\sigma^{k}}, we have for any π𝒳\pi\in\mathcal{X}:

i=1Nπivi(πμ,σk)μ(πiμ,σkσik),πiπiμ,σk0,\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma^{k}})-\mu\left(\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k}\right),\pi_{i}-\pi_{i}^{\mu,\sigma^{k}}\rangle\leq 0,

and then (πivi(πμ,σk)μ(πiμ,σkσik))i[N]N𝒳(πμ,σk)\left(\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma^{k}})-\mu\left(\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k}\right)\right)_{i\in[N]}\in N_{\mathcal{X}}(\pi^{\mu,\sigma^{k}}). Thus, the first term of (12) can be bounded as:

min(ai)N𝒳(πμ,σk)i=1Nπivi(πμ,σk)+ai2\displaystyle\min_{(a_{i})\in N_{\mathcal{X}}(\pi^{\mu,\sigma^{k}})}\sqrt{\sum_{i=1}^{N}\|-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma^{k}})+a_{i}\|^{2}}
i=1Nπivi(πμ,σk)+πivi(πμ,σk)μ(πiμ,σkσik)2\displaystyle\leq\sqrt{\sum_{i=1}^{N}\|-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma^{k}})+\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma^{k}})-\mu\left(\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k}\right)\|^{2}}
=μπμ,σkσk.\displaystyle=\mu\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|. (13)

Combining (12) and (13), we have:

GAP(σk+1)μdiam(𝒳)πμ,σkσk+(Ldiam(𝒳)+ζ)πμ,σkσk+1\displaystyle\mathrm{GAP}(\sigma^{k+1})\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|

E.4 Proof of Lemma B.4 (Formal Version of Lemma 4.4)

Proof of Lemma B.4.

First, we prove the following lemma:

Lemma E.2.

Assume that i=1Nπivi(π)2ζ\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi)\|^{2}}\leq\zeta for any π𝒳\pi\in\mathcal{X}. If GG is set as the squared 2\ell^{2}-distance, we have for any k[K]k\in[K]:

πμ,σkσk2\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2} σkσk12+2ζμπμ,σk1σk.\displaystyle\leq\|\sigma^{k}-\sigma^{k-1}\|^{2}+\frac{2\zeta}{\mu}\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k}\|.

From Lemma E.2, we can bound πμ,σkσk2\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2} as:

πμ,σkσk2\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}
σkσk12+2ζμπμ,σk1σk\displaystyle\leq\|\sigma^{k}-\sigma^{k-1}\|^{2}+\frac{2\zeta}{\mu}\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k}\|
πμ,σk1σk12+σkπμ,σk12+2σkπμ,σk1πμ,σk1σk1+2ζμπμ,σk1σk.\displaystyle\leq\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\|^{2}+\|\sigma^{k}-\pi^{\mu,\sigma^{k-1}}\|^{2}+2\|\sigma^{k}-\pi^{\mu,\sigma^{k-1}}\|\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\|+\frac{2\zeta}{\mu}\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k}\|. (14)

Next, we upper bound πμ,σkσk2\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2} using the following lemma:

Lemma E.3.

Assume that Tσmax(6ln2ln(2ημ)lnT+2ln64ln2ln(2ημ),1)T_{\sigma}\geq\max(\frac{6}{\ln 2-\ln(2-\eta\mu)}\ln T+\frac{2\ln 64}{\ln 2-\ln(2-\eta\mu)},1). In the same setup of Theorem 4.1, we have for any Nash equilibrium πΠ\pi^{\ast}\in\Pi^{\ast}:

k=0K1πμ,σkσk2\displaystyle\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2} 16πσ02.\displaystyle\leq 16\|\pi^{\ast}-\sigma^{0}\|^{2}.

Using Lemma E.3, we have:

πμ,σkσk2k=0K1πμ,σkσk216πσ02,\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\leq\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\leq 16\|\pi^{\ast}-\sigma^{0}\|^{2},

and then from Lemma B.2, we get:

πμ,σkσk+1πμ,σkσk(1ημ2)Tσ24πσ0(1ημ2)Tσ2.\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|\leq\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\leq 4\|\pi^{\ast}-\sigma^{0}\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}. (15)

By combining (14) and (15), we have:

πμ,σkσk2\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}
πμ,σk1σk12+16πσ0(1ημ2)Tσ+32πσ02(1ημ2)Tσ2+8ζμπσ0(1ημ2)Tσ2\displaystyle\leq\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\|^{2}+16\|\pi^{\ast}-\sigma^{0}\|\left(1-\frac{\eta\mu}{2}\right)^{T_{\sigma}}+32\|\pi^{\ast}-\sigma^{0}\|^{2}\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}+\frac{8\zeta}{\mu}\|\pi^{\ast}-\sigma^{0}\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}
πμ,σk1σk12+48πσ02(1ημ2)Tσ2+8ζμπσ0(1ημ2)Tσ2\displaystyle\leq\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\|^{2}+48\|\pi^{\ast}-\sigma^{0}\|^{2}\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}+\frac{8\zeta}{\mu}\|\pi^{\ast}-\sigma^{0}\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}
πμ,σk1σk12+8πσ0(1ημ2)Tσ2(6πσ0+ζμ).\displaystyle\leq\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\|^{2}+8\|\pi^{\ast}-\sigma^{0}\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\left(6\|\pi^{\ast}-\sigma^{0}\|+\frac{\zeta}{\mu}\right).

Therefore, we get:

πμ,σK1σK12\displaystyle\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|^{2} πμ,σK2σK22+8πσ0(1ημ2)Tσ2(6πσ0+ζμ)\displaystyle\leq\|\pi^{\mu,\sigma^{K-2}}-\sigma^{K-2}\|^{2}+8\|\pi^{\ast}-\sigma^{0}\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\left(6\|\pi^{\ast}-\sigma^{0}\|+\frac{\zeta}{\mu}\right)
πμ,σkσk2+8Kπσ0(1ημ2)Tσ2(6πσ0+ζμ).\displaystyle\leq\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}+8K\|\pi^{\ast}-\sigma^{0}\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\left(6\|\pi^{\ast}-\sigma^{0}\|+\frac{\zeta}{\mu}\right). (16)

By combining (16) and Lemma E.3, we have:

Kπμ,σK1σK12\displaystyle K\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|^{2} k=0K1πμ,σkσk2+8K2πσ0(1ημ2)Tσ2(6πσ0+ζμ)\displaystyle\leq\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}+8K^{2}\|\pi^{\ast}-\sigma^{0}\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\left(6\|\pi^{\ast}-\sigma^{0}\|+\frac{\zeta}{\mu}\right)
16πσ02+8K2πσ0(1ημ2)Tσ2(6πσ0+ζμ)\displaystyle\leq 16\|\pi^{\ast}-\sigma^{0}\|^{2}+8K^{2}\|\pi^{\ast}-\sigma^{0}\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\left(6\|\pi^{\ast}-\sigma^{0}\|+\frac{\zeta}{\mu}\right)
16πσ02+8T2πσ0(1ημ2)Tσ2(6πσ0+ζμ).\displaystyle\leq 16\|\pi^{\ast}-\sigma^{0}\|^{2}+8T^{2}\|\pi^{\ast}-\sigma^{0}\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\left(6\|\pi^{\ast}-\sigma^{0}\|+\frac{\zeta}{\mu}\right).

Under the assumption that Tσmax(6ln2ln(2ημ)lnT+2ln64ln2ln(2ημ),1)T_{\sigma}\geq\max(\frac{6}{\ln 2-\ln(2-\eta\mu)}\ln T+\frac{2\ln 64}{\ln 2-\ln(2-\eta\mu)},1), we get:

(1ημ2)Tσ2(1ημ2)3ln2ln(2ημ)lnT(1ημ2)ln64ln(1ημ2)=64(1ημ2)lnT3ln(2ημ)ln2=64T3.\displaystyle\left(1-\frac{\eta\mu}{2}\right)^{-\frac{T_{\sigma}}{2}}\geq\left(1-\frac{\eta\mu}{2}\right)^{-\frac{3}{\ln 2-\ln(2-\eta\mu)}\ln T}\left(1-\frac{\eta\mu}{2}\right)^{\frac{\ln 64}{\ln\left(1-\frac{\eta\mu}{2}\right)}}=64\left(1-\frac{\eta\mu}{2}\right)^{\frac{\ln T^{3}}{\ln(2-\eta\mu)-\ln 2}}=64T^{3}.

Thus, we have:

Kπμ,σK1σK12\displaystyle K\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|^{2} 16πσ02+8πσ0(6πσ0+ζμ)\displaystyle\leq 16\|\pi^{\ast}-\sigma^{0}\|^{2}+8\|\pi^{\ast}-\sigma^{0}\|\left(6\|\pi^{\ast}-\sigma^{0}\|+\frac{\zeta}{\mu}\right)
=8πσ0(8πσ0+ζμ)\displaystyle=8\|\pi^{\ast}-\sigma^{0}\|\left(8\|\pi^{\ast}-\sigma^{0}\|+\frac{\zeta}{\mu}\right)
8diam(𝒳)(8diam(𝒳)+ζμ).\displaystyle\leq 8\cdot\mathrm{diam}(\mathcal{X})\left(8\cdot\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right).

E.5 Proof of Theorem B.6 (Formal Version of Theorem 4.5)

Proof of Theorem B.6.

First, from Lemma B.3, we have for any k0k\geq 0:

𝔼[GAP(σk+1)]μdiam(𝒳)𝔼[πμ,σkσk]+(Ldiam(𝒳)+ζ)𝔼[πμ,σkσk+1].\displaystyle\mathbb{E}\left[\mathrm{GAP}(\sigma^{k+1})\right]\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|\right]+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|\right].

Using Lemma B.7, we can upper bound the term of 𝔼[πμ,σkσk+12]\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}\right] as follows:

𝔼[πμ,σkσk+12]2θκκ(Tσ1)+2θ𝔼[πμ,σkσk2]+NC2ρ(κ(Tσ1)+2θ)(1κln(κ2θ(Tσ1)+1)+12θ).\displaystyle\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}\right]\leq\frac{2\theta-\kappa}{\kappa(T_{\sigma}-1)+2\theta}\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\right]+\frac{NC^{2}}{\rho(\kappa(T_{\sigma}-1)+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-1)+1\right)+\frac{1}{2\theta}\right).

Combining these inequalities, we have for any k0k\geq 0:

𝔼[GAP(σk+1)]\displaystyle\mathbb{E}\left[\mathrm{GAP}(\sigma^{k+1})\right]
μdiam(𝒳)𝔼[πμ,σkσk]+(Ldiam(𝒳)+ζ)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρκT4/5,\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|\right]+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa T^{4/5}}},

where we use Tσmax(T4/5+2,3)T_{\sigma}\geq\max(T^{4/5}+2,3). Let us denote K:=T/TσK:=\lfloor T/T_{\sigma}\rfloor as the total number of the slingshot strategy updates over the entire TT iterations. By letting k=K1k=K-1 in the above inequality, we get:

𝔼[GAP(σK)]\displaystyle\mathbb{E}\left[\mathrm{GAP}(\sigma^{K})\right] μdiam(𝒳)𝔼[πμ,σK1σK1]\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\mathbb{E}\left[\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|\right]
+(Ldiam(𝒳)+ζ)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρκT4/5,\displaystyle+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa T^{4/5}}}, (17)

Next, we derive the following upper bound on 𝔼[πμ,σK1σK1]\mathbb{E}\left[\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|\right] from Lemma B.8:

𝔼[πμ,σK1σK1]\displaystyle\mathbb{E}\left[\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|\right]
πσ02+3ρκ(diam(𝒳)+ζμ+1)(ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ))K.\displaystyle\leq\sqrt{\frac{\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{3}{\rho\kappa}\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)\left(\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)\right)}{K}}. (18)

By combining (17) and (18), we get:

𝔼[GAP(σK)]\displaystyle\mathbb{E}\left[\mathrm{GAP}(\sigma^{K})\right]
μdiam(𝒳)πσ02+3ρκ(diam(𝒳)+ζμ+1)(ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ))K\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{3}{\rho\kappa}\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)\left(\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)\right)}{K}}
+(Ldiam(𝒳)+ζ)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρκT4/5.\displaystyle+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa T^{4/5}}}.

Finally, since πT=σK\pi^{T}=\sigma^{K}, K=T/TσK=\lfloor T/T_{\sigma}\rfloor, and Tσ=cmax(T4/5+2,3)T_{\sigma}=c\cdot\max(T^{4/5}+2,3), we have:

𝔼[GAP(πT)]\displaystyle\mathbb{E}\left[\mathrm{GAP}(\pi^{T})\right]
μdiam(𝒳)πσ02+3ρκ(diam(𝒳)+ζμ+1)(ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ))T/Tσ\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{3}{\rho\kappa}\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)\left(\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)\right)}{T/T_{\sigma}}}
+(Ldiam(𝒳)+ζ)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρκT4/5\displaystyle+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa T^{4/5}}}
μdiam(𝒳)c(T4/5+5)\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{c(T^{4/5}+5)}
πσ02+3ρκ(diam(𝒳)+ζμ+1)(ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ))T\displaystyle\cdot\sqrt{\frac{\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{3}{\rho\kappa}\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)\left(\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)\right)}{T}}
+(Ldiam(𝒳)+ζ)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρκT4/5\displaystyle+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa T^{4/5}}}
μdiam(𝒳)6cπσ02+18cρκ(diam(𝒳)+ζμ+1)(ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ))T1/5\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{6c\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{18c}{\rho\kappa}\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)\left(\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)\right)}{T^{1/5}}}
+(Ldiam(𝒳)+ζ)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρκT4/5\displaystyle+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa T^{4/5}}}
μdiam(𝒳)26cT1/5\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})^{2}\cdot\sqrt{\frac{6c}{T^{1/5}}}
+μdiam(𝒳)18c(diam(𝒳)+ζμ+1)(ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ))ρκT1/5\displaystyle+\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{18c\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)\left(\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)\right)}{\rho\kappa T^{1/5}}}
+(Ldiam(𝒳)+ζ)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρκT1/5\displaystyle+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa T^{1/5}}}
6cμdiam(𝒳)2T1/10\displaystyle\leq\frac{\sqrt{6c}\mu\cdot\mathrm{diam}(\mathcal{X})^{2}}{T^{1/10}}
+Ldiam(𝒳)+ζ+μdiam(𝒳)18c(diam(𝒳)+ζμ+1)T1/10ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρκ.\displaystyle+\frac{L\cdot\mathrm{diam}(\mathcal{X})+\zeta+\mu\cdot\mathrm{diam}(\mathcal{X})\sqrt{18c\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)}}{T^{1/10}}\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa}}.

This concludes the statement of the theorem.

E.6 Proof of Lemma B.7 (Formal Version of Lemma 4.6)

Proof of Lemma B.7.

Assuming that GG is identical to DψD_{\psi}, Assumption 5.4 is satisfied with β=γ=1\beta=\gamma=1. Furthermore, since ψ(x)=12x2\psi(x)=\frac{1}{2}\left\|x\right\|^{2}, both ρ=1\rho=1 and int(domψ)=di\mathrm{int}(\mathrm{dom}~{}\psi)=\mathbb{R}^{d_{i}} hold. Therefore, Assumption 5.5 is also satisfied. Consequently, we can apply Lemma B.10 with β=γ=ρ=1\beta=\gamma=\rho=1 and obtain:

𝔼[πμ,σkσk+12]2θκκ(Tσ1)+2θ𝔼[πμ,σkσk2]+NC2ρ(κ(Tσ1)+2θ)(1κln(κ2θ(Tσ1)+1)+12θ).\displaystyle\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}\right]\leq\frac{2\theta-\kappa}{\kappa(T_{\sigma}-1)+2\theta}\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\right]+\frac{NC^{2}}{\rho(\kappa(T_{\sigma}-1)+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-1)+1\right)+\frac{1}{2\theta}\right).

E.7 Proof of Lemma B.8 (Formal Version of Lemma 4.7)

Proof of Lemma B.8.

From Lemmas E.2 and B.7, we have:

𝔼[πμ,σkσk2]\displaystyle\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\right]
𝔼[σkσk12]+2ζμ𝔼[πμ,σk1σk]\displaystyle\leq\mathbb{E}\left[\|\sigma^{k}-\sigma^{k-1}\|^{2}\right]+\frac{2\zeta}{\mu}\mathbb{E}\left[\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k}\|\right]
𝔼[σkπμ,σk12]+𝔼[πμ,σk1σk12]+𝔼[2σkπμ,σk1πμ,σk1σk1+2ζμπμ,σk1σk]\displaystyle\leq\mathbb{E}\left[\|\sigma^{k}-\pi^{\mu,\sigma^{k-1}}\|^{2}\right]+\mathbb{E}\left[\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\|^{2}\right]+\mathbb{E}\left[2\|\sigma^{k}-\pi^{\mu,\sigma^{k-1}}\|\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\|+\frac{2\zeta}{\mu}\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k}\|\right]
2θκκ(Tσ2)+2θ𝔼[πμ,σkσk2]+NC2ρ(κ(Tσ2)+2θ)(1κln(κ2θ(Tσ2)+1)+12θ)\displaystyle\leq\frac{2\theta-\kappa}{\kappa(T_{\sigma}-2)+2\theta}\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\right]+\frac{NC^{2}}{\rho(\kappa(T_{\sigma}-2)+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-2)+1\right)+\frac{1}{2\theta}\right)
+2(diam(𝒳)+ζμ)2θκκ(Tσ2)+2θ𝔼[πμ,σk1σk12]+NC2ρ(κ(Tσ2)+2θ)(1κln(κ2θ(Tσ2)+1)+12θ)\displaystyle+2\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)\sqrt{\frac{2\theta-\kappa}{\kappa(T_{\sigma}-2)+2\theta}\mathbb{E}\left[\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\|^{2}\right]+\frac{NC^{2}}{\rho(\kappa(T_{\sigma}-2)+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-2)+1\right)+\frac{1}{2\theta}\right)}
+𝔼[πμ,σk1σk12]\displaystyle+\mathbb{E}\left[\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\|^{2}\right]
𝔼[πμ,σk1σk12]+ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θ(Tσ2)+1)+12θ)ρ(κ(Tσ2)+2θ)\displaystyle\leq\mathbb{E}\left[\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\|^{2}\right]+\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-2)+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa(T_{\sigma}-2)+2\theta)}
+2(diam(𝒳)+ζμ)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θ(Tσ2)+1)+12θ)ρ(κ(Tσ2)+2θ).\displaystyle+2\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-2)+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa(T_{\sigma}-2)+2\theta)}}.

Under the assumption that Tσmax(T4/5+2,3)T_{\sigma}\geq\max(T^{4/5}+2,3), we get:

𝔼[πμ,σkσk2]\displaystyle\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\right] 𝔼[πμ,σk1σk12]+ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρ(κT4/5+2θ)\displaystyle\leq\mathbb{E}\left[\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\|^{2}\right]+\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}
+2(diam(𝒳)+ζμ)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρ(κT4/5+2θ).\displaystyle+2\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}.

Therefore, we get:

𝔼[πμ,σK1σK12]\displaystyle\mathbb{E}[\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|^{2}] 𝔼[πμ,σK2σK22]+ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρ(κT4/5+2θ)\displaystyle\leq\mathbb{E}[\|\pi^{\mu,\sigma^{K-2}}-\sigma^{K-2}\|^{2}]+\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}
+2(diam(𝒳)+ζμ)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρ(κT4/5+2θ)\displaystyle+2\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}
𝔼[πμ,σkσk2]+Kρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρ(κT4/5+2θ)\displaystyle\leq\mathbb{E}[\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}]+K\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}
+2K(diam(𝒳)+ζμ)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρ(κT4/5+2θ).\displaystyle+2K\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}. (19)

Here, we derive the following upper bound in terms of 𝔼[πμ,σkσk2]\mathbb{E}[\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}]:

Lemma E.4.

Assume that Tσmax(T4/5+2,3)T_{\sigma}\geq\max(T^{4/5}+2,3). In the same setup of Theorem 4.5, we have for any Nash equilibrium πΠ\pi^{\ast}\in\Pi^{\ast}:

𝔼[k=0K1πμ,σkσk2]πσ02+Kdiam(𝒳)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρ(κT4/5+2θ).\displaystyle\mathbb{E}\left[\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\right]\leq\|\pi^{\ast}-\sigma^{0}\|^{2}+K\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}.

By combining (19), Lemma E.4, and the assumption that Tσmax(T4/5+2,3)T_{\sigma}\geq\max(T^{4/5}+2,3), we have:

K𝔼[πμ,σK1σK12]\displaystyle K\mathbb{E}[\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|^{2}]
𝔼[k=0K1πμ,σkσk2]+K2ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρ(κT4/5+2θ)\displaystyle\leq\mathbb{E}\left[\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\right]+K^{2}\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}
+2K2(diam(𝒳)+ζμ)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρ(κT4/5+2θ)\displaystyle+2K^{2}\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}
πσ02+Kdiam(𝒳)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θ(T1)+1)+12θ)ρ(κT4/5+2θ)\displaystyle\leq\|\pi^{\ast}-\sigma^{0}\|^{2}+K\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T-1)+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}
+K2ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρ(κT4/5+2θ)\displaystyle+K^{2}\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}
+2K2(diam(𝒳)+ζμ)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρ(κT4/5+2θ)\displaystyle+2K^{2}\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}
πσ02+T1/5diam(𝒳)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θ(T1)+1)+12θ)ρ(κT4/5+2θ)\displaystyle\leq\|\pi^{\ast}-\sigma^{0}\|^{2}+T^{1/5}\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T-1)+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}
+T2/5ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρ(κT4/5+2θ)\displaystyle+T^{2/5}\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}
+2T2/5(diam(𝒳)+ζμ)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρ(κT4/5+2θ)\displaystyle+2T^{2/5}\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}
πσ02+3(diam(𝒳)+ζμ+1)(ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρκ).\displaystyle\leq\|\pi^{\ast}-\sigma^{0}\|^{2}+3\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)\left(\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa}\right).

Appendix F Proofs for Section 5

F.1 Proof of Lemma B.9 (Formal Version of Lemma 5.6)

Proof of Lemma B.9.

From the definition of the Bregman divergence, we have for any t{kTσ,kTσ+1,,(k+1)Tσ1}t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}:

Dψ(πiμ,σk,πit+1)Dψ(πiμ,σk,πit)+Dψ(πit+1,πit)\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})
=ψ(πiμ,σk)ψ(πit+1)ψ(πit+1),πiμ,σkπit+1\displaystyle=\psi(\pi_{i}^{\mu,\sigma^{k}})-\psi(\pi_{i}^{t+1})-\langle\nabla\psi(\pi_{i}^{t+1}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle
ψ(πiμ,σk)+ψ(πit)+ψ(πit),πiμ,σkπit\displaystyle-\psi(\pi_{i}^{\mu,\sigma^{k}})+\psi(\pi_{i}^{t})+\langle\nabla\psi(\pi_{i}^{t}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t}\rangle
+ψ(πit+1)ψ(πit)ψ(πit),πit+1πit\displaystyle+\psi(\pi_{i}^{t+1})-\psi(\pi_{i}^{t})-\langle\nabla\psi(\pi_{i}^{t}),\pi_{i}^{t+1}-\pi_{i}^{t}\rangle
=ψ(πit)ψ(πit+1),πiμ,σkπit+1.\displaystyle=\langle\nabla\psi(\pi_{i}^{t})-\nabla\psi(\pi_{i}^{t+1}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle. (20)

From the first-order optimality condition for πit+1\pi_{i}^{t+1}, we get for any t{kTσ,kTσ+1,,(k+1)Tσ1}t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}:

η(πivi(πit,πit)μπiG(πit,σik))ψ(πit+1)+ψ(πit),πiμ,σkπit+10.\displaystyle\langle\eta(\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}))-\nabla\psi(\pi_{i}^{t+1})+\nabla\psi(\pi_{i}^{t}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle\leq 0. (21)

Note that πiG(πit,σik)\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}) and πiψ(πit)\nabla_{\pi_{i}}\psi(\pi_{i}^{t}) are well-defined because of Assumptions 5.4 and 5.5. By combining (20) and (21), we have:

Dψ(πiμ,σk,πit+1)Dψ(πiμ,σk,πit)+Dψ(πit+1,πit)ηπivi(πit,πit)μπiG(πit,σik),πit+1πiμ,σk.\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})\leq\eta\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle. (22)

Next, we derive the following convergence result for πt\pi^{t}:

Lemma F.1.

Suppose that Assumption 5.4 holds with β,γ(0,)\beta,\gamma\in(0,\infty), and the updated strategy profile πt\pi^{t} satisfies the following condition: for any t{kTσ,kTσ+1,,(k+1)Tσ1}t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\},

Dψ(πμ,σk,πt+1)Dψ(πμ,σk,πt)+Dψ(πt+1,πt)ηi=1Nπivi(πit,πit)μπiG(πit,σik),πit+1πiμ,σk.\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})-D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+D_{\psi}(\pi^{t+1},\pi^{t})\leq\eta\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.

Then, for any t{kTσ,kTσ+1,,(k+1)Tσ1}t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}:

Dψ(πμ,σk,πt+1)Dψ(πμ,σk,πkTσ)(1ημγ2)tkTσ+1.\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})\leq D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})\left(1-\frac{\eta\mu\gamma}{2}\right)^{t-kT_{\sigma}+1}.

It is easy to confirm that (22) satisfies the assumption in Lemma F.1. Thus, taking t=(k+1)Tσ1t=(k+1)T_{\sigma}-1, we have:

Dψ(πμ,σk,π(k+1)Tσ)Dψ(πμ,σk,πkTσ)(1ημγ2)Tσ.\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{(k+1)T_{\sigma}})\leq D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})\left(1-\frac{\eta\mu\gamma}{2}\right)^{T_{\sigma}}.

Since σk=πkTσ\sigma^{k}=\pi^{kT_{\sigma}} and σk+1=π(k+1)Tσ\sigma^{k+1}=\pi^{(k+1)T_{\sigma}}, we conclude the statement. ∎

F.2 Proof of Lemma B.10 (Formal Version of Lemma 5.7)

Proof of Lemma B.10.

Writing git=πivi(πit,πit)μπiG(πit,σik){g}_{i}^{t}=\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}), from the first-order optimality condition for πit+1\pi_{i}^{t+1}, we get for any t{kTσ,kTσ+1,,(k+1)Tσ1}t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}:

ηt(git+ξit)ψ(πit+1)+ψ(πit),πiμ,σkπit+10.\displaystyle\langle\eta_{t}({g}_{i}^{t}+\xi_{i}^{t})-\nabla\psi(\pi_{i}^{t+1})+\nabla\psi(\pi_{i}^{t}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle\leq 0. (23)

Note that πiG(πit,σik)\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}) and πiψ(πit)\nabla_{\pi_{i}}\psi(\pi_{i}^{t}) are well-defined because of Assumptions 5.4 and 5.5. By combining (20) and (23), we have for any t{kTσ,kTσ+1,,(k+1)Tσ1}t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}:

Dψ(πiμ,σk,πit+1)Dψ(πiμ,σk,πit)+Dψ(πit+1,πit)ηtgit+ξit,πit+1πiμ,σk.\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})\leq\eta_{t}\langle g_{i}^{t}+\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle. (24)

We have the following Lemma that replaces the gradient with Bregman divergences:

Lemma F.2.

Under the noisy feedback setting, suppose that Assumption 5.4 holds with β,γ(0,)\beta,\gamma\in(0,\infty), and the updated strategy profile πt\pi^{t} satisfies the following condition: for any t{kTσ,kTσ+1,,(k+1)Tσ1}t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\},

Dψ(πμ,σk,πt+1)Dψ(πμ,σk,πt)+Dψ(πt+1,πt)\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})-D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+D_{\psi}(\pi^{t+1},\pi^{t})
ηti=1Nπivi(πit,πit)μπiG(πit,σik)+ξit,πit+1πiμ,σk.\displaystyle\quad\leq\eta_{t}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k})+\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.

Then, for any t{kTσ,kTσ+1,,(k+1)Tσ1}t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}:

Dψ(πμ,σk,πt+1)Dψ(πμ,σk,πt)+Dψ(πt+1,πt)\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})-D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+D_{\psi}(\pi^{t+1},\pi^{t})
ηt((μ2γρ2(γ+2β)+8L22μγρ2)Dψ(πt+1,πt)μγ2Dψ(πμ,σk,πt))+ηti=1Nξit,πit+1πiμ,σk.\displaystyle\leq\eta_{t}\left(\left(\frac{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}{2\mu\gamma\rho^{2}}\right)D_{\psi}(\pi^{t+1},\pi^{t})-\frac{\mu\gamma}{2}D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})\right)+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.

It is easy to confirm that (24) satisfies the assumption in Lemma F.2. Thus, for any t{kTσ,kTσ+1,,(k+1)Tσ1}t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}:

Dψ(πμ,r,πt+1)Dψ(πμ,σk,πt)+Dψ(πt+1,πt)\displaystyle D_{\psi}(\pi^{\mu,r},\pi^{t+1})-D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+D_{\psi}(\pi^{t+1},\pi^{t})
ηt((μ2γρ2(γ+2β)+8L22μγρ2)Dψ(πt+1,πt)μγ2Dψ(πμ,σk,πt))+ηti=1Nξit,πit+1πiμ,σk\displaystyle\leq\eta_{t}\left(\left(\frac{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}{2\mu\gamma\rho^{2}}\right)D_{\psi}(\pi^{t+1},\pi^{t})-\frac{\mu\gamma}{2}D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})\right)+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle
=ηt(θDψ(πt+1,πt)κDψ(πμ,σk,πt))+ηti=1Nξit,πit+1πiμ,σk.\displaystyle=\eta_{t}(\theta D_{\psi}(\pi^{t+1},\pi^{t})-\kappa D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t}))+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.

Then, using this inequality, we can bound the expected value of Dψ(πμ,σk,πt+1)D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1}) as follows:

Lemma F.3.

Suppose that with some constants θ>κ>0\theta>\kappa>0, for all t{kTσ,kTσ+1,,(k+1)Tσ1}t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}, the following inequality holds:

Dψ(πμ,σk,πt+1)Dψ(πμ,σk,πt)+Dψ(πt+1,πt)\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})-D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+D_{\psi}(\pi^{t+1},\pi^{t})
ηt(θDψ(πt+1,πt)κDψ(πμ,σk,πt))+ηti=1Nξit,πit+1πiμ,σk.\displaystyle\quad\leq\eta_{t}(\theta D_{\psi}(\pi^{t+1},\pi^{t})-\kappa D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t}))+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.

Then, under Assumption B.5, for any t{kTσ,kTσ+1,,(k+1)Tσ1}t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\},

𝔼[Dψ(πμ,σk,πt+1)]\displaystyle\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})]
2θκκ(tkTσ)+2θ𝔼[Dψ(πμ,σk,πkTσ)]+NC2ρ(κ(tkTσ)+2θ)(1κln(κ2θ(tkTσ)+1)+12θ).\displaystyle\leq\frac{2\theta-\kappa}{\kappa(t-kT_{\sigma})+2\theta}\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})]+\frac{NC^{2}}{\rho(\kappa(t-kT_{\sigma})+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(t-kT_{\sigma})+1\right)+\frac{1}{2\theta}\right).

Taking t=(k+1)Tσ1t=(k+1)T_{\sigma}-1, we have:

𝔼[Dψ(πμ,σk,π(k+1)Tσ)]\displaystyle\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{(k+1)T_{\sigma}})]
2θκκ(Tσ1)+2θ𝔼[Dψ(πμ,σk,πkTσ)]+NC2ρ(κ(Tσ1)+2θ)(1κln(κ2θ(Tσ1)+1)+12θ).\displaystyle\leq\frac{2\theta-\kappa}{\kappa(T_{\sigma}-1)+2\theta}\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})]+\frac{NC^{2}}{\rho(\kappa(T_{\sigma}-1)+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-1)+1\right)+\frac{1}{2\theta}\right).

Since σk=πkTσ\sigma^{k}=\pi^{kT_{\sigma}} and σk+1=π(k+1)Tσ\sigma^{k+1}=\pi^{(k+1)T_{\sigma}}, we conclude the statement. ∎

F.3 Proof of Theorem 5.8

Proof of Theorem 5.8.

When G(πi,πi)=Dψ(πi,πi)G(\pi_{i},\pi_{i}^{\prime})=D_{\psi^{\prime}}(\pi_{i},\pi_{i}^{\prime}) for all i[N]i\in[N] and π,π𝒳\pi,\pi^{\prime}\in\mathcal{X}, we can show that the Bregman divergence from a Nash equilibrium πΠ\pi^{\ast}\in\Pi^{\ast} to σk+1\sigma^{k+1} monotonically decreases:

Lemma F.4.

Assume that GG is a Bregman divergence DψD_{\psi^{\prime}} for some strongly convex function ψ\psi^{\prime}. Then, for any Nash equilibrium πΠ\pi^{\ast}\in\Pi^{\ast} of the underlying game, we have for any k0k\geq 0:

Dψ(π,σk+1)Dψ(π,σk)\displaystyle D_{\psi^{\prime}}(\pi^{\ast},\sigma^{k+1})-D_{\psi^{\prime}}(\pi^{\ast},\sigma^{k}) Dψ(σk+1,σk).\displaystyle\leq-D_{\psi^{\prime}}(\sigma^{k+1},\sigma^{k}).

By summing the inequality in Lemma F.4 from k=0k=0 to KK, we have:

Dψ(π,σ0)k=0KDψ(σk+1,σk)ρ2k=0Kσk+1σk2,\displaystyle D_{\psi^{\prime}}(\pi^{\ast},\sigma^{0})\geq\sum_{k=0}^{K}D_{\psi^{\prime}}(\sigma^{k+1},\sigma^{k})\geq\frac{\rho}{2}\sum_{k=0}^{K}\|\sigma^{k+1}-\sigma^{k}\|^{2},

where the second inequality follows from the strong convexity of ψ\psi^{\prime}. Therefore, k=0σk+1σk2<\sum_{k=0}^{\infty}\|\sigma^{k+1}-\sigma^{k}\|^{2}<\infty, which implies that σk+1σk0\|\sigma^{k+1}-\sigma^{k}\|\to 0 as kk\to\infty.

By the compactness of 𝒳\mathcal{X} and Bolzano–Weierstrass theorem, there exists a subsequence knk_{n} and a limit point σ^𝒳\hat{\sigma}\in\mathcal{X} such that σknσ^\sigma^{k_{n}}\to\hat{\sigma} as nn\to\infty. Since σkn+1σkn0\|\sigma^{k_{n}+1}-\sigma^{k_{n}}\|\to 0 as nn\to\infty, we have σkn+1σ^\sigma^{k_{n}+1}\to\hat{\sigma} as nn\to\infty. Thus, the limit point σ^\hat{\sigma} is the fixed point of the updating rule. From the following lemma, we show that the fixed point σ^\hat{\sigma} is a Nash equilibrium of the underlying game:

Lemma F.5.

Assume that GG is a Bregman divergence DψD_{\psi^{\prime}} for some strongly convex function ψ\psi^{\prime}, and σk+1=πμ,σk\sigma^{k+1}=\pi^{\mu,\sigma^{k}} for k0k\geq 0. If σk+1=σk\sigma^{k+1}=\sigma^{k}, then σk\sigma^{k} is a Nash equilibrium of the underlying game.

On the other hand, by summing the inequality in Lemma F.4 from k=knk=k_{n} to k=K1k=K-1 for Kkn+1K\geq k_{n}+1, we have:

0Dψ(σ^,σK)\displaystyle 0\leq D_{\psi^{\prime}}(\hat{\sigma},\sigma^{K}) Dψ(σ^,σkn).\displaystyle\leq D_{\psi^{\prime}}(\hat{\sigma},\sigma^{k_{n}}).

Since σknσ^\sigma^{k_{n}}\to\hat{\sigma} as nn\to\infty, we have σKσ^\sigma^{K}\to\hat{\sigma} as KK\to\infty. Since σ^\hat{\sigma} is a Nash equilibrium of the underlying game, we conclude the first statement of the theorem.

F.4 Proof of Theorem 5.9

Proof of Theorem 5.9.

We first show that the divergence between Π\Pi^{\ast} and σk\sigma^{k} decreases monotonically as kk increases:

Lemma F.6.

Suppose that the same assumptions in Theorem 5.9 hold. For any k0k\geq 0, if σk𝒳Π\sigma^{k}\in\mathcal{X}\setminus\Pi^{\ast}, then:

minπΠKL(π,σk+1)<minπΠKL(π,σk).\displaystyle\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k+1})<\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k}).

Otherwise, if σkΠ\sigma^{k}\in\Pi^{\ast}, then σk+1=σkΠ\sigma^{k+1}=\sigma^{k}\in\Pi^{\ast}.

From Lemma F.6, the sequence {minπΠKL(π,σk)}k0\{\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k})\}_{k\geq 0} is a monotonically decreasing sequence and is bounded from below by zero. Thus, {minπΠKL(π,σk)}k0\{\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k})\}_{k\geq 0} converges to some constant b0b\geq 0. We show that b=0b=0 by a contradiction argument.

Suppose b>0b>0 and let us define B=minπΠKL(π,σ0)B=\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{0}). Since minπΠKL(π,σk)\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k}) monotonically decreases, σk\sigma^{k} is in the set Ωb,B={σ𝒳|bminπΠKL(π,σ)B}\Omega_{b,B}=\{\sigma\in\mathcal{X}~{}|~{}b\leq\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma)\leq B\} for all k0k\geq 0. Since minπΠKL(π,)\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\cdot) is a continuous function on 𝒳\mathcal{X}, the preimage Ωb,B\Omega_{b,B} of the closed set [b,B][b,B] is also closed. Furthermore, since 𝒳\mathcal{X} is compact and then bounded, Ωb,B\Omega_{b,B} is a bounded set. Thus, Ωb,B\Omega_{b,B} is a compact set.

Next, we show that the function which maps the slingshot strategies σ\sigma to the associated stationary point πμ,σ\pi^{\mu,\sigma} is continuous:

Lemma F.7.

Let F(σ):𝒳𝒳F(\sigma):\mathcal{X}\to\mathcal{X} be a function that maps the slingshot strategies σ\sigma to the stationary point πμ,σ\pi^{\mu,\sigma} defined by (4). In the same setup of Theorem 5.9, F()F(\cdot) is a continuous function on 𝒳\mathcal{X}.

From Lemma F.7, minπΠKL(π,F(σ))minπΠKL(π,σ)\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},F(\sigma))-\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma) is also a continuous function. Since a continuous function has a maximum over a compact set Ωb,B\Omega_{b,B}, the maximum δ=maxσΩb,B{minπΠKL(π,F(σ))minπΠKL(π,σ)}\delta=\max_{\sigma\in\Omega_{b,B}}\left\{\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},F(\sigma))-\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma)\right\} exists. From Lemma F.6 and the assumption that b>0b>0, we have δ<0\delta<0. It follows that:

minπΠKL(π,σk)\displaystyle\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k}) =minπΠKL(π,σ0)+l=0k1(minπΠKL(π,σl+1)minπΠKL(π,σl))\displaystyle=\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{0})+\sum_{l=0}^{k-1}\left(\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{l+1})-\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{l})\right)
B+l=0k1δ=B+kδ.\displaystyle\leq B+\sum_{l=0}^{k-1}\delta=B+k\delta.

This implies that minπΠKL(π,σk)<0\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k})<0 for k>Bδk>\frac{B}{-\delta}, which contradicts minπΠKL(π,σ)0\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma)\geq 0. Therefore, the sequence {minπΠKL(π,σk)}k0\{\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k})\}_{k\geq 0} converges to 0, and σk\sigma^{k} converges to Π\Pi^{\ast}. ∎

F.5 Proof of Lemma C.2

Proof of Lemma C.2.

First, we introduce the following lemma:

Lemma F.8.

Let us define T(yi)=argmaxx𝒳i{yi,xψ(x)}T(y_{i})=\mathop{\rm arg~{}max}\limits_{x\in\mathcal{X}_{i}}\{\langle y_{i},x\rangle-\psi(x)\}. Assuming ψ:𝒳i\psi:\mathcal{X}_{i}\to\mathbb{R} be a convex function of the Legendre type, we have for any πi𝒳i\pi_{i}\in\mathcal{X}_{i}:

Dψ(πi,T(yi))=ψ(πi)ψ(T(yi))yi,πiT(yi).\displaystyle D_{\psi}(\pi_{i},T(y_{i}))=\psi(\pi_{i})-\psi(T(y_{i}))-\langle y_{i},\pi_{i}-T(y_{i})\rangle.

Defining yit=ηs=0t(πivi(πis,πis)μπiG(πis,σik))y_{i}^{t}=\eta\sum_{s=0}^{t}\left(\nabla_{\pi_{i}}v_{i}(\pi_{i}^{s},\pi_{-i}^{s})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{s},\sigma_{i}^{k})\right) and letting πi=πiμ,σk\pi_{i}=\pi_{i}^{\mu,\sigma^{k}}, yi=yity_{i}=y_{i}^{t} in Lemma F.8, we have:

Dψ(πiμ,σk,πit+1)=ψ(πiμ,σk)ψ(πit+1)yit,πiμ,σkπit+1.\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})=\psi(\pi_{i}^{\mu,\sigma^{k}})-\psi(\pi_{i}^{t+1})-\langle y_{i}^{t},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle.

Note that πiG(πis,σik)\nabla_{\pi_{i}}G(\pi_{i}^{s},\sigma_{i}^{k}) is well-defined because of Assumptions 5.4 and C.1. Using this equation, we get for any t{kTσ,kTσ+1,,(k+1)Tσ1}t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}:

Dψ(πiμ,σk,πit+1)Dψ(πiμ,σk,πit)+Dψ(πit+1,πit)\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})
=ψ(πiμ,σk)ψ(πit+1)yit,πiμ,σkπit+1ψ(πiμ,σk)+ψ(πit)+yit1,πiμ,σkπit\displaystyle=\psi(\pi_{i}^{\mu,\sigma^{k}})-\psi(\pi_{i}^{t+1})-\langle y_{i}^{t},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle-\psi(\pi_{i}^{\mu,\sigma^{k}})+\psi(\pi_{i}^{t})+\langle y_{i}^{t-1},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t}\rangle
+ψ(πit+1)ψ(πit)yit1,πit+1πit\displaystyle+\psi(\pi_{i}^{t+1})-\psi(\pi_{i}^{t})-\langle y_{i}^{t-1},\pi_{i}^{t+1}-\pi_{i}^{t}\rangle
=yit,πiμ,σkπit+1+yit1,πiμ,σkπit+1\displaystyle=-\langle y_{i}^{t},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle+\langle y_{i}^{t-1},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle
=yityit1,πit+1πiμ,σk\displaystyle=\langle y_{i}^{t}-y_{i}^{t-1},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle
=ηπivi(πit,πit)μπiG(πit,σik),πit+1πiμ,σk.\displaystyle=\eta\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle. (25)

It is easy to confirm that (25) satisfies the assumption in Lemma F.1. Thus, taking t=(k+1)Tσ1t=(k+1)T_{\sigma}-1 in Lemma F.1, we have:

Dψ(πμ,σk,π(k+1)Tσ)Dψ(πμ,σk,πkTσ)(1ημγ2)Tσ.\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{(k+1)T_{\sigma}})\leq D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})\left(1-\frac{\eta\mu\gamma}{2}\right)^{T_{\sigma}}.

Since σk=πkTσ\sigma^{k}=\pi^{kT_{\sigma}} and σk+1=π(k+1)Tσ\sigma^{k+1}=\pi^{(k+1)T_{\sigma}}, we conclude the statement. ∎

F.6 Proof of Lemma C.3

Proof of Lemma C.3.

Writing yit=s=0tηs(πivi(πis,πis)+ξisμπiG(πis,ri))y_{i}^{t}=\sum_{s=0}^{t}\eta_{s}(\nabla_{\pi_{i}}v_{i}(\pi_{i}^{s},\pi_{-i}^{s})+\xi^{s}_{i}-\mu\nabla_{\pi_{i}}G(\pi_{i}^{s},r_{i})) and using Lemma F.8 in Appendix F.5, we have:

Dψ(πiμ,σk,πit+1)Dψ(πiμ,σk,πit)+Dψ(πit+1,πit)\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})
=yityit1,πit+1πiμ,σk\displaystyle=\langle y_{i}^{t}-y_{i}^{t-1},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle
=ηtπivi(πit,πit)μπiG(πit,σik)+ξit,πit+1πiμ,σk.\displaystyle=\eta_{t}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k})+\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.

Note that πiG(πit,σik)\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}) is well-defined because of Assumptions 5.4 and C.1. Thus, we can apply Lemma F.2, and we have for any t{kTσ,kTσ+1,,(k+1)Tσ1}t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}:

Dψ(πμ,r,πt+1)Dψ(πμ,σk,πt)+Dψ(πt+1,πt)\displaystyle D_{\psi}(\pi^{\mu,r},\pi^{t+1})-D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+D_{\psi}(\pi^{t+1},\pi^{t})
ηt((μ2γρ2(γ+2β)+8L22μγρ2)Dψ(πt+1,πt)μγ2Dψ(πμ,σk,πt))+ηti=1Nξit,πit+1πiμ,σk\displaystyle\leq\eta_{t}\left(\left(\frac{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}{2\mu\gamma\rho^{2}}\right)D_{\psi}(\pi^{t+1},\pi^{t})-\frac{\mu\gamma}{2}D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})\right)+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle
=ηt(θDψ(πt+1,πt)κDψ(πμ,σk,πt))+ηti=1Nξit,πit+1πiμ,σk.\displaystyle=\eta_{t}(\theta D_{\psi}(\pi^{t+1},\pi^{t})-\kappa D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t}))+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.

Therefore, the assumption in Lemma F.3 is satisfied, and we get:

𝔼[Dψ(πμ,σk,πt+1)]\displaystyle\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})]
2θκκ(tkTσ)+2θ𝔼[Dψ(πμ,σk,πkTσ)]+NC2ρ(κ(tkTσ)+2θ)(1κln(κ2θ(tkTσ)+1)+12θ).\displaystyle\leq\frac{2\theta-\kappa}{\kappa(t-kT_{\sigma})+2\theta}\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})]+\frac{NC^{2}}{\rho(\kappa(t-kT_{\sigma})+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(t-kT_{\sigma})+1\right)+\frac{1}{2\theta}\right).

Taking t=(k+1)Tσ1t=(k+1)T_{\sigma}-1, we have:

𝔼[Dψ(πμ,σk,π(k+1)Tσ)]\displaystyle\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{(k+1)T_{\sigma}})]
2θκκ(Tσ1)+2θ𝔼[Dψ(πμ,σk,πkTσ)]+NC2ρ(κ(Tσ1)+2θ)(1κln(κ2θ(Tσ1)+1)+12θ).\displaystyle\leq\frac{2\theta-\kappa}{\kappa(T_{\sigma}-1)+2\theta}\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})]+\frac{NC^{2}}{\rho(\kappa(T_{\sigma}-1)+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-1)+1\right)+\frac{1}{2\theta}\right).

Since σk=πkTσ\sigma^{k}=\pi^{kT_{\sigma}} and σk+1=π(k+1)Tσ\sigma^{k+1}=\pi^{(k+1)T_{\sigma}}, we conclude the statement. ∎

Appendix G Proofs for Section D

G.1 Proof of Theorem D.1

Proof of Theorem D.1.

Since vi(,πit)v_{i}(\cdot,\pi_{-i}^{t}) is concave, we can upper bound the gap function for πt\pi^{t} as:

GAP(πt)\displaystyle\mathrm{GAP}(\pi^{t})
=i=1Nmaxπ~i𝒳iπivi(πt),π~iπit\displaystyle=\sum_{i=1}^{N}\max_{\tilde{\pi}_{i}\in\mathcal{X}_{i}}\langle\nabla_{\pi_{i}}v_{i}(\pi^{t}),\tilde{\pi}_{i}-\pi_{i}^{t}\rangle
=maxπ~𝒳i=1N(πivi(πμ,σ),π~iπiμ,σπivi(πμ,σ),πitπiμ,σ+πivi(πt)πivi(πμ,σ),π~iπit).\displaystyle=\max_{\tilde{\pi}\in\mathcal{X}}\sum_{i=1}^{N}\left(\langle\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma}),\tilde{\pi}_{i}-\pi_{i}^{\mu,\sigma}\rangle-\langle\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma}\rangle+\langle\nabla_{\pi_{i}}v_{i}(\pi^{t})-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma}),\tilde{\pi}_{i}-\pi_{i}^{t}\rangle\right). (26)

From Lemma E.1, the first term of (26) can be upper bounded as:

maxπ~𝒳i=1Nπivi(πμ,σ),π~iπiμ,σdiam(𝒳)min(ai)N𝒳(πμ,σ)i=1Nπivi(πμ,σ)+ai2.\displaystyle\max_{\tilde{\pi}\in\mathcal{X}}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma}),\tilde{\pi}_{i}-\pi_{i}^{\mu,\sigma}\rangle\leq\mathrm{diam}(\mathcal{X})\cdot\min_{(a_{i})\in N_{\mathcal{X}}(\pi^{\mu,\sigma})}\sqrt{\sum_{i=1}^{N}\|-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})+a_{i}\|^{2}}.

From the first-order optimality condition for πμ,σ\pi^{\mu,\sigma}, we have for any π𝒳\pi\in\mathcal{X}:

i=1Nπivi(πμ,σ)μπiG(πiμ,σ,σi),πiπiμ,σ0,\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i}),\pi_{i}-\pi_{i}^{\mu,\sigma}\rangle\leq 0,

and then (πivi(πμ,σ)μπiG(πiμ,σ,σi))i[N]N𝒳(πμ,σ)(\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i}))_{i\in[N]}\in N_{\mathcal{X}}(\pi^{\mu,\sigma}). Thus,

maxπ~𝒳i=1Nπivi(πμ,σ),π~iπiμ,σ\displaystyle\max_{\tilde{\pi}\in\mathcal{X}}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma}),\tilde{\pi}_{i}-\pi_{i}^{\mu,\sigma}\rangle
diam(𝒳)i=1Nπivi(πμ,σ)+πivi(πμ,σ)μπiG(πiμ,σ,σi)2\displaystyle\leq\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\|-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})+\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i})\|^{2}}
=μdiam(𝒳)i=1NπiG(πiμ,σ,σi)2.\displaystyle=\mu\cdot\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i})\|^{2}}. (27)

Next, from Cauchy–Schwarz inequality, the second term of (26) can be bounded as:

i=1Nπivi(πμ,σ),πitπiμ,σπtπμ,σi=1Nπivi(πμ,σ)2.\displaystyle-\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma}\rangle\leq\|\pi^{t}-\pi^{\mu,\sigma}\|\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})\|^{2}}. (28)

Again from Cauchy–Schwarz inequality, the third term of (26) is bounded by:

i=1Nπivi(πt)πivi(πμ,σ),π~iπit\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi^{t})-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma}),\tilde{\pi}_{i}-\pi_{i}^{t}\rangle π~πti=1Nπivi(πt)πivi(πμ,σ)2\displaystyle\leq\|\tilde{\pi}-\pi^{t}\|\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi^{t})-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})\|^{2}}
diam(𝒳)i=1Nπivi(πt)πivi(πμ,σ)2\displaystyle\leq\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi^{t})-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})\|^{2}}
Ldiam(𝒳)πtπμ,σ,\displaystyle\leq L\cdot\mathrm{diam}(\mathcal{X})\|\pi^{t}-\pi^{\mu,\sigma}\|, (29)

where the third inequality follows from (2).

By combining (26), (27), (28), and (29), we get:

GAP(πt)μdiam(𝒳)i=1NπiG(πiμ,σ,σi)2+(Ldiam(𝒳)+i=1Nπivi(πμ,σ)2)πtπμ,σ.\displaystyle\mathrm{GAP}(\pi^{t})\leq\mu\cdot\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i})\|^{2}}+\left(L\cdot\mathrm{diam}(\mathcal{X})+\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})\|^{2}}\right)\|\pi^{t}-\pi^{\mu,\sigma}\|.

Thus, from Lemma 5.6 and the strong convexity of ψ\psi, we have:

GAP(πt)\displaystyle\mathrm{GAP}(\pi^{t}) μdiam(𝒳)i=1NπiG(πiμ,σ,σi)2\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i})\|^{2}}
+(Ldiam(𝒳)+i=1Nπivi(πμ,σ)2)2Dψ(πμ,σ,π0)ρ(1ημγ2)t.\displaystyle+\left(L\cdot\mathrm{diam}(\mathcal{X})+\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})\|^{2}}\right)\sqrt{\frac{2D_{\psi}(\pi^{\mu,\sigma},\pi^{0})}{\rho}\left(1-\frac{\eta\mu\gamma}{2}\right)^{t}}.

Appendix H Proofs for Additional Lemmas

H.1 Proof of Lemma E.2

Proof of Lemma E.2.

From the first-order optimality condition for πμ,σk\pi^{\mu,\sigma^{k}} and πμ,σk1\pi^{\mu,\sigma^{k-1}}, we have for any k1k\geq 1:

πivi(πiμ,σk,πiμ,σk)μ(πiμ,σkσik),σikπiμ,σk0,\displaystyle\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}})-\mu\left(\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k}\right),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k}}\rangle\leq 0,
πivi(πiμ,σk1,πiμ,σk1)μ(πiμ,σk1σik1),πiμ,σkπiμ,σk10.\displaystyle\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k-1}},\pi_{-i}^{\mu,\sigma^{k-1}})-\mu\left(\pi_{i}^{\mu,\sigma^{k-1}}-\sigma_{i}^{k-1}\right),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle\leq 0.

Summing up these inequalities yields:

0\displaystyle 0 i=1Nπivi(πiμ,σk,πiμ,σk)μ(πiμ,σkσik),σikπiμ,σk\displaystyle\geq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}})-\mu\left(\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k}\right),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k}}\rangle
+i=1Nπivi(πiμ,σk1,πiμ,σk1)μ(πiμ,σk1σik1),πiμ,σkπiμ,σk1\displaystyle+\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k-1}},\pi_{-i}^{\mu,\sigma^{k-1}})-\mu\left(\pi_{i}^{\mu,\sigma^{k-1}}-\sigma_{i}^{k-1}\right),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle
=i=1Nπivi(πiμ,σk,πiμ,σk),πiμ,σk1πiμ,σk+i=1Nπivi(πiμ,σk,πiμ,σk),σikπiμ,σk1+μπμ,σkσk2\displaystyle=\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\pi_{i}^{\mu,\sigma^{k-1}}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\mu\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}
+i=1Nπivi(πiμ,σk1,πiμ,σk1),πiμ,σkπiμ,σk1+μi=1Nσik1πiμ,σk1,πiμ,σkπiμ,σk1\displaystyle+\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k-1}},\pi_{-i}^{\mu,\sigma^{k-1}}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\mu\sum_{i=1}^{N}\langle\sigma_{i}^{k-1}-\pi_{i}^{\mu,\sigma^{k-1}},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle
=i=1Nπivi(πiμ,σk,πiμ,σk)πivi(πiμ,σk1,πiμ,σk1),πiμ,σk1πiμ,σk\displaystyle=\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k-1}},\pi_{-i}^{\mu,\sigma^{k-1}}),\pi_{i}^{\mu,\sigma^{k-1}}-\pi_{i}^{\mu,\sigma^{k}}\rangle
+i=1Nπivi(πiμ,σk,πiμ,σk),σikπiμ,σk1+μπμ,σkσk2+μi=1Nσik1πiμ,σk1,πiμ,σkπiμ,σk1\displaystyle+\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\mu\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}+\mu\sum_{i=1}^{N}\langle\sigma_{i}^{k-1}-\pi_{i}^{\mu,\sigma^{k-1}},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle
i=1Nπivi(πiμ,σk,πiμ,σk),σikπiμ,σk1+μπμ,σkσk2+μi=1Nσik1πiμ,σk1,πiμ,σkπiμ,σk1,\displaystyle\geq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\mu\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}+\mu\sum_{i=1}^{N}\langle\sigma_{i}^{k-1}-\pi_{i}^{\mu,\sigma^{k-1}},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle,

where the last inequality follows from (1). Then, since

σik1πiμ,σk1,πiμ,σk1πiμ,σk=πiμ,σkσik122πiμ,σkπiμ,σk122πiμ,σk1σik122,\displaystyle\langle\sigma_{i}^{k-1}-\pi_{i}^{\mu,\sigma^{k-1}},\pi_{i}^{\mu,\sigma^{k-1}}-\pi_{i}^{\mu,\sigma^{k}}\rangle=\frac{\|\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k-1}\|^{2}}{2}-\frac{\|\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{\mu,\sigma^{k-1}}\|^{2}}{2}-\frac{\|\pi_{i}^{\mu,\sigma^{k-1}}-\sigma_{i}^{k-1}\|^{2}}{2},

we have:

0\displaystyle 0 i=1Nπivi(πiμ,σk,πiμ,σk),σikπiμ,σk1+μπμ,σkσk2\displaystyle\geq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\mu\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}
+μ2i=1Nσik1πiμ,σk1,πiμ,σkπiμ,σk1+μ2i=1Nσik1πiμ,σk1,πiμ,σkπiμ,σk1\displaystyle+\frac{\mu}{2}\sum_{i=1}^{N}\langle\sigma_{i}^{k-1}-\pi_{i}^{\mu,\sigma^{k-1}},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\frac{\mu}{2}\sum_{i=1}^{N}\langle\sigma_{i}^{k-1}-\pi_{i}^{\mu,\sigma^{k-1}},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle
i=1Nπivi(πiμ,σk,πiμ,σk),σikπiμ,σk1+μπμ,σkσk2\displaystyle\geq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\mu\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}
μ4(σk1πμ,σk12+πμ,σkπμ,σk12πμ,σkπμ,σk1+σk1πμ,σk12)\displaystyle-\frac{\mu}{4}\left(\|\sigma^{k-1}-\pi^{\mu,\sigma^{k-1}}\|^{2}+\|\pi^{\mu,\sigma^{k}}-\pi^{\mu,\sigma^{k-1}}\|^{2}-\|\pi^{\mu,\sigma^{k}}-\pi^{\mu,\sigma^{k-1}}+\sigma^{k-1}-\pi^{\mu,\sigma^{k-1}}\|^{2}\right)
μ4(πμ,σkσk12πμ,σkπμ,σk12πμ,σk1σk12)\displaystyle-\frac{\mu}{4}\left(\|\pi^{\mu,\sigma^{k}}-\sigma^{k-1}\|^{2}-\|\pi^{\mu,\sigma^{k}}-\pi^{\mu,\sigma^{k-1}}\|^{2}-\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\|^{2}\right)
i=1Nπivi(πiμ,σk,πiμ,σk),σikπiμ,σk1+μπμ,σkσk2μ4πμ,σkσk12\displaystyle\geq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\mu\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}-\frac{\mu}{4}\|\pi^{\mu,\sigma^{k}}-\sigma^{k-1}\|^{2}
i=1Nπivi(πiμ,σk,πiμ,σk),σikπiμ,σk1+μπμ,σkσk2μ2πμ,σkσk2μ2σkσk12\displaystyle\geq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\mu\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\sigma^{k}-\sigma^{k-1}\|^{2}
=i=1Nπivi(πiμ,σk,πiμ,σk),σikπiμ,σk1+μ2πμ,σkσk2μ2σkσk12,\displaystyle=\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\frac{\mu}{2}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\sigma^{k}-\sigma^{k-1}\|^{2},

where the third inequality follows from (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}) for a,ba,b\in\mathbb{R}. Thus,

πμ,σkσk2\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2} σkσk12+2μi=1Nπivi(πiμ,σk,πiμ,σk),πiμ,σk1σik\displaystyle\leq\|\sigma^{k}-\sigma^{k-1}\|^{2}+\frac{2}{\mu}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\pi_{i}^{\mu,\sigma^{k-1}}-\sigma_{i}^{k}\rangle
σkσk12+2μπμ,σk1σki=1Nπivi(πiμ,σk,πiμ,σk)2\displaystyle\leq\|\sigma^{k}-\sigma^{k-1}\|^{2}+\frac{2}{\mu}\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k}\|\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}})\|^{2}}
σkσk12+2ζμπμ,σk1σk.\displaystyle\leq\|\sigma^{k}-\sigma^{k-1}\|^{2}+\frac{2\zeta}{\mu}\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k}\|.

H.2 Proof of Lemma E.3

Proof of Lemma E.3.

From the first-order optimality condition for pk+1p^{k+1}, we have:

πivi(πiμ,σk,πiμ,σk)μ(πiμ,σkσik),πiπiμ,σk0.\displaystyle\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}})-\mu\left(\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k}\right),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle\leq 0.

Thus, from the three-point identity 2xy,zx=yz2xy2xz22\langle x-y,z-x\rangle=\|y-z\|^{2}-\|x-y\|^{2}-\|x-z\|^{2} and Young’s inequality:

i=1Nπivi(πiμ,σk,πiμ,σk),πiπiμ,σkμi=1Nπiμ,σkσik,πiπiμ,σk\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle\leq\mu\sum_{i=1}^{N}\langle\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k},\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle
=μ2πσk2μ2πμ,σkσk2μ2ππμ,σk2\displaystyle=\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\ast}-\pi^{\mu,\sigma^{k}}\|^{2}
=μ2πσk2μ2πμ,σkσk2μ2πσk+1+σk+1πμ,σk2\displaystyle=\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k+1}+\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\|^{2}
=μ2πσk2μ2πμ,σkσk2μ2πσk+12μ2σk+1πμ,σk2μπσk+1,σk+1πμ,σk\displaystyle=\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k+1}\|^{2}-\frac{\mu}{2}\|\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\|^{2}-\mu\langle\pi^{\ast}-\sigma^{k+1},\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\rangle
μ2πσk2μ2πμ,σkσk2μ2πσk+12+μπσk+1σk+1πμ,σk\displaystyle\leq\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k+1}\|^{2}+\mu\|\pi^{\ast}-\sigma^{k+1}\|\|\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\|
μ2πσk2μ2πμ,σkσk2μ2πσk+12+μ64T3πσk+12+32μT32σk+1πμ,σk2.\displaystyle\leq\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k+1}\|^{2}+\frac{\mu}{64T^{3}}\|\pi^{\ast}-\sigma^{k+1}\|^{2}+\frac{32\mu T^{3}}{2}\|\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\|^{2}.

Here, since Tσmax(6ln2ln(2ημ)lnT+2ln64ln2ln(2ημ),1)T_{\sigma}\geq\max(\frac{6}{\ln 2-\ln(2-\eta\mu)}\ln T+\frac{2\ln 64}{\ln 2-\ln(2-\eta\mu)},1), we have:

(1ημ2)Tσ(1ημ2)3ln2ln(2ημ)lnT(1ημ2)ln64ln(1ημ2)=64(1ημ2)lnT3ln(2ημ)ln2=64T3.\displaystyle\left(1-\frac{\eta\mu}{2}\right)^{-T_{\sigma}}\geq\left(1-\frac{\eta\mu}{2}\right)^{-\frac{3}{\ln 2-\ln(2-\eta\mu)}\ln T}\left(1-\frac{\eta\mu}{2}\right)^{\frac{\ln 64}{\ln\left(1-\frac{\eta\mu}{2}\right)}}=64\left(1-\frac{\eta\mu}{2}\right)^{\frac{\ln T^{3}}{\ln(2-\eta\mu)-\ln 2}}=64T^{3}.

Therefore, we get from Lemma B.2:

i=1Nπivi(πiμ,σk,πiμ,σk),πiπiμ,σk\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle
μ2πσk2μ2πμ,σkσk2μ2πσk+12+μ64T3πσk+12+32μT32σkπμ,σk2(1ημ2)Tσ\displaystyle\leq\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k+1}\|^{2}+\frac{\mu}{64T^{3}}\|\pi^{\ast}-\sigma^{k+1}\|^{2}+\frac{32\mu T^{3}}{2}\|\sigma^{k}-\pi^{\mu,\sigma^{k}}\|^{2}\left(1-\frac{\eta\mu}{2}\right)^{T_{\sigma}}
μ2πσk2μ2πμ,σkσk2μ2πσk+12+μ64T3πσk+12+μ4σkπμ,σk2\displaystyle\leq\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k+1}\|^{2}+\frac{\mu}{64T^{3}}\|\pi^{\ast}-\sigma^{k+1}\|^{2}+\frac{\mu}{4}\|\sigma^{k}-\pi^{\mu,\sigma^{k}}\|^{2}
=μ2πσk2μ2πσk+12μ4πμ,σkσk2+μ64T3πσk+12.\displaystyle=\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k+1}\|^{2}-\frac{\mu}{4}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}+\frac{\mu}{64T^{3}}\|\pi^{\ast}-\sigma^{k+1}\|^{2}.

Summing up this inequality from k=0k=0 to K1K-1 yields:

μ2πσ02+μ64T3k=0K1πσk+12μ4k=0K1πμ,σkσk2\displaystyle\frac{\mu}{2}\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{\mu}{64T^{3}}\sum_{k=0}^{K-1}\|\pi^{\ast}-\sigma^{k+1}\|^{2}-\frac{\mu}{4}\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}
k=0K1i=1Nπivi(πiμ,σk,πiμ,σk),πiπiμ,σk\displaystyle\geq\sum_{k=0}^{K-1}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle
k=0K1i=1Nπivi(πi,πi),πiπiμ,σk\displaystyle\geq\sum_{k=0}^{K-1}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\ast},\pi_{-i}^{\ast}),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle
0.\displaystyle\geq 0.

Then, from Cauchy–Schwarz inequality, we have:

μ4k=0K1πμ,σkσk2\displaystyle\frac{\mu}{4}\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2} μ2πσ02+μ64T3k=0K1πσk+12\displaystyle\leq\frac{\mu}{2}\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{\mu}{64T^{3}}\sum_{k=0}^{K-1}\|\pi^{\ast}-\sigma^{k+1}\|^{2}
μ2πσ02+μ64T3k=0K1(τ=0kστ+1στ+πσ0)2\displaystyle\leq\frac{\mu}{2}\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{\mu}{64T^{3}}\sum_{k=0}^{K-1}\left(\sum_{\tau=0}^{k}\|\sigma^{\tau+1}-\sigma^{\tau}\|+\|\pi^{\ast}-\sigma^{0}\|\right)^{2}
μ2πσ02+μ32T3k=0K1((τ=0kστ+1στ)2+πσ02)\displaystyle\leq\frac{\mu}{2}\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{\mu}{32T^{3}}\sum_{k=0}^{K-1}\left(\left(\sum_{\tau=0}^{k}\|\sigma^{\tau+1}-\sigma^{\tau}\|\right)^{2}+\|\pi^{\ast}-\sigma^{0}\|^{2}\right)
μ2πσ02+μ32T3k=0K1(Kτ=0kστ+1στ2+πσ02)\displaystyle\leq\frac{\mu}{2}\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{\mu}{32T^{3}}\sum_{k=0}^{K-1}\left(K\sum_{\tau=0}^{k}\|\sigma^{\tau+1}-\sigma^{\tau}\|^{2}+\|\pi^{\ast}-\sigma^{0}\|^{2}\right)
μ2πσ02+μ32T3(K2k=0K1σk+1σk2+Kπσ02)\displaystyle\leq\frac{\mu}{2}\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{\mu}{32T^{3}}\left(K^{2}\sum_{k=0}^{K-1}\|\sigma^{k+1}-\sigma^{k}\|^{2}+K\|\pi^{\ast}-\sigma^{0}\|^{2}\right)
μ2πσ02+μ32T3(K2k=0K1(σk+1πμ,σk+πμ,σkσk)2+Kπσ02).\displaystyle\leq\frac{\mu}{2}\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{\mu}{32T^{3}}\left(K^{2}\sum_{k=0}^{K-1}\left(\|\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\|+\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|\right)^{2}+K\|\pi^{\ast}-\sigma^{0}\|^{2}\right).

By applying Lemma B.2 to the above inequality, we get:

μ4k=0K1πμ,σkσk2\displaystyle\frac{\mu}{4}\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}
μ2πσ02+μ32T3(K2k=0K1(σkπμ,σk2(1ημ2)Tσ2+πμ,σkσk)2+Kπσ02)\displaystyle\leq\frac{\mu}{2}\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{\mu}{32T^{3}}\left(K^{2}\sum_{k=0}^{K-1}\left(\|\sigma^{k}-\pi^{\mu,\sigma^{k}}\|^{2}\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}+\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|\right)^{2}+K\|\pi^{\ast}-\sigma^{0}\|^{2}\right)
μ2πσ02+μ32T3(K2k=0K1(2πμ,σkσk)2+Kπσ02)\displaystyle\leq\frac{\mu}{2}\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{\mu}{32T^{3}}\left(K^{2}\sum_{k=0}^{K-1}\left(2\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|\right)^{2}+K\|\pi^{\ast}-\sigma^{0}\|^{2}\right)
=μ2πσ02+μ32T3(4K2k=0K1πμ,σkσk2+Kπσ02)\displaystyle=\frac{\mu}{2}\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{\mu}{32T^{3}}\left(4K^{2}\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}+K\|\pi^{\ast}-\sigma^{0}\|^{2}\right)
μ2πσ02+μ32T3(4T2k=0K1πμ,σkσk2+Tπσ02)\displaystyle\leq\frac{\mu}{2}\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{\mu}{32T^{3}}\left(4T^{2}\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}+T\|\pi^{\ast}-\sigma^{0}\|^{2}\right)
=μ(12+1T2)πσ02+μ8Tk=0K1πμ,σkσk2\displaystyle=\mu\left(\frac{1}{2}+\frac{1}{T^{2}}\right)\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{\mu}{8T}\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}
2μπσ02+μ8k=0K1πμ,σkσk2.\displaystyle\leq 2\mu\|\pi^{\ast}-\sigma^{0}\|^{2}+\frac{\mu}{8}\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}.

Therefore, for K1K\geq 1, we get:

k=0K1πμ,σkσk2\displaystyle\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2} 16πσ02.\displaystyle\leq 16\|\pi^{\ast}-\sigma^{0}\|^{2}.

H.3 Proof of Lemma E.4

Proof of Lemma E.4.

From the first-order optimality condition for pk+1p^{k+1}, we have:

πivi(πiμ,σk,πiμ,σk)μ(πiμ,σkσik),πiπiμ,σk0.\displaystyle\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}})-\mu\left(\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k}\right),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle\leq 0.

Thus, from the three-point identity 2ab,ca=bc2ab2ac22\langle a-b,c-a\rangle=\|b-c\|^{2}-\|a-b\|^{2}-\|a-c\|^{2} and Young’s inequality:

i=1Nπivi(πiμ,σk,πiμ,σk),πiπiμ,σkμi=1Nπiμ,σkσik,πiπiμ,σk\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle\leq\mu\sum_{i=1}^{N}\langle\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k},\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle
=μ2πσk2μ2πμ,σkσk2μ2ππμ,σk2\displaystyle=\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\ast}-\pi^{\mu,\sigma^{k}}\|^{2}
=μ2πσk2μ2πμ,σkσk2μ2πσk+1+σk+1πμ,σk2\displaystyle=\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k+1}+\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\|^{2}
=μ2πσk2μ2πμ,σkσk2μ2πσk+12μ2σk+1πμ,σk2μπσk+1,σk+1πμ,σk\displaystyle=\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k+1}\|^{2}-\frac{\mu}{2}\|\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\|^{2}-\mu\langle\pi^{\ast}-\sigma^{k+1},\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\rangle
μ2πσk2μ2πμ,σkσk2μ2πσk+12+μπσk+1σk+1πμ,σk\displaystyle\leq\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k+1}\|^{2}+\mu\|\pi^{\ast}-\sigma^{k+1}\|\|\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\|
μ2πσk2μ2πμ,σkσk2μ2πσk+12+μdiam(𝒳)σk+1πμ,σk\displaystyle\leq\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k+1}\|^{2}+\mu\cdot\mathrm{diam}(\mathcal{X})\|\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\|

Since Tσmax(T4/5+2,3)T_{\sigma}\geq\max(T^{4/5}+2,3), we get from Lemma B.7:

𝔼[i=1Nπivi(πiμ,σk,πiμ,σk),πiπiμ,σk]\displaystyle\mathbb{E}\left[\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle\right]
𝔼[μ2πσk2μ2πμ,σkσk2μ2πσk+12]\displaystyle\leq\mathbb{E}\left[\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k+1}\|^{2}\right]
+μdiam(𝒳)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θ(Tσ1)+1)+12θ)ρ(κ(Tσ1)+2θ)\displaystyle+\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-1)+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa(T_{\sigma}-1)+2\theta)}}
𝔼[μ2πσk2μ2πμ,σkσk2μ2πσk+12]\displaystyle\leq\mathbb{E}\left[\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}-\frac{\mu}{2}\|\pi^{\ast}-\sigma^{k+1}\|^{2}\right]
+μdiam(𝒳)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρ(κT4/5+2θ).\displaystyle+\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}.

Summing up this inequality from k=0k=0 to K1K-1 and taking its expectation yields:

μ2πσ02𝔼[μ2k=0K1πμ,σkσk2]\displaystyle\frac{\mu}{2}\|\pi^{\ast}-\sigma^{0}\|^{2}-\mathbb{E}\left[\frac{\mu}{2}\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\right]
+Kμdiam(𝒳)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρ(κT4/5+2θ)\displaystyle+K\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}
𝔼[k=0K1i=1Nπivi(πiμ,σk,πiμ,σk),πiπiμ,σk]\displaystyle\geq\mathbb{E}\left[\sum_{k=0}^{K-1}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle\right]
𝔼[k=0K1i=1Nπivi(πi,πi),πiπiμ,σk]\displaystyle\geq\mathbb{E}\left[\sum_{k=0}^{K-1}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\ast},\pi_{-i}^{\ast}),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle\right]
0.\displaystyle\geq 0.

Therefore, for K1K\geq 1, we get:

𝔼[k=0K1πμ,σkσk2]πσ02+Kdiam(𝒳)ρ(2θκ)diam(𝒳)2+NC2(1κln(κ2θT+1)+12θ)ρ(κT4/5+2θ).\displaystyle\mathbb{E}\left[\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\right]\leq\|\pi^{\ast}-\sigma^{0}\|^{2}+K\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}.

H.4 Proof of Lemma F.1

Proof of Lemma F.1.

We first decompose the inequality in the assumption as follows:

Dψ(πiμ,σk,πit+1)Dψ(πiμ,σk,πit)+Dψ(πit+1,πit)\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})
ηπivi(πit,πit)μπiG(πit,σik),πit+1πiμ,σk\displaystyle\leq\eta\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle
=ηπivi(πit,πit),πit+1πiμ,σk+ημπiG(πit,σik),πitπit+1+ημπiG(πit,σik),πiμ,σkπit.\displaystyle=\eta\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta\mu\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{t}-\pi_{i}^{t+1}\rangle+\eta\mu\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t}\rangle. (30)

From the relative smoothness in Assumption 5.4 and the convexity of G(,σik)G(\cdot,\sigma_{i}^{k}):

πiG(πit,σik),πitπit+1\displaystyle\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{t}-\pi_{i}^{t+1}\rangle
G(πit,σik)G(πit+1,σik)+βDψ(πit+1,πit)\displaystyle\leq G(\pi_{i}^{t},\sigma_{i}^{k})-G(\pi_{i}^{t+1},\sigma_{i}^{k})+\beta D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})
G(πit,σik)G(πiμ,σk,σik)+πiG(πiμ,σk,σik),πiμ,σkπit+1+βDψ(πit+1,πit).\displaystyle\leq G(\pi_{i}^{t},\sigma_{i}^{k})-G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k})+\langle\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle+\beta D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t}). (31)

Also, from the relative strong convexity in Assumption 5.4:

G(πit,σik)G(πiμ,σk,σik)πiG(πit,σik),πitπiμ,σkγDψ(πiμ,σk,πit).\displaystyle G(\pi_{i}^{t},\sigma_{i}^{k})-G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k})\leq\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{k}}\rangle-\gamma D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t}). (32)

By combining (30), (31), and (32), we have:

Dψ(πiμ,σk,πit+1)Dψ(πiμ,σk,πit)+Dψ(πit+1,πit)\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})
ηπivi(πit,πit),πit+1πiμ,σk+ημπiG(πiμ,σk,σik),πiμ,σkπit+1\displaystyle\leq\eta\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta\mu\langle\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle
ημγDψ(πiμ,σk,πit)+ημβDψ(πit+1,πit),\displaystyle-\eta\mu\gamma D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+\eta\mu\beta D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t}),

and then:

Dψ(πiμ,σk,πit+1)(1ημγ)Dψ(πiμ,σk,πit)+(1ημβ)Dψ(πit+1,πit)\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-(1-\eta\mu\gamma)D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+(1-\eta\mu\beta)D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})
ηπivi(πit,πit),πit+1πiμ,σk+ημπiG(πiμ,σk,σik),πiμ,σkπit+1\displaystyle\leq\eta\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta\mu\langle\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle
=ηπivi(πit+1,πit+1),πit+1πiμ,σk+ημπiG(πiμ,σk,σik),πiμ,σkπit+1\displaystyle=\eta\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta\mu\langle\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle
+ηπivi(πit,πit)πivi(πit+1,πit+1),πit+1πiμ,σk.\displaystyle+\eta\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.

Summing this inequality from i=1i=1 to NN implies that:

Dψ(πμ,σk,πt+1)(1ημγ)Dψ(πμ,σk,πt)+(1ημβ)Dψ(πt+1,πt)\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})-(1-\eta\mu\gamma)D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+(1-\eta\mu\beta)D_{\psi}(\pi^{t+1},\pi^{t})
ηi=1Nπivi(πit+1,πit+1),πit+1πiμ,σk+ημi=1NπiG(πiμ,σk,σik),πiμ,σkπit+1\displaystyle\leq\eta\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta\mu\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle
+ηi=1Nπivi(πit,πit)πivi(πit+1,πit+1),πit+1πiμ,σk\displaystyle+\eta\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle
ηi=1Nπivi(πiμ,σk,πiμ,σk)μπiG(πiμ,σk,σik),πit+1πiμ,σk\displaystyle\leq\eta\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle
+ηi=1Nπivi(πit,πit)πivi(πit+1,πit+1),πit+1πiμ,σk\displaystyle+\eta\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle
ηi=1Nπivi(πit,πit)πivi(πit+1,πit+1),πit+1πiμ,σk,\displaystyle\leq\eta\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle, (33)

where the second inequality follows from (1), and the third inequality follows from the first-order optimality condition for πμ,σk\pi^{\mu,\sigma^{k}}.

Here, from Young’s inequality, we have for any λ>0\lambda>0:

i=1Nπivi(πit,πit)πivi(πit+1,πit+1),πit+1πiμ,σk\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle
=i=1Nπivi(πit,πit)πivi(πit+1,πit+1),πit+1πit+i=1Nπivi(πit,πit)πivi(πit+1,πit+1),πitπiμ,σk\displaystyle=\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{t}\rangle+\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{k}}\rangle
λi=1Nπivi(πit+1,πit+1)πivi(πit,πit)2+12λπt+1πt2+12λπtπμ,σk2\displaystyle\leq\lambda\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})\|^{2}+\frac{1}{2\lambda}\|\pi^{t+1}-\pi^{t}\|^{2}+\frac{1}{2\lambda}\|\pi^{t}-\pi^{\mu,\sigma^{k}}\|^{2}
(L2λ+12λ)πt+1πt2+12λπtπμ,σk2\displaystyle\leq\left(L^{2}\lambda+\frac{1}{2\lambda}\right)\|\pi^{t+1}-\pi^{t}\|^{2}+\frac{1}{2\lambda}\|\pi^{t}-\pi^{\mu,\sigma^{k}}\|^{2}
1ρ(2L2λ+1λ)Dψ(πt+1,πt)+1ρλDψ(πμ,σk,πt).\displaystyle\leq\frac{1}{\rho}\left(2L^{2}\lambda+\frac{1}{\lambda}\right)D_{\psi}(\pi^{t+1},\pi^{t})+\frac{1}{\rho\lambda}D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t}). (34)

where the second inequality follows from (2), and the fourth inequality follows from the strong convexity of ψ\psi.

By combining (33) and (34), we get:

Dψ(πμ,σk,πt+1)(1η(μγ1ρλ))Dψ(πμ,σk,πt)(1η(μβ+2L2λρ+1ρλ))Dψ(πt+1,πt).\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})\leq\left(1-\eta\left(\mu\gamma-\frac{1}{\rho\lambda}\right)\right)D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})-\left(1-\eta\left(\mu\beta+\frac{2L^{2}\lambda}{\rho}+\frac{1}{\rho\lambda}\right)\right)D_{\psi}(\pi^{t+1},\pi^{t}).

By setting λ=2μγρ\lambda=\frac{2}{\mu\gamma\rho},

Dψ(πμ,σk,πt+1)\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1}) (1ημγ2)Dψ(πμ,σk,πt)(1η(μ(γ+2β)2+4L2μγρ2))Dψ(πt+1,πt)\displaystyle\leq\left(1-\frac{\eta\mu\gamma}{2}\right)D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})-\left(1-\eta\left(\frac{\mu(\gamma+2\beta)}{2}+\frac{4L^{2}}{\mu\gamma\rho^{2}}\right)\right)D_{\psi}(\pi^{t+1},\pi^{t})
=(1ημγ2)Dψ(πμ,σk,πt)(1η(μ2γρ2(γ+2β)+8L22μγρ2))Dψ(πt+1,πt).\displaystyle=\left(1-\frac{\eta\mu\gamma}{2}\right)D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})-\left(1-\eta\left(\frac{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}{2\mu\gamma\rho^{2}}\right)\right)D_{\psi}(\pi^{t+1},\pi^{t}).

Thus, when η2μγρ2μ2γρ2(γ+2β)+8L2<2μγ\eta\leq\frac{2\mu\gamma\rho^{2}}{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}<\frac{2}{\mu\gamma}, we have for any t{kTσ,kTσ+1,,(k+1)Tσ1}t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}:

Dψ(πμ,σk,πt+1)Dψ(πμ,σk,πt)(1ημγ2)Dψ(πμ,σk,πkTσ)(1ημγ2)tkTσ+1.\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})\leq D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})\left(1-\frac{\eta\mu\gamma}{2}\right)\leq D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})\left(1-\frac{\eta\mu\gamma}{2}\right)^{t-kT_{\sigma}+1}.

H.5 Proof of Lemma F.2

Proof of Lemma F.2.

We first decompose the inequality in the assumption as follows:

Dψ(πiμ,σk,πit+1)Dψ(πiμ,σk,πit)+Dψ(πit+1,πit)\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})
ηtπivi(πit,πit)μπiG(πit,σik)+ξit,πit+1πiμ,σk\displaystyle\leq\eta_{t}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k})+\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle
=ηtπivi(πit,πit),πit+1πiμ,σk+ηtμπiG(πit,σik),πitπit+1\displaystyle=\eta_{t}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta_{t}\mu\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{t}-\pi_{i}^{t+1}\rangle
+ηtμπiG(πit,σik),πiμ,σkπit+ξit,πit+1πiμ,σk.\displaystyle\quad+\eta_{t}\mu\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t}\rangle+\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle. (35)

By combining (31), (32) in Appendix H.4, and (35),

Dψ(πiμ,σk,πit+1)Dψ(πiμ,σk,πit)+Dψ(πit+1,πit)\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})
ηtπivi(πit,πit),πit+1πiμ,σk+ηtμπiG(πiμ,σk,σik),πiμ,σkπit+1+ηtμβDψ(πit+1,πit)\displaystyle\leq\eta_{t}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta_{t}\mu\langle\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle+\eta_{t}\mu\beta D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})
ηtμγDψ(πiμ,σk,πit)+ξit,πit+1πiμ,σk.\displaystyle\quad-\eta_{t}\mu\gamma D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.

Summing up these inequalities with respect to the player index,

Dψ(πμ,σk,πt+1)Dψ(πμ,σk,πt)+Dψ(πt+1,πt)\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})-D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+D_{\psi}(\pi^{t+1},\pi^{t})
ηti=1Nπivi(πit,πit)μG(πiμ,σk,σik),πit+1πiμ,σk+ηtμβDψ(πt+1,πt)\displaystyle\leq\eta_{t}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta_{t}\mu\beta D_{\psi}(\pi^{t+1},\pi^{t})
ηtμγDψ(πμ,σk,πt)+i=1Nξit,πit+1πiμ,σk\displaystyle\quad-\eta_{t}\mu\gamma D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle
=i=1Nηtπivi(πit+1,πit+1)μπiG(πiμ,σk,σik),πit+1πiμ,σkηtμγDψ(πμ,σk,πt)+ηtμβDψ(πt+1,πt)\displaystyle=\sum_{i=1}^{N}\eta_{t}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle-\eta_{t}\mu\gamma D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+\eta_{t}\mu\beta D_{\psi}(\pi^{t+1},\pi^{t})
+ηti=1Nπivi(πit,πit)πivi(πit+1,πit+1),πit+1πiμ,σk+ηti=1Nξit,πit+1πiμ,σk\displaystyle+\eta_{t}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle
i=1Nηtπivi(πiμ,σk,πiμ,σk)μπiG(πiμ,σk,σik),πit+1πiμ,σkηtμγDψ(πμ,σk,πt)+ηtμβDψ(πt+1,πt)\displaystyle\leq\sum_{i=1}^{N}\eta_{t}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle-\eta_{t}\mu\gamma D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+\eta_{t}\mu\beta D_{\psi}(\pi^{t+1},\pi^{t})
+ηti=1Nπivi(πit,πit)πivi(πit+1,πit+1),πit+1πiμ,σk+ηti=1Nξit,πit+1πiμ,σk\displaystyle+\eta_{t}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle
ηtμγDψ(πμ,σk,πt)+ηtμβDψ(πt+1,πt)\displaystyle\leq-\eta_{t}\mu\gamma D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+\eta_{t}\mu\beta D_{\psi}(\pi^{t+1},\pi^{t})
+ηti=1Nπivi(πit,πit)πivi(πit+1,πit+1),πit+1πiμ,σk+ηti=1Nξit,πit+1πiμ,σk,\displaystyle+\eta_{t}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle, (36)

where the second inequality follows (1), and the third inequality follows from the first-order optimality condition for πμ,σk\pi^{\mu,\sigma^{k}}.

By combining (34) in Appendix H.4 and (36), we have for any λ>0\lambda>0:

Dψ(πμ,σk,πt+1)Dψ(πμ,σk,πt)+Dψ(πt+1,πt)\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})-D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+D_{\psi}(\pi^{t+1},\pi^{t})
ηtμγDψ(πμ,σk,πt)+ηtμβDψ(πt+1,πt)\displaystyle\leq-\eta_{t}\mu\gamma D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+\eta_{t}\mu\beta D_{\psi}(\pi^{t+1},\pi^{t})
+ηtρ(2L2λ+1λ)Dψ(πt+1,πt)+ηtρλDψ(πμ,σk,πt)+ηti=1Nξit,πit+1πiμ,σk.\displaystyle+\frac{\eta_{t}}{\rho}\left(2L^{2}\lambda+\frac{1}{\lambda}\right)D_{\psi}(\pi^{t+1},\pi^{t})+\frac{\eta_{t}}{\rho\lambda}D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.

By setting λ=2μγρ\lambda=\frac{2}{\mu\gamma\rho},

Dψ(πμ,σk,πt+1)Dψ(πμ,σk,πt)+Dψ(πt+1,πt)\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})-D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+D_{\psi}(\pi^{t+1},\pi^{t})
ηtμγ2Dψ(πμ,σk,πt)+ηt(μ2γρ2(γ+2β)+8L22μγρ2)Dψ(πt+1,πt)+ηti=1Nξit,πit+1πiμ,σk.\displaystyle\leq-\frac{\eta_{t}\mu\gamma}{2}D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+\eta_{t}\left(\frac{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}{2\mu\gamma\rho^{2}}\right)D_{\psi}(\pi^{t+1},\pi^{t})+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.

This concludes the proof. ∎

H.6 Proof of Lemma F.3

Proof of Lemma F.3.

Reforming the inequality in the assumption,

Dψ(πμ,σk,πt+1)\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1}) (1κηt)Dψ(πμ,σk,πt)(1ηtθ)Dψ(πt+1,πt)+ηti=1Nξit,πit+1πiμ,σk\displaystyle\leq(1-\kappa\eta_{t})D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})-(1-\eta_{t}\theta)D_{\psi}(\pi^{t+1},\pi^{t})+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle
=(1κηt)Dψ(πμ,σk,πt)(1ηtθ)Dψ(πt+1,πt)\displaystyle=(1-\kappa\eta_{t})D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})-(1-\eta_{t}\theta)D_{\psi}(\pi^{t+1},\pi^{t})
+ηti=1Nξit,πitπiμ,σk+ηti=1Nξit,πit+1πit.\displaystyle\phantom{=}+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{t}\rangle.

By taking the expectation conditioned on t\mathcal{F}_{t} for both sides and using Assumption B.5 (a),

𝔼[Dψ(πμ,σk,πt+1)|t]\displaystyle\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})|\mathcal{F}_{t}] (1κηt)Dψ(πμ,σk,πt)(1ηtθ)𝔼[Dψ(πt+1,πt)|t]+i=1N𝔼[ηtξit,πit+1πit|t]\displaystyle\leq(1-\kappa\eta_{t})D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})-(1-\eta_{t}\theta)\mathbb{E}[D_{\psi}(\pi^{t+1},\pi^{t})|\mathcal{F}_{t}]+\sum_{i=1}^{N}\mathbb{E}[\langle\eta_{t}\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{t}\rangle|\mathcal{F}_{t}]
=(1κηt)Dψ(πμ,σk,πt)(1ηtθ)𝔼[Dψ(πt+1,πt)|t]\displaystyle=(1-\kappa\eta_{t})D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})-(1-\eta_{t}\theta)\mathbb{E}[D_{\psi}(\pi^{t+1},\pi^{t})|\mathcal{F}_{t}]
+i=1N𝔼[ηtξitρ(1ηtθ),ρ(1ηtθ)(πit+1πit)|t]\displaystyle\phantom{=}+\sum_{i=1}^{N}\mathbb{E}\left[\left\langle\frac{\eta_{t}\xi_{i}^{t}}{\sqrt{\rho(1-\eta_{t}\theta)}},\sqrt{\rho(1-\eta_{t}\theta)}(\pi_{i}^{t+1}-\pi_{i}^{t})\right\rangle|\mathcal{F}_{t}\right]
(1κηt)Dψ(πμ,σk,πt)(1ηtθ)𝔼[Dψ(πt+1,πt)|t]\displaystyle\leq(1-\kappa\eta_{t})D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})-(1-\eta_{t}\theta)\mathbb{E}[D_{\psi}(\pi^{t+1},\pi^{t})|\mathcal{F}_{t}]
+ηt22ρ(1ηtθ)i=1N𝔼[ξit2|t]+ρ(1ηtθ)2𝔼[πt+1πt2|t]\displaystyle\phantom{=}+\frac{\eta_{t}^{2}}{2\rho(1-\eta_{t}\theta)}\sum_{i=1}^{N}\mathbb{E}[\|\xi_{i}^{t}\|^{2}|\mathcal{F}_{t}]+\frac{\rho(1-\eta_{t}\theta)}{2}\mathbb{E}[\|\pi^{t+1}-\pi^{t}\|^{2}|\mathcal{F}_{t}]
(1κηt)Dψ(πμ,σk,πt)+ηt22ρ(1ηtθ)i=1N𝔼[ξit2|t]\displaystyle\leq(1-\kappa\eta_{t})D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+\frac{\eta_{t}^{2}}{2\rho(1-\eta_{t}\theta)}\sum_{i=1}^{N}\mathbb{E}[\|\xi_{i}^{t}\|^{2}|\mathcal{F}_{t}]
(11tkTσ+2θ/κ)Dψ(πμ,σk,πt)+ηt2ρi=1N𝔼[ξit2|t].\displaystyle\leq\left(1-\frac{1}{t-kT_{\sigma}+2\theta/\kappa}\right)D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+\frac{\eta_{t}^{2}}{\rho}\sum_{i=1}^{N}\mathbb{E}[\|\xi_{i}^{t}\|^{2}|\mathcal{F}_{t}].

Therefore, rearranging and taking the expectations,

(tkTσ+2θ/κ)𝔼[Dψ(πμ,σk,πt+1)](tkTσ1+2θ/κ)𝔼[Dψ(πμ,σk,πt)]+NC2ρκ(κ(tkTσ)+2θ).\displaystyle(t-kT_{\sigma}+2\theta/\kappa)\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})]\leq(t-kT_{\sigma}-1+2\theta/\kappa)\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})]+\frac{NC^{2}}{\rho\kappa(\kappa(t-kT_{\sigma})+2\theta)}.

Telescoping the sum,

(tkTσ+2θ/κ)𝔼[Dψ(πμ,σk,πt+1)](2θ/κ1)𝔼[Dψ(πμ,σk,πkTσ)]+NC2ρκs=kTσt1κ(skTσ)+2θ\displaystyle(t-kT_{\sigma}+2\theta/\kappa)\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})]\leq(2\theta/\kappa-1)\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})]+\frac{NC^{2}}{\rho\kappa}\sum_{s=kT_{\sigma}}^{t}\frac{1}{\kappa(s-kT_{\sigma})+2\theta}
\displaystyle\Longleftrightarrow\quad 𝔼[Dψ(πμ,σk,πt+1)]2θκκ(tkTσ)+2θ𝔼[Dψ(πμ,σk,πkTσ)]+NC2ρ(κ(tkTσ)+2θ)s=0tkTσ1κs+2θ.\displaystyle\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})]\leq\frac{2\theta-\kappa}{\kappa(t-kT_{\sigma})+2\theta}\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})]+\frac{NC^{2}}{\rho(\kappa(t-kT_{\sigma})+2\theta)}\sum_{s=0}^{t-kT_{\sigma}}\frac{1}{\kappa s+2\theta}.

Here, we introduce the following lemma, whose proof is given in Appendix H.12, for the evaluation of the sum.

Lemma H.1.

For any κ,θ0\kappa,\theta\geq 0 and t0t\geq 0,

s=0t1κs+2θ12θ+1κln(κ2θt+1).\displaystyle\sum_{s=0}^{t}\frac{1}{\kappa s+2\theta}\leq\frac{1}{2\theta}+\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}t+1\right).

In summary, we obtain the following inequality:

𝔼[Dψ(πμ,σk,πt+1)]\displaystyle\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})]
2θκκ(tkTσ)+2θ𝔼[Dψ(πμ,σk,πkTσ)]+NC2ρ(κ(tkTσ)+2θ)(1κln(κ2θ(tkTσ)+1)+12θ).\displaystyle\leq\frac{2\theta-\kappa}{\kappa(t-kT_{\sigma})+2\theta}\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})]+\frac{NC^{2}}{\rho(\kappa(t-kT_{\sigma})+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(t-kT_{\sigma})+1\right)+\frac{1}{2\theta}\right).

This concludes the proof. ∎

H.7 Proof of Lemma F.4

Proof of Lemma F.4.

Recall that G(πi,πi)=Dψ(πi,πi)G(\pi_{i},\pi_{i}^{\prime})=D_{\psi^{\prime}}(\pi_{i},\pi_{i}^{\prime}) for any i[N]i\in[N] and πi,πi𝒳i\pi_{i},\pi_{i}^{\prime}\in\mathcal{X}_{i}. By the first-order optimality condition for σk+1\sigma^{k+1}, we have for all πΠ\pi^{\ast}\in\Pi^{\ast}:

i=1Nπivi(σik+1,σik+1)μ(ψ(σik+1)ψ(σik)),πiσik+10.\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1})-\mu(\nabla\psi^{\prime}(\sigma_{i}^{k+1})-\nabla\psi^{\prime}(\sigma_{i}^{k})),\pi_{i}^{\ast}-\sigma_{i}^{k+1}\rangle\leq 0.

Then,

i=1Nψ(σik+1)ψ(σik),σik+1πi1μi=1Nσik+1πi,πivi(σik+1,σik+1).\displaystyle\sum_{i=1}^{N}\langle\nabla\psi^{\prime}(\sigma_{i}^{k+1})-\nabla\psi^{\prime}(\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle\leq\frac{1}{\mu}\sum_{i=1}^{N}\langle\sigma_{i}^{k+1}-\pi_{i}^{\ast},\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1})\rangle.

Moreover, we have for any πΠ\pi^{\ast}\in\Pi^{\ast}:

Dψ(π,σk+1)Dψ(π,σk)\displaystyle D_{\psi^{\prime}}(\pi^{\ast},\sigma^{k+1})-D_{\psi^{\prime}}(\pi^{\ast},\sigma^{k})
=i=1N(ψ(πi)ψ(σik+1)ψ(σik+1),πiσik+1ψ(πi)+ψ(σik)+ψ(σik),πiσik)\displaystyle=\sum_{i=1}^{N}\left(\psi^{\prime}(\pi_{i}^{\ast})-\psi^{\prime}(\sigma_{i}^{k+1})-\langle\nabla\psi^{\prime}(\sigma_{i}^{k+1}),\pi_{i}^{\ast}-\sigma_{i}^{k+1}\rangle-\psi^{\prime}(\pi_{i}^{\ast})+\psi^{\prime}(\sigma_{i}^{k})+\langle\nabla\psi^{\prime}(\sigma_{i}^{k}),\pi_{i}^{\ast}-\sigma_{i}^{k}\rangle\right)
=i=1N(ψ(σik+1)+ψ(σik)+ψ(σik),σik+1σikψ(σik+1)ψ(σik),πiσik+1)\displaystyle=\sum_{i=1}^{N}\left(-\psi^{\prime}(\sigma_{i}^{k+1})+\psi^{\prime}(\sigma_{i}^{k})+\langle\nabla\psi^{\prime}(\sigma_{i}^{k}),\sigma_{i}^{k+1}-\sigma_{i}^{k}\rangle-\langle\nabla\psi^{\prime}(\sigma_{i}^{k+1})-\nabla\psi^{\prime}(\sigma_{i}^{k}),\pi_{i}^{\ast}-\sigma_{i}^{k+1}\rangle\right)
=Dψ(σk+1,σk)+i=1Nψ(σik+1)ψ(σik),σik+1πi.\displaystyle=-D_{\psi^{\prime}}(\sigma^{k+1},\sigma^{k})+\sum_{i=1}^{N}\langle\nabla\psi^{\prime}(\sigma_{i}^{k+1})-\nabla\psi^{\prime}(\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle.

By combining these inequalities, we get for any πΠ\pi^{\ast}\in\Pi^{\ast}:

Dψ(π,σk+1)Dψ(π,σk)\displaystyle D_{\psi^{\prime}}(\pi^{\ast},\sigma^{k+1})-D_{\psi^{\prime}}(\pi^{\ast},\sigma^{k}) Dψ(σk+1,σk)+1μi=1Nσik+1πi,πivi(σik+1,σik+1)\displaystyle\leq-D_{\psi^{\prime}}(\sigma^{k+1},\sigma^{k})+\frac{1}{\mu}\sum_{i=1}^{N}\langle\sigma_{i}^{k+1}-\pi_{i}^{\ast},\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1})\rangle
Dψ(σk+1,σk)+1μi=1Nσik+1πi,πivi(πi,πi),\displaystyle\leq-D_{\psi^{\prime}}(\sigma^{k+1},\sigma^{k})+\frac{1}{\mu}\sum_{i=1}^{N}\langle\sigma_{i}^{k+1}-\pi_{i}^{\ast},\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\ast},\pi_{-i}^{\ast})\rangle,

where the second inequality follows from (1). Since π\pi^{\ast} is the Nash equilibrium, from the first-order optimality condition, we get:

i=1Nσik+1πi,πivi(πi,πi)0.\displaystyle\sum_{i=1}^{N}\langle\sigma_{i}^{k+1}-\pi_{i}^{\ast},\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\ast},\pi_{-i}^{\ast})\rangle\leq 0.

Thus, we have for πΠ\pi^{\ast}\in\Pi^{\ast}:

Dψ(π,σk+1)Dψ(π,σk)\displaystyle D_{\psi^{\prime}}(\pi^{\ast},\sigma^{k+1})-D_{\psi^{\prime}}(\pi^{\ast},\sigma^{k}) Dψ(σk+1,σk).\displaystyle\leq-D_{\psi^{\prime}}(\sigma^{k+1},\sigma^{k}).

H.8 Proof of Lemma F.5

Proof of Lemma F.5.

By using the first-order optimality condition for σik+1\sigma_{i}^{k+1}, we have for all π𝒳\pi\in\mathcal{X}:

i=1Nπivi(σik+1,σik+1)μ(πiψ(σik+1)πiψ(σik)),πiσik+10,\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1})-\mu(\nabla_{\pi_{i}}\psi^{\prime}(\sigma_{i}^{k+1})-\nabla_{\pi_{i}}\psi^{\prime}(\sigma_{i}^{k})),\pi_{i}-\sigma_{i}^{k+1}\rangle\leq 0,

and then

i=1Nπivi(σik+1,σik+1),πiσik+1μi=1Nπiψ(σik+1)πiψ(σik),πiσik+1.\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1}),\pi_{i}-\sigma_{i}^{k+1}\rangle\leq\mu\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}\psi^{\prime}(\sigma_{i}^{k+1})-\nabla_{\pi_{i}}\psi^{\prime}(\sigma_{i}^{k}),\pi_{i}-\sigma_{i}^{k+1}\rangle.

Under the assumption that σk+1=σk\sigma^{k+1}=\sigma^{k}, we have for all π𝒳\pi\in\mathcal{X}:

i=1Nπivi(σik+1,σik+1),πiσik+10.\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1}),\pi_{i}-\sigma_{i}^{k+1}\rangle\leq 0.

This is equivalent to the first-order optimality condition for πΠ\pi^{\ast}\in\Pi^{\ast}. Therefore, σk+1=σk\sigma^{k+1}=\sigma^{k} is a Nash equilibrium of the underlying game. ∎

H.9 Proof of Lemma F.6

Proof of Lemma F.6.

First, we prove the first statement of the lemma by using the following lemmas:

Lemma H.2.

Assume that σk+1=πμ,σk\sigma^{k+1}=\pi^{\mu,\sigma^{k}} for k0k\geq 0, and GG is one of the following divergence: 1) α\alpha-divergence with α(0,1)\alpha\in(0,1); 2) Rényi-divergence with α(0,1)\alpha\in(0,1); 3) reverse KL divergence. If σk+1=σk\sigma^{k+1}=\sigma^{k}, then σk\sigma^{k} is a Nash equilibrium of the underlying game.

Lemma H.3.

Assume that σk+1=πμ,σk\sigma^{k+1}=\pi^{\mu,\sigma^{k}} for k0k\geq 0, and GG is one of the following divergence: 1) α\alpha-divergence with α(0,1)\alpha\in(0,1); 2) Rényi-divergence with α(0,1)\alpha\in(0,1); 3) reverse KL divergence. Then, if σk+1σk\sigma^{k+1}\neq\sigma^{k}, we have for any πΠ\pi^{\ast}\in\Pi^{\ast} and k0k\geq 0:

KL(π,σk+1)KL(π,σk)<0.\displaystyle\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k})<0.

From Lemma H.2, when σk𝒳Π\sigma^{k}\in\mathcal{X}\setminus\Pi^{\ast}, σk+1σk\sigma^{k+1}\neq\sigma^{k} always holds. Let us define π=argminπΠKL(π,σk)\pi^{\star}=\mathop{\rm arg~{}min}\limits_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k}). Since σk+1σk\sigma^{k+1}\neq\sigma^{k}, from Lemma H.3, we have:

minπΠKL(π,σk)=KL(π,σk)>KL(π,σk+1)minπΠKL(π,σk+1).\displaystyle\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k})=\mathrm{KL}(\pi^{\star},\sigma^{k})>\mathrm{KL}(\pi^{\star},\sigma^{k+1})\geq\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k+1}).

Therefore, if σk𝒳Π\sigma^{k}\in\mathcal{X}\setminus\Pi^{\ast} then minπΠKL(π,σk+1)<minπΠKL(π,σk)\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k+1})<\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k}).

Next, we prove the second statement of the lemma. Assume that there exists σkΠ\sigma^{k}\in\Pi^{\ast} such that σk+1σk\sigma^{k+1}\neq\sigma^{k}. In this case, we can apply Lemma H.3, hence we have KL(π,σk+1)<KL(π,σk)\mathrm{KL}(\pi^{\ast},\sigma^{k+1})<\mathrm{KL}(\pi^{\ast},\sigma^{k}) for all πΠ\pi^{\ast}\in\Pi^{\ast}. On the other hand, since σkΠ\sigma^{k}\in\Pi^{\ast}, there exists a Nash equilibrium πΠ\pi^{\star}\in\Pi^{\ast} such that KL(π,σk)=0\mathrm{KL}(\pi^{\star},\sigma^{k})=0. Therefore, we have KL(π,σk+1)<KL(π,σk)=0\mathrm{KL}(\pi^{\star},\sigma^{k+1})<\mathrm{KL}(\pi^{\star},\sigma^{k})=0, which contradicts KL(π,σk+1)0\mathrm{KL}(\pi^{\star},\sigma^{k+1})\geq 0. Thus, if σkΠ\sigma^{k}\in\Pi^{\ast} then σk+1=σk\sigma^{k+1}=\sigma^{k}. ∎

H.10 Proof of Lemma F.7

Proof of Lemma F.7.

For a given σ𝒳\sigma\in\mathcal{X}, let us consider that πt\pi^{t} follows the following continuous-time dynamics:

πit\displaystyle\pi_{i}^{t} =argmaxπi𝒳i{yit,πiψ(πi)},\displaystyle=\mathop{\rm arg~{}max}\limits_{\pi_{i}\in\mathcal{X}_{i}}\left\{\langle y_{i}^{t},\pi_{i}\rangle-\psi(\pi_{i})\right\}, (37)
yijt\displaystyle y_{ij}^{t} =0t(πijvi(πis,πis)μπijG(πis,σi)).\displaystyle=\int_{0}^{t}\left(\frac{\partial}{\pi_{ij}}v_{i}(\pi_{i}^{s},\pi_{-i}^{s})-\mu\frac{\partial}{\pi_{ij}}G(\pi_{i}^{s},\sigma_{i})\right).

We assume that ψ(πi)=j=1diπijlnπij\psi(\pi_{i})=\sum_{j=1}^{d_{i}}\pi_{ij}\ln\pi_{ij}. Note that this dynamics is the continuous-time version of APFTRL, so clearly πμ,σ\pi^{\mu,\sigma} defined by (4) is the stationary point of (37). We have for a given σ𝒳\sigma^{\prime}\in\mathcal{X} and the associated stationary point πμ,σ=F(σ)\pi^{\mu,\sigma^{\prime}}=F(\sigma^{\prime}):

ddtKL(πμ,σ,πt)\displaystyle\frac{d}{dt}\mathrm{KL}(\pi^{\mu,\sigma^{\prime}},\pi^{t}) =ddtDψ(πμ,σ,πt)\displaystyle=\frac{d}{dt}D_{\psi}(\pi^{\mu,\sigma^{\prime}},\pi^{t})
=i=1Nddt(ψ(πiμ,σ)ψ(πit)yit,πiμ,σπit)\displaystyle=\sum_{i=1}^{N}\frac{d}{dt}\left(\psi(\pi_{i}^{\mu,\sigma^{\prime}})-\psi(\pi_{i}^{t})-\langle y_{i}^{t},\pi_{i}^{\mu,\sigma^{\prime}}-\pi_{i}^{t}\rangle\right)
=i=1Nddt(yit,πitψ(πit)yit,πiμ,σ+ψ(πiμ,σ))\displaystyle=\sum_{i=1}^{N}\frac{d}{dt}\left(\langle y_{i}^{t},\pi_{i}^{t}\rangle-\psi(\pi_{i}^{t})-\langle y_{i}^{t},\pi_{i}^{\mu,\sigma^{\prime}}\rangle+\psi(\pi_{i}^{\mu,\sigma^{\prime}})\right)
=i=1Nddt(ψ(yit)yit,πiμ,σ)\displaystyle=\sum_{i=1}^{N}\frac{d}{dt}\left(\psi^{\ast}(y_{i}^{t})-\langle y_{i}^{t},\pi_{i}^{\mu,\sigma^{\prime}}\rangle\right)
=i=1N(ddtyit,ψ(yit)ddtyit,πiμ,σ)\displaystyle=\sum_{i=1}^{N}\left(\left\langle\frac{d}{dt}y_{i}^{t},\nabla\psi^{\ast}(y_{i}^{t})\right\rangle-\left\langle\frac{d}{dt}y_{i}^{t},\pi_{i}^{\mu,\sigma^{\prime}}\right\rangle\right)
=i=1Nddtyit,ψ(yit)πiμ,σ,\displaystyle=\sum_{i=1}^{N}\left\langle\frac{d}{dt}y_{i}^{t},\nabla\psi^{\ast}(y_{i}^{t})-\pi_{i}^{\mu,\sigma^{\prime}}\right\rangle,

where ψ(yit)=maxπi𝒳i{yit,πiψ(πi)}\psi^{\ast}(y_{i}^{t})=\max_{\pi_{i}\in\mathcal{X}_{i}}\left\{\langle y_{i}^{t},\pi_{i}\rangle-\psi(\pi_{i})\right\}. When ψ(πi)=j=1diπijlnπij\psi(\pi_{i})=\sum_{j=1}^{d_{i}}\pi_{ij}\ln\pi_{ij}, we have

ψ(yit)\displaystyle\psi^{\ast}(y_{i}^{t}) =j=1diyijtexp(yijt)j=1diexp(yijt)j=1diexp(yijt)j=1diexp(yijt)lnexp(yijt)j=1diexp(yijt)\displaystyle=\sum_{j=1}^{d_{i}}y_{ij}^{t}\frac{\exp(y_{ij}^{t})}{\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})}-\sum_{j=1}^{d_{i}}\frac{\exp(y_{ij}^{t})}{\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})}\ln\frac{\exp(y_{ij}^{t})}{\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})}
=lnj=1diexp(yijt)j=1diexp(yijt)j=1diexp(yijt),\displaystyle=\frac{\ln\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})}{\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})}\sum_{j=1}^{d_{i}}\exp(y_{ij}^{t}),

and then,

yijψ(yit)\displaystyle\frac{\partial}{\partial y_{ij}}\psi^{\ast}(y_{i}^{t}) =exp(yijt)(j=1diexp(yijt))2j=1diexp(yijt)lnj=1diexp(yijt)(j=1diexp(yijt))2exp(yijt)j=1diexp(yijt)\displaystyle=\frac{\exp(y_{ij}^{t})}{(\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t}))^{2}}\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})-\frac{\ln\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})}{(\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t}))^{2}}\exp(y_{ij}^{t})\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})
+lnj=1diexp(yijt)j=1diexp(yijt)exp(yijt)\displaystyle+\frac{\ln\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})}{\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})}\exp(y_{ij}^{t})
=exp(yijt)j=1diexp(yijt)=πijt.\displaystyle=\frac{\exp(y_{ij}^{t})}{\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})}=\pi_{ij}^{t}.

Therefore, we get ψ(yit)=πit\nabla\psi^{\ast}(y_{i}^{t})=\pi_{i}^{t}. Hence,

ddtKL(πμ,σ,πt)\displaystyle\frac{d}{dt}\mathrm{KL}(\pi^{\mu,\sigma^{\prime}},\pi^{t}) =i=1Nddtyit,πitπiμ,σ\displaystyle=\sum_{i=1}^{N}\left\langle\frac{d}{dt}y_{i}^{t},\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\right\rangle
=i=1Nπivi(πit,πit)μπiG(πit,σi),πitπiμ,σ\displaystyle=\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle
=i=1Nπivi(πit,πit)μπiG(πit,σi),πitπiμ,σ\displaystyle=\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle
+μi=1NπiG(πit,σi)πiG(πit,σi),πitπiμ,σ.\displaystyle+\mu\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle. (38)

The first term of (H.10) can be written as:

i=1Nπivi(πit,πit)μπiG(πit,σi),πitπiμ,σ\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle
i=1Nπivi(πiμ,σ,πiμ,σ)μπiG(πit,σi),πitπiμ,σ\displaystyle\leq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{\prime}},\pi_{-i}^{\mu,\sigma^{\prime}})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle
=i=1Nπivi(πiμ,σ,πiμ,σ),πitπiμ,σμi=1NπiG(πit,σi),πitπiμ,σ\displaystyle=\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{\prime}},\pi_{-i}^{\mu,\sigma^{\prime}}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle-\mu\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle
=i=1Nπivi(πiμ,σ,πiμ,σ)μπiG(πiμ,σ,σi),πitπiμ,σ\displaystyle=\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{\prime}},\pi_{-i}^{\mu,\sigma^{\prime}})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{\prime}},\sigma_{i}^{\prime}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle
μi=1NπiG(πit,σi)πiG(πiμ,σ,σi),πitπiμ,σ\displaystyle-\mu\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{\prime}},\sigma_{i}^{\prime}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle
μi=1NπiG(πit,σi)πiG(πiμ,σ,σi),πitπiμ,σ.\displaystyle\leq-\mu\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{\prime}},\sigma_{i}^{\prime}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle.

where the first inequality follows from (1), and the second inequality follows from the first-order optimality condition for πμ,σ\pi^{\mu,\sigma}. When GG is α\alpha-divergence, GG has a diagonal Hessian is given as:

2G(πi,σi)=[σi1(πi1)α2σidi(πidi)α2,]\displaystyle\nabla^{2}G(\pi_{i},\sigma_{i}^{\prime})=\begin{bmatrix}\frac{\sigma_{i1}^{\prime}}{(\pi_{i1})^{\alpha-2}}&&\\ &\ddots&\\ &&\frac{\sigma_{id_{i}}^{\prime}}{(\pi_{id_{i}})^{\alpha-2}},\end{bmatrix}

and thus, its smallest eigenvalue is lower bounded by minj[di]σij\min_{j\in[d_{i}]}\sigma_{ij}^{\prime}. Therefore,

i=1Nπivi(πit,πit)μπiG(πit,σi),πitπiμ,σ\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle
μi=1NπiG(πit,σi)πiG(πiμ,σ,σi),πitπiμ,σ\displaystyle\leq-\mu\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{\prime}},\sigma_{i}^{\prime}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle
μ(mini[N],j[di]σij)πtπμ,σ2.\displaystyle\leq-\mu\left(\min_{i\in[N],~{}j\in[d_{i}]}\sigma_{ij}^{\prime}\right)\|\pi^{t}-\pi^{\mu,\sigma^{\prime}}\|^{2}. (39)

On the other hand, by compactness of 𝒳i\mathcal{X}_{i}, the second term of (H.10) is written as:

μi=1NπiG(πit,σi)πiG(πit,σi),πitπiμ,σ\displaystyle\mu\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle
μdiam(𝒳)i=1NπiG(πit,σi)πiG(πit,σi)2\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i})\|^{2}} (40)

By combining (H.10), (H.10), and (40), we get:

ddtKL(πμ,σ,πt)\displaystyle\frac{d}{dt}\mathrm{KL}(\pi^{\mu,\sigma^{\prime}},\pi^{t})
μ(mini[N],j[di]σij)πtπμ,σ2+μdiam(𝒳)i=1NπiG(πit,σi)πiG(πit,σi)2.\displaystyle\leq-\mu\left(\min_{i\in[N],~{}j\in[d_{i}]}\sigma_{ij}^{\prime}\right)\|\pi^{t}-\pi^{\mu,\sigma^{\prime}}\|^{2}+\mu\cdot\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i})\|^{2}}.

Recall that πμ,σ\pi^{\mu,\sigma} is the stationary point of (37). Therefore, by setting the start point as π0=πμ,σ\pi^{0}=\pi^{\mu,\sigma}, we have for all t0,πt=πμ,σt\geq 0,\pi^{t}=\pi^{\mu,\sigma}. In this case, for all t0t\geq 0, ddtKL(πμ,σ,πt)=0\frac{d}{dt}\mathrm{KL}(\pi^{\mu,\sigma^{\prime}},\pi^{t})=0 and then:

(mini[N],j[di]σij)πμ,σπμ,σ2diam(𝒳)i=1NπiG(πit,σi)πiG(πit,σi)2.\displaystyle\left(\min_{i\in[N],~{}j\in[d_{i}]}\sigma_{ij}^{\prime}\right)\|\pi^{\mu,\sigma^{\prime}}-\pi^{\mu,\sigma}\|^{2}\leq\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i})\|^{2}}.

For a given ε>0\varepsilon>0, let us define ε=(mini[N],j[di]σij)diam(𝒳)ε2\varepsilon^{\prime}=\frac{\left(\min_{i\in[N],~{}j\in[d_{i}]}\sigma_{ij}^{\prime}\right)}{\mathrm{diam}(\mathcal{X})}\varepsilon^{2}. Since πiG(πiμ,σ,)\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\cdot) is continuous on 𝒳i\mathcal{X}_{i}, for ε\varepsilon^{\prime}, there exists δ>0\delta>0 such that σσ<δi=1NπiG(πiμ,σ,σi)πiG(πiμ,σ,σi)2<ε=(mini[N],j[di]σij)diam(𝒳)ε2\|\sigma^{\prime}-\sigma\|<\delta\Rightarrow\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i})\|^{2}}<\varepsilon^{\prime}=\frac{\left(\min_{i\in[N],~{}j\in[d_{i}]}\sigma_{ij}^{\prime}\right)}{\mathrm{diam}(\mathcal{X})}\varepsilon^{2}. Thus, for every ε>0\varepsilon>0, there exists δ>0\delta>0 such that

σσ<δ\displaystyle\|\sigma^{\prime}-\sigma\|<\delta
πμ,σπμ,σdiam(𝒳)(mini[N],j[di]σij)(i=1NπiG(πiμ,σ,σi)πiG(πiμ,σ,σi)2)14<ε.\displaystyle\Rightarrow\|\pi^{\mu,\sigma^{\prime}}-\pi^{\mu,\sigma}\|\leq\sqrt{\frac{\mathrm{diam}(\mathcal{X})}{\left(\min_{i\in[N],~{}j\in[d_{i}]}\sigma_{ij}^{\prime}\right)}}\left(\sum_{i=1}^{N}\|\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i})\|^{2}\right)^{\frac{1}{4}}<\varepsilon.

This implies that F()F(\cdot) is a continuous function on 𝒳\mathcal{X} when GG is α\alpha-divergence. A similar argument can be applied to Rényi-divergence and reverse KL divergence. ∎

H.11 Proof of Lemma F.8

Proof of Lemma F.8.

First, from the definition of the Bregman divergence, for any πi𝒳i\pi_{i}\in\mathcal{X}_{i}:

Dψ(π,T(yi))=ψ(π)ψ(T(yi))ψ(T(yi)),πiT(yi).\displaystyle D_{\psi}(\pi,T(y_{i}))=\psi(\pi)-\psi(T(y_{i}))-\langle\nabla\psi(T(y_{i})),\pi_{i}-T(y_{i})\rangle. (41)

Recall that 𝒳i\mathcal{X}_{i} satisfies Aπi=bA\pi_{i}=b for all πi𝒳i\pi_{i}\in\mathcal{X}_{i} for a matrix Aki×diA\in\mathbb{R}^{k_{i}\times d_{i}} and bkib\in\mathbb{R}^{k_{i}}. From the assumption for ψ\psi and the first-order optimality condition for the optimization problem of argmaxx𝒳{yi,xψ(x)}\mathop{\rm arg~{}max}\limits_{x\in\mathcal{X}}\{\langle y_{i},x\rangle-\psi(x)\}, there exists νki\nu\in\mathbb{R}^{k_{i}} such that

yiψ(T(yi))=Aν.\displaystyle y_{i}-\nabla\psi(T(y_{i}))=A^{\top}\nu.

Thus, we get:

yi,πiT(yi)\displaystyle\langle y_{i},\pi_{i}-T(y_{i})\rangle =Aν+ψ(T(yi)),πiT(yi)\displaystyle=\langle A^{\top}\nu+\nabla\psi(T(y_{i})),\pi_{i}-T(y_{i})\rangle
=ψ(T(yi)),πiT(yi)+νAπiνAT(yi)\displaystyle=\langle\nabla\psi(T(y_{i})),\pi_{i}-T(y_{i})\rangle+\nu^{\top}A\pi_{i}-\nu^{\top}AT(y_{i})
=ψ(T(yi)),πiT(yi)+νbνb\displaystyle=\langle\nabla\psi(T(y_{i})),\pi_{i}-T(y_{i})\rangle+\nu^{\top}b-\nu^{\top}b
=ψ(T(yi)),πiT(yi).\displaystyle=\langle\nabla\psi(T(y_{i})),\pi_{i}-T(y_{i})\rangle. (42)

By combining (41) and (42), we have:

Dψ(π,T(yi))=ψ(π)ψ(T(yi))yi,πiT(yi).\displaystyle D_{\psi}(\pi,T(y_{i}))=\psi(\pi)-\psi(T(y_{i}))-\langle y_{i},\pi_{i}-T(y_{i})\rangle.

H.12 Proof of Lemma H.1

Proof of Lemma H.1.

Since 1κs+2θ\frac{1}{\kappa s+2\theta} is a decreasing function for s0s\geq 0, for all s1s\geq 1,

1κs+2θs1s1κx+2θdx.\frac{1}{\kappa s+2\theta}\leq\int_{s-1}^{s}\frac{1}{\kappa x+2\theta}dx.

Using this inequality, we can upper bound the sum as follows.

s=0t1κs+2θ\displaystyle\sum_{s=0}^{t}\frac{1}{\kappa s+2\theta} =12θ+s=1t1κs+2θ\displaystyle=\frac{1}{2\theta}+\sum_{s=1}^{t}\frac{1}{\kappa s+2\theta}
12θ+s=1ts1s1κx+2θdx\displaystyle\leq\frac{1}{2\theta}+\sum_{s=1}^{t}\int_{s-1}^{s}\frac{1}{\kappa x+2\theta}dx
=12θ+0t1κx+2θdx\displaystyle=\frac{1}{2\theta}+\int_{0}^{t}\frac{1}{\kappa x+2\theta}dx
=12θ+1κ0t1x+2θκdx\displaystyle=\frac{1}{2\theta}+\frac{1}{\kappa}\int_{0}^{t}\frac{1}{x+\frac{2\theta}{\kappa}}dx
=12θ+1κ2θκt+2θκ1udu(u=x+2θκ)\displaystyle=\frac{1}{2\theta}+\frac{1}{\kappa}\int_{\frac{2\theta}{\kappa}}^{t+\frac{2\theta}{\kappa}}\frac{1}{u}du\qquad\left(u=x+\frac{2\theta}{\kappa}\right)
=12θ+1κln(κ2θt+1).\displaystyle=\frac{1}{2\theta}+\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}t+1\right).

This concludes the proof. ∎

H.13 Proof of Lemma H.2

Proof of Lemma H.2.

By using the first-order optimality condition for σik+1\sigma_{i}^{k+1}, we have for all π𝒳\pi\in\mathcal{X}:

i=1Nπivi(σik+1,σik+1)μπiG(σik+1,σik),πiσik+10,\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1})-\mu\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\pi_{i}-\sigma_{i}^{k+1}\rangle\leq 0,

and then

i=1Nπivi(σik+1,σik+1),πiσik+1μi=1NπiG(σik+1,σik),πiσik+1.\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1}),\pi_{i}-\sigma_{i}^{k+1}\rangle\leq\mu\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\pi_{i}-\sigma_{i}^{k+1}\rangle.

When GG is α\alpha-divergence, we have for all π𝒳\pi\in\mathcal{X}:

i=1NπiG(σik+1,σik),πiσik+1\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\pi_{i}-\sigma_{i}^{k+1}\rangle =11αi=1Nj=1di(σijk+1πij)(σijkσijk+1)1α\displaystyle=\frac{1}{1-\alpha}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1}-\pi_{ij})\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}
=11αi=1Nj=1di(σijk+1πij)=0,\displaystyle=\frac{1}{1-\alpha}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1}-\pi_{ij})=0,

where we use the assumption that σk+1=σk\sigma^{k+1}=\sigma^{k} and 𝒳i=Δdi\mathcal{X}_{i}=\Delta^{d_{i}}. Similarly, when GG is Rényi-divergence, we have for all π𝒳\pi\in\mathcal{X}:

i=1NπiG(σik+1,σik),πiσik+1\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\pi_{i}-\sigma_{i}^{k+1}\rangle =α1αi=1N1j=1di(σijk+1)α(σijk)1αj=1di(σijk+1πij)(σijkσijk+1)1α\displaystyle=\frac{\alpha}{1-\alpha}\sum_{i=1}^{N}\frac{1}{\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1})^{\alpha}(\sigma_{ij}^{k})^{1-\alpha}}\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1}-\pi_{ij})\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}
=α1αi=1N1j=1di(σijk+1)α(σijk)1αj=1di(σijk+1πij)=0.\displaystyle=\frac{\alpha}{1-\alpha}\sum_{i=1}^{N}\frac{1}{\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1})^{\alpha}(\sigma_{ij}^{k})^{1-\alpha}}\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1}-\pi_{ij})=0.

Furthermore, if GG is reverse KL divergence, we have for all π𝒳\pi\in\mathcal{X}:

i=1NπiG(σik+1,σik),πiσik+1\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\pi_{i}-\sigma_{i}^{k+1}\rangle =i=1Nj=1di(σijk+1πij)σijkσijk+1\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1}-\pi_{ij})\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}
=i=1Nj=1di(σijk+1πij)=0,\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1}-\pi_{ij})=0,

Thus, we have for all π𝒳\pi\in\mathcal{X}:

i=1Nπivi(σik+1,σik+1),πiσik+10.\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1}),\pi_{i}-\sigma_{i}^{k+1}\rangle\leq 0.

This is equivalent to the first-order optimality condition for πΠ\pi^{\ast}\in\Pi^{\ast}. Therefore, σk+1=σk\sigma^{k+1}=\sigma^{k} is a Nash equilibrium of the underlying game. ∎

H.14 Proof of Lemma H.3

Proof of Lemma H.3.

First, we prove the statement for α\alpha-divergence: G(σik+1,σik)=1α(1α)(1j=1di(σijk+1)α(σijk)1α)G(\sigma_{i}^{k+1},\sigma_{i}^{k})=\frac{1}{\alpha(1-\alpha)}\left(1-\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1})^{\alpha}(\sigma_{ij}^{k})^{1-\alpha}\right). From the definition of α\alpha-divergence, we have for all πΠ\pi^{\ast}\in\Pi^{\ast}:

i=1NπiG(σik+1,σik),σik+1πi\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle =11αi=1Nj=1di(πijσijk+1)(σijkσijk+1)1α\displaystyle=\frac{1}{1-\alpha}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}(\pi_{ij}^{\ast}-\sigma_{ij}^{k+1})\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}
=11αi=1Nj=1diπij(σijkσijk+1)1α11αi=1Nj=1di(σijk+1)α(σijk)1α.\displaystyle=\frac{1}{1-\alpha}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}-\frac{1}{1-\alpha}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1})^{\alpha}(\sigma_{ij}^{k})^{1-\alpha}.

Here, when α(0,1)\alpha\in(0,1), we get j=1di(σijk+1)α(σijk)1α1\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1})^{\alpha}(\sigma_{ij}^{k})^{1-\alpha}\leq 1. Thus,

i=1NπiG(σik+1,σik),σik+1πi\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle 11αi=1Nj=1diπij(σijkσijk+1)1αN1α\displaystyle\geq\frac{1}{1-\alpha}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}-\frac{N}{1-\alpha}
=N1αexp(ln(1Ni=1Nj=1diπij(σijkσijk+1)1α))N1α\displaystyle=\frac{N}{1-\alpha}\exp\left(\ln\left(\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}\right)\right)-\frac{N}{1-\alpha}
N1αexp(1αNi=1Nj=1diπijlnσijkσijk+1)N1α\displaystyle\geq\frac{N}{1-\alpha}\exp\left(\frac{1-\alpha}{N}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\ln\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)-\frac{N}{1-\alpha}
=N1αexp(1αN(KL(π,σk+1)KL(π,σk)))N1α,\displaystyle=\frac{N}{1-\alpha}\exp\left(\frac{1-\alpha}{N}\left(\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k})\right)\right)-\frac{N}{1-\alpha},

where the second inequality follows from the concavity of the ln()\ln(\cdot) function and Jensen’s inequality for concave functions. Since ln()\ln(\cdot) is strictly concave, the equality holds if and only if σk+1=σk\sigma^{k+1}=\sigma^{k}. Therefore, under the assumption that σk+1σk\sigma^{k+1}\neq\sigma^{k}, we get:

KL(π,σk+1)KL(π,σk)\displaystyle\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k}) <N1αln(1+1αNi=1NπiG(σik+1,σik),σik+1πi)\displaystyle<\frac{N}{1-\alpha}\ln\left(1+\frac{1-\alpha}{N}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle\right)
i=1NπiG(σik+1,σik),σik+1πi,\displaystyle\leq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle, (43)

where the second inequality follows from ln(1+x)x\ln(1+x)\leq x for x>1x>-1. From the first-order optimality condition for σik+1\sigma_{i}^{k+1}, we have for all πΠ\pi^{\ast}\in\Pi^{\ast}:

i=1Nπivi(σik+1,σik+1)μπiG(σik+1,σik),πiσik+10.\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1})-\mu\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\pi_{i}^{\ast}-\sigma_{i}^{k+1}\rangle\leq 0.

Then,

i=1NπiG(σik+1,σik),σik+1πi\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle 1μi=1Nπivi(σik+1,σik+1),σik+1πi\displaystyle\leq\frac{1}{\mu}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle
1μi=1Nπivi(πi,πi),σik+1πi,\displaystyle\leq\frac{1}{\mu}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\ast},\pi_{-i}^{\ast}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle, (44)

where the second inequality follows from (1). Moreover, since π\pi^{\ast} is the Nash equilibrium, from the first-order optimality condition, we get:

i=1Nσik+1πi,πivi(πi,πi)0.\displaystyle\sum_{i=1}^{N}\langle\sigma_{i}^{k+1}-\pi_{i}^{\ast},\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\ast},\pi_{-i}^{\ast})\rangle\leq 0. (45)

By combining (43), (H.14), and (45), if σk+1σk\sigma^{k+1}\neq\sigma^{k}, we have any πΠ\pi^{\ast}\in\Pi^{\ast}:

KL(π,σk+1)KL(π,σk)<0.\displaystyle\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k})<0.

Next, we prove the statement for Rényi-divergence: G(σik+1,σik)=1α1ln(j=1di(σijk+1)α(σijk)1α)G(\sigma_{i}^{k+1},\sigma_{i}^{k})=\frac{1}{\alpha-1}\ln\left(\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1})^{\alpha}(\sigma_{ij}^{k})^{1-\alpha}\right). We have for all πΠ\pi^{\ast}\in\Pi^{\ast}:

i=1NπiG(σik+1,σik),σik+1πi\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle =α1αi=1N1j=1di(σijk+1)α(σijk)1αj=1di(πijσijk+1)(σijkσijk+1)1α\displaystyle=\frac{\alpha}{1-\alpha}\sum_{i=1}^{N}\frac{1}{\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1})^{\alpha}(\sigma_{ij}^{k})^{1-\alpha}}\sum_{j=1}^{d_{i}}(\pi_{ij}^{\ast}-\sigma_{ij}^{k+1})\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}
=α1αi=1N1j=1di(σijk+1)α(σijk)1αj=1diπij(σijkσijk+1)1αNα1α.\displaystyle=\frac{\alpha}{1-\alpha}\sum_{i=1}^{N}\frac{1}{\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1})^{\alpha}(\sigma_{ij}^{k})^{1-\alpha}}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}-\frac{N\alpha}{1-\alpha}.

Again, by using j=1di(σijk+1)α(σijk)1α1\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1})^{\alpha}(\sigma_{ij}^{k})^{1-\alpha}\leq 1 when α(0,1)\alpha\in(0,1), we get:

i=1NπiG(σik+1,σik),σik+1πi\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle α1αi=1Nj=1diπij(σijkσijk+1)1αNα1α\displaystyle\geq\frac{\alpha}{1-\alpha}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}-\frac{N\alpha}{1-\alpha}
=Nα1αexp(ln(1Ni=1Nj=1diπij(σijkσijk+1)1α))Nα1α\displaystyle=\frac{N\alpha}{1-\alpha}\exp\left(\ln\left(\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}\right)\right)-\frac{N\alpha}{1-\alpha}
Nα1αexp(1αNi=1Nj=1diπijlnσijkσijk+1)Nα1α\displaystyle\geq\frac{N\alpha}{1-\alpha}\exp\left(\frac{1-\alpha}{N}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\ln\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)-\frac{N\alpha}{1-\alpha}
=Nα1αexp(1αN(KL(π,σk+1)KL(π,σk)))Nα1α,\displaystyle=\frac{N\alpha}{1-\alpha}\exp\left(\frac{1-\alpha}{N}\left(\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k})\right)\right)-\frac{N\alpha}{1-\alpha},

where the second inequality follows from Jensen’s inequality for ln()\ln(\cdot) function. Since ln()\ln(\cdot) is strictly concave, the equality holds if and only if σk+1=σk\sigma^{k+1}=\sigma^{k}. Therefore, under the assumption that σk+1σk\sigma^{k+1}\neq\sigma^{k}, we get:

KL(π,σk+1)KL(π,σk)\displaystyle\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k}) <Nα1αln(1+1αNαi=1NπiG(σik+1,σik),σik+1πi)\displaystyle<\frac{N\alpha}{1-\alpha}\ln\left(1+\frac{1-\alpha}{N\alpha}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle\right)
i=1NπiG(σik+1,σik),σik+1πi,\displaystyle\leq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle, (46)

where the second inequality follows from ln(1+x)x\ln(1+x)\leq x for x>1x>-1. Thus, by combining (H.14), (45), and (46), if σk+1σk\sigma^{k+1}\neq\sigma^{k}, we have any πΠ\pi^{\ast}\in\Pi^{\ast}:

KL(π,σk+1)KL(π,σk)<0.\displaystyle\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k})<0.

Finally, we prove the statement for reverse KL divergence: G(σik+1,σik)=j=1diσijklnσijkσijk+1G(\sigma_{i}^{k+1},\sigma_{i}^{k})=\sum_{j=1}^{d_{i}}\sigma_{ij}^{k}\ln\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}. We have for all πΠ\pi^{\ast}\in\Pi^{\ast}:

i=1NπiG(σik+1,σik),σik+1πi\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle =i=1Nj=1di(πijσijk+1)σijkσijk+1\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}(\pi_{ij}^{\ast}-\sigma_{ij}^{k+1})\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}
=i=1Nj=1diπijσijkσijk+1N\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}-N
=Nexp(ln(1Ni=1Nj=1diπijσijkσijk+1))N\displaystyle=N\exp\left(\ln\left(\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)\right)-N
Nexp(1Ni=1Nj=1diπijlnσijkσijk+1)N\displaystyle\geq N\exp\left(\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\ln\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)-N
=Nexp(1N(KL(π,σk+1)KL(π,σk)))N,\displaystyle=N\exp\left(\frac{1}{N}\left(\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k})\right)\right)-N,

where the inequality follows from Jensen’s inequality for ln()\ln(\cdot) function. Thus, under the assumption that σk+1σk\sigma^{k+1}\neq\sigma^{k}, we get:

KL(π,σk+1)KL(π,σk)\displaystyle\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k}) <Nln(1+1Ni=1NπiG(σik+1,σik),σik+1πi)\displaystyle<N\ln\left(1+\frac{1}{N}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle\right)
i=1NπiG(σik+1,σik),σik+1πi,\displaystyle\leq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle, (47)

where the second inequality follows from ln(1+x)x\ln(1+x)\leq x for x>1x>-1. Thus, by combining (H.14), (45), and (47), if σk+1σk\sigma^{k+1}\neq\sigma^{k}, we have any πΠ\pi^{\ast}\in\Pi^{\ast}:

KL(π,σk+1)KL(π,σk)<0.\displaystyle\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k})<0.

Appendix I Additional Experimental Results and Details

I.1 Payoff Matrix in Three-Player Biased RPS Game

Table 2: Three-Player Biased RPS game matrix.
R P S
R 0 1/3-1/3 11
P 1/31/3 0 1/3-1/3
S 1-1 1/31/3 0

I.2 Experimental Setting for Section 6

The experiments in Section 6 are conducted in Ubuntu 20.04.2 LTS with Intel(R) Core(TM) i9-10850K CPU @ 3.60GHz and 64GB RAM.

In the full feedback setting, we use a constant learning rate η=0.1\eta=0.1 for MWU and OMWU, and APMD in all three games. For APMD, we set μ=0.1\mu=0.1 and Tσ=100T_{\sigma}=100 for KL and reverse KL divergence perturbation, and set μ=0.1\mu=0.1 and Tσ=20T_{\sigma}=20 for squared 2\ell^{2}-distance perturbation. As an exception, η=0.01\eta=0.01, μ=1.0\mu=1.0, and Tσ=200T_{\sigma}=200 are used for APMD with squared 2\ell^{2}-distance perturbation in the random payoff games with 5050 actions.

For the noisy feedback setting, we use the lower learning rate η=0.01\eta=0.01 for all algorithms, except APMD with squared 2\ell^{2}-distance perturbation for the random payoff games with 5050 actions. We update the slingshot strategy profile σk\sigma^{k} every Tσ=1000T_{\sigma}=1000 iterations in APMD with KL and reverse KL divergence perturbation, and update it every Tσ=200T_{\sigma}=200 iterations in APMD with squared 2\ell^{2}-distance perturbation. For APMD with 2\ell^{2}-distance perturbation in the random payoff games with 5050 actions, we set η=0.001\eta=0.001 and Tσ=2000T_{\sigma}=2000.

I.3 Additional Experiments

In this section, we compare the performance of APMD and APFTRL to MWU, OMWU, and optimistic gradient descent (OGD) (Daskalakis et al., 2018; Wei et al., 2021) in the full/noisy feedback setting. The parameter settings for MWU, OMWU, and APMD are the same as Section 6. For APFTRL, we use the squared 2\ell^{2}-distance and the parameter is the same as APMD with squared 2\ell^{2}-distance perturbation. For OGD, we use the same learning rate as APMD with squared 2\ell^{2}-distance perturbation.

Figure 4 shows the logarithm of the gap function for πt\pi^{t} averaged over 100100 instances with full feedback. We observe that APMD and APFTRL with squared 2\ell^{2}-distance perturbation exhibit competitive performance to OGD. The experimental results in the noisy feedback setting are presented in Figure 5. Surprisingly, in the noisy feedback setting, all APMD-based algorithms and the APFTRL-based algorithm exhibit overwhelmingly superior performance to OGD in all three games.

Refer to caption
Figure 4: The gap function for πt\pi^{t} for APMD, APFTRL, MWU, OMWU, and OGD with full feedback. The shaded area represents the standard errors. Note that the KL divergence, reverse KL divergence, and squared 2\ell^{2}-distance are abbreviated to KL, RKL, and L2, respectively.
Refer to caption
Figure 5: The gap function for πt\pi^{t} for APMD, APFTRL, MWU, OMWU, and OGD with noisy feedback. The shaded area represents the standard errors.

I.4 Comparison with the Averaged Strategies of No-Regret Learning Algorithms

This section compares the last-iterate strategies πt\pi^{t} of APMD and APFTRL with the average of strategies 1tτ=1tπτ\frac{1}{t}\sum_{\tau=1}^{t}\pi^{\tau} of MWU, regret matching (RM) (Hart & Mas-Colell, 2000), and regret matching plus (RM+) (Tammelin, 2014). The parameter settings for MWU, APMD, and APFTRL, as used in Section I.3, are maintained. Figure 6 illustrates the logarithm of the gap function averaged over 100100 instances with full feedback. The results show that the last-iterate strategies of APMD and APFTRL squared 2\ell^{2}-distance perturbation exhibit a lower gap than the averaged strategies of MWU, RM, and RM+.

Refer to caption
Figure 6: Comparison between the gap function of the last-iterate strategy profile of APMD, APFTRL, and the averaged strategy profile of MWU, RM, and RM+ with full feedback. The shaded area represents the standard errors.

I.5 Sensitivity Analysis of Update Interval for the Slingshot Strategy

In this section, we investigate the performance when changing the update interval of the slingshot strategy. We vary the TσT_{\sigma} of APMD with L2 perturbation in 3BRPS with full feedback to be Tσ{10,100,1000,10000}T_{\sigma}\in\{10,100,1000,10000\}, and with noisy feedback to be Tσ{10,100,1000,10000}T_{\sigma}\in\{10,100,1000,10000\}. All other parameters are the same as in Section  6. Figure  7 shows the logarithm of the gap function for πt\pi^{t} averaged over 100100 instances in 3BRPS with full/noisy feedback. We observe that the smaller the TσT_{\sigma}, the faster πt\pi^{t} converges. However, if TσT_{\sigma} is too small, πt\pi^{t} does not converge (See Tσ=10T_{\sigma}=10 with full feedback, and Tσ=100T_{\sigma}=100 with noisy feedback in Figure  7).

Refer to caption
(a) Full feedback
Refer to caption
(b) Noisy feedback
Figure 7: The gap function for πt\pi^{t} for APMD with varying TσT_{\sigma} in 3BRPS with full/noisy feedback. The shaded area represents the standard errors.