Adaptively Perturbed Mirror Descent for Learning in Games

Kenshi Abe Kaito Ariu Mitsuki Sakamoto Atsushi Iwasaki

Abstract

This paper proposes a payoff perturbation technique for the Mirror Descent (MD) algorithm in games where the gradient of the payoff functions is monotone in the strategy profile space, potentially containing additive noise. The optimistic family of learning algorithms, exemplified by optimistic MD, successfully achieves last-iterate convergence in scenarios devoid of noise, leading the dynamics to a Nash equilibrium. A recent re-emerging trend underscores the promise of the perturbation approach, where payoff functions are perturbed based on the distance from an anchoring, or slingshot, strategy. In response, we propose Adaptively Perturbed MD (APMD), which adjusts the magnitude of the perturbation by repeatedly updating the slingshot strategy at a predefined interval. This innovation empowers us to find a Nash equilibrium of the underlying game with guaranteed rates. Empirical demonstrations affirm that our algorithm exhibits significantly accelerated convergence.

Machine Learning, ICML

1 Introduction

This study delves into a variant of Mirror Descent (MD) (Nemirovskij & Yudin, 1983; Beck & Teboulle, 2003) in the context of monotone games whose gradient of the payoff functions exhibits monotonicity concerning the strategy profile space. This encompasses diverse games, including Cournot competition (Bravo et al., 2018), $\lambda$ -cocoercive games (Lin et al., 2020), concave-convex games, and zero-sum polymatrix games (Cai & Daskalakis, 2011; Cai et al., 2016). Due to their extensive applicability, various learning algorithms have been developed and scrutinized to compute a Nash equilibrium efficiently.

Agents, which are prescribed to play according to MD or its variant, choose strategies with higher expected payoffs more likely but do not move far away from current strategies via regularization. The dynamics is known to converge to an equilibrium in an average sense, which is referred to as average-iterate convergence. In other words, the averaged strategy profile over iterations converges to an equilibrium. Nevertheless, research has shown that the actual trajectory of the updated strategy profiles fails to converge even in two-player zero-sum games and a specific class within monotone games (Mertikopoulos et al., 2018; Bailey & Piliouras, 2018). On the contrary, optimistic learning algorithms, incorporating recency bias, have shown success. The updated strategy profile itself converges to a Nash equilibrium (Daskalakis et al., 2018; Daskalakis & Panageas, 2019; Mertikopoulos et al., 2019; Wei et al., 2021), termed last-iterate convergence.

However, the optimistic approach faces challenges with feedback contaminated by some noise. Typically, each agent updates his or her strategy according to the perfect gradient feedback of the payoff function at each iteration, denoted as full feedback. In a more realistic scenario, noise might distort this feedback. With noisy feedback, optimistic learning algorithms perform suboptimally. For instance, Abe et al. (2023) empirically demonstrated that optimistic Multiplicative Weights Update (OMWU) fails to converge to an equilibrium, orbiting around it.

Alternatively, perturbation of payoffs has emerged again as a pivotal concept for achieving last-iterate convergence, even under noise (Abe et al., 2023). Payoff perturbation is a classical technique, as seen in Facchinei & Pang (2003), and introduces strongly convex penalties to the players’ payoff functions to stabilize learning. This leads to convergence to approximate equilibria, not only in the full feedback setting but also in the noisy feedback setting. However, to ensure convergence toward a Nash equilibrium of the underlying game, the magnitude of perturbation requires careful adjustment, which is calculated as the product of a strongly convex penalty function and a perturbation strength parameter. In fact, Liu et al. (2023) shrink the perturbation strength based on the current strategy profile’s proximity to an underlying equilibrium. Similarly, iterative Tikhonov regularization methods (Koshal et al., 2010; Tatarenko & Kamgarpour, 2019) adjust the magnitude of perturbation by using a sequence of perturbation strengths that satisfy certain conditions, such as diminishing as the iteration increases. Although these algorithms admit last-iterate convergence, it becomes challenging to choose an appropriate learning rate for the shrinking perturbation strength, which often leads to a failure in achieving rapid convergence for these algorithms and their variants.

In response to this, we adaptively determine the amount of the penalty from the distance $G(\cdot,\cdot)$ between the current strategy $\pi$ and an anchoring, or slingshot strategy $\sigma$ , while maintaining the perturbation strength parameter $\mu$ constant. Instead of carefully decaying the perturbation strength, the slingshot strategies $\sigma$ are re-initialized at a predefined interval $T_{\sigma}$ by the current strategies, and thus the magnitude of the perturbation $\mu G(\cdot,\cdot)$ is adjusted. To the best of our knowledge, Perolat et al. (2021) were the first to employ this idea and enabled Abe et al. (2023) to modify MWU to achieve last-iterate convergence. However, they have established the convergence only in an asymptotic manner. The significance of our work, in part, lies in extending these two studies and establishing non-asymptotic convergence results.

Our contributions are manifold. First, we identify our algorithm as Adaptively Perturbed MD¹¹1An implementation of our method is available at https://github.com/CyberAgentAILab/adaptively-perturbed-md (APMD). Second, we analyze the case where both the perturbation function and the proximal regularizer are assumed to be the squared $\ell^{2}$ -distance and provide the last-iterate convergence rates to a Nash equilibrium, $\mathcal{O}(\ln T/\sqrt{T})$ and $\mathcal{O}(\ln T/T^{\frac{1}{10}})$ with full and noisy feedback, respectively. We also discuss the case where different distances from the squared $\ell^{2}$ -distance are utilized. Finally, we empirically reveal that our proposed APMD significantly outperforms MWU and OMWU in two zero-sum polymatrix games, regardless of the feedback type.

2 Preliminaries

Monotone games.

This paper focuses on a continuous game, which is denoted by $([N],(\mathcal{X}_{i})_{i\in[N]},(v_{i})_{i\in[N]})$ . $[N]=\{1,2,\cdots,N\}$ represents the set of $N$ players, $\mathcal{X}_{i}\subseteq\mathbb{R}^{d_{i}}$ represents the $d_{i}$ -dimensional compact convex strategy space for player $i\in[N]$ , and we write $\mathcal{X}=\prod_{i\in[N]}\mathcal{X}_{i}$ . Each player $i$ chooses a strategy $\pi_{i}$ from $\mathcal{X}_{i}$ and aims to maximize her differentiable payoff function $v_{i}:\mathcal{X}\to\mathbb{R}$ . We write $\pi_{-i}\in\prod_{j\neq i}\mathcal{X}_{i}$ as the strategies of all players except player $i$ , and denote the strategy profile by $\pi=(\pi_{i})_{i\in[N]}\in\mathcal{X}$ . This study particularly considers a smooth monotone game, where the gradient $(\nabla_{\pi_{i}}v_{i})_{i\in[N]}$ of the payoff functions is monotone: $\forall\pi,\pi^{\prime}\in\mathcal{X},$

\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i},\pi_{-i})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\prime},\pi_{-i}^{\prime}),\pi_{i}-\pi^{\prime}_{i}\rangle\leq 0,

(1)

and $L$ -Lipschitz:

\displaystyle\!\!\!\!\sum_{i=1}^{N}\!\|\nabla_{\pi_{i}}v_{i}(\pi_{i},\pi_{-i})\!-\!\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\prime},\pi_{-i}^{\prime})\|^{2}\!\leq\!L^{2}\|\pi\!-\!\pi^{\prime}\|^{2}\!\!,

(2)

where $\|\cdot\|$ is the $\ell^{2}$ -norm.

Monotone games include many common and well-studied classes of games, such as concave-convex games, zero-sum polymatrix games, and Cournot competition.

Example 2.1 (Concave-Convex Games).

Let us consider a max-min game $(\{1,2\},(\mathcal{X}_{1},\mathcal{X}_{2}),(v,-v))$ , where $v:\mathcal{X}_{1}\times\mathcal{X}_{2}\to\mathbb{R}$ . Player $1$ aims to maximize $v$ , while player $2$ aims to minimize $v$ . If $v$ is concave in $x_{1}\in\mathcal{X}_{1}$ and convex in $x_{2}\in\mathcal{X}_{2}$ , the game is called a concave-convex game or minimax optimization problem, and it is easy to confirm that the game is monotone.

Example 2.2 (Zero-Sum Polymatrix Games).

In a zero-sum polymatrix game, each player’s payoff function can be decomposed as $v_{i}(\pi)=\sum_{j\neq i}u_{i}(\pi_{i},\pi_{j})$ , where $u_{i}:\mathcal{X}_{i}\times\mathcal{X}_{j}\to\mathbb{R}$ is represented by $u_{i}(\pi_{i},\pi_{j})=\pi_{i}^{\top}\mathrm{M}^{(i,j)}\pi_{j}$ with some matrix $\mathrm{M}^{(i,j)}\in\mathbb{R}^{d_{i}\times d_{j}}$ , and satisfies $u_{i}(\pi_{i},\pi_{j})=-u_{j}(\pi_{j},\pi_{i})$ . In this game, each player $i$ can be interpreted as playing a two-player zero-sum game with each other player $j\neq i$ . From the linearity and zero-sum property of $u_{i}$ , we can easily show that $\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i},\pi_{-i})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\prime},\pi_{-i}^{\prime}),\pi_{i}-\pi^{\prime}_{i}\rangle=0$ . Thus, the zero-sum polymatrix game is a special case of monotone games.

Nash equilibrium and gap function.

A Nash equilibrium (Nash, 1951) is a common solution concept of a game, which is a strategy profile where no player can improve her payoff by deviating from her specified strategy. Formally, a Nash equilibrium $\pi^{\ast}\in\mathcal{X}$ satisfies the following condition:

\displaystyle\forall i\in[N],\forall\pi_{i}\in\mathcal{X}_{i},~{}v_{i}(\pi_{i}^{\ast},\pi_{-i}^{\ast})\geq v_{i}(\pi_{i},\pi_{-i}^{\ast}).

We denote the set of Nash equilibria by $\Pi^{\ast}$ . Note that a Nash equilibrium always exists for any smooth monotone game (Debreu, 1952). Furthermore, we measure the proximity to Nash equilibrium for a given strategy profile $\pi$ by its gap function, which is defined as:

\displaystyle\mathrm{GAP}(\pi):=\max_{\tilde{\pi}\in\mathcal{X}}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i},\pi_{-i}),\tilde{\pi}_{i}-\pi_{i}\rangle.

The gap function is a standard metric of proximity to Nash equilibrium for a given strategy profile $\pi$ (Cai & Zheng, 2023). From the definition, $\mathrm{GAP}(\pi)\geq 0$ for any $\pi\in\mathcal{X}$ , and the equality holds if and only if $\pi$ is a Nash equilibrium.

Problem setting.

In this study, we consider the online learning setting where the following process is repeated for $T$ iterations: 1) At each iteration $t\geq 0$ , each player $i\in[N]$ chooses her strategy $\pi_{i}^{t}\in\mathcal{X}_{i}$ based on the previously observed feedback; 2) Each player $i$ receives the gradient feedback $\widehat{\nabla}_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})$ as feedback. This study considers two feedback models: full feedback and noisy feedback. In the full feedback setting, each player receives the perfect gradient vector as feedback, i.e., $\widehat{\nabla}_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})=\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})$ . In the noisy feedback setting, each player’s feedback is given by $\widehat{\nabla}_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})=\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})+\xi_{i}^{t}$ , where $\xi_{i}^{t}\in\mathbb{R}^{d_{i}}$ is a noise vector. Specifically, we focus on the zero-mean and bounded-variance noise vectors.

Mirror Descent.

Mirror Descent (MD) is a widely used algorithm for learning equilibria in games. Let us define $\psi:\mathbb{R}^{d_{i}}\to\mathbb{R}\cup\{\infty\}$ as the strictly convex regularization function and $D_{\psi}(\pi_{i},\pi_{i}^{\prime})=\psi(\pi_{i})-\psi(\pi_{i}^{\prime})-\langle\nabla\psi(\pi_{i}^{\prime}),\pi-\pi_{i}^{\prime}\rangle$ as the associated Bregman divergence. Then, MD updates each player $i$ ’s strategy $\pi_{i}^{t}$ at iteration $t$ as follows:

\displaystyle\pi_{i}^{t+1}

\displaystyle=\mathop{\rm arg~{}max}\limits_{x\in\mathcal{X}_{i}}\left\{\eta_{t}\left\langle\widehat{\nabla}_{\pi_{i}}v_{i}(\pi^{t}),x\right\rangle-D_{\psi}(x,\pi_{i}^{t})\right\},

where $\eta_{t}\in(0,\infty)$ is the learning rate at iteration $t$ .

Algorithm 1 APMD for player

i

0: Learning rate sequence

\{\eta_{t}\}_{t\geq 0}

, divergence function for perturbation

G

, perturbation strength

\mu

, update interval

T_{\sigma}

, initial strategy

\pi_{i}^{0}

k\leftarrow 0,~{}\tau\leftarrow 0

\sigma_{i}^{0}\leftarrow\pi_{i}^{0}

3: for

t=0,1,2,\cdots,T

4: Receive the gradient feedback

\widehat{\nabla}_{\pi_{i}}v_{i}(\pi^{t})

5: Update the strategy by

\!\!\!\!\!\pi_{i}^{t+1}\!=\!\mathop{\rm arg~{}max}\limits_{x\in\mathcal{X}_{i}}\!\bigg{\{}\!\eta_{t}\left\langle\widehat{\nabla}_{\pi_{i}}v_{i}(\pi^{t})\!-\!\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),x\!\right\rangle

\!\!\!\!\!\!\!\!\!\!\!\!-D_{\psi}(x,\pi_{i}^{t})\bigg{\}}

\tau\leftarrow\tau+1

7: if

\tau=T_{\sigma}

then

k\leftarrow k+1,~{}\tau\leftarrow 0

\sigma_{i}^{k}\leftarrow\pi_{i}^{t+1}

10: end if

11: end for

Other notations.

We denote a $d$ -dimensional probability simplex by $\Delta^{d}=\{p\in[0,1]^{d}~{}|~{}\sum_{j=1}^{d}p_{j}=1\}$ . We define $\mathrm{diam}(\mathcal{X}):=\sup_{\pi,\pi^{\prime}\in\mathcal{X}}\|\pi-\pi^{\prime}\|$ as the diameter of $\mathcal{X}$ . The Kullback-Leibler (KL) divergence is defined by $\mathrm{KL}(\pi_{i},\pi_{i}^{\prime})=\sum_{j=1}^{d_{i}}\pi_{ij}\ln\frac{\pi_{ij}}{\pi_{ij}^{\prime}}$ . Besides, with a slight abuse of notation, we represent the sum of Bregman divergences and the sum of KL divergences by $D_{\psi}(\pi,\pi^{\prime})=\sum_{i=1}^{N}D_{\psi}(\pi_{i},\pi_{i}^{\prime})$ , and $\mathrm{KL}(\pi,\pi^{\prime})=\sum_{i=1}^{N}\mathrm{KL}(\pi_{i},\pi_{i}^{\prime})$ , respectively. We finally denote the domain of $\psi$ by $\mathrm{dom~{}}\psi:=\{x:\psi(x)<\infty\}$ , and corresponding interior by $\mathrm{int}(\mathrm{dom}~{}\psi)$ .

3 Adaptively Perturbed Mirror Descent

In this section, we present Adaptively Perturbed Mirror Descent (APMD), which is an extension of the standard MD algorithms. Algorithm 1 describes the pseudo-code. APMD employs two pivotal techniques: slingshot payoff perturbation and slingshot strategy update. Each of them corresponds to line 5 and line 9 in Algorithm 1, respectively.

Refer to caption — Figure 1: Illustration of the impact of the slingshot strategy updates on the gap function for $\pi^{t}$ updated by APMD.

3.1 Slingshot Payoff Perturbation

Letting us define the differentiable divergence function $G(\cdot,\cdot):\mathbb{R}^{d_{i}}\times\mathbb{R}^{d_{i}}\to\mathbb{R}\cup\{\infty\}$ and the slingshot strategy $\sigma_{i}\in\mathcal{X}_{i}$ , APMD perturbs each player’s payoff by the divergence between the current strategy $\pi_{i}^{t}$ and the slingshot strategy $\sigma_{i}$ , i.e., $G(\pi_{i}^{t},\sigma_{i})$ . Specifically, APMD updates each player’s strategy according to

	$\displaystyle\pi_{i}^{t+1}$	$\displaystyle=\mathop{\rm arg~{}max}\limits_{x\in\mathcal{X}_{i}}\bigg{\{}\eta_{t}\left\langle\widehat{\nabla}_{\pi_{i}}v_{i}(\pi^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}),x\right\rangle$
		$\displaystyle\phantom{\mathop{\rm arg~{}max}\limits_{x\in\mathcal{X}_{i}}==}-D_{\psi}(x,\pi_{i}^{t})\bigg{\}}.$		(3)

where $\mu\in(0,\infty)$ is the perturbation strength and $\nabla_{\pi_{i}}G$ denotes differentiation with respect to first argument. We assume that $G(\cdot,\sigma_{i})$ is strictly convex for every $\sigma_{i}\in\mathcal{X}_{i}$ , and takes a minimum value of $0$ at $\sigma_{i}$ . Furthermore, we assume that $\psi$ is differentiable and $\rho$ -strongly convex on $\mathcal{X}_{i}$ with $\rho\in(0,\infty)$ .

The conventional MD updates its strategy based on the gradient feedback of the payoff function and the regularization term. The regularization term adjusts the next strategy so that it does not deviate significantly from the current strategy. APMD perturbs the gradient payoff vector by the divergence between the current strategy $\pi_{i}^{t}$ and a predefined slingshot strategy $\sigma_{i}$ . If the current strategy is far away from the slingshot strategy, the magnitude of the perturbation increases. Note that if both strategies are equivalent, no perturbation occurs, and APMD just seeks a strategy with a higher expected payoff. As Figure 1 illustrates, the divergence first fluctuates, and then the current strategy profile comes close to a stationary point where the gradient of the expected payoff vector is equal to the gradient of the magnitude of the perturbation so that the perturbation via slingshot stabilizes the learning dynamics. Indeed, Mutant Follow the Regularized Leader (Mutant FTRL) instantiated in Example 5.3 encompasses replicator-mutator dynamics, which is guaranteed to converge to an approximate equilibrium in two-player zero-sum games (Abe et al., 2022). We can argue that APMD inherits this nice feature.

3.2 Slingshot Strategy Update

The perturbation via slingshot enables $\pi^{t}$ to converge quickly to a stationary point (Lemmas 5.6 and 5.7). Different slingshot strategy profiles induce different stationary points. Of course, when the slingshot strategy profile is set to be an equilibrium, the corresponding stationary point also becomes an equilibrium. However, it is virtually impossible to identify such an ideal slingshot strategy profile beforehand. To this end, APMD adjusts a slingshot strategy profile $\sigma$ by replacing it with the (nearly) stationary point that is reached after predefined iterations $T_{\sigma}$ .

Figure 1 illustrates how our slingshot strategy update brings the corresponding stationary points closer to an equilibrium. x- and y-axis indicate the number of iterations and the logarithm of the gap function of the last-iterate strategy profile. We here assume that the learning rate and the perturbation strength are $\eta=0.1$ and $\mu=1$ , respectively. The initial slingshot strategy $\sigma_{i}$ is given as a uniform distribution on the action space. After the first interval of $T_{\sigma}=200$ iterations, APMD finds a stationary point. The slingshot strategy profile for the second interval is replaced with the stationary point. Figure 1 clearly shows the stationary point, i.e., the last-iterate strategy profile comes close to an equilibrium every time the slingshot strategy profile is updated. We also theoretically justify our slingshot strategy update in Theorem D.1, i.e., when the slingshot strategy profile is close to an equilibrium $\pi^{\ast}\in\Pi^{\ast}$ , the stationary point is close to $\pi^{\ast}$ .

4 Last-Iterate Convergence Rates

In this section, we establish the last-iterate convergence rates of APMD. More specifically, we examine a setting where both $D_{\psi}$ and $G$ is set to the squared $\ell^{2}$ -distance, i.e., $D_{\psi}(\pi_{i},\pi_{i}^{\prime})=G(\pi_{i},\pi_{i}^{\prime})=\frac{1}{2}\|\pi_{i}-\pi_{i}^{\prime}\|^{2}$ . This instance can be considered as an extended version of Gradient Descent, which incorporates our techniques of payoff perturbation and slingshot strategy update. We also assume that the gradient vector of $v_{i}$ is bounded. We emphasize that we have obtained the overall last-iterative convergence rates of APMD for the entire $T$ iterations in both full and noisy feedback settings.

4.1 Full Feedback Setting

First, we demonstrate the last-iterate convergence rate of APMD with full feedback where each player receives the perfect gradient vector $\widehat{\nabla}_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})=\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})$ , at each iteration $t$ . Theorem 4.1 provides the APMD’s convergence rate of $\mathcal{O}(\ln T/\sqrt{T})$ in terms of the gap function. Note that the learning rate is constant, and its upper bound is specified by perturbation strength $\mu$ and smoothness parameter $L$ .

Theorem 4.1.

If we use the constant learning rate $\eta_{t}=\eta\in(0,\frac{2\mu}{3\mu^{2}+8L^{2}})$ , and set $D_{\psi}$ and $G$ as the squared $\ell^{2}$ -distance $D_{\psi}(\pi_{i},\pi_{i}^{\prime})=G(\pi_{i},\pi_{i}^{\prime})=\|\pi_{i}-\pi_{i}^{\prime}\|^{2}/2$ , and set $T_{\sigma}=\Theta(\ln T)$ , then the strategy profile $\pi^{T}$ updated by APMD satisfies:

\displaystyle\mathrm{GAP}(\pi^{T})

\displaystyle=\mathcal{O}\left(\frac{\ln T}{\sqrt{T}}\right).

The obtained rate herein is competitive with optimistic gradient and extra-gradient methods (Cai et al., 2022a) whose rates are $\mathcal{O}(1/\sqrt{T})$ . Although it is open whether our convergence rate matches its lower bound, it closely aligns with the lower bound for the class of algorithms that includes $1$ -SCLI algorithms (Golowich et al., 2020b), which is different from our APMD. To the best of our knowledge, the fastest rate of $\mathcal{O}(1/T)$ is achieved by Accelerated Optimistic Gradient (AOG) (Cai & Zheng, 2023), which is an optimistic variant of the Halpern iteration (Halpern, 1967). At first sight, the update rule of AOG looks as if the perturbation strength was linearly decayed. However, it does not perturb the payoff functions and instead adjusts the regularization term by using the convex combination of the current strategy and the anchoring strategy as the proximal point in MD. Unlike our APMD, the anchoring strategy is never updated through the iterations.²²2Regarding the rate of $\mathcal{O}(1/T)$ , a companion paper is in preparation.

4.2 Proof Sketch of Theorem 4.1

This section sketches the proof for Theorem 4.1. We present the complete proofs for the theorem and associated lemmas in Appendix E.

(1) Convergence rates to a stationary point with $k$ -th slingshot strategy profile.

We denote $\sigma^{k}$ as the slingshot strategy profile after $k$ updates. Since the slingshot strategy profile is overwritten by the current strategy profile $\pi^{t}$ every $T_{\sigma}$ iterations, we can write $\sigma^{k}=\pi^{kT_{\sigma}}$ . We first prove that, as $T_{\sigma}$ increases, the next $k+1$ -th slingshot strategy profile $\sigma^{k+1}$ approaches to the stationary point $\pi^{\mu,\sigma^{k}}$ under the slingshot strategy profile $\sigma^{k}$ , which satisfies the following condition: $\forall i\in[N]$ ,

\displaystyle\pi_{i}^{\mu,\sigma^{k}}

\displaystyle=\mathop{\rm arg~{}max}\limits_{\pi_{i}\in\mathcal{X}_{i}}\left\{v_{i}(\pi_{i},\pi_{-i}^{\mu,\sigma^{k}})-\mu G(\pi_{i},\sigma_{i}^{k})\right\}.

(4)

Note that $\pi^{\mu,\sigma^{k}}$ always exists since the perturbed game is still monotone. Using the strong convexity of $G(\pi_{i},\sigma_{i}^{k})=\frac{1}{2}\|\pi_{i}-\sigma_{i}^{k}\|^{2}$ , we show that $\sigma^{k+1}\to\pi^{\mu,\sigma^{k}}$ as $T_{\sigma}\to\infty$ :

Lemma 4.2.

Assume that $D_{\psi}$ and $G$ are set as the squared $\ell^{2}$ -distance. If we use the constant learning rate $\eta_{t}=\eta\in(0,\frac{2\mu}{3\mu^{2}+8L^{2}})$ , the $k+1$ -th slingshot strategy profile $\sigma^{k+1}$ of APMD under the full feedback setting satisfies that:

\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}=\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\cdot\exp\left(-\mathcal{O}(T_{\sigma})\right).

(2) Upper bound on the gap function.

Next, we derive the upper bound on the gap function for $\sigma^{k+1}$ . From the bounding technique of the gap function using the tangent residuals by Cai et al. (2022a) and the first-order optimality condition for $\pi^{\mu,\sigma^{k}}$ , the gap function for $\sigma^{k+1}$ can be upper bounded by the distance between the slingshot strategy profile and the stationary point $\pi^{\mu,\sigma^{k}}$ :

Lemma 4.3.

If $G$ is set as the squared $\ell^{2}$ -distance, the gap function for $\sigma^{k+1}$ of APMD satisfies for $k\geq 0$ :

\displaystyle\mathrm{GAP}(\sigma^{k+1})=\mathcal{O}\left(\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|+\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|\right).

By Lemma 4.2, if we set $D_{\psi}$ as the squared $\ell^{2}$ -distance, the first term in this lemma can be bounded as:

\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}=\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\exp\left(-\mathcal{O}(T_{\sigma})\right).

(5)

Therefore, it is enough to derive the convergence rate on the $\ell^{2}$ -distance between $\pi^{\mu,\sigma^{k}}$ and $\sigma^{k}$ .

(3) Last-iterate convergence results for the slingshot strategy profile.

Let us denote $K:=\lfloor T/T_{\sigma}\rfloor$ as the total number of the slingshot strategy updates over the entire $T$ iterations. Then, by adjusting $T_{\sigma}=\Omega(\ln T)$ , we show that the $\ell^{2}$ -distance between the $K-1$ -th slingshot strategy profile $\sigma^{K-1}$ and the corresponding stationary point $\pi^{\mu,\sigma^{K-1}}$ decreases as $K$ increases:

Lemma 4.4.

In the same setup of Theorem 4.1, the $K-1$ -th slingshot strategy profile $\sigma^{K-1}$ of APMD satisfies:

\displaystyle\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|=\mathcal{O}(1/\sqrt{K}).

Lemma 4.4 implies that as $K$ increases, the variation of $\sigma^{K}$ becomes negligible, signifying convergence in its behavior.

By combining (5) and Lemmas 4.3-4.4, we can derive the last-iterate convergence rate of the slingshot strategy profile $\sigma^{K}$ :

\displaystyle\mathrm{GAP}(\sigma^{K})=\mathcal{O}(1/\sqrt{K}).

Thus, since $\pi^{T}=\sigma^{K}$ and $K=\lfloor T/T_{\sigma}\rfloor=\Theta(T/\ln T)$ , the statement of the theorem is concluded. ∎

4.3 Noisy Feedback Setting

Next, we consider the noisy feedback setting, where each player $i$ receives a gradient vector with additive noise: $\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})+\xi^{t}_{i}$ . Define the sigma-algebra generated by the history of the observations: $\mathcal{F}_{t}=\sigma\left((\widehat{\nabla}_{\pi_{i}}v_{i}(\pi_{i}^{0},\pi_{-i}^{0}))_{i\in[N]},\ldots,(\widehat{\nabla}_{\pi_{i}}v_{i}(\pi_{i}^{t-1},\pi_{-i}^{t-1}))_{i\in[N]}\right)$ , $\forall t\geq 1$ . We assume that the noise vectors $(\xi_{i}^{t})_{t\geq 1}$ are with zero-mean and bounded variances. We also suppose that the noise vectors $(\xi_{i}^{t})_{t\geq 1}$ are independent over $t$ . In this setting, the last-iterate convergence rate is achieved by APMD using a decreasing learning rate sequence $\eta_{t}$ . The convergence rate obtained by APMD is $\mathcal{O}(\ln T/T^{\frac{1}{10}})$ :

Theorem 4.5.

Let $\theta=\frac{3\mu^{2}+8L^{2}}{2\mu}$ and $\kappa=\frac{\mu}{2}$ . Assume that $D_{\psi}$ and $G$ are set as the squared $\ell^{2}$ -distance $D_{\psi}(\pi_{i},\pi_{i}^{\prime})=G(\pi_{i},\pi_{i}^{\prime})=\frac{1}{2}\|\pi_{i}-\pi_{i}^{\prime}\|^{2}$ , and $T_{\sigma}=\Theta(T^{4/5})$ . If we choose the learning rate sequence of the form $\eta_{t}=1/(\kappa(t-T_{\sigma}\cdot\lfloor t/T_{\sigma}\rfloor)+2\theta)$ , then the strategy profile $\pi^{T}$ updated by APMD satisfies:

\displaystyle\mathbb{E}\left[\mathrm{GAP}(\pi^{T})\right]

\displaystyle=\mathcal{O}\left(\frac{\ln T}{T^{\frac{1}{10}}}\right).

It should be noted that Theorem 4.5 provides a non-asymptotic convergence guarantee with a rate. This is a significant departure from the existing convergence results (Koshal et al., 2010, 2013; Tatarenko & Kamgarpour, 2019), which focus on the asymptotic convergence of iterative Tikhonov regularization methods in the noisy or bandit feedback settings.

4.4 Proof Sketch of Theorem 4.5

As in Theorem 4.1, we first derive the convergence rate of $\sigma^{k+1}$ for the noisy feedback setting:

Lemma 4.6.

Let $\theta=\frac{3\mu^{2}+8L^{2}}{2\mu}$ and $\kappa=\frac{\mu}{2}$ . Suppose that both $D_{\psi}$ and $G$ are defined as the squared $\ell^{2}$ -distance. Under the noisy feedback setting, if we use the learning rate sequence of the form $\eta_{t}=1/(\kappa(t-T_{\sigma}\cdot\lfloor t/T_{\sigma}\rfloor)+2\theta)$ , the $k+1$ -th slingshot strategy profile $\sigma^{k+1}$ of APMD for each $k\geq 0$ satisfies that:

\displaystyle\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}\right]=\mathcal{O}\left(\frac{\ln T_{\sigma}}{T_{\sigma}}\right).

The proof is given in Appendix E.6 and is based on the standard argument of stochastic optimization, e.g., Nedić & Lee (2014). However, the proof is made possible by taking into account the monotonicity of the game and the relative (strong and smooth) convexity of the divergence function.

Next, in a similar manner for Lemma 4.4, we show the upper bound on the expected distance between $\pi^{\mu,\sigma^{K-1}}$ and $\sigma^{K-1}$ under the noisy feedback setting:

Lemma 4.7.

In the same setup of Theorem 4.5, the $K-1$ -th slingshot strategy profile $\sigma^{K-1}$ satisfies:

\displaystyle\mathbb{E}\left[\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|\right]=\mathcal{O}\left(\frac{\ln K}{\sqrt{K}}\right).

We note that Lemma 4.3 holds for any combination of $\sigma^{k}$ and $\sigma^{k+1}$ , regardless of the presence of noise. By combining Lemmas 4.3, 4.6, and 4.7, we can derive the following last-iterate convergence rate of $\sigma^{K}$ in terms of the gap function:

\displaystyle\mathbb{E}\left[\mathrm{GAP}(\sigma^{K})\right]=\mathcal{O}\left(\frac{\ln K}{\sqrt{K}}\right).

By setting $K=\lfloor T/T_{\sigma}\rfloor=\Theta(T^{1/5})$ , we conclude the statement of the theorem. ∎

5 Beyond Squared $\ell^{2}$ -Payoff Perturbation

Section 4 assumes that both $G$ and $D_{\psi}$ are the squared $\ell^{2}$ -distance. This section considers more general choices of $G$ and $D_{\psi}$ .

5.1 Instantiation of Payoff-Perturbed Algorithms

First, we would like to emphasize that choosing appropriate combinations of $G$ and $D_{\psi}$ enables APMD to reproduce some existing learning algorithms that incorporate payoff perturbation. For example, the following learning algorithms can be viewed as instantiations of APMD.

Example 5.1 (Boltzmann Q-Learning (Tuyls et al., 2006)).

Assume that $\mathcal{X}_{i}=\Delta^{d_{i}}$ , the regularize is entropy: $\psi(\pi_{i})=\sum_{j=1}^{d_{i}}\pi_{ij}\ln\pi_{ij}$ . Let us set $G$ as the KL divergence and the slingshot strategy $\sigma_{i}$ as a uniform distribution, i.e., $G(\pi_{i},\sigma_{i})=\mathrm{KL}(\pi_{i},\sigma_{i})$ and $\sigma_{i}=(1/d_{i})_{j\in[d_{i}]}$ . Then, the corresponding continuous-time APMD dynamics can be expressed as:

	$\displaystyle\frac{d}{dt}\pi_{ij}^{t}$	$\displaystyle=\pi_{ij}^{t}\left(q_{ij}^{\pi^{t}}-\sum_{k=1}^{d_{i}}\pi_{ik}^{t}q_{ik}^{\pi^{t}}\right)$
		$\displaystyle\phantom{=}-\mu\pi_{ij}^{t}\!\left(\ln\pi_{ij}^{t}-\sum_{k=1}^{d_{i}}\pi_{ik}^{t}\ln\pi_{ik}^{t}\right),$

which is equivalent to Boltzman Q-learning (Tuyls et al., 2006; Bloembergen et al., 2015).

Example 5.2 (Reward transformed FTRL (Perolat et al., 2021)).

Consider the continuous-time APMD dynamics where $N=2$ , $\mathcal{X}_{i}=\Delta^{d_{i}}$ , $\psi$ is Legendre (Rockafellar, 1997; Lattimore & Szepesvári, 2020) with $\mathrm{dom}~{}\psi\subseteq\mathcal{X}_{i}$ , and $G(\pi_{i},\sigma_{i})=\mathrm{KL}(\pi_{i},\sigma_{i})$ . Then, APMD dynamics can be described as follows:

	$\displaystyle\pi_{ij}^{t}$	$\displaystyle=\mathop{\rm arg~{}max}\limits_{x\in\Delta^{d_{i}}}\left\{\int_{0}^{t}\left(\sum_{k=1}^{d_{i}}x_{k}q_{ik}^{\pi^{s}}-\mu\sum_{k=1}^{d_{i}}x_{k}\ln\frac{\pi_{ik}^{s}}{\sigma_{ik}}\right.\right.$
		$\displaystyle\left.\left.\phantom{======}+\mu\sum_{k=1}^{d_{-i}}\pi_{-ik}^{s}\ln\frac{\pi_{-ik}^{s}}{\sigma_{-ik}}\right)ds-\psi(x)\right\}.$

This algorithm is equivalent to FTRL with reward transformation (Perolat et al., 2021).

Example 5.3 (Mutant FTRL (Abe et al., 2022)).

Let us define $\mathcal{X}_{i}=\Delta^{d_{i}}$ , and assume that the regularizer $\psi$ is Legendre with $\mathrm{dom}~{}\psi\subseteq\mathcal{X}_{i}$ . If we set $G$ as the reverse KL divergence, i.e., $G(\pi_{i},\sigma_{i})=\mathrm{KL}(\sigma_{i},\pi_{i})=\sum_{j=1}^{d_{i}}\sigma_{ij}\ln\frac{\sigma_{ij}}{\pi_{ij}}$ , we can rewrite (3.1) as:

\displaystyle\pi_{i}^{t+1}\!\!=\!\mathop{\rm arg~{}max}\limits_{x\in\Delta^{d_{i}}}\!\Bigg{\{}\!\!\sum_{s=0}^{t}\eta\!\sum_{j=1}^{d_{i}}\!x_{j}\!\!\left(\!q_{ij}^{\pi^{s}}\!\!+\!\frac{\mu}{\pi_{ij}^{s}}\!\left(\sigma_{ij}\!-\!\pi_{ij}^{s}\right)\!\!\right)\!\!-\!\psi(x)\!\Bigg{\}}\!,

where $q_{ij}^{\pi^{s}}=(\widehat{\nabla}_{\pi_{i}}v_{i}(\pi_{i}^{s},\pi_{-i}^{s}))_{j}$ . This algorithm is equivalent to Mutant FTRL (Abe et al., 2022).

5.2 Convergence Results with General $G$ and $D_{\psi}$

Next, we establish the convergence results for APMD with general combinations of $G$ and $D_{\psi}$ . For theoretical analysis, we assume a specific condition on $G$ :

Assumption 5.4.

$G(\cdot,\sigma_{i})$ is differentiable over $\mathrm{int}(\mathrm{dom}~{}\psi)$ . Moreover, $G(\cdot,\sigma_{i})$ is $\beta$ -smooth and $\gamma$ -strongly convex relative to $\psi$ , i.e., for any $\pi_{i},\pi_{i}^{\prime}\in\mathrm{int}(\mathrm{dom}~{}\psi)$ , $\gamma D_{\psi}(\pi_{i}^{\prime},\pi_{i})\leq G(\pi_{i}^{\prime},\sigma_{i})-G(\pi_{i},\sigma_{i})-\langle\nabla_{\pi_{i}}G(\pi_{i},\sigma_{i}),\pi_{i}^{\prime}-\pi_{i}\rangle\leq\beta D_{\psi}(\pi_{i}^{\prime},\pi_{i})$ holds.

Note that these assumptions are always satisfied with $\beta=\gamma=1$ whenever $G$ is identical to $D_{\psi}$ ; thus, these are not strong assumptions. We also assume that $\pi^{t}$ is well-defined over iterations:

Assumption 5.5.

$\pi^{t}$ updated by APMD satisfies $\pi^{t}\in\mathrm{int}(\mathrm{dom}~{}\psi)$ for any $t\in\{0,1,\cdots,T\}$ .

Using Assumptions 5.4 and 5.5, we derive the convergence rate to $\pi^{\mu,\sigma^{k}}$ in (4):

Lemma 5.6.

Suppose that Assumptions 5.4 with $\beta,\gamma\in(0,\infty)$ and 5.5 hold. If we use the constant learning rate $\eta_{t}=\eta\in(0,\frac{2\mu\gamma\rho^{2}}{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}})$ , the $k+1$ -th slingshot strategy profile $\sigma^{k+1}$ of APMD under the full feedback setting satisfies that:

\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k+1})=D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k})\exp\left(-\mathcal{O}(T_{\sigma})\right).

Lemma 5.7.

Let $\theta=\frac{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}{2\mu\gamma\rho^{2}}$ and $\kappa=\frac{\mu\gamma}{2}$ . Suppose that Assumptions 5.4 with $\beta,\gamma\in(0,\infty)$ and 5.5 hold, and the learning rate sequence of the form $\eta_{t}=1/(\kappa(t-T_{\sigma}\cdot\lfloor t/T_{\sigma}\rfloor)+2\theta)$ is used. Then, the $k+1$ -th slingshot strategy profile $\sigma^{k+1}$ of APMD under the noisy feedback setting satisfies that:

\displaystyle\mathbb{E}\left[D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k+1})\right]=\mathcal{O}\left(\frac{\ln T_{\sigma}}{T_{\sigma}}\right).

These lemmas imply that $\sigma^{k+1}\to\pi^{\mu,\sigma^{k}}$ as $T_{\sigma}\to\infty$ . Therefore, when $T_{\sigma}$ is sufficiently large, $k+1$ -th slingshot strategy profile becomes almost equivalent to the stationary point $\pi^{\mu,\sigma^{k}}$ . From this, we anticipate that $\mathbb{E}[\mathrm{GAP}(\sigma^{k})]\to 0$ as $k\to\infty$ even in the noisy feedback setting. The subsequent theorems provide asymptotic last-iterate convergence results for this ideal scenario. In particular, we show that the slingshot strategy profile $\sigma^{k+1}$ converges to equilibrium when using the following divergence functions as $G$ : 1) Bregman divergence; 2) $\alpha$ -divergence; 3) Rényi-divergence; 4) reverse KL divergence.

Theorem 5.8.

Assume that $G$ is a Bregman divergence $D_{\psi^{\prime}}$ for some strongly convex function $\psi^{\prime}$ , and $\sigma^{k+1}=\pi^{\mu,\sigma^{k}}$ for $k\geq 0$ . Then, there exists $\pi^{\ast}\in\Pi^{\ast}$ such that $\sigma^{k}\to\pi^{\ast}$ as $k\to\infty$ .

Theorem 5.9.

Let us define $\mathcal{X}_{i}=\Delta^{d_{i}}$ . Assume that $\sigma^{k+1}=\pi^{\mu,\sigma^{k}}$ for $k\geq 0$ , and $G$ is one of the following divergence: 1) $\alpha$ -divergence with $\alpha\in(0,1)$ ; 2) Rényi-divergence with $\alpha\in(0,1)$ ; 3) reverse KL divergence. If the initial slingshot strategy profile $\sigma^{0}$ is in the interior of $\mathcal{X}$ , the sequence $\{\sigma^{k}\}_{k\geq 1}$ converges to the set of Nash equilibria $\Pi^{\ast}$ of the underlying game.

We remark that these results cover the algorithms in Example 5.1, 5.2, and 5.3. Furthermore, we can incorporate our payoff perturbation techniques into FTRL, detailed in Appendix C.

6 Experiments

This section empirically compares the representative instance of MD, namely Multiplicative Weight Update (MWU) and its Optimistic version (OMWU), with our framework. Specifically, we consider the following three instances of APMD: (i) both are the squared $\ell^{2}$ -distance; (ii) both $G$ and $D_{\psi}$ are the KL divergence, which is also an instance of Reward transformed FTRL in Example 5.2. Note that if the slingshot strategy is fixed to a uniformly random strategy, this algorithm corresponds to Boltzmann Q-Learning in Example 5.1; (iii) the divergence function $G$ is the reverse KL divergence, and the Bregman divergence $D_{\psi}$ is the KL divergence, which matches Mutant FTRL in Example 5.3.

We focus on two zero-sum polymatrix games: Three-Player Biased Rock-Paper-Scissors (3BRPS) and three-player random payoff games with $10$ and $50$ actions. For the 3BRPS game, each player participates in two instances of the game in Table 2 in Appendix I simultaneously with two other players. For the random payoff games, each player $i$ participates in two instances of the game with two other players $j$ simultaneously. The payoff matrix for each instance is denoted as $M^{(i,j)}$ . Each entry of $M^{(i,j)}$ is drawn independently from a uniform distribution on the interval $[-1,1]$ .

Figures 2 and 3 illustrate the logarithm of the gap function averaged over $100$ instances with different random seeds. We assume that the initial slingshot strategy profile $\pi^{0}$ is chosen uniformly at random in the interior of the strategy space $\mathcal{X}=\prod_{i=1}^{3}\Delta^{d_{i}}$ in each instance for 3BRPS, while $\pi^{0}$ is chosen as $(1/d_{i})_{j\in[d_{i}]}$ for $i\in[3]$ in every instances for the random payoff games.

First, Figure 2 depicts the case of full feedback. Unless otherwise specified, we use a constant learning rate $\eta=0.1$ and a perturbation strength $\mu=0.1$ for APMD. Further details and additional experiments can be found in Appendix I. Figure 2 shows that APMD outperforms MWU and OMWU in all three games. Notably, APMD exhibits the fastest convergence in terms of the gap function when using the squared $\ell^{2}$ -distance as both $G$ and $D_{\psi}$ . Next, Figure 3 depicts the case of noisy feedback. We assume that the noise vector $\xi_{i}^{t}$ is generated from the multivariate Gaussian distribution $\mathcal{N}(0,~{}0.1^{2}\mathbf{I})$ in an i.i.d. manner. To account for the noise, we use a lower learning rate $\eta=0.01$ than the full feedback case. In OMWU, we use the noisy gradient vector $\widehat{\nabla}_{\pi_{i}}v_{i}(\pi_{i}^{t-1},\pi_{-i}^{t-1})$ at the previous step $t-1$ as the prediction vector for the current iteration $t$ . We observe the same trends as with full feedback. While MWU and OMWU exhibit worse performance, APMD maintains its fast convergence, as predicted by the theoretical analysis.

7 Related Literature

Recent progress in achieving no-regret learning with full feedback has been driven by optimistic learning (Rakhlin & Sridharan, 2013a, b). Optimistic versions of well-known algorithms like Follow the Regularized Leader (Shalev-Shwartz & Singer, 2006) and Mirror Descent (Zhou et al., 2017; Hsieh et al., 2021) have been proposed to admit last-iterate convergence in a wide range of game settings. These optimistic algorithms have been successfully applied to various classes of games, including bilinear games (Daskalakis et al., 2018; Daskalakis & Panageas, 2019; Liang & Stokes, 2019; de Montbrun & Renault, 2022), cocoercive games (Lin et al., 2020), and saddle point problems (Daskalakis & Panageas, 2018; Mertikopoulos et al., 2019; Golowich et al., 2020b; Wei et al., 2021; Lei et al., 2021; Yoon & Ryu, 2021; Lee & Kim, 2021; Cevher et al., 2023). The advancements have provided solutions to monotone games and have established convergence rates (Golowich et al., 2020a; Cai et al., 2022a, b; Gorbunov et al., 2022; Cai & Zheng, 2023).

The exploration of literature with noisy feedback poses significant challenges, in contrast to full feedback. In situations where feedback is imprecise or limited, algorithms must estimate action values at each iteration. There have been two trends in achieving last-iterate convergence: restricting the class of games and perturbing the payoff functions. On one hand, particularly noticeable works lie in potential games (Cohen et al., 2017), normal-form games with strict Nash equilibria (Giannou et al., 2021b, a), and two-player zero-sum games (Abe et al., 2023). Also, noisy feedback is handled with games whose payoff functions are assumed to be strictly (or strongly) monotone (Bravo et al., 2018; Kannan & Shanbhag, 2019; Hsieh et al., 2019; Anagnostides & Panageas, 2022), while to be strictly variational stable (Mertikopoulos & Zhou, 2019; Mertikopoulos et al., 2019, 2022; Azizian et al., 2021). Note that variationally stable games, often referred to in control theory, are a slightly broader class of monotone games. These studies require the payoff functions to be strictly or strongly convex. When these restrictions are not imposed, convergence is primarily guaranteed only in an asymptotic sense, and the rate is not quantified (Hsieh et al., 2020, 2022).

On the other hand, payoff-perturbed algorithms have recently regained attention for their ability to demonstrate convergence in unrestricted games when noise is present. As described in Section 1, payoff perturbation is a textbook technique (Facchinei & Pang, 2003) that has been extensively studied (Koshal et al., 2010, 2013; Yousefian et al., 2017; Tatarenko & Kamgarpour, 2019; Abe et al., 2023; Cen et al., 2021, 2023; Cai et al., 2023; Pattathil et al., 2023). It is known that carefully adjusting the magnitude of perturbation ensures convergence to a Nash equilibrium. This magnitude is computed as the product of a strongly convex penalty and a perturbation strength parameter. Liu et al. (2023) shrink the perturbation strength based on a predefined hyper-parameter and the gap function of the current strategy. Likewise, Koshal et al. (2010) and Tatarenko & Kamgarpour (2019) have identified somewhat complex conditions that the sequence of the perturbation strength parameters and learning rates should satisfy. Roughly speaking, as we have implied in Lemma 4.2, a smaller strength would require a lower learning rate. This potentially decelerates the convergence rate and complicates the task of finding an appropriate learning rate. For practicality, we have opted to keep the perturbation strength constant, independent of the iteration in APMD. Moreover, it must be emphasized that the existing literature has primarily provided asymptotic convergence results, while we have successfully provided the non-asymptotic convergence rate.

Finally, the idea of the slingshot strategy update was initiated by Perolat et al. (2021) and later extended by Abe et al. (2023). Our contribution partly owes to the significance of quantifying the convergence rate for the first time. We must also mention that Sokota et al. (2023) have proposed a very similar, but essentially different update rule to ours. It just adds an additional regularization term based on an anchoring strategy, which they call a magnetic one, and this means that it directly perturbs the (expected) payoff functions. In contrast, our APMD indirectly perturbs the payoff functions, i.e., perturbs the gradient vector. Furthermore, we have established non-asymptotic convergence results toward a Nash equilibrium, while Sokota et al. (2023) have only shown convergence toward a quantal response equilibrium (McKelvey & Palfrey, 1995, 1998), which is just equivalent to an approximate equilibrium. Similar results to them have been obtained with the Boltzmann Q-learning dynamics (Tuyls et al., 2006) and penalty-regularized dynamics (Coucheney et al., 2015) in continuous-time settings (Leslie & Collins, 2005; Hussain et al., 2023).

8 Conclusion

This paper proposes a novel variant of MD that achieves last-iterate convergence even when the noise is present, by adaptively adjusting the magnitude of the perturbation. This research could lead to several intriguing future studies, such as finding the best perturbation strength for the optimal convergence rate and achieving convergence with more limited feedback, for example, using bandit feedback (Bravo et al., 2018; Drusvyatskiy et al., 2022).

Acknowledgments

Kaito Ariu is supported by JSPS KAKENHI Grant Number 23K19986. Atsushi Iwasaki is supported by JSPS KAKENHI Grant Numbers 21H04890 and 23K17547.

Impact Statement

This paper focuses on discussing the problem of computing equilibria in games. There are potential social implications associated with our work, but none that we believe need to be particularly emphasized here.

References

Abe et al. (2022) Abe, K., Sakamoto, M., and Iwasaki, A. Mutation-driven follow the regularized leader for last-iterate convergence in zero-sum games. In UAI, pp. 1–10, 2022.
Abe et al. (2023) Abe, K., Ariu, K., Sakamoto, M., Toyoshima, K., and Iwasaki, A. Last-iterate convergence with full and noisy feedback in two-player zero-sum games. In AISTATS, pp. 7999–8028, 2023.
Anagnostides & Panageas (2022) Anagnostides, I. and Panageas, I. Frequency-domain representation of first-order methods: A simple and robust framework of analysis. In SOSA, pp. 131–160, 2022.
Azizian et al. (2021) Azizian, W., Iutzeler, F., Malick, J., and Mertikopoulos, P. The last-iterate convergence rate of optimistic mirror descent in stochastic variational inequalities. In COLT, pp. 326–358, 2021.
Bailey & Piliouras (2018) Bailey, J. P. and Piliouras, G. Multiplicative weights update in zero-sum games. In Economics and Computation, pp. 321–338, 2018.
Beck & Teboulle (2003) Beck, A. and Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 31(3):167–175, 2003.
Bloembergen et al. (2015) Bloembergen, D., Tuyls, K., Hennes, D., and Kaisers, M. Evolutionary dynamics of multi-agent learning: A survey. Journal of Artificial Intelligence Research, 53:659–697, 2015.
Bravo et al. (2018) Bravo, M., Leslie, D., and Mertikopoulos, P. Bandit learning in concave N-person games. In NeurIPS, pp. 5666–5676, 2018.
Cai & Daskalakis (2011) Cai, Y. and Daskalakis, C. On minmax theorems for multiplayer games. In SODA, pp. 217–234, 2011.
Cai & Zheng (2023) Cai, Y. and Zheng, W. Doubly optimal no-regret learning in monotone games. arXiv preprint arXiv:2301.13120, 2023.
Cai et al. (2016) Cai, Y., Candogan, O., Daskalakis, C., and Papadimitriou, C. Zero-sum polymatrix games: A generalization of minmax. Mathematics of Operations Research, 41(2):648–655, 2016.
Cai et al. (2022a) Cai, Y., Oikonomou, A., and Zheng, W. Finite-time last-iterate convergence for learning in multi-player games. In NeurIPS, pp. 33904–33919, 2022a.
Cai et al. (2022b) Cai, Y., Oikonomou, A., and Zheng, W. Tight last-iterate convergence of the extragradient method for constrained monotone variational inequalities. arXiv preprint arXiv:2204.09228, 2022b.
Cai et al. (2023) Cai, Y., Luo, H., Wei, C.-Y., and Zheng, W. Uncoupled and convergent learning in two-player zero-sum Markov games. arXiv preprint arXiv:2303.02738, 2023.
Cen et al. (2021) Cen, S., Wei, Y., and Chi, Y. Fast policy extragradient methods for competitive games with entropy regularization. In NeurIPS, pp. 27952–27964, 2021.
Cen et al. (2023) Cen, S., Chi, Y., Du, S. S., and Xiao, L. Faster last-iterate convergence of policy optimization in zero-sum Markov games. In ICLR, 2023.
Cevher et al. (2023) Cevher, V., Piliouras, G., Sim, R., and Skoulakis, S. Min-max optimization made simple: Approximating the proximal point method via contraction maps. In Symposium on Simplicity in Algorithms (SOSA), pp. 192–206, 2023.
Cohen et al. (2017) Cohen, J., Héliou, A., and Mertikopoulos, P. Learning with bandit feedback in potential games. In NeurIPS, pp. 6372–6381, 2017.
Coucheney et al. (2015) Coucheney, P., Gaujal, B., and Mertikopoulos, P. Penalty-regulated dynamics and robust learning procedures in games. Mathematics of Operations Research, 40(3):611–633, 2015.
Daskalakis & Panageas (2018) Daskalakis, C. and Panageas, I. The limit points of (optimistic) gradient descent in min-max optimization. In NeurIPS, pp. 9256–9266, 2018.
Daskalakis & Panageas (2019) Daskalakis, C. and Panageas, I. Last-iterate convergence: Zero-sum games and constrained min-max optimization. In ITCS, pp. 27:1–27:18, 2019.
Daskalakis et al. (2018) Daskalakis, C., Ilyas, A., Syrgkanis, V., and Zeng, H. Training GANs with optimism. In ICLR, 2018.
de Montbrun & Renault (2022) de Montbrun, É. and Renault, J. Convergence of optimistic gradient descent ascent in bilinear games. arXiv preprint arXiv:2208.03085, 2022.
Debreu (1952) Debreu, G. A social equilibrium existence theorem. Proceedings of the National Academy of Sciences, 38(10):886–893, 1952.
Drusvyatskiy et al. (2022) Drusvyatskiy, D., Fazel, M., and Ratliff, L. J. Improved rates for derivative free gradient play in strongly monotone games. In CDC, pp. 3403–3408. IEEE, 2022.
Facchinei & Pang (2003) Facchinei, F. and Pang, J.-S. Finite-dimensional variational inequalities and complementarity problems. Springer, 2003.
Giannou et al. (2021a) Giannou, A., Vlatakis-Gkaragkounis, E. V., and Mertikopoulos, P. Survival of the strictest: Stable and unstable equilibria under regularized learning with partial information. In COLT, pp. 2147–2148, 2021a.
Giannou et al. (2021b) Giannou, A., Vlatakis-Gkaragkounis, E.-V., and Mertikopoulos, P. On the rate of convergence of regularized learning in games: From bandits and uncertainty to optimism and beyond. In NeurIPS, pp. 22655–22666, 2021b.
Golowich et al. (2020a) Golowich, N., Pattathil, S., and Daskalakis, C. Tight last-iterate convergence rates for no-regret learning in multi-player games. In NeurIPS, pp. 20766–20778, 2020a.
Golowich et al. (2020b) Golowich, N., Pattathil, S., Daskalakis, C., and Ozdaglar, A. Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems. In COLT, pp. 1758–1784, 2020b.
Gorbunov et al. (2022) Gorbunov, E., Taylor, A., and Gidel, G. Last-iterate convergence of optimistic gradient method for monotone variational inequalities. In NeurIPS, pp. 21858–21870, 2022.
Halpern (1967) Halpern, B. Fixed points of nonexpanding maps. Bulletin of the American Mathematical Society, 73(6):957 – 961, 1967.
Hart & Mas-Colell (2000) Hart, S. and Mas-Colell, A. A simple adaptive procedure leading to correlated equilibrium. Econometrica, 68(5):1127–1150, 2000.
Hsieh et al. (2019) Hsieh, Y.-G., Iutzeler, F., Malick, J., and Mertikopoulos, P. On the convergence of single-call stochastic extra-gradient methods. In NeurIPS, pp. 6938–6948, 2019.
Hsieh et al. (2020) Hsieh, Y.-G., Iutzeler, F., Malick, J., and Mertikopoulos, P. Explore aggressively, update conservatively: Stochastic extragradient methods with variable stepsize scaling. Advances in Neural Information Processing Systems, 33:16223–16234, 2020.
Hsieh et al. (2021) Hsieh, Y.-G., Antonakopoulos, K., and Mertikopoulos, P. Adaptive learning in continuous games: Optimal regret bounds and convergence to Nash equilibrium. In COLT, pp. 2388–2422, 2021.
Hsieh et al. (2022) Hsieh, Y.-G., Antonakopoulos, K., Cevher, V., and Mertikopoulos, P. No-regret learning in games with noisy feedback: Faster rates and adaptivity via learning rate separation. In NeurIPS, pp. 6544–6556, 2022.
Hussain et al. (2023) Hussain, A. A., Belardinelli, F., and Piliouras, G. Asymptotic convergence and performance of multi-agent Q-learning dynamics. arXiv preprint arXiv:2301.09619, 2023.
Kannan & Shanbhag (2019) Kannan, A. and Shanbhag, U. V. Optimal stochastic extragradient schemes for pseudomonotone stochastic variational inequality problems and their variants. Computational Optimization and Applications, 74(3):779–820, 2019.
Koshal et al. (2010) Koshal, J., Nedić, A., and Shanbhag, U. V. Single timescale regularized stochastic approximation schemes for monotone nash games under uncertainty. In CDC, pp. 231–236. IEEE, 2010.
Koshal et al. (2013) Koshal, J., Nedic, A., and Shanbhag, U. V. Regularized iterative stochastic approximation methods for stochastic variational inequality problems. IEEE Transactions on Automatic Control, 58(3):594–609, 2013.
Lattimore & Szepesvári (2020) Lattimore, T. and Szepesvári, C. Bandit algorithms. Cambridge University Press, 2020.
Lee & Kim (2021) Lee, S. and Kim, D. Fast extra gradient methods for smooth structured nonconvex-nonconcave minimax problems. In NeurIPS, pp. 22588–22600, 2021.
Lei et al. (2021) Lei, Q., Nagarajan, S. G., Panageas, I., et al. Last iterate convergence in no-regret learning: constrained min-max optimization for convex-concave landscapes. In AISTATS, pp. 1441–1449, 2021.
Leslie & Collins (2005) Leslie, D. S. and Collins, E. J. Individual q-learning in normal form games. SIAM Journal on Control and Optimization, 44(2):495–514, 2005.
Liang & Stokes (2019) Liang, T. and Stokes, J. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. In AISTATS, pp. 907–915, 2019.
Lin et al. (2020) Lin, T., Zhou, Z., Mertikopoulos, P., and Jordan, M. Finite-time last-iterate convergence for multi-agent learning in games. In ICML, pp. 6161–6171, 2020.
Liu et al. (2023) Liu, M., Ozdaglar, A., Yu, T., and Zhang, K. The power of regularization in solving extensive-form games. In ICLR, 2023.
McKelvey & Palfrey (1995) McKelvey, R. D. and Palfrey, T. R. Quantal response equilibria for normal form games. Games and economic behavior, 10(1):6–38, 1995.
McKelvey & Palfrey (1998) McKelvey, R. D. and Palfrey, T. R. Quantal response equilibria for extensive form games. Experimental economics, 1:9–41, 1998.
Mertikopoulos & Zhou (2019) Mertikopoulos, P. and Zhou, Z. Learning in games with continuous action sets and unknown payoff functions. Mathematical Programming, 173(1):465–507, 2019.
Mertikopoulos et al. (2018) Mertikopoulos, P., Papadimitriou, C., and Piliouras, G. Cycles in adversarial regularized learning. In SODA, pp. 2703–2717, 2018.
Mertikopoulos et al. (2019) Mertikopoulos, P., Lecouat, B., Zenati, H., Foo, C.-S., Chandrasekhar, V., and Piliouras, G. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. In ICLR, 2019.
Mertikopoulos et al. (2022) Mertikopoulos, P., Hsieh, Y.-P., and Cevher, V. Learning in games from a stochastic approximation viewpoint. arXiv preprint arXiv:2206.03922, 2022.
Nash (1951) Nash, J. Non-cooperative games. Annals of mathematics, pp. 286–295, 1951.
Nedić & Lee (2014) Nedić, A. and Lee, S. On stochastic subgradient mirror-descent algorithm with weighted averaging. SIAM Journal on Optimization, 24(1):84–107, 2014.
Nemirovski et al. (2009) Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.
Nemirovskij & Yudin (1983) Nemirovskij, A. S. and Yudin, D. B. Problem complexity and method efficiency in optimization. Wiley, 1983.
Pattathil et al. (2023) Pattathil, S., Zhang, K., and Ozdaglar, A. Symmetric (optimistic) natural policy gradient for multi-agent learning with parameter convergence. In AISTATS, pp. 5641–5685, 2023.
Perolat et al. (2021) Perolat, J., Munos, R., Lespiau, J.-B., Omidshafiei, S., Rowland, M., Ortega, P., Burch, N., Anthony, T., Balduzzi, D., De Vylder, B., et al. From Poincaré recurrence to convergence in imperfect information games: Finding equilibrium via regularization. In ICML, pp. 8525–8535, 2021.
Rakhlin & Sridharan (2013a) Rakhlin, A. and Sridharan, K. Online learning with predictable sequences. In COLT, pp. 993–1019, 2013a.
Rakhlin & Sridharan (2013b) Rakhlin, S. and Sridharan, K. Optimization, learning, and games with predictable sequences. In NeurIPS, pp. 3066–3074, 2013b.
Rockafellar (1997) Rockafellar, R. T. Convex analysis, volume 11. Princeton university press, 1997.
Shalev-Shwartz & Singer (2006) Shalev-Shwartz, S. and Singer, Y. Convex repeated games and fenchel duality. Advances in neural information processing systems, 19, 2006.
Sokota et al. (2023) Sokota, S., D’Orazio, R., Kolter, J. Z., Loizou, N., Lanctot, M., Mitliagkas, I., Brown, N., and Kroer, C. A unified approach to reinforcement learning, quantal response equilibria, and two-player zero-sum games. In ICLR, 2023.
Tammelin (2014) Tammelin, O. Solving large imperfect information games using CFR+. arXiv preprint arXiv:1407.5042, 2014.
Tatarenko & Kamgarpour (2019) Tatarenko, T. and Kamgarpour, M. Learning Nash equilibria in monotone games. In CDC, pp. 3104–3109. IEEE, 2019.
Tuyls et al. (2006) Tuyls, K., Hoen, P. J., and Vanschoenwinkel, B. An evolutionary dynamical analysis of multi-agent learning in iterated games. Autonomous Agents and Multi-Agent Systems, 12(1):115–153, 2006.
Wei et al. (2021) Wei, C.-Y., Lee, C.-W., Zhang, M., and Luo, H. Linear last-iterate convergence in constrained saddle-point optimization. In ICLR, 2021.
Yoon & Ryu (2021) Yoon, T. and Ryu, E. K. Accelerated algorithms for smooth convex-concave minimax problems with $\mathcal{O}(1/k^{2})$ rate on squared gradient norm. In ICML, pp. 12098–12109, 2021.
Yousefian et al. (2017) Yousefian, F., Nedić, A., and Shanbhag, U. V. On smoothing, regularization, and averaging in stochastic approximation methods for stochastic variational inequality problems. Mathematical Programming, 165:391–431, 2017.
Zhou et al. (2017) Zhou, Z., Mertikopoulos, P., Moustakas, A. L., Bambos, N., and Glynn, P. Mirror descent learning in continuous games. In CDC, pp. 5776–5783. IEEE, 2017.

Appendix A Notations

In this section, we summarize the notations we use in Table 1.

Table 1: Notations

Symbol	Description
$N$	Number of players
$\mathcal{X}_{i}$	Strategy space for player $i$
$\mathcal{X}$	Joint strategy space: $\mathcal{X}=\prod_{i=1}^{N}\mathcal{X}_{i}$
$v_{i}$	Payoff function for player $i$
$\pi_{i}$	Strategy for player $i$
$\pi$	Strategy profile: $\pi=(\pi_{i})_{i\in[N]}$
$\xi_{i}^{t}$	Noise vector for player $i$ at iteration $t$
$\pi^{\ast}$	Nash equilibrium
$\Pi^{\ast}$	Set of Nash equilibria
$\mathrm{GAP}(\pi)$	Gap function of $\pi$ : $\mathrm{GAP}(\pi)=\max_{\tilde{\pi}\in\mathcal{X}}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi),\tilde{\pi}_{i}-\pi_{i}\rangle$
$\Delta^{d}$	$d$ -dimensional probability simplex: $\Delta^{d}=\{p\in[0,1]^{d}~{}\|~{}\sum_{j=1}^{d}p_{j}=1\}$
$\mathrm{diam}(\mathcal{X})$	Diameter of $\mathcal{X}$ : $\mathrm{diam}(\mathcal{X})=\sup_{\pi,\pi^{\prime}\in\mathcal{X}}\\|\pi-\pi^{\prime}\\|$
$\mathrm{KL}(\cdot,\cdot)$	Kullback-Leibler divergence
$D_{\psi}(\cdot,\cdot)$	Bregman divergence associated with $\psi$
$\nabla_{\pi_{i}}v_{i}$	Gradient of $v_{i}$ with respect to $\pi_{i}$
$\eta_{t}$	Learning rate at iteration $t$
$\mu$	Perturbation strength
$\sigma$	Slingshot strategy profile
$G(\cdot,\cdot)$	Divergence function for payoff perturbation
$\nabla_{\pi_{i}}G$	Gradient of $G$ with respect to first argument
$T_{\sigma}$	Update interval for the slingshot strategy
$K$	Total number of the slingshot strategy updates
$\pi^{\mu,\sigma}$	Stationary point satisfies (4) for given $\mu$ and $\sigma$
$\pi^{t}$	Strategy profile at iteration $t$
$\sigma^{k}$	Slingshot strategy profile after $k$ updates
$L$	Smoothness parameter of $(v_{i})_{i\in[N]}$
$\rho$	Strongly convex parameter of $\psi$
$\beta$	Smoothness parameter of $G(\cdot,\sigma_{i})$ relative to $\psi$
$\gamma$	Strongly convex parameter of $G(\cdot,\sigma_{i})$ relative to $\psi$

Appendix B Formal Theorems and Lemmas

B.1 Full Feedback Setting

Theorem B.1 (Formal version of Theorem 4.1).

Assume that $\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi)\|^{2}}\leq\zeta$ for any $\pi\in\mathcal{X}$ . If we use the constant learning rate $\eta_{t}=\eta\in(0,\frac{2\mu\rho^{2}}{3\mu^{2}\rho^{2}+8L^{2}})$ , and set $D_{\psi}$ and $G$ as squared $\ell^{2}$ -distance $D_{\psi}(\pi_{i},\pi_{i}^{\prime})=G(\pi_{i},\pi_{i}^{\prime})=\frac{1}{2}\|\pi_{i}-\pi_{i}^{\prime}\|^{2}$ , and set $T_{\sigma}=c\cdot\max(\frac{6}{\ln 2-\ln(2-\eta\mu)}\ln T+\frac{2\ln 64}{\ln 2-\ln(2-\eta\mu)},1)$ for some constant $c\geq 1$ , then the strategy profile $\pi^{T}$ updated by APMD satisfies:

	$\displaystyle\mathrm{GAP}(\pi^{T})$
	$\displaystyle\leq\frac{2\sqrt{2}c\left((\mu+L)\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\left(\frac{6}{\ln 2-\ln(2-\eta\mu)}\ln T+\frac{2\ln 64}{\ln 2-\ln(2-\eta\mu)}+1\right)}{\sqrt{T}}\sqrt{\mathrm{diam}(\mathcal{X})\left(8\cdot\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)}.$

Lemma B.2 (Formal version of Lemma 4.2).

Assume that $D_{\psi}$ and $G$ are set as the squared $\ell^{2}$ -distance. If we use the constant learning rate $\eta_{t}=\eta\in(0,\frac{2\mu\rho^{2}}{3\mu^{2}\rho^{2}+8L^{2}})$ , the $k+1$ -th slingshot strategy profile $\sigma^{k+1}$ of APMD under the full feedback setting satisfies that:

\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}\leq\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\left(1-\frac{\eta\mu}{2}\right)^{T_{\sigma}}.

Lemma B.3 (Formal version of Lemma 4.3).

Assume that $\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi)\|^{2}}\leq\zeta$ for any $\pi\in\mathcal{X}$ . If $G$ is set as the squared $\ell^{2}$ -distance, the gap function for $\sigma^{k+1}$ of APMD satisfies for $k\geq 0$ :

\displaystyle\mathrm{GAP}(\sigma^{k+1})\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|.

Lemma B.4 (Formal version of Lemma 4.4).

Assume that $T_{\sigma}\geq\max(\frac{6}{\ln 2-\ln(2-\eta\mu)}\ln T+\frac{2\ln 64}{\ln 2-\ln(2-\eta\mu)},1)$ . In the same setup of Theorem 4.1, the $K-1$ -th slingshot strategy profile $\sigma^{K-1}$ of APMD satisfies:

\displaystyle\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|\leq\frac{2\sqrt{2}}{\sqrt{K}}\sqrt{\mathrm{diam}(\mathcal{X})\left(8\cdot\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)}.

B.2 Noisy Feedback Setting

For the noisy feedback setting, we assume that $\xi^{t}_{i}\in\mathbb{R}^{d_{i}}$ is a zero-mean independent random vector with bounded variance.

Assumption B.5.

For all $t\geq 1$ and $i\in[N]$ , the noise vector $\xi^{t}_{i}$ satisfies the following properties: (a) Zero-mean: $\mathbb{E}[\xi^{t}_{i}|\mathcal{F}_{t}]=(0,\cdots,0)^{\top}$ ; (b) Bounded variance: $\mathbb{E}[\|\xi^{t}_{i}\|^{2}|\mathcal{F}_{t}]\leq C^{2}$ with some constant $C>0$ .

This is a standard assumption in learning in games with noisy feedback (Mertikopoulos & Zhou, 2019; Hsieh et al., 2019) and stochastic optimization (Nemirovski et al., 2009; Nedić & Lee, 2014). Under Assumption B.5, we can obtain the following convergence results for ADMP under the noisy feedback setting.

Theorem B.6 (Formal version of Theorem 4.5).

Let $\theta=\frac{3\mu^{2}\rho^{2}+8L^{2}}{2\mu\rho^{2}}$ and $\kappa=\frac{\mu}{2}$ . Suppose that Assumption B.5 holds and $\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi)\|^{2}}\leq\zeta$ for any $\pi\in\mathcal{X}$ . We also assume that $D_{\psi}$ and $G$ are set as squared $\ell^{2}$ -distance $D_{\psi}(\pi_{i},\pi_{i}^{\prime})=G(\pi_{i},\pi_{i}^{\prime})=\frac{1}{2}\|\pi_{i}-\pi_{i}^{\prime}\|^{2}$ , and $T_{\sigma}=c\cdot\max(T^{4/5}+2,3)$ for some constant $c\geq 1$ . If we choose the learning rate sequence of the form $\eta_{t}=1/(\kappa(t-T_{\sigma}\cdot\lfloor t/T_{\sigma}\rfloor)+2\theta)$ , then the strategy profile $\pi^{T}$ updated by APMD satisfies:

	$\displaystyle\mathbb{E}\left[\mathrm{GAP}(\pi^{T})\right]\leq\frac{\sqrt{6c}\mu\cdot\mathrm{diam}(\mathcal{X})^{2}}{T^{1/10}}$
	$\displaystyle+\frac{L\cdot\mathrm{diam}(\mathcal{X})+\zeta+\mu\cdot\mathrm{diam}(\mathcal{X})\sqrt{18c\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)}}{T^{1/10}}\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa}}.$

Lemma B.7 (Formal version of Lemma 4.6).

Let $\theta=\frac{3\mu^{2}\rho^{2}+8L^{2}}{2\mu\rho^{2}}$ and $\kappa=\frac{\mu}{2}$ . Suppose that Assumption B.5 holds, and both $D_{\psi}$ and $G$ are defined as the squared $\ell^{2}$ -distance. Under the noisy feedback setting, if we use the learning rate sequence of the form $\eta_{t}=1/(\kappa(t-T_{\sigma}\cdot\lfloor t/T_{\sigma}\rfloor)+2\theta)$ , the $k+1$ -th slingshot strategy profile $\sigma^{k+1}$ of APMD for each $k\geq 0$ satisfies that:

\displaystyle\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}\right]\leq\frac{2\theta-\kappa}{\kappa(T_{\sigma}-1)+2\theta}\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\right]+\frac{NC^{2}}{\rho(\kappa(T_{\sigma}-1)+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-1)+1\right)+\frac{1}{2\theta}\right).

Lemma B.8 (Formal version of Lemma 4.7).

Assume that $T_{\sigma}\geq\max(T^{4/5}+2,3)$ . In the same setup of Theorem 4.5, the $K-1$ -th slingshot strategy profile $\sigma^{K-1}$ satisfies:

	$\displaystyle\mathbb{E}\left[\\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\\|\right]$
	$\displaystyle\leq\sqrt{\frac{\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{3}{\rho\kappa}\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)\left(\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)\right)}{K}}.$

B.3 Convergence Results with General $G$ and $D_{\psi}$

Lemma B.9 (Formal version of Lemma 5.6).

\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k+1})\leq D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k})\left(1-\frac{\eta\mu\gamma}{2}\right)^{T_{\sigma}}.

Lemma B.10 (Formal version of Lemma 5.7).

Let $\theta=\frac{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}{2\mu\gamma\rho^{2}}$ and $\kappa=\frac{\mu\gamma}{2}$ . Suppose that Assumptions 5.4, 5.5, and B.5 hold, and the learning rate sequence of the form $\eta_{t}=1/(\kappa(t-T_{\sigma}\cdot\lfloor t/T_{\sigma}\rfloor)+2\theta)$ is used. Then, the $k+1$ -th slingshot strategy profile $\sigma^{k+1}$ of APMD under the noisy feedback setting satisfies that:

\displaystyle\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k+1})]\leq\frac{2\theta-\kappa}{\kappa(T_{\sigma}-1)+2\theta}\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k})]+\frac{NC^{2}}{\rho(\kappa(T_{\sigma}-1)+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-1)+1\right)+\frac{1}{2\theta}\right).

Appendix C Extension to Follow the Regularized Leader

In Sections 4 and 5, we introduced and analyzed APMD, which extends the standard MD approach. Similarly, it is possible to extend the FTRL approach as well. In this section, we present Adaptively Perturbed Follow the Regularized Leader (APFTRL), which incorporates the perturbation term $\mu G(\cdot,\sigma_{i})$ into the conventional FTRL algorithm:

\displaystyle\pi_{i}^{t+1}=\mathop{\rm arg~{}max}\limits_{x\in\mathcal{X}_{i}}\left\{\sum_{s=0}^{t}\eta_{s}\left\langle\widehat{\nabla}_{\pi_{i}}v_{i}(\pi^{s})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{s},\sigma_{i}),x\right\rangle-\psi(x)\right\}.

In this section, we make the assumption that $\mathcal{X}_{i}$ is an affine subset, which means there exists a matrix $A\in\mathbb{R}^{k_{i}\times d_{i}}$ and a vector $b\in\mathbb{R}^{k_{i}}$ such that $A\pi_{i}=b$ for all $\pi_{i}\in\mathcal{X}_{i}$ . Additionally, we also assume that $\psi$ is a Legendre function, as described in (Rockafellar, 1997; Lattimore & Szepesvári, 2020). Another assumption we make is that $\pi^{t}$ is consistently well-defined over iterations:

Assumption C.1.

$\pi^{t}$ , updated by APFTRL, satisfies the condition $\pi^{t}\in\mathrm{int}(\mathrm{dom}~{}\psi)$ for every $t\in\{0,1,\cdots,T\}$ .

Then, APFTRL enjoys the last-iterate convergence of $\pi^{T}$ in full and noisy feedback settings by proving the following lemmas:

Lemma C.2.

Suppose that Assumptions 5.4 with $\beta,\gamma\in(0,\infty)$ and C.1 hold. If we use the constant learning rate $\eta_{t}=\eta\in(0,\frac{2\mu\gamma\rho^{2}}{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}})$ , the $k+1$ -th slingshot strategy profile $\sigma^{k+1}$ of APFTRL under the full feedback setting satisfies that:

\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k+1})\leq D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k})\left(1-\frac{\eta\mu\gamma}{2}\right)^{T_{\sigma}}.

Lemma C.3.

Let $\theta=\frac{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}{2\mu\gamma\rho^{2}}$ and $\kappa=\frac{\mu\gamma}{2}$ . Suppose that Assumptions 5.4, B.5, and C.1 hold, and the learning rate sequence of the form $\eta_{t}=1/(\kappa(t-T_{\sigma}\cdot\lfloor t/T_{\sigma}\rfloor)+2\theta)$ is used. Then, the $k+1$ -th slingshot strategy profile $\sigma^{k+1}$ of APFTRL under the noisy feedback setting satisfies that:

\displaystyle\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k+1})]\leq\frac{2\theta-\kappa}{\kappa(T_{\sigma}-1)+2\theta}\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\sigma^{k})]+\frac{NC^{2}}{\rho(\kappa(T_{\sigma}-1)+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-1)+1\right)+\frac{1}{2\theta}\right).

The proofs of these theorems can be found in Appendix F.5, F.6.

Appendix D Additional Theorems

Based on this theorem, we can show that the gap function for $\pi^{t}$ converges to the value of $\mathcal{O}(\mu)$ .

Theorem D.1.

In the same setup of Lemma 5.6, the gap function for APMD is bounded as:

\displaystyle\mathrm{GAP}(\pi^{t})

\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i})\|^{2}}+\mathcal{O}\left(\left(1-\frac{\eta\mu\gamma}{2}\right)^{\frac{t}{2}}\right).

We note that lower $\mu$ reduces the gap function for $\pi^{\mu,\sigma}$ (as in the first term of Theorem D.1), whereas higher $\mu$ makes $\pi^{t}$ converge faster (as in the second term of Theorem D.1). That is, $\mu$ controls a trade-off between the speed of convergence and the gap function.

Appendix E Proofs for Section 4

E.1 Proof of Theorem B.1 (Formal Version of Theoremf 4.1)

Proof of Theorem B.1.

First, from Lemma B.3, we have for any $k\geq 0$ :

\displaystyle\mathrm{GAP}(\sigma^{k+1})\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|.

Using Lemma B.2, we can upper bound the term of $\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|$ as follows:

\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}\leq\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\left(1-\frac{\eta\mu}{2}\right)^{T_{\sigma}}.

Combining these inequalities, we have for any $k\geq 0$ :

	$\displaystyle\mathrm{GAP}(\sigma^{k+1})$	$\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|+\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\cdot\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|$
		$\displaystyle\leq\left((\mu+L)\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|,$

where the second inequality follows from $T_{\sigma}\geq 1$ . Let us denote $K:=\lfloor T/T_{\sigma}\rfloor$ as the total number of the slingshot strategy updates over the entire $T$ iterations. By letting $k=K-1$ in the above inequality, we get:

\displaystyle\mathrm{GAP}(\sigma^{K})\leq\left((\mu+L)\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|.

(6)

Next, we derive the following upper bound on $\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|$ from Lemma B.4:

\displaystyle\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|\leq\frac{2\sqrt{2}}{\sqrt{K}}\sqrt{\mathrm{diam}(\mathcal{X})\left(8\cdot\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)}.

(7)

By combining (6) and (7), we get:

\displaystyle\mathrm{GAP}(\sigma^{K})

\displaystyle\leq\frac{2\sqrt{2}\left((\mu+L)\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)}{\sqrt{K}}\sqrt{\mathrm{diam}(\mathcal{X})\left(8\cdot\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)}.

Finally, since $\pi^{T}=\sigma^{K}$ , $K=\lfloor T/T_{\sigma}\rfloor$ , and $T_{\sigma}=c\cdot\max(\frac{6}{\ln 2-\ln(2-\eta\mu)}\ln T+\frac{2\ln 64}{\ln 2-\ln(2-\eta\mu)},1)$ , we have:

	$\displaystyle\mathrm{GAP}(\pi^{T})$
	$\displaystyle\leq\frac{2\sqrt{2}\left((\mu+L)\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)}{\sqrt{T/T_{\sigma}}}\sqrt{\mathrm{diam}(\mathcal{X})\left(8\cdot\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)}$
	$\displaystyle\leq\frac{2\sqrt{2}c\left((\mu+L)\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\left(\frac{6}{\ln 2-\ln(2-\eta\mu)}\ln T+\frac{2\ln 64}{\ln 2-\ln(2-\eta\mu)}+1\right)}{\sqrt{T}}\sqrt{\mathrm{diam}(\mathcal{X})\left(8\cdot\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)}.$

This concludes the statement of the theorem. ∎

E.2 Proof of Lemma B.2 (Formal Version of Lemma 4.2)

Proof of Lemma B.2.

From the definition of the Bregman divergence, we have for all $\pi_{i},\pi_{i}^{\prime}\in\mathcal{X}_{i}$ :

	$\displaystyle D_{\psi}(\pi_{i}^{\prime},\sigma_{i})-D_{\psi}(\pi_{i},\sigma_{i})-\langle\nabla_{\pi_{i}}D_{\psi}(\pi_{i},\sigma_{i}),\pi_{i}^{\prime}-\pi_{i}\rangle$
	$\displaystyle=\psi(\pi_{i}^{\prime})-\psi(\sigma_{i})-\langle\nabla\psi(\sigma_{i}),\pi_{i}^{\prime}-\sigma_{i}\rangle-\psi(\pi_{i})+\psi(\sigma_{i})+\langle\nabla\psi(\sigma_{i}),\pi_{i}-\sigma_{i}\rangle-\langle\nabla\psi(\pi_{i})-\nabla\psi(\sigma_{i}),\pi_{i}^{\prime}-\pi_{i}\rangle$
	$\displaystyle=\psi(\pi_{i}^{\prime})-\psi(\pi_{i})-\langle\nabla\psi(\pi_{i}),\pi_{i}^{\prime}-\pi_{i}\rangle$
	$\displaystyle=D_{\psi}(\pi_{i}^{\prime},\pi_{i}).$

Hence, assuming that $G$ is identical to $D_{\psi}$ , Assumption 5.4 is satisfied with $\beta=\gamma=1$ . Furthermore, since $\psi(x)=\frac{1}{2}\left\|x\right\|^{2}$ , both $\rho=1$ and $\mathrm{int}(\mathrm{dom}~{}\psi)=\mathbb{R}^{d_{i}}$ hold. Therefore, Assumption 5.5 is also satisfied. Consequently, we can obtain the following convergence result from Lemma B.9 with $\beta=\gamma=\rho=1$ :

\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}\leq\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\left(1-\frac{\eta\mu}{2}\right)^{T_{\sigma}}.

∎

E.3 Proof of Lemma B.3 (Formal Version of Lemma 4.3)

Proof of Lemma B.3.

First, we have for any $\pi\in\mathcal{X}$ :

	$\displaystyle\mathrm{GAP}(\pi)$	$\displaystyle=\max_{\tilde{\pi}_{i}\in\mathcal{X}_{i}}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi),\tilde{\pi}_{i}-\pi_{i}\rangle$
		$\displaystyle=\max_{\tilde{\pi}\in\mathcal{X}}\sum_{i=1}^{N}\left(\langle\nabla_{\pi_{i}}v_{i}(\pi^{\prime}),\tilde{\pi}_{i}-\pi_{i}^{\prime}\rangle-\langle\nabla_{\pi_{i}}v_{i}(\pi^{\prime}),\pi_{i}-\pi_{i}^{\prime}\rangle+\langle\nabla_{\pi_{i}}v_{i}(\pi)-\nabla_{\pi_{i}}v_{i}(\pi^{\prime}),\tilde{\pi}_{i}-\pi_{i}\rangle\right).$		(8)

Here, we introduce the following lemma from Cai et al. (2022a):

Lemma E.1 (Lemma 2 of Cai et al. (2022a)).

For any $\pi\in\mathcal{X}$ , we have:

\displaystyle\max_{\tilde{\pi}\in\mathcal{X}}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi),\tilde{\pi}_{i}-\pi_{i}\rangle\leq\mathrm{diam}(\mathcal{X})\cdot\min_{(a_{i})\in N_{\mathcal{X}}(\pi)}\sqrt{\sum_{i=1}^{N}\|-\nabla_{\pi_{i}}v_{i}(\pi)+a_{i}\|^{2}},

where $N_{\mathcal{X}}(\pi)=\{(a_{i})_{i\in[N]}\in\prod_{i=1}^{N}\mathbb{R}^{d_{i}}~{}|~{}\sum_{i=1}^{N}\langle a_{i},\pi_{i}^{\prime}-\pi_{i}\rangle\leq 0,~{}\forall\pi^{\prime}\in\mathcal{X}\}$ .

From Lemma E.1, the first term of (8) can be upper bounded as:

\displaystyle\max_{\tilde{\pi}\in\mathcal{X}}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi^{\prime}),\tilde{\pi}_{i}-\pi_{i}^{\prime}\rangle\leq\mathrm{diam}(\mathcal{X})\cdot\min_{(a_{i})\in N_{\mathcal{X}}(\pi^{\prime})}\sqrt{\sum_{i=1}^{N}\|-\nabla_{\pi_{i}}v_{i}(\pi^{\prime})+a_{i}\|^{2}}

(9)

Next, from Cauchy-Schwarz inequality, the second term of (8) can be upper bounded as:

	$\displaystyle-\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi^{\prime}),\pi_{i}-\pi_{i}^{\prime}\rangle$	$\displaystyle\leq\\|\pi-\pi^{\prime}\\|\sqrt{\sum_{i=1}^{N}\\|\nabla_{\pi_{i}}v_{i}(\pi^{\prime})\\|^{2}}$
		$\displaystyle\leq\zeta\\|\pi-\pi^{\prime}\\|.$		(10)

Again from Cauchy-Schwarz inequality, the third term of (8) can be upper bounded as:

$\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi)-\nabla_{\pi_{i}}v_{i}(\pi^{\prime}),\tilde{\pi}_{i}-\pi_{i}\rangle$	$\displaystyle\leq\\|\tilde{\pi}-\pi\\|\sqrt{\sum_{i=1}^{N}\\|\nabla_{\pi_{i}}v_{i}(\pi)-\nabla_{\pi_{i}}v_{i}(\pi^{\prime})\\|^{2}}$
	$\displaystyle\leq\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\\|\nabla_{\pi_{i}}v_{i}(\pi)-\nabla_{\pi_{i}}v_{i}(\pi^{\prime})\\|^{2}}$
	$\displaystyle\leq L\cdot\mathrm{diam}(\mathcal{X})\cdot\\|\pi-\pi^{\prime}\\|.$	(11)

By combining (8), (9), (10), and (11), we get for any $\pi,\pi^{\prime}\in\mathcal{X}$ :

\displaystyle\mathrm{GAP}(\pi)

\displaystyle\leq\mathrm{diam}(\mathcal{X})\cdot\min_{(a_{i})\in N_{\mathcal{X}}(\pi^{\prime})}\sqrt{\sum_{i=1}^{N}\|-\nabla_{\pi_{i}}v_{i}(\pi^{\prime})+a_{i}\|^{2}}+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\|\pi-\pi^{\prime}\|.

Thus, letting $\pi=\sigma^{k+1}$ and $\pi^{\prime}=\pi^{\mu,\sigma^{k}}$ , we have:

\displaystyle\mathrm{GAP}(\sigma^{k+1})\leq\mathrm{diam}(\mathcal{X})\cdot\min_{(a_{i})\in N_{\mathcal{X}}(\pi^{\mu,\sigma^{k}})}\sqrt{\sum_{i=1}^{N}\|-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma^{k}})+a_{i}\|^{2}}+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|.

(12)

On the other hand, from the first-order optimality condition for $\pi^{\mu,\sigma^{k}}$ , we have for any $\pi\in\mathcal{X}$ :

\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma^{k}})-\mu\left(\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k}\right),\pi_{i}-\pi_{i}^{\mu,\sigma^{k}}\rangle\leq 0,

and then $\left(\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma^{k}})-\mu\left(\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k}\right)\right)_{i\in[N]}\in N_{\mathcal{X}}(\pi^{\mu,\sigma^{k}})$ . Thus, the first term of (12) can be bounded as:

	$\displaystyle\min_{(a_{i})\in N_{\mathcal{X}}(\pi^{\mu,\sigma^{k}})}\sqrt{\sum_{i=1}^{N}\\|-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma^{k}})+a_{i}\\|^{2}}$
	$\displaystyle\leq\sqrt{\sum_{i=1}^{N}\\|-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma^{k}})+\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma^{k}})-\mu\left(\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k}\right)\\|^{2}}$
	$\displaystyle=\mu\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|.$		(13)

Combining (12) and (13), we have:

\displaystyle\mathrm{GAP}(\sigma^{k+1})\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|

∎

E.4 Proof of Lemma B.4 (Formal Version of Lemma 4.4)

Proof of Lemma B.4.

First, we prove the following lemma:

Lemma E.2.

Assume that $\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi)\|^{2}}\leq\zeta$ for any $\pi\in\mathcal{X}$ . If $G$ is set as the squared $\ell^{2}$ -distance, we have for any $k\in[K]$ :

\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}

\displaystyle\leq\|\sigma^{k}-\sigma^{k-1}\|^{2}+\frac{2\zeta}{\mu}\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k}\|.

From Lemma E.2, we can bound $\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}$ as:

	$\displaystyle\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}$
	$\displaystyle\leq\\|\sigma^{k}-\sigma^{k-1}\\|^{2}+\frac{2\zeta}{\mu}\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k}\\|$
	$\displaystyle\leq\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\\|^{2}+\\|\sigma^{k}-\pi^{\mu,\sigma^{k-1}}\\|^{2}+2\\|\sigma^{k}-\pi^{\mu,\sigma^{k-1}}\\|\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\\|+\frac{2\zeta}{\mu}\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k}\\|.$		(14)

Next, we upper bound $\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}$ using the following lemma:

Lemma E.3.

Assume that $T_{\sigma}\geq\max(\frac{6}{\ln 2-\ln(2-\eta\mu)}\ln T+\frac{2\ln 64}{\ln 2-\ln(2-\eta\mu)},1)$ . In the same setup of Theorem 4.1, we have for any Nash equilibrium $\pi^{\ast}\in\Pi^{\ast}$ :

\displaystyle\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}

\displaystyle\leq 16\|\pi^{\ast}-\sigma^{0}\|^{2}.

Using Lemma E.3, we have:

\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\leq\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\leq 16\|\pi^{\ast}-\sigma^{0}\|^{2},

and then from Lemma B.2, we get:

\displaystyle\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|\leq\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\leq 4\|\pi^{\ast}-\sigma^{0}\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}.

(15)

By combining (14) and (15), we have:

	$\displaystyle\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}$
	$\displaystyle\leq\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\\|^{2}+16\\|\pi^{\ast}-\sigma^{0}\\|\left(1-\frac{\eta\mu}{2}\right)^{T_{\sigma}}+32\\|\pi^{\ast}-\sigma^{0}\\|^{2}\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}+\frac{8\zeta}{\mu}\\|\pi^{\ast}-\sigma^{0}\\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}$
	$\displaystyle\leq\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\\|^{2}+48\\|\pi^{\ast}-\sigma^{0}\\|^{2}\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}+\frac{8\zeta}{\mu}\\|\pi^{\ast}-\sigma^{0}\\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}$
	$\displaystyle\leq\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\\|^{2}+8\\|\pi^{\ast}-\sigma^{0}\\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\left(6\\|\pi^{\ast}-\sigma^{0}\\|+\frac{\zeta}{\mu}\right).$

Therefore, we get:

	$\displaystyle\\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\\|^{2}$	$\displaystyle\leq\\|\pi^{\mu,\sigma^{K-2}}-\sigma^{K-2}\\|^{2}+8\\|\pi^{\ast}-\sigma^{0}\\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\left(6\\|\pi^{\ast}-\sigma^{0}\\|+\frac{\zeta}{\mu}\right)$
		$\displaystyle\leq\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}+8K\\|\pi^{\ast}-\sigma^{0}\\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\left(6\\|\pi^{\ast}-\sigma^{0}\\|+\frac{\zeta}{\mu}\right).$		(16)

By combining (16) and Lemma E.3, we have:

	$\displaystyle K\\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\\|^{2}$	$\displaystyle\leq\sum_{k=0}^{K-1}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}+8K^{2}\\|\pi^{\ast}-\sigma^{0}\\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\left(6\\|\pi^{\ast}-\sigma^{0}\\|+\frac{\zeta}{\mu}\right)$
		$\displaystyle\leq 16\\|\pi^{\ast}-\sigma^{0}\\|^{2}+8K^{2}\\|\pi^{\ast}-\sigma^{0}\\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\left(6\\|\pi^{\ast}-\sigma^{0}\\|+\frac{\zeta}{\mu}\right)$
		$\displaystyle\leq 16\\|\pi^{\ast}-\sigma^{0}\\|^{2}+8T^{2}\\|\pi^{\ast}-\sigma^{0}\\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\left(6\\|\pi^{\ast}-\sigma^{0}\\|+\frac{\zeta}{\mu}\right).$

Under the assumption that $T_{\sigma}\geq\max(\frac{6}{\ln 2-\ln(2-\eta\mu)}\ln T+\frac{2\ln 64}{\ln 2-\ln(2-\eta\mu)},1)$ , we get:

\displaystyle\left(1-\frac{\eta\mu}{2}\right)^{-\frac{T_{\sigma}}{2}}\geq\left(1-\frac{\eta\mu}{2}\right)^{-\frac{3}{\ln 2-\ln(2-\eta\mu)}\ln T}\left(1-\frac{\eta\mu}{2}\right)^{\frac{\ln 64}{\ln\left(1-\frac{\eta\mu}{2}\right)}}=64\left(1-\frac{\eta\mu}{2}\right)^{\frac{\ln T^{3}}{\ln(2-\eta\mu)-\ln 2}}=64T^{3}.

Thus, we have:

	$\displaystyle K\\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\\|^{2}$	$\displaystyle\leq 16\\|\pi^{\ast}-\sigma^{0}\\|^{2}+8\\|\pi^{\ast}-\sigma^{0}\\|\left(6\\|\pi^{\ast}-\sigma^{0}\\|+\frac{\zeta}{\mu}\right)$
		$\displaystyle=8\\|\pi^{\ast}-\sigma^{0}\\|\left(8\\|\pi^{\ast}-\sigma^{0}\\|+\frac{\zeta}{\mu}\right)$
		$\displaystyle\leq 8\cdot\mathrm{diam}(\mathcal{X})\left(8\cdot\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right).$

∎

E.5 Proof of Theorem B.6 (Formal Version of Theorem 4.5)

Proof of Theorem B.6.

First, from Lemma B.3, we have for any $k\geq 0$ :

\displaystyle\mathbb{E}\left[\mathrm{GAP}(\sigma^{k+1})\right]\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|\right]+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|\right].

Using Lemma B.7, we can upper bound the term of $\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}\right]$ as follows:

\displaystyle\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}\right]\leq\frac{2\theta-\kappa}{\kappa(T_{\sigma}-1)+2\theta}\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\right]+\frac{NC^{2}}{\rho(\kappa(T_{\sigma}-1)+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-1)+1\right)+\frac{1}{2\theta}\right).

Combining these inequalities, we have for any $k\geq 0$ :

	$\displaystyle\mathbb{E}\left[\mathrm{GAP}(\sigma^{k+1})\right]$
	$\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\mathbb{E}\left[\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|\right]+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa T^{4/5}}},$

where we use $T_{\sigma}\geq\max(T^{4/5}+2,3)$ . Let us denote $K:=\lfloor T/T_{\sigma}\rfloor$ as the total number of the slingshot strategy updates over the entire $T$ iterations. By letting $k=K-1$ in the above inequality, we get:

	$\displaystyle\mathbb{E}\left[\mathrm{GAP}(\sigma^{K})\right]$	$\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\mathbb{E}\left[\\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\\|\right]$
		$\displaystyle+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa T^{4/5}}},$		(17)

Next, we derive the following upper bound on $\mathbb{E}\left[\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\|\right]$ from Lemma B.8:

	$\displaystyle\mathbb{E}\left[\\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\\|\right]$
	$\displaystyle\leq\sqrt{\frac{\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{3}{\rho\kappa}\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)\left(\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)\right)}{K}}.$		(18)

By combining (17) and (18), we get:

	$\displaystyle\mathbb{E}\left[\mathrm{GAP}(\sigma^{K})\right]$
	$\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{3}{\rho\kappa}\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)\left(\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)\right)}{K}}$
	$\displaystyle+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa T^{4/5}}}.$

Finally, since $\pi^{T}=\sigma^{K}$ , $K=\lfloor T/T_{\sigma}\rfloor$ , and $T_{\sigma}=c\cdot\max(T^{4/5}+2,3)$ , we have:

	$\displaystyle\mathbb{E}\left[\mathrm{GAP}(\pi^{T})\right]$
	$\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{3}{\rho\kappa}\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)\left(\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)\right)}{T/T_{\sigma}}}$
	$\displaystyle+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa T^{4/5}}}$
	$\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{c(T^{4/5}+5)}$
	$\displaystyle\cdot\sqrt{\frac{\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{3}{\rho\kappa}\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)\left(\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)\right)}{T}}$
	$\displaystyle+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa T^{4/5}}}$
	$\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{6c\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{18c}{\rho\kappa}\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)\left(\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)\right)}{T^{1/5}}}$
	$\displaystyle+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa T^{4/5}}}$
	$\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})^{2}\cdot\sqrt{\frac{6c}{T^{1/5}}}$
	$\displaystyle+\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{18c\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)\left(\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)\right)}{\rho\kappa T^{1/5}}}$
	$\displaystyle+\left(L\cdot\mathrm{diam}(\mathcal{X})+\zeta\right)\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa T^{1/5}}}$
	$\displaystyle\leq\frac{\sqrt{6c}\mu\cdot\mathrm{diam}(\mathcal{X})^{2}}{T^{1/10}}$
	$\displaystyle+\frac{L\cdot\mathrm{diam}(\mathcal{X})+\zeta+\mu\cdot\mathrm{diam}(\mathcal{X})\sqrt{18c\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)}}{T^{1/10}}\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa}}.$

∎

This concludes the statement of the theorem.

E.6 Proof of Lemma B.7 (Formal Version of Lemma 4.6)

Proof of Lemma B.7.

Assuming that $G$ is identical to $D_{\psi}$ , Assumption 5.4 is satisfied with $\beta=\gamma=1$ . Furthermore, since $\psi(x)=\frac{1}{2}\left\|x\right\|^{2}$ , both $\rho=1$ and $\mathrm{int}(\mathrm{dom}~{}\psi)=\mathbb{R}^{d_{i}}$ hold. Therefore, Assumption 5.5 is also satisfied. Consequently, we can apply Lemma B.10 with $\beta=\gamma=\rho=1$ and obtain:

\displaystyle\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k+1}\|^{2}\right]\leq\frac{2\theta-\kappa}{\kappa(T_{\sigma}-1)+2\theta}\mathbb{E}\left[\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\right]+\frac{NC^{2}}{\rho(\kappa(T_{\sigma}-1)+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-1)+1\right)+\frac{1}{2\theta}\right).

∎

E.7 Proof of Lemma B.8 (Formal Version of Lemma 4.7)

Proof of Lemma B.8.

From Lemmas E.2 and B.7, we have:

	$\displaystyle\mathbb{E}\left[\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}\right]$
	$\displaystyle\leq\mathbb{E}\left[\\|\sigma^{k}-\sigma^{k-1}\\|^{2}\right]+\frac{2\zeta}{\mu}\mathbb{E}\left[\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k}\\|\right]$
	$\displaystyle\leq\mathbb{E}\left[\\|\sigma^{k}-\pi^{\mu,\sigma^{k-1}}\\|^{2}\right]+\mathbb{E}\left[\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\\|^{2}\right]+\mathbb{E}\left[2\\|\sigma^{k}-\pi^{\mu,\sigma^{k-1}}\\|\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\\|+\frac{2\zeta}{\mu}\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k}\\|\right]$
	$\displaystyle\leq\frac{2\theta-\kappa}{\kappa(T_{\sigma}-2)+2\theta}\mathbb{E}\left[\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}\right]+\frac{NC^{2}}{\rho(\kappa(T_{\sigma}-2)+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-2)+1\right)+\frac{1}{2\theta}\right)$
	$\displaystyle+2\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)\sqrt{\frac{2\theta-\kappa}{\kappa(T_{\sigma}-2)+2\theta}\mathbb{E}\left[\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\\|^{2}\right]+\frac{NC^{2}}{\rho(\kappa(T_{\sigma}-2)+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-2)+1\right)+\frac{1}{2\theta}\right)}$
	$\displaystyle+\mathbb{E}\left[\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\\|^{2}\right]$
	$\displaystyle\leq\mathbb{E}\left[\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\\|^{2}\right]+\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-2)+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa(T_{\sigma}-2)+2\theta)}$
	$\displaystyle+2\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-2)+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa(T_{\sigma}-2)+2\theta)}}.$

Under the assumption that $T_{\sigma}\geq\max(T^{4/5}+2,3)$ , we get:

	$\displaystyle\mathbb{E}\left[\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}\right]$	$\displaystyle\leq\mathbb{E}\left[\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\\|^{2}\right]+\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}$
		$\displaystyle+2\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}.$

Therefore, we get:

$\displaystyle\mathbb{E}[\\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\\|^{2}]$	$\displaystyle\leq\mathbb{E}[\\|\pi^{\mu,\sigma^{K-2}}-\sigma^{K-2}\\|^{2}]+\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}$
	$\displaystyle+2\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}$
	$\displaystyle\leq\mathbb{E}[\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}]+K\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}$
	$\displaystyle+2K\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}.$	(19)

Here, we derive the following upper bound in terms of $\mathbb{E}[\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}]$ :

Lemma E.4.

Assume that $T_{\sigma}\geq\max(T^{4/5}+2,3)$ . In the same setup of Theorem 4.5, we have for any Nash equilibrium $\pi^{\ast}\in\Pi^{\ast}$ :

\displaystyle\mathbb{E}\left[\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\right]\leq\|\pi^{\ast}-\sigma^{0}\|^{2}+K\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}.

By combining (19), Lemma E.4, and the assumption that $T_{\sigma}\geq\max(T^{4/5}+2,3)$ , we have:

	$\displaystyle K\mathbb{E}[\\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\\|^{2}]$
	$\displaystyle\leq\mathbb{E}\left[\sum_{k=0}^{K-1}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}\right]+K^{2}\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}$
	$\displaystyle+2K^{2}\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}$
	$\displaystyle\leq\\|\pi^{\ast}-\sigma^{0}\\|^{2}+K\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T-1)+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}$
	$\displaystyle+K^{2}\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}$
	$\displaystyle+2K^{2}\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}$
	$\displaystyle\leq\\|\pi^{\ast}-\sigma^{0}\\|^{2}+T^{1/5}\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T-1)+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}$
	$\displaystyle+T^{2/5}\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}$
	$\displaystyle+2T^{2/5}\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}\right)\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}$
	$\displaystyle\leq\\|\pi^{\ast}-\sigma^{0}\\|^{2}+3\left(\mathrm{diam}(\mathcal{X})+\frac{\zeta}{\mu}+1\right)\left(\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho\kappa}\right).$

∎

Appendix F Proofs for Section 5

F.1 Proof of Lemma B.9 (Formal Version of Lemma 5.6)

Proof of Lemma B.9.

From the definition of the Bregman divergence, we have for any $t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}$ :

	$\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})$
	$\displaystyle=\psi(\pi_{i}^{\mu,\sigma^{k}})-\psi(\pi_{i}^{t+1})-\langle\nabla\psi(\pi_{i}^{t+1}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle$
	$\displaystyle-\psi(\pi_{i}^{\mu,\sigma^{k}})+\psi(\pi_{i}^{t})+\langle\nabla\psi(\pi_{i}^{t}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t}\rangle$
	$\displaystyle+\psi(\pi_{i}^{t+1})-\psi(\pi_{i}^{t})-\langle\nabla\psi(\pi_{i}^{t}),\pi_{i}^{t+1}-\pi_{i}^{t}\rangle$
	$\displaystyle=\langle\nabla\psi(\pi_{i}^{t})-\nabla\psi(\pi_{i}^{t+1}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle.$		(20)

From the first-order optimality condition for $\pi_{i}^{t+1}$ , we get for any $t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}$ :

\displaystyle\langle\eta(\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}))-\nabla\psi(\pi_{i}^{t+1})+\nabla\psi(\pi_{i}^{t}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle\leq 0.

(21)

Note that $\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k})$ and $\nabla_{\pi_{i}}\psi(\pi_{i}^{t})$ are well-defined because of Assumptions 5.4 and 5.5. By combining (20) and (21), we have:

\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})\leq\eta\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.

(22)

Next, we derive the following convergence result for $\pi^{t}$ :

Lemma F.1.

Suppose that Assumption 5.4 holds with $\beta,\gamma\in(0,\infty)$ , and the updated strategy profile $\pi^{t}$ satisfies the following condition: for any $t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}$ ,

\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})-D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+D_{\psi}(\pi^{t+1},\pi^{t})\leq\eta\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.

Then, for any $t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}$ :

\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})\leq D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})\left(1-\frac{\eta\mu\gamma}{2}\right)^{t-kT_{\sigma}+1}.

It is easy to confirm that (22) satisfies the assumption in Lemma F.1. Thus, taking $t=(k+1)T_{\sigma}-1$ , we have:

\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{(k+1)T_{\sigma}})\leq D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})\left(1-\frac{\eta\mu\gamma}{2}\right)^{T_{\sigma}}.

Since $\sigma^{k}=\pi^{kT_{\sigma}}$ and $\sigma^{k+1}=\pi^{(k+1)T_{\sigma}}$ , we conclude the statement. ∎

F.2 Proof of Lemma B.10 (Formal Version of Lemma 5.7)

Proof of Lemma B.10.

Writing ${g}_{i}^{t}=\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k})$ , from the first-order optimality condition for $\pi_{i}^{t+1}$ , we get for any $t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}$ :

\displaystyle\langle\eta_{t}({g}_{i}^{t}+\xi_{i}^{t})-\nabla\psi(\pi_{i}^{t+1})+\nabla\psi(\pi_{i}^{t}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle\leq 0.

(23)

Note that $\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k})$ and $\nabla_{\pi_{i}}\psi(\pi_{i}^{t})$ are well-defined because of Assumptions 5.4 and 5.5. By combining (20) and (23), we have for any $t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}$ :

\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})\leq\eta_{t}\langle g_{i}^{t}+\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.

(24)

We have the following Lemma that replaces the gradient with Bregman divergences:

Lemma F.2.

Under the noisy feedback setting, suppose that Assumption 5.4 holds with $\beta,\gamma\in(0,\infty)$ , and the updated strategy profile $\pi^{t}$ satisfies the following condition: for any $t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}$ ,

	$\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})-D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+D_{\psi}(\pi^{t+1},\pi^{t})$
	$\displaystyle\quad\leq\eta_{t}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k})+\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.$

Then, for any $t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}$ :

	$\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})-D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+D_{\psi}(\pi^{t+1},\pi^{t})$
	$\displaystyle\leq\eta_{t}\left(\left(\frac{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}{2\mu\gamma\rho^{2}}\right)D_{\psi}(\pi^{t+1},\pi^{t})-\frac{\mu\gamma}{2}D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})\right)+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.$

It is easy to confirm that (24) satisfies the assumption in Lemma F.2. Thus, for any $t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}$ :

	$\displaystyle D_{\psi}(\pi^{\mu,r},\pi^{t+1})-D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+D_{\psi}(\pi^{t+1},\pi^{t})$
	$\displaystyle\leq\eta_{t}\left(\left(\frac{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}{2\mu\gamma\rho^{2}}\right)D_{\psi}(\pi^{t+1},\pi^{t})-\frac{\mu\gamma}{2}D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})\right)+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle=\eta_{t}(\theta D_{\psi}(\pi^{t+1},\pi^{t})-\kappa D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t}))+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.$

Then, using this inequality, we can bound the expected value of $D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})$ as follows:

Lemma F.3.

Suppose that with some constants $\theta>\kappa>0$ , for all $t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}$ , the following inequality holds:

	$\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})-D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+D_{\psi}(\pi^{t+1},\pi^{t})$
	$\displaystyle\quad\leq\eta_{t}(\theta D_{\psi}(\pi^{t+1},\pi^{t})-\kappa D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t}))+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.$

Then, under Assumption B.5, for any $t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}$ ,

	$\displaystyle\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})]$
	$\displaystyle\leq\frac{2\theta-\kappa}{\kappa(t-kT_{\sigma})+2\theta}\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})]+\frac{NC^{2}}{\rho(\kappa(t-kT_{\sigma})+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(t-kT_{\sigma})+1\right)+\frac{1}{2\theta}\right).$

Taking $t=(k+1)T_{\sigma}-1$ , we have:

	$\displaystyle\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{(k+1)T_{\sigma}})]$
	$\displaystyle\leq\frac{2\theta-\kappa}{\kappa(T_{\sigma}-1)+2\theta}\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})]+\frac{NC^{2}}{\rho(\kappa(T_{\sigma}-1)+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-1)+1\right)+\frac{1}{2\theta}\right).$

Since $\sigma^{k}=\pi^{kT_{\sigma}}$ and $\sigma^{k+1}=\pi^{(k+1)T_{\sigma}}$ , we conclude the statement. ∎

F.3 Proof of Theorem 5.8

Proof of Theorem 5.8.

When $G(\pi_{i},\pi_{i}^{\prime})=D_{\psi^{\prime}}(\pi_{i},\pi_{i}^{\prime})$ for all $i\in[N]$ and $\pi,\pi^{\prime}\in\mathcal{X}$ , we can show that the Bregman divergence from a Nash equilibrium $\pi^{\ast}\in\Pi^{\ast}$ to $\sigma^{k+1}$ monotonically decreases:

Lemma F.4.

Assume that $G$ is a Bregman divergence $D_{\psi^{\prime}}$ for some strongly convex function $\psi^{\prime}$ . Then, for any Nash equilibrium $\pi^{\ast}\in\Pi^{\ast}$ of the underlying game, we have for any $k\geq 0$ :

\displaystyle D_{\psi^{\prime}}(\pi^{\ast},\sigma^{k+1})-D_{\psi^{\prime}}(\pi^{\ast},\sigma^{k})

\displaystyle\leq-D_{\psi^{\prime}}(\sigma^{k+1},\sigma^{k}).

By summing the inequality in Lemma F.4 from $k=0$ to $K$ , we have:

\displaystyle D_{\psi^{\prime}}(\pi^{\ast},\sigma^{0})\geq\sum_{k=0}^{K}D_{\psi^{\prime}}(\sigma^{k+1},\sigma^{k})\geq\frac{\rho}{2}\sum_{k=0}^{K}\|\sigma^{k+1}-\sigma^{k}\|^{2},

where the second inequality follows from the strong convexity of $\psi^{\prime}$ . Therefore, $\sum_{k=0}^{\infty}\|\sigma^{k+1}-\sigma^{k}\|^{2}<\infty$ , which implies that $\|\sigma^{k+1}-\sigma^{k}\|\to 0$ as $k\to\infty$ .

By the compactness of $\mathcal{X}$ and Bolzano–Weierstrass theorem, there exists a subsequence $k_{n}$ and a limit point $\hat{\sigma}\in\mathcal{X}$ such that $\sigma^{k_{n}}\to\hat{\sigma}$ as $n\to\infty$ . Since $\|\sigma^{k_{n}+1}-\sigma^{k_{n}}\|\to 0$ as $n\to\infty$ , we have $\sigma^{k_{n}+1}\to\hat{\sigma}$ as $n\to\infty$ . Thus, the limit point $\hat{\sigma}$ is the fixed point of the updating rule. From the following lemma, we show that the fixed point $\hat{\sigma}$ is a Nash equilibrium of the underlying game:

Lemma F.5.

Assume that $G$ is a Bregman divergence $D_{\psi^{\prime}}$ for some strongly convex function $\psi^{\prime}$ , and $\sigma^{k+1}=\pi^{\mu,\sigma^{k}}$ for $k\geq 0$ . If $\sigma^{k+1}=\sigma^{k}$ , then $\sigma^{k}$ is a Nash equilibrium of the underlying game.

On the other hand, by summing the inequality in Lemma F.4 from $k=k_{n}$ to $k=K-1$ for $K\geq k_{n}+1$ , we have:

\displaystyle 0\leq D_{\psi^{\prime}}(\hat{\sigma},\sigma^{K})

\displaystyle\leq D_{\psi^{\prime}}(\hat{\sigma},\sigma^{k_{n}}).

Since $\sigma^{k_{n}}\to\hat{\sigma}$ as $n\to\infty$ , we have $\sigma^{K}\to\hat{\sigma}$ as $K\to\infty$ . Since $\hat{\sigma}$ is a Nash equilibrium of the underlying game, we conclude the first statement of the theorem.

∎

F.4 Proof of Theorem 5.9

Proof of Theorem 5.9.

We first show that the divergence between $\Pi^{\ast}$ and $\sigma^{k}$ decreases monotonically as $k$ increases:

Lemma F.6.

Suppose that the same assumptions in Theorem 5.9 hold. For any $k\geq 0$ , if $\sigma^{k}\in\mathcal{X}\setminus\Pi^{\ast}$ , then:

\displaystyle\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k+1})<\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k}).

Otherwise, if $\sigma^{k}\in\Pi^{\ast}$ , then $\sigma^{k+1}=\sigma^{k}\in\Pi^{\ast}$ .

From Lemma F.6, the sequence $\{\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k})\}_{k\geq 0}$ is a monotonically decreasing sequence and is bounded from below by zero. Thus, $\{\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k})\}_{k\geq 0}$ converges to some constant $b\geq 0$ . We show that $b=0$ by a contradiction argument.

Suppose $b>0$ and let us define $B=\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{0})$ . Since $\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k})$ monotonically decreases, $\sigma^{k}$ is in the set $\Omega_{b,B}=\{\sigma\in\mathcal{X}~{}|~{}b\leq\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma)\leq B\}$ for all $k\geq 0$ . Since $\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\cdot)$ is a continuous function on $\mathcal{X}$ , the preimage $\Omega_{b,B}$ of the closed set $[b,B]$ is also closed. Furthermore, since $\mathcal{X}$ is compact and then bounded, $\Omega_{b,B}$ is a bounded set. Thus, $\Omega_{b,B}$ is a compact set.

Next, we show that the function which maps the slingshot strategies $\sigma$ to the associated stationary point $\pi^{\mu,\sigma}$ is continuous:

Lemma F.7.

Let $F(\sigma):\mathcal{X}\to\mathcal{X}$ be a function that maps the slingshot strategies $\sigma$ to the stationary point $\pi^{\mu,\sigma}$ defined by (4). In the same setup of Theorem 5.9, $F(\cdot)$ is a continuous function on $\mathcal{X}$ .

From Lemma F.7, $\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},F(\sigma))-\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma)$ is also a continuous function. Since a continuous function has a maximum over a compact set $\Omega_{b,B}$ , the maximum $\delta=\max_{\sigma\in\Omega_{b,B}}\left\{\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},F(\sigma))-\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma)\right\}$ exists. From Lemma F.6 and the assumption that $b>0$ , we have $\delta<0$ . It follows that:

	$\displaystyle\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k})$	$\displaystyle=\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{0})+\sum_{l=0}^{k-1}\left(\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{l+1})-\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{l})\right)$
		$\displaystyle\leq B+\sum_{l=0}^{k-1}\delta=B+k\delta.$

This implies that $\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k})<0$ for $k>\frac{B}{-\delta}$ , which contradicts $\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma)\geq 0$ . Therefore, the sequence $\{\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k})\}_{k\geq 0}$ converges to $0$ , and $\sigma^{k}$ converges to $\Pi^{\ast}$ . ∎

F.5 Proof of Lemma C.2

Proof of Lemma C.2.

First, we introduce the following lemma:

Lemma F.8.

Let us define $T(y_{i})=\mathop{\rm arg~{}max}\limits_{x\in\mathcal{X}_{i}}\{\langle y_{i},x\rangle-\psi(x)\}$ . Assuming $\psi:\mathcal{X}_{i}\to\mathbb{R}$ be a convex function of the Legendre type, we have for any $\pi_{i}\in\mathcal{X}_{i}$ :

\displaystyle D_{\psi}(\pi_{i},T(y_{i}))=\psi(\pi_{i})-\psi(T(y_{i}))-\langle y_{i},\pi_{i}-T(y_{i})\rangle.

Defining $y_{i}^{t}=\eta\sum_{s=0}^{t}\left(\nabla_{\pi_{i}}v_{i}(\pi_{i}^{s},\pi_{-i}^{s})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{s},\sigma_{i}^{k})\right)$ and letting $\pi_{i}=\pi_{i}^{\mu,\sigma^{k}}$ , $y_{i}=y_{i}^{t}$ in Lemma F.8, we have:

\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})=\psi(\pi_{i}^{\mu,\sigma^{k}})-\psi(\pi_{i}^{t+1})-\langle y_{i}^{t},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle.

Note that $\nabla_{\pi_{i}}G(\pi_{i}^{s},\sigma_{i}^{k})$ is well-defined because of Assumptions 5.4 and C.1. Using this equation, we get for any $t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}$ :

	$\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})$
	$\displaystyle=\psi(\pi_{i}^{\mu,\sigma^{k}})-\psi(\pi_{i}^{t+1})-\langle y_{i}^{t},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle-\psi(\pi_{i}^{\mu,\sigma^{k}})+\psi(\pi_{i}^{t})+\langle y_{i}^{t-1},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t}\rangle$
	$\displaystyle+\psi(\pi_{i}^{t+1})-\psi(\pi_{i}^{t})-\langle y_{i}^{t-1},\pi_{i}^{t+1}-\pi_{i}^{t}\rangle$
	$\displaystyle=-\langle y_{i}^{t},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle+\langle y_{i}^{t-1},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle$
	$\displaystyle=\langle y_{i}^{t}-y_{i}^{t-1},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle=\eta\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.$		(25)

It is easy to confirm that (25) satisfies the assumption in Lemma F.1. Thus, taking $t=(k+1)T_{\sigma}-1$ in Lemma F.1, we have:

\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{(k+1)T_{\sigma}})\leq D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})\left(1-\frac{\eta\mu\gamma}{2}\right)^{T_{\sigma}}.

Since $\sigma^{k}=\pi^{kT_{\sigma}}$ and $\sigma^{k+1}=\pi^{(k+1)T_{\sigma}}$ , we conclude the statement. ∎

F.6 Proof of Lemma C.3

Proof of Lemma C.3.

Writing $y_{i}^{t}=\sum_{s=0}^{t}\eta_{s}(\nabla_{\pi_{i}}v_{i}(\pi_{i}^{s},\pi_{-i}^{s})+\xi^{s}_{i}-\mu\nabla_{\pi_{i}}G(\pi_{i}^{s},r_{i}))$ and using Lemma F.8 in Appendix F.5, we have:

	$\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})$
	$\displaystyle=\langle y_{i}^{t}-y_{i}^{t-1},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle=\eta_{t}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k})+\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.$

Note that $\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k})$ is well-defined because of Assumptions 5.4 and C.1. Thus, we can apply Lemma F.2, and we have for any $t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}$ :

	$\displaystyle D_{\psi}(\pi^{\mu,r},\pi^{t+1})-D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+D_{\psi}(\pi^{t+1},\pi^{t})$
	$\displaystyle\leq\eta_{t}\left(\left(\frac{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}{2\mu\gamma\rho^{2}}\right)D_{\psi}(\pi^{t+1},\pi^{t})-\frac{\mu\gamma}{2}D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})\right)+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle=\eta_{t}(\theta D_{\psi}(\pi^{t+1},\pi^{t})-\kappa D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t}))+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.$

Therefore, the assumption in Lemma F.3 is satisfied, and we get:

	$\displaystyle\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})]$
	$\displaystyle\leq\frac{2\theta-\kappa}{\kappa(t-kT_{\sigma})+2\theta}\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})]+\frac{NC^{2}}{\rho(\kappa(t-kT_{\sigma})+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(t-kT_{\sigma})+1\right)+\frac{1}{2\theta}\right).$

Taking $t=(k+1)T_{\sigma}-1$ , we have:

	$\displaystyle\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{(k+1)T_{\sigma}})]$
	$\displaystyle\leq\frac{2\theta-\kappa}{\kappa(T_{\sigma}-1)+2\theta}\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})]+\frac{NC^{2}}{\rho(\kappa(T_{\sigma}-1)+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-1)+1\right)+\frac{1}{2\theta}\right).$

Since $\sigma^{k}=\pi^{kT_{\sigma}}$ and $\sigma^{k+1}=\pi^{(k+1)T_{\sigma}}$ , we conclude the statement. ∎

Appendix G Proofs for Section D

G.1 Proof of Theorem D.1

Proof of Theorem D.1.

Since $v_{i}(\cdot,\pi_{-i}^{t})$ is concave, we can upper bound the gap function for $\pi^{t}$ as:

	$\displaystyle\mathrm{GAP}(\pi^{t})$
	$\displaystyle=\sum_{i=1}^{N}\max_{\tilde{\pi}_{i}\in\mathcal{X}_{i}}\langle\nabla_{\pi_{i}}v_{i}(\pi^{t}),\tilde{\pi}_{i}-\pi_{i}^{t}\rangle$
	$\displaystyle=\max_{\tilde{\pi}\in\mathcal{X}}\sum_{i=1}^{N}\left(\langle\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma}),\tilde{\pi}_{i}-\pi_{i}^{\mu,\sigma}\rangle-\langle\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma}\rangle+\langle\nabla_{\pi_{i}}v_{i}(\pi^{t})-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma}),\tilde{\pi}_{i}-\pi_{i}^{t}\rangle\right).$		(26)

From Lemma E.1, the first term of (26) can be upper bounded as:

\displaystyle\max_{\tilde{\pi}\in\mathcal{X}}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma}),\tilde{\pi}_{i}-\pi_{i}^{\mu,\sigma}\rangle\leq\mathrm{diam}(\mathcal{X})\cdot\min_{(a_{i})\in N_{\mathcal{X}}(\pi^{\mu,\sigma})}\sqrt{\sum_{i=1}^{N}\|-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})+a_{i}\|^{2}}.

From the first-order optimality condition for $\pi^{\mu,\sigma}$ , we have for any $\pi\in\mathcal{X}$ :

\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i}),\pi_{i}-\pi_{i}^{\mu,\sigma}\rangle\leq 0,

and then $(\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i}))_{i\in[N]}\in N_{\mathcal{X}}(\pi^{\mu,\sigma})$ . Thus,

	$\displaystyle\max_{\tilde{\pi}\in\mathcal{X}}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma}),\tilde{\pi}_{i}-\pi_{i}^{\mu,\sigma}\rangle$
	$\displaystyle\leq\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\\|-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})+\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i})\\|^{2}}$
	$\displaystyle=\mu\cdot\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\\|\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i})\\|^{2}}.$		(27)

Next, from Cauchy–Schwarz inequality, the second term of (26) can be bounded as:

\displaystyle-\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma}\rangle\leq\|\pi^{t}-\pi^{\mu,\sigma}\|\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})\|^{2}}.

(28)

Again from Cauchy–Schwarz inequality, the third term of (26) is bounded by:

$\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi^{t})-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma}),\tilde{\pi}_{i}-\pi_{i}^{t}\rangle$	$\displaystyle\leq\\|\tilde{\pi}-\pi^{t}\\|\sqrt{\sum_{i=1}^{N}\\|\nabla_{\pi_{i}}v_{i}(\pi^{t})-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})\\|^{2}}$
	$\displaystyle\leq\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\\|\nabla_{\pi_{i}}v_{i}(\pi^{t})-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})\\|^{2}}$
	$\displaystyle\leq L\cdot\mathrm{diam}(\mathcal{X})\\|\pi^{t}-\pi^{\mu,\sigma}\\|,$	(29)

where the third inequality follows from (2).

By combining (26), (27), (28), and (29), we get:

\displaystyle\mathrm{GAP}(\pi^{t})\leq\mu\cdot\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i})\|^{2}}+\left(L\cdot\mathrm{diam}(\mathcal{X})+\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})\|^{2}}\right)\|\pi^{t}-\pi^{\mu,\sigma}\|.

Thus, from Lemma 5.6 and the strong convexity of $\psi$ , we have:

	$\displaystyle\mathrm{GAP}(\pi^{t})$	$\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\\|\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i})\\|^{2}}$
		$\displaystyle+\left(L\cdot\mathrm{diam}(\mathcal{X})+\sqrt{\sum_{i=1}^{N}\\|\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma})\\|^{2}}\right)\sqrt{\frac{2D_{\psi}(\pi^{\mu,\sigma},\pi^{0})}{\rho}\left(1-\frac{\eta\mu\gamma}{2}\right)^{t}}.$

∎

Appendix H Proofs for Additional Lemmas

H.1 Proof of Lemma E.2

Proof of Lemma E.2.

From the first-order optimality condition for $\pi^{\mu,\sigma^{k}}$ and $\pi^{\mu,\sigma^{k-1}}$ , we have for any $k\geq 1$ :

	$\displaystyle\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}})-\mu\left(\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k}\right),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k}}\rangle\leq 0,$
	$\displaystyle\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k-1}},\pi_{-i}^{\mu,\sigma^{k-1}})-\mu\left(\pi_{i}^{\mu,\sigma^{k-1}}-\sigma_{i}^{k-1}\right),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle\leq 0.$

Summing up these inequalities yields:

	$\displaystyle 0$	$\displaystyle\geq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}})-\mu\left(\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k}\right),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
		$\displaystyle+\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k-1}},\pi_{-i}^{\mu,\sigma^{k-1}})-\mu\left(\pi_{i}^{\mu,\sigma^{k-1}}-\sigma_{i}^{k-1}\right),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle$
		$\displaystyle=\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\pi_{i}^{\mu,\sigma^{k-1}}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\mu\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}$
		$\displaystyle+\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k-1}},\pi_{-i}^{\mu,\sigma^{k-1}}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\mu\sum_{i=1}^{N}\langle\sigma_{i}^{k-1}-\pi_{i}^{\mu,\sigma^{k-1}},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle$
		$\displaystyle=\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k-1}},\pi_{-i}^{\mu,\sigma^{k-1}}),\pi_{i}^{\mu,\sigma^{k-1}}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
		$\displaystyle+\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\mu\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}+\mu\sum_{i=1}^{N}\langle\sigma_{i}^{k-1}-\pi_{i}^{\mu,\sigma^{k-1}},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle$
		$\displaystyle\geq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\mu\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}+\mu\sum_{i=1}^{N}\langle\sigma_{i}^{k-1}-\pi_{i}^{\mu,\sigma^{k-1}},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle,$

where the last inequality follows from (1). Then, since

\displaystyle\langle\sigma_{i}^{k-1}-\pi_{i}^{\mu,\sigma^{k-1}},\pi_{i}^{\mu,\sigma^{k-1}}-\pi_{i}^{\mu,\sigma^{k}}\rangle=\frac{\|\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k-1}\|^{2}}{2}-\frac{\|\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{\mu,\sigma^{k-1}}\|^{2}}{2}-\frac{\|\pi_{i}^{\mu,\sigma^{k-1}}-\sigma_{i}^{k-1}\|^{2}}{2},

we have:

	$\displaystyle 0$	$\displaystyle\geq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\mu\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}$
		$\displaystyle+\frac{\mu}{2}\sum_{i=1}^{N}\langle\sigma_{i}^{k-1}-\pi_{i}^{\mu,\sigma^{k-1}},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\frac{\mu}{2}\sum_{i=1}^{N}\langle\sigma_{i}^{k-1}-\pi_{i}^{\mu,\sigma^{k-1}},\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle$
		$\displaystyle\geq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\mu\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}$
		$\displaystyle-\frac{\mu}{4}\left(\\|\sigma^{k-1}-\pi^{\mu,\sigma^{k-1}}\\|^{2}+\\|\pi^{\mu,\sigma^{k}}-\pi^{\mu,\sigma^{k-1}}\\|^{2}-\\|\pi^{\mu,\sigma^{k}}-\pi^{\mu,\sigma^{k-1}}+\sigma^{k-1}-\pi^{\mu,\sigma^{k-1}}\\|^{2}\right)$
		$\displaystyle-\frac{\mu}{4}\left(\\|\pi^{\mu,\sigma^{k}}-\sigma^{k-1}\\|^{2}-\\|\pi^{\mu,\sigma^{k}}-\pi^{\mu,\sigma^{k-1}}\\|^{2}-\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\\|^{2}\right)$
		$\displaystyle\geq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\mu\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}-\frac{\mu}{4}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k-1}\\|^{2}$
		$\displaystyle\geq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\mu\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\sigma^{k}-\sigma^{k-1}\\|^{2}$
		$\displaystyle=\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\sigma_{i}^{k}-\pi_{i}^{\mu,\sigma^{k-1}}\rangle+\frac{\mu}{2}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\sigma^{k}-\sigma^{k-1}\\|^{2},$

where the third inequality follows from $(a+b)^{2}\leq 2(a^{2}+b^{2})$ for $a,b\in\mathbb{R}$ . Thus,

	$\displaystyle\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}$	$\displaystyle\leq\\|\sigma^{k}-\sigma^{k-1}\\|^{2}+\frac{2}{\mu}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\pi_{i}^{\mu,\sigma^{k-1}}-\sigma_{i}^{k}\rangle$
		$\displaystyle\leq\\|\sigma^{k}-\sigma^{k-1}\\|^{2}+\frac{2}{\mu}\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k}\\|\sqrt{\sum_{i=1}^{N}\\|\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}})\\|^{2}}$
		$\displaystyle\leq\\|\sigma^{k}-\sigma^{k-1}\\|^{2}+\frac{2\zeta}{\mu}\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k}\\|.$

∎

H.2 Proof of Lemma E.3

Proof of Lemma E.3.

From the first-order optimality condition for $p^{k+1}$ , we have:

\displaystyle\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}})-\mu\left(\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k}\right),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle\leq 0.

Thus, from the three-point identity $2\langle x-y,z-x\rangle=\|y-z\|^{2}-\|x-y\|^{2}-\|x-z\|^{2}$ and Young’s inequality:

	$\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle\leq\mu\sum_{i=1}^{N}\langle\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k},\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle=\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\ast}-\pi^{\mu,\sigma^{k}}\\|^{2}$
	$\displaystyle=\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k+1}+\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\\|^{2}$
	$\displaystyle=\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k+1}\\|^{2}-\frac{\mu}{2}\\|\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\\|^{2}-\mu\langle\pi^{\ast}-\sigma^{k+1},\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\rangle$
	$\displaystyle\leq\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k+1}\\|^{2}+\mu\\|\pi^{\ast}-\sigma^{k+1}\\|\\|\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\\|$
	$\displaystyle\leq\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k+1}\\|^{2}+\frac{\mu}{64T^{3}}\\|\pi^{\ast}-\sigma^{k+1}\\|^{2}+\frac{32\mu T^{3}}{2}\\|\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\\|^{2}.$

Here, since $T_{\sigma}\geq\max(\frac{6}{\ln 2-\ln(2-\eta\mu)}\ln T+\frac{2\ln 64}{\ln 2-\ln(2-\eta\mu)},1)$ , we have:

\displaystyle\left(1-\frac{\eta\mu}{2}\right)^{-T_{\sigma}}\geq\left(1-\frac{\eta\mu}{2}\right)^{-\frac{3}{\ln 2-\ln(2-\eta\mu)}\ln T}\left(1-\frac{\eta\mu}{2}\right)^{\frac{\ln 64}{\ln\left(1-\frac{\eta\mu}{2}\right)}}=64\left(1-\frac{\eta\mu}{2}\right)^{\frac{\ln T^{3}}{\ln(2-\eta\mu)-\ln 2}}=64T^{3}.

Therefore, we get from Lemma B.2:

	$\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle\leq\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k+1}\\|^{2}+\frac{\mu}{64T^{3}}\\|\pi^{\ast}-\sigma^{k+1}\\|^{2}+\frac{32\mu T^{3}}{2}\\|\sigma^{k}-\pi^{\mu,\sigma^{k}}\\|^{2}\left(1-\frac{\eta\mu}{2}\right)^{T_{\sigma}}$
	$\displaystyle\leq\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k+1}\\|^{2}+\frac{\mu}{64T^{3}}\\|\pi^{\ast}-\sigma^{k+1}\\|^{2}+\frac{\mu}{4}\\|\sigma^{k}-\pi^{\mu,\sigma^{k}}\\|^{2}$
	$\displaystyle=\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k+1}\\|^{2}-\frac{\mu}{4}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}+\frac{\mu}{64T^{3}}\\|\pi^{\ast}-\sigma^{k+1}\\|^{2}.$

Summing up this inequality from $k=0$ to $K-1$ yields:

	$\displaystyle\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{\mu}{64T^{3}}\sum_{k=0}^{K-1}\\|\pi^{\ast}-\sigma^{k+1}\\|^{2}-\frac{\mu}{4}\sum_{k=0}^{K-1}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}$
	$\displaystyle\geq\sum_{k=0}^{K-1}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle\geq\sum_{k=0}^{K-1}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\ast},\pi_{-i}^{\ast}),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle\geq 0.$

Then, from Cauchy–Schwarz inequality, we have:

	$\displaystyle\frac{\mu}{4}\sum_{k=0}^{K-1}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}$	$\displaystyle\leq\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{\mu}{64T^{3}}\sum_{k=0}^{K-1}\\|\pi^{\ast}-\sigma^{k+1}\\|^{2}$
		$\displaystyle\leq\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{\mu}{64T^{3}}\sum_{k=0}^{K-1}\left(\sum_{\tau=0}^{k}\\|\sigma^{\tau+1}-\sigma^{\tau}\\|+\\|\pi^{\ast}-\sigma^{0}\\|\right)^{2}$
		$\displaystyle\leq\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{\mu}{32T^{3}}\sum_{k=0}^{K-1}\left(\left(\sum_{\tau=0}^{k}\\|\sigma^{\tau+1}-\sigma^{\tau}\\|\right)^{2}+\\|\pi^{\ast}-\sigma^{0}\\|^{2}\right)$
		$\displaystyle\leq\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{\mu}{32T^{3}}\sum_{k=0}^{K-1}\left(K\sum_{\tau=0}^{k}\\|\sigma^{\tau+1}-\sigma^{\tau}\\|^{2}+\\|\pi^{\ast}-\sigma^{0}\\|^{2}\right)$
		$\displaystyle\leq\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{\mu}{32T^{3}}\left(K^{2}\sum_{k=0}^{K-1}\\|\sigma^{k+1}-\sigma^{k}\\|^{2}+K\\|\pi^{\ast}-\sigma^{0}\\|^{2}\right)$
		$\displaystyle\leq\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{\mu}{32T^{3}}\left(K^{2}\sum_{k=0}^{K-1}\left(\\|\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\\|+\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|\right)^{2}+K\\|\pi^{\ast}-\sigma^{0}\\|^{2}\right).$

By applying Lemma B.2 to the above inequality, we get:

	$\displaystyle\frac{\mu}{4}\sum_{k=0}^{K-1}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}$
	$\displaystyle\leq\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{\mu}{32T^{3}}\left(K^{2}\sum_{k=0}^{K-1}\left(\\|\sigma^{k}-\pi^{\mu,\sigma^{k}}\\|^{2}\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}+\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|\right)^{2}+K\\|\pi^{\ast}-\sigma^{0}\\|^{2}\right)$
	$\displaystyle\leq\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{\mu}{32T^{3}}\left(K^{2}\sum_{k=0}^{K-1}\left(2\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|\right)^{2}+K\\|\pi^{\ast}-\sigma^{0}\\|^{2}\right)$
	$\displaystyle=\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{\mu}{32T^{3}}\left(4K^{2}\sum_{k=0}^{K-1}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}+K\\|\pi^{\ast}-\sigma^{0}\\|^{2}\right)$
	$\displaystyle\leq\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{\mu}{32T^{3}}\left(4T^{2}\sum_{k=0}^{K-1}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}+T\\|\pi^{\ast}-\sigma^{0}\\|^{2}\right)$
	$\displaystyle=\mu\left(\frac{1}{2}+\frac{1}{T^{2}}\right)\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{\mu}{8T}\sum_{k=0}^{K-1}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}$
	$\displaystyle\leq 2\mu\\|\pi^{\ast}-\sigma^{0}\\|^{2}+\frac{\mu}{8}\sum_{k=0}^{K-1}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}.$

Therefore, for $K\geq 1$ , we get:

\displaystyle\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}

\displaystyle\leq 16\|\pi^{\ast}-\sigma^{0}\|^{2}.

∎

H.3 Proof of Lemma E.4

Proof of Lemma E.4.

From the first-order optimality condition for $p^{k+1}$ , we have:

\displaystyle\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}})-\mu\left(\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k}\right),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle\leq 0.

Thus, from the three-point identity $2\langle a-b,c-a\rangle=\|b-c\|^{2}-\|a-b\|^{2}-\|a-c\|^{2}$ and Young’s inequality:

	$\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle\leq\mu\sum_{i=1}^{N}\langle\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k},\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle=\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\ast}-\pi^{\mu,\sigma^{k}}\\|^{2}$
	$\displaystyle=\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k+1}+\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\\|^{2}$
	$\displaystyle=\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k+1}\\|^{2}-\frac{\mu}{2}\\|\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\\|^{2}-\mu\langle\pi^{\ast}-\sigma^{k+1},\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\rangle$
	$\displaystyle\leq\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k+1}\\|^{2}+\mu\\|\pi^{\ast}-\sigma^{k+1}\\|\\|\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\\|$
	$\displaystyle\leq\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k+1}\\|^{2}+\mu\cdot\mathrm{diam}(\mathcal{X})\\|\sigma^{k+1}-\pi^{\mu,\sigma^{k}}\\|$

Since $T_{\sigma}\geq\max(T^{4/5}+2,3)$ , we get from Lemma B.7:

	$\displaystyle\mathbb{E}\left[\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle\right]$
	$\displaystyle\leq\mathbb{E}\left[\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k+1}\\|^{2}\right]$
	$\displaystyle+\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(T_{\sigma}-1)+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa(T_{\sigma}-1)+2\theta)}}$
	$\displaystyle\leq\mathbb{E}\left[\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}-\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{k+1}\\|^{2}\right]$
	$\displaystyle+\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}.$

Summing up this inequality from $k=0$ to $K-1$ and taking its expectation yields:

	$\displaystyle\frac{\mu}{2}\\|\pi^{\ast}-\sigma^{0}\\|^{2}-\mathbb{E}\left[\frac{\mu}{2}\sum_{k=0}^{K-1}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}\right]$
	$\displaystyle+K\mu\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}$
	$\displaystyle\geq\mathbb{E}\left[\sum_{k=0}^{K-1}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}}),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle\right]$
	$\displaystyle\geq\mathbb{E}\left[\sum_{k=0}^{K-1}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\ast},\pi_{-i}^{\ast}),\pi_{i}^{\ast}-\pi_{i}^{\mu,\sigma^{k}}\rangle\right]$
	$\displaystyle\geq 0.$

Therefore, for $K\geq 1$ , we get:

\displaystyle\mathbb{E}\left[\sum_{k=0}^{K-1}\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\|^{2}\right]\leq\|\pi^{\ast}-\sigma^{0}\|^{2}+K\cdot\mathrm{diam}(\mathcal{X})\cdot\sqrt{\frac{\rho(2\theta-\kappa)\mathrm{diam}(\mathcal{X})^{2}+NC^{2}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}T+1\right)+\frac{1}{2\theta}\right)}{\rho(\kappa T^{4/5}+2\theta)}}.

∎

H.4 Proof of Lemma F.1

Proof of Lemma F.1.

We first decompose the inequality in the assumption as follows:

	$\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})$
	$\displaystyle\leq\eta\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle=\eta\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta\mu\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{t}-\pi_{i}^{t+1}\rangle+\eta\mu\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t}\rangle.$		(30)

From the relative smoothness in Assumption 5.4 and the convexity of $G(\cdot,\sigma_{i}^{k})$ :

	$\displaystyle\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{t}-\pi_{i}^{t+1}\rangle$
	$\displaystyle\leq G(\pi_{i}^{t},\sigma_{i}^{k})-G(\pi_{i}^{t+1},\sigma_{i}^{k})+\beta D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})$
	$\displaystyle\leq G(\pi_{i}^{t},\sigma_{i}^{k})-G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k})+\langle\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle+\beta D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t}).$		(31)

Also, from the relative strong convexity in Assumption 5.4:

\displaystyle G(\pi_{i}^{t},\sigma_{i}^{k})-G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k})\leq\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{k}}\rangle-\gamma D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t}).

(32)

By combining (30), (31), and (32), we have:

	$\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})$
	$\displaystyle\leq\eta\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta\mu\langle\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle$
	$\displaystyle-\eta\mu\gamma D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+\eta\mu\beta D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t}),$

and then:

	$\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-(1-\eta\mu\gamma)D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+(1-\eta\mu\beta)D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})$
	$\displaystyle\leq\eta\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta\mu\langle\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle$
	$\displaystyle=\eta\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta\mu\langle\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle$
	$\displaystyle+\eta\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.$

Summing this inequality from $i=1$ to $N$ implies that:

	$\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})-(1-\eta\mu\gamma)D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+(1-\eta\mu\beta)D_{\psi}(\pi^{t+1},\pi^{t})$
	$\displaystyle\leq\eta\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta\mu\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle$
	$\displaystyle+\eta\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle\leq\eta\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle+\eta\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle\leq\eta\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle,$		(33)

where the second inequality follows from (1), and the third inequality follows from the first-order optimality condition for $\pi^{\mu,\sigma^{k}}$ .

Here, from Young’s inequality, we have for any $\lambda>0$ :

	$\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle=\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{t}\rangle+\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle\leq\lambda\sum_{i=1}^{N}\\|\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})\\|^{2}+\frac{1}{2\lambda}\\|\pi^{t+1}-\pi^{t}\\|^{2}+\frac{1}{2\lambda}\\|\pi^{t}-\pi^{\mu,\sigma^{k}}\\|^{2}$
	$\displaystyle\leq\left(L^{2}\lambda+\frac{1}{2\lambda}\right)\\|\pi^{t+1}-\pi^{t}\\|^{2}+\frac{1}{2\lambda}\\|\pi^{t}-\pi^{\mu,\sigma^{k}}\\|^{2}$
	$\displaystyle\leq\frac{1}{\rho}\left(2L^{2}\lambda+\frac{1}{\lambda}\right)D_{\psi}(\pi^{t+1},\pi^{t})+\frac{1}{\rho\lambda}D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t}).$		(34)

where the second inequality follows from (2), and the fourth inequality follows from the strong convexity of $\psi$ .

By combining (33) and (34), we get:

\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})\leq\left(1-\eta\left(\mu\gamma-\frac{1}{\rho\lambda}\right)\right)D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})-\left(1-\eta\left(\mu\beta+\frac{2L^{2}\lambda}{\rho}+\frac{1}{\rho\lambda}\right)\right)D_{\psi}(\pi^{t+1},\pi^{t}).

By setting $\lambda=\frac{2}{\mu\gamma\rho}$ ,

	$\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})$	$\displaystyle\leq\left(1-\frac{\eta\mu\gamma}{2}\right)D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})-\left(1-\eta\left(\frac{\mu(\gamma+2\beta)}{2}+\frac{4L^{2}}{\mu\gamma\rho^{2}}\right)\right)D_{\psi}(\pi^{t+1},\pi^{t})$
		$\displaystyle=\left(1-\frac{\eta\mu\gamma}{2}\right)D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})-\left(1-\eta\left(\frac{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}{2\mu\gamma\rho^{2}}\right)\right)D_{\psi}(\pi^{t+1},\pi^{t}).$

Thus, when $\eta\leq\frac{2\mu\gamma\rho^{2}}{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}<\frac{2}{\mu\gamma}$ , we have for any $t\in\{kT_{\sigma},kT_{\sigma}+1,\cdots,(k+1)T_{\sigma}-1\}$ :

\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})\leq D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})\left(1-\frac{\eta\mu\gamma}{2}\right)\leq D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})\left(1-\frac{\eta\mu\gamma}{2}\right)^{t-kT_{\sigma}+1}.

∎

H.5 Proof of Lemma F.2

Proof of Lemma F.2.

We first decompose the inequality in the assumption as follows:

	$\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})$
	$\displaystyle\leq\eta_{t}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k})+\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle=\eta_{t}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta_{t}\mu\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{t}-\pi_{i}^{t+1}\rangle$
	$\displaystyle\quad+\eta_{t}\mu\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{k}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t}\rangle+\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.$		(35)

By combining (31), (32) in Appendix H.4, and (35),

	$\displaystyle D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t+1})-D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})$
	$\displaystyle\leq\eta_{t}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta_{t}\mu\langle\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{\mu,\sigma^{k}}-\pi_{i}^{t+1}\rangle+\eta_{t}\mu\beta D_{\psi}(\pi_{i}^{t+1},\pi_{i}^{t})$
	$\displaystyle\quad-\eta_{t}\mu\gamma D_{\psi}(\pi_{i}^{\mu,\sigma^{k}},\pi_{i}^{t})+\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.$

Summing up these inequalities with respect to the player index,

	$\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})-D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+D_{\psi}(\pi^{t+1},\pi^{t})$
	$\displaystyle\leq\eta_{t}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta_{t}\mu\beta D_{\psi}(\pi^{t+1},\pi^{t})$
	$\displaystyle\quad-\eta_{t}\mu\gamma D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle=\sum_{i=1}^{N}\eta_{t}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle-\eta_{t}\mu\gamma D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+\eta_{t}\mu\beta D_{\psi}(\pi^{t+1},\pi^{t})$
	$\displaystyle+\eta_{t}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle\leq\sum_{i=1}^{N}\eta_{t}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{k}},\pi_{-i}^{\mu,\sigma^{k}})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{k}},\sigma_{i}^{k}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle-\eta_{t}\mu\gamma D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+\eta_{t}\mu\beta D_{\psi}(\pi^{t+1},\pi^{t})$
	$\displaystyle+\eta_{t}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
	$\displaystyle\leq-\eta_{t}\mu\gamma D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+\eta_{t}\mu\beta D_{\psi}(\pi^{t+1},\pi^{t})$
	$\displaystyle+\eta_{t}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t+1},\pi_{-i}^{t+1}),\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle,$		(36)

where the second inequality follows (1), and the third inequality follows from the first-order optimality condition for $\pi^{\mu,\sigma^{k}}$ .

By combining (34) in Appendix H.4 and (36), we have for any $\lambda>0$ :

	$\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})-D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+D_{\psi}(\pi^{t+1},\pi^{t})$
	$\displaystyle\leq-\eta_{t}\mu\gamma D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+\eta_{t}\mu\beta D_{\psi}(\pi^{t+1},\pi^{t})$
	$\displaystyle+\frac{\eta_{t}}{\rho}\left(2L^{2}\lambda+\frac{1}{\lambda}\right)D_{\psi}(\pi^{t+1},\pi^{t})+\frac{\eta_{t}}{\rho\lambda}D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.$

By setting $\lambda=\frac{2}{\mu\gamma\rho}$ ,

	$\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})-D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+D_{\psi}(\pi^{t+1},\pi^{t})$
	$\displaystyle\leq-\frac{\eta_{t}\mu\gamma}{2}D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+\eta_{t}\left(\frac{\mu^{2}\gamma\rho^{2}(\gamma+2\beta)+8L^{2}}{2\mu\gamma\rho^{2}}\right)D_{\psi}(\pi^{t+1},\pi^{t})+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle.$

This concludes the proof. ∎

H.6 Proof of Lemma F.3

Proof of Lemma F.3.

Reforming the inequality in the assumption,

	$\displaystyle D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})$	$\displaystyle\leq(1-\kappa\eta_{t})D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})-(1-\eta_{t}\theta)D_{\psi}(\pi^{t+1},\pi^{t})+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{\mu,\sigma^{k}}\rangle$
		$\displaystyle=(1-\kappa\eta_{t})D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})-(1-\eta_{t}\theta)D_{\psi}(\pi^{t+1},\pi^{t})$
		$\displaystyle\phantom{=}+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{k}}\rangle+\eta_{t}\sum_{i=1}^{N}\langle\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{t}\rangle.$

By taking the expectation conditioned on $\mathcal{F}_{t}$ for both sides and using Assumption B.5 (a),

	$\displaystyle\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})\|\mathcal{F}_{t}]$	$\displaystyle\leq(1-\kappa\eta_{t})D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})-(1-\eta_{t}\theta)\mathbb{E}[D_{\psi}(\pi^{t+1},\pi^{t})\|\mathcal{F}_{t}]+\sum_{i=1}^{N}\mathbb{E}[\langle\eta_{t}\xi_{i}^{t},\pi_{i}^{t+1}-\pi_{i}^{t}\rangle\|\mathcal{F}_{t}]$
		$\displaystyle=(1-\kappa\eta_{t})D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})-(1-\eta_{t}\theta)\mathbb{E}[D_{\psi}(\pi^{t+1},\pi^{t})\|\mathcal{F}_{t}]$
		$\displaystyle\phantom{=}+\sum_{i=1}^{N}\mathbb{E}\left[\left\langle\frac{\eta_{t}\xi_{i}^{t}}{\sqrt{\rho(1-\eta_{t}\theta)}},\sqrt{\rho(1-\eta_{t}\theta)}(\pi_{i}^{t+1}-\pi_{i}^{t})\right\rangle\|\mathcal{F}_{t}\right]$
		$\displaystyle\leq(1-\kappa\eta_{t})D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})-(1-\eta_{t}\theta)\mathbb{E}[D_{\psi}(\pi^{t+1},\pi^{t})\|\mathcal{F}_{t}]$
		$\displaystyle\phantom{=}+\frac{\eta_{t}^{2}}{2\rho(1-\eta_{t}\theta)}\sum_{i=1}^{N}\mathbb{E}[\\|\xi_{i}^{t}\\|^{2}\|\mathcal{F}_{t}]+\frac{\rho(1-\eta_{t}\theta)}{2}\mathbb{E}[\\|\pi^{t+1}-\pi^{t}\\|^{2}\|\mathcal{F}_{t}]$
		$\displaystyle\leq(1-\kappa\eta_{t})D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+\frac{\eta_{t}^{2}}{2\rho(1-\eta_{t}\theta)}\sum_{i=1}^{N}\mathbb{E}[\\|\xi_{i}^{t}\\|^{2}\|\mathcal{F}_{t}]$
		$\displaystyle\leq\left(1-\frac{1}{t-kT_{\sigma}+2\theta/\kappa}\right)D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})+\frac{\eta_{t}^{2}}{\rho}\sum_{i=1}^{N}\mathbb{E}[\\|\xi_{i}^{t}\\|^{2}\|\mathcal{F}_{t}].$

Therefore, rearranging and taking the expectations,

\displaystyle(t-kT_{\sigma}+2\theta/\kappa)\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})]\leq(t-kT_{\sigma}-1+2\theta/\kappa)\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t})]+\frac{NC^{2}}{\rho\kappa(\kappa(t-kT_{\sigma})+2\theta)}.

Telescoping the sum,

		$\displaystyle(t-kT_{\sigma}+2\theta/\kappa)\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})]\leq(2\theta/\kappa-1)\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})]+\frac{NC^{2}}{\rho\kappa}\sum_{s=kT_{\sigma}}^{t}\frac{1}{\kappa(s-kT_{\sigma})+2\theta}$
	$\displaystyle\Longleftrightarrow\quad$	$\displaystyle\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})]\leq\frac{2\theta-\kappa}{\kappa(t-kT_{\sigma})+2\theta}\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})]+\frac{NC^{2}}{\rho(\kappa(t-kT_{\sigma})+2\theta)}\sum_{s=0}^{t-kT_{\sigma}}\frac{1}{\kappa s+2\theta}.$

Here, we introduce the following lemma, whose proof is given in Appendix H.12, for the evaluation of the sum.

Lemma H.1.

For any $\kappa,\theta\geq 0$ and $t\geq 0$ ,

\displaystyle\sum_{s=0}^{t}\frac{1}{\kappa s+2\theta}\leq\frac{1}{2\theta}+\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}t+1\right).

In summary, we obtain the following inequality:

	$\displaystyle\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{t+1})]$
	$\displaystyle\leq\frac{2\theta-\kappa}{\kappa(t-kT_{\sigma})+2\theta}\mathbb{E}[D_{\psi}(\pi^{\mu,\sigma^{k}},\pi^{kT_{\sigma}})]+\frac{NC^{2}}{\rho(\kappa(t-kT_{\sigma})+2\theta)}\left(\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}(t-kT_{\sigma})+1\right)+\frac{1}{2\theta}\right).$

This concludes the proof. ∎

H.7 Proof of Lemma F.4

Proof of Lemma F.4.

Recall that $G(\pi_{i},\pi_{i}^{\prime})=D_{\psi^{\prime}}(\pi_{i},\pi_{i}^{\prime})$ for any $i\in[N]$ and $\pi_{i},\pi_{i}^{\prime}\in\mathcal{X}_{i}$ . By the first-order optimality condition for $\sigma^{k+1}$ , we have for all $\pi^{\ast}\in\Pi^{\ast}$ :

\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1})-\mu(\nabla\psi^{\prime}(\sigma_{i}^{k+1})-\nabla\psi^{\prime}(\sigma_{i}^{k})),\pi_{i}^{\ast}-\sigma_{i}^{k+1}\rangle\leq 0.

Then,

\displaystyle\sum_{i=1}^{N}\langle\nabla\psi^{\prime}(\sigma_{i}^{k+1})-\nabla\psi^{\prime}(\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle\leq\frac{1}{\mu}\sum_{i=1}^{N}\langle\sigma_{i}^{k+1}-\pi_{i}^{\ast},\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1})\rangle.

Moreover, we have for any $\pi^{\ast}\in\Pi^{\ast}$ :

	$\displaystyle D_{\psi^{\prime}}(\pi^{\ast},\sigma^{k+1})-D_{\psi^{\prime}}(\pi^{\ast},\sigma^{k})$
	$\displaystyle=\sum_{i=1}^{N}\left(\psi^{\prime}(\pi_{i}^{\ast})-\psi^{\prime}(\sigma_{i}^{k+1})-\langle\nabla\psi^{\prime}(\sigma_{i}^{k+1}),\pi_{i}^{\ast}-\sigma_{i}^{k+1}\rangle-\psi^{\prime}(\pi_{i}^{\ast})+\psi^{\prime}(\sigma_{i}^{k})+\langle\nabla\psi^{\prime}(\sigma_{i}^{k}),\pi_{i}^{\ast}-\sigma_{i}^{k}\rangle\right)$
	$\displaystyle=\sum_{i=1}^{N}\left(-\psi^{\prime}(\sigma_{i}^{k+1})+\psi^{\prime}(\sigma_{i}^{k})+\langle\nabla\psi^{\prime}(\sigma_{i}^{k}),\sigma_{i}^{k+1}-\sigma_{i}^{k}\rangle-\langle\nabla\psi^{\prime}(\sigma_{i}^{k+1})-\nabla\psi^{\prime}(\sigma_{i}^{k}),\pi_{i}^{\ast}-\sigma_{i}^{k+1}\rangle\right)$
	$\displaystyle=-D_{\psi^{\prime}}(\sigma^{k+1},\sigma^{k})+\sum_{i=1}^{N}\langle\nabla\psi^{\prime}(\sigma_{i}^{k+1})-\nabla\psi^{\prime}(\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle.$

By combining these inequalities, we get for any $\pi^{\ast}\in\Pi^{\ast}$ :

	$\displaystyle D_{\psi^{\prime}}(\pi^{\ast},\sigma^{k+1})-D_{\psi^{\prime}}(\pi^{\ast},\sigma^{k})$	$\displaystyle\leq-D_{\psi^{\prime}}(\sigma^{k+1},\sigma^{k})+\frac{1}{\mu}\sum_{i=1}^{N}\langle\sigma_{i}^{k+1}-\pi_{i}^{\ast},\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1})\rangle$
		$\displaystyle\leq-D_{\psi^{\prime}}(\sigma^{k+1},\sigma^{k})+\frac{1}{\mu}\sum_{i=1}^{N}\langle\sigma_{i}^{k+1}-\pi_{i}^{\ast},\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\ast},\pi_{-i}^{\ast})\rangle,$

where the second inequality follows from (1). Since $\pi^{\ast}$ is the Nash equilibrium, from the first-order optimality condition, we get:

\displaystyle\sum_{i=1}^{N}\langle\sigma_{i}^{k+1}-\pi_{i}^{\ast},\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\ast},\pi_{-i}^{\ast})\rangle\leq 0.

Thus, we have for $\pi^{\ast}\in\Pi^{\ast}$ :

\displaystyle D_{\psi^{\prime}}(\pi^{\ast},\sigma^{k+1})-D_{\psi^{\prime}}(\pi^{\ast},\sigma^{k})

\displaystyle\leq-D_{\psi^{\prime}}(\sigma^{k+1},\sigma^{k}).

∎

H.8 Proof of Lemma F.5

Proof of Lemma F.5.

By using the first-order optimality condition for $\sigma_{i}^{k+1}$ , we have for all $\pi\in\mathcal{X}$ :

\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1})-\mu(\nabla_{\pi_{i}}\psi^{\prime}(\sigma_{i}^{k+1})-\nabla_{\pi_{i}}\psi^{\prime}(\sigma_{i}^{k})),\pi_{i}-\sigma_{i}^{k+1}\rangle\leq 0,

and then

\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1}),\pi_{i}-\sigma_{i}^{k+1}\rangle\leq\mu\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}\psi^{\prime}(\sigma_{i}^{k+1})-\nabla_{\pi_{i}}\psi^{\prime}(\sigma_{i}^{k}),\pi_{i}-\sigma_{i}^{k+1}\rangle.

Under the assumption that $\sigma^{k+1}=\sigma^{k}$ , we have for all $\pi\in\mathcal{X}$ :

\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1}),\pi_{i}-\sigma_{i}^{k+1}\rangle\leq 0.

This is equivalent to the first-order optimality condition for $\pi^{\ast}\in\Pi^{\ast}$ . Therefore, $\sigma^{k+1}=\sigma^{k}$ is a Nash equilibrium of the underlying game. ∎

H.9 Proof of Lemma F.6

Proof of Lemma F.6.

First, we prove the first statement of the lemma by using the following lemmas:

Lemma H.2.

Assume that $\sigma^{k+1}=\pi^{\mu,\sigma^{k}}$ for $k\geq 0$ , and $G$ is one of the following divergence: 1) $\alpha$ -divergence with $\alpha\in(0,1)$ ; 2) Rényi-divergence with $\alpha\in(0,1)$ ; 3) reverse KL divergence. If $\sigma^{k+1}=\sigma^{k}$ , then $\sigma^{k}$ is a Nash equilibrium of the underlying game.

Lemma H.3.

Assume that $\sigma^{k+1}=\pi^{\mu,\sigma^{k}}$ for $k\geq 0$ , and $G$ is one of the following divergence: 1) $\alpha$ -divergence with $\alpha\in(0,1)$ ; 2) Rényi-divergence with $\alpha\in(0,1)$ ; 3) reverse KL divergence. Then, if $\sigma^{k+1}\neq\sigma^{k}$ , we have for any $\pi^{\ast}\in\Pi^{\ast}$ and $k\geq 0$ :

\displaystyle\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k})<0.

From Lemma H.2, when $\sigma^{k}\in\mathcal{X}\setminus\Pi^{\ast}$ , $\sigma^{k+1}\neq\sigma^{k}$ always holds. Let us define $\pi^{\star}=\mathop{\rm arg~{}min}\limits_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k})$ . Since $\sigma^{k+1}\neq\sigma^{k}$ , from Lemma H.3, we have:

\displaystyle\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k})=\mathrm{KL}(\pi^{\star},\sigma^{k})>\mathrm{KL}(\pi^{\star},\sigma^{k+1})\geq\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k+1}).

Therefore, if $\sigma^{k}\in\mathcal{X}\setminus\Pi^{\ast}$ then $\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k+1})<\min_{\pi^{\ast}\in\Pi^{\ast}}\mathrm{KL}(\pi^{\ast},\sigma^{k})$ .

Next, we prove the second statement of the lemma. Assume that there exists $\sigma^{k}\in\Pi^{\ast}$ such that $\sigma^{k+1}\neq\sigma^{k}$ . In this case, we can apply Lemma H.3, hence we have $\mathrm{KL}(\pi^{\ast},\sigma^{k+1})<\mathrm{KL}(\pi^{\ast},\sigma^{k})$ for all $\pi^{\ast}\in\Pi^{\ast}$ . On the other hand, since $\sigma^{k}\in\Pi^{\ast}$ , there exists a Nash equilibrium $\pi^{\star}\in\Pi^{\ast}$ such that $\mathrm{KL}(\pi^{\star},\sigma^{k})=0$ . Therefore, we have $\mathrm{KL}(\pi^{\star},\sigma^{k+1})<\mathrm{KL}(\pi^{\star},\sigma^{k})=0$ , which contradicts $\mathrm{KL}(\pi^{\star},\sigma^{k+1})\geq 0$ . Thus, if $\sigma^{k}\in\Pi^{\ast}$ then $\sigma^{k+1}=\sigma^{k}$ . ∎

H.10 Proof of Lemma F.7

Proof of Lemma F.7.

For a given $\sigma\in\mathcal{X}$ , let us consider that $\pi^{t}$ follows the following continuous-time dynamics:

	$\displaystyle\pi_{i}^{t}$	$\displaystyle=\mathop{\rm arg~{}max}\limits_{\pi_{i}\in\mathcal{X}_{i}}\left\{\langle y_{i}^{t},\pi_{i}\rangle-\psi(\pi_{i})\right\},$		(37)
	$\displaystyle y_{ij}^{t}$	$\displaystyle=\int_{0}^{t}\left(\frac{\partial}{\pi_{ij}}v_{i}(\pi_{i}^{s},\pi_{-i}^{s})-\mu\frac{\partial}{\pi_{ij}}G(\pi_{i}^{s},\sigma_{i})\right).$

We assume that $\psi(\pi_{i})=\sum_{j=1}^{d_{i}}\pi_{ij}\ln\pi_{ij}$ . Note that this dynamics is the continuous-time version of APFTRL, so clearly $\pi^{\mu,\sigma}$ defined by (4) is the stationary point of (37). We have for a given $\sigma^{\prime}\in\mathcal{X}$ and the associated stationary point $\pi^{\mu,\sigma^{\prime}}=F(\sigma^{\prime})$ :

	$\displaystyle\frac{d}{dt}\mathrm{KL}(\pi^{\mu,\sigma^{\prime}},\pi^{t})$	$\displaystyle=\frac{d}{dt}D_{\psi}(\pi^{\mu,\sigma^{\prime}},\pi^{t})$
		$\displaystyle=\sum_{i=1}^{N}\frac{d}{dt}\left(\psi(\pi_{i}^{\mu,\sigma^{\prime}})-\psi(\pi_{i}^{t})-\langle y_{i}^{t},\pi_{i}^{\mu,\sigma^{\prime}}-\pi_{i}^{t}\rangle\right)$
		$\displaystyle=\sum_{i=1}^{N}\frac{d}{dt}\left(\langle y_{i}^{t},\pi_{i}^{t}\rangle-\psi(\pi_{i}^{t})-\langle y_{i}^{t},\pi_{i}^{\mu,\sigma^{\prime}}\rangle+\psi(\pi_{i}^{\mu,\sigma^{\prime}})\right)$
		$\displaystyle=\sum_{i=1}^{N}\frac{d}{dt}\left(\psi^{\ast}(y_{i}^{t})-\langle y_{i}^{t},\pi_{i}^{\mu,\sigma^{\prime}}\rangle\right)$
		$\displaystyle=\sum_{i=1}^{N}\left(\left\langle\frac{d}{dt}y_{i}^{t},\nabla\psi^{\ast}(y_{i}^{t})\right\rangle-\left\langle\frac{d}{dt}y_{i}^{t},\pi_{i}^{\mu,\sigma^{\prime}}\right\rangle\right)$
		$\displaystyle=\sum_{i=1}^{N}\left\langle\frac{d}{dt}y_{i}^{t},\nabla\psi^{\ast}(y_{i}^{t})-\pi_{i}^{\mu,\sigma^{\prime}}\right\rangle,$

where $\psi^{\ast}(y_{i}^{t})=\max_{\pi_{i}\in\mathcal{X}_{i}}\left\{\langle y_{i}^{t},\pi_{i}\rangle-\psi(\pi_{i})\right\}$ . When $\psi(\pi_{i})=\sum_{j=1}^{d_{i}}\pi_{ij}\ln\pi_{ij}$ , we have

	$\displaystyle\psi^{\ast}(y_{i}^{t})$	$\displaystyle=\sum_{j=1}^{d_{i}}y_{ij}^{t}\frac{\exp(y_{ij}^{t})}{\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})}-\sum_{j=1}^{d_{i}}\frac{\exp(y_{ij}^{t})}{\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})}\ln\frac{\exp(y_{ij}^{t})}{\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})}$
		$\displaystyle=\frac{\ln\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})}{\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})}\sum_{j=1}^{d_{i}}\exp(y_{ij}^{t}),$

and then,

	$\displaystyle\frac{\partial}{\partial y_{ij}}\psi^{\ast}(y_{i}^{t})$	$\displaystyle=\frac{\exp(y_{ij}^{t})}{(\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t}))^{2}}\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})-\frac{\ln\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})}{(\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t}))^{2}}\exp(y_{ij}^{t})\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})$
		$\displaystyle+\frac{\ln\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})}{\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})}\exp(y_{ij}^{t})$
		$\displaystyle=\frac{\exp(y_{ij}^{t})}{\sum_{j^{\prime}=1}^{d_{i}}\exp(y_{ij^{\prime}}^{t})}=\pi_{ij}^{t}.$

Therefore, we get $\nabla\psi^{\ast}(y_{i}^{t})=\pi_{i}^{t}$ . Hence,

$\displaystyle\frac{d}{dt}\mathrm{KL}(\pi^{\mu,\sigma^{\prime}},\pi^{t})$	$\displaystyle=\sum_{i=1}^{N}\left\langle\frac{d}{dt}y_{i}^{t},\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\right\rangle$
	$\displaystyle=\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle$
	$\displaystyle=\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle$
	$\displaystyle+\mu\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle.$	(38)

The first term of (H.10) can be written as:

	$\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle$
	$\displaystyle\leq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{\prime}},\pi_{-i}^{\mu,\sigma^{\prime}})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle$
	$\displaystyle=\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{\prime}},\pi_{-i}^{\mu,\sigma^{\prime}}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle-\mu\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle$
	$\displaystyle=\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\mu,\sigma^{\prime}},\pi_{-i}^{\mu,\sigma^{\prime}})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{\prime}},\sigma_{i}^{\prime}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle$
	$\displaystyle-\mu\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{\prime}},\sigma_{i}^{\prime}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle$
	$\displaystyle\leq-\mu\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{\prime}},\sigma_{i}^{\prime}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle.$

where the first inequality follows from (1), and the second inequality follows from the first-order optimality condition for $\pi^{\mu,\sigma}$ . When $G$ is $\alpha$ -divergence, $G$ has a diagonal Hessian is given as:

\displaystyle\nabla^{2}G(\pi_{i},\sigma_{i}^{\prime})=\begin{bmatrix}\frac{\sigma_{i1}^{\prime}}{(\pi_{i1})^{\alpha-2}}&&\\ &\ddots&\\ &&\frac{\sigma_{id_{i}}^{\prime}}{(\pi_{id_{i}})^{\alpha-2}},\end{bmatrix}

and thus, its smallest eigenvalue is lower bounded by $\min_{j\in[d_{i}]}\sigma_{ij}^{\prime}$ . Therefore,

		$\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{t},\pi_{-i}^{t})-\mu\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle$
		$\displaystyle\leq-\mu\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma^{\prime}},\sigma_{i}^{\prime}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle$
		$\displaystyle\leq-\mu\left(\min_{i\in[N],~{}j\in[d_{i}]}\sigma_{ij}^{\prime}\right)\\|\pi^{t}-\pi^{\mu,\sigma^{\prime}}\\|^{2}.$		(39)

On the other hand, by compactness of $\mathcal{X}_{i}$ , the second term of (H.10) is written as:

	$\displaystyle\mu\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}),\pi_{i}^{t}-\pi_{i}^{\mu,\sigma^{\prime}}\rangle$
	$\displaystyle\leq\mu\cdot\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\\|\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i})\\|^{2}}$		(40)

By combining (H.10), (H.10), and (40), we get:

	$\displaystyle\frac{d}{dt}\mathrm{KL}(\pi^{\mu,\sigma^{\prime}},\pi^{t})$
	$\displaystyle\leq-\mu\left(\min_{i\in[N],~{}j\in[d_{i}]}\sigma_{ij}^{\prime}\right)\\|\pi^{t}-\pi^{\mu,\sigma^{\prime}}\\|^{2}+\mu\cdot\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\\|\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i})\\|^{2}}.$

Recall that $\pi^{\mu,\sigma}$ is the stationary point of (37). Therefore, by setting the start point as $\pi^{0}=\pi^{\mu,\sigma}$ , we have for all $t\geq 0,\pi^{t}=\pi^{\mu,\sigma}$ . In this case, for all $t\geq 0$ , $\frac{d}{dt}\mathrm{KL}(\pi^{\mu,\sigma^{\prime}},\pi^{t})=0$ and then:

\displaystyle\left(\min_{i\in[N],~{}j\in[d_{i}]}\sigma_{ij}^{\prime}\right)\|\pi^{\mu,\sigma^{\prime}}-\pi^{\mu,\sigma}\|^{2}\leq\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{t},\sigma_{i})\|^{2}}.

For a given $\varepsilon>0$ , let us define $\varepsilon^{\prime}=\frac{\left(\min_{i\in[N],~{}j\in[d_{i}]}\sigma_{ij}^{\prime}\right)}{\mathrm{diam}(\mathcal{X})}\varepsilon^{2}$ . Since $\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\cdot)$ is continuous on $\mathcal{X}_{i}$ , for $\varepsilon^{\prime}$ , there exists $\delta>0$ such that $\|\sigma^{\prime}-\sigma\|<\delta\Rightarrow\sqrt{\sum_{i=1}^{N}\|\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i})\|^{2}}<\varepsilon^{\prime}=\frac{\left(\min_{i\in[N],~{}j\in[d_{i}]}\sigma_{ij}^{\prime}\right)}{\mathrm{diam}(\mathcal{X})}\varepsilon^{2}$ . Thus, for every $\varepsilon>0$ , there exists $\delta>0$ such that

	$\displaystyle\\|\sigma^{\prime}-\sigma\\|<\delta$
	$\displaystyle\Rightarrow\\|\pi^{\mu,\sigma^{\prime}}-\pi^{\mu,\sigma}\\|\leq\sqrt{\frac{\mathrm{diam}(\mathcal{X})}{\left(\min_{i\in[N],~{}j\in[d_{i}]}\sigma_{ij}^{\prime}\right)}}\left(\sum_{i=1}^{N}\\|\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i}^{\prime})-\nabla_{\pi_{i}}G(\pi_{i}^{\mu,\sigma},\sigma_{i})\\|^{2}\right)^{\frac{1}{4}}<\varepsilon.$

This implies that $F(\cdot)$ is a continuous function on $\mathcal{X}$ when $G$ is $\alpha$ -divergence. A similar argument can be applied to Rényi-divergence and reverse KL divergence. ∎

H.11 Proof of Lemma F.8

Proof of Lemma F.8.

First, from the definition of the Bregman divergence, for any $\pi_{i}\in\mathcal{X}_{i}$ :

\displaystyle D_{\psi}(\pi,T(y_{i}))=\psi(\pi)-\psi(T(y_{i}))-\langle\nabla\psi(T(y_{i})),\pi_{i}-T(y_{i})\rangle.

(41)

Recall that $\mathcal{X}_{i}$ satisfies $A\pi_{i}=b$ for all $\pi_{i}\in\mathcal{X}_{i}$ for a matrix $A\in\mathbb{R}^{k_{i}\times d_{i}}$ and $b\in\mathbb{R}^{k_{i}}$ . From the assumption for $\psi$ and the first-order optimality condition for the optimization problem of $\mathop{\rm arg~{}max}\limits_{x\in\mathcal{X}}\{\langle y_{i},x\rangle-\psi(x)\}$ , there exists $\nu\in\mathbb{R}^{k_{i}}$ such that

\displaystyle y_{i}-\nabla\psi(T(y_{i}))=A^{\top}\nu.

Thus, we get:

$\displaystyle\langle y_{i},\pi_{i}-T(y_{i})\rangle$	$\displaystyle=\langle A^{\top}\nu+\nabla\psi(T(y_{i})),\pi_{i}-T(y_{i})\rangle$
	$\displaystyle=\langle\nabla\psi(T(y_{i})),\pi_{i}-T(y_{i})\rangle+\nu^{\top}A\pi_{i}-\nu^{\top}AT(y_{i})$
	$\displaystyle=\langle\nabla\psi(T(y_{i})),\pi_{i}-T(y_{i})\rangle+\nu^{\top}b-\nu^{\top}b$
	$\displaystyle=\langle\nabla\psi(T(y_{i})),\pi_{i}-T(y_{i})\rangle.$	(42)

By combining (41) and (42), we have:

\displaystyle D_{\psi}(\pi,T(y_{i}))=\psi(\pi)-\psi(T(y_{i}))-\langle y_{i},\pi_{i}-T(y_{i})\rangle.

∎

H.12 Proof of Lemma H.1

Proof of Lemma H.1.

Since $\frac{1}{\kappa s+2\theta}$ is a decreasing function for $s\geq 0$ , for all $s\geq 1$ ,

\frac{1}{\kappa s+2\theta}\leq\int_{s-1}^{s}\frac{1}{\kappa x+2\theta}dx.

Using this inequality, we can upper bound the sum as follows.

	$\displaystyle\sum_{s=0}^{t}\frac{1}{\kappa s+2\theta}$	$\displaystyle=\frac{1}{2\theta}+\sum_{s=1}^{t}\frac{1}{\kappa s+2\theta}$
		$\displaystyle\leq\frac{1}{2\theta}+\sum_{s=1}^{t}\int_{s-1}^{s}\frac{1}{\kappa x+2\theta}dx$
		$\displaystyle=\frac{1}{2\theta}+\int_{0}^{t}\frac{1}{\kappa x+2\theta}dx$
		$\displaystyle=\frac{1}{2\theta}+\frac{1}{\kappa}\int_{0}^{t}\frac{1}{x+\frac{2\theta}{\kappa}}dx$
		$\displaystyle=\frac{1}{2\theta}+\frac{1}{\kappa}\int_{\frac{2\theta}{\kappa}}^{t+\frac{2\theta}{\kappa}}\frac{1}{u}du\qquad\left(u=x+\frac{2\theta}{\kappa}\right)$
		$\displaystyle=\frac{1}{2\theta}+\frac{1}{\kappa}\ln\left(\frac{\kappa}{2\theta}t+1\right).$

This concludes the proof. ∎

H.13 Proof of Lemma H.2

Proof of Lemma H.2.

By using the first-order optimality condition for $\sigma_{i}^{k+1}$ , we have for all $\pi\in\mathcal{X}$ :

\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1})-\mu\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\pi_{i}-\sigma_{i}^{k+1}\rangle\leq 0,

and then

\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1}),\pi_{i}-\sigma_{i}^{k+1}\rangle\leq\mu\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\pi_{i}-\sigma_{i}^{k+1}\rangle.

When $G$ is $\alpha$ -divergence, we have for all $\pi\in\mathcal{X}$ :

	$\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\pi_{i}-\sigma_{i}^{k+1}\rangle$	$\displaystyle=\frac{1}{1-\alpha}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1}-\pi_{ij})\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}$
		$\displaystyle=\frac{1}{1-\alpha}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1}-\pi_{ij})=0,$

where we use the assumption that $\sigma^{k+1}=\sigma^{k}$ and $\mathcal{X}_{i}=\Delta^{d_{i}}$ . Similarly, when $G$ is Rényi-divergence, we have for all $\pi\in\mathcal{X}$ :

	$\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\pi_{i}-\sigma_{i}^{k+1}\rangle$	$\displaystyle=\frac{\alpha}{1-\alpha}\sum_{i=1}^{N}\frac{1}{\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1})^{\alpha}(\sigma_{ij}^{k})^{1-\alpha}}\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1}-\pi_{ij})\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}$
		$\displaystyle=\frac{\alpha}{1-\alpha}\sum_{i=1}^{N}\frac{1}{\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1})^{\alpha}(\sigma_{ij}^{k})^{1-\alpha}}\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1}-\pi_{ij})=0.$

Furthermore, if $G$ is reverse KL divergence, we have for all $\pi\in\mathcal{X}$ :

	$\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\pi_{i}-\sigma_{i}^{k+1}\rangle$	$\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1}-\pi_{ij})\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}$
		$\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1}-\pi_{ij})=0,$

Thus, we have for all $\pi\in\mathcal{X}$ :

\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1}),\pi_{i}-\sigma_{i}^{k+1}\rangle\leq 0.

This is equivalent to the first-order optimality condition for $\pi^{\ast}\in\Pi^{\ast}$ . Therefore, $\sigma^{k+1}=\sigma^{k}$ is a Nash equilibrium of the underlying game. ∎

H.14 Proof of Lemma H.3

Proof of Lemma H.3.

First, we prove the statement for $\alpha$ -divergence: $G(\sigma_{i}^{k+1},\sigma_{i}^{k})=\frac{1}{\alpha(1-\alpha)}\left(1-\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1})^{\alpha}(\sigma_{ij}^{k})^{1-\alpha}\right)$ . From the definition of $\alpha$ -divergence, we have for all $\pi^{\ast}\in\Pi^{\ast}$ :

	$\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle$	$\displaystyle=\frac{1}{1-\alpha}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}(\pi_{ij}^{\ast}-\sigma_{ij}^{k+1})\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}$
		$\displaystyle=\frac{1}{1-\alpha}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}-\frac{1}{1-\alpha}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1})^{\alpha}(\sigma_{ij}^{k})^{1-\alpha}.$

Here, when $\alpha\in(0,1)$ , we get $\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1})^{\alpha}(\sigma_{ij}^{k})^{1-\alpha}\leq 1$ . Thus,

	$\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle$	$\displaystyle\geq\frac{1}{1-\alpha}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}-\frac{N}{1-\alpha}$
		$\displaystyle=\frac{N}{1-\alpha}\exp\left(\ln\left(\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}\right)\right)-\frac{N}{1-\alpha}$
		$\displaystyle\geq\frac{N}{1-\alpha}\exp\left(\frac{1-\alpha}{N}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\ln\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)-\frac{N}{1-\alpha}$
		$\displaystyle=\frac{N}{1-\alpha}\exp\left(\frac{1-\alpha}{N}\left(\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k})\right)\right)-\frac{N}{1-\alpha},$

where the second inequality follows from the concavity of the $\ln(\cdot)$ function and Jensen’s inequality for concave functions. Since $\ln(\cdot)$ is strictly concave, the equality holds if and only if $\sigma^{k+1}=\sigma^{k}$ . Therefore, under the assumption that $\sigma^{k+1}\neq\sigma^{k}$ , we get:

	$\displaystyle\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k})$	$\displaystyle<\frac{N}{1-\alpha}\ln\left(1+\frac{1-\alpha}{N}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle\right)$
		$\displaystyle\leq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle,$		(43)

where the second inequality follows from $\ln(1+x)\leq x$ for $x>-1$ . From the first-order optimality condition for $\sigma_{i}^{k+1}$ , we have for all $\pi^{\ast}\in\Pi^{\ast}$ :

\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1})-\mu\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\pi_{i}^{\ast}-\sigma_{i}^{k+1}\rangle\leq 0.

Then,

	$\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle$	$\displaystyle\leq\frac{1}{\mu}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\sigma_{i}^{k+1},\sigma_{-i}^{k+1}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle$
		$\displaystyle\leq\frac{1}{\mu}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\ast},\pi_{-i}^{\ast}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle,$		(44)

where the second inequality follows from (1). Moreover, since $\pi^{\ast}$ is the Nash equilibrium, from the first-order optimality condition, we get:

\displaystyle\sum_{i=1}^{N}\langle\sigma_{i}^{k+1}-\pi_{i}^{\ast},\nabla_{\pi_{i}}v_{i}(\pi_{i}^{\ast},\pi_{-i}^{\ast})\rangle\leq 0.

(45)

By combining (43), (H.14), and (45), if $\sigma^{k+1}\neq\sigma^{k}$ , we have any $\pi^{\ast}\in\Pi^{\ast}$ :

\displaystyle\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k})<0.

Next, we prove the statement for Rényi-divergence: $G(\sigma_{i}^{k+1},\sigma_{i}^{k})=\frac{1}{\alpha-1}\ln\left(\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1})^{\alpha}(\sigma_{ij}^{k})^{1-\alpha}\right)$ . We have for all $\pi^{\ast}\in\Pi^{\ast}$ :

	$\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle$	$\displaystyle=\frac{\alpha}{1-\alpha}\sum_{i=1}^{N}\frac{1}{\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1})^{\alpha}(\sigma_{ij}^{k})^{1-\alpha}}\sum_{j=1}^{d_{i}}(\pi_{ij}^{\ast}-\sigma_{ij}^{k+1})\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}$
		$\displaystyle=\frac{\alpha}{1-\alpha}\sum_{i=1}^{N}\frac{1}{\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1})^{\alpha}(\sigma_{ij}^{k})^{1-\alpha}}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}-\frac{N\alpha}{1-\alpha}.$

Again, by using $\sum_{j=1}^{d_{i}}(\sigma_{ij}^{k+1})^{\alpha}(\sigma_{ij}^{k})^{1-\alpha}\leq 1$ when $\alpha\in(0,1)$ , we get:

	$\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle$	$\displaystyle\geq\frac{\alpha}{1-\alpha}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}-\frac{N\alpha}{1-\alpha}$
		$\displaystyle=\frac{N\alpha}{1-\alpha}\exp\left(\ln\left(\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\left(\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)^{1-\alpha}\right)\right)-\frac{N\alpha}{1-\alpha}$
		$\displaystyle\geq\frac{N\alpha}{1-\alpha}\exp\left(\frac{1-\alpha}{N}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\ln\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)-\frac{N\alpha}{1-\alpha}$
		$\displaystyle=\frac{N\alpha}{1-\alpha}\exp\left(\frac{1-\alpha}{N}\left(\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k})\right)\right)-\frac{N\alpha}{1-\alpha},$

where the second inequality follows from Jensen’s inequality for $\ln(\cdot)$ function. Since $\ln(\cdot)$ is strictly concave, the equality holds if and only if $\sigma^{k+1}=\sigma^{k}$ . Therefore, under the assumption that $\sigma^{k+1}\neq\sigma^{k}$ , we get:

	$\displaystyle\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k})$	$\displaystyle<\frac{N\alpha}{1-\alpha}\ln\left(1+\frac{1-\alpha}{N\alpha}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle\right)$
		$\displaystyle\leq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle,$		(46)

where the second inequality follows from $\ln(1+x)\leq x$ for $x>-1$ . Thus, by combining (H.14), (45), and (46), if $\sigma^{k+1}\neq\sigma^{k}$ , we have any $\pi^{\ast}\in\Pi^{\ast}$ :

\displaystyle\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k})<0.

Finally, we prove the statement for reverse KL divergence: $G(\sigma_{i}^{k+1},\sigma_{i}^{k})=\sum_{j=1}^{d_{i}}\sigma_{ij}^{k}\ln\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}$ . We have for all $\pi^{\ast}\in\Pi^{\ast}$ :

	$\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle$	$\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}(\pi_{ij}^{\ast}-\sigma_{ij}^{k+1})\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}$
		$\displaystyle=\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}-N$
		$\displaystyle=N\exp\left(\ln\left(\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)\right)-N$
		$\displaystyle\geq N\exp\left(\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{d_{i}}\pi_{ij}^{\ast}\ln\frac{\sigma_{ij}^{k}}{\sigma_{ij}^{k+1}}\right)-N$
		$\displaystyle=N\exp\left(\frac{1}{N}\left(\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k})\right)\right)-N,$

where the inequality follows from Jensen’s inequality for $\ln(\cdot)$ function. Thus, under the assumption that $\sigma^{k+1}\neq\sigma^{k}$ , we get:

	$\displaystyle\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k})$	$\displaystyle<N\ln\left(1+\frac{1}{N}\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle\right)$
		$\displaystyle\leq\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}G(\sigma_{i}^{k+1},\sigma_{i}^{k}),\sigma_{i}^{k+1}-\pi_{i}^{\ast}\rangle,$		(47)

where the second inequality follows from $\ln(1+x)\leq x$ for $x>-1$ . Thus, by combining (H.14), (45), and (47), if $\sigma^{k+1}\neq\sigma^{k}$ , we have any $\pi^{\ast}\in\Pi^{\ast}$ :

\displaystyle\mathrm{KL}(\pi^{\ast},\sigma^{k+1})-\mathrm{KL}(\pi^{\ast},\sigma^{k})<0.

∎

Appendix I Additional Experimental Results and Details

I.1 Payoff Matrix in Three-Player Biased RPS Game

Table 2: Three-Player Biased RPS game matrix.

	R	P	S
R	$0$	$-1/3$	$1$
P	$1/3$	$0$	$-1/3$
S	$-1$	$1/3$	$0$

I.2 Experimental Setting for Section 6

The experiments in Section 6 are conducted in Ubuntu 20.04.2 LTS with Intel(R) Core(TM) i9-10850K CPU @ 3.60GHz and 64GB RAM.

In the full feedback setting, we use a constant learning rate $\eta=0.1$ for MWU and OMWU, and APMD in all three games. For APMD, we set $\mu=0.1$ and $T_{\sigma}=100$ for KL and reverse KL divergence perturbation, and set $\mu=0.1$ and $T_{\sigma}=20$ for squared $\ell^{2}$ -distance perturbation. As an exception, $\eta=0.01$ , $\mu=1.0$ , and $T_{\sigma}=200$ are used for APMD with squared $\ell^{2}$ -distance perturbation in the random payoff games with $50$ actions.

For the noisy feedback setting, we use the lower learning rate $\eta=0.01$ for all algorithms, except APMD with squared $\ell^{2}$ -distance perturbation for the random payoff games with $50$ actions. We update the slingshot strategy profile $\sigma^{k}$ every $T_{\sigma}=1000$ iterations in APMD with KL and reverse KL divergence perturbation, and update it every $T_{\sigma}=200$ iterations in APMD with squared $\ell^{2}$ -distance perturbation. For APMD with $\ell^{2}$ -distance perturbation in the random payoff games with $50$ actions, we set $\eta=0.001$ and $T_{\sigma}=2000$ .

I.3 Additional Experiments

In this section, we compare the performance of APMD and APFTRL to MWU, OMWU, and optimistic gradient descent (OGD) (Daskalakis et al., 2018; Wei et al., 2021) in the full/noisy feedback setting. The parameter settings for MWU, OMWU, and APMD are the same as Section 6. For APFTRL, we use the squared $\ell^{2}$ -distance and the parameter is the same as APMD with squared $\ell^{2}$ -distance perturbation. For OGD, we use the same learning rate as APMD with squared $\ell^{2}$ -distance perturbation.

Figure 4 shows the logarithm of the gap function for $\pi^{t}$ averaged over $100$ instances with full feedback. We observe that APMD and APFTRL with squared $\ell^{2}$ -distance perturbation exhibit competitive performance to OGD. The experimental results in the noisy feedback setting are presented in Figure 5. Surprisingly, in the noisy feedback setting, all APMD-based algorithms and the APFTRL-based algorithm exhibit overwhelmingly superior performance to OGD in all three games.

I.4 Comparison with the Averaged Strategies of No-Regret Learning Algorithms

This section compares the last-iterate strategies $\pi^{t}$ of APMD and APFTRL with the average of strategies $\frac{1}{t}\sum_{\tau=1}^{t}\pi^{\tau}$ of MWU, regret matching (RM) (Hart & Mas-Colell, 2000), and regret matching plus (RM+) (Tammelin, 2014). The parameter settings for MWU, APMD, and APFTRL, as used in Section I.3, are maintained. Figure 6 illustrates the logarithm of the gap function averaged over $100$ instances with full feedback. The results show that the last-iterate strategies of APMD and APFTRL squared $\ell^{2}$ -distance perturbation exhibit a lower gap than the averaged strategies of MWU, RM, and RM+.

I.5 Sensitivity Analysis of Update Interval for the Slingshot Strategy

In this section, we investigate the performance when changing the update interval of the slingshot strategy. We vary the $T_{\sigma}$ of APMD with L2 perturbation in 3BRPS with full feedback to be $T_{\sigma}\in\{10,100,1000,10000\}$ , and with noisy feedback to be $T_{\sigma}\in\{10,100,1000,10000\}$ . All other parameters are the same as in Section 6. Figure 7 shows the logarithm of the gap function for $\pi^{t}$ averaged over $100$ instances in 3BRPS with full/noisy feedback. We observe that the smaller the $T_{\sigma}$ , the faster $\pi^{t}$ converges. However, if $T_{\sigma}$ is too small, $\pi^{t}$ does not converge (See $T_{\sigma}=10$ with full feedback, and $T_{\sigma}=100$ with noisy feedback in Figure 7).

$\displaystyle\sum_{i=1}^{N}\langle\nabla_{\pi_{i}}v_{i}(\pi)-\nabla_{\pi_{i}}v_{i}(\pi^{\prime}),\tilde{\pi}_{i}-\pi_{i}\rangle$	$\displaystyle\leq\\|\tilde{\pi}-\pi\\|\sqrt{\sum_{i=1}^{N}\\|\nabla_{\pi_{i}}v_{i}(\pi)-\nabla_{\pi_{i}}v_{i}(\pi^{\prime})\\|^{2}}$
	$\displaystyle\leq\mathrm{diam}(\mathcal{X})\sqrt{\sum_{i=1}^{N}\\|\nabla_{\pi_{i}}v_{i}(\pi)-\nabla_{\pi_{i}}v_{i}(\pi^{\prime})\\|^{2}}$
	$\displaystyle\leq L\cdot\mathrm{diam}(\mathcal{X})\cdot\\|\pi-\pi^{\prime}\\|.$	(11)

	$\displaystyle\min_{(a_{i})\in N_{\mathcal{X}}(\pi^{\mu,\sigma^{k}})}\sqrt{\sum_{i=1}^{N}\\|-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma^{k}})+a_{i}\\|^{2}}$
	$\displaystyle\leq\sqrt{\sum_{i=1}^{N}\\|-\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma^{k}})+\nabla_{\pi_{i}}v_{i}(\pi^{\mu,\sigma^{k}})-\mu\left(\pi_{i}^{\mu,\sigma^{k}}-\sigma_{i}^{k}\right)\\|^{2}}$
	$\displaystyle=\mu\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|.$		(13)

	$\displaystyle\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}$
	$\displaystyle\leq\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\\|^{2}+16\\|\pi^{\ast}-\sigma^{0}\\|\left(1-\frac{\eta\mu}{2}\right)^{T_{\sigma}}+32\\|\pi^{\ast}-\sigma^{0}\\|^{2}\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}+\frac{8\zeta}{\mu}\\|\pi^{\ast}-\sigma^{0}\\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}$
	$\displaystyle\leq\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\\|^{2}+48\\|\pi^{\ast}-\sigma^{0}\\|^{2}\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}+\frac{8\zeta}{\mu}\\|\pi^{\ast}-\sigma^{0}\\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}$
	$\displaystyle\leq\\|\pi^{\mu,\sigma^{k-1}}-\sigma^{k-1}\\|^{2}+8\\|\pi^{\ast}-\sigma^{0}\\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\left(6\\|\pi^{\ast}-\sigma^{0}\\|+\frac{\zeta}{\mu}\right).$

	$\displaystyle\\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\\|^{2}$	$\displaystyle\leq\\|\pi^{\mu,\sigma^{K-2}}-\sigma^{K-2}\\|^{2}+8\\|\pi^{\ast}-\sigma^{0}\\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\left(6\\|\pi^{\ast}-\sigma^{0}\\|+\frac{\zeta}{\mu}\right)$
		$\displaystyle\leq\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}+8K\\|\pi^{\ast}-\sigma^{0}\\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\left(6\\|\pi^{\ast}-\sigma^{0}\\|+\frac{\zeta}{\mu}\right).$		(16)

	$\displaystyle K\\|\pi^{\mu,\sigma^{K-1}}-\sigma^{K-1}\\|^{2}$	$\displaystyle\leq\sum_{k=0}^{K-1}\\|\pi^{\mu,\sigma^{k}}-\sigma^{k}\\|^{2}+8K^{2}\\|\pi^{\ast}-\sigma^{0}\\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\left(6\\|\pi^{\ast}-\sigma^{0}\\|+\frac{\zeta}{\mu}\right)$
		$\displaystyle\leq 16\\|\pi^{\ast}-\sigma^{0}\\|^{2}+8K^{2}\\|\pi^{\ast}-\sigma^{0}\\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\left(6\\|\pi^{\ast}-\sigma^{0}\\|+\frac{\zeta}{\mu}\right)$
		$\displaystyle\leq 16\\|\pi^{\ast}-\sigma^{0}\\|^{2}+8T^{2}\\|\pi^{\ast}-\sigma^{0}\\|\left(1-\frac{\eta\mu}{2}\right)^{\frac{T_{\sigma}}{2}}\left(6\\|\pi^{\ast}-\sigma^{0}\\|+\frac{\zeta}{\mu}\right).$

Adaptively Perturbed Mirror Descent for Learning in Games

Abstract

1 Introduction

2 Preliminaries

Monotone games.

Example 2.1 (Concave-Convex Games).

Example 2.2 (Zero-Sum Polymatrix Games).

Nash equilibrium and gap function.

Problem setting.

Mirror Descent.

Other notations.

3 Adaptively Perturbed Mirror Descent

3.1 Slingshot Payoff Perturbation

3.2 Slingshot Strategy Update

4 Last-Iterate Convergence Rates

4.1 Full Feedback Setting

Theorem 4.1.

4.2 Proof Sketch of Theorem 4.1

(1) Convergence rates to a stationary point with kk-th slingshot strategy profile.

Lemma 4.2.

(2) Upper bound on the gap function.

Lemma 4.3.

(3) Last-iterate convergence results for the slingshot strategy profile.

Lemma 4.4.

4.3 Noisy Feedback Setting

Theorem 4.5.

4.4 Proof Sketch of Theorem 4.5

Lemma 4.6.

Lemma 4.7.

5 Beyond Squared ℓ2\ell^{2}-Payoff Perturbation

5.1 Instantiation of Payoff-Perturbed Algorithms

Example 5.1 (Boltzmann Q-Learning (Tuyls et al., 2006)).

Example 5.2 (Reward transformed FTRL (Perolat et al., 2021)).

Example 5.3 (Mutant FTRL (Abe et al., 2022)).

5.2 Convergence Results with General GG and DψD_{\psi}

Assumption 5.4.

Assumption 5.5.

Lemma 5.6.

Lemma 5.7.

Theorem 5.8.

Theorem 5.9.

6 Experiments

7 Related Literature

8 Conclusion

Acknowledgments

Impact Statement

References

Appendix A Notations

Appendix B Formal Theorems and Lemmas

B.1 Full Feedback Setting

Theorem B.1 (Formal version of Theorem 4.1).

Lemma B.2 (Formal version of Lemma 4.2).

Lemma B.3 (Formal version of Lemma 4.3).

Lemma B.4 (Formal version of Lemma 4.4).

B.2 Noisy Feedback Setting

Assumption B.5.

Theorem B.6 (Formal version of Theorem 4.5).

Lemma B.7 (Formal version of Lemma 4.6).

Lemma B.8 (Formal version of Lemma 4.7).

B.3 Convergence Results with General GG and DψD_{\psi}

Lemma B.9 (Formal version of Lemma 5.6).

Lemma B.10 (Formal version of Lemma 5.7).

Appendix C Extension to Follow the Regularized Leader

Assumption C.1.

Lemma C.2.

Lemma C.3.

Appendix D Additional Theorems

Theorem D.1.

Appendix E Proofs for Section 4

E.1 Proof of Theorem B.1 (Formal Version of Theoremf 4.1)

Proof of Theorem B.1.

E.2 Proof of Lemma B.2 (Formal Version of Lemma 4.2)

Proof of Lemma B.2.

E.3 Proof of Lemma B.3 (Formal Version of Lemma 4.3)

Proof of Lemma B.3.

Lemma E.1 (Lemma 2 of Cai et al. (2022a)).

E.4 Proof of Lemma B.4 (Formal Version of Lemma 4.4)

Proof of Lemma B.4.

Lemma E.2.

Lemma E.3.

(1) Convergence rates to a stationary point with $k$ -th slingshot strategy profile.

5 Beyond Squared $\ell^{2}$ -Payoff Perturbation

5.2 Convergence Results with General $G$ and $D_{\psi}$

B.3 Convergence Results with General $G$ and $D_{\psi}$