Recursive Reasoning in Minimax Games: A Level $k$ Gradient Play Method

Zichu Liu
University of Toronto
[email protected]
&Lacra Pavel
University of Toronto
[email protected]

Abstract

Despite the success of generative adversarial networks (GANs) in generating visually appealing images, they are notoriously challenging to train. In order to stabilize the learning dynamics in minimax games, we propose a novel recursive reasoning algorithm: Level $k$ Gradient Play (Lv. $k$ GP) algorithm. In contrast to many existing algorithms, our algorithm does not require sophisticated heuristics or curvature information. We show that as $k$ increases, Lv. $k$ GP converges asymptotically towards an accurate estimation of players’ future strategy. Moreover, we justify that Lv. $\infty$ GP naturally generalizes a line of provably convergent game dynamics which rely on predictive updates. Furthermore, we provide its local convergence property in nonconvex-nonconcave zero-sum games and global convergence in bilinear and quadratic games. By combining Lv. $k$ GP with Adam optimizer, our algorithm shows a clear advantage in terms of performance and computational overhead compared to other methods. Using a single Nvidia RTX3090 GPU and 30 times fewer parameters than BigGAN on CIFAR-10, we achieve an FID of 10.17 for unconditional image generation within 30 hours, allowing GAN training on common computational resources to reach state-of-the-art performance.

1 Introduction

In recent years, there has been a surge of powerful models that require simultaneous optimization of several objectives. This increasing interest in multi-objective optimization arises in various domains - such as generative adversarial networks [21, 27], adversarial attacks and robust optimization [37, 7], and multi-agent reinforcement learning [36, 55] - where several agents aim at minimizing their objectives simultaneously. Games generalize this optimization framework by introducing different objectives for different learning agents, known as players.

Refer to caption — Figure 1: Illustration of predictive algorithms on: $\min_{x}\max_{y}10xy$ . Left: Extra-gradient algorithm. Right: Level $k$ gradient play algorithm $(k=6)$ . The solution, trajectory $\{x_{t},y_{t}\}_{t=1}^{T}$ and anticipated future state are shown with red star, black line and orange arrow, resp. The subplot in the right figure depicts how Lv. $k$ GP predicts future states by showing its reasoning procedure with blue arrows. More steps in the reasoning process leading to better anticipations and faster convergence.

The generative adversarial network is a widely-used method of this type, which has demonstrated state-of-the-art performance in a variety of applications, including image generation [28, 52], image super-resolution [32], and image-to-image translation [10]. Despite their success at generating visually appealing images, GANs are notoriously challenging to train [40, 39]. Naive application of the gradient-based algorithm in GANs often leads to poor image diversity (sometimes manifesting as "mode collapse") [40], Poincare recurrence [39], and subtle dependency on hyperparameters [18]. An immense corpus of work is devoted to exploring and enhancing the stability of GANs, including techniques as diverse as the use of optimal transport distance [1, 22], critic gradient penalties [54], different neural network architectures [26, 6], feature matching [50], pre-trained feature space [51], and minibatch discrimination [50]. Nevertheless, architectural modifications (e.g., StyleGANs [27]) require extensive computational resources, and many theoretically appealing methods (Follow-the-ridge [56], CGD[53]) require Hessian inverse operations, which is infeasible for most GAN applications.

To stabilize the learning dynamics in GANs, many recent efforts rely on sophisticated heuristics that allow the agents to anticipate each other’s next move [16, 53, 23]. This anticipation is an example of a recursive reasoning procedure in cognitive science [12]. Similar to how humans think, recursive reasoning represents the belief reasoning process where each agent considers the reasoning process of other agents, based on which it expects to make better decisions. Importantly, it enables the use of opponents that reason about the learning agent, rather than assuming fixed opponents; the process can therefore be nested in a form as ’I believe that you believe that I believe…’. Based on this intuition, we introduce a novel recursive reasoning algorithm that utilizes only gradient information to optimize GANs. Our contributions can be summarized as follows:

(i) We propose a novel algorithm: level $k$ gradient play (Lv. $k$ GP), which is capable of reasoning about players’ future strategy. In a game, agents at Lv. $k$ adjust their strategies in accordance with the strategies of Lv. $k-1$ agents. We justify that, while typical GANs optimizers, such as Learning with Opponent Learning Awareness (LOLA) and Symplectic Gradient Adjustment (SGA), approximate Lv. $2$ and Lv. $3$ GP, our algorithm permits higher levels of strategic reasoning. In addition, the proposed algorithm is amenable to neural network optimizers like Adam [29].

(ii) We show that, in smooth games, Lv. $k$ GP converges asymptotically towards an accurate prediction of agents’ next move. Under mutual opponent shaping, two Lv. $\infty$ agents will naturally have a consistent view of one another if the Lv. $k$ GP converges as $k$ increases. Based on this idea, we provide a closed-form solution for Lv. $\infty$ GP: the Semi-Proximal Point Method (SPPM).

(iii) We prove the local convergence property of Lv. $\infty$ GP in nonconvex - nonconcave zero-sum games and its global convergence in bilinear and quadratic games. The theoretical analysis we present indicates that strong interactions between competing agents can increase the convergence rate of Lv. $k$ GP agents in a zero-sum game.

(iv) By combining Lv. $k$ GP with Adam optimizer, our algorithm shows a clear advantage in terms of performance and computational overhead compared to other methods. Using a single 3090 GPU with 30 times fewer parameters and 16 times smaller mini-batches than BigGAN [6] on CIFAR-10, we achieve an FID score [24] of 10.17 for unconditional image generation within 30 hours, allowing GAN training on common computing resources to reach state-of-the-art performance.

2 Related Works

In recent years, minimax problems have attracted considerable interest in machine learning in light of their connection with GANs. Gradient descent ascent (GDA), a generalization of gradient descent for minimax games, is the principal approach for training GANs in applications. GDA alternates between a gradient descent step for the min-player and a gradient ascent step for the max-player. The convergence of GDA in games is far from as well understood as gradient descent in single-objective problems. Despite the impressive image quality generated by GANs, GDA fails to converge even in bilinear zero-sum games. Recent research on GDA has established a unified picture of its behavior in bilinear games in continuous and discrete-time [39, 46, 13, 41, 20]. First, [39] revealed that continuous-time GDA dynamics in zero-sum games result in Poincare recurrence, where agents return arbitrarily close to their initial state infinitely many times. Second, [3, 58] examined the discrete-time GDA dynamics, showing that simultaneous update of two players results in divergence while the agents’ strategies remain bounded and cycle when agents take turns to update their strategies.

The majority of existing approaches to stabilizing GDA follow one of three lines of research. The essence of the first method is that the discriminator is trained until convergence while the generator parameters are frozen. As long as the generator changes slowly enough, the discriminator still converges in the presence of small generator perturbations. The two-timescale update rule proposed by [21, 24, 42] aims to keep the discriminator’s optimality while updating the generator at an appropriate step size. [25, 14] proved that this two-timescale GDA with finite timescale separation converges towards the strict local minimax/Stackelberg equilibrium in differentiable games. [15, 56] explicitly find the local minimax equilibrium in games with secon-order optimization algorithms.

The second line of research overcomes the failure of GDA in games with predictive updates. Extra-gradient method (EG) [30] and optimistic gradient descent (OGD) [13] use the predictability of the agents’ strategy to achieve better convergence property. Their variants are developed to improve the training performance of GANs [8, 19, 43]. [45] provided a unified analysis of EG and OGD, showing that they approximate the classical proximal point methods. Competitive gradient descent [53] models the agents’ next move by solving a regularized bilinear approximation of the underlying game. Learning with opponent learning awareness (LOLA) and consistent opponent learning awareness (COLA) [16, 57] introduced opponent shaping to this problem by explicitly modeling the learning strategy of other agents in the game. LOLA models opponents as naive learners rather than LOLA agents, while COLA utilizes neural networks to predict opponents’ next move. Lookahead-minimax [9] stabilizes GAN training by ‘looking ahead’ at the sequence of future states generated by an inner optimizer. In the game theory literature, recent work has proposed Clairvoyant Multiplicative Weights Update (CMWU) for regret minimization in general games [47]. Although CMWU is proposed to solve finite normal form games, which are different from unconstrained continuous games that Lv. $k$ GP aims to solve, both CMWU and Lv. $k$ GP share the same motivation of enabling the learning agent to update their strategy based on the opponent’s future strategy. From this aspect, Lv. $k$ GP can be viewed as a specialized variant of CMWU that is specific to the problem of two-player zero-sum games, but adapted for unconstrained continuous kernel games.

Other methods directly modify the GDA algorithm with ad-hoc modifications of game dynamics and introduction of additional regularizers. Consensus optimization (CO) [40] and gradient penalty [22, 54] improve convergence by directly minimizing the magnitude of players’ gradients. Symplectic gradient adjustment (SGA) [33, 4, 17] improves convergence by disentangling convergent potential components from rotational Hamiltonian components of the vector field.

3 Preliminaries

3.1 Notation

In this paper, vectors are lower-case bold letters (e.g. $\bm{\theta}$ ), matrices are upper-case bold letters (e.g. $\bm{A}$ ). For a function $f:\mathbb{R}^{d}\to\mathbb{R}$ , we denote its gradient by $\nabla f$ . For functions of two vector arguments $f({\bm{x}},{\bm{y}}):\mathbb{R}^{d_{1}}\times\mathbb{R}^{d_{2}}\to\mathbb{R}$ we use $\nabla_{x}f,\nabla_{y}f$ to denote its partial gradients. We use $\nabla_{xx}f,\nabla_{yy}f,\nabla_{xy}f$ to denote its Hessian. A stationary point of $f$ denotes the point where $\nabla_{x}f=\nabla_{y}f=\bm{0}$ . We use $\lVert\bm{v}\rVert$ to denote the Euclidean norm of vector $\bm{v}$ . We refer to the largest and smallest eigenvalues of a matrix $\bm{A}$ by $\lambda_{\max}(\bm{A})$ and $\lambda_{\min}(\bm{A})$ , respectively. Moreover, we denote the spectral radius of matrix $\bm{A}$ by $\rho(\bm{A})=\max\{\lvert\lambda_{1}\rvert,\dots,\lvert\lambda_{n}\rvert\}$ , i.e., the eigenvalue with largest absolute value.

3.2 Problem Definition

In order to justify the effectiveness of recursive reasoning procedure, in this paper, we consider the problem of training Generative Adversarial Networks (GANs)[21]. The GANs training strategy defines a two-player game between a generative neural network $G_{\bm{\theta}}(\cdot)$ and a discriminative neural network $D_{\bm{\phi}}(\cdot)$ . The generator takes as input random noise ${\bm{z}}$ sampled from a known distribution $\mathbf{P}_{{\bm{z}}}$ , e.g., ${\bm{z}}\sim\mathbf{P}_{{\bm{z}}}$ , and outputs a sample $G_{\bm{\theta}}({\bm{z}})$ . A discriminator takes as input a sample ${\bm{x}}$ (either sampled from the true distribution $\mathbf{P}_{{\bm{x}}}$ or from the generator) and attempts to classify it as real or fake. The goal of the generator is to fool the discriminator. The optimization of GAN is formulated as a two-player differentiable game where the generator $G_{\bm{\theta}}$ with parameter $\bm{\theta}$ , and the discriminator $D_{\bm{\phi}}$ with parameters $\bm{\phi}$ , aim at minimizing their own cost function $f(\bm{\theta},\bm{\phi})$ and $g(\bm{\theta},\bm{\phi})$ respectively, as follows:

\min_{\bm{\theta}\in\mathbb{R}^{m}}f(\bm{\theta},\bm{\phi})\text{ and }\min_{\bm{\phi}\in\mathbb{R}^{n}}g(\bm{\theta},\bm{\phi}),

(1)

where the two function $f\text{ and }g:\mathbb{R}^{m}\times\mathbb{R}^{n}\to\mathbb{R}$ . When $f=-g$ the corresponding optimization problem is called a two-player zero-sum game and it becomes a minimax problem:

\min_{\bm{\theta}\in\mathbb{R}^{m}}\max_{\bm{\phi}\in\mathbb{R}^{n}}f(\bm{\theta},\bm{\phi}).

(Minimax)

In this work, we assume the cost functions have Lipschitz continuous gradients with respect to all model parameters $(\bm{\theta},\bm{\phi})$ :

Assumption 3.1.

The gradient $\nabla_{\bm{\theta}}f(\bm{\theta},\bm{\phi})$ , is $L_{\bm{\theta}\bm{\theta}}-$ Lipschitz with respect to $\bm{\theta}$ and $L_{\bm{\theta}\bm{\phi}}-$ Lipschitz with respect to $\bm{\phi}$ and the gradient $\nabla_{\bm{\phi}}g(\bm{\theta},\bm{\phi})$ , is $L_{\bm{\phi}\bm{\phi}}-$ Lipschitz with respect to $\bm{\phi}$ and $L_{\bm{\phi}\bm{\theta}}-$ Lipschitz with respect to $\bm{\theta}$ , i.e.,

	$\displaystyle\lVert\nabla_{\bm{\theta}}f(\bm{\theta}_{1},\bm{\phi})-\nabla_{\bm{\theta}}f(\bm{\theta}_{2},\bm{\phi})\rVert$	$\displaystyle\leq L_{\bm{\theta}\bm{\theta}}\lVert\bm{\theta}_{1}-\bm{\theta}_{2}\rVert\text{ for all $\bm{\phi}$,}$
	$\displaystyle\lVert\nabla_{\bm{\theta}}f(\bm{\theta},\bm{\phi}_{1})-\nabla_{\bm{\theta}}f(\bm{\theta},\bm{\phi}_{2})\rVert$	$\displaystyle\leq L_{\bm{\theta}\bm{\phi}}\lVert\bm{\phi}_{1}-\bm{\phi}_{2}\rVert\text{ for all $\bm{\theta}$,}$
	$\displaystyle\lVert\nabla_{\bm{\phi}}g(\bm{\theta}_{1},\bm{\phi})-\nabla_{\bm{\phi}}g(\bm{\theta}_{2},\bm{\phi})\rVert$	$\displaystyle\leq L_{\bm{\phi}\bm{\theta}}\lVert\bm{\theta}_{1}-\bm{\theta}_{2}\rVert\text{ for all $\bm{\phi}$,}$
	$\displaystyle\lVert\nabla_{\bm{\phi}}g(\bm{\theta},\bm{\phi}_{1})-\nabla_{\bm{\phi}}g(\bm{\theta},\bm{\phi}_{2})\rVert$	$\displaystyle\leq L_{\bm{\phi}\bm{\phi}}\lVert\bm{\phi}_{1}-\bm{\phi}_{2}\rVert\text{ for all $\bm{\theta}$.}$

We define $L:=\max\{L_{\bm{\theta}\bm{\theta}},L_{\bm{\theta}\bm{\phi}},L_{\bm{\phi}\bm{\theta}},L_{\bm{\phi}\bm{\phi}}\}$ .

4 Level $k$ Gradient Play

In this section, we propose a novel recursive reasoning algorithm, Level $k$ Gradient Play (Lv. $k$ GP), that allows the agents to discover self-interested strategies while taking into account other agents’ reasoning processes. In Lv. $k$ GP, $k$ steps of recursive reasoning is applied to obtain an anticipated future state $(\bm{\theta}^{(k)}_{t},\bm{\phi}^{(k)}_{t})$ , and the current states $(\bm{\theta}_{t},\bm{\phi}_{t})$ are then updated as follows:

\mathmakebox[0.8]{\text{Reasoning:}\begin{cases}\bm{\theta}^{(n)}_{t}=\bm{\theta}_{t}-\eta\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(n-1)})\\ \bm{\phi}^{(n)}_{t}=\bm{\phi}_{t}-\eta\nabla_{\bm{\phi}}g(\bm{\theta}_{t}^{(n-1)},\bm{\phi}_{t})\end{cases}\text{Update:}\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}^{(k)}\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}^{(k)}\end{cases}}

(Lv.k GP)

We define the current state $(\bm{\theta}_{t},\bm{\phi}_{t})$ , to be the starting point $(\bm{\theta}_{t}^{(0)},\bm{\phi}_{t}^{(0)})$ , of the reasoning process. Learning agents that adopt Lv. $k$ GP strategy are then called Lv. $k$ agents. Lv. $1$ agents act naively in response to the current state using GDA dynamics and Lv. $2$ agents act in response to Lv. $1$ agents by assuming its opponent as a naive learner. Therefore, Lv. $k$ GP allows for higher levels of strategic reasoning. The inspiration comes from how humans collaborate: humans are great at anticipating how their actions will affect others, so they frequently find out how to collaborate with other people to reach a "win-win" solution. The key to human collaboration is their ability to understand how other humans think which helps them develop strategies that benefit their collaborators. One of our main theoretical results is the following theorem, which demonstrates that agents adopting Lv. $k$ GP can precisely predict other players’ next move and reach a consensus on their future strategies:

Theorem 4.1.

Suppose Assumption 3.1 holds. Let us define $\bm{\omega}_{t}=[\bm{\theta}_{t},\bm{\phi}_{t}]^{T}\in\mathbb{R}^{m+n}$ , $\bm{\omega}_{t}^{(k)}=[\bm{\theta}_{t}^{(k)},\bm{\phi}_{t}^{(k)}]^{T}\in\mathbb{R}^{m+n}$ and $\Delta_{\max}=2\times\max(\lVert\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\phi}_{t})\rVert,\lVert\nabla_{\bm{\phi}}g(\bm{\theta}_{t},\bm{\phi}_{t})\rVert)$ . Assume $\bm{\omega}_{t}^{(k)}$ lie in a complete subspace of $\mathbb{R}^{m+n}$ . Then for Lv.k GP we have:

\lVert\bm{\omega}_{t}^{(k)}-\bm{\omega}_{t}^{(k-1)}\rVert\leq\eta\cdot(\eta L)^{(k-1)}\Delta_{\max},

(2)

Suppose the learning rate satisfies: $\eta<(2L)^{-1}$ , then the sequence $\{\bm{\omega}_{t}^{(k)}\}_{k=0}^{\infty}$ is a Cauchy sequence. That is, given $\epsilon>0$ , there exists $N$ such that, if $a>b>N$ then:

\lVert\bm{\omega}_{t}^{(a)}-\bm{\omega}_{t}^{(b)}\rVert<\mathcal{O}(\eta^{b})<\epsilon

(3)

Moreover, the sequence $\{\bm{\omega}_{t}^{(k)}\}_{k=0}^{\infty}$ converges to a limit $\bm{\omega}_{t}^{*}$ : $\lim_{k\to\infty}\bm{\omega}_{t}^{(k)}=\bm{\omega}_{t}^{*}$ .

In accordance with Theorem 4.1, if we define $\bm{\omega}_{t+1}=\bm{\omega}_{t}^{*}$ , then Lv. $\infty$ GP is equivalent to the following implicit algorithm where we call it Semi-Proximal Point Method:

\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\phi}_{t+1})\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}-\eta\nabla_{\bm{\phi}}g(\bm{\theta}_{t+1},\bm{\phi}_{t})\end{cases}

(SPPM)

4.1 Algorithms as an Approximation of SPPM

SPPM players arrive at a consensus by knowing precisely what their opponents’ future strategies will be. Existing algorithms are not able to offer this kind of agreement. For instance, consensus optimization[40] forces the learning agents to cooperate regardless of their own benefits. Agents employ extra-gradient method[30], SGA[4], and LOLA[16] consider their opponents as naive learners, ignoring their strategic reasoning ability. CGD[53] takes into account the reasoning process of learning agents; however, it leads to an inaccurate prediction of agents in games that have cost functions with non-zero higher order derivatives ( $n\geq 3$ ) [57]. In this section, we consider a subset of provably convergent variants of GDA in the Minimax setting, showing that, for specific choice of hyperparameters, the mentioned algorithms either approximate SPPM or approximate the approximations of SPPM:

Table 1: The update rules for the first player of SGA, LOLA, Lv.

2

GP, LEAD, CGD, Lv.

3

GP and SPPM in a Minimax problem and their precision as an approximation of SPPM. ^†The usage of zero or negative momentum has been suggested in recent works [20, 18]. For sake of comparison, we assume no momentum factor in LEAD’s update, which corresponds to

\beta=0

in Equation 10 of [23].

Algorithm	Update Rule	Precision
SGA [4]	$\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta\gamma\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})$	——
LOLA [16]	$\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta\delta\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})$	——
Lv.2 GP	$\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}+\eta\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t}))$	$\mathcal{O}(\eta^{2})$
LEAD^† [23]	$\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\alpha\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})(\bm{\phi}_{t}-\bm{\phi}_{t-1})$	——
CGD [53]	$\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})(\bm{\phi}_{t+1}-\bm{\phi}_{t})$	$\mathcal{O}(\eta^{3})$
Lv.3 GP	$\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}+\eta\nabla_{\phi}f(\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}),\bm{\phi}_{t}))$	$\mathcal{O}(\eta^{3})$
SPPM (Lv. $\infty$ GP)	$\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t+1})$	$0$

In Table 1, we compare the orders of precision of different algorithms as an approximation of SPPM in Minimax games with infinitely differentiable objective functions. In accordance with Equation (3) of Theorem 4.1, Lv.k GP is an $\mathcal{O}(\eta^{k})$ approximation of SPPM. In order to analyze how well existing algorithms approximate SPPM, we consider the first-order Taylor approximation to SPPM:

\begin{cases}\bm{\theta}_{t\!+\!1}\!=\!\bm{\theta}_{t}\!-\!\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\!-\!\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\!-\!\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})(\bm{\theta}_{t\!+\!1}\!-\!\bm{\theta}_{t})\\ \smash{\underbrace{\bm{\phi}_{t\!+\!1}\!=\!\bm{\phi}_{t}\!+\!\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\!-\!\eta^{2}\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})}_{\text{$1^{st}$ order approximation of Lv.2 GP}}\!-\!\eta^{2}\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})(\bm{\phi}_{t\!+\!1}\!-\!\bm{\phi}_{t})}\end{cases}\vphantom{\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})(\bm{\theta}_{t+1}-\bm{\theta}_{t})\\ \underbrace{\bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})}_{\text{$1^{st}$ order approximation of Lv.2 GP}}-\eta^{2}\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})(\bm{\phi}_{t+1}-\bm{\phi}_{t})\vspace{0.12in}\end{cases}}

Under-brace terms correspond to the first-order Taylor approximation of Lv. $2$ GP. For an appropriate choice of hyperparameters, SGA ( $\gamma=\eta$ ) and LOLA ( $\delta=\eta$ ) are identical to the first-order Taylor approximation of Lv.2 GP, where each agent models their opponent as a naive learner. Hence, we list them above the Lv.2 GP, which approximates SPPM up to $\mathcal{O}(\eta^{2})$ . CGD exactly recovers the first-order Taylor approximation of SPPM.¹¹1The CGD update for the max player $\bm{\phi}$ is $\bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})+\eta\nabla_{\phi\theta}f(\bm{\theta}_{t},\phi_{t})(\bm{\theta}_{t+1}-\bm{\theta}_{t})$ . If we substitute $(\bm{\phi}_{t+1}-\bm{\phi}_{t})$ into $\bm{\theta}$ ’s update (and substitute $(\bm{\theta}_{t+1}-\bm{\theta}_{t})$ into $\bm{\phi}$ ’s update, respectively), we arrive at the first-order approximation of SPPM. See A.5 for derivation details. In games with cost functions that have non-negative higher order derivatives ( $n\geq 3$ ), the remaining term in SPPM’s first-order approximation is an error of magnitude $\mathcal{O}(\eta^{3})$ , which means that CGD’s accuracy is in the same range as that of Lv. $3$ GP. In bilinear and quadratic games where the objective function is at most twice differentiable, CGD is equivalent to SPPM. The distinction between LEAD ( $\alpha=\eta$ ) and CGD can be understood by considering their update rules. LEAD is an explicit method where opponents’ potential next strategies are anticipated based on their most recent move $(\bm{\phi}_{t}-\bm{\phi}_{t-1})$ . On the contrary, CGD accounts for this anticipation in an implicit manner, $(\bm{\phi}_{t+1}-\bm{\phi}_{t})$ , where the future states appear in current states’ update rules. Therefore, the computation of CGD updates requires solving an function involving additional Hessian inverse operations. A numerical justification is also provided in Table 2, showing that the approximation accuracy of Lv. $k$ GP improves as $k$ increases.

5 Convergence Property

In Theorem 4.1, we have analytically proved that Lv. $k$ GP convergences asymptotically towards SPPM, we will use this result to study the convergence property of Lv. $k$ GP in games based on our analysis of SPPM. The local convergence of SPPM in a non convex - non concave game can be analyzed via the spectral radius of the game Jacobian around a stationary point:

Theorem 5.1.

Consider the (Minimax) problem under Assumption 3.1 and Lv. $k$ GP. Let $(\bm{\theta}^{*},\bm{\phi}^{*})$ be a stationary point. Suppose $\bm{\theta}_{t}-\bm{\theta}^{*}$ not in kernel of $\nabla_{\phi\theta}f(\bm{\theta}^{*},\bm{\phi}^{*})$ , $\bm{\phi}_{t}-\bm{\phi}^{*}$ not in kernel of $\nabla_{\theta\phi}f(\bm{\theta}^{*},\bm{\phi}^{*})$ and $\eta<(L)^{-1}$ . There exists a neighborhood $\mathcal{U}$ of $(\bm{\theta}^{*},\bm{\phi}^{*})$ such that if SPPM started at $(\bm{\theta}_{0},\bm{\phi}_{0})\in\mathcal{U}$ , the iterates $\{\bm{\theta}_{t},\bm{\phi}_{t}\}_{t\geq 0}$ generated by SPPM satisfy:

\lVert\bm{\theta}_{t+1}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t+1}-\bm{\phi}^{*}\rVert^{2}\leq\frac{\rho^{2}(\bm{I}-\eta\nabla_{\theta\theta}f^{*})\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\rho^{2}(\bm{I}+\eta\nabla_{\phi\phi}f^{*})\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}}{1+\eta^{2}\lambda_{\min}(\nabla_{\theta\phi}f^{*}\nabla_{\phi\theta}f^{*})}

where $f^{*}=f(\bm{\theta}^{*},\bm{\phi}^{*})$ . Moreover,for any $\eta$ satisfying:

\frac{\max(\rho^{2}(\bm{I}-\eta\nabla_{\theta\theta}f^{*}),\rho^{2}(\bm{I}+\eta\nabla_{\phi\phi}f^{*}))}{1+\eta^{2}\lambda_{\min}(\nabla_{\theta\phi}f^{*}\nabla_{\phi\theta}f^{*})}<1,

(4)

SPPM converges asymptotically to $(\bm{\theta}^{*},\bm{\phi}^{*})$ .

Remark 5.1.

Following the same condition as in 5.1, the iterates generated by Lv. $2k$ GP satisfies:

	$\displaystyle\lVert$	$\displaystyle\bm{\theta}_{t}^{(2k)}-\bm{\theta}^{}\rVert^{2}+\lVert\bm{\phi}_{t}^{(2k)}-\bm{\phi}^{}\rVert^{2}$
		$\displaystyle\leq a\!\left(\frac{\rho^{2}(\bm{I}\!-\eta\nabla_{\theta\theta}f^{})\lVert\bm{\theta}_{t}\!-\!\bm{\theta}^{}\rVert^{2}\!+\!\rho^{2}(\bm{I}\!+\!\eta\nabla_{\phi\phi}f^{})\lVert\bm{\phi}_{t}\!-\!\bm{\phi}^{}\rVert^{2}}{1\!+\!\eta^{2}\lambda_{\min}(\nabla_{\theta\phi}f^{}\nabla_{\phi\theta}f^{})}\right)\!+\!b(\lVert\bm{\theta}_{t}\!-\!\bm{\theta}^{}\rVert^{2}\!+\!\lVert\bm{\phi}_{t}\!-\!\bm{\phi}^{}\rVert^{2})$

where

	$\displaystyle a$	$\displaystyle\!=\!\frac{(1\!+\!(\eta^{2}\lambda_{\max}(\nabla_{\theta\phi}f^{}\nabla_{\phi\theta}f^{}))^{k})^{2}}{1\!-\!(\eta^{2}\lambda_{\max}(\nabla_{\theta\phi}f^{}\nabla_{\phi\theta}f^{}))^{k}}\text{ for odd $k$, or }\frac{(1\!-\!(\eta^{2}\lambda_{\min}(\nabla_{\theta\phi}f^{}\nabla_{\phi\theta}f^{}))^{k})^{2}}{1\!-\!(\eta^{2}\lambda_{\max}(\nabla_{\theta\phi}f^{}\nabla_{\phi\theta}f^{}))^{k}}\text{ for even $k$,}$
	$\displaystyle b$	$\displaystyle\!=\!\frac{(\eta^{2}\lambda_{\max}(\nabla_{\theta\phi}f^{}\nabla_{\phi\theta}f^{}))^{k}(1\!-\!(\eta^{2}\lambda_{\min}(\nabla_{\theta\phi}f^{}\nabla_{\phi\theta}f^{}))^{k})}{1\!-\!(\eta^{2}\lambda_{\max}(\nabla_{\theta\phi}f^{}\nabla_{\phi\theta}f^{}))^{k}}.$

We assume that the difference between the trajectory $\{\bm{\theta}_{t},\bm{\phi}_{t}\}_{t\geq 0}$ and the stationary point $(\bm{\theta}^{*},\bm{\phi}^{*})$ is not in the kernel of $\nabla_{\phi\theta}f(\bm{\theta}^{*},\bm{\phi}^{*})$ and $\nabla_{\theta\phi}f(\bm{\theta}^{*},\bm{\phi}^{*})$ for the sake of simplicity. Detailed proofs without this assumption are provided in Appendix A.4. Following the same setting as Theorem 4.1, we have $\eta<L^{-1}$ , which ensures that $\eta^{2}\lambda_{\max}(\nabla_{\theta\phi}f^{*}\nabla_{\phi\theta}f^{*})<1$ . Therefore, as $k\to\infty$ , $a\to 1$ and $b\to 0$ in Remark 5.1, and as such Lv. $k$ GP has similar local convergence properties to SPPM. Figure 2 illustrates this property in a quadratic game. Lv. $k$ GP may behave differently than SPPM at lower values of $k$ , but as $k$ increases, both max step size $\eta_{\max,k}$ and distance to equilibrium for a fixed number of iterations under $\eta_{\max,\infty}$ converges to that of SPPM. Thus, we observe that Lv. $k$ GP is empirically similar to SPPM at higher values of $k$ .

We study the global convergence property of SPPM by analyzing its behavior in bilinear and quadratic games. Consider the following bilinear game:

\min_{\bm{\theta}\in\mathbb{R}^{n}}\max_{\bm{\phi}\in\mathbb{R}^{n}}\bm{\theta}^{T}\bm{M}\bm{\phi}

(Bilinear game)

where $\bm{M}$ is a full rank matrix. The following theorem summarizes SPPM’s convergence property:

Theorem 5.2.

Consider the Bilinear game and the SPPM method. Further, we define $r_{t}=\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}$ . Then, for any $\eta>0$ , the iterates $\{\bm{\theta}_{t},\bm{\phi}_{t}\}_{t\geq 0}$ generated by SPPM satisfy

r_{t+1}\leq\frac{1}{1+\eta^{2}\lambda_{\min}(\bm{M}^{T}\bm{M})}r_{t}.

(5)

It is worth noting that the convergence property of Lv. $k$ GP in bi-linear game has been studied in [2]. Furthermore, we study the convergence property of SPPM in the Quadratic game:

\min_{\bm{\theta}\in\mathbb{R}^{n}}\max_{\bm{\phi}\in\mathbb{R}^{n}}\bm{\theta}^{T}\bm{A}\bm{\theta}+\bm{\phi}^{T}\bm{B}\bm{\phi}+\bm{\theta}^{T}\bm{C}\bm{\phi}

(Quadratic game)

where $\bm{A}\in\mathbb{R}^{n\times n}$ is symmetric and positive definite, $\bm{B}\in\mathbb{R}^{n\times n}$ is symmetric and negative definite and the interaction term $\bm{C}\in\mathbb{R}^{n\times n}$ is full rank. SPPM in quadratic games converges with the following rate:

Theorem 5.3.

Consider the Quadratic game and the SPPM. Then, for any $\eta>0$ , the iterates $\{\bm{\theta}_{t},\bm{\phi}_{t}\}_{t\geq 0}$ generated by SPPM satisfy

\lVert\bm{\theta}_{t+1}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t+1}-\bm{\phi}^{*}\rVert^{2}\leq\frac{\rho^{2}(1-\eta\bm{A})\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\rho^{2}(1+\eta\bm{B})\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}}{1+\eta^{2}\lambda_{\min}(\bm{C}^{T}\bm{C})}

(6)

Theorem 5.1,5.2 and 5.3 indicate that stronger interaction $\nabla_{\theta\phi}f(\bm{\theta}^{*},\bm{\phi}^{*})$ improve the convergence rate towards the stationary points. By contrast, in existing modifications of GDA, the step size is chosen in inversely proportional to the interaction term $\nabla_{\theta\phi}f(\bm{\theta}^{*},\bm{\phi}^{*})$ [40, 13, 34]. In Figure 3, we showcase the effect of interaction on different algorithms in the Quadratic game setup with dimension $n=5$ and the interaction matrix is defined as $\bm{C}=c\bm{I}$ . Stronger interaction corresponds to higher values of $c$ . A key difference between SPPM and Lv. $k$ GP in the experiments is that, SPPM converges with any step size - and so arbitrarily fast - while it is not the case for Lv. $k$ GP. In order to approximate SPPM, one needs Lv. $k$ GP to be a contraction and so have $\eta<L^{-1}$ . This result implies an additional bound on step sizes for Lv. $k$ GP. As long as the constraint on $\eta$ is satisfied, stronger interaction only improves convergence for higher order Lv. $k$ GP while all other algorithms quickly diverge as $\eta$ and $c$ increase.

6 Experimental Results

In this section, we discuss our implementation of Lv. $k$ GP algorithm for training GANs. Its performance is evaluated on 8-Gaussians and two representative datasets CIFAR-10 and STL-10.

6.1 Level $k$ Adam

We propose to combine Lv. $k$ GP with the Adam optimizer [29]. Preliminary experiments find that Lv. $k$ Adam to converge much faster than Lv. $k$ GP, see A.8 for experiment results. A detailed pseudo-code for Level $k$ Adam (Lv. $k$ Adam) on GAN training with loss functions $\mathcal{L}_{G}$ and $\mathcal{L}_{D}$ is given in Algorithm 1. For the Adam optimizer, there are several possible choices on how to update the moments. This choice can lead to different algorithms in practice. Unlike [19] where the moments are updated on the fly, in Algorithm 1, we keep the moments fixed in the reasoning steps and update it together with model parameters. In Table 2, our experiment result suggests that the proposed Lv. $k$ Adam algorithm converges asymptotically as the number of reasoning steps $k$ increases.

Input: Stopping time

T

, reasoning steps

k

, learning rate

\eta_{\bm{\theta}},\eta_{\bm{\phi}}

, decay rates for momentum estimates

\beta_{1},\beta_{2}

, initial weight

(\bm{\theta}_{0},\bm{\phi}_{0})

\bm{P}_{{\bm{x}}}

and

\bm{P}_{{\bm{z}}}

real and noise-data distributions, losses

\mathcal{L}_{G}(\bm{\theta},\bm{\phi},{\bm{x}},{\bm{z}})

and

\mathcal{L}_{D}(\bm{\theta},\bm{\phi},{\bm{x}},{\bm{z}})

\epsilon=1e-8

Parameters : Initial parameters:

\bm{\theta}_{0},\bm{\phi}_{0}

Initialize first moments:

\bm{m}_{\theta,0}\xleftarrow{}0,\bm{m}_{\phi,0}\xleftarrow{}0

Initialize second moments:

\bm{v}_{\theta,0}\xleftarrow{}0,\bm{v}_{\phi,0}\xleftarrow{}0

for t=0,…,T-1 do

Sample new mini-batch:

{\bm{x}},{\bm{z}}\sim\bm{P}_{{\bm{x}}},\bm{P}_{{\bm{z}}}

\bm{\theta}_{t}^{(0)}\xleftarrow{}\bm{\theta}_{t},\bm{\phi}_{t}^{(0)}\xleftarrow{}\bm{\phi}_{t}

for n=1,…,k do

Compute stochastic gradient:

\bm{g}_{\bm{\theta},t}^{(n)}=\nabla_{\theta}\mathcal{L}_{G}(\bm{\theta}_{t},\bm{\phi}^{(n-1)}_{t},{\bm{x}},{\bm{z}});\bm{g}_{\bm{\phi},t}^{(n)}=\nabla_{\phi}\mathcal{L}_{D}(\bm{\theta}^{(n-1)}_{t},\bm{\phi}_{t},{\bm{x}},{\bm{z}})

Update estimate of first moment:

\bm{m}_{\theta,t}^{(n)}=\beta_{1}\bm{m}_{\theta,t-1}+(1-\beta_{1})\bm{g}^{(n)}_{\theta,t};\bm{m}_{\phi,t}^{(n)}=\beta_{1}\bm{m}_{\phi,t-1}+(1-\beta_{1})\bm{g}^{(n)}_{\phi,t}

Update estimate of second moment:

\bm{v}_{\theta,t}^{(n)}=\beta_{2}\bm{v}_{\theta,t-1}+(1-\beta_{2})(\bm{g}^{(n)}_{\theta,t})^{2};\bm{v}_{\phi,t}^{(n)}=\beta_{2}\bm{v}_{\phi,t-1}+(1-\beta_{2})(\bm{g}^{(n)}_{\phi,t})^{2}

Correct the bias for the moments:

\bm{\hat{m}}_{\theta,t}^{(n)}=\frac{\bm{m}^{(n)}_{\theta,t}}{(1-\beta_{1}^{t})},\bm{\hat{m}}_{\phi,t}^{(n)}=\frac{\bm{m}^{(n)}_{\phi,t}}{(1-\beta_{1}^{t})};\bm{\hat{v}}_{\theta,t}^{(n)}=\frac{\bm{v}^{(n)}_{\theta,t}}{(1-\beta_{2}^{t})},\bm{\hat{v}}_{\phi,t}^{(n)}=\frac{\bm{v}^{(n)}_{\phi,t}}{(1-\beta_{2}^{t})}

Perform Adam update:

\bm{\theta}^{(n)}_{t}=\bm{\theta}_{t}-\eta_{\theta}\frac{\bm{\hat{m}}_{\theta,t}^{(n)}}{\sqrt{\bm{\hat{v}}_{\theta,t}^{(n)}}+\epsilon};\bm{\phi}^{(n)}_{t}=\bm{\phi}_{t}-\eta_{\phi}\frac{\bm{\hat{m}}_{\phi,t}^{(n)}}{\sqrt{\bm{\hat{v}}_{\phi,t}^{(n)}}+\epsilon}

\bm{\theta}_{t+1}\xleftarrow{}\bm{\theta}_{t}^{(k)},\bm{\phi}_{t+1}\xleftarrow{}\bm{\phi}_{t}^{(k)}

;

\bm{m}_{\theta,t}\xleftarrow{}\bm{m}_{\theta,t}^{(k)},\bm{m}_{\phi,t}\xleftarrow{}\bm{m}_{\phi,t}^{(k)}

;

\bm{v}_{\theta,t}\xleftarrow{}\bm{v}_{\theta,t}^{(k)},\bm{v}_{\phi,t}\xleftarrow{}\bm{v}_{\phi,t}^{(k)}

Algorithm 1 Level

k

Adam: proposed Adam with recursive reasoning steps

6.2 8-Gaussians

In our first experiment, we evaluate Lv. $k$ Adam on generating a mixture of 8-Gaussians with standard deviations equal to $0.05$ and modes uniformly distributed around the unit circle. We use a two layer multi-layer perceptron with ReLU activations, latent dimension of $64$ and batch size of $128$ . The generated distribution is presented in Figure 4.

Table 2: The difference between two states generated by Lv.

k

Adam and Lv.

k

GP averaged over 100 steps. ^†The difference is smaller than machine precision.

$\frac{1}{100}\sum_{t=1}^{100}r_{t}^{(k)}$	$k=2$	$k=4$	$k=6$	$k=8$	$k=10$
Lv. $k$ Adam	$1.04\times 10^{-1}$	$1.60\times 10^{-2}$	$2.91\times 10^{-3}$	$6.08\times 10^{-4}$	$1.68\times 10^{-4}$
Lv. $k$ GP	$9.79\times 10^{-9}$	$3.84\times 10^{-15}$	$1.31\times 10^{-17}$	$1.68\times 10^{-19}$	$\approx 0^{\dagger}$

In addition to presenting the mode coverage of the generated distribution after training, we also study the convergence of the reasoning steps of Lv. $k$ Adam and Lv. $k$ GP. Let us define the difference between the states of Lv. $k$ and Lv. $k-1$ agents at time $t$ as:

r^{(k)}_{t}=\lVert\bm{\theta}^{(k)}_{t}-\bm{\theta}^{(k-1)}_{t}\rVert^{2}+\lVert\bm{\phi}^{(k)}_{t}-\bm{\phi}^{(k-1)}_{t}\rVert^{2}.

(7)

We measure the difference averaged over the first 100 iterations, $\frac{1}{100}\sum_{t=1}^{100}r_{t}^{k}$ , for Lv. $10$ Adam ( $\eta=10^{-4}$ ) and Lv. $10$ GP ( $\eta=10^{-2}$ ). The result presented in Table2 demonstrates that both Lv. $k$ GP and Lv. $k$ Adam are converging as $k$ increases. Moreover, the estimation precision of Lv. $k$ GP improves rapidly and converges to $0$ within finite steps, making it an accurate estimation for SPPM.

Table 3: FID and Inception scores of different algorithms and architectures on CIFAR-10. Results are averaged over 3 runs. ^†We re-evaluate its performance on the official implementation of FID.

Algorithm	Architecture	FID $\downarrow$	IS $\uparrow$
Adam [29]	BigGAN [6]	14.73	$\bm{9.22}$
Adam [29]	StyleGAN2 [59]	11.07	9.18
Adam [29]	SN-GAN [44]	$21.70\pm 0.21$	$7.60\pm 0.06$
Unrolled GAN [42, 9]	SN-GAN [44]	$17.51\pm 1.08$	——
Extra-Adam [19, 38, 9]	SN-GAN [44]	$15.47\pm 1.82$	——
LEAD^† [23]	SN-GAN [44]	$14.45\pm 0.45$	——
LA-AltGAN [9]	SN-GAN [44]	$12.67\pm 0.57$	$8.55\pm 0.04$
ODE-GAN(RK4) [48]	SN-GAN [44]	$11.85\pm 0.21$	$8.61\pm 0.06$
Lv. $6$ Adam	SN-GAN [44]	$\bm{10.17\pm 0.16}$	$\bm{8.78\pm 0.06}$

6.3 Image Generation Experiments

We evaluate the effectiveness of our Lv. $k$ Adam algorithm on unconditional generation of CIFAR-10 [31]. We use the Inception score (the higher the better) [50] and the Fréchet Inception distance (the lower the better) [24] as performance metrics for image synthesis. For architecture, we use the SN-GAN architecture based on [44]. For baselines, we compare the performance of Lv. $6$ Adam to that of other first-order and second-order optimization algorithms with the same SN-GAN architecture, and to state-of-the-art models trained with Adam. For Lv. $k$ Adam, we use $\beta_{1}=0$ and $\beta_{2}=0.9$ for all experiments. We use different learning rates for the generator ( $\eta_{\theta}=4e-5$ ) and the discriminator ( $\eta_{\phi}=2e-4$ ). We train Lv. $k$ Adam with batch size $128$ for $600$ epochs. For testing, we use an exponential moving average of the generator’s parameters with averaging factor $\beta=0.999$ .

In table 3, we present the performance of our method and baselines. The best FID score and Inception score, $10.17\pm 0.16$ and $8.78\pm 0.06$ , on SN-GAN are obtained with our Lv. $6$ Adam. We also outperform BigGAN and StyleGAN2 in terms of FID score. Notably, our model has 5.1M parameters in total, and is trained with a small batch size of 128, whereas BigGAN uses 158.3M parameters and a batch size of 2048.

The effect of $k$ and losses: We evaluate values of $k=\{2,4,6\}$ on non-saturated loss [21] (non-zero-sum formulation) and hinge loss [35] (zero-sum formulation). The result is presented in Table 4.

Table 4: FID scores for the different loss functions with

k=\{2,4,6\}.

	Lv. $2$ Adam	Lv. $4$ Adam	Lv. $6$ Adam
Non-saturated loss	$11.33\pm 0.18$	$11.62\pm 0.25$	$10.93\pm 0.24$
Hinge loss	$10.68\pm 0.20$	$10.33\pm 0.22$	$10.17\pm 0.16$

Remarkably, our experiments demonstrate that few steps of recursive reasoning can result in significant performance gain comparing to existing GAN optimizers. The gradual improvements in the FID scores justify the idea that better estimation of opponents’ next move improves performance. Moreover, we observe performance gains in both zero-sum and non-zero-sum formulations which supplement our theoretical convergence guarantees in zero-sum games.

Experiment results on STL-10: To test whether the proposed Lv. $k$ Adam optimizer works on higher resolution images, we evaluate its performance on the STL-10 dataset [11] with $3\times 48\times 48$ resolutions. In our experiments, Lv. $6$ Adam obtained an averaged FID of $25.43\pm 0.18$ which outperforms that of the Adam optimizer, $30.25\pm 0.26$ , using the same SN-GAN architecture.

Memory and computation cost: Compared to SGD, Lv. $k$ Adam requires the same extra memory as the EG method (one additional set of parameters per player). The relative cost of one iteration versus SGD is a factor of $k$ and the computational cost increases linearly as $k$ increases, we illustrate this relationship in Figure 5 (right). We provide an accelerated version of Lv. $k$ Adam in A.8 which reduces the computation cost by half for $k>2$ . In Figure 5 (top-left), we compare the FID scores obtained by Lv. $k$ Adam, Adam, and LEAD on CIFAR-10 over the same number of gradient computations. LEAD, Lv4 Adam, and Lv6 Adam all outperform Adam in this experiment. Lv4 Adam outperforms LEAD after $2\times 10^{5}$ gradient computations. Our method is also compared with different algorithms on the 8-Gaussian problem in terms of its computational cost. On the same architecture with different widths, Figure 5 (bottom-left) illustrates the wall-clock time per computation for different algorithms. We observe that the computational cost of Lv. $k$ Adam while being much lower than CGD, is similar to LEAD, SGA and CO which involve JVP operations. Each run on CIFAR-10 dataset takes $30\sim 33$ hours on a Nvidia RTX3090 GPU. Each experiment on STL-10 takes $48\sim 60$ hours on a Nvidia RTX3090 GPU.

7 Conclusion and Future Work

This paper proposes a novel algorithm: Level $k$ gradient play, capable of reasoning about players’ future strategies. We achieve an average FID score of 10.17 for unconditional image generation on CIFAR-10 dataset, allowing GAN training on common computing resources to reach state-of-the-art performance. Our results suggest that Lv. $k$ GP is a flexible add-on that can be easily attached to existing GAN optimizers (e.g., Adam) and provides noticeable gains in performance and stability. In future work, we will examine the effectiveness of our approach on more complicated GAN designs, such as Progressive GANs [26] and StyleGANs [27], where optimization plays a more significant role. Additionally, we intend to examine the convergence property of Lv. $k$ GP in games with more than two player in the future.

Broader Impact

Our work introduces a novel optimizer that improves the performance of GANs and may reduce the amount of hyperparameter tuning required by practitioners of generative modeling. Generative models have been used to create illegal content(a.k.a. deepfakes [5]). There is risk of negative social impact resulting from malicious use of the proposed methods.

Acknowledgement

This work was supported by a funding from a NSERC Alliance grant and Huawei Technologies Canada. ZL would like to thank Tianshi Cao for insightful discussions on algorithm and experiment design. We are grateful to Bolin Gao and Dian Gadjov for their support.

References

Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
Azizian et al. [2020] Waïss Azizian, Ioannis Mitliagkas, Simon Lacoste-Julien, and Gauthier Gidel. A tight and unified analysis of gradient-based methods for a whole spectrum of differentiable games. In International Conference on Artificial Intelligence and Statistics, pages 2863–2873. PMLR, 2020.
Bailey et al. [2020] James P Bailey, Gauthier Gidel, and Georgios Piliouras. Finite regret and cycles with fixed step-size via alternating gradient descent-ascent. In Conference on Learning Theory, pages 391–407. PMLR, 2020.
Balduzzi et al. [2018] David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of n-player differentiable games. In International Conference on Machine Learning, pages 354–363. PMLR, 2018.
Brandon [2018] John Brandon. Terrifying high-tech porn: Creepy ’deepfake’ videos are on the rise. Fox News, 2018.
Brock et al. [2018] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
Carlini and Wagner [2017] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pages 39–57. IEEE, 2017.
Chavdarova et al. [2019] Tatjana Chavdarova, Gauthier Gidel, François Fleuret, and Simon Lacoste-Julien. Reducing noise in GAN training with variance reduced extragradient. Advances in Neural Information Processing Systems, 32, 2019.
Chavdarova et al. [2020] Tatjana Chavdarova, Matteo Pagliardini, Sebastian U Stich, François Fleuret, and Martin Jaggi. Taming GANs with lookahead-minmax. arXiv preprint arXiv:2006.14567, 2020.
Choi et al. [2018] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8789–8797, 2018.
Coates et al. [2011] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011.
Corballis [2007] Michael C Corballis. The uniqueness of human recursive thinking: the ability to think about thinking may be the critical attribute that distinguishes us from all other species. American Scientist, 95(3):240–248, 2007.
Daskalakis et al. [2017] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training GANs with optimism. arXiv preprint arXiv:1711.00141, 2017.
Fiez and Ratliff [2021] Tanner Fiez and Lillian J Ratliff. Local convergence analysis of gradient descent ascent with finite timescale separation. In Proceedings of the International Conference on Learning Representation, 2021.
Fiez et al. [2019] Tanner Fiez, Benjamin Chasnov, and Lillian J Ratliff. Convergence of learning dynamics in Stackelberg games. arXiv preprint arXiv:1906.01217, 2019.
Foerster et al. [2017] Jakob N Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. arXiv preprint arXiv:1709.04326, 2017.
Gemp and Mahadevan [2018] Ian Gemp and Sridhar Mahadevan. Global convergence to the equilibrium of gans using variational inequalities. arXiv preprint arXiv:1808.01531, 2018.
Gemp and McWilliams [2019] Ian Gemp and Brian McWilliams. The Unreasonable Effectiveness of Adam on Cycles. In NeurIPS Workshop on Bridging Game Theory and Deep Learning, 2019.
Gidel et al. [2018] Gauthier Gidel, Hugo Berard, Gaëtan Vignoud, Pascal Vincent, and Simon Lacoste-Julien. A variational inequality perspective on generative adversarial networks. arXiv preprint arXiv:1802.10551, 2018.
Gidel et al. [2019] Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Rémi Le Priol, Gabriel Huang, Simon Lacoste-Julien, and Ioannis Mitliagkas. Negative momentum for improved game dynamics. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1802–1811. PMLR, 2019.
Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
Gulrajani et al. [2017] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. Advances in neural information processing systems, 30, 2017.
Hemmat et al. [2020] Reyhane Askari Hemmat, Amartya Mitra, Guillaume Lajoie, and Ioannis Mitliagkas. Lead: Least-action dynamics for min-max optimization. arXiv preprint arXiv:2010.13846, 2020.
Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in neural information processing systems, 30, 2017.
Jin et al. [2020] Chi Jin, Praneeth Netrapalli, and Michael Jordan. What is local optimality in nonconvex-nonconcave minimax optimization? In International Conference on Machine Learning, pages 4880–4889. PMLR, 2020.
Karras et al. [2017] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
Karras et al. [2020] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Korpelevich [1976] Galina M Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976.
Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Ledig et al. [2017] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
Letcher et al. [2019] Alistair Letcher, David Balduzzi, Sébastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. Differentiable game mechanics. The Journal of Machine Learning Research, 20(1):3032–3071, 2019.
Liang and Stokes [2019] Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 907–915. PMLR, 2019.
Lim and Ye [2017] Jae Hyun Lim and Jong Chul Ye. Geometric gan. arXiv preprint arXiv:1705.02894, 2017.
Lowe et al. [2017] Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30, 2017.
Madry et al. [2017] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
Mertikopoulos et al. [2018a] Panayotis Mertikopoulos, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. arXiv preprint arXiv:1807.02629, 2018a.
Mertikopoulos et al. [2018b] Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2703–2717. SIAM, 2018b.
Mescheder et al. [2017] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of GANs. Advances in neural information processing systems, 30, 2017.
Mescheder et al. [2018] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for GANs do actually converge? In International conference on machine learning, pages 3481–3490. PMLR, 2018.
Metz et al. [2016] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163, 2016.
Mishchenko et al. [2020] Konstantin Mishchenko, Dmitry Kovalev, Egor Shulgin, Peter Richtárik, and Yura Malitsky. Revisiting stochastic extragradient. In International Conference on Artificial Intelligence and Statistics, pages 4573–4582. PMLR, 2020.
Miyato et al. [2018] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
Mokhtari et al. [2020] Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics, pages 1497–1507. PMLR, 2020.
Papadimitriou and Piliouras [2018] Christos Papadimitriou and Georgios Piliouras. From Nash equilibria to chain recurrent sets: An algorithmic solution concept for game theory. Entropy, 20(10):782, 2018.
Piliouras et al. [2021] Georgios Piliouras, Ryann Sim, and Stratis Skoulakis. Optimal no-regret learning in general games: Bounded regret with unbounded step-sizes via clairvoyant mwu. arXiv preprint arXiv:2111.14737, 2021.
Qin et al. [2020] Chongli Qin, Yan Wu, Jost Tobias Springenberg, Andy Brock, Jeff Donahue, Timothy Lillicrap, and Pushmeet Kohli. Training generative adversarial networks by solving ordinary differential equations. Advances in Neural Information Processing Systems, 33:5599–5609, 2020.
Rockafellar [1976] R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877–898, 1976.
Salimans et al. [2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. Advances in neural information processing systems, 29, 2016.
Sauer et al. [2021] Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger. Projected GANs converge faster. Advances in Neural Information Processing Systems, 34, 2021.
Sauer et al. [2022] Axel Sauer, Katja Schwarz, and Andreas Geiger. StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets. arXiv preprint arXiv:2202.00273, 2022.
Schäfer and Anandkumar [2019] Florian Schäfer and Anima Anandkumar. Competitive gradient descent. Advances in Neural Information Processing Systems, 32, 2019.
Thanh-Tung et al. [2019] Hoang Thanh-Tung, Truyen Tran, and Svetha Venkatesh. Improving generalization and stability of generative adversarial networks. arXiv preprint arXiv:1902.03984, 2019.
Vinyals et al. [2019] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
Wang et al. [2019] Yuanhao Wang, Guodong Zhang, and Jimmy Ba. On solving minimax optimization locally: A follow-the-ridge approach. arXiv preprint arXiv:1910.07512, 2019.
Willi et al. [2022] Timon Willi, Johannes Treutlein, Alistair Letcher, and Jakob Foerster. COLA: Consistent Learning with Opponent-Learning Awareness. arXiv preprint arXiv:2203.04098, 2022.
Zhang et al. [2021] Guodong Zhang, Yuanhao Wang, Laurent Lessard, and Roger Grosse. Near-optimal Local Convergence of Alternating Gradient Descent-Ascent for Minimax Optimization. arXiv preprint arXiv:2102.09468, 2021.
Zhao et al. [2020] Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song Han. Differentiable augmentation for data-efficient gan training. Advances in Neural Information Processing Systems, 33:7559–7570, 2020.

Appendix A Appendix

A.1 Proof of Theorem 4.1

Proof.

To begin with, let us consider Lv. $1$ GP:

	$\displaystyle\bm{\theta}_{t}^{(1)}$	$\displaystyle=\bm{\theta}_{t}-\eta\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(0)})=\bm{\theta}_{t}-\eta\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\phi}_{t})$
	$\displaystyle\bm{\phi}_{t}^{(1)}$	$\displaystyle=\bm{\phi}_{t}-\eta\nabla_{\bm{\phi}}g(\bm{\theta}_{t}^{(0)},\bm{\phi}_{t})=\bm{\phi}_{t}-\eta\nabla_{\bm{\phi}}g(\bm{\theta}_{t},\bm{\phi}_{t})$

The differences between $(\bm{\theta}_{t}^{(1)},\bm{\phi}_{t}^{(1)})$ and $(\bm{\theta}_{t}^{(0)},\bm{\phi}_{t}^{(0)})$ are:

	$\displaystyle\lVert\bm{\theta}_{t}^{(1)}-\bm{\theta}_{t}^{(0)}\rVert$	$\displaystyle=\eta\lVert\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\rVert,$
	$\displaystyle\lVert\bm{\phi}_{t}^{(1)}-\bm{\phi}_{t}^{(0)}\rVert$	$\displaystyle=\eta\lVert\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\rVert.$

Recall our definition of $\Delta_{\max}$ we have:

\lVert\bm{\theta}_{t}^{(1)}-\bm{\theta}_{t}^{(0)}\rVert+\lVert\bm{\phi}_{t}^{(1)}-\bm{\phi}_{t}^{(0)}\rVert\leq\eta\Delta_{\max}

(8)

Then, with $\bm{\omega}_{t}=[\bm{\theta}_{t},\bm{\phi}_{t}]^{T}$ and $\bm{\omega}_{t}^{(k)}=[\bm{\theta}_{t}^{(k)},\bm{\phi}_{t}^{(k)}]^{T}$ , the differences between Lv. $2$ agents and Lv. $1$ agents are:

	$\displaystyle\lVert\bm{\theta}_{t}^{(2)}-\bm{\theta}_{t}^{(1)}\rVert$	$\displaystyle=\eta\lVert\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(1)})-\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(0)})\rVert$
		$\displaystyle\leq\eta L_{\theta\phi}\lVert\bm{\phi}_{t}^{(1)}-\bm{\phi}_{t}^{(0)}\rVert,$
	$\displaystyle\lVert\bm{\phi}_{t}^{(2)}-\bm{\phi}_{t}^{(1)}\rVert$	$\displaystyle=\eta\lVert\nabla_{\phi}f(\bm{\theta}_{t}^{(1)},\bm{\phi}_{t})-\nabla_{\phi}f(\bm{\theta}_{t}^{(0)},\bm{\phi}_{t})\rVert$
		$\displaystyle\leq\eta L_{\phi\theta}\lVert\bm{\theta}_{t}^{(1)}-\bm{\theta}_{t}^{(0)}\rVert.$

Recall that $L:=\max\{L_{\theta\theta},L_{\theta\phi},L_{\phi\theta},L_{\phi\phi}\}$ , using Equation (8) we have:

\lVert\bm{\theta}_{t}^{(2)}-\bm{\theta}_{t}^{(1)}\rVert+\lVert\bm{\phi}_{t}^{(2)}-\bm{\phi}_{t}^{(1)}\rVert\leq\eta^{2}L\Delta_{\max}

(9)

Similarly, we can derive the differences between Lv. $3$ and Lv. $2$ agents:

$\displaystyle\lVert\bm{\theta}_{t}^{(3)}-\bm{\theta}_{t}^{(2)}\rVert$	$\displaystyle=\eta\lVert\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(2)})-\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(1)})\rVert$
	$\displaystyle\leq\eta L_{\theta\phi}\lVert\bm{\phi}_{t}^{(2)}-\bm{\phi}_{t}^{(1)}\rVert$
$\displaystyle\lVert\bm{\phi}_{t}^{(3)}-\bm{\phi}_{t}^{(2)}\rVert$	$\displaystyle=\eta\lVert\nabla_{\phi}f(\bm{\theta}_{t}^{(2)},\bm{\phi}_{t})-\nabla_{\phi}f(\bm{\theta}_{t}^{(1)},\bm{\phi}_{t})\rVert$
	$\displaystyle\leq\eta L_{\phi\theta}\lVert\bm{\theta}_{t}^{(2)}-\bm{\theta}_{t}^{(1)}\rVert$
$\displaystyle\lVert\bm{\theta}_{t}^{(3)}-\bm{\theta}_{t}^{(2)}\rVert$	$\displaystyle+\lVert\bm{\phi}_{t}^{(3)}-\bm{\phi}_{t}^{(2)}\rVert\leq\eta^{3}L^{2}\Delta_{\max}$	(10)

Consequently, the difference between any two consecutive states $k$ and $k-1$ are upper bounded by:

$\displaystyle\lVert\bm{\theta}_{t}^{(k)}-\bm{\theta}_{t}^{(k-1)}\rVert$	$\displaystyle=\eta\lVert\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(k-1)})-\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(k-2)})\rVert$
	$\displaystyle\leq\eta L_{\theta\phi}\lVert\bm{\phi}_{t}^{(k-1)}-\bm{\phi}_{t}^{(k-2)}\rVert$
$\displaystyle\lVert\bm{\phi}_{t}^{(k)}-\bm{\phi}_{t}^{(k-1)}\rVert$	$\displaystyle=\eta\lVert\nabla_{\phi}f(\bm{\theta}_{t}^{(k-1)},\bm{\phi}_{t})-\nabla_{\phi}f(\bm{\theta}_{t}^{(k-2)},\bm{\phi}_{t})\rVert$
	$\displaystyle\leq\eta L_{\phi\theta}\lVert\bm{\theta}_{t}^{(k-1)}-\bm{\theta}_{t}^{(k-2)}\rVert$
$\displaystyle\lVert\bm{\theta}_{t}^{(k)}-\bm{\theta}_{t}^{(k-1)}\rVert$	$\displaystyle+\lVert\bm{\phi}_{t}^{(k)}-\bm{\phi}_{t}^{(k-1)}\rVert\leq\eta\cdot(\eta L)^{(k-1)}\Delta_{\max}$	(11)

Since $\lVert\bm{\omega}_{t}^{(k)}-\bm{\omega}_{t}^{(k-1)}\rVert\leq\lVert\bm{\theta}_{t}^{(k)}-\bm{\theta}_{t}^{(k-1)}\rVert+\lVert\bm{\phi}_{t}^{(k)}-\bm{\phi}_{t}^{(k-1)}\rVert$ we have:

\lVert\bm{\omega}_{t}^{(k)}-\bm{\omega}_{t}^{(k-1)}\rVert\leq\eta\cdot(\eta L)^{(k-1)}\Delta_{\max}

(12)

Suppose $\eta<(2L)^{-1}$ , such that the difference between any two consecutive states is a contraction, then we consider the difference, $\lVert\bm{\omega}_{t}^{(a)}-\bm{\omega}_{t}^{(b)}\rVert$ , where $a>b>0$ . We can rewrite it as:

$\displaystyle\lVert\bm{\omega}_{t}^{(a)}-\bm{\omega}_{t}^{(b)}\rVert$	$\displaystyle=\left\lVert\sum_{i=b+1}^{a}\bm{\omega}_{t}^{(i)}-\bm{\omega}_{t}^{(i-1)}\right\rVert$
	$\displaystyle\leq\sum_{i=b+1}^{a}\left\lVert\bm{\omega}_{t}^{(i)}-\bm{\omega}_{t}^{(i-1)}\right\rVert$
	$\displaystyle\leq\sum_{i=b+1}^{a}\eta\cdot(\eta L)^{(i-1)}\Delta_{\max}$
	$\displaystyle\leq\eta\Delta_{\max}\cdot\left[(\eta L)^{(b)}+\dots+(\eta L)^{(a-1)}\right]$
	$\displaystyle\leq\eta\Delta_{\max}\cdot(\eta L)^{(b-1)}$
	$\displaystyle\leq\eta^{b}L^{(b-1)}\Delta_{\max}=\mathcal{O}(\eta^{b}).$	(13)

Since $\eta<(2L)^{-1}$ , we have that $\eta L<1$ and for any $\epsilon>0$ , we can solve for $b$ such that $\eta^{b}L^{(b-1)}\Delta_{\max}<\epsilon$ . Therefore the sequence $\{\bm{\omega}_{t}^{k}\}_{k=0}^{\infty}$ is a Cauchy sequence. Moreover, in a complete space, every Cauchy sequence has a limit: $\lim_{k\to\infty}\bm{\omega}_{t}^{(k)}=\bm{\omega}_{t}^{*}$ ∎

A.2 Proof of Theorem 5.1

Theorem A.1.

\lVert\bm{\theta}_{t+1}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t+1}-\bm{\phi}^{*}\rVert^{2}\leq\frac{\rho^{2}(\bm{I}-\eta\nabla_{\theta\theta}f^{*})\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\rho^{2}(\bm{I}+\eta\nabla_{\phi\phi}f^{*})\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}}{1+\eta^{2}\lambda_{\min}(\nabla_{\theta\phi}f^{*}\nabla_{\phi\theta}f^{*})}

where $f^{*}=f(\bm{\theta}^{*},\bm{\phi}^{*})$ . Moreover, for any $\eta$ satisfying:

\frac{\max(\rho^{2}(\bm{I}-\eta\nabla_{\theta\theta}f^{*}),\rho^{2}(\bm{I}+\eta\nabla_{\phi\phi}f^{*}))}{1+\eta^{2}\lambda_{\min}(\nabla_{\theta\phi}f^{*}\nabla_{\phi\theta}f^{*})}<1,

(14)

SPPM converges asymptotically to $(\bm{\theta}^{*},\bm{\phi}^{*})$ .

Proof.

Consider the learning dynamics:

	$\displaystyle\bm{\theta}_{t+1}$	$\displaystyle=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t+1})$
	$\displaystyle\bm{\phi}_{t+1}$	$\displaystyle=\bm{\phi}_{t}+\eta\nabla_{\theta}f(\bm{\theta}_{t+1},\bm{\phi}_{t})$

Let us define

	$\displaystyle\hat{\bm{\theta}}_{t}=\bm{\theta}_{t}-\bm{\theta}^{*}$
	$\displaystyle\hat{\bm{\phi}}_{t}=\bm{\phi}_{t}-\bm{\phi}^{*}$

It follows immediately by linearizing the system about the stationary point $(\bm{\theta}^{*},\bm{\phi}^{*})$ that

	$\displaystyle\begin{bmatrix}\hat{\bm{\theta}}_{t+1}\\ \hat{\bm{\phi}}_{t+1}\end{bmatrix}\simeq$	$\displaystyle\begin{bmatrix}\bm{I}-\eta\nabla_{\theta\theta}^{2}f(\bm{\theta}^{},\bm{\phi}^{})&\bm{0}\\ \bm{0}&\bm{I}+\eta\nabla_{\phi\phi}^{2}f(\bm{\theta}^{},\bm{\phi}^{})\end{bmatrix}\begin{bmatrix}\hat{\bm{\theta}}_{t}\\ \hat{\bm{\phi}}_{t}\end{bmatrix}$
		$\displaystyle+\begin{bmatrix}\bm{0}&-\eta\nabla_{\theta\phi}^{2}f(\bm{\theta}^{},\bm{\phi}^{})\\ \eta\nabla_{\theta\phi}^{2}f(\bm{\theta}^{},\bm{\phi}^{})&\bm{0}\end{bmatrix}\begin{bmatrix}\hat{\bm{\theta}}_{t+1}\\ \hat{\bm{\phi}}_{t+1}\end{bmatrix}$

Let us denote the Jacobian by

\displaystyle\begin{bmatrix}-\nabla_{\theta\theta}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})&-\nabla_{\theta\phi}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})\\ \nabla_{\phi\theta}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})&\nabla_{\phi\phi}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})\end{bmatrix}=\begin{bmatrix}-\bm{A}&-\bm{B}\\ \bm{B}^{T}&\bm{C}\end{bmatrix}

(15)

Then we can rewrite the dynamics around the stationary point as

$\displaystyle\hat{\bm{\theta}}_{t+1}$	$\displaystyle=\hat{\bm{\theta}}_{t}-\eta\bm{A}\hat{\bm{\theta}}_{t}-\eta\bm{B}\hat{\bm{\phi}}_{t+1}$
$\displaystyle\hat{\bm{\theta}}_{t+1}$	$\displaystyle=\hat{\bm{\theta}}_{t}-\eta\bm{A}\hat{\bm{\theta}}_{t}-\eta\bm{B}(\hat{\bm{\phi}}_{t}+\eta\bm{B}^{T}\hat{\bm{\theta}}_{t+1}+\eta\bm{C}\hat{\bm{\phi}}_{t})$
$\displaystyle(\bm{I}+\eta^{2}\bm{BB}^{T})\hat{\bm{\theta}}_{t+1}$	$\displaystyle=(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}-\eta\bm{B}(\bm{I}+\eta\bm{C})\bm{\phi}_{t}$
$\displaystyle\hat{\bm{\theta}}_{t+1}$	$\displaystyle=(\bm{I}+\eta^{2}\bm{BB}^{T})^{-1}\left[(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}-\eta\bm{B}(\bm{I}+\eta\bm{C})\bm{\phi}_{t}\right]$	(16)

Similarly, for the other player we have

$\displaystyle\hat{\bm{\phi}}_{t+1}$	$\displaystyle=\hat{\bm{\phi}}_{t}+\eta\bm{B}^{T}\hat{\bm{\theta}}_{t+1}+\eta\bm{C}\hat{\bm{\phi}}_{t}$
$\displaystyle\hat{\bm{\phi}}_{t+1}$	$\displaystyle=\hat{\bm{\phi}}_{t}+\eta\bm{B}^{T}(\hat{\bm{\theta}}_{t}-\eta\bm{A}\hat{\bm{\theta}}_{t}-\eta\bm{B}\hat{\bm{\phi}}_{t+1})+\eta\bm{C}\hat{\bm{\phi}}_{t}$
$\displaystyle(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})\hat{\bm{\phi}}_{t+1}$	$\displaystyle=\eta\bm{B}^{T}(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}+(\bm{I}+\eta\bm{C})\bm{\phi}_{t}$
$\displaystyle\hat{\bm{\phi}}_{t+1}$	$\displaystyle=(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}\left[\eta\bm{B}^{T}(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}+(\bm{I}+\eta\bm{C})\bm{\phi}_{t}\right]$	(17)

Let us define the symmetric matrices $\bm{Q}_{\bm{\theta}}=(\bm{I}+\eta^{2}\bm{BB}^{T})^{-1}$ , $\bm{Q}_{\bm{\phi}}=(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}$ and $\bm{P}_{\bm{\theta}}=(\bm{I}-\eta\bm{A})$ , $\bm{P}_{\bm{\phi}}=(\bm{I}+\eta\bm{C})$ . Further we define $r_{t}=\lVert\hat{\bm{\theta}}_{t+1}\rVert^{2}+\lVert\hat{\bm{\phi}}_{t+1}\rVert^{2}$ . Based on these definitions, and the expressions in (52) and (53) we have

	$\displaystyle\lVert\bm{\theta}_{t+1}\rVert^{2}+\lVert\bm{\phi}_{t+1}\rVert^{2}=\lVert\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}\rVert^{2}$	$\displaystyle+\eta^{2}\lVert\bm{Q}_{\bm{\theta}}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\rVert^{2}+\lVert\bm{Q}_{\bm{\phi}}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}\rVert^{2}+\lVert\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\rVert^{2}$
		$\displaystyle-2\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+2\eta\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}$		(18)

To simplify the expression in (54) we use the following lemma:

Lemma A.1.

The matrices $\bm{Q}_{\bm{\theta}}=(\bm{I}+\eta^{2}\bm{BB}^{T})^{-1}$ , $\bm{Q}_{\bm{\phi}}=(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}$ satisfy the following properties:

	$\displaystyle\bm{Q}_{\bm{\theta}}\bm{B}$	$\displaystyle=\bm{B}\bm{Q}_{\bm{\phi}}$		(19)
	$\displaystyle\bm{Q}_{\bm{\phi}}\bm{B}^{T}$	$\displaystyle=\bm{B}^{T}\bm{Q}_{\bm{\theta}}$		(20)

Using this lemma, we can show that

\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}=\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}^{T}\bm{B}^{T}\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}=\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}

where the intermediate equality holds as $\bm{a}^{T}\bm{b}=\bm{b}^{T}\bm{a}$ . Hence, the expression in (54) can be simplified as

\lVert\hat{\bm{\theta}}_{t+1}\rVert^{2}+\lVert\hat{\bm{\phi}}_{t+1}\rVert^{2}=\lVert\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\theta}}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\rVert^{2}+\lVert\bm{Q}_{\bm{\phi}}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}\rVert^{2}+\lVert\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\rVert^{2}

(21)

We simplify equation (57) as follows. Consider the term involving $\hat{\bm{\theta}}_{t}$ . We have

$\displaystyle\lVert\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\phi}}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}\rVert^{2}$	$\displaystyle=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{2}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\eta^{2}\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}^{2}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}$
	$\displaystyle=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{Q}_{\bm{\theta}}^{2}+\eta^{2}\bm{B}\bm{Q}_{\bm{\phi}}^{2}\bm{B}^{T})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}$
	$\displaystyle=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{Q}_{\bm{\theta}}^{2}+\eta^{2}\bm{B}\bm{Q}_{\bm{\phi}}\bm{B}^{T}\bm{Q}_{\bm{\theta}})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}$
	$\displaystyle=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{Q}_{\bm{\theta}}^{2}+\eta^{2}\bm{B}\bm{B}^{T}\bm{Q}_{\bm{\theta}}\bm{Q}_{\bm{\theta}})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}$
	$\displaystyle=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})\bm{Q}_{\bm{\theta}}^{2}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}$
	$\displaystyle=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}$	(22)

where the last equality follows by replacing $\bm{Q}_{\bm{\theta}}$ by its definition. The same procedure follows for the term involving $\bm{\phi}_{t}$ which leads to the expression

\lVert\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\theta}}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\rVert^{2}=\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}.

(23)

Substitute $\lVert\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\phi}}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}\rVert^{2}$ and $\lVert\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\theta}}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\rVert^{2}$ in (57) with the expressions in (22) and (23), respectively, to obtain

\lVert\hat{\bm{\theta}}_{t+1}\rVert^{2}+\lVert\hat{\bm{\phi}}_{t+1}\rVert^{2}=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}.

(24)

Note that, we assume that the trajectory $\{\hat{\bm{\theta}}_{t},\hat{\bm{\phi}}_{t}\}_{t\geq 0}$ is not in the kernel of $\bm{B}\bm{B}^{T}$ and $\bm{B}^{T}\bm{B}$ , thus $\bm{B}\bm{B}^{T}\hat{\bm{\theta}}_{t}\neq 0$ and $\bm{B}^{T}\bm{B}\hat{\bm{\phi}}_{t}\neq 0$ . Now using the expression in (60) and the fact that $\bm{P}_{\bm{\theta}}=\bm{P}_{\bm{\theta}}^{T}$ , $\bm{P}_{\bm{\phi}}=\bm{P}_{\bm{\phi}}^{T}$ and $\bm{B}\bm{B}^{T}$ and $\bm{B}^{T}\bm{B}$ have the same set of non-zero eigenvalues, if we denote the minimum non-zero eigenvalues by $\lambda_{\min}(\bm{B}\bm{B}^{T})$ and $\lambda_{\min}(\bm{B}^{T}\bm{B})$ , we can write

\lVert\bm{\theta}_{t+1}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t+1}-\bm{\phi}^{*}\rVert^{2}\leq\frac{\rho^{2}(\bm{I}-\eta\bm{A})\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\rho^{2}(1+\eta\bm{C})\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{B}^{T}\bm{B})}.

Replacing $\lVert\bm{\theta}_{t+1}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t+1}-\bm{\phi}^{*}\rVert^{2}$ and $\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}$ with $r_{t+1}$ and $r_{t}$ we have:

r_{t+1}\leq\frac{\max(\rho^{2}(\bm{I}-\eta\bm{A}),\rho^{2}(\bm{I}+\eta\bm{C}))}{1+\eta^{2}\lambda_{\min}(\bm{B}^{T}\bm{B})}r_{t}.

Recall that $\bm{A}=\nabla_{\theta\theta}f(\bm{\theta}^{*},\bm{\phi}^{*})$ , $\bm{B}=\nabla_{\theta\phi}f(\bm{\theta}^{*},\bm{\phi}^{*})$ and $\bm{C}=\nabla_{\phi\phi}f(\bm{\theta}^{*},\bm{\phi}^{*})$ , therefore for any $\eta$ satisfying that:

\frac{\max(\rho^{2}(\bm{I}-\eta\nabla_{\theta\theta}f(\bm{\theta}^{*},\bm{\phi}^{*})),\rho^{2}(\bm{I}+\eta\nabla_{\phi\phi}f(\bm{\theta}^{*},\bm{\phi}^{*})))}{1+\eta^{2}\lambda_{\min}(\nabla_{\theta\phi}f(\bm{\theta}^{*},\bm{\phi}^{*})\nabla_{\phi\theta}f(\bm{\theta}^{*},\bm{\phi}^{*}))}<1,

(25)

we have $r_{t+1}<r_{t}$ . Since we linearize the system about the stationary point $(\bm{\theta}^{*},\bm{\phi}^{*})$ , there exists a neighborhood $\mathcal{U}$ around the stationary point, such that, SPPM started at $(\bm{\theta}_{0},\bm{\phi}_{0})\in\mathcal{U}$ converges asymptotically to $(\bm{\theta}^{*},\bm{\phi}^{*})$ . ∎

A.3 Proof of Remark 5.1

Proof.

To prove the local convergence of Lv. $k$ GP in non-convex non-concave games, we first consider the update rule of Lv. $k$ GP:

\mathmakebox[0.8]{\text{Reasoning:}\begin{cases}\bm{\theta}^{(k)}_{t}=\bm{\theta}_{t}-\eta\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(k-1)})\\ \bm{\phi}^{(k)}_{t}=\bm{\phi}_{t}-\eta\nabla_{\bm{\phi}}g(\bm{\theta}_{t}^{(k-1)},\bm{\phi}_{t})\end{cases}\text{Update:}\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}^{(k)}\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}^{(k)}\end{cases}}

Similar to Section A.2 let us denote

\displaystyle\begin{bmatrix}-\nabla_{\theta\theta}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})&-\nabla_{\theta\phi}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})\\ \nabla_{\phi\theta}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})&\nabla_{\phi\phi}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})\end{bmatrix}=\begin{bmatrix}-\bm{A}&-\bm{B}\\ \bm{B}^{T}&\bm{C}\end{bmatrix}

and we define the difference between states and stationary points as

	$\displaystyle\hat{\bm{\theta}}^{(k)}_{t}=\bm{\theta}^{(k)}_{t}-\bm{\theta}^{*}$	$\displaystyle\text{ and }\hat{\bm{\theta}}_{t}=\bm{\theta}_{t}-\bm{\theta}^{*}$
	$\displaystyle\hat{\bm{\phi}}^{(k)}_{t}=\bm{\phi}^{(k)}_{t}-\bm{\phi}^{*}$	$\displaystyle\text{ and }\hat{\bm{\phi}}_{t}=\bm{\phi}_{t}-\bm{\phi}^{*}$

Linearizing the dynamical system induced by Lv. $k$ GP about the stationary point $(\bm{\theta}^{*},\bm{\phi}^{*})$ we get:

\displaystyle\begin{cases}\hat{\bm{\theta}}_{t+1}&=\hat{\bm{\theta}}^{(k)}_{t}=(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}-\eta\bm{B}\hat{\bm{\phi}}_{t}^{(k-1)}\\ \hat{\bm{\phi}}_{t+1}&=\hat{\bm{\phi}}^{(k)}_{t}=\eta\bm{B}^{T}\hat{\bm{\theta}}_{t}^{(k-1)}+(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}\end{cases}

Note, in Lv. $k$ GP, we define $\bm{\theta}^{(0)}_{t}=\bm{\theta}_{t}$ and $\bm{\phi}^{(0)}_{t}=\bm{\phi}_{t}$ , thus for Lv. $1$ GP, we have:

\displaystyle\begin{cases}\hat{\bm{\theta}}^{(1)}_{t}=(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}-\eta\bm{B}\hat{\bm{\phi}}_{t}\\ \hat{\bm{\phi}}^{(1)}_{t}=\eta\bm{B}^{T}\hat{\bm{\theta}}_{t}+(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}\end{cases}

For Lv. $2$ GP, we have:

\displaystyle\begin{cases}\hat{\bm{\theta}}^{(2)}_{t}=(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}-\eta\bm{B}\hat{\bm{\phi}}_{t}^{(1)}\\ \hat{\bm{\phi}}^{(2)}_{t}=\eta\bm{B}^{T}\hat{\bm{\theta}}_{t}^{(1)}+(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}\end{cases}

Substituting $\hat{\bm{\theta}}^{(1)}_{t}$ and $\hat{\bm{\phi}}^{(1)}_{t}$ into the update rule above we get:

\displaystyle\begin{cases}\hat{\bm{\theta}}^{(2)}_{t}=(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}-\eta\bm{B}(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}-\eta^{2}\bm{BB}^{T}\hat{\bm{\theta}}_{t}\\ \hat{\bm{\phi}}^{(2)}_{t}=\eta\bm{B}^{T}(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}+(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}-\eta^{2}\bm{B}^{T}\bm{B}\hat{\bm{\phi}}_{t}\end{cases}

Similarly, for Lv. $3$ and Lv. $4$ GP we have:

\displaystyle\begin{cases}\hat{\bm{\theta}}^{(3)}_{t}=(\bm{I}-\eta^{2}\bm{B}\bm{B}^{T})(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}-\eta\bm{B}(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}+\eta^{3}\bm{BB}^{T}\bm{B}\hat{\bm{\phi}}_{t}\\ \hat{\bm{\phi}}^{(3)}_{t}=\eta\bm{B}^{T}(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}+(\bm{I}-\eta^{2}\bm{B}^{T}\bm{B})(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}-\eta^{3}\bm{B}\bm{B}^{T}\bm{B}\hat{\bm{\theta}}_{t}\end{cases}

and

\displaystyle\begin{cases}\hat{\bm{\theta}}^{(4)}_{t}=(\bm{I}-\eta^{2}\bm{B}\bm{B}^{T})\left[(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}-\eta\bm{B}(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}\right]+\eta^{4}\bm{BB}^{T}\bm{BB}^{T}\hat{\bm{\theta}}_{t}\\ \hat{\bm{\phi}}^{(4)}_{t}=(\bm{I}-\eta^{2}\bm{B}^{T}\bm{B})\left[\eta\bm{B}^{T}(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}+(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}\right]+\eta^{4}\bm{B}^{T}\bm{B}\bm{B}^{T}\bm{B}\hat{\bm{\phi}}_{t}\end{cases}

Summarizing the equations above we have that for Lv. $2k$ GP, its update can be written as:

\displaystyle\begin{cases}\hat{\bm{\theta}}^{(2k)}_{t}=(\sum_{i=0}^{k-1}(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\left[(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}-\eta\bm{B}(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}\right]+(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\\ \hat{\bm{\phi}}^{(2k)}_{t}=(\sum_{i=0}^{k-1}(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\left[\eta\bm{B}^{T}(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}+(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}\right]+(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\end{cases}

Similar to Appendix A.2, let us define $\bm{Q}_{\bm{\theta}}=(\bm{I}+\eta^{2}\bm{BB}^{T})^{-1}$ , $\bm{Q}_{\bm{\phi}}=(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}$ and $\bm{P}_{\bm{\theta}}=(\bm{I}-\eta\bm{A})$ , $\bm{P}_{\bm{\phi}}=(\bm{I}+\eta\bm{C})$ . Further, we define $\bm{R}^{(k)}_{\bm{\theta}}=(\sum_{i=0}^{k-1}(-\eta^{2}\bm{B}\bm{B}^{T})^{k})$ , $\bm{R}^{(k)}_{\bm{\phi}}=(\sum_{i=0}^{k-1}(-\eta^{2}\bm{B}^{T}\bm{B})^{k})$ and $\bm{E}^{(k)}_{\bm{\theta}}=\bm{R}^{(k)}_{\bm{\theta}}-\bm{Q}_{\bm{\theta}}$ , $\bm{E}^{(k)}_{\bm{\phi}}=\bm{R}^{(k)}_{\bm{\phi}}-\bm{Q}_{\bm{\phi}}$ .

Since $\eta<L^{-1}$ , we have that:

	$\displaystyle\bm{Q}_{\bm{\theta}}=\sum_{i=0}^{\infty}(-\eta^{2}\bm{B}\bm{B}^{T})^{k}$
	$\displaystyle\bm{Q}_{\bm{\phi}}=\sum_{i=0}^{\infty}(-\eta^{2}\bm{B}^{T}\bm{B})^{k}$

and

	$\displaystyle\bm{E}^{(k)}_{\bm{\theta}}=\bm{R}^{(k)}_{\bm{\theta}}-\bm{Q}_{\bm{\theta}}=-\sum_{i=k}^{\infty}(-\eta^{2}\bm{B}\bm{B}^{T})^{k}=-(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}\cdot(-\eta^{2}\bm{B}\bm{B}^{T})^{k}$		(26)
	$\displaystyle\bm{E}^{(k)}_{\bm{\phi}}=\bm{R}^{(k)}_{\bm{\phi}}-\bm{Q}_{\bm{\phi}}=-\sum_{i=k}^{\infty}(-\eta^{2}\bm{B}^{T}\bm{B})^{k}=-(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}\cdot(-\eta^{2}\bm{B}^{T}\bm{B})^{k}$		(27)

Also, from Lemma A.1 and the definition of the error terms, it can be verified that

	$\displaystyle\bm{E}_{\bm{\theta}}^{(k)}\bm{B}$	$\displaystyle=\bm{B}\bm{E}_{\bm{\phi}}^{(k)}$		(28)
	$\displaystyle\bm{E}_{\bm{\phi}}^{(k)}\bm{B}^{T}$	$\displaystyle=\bm{B}^{T}\bm{E}_{\bm{\theta}}^{(k)}$		(29)

Then we can rewrite the update rule of Lv. $2k$ GP:

\displaystyle\begin{cases}\hat{\bm{\theta}}^{(2k)}_{t}=(\bm{Q}_{\bm{\theta}}+\bm{E}_{\bm{\theta}}^{(k)})\left[\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}-\eta\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]+(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\\ \hat{\bm{\phi}}^{(2k)}_{t}=(\bm{Q}_{\bm{\phi}}+\bm{E}_{\bm{\phi}}^{(k)})\left[\eta\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]+(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\end{cases}

Let us consider the following sum:

	$\displaystyle\lVert$	$\displaystyle\hat{\bm{\theta}}^{(2k)}_{t}-(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rVert^{2}+\lVert\hat{\bm{\phi}}^{(2k)}_{t}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\rVert^{2}$
		$\displaystyle=\left\lVert(\bm{Q}_{\bm{\theta}}+\bm{E}_{\bm{\theta}}^{(k)})\left[\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}-\eta\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}+\left\lVert(\bm{Q}_{\bm{\phi}}+\bm{E}_{\bm{\phi}}^{(k)})\left[\eta\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}$		(30)

The R.H.S. of Eq.(30) can be written as:

	$\displaystyle\left\lVert(\bm{Q}_{\bm{\theta}}+\bm{E}_{\bm{\theta}}^{(k)})\left[\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}-\eta\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}+\left\lVert(\bm{Q}_{\bm{\phi}}+\bm{E}_{\bm{\phi}}^{(k)})\left[\eta\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}$
$\displaystyle=$	$\displaystyle\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{2}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}-2\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{2}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+\eta^{2}\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{B}^{T}\bm{Q}_{\bm{\theta}}^{2}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}$
$\displaystyle+$	$\displaystyle 2\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}-4\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+2\eta^{2}\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{B}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}$
$\displaystyle+$	$\displaystyle\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}[\bm{E}_{\bm{\theta}}^{(k)}]^{2}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}-2\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}[\bm{E}_{\bm{\theta}}^{(k)}]^{2}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+\eta^{2}\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{B}^{T}[\bm{E}_{\bm{\theta}}^{(k)}]^{2}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}$
$\displaystyle+$	$\displaystyle\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}^{2}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+2\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}^{2}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+\eta^{2}\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}^{2}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}$
$\displaystyle+$	$\displaystyle 2\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}\bm{E}_{\bm{\phi}}^{(k)}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+4\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}\bm{E}_{\bm{\phi}}^{(k)}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+2\eta^{2}\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}\bm{E}_{\bm{\phi}}^{(k)}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}$
$\displaystyle+$	$\displaystyle\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}[\bm{E}_{\bm{\phi}}^{(k)}]^{2}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+2\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}[\bm{E}_{\bm{\phi}}^{(k)}]^{2}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+\eta^{2}\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}[\bm{E}_{\bm{\phi}}^{(k)}]^{2}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}$	(31)

Now, before adding all terms in Eq.(31), note that all of the cross terms in Eq.(31) cancel out.

For instance, using Lemma A.1 and Eq.(28), Eq.(29) we can show that

	$\displaystyle 4\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}\bm{E}_{\bm{\phi}}^{(k)}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}-4\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}$
	$\displaystyle=4\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{B}\bm{E}_{\bm{\phi}}^{(k)}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}-4\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}$
	$\displaystyle=4\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}-4\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}$
	$\displaystyle=0$

By using similar arguments it can be shown that terms in Eq.(31) leads to:

	$\displaystyle\left\lVert(\bm{Q}_{\bm{\theta}}+\bm{E}_{\bm{\theta}}^{(k)})\left[\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}-\eta\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}+\left\lVert(\bm{Q}_{\bm{\phi}}+\bm{E}_{\bm{\phi}}^{(k)})\left[\eta\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}$
$\displaystyle=$	$\displaystyle\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{2}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\eta^{2}\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{B}^{T}\bm{Q}_{\bm{\theta}}^{2}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}$
$\displaystyle+$	$\displaystyle 2\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+2\eta^{2}\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{B}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}$
$\displaystyle+$	$\displaystyle\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}[\bm{E}_{\bm{\theta}}^{(k)}]^{2}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\eta^{2}\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{B}^{T}[\bm{E}_{\bm{\theta}}^{(k)}]^{2}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}$
$\displaystyle+$	$\displaystyle\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}^{2}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+\eta^{2}\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}^{2}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}$
$\displaystyle+$	$\displaystyle 2\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}\bm{E}_{\bm{\phi}}^{(k)}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+2\eta^{2}\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}\bm{E}_{\bm{\phi}}^{(k)}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}$
$\displaystyle+$	$\displaystyle\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}[\bm{E}_{\bm{\phi}}^{(k)}]^{2}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+\eta^{2}\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}[\bm{E}_{\bm{\phi}}^{(k)}]^{2}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}$	(32)

Similar to Eq.(22) we have the following simplification:

	$\displaystyle\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{2}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\eta^{2}\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}^{2}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}$	$\displaystyle=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}$
	$\displaystyle\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}^{2}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+\eta^{2}\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{B}^{T}\bm{Q}_{\bm{\theta}}^{2}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}$	$\displaystyle=\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}$
	$\displaystyle 2\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+2\eta^{2}\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}\bm{E}_{\bm{\phi}}^{(k)}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}$	$\displaystyle=2\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{E}_{\bm{\theta}}^{(k)}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}$
	$\displaystyle 2\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}\bm{E}_{\bm{\phi}}^{(k)}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+2\eta^{2}\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{B}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}$	$\displaystyle=2\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{E}_{\bm{\phi}}^{(k)}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}$

Now we can further simplify Eq.(31) as:

		$\displaystyle\left\lVert(\bm{Q}_{\bm{\theta}}+\bm{E}_{\bm{\theta}}^{(k)})\left[\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}-\eta\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}+\left\lVert(\bm{Q}_{\bm{\phi}}+\bm{E}_{\bm{\phi}}^{(k)})\left[\eta\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}$
	$\displaystyle=$	$\displaystyle(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}\left[\bm{Q}_{\bm{\theta}}+2\bm{E}_{\bm{\theta}}^{(k)}+[\bm{E}_{\bm{\theta}}^{(k)}]^{2}+\eta^{2}\bm{B}[\bm{E}_{\bm{\phi}}^{(k)}]^{2}\bm{B}^{T}\right](\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})$
	$\displaystyle+$	$\displaystyle(\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})^{T}\left[\bm{Q}_{\bm{\phi}}+2\bm{E}_{\bm{\phi}}^{(k)}+[\bm{E}_{\bm{\phi}}^{(k)}]^{2}+\eta^{2}\bm{B}[\bm{E}_{\bm{\theta}}^{(k)}]^{2}\bm{B}^{T}\right](\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})$

Using Eq.(26) and Eq.(27) and definition of $\bm{Q}_{\bm{\theta}}$ and $\bm{Q}_{\bm{\phi}}$ we have:

		$\displaystyle(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}\left[\bm{Q}_{\bm{\theta}}+2\bm{E}_{\bm{\theta}}^{(k)}+[\bm{E}_{\bm{\theta}}^{(k)}]^{2}+\eta^{2}\bm{B}[\bm{E}_{\bm{\phi}}^{(k)}]^{2}\bm{B}^{T}\right](\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})$
	$\displaystyle=$	$\displaystyle(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})-2(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}(-\eta^{2}\bm{B}\bm{B}^{T})^{k}(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})$
	$\displaystyle+$	$\displaystyle(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-2}(-\eta^{2}\bm{B}\bm{B}^{T})^{2k}(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})$
	$\displaystyle=$	$\displaystyle(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}(\bm{I}-2(-\eta^{2}\bm{B}\bm{B}^{T})^{(k)}+(-\eta^{2}\bm{B}\bm{B}^{T})^{2k})(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})$
	$\displaystyle=$	$\displaystyle((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})$

Similarly, we have that

		$\displaystyle(\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})^{T}\left[\bm{Q}_{\bm{\phi}}+2\bm{E}_{\bm{\phi}}^{(k)}+[\bm{E}_{\bm{\phi}}^{(k)}]^{2}+\eta^{2}\bm{B}^{T}[\bm{E}_{\bm{\theta}}^{(k)}]^{2}\bm{B}\right](\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})$
	$\displaystyle=$	$\displaystyle((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})$

Thus we simplify the R.H.S. of Eq.(30) as

	$\displaystyle\left\lVert(\bm{Q}_{\bm{\theta}}+\bm{E}_{\bm{\theta}}^{(k)})\left[\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}-\eta\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}+\left\lVert(\bm{Q}_{\bm{\phi}}+\bm{E}_{\bm{\phi}}^{(k)})\left[\eta\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}$
$\displaystyle=$	$\displaystyle((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})$
$\displaystyle+$	$\displaystyle((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})$	(33)

Let us consider the L.H.S. of Eq.(30)

$\displaystyle\lVert$	$\displaystyle\hat{\bm{\theta}}^{(2k)}_{t}-(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rVert^{2}+\lVert\hat{\bm{\phi}}^{(2k)}_{t}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\rVert^{2}$
$\displaystyle=$	$\displaystyle\lVert\hat{\bm{\theta}}^{(2k)}_{t}\rVert^{2}-2\langle\hat{\bm{\theta}}^{(2k)}_{t},(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rangle+\lVert(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rVert^{2}$
$\displaystyle+$	$\displaystyle\lVert\hat{\bm{\phi}}^{(2k)}_{t}\rVert^{2}-2\langle\hat{\bm{\phi}}^{(2k)}_{t},(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\rangle+\lVert(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\rVert^{2}$	(34)

Substituting Eq.(34) and Eq.(33) into L.H.S. and R.H.S. of Eq.(30) respectively we get:

		$\displaystyle\lVert\hat{\bm{\theta}}^{(2k)}_{t}\rVert^{2}-2\langle\hat{\bm{\theta}}^{(2k)}_{t},(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rangle+\lVert(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rVert^{2}$
	$\displaystyle+$	$\displaystyle\lVert\hat{\bm{\phi}}^{(2k)}_{t}\rVert^{2}-2\langle\hat{\bm{\phi}}^{(2k)}_{t},(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\rangle+\lVert(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\rVert^{2}$
	$\displaystyle=$	$\displaystyle((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})$
	$\displaystyle+$	$\displaystyle((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})$

Now we have the following equation:

		$\displaystyle\lVert\hat{\bm{\theta}}^{(2k)}_{t}\rVert^{2}+\lVert\hat{\bm{\phi}}^{(2k)}_{t}\rVert^{2}$
	$\displaystyle=$	$\displaystyle((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})$
	$\displaystyle+$	$\displaystyle 2\langle\hat{\bm{\theta}}^{(2k)}_{t},(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rangle-\lVert(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rVert^{2}$
	$\displaystyle+$	$\displaystyle((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})$
	$\displaystyle+$	$\displaystyle 2\langle\hat{\bm{\phi}}^{(2k)}_{t},(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\rangle-\lVert(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\rVert^{2}$

Note that

	$\displaystyle 2\langle\hat{\bm{\theta}}^{(2k)}_{t},(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rangle$	$\displaystyle=2\langle(-\eta^{2}\bm{BB}^{T})^{\frac{k}{2}}\hat{\bm{\theta}}^{(2k)}_{t},(\eta^{2}\bm{BB}^{T})^{\frac{k}{2}}\hat{\bm{\theta}}_{t}\rangle$		(35)
		$\displaystyle\leq\eta^{2k}(\hat{\bm{\theta}}^{(2k)}_{t})^{T}(\bm{BB}^{T})^{k}\hat{\bm{\theta}}^{(2k)}_{t}+\eta^{2k}\hat{\bm{\theta}}_{t}^{T}(\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}$		(36)

Similarly

	$\displaystyle 2\langle\hat{\bm{\phi}}^{(2k)}_{t},(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\phi}}_{t}\rangle$	$\displaystyle=2\langle(-\eta^{2}\bm{B}^{T}\bm{B})^{\frac{k}{2}}\hat{\bm{\phi}}^{(2k)}_{t},(\eta^{2}\bm{B}^{T}\bm{B})^{\frac{k}{2}}\hat{\bm{\phi}}_{t}\rangle$		(37)
		$\displaystyle\leq\eta^{2k}(\hat{\bm{\phi}}^{(2k)}_{t})^{T}(\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}^{(2k)}_{t}+\eta^{2k}\hat{\bm{\phi}}_{t}^{T}(\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}$		(38)

Summing everything together we have:

		$\displaystyle(\hat{\bm{\theta}}^{(2k)}_{t})^{T}(\bm{I}-(\eta^{2}\bm{B}\bm{B}^{T})^{k})(\hat{\bm{\theta}}^{(2k)}_{t})+(\hat{\bm{\phi}}^{(2k)}_{t})^{T}(\bm{I}-(\eta^{2}\bm{B}^{T}\bm{B})^{k})(\hat{\bm{\phi}}^{(2k)}_{t})$
	$\displaystyle\leq$	$\displaystyle((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})$
	$\displaystyle+$	$\displaystyle(\hat{\bm{\theta}}_{t})^{T}(\eta^{2}\bm{B}\bm{B}^{T})^{k}(\hat{\bm{\theta}}_{t})-\lVert(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rVert^{2}$
	$\displaystyle+$	$\displaystyle((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})$
	$\displaystyle+$	$\displaystyle(\hat{\bm{\phi}}_{t})^{T}(\eta^{2}\bm{B}^{T}\bm{B})^{k}(\hat{\bm{\phi}}_{t})-\lVert(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\rVert^{2}$

		$\displaystyle(1-(\eta^{2}\lambda_{max}(\bm{B}\bm{B}^{T}))^{k})\lVert\hat{\bm{\theta}_{t}^{(2k)}}\rVert^{2}+(1-(\eta^{2}\lambda_{max}(\bm{B}^{T}\bm{B}))^{k})\lVert\hat{\bm{\phi}_{t}^{(2k)}}\rVert^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{(1-(-\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k})^{2}\rho^{2}(1-\eta\bm{A})\lVert\hat{\bm{\theta}}_{t}\rVert^{2}}{1+\eta^{2}\lambda_{min}(\bm{B}\bm{B}^{T})}+(\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k}\lVert\hat{\bm{\theta}}_{t}\rVert^{2}-(\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{2k}\lVert\hat{\bm{\theta}}_{t}\rVert^{2}$
	$\displaystyle+$	$\displaystyle\frac{(1-(-\eta^{2}\lambda(\bm{B}^{T}\bm{B}))^{k})^{2}\rho^{2}(1+\eta\bm{C})\lVert\hat{\bm{\phi}}_{t}\rVert^{2}}{1+\eta^{2}\lambda_{min}(\bm{B}^{T}\bm{B})}+(\eta^{2}\lambda(\bm{B}^{T}\bm{B}))^{k}\lVert\hat{\bm{\phi}}_{t}\rVert^{2}-(\eta^{2}\lambda(\bm{B}^{T}\bm{B}))^{2k}\lVert\hat{\bm{\phi}}_{t}\rVert^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{(1-(-\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k})^{2}\rho^{2}(1-\eta\bm{A})\lVert\hat{\bm{\theta}}_{t}\rVert^{2}}{1+\eta^{2}\lambda_{min}(\bm{B}\bm{B}^{T})}+(\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k}(1-(\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k})\lVert\hat{\bm{\theta}}_{t}\rVert^{2}$
	$\displaystyle+$	$\displaystyle\frac{(1-(-\eta^{2}\lambda(\bm{B}^{T}\bm{B}))^{k})^{2}\rho^{2}(1+\eta\bm{C})\lVert\hat{\bm{\phi}}_{t}\rVert^{2}}{1+\eta^{2}\lambda_{min}(\bm{B}^{T}\bm{B})}+(\eta^{2}\lambda(\bm{B}^{T}\bm{B}))^{k}\lVert\hat{\bm{\phi}}_{t}\rVert^{2}-(\eta^{2}\lambda(\bm{B}^{T}\bm{B}))^{2k}\lVert\hat{\bm{\phi}}_{t}\rVert^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{(1-(-\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k})^{2}\rho^{2}(1-\eta\bm{A})\lVert\hat{\bm{\theta}}_{t}\rVert^{2}}{(1+\eta^{2}\lambda_{min}(\bm{B}\bm{B}^{T}))(1-(\eta^{2}\lambda_{max}(\bm{B}\bm{B}^{T}))^{k})}+\frac{(\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k}(1-(\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k})\lVert\hat{\bm{\theta}}_{t}\rVert^{2}}{(1-(\eta^{2}\lambda_{max}(\bm{B}\bm{B}^{T}))^{k})}$
	$\displaystyle+$	$\displaystyle\frac{(1-(-\eta^{2}\lambda(\bm{B}^{T}\bm{B}))^{k})^{2}\rho^{2}(1+\eta\bm{C})\lVert\hat{\bm{\phi}}_{t}\rVert^{2}}{(1+\eta^{2}\lambda_{min}(\bm{B}^{T}\bm{B}))(1-(\eta^{2}\lambda_{max}(\bm{B}^{T}\bm{B}))^{k})}+\frac{(\eta^{2}\lambda(\bm{B}^{T}\bm{B}))^{k}(1-(\eta^{2}\lambda(\bm{B}^{T}\bm{B}))^{k})\lVert\hat{\bm{\phi}}_{t}\rVert^{2}}{(1-(\eta^{2}\lambda_{max}(\bm{B}^{T}\bm{B}))^{k})}$

Let us define the distance as:

	$\displaystyle\lVert r_{t}^{(k)}\rVert^{2}$	$\displaystyle=\lVert\hat{\bm{\theta}}_{t}^{(k)}\rVert^{2}+\lVert\hat{\bm{\phi}}_{t}^{(k)}\rVert^{2}$		(40)
	$\displaystyle\lVert r_{t}\rVert^{2}$	$\displaystyle=\lVert\hat{\bm{\theta}}_{t}\rVert^{2}+\lVert\hat{\bm{\phi}}_{t}\rVert^{2}$		(41)

Then we have

$\displaystyle\lVert r_{t}^{(2k)}$	$\displaystyle\rVert\leq\frac{(1-(-\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k})^{2}(\rho^{2}(1-\eta\bm{A})\lVert\hat{\bm{\theta}}_{t}\rVert^{2}+\rho^{2}(1+\eta\bm{C})\lVert\hat{\bm{\phi}}_{t}\rVert^{2})}{(1+\eta^{2}\lambda_{min}(\bm{B}\bm{B}^{T}))(1-(\eta^{2}\lambda_{max}(\bm{B}\bm{B}^{T}))^{k})}$
	$\displaystyle+\frac{(\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k}(1-(\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k})}{(1-(\eta^{2}\lambda_{max}(\bm{B}\bm{B}^{T}))^{k})}\lVert r_{t}\rVert^{2}$
	$\displaystyle=a(\frac{\rho^{2}(1-\eta\bm{A})\lVert\hat{\bm{\theta}}_{t}\rVert^{2}+\rho^{2}(1+\eta\bm{C})\lVert\hat{\bm{\phi}}_{t}\rVert^{2}}{1+\eta^{2}\lambda_{min}(\bm{B}\bm{B}^{T})})+b\lVert r_{t}\rVert^{2}$	(42)

where

\displaystyle a=\begin{cases}\frac{(1+(\eta^{2}\lambda_{\max}(\bm{B}\bm{B}^{T}))^{k})^{2}}{1-(\eta^{2}\lambda_{\max}(\bm{B}\bm{B}^{T}))^{k}}\text{ , odd $k$ }\\ \frac{(1-(\eta^{2}\lambda_{\min}(\bm{B}\bm{B}^{T}))^{k})^{2}}{1-(\eta^{2}\lambda_{\max}(\bm{B}\bm{B}^{T}))^{k}}\text{ , even $k$}\end{cases}

and

\displaystyle b=\frac{(\eta^{2}\lambda_{\max}(\bm{B}\bm{B}^{T}))^{k}(1-(\eta^{2}\lambda_{\min}(\bm{B}\bm{B}^{T}))^{k})}{1-(\eta^{2}\lambda_{\max}(\bm{B}\bm{B}^{T}))^{k}}

(43)

∎

A.3.1 Proof of Lemma A.1

Let $\bm{B}=\bm{U}\bm{\Sigma}\bm{V}^{T}$ be the singular value decomposition of $\bm{B}$ . Here $\bm{U}$ and $\bm{V}$ are orthonormal matrices and $\bm{\Sigma}$ is a rectangular diagonal matrix. Then we have:

	$\displaystyle\bm{Q}_{\bm{\theta}}\bm{B}$	$\displaystyle=(\bm{I}+\eta^{2}\bm{U}\bm{\Sigma}\bm{V}^{T}\bm{V}\bm{\Sigma}^{T}\bm{U}^{T})^{-1}\bm{U}\bm{\Sigma}\bm{V}^{T}$
		$\displaystyle=(\bm{U}(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})\bm{U}^{T})^{-1}\bm{U}\bm{\Sigma}\bm{V}^{T}$
		$\displaystyle=\bm{U}(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})^{-1}\bm{U}^{T}\bm{U}\bm{\Sigma}\bm{V}^{T}$
		$\displaystyle=\bm{U}(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})^{-1}\bm{\Sigma}\bm{V}^{T}$

Here we used the fact that $\bm{U}$ and $\bm{V}$ are orthonormal matrices. Now, we simplify the other side to get:

	$\displaystyle\bm{B}\bm{Q}_{\bm{\phi}}$	$\displaystyle=\bm{U}\bm{\Sigma}\bm{V}^{T}(\bm{I}+\eta^{2}\bm{V}\bm{\Sigma}^{T}\bm{U}^{T}\bm{U}\bm{\Sigma}\bm{V}^{T})^{-1}$
		$\displaystyle=\bm{U}\bm{\Sigma}\bm{V}^{T}(\bm{V}(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})\bm{V}^{T})^{-1}$
		$\displaystyle=\bm{U}\bm{\Sigma}\bm{V}^{T}\bm{V}(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})^{-1}\bm{V}^{T}$
		$\displaystyle=\bm{U}\bm{\Sigma}(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})^{-1}\bm{V}^{T}$

Now we consider the following equation:

\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{\Sigma}=\bm{\Sigma}(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})=(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})\bm{\Sigma}

(44)

which indicates that $\bm{\Sigma}(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})=(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})\bm{\Sigma}$ . Multiplying both sides of this equation by $(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})^{-1}$ and $(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})^{-1}$ we have:

	$\displaystyle(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})^{-1}\bm{\Sigma}(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})^{-1}$	$\displaystyle=(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})^{-1}(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})\bm{\Sigma}(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})^{-1}$
	$\displaystyle(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})^{-1}\bm{\Sigma}$	$\displaystyle=\bm{\Sigma}(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})^{-1}$

Therefore, we have $\bm{Q}_{\bm{\theta}}\bm{B}=\bm{B}\bm{Q}_{\bm{\phi}}$ . Using a similar argument, we can also prove the equality in Equation (20).

A.4 Theorem 5.1 without kernel assumption

Theorem A.2.

Consider the (Minimax) problem under Assumption 3.1 and Lv. $k$ GP. Let $(\bm{\theta}^{*},\bm{\phi}^{*})$ be a stationary point. Suppose $\eta<(L)^{-1}$ . There exists a neighborhood $\mathcal{U}$ of $(\bm{\theta}^{*},\bm{\phi}^{*})$ such that if SPPM started at $(\bm{\theta}_{0},\bm{\phi}_{0})\in\mathcal{U}$ , the iterates $\{\bm{\theta}_{t},\bm{\phi}_{t}\}_{t\geq 0}$ generated by SPPM satisfy:

\lVert\bm{\theta}_{t+1}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t+1}-\bm{\phi}^{*}\rVert^{2}\leq\frac{\rho^{2}(\bm{I}-\eta\bm{A})\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{B}\bm{B}^{T})}+\frac{\rho^{2}(1+\eta\bm{C})\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{B}^{T}\bm{B})}.

where $f^{*}=f(\bm{\theta}^{*},\bm{\phi}^{*})$ . Moreover, for any $\eta$ satisfying:

\frac{\rho^{2}(\bm{I}-\eta\bm{A})\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{B}\bm{B}^{T})}+\frac{\rho^{2}(1+\eta\bm{C})\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{B}^{T}\bm{B})}<\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\bm{\phi}_{t}-\bm{\phi}^{*}}\rVert^{2}

(45)

SPPM converges asymptotically to $(\bm{\theta}^{*},\bm{\phi}^{*})$ .

Proof.

Following the same setting and procedure as in Appendix A.2, we have that

\lVert\hat{\bm{\theta}}_{t+1}\rVert^{2}+\lVert\hat{\bm{\phi}}_{t+1}\rVert^{2}=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}

(46)

Now using the fact that $\bm{P}_{\bm{\theta}}=\bm{P}_{\bm{\theta}}^{T}$ , $\bm{P}_{\bm{\phi}}=\bm{P}_{\bm{\phi}}^{T}$ , we can write

\lVert\bm{\theta}_{t+1}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t+1}-\bm{\phi}^{*}\rVert^{2}\leq\frac{\rho^{2}(\bm{I}-\eta\bm{A})\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{B}\bm{B}^{T})}+\frac{\rho^{2}(1+\eta\bm{C})\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{B}^{T}\bm{B})}.

For any $\eta$ that satisfying

\frac{\rho^{2}(\bm{I}-\eta\bm{A})\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{B}\bm{B}^{T})}+\frac{\rho^{2}(1+\eta\bm{C})\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{B}^{T}\bm{B})}<\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\bm{\phi}_{t}-\bm{\phi}^{*}}\rVert^{2}

(47)

we have that

\lVert\bm{\theta}_{t+1}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\bm{\phi}_{t+1}-\bm{\phi}^{*}}\rVert^{2}<\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\bm{\phi}_{t}-\bm{\phi}^{*}}\rVert^{2}

(48)

i.e., SPPM converges asymptotically towards $(\bm{\theta}^{*},\bm{\phi}^{*})$ . ∎

A.5 Proof of Theorem 5.2

Proof.

In order to proof the convergence of SPPM in bilinear games, we first show that the SPPM update rule is equivalent to that of the following Proximal Point Method:

\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t+1},\bm{\phi}_{t+1})\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\nabla_{\phi}f(\bm{\theta}_{t+1},\bm{\phi}_{t+1})\end{cases}

(49)

In the Bilinear game, the SPPM update is:

	$\displaystyle\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t+1})\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t+1})\end{cases}$
	$\displaystyle=\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\bm{M}\bm{\phi}_{t+1}\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\bm{M}^{T}\bm{\theta}_{t+1}\end{cases}$

and the PPM update is:

	$\displaystyle\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t+1},\bm{\phi}_{t+1})\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\nabla_{\phi}f(\bm{\theta}_{t+1},\bm{\phi}_{t+1})\end{cases}$
	$\displaystyle=\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\bm{M}\bm{\phi}_{t+1}\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\bm{M}^{T}\bm{\theta}_{t+1}\end{cases}$

Thus SPPM and PPM are equivalent in the Bilinear game. The convergence result of PPM in bilinear games has been proved in Theorem 2 of [49]:

Theorem A.3.

Consider the Bilinear game and the PPM method. Further, we define $r_{t}=\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}$ . Then, for any $\eta>0$ , the iterates $\{\bm{\theta}_{t},\bm{\phi}_{t}\}_{t\geq 0}$ generated by SPPM satisfy

r_{t+1}\leq\frac{1}{1+\eta^{2}\lambda_{\min}(\bm{M}^{T}\bm{M})}r_{t}.

(50)

Therefore, SPPM and PPM have the same convergence property in bilinear games. ∎

A.6 Proof of Theorem 5.3

Proof.

Consider the learning dynamics:

	$\displaystyle\bm{\theta}_{t+1}$	$\displaystyle=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t+1})$
	$\displaystyle\bm{\phi}_{t+1}$	$\displaystyle=\bm{\phi}_{t}+\eta\nabla_{\theta}f(\bm{\theta}_{t+1},\bm{\phi}_{t})$

In the Quadratic game, the SPPM update rule can be written as:

\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\bm{A}\bm{\theta}_{t}-\bm{C}\bm{\phi}_{t+1}\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\bm{B}\bm{\phi}_{t}+\bm{C}^{T}\bm{\theta}_{t+1}\end{cases}

(51)

Then we can rewrite the learning dynamics:

$\displaystyle\bm{\theta}_{t+1}$	$\displaystyle=\bm{\theta}_{t}-\eta\bm{A}\bm{\theta}_{t}-\eta\bm{C}\bm{\phi}_{t+1}$
$\displaystyle\bm{\theta}_{t+1}$	$\displaystyle=\bm{\theta}_{t}-\eta\bm{A}\bm{\theta}_{t}-\eta\bm{C}(\bm{\phi}_{t}+\eta\bm{C}^{T}\bm{\theta}_{t+1}+\eta\bm{B}\bm{\phi}_{t})$
$\displaystyle(\bm{I}+\eta^{2}\bm{CC}^{T})\bm{\theta}_{t+1}$	$\displaystyle=(\bm{I}-\eta\bm{A})\bm{\theta}_{t}-\eta\bm{C}(\bm{I}+\eta\bm{B})\bm{\phi}_{t}$
$\displaystyle\bm{\theta}_{t+1}$	$\displaystyle=(\bm{I}+\eta^{2}\bm{CC}^{T})^{-1}\left[(\bm{I}-\eta\bm{A})\bm{\theta}_{t}-\eta\bm{C}(\bm{I}+\eta\bm{B})\bm{\phi}_{t}\right]$	(52)

Similarly, for the other player we have

$\displaystyle\bm{\phi}_{t+1}$	$\displaystyle=\bm{\phi}_{t}+\eta\bm{C}^{T}\bm{\theta}_{t+1}+\eta\bm{B}\bm{\phi}_{t}$
$\displaystyle\bm{\phi}_{t+1}$	$\displaystyle=\bm{\phi}_{t}+\eta\bm{C}^{T}(\bm{\theta}_{t}-\eta\bm{A}\bm{\theta}_{t}-\eta\bm{C}\bm{\phi}_{t+1})+\eta\bm{B}\bm{\phi}_{t}$
$\displaystyle(\bm{I}+\eta^{2}\bm{C}^{T}\bm{C})\bm{\phi}_{t+1}$	$\displaystyle=\eta\bm{C}^{T}(\bm{I}-\eta\bm{A})\bm{\theta}_{t}+(\bm{I}+\eta\bm{B})\bm{\phi}_{t}$
$\displaystyle\bm{\phi}_{t+1}$	$\displaystyle=(\bm{I}+\eta^{2}\bm{C}^{T}\bm{C})^{-1}\left[\eta\bm{C}^{T}(\bm{I}-\eta\bm{A})\bm{\theta}_{t}+(\bm{I}+\eta\bm{B})\bm{\phi}_{t}\right]$	(53)

Let us define the symmetric matrices $\bm{Q}_{\bm{\theta}}=(\bm{I}+\eta^{2}\bm{CC}^{T})^{-1}$ , $\bm{Q}_{\bm{\phi}}=(\bm{I}+\eta^{2}\bm{C}^{T}\bm{C})^{-1}$ and $\bm{P}_{\bm{\theta}}=(\bm{I}-\eta\bm{A})$ , $\bm{P}_{\bm{\phi}}=(\bm{I}+\eta\bm{B})$ . Further we define $r_{t}=\lVert\bm{\theta}_{t+1}\rVert^{2}+\lVert\bm{\phi}_{t+1}\rVert^{2}$ . Based on these definitions, and the expressions in (52) and (53) we have

	$\displaystyle\lVert\bm{\theta}_{t+1}\rVert^{2}+\lVert\bm{\phi}_{t+1}\rVert^{2}=\lVert\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}\rVert^{2}$	$\displaystyle+\eta^{2}\lVert\bm{Q}_{\bm{\theta}}\bm{C}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}\rVert^{2}+\lVert\bm{Q}_{\bm{\phi}}\bm{C}^{T}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}\rVert^{2}+\lVert\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}\rVert^{2}$
		$\displaystyle-2\eta\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{C}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}+2\eta\bm{\phi}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}\bm{C}^{T}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}$		(54)

To simplify the expression in (54) we use Lemma A.1 to obtain the following equations:

	$\displaystyle\bm{Q}_{\bm{\theta}}\bm{C}$	$\displaystyle=\bm{C}\bm{Q}_{\bm{\phi}}$		(55)
	$\displaystyle\bm{Q}_{\bm{\phi}}\bm{C}^{T}$	$\displaystyle=\bm{C}^{T}\bm{Q}_{\bm{\theta}}$		(56)

Using this lemma, we can show that

\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{C}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}=\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{T}\bm{C}\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}=\bm{\phi}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}^{T}\bm{C}^{T}\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}=\bm{\phi}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}\bm{C}^{T}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}

where the intermediate equality holds as $\bm{a}^{T}\bm{C}=\bm{C}^{T}\bm{a}$ . Hence, the expression in (54) can be simplified as

\lVert\bm{\theta}_{t+1}\rVert^{2}+\lVert\bm{\phi}_{t+1}\rVert^{2}=\lVert\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\theta}}\bm{C}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}\rVert^{2}+\lVert\bm{Q}_{\bm{\phi}}\bm{C}^{T}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}\rVert^{2}+\lVert\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}\rVert^{2}

(57)

We simplify equation (57) as follows. Consider the term involving $\bm{\theta}_{t}$ . We have

$\displaystyle\lVert\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\phi}}\bm{C}^{T}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}\rVert^{2}$	$\displaystyle=\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{2}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}+\eta^{2}\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{C}\bm{Q}_{\bm{\phi}}^{2}\bm{C}^{T}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}$
	$\displaystyle=\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{Q}_{\bm{\theta}}^{2}+\eta^{2}\bm{C}\bm{Q}_{\bm{\phi}}^{2}\bm{C}^{T})\bm{P}_{\bm{\theta}}\bm{\theta}_{t}$
	$\displaystyle=\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{Q}_{\bm{\theta}}^{2}+\eta^{2}\bm{C}\bm{Q}_{\bm{\phi}}\bm{C}^{T}\bm{Q}_{\bm{\theta}})\bm{P}_{\bm{\theta}}\bm{\theta}_{t}$
	$\displaystyle=\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{Q}_{\bm{\theta}}^{2}+\eta^{2}\bm{C}\bm{C}^{T}\bm{Q}_{\bm{\theta}}\bm{Q}_{\bm{\theta}})\bm{P}_{\bm{\theta}}\bm{\theta}_{t}$
	$\displaystyle=\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{I}+\eta^{2}\bm{C}\bm{C}^{T})\bm{Q}_{\bm{\theta}}^{2}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}$
	$\displaystyle=\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{I}+\eta^{2}\bm{C}\bm{C}^{T})^{-1}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}$	(58)

where the last equality follows by replacing $\bm{Q}_{\bm{\theta}}$ by its definition. The same procedure follows for the term involving $\bm{\phi}_{t}$ which leads to the expression

\lVert\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\theta}}\bm{C}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}\rVert^{2}=\bm{\phi}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}(\bm{I}+\eta^{2}\bm{C}^{T}\bm{C})^{-1}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}.

(59)

Substitute $\lVert\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\phi}}\bm{C}^{T}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}\rVert^{2}$ and $\lVert\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\theta}}\bm{C}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}\rVert^{2}$ in (57) with the expressions in (58) and (59), respectively, to obtain

\lVert\bm{\theta}_{t+1}\rVert^{2}+\lVert\bm{\phi}_{t+1}\rVert^{2}=\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{I}+\eta^{2}\bm{C}\bm{C}^{T})^{-1}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}+\bm{\phi}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}(\bm{I}+\eta^{2}\bm{C}^{T}\bm{C})^{-1}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}.

(60)

Now using the expression in (60) and the fact that $\bm{P}_{\bm{\theta}}=\bm{P}_{\bm{\theta}}^{T}$ , $\bm{P}_{\bm{\phi}}=\bm{P}_{\bm{\phi}}^{T}$ and $\lambda_{\min}(\bm{C}^{T}\bm{C})=\lambda_{\min}(\bm{C}\bm{C}^{T})$ , we can write

\lVert\bm{\theta}_{t+1}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t+1}-\bm{\phi}^{*}\rVert^{2}\leq\frac{\rho^{2}(\bm{I}-\eta\bm{A})\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\rho^{2}(1+\eta\bm{B})\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{C}^{T}\bm{C})}.

∎

A.7 Competitive Gradient Descent as an Approximation of SPPM

In this section, we justify our results in Section 4.1 that Competitive Gradient Descent is a first order Taylor approximation of SPPM. Firstly, we consider the standard definition of CGD:

\displaystyle\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta(\bm{I}+\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}))^{-1}(\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})+\eta\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t}))\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta(\bm{I}+\eta^{2}\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t}))^{-1}(\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}))\end{cases}

Rewriting the update rules we can get:

	$\displaystyle(\bm{I}+\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}))(\bm{\theta}_{t+1}-\bm{\theta}_{t})=-\eta(\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})+\eta\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t}))$
	$\displaystyle\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})(\bm{\theta}_{t+1}-\bm{\theta}_{t})$

Similarly, we have:

\displaystyle\bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})(\bm{\phi}_{t+1}-\bm{\phi}_{t})

Therefore, CGD is a first order approximation of SPPM. Then we prove that the standard definition of CGD is equivalent to the update rule in Table 1. Consider the update rule in Table 1 and its footnote, we have:

\displaystyle\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})(\bm{\phi}_{t+1}-\bm{\phi}_{t})\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})+\eta\nabla_{\phi\theta}f(\bm{\theta}_{t},\phi_{t})(\bm{\theta}_{t+1}-\bm{\theta}_{t})\end{cases}

(61)

Substituting $(\bm{\phi}_{t+1}-\bm{\phi}_{t})$ into the first equation of (61) we get:

	$\displaystyle\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})(\eta\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})+\eta\nabla_{\phi\theta}f(\bm{\theta}_{t},\phi_{t})(\bm{\theta}_{t+1}-\bm{\theta}_{t}))$
	$\displaystyle\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi\theta}f(\bm{\theta}_{t},\phi_{t})(\bm{\theta}_{t+1}-\bm{\theta}_{t})$
	$\displaystyle(\bm{I}+\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t}))(\bm{\theta}_{t+1}-\bm{\theta}_{t})=-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})$
	$\displaystyle\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta(\bm{I}+\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}))^{-1}(\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})+\eta\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})).$

Substituting $(\bm{\theta}_{t+1}-\bm{\theta}_{t})$ into the second equation of (61) and applying similar arguments we get:

\bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta(\bm{I}+\eta^{2}\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t}))^{-1}(\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}))

Thus the update rule in Table 1 is equivalent to the standard definition of CGD and it is equivalent to the first order Taylor approximation of SPPM.

A.8 Experiments on Bilinear and Quadratic Games

Bilinear Game

Consider the following bilinear game:

\min_{\theta\in\mathbb{R}}\max_{\phi\in\mathbb{R}}a\theta\phi

(62)

the example presented in Figure 1 is an example of using different algorithms to solve the bilinear game above with coefficient $a=10$ . For sake of completeness, we also provide a grid of experiment results for different algorithms with different coefficients $a$ and learning rates $\eta$ , starting from the same point $(\theta_{0},\phi_{0})=(-12,10)$ . The result is presented in Figure 6.

The experiment demonstrates that, in a bilinear game, Lv. $2$ GP is equivalent to the extra-gradient method, and higher level Lv. $k$ GP performs better with increased coefficient $a$ and learning rate $\eta$ as long as it remains a contraction (i.e., $\eta<a^{-1}$ ).

Quadratic Game

For the quadratic game presented in Figure 3, we randomly initialize the matrices $\bm{A}$ and $\bm{B}$ :

\displaystyle\bm{A}=\begin{bmatrix}1.8398&0.5195&1.2537&1.7470&1.2769\\ 0.5195&0.6586&0.4476&0.8898&1.1309\\ 1.2537&0.4476&1.4440&1.3923&0.8877\\ 1.7470&0.8898&1.3923&2.1249&1.7664\\ 1.2769&1.1309&0.8877&1.7664&2.1553\end{bmatrix}\bm{B}=-\begin{bmatrix}1.0821&1.2427&1.0093&1.3335&0.6761\\ 1.2427&2.2031&1.3236&1.8566&0.9394\\ 1.0093&1.3236&1.2393&1.3675&0.9065\\ 1.3335&1.8566&1.3675&1.9081&0.9693\\ 0.6761&0.9394&0.9065&0.9693&0.7141\end{bmatrix}

where $\bm{A}$ is symmetric and positive definite and $\bm{B}$ is symmetric and negative definite. The interaction matrix is defined as:

\displaystyle\bm{C}=\begin{bmatrix}c&0&0&0&0\\ 0&c&0&0&0\\ 0&0&c&0&0\\ 0&0&0&c&0\\ 0&0&0&0&c\end{bmatrix}

where $c$ represents the strength of the interaction between the two players. The starting point $\bm{\theta}_{0}$ and $\bm{\phi}_{0}$ are $[0.1270,0.9667,0.2605,0.8972,0.3767]^{T}$ and $[0.3362,0.4514,0.8403,0.1231,0.5430]^{T}$ respectively.

A.9 Experiments on 8-Gaussians

Dataset

The target distribution is a mixture of 8-Gaussians with standard deviation equal to $0.05$ and modes uniformly distributed around a unit circle.

Experiment

For our experiments, we used the PyTorch framework. Furthermore, the batch size we used is $128$ .

A.10 Experiments on CIFAR-10 and STL-10

For our experiments, we used the PyTorch²²2https://pytorch.org/ framework. For experiments on CIFAR-10 and STL-10, we compute the FID and IS metrics using the provided implementations in Tensorflow³³3https://tensorflow.org/ for consistency with related works.

Lv. $k$ GP vs Lv. $k$ Adam

In experiments, we compare the performance of Lv. $k$ GP and Lv. $k$ Adam on the task of CIFAR-10 image generation. The experiment results is presented in Figure 8. The experiments on Lv. $k$ GP and Lv. $k$ Adam use the same initialization and hyperparameters. According to our experiments, Lv. $k$ Adam converges much faster than Lv. $k$ GP for the same choice of $k$ and learning rates.

Adam vs Lv. $k$ Adam

We also present a comparison between the performance of Adam and Lv. $k$ Adam optimizers on the task of STL-10 image generation. The experiment results is presented in Figure 9. Under the same choice of hyperparameters and identical model parameter initialization, Lv. $k$ Adam consistently outperforms the Adam optimizer in terms of FID score.

Accelerated Lv. $k$ Adam

In this section, we propose an accelerated version of Lv. $k$ Adam. The intuition is that we update the min player $\bm{\theta}$ and the max player $\bm{\phi}$ in an alternating order. The corresponding Lv. $k$ GP algorithm can be writen as:

\mathmakebox[0.8]{\text{Reasoning:}\begin{cases}\bm{\theta}^{(k)}_{t}=\bm{\theta}_{t}-\eta\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(k-1)})\\ \bm{\phi}^{(k)}_{t}=\bm{\phi}_{t}-\eta\nabla_{\bm{\phi}}g(\bm{\theta}_{t}^{(k)},\bm{\phi}_{t})\end{cases}\text{Update:}\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(k)})\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}-\eta\nabla_{\bm{\phi}}g(\bm{\theta}_{t}^{(k)},\bm{\phi}_{t})\end{cases}}

(Alt-Lv.

k

GP)

Instead of responding to $\bm{\theta}^{(k-1)}_{t}$ , in Alt-Lv. $k$ GP, the max player $\bm{\phi}_{t}^{(k)}$ acts in response to the min player’s current action, $\bm{\theta}^{(k)}_{t}$ . A Lv. $k$ min player in Alt-Lv. $k$ GP is equivalent to a Lv. $2k-1$ player in the Lv. $k$ GP, and a Lv. $k$ max player in Alt-Lv. $k$ GP is equivalent to a Lv. $2k$ player in the Lv. $k$ GP, respectively. Therefore, it is easy to verify that Alt-Lv. $k$ GP converges two times faster than Lv. $k$ GP and the corresponding Alt-Lv. $k$ Adam algorithm is provided in Algorithm 2.

Input: Stopping time

T

, reasoning steps

k

, learning rate

\eta_{\bm{\theta}},\eta_{\bm{\phi}}

, decay rates for momentum estimates

\beta_{1},\beta_{2}

, initial weight

(\bm{\theta}_{0},\bm{\phi}_{0})

\bm{P}_{{\bm{x}}}

and

\bm{P}_{{\bm{z}}}

real and noise-data distributions, losses

\mathcal{L}_{G}(\bm{\theta},\bm{\phi},{\bm{x}},{\bm{z}})

and

\mathcal{L}_{D}(\bm{\theta},\bm{\phi},{\bm{x}},{\bm{z}})

\epsilon=1e-8

Parameters : Initial parameters:

\bm{\theta}_{0},\bm{\phi}_{0}

Initialize first moments:

\bm{m}_{\theta,0}\xleftarrow{}0,\bm{m}_{\phi,0}\xleftarrow{}0

Initialize second moments:

\bm{v}_{\theta,0}\xleftarrow{}0,\bm{v}_{\phi,0}\xleftarrow{}0

for t=0,…,T-1 do

Sample new mini-batch:

{\bm{x}},{\bm{z}}\sim\bm{P}_{{\bm{x}}},\bm{P}_{{\bm{z}}}

\bm{\theta}_{t}^{(0)}\xleftarrow{}\bm{\theta}_{t},\bm{\phi}_{t}^{(0)}\xleftarrow{}\bm{\phi}_{t}

for n=1,…,k do

Compute stochastic gradient:

\bm{g}_{\bm{\theta},t}^{(n)}=\nabla_{\theta}\mathcal{L}_{G}(\bm{\theta}_{t},\bm{\phi}^{(n-1)}_{t},{\bm{x}},{\bm{z}})

;

Update estimate of first moment:

\bm{m}_{\theta,t}^{(n)}=\beta_{1}\bm{m}_{\theta,t-1}+(1-\beta_{1})\bm{g}^{(n)}_{\theta,t}

;

Update estimate of second moment:

\bm{v}_{\theta,t}^{(n)}=\beta_{2}\bm{v}_{\theta,t-1}+(1-\beta_{2})(\bm{g}^{(n)}_{\theta,t})^{2}

;

Correct the bias for the moments:

\bm{\hat{m}}_{\theta,t}^{(n)}=\frac{\bm{m}^{(n)}_{\theta,t}}{(1-\beta_{1}^{t})},\bm{\hat{v}}_{\theta,t}^{(n)}=\frac{\bm{v}^{(n)}_{\theta,t}}{(1-\beta_{2}^{t})}

;

Perform Adam update:

\bm{\theta}^{(n)}_{t}=\bm{\theta}_{t}-\eta_{\theta}\frac{\bm{\hat{m}}_{\theta,t}^{(n)}}{\sqrt{\bm{\hat{v}}_{\theta,t}^{(n)}}+\epsilon}

;

Compute stochastic gradient:

\bm{g}_{\bm{\phi},t}^{(n)}=\nabla_{\phi}\mathcal{L}_{D}(\bm{\theta}^{(n)}_{t},\bm{\phi}_{t},{\bm{x}},{\bm{z}})

;

Update estimate of first moment:

\bm{m}_{\phi,t}^{(n)}=\beta_{1}\bm{m}_{\phi,t-1}+(1-\beta_{1})\bm{g}^{(n)}_{\phi,t}

;

Update estimate of second moment:

\bm{v}_{\phi,t}^{(n)}=\beta_{2}\bm{v}_{\phi,t-1}+(1-\beta_{2})(\bm{g}^{(n)}_{\phi,t})^{2}

;

Correct the bias for the moments:

\bm{\hat{m}}_{\phi,t}^{(n)}=\frac{\bm{m}^{(n)}_{\phi,t}}{(1-\beta_{1}^{t})},\bm{\hat{v}}_{\phi,t}^{(n)}=\frac{\bm{v}^{(n)}_{\phi,t}}{(1-\beta_{2}^{t})}

;

Perform Adam update:

\bm{\phi}^{(n)}_{t}=\bm{\phi}_{t}-\eta_{\phi}\frac{\bm{\hat{m}}_{\phi,t}^{(n)}}{\sqrt{\bm{\hat{v}}_{\phi,t}^{(n)}}+\epsilon}

;

\bm{\theta}_{t+1}\xleftarrow{}\bm{\theta}_{t}^{(k)},\bm{\phi}_{t+1}\xleftarrow{}\bm{\phi}_{t}^{(k)}

;

\bm{m}_{\theta,t}\xleftarrow{}\bm{m}_{\theta,t}^{(k)},\bm{m}_{\phi,t}\xleftarrow{}\bm{m}_{\phi,t}^{(k)}

;

\bm{v}_{\theta,t}\xleftarrow{}\bm{v}_{\theta,t}^{(k)},\bm{v}_{\phi,t}\xleftarrow{}\bm{v}_{\phi,t}^{(k)}

Algorithm 2 Accelerated Level

k

Adam: proposed Adam with recursive reasoning steps

Architecture

In this section, we describe the model we used to evaluate the performance of Lv. $k$ Adam for generating CIFAR-10⁴⁴4CIFAR10 is released under the MIT license. and STL-10 datasets. With ’conv’ we denote a convolutional layer and ’transposed conv’ a transposed convolution layer. The models use Batch Normalization and Spectral Normalization. The model’s parameters are initialized with Xavier initialization.

Table 5: ResNet blocks used for the SN-GAN architectures on CIFAR-10 image generation, for the generator (left) and the discriminator (right).

G-ResBlock
Shortcut:
Upsample( $\times 2$ )
Residual:
Batch Normalization
ReLU
Upsample( $\times 2$ )
conv (ker: $3\times 3$ , $256\to 256$ ; stride: $1$ ; pad: $1$ )
Batch Normalization
ReLU
conv (ker: $3\times 3$ , $256\to 256$ ; stride: $1$ ; pad: $1$ )

D-ResBlock (

l-

th block)

Shortcut:

[

AvgPool (ker:

2\times 2

)

]

, if

l=1

conv (ker:

1\times 1

3_{l=1}/128_{l\neq 1}\to 128

; stride:

1

)

Spectral Normalization

[

AvgPool (ker:

2\times 2

, stride:

2

)

]

, if

l=1

Residual:

[

ReLU

]

, if

l\neq 1

conv (ker:

3\times 3

3_{l=1}/128_{l\neq 1}\to 128

; stride:

1

; pad:

1

)

Spectral Normalization

ReLU

conv (ker:

1\times 1

128\to 128

; stride:

1

)

Spectral Normalization

AvgPool (ker:

2\times 2

)

Table 6: SN-GAN architectures for experiments on CIFAR-10

Generator	Discriminator
Input: ${\bm{z}}\in\mathbb{R}^{128}\sim\mathcal{N}(0,\bm{I})$	Input: ${\bm{x}}\in\mathbb{R}^{3\times 32\times 32}$
Linear( $128\to 4096$ )	D-ResBlock
G-ResBlock	D-ResBlock
G-ResBlock	D-ResBlock
G-ResBlock	D-ResBlock
Batch Normalization	ReLU
ReLU	AvgPool(ker: $8\times 8$ )
conv (ker: $3\times 3$ , $256\to 3$ ; stride: $1$ ; pad: $1$ )	Linear( $128\to 1$ )
Tanh	Spectral Normalization

Table 7: SN-GAN architectures for experiments on STL-10

Generator	Discriminator
Input: ${\bm{z}}\in\mathbb{R}^{128}\sim\mathcal{N}(0,\bm{I})$	Input: ${\bm{x}}\in\mathbb{R}^{3\times 48\times 48}$
Linear( $128\to 6\times 6\times 512$ )	D-ResBlock down $64\to 128$
G-ResBlock up $512\to 256$	D-ResBlock down $3\to 128$
G-ResBlock up $256\to 128$	D-ResBlock down $128\to 256$
G-ResBlock up $128\to 64$	D-ResBlock down $256\to 512$
Batch Normalization	D-ResBlock $512\to 1024$
ReLU	ReLU, AvgPool (ker: $8\times 8$ )
conv (ker: $3\times 3$ , $64\to 3$ ; stride: $1$ ; pad: $1$ )	Linear( $128\to 1$ )
Tanh	Spectral Normalization

Images generated on CIFAR-10 and STL-10

In this section, we present sample images generated by the best performing trained generators on CIFAR-10 and STL-10.

Recursive Reasoning in Minimax Games: A Level kk Gradient Play Method

Abstract

1 Introduction

2 Related Works

3 Preliminaries

3.1 Notation

3.2 Problem Definition

Assumption 3.1.

4 Level kk Gradient Play

Theorem 4.1.

4.1 Algorithms as an Approximation of SPPM

5 Convergence Property

Theorem 5.1.

Remark 5.1.

Theorem 5.2.

Theorem 5.3.

6 Experimental Results

6.1 Level kk Adam

6.2 8-Gaussians

6.3 Image Generation Experiments

7 Conclusion and Future Work

Broader Impact

Acknowledgement

References

Appendix A Appendix

A.1 Proof of Theorem 4.1

Proof.

A.2 Proof of Theorem 5.1

Theorem A.1.

Proof.

Lemma A.1.

A.3 Proof of Remark 5.1

Proof.

A.3.1 Proof of Lemma A.1

A.4 Theorem 5.1 without kernel assumption

Theorem A.2.

Proof.

A.5 Proof of Theorem 5.2

Proof.

Theorem A.3.

A.6 Proof of Theorem 5.3

Proof.

A.7 Competitive Gradient Descent as an Approximation of SPPM

A.8 Experiments on Bilinear and Quadratic Games

Bilinear Game

Quadratic Game

A.9 Experiments on 8-Gaussians

Dataset

Experiment

A.10 Experiments on CIFAR-10 and STL-10

Lv.kk GP vs Lv.kk Adam

Adam vs Lv.kk Adam

Accelerated Lv.kk Adam

Architecture

Images generated on CIFAR-10 and STL-10

Recursive Reasoning in Minimax Games: A Level $k$ Gradient Play Method

4 Level $k$ Gradient Play

6.1 Level $k$ Adam

Lv. $k$ GP vs Lv. $k$ Adam

Adam vs Lv. $k$ Adam

Accelerated Lv. $k$ Adam