This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Recursive Reasoning in Minimax Games: A Level kk Gradient Play Method

Zichu Liu
University of Toronto
[email protected]
&Lacra Pavel
University of Toronto
[email protected]
Abstract

Despite the success of generative adversarial networks (GANs) in generating visually appealing images, they are notoriously challenging to train. In order to stabilize the learning dynamics in minimax games, we propose a novel recursive reasoning algorithm: Level kk Gradient Play (Lv.kk GP) algorithm. In contrast to many existing algorithms, our algorithm does not require sophisticated heuristics or curvature information. We show that as kk increases, Lv.kk GP converges asymptotically towards an accurate estimation of players’ future strategy. Moreover, we justify that Lv.\infty GP naturally generalizes a line of provably convergent game dynamics which rely on predictive updates. Furthermore, we provide its local convergence property in nonconvex-nonconcave zero-sum games and global convergence in bilinear and quadratic games. By combining Lv.kk GP with Adam optimizer, our algorithm shows a clear advantage in terms of performance and computational overhead compared to other methods. Using a single Nvidia RTX3090 GPU and 30 times fewer parameters than BigGAN on CIFAR-10, we achieve an FID of 10.17 for unconditional image generation within 30 hours, allowing GAN training on common computational resources to reach state-of-the-art performance.

1 Introduction

In recent years, there has been a surge of powerful models that require simultaneous optimization of several objectives. This increasing interest in multi-objective optimization arises in various domains - such as generative adversarial networks [21, 27], adversarial attacks and robust optimization [37, 7], and multi-agent reinforcement learning [36, 55] - where several agents aim at minimizing their objectives simultaneously. Games generalize this optimization framework by introducing different objectives for different learning agents, known as players.

Refer to caption

Figure 1: Illustration of predictive algorithms on: minxmaxy10xy\min_{x}\max_{y}10xy. Left: Extra-gradient algorithm. Right: Level kk gradient play algorithm (k=6)(k=6). The solution, trajectory {xt,yt}t=1T\{x_{t},y_{t}\}_{t=1}^{T} and anticipated future state are shown with red star, black line and orange arrow, resp. The subplot in the right figure depicts how Lv.kk GP predicts future states by showing its reasoning procedure with blue arrows. More steps in the reasoning process leading to better anticipations and faster convergence.

The generative adversarial network is a widely-used method of this type, which has demonstrated state-of-the-art performance in a variety of applications, including image generation [28, 52], image super-resolution [32], and image-to-image translation [10]. Despite their success at generating visually appealing images, GANs are notoriously challenging to train [40, 39]. Naive application of the gradient-based algorithm in GANs often leads to poor image diversity (sometimes manifesting as "mode collapse") [40], Poincare recurrence [39], and subtle dependency on hyperparameters [18]. An immense corpus of work is devoted to exploring and enhancing the stability of GANs, including techniques as diverse as the use of optimal transport distance [1, 22], critic gradient penalties [54], different neural network architectures [26, 6], feature matching [50], pre-trained feature space [51], and minibatch discrimination [50]. Nevertheless, architectural modifications (e.g., StyleGANs [27]) require extensive computational resources, and many theoretically appealing methods (Follow-the-ridge [56], CGD[53]) require Hessian inverse operations, which is infeasible for most GAN applications.

To stabilize the learning dynamics in GANs, many recent efforts rely on sophisticated heuristics that allow the agents to anticipate each other’s next move [16, 53, 23]. This anticipation is an example of a recursive reasoning procedure in cognitive science [12]. Similar to how humans think, recursive reasoning represents the belief reasoning process where each agent considers the reasoning process of other agents, based on which it expects to make better decisions. Importantly, it enables the use of opponents that reason about the learning agent, rather than assuming fixed opponents; the process can therefore be nested in a form as ’I believe that you believe that I believe…’. Based on this intuition, we introduce a novel recursive reasoning algorithm that utilizes only gradient information to optimize GANs. Our contributions can be summarized as follows:

(i) We propose a novel algorithm: level kk gradient play (Lv.kk GP), which is capable of reasoning about players’ future strategy. In a game, agents at Lv.kk adjust their strategies in accordance with the strategies of Lv.k1k-1 agents. We justify that, while typical GANs optimizers, such as Learning with Opponent Learning Awareness (LOLA) and Symplectic Gradient Adjustment (SGA), approximate Lv.22 and Lv.33 GP, our algorithm permits higher levels of strategic reasoning. In addition, the proposed algorithm is amenable to neural network optimizers like Adam [29].

(ii) We show that, in smooth games, Lv.kk GP converges asymptotically towards an accurate prediction of agents’ next move. Under mutual opponent shaping, two Lv.\infty agents will naturally have a consistent view of one another if the Lv.kk GP converges as kk increases. Based on this idea, we provide a closed-form solution for Lv.\infty GP: the Semi-Proximal Point Method (SPPM).

(iii) We prove the local convergence property of Lv.\infty GP in nonconvex - nonconcave zero-sum games and its global convergence in bilinear and quadratic games. The theoretical analysis we present indicates that strong interactions between competing agents can increase the convergence rate of Lv.kk GP agents in a zero-sum game.

(iv) By combining Lv.kk GP with Adam optimizer, our algorithm shows a clear advantage in terms of performance and computational overhead compared to other methods. Using a single 3090 GPU with 30 times fewer parameters and 16 times smaller mini-batches than BigGAN [6] on CIFAR-10, we achieve an FID score [24] of 10.17 for unconditional image generation within 30 hours, allowing GAN training on common computing resources to reach state-of-the-art performance.

2 Related Works

In recent years, minimax problems have attracted considerable interest in machine learning in light of their connection with GANs. Gradient descent ascent (GDA), a generalization of gradient descent for minimax games, is the principal approach for training GANs in applications. GDA alternates between a gradient descent step for the min-player and a gradient ascent step for the max-player. The convergence of GDA in games is far from as well understood as gradient descent in single-objective problems. Despite the impressive image quality generated by GANs, GDA fails to converge even in bilinear zero-sum games. Recent research on GDA has established a unified picture of its behavior in bilinear games in continuous and discrete-time [39, 46, 13, 41, 20]. First, [39] revealed that continuous-time GDA dynamics in zero-sum games result in Poincare recurrence, where agents return arbitrarily close to their initial state infinitely many times. Second, [3, 58] examined the discrete-time GDA dynamics, showing that simultaneous update of two players results in divergence while the agents’ strategies remain bounded and cycle when agents take turns to update their strategies.

The majority of existing approaches to stabilizing GDA follow one of three lines of research. The essence of the first method is that the discriminator is trained until convergence while the generator parameters are frozen. As long as the generator changes slowly enough, the discriminator still converges in the presence of small generator perturbations. The two-timescale update rule proposed by [21, 24, 42] aims to keep the discriminator’s optimality while updating the generator at an appropriate step size. [25, 14] proved that this two-timescale GDA with finite timescale separation converges towards the strict local minimax/Stackelberg equilibrium in differentiable games. [15, 56] explicitly find the local minimax equilibrium in games with secon-order optimization algorithms.

The second line of research overcomes the failure of GDA in games with predictive updates. Extra-gradient method (EG) [30] and optimistic gradient descent (OGD) [13] use the predictability of the agents’ strategy to achieve better convergence property. Their variants are developed to improve the training performance of GANs [8, 19, 43]. [45] provided a unified analysis of EG and OGD, showing that they approximate the classical proximal point methods. Competitive gradient descent [53] models the agents’ next move by solving a regularized bilinear approximation of the underlying game. Learning with opponent learning awareness (LOLA) and consistent opponent learning awareness (COLA) [16, 57] introduced opponent shaping to this problem by explicitly modeling the learning strategy of other agents in the game. LOLA models opponents as naive learners rather than LOLA agents, while COLA utilizes neural networks to predict opponents’ next move. Lookahead-minimax [9] stabilizes GAN training by ‘looking ahead’ at the sequence of future states generated by an inner optimizer. In the game theory literature, recent work has proposed Clairvoyant Multiplicative Weights Update (CMWU) for regret minimization in general games [47]. Although CMWU is proposed to solve finite normal form games, which are different from unconstrained continuous games that Lv.kk GP aims to solve, both CMWU and Lv.kk GP share the same motivation of enabling the learning agent to update their strategy based on the opponent’s future strategy. From this aspect, Lv.kk GP can be viewed as a specialized variant of CMWU that is specific to the problem of two-player zero-sum games, but adapted for unconstrained continuous kernel games.

Other methods directly modify the GDA algorithm with ad-hoc modifications of game dynamics and introduction of additional regularizers. Consensus optimization (CO) [40] and gradient penalty [22, 54] improve convergence by directly minimizing the magnitude of players’ gradients. Symplectic gradient adjustment (SGA) [33, 4, 17] improves convergence by disentangling convergent potential components from rotational Hamiltonian components of the vector field.

3 Preliminaries

3.1 Notation

In this paper, vectors are lower-case bold letters (e.g. 𝜽\bm{\theta}), matrices are upper-case bold letters (e.g. 𝑨\bm{A}). For a function f:df:\mathbb{R}^{d}\to\mathbb{R}, we denote its gradient by f\nabla f. For functions of two vector arguments f(𝒙,𝒚):d1×d2f({\bm{x}},{\bm{y}}):\mathbb{R}^{d_{1}}\times\mathbb{R}^{d_{2}}\to\mathbb{R} we use xf,yf\nabla_{x}f,\nabla_{y}f to denote its partial gradients. We use xxf,yyf,xyf\nabla_{xx}f,\nabla_{yy}f,\nabla_{xy}f to denote its Hessian. A stationary point of ff denotes the point where xf=yf=𝟎\nabla_{x}f=\nabla_{y}f=\bm{0}. We use 𝒗\lVert\bm{v}\rVert to denote the Euclidean norm of vector 𝒗\bm{v}. We refer to the largest and smallest eigenvalues of a matrix 𝑨\bm{A} by λmax(𝑨)\lambda_{\max}(\bm{A}) and λmin(𝑨)\lambda_{\min}(\bm{A}), respectively. Moreover, we denote the spectral radius of matrix 𝑨\bm{A} by ρ(𝑨)=max{|λ1|,,|λn|}\rho(\bm{A})=\max\{\lvert\lambda_{1}\rvert,\dots,\lvert\lambda_{n}\rvert\}, i.e., the eigenvalue with largest absolute value.

3.2 Problem Definition

In order to justify the effectiveness of recursive reasoning procedure, in this paper, we consider the problem of training Generative Adversarial Networks (GANs)[21]. The GANs training strategy defines a two-player game between a generative neural network G𝜽()G_{\bm{\theta}}(\cdot) and a discriminative neural network Dϕ()D_{\bm{\phi}}(\cdot). The generator takes as input random noise 𝒛{\bm{z}} sampled from a known distribution 𝐏𝒛\mathbf{P}_{{\bm{z}}}, e.g., 𝒛𝐏𝒛{\bm{z}}\sim\mathbf{P}_{{\bm{z}}}, and outputs a sample G𝜽(𝒛)G_{\bm{\theta}}({\bm{z}}). A discriminator takes as input a sample 𝒙{\bm{x}} (either sampled from the true distribution 𝐏𝒙\mathbf{P}_{{\bm{x}}} or from the generator) and attempts to classify it as real or fake. The goal of the generator is to fool the discriminator. The optimization of GAN is formulated as a two-player differentiable game where the generator G𝜽G_{\bm{\theta}} with parameter 𝜽\bm{\theta}, and the discriminator DϕD_{\bm{\phi}} with parameters ϕ\bm{\phi}, aim at minimizing their own cost function f(𝜽,ϕ)f(\bm{\theta},\bm{\phi}) and g(𝜽,ϕ)g(\bm{\theta},\bm{\phi}) respectively, as follows:

min𝜽mf(𝜽,ϕ) and minϕng(𝜽,ϕ),\min_{\bm{\theta}\in\mathbb{R}^{m}}f(\bm{\theta},\bm{\phi})\text{ and }\min_{\bm{\phi}\in\mathbb{R}^{n}}g(\bm{\theta},\bm{\phi}), (1)

where the two function f and g:m×nf\text{ and }g:\mathbb{R}^{m}\times\mathbb{R}^{n}\to\mathbb{R}. When f=gf=-g the corresponding optimization problem is called a two-player zero-sum game and it becomes a minimax problem:

min𝜽mmaxϕnf(𝜽,ϕ).\min_{\bm{\theta}\in\mathbb{R}^{m}}\max_{\bm{\phi}\in\mathbb{R}^{n}}f(\bm{\theta},\bm{\phi}). (Minimax)

In this work, we assume the cost functions have Lipschitz continuous gradients with respect to all model parameters (𝜽,ϕ)(\bm{\theta},\bm{\phi}):

Assumption 3.1.

The gradient 𝛉f(𝛉,ϕ)\nabla_{\bm{\theta}}f(\bm{\theta},\bm{\phi}), is L𝛉𝛉L_{\bm{\theta}\bm{\theta}}-Lipschitz with respect to 𝛉\bm{\theta} and L𝛉ϕL_{\bm{\theta}\bm{\phi}}-Lipschitz with respect to ϕ\bm{\phi} and the gradient ϕg(𝛉,ϕ)\nabla_{\bm{\phi}}g(\bm{\theta},\bm{\phi}), is LϕϕL_{\bm{\phi}\bm{\phi}}-Lipschitz with respect to ϕ\bm{\phi} and Lϕ𝛉L_{\bm{\phi}\bm{\theta}}-Lipschitz with respect to 𝛉\bm{\theta}, i.e.,

𝜽f(𝜽1,ϕ)𝜽f(𝜽2,ϕ)\displaystyle\lVert\nabla_{\bm{\theta}}f(\bm{\theta}_{1},\bm{\phi})-\nabla_{\bm{\theta}}f(\bm{\theta}_{2},\bm{\phi})\rVert L𝜽𝜽𝜽1𝜽2 for all ϕ,\displaystyle\leq L_{\bm{\theta}\bm{\theta}}\lVert\bm{\theta}_{1}-\bm{\theta}_{2}\rVert\text{ for all $\bm{\phi}$,}
𝜽f(𝜽,ϕ1)𝜽f(𝜽,ϕ2)\displaystyle\lVert\nabla_{\bm{\theta}}f(\bm{\theta},\bm{\phi}_{1})-\nabla_{\bm{\theta}}f(\bm{\theta},\bm{\phi}_{2})\rVert L𝜽ϕϕ1ϕ2 for all 𝜽,\displaystyle\leq L_{\bm{\theta}\bm{\phi}}\lVert\bm{\phi}_{1}-\bm{\phi}_{2}\rVert\text{ for all $\bm{\theta}$,}
ϕg(𝜽1,ϕ)ϕg(𝜽2,ϕ)\displaystyle\lVert\nabla_{\bm{\phi}}g(\bm{\theta}_{1},\bm{\phi})-\nabla_{\bm{\phi}}g(\bm{\theta}_{2},\bm{\phi})\rVert Lϕ𝜽𝜽1𝜽2 for all ϕ,\displaystyle\leq L_{\bm{\phi}\bm{\theta}}\lVert\bm{\theta}_{1}-\bm{\theta}_{2}\rVert\text{ for all $\bm{\phi}$,}
ϕg(𝜽,ϕ1)ϕg(𝜽,ϕ2)\displaystyle\lVert\nabla_{\bm{\phi}}g(\bm{\theta},\bm{\phi}_{1})-\nabla_{\bm{\phi}}g(\bm{\theta},\bm{\phi}_{2})\rVert Lϕϕϕ1ϕ2 for all 𝜽.\displaystyle\leq L_{\bm{\phi}\bm{\phi}}\lVert\bm{\phi}_{1}-\bm{\phi}_{2}\rVert\text{ for all $\bm{\theta}$.}

We define L:=max{L𝛉𝛉,L𝛉ϕ,Lϕ𝛉,Lϕϕ}L:=\max\{L_{\bm{\theta}\bm{\theta}},L_{\bm{\theta}\bm{\phi}},L_{\bm{\phi}\bm{\theta}},L_{\bm{\phi}\bm{\phi}}\}.

4 Level kk Gradient Play

In this section, we propose a novel recursive reasoning algorithm, Level kk Gradient Play (Lv.kk GP), that allows the agents to discover self-interested strategies while taking into account other agents’ reasoning processes. In Lv.kk GP, kk steps of recursive reasoning is applied to obtain an anticipated future state (𝜽t(k),ϕt(k))(\bm{\theta}^{(k)}_{t},\bm{\phi}^{(k)}_{t}), and the current states (𝜽t,ϕt)(\bm{\theta}_{t},\bm{\phi}_{t}) are then updated as follows:

Reasoning:{𝜽t(n)=𝜽tη𝜽f(𝜽t,ϕt(n1))ϕt(n)=ϕtηϕg(𝜽t(n1),ϕt)Update:{𝜽t+1=𝜽t(k)ϕt+1=ϕt(k)\mathmakebox[0.8]{\text{Reasoning:}\begin{cases}\bm{\theta}^{(n)}_{t}=\bm{\theta}_{t}-\eta\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(n-1)})\\ \bm{\phi}^{(n)}_{t}=\bm{\phi}_{t}-\eta\nabla_{\bm{\phi}}g(\bm{\theta}_{t}^{(n-1)},\bm{\phi}_{t})\end{cases}\text{Update:}\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}^{(k)}\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}^{(k)}\end{cases}} (Lv.k GP)

We define the current state (𝜽t,ϕt)(\bm{\theta}_{t},\bm{\phi}_{t}), to be the starting point (𝜽t(0),ϕt(0))(\bm{\theta}_{t}^{(0)},\bm{\phi}_{t}^{(0)}), of the reasoning process. Learning agents that adopt Lv.kk GP strategy are then called Lv.kk agents. Lv.11 agents act naively in response to the current state using GDA dynamics and Lv.22 agents act in response to Lv.11 agents by assuming its opponent as a naive learner. Therefore, Lv.kk GP allows for higher levels of strategic reasoning. The inspiration comes from how humans collaborate: humans are great at anticipating how their actions will affect others, so they frequently find out how to collaborate with other people to reach a "win-win" solution. The key to human collaboration is their ability to understand how other humans think which helps them develop strategies that benefit their collaborators. One of our main theoretical results is the following theorem, which demonstrates that agents adopting Lv.kk GP can precisely predict other players’ next move and reach a consensus on their future strategies:

Theorem 4.1.

Suppose Assumption 3.1 holds. Let us define 𝛚t=[𝛉t,ϕt]Tm+n\bm{\omega}_{t}=[\bm{\theta}_{t},\bm{\phi}_{t}]^{T}\in\mathbb{R}^{m+n}, 𝛚t(k)=[𝛉t(k),ϕt(k)]Tm+n\bm{\omega}_{t}^{(k)}=[\bm{\theta}_{t}^{(k)},\bm{\phi}_{t}^{(k)}]^{T}\in\mathbb{R}^{m+n} and Δmax=2×max(𝛉f(𝛉t,ϕt),ϕg(𝛉t,ϕt))\Delta_{\max}=2\times\max(\lVert\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\phi}_{t})\rVert,\lVert\nabla_{\bm{\phi}}g(\bm{\theta}_{t},\bm{\phi}_{t})\rVert). Assume 𝛚t(k)\bm{\omega}_{t}^{(k)} lie in a complete subspace of m+n\mathbb{R}^{m+n}. Then for Lv.k GP we have:

𝝎t(k)𝝎t(k1)η(ηL)(k1)Δmax,\lVert\bm{\omega}_{t}^{(k)}-\bm{\omega}_{t}^{(k-1)}\rVert\leq\eta\cdot(\eta L)^{(k-1)}\Delta_{\max}, (2)

Suppose the learning rate satisfies: η<(2L)1\eta<(2L)^{-1}, then the sequence {𝛚t(k)}k=0\{\bm{\omega}_{t}^{(k)}\}_{k=0}^{\infty} is a Cauchy sequence. That is, given ϵ>0\epsilon>0, there exists NN such that, if a>b>Na>b>N then:

𝝎t(a)𝝎t(b)<𝒪(ηb)<ϵ\lVert\bm{\omega}_{t}^{(a)}-\bm{\omega}_{t}^{(b)}\rVert<\mathcal{O}(\eta^{b})<\epsilon (3)

Moreover, the sequence {𝛚t(k)}k=0\{\bm{\omega}_{t}^{(k)}\}_{k=0}^{\infty} converges to a limit 𝛚t\bm{\omega}_{t}^{*}: limk𝛚t(k)=𝛚t\lim_{k\to\infty}\bm{\omega}_{t}^{(k)}=\bm{\omega}_{t}^{*}.

In accordance with Theorem 4.1, if we define 𝝎t+1=𝝎t\bm{\omega}_{t+1}=\bm{\omega}_{t}^{*}, then Lv.\infty GP is equivalent to the following implicit algorithm where we call it Semi-Proximal Point Method:

{𝜽t+1=𝜽tη𝜽f(𝜽t,ϕt+1)ϕt+1=ϕtηϕg(𝜽t+1,ϕt)\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\phi}_{t+1})\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}-\eta\nabla_{\bm{\phi}}g(\bm{\theta}_{t+1},\bm{\phi}_{t})\end{cases} (SPPM)

4.1 Algorithms as an Approximation of SPPM

SPPM players arrive at a consensus by knowing precisely what their opponents’ future strategies will be. Existing algorithms are not able to offer this kind of agreement. For instance, consensus optimization[40] forces the learning agents to cooperate regardless of their own benefits. Agents employ extra-gradient method[30], SGA[4], and LOLA[16] consider their opponents as naive learners, ignoring their strategic reasoning ability. CGD[53] takes into account the reasoning process of learning agents; however, it leads to an inaccurate prediction of agents in games that have cost functions with non-zero higher order derivatives (n3n\geq 3) [57]. In this section, we consider a subset of provably convergent variants of GDA in the Minimax setting, showing that, for specific choice of hyperparameters, the mentioned algorithms either approximate SPPM or approximate the approximations of SPPM:

Table 1: The update rules for the first player of SGA, LOLA, Lv.22 GP, LEAD, CGD, Lv.33 GP and SPPM in a Minimax problem and their precision as an approximation of SPPM. The usage of zero or negative momentum has been suggested in recent works [20, 18]. For sake of comparison, we assume no momentum factor in LEAD’s update, which corresponds to β=0\beta=0 in Equation 10 of [23].
Algorithm Update Rule Precision
SGA [4] 𝜽t+1=𝜽tηθf(𝜽t,ϕt)ηγθϕf(𝜽t,ϕt)ϕf(𝜽t,ϕt)\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta\gamma\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t}) ——
LOLA [16] 𝜽t+1=𝜽tηθf(𝜽t,ϕt)ηδθϕf(𝜽t,ϕt)ϕf(𝜽t,ϕt)\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta\delta\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t}) ——
Lv.2 GP 𝜽t+1=𝜽tηθf(𝜽t,ϕt+ηϕf(𝜽t,ϕt))\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}+\eta\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})) 𝒪(η2)\mathcal{O}(\eta^{2})
LEAD [23] 𝜽t+1=𝜽tηθf(𝜽t,ϕt)αθϕf(𝜽t,ϕt)(ϕtϕt1)\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\alpha\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})(\bm{\phi}_{t}-\bm{\phi}_{t-1}) ——
CGD [53] 𝜽t+1=𝜽tηθf(𝜽t,ϕt)ηθϕf(𝜽t,ϕt)(ϕt+1ϕt)\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})(\bm{\phi}_{t+1}-\bm{\phi}_{t}) 𝒪(η3)\mathcal{O}(\eta^{3})
Lv.3 GP 𝜽t+1=𝜽tηθf(𝜽t,ϕt+ηϕf(𝜽tηθf(𝜽t,ϕt),ϕt))\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}+\eta\nabla_{\phi}f(\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}),\bm{\phi}_{t})) 𝒪(η3)\mathcal{O}(\eta^{3})
SPPM (Lv.\infty GP) 𝜽t+1=𝜽tηθf(𝜽t,ϕt+1)\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t+1}) 0

In Table 1, we compare the orders of precision of different algorithms as an approximation of SPPM in Minimax games with infinitely differentiable objective functions. In accordance with Equation (3) of Theorem 4.1, Lv.k GP is an 𝒪(ηk)\mathcal{O}(\eta^{k}) approximation of SPPM. In order to analyze how well existing algorithms approximate SPPM, we consider the first-order Taylor approximation to SPPM:

{𝜽t+1=𝜽tηθf(𝜽t,ϕt)η2θϕf(𝜽t,ϕt)ϕf(𝜽t,ϕt)η2θϕf(𝜽t,ϕt)ϕθf(𝜽t,ϕt)(𝜽t+1𝜽t)ϕt+1=ϕt+ηθf(𝜽t,ϕt)η2ϕθf(𝜽t,ϕt)θf(𝜽t,ϕt)1st order approximation of Lv.2 GPη2ϕθf(𝜽t,ϕt)θϕf(𝜽t,ϕt)(ϕt+1ϕt)\begin{cases}\bm{\theta}_{t\!+\!1}\!=\!\bm{\theta}_{t}\!-\!\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\!-\!\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\!-\!\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})(\bm{\theta}_{t\!+\!1}\!-\!\bm{\theta}_{t})\\ \smash{\underbrace{\bm{\phi}_{t\!+\!1}\!=\!\bm{\phi}_{t}\!+\!\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\!-\!\eta^{2}\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})}_{\text{$1^{st}$ order approximation of Lv.2 GP}}\!-\!\eta^{2}\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})(\bm{\phi}_{t\!+\!1}\!-\!\bm{\phi}_{t})}\end{cases}\vphantom{\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})(\bm{\theta}_{t+1}-\bm{\theta}_{t})\\ \underbrace{\bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})}_{\text{$1^{st}$ order approximation of Lv.2 GP}}-\eta^{2}\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})(\bm{\phi}_{t+1}-\bm{\phi}_{t})\vspace{0.12in}\end{cases}}

Under-brace terms correspond to the first-order Taylor approximation of Lv.22 GP. For an appropriate choice of hyperparameters, SGA (γ=η\gamma=\eta) and LOLA (δ=η\delta=\eta) are identical to the first-order Taylor approximation of Lv.2 GP, where each agent models their opponent as a naive learner. Hence, we list them above the Lv.2 GP, which approximates SPPM up to 𝒪(η2)\mathcal{O}(\eta^{2}). CGD exactly recovers the first-order Taylor approximation of SPPM.111The CGD update for the max player ϕ\bm{\phi} is ϕt+1=ϕt+ηϕf(𝜽t,ϕt)+ηϕθf(𝜽t,ϕt)(𝜽t+1𝜽t)\bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})+\eta\nabla_{\phi\theta}f(\bm{\theta}_{t},\phi_{t})(\bm{\theta}_{t+1}-\bm{\theta}_{t}). If we substitute (ϕt+1ϕt)(\bm{\phi}_{t+1}-\bm{\phi}_{t}) into 𝜽\bm{\theta}’s update (and substitute (𝜽t+1𝜽t)(\bm{\theta}_{t+1}-\bm{\theta}_{t}) into ϕ\bm{\phi}’s update, respectively), we arrive at the first-order approximation of SPPM. See A.5 for derivation details. In games with cost functions that have non-negative higher order derivatives (n3n\geq 3), the remaining term in SPPM’s first-order approximation is an error of magnitude 𝒪(η3)\mathcal{O}(\eta^{3}), which means that CGD’s accuracy is in the same range as that of Lv.33 GP. In bilinear and quadratic games where the objective function is at most twice differentiable, CGD is equivalent to SPPM. The distinction between LEAD (α=η\alpha=\eta) and CGD can be understood by considering their update rules. LEAD is an explicit method where opponents’ potential next strategies are anticipated based on their most recent move (ϕtϕt1)(\bm{\phi}_{t}-\bm{\phi}_{t-1}). On the contrary, CGD accounts for this anticipation in an implicit manner, (ϕt+1ϕt)(\bm{\phi}_{t+1}-\bm{\phi}_{t}), where the future states appear in current states’ update rules. Therefore, the computation of CGD updates requires solving an function involving additional Hessian inverse operations. A numerical justification is also provided in Table 2, showing that the approximation accuracy of Lv.kk GP improves as kk increases.

5 Convergence Property

In Theorem 4.1, we have analytically proved that Lv.kk GP convergences asymptotically towards SPPM, we will use this result to study the convergence property of Lv.kk GP in games based on our analysis of SPPM. The local convergence of SPPM in a non convex - non concave game can be analyzed via the spectral radius of the game Jacobian around a stationary point:

Theorem 5.1.

Consider the (Minimax) problem under Assumption 3.1 and Lv.kk GP. Let (𝛉,ϕ)(\bm{\theta}^{*},\bm{\phi}^{*}) be a stationary point. Suppose 𝛉t𝛉\bm{\theta}_{t}-\bm{\theta}^{*} not in kernel of ϕθf(𝛉,ϕ)\nabla_{\phi\theta}f(\bm{\theta}^{*},\bm{\phi}^{*}), ϕtϕ\bm{\phi}_{t}-\bm{\phi}^{*} not in kernel of θϕf(𝛉,ϕ)\nabla_{\theta\phi}f(\bm{\theta}^{*},\bm{\phi}^{*}) and η<(L)1\eta<(L)^{-1}. There exists a neighborhood 𝒰\mathcal{U} of (𝛉,ϕ)(\bm{\theta}^{*},\bm{\phi}^{*}) such that if SPPM started at (𝛉0,ϕ0)𝒰(\bm{\theta}_{0},\bm{\phi}_{0})\in\mathcal{U}, the iterates {𝛉t,ϕt}t0\{\bm{\theta}_{t},\bm{\phi}_{t}\}_{t\geq 0} generated by SPPM satisfy:

𝜽t+1𝜽2+ϕt+1ϕ2ρ2(𝑰ηθθf)𝜽t𝜽2+ρ2(𝑰+ηϕϕf)ϕtϕ21+η2λmin(θϕfϕθf)\lVert\bm{\theta}_{t+1}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t+1}-\bm{\phi}^{*}\rVert^{2}\leq\frac{\rho^{2}(\bm{I}-\eta\nabla_{\theta\theta}f^{*})\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\rho^{2}(\bm{I}+\eta\nabla_{\phi\phi}f^{*})\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}}{1+\eta^{2}\lambda_{\min}(\nabla_{\theta\phi}f^{*}\nabla_{\phi\theta}f^{*})}

where f=f(𝛉,ϕ)f^{*}=f(\bm{\theta}^{*},\bm{\phi}^{*}). Moreover,for any η\eta satisfying:

max(ρ2(𝑰ηθθf),ρ2(𝑰+ηϕϕf))1+η2λmin(θϕfϕθf)<1,\frac{\max(\rho^{2}(\bm{I}-\eta\nabla_{\theta\theta}f^{*}),\rho^{2}(\bm{I}+\eta\nabla_{\phi\phi}f^{*}))}{1+\eta^{2}\lambda_{\min}(\nabla_{\theta\phi}f^{*}\nabla_{\phi\theta}f^{*})}<1, (4)

SPPM converges asymptotically to (𝛉,ϕ)(\bm{\theta}^{*},\bm{\phi}^{*}).

Remark 5.1.

Following the same condition as in 5.1, the iterates generated by Lv.2k2k GP satisfies:

\displaystyle\lVert 𝜽t(2k)𝜽2+ϕt(2k)ϕ2\displaystyle\bm{\theta}_{t}^{(2k)}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t}^{(2k)}-\bm{\phi}^{*}\rVert^{2}
a(ρ2(𝑰ηθθf)𝜽t𝜽2+ρ2(𝑰+ηϕϕf)ϕtϕ21+η2λmin(θϕfϕθf))+b(𝜽t𝜽2+ϕtϕ2)\displaystyle\leq a\!\left(\frac{\rho^{2}(\bm{I}\!-\eta\nabla_{\theta\theta}f^{*})\lVert\bm{\theta}_{t}\!-\!\bm{\theta}^{*}\rVert^{2}\!+\!\rho^{2}(\bm{I}\!+\!\eta\nabla_{\phi\phi}f^{*})\lVert\bm{\phi}_{t}\!-\!\bm{\phi}^{*}\rVert^{2}}{1\!+\!\eta^{2}\lambda_{\min}(\nabla_{\theta\phi}f^{*}\nabla_{\phi\theta}f^{*})}\right)\!+\!b(\lVert\bm{\theta}_{t}\!-\!\bm{\theta}^{*}\rVert^{2}\!+\!\lVert\bm{\phi}_{t}\!-\!\bm{\phi}^{*}\rVert^{2})

where

a\displaystyle a =(1+(η2λmax(θϕfϕθf))k)21(η2λmax(θϕfϕθf))k for odd k, or (1(η2λmin(θϕfϕθf))k)21(η2λmax(θϕfϕθf))k for even k,\displaystyle\!=\!\frac{(1\!+\!(\eta^{2}\lambda_{\max}(\nabla_{\theta\phi}f^{*}\nabla_{\phi\theta}f^{*}))^{k})^{2}}{1\!-\!(\eta^{2}\lambda_{\max}(\nabla_{\theta\phi}f^{*}\nabla_{\phi\theta}f^{*}))^{k}}\text{ for odd $k$, or }\frac{(1\!-\!(\eta^{2}\lambda_{\min}(\nabla_{\theta\phi}f^{*}\nabla_{\phi\theta}f^{*}))^{k})^{2}}{1\!-\!(\eta^{2}\lambda_{\max}(\nabla_{\theta\phi}f^{*}\nabla_{\phi\theta}f^{*}))^{k}}\text{ for even $k$,}
b\displaystyle b =(η2λmax(θϕfϕθf))k(1(η2λmin(θϕfϕθf))k)1(η2λmax(θϕfϕθf))k.\displaystyle\!=\!\frac{(\eta^{2}\lambda_{\max}(\nabla_{\theta\phi}f^{*}\nabla_{\phi\theta}f^{*}))^{k}(1\!-\!(\eta^{2}\lambda_{\min}(\nabla_{\theta\phi}f^{*}\nabla_{\phi\theta}f^{*}))^{k})}{1\!-\!(\eta^{2}\lambda_{\max}(\nabla_{\theta\phi}f^{*}\nabla_{\phi\theta}f^{*}))^{k}}.
Refer to caption
Figure 2: Top: max step size against kk, Right: distance to equilibrium against kk. In both figures the red dashed line represents the value for SPPM.

We assume that the difference between the trajectory {𝜽t,ϕt}t0\{\bm{\theta}_{t},\bm{\phi}_{t}\}_{t\geq 0} and the stationary point (𝜽,ϕ)(\bm{\theta}^{*},\bm{\phi}^{*}) is not in the kernel of ϕθf(𝜽,ϕ)\nabla_{\phi\theta}f(\bm{\theta}^{*},\bm{\phi}^{*}) and θϕf(𝜽,ϕ)\nabla_{\theta\phi}f(\bm{\theta}^{*},\bm{\phi}^{*}) for the sake of simplicity. Detailed proofs without this assumption are provided in Appendix A.4. Following the same setting as Theorem 4.1, we have η<L1\eta<L^{-1}, which ensures that η2λmax(θϕfϕθf)<1\eta^{2}\lambda_{\max}(\nabla_{\theta\phi}f^{*}\nabla_{\phi\theta}f^{*})<1. Therefore, as kk\to\infty, a1a\to 1 and b0b\to 0 in Remark 5.1, and as such Lv.kk GP has similar local convergence properties to SPPM. Figure 2 illustrates this property in a quadratic game. Lv.kk GP may behave differently than SPPM at lower values of kk, but as kk increases, both max step size ηmax,k\eta_{\max,k} and distance to equilibrium for a fixed number of iterations under ηmax,\eta_{\max,\infty} converges to that of SPPM. Thus, we observe that Lv.kk GP is empirically similar to SPPM at higher values of kk.

We study the global convergence property of SPPM by analyzing its behavior in bilinear and quadratic games. Consider the following bilinear game:

min𝜽nmaxϕn𝜽T𝑴ϕ\min_{\bm{\theta}\in\mathbb{R}^{n}}\max_{\bm{\phi}\in\mathbb{R}^{n}}\bm{\theta}^{T}\bm{M}\bm{\phi} (Bilinear game)

where 𝑴\bm{M} is a full rank matrix. The following theorem summarizes SPPM’s convergence property:

Theorem 5.2.

Consider the Bilinear game and the SPPM method. Further, we define rt=𝛉t𝛉2+ϕtϕ2r_{t}=\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}. Then, for any η>0\eta>0, the iterates {𝛉t,ϕt}t0\{\bm{\theta}_{t},\bm{\phi}_{t}\}_{t\geq 0} generated by SPPM satisfy

rt+111+η2λmin(𝑴T𝑴)rt.r_{t+1}\leq\frac{1}{1+\eta^{2}\lambda_{\min}(\bm{M}^{T}\bm{M})}r_{t}. (5)

It is worth noting that the convergence property of Lv.kk GP in bi-linear game has been studied in [2]. Furthermore, we study the convergence property of SPPM in the Quadratic game:

min𝜽nmaxϕn𝜽T𝑨𝜽+ϕT𝑩ϕ+𝜽T𝑪ϕ\min_{\bm{\theta}\in\mathbb{R}^{n}}\max_{\bm{\phi}\in\mathbb{R}^{n}}\bm{\theta}^{T}\bm{A}\bm{\theta}+\bm{\phi}^{T}\bm{B}\bm{\phi}+\bm{\theta}^{T}\bm{C}\bm{\phi} (Quadratic game)

where 𝑨n×n\bm{A}\in\mathbb{R}^{n\times n} is symmetric and positive definite, 𝑩n×n\bm{B}\in\mathbb{R}^{n\times n} is symmetric and negative definite and the interaction term 𝑪n×n\bm{C}\in\mathbb{R}^{n\times n} is full rank. SPPM in quadratic games converges with the following rate:

Theorem 5.3.

Consider the Quadratic game and the SPPM. Then, for any η>0\eta>0, the iterates {𝛉t,ϕt}t0\{\bm{\theta}_{t},\bm{\phi}_{t}\}_{t\geq 0} generated by SPPM satisfy

𝜽t+1𝜽2+ϕt+1ϕ2ρ2(1η𝑨)𝜽t𝜽2+ρ2(1+η𝑩)ϕtϕ21+η2λmin(𝑪T𝑪)\lVert\bm{\theta}_{t+1}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t+1}-\bm{\phi}^{*}\rVert^{2}\leq\frac{\rho^{2}(1-\eta\bm{A})\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\rho^{2}(1+\eta\bm{B})\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}}{1+\eta^{2}\lambda_{\min}(\bm{C}^{T}\bm{C})} (6)

Refer to caption

Figure 3: A grid of experiments for different algorithms with different values of interaction cc and learning rates η\eta. The color in each cell indicates the distance to the equilibrium after 100 iterations. Note that the CGD update is equivalent to SPPM in quadratic games.

Theorem 5.1,5.2 and 5.3 indicate that stronger interaction θϕf(𝜽,ϕ)\nabla_{\theta\phi}f(\bm{\theta}^{*},\bm{\phi}^{*}) improve the convergence rate towards the stationary points. By contrast, in existing modifications of GDA, the step size is chosen in inversely proportional to the interaction term θϕf(𝜽,ϕ)\nabla_{\theta\phi}f(\bm{\theta}^{*},\bm{\phi}^{*}) [40, 13, 34]. In Figure 3, we showcase the effect of interaction on different algorithms in the Quadratic game setup with dimension n=5n=5 and the interaction matrix is defined as 𝑪=c𝑰\bm{C}=c\bm{I}. Stronger interaction corresponds to higher values of cc. A key difference between SPPM and Lv.kk GP in the experiments is that, SPPM converges with any step size - and so arbitrarily fast - while it is not the case for Lv.kk GP. In order to approximate SPPM, one needs Lv.kk GP to be a contraction and so have η<L1\eta<L^{-1}. This result implies an additional bound on step sizes for Lv.kk GP. As long as the constraint on η\eta is satisfied, stronger interaction only improves convergence for higher order Lv.kk GP while all other algorithms quickly diverge as η\eta and cc increase.

6 Experimental Results

In this section, we discuss our implementation of Lv.kk GP algorithm for training GANs. Its performance is evaluated on 8-Gaussians and two representative datasets CIFAR-10 and STL-10.

6.1 Level kk Adam

We propose to combine Lv.kk GP with the Adam optimizer [29]. Preliminary experiments find that Lv.kk Adam to converge much faster than Lv.kk GP, see A.8 for experiment results. A detailed pseudo-code for Level kk Adam (Lv.kk Adam) on GAN training with loss functions G\mathcal{L}_{G} and D\mathcal{L}_{D} is given in Algorithm 1. For the Adam optimizer, there are several possible choices on how to update the moments. This choice can lead to different algorithms in practice. Unlike [19] where the moments are updated on the fly, in Algorithm 1, we keep the moments fixed in the reasoning steps and update it together with model parameters. In Table 2, our experiment result suggests that the proposed Lv.kk Adam algorithm converges asymptotically as the number of reasoning steps kk increases.

Input: Stopping time TT, reasoning steps kk, learning rate η𝜽,ηϕ\eta_{\bm{\theta}},\eta_{\bm{\phi}}, decay rates for momentum estimates β1,β2\beta_{1},\beta_{2}, initial weight (𝜽0,ϕ0)(\bm{\theta}_{0},\bm{\phi}_{0}), 𝑷𝒙\bm{P}_{{\bm{x}}} and 𝑷𝒛\bm{P}_{{\bm{z}}} real and noise-data distributions, losses G(𝜽,ϕ,𝒙,𝒛)\mathcal{L}_{G}(\bm{\theta},\bm{\phi},{\bm{x}},{\bm{z}}) and D(𝜽,ϕ,𝒙,𝒛)\mathcal{L}_{D}(\bm{\theta},\bm{\phi},{\bm{x}},{\bm{z}}), ϵ=1e8\epsilon=1e-8.
Parameters : Initial parameters: 𝜽0,ϕ0\bm{\theta}_{0},\bm{\phi}_{0}
Initialize first moments:𝒎θ,00,𝒎ϕ,00\bm{m}_{\theta,0}\xleftarrow{}0,\bm{m}_{\phi,0}\xleftarrow{}0
Initialize second moments:𝒗θ,00,𝒗ϕ,00\bm{v}_{\theta,0}\xleftarrow{}0,\bm{v}_{\phi,0}\xleftarrow{}0
for t=0,…,T-1 do
       Sample new mini-batch: 𝒙,𝒛𝑷𝒙,𝑷𝒛{\bm{x}},{\bm{z}}\sim\bm{P}_{{\bm{x}}},\bm{P}_{{\bm{z}}},
       𝜽t(0)𝜽t,ϕt(0)ϕt\bm{\theta}_{t}^{(0)}\xleftarrow{}\bm{\theta}_{t},\bm{\phi}_{t}^{(0)}\xleftarrow{}\bm{\phi}_{t},
       for n=1,…,k do
             Compute stochastic gradient: 𝒈𝜽,t(n)=θG(𝜽t,ϕt(n1),𝒙,𝒛);𝒈ϕ,t(n)=ϕD(𝜽t(n1),ϕt,𝒙,𝒛)\bm{g}_{\bm{\theta},t}^{(n)}=\nabla_{\theta}\mathcal{L}_{G}(\bm{\theta}_{t},\bm{\phi}^{(n-1)}_{t},{\bm{x}},{\bm{z}});\bm{g}_{\bm{\phi},t}^{(n)}=\nabla_{\phi}\mathcal{L}_{D}(\bm{\theta}^{(n-1)}_{t},\bm{\phi}_{t},{\bm{x}},{\bm{z}})
             Update estimate of first moment: 𝒎θ,t(n)=β1𝒎θ,t1+(1β1)𝒈θ,t(n);𝒎ϕ,t(n)=β1𝒎ϕ,t1+(1β1)𝒈ϕ,t(n)\bm{m}_{\theta,t}^{(n)}=\beta_{1}\bm{m}_{\theta,t-1}+(1-\beta_{1})\bm{g}^{(n)}_{\theta,t};\bm{m}_{\phi,t}^{(n)}=\beta_{1}\bm{m}_{\phi,t-1}+(1-\beta_{1})\bm{g}^{(n)}_{\phi,t}
             Update estimate of second moment: 𝒗θ,t(n)=β2𝒗θ,t1+(1β2)(𝒈θ,t(n))2;𝒗ϕ,t(n)=β2𝒗ϕ,t1+(1β2)(𝒈ϕ,t(n))2\bm{v}_{\theta,t}^{(n)}=\beta_{2}\bm{v}_{\theta,t-1}+(1-\beta_{2})(\bm{g}^{(n)}_{\theta,t})^{2};\bm{v}_{\phi,t}^{(n)}=\beta_{2}\bm{v}_{\phi,t-1}+(1-\beta_{2})(\bm{g}^{(n)}_{\phi,t})^{2}
             Correct the bias for the moments: 𝒎^θ,t(n)=𝒎θ,t(n)(1β1t),𝒎^ϕ,t(n)=𝒎ϕ,t(n)(1β1t);𝒗^θ,t(n)=𝒗θ,t(n)(1β2t),𝒗^ϕ,t(n)=𝒗ϕ,t(n)(1β2t)\bm{\hat{m}}_{\theta,t}^{(n)}=\frac{\bm{m}^{(n)}_{\theta,t}}{(1-\beta_{1}^{t})},\bm{\hat{m}}_{\phi,t}^{(n)}=\frac{\bm{m}^{(n)}_{\phi,t}}{(1-\beta_{1}^{t})};\bm{\hat{v}}_{\theta,t}^{(n)}=\frac{\bm{v}^{(n)}_{\theta,t}}{(1-\beta_{2}^{t})},\bm{\hat{v}}_{\phi,t}^{(n)}=\frac{\bm{v}^{(n)}_{\phi,t}}{(1-\beta_{2}^{t})}
             Perform Adam update: 𝜽t(n)=𝜽tηθ𝒎^θ,t(n)𝒗^θ,t(n)+ϵ;ϕt(n)=ϕtηϕ𝒎^ϕ,t(n)𝒗^ϕ,t(n)+ϵ\bm{\theta}^{(n)}_{t}=\bm{\theta}_{t}-\eta_{\theta}\frac{\bm{\hat{m}}_{\theta,t}^{(n)}}{\sqrt{\bm{\hat{v}}_{\theta,t}^{(n)}}+\epsilon};\bm{\phi}^{(n)}_{t}=\bm{\phi}_{t}-\eta_{\phi}\frac{\bm{\hat{m}}_{\phi,t}^{(n)}}{\sqrt{\bm{\hat{v}}_{\phi,t}^{(n)}}+\epsilon}
            
      𝜽t+1𝜽t(k),ϕt+1ϕt(k)\bm{\theta}_{t+1}\xleftarrow{}\bm{\theta}_{t}^{(k)},\bm{\phi}_{t+1}\xleftarrow{}\bm{\phi}_{t}^{(k)};
       𝒎θ,t𝒎θ,t(k),𝒎ϕ,t𝒎ϕ,t(k)\bm{m}_{\theta,t}\xleftarrow{}\bm{m}_{\theta,t}^{(k)},\bm{m}_{\phi,t}\xleftarrow{}\bm{m}_{\phi,t}^{(k)};
       𝒗θ,t𝒗θ,t(k),𝒗ϕ,t𝒗ϕ,t(k)\bm{v}_{\theta,t}\xleftarrow{}\bm{v}_{\theta,t}^{(k)},\bm{v}_{\phi,t}\xleftarrow{}\bm{v}_{\phi,t}^{(k)}
      
Algorithm 1 Level kk Adam: proposed Adam with recursive reasoning steps

6.2 8-Gaussians

In our first experiment, we evaluate Lv.kk Adam on generating a mixture of 8-Gaussians with standard deviations equal to 0.050.05 and modes uniformly distributed around the unit circle. We use a two layer multi-layer perceptron with ReLU activations, latent dimension of 6464 and batch size of 128128. The generated distribution is presented in Figure 4.

Table 2: The difference between two states generated by Lv.kk Adam and Lv.kk GP averaged over 100 steps. The difference is smaller than machine precision.
1100t=1100rt(k)\frac{1}{100}\sum_{t=1}^{100}r_{t}^{(k)} k=2k=2 k=4k=4 k=6k=6 k=8k=8 k=10k=10
Lv.kk Adam 1.04×1011.04\times 10^{-1} 1.60×1021.60\times 10^{-2} 2.91×1032.91\times 10^{-3} 6.08×1046.08\times 10^{-4} 1.68×1041.68\times 10^{-4}
Lv.kk GP 9.79×1099.79\times 10^{-9} 3.84×10153.84\times 10^{-15} 1.31×10171.31\times 10^{-17} 1.68×10191.68\times 10^{-19} 0\approx 0^{\dagger}
Refer to caption
Figure 4: Left: real distribution, Right: generated distribution

In addition to presenting the mode coverage of the generated distribution after training, we also study the convergence of the reasoning steps of Lv.kk Adam and Lv.kk GP. Let us define the difference between the states of Lv.kk and Lv.k1k-1 agents at time tt as:

rt(k)=𝜽t(k)𝜽t(k1)2+ϕt(k)ϕt(k1)2.r^{(k)}_{t}=\lVert\bm{\theta}^{(k)}_{t}-\bm{\theta}^{(k-1)}_{t}\rVert^{2}+\lVert\bm{\phi}^{(k)}_{t}-\bm{\phi}^{(k-1)}_{t}\rVert^{2}. (7)

We measure the difference averaged over the first 100 iterations, 1100t=1100rtk\frac{1}{100}\sum_{t=1}^{100}r_{t}^{k}, for Lv.1010 Adam (η=104\eta=10^{-4}) and Lv.1010 GP (η=102\eta=10^{-2}). The result presented in Table2 demonstrates that both Lv.kk GP and Lv.kk Adam are converging as kk increases. Moreover, the estimation precision of Lv.kk GP improves rapidly and converges to 0 within finite steps, making it an accurate estimation for SPPM.

Table 3: FID and Inception scores of different algorithms and architectures on CIFAR-10. Results are averaged over 3 runs. We re-evaluate its performance on the official implementation of FID.
Algorithm Architecture FID \downarrow IS \uparrow
Adam [29] BigGAN [6] 14.73 9.22\bm{9.22}
Adam [29] StyleGAN2 [59] 11.07 9.18
Adam [29] SN-GAN [44] 21.70±0.2121.70\pm 0.21 7.60±0.067.60\pm 0.06
Unrolled GAN [42, 9] SN-GAN [44] 17.51±1.0817.51\pm 1.08 ——
Extra-Adam [19, 38, 9] SN-GAN [44] 15.47±1.8215.47\pm 1.82 ——
LEAD [23] SN-GAN [44] 14.45±0.4514.45\pm 0.45 ——
LA-AltGAN [9] SN-GAN [44] 12.67±0.5712.67\pm 0.57 8.55±0.048.55\pm 0.04
ODE-GAN(RK4) [48] SN-GAN [44] 11.85±0.2111.85\pm 0.21 8.61±0.068.61\pm 0.06
Lv.66 Adam SN-GAN [44] 10.17±0.16\bm{10.17\pm 0.16} 8.78±0.06\bm{8.78\pm 0.06}

Refer to caption

Figure 5: Top Left: Change of FID scores over 1.2 million gradient computations for Adam, LEAD, Extra-Adam, Lv.22, Lv.44 and Lv.66 Adam on CIFAR-10 with SNGAN. Note for Lv.44 and Lv.66 Adam, we use the accelerated implementation introduced in Appendix A.10. Bottom Left: Average computational cost per iteration on 8-Gaussians experiment for MLPs with different widths. Right: Average computational cost per iteration on CIFAR-10 for Lv.kk-Adam with different kk. The values for LEAD (2.88) and Adam (14.65) are highlighted by dashed line.

6.3 Image Generation Experiments

We evaluate the effectiveness of our Lv.kk Adam algorithm on unconditional generation of CIFAR-10 [31]. We use the Inception score (the higher the better) [50] and the Fréchet Inception distance (the lower the better) [24] as performance metrics for image synthesis. For architecture, we use the SN-GAN architecture based on [44]. For baselines, we compare the performance of Lv.66 Adam to that of other first-order and second-order optimization algorithms with the same SN-GAN architecture, and to state-of-the-art models trained with Adam. For Lv.kk Adam, we use β1=0\beta_{1}=0 and β2=0.9\beta_{2}=0.9 for all experiments. We use different learning rates for the generator (ηθ=4e5\eta_{\theta}=4e-5) and the discriminator (ηϕ=2e4\eta_{\phi}=2e-4). We train Lv.kk Adam with batch size 128128 for 600600 epochs. For testing, we use an exponential moving average of the generator’s parameters with averaging factor β=0.999\beta=0.999.

In table 3, we present the performance of our method and baselines. The best FID score and Inception score, 10.17±0.1610.17\pm 0.16 and 8.78±0.068.78\pm 0.06, on SN-GAN are obtained with our Lv.66 Adam. We also outperform BigGAN and StyleGAN2 in terms of FID score. Notably, our model has 5.1M parameters in total, and is trained with a small batch size of 128, whereas BigGAN uses 158.3M parameters and a batch size of 2048.

The effect of kk and losses: We evaluate values of k={2,4,6}k=\{2,4,6\} on non-saturated loss [21] (non-zero-sum formulation) and hinge loss [35] (zero-sum formulation). The result is presented in Table 4.

Table 4: FID scores for the different loss functions with k={2,4,6}.k=\{2,4,6\}.
Lv.22 Adam Lv.44 Adam Lv.66 Adam
Non-saturated loss 11.33±0.1811.33\pm 0.18 11.62±0.2511.62\pm 0.25 10.93±0.2410.93\pm 0.24
Hinge loss 10.68±0.2010.68\pm 0.20 10.33±0.2210.33\pm 0.22 10.17±0.1610.17\pm 0.16

Remarkably, our experiments demonstrate that few steps of recursive reasoning can result in significant performance gain comparing to existing GAN optimizers. The gradual improvements in the FID scores justify the idea that better estimation of opponents’ next move improves performance. Moreover, we observe performance gains in both zero-sum and non-zero-sum formulations which supplement our theoretical convergence guarantees in zero-sum games.

Experiment results on STL-10: To test whether the proposed Lv.kk Adam optimizer works on higher resolution images, we evaluate its performance on the STL-10 dataset [11] with 3×48×483\times 48\times 48 resolutions. In our experiments, Lv.66 Adam obtained an averaged FID of 25.43±0.1825.43\pm 0.18 which outperforms that of the Adam optimizer, 30.25±0.2630.25\pm 0.26, using the same SN-GAN architecture.

Memory and computation cost: Compared to SGD, Lv.kk Adam requires the same extra memory as the EG method (one additional set of parameters per player). The relative cost of one iteration versus SGD is a factor of kk and the computational cost increases linearly as kk increases, we illustrate this relationship in Figure 5 (right). We provide an accelerated version of Lv.kk Adam in A.8 which reduces the computation cost by half for k>2k>2. In Figure 5 (top-left), we compare the FID scores obtained by Lv.kk Adam, Adam, and LEAD on CIFAR-10 over the same number of gradient computations. LEAD, Lv4 Adam, and Lv6 Adam all outperform Adam in this experiment. Lv4 Adam outperforms LEAD after 2×1052\times 10^{5} gradient computations. Our method is also compared with different algorithms on the 8-Gaussian problem in terms of its computational cost. On the same architecture with different widths, Figure 5 (bottom-left) illustrates the wall-clock time per computation for different algorithms. We observe that the computational cost of Lv.kk Adam while being much lower than CGD, is similar to LEAD, SGA and CO which involve JVP operations. Each run on CIFAR-10 dataset takes 303330\sim 33 hours on a Nvidia RTX3090 GPU. Each experiment on STL-10 takes 486048\sim 60 hours on a Nvidia RTX3090 GPU.

7 Conclusion and Future Work

This paper proposes a novel algorithm: Level kk gradient play, capable of reasoning about players’ future strategies. We achieve an average FID score of 10.17 for unconditional image generation on CIFAR-10 dataset, allowing GAN training on common computing resources to reach state-of-the-art performance. Our results suggest that Lv.kk GP is a flexible add-on that can be easily attached to existing GAN optimizers (e.g., Adam) and provides noticeable gains in performance and stability. In future work, we will examine the effectiveness of our approach on more complicated GAN designs, such as Progressive GANs [26] and StyleGANs [27], where optimization plays a more significant role. Additionally, we intend to examine the convergence property of Lv.kk GP in games with more than two player in the future.

Broader Impact

Our work introduces a novel optimizer that improves the performance of GANs and may reduce the amount of hyperparameter tuning required by practitioners of generative modeling. Generative models have been used to create illegal content(a.k.a. deepfakes [5]). There is risk of negative social impact resulting from malicious use of the proposed methods.

Acknowledgement

This work was supported by a funding from a NSERC Alliance grant and Huawei Technologies Canada. ZL would like to thank Tianshi Cao for insightful discussions on algorithm and experiment design. We are grateful to Bolin Gao and Dian Gadjov for their support.

References

  • Arjovsky et al. [2017] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, pages 214–223. PMLR, 2017.
  • Azizian et al. [2020] Waïss Azizian, Ioannis Mitliagkas, Simon Lacoste-Julien, and Gauthier Gidel. A tight and unified analysis of gradient-based methods for a whole spectrum of differentiable games. In International Conference on Artificial Intelligence and Statistics, pages 2863–2873. PMLR, 2020.
  • Bailey et al. [2020] James P Bailey, Gauthier Gidel, and Georgios Piliouras. Finite regret and cycles with fixed step-size via alternating gradient descent-ascent. In Conference on Learning Theory, pages 391–407. PMLR, 2020.
  • Balduzzi et al. [2018] David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. The mechanics of n-player differentiable games. In International Conference on Machine Learning, pages 354–363. PMLR, 2018.
  • Brandon [2018] John Brandon. Terrifying high-tech porn: Creepy ’deepfake’ videos are on the rise. Fox News, 2018.
  • Brock et al. [2018] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  • Carlini and Wagner [2017] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pages 39–57. IEEE, 2017.
  • Chavdarova et al. [2019] Tatjana Chavdarova, Gauthier Gidel, François Fleuret, and Simon Lacoste-Julien. Reducing noise in GAN training with variance reduced extragradient. Advances in Neural Information Processing Systems, 32, 2019.
  • Chavdarova et al. [2020] Tatjana Chavdarova, Matteo Pagliardini, Sebastian U Stich, François Fleuret, and Martin Jaggi. Taming GANs with lookahead-minmax. arXiv preprint arXiv:2006.14567, 2020.
  • Choi et al. [2018] Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. StarGAN: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8789–8797, 2018.
  • Coates et al. [2011] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011.
  • Corballis [2007] Michael C Corballis. The uniqueness of human recursive thinking: the ability to think about thinking may be the critical attribute that distinguishes us from all other species. American Scientist, 95(3):240–248, 2007.
  • Daskalakis et al. [2017] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training GANs with optimism. arXiv preprint arXiv:1711.00141, 2017.
  • Fiez and Ratliff [2021] Tanner Fiez and Lillian J Ratliff. Local convergence analysis of gradient descent ascent with finite timescale separation. In Proceedings of the International Conference on Learning Representation, 2021.
  • Fiez et al. [2019] Tanner Fiez, Benjamin Chasnov, and Lillian J Ratliff. Convergence of learning dynamics in Stackelberg games. arXiv preprint arXiv:1906.01217, 2019.
  • Foerster et al. [2017] Jakob N Foerster, Richard Y Chen, Maruan Al-Shedivat, Shimon Whiteson, Pieter Abbeel, and Igor Mordatch. Learning with opponent-learning awareness. arXiv preprint arXiv:1709.04326, 2017.
  • Gemp and Mahadevan [2018] Ian Gemp and Sridhar Mahadevan. Global convergence to the equilibrium of gans using variational inequalities. arXiv preprint arXiv:1808.01531, 2018.
  • Gemp and McWilliams [2019] Ian Gemp and Brian McWilliams. The Unreasonable Effectiveness of Adam on Cycles. In NeurIPS Workshop on Bridging Game Theory and Deep Learning, 2019.
  • Gidel et al. [2018] Gauthier Gidel, Hugo Berard, Gaëtan Vignoud, Pascal Vincent, and Simon Lacoste-Julien. A variational inequality perspective on generative adversarial networks. arXiv preprint arXiv:1802.10551, 2018.
  • Gidel et al. [2019] Gauthier Gidel, Reyhane Askari Hemmat, Mohammad Pezeshki, Rémi Le Priol, Gabriel Huang, Simon Lacoste-Julien, and Ioannis Mitliagkas. Negative momentum for improved game dynamics. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1802–1811. PMLR, 2019.
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • Gulrajani et al. [2017] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of Wasserstein GANs. Advances in neural information processing systems, 30, 2017.
  • Hemmat et al. [2020] Reyhane Askari Hemmat, Amartya Mitra, Guillaume Lajoie, and Ioannis Mitliagkas. Lead: Least-action dynamics for min-max optimization. arXiv preprint arXiv:2010.13846, 2020.
  • Heusel et al. [2017] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • Jin et al. [2020] Chi Jin, Praneeth Netrapalli, and Michael Jordan. What is local optimality in nonconvex-nonconcave minimax optimization? In International Conference on Machine Learning, pages 4880–4889. PMLR, 2020.
  • Karras et al. [2017] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  • Karras et al. [2019] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  • Karras et al. [2020] Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of StyleGAN. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Korpelevich [1976] Galina M Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976.
  • Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
  • Ledig et al. [2017] Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690, 2017.
  • Letcher et al. [2019] Alistair Letcher, David Balduzzi, Sébastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore Graepel. Differentiable game mechanics. The Journal of Machine Learning Research, 20(1):3032–3071, 2019.
  • Liang and Stokes [2019] Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 907–915. PMLR, 2019.
  • Lim and Ye [2017] Jae Hyun Lim and Jong Chul Ye. Geometric gan. arXiv preprint arXiv:1705.02894, 2017.
  • Lowe et al. [2017] Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30, 2017.
  • Madry et al. [2017] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  • Mertikopoulos et al. [2018a] Panayotis Mertikopoulos, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. arXiv preprint arXiv:1807.02629, 2018a.
  • Mertikopoulos et al. [2018b] Panayotis Mertikopoulos, Christos Papadimitriou, and Georgios Piliouras. Cycles in adversarial regularized learning. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2703–2717. SIAM, 2018b.
  • Mescheder et al. [2017] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. The numerics of GANs. Advances in neural information processing systems, 30, 2017.
  • Mescheder et al. [2018] Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. Which training methods for GANs do actually converge? In International conference on machine learning, pages 3481–3490. PMLR, 2018.
  • Metz et al. [2016] Luke Metz, Ben Poole, David Pfau, and Jascha Sohl-Dickstein. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163, 2016.
  • Mishchenko et al. [2020] Konstantin Mishchenko, Dmitry Kovalev, Egor Shulgin, Peter Richtárik, and Yura Malitsky. Revisiting stochastic extragradient. In International Conference on Artificial Intelligence and Statistics, pages 4573–4582. PMLR, 2020.
  • Miyato et al. [2018] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
  • Mokhtari et al. [2020] Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics, pages 1497–1507. PMLR, 2020.
  • Papadimitriou and Piliouras [2018] Christos Papadimitriou and Georgios Piliouras. From Nash equilibria to chain recurrent sets: An algorithmic solution concept for game theory. Entropy, 20(10):782, 2018.
  • Piliouras et al. [2021] Georgios Piliouras, Ryann Sim, and Stratis Skoulakis. Optimal no-regret learning in general games: Bounded regret with unbounded step-sizes via clairvoyant mwu. arXiv preprint arXiv:2111.14737, 2021.
  • Qin et al. [2020] Chongli Qin, Yan Wu, Jost Tobias Springenberg, Andy Brock, Jeff Donahue, Timothy Lillicrap, and Pushmeet Kohli. Training generative adversarial networks by solving ordinary differential equations. Advances in Neural Information Processing Systems, 33:5599–5609, 2020.
  • Rockafellar [1976] R Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM journal on control and optimization, 14(5):877–898, 1976.
  • Salimans et al. [2016] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs. Advances in neural information processing systems, 29, 2016.
  • Sauer et al. [2021] Axel Sauer, Kashyap Chitta, Jens Müller, and Andreas Geiger. Projected GANs converge faster. Advances in Neural Information Processing Systems, 34, 2021.
  • Sauer et al. [2022] Axel Sauer, Katja Schwarz, and Andreas Geiger. StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets. arXiv preprint arXiv:2202.00273, 2022.
  • Schäfer and Anandkumar [2019] Florian Schäfer and Anima Anandkumar. Competitive gradient descent. Advances in Neural Information Processing Systems, 32, 2019.
  • Thanh-Tung et al. [2019] Hoang Thanh-Tung, Truyen Tran, and Svetha Venkatesh. Improving generalization and stability of generative adversarial networks. arXiv preprint arXiv:1902.03984, 2019.
  • Vinyals et al. [2019] Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
  • Wang et al. [2019] Yuanhao Wang, Guodong Zhang, and Jimmy Ba. On solving minimax optimization locally: A follow-the-ridge approach. arXiv preprint arXiv:1910.07512, 2019.
  • Willi et al. [2022] Timon Willi, Johannes Treutlein, Alistair Letcher, and Jakob Foerster. COLA: Consistent Learning with Opponent-Learning Awareness. arXiv preprint arXiv:2203.04098, 2022.
  • Zhang et al. [2021] Guodong Zhang, Yuanhao Wang, Laurent Lessard, and Roger Grosse. Near-optimal Local Convergence of Alternating Gradient Descent-Ascent for Minimax Optimization. arXiv preprint arXiv:2102.09468, 2021.
  • Zhao et al. [2020] Shengyu Zhao, Zhijian Liu, Ji Lin, Jun-Yan Zhu, and Song Han. Differentiable augmentation for data-efficient gan training. Advances in Neural Information Processing Systems, 33:7559–7570, 2020.

Appendix A Appendix

A.1 Proof of Theorem 4.1

Proof.

To begin with, let us consider Lv.11 GP:

𝜽t(1)\displaystyle\bm{\theta}_{t}^{(1)} =𝜽tη𝜽f(𝜽t,ϕt(0))=𝜽tη𝜽f(𝜽t,ϕt)\displaystyle=\bm{\theta}_{t}-\eta\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(0)})=\bm{\theta}_{t}-\eta\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\phi}_{t})
ϕt(1)\displaystyle\bm{\phi}_{t}^{(1)} =ϕtηϕg(𝜽t(0),ϕt)=ϕtηϕg(𝜽t,ϕt)\displaystyle=\bm{\phi}_{t}-\eta\nabla_{\bm{\phi}}g(\bm{\theta}_{t}^{(0)},\bm{\phi}_{t})=\bm{\phi}_{t}-\eta\nabla_{\bm{\phi}}g(\bm{\theta}_{t},\bm{\phi}_{t})

The differences between (𝜽t(1),ϕt(1))(\bm{\theta}_{t}^{(1)},\bm{\phi}_{t}^{(1)}) and (𝜽t(0),ϕt(0))(\bm{\theta}_{t}^{(0)},\bm{\phi}_{t}^{(0)}) are:

𝜽t(1)𝜽t(0)\displaystyle\lVert\bm{\theta}_{t}^{(1)}-\bm{\theta}_{t}^{(0)}\rVert =ηθf(𝜽t,ϕt),\displaystyle=\eta\lVert\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\rVert,
ϕt(1)ϕt(0)\displaystyle\lVert\bm{\phi}_{t}^{(1)}-\bm{\phi}_{t}^{(0)}\rVert =ηϕf(𝜽t,ϕt).\displaystyle=\eta\lVert\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\rVert.

Recall our definition of Δmax\Delta_{\max} we have:

𝜽t(1)𝜽t(0)+ϕt(1)ϕt(0)ηΔmax\lVert\bm{\theta}_{t}^{(1)}-\bm{\theta}_{t}^{(0)}\rVert+\lVert\bm{\phi}_{t}^{(1)}-\bm{\phi}_{t}^{(0)}\rVert\leq\eta\Delta_{\max} (8)

Then, with 𝝎t=[𝜽t,ϕt]T\bm{\omega}_{t}=[\bm{\theta}_{t},\bm{\phi}_{t}]^{T} and 𝝎t(k)=[𝜽t(k),ϕt(k)]T\bm{\omega}_{t}^{(k)}=[\bm{\theta}_{t}^{(k)},\bm{\phi}_{t}^{(k)}]^{T}, the differences between Lv.22 agents and Lv.11 agents are:

𝜽t(2)𝜽t(1)\displaystyle\lVert\bm{\theta}_{t}^{(2)}-\bm{\theta}_{t}^{(1)}\rVert =ηθf(𝜽t,ϕt(1))θf(𝜽t,ϕt(0))\displaystyle=\eta\lVert\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(1)})-\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(0)})\rVert
ηLθϕϕt(1)ϕt(0),\displaystyle\leq\eta L_{\theta\phi}\lVert\bm{\phi}_{t}^{(1)}-\bm{\phi}_{t}^{(0)}\rVert,
ϕt(2)ϕt(1)\displaystyle\lVert\bm{\phi}_{t}^{(2)}-\bm{\phi}_{t}^{(1)}\rVert =ηϕf(𝜽t(1),ϕt)ϕf(𝜽t(0),ϕt)\displaystyle=\eta\lVert\nabla_{\phi}f(\bm{\theta}_{t}^{(1)},\bm{\phi}_{t})-\nabla_{\phi}f(\bm{\theta}_{t}^{(0)},\bm{\phi}_{t})\rVert
ηLϕθ𝜽t(1)𝜽t(0).\displaystyle\leq\eta L_{\phi\theta}\lVert\bm{\theta}_{t}^{(1)}-\bm{\theta}_{t}^{(0)}\rVert.

Recall that L:=max{Lθθ,Lθϕ,Lϕθ,Lϕϕ}L:=\max\{L_{\theta\theta},L_{\theta\phi},L_{\phi\theta},L_{\phi\phi}\}, using Equation (8) we have:

𝜽t(2)𝜽t(1)+ϕt(2)ϕt(1)η2LΔmax\lVert\bm{\theta}_{t}^{(2)}-\bm{\theta}_{t}^{(1)}\rVert+\lVert\bm{\phi}_{t}^{(2)}-\bm{\phi}_{t}^{(1)}\rVert\leq\eta^{2}L\Delta_{\max} (9)

Similarly, we can derive the differences between Lv.33 and Lv.22 agents:

𝜽t(3)𝜽t(2)\displaystyle\lVert\bm{\theta}_{t}^{(3)}-\bm{\theta}_{t}^{(2)}\rVert =ηθf(𝜽t,ϕt(2))θf(𝜽t,ϕt(1))\displaystyle=\eta\lVert\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(2)})-\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(1)})\rVert
ηLθϕϕt(2)ϕt(1)\displaystyle\leq\eta L_{\theta\phi}\lVert\bm{\phi}_{t}^{(2)}-\bm{\phi}_{t}^{(1)}\rVert
ϕt(3)ϕt(2)\displaystyle\lVert\bm{\phi}_{t}^{(3)}-\bm{\phi}_{t}^{(2)}\rVert =ηϕf(𝜽t(2),ϕt)ϕf(𝜽t(1),ϕt)\displaystyle=\eta\lVert\nabla_{\phi}f(\bm{\theta}_{t}^{(2)},\bm{\phi}_{t})-\nabla_{\phi}f(\bm{\theta}_{t}^{(1)},\bm{\phi}_{t})\rVert
ηLϕθ𝜽t(2)𝜽t(1)\displaystyle\leq\eta L_{\phi\theta}\lVert\bm{\theta}_{t}^{(2)}-\bm{\theta}_{t}^{(1)}\rVert
𝜽t(3)𝜽t(2)\displaystyle\lVert\bm{\theta}_{t}^{(3)}-\bm{\theta}_{t}^{(2)}\rVert +ϕt(3)ϕt(2)η3L2Δmax\displaystyle+\lVert\bm{\phi}_{t}^{(3)}-\bm{\phi}_{t}^{(2)}\rVert\leq\eta^{3}L^{2}\Delta_{\max} (10)

Consequently, the difference between any two consecutive states kk and k1k-1 are upper bounded by:

𝜽t(k)𝜽t(k1)\displaystyle\lVert\bm{\theta}_{t}^{(k)}-\bm{\theta}_{t}^{(k-1)}\rVert =ηθf(𝜽t,ϕt(k1))θf(𝜽t,ϕt(k2))\displaystyle=\eta\lVert\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(k-1)})-\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(k-2)})\rVert
ηLθϕϕt(k1)ϕt(k2)\displaystyle\leq\eta L_{\theta\phi}\lVert\bm{\phi}_{t}^{(k-1)}-\bm{\phi}_{t}^{(k-2)}\rVert
ϕt(k)ϕt(k1)\displaystyle\lVert\bm{\phi}_{t}^{(k)}-\bm{\phi}_{t}^{(k-1)}\rVert =ηϕf(𝜽t(k1),ϕt)ϕf(𝜽t(k2),ϕt)\displaystyle=\eta\lVert\nabla_{\phi}f(\bm{\theta}_{t}^{(k-1)},\bm{\phi}_{t})-\nabla_{\phi}f(\bm{\theta}_{t}^{(k-2)},\bm{\phi}_{t})\rVert
ηLϕθ𝜽t(k1)𝜽t(k2)\displaystyle\leq\eta L_{\phi\theta}\lVert\bm{\theta}_{t}^{(k-1)}-\bm{\theta}_{t}^{(k-2)}\rVert
𝜽t(k)𝜽t(k1)\displaystyle\lVert\bm{\theta}_{t}^{(k)}-\bm{\theta}_{t}^{(k-1)}\rVert +ϕt(k)ϕt(k1)η(ηL)(k1)Δmax\displaystyle+\lVert\bm{\phi}_{t}^{(k)}-\bm{\phi}_{t}^{(k-1)}\rVert\leq\eta\cdot(\eta L)^{(k-1)}\Delta_{\max} (11)

Since 𝝎t(k)𝝎t(k1)𝜽t(k)𝜽t(k1)+ϕt(k)ϕt(k1)\lVert\bm{\omega}_{t}^{(k)}-\bm{\omega}_{t}^{(k-1)}\rVert\leq\lVert\bm{\theta}_{t}^{(k)}-\bm{\theta}_{t}^{(k-1)}\rVert+\lVert\bm{\phi}_{t}^{(k)}-\bm{\phi}_{t}^{(k-1)}\rVert we have:

𝝎t(k)𝝎t(k1)η(ηL)(k1)Δmax\lVert\bm{\omega}_{t}^{(k)}-\bm{\omega}_{t}^{(k-1)}\rVert\leq\eta\cdot(\eta L)^{(k-1)}\Delta_{\max} (12)

Suppose η<(2L)1\eta<(2L)^{-1}, such that the difference between any two consecutive states is a contraction, then we consider the difference, 𝝎t(a)𝝎t(b)\lVert\bm{\omega}_{t}^{(a)}-\bm{\omega}_{t}^{(b)}\rVert, where a>b>0a>b>0. We can rewrite it as:

𝝎t(a)𝝎t(b)\displaystyle\lVert\bm{\omega}_{t}^{(a)}-\bm{\omega}_{t}^{(b)}\rVert =i=b+1a𝝎t(i)𝝎t(i1)\displaystyle=\left\lVert\sum_{i=b+1}^{a}\bm{\omega}_{t}^{(i)}-\bm{\omega}_{t}^{(i-1)}\right\rVert
i=b+1a𝝎t(i)𝝎t(i1)\displaystyle\leq\sum_{i=b+1}^{a}\left\lVert\bm{\omega}_{t}^{(i)}-\bm{\omega}_{t}^{(i-1)}\right\rVert
i=b+1aη(ηL)(i1)Δmax\displaystyle\leq\sum_{i=b+1}^{a}\eta\cdot(\eta L)^{(i-1)}\Delta_{\max}
ηΔmax[(ηL)(b)++(ηL)(a1)]\displaystyle\leq\eta\Delta_{\max}\cdot\left[(\eta L)^{(b)}+\dots+(\eta L)^{(a-1)}\right]
ηΔmax(ηL)(b1)\displaystyle\leq\eta\Delta_{\max}\cdot(\eta L)^{(b-1)}
ηbL(b1)Δmax=𝒪(ηb).\displaystyle\leq\eta^{b}L^{(b-1)}\Delta_{\max}=\mathcal{O}(\eta^{b}). (13)

Since η<(2L)1\eta<(2L)^{-1}, we have that ηL<1\eta L<1 and for any ϵ>0\epsilon>0, we can solve for bb such that ηbL(b1)Δmax<ϵ\eta^{b}L^{(b-1)}\Delta_{\max}<\epsilon. Therefore the sequence {𝝎tk}k=0\{\bm{\omega}_{t}^{k}\}_{k=0}^{\infty} is a Cauchy sequence. Moreover, in a complete space, every Cauchy sequence has a limit: limk𝝎t(k)=𝝎t\lim_{k\to\infty}\bm{\omega}_{t}^{(k)}=\bm{\omega}_{t}^{*}

A.2 Proof of Theorem 5.1

Theorem A.1.

Consider the (Minimax) problem under Assumption 3.1 and Lv.kk GP. Let (𝛉,ϕ)(\bm{\theta}^{*},\bm{\phi}^{*}) be a stationary point. Suppose 𝛉t𝛉\bm{\theta}_{t}-\bm{\theta}^{*} not in kernel of ϕθf(𝛉,ϕ)\nabla_{\phi\theta}f(\bm{\theta}^{*},\bm{\phi}^{*}), ϕtϕ\bm{\phi}_{t}-\bm{\phi}^{*} not in kernel of θϕf(𝛉,ϕ)\nabla_{\theta\phi}f(\bm{\theta}^{*},\bm{\phi}^{*}) and η<(L)1\eta<(L)^{-1}. There exists a neighborhood 𝒰\mathcal{U} of (𝛉,ϕ)(\bm{\theta}^{*},\bm{\phi}^{*}) such that if SPPM started at (𝛉0,ϕ0)𝒰(\bm{\theta}_{0},\bm{\phi}_{0})\in\mathcal{U}, the iterates {𝛉t,ϕt}t0\{\bm{\theta}_{t},\bm{\phi}_{t}\}_{t\geq 0} generated by SPPM satisfy:

𝜽t+1𝜽2+ϕt+1ϕ2ρ2(𝑰ηθθf)𝜽t𝜽2+ρ2(𝑰+ηϕϕf)ϕtϕ21+η2λmin(θϕfϕθf)\lVert\bm{\theta}_{t+1}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t+1}-\bm{\phi}^{*}\rVert^{2}\leq\frac{\rho^{2}(\bm{I}-\eta\nabla_{\theta\theta}f^{*})\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\rho^{2}(\bm{I}+\eta\nabla_{\phi\phi}f^{*})\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}}{1+\eta^{2}\lambda_{\min}(\nabla_{\theta\phi}f^{*}\nabla_{\phi\theta}f^{*})}

where f=f(𝛉,ϕ)f^{*}=f(\bm{\theta}^{*},\bm{\phi}^{*}). Moreover, for any η\eta satisfying:

max(ρ2(𝑰ηθθf),ρ2(𝑰+ηϕϕf))1+η2λmin(θϕfϕθf)<1,\frac{\max(\rho^{2}(\bm{I}-\eta\nabla_{\theta\theta}f^{*}),\rho^{2}(\bm{I}+\eta\nabla_{\phi\phi}f^{*}))}{1+\eta^{2}\lambda_{\min}(\nabla_{\theta\phi}f^{*}\nabla_{\phi\theta}f^{*})}<1, (14)

SPPM converges asymptotically to (𝛉,ϕ)(\bm{\theta}^{*},\bm{\phi}^{*}).

Proof.

Consider the learning dynamics:

𝜽t+1\displaystyle\bm{\theta}_{t+1} =𝜽tηθf(𝜽t,ϕt+1)\displaystyle=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t+1})
ϕt+1\displaystyle\bm{\phi}_{t+1} =ϕt+ηθf(𝜽t+1,ϕt)\displaystyle=\bm{\phi}_{t}+\eta\nabla_{\theta}f(\bm{\theta}_{t+1},\bm{\phi}_{t})

Let us define

𝜽^t=𝜽t𝜽\displaystyle\hat{\bm{\theta}}_{t}=\bm{\theta}_{t}-\bm{\theta}^{*}
ϕ^t=ϕtϕ\displaystyle\hat{\bm{\phi}}_{t}=\bm{\phi}_{t}-\bm{\phi}^{*}

It follows immediately by linearizing the system about the stationary point (𝜽,ϕ)(\bm{\theta}^{*},\bm{\phi}^{*}) that

[𝜽^t+1ϕ^t+1]\displaystyle\begin{bmatrix}\hat{\bm{\theta}}_{t+1}\\ \hat{\bm{\phi}}_{t+1}\end{bmatrix}\simeq [𝑰ηθθ2f(𝜽,ϕ)𝟎𝟎𝑰+ηϕϕ2f(𝜽,ϕ)][𝜽^tϕ^t]\displaystyle\begin{bmatrix}\bm{I}-\eta\nabla_{\theta\theta}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})&\bm{0}\\ \bm{0}&\bm{I}+\eta\nabla_{\phi\phi}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})\end{bmatrix}\begin{bmatrix}\hat{\bm{\theta}}_{t}\\ \hat{\bm{\phi}}_{t}\end{bmatrix}
+[𝟎ηθϕ2f(𝜽,ϕ)ηθϕ2f(𝜽,ϕ)𝟎][𝜽^t+1ϕ^t+1]\displaystyle+\begin{bmatrix}\bm{0}&-\eta\nabla_{\theta\phi}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})\\ \eta\nabla_{\theta\phi}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})&\bm{0}\end{bmatrix}\begin{bmatrix}\hat{\bm{\theta}}_{t+1}\\ \hat{\bm{\phi}}_{t+1}\end{bmatrix}

Let us denote the Jacobian by

[θθ2f(𝜽,ϕ)θϕ2f(𝜽,ϕ)ϕθ2f(𝜽,ϕ)ϕϕ2f(𝜽,ϕ)]=[𝑨𝑩𝑩T𝑪]\displaystyle\begin{bmatrix}-\nabla_{\theta\theta}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})&-\nabla_{\theta\phi}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})\\ \nabla_{\phi\theta}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})&\nabla_{\phi\phi}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})\end{bmatrix}=\begin{bmatrix}-\bm{A}&-\bm{B}\\ \bm{B}^{T}&\bm{C}\end{bmatrix} (15)

Then we can rewrite the dynamics around the stationary point as

𝜽^t+1\displaystyle\hat{\bm{\theta}}_{t+1} =𝜽^tη𝑨𝜽^tη𝑩ϕ^t+1\displaystyle=\hat{\bm{\theta}}_{t}-\eta\bm{A}\hat{\bm{\theta}}_{t}-\eta\bm{B}\hat{\bm{\phi}}_{t+1}
𝜽^t+1\displaystyle\hat{\bm{\theta}}_{t+1} =𝜽^tη𝑨𝜽^tη𝑩(ϕ^t+η𝑩T𝜽^t+1+η𝑪ϕ^t)\displaystyle=\hat{\bm{\theta}}_{t}-\eta\bm{A}\hat{\bm{\theta}}_{t}-\eta\bm{B}(\hat{\bm{\phi}}_{t}+\eta\bm{B}^{T}\hat{\bm{\theta}}_{t+1}+\eta\bm{C}\hat{\bm{\phi}}_{t})
(𝑰+η2𝑩𝑩T)𝜽^t+1\displaystyle(\bm{I}+\eta^{2}\bm{BB}^{T})\hat{\bm{\theta}}_{t+1} =(𝑰η𝑨)𝜽^tη𝑩(𝑰+η𝑪)ϕt\displaystyle=(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}-\eta\bm{B}(\bm{I}+\eta\bm{C})\bm{\phi}_{t}
𝜽^t+1\displaystyle\hat{\bm{\theta}}_{t+1} =(𝑰+η2𝑩𝑩T)1[(𝑰η𝑨)𝜽^tη𝑩(𝑰+η𝑪)ϕt]\displaystyle=(\bm{I}+\eta^{2}\bm{BB}^{T})^{-1}\left[(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}-\eta\bm{B}(\bm{I}+\eta\bm{C})\bm{\phi}_{t}\right] (16)

Similarly, for the other player we have

ϕ^t+1\displaystyle\hat{\bm{\phi}}_{t+1} =ϕ^t+η𝑩T𝜽^t+1+η𝑪ϕ^t\displaystyle=\hat{\bm{\phi}}_{t}+\eta\bm{B}^{T}\hat{\bm{\theta}}_{t+1}+\eta\bm{C}\hat{\bm{\phi}}_{t}
ϕ^t+1\displaystyle\hat{\bm{\phi}}_{t+1} =ϕ^t+η𝑩T(𝜽^tη𝑨𝜽^tη𝑩ϕ^t+1)+η𝑪ϕ^t\displaystyle=\hat{\bm{\phi}}_{t}+\eta\bm{B}^{T}(\hat{\bm{\theta}}_{t}-\eta\bm{A}\hat{\bm{\theta}}_{t}-\eta\bm{B}\hat{\bm{\phi}}_{t+1})+\eta\bm{C}\hat{\bm{\phi}}_{t}
(𝑰+η2𝑩T𝑩)ϕ^t+1\displaystyle(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})\hat{\bm{\phi}}_{t+1} =η𝑩T(𝑰η𝑨)𝜽^t+(𝑰+η𝑪)ϕt\displaystyle=\eta\bm{B}^{T}(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}+(\bm{I}+\eta\bm{C})\bm{\phi}_{t}
ϕ^t+1\displaystyle\hat{\bm{\phi}}_{t+1} =(𝑰+η2𝑩T𝑩)1[η𝑩T(𝑰η𝑨)𝜽^t+(𝑰+η𝑪)ϕt]\displaystyle=(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}\left[\eta\bm{B}^{T}(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}+(\bm{I}+\eta\bm{C})\bm{\phi}_{t}\right] (17)

Let us define the symmetric matrices 𝑸𝜽=(𝑰+η2𝑩𝑩T)1\bm{Q}_{\bm{\theta}}=(\bm{I}+\eta^{2}\bm{BB}^{T})^{-1}, 𝑸ϕ=(𝑰+η2𝑩T𝑩)1\bm{Q}_{\bm{\phi}}=(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1} and 𝑷𝜽=(𝑰η𝑨)\bm{P}_{\bm{\theta}}=(\bm{I}-\eta\bm{A}), 𝑷ϕ=(𝑰+η𝑪)\bm{P}_{\bm{\phi}}=(\bm{I}+\eta\bm{C}). Further we define rt=𝜽^t+12+ϕ^t+12r_{t}=\lVert\hat{\bm{\theta}}_{t+1}\rVert^{2}+\lVert\hat{\bm{\phi}}_{t+1}\rVert^{2}. Based on these definitions, and the expressions in (52) and (53) we have

𝜽t+12+ϕt+12=𝑸𝜽𝑷𝜽𝜽^t2\displaystyle\lVert\bm{\theta}_{t+1}\rVert^{2}+\lVert\bm{\phi}_{t+1}\rVert^{2}=\lVert\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}\rVert^{2} +η2𝑸𝜽𝑩𝑷ϕϕ^t2+𝑸ϕ𝑩T𝑷𝜽𝜽^t2+𝑸ϕ𝑷ϕϕ^t2\displaystyle+\eta^{2}\lVert\bm{Q}_{\bm{\theta}}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\rVert^{2}+\lVert\bm{Q}_{\bm{\phi}}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}\rVert^{2}+\lVert\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\rVert^{2}
2η𝜽^tT𝑷𝜽T𝑸𝜽T𝑸𝜽𝑩𝑷ϕϕ^t+2ηϕ^tT𝑷ϕT𝑸ϕT𝑸ϕ𝑩T𝑷𝜽𝜽^t\displaystyle-2\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+2\eta\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t} (18)

To simplify the expression in (54) we use the following lemma:

Lemma A.1.

The matrices 𝐐𝛉=(𝐈+η2𝐁𝐁T)1\bm{Q}_{\bm{\theta}}=(\bm{I}+\eta^{2}\bm{BB}^{T})^{-1}, 𝐐ϕ=(𝐈+η2𝐁T𝐁)1\bm{Q}_{\bm{\phi}}=(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1} satisfy the following properties:

𝑸𝜽𝑩\displaystyle\bm{Q}_{\bm{\theta}}\bm{B} =𝑩𝑸ϕ\displaystyle=\bm{B}\bm{Q}_{\bm{\phi}} (19)
𝑸ϕ𝑩T\displaystyle\bm{Q}_{\bm{\phi}}\bm{B}^{T} =𝑩T𝑸𝜽\displaystyle=\bm{B}^{T}\bm{Q}_{\bm{\theta}} (20)

Using this lemma, we can show that

𝜽^tT𝑷𝜽T𝑸𝜽T𝑸𝜽𝑩𝑷ϕϕ^t=𝜽^tT𝑷𝜽T𝑸𝜽T𝑩𝑸ϕ𝑷ϕϕ^t=ϕ^tT𝑷ϕT𝑸ϕT𝑩T𝑸𝜽𝑷𝜽𝜽^t=ϕ^tT𝑷ϕT𝑸ϕT𝑸ϕ𝑩T𝑷𝜽𝜽^t\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}=\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}^{T}\bm{B}^{T}\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}=\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}

where the intermediate equality holds as 𝒂T𝒃=𝒃T𝒂\bm{a}^{T}\bm{b}=\bm{b}^{T}\bm{a}. Hence, the expression in (54) can be simplified as

𝜽^t+12+ϕ^t+12=𝑸𝜽𝑷𝜽𝜽^t2+η2𝑸𝜽𝑩𝑷ϕϕ^t2+𝑸ϕ𝑩T𝑷𝜽𝜽^t2+𝑸ϕ𝑷ϕϕ^t2\lVert\hat{\bm{\theta}}_{t+1}\rVert^{2}+\lVert\hat{\bm{\phi}}_{t+1}\rVert^{2}=\lVert\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\theta}}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\rVert^{2}+\lVert\bm{Q}_{\bm{\phi}}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}\rVert^{2}+\lVert\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\rVert^{2} (21)

We simplify equation (57) as follows. Consider the term involving 𝜽^t\hat{\bm{\theta}}_{t}. We have

𝑸𝜽𝑷𝜽𝜽^t2+η2𝑸ϕ𝑩T𝑷𝜽𝜽^t2\displaystyle\lVert\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\phi}}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}\rVert^{2} =𝜽^tT𝑷𝜽T𝑸𝜽2𝑷𝜽𝜽^t+η2𝜽^tT𝑷𝜽T𝑩𝑸ϕ2𝑩T𝑷𝜽𝜽^t\displaystyle=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{2}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\eta^{2}\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}^{2}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}
=𝜽^tT𝑷𝜽T(𝑸𝜽2+η2𝑩𝑸ϕ2𝑩T)𝑷𝜽𝜽^t\displaystyle=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{Q}_{\bm{\theta}}^{2}+\eta^{2}\bm{B}\bm{Q}_{\bm{\phi}}^{2}\bm{B}^{T})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}
=𝜽^tT𝑷𝜽T(𝑸𝜽2+η2𝑩𝑸ϕ𝑩T𝑸𝜽)𝑷𝜽𝜽^t\displaystyle=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{Q}_{\bm{\theta}}^{2}+\eta^{2}\bm{B}\bm{Q}_{\bm{\phi}}\bm{B}^{T}\bm{Q}_{\bm{\theta}})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}
=𝜽^tT𝑷𝜽T(𝑸𝜽2+η2𝑩𝑩T𝑸𝜽𝑸𝜽)𝑷𝜽𝜽^t\displaystyle=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{Q}_{\bm{\theta}}^{2}+\eta^{2}\bm{B}\bm{B}^{T}\bm{Q}_{\bm{\theta}}\bm{Q}_{\bm{\theta}})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}
=𝜽^tT𝑷𝜽T(𝑰+η2𝑩𝑩T)𝑸𝜽2𝑷𝜽𝜽^t\displaystyle=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})\bm{Q}_{\bm{\theta}}^{2}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}
=𝜽^tT𝑷𝜽T(𝑰+η2𝑩𝑩T)1𝑷𝜽𝜽^t\displaystyle=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t} (22)

where the last equality follows by replacing 𝑸𝜽\bm{Q}_{\bm{\theta}} by its definition. The same procedure follows for the term involving ϕt\bm{\phi}_{t} which leads to the expression

𝑸ϕ𝑷ϕϕ^t2+η2𝑸𝜽𝑩𝑷ϕϕ^t2=ϕ^tT𝑷ϕT(𝑰+η2𝑩T𝑩)1𝑷ϕϕ^t.\lVert\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\theta}}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\rVert^{2}=\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}. (23)

Substitute 𝑸𝜽𝑷𝜽𝜽^t2+η2𝑸ϕ𝑩T𝑷𝜽𝜽^t2\lVert\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\phi}}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}\rVert^{2} and 𝑸ϕ𝑷ϕϕ^t2+η2𝑸𝜽𝑩𝑷ϕϕ^t2\lVert\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\theta}}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\rVert^{2} in (57) with the expressions in (22) and (23), respectively, to obtain

𝜽^t+12+ϕ^t+12=𝜽^tT𝑷𝜽T(𝑰+η2𝑩𝑩T)1𝑷𝜽𝜽^t+ϕ^tT𝑷ϕT(𝑰+η2𝑩T𝑩)1𝑷ϕϕ^t.\lVert\hat{\bm{\theta}}_{t+1}\rVert^{2}+\lVert\hat{\bm{\phi}}_{t+1}\rVert^{2}=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}. (24)

Note that, we assume that the trajectory {𝜽^t,ϕ^t}t0\{\hat{\bm{\theta}}_{t},\hat{\bm{\phi}}_{t}\}_{t\geq 0} is not in the kernel of 𝑩𝑩T\bm{B}\bm{B}^{T} and 𝑩T𝑩\bm{B}^{T}\bm{B}, thus 𝑩𝑩T𝜽^t0\bm{B}\bm{B}^{T}\hat{\bm{\theta}}_{t}\neq 0 and 𝑩T𝑩ϕ^t0\bm{B}^{T}\bm{B}\hat{\bm{\phi}}_{t}\neq 0. Now using the expression in (60) and the fact that 𝑷𝜽=𝑷𝜽T\bm{P}_{\bm{\theta}}=\bm{P}_{\bm{\theta}}^{T}, 𝑷ϕ=𝑷ϕT\bm{P}_{\bm{\phi}}=\bm{P}_{\bm{\phi}}^{T} and 𝑩𝑩T\bm{B}\bm{B}^{T} and 𝑩T𝑩\bm{B}^{T}\bm{B} have the same set of non-zero eigenvalues, if we denote the minimum non-zero eigenvalues by λmin(𝑩𝑩T)\lambda_{\min}(\bm{B}\bm{B}^{T}) and λmin(𝑩T𝑩)\lambda_{\min}(\bm{B}^{T}\bm{B}), we can write

𝜽t+1𝜽2+ϕt+1ϕ2ρ2(𝑰η𝑨)𝜽t𝜽2+ρ2(1+η𝑪)ϕtϕ2𝑰+η2λmin(𝑩T𝑩).\lVert\bm{\theta}_{t+1}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t+1}-\bm{\phi}^{*}\rVert^{2}\leq\frac{\rho^{2}(\bm{I}-\eta\bm{A})\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\rho^{2}(1+\eta\bm{C})\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{B}^{T}\bm{B})}.

Replacing 𝜽t+1𝜽2+ϕt+1ϕ2\lVert\bm{\theta}_{t+1}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t+1}-\bm{\phi}^{*}\rVert^{2} and 𝜽t𝜽2+ϕtϕ2\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2} with rt+1r_{t+1} and rtr_{t} we have:

rt+1max(ρ2(𝑰η𝑨),ρ2(𝑰+η𝑪))1+η2λmin(𝑩T𝑩)rt.r_{t+1}\leq\frac{\max(\rho^{2}(\bm{I}-\eta\bm{A}),\rho^{2}(\bm{I}+\eta\bm{C}))}{1+\eta^{2}\lambda_{\min}(\bm{B}^{T}\bm{B})}r_{t}.

Recall that 𝑨=θθf(𝜽,ϕ)\bm{A}=\nabla_{\theta\theta}f(\bm{\theta}^{*},\bm{\phi}^{*}), 𝑩=θϕf(𝜽,ϕ)\bm{B}=\nabla_{\theta\phi}f(\bm{\theta}^{*},\bm{\phi}^{*}) and 𝑪=ϕϕf(𝜽,ϕ)\bm{C}=\nabla_{\phi\phi}f(\bm{\theta}^{*},\bm{\phi}^{*}), therefore for any η\eta satisfying that:

max(ρ2(𝑰ηθθf(𝜽,ϕ)),ρ2(𝑰+ηϕϕf(𝜽,ϕ)))1+η2λmin(θϕf(𝜽,ϕ)ϕθf(𝜽,ϕ))<1,\frac{\max(\rho^{2}(\bm{I}-\eta\nabla_{\theta\theta}f(\bm{\theta}^{*},\bm{\phi}^{*})),\rho^{2}(\bm{I}+\eta\nabla_{\phi\phi}f(\bm{\theta}^{*},\bm{\phi}^{*})))}{1+\eta^{2}\lambda_{\min}(\nabla_{\theta\phi}f(\bm{\theta}^{*},\bm{\phi}^{*})\nabla_{\phi\theta}f(\bm{\theta}^{*},\bm{\phi}^{*}))}<1, (25)

we have rt+1<rtr_{t+1}<r_{t}. Since we linearize the system about the stationary point (𝜽,ϕ)(\bm{\theta}^{*},\bm{\phi}^{*}), there exists a neighborhood 𝒰\mathcal{U} around the stationary point, such that, SPPM started at (𝜽0,ϕ0)𝒰(\bm{\theta}_{0},\bm{\phi}_{0})\in\mathcal{U} converges asymptotically to (𝜽,ϕ)(\bm{\theta}^{*},\bm{\phi}^{*}). ∎

A.3 Proof of Remark 5.1

Proof.

To prove the local convergence of Lv.kk GP in non-convex non-concave games, we first consider the update rule of Lv.kk GP:

Reasoning:{𝜽t(k)=𝜽tη𝜽f(𝜽t,ϕt(k1))ϕt(k)=ϕtηϕg(𝜽t(k1),ϕt)Update:{𝜽t+1=𝜽t(k)ϕt+1=ϕt(k)\mathmakebox[0.8]{\text{Reasoning:}\begin{cases}\bm{\theta}^{(k)}_{t}=\bm{\theta}_{t}-\eta\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(k-1)})\\ \bm{\phi}^{(k)}_{t}=\bm{\phi}_{t}-\eta\nabla_{\bm{\phi}}g(\bm{\theta}_{t}^{(k-1)},\bm{\phi}_{t})\end{cases}\text{Update:}\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}^{(k)}\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}^{(k)}\end{cases}}

Similar to Section A.2 let us denote

[θθ2f(𝜽,ϕ)θϕ2f(𝜽,ϕ)ϕθ2f(𝜽,ϕ)ϕϕ2f(𝜽,ϕ)]=[𝑨𝑩𝑩T𝑪]\displaystyle\begin{bmatrix}-\nabla_{\theta\theta}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})&-\nabla_{\theta\phi}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})\\ \nabla_{\phi\theta}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})&\nabla_{\phi\phi}^{2}f(\bm{\theta}^{*},\bm{\phi}^{*})\end{bmatrix}=\begin{bmatrix}-\bm{A}&-\bm{B}\\ \bm{B}^{T}&\bm{C}\end{bmatrix}

and we define the difference between states and stationary points as

𝜽^t(k)=𝜽t(k)𝜽\displaystyle\hat{\bm{\theta}}^{(k)}_{t}=\bm{\theta}^{(k)}_{t}-\bm{\theta}^{*} and 𝜽^t=𝜽t𝜽\displaystyle\text{ and }\hat{\bm{\theta}}_{t}=\bm{\theta}_{t}-\bm{\theta}^{*}
ϕ^t(k)=ϕt(k)ϕ\displaystyle\hat{\bm{\phi}}^{(k)}_{t}=\bm{\phi}^{(k)}_{t}-\bm{\phi}^{*} and ϕ^t=ϕtϕ\displaystyle\text{ and }\hat{\bm{\phi}}_{t}=\bm{\phi}_{t}-\bm{\phi}^{*}

Linearizing the dynamical system induced by Lv.kk GP about the stationary point (𝜽,ϕ)(\bm{\theta}^{*},\bm{\phi}^{*}) we get:

{𝜽^t+1=𝜽^t(k)=(𝑰η𝑨)𝜽^tη𝑩ϕ^t(k1)ϕ^t+1=ϕ^t(k)=η𝑩T𝜽^t(k1)+(𝑰+η𝑪)ϕ^t\displaystyle\begin{cases}\hat{\bm{\theta}}_{t+1}&=\hat{\bm{\theta}}^{(k)}_{t}=(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}-\eta\bm{B}\hat{\bm{\phi}}_{t}^{(k-1)}\\ \hat{\bm{\phi}}_{t+1}&=\hat{\bm{\phi}}^{(k)}_{t}=\eta\bm{B}^{T}\hat{\bm{\theta}}_{t}^{(k-1)}+(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}\end{cases}

Note, in Lv.kk GP, we define 𝜽t(0)=𝜽t\bm{\theta}^{(0)}_{t}=\bm{\theta}_{t} and ϕt(0)=ϕt\bm{\phi}^{(0)}_{t}=\bm{\phi}_{t}, thus for Lv.11 GP, we have:

{𝜽^t(1)=(𝑰η𝑨)𝜽^tη𝑩ϕ^tϕ^t(1)=η𝑩T𝜽^t+(𝑰+η𝑪)ϕ^t\displaystyle\begin{cases}\hat{\bm{\theta}}^{(1)}_{t}=(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}-\eta\bm{B}\hat{\bm{\phi}}_{t}\\ \hat{\bm{\phi}}^{(1)}_{t}=\eta\bm{B}^{T}\hat{\bm{\theta}}_{t}+(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}\end{cases}

For Lv.22 GP, we have:

{𝜽^t(2)=(𝑰η𝑨)𝜽^tη𝑩ϕ^t(1)ϕ^t(2)=η𝑩T𝜽^t(1)+(𝑰+η𝑪)ϕ^t\displaystyle\begin{cases}\hat{\bm{\theta}}^{(2)}_{t}=(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}-\eta\bm{B}\hat{\bm{\phi}}_{t}^{(1)}\\ \hat{\bm{\phi}}^{(2)}_{t}=\eta\bm{B}^{T}\hat{\bm{\theta}}_{t}^{(1)}+(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}\end{cases}

Substituting 𝜽^t(1)\hat{\bm{\theta}}^{(1)}_{t} and ϕ^t(1)\hat{\bm{\phi}}^{(1)}_{t} into the update rule above we get:

{𝜽^t(2)=(𝑰η𝑨)𝜽^tη𝑩(𝑰+η𝑪)ϕ^tη2𝑩𝑩T𝜽^tϕ^t(2)=η𝑩T(𝑰η𝑨)𝜽^t+(𝑰+η𝑪)ϕ^tη2𝑩T𝑩ϕ^t\displaystyle\begin{cases}\hat{\bm{\theta}}^{(2)}_{t}=(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}-\eta\bm{B}(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}-\eta^{2}\bm{BB}^{T}\hat{\bm{\theta}}_{t}\\ \hat{\bm{\phi}}^{(2)}_{t}=\eta\bm{B}^{T}(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}+(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}-\eta^{2}\bm{B}^{T}\bm{B}\hat{\bm{\phi}}_{t}\end{cases}

Similarly, for Lv.33 and Lv.44 GP we have:

{𝜽^t(3)=(𝑰η2𝑩𝑩T)(𝑰η𝑨)𝜽^tη𝑩(𝑰+η𝑪)ϕ^t+η3𝑩𝑩T𝑩ϕ^tϕ^t(3)=η𝑩T(𝑰η𝑨)𝜽^t+(𝑰η2𝑩T𝑩)(𝑰+η𝑪)ϕ^tη3𝑩𝑩T𝑩𝜽^t\displaystyle\begin{cases}\hat{\bm{\theta}}^{(3)}_{t}=(\bm{I}-\eta^{2}\bm{B}\bm{B}^{T})(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}-\eta\bm{B}(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}+\eta^{3}\bm{BB}^{T}\bm{B}\hat{\bm{\phi}}_{t}\\ \hat{\bm{\phi}}^{(3)}_{t}=\eta\bm{B}^{T}(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}+(\bm{I}-\eta^{2}\bm{B}^{T}\bm{B})(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}-\eta^{3}\bm{B}\bm{B}^{T}\bm{B}\hat{\bm{\theta}}_{t}\end{cases}

and

{𝜽^t(4)=(𝑰η2𝑩𝑩T)[(𝑰η𝑨)𝜽^tη𝑩(𝑰+η𝑪)ϕ^t]+η4𝑩𝑩T𝑩𝑩T𝜽^tϕ^t(4)=(𝑰η2𝑩T𝑩)[η𝑩T(𝑰η𝑨)𝜽^t+(𝑰+η𝑪)ϕ^t]+η4𝑩T𝑩𝑩T𝑩ϕ^t\displaystyle\begin{cases}\hat{\bm{\theta}}^{(4)}_{t}=(\bm{I}-\eta^{2}\bm{B}\bm{B}^{T})\left[(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}-\eta\bm{B}(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}\right]+\eta^{4}\bm{BB}^{T}\bm{BB}^{T}\hat{\bm{\theta}}_{t}\\ \hat{\bm{\phi}}^{(4)}_{t}=(\bm{I}-\eta^{2}\bm{B}^{T}\bm{B})\left[\eta\bm{B}^{T}(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}+(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}\right]+\eta^{4}\bm{B}^{T}\bm{B}\bm{B}^{T}\bm{B}\hat{\bm{\phi}}_{t}\end{cases}

Summarizing the equations above we have that for Lv.2k2k GP, its update can be written as:

{𝜽^t(2k)=(i=0k1(η2𝑩𝑩T)k)[(𝑰η𝑨)𝜽^tη𝑩(𝑰+η𝑪)ϕ^t]+(η2𝑩𝑩T)k𝜽^tϕ^t(2k)=(i=0k1(η2𝑩T𝑩)k)[η𝑩T(𝑰η𝑨)𝜽^t+(𝑰+η𝑪)ϕ^t]+(η2𝑩T𝑩)kϕ^t\displaystyle\begin{cases}\hat{\bm{\theta}}^{(2k)}_{t}=(\sum_{i=0}^{k-1}(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\left[(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}-\eta\bm{B}(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}\right]+(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\\ \hat{\bm{\phi}}^{(2k)}_{t}=(\sum_{i=0}^{k-1}(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\left[\eta\bm{B}^{T}(\bm{I}-\eta\bm{A})\hat{\bm{\theta}}_{t}+(\bm{I}+\eta\bm{C})\hat{\bm{\phi}}_{t}\right]+(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\end{cases}

Similar to Appendix A.2, let us define 𝑸𝜽=(𝑰+η2𝑩𝑩T)1\bm{Q}_{\bm{\theta}}=(\bm{I}+\eta^{2}\bm{BB}^{T})^{-1}, 𝑸ϕ=(𝑰+η2𝑩T𝑩)1\bm{Q}_{\bm{\phi}}=(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1} and 𝑷𝜽=(𝑰η𝑨)\bm{P}_{\bm{\theta}}=(\bm{I}-\eta\bm{A}), 𝑷ϕ=(𝑰+η𝑪)\bm{P}_{\bm{\phi}}=(\bm{I}+\eta\bm{C}). Further, we define 𝑹𝜽(k)=(i=0k1(η2𝑩𝑩T)k)\bm{R}^{(k)}_{\bm{\theta}}=(\sum_{i=0}^{k-1}(-\eta^{2}\bm{B}\bm{B}^{T})^{k}), 𝑹ϕ(k)=(i=0k1(η2𝑩T𝑩)k)\bm{R}^{(k)}_{\bm{\phi}}=(\sum_{i=0}^{k-1}(-\eta^{2}\bm{B}^{T}\bm{B})^{k}) and 𝑬𝜽(k)=𝑹𝜽(k)𝑸𝜽\bm{E}^{(k)}_{\bm{\theta}}=\bm{R}^{(k)}_{\bm{\theta}}-\bm{Q}_{\bm{\theta}}, 𝑬ϕ(k)=𝑹ϕ(k)𝑸ϕ\bm{E}^{(k)}_{\bm{\phi}}=\bm{R}^{(k)}_{\bm{\phi}}-\bm{Q}_{\bm{\phi}}.

Since η<L1\eta<L^{-1}, we have that:

𝑸𝜽=i=0(η2𝑩𝑩T)k\displaystyle\bm{Q}_{\bm{\theta}}=\sum_{i=0}^{\infty}(-\eta^{2}\bm{B}\bm{B}^{T})^{k}
𝑸ϕ=i=0(η2𝑩T𝑩)k\displaystyle\bm{Q}_{\bm{\phi}}=\sum_{i=0}^{\infty}(-\eta^{2}\bm{B}^{T}\bm{B})^{k}

and

𝑬𝜽(k)=𝑹𝜽(k)𝑸𝜽=i=k(η2𝑩𝑩T)k=(𝑰+η2𝑩𝑩T)1(η2𝑩𝑩T)k\displaystyle\bm{E}^{(k)}_{\bm{\theta}}=\bm{R}^{(k)}_{\bm{\theta}}-\bm{Q}_{\bm{\theta}}=-\sum_{i=k}^{\infty}(-\eta^{2}\bm{B}\bm{B}^{T})^{k}=-(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}\cdot(-\eta^{2}\bm{B}\bm{B}^{T})^{k} (26)
𝑬ϕ(k)=𝑹ϕ(k)𝑸ϕ=i=k(η2𝑩T𝑩)k=(𝑰+η2𝑩T𝑩)1(η2𝑩T𝑩)k\displaystyle\bm{E}^{(k)}_{\bm{\phi}}=\bm{R}^{(k)}_{\bm{\phi}}-\bm{Q}_{\bm{\phi}}=-\sum_{i=k}^{\infty}(-\eta^{2}\bm{B}^{T}\bm{B})^{k}=-(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}\cdot(-\eta^{2}\bm{B}^{T}\bm{B})^{k} (27)

Also, from Lemma A.1 and the definition of the error terms, it can be verified that

𝑬𝜽(k)𝑩\displaystyle\bm{E}_{\bm{\theta}}^{(k)}\bm{B} =𝑩𝑬ϕ(k)\displaystyle=\bm{B}\bm{E}_{\bm{\phi}}^{(k)} (28)
𝑬ϕ(k)𝑩T\displaystyle\bm{E}_{\bm{\phi}}^{(k)}\bm{B}^{T} =𝑩T𝑬𝜽(k)\displaystyle=\bm{B}^{T}\bm{E}_{\bm{\theta}}^{(k)} (29)

Then we can rewrite the update rule of Lv.2k2k GP:

{𝜽^t(2k)=(𝑸𝜽+𝑬𝜽(k))[𝑷𝜽𝜽^tη𝑩𝑷ϕϕ^t]+(η2𝑩𝑩T)k𝜽^tϕ^t(2k)=(𝑸ϕ+𝑬ϕ(k))[η𝑩T𝑷𝜽𝜽^t+𝑷ϕϕ^t]+(η2𝑩T𝑩)kϕ^t\displaystyle\begin{cases}\hat{\bm{\theta}}^{(2k)}_{t}=(\bm{Q}_{\bm{\theta}}+\bm{E}_{\bm{\theta}}^{(k)})\left[\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}-\eta\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]+(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\\ \hat{\bm{\phi}}^{(2k)}_{t}=(\bm{Q}_{\bm{\phi}}+\bm{E}_{\bm{\phi}}^{(k)})\left[\eta\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]+(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\end{cases}

Let us consider the following sum:

\displaystyle\lVert 𝜽^t(2k)(η2𝑩𝑩T)k𝜽^t2+ϕ^t(2k)(η2𝑩T𝑩)kϕ^t2\displaystyle\hat{\bm{\theta}}^{(2k)}_{t}-(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rVert^{2}+\lVert\hat{\bm{\phi}}^{(2k)}_{t}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\rVert^{2}
=(𝑸𝜽+𝑬𝜽(k))[𝑷𝜽𝜽^tη𝑩𝑷ϕϕ^t]2+(𝑸ϕ+𝑬ϕ(k))[η𝑩T𝑷𝜽𝜽^t+𝑷ϕϕ^t]2\displaystyle=\left\lVert(\bm{Q}_{\bm{\theta}}+\bm{E}_{\bm{\theta}}^{(k)})\left[\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}-\eta\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}+\left\lVert(\bm{Q}_{\bm{\phi}}+\bm{E}_{\bm{\phi}}^{(k)})\left[\eta\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2} (30)

The R.H.S. of Eq.(30) can be written as:

(𝑸𝜽+𝑬𝜽(k))[𝑷𝜽𝜽^tη𝑩𝑷ϕϕ^t]2+(𝑸ϕ+𝑬ϕ(k))[η𝑩T𝑷𝜽𝜽^t+𝑷ϕϕ^t]2\displaystyle\left\lVert(\bm{Q}_{\bm{\theta}}+\bm{E}_{\bm{\theta}}^{(k)})\left[\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}-\eta\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}+\left\lVert(\bm{Q}_{\bm{\phi}}+\bm{E}_{\bm{\phi}}^{(k)})\left[\eta\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}
=\displaystyle= 𝜽^tT𝑷𝜽T𝑸𝜽2𝑷𝜽𝜽^t2η𝜽^tT𝑷𝜽T𝑸𝜽2𝑩𝑷ϕϕ^t+η2ϕ^tT𝑷ϕT𝑩T𝑸𝜽2𝑩𝑷ϕϕ^t\displaystyle\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{2}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}-2\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{2}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+\eta^{2}\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{B}^{T}\bm{Q}_{\bm{\theta}}^{2}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}
+\displaystyle+ 2𝜽^tT𝑷𝜽T𝑸𝜽𝑬𝜽(k)𝑷𝜽𝜽^t4η𝜽^tT𝑷𝜽T𝑸𝜽𝑬𝜽(k)𝑩𝑷ϕϕ^t+2η2ϕ^tT𝑷ϕT𝑩T𝑸𝜽𝑬𝜽(k)𝑩𝑷ϕϕ^t\displaystyle 2\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}-4\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+2\eta^{2}\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{B}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}
+\displaystyle+ 𝜽^tT𝑷𝜽T[𝑬𝜽(k)]2𝑷𝜽𝜽^t2η𝜽^tT𝑷𝜽T[𝑬𝜽(k)]2𝑩𝑷ϕϕ^t+η2ϕ^tT𝑷ϕT𝑩T[𝑬𝜽(k)]2𝑩𝑷ϕϕ^t\displaystyle\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}[\bm{E}_{\bm{\theta}}^{(k)}]^{2}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}-2\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}[\bm{E}_{\bm{\theta}}^{(k)}]^{2}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+\eta^{2}\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{B}^{T}[\bm{E}_{\bm{\theta}}^{(k)}]^{2}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}
+\displaystyle+ ϕ^tT𝑷ϕT𝑸ϕ2𝑷ϕϕ^t+2η𝜽^tT𝑷𝜽T𝑩𝑸ϕ2𝑷ϕϕ^t+η2𝜽^tT𝑷𝜽T𝑩𝑸ϕ2𝑩T𝑷𝜽𝜽^t\displaystyle\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}^{2}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+2\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}^{2}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+\eta^{2}\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}^{2}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}
+\displaystyle+ 2ϕ^tT𝑷ϕT𝑸ϕ𝑬ϕ(k)𝑷ϕϕ^t+4η𝜽^tT𝑷𝜽T𝑩𝑸ϕ𝑬ϕ(k)𝑷ϕϕ^t+2η2𝜽^tT𝑷𝜽T𝑩𝑸ϕ𝑬ϕ(k)𝑩T𝑷𝜽𝜽^t\displaystyle 2\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}\bm{E}_{\bm{\phi}}^{(k)}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+4\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}\bm{E}_{\bm{\phi}}^{(k)}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+2\eta^{2}\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}\bm{E}_{\bm{\phi}}^{(k)}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}
+\displaystyle+ ϕ^tT𝑷ϕT[𝑬ϕ(k)]2𝑷ϕϕ^t+2η𝜽^tT𝑷𝜽T𝑩[𝑬ϕ(k)]2𝑷ϕϕ^t+η2𝜽^tT𝑷𝜽T𝑩[𝑬ϕ(k)]2𝑩T𝑷𝜽𝜽^t\displaystyle\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}[\bm{E}_{\bm{\phi}}^{(k)}]^{2}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+2\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}[\bm{E}_{\bm{\phi}}^{(k)}]^{2}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+\eta^{2}\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}[\bm{E}_{\bm{\phi}}^{(k)}]^{2}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t} (31)

Now, before adding all terms in Eq.(31), note that all of the cross terms in Eq.(31) cancel out.

For instance, using Lemma A.1 and Eq.(28), Eq.(29) we can show that

4η𝜽^tT𝑷𝜽T𝑩𝑸ϕ𝑬ϕ(k)𝑷ϕϕ^t4η𝜽^tT𝑷𝜽T𝑸𝜽𝑬𝜽(k)𝑩𝑷ϕϕ^t\displaystyle 4\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}\bm{E}_{\bm{\phi}}^{(k)}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}-4\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}
=4η𝜽^tT𝑷𝜽T𝑸𝜽𝑩𝑬ϕ(k)𝑷ϕϕ^t4η𝜽^tT𝑷𝜽T𝑸𝜽𝑬𝜽(k)𝑩𝑷ϕϕ^t\displaystyle=4\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{B}\bm{E}_{\bm{\phi}}^{(k)}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}-4\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}
=4η𝜽^tT𝑷𝜽T𝑸𝜽𝑬𝜽(k)𝑩𝑷ϕϕ^t4η𝜽^tT𝑷𝜽T𝑸𝜽𝑬𝜽(k)𝑩𝑷ϕϕ^t\displaystyle=4\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}-4\eta\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}
=0\displaystyle=0

By using similar arguments it can be shown that terms in Eq.(31) leads to:

(𝑸𝜽+𝑬𝜽(k))[𝑷𝜽𝜽^tη𝑩𝑷ϕϕ^t]2+(𝑸ϕ+𝑬ϕ(k))[η𝑩T𝑷𝜽𝜽^t+𝑷ϕϕ^t]2\displaystyle\left\lVert(\bm{Q}_{\bm{\theta}}+\bm{E}_{\bm{\theta}}^{(k)})\left[\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}-\eta\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}+\left\lVert(\bm{Q}_{\bm{\phi}}+\bm{E}_{\bm{\phi}}^{(k)})\left[\eta\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}
=\displaystyle= 𝜽^tT𝑷𝜽T𝑸𝜽2𝑷𝜽𝜽^t+η2ϕ^tT𝑷ϕT𝑩T𝑸𝜽2𝑩𝑷ϕϕ^t\displaystyle\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{2}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\eta^{2}\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{B}^{T}\bm{Q}_{\bm{\theta}}^{2}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}
+\displaystyle+ 2𝜽^tT𝑷𝜽T𝑸𝜽𝑬𝜽(k)𝑷𝜽𝜽^t+2η2ϕ^tT𝑷ϕT𝑩T𝑸𝜽𝑬𝜽(k)𝑩𝑷ϕϕ^t\displaystyle 2\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+2\eta^{2}\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{B}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}
+\displaystyle+ 𝜽^tT𝑷𝜽T[𝑬𝜽(k)]2𝑷𝜽𝜽^t+η2ϕ^tT𝑷ϕT𝑩T[𝑬𝜽(k)]2𝑩𝑷ϕϕ^t\displaystyle\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}[\bm{E}_{\bm{\theta}}^{(k)}]^{2}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\eta^{2}\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{B}^{T}[\bm{E}_{\bm{\theta}}^{(k)}]^{2}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}
+\displaystyle+ ϕ^tT𝑷ϕT𝑸ϕ2𝑷ϕϕ^t+η2𝜽^tT𝑷𝜽T𝑩𝑸ϕ2𝑩T𝑷𝜽𝜽^t\displaystyle\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}^{2}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+\eta^{2}\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}^{2}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}
+\displaystyle+ 2ϕ^tT𝑷ϕT𝑸ϕ𝑬ϕ(k)𝑷ϕϕ^t+2η2𝜽^tT𝑷𝜽T𝑩𝑸ϕ𝑬ϕ(k)𝑩T𝑷𝜽𝜽^t\displaystyle 2\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}\bm{E}_{\bm{\phi}}^{(k)}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+2\eta^{2}\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}\bm{E}_{\bm{\phi}}^{(k)}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}
+\displaystyle+ ϕ^tT𝑷ϕT[𝑬ϕ(k)]2𝑷ϕϕ^t+η2𝜽^tT𝑷𝜽T𝑩[𝑬ϕ(k)]2𝑩T𝑷𝜽𝜽^t\displaystyle\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}[\bm{E}_{\bm{\phi}}^{(k)}]^{2}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+\eta^{2}\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}[\bm{E}_{\bm{\phi}}^{(k)}]^{2}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t} (32)

Similar to Eq.(22) we have the following simplification:

𝜽^tT𝑷𝜽T𝑸𝜽2𝑷𝜽𝜽^t+η2𝜽^tT𝑷𝜽T𝑩𝑸ϕ2𝑩T𝑷𝜽𝜽^t\displaystyle\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{2}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\eta^{2}\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}^{2}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t} =𝜽^tT𝑷𝜽T𝑸𝜽𝑷𝜽𝜽^t\displaystyle=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}
ϕ^tT𝑷ϕT𝑸ϕ2𝑷ϕϕ^t+η2ϕ^tT𝑷ϕT𝑩T𝑸𝜽2𝑩𝑷ϕϕ^t\displaystyle\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}^{2}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+\eta^{2}\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{B}^{T}\bm{Q}_{\bm{\theta}}^{2}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t} =ϕ^tT𝑷ϕT𝑸ϕ𝑷ϕϕ^t\displaystyle=\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}
2𝜽^tT𝑷𝜽T𝑸𝜽𝑬𝜽(k)𝑷𝜽𝜽^t+2η2𝜽^tT𝑷𝜽T𝑩𝑸ϕ𝑬ϕ(k)𝑩T𝑷𝜽𝜽^t\displaystyle 2\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+2\eta^{2}\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{B}\bm{Q}_{\bm{\phi}}\bm{E}_{\bm{\phi}}^{(k)}\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t} =2𝜽^tT𝑷𝜽T𝑬𝜽(k)𝑷𝜽𝜽^t\displaystyle=2\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{E}_{\bm{\theta}}^{(k)}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}
2ϕ^tT𝑷ϕT𝑸ϕ𝑬ϕ(k)𝑷ϕϕ^t+2η2ϕ^tT𝑷ϕT𝑩T𝑸𝜽𝑬𝜽(k)𝑩𝑷ϕϕ^t\displaystyle 2\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}\bm{E}_{\bm{\phi}}^{(k)}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}+2\eta^{2}\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{B}^{T}\bm{Q}_{\bm{\theta}}\bm{E}_{\bm{\theta}}^{(k)}\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t} =2ϕ^tT𝑷ϕT𝑬ϕ(k)𝑷ϕϕ^t\displaystyle=2\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{E}_{\bm{\phi}}^{(k)}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}

Now we can further simplify Eq.(31) as:

(𝑸𝜽+𝑬𝜽(k))[𝑷𝜽𝜽^tη𝑩𝑷ϕϕ^t]2+(𝑸ϕ+𝑬ϕ(k))[η𝑩T𝑷𝜽𝜽^t+𝑷ϕϕ^t]2\displaystyle\left\lVert(\bm{Q}_{\bm{\theta}}+\bm{E}_{\bm{\theta}}^{(k)})\left[\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}-\eta\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}+\left\lVert(\bm{Q}_{\bm{\phi}}+\bm{E}_{\bm{\phi}}^{(k)})\left[\eta\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}
=\displaystyle= (𝑷𝜽𝜽^t)T[𝑸𝜽+2𝑬𝜽(k)+[𝑬𝜽(k)]2+η2𝑩[𝑬ϕ(k)]2𝑩T](𝑷𝜽𝜽^t)\displaystyle(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}\left[\bm{Q}_{\bm{\theta}}+2\bm{E}_{\bm{\theta}}^{(k)}+[\bm{E}_{\bm{\theta}}^{(k)}]^{2}+\eta^{2}\bm{B}[\bm{E}_{\bm{\phi}}^{(k)}]^{2}\bm{B}^{T}\right](\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})
+\displaystyle+ (𝑷ϕϕ^t)T[𝑸ϕ+2𝑬ϕ(k)+[𝑬ϕ(k)]2+η2𝑩[𝑬𝜽(k)]2𝑩T](𝑷ϕϕ^t)\displaystyle(\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})^{T}\left[\bm{Q}_{\bm{\phi}}+2\bm{E}_{\bm{\phi}}^{(k)}+[\bm{E}_{\bm{\phi}}^{(k)}]^{2}+\eta^{2}\bm{B}[\bm{E}_{\bm{\theta}}^{(k)}]^{2}\bm{B}^{T}\right](\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})

Using Eq.(26) and Eq.(27) and definition of 𝑸𝜽\bm{Q}_{\bm{\theta}} and 𝑸ϕ\bm{Q}_{\bm{\phi}} we have:

(𝑷𝜽𝜽^t)T[𝑸𝜽+2𝑬𝜽(k)+[𝑬𝜽(k)]2+η2𝑩[𝑬ϕ(k)]2𝑩T](𝑷𝜽𝜽^t)\displaystyle(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}\left[\bm{Q}_{\bm{\theta}}+2\bm{E}_{\bm{\theta}}^{(k)}+[\bm{E}_{\bm{\theta}}^{(k)}]^{2}+\eta^{2}\bm{B}[\bm{E}_{\bm{\phi}}^{(k)}]^{2}\bm{B}^{T}\right](\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})
=\displaystyle= (𝑷𝜽𝜽^t)T(𝑰+η2𝑩𝑩T)1(𝑷𝜽𝜽^t)2(𝑷𝜽𝜽^t)T(𝑰+η2𝑩𝑩T)1(η2𝑩𝑩T)k(𝑷𝜽𝜽^t)\displaystyle(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})-2(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}(-\eta^{2}\bm{B}\bm{B}^{T})^{k}(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})
+\displaystyle+ (𝑷𝜽𝜽^t)T(𝑰+η2𝑩𝑩T)(𝑰+η2𝑩𝑩T)2(η2𝑩𝑩T)2k(𝑷𝜽𝜽^t)\displaystyle(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-2}(-\eta^{2}\bm{B}\bm{B}^{T})^{2k}(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})
=\displaystyle= (𝑷𝜽𝜽^t)T(𝑰+η2𝑩𝑩T)1(𝑰2(η2𝑩𝑩T)(k)+(η2𝑩𝑩T)2k)(𝑷𝜽𝜽^t)\displaystyle(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}(\bm{I}-2(-\eta^{2}\bm{B}\bm{B}^{T})^{(k)}+(-\eta^{2}\bm{B}\bm{B}^{T})^{2k})(\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})
=\displaystyle= ((𝑰(η2𝑩𝑩T)k)𝑷𝜽𝜽^t)T(𝑰+η2𝑩𝑩T)1((𝑰(η2𝑩𝑩T)k)𝑷𝜽𝜽^t)\displaystyle((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})

Similarly, we have that

(𝑷ϕϕ^t)T[𝑸ϕ+2𝑬ϕ(k)+[𝑬ϕ(k)]2+η2𝑩T[𝑬𝜽(k)]2𝑩](𝑷ϕϕ^t)\displaystyle(\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})^{T}\left[\bm{Q}_{\bm{\phi}}+2\bm{E}_{\bm{\phi}}^{(k)}+[\bm{E}_{\bm{\phi}}^{(k)}]^{2}+\eta^{2}\bm{B}^{T}[\bm{E}_{\bm{\theta}}^{(k)}]^{2}\bm{B}\right](\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})
=\displaystyle= ((𝑰(η2𝑩T𝑩)k)𝑷ϕϕ^t)T(𝑰+η2𝑩T𝑩)1((𝑰(η2𝑩T𝑩)k)𝑷ϕϕ^t)\displaystyle((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})

Thus we simplify the R.H.S. of Eq.(30) as

(𝑸𝜽+𝑬𝜽(k))[𝑷𝜽𝜽^tη𝑩𝑷ϕϕ^t]2+(𝑸ϕ+𝑬ϕ(k))[η𝑩T𝑷𝜽𝜽^t+𝑷ϕϕ^t]2\displaystyle\left\lVert(\bm{Q}_{\bm{\theta}}+\bm{E}_{\bm{\theta}}^{(k)})\left[\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}-\eta\bm{B}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}+\left\lVert(\bm{Q}_{\bm{\phi}}+\bm{E}_{\bm{\phi}}^{(k)})\left[\eta\bm{B}^{T}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}\right]\right\rVert^{2}
=\displaystyle= ((𝑰(η2𝑩𝑩T)k)𝑷𝜽𝜽^t)T(𝑰+η2𝑩𝑩T)1((𝑰(η2𝑩𝑩T)k)𝑷𝜽𝜽^t)\displaystyle((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})
+\displaystyle+ ((𝑰(η2𝑩T𝑩)k)𝑷ϕϕ^t)T(𝑰+η2𝑩T𝑩)1((𝑰(η2𝑩T𝑩)k)𝑷ϕϕ^t)\displaystyle((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t}) (33)

Let us consider the L.H.S. of Eq.(30)

\displaystyle\lVert 𝜽^t(2k)(η2𝑩𝑩T)k𝜽^t2+ϕ^t(2k)(η2𝑩T𝑩)kϕ^t2\displaystyle\hat{\bm{\theta}}^{(2k)}_{t}-(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rVert^{2}+\lVert\hat{\bm{\phi}}^{(2k)}_{t}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\rVert^{2}
=\displaystyle= 𝜽^t(2k)22𝜽^t(2k),(η2𝑩𝑩T)k𝜽^t+(η2𝑩𝑩T)k𝜽^t2\displaystyle\lVert\hat{\bm{\theta}}^{(2k)}_{t}\rVert^{2}-2\langle\hat{\bm{\theta}}^{(2k)}_{t},(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rangle+\lVert(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rVert^{2}
+\displaystyle+ ϕ^t(2k)22ϕ^t(2k),(η2𝑩T𝑩)kϕ^t+(η2𝑩T𝑩)kϕ^t2\displaystyle\lVert\hat{\bm{\phi}}^{(2k)}_{t}\rVert^{2}-2\langle\hat{\bm{\phi}}^{(2k)}_{t},(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\rangle+\lVert(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\rVert^{2} (34)

Substituting Eq.(34) and Eq.(33) into L.H.S. and R.H.S. of Eq.(30) respectively we get:

𝜽^t(2k)22𝜽^t(2k),(η2𝑩𝑩T)k𝜽^t+(η2𝑩𝑩T)k𝜽^t2\displaystyle\lVert\hat{\bm{\theta}}^{(2k)}_{t}\rVert^{2}-2\langle\hat{\bm{\theta}}^{(2k)}_{t},(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rangle+\lVert(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rVert^{2}
+\displaystyle+ ϕ^t(2k)22ϕ^t(2k),(η2𝑩T𝑩)kϕ^t+(η2𝑩T𝑩)kϕ^t2\displaystyle\lVert\hat{\bm{\phi}}^{(2k)}_{t}\rVert^{2}-2\langle\hat{\bm{\phi}}^{(2k)}_{t},(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\rangle+\lVert(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\rVert^{2}
=\displaystyle= ((𝑰(η2𝑩𝑩T)k)𝑷𝜽𝜽^t)T(𝑰+η2𝑩𝑩T)1((𝑰(η2𝑩𝑩T)k)𝑷𝜽𝜽^t)\displaystyle((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})
+\displaystyle+ ((𝑰(η2𝑩T𝑩)k)𝑷ϕϕ^t)T(𝑰+η2𝑩T𝑩)1((𝑰(η2𝑩T𝑩)k)𝑷ϕϕ^t)\displaystyle((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})

Now we have the following equation:

𝜽^t(2k)2+ϕ^t(2k)2\displaystyle\lVert\hat{\bm{\theta}}^{(2k)}_{t}\rVert^{2}+\lVert\hat{\bm{\phi}}^{(2k)}_{t}\rVert^{2}
=\displaystyle= ((𝑰(η2𝑩𝑩T)k)𝑷𝜽𝜽^t)T(𝑰+η2𝑩𝑩T)1((𝑰(η2𝑩𝑩T)k)𝑷𝜽𝜽^t)\displaystyle((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})
+\displaystyle+ 2𝜽^t(2k),(η2𝑩𝑩T)k𝜽^t(η2𝑩𝑩T)k𝜽^t2\displaystyle 2\langle\hat{\bm{\theta}}^{(2k)}_{t},(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rangle-\lVert(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rVert^{2}
+\displaystyle+ ((𝑰(η2𝑩T𝑩)k)𝑷ϕϕ^t)T(𝑰+η2𝑩T𝑩)1((𝑰(η2𝑩T𝑩)k)𝑷ϕϕ^t)\displaystyle((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})
+\displaystyle+ 2ϕ^t(2k),(η2𝑩T𝑩)kϕ^t(η2𝑩T𝑩)kϕ^t2\displaystyle 2\langle\hat{\bm{\phi}}^{(2k)}_{t},(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\rangle-\lVert(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\rVert^{2}

Note that

2𝜽^t(2k),(η2𝑩𝑩T)k𝜽^t\displaystyle 2\langle\hat{\bm{\theta}}^{(2k)}_{t},(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rangle =2(η2𝑩𝑩T)k2𝜽^t(2k),(η2𝑩𝑩T)k2𝜽^t\displaystyle=2\langle(-\eta^{2}\bm{BB}^{T})^{\frac{k}{2}}\hat{\bm{\theta}}^{(2k)}_{t},(\eta^{2}\bm{BB}^{T})^{\frac{k}{2}}\hat{\bm{\theta}}_{t}\rangle (35)
η2k(𝜽^t(2k))T(𝑩𝑩T)k𝜽^t(2k)+η2k𝜽^tT(𝑩𝑩T)k𝜽^t\displaystyle\leq\eta^{2k}(\hat{\bm{\theta}}^{(2k)}_{t})^{T}(\bm{BB}^{T})^{k}\hat{\bm{\theta}}^{(2k)}_{t}+\eta^{2k}\hat{\bm{\theta}}_{t}^{T}(\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t} (36)

Similarly

2ϕ^t(2k),(η2𝑩𝑩T)kϕ^t\displaystyle 2\langle\hat{\bm{\phi}}^{(2k)}_{t},(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\phi}}_{t}\rangle =2(η2𝑩T𝑩)k2ϕ^t(2k),(η2𝑩T𝑩)k2ϕ^t\displaystyle=2\langle(-\eta^{2}\bm{B}^{T}\bm{B})^{\frac{k}{2}}\hat{\bm{\phi}}^{(2k)}_{t},(\eta^{2}\bm{B}^{T}\bm{B})^{\frac{k}{2}}\hat{\bm{\phi}}_{t}\rangle (37)
η2k(ϕ^t(2k))T(𝑩T𝑩)kϕ^t(2k)+η2kϕ^tT(𝑩T𝑩)kϕ^t\displaystyle\leq\eta^{2k}(\hat{\bm{\phi}}^{(2k)}_{t})^{T}(\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}^{(2k)}_{t}+\eta^{2k}\hat{\bm{\phi}}_{t}^{T}(\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t} (38)

Summing everything together we have:

(𝜽^t(2k))T(𝑰(η2𝑩𝑩T)k)(𝜽^t(2k))+(ϕ^t(2k))T(𝑰(η2𝑩T𝑩)k)(ϕ^t(2k))\displaystyle(\hat{\bm{\theta}}^{(2k)}_{t})^{T}(\bm{I}-(\eta^{2}\bm{B}\bm{B}^{T})^{k})(\hat{\bm{\theta}}^{(2k)}_{t})+(\hat{\bm{\phi}}^{(2k)}_{t})^{T}(\bm{I}-(\eta^{2}\bm{B}^{T}\bm{B})^{k})(\hat{\bm{\phi}}^{(2k)}_{t})
\displaystyle\leq ((𝑰(η2𝑩𝑩T)k)𝑷𝜽𝜽^t)T(𝑰+η2𝑩𝑩T)1((𝑰(η2𝑩𝑩T)k)𝑷𝜽𝜽^t)\displaystyle((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}((\bm{I}-(-\eta^{2}\bm{B}\bm{B}^{T})^{k})\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t})
+\displaystyle+ (𝜽^t)T(η2𝑩𝑩T)k(𝜽^t)(η2𝑩𝑩T)k𝜽^t2\displaystyle(\hat{\bm{\theta}}_{t})^{T}(\eta^{2}\bm{B}\bm{B}^{T})^{k}(\hat{\bm{\theta}}_{t})-\lVert(-\eta^{2}\bm{BB}^{T})^{k}\hat{\bm{\theta}}_{t}\rVert^{2}
+\displaystyle+ ((𝑰(η2𝑩T𝑩)k)𝑷ϕϕ^t)T(𝑰+η2𝑩T𝑩)1((𝑰(η2𝑩T𝑩)k)𝑷ϕϕ^t)\displaystyle((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})^{T}(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}((\bm{I}-(-\eta^{2}\bm{B}^{T}\bm{B})^{k})\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t})
+\displaystyle+ (ϕ^t)T(η2𝑩T𝑩)k(ϕ^t)(η2𝑩T𝑩)kϕ^t2\displaystyle(\hat{\bm{\phi}}_{t})^{T}(\eta^{2}\bm{B}^{T}\bm{B})^{k}(\hat{\bm{\phi}}_{t})-\lVert(-\eta^{2}\bm{B}^{T}\bm{B})^{k}\hat{\bm{\phi}}_{t}\rVert^{2}

Note that, we assume that the trajectory {𝜽^t,ϕ^t}t0\{\hat{\bm{\theta}}_{t},\hat{\bm{\phi}}_{t}\}_{t\geq 0} is not in the kernel of 𝑩𝑩T\bm{B}\bm{B}^{T} and 𝑩T𝑩\bm{B}^{T}\bm{B}, thus 𝑩𝑩T𝜽^t0\bm{B}\bm{B}^{T}\hat{\bm{\theta}}_{t}\neq 0 and 𝑩T𝑩ϕ^t0\bm{B}^{T}\bm{B}\hat{\bm{\phi}}_{t}\neq 0. Now using the expression in (60) and the fact that 𝑷𝜽=𝑷𝜽T\bm{P}_{\bm{\theta}}=\bm{P}_{\bm{\theta}}^{T}, 𝑷ϕ=𝑷ϕT\bm{P}_{\bm{\phi}}=\bm{P}_{\bm{\phi}}^{T} and 𝑩𝑩T\bm{B}\bm{B}^{T} and 𝑩T𝑩\bm{B}^{T}\bm{B} have the same set of non-zero eigenvalues, if we denote the minimum non-zero eigenvalues by λmin(𝑩𝑩T)\lambda_{\min}(\bm{B}\bm{B}^{T}) and λmin(𝑩T𝑩)\lambda_{\min}(\bm{B}^{T}\bm{B}), we can write

(1(η2λmax(𝑩𝑩T))k)𝜽t(2k)^2+(1(η2λmax(𝑩T𝑩))k)ϕt(2k)^2\displaystyle(1-(\eta^{2}\lambda_{max}(\bm{B}\bm{B}^{T}))^{k})\lVert\hat{\bm{\theta}_{t}^{(2k)}}\rVert^{2}+(1-(\eta^{2}\lambda_{max}(\bm{B}^{T}\bm{B}))^{k})\lVert\hat{\bm{\phi}_{t}^{(2k)}}\rVert^{2}
\displaystyle\leq (1(η2λ(𝑩𝑩T))k)2ρ2(1η𝑨)𝜽^t21+η2λmin(𝑩𝑩T)+(η2λ(𝑩𝑩T))k𝜽^t2(η2λ(𝑩𝑩T))2k𝜽^t2\displaystyle\frac{(1-(-\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k})^{2}\rho^{2}(1-\eta\bm{A})\lVert\hat{\bm{\theta}}_{t}\rVert^{2}}{1+\eta^{2}\lambda_{min}(\bm{B}\bm{B}^{T})}+(\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k}\lVert\hat{\bm{\theta}}_{t}\rVert^{2}-(\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{2k}\lVert\hat{\bm{\theta}}_{t}\rVert^{2}
+\displaystyle+ (1(η2λ(𝑩T𝑩))k)2ρ2(1+η𝑪)ϕ^t21+η2λmin(𝑩T𝑩)+(η2λ(𝑩T𝑩))kϕ^t2(η2λ(𝑩T𝑩))2kϕ^t2\displaystyle\frac{(1-(-\eta^{2}\lambda(\bm{B}^{T}\bm{B}))^{k})^{2}\rho^{2}(1+\eta\bm{C})\lVert\hat{\bm{\phi}}_{t}\rVert^{2}}{1+\eta^{2}\lambda_{min}(\bm{B}^{T}\bm{B})}+(\eta^{2}\lambda(\bm{B}^{T}\bm{B}))^{k}\lVert\hat{\bm{\phi}}_{t}\rVert^{2}-(\eta^{2}\lambda(\bm{B}^{T}\bm{B}))^{2k}\lVert\hat{\bm{\phi}}_{t}\rVert^{2}
\displaystyle\leq (1(η2λ(𝑩𝑩T))k)2ρ2(1η𝑨)𝜽^t21+η2λmin(𝑩𝑩T)+(η2λ(𝑩𝑩T))k(1(η2λ(𝑩𝑩T))k)𝜽^t2\displaystyle\frac{(1-(-\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k})^{2}\rho^{2}(1-\eta\bm{A})\lVert\hat{\bm{\theta}}_{t}\rVert^{2}}{1+\eta^{2}\lambda_{min}(\bm{B}\bm{B}^{T})}+(\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k}(1-(\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k})\lVert\hat{\bm{\theta}}_{t}\rVert^{2}
+\displaystyle+ (1(η2λ(𝑩T𝑩))k)2ρ2(1+η𝑪)ϕ^t21+η2λmin(𝑩T𝑩)+(η2λ(𝑩T𝑩))kϕ^t2(η2λ(𝑩T𝑩))2kϕ^t2\displaystyle\frac{(1-(-\eta^{2}\lambda(\bm{B}^{T}\bm{B}))^{k})^{2}\rho^{2}(1+\eta\bm{C})\lVert\hat{\bm{\phi}}_{t}\rVert^{2}}{1+\eta^{2}\lambda_{min}(\bm{B}^{T}\bm{B})}+(\eta^{2}\lambda(\bm{B}^{T}\bm{B}))^{k}\lVert\hat{\bm{\phi}}_{t}\rVert^{2}-(\eta^{2}\lambda(\bm{B}^{T}\bm{B}))^{2k}\lVert\hat{\bm{\phi}}_{t}\rVert^{2}
\displaystyle\leq (1(η2λ(𝑩𝑩T))k)2ρ2(1η𝑨)𝜽^t2(1+η2λmin(𝑩𝑩T))(1(η2λmax(𝑩𝑩T))k)+(η2λ(𝑩𝑩T))k(1(η2λ(𝑩𝑩T))k)𝜽^t2(1(η2λmax(𝑩𝑩T))k)\displaystyle\frac{(1-(-\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k})^{2}\rho^{2}(1-\eta\bm{A})\lVert\hat{\bm{\theta}}_{t}\rVert^{2}}{(1+\eta^{2}\lambda_{min}(\bm{B}\bm{B}^{T}))(1-(\eta^{2}\lambda_{max}(\bm{B}\bm{B}^{T}))^{k})}+\frac{(\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k}(1-(\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k})\lVert\hat{\bm{\theta}}_{t}\rVert^{2}}{(1-(\eta^{2}\lambda_{max}(\bm{B}\bm{B}^{T}))^{k})}
+\displaystyle+ (1(η2λ(𝑩T𝑩))k)2ρ2(1+η𝑪)ϕ^t2(1+η2λmin(𝑩T𝑩))(1(η2λmax(𝑩T𝑩))k)+(η2λ(𝑩T𝑩))k(1(η2λ(𝑩T𝑩))k)ϕ^t2(1(η2λmax(𝑩T𝑩))k)\displaystyle\frac{(1-(-\eta^{2}\lambda(\bm{B}^{T}\bm{B}))^{k})^{2}\rho^{2}(1+\eta\bm{C})\lVert\hat{\bm{\phi}}_{t}\rVert^{2}}{(1+\eta^{2}\lambda_{min}(\bm{B}^{T}\bm{B}))(1-(\eta^{2}\lambda_{max}(\bm{B}^{T}\bm{B}))^{k})}+\frac{(\eta^{2}\lambda(\bm{B}^{T}\bm{B}))^{k}(1-(\eta^{2}\lambda(\bm{B}^{T}\bm{B}))^{k})\lVert\hat{\bm{\phi}}_{t}\rVert^{2}}{(1-(\eta^{2}\lambda_{max}(\bm{B}^{T}\bm{B}))^{k})}

Let us define the distance as:

rt(k)2\displaystyle\lVert r_{t}^{(k)}\rVert^{2} =𝜽^t(k)2+ϕ^t(k)2\displaystyle=\lVert\hat{\bm{\theta}}_{t}^{(k)}\rVert^{2}+\lVert\hat{\bm{\phi}}_{t}^{(k)}\rVert^{2} (40)
rt2\displaystyle\lVert r_{t}\rVert^{2} =𝜽^t2+ϕ^t2\displaystyle=\lVert\hat{\bm{\theta}}_{t}\rVert^{2}+\lVert\hat{\bm{\phi}}_{t}\rVert^{2} (41)

Then we have

rt(2k)\displaystyle\lVert r_{t}^{(2k)} (1(η2λ(𝑩𝑩T))k)2(ρ2(1η𝑨)𝜽^t2+ρ2(1+η𝑪)ϕ^t2)(1+η2λmin(𝑩𝑩T))(1(η2λmax(𝑩𝑩T))k)\displaystyle\rVert\leq\frac{(1-(-\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k})^{2}(\rho^{2}(1-\eta\bm{A})\lVert\hat{\bm{\theta}}_{t}\rVert^{2}+\rho^{2}(1+\eta\bm{C})\lVert\hat{\bm{\phi}}_{t}\rVert^{2})}{(1+\eta^{2}\lambda_{min}(\bm{B}\bm{B}^{T}))(1-(\eta^{2}\lambda_{max}(\bm{B}\bm{B}^{T}))^{k})}
+(η2λ(𝑩𝑩T))k(1(η2λ(𝑩𝑩T))k)(1(η2λmax(𝑩𝑩T))k)rt2\displaystyle+\frac{(\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k}(1-(\eta^{2}\lambda(\bm{B}\bm{B}^{T}))^{k})}{(1-(\eta^{2}\lambda_{max}(\bm{B}\bm{B}^{T}))^{k})}\lVert r_{t}\rVert^{2}
=a(ρ2(1η𝑨)𝜽^t2+ρ2(1+η𝑪)ϕ^t21+η2λmin(𝑩𝑩T))+brt2\displaystyle=a(\frac{\rho^{2}(1-\eta\bm{A})\lVert\hat{\bm{\theta}}_{t}\rVert^{2}+\rho^{2}(1+\eta\bm{C})\lVert\hat{\bm{\phi}}_{t}\rVert^{2}}{1+\eta^{2}\lambda_{min}(\bm{B}\bm{B}^{T})})+b\lVert r_{t}\rVert^{2} (42)

where

a={(1+(η2λmax(𝑩𝑩T))k)21(η2λmax(𝑩𝑩T))k , odd k (1(η2λmin(𝑩𝑩T))k)21(η2λmax(𝑩𝑩T))k , even k\displaystyle a=\begin{cases}\frac{(1+(\eta^{2}\lambda_{\max}(\bm{B}\bm{B}^{T}))^{k})^{2}}{1-(\eta^{2}\lambda_{\max}(\bm{B}\bm{B}^{T}))^{k}}\text{ , odd $k$ }\\ \frac{(1-(\eta^{2}\lambda_{\min}(\bm{B}\bm{B}^{T}))^{k})^{2}}{1-(\eta^{2}\lambda_{\max}(\bm{B}\bm{B}^{T}))^{k}}\text{ , even $k$}\end{cases}

and

b=(η2λmax(𝑩𝑩T))k(1(η2λmin(𝑩𝑩T))k)1(η2λmax(𝑩𝑩T))k\displaystyle b=\frac{(\eta^{2}\lambda_{\max}(\bm{B}\bm{B}^{T}))^{k}(1-(\eta^{2}\lambda_{\min}(\bm{B}\bm{B}^{T}))^{k})}{1-(\eta^{2}\lambda_{\max}(\bm{B}\bm{B}^{T}))^{k}} (43)

A.3.1 Proof of Lemma A.1

Let 𝑩=𝑼𝚺𝑽T\bm{B}=\bm{U}\bm{\Sigma}\bm{V}^{T} be the singular value decomposition of 𝑩\bm{B}. Here 𝑼\bm{U} and 𝑽\bm{V} are orthonormal matrices and 𝚺\bm{\Sigma} is a rectangular diagonal matrix. Then we have:

𝑸𝜽𝑩\displaystyle\bm{Q}_{\bm{\theta}}\bm{B} =(𝑰+η2𝑼𝚺𝑽T𝑽𝚺T𝑼T)1𝑼𝚺𝑽T\displaystyle=(\bm{I}+\eta^{2}\bm{U}\bm{\Sigma}\bm{V}^{T}\bm{V}\bm{\Sigma}^{T}\bm{U}^{T})^{-1}\bm{U}\bm{\Sigma}\bm{V}^{T}
=(𝑼(η2𝚺𝚺T+𝑰)𝑼T)1𝑼𝚺𝑽T\displaystyle=(\bm{U}(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})\bm{U}^{T})^{-1}\bm{U}\bm{\Sigma}\bm{V}^{T}
=𝑼(η2𝚺𝚺T+𝑰)1𝑼T𝑼𝚺𝑽T\displaystyle=\bm{U}(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})^{-1}\bm{U}^{T}\bm{U}\bm{\Sigma}\bm{V}^{T}
=𝑼(η2𝚺𝚺T+𝑰)1𝚺𝑽T\displaystyle=\bm{U}(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})^{-1}\bm{\Sigma}\bm{V}^{T}

Here we used the fact that 𝑼\bm{U} and 𝑽\bm{V} are orthonormal matrices. Now, we simplify the other side to get:

𝑩𝑸ϕ\displaystyle\bm{B}\bm{Q}_{\bm{\phi}} =𝑼𝚺𝑽T(𝑰+η2𝑽𝚺T𝑼T𝑼𝚺𝑽T)1\displaystyle=\bm{U}\bm{\Sigma}\bm{V}^{T}(\bm{I}+\eta^{2}\bm{V}\bm{\Sigma}^{T}\bm{U}^{T}\bm{U}\bm{\Sigma}\bm{V}^{T})^{-1}
=𝑼𝚺𝑽T(𝑽(η2𝚺T𝚺+𝑰)𝑽T)1\displaystyle=\bm{U}\bm{\Sigma}\bm{V}^{T}(\bm{V}(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})\bm{V}^{T})^{-1}
=𝑼𝚺𝑽T𝑽(η2𝚺T𝚺+𝑰)1𝑽T\displaystyle=\bm{U}\bm{\Sigma}\bm{V}^{T}\bm{V}(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})^{-1}\bm{V}^{T}
=𝑼𝚺(η2𝚺T𝚺+𝑰)1𝑽T\displaystyle=\bm{U}\bm{\Sigma}(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})^{-1}\bm{V}^{T}

Now we consider the following equation:

η2𝚺𝚺T𝚺+𝚺=𝚺(η2𝚺T𝚺+𝑰)=(η2𝚺𝚺T+𝑰)𝚺\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{\Sigma}=\bm{\Sigma}(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})=(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})\bm{\Sigma} (44)

which indicates that 𝚺(η2𝚺T𝚺+𝑰)=(η2𝚺𝚺T+𝑰)𝚺\bm{\Sigma}(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})=(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})\bm{\Sigma}. Multiplying both sides of this equation by (η2𝚺𝚺T+𝑰)1(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})^{-1} and (η2𝚺T𝚺+𝑰)1(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})^{-1} we have:

(η2𝚺𝚺T+𝑰)1𝚺(η2𝚺T𝚺+𝑰)(η2𝚺T𝚺+𝑰)1\displaystyle(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})^{-1}\bm{\Sigma}(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})^{-1} =(η2𝚺𝚺T+𝑰)1(η2𝚺𝚺T+𝑰)𝚺(η2𝚺T𝚺+𝑰)1\displaystyle=(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})^{-1}(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})\bm{\Sigma}(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})^{-1}
(η2𝚺𝚺T+𝑰)1𝚺\displaystyle(\eta^{2}\bm{\Sigma}\bm{\Sigma}^{T}+\bm{I})^{-1}\bm{\Sigma} =𝚺(η2𝚺T𝚺+𝑰)1\displaystyle=\bm{\Sigma}(\eta^{2}\bm{\Sigma}^{T}\bm{\Sigma}+\bm{I})^{-1}

Therefore, we have 𝑸𝜽𝑩=𝑩𝑸ϕ\bm{Q}_{\bm{\theta}}\bm{B}=\bm{B}\bm{Q}_{\bm{\phi}}. Using a similar argument, we can also prove the equality in Equation (20).

A.4 Theorem 5.1 without kernel assumption

Theorem A.2.

Consider the (Minimax) problem under Assumption 3.1 and Lv.kk GP. Let (𝛉,ϕ)(\bm{\theta}^{*},\bm{\phi}^{*}) be a stationary point. Suppose η<(L)1\eta<(L)^{-1}. There exists a neighborhood 𝒰\mathcal{U} of (𝛉,ϕ)(\bm{\theta}^{*},\bm{\phi}^{*}) such that if SPPM started at (𝛉0,ϕ0)𝒰(\bm{\theta}_{0},\bm{\phi}_{0})\in\mathcal{U}, the iterates {𝛉t,ϕt}t0\{\bm{\theta}_{t},\bm{\phi}_{t}\}_{t\geq 0} generated by SPPM satisfy:

𝜽t+1𝜽2+ϕt+1ϕ2ρ2(𝑰η𝑨)𝜽t𝜽2𝑰+η2λmin(𝑩𝑩T)+ρ2(1+η𝑪)ϕtϕ2𝑰+η2λmin(𝑩T𝑩).\lVert\bm{\theta}_{t+1}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t+1}-\bm{\phi}^{*}\rVert^{2}\leq\frac{\rho^{2}(\bm{I}-\eta\bm{A})\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{B}\bm{B}^{T})}+\frac{\rho^{2}(1+\eta\bm{C})\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{B}^{T}\bm{B})}.

where f=f(𝛉,ϕ)f^{*}=f(\bm{\theta}^{*},\bm{\phi}^{*}). Moreover, for any η\eta satisfying:

ρ2(𝑰η𝑨)𝜽t𝜽2𝑰+η2λmin(𝑩𝑩T)+ρ2(1+η𝑪)ϕtϕ2𝑰+η2λmin(𝑩T𝑩)<𝜽t𝜽2+ϕ𝒕ϕ2\frac{\rho^{2}(\bm{I}-\eta\bm{A})\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{B}\bm{B}^{T})}+\frac{\rho^{2}(1+\eta\bm{C})\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{B}^{T}\bm{B})}<\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\bm{\phi}_{t}-\bm{\phi}^{*}}\rVert^{2} (45)

SPPM converges asymptotically to (𝛉,ϕ)(\bm{\theta}^{*},\bm{\phi}^{*}).

Proof.

Following the same setting and procedure as in Appendix A.2, we have that

𝜽^t+12+ϕ^t+12=𝜽^tT𝑷𝜽T(𝑰+η2𝑩𝑩T)1𝑷𝜽𝜽^t+ϕ^tT𝑷ϕT(𝑰+η2𝑩T𝑩)1𝑷ϕϕ^t\lVert\hat{\bm{\theta}}_{t+1}\rVert^{2}+\lVert\hat{\bm{\phi}}_{t+1}\rVert^{2}=\hat{\bm{\theta}}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{I}+\eta^{2}\bm{B}\bm{B}^{T})^{-1}\bm{P}_{\bm{\theta}}\hat{\bm{\theta}}_{t}+\hat{\bm{\phi}}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}(\bm{I}+\eta^{2}\bm{B}^{T}\bm{B})^{-1}\bm{P}_{\bm{\phi}}\hat{\bm{\phi}}_{t} (46)

Now using the fact that 𝑷𝜽=𝑷𝜽T\bm{P}_{\bm{\theta}}=\bm{P}_{\bm{\theta}}^{T}, 𝑷ϕ=𝑷ϕT\bm{P}_{\bm{\phi}}=\bm{P}_{\bm{\phi}}^{T}, we can write

𝜽t+1𝜽2+ϕt+1ϕ2ρ2(𝑰η𝑨)𝜽t𝜽2𝑰+η2λmin(𝑩𝑩T)+ρ2(1+η𝑪)ϕtϕ2𝑰+η2λmin(𝑩T𝑩).\lVert\bm{\theta}_{t+1}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t+1}-\bm{\phi}^{*}\rVert^{2}\leq\frac{\rho^{2}(\bm{I}-\eta\bm{A})\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{B}\bm{B}^{T})}+\frac{\rho^{2}(1+\eta\bm{C})\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{B}^{T}\bm{B})}.

For any η\eta that satisfying

ρ2(𝑰η𝑨)𝜽t𝜽2𝑰+η2λmin(𝑩𝑩T)+ρ2(1+η𝑪)ϕtϕ2𝑰+η2λmin(𝑩T𝑩)<𝜽t𝜽2+ϕ𝒕ϕ2\frac{\rho^{2}(\bm{I}-\eta\bm{A})\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{B}\bm{B}^{T})}+\frac{\rho^{2}(1+\eta\bm{C})\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{B}^{T}\bm{B})}<\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\bm{\phi}_{t}-\bm{\phi}^{*}}\rVert^{2} (47)

we have that

𝜽t+1𝜽2+ϕ𝒕+𝟏ϕ2<𝜽t𝜽2+ϕ𝒕ϕ2\lVert\bm{\theta}_{t+1}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\bm{\phi}_{t+1}-\bm{\phi}^{*}}\rVert^{2}<\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\bm{\phi}_{t}-\bm{\phi}^{*}}\rVert^{2} (48)

i.e., SPPM converges asymptotically towards (𝜽,ϕ)(\bm{\theta}^{*},\bm{\phi}^{*}). ∎

A.5 Proof of Theorem 5.2

Proof.

In order to proof the convergence of SPPM in bilinear games, we first show that the SPPM update rule is equivalent to that of the following Proximal Point Method:

{𝜽t+1=𝜽tηθf(𝜽t+1,ϕt+1)ϕt+1=ϕt+ηϕf(𝜽t+1,ϕt+1)\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t+1},\bm{\phi}_{t+1})\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\nabla_{\phi}f(\bm{\theta}_{t+1},\bm{\phi}_{t+1})\end{cases} (49)

In the Bilinear game, the SPPM update is:

{𝜽t+1=𝜽tηθf(𝜽t,ϕt+1)ϕt+1=ϕt+ηϕf(𝜽t,ϕt+1)\displaystyle\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t+1})\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t+1})\end{cases}
={𝜽t+1=𝜽tη𝑴ϕt+1ϕt+1=ϕt+η𝑴T𝜽t+1\displaystyle=\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\bm{M}\bm{\phi}_{t+1}\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\bm{M}^{T}\bm{\theta}_{t+1}\end{cases}

and the PPM update is:

{𝜽t+1=𝜽tηθf(𝜽t+1,ϕt+1)ϕt+1=ϕt+ηϕf(𝜽t+1,ϕt+1)\displaystyle\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t+1},\bm{\phi}_{t+1})\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\nabla_{\phi}f(\bm{\theta}_{t+1},\bm{\phi}_{t+1})\end{cases}
={𝜽t+1=𝜽tη𝑴ϕt+1ϕt+1=ϕt+η𝑴T𝜽t+1\displaystyle=\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\bm{M}\bm{\phi}_{t+1}\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\bm{M}^{T}\bm{\theta}_{t+1}\end{cases}

Thus SPPM and PPM are equivalent in the Bilinear game. The convergence result of PPM in bilinear games has been proved in Theorem 2 of [49]:

Theorem A.3.

Consider the Bilinear game and the PPM method. Further, we define rt=𝛉t𝛉2+ϕtϕ2r_{t}=\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}. Then, for any η>0\eta>0, the iterates {𝛉t,ϕt}t0\{\bm{\theta}_{t},\bm{\phi}_{t}\}_{t\geq 0} generated by SPPM satisfy

rt+111+η2λmin(𝑴T𝑴)rt.r_{t+1}\leq\frac{1}{1+\eta^{2}\lambda_{\min}(\bm{M}^{T}\bm{M})}r_{t}. (50)

Therefore, SPPM and PPM have the same convergence property in bilinear games. ∎

A.6 Proof of Theorem 5.3

Proof.

Consider the learning dynamics:

𝜽t+1\displaystyle\bm{\theta}_{t+1} =𝜽tηθf(𝜽t,ϕt+1)\displaystyle=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t+1})
ϕt+1\displaystyle\bm{\phi}_{t+1} =ϕt+ηθf(𝜽t+1,ϕt)\displaystyle=\bm{\phi}_{t}+\eta\nabla_{\theta}f(\bm{\theta}_{t+1},\bm{\phi}_{t})

In the Quadratic game, the SPPM update rule can be written as:

{𝜽t+1=𝜽tη𝑨𝜽t𝑪ϕt+1ϕt+1=ϕt+η𝑩ϕt+𝑪T𝜽t+1\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\bm{A}\bm{\theta}_{t}-\bm{C}\bm{\phi}_{t+1}\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\bm{B}\bm{\phi}_{t}+\bm{C}^{T}\bm{\theta}_{t+1}\end{cases} (51)

Then we can rewrite the learning dynamics:

𝜽t+1\displaystyle\bm{\theta}_{t+1} =𝜽tη𝑨𝜽tη𝑪ϕt+1\displaystyle=\bm{\theta}_{t}-\eta\bm{A}\bm{\theta}_{t}-\eta\bm{C}\bm{\phi}_{t+1}
𝜽t+1\displaystyle\bm{\theta}_{t+1} =𝜽tη𝑨𝜽tη𝑪(ϕt+η𝑪T𝜽t+1+η𝑩ϕt)\displaystyle=\bm{\theta}_{t}-\eta\bm{A}\bm{\theta}_{t}-\eta\bm{C}(\bm{\phi}_{t}+\eta\bm{C}^{T}\bm{\theta}_{t+1}+\eta\bm{B}\bm{\phi}_{t})
(𝑰+η2𝑪𝑪T)𝜽t+1\displaystyle(\bm{I}+\eta^{2}\bm{CC}^{T})\bm{\theta}_{t+1} =(𝑰η𝑨)𝜽tη𝑪(𝑰+η𝑩)ϕt\displaystyle=(\bm{I}-\eta\bm{A})\bm{\theta}_{t}-\eta\bm{C}(\bm{I}+\eta\bm{B})\bm{\phi}_{t}
𝜽t+1\displaystyle\bm{\theta}_{t+1} =(𝑰+η2𝑪𝑪T)1[(𝑰η𝑨)𝜽tη𝑪(𝑰+η𝑩)ϕt]\displaystyle=(\bm{I}+\eta^{2}\bm{CC}^{T})^{-1}\left[(\bm{I}-\eta\bm{A})\bm{\theta}_{t}-\eta\bm{C}(\bm{I}+\eta\bm{B})\bm{\phi}_{t}\right] (52)

Similarly, for the other player we have

ϕt+1\displaystyle\bm{\phi}_{t+1} =ϕt+η𝑪T𝜽t+1+η𝑩ϕt\displaystyle=\bm{\phi}_{t}+\eta\bm{C}^{T}\bm{\theta}_{t+1}+\eta\bm{B}\bm{\phi}_{t}
ϕt+1\displaystyle\bm{\phi}_{t+1} =ϕt+η𝑪T(𝜽tη𝑨𝜽tη𝑪ϕt+1)+η𝑩ϕt\displaystyle=\bm{\phi}_{t}+\eta\bm{C}^{T}(\bm{\theta}_{t}-\eta\bm{A}\bm{\theta}_{t}-\eta\bm{C}\bm{\phi}_{t+1})+\eta\bm{B}\bm{\phi}_{t}
(𝑰+η2𝑪T𝑪)ϕt+1\displaystyle(\bm{I}+\eta^{2}\bm{C}^{T}\bm{C})\bm{\phi}_{t+1} =η𝑪T(𝑰η𝑨)𝜽t+(𝑰+η𝑩)ϕt\displaystyle=\eta\bm{C}^{T}(\bm{I}-\eta\bm{A})\bm{\theta}_{t}+(\bm{I}+\eta\bm{B})\bm{\phi}_{t}
ϕt+1\displaystyle\bm{\phi}_{t+1} =(𝑰+η2𝑪T𝑪)1[η𝑪T(𝑰η𝑨)𝜽t+(𝑰+η𝑩)ϕt]\displaystyle=(\bm{I}+\eta^{2}\bm{C}^{T}\bm{C})^{-1}\left[\eta\bm{C}^{T}(\bm{I}-\eta\bm{A})\bm{\theta}_{t}+(\bm{I}+\eta\bm{B})\bm{\phi}_{t}\right] (53)

Let us define the symmetric matrices 𝑸𝜽=(𝑰+η2𝑪𝑪T)1\bm{Q}_{\bm{\theta}}=(\bm{I}+\eta^{2}\bm{CC}^{T})^{-1}, 𝑸ϕ=(𝑰+η2𝑪T𝑪)1\bm{Q}_{\bm{\phi}}=(\bm{I}+\eta^{2}\bm{C}^{T}\bm{C})^{-1} and 𝑷𝜽=(𝑰η𝑨)\bm{P}_{\bm{\theta}}=(\bm{I}-\eta\bm{A}), 𝑷ϕ=(𝑰+η𝑩)\bm{P}_{\bm{\phi}}=(\bm{I}+\eta\bm{B}). Further we define rt=𝜽t+12+ϕt+12r_{t}=\lVert\bm{\theta}_{t+1}\rVert^{2}+\lVert\bm{\phi}_{t+1}\rVert^{2}. Based on these definitions, and the expressions in (52) and (53) we have

𝜽t+12+ϕt+12=𝑸𝜽𝑷𝜽𝜽t2\displaystyle\lVert\bm{\theta}_{t+1}\rVert^{2}+\lVert\bm{\phi}_{t+1}\rVert^{2}=\lVert\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}\rVert^{2} +η2𝑸𝜽𝑪𝑷ϕϕt2+𝑸ϕ𝑪T𝑷𝜽𝜽t2+𝑸ϕ𝑷ϕϕt2\displaystyle+\eta^{2}\lVert\bm{Q}_{\bm{\theta}}\bm{C}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}\rVert^{2}+\lVert\bm{Q}_{\bm{\phi}}\bm{C}^{T}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}\rVert^{2}+\lVert\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}\rVert^{2}
2η𝜽tT𝑷𝜽T𝑸𝜽T𝑸𝜽𝑪𝑷ϕϕt+2ηϕtT𝑷ϕT𝑸ϕT𝑸ϕ𝑪T𝑷𝜽𝜽t\displaystyle-2\eta\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{C}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}+2\eta\bm{\phi}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}\bm{C}^{T}\bm{P}_{\bm{\theta}}\bm{\theta}_{t} (54)

To simplify the expression in (54) we use Lemma A.1 to obtain the following equations:

𝑸𝜽𝑪\displaystyle\bm{Q}_{\bm{\theta}}\bm{C} =𝑪𝑸ϕ\displaystyle=\bm{C}\bm{Q}_{\bm{\phi}} (55)
𝑸ϕ𝑪T\displaystyle\bm{Q}_{\bm{\phi}}\bm{C}^{T} =𝑪T𝑸𝜽\displaystyle=\bm{C}^{T}\bm{Q}_{\bm{\theta}} (56)

Using this lemma, we can show that

𝜽tT𝑷𝜽T𝑸𝜽T𝑸𝜽𝑪𝑷ϕϕt=𝜽tT𝑷𝜽T𝑸𝜽T𝑪𝑸ϕ𝑷ϕϕt=ϕtT𝑷ϕT𝑸ϕT𝑪T𝑸𝜽𝑷𝜽𝜽t=ϕtT𝑷ϕT𝑸ϕT𝑸ϕ𝑪T𝑷𝜽𝜽t\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}\bm{C}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}=\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{T}\bm{C}\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}=\bm{\phi}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}^{T}\bm{C}^{T}\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}=\bm{\phi}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}^{T}\bm{Q}_{\bm{\phi}}\bm{C}^{T}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}

where the intermediate equality holds as 𝒂T𝑪=𝑪T𝒂\bm{a}^{T}\bm{C}=\bm{C}^{T}\bm{a}. Hence, the expression in (54) can be simplified as

𝜽t+12+ϕt+12=𝑸𝜽𝑷𝜽𝜽t2+η2𝑸𝜽𝑪𝑷ϕϕt2+𝑸ϕ𝑪T𝑷𝜽𝜽t2+𝑸ϕ𝑷ϕϕt2\lVert\bm{\theta}_{t+1}\rVert^{2}+\lVert\bm{\phi}_{t+1}\rVert^{2}=\lVert\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\theta}}\bm{C}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}\rVert^{2}+\lVert\bm{Q}_{\bm{\phi}}\bm{C}^{T}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}\rVert^{2}+\lVert\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}\rVert^{2} (57)

We simplify equation (57) as follows. Consider the term involving 𝜽t\bm{\theta}_{t}. We have

𝑸𝜽𝑷𝜽𝜽t2+η2𝑸ϕ𝑪T𝑷𝜽𝜽t2\displaystyle\lVert\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\phi}}\bm{C}^{T}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}\rVert^{2} =𝜽tT𝑷𝜽T𝑸𝜽2𝑷𝜽𝜽t+η2𝜽tT𝑷𝜽T𝑪𝑸ϕ2𝑪T𝑷𝜽𝜽t\displaystyle=\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{Q}_{\bm{\theta}}^{2}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}+\eta^{2}\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}\bm{C}\bm{Q}_{\bm{\phi}}^{2}\bm{C}^{T}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}
=𝜽tT𝑷𝜽T(𝑸𝜽2+η2𝑪𝑸ϕ2𝑪T)𝑷𝜽𝜽t\displaystyle=\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{Q}_{\bm{\theta}}^{2}+\eta^{2}\bm{C}\bm{Q}_{\bm{\phi}}^{2}\bm{C}^{T})\bm{P}_{\bm{\theta}}\bm{\theta}_{t}
=𝜽tT𝑷𝜽T(𝑸𝜽2+η2𝑪𝑸ϕ𝑪T𝑸𝜽)𝑷𝜽𝜽t\displaystyle=\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{Q}_{\bm{\theta}}^{2}+\eta^{2}\bm{C}\bm{Q}_{\bm{\phi}}\bm{C}^{T}\bm{Q}_{\bm{\theta}})\bm{P}_{\bm{\theta}}\bm{\theta}_{t}
=𝜽tT𝑷𝜽T(𝑸𝜽2+η2𝑪𝑪T𝑸𝜽𝑸𝜽)𝑷𝜽𝜽t\displaystyle=\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{Q}_{\bm{\theta}}^{2}+\eta^{2}\bm{C}\bm{C}^{T}\bm{Q}_{\bm{\theta}}\bm{Q}_{\bm{\theta}})\bm{P}_{\bm{\theta}}\bm{\theta}_{t}
=𝜽tT𝑷𝜽T(𝑰+η2𝑪𝑪T)𝑸𝜽2𝑷𝜽𝜽t\displaystyle=\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{I}+\eta^{2}\bm{C}\bm{C}^{T})\bm{Q}_{\bm{\theta}}^{2}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}
=𝜽tT𝑷𝜽T(𝑰+η2𝑪𝑪T)1𝑷𝜽𝜽t\displaystyle=\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{I}+\eta^{2}\bm{C}\bm{C}^{T})^{-1}\bm{P}_{\bm{\theta}}\bm{\theta}_{t} (58)

where the last equality follows by replacing 𝑸𝜽\bm{Q}_{\bm{\theta}} by its definition. The same procedure follows for the term involving ϕt\bm{\phi}_{t} which leads to the expression

𝑸ϕ𝑷ϕϕt2+η2𝑸𝜽𝑪𝑷ϕϕt2=ϕtT𝑷ϕT(𝑰+η2𝑪T𝑪)1𝑷ϕϕt.\lVert\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\theta}}\bm{C}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}\rVert^{2}=\bm{\phi}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}(\bm{I}+\eta^{2}\bm{C}^{T}\bm{C})^{-1}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}. (59)

Substitute 𝑸𝜽𝑷𝜽𝜽t2+η2𝑸ϕ𝑪T𝑷𝜽𝜽t2\lVert\bm{Q}_{\bm{\theta}}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\phi}}\bm{C}^{T}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}\rVert^{2} and 𝑸ϕ𝑷ϕϕt2+η2𝑸𝜽𝑪𝑷ϕϕt2\lVert\bm{Q}_{\bm{\phi}}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}\rVert^{2}+\eta^{2}\lVert\bm{Q}_{\bm{\theta}}\bm{C}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}\rVert^{2} in (57) with the expressions in (58) and (59), respectively, to obtain

𝜽t+12+ϕt+12=𝜽tT𝑷𝜽T(𝑰+η2𝑪𝑪T)1𝑷𝜽𝜽t+ϕtT𝑷ϕT(𝑰+η2𝑪T𝑪)1𝑷ϕϕt.\lVert\bm{\theta}_{t+1}\rVert^{2}+\lVert\bm{\phi}_{t+1}\rVert^{2}=\bm{\theta}_{t}^{T}\bm{P}_{\bm{\theta}}^{T}(\bm{I}+\eta^{2}\bm{C}\bm{C}^{T})^{-1}\bm{P}_{\bm{\theta}}\bm{\theta}_{t}+\bm{\phi}_{t}^{T}\bm{P}_{\bm{\phi}}^{T}(\bm{I}+\eta^{2}\bm{C}^{T}\bm{C})^{-1}\bm{P}_{\bm{\phi}}\bm{\phi}_{t}. (60)

Now using the expression in (60) and the fact that 𝑷𝜽=𝑷𝜽T\bm{P}_{\bm{\theta}}=\bm{P}_{\bm{\theta}}^{T}, 𝑷ϕ=𝑷ϕT\bm{P}_{\bm{\phi}}=\bm{P}_{\bm{\phi}}^{T} and λmin(𝑪T𝑪)=λmin(𝑪𝑪T)\lambda_{\min}(\bm{C}^{T}\bm{C})=\lambda_{\min}(\bm{C}\bm{C}^{T}), we can write

𝜽t+1𝜽2+ϕt+1ϕ2ρ2(𝑰η𝑨)𝜽t𝜽2+ρ2(1+η𝑩)ϕtϕ2𝑰+η2λmin(𝑪T𝑪).\lVert\bm{\theta}_{t+1}-\bm{\theta}^{*}\rVert^{2}+\lVert\bm{\phi}_{t+1}-\bm{\phi}^{*}\rVert^{2}\leq\frac{\rho^{2}(\bm{I}-\eta\bm{A})\lVert\bm{\theta}_{t}-\bm{\theta}^{*}\rVert^{2}+\rho^{2}(1+\eta\bm{B})\lVert\bm{\phi}_{t}-\bm{\phi}^{*}\rVert^{2}}{\bm{I}+\eta^{2}\lambda_{\min}(\bm{C}^{T}\bm{C})}.

A.7 Competitive Gradient Descent as an Approximation of SPPM

In this section, we justify our results in Section 4.1 that Competitive Gradient Descent is a first order Taylor approximation of SPPM. Firstly, we consider the standard definition of CGD:

{𝜽t+1=𝜽tη(𝑰+η2θϕf(𝜽t,ϕt)ϕθf(𝜽t,ϕt))1(θf(𝜽t,ϕt)+ηθϕf(𝜽t,ϕt)ϕf(𝜽t,ϕt))ϕt+1=ϕt+η(𝑰+η2ϕθf(𝜽t,ϕt)θϕf(𝜽t,ϕt))1(ϕf(𝜽t,ϕt)ηϕθf(𝜽t,ϕt)θf(𝜽t,ϕt))\displaystyle\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta(\bm{I}+\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}))^{-1}(\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})+\eta\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t}))\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta(\bm{I}+\eta^{2}\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t}))^{-1}(\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}))\end{cases}

Rewriting the update rules we can get:

(𝑰+η2θϕf(𝜽t,ϕt)ϕθf(𝜽t,ϕt))(𝜽t+1𝜽t)=η(θf(𝜽t,ϕt)+ηθϕf(𝜽t,ϕt)ϕf(𝜽t,ϕt))\displaystyle(\bm{I}+\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}))(\bm{\theta}_{t+1}-\bm{\theta}_{t})=-\eta(\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})+\eta\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t}))
𝜽t+1=𝜽tηθf(𝜽t,ϕt)η2θϕf(𝜽t,ϕt)ϕf(𝜽t,ϕt)η2θϕf(𝜽t,ϕt)ϕθf(𝜽t,ϕt)(𝜽t+1𝜽t)\displaystyle\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})(\bm{\theta}_{t+1}-\bm{\theta}_{t})

Similarly, we have:

ϕt+1=ϕt+ηθf(𝜽t,ϕt)η2ϕθf(𝜽t,ϕt)θf(𝜽t,ϕt)η2ϕθf(𝜽t,ϕt)θϕf(𝜽t,ϕt)(ϕt+1ϕt)\displaystyle\bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})(\bm{\phi}_{t+1}-\bm{\phi}_{t})

Therefore, CGD is a first order approximation of SPPM. Then we prove that the standard definition of CGD is equivalent to the update rule in Table 1. Consider the update rule in Table 1 and its footnote, we have:

{𝜽t+1=𝜽tηθf(𝜽t,ϕt)ηθϕf(𝜽t,ϕt)(ϕt+1ϕt)ϕt+1=ϕt+ηϕf(𝜽t,ϕt)+ηϕθf(𝜽t,ϕt)(𝜽t+1𝜽t)\displaystyle\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})(\bm{\phi}_{t+1}-\bm{\phi}_{t})\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})+\eta\nabla_{\phi\theta}f(\bm{\theta}_{t},\phi_{t})(\bm{\theta}_{t+1}-\bm{\theta}_{t})\end{cases} (61)

Substituting (ϕt+1ϕt)(\bm{\phi}_{t+1}-\bm{\phi}_{t}) into the first equation of (61) we get:

𝜽t+1=𝜽tηθf(𝜽t,ϕt)ηθϕf(𝜽t,ϕt)(ηϕf(𝜽t,ϕt)+ηϕθf(𝜽t,ϕt)(𝜽t+1𝜽t))\displaystyle\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})(\eta\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})+\eta\nabla_{\phi\theta}f(\bm{\theta}_{t},\phi_{t})(\bm{\theta}_{t+1}-\bm{\theta}_{t}))
𝜽t+1=𝜽tηθf(𝜽t,ϕt)η2θϕf(𝜽t,ϕt)ϕf(𝜽t,ϕt)η2θϕf(𝜽t,ϕt)ϕθf(𝜽t,ϕt)(𝜽t+1𝜽t)\displaystyle\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi\theta}f(\bm{\theta}_{t},\phi_{t})(\bm{\theta}_{t+1}-\bm{\theta}_{t})
(𝑰+η2θϕf(𝜽t,ϕt)ϕf(𝜽t,ϕt))(𝜽t+1𝜽t)=ηθf(𝜽t,ϕt)η2θϕf(𝜽t,ϕt)ϕf(𝜽t,ϕt)\displaystyle(\bm{I}+\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t}))(\bm{\theta}_{t+1}-\bm{\theta}_{t})=-\eta\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})
𝜽t+1=𝜽tη(𝑰+η2θϕf(𝜽t,ϕt)ϕθf(𝜽t,ϕt))1(θf(𝜽t,ϕt)+ηθϕf(𝜽t,ϕt)ϕf(𝜽t,ϕt)).\displaystyle\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta(\bm{I}+\eta^{2}\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}))^{-1}(\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})+\eta\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})).

Substituting (𝜽t+1𝜽t)(\bm{\theta}_{t+1}-\bm{\theta}_{t}) into the second equation of (61) and applying similar arguments we get:

ϕt+1=ϕt+η(𝑰+η2ϕθf(𝜽t,ϕt)θϕf(𝜽t,ϕt))1(ϕf(𝜽t,ϕt)ηϕθf(𝜽t,ϕt)θf(𝜽t,ϕt))\bm{\phi}_{t+1}=\bm{\phi}_{t}+\eta(\bm{I}+\eta^{2}\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta\phi}f(\bm{\theta}_{t},\bm{\phi}_{t}))^{-1}(\nabla_{\phi}f(\bm{\theta}_{t},\bm{\phi}_{t})-\eta\nabla_{\phi\theta}f(\bm{\theta}_{t},\bm{\phi}_{t})\nabla_{\theta}f(\bm{\theta}_{t},\bm{\phi}_{t}))

Thus the update rule in Table 1 is equivalent to the standard definition of CGD and it is equivalent to the first order Taylor approximation of SPPM.

A.8 Experiments on Bilinear and Quadratic Games

Bilinear Game

Consider the following bilinear game:

minθmaxϕaθϕ\min_{\theta\in\mathbb{R}}\max_{\phi\in\mathbb{R}}a\theta\phi (62)

the example presented in Figure 1 is an example of using different algorithms to solve the bilinear game above with coefficient a=10a=10. For sake of completeness, we also provide a grid of experiment results for different algorithms with different coefficients aa and learning rates η\eta, starting from the same point (θ0,ϕ0)=(12,10)(\theta_{0},\phi_{0})=(-12,10). The result is presented in Figure 6.

Refer to caption

Figure 6: A grid of experiments on the bilinear game for different algorithms with different values of coefficient aa and learning rates η\eta. The color in each cell indicates the distance to the equilibrium after 50 iterations.

The experiment demonstrates that, in a bilinear game, Lv.22 GP is equivalent to the extra-gradient method, and higher level Lv.kk GP performs better with increased coefficient aa and learning rate η\eta as long as it remains a contraction (i.e., η<a1\eta<a^{-1}).

Quadratic Game

For the quadratic game presented in Figure 3, we randomly initialize the matrices 𝑨\bm{A} and 𝑩\bm{B}:

𝑨=[1.83980.51951.25371.74701.27690.51950.65860.44760.88981.13091.25370.44761.44401.39230.88771.74700.88981.39232.12491.76641.27691.13090.88771.76642.1553]𝑩=[1.08211.24271.00931.33350.67611.24272.20311.32361.85660.93941.00931.32361.23931.36750.90651.33351.85661.36751.90810.96930.67610.93940.90650.96930.7141]\displaystyle\bm{A}=\begin{bmatrix}1.8398&0.5195&1.2537&1.7470&1.2769\\ 0.5195&0.6586&0.4476&0.8898&1.1309\\ 1.2537&0.4476&1.4440&1.3923&0.8877\\ 1.7470&0.8898&1.3923&2.1249&1.7664\\ 1.2769&1.1309&0.8877&1.7664&2.1553\end{bmatrix}\bm{B}=-\begin{bmatrix}1.0821&1.2427&1.0093&1.3335&0.6761\\ 1.2427&2.2031&1.3236&1.8566&0.9394\\ 1.0093&1.3236&1.2393&1.3675&0.9065\\ 1.3335&1.8566&1.3675&1.9081&0.9693\\ 0.6761&0.9394&0.9065&0.9693&0.7141\end{bmatrix}

where 𝑨\bm{A} is symmetric and positive definite and 𝑩\bm{B} is symmetric and negative definite. The interaction matrix is defined as:

𝑪=[c00000c00000c00000c00000c]\displaystyle\bm{C}=\begin{bmatrix}c&0&0&0&0\\ 0&c&0&0&0\\ 0&0&c&0&0\\ 0&0&0&c&0\\ 0&0&0&0&c\end{bmatrix}

where cc represents the strength of the interaction between the two players. The starting point 𝜽0\bm{\theta}_{0} and ϕ0\bm{\phi}_{0} are [0.1270,0.9667,0.2605,0.8972,0.3767]T[0.1270,0.9667,0.2605,0.8972,0.3767]^{T} and [0.3362,0.4514,0.8403,0.1231,0.5430]T[0.3362,0.4514,0.8403,0.1231,0.5430]^{T} respectively.

Refer to caption

Figure 7: A grid of experiments on the quadratic game for different algorithms with different values of coefficient aa and learning rates η\eta. The color in each cell indicates the distance to the equilibrium after 50 iterations.

A.9 Experiments on 8-Gaussians

Dataset

The target distribution is a mixture of 8-Gaussians with standard deviation equal to 0.050.05 and modes uniformly distributed around a unit circle.

Experiment

For our experiments, we used the PyTorch framework. Furthermore, the batch size we used is 128128.

A.10 Experiments on CIFAR-10 and STL-10

For our experiments, we used the PyTorch222https://pytorch.org/ framework. For experiments on CIFAR-10 and STL-10, we compute the FID and IS metrics using the provided implementations in Tensorflow333https://tensorflow.org/ for consistency with related works.

Lv.kk GP vs Lv.kk Adam
Refer to caption
Figure 8: Comparison between Lv.kk GP and Lv.kk Adam on generating CIFAR-10 images. We can see significant improvements in FID when using the Lv.kk Adam algorithm we proposed.

In experiments, we compare the performance of Lv.kk GP and Lv.kk Adam on the task of CIFAR-10 image generation. The experiment results is presented in Figure 8. The experiments on Lv.kk GP and Lv.kk Adam use the same initialization and hyperparameters. According to our experiments, Lv.kk Adam converges much faster than Lv.kk GP for the same choice of kk and learning rates.

Adam vs Lv.kk Adam
Refer to caption
Figure 9: Comparison between Adam and Lv.kk Adam on generating STL-10 images. We can see that Lv.kk Adam consistently outperforms the Adam optimizer in terms of FID score.

We also present a comparison between the performance of Adam and Lv.kk Adam optimizers on the task of STL-10 image generation. The experiment results is presented in Figure 9. Under the same choice of hyperparameters and identical model parameter initialization, Lv.kk Adam consistently outperforms the Adam optimizer in terms of FID score.

Accelerated Lv.kk Adam

In this section, we propose an accelerated version of Lv.kk Adam. The intuition is that we update the min player 𝜽\bm{\theta} and the max player ϕ\bm{\phi} in an alternating order. The corresponding Lv.kk GP algorithm can be writen as:

Reasoning:{𝜽t(k)=𝜽tη𝜽f(𝜽t,ϕt(k1))ϕt(k)=ϕtηϕg(𝜽t(k),ϕt)Update:{𝜽t+1=𝜽tη𝜽f(𝜽t,ϕt(k))ϕt+1=ϕtηϕg(𝜽t(k),ϕt)\mathmakebox[0.8]{\text{Reasoning:}\begin{cases}\bm{\theta}^{(k)}_{t}=\bm{\theta}_{t}-\eta\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(k-1)})\\ \bm{\phi}^{(k)}_{t}=\bm{\phi}_{t}-\eta\nabla_{\bm{\phi}}g(\bm{\theta}_{t}^{(k)},\bm{\phi}_{t})\end{cases}\text{Update:}\begin{cases}\bm{\theta}_{t+1}=\bm{\theta}_{t}-\eta\nabla_{\bm{\theta}}f(\bm{\theta}_{t},\bm{\phi}_{t}^{(k)})\\ \bm{\phi}_{t+1}=\bm{\phi}_{t}-\eta\nabla_{\bm{\phi}}g(\bm{\theta}_{t}^{(k)},\bm{\phi}_{t})\end{cases}} (Alt-Lv.kk GP)

Instead of responding to 𝜽t(k1)\bm{\theta}^{(k-1)}_{t}, in Alt-Lv.kk GP, the max player ϕt(k)\bm{\phi}_{t}^{(k)} acts in response to the min player’s current action, 𝜽t(k)\bm{\theta}^{(k)}_{t}. A Lv.kk min player in Alt-Lv.kk GP is equivalent to a Lv.2k12k-1 player in the Lv.kk GP, and a Lv.kk max player in Alt-Lv.kk GP is equivalent to a Lv.2k2k player in the Lv.kk GP, respectively. Therefore, it is easy to verify that Alt-Lv.kk GP converges two times faster than Lv.kk GP and the corresponding Alt-Lv.kk Adam algorithm is provided in Algorithm 2.

Input: Stopping time TT, reasoning steps kk, learning rate η𝜽,ηϕ\eta_{\bm{\theta}},\eta_{\bm{\phi}}, decay rates for momentum estimates β1,β2\beta_{1},\beta_{2}, initial weight (𝜽0,ϕ0)(\bm{\theta}_{0},\bm{\phi}_{0}), 𝑷𝒙\bm{P}_{{\bm{x}}} and 𝑷𝒛\bm{P}_{{\bm{z}}} real and noise-data distributions, losses G(𝜽,ϕ,𝒙,𝒛)\mathcal{L}_{G}(\bm{\theta},\bm{\phi},{\bm{x}},{\bm{z}}) and D(𝜽,ϕ,𝒙,𝒛)\mathcal{L}_{D}(\bm{\theta},\bm{\phi},{\bm{x}},{\bm{z}}), ϵ=1e8\epsilon=1e-8.
Parameters : Initial parameters: 𝜽0,ϕ0\bm{\theta}_{0},\bm{\phi}_{0}
Initialize first moments:𝒎θ,00,𝒎ϕ,00\bm{m}_{\theta,0}\xleftarrow{}0,\bm{m}_{\phi,0}\xleftarrow{}0
Initialize second moments:𝒗θ,00,𝒗ϕ,00\bm{v}_{\theta,0}\xleftarrow{}0,\bm{v}_{\phi,0}\xleftarrow{}0
for t=0,…,T-1 do
       Sample new mini-batch: 𝒙,𝒛𝑷𝒙,𝑷𝒛{\bm{x}},{\bm{z}}\sim\bm{P}_{{\bm{x}}},\bm{P}_{{\bm{z}}},
       𝜽t(0)𝜽t,ϕt(0)ϕt\bm{\theta}_{t}^{(0)}\xleftarrow{}\bm{\theta}_{t},\bm{\phi}_{t}^{(0)}\xleftarrow{}\bm{\phi}_{t},
       for n=1,…,k do
             Compute stochastic gradient: 𝒈𝜽,t(n)=θG(𝜽t,ϕt(n1),𝒙,𝒛)\bm{g}_{\bm{\theta},t}^{(n)}=\nabla_{\theta}\mathcal{L}_{G}(\bm{\theta}_{t},\bm{\phi}^{(n-1)}_{t},{\bm{x}},{\bm{z}});
             Update estimate of first moment: 𝒎θ,t(n)=β1𝒎θ,t1+(1β1)𝒈θ,t(n)\bm{m}_{\theta,t}^{(n)}=\beta_{1}\bm{m}_{\theta,t-1}+(1-\beta_{1})\bm{g}^{(n)}_{\theta,t};
             Update estimate of second moment: 𝒗θ,t(n)=β2𝒗θ,t1+(1β2)(𝒈θ,t(n))2\bm{v}_{\theta,t}^{(n)}=\beta_{2}\bm{v}_{\theta,t-1}+(1-\beta_{2})(\bm{g}^{(n)}_{\theta,t})^{2};
             Correct the bias for the moments: 𝒎^θ,t(n)=𝒎θ,t(n)(1β1t),𝒗^θ,t(n)=𝒗θ,t(n)(1β2t)\bm{\hat{m}}_{\theta,t}^{(n)}=\frac{\bm{m}^{(n)}_{\theta,t}}{(1-\beta_{1}^{t})},\bm{\hat{v}}_{\theta,t}^{(n)}=\frac{\bm{v}^{(n)}_{\theta,t}}{(1-\beta_{2}^{t})};
             Perform Adam update: 𝜽t(n)=𝜽tηθ𝒎^θ,t(n)𝒗^θ,t(n)+ϵ\bm{\theta}^{(n)}_{t}=\bm{\theta}_{t}-\eta_{\theta}\frac{\bm{\hat{m}}_{\theta,t}^{(n)}}{\sqrt{\bm{\hat{v}}_{\theta,t}^{(n)}}+\epsilon};
             Compute stochastic gradient: 𝒈ϕ,t(n)=ϕD(𝜽t(n),ϕt,𝒙,𝒛)\bm{g}_{\bm{\phi},t}^{(n)}=\nabla_{\phi}\mathcal{L}_{D}(\bm{\theta}^{(n)}_{t},\bm{\phi}_{t},{\bm{x}},{\bm{z}});
             Update estimate of first moment:𝒎ϕ,t(n)=β1𝒎ϕ,t1+(1β1)𝒈ϕ,t(n)\bm{m}_{\phi,t}^{(n)}=\beta_{1}\bm{m}_{\phi,t-1}+(1-\beta_{1})\bm{g}^{(n)}_{\phi,t};
             Update estimate of second moment:𝒗ϕ,t(n)=β2𝒗ϕ,t1+(1β2)(𝒈ϕ,t(n))2\bm{v}_{\phi,t}^{(n)}=\beta_{2}\bm{v}_{\phi,t-1}+(1-\beta_{2})(\bm{g}^{(n)}_{\phi,t})^{2};
             Correct the bias for the moments:𝒎^ϕ,t(n)=𝒎ϕ,t(n)(1β1t),𝒗^ϕ,t(n)=𝒗ϕ,t(n)(1β2t)\bm{\hat{m}}_{\phi,t}^{(n)}=\frac{\bm{m}^{(n)}_{\phi,t}}{(1-\beta_{1}^{t})},\bm{\hat{v}}_{\phi,t}^{(n)}=\frac{\bm{v}^{(n)}_{\phi,t}}{(1-\beta_{2}^{t})};
             Perform Adam update:ϕt(n)=ϕtηϕ𝒎^ϕ,t(n)𝒗^ϕ,t(n)+ϵ\bm{\phi}^{(n)}_{t}=\bm{\phi}_{t}-\eta_{\phi}\frac{\bm{\hat{m}}_{\phi,t}^{(n)}}{\sqrt{\bm{\hat{v}}_{\phi,t}^{(n)}}+\epsilon};
            
      𝜽t+1𝜽t(k),ϕt+1ϕt(k)\bm{\theta}_{t+1}\xleftarrow{}\bm{\theta}_{t}^{(k)},\bm{\phi}_{t+1}\xleftarrow{}\bm{\phi}_{t}^{(k)};
       𝒎θ,t𝒎θ,t(k),𝒎ϕ,t𝒎ϕ,t(k)\bm{m}_{\theta,t}\xleftarrow{}\bm{m}_{\theta,t}^{(k)},\bm{m}_{\phi,t}\xleftarrow{}\bm{m}_{\phi,t}^{(k)};
       𝒗θ,t𝒗θ,t(k),𝒗ϕ,t𝒗ϕ,t(k)\bm{v}_{\theta,t}\xleftarrow{}\bm{v}_{\theta,t}^{(k)},\bm{v}_{\phi,t}\xleftarrow{}\bm{v}_{\phi,t}^{(k)}
      
Algorithm 2 Accelerated Level kk Adam: proposed Adam with recursive reasoning steps
Architecture

In this section, we describe the model we used to evaluate the performance of Lv.kk Adam for generating CIFAR-10444CIFAR10 is released under the MIT license. and STL-10 datasets. With ’conv’ we denote a convolutional layer and ’transposed conv’ a transposed convolution layer. The models use Batch Normalization and Spectral Normalization. The model’s parameters are initialized with Xavier initialization.

Table 5: ResNet blocks used for the SN-GAN architectures on CIFAR-10 image generation, for the generator (left) and the discriminator (right).
G-ResBlock
Shortcut:
Upsample(×2\times 2)
Residual:
Batch Normalization
ReLU
Upsample(×2\times 2)
conv (ker:3×33\times 3, 256256256\to 256; stride: 11; pad: 11)
Batch Normalization
ReLU
conv (ker:3×33\times 3, 256256256\to 256; stride: 11; pad: 11)
D-ResBlock (ll-th block)
Shortcut:
[[AvgPool (ker: 2×22\times 2)]], if l=1l=1
conv (ker: 1×11\times 1, 3l=1/128l11283_{l=1}/128_{l\neq 1}\to 128; stride: 11)
Spectral Normalization
[[AvgPool (ker: 2×22\times 2, stride: 22)]], if l=1l=1
Residual:
[[ReLU]], if l1l\neq 1
conv (ker: 3×33\times 3, 3l=1/128l11283_{l=1}/128_{l\neq 1}\to 128; stride: 11; pad: 11)
Spectral Normalization
ReLU
conv (ker: 1×11\times 1, 128128128\to 128; stride: 11)
Spectral Normalization
AvgPool (ker: 2×22\times 2)
Table 6: SN-GAN architectures for experiments on CIFAR-10
Generator Discriminator
Input: 𝒛128𝒩(0,𝑰){\bm{z}}\in\mathbb{R}^{128}\sim\mathcal{N}(0,\bm{I}) Input: 𝒙3×32×32{\bm{x}}\in\mathbb{R}^{3\times 32\times 32}
Linear(1284096128\to 4096) D-ResBlock
G-ResBlock D-ResBlock
G-ResBlock D-ResBlock
G-ResBlock D-ResBlock
Batch Normalization ReLU
ReLU AvgPool(ker:8×88\times 8)
conv (ker:3×33\times 3, 2563256\to 3; stride: 11; pad: 11) Linear(1281128\to 1)
Tanh Spectral Normalization
Table 7: SN-GAN architectures for experiments on STL-10
Generator Discriminator
Input: 𝒛128𝒩(0,𝑰){\bm{z}}\in\mathbb{R}^{128}\sim\mathcal{N}(0,\bm{I}) Input: 𝒙3×48×48{\bm{x}}\in\mathbb{R}^{3\times 48\times 48}
Linear(1286×6×512128\to 6\times 6\times 512) D-ResBlock down 6412864\to 128
G-ResBlock up 512256512\to 256 D-ResBlock down 31283\to 128
G-ResBlock up 256128256\to 128 D-ResBlock down 128256128\to 256
G-ResBlock up 12864128\to 64 D-ResBlock down 256512256\to 512
Batch Normalization D-ResBlock 5121024512\to 1024
ReLU ReLU, AvgPool (ker: 8×88\times 8)
conv (ker: 3×33\times 3, 64364\to 3; stride: 11; pad: 11) Linear(1281128\to 1)
Tanh Spectral Normalization
Images generated on CIFAR-10 and STL-10

In this section, we present sample images generated by the best performing trained generators on CIFAR-10 and STL-10.

Refer to caption
Figure 10: The presented samples are generated by the best performing trained generator on CIFAR-10, using Lv.66 Adam. This gives a FID score of 10.1210.12.
Refer to caption
Figure 11: The presented samples are generated by the best performing trained generator on STL-10, using Lv.66 Adam. This gives a FID score of 25.4325.43.