This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Last iterate convergence in no-regret learning: constrained min-max optimization for convex-concave landscapes

Qi Lei
UT Austin
[email protected]
   Sai Ganesh Nagarajan
SUTD
[email protected]
   Ioannis Panageas
SUTD
[email protected]
   Xiao Wang
SUTD
[email protected]
Abstract

In a recent series of papers it has been established that variants of Gradient Descent/Ascent and Mirror Descent exhibit last iterate convergence in convex-concave zero-sum games. Specifically, [6, 10] show last iterate convergence of the so called “Optimistic Gradient Descent/Ascent” for the case of unconstrained min-max optimization. Moreover, in [11] the authors show that Mirror Descent with an extra gradient step displays last iterate convergence for convex-concave problems (both constrained and unconstrained), though their algorithm does not follow the online learning framework; it uses extra information rather than only the history to compute the next iteration. In this work, we show that ”Optimistic Multiplicative-Weights Update (OMWU)” which follows the no-regret online learning framework, exhibits last iterate convergence locally for convex-concave games, generalizing the results of [8] where last iterate convergence of OMWU was shown only for the bilinear case. We complement our results with experiments that indicate fast convergence of the method.

1 Introduction

In classic (normal form) zero-sum games, one has to compute two probability vectors 𝐱Δn,𝐲Δm\mathbf{x}^{*}\in\Delta_{n},\mathbf{y}^{*}\in\Delta_{m} 111Δn\Delta_{n} denotes the simplex of size nn. that consist an equilibrium of the following problem

min𝐱Δnmax𝐲Δm𝐱A𝐲,\displaystyle\min_{\mathbf{x}\in\Delta_{n}}\max_{\mathbf{y}\in\Delta_{m}}\mathbf{x}^{\top}A\mathbf{y}, (1)

where AA is n×mn\times m real matrix (called payoff matrix). Here 𝐱A𝐲\mathbf{x}^{\top}A\mathbf{y} represents the payment of the 𝐱\mathbf{x} player to the 𝐲\mathbf{y} player under choices of strategies by the two players and is a bilinear function.

Arguably, one of the most celebrated theorems and a founding stone in Game Theory, is the minimax theorem by Von Neumann  [16]. It states that

min𝐱Δnmax𝐲Δmf(𝐱,𝐲)=max𝐲Δmmin𝐱Δnf(𝐱,𝐲),\displaystyle\min_{\mathbf{x}\in\Delta_{n}}\max_{\mathbf{y}\in\Delta_{m}}f(\mathbf{x},\mathbf{y})=\max_{\mathbf{y}\in\Delta_{m}}\min_{\mathbf{x}\in\Delta_{n}}f(\mathbf{x},\mathbf{y}), (2)

where f:Δn×Δmf:\Delta_{n}\times\Delta_{m}\to\mathbb{R} is convex in 𝐱\mathbf{x}, concave in 𝐲\mathbf{y}. The aforementioned result holds for any convex compact sets 𝒳n\mathcal{X}\subset\mathbb{R}^{n} and 𝒴m\mathcal{Y}\subset\mathbb{R}^{m}. The min-max theorem reassures us that an equilibrium always exists in the bilinear game (1) or its convex-concave analogue (again f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}) is interpreted as the payment of the 𝐱\mathbf{x} player to the 𝐲\mathbf{y} player). An equilibrium is a pair of randomized strategies (𝐱,𝐲)(\mathbf{x}^{*},\mathbf{y}^{*}) such that neither player can improve their payoff by unilaterally changing their distribution.

Soon after the appearance of the minimax theorem, research was focused on dynamics for solving min-max optimization problems by having the min and max players of (1) run a simple online learning procedure. In the online learning framework, at time tt, each player chooses a probability distribution (𝐱t,𝐲t\mathbf{x}^{t},\mathbf{y}^{t} respectively) simultaneously depending only on the past choices of both players (i.e., 𝐱1,,𝐱t1,𝐲1,,𝐲t1\mathbf{x}^{1},...,\mathbf{x}^{t-1},\mathbf{y}^{1},...,\mathbf{y}^{t-1}) and experiences payoff that depends on choices 𝐱t,𝐲t\mathbf{x}^{t},\mathbf{y}^{t}.

An early method, proposed by Brown [4] and analyzed by Robinson [14], was fictitious play. Later on, researchers discover several learning robust algorithms converging to minimax equilibrium at faster rates, see [5]. This class of learning algorithms, are the so-called “no-regret” and include Multiplicative Weights Update method [2] and Follow the regularized leader.

1.1 Average Iterate Convergence

Despite the rich literature on no-regret learning, most of the known results have the feature that min-max equilibrium is shown to be attained only by the time average. This means that the trajectory of a no-regret learning method (𝐱t,𝐲t)(\mathbf{x}^{t},\mathbf{y}^{t}) has the property that 1tτt𝐱τA𝐲τ{1\over t}\sum_{\tau\leq t}{\mathbf{x}^{\tau}}^{\top}A\mathbf{y}^{\tau} converges to the equilibrium of (1), as tt\rightarrow\infty. Unfortunately that does not mean that the last iterate (𝐱t,𝐲t)(\mathbf{x}^{t},\mathbf{y}^{t}) converges to an equilibrium, it commonly diverges or cycles. One such example is the well-known Multiplicative Weights Update Algorithm, the time average of which is known to converge to an equilibrium, but the actual trajectory cycles towards the boundary of the simplex ([3]). This is even true for the vanilla Gradient Descent/Ascent, where one can show for even bilinear landscapes (unconstrained case) last iterate fails to converge [6].

Motivated by the training of Generative Adversarial Networks (GANs), the last couple of years researchers have focused on designing and analyzing procedures that exhibit last iterate convergence (or pointwise convergence) for zero-sum games. This is crucial for training GANs, the landscapes of which are typically non-convex non-concave and averaging now as before does not give much guarantees (e.g., note that Jensen’s inequality is not applicable anymore). In [6, 10] the authors show that a variant of Gradient Descent/Ascent, called Optimistic Gradient Descent/Ascent has last iterate convergence for the case of bilinear functions 𝐱A𝐲\mathbf{x}^{\top}A\mathbf{y} where 𝐱n\mathbf{x}\in\mathbb{R}^{n} and 𝐲m\mathbf{y}\in\mathbb{R}^{m} (this is called the unconstrained case, since there are no restrictions on the vectors). Later on, [8] generalized the above result with simplex constraints, where the online method that the authors analyzed was Optimistic Multiplicative Weights Update. In [11], it is shown that Mirror Descent with extra gradient computation converges pointwise for a class of zero-sum games that includes the convex-concave setting (with arbitrary constraints), though their algorithm does not fit in the online no-regret framework since it uses information twice about the payoffs before it iterates. Last but not least there have appeared other works that show pointwise convergence for other settings (see [13, 7] and [1] and references therein) to stationary points (but not local equilibrium solutions).

1.2 Main Results

In this paper, we focus on the min-max optimization problem

min𝐱Δnmax𝐲Δmf(𝐱,𝐲),\displaystyle\min_{\mathbf{x}\in\Delta_{n}}\max_{\mathbf{y}\in\Delta_{m}}f(\mathbf{x},\mathbf{y}), (3)

where ff is a convex-concave function (convex in 𝐱\mathbf{x}, concave in 𝐲\mathbf{y}). We analyze the no-regret online algorithm Optimistic Multiplicative Weights Update (OMWU). OMWU is an instantiation of the Optimistic Follow the Regularized Leader (OFTRL) method with entropy as a regularizer (for both players, see Preliminaries section for the definition of OMWU).

We prove that OMWU exhibits local last iterate convergence, generalizing the result of [8] and proving an open question of [15] (for convex-concave games). Formally, our main theorem is stated below:

Theorem 1.1 (Last iterate convergence of OMWU).

Let f:Δn×Δmf:\Delta_{n}\times\Delta_{m}\to\mathbb{R} be a twice differentiable function f(𝐱,𝐲)f(\mathbf{x},\mathbf{y}) that is convex in 𝐱\mathbf{x} and concave in 𝐲\mathbf{y}. Assume that there exists an equilibrium (𝐱,𝐲)(\mathbf{x}^{*},\mathbf{y}^{*}) that satisfies the KKT conditions with strict inequalities (see (4)). It holds that for sufficiently small stepsize, there exists a neighborhood UΔn×ΔmU\subseteq\Delta_{n}\times\Delta_{m} of (𝐱,𝐲)(\mathbf{x}^{*},\mathbf{y}^{*}) such that for all for all initial conditions (𝐱0,𝐲0),(𝐱1,𝐲1)U(\mathbf{x}^{0},\mathbf{y}^{0}),(\mathbf{x}^{1},\mathbf{y}^{1})\in U, OMWU exhibits last iterate (pointwise) convergence, i.e.,

limt(𝐱t,𝐲t)=(𝐱,𝐲),\lim_{t\to\infty}(\mathbf{x}^{t},\mathbf{y}^{t})=(\mathbf{x}^{*},\mathbf{y}^{*}),

where (𝐱t,𝐲t)(\mathbf{x}^{t},\mathbf{y}^{t}) denotes the tt-th iterate of OMWU.

Moreover, we complement our theoretical findings with experimental analysis of the procedure. The experiments on KL-divergence indicate that the results should hold globally.

1.3 Structure and Technical Overview

We present the structure of the paper and a brief technical overview.

Section 2 provides necessary definitions, the explicit form of OMWU derived from OFTRL with entropy regularizer, and some existing results on dynamical systems.

Section 3 is the main technical part, i.e, the computation and spectral analysis of the Jacobian matrix of OMWU dynamics. The stability analysis, the understanding of the local behavior and the local convergence guarantees of OMWU rely on the spectral analysis of the computed Jacobian matrix. The techniques for bilinear games (as in [8]) are no longer valid in convex-concave games. Allow us to explain the differences from [8]. In general, one cannot expect a trivial generalization from linear to non-linear scenarios. The properties of bilinear games are fundamentally different from that of convex-concave games, and this makes the analysis much more challenging in the latter. The key result of spectral analysis in [8] is in a lemma (Lemma B.6) which states that a skew symmetric222AA is skew symmetric if A=AA^{\top}=-A. has imaginary eigenvalues. Skew symmetric matrices appear since in bilinear cases there are terms that are linear in 𝐱\mathbf{x} and linear in 𝐲\mathbf{y} but no higher order terms in 𝐱\mathbf{x} or 𝐲\mathbf{y}. However, the skew symmetry has no place in the case of convex-concave landscapes and the Jacobian matrix of OMWU is far more complicated. One key technique to overcome the lack of skew symmetry is the use of Ky Fan inequality [12] which states that the sequence of the eigenvalues of 12(W+W)\frac{1}{2}(W+W^{\top}) majorizes the real part of the sequence of the eigenvalues of W for any square matrix W (see Lemma 3.1).

Section 4 focuses on numerical experiments to understand how the problem size and the choice of learning rate affect the performance of our algorithm. We observe that our algorithm is able to achieve global convergence invariant to the choice of learning rate, random initialization or problem size. As comparison, the latest popularized (projected) optimistic gradient descent ascent is much more sensitivity to the choice of hyperparameter. Due to space constraint, the detailed calculation of the Jacobian matrix (general form and at fixed point) of OMWU are left in Appendix.

Notation

The boldface 𝐱\mathbf{x} and 𝐲\mathbf{y} denote the vectors in Δn\Delta_{n} and Δm\Delta_{m}. 𝐱t\mathbf{x}^{t} denotes the tt-th iterate of the dynamical system. The letter JJ denote the Jacobian matrix. 𝐈\mathbf{I}, 𝟎\mathbf{0} and 𝟏\mathbf{1} are preserved for the identity, zero matrix and the vector with all the entries equal to 1. The support of 𝐱\mathbf{x} is the set of indices of xix_{i} such that xi0x_{i}\neq 0, denoted by Supp(𝐱)\textrm{Supp}(\mathbf{x}). (𝐱,𝐲)(\mathbf{x}^{*},\mathbf{y}^{*}) denotes the optimal solution for minimax problem. [n][n] denote the set of integers {1,,n}\{1,...,n\}.

2 Preliminaries

In this section, we present some background that will be used later.

2.1 Equilibria for Constrained Minimax

From Von Neumann’s minimax theorem, one can conclude that the problem min𝐱Δnmax𝐲Δf(𝐱,𝐲)\min_{\mathbf{x}\in\Delta_{n}}\max_{\mathbf{y}\in\Delta}f(\mathbf{x},\mathbf{y}) has always an equilibrium (𝐱,𝐲)(\mathbf{x}^{*},\mathbf{y}^{*}) with f(𝐱,𝐲)f(\mathbf{x}^{*},\mathbf{y}^{*}) be unique. Moreover from KKT conditions (as long as ff is twice differentiable), such an equilibrium must satisfy the following (𝐱\mathbf{x}^{*} is a local minimum for fixed 𝐲=𝐲\mathbf{y}=\mathbf{y}^{*} and 𝐲\mathbf{y}^{*} is a local maximum for fixed 𝐱=𝐱\mathbf{x}=\mathbf{x}^{*}):

Definition 2.1 (KKT conditions).

Formally, it holds

𝐱Δnxi>0fxi(𝐱,𝐲)=j=1nxjfxj(𝐱,𝐲)xi=0fxi(𝐱,𝐲)j=1nxjfxj(𝐱,𝐲) for player 𝐱,𝐲Δmyi>0fyi(𝐱,𝐲)=j=1myjfyj(𝐱,𝐲)yi=0fyi(𝐱,𝐲)j=1myjfyj(𝐱,𝐲) for player 𝐲.\begin{array}[]{l}\mathbf{x}^{*}\in\Delta_{n}\\ x_{i}^{*}>0\Rightarrow\frac{\partial f}{\partial x_{i}}(\mathbf{x}^{*},\mathbf{y}^{*})=\sum_{j=1}^{n}x_{j}^{*}\frac{\partial f}{\partial x_{j}}(\mathbf{x}^{*},\mathbf{y}^{*})\\ x_{i}^{*}=0\Rightarrow\frac{\partial f}{\partial x_{i}}(\mathbf{x}^{*},\mathbf{y}^{*})\geq\sum_{j=1}^{n}x_{j}^{*}\frac{\partial f}{\partial x_{j}}(\mathbf{x}^{*},\mathbf{y}^{*})\\ \textrm{ for player }\mathbf{x},\\ \mathbf{y}^{*}\in\Delta_{m}\\ y_{i}^{*}>0\Rightarrow\frac{\partial f}{\partial y_{i}}(\mathbf{x}^{*},\mathbf{y}^{*})=\sum_{j=1}^{m}y_{j}^{*}\frac{\partial f}{\partial y_{j}}(\mathbf{x}^{*},\mathbf{y}^{*})\\ y_{i}^{*}=0\Rightarrow\frac{\partial f}{\partial y_{i}}(\mathbf{x}^{*},\mathbf{y}^{*})\leq\sum_{j=1}^{m}y_{j}^{*}\frac{\partial f}{\partial y_{j}}(\mathbf{x}^{*},\mathbf{y}^{*})\\ \textrm{ for player }\mathbf{y}.\end{array} (4)
Remark 2.2 (No degeneracies).

For the rest of the paper we assume no degeneracies, i.e., the last inequalities hold strictly (in the case a strategy is played with zero probability for each player). Moreover, it is easy to see that since ff is convex concave and twice differentiable, then 𝐱𝐱2f\nabla^{2}_{\mathbf{x}\mathbf{x}}f (part of the Hessian that involves 𝐱\mathbf{x} variables) is positive semi-definite and 𝐲𝐲2f\nabla^{2}_{\mathbf{y}\mathbf{y}}f (part of the Hessian that involves 𝐲\mathbf{y} variables) is negative semi-definite.

2.2 Optimistic Multiplicative Weights Update

The equations of Optimistic Follow-the-Regularized-Leader (OFTRL) applied to a problem
min𝐱𝒳max𝐲𝒴f(𝐱,𝐲)\min_{\mathbf{x}\in\mathcal{X}}\max_{\mathbf{y}\in\mathcal{Y}}f(\mathbf{x},\mathbf{y}) with regularizers (strongly convex functions) h1(𝐱),h2(𝐲)h_{1}(\mathbf{x}),h_{2}(\mathbf{y}) (for player 𝐱,𝐲\mathbf{x},\mathbf{y} respectively) and 𝒳n,𝒴m\mathcal{X}\subset\mathbb{R}^{n},\mathcal{Y}\subset\mathbb{R}^{m} is given below (see [6]):

𝐱t+1\displaystyle\mathbf{x}^{t+1} =𝐚𝐫𝐠𝐦𝐢𝐧𝐱𝒳{ηs=1t𝐱𝐱f(𝐱s,𝐲s)+η𝐱𝐱f(𝐱t,𝐲t)optimistic term+h1(𝐱)}\displaystyle=\mathop{\mathbf{argmin}}_{\mathbf{x}\in\mathcal{X}}\{\eta\sum_{s=1}^{t}\mathbf{x}^{\top}\nabla_{\mathbf{x}}f(\mathbf{x}^{s},\mathbf{y}^{s})+\underbrace{\eta\mathbf{x}^{\top}\nabla_{\mathbf{x}}f(\mathbf{x}^{t},\mathbf{y}^{t})}_{\textrm{optimistic term}}+h_{1}(\mathbf{x})\}
𝐲t+1\displaystyle\mathbf{y}^{t+1} =𝐚𝐫𝐠𝐦𝐚𝐱𝐲𝒴{ηs=1t𝐲𝐲f(𝐱s,𝐲s)+η𝐲𝐲f(𝐱t,𝐲t)optimistic termh2(𝐲)}.\displaystyle=\mathop{\mathbf{argmax}}_{\mathbf{y}\in\mathcal{Y}}\{\eta\sum_{s=1}^{t}\mathbf{y}^{\top}\nabla_{\mathbf{y}}f(\mathbf{x}^{s},\mathbf{y}^{s})+\underbrace{\eta\mathbf{y}^{\top}\nabla_{\mathbf{y}}f(\mathbf{x}^{t},\mathbf{y}^{t})}_{\textrm{optimistic term}}-h_{2}(\mathbf{y})\}.

η\eta is called the stepsize of the online algorithm. OFTRL is uniquely defined if ff is convex-concave and domains 𝒳\mathcal{X} and 𝒴\mathcal{Y} are convex. For simplex constraints and entropy regularizers, i.e., h1(𝐱)=ixilnxi,h2(𝐲)=iyilnyih_{1}(\mathbf{x})=\sum_{i}x_{i}\ln x_{i},h_{2}(\mathbf{y})=\sum_{i}y_{i}\ln y_{i}, we can solve for the explicit form of OFTRL using KKT conditions, the update rule is the Optimistic Multiplicative Weights Update (OMWU) and is described as follows:

xit+1\displaystyle x_{i}^{t+1} =xite2ηfxi(𝐱t,𝐲t)+ηfxi(𝐱t1,𝐲t1)kxkte2ηfxk(𝐱t,𝐲t)+ηfxk(𝐱t1,𝐲t1)\displaystyle=x_{i}^{t}\frac{e^{-2\eta\frac{\partial f}{\partial x_{i}}(\mathbf{x}^{t},\mathbf{y}^{t})+\eta\frac{\partial f}{\partial x_{i}}(\mathbf{x}^{t-1},\mathbf{y}^{t-1})}}{\sum_{k}x_{k}^{t}e^{-2\eta\frac{\partial f}{\partial x_{k}}(\mathbf{x}^{t},\mathbf{y}^{t})+\eta\frac{\partial f}{\partial x_{k}}(\mathbf{x}^{t-1},\mathbf{y}^{t-1})}}
for all i[n],\displaystyle\text{for all }i\in[n],
yit+1\displaystyle y_{i}^{t+1} =yite2ηfyi(𝐱t,𝐲t)ηfyi(𝐱t1,𝐲t1)kykte2ηfyj(𝐱t,𝐲t)ηfyk(𝐱t1,𝐲t1)\displaystyle=y_{i}^{t}\frac{e^{2\eta\frac{\partial f}{\partial y_{i}}(\mathbf{x}^{t},\mathbf{y}^{t})-\eta\frac{\partial f}{\partial y_{i}}(\mathbf{x}^{t-1},\mathbf{y}^{t-1})}}{\sum_{k}y_{k}^{t}e^{2\eta\frac{\partial f}{\partial y_{j}}(\mathbf{x}^{t},\mathbf{y}^{t})-\eta\frac{\partial f}{\partial y_{k}}(\mathbf{x}^{t-1},\mathbf{y}^{t-1})}}
for all i[m].\displaystyle\text{for all }i\in[m].

2.3 Fundamentals of Dynamical Systems

We conclude Preliminaries section with some basic facts from dynamical systems.

Definition 2.3.

A recurrence relation of the form 𝐱t+1=w(𝐱t)\mathbf{x}^{t+1}=w(\mathbf{x}^{t}) is a discrete time dynamical system, with update rule w:𝒮𝒮w:\mathcal{S}\rightarrow\mathcal{S} where 𝒮\mathcal{S} is a subset of k\mathbb{R}^{k} for some positive integer kk. The point 𝐳𝒮\mathbf{z}\in\mathcal{S} is called a fixed point if w(𝐳)=𝐳w(\mathbf{z})=\mathbf{z}.

Remark 2.4.

Using KKT conditions (4), it is not hard to observe that an equilibrium point (𝐱,𝐲)(\mathbf{x}^{*},\mathbf{y}^{*}) must be a fixed point of the OMWU algorithm, i.e., if (𝐱t,𝐲t)=(𝐱t1,𝐲t1)=(𝐱,𝐲)(\mathbf{x}^{t},\mathbf{y}^{t})=(\mathbf{x}^{t-1},\mathbf{y}^{t-1})=(\mathbf{x}^{*},\mathbf{y}^{*}) then (𝐱t+1,𝐲t+1)=(𝐱,𝐲)(\mathbf{x}^{t+1},\mathbf{y}^{t+1})=(\mathbf{x}^{*},\mathbf{y}^{*}).

Proposition 2.5 ([9]).

Assume that ww is a differentiable function and the Jacobian of the update rule ww at a fixed point 𝐳\mathbf{z}^{*} has spectral radius less than one. It holds that there exists a neighborhood UU around 𝐳\mathbf{z}^{*} such that for all 𝐳0U\mathbf{z}^{0}\in U, the dynamics 𝐳t+1=w(𝐳t)\mathbf{z}^{t+1}=w(\mathbf{z}^{t}) converges to 𝐳\mathbf{z}^{*}, i.e. limnwn(𝐳0)=𝐳\lim_{n\rightarrow\infty}w^{n}(\mathbf{z}^{0})=\mathbf{z}^{*} 333wnw^{n} denotes the composition of ww with itself nn times.. ww is called a contraction mapping in UU.

Note that we will make use of Proposition 2.5 to prove our Theorem 1.1 (by proving that the Jacobian of the update rule of OMWU has spectral radius less than one).

3 Last iterate convergence of OMWU

In this section, we prove that OMWU converges pointwise (exhibits last iterate convergence) if the initializations (𝐱0,𝐲0),(𝐱1,𝐲1)(\mathbf{x}^{0},\mathbf{y}^{0}),(\mathbf{x}^{1},\mathbf{y}^{1}) belong in a neighborhood UU of the equilibrium (𝐱,𝐲)(\mathbf{x}^{*},\mathbf{y}^{*}).

3.1 Dynamical System of OMWU

We first express OMWU algorithm as a dynamical system so that we can use Proposition 2.5. The idea (similar to [8]) is to lift the space to consist of four components (𝐱,𝐲,𝐳,𝐰(\mathbf{x},\mathbf{y},\mathbf{z},\mathbf{w}, in such a way we can include the history (current and previous step, see Section 2.2 for the equations). First, we provide the update rule g:Δn×Δm×Δn×ΔmΔn×Δm×Δn×Δmg:\Delta_{n}\times\Delta_{m}\times\Delta_{n}\times\Delta_{m}\to\Delta_{n}\times\Delta_{m}\times\Delta_{n}\times\Delta_{m} of the lifted dynamical system and is given by

g(𝐱,𝐲,𝐳,𝐰)=(g1,g2,g3,g4)g(\mathbf{x},\mathbf{y},\mathbf{z},\mathbf{w})=(g_{1},g_{2},g_{3},g_{4})

where gi=gi(𝐱,𝐲,𝐳,𝐰)g_{i}=g_{i}(\mathbf{x},\mathbf{y},\mathbf{z},\mathbf{w}) for i[4]i\in[4] are defined as follows:

g1,i(𝐱,𝐲,𝐳,𝐰)\displaystyle g_{1,i}(\mathbf{x},\mathbf{y},\mathbf{z},\mathbf{w}) =xie2ηfxi(𝐱,𝐲)+ηfzi(𝐳,𝐰)kxke2ηfxk(𝐱,𝐲)+ηfzk(𝐳,𝐰),i[n]\displaystyle=x_{i}\frac{e^{-2\eta\frac{\partial f}{\partial x_{i}}(\mathbf{x},\mathbf{y})+\eta\frac{\partial f}{\partial z_{i}}(\mathbf{z},\mathbf{w})}}{\sum_{k}x_{k}e^{-2\eta\frac{\partial f}{\partial x_{k}}(\mathbf{x},\mathbf{y})+\eta\frac{\partial f}{\partial z_{k}}(\mathbf{z},\mathbf{w})}},i\in[n] (5)
g2,i(𝐱,𝐲,𝐳,𝐰)\displaystyle g_{2,i}(\mathbf{x},\mathbf{y},\mathbf{z},\mathbf{w}) =yie2ηfyi(𝐱,𝐲)ηfwi(𝐳,𝐰)kyke2ηfyk(𝐱,𝐲)ηfwk(𝐳,𝐰),i[m]\displaystyle=y_{i}\frac{e^{2\eta\frac{\partial f}{\partial y_{i}}(\mathbf{x},\mathbf{y})-\eta\frac{\partial f}{\partial w_{i}}(\mathbf{z},\mathbf{w})}}{\sum_{k}y_{k}e^{2\eta\frac{\partial f}{\partial y_{k}}(\mathbf{x},\mathbf{y})-\eta\frac{\partial f}{\partial w_{k}}(\mathbf{z},\mathbf{w})}},i\in[m] (6)
g3(𝐱,𝐲,𝐳,𝐰)\displaystyle g_{3}(\mathbf{x},\mathbf{y},\mathbf{z},\mathbf{w}) =𝐱org3,i(𝐱,𝐲,𝐳,𝐰)=xi,i[n]\displaystyle=\mathbf{x}\ \ \ \text{or}\ \ g_{3,i}(\mathbf{x},\mathbf{y},\mathbf{z},\mathbf{w})=x_{i},\ i\in[n] (7)
g4(𝐱,𝐲,𝐳,𝐰)\displaystyle g_{4}(\mathbf{x},\mathbf{y},\mathbf{z},\mathbf{w}) =𝐲org4,i(𝐱,𝐲,𝐳,𝐰)=yi,i[m].\displaystyle=\mathbf{y}\ \ \ \text{or}\ \ g_{4,i}(\mathbf{x},\mathbf{y},\mathbf{z},\mathbf{w})=y_{i},\ i\in[m]. (8)

Then the dynamical system of OMWU can be written in compact form as

(𝐱t+1,𝐲t+1,𝐱t,𝐲t)=g(𝐱t,𝐲t,𝐱t1,𝐲t1).(\mathbf{x}_{t+1},\mathbf{y}_{t+1},\mathbf{x}_{t},\mathbf{y}_{t})=g(\mathbf{x}_{t},\mathbf{y}_{t},\mathbf{x}_{t-1},\mathbf{y}_{t-1}).

In what follows, we will perform spectral analysis on the Jacobian of the function gg, computed at the fixed point (𝐱,𝐲)(\mathbf{x}^{*},\mathbf{y}^{*}). Since gg has been lifted, the fixed point we analyze is (𝐱,𝐲,𝐱,𝐲)(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*}) (see Remark 2.4). By showing that the spectral radius is less than one, our Theorem 1.1 follows by Proposition 2.5. The computations of the Jacobian of gg are deferred to the supplementary material.

3.2 Spectral Analysis

Let (𝐱,𝐲)(\mathbf{x}^{*},\mathbf{y}^{*}) be the equilibrium of min-max problem (2). Assume iSupp(𝐱)i\notin\textrm{Supp}(\mathbf{x}^{*}), i.e., xi=0x_{i}^{*}=0 then (see equations at the supplementary material, section A)

g1,ixi(𝐱,𝐲,𝐱,𝐲)=eηfxi(𝐱,𝐲)t=1nxteηfxt(𝐱,𝐲)\frac{\partial g_{1,i}}{\partial x_{i}}(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*})=\frac{e^{-\eta\frac{\partial f}{\partial x_{i}}(\mathbf{x}^{*},\mathbf{y}^{*})}}{\sum_{t=1}^{n}x^{*}_{t}e^{-\eta\frac{\partial f}{\partial x_{t}}(\mathbf{x}^{*},\mathbf{y}^{*})}}

and all other partial derivatives of g1,ig_{1,i} are zero, thus eηfxi(𝐱,𝐲)t=1nxteηfxt(𝐱,𝐲)\frac{e^{-\eta\frac{\partial f}{\partial x_{i}}(\mathbf{x}^{*},\mathbf{y}^{*})}}{\sum_{t=1}^{n}x^{*}_{t}e^{-\eta\frac{\partial f}{\partial x_{t}}(\mathbf{x}^{*},\mathbf{y}^{*})}} is an eigenvalue of the Jacobian computed at (𝐱,𝐲,𝐱,𝐲)(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*}). This is true because the row of the Jacobian that corresponds to g1,ig_{1,i} has zeros everywhere but the diagonal entry. Moreover because of the degeneracy assumption of KKT conditions (see Remark 2.2), it holds that

eηfxi(𝐱,𝐲)t=1nxteηfxt(𝐱,𝐲)<1.\frac{e^{-\eta\frac{\partial f}{\partial x_{i}}(\mathbf{x}^{*},\mathbf{y}^{*})}}{\sum_{t=1}^{n}x^{*}_{t}e^{-\eta\frac{\partial f}{\partial x_{t}}(\mathbf{x}^{*},\mathbf{y}^{*})}}<1.

Similarly, it holds for jSupp(𝐲)j\notin\textrm{Supp}(\mathbf{y}^{*}) that

g2,jyj(𝐱,𝐲,𝐱,𝐲)=eηfyj(𝐱,𝐲)t=1myteηfyt(𝐱,𝐲)<1\frac{\partial g_{2,j}}{\partial y_{j}}(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*})=\frac{e^{\eta\frac{\partial f}{\partial y_{j}}(\mathbf{x}^{*},\mathbf{y}^{*})}}{\sum_{t=1}^{m}y^{*}_{t}e^{\eta\frac{\partial f}{\partial y_{t}}(\mathbf{x}^{*},\mathbf{y}^{*})}}<1

(again by Remark 2.2) and all other partial derivatives of g2,jg_{2,j} are zero, therefore eηfyj(𝐱,𝐲)t=1myteηfyt(𝐱,𝐲)\frac{e^{\eta\frac{\partial f}{\partial y_{j}}(\mathbf{x}^{*},\mathbf{y}^{*})}}{\sum_{t=1}^{m}y^{*}_{t}e^{\eta\frac{\partial f}{\partial y_{t}}(\mathbf{x}^{*},\mathbf{y}^{*})}} is an eigenvalue of the Jacobian computed at (𝐱,𝐲,𝐱,𝐲)(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*}).

We focus on the submatrix of the Jacobian of gg computed at (𝐱,𝐲,𝐱,𝐲)(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*}) that corresponds to the non-zero probabilities of 𝐱\mathbf{x}^{*} and 𝐲\mathbf{y}^{*}. We denote D𝐱D_{\mathbf{x}^{*}} to be the diagonal matrix of size |Supp(𝐱)|×|Supp(𝐱)|\left|\text{Supp}(\mathbf{x}^{*})\right|\times\left|\text{Supp}(\mathbf{x}^{*})\right| that has on the diagonal the nonzero entries of 𝐱\mathbf{x}^{*} and similarly we define D𝐲D_{\mathbf{y}^{*}} of size |Supp(𝐲)|×|Supp(𝐲)|\left|\text{Supp}(\mathbf{y}^{*})\right|\times\left|\text{Supp}(\mathbf{y}^{*})\right|. For convenience, let us denote kx:=|Supp(𝐱)|k_{x}:=\left|\text{Supp}(\mathbf{x}^{*})\right| and ky:=|Supp(𝐲)|k_{y}:=\left|\text{Supp}(\mathbf{y}^{*})\right|. The Jacobian submatrix is the following

J=[A11A12A13A14A21A22A23A24𝐈kx×kx𝟎kx×ky𝟎kx×kx𝟎kx×ky𝟎ky×kx𝐈ky×ky𝟎ky×kx𝟎ky×ky]J=\left[\begin{array}[]{cccc}A_{11}&A_{12}&A_{13}&A_{14}\\ A_{21}&A_{22}&A_{23}&A_{24}\\ \mathbf{I}_{k_{x}\times k_{x}}&\mathbf{0}_{k_{x}\times k_{y}}&\mathbf{0}_{k_{x}\times k_{x}}&\mathbf{0}_{k_{x}\times k_{y}}\\ \mathbf{0}_{k_{y}\times k_{x}}&\mathbf{I}_{k_{y}\times k_{y}}&\mathbf{0}_{k_{y}\times k_{x}}&\mathbf{0}_{k_{y}\times k_{y}}\end{array}\right]

where

A11=𝐈kx×kxD𝐱𝟏kx𝟏kx2ηD𝐱(𝐈kx×kx𝟏kx𝐱)𝐱𝐱2fA12=2ηD𝐱(𝐈kx×kx𝟏kx𝐱)𝐱𝐲2fA13=ηD𝐱(𝐈kx×kx𝟏kx𝐱)𝐱𝐱2fA14=ηD𝐱(𝐈kx×kx𝟏kx𝐱)𝐱𝐲2fA21=2ηD𝐲(𝐈ky×ky𝟏ky𝐲)𝐲𝐱2fA22=𝐈ky×kyD𝐲𝟏ky𝟏ky+2ηD𝐲(𝐈ky×ky𝟏ky𝐲)𝐲𝐲2fA23=ηD𝐲(𝐈ky×ky𝟏ky𝐲)𝐲𝐱2fA24=ηD𝐲(𝐈ky×ky𝟏ky𝐲)𝐲𝐲2f.\displaystyle\begin{split}A_{11}&=\mathbf{I}_{k_{x}\times k_{x}}-D_{\mathbf{x}^{*}}\mathbf{1}_{k_{x}}\mathbf{1}_{k_{x}}^{\top}-2\eta D_{\mathbf{x}^{*}}(\mathbf{I}_{k_{x}\times k_{x}}-\mathbf{1}_{k_{x}}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{x}}^{2}f\\ A_{12}&=-2\eta D_{\mathbf{x}^{*}}(\mathbf{I}_{k_{x}\times k_{x}}-\mathbf{1}_{k_{x}}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{y}}^{2}f\\ A_{13}&=\eta D_{\mathbf{x}^{*}}(\mathbf{I}_{k_{x}\times k_{x}}-\mathbf{1}_{k_{x}}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{x}}^{2}f\\ A_{14}&=\eta D_{\mathbf{x}^{*}}(\mathbf{I}_{k_{x}\times k_{x}}-\mathbf{1}_{k_{x}}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{y}}^{2}f\\ A_{21}&=2\eta D_{\mathbf{y}^{*}}(\mathbf{I}_{k_{y}\times k_{y}}-\mathbf{1}_{k_{y}}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{x}}^{2}f\\ A_{22}&=\mathbf{I}_{k_{y}\times k_{y}}-D_{\mathbf{y}^{*}}\mathbf{1}_{k_{y}}\mathbf{1}_{k_{y}}^{\top}+2\eta D_{\mathbf{y}^{*}}(\mathbf{I}_{k_{y}\times k_{y}}-\mathbf{1}_{k_{y}}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{y}}^{2}f\\ A_{23}&=-\eta D_{\mathbf{y}^{*}}(\mathbf{I}_{k_{y}\times k_{y}}-\mathbf{1}_{k_{y}}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{x}}^{2}f\\ A_{24}&=-\eta D_{\mathbf{y}^{*}}(\mathbf{I}_{k_{y}\times k_{y}}-\mathbf{1}_{k_{y}}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{y}}^{2}f.\\ \end{split} (9)

We note that 𝐈,𝟎\mathbf{I},\mathbf{0} capture the identity matrix and the all zeros matrix respectively (the appropriate size is indicated as a subscript). The vectors (𝟏kx,𝟎ky,𝟎kx,𝟎ky)(\mathbf{1}_{k_{x}},\mathbf{0}_{k_{y}},\mathbf{0}_{k_{x}},\mathbf{0}_{k_{y}}) and (𝟎kx,𝟏ky,𝟎kx,𝟎ky)(\mathbf{0}_{k_{x}},\mathbf{1}_{k_{y}},\mathbf{0}_{k_{x}},\mathbf{0}_{k_{y}}) are left eigenvectors with eigenvalue zero for the above matrix. Hence, any right eigenvector (𝐯𝐱,𝐯𝐲,𝐯𝐳,𝐯𝐰)(\mathbf{v_{x}},\mathbf{v_{y}},\mathbf{v_{z}},\mathbf{v_{w}}) should satisfy the conditions 𝟏𝐯𝐱=0\mathbf{1}^{\top}\mathbf{v_{x}}=0 and 𝟏𝐯𝐲=0\mathbf{1}^{\top}\mathbf{v_{y}}=0. Thus, every non-zero eigenvalue of the above matrix is also a non-zero eigenvalue of the matrix below:

Jnew=[B11A12A13A14A21B22A23A24𝐈kx×kx𝟎kx×ky𝟎kx×kx𝟎kx×ky𝟎ky×kx𝐈ky×ky𝟎ky×kx𝟎ky×ky]J_{\text{new}}=\left[\begin{array}[]{cccc}B_{11}&A_{12}&A_{13}&A_{14}\\ A_{21}&B_{22}&A_{23}&A_{24}\\ \mathbf{I}_{k_{x}\times k_{x}}&\mathbf{0}_{k_{x}\times k_{y}}&\mathbf{0}_{k_{x}\times k_{x}}&\mathbf{0}_{k_{x}\times k_{y}}\\ \mathbf{0}_{k_{y}\times k_{x}}&\mathbf{I}_{k_{y}\times k_{y}}&\mathbf{0}_{k_{y}\times k_{x}}&\mathbf{0}_{k_{y}\times k_{y}}\end{array}\right]

where

B11=𝐈kx×kx2ηD𝐱(𝐈kx×kx𝟏kx𝐱)𝐱𝐱2f,B_{11}=\mathbf{I}_{k_{x}\times k_{x}}-2\eta D_{\mathbf{x}^{*}}(\mathbf{I}_{k_{x}\times k_{x}}-\mathbf{1}_{k_{x}}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{x}}^{2}f,
B22=𝐈ky×ky+2ηD𝐲(𝐈ky×ky𝟏ky𝐲)𝐲𝐲2f.B_{22}=\mathbf{I}_{k_{y}\times k_{y}}+2\eta D_{\mathbf{y}^{*}}(\mathbf{I}_{k_{y}\times k_{y}}-\mathbf{1}_{k_{y}}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{y}}^{2}f.

The characteristic polynomial of JnewJ_{\text{new}} is obtained by finding det(Jnewλ𝐈)\det(J_{\text{new}}-\lambda\mathbf{I}). One can perform row/column operations on JnewJ_{\text{new}} to calculate this determinant, which gives us the following relation:

det(Jnewλ𝐈2kx×2ky)=(12λ)(kx+ky)q(λ(λ1)2λ1)\det(J_{\text{new}}-\lambda\mathbf{I}_{2k_{x}\times 2k_{y}})=\left(1-2\lambda\right)^{(k_{x}+k_{y})}q\left(\frac{\lambda(\lambda-1)}{2\lambda-1}\right)

where q(λ)q(\lambda) is the characteristic polynomial of the following matrix

Jsmall=[B11𝐈kx×kxA12A21B22𝐈ky×ky,]J_{\text{small}}=\left[\begin{array}[]{cccc}B_{11}-\mathbf{I}_{k_{x}\times k_{x}}&A_{12}\\ A_{21}&B_{22}-\mathbf{I}_{k_{y}\times k_{y}},\end{array}\right]

and B11,B12,A12,A21B_{11},B_{12},A_{12},A_{21} are the aforementioned sub-matrices. Notice that JsmallJ_{\text{small}} can be written as

Jsmall=2η[(D𝐱𝐱𝐱)𝟎kx×ky𝟎ky×kx(D𝐲𝐲𝐲)]HJ_{\text{small}}=2\eta\left[\begin{array}[]{cc}-(D_{\mathbf{x}^{*}}-\mathbf{x}^{*}\mathbf{x}^{*\top})&\mathbf{0}_{k_{x}\times k_{y}}\\ \mathbf{0}_{k_{y}\times k_{x}}&(D_{\mathbf{y}^{*}}-\mathbf{y}^{*}\mathbf{y}^{*\top})\end{array}\right]H

where,

H=[𝐱𝐱2f𝐱𝐲2f𝐲𝐱2f𝐲𝐲2f]H=\left[\begin{array}[]{cc}\nabla_{\mathbf{x}\mathbf{x}}^{2}f&\nabla_{\mathbf{x}\mathbf{y}}^{2}f\\ \nabla_{\mathbf{y}\mathbf{x}}^{2}f&\nabla_{\mathbf{y}\mathbf{y}}^{2}f\end{array}\right]

Notice here that HH is the Hessian matrix evaluated at the fixed point (𝐱,𝐲)(\mathbf{x}^{*},\mathbf{y}^{*}), and is the appropriate sub-matrix restricted to the support of |Supp(𝐲)|\left|\text{Supp}(\mathbf{y}^{*})\right| and |Supp(𝐱)|\left|\text{Supp}(\mathbf{x}^{*})\right|. Although, the Hessian matrix is symmetric, we would like to work with the following representation of JsmallJ_{\text{small}}:

Jsmall=2η[(D𝐱𝐱𝐱)𝟎kx×ky𝟎ky×kx(D𝐲𝐲𝐲)]HJ_{\text{small}}=2\eta\left[\begin{array}[]{cc}(D_{\mathbf{x}^{*}}-\mathbf{x}^{*}\mathbf{x}^{*\top})&\mathbf{0}_{k_{x}\times k_{y}}\\ \mathbf{0}_{k_{y}\times k_{x}}&(D_{\mathbf{y}^{*}}-\mathbf{y}^{*}\mathbf{y}^{*\top})\end{array}\right]H^{-}

where,

H=[𝐱𝐱2f𝐱𝐲2f𝐲𝐱2f𝐲𝐲2f]H^{-}=\left[\begin{array}[]{cc}-\nabla_{\mathbf{x}\mathbf{x}}^{2}f&-\nabla_{\mathbf{x}\mathbf{y}}^{2}f\\ \nabla_{\mathbf{y}\mathbf{x}}^{2}f&\nabla_{\mathbf{y}\mathbf{y}}^{2}f\end{array}\right]

Let us denote any non-zero eigenvalue of JsmallJ_{\text{small}} by ϵ\epsilon which may be a complex number. Thus ϵ\epsilon is where q()q(\cdot) vanishes and hence the eigenvalue of JnewJ_{\text{new}} must satisfy the relation

λ(λ1)2λ1=ϵ\frac{\lambda(\lambda-1)}{2\lambda-1}=\epsilon

We are to now show that the magnitude of any eigenvalue of JnewJ_{\text{new}} is strictly less than 1, i.e, |λ|<1\left|\lambda\right|<1. Trivially, λ=12\lambda=\frac{1}{2} satisfies the above condition. Thus we need to show that the magnitude of λ\lambda where q()q(\cdot) vanishes is strictly less than 1. The remainder of the proof proceeds by showing the following two lemmas:

Lemma 3.1 (Real part non-positive).

Let λ\lambda be an eigenvalue of matrix JsmallJ_{\text{small}}. It holds that Re(λ)0\textrm{Re}(\lambda)\leq 0.

Refer to caption Refer to caption
(a) #iterations vs size of nn (b) l1l_{1} error vs #iterations
Figure 1: Convergence of OMWU vs different sizes of the problem. For Figure (a), xx-axis is nn and yy-axis is the number of iterations to reach convergence for Eqn. (14). In Figure (b) we choose four cases of nn to illustrate how l1l_{1} error of the problem decreases with the number of iterations.
Proof.

Assume that λ0\lambda\neq 0. All the non-zero eigenvalues of matrix JsmallJ_{\text{small}} coincide with the eigenvalues of the matrix

R:=[(D𝐱𝐱𝐱)𝟎kx×ky𝟎ky×kx(D𝐲𝐲𝐲)]1/2×H×[(D𝐱𝐱𝐱)𝟎kx×ky𝟎ky×kx(D𝐲𝐲𝐲)]1/2.R:=\left[\begin{array}[]{cc}(D_{\mathbf{x}^{*}}-\mathbf{x}^{*}\mathbf{x}^{*\top})&\mathbf{0}_{k_{x}\times k_{y}}\\ \mathbf{0}_{k_{y}\times k_{x}}&(D_{\mathbf{y}^{*}}-\mathbf{y}^{*}\mathbf{y}^{*\top})\end{array}\right]^{1/2}\times H^{-}\times\left[\begin{array}[]{cc}(D_{\mathbf{x}^{*}}-\mathbf{x}^{*}\mathbf{x}^{*\top})&\mathbf{0}_{k_{x}\times k_{y}}\\ \mathbf{0}_{k_{y}\times k_{x}}&(D_{\mathbf{y}^{*}}-\mathbf{y}^{*}\mathbf{y}^{*\top})\end{array}\right]^{1/2}.

This is well-defined since

[(D𝐱𝐱𝐱)𝟎kx×ky𝟎ky×kx(D𝐲𝐲𝐲)]\left[\begin{array}[]{cc}(D_{\mathbf{x}^{*}}-\mathbf{x}^{*}\mathbf{x}^{*\top})&\mathbf{0}_{k_{x}\times k_{y}}\\ \mathbf{0}_{k_{y}\times k_{x}}&(D_{\mathbf{y}^{*}}-\mathbf{y}^{*}\mathbf{y}^{*\top})\end{array}\right]

is positive semi-definite. Moreover, we use KyFan inequalities which state that the sequence (in decreasing order) of the eigenvalues of 12(W+W)\frac{1}{2}(W+W^{\top}) majorizes the real part of the sequence of the eigenvalues of WW for any square matrix WW (see [12], page 4). We conclude that for any eigenvalue λ\lambda of RR, it holds that Re(λ)\textrm{Re}(\lambda) is at most the maximum eigenvalue of 12(R+R)\frac{1}{2}(R+R^{\top}). Observe now that

R+R:=[(D𝐱𝐱𝐱)𝟎kx×ky𝟎ky×kx(D𝐲𝐲𝐲)]1/2×R+R^{\top}:=\left[\begin{array}[]{cc}(D_{\mathbf{x}^{*}}-\mathbf{x}^{*}\mathbf{x}^{*\top})&\mathbf{0}_{k_{x}\times k_{y}}\\ \mathbf{0}_{k_{y}\times k_{x}}&(D_{\mathbf{y}^{*}}-\mathbf{y}^{*}\mathbf{y}^{*\top})\end{array}\right]^{1/2}\times
(H+H)×[(D𝐱𝐱𝐱)𝟎kx×ky𝟎ky×kx(D𝐲𝐲𝐲)]1/2.(H^{-}+H^{-\top})\times\left[\begin{array}[]{cc}(D_{\mathbf{x}^{*}}-\mathbf{x}^{*}\mathbf{x}^{*\top})&\mathbf{0}_{k_{x}\times k_{y}}\\ \mathbf{0}_{k_{y}\times k_{x}}&(D_{\mathbf{y}^{*}}-\mathbf{y}^{*}\mathbf{y}^{*\top})\end{array}\right]^{1/2}.

Since

H+H=[𝐱𝐱2f00𝐲𝐲2f]H^{-}+H^{-\top}=\left[\begin{array}[]{cc}-\nabla_{\mathbf{x}\mathbf{x}}^{2}f&0\\ 0&\nabla_{\mathbf{y}\mathbf{y}}^{2}f\end{array}\right]

by the convex-concave assumption on ff it follows that the matrix above is negative semi-definite (see Remark 2.2) and so is R+RR+R^{\top}. We conclude that the maximum eigenvalue of R+RR+R^{\top} is non-positive. Therefore any eigenvalue of RR has real part non-positive and the same is true for JsmallJ_{\textrm{small}}. ∎

Lemma 3.2.

If ϵ\epsilon is a non-zero eigenvalue of JsmallJ_{\text{small}} then, Re(ϵ)0\textrm{Re}(\epsilon)\leq 0 and |ϵ|0\left|\epsilon\right|\downarrow 0 as the stepsize η0\eta\to 0.

We first can see that η\eta which is the learning rate multiplies any eigenvalue and we may assume that whilst η\eta is positive, it may be chosen to be sufficiently small and hence the magnitude of any eigenvalue |ϵ|0\left|\epsilon\right|\downarrow 0.

Remark 3.3.

The equation ϵ=λ(λ1)2λ1\epsilon=\frac{\lambda(\lambda-1)}{2\lambda-1} determines two complex roots for each fixed ϵ\epsilon, say λ1\lambda_{1} and λ2\lambda_{2}. The relation between |ϵ|\left|\epsilon\right|, |λ1|\left|\lambda_{1}\right| and |λ2||\lambda_{2}| is illustrated in Figure 2, where the xx-axis is taken to be exp(1/|ϵ|)\propto\exp(1/\left|\epsilon\right|). Specifically we choose ϵ=1/log(x)+1/log(x)1\epsilon=-1/\log(x)+1/\log(x)\sqrt{-1} that satisfies |ϵ|0|\epsilon|\downarrow 0 as xx\rightarrow\infty (The xx-axis of Figure 2 takes xx from 3 to 103).

Refer to caption
Figure 2: λ1\lambda_{1} and λ2\lambda_{2} less than 1 as |ϵ|\left|\epsilon\right| is small.
Proof.

Let λ=x+1y\lambda=x+\sqrt{-1}y and ϵ=a+1b\epsilon=a+\sqrt{-1}b. The relation λ(λ1)2λ1=ϵ\frac{\lambda(\lambda-1)}{2\lambda-1}=\epsilon gives two equations based on the equality of real and imaginary parts as follows,

x2xy2\displaystyle x^{2}-x-y^{2} =2axa2by\displaystyle=2ax-a-2by (10)
2xyy\displaystyle 2xy-y =2bx+2ayb.\displaystyle=2bx+2ay-b. (11)

Notice that the above equations can be transformed to the following forms:

(x2a+12)2(yb)2\displaystyle(x-\frac{2a+1}{2})^{2}-(y-b)^{2} =ab2+(2a+1)24\displaystyle=-a-b^{2}+\frac{(2a+1)^{2}}{4} (12)
(x2a+12)(yb)\displaystyle(x-\frac{2a+1}{2})(y-b) =ab.\displaystyle=ab. (13)

For each ϵ=a+1b\epsilon=a+\sqrt{-1}b, there exist two pairs of points (x1,y1)(x_{1},y_{1}) and (x2,y2)(x_{2},y_{2}) that are the intersections of the above two hyperbola, illustrated in Figure 4. Recall the condition that a<0a<0. As |ϵ|0\left|\epsilon\right|\rightarrow 0, the hyperbola can be obtained from the translation by (2a+12,b)(\frac{2a+1}{2},b) of the hyperbola

x2y2\displaystyle x^{2}-y^{2} =ab2+(2a+1)24\displaystyle=-a-b^{2}+\frac{(2a+1)^{2}}{4}
xy\displaystyle xy =ab\displaystyle=ab

where the translated symmetric center is close to (12,0)(\frac{1}{2},0) since (a,b)(a,b) is close to (0,0)(0,0). So the two intersections of the above hyperbola, (x1,y1)(x_{1},y_{1}) and (x2,y2)(x_{2},y_{2}), satisfy the property that x12+y12x_{1}^{2}+y_{1}^{2} is small and x2>12x_{2}>\frac{1}{2} since the two intersections are on two sides of the axis x=2a+12x=\frac{2a+1}{2}, as showed in Figure 3.

Refer to caption
(a) ab<0ab<0
Refer to caption
(b) ab>0ab>0
Figure 3: The intersections of the four branches of hyperbola are the two solutions of the equations (10) or (12). The intersections are on two sides of the line defined by x=2a+12x=\frac{2a+1}{2}, provided |b|\left|b\right| is small and a<0a<0. This occurs in the case either ab>0ab>0 or ab<0ab<0.

On the other hand, we have

λ(λ1)2λ1=(x+1y)(x1+1y)2x1+12y=ϵ=a+1b\frac{\lambda(\lambda-1)}{2\lambda-1}=\frac{(x+\sqrt{-1}y)(x-1+\sqrt{-1}y)}{2x-1+\sqrt{-1}2y}=\epsilon=a+\sqrt{-1}b

and then the condition a<0a<0 gives the inequality

Re(ϵ)=(x2x+y2)(2x1)(2x1)2+4y2<0\text{Re}(\epsilon)=\frac{(x^{2}-x+y^{2})(2x-1)}{(2x-1)^{2}+4y^{2}}<0

that is equivalent to

x>12andx2x+y2<0x>\frac{1}{2}\ \ \ \text{and}\ \ \ x^{2}-x+y^{2}<0

where only the case x>12x>\frac{1}{2} is considered since if the intersection whose xx-component satisfying x<12x<\frac{1}{2} has the property that x2+y2x^{2}+y^{2} is small and then less than 1, Figure 4. Thus to prove that |λ|<1\left|\lambda\right|<1, it suffices to assume x>12x>\frac{1}{2}. It is obvious that x2x+y2=(x12)2+y214<0x^{2}-x+y^{2}=(x-\frac{1}{2})^{2}+y^{2}-\frac{1}{4}<0 implies that x2+y2<1x^{2}+y^{2}<1. The proof completes. ∎

Refer to caption
Figure 4: a=0.1a=-0.1, b=0.1b=0.1
Refer to caption Refer to caption
(a) OMWU (b) OGDA
Refer to caption Refer to caption
(c) Convergence time comparisons (d) OMWU trajectories with different learning rate
Figure 5: Time comparisons of OMWU and projected OGDA vs different choices of learning rate. For Figure (a)(b)(c), xx-axis is iterations and yy-axis is the l1l_{1} error to the stationary point for Eqn. (14) with n=100n=100. We observe that OMWU (as in (a)) always converges while projected OGDA (as in (b)) will diverge for large learning rate. In figure (c) we remove the divergent case and compare the efficiency of the two algorithm measured in CPU time. In Figure (d) we visually present the trajectories for the min-max game of min𝐱Δ2max𝐲Δ2{x12y12+2x1y1}\min_{\mathbf{x}\in\Delta_{2}}\max_{\mathbf{y}\in\Delta_{2}}\{x_{1}^{2}-y_{1}^{2}+2x_{1}y_{1}\} with learning rate 0.1,1.00.1,1.0 and 1010. Here xx-axis is the value of x1x_{1} and yy-axis is the value of y1y_{1} respectively. The equilibrium point the algorithm converges to is 𝐱=[0,1],𝐲=[0,1]\mathbf{x}=[0,1],\mathbf{y}=[0,1].
Refer to caption Refer to caption
(a) KL divergence vs #iterations with different nn (b) KL divergence vs #iterations with different η\eta
Figure 6: KL divergence decreases with #iterations under different settings. For both images, xx-axis is the number of iterations, and yy-axis is KL divergence. Figure (a) is OMWU on bilinear function Eqn.(14) with n={25,100,175,250}n=\{25,100,175,250\}. Figure (b) is OMWU on the quadratic function f(𝐱,𝐲)=x12y12+2x1y1f(\mathbf{x},\mathbf{y})=x_{1}^{2}-y_{1}^{2}+2x_{1}y_{1} with different learning rate η\eta in {0.01,0.1,1.0,10.0}\{0.01,0.1,1.0,10.0\}. Shaded area indicates standard deviation from 10 runs with random initializations. OMWU with smaller learning rate tends to have higher variance.

4 Experiments

In this section, we conduct empirical studies to verify the theoretical results of our paper. We primarily target to understand two factors that influence the convergence speed of OMWU: the problem size and the learning rate. We also compare our algorithm with Optimistic Gradient Descent Ascent (OGDA) with projection, and demonstrate our superiority against it.

We start with a simple bilinear min-max game:

min𝐱Δnmax𝐲Δn𝐱A𝐲.\displaystyle\min_{\mathbf{x}\in\Delta_{n}}\max_{\mathbf{y}\in\Delta_{n}}\mathbf{x}^{\top}A\mathbf{y}. (14)

We first vary the value of nn to study how the learning speed scales with the size of the problem. The learning rate is fixed at 1.01.0, and we run OMWU with n{25,50,75,,250}n\in\{25,50,75,\cdots,250\} and matrix An×nA\in\mathbb{R}^{n\times n} is generated with i.i.d random Gaussian entries. We output the number of iterations for OMWU to reach convergence, i.e., with l1l_{1} error to the optimal solution to be less or equal to 10510^{-5}. The results are averaged from 10 runs with different randomly initializations. As reported in Figure 1, generally a larger problem size requires more iterations to reach convergence. We also provide four specific cases of nn to show the convergence in l1l_{1} distance in Figure 1(b). The shaded area demonstrates the standard deviation from the 50 runs.

To understand how learning rate affects the speed of convergence, we conduct similar experiments on Eqn. (14) and plot the l1l_{1} error with different step sizes in Figure 5(a)-(c). For this experiment the matrix size is fixed as n=100n=100. We also include a comparison with the Optimistic Gradient Descent Ascent[7]. Notice the original proposal was for unconstrained problems and we use projection in each step in order to constrain the iterates to stay inside the simplex. For the setting we considered, we observe a larger learning rate effectively speeds up our learning process, and our algorithm is relatively more stable to the choice of step-size. In comparison, OGDA is quite sensitive to the choice of step-size. As shown in Figure 5(b), a larger step-size makes the algorithm diverge, while a smaller step-size will make very little progress. Furthermore, we also choose to perform our algorithm over a convex-concave but not bilinear function f(𝐱,𝐲)=x12y12+2x1y1f(\mathbf{x},\mathbf{y})=x_{1}^{2}-y_{1}^{2}+2x_{1}y_{1}, where 𝐱,𝐲Δ2\mathbf{x},\mathbf{y}\in\Delta_{2} and x1x_{1} and y1y_{1} are the first coefficients of 𝐱\mathbf{x} and 𝐲\mathbf{y}. With this low dimensional function, we could visually show the convergence procedure as in Figure 5(b), where each arrow indicates an OMWU step. This figure demonstrates that at least in this case, a larger step size usually makes sure a bigger progress towards the optimal solution.

Finally we show how the KL divergence DKL((𝐱,𝐲)(𝐱t,𝐲t))D_{KL}((\mathbf{x}^{*},\mathbf{y}^{*})\parallel(\mathbf{x}^{t},\mathbf{y}^{t})) decreases under different circumstances. Figure 6 again considers the bilinear problem (Eqn.(14)) with multiple dimensions nn and a simple convex-concave function f(𝐱,𝐲)=x12y12+2x1y1f(\mathbf{x},\mathbf{y})=x_{1}^{2}-y_{1}^{2}+2x_{1}y_{1} with different learning rate. We note that in all circumstances we consider, we observe that OMWU is very stable, and achieves global convergence invariant to the problem size, random initialization, and learning rate.

5 Conclusion

In this paper we analyze the last iterate behavior of a no-regret learning algorithm called Optimistic Multiplicative Weights Update for convex-concave landscapes. We prove that OMWU exhibits last iterate convergence in a neighborhood of the fixed point of OMWU algorithm, generalizing previous results that showed last iterate convergence for bilinear functions. The experiments explores how the problem size and the choice of learning rate affect the performance of our algorithm. We find that OMWU achieves global convergence and less sensitive to the choice of hyperparameter, compared to projected optimistic gradient descent ascent.

References

  • [1] Jacob D. Abernethy, Kevin A. Lai, and Andre Wibisono. Last-iterate convergence rates for min-max optimization. CoRR, abs/1906.02027, 2019.
  • [2] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121–164, 2012.
  • [3] James P. Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In Proceedings of the 2018 ACM Conference on Economics and Computation, Ithaca, NY, USA, June 18-22, 2018, pages 321–338, 2018.
  • [4] G.W Brown. Iterative solutions of games by fictitious play. In Activity Analysis of Production and Allocation, 1951.
  • [5] Nikolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
  • [6] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training GANs with Optimism. In Proceedings of ICLR, 2018.
  • [7] Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 9256–9266, 2018.
  • [8] Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. In 10th Innovations in Theoretical Computer Science Conference, ITCS 2019, January 10-12, 2019, San Diego, California, USA, pages 27:1–27:18, 2019.
  • [9] Oded Galor. Discrete Dynamical Systems. Springer, 2007.
  • [10] Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. arXiv preprint:1802.06132, 2018.
  • [11] Panayotis Mertikopoulos, Houssam Zenati, Bruno Lecouat, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Mirror descent in saddle-point problems: Going the extra (gradient) mile. CoRR, abs/1807.02629, 2018.
  • [12] Mohammad Sal Moslehian. Ky fan inequalities. CoRR, abs/1108.1467, 2011.
  • [13] Gerasimos Palaiopanos, Ioannis Panageas, and Georgios Piliouras. Multiplicative weights update with constant step-size in congestion games: Convergence, limit cycles and chaos. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5874–5884, 2017.
  • [14] J. Robinson. An iterative method of solving a game. In Annals of Mathematics, pages 296–301, 1951.
  • [15] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E. Schapire. Fast convergence of regularized learning in games. In Annual Conference on Neural Information Processing Systems 2015, pages 2989–2997, 2015.
  • [16] J Von Neumann. Zur theorie der gesellschaftsspiele. In Math. Ann., pages 295–320, 1928.

Appendix A Equations of the Jacobian of OMWU

g1,ixi\displaystyle\frac{\partial g_{1,i}}{\partial x_{i}} =e2ηfxi+ηfziSx+xi1Sx2(e2ηfxi+ηfzi(2η2fxi2)Sxe2ηfxi+ηfziSxxi)\displaystyle=\frac{e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}}{S_{x}}+x_{i}\frac{1}{S_{x}^{2}}\left(e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}(-2\eta\frac{\partial^{2}f}{\partial x_{i}^{2}})S_{x}-e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}\frac{\partial S_{x}}{\partial x_{i}}\right) (15)
whereSxxi=e2ηfxi+ηfzi2ηkxke2ηfxk+ηfzk2fxi2\displaystyle\text{where}\ \ \frac{\partial S_{x}}{\partial x_{i}}=e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}-2\eta\sum_{k}x_{k}e^{-2\eta\frac{\partial f}{\partial x_{k}}+\eta\frac{\partial f}{\partial z_{k}}}\frac{\partial^{2}f}{\partial x_{i}^{2}} (16)
g1,ixj\displaystyle\frac{\partial g_{1,i}}{\partial x_{j}} =xi1Sx2(e2ηfxi+ηfzi(2η2fxixj)Sxe2ηfxi+ηfziSxxj)\displaystyle=x_{i}\frac{1}{S_{x}^{2}}\left(e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}(-2\eta\frac{\partial^{2}f}{\partial x_{i}\partial x_{j}})S_{x}-e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}\frac{\partial S_{x}}{\partial x_{j}}\right) (17)
whereSxxj=e2ηfxj+ηfzj2ηkxke2ηfxk+ηfzk2fxjxk\displaystyle\text{where}\ \ \frac{\partial S_{x}}{\partial x_{j}}=e^{-2\eta\frac{\partial f}{\partial x_{j}}+\eta\frac{\partial f}{\partial z_{j}}}-2\eta\sum_{k}x_{k}e^{-2\eta\frac{\partial f}{\partial x_{k}}+\eta\frac{\partial f}{\partial z_{k}}}\frac{\partial^{2}f}{\partial x_{j}\partial x_{k}} (18)
g1,iyj\displaystyle\frac{\partial g_{1,i}}{\partial y_{j}} =xi1Sx2(e2ηfxi+ηfzi(2η2fxiyj)Sxe2ηfxi+ηfziSxyj)\displaystyle=x_{i}\frac{1}{S_{x}^{2}}\left(e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}(-2\eta\frac{\partial^{2}f}{\partial x_{i}\partial y_{j}})S_{x}-e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}\frac{\partial S_{x}}{\partial y_{j}}\right) (19)
whereSxyj=kxke2ηfxi+ηfzi(2η2fxkyj)\displaystyle\text{where}\ \ \frac{\partial S_{x}}{\partial y_{j}}=\sum_{k}x_{k}e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}(-2\eta\frac{\partial^{2}f}{\partial x_{k}\partial y_{j}}) (20)
g1,izj\displaystyle\frac{\partial g_{1,i}}{\partial z_{j}} =xi1Sx2(e2ηfxi+ηfzi(η2fzjxi)Sxe2ηfxi+ηfziSxzj)\displaystyle=x_{i}\frac{1}{S_{x}^{2}}\left(e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}(\eta\frac{\partial^{2}f}{\partial z_{j}\partial x_{i}})S_{x}-e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}\frac{\partial S_{x}}{\partial z_{j}}\right) (21)
whereSxzj=ηkxke2ηfxk+ηfzk2fzkzj\displaystyle\text{where}\ \ \frac{\partial S_{x}}{\partial z_{j}}=\eta\sum_{k}x_{k}e^{-2\eta\frac{\partial f}{\partial x_{k}}+\eta\frac{\partial f}{\partial z_{k}}}\frac{\partial^{2}f}{\partial z_{k}\partial z_{j}} (22)
g1,iwj\displaystyle\frac{\partial g_{1,i}}{\partial w_{j}} =xi1Sx2(e2ηfxi+ηfziη2fziwjSxe2ηfxi+ηfziSxwj)\displaystyle=x_{i}\frac{1}{S_{x}^{2}}\left(e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}\eta\frac{\partial^{2}f}{\partial z_{i}\partial w_{j}}S_{x}-e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}\frac{\partial S_{x}}{\partial w_{j}}\right) (23)
whereSxwj=kxke2ηfxk+ηfzkηfzkwj\displaystyle\text{where}\ \ \frac{\partial S_{x}}{\partial w_{j}}=\sum_{k}x_{k}e^{-2\eta\frac{\partial f}{\partial x_{k}}+\eta\frac{\partial f}{\partial z_{k}}}\eta\frac{\partial f}{\partial z_{k}\partial w_{j}} (24)
g2,ixj\displaystyle\frac{\partial g_{2,i}}{\partial x_{j}} =yi1Sy2(e2ηfyiηfwi(2η2fxjyi)Sye2ηfyiηfwiSyxj)\displaystyle=y_{i}\frac{1}{S_{y}^{2}}\left(e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}(2\eta\frac{\partial^{2}f}{\partial x_{j}\partial y_{i}})S_{y}-e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}\frac{\partial S_{y}}{\partial x_{j}}\right) (25)
whereSyxj=kyke2ηfyiηfwi2η2fxjyk\displaystyle\text{where}\ \ \frac{\partial S_{y}}{\partial x_{j}}=\sum_{k}y_{k}e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}2\eta\frac{\partial^{2}f}{\partial x_{j}\partial y_{k}} (26)
g2,iyi\displaystyle\frac{\partial g_{2,i}}{\partial y_{i}} =e2ηfyiηfwiSy+yi1Sy2(e2ηfyiηfwi2η2fyi2Sye2ηfyiηfwiSyyi)\displaystyle=\frac{e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}}{S_{y}}+y_{i}\frac{1}{S_{y}^{2}}\left(e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}2\eta\frac{\partial^{2}f}{\partial y_{i}^{2}}S_{y}-e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}\frac{\partial S_{y}}{\partial y_{i}}\right) (27)
whereSyyi=e2ηfyiηfwi+2ηkyke2ηfyiηfwi2fyiyk\displaystyle\text{where}\ \ \frac{\partial S_{y}}{\partial y_{i}}=e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}+2\eta\sum_{k}y_{k}e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}\frac{\partial^{2}f}{\partial y_{i}\partial y_{k}} (28)
g2,izj\displaystyle\frac{\partial g_{2,i}}{\partial z_{j}} =yi1Sy2(e2ηfyiηfwi(η2fwizj)Sye2ηfyiηfwiSyzj)\displaystyle=y_{i}\frac{1}{S_{y}^{2}}\left(e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}(-\eta\frac{\partial^{2}f}{\partial w_{i}\partial z_{j}})S_{y}-e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}\frac{\partial S_{y}}{\partial z_{j}}\right) (29)
whereSyzj=kyke2ηfyiηfwi(η2fwkzj)\displaystyle\text{where}\ \ \frac{\partial S_{y}}{\partial z_{j}}=\sum_{k}y_{k}e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}(-\eta\frac{\partial^{2}f}{\partial w_{k}\partial z_{j}}) (30)
g2,iwj\displaystyle\frac{\partial g_{2,i}}{\partial w_{j}} =yi1Sy2(e2ηfyiηfwi(η2fwiwj)e2ηfyiηfwiSywj)\displaystyle=y_{i}\frac{1}{S_{y}^{2}}\left(e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}(-\eta\frac{\partial^{2}f}{\partial w_{i}\partial w_{j}})-e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}\frac{\partial S_{y}}{\partial w_{j}}\right) (31)
whereSywj=kyke2ηfyiηfwi(η2fwkwj)\displaystyle\text{where}\ \ \frac{\partial S_{y}}{\partial w_{j}}=\sum_{k}y_{k}e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}(-\eta\frac{\partial^{2}f}{\partial w_{k}\partial w_{j}}) (32)

Appendix B Equations of the Jacobian of OMWU at the fixed point (𝐱,𝐲,𝐳,𝐰)(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{z}^{*},\mathbf{w}^{*})

In this section, we compute the equations of the Jacobian at the fixed point (𝐱,𝐲,𝐳,𝐰)(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{z}^{*},\mathbf{w}^{*}). The fact that (𝐱,𝐲)=(𝐳,𝐰)(\mathbf{x}^{*},\mathbf{y}^{*})=(\mathbf{z}^{*},\mathbf{w}^{*}) and (𝐳,𝐰)(\mathbf{z},\mathbf{w}) takes the position of (𝐱,𝐲)(\mathbf{x},\mathbf{y}) in computing partial derivatives gives the following equations.

g1,ixi\displaystyle\frac{\partial g_{1,i}}{\partial x_{i}} =1xi2ηxi(2fxikxk2fxixk),i[n],\displaystyle=1-x_{i}^{*}-2\eta x_{i}^{*}(\frac{\partial^{2}f}{\partial x_{i}^{*}}-\sum_{k}x_{k}^{*}\frac{\partial^{2}f}{\partial x_{i}\partial x_{k}}),i\in[n], (33)
g1,ixj\displaystyle\frac{\partial g_{1,i}}{\partial x_{j}} =xi2ηxi(2fxixjkxk2fxjxk),j[n],ji\displaystyle=-x_{i}^{*}-2\eta x_{i}^{*}(\frac{\partial^{2}f}{\partial x_{i}\partial x_{j}}-\sum_{k}x_{k}^{*}\frac{\partial^{2}f}{\partial x_{j}\partial x_{k}}),j\in[n],j\neq i (34)
g1,iyj\displaystyle\frac{\partial g_{1,i}}{\partial y_{j}} =2ηxi(2fxiyjkxk2fxkyj),j[m]\displaystyle=-2\eta x_{i}^{*}(\frac{\partial^{2}f}{\partial x_{i}\partial y_{j}}-\sum_{k}x_{k}^{*}\frac{\partial^{2}f}{\partial x_{k}\partial y_{j}}),j\in[m] (35)
g1,izj\displaystyle\frac{\partial g_{1},i}{\partial z_{j}} =ηxi(2fxixjkxk2fxkxj),j[n]\displaystyle=\eta x_{i}^{*}(\frac{\partial^{2}f}{\partial x_{i}\partial x_{j}}-\sum_{k}x_{k}^{*}\frac{\partial^{2}f}{\partial x_{k}\partial x_{j}}),j\in[n] (36)
g1,iwj\displaystyle\frac{\partial g_{1,i}}{\partial w_{j}} =ηxi(2fxiyjkxk2fxkyj),j[m]\displaystyle=\eta x_{i}^{*}(\frac{\partial^{2}f}{\partial x_{i}\partial y_{j}}-\sum_{k}x_{k}^{*}\frac{\partial^{2}f}{\partial x_{k}\partial y_{j}}),j\in[m] (37)
g2,ixj\displaystyle\frac{\partial g_{2,i}}{\partial x_{j}} =2ηyi(2fxjyikyk2fxjyk),j[n]\displaystyle=2\eta y_{i}^{*}(\frac{\partial^{2}f}{\partial x_{j}\partial y_{i}}-\sum_{k}y_{k}^{*}\frac{\partial^{2}f}{\partial x_{j}\partial y_{k}}),j\in[n] (38)
g2,iyi\displaystyle\frac{\partial g_{2,i}}{\partial y_{i}} =1yi+2η(2fyi2kyk2fyiyk),i[m]\displaystyle=1-y_{i}^{*}+2\eta(\frac{\partial^{2}f}{\partial y_{i}^{2}}-\sum_{k}y_{k}^{*}\frac{\partial^{2}f}{\partial y_{i}\partial y_{k}}),i\in[m] (39)
g2,iyj\displaystyle\frac{\partial g_{2,i}}{\partial y_{j}} =yi+2η(2fyiyjkyk2fyjyk),j[m]\displaystyle=-y_{i}^{*}+2\eta(\frac{\partial^{2}f}{\partial y_{i}\partial y_{j}}-\sum_{k}y_{k}^{*}\frac{\partial^{2}f}{\partial y_{j}\partial y_{k}}),j\in[m] (40)
g2,izj\displaystyle\frac{\partial g_{2,i}}{\partial z_{j}} =ηyi(2fxjyi+kyk2fxjyk),j[n]\displaystyle=\eta y_{i}^{*}(-\frac{\partial^{2}f}{\partial x_{j}\partial y_{i}}+\sum_{k}y_{k}^{*}\frac{\partial^{2}f}{\partial x_{j}\partial y_{k}}),j\in[n] (41)
g2,iwj\displaystyle\frac{\partial g_{2,i}}{\partial w_{j}} =ηyi(2fyiyj+kyk2fykyj),j[m]\displaystyle=\eta y_{i}^{*}(-\frac{\partial^{2}f}{\partial y_{i}\partial y_{j}}+\sum_{k}y_{k}^{*}\frac{\partial^{2}f}{\partial y_{k}\partial y_{j}}),j\in[m] (42)
g3,ixi\displaystyle\frac{\partial g_{3,i}}{\partial x_{i}} =1for alli[n]and zerofor all the other partial derivatives ofg3,i\displaystyle=1\ \ \text{for all}\ \ i\in[n]\ \ \text{and zero}\text{for all the other partial derivatives of}\ \ g_{3,i} (43)
g4,iyi\displaystyle\frac{\partial g_{4,i}}{\partial y_{i}} =1for alli[m]and zero for all the other partial derivatives ofg4,i.\displaystyle=1\ \text{for all}\ \ i\in[m]\ \ \text{and zero for all the other partial derivatives of}\ \ \ g_{4,i}. (44)

Appendix C Jacobian matrix at (𝐱,𝐲,𝐳,𝐰)(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{z}^{*},\mathbf{w}^{*})

This section serves for the ”Spectral Analysis” of Section 3. The Jacobian matrix of gg at the fixed point is obtained based on the calculations above. We refer the main article for the subscript indicating the size of each block matrix.

J=[𝐈D𝐱𝟏𝟏2ηD𝐱(𝐈𝟏𝐱)𝐱𝐱2f2ηD𝐱(𝐈𝟏𝐱)𝐱𝐲2fηD𝐱(𝐈𝟏𝐱)𝐱𝐱2fηD𝐱(𝐈𝟏𝐱)𝐱𝐲2f2ηD𝐲(𝐈𝟏𝐲)𝐲𝐱2f𝐈D𝐲𝟏𝟏+2ηD𝐲(𝐈𝟏𝐲)𝐲𝐲2fηD𝐲(𝐈𝟏𝐲)𝐲𝐱2fηD𝐲(𝐈𝟏𝐲)𝐲𝐲2f𝐈𝟎𝟎𝟎𝟎𝐈𝟎𝟎]J=\left[\begin{array}[]{cccc}\mathbf{I}-D_{\mathbf{x}^{*}}\mathbf{1}\mathbf{1}^{\top}-2\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{x}}^{2}f&-2\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{y}}^{2}f&\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{x}}^{2}f&\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{y}}^{2}f\\ 2\eta D_{\mathbf{y}^{*\top}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*})\nabla_{\mathbf{y}\mathbf{x}}^{2}f&\mathbf{I}-D_{\mathbf{y}^{*}}\mathbf{1}\mathbf{1}^{\top}+2\eta D_{\mathbf{y}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{y}}^{2}f&-\eta D_{\mathbf{y}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{x}}^{2}f&-\eta D_{\mathbf{y}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{y}}^{2}f\\ \mathbf{I}&\mathbf{0}&\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\mathbf{I}&\mathbf{0}&\mathbf{0}\end{array}\right]

By acting on the tangent space of each simplex, we observe that D𝐱𝟏𝟏𝐯=0D_{\mathbf{x}^{*}}\mathbf{1}\mathbf{1}^{\top}\mathbf{v}=0 for kvk=0\sum_{k}v_{k}=0, so each eigenvalue of matrix JJ is an eigenvalue of the following matrix

Jnew=[𝐈2ηD𝐱(𝐈𝟏𝐱)𝐱𝐱2f2ηD𝐱(𝐈𝟏𝐱)𝐱𝐲2fηD𝐱(𝐈𝟏𝐱)𝐱𝐱2fηD𝐱(𝐈𝟏𝐱)𝐱𝐲2f2ηD𝐲(𝐈𝟏𝐲)𝐲𝐱2f𝐈+2ηD𝐲(𝐈𝟏𝐲)𝐲𝐲2fηD𝐲(𝐈𝟏𝐲)𝐲𝐱2fηD𝐲(𝐈𝟏𝐲)𝐲𝐲2f𝐈𝟎𝟎𝟎𝟎𝐈𝟎𝟎]J_{\text{new}}=\left[\begin{array}[]{cccc}\mathbf{I}-2\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{x}}^{2}f&-2\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{y}}^{2}f&\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{x}}^{2}f&\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{y}}^{2}f\\ 2\eta D_{\mathbf{y}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{x}}^{2}f&\mathbf{I}+2\eta D_{\mathbf{y}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{y}}^{2}f&-\eta D_{\mathbf{y}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{x}}^{2}f&-\eta D_{\mathbf{y}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{y}}^{2}f\\ \mathbf{I}&\mathbf{0}&\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\mathbf{I}&\mathbf{0}&\mathbf{0}\end{array}\right]

The characteristic polynomial of JnewJ_{\text{new}} is det(JnewλI)\det(J_{new}-\lambda I) that can be computed as the determinant of the following matrix:

[(1λ)𝐈+(1λ2)ηD𝐱(𝐈𝟏𝐱)𝐱𝐱2f(1λ2)ηD𝐱(𝐈𝟏𝐱)𝐱𝐲2f(21λ)ηD𝐲(𝐈𝟏𝐲)𝐲𝐱2f(1λ)𝐈+(21λ)ηD𝐲(𝐈𝟏𝐲)𝐲𝐲2f]\displaystyle\left[\begin{array}[]{cccc}(1-\lambda)\mathbf{I}+(\frac{1}{\lambda}-2)\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{x}}^{2}f&(\frac{1}{\lambda}-2)\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{y}}^{2}f\\ (2-\frac{1}{\lambda})\eta D_{\mathbf{y}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{x}}^{2}f&(1-\lambda)\mathbf{I}+(2-\frac{1}{\lambda})\eta D_{\mathbf{y}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{y}}^{2}f\end{array}\right] (47)