Last iterate convergence in no-regret learning: constrained min-max optimization for convex-concave landscapes

Qi Lei
UT Austin
[email protected] Sai Ganesh Nagarajan
SUTD
[email protected] Ioannis Panageas
SUTD
[email protected] Xiao Wang
SUTD
[email protected]

Abstract

In a recent series of papers it has been established that variants of Gradient Descent/Ascent and Mirror Descent exhibit last iterate convergence in convex-concave zero-sum games. Specifically, [6, 10] show last iterate convergence of the so called “Optimistic Gradient Descent/Ascent” for the case of unconstrained min-max optimization. Moreover, in [11] the authors show that Mirror Descent with an extra gradient step displays last iterate convergence for convex-concave problems (both constrained and unconstrained), though their algorithm does not follow the online learning framework; it uses extra information rather than only the history to compute the next iteration. In this work, we show that ”Optimistic Multiplicative-Weights Update (OMWU)” which follows the no-regret online learning framework, exhibits last iterate convergence locally for convex-concave games, generalizing the results of [8] where last iterate convergence of OMWU was shown only for the bilinear case. We complement our results with experiments that indicate fast convergence of the method.

1 Introduction

In classic (normal form) zero-sum games, one has to compute two probability vectors $\mathbf{x}^{*}\in\Delta_{n},\mathbf{y}^{*}\in\Delta_{m}$ ¹¹1 $\Delta_{n}$ denotes the simplex of size $n$ . that consist an equilibrium of the following problem

\displaystyle\min_{\mathbf{x}\in\Delta_{n}}\max_{\mathbf{y}\in\Delta_{m}}\mathbf{x}^{\top}A\mathbf{y},

(1)

where $A$ is $n\times m$ real matrix (called payoff matrix). Here $\mathbf{x}^{\top}A\mathbf{y}$ represents the payment of the $\mathbf{x}$ player to the $\mathbf{y}$ player under choices of strategies by the two players and is a bilinear function.

Arguably, one of the most celebrated theorems and a founding stone in Game Theory, is the minimax theorem by Von Neumann [16]. It states that

\displaystyle\min_{\mathbf{x}\in\Delta_{n}}\max_{\mathbf{y}\in\Delta_{m}}f(\mathbf{x},\mathbf{y})=\max_{\mathbf{y}\in\Delta_{m}}\min_{\mathbf{x}\in\Delta_{n}}f(\mathbf{x},\mathbf{y}),

(2)

where $f:\Delta_{n}\times\Delta_{m}\to\mathbb{R}$ is convex in $\mathbf{x}$ , concave in $\mathbf{y}$ . The aforementioned result holds for any convex compact sets $\mathcal{X}\subset\mathbb{R}^{n}$ and $\mathcal{Y}\subset\mathbb{R}^{m}$ . The min-max theorem reassures us that an equilibrium always exists in the bilinear game (1) or its convex-concave analogue (again $f(\mathbf{x},\mathbf{y})$ is interpreted as the payment of the $\mathbf{x}$ player to the $\mathbf{y}$ player). An equilibrium is a pair of randomized strategies $(\mathbf{x}^{*},\mathbf{y}^{*})$ such that neither player can improve their payoff by unilaterally changing their distribution.

Soon after the appearance of the minimax theorem, research was focused on dynamics for solving min-max optimization problems by having the min and max players of (1) run a simple online learning procedure. In the online learning framework, at time $t$ , each player chooses a probability distribution ( $\mathbf{x}^{t},\mathbf{y}^{t}$ respectively) simultaneously depending only on the past choices of both players (i.e., $\mathbf{x}^{1},...,\mathbf{x}^{t-1},\mathbf{y}^{1},...,\mathbf{y}^{t-1}$ ) and experiences payoff that depends on choices $\mathbf{x}^{t},\mathbf{y}^{t}$ .

An early method, proposed by Brown [4] and analyzed by Robinson [14], was fictitious play. Later on, researchers discover several learning robust algorithms converging to minimax equilibrium at faster rates, see [5]. This class of learning algorithms, are the so-called “no-regret” and include Multiplicative Weights Update method [2] and Follow the regularized leader.

1.1 Average Iterate Convergence

Despite the rich literature on no-regret learning, most of the known results have the feature that min-max equilibrium is shown to be attained only by the time average. This means that the trajectory of a no-regret learning method $(\mathbf{x}^{t},\mathbf{y}^{t})$ has the property that ${1\over t}\sum_{\tau\leq t}{\mathbf{x}^{\tau}}^{\top}A\mathbf{y}^{\tau}$ converges to the equilibrium of (1), as $t\rightarrow\infty$ . Unfortunately that does not mean that the last iterate $(\mathbf{x}^{t},\mathbf{y}^{t})$ converges to an equilibrium, it commonly diverges or cycles. One such example is the well-known Multiplicative Weights Update Algorithm, the time average of which is known to converge to an equilibrium, but the actual trajectory cycles towards the boundary of the simplex ([3]). This is even true for the vanilla Gradient Descent/Ascent, where one can show for even bilinear landscapes (unconstrained case) last iterate fails to converge [6].

Motivated by the training of Generative Adversarial Networks (GANs), the last couple of years researchers have focused on designing and analyzing procedures that exhibit last iterate convergence (or pointwise convergence) for zero-sum games. This is crucial for training GANs, the landscapes of which are typically non-convex non-concave and averaging now as before does not give much guarantees (e.g., note that Jensen’s inequality is not applicable anymore). In [6, 10] the authors show that a variant of Gradient Descent/Ascent, called Optimistic Gradient Descent/Ascent has last iterate convergence for the case of bilinear functions $\mathbf{x}^{\top}A\mathbf{y}$ where $\mathbf{x}\in\mathbb{R}^{n}$ and $\mathbf{y}\in\mathbb{R}^{m}$ (this is called the unconstrained case, since there are no restrictions on the vectors). Later on, [8] generalized the above result with simplex constraints, where the online method that the authors analyzed was Optimistic Multiplicative Weights Update. In [11], it is shown that Mirror Descent with extra gradient computation converges pointwise for a class of zero-sum games that includes the convex-concave setting (with arbitrary constraints), though their algorithm does not fit in the online no-regret framework since it uses information twice about the payoffs before it iterates. Last but not least there have appeared other works that show pointwise convergence for other settings (see [13, 7] and [1] and references therein) to stationary points (but not local equilibrium solutions).

1.2 Main Results

In this paper, we focus on the min-max optimization problem

\displaystyle\min_{\mathbf{x}\in\Delta_{n}}\max_{\mathbf{y}\in\Delta_{m}}f(\mathbf{x},\mathbf{y}),

(3)

where $f$ is a convex-concave function (convex in $\mathbf{x}$ , concave in $\mathbf{y}$ ). We analyze the no-regret online algorithm Optimistic Multiplicative Weights Update (OMWU). OMWU is an instantiation of the Optimistic Follow the Regularized Leader (OFTRL) method with entropy as a regularizer (for both players, see Preliminaries section for the definition of OMWU).

We prove that OMWU exhibits local last iterate convergence, generalizing the result of [8] and proving an open question of [15] (for convex-concave games). Formally, our main theorem is stated below:

Theorem 1.1 (Last iterate convergence of OMWU).

Let $f:\Delta_{n}\times\Delta_{m}\to\mathbb{R}$ be a twice differentiable function $f(\mathbf{x},\mathbf{y})$ that is convex in $\mathbf{x}$ and concave in $\mathbf{y}$ . Assume that there exists an equilibrium $(\mathbf{x}^{*},\mathbf{y}^{*})$ that satisfies the KKT conditions with strict inequalities (see (4)). It holds that for sufficiently small stepsize, there exists a neighborhood $U\subseteq\Delta_{n}\times\Delta_{m}$ of $(\mathbf{x}^{*},\mathbf{y}^{*})$ such that for all for all initial conditions $(\mathbf{x}^{0},\mathbf{y}^{0}),(\mathbf{x}^{1},\mathbf{y}^{1})\in U$ , OMWU exhibits last iterate (pointwise) convergence, i.e.,

\lim_{t\to\infty}(\mathbf{x}^{t},\mathbf{y}^{t})=(\mathbf{x}^{*},\mathbf{y}^{*}),

where $(\mathbf{x}^{t},\mathbf{y}^{t})$ denotes the $t$ -th iterate of OMWU.

Moreover, we complement our theoretical findings with experimental analysis of the procedure. The experiments on KL-divergence indicate that the results should hold globally.

1.3 Structure and Technical Overview

We present the structure of the paper and a brief technical overview.

Section 2 provides necessary definitions, the explicit form of OMWU derived from OFTRL with entropy regularizer, and some existing results on dynamical systems.

Section 3 is the main technical part, i.e, the computation and spectral analysis of the Jacobian matrix of OMWU dynamics. The stability analysis, the understanding of the local behavior and the local convergence guarantees of OMWU rely on the spectral analysis of the computed Jacobian matrix. The techniques for bilinear games (as in [8]) are no longer valid in convex-concave games. Allow us to explain the differences from [8]. In general, one cannot expect a trivial generalization from linear to non-linear scenarios. The properties of bilinear games are fundamentally different from that of convex-concave games, and this makes the analysis much more challenging in the latter. The key result of spectral analysis in [8] is in a lemma (Lemma B.6) which states that a skew symmetric²²2 $A$ is skew symmetric if $A^{\top}=-A$ . has imaginary eigenvalues. Skew symmetric matrices appear since in bilinear cases there are terms that are linear in $\mathbf{x}$ and linear in $\mathbf{y}$ but no higher order terms in $\mathbf{x}$ or $\mathbf{y}$ . However, the skew symmetry has no place in the case of convex-concave landscapes and the Jacobian matrix of OMWU is far more complicated. One key technique to overcome the lack of skew symmetry is the use of Ky Fan inequality [12] which states that the sequence of the eigenvalues of $\frac{1}{2}(W+W^{\top})$ majorizes the real part of the sequence of the eigenvalues of W for any square matrix W (see Lemma 3.1).

Section 4 focuses on numerical experiments to understand how the problem size and the choice of learning rate affect the performance of our algorithm. We observe that our algorithm is able to achieve global convergence invariant to the choice of learning rate, random initialization or problem size. As comparison, the latest popularized (projected) optimistic gradient descent ascent is much more sensitivity to the choice of hyperparameter. Due to space constraint, the detailed calculation of the Jacobian matrix (general form and at fixed point) of OMWU are left in Appendix.

Notation

The boldface $\mathbf{x}$ and $\mathbf{y}$ denote the vectors in $\Delta_{n}$ and $\Delta_{m}$ . $\mathbf{x}^{t}$ denotes the $t$ -th iterate of the dynamical system. The letter $J$ denote the Jacobian matrix. $\mathbf{I}$ , $\mathbf{0}$ and $\mathbf{1}$ are preserved for the identity, zero matrix and the vector with all the entries equal to 1. The support of $\mathbf{x}$ is the set of indices of $x_{i}$ such that $x_{i}\neq 0$ , denoted by $\textrm{Supp}(\mathbf{x})$ . $(\mathbf{x}^{*},\mathbf{y}^{*})$ denotes the optimal solution for minimax problem. $[n]$ denote the set of integers $\{1,...,n\}$ .

2 Preliminaries

In this section, we present some background that will be used later.

2.1 Equilibria for Constrained Minimax

From Von Neumann’s minimax theorem, one can conclude that the problem $\min_{\mathbf{x}\in\Delta_{n}}\max_{\mathbf{y}\in\Delta}f(\mathbf{x},\mathbf{y})$ has always an equilibrium $(\mathbf{x}^{*},\mathbf{y}^{*})$ with $f(\mathbf{x}^{*},\mathbf{y}^{*})$ be unique. Moreover from KKT conditions (as long as $f$ is twice differentiable), such an equilibrium must satisfy the following ( $\mathbf{x}^{*}$ is a local minimum for fixed $\mathbf{y}=\mathbf{y}^{*}$ and $\mathbf{y}^{*}$ is a local maximum for fixed $\mathbf{x}=\mathbf{x}^{*}$ ):

Definition 2.1 (KKT conditions).

Formally, it holds

\begin{array}[]{l}\mathbf{x}^{*}\in\Delta_{n}\\ x_{i}^{*}>0\Rightarrow\frac{\partial f}{\partial x_{i}}(\mathbf{x}^{*},\mathbf{y}^{*})=\sum_{j=1}^{n}x_{j}^{*}\frac{\partial f}{\partial x_{j}}(\mathbf{x}^{*},\mathbf{y}^{*})\\ x_{i}^{*}=0\Rightarrow\frac{\partial f}{\partial x_{i}}(\mathbf{x}^{*},\mathbf{y}^{*})\geq\sum_{j=1}^{n}x_{j}^{*}\frac{\partial f}{\partial x_{j}}(\mathbf{x}^{*},\mathbf{y}^{*})\\ \textrm{ for player }\mathbf{x},\\ \mathbf{y}^{*}\in\Delta_{m}\\ y_{i}^{*}>0\Rightarrow\frac{\partial f}{\partial y_{i}}(\mathbf{x}^{*},\mathbf{y}^{*})=\sum_{j=1}^{m}y_{j}^{*}\frac{\partial f}{\partial y_{j}}(\mathbf{x}^{*},\mathbf{y}^{*})\\ y_{i}^{*}=0\Rightarrow\frac{\partial f}{\partial y_{i}}(\mathbf{x}^{*},\mathbf{y}^{*})\leq\sum_{j=1}^{m}y_{j}^{*}\frac{\partial f}{\partial y_{j}}(\mathbf{x}^{*},\mathbf{y}^{*})\\ \textrm{ for player }\mathbf{y}.\end{array}

(4)

Remark 2.2 (No degeneracies).

For the rest of the paper we assume no degeneracies, i.e., the last inequalities hold strictly (in the case a strategy is played with zero probability for each player). Moreover, it is easy to see that since $f$ is convex concave and twice differentiable, then $\nabla^{2}_{\mathbf{x}\mathbf{x}}f$ (part of the Hessian that involves $\mathbf{x}$ variables) is positive semi-definite and $\nabla^{2}_{\mathbf{y}\mathbf{y}}f$ (part of the Hessian that involves $\mathbf{y}$ variables) is negative semi-definite.

2.2 Optimistic Multiplicative Weights Update

The equations of Optimistic Follow-the-Regularized-Leader (OFTRL) applied to a problem
$\min_{\mathbf{x}\in\mathcal{X}}\max_{\mathbf{y}\in\mathcal{Y}}f(\mathbf{x},\mathbf{y})$ with regularizers (strongly convex functions) $h_{1}(\mathbf{x}),h_{2}(\mathbf{y})$ (for player $\mathbf{x},\mathbf{y}$ respectively) and $\mathcal{X}\subset\mathbb{R}^{n},\mathcal{Y}\subset\mathbb{R}^{m}$ is given below (see [6]):

	$\displaystyle\mathbf{x}^{t+1}$	$\displaystyle=\mathop{\mathbf{argmin}}_{\mathbf{x}\in\mathcal{X}}\{\eta\sum_{s=1}^{t}\mathbf{x}^{\top}\nabla_{\mathbf{x}}f(\mathbf{x}^{s},\mathbf{y}^{s})+\underbrace{\eta\mathbf{x}^{\top}\nabla_{\mathbf{x}}f(\mathbf{x}^{t},\mathbf{y}^{t})}_{\textrm{optimistic term}}+h_{1}(\mathbf{x})\}$
	$\displaystyle\mathbf{y}^{t+1}$	$\displaystyle=\mathop{\mathbf{argmax}}_{\mathbf{y}\in\mathcal{Y}}\{\eta\sum_{s=1}^{t}\mathbf{y}^{\top}\nabla_{\mathbf{y}}f(\mathbf{x}^{s},\mathbf{y}^{s})+\underbrace{\eta\mathbf{y}^{\top}\nabla_{\mathbf{y}}f(\mathbf{x}^{t},\mathbf{y}^{t})}_{\textrm{optimistic term}}-h_{2}(\mathbf{y})\}.$

$\eta$ is called the stepsize of the online algorithm. OFTRL is uniquely defined if $f$ is convex-concave and domains $\mathcal{X}$ and $\mathcal{Y}$ are convex. For simplex constraints and entropy regularizers, i.e., $h_{1}(\mathbf{x})=\sum_{i}x_{i}\ln x_{i},h_{2}(\mathbf{y})=\sum_{i}y_{i}\ln y_{i}$ , we can solve for the explicit form of OFTRL using KKT conditions, the update rule is the Optimistic Multiplicative Weights Update (OMWU) and is described as follows:

	$\displaystyle x_{i}^{t+1}$	$\displaystyle=x_{i}^{t}\frac{e^{-2\eta\frac{\partial f}{\partial x_{i}}(\mathbf{x}^{t},\mathbf{y}^{t})+\eta\frac{\partial f}{\partial x_{i}}(\mathbf{x}^{t-1},\mathbf{y}^{t-1})}}{\sum_{k}x_{k}^{t}e^{-2\eta\frac{\partial f}{\partial x_{k}}(\mathbf{x}^{t},\mathbf{y}^{t})+\eta\frac{\partial f}{\partial x_{k}}(\mathbf{x}^{t-1},\mathbf{y}^{t-1})}}$
		$\displaystyle\text{for all }i\in[n],$
	$\displaystyle y_{i}^{t+1}$	$\displaystyle=y_{i}^{t}\frac{e^{2\eta\frac{\partial f}{\partial y_{i}}(\mathbf{x}^{t},\mathbf{y}^{t})-\eta\frac{\partial f}{\partial y_{i}}(\mathbf{x}^{t-1},\mathbf{y}^{t-1})}}{\sum_{k}y_{k}^{t}e^{2\eta\frac{\partial f}{\partial y_{j}}(\mathbf{x}^{t},\mathbf{y}^{t})-\eta\frac{\partial f}{\partial y_{k}}(\mathbf{x}^{t-1},\mathbf{y}^{t-1})}}$
		$\displaystyle\text{for all }i\in[m].$

2.3 Fundamentals of Dynamical Systems

We conclude Preliminaries section with some basic facts from dynamical systems.

Definition 2.3.

A recurrence relation of the form $\mathbf{x}^{t+1}=w(\mathbf{x}^{t})$ is a discrete time dynamical system, with update rule $w:\mathcal{S}\rightarrow\mathcal{S}$ where $\mathcal{S}$ is a subset of $\mathbb{R}^{k}$ for some positive integer $k$ . The point $\mathbf{z}\in\mathcal{S}$ is called a fixed point if $w(\mathbf{z})=\mathbf{z}$ .

Remark 2.4.

Using KKT conditions (4), it is not hard to observe that an equilibrium point $(\mathbf{x}^{*},\mathbf{y}^{*})$ must be a fixed point of the OMWU algorithm, i.e., if $(\mathbf{x}^{t},\mathbf{y}^{t})=(\mathbf{x}^{t-1},\mathbf{y}^{t-1})=(\mathbf{x}^{*},\mathbf{y}^{*})$ then $(\mathbf{x}^{t+1},\mathbf{y}^{t+1})=(\mathbf{x}^{*},\mathbf{y}^{*})$ .

Proposition 2.5 ([9]).

Assume that $w$ is a differentiable function and the Jacobian of the update rule $w$ at a fixed point $\mathbf{z}^{*}$ has spectral radius less than one. It holds that there exists a neighborhood $U$ around $\mathbf{z}^{*}$ such that for all $\mathbf{z}^{0}\in U$ , the dynamics $\mathbf{z}^{t+1}=w(\mathbf{z}^{t})$ converges to $\mathbf{z}^{*}$ , i.e. $\lim_{n\rightarrow\infty}w^{n}(\mathbf{z}^{0})=\mathbf{z}^{*}$ ³³3 $w^{n}$ denotes the composition of $w$ with itself $n$ times.. $w$ is called a contraction mapping in $U$ .

Note that we will make use of Proposition 2.5 to prove our Theorem 1.1 (by proving that the Jacobian of the update rule of OMWU has spectral radius less than one).

3 Last iterate convergence of OMWU

In this section, we prove that OMWU converges pointwise (exhibits last iterate convergence) if the initializations $(\mathbf{x}^{0},\mathbf{y}^{0}),(\mathbf{x}^{1},\mathbf{y}^{1})$ belong in a neighborhood $U$ of the equilibrium $(\mathbf{x}^{*},\mathbf{y}^{*})$ .

3.1 Dynamical System of OMWU

We first express OMWU algorithm as a dynamical system so that we can use Proposition 2.5. The idea (similar to [8]) is to lift the space to consist of four components $(\mathbf{x},\mathbf{y},\mathbf{z},\mathbf{w}$ , in such a way we can include the history (current and previous step, see Section 2.2 for the equations). First, we provide the update rule $g:\Delta_{n}\times\Delta_{m}\times\Delta_{n}\times\Delta_{m}\to\Delta_{n}\times\Delta_{m}\times\Delta_{n}\times\Delta_{m}$ of the lifted dynamical system and is given by

g(\mathbf{x},\mathbf{y},\mathbf{z},\mathbf{w})=(g_{1},g_{2},g_{3},g_{4})

where $g_{i}=g_{i}(\mathbf{x},\mathbf{y},\mathbf{z},\mathbf{w})$ for $i\in[4]$ are defined as follows:

$\displaystyle g_{1,i}(\mathbf{x},\mathbf{y},\mathbf{z},\mathbf{w})$	$\displaystyle=x_{i}\frac{e^{-2\eta\frac{\partial f}{\partial x_{i}}(\mathbf{x},\mathbf{y})+\eta\frac{\partial f}{\partial z_{i}}(\mathbf{z},\mathbf{w})}}{\sum_{k}x_{k}e^{-2\eta\frac{\partial f}{\partial x_{k}}(\mathbf{x},\mathbf{y})+\eta\frac{\partial f}{\partial z_{k}}(\mathbf{z},\mathbf{w})}},i\in[n]$	(5)
$\displaystyle g_{2,i}(\mathbf{x},\mathbf{y},\mathbf{z},\mathbf{w})$	$\displaystyle=y_{i}\frac{e^{2\eta\frac{\partial f}{\partial y_{i}}(\mathbf{x},\mathbf{y})-\eta\frac{\partial f}{\partial w_{i}}(\mathbf{z},\mathbf{w})}}{\sum_{k}y_{k}e^{2\eta\frac{\partial f}{\partial y_{k}}(\mathbf{x},\mathbf{y})-\eta\frac{\partial f}{\partial w_{k}}(\mathbf{z},\mathbf{w})}},i\in[m]$	(6)
$\displaystyle g_{3}(\mathbf{x},\mathbf{y},\mathbf{z},\mathbf{w})$	$\displaystyle=\mathbf{x}\ \ \ \text{or}\ \ g_{3,i}(\mathbf{x},\mathbf{y},\mathbf{z},\mathbf{w})=x_{i},\ i\in[n]$	(7)
$\displaystyle g_{4}(\mathbf{x},\mathbf{y},\mathbf{z},\mathbf{w})$	$\displaystyle=\mathbf{y}\ \ \ \text{or}\ \ g_{4,i}(\mathbf{x},\mathbf{y},\mathbf{z},\mathbf{w})=y_{i},\ i\in[m].$	(8)

Then the dynamical system of OMWU can be written in compact form as

(\mathbf{x}_{t+1},\mathbf{y}_{t+1},\mathbf{x}_{t},\mathbf{y}_{t})=g(\mathbf{x}_{t},\mathbf{y}_{t},\mathbf{x}_{t-1},\mathbf{y}_{t-1}).

In what follows, we will perform spectral analysis on the Jacobian of the function $g$ , computed at the fixed point $(\mathbf{x}^{*},\mathbf{y}^{*})$ . Since $g$ has been lifted, the fixed point we analyze is $(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*})$ (see Remark 2.4). By showing that the spectral radius is less than one, our Theorem 1.1 follows by Proposition 2.5. The computations of the Jacobian of $g$ are deferred to the supplementary material.

3.2 Spectral Analysis

Let $(\mathbf{x}^{*},\mathbf{y}^{*})$ be the equilibrium of min-max problem (2). Assume $i\notin\textrm{Supp}(\mathbf{x}^{*})$ , i.e., $x_{i}^{*}=0$ then (see equations at the supplementary material, section A)

\frac{\partial g_{1,i}}{\partial x_{i}}(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*})=\frac{e^{-\eta\frac{\partial f}{\partial x_{i}}(\mathbf{x}^{*},\mathbf{y}^{*})}}{\sum_{t=1}^{n}x^{*}_{t}e^{-\eta\frac{\partial f}{\partial x_{t}}(\mathbf{x}^{*},\mathbf{y}^{*})}}

and all other partial derivatives of $g_{1,i}$ are zero, thus $\frac{e^{-\eta\frac{\partial f}{\partial x_{i}}(\mathbf{x}^{*},\mathbf{y}^{*})}}{\sum_{t=1}^{n}x^{*}_{t}e^{-\eta\frac{\partial f}{\partial x_{t}}(\mathbf{x}^{*},\mathbf{y}^{*})}}$ is an eigenvalue of the Jacobian computed at $(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*})$ . This is true because the row of the Jacobian that corresponds to $g_{1,i}$ has zeros everywhere but the diagonal entry. Moreover because of the degeneracy assumption of KKT conditions (see Remark 2.2), it holds that

\frac{e^{-\eta\frac{\partial f}{\partial x_{i}}(\mathbf{x}^{*},\mathbf{y}^{*})}}{\sum_{t=1}^{n}x^{*}_{t}e^{-\eta\frac{\partial f}{\partial x_{t}}(\mathbf{x}^{*},\mathbf{y}^{*})}}<1.

Similarly, it holds for $j\notin\textrm{Supp}(\mathbf{y}^{*})$ that

\frac{\partial g_{2,j}}{\partial y_{j}}(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*})=\frac{e^{\eta\frac{\partial f}{\partial y_{j}}(\mathbf{x}^{*},\mathbf{y}^{*})}}{\sum_{t=1}^{m}y^{*}_{t}e^{\eta\frac{\partial f}{\partial y_{t}}(\mathbf{x}^{*},\mathbf{y}^{*})}}<1

(again by Remark 2.2) and all other partial derivatives of $g_{2,j}$ are zero, therefore $\frac{e^{\eta\frac{\partial f}{\partial y_{j}}(\mathbf{x}^{*},\mathbf{y}^{*})}}{\sum_{t=1}^{m}y^{*}_{t}e^{\eta\frac{\partial f}{\partial y_{t}}(\mathbf{x}^{*},\mathbf{y}^{*})}}$ is an eigenvalue of the Jacobian computed at $(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*})$ .

We focus on the submatrix of the Jacobian of $g$ computed at $(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{x}^{*},\mathbf{y}^{*})$ that corresponds to the non-zero probabilities of $\mathbf{x}^{*}$ and $\mathbf{y}^{*}$ . We denote $D_{\mathbf{x}^{*}}$ to be the diagonal matrix of size $\left|\text{Supp}(\mathbf{x}^{*})\right|\times\left|\text{Supp}(\mathbf{x}^{*})\right|$ that has on the diagonal the nonzero entries of $\mathbf{x}^{*}$ and similarly we define $D_{\mathbf{y}^{*}}$ of size $\left|\text{Supp}(\mathbf{y}^{*})\right|\times\left|\text{Supp}(\mathbf{y}^{*})\right|$ . For convenience, let us denote $k_{x}:=\left|\text{Supp}(\mathbf{x}^{*})\right|$ and $k_{y}:=\left|\text{Supp}(\mathbf{y}^{*})\right|$ . The Jacobian submatrix is the following

J=\left[\begin{array}[]{cccc}A_{11}&A_{12}&A_{13}&A_{14}\\ A_{21}&A_{22}&A_{23}&A_{24}\\ \mathbf{I}_{k_{x}\times k_{x}}&\mathbf{0}_{k_{x}\times k_{y}}&\mathbf{0}_{k_{x}\times k_{x}}&\mathbf{0}_{k_{x}\times k_{y}}\\ \mathbf{0}_{k_{y}\times k_{x}}&\mathbf{I}_{k_{y}\times k_{y}}&\mathbf{0}_{k_{y}\times k_{x}}&\mathbf{0}_{k_{y}\times k_{y}}\end{array}\right]

where

\displaystyle\begin{split}A_{11}&=\mathbf{I}_{k_{x}\times k_{x}}-D_{\mathbf{x}^{*}}\mathbf{1}_{k_{x}}\mathbf{1}_{k_{x}}^{\top}-2\eta D_{\mathbf{x}^{*}}(\mathbf{I}_{k_{x}\times k_{x}}-\mathbf{1}_{k_{x}}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{x}}^{2}f\\ A_{12}&=-2\eta D_{\mathbf{x}^{*}}(\mathbf{I}_{k_{x}\times k_{x}}-\mathbf{1}_{k_{x}}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{y}}^{2}f\\ A_{13}&=\eta D_{\mathbf{x}^{*}}(\mathbf{I}_{k_{x}\times k_{x}}-\mathbf{1}_{k_{x}}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{x}}^{2}f\\ A_{14}&=\eta D_{\mathbf{x}^{*}}(\mathbf{I}_{k_{x}\times k_{x}}-\mathbf{1}_{k_{x}}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{y}}^{2}f\\ A_{21}&=2\eta D_{\mathbf{y}^{*}}(\mathbf{I}_{k_{y}\times k_{y}}-\mathbf{1}_{k_{y}}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{x}}^{2}f\\ A_{22}&=\mathbf{I}_{k_{y}\times k_{y}}-D_{\mathbf{y}^{*}}\mathbf{1}_{k_{y}}\mathbf{1}_{k_{y}}^{\top}+2\eta D_{\mathbf{y}^{*}}(\mathbf{I}_{k_{y}\times k_{y}}-\mathbf{1}_{k_{y}}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{y}}^{2}f\\ A_{23}&=-\eta D_{\mathbf{y}^{*}}(\mathbf{I}_{k_{y}\times k_{y}}-\mathbf{1}_{k_{y}}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{x}}^{2}f\\ A_{24}&=-\eta D_{\mathbf{y}^{*}}(\mathbf{I}_{k_{y}\times k_{y}}-\mathbf{1}_{k_{y}}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{y}}^{2}f.\\ \end{split}

(9)

We note that $\mathbf{I},\mathbf{0}$ capture the identity matrix and the all zeros matrix respectively (the appropriate size is indicated as a subscript). The vectors $(\mathbf{1}_{k_{x}},\mathbf{0}_{k_{y}},\mathbf{0}_{k_{x}},\mathbf{0}_{k_{y}})$ and $(\mathbf{0}_{k_{x}},\mathbf{1}_{k_{y}},\mathbf{0}_{k_{x}},\mathbf{0}_{k_{y}})$ are left eigenvectors with eigenvalue zero for the above matrix. Hence, any right eigenvector $(\mathbf{v_{x}},\mathbf{v_{y}},\mathbf{v_{z}},\mathbf{v_{w}})$ should satisfy the conditions $\mathbf{1}^{\top}\mathbf{v_{x}}=0$ and $\mathbf{1}^{\top}\mathbf{v_{y}}=0$ . Thus, every non-zero eigenvalue of the above matrix is also a non-zero eigenvalue of the matrix below:

J_{\text{new}}=\left[\begin{array}[]{cccc}B_{11}&A_{12}&A_{13}&A_{14}\\ A_{21}&B_{22}&A_{23}&A_{24}\\ \mathbf{I}_{k_{x}\times k_{x}}&\mathbf{0}_{k_{x}\times k_{y}}&\mathbf{0}_{k_{x}\times k_{x}}&\mathbf{0}_{k_{x}\times k_{y}}\\ \mathbf{0}_{k_{y}\times k_{x}}&\mathbf{I}_{k_{y}\times k_{y}}&\mathbf{0}_{k_{y}\times k_{x}}&\mathbf{0}_{k_{y}\times k_{y}}\end{array}\right]

where

B_{11}=\mathbf{I}_{k_{x}\times k_{x}}-2\eta D_{\mathbf{x}^{*}}(\mathbf{I}_{k_{x}\times k_{x}}-\mathbf{1}_{k_{x}}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{x}}^{2}f,

B_{22}=\mathbf{I}_{k_{y}\times k_{y}}+2\eta D_{\mathbf{y}^{*}}(\mathbf{I}_{k_{y}\times k_{y}}-\mathbf{1}_{k_{y}}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{y}}^{2}f.

The characteristic polynomial of $J_{\text{new}}$ is obtained by finding $\det(J_{\text{new}}-\lambda\mathbf{I})$ . One can perform row/column operations on $J_{\text{new}}$ to calculate this determinant, which gives us the following relation:

\det(J_{\text{new}}-\lambda\mathbf{I}_{2k_{x}\times 2k_{y}})=\left(1-2\lambda\right)^{(k_{x}+k_{y})}q\left(\frac{\lambda(\lambda-1)}{2\lambda-1}\right)

where $q(\lambda)$ is the characteristic polynomial of the following matrix

J_{\text{small}}=\left[\begin{array}[]{cccc}B_{11}-\mathbf{I}_{k_{x}\times k_{x}}&A_{12}\\ A_{21}&B_{22}-\mathbf{I}_{k_{y}\times k_{y}},\end{array}\right]

and $B_{11},B_{12},A_{12},A_{21}$ are the aforementioned sub-matrices. Notice that $J_{\text{small}}$ can be written as

J_{\text{small}}=2\eta\left[\begin{array}[]{cc}-(D_{\mathbf{x}^{*}}-\mathbf{x}^{*}\mathbf{x}^{*\top})&\mathbf{0}_{k_{x}\times k_{y}}\\ \mathbf{0}_{k_{y}\times k_{x}}&(D_{\mathbf{y}^{*}}-\mathbf{y}^{*}\mathbf{y}^{*\top})\end{array}\right]H

where,

H=\left[\begin{array}[]{cc}\nabla_{\mathbf{x}\mathbf{x}}^{2}f&\nabla_{\mathbf{x}\mathbf{y}}^{2}f\\ \nabla_{\mathbf{y}\mathbf{x}}^{2}f&\nabla_{\mathbf{y}\mathbf{y}}^{2}f\end{array}\right]

Notice here that $H$ is the Hessian matrix evaluated at the fixed point $(\mathbf{x}^{*},\mathbf{y}^{*})$ , and is the appropriate sub-matrix restricted to the support of $\left|\text{Supp}(\mathbf{y}^{*})\right|$ and $\left|\text{Supp}(\mathbf{x}^{*})\right|$ . Although, the Hessian matrix is symmetric, we would like to work with the following representation of $J_{\text{small}}$ :

J_{\text{small}}=2\eta\left[\begin{array}[]{cc}(D_{\mathbf{x}^{*}}-\mathbf{x}^{*}\mathbf{x}^{*\top})&\mathbf{0}_{k_{x}\times k_{y}}\\ \mathbf{0}_{k_{y}\times k_{x}}&(D_{\mathbf{y}^{*}}-\mathbf{y}^{*}\mathbf{y}^{*\top})\end{array}\right]H^{-}

where,

H^{-}=\left[\begin{array}[]{cc}-\nabla_{\mathbf{x}\mathbf{x}}^{2}f&-\nabla_{\mathbf{x}\mathbf{y}}^{2}f\\ \nabla_{\mathbf{y}\mathbf{x}}^{2}f&\nabla_{\mathbf{y}\mathbf{y}}^{2}f\end{array}\right]

Let us denote any non-zero eigenvalue of $J_{\text{small}}$ by $\epsilon$ which may be a complex number. Thus $\epsilon$ is where $q(\cdot)$ vanishes and hence the eigenvalue of $J_{\text{new}}$ must satisfy the relation

\frac{\lambda(\lambda-1)}{2\lambda-1}=\epsilon

We are to now show that the magnitude of any eigenvalue of $J_{\text{new}}$ is strictly less than 1, i.e, $\left|\lambda\right|<1$ . Trivially, $\lambda=\frac{1}{2}$ satisfies the above condition. Thus we need to show that the magnitude of $\lambda$ where $q(\cdot)$ vanishes is strictly less than 1. The remainder of the proof proceeds by showing the following two lemmas:

Lemma 3.1 (Real part non-positive).

Let $\lambda$ be an eigenvalue of matrix $J_{\text{small}}$ . It holds that $\textrm{Re}(\lambda)\leq 0$ .

Refer to caption — Figure 1: Convergence of OMWU vs different sizes of the problem. For Figure (a), $x$ -axis is $n$ and $y$ -axis is the number of iterations to reach convergence for Eqn. (14). In Figure (b) we choose four cases of $n$ to illustrate how $l_{1}$ error of the problem decreases with the number of iterations.

Proof.

Assume that $\lambda\neq 0$ . All the non-zero eigenvalues of matrix $J_{\text{small}}$ coincide with the eigenvalues of the matrix

R:=\left[\begin{array}[]{cc}(D_{\mathbf{x}^{*}}-\mathbf{x}^{*}\mathbf{x}^{*\top})&\mathbf{0}_{k_{x}\times k_{y}}\\ \mathbf{0}_{k_{y}\times k_{x}}&(D_{\mathbf{y}^{*}}-\mathbf{y}^{*}\mathbf{y}^{*\top})\end{array}\right]^{1/2}\times H^{-}\times\left[\begin{array}[]{cc}(D_{\mathbf{x}^{*}}-\mathbf{x}^{*}\mathbf{x}^{*\top})&\mathbf{0}_{k_{x}\times k_{y}}\\ \mathbf{0}_{k_{y}\times k_{x}}&(D_{\mathbf{y}^{*}}-\mathbf{y}^{*}\mathbf{y}^{*\top})\end{array}\right]^{1/2}.

This is well-defined since

\left[\begin{array}[]{cc}(D_{\mathbf{x}^{*}}-\mathbf{x}^{*}\mathbf{x}^{*\top})&\mathbf{0}_{k_{x}\times k_{y}}\\ \mathbf{0}_{k_{y}\times k_{x}}&(D_{\mathbf{y}^{*}}-\mathbf{y}^{*}\mathbf{y}^{*\top})\end{array}\right]

is positive semi-definite. Moreover, we use KyFan inequalities which state that the sequence (in decreasing order) of the eigenvalues of $\frac{1}{2}(W+W^{\top})$ majorizes the real part of the sequence of the eigenvalues of $W$ for any square matrix $W$ (see [12], page 4). We conclude that for any eigenvalue $\lambda$ of $R$ , it holds that $\textrm{Re}(\lambda)$ is at most the maximum eigenvalue of $\frac{1}{2}(R+R^{\top})$ . Observe now that

R+R^{\top}:=\left[\begin{array}[]{cc}(D_{\mathbf{x}^{*}}-\mathbf{x}^{*}\mathbf{x}^{*\top})&\mathbf{0}_{k_{x}\times k_{y}}\\ \mathbf{0}_{k_{y}\times k_{x}}&(D_{\mathbf{y}^{*}}-\mathbf{y}^{*}\mathbf{y}^{*\top})\end{array}\right]^{1/2}\times

(H^{-}+H^{-\top})\times\left[\begin{array}[]{cc}(D_{\mathbf{x}^{*}}-\mathbf{x}^{*}\mathbf{x}^{*\top})&\mathbf{0}_{k_{x}\times k_{y}}\\ \mathbf{0}_{k_{y}\times k_{x}}&(D_{\mathbf{y}^{*}}-\mathbf{y}^{*}\mathbf{y}^{*\top})\end{array}\right]^{1/2}.

Since

H^{-}+H^{-\top}=\left[\begin{array}[]{cc}-\nabla_{\mathbf{x}\mathbf{x}}^{2}f&0\\ 0&\nabla_{\mathbf{y}\mathbf{y}}^{2}f\end{array}\right]

by the convex-concave assumption on $f$ it follows that the matrix above is negative semi-definite (see Remark 2.2) and so is $R+R^{\top}$ . We conclude that the maximum eigenvalue of $R+R^{\top}$ is non-positive. Therefore any eigenvalue of $R$ has real part non-positive and the same is true for $J_{\textrm{small}}$ . ∎

Lemma 3.2.

If $\epsilon$ is a non-zero eigenvalue of $J_{\text{small}}$ then, $\textrm{Re}(\epsilon)\leq 0$ and $\left|\epsilon\right|\downarrow 0$ as the stepsize $\eta\to 0$ .

We first can see that $\eta$ which is the learning rate multiplies any eigenvalue and we may assume that whilst $\eta$ is positive, it may be chosen to be sufficiently small and hence the magnitude of any eigenvalue $\left|\epsilon\right|\downarrow 0$ .

Remark 3.3.

The equation $\epsilon=\frac{\lambda(\lambda-1)}{2\lambda-1}$ determines two complex roots for each fixed $\epsilon$ , say $\lambda_{1}$ and $\lambda_{2}$ . The relation between $\left|\epsilon\right|$ , $\left|\lambda_{1}\right|$ and $|\lambda_{2}|$ is illustrated in Figure 2, where the $x$ -axis is taken to be $\propto\exp(1/\left|\epsilon\right|)$ . Specifically we choose $\epsilon=-1/\log(x)+1/\log(x)\sqrt{-1}$ that satisfies $|\epsilon|\downarrow 0$ as $x\rightarrow\infty$ (The $x$ -axis of Figure 2 takes $x$ from 3 to 103).

Proof.

Let $\lambda=x+\sqrt{-1}y$ and $\epsilon=a+\sqrt{-1}b$ . The relation $\frac{\lambda(\lambda-1)}{2\lambda-1}=\epsilon$ gives two equations based on the equality of real and imaginary parts as follows,

	$\displaystyle x^{2}-x-y^{2}$	$\displaystyle=2ax-a-2by$		(10)
	$\displaystyle 2xy-y$	$\displaystyle=2bx+2ay-b.$		(11)

Notice that the above equations can be transformed to the following forms:

	$\displaystyle(x-\frac{2a+1}{2})^{2}-(y-b)^{2}$	$\displaystyle=-a-b^{2}+\frac{(2a+1)^{2}}{4}$		(12)
	$\displaystyle(x-\frac{2a+1}{2})(y-b)$	$\displaystyle=ab.$		(13)

For each $\epsilon=a+\sqrt{-1}b$ , there exist two pairs of points $(x_{1},y_{1})$ and $(x_{2},y_{2})$ that are the intersections of the above two hyperbola, illustrated in Figure 4. Recall the condition that $a<0$ . As $\left|\epsilon\right|\rightarrow 0$ , the hyperbola can be obtained from the translation by $(\frac{2a+1}{2},b)$ of the hyperbola

	$\displaystyle x^{2}-y^{2}$	$\displaystyle=-a-b^{2}+\frac{(2a+1)^{2}}{4}$
	$\displaystyle xy$	$\displaystyle=ab$

where the translated symmetric center is close to $(\frac{1}{2},0)$ since $(a,b)$ is close to $(0,0)$ . So the two intersections of the above hyperbola, $(x_{1},y_{1})$ and $(x_{2},y_{2})$ , satisfy the property that $x_{1}^{2}+y_{1}^{2}$ is small and $x_{2}>\frac{1}{2}$ since the two intersections are on two sides of the axis $x=\frac{2a+1}{2}$ , as showed in Figure 3.

On the other hand, we have

\frac{\lambda(\lambda-1)}{2\lambda-1}=\frac{(x+\sqrt{-1}y)(x-1+\sqrt{-1}y)}{2x-1+\sqrt{-1}2y}=\epsilon=a+\sqrt{-1}b

and then the condition $a<0$ gives the inequality

\text{Re}(\epsilon)=\frac{(x^{2}-x+y^{2})(2x-1)}{(2x-1)^{2}+4y^{2}}<0

that is equivalent to

x>\frac{1}{2}\ \ \ \text{and}\ \ \ x^{2}-x+y^{2}<0

where only the case $x>\frac{1}{2}$ is considered since if the intersection whose $x$ -component satisfying $x<\frac{1}{2}$ has the property that $x^{2}+y^{2}$ is small and then less than 1, Figure 4. Thus to prove that $\left|\lambda\right|<1$ , it suffices to assume $x>\frac{1}{2}$ . It is obvious that $x^{2}-x+y^{2}=(x-\frac{1}{2})^{2}+y^{2}-\frac{1}{4}<0$ implies that $x^{2}+y^{2}<1$ . The proof completes. ∎

4 Experiments

In this section, we conduct empirical studies to verify the theoretical results of our paper. We primarily target to understand two factors that influence the convergence speed of OMWU: the problem size and the learning rate. We also compare our algorithm with Optimistic Gradient Descent Ascent (OGDA) with projection, and demonstrate our superiority against it.

We start with a simple bilinear min-max game:

\displaystyle\min_{\mathbf{x}\in\Delta_{n}}\max_{\mathbf{y}\in\Delta_{n}}\mathbf{x}^{\top}A\mathbf{y}.

(14)

We first vary the value of $n$ to study how the learning speed scales with the size of the problem. The learning rate is fixed at $1.0$ , and we run OMWU with $n\in\{25,50,75,\cdots,250\}$ and matrix $A\in\mathbb{R}^{n\times n}$ is generated with i.i.d random Gaussian entries. We output the number of iterations for OMWU to reach convergence, i.e., with $l_{1}$ error to the optimal solution to be less or equal to $10^{-5}$ . The results are averaged from 10 runs with different randomly initializations. As reported in Figure 1, generally a larger problem size requires more iterations to reach convergence. We also provide four specific cases of $n$ to show the convergence in $l_{1}$ distance in Figure 1(b). The shaded area demonstrates the standard deviation from the 50 runs.

To understand how learning rate affects the speed of convergence, we conduct similar experiments on Eqn. (14) and plot the $l_{1}$ error with different step sizes in Figure 5(a)-(c). For this experiment the matrix size is fixed as $n=100$ . We also include a comparison with the Optimistic Gradient Descent Ascent[7]. Notice the original proposal was for unconstrained problems and we use projection in each step in order to constrain the iterates to stay inside the simplex. For the setting we considered, we observe a larger learning rate effectively speeds up our learning process, and our algorithm is relatively more stable to the choice of step-size. In comparison, OGDA is quite sensitive to the choice of step-size. As shown in Figure 5(b), a larger step-size makes the algorithm diverge, while a smaller step-size will make very little progress. Furthermore, we also choose to perform our algorithm over a convex-concave but not bilinear function $f(\mathbf{x},\mathbf{y})=x_{1}^{2}-y_{1}^{2}+2x_{1}y_{1}$ , where $\mathbf{x},\mathbf{y}\in\Delta_{2}$ and $x_{1}$ and $y_{1}$ are the first coefficients of $\mathbf{x}$ and $\mathbf{y}$ . With this low dimensional function, we could visually show the convergence procedure as in Figure 5(b), where each arrow indicates an OMWU step. This figure demonstrates that at least in this case, a larger step size usually makes sure a bigger progress towards the optimal solution.

Finally we show how the KL divergence $D_{KL}((\mathbf{x}^{*},\mathbf{y}^{*})\parallel(\mathbf{x}^{t},\mathbf{y}^{t}))$ decreases under different circumstances. Figure 6 again considers the bilinear problem (Eqn.(14)) with multiple dimensions $n$ and a simple convex-concave function $f(\mathbf{x},\mathbf{y})=x_{1}^{2}-y_{1}^{2}+2x_{1}y_{1}$ with different learning rate. We note that in all circumstances we consider, we observe that OMWU is very stable, and achieves global convergence invariant to the problem size, random initialization, and learning rate.

5 Conclusion

In this paper we analyze the last iterate behavior of a no-regret learning algorithm called Optimistic Multiplicative Weights Update for convex-concave landscapes. We prove that OMWU exhibits last iterate convergence in a neighborhood of the fixed point of OMWU algorithm, generalizing previous results that showed last iterate convergence for bilinear functions. The experiments explores how the problem size and the choice of learning rate affect the performance of our algorithm. We find that OMWU achieves global convergence and less sensitive to the choice of hyperparameter, compared to projected optimistic gradient descent ascent.

References

[1] Jacob D. Abernethy, Kevin A. Lai, and Andre Wibisono. Last-iterate convergence rates for min-max optimization. CoRR, abs/1906.02027, 2019.
[2] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121–164, 2012.
[3] James P. Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In Proceedings of the 2018 ACM Conference on Economics and Computation, Ithaca, NY, USA, June 18-22, 2018, pages 321–338, 2018.
[4] G.W Brown. Iterative solutions of games by fictitious play. In Activity Analysis of Production and Allocation, 1951.
[5] Nikolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
[6] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training GANs with Optimism. In Proceedings of ICLR, 2018.
[7] Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 9256–9266, 2018.
[8] Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. In 10th Innovations in Theoretical Computer Science Conference, ITCS 2019, January 10-12, 2019, San Diego, California, USA, pages 27:1–27:18, 2019.
[9] Oded Galor. Discrete Dynamical Systems. Springer, 2007.
[10] Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. arXiv preprint:1802.06132, 2018.
[11] Panayotis Mertikopoulos, Houssam Zenati, Bruno Lecouat, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Mirror descent in saddle-point problems: Going the extra (gradient) mile. CoRR, abs/1807.02629, 2018.
[12] Mohammad Sal Moslehian. Ky fan inequalities. CoRR, abs/1108.1467, 2011.
[13] Gerasimos Palaiopanos, Ioannis Panageas, and Georgios Piliouras. Multiplicative weights update with constant step-size in congestion games: Convergence, limit cycles and chaos. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5874–5884, 2017.
[14] J. Robinson. An iterative method of solving a game. In Annals of Mathematics, pages 296–301, 1951.
[15] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E. Schapire. Fast convergence of regularized learning in games. In Annual Conference on Neural Information Processing Systems 2015, pages 2989–2997, 2015.
[16] J Von Neumann. Zur theorie der gesellschaftsspiele. In Math. Ann., pages 295–320, 1928.

Appendix A Equations of the Jacobian of OMWU

$\displaystyle\frac{\partial g_{1,i}}{\partial x_{i}}$	$\displaystyle=\frac{e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}}{S_{x}}+x_{i}\frac{1}{S_{x}^{2}}\left(e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}(-2\eta\frac{\partial^{2}f}{\partial x_{i}^{2}})S_{x}-e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}\frac{\partial S_{x}}{\partial x_{i}}\right)$	(15)
	$\displaystyle\text{where}\ \ \frac{\partial S_{x}}{\partial x_{i}}=e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}-2\eta\sum_{k}x_{k}e^{-2\eta\frac{\partial f}{\partial x_{k}}+\eta\frac{\partial f}{\partial z_{k}}}\frac{\partial^{2}f}{\partial x_{i}^{2}}$	(16)
$\displaystyle\frac{\partial g_{1,i}}{\partial x_{j}}$	$\displaystyle=x_{i}\frac{1}{S_{x}^{2}}\left(e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}(-2\eta\frac{\partial^{2}f}{\partial x_{i}\partial x_{j}})S_{x}-e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}\frac{\partial S_{x}}{\partial x_{j}}\right)$	(17)
	$\displaystyle\text{where}\ \ \frac{\partial S_{x}}{\partial x_{j}}=e^{-2\eta\frac{\partial f}{\partial x_{j}}+\eta\frac{\partial f}{\partial z_{j}}}-2\eta\sum_{k}x_{k}e^{-2\eta\frac{\partial f}{\partial x_{k}}+\eta\frac{\partial f}{\partial z_{k}}}\frac{\partial^{2}f}{\partial x_{j}\partial x_{k}}$	(18)
$\displaystyle\frac{\partial g_{1,i}}{\partial y_{j}}$	$\displaystyle=x_{i}\frac{1}{S_{x}^{2}}\left(e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}(-2\eta\frac{\partial^{2}f}{\partial x_{i}\partial y_{j}})S_{x}-e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}\frac{\partial S_{x}}{\partial y_{j}}\right)$	(19)
	$\displaystyle\text{where}\ \ \frac{\partial S_{x}}{\partial y_{j}}=\sum_{k}x_{k}e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}(-2\eta\frac{\partial^{2}f}{\partial x_{k}\partial y_{j}})$	(20)
$\displaystyle\frac{\partial g_{1,i}}{\partial z_{j}}$	$\displaystyle=x_{i}\frac{1}{S_{x}^{2}}\left(e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}(\eta\frac{\partial^{2}f}{\partial z_{j}\partial x_{i}})S_{x}-e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}\frac{\partial S_{x}}{\partial z_{j}}\right)$	(21)
	$\displaystyle\text{where}\ \ \frac{\partial S_{x}}{\partial z_{j}}=\eta\sum_{k}x_{k}e^{-2\eta\frac{\partial f}{\partial x_{k}}+\eta\frac{\partial f}{\partial z_{k}}}\frac{\partial^{2}f}{\partial z_{k}\partial z_{j}}$	(22)
$\displaystyle\frac{\partial g_{1,i}}{\partial w_{j}}$	$\displaystyle=x_{i}\frac{1}{S_{x}^{2}}\left(e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}\eta\frac{\partial^{2}f}{\partial z_{i}\partial w_{j}}S_{x}-e^{-2\eta\frac{\partial f}{\partial x_{i}}+\eta\frac{\partial f}{\partial z_{i}}}\frac{\partial S_{x}}{\partial w_{j}}\right)$	(23)
	$\displaystyle\text{where}\ \ \frac{\partial S_{x}}{\partial w_{j}}=\sum_{k}x_{k}e^{-2\eta\frac{\partial f}{\partial x_{k}}+\eta\frac{\partial f}{\partial z_{k}}}\eta\frac{\partial f}{\partial z_{k}\partial w_{j}}$	(24)
$\displaystyle\frac{\partial g_{2,i}}{\partial x_{j}}$	$\displaystyle=y_{i}\frac{1}{S_{y}^{2}}\left(e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}(2\eta\frac{\partial^{2}f}{\partial x_{j}\partial y_{i}})S_{y}-e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}\frac{\partial S_{y}}{\partial x_{j}}\right)$	(25)
	$\displaystyle\text{where}\ \ \frac{\partial S_{y}}{\partial x_{j}}=\sum_{k}y_{k}e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}2\eta\frac{\partial^{2}f}{\partial x_{j}\partial y_{k}}$	(26)
$\displaystyle\frac{\partial g_{2,i}}{\partial y_{i}}$	$\displaystyle=\frac{e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}}{S_{y}}+y_{i}\frac{1}{S_{y}^{2}}\left(e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}2\eta\frac{\partial^{2}f}{\partial y_{i}^{2}}S_{y}-e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}\frac{\partial S_{y}}{\partial y_{i}}\right)$	(27)
	$\displaystyle\text{where}\ \ \frac{\partial S_{y}}{\partial y_{i}}=e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}+2\eta\sum_{k}y_{k}e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}\frac{\partial^{2}f}{\partial y_{i}\partial y_{k}}$	(28)
$\displaystyle\frac{\partial g_{2,i}}{\partial z_{j}}$	$\displaystyle=y_{i}\frac{1}{S_{y}^{2}}\left(e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}(-\eta\frac{\partial^{2}f}{\partial w_{i}\partial z_{j}})S_{y}-e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}\frac{\partial S_{y}}{\partial z_{j}}\right)$	(29)
	$\displaystyle\text{where}\ \ \frac{\partial S_{y}}{\partial z_{j}}=\sum_{k}y_{k}e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}(-\eta\frac{\partial^{2}f}{\partial w_{k}\partial z_{j}})$	(30)
$\displaystyle\frac{\partial g_{2,i}}{\partial w_{j}}$	$\displaystyle=y_{i}\frac{1}{S_{y}^{2}}\left(e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}(-\eta\frac{\partial^{2}f}{\partial w_{i}\partial w_{j}})-e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}\frac{\partial S_{y}}{\partial w_{j}}\right)$	(31)
	$\displaystyle\text{where}\ \ \frac{\partial S_{y}}{\partial w_{j}}=\sum_{k}y_{k}e^{2\eta\frac{\partial f}{\partial y_{i}}-\eta\frac{\partial f}{\partial w_{i}}}(-\eta\frac{\partial^{2}f}{\partial w_{k}\partial w_{j}})$	(32)

Appendix B Equations of the Jacobian of OMWU at the fixed point $(\mathbf{x}^{},\mathbf{y}^{},\mathbf{z}^{},\mathbf{w}^{})$

In this section, we compute the equations of the Jacobian at the fixed point $(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{z}^{*},\mathbf{w}^{*})$ . The fact that $(\mathbf{x}^{*},\mathbf{y}^{*})=(\mathbf{z}^{*},\mathbf{w}^{*})$ and $(\mathbf{z},\mathbf{w})$ takes the position of $(\mathbf{x},\mathbf{y})$ in computing partial derivatives gives the following equations.

$\displaystyle\frac{\partial g_{1,i}}{\partial x_{i}}$	$\displaystyle=1-x_{i}^{}-2\eta x_{i}^{}(\frac{\partial^{2}f}{\partial x_{i}^{}}-\sum_{k}x_{k}^{}\frac{\partial^{2}f}{\partial x_{i}\partial x_{k}}),i\in[n],$	(33)
$\displaystyle\frac{\partial g_{1,i}}{\partial x_{j}}$	$\displaystyle=-x_{i}^{}-2\eta x_{i}^{}(\frac{\partial^{2}f}{\partial x_{i}\partial x_{j}}-\sum_{k}x_{k}^{*}\frac{\partial^{2}f}{\partial x_{j}\partial x_{k}}),j\in[n],j\neq i$	(34)
$\displaystyle\frac{\partial g_{1,i}}{\partial y_{j}}$	$\displaystyle=-2\eta x_{i}^{}(\frac{\partial^{2}f}{\partial x_{i}\partial y_{j}}-\sum_{k}x_{k}^{}\frac{\partial^{2}f}{\partial x_{k}\partial y_{j}}),j\in[m]$	(35)
$\displaystyle\frac{\partial g_{1},i}{\partial z_{j}}$	$\displaystyle=\eta x_{i}^{}(\frac{\partial^{2}f}{\partial x_{i}\partial x_{j}}-\sum_{k}x_{k}^{}\frac{\partial^{2}f}{\partial x_{k}\partial x_{j}}),j\in[n]$	(36)
$\displaystyle\frac{\partial g_{1,i}}{\partial w_{j}}$	$\displaystyle=\eta x_{i}^{}(\frac{\partial^{2}f}{\partial x_{i}\partial y_{j}}-\sum_{k}x_{k}^{}\frac{\partial^{2}f}{\partial x_{k}\partial y_{j}}),j\in[m]$	(37)
$\displaystyle\frac{\partial g_{2,i}}{\partial x_{j}}$	$\displaystyle=2\eta y_{i}^{}(\frac{\partial^{2}f}{\partial x_{j}\partial y_{i}}-\sum_{k}y_{k}^{}\frac{\partial^{2}f}{\partial x_{j}\partial y_{k}}),j\in[n]$	(38)
$\displaystyle\frac{\partial g_{2,i}}{\partial y_{i}}$	$\displaystyle=1-y_{i}^{}+2\eta(\frac{\partial^{2}f}{\partial y_{i}^{2}}-\sum_{k}y_{k}^{}\frac{\partial^{2}f}{\partial y_{i}\partial y_{k}}),i\in[m]$	(39)
$\displaystyle\frac{\partial g_{2,i}}{\partial y_{j}}$	$\displaystyle=-y_{i}^{}+2\eta(\frac{\partial^{2}f}{\partial y_{i}\partial y_{j}}-\sum_{k}y_{k}^{}\frac{\partial^{2}f}{\partial y_{j}\partial y_{k}}),j\in[m]$	(40)
$\displaystyle\frac{\partial g_{2,i}}{\partial z_{j}}$	$\displaystyle=\eta y_{i}^{}(-\frac{\partial^{2}f}{\partial x_{j}\partial y_{i}}+\sum_{k}y_{k}^{}\frac{\partial^{2}f}{\partial x_{j}\partial y_{k}}),j\in[n]$	(41)
$\displaystyle\frac{\partial g_{2,i}}{\partial w_{j}}$	$\displaystyle=\eta y_{i}^{}(-\frac{\partial^{2}f}{\partial y_{i}\partial y_{j}}+\sum_{k}y_{k}^{}\frac{\partial^{2}f}{\partial y_{k}\partial y_{j}}),j\in[m]$	(42)
$\displaystyle\frac{\partial g_{3,i}}{\partial x_{i}}$	$\displaystyle=1\ \ \text{for all}\ \ i\in[n]\ \ \text{and zero}\text{for all the other partial derivatives of}\ \ g_{3,i}$	(43)
$\displaystyle\frac{\partial g_{4,i}}{\partial y_{i}}$	$\displaystyle=1\ \text{for all}\ \ i\in[m]\ \ \text{and zero for all the other partial derivatives of}\ \ \ g_{4,i}.$	(44)

Appendix C Jacobian matrix at $(\mathbf{x}^{},\mathbf{y}^{},\mathbf{z}^{},\mathbf{w}^{})$

This section serves for the ”Spectral Analysis” of Section 3. The Jacobian matrix of $g$ at the fixed point is obtained based on the calculations above. We refer the main article for the subscript indicating the size of each block matrix.

J=\left[\begin{array}[]{cccc}\mathbf{I}-D_{\mathbf{x}^{*}}\mathbf{1}\mathbf{1}^{\top}-2\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{x}}^{2}f&-2\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{y}}^{2}f&\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{x}}^{2}f&\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{y}}^{2}f\\ 2\eta D_{\mathbf{y}^{*\top}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*})\nabla_{\mathbf{y}\mathbf{x}}^{2}f&\mathbf{I}-D_{\mathbf{y}^{*}}\mathbf{1}\mathbf{1}^{\top}+2\eta D_{\mathbf{y}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{y}}^{2}f&-\eta D_{\mathbf{y}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{x}}^{2}f&-\eta D_{\mathbf{y}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{y}}^{2}f\\ \mathbf{I}&\mathbf{0}&\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\mathbf{I}&\mathbf{0}&\mathbf{0}\end{array}\right]

By acting on the tangent space of each simplex, we observe that $D_{\mathbf{x}^{*}}\mathbf{1}\mathbf{1}^{\top}\mathbf{v}=0$ for $\sum_{k}v_{k}=0$ , so each eigenvalue of matrix $J$ is an eigenvalue of the following matrix

J_{\text{new}}=\left[\begin{array}[]{cccc}\mathbf{I}-2\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{x}}^{2}f&-2\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{y}}^{2}f&\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{x}}^{2}f&\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{y}}^{2}f\\ 2\eta D_{\mathbf{y}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{x}}^{2}f&\mathbf{I}+2\eta D_{\mathbf{y}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{y}}^{2}f&-\eta D_{\mathbf{y}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{x}}^{2}f&-\eta D_{\mathbf{y}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{y}}^{2}f\\ \mathbf{I}&\mathbf{0}&\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\mathbf{I}&\mathbf{0}&\mathbf{0}\end{array}\right]

The characteristic polynomial of $J_{\text{new}}$ is $\det(J_{new}-\lambda I)$ that can be computed as the determinant of the following matrix:

\displaystyle\left[\begin{array}[]{cccc}(1-\lambda)\mathbf{I}+(\frac{1}{\lambda}-2)\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{x}}^{2}f&(\frac{1}{\lambda}-2)\eta D_{\mathbf{x}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{x}^{*\top})\nabla_{\mathbf{x}\mathbf{y}}^{2}f\\ (2-\frac{1}{\lambda})\eta D_{\mathbf{y}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{x}}^{2}f&(1-\lambda)\mathbf{I}+(2-\frac{1}{\lambda})\eta D_{\mathbf{y}^{*}}(\mathbf{I}-\mathbf{1}\mathbf{y}^{*\top})\nabla_{\mathbf{y}\mathbf{y}}^{2}f\end{array}\right]

(47)


(a) OMWU	(b) OGDA

(c) Convergence time comparisons	(d) OMWU trajectories with different learning rate


(a) #iterations vs size of $n$	(b) $l_{1}$ error vs #iterations


(a) KL divergence vs #iterations with different $n$	(b) KL divergence vs #iterations with different $\eta$

Last iterate convergence in no-regret learning: constrained min-max optimization for convex-concave landscapes

Abstract

1 Introduction

1.1 Average Iterate Convergence

1.2 Main Results

Theorem 1.1 (Last iterate convergence of OMWU).

1.3 Structure and Technical Overview

Notation

2 Preliminaries

2.1 Equilibria for Constrained Minimax

Definition 2.1 (KKT conditions).

Remark 2.2 (No degeneracies).

2.2 Optimistic Multiplicative Weights Update

2.3 Fundamentals of Dynamical Systems

Definition 2.3.

Remark 2.4.

Proposition 2.5 ([9]).

3 Last iterate convergence of OMWU

3.1 Dynamical System of OMWU

3.2 Spectral Analysis

Lemma 3.1 (Real part non-positive).

Proof.

Lemma 3.2.

Remark 3.3.

Proof.

4 Experiments

5 Conclusion

References

Appendix A Equations of the Jacobian of OMWU

Appendix B Equations of the Jacobian of OMWU at the fixed point (𝐱∗,𝐲∗,𝐳∗,𝐰∗)(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{z}^{*},\mathbf{w}^{*})

Appendix C Jacobian matrix at (𝐱∗,𝐲∗,𝐳∗,𝐰∗)(\mathbf{x}^{*},\mathbf{y}^{*},\mathbf{z}^{*},\mathbf{w}^{*})

Appendix B Equations of the Jacobian of OMWU at the fixed point $(\mathbf{x}^{},\mathbf{y}^{},\mathbf{z}^{},\mathbf{w}^{})$

Appendix C Jacobian matrix at $(\mathbf{x}^{},\mathbf{y}^{},\mathbf{z}^{},\mathbf{w}^{})$