Last iterate convergence in no-regret learning: constrained min-max optimization for convex-concave landscapes
Abstract
In a recent series of papers it has been established that variants of Gradient Descent/Ascent and Mirror Descent exhibit last iterate convergence in convex-concave zero-sum games. Specifically, [6, 10] show last iterate convergence of the so called “Optimistic Gradient Descent/Ascent” for the case of unconstrained min-max optimization. Moreover, in [11] the authors show that Mirror Descent with an extra gradient step displays last iterate convergence for convex-concave problems (both constrained and unconstrained), though their algorithm does not follow the online learning framework; it uses extra information rather than only the history to compute the next iteration. In this work, we show that ”Optimistic Multiplicative-Weights Update (OMWU)” which follows the no-regret online learning framework, exhibits last iterate convergence locally for convex-concave games, generalizing the results of [8] where last iterate convergence of OMWU was shown only for the bilinear case. We complement our results with experiments that indicate fast convergence of the method.
1 Introduction
In classic (normal form) zero-sum games, one has to compute two probability vectors 111 denotes the simplex of size . that consist an equilibrium of the following problem
(1) |
where is real matrix (called payoff matrix). Here represents the payment of the player to the player under choices of strategies by the two players and is a bilinear function.
Arguably, one of the most celebrated theorems and a founding stone in Game Theory, is the minimax theorem by Von Neumann [16]. It states that
(2) |
where is convex in , concave in . The aforementioned result holds for any convex compact sets and . The min-max theorem reassures us that an equilibrium always exists in the bilinear game (1) or its convex-concave analogue (again is interpreted as the payment of the player to the player). An equilibrium is a pair of randomized strategies such that neither player can improve their payoff by unilaterally changing their distribution.
Soon after the appearance of the minimax theorem, research was focused on dynamics for solving min-max optimization problems by having the min and max players of (1) run a simple online learning procedure. In the online learning framework, at time , each player chooses a probability distribution ( respectively) simultaneously depending only on the past choices of both players (i.e., ) and experiences payoff that depends on choices .
An early method, proposed by Brown [4] and analyzed by Robinson [14], was fictitious play. Later on, researchers discover several learning robust algorithms converging to minimax equilibrium at faster rates, see [5]. This class of learning algorithms, are the so-called “no-regret” and include Multiplicative Weights Update method [2] and Follow the regularized leader.
1.1 Average Iterate Convergence
Despite the rich literature on no-regret learning, most of the known results have the feature that min-max equilibrium is shown to be attained only by the time average. This means that the trajectory of a no-regret learning method has the property that converges to the equilibrium of (1), as . Unfortunately that does not mean that the last iterate converges to an equilibrium, it commonly diverges or cycles. One such example is the well-known Multiplicative Weights Update Algorithm, the time average of which is known to converge to an equilibrium, but the actual trajectory cycles towards the boundary of the simplex ([3]). This is even true for the vanilla Gradient Descent/Ascent, where one can show for even bilinear landscapes (unconstrained case) last iterate fails to converge [6].
Motivated by the training of Generative Adversarial Networks (GANs), the last couple of years researchers have focused on designing and analyzing procedures that exhibit last iterate convergence (or pointwise convergence) for zero-sum games. This is crucial for training GANs, the landscapes of which are typically non-convex non-concave and averaging now as before does not give much guarantees (e.g., note that Jensen’s inequality is not applicable anymore). In [6, 10] the authors show that a variant of Gradient Descent/Ascent, called Optimistic Gradient Descent/Ascent has last iterate convergence for the case of bilinear functions where and (this is called the unconstrained case, since there are no restrictions on the vectors). Later on, [8] generalized the above result with simplex constraints, where the online method that the authors analyzed was Optimistic Multiplicative Weights Update. In [11], it is shown that Mirror Descent with extra gradient computation converges pointwise for a class of zero-sum games that includes the convex-concave setting (with arbitrary constraints), though their algorithm does not fit in the online no-regret framework since it uses information twice about the payoffs before it iterates. Last but not least there have appeared other works that show pointwise convergence for other settings (see [13, 7] and [1] and references therein) to stationary points (but not local equilibrium solutions).
1.2 Main Results
In this paper, we focus on the min-max optimization problem
(3) |
where is a convex-concave function (convex in , concave in ). We analyze the no-regret online algorithm Optimistic Multiplicative Weights Update (OMWU). OMWU is an instantiation of the Optimistic Follow the Regularized Leader (OFTRL) method with entropy as a regularizer (for both players, see Preliminaries section for the definition of OMWU).
We prove that OMWU exhibits local last iterate convergence, generalizing the result of [8] and proving an open question of [15] (for convex-concave games). Formally, our main theorem is stated below:
Theorem 1.1 (Last iterate convergence of OMWU).
Let be a twice differentiable function that is convex in and concave in . Assume that there exists an equilibrium that satisfies the KKT conditions with strict inequalities (see (4)). It holds that for sufficiently small stepsize, there exists a neighborhood of such that for all for all initial conditions , OMWU exhibits last iterate (pointwise) convergence, i.e.,
where denotes the -th iterate of OMWU.
Moreover, we complement our theoretical findings with experimental analysis of the procedure. The experiments on KL-divergence indicate that the results should hold globally.
1.3 Structure and Technical Overview
We present the structure of the paper and a brief technical overview.
Section 2 provides necessary definitions, the explicit form of OMWU derived from OFTRL with entropy regularizer, and some existing results on dynamical systems.
Section 3 is the main technical part, i.e, the computation and spectral analysis of the Jacobian matrix of OMWU dynamics. The stability analysis, the understanding of the local behavior and the local convergence guarantees of OMWU rely on the spectral analysis of the computed Jacobian matrix. The techniques for bilinear games (as in [8]) are no longer valid in convex-concave games. Allow us to explain the differences from [8]. In general, one cannot expect a trivial generalization from linear to non-linear scenarios. The properties of bilinear games are fundamentally different from that of convex-concave games, and this makes the analysis much more challenging in the latter. The key result of spectral analysis in [8] is in a lemma (Lemma B.6) which states that a skew symmetric222 is skew symmetric if . has imaginary eigenvalues. Skew symmetric matrices appear since in bilinear cases there are terms that are linear in and linear in but no higher order terms in or . However, the skew symmetry has no place in the case of convex-concave landscapes and the Jacobian matrix of OMWU is far more complicated. One key technique to overcome the lack of skew symmetry is the use of Ky Fan inequality [12] which states that the sequence of the eigenvalues of majorizes the real part of the sequence of the eigenvalues of W for any square matrix W (see Lemma 3.1).
Section 4 focuses on numerical experiments to understand how the problem size and the choice of learning rate affect the performance of our algorithm. We observe that our algorithm is able to achieve global convergence invariant to the choice of learning rate, random initialization or problem size. As comparison, the latest popularized (projected) optimistic gradient descent ascent is much more sensitivity to the choice of hyperparameter. Due to space constraint, the detailed calculation of the Jacobian matrix (general form and at fixed point) of OMWU are left in Appendix.
Notation
The boldface and denote the vectors in and . denotes the -th iterate of the dynamical system. The letter denote the Jacobian matrix. , and are preserved for the identity, zero matrix and the vector with all the entries equal to 1. The support of is the set of indices of such that , denoted by . denotes the optimal solution for minimax problem. denote the set of integers .
2 Preliminaries
In this section, we present some background that will be used later.
2.1 Equilibria for Constrained Minimax
From Von Neumann’s minimax theorem, one can conclude that the problem has always an equilibrium with be unique. Moreover from KKT conditions (as long as is twice differentiable), such an equilibrium must satisfy the following ( is a local minimum for fixed and is a local maximum for fixed ):
Definition 2.1 (KKT conditions).
Formally, it holds
(4) |
Remark 2.2 (No degeneracies).
For the rest of the paper we assume no degeneracies, i.e., the last inequalities hold strictly (in the case a strategy is played with zero probability for each player). Moreover, it is easy to see that since is convex concave and twice differentiable, then (part of the Hessian that involves variables) is positive semi-definite and (part of the Hessian that involves variables) is negative semi-definite.
2.2 Optimistic Multiplicative Weights Update
The equations of Optimistic Follow-the-Regularized-Leader (OFTRL) applied to a problem
with regularizers (strongly convex functions) (for player respectively) and is given below (see [6]):
is called the stepsize of the online algorithm. OFTRL is uniquely defined if is convex-concave and domains and are convex. For simplex constraints and entropy regularizers, i.e., , we can solve for the explicit form of OFTRL using KKT conditions, the update rule is the Optimistic Multiplicative Weights Update (OMWU) and is described as follows:
2.3 Fundamentals of Dynamical Systems
We conclude Preliminaries section with some basic facts from dynamical systems.
Definition 2.3.
A recurrence relation of the form is a discrete time dynamical system, with update rule where is a subset of for some positive integer . The point is called a fixed point if .
Remark 2.4.
Using KKT conditions (4), it is not hard to observe that an equilibrium point must be a fixed point of the OMWU algorithm, i.e., if then .
Proposition 2.5 ([9]).
Assume that is a differentiable function and the Jacobian of the update rule at a fixed point has spectral radius less than one. It holds that there exists a neighborhood around such that for all , the dynamics converges to , i.e. 333 denotes the composition of with itself times.. is called a contraction mapping in .
3 Last iterate convergence of OMWU
In this section, we prove that OMWU converges pointwise (exhibits last iterate convergence) if the initializations belong in a neighborhood of the equilibrium .
3.1 Dynamical System of OMWU
We first express OMWU algorithm as a dynamical system so that we can use Proposition 2.5. The idea (similar to [8]) is to lift the space to consist of four components , in such a way we can include the history (current and previous step, see Section 2.2 for the equations). First, we provide the update rule of the lifted dynamical system and is given by
where for are defined as follows:
(5) | ||||
(6) | ||||
(7) | ||||
(8) |
Then the dynamical system of OMWU can be written in compact form as
In what follows, we will perform spectral analysis on the Jacobian of the function , computed at the fixed point . Since has been lifted, the fixed point we analyze is (see Remark 2.4). By showing that the spectral radius is less than one, our Theorem 1.1 follows by Proposition 2.5. The computations of the Jacobian of are deferred to the supplementary material.
3.2 Spectral Analysis
Let be the equilibrium of min-max problem (2). Assume , i.e., then (see equations at the supplementary material, section A)
and all other partial derivatives of are zero, thus is an eigenvalue of the Jacobian computed at . This is true because the row of the Jacobian that corresponds to has zeros everywhere but the diagonal entry. Moreover because of the degeneracy assumption of KKT conditions (see Remark 2.2), it holds that
Similarly, it holds for that
(again by Remark 2.2) and all other partial derivatives of are zero, therefore is an eigenvalue of the Jacobian computed at .
We focus on the submatrix of the Jacobian of computed at that corresponds to the non-zero probabilities of and . We denote to be the diagonal matrix of size that has on the diagonal the nonzero entries of and similarly we define of size . For convenience, let us denote and . The Jacobian submatrix is the following
where
(9) |
We note that capture the identity matrix and the all zeros matrix respectively (the appropriate size is indicated as a subscript). The vectors and are left eigenvectors with eigenvalue zero for the above matrix. Hence, any right eigenvector should satisfy the conditions and . Thus, every non-zero eigenvalue of the above matrix is also a non-zero eigenvalue of the matrix below:
where
The characteristic polynomial of is obtained by finding . One can perform row/column operations on to calculate this determinant, which gives us the following relation:
where is the characteristic polynomial of the following matrix
and are the aforementioned sub-matrices. Notice that can be written as
where,
Notice here that is the Hessian matrix evaluated at the fixed point , and is the appropriate sub-matrix restricted to the support of and . Although, the Hessian matrix is symmetric, we would like to work with the following representation of :
where,
Let us denote any non-zero eigenvalue of by which may be a complex number. Thus is where vanishes and hence the eigenvalue of must satisfy the relation
We are to now show that the magnitude of any eigenvalue of is strictly less than 1, i.e, . Trivially, satisfies the above condition. Thus we need to show that the magnitude of where vanishes is strictly less than 1. The remainder of the proof proceeds by showing the following two lemmas:
Lemma 3.1 (Real part non-positive).
Let be an eigenvalue of matrix . It holds that .
![]() |
![]() |
(a) #iterations vs size of | (b) error vs #iterations |
Proof.
Assume that . All the non-zero eigenvalues of matrix coincide with the eigenvalues of the matrix
This is well-defined since
is positive semi-definite. Moreover, we use KyFan inequalities which state that the sequence (in decreasing order) of the eigenvalues of majorizes the real part of the sequence of the eigenvalues of for any square matrix (see [12], page 4). We conclude that for any eigenvalue of , it holds that is at most the maximum eigenvalue of . Observe now that
Since
by the convex-concave assumption on it follows that the matrix above is negative semi-definite (see Remark 2.2) and so is . We conclude that the maximum eigenvalue of is non-positive. Therefore any eigenvalue of has real part non-positive and the same is true for . ∎
Lemma 3.2.
If is a non-zero eigenvalue of then, and as the stepsize .
We first can see that which is the learning rate multiplies any eigenvalue and we may assume that whilst is positive, it may be chosen to be sufficiently small and hence the magnitude of any eigenvalue .
Remark 3.3.

Proof.
Let and . The relation gives two equations based on the equality of real and imaginary parts as follows,
(10) | ||||
(11) |
Notice that the above equations can be transformed to the following forms:
(12) | ||||
(13) |
For each , there exist two pairs of points and that are the intersections of the above two hyperbola, illustrated in Figure 4. Recall the condition that . As , the hyperbola can be obtained from the translation by of the hyperbola
where the translated symmetric center is close to since is close to . So the two intersections of the above hyperbola, and , satisfy the property that is small and since the two intersections are on two sides of the axis , as showed in Figure 3.


On the other hand, we have
and then the condition gives the inequality
that is equivalent to
where only the case is considered since if the intersection whose -component satisfying has the property that is small and then less than 1, Figure 4. Thus to prove that , it suffices to assume . It is obvious that implies that . The proof completes. ∎

![]() |
![]() |
(a) OMWU | (b) OGDA |
![]() |
![]() |
(c) Convergence time comparisons | (d) OMWU trajectories with different learning rate |
![]() |
![]() |
(a) KL divergence vs #iterations with different | (b) KL divergence vs #iterations with different |
4 Experiments
In this section, we conduct empirical studies to verify the theoretical results of our paper. We primarily target to understand two factors that influence the convergence speed of OMWU: the problem size and the learning rate. We also compare our algorithm with Optimistic Gradient Descent Ascent (OGDA) with projection, and demonstrate our superiority against it.
We start with a simple bilinear min-max game:
(14) |
We first vary the value of to study how the learning speed scales with the size of the problem. The learning rate is fixed at , and we run OMWU with and matrix is generated with i.i.d random Gaussian entries. We output the number of iterations for OMWU to reach convergence, i.e., with error to the optimal solution to be less or equal to . The results are averaged from 10 runs with different randomly initializations. As reported in Figure 1, generally a larger problem size requires more iterations to reach convergence. We also provide four specific cases of to show the convergence in distance in Figure 1(b). The shaded area demonstrates the standard deviation from the 50 runs.
To understand how learning rate affects the speed of convergence, we conduct similar experiments on Eqn. (14) and plot the error with different step sizes in Figure 5(a)-(c). For this experiment the matrix size is fixed as . We also include a comparison with the Optimistic Gradient Descent Ascent[7]. Notice the original proposal was for unconstrained problems and we use projection in each step in order to constrain the iterates to stay inside the simplex. For the setting we considered, we observe a larger learning rate effectively speeds up our learning process, and our algorithm is relatively more stable to the choice of step-size. In comparison, OGDA is quite sensitive to the choice of step-size. As shown in Figure 5(b), a larger step-size makes the algorithm diverge, while a smaller step-size will make very little progress. Furthermore, we also choose to perform our algorithm over a convex-concave but not bilinear function , where and and are the first coefficients of and . With this low dimensional function, we could visually show the convergence procedure as in Figure 5(b), where each arrow indicates an OMWU step. This figure demonstrates that at least in this case, a larger step size usually makes sure a bigger progress towards the optimal solution.
Finally we show how the KL divergence decreases under different circumstances. Figure 6 again considers the bilinear problem (Eqn.(14)) with multiple dimensions and a simple convex-concave function with different learning rate. We note that in all circumstances we consider, we observe that OMWU is very stable, and achieves global convergence invariant to the problem size, random initialization, and learning rate.
5 Conclusion
In this paper we analyze the last iterate behavior of a no-regret learning algorithm called Optimistic Multiplicative Weights Update for convex-concave landscapes. We prove that OMWU exhibits last iterate convergence in a neighborhood of the fixed point of OMWU algorithm, generalizing previous results that showed last iterate convergence for bilinear functions. The experiments explores how the problem size and the choice of learning rate affect the performance of our algorithm. We find that OMWU achieves global convergence and less sensitive to the choice of hyperparameter, compared to projected optimistic gradient descent ascent.
References
- [1] Jacob D. Abernethy, Kevin A. Lai, and Andre Wibisono. Last-iterate convergence rates for min-max optimization. CoRR, abs/1906.02027, 2019.
- [2] Sanjeev Arora, Elad Hazan, and Satyen Kale. The multiplicative weights update method: a meta-algorithm and applications. Theory of Computing, 8(1):121–164, 2012.
- [3] James P. Bailey and Georgios Piliouras. Multiplicative weights update in zero-sum games. In Proceedings of the 2018 ACM Conference on Economics and Computation, Ithaca, NY, USA, June 18-22, 2018, pages 321–338, 2018.
- [4] G.W Brown. Iterative solutions of games by fictitious play. In Activity Analysis of Production and Allocation, 1951.
- [5] Nikolo Cesa-Bianchi and Gabor Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.
- [6] Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training GANs with Optimism. In Proceedings of ICLR, 2018.
- [7] Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pages 9256–9266, 2018.
- [8] Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. In 10th Innovations in Theoretical Computer Science Conference, ITCS 2019, January 10-12, 2019, San Diego, California, USA, pages 27:1–27:18, 2019.
- [9] Oded Galor. Discrete Dynamical Systems. Springer, 2007.
- [10] Tengyuan Liang and James Stokes. Interaction matters: A note on non-asymptotic local convergence of generative adversarial networks. arXiv preprint:1802.06132, 2018.
- [11] Panayotis Mertikopoulos, Houssam Zenati, Bruno Lecouat, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Mirror descent in saddle-point problems: Going the extra (gradient) mile. CoRR, abs/1807.02629, 2018.
- [12] Mohammad Sal Moslehian. Ky fan inequalities. CoRR, abs/1108.1467, 2011.
- [13] Gerasimos Palaiopanos, Ioannis Panageas, and Georgios Piliouras. Multiplicative weights update with constant step-size in congestion games: Convergence, limit cycles and chaos. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5874–5884, 2017.
- [14] J. Robinson. An iterative method of solving a game. In Annals of Mathematics, pages 296–301, 1951.
- [15] Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E. Schapire. Fast convergence of regularized learning in games. In Annual Conference on Neural Information Processing Systems 2015, pages 2989–2997, 2015.
- [16] J Von Neumann. Zur theorie der gesellschaftsspiele. In Math. Ann., pages 295–320, 1928.
Appendix A Equations of the Jacobian of OMWU
(15) | ||||
(16) | ||||
(17) | ||||
(18) | ||||
(19) | ||||
(20) | ||||
(21) | ||||
(22) | ||||
(23) | ||||
(24) | ||||
(25) | ||||
(26) | ||||
(27) | ||||
(28) | ||||
(29) | ||||
(30) | ||||
(31) | ||||
(32) |
Appendix B Equations of the Jacobian of OMWU at the fixed point
In this section, we compute the equations of the Jacobian at the fixed point . The fact that and takes the position of in computing partial derivatives gives the following equations.
(33) | ||||
(34) | ||||
(35) | ||||
(36) | ||||
(37) | ||||
(38) | ||||
(39) | ||||
(40) | ||||
(41) | ||||
(42) | ||||
(43) | ||||
(44) |
Appendix C Jacobian matrix at
This section serves for the ”Spectral Analysis” of Section 3. The Jacobian matrix of at the fixed point is obtained based on the calculations above. We refer the main article for the subscript indicating the size of each block matrix.
By acting on the tangent space of each simplex, we observe that for , so each eigenvalue of matrix is an eigenvalue of the following matrix
The characteristic polynomial of is that can be computed as the determinant of the following matrix:
(47) |