Two-Scale Gradient Descent Ascent Dynamics Finds Mixed Nash Equilibria of Continuous Games: A Mean-Field Perspective
Abstract.
Finding the mixed Nash equilibria (MNE) of a two-player zero sum continuous game is an important and challenging problem in machine learning. A canonical algorithm to finding the MNE is the noisy gradient descent ascent method which in the infinite particle limit gives rise to the Mean-Field Gradient Descent Ascent (GDA) dynamics on the space of probability measures. In this paper, we first study the convergence of a two-scale Mean-Field GDA dynamics for finding the MNE of the entropy-regularized objective. More precisely we show that for each finite temperature (or regularization parameter), the two-scale Mean-Field GDA with a suitable finite scale ratio converges exponentially to the unique MNE without assuming the convexity or concavity of the interaction potential. The key ingredient of our proof lies in the construction of new Lyapunov functions that dissipate exponentially along the Mean-Field GDA. We further study the simulated annealing of the Mean-Field GDA dynamics. We show that with a temperature schedule that decays logarithmically in time the annealed Mean-Field GDA converges to the MNE of the original unregularized objective.
Keywords. Nash equilibrium, mixed Nash equilbirium, gradient descent ascent, mean field, simulated annealing.
1. Introduction
Minmax learning underpins numerous machine learning methods including Generative Adversarial Networks (GANs) [20], adversarial training [34] and reinforcement learning [6]. The minmax learning is often be formulated as a zero-sum game between min and max players and hence can be formalized as a minmax optimization problem of the form
where the function is the game objective, and represent the player strategies. When is nonconvex in or nonconcave in , finding the pure Nash [38] equilibria of is difficult and sometimes impossible since the pure Nash equilibrium may not even exist. On the other hand, the mixed Nash equilibria (MNE) [19] where the pure strategies are replaced by mixed strategies modeled by a probability distribution over the set of strategies, do exist for more general objective functions. Formally the mixed Nash equilibria consist of pairs of probability distributions and that solve
(1.1) |
Here and denote the space of probability measures (or the set of all mixed strategies) on the state spaces and respectively. Thanks to Glicksberg’s theorem [19], a MNE exists if and are finite or if and are compact and is continuous. While MNE exist in much generality, it is still difficult to find them. Several progress have been made recently for finding MNE of high dimensional game problems with applications in GANs. For instance, the work [22] proposed a mirror-descent algorithm for finding MNE of (1.1) and applied the algorithms for Wasserstein GANs with provable convergence guarantees. The recent work [16, 33] proposed and analyzed an entropy-regularized version of (1.1):
(1.2) |
Here is the entropy functional of the probability measure , and the parameter is the entropy regularization parameter(or temperature). Observe that the objective energy functional is strongly convex in and strongly concave in . As a result of the famous Von Neumann’s minmax theorem [52, 47, 41, 40], one has that
(1.3) |
Under very mild assumptions on , Problem (1.2) has a unique MNE (see a proof in [16]) in the sense that for any ,
Furthermore, the MNE is given by the unique solution of the fixed point equations
(1.4) |
In the setting where both players play finite mixtures of strategies, i.e. and , the following noisy gradient descent-ascent dynamics is perhaps one of the most natural algorithms for finding MNE of (1.2):
(1.5) |
where are independent Brownian motions. The parameter in (1.5) represents the ratio of timescales at which the gradient descent and ascent dynamics is undergoing. When , there is no timescale separation. However, GDA with different time scales is commonly used in minmax optimization [24, 28] and sometimes leads to better convergence property [21].
In the limit of large number of strategies (), the empirical measures and of the interacting particle, with and identically independent sampled from the initial distributions and , converge weakly within finite time to the solutions and of the Mean-Field GDA (also named the Interacting Wasserstein Gradient Flow in [16]) dynamics:
(1.6) | ||||
The dynamics (1.6) can be viewed as the minmax analogue of the Mean-Field Langevin [42, 7, 23] dynamics that arises naturally from the mean field analysis of optimization of two-layer neural networks.
Despite its simplicity and popularity, the long-time convergence of the Mean-Field GDA (1.6) is still not well understood, except in the extreme quasi-static [33] regime where the ascent dynamics is infinitely faster or slower than the descent dynamics (). This motivates us to study the first question, which to the best of our knowledge, remains open:
Question 1: Does the Mean-Field GDA (1.6) with a finite scale ratio converge to the unique MNE? If so, what is the rate of convergence?
We provide an affirmative answer to Question 1 and establish a quantitative exponential convergence of (1.6) to the MNE for any fixed . Furthermore, motivated by the recent simulated annealing results for Langevin dynamics [51, 44] in the context of global optimization or for Mean-Field Langevin dynamics [7] in the setting of training two-layer neural networks, it is natural to ask what would happen to the dynamics (1.6) in the annealed regime that as . In particular, it is natural to ask
Question 2: Does the annealed Mean-Field GDA (1.6) with a decreasing temperature converge to a MNE of the unregularized objective defined in (1.1)?
Summary of contributions.
In this work we first address Question 1 by providing the first convergence analysis of the two-scale continuous-time Mean-Field GDA dynamics (1.6) with a finite scale ratio. This improves substantially the earlier convergence results by [33, 15] on Mean-Field GDA in the quasistatic setting where the scale ratio either vanishes or explodes. The key ingredient of the proof is the construction of a novel Lyapunov function which decreases exponentially along the dynamics (1.6). We then further address Question 2 by proving that the annealed Mean-Field GDA converges to a global MNE of the original unregularized objective . The latter result, to the best our knowledge, is the first rigorous justification of the global convergence of GDA to MNE in the mean field regime.
We highlight the major contributions as follows.
-
•
For any fixed temperature , we show that in the fast ascent/descent regime (or the scale ratio is either larger or smaller than certain threshold), the Mean-Field GDA dynamics (1.6) converges exponentially to the MNE of the entropy-regularized objective with respect to certain Lyapunov functions; see Theorem 2.1.
-
•
The convergence in Lyapunov functions further implies the global exponential convergence of the dynamics with respect to the relative entropy (see Theorem 2.2). Our (non-asymptotic) convergence results hold for any positive temperature (regularization parameter) and do not assume convexity or concavity of the interaction potential . The convergence rate is characterized explicitly in terms of , bounds on and diameters of the state spaces.
-
•
We also study the simulated annealing of the Mean-Field GDA dynamics (1.6). We prove that with the cooling schedule decaying with a logarithimc rate, the Mean-Field GDA dynamics (1.6) converges to the global mixed Nash equilibrium of with respect to the Nikaidò-Isoda error (defined in (1.7)). See Theorem 2.3 for the precise statements.
1.1. Related work.
Minmax optimization
Most previous works about minmax optimization focused on the analysis of discrete-time optimization schemes in finite dimensional Euclidean spaces under various assumptions on the objective function. In the convex-concave setting, global convergence guarantees were obtained for various optimization schemes, such as GDA [12], mirror descent [36] and Hamiltonian gradient descent [1]. In the non-convex and/or non-concave setting, local convergence to several notions of stationary points were studied in [11, 28]. In the special case where the objective satisfies a two-sided Polyak-Łojasiewicz (PL) condition, convergence to global Nash equilibria were recently obtained in [55, 56, 13] for two-scale GDA.
Mean-field analysis
Our study of the Mean-Field GDA dynamics for minmax optimization is strongly motivated by a growing amount of work on mean-field analysis of gradient descent algorithms for minimizing neural networks. The study of the latter originates from the work on two-layer networks [35, 48, 46, 8], to residual networks [29, 54] and general deep neural networks [2, 49, 39]. These results show that the mean field dynamics can be realized as Wasserstein gradient flows of certain nonlinear energy functionals on the space of probability measures. Global convergence of the Wasserstein gradient flows were obtained for the mean-field Langevin dynamics in [35, 7, 42, 23] and for the noiseless mean-field dynamics in [8]. We note that mean-field Langevin dynamics can be viewed as a special Mckean-Vlasov dynamics whose long-time convergence are often established by assuming either the small interaction strength or large noise level; see e.g. [17, 23].
Among the existing work on mean field Langevin dynamics, [7] is closest to ours where the author proved the exponential convergence of Mean-Field Langevin dynamics under a certain uniform log-Sobolev inequality assumption, and also proved the convergence of the annealed Mean-Field Langevin to the global minimizers of the objective function. Our results are parallel to the results of [7], but require new proof strategies. A key ingredient of our proof is constructing new Lyapunov functions that decay exponentially along the Mean-Field GDA (1.6).
In the context of minmax optimization, [16] first proposed and analyzed the noisy gradient descent ascent flow (1.5) without scale separation () for finding the MNE. Specifically, the authors therein proved the convergence of (1.5) to the mean field GDA (1.6) in the limit of large agent size, and characterized the equilibrium of (1.6) as the unique solution of the fixed point equation (1.4). However, [16] did not provide a proof of the long-time convergence of (1.6). In [33], the authors obtained convergence guarantees for the mean field GDA (1.6) in the quasi-static regime where the ascent dynamics is infinitely faster or slower than the descent dynamics (i.e. ). One of the major contributions of the present work goes beyond the quasi-static regime and provides the first proof for exponential convergence of (1.6) with a finite scale ratio.
Wasserstein-Fisher-Rao gradient flows
As an alternative to Wasserstein gradient flows, gradient flow with respect to the Wasserstein-Fisher-Rao (WFR) metric [9, 25, 27] has recently sparked a large amount of research on PDEs [5, 26], neural network training [45] and statistical sampling [30, 31]. Especially, WFR gradient flows give rise to birth-death interacting particle dynamics that enables “teleportation” of mass and locations of particles and hence lead to better convergence properties than the Langevin dynamics in certain scenarios [45, 30, 31]. More recently, [16] adopts a WFR-gradient flow for two-player zero-mean continuous games and proves that the time-averaging of the WFR-dynamics converges to the MNE in the regime where Fisher-Rao dynamics dominates the Wasserstein counterpart. In [53], a new particle algorithm motivated by the implicit time-discretization of WFR-gradient flow was proposed for finding the MNE in the space of atomic measures. The authors proved a local exponential convergence of the proposed algorithm under certain non-degeneracy assumptions on the game objective and the assumption that the MNE is unique.
1.2. Notations.
For a measurable space , we use to denote the space of probability measures on . Given a canonical Borel measure on , such as the uniform measure considered in the paper, if is absolutely continuous with respect to , we denote the Radon-Nikodym derivative by . We define the shannon entropy of by . The relative entropy (or Kullback-Leibler divergence) and the relative Fisher information between two probability measures and are defined by
if is absolutely continuous with respect to and otherwise. Given an approximate mixed Nash equilibrium , we define the Nikaidò-Isoda [41] error by
(1.7) | ||||
Note that by definition for any and and if and only if .
1.3. Organization
The remaining of the paper is organized as follows. In Section 2 we state the key assumptions and present the main results of the paper. In Section 3 we discuss applications of our results in training GANs and adversarial learning of PDEs. Section 4 devotes to the proofs of main results. We finish with several conclusion remarks and future directions in Section 5. Further proof details are provided in the Appendix.
2. Assumptions and Main Results
Let us first recall the minmax problem of the entropy-regularized continuous game:
(2.1) |
When the regularization parameter (or temperature) , we write
We are interested in the global convergence of the Mean-Field GDA dynamics to the MNE of (2.1):
In what follows, we may suppress the dependence of on and write to simplify notations when the is fixed; we will indicate such dependence in later discussions on the annealed dynamics where shrinks to zero in time.
2.1. Assumptions
Assumption 2.1.
The state spaces and are (i) either smooth compact manifolds without boundary such that the eigenvalues of the Ricci curvature are strictly positive everywhere or (ii) Euclidean tori (with possibly unequal dimensions).
We also make the following smoothness assumption on the potential .
Assumption 2.2.
The function and there exists such that
It will be very useful to introduce operators and where
(2.2) |
Here and are the corresponding partition functions defined by
(2.3) |
It is easy to verify by the definition of that
(2.4) |
With the notations above, the MNE of (2.1) is characterized as the unique solution of the following fixed point equations (see [16, Theorem 1]):
(2.5) |
The lemma below shows that under Assumption 2.2 and Assumption 2.1, the measures and satisfy the logarithmic Sobolev inequalities (LSI) uniformly for any and .
Lemma 2.1.
There exists a constant such that for any and ,
(2.6) |
Lemma 2.1 follows from the classical Holley-Stroock argument and the fact that the uniform measures on and satisfy the LSI thanks to Assumption 2.1; see e.g. [7, Proposition 5.2] and [15, Proposition 6]). It is worthwhile to note that in the regime where , the log-Sobolev constant satisfies that
(2.7) |
2.2. Exponential convergence of Mean-Field GDA with a fixed temperature
The goal of this section is to present the global convergence of the GDA dynamics (1.6) to the MNE . To this end, it will be useful to define the following functionals
(2.8) | ||||
Note that by definition the functionals defined above are non-negative. For a fixed constant , we also define the Lyapunov functions
(2.9) |
Let be the effective condition number defined by
Our first main theorem characterizes the exponential convergence of (1.6) in terms of the Lyapunov functions and .
Theorem 2.1.
Let Assumption 2.1 and Assumption 2.2 hold. For a fixed , let be the solution to the GDA dynamics (1.6) with an initial condition satisfying and , and with a finite time-scale ratio .
(i) Fast ascent regime: set Then for all ,
with
(ii) Fast descent regime: set Then for all ,
with
Remark 2.1.
Theorem 2.1 states that the two-scale GDA dynamics (1.6) with a suitable finite scale ratio converges exponentially to the equilibrium. This improves substantially the earlier result by [33] on GDA in the quasistatic setting where the scale ratio or . We emphasize that we chose the specific scale ratios merely for the purpose of simplifying the expression of the convergence rate . In fact, by tracking the proof of Theorem 2.1, one can obtain that the convergence rate
for some constants . Moreover, the dependence of the convergence rates on the diameters of the state spaces via and can be avoided, and especially the latter two quantities can be replaced by ; see discussions around (A.3).
Observe also that in the low temperature regime (), the log-Sobolev constant . Hence Theorem 2.1 requires (and ) to guarantee an exponential convergence rate (and ) in the fast ascent (descent) regime. Notice that the quadratic dependence of the convergence rate on in the fast descent regime, in contrast to the linear dependence on in the fast ascent regime, is due to the fact that the ascent dynamics itself is running in the slow time-scale ; one would recover the same rate of convergence as in the fast ascent regime if one used the time scales for the descent dynamics and the ascent dynamics.
Remark 2.2.
Our construction of Lyapunov functions (2.9) is strongly motivated by the recent study [55, 13] of two-scale GDA for minmax optimization on Euclidean spaces. In the finite dimensional setting, a two-sided PL condition is sufficient to guarantee the exponential convergence of two-scale GDA in both continuous-time [13] and discrete-time [55, 56] cases. In our infinite dimensional setting of the Mean-Field GDA, the uniform log-Sobolev inequality plays the same role as the two-sided PL condition, which however, needs to be combined with the entropy sandwich inequalities in Lemma 4.1 to obtain the exponential dissipation of Lyapunov functions.
Our next theorem provides an exponential convergence of (1.6) with respect to the relative entropy.
2.3. Convergence of annealed Mean-Field GDA
In this section, we proceed to presenting the convergence of the “annealed” Mean-Field GDA dynamics
(2.12) | ||||
where is now a time-dependent temperature which shrinks in time. Given any initial condition , the existence of uniqueness of the global solution to (2.12) follow directly from the classical well-posedness of the theory of nonlinear Mckean-Vlasov-type PDEs [18, 50]. Our goal is to show that by carefully choosing the cooling schedule and the time-scale ratio the solution to the annealed dynamics (2.12) converges to the MNE of .
Let be the solution of (2.5) corresponding to temperature . Recall the Nikaidò-Isoda defined by (1.7).
Theorem 2.3.
Let Assumption 2.2 and Assumption 2.1 hold. Let be the solution to (2.12) with an initial condition . Assume that the log-Sobolev constant for some .
(i) Fast ascent regime: Assume further that is smooth, decreasing in and for some , for large values of . Set for some large . Then for every , there exists such that for sufficiently large,
(2.13) |
and that
(2.14) |
(ii) Fast descent regime: Assume further that is smooth, decreasing in and for some , for large values of . Set for some large M. Then for every , there exists such that for sufficiently large,
(2.15) |
and that
(2.16) |
3. Applications
In this section, we discuss briefly applications of our theoretical results in training of GANs and adversarial learning of PDEs.
3.1. Training of GANs
Let be the empirical measure associated to the i.i.d. samples from a target measure . Let be an IPM on the space of probability measures parameterized by a set of discriminators, i.e.
IPM-based GAN learns an optimal that minimizes over :
(3.1) |
Consider the witness function class given by the unit ball of Barron space which consists of functions admitting the representation
where and is a probability measure on the parameter space . Observe that Barron functions arise as natural infinite-width limit of two-layer neural networks with a dimension-free rate [3, 32, 4]. When the activation function satisfies that for some nonnegative function and for all , the Barron norm is defined by
(3.2) |
Setting in (3.1) leads to
(3.3) |
Here we adopted the short-notation with in the above. Assume that the activation function , and the input space and the parameter space satisfy the Assumption 2.1. Then it is straightforward to see that and there exists such that for any multi-indices and with ,
Therefore the convergence results in Theorem 2.1-2.2 for the Mean-Field GDA hold for the entropy-regularization of the GAN objective defined in (3.3). Moreover, Theorem 2.3 implies that the annealed GDA dynamics finds the MNE of the unregularized GAN objective.
3.2. Adversarial learning of PDEs
We provide another usage of our results in adversarial learning of PDEs. To demonstrate the idea, we focus on a simple linear elliptic PDE on a bounded Lipschitz domain equipped with the Neumann boundary condition
Assume that and . The weak solution satisfies that
(3.4) |
We seek an approximate solution to (3.4) in the framework of Petrov-Galerkin [43, 37] where we choose the spaces of trial functions and test functions as two different Barron functions. More precisely, consider a trial function and a test function , where are Barron spaces defined in Section 3.1 with activation function and Barron norm defined in (3.2) with replaced by nonnegative weight functions . We look for a solution parameterized by some probability measure ,
with satisfying equation (3.4) for any with parameterized by such that
Putting these into the weak formulation (3.4) leads to
where the potential is given for by
Assume that the activation functions and the parameter spaces and satisfy Assumption 2.1. Assume also that and . Then it is easy to verify that and that for some constant depending on and . Hence the convergence results on Mean-Field GDA and its annealed version established in Section 2 apply to this problem.
4. Proofs of Main Results
4.1. Proof of convergence for Mean-Field GDA with a fixed temperature
We only present the proof of our main convergence results in the fast ascent regime. The proofs for the fast descent regime can be carried out in a similar manner and are provided in the Appendix B for completeness.
We first state a lemma below summarizing some important properties on the functionals . defined in (2.8).
Lemma 4.1.
For any and , the following hold
(4.1) | ||||
(4.2) | ||||
(4.3) | ||||
(4.4) |
Remark 4.1.
The sandwich inequalities (4.3) and (4.4) play an essential role in controlling terms in the entropy production of the Mean-Field GDA dynamics via the Lyapunov functions in order to close the Grönwall argument to obtain the dissipation of the latter along the dynamics. A similar sandwich inequality appeared in [7, Lemma 3.4] in the proof of convergence for the Mean-Field Langevin dynamics.
Next, we keep track of the time-derivatives of and in the next proposition whose proof can be found in Appendix A.
Proposition 4.1.
Proof of Theorem 2.1.
Proof of Theorem 2.2.
First, thanks to Theorem 2.1, . In particular, for any ,
(4.7) |
and
(4.8) |
As a result of (4.7) and (4.3), one obtains that
(4.9) |
Next, to obtain the exponential decay of , notice first that
Since , we have
It follows from the last two displays that
(4.10) |
The last term on the right side above can be upper bounded by
(4.11) | ||||
where we have used the Pinsker’s inequality and Young’s inequality in the last two lines above. Finally combining the last two estimates leads to
where we have used inequalities (4.8) and (4.9) in the last inequality above.
∎
4.2. Proof of convergence for the annealed dynamics
Recall that the MNE of entropy-regularized objective is given by characterized by (1.1). Note that we emphasized the dependence of the objective and the optimizer on the temperature since the later will be time-dependent throughout this section.
It will be useful to define energies as follows.
(4.12) | ||||
It is clear that .
Proof of Theorem 2.3.
Similar to the last section, we only present here the proof for the fast ascent regime. The proof for the other regime can be carried out analogously which for completeness is provided in Appendix C. The proof of the entropy decays in (2.13) follows largely the proof of Theorem 2.1 and is also inspired by the proof of [7, Theorem 4.1].
Step 1. Bounding . First, let us compute the time-derivative . Note that since
we have for ,
Moreover,
Combining the last two identities leads to
Consequently, we have
(4.13) | ||||
Next, a similar calculation as above applied to gives
(4.14) | ||||
As a result of (4.14) and (4.13),
(4.15) | ||||
Next, we have from the ascent dynamics for ,
(4.16) | ||||
The third term on the RHS above is
Combining the last two displays yields
(4.17) | ||||
It then follows from (4.15) (4.17) that
where the time-dependent rate satisfies for large that
Note that in the above we have used the decreasing of for large . By choosing for some large , we have
As a result, we have obtained that for any and for large enough,
Define . Then it is straightforward to show that for large enough,
which implies that
(4.18) |
since and is finite. By the definition of and the fact that for any , the last estimate further implies that for any ,
(4.19) |
In addition, using the same arguments used in the proof of the second bound in (2.10) from Theorem 2.2, one can obtain that
Step 2: Bounding . Let us first claim that the difference satisfies
(4.20) |
In fact, by definition,
To bound , let us define and . Then by the optimality of and the boundedness of ,
The same bound holds for after applying a similar argument as above, which completes the proof of (4.20). Finally, applying Lemma 4.2 with , we have for large enough,
(4.21) |
∎
We restate [16, Theorem 5] in the following lemma which is used to control .
Lemma 4.2.
[16, Theorem 5] Let and let be a lower bound on the volume of a ball of radius of in and . Let be the solution of the solution of (2.5). Then is an -Nash equilibrium of if
In particular, when and are Riemannian manifolds or Euclidean tori of dimensions and respectively so that and , the bounds above become
for some constant depending only on . Alternatively, if is sufficiently small, then is an -Nash equilibrium with
(4.22) |
5. Conclusion and future work
In this paper, we proved the global exponential convergence of two-scale Mean-Field GDA dynamics for finding the unique MNE of the entropy-regularized game objective on the space of probability measures. We also proved that the convergence of the annealed GDA dynamics to the MNE of the unregularized game objective with respect to the Nikaidò-Isoda error. The key ingredient of our proofs are new Lyapunov functions which are used to capture the dissipation of the two-scale GDA dynamics in different scaling regimes.
We would like to mention several open problems and future research directions. First, although we have proved the global convergence of the Mean-Field GDA in the fast ascent/descent regime (with a finite but large/small time-scale ratio), it remains an open question to show the convergence or nonconvergence of the Mean-Field GDA in the intermediate time-scale regime, including particularly the case where no time-scale separation occurs (i.e. in (1.6)). Also, currently our results only hold on bounded state spaces and the convergence rate depends on the uniform bounds of the potential function on the state spaces. It remains an important question to extend the results to minmax optimization on unbounded state spaces. In practice, the Mean-Field GDA needs to be implemented by a certain interacting particle algorithm such as (1.5) with a large number of particles. Existing result on the mean field limit of (1.5) (e.g. [16, Theorem 3]) only holds in finite time interval and the error bound can potentially grow exponentially in time. To obtain a quantitative error analysis of the GDA for minmax optimization, it is an interesting open question to derive a uniform-in-time quantitative error bound between the particle system and the mean field dynamics. Finally, we anticipate our results can be exploited to prove theoretical convergence guarantees for a variety of minmax learning problems, including especially the training of GANs [20], adversarial learning problems [34], dual training of energy based models [10, 14] and weak adversarial networks for PDEs [57].
Acknowledgments
The author thanks Lexing Ying for suggesting the question and thanks Jianfeng Lu, Yiping Lu and Chao Ma for helpful discussions at the early stage of the problem setup. The author also thank Fei Cao and Lénaïc Chizat for the valuable feedback on an early version of the paper. This work is supported by the National Science Foundation through the award DMS-2107934.
Appendix
Appendix A Proofs of preliminary lemmas
Proof of Lemma 4.1.
First we have from the definition of and (2.4) that
which proves (4.1). To see (4.2), notice again from (2.4) that
Next, we prove (4.3). In fact, replacing by in the last display leads to
which proves the lower bound part of (4.3). To prove the upper bound part, we first compute the first-variation of the log partition function as
(A.1) |
Then an application of the convexity of shown in Lemma A.1 yields
Consequently, one obtains that
Finally, the inequality (4.4) follows from the same reasoning above by exploiting the convexity of .
∎
The following lemma establishes the convexity of the log partition functions and . It was proved in [33, Proposition 3], but for completeness we include its proof here.
Lemma A.1.
Both the functional and the functional are convex.
Proof.
We only present the proof for the convexity of the map since the other case can be proved in the same manner. For any and ,
∎
Proof of Proposition 4.1.
First let us compute the functional gradient In fact, thanks to the fact that and (4.12), one has that
where in the second identity we have used (A.1). Therefore it follows from (1.6) and the Cauchy-Schwarz inequality that
Furthermore, using the fact that
one derives that
(A.2) |
Moreover, using the mean value theorem, one has for any that
Inserting the last identity into (A.2) leads to
(A.3) | ||||
where we have used the Pinsker’s inequality in the last inequality above. Combining the last two estimates proves (4.5).
Next we proceed with the proof of (4.6). Recall from (4.1) that
Hence
(A.4) |
Using the second equation of (1.6) that solves, one sees that the first term on the right side above becomes
(A.5) |
To compute the second term on the right side of (A.4), observe that
An a result of above and (A.1), one obtains that
where we have used (A.3) in the last inequality above. The estimate (4.6) then follows from above, (A.5) and (A.4). This completes the proof of the proposition. ∎
Proof of Lemma 4.2.
The proof of the lemma can be found in [16, Appendix C.4]. We would like to elaborate on the proof of (4.22). In fact, by tracking the proof of [16, Theorem 5], one sees that is an -Nash equilibrium provided that
where we recall that . Thanks to the lower bound on the volume of small balls in and , the inequality above holds if
for some depending only on . Now with the choice (4.22) for with , one has for sufficiently small that
∎
Appendix B Proofs of convergence for Mean-Field GDA in the fast descent regime
Let us first state a proposition which bounds the time-derivative of and .
Proposition B.1.
Proof.
The proof follows essentially the same reasoning as the proof of Proposition 4.1. In fact, it can be shown by a straightforward calculation that
Similar to (A.3), one can obtain that
Combining the last two inequalities yield (B.1).
As for time-derivative of , one has that
The second term on the right side above is bounded by
The estimate (B.2) follows directly from the last two inequalities. ∎
Proof of Part (ii) of Theorem 2.1 .
Proof of Part (ii) of Theorem 2.2.
Thanks to Part (ii) of Theorem 2.1 and the bound (4.4), we have that
(B.3) |
and that
(B.4) |
Next to obtain the exponential decay of , observe that
(B.5) | ||||
Since , we can write
Hence the second term on the RHS of the last line of (B.5) can be bounded by in a similar manner as (4.11). Namely,
Combining the last inequality with (B.5) leads to
where we have used (B.3) and (B.4) in the last inequality. ∎
Appendix C Proof of convergence for the annealed dynamics in the fast descent regime
Proof of Part (ii) of Theorem 2.3.
By a direct calculation, one has
(C.1) |
and
(C.2) | ||||
Combining the last two displays, one can obtain the time-derivative of as follows
where
Setting for some in the above yields that for every and large enough
Therefore we have obtained that
Similar to the proof of (4.18), one can obtain from above that for large enough
This directly implies the entropy decay bounds in (2.15). Finally the estimate (2.16) follows from (2.15) and the arguments used in Step 2 of the proof of Theorem 2.3- Part(i). ∎
References
- [1] Jacob Abernethy, Kevin A Lai, and Andre Wibisono. Last-iterate convergence rates for min-max optimization. arXiv preprint arXiv:1906.02027, 2019.
- [2] Dyego Araújo, Roberto I Oliveira, and Daniel Yukimura. A mean-field limit for certain deep neural networks. arXiv preprint arXiv:1906.00193, 2019.
- [3] Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
- [4] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
- [5] Yann Brenier and Dmitry Vorotnikov. On optimal transport of matrix-valued measures. SIAM Journal on Mathematical Analysis, 52(3):2849–2873, 2020.
- [6] Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2):156–172, 2008.
- [7] Lénaïc Chizat. Mean-field langevin dynamics: Exponential convergence and annealing. arXiv preprint arXiv:2202.01009, 2022.
- [8] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
- [9] Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. An interpolating distance between optimal transport and fisher–rao metrics. Foundations of Computational Mathematics, 18(1):1–44, 2018.
- [10] Bo Dai, Zhen Liu, Hanjun Dai, Niao He, Arthur Gretton, Le Song, and Dale Schuurmans. Exponential family estimation via adversarial dynamics embedding. Advances in Neural Information Processing Systems, 32, 2019.
- [11] Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. Advances in neural information processing systems, 31, 2018.
- [12] Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. 10th Innovations in Theoretical Computer Science, 2019.
- [13] Thinh Doan. Convergence rates of two-time-scale gradient descent-ascent dynamics for solving nonconvex min-max problems. In Learning for Dynamics and Control Conference, pages 192–206. PMLR, 2022.
- [14] Carles Domingo-Enrich, Alberto Bietti, Marylou Gabrié, Joan Bruna, and Eric Vanden-Eijnden. Dual training of energy-based models with overparametrized shallow neural networks. arXiv preprint arXiv:2107.05134, 2021.
- [15] Carles Domingo-Enrich and Joan Bruna. Simultaneous transport evolution for minimax equilibria on measures. arXiv preprint arXiv:2202.06460, 2022.
- [16] Carles Domingo-Enrich, Samy Jelassi, Arthur Mensch, Grant Rotskoff, and Joan Bruna. A mean-field analysis of two-player zero-sum games. Advances in neural information processing systems, 33:20215–20226, 2020.
- [17] Andreas Eberle, Arnaud Guillin, and Raphael Zimmer. Quantitative harris-type theorems for diffusions and mckean–vlasov processes. Transactions of the American Mathematical Society, 371(10):7135–7173, 2019.
- [18] Tadahisa Funaki. A certain class of diffusion processes associated with nonlinear parabolic equations. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 67(3):331–348, 1984.
- [19] Irving L Glicksberg. A further generalization of the kakutani fixed point theorem, with application to nash equilibrium points. Proceedings of the American Mathematical Society, 3(1):170–174, 1952.
- [20] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
- [21] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- [22] Ya-Ping Hsieh, Chen Liu, and Volkan Cevher. Finding mixed nash equilibria of generative adversarial networks. In International Conference on Machine Learning, pages 2810–2819. PMLR, 2019.
- [23] Kaitong Hu, Zhenjie Ren, David Šiška, and Łukasz Szpruch. Mean-field langevin dynamics and energy landscape of neural networks. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 57, pages 2043–2065. Institut Henri Poincaré, 2021.
- [24] Chi Jin, Praneeth Netrapalli, and Michael Jordan. What is local optimality in nonconvex-nonconcave minimax optimization? In International conference on machine learning, pages 4880–4889. PMLR, 2020.
- [25] Stanislav Kondratyev, Léonard Monsaingeon, and Dmitry Vorotnikov. A new optimal transport distance on the space of finite radon measures. Advances in Differential Equations, 21(11/12):1117–1164, 2016.
- [26] Stanislav Kondratyev and Dmitry Vorotnikov. Spherical hellinger–kantorovich gradient flows. SIAM Journal on Mathematical Analysis, 51(3):2053–2084, 2019.
- [27] Vaios Laschos and Alexander Mielke. Geometric properties of cones with applications on the hellinger–kantorovich space, and a new distance on the space of probability measures. Journal of Functional Analysis, 276(11):3529–3576, 2019.
- [28] Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pages 6083–6093. PMLR, 2020.
- [29] Yiping Lu, Chao Ma, Yulong Lu, Jianfeng Lu, and Lexing Ying. A mean field analysis of deep resnet and beyond: Towards provably optimization via overparameterization from depth. In International Conference on Machine Learning, pages 6426–6436. PMLR, 2020.
- [30] Yulong Lu, Jianfeng Lu, and James Nolen. Accelerating langevin sampling with birth-death. arXiv preprint arXiv:1905.09863, 2019.
- [31] Yulong Lu, Dejan Slepčev, and Lihan Wang. Birth-death dynamics for sampling: Global convergence, approximations and their asymptotics. arXiv preprint arXiv:2211.00450, 2022.
- [32] Chao Ma, Lei Wu, et al. The barron space and the flow-induced function spaces for neural network models. Constructive Approximation, 55(1):369–406, 2022.
- [33] Chao Ma and Lexing Ying. Provably convergent quasistatic dynamics for mean-field two-player zero-sum games. In International Conference on Learning Representations, 2021.
- [34] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
- [35] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
- [36] Panayotis Mertikopoulos, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. arXiv preprint arXiv:1807.02629, 2018.
- [37] Andrew Ronald Mitchell and David Francis Griffiths. The finite difference method in partial differential equations. A Wiley-Interscience Publication, 1980.
- [38] John Nash. Non-cooperative games. Annals of Mathematics, 54:286–295, 1951.
- [39] Phan-Minh Nguyen. Mean field limit of the learning dynamics of multilayer neural networks. arXiv preprint arXiv:1902.02880, 2019.
- [40] Hukukane Nikaido. On von neumann’s minimax theorem. Pacific J. Math, 4(1):65–72, 1954.
- [41] Hukukane Nikaido and Kazuo Isoda. Note on non-cooperative convex games. Pacific Journal of Mathematics, 5(S1):807–815, 1955.
- [42] Atsushi Nitanda, Denny Wu, and Taiji Suzuki. Convex analysis of the mean field langevin dynamics. In International Conference on Artificial Intelligence and Statistics, pages 9741–9757. PMLR, 2022.
- [43] Georgii Ivanovich Petrov. Application of the method of galerkin to a problem involving the stationary flow of a viscous fluid. Prikl. Matem. Mekh., 4(3):3–12, 1940.
- [44] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pages 1674–1703. PMLR, 2017.
- [45] Grant Rotskoff, S Jelassi, Joan Bruna, and Eric Vanden-Eijnden. Global convergence of neuron birth-death dynamics. 2019.
- [46] Grant Rotskoff and Eric Vanden-Eijnden. Trainability and accuracy of artificial neural networks: An interacting particle system approach. Communications on Pure and Applied Mathematics, 75(9):1889–1935, 2022.
- [47] Maurice Sion. On general minimax theorems. Pacific Journal of mathematics, 8(1):171–176, 1958.
- [48] Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A law of large numbers. SIAM Journal on Applied Mathematics, 80(2):725–752, 2020.
- [49] Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of deep neural networks. Mathematics of Operations Research, 47(1):120–152, 2022.
- [50] Alain-Sol Sznitman. Topics in propagation of chaos. In Ecole d’Eté Saint-Flour XIX-1989, pages 165–251. Springer, 1991.
- [51] Wenpin Tang and Xun Yu Zhou. Simulated annealing from continuum to discretization: a convergence analysis via the eyring–kramers law. arXiv preprint arXiv:2102.02339, 2021.
- [52] J Von. Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295–320, 1928.
- [53] Guillaume Wang and Lénaïc Chizat. An exponentially converging particle method for the mixed nash equilibrium of continuous games. arXiv preprint arXiv:2211.01280, 2022.
- [54] Stephan Wojtowytsch et al. On the banach spaces associated with multi-layer relu networks: Function representation, approximation theory and gradient descent dynamics. arXiv preprint arXiv:2007.15623, 2020.
- [55] Junchi Yang, Negar Kiyavash, and Niao He. Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems. Advances in Neural Information Processing Systems, 33:1153–1165, 2020.
- [56] Junchi Yang, Antonio Orvieto, Aurelien Lucchi, and Niao He. Faster single-loop algorithms for minimax optimization without strong concavity. In International Conference on Artificial Intelligence and Statistics, pages 5485–5517. PMLR, 2022.
- [57] Yaohua Zang, Gang Bao, Xiaojing Ye, and Haomin Zhou. Weak adversarial networks for high-dimensional partial differential equations. Journal of Computational Physics, 411:109409, 2020.