This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Two-Scale Gradient Descent Ascent Dynamics Finds Mixed Nash Equilibria of Continuous Games: A Mean-Field Perspective

Yulong Lu (YL) Department of Mathematics and Statistics, Lederle Graduate Research Tower, University of Massachusetts, 710 N. Pleasant Street, Amherst, MA 01003. [email protected]
Abstract.

Finding the mixed Nash equilibria (MNE) of a two-player zero sum continuous game is an important and challenging problem in machine learning. A canonical algorithm to finding the MNE is the noisy gradient descent ascent method which in the infinite particle limit gives rise to the Mean-Field Gradient Descent Ascent (GDA) dynamics on the space of probability measures. In this paper, we first study the convergence of a two-scale Mean-Field GDA dynamics for finding the MNE of the entropy-regularized objective. More precisely we show that for each finite temperature (or regularization parameter), the two-scale Mean-Field GDA with a suitable finite scale ratio converges exponentially to the unique MNE without assuming the convexity or concavity of the interaction potential. The key ingredient of our proof lies in the construction of new Lyapunov functions that dissipate exponentially along the Mean-Field GDA. We further study the simulated annealing of the Mean-Field GDA dynamics. We show that with a temperature schedule that decays logarithmically in time the annealed Mean-Field GDA converges to the MNE of the original unregularized objective.

Keywords. Nash equilibrium, mixed Nash equilbirium, gradient descent ascent, mean field, simulated annealing.

1. Introduction

Minmax learning underpins numerous machine learning methods including Generative Adversarial Networks (GANs) [20], adversarial training [34] and reinforcement learning [6]. The minmax learning is often be formulated as a zero-sum game between min and max players and hence can be formalized as a minmax optimization problem of the form

minx𝒳maxy𝒴K(x,y),\min_{x\in{\mathcal{X}}}\max_{y\in{\mathcal{Y}}}K(x,y),

where the function KK is the game objective, xx and yy represent the player strategies. When KK is nonconvex in xx or nonconcave in yy, finding the pure Nash [38] equilibria of KK is difficult and sometimes impossible since the pure Nash equilibrium may not even exist. On the other hand, the mixed Nash equilibria (MNE) [19] where the pure strategies are replaced by mixed strategies modeled by a probability distribution over the set of strategies, do exist for more general objective functions. Formally the mixed Nash equilibria consist of pairs of probability distributions μ\mu and ν\nu that solve

(1.1) minμ𝒫(𝒳)maxν𝒫(𝒴)E0(μ,ν), where E0(μ,ν):=𝒳𝒴K(x,y)μ(dx)ν(dy).\min_{\mu\in{\mathcal{P}}({\mathcal{X}})}\max_{\nu\in{\mathcal{P}}({\mathcal{Y}})}E_{0}(\mu,\nu),\text{ where }E_{0}(\mu,\nu):=\int_{{\mathcal{X}}}\int_{{\mathcal{Y}}}K(x,y)\mu(dx)\nu(dy).

Here 𝒫(𝒳){\mathcal{P}}({\mathcal{X}}) and 𝒫(𝒴){\mathcal{P}}({\mathcal{Y}}) denote the space of probability measures (or the set of all mixed strategies) on the state spaces 𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}} respectively. Thanks to Glicksberg’s theorem [19], a MNE exists if 𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}} are finite or if 𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}} are compact and KK is continuous. While MNE exist in much generality, it is still difficult to find them. Several progress have been made recently for finding MNE of high dimensional game problems with applications in GANs. For instance, the work [22] proposed a mirror-descent algorithm for finding MNE of (1.1) and applied the algorithms for Wasserstein GANs with provable convergence guarantees. The recent work [16, 33] proposed and analyzed an entropy-regularized version of (1.1):

(1.2) minμ𝒫(𝒳)maxν𝒫(𝒴)Eτ(μ,ν), where Eτ(μ,ν):=𝒳𝒴K(x,y)𝑑ν(y)𝑑μ(x)τ(μ)+τ(ν).\min_{\mu\in{\mathcal{P}}({\mathcal{X}})}\max_{\nu\in{\mathcal{P}}(\mathcal{Y})}E_{\tau}(\mu,\nu),\text{ where }E_{\tau}(\mu,\nu):=\int_{{\mathcal{X}}}\int_{\mathcal{Y}}K(x,y)d\nu(y)d\mu(x)-\tau{\mathcal{H}}(\mu)+\tau{\mathcal{H}}(\nu).

Here (μ)=logdμdxdμ{\mathcal{H}}(\mu)=-\int\log\frac{d\mu}{dx}d\mu is the entropy functional of the probability measure μ\mu, and the parameter τ>0\tau>0 is the entropy regularization parameter(or temperature). Observe that the objective energy functional EτE_{\tau} is strongly convex in μ\mu and strongly concave in ν\nu. As a result of the famous Von Neumann’s minmax theorem [52, 47, 41, 40], one has that

(1.3) minμ𝒫(𝒳)maxν𝒫(𝒴)Eτ(μ,ν)=maxν𝒫(𝒴)minμ𝒫(𝒳)Eτ(μ,ν).\min_{\mu\in{\mathcal{P}}({\mathcal{X}})}\max_{\nu\in{\mathcal{P}}(\mathcal{Y})}E_{\tau}(\mu,\nu)=\max_{\nu\in{\mathcal{P}}(\mathcal{Y})}\min_{\mu\in{\mathcal{P}}({\mathcal{X}})}E_{\tau}(\mu,\nu).

Under very mild assumptions on KK, Problem (1.2) has a unique MNE (μ,ν)(\mu^{\ast},\nu^{\ast}) (see a proof in [16]) in the sense that for any μ𝒫(𝒳),ν𝒫(𝒴)\mu\in{\mathcal{P}}({\mathcal{X}}),\nu\in{\mathcal{P}}({\mathcal{Y}}),

Eτ(μ,ν)Eτ(μ,ν)Eτ(μ,ν).E_{\tau}(\mu^{\ast},\nu)\leq E_{\tau}(\mu^{\ast},\nu^{\ast})\leq E_{\tau}(\mu,\nu^{\ast}).

Furthermore, the MNE (μ,ν)(\mu^{\ast},\nu^{\ast}) is given by the unique solution of the fixed point equations

(1.4) μ(dx)exp(𝒴τ1K(x,y)dν(y)),ν(dy)exp(𝒳τ1K(x,y)dμ(x))).\mu^{\ast}(dx)\propto\exp\Big{(}-\int_{{\mathcal{Y}}}\tau^{-1}K(x,y)d\nu^{\ast}(y)\Big{)},\quad\nu^{\ast}(dy)\propto\exp\Big{(}\int_{{\mathcal{X}}}\tau^{-1}K(x,y)d\mu^{\ast}(x))\Big{)}.

In the setting where both players play finite mixtures of NN strategies, i.e. μ=j=1N1NδXj\mu=\sum_{j=1}^{N}\frac{1}{N}\delta_{X^{j}} and ν=j=1N1NδYj\nu=\sum_{j=1}^{N}\frac{1}{N}\delta_{Y^{j}}, the following noisy gradient descent-ascent dynamics is perhaps one of the most natural algorithms for finding MNE of (1.2):

(1.5) dXti=1Nj=1NxK(Xti,Ytj)dt+2τdWti,dYti=ηNj=1NyK(Xtj,Yti)dt+2τηdBti,dX^{i}_{t}=-\frac{1}{N}\sum_{j=1}^{N}\nabla_{x}K(X^{i}_{t},Y^{j}_{t})dt+\sqrt{2\tau}dW^{i}_{t},\quad dY^{i}_{t}=\frac{\eta}{N}\sum_{j=1}^{N}\nabla_{y}K(X^{j}_{t},Y^{i}_{t})dt+\sqrt{2\tau\eta}dB^{i}_{t},

where Wti,Bti,i=1,NW^{i}_{t},B^{i}_{t},i=1,\cdots N are independent Brownian motions. The parameter η>0\eta>0 in (1.5) represents the ratio of timescales at which the gradient descent and ascent dynamics is undergoing. When η=1\eta=1, there is no timescale separation. However, GDA with different time scales is commonly used in minmax optimization [24, 28] and sometimes leads to better convergence property [21].

In the limit of large number of strategies (NN{\rightarrow}\infty), the empirical measures μtN=1Ni=1NδXti\mu_{t}^{N}=\frac{1}{N}\sum_{i=1}^{N}\delta_{X^{i}_{t}} and νtN=1Ni=1NδYti\nu_{t}^{N}=\frac{1}{N}\sum_{i=1}^{N}\delta_{Y^{i}_{t}} of the interacting particle, with {X0i}\{X_{0}^{i}\} and {Y0i}\{Y_{0}^{i}\} identically independent sampled from the initial distributions μ0\mu_{0} and ν0\nu_{0}, converge weakly within finite time to the solutions μt\mu_{t} and νt\nu_{t} of the Mean-Field GDA (also named the Interacting Wasserstein Gradient Flow in [16]) dynamics:

(1.6) tμt\displaystyle\partial_{t}\mu_{t} =(μt𝒴xK(x,y)𝑑νt(y))+τΔμt,\displaystyle=\nabla\cdot(\mu_{t}\int_{{\mathcal{Y}}}\nabla_{x}K(x,y)d\nu_{t}(y))+\tau\Delta\mu_{t},
tνt\displaystyle\partial_{t}\nu_{t} =η((νt𝒳yK(x,y)𝑑μt(x))+τΔνt).\displaystyle=\eta\Big{(}-\nabla\cdot(\nu_{t}\int_{{\mathcal{X}}}\nabla_{y}K(x,y)d\mu_{t}(x))+\tau\Delta\nu_{t}\Big{)}.

The dynamics (1.6) can be viewed as the minmax analogue of the Mean-Field Langevin [42, 7, 23] dynamics that arises naturally from the mean field analysis of optimization of two-layer neural networks.

Despite its simplicity and popularity, the long-time convergence of the Mean-Field GDA (1.6) is still not well understood, except in the extreme quasi-static [33] regime where the ascent dynamics is infinitely faster or slower than the descent dynamics (η=0 or +\eta=0\text{ or }+\infty). This motivates us to study the first question, which to the best of our knowledge, remains open:

Question 1: Does the Mean-Field GDA (1.6) with a finite scale ratio η>0\eta>0 converge to the unique MNE? If so, what is the rate of convergence?

We provide an affirmative answer to Question 1 and establish a quantitative exponential convergence of (1.6) to the MNE for any fixed τ>0\tau>0. Furthermore, motivated by the recent simulated annealing results for Langevin dynamics [51, 44] in the context of global optimization or for Mean-Field Langevin dynamics [7] in the setting of training two-layer neural networks, it is natural to ask what would happen to the dynamics (1.6) in the annealed regime that τ0\tau{\rightarrow}0 as tt{\rightarrow}\infty. In particular, it is natural to ask

Question 2: Does the annealed Mean-Field GDA (1.6) with a decreasing temperature τ=τt\tau=\tau_{t} converge to a MNE of the unregularized objective E0E_{0} defined in (1.1)?

Summary of contributions.

In this work we first address Question 1 by providing the first convergence analysis of the two-scale continuous-time Mean-Field GDA dynamics (1.6) with a finite scale ratio. This improves substantially the earlier convergence results by [33, 15] on Mean-Field GDA in the quasistatic setting where the scale ratio either vanishes or explodes. The key ingredient of the proof is the construction of a novel Lyapunov function which decreases exponentially along the dynamics (1.6). We then further address Question 2 by proving that the annealed Mean-Field GDA converges to a global MNE of the original unregularized objective E0E_{0}. The latter result, to the best our knowledge, is the first rigorous justification of the global convergence of GDA to MNE in the mean field regime.

We highlight the major contributions as follows.

  • For any fixed temperature τ>0\tau>0, we show that in the fast ascent/descent regime (or the scale ratio η\eta is either larger or smaller than certain threshold), the Mean-Field GDA dynamics (1.6) converges exponentially to the MNE of the entropy-regularized objective EτE_{\tau} with respect to certain Lyapunov functions; see Theorem 2.1.

  • The convergence in Lyapunov functions further implies the global exponential convergence of the dynamics with respect to the relative entropy (see Theorem 2.2). Our (non-asymptotic) convergence results hold for any positive temperature (regularization parameter) and do not assume convexity or concavity of the interaction potential KK. The convergence rate is characterized explicitly in terms of τ\tau, bounds on KK and diameters of the state spaces.

  • We also study the simulated annealing of the Mean-Field GDA dynamics (1.6). We prove that with the cooling schedule τt\tau_{t} decaying with a logarithimc rate, the Mean-Field GDA dynamics (1.6) converges to the global mixed Nash equilibrium of E0E_{0} with respect to the Nikaidò-Isoda error (defined in (1.7)). See Theorem 2.3 for the precise statements.

1.1. Related work.

Minmax optimization

Most previous works about minmax optimization focused on the analysis of discrete-time optimization schemes in finite dimensional Euclidean spaces under various assumptions on the objective function. In the convex-concave setting, global convergence guarantees were obtained for various optimization schemes, such as GDA [12], mirror descent [36] and Hamiltonian gradient descent [1]. In the non-convex and/or non-concave setting, local convergence to several notions of stationary points were studied in [11, 28]. In the special case where the objective satisfies a two-sided Polyak-Łojasiewicz (PL) condition, convergence to global Nash equilibria were recently obtained in [55, 56, 13] for two-scale GDA.

Mean-field analysis

Our study of the Mean-Field GDA dynamics for minmax optimization is strongly motivated by a growing amount of work on mean-field analysis of gradient descent algorithms for minimizing neural networks. The study of the latter originates from the work on two-layer networks [35, 48, 46, 8], to residual networks [29, 54] and general deep neural networks [2, 49, 39]. These results show that the mean field dynamics can be realized as Wasserstein gradient flows of certain nonlinear energy functionals on the space of probability measures. Global convergence of the Wasserstein gradient flows were obtained for the mean-field Langevin dynamics in [35, 7, 42, 23] and for the noiseless mean-field dynamics in [8]. We note that mean-field Langevin dynamics can be viewed as a special Mckean-Vlasov dynamics whose long-time convergence are often established by assuming either the small interaction strength or large noise level; see e.g. [17, 23].

Among the existing work on mean field Langevin dynamics, [7] is closest to ours where the author proved the exponential convergence of Mean-Field Langevin dynamics under a certain uniform log-Sobolev inequality assumption, and also proved the convergence of the annealed Mean-Field Langevin to the global minimizers of the objective function. Our results are parallel to the results of [7], but require new proof strategies. A key ingredient of our proof is constructing new Lyapunov functions that decay exponentially along the Mean-Field GDA (1.6).

In the context of minmax optimization, [16] first proposed and analyzed the noisy gradient descent ascent flow (1.5) without scale separation (η=1\eta=1) for finding the MNE. Specifically, the authors therein proved the convergence of (1.5) to the mean field GDA (1.6) in the limit of large agent size, and characterized the equilibrium of (1.6) as the unique solution of the fixed point equation (1.4). However, [16] did not provide a proof of the long-time convergence of (1.6). In [33], the authors obtained convergence guarantees for the mean field GDA (1.6) in the quasi-static regime where the ascent dynamics is infinitely faster or slower than the descent dynamics (i.e. η=+ or 0\eta=+\infty\text{ or }0). One of the major contributions of the present work goes beyond the quasi-static regime and provides the first proof for exponential convergence of (1.6) with a finite scale ratio.

Wasserstein-Fisher-Rao gradient flows

As an alternative to Wasserstein gradient flows, gradient flow with respect to the Wasserstein-Fisher-Rao (WFR) metric [9, 25, 27] has recently sparked a large amount of research on PDEs [5, 26], neural network training [45] and statistical sampling [30, 31]. Especially, WFR gradient flows give rise to birth-death interacting particle dynamics that enables “teleportation” of mass and locations of particles and hence lead to better convergence properties than the Langevin dynamics in certain scenarios [45, 30, 31]. More recently, [16] adopts a WFR-gradient flow for two-player zero-mean continuous games and proves that the time-averaging of the WFR-dynamics converges to the MNE in the regime where Fisher-Rao dynamics dominates the Wasserstein counterpart. In [53], a new particle algorithm motivated by the implicit time-discretization of WFR-gradient flow was proposed for finding the MNE in the space of atomic measures. The authors proved a local exponential convergence of the proposed algorithm under certain non-degeneracy assumptions on the game objective and the assumption that the MNE is unique.

1.2. Notations.

For a measurable space 𝒳{\mathcal{X}}, we use 𝒫(𝒳){\mathcal{P}}({\mathcal{X}}) to denote the space of probability measures on 𝒳{\mathcal{X}}. Given a canonical Borel measure dxdx on 𝒳{\mathcal{X}}, such as the uniform measure considered in the paper, if μ\mu is absolutely continuous with respect to dxdx, we denote the Radon-Nikodym derivative by dμdx\frac{d\mu}{dx}. We define the shannon entropy of μ\mu by (μ)=𝒳log(dμdx)𝑑μ(x){\mathcal{H}}(\mu)=-\int_{{\mathcal{X}}}\log(\frac{d\mu}{dx})d\mu(x). The relative entropy (or Kullback-Leibler divergence) and the relative Fisher information between two probability measures μ1\mu_{1} and μ2\mu_{2} are defined by

(μ1|μ2)=log(dμ1dμ2)𝑑μ1,(μ1|μ2)=|log(dμ1dμ2)|2𝑑μ1{\mathcal{H}}(\mu_{1}|\mu_{2})=\int\log\Big{(}\frac{d\mu_{1}}{d\mu_{2}}\Big{)}d\mu_{1},\quad{\mathcal{I}}(\mu_{1}|\mu_{2})=\int\Big{|}\nabla\log\Big{(}\frac{d\mu_{1}}{d\mu_{2}}\Big{)}\Big{|}^{2}d\mu_{1}

if μ1\mu_{1} is absolutely continuous with respect to μ2\mu_{2} and ++\infty otherwise. Given an approximate mixed Nash equilibrium (μ,ν)(\mu,\nu), we define the Nikaidò-Isoda [41] error NI(μ,ν){\text{NI}}(\mu,\nu) by

(1.7) NI(μ,ν)\displaystyle{\text{NI}}(\mu,\nu) :=supν𝒫(𝒴)E0(μ,ν)infμ𝒫(𝒳)E0(μ,ν)\displaystyle:=\sup_{\nu^{\prime}\in{\mathcal{P}}({\mathcal{Y}})}E_{0}(\mu,\nu^{\prime})-\inf_{\mu^{\prime}\in{\mathcal{P}}({\mathcal{X}})}E_{0}(\mu^{\prime},\nu)
=E0(μ,ν)infμ𝒫(𝒳)E0(μ,ν)+supν𝒫(𝒴)E0(μ,ν)E0(μ,ν).\displaystyle=E_{0}(\mu,\nu)-\inf_{\mu^{\prime}\in{\mathcal{P}}({\mathcal{X}})}E_{0}(\mu^{\prime},\nu)+\sup_{\nu^{\prime}\in{\mathcal{P}}({\mathcal{Y}})}E_{0}(\mu,\nu^{\prime})-E_{0}(\mu,\nu).

Note that by definition NI(μ,ν)0{\text{NI}}(\mu,\nu)\geq 0 for any μ𝒫(𝒳)\mu\in{\mathcal{P}}({\mathcal{X}}) and ν𝒫(𝒴)\nu\in{\mathcal{P}}({\mathcal{Y}}) and NI(μ,ν)=0{\text{NI}}(\mu,\nu)=0 if and only if (μ,ν)=(μ,ν)(\mu,\nu)=(\mu^{\ast},\nu^{\ast}).

1.3. Organization

The remaining of the paper is organized as follows. In Section 2 we state the key assumptions and present the main results of the paper. In Section 3 we discuss applications of our results in training GANs and adversarial learning of PDEs. Section 4 devotes to the proofs of main results. We finish with several conclusion remarks and future directions in Section 5. Further proof details are provided in the Appendix.

2. Assumptions and Main Results

Let us first recall the minmax problem of the entropy-regularized continuous game:

(2.1) minμ𝒫(𝒳)maxν𝒫(𝒴)Eτ(μ,ν), where Eτ(μ,ν):=𝒳𝒴K(x,y)𝑑ν(y)𝑑μ(x)τ(μ)+τ(ν).\min_{\mu\in{\mathcal{P}}({\mathcal{X}})}\max_{\nu\in{\mathcal{P}}(\mathcal{Y})}E_{\tau}(\mu,\nu),\text{ where }E_{\tau}(\mu,\nu):=\int_{{\mathcal{X}}}\int_{\mathcal{Y}}K(x,y)d\nu(y)d\mu(x)-\tau{\mathcal{H}}(\mu)+\tau{\mathcal{H}}(\nu).

When the regularization parameter (or temperature) τ=0\tau=0, we write

E0(μ,ν)=𝒳𝒴K(x,y)𝑑ν(y)𝑑μ(x).E_{0}(\mu,\nu)=\int_{{\mathcal{X}}}\int_{\mathcal{Y}}K(x,y)d\nu(y)d\mu(x).

We are interested in the global convergence of the Mean-Field GDA dynamics to the MNE of (2.1):

tμt\displaystyle\partial_{t}\mu_{t} =(μt𝒴xK(x,y)𝑑νt(y))+τΔμt,\displaystyle=\nabla\cdot(\mu_{t}\int_{{\mathcal{Y}}}\nabla_{x}K(x,y)d\nu_{t}(y))+\tau\Delta\mu_{t},
tνt\displaystyle\partial_{t}\nu_{t} =η((νt𝒳yK(x,y)𝑑μt(x))+τΔνt).\displaystyle=\eta\Big{(}-\nabla\cdot(\nu_{t}\int_{{\mathcal{X}}}\nabla_{y}K(x,y)d\mu_{t}(x))+\tau\Delta\nu_{t}\Big{)}.

In what follows, we may suppress the dependence of EτE_{\tau} on τ\tau and write E=EτE=E_{\tau} to simplify notations when the τ>0\tau>0 is fixed; we will indicate such dependence in later discussions on the annealed dynamics where τ=τt\tau=\tau_{t} shrinks to zero in time.

2.1. Assumptions

Following [15] and [53], we make the following assumptions on the state spaces 𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}}.

Assumption 2.1.

The state spaces 𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}} are (i) either smooth compact manifolds without boundary such that the eigenvalues of the Ricci curvature are strictly positive everywhere or (ii) Euclidean tori (with possibly unequal dimensions).

We also make the following smoothness assumption on the potential KK.

Assumption 2.2.

The function KC2(𝒳×𝒴)K\in C^{2}({\mathcal{X}}\times{\mathcal{Y}}) and there exists Kxy>0K_{xy}>0 such that

xy2KKxy.\|\nabla_{xy}^{2}K\|_{\infty}\leq K_{xy}.

It will be very useful to introduce operators 𝒦+:𝒫(𝒳)𝒫(𝒴){\mathcal{K}}^{+}:{\mathcal{P}}({\mathcal{X}}){\rightarrow}{\mathcal{P}}({\mathcal{Y}}) and 𝒦:𝒫(𝒴)𝒫(𝒳){\mathcal{K}}^{-}:{\mathcal{P}}({\mathcal{Y}}){\rightarrow}{\mathcal{P}}({\mathcal{X}}) where

(2.2) 𝒦+μ(dy)=1Z+(μ)exp(𝒳τ1K(x,y)𝑑μ(x)),𝒦ν(dx)=1Z(ν)exp(𝒴τ1K(x,y)𝑑ν(y)).{\mathcal{K}}^{+}\mu(dy)=\frac{1}{Z^{+}(\mu)}\exp\Big{(}\int_{{\mathcal{X}}}\tau^{-1}K(x,y)d\mu(x)\Big{)},\quad{\mathcal{K}}^{-}\nu(dx)=\frac{1}{Z^{-}(\nu)}\exp\Big{(}-\int_{{\mathcal{Y}}}\tau^{-1}K(x,y)d\nu(y)\Big{)}.

Here Z(ν)Z^{-}(\nu) and Z+(μ)Z^{+}(\mu) are the corresponding partition functions defined by

(2.3) Z(ν)=𝒳exp(𝒴τ1K(x,y)𝑑ν(y))𝑑x,Z+(μ)=𝒴exp(𝒳τ1K(x,y)𝑑μ(x))𝑑y.Z^{-}(\nu)=\int_{{\mathcal{X}}}\exp\Big{(}-\int_{{\mathcal{Y}}}\tau^{-1}K(x,y)d\nu(y)\Big{)}dx,\quad Z^{+}(\mu)=\int_{{\mathcal{Y}}}\exp\Big{(}\int_{{\mathcal{X}}}\tau^{-1}K(x,y)d\mu(x)\Big{)}dy.

It is easy to verify by the definition of E(μ,ν)E(\mu,\nu) that

(2.4) argmaxν𝒫(𝒴)E(μ,ν)=𝒦+(μ),argminμ𝒫(𝒳)E(μ,ν)=𝒦(ν).\operatorname*{arg\,max}_{\nu\in{\mathcal{P}}({\mathcal{Y}})}E(\mu,\nu)={\mathcal{K}}^{+}(\mu),\quad\operatorname*{arg\,min}_{\mu\in{\mathcal{P}}({\mathcal{X}})}E(\mu,\nu)={\mathcal{K}}^{-}(\nu).

With the notations above, the MNE (μ,ν)(\mu^{\ast},\nu^{\ast}) of (2.1) is characterized as the unique solution of the following fixed point equations (see [16, Theorem 1]):

(2.5) μ=𝒦(ν),ν=𝒦+(μ).\mu^{\ast}={\mathcal{K}}^{-}(\nu^{\ast}),\quad\nu^{\ast}={\mathcal{K}}^{+}(\mu^{\ast}).

The lemma below shows that under Assumption 2.2 and Assumption 2.1, the measures 𝒦+(μ){\mathcal{K}}^{+}(\mu) and 𝒦(ν){\mathcal{K}}^{-}(\nu) satisfy the logarithmic Sobolev inequalities (LSI) uniformly for any μ𝒫(𝒳)\mu\in{\mathcal{P}}({\mathcal{X}}) and ν𝒫(𝒴)\nu\in{\mathcal{P}}({\mathcal{Y}}).

Lemma 2.1.

There exists a constant λLS>0\lambda_{LS}>0 such that for any μ𝒫(𝒳)\mu\in{\mathcal{P}}({\mathcal{X}}) and ν𝒫(𝒴)\nu\in{\mathcal{P}}({\mathcal{Y}}),

(2.6) λLS(μ|𝒦ν)(μ|𝒦ν),λLS(ν|𝒦+μ)(ν|𝒦+μ).\lambda_{LS}{\mathcal{H}}(\mu|{\mathcal{K}}^{-}{\nu})\leq{\mathcal{I}}(\mu|{\mathcal{K}}^{-}{\nu}),\quad\lambda_{LS}{\mathcal{H}}(\nu|{\mathcal{K}}^{+}{\mu})\leq{\mathcal{I}}(\nu|{\mathcal{K}}^{+}{\mu}).

Lemma 2.1 follows from the classical Holley-Stroock argument and the fact that the uniform measures on 𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}} satisfy the LSI thanks to Assumption 2.1; see e.g. [7, Proposition 5.2] and [15, Proposition 6]). It is worthwhile to note that in the regime where τ1K1\tau^{-1}\|K\|_{\infty}\gg 1, the log-Sobolev constant λLS=λLS(τ)\lambda_{LS}=\lambda_{LS}(\tau) satisfies that

(2.7) λLS(τ)=𝒪(exp(τ1K)).\lambda_{LS}(\tau)=\mathcal{O}(\exp(-\tau^{-1}\|K\|_{\infty})).

2.2. Exponential convergence of Mean-Field GDA with a fixed temperature

The goal of this section is to present the global convergence of the GDA dynamics (1.6) to the MNE (μ,ν)(\mu^{\ast},\nu^{\ast}). To this end, it will be useful to define the following functionals

(2.8) 1(μ)\displaystyle{\mathcal{L}}_{1}(\mu) :=maxν𝒫(𝒴)E(μ,ν)minμ𝒫(𝒳)maxν𝒫(𝒴)E(μ,ν),\displaystyle:=\max_{\nu\in{\mathcal{P}}({\mathcal{Y}})}E(\mu,\nu)-\min_{\mu\in{\mathcal{P}}({\mathcal{X}})}\max_{\nu\in{\mathcal{P}}({\mathcal{Y}})}E(\mu,\nu),
2(μ,ν)\displaystyle{\mathcal{L}}_{2}(\mu,\nu) :=maxν𝒫(𝒴)E(μ,ν)E(μ,ν),\displaystyle:=\max_{\nu\in{\mathcal{P}}({\mathcal{Y}})}E(\mu,\nu)-E(\mu,\nu),
3(ν)\displaystyle{\mathcal{L}}_{3}(\nu) :=maxν𝒫(𝒴)minμ𝒫(𝒳)E(μ,ν)minμ𝒫(𝒳)E(μ,ν),\displaystyle:=\max_{\nu\in{\mathcal{P}}({\mathcal{Y}})}\min_{\mu\in{\mathcal{P}}({\mathcal{X}})}E(\mu,\nu)-\min_{\mu\in{\mathcal{P}}({\mathcal{X}})}E(\mu,\nu),
4(μ,ν)\displaystyle{\mathcal{L}}_{4}(\mu,\nu) :=E(μ,ν)minμ𝒫(𝒳)E(μ,ν).\displaystyle:=E(\mu,\nu)-\min_{\mu\in{\mathcal{P}}({\mathcal{X}})}E(\mu,\nu).

Note that by definition the functionals i,i=1,,4{\mathcal{L}}_{i},i=1,\cdots,4 defined above are non-negative. For a fixed constant γ>0\gamma>0, we also define the Lyapunov functions

(2.9) (μ,ν)=1(μ)+γ2(μ,ν),~(μ,ν)=3(μ)+γ4(μ,ν).{\mathcal{L}}(\mu,\nu)={\mathcal{L}}_{1}(\mu)+\gamma{\mathcal{L}}_{2}(\mu,\nu),\quad\widetilde{{\mathcal{L}}}(\mu,\nu)={\mathcal{L}}_{3}(\mu)+\gamma{\mathcal{L}}_{4}(\mu,\nu).

Let κ\kappa be the effective condition number defined by

κ=KxyτλLS.\kappa=\frac{K_{xy}}{\tau\lambda_{LS}}.

Our first main theorem characterizes the exponential convergence of (1.6) in terms of the Lyapunov functions {\mathcal{L}} and ~\widetilde{{\mathcal{L}}}.

Theorem 2.1.

Let Assumption 2.1 and Assumption 2.2 hold. For a fixed γ<1\gamma<1, let (μt,νt)(\mu_{t},\nu_{t}) be the solution to the GDA dynamics (1.6) with an initial condition (μ0,ν0)(\mu_{0},\nu_{0}) satisfying (μ0,ν0)<{\mathcal{L}}(\mu_{0},\nu_{0})<\infty and ~(μ0,ν0)<\widetilde{{\mathcal{L}}}(\mu_{0},\nu_{0})<\infty, and with a finite time-scale ratio η>0\eta>0.

(i) Fast ascent regime: set η=2λLSκ2Diam(𝒴)2γ.\eta=\frac{2\lambda_{LS}\kappa^{2}\mathrm{Diam}({\mathcal{Y}})^{2}}{\gamma}. Then for all t>0t>0,

(μt,νt)eα1t(μ0,ν0){\mathcal{L}}(\mu_{t},\nu_{t})\leq e^{-\alpha_{1}t}{\mathcal{L}}(\mu_{0},\nu_{0})

with

α1=τλLS(1γ2κ2Diam(𝒴)2(1+3γ)γ);\alpha_{1}=\tau\lambda_{LS}\Big{(}\frac{1-\gamma}{2}\wedge\frac{\kappa^{2}\mathrm{Diam}({\mathcal{Y}})^{2}(1+3\gamma)}{\gamma}\Big{)};

(ii) Fast descent regime: set η=γ2λLSκ2Diam(𝒳)2(1+3γ).\eta=\frac{\gamma}{2\lambda_{LS}\kappa^{2}\mathrm{Diam}({\mathcal{X}})^{2}(1+3\gamma)}. Then for all t>0t>0,

~(μt,νt)eα2t~(μ0,ν0)\widetilde{{\mathcal{L}}}(\mu_{t},\nu_{t})\leq e^{-\alpha_{2}t}\widetilde{{\mathcal{L}}}(\mu_{0},\nu_{0})

with

α2=τλLS2(1η(1γ)).\alpha_{2}=\frac{\tau\lambda_{LS}}{2}\Big{(}1\wedge\eta(1-\gamma)\Big{)}.
Remark 2.1.

Theorem 2.1 states that the two-scale GDA dynamics (1.6) with a suitable finite scale ratio converges exponentially to the equilibrium. This improves substantially the earlier result by [33] on GDA in the quasistatic setting where the scale ratio η=0\eta=0 or η=\eta=\infty. We emphasize that we chose the specific scale ratios merely for the purpose of simplifying the expression of the convergence rate α\alpha. In fact, by tracking the proof of Theorem 2.1, one can obtain that the convergence rate

α={C1τλLS, if η>c1λLSκ2Diam(𝒴)2,C2τλLS2, if η<c2γλLSκ2Diam(𝒴)2(1+3γ),\displaystyle\alpha=\begin{cases}C_{1}\tau\lambda_{LS},&\text{ if }\eta>c_{1}\lambda_{LS}\kappa^{2}\mathrm{Diam}({\mathcal{Y}})^{2},\\ C_{2}\tau\lambda_{LS}^{2},&\text{ if }\eta<\frac{c_{2}\gamma}{\lambda_{LS}\kappa^{2}\mathrm{Diam}({\mathcal{Y}})^{2}(1+3\gamma)},\end{cases}

for some constants Ci,ci>0,i=1,2C_{i},c_{i}>0,i=1,2. Moreover, the dependence of the convergence rates on the diameters of the state spaces via KxyDiam(𝒴)K_{xy}\mathrm{Diam}({\mathcal{Y}}) and KxyDiam(𝒳)K_{xy}\mathrm{Diam}({\mathcal{X}}) can be avoided, and especially the latter two quantities can be replaced by K\|\nabla K\|_{\infty}; see discussions around (A.3).

Observe also that in the low temperature regime (τ1\tau\ll 1), the log-Sobolev constant λLS=𝒪(exp(τ1K))\lambda_{LS}=\mathcal{O}(\exp(-\tau^{-1}\|K\|_{\infty})). Hence Theorem 2.1 requires η=Ω((τ1Kxy)2/λLS)\eta=\Omega((\tau^{-1}K_{xy})^{2}/\lambda_{LS}) (and η=o(λLSτ2/Kxy2)\eta=o(\lambda_{LS}\tau^{2}/K_{xy}^{2})) to guarantee an exponential convergence rate α=𝒪(τλLS)\alpha=\mathcal{O}(\tau\lambda_{LS}) (and α=𝒪(λLS2τ3)\alpha=\mathcal{O}(\lambda_{LS}^{2}\tau^{3})) in the fast ascent (descent) regime. Notice that the quadratic dependence of the convergence rate α\alpha on λLS\lambda_{LS} in the fast descent regime, in contrast to the linear dependence on λLS\lambda_{LS} in the fast ascent regime, is due to the fact that the ascent dynamics itself is running in the slow time-scale η=o(λLSτ2/Kxy2)\eta=o(\lambda_{LS}\tau^{2}/K_{xy}^{2}); one would recover the same rate of convergence as in the fast ascent regime if one used the time scales 1/η1/\eta for the descent dynamics and 11 the ascent dynamics.

Remark 2.2.

Our construction of Lyapunov functions (2.9) is strongly motivated by the recent study [55, 13] of two-scale GDA for minmax optimization on Euclidean spaces. In the finite dimensional setting, a two-sided PL condition is sufficient to guarantee the exponential convergence of two-scale GDA in both continuous-time [13] and discrete-time [55, 56] cases. In our infinite dimensional setting of the Mean-Field GDA, the uniform log-Sobolev inequality plays the same role as the two-sided PL condition, which however, needs to be combined with the entropy sandwich inequalities in Lemma 4.1 to obtain the exponential dissipation of Lyapunov functions.

Our next theorem provides an exponential convergence of (1.6) with respect to the relative entropy.

Theorem 2.2.

Let the assumptions of Theorem 2.1 hold. Let the rates αi,i=1,2\alpha_{i},i=1,2 be defined as in Theorem 2.1.

(i) Fast ascent regime: set η=2λLSκ2Diam(𝒴)2γ.\eta=\frac{2\lambda_{LS}\kappa^{2}\mathrm{Diam}({\mathcal{Y}})^{2}}{\gamma}. Then for all t>0t>0,

(2.10) τ(μt|μ)eα1tγ(μ0,ν0) and τ(νt|ν)(2γ+4K2τ2)eα1t(μ0,ν0).\tau{\mathcal{H}}(\mu_{t}|\mu^{\ast})\leq\frac{e^{-\alpha_{1}t}}{\gamma}{\mathcal{L}}(\mu_{0},\nu_{0})\text{ and }\tau{\mathcal{H}}(\nu_{t}|\nu^{\ast})\leq\Big{(}\frac{2}{\gamma}+\frac{4\|K\|_{\infty}^{2}}{\tau^{2}}\Big{)}e^{-\alpha_{1}t}{\mathcal{L}}(\mu_{0},\nu_{0}).

(ii) Fast descent regime: set η=2λLSκ2Diam(𝒴)2γ.\eta=\frac{2\lambda_{LS}\kappa^{2}\mathrm{Diam}({\mathcal{Y}})^{2}}{\gamma}. Then for all t>0t>0,

(2.11) τ(μt|μ)eα2tγ(μ0,ν0) and τ(νt|ν)(2γ+4K2τ2)eα2t(μ0,ν0).\tau{\mathcal{H}}(\mu_{t}|\mu^{\ast})\leq\frac{e^{-\alpha_{2}t}}{\gamma}{\mathcal{L}}(\mu_{0},\nu_{0})\text{ and }\tau{\mathcal{H}}(\nu_{t}|\nu^{\ast})\leq\Big{(}\frac{2}{\gamma}+\frac{4\|K\|_{\infty}^{2}}{\tau^{2}}\Big{)}e^{-\alpha_{2}t}{\mathcal{L}}(\mu_{0},\nu_{0}).

2.3. Convergence of annealed Mean-Field GDA

In this section, we proceed to presenting the convergence of the “annealed” Mean-Field GDA dynamics

(2.12) tμt\displaystyle\partial_{t}\mu_{t} =(μt𝒴xK(x,y)𝑑νt(y))+τtΔμt,\displaystyle=\nabla\cdot\left(\mu_{t}\int_{{\mathcal{Y}}}\nabla_{x}K(x,y)d\nu_{t}(y)\right)+\tau_{t}\Delta\mu_{t},
tνt\displaystyle\partial_{t}\nu_{t} =ηt((νt𝒳yK(x,y)𝑑μt(x))+τtΔνt),\displaystyle=\eta_{t}\left(-\nabla\cdot\left(\nu_{t}\int_{{\mathcal{X}}}\nabla_{y}K(x,y)d\mu_{t}(x)\right)+\tau_{t}\Delta\nu_{t}\right),

where τt>0\tau_{t}>0 is now a time-dependent temperature which shrinks in time. Given any initial condition (μ0,ν0)𝒫(𝒳)×𝒫(𝒴)(\mu_{0},\nu_{0})\in{\mathcal{P}}({\mathcal{X}})\times{\mathcal{P}}({\mathcal{Y}}), the existence of uniqueness of the global solution to (2.12) follow directly from the classical well-posedness of the theory of nonlinear Mckean-Vlasov-type PDEs [18, 50]. Our goal is to show that by carefully choosing the cooling schedule τt\tau_{t} and the time-scale ratio ηt\eta_{t} the solution (μt,νt)(\mu_{t},\nu_{t}) to the annealed dynamics (2.12) converges to the MNE of E0E_{0}.

Let (μτ,ντ)(\mu_{\tau}^{\ast},\nu_{\tau}^{\ast}) be the solution of (2.5) corresponding to temperature τ\tau. Recall the Nikaidò-Isoda defined by (1.7).

Theorem 2.3.

Let Assumption 2.2 and Assumption 2.1 hold. Let (μt,νt)(\mu_{t},\nu_{t}) be the solution to (2.12) with an initial condition (μ0,ν0)𝒫(𝒳)×𝒫(𝒴)(\mu_{0},\nu_{0})\in{\mathcal{P}}({\mathcal{X}})\times{\mathcal{P}}({\mathcal{Y}}). Assume that the log-Sobolev constant λLS=λLS(τ)C0eξ/τ\lambda_{LS}=\lambda_{LS}(\tau)\geq C_{0}e^{-\xi^{\ast}/\tau} for some ξ,C0>0\xi^{\ast},C_{0}>0.

(i) Fast ascent regime: Assume further that τt\tau_{t} is smooth, decreasing in tt and for some ξ>ξ\xi>\xi^{\ast}, τt=ξ/logt\tau_{t}=\xi/\log t for large values of tt. Set ηt=M(logt)2tξ/ξ\eta_{t}=\frac{M}{(\log t)^{2}}t^{\xi^{\ast}/\xi} for some large M>0M>0. Then for every 0<ϵ<1ξ/ξ0<\epsilon<1-\xi^{\ast}/\xi, there exists C,C>0C,C^{\prime}>0 such that for tt sufficiently large,

(2.13) (μt|μτt)Ct(1ξ/ξϵ),(νt|ντt)Ct(1ξ/ξϵ),{\mathcal{H}}(\mu_{t}|\mu_{\tau_{t}}^{\ast})\leq Ct^{-(1-\xi^{\ast}/\xi-\epsilon)},\qquad{\mathcal{H}}(\nu_{t}|\nu_{\tau_{t}}^{\ast})\leq Ct^{-(1-\xi^{\ast}/\xi-\epsilon)},

and that

(2.14) 0NI(μt,νt)Cloglogtlogt.0\leq{\text{NI}}(\mu_{t},\nu_{t})\leq\frac{C^{\prime}\log\log t}{\log t}.

(ii) Fast descent regime: Assume further that τt\tau_{t} is smooth, decreasing in tt and for some ξ>2ξ\xi>2\xi^{\ast}, τt=ξ/logt\tau_{t}=\xi/\log t for large values of tt. Set ηt=logtMt\eta_{t}=\frac{\log t}{Mt} for some large M. Then for every 0<ϵ<12ξ/ξ0<\epsilon<1-2\xi^{\ast}/\xi, there exists C,C>0C,C^{\prime}>0 such that for tt sufficiently large,

(2.15) (μt|μτt)Ct(12ξ/ξϵ),(νt|ντt)Ct(12ξ/ξϵ),{\mathcal{H}}(\mu_{t}|\mu_{\tau_{t}}^{\ast})\leq Ct^{-(1-2\xi^{\ast}/\xi-\epsilon)},\qquad{\mathcal{H}}(\nu_{t}|\nu_{\tau_{t}}^{\ast})\leq Ct^{-(1-2\xi^{\ast}/\xi-\epsilon)},

and that

(2.16) 0NI(μt,νt)Cloglogtlogt.0\leq{\text{NI}}(\mu_{t},\nu_{t})\leq\frac{C^{\prime}\log\log t}{\log t}.

3. Applications

In this section, we discuss briefly applications of our theoretical results in training of GANs and adversarial learning of PDEs.

3.1. Training of GANs

Let μm\mu_{m} be the empirical measure associated to the i.i.d. samples {xi}i=1m𝒳\{x_{i}\}_{i=1}^{m}\in{\mathcal{X}} from a target measure μ𝒫(𝒳)\mu\in{\mathcal{P}}({\mathcal{X}}). Let DD_{{\mathcal{F}}} be an IPM on the space of probability measures 𝒫(𝒳){\mathcal{P}}({\mathcal{X}}) parameterized by a set {\mathcal{F}} of discriminators, i.e.

D(μ,ν):=supff𝑑μf𝑑ν.D_{{\mathcal{F}}}(\mu,\nu):=\sup_{f\in{\mathcal{F}}}\int fd\mu-\int fd\nu.

IPM-based GAN learns an optimal μ\mu that minimizes D(μ,μm)D_{{\mathcal{F}}}(\mu,\mu_{m}) over 𝒫(𝒳){\mathcal{P}}({\mathcal{X}}) :

(3.1) infμ𝒫(𝒳)D(μ,μm)=infμ𝒫(𝒳)supff𝑑μf𝑑μm.\inf_{\mu\in{\mathcal{P}}({\mathcal{X}})}D_{{\mathcal{F}}}(\mu,\mu_{m})=\inf_{\mu\in{\mathcal{P}}({\mathcal{X}})}\sup_{f\in{\mathcal{F}}}\int fd\mu-\int fd\mu_{m}.

Consider the witness function class {\mathcal{F}} given by the unit ball of Barron space \mathcal{B} which consists of functions admitting the representation

f(x)=𝒴aσ(bx+c)𝑑ν(y),f(x)=\int_{{\mathcal{Y}}}a\sigma(b\cdot x+c)d\nu(y),

where y=(a,b,c)𝒴y=(a,b,c)\in{\mathcal{Y}} and ν\nu is a probability measure on the parameter space 𝒴{\mathcal{Y}}. Observe that Barron functions arise as natural infinite-width limit of two-layer neural networks with a dimension-free rate [3, 32, 4]. When the activation function satisfies that supx|aσ(bx+c)|ϕ(y)\sup_{x}|a\sigma(b\cdot x+c)|\leq\phi(y) for some nonnegative function ϕ\phi and for all y𝒴y\in{\mathcal{Y}}, the Barron norm f\|f\|_{{\mathcal{B}}} is defined by

(3.2) f:=infν{𝒴ϕ(y)μ(dy)|f(x)=𝒴aσ(bx+c)𝑑ν(y)}.\|f\|_{{\mathcal{B}}}:=\inf_{\nu}\Big{\{}\int_{{\mathcal{Y}}}\phi(y)\mu(dy)\ \Big{|}\ f(x)=\int_{{\mathcal{Y}}}a\sigma(b\cdot x+c)d\nu(y)\Big{\}}.

Setting ={f|f1}{\mathcal{F}}=\{f\in{\mathcal{B}}\ |\ \|f\|_{{\mathcal{B}}}\leq 1\} in (3.1) leads to

(3.3) infμ𝒫(𝒳)supν𝒫(𝒴)𝒳𝒴K(x,y)μ(dx)ν(dy), where K(x,y)=Σ(x,y)𝒳Σ(x,y)μm(dx)+ϕ(y).\inf_{\mu\in{\mathcal{P}}({\mathcal{X}})}\sup_{\nu\in{\mathcal{P}}({\mathcal{Y}})}\int_{{\mathcal{X}}}\int_{{\mathcal{Y}}}K(x,y)\mu(dx)\nu(dy),\text{ where }K(x,y)=\Sigma(x,y)-\int_{{\mathcal{X}}}\Sigma(x,y)\mu_{m}(dx)+\phi(y).

Here we adopted the short-notation Σ(x,y)=aσ(bx+c)\Sigma(x,y)=a\sigma(b\cdot x+c) with y=(a,b,c)𝒴y=(a,b,c)\in{\mathcal{Y}} in the above. Assume that the activation function σC2()\sigma\in C^{2}({\mathbb{R}}), and the input space 𝒳{\mathcal{X}} and the parameter space satisfy the Assumption 2.1. Then it is straightforward to see that KC2(𝒳×𝒴)K\in C^{2}({\mathcal{X}}\times{\mathcal{Y}}) and there exists Cσ<C_{\sigma}<\infty such that for any multi-indices 𝐢\mathbf{i} and 𝐣\mathbf{j} with 0|𝐢|+|𝐣|20\leq|\mathbf{i}|+|\mathbf{j}|\leq 2,

x𝐢y𝐣K(x,y)2x𝐢y𝐣Σ+y𝐣ϕCσ.\|\nabla_{x}^{\mathbf{i}}\nabla^{\mathbf{j}}_{y}K(x,y)\|_{\infty}\leq 2\|\nabla_{x}^{\mathbf{i}}\nabla^{\mathbf{j}}_{y}\Sigma\|_{\infty}+\|\nabla^{\mathbf{j}}_{y}\phi\|\leq C_{\sigma}.

Therefore the convergence results in Theorem 2.1-2.2 for the Mean-Field GDA hold for the entropy-regularization of the GAN objective defined in (3.3). Moreover, Theorem 2.3 implies that the annealed GDA dynamics finds the MNE of the unregularized GAN objective.

3.2. Adversarial learning of PDEs

We provide another usage of our results in adversarial learning of PDEs. To demonstrate the idea, we focus on a simple linear elliptic PDE on a bounded Lipschitz domain 𝒵d{\mathcal{Z}}\subset{\mathbb{R}}^{d} equipped with the Neumann boundary condition

Δu(z)+Vu(z)\displaystyle-\Delta u(z)+Vu(z) =f(z),z𝒵,\displaystyle=f(z),z\in{\mathcal{Z}},
νu(z)\displaystyle\partial_{\nu}u(z) =0,z𝒵.\displaystyle=0,z\in\partial{\mathcal{Z}}.

Assume that 0<VminVVmax<0<V_{\min}\leq V\leq V_{\max}<\infty and f(H1(𝒵))f\in(H^{1}({\mathcal{Z}}))^{*}. The weak solution uH1(𝒵)u\in H^{1}({\mathcal{Z}}) satisfies that

(3.4) 𝒵uv+Vuvdz=𝒵fv𝑑z,vH1(𝒵).\int_{{\mathcal{Z}}}\nabla u\cdot\nabla v+Vuvdz=\int_{{\mathcal{Z}}}fvdz,\forall v\in H^{1}({\mathcal{Z}}).

We seek an approximate solution to (3.4) in the framework of Petrov-Galerkin [43, 37] where we choose the spaces of trial functions and test functions as two different Barron functions. More precisely, consider a trial function u(z)𝒰:={u1|u11}u(z)\in{\mathcal{U}}:=\{u\in{\mathcal{B}}_{1}\ |\ \|u\|_{{\mathcal{B}}_{1}}\leq 1\} and a test function v𝒱:={v2|v21}v\in{\mathcal{V}}:=\{v\in{\mathcal{B}}_{2}\ |\ \|v\|_{{\mathcal{B}}_{2}}\leq 1\}, where i,i=1,2{\mathcal{B}}_{i},i=1,2 are Barron spaces defined in Section 3.1 with activation function σi\sigma_{i} and Barron norm i\|\cdot\|_{{\mathcal{B}}_{i}} defined in (3.2) with ϕ\phi replaced by nonnegative weight functions ϕi\phi_{i}. We look for a solution u𝒰u\in{\mathcal{U}} parameterized by some probability measure μ𝒫(𝒳)\mu\in{\mathcal{P}}({\mathcal{X}}),

u(z)=𝒳a1σ1(b1z+c1)μ(dx)u(z)=\int_{{\mathcal{X}}}a_{1}\sigma_{1}(b_{1}\cdot z+c_{1})\mu(dx)

with x=(a1,b1,c1)𝒳x=(a_{1},b_{1},c_{1})\in{\mathcal{X}} satisfying equation (3.4) for any v𝒱v\in{\mathcal{V}} with v21\|v\|_{{\mathcal{B}}_{2}}\leq 1 parameterized by ν𝒫(𝒴)\nu\in{\mathcal{P}}({\mathcal{Y}}) such that

v(z)=𝒴a2σ2(b2z+c2)ν(dy).v(z)=\int_{{\mathcal{Y}}}a_{2}\sigma_{2}(b_{2}\cdot z+c_{2})\nu(dy).

Putting these into the weak formulation (3.4) leads to

infμ𝒫(𝒳)supν𝒫(𝒴)𝒳𝒴K(x,y)μ(dx)ν(dy),\inf_{\mu\in{\mathcal{P}}({\mathcal{X}})}\sup_{\nu\in{\mathcal{P}}({\mathcal{Y}})}\int_{{\mathcal{X}}}\int_{{\mathcal{Y}}}K(x,y)\mu(dx)\nu(dy),

where the potential K(x,y)K(x,y) is given for x=(a1,b1,c1),y=(a2,b2,c2)x=(a_{1},b_{1},c_{1}),y=(a_{2},b_{2},c_{2}) by

K(x,y)\displaystyle K(x,y) =𝒵(a1a2b1b2σ1(b1z+c1)σ2(b2z+c2)+V(z)a1a2σ1(b1z+c1)σ2(b2z+c2)\displaystyle=\int_{{\mathcal{Z}}}\Big{(}a_{1}a_{2}b_{1}\cdot b_{2}\sigma^{\prime}_{1}(b_{1}\cdot z+c_{1})\sigma^{\prime}_{2}(b_{2}\cdot z+c_{2})+V(z)a_{1}a_{2}\sigma_{1}(b_{1}\cdot z+c_{1})\sigma_{2}(b_{2}\cdot z+c_{2})
f(z)a2σ2(b2z+c2))dzϕ1(x)+ϕ2(y).\displaystyle\qquad-f(z)a_{2}\sigma_{2}(b_{2}\cdot z+c_{2})\Big{)}dz-\phi_{1}(x)+\phi_{2}(y).

Assume that the activation functions σiC2()\sigma_{i}\in C^{2}({\mathbb{R}}) and the parameter spaces 𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}} satisfy Assumption 2.1. Assume also that ϕ1C2(𝒳)\phi_{1}\in C^{2}({\mathcal{X}}) and ϕ2C2(𝒴)\phi_{2}\in C^{2}({\mathcal{Y}}). Then it is easy to verify that KC2(𝒳×𝒴)K\in C^{2}({\mathcal{X}}\times{\mathcal{Y}}) and that KC2C\|K\|_{C^{2}}\leq C for some constant C>0C>0 depending on σi,ϕi,𝒳,𝒴,V\sigma_{i},\phi_{i},{\mathcal{X}},{\mathcal{Y}},V and ff. Hence the convergence results on Mean-Field GDA and its annealed version established in Section 2 apply to this problem.

4. Proofs of Main Results

4.1. Proof of convergence for Mean-Field GDA with a fixed temperature

We only present the proof of our main convergence results in the fast ascent regime. The proofs for the fast descent regime can be carried out in a similar manner and are provided in the Appendix B for completeness.

We first state a lemma below summarizing some important properties on the functionals i,i=1,,4{\mathcal{L}}_{i},i=1,\cdots,4. defined in (2.8).

Lemma 4.1.

For any μ𝒫(𝒳)\mu\in{\mathcal{P}}({\mathcal{X}}) and ν𝒫(𝒴)\nu\in{\mathcal{P}}({\mathcal{Y}}), the following hold

(4.1) 2(μ,ν)\displaystyle{\mathcal{L}}_{2}(\mu,\nu) =τ(ν|𝒦+(μ)),\displaystyle=\tau{\mathcal{H}}(\nu|{\mathcal{K}}^{+}(\mu)),
(4.2) 4(μ,ν)\displaystyle{\mathcal{L}}_{4}(\mu,\nu) =τ(μ|𝒦(ν)),\displaystyle=\tau{\mathcal{H}}(\mu|{\mathcal{K}}^{-}(\nu)),
(4.3) τ(μ|μ)1(μ)\displaystyle\tau{\mathcal{H}}(\mu|\mu^{\ast})\leq{\mathcal{L}}_{1}(\mu) τ(μ|𝒦(𝒦+(μ))),\displaystyle\leq\tau{\mathcal{H}}(\mu|{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu))),
(4.4) τ(ν|ν)3(ν)\displaystyle\tau{\mathcal{H}}(\nu|\nu^{\ast})\leq{\mathcal{L}}_{3}(\nu) τ(ν|𝒦+(𝒦(ν))).\displaystyle\leq\tau{\mathcal{H}}(\nu|{\mathcal{K}}^{+}({\mathcal{K}}^{-}(\nu))).
Remark 4.1.

The sandwich inequalities (4.3) and (4.4) play an essential role in controlling terms in the entropy production of the Mean-Field GDA dynamics via the Lyapunov functions in order to close the Grönwall argument to obtain the dissipation of the latter along the dynamics. A similar sandwich inequality appeared in [7, Lemma 3.4] in the proof of convergence for the Mean-Field Langevin dynamics.

Next, we keep track of the time-derivatives of 1(μt){\mathcal{L}}_{1}(\mu_{t}) and 2(μt,νt){\mathcal{L}}_{2}(\mu_{t},\nu_{t}) in the next proposition whose proof can be found in Appendix A.

Proposition 4.1.

Let (μt,νt)(\mu_{t},\nu_{t}) be the solution to the DGA dynamics (1.6). Then

(4.5) ddt1(μt)τ22(μt|𝒦(𝒦+(μt)))+Kxy2Diam(𝒴)2(νt|𝒦+(μt))\displaystyle\frac{d}{dt}{\mathcal{L}}_{1}(\mu_{t})\leq-\frac{\tau^{2}}{2}{\mathcal{I}}(\mu_{t}|{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t})))+K_{xy}^{2}\mathrm{Diam}({\mathcal{Y}})^{2}\cdot{\mathcal{H}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t}))

and

(4.6) ddt2(μt,νt)τ2η(νt|𝒦+(μt))+τ22(μt|𝒦(𝒦+(μt)))+3Kxy2Diam(𝒴)2(νt|𝒦+(μt)).\displaystyle\frac{d}{dt}{\mathcal{L}}_{2}(\mu_{t},\nu_{t})\leq-\tau^{2}\eta{\mathcal{I}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t}))+\frac{\tau^{2}}{2}{\mathcal{I}}(\mu_{t}|{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t})))+3K_{xy}^{2}\mathrm{Diam}({\mathcal{Y}})^{2}\cdot{\mathcal{H}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t})).

With Proposition 4.1, we are ready to present the proof of Theorem 2.1 and Theorem 2.2.

Proof of Theorem 2.1.

Thanks to Proposition 4.1 and identity (4.1), we have

ddt(μt,νt)\displaystyle\frac{d}{dt}{\mathcal{L}}(\mu_{t},\nu_{t}) ητ2γ(νt|𝒦+(μt))τ22(1γ)(μt|𝒦(𝒦+(μt)))\displaystyle\leq-\eta\tau^{2}\gamma{\mathcal{I}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t}))-\frac{\tau^{2}}{2}(1-\gamma){\mathcal{I}}(\mu_{t}|{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t})))
+Kxy2Diam(𝒴)2(1+3γ)τ2(μt,νt).\displaystyle\qquad+\frac{K_{xy}^{2}\mathrm{Diam}({\mathcal{Y}})^{2}(1+3\gamma)}{\tau}{\mathcal{L}}_{2}(\mu_{t},\nu_{t}).

Observe also from Lemma 2.1 and sandwich inequality (4.3) that

τ(μt|𝒦(𝒦+(μt)))\displaystyle\tau{\mathcal{I}}(\mu_{t}|{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))) τλLS(μt|𝒦(𝒦+(μt)))\displaystyle\geq\tau\lambda_{LS}{\mathcal{H}}(\mu_{t}|{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t})))
λLS1(μt).\displaystyle\geq\lambda_{LS}{\mathcal{L}}_{1}(\mu_{t}).

Combining the last two displays leads to

ddt(μt,νt)\displaystyle\frac{d}{dt}{\mathcal{L}}(\mu_{t},\nu_{t}) ητ2γ(νt|𝒦+(μt))τ2(1γ)λLS1(μt)+Kxy2Diam(𝒴)2(1+3γ)τ2(μt,νt)\displaystyle\leq-\eta\tau^{2}\gamma{\mathcal{I}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t}))-\frac{\tau}{2}(1-\gamma)\lambda_{LS}{\mathcal{L}}_{1}(\mu_{t})+\frac{K_{xy}^{2}\mathrm{Diam}({\mathcal{Y}})^{2}(1+3\gamma)}{\tau}{\mathcal{L}}_{2}(\mu_{t},\nu_{t})
ητ2γλLS(νt|𝒦+(μt))τ2(1γ)λLS1(μt)+Kxy2Diam(𝒴)2(1+3γ)τ2(μt,νt)\displaystyle\leq-\eta\tau^{2}\gamma\lambda_{LS}{\mathcal{H}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t}))-\frac{\tau}{2}(1-\gamma)\lambda_{LS}{\mathcal{L}}_{1}(\mu_{t})+\frac{K_{xy}^{2}\mathrm{Diam}({\mathcal{Y}})^{2}(1+3\gamma)}{\tau}{\mathcal{L}}_{2}(\mu_{t},\nu_{t})
=τ2(1γ)λLS1(μt)τ(ηγλLSKxy2Diam(𝒴)2(1+3γ)τ2)2(μt,νt).\displaystyle=-\frac{\tau}{2}(1-\gamma)\lambda_{LS}{\mathcal{L}}_{1}(\mu_{t})-\tau\Big{(}\eta\gamma\lambda_{LS}-\frac{K_{xy}^{2}\mathrm{Diam}({\mathcal{Y}})^{2}(1+3\gamma)}{\tau^{2}}\Big{)}{\mathcal{L}}_{2}(\mu_{t},\nu_{t}).

Now for any fixed γ<1\gamma<1, we set

η=2Kxy2Diam(𝒴)2(1+3γ)τ2γλLS=2λLSκ2Diam(𝒴)2γ.\eta=\frac{2K_{xy}^{2}\mathrm{Diam}({\mathcal{Y}})^{2}(1+3\gamma)}{\tau^{2}\gamma\lambda_{LS}}=\frac{2\lambda_{LS}\kappa^{2}\mathrm{Diam}({\mathcal{Y}})^{2}}{\gamma}.

Then it follows from the last inequality that

ddt(μt,νt)α(μt,νt)\frac{d}{dt}{\mathcal{L}}(\mu_{t},\nu_{t})\leq-\alpha{\mathcal{L}}(\mu_{t},\nu_{t})

with

α=τλLS(1γ2κ2Diam(𝒴)2(1+3γ)γ).\displaystyle\alpha=\tau\lambda_{LS}\Big{(}\frac{1-\gamma}{2}\wedge\frac{\kappa^{2}\mathrm{Diam}({\mathcal{Y}})^{2}(1+3\gamma)}{\gamma}\Big{)}.

Proof of Theorem 2.2.

First, thanks to Theorem 2.1, (μt,νt)eαt(μ0,ν0){\mathcal{L}}(\mu_{t},\nu_{t})\leq e^{-\alpha t}{\mathcal{L}}(\mu_{0},\nu_{0}). In particular, for any t>0t>0,

(4.7) 1(μt)eαt(μ0,ν0){\mathcal{L}}_{1}(\mu_{t})\leq e^{-\alpha t}{\mathcal{L}}(\mu_{0},\nu_{0})

and

(4.8) τ(νt|𝒦+(μt))=2(μt,νt)eαtγ(μ0,ν0).\tau{\mathcal{H}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t}))={\mathcal{L}}_{2}(\mu_{t},\nu_{t})\leq\frac{e^{-\alpha t}}{\gamma}{\mathcal{L}}(\mu_{0},\nu_{0}).

As a result of (4.7) and (4.3), one obtains that

(4.9) τ(μt|μ)eαt(μ0,ν0).\tau{\mathcal{H}}(\mu_{t}|\mu^{\ast})\leq e^{-\alpha t}{\mathcal{L}}(\mu_{0},\nu_{0}).

Next, to obtain the exponential decay of (νt|ν){\mathcal{H}}(\nu_{t}|\nu^{\ast}), notice first that

τ(νt|ν)\displaystyle\tau{\mathcal{H}}(\nu_{t}|\nu^{\ast}) =τ(νt|𝒦+(μt))+τ𝒴(log(𝒦+(μt))log(ν))𝑑νt\displaystyle=\tau{\mathcal{H}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t}))+\tau\int_{{\mathcal{Y}}}(\log({\mathcal{K}}^{+}(\mu_{t}))-\log(\nu^{\ast}))d\nu_{t}
=τ(νt|𝒦+(μt))+τ𝒴(log(𝒦+(μt))log(ν))d(νtν)τ(ν|𝒦+(νt))\displaystyle=\tau{\mathcal{H}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t}))+\tau\int_{{\mathcal{Y}}}(\log({\mathcal{K}}^{+}(\mu_{t}))-\log(\nu^{\ast}))d(\nu_{t}-\nu^{\ast})-\tau{\mathcal{H}}(\nu^{\ast}|{\mathcal{K}}^{+}(\nu_{t}))
τ(νt|𝒦+(μt))+τ𝒴(log(𝒦+(μt))log(ν))d(νtν).\displaystyle\leq\tau{\mathcal{H}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t}))+\tau\int_{{\mathcal{Y}}}(\log({\mathcal{K}}^{+}(\mu_{t}))-\log(\nu^{\ast}))d(\nu_{t}-\nu^{\ast}).

Since ν=𝒦+(μ)\nu^{\ast}={\mathcal{K}}^{+}(\mu^{\ast}), we have

(log(𝒦+(μt))log(ν))(y)\displaystyle(\log({\mathcal{K}}^{+}(\mu_{t}))-\log(\nu^{\ast}))(y) =τ1𝒳K(x,y)(μt(dx)μ(dx))(logZ+(μt)logZ+(μ)).\displaystyle=\tau^{-1}\int_{{\mathcal{X}}}K(x,y)(\mu_{t}(dx)-\mu^{\ast}(dx))-(\log Z^{+}(\mu_{t})-\log Z^{+}(\mu^{\ast})).

It follows from the last two displays that

(4.10) τ(νt|ν)\displaystyle\tau{\mathcal{H}}(\nu_{t}|\nu^{\ast}) τ(νt|𝒦+(μt))+𝒴𝒳K(x,y)(μt(dx)μ(dx))(νt(dy)ν(dy)).\displaystyle\leq\tau{\mathcal{H}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t}))+\int_{{\mathcal{Y}}}\int_{{\mathcal{X}}}K(x,y)(\mu_{t}(dx)-\mu^{\ast}(dx))(\nu_{t}(dy)-\nu^{\ast}(dy)).

The last term on the right side above can be upper bounded by

(4.11) 𝒴𝒳K(x,y)(μt(dx)μ(dx))(νt(dy)ν(dy))\displaystyle\int_{{\mathcal{Y}}}\int_{{\mathcal{X}}}K(x,y)(\mu_{t}(dx)-\mu^{\ast}(dx))(\nu_{t}(dy)-\nu^{\ast}(dy)) KTV(μt,μ)TV(νt,ν)\displaystyle\leq\|K\|_{\infty}\cdot\text{TV}(\mu_{t},\mu^{\ast})\cdot\text{TV}(\nu_{t},\nu^{\ast})
2K(μt|μ)(νt|ν)\displaystyle\leq 2\|K\|_{\infty}\sqrt{{\mathcal{H}}(\mu_{t}|\mu^{\ast})}\cdot\sqrt{{\mathcal{H}}(\nu_{t}|\nu^{\ast})}
τ2(νt|ν)+2K2τ(μt|μ).\displaystyle\leq\frac{\tau}{2}{\mathcal{H}}(\nu_{t}|\nu^{\ast})+\frac{2\|K\|_{\infty}^{2}}{\tau}{\mathcal{H}}(\mu_{t}|\mu^{\ast}).

where we have used the Pinsker’s inequality and Young’s inequality in the last two lines above. Finally combining the last two estimates leads to

τ(νt|ν)\displaystyle\tau{\mathcal{H}}(\nu_{t}|\nu^{\ast}) 2τ(νt|𝒦+(μt))+4K2τ(μt|μ)\displaystyle\leq 2\tau{\mathcal{H}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t}))+\frac{4\|K\|_{\infty}^{2}}{\tau}{\mathcal{H}}(\mu_{t}|\mu^{\ast})
(2γ+4K2τ2)eαt(μ0,ν0),\displaystyle\leq\Big{(}\frac{2}{\gamma}+\frac{4\|K\|_{\infty}^{2}}{\tau^{2}}\Big{)}e^{-\alpha t}{\mathcal{L}}(\mu_{0},\nu_{0}),

where we have used inequalities (4.8) and (4.9) in the last inequality above.

4.2. Proof of convergence for the annealed dynamics

Recall that the MNE of entropy-regularized objective EτE_{\tau} is given by (μτ,ντ)(\mu^{\ast}_{\tau},\nu^{\ast}_{\tau}) characterized by (1.1). Note that we emphasized the dependence of the objective and the optimizer on the temperature τ\tau since the later will be time-dependent throughout this section.

It will be useful to define energies i,i=1,2{\mathcal{E}}_{i},i=1,2 as follows.

(4.12) 1(μ)\displaystyle{\mathcal{E}}_{1}(\mu) :=maxν𝒫(𝒴)Eτ(μ,ν)=τ(μ)+τlogZ+(μ),\displaystyle:=\max_{\nu\in{\mathcal{P}}({\mathcal{Y}})}E_{\tau}(\mu,\nu)=-\tau{\mathcal{H}}(\mu)+\tau\log Z^{+}(\mu),
2(ν)\displaystyle{\mathcal{E}}_{2}(\nu) :=minμ𝒫(𝒳)Eτ(μ,ν)=τ(ν)τlogZ(ν).\displaystyle:=\min_{\mu\in{\mathcal{P}}({\mathcal{X}})}E_{\tau}(\mu,\nu)=\tau{\mathcal{H}}(\nu)-\tau\log Z^{-}(\nu).

It is clear that 1(μ)=1(μ)1(μ){\mathcal{L}}_{1}(\mu)={\mathcal{E}}_{1}(\mu)-{\mathcal{E}}_{1}(\mu^{\ast}).

Proof of Theorem 2.3.

Similar to the last section, we only present here the proof for the fast ascent regime. The proof for the other regime can be carried out analogously which for completeness is provided in Appendix C. The proof of the entropy decays in (2.13) follows largely the proof of Theorem 2.1 and is also inspired by the proof of [7, Theorem 4.1].

Step 1. Bounding 1(μt)=1(μt)1(μτt){\mathcal{L}}_{1}(\mu_{t})={\mathcal{E}}_{1}(\mu_{t})-{\mathcal{E}}_{1}(\mu_{\tau_{t}}^{\ast}). First, let us compute the time-derivative ddt1(μτt)\frac{d}{dt}{\mathcal{E}}_{1}(\mu_{\tau_{t}}^{\ast}). Note that since

1(μ)=τ(μ)+τlogZ+(μ),{\mathcal{E}}_{1}(\mu)=-\tau{\mathcal{H}}(\mu)+\tau\log Z^{+}(\mu),

we have for τ>0\tau>0,

ddτ1(μτ)\displaystyle\frac{d}{d\tau}{\mathcal{E}}_{1}(\mu_{\tau}^{\ast}) =(μτ)+logZ+(μτ)\displaystyle=-{\mathcal{H}}(\mu_{\tau}^{\ast})+\log Z^{+}(\mu_{\tau}^{\ast})
+τ(logμτ,τμτ+𝒳𝒴𝒦+(μτ)(y)K(x,y)ddτ(τ1μτ(x))𝑑y𝑑x).\displaystyle\qquad+\tau\Big{(}\langle\log\mu_{\tau}^{\ast},\partial_{\tau}\mu_{\tau}^{\ast}\rangle+\int_{{\mathcal{X}}}\int_{{\mathcal{Y}}}{\mathcal{K}}^{+}(\mu^{\ast}_{\tau})(y)K(x,y)\frac{d}{d\tau}(\tau^{-1}\mu^{\ast}_{\tau}(x))dydx\Big{)}.

Moreover,

τ𝒳𝒴𝒦+(μτ)(y)K(x,y)ddτ(τ1μτ(x))𝑑y𝑑x=𝒳𝒴𝒦+(μτ)(y)τ1K(x,y)𝑑yμτ(x)𝑑x\displaystyle\tau\int_{{\mathcal{X}}}\int_{{\mathcal{Y}}}{\mathcal{K}}^{+}(\mu^{\ast}_{\tau})(y)K(x,y)\frac{d}{d\tau}(\tau^{-1}\mu^{\ast}_{\tau}(x))dydx=-\int_{{\mathcal{X}}}\int_{{\mathcal{Y}}}{\mathcal{K}}^{+}(\mu^{\ast}_{\tau})(y)\tau^{-1}K(x,y)dy\mu^{\ast}_{\tau}(x)dx
+τ𝒳𝒴𝒦+(μτ)(y)τ1K(x,y)𝑑yτμτ(dx)\displaystyle\qquad+\tau\int_{{\mathcal{X}}}\int_{{\mathcal{Y}}}{\mathcal{K}}^{+}(\mu^{\ast}_{\tau})(y)\tau^{-1}K(x,y)dy\partial_{\tau}\mu^{\ast}_{\tau}(dx)
=𝒳𝒴𝒦+(μτ)(y)τ1K(x,y)𝑑yμτ(x)𝑑xτlog𝒦(𝒦+(μτ)),τμτ\displaystyle=-\int_{{\mathcal{X}}}\int_{{\mathcal{Y}}}{\mathcal{K}}^{+}(\mu^{\ast}_{\tau})(y)\tau^{-1}K(x,y)dy\mu^{\ast}_{\tau}(x)dx-\tau\langle\log{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{\tau}^{\ast})),\partial_{\tau}\mu^{\ast}_{\tau}\rangle

Combining the last two identities leads to

ddτ1(μτ)\displaystyle\frac{d}{d\tau}{\mathcal{E}}_{1}(\mu_{\tau}^{\ast}) =(μτ)+logZ+(μτ)𝒳𝒴𝒦+(μτ)(y)τ1K(x,y)𝑑yμτ(x)𝑑x\displaystyle=-{\mathcal{H}}(\mu_{\tau}^{\ast})+\log Z^{+}(\mu_{\tau}^{\ast})-\int_{{\mathcal{X}}}\int_{{\mathcal{Y}}}{\mathcal{K}}^{+}(\mu^{\ast}_{\tau})(y)\tau^{-1}K(x,y)dy\mu^{\ast}_{\tau}(x)dx
+τlogμτlog𝒦(𝒦+(μτ)),τμτ=u1τ(u)|u=μτ,τμτ=0\displaystyle+\underbrace{\tau\langle\log\mu_{\tau}^{\ast}-\log{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{\tau}^{\ast})),\partial_{\tau}\mu^{\ast}_{\tau}\rangle}_{=\langle\partial_{u}{\mathcal{E}}_{1}^{\tau}(u)|_{u=\mu_{\tau}^{\ast}},\ \partial_{\tau}\mu^{\ast}_{\tau}\rangle=0}
=τ11(μτ)𝒳𝒴𝒦+(μτ)(y)τ1K(x,y)𝑑yμτ(x)𝑑x.\displaystyle=\tau^{-1}{\mathcal{E}}_{1}(\mu^{\ast}_{\tau})-\int_{{\mathcal{X}}}\int_{{\mathcal{Y}}}{\mathcal{K}}^{+}(\mu^{\ast}_{\tau})(y)\tau^{-1}K(x,y)dy\mu^{\ast}_{\tau}(x)dx.

Consequently, we have

(4.13) ddt1(μτt)\displaystyle\frac{d}{dt}{\mathcal{E}}_{1}(\mu_{\tau_{t}}^{\ast}) =τtddτ1(μτ)|τ=τt\displaystyle=\tau^{\prime}_{t}\frac{d}{d\tau}{\mathcal{E}}_{1}(\mu_{\tau}^{\ast})\Big{|}_{\tau=\tau_{t}}
=τt(τt11(μτt)𝒳𝒴𝒦+(μτt)(y)τt1K(x,y)𝑑yμτt(x)𝑑x).\displaystyle=\tau^{\prime}_{t}\Big{(}\tau_{t}^{-1}{\mathcal{E}}_{1}(\mu^{\ast}_{\tau_{t}})-\int_{{\mathcal{X}}}\int_{{\mathcal{Y}}}{\mathcal{K}}^{+}(\mu^{\ast}_{\tau_{t}})(y)\tau_{t}^{-1}K(x,y)dy\mu^{\ast}_{\tau_{t}}(x)dx\Big{)}.

Next, a similar calculation as above applied to 1(μt){\mathcal{E}}_{1}(\mu_{t}) gives

(4.14) ddt1(μt)\displaystyle\frac{d}{dt}{\mathcal{E}}_{1}(\mu_{t}) =τt(τt11(μt)𝒳𝒴𝒦+(μt)(y)τt1K(x,y)𝑑yμt(x)𝑑x)\displaystyle=\tau^{\prime}_{t}\Big{(}\tau_{t}^{-1}{\mathcal{E}}_{1}(\mu_{t})-\int_{{\mathcal{X}}}\int_{{\mathcal{Y}}}{\mathcal{K}}^{+}(\mu_{t})(y)\tau_{t}^{-1}K(x,y)dy\mu_{t}(x)dx\Big{)}
+τt𝒳log(dμtd𝒦(𝒦+(μt)))tμt(dx)\displaystyle\qquad+\tau_{t}\int_{{\mathcal{X}}}\log\Big{(}\frac{d\mu_{t}}{d{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))}\Big{)}\cdot\partial_{t}\mu_{t}(dx)
=τt(τt11(μt)𝒳𝒴𝒦+(μt)(y)τt1K(x,y)𝑑yμt(x)𝑑x)\displaystyle=\tau^{\prime}_{t}\Big{(}\tau_{t}^{-1}{\mathcal{E}}_{1}(\mu_{t})-\int_{{\mathcal{X}}}\int_{{\mathcal{Y}}}{\mathcal{K}}^{+}(\mu_{t})(y)\tau_{t}^{-1}K(x,y)dy\mu_{t}(x)dx\Big{)}
τt2𝒳xlog(dμtd𝒦(𝒦+(μt)))xlog(dμtd𝒦(νt))μt(dx).\displaystyle\qquad-\tau_{t}^{2}\int_{{\mathcal{X}}}\nabla_{x}\log\Big{(}\frac{d\mu_{t}}{d{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))}\Big{)}\cdot\nabla_{x}\log\Big{(}\frac{d\mu_{t}}{d{\mathcal{K}}^{-}(\nu_{t})}\Big{)}\mu_{t}(dx).

As a result of (4.14) and (4.13),

(4.15) ddt1(μt)=ddt(1(μt)1(μτt))\displaystyle\frac{d}{dt}{\mathcal{L}}_{1}(\mu_{t})=\frac{d}{dt}({\mathcal{E}}_{1}(\mu_{t})-{\mathcal{E}}_{1}(\mu^{\ast}_{\tau_{t}}))
=τtτt[1(μt)(𝒳𝒴𝒦+(μt)(y)K(x,y)𝑑yμt(x)𝑑x𝒳𝒴𝒦+(μτt)(y)K(x,y)𝑑yμτt(x)𝑑x)]\displaystyle=\frac{\tau_{t}^{\prime}}{\tau_{t}}\Big{[}{\mathcal{L}}_{1}(\mu_{t})-\Big{(}\int_{{\mathcal{X}}}\int_{{\mathcal{Y}}}{\mathcal{K}}^{+}(\mu_{t})(y)K(x,y)dy\mu_{t}(x)dx-\int_{{\mathcal{X}}}\int_{{\mathcal{Y}}}{\mathcal{K}}^{+}(\mu^{\ast}_{\tau_{t}})(y)K(x,y)dy\mu^{\ast}_{\tau_{t}}(x)dx\Big{)}\Big{]}
τt2𝒳xlog(dμtd𝒦(𝒦+(μt)))xlog(dμtd𝒦(νt))μt(dx)\displaystyle\qquad-\tau_{t}^{2}\int_{{\mathcal{X}}}\nabla_{x}\log\Big{(}\frac{d\mu_{t}}{d{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))}\Big{)}\cdot\nabla_{x}\log\Big{(}\frac{d\mu_{t}}{d{\mathcal{K}}^{-}(\nu_{t})}\Big{)}\mu_{t}(dx)
τtτt(1(μt)+2K)τt22(μt|𝒦(𝒦+(μt)))+Kxy2Diam(𝒴)2(νt|𝒦+(μt)).\displaystyle\leq\frac{\tau_{t}^{\prime}}{\tau_{t}}\Big{(}{\mathcal{L}}_{1}(\mu_{t})+2\|K\|_{\infty}\Big{)}-\frac{\tau_{t}^{2}}{2}{\mathcal{I}}(\mu_{t}|{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t})))+K_{xy}^{2}\mathrm{Diam}({\mathcal{Y}})^{2}\cdot{\mathcal{H}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t})).

Next, we have from the ascent dynamics for νt\nu_{t},

(4.16) ddt2(μt,νt)\displaystyle\frac{d}{dt}{\mathcal{L}}_{2}(\mu_{t},\nu_{t}) =τt(νt|𝒦+(μt))+τt𝒴log(dνtd𝒦+(μt))tνt(dy)τt𝒴ddt(log𝒦+(μt))νt(dy)\displaystyle=\tau^{\prime}_{t}{\mathcal{H}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t}))+\tau_{t}\int_{{\mathcal{Y}}}\log\Big{(}\frac{d\nu_{t}}{d{\mathcal{K}}^{+}(\mu_{t})}\Big{)}\partial_{t}\nu_{t}(dy)-\tau_{t}\int_{{\mathcal{Y}}}\frac{d}{dt}(\log{\mathcal{K}}^{+}(\mu_{t}))\nu_{t}(dy)
τt(νt|𝒦+(μt))ηtτt2(νt|𝒦+(μt))τt𝒴ddt(log𝒦+(μt))νt(dy).\displaystyle\leq\tau^{\prime}_{t}{\mathcal{H}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t}))-\eta_{t}\tau_{t}^{2}{\mathcal{I}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t}))-\tau_{t}\int_{{\mathcal{Y}}}\frac{d}{dt}(\log{\mathcal{K}}^{+}(\mu_{t}))\nu_{t}(dy).

The third term on the RHS above is

τt(𝒴𝒳ddt(τt1μt(x))K(x,y)𝑑xνt(y)𝑑yddtlog(Z+(μt)))\displaystyle-\tau_{t}\Big{(}\int_{{\mathcal{Y}}}\int_{{\mathcal{X}}}\frac{d}{dt}(\tau_{t}^{-1}\mu_{t}(x))K(x,y)dx\nu_{t}(y)dy-\frac{d}{dt}\log(Z^{+}(\mu_{t}))\Big{)}
=τt(𝒴𝒳ddt(τt1μt(x))K(x,y)𝑑x(νt(y)𝒦+(μt)(y))𝑑y)\displaystyle=-\tau_{t}\Big{(}\int_{{\mathcal{Y}}}\int_{{\mathcal{X}}}\frac{d}{dt}(\tau_{t}^{-1}\mu_{t}(x))K(x,y)dx(\nu_{t}(y)-{\mathcal{K}}^{+}(\mu_{t})(y))dy\Big{)}
=τtτt𝒴𝒳μt(x)K(x,y)𝑑x(νt(y)𝒦+(μt)(y))𝑑y\displaystyle=-\frac{\tau^{\prime}_{t}}{\tau_{t}}\int_{{\mathcal{Y}}}\int_{{\mathcal{X}}}\mu_{t}(x)K(x,y)dx(\nu_{t}(y)-{\mathcal{K}}^{+}(\mu_{t})(y))dy
𝒴𝒳tμt(x)K(x,y)dx(νt(y)𝒦+(μt)(y))dy\displaystyle\qquad\qquad-\int_{{\mathcal{Y}}}\int_{{\mathcal{X}}}\partial_{t}\mu_{t}(x)K(x,y)dx(\nu_{t}(y)-{\mathcal{K}}^{+}(\mu_{t})(y))dy
=τtτt𝒴𝒳μt(x)K(x,y)𝑑x(νt(y)𝒦+(μt)(y))𝑑y\displaystyle=-\frac{\tau^{\prime}_{t}}{\tau_{t}}\int_{{\mathcal{Y}}}\int_{{\mathcal{X}}}\mu_{t}(x)K(x,y)dx(\nu_{t}(y)-{\mathcal{K}}^{+}(\mu_{t})(y))dy
+τt2𝒳xlog(d𝒦(𝒦+(μt))d𝒦(νt))xlog(dμtd𝒦(νt))𝑑μt(x)\displaystyle\qquad+\tau_{t}^{2}\int_{{\mathcal{X}}}\nabla_{x}\log\Big{(}\frac{d{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))}{d{\mathcal{K}}^{-}(\nu_{t})}\Big{)}\cdot\nabla_{x}\log\Big{(}\frac{d\mu_{t}}{d{\mathcal{K}}^{-}(\nu_{t})}\Big{)}d\mu_{t}(x)
2K|τt|τt+τt22(μt|𝒦(𝒦+(μt)))+3Kxy2Diam(𝒴)2(νt|𝒦+(μt)).\displaystyle\leq\frac{2\|K\|_{\infty}|\tau^{\prime}_{t}|}{\tau_{t}}+\frac{\tau_{t}^{2}}{2}{\mathcal{I}}(\mu_{t}|{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t})))+3K_{xy}^{2}\mathrm{Diam}({\mathcal{Y}})^{2}\cdot{\mathcal{H}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t})).

Combining the last two displays yields

(4.17) ddt2(μt,νt)(τt+3Kxy2Diam(𝒴)2)(νt|𝒦+(μt))ηtτt2(νt|𝒦+(μt))\displaystyle\frac{d}{dt}{\mathcal{L}}_{2}(\mu_{t},\nu_{t})\leq\Big{(}\tau^{\prime}_{t}+3K_{xy}^{2}\mathrm{Diam}({\mathcal{Y}})^{2}\Big{)}{\mathcal{H}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t}))-\eta_{t}\tau_{t}^{2}{\mathcal{I}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t}))
+2K|τt|τt+τt22(μt|𝒦(𝒦+(μt))).\displaystyle\qquad+\frac{2\|K\|_{\infty}|\tau^{\prime}_{t}|}{\tau_{t}}+\frac{\tau_{t}^{2}}{2}{\mathcal{I}}(\mu_{t}|{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))).

It then follows from (4.15) (4.17) that

ddt(μt,νt)τt2(1γ)λLS(τt)2(μt|𝒦(𝒦+(μt)))+(γτt+(1+3γ)Kxy2Diam(𝒴)2)(νt|𝒦+(μt))\displaystyle\frac{d}{dt}{\mathcal{L}}(\mu_{t},\nu_{t})\leq-\frac{\tau_{t}^{2}(1-\gamma)\lambda_{LS}(\tau_{t})}{2}{\mathcal{H}}(\mu_{t}|{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t})))+\Big{(}\gamma\tau^{\prime}_{t}+(1+3\gamma)K_{xy}^{2}\mathrm{Diam}({\mathcal{Y}})^{2}\Big{)}{\mathcal{H}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t}))
+2(1+γ)K|τt|τt+τtτt1(μt)γηtτt2(νt|𝒦+(μt))\displaystyle+\frac{2(1+\gamma)\|K\|_{\infty}|\tau^{\prime}_{t}|}{\tau_{t}}+\frac{\tau^{\prime}_{t}}{\tau_{t}}{\mathcal{L}}_{1}(\mu_{t})-\gamma\eta_{t}\tau_{t}^{2}{\mathcal{I}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t}))
((1γ)τtλLS(τt)2τtτt)1(μt)(γηtτtλLS(τt)γτtτt(1+3γ)Kxy2Diam(𝒴)2τt)2(μt,νt)\displaystyle\leq-\Big{(}\frac{(1-\gamma)\tau_{t}\lambda_{LS}(\tau_{t})}{2}-\frac{\tau^{\prime}_{t}}{\tau_{t}}\Big{)}{\mathcal{L}}_{1}(\mu_{t})-\Big{(}\gamma\eta_{t}\tau_{t}\lambda_{LS}(\tau_{t})-\frac{\gamma\tau^{\prime}_{t}}{\tau_{t}}-\frac{(1+3\gamma)K_{xy}^{2}\mathrm{Diam}({\mathcal{Y}})^{2}}{\tau_{t}}\Big{)}{\mathcal{L}}_{2}(\mu_{t},\nu_{t})
+2(1+γ)K|τt|τt\displaystyle+\frac{2(1+\gamma)\|K\|_{\infty}|\tau^{\prime}_{t}|}{\tau_{t}}
αt(μt,νt)+2(1+γ)Ktlogt,\displaystyle\leq-\alpha_{t}{\mathcal{L}}(\mu_{t},\nu_{t})+\frac{2(1+\gamma)\|K\|_{\infty}}{t\log t},

where the time-dependent rate satisfies for large tt that

αt\displaystyle\alpha_{t} =(((1γ)τtλLS(τt)2τtτt)(ηtτtλLS(τt)τtτt(1+3γ)Kxy2Diam(𝒴)2γτt))\displaystyle=\Big{(}\Big{(}\frac{(1-\gamma)\tau_{t}\lambda_{LS}(\tau_{t})}{2}-\frac{\tau^{\prime}_{t}}{\tau_{t}}\Big{)}\wedge\Big{(}\eta_{t}\tau_{t}\lambda_{LS}(\tau_{t})-\frac{\tau^{\prime}_{t}}{\tau_{t}}-\frac{(1+3\gamma)K_{xy}^{2}\mathrm{Diam}({\mathcal{Y}})^{2}}{\gamma\tau_{t}}\Big{)}\Big{)}
ξlogt(C1tξ/ξ(ηttξ/ξ+C2tlogtC3(logt)2)).\displaystyle\geq\frac{\xi}{\log t}\Big{(}C_{1}t^{-\xi_{\ast}/\xi}\wedge\Big{(}\eta_{t}t^{-\xi^{\ast}/\xi}+\frac{C_{2}}{t\log t}-\frac{C_{3}}{(\log t)^{2}}\Big{)}\Big{)}.

Note that in the above we have used the decreasing of τt\tau_{t} for large tt. By choosing ηtMtξ/ξ(logt)2\eta_{t}\geq M\frac{t^{\xi^{\ast}/\xi}}{(\log t)^{2}} for some large M>0M>0, we have

αtCtξ/ξϵlogt.\alpha_{t}\geq\frac{Ct^{-\xi^{\ast}/\xi-\epsilon}}{\log t}.

As a result, we have obtained that for any ϵ<1ξ/ξ\epsilon^{\prime}<1-\xi^{\ast}/\xi and for tt large enough,

ddt(μt,νt)Ctξ/ξϵ(μt,νt)+Ctlogt.\frac{d}{dt}{\mathcal{L}}(\mu_{t},\nu_{t})\leq-Ct^{-\xi^{\ast}/\xi-\epsilon^{\prime}}{\mathcal{L}}(\mu_{t},\nu_{t})+\frac{C^{\prime}}{t\log t}.

Define Q(t)=(μt,νt)CCt1+ξ/ξ+ϵQ(t)={\mathcal{L}}(\mu_{t},\nu_{t})-\frac{C^{\prime}}{C}t^{-1+\xi^{\ast}/\xi+\epsilon^{\prime}}. Then it is straightforward to show that for t>tt>t_{\ast} large enough,

ddtQ(t)Ctξ/ξϵQ(t),\frac{d}{dt}Q(t)\leq-Ct^{-\xi^{\ast}/\xi-\epsilon^{\prime}}Q(t),

which implies that

(4.18) (μt,νt)Q(t)eC1ξ/ξϵ(t1ξ/ξϵt1ξ/ξϵ)+CCt1+ξ/ξ+ϵC′′t1+ξ/ξ+ϵ{\mathcal{L}}(\mu_{t},\nu_{t})\leq Q(t_{\ast})e^{-\frac{C}{1-\xi^{\ast}/\xi-\epsilon}(t^{1-\xi^{\ast}/\xi-\epsilon^{\prime}}-t_{\ast}^{1-\xi^{\ast}/\xi-\epsilon^{\prime}})}+\frac{C^{\prime}}{C}t^{-1+\xi^{\ast}/\xi+\epsilon^{\prime}}\leq C^{\prime\prime}t^{-1+\xi^{\ast}/\xi+\epsilon^{\prime}}

since ξ<ξ\xi^{\ast}<\xi and Q(t)Q(t_{\ast}) is finite. By the definition of {\mathcal{L}} and the fact that logttϵ\log t\ll t^{\epsilon} for any ϵ>0\epsilon>0, the last estimate further implies that for any 0<ϵ<1ξ/ξ0<\epsilon<1-\xi^{\ast}/\xi,

(4.19) (μt|μτt)C′′t1+ξ/ξ+ϵ, for t>t.{\mathcal{H}}(\mu_{t}|\mu_{\tau_{t}}^{\ast})\leq C^{\prime\prime}t^{-1+\xi^{\ast}/\xi+\epsilon},\text{ for }t>t_{\ast}.

In addition, using the same arguments used in the proof of the second bound in (2.10) from Theorem 2.2, one can obtain that

(νt|ντt)C′′t1+ξ/ξ+ϵ, for t>t.{\mathcal{H}}(\nu_{t}|\nu_{\tau_{t}}^{\ast})\leq C^{\prime\prime}t^{-1+\xi^{\ast}/\xi+\epsilon},\text{ for }t>t_{\ast}.

Step 2: Bounding NI(μt,νt){\text{NI}}(\mu_{t},\nu_{t}). Let us first claim that the difference NI(μt,νt)NI(μτt,ντt){\text{NI}}(\mu_{t},\nu_{t})-{\text{NI}}(\mu_{\tau_{t}}^{\ast},\nu_{\tau_{t}}^{\ast}) satisfies

(4.20) NI(μt,νt)NI(μτt,ντt)Ct1ξ/ξϵ2.{\text{NI}}(\mu_{t},\nu_{t})-{\text{NI}}(\mu_{\tau_{t}}^{\ast},\nu_{\tau_{t}}^{\ast})\leq Ct^{-\frac{1-\xi^{\ast}/\xi-\epsilon}{2}}.

In fact, by definition,

NI(μt,νt)NI(μτt,ντt)=maxν𝒫(𝒳)E0(μt,ν)maxν𝒫(𝒳)E0(μτt,ν)+minμ𝒫(𝒳)E0(μ,ντt)minμ𝒫(𝒳)E0(μ,νt)\displaystyle{\text{NI}}(\mu_{t},\nu_{t})-{\text{NI}}(\mu_{\tau_{t}}^{\ast},\nu_{\tau_{t}}^{\ast})=\max_{\nu\in{\mathcal{P}}({\mathcal{X}})}E_{0}(\mu_{t},\nu)-\max_{\nu\in{\mathcal{P}}({\mathcal{X}})}E_{0}(\mu_{\tau_{t}}^{\ast},\nu)+\min_{\mu\in{\mathcal{P}}({\mathcal{X}})}E_{0}(\mu,\nu_{\tau_{t}}^{\ast})-\min_{\mu\in{\mathcal{P}}({\mathcal{X}})}E_{0}(\mu,\nu_{t})
=maxy𝒴𝒳K(x,y)μt(dx)maxy𝒴𝒳K(x,y)μτt(dx)\displaystyle=\max_{y\in{\mathcal{Y}}}\int_{{\mathcal{X}}}K(x,y)\mu_{t}(dx)-\max_{y\in{\mathcal{Y}}}\int_{{\mathcal{X}}}K(x,y)\mu_{\tau_{t}}^{\ast}(dx)
+minx𝒳𝒴K(x,y)ντt(dy)minx𝒳𝒴K(x,y)νt(dy)\displaystyle\qquad+\min_{x\in{\mathcal{X}}}\int_{{\mathcal{Y}}}K(x,y)\nu_{\tau_{t}}^{\ast}(dy)-\min_{x\in{\mathcal{X}}}\int_{{\mathcal{Y}}}K(x,y)\nu_{t}(dy)
=:J1+J2.\displaystyle=:J_{1}+J_{2}.

To bound J1J_{1}, let us define ytargmaxy𝒴𝒳K(x,y)μt(dx)y_{t}\in\operatorname*{arg\,max}_{y\in{\mathcal{Y}}}\int_{{\mathcal{X}}}K(x,y)\mu_{t}(dx) and ytargmaxy𝒴𝒳K(x,y)μτt(dx)y^{\ast}_{t}\in\operatorname*{arg\,max}_{y\in{\mathcal{Y}}}\int_{{\mathcal{X}}}K(x,y)\mu_{\tau_{t}}^{\ast}(dx). Then by the optimality of yty^{\ast}_{t} and the boundedness of KK,

J1\displaystyle J_{1} =𝒳K(x,yt)μt(dx)𝒳K(x,yt)μτt(dx)\displaystyle=\int_{{\mathcal{X}}}K(x,y_{t})\mu_{t}(dx)-\int_{{\mathcal{X}}}K(x,y^{\ast}_{t})\mu_{\tau_{t}}^{\ast}(dx)
𝒳K(x,yt)μt(dx)𝒳K(x,yt)μτt(dx)\displaystyle\leq\int_{{\mathcal{X}}}K(x,y_{t})\mu_{t}(dx)-\int_{{\mathcal{X}}}K(x,y_{t})\mu_{\tau_{t}}^{\ast}(dx)
KTV(μt,μτt)\displaystyle\leq\|K\|_{\infty}\text{TV}(\mu_{t},\mu_{\tau_{t}}^{\ast})
2K(μt|μτt)\displaystyle\leq\sqrt{2}\|K\|_{\infty}\sqrt{{\mathcal{H}}(\mu_{t}|\mu_{\tau_{t}}^{\ast})}
Ct1ξ/ξϵ2.\displaystyle\leq Ct^{-\frac{1-\xi^{\ast}/\xi-\epsilon}{2}}.

The same bound holds for J2J_{2} after applying a similar argument as above, which completes the proof of (4.20). Finally, applying Lemma 4.2 with τ=τt=ξ/logt\tau=\tau_{t}=\xi/\log t, we have for tt large enough,

(4.21) NI(μτt,ντt)Clog(logt)logt.{\text{NI}}(\mu_{\tau_{t}}^{\ast},\nu_{\tau_{t}}^{\ast})\leq\frac{C\log(\log t)}{\log t}.

Hence the estimate (2.14) follows from (4.20) and (4.21).

We restate [16, Theorem 5] in the following lemma which is used to control NI(μτ,ντ){\text{NI}}(\mu_{\tau}^{\ast},\nu_{\tau}^{\ast}).

Lemma 4.2.

[16, Theorem 5] Let ϵ>0,δ:=ϵ/(2Lip(K))\epsilon>0,\delta:=\epsilon/(2\text{Lip}(K)) and let VδV_{\delta} be a lower bound on the volume of a ball of radius of δ\delta in 𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}}. Let (μτ,ντ)(\mu_{\tau}^{\ast},\nu_{\tau}^{\ast}) be the solution of the solution of (2.5). Then (μτ,ντ)(\mu_{\tau}^{\ast},\nu_{\tau}^{\ast}) is an ϵ\epsilon-Nash equilibrium of E0E_{0} if

τϵ4log(2(1Vδ)Vδ(4K/ϵ1)).\tau\leq\frac{\epsilon}{4\log\Big{(}\frac{2(1-V_{\delta})}{V_{\delta}}(4\|K\|_{\infty}/\epsilon-1)\Big{)}}.

In particular, when 𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}} are Riemannian manifolds or Euclidean tori of dimensions d𝒳d_{{\mathcal{X}}} and d𝒴d_{{\mathcal{Y}}} respectively so that Vol(Bxδ)Cδd𝒳\text{Vol}(B_{x}^{\delta})\geq C\delta^{d_{{\mathcal{X}}}} and Vol(Byδ)Cδd𝒴\text{Vol}(B_{y}^{\delta})\geq C\delta^{d_{{\mathcal{Y}}}}, the bounds above become

τCϵlogϵ1\tau\leq\frac{C\epsilon}{\log\epsilon^{-1}}

for some constant C>0C>0 depending only on K,d𝒳,d𝒴K,d_{{\mathcal{X}}},d_{{\mathcal{Y}}}. Alternatively, if τ\tau is sufficiently small, then (μτ,ντ)(\mu_{\tau}^{\ast},\nu_{\tau}^{\ast}) is an ϵ\epsilon-Nash equilibrium with

(4.22) ϵ=βτ(log(1/τ)) for β>d𝒳d𝒴+1.\epsilon=\beta\tau(\log(1/\tau))\text{ for }\beta>d_{{\mathcal{X}}}\vee d_{{\mathcal{Y}}}+1.

5. Conclusion and future work

In this paper, we proved the global exponential convergence of two-scale Mean-Field GDA dynamics for finding the unique MNE of the entropy-regularized game objective on the space of probability measures. We also proved that the convergence of the annealed GDA dynamics to the MNE of the unregularized game objective with respect to the Nikaidò-Isoda error. The key ingredient of our proofs are new Lyapunov functions which are used to capture the dissipation of the two-scale GDA dynamics in different scaling regimes.

We would like to mention several open problems and future research directions. First, although we have proved the global convergence of the Mean-Field GDA in the fast ascent/descent regime (with a finite but large/small time-scale ratio), it remains an open question to show the convergence or nonconvergence of the Mean-Field GDA in the intermediate time-scale regime, including particularly the case where no time-scale separation occurs (i.e. η=1\eta=1 in (1.6)). Also, currently our results only hold on bounded state spaces and the convergence rate depends on the uniform bounds of the potential function KK on the state spaces. It remains an important question to extend the results to minmax optimization on unbounded state spaces. In practice, the Mean-Field GDA needs to be implemented by a certain interacting particle algorithm such as (1.5) with a large number of particles. Existing result on the mean field limit of (1.5) (e.g. [16, Theorem 3]) only holds in finite time interval and the error bound can potentially grow exponentially in time. To obtain a quantitative error analysis of the GDA for minmax optimization, it is an interesting open question to derive a uniform-in-time quantitative error bound between the particle system and the mean field dynamics. Finally, we anticipate our results can be exploited to prove theoretical convergence guarantees for a variety of minmax learning problems, including especially the training of GANs [20], adversarial learning problems [34], dual training of energy based models [10, 14] and weak adversarial networks for PDEs [57].

Acknowledgments

The author thanks Lexing Ying for suggesting the question and thanks Jianfeng Lu, Yiping Lu and Chao Ma for helpful discussions at the early stage of the problem setup. The author also thank Fei Cao and Lénaïc Chizat for the valuable feedback on an early version of the paper. This work is supported by the National Science Foundation through the award DMS-2107934.

Appendix

Appendix A Proofs of preliminary lemmas

Proof of Lemma 4.1.

First we have from the definition of EE and (2.4) that

2(μ,ν)\displaystyle{\mathcal{L}}_{2}(\mu,\nu) =maxν𝒫(𝒴)E(μ,ν)E(μ,ν)\displaystyle=\max_{\nu\in{\mathcal{P}}({\mathcal{Y}})}E(\mu,\nu)-E(\mu,\nu)
=τlogZ+(μ)𝒳𝒴K(x,y)μ(dx)ν(dy)τ(ν)\displaystyle=\tau\log Z^{+}(\mu)-\int_{{\mathcal{X}}}\int_{{\mathcal{Y}}}K(x,y)\mu(dx)\nu(dy)-\tau{\mathcal{H}}(\nu)
=τ(ν|𝒦+(μ))\displaystyle=\tau{\mathcal{H}}(\nu|{\mathcal{K}}^{+}(\mu))

which proves (4.1). To see (4.2), notice again from (2.4) that

4(μ,ν)\displaystyle{\mathcal{L}}_{4}(\mu,\nu) =E(μ,ν)minμ𝒫(𝒳)E(μ,ν)\displaystyle=E(\mu,\nu)-\min_{\mu\in{\mathcal{P}}({\mathcal{X}})}E(\mu,\nu)
=𝒳𝒴K(x,y)μ(dx)ν(dy)τ(μ)+τlogZ(ν)\displaystyle=\int_{{\mathcal{X}}}\int_{{\mathcal{Y}}}K(x,y)\mu(dx)\nu(dy)-\tau{\mathcal{H}}(\mu)+\tau\log Z^{-}(\nu)
=τ(μ|𝒦(ν)).\displaystyle=\tau{\mathcal{H}}(\mu|{\mathcal{K}}^{-}(\nu)).

Next, we prove (4.3). In fact, replacing ν\nu by 𝒦(μ){\mathcal{K}}^{-}(\mu) in the last display leads to

τ(μ|𝒦(𝒦+(μ)))\displaystyle\tau{\mathcal{H}}(\mu|{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu))) =4(μ,𝒦+(μ))\displaystyle={\mathcal{L}}_{4}(\mu,{\mathcal{K}}^{+}(\mu))
=E(μ,𝒦+(μ))minμ𝒫(𝒳)E(μ,𝒦+(μ))\displaystyle=E(\mu,{\mathcal{K}}^{+}(\mu))-\min_{\mu\in{\mathcal{P}}({\mathcal{X}})}E(\mu,{\mathcal{K}}^{+}(\mu))
=maxν𝒫(𝒴)E(μ,ν)minμ𝒫(𝒳)E(μ,𝒦+(μ))\displaystyle=\max_{\nu\in{\mathcal{P}}({\mathcal{Y}})}E(\mu,\nu)-\min_{\mu\in{\mathcal{P}}({\mathcal{X}})}E(\mu,{\mathcal{K}}^{+}(\mu))
maxν𝒫(𝒴)E(μ,ν)minμ𝒫(𝒳)maxν𝒫(𝒴)E(μ,ν)\displaystyle\geq\max_{\nu\in{\mathcal{P}}({\mathcal{Y}})}E(\mu,\nu)-\min_{\mu\in{\mathcal{P}}({\mathcal{X}})}\max_{\nu\in{\mathcal{P}}({\mathcal{Y}})}E(\mu,\nu)
=1(μ),\displaystyle={\mathcal{L}}_{1}(\mu),

which proves the lower bound part of (4.3). To prove the upper bound part, we first compute the first-variation of the log partition function logZ+(μ)\log Z^{+}(\mu) as

(A.1) μlogZ+(μ)=τ1𝒴K(x,y)𝑑𝒦+(μ)(y)=log(𝒦(𝒦+(μ)))logZ(𝒦+(μ)).\frac{\partial}{\partial\mu}\log Z^{+}(\mu)=\tau^{-1}\int_{{\mathcal{Y}}}K(x,y)d{\mathcal{K}}^{+}(\mu)(y)=-\log({\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu)))-\log Z^{-}({\mathcal{K}}^{+}(\mu)).

Then an application of the convexity of μlogZ+(μ)\mu{\rightarrow}\log Z^{+}(\mu) shown in Lemma A.1 yields

logZ+(μ)logZ+(μ)\displaystyle\log Z^{+}(\mu)-\log Z^{+}(\mu^{\ast}) 𝒳logZ+(μ)μ|μ=μ(dμdμ)\displaystyle\geq\int_{{\mathcal{X}}}\frac{\partial\log Z^{+}(\mu)}{\partial\mu}\Big{|}_{\mu=\mu^{\ast}}(d\mu-d\mu^{\ast})
=𝒳log(𝒦(𝒦+(μ)))(dμdμ)\displaystyle=-\int_{{\mathcal{X}}}\log({\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu^{\ast})))(d\mu-d\mu^{\ast})
=𝒳log(μ)(dμdμ).\displaystyle=-\int_{{\mathcal{X}}}\log(\mu^{\ast})(d\mu-d\mu^{\ast}).

Consequently, one obtains that

1(μ)\displaystyle{\mathcal{L}}_{1}(\mu) =1(μ)1(μ)\displaystyle={\mathcal{E}}_{1}(\mu)-{\mathcal{E}}_{1}(\mu^{\ast})
=τ((μ)(μ))+τ(logZ+(μ)logZ+(μ))\displaystyle=-\tau({\mathcal{H}}(\mu)-{\mathcal{H}}(\mu^{\ast}))+\tau(\log Z^{+}(\mu)-\log Z^{+}(\mu^{\ast}))
τlogμdμlogμdμ𝒳log(μ)(dμdμ)\displaystyle\geq\tau\int\log\mu d\mu-\int\log\mu^{\ast}d\mu^{\ast}-\int_{{\mathcal{X}}}\log(\mu^{\ast})(d\mu-d\mu^{\ast})
=τ(μ|μ).\displaystyle=\tau{\mathcal{H}}(\mu|\mu^{\ast}).

Finally, the inequality (4.4) follows from the same reasoning above by exploiting the convexity of νlogZ(ν)\nu{\rightarrow}\log Z^{-}(\nu).

The following lemma establishes the convexity of the log partition functions logZ+(μ)\log Z^{+}(\mu) and logZ(ν)\log Z^{-}(\nu). It was proved in [33, Proposition 3], but for completeness we include its proof here.

Lemma A.1.

Both the functional μlogZ+(μ)\mu{\rightarrow}\log Z^{+}(\mu) and the functional νlogZ(ν)\nu{\rightarrow}\log Z^{-}(\nu) are convex.

Proof.

We only present the proof for the convexity of the map μlogZ+(μ)\mu{\rightarrow}\log Z^{+}(\mu) since the other case can be proved in the same manner. For any μ1,μ2𝒫(𝒳)\mu_{1},\mu_{2}\in{\mathcal{P}}({\mathcal{X}}) and α(0,1)\alpha\in(0,1),

Z+(αμ1+(1α)μ2)\displaystyle Z^{+}(\alpha\mu_{1}+(1-\alpha)\mu_{2})
=log(𝒴exp(𝒳τ1K(x,y)(αμ1(dx)+(1α)μ2(dx)))𝑑y)\displaystyle=\log\Big{(}\int_{{\mathcal{Y}}}\exp\Big{(}\int_{{\mathcal{X}}}\tau^{-1}K(x,y)(\alpha\mu_{1}(dx)+(1-\alpha)\mu_{2}(dx))\Big{)}dy\Big{)}
=log(𝒴exp(α𝒳τ1K(x,y)μ1(dx))exp((1α)𝒳τ1K(x,y)μ2(dx))𝑑y)\displaystyle=\log\Big{(}\int_{{\mathcal{Y}}}\exp\Big{(}\alpha\int_{{\mathcal{X}}}\tau^{-1}K(x,y)\mu_{1}(dx)\Big{)}\cdot\exp\Big{(}(1-\alpha)\int_{{\mathcal{X}}}\tau^{-1}K(x,y)\mu_{2}(dx)\Big{)}dy\Big{)}
log((𝒴exp(𝒳τ1K(x,y)μ1(dx))𝑑y)α(𝒴exp(𝒳τ1K(x,y)μ2(dx))𝑑y)1α)\displaystyle\leq\log\Big{(}\Big{(}\int_{{\mathcal{Y}}}\exp\Big{(}\int_{{\mathcal{X}}}\tau^{-1}K(x,y)\mu_{1}(dx)\Big{)}dy\Big{)}^{\alpha}\cdot\Big{(}\int_{{\mathcal{Y}}}\exp\Big{(}\int_{{\mathcal{X}}}\tau^{-1}K(x,y)\mu_{2}(dx)\Big{)}dy\Big{)}^{1-\alpha}\Big{)}
=αZ+(μ1)+(1α)Z+(μ2).\displaystyle=\alpha Z^{+}(\mu_{1})+(1-\alpha)Z^{+}(\mu_{2}).

Proof of Proposition 4.1.

First let us compute the functional gradient 1u(μt).\frac{\partial{\mathcal{L}}_{1}}{\partial u}(\mu_{t}). In fact, thanks to the fact that 1(μ)=1(μ)1(μ){\mathcal{L}}_{1}(\mu)={\mathcal{E}}_{1}(\mu)-{\mathcal{E}}_{1}(\mu^{\ast}) and (4.12), one has that

1u|μ=μt\displaystyle\frac{\partial{\mathcal{L}}_{1}}{\partial u}\Big{|}_{\mu=\mu_{t}} =τu((μ)logZ+(μ))|μ=μt\displaystyle=-\tau\frac{\partial}{\partial u}\Big{(}{\mathcal{H}}(\mu)-\log Z^{+}(\mu)\Big{)}\Big{|}_{\mu=\mu_{t}}
=τ(logμ+τ1𝒴K(x,y)𝑑𝒦+(μ)(y))|μ=μt\displaystyle=\tau\Big{(}\log\mu+\tau^{-1}\int_{{\mathcal{Y}}}K(x,y)d{\mathcal{K}}^{+}(\mu)(y)\Big{)}\Big{|}_{\mu=\mu_{t}}
=τ(log(dμtd𝒦(𝒦+(μt)))logZ(𝒦+(μt))),\displaystyle=\tau\Big{(}\log\Big{(}\frac{d\mu_{t}}{d{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))}\Big{)}-\log Z^{-}({\mathcal{K}}^{+}(\mu_{t}))\Big{)},

where in the second identity we have used (A.1). Therefore it follows from (1.6) and the Cauchy-Schwarz inequality that

ddt1(μt)\displaystyle\frac{d}{dt}{\mathcal{L}}_{1}(\mu_{t}) =𝒳1u(μt)d(tμt)\displaystyle=\int_{{\mathcal{X}}}\frac{\partial{\mathcal{L}}_{1}}{\partial u}(\mu_{t})d(\partial_{t}\mu_{t})
=τ2𝒳(log(dμtd𝒦(𝒦+(μt)))logZ(𝒦(𝒦+(μt))))x(μtxlog(dμtd𝒦(νt)))𝑑x\displaystyle=\tau^{2}\int_{{\mathcal{X}}}\Big{(}\log\Big{(}\frac{d\mu_{t}}{d{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))}\Big{)}-\log Z^{-}({\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t})))\Big{)}\nabla_{x}\cdot\Big{(}\mu_{t}\nabla_{x}\log\Big{(}\frac{d\mu_{t}}{d{\mathcal{K}}^{-}(\nu_{t})}\Big{)}\Big{)}dx
=τ2𝒳xlog(dμtd𝒦(𝒦+(μt)))xlog(dμtd𝒦(νt))𝑑μt(x)\displaystyle=-\tau^{2}\int_{{\mathcal{X}}}\nabla_{x}\log\Big{(}\frac{d\mu_{t}}{d{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))}\Big{)}\cdot\nabla_{x}\log\Big{(}\frac{d\mu_{t}}{d{\mathcal{K}}^{-}(\nu_{t})}\Big{)}d\mu_{t}(x)
=τ2𝒳|xlog(dμtd𝒦(𝒦+(μt)))|2𝑑μt(x)\displaystyle=-\tau^{2}\int_{{\mathcal{X}}}\Big{|}\nabla_{x}\log\Big{(}\frac{d\mu_{t}}{d{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))}\Big{)}\Big{|}^{2}d\mu_{t}(x)
τ2𝒳xlog(d𝒦(𝒦+(μt))d𝒦(νt))xlog(dμtd𝒦(𝒦+(μt)))𝑑μt(x)\displaystyle\qquad-\tau^{2}\int_{{\mathcal{X}}}\nabla_{x}\log\Big{(}\frac{d{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))}{d{\mathcal{K}}^{-}(\nu_{t})}\Big{)}\cdot\nabla_{x}\log\Big{(}\frac{d\mu_{t}}{d{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))}\Big{)}d\mu_{t}(x)
τ22(μt|𝒦(𝒦+(μt)))+τ22𝒳|xlog(d𝒦(𝒦+(μt))d𝒦(νt))|2𝑑μt(x).\displaystyle\leq-\frac{\tau^{2}}{2}{\mathcal{I}}(\mu_{t}|{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t})))+\frac{\tau^{2}}{2}\int_{{\mathcal{X}}}\Big{|}\nabla_{x}\log\Big{(}\frac{d{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))}{d{\mathcal{K}}^{-}(\nu_{t})}\Big{)}\Big{|}^{2}d\mu_{t}(x).

Furthermore, using the fact that

𝒦(𝒦+(μt))exp(𝒴τ1K(x,y)𝑑𝒦+(μt)(y)),𝒦(νt)exp(𝒴τ1K(x,y)𝑑νt(y)),{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))\propto\exp\Big{(}-\int_{{\mathcal{Y}}}\tau^{-1}K(x,y)d{\mathcal{K}}^{+}(\mu_{t})(y)\Big{)},\quad{\mathcal{K}}^{-}(\nu_{t})\propto\exp\Big{(}-\int_{{\mathcal{Y}}}\tau^{-1}K(x,y)d\nu_{t}(y)\Big{)},

one derives that

(A.2) 12𝒳|xlog(d𝒦(𝒦+(μt))d𝒦(νt))|2𝑑μt(x)\displaystyle\frac{1}{2}\int_{{\mathcal{X}}}\Big{|}\nabla_{x}\log\Big{(}\frac{d{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))}{d{\mathcal{K}}^{-}(\nu_{t})}\Big{)}\Big{|}^{2}d\mu_{t}(x) =12τ2𝒳|𝒴xK(x,y)(d𝒦+(μt)(y)dνt(y))|2𝑑μt(x).\displaystyle=\frac{1}{2\tau^{2}}\int_{{\mathcal{X}}}\Big{|}\int_{{\mathcal{Y}}}\nabla_{x}K(x,y)(d{\mathcal{K}}^{+}(\mu_{t})(y)-d\nu_{t}(y))\Big{|}^{2}d\mu_{t}(x).

Moreover, using the mean value theorem, one has for any y0𝒴y_{0}\in{\mathcal{Y}} that

xK(x,y)xK(x,y0)=[01xy2K(x,y0+s(yy0))s𝑑s](yy0).\nabla_{x}K(x,y)-\nabla_{x}K(x,y_{0})=\Big{[}\int_{0}^{1}\nabla_{xy}^{2}K(x,y_{0}+s(y-y_{0}))s\,ds\Big{]}(y-y_{0}).

Inserting the last identity into (A.2) leads to

(A.3) 12𝒳|xlog(d𝒦(𝒦+(μt))d𝒦(νt))|2𝑑μt(x)\displaystyle\frac{1}{2}\int_{{\mathcal{X}}}\Big{|}\nabla_{x}\log\Big{(}\frac{d{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))}{d{\mathcal{K}}^{-}(\nu_{t})}\Big{)}\Big{|}^{2}d\mu_{t}(x) Kxy2Diam(𝒴)22τ2TV2(𝒦+(μt),νt)\displaystyle\leq\frac{K_{xy}^{2}\mathrm{Diam}({\mathcal{Y}})^{2}}{2\tau^{2}}\text{TV}^{2}({\mathcal{K}}^{+}(\mu_{t}),\nu_{t})
Kxy2Diam(𝒴)2τ2(νt|𝒦+(μt)),\displaystyle\leq\frac{K_{xy}^{2}\mathrm{Diam}({\mathcal{Y}})^{2}}{\tau^{2}}{\mathcal{H}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t})),

where we have used the Pinsker’s inequality in the last inequality above. Combining the last two estimates proves (4.5).

Next we proceed with the proof of (4.6). Recall from (4.1) that

2(μ,ν)=τ(ν|𝒦+(μ))=τ𝒴log(ν𝒦+(μ))𝑑ν.{\mathcal{L}}_{2}(\mu,\nu)=\tau{\mathcal{H}}(\nu|{\mathcal{K}}^{+}(\mu))=\tau\int_{{\mathcal{Y}}}\log\Big{(}\frac{\nu}{{\mathcal{K}}^{+}(\mu)}\Big{)}d\nu.

Hence

(A.4) ddt2(μt,νt)=τ(𝒴log(νt𝒦+(μt))d(tνt)𝒴ddtlog(𝒦+(μt))𝑑νt)\displaystyle\frac{d}{dt}{\mathcal{L}}_{2}(\mu_{t},\nu_{t})=\tau\Big{(}\int_{{\mathcal{Y}}}\log\Big{(}\frac{\nu_{t}}{{\mathcal{K}}^{+}(\mu_{t})}\Big{)}d(\partial_{t}\nu_{t})-\int_{{\mathcal{Y}}}\frac{d}{dt}\log({\mathcal{K}}^{+}(\mu_{t}))d\nu_{t}\Big{)}

Using the second equation of (1.6) that νt\nu_{t} solves, one sees that the first term on the right side above becomes

(A.5) τ(𝒴log(νt𝒦+(μt))d(tνt)=ητ2(νt|𝒦+(μt)).\tau\Big{(}\int_{{\mathcal{Y}}}\log\Big{(}\frac{\nu_{t}}{{\mathcal{K}}^{+}(\mu_{t})}\Big{)}d(\partial_{t}\nu_{t})=-\eta\tau^{2}{\mathcal{I}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t})).

To compute the second term on the right side of (A.4), observe that

log𝒦+(μt)=τ1𝒳K(x,y)𝑑μt(x)log(Z+(μt)).\log{\mathcal{K}}^{+}(\mu_{t})=\tau^{-1}\int_{{\mathcal{X}}}K(x,y)d\mu_{t}(x)-\log(Z^{+}(\mu_{t})).

An a result of above and (A.1), one obtains that

τ𝒴ddtlog(𝒦+(μt))𝑑νt\displaystyle-\tau\int_{{\mathcal{Y}}}\frac{d}{dt}\log({\mathcal{K}}^{+}(\mu_{t}))d\nu_{t}
=τ𝒴𝒳(τ1K(x,y)log(Z+(μ))μ(μt))d(tμt(x))𝑑νt(y)\displaystyle=-\tau\int_{{\mathcal{Y}}}\int_{{\mathcal{X}}}\Big{(}\tau^{-1}K(x,y)-\frac{\partial\log(Z^{+}(\mu))}{\partial\mu}(\mu_{t})\Big{)}d(\partial_{t}\mu_{t}(x))d\nu_{t}(y)
=τ𝒳(𝒴τ1K(x,y)𝑑νt(y)𝒴τ1K(x,y)𝑑𝒦+(μt)(y))d(tμt(x))\displaystyle=-\tau\int_{{\mathcal{X}}}\Big{(}\int_{{\mathcal{Y}}}\tau^{-1}K(x,y)d\nu_{t}(y)-\int_{{\mathcal{Y}}}\tau^{-1}K(x,y)d{\mathcal{K}}^{+}(\mu_{t})(y)\Big{)}d(\partial_{t}\mu_{t}(x))
=τ𝒳(log(𝒵(𝒦+(μt))𝒦(𝒦+(μt)))log(𝒵(νt)𝒦(νt)))d(tμt(x))\displaystyle=-\tau\int_{{\mathcal{X}}}\Big{(}\log({\mathcal{Z}}^{-}({\mathcal{K}}^{+}(\mu_{t})){\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t})))-\log({\mathcal{Z}}^{-}(\nu_{t}){\mathcal{K}}^{-}(\nu_{t}))\Big{)}d(\partial_{t}\mu_{t}(x))
=τ2𝒳xlog(d𝒦(𝒦+(μt))d𝒦(νt))xlog(dμtd𝒦(νt))𝑑μt(x)\displaystyle=\tau^{2}\int_{{\mathcal{X}}}\nabla_{x}\log\Big{(}\frac{d{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))}{d{\mathcal{K}}^{-}(\nu_{t})}\Big{)}\cdot\nabla_{x}\log\Big{(}\frac{d\mu_{t}}{d{\mathcal{K}}^{-}(\nu_{t})}\Big{)}d\mu_{t}(x)
=τ2(𝒳|xlog(d𝒦(𝒦+(μt))d𝒦(νt))|2dμt(x)\displaystyle=\tau^{2}\Big{(}\int_{{\mathcal{X}}}\Big{|}\nabla_{x}\log\Big{(}\frac{d{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))}{d{\mathcal{K}}^{-}(\nu_{t})}\Big{)}\Big{|}^{2}d\mu_{t}(x)
+𝒳xlog(d𝒦(𝒦+(μt))d𝒦(νt))xlog(dμtd𝒦(𝒦+(μt)))dμt(x))\displaystyle\qquad+\int_{{\mathcal{X}}}\nabla_{x}\log\Big{(}\frac{d{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))}{d{\mathcal{K}}^{-}(\nu_{t})}\Big{)}\cdot\nabla_{x}\log\Big{(}\frac{d\mu_{t}}{d{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))}\Big{)}d\mu_{t}(x)\Big{)}
τ22(μt|𝒦(𝒦+(μt)))+3τ22𝒳|xlog(d𝒦(𝒦+(μt))d𝒦(νt))|2𝑑μt(x)\displaystyle\leq\frac{\tau^{2}}{2}{\mathcal{I}}(\mu_{t}|{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t})))+\frac{3\tau^{2}}{2}\int_{{\mathcal{X}}}\Big{|}\nabla_{x}\log\Big{(}\frac{d{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t}))}{d{\mathcal{K}}^{-}(\nu_{t})}\Big{)}\Big{|}^{2}d\mu_{t}(x)
τ22(μt|𝒦(𝒦+(μt)))+3Kxy2Diam(𝒴)2(νt|𝒦+(μt)),\displaystyle\leq\frac{\tau^{2}}{2}{\mathcal{I}}(\mu_{t}|{\mathcal{K}}^{-}({\mathcal{K}}^{+}(\mu_{t})))+3K_{xy}^{2}\mathrm{Diam}({\mathcal{Y}})^{2}{\mathcal{H}}(\nu_{t}|{\mathcal{K}}^{+}(\mu_{t})),

where we have used (A.3) in the last inequality above. The estimate (4.6) then follows from above, (A.5) and (A.4). This completes the proof of the proposition. ∎

Proof of Lemma 4.2.

The proof of the lemma can be found in [16, Appendix C.4]. We would like to elaborate on the proof of (4.22). In fact, by tracking the proof of [16, Theorem 5], one sees that (μτ,ντ)(\mu_{\tau}^{\ast},\nu_{\tau}^{\ast}) is an ϵ\epsilon-Nash equilibrium provided that

eε2τ(Vol(Bxδ)1Vol(Bxδ)Vol(Byδ)1Vol(Byδ))2(4K/ϵ1),e^{\frac{{\varepsilon}}{2\tau}}\Big{(}\frac{\text{Vol}(B_{x}^{\delta})}{1-\text{Vol}(B_{x}^{\delta})}\vee\frac{\text{Vol}(B_{y}^{\delta})}{1-\text{Vol}(B_{y}^{\delta})}\Big{)}\geq 2(4\|K\|_{\infty}/\epsilon-1),

where we recall that δ=ϵ/(2Lip(K))\delta=\epsilon/(2\text{Lip}(K)). Thanks to the lower bound on the volume of small balls in 𝒳{\mathcal{X}} and 𝒴{\mathcal{Y}}, the inequality above holds if

eϵ2τC1ϵ(d𝒳d𝒴+1)e^{\frac{\epsilon}{2\tau}}\geq C_{1}\epsilon^{-(d_{{\mathcal{X}}}\vee d_{{\mathcal{Y}}}+1)}

for some C1>0C_{1}>0 depending only on K,d𝒳,d𝒴K,d_{{\mathcal{X}}},d_{{\mathcal{Y}}}. Now with the choice (4.22) for ϵ\epsilon with β>d𝒳d𝒴+1\beta>d_{{\mathcal{X}}}\vee d_{{\mathcal{Y}}}+1, one has for τ\tau sufficiently small that

eε2τ1τβ>C1βd𝒳d𝒴+1(1τlog(1/τ))d𝒳d𝒴+1=C1ϵ(d𝒳d𝒴+1).\displaystyle e^{\frac{{\varepsilon}}{2\tau}}\geq\frac{1}{\tau^{\beta}}>\frac{C_{1}}{\beta^{d_{{\mathcal{X}}}\vee d_{{\mathcal{Y}}}+1}}\Big{(}\frac{1}{\tau\log(1/\tau)}\Big{)}^{d_{{\mathcal{X}}}\vee d_{{\mathcal{Y}}}+1}=C_{1}\epsilon^{-(d_{{\mathcal{X}}}\vee d_{{\mathcal{Y}}}+1)}.

Appendix B Proofs of convergence for Mean-Field GDA in the fast descent regime

Let us first state a proposition which bounds the time-derivative of 2(νt){\mathcal{L}}_{2}(\nu_{t}) and 4(μt,νt){\mathcal{L}}_{4}(\mu_{t},\nu_{t}).

Proposition B.1.

Let (μt,νt)(\mu_{t},\nu_{t}) be the solution to the DGA dynamics (1.6). Then

(B.1) ddt3(νt)ητ22(νt|𝒦+(𝒦(νt)))+ηKxy2Diam(𝒳)2(μt|𝒦(νt))\frac{d}{dt}{\mathcal{L}}_{3}(\nu_{t})\leq-\frac{\eta\tau^{2}}{2}{\mathcal{I}}(\nu_{t}|{\mathcal{K}}^{+}({\mathcal{K}}^{-}(\nu_{t})))+\eta K_{xy}^{2}\mathrm{Diam}({\mathcal{X}})^{2}\cdot{\mathcal{H}}(\mu_{t}|{\mathcal{K}}^{-}(\nu_{t}))

and

(B.2) ddt4(μt,νt)τ2(μt|𝒦(νt))+ητ22(νt|𝒦+(𝒦(νt)))+3ηKxy2Diam(𝒳)2(μt|𝒦(νt)).\frac{d}{dt}{\mathcal{L}}_{4}(\mu_{t},\nu_{t})\leq-\tau^{2}{\mathcal{I}}(\mu_{t}|{\mathcal{K}}^{-}(\nu_{t}))+\frac{\eta\tau^{2}}{2}{\mathcal{I}}(\nu_{t}|{\mathcal{K}}^{+}({\mathcal{K}}^{-}(\nu_{t})))+3\eta K_{xy}^{2}\mathrm{Diam}({\mathcal{X}})^{2}\cdot{\mathcal{H}}(\mu_{t}|{\mathcal{K}}^{-}(\nu_{t})).
Proof.

The proof follows essentially the same reasoning as the proof of Proposition 4.1. In fact, it can be shown by a straightforward calculation that

ddt2(νt)\displaystyle\frac{d}{dt}{\mathcal{L}}_{2}(\nu_{t}) =ητ2𝒴ylog(dνtd𝒦+(𝒦(νt)))ylog(dνtd𝒦+(μt))𝑑νt(y)\displaystyle=-\eta\tau^{2}\int_{{\mathcal{Y}}}\nabla_{y}\log\Big{(}\frac{d\nu_{t}}{d{\mathcal{K}}^{+}({\mathcal{K}}^{-}(\nu_{t}))}\Big{)}\cdot\nabla_{y}\log\Big{(}\frac{d\nu_{t}}{d{\mathcal{K}}^{+}(\mu_{t})}\Big{)}d\nu_{t}(y)
ητ22(νt|𝒦+(𝒦(νt)))+ητ22𝒴|ylog(d𝒦+(𝒦(νt))d𝒦+(μt))|2𝑑νt(y)\displaystyle\leq-\frac{\eta\tau^{2}}{2}{\mathcal{I}}(\nu_{t}|{\mathcal{K}}^{+}({\mathcal{K}}^{-}(\nu_{t})))+\frac{\eta\tau^{2}}{2}\int_{{\mathcal{Y}}}\Big{|}\nabla_{y}\log\Big{(}\frac{d{\mathcal{K}}^{+}({\mathcal{K}}^{-}(\nu_{t}))}{d{\mathcal{K}}^{+}(\mu_{t})}\Big{)}\Big{|}^{2}d\nu_{t}(y)

Similar to (A.3), one can obtain that

ητ22𝒴|ylog(d𝒦+(𝒦(νt))d𝒦+(μt))|2𝑑νt(y)ηKxy2Diam(𝒳)2(μt|𝒦(νt)).\displaystyle\frac{\eta\tau^{2}}{2}\int_{{\mathcal{Y}}}\Big{|}\nabla_{y}\log\Big{(}\frac{d{\mathcal{K}}^{+}({\mathcal{K}}^{-}(\nu_{t}))}{d{\mathcal{K}}^{+}(\mu_{t})}\Big{)}\Big{|}^{2}d\nu_{t}(y)\leq\eta K_{xy}^{2}\mathrm{Diam}({\mathcal{X}})^{2}\cdot{\mathcal{H}}(\mu_{t}|{\mathcal{K}}^{-}(\nu_{t})).

Combining the last two inequalities yield (B.1).

As for time-derivative of 4(μt,νt){\mathcal{L}}_{4}(\mu_{t},\nu_{t}), one has that

ddt4(μt,νt)=τ2(μt|𝒦(νt))ητ2𝒴ylog(d𝒦+(μt)d𝒦+(𝒦(νt)))ylog(dνtd𝒦+(μt))𝑑νt(y).\displaystyle\frac{d}{dt}{\mathcal{L}}_{4}(\mu_{t},\nu_{t})=-\tau^{2}{\mathcal{I}}(\mu_{t}|{\mathcal{K}}^{-}(\nu_{t}))-\eta\tau^{2}\int_{{\mathcal{Y}}}\nabla_{y}\log\Big{(}\frac{d{\mathcal{K}}^{+}(\mu_{t})}{d{\mathcal{K}}^{+}({\mathcal{K}}^{-}(\nu_{t}))}\Big{)}\cdot\nabla_{y}\log\Big{(}\frac{d\nu_{t}}{d{\mathcal{K}}^{+}(\mu_{t})}\Big{)}d\nu_{t}(y).

The second term on the right side above is bounded by

ητ2(νt|𝒦+(𝒦(νt)))+3ητ22𝒴|ylog(d𝒦+(𝒦(νt))d𝒦+(μt))|2𝑑νt(y)\displaystyle\eta\tau^{2}{\mathcal{I}}(\nu_{t}|{\mathcal{K}}^{+}({\mathcal{K}}^{-}(\nu_{t})))+\frac{3\eta\tau^{2}}{2}\int_{{\mathcal{Y}}}\Big{|}\nabla_{y}\log\Big{(}\frac{d{\mathcal{K}}^{+}({\mathcal{K}}^{-}(\nu_{t}))}{d{\mathcal{K}}^{+}(\mu_{t})}\Big{)}\Big{|}^{2}d\nu_{t}(y)
ητ22(νt|𝒦+(𝒦(νt)))+3ηKxy2Diam(𝒳)2(μt|𝒦(νt)).\displaystyle\leq\frac{\eta\tau^{2}}{2}{\mathcal{I}}(\nu_{t}|{\mathcal{K}}^{+}({\mathcal{K}}^{-}(\nu_{t})))+3\eta K_{xy}^{2}\mathrm{Diam}({\mathcal{X}})^{2}\cdot{\mathcal{H}}(\mu_{t}|{\mathcal{K}}^{-}(\nu_{t})).

The estimate (B.2) follows directly from the last two inequalities. ∎

Proof of Part (ii) of Theorem 2.1 .

It follows from Proposition B.1 that

ddt~(μt,νt)\displaystyle\frac{d}{dt}\widetilde{{\mathcal{L}}}(\mu_{t},\nu_{t}) ητ22(1γ)(νt|𝒦+(𝒦(νt)))γτ2(μt|𝒦(νt))\displaystyle\leq-\frac{\eta\tau^{2}}{2}(1-\gamma){\mathcal{I}}(\nu_{t}|{\mathcal{K}}^{+}({\mathcal{K}}^{-}(\nu_{t})))-\gamma\tau^{2}{\mathcal{I}}(\mu_{t}|{\mathcal{K}}^{-}(\nu_{t}))
+ηKxy2Diam(𝒳)2(1+3γ)(μt|𝒦(νt)).\displaystyle+\eta K_{xy}^{2}\mathrm{Diam}({\mathcal{X}})^{2}(1+3\gamma)\cdot{\mathcal{H}}(\mu_{t}|{\mathcal{K}}^{-}(\nu_{t})).

Thanks to Lemma 2.1 and (4.4),

τ(νt|𝒦+(𝒦(νt)))\displaystyle\tau{\mathcal{I}}(\nu_{t}|{\mathcal{K}}^{+}({\mathcal{K}}^{-}(\nu_{t}))) τλLS(νt|𝒦+(𝒦(νt)))\displaystyle\geq\tau\lambda_{LS}{\mathcal{H}}(\nu_{t}|{\mathcal{K}}^{+}({\mathcal{K}}^{-}(\nu_{t})))
λLS3(νt).\displaystyle\geq\lambda_{LS}{\mathcal{L}}_{3}(\nu_{t}).

As a result of above and (4.2),

ddt~(μt,νt)\displaystyle\frac{d}{dt}\widetilde{{\mathcal{L}}}(\mu_{t},\nu_{t}) ητ2(1γ)λLS3(νt)τ(γλLSηKxy2Diam(𝒳)2(1+3γ))4(μt,νt).\displaystyle\leq-\frac{\eta\tau}{2}(1-\gamma)\lambda_{LS}{\mathcal{L}}_{3}(\nu_{t})-\tau\Big{(}\gamma\lambda_{LS}-\eta K_{xy}^{2}\mathrm{Diam}({\mathcal{X}})^{2}(1+3\gamma)\Big{)}{\mathcal{L}}_{4}(\mu_{t},\nu_{t}).

Therefore setting

η=γτ2λLS2Kxy2Diam(𝒳)2(1+3γ)\eta=\frac{\gamma\tau^{2}\lambda_{LS}}{2K_{xy}^{2}\mathrm{Diam}({\mathcal{X}})^{2}(1+3\gamma)}

and applying the Grönwall’s inequality to above leads to

~(μt,νt)eαt~(μ0,ν0)\displaystyle\widetilde{{\mathcal{L}}}(\mu_{t},\nu_{t})\leq e^{-\alpha t}\widetilde{{\mathcal{L}}}(\mu_{0},\nu_{0})

with

α=τλLS2(1η(1γ)).\alpha=\frac{\tau\lambda_{LS}}{2}\Big{(}1\wedge\eta(1-\gamma)\Big{)}.

Proof of Part (ii) of Theorem 2.2.

Thanks to Part (ii) of Theorem 2.1 and the bound (4.4), we have that

(B.3) τ(νt|ν)L~(μt,νt)eα1tL~(μ0,ν0)\tau{\mathcal{H}}(\nu_{t}|\nu^{\ast})\leq\widetilde{L}(\mu_{t},\nu_{t})\leq e^{-\alpha_{1}t}\widetilde{L}(\mu_{0},\nu_{0})

and that

(B.4) τ(μt|𝒦(νt))=4(μt,νt)eα1tγL~(μ0,ν0).\tau{\mathcal{H}}(\mu_{t}|{\mathcal{K}}^{-}(\nu_{t}))={\mathcal{L}}_{4}(\mu_{t},\nu_{t})\leq\frac{e^{-\alpha_{1}t}}{\gamma}\widetilde{L}(\mu_{0},\nu_{0}).

Next to obtain the exponential decay of τ(μt|μ)\tau{\mathcal{H}}(\mu_{t}|\mu^{\ast}), observe that

(B.5) τ(μt|μ)\displaystyle\tau{\mathcal{H}}(\mu_{t}|\mu^{\ast}) =τ(μt|𝒦(νt))+τ𝒳(log𝒦(νt)logμ)𝑑μt\displaystyle=\tau{\mathcal{H}}(\mu_{t}|{\mathcal{K}}^{-}(\nu_{t}))+\tau\int_{{\mathcal{X}}}(\log{\mathcal{K}}^{-}(\nu_{t})-\log\mu^{\ast})d\mu_{t}
=τ(μt|𝒦(νt))+τ𝒳(log𝒦(νt)logμ)d(μtμ)τ(μ|𝒦(νt))\displaystyle=\tau{\mathcal{H}}(\mu_{t}|{\mathcal{K}}^{-}(\nu_{t}))+\tau\int_{{\mathcal{X}}}(\log{\mathcal{K}}^{-}(\nu_{t})-\log\mu^{\ast})d(\mu_{t}-\mu^{\ast})-\tau{\mathcal{H}}(\mu^{\ast}|{\mathcal{K}}^{-}(\nu_{t}))
τ(μt|𝒦(νt))+τ𝒳(log𝒦(νt)logμ)d(μtμ).\displaystyle\leq\tau{\mathcal{H}}(\mu_{t}|{\mathcal{K}}^{-}(\nu_{t}))+\tau\int_{{\mathcal{X}}}(\log{\mathcal{K}}^{-}(\nu_{t})-\log\mu^{\ast})d(\mu_{t}-\mu^{\ast}).

Since μ=𝒦(ν)\mu^{\ast}={\mathcal{K}}^{-}(\nu^{\ast}), we can write

log𝒦(νt)logμ=τ1𝒴K(x,y)(νtν)(dy)(logZ(νt)logZ(ν)).\displaystyle\log{\mathcal{K}}^{-}(\nu_{t})-\log\mu^{\ast}=-\tau^{-1}\int_{{\mathcal{Y}}}K(x,y)(\nu_{t}-\nu^{\ast})(dy)-(\log Z^{-}(\nu_{t})-\log Z^{-}(\nu^{\ast})).

Hence the second term on the RHS of the last line of (B.5) can be bounded by in a similar manner as (4.11). Namely,

τ𝒳(log𝒦(νt)logμ)d(μtμ)\displaystyle\tau\int_{{\mathcal{X}}}(\log{\mathcal{K}}^{-}(\nu_{t})-\log\mu^{\ast})d(\mu_{t}-\mu^{\ast}) =𝒳𝒴K(x,y)(νtν)(dy)(μtμ)(dx)\displaystyle=\int_{{\mathcal{X}}}\int_{{\mathcal{Y}}}K(x,y)(\nu_{t}-\nu^{\ast})(dy)(\mu_{t}-\mu^{\ast})(dx)
τ2(μt|μ)+2K2τ(νt|ν).\displaystyle\leq\frac{\tau}{2}{\mathcal{H}}(\mu_{t}|\mu^{\ast})+\frac{2\|K\|_{\infty}^{2}}{\tau}{\mathcal{H}}(\nu_{t}|\nu^{\ast}).

Combining the last inequality with (B.5) leads to

τ(μt|μ)\displaystyle\tau{\mathcal{H}}(\mu_{t}|\mu^{\ast}) 2τ(μt|𝒦(νt))+4K2τ(νt|ν)\displaystyle\leq 2\tau{\mathcal{H}}(\mu_{t}|{\mathcal{K}}^{-}(\nu_{t}))+\frac{4\|K\|_{\infty}^{2}}{\tau}{\mathcal{H}}(\nu_{t}|\nu^{\ast})
(2γ+4K2τ2)eα2tL~(μ0,ν0),\displaystyle\leq\Big{(}\frac{2}{\gamma}+\frac{4\|K\|_{\infty}^{2}}{\tau^{2}}\Big{)}e^{-\alpha_{2}t}\widetilde{L}(\mu_{0},\nu_{0}),

where we have used (B.3) and (B.4) in the last inequality. ∎

Appendix C Proof of convergence for the annealed dynamics in the fast descent regime

Proof of Part (ii) of Theorem 2.3.

By a direct calculation, one has

(C.1) ddt3(νt)τtτt(3(νt)+2K)ηtτt22(νt|𝒦+(𝒦(νt)))+ηtKxy2Diam(𝒳)2(μt|𝒦(νt))\displaystyle\frac{d}{dt}{\mathcal{L}}_{3}(\nu_{t})\leq\frac{\tau_{t}^{\prime}}{\tau_{t}}({\mathcal{L}}_{3}(\nu_{t})+2\|K\|_{\infty})-\frac{\eta_{t}\tau_{t}^{2}}{2}{\mathcal{I}}(\nu_{t}|{\mathcal{K}}^{+}({\mathcal{K}}^{-}(\nu_{t})))+\eta_{t}K_{xy}^{2}\mathrm{Diam}({\mathcal{X}})^{2}{\mathcal{H}}(\mu_{t}|{\mathcal{K}}^{-}(\nu_{t}))

and

(C.2) ddt4(μt,νt)\displaystyle\frac{d}{dt}{\mathcal{L}}_{4}(\mu_{t},\nu_{t}) τtτt(4(νt)+2K)τt2(μt|𝒦(νt))+τt2ηt2(νt|𝒦+(𝒦(νt)))\displaystyle\leq\frac{\tau_{t}^{\prime}}{\tau_{t}}({\mathcal{L}}_{4}(\nu_{t})+2\|K\|_{\infty})-\tau_{t}^{2}{\mathcal{I}}(\mu_{t}|{\mathcal{K}}^{-}(\nu_{t}))+\frac{\tau_{t}^{2}\eta_{t}}{2}{\mathcal{I}}(\nu_{t}|{\mathcal{K}}^{+}({\mathcal{K}}^{-}(\nu_{t})))
+3ηtKxy2Diam(𝒳)2(μt|𝒦(νt)).\displaystyle+3\eta_{t}K_{xy}^{2}\mathrm{Diam}({\mathcal{X}})^{2}{\mathcal{H}}(\mu_{t}|{\mathcal{K}}^{-}(\nu_{t})).

Combining the last two displays, one can obtain the time-derivative of ~(μt,νt)\widetilde{{\mathcal{L}}}(\mu_{t},\nu_{t}) as follows

ddt~(μt,νt)\displaystyle\frac{d}{dt}\widetilde{{\mathcal{L}}}(\mu_{t},\nu_{t}) τtτt~(μt,νt)+τtτt2(1+γ)Kγτt2(μt|𝒦(νt))\displaystyle\leq\frac{\tau_{t}^{\prime}}{\tau_{t}}\widetilde{{\mathcal{L}}}(\mu_{t},\nu_{t})+\frac{\tau_{t}^{\prime}}{\tau_{t}}2(1+\gamma)\|K\|_{\infty}-\gamma\tau_{t}^{2}{\mathcal{I}}(\mu_{t}|{\mathcal{K}}^{-}(\nu_{t}))
(1γ)ηtτt22(νt|𝒦+(𝒦(νt)))+ηt(1+3γ)Kxy2Diam(𝒳)2(μt|𝒦(νt))\displaystyle-\frac{(1-\gamma)\eta_{t}\tau_{t}^{2}}{2}{\mathcal{I}}(\nu_{t}|{\mathcal{K}}^{+}({\mathcal{K}}^{-}(\nu_{t})))+\eta_{t}(1+3\gamma)K_{xy}^{2}\mathrm{Diam}({\mathcal{X}})^{2}{\mathcal{H}}(\mu_{t}|{\mathcal{K}}^{-}(\nu_{t}))
τtτt~(μt,νt)(1γ)ηtτtλLS(τt)23(νt)τt(γλLS(τt)\displaystyle\leq\frac{\tau_{t}^{\prime}}{\tau_{t}}\widetilde{{\mathcal{L}}}(\mu_{t},\nu_{t})-\frac{(1-\gamma)\eta_{t}\tau_{t}\lambda_{LS}(\tau_{t})}{2}{\mathcal{L}}_{3}(\nu_{t})-\tau_{t}\Big{(}\gamma\lambda_{LS}(\tau_{t})
ηt(1+3γ)Kxy2Diam(𝒳)2τt2)4(μt,νt)+2τtτt(1+γ)K\displaystyle\qquad-\frac{\eta_{t}(1+3\gamma)K_{xy}^{2}\mathrm{Diam}({\mathcal{X}})^{2}}{\tau_{t}^{2}}\Big{)}{\mathcal{L}}_{4}(\mu_{t},\nu_{t})+\frac{2\tau_{t}^{\prime}}{\tau_{t}}(1+\gamma)\|K\|_{\infty}
α2(t)~(μt,νt)+2τtτt(1+γ)K,\displaystyle\leq-\alpha_{2}(t)\widetilde{{\mathcal{L}}}(\mu_{t},\nu_{t})+\frac{2\tau_{t}^{\prime}}{\tau_{t}}(1+\gamma)\|K\|_{\infty},

where

α2(t)\displaystyle\alpha_{2}(t) =(1γ)ηtτtλLS(τt)2τt(λLS(τt)ηt(1+3γ)Kxy2Diam(𝒳)2γτt2τtτt)\displaystyle=\frac{(1-\gamma)\eta_{t}\tau_{t}\lambda_{LS}(\tau_{t})}{2}\wedge\tau_{t}\Big{(}\lambda_{LS}(\tau_{t})-\frac{\eta_{t}(1+3\gamma)K_{xy}^{2}\mathrm{Diam}({\mathcal{X}})^{2}}{\gamma\tau_{t}^{2}}-\frac{\tau_{t}^{\prime}}{\tau_{t}}\Big{)}
ξlogtt(ξ/ξ)(C1ηt(t(ξ/ξ)+C2tlogtC3ηtlog(t)2)).\displaystyle\geq\frac{\xi}{\log t}t^{-(\xi^{\ast}/\xi)}\Big{(}C_{1}\eta_{t}\wedge\Big{(}t^{-(\xi^{\ast}/\xi)}+\frac{C_{2}}{t\log t}-\frac{C_{3}\eta_{t}}{\log(t)^{2}}\Big{)}\Big{)}.

Setting ηt=clogt/t\eta_{t}=c\log t/t for some c<C2/C3c<C_{2}/C_{3} in the above yields that for every 0<ϵ<1ξ/ξ0<\epsilon<1-\xi^{\ast}/\xi and tt large enough

α2(t)Ct2ξ/ξϵ.\alpha_{2}(t)\geq Ct^{-2\xi^{\ast}/\xi-\epsilon}.

Therefore we have obtained that

ddt~(μt,νt)Ct2ξ/ξϵ~(μt,νt)+Ctlogt.\displaystyle\frac{d}{dt}\widetilde{{\mathcal{L}}}(\mu_{t},\nu_{t})\leq-Ct^{-2\xi^{\ast}/\xi-\epsilon}\widetilde{{\mathcal{L}}}(\mu_{t},\nu_{t})+\frac{C^{\prime}}{t\log t}.

Similar to the proof of (4.18), one can obtain from above that for tt large enough

~(μt,νt)C′′t1+2ξ/ξ+ϵ.\widetilde{{\mathcal{L}}}(\mu_{t},\nu_{t})\leq C^{\prime\prime}t^{-1+2\xi^{\ast}/\xi+\epsilon}.

This directly implies the entropy decay bounds in (2.15). Finally the estimate (2.16) follows from (2.15) and the arguments used in Step 2 of the proof of Theorem 2.3- Part(i). ∎

References

  • [1] Jacob Abernethy, Kevin A Lai, and Andre Wibisono. Last-iterate convergence rates for min-max optimization. arXiv preprint arXiv:1906.02027, 2019.
  • [2] Dyego Araújo, Roberto I Oliveira, and Daniel Yukimura. A mean-field limit for certain deep neural networks. arXiv preprint arXiv:1906.00193, 2019.
  • [3] Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18(1):629–681, 2017.
  • [4] Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
  • [5] Yann Brenier and Dmitry Vorotnikov. On optimal transport of matrix-valued measures. SIAM Journal on Mathematical Analysis, 52(3):2849–2873, 2020.
  • [6] Lucian Busoniu, Robert Babuska, and Bart De Schutter. A comprehensive survey of multiagent reinforcement learning. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 38(2):156–172, 2008.
  • [7] Lénaïc Chizat. Mean-field langevin dynamics: Exponential convergence and annealing. arXiv preprint arXiv:2202.01009, 2022.
  • [8] Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
  • [9] Lenaic Chizat, Gabriel Peyré, Bernhard Schmitzer, and François-Xavier Vialard. An interpolating distance between optimal transport and fisher–rao metrics. Foundations of Computational Mathematics, 18(1):1–44, 2018.
  • [10] Bo Dai, Zhen Liu, Hanjun Dai, Niao He, Arthur Gretton, Le Song, and Dale Schuurmans. Exponential family estimation via adversarial dynamics embedding. Advances in Neural Information Processing Systems, 32, 2019.
  • [11] Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient descent in min-max optimization. Advances in neural information processing systems, 31, 2018.
  • [12] Constantinos Daskalakis and Ioannis Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. 10th Innovations in Theoretical Computer Science, 2019.
  • [13] Thinh Doan. Convergence rates of two-time-scale gradient descent-ascent dynamics for solving nonconvex min-max problems. In Learning for Dynamics and Control Conference, pages 192–206. PMLR, 2022.
  • [14] Carles Domingo-Enrich, Alberto Bietti, Marylou Gabrié, Joan Bruna, and Eric Vanden-Eijnden. Dual training of energy-based models with overparametrized shallow neural networks. arXiv preprint arXiv:2107.05134, 2021.
  • [15] Carles Domingo-Enrich and Joan Bruna. Simultaneous transport evolution for minimax equilibria on measures. arXiv preprint arXiv:2202.06460, 2022.
  • [16] Carles Domingo-Enrich, Samy Jelassi, Arthur Mensch, Grant Rotskoff, and Joan Bruna. A mean-field analysis of two-player zero-sum games. Advances in neural information processing systems, 33:20215–20226, 2020.
  • [17] Andreas Eberle, Arnaud Guillin, and Raphael Zimmer. Quantitative harris-type theorems for diffusions and mckean–vlasov processes. Transactions of the American Mathematical Society, 371(10):7135–7173, 2019.
  • [18] Tadahisa Funaki. A certain class of diffusion processes associated with nonlinear parabolic equations. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 67(3):331–348, 1984.
  • [19] Irving L Glicksberg. A further generalization of the kakutani fixed point theorem, with application to nash equilibrium points. Proceedings of the American Mathematical Society, 3(1):170–174, 1952.
  • [20] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  • [21] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  • [22] Ya-Ping Hsieh, Chen Liu, and Volkan Cevher. Finding mixed nash equilibria of generative adversarial networks. In International Conference on Machine Learning, pages 2810–2819. PMLR, 2019.
  • [23] Kaitong Hu, Zhenjie Ren, David Šiška, and Łukasz Szpruch. Mean-field langevin dynamics and energy landscape of neural networks. In Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, volume 57, pages 2043–2065. Institut Henri Poincaré, 2021.
  • [24] Chi Jin, Praneeth Netrapalli, and Michael Jordan. What is local optimality in nonconvex-nonconcave minimax optimization? In International conference on machine learning, pages 4880–4889. PMLR, 2020.
  • [25] Stanislav Kondratyev, Léonard Monsaingeon, and Dmitry Vorotnikov. A new optimal transport distance on the space of finite radon measures. Advances in Differential Equations, 21(11/12):1117–1164, 2016.
  • [26] Stanislav Kondratyev and Dmitry Vorotnikov. Spherical hellinger–kantorovich gradient flows. SIAM Journal on Mathematical Analysis, 51(3):2053–2084, 2019.
  • [27] Vaios Laschos and Alexander Mielke. Geometric properties of cones with applications on the hellinger–kantorovich space, and a new distance on the space of probability measures. Journal of Functional Analysis, 276(11):3529–3576, 2019.
  • [28] Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, pages 6083–6093. PMLR, 2020.
  • [29] Yiping Lu, Chao Ma, Yulong Lu, Jianfeng Lu, and Lexing Ying. A mean field analysis of deep resnet and beyond: Towards provably optimization via overparameterization from depth. In International Conference on Machine Learning, pages 6426–6436. PMLR, 2020.
  • [30] Yulong Lu, Jianfeng Lu, and James Nolen. Accelerating langevin sampling with birth-death. arXiv preprint arXiv:1905.09863, 2019.
  • [31] Yulong Lu, Dejan Slepčev, and Lihan Wang. Birth-death dynamics for sampling: Global convergence, approximations and their asymptotics. arXiv preprint arXiv:2211.00450, 2022.
  • [32] Chao Ma, Lei Wu, et al. The barron space and the flow-induced function spaces for neural network models. Constructive Approximation, 55(1):369–406, 2022.
  • [33] Chao Ma and Lexing Ying. Provably convergent quasistatic dynamics for mean-field two-player zero-sum games. In International Conference on Learning Representations, 2021.
  • [34] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
  • [35] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
  • [36] Panayotis Mertikopoulos, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, Vijay Chandrasekhar, and Georgios Piliouras. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. arXiv preprint arXiv:1807.02629, 2018.
  • [37] Andrew Ronald Mitchell and David Francis Griffiths. The finite difference method in partial differential equations. A Wiley-Interscience Publication, 1980.
  • [38] John Nash. Non-cooperative games. Annals of Mathematics, 54:286–295, 1951.
  • [39] Phan-Minh Nguyen. Mean field limit of the learning dynamics of multilayer neural networks. arXiv preprint arXiv:1902.02880, 2019.
  • [40] Hukukane Nikaido. On von neumann’s minimax theorem. Pacific J. Math, 4(1):65–72, 1954.
  • [41] Hukukane Nikaido and Kazuo Isoda. Note on non-cooperative convex games. Pacific Journal of Mathematics, 5(S1):807–815, 1955.
  • [42] Atsushi Nitanda, Denny Wu, and Taiji Suzuki. Convex analysis of the mean field langevin dynamics. In International Conference on Artificial Intelligence and Statistics, pages 9741–9757. PMLR, 2022.
  • [43] Georgii Ivanovich Petrov. Application of the method of galerkin to a problem involving the stationary flow of a viscous fluid. Prikl. Matem. Mekh., 4(3):3–12, 1940.
  • [44] Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pages 1674–1703. PMLR, 2017.
  • [45] Grant Rotskoff, S Jelassi, Joan Bruna, and Eric Vanden-Eijnden. Global convergence of neuron birth-death dynamics. 2019.
  • [46] Grant Rotskoff and Eric Vanden-Eijnden. Trainability and accuracy of artificial neural networks: An interacting particle system approach. Communications on Pure and Applied Mathematics, 75(9):1889–1935, 2022.
  • [47] Maurice Sion. On general minimax theorems. Pacific Journal of mathematics, 8(1):171–176, 1958.
  • [48] Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A law of large numbers. SIAM Journal on Applied Mathematics, 80(2):725–752, 2020.
  • [49] Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of deep neural networks. Mathematics of Operations Research, 47(1):120–152, 2022.
  • [50] Alain-Sol Sznitman. Topics in propagation of chaos. In Ecole d’Eté dede probabilitésprobabilit\'{e}s dede Saint-Flour XIX-1989, pages 165–251. Springer, 1991.
  • [51] Wenpin Tang and Xun Yu Zhou. Simulated annealing from continuum to discretization: a convergence analysis via the eyring–kramers law. arXiv preprint arXiv:2102.02339, 2021.
  • [52] J Von. Neumann. Zur theorie der gesellschaftsspiele. Mathematische annalen, 100(1):295–320, 1928.
  • [53] Guillaume Wang and Lénaïc Chizat. An exponentially converging particle method for the mixed nash equilibrium of continuous games. arXiv preprint arXiv:2211.01280, 2022.
  • [54] Stephan Wojtowytsch et al. On the banach spaces associated with multi-layer relu networks: Function representation, approximation theory and gradient descent dynamics. arXiv preprint arXiv:2007.15623, 2020.
  • [55] Junchi Yang, Negar Kiyavash, and Niao He. Global convergence and variance reduction for a class of nonconvex-nonconcave minimax problems. Advances in Neural Information Processing Systems, 33:1153–1165, 2020.
  • [56] Junchi Yang, Antonio Orvieto, Aurelien Lucchi, and Niao He. Faster single-loop algorithms for minimax optimization without strong concavity. In International Conference on Artificial Intelligence and Statistics, pages 5485–5517. PMLR, 2022.
  • [57] Yaohua Zang, Gang Bao, Xiaojing Ye, and Haomin Zhou. Weak adversarial networks for high-dimensional partial differential equations. Journal of Computational Physics, 411:109409, 2020.