This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MF-OMO: An Optimization Formulation of Mean-Field Games

Xin Guo 55footnotemark: 5 University of California, Berkeley. Email:[email protected]Amazon.com. Email:[email protected]    Anran Hu University of Oxford. Email:[email protected]    Junzi Zhang Citadel Securities. Email:[email protected]
Abstract

This paper proposes a new mathematical paradigm to analyze discrete-time mean-field games. It is shown that finding Nash equilibrium solutions for a general class of discrete-time mean-field games is equivalent to solving an optimization problem with bounded variables and simple convex constraints, called MF-OMO. This equivalence framework enables finding multiple (and possibly all) Nash equilibrium solutions of mean-field games by standard algorithms. For instance, projected gradient descent is shown to be capable of retrieving all possible Nash equilibrium solutions when there are finitely many of them, with proper initializations. Moreover, analyzing mean-field games with linear rewards and mean-field independent dynamics is reduced to solving a finite number of linear programs, hence solvable in finite time. This framework does not rely on the contractive and the monotone assumptions and the uniqueness of the Nash equilibrium.

1 Introduction

Theory of mean-field games, pioneered by the seminal work of [39] and [46], provides an ingenious way of analyzing and finding the ϵ\epsilon-Nash equilibrium solution to the otherwise notoriously hard NN-player stochastic games. Ever since its inception, the literature of mean-field games has experienced an exponential growth both in theory and in applications (see the standard reference [15]).

There are two key criteria for finding Nash equilibrium solutions of a mean-field game. The first is the optimality condition for the representative player, and the second is the consistency condition to ensure that the solution from the representative player is consistent with the mean-field flow of all players. Consequently, mean-field games with continuous-time stochastic dynamics have been studied under three approaches. The first is the original fixed-point [39] and PDE [46] approach that analyzes the joint system of a backward Hamilton-Jacobi-Bellman equation for a control problem and a forward Fokker-Planck equation for the controlled dynamics. This naturally leads to the second probabilistic approach, which focuses on the associated forward-backward stochastic differential equations [12, 14]. The last and the latest effort in mean-field game theory is to use master equation to analyze mean-field games with common noise and the propagation of chaos for optimal trajectories [9, 13, 23]. All existing methodologies are based on either the contractivity or monotonicity conditions, or the uniqueness of the Nash equilibrium solution, or some particular game structures; see [69, 1] for a detailed summary. Accordingly, finding Nash equilibrium solutions of mean-field games relies on similar assumptions and structures; see for instance [47].

It is well known, however, that non-zero-sum stochastic games and mean-field games in general have multiple Nash equilibrium solutions and that uniqueness of the Nash equilibrium solution, contractivity, and monotonicity are more of an exception than a rule. Yet, systematic and efficient approaches for finding multiple or all Nash equilibrium solutions of general mean-field games remain elusive.

Our work and approach.

In this work, we show that finding Nash equilibrium solutions for a general class of discrete-time mean-field games is equivalent to solving an optimization problem with bounded variables and simple convex constraints, called MF-OMO (Mean-Field Occupation Measure Optimization) (Section 3). This equivalence is established without assuming either the uniqueness of the Nash equilibrium solution or the contractivity or monotonicity conditions.

The equivalence is built through several steps. The first step, inspired by the classical linear program formulation of the single-agent Markov decision process (MDP) problem [61], is to introduce occupation measure for the representative player, and to transform the problem of finding Nash equilibrium solutions to a constrained optimization problem (Theorem 4) by the strong duality of the linear program to bridge the two criteria of Nash equilibrium solutions. The second step is to utilize a solution modification technique (Proposition 6) to deal with potentially unbounded variables in the optimization problem and obtain the MF-OMO formulation (Theorem 5). Further analysis of this equivalent optimization formulation enables establishing the connection between the suboptimality of MF-OMO and the exploitability of any policy for mean field games (Section 4). This is obtained by the perturbation analysis on MDPs (Lemma 13) and by the strong duality of linear programs.

These equivalence and connections enable adopting families of optimization algorithms to find multiple (and possibly all) Nash equilibrium solutions of mean-field games. In particular, detailed studies on the convergence of MF-OMO with the projected gradient descent (PGD) algorithm to Nash equilibrium solutions of mean-field games are presented (Theorem 11). For instance, PGD is shown to be capable of retrieving all possible Nash equilibrium solutions when there are finitely many of them, with proper initializations. Moreover, analyzing mean-field games with linear rewards and mean-field independent dynamics is shown to be reduced to solving a finite number of linear programs, hence solvable in finite time (Proposition 12).

Finally, this MF-OMO framework can be extended to other variants of mean-field games, as illustrated in the case of personalized mean-field games in Section 7.

Related work.

Our approach is inspired by the classical linear program framework for MDPs, originally proposed in [51] for discrete-time MDPs with finite state-action spaces and then extended first to semi-MDPs [56] and then to approximate dynamic programming [63, 21], as well as to continuous-time stochastic control [66, 10, 45] and singular control [67, 44] with continuous state-action spaces, among others [72, 25, 24, 38]. In the similar spirit, our work builds an optimization framework for discrete-time mean-field games.

There are several existing computational algorithms for finding Nash equilibrium solutions of discrete-time mean-field games. In particular, [31, 58] assume strict monotonicity, while [59] assumes weak monotonicity, all of which also assume the dynamics to be mean-field independent. In addition, [34, 32, 3] focus on contractive mean-field games, [6] studies mean-field games with unique Nash equilibrium solutions, and [20, 4] analyze mean-field games that are sufficiently close to some contractive regularized mean-field games. Under these assumptions, global convergence is obtained for some of these algorithms. In contrast, we focus on presenting an equivalent optimization framework for mean-field games, and establishing local convergence without these assumptions.

The most relevant to our works are earlier works of [11, 29, 30] on continuous-time mean-field games, in which they replace the best-response/optimality condition in Nash equilibrium solutions by the linear program optimality condition. Their reformulation approach leads to new existence results for relaxed Nash equilibrium solutions and computational benefits. In contrast, our work formulates the problem of finding Nash equilibrium solutions as a single optimization problem which captures both the best-response condition and the Fokker-Planck/consistency condition. Our formulation shows that finding Nash equilibrium solutions of discrete-time mean-field games with finite state-action spaces is generally no harder than solving a smooth, time-decoupled optimization problem with bounded variables and trivial convex constraints. Such a connection allows for directly utilizing various optimization tools, algorithms and solvers to find Nash equilibrium solutions of mean-field games. Moreover, our work provides, for the first time to our best knowledge, provably convergent algorithms for mean-field games with possibly multiple Nash equilibrium solutions and without contractivity or monotonicity assumptions.

2 Problem setup

We consider an extended [16] or general [34] mean-field game in a finite-time horizon TT with a state space 𝒮\mathcal{S} and an action space 𝒜\mathcal{A}, where |𝒮|=S<|\mathcal{S}|=S<\infty and |𝒜|=A<|\mathcal{A}|=A<\infty. In this game, a representative player starts from a state s0μ0s_{0}\sim\mu_{0}, with μ0\mu_{0} the (common) initial state distribution of all players of an infinite population. At each time t𝒯={0,1,,T}t\in\mathcal{T}=\{0,1,\dots,T\} and when at state sts_{t}, she chooses an action at𝒜a_{t}\in\mathcal{A} from some randomized/mixed policy πt:𝒮Δ(𝒜)\pi_{t}:\mathcal{S}\rightarrow\Delta(\mathcal{A}), where Δ(𝒜)\Delta(\mathcal{A}) denotes the set of probability vectors on 𝒜\mathcal{A}. Note that mixed policies are also known as relaxed controls. She will then move to a new state st+1s_{t+1} according to a transition probability Pt(|st,at,Lt)P_{t}(\cdot|s_{t},a_{t},L_{t}) and receive a reward rt(st,at,Lt)r_{t}(s_{t},a_{t},L_{t}), where LtΔ(𝒮×𝒜)L_{t}\in\Delta(\mathcal{S}\times\mathcal{A}) is the joint state-action distribution among all players at time tt, referred to as the mean-field information hereafter, and Δ(𝒮×𝒜)\Delta(\mathcal{S}\times\mathcal{A}) is the set of probability vectors on 𝒮×𝒜\mathcal{S}\times\mathcal{A}. Denote rmax=supt𝒯,s𝒮,a𝒜,LΔ(𝒮×𝒜)|rt(s,a,L)|r_{\max}=\sup_{t\in\mathcal{T},s\in\mathcal{S},a\in\mathcal{A},L\in\Delta(\mathcal{S}\times\mathcal{A})}|r_{t}(s,a,L)|, which is assumed throughout the paper as finite.

Given the mean-field flow {Lt}t𝒯\{L_{t}\}_{t\in\mathcal{T}}, the objective of this representative player is to maximize her accumulated rewards, i.e., to solve the following MDP problem:

maximize{πt}t𝒯𝔼[t=0Trt(st,at,Lt)|s0μ0]subject tost+1Pt(|st,at,Lt)(t𝒯\{T}),atπt(|st)(t𝒯).\begin{split}\text{maximize}_{\{\pi_{t}\}_{t\in\mathcal{T}}}\quad&\mathbb{E}\left[\sum_{t=0}^{T}r_{t}(s_{t},a_{t},L_{t})\Big{|}s_{0}\sim\mu_{0}\right]\\ \text{subject to}\quad&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}s_{t+1}\sim P_{t}(\cdot|s_{t},a_{t},L_{t})\,(t\in\mathcal{T}\backslash\{T\}),\quad a_{t}\sim\pi_{t}(\cdot|s_{t})\,(t\in\mathcal{T}).}\end{split} (1)

Throughout the paper, for a given mean-field flow L={Lt}t𝒯L=\{L_{t}\}_{t\in\mathcal{T}}, we will denote the mean-field induced MDP (1) as (L)\mathcal{M}(L), Vt(L)SV_{t}^{\star}(L)\in\mathbb{R}^{S} as its optimal total expected reward, i.e., value function, starting from time tt, with the ss-th entry [Vt(L)]s[V_{t}^{\star}(L)]_{s} being the optimal expected reward starting from state ss at time tt, and Vμ0(L)=s𝒮μ0(s)[V0(L)]sV_{\mu_{0}}^{\star}(L)=\sum_{s\in\mathcal{S}}\mu_{0}(s)[V_{0}^{\star}(L)]_{s} as its optimal expected total reward starting from μ0\mu_{0}. Correspondingly, we denote respectively Vtπ(L)SV_{t}^{\pi}(L)\in\mathbb{R}^{S}, [Vtπ(L)]s[V_{t}^{\pi}(L)]_{s}, as well as Vμ0π(L)=s𝒮μ0(s)[V0π(L)]sV_{\mu_{0}}^{\pi}(L)=\sum_{s\in\mathcal{S}}\mu_{0}(s)[V_{0}^{\pi}(L)]_{s}, under a given policy π={πt}t𝒯\pi=\{\pi_{t}\}_{t\in\mathcal{T}} for (L)\mathcal{M}(L).

To analyze such a mean-field game, the most widely adopted solution concept is the Nash equilibrium. A policy sequence {πt}t𝒯\{\pi_{t}\}_{t\in\mathcal{T}} and a mean-field flow {Lt}t𝒯\{L_{t}\}_{t\in\mathcal{T}} constitute a Nash equilibrium solution of this finite-time horizon mean-field game, if the following conditions are satisfied.

Definition 2.1 (Nash equilibrium solution).
  • 1)

    (Optimality/Best Response) Fixing {Lt}t𝒯\{L_{t}\}_{t\in\mathcal{T}}, {πt}t𝒯\{\pi_{t}\}_{t\in\mathcal{T}} solves the optimization problem (1), i.e., {πt}t𝒯\{\pi_{t}\}_{t\in\mathcal{T}} is optimal for the representative agent given the mean-field flow {Lt}t𝒯\{L_{t}\}_{t\in\mathcal{T}};

  • 2)

    (Consistency) Fixing {πt}t𝒯\{\pi_{t}\}_{t\in\mathcal{T}}, the consistency of the mean-field flow holds, namely

    Lt=st,at, where st+1Pt(|st,at,Lt),atπt(|st),s0μ0,t𝒯\{T}.L_{t}=\mathbb{P}_{s_{t},a_{t}},\text{ where }s_{t+1}\sim P_{t}(\cdot|s_{t},a_{t},L_{t}),\,a_{t}\sim\pi_{t}(\cdot|s_{t}),\,s_{0}\sim\mu_{0},\,t\in\mathcal{T}\backslash\{T\}. (2)

    Here x\mathbb{P}_{x} denotes the probability distribution of a random variable/vector xx.

Note that (1) requires that the policy sequence {πt}t𝒯\{\pi_{t}\}_{t\in\mathcal{T}} is the best response to the flow {Lt}t𝒯\{L_{t}\}_{t\in\mathcal{T}}, while (2) requires that the flow {Lt}t𝒯\{L_{t}\}_{t\in\mathcal{T}} is the corresponding mean-field flow induced when all players adopt the policy sequence {πt}t𝒯\{\pi_{t}\}_{t\in\mathcal{T}}.111Condition 2) is also known as the Fokker-Planck equation in the continuous-time mean-field game literature. Also note that (2) can be written more explicitly as follows:

L0(s,a)=μ0(s)π0(a|s),Lt+1(s,a)=πt+1(a|s)s𝒮a𝒜Lt(s,a)Pt(s|s,a,Lt),t=0,,T1.\begin{split}L_{0}(s,a)&=\mu_{0}(s)\pi_{0}(a|s),\\ L_{t+1}(s^{\prime},a^{\prime})&=\pi_{t+1}(a^{\prime}|s^{\prime})\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}L_{t}(s,a)P_{t}(s^{\prime}|s,a,L_{t}),\quad\forall t=0,\dots,T-1.\end{split} (3)

The following existence result for Nash equilibrium solutions holds as long as the transitions and rewards are continuous in LtL_{t}. The proof is based on the Kakutani fixed-point theorem; it is almost identical to those in [62, 20], except for replacing the state mean-field flow with the state-action joint mean-field flow.

Proposition 1.

Suppose that Pt(s|s,a,Lt)P_{t}(s^{\prime}|s,a,L_{t}) and rt(s,a,Lt)r_{t}(s,a,L_{t}) are both continuous in LtL_{t} for any s,s𝒮s,s^{\prime}\in\mathcal{S}, a𝒜a\in\mathcal{A} and t𝒯t\in\mathcal{T}. Then a Nash equilibrium solution exists.

Having established the existence of Nash equilibrium solution for the mean-field game ((1) and (2)), the rest of the paper focuses on proposing a new analytical approach to find Nash equilibrium solution(s) for the mean-field game. This is motivated by the well-known fact that most mean-field games have multiple Nash equilibrium solutions and are neither contractive nor (weakly) monotone, as the following simple example illustrates.

Example of multiple Nash equilibrium solutions.

Take 𝒮=𝒜={1,,n}\mathcal{S}=\mathcal{A}=\{1,\dots,n\} and an arbitrary initial distribution μ0\mu_{0}. At time 0, for any L0Δ(𝒮×𝒜)L_{0}\in\Delta(\mathcal{S}\times\mathcal{A}) and any i,j=1,,ni,j=1,\dots,n, the rewards r0(i,j,L0)=0r_{0}(i,j,L_{0})=0, and the transitions are deterministic such that P0(i|i,j,L0)=𝟏{i=j}P_{0}(i^{\prime}|i,j,L_{0})=\mathbf{1}_{\{i^{\prime}=j\}}. When t>0t>0, the rewards and the transition probabilities may depend on LtL_{t}. Specifically, if the population congregates at the same state ii and chooses to stay, i.e., Lt(s,a)=𝟏{s=i,a=i}L_{t}(s,a)=\mathbf{1}_{\{s=i,a=i\}}, then the rewards rt(i,j,Lt)=ri𝟏{i=j}r_{t}(i,j,L_{t})=r^{i}\mathbf{1}_{\{i=j\}} for some ri>0r^{i}>0, and the transition probabilities will be deterministic Pt(i|i,j,Lt)=𝟏{i=j}P_{t}(i^{\prime}|i,j,L_{t})=\mathbf{1}_{\{i^{\prime}=j\}}. If LtL_{t} deviates from state ii and action ii, the rewards and transition probabilities will be adjusted as follows:

  • rt(i,j,Lt)=𝟏{i=j}(ririLt𝟏{(s,a)|s=i,a=i}22/2)r_{t}(i,j,L_{t})=\mathbf{1}_{\{i=j\}}(r^{i}-r^{i}\|L_{t}-\mathbf{1}_{\{(s,a)|s=i,a=i\}}\|_{2}^{2}/2), for any i,j=1,,ni,j=1,\dots,n and t=1,,Tt=1,\dots,T.222Here for a totally ordered finite set 𝒳\mathcal{X}, we use 𝟏𝒳\mathbf{1}_{\mathcal{X}} to denote the binary vector in |𝒳|\mathbb{R}^{|\mathcal{X}|} which has entry value 11 for those indices x𝒳x\in\mathcal{X} and has entry value 0 otherwise.

  • Pt(i|i,j,Lt)=𝟏{i=j}+CtLt𝟏{(s,a)|s=i,a=i}221+nCtLt𝟏{(s,a)|s=i,a=i}22P_{t}(i^{\prime}|i,j,L_{t})=\dfrac{\mathbf{1}_{\{i^{\prime}=j\}}+C_{t}\|L_{t}-\mathbf{1}_{\{(s,a)|s=i,a=i\}}\|_{2}^{2}}{1+nC_{t}\|L_{t}-\mathbf{1}_{\{(s,a)|s=i,a=i\}}\|_{2}^{2}} for any i,i,j=1,,ni^{\prime},i,j=1,\dots,n and t=1,,Tt=1,\dots,T.

Here CtC_{t} (t=1,,T)(t=1,\dots,T) are some non-negative constants.

It is easy to verify that for any jargmaxi=1,,nrij^{\star}\in\text{argmax}_{i=1,\dots,n}r^{i}, if we define

  • πt(a|s)=𝟏{a=j}\pi_{t}(a|s)=\mathbf{1}_{\{a=j^{\star}\}} for any s,a=1,,ns,a=1,\dots,n and t=0,,Tt=0,\dots,T;

  • L0(s,a)=μ0(s)π0(a|s)L_{0}(s,a)=\mu_{0}(s)\pi_{0}(a|s) and Lt(s,a)=𝟏{s=j,a=j}L_{t}(s,a)=\mathbf{1}_{\{s=j^{\star},a=j^{\star}\}}, for any s,a=1,,ns,a=1,\dots,n and t=1,,Tt=1,\dots,T,

then π={πt}t𝒯\pi=\{\pi_{t}\}_{t\in\mathcal{T}} and L={Lt}t𝒯L=\{L_{t}\}_{t\in\mathcal{T}} constitute a Nash equilibrium solution. Thus there are several distinct Nash equilibrium solutions with both π\pi and LL being different, defined by the index jargmaxi=1,,nrij^{\star}\in\text{argmax}_{i=1,\dots,n}r^{i} as π(j)\pi^{(j^{\star})} and L(j)L^{(j^{\star})}. As contractivity and strict monotonicity imply the uniqueness of the Nash equilibrium solution, the mean-field game above is neither contractive nor strictly monotone. To see that it is not weakly monotone either, consider any two mean-field flows Lt(1)=Lt(j1)L_{t}^{(1)}=L_{t}^{(j_{1}^{\star})} and Lt(2)=Lt(j2)L_{t}^{(2)}=L_{t}^{(j_{2}^{\star})} with j1j2argmaxi=1,,nrij_{1}^{\star}\neq j_{2}^{\star}\in\text{argmax}_{i=1,\dots,n}r^{i}. Then

t=0Ts=1na=1n(Lt(1)(s,a)Lt(2)(s,a))(rt(s,a,Lt(1))rt(s,a,Lt(2)))=T(rj1+rj2)=2Tmaxi=1,,nri>0,\begin{split}&\sum_{t=0}^{T}\sum_{s=1}^{n}\sum_{a=1}^{n}(L_{t}^{(1)}(s,a)-L_{t}^{(2)}(s,a))(r_{t}(s,a,L_{t}^{(1)})-r_{t}(s,a,L_{t}^{(2)}))\\ &={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}T(r^{j_{1}^{\star}}+r^{j_{2}^{\star}})=2T\max\nolimits_{i=1,\dots,n}r^{i}}>0,\end{split}

violating the requirement of weak monotonicity.

We now propose a new analytical framework that enables finding multiple Nash equilibrium solutions, without either the restrictive uniqueness assumption of the Nash equilibrium solution or contractivity [34, 20] and monotonicity [59, 58] conditions.

3 MF-OMO: Optimization formulation of mean-field games

To overcome the restrictive assumptions of contractivity and monotonicity for analyzing mean-field games, we first establish a new optimization framework to study mean-field games, via two steps. The first is an optimization reformulation of mean-field games with potentially unbounded variables and nonconvex constraints, which relies on the equivalence between an MDP problem and a linear program of occupation measures. The second is to utilize a solution modification technique to obtain the final form of the optimization formulation with bounded variables and simple convex constraints. We call this framework Mean-Field Occupation Measure Optimization (MF-OMO).

Occupation measure.

To start, let us introduce a new variable dt(s,a)d_{t}(s,a) for any t𝒯t\in\mathcal{T}, s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}, which represents the occupation measure of the representative agent under some policy sequence π={πt}t𝒯\pi=\{\pi_{t}\}_{t\in\mathcal{T}} in (L)\mathcal{M}(L), the mean-field induced MDP, i.e., dt(s,a)=(st=s,at=a)d_{t}(s,a)=\mathbb{P}(s_{t}=s,a_{t}=a), with s0μ0s_{0}\sim\mu_{0}, st+1Pt(|st,at,Lt)s_{t+1}\sim P_{t}(\cdot|s_{t},a_{t},L_{t}), atπt(|st)a_{t}\sim\pi_{t}(\cdot|s_{t}) for t=0,,T1t=0,\dots,T-1. Given the occupation measure d={dt}t𝒯d=\{d_{t}\}_{t\in\mathcal{T}}, define a mapping Π\Pi that retrieves the policy from the occupation measure. This set-valued mapping Π\Pi maps from a sequence {dt}t𝒯Δ(𝒮×𝒜)\{d_{t}\}_{t\in\mathcal{T}}\subseteq\Delta(\mathcal{S}\times\mathcal{A}) to a set of policy sequences {πt}t𝒯\{\pi_{t}\}_{t\in\mathcal{T}}: for any {dt}t𝒯Δ(𝒮×𝒜)\{d_{t}\}_{t\in\mathcal{T}}\subseteq\Delta(\mathcal{S}\times\mathcal{A}), πΠ(d)\pi\in\Pi(d) if and only if

πt(a|s)=dt(s,a)a𝒜dt(s,a)\pi_{t}(a|s)=\frac{d_{t}(s,a)}{\sum_{a^{\prime}\in\mathcal{A}}d_{t}(s,a^{\prime})} (4)

when a𝒜dt(s,a)>0\sum_{a^{\prime}\in\mathcal{A}}d_{t}(s,a^{\prime})>0, and πt(|s)\pi_{t}(\cdot|s) is an arbitrary probability vector in Δ(𝒜)\Delta(\mathcal{A}) when a𝒜dt(s,a)=0\sum_{a^{\prime}\in\mathcal{A}}d_{t}(s,a^{\prime})=0.

Linear program formulation of MDPs.

Next, consider any finite-horizon MDP \mathcal{M} (e.g., the mean-field induced MDP (L)\mathcal{M}(L)) with a finite state space 𝒮\mathcal{S}, a finite action space 𝒜\mathcal{A}, an initial state distribution μ0\mu_{0}, a reward function rt(s,a)r_{t}(s,a), and dynamics Pt(s|s,a)P_{t}(s^{\prime}|s,a) for t=0,,Tt=0,\dots,T, s,s𝒮s,s^{\prime}\in\mathcal{S} and a𝒜a\in\mathcal{A}. For brievity, we use π()\mathbb{P}^{\pi}(\cdot) for the state and/or action distribution generated by s0μ0s_{0}\sim\mu_{0}, st+1Pt(|st,at)s_{t+1}\sim P_{t}(\cdot|s_{t},a_{t}), atπt(|st)a_{t}\sim\pi_{t}(\cdot|s_{t}) for t=0,,T1t=0,\dots,T-1.

Then, our first component of MF-OMO relies on a linear program formulation of an MDP problem, as follows.

Lemma 2.

Suppose that π={πt}t𝒯\pi=\{\pi_{t}\}_{t\in\mathcal{T}} is an ϵ\epsilon-suboptimal policy for the MDP \mathcal{M}. Define μt(s):=π(st=s)\mu_{t}(s):=\mathbb{P}^{\pi}(s_{t}=s) and dt(s,a):=μt(s)πt(a|s)d_{t}(s,a):=\mu_{t}(s)\pi_{t}(a|s) for s𝒮s\in\mathcal{S} and a𝒜a\in\mathcal{A}. Then πΠ(d)\pi\in\Pi(d) and d={dt}t𝒯d=\{d_{t}\}_{t\in\mathcal{T}} is a feasible ϵ\epsilon-suboptimal solution to the following linear program:

maximized\displaystyle\text{maximize}_{d}\quad t𝒯s𝒮a𝒜dt(s,a)rt(s,a)\displaystyle\sum_{t\in\mathcal{T}}\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d_{t}(s,a)r_{t}(s,a)
subject to s𝒮a𝒜dt(s,a)Pt(s|s,a)=a𝒜dt+1(s,a),s𝒮,t𝒯\{T},\displaystyle\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d_{t}(s,a)P_{t}(s^{\prime}|s,a)=\sum_{a\in\mathcal{A}}d_{t+1}(s^{\prime},a),\quad\forall s^{\prime}\in\mathcal{S},t\in\mathcal{T}\backslash\{T\},
a𝒜d0(s,a)=μ0(s),s𝒮,\displaystyle\sum_{a\in\mathcal{A}}d_{0}(s,a)=\mu_{0}(s),\quad\forall s\in\mathcal{S},
dt(s,a)0,s𝒮,a𝒜,t𝒯.\displaystyle d_{t}(s,a)\geq 0,\quad\forall s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}.

Conversely, suppose that d={dt}t𝒯d=\{d_{t}\}_{t\in\mathcal{T}} is a feasible ϵ\epsilon-suboptimal solution to the above linear program. Then for any πΠ(d)\pi\in\Pi(d) defined by (4), π={πt}t𝒯\pi=\{\pi_{t}\}_{t\in\mathcal{T}} is an ϵ\epsilon-suboptimal policy for the MDP \mathcal{M}. In addition, the value of the above linear program is equal to the value of the MDP \mathcal{M}.

The proof of Lemma 2 is a suitable modification for its well-known counterpart in the discounted and average reward settings [21, 40, 70].

A preliminary version of MF-OMO.

We now show that π={πt}t𝒯\pi=\{\pi_{t}\}_{t\in\mathcal{T}} is a Nash equilibrium solution if and only if there exist d={dt(s,a)}t𝒯d^{\star}=\{d_{t}^{\star}(s,a)\}_{t\in\mathcal{T}} and L={Lt(s,a)}t𝒯L=\{L_{t}(s,a)\}_{t\in\mathcal{T}}, such that πΠ(d)\pi\in\Pi(d^{\star}) and the following two conditions hold.

  • (A)

    Optimality of the representative agent: d={dt(s,a)}t𝒯d^{\star}=\{d_{t}^{\star}(s,a)\}_{t\in\mathcal{T}} solves the following LP problem

    maximized\displaystyle\text{maximize}_{d}\quad t𝒯s𝒮a𝒜dt(s,a)rt(s,a,Lt)\displaystyle\sum_{t\in\mathcal{T}}\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d_{t}(s,a)r_{t}(s,a,L_{t}) (5)
    subject to s𝒮a𝒜dt(s,a)Pt(s|s,a,Lt)=a𝒜dt+1(s,a),\displaystyle\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d_{t}(s,a)P_{t}(s^{\prime}|s,a,L_{t})=\sum_{a\in\mathcal{A}}d_{t+1}(s^{\prime},a), (6)
    s𝒮,t𝒯\{T},\displaystyle\hskip 119.50148pt\forall s^{\prime}\in\mathcal{S},t\in\mathcal{T}\backslash\{T\},
    a𝒜d0(s,a)=μ0(s),s𝒮,\displaystyle\sum_{a\in\mathcal{A}}d_{0}(s,a)=\mu_{0}(s),\quad\forall s\in\mathcal{S},
    dt(s,a)0,s𝒮,a𝒜,t𝒯.\displaystyle d_{t}(s,a)\geq 0,\quad\forall s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}.
  • (B)

    Consistency between the agent and the mean-field: dt(s,a)=Lt(s,a)d_{t}^{\star}(s,a)=L_{t}(s,a) for any s𝒮,a𝒜,t𝒯s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}.

Here condition (A) corresponds to the optimality condition 1) in Definition 2.1 according to Lemma 2. Condition (B) indicates that the occupation measure of the single agent matches the mean-field flow of the population, so that the agent is indeed representative. Condition (B), when combined with condition (A), in fact implies the consistency condition 2) in Definition 2.1. Formally, we have the following lemma.

Lemma 3.

If (π,L)(\pi,L) is a Nash equilibrium solution of the mean-field game ((1) and (2)), then there exists dd such that πΠ(d)\pi\in\Pi(d^{\star}) and (d,L)(d^{\star},L) with d=Ld^{\star}=L satisfying conditions (A) and (B). On the other hand, if (d,L)(d^{\star},L) satisfies conditions (A) and (B), then for any πΠ(L)\pi\in\Pi(L), (π,L)(\pi,L) is a Nash equilibrium solution of the mean-field game ((1) and (2)).

To get the preliminary version of MF-OMO, notice that the linear program in condition (A) can be rewritten as

minimizedcLdsubject toALd=b,d0,\begin{array}[]{ll}\text{minimize}_{d}&c_{L}^{\top}d\\ \text{subject to}&A_{L}d=b,\quad d\geq 0,\end{array} (7)

where cLSA(T+1)c_{L}\in\mathbb{R}^{SA(T+1)} and bST+Sb\in\mathbb{R}^{ST+S} are

cL=[r0(,,L0)rT(,,LT)],b=[00μ0],c_{L}=\left[\begin{array}[]{c}-r_{0}(\cdot,\cdot,L_{0})\\ \vdots\\ -r_{T}(\cdot,\cdot,L_{T})\end{array}\right],\quad b=\left[\begin{array}[]{c}0\\ \vdots\\ 0\\ \mu_{0}\end{array}\right], (8)

rt(,,Lt)SAr_{t}(\cdot,\cdot,L_{t})\in\mathbb{R}^{SA} is a flattened vector (with column-major order), and the matrix ALS(T+1)×SA(T+1)A_{L}\in\mathbb{R}^{S(T+1)\times SA(T+1)} is

AL=[W0(L0)Z00000W1(L1)Z00000W2(L2)Z000000WT1(LT1)ZZ00000].A_{L}=\left[\begin{array}[]{ccccccc}W_{0}(L_{0})&-Z&0&0&\cdots&0&0\\ 0&W_{1}(L_{1})&-Z&0&\cdots&0&0\\ 0&0&W_{2}(L_{2})&-Z&\cdots&0&0\\ \vdots&\vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ 0&0&0&0&\cdots&W_{T-1}(L_{T-1})&-Z\\ Z&0&0&0&\cdots&0&0\end{array}\right]. (9)

Here Wt(Lt)S×SAW_{t}(L_{t})\in\mathbb{R}^{S\times SA} is the matrix with the ll-th row (l=1,,Sl=1,\dots,S) being the flattened vector [Pt(l|,,Lt)]SA[P_{t}(l|\cdot,\cdot,L_{t})]\in\mathbb{R}^{SA} (with column-major order), and the matrix ZZ is defined as

Z:=[IS,,IS]AS×SA,Z:=[\overbrace{I_{S},\dots,I_{S}]}^{A}\in\mathbb{R}^{S\times SA}, (10)

where ISI_{S} is the identity matrix with dimension SS. In addition, the variable dd is also flattened/vectorized in the same order as cLc_{L}. Accordingly, hereafter both d={dt}t𝒯d^{\star}=\{d^{\star}_{t}\}_{t\in\mathcal{T}} and L={Lt}t𝒯L=\{L_{t}\}_{t\in\mathcal{T}} are viewed as flattened vectors (with column-major order) in SA(T+1)\mathbb{R}^{SA(T+1)} or a sequence of (T+1)(T+1) flattened vectors (with column-major order) in SA\mathbb{R}^{SA}, depending on the context, and we use Lt(s,a)L_{t}(s,a) and Ls,a,tL_{s,a,t} (resp. dt(s,a)d_{t}(s,a) and ds,a,td_{s,a,t}) alternatively.

By the strong duality of linear program [50], {dt(s,a)}\{d_{t}^{\star}(s,a)\} is the optimal solution to the above linear program (7) if and only if the following KKT conditions hold for dd^{\star} and some yy and zz:

ALd=b,ALy+z=cL,zd=0,d0,z0.\begin{split}&A_{L}d^{\star}=b,\quad A_{L}^{\top}y+z=c_{L},\\ &z^{\top}d^{\star}=0,\quad d^{\star}\geq 0,\quad z\geq 0.\end{split} (11)

Now, combining condition (A) (in the form of equation (11)) with condition (B), we have the following preliminary version of MF-OMO.

Theorem 4.

Finding a Nash equilibrium solution of the mean-field game ((1) and (2)) is equivalent to solving the following constrained optimization problem:

minimizey,z,L0subject toALL=b,ALy+z=cL,zL=0,L0,z0,\begin{array}[]{ll}\text{\rm minimize}_{y,z,L}&0\\ \text{\rm subject to}&A_{L}L=b,\quad A_{L}^{\top}y+z=c_{L},\\ &z^{\top}L=0,\quad L\geq 0,\quad z\geq 0,\end{array} (12)

where AL,cL,bA_{L},c_{L},b are set as (8) and (9). Specifically, if (π,L)(\pi,L) is a Nash equilibrium solution of the mean-field game ((1) and (2)), then there exist some y,zy,z such that (y,z,L)(y,z,L) solves the constrained optimization problem (12). On the other hand, if (y,z,L)(y,z,L) solves the constrained optimization problem (12), then for any πΠ(L)\pi\in\Pi(L), (π,L)(\pi,L) is a Nash equilibrium solution of the mean-field game ((1) and (2)).

In the above derivation, we have used LL to represent the population mean-field flows and dd to denote the occupation measure of the representative agent. The optimal occupation measure (i.e., the solution to the linear program) associated with the representative agent is denoted by dd^{\star}.

MF-OMO.

The above constrained optimization problem (12) in Theorem 4 can be rewritten in a more computationally-efficient form by adding bound constraints to the auxiliary variables yy and zz and moving potentially nonconvex constraints to the objective. Note that the consequent nonconvex objectives are in general much easier to handle than nonconvex constraints (cf. (12)). In addition, the boundedness of all variables is essential to the later analysis of the optimization framework in Sections 4 and 5.

Theorem 5.

Finding a Nash equilibrium solution of the mean-field game ((1) and (2)) is equivalent to solving the following optimization problem with bounded variables and simple convex constraints:

minimizey,z,LALLb22+ALy+zcL22+zLsubject toL0,𝟏Lt=1,t𝒯,𝟏zSA(T2+T+2)rmax,z0,y2S(T+1)(T+2)2rmax,\begin{array}[]{ll}\text{minimize}_{y,z,L}&\|A_{L}L-b\|_{2}^{2}+\|A_{L}^{\top}y+z-c_{L}\|_{2}^{2}+z^{\top}L\\ \text{subject to}&L\geq 0,\quad{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{1}^{\top}L_{t}=1,\,t\in\mathcal{T},}\\ &\mathbf{1}^{\top}z\leq SA(T^{2}+T+2)r_{\max},\quad z\geq 0,\\ &\|y\|_{2}\leq\frac{S(T+1)(T+2)}{2}r_{\max},\end{array} (MF-OMO)

where AL,cL,bA_{L},c_{L},b are set as (8) and (9), and 𝟏{\bf 1} is the all-one vector (with appropriate dimensions). Specifically, if (π,L)(\pi,L) is a Nash equilibrium solution of the mean-field game ((1) and (2)), then there exist some y,zy,z such that (y,z,L)(y,z,L) solves (MF-OMO). On the other hand, if (y,z,L)(y,z,L) solves (MF-OMO) with the value of the objective function being 0, then for any πΠ(L)\pi\in\Pi(L), (π,L)(\pi,L) is a Nash equilibrium solution of the mean-field game ((1) and (2)).

Theorem 5 is immediate from the following two results, which show how the bounds on the auxiliary variables yy and zz in (MF-OMO) are obtained. Note that only LL is needed to construct the Nash equilibrium solution, hence there is no loss to add additional bounds on yy and zz according to Corollary 7.

The interpretation of yy and zz is motivated by the following remark.

Proposition 6 (Solution modification/selection).

Suppose that (y,z,L)(y,z,L) is a solution to (12). Define y^\hat{y} as

y^=[V1(L),V2(L),,VT(L),V0(L)]S(T+1),\hat{y}=[V_{1}^{\star}(L),V_{2}^{\star}(L),\dots,V_{T}^{\star}(L),-V_{0}^{\star}(L)]\in\mathbb{R}^{S(T+1)},

where Vt(L)SV_{t}^{\star}(L)\in\mathbb{R}^{S} is the vector of value function starting from step tt for the induced MDP (L)\mathcal{M}(L) (cf. Section 2). In addition, define z^=[z^0,z^1,,z^T]SA(T+1)\hat{z}=[\hat{z}_{0},\hat{z}_{1},\dots,\hat{z}_{T}]\in\mathbb{R}^{SA(T+1)}, where z^t\hat{z}_{t} is the Bellman residual/gap333This is also known as the (optimal) advantage function in the reinforcement learning literature., i.e., z^t=[z^t1,,z^tA]\hat{z}_{t}=[\hat{z}_{t}^{1},\dots,\hat{z}_{t}^{A}] (t=0,,Tt=0,\dots,T) with

z^ta=Vt(L)rt(,a,Lt)Pta(Lt)Vt+1(L)S,t=0,,T1,a𝒜\hat{z}_{t}^{a}=V_{t}^{\star}(L)-r_{t}(\cdot,a,L_{t})-P_{t}^{a}(L_{t})V_{t+1}^{\star}(L)\in\mathbb{R}^{S},\quad t=0,\dots,T-1,\,a\in\mathcal{A}

and z^Ta=VT(L)rT(,a,LT)S\hat{z}_{T}^{a}=V_{T}^{\star}(L)-r_{T}(\cdot,a,L_{T})\in\mathbb{R}^{S} (a𝒜a\in\mathcal{A}), where Pta(Lt)S×SP_{t}^{a}(L_{t})\in\mathbb{R}^{S\times S} is defined by [Pta(Lt)]s,s=Pt(s|s,a,Lt)[P_{t}^{a}(L_{t})]_{s,s^{\prime}}=P_{t}(s^{\prime}|s,a,L_{t}). Then (y^,z^,L)(\hat{y},\hat{z},L) is also a solution to (12).

The following corollary follows immediately by the definitions of y^\hat{y} and z^\hat{z} in Proposition 6.

Corollary 7.

Suppose that (y,z,L)(y,z,L) is a solution to (12). Then there exists a modified solution (y^,z^,L)(\hat{y},\hat{z},L), such that the following bounds hold:

y^2y^1S(T+1)(T+2)2rmax,z^2z^1SA(T2+T+2)rmax.\|\hat{y}\|_{2}\leq\|\hat{y}\|_{1}\leq\frac{S(T+1)(T+2)}{2}r_{\max},\quad\|\hat{z}\|_{2}\leq\|\hat{z}\|_{1}\leq SA(T^{2}+T+2)r_{\max}.

The key is to notice that |[Vt(L)]s|(Tt+1)rmax|[V_{t}^{\star}(L)]_{s}|\leq(T-t+1)r_{\max}, |z^ta(s)|(Tt+1)rmax+rmax+(Tt)rmax|\hat{z}_{t}^{a}(s)|\leq(T-t+1)r_{\max}+r_{\max}+(T-t)r_{\max} and |z^Ta(s)|2rmax|\hat{z}_{T}^{a}(s)|\leq 2r_{\max}, for any s𝒮,a𝒜,t=0,,T1s\in\mathcal{S},\,a\in\mathcal{A},\,t=0,\dots,T-1.

Remark 1.

Note that the constraint 𝟏Lt=1\mathbf{1}^{\top}L_{t}=1, t𝒯t\in\mathcal{T} is added to ensure the well-definedness of PtP_{t} and rtr_{t} for non-optimal but feasible variables. As otherwise Pt(s,a,)P_{t}(s,a,\cdot) and rt(s,a,)r_{t}(s,a,\cdot) are undefined outside of the simplex Δ(𝒮×𝒜)\Delta(\mathcal{S}\times\mathcal{A}). The choice of norms on yy and zz is to facilitate projection evaluation and reparametrization, where the first (projection evaluation) will be clear in Section 5 and the second (reparametrization) is explained in Appendix A.3. In particular, the choice of 1\ell_{1}-norm for zz is to obtain a simplex-alike specialized polyhedral constraint paired with z0z\geq 0, and the 2\ell_{2}-norm for yy is to obtain closed-form projections (by normalization) and simple reparametrizations. Nevertheless, all our subsequent results in Sections 4 and 5 hold (up to changes of constants) for other choices including 2\ell_{2}-norm for zz and 1\ell_{1}-norm for yy, as long as either efficient projections or reparametrizations are feasible for yy and zz.

Remark 2.

Note that whenever a Nash equilibrium solution exists for the mean-field game ((1) and (2)), the optimal value of the objective function of (MF-OMO) is obviously 0, and vice versa. To see more explicitly the relationship between (MF-OMO) and Nash equilibrium solution(s) of mean-field game ((1) and (2)), denote the feasible set of (MF-OMO) as Θ\Theta, i.e.,

Θ:={(y,z,L)|L0,𝟏Lt=1,t𝒯,y2S(T+1)(T+2)rmax/2,𝟏zSA(T2+T+2)rmax,z0}.\begin{split}\Theta:=\{(y,z,L)\,|\,&L\geq 0,{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{1}^{\top}L_{t}=1,\,t\in\mathcal{T},}\,\\ &\|y\|_{2}\leq S(T+1)(T+2)r_{\max}/2,\\ &\mathbf{1}^{\top}z\leq SA(T^{2}+T+2)r_{\max},z\geq 0\}.\end{split} (13)

Then (MF-OMO) can be rewritten as minimizing the following objective function over (y,z,L)Θ(y,z,L)\in\Theta:

fMF-OMO(y,z,L)=s𝒮(a𝒜L0(s,a)μ0(s))2+s𝒮t=0T1(a𝒜Lt+1(s,a)s𝒮,a𝒜Lt(s,a)Pt(s|s,a,Lt))2+s𝒮,a𝒜(yT1(s)rT(s,a,LT)zT(s,a))2+s𝒮,a𝒜t=0T2(yt(s)rt+1(s,a,Lt+1)s𝒮Pt+1(s|s,a,Lt+1)yt+1(s)zt+1(s,a))2+s𝒮,a𝒜(yT(s)+r0(s,a,L0)+s𝒮P0(s|s,a,L0)y0(s)+z0(s,a))2+s𝒮,a𝒜,t𝒯zt(s,a)Lt(s,a).\begin{array}[]{ll}f^{\text{\rm MF-OMO}}(y,z,L)=&\sum\limits_{s\in\mathcal{S}}\left(\sum\limits_{a\in\mathcal{A}}L_{0}(s,a)-\mu_{0}(s)\right)^{2}\\ &+\sum\limits_{s^{\prime}\in\mathcal{S}}\sum\limits_{t=0}^{T-1}\left(\sum\limits_{a\in\mathcal{A}}L_{t+1}(s^{\prime},a)-\sum\limits_{s\in\mathcal{S},a\in\mathcal{A}}L_{t}(s,a)P_{t}(s^{\prime}|s,a,L_{t})\right)^{2}\\ &+\sum\limits_{s\in\mathcal{S},a\in\mathcal{A}}\left(y_{T-1}(s)-r_{T}(s,a,L_{T})-z_{T}(s,a)\right)^{2}\\ &+\sum\limits_{s\in\mathcal{S},a\in\mathcal{A}}\sum\limits_{t=0}^{T-2}\left(y_{t}(s)-r_{t+1}(s,a,L_{t+1})\right.\\ &\qquad\qquad\qquad\left.-\sum_{s^{\prime}\in\mathcal{S}}P_{t+1}(s^{\prime}|s,a,L_{t+1})y_{t+1}(s^{\prime})-z_{t+1}(s,a)\right)^{2}\\ &+\sum\limits_{s\in\mathcal{S},a\in\mathcal{A}}\left(y_{T}(s)+r_{0}(s,a,L_{0})+\sum\limits_{s^{\prime}\in\mathcal{S}}P_{0}(s^{\prime}|s,a,L_{0})y_{0}(s^{\prime})+z_{0}(s,a)\right)^{2}\\ &+\sum\limits_{s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}}z_{t}(s,a)L_{t}(s,a).\end{array} (14)

Here the first two rows expand the term ALLb22\|A_{L}L-b\|_{2}^{2}, corresponding to the consistency condition 2) in Definition 2.1. The next four rows expand the term ALy+zcL22\|A_{L}^{\top}y+z-c_{L}\|_{2}^{2}, corresponding to the Bellman residual by the interpretations of yy and zz in Proposition 6 and hence the optimality condition 1) in Definition 2.1. The last row expands the complementarity term zLz^{\top}L, connecting the two conditions into a single optimization problem.

This MF-OMO formulation is in sharp contrast to most existing approaches in the literature of discrete-time mean-field games such as [34, 59, 58], which alternate between a full policy optimization/evaluation given a fixed mean-field flow and a full mean-field flow forward propagation given a fixed policy sequence. Moreover, both the policy optimization/evaluation and the mean-field propagation lead to coupling over the full time horizon 𝒯\mathcal{T}. As a result, the convergence of these algorithms requires assumptions of contractivity and monotonicity. In contrast, (14) shows that MF-OMO is fully decoupled over the time horizon, which better facilitates parallel and distributed optimization algorithms and enables more efficient sub-samplings for stochastic optimization algorithms, especially for large time-horizon problems.

4 Connecting exploitability with MF-OMO suboptimality

Having established the equivalence between minimizers of (MF-OMO) and Nash equilibrium solutions of mean-field games, we now connect the exploitablity of mean-field games with suboptimal solutions to (MF-OMO).

The concept of exploitability in game theory is used to characterize the difference between any policy and a Nash equilibrium solution. More precisely, define a mapping Γ\Gamma that maps any policy sequence {πt}t𝒯\{\pi_{t}\}_{t\in\mathcal{T}} to {Lt}t𝒯\{L_{t}\}_{t\in\mathcal{T}} when all players take such a policy sequence. Following the consistency condition (3), such Γ\Gamma can be defined recursively, starting with the initialization

Γ(π)0(s,a):=μ0(s)π0(a|s),\Gamma(\pi)_{0}(s,a):=\mu_{0}(s)\pi_{0}(a|s), (15)

such that

Γ(π)t+1(s,a):=πt+1(a|s)s𝒮a𝒜Γ(π)t(s,a)Pt(s|s,a,Γ(π)t),t=0,,T1.\Gamma(\pi)_{t+1}(s,a):=\pi_{t+1}(a|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\Gamma(\pi)_{t}(s^{\prime},a^{\prime})P_{t}(s|s^{\prime},a^{\prime},\Gamma(\pi)_{t}),\quad\forall t=0,\dots,T-1. (16)

For any given policy sequence π={πt}t𝒯\pi=\{\pi_{t}\}_{t\in\mathcal{T}}, let L={Lt}t𝒯=Γ(π)L=\{L_{t}\}_{t\in\mathcal{T}}=\Gamma(\pi) be the induced mean-field flow. Then exploitability characterizes the sub-optimality of the policy π\pi under LL as follows,

Expl(π):=Vμ0(Γ(π))Vμ0π(Γ(π))=maxπVμ0π(Γ(π))Vμ0π(Γ(π)).\text{Expl}(\pi):=V_{\mu_{0}}^{\star}(\Gamma(\pi))-V_{\mu_{0}}^{\pi}(\Gamma(\pi))=\max_{\pi^{\prime}}V_{\mu_{0}}^{\pi^{\prime}}(\Gamma(\pi))-V_{\mu_{0}}^{\pi}(\Gamma(\pi)). (17)

In particular, (π,L)(\pi,L) is a Nash equilibrium solution if and only if L=Γ(π)L=\Gamma(\pi) and Expl(π)=0\text{Expl}(\pi)=0; and a policy π\pi is an ϵ\epsilon-Nash equilibrium solution if Expl(π)ϵ\text{Expl}(\pi)\leq\epsilon.

However, computing and optimizing exploitability directly is not easy. First, Expl(π)\text{Expl}(\pi) is generally nonconvex and nonsmooth, even if rt(s,a,)r_{t}(s,a,\cdot) and Pt(s,a,)P_{t}(s,a,\cdot) are differentiable in LtL_{t}. Secondly, a single evaluation of Expl(π)\text{Expl}(\pi) requires a full policy optimization when calculating maxπVμ0π(Γ(π))\max_{\pi^{\prime}}V_{\mu_{0}}^{\pi^{\prime}}(\Gamma(\pi)) and a full policy evaluation when calculating Vμ0π(Γ(π))V_{\mu_{0}}^{\pi}(\Gamma(\pi)). In this section, we will show that in order to find an ϵ\epsilon-Nash equilibrium solution, it is sufficient to solve (MF-OMO) to O(ϵ2)O(\epsilon^{2}) sub-optimality.

We will need additional notation here before presenting the precise statement. Denote Pt(|,,Lt)S×S×AP_{t}(\cdot|\cdot,\cdot,L_{t})\in\mathbb{R}^{S\times S\times A} as a tensor of dimensions S×S×AS\times S\times A. For a sequence of T+1T+1 tensors p={pt(|,)}t𝒯S×S×Ap=\{p_{t}(\cdot|\cdot,\cdot)\}_{t\in\mathcal{T}}\in\mathbb{R}^{S\times S\times A} with dimensions S×S×AS\times S\times A (e.g., pt(s|s,a)p_{t}(s^{\prime}|s,a) can be Pt(s|s,a,Lt)P_{t}(s^{\prime}|s,a,L_{t})) and a sequence of T+1T+1 vectors x={xt(,)}t𝒯SAx=\{x_{t}(\cdot,\cdot)\}_{t\in\mathcal{T}}\in\mathbb{R}^{SA} with dimensions S×AS\times A (e.g., xt(s,a)x_{t}(s,a) can be rt(s,a,Lt)r_{t}(s,a,L_{t}), dt(s,a)d_{t}(s,a) or Lt(s,a)L_{t}(s,a), etc.), define

p,1=maxs𝒮,a𝒜,t=0,,T1s𝒮|pt(s|s,a)|,x1,=t𝒯maxs𝒮,a𝒜|xt(s,a)|.\begin{split}&\|p\|_{\infty,1}=\max_{s\in\mathcal{S},a\in\mathcal{A},t=0,\dots,T-1}\sum_{s^{\prime}\in\mathcal{S}}|p_{t}(s^{\prime}|s,a)|,\quad\|x\|_{1,\infty}=\sum_{t\in\mathcal{T}}\max_{s\in\mathcal{S},a\in\mathcal{A}}|x_{t}(s,a)|.\end{split}

Similarly, for any tensor pt(|,)S×S×Ap_{t}(\cdot|\cdot,\cdot)\in\mathbb{R}^{S\times S\times A} with dimensions S×S×AS\times S\times A, define

pt,1=maxs𝒮,a𝒜s𝒮|pt(s|s,a)|.\begin{split}&\|p_{t}\|_{\infty,1}=\max_{s\in\mathcal{S},a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}|p_{t}(s^{\prime}|s,a)|.\end{split}

Denote also 1\|\cdot\|_{1}, 2\|\cdot\|_{2} and \|\cdot\|_{\infty} respectively for the standard 1\ell_{1}, 2\ell_{2} and \ell_{\infty} vector and matrix norms. Then we will see that a near-optimal solution to (MF-OMO) will be close to a Nash equilibrium solution, assuming the dynamics and rewards are Lipschitz continuous in LtL_{t}.

Theorem 8.

Suppose that for any t𝒯t\in\mathcal{T}, Pt(|,,Lt)P_{t}(\cdot|\cdot,\cdot,L_{t}) and rt(,,Lt)r_{t}(\cdot,\cdot,L_{t}) are CP(>0)C_{P}(>0) and Cr(>0)C_{r}(>0) Lipschitz continuous in LtL_{t}, respectively, i.e.,

PtL1PtL2,1CPLt1Lt21,rtL1rtL2CrLt1Lt21,\|P_{t}^{L^{1}}-P_{t}^{L^{2}}\|_{\infty,1}\leq C_{P}\|L_{t}^{1}-L_{t}^{2}\|_{1},\quad\|r_{t}^{L^{1}}-r_{t}^{L^{2}}\|_{\infty}\leq C_{r}\|L_{t}^{1}-L_{t}^{2}\|_{1},

where PtL:=Pt(|,,Lt)P_{t}^{L}:=P_{t}(\cdot|\cdot,\cdot,L_{t}) and rtL:=rt(,,Lt)r_{t}^{L}:=r_{t}(\cdot,\cdot,L_{t}), and Li:={Lti}t𝒯L^{i}:=\{L_{t}^{i}\}_{t\in\mathcal{T}} (i=1,2i=1,2) are two arbitrary mean-field flows. Let y,z,Ly,z,L be a feasible solution to (MF-OMO), with the value of its objective function being fMF-OMO(y,z,L)ϵ2f^{\text{\rm MF-OMO}}(y,z,L)\leq\epsilon^{2} for some ϵ0\epsilon\geq 0. Then for any πΠ(L)\pi\in\Pi(L), we have

Expl(π)f(S,A,T,CP,Cr,rmax)ϵ+ϵ2,\text{\rm Expl}(\pi)\leq f(S,A,T,C_{P},C_{r},r_{\max})\epsilon+\epsilon^{2},

where

f(S,A,T,CP,Cr,rmax):=T(T+1)rmax((CP+1)T+11)S+2Cr(CP+1)T+2(T+2)CP1CP2S+S32A(T+2)3rmax+SA(T+1)+T.\begin{split}f(S,A,T,C_{P},C_{r},r_{\max}):=&\,\,T(T+1)r_{\max}((C_{P}+1)^{T+1}-1)\sqrt{S}\\ &\,+2C_{r}\dfrac{(C_{P}+1)^{T+2}-(T+2)C_{P}-1}{C_{P}^{2}}\sqrt{S}\\ &\,+S^{\frac{3}{2}}A(T+2)^{3}r_{\max}+\sqrt{SA(T+1)}+\sqrt{T}.\end{split}
Remark 3.

In the special case when the transition dynamics Pt(s|s,a,Lt)P_{t}(s^{\prime}|s,a,L_{t}) are independent of LtL_{t}, as generally assumed in the literature of discrete-time mean-field games [31, 59, 58], we have CP=0C_{P}=0 and a much tighter constant

f(S,A,T,CP,Cr,rmax)=Cr(T+1)(T+2)S+S32A(T+2)3rmax+SA(T+1)+T.f(S,A,T,C_{P},C_{r},r_{\max})=C_{r}(T+1)(T+2)\sqrt{S}+S^{\frac{3}{2}}A(T+2)^{3}r_{\max}+\sqrt{SA(T+1)}+\sqrt{T}.

Note that, this tighter constant is polynomial in all the problem parameters.

Conversely, one can characterize an ϵ\epsilon-Nash equilibrium solution in terms of the sub-optimality of (MF-OMO), without the additional assumption of the Lipschitz continuity in Theorem 8.

Theorem 9.

Let π={πt}t𝒯\pi=\{\pi_{t}\}_{t\in\mathcal{T}} be an ϵ\epsilon-Nash equilibrium solution, i.e., Expl(π)ϵ\text{\rm Expl}(\pi)\\ \leq\epsilon. Define L=Γ(π)L=\Gamma(\pi),

y=[V1(L),V2(L),,VT(L),V0(L)]S(T+1),y=[V_{1}^{\star}(L),V_{2}^{\star}(L),\dots,V_{T}^{\star}(L),-V_{0}^{\star}(L)]\in\mathbb{R}^{S(T+1)},

and z=[z0,z1,,zT]SA(T+1)z=[z_{0},z_{1},\dots,z_{T}]\in\mathbb{R}^{SA(T+1)} with zt=[zt1,,ztA]z_{t}=[z_{t}^{1},\dots,z_{t}^{A}] (t=0,,Tt=0,\dots,T), where

zta=Vt(L)rt(,a,Lt)Pta(Lt)Vt+1(L)S,t=0,,T1,a𝒜z_{t}^{a}=V_{t}^{\star}(L)-r_{t}(\cdot,a,L_{t})-P_{t}^{a}(L_{t})V_{t+1}^{\star}(L)\in\mathbb{R}^{S},\quad t=0,\dots,T-1,\,a\in\mathcal{A}

and zTa=VT(L)rT(,a,LT)Sz_{T}^{a}=V_{T}^{\star}(L)-r_{T}(\cdot,a,L_{T})\in\mathbb{R}^{S} (a𝒜a\in\mathcal{A}). Then y,z,Ly,z,L is a feasible solution to (MF-OMO) and fMF-OMO(y,z,L)ϵf^{\text{\rm MF-OMO}}(y,z,L)\leq\epsilon.

5 Finding Nash equilibrium solutions via solving MF-OMO

The optimization formulation (MF-OMO) enables applications of families of optimization algorithms to find multiple Nash equilibrium solutions. In this section, we first present the projected gradient descent algorithm and establish its convergence guarantees to the Nash equilibrium solutions of general mean-field games. We then present a finite-time convergent algorithm for solving a special class of mean-field games with linear rewards and mean-field independent dynamics. More discussion on stochastic algorithms, convergence to stationary points as well as practical tricks such as reparametrization, acceleration and variance reduction can be found in Appendix A.

To start, we will need the following assumption.

Assumption 1.

Pt(s,a,Lt)P_{t}(s,a,L_{t}) and rt(s,a,Lt)r_{t}(s,a,L_{t}) are both second-order continuously differentiable in LtL_{t} within some open set containing Δ(𝒮×𝒜)\Delta(\mathcal{S}\times\mathcal{A}) for any s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}.

This assumption is easily verifiable and holds for numerical examples in many existing works including [35, 20] and for theoretical studies of [59, 58].

For notation simplicity, denote the concatenation of y,z,Ly,z,L as θ\theta, and recall the earlier notation for feasible set of (MF-OMO) as Θ\Theta (cf. (13)). Then we see that Assumption 1, together with the compactness of the probability simplex of LtL_{t} and the norm bounds on yy and zz in (MF-OMO), immediately leads to the following proposition.

Proposition 10.

Under Assumption 1, fMF-OMO(θ)f^{\text{\rm MF-OMO}}(\theta) is MM-strongly smooth for some positive constant M>0M>0, for any θΘ\theta\in\Theta. That is, for any θ,θΘ\theta,\theta^{\prime}\in\Theta, we have θfMF-OMO(θ)θfMF-OMO(θ)2Mθθ2\|\nabla_{\theta}f^{\text{\rm MF-OMO}}(\theta)-\nabla_{\theta}f^{\text{\rm MF-OMO}}(\theta^{\prime})\|_{2}\leq M\|\theta-\theta^{\prime}\|_{2}.

In addition, Assumption 1, together with the compactness of Θ\Theta, implies the continuity condition in Proposition 1 and the Lipschitz continuity condition in Theorem 8. Moreover, optimal solution(s) exist for (MF-OMO) under Assumption 1. We will denote the optimal solution set of (MF-OMO) as Θ\Theta^{\star} hereafter.

5.1 PGD and convergence to Nash equilibrium solutions

Projected gradient descent (PGD).

The algorithm of projected gradient descent updates θ\theta with the following iterative process:

θk+1=θkηkGηk(θk)=ProjΘ(θkηkθfMF-OMO(θk)).\theta_{k+1}=\theta_{k}-\eta_{k}G_{\eta_{k}}(\theta_{k})=\textbf{Proj}_{\Theta}(\theta_{k}-\eta_{k}\nabla_{\theta}f^{\text{MF-OMO}}(\theta_{k})). (18)

Here ηk>0\eta_{k}>0 is the step-size of the kk-th iteration for which appropriate ranges are specified below, and the projection operator ProjΘ\textbf{Proj}_{\Theta} projects LL to the probability simplex, zz to a specialized polyhedra {z|z0,z1SA(T2+T+2)rmax}\{z\,|\,z\geq 0,\|z\|_{1}\leq SA(T^{2}+T+2)r_{\max}\}, and yy to the 2\ell_{2}-normed ball with norm S(T+1)(T+2)rmax/2S(T+1)(T+2)r_{\max}/2. All these projections can be efficiently evaluated in linear time and (almost) closed-form (cf. [28, 49, 64, 17] for projection onto the probability simplex and the specialized polyhedra, and [57] for the projection onto the 2\ell_{2}-normed balls). For simplicity, we always assume that the initialization θ0\theta_{0} is feasible, i.e., θ0Θ\theta_{0}\in\Theta.

Convergence to Nash equilibrium solutions.

To find a Nash equilibrium solution of the mean-field game ((1) and (2)) by the PGD algorithm, we will need an additional definability assumption stated below, which is one of the most commonly adopted assumptions in the existing convergence theory of nonconvex optimization.

Assumption 2.

For any s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}, Pt(s,a,Lt)P_{t}(s,a,L_{t}) and rt(s,a,Lt)r_{t}(s,a,L_{t}) (as functions of LtL_{t}) are both restrictions of definable functions on the log-exp structure444For simplicity, below we say that a function is definable if it’s definable on the log-exp structure. to Δ(𝒮×𝒜)\Delta(\mathcal{S}\times\mathcal{A}).

Definable functions cover a broad class of (both smooth and nonsmooth) functions, including all semialgebraic functions, all analytic functions on definable compact sets555A set is definable if it can be defined by the image/range of a definable function., and the exponential and logarithm functions. Moreover, any finite combination of definable functions via summation, subtraction, product, affine mapping, composition, (generalized) inversion, max and min, partial supremum and partial infimum, as well as reciprocal (restricted to a compact domain) inside their domain of definitions are definable. Generally speaking, definable functions include all functions that are “programmable”, such as those that can be defined in NumPy, SciPy and PyTorch. The precise definition of definability for both sets and functions and the log-exp structure can be found in Appendix B.1 (see also [7, Section 4.3] and [19]).

Under Assumptions 1 and 2, we will show that the PGD algorithm converges to Nash equilibrium solution(s) when the initialization is close. Moreover, if the initialization is close to some isolated Nash equilibrium solution (e.g., when there is a finite number of Nash equilibrium solutions), the iterates converge to that specific Nash equilibrium solution. As a result, different initializations will enable us to retrieve all possible Nash equilibrium solutions when there are finitely many of them. The proof is left to Appendix B.2.

Theorem 11 (Convergence to Nash equilibrium solutions).

Under Assumptions 1 and 2, let θk=(yk,zk,Lk)\theta_{k}=(y^{k},z^{k},L^{k}) (k0k\geq 0) be the sequence generated by PGD (18) with ηk=η(0,2/M)\eta_{k}=\eta\in(0,2/M). Then for any Nash equilibrium solution (π,L)(\pi^{\star},L^{\star}), there exists ϵ0>0\epsilon_{0}>0, such that for any L0Δ(𝒮×𝒜)L^{0}\in\Delta(\mathcal{S}\times\mathcal{A}) with L0L1ϵ0\|L^{0}-L^{\star}\|_{1}\leq\epsilon_{0}, if we initialize PGD with θ0=(y0,z0,L0)\theta_{0}=(y^{0},z^{0},L^{0}), where y0=[V1(L0),V2(L0),,VT(L0),V0(L0)]S(T+1)y^{0}=[V_{1}^{\star}(L^{0}),V_{2}^{\star}(L^{0}),\dots,V_{T}^{\star}(L^{0}),-V_{0}^{\star}(L^{0})]\in\mathbb{R}^{S(T+1)}, z0=[z0,0,,z0,T]SA(T+1)z^{0}=[z_{0,0},\dots,z_{0,T}]\in\mathbb{R}^{SA(T+1)}, z0,t=[z0,t1,,z0,tA]z_{0,t}=[z_{0,t}^{1},\dots,z_{0,t}^{A}] (t𝒯t\in\mathcal{T}), with

z0,ta=Vt(L0)rt(,a,Lt0)Pta(Lt0)Vt+1(L0)S,t=0,,T1,a𝒜z_{0,t}^{a}=V_{t}^{\star}(L^{0})-r_{t}(\cdot,a,L_{t}^{0})-P_{t}^{a}(L_{t}^{0})V_{t+1}^{\star}(L^{0})\in\mathbb{R}^{S},\quad t=0,\dots,T-1,\,a\in\mathcal{A}

and z0,Ta=VT(L0)rT(,a,LT0)Sz_{0,T}^{a}=V_{T}^{\star}(L^{0})-r_{T}(\cdot,a,L_{T}^{0})\in\mathbb{R}^{S} (a𝒜a\in\mathcal{A}), then we have the following:

  • limk𝐝𝐢𝐬𝐭(θk,Θ)=0\lim_{k\rightarrow\infty}{\bf dist}(\theta_{k},\Theta^{\star})=0, and LkL2g(ϵ0)\|L^{k}-L^{\star}\|_{2}\leq g(\epsilon_{0}) for any k0k\geq 0. Here g:++g:\mathbb{R}_{+}\rightarrow\mathbb{R}_{+} is a function such that limϵ0g(ϵ)=0\lim_{\epsilon\rightarrow 0}g(\epsilon)=0.666The closed-form definition of gg is left to Appendix B.2 and can be found in (46) there. Here 𝐝𝐢𝐬𝐭{\bf dist} is the point-set distance defined as 𝐝𝐢𝐬𝐭(θk,Θ)=infθΘθkθ2{\bf dist}(\theta_{k},\Theta^{\star})=\inf_{\theta\in\Theta^{\star}}\|\theta_{k}-\theta\|_{2}.

  • The limit point set of {θk}k0\{\theta_{k}\}_{k\geq 0} is nonempty, and for any limit point θ¯=(y¯,z¯,L¯)\bar{\theta}=(\bar{y},\bar{z},\bar{L}) of the iteration sequence {θk}k0\{\theta_{k}\}_{k\geq 0}, any π¯Π(L¯)\bar{\pi}\in\Pi(\bar{L}) is a Nash equilibrium solution of the mean-field game ((1) and (2)).

  • For any sequence of πkΠ(Lk)\pi^{k}\in\Pi(L^{k}), limkExpl(πk)=0\lim_{k\rightarrow\infty}\text{\rm Expl}(\pi^{k})=0.

In addition, for any isolated Nash equilibrium solution (π,L)(\pi^{\star},L^{\star}) (i.e., there exists ϵ\epsilon such that any (π,L)(\pi,L) with LL2ϵ\|L-L^{\star}\|_{2}\leq\epsilon is not a Nash equilibrium solution), if we initialize PGD with θ0=(y0,z0,L0)\theta_{0}=(y^{0},z^{0},L^{0}) in the same way as above, then limkLk=L\lim_{k\rightarrow\infty}L^{k}=L^{\star} and limkExpl(πk)=0\lim_{k\rightarrow\infty}\text{\rm Expl}(\pi^{k})=0 for any sequence πkΠ(Lk)\pi^{k}\in\Pi(L^{k}).

5.2 Finite-time solvability for a class of mean-field games

For mean-field games with linear rewards and mean-field independent dynamics, we will show that a Nash equilibrium solution can be found in finite time under the MF-OMO framework. Such a class of mean-field games include the LR (left-right) and RPS (rock-paper-scissors) problems in [20].

More precisely, consider mean-field games with rt(s,a,Lt)=r¯s,a,t+LtR¯s,a,tr_{t}(s,a,L_{t})=\bar{r}_{s,a,t}+L_{t}^{\top}\bar{R}_{s,a,t}, Pt(s|s,a,Lt)=P¯s,s,a,tP_{t}(s^{\prime}|s,a,L_{t})=\bar{P}_{s^{\prime},s,a,t} for some constants r¯s,a,t,P¯s,s,a,t\bar{r}_{s,a,t},\,\bar{P}_{s^{\prime},s,a,t}\in\mathbb{R} and constant vectors R¯s,a,tSA\bar{R}_{s,a,t}\in\mathbb{R}^{SA}. Here, ALA_{L} is independent of LL, hence denoted as A¯\bar{A} (to distinguish from the number of actions AA), and cLc_{L} is of the form cL=c¯+C¯Lc_{L}=\bar{c}+\bar{C}L, where c¯SA(T+1)\bar{c}\in\mathbb{R}^{SA(T+1)} is the vectorization of [r¯s,a,t]s𝒮,a𝒜,t𝒯S×A×(T+1)[-\bar{r}_{s,a,t}]_{s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}}\in\mathbb{R}^{S\times A\times(T+1)} (with column-major order), and C¯SA(T+1)×SA(T+1)\bar{C}\in\mathbb{R}^{SA(T+1)\times SA(T+1)} is defined as

C¯=diag([R1,1,0RS,A,0],[R1,1,1RS,A,1],,[R1,1,TRS,A,T]).\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\bar{C}=\textbf{diag}\left(\left[\begin{matrix}-R_{1,1,0}^{\top}\\ \vdots\\ -R_{S,A,0}^{\top}\end{matrix}\right],\left[\begin{matrix}-R_{1,1,1}^{\top}\\ \vdots\\ -R_{S,A,1}^{\top}\end{matrix}\right],\dots,\left[\begin{matrix}-R_{1,1,T}^{\top}\\ \vdots\\ -R_{S,A,T}^{\top}\end{matrix}\right]\right).

By (12), one can see that finding Nash equilibrium solution(s) of this class of mean-field games is equivalent to solving the following linear complementarity problem (LCP):

minimizey,z,L0subject to[0A¯A¯C¯][yL]+[0z]=[bc¯],zL=0,L0,z0.\begin{array}[]{ll}\text{\rm minimize}_{y,z,L}&0\\ \text{\rm subject to}&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\left[\begin{array}[]{ll}0&\bar{A}\\ \bar{A}^{\top}&-\bar{C}\end{array}\right]}\left[\begin{array}[]{l}y\\ L\end{array}\right]+\left[\begin{array}[]{l}0\\ z\end{array}\right]=\left[\begin{array}[]{l}b\\ \bar{c}\end{array}\right],\\ &z^{\top}L=0,\quad L\geq 0,\quad z\geq 0.\end{array}

Notice that for any solution (y,z,L)(y,z,L), we have either zi=0z_{i}=0 or Li=0L_{i}=0 for any i=1,,SA(T+1)i=1,\dots,SA(T+1). Hence it suffices to consider linear programs induced by fixing z𝒟=0z_{\mathcal{D}}=0 and L𝒟c=0L_{\mathcal{D}^{c}}=0 for some subset 𝒟\mathcal{D} of {1,,SA(T+1)}\{1,\dots,SA(T+1)\}, which leads to a total of 2SA(T+1)2^{SA(T+1)} linear programs. Since a linear program can be solved to its global optimality in finite time (e.g., by simplex algorithms [53]), we have the following proposition.

Proposition 12.

Suppose that the mean-field game ((1) and (2)) has linear rewards and mean-field independent dynamics. Then its Nash equilibrium solution can be found in finite time.

In practice, to solve the LCP efficiently by exploiting the linear complementarity structure, one may resort to pivoting procedures [54] or Lemke’s algorithm [2].

6 Proofs of main results

6.1 Proof of Lemma 3

For any Nash equilibrium solution (π,L)(\pi,L), denote dt(s,a)=π,L(st=s,at=a)d_{t}^{\star}(s,a)=\mathbb{P}^{\pi,L}(s_{t}=s,a_{t}=a) for the probability distribution of any representative agent taking policy π\pi under LL. We will first show that πΠ(d)\pi\in\Pi(d^{\star}) and (d,L)(d^{\star},L) satisfies conditions (A) and (B). Since (π,L)(\pi,L) is a Nash equilibrium solution, it satisfies (1). By Lemma 2 and by considering the MDP for the representative agent, πΠ(d)\pi\in\Pi(d^{\star}) and condition (A) is satisfied for dd^{\star} and LL. Condition (B) can be proved by induction. When t=0t=0, d0(s,a)=π,L(s0=s,a0=a)=μ0π0(a|s)=L0(s,a)d_{0}^{\star}(s,a)=\mathbb{P}^{\pi,L}(s_{0}=s,a_{0}=a)=\mu_{0}\pi_{0}(a|s)=L_{0}(s,a) for any s𝒮s\in\mathcal{S} and any a𝒜a\in\mathcal{A}, since (15) holds for L0L_{0}. Suppose dt(s,a)=Lt(s,a)d_{t}^{\star}(s,a)=L_{t}(s,a) holds for all ttt\leq t^{\prime} and all s𝒮s\in\mathcal{S} and a𝒜a\in\mathcal{A}. Then for any s𝒮s\in\mathcal{S} and any a𝒜a\in\mathcal{A},

dt+1(s,a)\displaystyle d_{t^{\prime}+1}^{\star}(s,a) =π,L(st+1=s,at+1=a)=π,L(st+1=s)πt+1(a|s)\displaystyle=\mathbb{P}^{\pi,L}(s_{t^{\prime}+1}=s,a_{t^{\prime}+1}=a)=\mathbb{P}^{\pi,L}(s_{t^{\prime}+1}=s)\pi_{t^{\prime}+1}(a|s)
=πt+1(a|s)s𝒮a𝒜π,L(st=s,at=a,st+1=s)\displaystyle=\pi_{t^{\prime}+1}(a|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\mathbb{P}^{\pi,L}(s_{t^{\prime}}=s^{\prime},a_{t^{\prime}}=a^{\prime},s_{t^{\prime}+1}=s)
=πt+1(a|s)s𝒮a𝒜π,L(st=s,at=a)Pt(s|s,a,Lt)\displaystyle=\pi_{t^{\prime}+1}(a|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\mathbb{P}^{\pi,L}(s_{t^{\prime}}=s^{\prime},a_{t^{\prime}}=a^{\prime})P_{t^{\prime}}(s|s^{\prime},a^{\prime},L_{t^{\prime}})
=πt+1(a|s)s𝒮a𝒜dt(s,a)Pt(s|s,a,Lt)\displaystyle=\pi_{t^{\prime}+1}(a|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}d_{t^{\prime}}(s^{\prime},a^{\prime})P_{t^{\prime}}(s|s^{\prime},a^{\prime},L_{t^{\prime}})
=πt+1(a|s)s𝒮a𝒜Lt(s,a)Pt(s|s,a,Lt)=Lt+1(s,a),\displaystyle=\pi_{t^{\prime}+1}(a|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}L_{t^{\prime}}(s^{\prime},a^{\prime})P_{t^{\prime}}(s|s^{\prime},a^{\prime},L_{t^{\prime}})=L_{t^{\prime}+1}(s,a),

where the last equation holds by (16).

On the other hand, for any (d,L)(d^{\star},L) satisfying conditions (A) and (B), let πt(a|s)=Lt(s,a)a𝒜Lt(s,a)\pi_{t}(a|s)=\frac{L_{t}(s,a)}{\sum_{a^{\prime}\in\mathcal{A}}L_{t}(s,a^{\prime})} when a𝒜Lt(s,a)>0\sum_{a^{\prime}\in\mathcal{A}}L_{t}(s,a^{\prime})>0 and let πt(|s)\pi_{t}(\cdot|s) be any probability vector if a𝒜Lt(s,a)=0\sum_{a^{\prime}\in\mathcal{A}}L_{t}(s,a^{\prime})=0, we will show that (π,L)(\pi,L) is the Nash equilibrium solution of the mean-field game ((1) and (2)).

By Lemma 2, condition (1) holds. For condition (2), since d=Ld^{\star}=L, LL satisfies (6) by replacing dd with LL. Therefore, for any s𝒮,t=0,1,,T1s^{\prime}\in\mathcal{S},t=0,1,\dots,T-1,

s𝒮a𝒜Lt(s,a)Pt(s|s,a,Lt)=a𝒜Lt+1(s,a).\displaystyle\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}L_{t}(s,a)P_{t}(s^{\prime}|s,a,L_{t})=\sum_{a\in\mathcal{A}}L_{t+1}(s^{\prime},a).

Multiplying both sides with πt+1(a|s)\pi_{t+1}(a^{\prime}|s^{\prime}), and by the definition of π\pi,

πt+1(a|s)s𝒮a𝒜Lt(a|s)Pt(s|s,a,Lt)=Lt+1(s,a).\displaystyle\pi_{t+1}(a^{\prime}|s^{\prime})\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}L_{t}(a|s)P_{t}(s^{\prime}|s,a,L_{t})=L_{t+1}(s^{\prime},a^{\prime}).

Hence condition (2).∎

6.2 Proof of Proposition 6

Since y,z,Ly,z,L is a solution to (12), L0L\geq 0, 𝟏Lt=1\mathbf{1}^{\top}L_{t}=1, t𝒯t\in\mathcal{T} (by ALL=bA_{L}L=b), i.e., LΔ(𝒮×𝒜)L\in\Delta(\mathcal{S}\times\mathcal{A}). Note that we use 𝟏\mathbf{1} to denote the all-one vector (with appropriate dimensions). To show that y^,z^,L\hat{y},\hat{z},L solves (12), it reduces to show that ALy^+z^=cLA_{L}^{\top}\hat{y}+\hat{z}=c_{L}, z^0\hat{z}\geq 0 and z^L=0\hat{z}^{\top}L=0.

By writing y^\hat{y} as y^=[y^0,,y^T]\hat{y}=[\hat{y}_{0},\dots,\hat{y}_{T}] with y^tS\hat{y}_{t}\in\mathbb{R}^{S}, z^\hat{z} as z^=[z^0,,z^T]\hat{z}=[\hat{z}_{0},\dots,\hat{z}_{T}] with z^tSA\hat{z}_{t}\in\mathbb{R}^{SA}, we have

ALy^+z^=[W0(L0)y^0+Zy^T+z^0,W1(L1)y^1Zy^0+z^1,WT1(LT1)y^T1Zy^T2+z^T1,Zy^T1+z^T.],A_{L}^{\top}\hat{y}+\hat{z}=\left[\begin{array}[]{c}W_{0}(L_{0})^{\top}\hat{y}_{0}+Z^{\top}\hat{y}_{T}+\hat{z}_{0},\\ W_{1}(L_{1})^{\top}\hat{y}_{1}-Z^{\top}\hat{y}_{0}+\hat{z}_{1},\\ \vdots\\ W_{T-1}(L_{T-1})^{\top}\hat{y}_{T-1}-Z^{\top}\hat{y}_{T-2}+\hat{z}_{T-1},\\ -Z^{\top}\hat{y}_{T-1}+\hat{z}_{T}.\end{array}\right],

which, by the definition of Wt(Lt)W_{t}(L_{t}) and ZZ, becomes

ALy^+z^=[P01(L0)y^0+y^T+z^01P0A(L0)y^0+y^T+z^0AP11(L1)y^1y^0+z^11P1A(L1)y^1y^0+z^1APT11(LT1)y^T1y^T2+z^T11PT1A(LT1)y^T1y^T2+z^T1Ay^T1+z^T1y^T1+z^TA].A_{L}^{\top}\hat{y}+\hat{z}=\left[\begin{array}[]{c}P_{0}^{1}(L_{0})\hat{y}_{0}+\hat{y}_{T}+\hat{z}_{0}^{1}\\ \vdots\\ P_{0}^{A}(L_{0})\hat{y}_{0}+\hat{y}_{T}+\hat{z}_{0}^{A}\\ P_{1}^{1}(L_{1})\hat{y}_{1}-\hat{y}_{0}+\hat{z}_{1}^{1}\\ \vdots\\ P_{1}^{A}(L_{1})\hat{y}_{1}-\hat{y}_{0}+\hat{z}_{1}^{A}\\ \vdots\\ P_{T-1}^{1}(L_{T-1})\hat{y}_{T-1}-\hat{y}_{T-2}+\hat{z}_{T-1}^{1}\\ \vdots\\ P_{T-1}^{A}(L_{T-1})\hat{y}_{T-1}-\hat{y}_{T-2}+\hat{z}_{T-1}^{A}\\ -\hat{y}_{T-1}+\hat{z}_{T}^{1}\\ \vdots\\ -\hat{y}_{T-1}+\hat{z}_{T}^{A}\end{array}\right].

Now noticing that y^T=V0(L)\hat{y}_{T}=-V_{0}^{\star}(L), y^t=Vt+1(L)\hat{y}_{t}=V_{t+1}^{\star}(L), z^Ta=VT(L)rT(,a,LT)\hat{z}_{T}^{a}=V_{T}^{\star}(L)-r_{T}(\cdot,a,L_{T}), and z^ta=Vt(L)rt(,a,Lt)PtaVt+1(L)\hat{z}_{t}^{a}=V_{t}^{\star}(L)-r_{t}(\cdot,a,L_{t})-P_{t}^{a}V_{t+1}^{\star}(L) for a𝒜a\in\mathcal{A}, t=0,,T1t=0,\dots,T-1, we see

ALy^+z^=[r0(,,L0),,rT(,,LT)]=cL.A_{L}^{\top}\hat{y}+\hat{z}=[-r_{0}(\cdot,\cdot,L_{0}),\dots,-r_{T}(\cdot,\cdot,L_{T})]=c_{L}.

Moreover, we have VT(L)=maxa𝒜rT(,a,LT)V_{T}^{\star}(L)=\max_{a\in\mathcal{A}}r_{T}(\cdot,a,L_{T}), and that

Vt(L)=maxa𝒜{rt(,a,Lt)+Pta(Lt)Vt+1(L)},V_{t}^{\star}(L)=\max_{a\in\mathcal{A}}\left\{r_{t}(\cdot,a,L_{t})+P_{t}^{a}(L_{t})V_{t+1}^{\star}(L)\right\},

implying that z^0\hat{z}\geq 0 by its definition. Note that these are the Bellman equations for the mean-field induced MDP (L)\mathcal{M}(L).

It remains to prove that z^L=0\hat{z}^{\top}L=0. To see this, we first show that y^t1yt1\hat{y}_{t-1}\leq y_{t-1} for t=1,,Tt=1,\dots,T and y^1y1\hat{y}_{-1}\geq y_{-1} by backward induction on tt from TT to 0. Here y^1\hat{y}_{-1} and y1y_{-1} are defined as y^T\hat{y}_{T} and yTy_{T}, respectively.

Firstly, note that ALy+z=cLA_{L}^{\top}y+z=c_{L} and z0z\geq 0. Hence similar to the expansions above, we have

zTa=yT1r(,a,LT),z0a=yTr0(,a,L0)P0a(L0)y0,a𝒜,z_{T}^{a}=y_{T-1}-r(\cdot,a,L_{T}),\quad z_{0}^{a}=-y_{T}-r_{0}(\cdot,a,L_{0})-P_{0}^{a}(L_{0})y_{0},\quad\forall a\in\mathcal{A},

and

zta=yt1rt(,a,Lt)Pta(Lt)yt,a𝒜,t=0,,T1.z_{t}^{a}=y_{t-1}-r_{t}(\cdot,a,L_{t})-P_{t}^{a}(L_{t})y_{t},\quad\forall a\in\mathcal{A},\,t=0,\dots,T-1.

Now for the base step, since zTa0z_{T}^{a}\geq 0 for all a𝒜a\in\mathcal{A}, we have

yT1maxa𝒜r(,a,LT)=VT(L)=y^T1.y_{T-1}\geq\max_{a\in\mathcal{A}}r(\cdot,a,L_{T})=V_{T}^{\star}(L)=\hat{y}_{T-1}.

For the induction step, suppose that y^tyt\hat{y}_{t}\leq y_{t} for some 0<tT10<t\leq T-1. Then since zta0z_{t}^{a}\geq 0 for all a𝒜a\in\mathcal{A}, we have

yt1maxa𝒜{rt(,a,Lt)+Pta(Lt)yt}maxa𝒜{rt(,a,Lt)+Pta(Lt)y^t}=maxa𝒜{rt(,a,Lt)+Pta(Lt)Vt+1(L)}=Vt(L)=y^t1.\begin{split}y_{t-1}&\geq\max_{a\in\mathcal{A}}\left\{r_{t}(\cdot,a,L_{t})+P_{t}^{a}(L_{t})y_{t}\right\}\geq\max_{a\in\mathcal{A}}\left\{r_{t}(\cdot,a,L_{t})+P_{t}^{a}(L_{t})\hat{y}_{t}\right\}\\ &=\max_{a\in\mathcal{A}}\left\{r_{t}(\cdot,a,L_{t})+P_{t}^{a}(L_{t})V_{t+1}^{\star}(L)\right\}=V_{t}^{\star}(L)=\hat{y}_{t-1}.\end{split}

Finally, suppose that y^0y0\hat{y}_{0}\leq y_{0}. Then since z0a0z_{0}^{a}\geq 0 for all a𝒜a\in\mathcal{A},

y1=yTmina𝒜{r0(,a,L0)P0a(L0)y0}mina𝒜{r0(,a,L0)P0a(L0)y^0}=maxa𝒜{r0(,a,L0)+P0a(L0)V1(L)}=V0(L)=y^T=y^1.\begin{split}y_{-1}=y_{T}&\leq\min_{a\in\mathcal{A}}\left\{-r_{0}(\cdot,a,L_{0})-P_{0}^{a}(L_{0})y_{0}\right\}\leq\min_{a\in\mathcal{A}}\left\{-r_{0}(\cdot,a,L_{0})-P_{0}^{a}(L_{0})\hat{y}_{0}\right\}\\ &=-\max_{a\in\mathcal{A}}\left\{r_{0}(\cdot,a,L_{0})+P_{0}^{a}(L_{0})V_{1}^{\star}(L)\right\}=-V_{0}^{\star}(L)=\hat{y}_{T}=\hat{y}_{-1}.\end{split}

Therefore y^TyT\hat{y}_{T}\geq y_{T}.

Now, since y,z,Ly,z,L is a solution to problem (12), we have

by=(ALL)y+Lz=L(ALy+z)=LcL=cLL.b^{\top}y=(A_{L}L)^{\top}y+L^{\top}z=L^{\top}(A_{L}^{\top}y+z)=L^{\top}c_{L}=c_{L}^{\top}L.

And for any y,zy^{\prime},z^{\prime} satisfying ALy+z=cLA_{L}^{\top}y^{\prime}+z^{\prime}=c_{L} and z0z^{\prime}\geq 0, since ALL=bA_{L}L=b and L0L\geq 0, we have

cLLby=(ALy+z)LLALy=Lz0.c_{L}^{\top}L-b^{\top}y^{\prime}=(A_{L}^{\top}y^{\prime}+z^{\prime})^{\top}L-L^{\top}A_{L}^{\top}y^{\prime}=L^{\top}z^{\prime}\geq 0. (19)

In particular, by=cLLbyb^{\top}y=c_{L}^{\top}L\geq b^{\top}y^{\prime} for any yy^{\prime} such that ALy+z=cLA_{L}^{\top}y^{\prime}+z^{\prime}=c_{L} for some z0z^{\prime}\geq 0. That is, yy (together with zz) is the solution to the following linear optimization problem for fixed LL:

maximizey,zbysubject toALy+z=cL,z0.\begin{array}[]{ll}\text{maximize}_{y^{\prime},z^{\prime}}&b^{\top}y^{\prime}\\ \text{subject to}&A_{L}^{\top}y^{\prime}+z^{\prime}=c_{L},\quad z^{\prime}\geq 0.\end{array} (20)

Now taking y=y^y^{\prime}=\hat{y} and z=z^z^{\prime}=\hat{z}, we see that ALy^+z^=cLA_{L}^{\top}\hat{y}+\hat{z}=c_{L} and z^0\hat{z}\geq 0 as proved above. Hence y^,z^\hat{y},\hat{z} is a feasible solution to problem (20), and byby^b^{\top}y\geq b^{\top}\hat{y}. Since y^TyT\hat{y}_{T}\geq y_{T},

by^=μ0y^Tμ0yT=by.b^{\top}\hat{y}=\mu_{0}^{\top}\hat{y}_{T}\geq\mu_{0}^{\top}y_{T}=b^{\top}y.

Combined, by=by^b^{\top}y=b^{\top}\hat{y}, hence by (19),

Lz^=cLLby^=cLLby=0.L^{\top}\hat{z}=c_{L}^{\top}L-b^{\top}\hat{y}=c_{L}^{\top}L-b^{\top}y=0.

6.3 Proof of Theorem 8

It is established via the following perturbation lemma for general MDPs.

Lemma 13.

Consider two finite MDPs777Finite MDP refers to MDPs with finite state and action spaces. 1\mathcal{M}^{1} and 2\mathcal{M}^{2} with finite horizon, with their respective transition probabilities {pt1}t𝒯\{p^{1}_{t}\}_{t\in\mathcal{T}} and {pt2}t𝒯\{p^{2}_{t}\}_{t\in\mathcal{T}}, rewards {rt1}t𝒯\{r_{t}^{1}\}_{t\in\mathcal{T}} and {rt2}t𝒯\{r_{t}^{2}\}_{t\in\mathcal{T}}, expected total rewards V1,πV^{1,\pi} and V2,πV^{2,\pi} for any policy π\pi. In addition, denote their corresponding optimal values as V1,V^{1,\star} and V2,V^{2,\star}. Then we have

|V1,πV2,π|T(T+1)rmax2p1p2,1+r1r21,,|V^{1,\pi}-V^{2,\pi}|\leq\dfrac{T(T+1)r_{\max}}{2}\|p^{1}-p^{2}\|_{\infty,1}+\|r^{1}-r^{2}\|_{1,\infty},

and

|V1,V2,|T(T+1)rmax2p1p2,1+r1r21,.|V^{1,\star}-V^{2,\star}|\leq\dfrac{T(T+1)r_{\max}}{2}\|p^{1}-p^{2}\|_{\infty,1}+\|r^{1}-r^{2}\|_{1,\infty}.

Here rmaxr_{\max} is such that |rt1(s,a)|rmax|r_{t}^{1}(s,a)|\leq r_{\max} and |rt2(s,a)|rmax|r_{t}^{2}(s,a)|\leq r_{\max} for any s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}.

Proof of Lemma 13.

The proof consists of two parts.

Part I: Values for fixed policy π\pi. Let Vti,π(s)V_{t}^{i,\pi}(s) be the total expected reward starting from state ss at time tt under policy π\pi in the MDP i\mathcal{M}^{i} (i=1,2i=1,2). Clearly Vi,π=s𝒮μ0(s)V0i,π(s)V^{i,\pi}=\sum_{s\in\mathcal{S}}\mu_{0}(s)V_{0}^{i,\pi}(s) (i=1,2i=1,2).

Now by the dynamic programming principle, we have for i=1,2i=1,2, t=0,,T1t=0,\dots,T-1 and s𝒮s\in\mathcal{S},

Vti,π(s)=a𝒜πt(a|s)rti(s,a)+a𝒜s𝒮πt(a|s)pti(s|s,a)Vt+1i,π(s),V_{t}^{i,\pi}(s)=\sum_{a\in\mathcal{A}}\pi_{t}(a|s)r_{t}^{i}(s,a)+\sum_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}\pi_{t}(a|s)p_{t}^{i}(s^{\prime}|s,a)V_{t+1}^{i,\pi}(s^{\prime}),

and VTi,π(s)=a𝒜πt(a|s)rTi(s,a)V_{T}^{i,\pi}(s)=\sum_{a\in\mathcal{A}}\pi_{t}(a|s)r_{T}^{i}(s,a), implying

|Vt1,(s)Vt2,(s)|a𝒜|rt1(s,a)rt2(s,a)|πt(a|s)+a𝒜s𝒮πt(a|s)|pt1(s|s,a)Vt+11,π(s)pt2(s|s,a)Vt+12,π(s)|maxa𝒜s𝒮|pt1(s|s,a)pt2(s|s,a)||Vt+11,π(s)|+a𝒜s𝒮πt(a|s)pt2(s|s,a)|Vt+11,π(s)Vt+12,π(s)|+maxa𝒜|rt1(s,a)rt2(s,a)|(a)(Tt)rmaxmaxs𝒮,a𝒜s𝒮|pt1(s|s,a)pt2(s|s,a)|+Vt+11,πVt+12,π+maxs𝒮,a𝒜|rt1(s,a)rt2(s,a)|.\begin{split}|V_{t}^{1,\star}(s)&-V_{t}^{2,\star}(s)|\leq\,\sum_{a\in\mathcal{A}}|r_{t}^{1}(s,a)-r_{t}^{2}(s,a)|\pi_{t}(a|s)\\ &+\sum_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}\pi_{t}(a|s)\left|p_{t}^{1}(s^{\prime}|s,a)V_{t+1}^{1,\pi}(s^{\prime})-p_{t}^{2}(s^{\prime}|s,a)V_{t+1}^{2,\pi}(s^{\prime})\right|\\ \leq&\,\max_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}|p_{t}^{1}(s^{\prime}|s,a)-p_{t}^{2}(s^{\prime}|s,a)||V_{t+1}^{1,\pi}(s^{\prime})|\\ &+\sum_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}\pi_{t}(a|s)p_{t}^{2}(s^{\prime}|s,a)|V_{t+1}^{1,\pi}(s^{\prime})-V_{t+1}^{2,\pi}(s^{\prime})|\\ &+\max_{a\in\mathcal{A}}|r_{t}^{1}(s,a)-r_{t}^{2}(s,a)|\\ (a)\leq&\,(T-t)r_{\max}\max_{s\in\mathcal{S},a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}|p_{t}^{1}(s^{\prime}|s,a)-p_{t}^{2}(s^{\prime}|s,a)|+\|V_{t+1}^{1,\pi}-V_{t+1}^{2,\pi}\|_{\infty}\\ &+\max_{s\in\mathcal{S},a\in\mathcal{A}}|r_{t}^{1}(s,a)-r_{t}^{2}(s,a)|.\end{split} (21)

Here (aa) uses the fact that |Vt+1i,π(s)|(Tt)rmax|V_{t+1}^{i,\pi}(s)|\leq(T-t)r_{\max}, and Vti,πV_{t}^{i,\pi} is viewed as a vector in S\mathbb{R}^{S} when taking the \ell_{\infty} norm.

By taking max over s𝒮s\in\mathcal{S} on the left-hand side of (21) and telescoping the inequalities, we have

V01,πV02,πt=0T1(Tt)rmaxmaxs𝒮,a𝒜s𝒮|pt1(s|s,a)pt2(s|s,a)|+t=0Tmaxs𝒮,a𝒜|rt1(s,a)rt2(s,a)|T(T+1)rmax2p1p2,1+r1r21,.\begin{split}\|V_{0}^{1,\pi}-V_{0}^{2,\pi}\|_{\infty}\leq&\sum_{t=0}^{T-1}(T-t)r_{\max}\max_{s\in\mathcal{S},a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}|p_{t}^{1}(s^{\prime}|s,a)-p_{t}^{2}(s^{\prime}|s,a)|\\ &+\sum_{t=0}^{T}\max_{s\in\mathcal{S},a\in\mathcal{A}}|r_{t}^{1}(s,a)-r_{t}^{2}(s,a)|\\ \leq&\dfrac{T(T+1)r_{\max}}{2}\|p^{1}-p^{2}\|_{\infty,1}+\|r^{1}-r^{2}\|_{1,\infty}.\end{split}

Here we use the fact that

VT1,π(s)VT2,π(s)=a𝒜rT1(s,a)πT(a|s)a𝒜rT2(s,a)πT(a|s).V_{T}^{1,\pi}(s)-V_{T}^{2,\pi}(s)=\sum_{a\in\mathcal{A}}r_{T}^{1}(s,a)\pi_{T}(a|s)-\sum_{a\in\mathcal{A}}r_{T}^{2}(s,a)\pi_{T}(a|s).

Part II: Optimal values. Let Vti,(s)V_{t}^{i,\star}(s) be the value function for MDP i\mathcal{M}_{i} (i=1,2i=1,2) starting at time t𝒯t\in\mathcal{T}. Then we have Vi,=s𝒮μ0(s)V0i,(s)V^{i,\star}=\sum_{s\in\mathcal{S}}\mu_{0}(s)V_{0}^{i,\star}(s) (i=1,2i=1,2).

Now by the dynamic programming principle, we have for i=1,2i=1,2, t=0,,T1t=0,\dots,T-1 and s𝒮s\in\mathcal{S},

Vti,(s)=maxa𝒜(rti(s,a)+s𝒮pti(s|s,a)Vt+1i,(s)),V_{t}^{i,\star}(s)=\max_{a\in\mathcal{A}}\left(r_{t}^{i}(s,a)+\sum_{s^{\prime}\in\mathcal{S}}p_{t}^{i}(s^{\prime}|s,a)V_{t+1}^{i,\star}(s^{\prime})\right),

which implies that

|Vt1,(s)Vt2,(s)|maxa𝒜{|rt1(s,a)rt2(s,a)|+s𝒮|pt1(s|s,a)Vt+11,(s)pt2(s|s,a)Vt+12,(s)|}maxa𝒜s𝒮|pt1(s|s,a)pt2(s|s,a)|Vt+11,(s)+maxa𝒜s𝒮pt2(s|s,a)|Vt+11,(s)Vt+12,(s)|+maxa𝒜|rt1(s,a)rt2(s,a)|(a)(Tt)rmaxmaxs𝒮,a𝒜s𝒮|pt1(s|s,a)pt2(s|s,a)|+Vt+11,Vt+12,+maxs𝒮,a𝒜|rt1(s,a)rt2(s,a)|.\begin{split}|V_{t}^{1,\star}(s)&-V_{t}^{2,\star}(s)|\\ &\leq\max_{a\in\mathcal{A}}\left\{|r_{t}^{1}(s,a)-r_{t}^{2}(s,a)|+\sum_{s^{\prime}\in\mathcal{S}}\left|p_{t}^{1}(s^{\prime}|s,a)V_{t+1}^{1,\star}(s^{\prime})-p_{t}^{2}(s^{\prime}|s,a)V_{t+1}^{2,\star}(s^{\prime})\right|\right\}\\ &\leq\max_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}|p_{t}^{1}(s^{\prime}|s,a)-p_{t}^{2}(s^{\prime}|s,a)|V_{t+1}^{1,\star}(s^{\prime})\\ &\quad+\max_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}p_{t}^{2}(s^{\prime}|s,a)|V_{t+1}^{1,\star}(s^{\prime})-V_{t+1}^{2,\star}(s^{\prime})|+\max_{a\in\mathcal{A}}|r_{t}^{1}(s,a)-r_{t}^{2}(s,a)|\\ (a)&\leq(T-t)r_{\max}\max_{s\in\mathcal{S},a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}|p_{t}^{1}(s^{\prime}|s,a)-p_{t}^{2}(s^{\prime}|s,a)|+\|V_{t+1}^{1,\star}-V_{t+1}^{2,\star}\|_{\infty}\\ &\quad+\max_{s\in\mathcal{S},a\in\mathcal{A}}|r_{t}^{1}(s,a)-r_{t}^{2}(s,a)|.\end{split} (22)

Here in (aa) we use the fact that Vt+1i,(s)[0,(Tt)rmax]V_{t+1}^{i,\star}(s)\in[0,(T-t)r_{\max}], and Vti,V_{t}^{i,\star} is viewed as a vector in S\mathbb{R}^{S} when taking the \ell_{\infty} norm.

By taking max over s𝒮s\in\mathcal{S} on the left-hand side of (22) and telescoping the inequalities, we have

V01,V02,t=0T1(Tt)rmaxmaxs𝒮,a𝒜s𝒮|pt1(s|s,a)pt2(s|s,a)|+t=0Tmaxs𝒮,a𝒜|rt1(s,a)rt2(s,a)|T(T+1)rmax2p1p2,1+r1r21,.\begin{split}\|V_{0}^{1,\star}-V_{0}^{2,\star}\|_{\infty}\leq&\sum_{t=0}^{T-1}(T-t)r_{\max}\max_{s\in\mathcal{S},a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}|p_{t}^{1}(s^{\prime}|s,a)-p_{t}^{2}(s^{\prime}|s,a)|\\ &+\sum_{t=0}^{T}\max_{s\in\mathcal{S},a\in\mathcal{A}}|r_{t}^{1}(s,a)-r_{t}^{2}(s,a)|\\ \leq&\dfrac{T(T+1)r_{\max}}{2}\|p^{1}-p^{2}\|_{\infty,1}+\|r^{1}-r^{2}\|_{1,\infty}.\end{split}

Here we use the fact that VT1,(s)VT2,(s)=rT1(s,a)rT2(s,a)V_{T}^{1,\star}(s)-V_{T}^{2,\star}(s)=r_{T}^{1}(s,a)-r_{T}^{2}(s,a). ∎

Proof of Theorem 8.

We first define d=Γ~(π,L)d=\tilde{\Gamma}(\pi,L) for the πΠ(L)\pi\in\Pi(L), where Γ~(π,L)\tilde{\Gamma}(\pi,L) is recursively defined by Γ~(π,L)0(s,a):=μ0(s)π0(a|s)\tilde{\Gamma}(\pi,L)_{0}(s,a):=\mu_{0}(s)\pi_{0}(a|s) (for any s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}) and

Γ~(π,L)t+1(s,a):=πt+1(a|s)s𝒮a𝒜Γ~(π,L)t(s,a)Pt(s|s,a,Lt),s𝒮,a𝒜.\tilde{\Gamma}(\pi,L)_{t+1}(s,a):=\pi_{t+1}(a|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\tilde{\Gamma}(\pi,L)_{t}(s^{\prime},a^{\prime})P_{t}(s|s^{\prime},a^{\prime},L_{t}),\quad\forall s\in\mathcal{S},a\in\mathcal{A}.

Then dd is the occupation measure of policy π\pi for MDP (L)\mathcal{M}(L).

Step 1: Closeness between dd and LL. By definition, ALd=bA_{L}d=b. We first prove by induction that for all tt,

s𝒮a𝒜|dt(s,a)Lt(s,a)|S(t+1)ϵ.\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}|d_{t}(s,a)-L_{t}(s,a)|\leq\sqrt{S}(t+1)\epsilon. (23)

When t=0t=0,

s𝒮a𝒜|d0(s,a)L0(s,a)|=s𝒮a𝒜π0(a|s)|μ0(s)a𝒜L0(s,a)|=s𝒮|μ0(s)a𝒜L0(s,a)|Sϵ,\begin{split}\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}|d_{0}(s,a)-L_{0}(s,a)|&=\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}\pi_{0}(a|s)|\mu_{0}(s)-\sum_{a^{\prime}\in\mathcal{A}}L_{0}(s,a^{\prime})|\\ &=\sum_{s\in\mathcal{S}}|\mu_{0}(s)-\sum_{a^{\prime}\in\mathcal{A}}L_{0}(s,a^{\prime})|\leq\sqrt{S}\epsilon,\end{split} (24)

where the last inequality is by ALLb2ϵ\|A_{L}L-b\|_{2}\leq\epsilon. Now suppose (23) holds for tt. Then by the construction of π\pi and dd,

dt+1(s,a)=πt+1(a|s)s𝒮a𝒜dt(s,a)Pt(s|s,a,Lt),d_{t+1}(s,a)=\pi_{t+1}(a|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}d_{t}(s^{\prime},a^{\prime})P_{t}(s|s^{\prime},a^{\prime},L_{t}), (25)
Lt+1(s,a)=πt+1(a|s)a𝒜Lt+1(s,a).L_{t+1}(s,a)=\pi_{t+1}(a|s)\sum_{a^{\prime}\in\mathcal{A}}L_{t+1}(s,a^{\prime}). (26)

Therefore

s𝒮a𝒜|dt+1(s,a)Lt+1(s,a)|=s𝒮a𝒜πt+1(a|s)|s𝒮a𝒜dt(s,a)Pt(s|s,a,Lt)a𝒜Lt+1(s,a)|=s𝒮|s𝒮a𝒜dt(s,a)Pt(s|s,a,Lt)a𝒜Lt+1(s,a)|s𝒮|s𝒮a𝒜(dt(s,a)Lt(s,a))Pt(s|s,a,Lt)|+s𝒮|s𝒮a𝒜Lt(s,a)Pt(s|s,a,Lt)a𝒜Lt+1(s,a)|S(t+1)ϵ+Sϵ=S(t+2)ϵ,\begin{split}\sum_{s\in\mathcal{S}}&\sum_{a\in\mathcal{A}}|d_{t+1}(s,a)-L_{t+1}(s,a)|\\ &=\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}\pi_{t+1}(a|s)|\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}d_{t}(s^{\prime},a^{\prime})P_{t}(s|s^{\prime},a^{\prime},L_{t})-\sum_{a^{\prime}\in\mathcal{A}}L_{t+1}(s,a^{\prime})|\\ &=\sum_{s\in\mathcal{S}}|\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}d_{t}(s^{\prime},a^{\prime})P_{t}(s|s^{\prime},a^{\prime},L_{t})-\sum_{a^{\prime}\in\mathcal{A}}L_{t+1}(s,a^{\prime})|\\ &\leq\sum_{s\in\mathcal{S}}|\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\left(d_{t}(s^{\prime},a^{\prime})-L_{t}(s^{\prime},a^{\prime})\right)P_{t}(s|s^{\prime},a^{\prime},L_{t})|\\ &\qquad\qquad+\sum_{s\in\mathcal{S}}|\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}L_{t}(s^{\prime},a^{\prime})P_{t}(s|s^{\prime},a^{\prime},L_{t})-\sum_{a^{\prime}\in\mathcal{A}}L_{t+1}(s,a^{\prime})|\\ &\leq\sqrt{S}(t+1)\epsilon+\sqrt{S}\epsilon=\sqrt{S}(t+2)\epsilon,\end{split} (27)

where the last inequality holds by using induction on the first term and the constraint on LL in ALLb2ϵ\|A_{L}L-b\|_{2}\leq\epsilon. This completes the induction.

The above bound (27), together with the fact that 0zLϵ20\leq z^{\top}L\leq\epsilon^{2}, immediately implies that

|zd||z(Ld)|+|zL|z1Ld+ϵ2SA(T2+T+2)rmaxS(T+1)ϵ+ϵ2.\begin{split}|z^{\top}d|&\leq|z^{\top}(L-d)|+|z^{\top}L|\leq\|z\|_{1}\|L-d\|_{\infty}+\epsilon^{2}\\ &\leq SA(T^{2}+T+2)r_{\max}\sqrt{S}(T+1)\epsilon+\epsilon^{2}.\end{split}

Hence by ALd=bA_{L}d=b and ALy+zcL22ϵ2\|A_{L}^{\top}y+z-c_{L}\|_{2}^{2}\leq\epsilon^{2}, we have

|bycLd|=|dALycLd||dz|+|d(ALy+zcL)|SA(T2+T+2)rmaxS(T+1)ϵ+ϵ2+d2ALy+zcL2SA(T2+T+2)rmaxS(T+1)ϵ+ϵ2+Tϵ.\begin{split}|b^{\top}y-c_{L}^{\top}d|&=|d^{\top}A_{L}^{\top}y-c_{L}^{\top}d|\leq|d^{\top}z|+|d^{\top}(A_{L}^{\top}y+z-c_{L})|\\ &\leq SA(T^{2}+T+2)r_{\max}\sqrt{S}(T+1)\epsilon+\epsilon^{2}+\|d\|_{2}\|A_{L}^{\top}y+z-c_{L}\|_{2}\\ &\leq SA(T^{2}+T+2)r_{\max}\sqrt{S}(T+1)\epsilon+\epsilon^{2}+\sqrt{T}\epsilon.\end{split}

Here we use dt1=1\|d_{t}\|_{1}=1 for any t𝒯t\in\mathcal{T} and dt2dt1\|d_{t}\|_{2}\leq\|d_{t}\|_{1}.

Step 2: Near-optimality of dd (and π\pi) for (L)\mathcal{M}(L). Now we show that dd is near-optimal for the MDP (L)\mathcal{M}(L). To see this, define Δ:=ALy+zcL\Delta:=A_{L}^{\top}y+z-c_{L}. Here Δ\Delta can be viewed both as a vector in SA(T+1)\mathbb{R}^{SA(T+1)} and a sequence of T+1T+1 matrices {Δt}t𝒯\{\Delta_{t}\}_{t\in\mathcal{T}} with ΔtS×A\Delta_{t}\in\mathbb{R}^{S\times A}. Then we have Δ22ϵ2\|\Delta\|_{2}^{2}\leq\epsilon^{2}.

Consider the following linear program with variables y^,z^\hat{y},\hat{z}:

maximizey^,z^by^subject toALy^+z^=cL+Δ,z^0,\begin{array}[]{ll}\text{maximize}_{\hat{y},\hat{z}}&b^{\top}\hat{y}\\ \text{subject to}&A_{L}^{\top}\hat{y}+\hat{z}=c_{L}+\Delta,\quad\hat{z}\geq 0,\end{array} (28)

and denote y^,z^\hat{y}^{\star},\hat{z}^{\star} as its optimal solution. Then since y,zy,z is its feasible solution, we have byby^b^{\top}y\leq b^{\top}\hat{y}^{\star}.

Note that the dual problem of (28) can be written as

minimized^(cL+Δ)d^subject toALd^=b,d^0.\begin{array}[]{ll}\text{minimize}_{\hat{d}}&(c_{L}+\Delta)^{\top}\hat{d}\\ \text{subject to}&A_{L}\hat{d}=b,\quad\hat{d}\geq 0.\end{array} (29)

By Lemma 2 and (7), if d^\hat{d}^{\star} solves (29), then any π^Π(d^)\hat{\pi}^{\star}\in\Pi(\hat{d}^{\star}) is an optimal policy for the MDP ^(L)\hat{\mathcal{M}}(L) with transitions Pt(s|s,a,Lt)P_{t}(s^{\prime}|s,a,L_{t}) and rewards rt(s,a,Lt)Δt(s,a)r_{t}(s,a,L_{t})-\Delta_{t}(s,a) (with L={Lt}t𝒯L=\{L_{t}\}_{t\in\mathcal{T}} fixed), and the optimal value of the objective function of (29) is equal to the negative optimal value of the MDP ^(L)\hat{\mathcal{M}}(L). By the strong duality of linear programs, we see that by^=(cL+Δ)d^b^{\top}\hat{y}^{\star}=(c_{L}+\Delta)^{\top}\hat{d}^{\star} is equal to the negative optimal value of the MDP ^(L)\hat{\mathcal{M}}(L). Note that the components of cLc_{L} are negative rewards, and that the existence of the optimal solution d^\hat{d}^{\star} follows from the existence of optimal policies for any finite MDP with finite horizon and Lemma 2 and (7).

Now, let dd^{\star} be an optimal solution to the following linear program:

minimizedcLdsubject toALd=b,d0.\begin{array}[]{ll}\text{minimize}_{d}&c_{L}^{\top}d\\ \text{subject to}&A_{L}d=b,\quad d\geq 0.\end{array} (30)

Then again by Lemma 2 and (7), cLdc_{L}^{\top}d^{\star} is equal to the negative optimal value of the MDP (L)\mathcal{M}(L). Hence by applying Lemma 13 to the MDPs ^(L)\hat{\mathcal{M}}(L) and (L)\mathcal{M}(L),

|cLd(cL+Δ)d^|Δ1,=t=0Tmaxs𝒮,a𝒜|Δt(s,a)|t𝒯,s𝒮,a𝒜|Δt(s,a)|SA(T+1)ϵ.\begin{split}|c_{L}^{\top}d^{\star}-(c_{L}+\Delta)^{\top}\hat{d}^{\star}|&\leq\|\Delta\|_{1,\infty}=\sum_{t=0}^{T}\max_{s\in\mathcal{S},a\in\mathcal{A}}|\Delta_{t}(s,a)|\leq\sum_{t\in\mathcal{T},s\in\mathcal{S},a\in\mathcal{A}}|\Delta_{t}(s,a)|\\ &\leq\sqrt{SA(T+1)}\epsilon.\end{split}

Putting the above bounds together, we conclude that

cLdcLd=cLdby+bycLdcLdby+by^cLd=cLdby+(cL+Δ)d^cLdSA(T2+T+2)rmaxS(T+1)ϵ+ϵ2+Tϵ+SA(T+1)ϵ.\begin{split}c_{L}^{\top}d-c_{L}^{\top}d^{\star}&=c_{L}^{\top}d-b^{\top}y+b^{\top}y-c_{L}^{\top}d^{\star}\\ &\leq c_{L}^{\top}d-b^{\top}y+b^{\top}\hat{y}^{\star}-c_{L}^{\top}d^{\star}\\ &=c_{L}^{\top}d-b^{\top}y+(c_{L}+\Delta)^{\top}\hat{d}^{\star}-c_{L}^{\top}d^{\star}\\ &\leq SA(T^{2}+T+2)r_{\max}\sqrt{S}(T+1)\epsilon+\epsilon^{2}+\sqrt{T}\epsilon+\sqrt{SA(T+1)}\epsilon.\end{split} (31)

Since dd is a feasible solution to (30), again by Lemma 2, (31) implies that π\pi, by definition satisfying πΠ(d)\pi\in\Pi(d) from (25), is a (CS,A,Tϵ+ϵ2)(C_{S,A,T}\epsilon+\epsilon^{2})-suboptimal solution to the MDP (L)\mathcal{M}(L), where CS,A,T=S32A(T+2)3rmax+SA(T+1)+TC_{S,A,T}=S^{\frac{3}{2}}A(T+2)^{3}r_{\max}+\sqrt{SA(T+1)}+\sqrt{T}.

Step 3: Bounding the gap between LL and Γ(π)\Gamma(\pi). We then bound the difference between LL and Γ(π)\Gamma(\pi) by proving the following for all tt:

s𝒮a𝒜|Γ(π)t(s,a)Lt(s,a)|(CP+1)t+11CPSϵ.\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}|\Gamma(\pi)_{t}(s,a)-L_{t}(s,a)|\leq\frac{(C_{P}+1)^{t+1}-1}{C_{P}}\sqrt{S}\epsilon. (32)

This is again proved by induction. When t=0t=0, it holds by noticing Γ(π)0(s,a)=d0(s,a)\Gamma(\pi)_{0}(s,a)=d_{0}(s,a) and (24). Now suppose (32) holds for tt, consider when t+1t+1, by the definition of Γ(π)\Gamma(\pi), we have

Γ(π)t+1(s,a)=πt+1(a|s)s𝒮a𝒜Γ(π)t(s,a)Pt(s|s,a,Γ(π)t).\Gamma(\pi)_{t+1}(s,a)=\pi_{t+1}(a|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\Gamma(\pi)_{t}(s^{\prime},a^{\prime})P_{t}(s|s^{\prime},a^{\prime},\Gamma(\pi)_{t}). (33)

Then

s𝒮a𝒜|Γ(π)t+1(s,a)Lt+1(s,a)|=s𝒮|s𝒮a𝒜Γ(π)t(s,a)Pt(s|s,a,Γ(π)t)a𝒜Lt+1(s,a)|s𝒮|s𝒮a𝒜Γ(π)t(s,a)(Pt(s|s,a,Γ(π)t)Pt(s|s,a,Lt)|+s𝒮s𝒮a𝒜|Γ(π)t(s,a)Lt(s,a)|Pt(s|s,a,Lt)+s𝒮|s𝒮a𝒜Lt(s,a)Pt(s|s,a,Lt)a𝒜Lt+1(s,a)|s𝒮a𝒜Γ(π)t(s,a)s𝒮|Pt(s|s,a,Γ(π)t)Pt(s|s,a,Lt)|+s𝒮a𝒜|Γ(π)t(s,a)Lt(s,a)|+Sϵ(CP+1)s𝒮a𝒜|Γ(π)t(s,a)Lt(s,a)|+Sϵ(CP+1)(CP+1)t+11CPSϵ+Sϵ=(CP+1)t+21CPSϵ.\begin{split}\sum_{s\in\mathcal{S}}&\sum_{a\in\mathcal{A}}|\Gamma(\pi)_{t+1}(s,a)-L_{t+1}(s,a)|\\ &=\sum_{s\in\mathcal{S}}|\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\Gamma(\pi)_{t}(s^{\prime},a^{\prime})P_{t}(s|s^{\prime},a^{\prime},\Gamma(\pi)_{t})-\sum_{a^{\prime}\in\mathcal{A}}L_{t+1}(s,a^{\prime})|\\ &\leq\sum_{s\in\mathcal{S}}|\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\Gamma(\pi)_{t}(s^{\prime},a^{\prime})(P_{t}(s|s^{\prime},a^{\prime},\Gamma(\pi)_{t})-P_{t}(s|s^{\prime},a^{\prime},L_{t})|\\ &\qquad\qquad+\sum_{s\in\mathcal{S}}\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}|\Gamma(\pi)_{t}(s^{\prime},a^{\prime})-L_{t}(s^{\prime},a^{\prime})|P_{t}(s|s^{\prime},a^{\prime},L_{t})\\ &\qquad\qquad+\sum_{s\in\mathcal{S}}|\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}L_{t}(s^{\prime},a^{\prime})P_{t}(s|s^{\prime},a^{\prime},L_{t})-\sum_{a^{\prime}\in\mathcal{A}}L_{t+1}(s,a^{\prime})|\\ &\leq\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\Gamma(\pi)_{t}(s^{\prime},a^{\prime})\sum_{s\in\mathcal{S}}|P_{t}(s|s^{\prime},a^{\prime},\Gamma(\pi)_{t})-P_{t}(s|s^{\prime},a^{\prime},L_{t})|\\ &\qquad\qquad+\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}|\Gamma(\pi)_{t}(s^{\prime},a^{\prime})-L_{t}(s^{\prime},a^{\prime})|+\sqrt{S}\epsilon\\ &\leq(C_{P}+1)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}|\Gamma(\pi)_{t}(s^{\prime},a^{\prime})-L_{t}(s^{\prime},a^{\prime})|+\sqrt{S}\epsilon\\ &\leq(C_{P}+1)\frac{(C_{P}+1)^{t+1}-1}{C_{P}}\sqrt{S}\epsilon+\sqrt{S}\epsilon=\frac{(C_{P}+1)^{t+2}-1}{C_{P}}\sqrt{S}\epsilon.\end{split} (34)

This concludes the induction. Consequently,

PΓ(π)PL,1=maxt𝒯PtΓ(π)PtL,1maxt𝒯CPΓ(π)tLt1((CP+1)T+11)Sϵ,\begin{split}\|P^{\Gamma(\pi)}-P^{L}\|_{\infty,1}&=\max_{t\in\mathcal{T}}\|P_{t}^{\Gamma(\pi)}-P_{t}^{L}\|_{\infty,1}\leq\max_{t\in\mathcal{T}}C_{P}\|\Gamma(\pi)_{t}-L_{t}\|_{1}\\ &\leq((C_{P}+1)^{T+1}-1)\sqrt{S}\epsilon,\end{split}

and similarly,

rΓ(π)rL1,=t=0TrtΓ(π)rtLCrt=0TΓ(π)tLt1Cr(CP+1)T+2(T+2)CP1CP2Sϵ.\begin{split}\|r^{\Gamma(\pi)}-r^{L}\|_{1,\infty}&=\sum_{t=0}^{T}\|r_{t}^{\Gamma(\pi)}-r_{t}^{L}\|_{\infty}\leq C_{r}\sum_{t=0}^{T}\|\Gamma(\pi)_{t}-L_{t}\|_{1}\\ &\leq C_{r}\dfrac{(C_{P}+1)^{T+2}-(T+2)C_{P}-1}{C_{P}^{2}}\sqrt{S}\epsilon.\end{split}

Finally, by applying Lemma 13 to the MDPs (L)\mathcal{M}(L) and (Γ(π))\mathcal{M}(\Gamma(\pi)), and utilizing the fact that π\pi is (CS,A,Tϵ+ϵ2)(C_{S,A,T}\epsilon+\epsilon^{2})-suboptimal for the MDP (L)\mathcal{M}(L), we see that

Expl(π)=Vμ0(Γ(π))Vμ0π(Γ(π))=Vμ0(Γ(π))Vμ0(L)+Vμ0(L)Vμ0π(L)+Vμ0π(L)Vμ0π(Γ(π))T(T+1)rmax((CP+1)T+11)Sϵ+2Cr(CP+1)T+2(T+2)CP1CP2Sϵ+CS,A,Tϵ+ϵ2\begin{split}\text{Expl}(\pi)&=V_{\mu_{0}}^{\star}(\Gamma(\pi))-V_{\mu_{0}}^{\pi}(\Gamma(\pi))\\ &=V_{\mu_{0}}^{\star}(\Gamma(\pi))-V_{\mu_{0}}^{\star}(L)+V_{\mu_{0}}^{\star}(L)-V_{\mu_{0}}^{\pi}(L)+V_{\mu_{0}}^{\pi}(L)-V_{\mu_{0}}^{\pi}(\Gamma(\pi))\\ &\leq T(T+1)r_{\max}((C_{P}+1)^{T+1}-1)\sqrt{S}\epsilon\\ &\quad+2C_{r}\dfrac{(C_{P}+1)^{T+2}-(T+2)C_{P}-1}{C_{P}^{2}}\sqrt{S}\epsilon+C_{S,A,T}\epsilon+\epsilon^{2}\end{split}

Here Vμ0(L)=maxπVμ0π(L)=s𝒮μ0(s)[V0(L)]sV_{\mu_{0}}^{\star}(L)=\max_{\pi^{\prime}}V_{\mu_{0}}^{\pi^{\prime}}(L)=\sum_{s\in\mathcal{S}}\mu_{0}(s)[V_{0}^{\star}(L)]_{s} is the optimal value of the MDP (L)\mathcal{M}(L) for any mean-field flow LL. ∎

6.4 Proof of Theorem 9

The feasibility of y,z,Ly,z,L is obvious from the definition. In particular, L=Γ(π)L=\Gamma(\pi) implies that L0L\geq 0 and 𝟏Lt=1\mathbf{1}^{\top}L_{t}=1, t𝒯t\in\mathcal{T}, and the same argument leading to Corollary 7 implies that y2S(T+1)(T+2)rmax/2\|y\|_{2}\leq S(T+1)(T+2)r_{\max}/2 and 𝟏zSA(T2+T+2)rmax\mathbf{1}^{\top}z\leq SA(T^{2}+T+2)r_{\max}, and the proof of Proposition 6 and in particular, the Bellman optimality shows that z0z\geq 0.

Furthermore, again by the construction and the proof of Proposition 6, we also have ALL=bA_{L}L=b by taking L=Γ(π)L=\Gamma(\pi) and ALy+z=cLA_{L}^{\top}y+z=c_{L}. Hence

fMF-OMO(y,z,L)=ALLb22+ALy+zcL22+zL=zL.f^{\text{MF-OMO}}(y,z,L)=\|A_{L}L-b\|_{2}^{2}+\|A_{L}^{\top}y+z-c_{L}\|_{2}^{2}+z^{\top}L=z^{\top}L.

It remains to show that zLϵz^{\top}L\leq\epsilon. To see this, notice that

zL=(cLALy)L=cLLyALL=cLLby.z^{\top}L=(c_{L}-A_{L}^{\top}y)^{\top}L=c_{L}^{\top}L-y^{\top}A_{L}L=c_{L}^{\top}L-b^{\top}y.

Now recall that L=Γ(π)L=\Gamma(\pi) implies that Lt(s,a)=π,L(st=s,at=a)L_{t}(s,a)=\mathbb{P}^{\pi,L}(s_{t}=s,a_{t}=a) for t𝒯t\in\mathcal{T} by the proof of Lemma 3. Hence

cLL=s𝒮,a𝒜,t𝒯rt(s,a,Lt)Lt(s,a)=s𝒮,a𝒜,t𝒯rt(s,a,Lt)π,L(st=s,at=a)=Vμ0π(L).\begin{split}c_{L}^{\top}L&=-\sum_{s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}}r_{t}(s,a,L_{t})L_{t}(s,a)\\ &=-\sum_{s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}}r_{t}(s,a,L_{t})\mathbb{P}^{\pi,L}(s_{t}=s,a_{t}=a)=-V_{\mu_{0}}^{\pi}(L).\end{split}

Combined with the fact that by=s𝒮μ0(s)[V0(L)]s=Vμ0(L)b^{\top}y=-\sum_{s\in\mathcal{S}}\mu_{0}(s)[V_{0}^{\star}(L)]_{s}=-V_{\mu_{0}}^{\star}(L), we see that

zL=cLLby=Vμ0(L)Vμ0π(L)=Vμ0(Γ(π))Vμ0π(Γ(π))=Expl(π)ϵ.z^{\top}L=c_{L}^{\top}L-b^{\top}y=V_{\mu_{0}}^{\star}(L)-V_{\mu_{0}}^{\pi}(L)=V_{\mu_{0}}^{\star}(\Gamma(\pi))-V_{\mu_{0}}^{\pi}(\Gamma(\pi))=\text{Expl}(\pi)\leq\epsilon.

7 Extensions

The MF-OMO framework can be extended to other variants of mean-field games. This section details its extension to personalized mean-field games; and its extension to multi-population mean-field games is straightforward and omitted here.

Personalized mean-field games are mean-field games involving non-homogeneous players, which can be found in many applications. In such problems, every player is associated with some information (type) which characterizes the heterogeneity among players. Personalized mean-field games generalize mean-field games by incorporating the information (type) into the state of players, with heterogeneity in the initial distributions. To deal with the heterogeneity, one needs to define a “stronger” Nash equilibrium solution for these personalized mean-field games, where the policy sequence π\pi is optimal from any initial state given the mean-field flow. We say that the policy sequence π={πt}t𝒯\pi=\{\pi_{t}\}_{t\in\mathcal{T}} and a mean-field flow L={Lt}t𝒯L=\{L_{t}\}_{t\in\mathcal{T}} constitute a refined Nash equilibrium solution of the finite-time horizon personalized mean-field game, if the following conditions are satisfied.

Definition 7.1 (Refined Nash equilibrium solution).
  • 1)

    (Optimality) Fixing {Lt}t𝒯\{L_{t}\}_{t\in\mathcal{T}}, for any initial state s𝒮s\in\mathcal{S}, {πt}t𝒯\{\pi_{t}\}_{t\in\mathcal{T}} solves the following optimization problem:

    maximize{πt}t𝒯𝔼[t=0Trt(st,at,Lt)|s0=s]subject tost+1Pt(|st,at,Lt),atπt(|st),t𝒯\{T},\begin{array}[]{ll}\text{maximize}_{\{\pi_{t}\}_{t\in\mathcal{T}}}&\mathbb{E}[\sum_{t=0}^{T}r_{t}(s_{t},a_{t},L_{t})|s_{0}=s]\\ \text{subject to}&s_{t+1}\sim P_{t}(\cdot|s_{t},a_{t},L_{t}),\,a_{t}\sim\pi_{t}(\cdot|s_{t}),\,t\in\mathcal{T}\backslash\{T\},\end{array} (35)

    i.e., {πt}t𝒯\{\pi_{t}\}_{t\in\mathcal{T}} is optimal for the representative agent given the mean-field flow {Lt}t𝒯\{L_{t}\}_{t\in\mathcal{T}};

  • 2)

    (Consistency) Fixing {πt}t𝒯\{\pi_{t}\}_{t\in\mathcal{T}}, the consistency of mean-field flow holds, i.e.,

    Lt=st,at,where st+1Pt(|st,at,Lt),atπt(|st),s0μ0,t𝒯\{T}.\begin{split}&L_{t}=\mathbb{P}_{s_{t},a_{t}},\\ &\text{where }s_{t+1}\sim P_{t}(\cdot|s_{t},a_{t},L_{t}),\,a_{t}\sim\pi_{t}(\cdot|s_{t}),\,s_{0}\sim\mu_{0},\,t\in\mathcal{T}\backslash\{T\}.\end{split} (36)

For personalized mean-field games, we introduce a modified version of exploitability. Instead of simply averaging value functions over μ0\mu_{0} (cf. (17)), here the averaging is over a uniform initial distribution, which corresponds to the arbitrariness of the initial state in (35). The exploitability is defined as follows:

Explrefined(π):=1Smaxπs𝒮([V0π(Γ(π))]s[V0π(Γ(π))]s).\text{Expl}^{\text{refined}}(\pi):=\dfrac{1}{S}\max_{\pi^{\prime}}\sum_{s\in\mathcal{S}}\left([V_{0}^{\pi^{\prime}}(\Gamma(\pi))]_{s}-[V_{0}^{\pi}(\Gamma(\pi))]_{s}\right).

As in Section 3, π\pi is a refined Nash equilibrium solution if and only if Explrefined(π)=0\text{Expl}^{\text{refined}}(\pi)=0.

To find the refined Nash equilibrium solution, we need to find the “refined” optimal policy for the mean-field induced MDP as described in (35), which is optimal under any initial state s0=s𝒮s_{0}=s\in\mathcal{S} (instead of an initial state with distribution μ0\mu_{0} as in (1)). The proposition below characterizes the linear program formulation for (35), a counterpart of Lemma 2. For brevity, for any ν0Δ(𝒮)\nu_{0}\in\Delta(\mathcal{S}), we denote π,L,ν0()\mathbb{P}^{\pi,L,\nu_{0}}(\cdot) for the probability distribution of any representative agent taking policy π\pi under the mean-field flow LL, i.e., the state and/or action distribution generated by s0ν0s_{0}\sim\nu_{0}, st+1Pt(|st,at,Lt)s_{t+1}\sim P_{t}(\cdot|s_{t},a_{t},L_{t}), atπt(|st)a_{t}\sim\pi_{t}(\cdot|s_{t}) for t=0,,T1t=0,\dots,T-1.

Proposition 14.

Fix {Lt}t𝒯\{L_{t}\}_{t\in\mathcal{T}}. Suppose that {πt}t𝒯\{\pi_{t}\}_{t\in\mathcal{T}} is an ϵ\epsilon-suboptimal policy for (35) for any s𝒮s\in\mathcal{S}. Define ν0(s)=1S\nu_{0}(s)=\frac{1}{S} for all s𝒮s\in\mathcal{S}, νt(s):=π,L,ν0(st=s)\nu_{t}(s):=\mathbb{P}^{\pi,L,\nu_{0}}(s_{t}=s), and dt(s,a):=νt(s)πt(a|s)d_{t}(s,a):=\nu_{t}(s)\pi_{t}(a|s) for s𝒮s\in\mathcal{S} and a𝒜a\in\mathcal{A}. Then πΠ(d)\pi\in\Pi(d) and d={dt}t𝒯d=\{d_{t}\}_{t\in\mathcal{T}} is a feasible ϵ\epsilon-suboptimal solution to the following linear program:

maximizedt𝒯s𝒮a𝒜dt(s,a)rt(s,a,Lt)subject tos𝒮a𝒜dt(s,a)Pt(s|s,a,Lt)=a𝒜dt+1(s,a),s𝒮,t𝒯\{T},a𝒜d0(s,a)=1S,s𝒮,dt(s,a)0,s𝒮,a𝒜,t𝒯.\begin{array}[]{ll}\text{maximize}_{d}&\sum\limits_{t\in\mathcal{T}}\sum\limits_{s\in\mathcal{S}}\sum\limits_{a\in\mathcal{A}}d_{t}(s,a)r_{t}(s,a,L_{t})\\ \text{subject to}&\sum\limits_{s\in\mathcal{S}}\sum\limits_{a\in\mathcal{A}}d_{t}(s,a)P_{t}(s^{\prime}|s,a,L_{t})=\sum\limits_{a\in\mathcal{A}}d_{t+1}(s^{\prime},a),\\ &\hskip 115.23373pt\forall s^{\prime}\in\mathcal{S},t\in\mathcal{T}\backslash\{T\},\\ &\sum\limits_{a\in\mathcal{A}}d_{0}(s,a)=\frac{1}{S},\quad\forall s\in\mathcal{S},\\ &d_{t}(s,a)\geq 0,\quad\forall s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}.\end{array} (37)

Conversely, suppose that d={dt}t𝒯d=\{d_{t}\}_{t\in\mathcal{T}} is a feasible ϵ\epsilon-suboptimal solution to the above linear program (37). Then for any πΠ(d)\pi\in\Pi(d), {πt}t𝒯\{\pi_{t}\}_{t\in\mathcal{T}} is an SϵS\epsilon-suboptimal policy for (35) for any s𝒮s\in\mathcal{S}. In addition, the optimal value of the objective function of the above linear program (37) is equal to the average optimal value of (35) over s𝒮s\in\mathcal{S}.

Note that instead of the specified initial distribution μ0\mu_{0} in Lemma 2, Proposition 14 chooses a special initial distribution where all components are equal to 1S\frac{1}{S}. The proof is a direct combination of Lemma 2 and the following classic result of MDP.

Lemma 15.

If π\pi is an optimal policy for the MDP \mathcal{M} with initial distribution ν0\nu_{0} satisfying ν0(s)>0\nu_{0}(s)>0 for all s𝒮s\in\mathcal{S}, then π\pi is an optimal policy for the MDP \mathcal{M} with any initial state and any initial distribution.

Then similar to Theorem 4 one can establish the corresponding equivalence between refined Nash equilibrium solutions and optimal solutions to a feasibility optimization problem. This is derived by replacing the μ0\mu_{0} in bb in Theorem 4 with 1S𝟏S\frac{1}{S}{\bf 1}\in\mathbb{R}^{S}. More precisely, based on the above result, one can characterize the optimality conditions of π\pi given LL, the counterpart of condition (A) with μ0\mu_{0} replaced by the uniform distribution over 𝒮\mathcal{S}. Then combined with the consistency condition for the mean-field flow (36), one can establish the corresponding equivalence between refined Nash equilibrium solutions and optimal solutions to the following feasibility optimization problem:

minimizeπ,y,z,d,L0subject toALd=b,ALy+z=cL,zd=0,d0,z0,πt(a|s)a𝒜dt(s,a)=dt(s,a),s𝒮,a𝒜,t𝒯L0(s,a)=μ0(s)π0(a|s),s𝒮,a𝒜,Lt+1(s,a)=πt+1(a|s)s𝒮a𝒜Lt(s,a)Pt(s|s,a,Lt),s𝒮,t𝒯\{T},a𝒜L0(s,a)=μ0(s),s𝒮,L0,\begin{array}[]{ll}\text{\rm minimize}_{\pi,y,z,d,L}&0\\ \text{\rm subject to}&A_{L}d=b^{\prime},\quad A_{L}^{\top}y+z=c_{L},\\ &z^{\top}d=0,\quad d\geq 0,\quad z\geq 0,\\ &\pi_{t}(a|s)\sum_{a^{\prime}\in\mathcal{A}}d_{t}(s,a^{\prime})=d_{t}(s,a),\,\forall s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}\\ &L_{0}(s,a)=\mu_{0}(s)\pi_{0}(a|s),\quad\forall s\in\mathcal{S},a\in\mathcal{A},\\ &L_{t+1}(s^{\prime},a^{\prime})=\pi_{t+1}(a^{\prime}|s^{\prime})\sum\limits_{s\in\mathcal{S}}\sum\limits_{a\in\mathcal{A}}L_{t}(s,a)P_{t}(s^{\prime}|s,a,L_{t}),\\ &\hskip 151.9376pt\forall s^{\prime}\in\mathcal{S},t\in\mathcal{T}\backslash\{T\},\\ &\sum_{a\in\mathcal{A}}L_{0}(s,a)=\mu_{0}(s),\quad\forall s\in\mathcal{S},\quad L\geq 0,\end{array} (38)

where AL,cLA_{L},c_{L} are set as (8) and (9), and b=[0,,0,1S𝟏]S(T+1)b^{\prime}=[0,\dots,0,\frac{1}{S}{\bf 1}]\in\mathbb{R}^{S(T+1)}. More precisely, (π,L)(\pi,L) is a refined Nash equilibrium solution if and only if (π,y,z,d,L)(\pi,y,z,d,L) is a solution to (38) for some d,y,zd,y,z.

Note that here the condition (A) may not hold, therefore the consistency condition on mean-field flow can not be implied by d=Ld=L. As a result, the policy π\pi is directly included as part of the variables.

In addition, noticing that Lt(s,a)=a𝒜Lt(s,a)πt(a|s)L_{t}(s,a)=\sum_{a^{\prime}\in\mathcal{A}}L_{t}(s,a^{\prime})\pi_{t}(a|s) after summing over aa^{\prime} on both sides of the fifth row of the constraints in (38), and denoting b=[0,,0,μ0]S(T+1)b=[0^{\top},\dots,0^{\top},\mu_{0}^{\top}]^{\top}\in\mathbb{R}^{S(T+1)}, then the optimization problem (38) can be further simplified, hence the following theorem.

Theorem 16.

Solving refined Nash equilibrium solution(s) of the personalized mean-field game ((35) and (36)) is equivalent to solving the following feasibility optimization problem:

minimizey,z,d,L0subject toALd=b,ALy+z=cL,zd=0,d0,z0,ALL=b,L0,Lt(s,a)a𝒜dt(s,a)=dt(s,a)a𝒜Lt(s,a),s𝒮,a𝒜,t𝒯.\begin{array}[]{ll}\text{\rm minimize}_{y,z,d,L}&0\\ \text{\rm subject to}&A_{L}d=b^{\prime},\quad A_{L}^{\top}y+z=c_{L},\\ &z^{\top}d=0,\quad d\geq 0,\quad z\geq 0,\\ &A_{L}L=b,\quad L\geq 0,\\ &L_{t}(s,a)\sum\nolimits_{a^{\prime}\in\mathcal{A}}d_{t}(s,a^{\prime})=d_{t}(s,a)\sum\nolimits_{a^{\prime}\in\mathcal{A}}L_{t}(s,a^{\prime}),\\ &\hskip 124.6232pt\forall s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}.\end{array} (39)

where AL,cLA_{L},c_{L} are set as (8) and (9), and b=[0,,0,1S𝟏]S(T+1)b^{\prime}=[0,\dots,0,\frac{1}{S}{\bf 1}]\in\mathbb{R}^{S(T+1)}. More precisely, if (π,L)(\pi,L) is a refined Nash equilibrium solution, then (y,z,d,L)(y,z,d,L) is a solution to (39) for some y,z,dy,z,d. Conversely, if (y,z,d,L)(y,z,d,L) is a solution to (39), then for any πΠ(d)\pi\in\Pi(d), (π,L)(\pi,L) constitutes a refined Nash equilibrium solution.

Note that the last row of constraints in (39) is equivalent to requiring that Π(d)=Π(L)\Pi(d)=\Pi(L), where the equality is in terms of sets.

8 Numerical experiments

In this section, we present numerical results of algorithms based on the MF-OMO optimization framework. In the first part, MF-OMO is compared with the existing state-of-the-art algorithms for solving MFGs on the SIS problem introduced in [20]. In the second part, it is shown that different initializations of MF-OMO algorithms converge to different NEs in the example introduced in Section 2.

8.1 Comparison with existing algorithms

In this section, we focus on the SIS problem introduced in [20]. There are a large number of agents who choose to either social distance or to go out at each time step. Susceptible (S) agents who choose to go out may get infected (I), with probability proportional to the number of infected agents, while susceptible agents opt for social distancing will stay healthy. In the meantime, infected agents recover with certain probability at each time step. Agents in the system aim at finding the best strategy to minimize their costs induced from social distancing and being infected. The parameters remain the same as in [20].

Figures 1 and 2 show the comparisons between MF-OMO and existing algorithms on this SIS problems with T=50T=50 and T=100T=100, respectively. Here, MF-OMO is tested with different optimization algorithms including PGD (cf. §5), Adam [43], and NAdam [26] against online mirror descent (OMD) [58], fictitious play (FP) [59], and prior descent (PD) [20] with different choices of hyperparameters.888In the figure legends, the numbers after the OMD and FP algorithms are the step-sizes/learning-rates of these algorithms, while the number pairs after the PD algorithms are the temperatures and inner iterations of PD, respectively. To make fair comparisons, the maximum number of iterations for each algorithm are set differently so that the total runtime are comparable and that MF-OMO algorithms are given the smallest budget in terms of both the total number of iterations and the total runtime. With the same uniform initialization, it is clear from the figures that MF-OMO with Adam outperforms the remaining algorithms. The normalized exploitability999Normalized exploitability is defined to be exploitability devided by the initial exploitability. of MF-OMO Adam quickly drops below 10210^{-2} in terms of both the number of iterations and the runtime. All algorithms except for three variants of MF-OMO fail to achieve 10210^{-2} for both T=50T=50 and T=100T=100.

Refer to caption
(a) Convergence against iterations
Refer to caption
(b) Convergence against runtime
Figure 1: Comparison of different algorithms on SIS problem with T=50T=50.
Refer to caption
(a) Convergence against iterations
Refer to caption
(b) Convergence against runtime
Figure 2: Comparison of different algorithms on SIS problem with T=100T=100.

8.2 Multiple NEs

In this section, we focus on the problem introduced in Section 2 and show how algorithms based on the MF-OMO framework converge with different initializations. We choose S=A=n=5S=A=n=5, T=10T=10, r1=r2=r3=1.5r^{1}=r^{2}=r^{3}=1.5; and r4,r5r^{4},\,r^{5}, and {Ct}t=110\{C_{t}\}_{t=1}^{10} are all independently sampled uniformly between 0 and 11.

From discussions in Section 2, there are at least 3 different NEs, which correspond to agents staying in state 1, 2 and 3, respectively. Here, several different sets of random initializations are tested, denoted by RI(i,ϵ)(i,\epsilon), which represents initializations randomly generated in the ϵ\epsilon neighborhood of the ii-th NE (i=1,2,3)(i=1,2,3). Each RI(i,ϵ)(i,\epsilon) set consists of 20 independent samples, and the convergence behavior of MF-OMO NAdam algorithm is recorded in Table 1.101010We choose NAdam as it converges significantly faster than other algorithms such as PGD and Adam for this specific problem, and our primary goal is to study the behavior after the convergence. For each neighborhood of size ϵ\epsilon and NE ii, the convergence behavior in RI(i,ϵ)(i,\epsilon) is categorized using the following criteria: p0p_{0} counts the proportion of samples whose normalized exploitabilities do not drop to 10310^{-3} after 400 iterations; p1p_{1} counts the proportion of samples that converge to an (approximate) NE (with normalized exploitability below 10310^{-3}) closest to ii-th NE among the three NEs; p2p_{2} counts the proportion of samples that converge to an (approximate) NE closer to the other two NEs.

ϵ\epsilon NE 1 NE 2 NE 3
p0p_{0} p1p_{1} p2p_{2} p0p_{0} p1p_{1} p2p_{2} p0p_{0} p1p_{1} p2p_{2}
0.05 0% 70% 30% 5% 75% 20% 0% 65% 35%
0.2 0% 50% 50% 5% 65% 30% 5% 60% 35%
0.8 5% 45% 50% 5% 40% 55% 0% 30% 70%
1.0 0% 40% 60% 0% 45% 55% 0% 25% 75%
Table 1: Convergence behavior of different initializations

From Table 1, it is clear that most of the samples converge to some approximate NE with exploitabilities below the target 10310^{-3} tolerance, regardless of the initializations. In addition, different initializations lead to different solutions. Specifically, when initializations are close to some NE, it is more likely for MF-OMO NAdam to converge to that specific NE, which is consistent with our theoretical study. Finally, as the neighborhood sizes ϵ\epsilon increase, the convergence behavior become expectedly more chaotic, i.e., less concentrated on the center NE.

References

  • [1] Yves Achdou, Pierre Cardaliaguet, François Delarue, Alessio Porretta, and Filippo Santambrogio. Mean Field Games: Cetraro, Italy 2019, volume 2281. Springer Nature, 2021.
  • [2] Ilan Adler and Sushil Verma. The linear complementarity problem, Lemke algorithm, perturbation, and the complexity class PPAD. https://optimization-online.org/wp-content/uploads/2011/03/2948.pdf, 2011. Last visited: 2022 September 6.
  • [3] Berkay Anahtarci, Can Deha Kariksiz, and Naci Saldi. Fitted Q-learning in mean-field games. arXiv preprint arXiv:1912.13309, 2019.
  • [4] Berkay Anahtarci, Can Deha Kariksiz, and Naci Saldi. Q-learning in regularized mean-field games. Dynamic Games and Applications, pages 1–29, 2022.
  • [5] Donald G Anderson. Iterative procedures for nonlinear integral equations. Journal of the ACM (JACM), 12(4):547–560, 1965.
  • [6] Andrea Angiuli, Jean-Pierre Fouque, and Mathieu Laurière. Unified reinforcement Q-learning for mean field game and control problems. Mathematics of Control, Signals, and Systems, pages 1–55, 2022.
  • [7] Hédy Attouch, Jérôme Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Łojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.
  • [8] Amir Beck. First-Order Methods in Optimization. SIAM, 2017.
  • [9] Alain Bensoussan, Jens Frehse, and Sheung Chi Phillip Yam. The master equation in mean field theory. Journal de Mathématiques Pures et Appliquées, 103(6):1441–1474, 2015.
  • [10] Abhay G Bhatt and Vivek S Borkar. Occupation measures for controlled Markov processes: Characterization and optimality. The Annals of Probability, pages 1531–1562, 1996.
  • [11] Géraldine Bouveret, Roxana Dumitrescu, and Peter Tankov. Mean-field games of optimal stopping: A relaxed solution approach. SIAM Journal on Control and Optimization, 58(4):1795–1821, 2020.
  • [12] Rainer Buckdahn, Boualem Djehiche, Juan Li, and Shige Peng. Mean-field backward stochastic differential equations: A limit approach. The Annals of Probability, 37(4):1524–1565, 2009.
  • [13] Pierre Cardaliaguet, François Delarue, Jean-Michel Lasry, and Pierre-Louis Lions. The Master Equation and the Convergence Problem in Mean Field Games. Princeton University Press, 2019.
  • [14] René Carmona and François Delarue. Mean field forward-backward stochastic differential equations. Electronic Communications in Probability, 18:1–15, 2013.
  • [15] René Carmona and François Delarue. Probabilistic Theory of Mean Field Games with Applications I-II. Springer, 2018.
  • [16] René Carmona and Peiqi Wang. A probabilistic approach to extended finite state mean field games. Mathematics of Operations Research, 46(2):471–502, 2021.
  • [17] Laurent Condat. Fast projection onto the simplex and the 1\ell_{1} ball. Mathematical Programming, 158(1):575–585, 2016.
  • [18] Michel Coste. An Introduction to Semialgebraic Geometry, 2000.
  • [19] Michel Coste. An Introduction to O-minimal Geometry. Istituti editoriali e poligrafici internazionali Pisa, 2000.
  • [20] Kai Cui and Heinz Koeppl. Approximately solving mean field games via entropy-regularized deep reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 1909–1917. PMLR, 2021.
  • [21] Daniela Pucci De Farias and Benjamin Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850–865, 2003.
  • [22] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in Neural Information Processing Systems, 27, 2014.
  • [23] François Delarue, Daniel Lacker, and Kavita Ramanan. From the master equation to mean field game limit theory: A central limit theorem. Electronic Journal of Probability, 24:1–54, 2019.
  • [24] Eric V Denardo. On linear programming in a Markov decision problem. Management Science, 16(5):281–288, 1970.
  • [25] Cyrus Derman. On sequential decisions and Markov chains. Management Science, 9(1):16–24, 1962.
  • [26] Timothy Dozat. Incorporating nesterov momentum into adam. 2016.
  • [27] L. P. D. van den Dries. Tame Topology and O-minimal Structures. London Mathematical Society Lecture Note Series. Cambridge University Press, 1998.
  • [28] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the 1\ell_{1}-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, pages 272–279, 2008.
  • [29] Roxana Dumitrescu, Marcos Leutscher, and Peter Tankov. Control and optimal stopping mean field games: A linear programming approach. Electronic Journal of Probability, 26:1–49, 2021.
  • [30] Roxana Dumitrescu, Marcos Leutscher, and Peter Tankov. Linear programming fictitious play algorithm for mean field games with optimal stopping and absorption. arXiv preprint arXiv:2202.11428, 2022.
  • [31] Romuald Elie, Julien Perolat, Mathieu Laurière, Matthieu Geist, and Olivier Pietquin. On the convergence of model free learning in mean field games. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7143–7150, 2020.
  • [32] Zuyue Fu, Zhuoran Yang, Yongxin Chen, and Zhaoran Wang. Actor-critic provably finds nash equilibria of linear-quadratic mean-field games. arXiv preprint arXiv:1910.07498, 2019.
  • [33] Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1):59–99, 2016.
  • [34] Xin Guo, Anran Hu, Renyuan Xu, and Junzi Zhang. Learning mean-field games. Advances in Neural Information Processing Systems, 32, 2019.
  • [35] Xin Guo, Anran Hu, Renyuan Xu, and Junzi Zhang. A general framework for learning mean-field games. arXiv preprint arXiv:2003.06069, 2020.
  • [36] Filip Hanzely, Konstantin Mishchenko, and Peter Richtárik. SEGA: Variance reduction via gradient sketching. Advances in Neural Information Processing Systems, 31, 2018.
  • [37] Nicholas C Henderson and Ravi Varadhan. Damped anderson acceleration with restarts and monotonicity control for accelerating em and em-like algorithms. Journal of Computational and Graphical Statistics, 28(4):834–846, 2019.
  • [38] Arie Hordijk and LCM Kallenberg. Linear programming and Markov decision chains. Management Science, 25(4):352–362, 1979.
  • [39] Minyi Huang, Roland P Malhamé, and Peter E Caines. Large population stochastic dynamic games: Closed-loop Mckean-Vlasov systems and the Nash certainty equivalence principle. Communications in Information & Systems, 6(3):221–252, 2006.
  • [40] Youqiang Huang and Masami Kurano. The LP approach in average reward MDPs with multiple cost constraints: The countable state case. Journal of Information and Optimization Sciences, 18(1):33–47, 1997.
  • [41] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems, 26, 2013.
  • [42] Tobias Kaiser and Andre Opris. Differentiability properties of log-analytic functions. arXiv preprint arXiv:2007.03332, 2020.
  • [43] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [44] Thomas Kurtz and Richard Stockbridge. Stationary solutions and forward equations for controlled and singular martingale problems. Electronic Journal of Probability, 6:1–52, 2001.
  • [45] Thomas G Kurtz and Richard H Stockbridge. Existence of Markov controls and characterization of optimal Markov controls. SIAM Journal on Control and Optimization, 36(2):609–653, 1998.
  • [46] Jean-Michel Lasry and Pierre-Louis Lions. Mean field games. Japanese Journal of Mathematics, 2(1):229–260, 2007.
  • [47] Mathieu Lauriere. Numerical methods for mean field games and mean field type control. arXiv preprint arXiv:2106.06231, 2021.
  • [48] Nicolas Le Roux, Mark W Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. Pereira et al, 2013.
  • [49] Jun Liu and Jieping Ye. Efficient euclidean projections in linear time. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 657–664, 2009.
  • [50] David G Luenberger and Yinyu Ye. Linear and Nonlinear Programming, volume 2. Springer, 1984.
  • [51] Alan S Manne. Linear programming and sequential decisions. Management Science, 6(3):259–267, 1960.
  • [52] D Marker. Tame topology and o-minimal structures by lou van den dries. BULLETIN-AMERICAN MATHEMATICAL SOCIETY, 37(3):351–358, 2000.
  • [53] Richard Kipp Martin. Large Scale Linear and Integer Optimization: A Unified Approach. Springer Science & Business Media, 2012.
  • [54] KG Murty. Linear complementarity problem, its geometry and applications. Linear Complementarity, Linear and Nonlinear Programming, pages 1–58, 1997.
  • [55] Yurii Nesterov. Lectures on Convex Optimization, volume 137. Springer, 2018.
  • [56] Shunji Osaki and Hisashi Mine. Linear programming algorithms for semi-Markovian decision processes. Journal of Mathematical Analysis and Applications, 22(2):356–381, 1968.
  • [57] Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127–239, 2014.
  • [58] Julien Perolat, Sarah Perrin, Romuald Elie, Mathieu Laurière, Georgios Piliouras, Matthieu Geist, Karl Tuyls, and Olivier Pietquin. Scaling up mean field games with online mirror descent. arXiv preprint arXiv:2103.00623, 2021.
  • [59] Sarah Perrin, Julien Pérolat, Mathieu Laurière, Matthieu Geist, Romuald Elie, and Olivier Pietquin. Fictitious play for mean field games: Continuous time analysis and applications. Advances in Neural Information Processing Systems, 33:13199–13213, 2020.
  • [60] Boris T Polyak. Introduction to Optimization. New York : Optimization Software, Publications Division, 1987.
  • [61] Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
  • [62] Naci Saldi, Tamer Basar, and Maxim Raginsky. Markov–Nash equilibria in mean-field games with discounted cost. SIAM Journal on Control and Optimization, 56(6):4256–4287, 2018.
  • [63] Paul J Schweitzer and Abraham Seidmann. Generalized polynomial approximations in Markovian decision processes. Journal of mathematical analysis and applications, 110(2):568–582, 1985.
  • [64] Jitkomut Songsiri. Projection onto an 1\ell_{1}-norm ball with application to identification of sparse autoregressive models. In Asean Symposium on Automatic Control (ASAC), 2011.
  • [65] Pantelis Sopasakis, Krina Menounou, and Panagiotis Patrinos. Superscs: fast and accurate large-scale conic optimization. In 2019 18th European Control Conference (ECC), pages 1500–1505. IEEE, 2019.
  • [66] Richard H Stockbridge. Time-average control of martingale problems: A linear programming formulation. The Annals of Probability, pages 206–217, 1990.
  • [67] Michael I Taksar. Infinite-dimensional linear programming approach to singular stochastic control. SIAM Journal on Control and Optimization, 35(2):604–625, 1997.
  • [68] Lou Van den Dries and Chris Miller. Geometric categories and o-minimal structures. Duke Mathematical Journal, 84(2):497–540, 1996.
  • [69] Athanasios Vasiliadis. An introduction to mean field games using probabilistic methods. arXiv preprint arXiv:1907.01411, 2019.
  • [70] Mengdi Wang and Yichen Chen. An online primal-dual method for discounted Markov decision processes. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 4516–4521. IEEE, 2016.
  • [71] Alex J Wilkie. Model completeness results for expansions of the ordered field of real numbers by restricted pfaffian functions and the exponential function. Journal of the American Mathematical Society, 9(4):1051–1094, 1996.
  • [72] Philip Wolfe and George Bernard Dantzig. Linear programming in a Markov chain. Operations Research, 10(5):702–710, 1962.
  • [73] Stephen Wright, Jorge Nocedal, et al. Numerical optimization. Springer Science, 35(67-68):7, 1999.
  • [74] Junzi Zhang, Jongho Kim, Brendan O’Donoghue, and Stephen Boyd. Sample efficient reinforcement learning with REINFORCE. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35(12), pages 10887–10895, 2021.
  • [75] Junzi Zhang, Brendan O’Donoghue, and Stephen Boyd. Globally convergent type-I Anderson acceleration for nonsmooth fixed-point iterations. SIAM Journal on Optimization, 30(4):3170–3197, 2020.

Appendix

Appendix A Additional optimization algorithms and convergence to stationary points

A.1 Stochastic projected gradient descent (SPGD)

When the state space, action space and the time horizon are large, a single gradient evaluation in the PGD update (18) can be costly. To address this issue, we consider a stochastic variant of PGD (SPGD), which is suitable to handle problems with large SS, AA and TT. To invoke SPGD, first recall the explicitly expansion of (MF-OMO) as (14). The objective in (14) is a sum of n:=S+ST+SA+SA(T1)+SA+SA(T+1)=S(T+1)(2A+1)n:=S+ST+SA+SA(T-1)+SA+SA(T+1)=S(T+1)(2A+1) terms, which we denote as fiMF-OMO(θ)f_{i}^{\text{MF-OMO}}(\theta) (i=1,,ni=1,\dots,n) for brevity. SPGD replaces the exact gradient θfMF-OMO(θk)\nabla_{\theta}f^{\text{MF-OMO}}(\theta_{k}) in PGD with a mini-batch estimator of the following form

g^k=n|k|ikθfiMF-OMO(θk),\hat{g}_{k}=\frac{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}n}{|\mathcal{B}_{k}|}\sum_{i\in\mathcal{B}_{k}}\nabla_{\theta}f_{i}^{\text{MF-OMO}}(\theta_{k}), (40)

where the mini-batch k\mathcal{B}_{k} is a subset of {1,,n}\{1,\dots,n\} sampled uniformly at random (without replacement) and independently across kk (i.e., iterations). Such a sampling approach ensures that g^k\hat{g}_{k} is an unbiased estimator of θfMF-OMO(θk)\nabla_{\theta}f^{\text{MF-OMO}}(\theta_{k}) (conditioned on θk\theta_{k}).

SPGD then proceeds by updating θ\theta with the following iterative process:

θk+1=ProjΘ(θkηkg^k).\theta_{k+1}=\textbf{Proj}_{\Theta}(\theta_{k}-\eta_{k}\hat{g}_{k}). (41)

Again, we assume that θ0Θ\theta_{0}\in\Theta.

Proposition 17.

Under Assumption 1, there exist constants C1,C2>0C_{1},C_{2}>0 such that for any k0k\geq 0, g^k2C1\|\hat{g}_{k}\|_{2}\leq C_{1} almost surely, θfMF-OMO(θk)2C1\|\nabla_{\theta}f^{\text{\rm MF-OMO}}(\theta_{k})\|_{2}\leq C_{1}, 𝔼kg^k=θfMF-OMO(θk)\mathbb{E}_{k}\hat{g}_{k}=\nabla_{\theta}f^{\text{\rm MF-OMO}}(\theta_{k}) and 𝔼kg^k22C2\mathbb{E}_{k}\|\hat{g}_{k}\|_{2}^{2}\leq C_{2}. Here 𝔼k\mathbb{E}_{k} is the conditional expectation given the kk-th iteration θk\theta_{k}.

That is, the estimator (40) is unbiased, almost surely bounded, with a bounded second-order moment, and the exact gradient is also uniformly bounded. The proof is a direct application of the continuity of the rewards and dynamics and the compactness of Θ\Theta, and omitted here.

A.2 Global convergence to stationary points without definability

In this section, we show the global convergence of the iterates to stationary points when only the smoothness assumption holds. Let us first recap the concept of stationary points for constrained optimization problems and their relation to optimality.

Optimality conditions.

The KKT conditions state that a necessary condition for a point θΘ\theta\in\Theta to be the optimal solution to (MF-OMO) is θfMF-OMO(θ)𝒩Θ(θ)-\nabla_{\theta}f^{\text{MF-OMO}}(\theta)\in\mathcal{N}_{\Theta}(\theta), where

𝒩Θ(θ)={,θΘ,{ν|ν(θθ)0,θΘ},θΘ\mathcal{N}_{\Theta}(\theta)=\left\{\begin{array}[]{ll}\emptyset,&\theta\notin\Theta,\\ \{\nu|\nu^{\top}(\theta^{\prime}-\theta)\leq 0,\,\forall\theta^{\prime}\in\Theta\},&\theta\in\Theta\end{array}\right.

is the normal cone of Θ\Theta at θ\theta. We say that θ\theta is a stationary point of (MF-OMO) if θfMF-OMO(θ)𝒩Θ(θ)-\nabla_{\theta}f^{\text{MF-OMO}}(\theta)\in\mathcal{N}_{\Theta}(\theta).

The typical goal of nonconvex optimization is to find a sequence of points that converge to a stationary point. Since the normal cone is set-valued, it is hard to directly evaluate the closeness to stationary points with the above definition. A common surrogate metric is the norm of the projected gradient mapping Gη:ΘΘG_{\eta}:\Theta\rightarrow\Theta, defined as

Gη(θ):=1η(θProjΘ(θηθfMF-OMO(θ))).G_{\eta}(\theta):=\dfrac{1}{\eta}\left(\theta-\textbf{Proj}_{\Theta}(\theta-\eta\nabla_{\theta}f^{\text{MF-OMO}}(\theta))\right).

Here Gη(θ)=0G_{\eta}(\theta)=0 if and only if θ\theta is a stationary point of (MF-OMO). Moreover, if Gη(θ)2ϵ\|G_{\eta}(\theta)\|_{2}\leq\epsilon, then according to [33, Lemma 3],

fMF-OMO(θ)𝒩Θ(θ)+ϵ(ηM+1)𝔹2,-\nabla f^{\text{MF-OMO}}(\theta)\in\mathcal{N}_{\Theta}(\theta)+\epsilon(\eta M+1)\mathbb{B}_{2},

where 𝔹2\mathbb{B}_{2} is the unit 2\ell_{2} ball.

When Θ\Theta is the full Euclidean space, the projected gradient mapping is simply the gradient of fMF-OMOf^{\text{MF-OMO}}. We refer interested readers to [8, Chapter 3] for a more detailed discussion on optimality conditions.

Convergence to stationary points.

Under Assumption 1, the following convergence result for PGD is standard given Proposition 10 and the existence of optimal solution(s) for (MF-OMO).

Proposition 18.

[8, Theorem 10.15] Under Assumption 1, let {θk}k0\{\theta_{k}\}_{k\geq 0} be the sequence generated by PGD (18) with ηk=η(0,2/M)\eta_{k}=\eta\in(0,2/M). Then

  • the objective value sequence {fMF-OMO(θk)}k0\{f^{\text{\rm MF-OMO}}(\theta_{k})\}_{k\geq 0} is decreasing, and the decreasing is strict until a stationary point is found (i.e., θfMF-OMO(θk)=0\nabla_{\theta}f^{\text{\rm MF-OMO}}(\theta_{k})=0) and the iteration is terminated;

  • k=0KGη(θk)22fMF-OMO(θ0)ηMη2/2\sum_{k=0}^{K}\|G_{\eta}(\theta_{k})\|_{2}^{2}\leq\frac{f^{\text{\rm MF-OMO}}(\theta_{0})}{\eta-M\eta^{2}/2}, and in particular,

    mink=0,,KGη(θk)2fMF-OMO(θ0)(ηMη2/2)(K+1);\min_{k=0,\dots,K}\|G_{\eta}(\theta_{k})\|_{2}\leq\sqrt{\dfrac{f^{\text{\rm MF-OMO}}(\theta_{0})}{(\eta-M\eta^{2}/2)(K+1)}};
  • all limit points of {θk}k0\{\theta_{k}\}_{k\geq 0} are stationary points of (MF-OMO).

Similarly, the following proposition shows that the iterates of SPGD converge to stationary points with high probability. The uniform boundedness in Proposition 17 allows us to adapt the proof in [74, Theorem 9] for policy gradient methods of MDPs to our setting. The proof is omitted due to similarity.

Proposition 19.

Under Assumptions 1 and 2, let {θk}k0\{\theta_{k}\}_{k\geq 0} be the sequence generated by SPGD (41) with ηk=1k+3log2(k+3)\eta_{k}=\frac{1}{\sqrt{k+3}\log_{2}(k+3)}. Then for any K0K\geq 0, with probability at least 1δ1-\delta, we have

mink=0,,KGη(θk)224log2(K+3)K+1(C2M+fMF-OMO(θ0)+C12(16+M)log(2δ)).\min_{k=0,\dots,K}\|G_{\eta}(\theta_{k})\|_{2}^{2}\leq\dfrac{4\log_{2}(K+3)}{\sqrt{K+1}}\left(C_{2}M+f^{\text{\rm MF-OMO}}(\theta_{0})+C_{1}^{2}\sqrt{(16+M)\log\left(\frac{2}{\delta}\right)}\right).

A.3 Reparametrization, acceleration and variance reduction

(MF-OMO) may be solved more efficiently. First, one can also reparametrize the variables to completely get rid of the simple constraints in Θ\Theta. In particular, one can reparametrize LL and zz by

Ls,a,t=exp(us,a,t)s𝒮,a𝒜exp(us,a,t),zs,a,t=SA(T2+T+2)rmaxexp(vs,a,t)s𝒮,a𝒜exp(vs,a,t)+w0L_{s,a,t}=\frac{\exp(u_{s,a,t})}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sum_{s^{\prime}\in\mathcal{S},a^{\prime}\in\mathcal{A}}\exp(u_{s^{\prime},a^{\prime},t})},\quad z_{s,a,t}=SA(T^{2}+T+2)r_{\max}\frac{\exp(v_{s,a,t})}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sum_{s^{\prime}\in\mathcal{S},a^{\prime}\in\mathcal{A}}\exp(v_{s^{\prime},a^{\prime},t})+w_{0}}

for some u=[us,a,t]s𝒮,a𝒜,t𝒯SA(T+1)u=[u_{s,a,t}]_{s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}}\in\mathbb{R}^{SA(T+1)} and v=[vs,a,t]s𝒮,a𝒜,t𝒯SA(T+1)v=[v_{s,a,t}]_{s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}}\in\mathbb{R}^{SA(T+1)}, and w0w_{0}\in\mathbb{R}. Similarly, yy can be reparametrized using trigonometric functions such as ys,t=S(T+1)(T+2)rmax2S(T+1)sin(ws,t)y_{s,t}=\frac{S(T+1)(T+2)r_{\max}}{2\sqrt{S(T+1)}}\sin(w_{s,t}) for some wS(T+1)w\in\mathbb{R}^{S(T+1)}.With the aforementioned reparametrization, (MF-OMO) becomes a smooth unconstrained optimization problem, which can be solved by additional optimization algorithms and solvers not involving projections. As mentioned in Remark 1, one can also change the norms of yy and zz and their bands and apply other reparametrizations.

Lastly, standard techniques such as momentum methods [60], Nesterov acceleration [55], quasi-Newton methods [73], Anderson acceleration [5], and Newton methods [50] can readily be applied to accelerate the convergence of PGD and SPGD; variance reduction approaches such as SAG [48], SVRG [41], SAGA [22], SEGA [36] can also be adopted to further accelerate and stabilize the convergence of SPGD. In particular, safeguarded Anderson acceleration methods [37, 75, 65] might be adopted to maintain the desired monotonicity property of vanilla PGD, which is central for the local convergence to global Nash equilibrium solutions (cf. Theorem 11).

Appendix B Definability and local convergence to global Nash equilibrium solutions

B.1 Definability

In this section, we restate the formal definitions of definable sets and functions as in [7, Section 4.3], and summarize and derive necessary properties to prove Theorem 11. See [68, 71, 27, 52, 19, 42] for more complete expositions of the topic.

We first need the following concept of o-minimal structure over \mathbb{R}.

Definition B.1.

Suppose that 𝒪:={𝒪n}n=0\mathcal{O}:=\{\mathcal{O}_{n}\}_{n=0}^{\infty} is a sequence of subsets collections 𝒪n2n\mathcal{O}_{n}\subseteq 2^{\mathbb{R}^{n}}. We say that 𝒪\mathcal{O} is an o-minimal structure over \mathbb{R} if it satisfies the following conditions:

  • n0\forall n\geq 0, 𝒪n\emptyset\in\mathcal{O}_{n}. And A,B𝒪n\forall A,B\in\mathcal{O}_{n}, AB,AB,n\A𝒪nA\cap B,A\cup B,\mathbb{R}^{n}\backslash A\in\mathcal{O}_{n}.

  • A𝒪n\forall A\in\mathcal{O}_{n}, A×𝒪n+1A\times\mathbb{R}\in\mathcal{O}_{n+1} and ×A𝒪n+1\mathbb{R}\times A\in\mathcal{O}_{n+1}.

  • A𝒪n+1\forall A\in\mathcal{O}_{n+1}, its canonical projection {(x1,,xn)n|(x1,,xn,xn+1)A}𝒪n\{(x_{1},\dots,x_{n})\in\mathbb{R}^{n}|(x_{1},\dots,x_{n},x_{n+1})\in A\}\in\mathcal{O}_{n}.

  • ij\forall i\neq j and i,j{1,,n}i,j\in\{1,\dots,n\}, the diagonals {(x1,,xn)|xi=xj}𝒪n\{(x_{1},\dots,x_{n})|x_{i}=x_{j}\}\in\mathcal{O}_{n}.

  • The set {(x1,x2)2|x1<x2}𝒪2\{(x_{1},x_{2})\in\mathbb{R}^{2}|x_{1}<x_{2}\}\in\mathcal{O}_{2}.

  • The elements of 𝒪1\mathcal{O}_{1} are exactly finite unions of intervals (including both open and closed, and in particular single points).

We then have the following definition of definability.

Definition B.2.

Let 𝒪={𝒪n}n=0\mathcal{O}=\{\mathcal{O}_{n}\}_{n=0}^{\infty} be an o-minimal structure over \mathbb{R}. We say that a set AnA\in\mathbb{R}^{n} is definable if A𝒪nA\in\mathcal{O}_{n}. A function/mapping f:Anmf:A\subseteq\mathbb{R}^{n}\rightarrow\mathbb{R}^{m} is said to be definable if its graph {(x,y)n×m|y=f(x)}𝒪n+m\{(x,y)\in\mathbb{R}^{n}\times\mathbb{R}^{m}|y=f(x)\}\in\mathcal{O}_{n+m}.111111Note that a real extended value function f:n{+}f:\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\} can be equivalently seen as a standard function f:dom(f)f:\textbf{dom}(f)\rightarrow\mathbb{R}. Here dom(f)\textbf{dom}(f) denotes the domain of ff. A set-valued mapping f:Anmf:A\subseteq\mathbb{R}^{n}\rightrightarrows\mathbb{R}^{m} is said to be definable if its graph {(x,y)A×m|yf(x)}𝒪n+m\{(x,y)\in A\times\mathbb{R}^{m}|y\in f(x)\}\in\mathcal{O}_{n+m}.

Below we provide three examples of most widely used o-minimal structures, namely the real semialgebraic structure, globally subanalytic structure and the log-exp structure used in our main text.

Example 1 (Real semialgebraic structure).

In this structure, each 𝒪n\mathcal{O}_{n} consists of real semialgebraic sets in n\mathbb{R}^{n}. A set in n\mathbb{R}^{n} is said to be real semialgebraic (or semialgebraic for short) if it can be written as a finite union of sets of the form {xn|pi(x)=0,qi(x)<0,i=1,,k}\{x\in\mathbb{R}^{n}|p_{i}(x)=0,\,q_{i}(x)<0,\,i=1,\dots,k\}, where pi,qip_{i},q_{i} are real polynomial functions and k1k\geq 1. A function/mapping f:Anmf:A\subseteq\mathbb{R}^{n}\rightarrow\mathbb{R}^{m} is called (real) semialgebraic if its graph is semialgebraic.

The following result is useful for proving Theorem 11 (see [18] for more properties of real semialgberaic sets).

Lemma 20.

Any set of the form {xn|pi(x)=0,qj(x)0,i=1,,k,j=1,,l}\{x\in\mathbb{R}^{n}|p_{i}(x)=0,\,q_{j}(x)\leq 0,\,i=1,\dots,k,\,j=1,\dots,l\}, where pi,qjp_{i},q_{j} are real polynomial functions and k,l1k,l\geq 1 is real semialgebraic.

Proof.

First, notice that {xn|q(x)<0}={xn|0=0,q(x)<0}\{x\in\mathbb{R}^{n}|q(x)<0\}=\{x\in\mathbb{R}^{n}|0=0,q(x)<0\} and {xn|p(x)=0}={xn|p(x)=0,1<0}\{x\in\mathbb{R}^{n}|p(x)=0\}=\{x\in\mathbb{R}^{n}|p(x)=0,-1<0\} are both real semialgebraic by definition. Hence {xn|p(x)0}={xn|p(x)<0}{xn|p(x)=0}\{x\in\mathbb{R}^{n}|p(x)\leq 0\}=\{x\in\mathbb{R}^{n}|p(x)<0\}\cup\{x\in\mathbb{R}^{n}|p(x)=0\} is real semialgebraic. Finally, by the intersection closure property of o-minimal structures,

{xn|pi(x)=0,qj(x)0,i=1,,k,j=1,,l}=(i=1k{xn|pi(x)=0})(j=1l{xn|qj(x)0})\begin{split}&\{x\in\mathbb{R}^{n}|p_{i}(x)=0,q_{j}(x)\leq 0,i=1,\dots,k,j=1,\dots,l\}\\ &=\left(\bigcap_{i=1}^{k}\{x\in\mathbb{R}^{n}|p_{i}(x)=0\}\right)\bigcap\left(\bigcap_{j=1}^{l}\{x\in\mathbb{R}^{n}|q_{j}(x)\leq 0\}\right)\end{split}

is also real semialgebraic. ∎

Example 2 (Globally subanalytic structure).

This structure is the smallest o-minimal structure such that each 𝒪n+1\mathcal{O}_{n+1} (n1n\geq 1) includes but is not limited to all sets of the form {(x,t)[1,1]n×|f(x)=t}\{(x,t)\in[-1,1]^{n}\times\mathbb{R}|f(x)=t\}, where f:[1,1]nf:[-1,1]^{n}\rightarrow\mathbb{R} is an analytic function that can be extended analytically to a neighborhood containing [1,1]n[-1,1]^{n}. In particular, all semialgebraic sets and all sets of the form {(x,t)[a,b]n×|f(x)=t}\{(x,t)\in[a,b]^{n}\times\mathbb{R}|f(x)=t\} with f:nf:\mathbb{R}^{n}\rightarrow\mathbb{R} being analytic and aba\leq b\in\mathbb{R} belong to this structure.

Example 3 (Log-exp structure).

This structure is the smallest o-minimal structure containing the globally analytic structure and the graph of the exponential function exp:\exp:\mathbb{R}\rightarrow\mathbb{R}.

As in the main text, we say that a function (or a set) is definable if it is definable on the log-exp structure. The following proposition formally summarizes the major facts about definable functions and sets. See [7] and [68] for more details.

Proposition 21.

The following facts hold for definable functions/mappings and sets.

  1. 1.

    Let the real extended value functions f,g:n{+}f,g:\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\} be definable, and let Gn×mG\in\mathbb{R}^{n\times m} be a constant matrix and hnh\in\mathbb{R}^{n} be a constant vector. Then f(x)+g(x)f(x)+g(x), f(x)g(x)f(x)-g(x), f(x)g(x)f(x)g(x), f(Gz+h)f(Gz+h), max{f(x),g(x)}\max\{f(x),g(x)\}, min{f(x),g(x)}\min\{f(x),g(x)\} and f1(y)f^{-1}(y) are definable. The same results hold for vector-valued mappings f,g:Anmf,g:A\subseteq\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}.

  2. 2.

    Let the functions/mappings f:Anmf:A\subseteq\mathbb{R}^{n}\rightarrow\mathbb{R}^{m} and g:Bmlg:B\subseteq\mathbb{R}^{m}\rightarrow\mathbb{R}^{l}, and let f(A)Bf(A)\subseteq B. Then the composition h(x)=g(f(x)):Alh(x)=g(f(x)):A\rightarrow\mathbb{R}^{l} is also definable.

  3. 3.

    Let f:Anf:A\subseteq\mathbb{R}^{n}\rightarrow\mathbb{R} be definable and let BAB\subseteq A be a compact set on which f(x)0f(x)\neq 0 for all xBx\in B. Then g(x)=1/f(x):Bg(x)=1/f(x):B\rightarrow\mathbb{R} is also definable.

  4. 4.

    Let f(x,y):n×mf(x,y):\mathbb{R}^{n}\times\mathbb{R}^{m}\rightarrow\mathbb{R} and CmC\subseteq\mathbb{R}^{m} be definable. Then g(x)=supyCf(x,y):ng(x)=\sup_{y\in C}f(x,y):\mathbb{R}^{n}\rightarrow\mathbb{R} and h(x)=infyCf(x,y):nh(x)=\inf_{y\in C}f(x,y):\mathbb{R}^{n}\rightarrow\mathbb{R} are both definable.

  5. 5.

    All semialgebraic functions, all analytic functions restricted to definable compact sets, as well as the exponential function exp\exp and logarithm function log\log are definable.

  6. 6.

    Let the set Θ\Theta be definable. Then the indicator function 𝟏Θ(θ)\mathbf{1}_{\Theta}(\theta) (that takes the value 11 when θΘ\theta\in\Theta and 0 otherwise) and the characteristic function121212Here 𝒳(x)\mathcal{I}_{\mathcal{X}}(x) denotes the (real extended valued) characteristic function (sometimes also referred to as the indicator function in some literature) that takes value 0 when x𝒳x\in\mathcal{X} and ++\infty otherwise. Θ(θ)\mathcal{I}_{\Theta}(\theta) are also definable.

  7. 7.

    Let the sets A,BnA,B\subseteq\mathbb{R}^{n} be definable. Then the Cartesian product A×BA\times B is definable.

B.2 Local convergence to global Nash equilibrium solutions

Proof of Theorem 11.

The proof consists of four steps. We begin by verifying the definability around the initialization. We then specify ϵ0\epsilon_{0} and show that the iterates are uniformly bounded in a similar neighborhood of the initialization. Finally, we obtain convergence to Nash equilibrium solutions.

Step 1: Verification of definability around initialization.

We first verify that

hMF-OMO(θ):=fMF-OMO(θ)+Θ(θ)h^{\text{MF-OMO}}(\theta):=f^{\text{MF-OMO}}(\theta)+\mathcal{I}_{\Theta}(\theta)

is definable. Note that θ\theta is an optimal solution to (MF-OMO) if and only if θ\theta minimizes hMF-OMO(θ)h^{\text{MF-OMO}}(\theta) over the entire Euclidean space of θ\theta. To see this, note that fMF-OMO(θ)f^{\text{MF-OMO}}(\theta) is obtained from a finite number of summations, subtractions and products of definable functions (i.e., Pt(s,a,Lt)P_{t}(s,a,L_{t}), rt(s,a,Lt)r_{t}(s,a,L_{t}), constant functions and identity functions), and is hence definable by Proposition 21. In addition, by rewriting y2S(T+1)(T+2)rmax/2\|y\|_{2}\leq S(T+1)(T+2)r_{\max}/2 as i=1S(T+1)yi2S2(T+1)2(T+2)2rmax2/4\sum_{i=1}^{S(T+1)}y_{i}^{2}\leq S^{2}(T+1)^{2}(T+2)^{2}r_{\max}^{2}/4 and applying Lemma 20, Θ\Theta is obviously semialgebraic and hence definable, and hence the characteristic function Θ(θ)\mathcal{I}_{\Theta}(\theta) is also definable by Proposition 21. As a result, the sum of fMF-OMO(θ)f^{\text{MF-OMO}}(\theta) and Θ(θ)\mathcal{I}_{\Theta}(\theta) is also definable.

Let (π,L)(\pi^{\star},L^{\star}) be some Nash equilibrium solution. Define yy^{\star} and zz^{\star} by

y:=[V1(L),V2(L),,VT(L),V0(L)]S(T+1)y^{\star}:=[V_{1}^{\star}(L^{\star}),V_{2}^{\star}(L^{\star}),\dots,V_{T}^{\star}(L^{\star}),-V_{0}^{\star}(L^{\star})]\in\mathbb{R}^{S(T+1)}

and z=[z,0,z,1,,z,T]SA(T+1)z^{\star}=[z_{\star,0},z_{\star,1},\dots,z_{\star,T}]\in\mathbb{R}^{SA(T+1)} with z,t=[z,t1,,z,tA]z_{\star,t}=[z_{\star,t}^{1},\dots,z_{\star,t}^{A}] (t=0,,Tt=0,\dots,T), where

z,ta=Vt(L)rt(,a,Lt)Pta(Lt)Vt+1(L)S,t=0,,T1,a𝒜z_{\star,t}^{a}=V_{t}^{\star}(L^{\star})-r_{t}(\cdot,a,L_{t}^{\star})-P_{t}^{a}(L_{t}^{\star})V_{t+1}^{\star}(L^{\star})\in\mathbb{R}^{S},\quad t=0,\dots,T-1,\,a\in\mathcal{A}

and z,Ta=VT(L)rT(,a,LT)Sz_{\star,T}^{a}=V_{T}^{\star}(L^{\star})-r_{T}(\cdot,a,L_{T}^{\star})\in\mathbb{R}^{S} (a𝒜a\in\mathcal{A}). Then by Theorem 4, Proposition 6 and Corollary 7, θ\theta^{\star} is feasible (i.e., θΘ\theta^{\star}\in\Theta) and ALL=bA_{L^{\star}}L^{\star}=b, ALy+z=cLA_{L^{\star}}^{\top}y^{\star}+z^{\star}=c_{L^{\star}} and (z)L=0(z^{\star})^{\top}L^{\star}=0. Hence fMF-OMO(θ)=0f^{\text{MF-OMO}}(\theta^{\star})=0 and thus θ\theta^{\star} is optimal (i.e., θΘ\theta^{\star}\in\Theta^{\star}).

For this θΘ\theta^{\star}\in\Theta^{\star}, we have hMF-OMO(θ)=fMF-OMO(θ)=0h^{\text{MF-OMO}}(\theta^{\star})=f^{\text{MF-OMO}}(\theta^{\star})=0. By [7, Theorem 14] and the continuity of fMF-OMO(θ)f^{\text{MF-OMO}}(\theta), there exist ϵ,δ>0\epsilon,\delta>0 and a continuous, concave, strictly increasing and definable function ϕ:[0,δ)+\phi:[0,\delta)\rightarrow\mathbb{R}_{+}, with ϕ(0)=0\phi(0)=0 and C1C^{1} in (0,δ)(0,\delta), such that for any θΘ\theta\in\Theta with θθ2ϵ\|\theta-\theta^{\star}\|_{2}\leq\epsilon and 0<hMF-OMO(θ)<δ0<h^{\text{MF-OMO}}(\theta)<\delta, the following Kurdyka-Lojasiewicz (KL) condition holds:

ϕ(hMF-OMO(θ))dist(0,θfMF-OMO(θ)+𝒩Θ(θ))1.\phi^{\prime}(h^{\text{MF-OMO}}(\theta))\textbf{dist}(0,\nabla_{\theta}f^{\text{MF-OMO}}(\theta)+\mathcal{N}_{\Theta}(\theta))\geq 1. (42)

Here we use the fact that the limiting sub-differential hMF-OMO(θ)=θfMF-OMO(θ)+𝒩Θ(θ)\partial h^{\text{MF-OMO}}(\theta)=\nabla_{\theta}f^{\text{MF-OMO}}(\theta)+\mathcal{N}_{\Theta}(\theta) [7, Proposition 3].

Step 2: Specify ϵ0\epsilon_{0}.

Next, we specify the choice of ϵ0\epsilon_{0}. By the continuity of fMF-OMO(θ)f^{\text{MF-OMO}}(\theta), there exists ϵ(0,ϵ/3]\epsilon^{\prime}\in(0,\epsilon/3] such that for any θΘ\theta\in\Theta with θθ2ϵ\|\theta-\theta^{\star}\|_{2}\leq\epsilon^{\prime},

hMF-OMO(θ)=fMF-OMO(θ)min{ϕ1(ϵ3(M+1/η)),(2Mη)ϵ218η,δ/2}<δ.h^{\text{MF-OMO}}(\theta)=f^{\text{MF-OMO}}(\theta)\leq\min\left\{\phi^{-1}\left(\frac{\epsilon}{3(M+1/\eta)}\right),\frac{(2-M\eta)\epsilon^{2}}{18\eta},\delta/2\right\}<\delta.

Let ϵ0:=min{ϵ,ϵ}/C\epsilon_{0}:=\min\{\epsilon,\epsilon^{\prime}\}/C, where

C=SA(T+1)(CP(T+1)2rmax+Cr(2T+1))+S(T+1)(CPT(T+1)rmax2+CrT)+1.C=SA(T+1)(C_{P}(T+1)^{2}r_{\max}+C_{r}(2T+1))+S(T+1)\left(\dfrac{C_{P}T(T+1)r_{\max}}{2}+C_{r}T\right)+1.

We now show that for any L0Δ(𝒮×𝒜)L^{0}\in\Delta(\mathcal{S}\times\mathcal{A}) with L0L1ϵ0\|L^{0}-L^{\star}\|_{1}\leq\epsilon_{0}, θ0θ2Cϵ0=min{ϵ,ϵ}\|\theta_{0}-\theta^{\star}\|_{2}\leq C\epsilon_{0}=\min\{\epsilon,\epsilon^{\prime}\}, and hence hMF-OMO(θ0)min{ϕ1(ϵ3(M+1/η)),(2Mη)ϵ218η,δ/2}<δh^{\text{MF-OMO}}(\theta_{0})\leq\min\left\{\phi^{-1}(\frac{\epsilon}{3(M+1/\eta)}),\frac{(2-M\eta)\epsilon^{2}}{18\eta},\delta/2\right\}<\delta.

By Assumption 1 and the compactness of Θ\Theta, the Lipschitz continuity assumptions in Theorem 8 hold for some constants CP,Cr>0C_{P},C_{r}>0. Since L0L1ϵ0\|L^{0}-L^{\star}\|_{1}\leq\epsilon_{0}, by the proof of Proposition 13 and the construction of y0y^{0} and z0z^{0}, θ0=(y0,z0,L0)Θ\theta_{0}=(y^{0},z^{0},L^{0})\in\Theta is feasible by the proof of Theorem 9, and

y0y=maxt𝒯Vt(L0)Vt(L)T(T+1)rmax2maxt=0,,T1PtL0PtL,1+t𝒯rtL0rtL(CPT(T+1)rmax2+CrT)ϵ0.\begin{split}\|y^{0}-y^{\star}\|_{\infty}&=\max_{t\in\mathcal{T}}\|V_{t}^{\star}(L^{0})-V_{t}^{\star}(L^{\star})\|_{\infty}\\ &\leq\dfrac{T(T+1)r_{\max}}{2}\max_{t=0,\dots,T-1}\|P_{t}^{L^{0}}-P_{t}^{L^{\star}}\|_{\infty,1}+\sum_{t\in\mathcal{T}}\|r_{t}^{L^{0}}-r_{t}^{L^{\star}}\|_{\infty}\\ &\leq\left(\dfrac{C_{P}T(T+1)r_{\max}}{2}+C_{r}T\right)\epsilon_{0}.\end{split}

Similarly,

z0zmaxt𝒯Vt(L0)Vt(L)+maxa𝒜,t𝒯rt(,a,Lt0)rt(,a,Lt)+maxa𝒜,t𝒯Pta(Lt0)Vt+1(L0)Pta(Lt)Vt+1(L)(CPT(T+1)rmax2+CrT)ϵ0+Crϵ0+maxa𝒜,t𝒯Pta(Lt0)Vt+1(L0)Vt+1(L)+maxa𝒜,t𝒯Pta(Lt0)Pta(Lt)Vt+1(L)(CPT(T+1)rmax2+Cr(T+1))ϵ0+(CPT(T+1)rmax2+CrT)ϵ0+rmax(T+1)CPϵ0=(CP(T+1)2rmax+Cr(2T+1))ϵ0,\begin{split}\|z^{0}-z^{\star}\|_{\infty}&\leq\max_{t\in\mathcal{T}}\|V_{t}^{\star}(L^{0})-V_{t}^{\star}(L^{\star})\|_{\infty}+\max_{a\in\mathcal{A},t\in\mathcal{T}}\|r_{t}(\cdot,a,L_{t}^{0})-r_{t}(\cdot,a,L_{t}^{\star})\|_{\infty}\\ &\qquad+\max_{a\in\mathcal{A},t\in\mathcal{T}}\|P_{t}^{a}(L_{t}^{0})V_{t+1}^{\star}(L^{0})-P_{t}^{a}(L_{t}^{\star})V_{t+1}(L^{\star})\|_{\infty}\\ &\leq\left(\dfrac{C_{P}T(T+1)r_{\max}}{2}+C_{r}T\right)\epsilon_{0}+C_{r}\epsilon_{0}\\ &\qquad+\max_{a\in\mathcal{A},t\in\mathcal{T}}\|P_{t}^{a}(L_{t}^{0})\|_{\infty}\|V_{t+1}^{\star}(L^{0})-V_{t+1}^{\star}(L^{\star})\|_{\infty}\\ &\qquad+\max_{a\in\mathcal{A},t\in\mathcal{T}}\|P_{t}^{a}(L_{t}^{0})-P_{t}^{a}(L_{t}^{\star})\|_{\infty}\|V_{t+1}^{\star}(L^{\star})\|_{\infty}\\ &\leq\left(\dfrac{C_{P}T(T+1)r_{\max}}{2}+C_{r}(T+1)\right)\epsilon_{0}+\left(\dfrac{C_{P}T(T+1)r_{\max}}{2}+C_{r}T\right)\epsilon_{0}\\ &\qquad+r_{\max}(T+1)C_{P}\epsilon_{0}\\ &=(C_{P}(T+1)^{2}r_{\max}+C_{r}(2T+1))\epsilon_{0},\end{split}

where VT+1(L)=0V_{T+1}(L)=0 for any LΔ(𝒮×𝒜)L\in\Delta(\mathcal{S}\times\mathcal{A}). Here we use the fact that Pta(Lt0)=1\|P_{t}^{a}(L_{t}^{0})\|_{\infty}=1, maxa𝒜,t𝒯Pta(Lt0)Pta(Lt)=PL0PL,1\max_{a\in\mathcal{A},t\in\mathcal{T}}\|P_{t}^{a}(L_{t}^{0})-P_{t}^{a}(L_{t}^{\star})\|_{\infty}=\|P^{L^{0}}-P^{L^{\star}}\|_{\infty,1}, and maxt𝒯|Vt(L)(s)|rmax(T+1)\max_{t\in\mathcal{T}}|V_{t}^{\star}(L^{\star})(s)|\leq r_{\max}(T+1). Also note that here 1\|\cdot\|_{1} and \|\cdot\|_{\infty} are both vector norms (except for Pta(Lt)P_{t}^{a}(L_{t}), for which \|\cdot\|_{\infty} is the matrix \ell_{\infty}-norm).

Hence we have shown that

θ0θ2θ0θ1=z0z1+y0y1+L0L1Cϵ0=min{ϵ,ϵ},\|\theta_{0}-\theta^{\star}\|_{2}\leq\|\theta_{0}-\theta^{\star}\|_{1}=\|z^{0}-z^{\star}\|_{1}+\|y^{0}-y^{\star}\|_{1}+\|L^{0}-L^{\star}\|_{1}\leq C\epsilon_{0}=\min\{\epsilon,\epsilon^{\prime}\},

where CC is a problem dependent constant defined as above. Thus hMF-OMO(θ0)<δh^{\text{MF-OMO}}(\theta_{0})<\delta.

Now by Proposition 18 and the feasibility of {θk}k0\{\theta_{k}\}_{k\geq 0}, hMF-OMO(θk)=fMF-OMO(θk)h^{\text{MF-OMO}}(\theta_{k})=f^{\text{MF-OMO}}(\theta_{k}) is non-increasing, and hence

hMF-OMO(θk)hMF-OMO(θ0)min{ϕ1(ϵ3(M+1/η)),(2Mη)ϵ218η,δ/2}δ/2<δ.h^{\text{MF-OMO}}(\theta_{k})\leq h^{\text{MF-OMO}}(\theta_{0})\leq\min\left\{\phi^{-1}\left(\frac{\epsilon}{3(M+1/\eta)}\right),\frac{(2-M\eta)\epsilon^{2}}{18\eta},\delta/2\right\}\leq\delta/2<\delta.

Step 3: Show that θkθ2ϵ\|\theta_{k}-\theta^{\star}\|_{2}\leq\epsilon, θkθ2=O(g(ϵ0))\|\theta_{k}-\theta^{\star}\|_{2}=O(g(\epsilon_{0})) for some gg with limϵ~0g(ϵ~)=0\lim_{\tilde{\epsilon}\rightarrow 0}g(\tilde{\epsilon})=0.

Define fk=fMF-OMO(θk)f_{k}=f^{\text{MF-OMO}}(\theta_{k}). By [8, Lemma 10.4],

fkfk+1=fMF-OMO(θk)fMF-OMO(θk+1)(1ηM2)θk+1θk22.f_{k}-f_{k+1}=f^{\text{MF-OMO}}(\theta_{k})-f^{\text{MF-OMO}}(\theta_{k+1})\geq\left(\frac{1}{\eta}-\frac{M}{2}\right)\|\theta_{k+1}-\theta_{k}\|_{2}^{2}. (43)

Then since ϕ\phi (in the KL condition (42) above) is concave, strictly increasing and C1C^{1} in (0,δ)(0,\delta), we have ϕ(fMF-OMO(θk))>0\phi^{\prime}(f^{\text{MF-OMO}}(\theta_{k}))>0 as long as fMF-OMO(θk)>0f^{\text{MF-OMO}}(\theta_{k})>0 (since we already have fMF-OMO(θk)<δf^{\text{MF-OMO}}(\theta_{k})<\delta for all k0k\geq 0), and

ϕ(fk)ϕ(fk+1))ϕ(fk)(fkfk+1)ϕ(fk)(1ηM2)θk+1θk22.\begin{split}\phi(f_{k})-\phi(f_{k+1}))&\geq\phi^{\prime}(f_{k})(f_{k}-f_{k+1})\geq\phi^{\prime}(f_{k})\left(\frac{1}{\eta}-\frac{M}{2}\right)\|\theta_{k+1}-\theta_{k}\|_{2}^{2}.\end{split} (44)

Here the first inequality is by the concavity of ϕ\phi.

Now we prove that

θiθ2min{g(ϵ0),ϵ}\|\theta_{i}-\theta^{\star}\|_{2}\leq\min\{g(\epsilon_{0}),\epsilon\} (45)

for all i0i\geq 0 by induction, where

g(ϵ0)=(M+1/η)ϕ(Cfϵ0)+2η2MηCfϵ0+Cϵ0(M+1/η)ϕ(f0)+2η2Mηf0+Cϵ0,\begin{split}g(\epsilon_{0})&=(M+1/\eta)\phi(C_{f}\epsilon_{0})+\sqrt{\frac{2\eta}{2-M\eta}C_{f}\epsilon_{0}}+C\epsilon_{0}\\ &\geq(M+1/\eta)\phi(f_{0})+\sqrt{\frac{2\eta}{2-M\eta}f_{0}}+C\epsilon_{0},\end{split} (46)

where Cf>0C_{f}>0 is the Lipschitz constant of fMF-OMO(θ)f^{\text{MF-OMO}}(\theta) in the sense that |fMF-OMO(θ)fMF-OMO(θ)|Cfθθ2|f^{\text{MF-OMO}}(\theta)-f^{\text{MF-OMO}}(\theta^{\prime})|\leq C_{f}\|\theta-\theta^{\prime}\|_{2} for any θ,θΘ\theta,\theta^{\prime}\in\Theta. The existence of CfC_{f} is guaranteed by Assumption 1 and the compactness of Θ\Theta. Note that since limϵ00f0=0\lim_{\epsilon_{0}\rightarrow 0}f_{0}=0 due to the continuity of fMF-OMO()f^{\text{MF-OMO}}(\cdot), we have the desired property limϵ~0g(ϵ~)=0\lim_{\tilde{\epsilon}\rightarrow 0}g(\tilde{\epsilon})=0.

The base case for i=0i=0 is trivial. Without loss of generality, we now assume that fMF-OMO(θ0)>0f^{\text{MF-OMO}}(\theta_{0})>0, since otherwise the algorithm finds the exact Nash equilibrium solution at the initialization and the iterates terminate/stay unchanged thereafter, and hence all the claims in Theorem 11 obviously hold.

Then for i=1i=1, by (43),

θ1θ2θ0θ2+θ1θ02Cϵ0+2η2Mη(f0f1)Cϵ0+2η2Mηf0{g(ϵ0),ϵ3+ϵ3ϵ.\begin{split}\|\theta_{1}-\theta^{\star}\|_{2}&\leq\|\theta_{0}-\theta^{\star}\|_{2}+\|\theta_{1}-\theta_{0}\|_{2}\\ &\leq C\epsilon_{0}+\sqrt{\frac{2\eta}{2-M\eta}(f_{0}-f_{1})}\leq C\epsilon_{0}+\sqrt{\frac{2\eta}{2-M\eta}f_{0}}\\ &\leq\left\{\begin{array}[]{l}g(\epsilon_{0}),\\ \dfrac{\epsilon}{3}+\dfrac{\epsilon}{3}\leq\epsilon.\end{array}\right.\end{split}

Suppose that the claim (45) holds for i=0,,ki=0,\dots,k (k0k\geq 0). Without loss of generality, we also assume that fMF-OMO(θk),fMF-OMO(θk+1)>0f^{\text{MF-OMO}}(\theta_{k}),f^{\text{MF-OMO}}(\theta_{k+1})>0, since otherwise the algorithm finds the exact Nash equilibrium solution at iteration kk or k+1k+1 and the iterates terminate/stay unchanged thereafter, and hence all the claims in Theorem 11 obviously hold.

Now for any i{1,,k}i\in\{1,\dots,k\}, since

θiargminθΘ(θ)+12θ(θi1ηθfMF-OMO(θi1))22,\theta_{i}\in\text{argmin}_{\theta}\,\mathcal{I}_{\Theta}(\theta)+\frac{1}{2}\|\theta-(\theta_{i-1}-\eta\nabla_{\theta}f^{\text{MF-OMO}}(\theta_{i-1}))\|_{2}^{2},

we have

0𝒩Θ(θi)+θiθi1+ηθfMF-OMO(θi1).0\in\mathcal{N}_{\Theta}(\theta_{i})+\theta_{i}-\theta_{i-1}+\eta\nabla_{\theta}f^{\text{MF-OMO}}(\theta_{i-1}).

Hence 1η(θi1θi)θfMF-OMO(θi1)𝒩Θ(θi)\frac{1}{\eta}(\theta_{i-1}-\theta_{i})-\nabla_{\theta}f^{\text{MF-OMO}}(\theta_{i-1})\in\mathcal{N}_{\Theta}(\theta_{i}) since 𝒩Θ(θi)\mathcal{N}_{\Theta}(\theta_{i}) is a cone, and by the KL condition at θiΘ\theta_{i}\in\Theta with θiθ2ϵ\|\theta_{i}-\theta^{\star}\|_{2}\leq\epsilon and hMF-OMO(θi)(0,δ)h^{\text{MF-OMO}}(\theta_{i})\in(0,\delta), we have

ϕ(fi)1η(θi1θi)θfMF-OMO(θi1)+θfMF-OMO(θi)21.\phi^{\prime}(f_{i})\left\|\frac{1}{\eta}(\theta_{i-1}-\theta_{i})-\nabla_{\theta}f^{\text{MF-OMO}}(\theta_{i-1})+\nabla_{\theta}f^{\text{MF-OMO}}(\theta_{i})\right\|_{2}\geq 1. (47)

(47), together with (44), immediately implies that

1ϕ(fi)(1ηθiθi12+Mθiθi12)(1η+M)(ϕ(fi)ϕ(fi+1))θiθi12θi+1θi22,\begin{split}1&\leq\phi^{\prime}(f_{i})\left(\frac{1}{\eta}\|\theta_{i}-\theta_{i-1}\|_{2}+M\|\theta_{i}-\theta_{i-1}\|_{2}\right)\\ &\leq\left(\frac{1}{\eta}+M\right)(\phi(f_{i})-\phi(f_{i+1}))\dfrac{\|\theta_{i}-\theta_{i-1}\|_{2}}{\|\theta_{i+1}-\theta_{i}\|_{2}^{2}},\end{split}

thus

2θi+1θi22(M+1/η)(ϕ(fi)ϕ(fi+1))θiθi12(M+1/η)ϕ(fi)ϕ(fi+1))+θiθi12.\begin{split}2\|\theta_{i+1}-\theta_{i}\|_{2}&\leq 2\sqrt{(M+1/\eta)(\phi(f_{i})-\phi(f_{i+1}))}\sqrt{\|\theta_{i}-\theta_{i-1}\|_{2}}\\ &\leq(M+1/\eta)\phi(f_{i})-\phi(f_{i+1}))+\|\theta_{i}-\theta_{i-1}\|_{2}.\end{split}

By telescoping the above inequality for i=1,,ki=1,\dots,k,

θk+1θk2+i=1kθi+1θi2(M+1/η)ϕ(f1)+θ1θ02,\|\theta_{k+1}-\theta_{k}\|_{2}+\sum_{i=1}^{k}\|\theta_{i+1}-\theta_{i}\|_{2}\leq(M+1/\eta)\phi(f_{1})+\|\theta_{1}-\theta_{0}\|_{2},

from which we deduce that

θk+1θ2i=1kθi+1θi2+θ1θ2(M+1/η)ϕ(f0)+θ1θ02+θ1θ2(M+1/η)ϕ(f0)+2η2Mηf0+Cϵ0.{g(ϵ0),ϵ3+ϵ3+ϵ3=ϵ.\begin{split}\|\theta_{k+1}-\theta^{\star}\|_{2}&\leq\sum_{i=1}^{k}\|\theta_{i+1}-\theta_{i}\|_{2}+\|\theta_{1}-\theta^{\star}\|_{2}\\ &\leq(M+1/\eta)\phi(f_{0})+\|\theta_{1}-\theta_{0}\|_{2}+\|\theta_{1}-\theta^{\star}\|_{2}\\ &\leq(M+1/\eta)\phi(f_{0})+\sqrt{\frac{2\eta}{2-M\eta}f_{0}}+C\epsilon_{0}.\\ &\leq\begin{cases}g(\epsilon_{0}),\\ \dfrac{\epsilon}{3}+\dfrac{\epsilon}{3}+\dfrac{\epsilon}{3}=\epsilon.\end{cases}\end{split}

Note that here we use (43) for k=0k=0 and we also use the fact that f1f0f_{1}\leq f_{0} and ϕ\phi is increasing. This completes the induction.

Final step: Convergence to Nash equilibrium solutions.

Let θ¯\bar{\theta} be an arbitrary limit point of {θk}k0\{\theta_{k}\}_{k\geq 0}. Then θ¯Θ\bar{\theta}\in\Theta, and by Proposition 18, θ¯\bar{\theta} is a stationary point of (MF-OMO), i.e., 0θfMF-OMO(θ¯)+𝒩Θ(θ¯)0\in\nabla_{\theta}f^{\text{MF-OMO}}(\bar{\theta})+\mathcal{N}_{\Theta}(\bar{\theta}). Moreover, again by Proposition 18 and the feasibility of {θk}k0\{\theta_{k}\}_{k\geq 0}, hMF-OMO(θk)=fMF-OMO(θk)h^{\text{MF-OMO}}(\theta_{k})=f^{\text{MF-OMO}}(\theta_{k}) is non-increasing, and hence by the continuity of fMF-OMO(θ)f^{\text{MF-OMO}}(\theta),

hMF-OMO(θ¯)=limkhMF-OMO(θk)=limkfMF-OMO(θk)=fMF-OMO(θ¯)fMF-OMO(θ0)<δ.h^{\text{MF-OMO}}(\bar{\theta})=\lim_{k\rightarrow\infty}h^{\text{MF-OMO}}(\theta_{k})=\lim_{k\rightarrow\infty}f^{\text{MF-OMO}}(\theta_{k})=f^{\text{MF-OMO}}(\bar{\theta})\leq f^{\text{MF-OMO}}(\theta_{0})<\delta.

In addition, by the first three steps above, we have θ¯θ2ϵ\|\bar{\theta}-\theta^{\star}\|_{2}\leq\epsilon. Hence if hMF-OMO(θ¯)>0h^{\text{MF-OMO}}(\bar{\theta})>0, then invoking the KL condition at θ¯\bar{\theta} yields

ϕ(hMF-OMO(θ¯))dist(0,θfMF-OMO(θ¯)+𝒩Θ(θ¯))1.\phi^{\prime}(h^{\text{MF-OMO}}(\bar{\theta}))\text{dist}(0,\nabla_{\theta}f^{\text{MF-OMO}}(\bar{\theta})+\mathcal{N}_{\Theta}(\bar{\theta}))\geq 1.

However, we also have dist(0,θfMF-OMO(θ¯)+𝒩Θ(θ¯))=0\text{dist}(0,\nabla_{\theta}f^{\text{MF-OMO}}(\bar{\theta})+\mathcal{N}_{\Theta}(\bar{\theta}))=0, which is a contradiction. Hence θ¯Θ\bar{\theta}\in\Theta satisfies fMF-OMO(θ¯)=0f^{\text{MF-OMO}}(\bar{\theta})=0, which indicates that θ¯Θ\bar{\theta}\in\Theta^{\star}. Therefore limkdist(θk,Θ)=0\lim_{k\rightarrow\infty}\textbf{dist}(\theta_{k},\Theta^{\star})=0, and by the equivalence result in Theorem 4, any π¯Π(L¯)\bar{\pi}\in\Pi(\bar{L}) with θ¯=(y¯,z¯,L¯)\bar{\theta}=(\bar{y},\bar{z},\bar{L}) is a Nash equilibrium solution of the mean-field game ((1) and (2)). In addition, we have LkL2θkθ2g(ϵ0)\|L^{k}-L^{\star}\|_{2}\leq\|\theta_{k}-\theta^{\star}\|_{2}\leq g(\epsilon_{0}) for all k0k\geq 0. The fact that the limit point set is nonempty comes simply from the boundedness of {θk}k0\{\theta_{k}\}_{k\geq 0}. Moreover, the vanishing limit of the exploitabilty of πkΠ(Lk)\pi^{k}\in\Pi(L^{k}) is an immediate implication of Theorem 8 and the fact that fMF-OMO(θk)0f^{\text{MF-OMO}}(\theta_{k})\rightarrow 0 as kk\rightarrow\infty. The claim about isolated Nash equilibrium solutions is obvious: simply choose ϵ0\epsilon_{0} for g(ϵ0)g(\epsilon_{0}) such that no other Nash equilibrium solutions exist in the g(ϵ0)g(\epsilon_{0})-neighborhood of LL^{\star}. ∎