MF-OMO: An Optimization Formulation of Mean-Field Games

Xin Guo ⁵⁵footnotemark: 5 University of California, Berkeley. Email:[email protected]Amazon.com. Email:[email protected] Anran Hu University of Oxford. Email:[email protected] Junzi Zhang Citadel Securities. Email:[email protected]

Abstract

This paper proposes a new mathematical paradigm to analyze discrete-time mean-field games. It is shown that finding Nash equilibrium solutions for a general class of discrete-time mean-field games is equivalent to solving an optimization problem with bounded variables and simple convex constraints, called MF-OMO. This equivalence framework enables finding multiple (and possibly all) Nash equilibrium solutions of mean-field games by standard algorithms. For instance, projected gradient descent is shown to be capable of retrieving all possible Nash equilibrium solutions when there are finitely many of them, with proper initializations. Moreover, analyzing mean-field games with linear rewards and mean-field independent dynamics is reduced to solving a finite number of linear programs, hence solvable in finite time. This framework does not rely on the contractive and the monotone assumptions and the uniqueness of the Nash equilibrium.

1 Introduction

Theory of mean-field games, pioneered by the seminal work of [39] and [46], provides an ingenious way of analyzing and finding the $\epsilon$ -Nash equilibrium solution to the otherwise notoriously hard $N$ -player stochastic games. Ever since its inception, the literature of mean-field games has experienced an exponential growth both in theory and in applications (see the standard reference [15]).

There are two key criteria for finding Nash equilibrium solutions of a mean-field game. The first is the optimality condition for the representative player, and the second is the consistency condition to ensure that the solution from the representative player is consistent with the mean-field flow of all players. Consequently, mean-field games with continuous-time stochastic dynamics have been studied under three approaches. The first is the original fixed-point [39] and PDE [46] approach that analyzes the joint system of a backward Hamilton-Jacobi-Bellman equation for a control problem and a forward Fokker-Planck equation for the controlled dynamics. This naturally leads to the second probabilistic approach, which focuses on the associated forward-backward stochastic differential equations [12, 14]. The last and the latest effort in mean-field game theory is to use master equation to analyze mean-field games with common noise and the propagation of chaos for optimal trajectories [9, 13, 23]. All existing methodologies are based on either the contractivity or monotonicity conditions, or the uniqueness of the Nash equilibrium solution, or some particular game structures; see [69, 1] for a detailed summary. Accordingly, finding Nash equilibrium solutions of mean-field games relies on similar assumptions and structures; see for instance [47].

It is well known, however, that non-zero-sum stochastic games and mean-field games in general have multiple Nash equilibrium solutions and that uniqueness of the Nash equilibrium solution, contractivity, and monotonicity are more of an exception than a rule. Yet, systematic and efficient approaches for finding multiple or all Nash equilibrium solutions of general mean-field games remain elusive.

Our work and approach.

In this work, we show that finding Nash equilibrium solutions for a general class of discrete-time mean-field games is equivalent to solving an optimization problem with bounded variables and simple convex constraints, called MF-OMO (Mean-Field Occupation Measure Optimization) (Section 3). This equivalence is established without assuming either the uniqueness of the Nash equilibrium solution or the contractivity or monotonicity conditions.

The equivalence is built through several steps. The first step, inspired by the classical linear program formulation of the single-agent Markov decision process (MDP) problem [61], is to introduce occupation measure for the representative player, and to transform the problem of finding Nash equilibrium solutions to a constrained optimization problem (Theorem 4) by the strong duality of the linear program to bridge the two criteria of Nash equilibrium solutions. The second step is to utilize a solution modification technique (Proposition 6) to deal with potentially unbounded variables in the optimization problem and obtain the MF-OMO formulation (Theorem 5). Further analysis of this equivalent optimization formulation enables establishing the connection between the suboptimality of MF-OMO and the exploitability of any policy for mean field games (Section 4). This is obtained by the perturbation analysis on MDPs (Lemma 13) and by the strong duality of linear programs.

These equivalence and connections enable adopting families of optimization algorithms to find multiple (and possibly all) Nash equilibrium solutions of mean-field games. In particular, detailed studies on the convergence of MF-OMO with the projected gradient descent (PGD) algorithm to Nash equilibrium solutions of mean-field games are presented (Theorem 11). For instance, PGD is shown to be capable of retrieving all possible Nash equilibrium solutions when there are finitely many of them, with proper initializations. Moreover, analyzing mean-field games with linear rewards and mean-field independent dynamics is shown to be reduced to solving a finite number of linear programs, hence solvable in finite time (Proposition 12).

Finally, this MF-OMO framework can be extended to other variants of mean-field games, as illustrated in the case of personalized mean-field games in Section 7.

Related work.

Our approach is inspired by the classical linear program framework for MDPs, originally proposed in [51] for discrete-time MDPs with finite state-action spaces and then extended first to semi-MDPs [56] and then to approximate dynamic programming [63, 21], as well as to continuous-time stochastic control [66, 10, 45] and singular control [67, 44] with continuous state-action spaces, among others [72, 25, 24, 38]. In the similar spirit, our work builds an optimization framework for discrete-time mean-field games.

There are several existing computational algorithms for finding Nash equilibrium solutions of discrete-time mean-field games. In particular, [31, 58] assume strict monotonicity, while [59] assumes weak monotonicity, all of which also assume the dynamics to be mean-field independent. In addition, [34, 32, 3] focus on contractive mean-field games, [6] studies mean-field games with unique Nash equilibrium solutions, and [20, 4] analyze mean-field games that are sufficiently close to some contractive regularized mean-field games. Under these assumptions, global convergence is obtained for some of these algorithms. In contrast, we focus on presenting an equivalent optimization framework for mean-field games, and establishing local convergence without these assumptions.

The most relevant to our works are earlier works of [11, 29, 30] on continuous-time mean-field games, in which they replace the best-response/optimality condition in Nash equilibrium solutions by the linear program optimality condition. Their reformulation approach leads to new existence results for relaxed Nash equilibrium solutions and computational benefits. In contrast, our work formulates the problem of finding Nash equilibrium solutions as a single optimization problem which captures both the best-response condition and the Fokker-Planck/consistency condition. Our formulation shows that finding Nash equilibrium solutions of discrete-time mean-field games with finite state-action spaces is generally no harder than solving a smooth, time-decoupled optimization problem with bounded variables and trivial convex constraints. Such a connection allows for directly utilizing various optimization tools, algorithms and solvers to find Nash equilibrium solutions of mean-field games. Moreover, our work provides, for the first time to our best knowledge, provably convergent algorithms for mean-field games with possibly multiple Nash equilibrium solutions and without contractivity or monotonicity assumptions.

2 Problem setup

We consider an extended [16] or general [34] mean-field game in a finite-time horizon $T$ with a state space $\mathcal{S}$ and an action space $\mathcal{A}$ , where $|\mathcal{S}|=S<\infty$ and $|\mathcal{A}|=A<\infty$ . In this game, a representative player starts from a state $s_{0}\sim\mu_{0}$ , with $\mu_{0}$ the (common) initial state distribution of all players of an infinite population. At each time $t\in\mathcal{T}=\{0,1,\dots,T\}$ and when at state $s_{t}$ , she chooses an action $a_{t}\in\mathcal{A}$ from some randomized/mixed policy $\pi_{t}:\mathcal{S}\rightarrow\Delta(\mathcal{A})$ , where $\Delta(\mathcal{A})$ denotes the set of probability vectors on $\mathcal{A}$ . Note that mixed policies are also known as relaxed controls. She will then move to a new state $s_{t+1}$ according to a transition probability $P_{t}(\cdot|s_{t},a_{t},L_{t})$ and receive a reward $r_{t}(s_{t},a_{t},L_{t})$ , where $L_{t}\in\Delta(\mathcal{S}\times\mathcal{A})$ is the joint state-action distribution among all players at time $t$ , referred to as the mean-field information hereafter, and $\Delta(\mathcal{S}\times\mathcal{A})$ is the set of probability vectors on $\mathcal{S}\times\mathcal{A}$ . Denote $r_{\max}=\sup_{t\in\mathcal{T},s\in\mathcal{S},a\in\mathcal{A},L\in\Delta(\mathcal{S}\times\mathcal{A})}|r_{t}(s,a,L)|$ , which is assumed throughout the paper as finite.

Given the mean-field flow $\{L_{t}\}_{t\in\mathcal{T}}$ , the objective of this representative player is to maximize her accumulated rewards, i.e., to solve the following MDP problem:

\begin{split}\text{maximize}_{\{\pi_{t}\}_{t\in\mathcal{T}}}\quad&\mathbb{E}\left[\sum_{t=0}^{T}r_{t}(s_{t},a_{t},L_{t})\Big{|}s_{0}\sim\mu_{0}\right]\\ \text{subject to}\quad&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}s_{t+1}\sim P_{t}(\cdot|s_{t},a_{t},L_{t})\,(t\in\mathcal{T}\backslash\{T\}),\quad a_{t}\sim\pi_{t}(\cdot|s_{t})\,(t\in\mathcal{T}).}\end{split}

(1)

Throughout the paper, for a given mean-field flow $L=\{L_{t}\}_{t\in\mathcal{T}}$ , we will denote the mean-field induced MDP (1) as $\mathcal{M}(L)$ , $V_{t}^{\star}(L)\in\mathbb{R}^{S}$ as its optimal total expected reward, i.e., value function, starting from time $t$ , with the $s$ -th entry $[V_{t}^{\star}(L)]_{s}$ being the optimal expected reward starting from state $s$ at time $t$ , and $V_{\mu_{0}}^{\star}(L)=\sum_{s\in\mathcal{S}}\mu_{0}(s)[V_{0}^{\star}(L)]_{s}$ as its optimal expected total reward starting from $\mu_{0}$ . Correspondingly, we denote respectively $V_{t}^{\pi}(L)\in\mathbb{R}^{S}$ , $[V_{t}^{\pi}(L)]_{s}$ , as well as $V_{\mu_{0}}^{\pi}(L)=\sum_{s\in\mathcal{S}}\mu_{0}(s)[V_{0}^{\pi}(L)]_{s}$ , under a given policy $\pi=\{\pi_{t}\}_{t\in\mathcal{T}}$ for $\mathcal{M}(L)$ .

To analyze such a mean-field game, the most widely adopted solution concept is the Nash equilibrium. A policy sequence $\{\pi_{t}\}_{t\in\mathcal{T}}$ and a mean-field flow $\{L_{t}\}_{t\in\mathcal{T}}$ constitute a Nash equilibrium solution of this finite-time horizon mean-field game, if the following conditions are satisfied.

Definition 2.1 (Nash equilibrium solution).

1)

(Optimality/Best Response) Fixing $\{L_{t}\}_{t\in\mathcal{T}}$ , $\{\pi_{t}\}_{t\in\mathcal{T}}$ solves the optimization problem (1), i.e., $\{\pi_{t}\}_{t\in\mathcal{T}}$ is optimal for the representative agent given the mean-field flow $\{L_{t}\}_{t\in\mathcal{T}}$ ;

(Consistency) Fixing $\{\pi_{t}\}_{t\in\mathcal{T}}$ , the consistency of the mean-field flow holds, namely

L_{t}=\mathbb{P}_{s_{t},a_{t}},\text{ where }s_{t+1}\sim P_{t}(\cdot|s_{t},a_{t},L_{t}),\,a_{t}\sim\pi_{t}(\cdot|s_{t}),\,s_{0}\sim\mu_{0},\,t\in\mathcal{T}\backslash\{T\}.

(2)

Here $\mathbb{P}_{x}$ denotes the probability distribution of a random variable/vector $x$ .

Note that (1) requires that the policy sequence $\{\pi_{t}\}_{t\in\mathcal{T}}$ is the best response to the flow $\{L_{t}\}_{t\in\mathcal{T}}$ , while (2) requires that the flow $\{L_{t}\}_{t\in\mathcal{T}}$ is the corresponding mean-field flow induced when all players adopt the policy sequence $\{\pi_{t}\}_{t\in\mathcal{T}}$ .¹¹1Condition 2) is also known as the Fokker-Planck equation in the continuous-time mean-field game literature. Also note that (2) can be written more explicitly as follows:

\begin{split}L_{0}(s,a)&=\mu_{0}(s)\pi_{0}(a|s),\\ L_{t+1}(s^{\prime},a^{\prime})&=\pi_{t+1}(a^{\prime}|s^{\prime})\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}L_{t}(s,a)P_{t}(s^{\prime}|s,a,L_{t}),\quad\forall t=0,\dots,T-1.\end{split}

(3)

The following existence result for Nash equilibrium solutions holds as long as the transitions and rewards are continuous in $L_{t}$ . The proof is based on the Kakutani fixed-point theorem; it is almost identical to those in [62, 20], except for replacing the state mean-field flow with the state-action joint mean-field flow.

Proposition 1.

Suppose that $P_{t}(s^{\prime}|s,a,L_{t})$ and $r_{t}(s,a,L_{t})$ are both continuous in $L_{t}$ for any $s,s^{\prime}\in\mathcal{S}$ , $a\in\mathcal{A}$ and $t\in\mathcal{T}$ . Then a Nash equilibrium solution exists.

Having established the existence of Nash equilibrium solution for the mean-field game ((1) and (2)), the rest of the paper focuses on proposing a new analytical approach to find Nash equilibrium solution(s) for the mean-field game. This is motivated by the well-known fact that most mean-field games have multiple Nash equilibrium solutions and are neither contractive nor (weakly) monotone, as the following simple example illustrates.

Example of multiple Nash equilibrium solutions.

Take $\mathcal{S}=\mathcal{A}=\{1,\dots,n\}$ and an arbitrary initial distribution $\mu_{0}$ . At time $0$ , for any $L_{0}\in\Delta(\mathcal{S}\times\mathcal{A})$ and any $i,j=1,\dots,n$ , the rewards $r_{0}(i,j,L_{0})=0$ , and the transitions are deterministic such that $P_{0}(i^{\prime}|i,j,L_{0})=\mathbf{1}_{\{i^{\prime}=j\}}$ . When $t>0$ , the rewards and the transition probabilities may depend on $L_{t}$ . Specifically, if the population congregates at the same state $i$ and chooses to stay, i.e., $L_{t}(s,a)=\mathbf{1}_{\{s=i,a=i\}}$ , then the rewards $r_{t}(i,j,L_{t})=r^{i}\mathbf{1}_{\{i=j\}}$ for some $r^{i}>0$ , and the transition probabilities will be deterministic $P_{t}(i^{\prime}|i,j,L_{t})=\mathbf{1}_{\{i^{\prime}=j\}}$ . If $L_{t}$ deviates from state $i$ and action $i$ , the rewards and transition probabilities will be adjusted as follows:

•

$r_{t}(i,j,L_{t})=\mathbf{1}_{\{i=j\}}(r^{i}-r^{i}\|L_{t}-\mathbf{1}_{\{(s,a)|s=i,a=i\}}\|_{2}^{2}/2)$ , for any $i,j=1,\dots,n$ and $t=1,\dots,T$ .²²2Here for a totally ordered finite set $\mathcal{X}$ , we use $\mathbf{1}_{\mathcal{X}}$ to denote the binary vector in $\mathbb{R}^{|\mathcal{X}|}$ which has entry value $1$ for those indices $x\in\mathcal{X}$ and has entry value $0$ otherwise.
•

$P_{t}(i^{\prime}|i,j,L_{t})=\dfrac{\mathbf{1}_{\{i^{\prime}=j\}}+C_{t}\|L_{t}-\mathbf{1}_{\{(s,a)|s=i,a=i\}}\|_{2}^{2}}{1+nC_{t}\|L_{t}-\mathbf{1}_{\{(s,a)|s=i,a=i\}}\|_{2}^{2}}$ for any $i^{\prime},i,j=1,\dots,n$ and $t=1,\dots,T$ .

Here $C_{t}$ $(t=1,\dots,T)$ are some non-negative constants.

It is easy to verify that for any $j^{\star}\in\text{argmax}_{i=1,\dots,n}r^{i}$ , if we define

•

$\pi_{t}(a|s)=\mathbf{1}_{\{a=j^{\star}\}}$ for any $s,a=1,\dots,n$ and $t=0,\dots,T$ ;
•

$L_{0}(s,a)=\mu_{0}(s)\pi_{0}(a|s)$ and $L_{t}(s,a)=\mathbf{1}_{\{s=j^{\star},a=j^{\star}\}}$ , for any $s,a=1,\dots,n$ and $t=1,\dots,T$ ,

then $\pi=\{\pi_{t}\}_{t\in\mathcal{T}}$ and $L=\{L_{t}\}_{t\in\mathcal{T}}$ constitute a Nash equilibrium solution. Thus there are several distinct Nash equilibrium solutions with both $\pi$ and $L$ being different, defined by the index $j^{\star}\in\text{argmax}_{i=1,\dots,n}r^{i}$ as $\pi^{(j^{\star})}$ and $L^{(j^{\star})}$ . As contractivity and strict monotonicity imply the uniqueness of the Nash equilibrium solution, the mean-field game above is neither contractive nor strictly monotone. To see that it is not weakly monotone either, consider any two mean-field flows $L_{t}^{(1)}=L_{t}^{(j_{1}^{\star})}$ and $L_{t}^{(2)}=L_{t}^{(j_{2}^{\star})}$ with $j_{1}^{\star}\neq j_{2}^{\star}\in\text{argmax}_{i=1,\dots,n}r^{i}$ . Then

\begin{split}&\sum_{t=0}^{T}\sum_{s=1}^{n}\sum_{a=1}^{n}(L_{t}^{(1)}(s,a)-L_{t}^{(2)}(s,a))(r_{t}(s,a,L_{t}^{(1)})-r_{t}(s,a,L_{t}^{(2)}))\\ &={\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}T(r^{j_{1}^{\star}}+r^{j_{2}^{\star}})=2T\max\nolimits_{i=1,\dots,n}r^{i}}>0,\end{split}

violating the requirement of weak monotonicity.

We now propose a new analytical framework that enables finding multiple Nash equilibrium solutions, without either the restrictive uniqueness assumption of the Nash equilibrium solution or contractivity [34, 20] and monotonicity [59, 58] conditions.

3 MF-OMO: Optimization formulation of mean-field games

To overcome the restrictive assumptions of contractivity and monotonicity for analyzing mean-field games, we first establish a new optimization framework to study mean-field games, via two steps. The first is an optimization reformulation of mean-field games with potentially unbounded variables and nonconvex constraints, which relies on the equivalence between an MDP problem and a linear program of occupation measures. The second is to utilize a solution modification technique to obtain the final form of the optimization formulation with bounded variables and simple convex constraints. We call this framework Mean-Field Occupation Measure Optimization (MF-OMO).

Occupation measure.

To start, let us introduce a new variable $d_{t}(s,a)$ for any $t\in\mathcal{T}$ , $s\in\mathcal{S},a\in\mathcal{A}$ , which represents the occupation measure of the representative agent under some policy sequence $\pi=\{\pi_{t}\}_{t\in\mathcal{T}}$ in $\mathcal{M}(L)$ , the mean-field induced MDP, i.e., $d_{t}(s,a)=\mathbb{P}(s_{t}=s,a_{t}=a)$ , with $s_{0}\sim\mu_{0}$ , $s_{t+1}\sim P_{t}(\cdot|s_{t},a_{t},L_{t})$ , $a_{t}\sim\pi_{t}(\cdot|s_{t})$ for $t=0,\dots,T-1$ . Given the occupation measure $d=\{d_{t}\}_{t\in\mathcal{T}}$ , define a mapping $\Pi$ that retrieves the policy from the occupation measure. This set-valued mapping $\Pi$ maps from a sequence $\{d_{t}\}_{t\in\mathcal{T}}\subseteq\Delta(\mathcal{S}\times\mathcal{A})$ to a set of policy sequences $\{\pi_{t}\}_{t\in\mathcal{T}}$ : for any $\{d_{t}\}_{t\in\mathcal{T}}\subseteq\Delta(\mathcal{S}\times\mathcal{A})$ , $\pi\in\Pi(d)$ if and only if

\pi_{t}(a|s)=\frac{d_{t}(s,a)}{\sum_{a^{\prime}\in\mathcal{A}}d_{t}(s,a^{\prime})}

(4)

when $\sum_{a^{\prime}\in\mathcal{A}}d_{t}(s,a^{\prime})>0$ , and $\pi_{t}(\cdot|s)$ is an arbitrary probability vector in $\Delta(\mathcal{A})$ when $\sum_{a^{\prime}\in\mathcal{A}}d_{t}(s,a^{\prime})=0$ .

Linear program formulation of MDPs.

Next, consider any finite-horizon MDP $\mathcal{M}$ (e.g., the mean-field induced MDP $\mathcal{M}(L)$ ) with a finite state space $\mathcal{S}$ , a finite action space $\mathcal{A}$ , an initial state distribution $\mu_{0}$ , a reward function $r_{t}(s,a)$ , and dynamics $P_{t}(s^{\prime}|s,a)$ for $t=0,\dots,T$ , $s,s^{\prime}\in\mathcal{S}$ and $a\in\mathcal{A}$ . For brievity, we use $\mathbb{P}^{\pi}(\cdot)$ for the state and/or action distribution generated by $s_{0}\sim\mu_{0}$ , $s_{t+1}\sim P_{t}(\cdot|s_{t},a_{t})$ , $a_{t}\sim\pi_{t}(\cdot|s_{t})$ for $t=0,\dots,T-1$ .

Then, our first component of MF-OMO relies on a linear program formulation of an MDP problem, as follows.

Lemma 2.

Suppose that $\pi=\{\pi_{t}\}_{t\in\mathcal{T}}$ is an $\epsilon$ -suboptimal policy for the MDP $\mathcal{M}$ . Define $\mu_{t}(s):=\mathbb{P}^{\pi}(s_{t}=s)$ and $d_{t}(s,a):=\mu_{t}(s)\pi_{t}(a|s)$ for $s\in\mathcal{S}$ and $a\in\mathcal{A}$ . Then $\pi\in\Pi(d)$ and $d=\{d_{t}\}_{t\in\mathcal{T}}$ is a feasible $\epsilon$ -suboptimal solution to the following linear program:

	$\displaystyle\text{maximize}_{d}\quad$	$\displaystyle\sum_{t\in\mathcal{T}}\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d_{t}(s,a)r_{t}(s,a)$
	subject to	$\displaystyle\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d_{t}(s,a)P_{t}(s^{\prime}\|s,a)=\sum_{a\in\mathcal{A}}d_{t+1}(s^{\prime},a),\quad\forall s^{\prime}\in\mathcal{S},t\in\mathcal{T}\backslash\{T\},$
		$\displaystyle\sum_{a\in\mathcal{A}}d_{0}(s,a)=\mu_{0}(s),\quad\forall s\in\mathcal{S},$
		$\displaystyle d_{t}(s,a)\geq 0,\quad\forall s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}.$

Conversely, suppose that $d=\{d_{t}\}_{t\in\mathcal{T}}$ is a feasible $\epsilon$ -suboptimal solution to the above linear program. Then for any $\pi\in\Pi(d)$ defined by (4), $\pi=\{\pi_{t}\}_{t\in\mathcal{T}}$ is an $\epsilon$ -suboptimal policy for the MDP $\mathcal{M}$ . In addition, the value of the above linear program is equal to the value of the MDP $\mathcal{M}$ .

The proof of Lemma 2 is a suitable modification for its well-known counterpart in the discounted and average reward settings [21, 40, 70].

A preliminary version of MF-OMO.

We now show that $\pi=\{\pi_{t}\}_{t\in\mathcal{T}}$ is a Nash equilibrium solution if and only if there exist $d^{\star}=\{d_{t}^{\star}(s,a)\}_{t\in\mathcal{T}}$ and $L=\{L_{t}(s,a)\}_{t\in\mathcal{T}}$ , such that $\pi\in\Pi(d^{\star})$ and the following two conditions hold.

(A)

Optimality of the representative agent: $d^{\star}=\{d_{t}^{\star}(s,a)\}_{t\in\mathcal{T}}$ solves the following LP problem

$\displaystyle\text{maximize}_{d}\quad$	$\displaystyle\sum_{t\in\mathcal{T}}\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d_{t}(s,a)r_{t}(s,a,L_{t})$	(5)
subject to	$\displaystyle\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d_{t}(s,a)P_{t}(s^{\prime}\|s,a,L_{t})=\sum_{a\in\mathcal{A}}d_{t+1}(s^{\prime},a),$	(6)
	$\displaystyle\hskip 119.50148pt\forall s^{\prime}\in\mathcal{S},t\in\mathcal{T}\backslash\{T\},$
	$\displaystyle\sum_{a\in\mathcal{A}}d_{0}(s,a)=\mu_{0}(s),\quad\forall s\in\mathcal{S},$
	$\displaystyle d_{t}(s,a)\geq 0,\quad\forall s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}.$

(B)

Consistency between the agent and the mean-field: $d_{t}^{\star}(s,a)=L_{t}(s,a)$ for any $s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}$ .

Here condition (A) corresponds to the optimality condition 1) in Definition 2.1 according to Lemma 2. Condition (B) indicates that the occupation measure of the single agent matches the mean-field flow of the population, so that the agent is indeed representative. Condition (B), when combined with condition (A), in fact implies the consistency condition 2) in Definition 2.1. Formally, we have the following lemma.

Lemma 3.

If $(\pi,L)$ is a Nash equilibrium solution of the mean-field game ((1) and (2)), then there exists $d$ such that $\pi\in\Pi(d^{\star})$ and $(d^{\star},L)$ with $d^{\star}=L$ satisfying conditions (A) and (B). On the other hand, if $(d^{\star},L)$ satisfies conditions (A) and (B), then for any $\pi\in\Pi(L)$ , $(\pi,L)$ is a Nash equilibrium solution of the mean-field game ((1) and (2)).

To get the preliminary version of MF-OMO, notice that the linear program in condition (A) can be rewritten as

\begin{array}[]{ll}\text{minimize}_{d}&c_{L}^{\top}d\\ \text{subject to}&A_{L}d=b,\quad d\geq 0,\end{array}

(7)

where $c_{L}\in\mathbb{R}^{SA(T+1)}$ and $b\in\mathbb{R}^{ST+S}$ are

c_{L}=\left[\begin{array}[]{c}-r_{0}(\cdot,\cdot,L_{0})\\ \vdots\\ -r_{T}(\cdot,\cdot,L_{T})\end{array}\right],\quad b=\left[\begin{array}[]{c}0\\ \vdots\\ 0\\ \mu_{0}\end{array}\right],

(8)

$r_{t}(\cdot,\cdot,L_{t})\in\mathbb{R}^{SA}$ is a flattened vector (with column-major order), and the matrix $A_{L}\in\mathbb{R}^{S(T+1)\times SA(T+1)}$ is

A_{L}=\left[\begin{array}[]{ccccccc}W_{0}(L_{0})&-Z&0&0&\cdots&0&0\\ 0&W_{1}(L_{1})&-Z&0&\cdots&0&0\\ 0&0&W_{2}(L_{2})&-Z&\cdots&0&0\\ \vdots&\vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\ 0&0&0&0&\cdots&W_{T-1}(L_{T-1})&-Z\\ Z&0&0&0&\cdots&0&0\end{array}\right].

(9)

Here $W_{t}(L_{t})\in\mathbb{R}^{S\times SA}$ is the matrix with the $l$ -th row ( $l=1,\dots,S$ ) being the flattened vector $[P_{t}(l|\cdot,\cdot,L_{t})]\in\mathbb{R}^{SA}$ (with column-major order), and the matrix $Z$ is defined as

Z:=[\overbrace{I_{S},\dots,I_{S}]}^{A}\in\mathbb{R}^{S\times SA},

(10)

where $I_{S}$ is the identity matrix with dimension $S$ . In addition, the variable $d$ is also flattened/vectorized in the same order as $c_{L}$ . Accordingly, hereafter both $d^{\star}=\{d^{\star}_{t}\}_{t\in\mathcal{T}}$ and $L=\{L_{t}\}_{t\in\mathcal{T}}$ are viewed as flattened vectors (with column-major order) in $\mathbb{R}^{SA(T+1)}$ or a sequence of $(T+1)$ flattened vectors (with column-major order) in $\mathbb{R}^{SA}$ , depending on the context, and we use $L_{t}(s,a)$ and $L_{s,a,t}$ (resp. $d_{t}(s,a)$ and $d_{s,a,t}$ ) alternatively.

By the strong duality of linear program [50], $\{d_{t}^{\star}(s,a)\}$ is the optimal solution to the above linear program (7) if and only if the following KKT conditions hold for $d^{\star}$ and some $y$ and $z$ :

\begin{split}&A_{L}d^{\star}=b,\quad A_{L}^{\top}y+z=c_{L},\\ &z^{\top}d^{\star}=0,\quad d^{\star}\geq 0,\quad z\geq 0.\end{split}

(11)

Now, combining condition (A) (in the form of equation (11)) with condition (B), we have the following preliminary version of MF-OMO.

Theorem 4.

Finding a Nash equilibrium solution of the mean-field game ((1) and (2)) is equivalent to solving the following constrained optimization problem:

\begin{array}[]{ll}\text{\rm minimize}_{y,z,L}&0\\ \text{\rm subject to}&A_{L}L=b,\quad A_{L}^{\top}y+z=c_{L},\\ &z^{\top}L=0,\quad L\geq 0,\quad z\geq 0,\end{array}

(12)

where $A_{L},c_{L},b$ are set as (8) and (9). Specifically, if $(\pi,L)$ is a Nash equilibrium solution of the mean-field game ((1) and (2)), then there exist some $y,z$ such that $(y,z,L)$ solves the constrained optimization problem (12). On the other hand, if $(y,z,L)$ solves the constrained optimization problem (12), then for any $\pi\in\Pi(L)$ , $(\pi,L)$ is a Nash equilibrium solution of the mean-field game ((1) and (2)).

In the above derivation, we have used $L$ to represent the population mean-field flows and $d$ to denote the occupation measure of the representative agent. The optimal occupation measure (i.e., the solution to the linear program) associated with the representative agent is denoted by $d^{\star}$ .

MF-OMO.

The above constrained optimization problem (12) in Theorem 4 can be rewritten in a more computationally-efficient form by adding bound constraints to the auxiliary variables $y$ and $z$ and moving potentially nonconvex constraints to the objective. Note that the consequent nonconvex objectives are in general much easier to handle than nonconvex constraints (cf. (12)). In addition, the boundedness of all variables is essential to the later analysis of the optimization framework in Sections 4 and 5.

Theorem 5.

Finding a Nash equilibrium solution of the mean-field game ((1) and (2)) is equivalent to solving the following optimization problem with bounded variables and simple convex constraints:

\begin{array}[]{ll}\text{minimize}_{y,z,L}&\|A_{L}L-b\|_{2}^{2}+\|A_{L}^{\top}y+z-c_{L}\|_{2}^{2}+z^{\top}L\\ \text{subject to}&L\geq 0,\quad{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{1}^{\top}L_{t}=1,\,t\in\mathcal{T},}\\ &\mathbf{1}^{\top}z\leq SA(T^{2}+T+2)r_{\max},\quad z\geq 0,\\ &\|y\|_{2}\leq\frac{S(T+1)(T+2)}{2}r_{\max},\end{array}

(MF-OMO)

where $A_{L},c_{L},b$ are set as (8) and (9), and ${\bf 1}$ is the all-one vector (with appropriate dimensions). Specifically, if $(\pi,L)$ is a Nash equilibrium solution of the mean-field game ((1) and (2)), then there exist some $y,z$ such that $(y,z,L)$ solves (MF-OMO). On the other hand, if $(y,z,L)$ solves (MF-OMO) with the value of the objective function being $0$ , then for any $\pi\in\Pi(L)$ , $(\pi,L)$ is a Nash equilibrium solution of the mean-field game ((1) and (2)).

Theorem 5 is immediate from the following two results, which show how the bounds on the auxiliary variables $y$ and $z$ in (MF-OMO) are obtained. Note that only $L$ is needed to construct the Nash equilibrium solution, hence there is no loss to add additional bounds on $y$ and $z$ according to Corollary 7.

The interpretation of $y$ and $z$ is motivated by the following remark.

Proposition 6 (Solution modification/selection).

Suppose that $(y,z,L)$ is a solution to (12). Define $\hat{y}$ as

\hat{y}=[V_{1}^{\star}(L),V_{2}^{\star}(L),\dots,V_{T}^{\star}(L),-V_{0}^{\star}(L)]\in\mathbb{R}^{S(T+1)},

where $V_{t}^{\star}(L)\in\mathbb{R}^{S}$ is the vector of value function starting from step $t$ for the induced MDP $\mathcal{M}(L)$ (cf. Section 2). In addition, define $\hat{z}=[\hat{z}_{0},\hat{z}_{1},\dots,\hat{z}_{T}]\in\mathbb{R}^{SA(T+1)}$ , where $\hat{z}_{t}$ is the Bellman residual/gap³³3This is also known as the (optimal) advantage function in the reinforcement learning literature., i.e., $\hat{z}_{t}=[\hat{z}_{t}^{1},\dots,\hat{z}_{t}^{A}]$ ( $t=0,\dots,T$ ) with

\hat{z}_{t}^{a}=V_{t}^{\star}(L)-r_{t}(\cdot,a,L_{t})-P_{t}^{a}(L_{t})V_{t+1}^{\star}(L)\in\mathbb{R}^{S},\quad t=0,\dots,T-1,\,a\in\mathcal{A}

and $\hat{z}_{T}^{a}=V_{T}^{\star}(L)-r_{T}(\cdot,a,L_{T})\in\mathbb{R}^{S}$ ( $a\in\mathcal{A}$ ), where $P_{t}^{a}(L_{t})\in\mathbb{R}^{S\times S}$ is defined by $[P_{t}^{a}(L_{t})]_{s,s^{\prime}}=P_{t}(s^{\prime}|s,a,L_{t})$ . Then $(\hat{y},\hat{z},L)$ is also a solution to (12).

The following corollary follows immediately by the definitions of $\hat{y}$ and $\hat{z}$ in Proposition 6.

Corollary 7.

Suppose that $(y,z,L)$ is a solution to (12). Then there exists a modified solution $(\hat{y},\hat{z},L)$ , such that the following bounds hold:

\|\hat{y}\|_{2}\leq\|\hat{y}\|_{1}\leq\frac{S(T+1)(T+2)}{2}r_{\max},\quad\|\hat{z}\|_{2}\leq\|\hat{z}\|_{1}\leq SA(T^{2}+T+2)r_{\max}.

The key is to notice that $|[V_{t}^{\star}(L)]_{s}|\leq(T-t+1)r_{\max}$ , $|\hat{z}_{t}^{a}(s)|\leq(T-t+1)r_{\max}+r_{\max}+(T-t)r_{\max}$ and $|\hat{z}_{T}^{a}(s)|\leq 2r_{\max}$ , for any $s\in\mathcal{S},\,a\in\mathcal{A},\,t=0,\dots,T-1$ .

Remark 1.

Note that the constraint $\mathbf{1}^{\top}L_{t}=1$ , $t\in\mathcal{T}$ is added to ensure the well-definedness of $P_{t}$ and $r_{t}$ for non-optimal but feasible variables. As otherwise $P_{t}(s,a,\cdot)$ and $r_{t}(s,a,\cdot)$ are undefined outside of the simplex $\Delta(\mathcal{S}\times\mathcal{A})$ . The choice of norms on $y$ and $z$ is to facilitate projection evaluation and reparametrization, where the first (projection evaluation) will be clear in Section 5 and the second (reparametrization) is explained in Appendix A.3. In particular, the choice of $\ell_{1}$ -norm for $z$ is to obtain a simplex-alike specialized polyhedral constraint paired with $z\geq 0$ , and the $\ell_{2}$ -norm for $y$ is to obtain closed-form projections (by normalization) and simple reparametrizations. Nevertheless, all our subsequent results in Sections 4 and 5 hold (up to changes of constants) for other choices including $\ell_{2}$ -norm for $z$ and $\ell_{1}$ -norm for $y$ , as long as either efficient projections or reparametrizations are feasible for $y$ and $z$ .

Remark 2.

Note that whenever a Nash equilibrium solution exists for the mean-field game ((1) and (2)), the optimal value of the objective function of (MF-OMO) is obviously $0$ , and vice versa. To see more explicitly the relationship between (MF-OMO) and Nash equilibrium solution(s) of mean-field game ((1) and (2)), denote the feasible set of (MF-OMO) as $\Theta$ , i.e.,

\begin{split}\Theta:=\{(y,z,L)\,|\,&L\geq 0,{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathbf{1}^{\top}L_{t}=1,\,t\in\mathcal{T},}\,\\ &\|y\|_{2}\leq S(T+1)(T+2)r_{\max}/2,\\ &\mathbf{1}^{\top}z\leq SA(T^{2}+T+2)r_{\max},z\geq 0\}.\end{split}

(13)

Then (MF-OMO) can be rewritten as minimizing the following objective function over $(y,z,L)\in\Theta$ :

\begin{array}[]{ll}f^{\text{\rm MF-OMO}}(y,z,L)=&\sum\limits_{s\in\mathcal{S}}\left(\sum\limits_{a\in\mathcal{A}}L_{0}(s,a)-\mu_{0}(s)\right)^{2}\\ &+\sum\limits_{s^{\prime}\in\mathcal{S}}\sum\limits_{t=0}^{T-1}\left(\sum\limits_{a\in\mathcal{A}}L_{t+1}(s^{\prime},a)-\sum\limits_{s\in\mathcal{S},a\in\mathcal{A}}L_{t}(s,a)P_{t}(s^{\prime}|s,a,L_{t})\right)^{2}\\ &+\sum\limits_{s\in\mathcal{S},a\in\mathcal{A}}\left(y_{T-1}(s)-r_{T}(s,a,L_{T})-z_{T}(s,a)\right)^{2}\\ &+\sum\limits_{s\in\mathcal{S},a\in\mathcal{A}}\sum\limits_{t=0}^{T-2}\left(y_{t}(s)-r_{t+1}(s,a,L_{t+1})\right.\\ &\qquad\qquad\qquad\left.-\sum_{s^{\prime}\in\mathcal{S}}P_{t+1}(s^{\prime}|s,a,L_{t+1})y_{t+1}(s^{\prime})-z_{t+1}(s,a)\right)^{2}\\ &+\sum\limits_{s\in\mathcal{S},a\in\mathcal{A}}\left(y_{T}(s)+r_{0}(s,a,L_{0})+\sum\limits_{s^{\prime}\in\mathcal{S}}P_{0}(s^{\prime}|s,a,L_{0})y_{0}(s^{\prime})+z_{0}(s,a)\right)^{2}\\ &+\sum\limits_{s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}}z_{t}(s,a)L_{t}(s,a).\end{array}

(14)

Here the first two rows expand the term $\|A_{L}L-b\|_{2}^{2}$ , corresponding to the consistency condition 2) in Definition 2.1. The next four rows expand the term $\|A_{L}^{\top}y+z-c_{L}\|_{2}^{2}$ , corresponding to the Bellman residual by the interpretations of $y$ and $z$ in Proposition 6 and hence the optimality condition 1) in Definition 2.1. The last row expands the complementarity term $z^{\top}L$ , connecting the two conditions into a single optimization problem.

This MF-OMO formulation is in sharp contrast to most existing approaches in the literature of discrete-time mean-field games such as [34, 59, 58], which alternate between a full policy optimization/evaluation given a fixed mean-field flow and a full mean-field flow forward propagation given a fixed policy sequence. Moreover, both the policy optimization/evaluation and the mean-field propagation lead to coupling over the full time horizon $\mathcal{T}$ . As a result, the convergence of these algorithms requires assumptions of contractivity and monotonicity. In contrast, (14) shows that MF-OMO is fully decoupled over the time horizon, which better facilitates parallel and distributed optimization algorithms and enables more efficient sub-samplings for stochastic optimization algorithms, especially for large time-horizon problems.

4 Connecting exploitability with MF-OMO suboptimality

Having established the equivalence between minimizers of (MF-OMO) and Nash equilibrium solutions of mean-field games, we now connect the exploitablity of mean-field games with suboptimal solutions to (MF-OMO).

The concept of exploitability in game theory is used to characterize the difference between any policy and a Nash equilibrium solution. More precisely, define a mapping $\Gamma$ that maps any policy sequence $\{\pi_{t}\}_{t\in\mathcal{T}}$ to $\{L_{t}\}_{t\in\mathcal{T}}$ when all players take such a policy sequence. Following the consistency condition (3), such $\Gamma$ can be defined recursively, starting with the initialization

\Gamma(\pi)_{0}(s,a):=\mu_{0}(s)\pi_{0}(a|s),

(15)

such that

\Gamma(\pi)_{t+1}(s,a):=\pi_{t+1}(a|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\Gamma(\pi)_{t}(s^{\prime},a^{\prime})P_{t}(s|s^{\prime},a^{\prime},\Gamma(\pi)_{t}),\quad\forall t=0,\dots,T-1.

(16)

For any given policy sequence $\pi=\{\pi_{t}\}_{t\in\mathcal{T}}$ , let $L=\{L_{t}\}_{t\in\mathcal{T}}=\Gamma(\pi)$ be the induced mean-field flow. Then exploitability characterizes the sub-optimality of the policy $\pi$ under $L$ as follows,

\text{Expl}(\pi):=V_{\mu_{0}}^{\star}(\Gamma(\pi))-V_{\mu_{0}}^{\pi}(\Gamma(\pi))=\max_{\pi^{\prime}}V_{\mu_{0}}^{\pi^{\prime}}(\Gamma(\pi))-V_{\mu_{0}}^{\pi}(\Gamma(\pi)).

(17)

In particular, $(\pi,L)$ is a Nash equilibrium solution if and only if $L=\Gamma(\pi)$ and $\text{Expl}(\pi)=0$ ; and a policy $\pi$ is an $\epsilon$ -Nash equilibrium solution if $\text{Expl}(\pi)\leq\epsilon$ .

However, computing and optimizing exploitability directly is not easy. First, $\text{Expl}(\pi)$ is generally nonconvex and nonsmooth, even if $r_{t}(s,a,\cdot)$ and $P_{t}(s,a,\cdot)$ are differentiable in $L_{t}$ . Secondly, a single evaluation of $\text{Expl}(\pi)$ requires a full policy optimization when calculating $\max_{\pi^{\prime}}V_{\mu_{0}}^{\pi^{\prime}}(\Gamma(\pi))$ and a full policy evaluation when calculating $V_{\mu_{0}}^{\pi}(\Gamma(\pi))$ . In this section, we will show that in order to find an $\epsilon$ -Nash equilibrium solution, it is sufficient to solve (MF-OMO) to $O(\epsilon^{2})$ sub-optimality.

We will need additional notation here before presenting the precise statement. Denote $P_{t}(\cdot|\cdot,\cdot,L_{t})\in\mathbb{R}^{S\times S\times A}$ as a tensor of dimensions $S\times S\times A$ . For a sequence of $T+1$ tensors $p=\{p_{t}(\cdot|\cdot,\cdot)\}_{t\in\mathcal{T}}\in\mathbb{R}^{S\times S\times A}$ with dimensions $S\times S\times A$ (e.g., $p_{t}(s^{\prime}|s,a)$ can be $P_{t}(s^{\prime}|s,a,L_{t})$ ) and a sequence of $T+1$ vectors $x=\{x_{t}(\cdot,\cdot)\}_{t\in\mathcal{T}}\in\mathbb{R}^{SA}$ with dimensions $S\times A$ (e.g., $x_{t}(s,a)$ can be $r_{t}(s,a,L_{t})$ , $d_{t}(s,a)$ or $L_{t}(s,a)$ , etc.), define

\begin{split}&\|p\|_{\infty,1}=\max_{s\in\mathcal{S},a\in\mathcal{A},t=0,\dots,T-1}\sum_{s^{\prime}\in\mathcal{S}}|p_{t}(s^{\prime}|s,a)|,\quad\|x\|_{1,\infty}=\sum_{t\in\mathcal{T}}\max_{s\in\mathcal{S},a\in\mathcal{A}}|x_{t}(s,a)|.\end{split}

Similarly, for any tensor $p_{t}(\cdot|\cdot,\cdot)\in\mathbb{R}^{S\times S\times A}$ with dimensions $S\times S\times A$ , define

\begin{split}&\|p_{t}\|_{\infty,1}=\max_{s\in\mathcal{S},a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}|p_{t}(s^{\prime}|s,a)|.\end{split}

Denote also $\|\cdot\|_{1}$ , $\|\cdot\|_{2}$ and $\|\cdot\|_{\infty}$ respectively for the standard $\ell_{1}$ , $\ell_{2}$ and $\ell_{\infty}$ vector and matrix norms. Then we will see that a near-optimal solution to (MF-OMO) will be close to a Nash equilibrium solution, assuming the dynamics and rewards are Lipschitz continuous in $L_{t}$ .

Theorem 8.

Suppose that for any $t\in\mathcal{T}$ , $P_{t}(\cdot|\cdot,\cdot,L_{t})$ and $r_{t}(\cdot,\cdot,L_{t})$ are $C_{P}(>0)$ and $C_{r}(>0)$ Lipschitz continuous in $L_{t}$ , respectively, i.e.,

\|P_{t}^{L^{1}}-P_{t}^{L^{2}}\|_{\infty,1}\leq C_{P}\|L_{t}^{1}-L_{t}^{2}\|_{1},\quad\|r_{t}^{L^{1}}-r_{t}^{L^{2}}\|_{\infty}\leq C_{r}\|L_{t}^{1}-L_{t}^{2}\|_{1},

where $P_{t}^{L}:=P_{t}(\cdot|\cdot,\cdot,L_{t})$ and $r_{t}^{L}:=r_{t}(\cdot,\cdot,L_{t})$ , and $L^{i}:=\{L_{t}^{i}\}_{t\in\mathcal{T}}$ ( $i=1,2$ ) are two arbitrary mean-field flows. Let $y,z,L$ be a feasible solution to (MF-OMO), with the value of its objective function being $f^{\text{\rm MF-OMO}}(y,z,L)\leq\epsilon^{2}$ for some $\epsilon\geq 0$ . Then for any $\pi\in\Pi(L)$ , we have

\text{\rm Expl}(\pi)\leq f(S,A,T,C_{P},C_{r},r_{\max})\epsilon+\epsilon^{2},

where

\begin{split}f(S,A,T,C_{P},C_{r},r_{\max}):=&\,\,T(T+1)r_{\max}((C_{P}+1)^{T+1}-1)\sqrt{S}\\ &\,+2C_{r}\dfrac{(C_{P}+1)^{T+2}-(T+2)C_{P}-1}{C_{P}^{2}}\sqrt{S}\\ &\,+S^{\frac{3}{2}}A(T+2)^{3}r_{\max}+\sqrt{SA(T+1)}+\sqrt{T}.\end{split}

Remark 3.

In the special case when the transition dynamics $P_{t}(s^{\prime}|s,a,L_{t})$ are independent of $L_{t}$ , as generally assumed in the literature of discrete-time mean-field games [31, 59, 58], we have $C_{P}=0$ and a much tighter constant

f(S,A,T,C_{P},C_{r},r_{\max})=C_{r}(T+1)(T+2)\sqrt{S}+S^{\frac{3}{2}}A(T+2)^{3}r_{\max}+\sqrt{SA(T+1)}+\sqrt{T}.

Note that, this tighter constant is polynomial in all the problem parameters.

Conversely, one can characterize an $\epsilon$ -Nash equilibrium solution in terms of the sub-optimality of (MF-OMO), without the additional assumption of the Lipschitz continuity in Theorem 8.

Theorem 9.

Let $\pi=\{\pi_{t}\}_{t\in\mathcal{T}}$ be an $\epsilon$ -Nash equilibrium solution, i.e., $\text{\rm Expl}(\pi)\\ \leq\epsilon$ . Define $L=\Gamma(\pi)$ ,

y=[V_{1}^{\star}(L),V_{2}^{\star}(L),\dots,V_{T}^{\star}(L),-V_{0}^{\star}(L)]\in\mathbb{R}^{S(T+1)},

and $z=[z_{0},z_{1},\dots,z_{T}]\in\mathbb{R}^{SA(T+1)}$ with $z_{t}=[z_{t}^{1},\dots,z_{t}^{A}]$ ( $t=0,\dots,T$ ), where

z_{t}^{a}=V_{t}^{\star}(L)-r_{t}(\cdot,a,L_{t})-P_{t}^{a}(L_{t})V_{t+1}^{\star}(L)\in\mathbb{R}^{S},\quad t=0,\dots,T-1,\,a\in\mathcal{A}

and $z_{T}^{a}=V_{T}^{\star}(L)-r_{T}(\cdot,a,L_{T})\in\mathbb{R}^{S}$ ( $a\in\mathcal{A}$ ). Then $y,z,L$ is a feasible solution to (MF-OMO) and $f^{\text{\rm MF-OMO}}(y,z,L)\leq\epsilon$ .

5 Finding Nash equilibrium solutions via solving MF-OMO

The optimization formulation (MF-OMO) enables applications of families of optimization algorithms to find multiple Nash equilibrium solutions. In this section, we first present the projected gradient descent algorithm and establish its convergence guarantees to the Nash equilibrium solutions of general mean-field games. We then present a finite-time convergent algorithm for solving a special class of mean-field games with linear rewards and mean-field independent dynamics. More discussion on stochastic algorithms, convergence to stationary points as well as practical tricks such as reparametrization, acceleration and variance reduction can be found in Appendix A.

To start, we will need the following assumption.

Assumption 1.

$P_{t}(s,a,L_{t})$ and $r_{t}(s,a,L_{t})$ are both second-order continuously differentiable in $L_{t}$ within some open set containing $\Delta(\mathcal{S}\times\mathcal{A})$ for any $s\in\mathcal{S},a\in\mathcal{A}$ .

This assumption is easily verifiable and holds for numerical examples in many existing works including [35, 20] and for theoretical studies of [59, 58].

For notation simplicity, denote the concatenation of $y,z,L$ as $\theta$ , and recall the earlier notation for feasible set of (MF-OMO) as $\Theta$ (cf. (13)). Then we see that Assumption 1, together with the compactness of the probability simplex of $L_{t}$ and the norm bounds on $y$ and $z$ in (MF-OMO), immediately leads to the following proposition.

Proposition 10.

Under Assumption 1, $f^{\text{\rm MF-OMO}}(\theta)$ is $M$ -strongly smooth for some positive constant $M>0$ , for any $\theta\in\Theta$ . That is, for any $\theta,\theta^{\prime}\in\Theta$ , we have $\|\nabla_{\theta}f^{\text{\rm MF-OMO}}(\theta)-\nabla_{\theta}f^{\text{\rm MF-OMO}}(\theta^{\prime})\|_{2}\leq M\|\theta-\theta^{\prime}\|_{2}$ .

In addition, Assumption 1, together with the compactness of $\Theta$ , implies the continuity condition in Proposition 1 and the Lipschitz continuity condition in Theorem 8. Moreover, optimal solution(s) exist for (MF-OMO) under Assumption 1. We will denote the optimal solution set of (MF-OMO) as $\Theta^{\star}$ hereafter.

5.1 PGD and convergence to Nash equilibrium solutions

Projected gradient descent (PGD).

The algorithm of projected gradient descent updates $\theta$ with the following iterative process:

\theta_{k+1}=\theta_{k}-\eta_{k}G_{\eta_{k}}(\theta_{k})=\textbf{Proj}_{\Theta}(\theta_{k}-\eta_{k}\nabla_{\theta}f^{\text{MF-OMO}}(\theta_{k})).

(18)

Here $\eta_{k}>0$ is the step-size of the $k$ -th iteration for which appropriate ranges are specified below, and the projection operator $\textbf{Proj}_{\Theta}$ projects $L$ to the probability simplex, $z$ to a specialized polyhedra $\{z\,|\,z\geq 0,\|z\|_{1}\leq SA(T^{2}+T+2)r_{\max}\}$ , and $y$ to the $\ell_{2}$ -normed ball with norm $S(T+1)(T+2)r_{\max}/2$ . All these projections can be efficiently evaluated in linear time and (almost) closed-form (cf. [28, 49, 64, 17] for projection onto the probability simplex and the specialized polyhedra, and [57] for the projection onto the $\ell_{2}$ -normed balls). For simplicity, we always assume that the initialization $\theta_{0}$ is feasible, i.e., $\theta_{0}\in\Theta$ .

Convergence to Nash equilibrium solutions.

To find a Nash equilibrium solution of the mean-field game ((1) and (2)) by the PGD algorithm, we will need an additional definability assumption stated below, which is one of the most commonly adopted assumptions in the existing convergence theory of nonconvex optimization.

Assumption 2.

For any $s\in\mathcal{S},a\in\mathcal{A}$ , $P_{t}(s,a,L_{t})$ and $r_{t}(s,a,L_{t})$ (as functions of $L_{t}$ ) are both restrictions of definable functions on the log-exp structure⁴⁴4For simplicity, below we say that a function is definable if it’s definable on the log-exp structure. to $\Delta(\mathcal{S}\times\mathcal{A})$ .

Definable functions cover a broad class of (both smooth and nonsmooth) functions, including all semialgebraic functions, all analytic functions on definable compact sets⁵⁵5A set is definable if it can be defined by the image/range of a definable function., and the exponential and logarithm functions. Moreover, any finite combination of definable functions via summation, subtraction, product, affine mapping, composition, (generalized) inversion, max and min, partial supremum and partial infimum, as well as reciprocal (restricted to a compact domain) inside their domain of definitions are definable. Generally speaking, definable functions include all functions that are “programmable”, such as those that can be defined in NumPy, SciPy and PyTorch. The precise definition of definability for both sets and functions and the log-exp structure can be found in Appendix B.1 (see also [7, Section 4.3] and [19]).

Under Assumptions 1 and 2, we will show that the PGD algorithm converges to Nash equilibrium solution(s) when the initialization is close. Moreover, if the initialization is close to some isolated Nash equilibrium solution (e.g., when there is a finite number of Nash equilibrium solutions), the iterates converge to that specific Nash equilibrium solution. As a result, different initializations will enable us to retrieve all possible Nash equilibrium solutions when there are finitely many of them. The proof is left to Appendix B.2.

Theorem 11 (Convergence to Nash equilibrium solutions).

Under Assumptions 1 and 2, let $\theta_{k}=(y^{k},z^{k},L^{k})$ ( $k\geq 0$ ) be the sequence generated by PGD (18) with $\eta_{k}=\eta\in(0,2/M)$ . Then for any Nash equilibrium solution $(\pi^{\star},L^{\star})$ , there exists $\epsilon_{0}>0$ , such that for any $L^{0}\in\Delta(\mathcal{S}\times\mathcal{A})$ with $\|L^{0}-L^{\star}\|_{1}\leq\epsilon_{0}$ , if we initialize PGD with $\theta_{0}=(y^{0},z^{0},L^{0})$ , where $y^{0}=[V_{1}^{\star}(L^{0}),V_{2}^{\star}(L^{0}),\dots,V_{T}^{\star}(L^{0}),-V_{0}^{\star}(L^{0})]\in\mathbb{R}^{S(T+1)}$ , $z^{0}=[z_{0,0},\dots,z_{0,T}]\in\mathbb{R}^{SA(T+1)}$ , $z_{0,t}=[z_{0,t}^{1},\dots,z_{0,t}^{A}]$ ( $t\in\mathcal{T}$ ), with

z_{0,t}^{a}=V_{t}^{\star}(L^{0})-r_{t}(\cdot,a,L_{t}^{0})-P_{t}^{a}(L_{t}^{0})V_{t+1}^{\star}(L^{0})\in\mathbb{R}^{S},\quad t=0,\dots,T-1,\,a\in\mathcal{A}

and $z_{0,T}^{a}=V_{T}^{\star}(L^{0})-r_{T}(\cdot,a,L_{T}^{0})\in\mathbb{R}^{S}$ ( $a\in\mathcal{A}$ ), then we have the following:

•

$\lim_{k\rightarrow\infty}{\bf dist}(\theta_{k},\Theta^{\star})=0$ , and $\|L^{k}-L^{\star}\|_{2}\leq g(\epsilon_{0})$ for any $k\geq 0$ . Here $g:\mathbb{R}_{+}\rightarrow\mathbb{R}_{+}$ is a function such that $\lim_{\epsilon\rightarrow 0}g(\epsilon)=0$ .⁶⁶6The closed-form definition of $g$ is left to Appendix B.2 and can be found in (46) there. Here ${\bf dist}$ is the point-set distance defined as ${\bf dist}(\theta_{k},\Theta^{\star})=\inf_{\theta\in\Theta^{\star}}\|\theta_{k}-\theta\|_{2}$ .
•

The limit point set of $\{\theta_{k}\}_{k\geq 0}$ is nonempty, and for any limit point $\bar{\theta}=(\bar{y},\bar{z},\bar{L})$ of the iteration sequence $\{\theta_{k}\}_{k\geq 0}$ , any $\bar{\pi}\in\Pi(\bar{L})$ is a Nash equilibrium solution of the mean-field game ((1) and (2)).
•

For any sequence of $\pi^{k}\in\Pi(L^{k})$ , $\lim_{k\rightarrow\infty}\text{\rm Expl}(\pi^{k})=0$ .

In addition, for any isolated Nash equilibrium solution $(\pi^{\star},L^{\star})$ (i.e., there exists $\epsilon$ such that any $(\pi,L)$ with $\|L-L^{\star}\|_{2}\leq\epsilon$ is not a Nash equilibrium solution), if we initialize PGD with $\theta_{0}=(y^{0},z^{0},L^{0})$ in the same way as above, then $\lim_{k\rightarrow\infty}L^{k}=L^{\star}$ and $\lim_{k\rightarrow\infty}\text{\rm Expl}(\pi^{k})=0$ for any sequence $\pi^{k}\in\Pi(L^{k})$ .

5.2 Finite-time solvability for a class of mean-field games

For mean-field games with linear rewards and mean-field independent dynamics, we will show that a Nash equilibrium solution can be found in finite time under the MF-OMO framework. Such a class of mean-field games include the LR (left-right) and RPS (rock-paper-scissors) problems in [20].

More precisely, consider mean-field games with $r_{t}(s,a,L_{t})=\bar{r}_{s,a,t}+L_{t}^{\top}\bar{R}_{s,a,t}$ , $P_{t}(s^{\prime}|s,a,L_{t})=\bar{P}_{s^{\prime},s,a,t}$ for some constants $\bar{r}_{s,a,t},\,\bar{P}_{s^{\prime},s,a,t}\in\mathbb{R}$ and constant vectors $\bar{R}_{s,a,t}\in\mathbb{R}^{SA}$ . Here, $A_{L}$ is independent of $L$ , hence denoted as $\bar{A}$ (to distinguish from the number of actions $A$ ), and $c_{L}$ is of the form $c_{L}=\bar{c}+\bar{C}L$ , where $\bar{c}\in\mathbb{R}^{SA(T+1)}$ is the vectorization of $[-\bar{r}_{s,a,t}]_{s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}}\in\mathbb{R}^{S\times A\times(T+1)}$ (with column-major order), and $\bar{C}\in\mathbb{R}^{SA(T+1)\times SA(T+1)}$ is defined as

\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\bar{C}=\textbf{diag}\left(\left[\begin{matrix}-R_{1,1,0}^{\top}\\ \vdots\\ -R_{S,A,0}^{\top}\end{matrix}\right],\left[\begin{matrix}-R_{1,1,1}^{\top}\\ \vdots\\ -R_{S,A,1}^{\top}\end{matrix}\right],\dots,\left[\begin{matrix}-R_{1,1,T}^{\top}\\ \vdots\\ -R_{S,A,T}^{\top}\end{matrix}\right]\right).

By (12), one can see that finding Nash equilibrium solution(s) of this class of mean-field games is equivalent to solving the following linear complementarity problem (LCP):

\begin{array}[]{ll}\text{\rm minimize}_{y,z,L}&0\\ \text{\rm subject to}&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\left[\begin{array}[]{ll}0&\bar{A}\\ \bar{A}^{\top}&-\bar{C}\end{array}\right]}\left[\begin{array}[]{l}y\\ L\end{array}\right]+\left[\begin{array}[]{l}0\\ z\end{array}\right]=\left[\begin{array}[]{l}b\\ \bar{c}\end{array}\right],\\ &z^{\top}L=0,\quad L\geq 0,\quad z\geq 0.\end{array}

Notice that for any solution $(y,z,L)$ , we have either $z_{i}=0$ or $L_{i}=0$ for any $i=1,\dots,SA(T+1)$ . Hence it suffices to consider linear programs induced by fixing $z_{\mathcal{D}}=0$ and $L_{\mathcal{D}^{c}}=0$ for some subset $\mathcal{D}$ of $\{1,\dots,SA(T+1)\}$ , which leads to a total of $2^{SA(T+1)}$ linear programs. Since a linear program can be solved to its global optimality in finite time (e.g., by simplex algorithms [53]), we have the following proposition.

Proposition 12.

Suppose that the mean-field game ((1) and (2)) has linear rewards and mean-field independent dynamics. Then its Nash equilibrium solution can be found in finite time.

In practice, to solve the LCP efficiently by exploiting the linear complementarity structure, one may resort to pivoting procedures [54] or Lemke’s algorithm [2].

6 Proofs of main results

6.1 Proof of Lemma 3

For any Nash equilibrium solution $(\pi,L)$ , denote $d_{t}^{\star}(s,a)=\mathbb{P}^{\pi,L}(s_{t}=s,a_{t}=a)$ for the probability distribution of any representative agent taking policy $\pi$ under $L$ . We will first show that $\pi\in\Pi(d^{\star})$ and $(d^{\star},L)$ satisfies conditions (A) and (B). Since $(\pi,L)$ is a Nash equilibrium solution, it satisfies (1). By Lemma 2 and by considering the MDP for the representative agent, $\pi\in\Pi(d^{\star})$ and condition (A) is satisfied for $d^{\star}$ and $L$ . Condition (B) can be proved by induction. When $t=0$ , $d_{0}^{\star}(s,a)=\mathbb{P}^{\pi,L}(s_{0}=s,a_{0}=a)=\mu_{0}\pi_{0}(a|s)=L_{0}(s,a)$ for any $s\in\mathcal{S}$ and any $a\in\mathcal{A}$ , since (15) holds for $L_{0}$ . Suppose $d_{t}^{\star}(s,a)=L_{t}(s,a)$ holds for all $t\leq t^{\prime}$ and all $s\in\mathcal{S}$ and $a\in\mathcal{A}$ . Then for any $s\in\mathcal{S}$ and any $a\in\mathcal{A}$ ,

	$\displaystyle d_{t^{\prime}+1}^{\star}(s,a)$	$\displaystyle=\mathbb{P}^{\pi,L}(s_{t^{\prime}+1}=s,a_{t^{\prime}+1}=a)=\mathbb{P}^{\pi,L}(s_{t^{\prime}+1}=s)\pi_{t^{\prime}+1}(a\|s)$
		$\displaystyle=\pi_{t^{\prime}+1}(a\|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\mathbb{P}^{\pi,L}(s_{t^{\prime}}=s^{\prime},a_{t^{\prime}}=a^{\prime},s_{t^{\prime}+1}=s)$
		$\displaystyle=\pi_{t^{\prime}+1}(a\|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\mathbb{P}^{\pi,L}(s_{t^{\prime}}=s^{\prime},a_{t^{\prime}}=a^{\prime})P_{t^{\prime}}(s\|s^{\prime},a^{\prime},L_{t^{\prime}})$
		$\displaystyle=\pi_{t^{\prime}+1}(a\|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}d_{t^{\prime}}(s^{\prime},a^{\prime})P_{t^{\prime}}(s\|s^{\prime},a^{\prime},L_{t^{\prime}})$
		$\displaystyle=\pi_{t^{\prime}+1}(a\|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}L_{t^{\prime}}(s^{\prime},a^{\prime})P_{t^{\prime}}(s\|s^{\prime},a^{\prime},L_{t^{\prime}})=L_{t^{\prime}+1}(s,a),$

where the last equation holds by (16).

On the other hand, for any $(d^{\star},L)$ satisfying conditions (A) and (B), let $\pi_{t}(a|s)=\frac{L_{t}(s,a)}{\sum_{a^{\prime}\in\mathcal{A}}L_{t}(s,a^{\prime})}$ when $\sum_{a^{\prime}\in\mathcal{A}}L_{t}(s,a^{\prime})>0$ and let $\pi_{t}(\cdot|s)$ be any probability vector if $\sum_{a^{\prime}\in\mathcal{A}}L_{t}(s,a^{\prime})=0$ , we will show that $(\pi,L)$ is the Nash equilibrium solution of the mean-field game ((1) and (2)).

By Lemma 2, condition (1) holds. For condition (2), since $d^{\star}=L$ , $L$ satisfies (6) by replacing $d$ with $L$ . Therefore, for any $s^{\prime}\in\mathcal{S},t=0,1,\dots,T-1$ ,

\displaystyle\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}L_{t}(s,a)P_{t}(s^{\prime}|s,a,L_{t})=\sum_{a\in\mathcal{A}}L_{t+1}(s^{\prime},a).

Multiplying both sides with $\pi_{t+1}(a^{\prime}|s^{\prime})$ , and by the definition of $\pi$ ,

\displaystyle\pi_{t+1}(a^{\prime}|s^{\prime})\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}L_{t}(a|s)P_{t}(s^{\prime}|s,a,L_{t})=L_{t+1}(s^{\prime},a^{\prime}).

Hence condition (2).∎

6.2 Proof of Proposition 6

Since $y,z,L$ is a solution to (12), $L\geq 0$ , $\mathbf{1}^{\top}L_{t}=1$ , $t\in\mathcal{T}$ (by $A_{L}L=b$ ), i.e., $L\in\Delta(\mathcal{S}\times\mathcal{A})$ . Note that we use $\mathbf{1}$ to denote the all-one vector (with appropriate dimensions). To show that $\hat{y},\hat{z},L$ solves (12), it reduces to show that $A_{L}^{\top}\hat{y}+\hat{z}=c_{L}$ , $\hat{z}\geq 0$ and $\hat{z}^{\top}L=0$ .

By writing $\hat{y}$ as $\hat{y}=[\hat{y}_{0},\dots,\hat{y}_{T}]$ with $\hat{y}_{t}\in\mathbb{R}^{S}$ , $\hat{z}$ as $\hat{z}=[\hat{z}_{0},\dots,\hat{z}_{T}]$ with $\hat{z}_{t}\in\mathbb{R}^{SA}$ , we have

A_{L}^{\top}\hat{y}+\hat{z}=\left[\begin{array}[]{c}W_{0}(L_{0})^{\top}\hat{y}_{0}+Z^{\top}\hat{y}_{T}+\hat{z}_{0},\\ W_{1}(L_{1})^{\top}\hat{y}_{1}-Z^{\top}\hat{y}_{0}+\hat{z}_{1},\\ \vdots\\ W_{T-1}(L_{T-1})^{\top}\hat{y}_{T-1}-Z^{\top}\hat{y}_{T-2}+\hat{z}_{T-1},\\ -Z^{\top}\hat{y}_{T-1}+\hat{z}_{T}.\end{array}\right],

which, by the definition of $W_{t}(L_{t})$ and $Z$ , becomes

A_{L}^{\top}\hat{y}+\hat{z}=\left[\begin{array}[]{c}P_{0}^{1}(L_{0})\hat{y}_{0}+\hat{y}_{T}+\hat{z}_{0}^{1}\\ \vdots\\ P_{0}^{A}(L_{0})\hat{y}_{0}+\hat{y}_{T}+\hat{z}_{0}^{A}\\ P_{1}^{1}(L_{1})\hat{y}_{1}-\hat{y}_{0}+\hat{z}_{1}^{1}\\ \vdots\\ P_{1}^{A}(L_{1})\hat{y}_{1}-\hat{y}_{0}+\hat{z}_{1}^{A}\\ \vdots\\ P_{T-1}^{1}(L_{T-1})\hat{y}_{T-1}-\hat{y}_{T-2}+\hat{z}_{T-1}^{1}\\ \vdots\\ P_{T-1}^{A}(L_{T-1})\hat{y}_{T-1}-\hat{y}_{T-2}+\hat{z}_{T-1}^{A}\\ -\hat{y}_{T-1}+\hat{z}_{T}^{1}\\ \vdots\\ -\hat{y}_{T-1}+\hat{z}_{T}^{A}\end{array}\right].

Now noticing that $\hat{y}_{T}=-V_{0}^{\star}(L)$ , $\hat{y}_{t}=V_{t+1}^{\star}(L)$ , $\hat{z}_{T}^{a}=V_{T}^{\star}(L)-r_{T}(\cdot,a,L_{T})$ , and $\hat{z}_{t}^{a}=V_{t}^{\star}(L)-r_{t}(\cdot,a,L_{t})-P_{t}^{a}V_{t+1}^{\star}(L)$ for $a\in\mathcal{A}$ , $t=0,\dots,T-1$ , we see

A_{L}^{\top}\hat{y}+\hat{z}=[-r_{0}(\cdot,\cdot,L_{0}),\dots,-r_{T}(\cdot,\cdot,L_{T})]=c_{L}.

Moreover, we have $V_{T}^{\star}(L)=\max_{a\in\mathcal{A}}r_{T}(\cdot,a,L_{T})$ , and that

V_{t}^{\star}(L)=\max_{a\in\mathcal{A}}\left\{r_{t}(\cdot,a,L_{t})+P_{t}^{a}(L_{t})V_{t+1}^{\star}(L)\right\},

implying that $\hat{z}\geq 0$ by its definition. Note that these are the Bellman equations for the mean-field induced MDP $\mathcal{M}(L)$ .

It remains to prove that $\hat{z}^{\top}L=0$ . To see this, we first show that $\hat{y}_{t-1}\leq y_{t-1}$ for $t=1,\dots,T$ and $\hat{y}_{-1}\geq y_{-1}$ by backward induction on $t$ from $T$ to $0$ . Here $\hat{y}_{-1}$ and $y_{-1}$ are defined as $\hat{y}_{T}$ and $y_{T}$ , respectively.

Firstly, note that $A_{L}^{\top}y+z=c_{L}$ and $z\geq 0$ . Hence similar to the expansions above, we have

z_{T}^{a}=y_{T-1}-r(\cdot,a,L_{T}),\quad z_{0}^{a}=-y_{T}-r_{0}(\cdot,a,L_{0})-P_{0}^{a}(L_{0})y_{0},\quad\forall a\in\mathcal{A},

and

z_{t}^{a}=y_{t-1}-r_{t}(\cdot,a,L_{t})-P_{t}^{a}(L_{t})y_{t},\quad\forall a\in\mathcal{A},\,t=0,\dots,T-1.

Now for the base step, since $z_{T}^{a}\geq 0$ for all $a\in\mathcal{A}$ , we have

y_{T-1}\geq\max_{a\in\mathcal{A}}r(\cdot,a,L_{T})=V_{T}^{\star}(L)=\hat{y}_{T-1}.

For the induction step, suppose that $\hat{y}_{t}\leq y_{t}$ for some $0<t\leq T-1$ . Then since $z_{t}^{a}\geq 0$ for all $a\in\mathcal{A}$ , we have

\begin{split}y_{t-1}&\geq\max_{a\in\mathcal{A}}\left\{r_{t}(\cdot,a,L_{t})+P_{t}^{a}(L_{t})y_{t}\right\}\geq\max_{a\in\mathcal{A}}\left\{r_{t}(\cdot,a,L_{t})+P_{t}^{a}(L_{t})\hat{y}_{t}\right\}\\ &=\max_{a\in\mathcal{A}}\left\{r_{t}(\cdot,a,L_{t})+P_{t}^{a}(L_{t})V_{t+1}^{\star}(L)\right\}=V_{t}^{\star}(L)=\hat{y}_{t-1}.\end{split}

Finally, suppose that $\hat{y}_{0}\leq y_{0}$ . Then since $z_{0}^{a}\geq 0$ for all $a\in\mathcal{A}$ ,

\begin{split}y_{-1}=y_{T}&\leq\min_{a\in\mathcal{A}}\left\{-r_{0}(\cdot,a,L_{0})-P_{0}^{a}(L_{0})y_{0}\right\}\leq\min_{a\in\mathcal{A}}\left\{-r_{0}(\cdot,a,L_{0})-P_{0}^{a}(L_{0})\hat{y}_{0}\right\}\\ &=-\max_{a\in\mathcal{A}}\left\{r_{0}(\cdot,a,L_{0})+P_{0}^{a}(L_{0})V_{1}^{\star}(L)\right\}=-V_{0}^{\star}(L)=\hat{y}_{T}=\hat{y}_{-1}.\end{split}

Therefore $\hat{y}_{T}\geq y_{T}$ .

Now, since $y,z,L$ is a solution to problem (12), we have

b^{\top}y=(A_{L}L)^{\top}y+L^{\top}z=L^{\top}(A_{L}^{\top}y+z)=L^{\top}c_{L}=c_{L}^{\top}L.

And for any $y^{\prime},z^{\prime}$ satisfying $A_{L}^{\top}y^{\prime}+z^{\prime}=c_{L}$ and $z^{\prime}\geq 0$ , since $A_{L}L=b$ and $L\geq 0$ , we have

c_{L}^{\top}L-b^{\top}y^{\prime}=(A_{L}^{\top}y^{\prime}+z^{\prime})^{\top}L-L^{\top}A_{L}^{\top}y^{\prime}=L^{\top}z^{\prime}\geq 0.

(19)

In particular, $b^{\top}y=c_{L}^{\top}L\geq b^{\top}y^{\prime}$ for any $y^{\prime}$ such that $A_{L}^{\top}y^{\prime}+z^{\prime}=c_{L}$ for some $z^{\prime}\geq 0$ . That is, $y$ (together with $z$ ) is the solution to the following linear optimization problem for fixed $L$ :

\begin{array}[]{ll}\text{maximize}_{y^{\prime},z^{\prime}}&b^{\top}y^{\prime}\\ \text{subject to}&A_{L}^{\top}y^{\prime}+z^{\prime}=c_{L},\quad z^{\prime}\geq 0.\end{array}

(20)

Now taking $y^{\prime}=\hat{y}$ and $z^{\prime}=\hat{z}$ , we see that $A_{L}^{\top}\hat{y}+\hat{z}=c_{L}$ and $\hat{z}\geq 0$ as proved above. Hence $\hat{y},\hat{z}$ is a feasible solution to problem (20), and $b^{\top}y\geq b^{\top}\hat{y}$ . Since $\hat{y}_{T}\geq y_{T}$ ,

b^{\top}\hat{y}=\mu_{0}^{\top}\hat{y}_{T}\geq\mu_{0}^{\top}y_{T}=b^{\top}y.

Combined, $b^{\top}y=b^{\top}\hat{y}$ , hence by (19),

L^{\top}\hat{z}=c_{L}^{\top}L-b^{\top}\hat{y}=c_{L}^{\top}L-b^{\top}y=0.

∎

6.3 Proof of Theorem 8

It is established via the following perturbation lemma for general MDPs.

Lemma 13.

Consider two finite MDPs⁷⁷7Finite MDP refers to MDPs with finite state and action spaces. $\mathcal{M}^{1}$ and $\mathcal{M}^{2}$ with finite horizon, with their respective transition probabilities $\{p^{1}_{t}\}_{t\in\mathcal{T}}$ and $\{p^{2}_{t}\}_{t\in\mathcal{T}}$ , rewards $\{r_{t}^{1}\}_{t\in\mathcal{T}}$ and $\{r_{t}^{2}\}_{t\in\mathcal{T}}$ , expected total rewards $V^{1,\pi}$ and $V^{2,\pi}$ for any policy $\pi$ . In addition, denote their corresponding optimal values as $V^{1,\star}$ and $V^{2,\star}$ . Then we have

|V^{1,\pi}-V^{2,\pi}|\leq\dfrac{T(T+1)r_{\max}}{2}\|p^{1}-p^{2}\|_{\infty,1}+\|r^{1}-r^{2}\|_{1,\infty},

and

|V^{1,\star}-V^{2,\star}|\leq\dfrac{T(T+1)r_{\max}}{2}\|p^{1}-p^{2}\|_{\infty,1}+\|r^{1}-r^{2}\|_{1,\infty}.

Here $r_{\max}$ is such that $|r_{t}^{1}(s,a)|\leq r_{\max}$ and $|r_{t}^{2}(s,a)|\leq r_{\max}$ for any $s\in\mathcal{S},a\in\mathcal{A}$ .

Proof of Lemma 13.

The proof consists of two parts.

Part I: Values for fixed policy $\pi$ . Let $V_{t}^{i,\pi}(s)$ be the total expected reward starting from state $s$ at time $t$ under policy $\pi$ in the MDP $\mathcal{M}^{i}$ ( $i=1,2$ ). Clearly $V^{i,\pi}=\sum_{s\in\mathcal{S}}\mu_{0}(s)V_{0}^{i,\pi}(s)$ ( $i=1,2$ ).

Now by the dynamic programming principle, we have for $i=1,2$ , $t=0,\dots,T-1$ and $s\in\mathcal{S}$ ,

V_{t}^{i,\pi}(s)=\sum_{a\in\mathcal{A}}\pi_{t}(a|s)r_{t}^{i}(s,a)+\sum_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}\pi_{t}(a|s)p_{t}^{i}(s^{\prime}|s,a)V_{t+1}^{i,\pi}(s^{\prime}),

and $V_{T}^{i,\pi}(s)=\sum_{a\in\mathcal{A}}\pi_{t}(a|s)r_{T}^{i}(s,a)$ , implying

\begin{split}|V_{t}^{1,\star}(s)&-V_{t}^{2,\star}(s)|\leq\,\sum_{a\in\mathcal{A}}|r_{t}^{1}(s,a)-r_{t}^{2}(s,a)|\pi_{t}(a|s)\\ &+\sum_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}\pi_{t}(a|s)\left|p_{t}^{1}(s^{\prime}|s,a)V_{t+1}^{1,\pi}(s^{\prime})-p_{t}^{2}(s^{\prime}|s,a)V_{t+1}^{2,\pi}(s^{\prime})\right|\\ \leq&\,\max_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}|p_{t}^{1}(s^{\prime}|s,a)-p_{t}^{2}(s^{\prime}|s,a)||V_{t+1}^{1,\pi}(s^{\prime})|\\ &+\sum_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}\pi_{t}(a|s)p_{t}^{2}(s^{\prime}|s,a)|V_{t+1}^{1,\pi}(s^{\prime})-V_{t+1}^{2,\pi}(s^{\prime})|\\ &+\max_{a\in\mathcal{A}}|r_{t}^{1}(s,a)-r_{t}^{2}(s,a)|\\ (a)\leq&\,(T-t)r_{\max}\max_{s\in\mathcal{S},a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}|p_{t}^{1}(s^{\prime}|s,a)-p_{t}^{2}(s^{\prime}|s,a)|+\|V_{t+1}^{1,\pi}-V_{t+1}^{2,\pi}\|_{\infty}\\ &+\max_{s\in\mathcal{S},a\in\mathcal{A}}|r_{t}^{1}(s,a)-r_{t}^{2}(s,a)|.\end{split}

(21)

Here ( $a$ ) uses the fact that $|V_{t+1}^{i,\pi}(s)|\leq(T-t)r_{\max}$ , and $V_{t}^{i,\pi}$ is viewed as a vector in $\mathbb{R}^{S}$ when taking the $\ell_{\infty}$ norm.

By taking max over $s\in\mathcal{S}$ on the left-hand side of (21) and telescoping the inequalities, we have

\begin{split}\|V_{0}^{1,\pi}-V_{0}^{2,\pi}\|_{\infty}\leq&\sum_{t=0}^{T-1}(T-t)r_{\max}\max_{s\in\mathcal{S},a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}|p_{t}^{1}(s^{\prime}|s,a)-p_{t}^{2}(s^{\prime}|s,a)|\\ &+\sum_{t=0}^{T}\max_{s\in\mathcal{S},a\in\mathcal{A}}|r_{t}^{1}(s,a)-r_{t}^{2}(s,a)|\\ \leq&\dfrac{T(T+1)r_{\max}}{2}\|p^{1}-p^{2}\|_{\infty,1}+\|r^{1}-r^{2}\|_{1,\infty}.\end{split}

Here we use the fact that

V_{T}^{1,\pi}(s)-V_{T}^{2,\pi}(s)=\sum_{a\in\mathcal{A}}r_{T}^{1}(s,a)\pi_{T}(a|s)-\sum_{a\in\mathcal{A}}r_{T}^{2}(s,a)\pi_{T}(a|s).

Part II: Optimal values. Let $V_{t}^{i,\star}(s)$ be the value function for MDP $\mathcal{M}_{i}$ ( $i=1,2$ ) starting at time $t\in\mathcal{T}$ . Then we have $V^{i,\star}=\sum_{s\in\mathcal{S}}\mu_{0}(s)V_{0}^{i,\star}(s)$ ( $i=1,2$ ).

Now by the dynamic programming principle, we have for $i=1,2$ , $t=0,\dots,T-1$ and $s\in\mathcal{S}$ ,

V_{t}^{i,\star}(s)=\max_{a\in\mathcal{A}}\left(r_{t}^{i}(s,a)+\sum_{s^{\prime}\in\mathcal{S}}p_{t}^{i}(s^{\prime}|s,a)V_{t+1}^{i,\star}(s^{\prime})\right),

which implies that

\begin{split}|V_{t}^{1,\star}(s)&-V_{t}^{2,\star}(s)|\\ &\leq\max_{a\in\mathcal{A}}\left\{|r_{t}^{1}(s,a)-r_{t}^{2}(s,a)|+\sum_{s^{\prime}\in\mathcal{S}}\left|p_{t}^{1}(s^{\prime}|s,a)V_{t+1}^{1,\star}(s^{\prime})-p_{t}^{2}(s^{\prime}|s,a)V_{t+1}^{2,\star}(s^{\prime})\right|\right\}\\ &\leq\max_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}|p_{t}^{1}(s^{\prime}|s,a)-p_{t}^{2}(s^{\prime}|s,a)|V_{t+1}^{1,\star}(s^{\prime})\\ &\quad+\max_{a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}p_{t}^{2}(s^{\prime}|s,a)|V_{t+1}^{1,\star}(s^{\prime})-V_{t+1}^{2,\star}(s^{\prime})|+\max_{a\in\mathcal{A}}|r_{t}^{1}(s,a)-r_{t}^{2}(s,a)|\\ (a)&\leq(T-t)r_{\max}\max_{s\in\mathcal{S},a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}|p_{t}^{1}(s^{\prime}|s,a)-p_{t}^{2}(s^{\prime}|s,a)|+\|V_{t+1}^{1,\star}-V_{t+1}^{2,\star}\|_{\infty}\\ &\quad+\max_{s\in\mathcal{S},a\in\mathcal{A}}|r_{t}^{1}(s,a)-r_{t}^{2}(s,a)|.\end{split}

(22)

Here in ( $a$ ) we use the fact that $V_{t+1}^{i,\star}(s)\in[0,(T-t)r_{\max}]$ , and $V_{t}^{i,\star}$ is viewed as a vector in $\mathbb{R}^{S}$ when taking the $\ell_{\infty}$ norm.

By taking max over $s\in\mathcal{S}$ on the left-hand side of (22) and telescoping the inequalities, we have

\begin{split}\|V_{0}^{1,\star}-V_{0}^{2,\star}\|_{\infty}\leq&\sum_{t=0}^{T-1}(T-t)r_{\max}\max_{s\in\mathcal{S},a\in\mathcal{A}}\sum_{s^{\prime}\in\mathcal{S}}|p_{t}^{1}(s^{\prime}|s,a)-p_{t}^{2}(s^{\prime}|s,a)|\\ &+\sum_{t=0}^{T}\max_{s\in\mathcal{S},a\in\mathcal{A}}|r_{t}^{1}(s,a)-r_{t}^{2}(s,a)|\\ \leq&\dfrac{T(T+1)r_{\max}}{2}\|p^{1}-p^{2}\|_{\infty,1}+\|r^{1}-r^{2}\|_{1,\infty}.\end{split}

Here we use the fact that $V_{T}^{1,\star}(s)-V_{T}^{2,\star}(s)=r_{T}^{1}(s,a)-r_{T}^{2}(s,a)$ . ∎

Proof of Theorem 8.

We first define $d=\tilde{\Gamma}(\pi,L)$ for the $\pi\in\Pi(L)$ , where $\tilde{\Gamma}(\pi,L)$ is recursively defined by $\tilde{\Gamma}(\pi,L)_{0}(s,a):=\mu_{0}(s)\pi_{0}(a|s)$ (for any $s\in\mathcal{S},a\in\mathcal{A}$ ) and

\tilde{\Gamma}(\pi,L)_{t+1}(s,a):=\pi_{t+1}(a|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\tilde{\Gamma}(\pi,L)_{t}(s^{\prime},a^{\prime})P_{t}(s|s^{\prime},a^{\prime},L_{t}),\quad\forall s\in\mathcal{S},a\in\mathcal{A}.

Then $d$ is the occupation measure of policy $\pi$ for MDP $\mathcal{M}(L)$ .

Step 1: Closeness between $d$ and $L$ . By definition, $A_{L}d=b$ . We first prove by induction that for all $t$ ,

\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}|d_{t}(s,a)-L_{t}(s,a)|\leq\sqrt{S}(t+1)\epsilon.

(23)

When $t=0$ ,

\begin{split}\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}|d_{0}(s,a)-L_{0}(s,a)|&=\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}\pi_{0}(a|s)|\mu_{0}(s)-\sum_{a^{\prime}\in\mathcal{A}}L_{0}(s,a^{\prime})|\\ &=\sum_{s\in\mathcal{S}}|\mu_{0}(s)-\sum_{a^{\prime}\in\mathcal{A}}L_{0}(s,a^{\prime})|\leq\sqrt{S}\epsilon,\end{split}

(24)

where the last inequality is by $\|A_{L}L-b\|_{2}\leq\epsilon$ . Now suppose (23) holds for $t$ . Then by the construction of $\pi$ and $d$ ,

d_{t+1}(s,a)=\pi_{t+1}(a|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}d_{t}(s^{\prime},a^{\prime})P_{t}(s|s^{\prime},a^{\prime},L_{t}),

(25)

L_{t+1}(s,a)=\pi_{t+1}(a|s)\sum_{a^{\prime}\in\mathcal{A}}L_{t+1}(s,a^{\prime}).

(26)

Therefore

\begin{split}\sum_{s\in\mathcal{S}}&\sum_{a\in\mathcal{A}}|d_{t+1}(s,a)-L_{t+1}(s,a)|\\ &=\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}\pi_{t+1}(a|s)|\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}d_{t}(s^{\prime},a^{\prime})P_{t}(s|s^{\prime},a^{\prime},L_{t})-\sum_{a^{\prime}\in\mathcal{A}}L_{t+1}(s,a^{\prime})|\\ &=\sum_{s\in\mathcal{S}}|\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}d_{t}(s^{\prime},a^{\prime})P_{t}(s|s^{\prime},a^{\prime},L_{t})-\sum_{a^{\prime}\in\mathcal{A}}L_{t+1}(s,a^{\prime})|\\ &\leq\sum_{s\in\mathcal{S}}|\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\left(d_{t}(s^{\prime},a^{\prime})-L_{t}(s^{\prime},a^{\prime})\right)P_{t}(s|s^{\prime},a^{\prime},L_{t})|\\ &\qquad\qquad+\sum_{s\in\mathcal{S}}|\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}L_{t}(s^{\prime},a^{\prime})P_{t}(s|s^{\prime},a^{\prime},L_{t})-\sum_{a^{\prime}\in\mathcal{A}}L_{t+1}(s,a^{\prime})|\\ &\leq\sqrt{S}(t+1)\epsilon+\sqrt{S}\epsilon=\sqrt{S}(t+2)\epsilon,\end{split}

(27)

where the last inequality holds by using induction on the first term and the constraint on $L$ in $\|A_{L}L-b\|_{2}\leq\epsilon$ . This completes the induction.

The above bound (27), together with the fact that $0\leq z^{\top}L\leq\epsilon^{2}$ , immediately implies that

\begin{split}|z^{\top}d|&\leq|z^{\top}(L-d)|+|z^{\top}L|\leq\|z\|_{1}\|L-d\|_{\infty}+\epsilon^{2}\\ &\leq SA(T^{2}+T+2)r_{\max}\sqrt{S}(T+1)\epsilon+\epsilon^{2}.\end{split}

Hence by $A_{L}d=b$ and $\|A_{L}^{\top}y+z-c_{L}\|_{2}^{2}\leq\epsilon^{2}$ , we have

\begin{split}|b^{\top}y-c_{L}^{\top}d|&=|d^{\top}A_{L}^{\top}y-c_{L}^{\top}d|\leq|d^{\top}z|+|d^{\top}(A_{L}^{\top}y+z-c_{L})|\\ &\leq SA(T^{2}+T+2)r_{\max}\sqrt{S}(T+1)\epsilon+\epsilon^{2}+\|d\|_{2}\|A_{L}^{\top}y+z-c_{L}\|_{2}\\ &\leq SA(T^{2}+T+2)r_{\max}\sqrt{S}(T+1)\epsilon+\epsilon^{2}+\sqrt{T}\epsilon.\end{split}

Here we use $\|d_{t}\|_{1}=1$ for any $t\in\mathcal{T}$ and $\|d_{t}\|_{2}\leq\|d_{t}\|_{1}$ .

Step 2: Near-optimality of $d$ (and $\pi$ ) for $\mathcal{M}(L)$ . Now we show that $d$ is near-optimal for the MDP $\mathcal{M}(L)$ . To see this, define $\Delta:=A_{L}^{\top}y+z-c_{L}$ . Here $\Delta$ can be viewed both as a vector in $\mathbb{R}^{SA(T+1)}$ and a sequence of $T+1$ matrices $\{\Delta_{t}\}_{t\in\mathcal{T}}$ with $\Delta_{t}\in\mathbb{R}^{S\times A}$ . Then we have $\|\Delta\|_{2}^{2}\leq\epsilon^{2}$ .

Consider the following linear program with variables $\hat{y},\hat{z}$ :

\begin{array}[]{ll}\text{maximize}_{\hat{y},\hat{z}}&b^{\top}\hat{y}\\ \text{subject to}&A_{L}^{\top}\hat{y}+\hat{z}=c_{L}+\Delta,\quad\hat{z}\geq 0,\end{array}

(28)

and denote $\hat{y}^{\star},\hat{z}^{\star}$ as its optimal solution. Then since $y,z$ is its feasible solution, we have $b^{\top}y\leq b^{\top}\hat{y}^{\star}$ .

Note that the dual problem of (28) can be written as

\begin{array}[]{ll}\text{minimize}_{\hat{d}}&(c_{L}+\Delta)^{\top}\hat{d}\\ \text{subject to}&A_{L}\hat{d}=b,\quad\hat{d}\geq 0.\end{array}

(29)

By Lemma 2 and (7), if $\hat{d}^{\star}$ solves (29), then any $\hat{\pi}^{\star}\in\Pi(\hat{d}^{\star})$ is an optimal policy for the MDP $\hat{\mathcal{M}}(L)$ with transitions $P_{t}(s^{\prime}|s,a,L_{t})$ and rewards $r_{t}(s,a,L_{t})-\Delta_{t}(s,a)$ (with $L=\{L_{t}\}_{t\in\mathcal{T}}$ fixed), and the optimal value of the objective function of (29) is equal to the negative optimal value of the MDP $\hat{\mathcal{M}}(L)$ . By the strong duality of linear programs, we see that $b^{\top}\hat{y}^{\star}=(c_{L}+\Delta)^{\top}\hat{d}^{\star}$ is equal to the negative optimal value of the MDP $\hat{\mathcal{M}}(L)$ . Note that the components of $c_{L}$ are negative rewards, and that the existence of the optimal solution $\hat{d}^{\star}$ follows from the existence of optimal policies for any finite MDP with finite horizon and Lemma 2 and (7).

Now, let $d^{\star}$ be an optimal solution to the following linear program:

\begin{array}[]{ll}\text{minimize}_{d}&c_{L}^{\top}d\\ \text{subject to}&A_{L}d=b,\quad d\geq 0.\end{array}

(30)

Then again by Lemma 2 and (7), $c_{L}^{\top}d^{\star}$ is equal to the negative optimal value of the MDP $\mathcal{M}(L)$ . Hence by applying Lemma 13 to the MDPs $\hat{\mathcal{M}}(L)$ and $\mathcal{M}(L)$ ,

\begin{split}|c_{L}^{\top}d^{\star}-(c_{L}+\Delta)^{\top}\hat{d}^{\star}|&\leq\|\Delta\|_{1,\infty}=\sum_{t=0}^{T}\max_{s\in\mathcal{S},a\in\mathcal{A}}|\Delta_{t}(s,a)|\leq\sum_{t\in\mathcal{T},s\in\mathcal{S},a\in\mathcal{A}}|\Delta_{t}(s,a)|\\ &\leq\sqrt{SA(T+1)}\epsilon.\end{split}

Putting the above bounds together, we conclude that

\begin{split}c_{L}^{\top}d-c_{L}^{\top}d^{\star}&=c_{L}^{\top}d-b^{\top}y+b^{\top}y-c_{L}^{\top}d^{\star}\\ &\leq c_{L}^{\top}d-b^{\top}y+b^{\top}\hat{y}^{\star}-c_{L}^{\top}d^{\star}\\ &=c_{L}^{\top}d-b^{\top}y+(c_{L}+\Delta)^{\top}\hat{d}^{\star}-c_{L}^{\top}d^{\star}\\ &\leq SA(T^{2}+T+2)r_{\max}\sqrt{S}(T+1)\epsilon+\epsilon^{2}+\sqrt{T}\epsilon+\sqrt{SA(T+1)}\epsilon.\end{split}

(31)

Since $d$ is a feasible solution to (30), again by Lemma 2, (31) implies that $\pi$ , by definition satisfying $\pi\in\Pi(d)$ from (25), is a $(C_{S,A,T}\epsilon+\epsilon^{2})$ -suboptimal solution to the MDP $\mathcal{M}(L)$ , where $C_{S,A,T}=S^{\frac{3}{2}}A(T+2)^{3}r_{\max}+\sqrt{SA(T+1)}+\sqrt{T}$ .

Step 3: Bounding the gap between $L$ and $\Gamma(\pi)$ . We then bound the difference between $L$ and $\Gamma(\pi)$ by proving the following for all $t$ :

\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}|\Gamma(\pi)_{t}(s,a)-L_{t}(s,a)|\leq\frac{(C_{P}+1)^{t+1}-1}{C_{P}}\sqrt{S}\epsilon.

(32)

This is again proved by induction. When $t=0$ , it holds by noticing $\Gamma(\pi)_{0}(s,a)=d_{0}(s,a)$ and (24). Now suppose (32) holds for $t$ , consider when $t+1$ , by the definition of $\Gamma(\pi)$ , we have

\Gamma(\pi)_{t+1}(s,a)=\pi_{t+1}(a|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\Gamma(\pi)_{t}(s^{\prime},a^{\prime})P_{t}(s|s^{\prime},a^{\prime},\Gamma(\pi)_{t}).

(33)

Then

\begin{split}\sum_{s\in\mathcal{S}}&\sum_{a\in\mathcal{A}}|\Gamma(\pi)_{t+1}(s,a)-L_{t+1}(s,a)|\\ &=\sum_{s\in\mathcal{S}}|\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\Gamma(\pi)_{t}(s^{\prime},a^{\prime})P_{t}(s|s^{\prime},a^{\prime},\Gamma(\pi)_{t})-\sum_{a^{\prime}\in\mathcal{A}}L_{t+1}(s,a^{\prime})|\\ &\leq\sum_{s\in\mathcal{S}}|\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\Gamma(\pi)_{t}(s^{\prime},a^{\prime})(P_{t}(s|s^{\prime},a^{\prime},\Gamma(\pi)_{t})-P_{t}(s|s^{\prime},a^{\prime},L_{t})|\\ &\qquad\qquad+\sum_{s\in\mathcal{S}}\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}|\Gamma(\pi)_{t}(s^{\prime},a^{\prime})-L_{t}(s^{\prime},a^{\prime})|P_{t}(s|s^{\prime},a^{\prime},L_{t})\\ &\qquad\qquad+\sum_{s\in\mathcal{S}}|\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}L_{t}(s^{\prime},a^{\prime})P_{t}(s|s^{\prime},a^{\prime},L_{t})-\sum_{a^{\prime}\in\mathcal{A}}L_{t+1}(s,a^{\prime})|\\ &\leq\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\Gamma(\pi)_{t}(s^{\prime},a^{\prime})\sum_{s\in\mathcal{S}}|P_{t}(s|s^{\prime},a^{\prime},\Gamma(\pi)_{t})-P_{t}(s|s^{\prime},a^{\prime},L_{t})|\\ &\qquad\qquad+\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}|\Gamma(\pi)_{t}(s^{\prime},a^{\prime})-L_{t}(s^{\prime},a^{\prime})|+\sqrt{S}\epsilon\\ &\leq(C_{P}+1)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}|\Gamma(\pi)_{t}(s^{\prime},a^{\prime})-L_{t}(s^{\prime},a^{\prime})|+\sqrt{S}\epsilon\\ &\leq(C_{P}+1)\frac{(C_{P}+1)^{t+1}-1}{C_{P}}\sqrt{S}\epsilon+\sqrt{S}\epsilon=\frac{(C_{P}+1)^{t+2}-1}{C_{P}}\sqrt{S}\epsilon.\end{split}

(34)

This concludes the induction. Consequently,

\begin{split}\|P^{\Gamma(\pi)}-P^{L}\|_{\infty,1}&=\max_{t\in\mathcal{T}}\|P_{t}^{\Gamma(\pi)}-P_{t}^{L}\|_{\infty,1}\leq\max_{t\in\mathcal{T}}C_{P}\|\Gamma(\pi)_{t}-L_{t}\|_{1}\\ &\leq((C_{P}+1)^{T+1}-1)\sqrt{S}\epsilon,\end{split}

and similarly,

\begin{split}\|r^{\Gamma(\pi)}-r^{L}\|_{1,\infty}&=\sum_{t=0}^{T}\|r_{t}^{\Gamma(\pi)}-r_{t}^{L}\|_{\infty}\leq C_{r}\sum_{t=0}^{T}\|\Gamma(\pi)_{t}-L_{t}\|_{1}\\ &\leq C_{r}\dfrac{(C_{P}+1)^{T+2}-(T+2)C_{P}-1}{C_{P}^{2}}\sqrt{S}\epsilon.\end{split}

Finally, by applying Lemma 13 to the MDPs $\mathcal{M}(L)$ and $\mathcal{M}(\Gamma(\pi))$ , and utilizing the fact that $\pi$ is $(C_{S,A,T}\epsilon+\epsilon^{2})$ -suboptimal for the MDP $\mathcal{M}(L)$ , we see that

\begin{split}\text{Expl}(\pi)&=V_{\mu_{0}}^{\star}(\Gamma(\pi))-V_{\mu_{0}}^{\pi}(\Gamma(\pi))\\ &=V_{\mu_{0}}^{\star}(\Gamma(\pi))-V_{\mu_{0}}^{\star}(L)+V_{\mu_{0}}^{\star}(L)-V_{\mu_{0}}^{\pi}(L)+V_{\mu_{0}}^{\pi}(L)-V_{\mu_{0}}^{\pi}(\Gamma(\pi))\\ &\leq T(T+1)r_{\max}((C_{P}+1)^{T+1}-1)\sqrt{S}\epsilon\\ &\quad+2C_{r}\dfrac{(C_{P}+1)^{T+2}-(T+2)C_{P}-1}{C_{P}^{2}}\sqrt{S}\epsilon+C_{S,A,T}\epsilon+\epsilon^{2}\end{split}

Here $V_{\mu_{0}}^{\star}(L)=\max_{\pi^{\prime}}V_{\mu_{0}}^{\pi^{\prime}}(L)=\sum_{s\in\mathcal{S}}\mu_{0}(s)[V_{0}^{\star}(L)]_{s}$ is the optimal value of the MDP $\mathcal{M}(L)$ for any mean-field flow $L$ . ∎

6.4 Proof of Theorem 9

The feasibility of $y,z,L$ is obvious from the definition. In particular, $L=\Gamma(\pi)$ implies that $L\geq 0$ and $\mathbf{1}^{\top}L_{t}=1$ , $t\in\mathcal{T}$ , and the same argument leading to Corollary 7 implies that $\|y\|_{2}\leq S(T+1)(T+2)r_{\max}/2$ and $\mathbf{1}^{\top}z\leq SA(T^{2}+T+2)r_{\max}$ , and the proof of Proposition 6 and in particular, the Bellman optimality shows that $z\geq 0$ .

Furthermore, again by the construction and the proof of Proposition 6, we also have $A_{L}L=b$ by taking $L=\Gamma(\pi)$ and $A_{L}^{\top}y+z=c_{L}$ . Hence

f^{\text{MF-OMO}}(y,z,L)=\|A_{L}L-b\|_{2}^{2}+\|A_{L}^{\top}y+z-c_{L}\|_{2}^{2}+z^{\top}L=z^{\top}L.

It remains to show that $z^{\top}L\leq\epsilon$ . To see this, notice that

z^{\top}L=(c_{L}-A_{L}^{\top}y)^{\top}L=c_{L}^{\top}L-y^{\top}A_{L}L=c_{L}^{\top}L-b^{\top}y.

Now recall that $L=\Gamma(\pi)$ implies that $L_{t}(s,a)=\mathbb{P}^{\pi,L}(s_{t}=s,a_{t}=a)$ for $t\in\mathcal{T}$ by the proof of Lemma 3. Hence

\begin{split}c_{L}^{\top}L&=-\sum_{s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}}r_{t}(s,a,L_{t})L_{t}(s,a)\\ &=-\sum_{s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}}r_{t}(s,a,L_{t})\mathbb{P}^{\pi,L}(s_{t}=s,a_{t}=a)=-V_{\mu_{0}}^{\pi}(L).\end{split}

Combined with the fact that $b^{\top}y=-\sum_{s\in\mathcal{S}}\mu_{0}(s)[V_{0}^{\star}(L)]_{s}=-V_{\mu_{0}}^{\star}(L)$ , we see that

z^{\top}L=c_{L}^{\top}L-b^{\top}y=V_{\mu_{0}}^{\star}(L)-V_{\mu_{0}}^{\pi}(L)=V_{\mu_{0}}^{\star}(\Gamma(\pi))-V_{\mu_{0}}^{\pi}(\Gamma(\pi))=\text{Expl}(\pi)\leq\epsilon.

∎

7 Extensions

The MF-OMO framework can be extended to other variants of mean-field games. This section details its extension to personalized mean-field games; and its extension to multi-population mean-field games is straightforward and omitted here.

Personalized mean-field games are mean-field games involving non-homogeneous players, which can be found in many applications. In such problems, every player is associated with some information (type) which characterizes the heterogeneity among players. Personalized mean-field games generalize mean-field games by incorporating the information (type) into the state of players, with heterogeneity in the initial distributions. To deal with the heterogeneity, one needs to define a “stronger” Nash equilibrium solution for these personalized mean-field games, where the policy sequence $\pi$ is optimal from any initial state given the mean-field flow. We say that the policy sequence $\pi=\{\pi_{t}\}_{t\in\mathcal{T}}$ and a mean-field flow $L=\{L_{t}\}_{t\in\mathcal{T}}$ constitute a refined Nash equilibrium solution of the finite-time horizon personalized mean-field game, if the following conditions are satisfied.

Definition 7.1 (Refined Nash equilibrium solution).

(Optimality) Fixing $\{L_{t}\}_{t\in\mathcal{T}}$ , for any initial state $s\in\mathcal{S}$ , $\{\pi_{t}\}_{t\in\mathcal{T}}$ solves the following optimization problem:

\begin{array}[]{ll}\text{maximize}_{\{\pi_{t}\}_{t\in\mathcal{T}}}&\mathbb{E}[\sum_{t=0}^{T}r_{t}(s_{t},a_{t},L_{t})|s_{0}=s]\\ \text{subject to}&s_{t+1}\sim P_{t}(\cdot|s_{t},a_{t},L_{t}),\,a_{t}\sim\pi_{t}(\cdot|s_{t}),\,t\in\mathcal{T}\backslash\{T\},\end{array}

(35)

i.e., $\{\pi_{t}\}_{t\in\mathcal{T}}$ is optimal for the representative agent given the mean-field flow $\{L_{t}\}_{t\in\mathcal{T}}$ ;

(Consistency) Fixing $\{\pi_{t}\}_{t\in\mathcal{T}}$ , the consistency of mean-field flow holds, i.e.,

\begin{split}&L_{t}=\mathbb{P}_{s_{t},a_{t}},\\ &\text{where }s_{t+1}\sim P_{t}(\cdot|s_{t},a_{t},L_{t}),\,a_{t}\sim\pi_{t}(\cdot|s_{t}),\,s_{0}\sim\mu_{0},\,t\in\mathcal{T}\backslash\{T\}.\end{split}

(36)

For personalized mean-field games, we introduce a modified version of exploitability. Instead of simply averaging value functions over $\mu_{0}$ (cf. (17)), here the averaging is over a uniform initial distribution, which corresponds to the arbitrariness of the initial state in (35). The exploitability is defined as follows:

\text{Expl}^{\text{refined}}(\pi):=\dfrac{1}{S}\max_{\pi^{\prime}}\sum_{s\in\mathcal{S}}\left([V_{0}^{\pi^{\prime}}(\Gamma(\pi))]_{s}-[V_{0}^{\pi}(\Gamma(\pi))]_{s}\right).

As in Section 3, $\pi$ is a refined Nash equilibrium solution if and only if $\text{Expl}^{\text{refined}}(\pi)=0$ .

To find the refined Nash equilibrium solution, we need to find the “refined” optimal policy for the mean-field induced MDP as described in (35), which is optimal under any initial state $s_{0}=s\in\mathcal{S}$ (instead of an initial state with distribution $\mu_{0}$ as in (1)). The proposition below characterizes the linear program formulation for (35), a counterpart of Lemma 2. For brevity, for any $\nu_{0}\in\Delta(\mathcal{S})$ , we denote $\mathbb{P}^{\pi,L,\nu_{0}}(\cdot)$ for the probability distribution of any representative agent taking policy $\pi$ under the mean-field flow $L$ , i.e., the state and/or action distribution generated by $s_{0}\sim\nu_{0}$ , $s_{t+1}\sim P_{t}(\cdot|s_{t},a_{t},L_{t})$ , $a_{t}\sim\pi_{t}(\cdot|s_{t})$ for $t=0,\dots,T-1$ .

Proposition 14.

Fix $\{L_{t}\}_{t\in\mathcal{T}}$ . Suppose that $\{\pi_{t}\}_{t\in\mathcal{T}}$ is an $\epsilon$ -suboptimal policy for (35) for any $s\in\mathcal{S}$ . Define $\nu_{0}(s)=\frac{1}{S}$ for all $s\in\mathcal{S}$ , $\nu_{t}(s):=\mathbb{P}^{\pi,L,\nu_{0}}(s_{t}=s)$ , and $d_{t}(s,a):=\nu_{t}(s)\pi_{t}(a|s)$ for $s\in\mathcal{S}$ and $a\in\mathcal{A}$ . Then $\pi\in\Pi(d)$ and $d=\{d_{t}\}_{t\in\mathcal{T}}$ is a feasible $\epsilon$ -suboptimal solution to the following linear program:

\begin{array}[]{ll}\text{maximize}_{d}&\sum\limits_{t\in\mathcal{T}}\sum\limits_{s\in\mathcal{S}}\sum\limits_{a\in\mathcal{A}}d_{t}(s,a)r_{t}(s,a,L_{t})\\ \text{subject to}&\sum\limits_{s\in\mathcal{S}}\sum\limits_{a\in\mathcal{A}}d_{t}(s,a)P_{t}(s^{\prime}|s,a,L_{t})=\sum\limits_{a\in\mathcal{A}}d_{t+1}(s^{\prime},a),\\ &\hskip 115.23373pt\forall s^{\prime}\in\mathcal{S},t\in\mathcal{T}\backslash\{T\},\\ &\sum\limits_{a\in\mathcal{A}}d_{0}(s,a)=\frac{1}{S},\quad\forall s\in\mathcal{S},\\ &d_{t}(s,a)\geq 0,\quad\forall s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}.\end{array}

(37)

Conversely, suppose that $d=\{d_{t}\}_{t\in\mathcal{T}}$ is a feasible $\epsilon$ -suboptimal solution to the above linear program (37). Then for any $\pi\in\Pi(d)$ , $\{\pi_{t}\}_{t\in\mathcal{T}}$ is an $S\epsilon$ -suboptimal policy for (35) for any $s\in\mathcal{S}$ . In addition, the optimal value of the objective function of the above linear program (37) is equal to the average optimal value of (35) over $s\in\mathcal{S}$ .

Note that instead of the specified initial distribution $\mu_{0}$ in Lemma 2, Proposition 14 chooses a special initial distribution where all components are equal to $\frac{1}{S}$ . The proof is a direct combination of Lemma 2 and the following classic result of MDP.

Lemma 15.

If $\pi$ is an optimal policy for the MDP $\mathcal{M}$ with initial distribution $\nu_{0}$ satisfying $\nu_{0}(s)>0$ for all $s\in\mathcal{S}$ , then $\pi$ is an optimal policy for the MDP $\mathcal{M}$ with any initial state and any initial distribution.

Then similar to Theorem 4 one can establish the corresponding equivalence between refined Nash equilibrium solutions and optimal solutions to a feasibility optimization problem. This is derived by replacing the $\mu_{0}$ in $b$ in Theorem 4 with $\frac{1}{S}{\bf 1}\in\mathbb{R}^{S}$ . More precisely, based on the above result, one can characterize the optimality conditions of $\pi$ given $L$ , the counterpart of condition (A) with $\mu_{0}$ replaced by the uniform distribution over $\mathcal{S}$ . Then combined with the consistency condition for the mean-field flow (36), one can establish the corresponding equivalence between refined Nash equilibrium solutions and optimal solutions to the following feasibility optimization problem:

\begin{array}[]{ll}\text{\rm minimize}_{\pi,y,z,d,L}&0\\ \text{\rm subject to}&A_{L}d=b^{\prime},\quad A_{L}^{\top}y+z=c_{L},\\ &z^{\top}d=0,\quad d\geq 0,\quad z\geq 0,\\ &\pi_{t}(a|s)\sum_{a^{\prime}\in\mathcal{A}}d_{t}(s,a^{\prime})=d_{t}(s,a),\,\forall s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}\\ &L_{0}(s,a)=\mu_{0}(s)\pi_{0}(a|s),\quad\forall s\in\mathcal{S},a\in\mathcal{A},\\ &L_{t+1}(s^{\prime},a^{\prime})=\pi_{t+1}(a^{\prime}|s^{\prime})\sum\limits_{s\in\mathcal{S}}\sum\limits_{a\in\mathcal{A}}L_{t}(s,a)P_{t}(s^{\prime}|s,a,L_{t}),\\ &\hskip 151.9376pt\forall s^{\prime}\in\mathcal{S},t\in\mathcal{T}\backslash\{T\},\\ &\sum_{a\in\mathcal{A}}L_{0}(s,a)=\mu_{0}(s),\quad\forall s\in\mathcal{S},\quad L\geq 0,\end{array}

(38)

where $A_{L},c_{L}$ are set as (8) and (9), and $b^{\prime}=[0,\dots,0,\frac{1}{S}{\bf 1}]\in\mathbb{R}^{S(T+1)}$ . More precisely, $(\pi,L)$ is a refined Nash equilibrium solution if and only if $(\pi,y,z,d,L)$ is a solution to (38) for some $d,y,z$ .

Note that here the condition (A) may not hold, therefore the consistency condition on mean-field flow can not be implied by $d=L$ . As a result, the policy $\pi$ is directly included as part of the variables.

In addition, noticing that $L_{t}(s,a)=\sum_{a^{\prime}\in\mathcal{A}}L_{t}(s,a^{\prime})\pi_{t}(a|s)$ after summing over $a^{\prime}$ on both sides of the fifth row of the constraints in (38), and denoting $b=[0^{\top},\dots,0^{\top},\mu_{0}^{\top}]^{\top}\in\mathbb{R}^{S(T+1)}$ , then the optimization problem (38) can be further simplified, hence the following theorem.

Theorem 16.

Solving refined Nash equilibrium solution(s) of the personalized mean-field game ((35) and (36)) is equivalent to solving the following feasibility optimization problem:

\begin{array}[]{ll}\text{\rm minimize}_{y,z,d,L}&0\\ \text{\rm subject to}&A_{L}d=b^{\prime},\quad A_{L}^{\top}y+z=c_{L},\\ &z^{\top}d=0,\quad d\geq 0,\quad z\geq 0,\\ &A_{L}L=b,\quad L\geq 0,\\ &L_{t}(s,a)\sum\nolimits_{a^{\prime}\in\mathcal{A}}d_{t}(s,a^{\prime})=d_{t}(s,a)\sum\nolimits_{a^{\prime}\in\mathcal{A}}L_{t}(s,a^{\prime}),\\ &\hskip 124.6232pt\forall s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}.\end{array}

(39)

where $A_{L},c_{L}$ are set as (8) and (9), and $b^{\prime}=[0,\dots,0,\frac{1}{S}{\bf 1}]\in\mathbb{R}^{S(T+1)}$ . More precisely, if $(\pi,L)$ is a refined Nash equilibrium solution, then $(y,z,d,L)$ is a solution to (39) for some $y,z,d$ . Conversely, if $(y,z,d,L)$ is a solution to (39), then for any $\pi\in\Pi(d)$ , $(\pi,L)$ constitutes a refined Nash equilibrium solution.

Note that the last row of constraints in (39) is equivalent to requiring that $\Pi(d)=\Pi(L)$ , where the equality is in terms of sets.

8 Numerical experiments

In this section, we present numerical results of algorithms based on the MF-OMO optimization framework. In the first part, MF-OMO is compared with the existing state-of-the-art algorithms for solving MFGs on the SIS problem introduced in [20]. In the second part, it is shown that different initializations of MF-OMO algorithms converge to different NEs in the example introduced in Section 2.

8.1 Comparison with existing algorithms

In this section, we focus on the SIS problem introduced in [20]. There are a large number of agents who choose to either social distance or to go out at each time step. Susceptible (S) agents who choose to go out may get infected (I), with probability proportional to the number of infected agents, while susceptible agents opt for social distancing will stay healthy. In the meantime, infected agents recover with certain probability at each time step. Agents in the system aim at finding the best strategy to minimize their costs induced from social distancing and being infected. The parameters remain the same as in [20].

Figures 1 and 2 show the comparisons between MF-OMO and existing algorithms on this SIS problems with $T=50$ and $T=100$ , respectively. Here, MF-OMO is tested with different optimization algorithms including PGD (cf. §5), Adam [43], and NAdam [26] against online mirror descent (OMD) [58], fictitious play (FP) [59], and prior descent (PD) [20] with different choices of hyperparameters.⁸⁸8In the figure legends, the numbers after the OMD and FP algorithms are the step-sizes/learning-rates of these algorithms, while the number pairs after the PD algorithms are the temperatures and inner iterations of PD, respectively. To make fair comparisons, the maximum number of iterations for each algorithm are set differently so that the total runtime are comparable and that MF-OMO algorithms are given the smallest budget in terms of both the total number of iterations and the total runtime. With the same uniform initialization, it is clear from the figures that MF-OMO with Adam outperforms the remaining algorithms. The normalized exploitability⁹⁹9Normalized exploitability is defined to be exploitability devided by the initial exploitability. of MF-OMO Adam quickly drops below $10^{-2}$ in terms of both the number of iterations and the runtime. All algorithms except for three variants of MF-OMO fail to achieve $10^{-2}$ for both $T=50$ and $T=100$ .

Refer to caption — (a) Convergence against iterations

8.2 Multiple NEs

In this section, we focus on the problem introduced in Section 2 and show how algorithms based on the MF-OMO framework converge with different initializations. We choose $S=A=n=5$ , $T=10$ , $r^{1}=r^{2}=r^{3}=1.5$ ; and $r^{4},\,r^{5}$ , and $\{C_{t}\}_{t=1}^{10}$ are all independently sampled uniformly between $0$ and $1$ .

From discussions in Section 2, there are at least 3 different NEs, which correspond to agents staying in state 1, 2 and 3, respectively. Here, several different sets of random initializations are tested, denoted by RI $(i,\epsilon)$ , which represents initializations randomly generated in the $\epsilon$ neighborhood of the $i$ -th NE $(i=1,2,3)$ . Each RI $(i,\epsilon)$ set consists of 20 independent samples, and the convergence behavior of MF-OMO NAdam algorithm is recorded in Table 1.¹⁰¹⁰10We choose NAdam as it converges significantly faster than other algorithms such as PGD and Adam for this specific problem, and our primary goal is to study the behavior after the convergence. For each neighborhood of size $\epsilon$ and NE $i$ , the convergence behavior in RI $(i,\epsilon)$ is categorized using the following criteria: $p_{0}$ counts the proportion of samples whose normalized exploitabilities do not drop to $10^{-3}$ after 400 iterations; $p_{1}$ counts the proportion of samples that converge to an (approximate) NE (with normalized exploitability below $10^{-3}$ ) closest to $i$ -th NE among the three NEs; $p_{2}$ counts the proportion of samples that converge to an (approximate) NE closer to the other two NEs.

$\epsilon$	NE 1			NE 2			NE 3
$\epsilon$	$p_{0}$	$p_{1}$	$p_{2}$	$p_{0}$	$p_{1}$	$p_{2}$	$p_{0}$	$p_{1}$	$p_{2}$
0.05	0%	70%	30%	5%	75%	20%	0%	65%	35%
0.2	0%	50%	50%	5%	65%	30%	5%	60%	35%
0.8	5%	45%	50%	5%	40%	55%	0%	30%	70%
1.0	0%	40%	60%	0%	45%	55%	0%	25%	75%

Table 1: Convergence behavior of different initializations

From Table 1, it is clear that most of the samples converge to some approximate NE with exploitabilities below the target $10^{-3}$ tolerance, regardless of the initializations. In addition, different initializations lead to different solutions. Specifically, when initializations are close to some NE, it is more likely for MF-OMO NAdam to converge to that specific NE, which is consistent with our theoretical study. Finally, as the neighborhood sizes $\epsilon$ increase, the convergence behavior become expectedly more chaotic, i.e., less concentrated on the center NE.

References

[1] Yves Achdou, Pierre Cardaliaguet, François Delarue, Alessio Porretta, and Filippo Santambrogio. Mean Field Games: Cetraro, Italy 2019, volume 2281. Springer Nature, 2021.
[2] Ilan Adler and Sushil Verma. The linear complementarity problem, Lemke algorithm, perturbation, and the complexity class PPAD. https://optimization-online.org/wp-content/uploads/2011/03/2948.pdf, 2011. Last visited: 2022 September 6.
[3] Berkay Anahtarci, Can Deha Kariksiz, and Naci Saldi. Fitted Q-learning in mean-field games. arXiv preprint arXiv:1912.13309, 2019.
[4] Berkay Anahtarci, Can Deha Kariksiz, and Naci Saldi. Q-learning in regularized mean-field games. Dynamic Games and Applications, pages 1–29, 2022.
[5] Donald G Anderson. Iterative procedures for nonlinear integral equations. Journal of the ACM (JACM), 12(4):547–560, 1965.
[6] Andrea Angiuli, Jean-Pierre Fouque, and Mathieu Laurière. Unified reinforcement Q-learning for mean field game and control problems. Mathematics of Control, Signals, and Systems, pages 1–55, 2022.
[7] Hédy Attouch, Jérôme Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Łojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.
[8] Amir Beck. First-Order Methods in Optimization. SIAM, 2017.
[9] Alain Bensoussan, Jens Frehse, and Sheung Chi Phillip Yam. The master equation in mean field theory. Journal de Mathématiques Pures et Appliquées, 103(6):1441–1474, 2015.
[10] Abhay G Bhatt and Vivek S Borkar. Occupation measures for controlled Markov processes: Characterization and optimality. The Annals of Probability, pages 1531–1562, 1996.
[11] Géraldine Bouveret, Roxana Dumitrescu, and Peter Tankov. Mean-field games of optimal stopping: A relaxed solution approach. SIAM Journal on Control and Optimization, 58(4):1795–1821, 2020.
[12] Rainer Buckdahn, Boualem Djehiche, Juan Li, and Shige Peng. Mean-field backward stochastic differential equations: A limit approach. The Annals of Probability, 37(4):1524–1565, 2009.
[13] Pierre Cardaliaguet, François Delarue, Jean-Michel Lasry, and Pierre-Louis Lions. The Master Equation and the Convergence Problem in Mean Field Games. Princeton University Press, 2019.
[14] René Carmona and François Delarue. Mean field forward-backward stochastic differential equations. Electronic Communications in Probability, 18:1–15, 2013.
[15] René Carmona and François Delarue. Probabilistic Theory of Mean Field Games with Applications I-II. Springer, 2018.
[16] René Carmona and Peiqi Wang. A probabilistic approach to extended finite state mean field games. Mathematics of Operations Research, 46(2):471–502, 2021.
[17] Laurent Condat. Fast projection onto the simplex and the $\ell_{1}$ ball. Mathematical Programming, 158(1):575–585, 2016.
[18] Michel Coste. An Introduction to Semialgebraic Geometry, 2000.
[19] Michel Coste. An Introduction to O-minimal Geometry. Istituti editoriali e poligrafici internazionali Pisa, 2000.
[20] Kai Cui and Heinz Koeppl. Approximately solving mean field games via entropy-regularized deep reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 1909–1917. PMLR, 2021.
[21] Daniela Pucci De Farias and Benjamin Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850–865, 2003.
[22] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in Neural Information Processing Systems, 27, 2014.
[23] François Delarue, Daniel Lacker, and Kavita Ramanan. From the master equation to mean field game limit theory: A central limit theorem. Electronic Journal of Probability, 24:1–54, 2019.
[24] Eric V Denardo. On linear programming in a Markov decision problem. Management Science, 16(5):281–288, 1970.
[25] Cyrus Derman. On sequential decisions and Markov chains. Management Science, 9(1):16–24, 1962.
[26] Timothy Dozat. Incorporating nesterov momentum into adam. 2016.
[27] L. P. D. van den Dries. Tame Topology and O-minimal Structures. London Mathematical Society Lecture Note Series. Cambridge University Press, 1998.
[28] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the $\ell_{1}$ -ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, pages 272–279, 2008.
[29] Roxana Dumitrescu, Marcos Leutscher, and Peter Tankov. Control and optimal stopping mean field games: A linear programming approach. Electronic Journal of Probability, 26:1–49, 2021.
[30] Roxana Dumitrescu, Marcos Leutscher, and Peter Tankov. Linear programming fictitious play algorithm for mean field games with optimal stopping and absorption. arXiv preprint arXiv:2202.11428, 2022.
[31] Romuald Elie, Julien Perolat, Mathieu Laurière, Matthieu Geist, and Olivier Pietquin. On the convergence of model free learning in mean field games. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7143–7150, 2020.
[32] Zuyue Fu, Zhuoran Yang, Yongxin Chen, and Zhaoran Wang. Actor-critic provably finds nash equilibria of linear-quadratic mean-field games. arXiv preprint arXiv:1910.07498, 2019.
[33] Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1):59–99, 2016.
[34] Xin Guo, Anran Hu, Renyuan Xu, and Junzi Zhang. Learning mean-field games. Advances in Neural Information Processing Systems, 32, 2019.
[35] Xin Guo, Anran Hu, Renyuan Xu, and Junzi Zhang. A general framework for learning mean-field games. arXiv preprint arXiv:2003.06069, 2020.
[36] Filip Hanzely, Konstantin Mishchenko, and Peter Richtárik. SEGA: Variance reduction via gradient sketching. Advances in Neural Information Processing Systems, 31, 2018.
[37] Nicholas C Henderson and Ravi Varadhan. Damped anderson acceleration with restarts and monotonicity control for accelerating em and em-like algorithms. Journal of Computational and Graphical Statistics, 28(4):834–846, 2019.
[38] Arie Hordijk and LCM Kallenberg. Linear programming and Markov decision chains. Management Science, 25(4):352–362, 1979.
[39] Minyi Huang, Roland P Malhamé, and Peter E Caines. Large population stochastic dynamic games: Closed-loop Mckean-Vlasov systems and the Nash certainty equivalence principle. Communications in Information & Systems, 6(3):221–252, 2006.
[40] Youqiang Huang and Masami Kurano. The LP approach in average reward MDPs with multiple cost constraints: The countable state case. Journal of Information and Optimization Sciences, 18(1):33–47, 1997.
[41] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems, 26, 2013.
[42] Tobias Kaiser and Andre Opris. Differentiability properties of log-analytic functions. arXiv preprint arXiv:2007.03332, 2020.
[43] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
[44] Thomas Kurtz and Richard Stockbridge. Stationary solutions and forward equations for controlled and singular martingale problems. Electronic Journal of Probability, 6:1–52, 2001.
[45] Thomas G Kurtz and Richard H Stockbridge. Existence of Markov controls and characterization of optimal Markov controls. SIAM Journal on Control and Optimization, 36(2):609–653, 1998.
[46] Jean-Michel Lasry and Pierre-Louis Lions. Mean field games. Japanese Journal of Mathematics, 2(1):229–260, 2007.
[47] Mathieu Lauriere. Numerical methods for mean field games and mean field type control. arXiv preprint arXiv:2106.06231, 2021.
[48] Nicolas Le Roux, Mark W Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. Pereira et al, 2013.
[49] Jun Liu and Jieping Ye. Efficient euclidean projections in linear time. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 657–664, 2009.
[50] David G Luenberger and Yinyu Ye. Linear and Nonlinear Programming, volume 2. Springer, 1984.
[51] Alan S Manne. Linear programming and sequential decisions. Management Science, 6(3):259–267, 1960.
[52] D Marker. Tame topology and o-minimal structures by lou van den dries. BULLETIN-AMERICAN MATHEMATICAL SOCIETY, 37(3):351–358, 2000.
[53] Richard Kipp Martin. Large Scale Linear and Integer Optimization: A Unified Approach. Springer Science & Business Media, 2012.
[54] KG Murty. Linear complementarity problem, its geometry and applications. Linear Complementarity, Linear and Nonlinear Programming, pages 1–58, 1997.
[55] Yurii Nesterov. Lectures on Convex Optimization, volume 137. Springer, 2018.
[56] Shunji Osaki and Hisashi Mine. Linear programming algorithms for semi-Markovian decision processes. Journal of Mathematical Analysis and Applications, 22(2):356–381, 1968.
[57] Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127–239, 2014.
[58] Julien Perolat, Sarah Perrin, Romuald Elie, Mathieu Laurière, Georgios Piliouras, Matthieu Geist, Karl Tuyls, and Olivier Pietquin. Scaling up mean field games with online mirror descent. arXiv preprint arXiv:2103.00623, 2021.
[59] Sarah Perrin, Julien Pérolat, Mathieu Laurière, Matthieu Geist, Romuald Elie, and Olivier Pietquin. Fictitious play for mean field games: Continuous time analysis and applications. Advances in Neural Information Processing Systems, 33:13199–13213, 2020.
[60] Boris T Polyak. Introduction to Optimization. New York : Optimization Software, Publications Division, 1987.
[61] Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
[62] Naci Saldi, Tamer Basar, and Maxim Raginsky. Markov–Nash equilibria in mean-field games with discounted cost. SIAM Journal on Control and Optimization, 56(6):4256–4287, 2018.
[63] Paul J Schweitzer and Abraham Seidmann. Generalized polynomial approximations in Markovian decision processes. Journal of mathematical analysis and applications, 110(2):568–582, 1985.
[64] Jitkomut Songsiri. Projection onto an $\ell_{1}$ -norm ball with application to identification of sparse autoregressive models. In Asean Symposium on Automatic Control (ASAC), 2011.
[65] Pantelis Sopasakis, Krina Menounou, and Panagiotis Patrinos. Superscs: fast and accurate large-scale conic optimization. In 2019 18th European Control Conference (ECC), pages 1500–1505. IEEE, 2019.
[66] Richard H Stockbridge. Time-average control of martingale problems: A linear programming formulation. The Annals of Probability, pages 206–217, 1990.
[67] Michael I Taksar. Infinite-dimensional linear programming approach to singular stochastic control. SIAM Journal on Control and Optimization, 35(2):604–625, 1997.
[68] Lou Van den Dries and Chris Miller. Geometric categories and o-minimal structures. Duke Mathematical Journal, 84(2):497–540, 1996.
[69] Athanasios Vasiliadis. An introduction to mean field games using probabilistic methods. arXiv preprint arXiv:1907.01411, 2019.
[70] Mengdi Wang and Yichen Chen. An online primal-dual method for discounted Markov decision processes. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 4516–4521. IEEE, 2016.
[71] Alex J Wilkie. Model completeness results for expansions of the ordered field of real numbers by restricted pfaffian functions and the exponential function. Journal of the American Mathematical Society, 9(4):1051–1094, 1996.
[72] Philip Wolfe and George Bernard Dantzig. Linear programming in a Markov chain. Operations Research, 10(5):702–710, 1962.
[73] Stephen Wright, Jorge Nocedal, et al. Numerical optimization. Springer Science, 35(67-68):7, 1999.
[74] Junzi Zhang, Jongho Kim, Brendan O’Donoghue, and Stephen Boyd. Sample efficient reinforcement learning with REINFORCE. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35(12), pages 10887–10895, 2021.
[75] Junzi Zhang, Brendan O’Donoghue, and Stephen Boyd. Globally convergent type-I Anderson acceleration for nonsmooth fixed-point iterations. SIAM Journal on Optimization, 30(4):3170–3197, 2020.

Appendix

Appendix A Additional optimization algorithms and convergence to stationary points

A.1 Stochastic projected gradient descent (SPGD)

When the state space, action space and the time horizon are large, a single gradient evaluation in the PGD update (18) can be costly. To address this issue, we consider a stochastic variant of PGD (SPGD), which is suitable to handle problems with large $S$ , $A$ and $T$ . To invoke SPGD, first recall the explicitly expansion of (MF-OMO) as (14). The objective in (14) is a sum of $n:=S+ST+SA+SA(T-1)+SA+SA(T+1)=S(T+1)(2A+1)$ terms, which we denote as $f_{i}^{\text{MF-OMO}}(\theta)$ ( $i=1,\dots,n$ ) for brevity. SPGD replaces the exact gradient $\nabla_{\theta}f^{\text{MF-OMO}}(\theta_{k})$ in PGD with a mini-batch estimator of the following form

\hat{g}_{k}=\frac{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}n}{|\mathcal{B}_{k}|}\sum_{i\in\mathcal{B}_{k}}\nabla_{\theta}f_{i}^{\text{MF-OMO}}(\theta_{k}),

(40)

where the mini-batch $\mathcal{B}_{k}$ is a subset of $\{1,\dots,n\}$ sampled uniformly at random (without replacement) and independently across $k$ (i.e., iterations). Such a sampling approach ensures that $\hat{g}_{k}$ is an unbiased estimator of $\nabla_{\theta}f^{\text{MF-OMO}}(\theta_{k})$ (conditioned on $\theta_{k}$ ).

SPGD then proceeds by updating $\theta$ with the following iterative process:

\theta_{k+1}=\textbf{Proj}_{\Theta}(\theta_{k}-\eta_{k}\hat{g}_{k}).

(41)

Again, we assume that $\theta_{0}\in\Theta$ .

Proposition 17.

Under Assumption 1, there exist constants $C_{1},C_{2}>0$ such that for any $k\geq 0$ , $\|\hat{g}_{k}\|_{2}\leq C_{1}$ almost surely, $\|\nabla_{\theta}f^{\text{\rm MF-OMO}}(\theta_{k})\|_{2}\leq C_{1}$ , $\mathbb{E}_{k}\hat{g}_{k}=\nabla_{\theta}f^{\text{\rm MF-OMO}}(\theta_{k})$ and $\mathbb{E}_{k}\|\hat{g}_{k}\|_{2}^{2}\leq C_{2}$ . Here $\mathbb{E}_{k}$ is the conditional expectation given the $k$ -th iteration $\theta_{k}$ .

That is, the estimator (40) is unbiased, almost surely bounded, with a bounded second-order moment, and the exact gradient is also uniformly bounded. The proof is a direct application of the continuity of the rewards and dynamics and the compactness of $\Theta$ , and omitted here.

A.2 Global convergence to stationary points without definability

In this section, we show the global convergence of the iterates to stationary points when only the smoothness assumption holds. Let us first recap the concept of stationary points for constrained optimization problems and their relation to optimality.

Optimality conditions.

The KKT conditions state that a necessary condition for a point $\theta\in\Theta$ to be the optimal solution to (MF-OMO) is $-\nabla_{\theta}f^{\text{MF-OMO}}(\theta)\in\mathcal{N}_{\Theta}(\theta)$ , where

\mathcal{N}_{\Theta}(\theta)=\left\{\begin{array}[]{ll}\emptyset,&\theta\notin\Theta,\\ \{\nu|\nu^{\top}(\theta^{\prime}-\theta)\leq 0,\,\forall\theta^{\prime}\in\Theta\},&\theta\in\Theta\end{array}\right.

is the normal cone of $\Theta$ at $\theta$ . We say that $\theta$ is a stationary point of (MF-OMO) if $-\nabla_{\theta}f^{\text{MF-OMO}}(\theta)\in\mathcal{N}_{\Theta}(\theta)$ .

The typical goal of nonconvex optimization is to find a sequence of points that converge to a stationary point. Since the normal cone is set-valued, it is hard to directly evaluate the closeness to stationary points with the above definition. A common surrogate metric is the norm of the projected gradient mapping $G_{\eta}:\Theta\rightarrow\Theta$ , defined as

G_{\eta}(\theta):=\dfrac{1}{\eta}\left(\theta-\textbf{Proj}_{\Theta}(\theta-\eta\nabla_{\theta}f^{\text{MF-OMO}}(\theta))\right).

Here $G_{\eta}(\theta)=0$ if and only if $\theta$ is a stationary point of (MF-OMO). Moreover, if $\|G_{\eta}(\theta)\|_{2}\leq\epsilon$ , then according to [33, Lemma 3],

-\nabla f^{\text{MF-OMO}}(\theta)\in\mathcal{N}_{\Theta}(\theta)+\epsilon(\eta M+1)\mathbb{B}_{2},

where $\mathbb{B}_{2}$ is the unit $\ell_{2}$ ball.

When $\Theta$ is the full Euclidean space, the projected gradient mapping is simply the gradient of $f^{\text{MF-OMO}}$ . We refer interested readers to [8, Chapter 3] for a more detailed discussion on optimality conditions.

Convergence to stationary points.

Under Assumption 1, the following convergence result for PGD is standard given Proposition 10 and the existence of optimal solution(s) for (MF-OMO).

Proposition 18.

[8, Theorem 10.15] Under Assumption 1, let $\{\theta_{k}\}_{k\geq 0}$ be the sequence generated by PGD (18) with $\eta_{k}=\eta\in(0,2/M)$ . Then

•

the objective value sequence $\{f^{\text{\rm MF-OMO}}(\theta_{k})\}_{k\geq 0}$ is decreasing, and the decreasing is strict until a stationary point is found (i.e., $\nabla_{\theta}f^{\text{\rm MF-OMO}}(\theta_{k})=0$ ) and the iteration is terminated;

•

$\sum_{k=0}^{K}\|G_{\eta}(\theta_{k})\|_{2}^{2}\leq\frac{f^{\text{\rm MF-OMO}}(\theta_{0})}{\eta-M\eta^{2}/2}$ , and in particular,

\min_{k=0,\dots,K}\|G_{\eta}(\theta_{k})\|_{2}\leq\sqrt{\dfrac{f^{\text{\rm MF-OMO}}(\theta_{0})}{(\eta-M\eta^{2}/2)(K+1)}};

•

all limit points of $\{\theta_{k}\}_{k\geq 0}$ are stationary points of (MF-OMO).

Similarly, the following proposition shows that the iterates of SPGD converge to stationary points with high probability. The uniform boundedness in Proposition 17 allows us to adapt the proof in [74, Theorem 9] for policy gradient methods of MDPs to our setting. The proof is omitted due to similarity.

Proposition 19.

Under Assumptions 1 and 2, let $\{\theta_{k}\}_{k\geq 0}$ be the sequence generated by SPGD (41) with $\eta_{k}=\frac{1}{\sqrt{k+3}\log_{2}(k+3)}$ . Then for any $K\geq 0$ , with probability at least $1-\delta$ , we have

\min_{k=0,\dots,K}\|G_{\eta}(\theta_{k})\|_{2}^{2}\leq\dfrac{4\log_{2}(K+3)}{\sqrt{K+1}}\left(C_{2}M+f^{\text{\rm MF-OMO}}(\theta_{0})+C_{1}^{2}\sqrt{(16+M)\log\left(\frac{2}{\delta}\right)}\right).

A.3 Reparametrization, acceleration and variance reduction

(MF-OMO) may be solved more efficiently. First, one can also reparametrize the variables to completely get rid of the simple constraints in $\Theta$ . In particular, one can reparametrize $L$ and $z$ by

L_{s,a,t}=\frac{\exp(u_{s,a,t})}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sum_{s^{\prime}\in\mathcal{S},a^{\prime}\in\mathcal{A}}\exp(u_{s^{\prime},a^{\prime},t})},\quad z_{s,a,t}=SA(T^{2}+T+2)r_{\max}\frac{\exp(v_{s,a,t})}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\sum_{s^{\prime}\in\mathcal{S},a^{\prime}\in\mathcal{A}}\exp(v_{s^{\prime},a^{\prime},t})+w_{0}}

for some $u=[u_{s,a,t}]_{s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}}\in\mathbb{R}^{SA(T+1)}$ and $v=[v_{s,a,t}]_{s\in\mathcal{S},a\in\mathcal{A},t\in\mathcal{T}}\in\mathbb{R}^{SA(T+1)}$ , and $w_{0}\in\mathbb{R}$ . Similarly, $y$ can be reparametrized using trigonometric functions such as $y_{s,t}=\frac{S(T+1)(T+2)r_{\max}}{2\sqrt{S(T+1)}}\sin(w_{s,t})$ for some $w\in\mathbb{R}^{S(T+1)}$ .With the aforementioned reparametrization, (MF-OMO) becomes a smooth unconstrained optimization problem, which can be solved by additional optimization algorithms and solvers not involving projections. As mentioned in Remark 1, one can also change the norms of $y$ and $z$ and their bands and apply other reparametrizations.

Lastly, standard techniques such as momentum methods [60], Nesterov acceleration [55], quasi-Newton methods [73], Anderson acceleration [5], and Newton methods [50] can readily be applied to accelerate the convergence of PGD and SPGD; variance reduction approaches such as SAG [48], SVRG [41], SAGA [22], SEGA [36] can also be adopted to further accelerate and stabilize the convergence of SPGD. In particular, safeguarded Anderson acceleration methods [37, 75, 65] might be adopted to maintain the desired monotonicity property of vanilla PGD, which is central for the local convergence to global Nash equilibrium solutions (cf. Theorem 11).

Appendix B Definability and local convergence to global Nash equilibrium solutions

B.1 Definability

In this section, we restate the formal definitions of definable sets and functions as in [7, Section 4.3], and summarize and derive necessary properties to prove Theorem 11. See [68, 71, 27, 52, 19, 42] for more complete expositions of the topic.

We first need the following concept of o-minimal structure over $\mathbb{R}$ .

Definition B.1.

Suppose that $\mathcal{O}:=\{\mathcal{O}_{n}\}_{n=0}^{\infty}$ is a sequence of subsets collections $\mathcal{O}_{n}\subseteq 2^{\mathbb{R}^{n}}$ . We say that $\mathcal{O}$ is an o-minimal structure over $\mathbb{R}$ if it satisfies the following conditions:

•

$\forall n\geq 0$ , $\emptyset\in\mathcal{O}_{n}$ . And $\forall A,B\in\mathcal{O}_{n}$ , $A\cap B,A\cup B,\mathbb{R}^{n}\backslash A\in\mathcal{O}_{n}$ .
•

$\forall A\in\mathcal{O}_{n}$ , $A\times\mathbb{R}\in\mathcal{O}_{n+1}$ and $\mathbb{R}\times A\in\mathcal{O}_{n+1}$ .
•

$\forall A\in\mathcal{O}_{n+1}$ , its canonical projection $\{(x_{1},\dots,x_{n})\in\mathbb{R}^{n}|(x_{1},\dots,x_{n},x_{n+1})\in A\}\in\mathcal{O}_{n}$ .
•

$\forall i\neq j$ and $i,j\in\{1,\dots,n\}$ , the diagonals $\{(x_{1},\dots,x_{n})|x_{i}=x_{j}\}\in\mathcal{O}_{n}$ .
•

The set $\{(x_{1},x_{2})\in\mathbb{R}^{2}|x_{1}<x_{2}\}\in\mathcal{O}_{2}$ .
•

The elements of $\mathcal{O}_{1}$ are exactly finite unions of intervals (including both open and closed, and in particular single points).

We then have the following definition of definability.

Definition B.2.

Let $\mathcal{O}=\{\mathcal{O}_{n}\}_{n=0}^{\infty}$ be an o-minimal structure over $\mathbb{R}$ . We say that a set $A\in\mathbb{R}^{n}$ is definable if $A\in\mathcal{O}_{n}$ . A function/mapping $f:A\subseteq\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}$ is said to be definable if its graph $\{(x,y)\in\mathbb{R}^{n}\times\mathbb{R}^{m}|y=f(x)\}\in\mathcal{O}_{n+m}$ .¹¹¹¹11Note that a real extended value function $f:\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\}$ can be equivalently seen as a standard function $f:\textbf{dom}(f)\rightarrow\mathbb{R}$ . Here $\textbf{dom}(f)$ denotes the domain of $f$ . A set-valued mapping $f:A\subseteq\mathbb{R}^{n}\rightrightarrows\mathbb{R}^{m}$ is said to be definable if its graph $\{(x,y)\in A\times\mathbb{R}^{m}|y\in f(x)\}\in\mathcal{O}_{n+m}$ .

Below we provide three examples of most widely used o-minimal structures, namely the real semialgebraic structure, globally subanalytic structure and the log-exp structure used in our main text.

Example 1 (Real semialgebraic structure).

In this structure, each $\mathcal{O}_{n}$ consists of real semialgebraic sets in $\mathbb{R}^{n}$ . A set in $\mathbb{R}^{n}$ is said to be real semialgebraic (or semialgebraic for short) if it can be written as a finite union of sets of the form $\{x\in\mathbb{R}^{n}|p_{i}(x)=0,\,q_{i}(x)<0,\,i=1,\dots,k\}$ , where $p_{i},q_{i}$ are real polynomial functions and $k\geq 1$ . A function/mapping $f:A\subseteq\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}$ is called (real) semialgebraic if its graph is semialgebraic.

The following result is useful for proving Theorem 11 (see [18] for more properties of real semialgberaic sets).

Lemma 20.

Any set of the form $\{x\in\mathbb{R}^{n}|p_{i}(x)=0,\,q_{j}(x)\leq 0,\,i=1,\dots,k,\,j=1,\dots,l\}$ , where $p_{i},q_{j}$ are real polynomial functions and $k,l\geq 1$ is real semialgebraic.

Proof.

First, notice that $\{x\in\mathbb{R}^{n}|q(x)<0\}=\{x\in\mathbb{R}^{n}|0=0,q(x)<0\}$ and $\{x\in\mathbb{R}^{n}|p(x)=0\}=\{x\in\mathbb{R}^{n}|p(x)=0,-1<0\}$ are both real semialgebraic by definition. Hence $\{x\in\mathbb{R}^{n}|p(x)\leq 0\}=\{x\in\mathbb{R}^{n}|p(x)<0\}\cup\{x\in\mathbb{R}^{n}|p(x)=0\}$ is real semialgebraic. Finally, by the intersection closure property of o-minimal structures,

\begin{split}&\{x\in\mathbb{R}^{n}|p_{i}(x)=0,q_{j}(x)\leq 0,i=1,\dots,k,j=1,\dots,l\}\\ &=\left(\bigcap_{i=1}^{k}\{x\in\mathbb{R}^{n}|p_{i}(x)=0\}\right)\bigcap\left(\bigcap_{j=1}^{l}\{x\in\mathbb{R}^{n}|q_{j}(x)\leq 0\}\right)\end{split}

is also real semialgebraic. ∎

Example 2 (Globally subanalytic structure).

This structure is the smallest o-minimal structure such that each $\mathcal{O}_{n+1}$ ( $n\geq 1$ ) includes but is not limited to all sets of the form $\{(x,t)\in[-1,1]^{n}\times\mathbb{R}|f(x)=t\}$ , where $f:[-1,1]^{n}\rightarrow\mathbb{R}$ is an analytic function that can be extended analytically to a neighborhood containing $[-1,1]^{n}$ . In particular, all semialgebraic sets and all sets of the form $\{(x,t)\in[a,b]^{n}\times\mathbb{R}|f(x)=t\}$ with $f:\mathbb{R}^{n}\rightarrow\mathbb{R}$ being analytic and $a\leq b\in\mathbb{R}$ belong to this structure.

Example 3 (Log-exp structure).

This structure is the smallest o-minimal structure containing the globally analytic structure and the graph of the exponential function $\exp:\mathbb{R}\rightarrow\mathbb{R}$ .

As in the main text, we say that a function (or a set) is definable if it is definable on the log-exp structure. The following proposition formally summarizes the major facts about definable functions and sets. See [7] and [68] for more details.

Proposition 21.

The following facts hold for definable functions/mappings and sets.

1.

Let the real extended value functions $f,g:\mathbb{R}^{n}\rightarrow\mathbb{R}\cup\{+\infty\}$ be definable, and let $G\in\mathbb{R}^{n\times m}$ be a constant matrix and $h\in\mathbb{R}^{n}$ be a constant vector. Then $f(x)+g(x)$ , $f(x)-g(x)$ , $f(x)g(x)$ , $f(Gz+h)$ , $\max\{f(x),g(x)\}$ , $\min\{f(x),g(x)\}$ and $f^{-1}(y)$ are definable. The same results hold for vector-valued mappings $f,g:A\subseteq\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}$ .
2.

Let the functions/mappings $f:A\subseteq\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}$ and $g:B\subseteq\mathbb{R}^{m}\rightarrow\mathbb{R}^{l}$ , and let $f(A)\subseteq B$ . Then the composition $h(x)=g(f(x)):A\rightarrow\mathbb{R}^{l}$ is also definable.
3.

Let $f:A\subseteq\mathbb{R}^{n}\rightarrow\mathbb{R}$ be definable and let $B\subseteq A$ be a compact set on which $f(x)\neq 0$ for all $x\in B$ . Then $g(x)=1/f(x):B\rightarrow\mathbb{R}$ is also definable.
4.

Let $f(x,y):\mathbb{R}^{n}\times\mathbb{R}^{m}\rightarrow\mathbb{R}$ and $C\subseteq\mathbb{R}^{m}$ be definable. Then $g(x)=\sup_{y\in C}f(x,y):\mathbb{R}^{n}\rightarrow\mathbb{R}$ and $h(x)=\inf_{y\in C}f(x,y):\mathbb{R}^{n}\rightarrow\mathbb{R}$ are both definable.
5.

All semialgebraic functions, all analytic functions restricted to definable compact sets, as well as the exponential function $\exp$ and logarithm function $\log$ are definable.
6.

Let the set $\Theta$ be definable. Then the indicator function $\mathbf{1}_{\Theta}(\theta)$ (that takes the value $1$ when $\theta\in\Theta$ and $0$ otherwise) and the characteristic function¹²¹²12Here $\mathcal{I}_{\mathcal{X}}(x)$ denotes the (real extended valued) characteristic function (sometimes also referred to as the indicator function in some literature) that takes value $0$ when $x\in\mathcal{X}$ and $+\infty$ otherwise. $\mathcal{I}_{\Theta}(\theta)$ are also definable.
7.

Let the sets $A,B\subseteq\mathbb{R}^{n}$ be definable. Then the Cartesian product $A\times B$ is definable.

B.2 Local convergence to global Nash equilibrium solutions

Proof of Theorem 11.

The proof consists of four steps. We begin by verifying the definability around the initialization. We then specify $\epsilon_{0}$ and show that the iterates are uniformly bounded in a similar neighborhood of the initialization. Finally, we obtain convergence to Nash equilibrium solutions.

Step 1: Verification of definability around initialization.

We first verify that

h^{\text{MF-OMO}}(\theta):=f^{\text{MF-OMO}}(\theta)+\mathcal{I}_{\Theta}(\theta)

is definable. Note that $\theta$ is an optimal solution to (MF-OMO) if and only if $\theta$ minimizes $h^{\text{MF-OMO}}(\theta)$ over the entire Euclidean space of $\theta$ . To see this, note that $f^{\text{MF-OMO}}(\theta)$ is obtained from a finite number of summations, subtractions and products of definable functions (i.e., $P_{t}(s,a,L_{t})$ , $r_{t}(s,a,L_{t})$ , constant functions and identity functions), and is hence definable by Proposition 21. In addition, by rewriting $\|y\|_{2}\leq S(T+1)(T+2)r_{\max}/2$ as $\sum_{i=1}^{S(T+1)}y_{i}^{2}\leq S^{2}(T+1)^{2}(T+2)^{2}r_{\max}^{2}/4$ and applying Lemma 20, $\Theta$ is obviously semialgebraic and hence definable, and hence the characteristic function $\mathcal{I}_{\Theta}(\theta)$ is also definable by Proposition 21. As a result, the sum of $f^{\text{MF-OMO}}(\theta)$ and $\mathcal{I}_{\Theta}(\theta)$ is also definable.

Let $(\pi^{\star},L^{\star})$ be some Nash equilibrium solution. Define $y^{\star}$ and $z^{\star}$ by

y^{\star}:=[V_{1}^{\star}(L^{\star}),V_{2}^{\star}(L^{\star}),\dots,V_{T}^{\star}(L^{\star}),-V_{0}^{\star}(L^{\star})]\in\mathbb{R}^{S(T+1)}

and $z^{\star}=[z_{\star,0},z_{\star,1},\dots,z_{\star,T}]\in\mathbb{R}^{SA(T+1)}$ with $z_{\star,t}=[z_{\star,t}^{1},\dots,z_{\star,t}^{A}]$ ( $t=0,\dots,T$ ), where

z_{\star,t}^{a}=V_{t}^{\star}(L^{\star})-r_{t}(\cdot,a,L_{t}^{\star})-P_{t}^{a}(L_{t}^{\star})V_{t+1}^{\star}(L^{\star})\in\mathbb{R}^{S},\quad t=0,\dots,T-1,\,a\in\mathcal{A}

and $z_{\star,T}^{a}=V_{T}^{\star}(L^{\star})-r_{T}(\cdot,a,L_{T}^{\star})\in\mathbb{R}^{S}$ ( $a\in\mathcal{A}$ ). Then by Theorem 4, Proposition 6 and Corollary 7, $\theta^{\star}$ is feasible (i.e., $\theta^{\star}\in\Theta$ ) and $A_{L^{\star}}L^{\star}=b$ , $A_{L^{\star}}^{\top}y^{\star}+z^{\star}=c_{L^{\star}}$ and $(z^{\star})^{\top}L^{\star}=0$ . Hence $f^{\text{MF-OMO}}(\theta^{\star})=0$ and thus $\theta^{\star}$ is optimal (i.e., $\theta^{\star}\in\Theta^{\star}$ ).

For this $\theta^{\star}\in\Theta^{\star}$ , we have $h^{\text{MF-OMO}}(\theta^{\star})=f^{\text{MF-OMO}}(\theta^{\star})=0$ . By [7, Theorem 14] and the continuity of $f^{\text{MF-OMO}}(\theta)$ , there exist $\epsilon,\delta>0$ and a continuous, concave, strictly increasing and definable function $\phi:[0,\delta)\rightarrow\mathbb{R}_{+}$ , with $\phi(0)=0$ and $C^{1}$ in $(0,\delta)$ , such that for any $\theta\in\Theta$ with $\|\theta-\theta^{\star}\|_{2}\leq\epsilon$ and $0<h^{\text{MF-OMO}}(\theta)<\delta$ , the following Kurdyka-Lojasiewicz (KL) condition holds:

\phi^{\prime}(h^{\text{MF-OMO}}(\theta))\textbf{dist}(0,\nabla_{\theta}f^{\text{MF-OMO}}(\theta)+\mathcal{N}_{\Theta}(\theta))\geq 1.

(42)

Here we use the fact that the limiting sub-differential $\partial h^{\text{MF-OMO}}(\theta)=\nabla_{\theta}f^{\text{MF-OMO}}(\theta)+\mathcal{N}_{\Theta}(\theta)$ [7, Proposition 3].

Step 2: Specify $\epsilon_{0}$ .

Next, we specify the choice of $\epsilon_{0}$ . By the continuity of $f^{\text{MF-OMO}}(\theta)$ , there exists $\epsilon^{\prime}\in(0,\epsilon/3]$ such that for any $\theta\in\Theta$ with $\|\theta-\theta^{\star}\|_{2}\leq\epsilon^{\prime}$ ,

h^{\text{MF-OMO}}(\theta)=f^{\text{MF-OMO}}(\theta)\leq\min\left\{\phi^{-1}\left(\frac{\epsilon}{3(M+1/\eta)}\right),\frac{(2-M\eta)\epsilon^{2}}{18\eta},\delta/2\right\}<\delta.

Let $\epsilon_{0}:=\min\{\epsilon,\epsilon^{\prime}\}/C$ , where

C=SA(T+1)(C_{P}(T+1)^{2}r_{\max}+C_{r}(2T+1))+S(T+1)\left(\dfrac{C_{P}T(T+1)r_{\max}}{2}+C_{r}T\right)+1.

We now show that for any $L^{0}\in\Delta(\mathcal{S}\times\mathcal{A})$ with $\|L^{0}-L^{\star}\|_{1}\leq\epsilon_{0}$ , $\|\theta_{0}-\theta^{\star}\|_{2}\leq C\epsilon_{0}=\min\{\epsilon,\epsilon^{\prime}\}$ , and hence $h^{\text{MF-OMO}}(\theta_{0})\leq\min\left\{\phi^{-1}(\frac{\epsilon}{3(M+1/\eta)}),\frac{(2-M\eta)\epsilon^{2}}{18\eta},\delta/2\right\}<\delta$ .

By Assumption 1 and the compactness of $\Theta$ , the Lipschitz continuity assumptions in Theorem 8 hold for some constants $C_{P},C_{r}>0$ . Since $\|L^{0}-L^{\star}\|_{1}\leq\epsilon_{0}$ , by the proof of Proposition 13 and the construction of $y^{0}$ and $z^{0}$ , $\theta_{0}=(y^{0},z^{0},L^{0})\in\Theta$ is feasible by the proof of Theorem 9, and

\begin{split}\|y^{0}-y^{\star}\|_{\infty}&=\max_{t\in\mathcal{T}}\|V_{t}^{\star}(L^{0})-V_{t}^{\star}(L^{\star})\|_{\infty}\\ &\leq\dfrac{T(T+1)r_{\max}}{2}\max_{t=0,\dots,T-1}\|P_{t}^{L^{0}}-P_{t}^{L^{\star}}\|_{\infty,1}+\sum_{t\in\mathcal{T}}\|r_{t}^{L^{0}}-r_{t}^{L^{\star}}\|_{\infty}\\ &\leq\left(\dfrac{C_{P}T(T+1)r_{\max}}{2}+C_{r}T\right)\epsilon_{0}.\end{split}

Similarly,

\begin{split}\|z^{0}-z^{\star}\|_{\infty}&\leq\max_{t\in\mathcal{T}}\|V_{t}^{\star}(L^{0})-V_{t}^{\star}(L^{\star})\|_{\infty}+\max_{a\in\mathcal{A},t\in\mathcal{T}}\|r_{t}(\cdot,a,L_{t}^{0})-r_{t}(\cdot,a,L_{t}^{\star})\|_{\infty}\\ &\qquad+\max_{a\in\mathcal{A},t\in\mathcal{T}}\|P_{t}^{a}(L_{t}^{0})V_{t+1}^{\star}(L^{0})-P_{t}^{a}(L_{t}^{\star})V_{t+1}(L^{\star})\|_{\infty}\\ &\leq\left(\dfrac{C_{P}T(T+1)r_{\max}}{2}+C_{r}T\right)\epsilon_{0}+C_{r}\epsilon_{0}\\ &\qquad+\max_{a\in\mathcal{A},t\in\mathcal{T}}\|P_{t}^{a}(L_{t}^{0})\|_{\infty}\|V_{t+1}^{\star}(L^{0})-V_{t+1}^{\star}(L^{\star})\|_{\infty}\\ &\qquad+\max_{a\in\mathcal{A},t\in\mathcal{T}}\|P_{t}^{a}(L_{t}^{0})-P_{t}^{a}(L_{t}^{\star})\|_{\infty}\|V_{t+1}^{\star}(L^{\star})\|_{\infty}\\ &\leq\left(\dfrac{C_{P}T(T+1)r_{\max}}{2}+C_{r}(T+1)\right)\epsilon_{0}+\left(\dfrac{C_{P}T(T+1)r_{\max}}{2}+C_{r}T\right)\epsilon_{0}\\ &\qquad+r_{\max}(T+1)C_{P}\epsilon_{0}\\ &=(C_{P}(T+1)^{2}r_{\max}+C_{r}(2T+1))\epsilon_{0},\end{split}

where $V_{T+1}(L)=0$ for any $L\in\Delta(\mathcal{S}\times\mathcal{A})$ . Here we use the fact that $\|P_{t}^{a}(L_{t}^{0})\|_{\infty}=1$ , $\max_{a\in\mathcal{A},t\in\mathcal{T}}\|P_{t}^{a}(L_{t}^{0})-P_{t}^{a}(L_{t}^{\star})\|_{\infty}=\|P^{L^{0}}-P^{L^{\star}}\|_{\infty,1}$ , and $\max_{t\in\mathcal{T}}|V_{t}^{\star}(L^{\star})(s)|\leq r_{\max}(T+1)$ . Also note that here $\|\cdot\|_{1}$ and $\|\cdot\|_{\infty}$ are both vector norms (except for $P_{t}^{a}(L_{t})$ , for which $\|\cdot\|_{\infty}$ is the matrix $\ell_{\infty}$ -norm).

Hence we have shown that

\|\theta_{0}-\theta^{\star}\|_{2}\leq\|\theta_{0}-\theta^{\star}\|_{1}=\|z^{0}-z^{\star}\|_{1}+\|y^{0}-y^{\star}\|_{1}+\|L^{0}-L^{\star}\|_{1}\leq C\epsilon_{0}=\min\{\epsilon,\epsilon^{\prime}\},

where $C$ is a problem dependent constant defined as above. Thus $h^{\text{MF-OMO}}(\theta_{0})<\delta$ .

Now by Proposition 18 and the feasibility of $\{\theta_{k}\}_{k\geq 0}$ , $h^{\text{MF-OMO}}(\theta_{k})=f^{\text{MF-OMO}}(\theta_{k})$ is non-increasing, and hence

h^{\text{MF-OMO}}(\theta_{k})\leq h^{\text{MF-OMO}}(\theta_{0})\leq\min\left\{\phi^{-1}\left(\frac{\epsilon}{3(M+1/\eta)}\right),\frac{(2-M\eta)\epsilon^{2}}{18\eta},\delta/2\right\}\leq\delta/2<\delta.

Step 3: Show that $\|\theta_{k}-\theta^{\star}\|_{2}\leq\epsilon$ , $\|\theta_{k}-\theta^{\star}\|_{2}=O(g(\epsilon_{0}))$ for some $g$ with $\lim_{\tilde{\epsilon}\rightarrow 0}g(\tilde{\epsilon})=0$ .

Define $f_{k}=f^{\text{MF-OMO}}(\theta_{k})$ . By [8, Lemma 10.4],

f_{k}-f_{k+1}=f^{\text{MF-OMO}}(\theta_{k})-f^{\text{MF-OMO}}(\theta_{k+1})\geq\left(\frac{1}{\eta}-\frac{M}{2}\right)\|\theta_{k+1}-\theta_{k}\|_{2}^{2}.

(43)

Then since $\phi$ (in the KL condition (42) above) is concave, strictly increasing and $C^{1}$ in $(0,\delta)$ , we have $\phi^{\prime}(f^{\text{MF-OMO}}(\theta_{k}))>0$ as long as $f^{\text{MF-OMO}}(\theta_{k})>0$ (since we already have $f^{\text{MF-OMO}}(\theta_{k})<\delta$ for all $k\geq 0$ ), and

\begin{split}\phi(f_{k})-\phi(f_{k+1}))&\geq\phi^{\prime}(f_{k})(f_{k}-f_{k+1})\geq\phi^{\prime}(f_{k})\left(\frac{1}{\eta}-\frac{M}{2}\right)\|\theta_{k+1}-\theta_{k}\|_{2}^{2}.\end{split}

(44)

Here the first inequality is by the concavity of $\phi$ .

Now we prove that

\|\theta_{i}-\theta^{\star}\|_{2}\leq\min\{g(\epsilon_{0}),\epsilon\}

(45)

for all $i\geq 0$ by induction, where

\begin{split}g(\epsilon_{0})&=(M+1/\eta)\phi(C_{f}\epsilon_{0})+\sqrt{\frac{2\eta}{2-M\eta}C_{f}\epsilon_{0}}+C\epsilon_{0}\\ &\geq(M+1/\eta)\phi(f_{0})+\sqrt{\frac{2\eta}{2-M\eta}f_{0}}+C\epsilon_{0},\end{split}

(46)

where $C_{f}>0$ is the Lipschitz constant of $f^{\text{MF-OMO}}(\theta)$ in the sense that $|f^{\text{MF-OMO}}(\theta)-f^{\text{MF-OMO}}(\theta^{\prime})|\leq C_{f}\|\theta-\theta^{\prime}\|_{2}$ for any $\theta,\theta^{\prime}\in\Theta$ . The existence of $C_{f}$ is guaranteed by Assumption 1 and the compactness of $\Theta$ . Note that since $\lim_{\epsilon_{0}\rightarrow 0}f_{0}=0$ due to the continuity of $f^{\text{MF-OMO}}(\cdot)$ , we have the desired property $\lim_{\tilde{\epsilon}\rightarrow 0}g(\tilde{\epsilon})=0$ .

The base case for $i=0$ is trivial. Without loss of generality, we now assume that $f^{\text{MF-OMO}}(\theta_{0})>0$ , since otherwise the algorithm finds the exact Nash equilibrium solution at the initialization and the iterates terminate/stay unchanged thereafter, and hence all the claims in Theorem 11 obviously hold.

Then for $i=1$ , by (43),

\begin{split}\|\theta_{1}-\theta^{\star}\|_{2}&\leq\|\theta_{0}-\theta^{\star}\|_{2}+\|\theta_{1}-\theta_{0}\|_{2}\\ &\leq C\epsilon_{0}+\sqrt{\frac{2\eta}{2-M\eta}(f_{0}-f_{1})}\leq C\epsilon_{0}+\sqrt{\frac{2\eta}{2-M\eta}f_{0}}\\ &\leq\left\{\begin{array}[]{l}g(\epsilon_{0}),\\ \dfrac{\epsilon}{3}+\dfrac{\epsilon}{3}\leq\epsilon.\end{array}\right.\end{split}

Suppose that the claim (45) holds for $i=0,\dots,k$ ( $k\geq 0$ ). Without loss of generality, we also assume that $f^{\text{MF-OMO}}(\theta_{k}),f^{\text{MF-OMO}}(\theta_{k+1})>0$ , since otherwise the algorithm finds the exact Nash equilibrium solution at iteration $k$ or $k+1$ and the iterates terminate/stay unchanged thereafter, and hence all the claims in Theorem 11 obviously hold.

Now for any $i\in\{1,\dots,k\}$ , since

\theta_{i}\in\text{argmin}_{\theta}\,\mathcal{I}_{\Theta}(\theta)+\frac{1}{2}\|\theta-(\theta_{i-1}-\eta\nabla_{\theta}f^{\text{MF-OMO}}(\theta_{i-1}))\|_{2}^{2},

we have

0\in\mathcal{N}_{\Theta}(\theta_{i})+\theta_{i}-\theta_{i-1}+\eta\nabla_{\theta}f^{\text{MF-OMO}}(\theta_{i-1}).

Hence $\frac{1}{\eta}(\theta_{i-1}-\theta_{i})-\nabla_{\theta}f^{\text{MF-OMO}}(\theta_{i-1})\in\mathcal{N}_{\Theta}(\theta_{i})$ since $\mathcal{N}_{\Theta}(\theta_{i})$ is a cone, and by the KL condition at $\theta_{i}\in\Theta$ with $\|\theta_{i}-\theta^{\star}\|_{2}\leq\epsilon$ and $h^{\text{MF-OMO}}(\theta_{i})\in(0,\delta)$ , we have

\phi^{\prime}(f_{i})\left\|\frac{1}{\eta}(\theta_{i-1}-\theta_{i})-\nabla_{\theta}f^{\text{MF-OMO}}(\theta_{i-1})+\nabla_{\theta}f^{\text{MF-OMO}}(\theta_{i})\right\|_{2}\geq 1.

(47)

(47), together with (44), immediately implies that

\begin{split}1&\leq\phi^{\prime}(f_{i})\left(\frac{1}{\eta}\|\theta_{i}-\theta_{i-1}\|_{2}+M\|\theta_{i}-\theta_{i-1}\|_{2}\right)\\ &\leq\left(\frac{1}{\eta}+M\right)(\phi(f_{i})-\phi(f_{i+1}))\dfrac{\|\theta_{i}-\theta_{i-1}\|_{2}}{\|\theta_{i+1}-\theta_{i}\|_{2}^{2}},\end{split}

thus

\begin{split}2\|\theta_{i+1}-\theta_{i}\|_{2}&\leq 2\sqrt{(M+1/\eta)(\phi(f_{i})-\phi(f_{i+1}))}\sqrt{\|\theta_{i}-\theta_{i-1}\|_{2}}\\ &\leq(M+1/\eta)\phi(f_{i})-\phi(f_{i+1}))+\|\theta_{i}-\theta_{i-1}\|_{2}.\end{split}

By telescoping the above inequality for $i=1,\dots,k$ ,

\|\theta_{k+1}-\theta_{k}\|_{2}+\sum_{i=1}^{k}\|\theta_{i+1}-\theta_{i}\|_{2}\leq(M+1/\eta)\phi(f_{1})+\|\theta_{1}-\theta_{0}\|_{2},

from which we deduce that

\begin{split}\|\theta_{k+1}-\theta^{\star}\|_{2}&\leq\sum_{i=1}^{k}\|\theta_{i+1}-\theta_{i}\|_{2}+\|\theta_{1}-\theta^{\star}\|_{2}\\ &\leq(M+1/\eta)\phi(f_{0})+\|\theta_{1}-\theta_{0}\|_{2}+\|\theta_{1}-\theta^{\star}\|_{2}\\ &\leq(M+1/\eta)\phi(f_{0})+\sqrt{\frac{2\eta}{2-M\eta}f_{0}}+C\epsilon_{0}.\\ &\leq\begin{cases}g(\epsilon_{0}),\\ \dfrac{\epsilon}{3}+\dfrac{\epsilon}{3}+\dfrac{\epsilon}{3}=\epsilon.\end{cases}\end{split}

Note that here we use (43) for $k=0$ and we also use the fact that $f_{1}\leq f_{0}$ and $\phi$ is increasing. This completes the induction.

Final step: Convergence to Nash equilibrium solutions.

Let $\bar{\theta}$ be an arbitrary limit point of $\{\theta_{k}\}_{k\geq 0}$ . Then $\bar{\theta}\in\Theta$ , and by Proposition 18, $\bar{\theta}$ is a stationary point of (MF-OMO), i.e., $0\in\nabla_{\theta}f^{\text{MF-OMO}}(\bar{\theta})+\mathcal{N}_{\Theta}(\bar{\theta})$ . Moreover, again by Proposition 18 and the feasibility of $\{\theta_{k}\}_{k\geq 0}$ , $h^{\text{MF-OMO}}(\theta_{k})=f^{\text{MF-OMO}}(\theta_{k})$ is non-increasing, and hence by the continuity of $f^{\text{MF-OMO}}(\theta)$ ,

h^{\text{MF-OMO}}(\bar{\theta})=\lim_{k\rightarrow\infty}h^{\text{MF-OMO}}(\theta_{k})=\lim_{k\rightarrow\infty}f^{\text{MF-OMO}}(\theta_{k})=f^{\text{MF-OMO}}(\bar{\theta})\leq f^{\text{MF-OMO}}(\theta_{0})<\delta.

In addition, by the first three steps above, we have $\|\bar{\theta}-\theta^{\star}\|_{2}\leq\epsilon$ . Hence if $h^{\text{MF-OMO}}(\bar{\theta})>0$ , then invoking the KL condition at $\bar{\theta}$ yields

\phi^{\prime}(h^{\text{MF-OMO}}(\bar{\theta}))\text{dist}(0,\nabla_{\theta}f^{\text{MF-OMO}}(\bar{\theta})+\mathcal{N}_{\Theta}(\bar{\theta}))\geq 1.

However, we also have $\text{dist}(0,\nabla_{\theta}f^{\text{MF-OMO}}(\bar{\theta})+\mathcal{N}_{\Theta}(\bar{\theta}))=0$ , which is a contradiction. Hence $\bar{\theta}\in\Theta$ satisfies $f^{\text{MF-OMO}}(\bar{\theta})=0$ , which indicates that $\bar{\theta}\in\Theta^{\star}$ . Therefore $\lim_{k\rightarrow\infty}\textbf{dist}(\theta_{k},\Theta^{\star})=0$ , and by the equivalence result in Theorem 4, any $\bar{\pi}\in\Pi(\bar{L})$ with $\bar{\theta}=(\bar{y},\bar{z},\bar{L})$ is a Nash equilibrium solution of the mean-field game ((1) and (2)). In addition, we have $\|L^{k}-L^{\star}\|_{2}\leq\|\theta_{k}-\theta^{\star}\|_{2}\leq g(\epsilon_{0})$ for all $k\geq 0$ . The fact that the limit point set is nonempty comes simply from the boundedness of $\{\theta_{k}\}_{k\geq 0}$ . Moreover, the vanishing limit of the exploitabilty of $\pi^{k}\in\Pi(L^{k})$ is an immediate implication of Theorem 8 and the fact that $f^{\text{MF-OMO}}(\theta_{k})\rightarrow 0$ as $k\rightarrow\infty$ . The claim about isolated Nash equilibrium solutions is obvious: simply choose $\epsilon_{0}$ for $g(\epsilon_{0})$ such that no other Nash equilibrium solutions exist in the $g(\epsilon_{0})$ -neighborhood of $L^{\star}$ . ∎

	$\displaystyle d_{t^{\prime}+1}^{\star}(s,a)$	$\displaystyle=\mathbb{P}^{\pi,L}(s_{t^{\prime}+1}=s,a_{t^{\prime}+1}=a)=\mathbb{P}^{\pi,L}(s_{t^{\prime}+1}=s)\pi_{t^{\prime}+1}(a\|s)$
		$\displaystyle=\pi_{t^{\prime}+1}(a\|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\mathbb{P}^{\pi,L}(s_{t^{\prime}}=s^{\prime},a_{t^{\prime}}=a^{\prime},s_{t^{\prime}+1}=s)$
		$\displaystyle=\pi_{t^{\prime}+1}(a\|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}\mathbb{P}^{\pi,L}(s_{t^{\prime}}=s^{\prime},a_{t^{\prime}}=a^{\prime})P_{t^{\prime}}(s\|s^{\prime},a^{\prime},L_{t^{\prime}})$
		$\displaystyle=\pi_{t^{\prime}+1}(a\|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}d_{t^{\prime}}(s^{\prime},a^{\prime})P_{t^{\prime}}(s\|s^{\prime},a^{\prime},L_{t^{\prime}})$
		$\displaystyle=\pi_{t^{\prime}+1}(a\|s)\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}L_{t^{\prime}}(s^{\prime},a^{\prime})P_{t^{\prime}}(s\|s^{\prime},a^{\prime},L_{t^{\prime}})=L_{t^{\prime}+1}(s,a),$