MF-OMO: An Optimization Formulation of Mean-Field Games
Abstract
This paper proposes a new mathematical paradigm to analyze discrete-time mean-field games. It is shown that finding Nash equilibrium solutions for a general class of discrete-time mean-field games is equivalent to solving an optimization problem with bounded variables and simple convex constraints, called MF-OMO. This equivalence framework enables finding multiple (and possibly all) Nash equilibrium solutions of mean-field games by standard algorithms. For instance, projected gradient descent is shown to be capable of retrieving all possible Nash equilibrium solutions when there are finitely many of them, with proper initializations. Moreover, analyzing mean-field games with linear rewards and mean-field independent dynamics is reduced to solving a finite number of linear programs, hence solvable in finite time. This framework does not rely on the contractive and the monotone assumptions and the uniqueness of the Nash equilibrium.
1 Introduction
Theory of mean-field games, pioneered by the seminal work of [39] and [46], provides an ingenious way of analyzing and finding the -Nash equilibrium solution to the otherwise notoriously hard -player stochastic games. Ever since its inception, the literature of mean-field games has experienced an exponential growth both in theory and in applications (see the standard reference [15]).
There are two key criteria for finding Nash equilibrium solutions of a mean-field game. The first is the optimality condition for the representative player, and the second is the consistency condition to ensure that the solution from the representative player is consistent with the mean-field flow of all players. Consequently, mean-field games with continuous-time stochastic dynamics have been studied under three approaches. The first is the original fixed-point [39] and PDE [46] approach that analyzes the joint system of a backward Hamilton-Jacobi-Bellman equation for a control problem and a forward Fokker-Planck equation for the controlled dynamics. This naturally leads to the second probabilistic approach, which focuses on the associated forward-backward stochastic differential equations [12, 14]. The last and the latest effort in mean-field game theory is to use master equation to analyze mean-field games with common noise and the propagation of chaos for optimal trajectories [9, 13, 23]. All existing methodologies are based on either the contractivity or monotonicity conditions, or the uniqueness of the Nash equilibrium solution, or some particular game structures; see [69, 1] for a detailed summary. Accordingly, finding Nash equilibrium solutions of mean-field games relies on similar assumptions and structures; see for instance [47].
It is well known, however, that non-zero-sum stochastic games and mean-field games in general have multiple Nash equilibrium solutions and that uniqueness of the Nash equilibrium solution, contractivity, and monotonicity are more of an exception than a rule. Yet, systematic and efficient approaches for finding multiple or all Nash equilibrium solutions of general mean-field games remain elusive.
Our work and approach.
In this work, we show that finding Nash equilibrium solutions for a general class of discrete-time mean-field games is equivalent to solving an optimization problem with bounded variables and simple convex constraints, called MF-OMO (Mean-Field Occupation Measure Optimization) (Section 3). This equivalence is established without assuming either the uniqueness of the Nash equilibrium solution or the contractivity or monotonicity conditions.
The equivalence is built through several steps. The first step, inspired by the classical linear program formulation of the single-agent Markov decision process (MDP) problem [61], is to introduce occupation measure for the representative player, and to transform the problem of finding Nash equilibrium solutions to a constrained optimization problem (Theorem 4) by the strong duality of the linear program to bridge the two criteria of Nash equilibrium solutions. The second step is to utilize a solution modification technique (Proposition 6) to deal with potentially unbounded variables in the optimization problem and obtain the MF-OMO formulation (Theorem 5). Further analysis of this equivalent optimization formulation enables establishing the connection between the suboptimality of MF-OMO and the exploitability of any policy for mean field games (Section 4). This is obtained by the perturbation analysis on MDPs (Lemma 13) and by the strong duality of linear programs.
These equivalence and connections enable adopting families of optimization algorithms to find multiple (and possibly all) Nash equilibrium solutions of mean-field games. In particular, detailed studies on the convergence of MF-OMO with the projected gradient descent (PGD) algorithm to Nash equilibrium solutions of mean-field games are presented (Theorem 11). For instance, PGD is shown to be capable of retrieving all possible Nash equilibrium solutions when there are finitely many of them, with proper initializations. Moreover, analyzing mean-field games with linear rewards and mean-field independent dynamics is shown to be reduced to solving a finite number of linear programs, hence solvable in finite time (Proposition 12).
Finally, this MF-OMO framework can be extended to other variants of mean-field games, as illustrated in the case of personalized mean-field games in Section 7.
Related work.
Our approach is inspired by the classical linear program framework for MDPs, originally proposed in [51] for discrete-time MDPs with finite state-action spaces and then extended first to semi-MDPs [56] and then to approximate dynamic programming [63, 21], as well as to continuous-time stochastic control [66, 10, 45] and singular control [67, 44] with continuous state-action spaces, among others [72, 25, 24, 38]. In the similar spirit, our work builds an optimization framework for discrete-time mean-field games.
There are several existing computational algorithms for finding Nash equilibrium solutions of discrete-time mean-field games. In particular, [31, 58] assume strict monotonicity, while [59] assumes weak monotonicity, all of which also assume the dynamics to be mean-field independent. In addition, [34, 32, 3] focus on contractive mean-field games, [6] studies mean-field games with unique Nash equilibrium solutions, and [20, 4] analyze mean-field games that are sufficiently close to some contractive regularized mean-field games. Under these assumptions, global convergence is obtained for some of these algorithms. In contrast, we focus on presenting an equivalent optimization framework for mean-field games, and establishing local convergence without these assumptions.
The most relevant to our works are earlier works of [11, 29, 30] on continuous-time mean-field games, in which they replace the best-response/optimality condition in Nash equilibrium solutions by the linear program optimality condition. Their reformulation approach leads to new existence results for relaxed Nash equilibrium solutions and computational benefits. In contrast, our work formulates the problem of finding Nash equilibrium solutions as a single optimization problem which captures both the best-response condition and the Fokker-Planck/consistency condition. Our formulation shows that finding Nash equilibrium solutions of discrete-time mean-field games with finite state-action spaces is generally no harder than solving a smooth, time-decoupled optimization problem with bounded variables and trivial convex constraints. Such a connection allows for directly utilizing various optimization tools, algorithms and solvers to find Nash equilibrium solutions of mean-field games. Moreover, our work provides, for the first time to our best knowledge, provably convergent algorithms for mean-field games with possibly multiple Nash equilibrium solutions and without contractivity or monotonicity assumptions.
2 Problem setup
We consider an extended [16] or general [34] mean-field game in a finite-time horizon with a state space and an action space , where and . In this game, a representative player starts from a state , with the (common) initial state distribution of all players of an infinite population. At each time and when at state , she chooses an action from some randomized/mixed policy , where denotes the set of probability vectors on . Note that mixed policies are also known as relaxed controls. She will then move to a new state according to a transition probability and receive a reward , where is the joint state-action distribution among all players at time , referred to as the mean-field information hereafter, and is the set of probability vectors on . Denote , which is assumed throughout the paper as finite.
Given the mean-field flow , the objective of this representative player is to maximize her accumulated rewards, i.e., to solve the following MDP problem:
(1) |
Throughout the paper, for a given mean-field flow , we will denote the mean-field induced MDP (1) as , as its optimal total expected reward, i.e., value function, starting from time , with the -th entry being the optimal expected reward starting from state at time , and as its optimal expected total reward starting from . Correspondingly, we denote respectively , , as well as , under a given policy for .
To analyze such a mean-field game, the most widely adopted solution concept is the Nash equilibrium. A policy sequence and a mean-field flow constitute a Nash equilibrium solution of this finite-time horizon mean-field game, if the following conditions are satisfied.
Definition 2.1 (Nash equilibrium solution).
-
1)
(Optimality/Best Response) Fixing , solves the optimization problem (1), i.e., is optimal for the representative agent given the mean-field flow ;
-
2)
(Consistency) Fixing , the consistency of the mean-field flow holds, namely
(2) Here denotes the probability distribution of a random variable/vector .
Note that (1) requires that the policy sequence is the best response to the flow , while (2) requires that the flow is the corresponding mean-field flow induced when all players adopt the policy sequence .111Condition 2) is also known as the Fokker-Planck equation in the continuous-time mean-field game literature. Also note that (2) can be written more explicitly as follows:
(3) |
The following existence result for Nash equilibrium solutions holds as long as the transitions and rewards are continuous in . The proof is based on the Kakutani fixed-point theorem; it is almost identical to those in [62, 20], except for replacing the state mean-field flow with the state-action joint mean-field flow.
Proposition 1.
Suppose that and are both continuous in for any , and . Then a Nash equilibrium solution exists.
Having established the existence of Nash equilibrium solution for the mean-field game ((1) and (2)), the rest of the paper focuses on proposing a new analytical approach to find Nash equilibrium solution(s) for the mean-field game. This is motivated by the well-known fact that most mean-field games have multiple Nash equilibrium solutions and are neither contractive nor (weakly) monotone, as the following simple example illustrates.
Example of multiple Nash equilibrium solutions.
Take and an arbitrary initial distribution . At time , for any and any , the rewards , and the transitions are deterministic such that . When , the rewards and the transition probabilities may depend on . Specifically, if the population congregates at the same state and chooses to stay, i.e., , then the rewards for some , and the transition probabilities will be deterministic . If deviates from state and action , the rewards and transition probabilities will be adjusted as follows:
-
•
, for any and .222Here for a totally ordered finite set , we use to denote the binary vector in which has entry value for those indices and has entry value otherwise.
-
•
for any and .
Here are some non-negative constants.
It is easy to verify that for any , if we define
-
•
for any and ;
-
•
and , for any and ,
then and constitute a Nash equilibrium solution. Thus there are several distinct Nash equilibrium solutions with both and being different, defined by the index as and . As contractivity and strict monotonicity imply the uniqueness of the Nash equilibrium solution, the mean-field game above is neither contractive nor strictly monotone. To see that it is not weakly monotone either, consider any two mean-field flows and with . Then
violating the requirement of weak monotonicity.
3 MF-OMO: Optimization formulation of mean-field games
To overcome the restrictive assumptions of contractivity and monotonicity for analyzing mean-field games, we first establish a new optimization framework to study mean-field games, via two steps. The first is an optimization reformulation of mean-field games with potentially unbounded variables and nonconvex constraints, which relies on the equivalence between an MDP problem and a linear program of occupation measures. The second is to utilize a solution modification technique to obtain the final form of the optimization formulation with bounded variables and simple convex constraints. We call this framework Mean-Field Occupation Measure Optimization (MF-OMO).
Occupation measure.
To start, let us introduce a new variable for any , , which represents the occupation measure of the representative agent under some policy sequence in , the mean-field induced MDP, i.e., , with , , for . Given the occupation measure , define a mapping that retrieves the policy from the occupation measure. This set-valued mapping maps from a sequence to a set of policy sequences : for any , if and only if
(4) |
when , and is an arbitrary probability vector in when .
Linear program formulation of MDPs.
Next, consider any finite-horizon MDP (e.g., the mean-field induced MDP ) with a finite state space , a finite action space , an initial state distribution , a reward function , and dynamics for , and . For brievity, we use for the state and/or action distribution generated by , , for .
Then, our first component of MF-OMO relies on a linear program formulation of an MDP problem, as follows.
Lemma 2.
Suppose that is an -suboptimal policy for the MDP . Define and for and . Then and is a feasible -suboptimal solution to the following linear program:
subject to | |||
Conversely, suppose that is a feasible -suboptimal solution to the above linear program. Then for any defined by (4), is an -suboptimal policy for the MDP . In addition, the value of the above linear program is equal to the value of the MDP .
A preliminary version of MF-OMO.
We now show that is a Nash equilibrium solution if and only if there exist and , such that and the following two conditions hold.
-
(A)
Optimality of the representative agent: solves the following LP problem
(5) subject to (6) -
(B)
Consistency between the agent and the mean-field: for any .
Here condition (A) corresponds to the optimality condition 1) in Definition 2.1 according to Lemma 2. Condition (B) indicates that the occupation measure of the single agent matches the mean-field flow of the population, so that the agent is indeed representative. Condition (B), when combined with condition (A), in fact implies the consistency condition 2) in Definition 2.1. Formally, we have the following lemma.
Lemma 3.
To get the preliminary version of MF-OMO, notice that the linear program in condition (A) can be rewritten as
(7) |
where and are
(8) |
is a flattened vector (with column-major order), and the matrix is
(9) |
Here is the matrix with the -th row () being the flattened vector (with column-major order), and the matrix is defined as
(10) |
where is the identity matrix with dimension . In addition, the variable is also flattened/vectorized in the same order as . Accordingly, hereafter both and are viewed as flattened vectors (with column-major order) in or a sequence of flattened vectors (with column-major order) in , depending on the context, and we use and (resp. and ) alternatively.
By the strong duality of linear program [50], is the optimal solution to the above linear program (7) if and only if the following KKT conditions hold for and some and :
(11) |
Now, combining condition (A) (in the form of equation (11)) with condition (B), we have the following preliminary version of MF-OMO.
Theorem 4.
Finding a Nash equilibrium solution of the mean-field game ((1) and (2)) is equivalent to solving the following constrained optimization problem:
(12) |
where are set as (8) and (9). Specifically, if is a Nash equilibrium solution of the mean-field game ((1) and (2)), then there exist some such that solves the constrained optimization problem (12). On the other hand, if solves the constrained optimization problem (12), then for any , is a Nash equilibrium solution of the mean-field game ((1) and (2)).
In the above derivation, we have used to represent the population mean-field flows and to denote the occupation measure of the representative agent. The optimal occupation measure (i.e., the solution to the linear program) associated with the representative agent is denoted by .
MF-OMO.
The above constrained optimization problem (12) in Theorem 4 can be rewritten in a more computationally-efficient form by adding bound constraints to the auxiliary variables and and moving potentially nonconvex constraints to the objective. Note that the consequent nonconvex objectives are in general much easier to handle than nonconvex constraints (cf. (12)). In addition, the boundedness of all variables is essential to the later analysis of the optimization framework in Sections 4 and 5.
Theorem 5.
Finding a Nash equilibrium solution of the mean-field game ((1) and (2)) is equivalent to solving the following optimization problem with bounded variables and simple convex constraints:
(MF-OMO) |
where are set as (8) and (9), and is the all-one vector (with appropriate dimensions). Specifically, if is a Nash equilibrium solution of the mean-field game ((1) and (2)), then there exist some such that solves (MF-OMO). On the other hand, if solves (MF-OMO) with the value of the objective function being , then for any , is a Nash equilibrium solution of the mean-field game ((1) and (2)).
Theorem 5 is immediate from the following two results, which show how the bounds on the auxiliary variables and in (MF-OMO) are obtained. Note that only is needed to construct the Nash equilibrium solution, hence there is no loss to add additional bounds on and according to Corollary 7.
The interpretation of and is motivated by the following remark.
Proposition 6 (Solution modification/selection).
Suppose that is a solution to (12). Define as
where is the vector of value function starting from step for the induced MDP (cf. Section 2). In addition, define , where is the Bellman residual/gap333This is also known as the (optimal) advantage function in the reinforcement learning literature., i.e., () with
and (), where is defined by . Then is also a solution to (12).
The following corollary follows immediately by the definitions of and in Proposition 6.
Corollary 7.
Suppose that is a solution to (12). Then there exists a modified solution , such that the following bounds hold:
The key is to notice that , and , for any .
Remark 1.
Note that the constraint , is added to ensure the well-definedness of and for non-optimal but feasible variables. As otherwise and are undefined outside of the simplex . The choice of norms on and is to facilitate projection evaluation and reparametrization, where the first (projection evaluation) will be clear in Section 5 and the second (reparametrization) is explained in Appendix A.3. In particular, the choice of -norm for is to obtain a simplex-alike specialized polyhedral constraint paired with , and the -norm for is to obtain closed-form projections (by normalization) and simple reparametrizations. Nevertheless, all our subsequent results in Sections 4 and 5 hold (up to changes of constants) for other choices including -norm for and -norm for , as long as either efficient projections or reparametrizations are feasible for and .
Remark 2.
Note that whenever a Nash equilibrium solution exists for the mean-field game ((1) and (2)), the optimal value of the objective function of (MF-OMO) is obviously , and vice versa. To see more explicitly the relationship between (MF-OMO) and Nash equilibrium solution(s) of mean-field game ((1) and (2)), denote the feasible set of (MF-OMO) as , i.e.,
(13) |
Then (MF-OMO) can be rewritten as minimizing the following objective function over :
(14) |
Here the first two rows expand the term , corresponding to the consistency condition 2) in Definition 2.1. The next four rows expand the term , corresponding to the Bellman residual by the interpretations of and in Proposition 6 and hence the optimality condition 1) in Definition 2.1. The last row expands the complementarity term , connecting the two conditions into a single optimization problem.
This MF-OMO formulation is in sharp contrast to most existing approaches in the literature of discrete-time mean-field games such as [34, 59, 58], which alternate between a full policy optimization/evaluation given a fixed mean-field flow and a full mean-field flow forward propagation given a fixed policy sequence. Moreover, both the policy optimization/evaluation and the mean-field propagation lead to coupling over the full time horizon . As a result, the convergence of these algorithms requires assumptions of contractivity and monotonicity. In contrast, (14) shows that MF-OMO is fully decoupled over the time horizon, which better facilitates parallel and distributed optimization algorithms and enables more efficient sub-samplings for stochastic optimization algorithms, especially for large time-horizon problems.
4 Connecting exploitability with MF-OMO suboptimality
Having established the equivalence between minimizers of (MF-OMO) and Nash equilibrium solutions of mean-field games, we now connect the exploitablity of mean-field games with suboptimal solutions to (MF-OMO).
The concept of exploitability in game theory is used to characterize the difference between any policy and a Nash equilibrium solution. More precisely, define a mapping that maps any policy sequence to when all players take such a policy sequence. Following the consistency condition (3), such can be defined recursively, starting with the initialization
(15) |
such that
(16) |
For any given policy sequence , let be the induced mean-field flow. Then exploitability characterizes the sub-optimality of the policy under as follows,
(17) |
In particular, is a Nash equilibrium solution if and only if and ; and a policy is an -Nash equilibrium solution if .
However, computing and optimizing exploitability directly is not easy. First, is generally nonconvex and nonsmooth, even if and are differentiable in . Secondly, a single evaluation of requires a full policy optimization when calculating and a full policy evaluation when calculating . In this section, we will show that in order to find an -Nash equilibrium solution, it is sufficient to solve (MF-OMO) to sub-optimality.
We will need additional notation here before presenting the precise statement. Denote as a tensor of dimensions . For a sequence of tensors with dimensions (e.g., can be ) and a sequence of vectors with dimensions (e.g., can be , or , etc.), define
Similarly, for any tensor with dimensions , define
Denote also , and respectively for the standard , and vector and matrix norms. Then we will see that a near-optimal solution to (MF-OMO) will be close to a Nash equilibrium solution, assuming the dynamics and rewards are Lipschitz continuous in .
Theorem 8.
Suppose that for any , and are and Lipschitz continuous in , respectively, i.e.,
where and , and () are two arbitrary mean-field flows. Let be a feasible solution to (MF-OMO), with the value of its objective function being for some . Then for any , we have
where
Remark 3.
Conversely, one can characterize an -Nash equilibrium solution in terms of the sub-optimality of (MF-OMO), without the additional assumption of the Lipschitz continuity in Theorem 8.
Theorem 9.
Let be an -Nash equilibrium solution, i.e., . Define ,
and with (), where
and (). Then is a feasible solution to (MF-OMO) and .
5 Finding Nash equilibrium solutions via solving MF-OMO
The optimization formulation (MF-OMO) enables applications of families of optimization algorithms to find multiple Nash equilibrium solutions. In this section, we first present the projected gradient descent algorithm and establish its convergence guarantees to the Nash equilibrium solutions of general mean-field games. We then present a finite-time convergent algorithm for solving a special class of mean-field games with linear rewards and mean-field independent dynamics. More discussion on stochastic algorithms, convergence to stationary points as well as practical tricks such as reparametrization, acceleration and variance reduction can be found in Appendix A.
To start, we will need the following assumption.
Assumption 1.
and are both second-order continuously differentiable in within some open set containing for any .
This assumption is easily verifiable and holds for numerical examples in many existing works including [35, 20] and for theoretical studies of [59, 58].
For notation simplicity, denote the concatenation of as , and recall the earlier notation for feasible set of (MF-OMO) as (cf. (13)). Then we see that Assumption 1, together with the compactness of the probability simplex of and the norm bounds on and in (MF-OMO), immediately leads to the following proposition.
Proposition 10.
Under Assumption 1, is -strongly smooth for some positive constant , for any . That is, for any , we have .
In addition, Assumption 1, together with the compactness of , implies the continuity condition in Proposition 1 and the Lipschitz continuity condition in Theorem 8. Moreover, optimal solution(s) exist for (MF-OMO) under Assumption 1. We will denote the optimal solution set of (MF-OMO) as hereafter.
5.1 PGD and convergence to Nash equilibrium solutions
Projected gradient descent (PGD).
The algorithm of projected gradient descent updates with the following iterative process:
(18) |
Here is the step-size of the -th iteration for which appropriate ranges are specified below, and the projection operator projects to the probability simplex, to a specialized polyhedra , and to the -normed ball with norm . All these projections can be efficiently evaluated in linear time and (almost) closed-form (cf. [28, 49, 64, 17] for projection onto the probability simplex and the specialized polyhedra, and [57] for the projection onto the -normed balls). For simplicity, we always assume that the initialization is feasible, i.e., .
Convergence to Nash equilibrium solutions.
To find a Nash equilibrium solution of the mean-field game ((1) and (2)) by the PGD algorithm, we will need an additional definability assumption stated below, which is one of the most commonly adopted assumptions in the existing convergence theory of nonconvex optimization.
Assumption 2.
For any , and (as functions of ) are both restrictions of definable functions on the log-exp structure444For simplicity, below we say that a function is definable if it’s definable on the log-exp structure. to .
Definable functions cover a broad class of (both smooth and nonsmooth) functions, including all semialgebraic functions, all analytic functions on definable compact sets555A set is definable if it can be defined by the image/range of a definable function., and the exponential and logarithm functions. Moreover, any finite combination of definable functions via summation, subtraction, product, affine mapping, composition, (generalized) inversion, max and min, partial supremum and partial infimum, as well as reciprocal (restricted to a compact domain) inside their domain of definitions are definable. Generally speaking, definable functions include all functions that are “programmable”, such as those that can be defined in NumPy, SciPy and PyTorch. The precise definition of definability for both sets and functions and the log-exp structure can be found in Appendix B.1 (see also [7, Section 4.3] and [19]).
Under Assumptions 1 and 2, we will show that the PGD algorithm converges to Nash equilibrium solution(s) when the initialization is close. Moreover, if the initialization is close to some isolated Nash equilibrium solution (e.g., when there is a finite number of Nash equilibrium solutions), the iterates converge to that specific Nash equilibrium solution. As a result, different initializations will enable us to retrieve all possible Nash equilibrium solutions when there are finitely many of them. The proof is left to Appendix B.2.
Theorem 11 (Convergence to Nash equilibrium solutions).
Under Assumptions 1 and 2, let () be the sequence generated by PGD (18) with . Then for any Nash equilibrium solution , there exists , such that for any with , if we initialize PGD with , where , , (), with
and (), then we have the following:
- •
- •
-
•
For any sequence of , .
In addition, for any isolated Nash equilibrium solution (i.e., there exists such that any with is not a Nash equilibrium solution), if we initialize PGD with in the same way as above, then and for any sequence .
5.2 Finite-time solvability for a class of mean-field games
For mean-field games with linear rewards and mean-field independent dynamics, we will show that a Nash equilibrium solution can be found in finite time under the MF-OMO framework. Such a class of mean-field games include the LR (left-right) and RPS (rock-paper-scissors) problems in [20].
More precisely, consider mean-field games with , for some constants and constant vectors . Here, is independent of , hence denoted as (to distinguish from the number of actions ), and is of the form , where is the vectorization of (with column-major order), and is defined as
By (12), one can see that finding Nash equilibrium solution(s) of this class of mean-field games is equivalent to solving the following linear complementarity problem (LCP):
Notice that for any solution , we have either or for any . Hence it suffices to consider linear programs induced by fixing and for some subset of , which leads to a total of linear programs. Since a linear program can be solved to its global optimality in finite time (e.g., by simplex algorithms [53]), we have the following proposition.
Proposition 12.
6 Proofs of main results
6.1 Proof of Lemma 3
For any Nash equilibrium solution , denote for the probability distribution of any representative agent taking policy under . We will first show that and satisfies conditions (A) and (B). Since is a Nash equilibrium solution, it satisfies (1). By Lemma 2 and by considering the MDP for the representative agent, and condition (A) is satisfied for and . Condition (B) can be proved by induction. When , for any and any , since (15) holds for . Suppose holds for all and all and . Then for any and any ,
where the last equation holds by (16).
6.2 Proof of Proposition 6
Since is a solution to (12), , , (by ), i.e., . Note that we use to denote the all-one vector (with appropriate dimensions). To show that solves (12), it reduces to show that , and .
By writing as with , as with , we have
which, by the definition of and , becomes
Now noticing that , , , and for , , we see
Moreover, we have , and that
implying that by its definition. Note that these are the Bellman equations for the mean-field induced MDP .
It remains to prove that . To see this, we first show that for and by backward induction on from to . Here and are defined as and , respectively.
Firstly, note that and . Hence similar to the expansions above, we have
and
Now for the base step, since for all , we have
For the induction step, suppose that for some . Then since for all , we have
Finally, suppose that . Then since for all ,
Therefore .
Now, since is a solution to problem (12), we have
And for any satisfying and , since and , we have
(19) |
In particular, for any such that for some . That is, (together with ) is the solution to the following linear optimization problem for fixed :
(20) |
6.3 Proof of Theorem 8
It is established via the following perturbation lemma for general MDPs.
Lemma 13.
Consider two finite MDPs777Finite MDP refers to MDPs with finite state and action spaces. and with finite horizon, with their respective transition probabilities and , rewards and , expected total rewards and for any policy . In addition, denote their corresponding optimal values as and . Then we have
and
Here is such that and for any .
Proof of Lemma 13.
The proof consists of two parts.
Part I: Values for fixed policy . Let be the total expected reward starting from state at time under policy in the MDP (). Clearly ().
Now by the dynamic programming principle, we have for , and ,
and , implying
(21) |
Here () uses the fact that , and is viewed as a vector in when taking the norm.
By taking max over on the left-hand side of (21) and telescoping the inequalities, we have
Here we use the fact that
Part II: Optimal values. Let be the value function for MDP () starting at time . Then we have ().
Now by the dynamic programming principle, we have for , and ,
which implies that
(22) |
Here in () we use the fact that , and is viewed as a vector in when taking the norm.
By taking max over on the left-hand side of (22) and telescoping the inequalities, we have
Here we use the fact that . ∎
Proof of Theorem 8.
We first define for the , where is recursively defined by (for any ) and
Then is the occupation measure of policy for MDP .
Step 1: Closeness between and . By definition, . We first prove by induction that for all ,
(23) |
When ,
(24) |
where the last inequality is by . Now suppose (23) holds for . Then by the construction of and ,
(25) |
(26) |
Therefore
(27) |
where the last inequality holds by using induction on the first term and the constraint on in . This completes the induction.
The above bound (27), together with the fact that , immediately implies that
Hence by and , we have
Here we use for any and .
Step 2: Near-optimality of (and ) for . Now we show that is near-optimal for the MDP . To see this, define . Here can be viewed both as a vector in and a sequence of matrices with . Then we have .
Consider the following linear program with variables :
(28) |
and denote as its optimal solution. Then since is its feasible solution, we have .
Note that the dual problem of (28) can be written as
(29) |
By Lemma 2 and (7), if solves (29), then any is an optimal policy for the MDP with transitions and rewards (with fixed), and the optimal value of the objective function of (29) is equal to the negative optimal value of the MDP . By the strong duality of linear programs, we see that is equal to the negative optimal value of the MDP . Note that the components of are negative rewards, and that the existence of the optimal solution follows from the existence of optimal policies for any finite MDP with finite horizon and Lemma 2 and (7).
Now, let be an optimal solution to the following linear program:
(30) |
Then again by Lemma 2 and (7), is equal to the negative optimal value of the MDP . Hence by applying Lemma 13 to the MDPs and ,
Putting the above bounds together, we conclude that
(31) |
Since is a feasible solution to (30), again by Lemma 2, (31) implies that , by definition satisfying from (25), is a -suboptimal solution to the MDP , where .
Step 3: Bounding the gap between and . We then bound the difference between and by proving the following for all :
(32) |
This is again proved by induction. When , it holds by noticing and (24). Now suppose (32) holds for , consider when , by the definition of , we have
(33) |
Then
(34) |
This concludes the induction. Consequently,
and similarly,
Finally, by applying Lemma 13 to the MDPs and , and utilizing the fact that is -suboptimal for the MDP , we see that
Here is the optimal value of the MDP for any mean-field flow . ∎
6.4 Proof of Theorem 9
7 Extensions
The MF-OMO framework can be extended to other variants of mean-field games. This section details its extension to personalized mean-field games; and its extension to multi-population mean-field games is straightforward and omitted here.
Personalized mean-field games are mean-field games involving non-homogeneous players, which can be found in many applications. In such problems, every player is associated with some information (type) which characterizes the heterogeneity among players. Personalized mean-field games generalize mean-field games by incorporating the information (type) into the state of players, with heterogeneity in the initial distributions. To deal with the heterogeneity, one needs to define a “stronger” Nash equilibrium solution for these personalized mean-field games, where the policy sequence is optimal from any initial state given the mean-field flow. We say that the policy sequence and a mean-field flow constitute a refined Nash equilibrium solution of the finite-time horizon personalized mean-field game, if the following conditions are satisfied.
Definition 7.1 (Refined Nash equilibrium solution).
-
1)
(Optimality) Fixing , for any initial state , solves the following optimization problem:
(35) i.e., is optimal for the representative agent given the mean-field flow ;
-
2)
(Consistency) Fixing , the consistency of mean-field flow holds, i.e.,
(36)
For personalized mean-field games, we introduce a modified version of exploitability. Instead of simply averaging value functions over (cf. (17)), here the averaging is over a uniform initial distribution, which corresponds to the arbitrariness of the initial state in (35). The exploitability is defined as follows:
As in Section 3, is a refined Nash equilibrium solution if and only if .
To find the refined Nash equilibrium solution, we need to find the “refined” optimal policy for the mean-field induced MDP as described in (35), which is optimal under any initial state (instead of an initial state with distribution as in (1)). The proposition below characterizes the linear program formulation for (35), a counterpart of Lemma 2. For brevity, for any , we denote for the probability distribution of any representative agent taking policy under the mean-field flow , i.e., the state and/or action distribution generated by , , for .
Proposition 14.
Fix . Suppose that is an -suboptimal policy for (35) for any . Define for all , , and for and . Then and is a feasible -suboptimal solution to the following linear program:
(37) |
Conversely, suppose that is a feasible -suboptimal solution to the above linear program (37). Then for any , is an -suboptimal policy for (35) for any . In addition, the optimal value of the objective function of the above linear program (37) is equal to the average optimal value of (35) over .
Note that instead of the specified initial distribution in Lemma 2, Proposition 14 chooses a special initial distribution where all components are equal to . The proof is a direct combination of Lemma 2 and the following classic result of MDP.
Lemma 15.
If is an optimal policy for the MDP with initial distribution satisfying for all , then is an optimal policy for the MDP with any initial state and any initial distribution.
Then similar to Theorem 4 one can establish the corresponding equivalence between refined Nash equilibrium solutions and optimal solutions to a feasibility optimization problem. This is derived by replacing the in in Theorem 4 with . More precisely, based on the above result, one can characterize the optimality conditions of given , the counterpart of condition (A) with replaced by the uniform distribution over . Then combined with the consistency condition for the mean-field flow (36), one can establish the corresponding equivalence between refined Nash equilibrium solutions and optimal solutions to the following feasibility optimization problem:
(38) |
where are set as (8) and (9), and . More precisely, is a refined Nash equilibrium solution if and only if is a solution to (38) for some .
Note that here the condition (A) may not hold, therefore the consistency condition on mean-field flow can not be implied by . As a result, the policy is directly included as part of the variables.
In addition, noticing that after summing over on both sides of the fifth row of the constraints in (38), and denoting , then the optimization problem (38) can be further simplified, hence the following theorem.
Theorem 16.
Solving refined Nash equilibrium solution(s) of the personalized mean-field game ((35) and (36)) is equivalent to solving the following feasibility optimization problem:
(39) |
where are set as (8) and (9), and . More precisely, if is a refined Nash equilibrium solution, then is a solution to (39) for some . Conversely, if is a solution to (39), then for any , constitutes a refined Nash equilibrium solution.
Note that the last row of constraints in (39) is equivalent to requiring that , where the equality is in terms of sets.
8 Numerical experiments
In this section, we present numerical results of algorithms based on the MF-OMO optimization framework. In the first part, MF-OMO is compared with the existing state-of-the-art algorithms for solving MFGs on the SIS problem introduced in [20]. In the second part, it is shown that different initializations of MF-OMO algorithms converge to different NEs in the example introduced in Section 2.
8.1 Comparison with existing algorithms
In this section, we focus on the SIS problem introduced in [20]. There are a large number of agents who choose to either social distance or to go out at each time step. Susceptible (S) agents who choose to go out may get infected (I), with probability proportional to the number of infected agents, while susceptible agents opt for social distancing will stay healthy. In the meantime, infected agents recover with certain probability at each time step. Agents in the system aim at finding the best strategy to minimize their costs induced from social distancing and being infected. The parameters remain the same as in [20].
Figures 1 and 2 show the comparisons between MF-OMO and existing algorithms on this SIS problems with and , respectively. Here, MF-OMO is tested with different optimization algorithms including PGD (cf. §5), Adam [43], and NAdam [26] against online mirror descent (OMD) [58], fictitious play (FP) [59], and prior descent (PD) [20] with different choices of hyperparameters.888In the figure legends, the numbers after the OMD and FP algorithms are the step-sizes/learning-rates of these algorithms, while the number pairs after the PD algorithms are the temperatures and inner iterations of PD, respectively. To make fair comparisons, the maximum number of iterations for each algorithm are set differently so that the total runtime are comparable and that MF-OMO algorithms are given the smallest budget in terms of both the total number of iterations and the total runtime. With the same uniform initialization, it is clear from the figures that MF-OMO with Adam outperforms the remaining algorithms. The normalized exploitability999Normalized exploitability is defined to be exploitability devided by the initial exploitability. of MF-OMO Adam quickly drops below in terms of both the number of iterations and the runtime. All algorithms except for three variants of MF-OMO fail to achieve for both and .




8.2 Multiple NEs
In this section, we focus on the problem introduced in Section 2 and show how algorithms based on the MF-OMO framework converge with different initializations. We choose , , ; and , and are all independently sampled uniformly between and .
From discussions in Section 2, there are at least 3 different NEs, which correspond to agents staying in state 1, 2 and 3, respectively. Here, several different sets of random initializations are tested, denoted by RI, which represents initializations randomly generated in the neighborhood of the -th NE . Each RI set consists of 20 independent samples, and the convergence behavior of MF-OMO NAdam algorithm is recorded in Table 1.101010We choose NAdam as it converges significantly faster than other algorithms such as PGD and Adam for this specific problem, and our primary goal is to study the behavior after the convergence. For each neighborhood of size and NE , the convergence behavior in RI is categorized using the following criteria: counts the proportion of samples whose normalized exploitabilities do not drop to after 400 iterations; counts the proportion of samples that converge to an (approximate) NE (with normalized exploitability below ) closest to -th NE among the three NEs; counts the proportion of samples that converge to an (approximate) NE closer to the other two NEs.
NE 1 | NE 2 | NE 3 | |||||||
---|---|---|---|---|---|---|---|---|---|
0.05 | 0% | 70% | 30% | 5% | 75% | 20% | 0% | 65% | 35% |
0.2 | 0% | 50% | 50% | 5% | 65% | 30% | 5% | 60% | 35% |
0.8 | 5% | 45% | 50% | 5% | 40% | 55% | 0% | 30% | 70% |
1.0 | 0% | 40% | 60% | 0% | 45% | 55% | 0% | 25% | 75% |
From Table 1, it is clear that most of the samples converge to some approximate NE with exploitabilities below the target tolerance, regardless of the initializations. In addition, different initializations lead to different solutions. Specifically, when initializations are close to some NE, it is more likely for MF-OMO NAdam to converge to that specific NE, which is consistent with our theoretical study. Finally, as the neighborhood sizes increase, the convergence behavior become expectedly more chaotic, i.e., less concentrated on the center NE.
References
- [1] Yves Achdou, Pierre Cardaliaguet, François Delarue, Alessio Porretta, and Filippo Santambrogio. Mean Field Games: Cetraro, Italy 2019, volume 2281. Springer Nature, 2021.
- [2] Ilan Adler and Sushil Verma. The linear complementarity problem, Lemke algorithm, perturbation, and the complexity class PPAD. https://optimization-online.org/wp-content/uploads/2011/03/2948.pdf, 2011. Last visited: 2022 September 6.
- [3] Berkay Anahtarci, Can Deha Kariksiz, and Naci Saldi. Fitted Q-learning in mean-field games. arXiv preprint arXiv:1912.13309, 2019.
- [4] Berkay Anahtarci, Can Deha Kariksiz, and Naci Saldi. Q-learning in regularized mean-field games. Dynamic Games and Applications, pages 1–29, 2022.
- [5] Donald G Anderson. Iterative procedures for nonlinear integral equations. Journal of the ACM (JACM), 12(4):547–560, 1965.
- [6] Andrea Angiuli, Jean-Pierre Fouque, and Mathieu Laurière. Unified reinforcement Q-learning for mean field game and control problems. Mathematics of Control, Signals, and Systems, pages 1–55, 2022.
- [7] Hédy Attouch, Jérôme Bolte, Patrick Redont, and Antoine Soubeyran. Proximal alternating minimization and projection methods for nonconvex problems: An approach based on the Kurdyka-Łojasiewicz inequality. Mathematics of Operations Research, 35(2):438–457, 2010.
- [8] Amir Beck. First-Order Methods in Optimization. SIAM, 2017.
- [9] Alain Bensoussan, Jens Frehse, and Sheung Chi Phillip Yam. The master equation in mean field theory. Journal de Mathématiques Pures et Appliquées, 103(6):1441–1474, 2015.
- [10] Abhay G Bhatt and Vivek S Borkar. Occupation measures for controlled Markov processes: Characterization and optimality. The Annals of Probability, pages 1531–1562, 1996.
- [11] Géraldine Bouveret, Roxana Dumitrescu, and Peter Tankov. Mean-field games of optimal stopping: A relaxed solution approach. SIAM Journal on Control and Optimization, 58(4):1795–1821, 2020.
- [12] Rainer Buckdahn, Boualem Djehiche, Juan Li, and Shige Peng. Mean-field backward stochastic differential equations: A limit approach. The Annals of Probability, 37(4):1524–1565, 2009.
- [13] Pierre Cardaliaguet, François Delarue, Jean-Michel Lasry, and Pierre-Louis Lions. The Master Equation and the Convergence Problem in Mean Field Games. Princeton University Press, 2019.
- [14] René Carmona and François Delarue. Mean field forward-backward stochastic differential equations. Electronic Communications in Probability, 18:1–15, 2013.
- [15] René Carmona and François Delarue. Probabilistic Theory of Mean Field Games with Applications I-II. Springer, 2018.
- [16] René Carmona and Peiqi Wang. A probabilistic approach to extended finite state mean field games. Mathematics of Operations Research, 46(2):471–502, 2021.
- [17] Laurent Condat. Fast projection onto the simplex and the ball. Mathematical Programming, 158(1):575–585, 2016.
- [18] Michel Coste. An Introduction to Semialgebraic Geometry, 2000.
- [19] Michel Coste. An Introduction to O-minimal Geometry. Istituti editoriali e poligrafici internazionali Pisa, 2000.
- [20] Kai Cui and Heinz Koeppl. Approximately solving mean field games via entropy-regularized deep reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 1909–1917. PMLR, 2021.
- [21] Daniela Pucci De Farias and Benjamin Van Roy. The linear programming approach to approximate dynamic programming. Operations Research, 51(6):850–865, 2003.
- [22] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in Neural Information Processing Systems, 27, 2014.
- [23] François Delarue, Daniel Lacker, and Kavita Ramanan. From the master equation to mean field game limit theory: A central limit theorem. Electronic Journal of Probability, 24:1–54, 2019.
- [24] Eric V Denardo. On linear programming in a Markov decision problem. Management Science, 16(5):281–288, 1970.
- [25] Cyrus Derman. On sequential decisions and Markov chains. Management Science, 9(1):16–24, 1962.
- [26] Timothy Dozat. Incorporating nesterov momentum into adam. 2016.
- [27] L. P. D. van den Dries. Tame Topology and O-minimal Structures. London Mathematical Society Lecture Note Series. Cambridge University Press, 1998.
- [28] John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efficient projections onto the -ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, pages 272–279, 2008.
- [29] Roxana Dumitrescu, Marcos Leutscher, and Peter Tankov. Control and optimal stopping mean field games: A linear programming approach. Electronic Journal of Probability, 26:1–49, 2021.
- [30] Roxana Dumitrescu, Marcos Leutscher, and Peter Tankov. Linear programming fictitious play algorithm for mean field games with optimal stopping and absorption. arXiv preprint arXiv:2202.11428, 2022.
- [31] Romuald Elie, Julien Perolat, Mathieu Laurière, Matthieu Geist, and Olivier Pietquin. On the convergence of model free learning in mean field games. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7143–7150, 2020.
- [32] Zuyue Fu, Zhuoran Yang, Yongxin Chen, and Zhaoran Wang. Actor-critic provably finds nash equilibria of linear-quadratic mean-field games. arXiv preprint arXiv:1910.07498, 2019.
- [33] Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming, 156(1):59–99, 2016.
- [34] Xin Guo, Anran Hu, Renyuan Xu, and Junzi Zhang. Learning mean-field games. Advances in Neural Information Processing Systems, 32, 2019.
- [35] Xin Guo, Anran Hu, Renyuan Xu, and Junzi Zhang. A general framework for learning mean-field games. arXiv preprint arXiv:2003.06069, 2020.
- [36] Filip Hanzely, Konstantin Mishchenko, and Peter Richtárik. SEGA: Variance reduction via gradient sketching. Advances in Neural Information Processing Systems, 31, 2018.
- [37] Nicholas C Henderson and Ravi Varadhan. Damped anderson acceleration with restarts and monotonicity control for accelerating em and em-like algorithms. Journal of Computational and Graphical Statistics, 28(4):834–846, 2019.
- [38] Arie Hordijk and LCM Kallenberg. Linear programming and Markov decision chains. Management Science, 25(4):352–362, 1979.
- [39] Minyi Huang, Roland P Malhamé, and Peter E Caines. Large population stochastic dynamic games: Closed-loop Mckean-Vlasov systems and the Nash certainty equivalence principle. Communications in Information & Systems, 6(3):221–252, 2006.
- [40] Youqiang Huang and Masami Kurano. The LP approach in average reward MDPs with multiple cost constraints: The countable state case. Journal of Information and Optimization Sciences, 18(1):33–47, 1997.
- [41] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in Neural Information Processing Systems, 26, 2013.
- [42] Tobias Kaiser and Andre Opris. Differentiability properties of log-analytic functions. arXiv preprint arXiv:2007.03332, 2020.
- [43] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- [44] Thomas Kurtz and Richard Stockbridge. Stationary solutions and forward equations for controlled and singular martingale problems. Electronic Journal of Probability, 6:1–52, 2001.
- [45] Thomas G Kurtz and Richard H Stockbridge. Existence of Markov controls and characterization of optimal Markov controls. SIAM Journal on Control and Optimization, 36(2):609–653, 1998.
- [46] Jean-Michel Lasry and Pierre-Louis Lions. Mean field games. Japanese Journal of Mathematics, 2(1):229–260, 2007.
- [47] Mathieu Lauriere. Numerical methods for mean field games and mean field type control. arXiv preprint arXiv:2106.06231, 2021.
- [48] Nicolas Le Roux, Mark W Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. Pereira et al, 2013.
- [49] Jun Liu and Jieping Ye. Efficient euclidean projections in linear time. In Proceedings of the 26th Annual International Conference on Machine Learning, pages 657–664, 2009.
- [50] David G Luenberger and Yinyu Ye. Linear and Nonlinear Programming, volume 2. Springer, 1984.
- [51] Alan S Manne. Linear programming and sequential decisions. Management Science, 6(3):259–267, 1960.
- [52] D Marker. Tame topology and o-minimal structures by lou van den dries. BULLETIN-AMERICAN MATHEMATICAL SOCIETY, 37(3):351–358, 2000.
- [53] Richard Kipp Martin. Large Scale Linear and Integer Optimization: A Unified Approach. Springer Science & Business Media, 2012.
- [54] KG Murty. Linear complementarity problem, its geometry and applications. Linear Complementarity, Linear and Nonlinear Programming, pages 1–58, 1997.
- [55] Yurii Nesterov. Lectures on Convex Optimization, volume 137. Springer, 2018.
- [56] Shunji Osaki and Hisashi Mine. Linear programming algorithms for semi-Markovian decision processes. Journal of Mathematical Analysis and Applications, 22(2):356–381, 1968.
- [57] Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127–239, 2014.
- [58] Julien Perolat, Sarah Perrin, Romuald Elie, Mathieu Laurière, Georgios Piliouras, Matthieu Geist, Karl Tuyls, and Olivier Pietquin. Scaling up mean field games with online mirror descent. arXiv preprint arXiv:2103.00623, 2021.
- [59] Sarah Perrin, Julien Pérolat, Mathieu Laurière, Matthieu Geist, Romuald Elie, and Olivier Pietquin. Fictitious play for mean field games: Continuous time analysis and applications. Advances in Neural Information Processing Systems, 33:13199–13213, 2020.
- [60] Boris T Polyak. Introduction to Optimization. New York : Optimization Software, Publications Division, 1987.
- [61] Martin L Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, 2014.
- [62] Naci Saldi, Tamer Basar, and Maxim Raginsky. Markov–Nash equilibria in mean-field games with discounted cost. SIAM Journal on Control and Optimization, 56(6):4256–4287, 2018.
- [63] Paul J Schweitzer and Abraham Seidmann. Generalized polynomial approximations in Markovian decision processes. Journal of mathematical analysis and applications, 110(2):568–582, 1985.
- [64] Jitkomut Songsiri. Projection onto an -norm ball with application to identification of sparse autoregressive models. In Asean Symposium on Automatic Control (ASAC), 2011.
- [65] Pantelis Sopasakis, Krina Menounou, and Panagiotis Patrinos. Superscs: fast and accurate large-scale conic optimization. In 2019 18th European Control Conference (ECC), pages 1500–1505. IEEE, 2019.
- [66] Richard H Stockbridge. Time-average control of martingale problems: A linear programming formulation. The Annals of Probability, pages 206–217, 1990.
- [67] Michael I Taksar. Infinite-dimensional linear programming approach to singular stochastic control. SIAM Journal on Control and Optimization, 35(2):604–625, 1997.
- [68] Lou Van den Dries and Chris Miller. Geometric categories and o-minimal structures. Duke Mathematical Journal, 84(2):497–540, 1996.
- [69] Athanasios Vasiliadis. An introduction to mean field games using probabilistic methods. arXiv preprint arXiv:1907.01411, 2019.
- [70] Mengdi Wang and Yichen Chen. An online primal-dual method for discounted Markov decision processes. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 4516–4521. IEEE, 2016.
- [71] Alex J Wilkie. Model completeness results for expansions of the ordered field of real numbers by restricted pfaffian functions and the exponential function. Journal of the American Mathematical Society, 9(4):1051–1094, 1996.
- [72] Philip Wolfe and George Bernard Dantzig. Linear programming in a Markov chain. Operations Research, 10(5):702–710, 1962.
- [73] Stephen Wright, Jorge Nocedal, et al. Numerical optimization. Springer Science, 35(67-68):7, 1999.
- [74] Junzi Zhang, Jongho Kim, Brendan O’Donoghue, and Stephen Boyd. Sample efficient reinforcement learning with REINFORCE. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35(12), pages 10887–10895, 2021.
- [75] Junzi Zhang, Brendan O’Donoghue, and Stephen Boyd. Globally convergent type-I Anderson acceleration for nonsmooth fixed-point iterations. SIAM Journal on Optimization, 30(4):3170–3197, 2020.
Appendix
Appendix A Additional optimization algorithms and convergence to stationary points
A.1 Stochastic projected gradient descent (SPGD)
When the state space, action space and the time horizon are large, a single gradient evaluation in the PGD update (18) can be costly. To address this issue, we consider a stochastic variant of PGD (SPGD), which is suitable to handle problems with large , and . To invoke SPGD, first recall the explicitly expansion of (MF-OMO) as (14). The objective in (14) is a sum of terms, which we denote as () for brevity. SPGD replaces the exact gradient in PGD with a mini-batch estimator of the following form
(40) |
where the mini-batch is a subset of sampled uniformly at random (without replacement) and independently across (i.e., iterations). Such a sampling approach ensures that is an unbiased estimator of (conditioned on ).
SPGD then proceeds by updating with the following iterative process:
(41) |
Again, we assume that .
Proposition 17.
Under Assumption 1, there exist constants such that for any , almost surely, , and . Here is the conditional expectation given the -th iteration .
That is, the estimator (40) is unbiased, almost surely bounded, with a bounded second-order moment, and the exact gradient is also uniformly bounded. The proof is a direct application of the continuity of the rewards and dynamics and the compactness of , and omitted here.
A.2 Global convergence to stationary points without definability
In this section, we show the global convergence of the iterates to stationary points when only the smoothness assumption holds. Let us first recap the concept of stationary points for constrained optimization problems and their relation to optimality.
Optimality conditions.
The KKT conditions state that a necessary condition for a point to be the optimal solution to (MF-OMO) is , where
is the normal cone of at . We say that is a stationary point of (MF-OMO) if .
The typical goal of nonconvex optimization is to find a sequence of points that converge to a stationary point. Since the normal cone is set-valued, it is hard to directly evaluate the closeness to stationary points with the above definition. A common surrogate metric is the norm of the projected gradient mapping , defined as
Here if and only if is a stationary point of (MF-OMO). Moreover, if , then according to [33, Lemma 3],
where is the unit ball.
When is the full Euclidean space, the projected gradient mapping is simply the gradient of . We refer interested readers to [8, Chapter 3] for a more detailed discussion on optimality conditions.
Convergence to stationary points.
Under Assumption 1, the following convergence result for PGD is standard given Proposition 10 and the existence of optimal solution(s) for (MF-OMO).
Proposition 18.
[8, Theorem 10.15] Under Assumption 1, let be the sequence generated by PGD (18) with . Then
-
•
the objective value sequence is decreasing, and the decreasing is strict until a stationary point is found (i.e., ) and the iteration is terminated;
-
•
, and in particular,
-
•
all limit points of are stationary points of (MF-OMO).
Similarly, the following proposition shows that the iterates of SPGD converge to stationary points with high probability. The uniform boundedness in Proposition 17 allows us to adapt the proof in [74, Theorem 9] for policy gradient methods of MDPs to our setting. The proof is omitted due to similarity.
A.3 Reparametrization, acceleration and variance reduction
(MF-OMO) may be solved more efficiently. First, one can also reparametrize the variables to completely get rid of the simple constraints in . In particular, one can reparametrize and by
for some and , and . Similarly, can be reparametrized using trigonometric functions such as for some .With the aforementioned reparametrization, (MF-OMO) becomes a smooth unconstrained optimization problem, which can be solved by additional optimization algorithms and solvers not involving projections. As mentioned in Remark 1, one can also change the norms of and and their bands and apply other reparametrizations.
Lastly, standard techniques such as momentum methods [60], Nesterov acceleration [55], quasi-Newton methods [73], Anderson acceleration [5], and Newton methods [50] can readily be applied to accelerate the convergence of PGD and SPGD; variance reduction approaches such as SAG [48], SVRG [41], SAGA [22], SEGA [36] can also be adopted to further accelerate and stabilize the convergence of SPGD. In particular, safeguarded Anderson acceleration methods [37, 75, 65] might be adopted to maintain the desired monotonicity property of vanilla PGD, which is central for the local convergence to global Nash equilibrium solutions (cf. Theorem 11).
Appendix B Definability and local convergence to global Nash equilibrium solutions
B.1 Definability
In this section, we restate the formal definitions of definable sets and functions as in [7, Section 4.3], and summarize and derive necessary properties to prove Theorem 11. See [68, 71, 27, 52, 19, 42] for more complete expositions of the topic.
We first need the following concept of o-minimal structure over .
Definition B.1.
Suppose that is a sequence of subsets collections . We say that is an o-minimal structure over if it satisfies the following conditions:
-
•
, . And , .
-
•
, and .
-
•
, its canonical projection .
-
•
and , the diagonals .
-
•
The set .
-
•
The elements of are exactly finite unions of intervals (including both open and closed, and in particular single points).
We then have the following definition of definability.
Definition B.2.
Let be an o-minimal structure over . We say that a set is definable if . A function/mapping is said to be definable if its graph .111111Note that a real extended value function can be equivalently seen as a standard function . Here denotes the domain of . A set-valued mapping is said to be definable if its graph .
Below we provide three examples of most widely used o-minimal structures, namely the real semialgebraic structure, globally subanalytic structure and the log-exp structure used in our main text.
Example 1 (Real semialgebraic structure).
In this structure, each consists of real semialgebraic sets in . A set in is said to be real semialgebraic (or semialgebraic for short) if it can be written as a finite union of sets of the form , where are real polynomial functions and . A function/mapping is called (real) semialgebraic if its graph is semialgebraic.
The following result is useful for proving Theorem 11 (see [18] for more properties of real semialgberaic sets).
Lemma 20.
Any set of the form , where are real polynomial functions and is real semialgebraic.
Proof.
First, notice that and are both real semialgebraic by definition. Hence is real semialgebraic. Finally, by the intersection closure property of o-minimal structures,
is also real semialgebraic. ∎
Example 2 (Globally subanalytic structure).
This structure is the smallest o-minimal structure such that each () includes but is not limited to all sets of the form , where is an analytic function that can be extended analytically to a neighborhood containing . In particular, all semialgebraic sets and all sets of the form with being analytic and belong to this structure.
Example 3 (Log-exp structure).
This structure is the smallest o-minimal structure containing the globally analytic structure and the graph of the exponential function .
As in the main text, we say that a function (or a set) is definable if it is definable on the log-exp structure. The following proposition formally summarizes the major facts about definable functions and sets. See [7] and [68] for more details.
Proposition 21.
The following facts hold for definable functions/mappings and sets.
-
1.
Let the real extended value functions be definable, and let be a constant matrix and be a constant vector. Then , , , , , and are definable. The same results hold for vector-valued mappings .
-
2.
Let the functions/mappings and , and let . Then the composition is also definable.
-
3.
Let be definable and let be a compact set on which for all . Then is also definable.
-
4.
Let and be definable. Then and are both definable.
-
5.
All semialgebraic functions, all analytic functions restricted to definable compact sets, as well as the exponential function and logarithm function are definable.
-
6.
Let the set be definable. Then the indicator function (that takes the value when and otherwise) and the characteristic function121212Here denotes the (real extended valued) characteristic function (sometimes also referred to as the indicator function in some literature) that takes value when and otherwise. are also definable.
-
7.
Let the sets be definable. Then the Cartesian product is definable.
B.2 Local convergence to global Nash equilibrium solutions
Proof of Theorem 11.
The proof consists of four steps. We begin by verifying the definability around the initialization. We then specify and show that the iterates are uniformly bounded in a similar neighborhood of the initialization. Finally, we obtain convergence to Nash equilibrium solutions.
Step 1: Verification of definability around initialization.
We first verify that
is definable. Note that is an optimal solution to (MF-OMO) if and only if minimizes over the entire Euclidean space of . To see this, note that is obtained from a finite number of summations, subtractions and products of definable functions (i.e., , , constant functions and identity functions), and is hence definable by Proposition 21. In addition, by rewriting as and applying Lemma 20, is obviously semialgebraic and hence definable, and hence the characteristic function is also definable by Proposition 21. As a result, the sum of and is also definable.
Let be some Nash equilibrium solution. Define and by
and with (), where
and (). Then by Theorem 4, Proposition 6 and Corollary 7, is feasible (i.e., ) and , and . Hence and thus is optimal (i.e., ).
For this , we have . By [7, Theorem 14] and the continuity of , there exist and a continuous, concave, strictly increasing and definable function , with and in , such that for any with and , the following Kurdyka-Lojasiewicz (KL) condition holds:
(42) |
Here we use the fact that the limiting sub-differential [7, Proposition 3].
Step 2: Specify .
Next, we specify the choice of . By the continuity of , there exists such that for any with ,
Let , where
We now show that for any with , , and hence .
By Assumption 1 and the compactness of , the Lipschitz continuity assumptions in Theorem 8 hold for some constants . Since , by the proof of Proposition 13 and the construction of and , is feasible by the proof of Theorem 9, and
Similarly,
where for any . Here we use the fact that , , and . Also note that here and are both vector norms (except for , for which is the matrix -norm).
Hence we have shown that
where is a problem dependent constant defined as above. Thus .
Now by Proposition 18 and the feasibility of , is non-increasing, and hence
Step 3: Show that , for some with .
Define . By [8, Lemma 10.4],
(43) |
Then since (in the KL condition (42) above) is concave, strictly increasing and in , we have as long as (since we already have for all ), and
(44) |
Here the first inequality is by the concavity of .
Now we prove that
(45) |
for all by induction, where
(46) |
where is the Lipschitz constant of in the sense that for any . The existence of is guaranteed by Assumption 1 and the compactness of . Note that since due to the continuity of , we have the desired property .
The base case for is trivial. Without loss of generality, we now assume that , since otherwise the algorithm finds the exact Nash equilibrium solution at the initialization and the iterates terminate/stay unchanged thereafter, and hence all the claims in Theorem 11 obviously hold.
Then for , by (43),
Suppose that the claim (45) holds for (). Without loss of generality, we also assume that , since otherwise the algorithm finds the exact Nash equilibrium solution at iteration or and the iterates terminate/stay unchanged thereafter, and hence all the claims in Theorem 11 obviously hold.
Now for any , since
we have
Hence since is a cone, and by the KL condition at with and , we have
(47) |
(47), together with (44), immediately implies that
thus
By telescoping the above inequality for ,
from which we deduce that
Note that here we use (43) for and we also use the fact that and is increasing. This completes the induction.
Final step: Convergence to Nash equilibrium solutions.
Let be an arbitrary limit point of . Then , and by Proposition 18, is a stationary point of (MF-OMO), i.e., . Moreover, again by Proposition 18 and the feasibility of , is non-increasing, and hence by the continuity of ,
In addition, by the first three steps above, we have . Hence if , then invoking the KL condition at yields
However, we also have , which is a contradiction. Hence satisfies , which indicates that . Therefore , and by the equivalence result in Theorem 4, any with is a Nash equilibrium solution of the mean-field game ((1) and (2)). In addition, we have for all . The fact that the limit point set is nonempty comes simply from the boundedness of . Moreover, the vanishing limit of the exploitabilty of is an immediate implication of Theorem 8 and the fact that as . The claim about isolated Nash equilibrium solutions is obvious: simply choose for such that no other Nash equilibrium solutions exist in the -neighborhood of . ∎