This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Graphon Mean-Field Control for Cooperative Multi-Agent Reinforcement Learning

Yuanquan Hu [email protected]    Xiaoli Wei [email protected]    Junji Yan [email protected]    Hengxi Zhang [email protected]
Abstract

The marriage between mean-field theory and reinforcement learning has shown a great capacity to solve large-scale control problems with homogeneous agents. To break the homogeneity restriction of mean-field theory, a recent interest is to introduce graphon theory to the mean-field paradigm. In this paper, we propose a graphon mean-field control (GMFC) framework to approximate cooperative multi-agent reinforcement learning (MARL) with nonuniform interactions and show that the approximate order is of 𝒪(1N)\mathcal{O}(\frac{1}{\sqrt{N}}), with NN the number of agents. By discretizing the graphon index of GMFC, we further introduce a smaller class of GMFC called block GMFC, which is shown to well approximate cooperative MARL. Our empirical studies on several examples demonstrate that our GMFC approach is comparable with the state-of-art MARL algorithms while enjoying better scalability.

1 Introduction

Multi-agent reinforcement learning (MARL) has found various applications in the field of transportation and simulating [50, 1], stock price analyzing and trading [32, 31], wireless communication networks [12, 11, 13], and learning behaviors in social dilemmas [33, 28, 34]. MARL, however, becomes intractable due to the complex interactions among agents as the number of agents increases.

A recent tractable approach is a mean-field approach by considering MARL in the regime with a large number of homogeneous agents under weak interactions [20]. According to the number of agents and learning goals, there are three subtle types of mean-field theories for MARL. The first one is called mean-field MARL (MF-MARL), which refers to the empirical average of the states or actions of a finite population. For example, [52] proposes to approximate interactions within the population of agents by averaging the actions of the overall population or neighboring agents. [35] proposes a mean-field proximal policy optimization algorithm for a class of MARL with permutation invariance. The second one is called mean-field game (MFG), which describes the asymptotic limit of non-cooperative stochastic games as the number of agents goes to infinity [30, 27, 8]. Recently, a rapidly growing literature studies MFG for noncooperative MARL either in a model-based way [53, 6, 26] or by a model-free approach [25, 48, 18, 14, 44]. The third one is called mean-field control (MFC), which is closely related to MFG yet different from MFG in terms of learning goals. For cooperative MFC, the Bellman equation for the value function is defined on an enlarged space of probability measures, and MFC is always reformulated as a new Markov decision process (MDP) with continuous state-action space [43]. [9] shows the existence of optimal policies for MFC in the form of mean-field MDP and adapts classical reinforcement learning (RL) methods to the mean-field setups. [24] approximates MARL by a MFC approach, and proposes a model-free kernel-based Q-learning algorithm (MFC-K-Q) that enjoys a linear convergence rate and is independent of the number of agents. [44] presents a model-based RL algorithm M3-UCRL for MFC with a general regret bound. [2] proposes a unified two-timescale learning framework for MFG and MFC by tuning the ratio of learning rates of QQ function and the population state distribution.

One restriction of the mean-field theory is that it eliminates the difference among agents and interactions between agents are assumed to be uniform. However, in many real world scenarios, strategic interactions between agents are not always uniform and rely on the relative positions of agents. To develop scalable learning algorithms for multi-agent systems with heterogeneous agents, one approach is to exploit the local network structure of agents [45, 38]. Another approach is to consider mean-field systems on large graphs and their asymptotic limits, which leads to graphon mean-field theory [39]. So far, most existing works on graphon mean-field theory consider either diffusion processes without learning in continuous time or non-cooperative graphon mean-field game (GMFG) in discrete time. [3] considers uncontrolled graphon mean-field systems in continuous time.  [17] studies MFG on an Erdös-Rényi graph. [19] studies the convergence of weighted empirical measures described by stochastic differential equations. [4] studies propagation of chaos of weakly interacting particles on general graph sequences. [5] considers general GMFG and studies ε\varepsilon-Nash equilibria of the multi-agent system by a PDE approach in continuous time. [29] studies stochastic games on large graphs and their graphon limits. It shows that GMFG is viewed as a special case of MFG by viewing the label of agents as a component of the state process. [21, 22] study continuous-time cooperative graphon mean-field systems with linear dynamics. On the other hand, [7] studies static finite-agent network games and their associated graphon games. [49] provides a sequential decomposition algorithm to find Nash equilibria of discrete-time GMFG. [15] constructs a discrete-time learning GMFG framework to analyze approximate Nash equilibria for MARL with nonuniform interactions. However, little is focused on learning cooperative graphon mean-field systems in discrete time, except for [41, 42] on particular forms of nonuniform interactions among agents. [42] proves that when the reward is affine in the state distribution and action distribution, MARL with nonuniform interactions can still be approximated by classic MFC. [41] considers multi-class MARL, where agents belonging to the same class are homogeneous. In contrast, we consider a general discrete-time GMFC framework under which agents are allowed to interact non-uniformly on any network captured by a graphon.

Our Work

In this work, we propose a general discrete-time GMFC framework to approximate cooperative MARL on large graphs by combining classic MFC and network games. Theoretically, we first show that GMFC can be reformulated as a new MDP with deterministic dynamics and infinite-dimensional state-action space, hence the Bellman equation for Q function is established on the space of probability measure ensembles. It shows that GMFC approximates cooperative MARL well in terms of both value function and optimal policies. The approximation error is at order 𝒪(1/N)\mathcal{O}(1/\sqrt{N}), where NN is the number of agents. Furthermore, instead of learning infinite-dimensional GMFC directly, we introduce a smaller class called block GMFC by discretizing the graphon index, which can be recast as a new MDP with deterministic dynamic and finite-dimensional continuous state-action space. We show that the optimal policy ensemble learned from block GMFC is near optimal for cooperative MARL. To deploy the policy ensemble in the finite-agent system, we directly sample from the action distribution in the blocks. Empirically, our experiments in Section 5 demonstrate that when the number of agents becomes large, the mean episode reward of MARL becomes increasingly close to that of block GMFC, which verifies our theoretical findings. Furthermore, our block GMFC approach achieves comparable performances with other popular existing MARL algorithms in the finite-agent setting.

Outline

The rest of the paper is organized as follows. Section 2 recalls basic notations of graphons and introduces the setup of cooperative MARL with nonuniform interactions and its asymptotic limit called GMFC. Section 3 connects cooperative MARL and GMFC, introduces block GMFC for efficient algorithm design, and builds its connection with cooperative MARL. The main theoretical proofs are presented in Section 4. Section 5 tests the performance of block GMFC experimentally.

2 Mean-Field MARL on Dense Graphs

2.1 Preliminary: Graphon Theory

In the following, we consider a cooperative multi-agent system and its associated mean-field limit. In this system, each agent is affected by all others, with different agents exerting different effects on her. This multi-agent system with NN agents can be described by a weighted graph GN=(𝒱N,N)G_{N}=({\cal V}_{N},\mathcal{E}_{N}), where the vertex set 𝒱N={1,,N}{\cal V}_{N}=\{1,\ldots,N\} and the edge set N\mathcal{E}_{N} represent agents and the interactions between agents, respectively. To study the limit of the multi-agent system as NN goes to infinity, we adopt the graphon theory introduced in [39] used to characterize the limit behavior of dense graph sequences. Therefore, throughout the paper, we assume the graph GNG_{N} is dense and leave sparse graphs for future study.

In general, a graphon is represented by a bounded symmetric measurable function W:W: ×\mathcal{I}\times\mathcal{I}\to\mathcal{I}, with =[0,1]\mathcal{I}=[0,1]. We denote by 𝒲{\cal W} the space of all graphons and equip the space 𝒲{\cal W} with the cut norm \|\cdot\|_{\square}

W=supS,T|S×TW(α,β)dαdβ|.\displaystyle\|W\|_{\square}=\sup_{S,T\subset\mathcal{I}}\biggl{|}\int_{S\times T}W(\alpha,\beta)d\alpha d\beta\biggl{|}.

It is worth noting that each weighted graph GN=(𝒱N,N)G_{N}=({\cal V}_{N},\mathcal{E}_{N}) is uniquely determined by a step-graphon WNW_{N}

WN(α,β)=WN(NαN,NβN).\displaystyle W_{N}(\alpha,\beta)=W_{N}\Big{(}\frac{\lceil N\alpha\rceil}{N},\frac{\lceil N\beta\rceil}{N}\Big{)}.

We assume that the sequence of WNW_{N} converges to a graphon WW in cut norm as the number of agents NN goes to infinity, which is crucial for the convergence analysis of cooperative MARL in Section 3.

Assumption 2.1

The sequence (WN)N(W_{N})_{N\in\mathbb{N}} converges in cut norm to some graphon W𝒲W\in{\cal W} such that

WNW0.\displaystyle\|W_{N}-W\|_{\square}\to 0.

Some common examples of graphons include

  1. 1)

    Erdős Rényi: W(α,β)=pW(\alpha,\beta)=p, 0p10\leq p\leq 1, α,β\alpha,\beta\in\mathcal{I};

  2. 2)

    Stochastic block model:

    W(α,β)={pif 0α,β0.5or 0.5α,β1,qotherwise,\displaystyle W(\alpha,\beta)=\left\{\begin{array}[]{rll}p&\mbox{if}\;0\leqslant\alpha,\beta\leqslant 0.5\;\mbox{or}\;0.5\leqslant\alpha,\beta\leqslant 1,\\ q&\mbox{otherwise},\end{array}\right.

    where pp represents the intra-community interaction and qq the inter-community interaction;

  3. 3)

    Random geometric graphon: W(α,β)=f(min(|βα|,1|βα|))W(\alpha,\beta)=f(\min(|\beta-\alpha|,1-|\beta-\alpha|)), where f:[0,0.5][0,1]f:[0,0.5]\to[0,1] is a non-increasing function.

2.2 Cooperative MARL with Nonuniform Interactions

In this section, we facilitate the analysis of MARL by considering a particular class of MARL with nonuniform interactions, where each agent interacts with all other agents via the aggregated weighted mean-field effect of the population of all agents.

Recall that we use the weighted graph GN=(𝒱N,N)G_{N}=({\cal V}_{N},\mathcal{E}_{N}) to represent the multi-agent system, in which agents are cooperative and coordinated by a central controller. They share a finite state space 𝒮{\cal S} and take actions from a finite action space 𝒜\mathcal{A}. We denote by 𝒫(𝒮){\cal P}({\cal S}) and 𝒫(𝒜){\cal P}({\cal A}) the space of all probability measures on 𝒮{\cal S} and 𝒜{\cal A}, respectively. Furthermore, denote by (𝒮){\cal B}({\cal S}) the space of all Borel measures on 𝒮{\cal S}.

For each agent ii, the neighborhood empirical measure is given by

μti,WN():=1Nj𝒱Nξi,jNδstj(),\displaystyle\mu^{i,W_{N}}_{t}(\cdot):={\frac{1}{N}}\sum_{j\in{\cal V}_{N}}\xi_{i,j}^{N}\delta_{s_{t}^{j}}(\cdot), (2.2)

where δstj\delta_{s_{t}^{j}} denotes Dirac measure at stjs_{t}^{j}, and ξi,jN\xi_{i,j}^{N} describing the interaction between agents ii and jj is taken as either

ξijN=WN(iN,jN)\displaystyle\xi_{ij}^{N}=W_{N}(\frac{i}{N},\frac{j}{N}) (C1)

or

ξi,jNBernoulli(WN(iN,jN)).\displaystyle\xi_{i,j}^{N}\sim\mbox{Bernoulli}(W_{N}(\frac{i}{N},\frac{j}{N})). (C2)

At each step t=0,1,,t=0,1,\cdots, if agent ii, i[N]i\in[N] at state sti𝒮s^{i}_{t}\in{\cal S} takes an action ati𝒜a^{i}_{t}\in\mathcal{A}, then she will receive a reward

r(sti,μti,WN,ati),i[N],\displaystyle r\Big{(}s^{i}_{t},\,\,\mu_{t}^{i,W_{N}},\,\,a^{i}_{t}\Big{)},\quad i\in[N], (2.3)

where r:𝒮×(𝒮)×𝒜r:{\cal S}\times{\cal B}({\cal S})\times{\cal A}\to\mathbb{R}, and she will change to a new state st+1is^{i}_{t+1} according to a transition probability such that

st+1iP(|sti,μti,WN,ati),i[N],s0iμ𝒫(𝒮),\displaystyle s^{i}_{t+1}\sim{P}\left.\Big{(}\cdot\,\right|\,s^{i}_{t},\,\mu_{t}^{i,W_{N}},\,\,a^{i}_{t}\Big{)},\quad i\in[N],\;s^{i}_{0}\sim\mu\in{\cal P}({\cal S}), (2.4)

where P:𝒮×(𝒮)×𝒜𝒫(𝒮)P:{\cal S}\times{\cal B}({\cal S})\times{\cal A}\to{\cal P}({\cal S}).

(2.3)-(2.4) indicate that the reward and the transition probability of agent ii at time tt depend on both her individual information (sti,ati)(s_{t}^{i},a_{t}^{i}) and neighborhood empirical measure μti,WN\mu^{i,W_{N}}_{t}.

Furthermore, the policy is assumed to be stationary for simplicity and takes the Markovian form

atiπi(|sti)𝒫(𝒜),i[N],\displaystyle a^{i}_{t}\sim\pi^{i}\left(\cdot|s^{i}_{t}\right)\in\mathcal{P}({\cal A}),\quad i\in[N], (2.5)

which maps agent ii’s state to a randomized action. For each agent ii, the space of all policies is denoted as Π\Pi.

Remark 2.2

When ξijN1\xi_{ij}^{N}\equiv 1, i,j[N]i,j\in[N], it corresponds to classical mean-field theory with uniform interactions [9, 24]. Furthermore, our framework is flexible enough to include the non-uniform interactions of actions via νti,WN=1Nj𝒱Nξi,jNδatj()\nu_{t}^{i,W_{N}}={\frac{1}{N}}\sum_{j\in{\cal V}_{N}}\xi_{i,j}^{N}\delta_{a_{t}^{j}}(\cdot), and also to include heterogeneity of agents by allowing rr and PP to rely on the agent types ii.

The objective of the multi-agent system (2.2)-(2.5) is to maximize the expected discounted accumulated reward averaged over all agents, i.e.,

VN(μ)\displaystyle V_{N}(\mu) =\displaystyle= sup(π1,,πN)ΠNJN(μ,π1,,πN)\displaystyle\sup_{(\pi^{1},\ldots,\pi^{N})\in\Pi^{N}}J_{N}(\mu,\pi_{1},\ldots,\pi_{N})
:=\displaystyle:= sup(π1,,πN)ΠN1Ni=1N𝔼[t=0γtr(sti,μti,WN,ati)|s0iμ,atiπi(|sti)],\displaystyle\sup_{(\pi^{1},\ldots,\pi^{N})\in\Pi^{N}}\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\left.\left[\sum_{t=0}^{\infty}\gamma^{t}{r\big{(}s^{i}_{t},\,\,\mu^{i,W_{N}}_{t},\,\,a^{i}_{t}\big{)}}\,\,\right|\,\,s_{0}^{i}\sim\mu,a_{t}^{i}\sim\pi^{i}(\cdot|s_{t}^{i})\right],

subject to (2.2)-(2.5) with a discount factor γ\gamma \in (0,1)(0,1).

Definition 2.3

An ε\varepsilon-Pareto optimality of cooperative MARL (2.2)-(2.2) for ε>0\varepsilon>0 is defined as (π1,,,πN,)ΠN(\pi^{1,*},\ldots,\pi^{N,*})\in\Pi^{N} such that

JN(μ,π1,,πN)sup(π1,,πN)ΠNJN(μ,π1,,πN)ε.\displaystyle J_{N}(\mu,\pi_{1}^{*},\ldots,\pi_{N}^{*})\geq\sup_{(\pi^{1},\ldots,\pi^{N})\in\Pi^{N}}J_{N}(\mu,\pi_{1},\ldots,\pi_{N})-\varepsilon. (2.7)

2.3 Graphon Mean-Field Control

We expect the cooperative MARL (2.2)-(2.2) to become a GMFC problem as NN\to\infty. In GMFC, there is a continuum of agents α\alpha\in\mathcal{I}, and each agent with the index/label α\alpha\in\mathcal{I} follows

s0αμα,atαπα(|stα),st+1αP(|stα,μtα,W,atα),\displaystyle s_{0}^{\alpha}\sim\mu^{\alpha},\;\;a_{t}^{\alpha}\sim\pi^{\alpha}(\cdot|s_{t}^{\alpha}),\;\;s_{t+1}^{\alpha}\sim P(\cdot|s_{t}^{\alpha},\mu_{t}^{\alpha,W},a_{t}^{\alpha}), (2.8)

where μtα\mu_{t}^{\alpha} =(stα)={\cal L}(s_{t}^{\alpha}), α\alpha\in\mathcal{I} denotes the probability distribution of stαs_{t}^{\alpha}, and μtα,W\mu_{t}^{\alpha,W} is defined as the neighborhood mean-field measure of agent α\alpha:

μtα,W=W(α,β)μtβ𝑑β(𝒮),\displaystyle\mu_{t}^{\alpha,W}=\int_{\mathcal{I}}W(\alpha,\beta)\mu_{t}^{\beta}d\beta\in{\cal B}({\cal S}), (2.9)

with the graphon WW given in Assumption 2.1.

To ease the sequel analysis, define the space of state distribution ensembles 𝓜:=𝒫(𝒮):={f:𝒫(𝒮)}{\boldsymbol{{\cal M}}}:={\cal P}({\cal S})^{\mathcal{I}}:=\{f:\mathcal{I}\to{\cal P}({\cal S})\} and the space of policy ensembles 𝚷:=𝒫(𝒜)𝒮×{\boldsymbol{\Pi}}:={\cal P}({\cal A})^{{\cal S}\times\mathcal{I}}. Then 𝝁:=(μα)α{\boldsymbol{\mu}}:=(\mu^{\alpha})_{\alpha\in\mathcal{I}} and 𝝅:=(πα)α{\boldsymbol{\pi}}:=(\pi^{\alpha})_{\alpha\in\mathcal{I}} are elements in 𝓜{\boldsymbol{{\cal M}}} and 𝚷{\boldsymbol{\Pi}}, respectively.

The objective of GMFC is to maximize the expected discounted accumulated reward averaged over all agents α\alpha\in\mathcal{I}

V(𝝁):\displaystyle V({\boldsymbol{\mu}}): =\displaystyle= sup𝝅𝚷J(𝝁,𝝅)\displaystyle\sup_{{\boldsymbol{\pi}}\in{\boldsymbol{\Pi}}}J({\boldsymbol{\mu}},{\boldsymbol{\pi}})
=\displaystyle= sup𝝅𝚷𝔼[t=0γtr(stα,μtα,W,atα)|s0αμα,atαπα(|stα)]dα.\displaystyle\sup_{{\boldsymbol{\pi}}\in{\boldsymbol{\Pi}}}\int_{\mathcal{I}}\mathbb{E}\left.\left[\sum_{t=0}^{\infty}\gamma^{t}{r\big{(}s^{\alpha}_{t},\,\,{\mu^{\alpha,W}_{t}},\,\,a^{\alpha}_{t}\big{)}}\,\,\right|\,\,s_{0}^{\alpha}\sim\mu^{\alpha},a_{t}^{\alpha}\sim\pi^{\alpha}(\cdot|s_{t}^{\alpha})\right]d\alpha.

3 Main Results

3.1 Reformulation of GMFC

In this section, we show that GMFC (2.8)-(2.3) can be reformulated as a MDP with deterministic dynamics and continuous state-action space 𝓜{\boldsymbol{{\cal M}}} ×\times 𝚷{\boldsymbol{\Pi}}.

Theorem 3.1

GMFC (2.8)-(2.3) can be reformulated as

V(𝝁)=sup𝝅𝚷t=0γtR(𝝁t,𝝅),\displaystyle V({\boldsymbol{\mu}})=\sup_{{\boldsymbol{\pi}}\in{\boldsymbol{\Pi}}}\sum_{t=0}^{\infty}\gamma^{t}R({\boldsymbol{\mu}}_{t},{\boldsymbol{\pi}}), (3.1)

subject to

μt+1α()=𝚽α(𝝁t,𝝅)(),t,μ0α=μα,α,\displaystyle\mu_{t+1}^{\alpha}(\cdot)={\boldsymbol{\Phi}}^{\alpha}({\boldsymbol{\mu}}_{t},{\boldsymbol{\pi}})(\cdot),\;t\in\mathbb{N},\;\mu_{0}^{\alpha}=\mu^{\alpha},\;\alpha\in\mathcal{I}, (3.2)

where the aggregated reward R:𝓜×𝚷R:{\boldsymbol{{\cal M}}}\times{\boldsymbol{\Pi}}\to\mathbb{R} and the aggregated transition dynamics 𝚽:𝓜×𝚷𝓜{\boldsymbol{\Phi}}:{\boldsymbol{{\cal M}}}\times{\boldsymbol{\Pi}}\to{\boldsymbol{{\cal M}}} are given by

R(𝝁,𝝅)=s𝒮a𝒜r(s,a,μα,W)πα(a|s)μα(s)dα,\displaystyle R({\boldsymbol{\mu}},{\boldsymbol{\pi}})=\int_{\mathcal{I}}\sum_{s\in{\cal S}}\sum_{a\in{\cal A}}r(s,a,\mu^{\alpha,W})\pi^{\alpha}(a|s)\mu^{\alpha}(s)d\alpha, (3.3)
𝚽α(𝝁,𝝅)()=s𝒮a𝒜P(|s,μα,W,a)πα(a|s)μα(s).\displaystyle{\boldsymbol{\Phi}}^{\alpha}({\boldsymbol{\mu}},{\boldsymbol{\pi}})(\cdot)=\sum_{s\in{\cal S}}\sum_{a\in{\cal A}}P(\cdot|s,\mu^{\alpha,W},a)\pi^{\alpha}(a|s)\mu^{\alpha}(s). (3.4)

The proof of Theorem 3.1 is similar to the proof of Lemma 2.2 in [23]. So we omit it here.

(3.4) and (3.2) indicate the evolution of the state distribution ensemble 𝝁t{\boldsymbol{\mu}}_{t} over time. That is, under the fixed policy ensemble 𝝅{\boldsymbol{\pi}}, the state distribution μt+1α\mu_{t+1}^{\alpha} of agent α\alpha at time t+1t+1 is fully determined by the policy ensemble 𝝅{\boldsymbol{\pi}} and the state distribution ensemble 𝝁t{\boldsymbol{\mu}}_{t} at time tt. Note that the state distribution of each agent α\alpha is fully coupled with state distributions of the population of all agents via the graphon WW.

With the reformulation in Theorem 3.1, the associated QQ function starting from (𝝁,𝝅)𝓜×𝚷({\boldsymbol{\mu}},{\boldsymbol{\pi}})\in{\boldsymbol{{\cal M}}}\times{\boldsymbol{\Pi}} is defined as

Q(𝝁,𝝅)\displaystyle Q({\boldsymbol{\mu}},{\boldsymbol{\pi}}) =\displaystyle= R(𝝁,𝝅)+sup𝝅𝚷[t=1γtR(𝝁t,𝝅)|s0αμα,a0απα(|s0α)].\displaystyle R({\boldsymbol{\mu}},{\boldsymbol{\pi}})+\sup_{{\boldsymbol{\pi}}^{\prime}\in{\boldsymbol{\Pi}}}\Big{[}\sum_{t=1}^{\infty}\gamma^{t}R\big{(}{\boldsymbol{\mu}}_{t},{\boldsymbol{\pi}}^{\prime}\big{)}\,\,\Big{|}\,\,s^{\alpha}_{0}\sim{\mu}^{\alpha},a_{0}^{\alpha}\sim\pi^{\alpha}(\cdot|s_{0}^{\alpha})\Big{]}. (3.5)

Hence its Bellman equation is given by

Q(𝝁,𝝅)=R(𝝁,𝝅)+γsup𝝅𝚷Q(𝚽(𝝁,𝝅),𝝅).\displaystyle Q({\boldsymbol{\mu}},{\boldsymbol{\pi}})=R({\boldsymbol{\mu}},{\boldsymbol{\pi}})+\gamma\sup_{{\boldsymbol{\pi}}^{\prime}\in{\boldsymbol{\Pi}}}Q({\boldsymbol{\Phi}}(\boldsymbol{\mu},\boldsymbol{\pi}),{\boldsymbol{\pi}}^{\prime}). (3.6)
Remark 3.2

(Label-state formulation) GMFC (2.8)-(2.3) can be viewed as a classical MFC with extended state space 𝒮×{\cal S}\times\mathcal{I}, action space 𝒜{\cal A}, policy π~𝒫(𝒜)𝒮×\tilde{\pi}\in{\cal P}({\cal A})^{{\cal S}\times\mathcal{I}}, mean-field information μ~𝒫(𝒮×)\tilde{\mu}\in{\cal P}({\cal S}\times\mathcal{I}), reward r~((s,α),μ~,a):=r(s,W(α,β)μ~(,β)𝑑β,a)\tilde{r}((s,\alpha),\tilde{\mu},a):=r(s,\int_{\mathcal{I}}W(\alpha,\beta)\tilde{\mu}(\cdot,\beta)d\beta,a), transition dynamics of (s~t,αt)(\tilde{s}_{t},\alpha_{t}) such that

s~t+1P(|s~t,a~t,W(αt,β)μ~t(,β)dβ),αt+1=αt,a~tπ~(|s~t,αt),s~0μ0,α~0Unif(0,1).\displaystyle\tilde{s}_{t+1}\sim P(\cdot|\tilde{s}_{t},\tilde{a}_{t},\int_{\mathcal{I}}W(\alpha_{t},\beta)\tilde{\mu}_{t}(\cdot,\beta)d\beta),\;\alpha_{t+1}=\alpha_{t},\;\tilde{a}_{t}\sim\tilde{\pi}(\cdot|\tilde{s}_{t},\alpha_{t}),\;\tilde{s}_{0}\sim\mu_{0},\;\tilde{\alpha}_{0}\sim Unif(0,1).

It is worth pointing out such a label-state formulation has also been studied in GMFG [29, 15].

3.2 Approximation

In this section, we show that GMFC (2.8)-(2.3) provides a good approximation for the cooperative multi-agent system (2.2)-(2.2) in terms of the value function and the optimal policy ensemble. To do this, the following assumptions on WW, PP, rr, and 𝝅{\boldsymbol{\pi}} are needed.

Assumption 3.3 (graphon WW)

There exists LW>0L_{W}>0 such that for all α,α,β,β\alpha,\alpha^{\prime},\beta,\beta^{\prime}\in\mathcal{I}

|W(α,β)W(α,β)|LW(|αα|+|ββ|).\displaystyle|W(\alpha,\beta)-W(\alpha^{\prime},\beta^{\prime})|\leq L_{W}\cdot\Big{(}|\alpha-\alpha^{\prime}|+|\beta-\beta^{\prime}|\Big{)}.

Assumption 3.3 is common in graphon mean-field theory [21, 15, 29]. Indeed, the Lipschitz continuity assumption on WW in Assumption 3.3 can be relaxed to piecewise Lipschitz continuity on WW.

Assumption 3.4 (transition probability PP)

There exists LP>0L_{P}>0 such that for all s𝒮,a𝒜s\in{\cal S},a\in{\cal A}, μ1,μ2(𝒮)\mu_{1},\mu_{2}\in{\cal B}({\cal S})

P(|s,μ1,a)P(|s,μ2,a)1LPμ1μ21,\displaystyle\|P(\cdot|s,\mu_{1},a)-P(\cdot|s,\mu_{2},a)\|_{1}\leq L_{P}\cdot\|\mu_{1}-\mu_{2}\|_{1},

where 1\|\cdot\|_{1} denotes L1L^{1} norm here and throughout the paper.

Assumption 3.5 (reward rr)

There exist Mr>0M_{r}>0 and Lr>0L_{r}>0 such that for all s𝒮,a𝒜s\in{\cal S},a\in{\cal A}, μ1,μ2(𝒮)\mu_{1},\mu_{2}\in{\cal B}({\cal S}),

|r(s,μ,a)|Mr,|r(s,μ1,a)r(s,μ2,a)|Lrμ1μ21.\displaystyle|{r}(s,\mu,a)|\leq M_{r},\;|{r}(s,\mu_{1},a)-r(s,\mu_{2},a)|\leq L_{r}\cdot||\mu_{1}-\mu_{2}||_{1}.
Assumption 3.6 (policy 𝝅{\boldsymbol{\pi}})

There exists L𝚷>0L_{\boldsymbol{\Pi}}>0 such that for any policy ensemble 𝛑:=(πα)α𝚷{\boldsymbol{\pi}}:=(\pi^{\alpha})_{\alpha\in\mathcal{I}}\in{\boldsymbol{\Pi}} is Lipschitz continuous, i.e.

maxs𝒮πα(|s)πβ(|s)1L𝚷|αβ|.\displaystyle\max_{s\in{\cal S}}\|\pi^{\alpha}(\cdot|s)-\pi^{\beta}(\cdot|s)\|_{1}\leq L_{\boldsymbol{\Pi}}|\alpha-\beta|.

Assumptions 3.4-3.6 are standard and commonly used to bridge the multi-agent system and mean-field theory.

To show approximation properties of GMFC in the large-scale multi-agent system, we need to relate policy ensembles of GMFC to policies of the multi-agent system. On one hand, one can see that any 𝝅𝚷{\boldsymbol{\pi}}\in{\boldsymbol{\Pi}} leads to a NN-agent policy tuple (π1,,πN)ΠN(\pi^{1},\ldots,\pi^{N})\in\Pi^{N} with

ΓN:𝚷𝝅(π1,,πN)ΠN,withπi:=𝝅iN.\displaystyle\Gamma^{N}:{\boldsymbol{\Pi}}\ni{\boldsymbol{\pi}}\mapsto(\pi^{1},\ldots,\pi^{N})\in\Pi^{N},\;\;\;\mbox{with}\;\pi^{i}:={\boldsymbol{\pi}}^{\frac{i}{N}}. (3.7)

On the other hand, any NN-agent policy tuple (π1,,πN)ΠN(\pi^{1},\ldots,\pi^{N})\in\Pi^{N} can be seen as a step policy ensemble 𝝅N{\boldsymbol{\pi}}^{N} in 𝚷{\boldsymbol{\Pi}}:

𝝅N,α:=i=1Nπi𝟏α(i1N,iN]𝚷.\displaystyle{\boldsymbol{\pi}}^{N,\alpha}:=\sum_{i=1}^{N}\pi^{i}{\bf 1}_{\alpha\in(\frac{i-1}{N},\frac{i}{N}]}\in{\boldsymbol{\Pi}}. (3.8)
Theorem 3.7 (Approximate Pareto Property)

Assume Assumptions 2.1, 3.3, 3.4, 3.5 and 3.6. Then under either the condition (C1) or (C2), we have for any initial distribution μ𝒫(𝒮)\mu\in{\cal P}({\cal S})

|VN(μ)V(μ)|0,asN.\displaystyle|V_{N}(\mu)-V({\mu})|\to 0,\;\;\text{as}\;N\to\infty. (3.9)

Moreover, if the graphon convergence in Assumption 2.1 is at rate 𝒪(1N)\mathcal{O}(\frac{1}{\sqrt{N}}), then |VN(μ)V(μ)|=𝒪(1N)|V_{N}(\mu)-V({\mu})|=\mathcal{O}(\frac{1}{\sqrt{N}}). As a consequence, for any ε>0\varepsilon>0, there exists an integer NεN_{\varepsilon} such that when NNεN\geq N_{\varepsilon}, the optimal policy ensemble of GMFC denoted as 𝛑{\boldsymbol{\pi}^{*}} (if it exists) provides an ε\varepsilon-Pareto optimality (π1,,,πN,):=ΓN(𝛑)(\pi^{1,*},\ldots,\pi^{N,*}):=\Gamma^{N}({\boldsymbol{\pi}}^{*}) for the multi-agent system (2.2), with ΓN\Gamma^{N} defined in (3.7).

Directly learning Q function of GMFC in (3.6) will lead to high complexity. Instead, we will introduce a smaller class of GMFC with a lower dimension in the next section, which enables a scalable algorithm.

3.3 Algorithm Design

This section will show that discretizing the graphon index α\alpha\in\mathcal{I} of GMFC enables to approximate Q function in (3.6) by an approximated Q function in (3.10) below defined on a smaller space, which is critical for designing efficient learning algorithms.

Precisely, we choose uniform grids αmM:={mM,0mM}\alpha_{m}\in\mathcal{I}_{M}:=\{\frac{m}{M},0\leq m\leq M\} for simplicity, and approximate each agent α\alpha\in\mathcal{I} by the nearest αmM\alpha_{m}\in\mathcal{I}_{M} close to it. Introduce 𝓜~M:=𝒫(𝒮)M\widetilde{\boldsymbol{{\cal M}}}_{M}:={\cal P}({\cal S})^{\mathcal{I}_{M}}, 𝚷~M:=𝒫(𝒜)𝒮×M\widetilde{\boldsymbol{\Pi}}_{M}:={\cal P}({\cal A})^{{\cal S}\times{\cal I}_{M}}. Meanwhile, 𝝁~:=(μ~αm)m[M]𝓜~M\tilde{\boldsymbol{\mu}}:=(\tilde{\mu}^{\alpha_{m}})_{m\in[M]}\in\widetilde{\boldsymbol{{\cal M}}}_{M} and 𝝅~:=(π~αm)m[M]𝚷~M\tilde{\boldsymbol{\pi}}:=(\tilde{\pi}^{\alpha_{m}})_{m\in[M]}\in\widetilde{\boldsymbol{\Pi}}_{M} can be viewed as a piecewise constant state distribution ensemble in 𝓜{\boldsymbol{{\cal M}}} and a piecewise constant policy ensemble in 𝚷{\boldsymbol{\Pi}}, respectively. Our arguments can be easily generalized to nonuniform grids.

Consequently, instead of performing algorithms according to (3.6) with a continuum of graphon labels directly, we work with GMFC with MM blocks called block GMFC, in which agents in the same block are homogeneous. The Bellman equation for Q function of block GMFC is given by

Q~(𝝁~,𝝅~)=R~(𝝁~,𝝅~)+γsup𝝅~𝚷~MQ~(𝚽~(𝝁~,𝝅~),𝝅~),\displaystyle\widetilde{Q}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}})=\widetilde{R}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}})+\gamma\sup_{\widetilde{\boldsymbol{\pi}}^{\prime}\in\widetilde{\boldsymbol{\Pi}}_{M}}\widetilde{Q}(\widetilde{\boldsymbol{\Phi}}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}}),\tilde{\boldsymbol{\pi}}^{\prime}), (3.10)

where the neighborhood mean-field measure, the aggregated reward R~:𝓜~M×𝚷~M\widetilde{R}:\widetilde{\boldsymbol{{\cal M}}}_{M}\times\widetilde{\boldsymbol{\Pi}}_{M}\to\mathbb{R} and the aggregated transition dynamics 𝚽~:𝓜~M×𝚷~M𝓜~M\widetilde{\boldsymbol{\Phi}}:\widetilde{\boldsymbol{{\cal M}}}_{M}\times\widetilde{\boldsymbol{\Pi}}_{M}\to\widetilde{\boldsymbol{{\cal M}}}_{M} are given by

μ~αm,W\displaystyle\tilde{\mu}^{\alpha_{m},W} =\displaystyle= 1Mm=0M1W(αm,αm)μ~αm,m[M],\displaystyle\frac{1}{M}\sum_{m^{\prime}=0}^{M-1}W(\alpha_{m},\alpha_{m^{\prime}})\tilde{\mu}^{\alpha_{m^{\prime}}},m\in[M], (3.11)
R~(𝝁~,𝝅~)\displaystyle\widetilde{R}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}}) =\displaystyle= 1Mm=0M1s𝒮a𝒜r(s,a,μ~αm,W)μ~αm(s)π~αm(a|s),\displaystyle\frac{1}{M}\sum_{m=0}^{M-1}\sum_{s\in{\cal S}}\sum_{a\in{\cal A}}r(s,a,\tilde{\mu}^{\alpha_{m},W})\tilde{\mu}^{\alpha_{m}}(s)\tilde{\pi}^{\alpha_{m}}(a|s), (3.12)
𝚽~αm(𝝁~,𝝅~)()\displaystyle\widetilde{\boldsymbol{\Phi}}^{\alpha_{m}}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}})(\cdot) =\displaystyle= s𝒮a𝒜P(|s,a,μ~αm,W)μ~αm(s)π~αm(a|s).\displaystyle\sum_{s\in{\cal S}}\sum_{a\in{\cal A}}P(\cdot|s,a,\tilde{\mu}^{\alpha_{m},W})\tilde{\mu}^{\alpha_{m}}(s)\tilde{\pi}^{\alpha_{m}}(a|s). (3.13)

We see from (3.10) that block GMFC is a MDP with deterministic dynamics 𝚽~\widetilde{\boldsymbol{\Phi}} and continuous state-action space 𝓜~M×𝚷~M\widetilde{\boldsymbol{{\cal M}}}_{M}\times\widetilde{\boldsymbol{\Pi}}_{M}. The following Theorem shows that there exists an optimal policy ensemble of block GMFC in 𝚷~M\widetilde{\boldsymbol{\Pi}}_{M}.

Theorem 3.8 (Existence of Optimal Policy Ensemble)

Given Assumptions 3.4, 3.5, assume γ(LP+1)<\gamma\cdot(L_{P}+1)<\infty, then for any fixed integer M>0M>0, there exists an 𝛑~\tilde{\boldsymbol{\pi}}^{*} \in 𝚷~M\widetilde{\boldsymbol{\Pi}}_{M} that maximize Q~(𝛍~,𝛑~)\widetilde{Q}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}}) in (3.10) for any 𝛍~𝓜~M\tilde{\boldsymbol{\mu}}\in\widetilde{\boldsymbol{{\cal M}}}_{M}.

Furthermore, we show that with sufficiently fine partitions of the graphon index \mathcal{I}, i.e., MM is sufficiently large, block GMFC (3.10)-(3.13) well approximates the multi-agent system in Section 2.2.

Theorem 3.9

Assume γ(LP+1)<1\gamma\cdot(L_{P}+1)<1 and Assumptions 2.1, 3.3, 3.4, 3.5 and 3.6. Under either (C1) or (C2), for any ε>0\varepsilon>0, there exists NεN_{\varepsilon}, MεM_{\varepsilon} such that for NNεN\geq N_{\varepsilon}, the optimal policy ensemble 𝛑~\tilde{\boldsymbol{\pi}}^{*} of block GMFC (3.10) with MεM_{\varepsilon} blocks provides an ε\varepsilon-Pareto optimality (π~1,,,π~N,):=ΓN(𝛑~)(\tilde{\pi}^{1,*},\ldots,\tilde{\pi}^{N,*}):=\Gamma^{N}(\tilde{\boldsymbol{\pi}}^{*}) for the multi-agent system (2.2) with NN agents.

Theorem 3.9 shows that the optimal policy ensemble of block GMFC is near-optimal for all sufficiently large multi-agent systems, meaning that block GMFC provides a good approximation for the multi-agent system.

Recall that block GMFC can be viewed as a MDP with deterministic dynamics and continuous state-action space. To learn block GMFC, one can adopt a similar kernel-based Q learning method in [24] for MFC, a uniform discretization method or deep reinforcement algorithms like DDPG [37] for MFC in [9] with theoretical guarantees. Since block GMFC has a higher dimension than classical MFC, we choose to adapt DRL algorithm Proximal Policy Optimization (PPO) [47] to block GMFC and then apply the learned policy ensemble of block GMFC to the multi-agent system to validate our theoretical findings. We describe the deployment of block GMFC in the multi-agent system in Algorithm 1, which we call it N-agent GMFC.

Algorithm 1 N-agent GMFC
Input Initial state distribution μ0\mu_{0}, number of agents NN, episode length TT, the learned policy 𝝅~𝚷~M\tilde{\boldsymbol{\pi}}\in\widetilde{\boldsymbol{\Pi}}_{M} learned by PPO
Initialize s0iμ0s^{i}_{0}\sim\mu_{0}, i[N]i\in[N]
for tt == 11 to TT do
     for ii == 11 to NN do
         Choose m(i)=argminm[M]|iNmM|m(i)=\underset{m\in[M]}{\mathrm{arg\,min\,}}|\frac{i}{N}-\frac{m}{M}|
         Sample action atiπ~αm(i)(|sti)a^{i}_{t}\sim{\tilde{\pi}}^{\alpha_{m(i)}}(\cdot|s_{t}^{i}), observe reward rtir_{t}^{i} and new state st+1is_{t+1}^{i}
     end for
end for

4 Proofs of Main Results

In this section, we will provide proofs of Theorems 3.7-3.9.

4.1 Proof of Theorem 3.7

To prove Theorem 3.7, we need the following two Lemmas. We start by defining the step state distribution 𝝁tN:=(μtN,α)α{\boldsymbol{\mu}}_{t}^{N}:=({\mu}_{t}^{N,\alpha})_{\alpha\in\mathcal{I}} for notational simplicity

μtN,α()=i𝒱Nδsti()𝟏α(i1N,iN].\displaystyle{\mu}_{t}^{N,\alpha}(\cdot)=\sum_{i\in{\cal V}_{N}}\delta_{s_{t}^{i}}(\cdot){\bf 1}_{\alpha\in(\frac{i-1}{N},\frac{i}{N}]}. (4.1)

Lemma 4.1 shows the convergence of the neighborhood empirical measure to the neighborhood mean-field measure.

Lemma 4.1

Assume Assumptions 2.1, 3.3, 3.4 and 3.6. Under either condition (C1) or (C2), for any policy ensemble 𝛑𝚷{\boldsymbol{\pi}}\in{\boldsymbol{\Pi}}, we have

i=1N(i1N,iN]𝔼[μti,WNμtα,W1]𝑑α0,asN,\displaystyle\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\mathbb{E}\big{[}\|\mu_{t}^{i,W_{N}}-\mu_{t}^{\alpha,W}\|_{1}\big{]}d\alpha\to 0,\;\;\;\mbox{as}\;N\to\infty, (4.2)

where μti=μtαμ𝒫(𝒮)\mu_{t}^{i}=\mu_{t}^{\alpha}\equiv\mu\in{\cal P}({\cal S}).

Moreover, if the graphon convergence in Assumption 2.1 is at rate 𝒪(1N)\mathcal{O}(\frac{1}{\sqrt{N}}), then

i=1N(i1N,iN]𝔼[μti,WNμtα,W1]𝑑α=𝒪(1N).\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\mathbb{E}\big{[}\|\mu_{t}^{i,W_{N}}-\mu_{t}^{\alpha,W}\|_{1}\big{]}d\alpha=\mathcal{O}(\frac{1}{\sqrt{N}}).
Proof of Lemma 4.1.

We first prove (4.2) under the condition (C1) and then show (4.2) also holds under the condition (C2).
Case 1:   ξi,jN=WN(iN,jN)\xi_{i,j}^{N}=W_{N}(\frac{i}{N},\frac{j}{N}).  Note that under the condition (C1), μti,WN=WN(iN,β)μtN,β𝑑β\mu_{t}^{i,W_{N}}=\int_{\mathcal{I}}W_{N}(\frac{i}{N},\beta){\mu}_{t}^{N,\beta}d\beta by the definition of μtN,α\mu_{t}^{N,\alpha} in (4.1). Then

i=1N(i1N,iN]𝔼[μti,WNμtα,W1]𝑑α\displaystyle\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\mathbb{E}\big{[}\|\mu_{t}^{i,W_{N}}-\mu_{t}^{\alpha,W}\|_{1}\big{]}d\alpha
=\displaystyle= i=1N(i1N,iN]𝔼[WN(iN,β)μtN,β𝑑βW(α,β)μtβ𝑑β1]𝑑α\displaystyle\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\mathbb{E}\Big{[}\Big{\|}\int_{\mathcal{I}}W_{N}(\frac{i}{N},\beta){\mu}_{t}^{N,\beta}d\beta-\int_{\mathcal{I}}W(\alpha,\beta){\mu}_{t}^{\beta}d\beta\Big{\|}_{1}\Big{]}d\alpha
\displaystyle\leq i=1N(i1N,iN]𝔼[WN(iN,β)μtN,β𝑑βWN(iN,β)μtβ𝑑β1]𝑑α\displaystyle\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\mathbb{E}\Big{[}\Big{\|}\int_{\mathcal{I}}W_{N}(\frac{i}{N},\beta){\mu}_{t}^{N,\beta}d\beta-\int_{\mathcal{I}}W_{N}(\frac{i}{N},\beta){\mu}_{t}^{\beta}d\beta\Big{\|}_{1}\Big{]}d\alpha
+i=1N(i1N,iN]𝔼[WN(iN,β)μtβ𝑑βW(α,β)μtβ𝑑β1]𝑑α\displaystyle\;\;\;+\;\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\mathbb{E}\Big{[}\Big{\|}\int_{\mathcal{I}}W_{N}(\frac{i}{N},\beta){\mu}_{t}^{\beta}d\beta-\int_{\mathcal{I}}W(\alpha,\beta){\mu}_{t}^{\beta}d\beta\Big{\|}_{1}\Big{]}d\alpha
=\displaystyle= :I1+I2.\displaystyle:I_{1}+I_{2}.

For the term I1I_{1}, we adopt Theorem 2 in [15] and have that under the policy ensemble 𝝅{\boldsymbol{\pi}} and NN-agent policy (π1,,πN):=ΓN(𝝅)(\pi^{1},\ldots,\pi^{N}):=\Gamma_{N}({\boldsymbol{\pi}}), with ΓN\Gamma_{N} defined in (3.7)

I1=𝔼[WN(iN,β)μtN,β𝑑βWN(iN,β)μtβ𝑑β1]0,asN.\displaystyle I_{1}=\mathbb{E}\Big{[}\Big{\|}\int_{\mathcal{I}}W_{N}(\frac{i}{N},\beta){\mu}_{t}^{N,\beta}d\beta-\int_{\mathcal{I}}W_{N}(\frac{i}{N},\beta){\mu}_{t}^{\beta}d\beta\Big{\|}_{1}\Big{]}\to 0,\;\mbox{as}\;N\to\infty.

Moreover, if the graphon convergence in Assumption 2.1 is at rate 𝒪(1N)\mathcal{O}(\frac{1}{\sqrt{N}}), then the term I1I_{1} is also at rate 𝒪(1N)\mathcal{O}(\frac{1}{\sqrt{N}}).
By noting that WN(α,β)=WN(NαN,NβN)W_{N}(\alpha,\beta)=W_{N}\big{(}\frac{\lceil N\alpha\rceil}{N},\frac{\lceil N\beta\rceil}{N}\big{)},

I2\displaystyle I_{2} =\displaystyle= i=1N(i1N,iN]WN(NαN,β)μtβ𝑑βW(α,β)μtβ𝑑β1𝑑α\displaystyle\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\Big{\|}\int_{\mathcal{I}}W_{N}\big{(}\frac{\lceil N\alpha\rceil}{N},\beta\big{)}{\mu}_{t}^{\beta}d\beta-\int_{\mathcal{I}}W(\alpha,\beta){\mu}_{t}^{\beta}d\beta\Big{\|}_{1}d\alpha
=\displaystyle= i=1N(i1N,iN]WN(α,β)μtβ𝑑βW(α,β)μtβ𝑑β1𝑑α\displaystyle\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\Big{\|}\int_{\mathcal{I}}W_{N}\big{(}\alpha,\beta\big{)}{\mu}_{t}^{\beta}d\beta-\int_{\mathcal{I}}W(\alpha,\beta){\mu}_{t}^{\beta}d\beta\Big{\|}_{1}d\alpha
=\displaystyle= WN(α,β)μtβ𝑑βW(α,β)μtβ𝑑β1𝑑α\displaystyle\int_{\mathcal{I}}\Big{\|}\int_{\mathcal{I}}W_{N}\big{(}\alpha,\beta\big{)}{\mu}_{t}^{\beta}d\beta-\int_{\mathcal{I}}W(\alpha,\beta){\mu}_{t}^{\beta}d\beta\Big{\|}_{1}d\alpha
=\displaystyle= s𝒮|WN(α,β)μtβ(s)𝑑βW(α,β)μtβ(s)𝑑β|𝑑α\displaystyle\sum_{s\in{\cal S}}\int_{\mathcal{I}}\Big{|}\int_{\mathcal{I}}W_{N}\big{(}\alpha,\beta\big{)}{\mu}_{t}^{\beta}(s)d\beta-\int_{\mathcal{I}}W(\alpha,\beta){\mu}_{t}^{\beta}(s)d\beta\Big{|}d\alpha
\displaystyle\to 0,\displaystyle 0,

where the last inequality is from the fact in [39] that the convergence of WNW0\|W_{N}-W\|_{\square}\to 0 is equivalent to the convergence of

WNWLL1:=supg1|(WN(α,β)W(α,β))g(β)dβ|dα0.\|W_{N}-W\|_{L_{\infty}\to L_{1}}:=\sup_{\|g\|_{\infty}\leq 1}\int_{\mathcal{I}}\biggl{|}\int_{\mathcal{I}}\big{(}W_{N}(\alpha,\beta)-W(\alpha,\beta)\big{)}g(\beta)d\beta\biggl{|}d\alpha\to 0.

Combining I1I_{1} and I2I_{2}, we prove (4.2) under the condition (C1).

Case 2: ξi,jN\xi_{i,j}^{N} are random variables with Bernoulli(WN(iN,jN)W_{N}(\frac{i}{N},\frac{j}{N})).

i=1N(i1N,iN]𝔼μti,WNμtα,W1𝑑α\displaystyle\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\mathbb{E}\|\mu^{i,W_{N}}_{t}-\mu^{\alpha,W}_{t}\|_{1}d\alpha
=\displaystyle= i=1N(i1N,iN]𝔼1Nj=1NξijNδstjW(α,β)μtβ𝑑β1𝑑α\displaystyle\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\mathbb{E}\big{\|}\frac{1}{N}\sum_{j=1}^{N}\xi_{ij}^{N}\delta_{s_{t}^{j}}-\int_{\mathcal{I}}W(\alpha,\beta)\mu_{t}^{\beta}d\beta\big{\|}_{1}d\alpha
\displaystyle\leq i=1N(i1N,iN]𝔼1Nj=1NξijNδstjWN(iN,β)μtN,β𝑑β1𝑑α\displaystyle\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\mathbb{E}\big{\|}\frac{1}{N}\sum_{j=1}^{N}\xi_{ij}^{N}\delta_{s_{t}^{j}}-\int_{\mathcal{I}}W_{N}(\frac{i}{N},\beta)\mu_{t}^{N,\beta}d\beta\big{\|}_{1}d\alpha
+i=1N(i1N,iN]𝔼WN(iN,β)μtN,β𝑑βW(α,β)μtβ𝑑β1𝑑α\displaystyle\;\;\;+\;\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\mathbb{E}\big{\|}\int_{\mathcal{I}}W_{N}(\frac{i}{N},\beta)\mu_{t}^{N,\beta}d\beta-\int_{\mathcal{I}}W(\alpha,\beta)\mu_{t}^{\beta}d\beta\big{\|}_{1}d\alpha
=:\displaystyle=: I1+I2.\displaystyle I_{1}+I_{2}.

Note from Case 1 that I20I_{2}\to 0 as NN\to\infty and I2=𝒪(1N)I_{2}=\mathcal{O}(\frac{1}{\sqrt{N}}) if the graphon convergence in Assumption 2.1 is at rate 𝒪(1N)\mathcal{O}(\frac{1}{\sqrt{N}}). Therefore, it is enough to estimate I1I_{1}.

I1\displaystyle I_{1} =\displaystyle= 𝔼1Nj=1NξijNδstjWN(iN,β)μtN,β𝑑β1\displaystyle\mathbb{E}\big{\|}\frac{1}{N}\sum_{j=1}^{N}\xi_{ij}^{N}\delta_{s_{t}^{j}}-\int_{\mathcal{I}}W_{N}(\frac{i}{N},\beta)\mu_{t}^{N,\beta}d\beta\big{\|}_{1}
\displaystyle\leq 𝔼[𝔼[supf:𝒮{1,1}{1Nj=1NξijNf(stj)1Nj=1NWN(iN,jN)f(stj)}|st1,,stN]].\displaystyle\mathbb{E}\Big{[}\mathbb{E}\Big{[}\sup_{f:{\cal S}\to\{-1,1\}}\Big{\{}\frac{1}{N}\sum_{j=1}^{N}\xi_{ij}^{N}f(s_{t}^{j})-\frac{1}{N}\sum_{j=1}^{N}W_{N}(\frac{i}{N},\frac{j}{N})f(s_{t}^{j})\Big{\}}\Big{|}s_{t}^{1},\ldots,s_{t}^{N}\Big{]}\Big{]}.

We proceed the same argument as in the proof of Theorem 6.3 in [24]. Precisely, conditioned on st1,,stNs_{t}^{1},\ldots,s_{t}^{N}, {ξijNf(stj)WN(iN,jN)f(stj)}j=1N\Big{\{}\xi_{ij}^{N}f(s_{t}^{j})-W_{N}(\frac{i}{N},\frac{j}{N})f(s_{t}^{j})\Big{\}}_{j=1}^{N} is a sequence of independent mean-zero random variables bounded in [1,1][-1,1] due to 𝔼[ξi,jN]=WN(iN,jN)\mathbb{E}[\xi_{i,j}^{N}]=W_{N}(\frac{i}{N},\frac{j}{N}). This implies that each ξijNf(stj)WN(iN,jN)f(stj)\xi_{ij}^{N}f(s_{t}^{j})-W_{N}(\frac{i}{N},\frac{j}{N})f(s_{t}^{j}) is a sub-Gaussian with variance bounded by 4. As a result, conditioned on st1,,stNs_{t}^{1},\ldots,s_{t}^{N}, {1Nj=1NξijNf(stj)1Nj=1NWN(iN,jN)f(stj)}i=1N\Big{\{}\frac{1}{N}\sum_{j=1}^{N}\xi_{ij}^{N}f(s_{t}^{j})-\frac{1}{N}\sum_{j=1}^{N}W_{N}(\frac{i}{N},\frac{j}{N})f(s_{t}^{j})\Big{\}}_{i=1}^{N} is a mean-zero sub-Gaussian random variable with variance 4N\frac{4}{N}. By the equation (2.66) in [51], we have

I1\displaystyle I_{1} \displaystyle\leq 𝔼[𝔼[supf:𝒮{1,1}{1Nj=1NξijNf(stj)1Nj=1NWN(iN,jN)f(stj)}|st1,,stN]]\displaystyle\mathbb{E}\Big{[}\mathbb{E}\Big{[}\sup_{f:{\cal S}\to\{-1,1\}}\Big{\{}\frac{1}{N}\sum_{j=1}^{N}\xi_{ij}^{N}f(s_{t}^{j})-\frac{1}{N}\sum_{j=1}^{N}W_{N}(\frac{i}{N},\frac{j}{N})f(s_{t}^{j})\Big{\}}\Big{|}s_{t}^{1},\ldots,s_{t}^{N}\Big{]}\Big{]}
\displaystyle\leq 8ln(2)|𝒮|N.\displaystyle\frac{\sqrt{8\ln(2)|{\cal S}|}}{\sqrt{N}}.

Therefore, combining I1I_{1} and I2I_{2} in Case 2, we show that when ξi,jN\xi_{i,j}^{N} are random variables with Bernoulli(WN(iN,jN)W_{N}(\frac{i}{N},\frac{j}{N})), (4.2) holds under the condition (C2). ∎

Lemma 4.2 shows the convergence of the state distribution of NN-agent game to the state distribution of GMFC.

Lemma 4.2

Assume Assumptions 2.1, 3.3, 3.4 and 3.6. For any uniformly bounded family 𝒢{\cal G} of functions g:𝒮g:{\cal S}\to\mathbb{R}, we have

supg𝒢i=1N(i1N,iN]|𝔼[g(sti)g(stα)]|0,\displaystyle\sup_{g\in{\cal G}}\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\Big{|}\mathbb{E}[g(s_{t}^{i})-g(s_{t}^{\alpha})]\Big{|}\to 0, (4.3)

where s0iμ0s_{0}^{i}\sim\mu_{0}, s0αμ0s_{0}^{\alpha}\sim\mu_{0}. Moreover, if the graphon convergence in Assumption 2.1 is at rate 𝒪(1N)\mathcal{O}(\frac{1}{\sqrt{N}}), then

supg𝒢i=1N(i1N,iN]|𝔼[g(sti)g(stα)]|=𝒪(1N).\sup_{g\in{\cal G}}\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\Big{|}\mathbb{E}[g(s_{t}^{i})-g(s_{t}^{\alpha})]\Big{|}=\mathcal{O}(\frac{1}{\sqrt{N}}).
Proof of Lemma 4.2.

The proof is by induction as follows. To do this, first introduce

lg(s,μ,π):=a𝒜s𝒮g(s)P(s|s,μ,a)π(a|s).\displaystyle l_{g}(s,\mu,\pi):=\sum_{a\in{\cal A}}\sum_{s^{\prime}\in{\cal S}}g(s^{\prime})P(s^{\prime}|s,\mu,a)\pi(a|s).

(4.3) holds obviously at t=0t=0. Suppose that (4.3) holds at tt. Then for any uniformly bounded function gg with |g|Mg|g|\leq M_{g} at t+1t+1

i=1N(i1N,iN]|𝔼[g(st+1i)g(st+1α)]|𝑑α\displaystyle\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\Big{|}\mathbb{E}[g(s_{t+1}^{i})-g(s_{t+1}^{\alpha})]\Big{|}d\alpha (4.4)
=\displaystyle= i=1N(i1N,iN]|𝔼[lg(sti,μti,WN,πi)]𝔼[lg(stα,μtα,W,πα)]|𝑑α\displaystyle\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\Big{|}\mathbb{E}\big{[}l_{g}(s_{t}^{i},\mu_{t}^{i,W_{N}},\pi^{i})\big{]}-\mathbb{E}\big{[}l_{g}(s_{t}^{\alpha},\mu_{t}^{\alpha,W},\pi^{\alpha})\big{]}\Big{|}d\alpha
\displaystyle\leq i=1N(i1N,iN]|𝔼[lg(sti,μti,WN,πi)]𝔼[lg(sti,μtα,W,πi)]|𝑑α\displaystyle\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\Big{|}\mathbb{E}\big{[}l_{g}(s_{t}^{i},\mu_{t}^{i,W_{N}},\pi^{i})\big{]}-\mathbb{E}\big{[}l_{g}(s_{t}^{i},\mu_{t}^{\alpha,W},\pi^{i})\big{]}\Big{|}d\alpha
+i=1N(i1N,iN]|𝔼[lg(sti,μtα,W,πi)]𝔼[lg(stα,μtα,W,πi)]|𝑑α\displaystyle\;\;\;\;+\;\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\Big{|}\mathbb{E}\big{[}l_{g}(s_{t}^{i},\mu_{t}^{\alpha,W},\pi^{i})\big{]}-\mathbb{E}\big{[}l_{g}(s_{t}^{\alpha},\mu_{t}^{\alpha,W},\pi^{i})\big{]}\Big{|}d\alpha
+i=1N(i1N,iN]|𝔼[lg(stα,μtα,W,πi)]𝔼[lg(stα,μtα,W,πα)]|𝑑α\displaystyle\;\;\;\;+\;\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\Big{|}\mathbb{E}\big{[}l_{g}(s_{t}^{\alpha},\mu_{t}^{\alpha,W},\pi^{i})\big{]}-\mathbb{E}\big{[}l_{g}(s_{t}^{\alpha},\mu_{t}^{\alpha,W},\pi^{\alpha})\big{]}\Big{|}d\alpha
=\displaystyle= :I+II+III,\displaystyle:I+II+III,

where the first equality is by the law of total expectation.

First term of (4.4)

I\displaystyle I =\displaystyle= i=1N(i1N,iN]|𝔼[lg(sti,μti,WN,πi)]𝔼[lg(sti,μtα,W,πi)]|𝑑α\displaystyle\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\Big{|}\mathbb{E}\big{[}l_{g}(s_{t}^{i},\mu_{t}^{i,W_{N}},\pi^{i})\big{]}-\mathbb{E}\big{[}l_{g}(s_{t}^{i},\mu_{t}^{\alpha,W},\pi^{i})\big{]}\Big{|}d\alpha
\displaystyle\leq MgLPi=1N(i1N,iN]𝔼[μti,WNμtα,W1]𝑑α\displaystyle M_{g}L_{P}\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\mathbb{E}\big{[}\|\mu_{t}^{i,W_{N}}-\mu_{t}^{\alpha,W}\|_{1}\big{]}d\alpha
\displaystyle\to 0,asN\displaystyle 0,\;\;\;\mbox{as}\;N\to\infty

where the second inequality is from the continuity of PP, and the last inequality is from Lemma 4.1.

Second term of (4.4)

One can view lg(s,μtα,W,πi)l_{g}(s,\mu_{t}^{\alpha,W},\pi^{i}) as a function of s𝒮s\in{\cal S} for any fixed μtα,W\mu_{t}^{\alpha,W} and πi\pi^{i}. Note that |lg(s,μtα,W,πi)|Mg|l_{g}(s,\mu_{t}^{\alpha,W},\pi^{i})|\leq M_{g}, where MgM_{g} is a constant independent of μtα,W\mu_{t}^{\alpha,W} and πi\pi^{i}. Since (4.3) holds at tt, then

II\displaystyle II =\displaystyle= i=1N(i1N,iN]|𝔼[lg(sti,μtα,W,πi)]𝔼[lg(stα,μtα,W,πi)]|𝑑α\displaystyle\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\Big{|}\mathbb{E}\big{[}l_{g}(s_{t}^{i},\mu_{t}^{\alpha,W},\pi^{i})\big{]}-\mathbb{E}\big{[}l_{g}(s_{t}^{\alpha},\mu_{t}^{\alpha,W},\pi^{i})\big{]}\Big{|}d\alpha
\displaystyle\to 0,asN.\displaystyle 0,\;\;\;\mbox{as}\;N\to\infty.

Third term of (4.4)

III\displaystyle III =\displaystyle= i=1N(i1N,iN]|𝔼[lg(stα,μtα,W,πi)]𝔼[lg(stα,μtα,W,πα)]|𝑑α\displaystyle\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\Big{|}\mathbb{E}\big{[}l_{g}(s_{t}^{\alpha},\mu_{t}^{\alpha,W},\pi^{i})\big{]}-\mathbb{E}\big{[}l_{g}(s_{t}^{\alpha},\mu_{t}^{\alpha,W},\pi^{\alpha})\big{]}\Big{|}d\alpha
\displaystyle\leq Mgi=1N(i1N,iN]𝔼[πi(stα)πα(stα)1]𝑑α\displaystyle M_{g}\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\mathbb{E}\big{[}\|\pi^{i}(s_{t}^{\alpha})-\pi^{\alpha}(s_{t}^{\alpha})\|_{1}\big{]}d\alpha
\displaystyle\leq MgLΠi=1N(i1N,iN]maxα(i1N,iN]|iNα|dα\displaystyle M_{g}L_{\Pi}\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\max_{\alpha\in(\frac{i-1}{N},\frac{i}{N}]}|\frac{i}{N}-\alpha|d\alpha
=\displaystyle= 𝒪(1N),\displaystyle\mathcal{O}(\frac{1}{N}),

where the second inequality is by the uniform boundedness of gg and the third inequality is from Assumption 3.6. ∎

Now we are ready to prove Theorem 3.7. We start by defining r^\widehat{r} the aggregated reward over all possible actions under the policy π\pi

r^(s,μ,π):=a𝒜r(s,μ,a)π(a|s).\displaystyle\widehat{r}(s,\mu,\pi):=\sum_{a\in{\cal A}}r(s,\mu,a)\pi(a|s).
Proof of Theorem 3.7.
|VN(μ)V(μ)|\displaystyle|V_{N}(\mu)-V(\mu)|
=|supΠN1Ni=1N𝔼[t=0γtr(sti,μti,WN,ati)]sup𝝅𝚷𝔼[t=0γtr(stα,μtα,W,atα)]dα|\displaystyle=\biggl{|}\sup_{\Pi^{N}}\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}{r\big{(}s^{i}_{t},\,\,{\mu^{i,W_{N}}_{t}},\,\,a^{i}_{t}\big{)}}\right]-\sup_{{\boldsymbol{\pi}\in{\boldsymbol{\Pi}}}}\int_{\mathcal{I}}\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}{r\big{(}s^{\alpha}_{t},\,\,{\mu^{\alpha,W}_{t}},\,\,a^{\alpha}_{t}\big{)}}\right]d\alpha\biggl{|}
sup𝝅𝚷|1Ni=1N𝔼[t=0γtr(sti,μti,WN,ati)]𝔼[t=0γtr(stα,μtα,W,atα)]dα|\displaystyle\leq\sup_{{\boldsymbol{\pi}\in\boldsymbol{\Pi}}}\biggl{|}\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}{r\big{(}s^{i}_{t},\,\,{\mu^{i,W_{N}}_{t}},\,\,a^{i}_{t}\big{)}}\right]-\int_{\mathcal{I}}\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}{r\big{(}s^{\alpha}_{t},\,\,{\mu^{\alpha,W}_{t}},\,\,a^{\alpha}_{t}\big{)}}\right]d\alpha\biggl{|}
=sup𝝅𝚷|t=0γti=1N(i1N,iN](𝔼[r^(sti,μti,WN,πi)]𝔼[r^(stα,μtα,W,πα)])dα|\displaystyle=\sup_{{\boldsymbol{\pi}\in\boldsymbol{\Pi}}}\biggl{|}\sum_{t=0}^{\infty}\gamma^{t}\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\Big{(}\mathbb{E}\big{[}\widehat{r}(s^{i}_{t},\,\,{\mu^{i,W_{N}}_{t}},\,\,\pi^{i})\big{]}-\mathbb{E}\big{[}\widehat{r}(s^{\alpha}_{t},\,\,{\mu^{\alpha,W}_{t}},\,\,\pi^{\alpha})\big{]}\Big{)}d\alpha\biggl{|}
sup𝝅𝚷|t=0γti=1N(i1N,iN](𝔼[r^(sti,μti,WN,πi)]𝔼[r^(sti,μtα,W,πi)])dα|\displaystyle\leq\sup_{{\boldsymbol{\pi}\in\boldsymbol{\Pi}}}\biggl{|}\sum_{t=0}^{\infty}\gamma^{t}\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\Big{(}\mathbb{E}\big{[}\widehat{r}(s^{i}_{t},\,\,{\mu^{i,W_{N}}_{t}},\,\,\pi^{i})\big{]}-\mathbb{E}\big{[}\widehat{r}(s^{i}_{t},\,\,{\mu^{\alpha,W}_{t}},\,\,\pi^{i})\big{]}\Big{)}d\alpha\biggl{|}
+sup𝝅𝚷|t=0γti=1N(i1N,iN](𝔼[r^(sti,μtα,W,πi)]𝔼[r^(stα,μtα,W,πi)])dα|\displaystyle\;\;\;\;+\;\sup_{{\boldsymbol{\pi}\in\boldsymbol{\Pi}}}\biggl{|}\sum_{t=0}^{\infty}\gamma^{t}\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\Big{(}\mathbb{E}\big{[}\widehat{r}(s^{i}_{t},\,\,{\mu^{\alpha,W}_{t}},\,\,\pi^{i})\big{]}-\mathbb{E}\big{[}\widehat{r}(s^{\alpha}_{t},\,\,{\mu^{\alpha,W}_{t}},\,\,\pi^{i})\big{]}\Big{)}d\alpha\biggl{|}
+sup𝝅𝚷|t=0γti=1N(i1N,iN](𝔼[r^(stα,μtα,W,πi)]𝔼[r^(stα,μtα,W,πα)])dα|\displaystyle\;\;\;\;+\;\sup_{{\boldsymbol{\pi}\in\boldsymbol{\Pi}}}\biggl{|}\sum_{t=0}^{\infty}\gamma^{t}\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\Big{(}\mathbb{E}\big{[}\widehat{r}(s^{\alpha}_{t},\,\,{\mu^{\alpha,W}_{t}},\,\,\pi^{i})\big{]}-\mathbb{E}\big{[}\widehat{r}(s^{\alpha}_{t},\,\,{\mu^{\alpha,W}_{t}},\,\,\pi^{\alpha})\big{]}\Big{)}d\alpha\biggl{|}
:=I+II+III,\displaystyle:=I+II+III, (4.5)

where we use (3.8) in the second inequality.

First term of (4.5)

I\displaystyle I \displaystyle\leq sup𝝅Lrt=0γti=1N(i1N,iN]𝔼μti,WNμtα,W1𝑑α\displaystyle\sup_{{\boldsymbol{\pi}}}L_{r}\sum_{t=0}^{\infty}\gamma^{t}\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\mathbb{E}\|\mu^{i,W_{N}}_{t}-\mu^{\alpha,W}_{t}\|_{1}d\alpha (4.6)
=\displaystyle= 𝒪(1N),\displaystyle\mathcal{O}(\frac{1}{\sqrt{N}}),

where the last equality is from Lemma 4.1 when the convergence in Assumption 2.1 is at rate 𝒪(1/N)\mathcal{O}(1/\sqrt{N}).

Second term of (4.5)

From Lemma 4.2, we have II=𝒪(1N)II=\mathcal{O}(\frac{1}{\sqrt{N}}).

Third term of (4.5)

III\displaystyle III \displaystyle\leq sup𝝅t=0γti=1N(i1N,iN]maxs𝒮πi(s)πα(s)1dα\displaystyle\sup_{{\boldsymbol{\pi}}}\sum_{t=0}^{\infty}\gamma^{t}\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}\max_{s\in{\cal S}}\|\pi^{i}(s)-\pi^{\alpha}(s)\|_{1}d\alpha
\displaystyle\leq LΠsup𝝅t=0γti=1N(i1N,iN]|iNπα|𝑑α\displaystyle L_{\Pi}\sup_{{\boldsymbol{\pi}}}\sum_{t=0}^{\infty}\gamma^{t}\sum_{i=1}^{N}\int_{(\frac{i-1}{N},\frac{i}{N}]}|\frac{i}{N}-\pi^{\alpha}|d\alpha
=\displaystyle= 𝒪(1N).\displaystyle\mathcal{O}(\frac{1}{N}).

Therefore, combining II, IIII and IIIIII yields the desired result. ∎

4.2 Proof of Theorem 3.8

First, we see that (3.10) corresponds to the following optimal control problem

V~M(𝝁~):=sup𝝅~𝚷~MJ~M(𝝁~,𝝅~)\displaystyle\widetilde{V}^{M}(\tilde{\boldsymbol{\mu}}):=\sup_{\tilde{\boldsymbol{\pi}}\in\widetilde{\boldsymbol{\Pi}}_{M}}\tilde{J}^{M}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}})
=sup𝝅~𝚷~M1Mm=1M𝔼[t=0γtr(s~tαm,μ~tαm,W,a~tαm)|s~0αmμ~αm,a~tαmπ~αm(|s~tαm)].\displaystyle=\sup_{\tilde{\boldsymbol{\pi}}\in\widetilde{\boldsymbol{\Pi}}_{M}}\frac{1}{M}\sum_{m=1}^{M}\mathbb{E}\left.\left[\sum_{t=0}^{\infty}\gamma^{t}{r\big{(}\tilde{s}^{\alpha_{m}}_{t},\,{\tilde{\mu}^{\alpha_{m},W}_{t}},\,\tilde{a}^{\alpha_{m}}_{t}\big{)}}\,\right|\tilde{s}^{\alpha_{m}}_{0}\sim{\tilde{\mu}^{\alpha_{m}}},\,\tilde{a}_{t}^{\alpha_{m}}\sim\tilde{\pi}^{\alpha_{m}}(\cdot|\tilde{s}_{t}^{\alpha_{m}})\right]. (4.7)

The associated Q function of (4.7) is defined as

Q~(𝝁~,𝝅~)\displaystyle\tilde{Q}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}}) =\displaystyle= sup𝝅~1Mm=1M𝔼[t=0γtr(s~tαm,μ~tαm,W,a~tαm)|s~0αmμ~αm,a~0αmπ~αm(|s~tαm)]\displaystyle\sup_{\tilde{\boldsymbol{\pi}}^{\prime}}\frac{1}{M}\sum_{m=1}^{M}\mathbb{E}\left.\left[\sum_{t=0}^{\infty}\gamma^{t}{r\big{(}\tilde{s}^{\alpha_{m}}_{t},\,{\tilde{\mu}^{\alpha_{m},W}_{t}},\,\tilde{a}^{\alpha_{m}}_{t}\big{)}}\,\right|\,\tilde{s}^{\alpha_{m}}_{0}\sim\tilde{\mu}^{\alpha_{m}},\tilde{a}_{0}^{\alpha_{m}}\sim\tilde{\pi}^{\alpha_{m}}(\cdot|\tilde{s}_{t}^{\alpha_{m}})\right] (4.8)
=\displaystyle= R(𝝁~,𝝅~)+sup𝝅~𝚷~Mt=1γtR~(𝝁~t,𝝅~),\displaystyle R(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}})+\sup_{\tilde{\boldsymbol{\pi}}^{\prime}\in\widetilde{\boldsymbol{\Pi}}_{M}}\sum_{t=1}^{\infty}\gamma^{t}\tilde{R}(\tilde{\boldsymbol{\mu}}_{t},\tilde{\boldsymbol{\pi}}^{\prime}),

subject to 𝝁~t+1=𝚽~(𝝁~t,𝝅~)\tilde{\boldsymbol{\mu}}_{t+1}=\widetilde{\boldsymbol{\Phi}}(\tilde{\boldsymbol{\mu}}_{t},\tilde{\boldsymbol{\pi}}), 𝝁~0=𝝁~\tilde{\boldsymbol{\mu}}_{0}=\tilde{\boldsymbol{\mu}}.

We first show the verification result and then prove the continuity property of Q~\tilde{Q} in (4.8), which thus leads to Theorem 3.8.

Lemma 4.3 (Verification)

Assume Assumption 3.5. Then Q~\tilde{Q} in (4.8) is the unique function satisfying the Bellman equation (3.10). Furthermore, if there exists 𝛑~argmax𝚷~MQ~(𝛍~,𝛑~)\tilde{\boldsymbol{\pi}}^{*}\in\arg\max_{\widetilde{\boldsymbol{\Pi}}_{M}}\tilde{Q}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}}) for each 𝛍~𝓜~M\tilde{\boldsymbol{\mu}}\in\widetilde{\boldsymbol{{\cal M}}}_{M}, then 𝛑~\tilde{\boldsymbol{\pi}}^{*} is an optimal stationary policy ensemble.

The proof of Lemma 4.3 is standard and very similar to the proof of Proposition 3.3 in [24].

Proof of Lemma 4.3.

First, define Mr1γ\frac{M_{r}}{1-\gamma}-bounded function space 𝒬:={f:𝓜~M×𝚷~M[Mr1γ,Mr1γ]}\mathcal{Q}:=\{f:\widetilde{\boldsymbol{{\cal M}}}_{M}\times\widetilde{\boldsymbol{\Pi}}_{M}\to[-\frac{M_{r}}{1-\gamma},\frac{M_{r}}{1-\gamma}]\}. Then we define a Bellman operator B:𝒬𝒬B:\mathcal{Q}\to\mathcal{Q}

(Bq)(𝝁~,𝝅~):=R~(𝝁~,𝝅~)+γsup𝝅~𝚷~Mq(𝚽~(𝝁~,𝝅~),𝝅~),\displaystyle(Bq)(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}}):=\widetilde{R}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}})+\gamma\sup_{\tilde{\boldsymbol{\pi}}^{\prime}\in\widetilde{\boldsymbol{\Pi}}_{M}}q(\widetilde{\boldsymbol{\Phi}}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}}),\tilde{\boldsymbol{\pi}}^{\prime}),

One can show that BB is a contraction operator with the module-γ\gamma. By Banach fixed point theorem, BB admits a unique fixed point. As Q~\tilde{Q} function of (4.8) satisfies BQ~=Q~B\tilde{Q}=\tilde{Q}, Q~\tilde{Q} is unique solution of (3.10).

We next define B𝝅~:𝒬𝒬B^{\tilde{\boldsymbol{\pi}}^{\prime}}:\mathcal{Q}\to\mathcal{Q} under the policy ensemble 𝝅~𝚷~M\tilde{\boldsymbol{\pi}}^{\prime}\in\widetilde{\boldsymbol{\Pi}}_{M} with

(B𝝅~q)(𝝁~,𝝅~):=R~(𝝁~,𝝅~)+γq(𝚽~(𝝁~,𝝅~),𝝅~).\displaystyle(B^{\tilde{\boldsymbol{\pi}}^{\prime}}q)(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}}):=\widetilde{R}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}})+\gamma q(\widetilde{\boldsymbol{\Phi}}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}}),\tilde{\boldsymbol{\pi}}^{\prime}).

Similarly, we can show that B𝝅~B^{\tilde{\boldsymbol{\pi}}^{\prime}} is a contraction map with the module-γ\gamma and thus admits a unique fixed point, which is denoted as Q~𝝅~\tilde{Q}^{\tilde{\boldsymbol{\pi}}^{\prime}}. From this, we have

Q~𝝅~(𝝁~,𝝅~)\displaystyle\tilde{Q}^{\tilde{\boldsymbol{\pi}}^{*}}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}}) =\displaystyle= R~(𝝁~,𝝅~)+γQ~𝝅~(𝚽~(𝝁~,𝝅~),𝝅~)\displaystyle\widetilde{R}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}})+\gamma\tilde{Q}^{\tilde{\boldsymbol{\pi}}^{*}}(\tilde{\boldsymbol{\Phi}}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}}),\tilde{\boldsymbol{\pi}}^{*})
=\displaystyle= R~(𝝁~,𝝅~)+γsup𝝅~𝚷~MQ~(𝚽~(𝝁~,𝝅~),𝝅~)=Q~(𝝁~,𝝅~),\displaystyle\widetilde{R}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}})+\gamma\sup_{\tilde{\boldsymbol{\pi}}^{\prime}\in\widetilde{\boldsymbol{\Pi}}_{M}}\tilde{Q}(\widetilde{\boldsymbol{\Phi}}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}}),\tilde{\boldsymbol{\pi}}^{\prime})=\tilde{Q}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}}),

which implies 𝝅~\tilde{\boldsymbol{\pi}}^{*} is an optimal policy ensemble. ∎

Lemma 4.4

Let Assumptions 3.4, 3.5 hold. Assume further γ(1+LP)<1\gamma\cdot(1+L_{P})<1. Then Q~\tilde{Q} in (4.8) is continuous.

Proof of Lemma 4.4.

We will show that as 𝝁~n𝝁~\tilde{\boldsymbol{\mu}}_{n}\to\tilde{\boldsymbol{\mu}}, 𝝅~n𝝅~\tilde{\boldsymbol{\pi}}_{n}\to\tilde{\boldsymbol{\pi}} in the sense that μ~αμ~αn1𝑑α+maxs𝒮π~απ~αn1dα0\int_{\mathcal{I}}\|{\tilde{\mu}^{\alpha}}-{\tilde{\mu}^{\alpha}}_{n}\|_{1}d\alpha+\int_{\mathcal{I}}\max_{s\in{\cal S}}\|{\tilde{\pi}^{\alpha}}-{\tilde{\pi}^{\alpha}}_{n}\|_{1}d\alpha\to 0,

Q~(𝝁~n,𝝅~n)Q~(𝝁~,𝝅~).\displaystyle\tilde{Q}(\tilde{\boldsymbol{\mu}}_{n},\tilde{\boldsymbol{\pi}}_{n})\to\tilde{Q}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}}).

From (4.8),

|Q~(𝝁~n,𝝅~n)Q~(𝝁~,𝝅~)|\displaystyle|\tilde{Q}(\tilde{\boldsymbol{\mu}}_{n},\tilde{\boldsymbol{\pi}}_{n})-\tilde{Q}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}})|
\displaystyle\leq |R~(𝝁~,𝝅~)+sup𝝅~𝚷~Mt=1γtR~(𝝁~t,𝝅~)R~(𝝁~n,𝝅~n)+sup𝝅~𝚷~Mt=1γtR~(𝝁~n,t,𝝅~)|\displaystyle\Big{|}\widetilde{R}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}})+\sup_{\tilde{\boldsymbol{\pi}}^{\prime}\in\widetilde{\boldsymbol{\Pi}}_{M}}\sum_{t=1}^{\infty}\gamma^{t}\tilde{R}(\tilde{\boldsymbol{\mu}}_{t},\tilde{\boldsymbol{\pi}}^{\prime})-\widetilde{R}(\tilde{\boldsymbol{\mu}}_{n},\tilde{\boldsymbol{\pi}}_{n})+\sup_{\tilde{\boldsymbol{\pi}}^{\prime}\in\widetilde{\boldsymbol{\Pi}}_{M}}\sum_{t=1}^{\infty}\gamma^{t}\tilde{R}(\tilde{\boldsymbol{\mu}}_{n,t},\tilde{\boldsymbol{\pi}}^{\prime})\Big{|}
\displaystyle\leq |R~(𝝁~,𝝅~)R~(𝝁~n,𝝅~n)|+sup𝝅~𝚷~Mt=1γt|R~(𝝁~n,t,𝝅~)R~(𝝁~t,𝝅~)|\displaystyle\Big{|}\widetilde{R}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}})-\widetilde{R}(\tilde{\boldsymbol{\mu}}_{n},\tilde{\boldsymbol{\pi}}_{n})\Big{|}+\sup_{\tilde{\boldsymbol{\pi}}^{\prime}\in\widetilde{\boldsymbol{\Pi}}_{M}}\sum_{t=1}^{\infty}\gamma^{t}\Big{|}\tilde{R}(\tilde{\boldsymbol{\mu}}_{n,t},\tilde{\boldsymbol{\pi}}^{\prime})-\tilde{R}(\tilde{\boldsymbol{\mu}}_{t},\tilde{\boldsymbol{\pi}}^{\prime})\Big{|}
\displaystyle\leq Lrμ~α,Wμ~α,Wn1𝑑α+Mrμ~αμ~αn1𝑑α+Mrmaxs𝒮π~απ~αn1dα\displaystyle L_{r}\cdot\int_{\mathcal{I}}\|{\tilde{\mu}^{\alpha,W}}-{\tilde{\mu}^{\alpha,W}}_{n}\|_{1}d\alpha+M_{r}\cdot\int_{\mathcal{I}}\|{\tilde{\mu}^{\alpha}}-{\tilde{\mu}^{\alpha}}_{n}\|_{1}d\alpha+M_{r}\cdot\int_{\mathcal{I}}\max_{s\in{\cal S}}\|{\tilde{\pi}^{\alpha}}-{\tilde{\pi}^{\alpha}}_{n}\|_{1}d\alpha
+sup𝝅~𝚷~Mt=1γt(Lrμ~α,Wtμ~α,Wn,t1𝑑α+Mrμ~αtμ~αn,t1𝑑α)\displaystyle+\sup_{\tilde{\boldsymbol{\pi}}^{\prime}\in\widetilde{\boldsymbol{\Pi}}_{M}}\sum_{t=1}^{\infty}\gamma^{t}\cdot\Big{(}L_{r}\cdot\int_{\mathcal{I}}\|{\tilde{\mu}^{\alpha,W}}_{t}-{\tilde{\mu}^{\alpha,W}}_{n,t}\|_{1}d\alpha+M_{r}\cdot\int_{\mathcal{I}}\|{\tilde{\mu}^{\alpha}}_{t}-{\tilde{\mu}^{\alpha}}_{n,t}\|_{1}d\alpha\Big{)}
\displaystyle\leq (Lr+Mr)μ~αμ~αn1𝑑α+Mrmaxs𝒮π~απ~αn1dα\displaystyle\big{(}L_{r}+M_{r}\big{)}\cdot\int_{\mathcal{I}}\|{\tilde{\mu}^{\alpha}}-{\tilde{\mu}^{\alpha}}_{n}\|_{1}d\alpha+M_{r}\cdot\int_{\mathcal{I}}\max_{s\in{\cal S}}\|{\tilde{\pi}^{\alpha}}-{\tilde{\pi}^{\alpha}}_{n}\|_{1}d\alpha
+sup𝝅~𝚷~Mt=1γt(Lr+Mr)μ~αtμ~αn,t1𝑑α.\displaystyle+\sup_{\tilde{\boldsymbol{\pi}}^{\prime}\in\widetilde{\boldsymbol{\Pi}}_{M}}\sum_{t=1}^{\infty}\gamma^{t}\cdot\big{(}L_{r}+M_{r}\big{)}\cdot\int_{\mathcal{I}}\|{\tilde{\mu}^{\alpha}}_{t}-{\tilde{\mu}^{\alpha}}_{n,t}\|_{1}d\alpha.

By induction, we obtain

μ~αtμ~αn,t1𝑑α(LP+1)μ~αt1μ~αn,t11𝑑α(LP+1)(t1)μ~α1μ~αn,11𝑑α.\displaystyle\int_{\mathcal{I}}\|{\tilde{\mu}^{\alpha}}_{t}-{\tilde{\mu}^{\alpha}}_{n,t}\|_{1}d\alpha\leq(L_{P}+1)\cdot\int_{\mathcal{I}}\|{\tilde{\mu}^{\alpha}}_{t-1}-{\tilde{\mu}^{\alpha}}_{n,{t-1}}\|_{1}d\alpha\leq\ldots\leq(L_{P}+1)^{(t-1)}\int_{\mathcal{I}}\|{\tilde{\mu}^{\alpha}}_{1}-{\tilde{\mu}^{\alpha}}_{n,1}\|_{1}d\alpha.

Therefore, if γ(1+LP)<1\gamma\cdot(1+L_{P})<1, then

|Q~(𝝁~n,𝝅~n)Q~(𝝁~,𝝅~)|C(μ~αμ~αn1𝑑α+maxs𝒮π~απ~αn1dα).\displaystyle|\tilde{Q}(\tilde{\boldsymbol{\mu}}_{n},\tilde{\boldsymbol{\pi}}_{n})-\tilde{Q}(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}})|\leq C\Big{(}\int_{\mathcal{I}}\|{\tilde{\mu}^{\alpha}}-{\tilde{\mu}^{\alpha}}_{n}\|_{1}d\alpha+\int_{\mathcal{I}}\max_{s\in{\cal S}}\|{\tilde{\pi}^{\alpha}}-{\tilde{\pi}^{\alpha}}_{n}\|_{1}d\alpha\Big{)}.

where CC is a constant depending on Lr,Mr,LPL_{r},M_{r},L_{P}. ∎

Now we prove Theorem 3.8.

Proof of Theorem 3.8.

By Lemma 4.4, along with the compactness of 𝚷~M\widetilde{\boldsymbol{\Pi}}_{M}, there exists 𝝅~𝚷~M\tilde{\boldsymbol{\pi}}^{*}\in\widetilde{\boldsymbol{\Pi}}_{M} such that 𝝅~argmax𝝅~𝚷~MQ(𝝁~,𝝅~)\tilde{\boldsymbol{\pi}}^{*}\in\underset{\tilde{\boldsymbol{\pi}}\in\widetilde{\boldsymbol{\Pi}}_{M}}{\mathrm{arg\,max\,}}Q(\tilde{\boldsymbol{\mu}},\tilde{\boldsymbol{\pi}}). By Lemma 4.3, there exists an optimal policy ensemble 𝝅~𝚷~M\tilde{\boldsymbol{\pi}}^{*}\in\widetilde{\boldsymbol{\Pi}}_{M}. ∎

4.3 Proof of Theorem 3.9

We first prove the following Lemma, which shows that GMFC and block GMFC become increasingly close to each other as the number of blocks becomes larger.

Lemma 4.5

Under Assumptions 3.3, 3.4 and 3.6, we have

m=1M(m1M,mM]μtα,Wμ~tαm,W1𝑑α[(1+LP)t1]LΠ+2LPLW+LWM+2LWM,\displaystyle\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\|\mu^{\alpha,W}_{t}-\tilde{\mu}^{\alpha_{m},W}_{t}\|_{1}d\alpha\leq\Big{[}(1+L_{P})^{t}-1\Big{]}\frac{L_{\Pi}+2L_{P}L_{W}+L_{W}}{M}+\frac{2L_{W}}{M},
m=1M(m1M,mM]μtαμ~tαm1𝑑α[(1+LP)t1]LΠ+2LPLW+LWM.\displaystyle\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\|\mu^{\alpha}_{t}-\tilde{\mu}^{\alpha_{m}}_{t}\|_{1}d\alpha\leq\Big{[}(1+L_{P})^{t}-1\Big{]}\frac{L_{\Pi}+2L_{P}L_{W}+L_{W}}{M}.
Proof of Lemma 4.5.
m=1M(m1M,mM]μtα,Wμ~tαm,W1𝑑α\displaystyle\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\|\mu^{\alpha,W}_{t}-\tilde{\mu}^{\alpha_{m},W}_{t}\|_{1}d\alpha
\displaystyle\leq m=1M(m1M,mM]μtα,Wμtαm,W1𝑑α+1Mm=1Mμtαm,Wμ¯tαm,W1\displaystyle\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\|\mu^{\alpha,W}_{t}-\mu^{\alpha_{m},W}_{t}\|_{1}d\alpha+\frac{1}{M}\sum_{m=1}^{M}\|\mu^{\alpha_{m},W}_{t}-\bar{\mu}^{\alpha_{m},W}_{t}\|_{1}
+1Mm=1Mμ¯tαm,Wμ~tαm,W1,\displaystyle\;\;+\;\frac{1}{M}\sum_{m=1}^{M}\|\bar{\mu}^{\alpha_{m},W}_{t}-\tilde{\mu}^{\alpha_{m},W}_{t}\|_{1},

where μ¯αm,W:=1Mm=1MW(αm,αm)μαm\bar{\mu}^{\alpha_{m},W}:=\frac{1}{M}\sum_{m^{\prime}=1}^{M}W(\alpha_{m},\alpha_{m^{\prime}})\mu^{\alpha_{m^{\prime}}}.
By the definition of μtα,W,μtαm,W\mu_{t}^{\alpha,W},\mu_{t}^{\alpha_{m},W} in (2.9), μ~tαm,W\tilde{\mu}_{t}^{\alpha_{m},W} in (3.11) and μ¯αm,W\bar{\mu}^{\alpha_{m},W}, together with the Lipschitz continuity of WW in Assumption 3.3,

m=1M(m1M,mM]μtα,Wμtαm,W1𝑑α\displaystyle\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\|\mu^{\alpha,W}_{t}-\mu^{\alpha_{m},W}_{t}\|_{1}d\alpha \displaystyle\leq m=1M(m1M,mM]μtαμtαm1𝑑α+LWM,\displaystyle\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\|\mu^{\alpha}_{t}-\mu^{\alpha_{m}}_{t}\|_{1}d\alpha+\frac{L_{W}}{M},
1Mm=1Mμtαm,Wμ¯tαm,W1\displaystyle\frac{1}{M}\sum_{m=1}^{M}\|\mu^{\alpha_{m},W}_{t}-\bar{\mu}^{\alpha_{m},W}_{t}\|_{1} \displaystyle\leq LWM,\displaystyle\frac{L_{W}}{M},
1Mm=1Mμ¯tαm,Wμ~tαm,W1\displaystyle\frac{1}{M}\sum_{m=1}^{M}\|\bar{\mu}^{\alpha_{m},W}_{t}-\tilde{\mu}^{\alpha_{m},W}_{t}\|_{1} \displaystyle\leq 1Mm=1Mμtαmμ~tαm1.\displaystyle\frac{1}{M}\sum_{m=1}^{M}\|\mu_{t}^{\alpha_{m}}-\tilde{\mu}_{t}^{\alpha_{m}}\|_{1}.

Plugging these into (4.3),

m=1M(m1M,mM]μtα,Wμ~tαm,W1𝑑αAt+2LWM,\displaystyle\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\|\mu^{\alpha,W}_{t}-\tilde{\mu}^{\alpha_{m},W}_{t}\|_{1}d\alpha\leq A_{t}+\frac{2L_{W}}{M},

where At:=m=1M(m1M,mM]μtαμtαm1𝑑α+1Mm=1Mμtαmμ~tαm1A_{t}:=\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\|\mu^{\alpha}_{t}-\mu^{\alpha_{m}}_{t}\|_{1}d\alpha+\frac{1}{M}\sum_{m=1}^{M}\|\mu_{t}^{\alpha_{m}}-\tilde{\mu}_{t}^{\alpha_{m}}\|_{1}.

On the other hand,

m=1M(m1M,mM]μtαμ~tαm1𝑑α\displaystyle\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\|\mu^{\alpha}_{t}-\tilde{\mu}^{\alpha_{m}}_{t}\|_{1}d\alpha
\displaystyle\leq m=1M(m1M,mM]μtαμtαm1𝑑α+1Mm=1Mμtαmμ~tαm1=At.\displaystyle\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\|\mu^{\alpha}_{t}-\mu^{\alpha_{m}}_{t}\|_{1}d\alpha+\frac{1}{M}\sum_{m=1}^{M}\|\mu_{t}^{\alpha_{m}}-\tilde{\mu}_{t}^{\alpha_{m}}\|_{1}=A_{t}.

Therefore, it is enough to estimate AtA_{t}. We next estimate At+1A_{t+1} by an inductive way. Note that A0=0A_{0}=0.

At+1\displaystyle A_{t+1}
=\displaystyle= m=1M(m1M,mM]μt+1αμt+1αm1𝑑α+1Mm=1Mμt+1αmμ~t+1αm1\displaystyle\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\|\mu^{\alpha}_{t+1}-\mu^{\alpha_{m}}_{t+1}\|_{1}d\alpha+\frac{1}{M}\sum_{m=1}^{M}\|\mu_{t+1}^{\alpha_{m}}-\tilde{\mu}_{t+1}^{\alpha_{m}}\|_{1}
=\displaystyle= m=1M(m1M,mM]s𝒮a𝒜(P(|s,μtα,W,a)πα(a|s)μtα(s)P(|s,a,μtαm,W)μtαm(s)παm(a|s))1dα\displaystyle\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\Big{\|}\sum_{s\in{\cal S}}\sum_{a\in{\cal A}}\Big{(}P(\cdot|s,\mu_{t}^{\alpha,W},a)\pi^{\alpha}(a|s)\mu_{t}^{\alpha}(s)-P(\cdot|s,a,\mu^{\alpha_{m},W}_{t})\mu_{t}^{\alpha_{m}}(s)\pi^{\alpha_{m}}(a|s)\Big{)}\Big{\|}_{1}d\alpha
+1Mm=1Ms𝒮a𝒜(P(|s,μtαm,W,a)παm(a|s)μtαm(s)P(|s,a,μ~tαm,W)μ~tαm(s)π~αm(a|s))1\displaystyle+\frac{1}{M}\sum_{m=1}^{M}\Big{\|}\sum_{s\in{\cal S}}\sum_{a\in{\cal A}}\Big{(}P(\cdot|s,\mu_{t}^{\alpha_{m},W},a)\pi^{\alpha_{m}}(a|s)\mu_{t}^{\alpha_{m}}(s)-P(\cdot|s,a,\tilde{\mu}^{\alpha_{m},W}_{t})\tilde{\mu}_{t}^{\alpha_{m}}(s)\tilde{\pi}^{\alpha_{m}}(a|s)\Big{)}\Big{\|}_{1}
\displaystyle\leq m=1M(m1M,mM](LPμtα,Wμαm,W1+LΠM+μtαμtαm1)𝑑α\displaystyle\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\Big{(}L_{P}\cdot\|\mu_{t}^{\alpha,W}-\mu^{\alpha_{m},W}\|_{1}+\frac{L_{\Pi}}{M}+\|\mu_{t}^{\alpha}-\mu_{t}^{\alpha_{m}}\|_{1}\Big{)}d\alpha
+1Mm=1M(LPμtαm,Wμ~αm,W1+μtαmμ~tαm1)\displaystyle\;+\;\frac{1}{M}\sum_{m=1}^{M}\Big{(}L_{P}\cdot\|\mu_{t}^{\alpha_{m},W}-\tilde{\mu}^{\alpha_{m},W}\|_{1}+\|\mu_{t}^{\alpha_{m}}-\tilde{\mu}_{t}^{\alpha_{m}}\|_{1}\Big{)}
\displaystyle\leq (1+LP)At+(LΠ+2LPLW+LW)1M,\displaystyle(1+L_{P})A_{t}+(L_{\Pi}+2L_{P}L_{W}+L_{W})\frac{1}{M},

where the second equality is from (3.4) and (3.13), and we use Assumptions 3.3, 3.4 and 3.6 in the third inequality.
By induction, we have

At+1[(1+LP)t1]LΠ+2LPLW+LWM.\displaystyle A_{t+1}\leq\Big{[}(1+L_{P})^{t}-1\Big{]}\frac{L_{\Pi}+2L_{P}L_{W}+L_{W}}{M}.

Based on Lemma 4.5, we have the following Proposition.

Proposition 4.6

Assume Assumptions 3.3, 3.4, 3.5, 3.6, and γ(LP+1)<1\gamma\cdot(L_{P}+1)<1. Then we have for any μ𝒫(𝒮)\mu\in{\cal P}({\cal S})

sup𝝅𝚷|J~M(μ,𝝅)J(μ,𝝅)|0,asM+,\displaystyle\sup_{{\boldsymbol{\pi}}\in{\boldsymbol{\Pi}}}\big{|}\tilde{J}^{M}(\mu,\boldsymbol{\pi})-J(\mu,{\boldsymbol{\pi}})\big{|}\to 0,\;\;\;\mbox{as}\;M\to+\infty, (4.10)

where J~M\tilde{J}^{M} and JJ are given in (4.7) and (2.3), respectively.

Proof of Proposition 4.6.

Recall from (3.11) that

J~M(μ,𝝅~)\displaystyle\widetilde{J}^{M}({\mu},\tilde{\boldsymbol{\pi}}) =\displaystyle= t=0γtR~(𝝁~t,𝝅~),\displaystyle\sum_{t=0}^{\infty}\gamma^{t}\widetilde{R}(\tilde{\boldsymbol{\mu}}_{t},\tilde{\boldsymbol{\pi}}),

subject to μ~t+1αm=𝚽~αm(μ~tαm,π~αm)\tilde{\mu}_{t+1}^{\alpha_{m}}=\widetilde{\boldsymbol{\Phi}}^{\alpha_{m}}(\tilde{\mu}_{t}^{\alpha_{m}},\tilde{\pi}^{\alpha_{m}}), t+t\in\mathbb{N}_{+}, μ~0αμ\tilde{\mu}_{0}^{\alpha}\equiv\mu, and μ~tαm,W\tilde{\mu}^{\alpha_{m},W}_{t} given in (3.11).

J(μ,𝝅)\displaystyle J(\mu,{\boldsymbol{\pi}}) =\displaystyle= t=0γtR(𝝁t,𝝅),\displaystyle\sum_{t=0}^{\infty}\gamma^{t}R({\boldsymbol{\mu}}_{t},{\boldsymbol{\pi}}),

subject to μt+1α=𝚽α(μtα,πα),t+\mu_{t+1}^{\alpha}={\boldsymbol{\Phi}}^{\alpha}(\mu_{t}^{\alpha},\pi^{\alpha}),t\in\mathbb{N}_{+}, μ0αμ\mu_{0}^{\alpha}\equiv\mu, and μtα,W\mu^{\alpha,W}_{t} given in (2.9). Since 𝝅~:=(π~αm)m[M]𝚷~M\tilde{\boldsymbol{\pi}}:=(\tilde{\pi}^{\alpha_{m}})_{m\in[M]}\in\widetilde{\boldsymbol{\Pi}}_{M} can be viewed as a piecewise-constant projection of 𝝅𝚷{\boldsymbol{\pi}}\in{\boldsymbol{\Pi}} onto 𝚷~M\widetilde{\boldsymbol{\Pi}}_{M}. Then,

sup𝝅𝚷|J~M(μ,𝝅)J(μ,𝝅)|\displaystyle\sup_{{\boldsymbol{\pi}}\in{\boldsymbol{\Pi}}}\big{|}\tilde{J}^{M}(\mu,\boldsymbol{\pi})-J(\mu,{\boldsymbol{\pi}})\big{|}
\displaystyle\leq sup𝝅𝚷t=0γt|R~(𝝁~t,𝝅~)R(𝝁t,𝝅)|\displaystyle\sup_{{\boldsymbol{\pi}}\in{\boldsymbol{\Pi}}}\sum_{t=0}^{\infty}\gamma^{t}\Big{|}\tilde{R}(\tilde{\boldsymbol{\mu}}_{t},\tilde{\boldsymbol{\pi}})-R({\boldsymbol{\mu}}_{t},{\boldsymbol{\pi}})\Big{|}
\displaystyle\leq sup𝝅𝚷t=0γt|R~(𝝁~t,𝝅~)R(𝝁t,𝝅~)|+sup𝝅𝚷t=0γt|R(𝝁t,𝝅~)R(𝝁t,𝝅)|\displaystyle\sup_{{\boldsymbol{\pi}}\in{\boldsymbol{\Pi}}}\sum_{t=0}^{\infty}\gamma^{t}\Big{|}\tilde{R}(\tilde{\boldsymbol{\mu}}_{t},\tilde{\boldsymbol{\pi}})-R({\boldsymbol{\mu}}_{t},\tilde{\boldsymbol{\pi}})\Big{|}+\;\sup_{{\boldsymbol{\pi}}\in{\boldsymbol{\Pi}}}\sum_{t=0}^{\infty}\gamma^{t}\Big{|}R({\boldsymbol{\mu}}_{t},\tilde{\boldsymbol{\pi}})-R({\boldsymbol{\mu}}_{t},{\boldsymbol{\pi}})\Big{|}
:=\displaystyle:= I+II.\displaystyle I+II.

In terms of the term II, we first estimate |R~(𝝁~t,𝝅~)R(𝝁t,𝝅~)|\Big{|}\tilde{R}(\tilde{\boldsymbol{\mu}}_{t},\tilde{\boldsymbol{\pi}})-R({\boldsymbol{\mu}}_{t},\tilde{\boldsymbol{\pi}})\Big{|}:

|R~(𝝁~t,𝝅~)R(𝝁t,𝝅~)|\displaystyle\Big{|}\tilde{R}(\tilde{\boldsymbol{\mu}}_{t},\tilde{\boldsymbol{\pi}})-R({\boldsymbol{\mu}}_{t},\tilde{\boldsymbol{\pi}})\Big{|}
=\displaystyle= |m=1M(m1M,mM]s𝒮a𝒜r(s,a,μ~tαm,W)μ~tαm(s)π~αm(a|s)dα\displaystyle\biggl{|}\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\sum_{s\in{\cal S}}\sum_{a\in{\cal A}}r(s,a,\tilde{\mu}^{\alpha_{m},W}_{t})\tilde{\mu}^{\alpha_{m}}_{t}(s)\tilde{\pi}^{\alpha_{m}}(a|s)d\alpha
m=1M(m1M,mM]s𝒮a𝒜r(s,a,μtα,W)μtα(s)π~αm(a|s)dα|\displaystyle\;\;\;-\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\sum_{s\in{\cal S}}\sum_{a\in{\cal A}}r(s,a,\mu^{\alpha,W}_{t})\mu^{\alpha}_{t}(s)\tilde{\pi}^{\alpha_{m}}(a|s)d\alpha\biggl{|}
\displaystyle\leq |m=1M(m1M,mM]s𝒮a𝒜r(s,a,μ~tαm,W)μ~tαm(s)π~αm(a|s)dα\displaystyle\biggl{|}\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\sum_{s\in{\cal S}}\sum_{a\in{\cal A}}r(s,a,\tilde{\mu}^{\alpha_{m},W}_{t})\tilde{\mu}^{\alpha_{m}}_{t}(s)\tilde{\pi}^{\alpha_{m}}(a|s)d\alpha
m=1M(m1M,mM]s𝒮a𝒜r(s,a,μtα,W)μ~tαm(s)π~αm(a|s)dα|\displaystyle\;\;\;-\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\sum_{s\in{\cal S}}\sum_{a\in{\cal A}}r(s,a,\mu^{\alpha,W}_{t})\tilde{\mu}^{\alpha_{m}}_{t}(s)\tilde{\pi}^{\alpha_{m}}(a|s)d\alpha\biggl{|}
+|m=1M(m1M,mM]s𝒮a𝒜r(s,a,μtα,W)μ~tαm(s)π~αm(a|s)dα\displaystyle+\;\biggl{|}\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\sum_{s\in{\cal S}}\sum_{a\in{\cal A}}r(s,a,\mu^{\alpha,W}_{t})\tilde{\mu}^{\alpha_{m}}_{t}(s)\tilde{\pi}^{\alpha_{m}}(a|s)d\alpha
m=1M(m1M,mM]s𝒮a𝒜r(s,a,μtα,W)μtα(s)π~αm(a|s)dα|\displaystyle\;\;\;-\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\sum_{s\in{\cal S}}\sum_{a\in{\cal A}}r(s,a,\mu^{\alpha,W}_{t})\mu^{\alpha}_{t}(s)\tilde{\pi}^{\alpha_{m}}(a|s)d\alpha\biggl{|}
\displaystyle\leq Lrm=1M(m1M,mM]μtα,Wμ~tαm,W1𝑑α+Mrm=1M(m1M,mM]μtαμ~tαm1𝑑α.\displaystyle L_{r}\cdot\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\|\mu^{\alpha,W}_{t}-\tilde{\mu}^{\alpha_{m},W}_{t}\|_{1}d\alpha+M_{r}\cdot\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\|\mu^{\alpha}_{t}-\tilde{\mu}^{\alpha_{m}}_{t}\|_{1}d\alpha.

By Lemma 4.5,

IC(γ,LΠ,LP,LW,Lr,Mr)M.\displaystyle I\leq\frac{C(\gamma,L_{\Pi},L_{P},L_{W},L_{r},M_{r})}{M}.

For the term IIII,

sup𝝅𝚷t=0γt|R(𝝁t,𝝅~)R(𝝁t,𝝅)|\displaystyle\sup_{{\boldsymbol{\pi}}\in{\boldsymbol{\Pi}}}\sum_{t=0}^{\infty}\gamma^{t}\Big{|}R({\boldsymbol{\mu}}_{t},\tilde{\boldsymbol{\pi}})-R({\boldsymbol{\mu}}_{t},{\boldsymbol{\pi}})\Big{|} \displaystyle\leq sup𝝅𝚷t=0γtMrm=1M(m1M,mM]maxs𝒮παπαm1dα\displaystyle\sup_{{\boldsymbol{\pi}}\in{\boldsymbol{\Pi}}}\sum_{t=0}^{\infty}\gamma^{t}M_{r}\sum_{m=1}^{M}\int_{(\frac{m-1}{M},\frac{m}{M}]}\max_{s\in{\cal S}}\|\pi^{\alpha}-\pi^{\alpha^{m}}\|_{1}d\alpha
\displaystyle\leq LΠMr1γ1M.\displaystyle\frac{L_{\Pi}M_{r}}{1-\gamma}\frac{1}{M}.

Proof of Theorem 3.9.

Suppose that 𝝅~𝚷~M𝚷\tilde{\boldsymbol{\pi}}^{*}\in\widetilde{\boldsymbol{\Pi}}_{M}\subset{\boldsymbol{\Pi}} and (π1,,,πN,)ΠN(\pi^{1,*},\ldots,\pi^{N,*})\in\Pi^{N} are optimal policies of the problems (4.7) and (2.2), respectively. From Proposition 4.6, for any ε>0\varepsilon>0, there exists sufficiently large Mε>0M_{\varepsilon}>0

|J~Mε(μ,𝝅~)J(μ,𝝅~)|ε3,\displaystyle|\tilde{J}^{M_{\varepsilon}}(\mu,\tilde{\boldsymbol{\pi}}^{*})-J(\mu,{\tilde{\boldsymbol{\pi}}^{*}})|\leq\frac{\varepsilon}{3},

where by (3.8), 𝝅N,:=i=1Nπi,𝟏α(i1N,iN]{\boldsymbol{\pi}}^{N,*}:=\sum_{i=1}^{N}\pi^{i,*}{\bf 1}_{\alpha\in(\frac{i-1}{N},\frac{i}{N}]}.
From Theorem 3.7, for any ε>0\varepsilon>0, there exists NεN_{\varepsilon} such that for all NNεN\geq N_{\varepsilon}

|JN(μ,π~1,,,π~N,)J(μ,𝝅~)|ε3,|JN(μ,π1,,,πN,)J(μ,𝝅N,)|ε3.\displaystyle|J_{N}(\mu,{\tilde{\pi}}^{1,*},\ldots,{\tilde{\pi}}^{N,*})-J(\mu,{\tilde{\boldsymbol{\pi}}^{*}})|\leq\frac{\varepsilon}{3},\;\;|J_{N}(\mu,\pi^{1,*},\ldots,\pi^{N,*})-J(\mu,{{\boldsymbol{\pi}}^{N,*}})|\leq\frac{\varepsilon}{3}.

Then we have

JN(μ,π~1,,,π~N,)JN(μ,π1,,,πN,)\displaystyle J_{N}(\mu,{\tilde{\pi}}^{1,*},\ldots,{\tilde{\pi}}^{N,*})-J_{N}({\mu},{\pi}^{1,*},\ldots,{\pi}^{N,*})
\displaystyle\geq JN(μ,π~1,,,π~N,)J(μ,𝝅~)I1+J(μ,𝝅~)J~Mε(μ,𝝅~)I2\displaystyle\underbrace{J_{N}(\mu,{\tilde{\pi}}^{1,*},\ldots,{\tilde{\pi}}^{N,*})-J(\mu,{\tilde{\boldsymbol{\pi}}^{*}})}_{I_{1}}+\underbrace{J(\mu,{\tilde{\boldsymbol{\pi}}^{*}})-\tilde{J}_{M_{\varepsilon}}(\mu,\tilde{\boldsymbol{\pi}}^{*})}_{I_{2}}
+J~Mε(μ,𝝅~)J~Mε(μ,𝝅N,)I3+J~Mε(μ,𝝅N,)JN(μ,π1,,,πN,)I4\displaystyle\;\;\;+\;\underbrace{\tilde{J}^{M_{\varepsilon}}(\mu,\tilde{\boldsymbol{\pi}}^{*})-\tilde{J}^{M_{\varepsilon}}(\mu,{\boldsymbol{\pi}}^{N,*})}_{I_{3}}+\underbrace{\tilde{J}^{M_{\varepsilon}}(\mu,{\boldsymbol{\pi}}^{N,*})-J_{N}({\mu},{\pi}^{1,*},\ldots,{\pi}^{N,*})}_{I_{4}}
\displaystyle\geq ε3ε3ε3=ε.\displaystyle-\frac{\varepsilon}{3}-\frac{\varepsilon}{3}-\frac{\varepsilon}{3}=-\varepsilon.

where I30I_{3}\geq 0 due to the optimality of 𝝅~\tilde{\boldsymbol{\pi}}^{*} for V~Mε\tilde{V}^{M_{\varepsilon}}. This means that the optimal policy of block GMFC provides an ε\varepsilon-optimal policy for the multi-agent system with (π~1,,π~N):=ΓN(𝝅~)(\tilde{\pi}_{1}^{*},\ldots,\tilde{\pi}_{N}^{*}):=\Gamma_{N}(\tilde{\boldsymbol{\pi}}^{*}). ∎

5 Experiments

In this section, we provide an empirical verification of our theoretical results, with two examples adapted from existing works on learning MFGs [16, 10] and learning GMFGs [15].

5.1 SIS Graphon Model

We consider a SIS graphon model in [16] under a cooperative setting. In this model, each agent α\alpha\in\mathcal{I} shares a state space 𝒮={S,I}{\cal S}=\{S,I\} and an action space 𝒜={C,NC}{\cal A}=\{C,NC\}, where SS is susceptible, II is infected, CC represents keeping contact with others, and NCNC means keeping social distance. The transition probability of each agent α\alpha is represented as follows

P(st+1=I|st=S,at=C,μtα,W)\displaystyle P(s_{t+1}=I|s_{t}=S,a_{t}=C,\mu_{t}^{\alpha,W}) =\displaystyle= β1μtα,W(I),\displaystyle\beta_{1}\mu_{t}^{\alpha,W}(I),
P(st+1=I|st=S,at=NC,μtα,W)\displaystyle P(s_{t+1}=I|s_{t}=S,a_{t}=NC,\mu_{t}^{\alpha,W}) =\displaystyle= β2μtα,W(I),\displaystyle\beta_{2}\mu_{t}^{\alpha,W}(I),
P(st+1=S|st=I,μtα,W)\displaystyle P(s_{t+1}=S|s_{t}=I,\mu_{t}^{\alpha,W}) =\displaystyle= δ,\displaystyle\delta,

where β1\beta_{1} is the infection rate with keeping contact with others, β2\beta_{2} is the infection rate under social distance, and δ\delta is the fixed recovery rate. We assume 0<β2<β10<\beta_{2}<\beta_{1}, meaning that keeping social distance can reduce the risk of being infected. The individual reward function is defined as

r(s,μtα,W,a)=c1𝟏{I}(s)c2𝟏{NC}(a)c3𝟏{I}(s)𝟏{C}(a),\displaystyle r(s,\mu_{t}^{\alpha,W},a)=-c_{1}{\bf 1}_{\{I\}}(s)-c_{2}{\bf 1}_{\{NC\}}(a)-c_{3}{\bf 1}_{\{I\}}(s){\bf 1}_{\{C\}}(a),

where c1c_{1} represents the cost of being infected such as the cost of medical treatment, c2c_{2} represents the cost of keeping social distance, and c3c_{3} represents the penalty of going out if the agent is infected.

In our experiment, we set β1\beta_{1}=0.8, β2\beta_{2}=0, δ=0.3\delta=0.3 for the transition dynamics and c1c_{1}=2, c2c_{2}=0.3, c3=0.5c_{3}=0.5 for the reward function. The initial mean field μ0\mu_{0} is taken as the uniform distribution. We set the episode length to 50.

5.2 Malware Spread Graphon Model

We consider a malware spread model in [10] under a cooperative setting. In this model, let 𝒮={0,1,,K1}{\cal S}=\{0,1,\ldots,K-1\}, KK\in\mathbb{N}, denote the health level of the agent, where st=0s_{t}=0 and st=K1s_{t}=K-1 represents the best level and the worst level, respectively. All agents can take two actions: at=0a_{t}=0 means doing nothing, and at=1a_{t}=1 means repairing. The state transition is given by

st+1={st+(Kst)χt,ifat=0,0,ifat=1,\displaystyle s_{t+1}=\left\{\begin{array}[]{cll}s_{t}+\lfloor(K-s_{t})\chi_{t}\rfloor,&\mbox{if}\;a_{t}=0,\\ 0,&\mbox{if}\;a_{t}=1,\end{array}\right.

where χt,t\chi_{t},t\in\mathbb{N} are i.i.d. random variables with a certain probability distribution. Then after taking action ata_{t}, agent α\alpha will receive an individual reward

r(st,μtα,W,at)=(c1+μtα,W)st/Kc2at.\displaystyle r(s_{t},\mu_{t}^{\alpha,W},a_{t})=-(c_{1}+\langle\mu_{t}^{\alpha,W}\rangle)s_{t}/K-c_{2}a_{t}.

Here considering the heterogeneity of agents, we use W(α,β)W(\alpha,\beta) to denote the importance effect of agent β\beta on agent α\alpha. μtα,W:=βs𝒮sW(α,β)μtβ(s)dβ\langle\mu_{t}^{\alpha,W}\rangle:=\int_{\beta\in\mathcal{I}}\sum_{s\in{\cal S}}sW(\alpha,\beta)\mu_{t}^{\beta}(s)d\beta is the risk of being infected by other agents and c2c_{2} is the cost of taking action ata_{t}.

In our experiment, we set KK=3, c1c_{1}=0.3, and c2c_{2}=0.5. In addition, to stabilize the training of the RL agent, we fix χt\chi_{t} to a static value, i.e., 0.7. In this model, we set the episode length to 10.

5.3 Performance of N-agent GMFC on Multi-Agent System

For both models, we use PPO [47] to train the block GMFC agent in the infinite-agent environment and obtain the policy ensembles and further use Algorithm 1 to deploy them in the finite-agent environment. We test the performance of N-agent GMFC with 10 blocks to different numbers of agents, i.e., from 10 to 100. For each case, we run 1000 times of simulations and show the mean and standard variation (Green shadows in Figure 1 and Figure 2) of the mean episode reward. We can see that in both scenarios and for different types of graphons, the mean episode rewards of the N-agent GMFC become increasingly close to that of block GMFC as the number of agents grows. (See Figure 1 and Figure 2). This verifies our theoretical findings empirically.

Refer to caption
Refer to caption
Refer to caption
Figure 1: Experiments for different graphons in SIS finite-agent environment
Refer to caption
Refer to caption
Refer to caption
Figure 2: Experiments for different graphons in Malware Spread finite-agent environment

5.4 Comparison with Other Algorithms

For different types of graphons, we compare our algorithm N-agent GMFC with three existing MARL algorithms, including two independent learning algorithms, i.e., independent DQN [40], independent PPO [47] and a powerful centralized-training-and-decentralized-execution(CTDE)-based algorithm QMIX [46]. We test the performance of those algorithms with different numbers of blocks, i.e., 2, 5, 10, to the multi-agent systems with 40 agents. The results are reported in Table 1 and Table 2.

In the SIS graphon model, N-agent GMFC shows dominating performance in most cases and outperforms independent algorithms by a large margin. Only QMIX can reach comparable results. And in the malware spread graphon model, N-agent GMFC outperforms other algorithms in more than half of the cases. Only independent DQN has comparable performance in this environment. And we can see that in both environments, the performance gap between N-agent GMFC and other MARL algorithms is shrinking as the number of blocks goes larger. This is mainly because the action space of block GMFC increases more quickly than MARL algorithms as the block number increases. And it is hard to train RL agents when the action space is too large.

Beyond the visible results shown in Tables 1 and 2, when the number of agents NN grows larger, classic MARL methods become infeasible because of the curse of dimensionality and the restriction of memory storage, while N-agent GMFC is trained only once and independent of the number of agents NN, hence is easier to scale up in a large-scale regime and enjoys a more stable performance. We can see that N-agent GMFC shows more stable results when N increases as shown in Figure 1 and Figure 2.

Table 1: Mean Episode Reward for SIS with 40 agents
Graphon Type M Algorithm
N-agent GMFC    I-DQN    I-PPO    QMIX
Erdős Rényi 2 -15.37 -17.58 -20.63 -20.51
5 -15.74 -16.17 -20.42 16.94
10 -15.67 -17.55 -21.38 -14.45
Stochastic Block 2 -13.58 -16.05 -18.38 -17.69
5 -13.67 -15.91 -20.13 -13.79
10 -13.57 -15.52 -14.87 -13.86
Random Geometric 2 -12.45 -17.93 -14.82 -14.52
5 -9.82 -12.81 -12.99 -10.84
10 -10.52 -11.68 -12.66 -12.60
Table 2: Mean Episode Reward for Malware Spread with 40 agents
Graphon Type M Algorithm
N-agent GMFC    I-DQN    I-PPO    QMIX
Erdős Rényi 2 -5.21 -5.11 -5.31 -6.05
5 -5.21 -5.30 -5.26 -6.13
10 -5.21 -5.14 -5.27 -5.21
Stochastic Block 2 -5.16 -5.21 -5.37 -5.88
5 -5.10 -5.19 -5.31 -5.70
10 -5.09 -5.05 -5.28 -5.27
Random Geometric 2 -5.02 -5.21 -5.27 -5.35
5 -4.85 -5.03 -5.04 -5.05
10 -4.82 -4.83 -5.14 -4.83

5.5 Implementation Details

We use three graphons in our experiments: (1) Erdős Rényi: W(α,β)=0.8W(\alpha,\beta)=0.8; (2) Stochastic block model: W(α,β)=0.9W(\alpha,\beta)=0.9, if 0α,β0.50\leqslant\alpha,\beta\leqslant 0.5 or 0.5α,β10.5\leqslant\alpha,\beta\leqslant 1, W(α,β)=0.4W(\alpha,\beta)=0.4, otherwise; (3) Random geometric graphon: W(α,β)=f(min(|βα|,1|βα|))W(\alpha,\beta)=f(\min(|\beta-\alpha|,1-|\beta-\alpha|)), where f(x)=ex0.5xf(x)=\mathrm{e}^{-\frac{x}{0.5-x}}.

For the RL algorithms, we use the implementation of RLlib [36] (version 1.11.0, Apache-2.0 license). For PPO used to learn an optimal policy ensemble in block GFMC, we use a 64-dimensional linear layer to encode the observation and 2-layer MLPs with 256 hidden units per layer for both value network and actor network. For independent DQN and independent PPO, we use the default weight-sharing model with 64-dimensional embedding layers. We train the GMFC PPO agent for 1000 iterations, and other three MARL agents for 200 iterations. The specific hyper-parameters are listed in Table 3.

Table 3: RL Algorithm Settings
Algorithms GMFC PPO I-DQN I-PPO QMIX
Learning rate 0.0005 0.0005 0.0001 0.00005
Learning rate decay True True True False
Discount factor 0.95 0.95 0.95 0.95
Batch size 128 128 128 128
KL coefficient 0.2 - 0.2 -
KL target 0.01 - 0.01 -
Buffer size - 2000 - 2000
Target network update frequency - 2000 - 1000

6 Conclusion

In this work, we have proposed a discrete-time GMFC framework for MARL with nonuniform interactions on dense graphs. Theoretically, we have shown that under suitable assumptions, GMFC approximates MARL well with approximation error of order 𝒪(1N)\mathcal{O}(\frac{1}{\sqrt{N}}). To reduce the dimension of GMFC, we have introduced block GMFC by discretizing the graphon index and shown that it also approximates MARL well. Empirical studies on several examples have verified the plausibility of the GMFC framework. For future research, we are interested in establishing theoretical guarantees of the PPO-based algorithm for block GMFC, learning the graph structure of MARL and extending our framework to MARL with nonuniform interactions on sparse graphs.

References

  • Adler and Blue [2002] Adler, J.L., Blue, V.J., 2002. A cooperative multi-agent transportation management and route guidance system. Transportation Research Part C: Emerging Technologies 10, 433–454.
  • Angiuli et al. [2022] Angiuli, A., Fouque, J.P., Laurière, M., 2022. Unified reinforcement Q-learning for mean field game and control problems. Mathematics of Control, Signals, and Systems , 1–55.
  • Bayraktar et al. [2020] Bayraktar, E., Chakraborty, S., Wu, R., 2020. Graphon mean field systems. arXiv preprint arXiv:2003.13180 .
  • Bet et al. [2020] Bet, G., Coppini, F., Nardi, F.R., 2020. Weakly interacting oscillators on dense random graphs. arXiv preprint arXiv:2006.07670 .
  • Caines and Huang [2019] Caines, P.E., Huang, M., 2019. Graphon mean field games and the GMFG equations: ε\varepsilon-Nash equilibria, in: 2019 IEEE 58th conference on decision and control (CDC), IEEE. pp. 286–292.
  • Cardaliaguet and Lehalle [2018] Cardaliaguet, P., Lehalle, C.A., 2018. Mean field game of controls and an application to trade crowding. Mathematics and Financial Economics 12, 335–363.
  • Carmona et al. [2022] Carmona, R., Cooney, D.B., Graves, C.V., Lauriere, M., 2022. Stochastic graphon games: I. the static case. Mathematics of Operations Research 47, 750–778.
  • Carmona and Delarue [2013] Carmona, R., Delarue, F., 2013. Probabilistic analysis of mean-field games. SIAM Journal on Control and Optimization 51, 2705–2734.
  • Carmona et al. [2019] Carmona, R., Laurière, M., Tan, Z., 2019. Model-free mean-field reinforcement learning: mean-field MDP and mean-field Q-learning. arXiv preprint arXiv:1910.12802 .
  • Chen et al. [2021] Chen, Y., Liu, J., Khoussainov, B., 2021. Agent-level maximum entropy inverse reinforcement learning for mean field games. arXiv preprint arXiv:2104.14654 .
  • Choi et al. [2009] Choi, J., Oh, S., Horowitz, R., 2009. Distributed learning and cooperative control for multi-agent systems. Automatica 45, 2802–2814.
  • Cortes et al. [2004] Cortes, J., Martinez, S., Karatas, T., Bullo, F., 2004. Coverage control for mobile sensing networks. IEEE Transactions on Robotics and Automation 20, 243–255.
  • Cui et al. [2019] Cui, J., Liu, Y., Nallanathan, A., 2019. Multi-agent reinforcement learning-based resource allocation for UAV networks. IEEE Transactions on Wireless Communications 19, 729–743.
  • Cui and Koeppl [2021] Cui, K., Koeppl, H., 2021. Approximately solving mean field games via entropy-regularized deep reinforcement learning, in: International Conference on Artificial Intelligence and Statistics, PMLR. pp. 1909–1917.
  • Cui and Koeppl [2022] Cui, K., Koeppl, H., 2022. Learning graphon mean field games and approximate Nash equilibria, in: International Conference on Learning Representations.
  • Cui et al. [2021] Cui, K., Tahir, A., Sinzger, M., Koeppl, H., 2021. Discrete-time mean field control with environment states, in: 2021 60th IEEE Conference on Decision and Control (CDC), IEEE. pp. 5239–5246.
  • Delarue [2017] Delarue, F., 2017. Mean field games: A toy model on an Erdös-Renyi graph. ESAIM: Proceedings and Surveys 60, 1–26.
  • Elie et al. [2020] Elie, R., Perolat, J., Laurière, M., Geist, M., Pietquin, O., 2020. On the convergence of model free learning in mean field games, in: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 7143–7150.
  • Esunge and Wu [2014] Esunge, J.N., Wu, J., 2014. Convergence of weighted empirical measures. Stochastic Analysis and Applications 32, 802–819.
  • Flyvbjerg et al. [1993] Flyvbjerg, H., Sneppen, K., Bak, P., 1993. Mean field theory for a simple model of evolution. Physical review letters 71, 4087.
  • Gao and Caines [2019a] Gao, S., Caines, P.E., 2019a. Graphon control of large-scale networks of linear systems. IEEE Transactions on Automatic Control 65, 4090–4105.
  • Gao and Caines [2019b] Gao, S., Caines, P.E., 2019b. Spectral representations of graphons in very large network systems control, in: 2019 IEEE 58th conference on decision and Control (CDC), IEEE. pp. 5068–5075.
  • Gu et al. [2019] Gu, H., Guo, X., Wei, X., Xu, R., 2019. Dynamic programming principles for mean-field controls with learning. arXiv preprint arXiv:1911.07314 .
  • Gu et al. [2021] Gu, H., Guo, X., Wei, X., Xu, R., 2021. Mean-field controls with Q-learning for cooperative MARL: convergence and complexity analysis. SIAM Journal on Mathematics of Data Science 3, 1168–1196.
  • Guo et al. [2019] Guo, X., Hu, A., Xu, R., Zhang, J., 2019. Learning mean-field games. Advances in Neural Information Processing Systems 32.
  • Hadikhanloo and Silva [2019] Hadikhanloo, S., Silva, F.J., 2019. Finite mean field games: fictitious play and convergence to a first order continuous mean field game. Journal de Mathématiques Pures et Appliquées 132, 369–397.
  • Huang et al. [2006] Huang, M., Malhamé, R.P., Caines, P.E., 2006. Large population stochastic dynamic games: closed-loop McKean-Vlasov systems and the Nash certainty equivalence principle. Communications in Information & Systems 6, 221–252.
  • Hughes et al. [2018] Hughes, E., Leibo, J.Z., Phillips, M., Tuyls, K., Dueñez-Guzman, E., García Castañeda, A., Dunning, I., Zhu, T., McKee, K., Koster, R., et al., 2018. Inequity aversion improves cooperation in intertemporal social dilemmas. Advances in neural information processing systems 31.
  • Lacker and Soret [2022] Lacker, D., Soret, A., 2022. A label-state formulation of stochastic graphon games and approximate equilibria on large networks. arXiv preprint arXiv:2204.09642 .
  • Lasry and Lions [2007] Lasry, J.M., Lions, P.L., 2007. Mean field games. Japanese Journal of Mathematics 2, 229–260.
  • Lee et al. [2007] Lee, J.W., Park, J., Jangmin, O., Lee, J., Hong, E., 2007. A multiagent approach to Q{Q}-learning for daily stock trading. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 37, 864–877.
  • Lee and Zhang [2002] Lee, J.W., Zhang, B.T., 2002. Stock trading system using reinforcement learning with cooperative agents, in: Proceedings of the Nineteenth International Conference on Machine Learning, pp. 451–458.
  • Leibo et al. [2017] Leibo, J.Z., Zambaldi, V., Lanctot, M., Marecki, J., Graepel, T., 2017. Multi-agent reinforcement learning in sequential social dilemmas. arXiv preprint arXiv:1702.03037 .
  • Lerer and Peysakhovich [2017] Lerer, A., Peysakhovich, A., 2017. Maintaining cooperation in complex social dilemmas using deep reinforcement learning. arXiv preprint arXiv:1707.01068 .
  • Li et al. [2021] Li, Y., Wang, L., Yang, J., Wang, E., Wang, Z., Zhao, T., Zha, H., 2021. Permutation invariant policy optimization for mean-field multi-agent reinforcement learning: A principled approach. arXiv preprint arXiv:2105.08268 .
  • Liang et al. [2018] Liang, E., Liaw, R., Nishihara, R., Moritz, P., Fox, R., Goldberg, K., Gonzalez, J.E., Jordan, M.I., Stoica, I., 2018. RLlib: Abstractions for distributed reinforcement learning, in: International Conference on Machine Learning (ICML).
  • Lillicrap et al. [2016] Lillicrap, T.P., Hunt, J.J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D., 2016. Continuous control with deep reinforcement learning, in: International Conference on Learning Representations.
  • Lin et al. [2021] Lin, Y., Qu, G., Huang, L., Wierman, A., 2021. Multi-agent reinforcement learning in stochastic networked systems. Advances in Neural Information Processing Systems 34, 7825–7837.
  • Lovász [2012] Lovász, L., 2012. Large networks and graph limits. volume 60. American Mathematical Soc.
  • Mnih et al. [2013] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., Riedmiller, M., 2013. Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 .
  • Mondal et al. [2022a] Mondal, W.U., Agarwal, M., Aggarwal, V., Ukkusuri, S.V., 2022a. On the approximation of cooperative heterogeneous multi-agent reinforcement learning (MARL) using mean field control (MFC). Journal of Machine Learning Research 23, 1–46.
  • Mondal et al. [2022b] Mondal, W.U., Aggarwal, V., Ukkusuri, S.V., 2022b. Can mean field control (MFC) approximate cooperative multi agent reinforcement learning (MARL) with non-uniform interaction? arXiv preprint arXiv:2203.00035 .
  • Motte and Pham [2022] Motte, M., Pham, H., 2022. Mean-field Markov decision processes with common noise and open-loop controls. The Annals of Applied Probability 32, 1421–1458.
  • Pasztor et al. [2021] Pasztor, B., Bogunovic, I., Krause, A., 2021. Efficient model-based multi-agent mean-field reinforcement learning. arXiv preprint arXiv:2107.04050 .
  • Qu et al. [2020] Qu, G., Wierman, A., Li, N., 2020. Scalable reinforcement learning of localized policies for multi-agent networked systems, in: Learning for Dynamics and Control, PMLR. pp. 256–266.
  • Rashid et al. [2018] Rashid, T., Samvelyan, M., Schroeder, C., Farquhar, G., Foerster, J., Whiteson, S., 2018. QMIX: Monotonic value function factorisation for deep multi-agent reinforcement learning, in: International Conference on Machine Learning, PMLR. pp. 4295–4304.
  • Schulman et al. [2017] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O., 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .
  • Subramanian and Mahajan [2019] Subramanian, J., Mahajan, A., 2019. Reinforcement learning in stationary mean-field games, in: Proceedings of the 18th International Conference on Autonomous Agents and MultiAgent Systems, pp. 251–259.
  • Vasal et al. [2021] Vasal, D., Mishra, R., Vishwanath, S., 2021. Sequential decomposition of graphon mean field games, in: 2021 American Control Conference (ACC), IEEE. pp. 730–736.
  • W Axhausen et al. [2016] W Axhausen, K., Horni, A., Nagel, K., 2016. The multi-agent transport simulation MATSim. Ubiquity Press.
  • Wainwright [2019] Wainwright, M.J., 2019. High-dimensional statistics: A non-asymptotic viewpoint. volume 48. Cambridge University Press.
  • Yang et al. [2018] Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., Wang, J., 2018. Mean field multi-agent reinforcement learning, in: International Conference on Machine Learning, PMLR. pp. 5571–5580.
  • Yin et al. [2010] Yin, H., Mehta, P.G., Meyn, S.P., Shanbhag, U.V., 2010. Learning in mean-field oscillator games, in: 49th IEEE Conference on Decision and Control (CDC), IEEE. pp. 3125–3132.