This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Robust Cooperative Multi-Agent Reinforcement Learning:
A Mean-Field Type Game Perspective

\NameMuhammad Aneeq uz Zaman \Email[email protected]
\addrCoordinated Science Laboratory
   University of Illinois at Urbana-Champaign    \NameMathieu Laurière \Email[email protected]
\addrShanghai Frontiers Science Center of Artificial Intelligence and Deep Learning;
NYU-ECNU Institute of Mathematical Sciences at NYU Shanghai
   \NameAlec Koppel \Email[email protected]
\addrArtificial Intelligence Research
   JP Morgan Chase & Co    \NameTamer Başar \Email[email protected]
\addrCoordinated Science Laboratory
   University of Illinois at Urbana-Champaign
Abstract

In this paper, we study the problem of robust cooperative multi-agent reinforcement learning (RL) where a large number of cooperative agents with distributed information aim to learn policies in the presence of stochastic and non-stochastic uncertainties whose distributions are respectively known and unknown. Focusing on policy optimization that accounts for both types of uncertainties, we formulate the problem in a worst-case (minimax) framework, which is is intractable in general. Thus, we focus on the Linear Quadratic setting to derive benchmark solutions. First, since no standard theory exists for this problem due to the distributed information structure, we utilize the Mean-Field Type Game (MFTG) paradigm to establish guarantees on the solution quality in the sense of achieved Nash equilibrium of the MFTG. This in turn allows us to compare the performance against the corresponding original robust multi-agent control problem. Then, we propose a Receding-horizon Gradient Descent Ascent RL algorithm to find the MFTG Nash equilibrium and we prove a non-asymptotic rate of convergence. Finally, we provide numerical experiments to demonstrate the efficacy of our approach relative to a baseline algorithm.

1 Introduction

Reinforcement Learning (RL) has had many successes, such as autonomous driving (Sallab et al., 2017), robotics (Kober et al., 2013), and RL from human feedback (RLHF) (Ziegler et al., 2019), to name a few. These successes have been focused on single-agent scenarios, but many scenarios involving, e.g., financial markets, communication networks, distributed robotics involve multiple agents. Prevailing algorithms for Multi-Agent Reinforcement Learning (MARL) (Zhang et al., 2021b; Li et al., 2021), however, do not model the distinct effects of modeled and un-modeled uncertainties on the transition dynamics, which can result in practical instability in safety-critical applications (Riley et al., 2021).

In this paper we consider a large population multi-agent setting, with stochastic and non-stochastic (un-modeled, possibly adversarial) uncertainties. These types of formulations have been studied under the guise of robust control in the single-agent case (Başar and Bernhard, 2008). The uncertainties (modeled and un-modeled) affect the performance of the system and might even lead to instability. Robust control seeks the robust controller which guarantees a certain level of performance for the system in under a worst-case hypothesis on these uncertainties. We employ here the popular Linear-Quadratic (LQ) setting in order to rigorously characterize and synthesize the solution to the robust multi-agent problem in a data-driven manner. The LQ setting entails a class of models in which the dynamics are linear and the costs are quadratic in the state and the action of the agent. This setting has been used extensively in the literature due to its tractability: the optimal decisions can be computed analytically or almost analytically, up to solving Riccati equations, when one has access to all system matrices. Instances of applications include permanent income theory (Sargent and Ljungqvist, 2000), portfolio management (Cardaliaguet and Lehalle, 2018), and wireless power control (Huang et al., 2003)), among many others. In the absence of knowledge of system parameters, model-free RL methods have also been developed (Fazel et al., 2018; Malik et al., 2019) for single agent LQ settings. We refer to (Recht, 2019) for an overview. When one goes from single to multiple agents, the issue of communicating local state and control information among agents exhibit scalability problems, and in particular, practical algorithms require sharing state information that can scale exponential in the number of agents. Instead, here we consider a distributed information structure where each agent has access only to its own state and the average of states of the other agents. This distributed information structure causes the characterization of the solution to be very difficult, in that previous gradient dominance results from (Fazel et al., 2018) no longer hold. To overcome this difficulty, we utilize the mean-field game and control paradigm, first introduced in the purely non-cooperative agent setting in (Lasry and Lions, 2006; Huang et al., 2006), which replaces individual agents by a distribution over agent types, which enables characterization and computation of the solution. The approach has then been extended to the cooperative setting through the notion of mean field control (Bensoussan et al., 2013; Carmona and Delarue, 2018). Building on this paradigm, this work is the first to develop scalable algorithms for MARL that can handle model mis-specification or adversarial inputs in the sense of robust control in the very large or possibly infinite number of agents defined by the mean-field.

We start Section 2 by formulating a robust multi-agent control problem with stochastic and non-stochastic (un-modeled) noises. The agents have distributed information, such that they have access to their own states and the average behavior of all the agents. Solving this problem entails finding a noise attenuation level (noise-to-output gain) for the multi-agent system and the corresponding robust controller. As in the single-agent setting (Başar and Bernhard, 2008), the robust multi-agent control problem is reformulated into an equivalent zero-sum min-max game between the maximizing non-stochastic noise (which may be interpreted as an adversary) and the minimizing controller. Solving this problem is not possible in the finite agent case due to the limited information available to each agent. Thus, in Section 3 we consider the mean-field (infinite population) version of the problem, that we call the Robust Mean-Field Control (RMFC) problem. As in the finite-population setting, RMFC has an equivalent zero-sum min-max formulation, referred to as the 2-player Zero-Sum Mean-Field Type Game (ZS-MFTG) in (Carmona et al., 2020, 2021), where the controller is the minimizing player and the non-stochastic disturbance is the maximizing one.

In Section 4 we propose a bi-level RL algorithm to compute the Nash equilibrium for the ZS-MFTG (which equivalently yields the robust controller for the robust multi-agent problem) in the form of Receding-horizon Gradient Descent Ascent (RGDA) (Algorithm 1). The upper-level of RGDA, uses a receding-horizon approach, i.e., it finds the controller parameters starting from the last timestep T1T-1 and moving backwards-in-time (à la dynamic programming). The receding-horizon policy gradient approach was used in Kalman filtering (Zhang et al., 2023) and LQR problems (Zhang and Başar, 2023). The present work builds on this approach to multi-agent problems, which helps in simplifying the complex nature of the cost landscape (known to be non-coercive (Zhang et al., 2021a)) and renders it convex-concave. The lower-level employs gradient descent-ascent to find the saddle point (Nash equilibrium) for each timestep tt. The convex-concave nature of the cost (due to the receding-horizon approach) proves to be a key component in proving linear convergence of the gradient descent-ascent to the saddle point (Theorem 4.1). Further analysis shows that the total accumulated error in the RGDA is small given that the lower level of RGDA has good convergence (Theorem 4.2). The gradient descent-ascent step requires computation of the stochastic gradient. We use a zero-order method (Fazel et al., 2018; Malik et al., 2019) which only requires access to the cost to compute stochastic gradients, and hence is truly model-free.

Literature Review: Robust control gained importance in the 1970s when control theorists realized the shortcomings of optimal control theory in dealing with model uncertainties (Athans et al., 1977; Harvey and Stein, 1978). The work of (Başar, 1989) was the first one to formulate the robust control problem as a zero-sum dynamic game between the controller and the uncertainty. Robust RL first introduced by (Morimoto and Doya, 2005) has recently had an increase in interest in for the single agent setting, where its ability to process trajectory data without explicit knowledge of system parameters can be used to learn robust controllers to address worst-case uncertainty (Zhang et al., 2020a; Kos and Song, 2017; Zhang et al., 2021c). Some recent works consider RL in scenarios with reward uncertainties (Zhang et al., 2020b), state uncertainty (He et al., 2023) or uncertainty in other agents’ policies (Sun et al., 2022). There have been some works on the intersection of RL for robust and multi-agent control (Li et al., 2019; He et al., 2023), yet there has not been any significant effort to provide (1) sufficient conditions for solvability of the multi-agent robust control problem i.e. determining the noise attenuation level of a system and (2) provable Robust multi-agent RL (RMARL) algorithms in the large population setting, as proposed in this paper.

This is made possible due to the mean-field game and control paradigm, which considers the limiting case as the number of agents approaches infinity. This paradigm was first introduced in the context of non-cooperative game theory as Mean-Field Games (MFGs) concurrently by (Lasry and Lions, 2006; Huang et al., 2006). Since then, the question of learning equilibria in MFGs has gained momentum, see (Laurière et al., 2022b). In particular, there have been several works dealing with RL for MFGs (Guo et al., 2019; Elie et al., 2020; Perrin et al., 2020; Zaman et al., 2020; Xie et al., 2021; Anahtarci et al., 2023), deep RL for MFGs (Perrin et al., 2021; Cui and Koeppl, 2021a; Laurière et al., 2022a), learning in multi-population MFGs (Pérolat et al., 2022; Zaman et al., 2021, 2023b), independent learning in MFGs (Yongacoglu et al., 2022; Yardim et al., 2023), oracle-free RL for MFGs (Angiuli et al., 2022; Zaman et al., 2023a) and RL for graphon games (Cui and Koeppl, 2021b; Fabian et al., 2023). There have also been several works on RL for MFC, which is the cooperative counterpart, see e.g. (Carmona et al., 2019a, b; Gu et al., 2021; Mondal et al., 2022; Angiuli et al., 2022). But these works require ability to sample from the true transition model, and hence are inapplicable in the case of mis-specification or modeling errors. To address this setting, we introduce the Robust MFC problem. We will connect this problem to MFTGs Tembine (2017), which contain mixed cooperative-competitive elements. Zero-sum MFTG model a zero-sum competition between two infinitely large teams of agents. Prior work on the theoretical framework of zero-sum MFTG include (Choutri et al., 2019; Tembine, 2017; Cosso and Pham, 2019; Carmona et al., 2021; Guan et al., 2024). Related to RL, the works (Carmona et al., 2020, 2021) propose a data-driven RL algorithm based on Policy Gradient to compute the Nash equilibrium between the two coalitions in an LQ setting but do not provide a theoretical analysis of the algorithm.

2 Formulation

In this section we introduce the robust multi-agent control problem by first defining the dynamics of the multi-agent system along with its performance and noise indices. The performance and noise indices have been introduced in the literature (Başar and Bernhard, 2008) in order to quantify the affect of the accumulated noise (referred to as noise index) on the performance of the system (called the performance index). The noise attenuation level is then defined as an upper bound on the ratio between the performance and noise indices given that the agents employ a robust controller. Hence the robust multi-agent problem is that of finding the robust controller under which a certain noise attenuation is achieved. In order to solve this problem, we reformulate it as a min-max game problem as in the single-agent setting (Başar and Bernhard, 2008). Consider an NN agent system. We let [N]={1,,N}[N]=\{1,\dots,N\}. The ithi^{th} agent has dynamics which are linear in its state xtimx^{i}_{t}\in\mathbb{R}^{m}, its action ut1,ipu^{1,i}_{t}\in\mathbb{R}^{p}, and the mean-field counterparts, x¯t\bar{x}_{t} and u¯ti\bar{u}^{i}_{t}. The disturbance ut2,iu^{2,i}_{t} is referred to as non-stochastic noise111The non-stochastic noise is assumed to have identity coefficient in the dynamics (1) for simplicity of analysis but can be easily changed to some other matrix of appropriate size. since it is an un-modeled disturbance and can even be adversarial. This is similar in spirit to the works of (Simchowitz et al., 2020). Let TT be a positive integer, interpreted as the horizon of the problem. The initial condition of agent ii’s state, i[N]i\in[N], is x0i=ω0,i+ω¯0x^{i}_{0}=\omega^{0,i}+\bar{\omega}^{0}, where ω0,i𝒩(0,Σ0)\omega^{0,i}\sim\mathcal{N}(0,\Sigma^{0}) and ω¯0𝒩(0,Σ¯0)\bar{\omega}^{0}\sim\mathcal{N}(0,\bar{\Sigma}^{0}) are i.i.d. noises. For t{0,,T1}t\in\{0,\ldots,T-1\},

xt+1i=Atxti+A¯tx¯t+Btut1,i+B¯tu¯t1+ut2,i+u¯t2+ωti+ω¯t,i[N]\displaystyle x^{i}_{t+1}=A_{t}x^{i}_{t}+\bar{A}_{t}\bar{x}_{t}+B_{t}u^{1,i}_{t}+\bar{B}_{t}\bar{u}^{1}_{t}+u^{2,i}_{t}+\bar{u}^{2}_{t}+\omega^{i}_{t}+\bar{\omega}_{t},\forall i\in[N] (1)

where ut1,iu^{1,i}_{t} is the control action of the ithi^{th} agent, x¯t:=i=1Nxti/N\bar{x}_{t}:=\sum_{i=1}^{N}x^{i}_{t}/N is referred to as the state mean-field and u¯tj:=i=1Nuj,i/N\bar{u}^{j}_{t}:=\sum_{i=1}^{N}u^{j,i}/N for j{1,2}j\in\{1,2\} are the control and noise mean-fields respectively. Each agent’s dynamics are perturbed by two types of noise: ωti\omega^{i}_{t} and ω¯t\bar{\omega}_{t} are referred to as stochastic noises since they are i.i.d. and their distributions are known (ωti𝒩(0,Σ)\omega^{i}_{t}\sim\mathcal{N}(0,\Sigma) and ω¯t𝒩(0,Σ¯)\bar{\omega}_{t}\sim\mathcal{N}(0,\bar{\Sigma})). All of our results (excluding the finite-sample analysis of the RL Algorithm) can be readily generalized for zero-mean non-Gaussian disturbances with finite variance.

In order to define the robust control problem we define the performance index of the population which penalizes the deviation of the agents from their (state and control) mean-fields and also regulates the mean-fields:

JN(𝓊1,𝓊2)=1Ni=1N𝔼t=0T1[xtix¯tQt2+x¯tQ¯t2+ut1,iu¯t12+u¯t12]+xTix¯TQT2+x¯TQ¯T2\displaystyle J_{N}(\mathcal{u}^{1},\mathcal{u}^{2})=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\sum_{t=0}^{T-1}\Big{[}\lVert x^{i}_{t}-\bar{x}_{t}\rVert^{2}_{Q_{t}}+\lVert\bar{x}_{t}\rVert^{2}_{\bar{Q}_{t}}+\lVert u^{1,i}_{t}-\bar{u}^{1}_{t}\rVert^{2}+\lVert\bar{u}^{1}_{t}\rVert^{2}\Big{]}+\lVert x^{i}_{T}-\bar{x}_{T}\rVert^{2}_{Q_{T}}+\lVert\bar{x}_{T}\rVert^{2}_{\bar{Q}_{T}} (2)

where the matrices Qt,Q¯t>0Q_{t},\bar{Q}_{t}>0 are symmetric matrices, 𝓊j=(𝓊j,i)i[N]\mathcal{u}^{j}=(\mathcal{u}^{j,i})_{i\in[N]} where each 𝓊j,i\mathcal{u}^{j,i} for j{1,2}j\in\{1,2\} is adapted to the distribution information structure i.e. σ\sigma-algebra generated by xtix^{i}_{t} and x¯t\bar{x}_{t} and 𝒰1,𝒰2\mathcal{U}^{1},\mathcal{U}^{2} represent the set of all possible 𝓊1,𝓊2\mathcal{u}^{1},\mathcal{u}^{2}, respectively. We define the noise index of the population in a similar manner

ϖN(𝓊1,𝓊2)=1Ni=1N𝔼t=0T1[ut2,iu¯t22+u¯t22+ωti2+ω¯t2].\displaystyle\varpi_{N}(\mathcal{u}^{1},\mathcal{u}^{2})=\frac{1}{N}\sum_{i=1}^{N}\mathbb{E}\sum_{t=0}^{T-1}\Big{[}\lVert u^{2,i}_{t}-\bar{u}^{2}_{t}\rVert^{2}+\lVert\bar{u}^{2}_{t}\rVert^{2}+\lVert\omega^{i}_{t}\rVert^{2}+\lVert\bar{\omega}_{t}\rVert^{2}\Big{]}. (3)

The robust control problem for this NN agent system is that of finding the range of noise attenuation levels γ>0\gamma>0 such that:

𝓊1𝒰1,𝓊2𝒰2,JN(𝓊1,𝓊2)γ2ϖN(𝓊1,𝓊2)\displaystyle\exists\mathcal{u}^{1}\in\mathcal{U}^{1},\forall\mathcal{u}^{2}\in\mathcal{U}^{2},\qquad J_{N}(\mathcal{u}^{1},\mathcal{u}^{2})\leq\gamma^{2}\varpi_{N}(\mathcal{u}^{1},\mathcal{u}^{2}) (4)

Any γ\gamma for which the above inequality is satisfied is referred to as a viable attenuation level and the least among them is called the minimum attenuation level. The controller 𝓊1\mathcal{u}^{1} which ensures a particular level γ\gamma of noise attenuation is referred to as the robust controller corresponding to γ\gamma (or robust controller in short). Since the inequality (4) can also be reformulated as JN()/ϖN()γ2J_{N}(\cdot)/\varpi_{N}(\cdot)\leq\gamma^{2}, a viable attenuation parameter γ2\gamma^{2} is also an upper bound on the noise-to-output gain of the system. As outlined in (Başar and Bernhard, 2008) for a single agent problem the condition (4) is equivalent to finding the range of value of γ>0\gamma>0 such that

inf𝓊1sup𝓊2(JN(𝓊1,𝓊2)γ2ϖN(𝓊1,𝓊2))0,\displaystyle\inf_{\mathcal{u}^{1}}\sup_{\mathcal{u}^{2}}\big{(}J_{N}(\mathcal{u}^{1},\mathcal{u}^{2})-\gamma^{2}\varpi_{N}(\mathcal{u}^{1},\mathcal{u}^{2})\big{)}\leq 0, (5)

where the infimizing controller u1u^{1} is the robust controller and the supremizing controller u2u^{2} is the worst-case non-stochastic noise. If we define the robust NN agent cost JNγJ^{\gamma}_{N} as follows

JNγ(𝓊1,𝓊2)=\displaystyle J^{\gamma}_{N}(\mathcal{u}^{1},\mathcal{u}^{2})= JN(𝓊1,𝓊2)γ2𝔼1Ni=1Nt=0T1(ut2,iu¯t22+u¯t22),\displaystyle J_{N}(\mathcal{u}^{1},\mathcal{u}^{2})-\gamma^{2}\mathbb{E}\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{T-1}\big{(}\lVert u^{2,i}_{t}-\bar{u}^{2}_{t}\rVert^{2}+\lVert\bar{u}^{2}_{t}\rVert^{2}\big{)},

then using (2) and (3), the robust NN agent control problem (5) can be equivalently written as

infu1supu2JNγ(𝓊1,𝓊2)γ2𝔼1Ni=1Nt=0T1(ωti2+ω¯t2)0.\displaystyle\inf_{u^{1}}\sup_{u^{2}}J^{\gamma}_{N}(\mathcal{u}^{1},\mathcal{u}^{2})-\gamma^{2}\mathbb{E}\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{T-1}(\lVert\omega^{i}_{t}\rVert^{2}+\lVert\bar{\omega}_{t}\rVert^{2})\leq 0. (6)

Due to the distributed information structure of the agents the standard theory of single-agent robust control does not apply in this setting. Hence we are unable to provide sufficient conditions for a given γ>0\gamma>0 to be a viable attenuation level, and we resort to the mean-field limit as NN\to\infty, which is of independent interest. The next section formulates the Robust Mean-Field Control (RMFC) problem and its equivalent 2-player zero-sum Mean-Field Type Game (ZS-MFTG) representation, and provides sufficient conditions for solvability of both.

3 Robust Mean-Field Control

Consider a system with infinitely many agents, where the generic agent has linear dynamics of its state xtx_{t} for a finite-horizon t{0,,T1}t\in\{0,\ldots,T-1\}:

xt+1=Atxt+A¯tx¯t+Btut1+B¯tu¯t1+ut2+u¯t2+ωt+ω¯t,\displaystyle x_{t+1}=A_{t}x_{t}+\bar{A}_{t}\bar{x}_{t}+B_{t}u^{1}_{t}+\bar{B}_{t}\bar{u}^{1}_{t}+u^{2}_{t}+\bar{u}^{2}_{t}+\omega_{t}+\bar{\omega}_{t}, (7)

where ut1u^{1}_{t} is the control action of the generic agent, x¯t:=𝔼[xt|(ω¯s)0st1]\bar{x}_{t}:=\mathbb{E}[x_{t}|(\bar{\omega}_{s})_{0\leq s\leq t-1}] is referred to as the state mean-field and u¯tj:=𝔼[utj|(ω¯s)0st1]\bar{u}^{j}_{t}:=\mathbb{E}[u^{j}_{t}|(\bar{\omega}_{s})_{0\leq s\leq t-1}] for j{1,2}j\in\{1,2\} are the control and noise mean-fields respectively. The initial condition of the generic agent is x0=ω0+ω¯0x_{0}=\omega^{0}+\bar{\omega}^{0}, where ω0𝒩(0,Σ0)\omega^{0}\sim\mathcal{N}(0,\Sigma^{0}) and ω¯0𝒩(0,Σ¯0)\bar{\omega}^{0}\sim\mathcal{N}(0,\bar{\Sigma}^{0}) are i.i.d. noises. The stochastic noises ωti\omega^{i}_{t} and ω¯t\bar{\omega}_{t} are i.i.d. such that ωti𝒩(0,Σ)\omega^{i}_{t}\sim\mathcal{N}(0,\Sigma) and ω¯t𝒩(0,Σ¯)\bar{\omega}_{t}\sim\mathcal{N}(0,\bar{\Sigma}), whereas the non-stochastic noise ut2u^{2}_{t} are un-modeled uncertainties. Similar to the NN agent case, we define the robust mean-field cost JγJ^{\gamma} as follows

Jγ(𝓊1,𝓊2)=\displaystyle J^{\gamma}(\mathcal{u}^{1},\mathcal{u}^{2})= 𝔼t=0T[xtx¯tQt2+x¯tQ¯t2+ut1u¯t12+u¯t12γ2(ut2u¯t22+u¯t22)\displaystyle\mathbb{E}\sum_{t=0}^{T}\Big{[}\lVert x_{t}-\bar{x}_{t}\rVert^{2}_{Q_{t}}+\lVert\bar{x}_{t}\rVert^{2}_{\bar{Q}_{t}}+\lVert u^{1}_{t}-\bar{u}^{1}_{t}\rVert^{2}+\lVert\bar{u}^{1}_{t}\rVert^{2}-\gamma^{2}\big{(}\lVert u^{2}_{t}-\bar{u}^{2}_{t}\rVert^{2}+\lVert\bar{u}^{2}_{t}\rVert^{2}\big{)} (8)
+xTx¯TQT2+x¯TQ¯T2].\displaystyle\hskip 284.52756pt+\lVert x_{T}-\bar{x}_{T}\rVert^{2}_{Q_{T}}+\lVert\bar{x}_{T}\rVert^{2}_{\bar{Q}_{T}}\Big{]}.

Now the robust mean-field control problem which is the mean-field analog to (6) is defined as follows.

Definition 3.1 (Robust Mean-Field Control problem).

If for a given γ>0\gamma>0 the following inequality is satisfied, then γ\gamma is a viable noise attenuation level for the robust mean-field control problem.

inf𝓊1sup𝓊2Jγ(𝓊1,𝓊2)γ2𝔼t=0T1ωt2+ω¯t20.\displaystyle\inf_{\mathcal{u}^{1}}\sup_{\mathcal{u}^{2}}J^{\gamma}(\mathcal{u}^{1},\mathcal{u}^{2})-\gamma^{2}\mathbb{E}\sum_{t=0}^{T-1}\lVert\omega_{t}\rVert^{2}+\lVert\bar{\omega}_{t}\rVert^{2}\leq 0. (9)

Moreover, the infimizing controller 𝓊1\mathcal{u}^{1} in (9) is a robust controller (corresponding to γ\gamma).

Now, under the condition of interchangability of the inf and sup operations, the problem of finding
inf𝓊1sup𝓊2Jγ(𝓊1,𝓊2)\inf_{\mathcal{u}^{1}}\sup_{\mathcal{u}^{2}}J^{\gamma}(\mathcal{u}^{1},\mathcal{u}^{2}) is that of finding the Nash equilibrium (equivalently, saddle point, in this case) of the Zero-sum 2-player Mean-Field Type Game; see (Carmona et al., 2020, 2021) for a very similar LQ setting without the theoretical analysis of the RL algorithm. In the following section we provide sufficient conditions for existence and uniqueness of a solution to this saddle point problem along with the value of inf𝓊1sup𝓊2Jγ(𝓊1,𝓊2)\inf_{\mathcal{u}^{1}}\sup_{\mathcal{u}^{2}}J^{\gamma}(\mathcal{u}^{1},\mathcal{u}^{2}).

22-player Zero-sum Mean-Field Type Games: Let us define yt=xtx¯t,zt=x¯ty_{t}=x_{t}-\bar{x}_{t},z_{t}=\bar{x}_{t}. The dynamics of yty_{t} and ztz_{t} can be written as

yt+1=Atyt+Bt(ut1u¯t1)+ut2u¯t2+ωtω¯t,zt+1=A~tzt+B~tu¯t1+2u¯t2+2ω¯t,\displaystyle y_{t+1}=A_{t}y_{t}+B_{t}(u^{1}_{t}-\bar{u}^{1}_{t})+u^{2}_{t}-\bar{u}^{2}_{t}+\omega_{t}-\bar{\omega}_{t},\hskip 5.69046ptz_{t+1}=\tilde{A}_{t}z_{t}+\tilde{B}_{t}\bar{u}^{1}_{t}+2\bar{u}^{2}_{t}+2\bar{\omega}_{t},

where A~t=At+A¯t\tilde{A}_{t}=A_{t}+\bar{A}_{t} and B~t=Bt+B¯t\tilde{B}_{t}=B_{t}+\bar{B}_{t}. The optimal controls are known to be linear (Carmona et al., 2020), hence we restrict our attention the set of linear controls in yty_{t} and ztz_{t},

ut1=𝓊t1(xt,x¯t)=Kt1(xtx¯t)Lt1x¯t,ut2=𝓊t2(xt,x¯t)=Kt2(xtx¯t)+Lt2x¯t\displaystyle u^{1}_{t}=\mathcal{u}^{1}_{t}(x_{t},\bar{x}_{t})=-K^{1}_{t}(x_{t}-\bar{x}_{t})-L^{1}_{t}\bar{x}_{t},\hskip 5.69046ptu^{2}_{t}=\mathcal{u}^{2}_{t}(x_{t},\bar{x}_{t})=K^{2}_{t}(x_{t}-\bar{x}_{t})+L^{2}_{t}\bar{x}_{t}

which implies that u¯t1=Lt1x¯t\bar{u}^{1}_{t}=-L^{1}_{t}\bar{x}_{t} and u¯t2=Lt2x¯t\bar{u}^{2}_{t}=L^{2}_{t}\bar{x}_{t}. The dynamics of the processes yty_{t} and ztz_{t} can be re-written as

yt+1=(AtBtKt1+Kt2)yt+ωtω¯t,zt+1=(A~tB~tLt1+Lt2)zt+2ω¯t.\displaystyle y_{t+1}=(A_{t}-B_{t}K^{1}_{t}+K^{2}_{t})y_{t}+\omega_{t}-\bar{\omega}_{t},\hskip 5.69046ptz_{t+1}=(\tilde{A}_{t}-\tilde{B}_{t}L^{1}_{t}+L^{2}_{t})z_{t}+2\bar{\omega}_{t}. (10)

Since the dynamics of yty_{t} and ztz_{t} are decoupled, we can decompose the cost JγJ^{\gamma} into the following two parts:

Jγ(K,L)\displaystyle J^{\gamma}(K,L) =Jyγ(K)+Jzγ(L),\displaystyle=J^{\gamma}_{y}(K)+J^{\gamma}_{z}(L),
Jyγ(K)\displaystyle J^{\gamma}_{y}(K) =𝔼[t=0T1yt(Qt+(Kt1)Kt1γ2(Kt2)Kt2)yt+yTQTyT],\displaystyle=\mathbb{E}\Big{[}\sum_{t=0}^{T-1}y_{t}^{\top}(Q_{t}+(K^{1}_{t})^{\top}K^{1}_{t}-\gamma^{2}(K^{2}_{t})^{\top}K^{2}_{t})y_{t}+y_{T}^{\top}Q_{T}y_{T}\Big{]}, (11)
Jzγ(L)\displaystyle J^{\gamma}_{z}(L) =𝔼[t=0T1zt(Q¯t+(Lt1)Lt1γ2(Lt2)Lt2)zt+zTQ¯TzT].\displaystyle=\mathbb{E}\Big{[}\sum_{t=0}^{T-1}z_{t}^{\top}(\bar{Q}_{t}+(L^{1}_{t})^{\top}L^{1}_{t}-\gamma^{2}(L^{2}_{t})^{\top}L^{2}_{t})z_{t}+z^{\top}_{T}\bar{Q}_{T}z_{T}\Big{]}.

The 2-player MFTG (7)-(8) has been decoupled into two 2-player LQ dynamic game problems as shown below:

minKt1,Lt1maxKt2,Lt2Jγ((Kt1,Kt2),(Lt1,Lt2))=minKt1maxKt2Jyγ(K)+minLt1maxLt2Jzγ(L)\displaystyle\min_{K^{1}_{t},L^{1}_{t}}\max_{K^{2}_{t},L^{2}_{t}}J^{\gamma}((K^{1}_{t},K^{2}_{t}),(L^{1}_{t},L^{2}_{t}))=\min_{K^{1}_{t}}\max_{K^{2}_{t}}J^{\gamma}_{y}(K)+\min_{L^{1}_{t}}\max_{L^{2}_{t}}J^{\gamma}_{z}(L)

where the dynamics of yty_{t} and ztz_{t} are defined in (10). In the following section, using results in the literature, we specify the sufficient conditions for existence and uniqueness of Nash equilibrium of the 2-player MFTG and also present the value (Nash cost) of the game. Building on the techniques developed in (Başar and Olsder, 1998; Carmona et al., 2020), we can prove the following result.

Theorem 3.2.

Assume for a given γ>0\gamma>0,

γ2IMtγ>0 and γ2IM¯tγ>0,\displaystyle\gamma^{2}I-M^{\gamma}_{t}>0\text{ and }\gamma^{2}I-\bar{M}^{\gamma}_{t}>0, (12)

where MtγM^{\gamma}_{t} and M¯tγ\bar{M}^{\gamma}_{t} are positive semi-definite matrices which satisfy the Coupled Algebraic Riccati equations,

Mtγ=Qt+AtMt+1γΛt1At,Λt=I+(BtBtγ2I)Mt+1γ,MTγ=QT,\displaystyle M^{\gamma}_{t}=Q_{t}+A^{\top}_{t}M^{\gamma}_{t+1}\Lambda^{-1}_{t}A_{t},\hskip 5.69046pt\Lambda_{t}=I+(B_{t}B^{\top}_{t}-\gamma^{-2}I)M^{\gamma}_{t+1},\hskip 5.69046ptM^{\gamma}_{T}=Q_{T},
M¯tγ=Q¯t+A~tM¯t+1γΛ¯t1A~t,Λ¯t=I+(B~tB~tγ2I)M¯t+1γ,M¯Tγ=Q¯T\displaystyle\bar{M}^{\gamma}_{t}=\bar{Q}_{t}+\tilde{A}^{\top}_{t}\bar{M}^{\gamma}_{t+1}\bar{\Lambda}^{-1}_{t}\tilde{A}_{t},\hskip 5.69046pt\bar{\Lambda}_{t}=I+(\tilde{B}_{t}\tilde{B}^{\top}_{t}-\gamma^{-2}I)\bar{M}^{\gamma}_{t+1},\hskip 5.69046pt\bar{M}^{\gamma}_{T}=\bar{Q}_{T} (13)
Ntγ=Nt+1γ+Tr(Mt+1γΣ),NTγ=0,N¯tγ=N¯t+1γ+Tr(M¯t+1γΣ),N¯Tγ=0.\displaystyle N^{\gamma}_{t}=N^{\gamma}_{t+1}+\operatorname{Tr}(M^{\gamma}_{t+1}\Sigma),\hskip 5.69046ptN^{\gamma}_{T}=0,\hskip 5.69046pt\bar{N}^{\gamma}_{t}=\bar{N}^{\gamma}_{t+1}+\operatorname{Tr}(\bar{M}^{\gamma}_{t+1}\Sigma),\hskip 5.69046pt\bar{N}^{\gamma}_{T}=0.

Then, ut1=Kt1(xtx¯t)Lt1x¯tu^{1*}_{t}=-K^{1*}_{t}(x_{t}-\bar{x}_{t})-L^{1*}_{t}\bar{x}_{t} and ut2=Kt2(xtx¯t)+Lt2x¯tu^{2*}_{t}=K^{2*}_{t}(x_{t}-\bar{x}_{t})+L^{2*}_{t}\bar{x}_{t} (complete expressions provided in Supplementary Materials) are the unique Nash policies. Furthermore, the Nash equilibrium (equivalently, saddle point) value is

inf𝓊1sup𝓊2Jγ(𝓊1,𝓊2)=Tr(M0γΣ0)+Tr(M¯0γΣ¯0)+N0γ+N¯0γ\displaystyle\inf_{\mathcal{u}^{1}}\sup_{\mathcal{u}^{2}}J^{\gamma}(\mathcal{u}^{1},\mathcal{u}^{2})=\operatorname{Tr}(M^{\gamma}_{0}\Sigma^{0})+\operatorname{Tr}(\bar{M}^{\gamma}_{0}\bar{\Sigma}^{0})+N^{\gamma}_{0}+\bar{N}^{\gamma}_{0} (14)

This result can be proved using techniques in proofs of Theorem 3.2 in (Başar and Bernhard, 2008) or Proposition 36 in (Carmona et al., 2021). We now use the Nash value of the game (14) to come up with a condition for the attenuation level γ\gamma which solves the robust mean-field control problem (9). First we simplify expression in (9) 𝔼t=0T1ωt2+ω¯t2=TTr(Σ+Σ¯)\mathbb{E}\sum_{t=0}^{T-1}\lVert\omega_{t}\rVert^{2}+\lVert\bar{\omega}_{t}\rVert^{2}=T\operatorname{Tr}(\Sigma+\bar{\Sigma}) using the i.i.d. stochastic nature of the noise. Combining this fact with (14), we arrive at the conclusion that (9) will be satisfied if and only if

t=1TTr((Mtγγ2I)Σ+(M¯tγγ2I)Σ¯)+Tr(M0γΣ0)+Tr(M¯0γΣ¯0)0\displaystyle\sum_{t=1}^{T}\operatorname{Tr}((M^{\gamma}_{t}-\gamma^{2}I)\Sigma+(\bar{M}^{\gamma}_{t}-\gamma^{2}I)\bar{\Sigma})+\operatorname{Tr}(M^{\gamma}_{0}\Sigma^{0})+\operatorname{Tr}(\bar{M}^{\gamma}_{0}\bar{\Sigma}^{0})\leq 0 (15)

Notice that the conditions (12) and (15) are different, as the first one requires positive definiteness of matrices and the second one requires a scalar inequality. Now we solve the robust NN agent control problem by providing sufficient conditions for a given attenuation level γ\gamma satisfying (4).

Theorem 3.3.

Let γ>0\gamma>0. Assume, in addition to (12), that we also have

t=1TTr((Mtγγ2I)Σ+(M¯tγγ2I)Σ¯)+Tr(M0γΣ0)+Tr(M¯0γΣ¯0)CTN,\displaystyle\sum_{t=1}^{T}\operatorname{Tr}((M^{\gamma}_{t}-\gamma^{2}I)\Sigma+(\bar{M}^{\gamma}_{t}-\gamma^{2}I)\bar{\Sigma})+\operatorname{Tr}(M^{\gamma}_{0}\Sigma^{0})+\operatorname{Tr}(\bar{M}^{\gamma}_{0}\bar{\Sigma}^{0})\leq-\frac{CT}{N}, (16)

where CC is a constant which depends only on the model parameters and MtγM^{\gamma}_{t} and M¯tγ\bar{M}^{\gamma}_{t} (13). Then γ\gamma is a viable attenuation level for the Robust NN agent control problem (4). Moreover the robust controller for each agent ii is given by ut1,i=Kt1(xtix¯t)Lt1x¯tu^{1,i*}_{t}=-K^{1*}_{t}(x^{i}_{t}-\bar{x}_{t})-L^{1*}_{t}\bar{x}_{t}.

The proof of this result can be found in the Supplementary Materials. The above theorem states that, if for a given γ\gamma, conditions (12) and (16) are satisfied (given that MtγM^{\gamma}_{t} and M¯tγ\bar{M}^{\gamma}_{t} are defined by (13)), then not only is γ\gamma a viable attenuation level for the original Robust multi-agent control problem (1)-(4), but the Nash equilibrium for the ZS-MFTG also yields the robust controller ut1,i=Kt1(xtix¯t)Lt1x¯tu^{1,i*}_{t}=-K^{1*}_{t}(x^{i}_{t}-\bar{x}_{t})-L^{1*}_{t}\bar{x}_{t} for the original finite-agent game. Condition (16) is strictly stronger than condition (15) but approaches (16) as NN\rightarrow\infty.

4 Reinforcement Learning for Robust Mean-Field Control

In this section we present the Receding-horizon policy Gradient Descent Ascent (RGDA) algorithm to compute the Nash equilibrium (Theorem 3.2) of the 2-player MFTG (7)-(8), which will also generate the robust controller for a fixed noise attenuation level γ\gamma. For this section we assume access to only the finite-horizon costs of the agents under a set of control policies, and not the state trajectories. Under this setting the model of the agents cannot be constructed hence our approach is truly model free (Malik et al., 2019). Due to the non-convex non-concave (also non-coercive (Zhang et al., 2020b)) nature of the cost function JγJ^{\gamma} in (11), instead we solve the receding-horizon problem, for each t={T1,,1,0}t=\{T-1,\ldots,1,0\} backwards-in-time. This entails solving 2×T2\times T min-max problems, where each problem is convex-concave and aims at finding (Kt,Lt)=((Kt1,Kt2),(Lt1,Lt2))(K_{t},L_{t})=\big{(}(K^{1}_{t},K^{2}_{t}),(L^{1}_{t},L^{2}_{t})\big{)} at time step tt, given the set of future controllers (controllers for times greater than tt), ((K~t+1,L~t+1),,(K~T,L~T))\big{(}(\tilde{K}_{t+1},\tilde{L}_{t+1}),\ldots,(\tilde{K}_{T},\tilde{L}_{T})\big{)} are held constant. But first we must approximate the mean-field term using finitely many agents.

Approximation of mean-field terms using MM agents: Since simulating infinitely many agents is impractical, in this section we outline how to use a set of 2M<2\leq M<\infty agents to approximately simulate the mean-field in a MFTG. Each of the MM agents has state xtix^{i}_{t} at time tt where i[M]i\in[M]. The agents follow controllers linear in their private state and empirical mean-field, xtix^{i}_{t} and z~t\tilde{z}_{t}, respectively:

ut1=Kt1(xtiz~t)Lt1z~t,ut2=Kt2(xtiz~t)+Lt2z~t,\displaystyle u^{1}_{t}=-K^{1}_{t}(x^{i}_{t}-\tilde{z}_{t})-L^{1}_{t}\tilde{z}_{t},\hskip 14.22636ptu^{2}_{t}=K^{2}_{t}(x^{i}_{t}-\tilde{z}_{t})+L^{2}_{t}\tilde{z}_{t},

where the empirical mean-field is z~t:=1Mi=1Mxti\tilde{z}_{t}:=\frac{1}{M}\sum_{i=1}^{M}x^{i}_{t}. Under these control laws, the dynamics of agent i[M]i\in[M] are

xt+1i=(AtBtKt1+Kt2)(xtiz~t)+(A~tB~tLt1+Lt2)z~t+ωt+1i+ω¯t\displaystyle x^{i}_{t+1}=(A_{t}-B_{t}K^{1}_{t}+K^{2}_{t})(x^{i}_{t}-\tilde{z}_{t})+(\tilde{A}_{t}-\tilde{B}_{t}L^{1}_{t}+L^{2}_{t})\tilde{z}_{t}+\omega^{i}_{t+1}+\bar{\omega}_{t}

and the dynamics of the empirical mean-field z~t\tilde{z}_{t} is

z~t+1=(A~tB~tLt1+Lt2)z~t+ω~t+10,whereω~t+10=ω¯t+1Mi=1Mωt+1i.\displaystyle\tilde{z}_{t+1}=(\tilde{A}_{t}-\tilde{B}_{t}L^{1}_{t}+L^{2}_{t})\tilde{z}_{t}+\tilde{\omega}^{0}_{t+1},\hskip 5.69046pt\text{where}\hskip 5.69046pt\tilde{\omega}^{0}_{t+1}=\bar{\omega}_{t}+\frac{1}{M}\sum_{i=1}^{M}\omega^{i}_{t+1}.

The cost of each agent is

J~i,γ(u1,u2)=\displaystyle\tilde{J}^{i,\gamma}(u_{1},u_{2})= 𝔼[t=0T1(xtiz~t)[Qt+(Kt1)Kt1γ2(Kt2)Kt2](xtiz~t)+(xTiz~T)QT(xTiz~T)\displaystyle\mathbb{E}\Big{[}\sum_{t=0}^{T-1}(x^{i}_{t}-\tilde{z}_{t})^{\top}[Q_{t}+(K^{1}_{t})^{\top}K^{1}_{t}-\gamma^{2}(K^{2}_{t})^{\top}K^{2}_{t}](x^{i}_{t}-\tilde{z}_{t})+(x^{i}_{T}-\tilde{z}_{T})^{\top}Q_{T}(x^{i}_{T}-\tilde{z}_{T})
+z~t[Q¯t+(Lt1)Lt1γ2(Lt2)Lt2]z~t+z~TQ¯Tz~T].\displaystyle\hskip 170.71652pt+\tilde{z}_{t}^{\top}[\bar{Q}_{t}+(L^{1}_{t})^{\top}L^{1}_{t}-\gamma^{2}(L^{2}_{t})^{\top}L^{2}_{t}]\tilde{z}_{t}+\tilde{z}_{T}^{\top}\bar{Q}_{T}\tilde{z}_{T}\Big{]}.

Now, similarly to the previous section, we define yti=xtiz~ty^{i}_{t}=x^{i}_{t}-\tilde{z}_{t}. The dynamics of ytiy^{i}_{t} are

yt+1i\displaystyle y^{i}_{t+1} =(AtBtKt1+Kt2)yti+ω~t+1i,whereω~t+1i=M1Mωt+1i1Mjiωt+1j.\displaystyle=(A_{t}-B_{t}K^{1}_{t}+K^{2}_{t})y^{i}_{t}+\tilde{\omega}^{i}_{t+1},\hskip 5.69046pt\text{where}\hskip 5.69046pt\tilde{\omega}^{i}_{t+1}=\frac{M-1}{M}\omega^{i}_{t+1}-\frac{1}{M}\sum_{j\neq i}\omega^{j}_{t+1}.

The cost can then be decomposed in a manner similar to (11):

J~i,γ((Kt1,Kt2),(Lt1,Lt2))\displaystyle\tilde{J}^{i,\gamma}\big{(}(K^{1}_{t},K^{2}_{t}),(L^{1}_{t},L^{2}_{t})\big{)} =J~yi,γ(Kt1,Kt2)+J~zi,γ(Lt1,Lt2),\displaystyle=\tilde{J}^{i,\gamma}_{y}(K^{1}_{t},K^{2}_{t})+\tilde{J}^{i,\gamma}_{z}(L^{1}_{t},L^{2}_{t}),
J~yi,γ(Kt1,Kt2)\displaystyle\tilde{J}^{i,\gamma}_{y}(K^{1}_{t},K^{2}_{t}) =𝔼[t=0T1(yti)[Qt+(Kt1)Kt1γ2(Kt2)Kt2]yti+(yTi)QTyTi],\displaystyle=\mathbb{E}\Big{[}\sum_{t=0}^{T-1}(y^{i}_{t})^{\top}[Q_{t}+(K^{1}_{t})^{\top}K^{1}_{t}-\gamma^{2}(K^{2}_{t})^{\top}K^{2}_{t}]y^{i}_{t}+(y^{i}_{T})^{\top}Q_{T}y^{i}_{T}\Big{]}, (17)
J~zi,γ(Lt1,Lt2)\displaystyle\tilde{J}^{i,\gamma}_{z}(L^{1}_{t},L^{2}_{t}) =𝔼[t=0T1z~t[Q¯t+(Lt1)Lt1γ2(Lt2)Lt2]z~t+z~TQ¯Tz~T].\displaystyle=\mathbb{E}\Big{[}\sum_{t=0}^{T-1}\tilde{z}_{t}^{\top}[\bar{Q}_{t}+(L^{1}_{t})^{\top}L^{1}_{t}-\gamma^{2}(L^{2}_{t})^{\top}L^{2}_{t}]\tilde{z}_{t}+\tilde{z}_{T}^{\top}\bar{Q}_{T}\tilde{z}_{T}\Big{]}.

Receding-horizon approach: Similar to the approach in Section 2, instead of finding the optimal, KK^{*} and LL^{*} which optimizes J~\tilde{J} in (17), we solve the receding-horizon problem for each t={T1,,,1,0}t=\{T-1,,\ldots,1,0\} backwards-in-time. This forms two decoupled min-max convex-concave problems of finding (Kt,Lt)=((Kt1,Kt2),(Lt1,Lt2))(K_{t},L_{t})=\big{(}(K^{1}_{t},K^{2}_{t}),(L^{1}_{t},L^{2}_{t})\big{)} at each time step tt, given the set of controllers for times greater than tt, ((K~t+1,L~t+1),,(K~T,L~T))\big{(}(\tilde{K}_{t+1},\tilde{L}_{t+1}),\ldots,(\tilde{K}_{T},\tilde{L}_{T})\big{)}

min(Kt1,Lt1)max(Kt2,Lt2)J~ti,γ(Kt,Lt)=\displaystyle\min_{(K^{1}_{t},L^{1}_{t})}\max_{(K^{2}_{t},L^{2}_{t})}\tilde{J}^{i,\gamma}_{t}(K_{t},L_{t})=
𝔼[yt(Qt+(Kt1)Kt1γ2(Kt2)Kt2)yt+k=t+1Tyk(Qt+(K~k1)K~k1γ2(K~k2)K~k2)yk]J~y,ti,γ\displaystyle\underbrace{\mathbb{E}\Big{[}y_{t}^{\top}(Q_{t}+(K^{1}_{t})^{\top}K^{1}_{t}-\gamma^{2}(K^{2}_{t})^{\top}K^{2}_{t})y_{t}+\sum_{k=t+1}^{T}y_{k}^{\top}(Q_{t}+(\tilde{K}^{1}_{k})^{\top}\tilde{K}^{1}_{k}-\gamma^{2}(\tilde{K}^{2}_{k})^{\top}\tilde{K}^{2}_{k})y_{k}\Big{]}}_{\tilde{J}^{i,\gamma}_{y,t}} (18)
+𝔼[zt(Q¯t+(Lt1)Lt1γ2(Lt2)Lt2)zt+k=t+1Tzk(Q¯t+L~1,kL~k1γ2(L~k2)L~k2)zk]J~z,ti,γ,\displaystyle+\underbrace{\mathbb{E}\Big{[}z_{t}^{\top}(\bar{Q}_{t}+(L^{1}_{t})^{\top}L^{1}_{t}-\gamma^{2}(L^{2}_{t})^{\top}L^{2}_{t})z_{t}+\sum_{k=t+1}^{T}z_{k}^{\top}(\bar{Q}_{t}+\tilde{L}^{\top}_{1,k}\tilde{L}^{1}_{k}-\gamma^{2}(\tilde{L}^{2}_{k})^{\top}\tilde{L}^{2}_{k})z_{k}\Big{]}}_{\tilde{J}^{i,\gamma}_{z,t}},

for any i[M]i\in[M] and yt𝒩(0,Σy),zt𝒩(0,Σz)y_{t}\sim\mathcal{N}(0,\Sigma_{y}),z_{t}\sim\mathcal{N}(0,\Sigma_{z}). This receding-horizon problem is solved using Receding-horizon policy Gradient Descent Ascent (RGDA) (Algorithm 1) where at each time instant tt the Nash control is approached using gradient descent ascent. We anticipate a small approximation error between the optimal controller and its computed approximation K~t\tilde{K}_{t} (respectively L~t\tilde{L}_{t}). However, this error is shown to be well-behaved (Theorem 4.2), as we progress backwards-in-time, given that the hyper-parameters of RGDA satisfy certain bounds.

Receding-horizon policy Gradient Descent Ascent (RGDA) Algorithm: The RGDA Algorithm (Algorithm 1 is a bi-level optimization algorithm where the outer loop starts at time t=T1t=T-1 and moves backwards-in-time, and the inner loop is a gradient descent (for control parameters (Kt1,Lt1)(K^{1}_{t},L^{1}_{t})) ascent (for control policy (Kt2,Lt2)(K^{2}_{t},L^{2}_{t})) update with learning rate ηk\eta_{k}. The gradient descent ascent step entails computing an approximation of the exact gradients of cost J~ti,γ\tilde{J}^{i,\gamma}_{t} with respect to the controls variables (Kt1,Lt1),(Kt2,Lt2)(K^{1}_{t},L^{1}_{t}),(K^{2}_{t},L^{2}_{t}). To obtain this approximation in a data driven manner we utilize a zero-order stochastic gradient ~1J~ti,γ(Kt,Lt),~2J~ti,γ(Kt,Lt)\tilde{\nabla}_{1}\tilde{J}^{i,\gamma}_{t}(K_{t},L_{t}),\tilde{\nabla}_{2}\tilde{J}^{i,\gamma}_{t}(K_{t},L_{t}) (Fazel et al., 2018; Malik et al., 2019) which requires cost computation under a given set of controllers (18) as shown below.

~1J~ti,γ(Kt,Lt)\displaystyle\tilde{\nabla}_{1}\tilde{J}^{i,\gamma}_{t}(K_{t},L_{t}) =nMr2j=1MJ~ti,γ((Ktj,1,Kt2),(Ltj,1,Lt2))ej,(Ktj,1Ltj,1)=(Kt1Lt1)+ej,ej𝕊n1(r)\displaystyle=\frac{n}{Mr^{2}}\sum_{j=1}^{M}\tilde{J}^{i,\gamma}_{t}((K^{j,1}_{t},K^{2}_{t}),(L^{j,1}_{t},L^{2}_{t}))e_{j},\hskip 5.69046pt\begin{pmatrix}K^{j,1}_{t}\\ L^{j,1}_{t}\end{pmatrix}=\begin{pmatrix}K^{1}_{t}\\ L^{1}_{t}\end{pmatrix}+e_{j},\hskip 5.69046pte_{j}\sim\mathbb{S}^{n-1}(r)
~2J~ti,γ(Kt,Lt)\displaystyle\tilde{\nabla}_{2}\tilde{J}^{i,\gamma}_{t}(K_{t},L_{t}) =nMr2j=1MJ~ti,γ((Kt1,Ktj,2),(Lt1,Ltj,2))ej,(Ktj,2Ltj,2)=(Kt2Lt2)+ej,ej𝕊n1(r).\displaystyle=\frac{n}{Mr^{2}}\sum_{j=1}^{M}\tilde{J}^{i,\gamma}_{t}((K^{1}_{t},K^{j,2}_{t}),(L^{1}_{t},L^{j,2}_{t}))e_{j},\hskip 5.69046pt\begin{pmatrix}K^{j,2}_{t}\\ L^{j,2}_{t}\end{pmatrix}=\begin{pmatrix}K^{2}_{t}\\ L^{2}_{t}\end{pmatrix}+e_{j},\hskip 5.69046pte_{j}\sim\mathbb{S}^{n-1}(r).

Stochastic gradient computation entails computing the cost of NbN_{b} different perturbed controllers, with a perturbation magnitude if rr also called the smoothing radius. This stochastic gradient provides us with a biased approximation of the exact gradient whose bias and variance can be controlled by tuning the values of NbN_{b} and rr. Finally to ensure stability of the learning algorithm, we use projection ProjD\operatorname{Proj}_{D} onto a DD-ball such that the norm of the matrices is bounded by DD, (Kt,Lt)2D\lVert(K_{t},L_{t})\rVert^{2}\leq D. The radius of the ball DD is chosen such that the Nash equilibrium controllers lie within this ball.

Algorithm 1 RGDA Algorithm for 2-player MFTG
1:  for t=T1,,1,0,t=T-1,\ldots,1,0, do
2:     Initialize Kt=(Kt1,Kt2)=0,Lt=(Lt1,Lt2)=0K_{t}=(K^{1}_{t},K^{2}_{t})=0,L_{t}=(L^{1}_{t},L^{2}_{t})=0
3:     for k=0,,Kk=0,\ldots,K do
4:        Gradient Descent (Kt1Lt1)ProjD((Kt1Lt1)ηk~1J~ti,γ(Kt,Lt)),\begin{pmatrix}K^{1}_{t}\\ L^{1}_{t}\end{pmatrix}\leftarrow\operatorname{Proj}_{D}\bigg{(}\begin{pmatrix}K^{1}_{t}\\ L^{1}_{t}\end{pmatrix}-\eta_{k}\tilde{\nabla}_{1}\tilde{J}^{i,\gamma}_{t}(K_{t},L_{t})\bigg{)},
5:        Gradient Ascent (Kt2Lt2)ProjD((Kt2Lt2)+ηk~2J~ti,γ(Kt,Lt)),\begin{pmatrix}K^{2}_{t}\\ L^{2}_{t}\end{pmatrix}\leftarrow\operatorname{Proj}_{D}\bigg{(}\begin{pmatrix}K^{2}_{t}\\ L^{2}_{t}\end{pmatrix}+\eta_{k}\tilde{\nabla}_{2}\tilde{J}^{i,\gamma}_{t}(K_{t},L_{t})\bigg{)},
6:     end for
7:  end for

RGDA algorithm analysis: In this section we start by showing linear convergence of the inner loop gradient descent ascent (Theorem 4.1), which is made possible by the convex-concave property of the cost function under the receding horizon approach (18). Then we show that if the error accumulated in each inner loop computation is small enough, the total accumulated error is well behaved (Theorem 4.2).

We first define some relevant notation. We define the joint controllers for each timestep tt as K¯t=[(Kt1),(Kt2)]\bar{K}_{t}=[(K^{1}_{t})^{\top},(K^{2}_{t})^{\top}]^{\top} and L¯t=[(Lt1),(Lt2)]\bar{L}_{t}=[(L^{1}_{t})^{\top},(L^{2}_{t})^{\top}]^{\top}, for the sake of conciseness. For each timestep t{T1,,1,0}t\in\{T-1,\ldots,1,0\} let us also define the target joint controllers K¯~t=(K~t1,K~t2),L¯~t=(L~t1,L~t2)\tilde{\bar{K}}^{*}_{t}=(\tilde{K}^{1*}_{t},\tilde{K}^{2*}_{t}),\tilde{\bar{L}}^{*}_{t}=(\tilde{L}^{1*}_{t},\tilde{L}^{2*}_{t}), as the set of policies which exactly solve the receding-horizon min-max problem (18). Notice that the set of target controllers K¯~t,L¯~t\tilde{\bar{K}}^{*}_{t},\tilde{\bar{L}}^{*}_{t} are unique (due to convex-concave nature of (18)) but do depend on the set of future joint controllers (K¯s,L¯s)t<s<T(\bar{K}_{s},\bar{L}_{s})_{t<s<T}. On the other hand, the Nash joint controllers are denoted by K¯t=(Kt1,Kt2)\bar{K}^{*}_{t}=(K^{1*}_{t},K^{2*}_{t}) and L¯t=(Lt1,Lt2)\bar{L}^{*}_{t}=(L^{1*}_{t},L^{2*}_{t}). Furthermore, the target joint controllers are equal to the Nash joint controllers (K¯~t,L¯~t)=(K¯t,L¯t)(\tilde{\bar{K}}^{*}_{t},\tilde{\bar{L}}^{*}_{t})=(\bar{K}^{*}_{t},\bar{L}^{*}_{t}) only if the future joint controllers are also Nash (K¯s,L¯s)t<s<T=(K¯s,L¯s)t<s<T(\bar{K}_{s},\bar{L}_{s})_{t<s<T}=(\bar{K}^{*}_{s},\bar{L}^{*}_{s})_{t<s<T}.

Theorem 4.1.

If the learning rate ηk\eta_{k} is smaller than a certain function of model parameters, the number of inner loop iterations K=Ω(log(1/ϵ)),K=\Omega(\log(1/\epsilon)), the mini-batch size Nb=Ω(1/ϵ)N_{b}=\Omega(1/\epsilon) and the smoothing radius r=𝒪(ϵ)r=\mathcal{O}(\epsilon), then at each timestep t{T1,,1,0}t\in\{T-1,\ldots,1,0\} the optimality gaps are K¯tK¯~t22ϵ\lVert\bar{K}_{t}-\tilde{\bar{K}}^{*}_{t}\rVert^{2}_{2}\leq\epsilon and L¯tL¯~t22ϵ\lVert\bar{L}_{t}-\tilde{\bar{L}}^{*}_{t}\rVert^{2}_{2}\leq\epsilon.

Closed form expressions of the bounds can be found in the proof given in the Supplementary materials. The linear rate of convergence is made possible by building upon the convergence analysis of descent ascent in (Fallah et al., 2020) due to the convex-concave nature of the cost function (18). The proof generalizes the techniques used in (Fallah et al., 2020) to stochastic unbiased gradients by utilizing the fact that the bias in stochastic gradients ~jJ~ti,γ\tilde{\nabla}_{j}\tilde{J}^{i,\gamma}_{t} for j{1,2}j\in\{1,2\} can be reduced by reducing the smoothing radius rr. This in turn causes an increase in the variance of the stochastic gradient which is controlled by increasing the mini-batch size NbN_{b}.

Now we present the non-asymptotic convergence guarantee of the paper stating that even though each iteration of the outer loop (as timestep tt moves backwards-in-time) accumulates error, if the error in each outer loop iteration is small enough, the total accumulated error will also be small enough. The proof can be found in the Supplementary Materials.

Theorem 4.2.

If all conditions in Theorem 4.1 are satisfied, then maxj{1,2}KtjKtj=𝒪(ϵ)\max_{j\in\{1,2\}}\lVert K^{j}_{t}-K^{j*}_{t}\rVert=\mathcal{O}(\epsilon) and maxj{1,2}LtjLtj=𝒪(ϵ)\max_{j\in\{1,2\}}\lVert L^{j}_{t}-L^{j*}_{t}\rVert=\mathcal{O}(\epsilon) for a small ϵ>0\epsilon>0 and t{T1,,0}t\in\{T-1,\ldots,0\}.

The Nash gaps at each time tt, KtjKtj\lVert K^{j}_{t}-K^{j*}_{t}\rVert and LtjLtj\lVert L^{j}_{t}-L^{j*}_{t}\rVert for j{1,2}j\in\{1,2\} are due to a combination of the optimality gap in the inner loop K¯tK¯~t22,L¯tL¯~t22\lVert\bar{K}_{t}-\tilde{\bar{K}}^{*}_{t}\rVert^{2}_{2},\lVert\bar{L}_{t}-\tilde{\bar{L}}^{*}_{t}\rVert^{2}_{2} and the accumulated Nash gap in the future joint controllers KsjKsj\lVert K^{j}_{s}-K^{j*}_{s}\rVert and LsjLsj\lVert L^{j}_{s}-L^{j*}_{s}\rVert for j{1,2}j\in\{1,2\} and t<s<Tt<s<T. The proof of Theorem 4.2 characterizes these two quantities and then shows that if the optimality gap at each timestep t{0,,T1}t\in\{0,\ldots,T-1\} never exceeds some small ϵ\epsilon, then the Nash gap at any time tt never exceeds ϵ\epsilon scaled by a constant.

5 Numerical Analysis

First, we simulate the RGDA algorithm for time horizon T=3T=3, number of agents M=1000M=1000 and the dimension of the state and action spaces m=p=2m=p=2. For each timestep t{2,1,0}t\in\{2,1,0\}, the number of inner-loop iterations K=1000K=1000, the mini-batch size Nb=5×104N_{b}=5\times 10^{4} and the learning rate ηk=0.001\eta_{k}=0.001. In Figure LABEL:fig:RGDA_vs_ERGDA_1 we compare the RGDA algorithm (Algorithm 1) with its exact version (E-RGDA) which has access to the exact policy gradients 1J~ti,γ=δJ~ti,γ/δ(Kt1,Lt1)\nabla_{1}\tilde{J}^{i,\gamma}_{t}=\delta\tilde{J}^{i,\gamma}_{t}/\delta(K^{1}_{t},L^{1}_{t}) and 2J~ti,γ=δJ~ti,γ/δ(Kt2,Lt2)\nabla_{2}\tilde{J}^{i,\gamma}_{t}=\delta\tilde{J}^{i,\gamma}_{t}/\delta(K^{2}_{t},L^{2}_{t}) at each iteration k[K]k\in[K]. The error plots in Figures LABEL:fig:RGDA_vs_ERGDA_1 and LABEL:fig:RGDA_vs_ERGDA_2 show the mean (solid lines) and standard deviation (shaded regions) of error, which is the norm of difference between iterates and Nash controllers. In Figure LABEL:fig:RGDA_vs_ERGDA_1 the blue plot shows error convergence of the E-RGDA algorithm, which computes the Nash controllers for the last timestep t=2t=2 (using gradient descent ascent with exact gradients) and moves backwards in time. Since at each timestep it has good convergence to Nash policies, the convexity-concavity of cost function at the next timestep is ensured, which results in linear convergence. The red plot in Figure LABEL:fig:RGDA_vs_ERGDA_1 shows the error convergence in the RGDA algorithm which uses stochastic gradients, which results in a noisy but downward trend in error. Notice that RGDA imitates E-RGDA in a noisy fashion and at each timestep the iterates only approximate the Nash controllers. This approximation can be further sharpened by increasing the mini-batch size NbN_{b} and decreasing smoothing radius rr. Figure LABEL:fig:RGDA_vs_ERGDA_1 shows the error convergence of E-RGDA for a ZS-MFTG with T=15T=15 and state and action space dimensions m=p=2m=p=2.

Figure 2 compares the E-RGDA algorithm with the exact 2-player zero-sum version of the MADPG algorithm (referred to as E-DDPG) (Lowe et al., 2017) which serves as a baseline as it does not use the receding-horizon approach. The number of inner-loop iterations for E-RGDA is K=70K=70 and the learning rate for both algorithms is η=0.025\eta=0.025. The four figures represent the comparisons for T={2,3,4,5}T=\{2,3,4,5\} and the y-axis is scaled in a logarithmic manner to best show the behavior of the algorithms. For all T>1T>1 the E-DDPG first diverges until it reaches the projection threshold then eventually starts to converge. This is due to the fact that errors in later timesteps cause the convexity-concavity condition to fail resulting in divergence in earlier timesteps. Over time the error decreases in the later timesteps, which causes the error in earlier timesteps to gradually decrease as well. But as seen from Figure 2, the convergence for E-DDPG takes significantly longer as the time-horizon increases.

Refer to caption
Figure 2: Comparison between E-RGDA and E-DDPG. The time-horizon is increasing from left to right with T=2T=2 (left-most), T=3T=3 (center left), T=4T=4 (center right) and T=5T=5 (right-most)

6 Conclusion

In this paper, we solve an MARL problem with the objective of designing robust controllers in the presence of modeled and un-modeled uncertainties. We introduce the concept of Robust Mean Field Control (RMFC) problem as the limiting problem when the number of agents grows to infinity. We then establish a connection with Zero-Sum Mean-Field Type Games (ZS-MFTG). We resort to the Linear-Quadratic (LQ) structure which, combined with the mean-field approximation, helps to have a more tractable model and to help resolve the analytical difficulty induced by the distributed information structure. This helps us obtain sufficient conditions for robustness of the problem as well as characterization of the robust control policy. We design and provide non-asymptotic analysis of a receding-horizon based RL algorithm which renders the non-coercive cost as convex-concave. Through numerical analysis the receding-horizon approach is shown to ameliorate the overshooting problem observed in the performance of the vanilla algorithm.

In future work we would like to explore this type of robust mean-field problems beyond the LQ setting and to develop RL algorithms which go beyond the gradient descent-ascent updates used in this paper. Furthermore, our work is a first step in the direction of using mean-field approximations to study robust MARL problems which occur in many real-world scenarios, but the study of concrete examples is left for future work.222Disclaimer: This paper was prepared for informational purposes in part by the Artificial Intelligence Research group of JP Morgan Chase & Co and its affiliates (“JP Morgan”), and is not a product of the Research Department of JP Morgan. JP Morgan makes no representation and warranty whatsoever and disclaims all liability, for the completeness, accuracy or reliability of the information contained herein. This document is not intended as investment research or investment advice, or a recommendation, offer or solicitation for the purchase or sale of any security, financial instrument, financial product or service, or to be used in any way for evaluating the merits of participating in any transaction, and shall not constitute a solicitation under any jurisdiction or to any person, if such solicitation under such jurisdiction or to such person would be unlawful.

Acknowledgments

The authors would like to thank Xiangyuan Zhang (University of Illinois Urbana-Champaign) for useful discussions regarding the Receding-horizon Policy Gradient algorithm.

References

  • Anahtarci et al. (2023) Berkay Anahtarci, Can Deha Kariksiz, and Naci Saldi. Q-learning in regularized mean-field games. Dynamic Games and Applications, 13(1):89–117, 2023.
  • Angiuli et al. (2022) Andrea Angiuli, Jean-Pierre Fouque, and Mathieu Laurière. Unified reinforcement Q-learning for mean field game and control problems. Mathematics of Control, Signals, and Systems, pages 1–55, 2022.
  • Athans et al. (1977) Michael Athans, David Castanon, K-P Dunn, C Greene, Wing Lee, N Sandell, and A Willsky. The stochastic control of the f-8c aircraft using a multiple model adaptive control (mmac) method–part i: Equilibrium flight. IEEE Transactions on Automatic Control, 22(5):768–780, 1977.
  • Başar (1989) Tamer Başar. A dynamic games approach to controller design: Disturbance rejection in discrete time. In Proceedings of the 28th IEEE Conference on Decision and Control,, pages 407–414. IEEE, 1989.
  • Başar and Bernhard (2008) Tamer Başar and Pierre Bernhard. H-infinity optimal control and related minimax design problems: a dynamic game approach. Springer Science & Business Media, 2008.
  • Başar and Olsder (1998) Tamer Başar and Geert Jan Olsder. Dynamic noncooperative game theory. SIAM, 1998.
  • Bensoussan et al. (2013) Alain Bensoussan, Jens Frehse, Phillip Yam, et al. Mean field games and mean field type control theory, volume 101. Springer, 2013.
  • Cardaliaguet and Lehalle (2018) Pierre Cardaliaguet and Charles-Albert Lehalle. Mean field game of controls and an application to trade crowding. Mathematics and Financial Economics, 12(3):335–363, 2018.
  • Carmona and Delarue (2018) Rene Carmona and François Delarue. Probabilistic Theory of Mean Field Games with Applications I. Springer, Cham, 2018.
  • Carmona et al. (2019a) René Carmona, Mathieu Laurière, and Zongjun Tan. Linear-quadratic mean-field reinforcement learning: convergence of policy gradient methods. arXiv preprint arXiv:1910.04295, 2019a.
  • Carmona et al. (2019b) René Carmona, Mathieu Laurière, and Zongjun Tan. Model-free mean-field reinforcement learning: mean-field MDP and mean-field Q-learning. arXiv preprint arXiv:1910.12802, 2019b.
  • Carmona et al. (2020) René Carmona, Kenza Hamidouche, Mathieu Laurière, and Zongjun Tan. Policy optimization for linear-quadratic zero-sum mean-field type games. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 1038–1043. IEEE, 2020.
  • Carmona et al. (2021) René Carmona, Kenza Hamidouche, Mathieu Laurière, and Zongjun Tan. Linear-quadratic zero-sum mean-field type games: Optimality conditions and policy optimization. Journal of Dynamics & Games, 8(4), 2021.
  • Choutri et al. (2019) Salah Eddine Choutri, Boualem Djehiche, and Hamidou Tembine. Optimal control and zero-sum games for markov chains of mean-field type. Mathematical Control and Related Fields, 9(3):571–605, 2019.
  • Cosso and Pham (2019) Andrea Cosso and Huyên Pham. Zero-sum stochastic differential games of generalized mckean–vlasov type. Journal de Mathématiques Pures et Appliquées, 129:180–212, 2019.
  • Cui and Koeppl (2021a) Kai Cui and Heinz Koeppl. Approximately solving mean field games via entropy-regularized deep reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 1909–1917. PMLR, 2021a.
  • Cui and Koeppl (2021b) Kai Cui and Heinz Koeppl. Learning graphon mean field games and approximate Nash equilibria. 2021b.
  • Elie et al. (2020) Romuald Elie, Julien Perolat, Mathieu Laurière, Matthieu Geist, and Olivier Pietquin. On the convergence of model free learning in mean field games. 34(05):7143–7150, 2020.
  • Fabian et al. (2023) Christian Fabian, Kai Cui, and Heinz Koeppl. Learning sparse graphon mean field games. In International Conference on Artificial Intelligence and Statistics, pages 4486–4514. PMLR, 2023.
  • Fallah et al. (2020) Alireza Fallah, Asuman Ozdaglar, and Sarath Pattathil. An optimal multistage stochastic gradient method for minimax problems. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 3573–3579. IEEE, 2020.
  • Fazel et al. (2018) Maryam Fazel, Rong Ge, Sham M Kakade, and Mehran Mesbahi. Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning, pages 1467–1476, 2018.
  • Gu et al. (2021) Haotian Gu, Xin Guo, Xiaoli Wei, and Renyuan Xu. Mean-field controls with Q-learning for cooperative MARL: convergence and complexity analysis. SIAM Journal on Mathematics of Data Science, 3(4):1168–1196, 2021.
  • Guan et al. (2024) Yue Guan, Mohammad Afshari, and Panagiotis Tsiotras. Zero-sum games between mean-field teams: Reachability-based analysis under mean-field sharing. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 9731–9739, 2024.
  • Guo et al. (2019) Xin Guo, Anran Hu, Renyuan Xu, and Junzi Zhang. Learning mean-field games. In Advances in Neural Information Processing Systems, 2019.
  • Harvey and Stein (1978) Charles Harvey and Gunter Stein. Quadratic weights for asymptotic regulator properties. IEEE Transactions on Automatic Control, 23(3):378–387, 1978.
  • He et al. (2023) Sihong He, Songyang Han, Sanbao Su, Shuo Han, Shaofeng Zou, and Fei Miao. Robust multi-agent reinforcement learning with state uncertainty. Transactions on Machine Learning Research, 2023.
  • Huang et al. (2003) Minyi Huang, Peter E Caines, and Roland P Malhamé. Individual and mass behaviour in large population stochastic wireless power control problems: Centralized and Nash equilibrium solutions. In IEEE International Conference on Decision and Control, volume 1, pages 98–103. IEEE, 2003.
  • Huang et al. (2006) Minyi Huang, Roland P Malhamé, and Peter E Caines. Large population stochastic dynamic games: Closed-loop Mckean-Vlasov systems and the Nash certainty equivalence principle. Communications in Information & Systems, 6(3):221–252, 2006.
  • Kober et al. (2013) Jens Kober, J Andrew Bagnell, and Jan Peters. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274, 2013.
  • Kos and Song (2017) Jernej Kos and Dawn Song. Delving into adversarial attacks on deep policies. arXiv preprint arXiv:1705.06452, 2017.
  • Lasry and Lions (2006) Jean-Michel Lasry and Pierre-Louis Lions. Jeux à champ moyen. i–le cas stationnaire. Comptes Rendus Mathématique, 343(9):619–625, 2006.
  • Laurière et al. (2022a) Mathieu Laurière, Sarah Perrin, Sertan Girgin, Paul Muller, Ayush Jain, Theophile Cabannes, Georgios Piliouras, Julien Pérolat, Romuald Elie, Olivier Pietquin, et al. Scalable deep reinforcement learning algorithms for mean field games. In International Conference on Machine Learning, pages 12078–12095. PMLR, 2022a.
  • Laurière et al. (2022b) Mathieu Laurière, Sarah Perrin, Julien Pérolat, Sertan Girgin, Paul Muller, Romuald Élie, Matthieu Geist, and Olivier Pietquin. Learning mean field games: A survey. arXiv preprint arXiv:2205.12944, 2022b.
  • Li et al. (2019) Shihui Li, Yi Wu, Xinyue Cui, Honghua Dong, Fei Fang, and Stuart Russell. Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 4213–4220, 2019.
  • Li et al. (2021) Yingying Li, Yujie Tang, Runyu Zhang, and Na Li. Distributed reinforcement learning for decentralized linear quadratic control: A derivative-free policy optimization approach. IEEE Transactions on Automatic Control, 67(12):6429–6444, 2021.
  • Lowe et al. (2017) Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30, 2017.
  • Malik et al. (2019) Dhruv Malik, Ashwin Pananjady, Kush Bhatia, Koulik Khamaru, Peter Bartlett, and Martin Wainwright. Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2916–2925. PMLR, 2019.
  • Mondal et al. (2022) Washim Uddin Mondal, Mridul Agarwal, Vaneet Aggarwal, and Satish V Ukkusuri. On the approximation of cooperative heterogeneous multi-agent reinforcement learning (marl) using mean field control (mfc). The Journal of Machine Learning Research, 23(1):5614–5659, 2022.
  • Morimoto and Doya (2005) Jun Morimoto and Kenji Doya. Robust reinforcement learning. Neural computation, 17(2):335–359, 2005.
  • Pérolat et al. (2022) Julien Pérolat, Sarah Perrin, Romuald Elie, Mathieu Laurière, Georgios Piliouras, Matthieu Geist, Karl Tuyls, and Olivier Pietquin. Scaling mean field games by online mirror descent. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pages 1028–1037, 2022.
  • Perrin et al. (2020) Sarah Perrin, Julien Pérolat, Mathieu Laurière, Matthieu Geist, Romuald Elie, and Olivier Pietquin. Fictitious play for mean field games: Continuous time analysis and applications. Advances in Neural Information Processing Systems, 33:13199–13213, 2020.
  • Perrin et al. (2021) Sarah Perrin, Mathieu Laurière, Julien Pérolat, Matthieu Geist, Romuald Élie, and Olivier Pietquin. Mean field games flock! The reinforcement learning way. In proc. of IJCAI, 2021.
  • Recht (2019) Benjamin Recht. A tour of reinforcement learning: The view from continuous control. Annual Review of Control, Robotics, and Autonomous Systems, 2:253–279, 2019.
  • Riley et al. (2021) Joshua Riley, Radu Calinescu, Colin Paterson, Daniel Kudenko, and Alec Banks. Utilising assured multi-agent reinforcement learning within safety-critical scenarios. Procedia Computer Science, 192:1061–1070, 2021.
  • Sallab et al. (2017) Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep reinforcement learning framework for autonomous driving. arXiv preprint arXiv:1704.02532, 2017.
  • Sargent and Ljungqvist (2000) Thomas J Sargent and Lars Ljungqvist. Recursive macroeconomic theory. Massachusetss Institute of Technology, 2000.
  • Simchowitz et al. (2020) Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. In Conference on Learning Theory, pages 3320–3436. PMLR, 2020.
  • Sun et al. (2022) Chuangchuang Sun, Dong-Ki Kim, and Jonathan P How. Romax: Certifiably robust deep multiagent reinforcement learning via convex relaxation. In 2022 International Conference on Robotics and Automation (ICRA), pages 5503–5510. IEEE, 2022.
  • Tembine (2017) Hamidou Tembine. Mean-field-type games. AIMS Math, 2(4):706–735, 2017.
  • Xie et al. (2021) Qiaomin Xie, Zhuoran Yang, Zhaoran Wang, and Andreea Minca. Learning while playing in mean-field games: Convergence and optimality. In International Conference on Machine Learning, pages 11436–11447. PMLR, 2021.
  • Yardim et al. (2023) Batuhan Yardim, Semih Cayci, Matthieu Geist, and Niao He. Policy mirror ascent for efficient and independent learning in mean field games. In International Conference on Machine Learning, pages 39722–39754. PMLR, 2023.
  • Yongacoglu et al. (2022) Bora Yongacoglu, Gürdal Arslan, and Serdar Yüksel. Independent learning and subjectivity in mean-field games. In 2022 IEEE 61st Conference on Decision and Control (CDC), pages 2845–2850. IEEE, 2022.
  • Zaman et al. (2020) Muhammad Aneeq uz Zaman, Kaiqing Zhang, Erik Miehling, and Tamer Bașar. Reinforcement learning in non-stationary discrete-time linear-quadratic mean-field games. In 2020 59th IEEE Conference on Decision and Control (CDC), pages 2278–2284. IEEE, 2020.
  • Zaman et al. (2021) Muhammad Aneeq Uz Zaman, Sujay Bhatt, and Tamer Başar. Adversarial linear-quadratic mean-field games over multigraphs. In 2021 60th IEEE Conference on Decision and Control (CDC), pages 209–214. IEEE, 2021.
  • Zaman et al. (2023a) Muhammad Aneeq uz Zaman, Alec Koppel, Sujay Bhatt, and Tamer Başar. Oracle-free reinforcement learning in mean-field games along a single sample path. In International Conference on Artificial Intelligence and Statistics, pages 10178–10206. PMLR, 2023a.
  • Zaman et al. (2023b) Muhammad Aneeq Uz Zaman, Erik Miehling, and Tamer Başar. Reinforcement learning for non-stationary discrete-time linear–quadratic mean-field games in multiple populations. Dynamic Games and Applications, 13(1):118–164, 2023b.
  • Zhang et al. (2020a) Huan Zhang, Hongge Chen, Chaowei Xiao, Bo Li, Mingyan Liu, Duane Boning, and Cho-Jui Hsieh. Robust deep reinforcement learning against adversarial perturbations on state observations. Advances in Neural Information Processing Systems, 33:21024–21037, 2020a.
  • Zhang et al. (2020b) Kaiqing Zhang, Tao Sun, Yunzhe Tao, Sahika Genc, Sunil Mallya, and Tamer Başar. Robust multi-agent reinforcement learning with model uncertainty. Advances in neural information processing systems, 33:10571–10583, 2020b.
  • Zhang et al. (2021a) Kaiqing Zhang, Bin Hu, and Tamer Başar. Policy optimization for H_2\_2 linear control with H_\_\infty robustness guarantee: Implicit regularization and global convergence. SIAM Journal on Control and Optimization, 59(6):4081–4109, 2021a.
  • Zhang et al. (2021b) Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of Reinforcement Learning and Control, pages 321–384, 2021b.
  • Zhang et al. (2021c) Kaiqing Zhang, Xiangyuan Zhang, Bin Hu, and Tamer Başar. Derivative-free policy optimization for linear risk-sensitive and robust control design: Implicit regularization and sample complexity. Advances in Neural Information Processing Systems, 34:2949–2964, 2021c.
  • Zhang and Başar (2023) Xiangyuan Zhang and Tamer Başar. Revisiting LQR control from the perspective of receding-horizon policy gradient. IEEE Control Systems Letters, 2023.
  • Zhang et al. (2023) Xiangyuan Zhang, Bin Hu, and Tamer Başar. Learning the kalman filter with fine-grained sample complexity. arXiv preprint arXiv:2301.12624, 2023.
  • Ziegler et al. (2019) Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.

Supplementary Materials

Proof of Theorem 3.3

Proof 6.1.

Central to this analysis is the quantification of the difference between the finite and infinite population costs for a given set of control policies. First we express the state and mean-field processes in terms of the noise processes, for the finite and infinite population settings. This then allows us to write the costs (in both settings) as quadratic functions of the noise process, which simplifies quantification of the difference between these two costs.

Let us first consider the finite agent setting with MM number of agents. Consider the dynamics of state xjx^{j} of agent j[M]j\in[M] under the NE of the MFTG (Theorem 3.2)

xt+1j,M=(AtBtKt1+Kt2)xtj,M+(A¯tBt(Lt1Kt1)B¯tLt1+Lt2Kt2)x¯tM+ωtj+ω¯t\displaystyle x^{j*,M}_{t+1}=(A_{t}-B_{t}K^{1*}_{t}+K^{2*}_{t})x^{j*,M}_{t}+(\bar{A}_{t}-B_{t}(L^{1*}_{t}-K^{1*}_{t})-\bar{B}_{t}L^{1*}_{t}+L^{2*}_{t}-K^{2*}_{t})\bar{x}^{M*}_{t}+\omega^{j}_{t}+\bar{\omega}_{t} (19)

where the superscript MM denotes the dynamics in the finite population game (1) and x¯M=1Mj[M]xtj,M\bar{x}^{M*}=\frac{1}{M}\sum_{j\in[M]}x^{j*,M}_{t} is the empirical mean-field. We can also write the dynamics of the empirical mean-field as

x¯t+1M=(At+A¯tBtLt1+Lt2)L~tx¯tM+1Mj[M]ωtj+ω¯tω¯tM.\displaystyle\bar{x}^{M*}_{t+1}=\underbrace{\big{(}A_{t}+\bar{A}_{t}-B_{t}L^{1*}_{t}+L^{2*}_{t}\big{)}}_{\tilde{L}^{*}_{t}}\bar{x}^{M*}_{t}+\underbrace{\frac{1}{M}\sum_{j\in[M]}\omega^{j}_{t}+\bar{\omega}_{t}}_{\bar{\omega}^{M}_{t}}. (20)

For simplicity we assume that x0j,M=ω0j+ω¯0x^{j*,M}_{0}=\omega^{j}_{0}+\bar{\omega}_{0} which also implies that x¯0M=ω¯0M\bar{x}^{M*}_{0}=\bar{\omega}^{M}_{0}. Using (20) we get the recursive definition of x¯tM\bar{x}^{M}_{t} as

x¯tM=s=0tL~[t1,s]ω¯sN, where L~[s,t]:=L~sL~s1L~t, if st. and L~[s,t]=I otherwise.\displaystyle\bar{x}^{M*}_{t}=\sum_{s=0}^{t}\tilde{L}^{*}_{[t-1,s]}\bar{\omega}^{N}_{s},\text{ where }\tilde{L}^{*}_{[s,t]}:=\tilde{L}^{*}_{s}\tilde{L}^{*}_{s-1}\ldots\tilde{L}^{*}_{t}\text{, if }s\geq t.\text{ and }\tilde{L}^{*}_{[s,t]}=I\text{ otherwise}.

Hence x¯tM\bar{x}^{M*}_{t} can be characterized as a linear function of the noise process

x¯tM=(Ψ¯ω¯M)t, where Ψ¯=(I00L~[0,0]I0L~[1,0]L~[1,1]IL~[2,0]L~[2,1]L~[2,2]) and ω¯M=(ω¯0Mω¯1Mω¯2Mω¯TM)\displaystyle\bar{x}^{M*}_{t}=\big{(}\bar{\Psi}^{*}\bar{\omega}^{M}\big{)}_{t},\text{ where }\bar{\Psi}^{*}=\begin{pmatrix}I&0&0&\ldots\\ \tilde{L}^{*}_{[0,0]}&I&0&\ldots\\ \tilde{L}^{*}_{[1,0]}&\tilde{L}^{*}_{[1,1]}&I&\ldots\\ \tilde{L}^{*}_{[2,0]}&\tilde{L}^{*}_{[2,1]}&\tilde{L}^{*}_{[2,2]}&\ldots\\ \vdots&\vdots&\vdots&\ddots\end{pmatrix}\text{ and }\bar{\omega}^{M}=\begin{pmatrix}\bar{\omega}^{M}_{0}\\ \bar{\omega}^{M}_{1}\\ \bar{\omega}^{M}_{2}\\ \vdots\\ \bar{\omega}^{M}_{T}\end{pmatrix}

where (M)t(M)_{t} denotes the ttht^{th} block of matrix MM and the covariance matrix of ω¯M\bar{\omega}^{M} is 𝔼[ω¯M(ω¯M)]=diag((Σ/M+Σ0)0tT)\mathbb{E}[\bar{\omega}^{M}(\bar{\omega}^{M})^{\top}]=\operatorname{diag}((\Sigma/M+\Sigma^{0})_{0\leq t\leq T}). Similarly we characterize the deviation process xtj,Mx¯tMx^{j*,M}_{t}-\bar{x}^{M*}_{t}, using (19) and (20)

xt+1j,Mx¯t+1M=(AtBtKt1+Kt2)Lt(xtj,Mx¯tM)+M1Mωtj+1Mkjωtkω¯tj,M.\displaystyle x^{j*,M}_{t+1}-\bar{x}^{M*}_{t+1}=\underbrace{(A_{t}-B_{t}K^{1*}_{t}+K^{2*}_{t})}_{L^{*}_{t}}(x^{j*,M}_{t}-\bar{x}^{M*}_{t})+\underbrace{\frac{M-1}{M}\omega^{j}_{t}+\frac{1}{M}\sum_{k\neq j}\omega^{k}_{t}}_{\bar{\omega}^{j,M}_{t}}.

Hence

xtj,Mx¯tM\displaystyle x^{j*,M}_{t}-\bar{x}^{M*}_{t} =s=0tL[t1,s]ω¯sj,M\displaystyle=\sum_{s=0}^{t}L^{*}_{[t-1,s]}\bar{\omega}^{j,M}_{s}
=(Ψω¯j,M)t, where ω¯j,M=M1M(ω0jωT1j)+1M(kjω0kkjωT1k)\displaystyle=\big{(}\Psi^{*}\bar{\omega}^{j,M}\big{)}_{t},\text{ where }\bar{\omega}^{j,M}=\frac{M-1}{M}\begin{pmatrix}\omega^{j}_{0}\\ \vdots\\ \omega^{j}_{T-1}\end{pmatrix}+\frac{1}{M}\begin{pmatrix}\sum_{k\neq j}\omega^{k}_{0}\\ \vdots\\ \sum_{k\neq j}\omega^{k}_{T-1}\end{pmatrix}

where the covariance matrix of ω¯j,M\bar{\omega}^{j,M} is 𝔼[ω¯j,M(ω¯j,M)]=diag(((M1)/M×Σ)0tT)\mathbb{E}[\bar{\omega}^{j,M}(\bar{\omega}^{j,M})^{\top}]=\operatorname{diag}(((M-1)/M\times\Sigma)_{0\leq t\leq T}). Similarly the infinite agent limit of this process is xtx¯t=(Ψω)tx^{*}_{t}-\bar{x}^{*}_{t}=(\Psi^{*}\omega)_{t} where ω=(ω0,,ωT1)\omega=(\omega^{\top}_{0},\ldots,\omega^{\top}_{T-1})^{\top} whose covariance is 𝔼[ωω]=diag((Σ)0tT)\mathbb{E}[\omega\omega^{\top}]=\operatorname{diag}((\Sigma)_{0\leq t\leq T}). Now we compute the finite agent cost in terms of the noise processes,

JMγ(𝓊)=𝔼[1Mj[M]t=0Txtj,Mx¯tMQt2+x¯tMQ¯t2+ut1,j,Mu¯t1,M2+u¯t1,M2\displaystyle J^{\gamma}_{M}(\mathcal{u}^{*})=\mathbb{E}\bigg{[}\frac{1}{M}\sum_{j\in[M]}\sum_{t=0}^{T}\lVert x^{j*,M}_{t}-\bar{x}^{M*}_{t}\rVert^{2}_{Q_{t}}+\lVert\bar{x}^{M*}_{t}\rVert^{2}_{\bar{Q}_{t}}+\lVert u^{1,j*,M}_{t}-\bar{u}^{1,M*}_{t}\rVert^{2}+\lVert\bar{u}^{1,M*}_{t}\rVert^{2}
γ2(ut2,j,Mu¯t2,M2+u¯t2,M2)]\displaystyle\hskip 312.9803pt-\gamma^{2}(\lVert u^{2,j*,M}_{t}-\bar{u}^{2,M*}_{t}\rVert^{2}+\lVert\bar{u}^{2,M*}_{t}\rVert^{2})\bigg{]}
=𝔼[1Mj[M]t=0Txtj,Mx¯tMQt+(Kt1,)Kt1,γ2(Kt2,)Kt2,2+x¯tMQ¯t+(Lt1,)Lt1,γ2(Lt2,)Lt2,2]\displaystyle=\mathbb{E}\bigg{[}\frac{1}{M}\sum_{j\in[M]}\sum_{t=0}^{T}\lVert x^{j*,M}_{t}-\bar{x}^{M*}_{t}\rVert^{2}_{Q_{t}+(K^{1,*}_{t})^{\top}K^{1,*}_{t}-\gamma^{2}(K^{2,*}_{t})^{\top}K^{2,*}_{t}}+\lVert\bar{x}^{M*}_{t}\rVert^{2}_{\bar{Q}_{t}+(L^{1,*}_{t})^{\top}L^{1,*}_{t}-\gamma^{2}(L^{2,*}_{t})^{\top}L^{2,*}_{t}}\bigg{]}
=𝔼[1Mj[M](Ψω¯j,M)(Q+(K)RK)Ψω¯j,M\displaystyle=\mathbb{E}\bigg{[}\frac{1}{M}\sum_{j\in[M]}(\Psi^{*}\bar{\omega}^{j,M})^{\top}\big{(}Q+(K^{*})^{\top}RK^{*}\big{)}\Psi^{*}\bar{\omega}^{j,M}
+(Ψ¯ω¯M)(Q¯+(K¯)R¯K¯)Ψ¯ω¯M]\displaystyle\hskip 170.71652pt+(\bar{\Psi}^{*}\bar{\omega}^{M})^{\top}\big{(}\bar{Q}+(\bar{K}^{*})^{\top}\bar{R}\bar{K}^{*}\big{)}\bar{\Psi}^{*}\bar{\omega}^{M}\bigg{]}
=Tr((Ψ)(Q+(K)RK)Ψ𝔼[ω¯j,M(ω¯j,M)])\displaystyle=\operatorname{Tr}\big{(}(\Psi^{*})^{\top}\big{(}Q+(K^{*})^{\top}RK^{*}\big{)}\Psi^{*}\mathbb{E}\big{[}\bar{\omega}^{j,M}(\bar{\omega}^{j,M})^{\top}\big{]}\big{)}
+Tr((Ψ¯)(Q¯+(K¯)R¯K¯)Ψ¯𝔼[ω¯M(ω¯M)])\displaystyle\hskip 128.0374pt+\operatorname{Tr}\big{(}(\bar{\Psi}^{*})^{\top}\big{(}\bar{Q}+(\bar{K}^{*})^{\top}\bar{R}\bar{K}^{*}\big{)}\bar{\Psi}^{*}\mathbb{E}\big{[}\bar{\omega}^{M}(\bar{\omega}^{M})^{\top}\big{]}\big{)}

where Q=diag((Qt)0T)Q=\operatorname{diag}((Q_{t})_{0\leq T}), R=diag((diag(I,γ2I))0T)R=\operatorname{diag}((\operatorname{diag}(I,-\gamma^{2}I))_{0\leq T}) and K=diag((Kt1,,Kt2,)0T)K^{*}=\operatorname{diag}((K^{1,*}_{t},K^{2,*}_{t})_{0\leq T}) with KT1,=0K^{1,*}_{T}=0 and KT2,=0K^{2,*}_{T}=0. Similarly Q¯=diag((Q¯t)0T)\bar{Q}=\operatorname{diag}((\bar{Q}_{t})_{0\leq T}), R¯=diag((diag(I,γ2I))0T)\bar{R}=\operatorname{diag}((\operatorname{diag}(I,-\gamma^{2}I))_{0\leq T}) and K¯=diag((Lt1,,Lt2,)0T)\bar{K}^{*}=\operatorname{diag}((L^{1,*}_{t},L^{2,*}_{t})_{0\leq T}) with LT1,=0L^{1,*}_{T}=0 and LT2,=0L^{2,*}_{T}=0. Using a similar technique we can compute the infinite agent cost:

Jγ(𝓊)\displaystyle J^{\gamma}(\mathcal{u}^{*}) =𝔼[t=0Txtx¯tQt2+x¯tQ¯t2+utu¯t2+u¯t2]\displaystyle=\mathbb{E}\bigg{[}\sum_{t=0}^{T}\lVert x^{*}_{t}-\bar{x}^{*}_{t}\rVert^{2}_{Q_{t}}+\lVert\bar{x}^{*}_{t}\rVert^{2}_{\bar{Q}_{t}}+\lVert u^{*}_{t}-\bar{u}^{*}_{t}\rVert^{2}+\lVert\bar{u}^{*}_{t}\rVert^{2}\bigg{]}
=Tr((Ψ)(Q+(K)RK)Ψ𝔼[ωω])\displaystyle=\operatorname{Tr}\big{(}(\Psi^{*})^{\top}\big{(}Q+(K^{*})^{\top}RK^{*}\big{)}\Psi^{*}\mathbb{E}\big{[}\omega\omega^{\top}\big{]}\big{)}
+Tr((Ψ¯)(Q¯+(K¯)R¯K¯)Ψ¯𝔼[ω¯0(ω¯0)]).\displaystyle\hskip 128.0374pt+\operatorname{Tr}\big{(}(\bar{\Psi}^{*})^{\top}\big{(}\bar{Q}+(\bar{K}^{*})^{\top}\bar{R}\bar{K}^{*}\big{)}\bar{\Psi}^{*}\mathbb{E}\big{[}\bar{\omega}^{0}(\bar{\omega}^{0})^{\top}\big{]}\big{)}.

Now evaluating the difference between the finite and infinite population costs:

|JMγ(𝓊)Jγ(𝓊)|\displaystyle\lvert J^{\gamma}_{M}(\mathcal{u}^{*})-J^{\gamma}(\mathcal{u}^{*})\rvert
=|Tr((Ψ)(Q+(K)RK)Ψ(𝔼[ω¯j,M(ω¯j,M)𝔼[ωω])])\displaystyle=\big{|}\operatorname{Tr}\big{(}(\Psi^{*})^{\top}\big{(}Q+(K^{*})^{\top}RK^{*}\big{)}\Psi^{*}\big{(}\mathbb{E}\big{[}\bar{\omega}^{j,M}(\bar{\omega}^{j,M})^{\top}-\mathbb{E}\big{[}\omega\omega^{\top}\big{]}\big{)}\big{]}\big{)}
+Tr((Ψ¯)(Q¯+(K¯)R¯K¯)Ψ¯(𝔼[ω¯M(ω¯M)]𝔼[ω¯0(ω¯0)]))|\displaystyle\hskip 85.35826pt+\operatorname{Tr}\big{(}(\bar{\Psi}^{*})^{\top}\big{(}\bar{Q}+(\bar{K}^{*})^{\top}\bar{R}\bar{K}^{*}\big{)}\bar{\Psi}^{*}\big{(}\mathbb{E}\big{[}\bar{\omega}^{M}(\bar{\omega}^{M})^{\top}\big{]}-\mathbb{E}\big{[}\bar{\omega}^{0}(\bar{\omega}^{0})^{\top}\big{]}\big{)}\big{)}\big{|}
((Ψ)(Q+(K)RK)ΨF+(Ψ¯)(Q¯+(K¯)R¯K¯)Ψ¯F)C1\displaystyle\leq\underbrace{\Big{(}\lVert(\Psi^{*})^{\top}\big{(}Q+(K^{*})^{\top}RK^{*}\big{)}\Psi^{*}\rVert_{F}+\lVert(\bar{\Psi}^{*})^{\top}\big{(}\bar{Q}+(\bar{K}^{*})^{\top}\bar{R}\bar{K}^{*}\big{)}\bar{\Psi}^{*}\rVert_{F}\Big{)}}_{C^{*}_{1}}
Tr(diag((Σ/M)0tT))\displaystyle\hskip 284.52756pt\operatorname{Tr}\big{(}\operatorname{diag}((\Sigma/M)_{0\leq t\leq T})\big{)}
C1σTM\displaystyle\leq C^{*}_{1}\frac{\sigma T}{M} (21)

where σ=ΣF\sigma=\lVert\Sigma\rVert_{F}. Now let us consider the same dynamics but under-Nash controls. The difference between finite and infinite population costs under these controls are

|JMγ(𝓊)Jγ(𝓊)|((Ψ)(Q+(K)RK)ΨF+(Ψ¯)(Q¯+(K¯)R¯K¯)Ψ¯F)C1σTM\displaystyle\lvert J^{\gamma}_{M}(\mathcal{u})-J^{\gamma}(\mathcal{u})\rvert\leq\underbrace{\Big{(}\lVert(\Psi)^{\top}\big{(}Q+(K)^{\top}RK\big{)}\Psi\rVert_{F}+\lVert(\bar{\Psi})^{\top}\big{(}\bar{Q}+(\bar{K})^{\top}\bar{R}\bar{K}\big{)}\bar{\Psi}\rVert_{F}\Big{)}}_{C_{1}}\frac{\sigma T}{M} (22)

Using this bound along we can deduce that if,

inf𝓊1sup𝓊2Jγ(𝓊1,𝓊2)+C1σTMγ2𝔼t=0T1ωt2+ω¯t20,\displaystyle\inf_{\mathcal{u}^{1}}\sup_{\mathcal{u}^{2}}J^{\gamma}(\mathcal{u}^{1},\mathcal{u}^{2})+C_{1}\frac{\sigma T}{M}-\gamma^{2}\mathbb{E}\sum_{t=0}^{T-1}\lVert\omega_{t}\rVert^{2}+\lVert\bar{\omega}_{t}\rVert^{2}\leq 0, (23)

then the robustness condition (6) will be satisfied. Using results form Theorem 3.2 we can rewrite (23) into the following condition

t=1TTr((Mtγγ2I)Σ+(M¯tγγ2I)Σ¯)+Tr(M0γΣ0)+Tr(M¯0γΣ¯0)CTN\displaystyle\sum_{t=1}^{T}\operatorname{Tr}((M^{\gamma}_{t}-\gamma^{2}I)\Sigma+(\bar{M}^{\gamma}_{t}-\gamma^{2}I)\bar{\Sigma})+\operatorname{Tr}(M^{\gamma}_{0}\Sigma^{0})+\operatorname{Tr}(\bar{M}^{\gamma}_{0}\bar{\Sigma}^{0})\leq-\frac{CT}{N}

which concludes the proof.

Proof of Theorem 4.1

Proof 6.2.

We start by analyzing the problem associated with the process yiy^{i} and the controller K¯t=[(Kt1),(Kt2)]\bar{K}_{t}=[(K^{1}_{t})^{\top},(K^{2}_{t})^{\top}]^{\top}. Using the Lyapunov equation the cost for the given set of future controllers ((K~t+1,L~t+1),,(K~T,L~T))\big{(}(\tilde{K}_{t+1},\tilde{L}_{t+1}),\ldots,(\tilde{K}_{T},\tilde{L}_{T})\big{)} can be defined in terms of symmetric matrix M~t+1\tilde{M}_{t+1} and covariance matrix Σ~ty\tilde{\Sigma}^{y}_{t},

minKt1maxKt2J~y,ti,γ(K¯t)=𝔼y[y(Qt+K¯tR¯K¯t)y+y(AB¯tK¯t)M~t+1(AB¯tK¯t)y]+Σ~ty,\displaystyle\min_{K^{1}_{t}}\max_{K^{2}_{t}}\tilde{J}^{i,\gamma}_{y,t}(\bar{K}_{t})=\mathbb{E}_{y}\big{[}y^{\top}(Q_{t}+\bar{K}^{\top}_{t}\bar{R}\bar{K}_{t})y+y^{\top}(A-\bar{B}_{t}\bar{K}_{t})^{\top}\tilde{M}_{t+1}(A-\bar{B}_{t}\bar{K}_{t})y\big{]}+\tilde{\Sigma}^{y}_{t},

where B¯t=[Bt,I]\bar{B}_{t}=[B_{t},I] and R¯=diag(I,γ2I)\bar{R}=\operatorname{diag}(I,-\gamma^{2}I). The matrices M~t+1\tilde{M}_{t+1} and Σ~ty\tilde{\Sigma}^{y}_{t} are fixed and can be computed offline using ((K~t+1,L~t+1)\big{(}(\tilde{K}_{t+1},\tilde{L}_{t+1}) ,,(K~T,L~T)),\ldots,(\tilde{K}_{T},\tilde{L}_{T})\big{)} and 𝔼[ω~si(ω~si)]\mathbb{E}[\tilde{\omega}^{i}_{s}(\tilde{\omega}^{i}_{s})^{\top}].

M~s\displaystyle\tilde{M}_{s} =Qt+(K~t1)K~t1γ2(K~t2)K~t2+γ(AtBtK~t1+K~t2)TM~s+1(AtBtK~t1+K~t2),\displaystyle=Q_{t}+(\tilde{K}^{1}_{t})^{\top}\tilde{K}^{1}_{t}-\gamma^{2}(\tilde{K}^{2}_{t})^{\top}\tilde{K}^{2}_{t}+\gamma(A_{t}-B_{t}\tilde{K}^{1}_{t}+\tilde{K}^{2}_{t})^{T}\tilde{M}_{s+1}(A_{t}-B_{t}\tilde{K}^{1}_{t}+\tilde{K}^{2}_{t}),
Σ~sy\displaystyle\tilde{\Sigma}^{y}_{s} =Tr(M~s+1𝔼[ω~si(ω~si)])+Σ~s+1y,\displaystyle=\operatorname{Tr}(\tilde{M}_{s+1}\mathbb{E}[\tilde{\omega}^{i}_{s}(\tilde{\omega}^{i}_{s})^{\top}])+\tilde{\Sigma}^{y}_{s+1},

for all st+1s\geq t+1. For a given t{T1,,1,0}t\in\{T-1,\ldots,1,0\} the exact policy gradient of cost J~y,ti,γ\tilde{J}^{i,\gamma}_{y,t} with respect to Kt1K^{1}_{t} and Kt2K^{2}_{t} is

δJ~y,ti,γδKt1\displaystyle\frac{\delta\tilde{J}^{i,\gamma}_{y,t}}{\delta K^{1}_{t}} =2((I+BtM~t+1Bt)Kt1BtM~t+1(AtKt2))Σy,\displaystyle=2\big{(}(I+B^{\top}_{t}\tilde{M}_{t+1}B_{t})K^{1}_{t}-B^{\top}_{t}\tilde{M}_{t+1}(A_{t}-K^{2}_{t})\big{)}\Sigma_{y}, (24)
δJ~y,ti,γδKt2\displaystyle-\frac{\delta\tilde{J}^{i,\gamma}_{y,t}}{\delta K^{2}_{t}} =2((γ2I+M~t+1)Kt2+M~t+1(AtBtKt1))Σz\displaystyle=-2\big{(}(-\gamma^{2}I+\tilde{M}_{t+1})K^{2}_{t}+\tilde{M}_{t+1}(A_{t}-B_{t}K^{1}_{t})\big{)}\Sigma_{z}

First we prove smoothness and convex-concave property of the function J~y,ti,γ\tilde{J}^{i,\gamma}_{y,t}. Due to the projection operation ProjD\operatorname{Proj}_{D} the norms of Kt1,Kt2K^{1}_{t},K^{2}_{t} are bounded by scalar DD. Using Definition 2.1 in (Fallah et al., 2020) and (24), the function J~y,ti,γ\tilde{J}^{i,\gamma}_{y,t} is smooth with smoothness parameter

L=(I+BtM~t+1Bt+M~t+1γ2I)(Σy+Σz)\displaystyle L=\big{(}\lVert I+B^{\top}_{t}\tilde{M}_{t+1}B_{t}\rVert+\lVert\tilde{M}_{t+1}-\gamma^{2}I\rVert\big{)}(\lVert\Sigma_{y}\rVert+\lVert\Sigma_{z}\rVert)

The function J~y,ti,γ\tilde{J}^{i,\gamma}_{y,t} is also convex-concave since, using (24) and convexity-concavity parameters are obtained using direct calculation,

μx=σ((I+BtM~t+1Bt)Σy)>0 and μy=σ((M~t+1γ2I)Σz)>0\displaystyle\mu_{x}=\sigma((I+B^{\top}_{t}\tilde{M}_{t+1}B_{t})\Sigma_{y})>0\text{ and }\mu_{y}=\sigma((\tilde{M}_{t+1}-\gamma^{2}I)\Sigma_{z})>0 (25)

where the second inequality is ensured when M~t+1γ2I>0\tilde{M}_{t+1}-\gamma^{2}I>0. Moreover satisfying Assumption 2.3 in (Fallah et al., 2020) requires component-wise smoothness which can be satisfied with Lx=I+BtM~t+1BtΣyL_{x}=\lVert I+B^{\top}_{t}\tilde{M}_{t+1}B_{t}\rVert\lVert\Sigma_{y}\rVert and Ly=M~t+1γ2IΣzL_{y}=\lVert\tilde{M}_{t+1}-\gamma^{2}I\rVert\lVert\Sigma_{z}\rVert. Having satisfied Assumption 2.3 in (Fallah et al., 2020) now we generalize the proof of convergence for gradient descent-ascent to biased stochastic gradients as is the case with ~1J~ti,γ\tilde{\nabla}_{1}\tilde{J}^{i,\gamma}_{t} and ~2J~ti,γ\tilde{\nabla}_{2}\tilde{J}^{i,\gamma}_{t}.

Let us first introduce a couple of notations, Ki,tJ~i,γ(Kt,Lt)\nabla_{K_{i,t}}\tilde{J}^{i,\gamma}(K_{t},L_{t}) is exact gradient of cost J~i,γ\tilde{J}^{i,\gamma} w.r.t. controller Ki,tK_{i,t},

Ki,tJ~i,γ(Kt,Lt)=δJ~i,γ(Kt,Lt)δKi,t\displaystyle\nabla_{K_{i,t}}\tilde{J}^{i,\gamma}(K_{t},L_{t})=\frac{\delta\tilde{J}^{i,\gamma}(K_{t},L_{t})}{\delta K_{i,t}}

~Ki,tJ~i,γ(Kt,Lt)\tilde{\nabla}_{K_{i,t}}\tilde{J}^{i,\gamma}(K_{t},L_{t}) is stochastic gradient of cost J~i,γ\tilde{J}^{i,\gamma} w.r.t. controller Ki,tK_{i,t},

~Ki,tJ~ti,γ(Kt,Lt)=nMr2j=1MJ~i,γ((Ktj,1,Kt2),Lt)ej,Ktj,1=Kt1+ej,ej𝕊n1(r),\displaystyle\tilde{\nabla}_{K_{i,t}}\tilde{J}^{i,\gamma}_{t}(K_{t},L_{t})=\frac{n}{Mr^{2}}\sum_{j=1}^{M}\tilde{J}^{i,\gamma}((K^{j,1}_{t},K^{2}_{t}),L_{t})e_{j},\hskip 5.69046ptK^{j,1}_{t}=K^{1}_{t}+e_{j},\hskip 5.69046pte_{j}\sim\mathbb{S}^{n-1}(r),

Ki,trJ~i,γ(Kt,Lt)\nabla^{r}_{K_{i,t}}\tilde{J}^{i,\gamma}(K_{t},L_{t}) is the smoothed gradient of cost J~i,γ\tilde{J}^{i,\gamma} w.r.t. controller Ki,tK_{i,t},

s.t. Ki,trJ~i,γ(Kt,Lt)=mr𝔼e[J~i,γ((Ki,t+e,Ki,t),Lt)e], where e𝕊m1(r)\displaystyle\text{ s.t. }\nabla^{r}_{K_{i,t}}\tilde{J}^{i,\gamma}(K_{t},L_{t})=\frac{m}{r}\mathbb{E}_{e}\big{[}\tilde{J}^{i,\gamma}((K_{i,t}+e,K_{-i,t}),L_{t})e\big{]},\text{ where }e\sim\mathbb{S}^{m-1}(r)

Now we introduce some results from literature. The following result proves that the stochastic gradient is an unbiased estimator of the smoothed gradient and the bias between the smoothed gradient and the stochastic gradient is bounded by a linear function of smoothing radius rr.

Lemma 6.3 ((Fazel et al., 2018)).
𝔼[~Ki,tJ~i,γ(Kt,Lt)]\displaystyle\mathbb{E}[\tilde{\nabla}_{K_{i,t}}\tilde{J}^{i,\gamma}(K_{t},L_{t})] =Ki,trJ~i,γ(Kt,Lt),\displaystyle=\nabla^{r}_{K_{i,t}}\tilde{J}^{i,\gamma}(K_{t},L_{t}),
Ki,trJ~i,γ(Kt,Lt)Ki,tJ~i,γ(Kt,Lt)2\displaystyle\lVert\nabla^{r}_{K_{i,t}}\tilde{J}^{i,\gamma}(K_{t},L_{t})-\nabla_{K_{i,t}}\tilde{J}^{i,\gamma}(K_{t},L_{t})\rVert_{2} ϕir, where ϕi=I+BiMBi2\displaystyle\leq\phi_{i}r,\text{ where }\phi_{i}=\lVert I+B^{\top}_{i}M^{*}B_{i}\rVert_{2}

Next we present the result that difference between the mini-batched stochastic gradient and the smoothed gradient can be bounded with high probability.

Lemma 6.4 ((Malik et al., 2019)).
~Ki,tJ~i,γ(Kt,Lt)Ki,trJ~i,γ(Kt,Lt)2mMr(J~i,γ(Kt,Lt)+λr)log(2mδ)\displaystyle\lVert\tilde{\nabla}_{K_{i,t}}\tilde{J}^{i,\gamma}(K_{t},L_{t})-\nabla^{r}_{K_{i,t}}\tilde{J}^{i,\gamma}(K_{t},L_{t})\rVert_{2}\leq\frac{m}{Mr}(\tilde{J}^{i,\gamma}(K_{t},L_{t})+\lambda r)\sqrt{\log\bigg{(}\frac{2m}{\delta}\bigg{)}}

with probability 1δ1-\delta.

Now we compute finite sample guarantees for the Gradient Descent Ascent (GDA) update as given in Algorithm 1. For each t[T]t\in[T], let us concatenate Kt1K^{1}_{t} as K¯t=(Kt1Kt2)\bar{K}_{t}=\begin{pmatrix}K^{1}_{t}\\ K^{2}_{t}\end{pmatrix}. Let us also denote the optimal controller which optimizes J~y,ti,γ\tilde{J}^{i,\gamma}_{y,t} as K¯t=(Kt1Kt2)\bar{K}^{*}_{t}=\begin{pmatrix}K^{1*}_{t}\\ K^{2*}_{t}\end{pmatrix} such that J~y,ti,γ(Kt1,Kt2)J~y,ti,γ(Kt1,Kt2)J~y,ti,γ(Kt1,Kt2)\tilde{J}^{i,\gamma}_{y,t}(K^{1*}_{t},K^{2}_{t})\leq\tilde{J}^{i,\gamma}_{y,t}(K^{1*}_{t},K^{2*}_{t})\leq\tilde{J}^{i,\gamma}_{y,t}(K^{1}_{t},K^{2*}_{t}). Since the timestep tt is fixed inside the inner loop of the algorithm we discard it, instead we use the iteration index k[K]k\in[K]. The update rule given in Algorithm 1 is given by

K¯k+1=K¯kηk(~1J~yi,γ(K¯k)~2J~yi,γ(K¯k))=K¯kη~J~yi,γ(K¯k)=XK¯k+Y~J~yi,γ(K¯k)\displaystyle\bar{K}_{k+1}=\bar{K}_{k}-\eta_{k}\begin{pmatrix}-\tilde{\nabla}_{1}\tilde{J}^{i,\gamma}_{y}(\bar{K}_{k})\\ \tilde{\nabla}_{2}\tilde{J}^{i,\gamma}_{y}(\bar{K}_{k})\end{pmatrix}=\bar{K}_{k}-\eta\tilde{\nabla}\tilde{J}^{i,\gamma}_{y}(\bar{K}_{k})=X\bar{K}_{k}+Y\tilde{\nabla}\tilde{J}^{i,\gamma}_{y}(\bar{K}_{k})

where ~1J~yi,γ(K¯k)=~Kt1J~yi,γ(K¯k),~2J~yi,γ(K¯k)=~Kt2J~yi,γ(K¯k),X=I\tilde{\nabla}_{1}\tilde{J}^{i,\gamma}_{y}(\bar{K}_{k})=\tilde{\nabla}_{K^{1}_{t}}\tilde{J}^{i,\gamma}_{y}(\bar{K}_{k}),\tilde{\nabla}_{2}\tilde{J}^{i,\gamma}_{y}(\bar{K}_{k})=\tilde{\nabla}_{K^{2}_{t}}\tilde{J}^{i,\gamma}_{y}(\bar{K}_{k}),X=I and Y=ηIY=\eta I. We the controller error as K^k=K¯kK¯k\hat{K}_{k}=\bar{K}_{k}-\bar{K}^{*}_{k}, the evolution of this error is

K^k+1=XK^k+Y~J~i,γ(K¯k)\displaystyle\hat{K}_{k+1}=X\hat{K}_{k}+Y\tilde{\nabla}\tilde{J}^{i,\gamma}(\bar{K}_{k})

We define a Lyapunov function Vp(K¯k):=K^kPK^k=pK^k22V_{p}(\bar{K}_{k}):=\hat{K}^{\top}_{k}P\hat{K}_{k}=p\lVert\hat{K}_{k}\rVert_{2}^{2} for P=pIP=pI for some p>0p>0.

Vp(K¯k+1)ρ2Vp(K¯k)=(XK^k+Y~J~i,γ(K¯k))P(XK^k+Y~J~i,γ(K¯k))ρ2K^kPK^k,\displaystyle V_{p}(\bar{K}_{k+1})-\rho^{2}V_{p}(\bar{K}_{k})=(X\hat{K}_{k}+Y\tilde{\nabla}\tilde{J}^{i,\gamma}(\bar{K}_{k}))^{\top}P(X\hat{K}_{k}+Y\tilde{\nabla}\tilde{J}^{i,\gamma}(\bar{K}_{k}))-\rho^{2}\hat{K}_{k}P\hat{K}_{k},
\displaystyle\leq (XK^k+YJ~i,γ(K¯k))P(XK^k+YJ~i,γ(K¯k))ρ2K^kPK^k\displaystyle(X\hat{K}_{k}+Y\nabla\tilde{J}^{i,\gamma}(\bar{K}_{k}))^{\top}P(X\hat{K}_{k}+Y\nabla\tilde{J}^{i,\gamma}(\bar{K}_{k}))-\rho^{2}\hat{K}_{k}P\hat{K}_{k}
+(Y(~J~i,γ(K¯k)J~i,γ(K¯k)))PXK^k+(XK^k)P(Y(~J~i,γ(K¯k)J~i,γ(K¯k)))\displaystyle+(Y(\tilde{\nabla}\tilde{J}^{i,\gamma}(\bar{K}_{k})-\nabla\tilde{J}^{i,\gamma}(\bar{K}_{k})))^{\top}PX\hat{K}_{k}+(X\hat{K}_{k})^{\top}P(Y(\tilde{\nabla}\tilde{J}^{i,\gamma}(\bar{K}_{k})-\nabla\tilde{J}^{i,\gamma}(\bar{K}_{k})))
+(~J~i,γ(K¯k)J~i,γ(K¯k))YPY(~J~i,γ(K¯k)J~i,γ(K¯k)),\displaystyle+(\tilde{\nabla}\tilde{J}^{i,\gamma}(\bar{K}_{k})-\nabla\tilde{J}^{i,\gamma}(\bar{K}_{k}))^{\top}Y^{\top}PY(\tilde{\nabla}\tilde{J}^{i,\gamma}(\bar{K}_{k})-\nabla\tilde{J}^{i,\gamma}(\bar{K}_{k})),
\displaystyle\leq (K^kJ~i,γ(K¯k))[XTPXρ2PXPYYPXYPY](K^kJ~i,γ(K¯k))+η2p~J~i,γ(K¯k)J~i,γ22\displaystyle\begin{pmatrix}\hat{K}_{k}\\ \nabla\tilde{J}^{i,\gamma}(\bar{K}_{k})\end{pmatrix}^{\top}\begin{bmatrix}X^{T}PX-\rho^{2}P&X^{\top}PY\\ Y^{\top}PX&Y^{\top}PY\end{bmatrix}\begin{pmatrix}\hat{K}_{k}\\ \nabla\tilde{J}^{i,\gamma}(\bar{K}_{k})\end{pmatrix}+\eta^{2}p\lVert\tilde{\nabla}\tilde{J}^{i,\gamma}(\bar{K}_{k})-\nabla\tilde{J}^{i,\gamma}\rVert^{2}_{2}
+(K^k~J~i,γ(K¯k)J~i,γ(K¯k))[0XPYYPX0](K^k~J~i,γ(K¯k)J~i,γ(K¯k)).\displaystyle+\begin{pmatrix}\hat{K}_{k}\\ \tilde{\nabla}\tilde{J}^{i,\gamma}(\bar{K}_{k})-\nabla\tilde{J}^{i,\gamma}(\bar{K}_{k})\end{pmatrix}^{\top}\begin{bmatrix}0&X^{\top}PY\\ Y^{\top}PX&0\end{bmatrix}\begin{pmatrix}\hat{K}_{k}\\ \tilde{\nabla}\tilde{J}^{i,\gamma}(\bar{K}_{k})-\nabla\tilde{J}^{i,\gamma}(\bar{K}_{k})\end{pmatrix}.

Let us enforce ρ2=1mη\rho^{2}=1-m\eta, then we know due to (Fallah et al., 2020) that

(K^kJ~i,γ(K¯k))[XTPXρ2PXPYYPXYPY](K^kJ~i,γ(K¯k))0\begin{pmatrix}\hat{K}_{k}\\ \nabla\tilde{J}^{i,\gamma}(\bar{K}_{k})\end{pmatrix}^{\top}\begin{bmatrix}X^{T}PX-\rho^{2}P&X^{\top}PY\\ Y^{\top}PX&Y^{\top}PY\end{bmatrix}\begin{pmatrix}\hat{K}_{k}\\ \nabla\tilde{J}^{i,\gamma}(\bar{K}_{k})\end{pmatrix}\leq 0
K^k+1K^k+1(1mη)K^kK^k\displaystyle\hat{K}^{\top}_{k+1}\hat{K}_{k+1}-(1-m\eta)\hat{K}^{\top}_{k}\hat{K}_{k} η~J~i,γ(K¯k)J~i,γ(K¯k)2K^k2+η2~J~i,γ(K¯k)J~i,γ(K¯k)22,\displaystyle\leq\eta\lVert\tilde{\nabla}\tilde{J}^{i,\gamma}(\bar{K}_{k})-\nabla\tilde{J}^{i,\gamma}(\bar{K}_{k})\rVert_{2}\lVert\hat{K}_{k}\rVert_{2}+\eta^{2}\lVert\tilde{\nabla}\tilde{J}^{i,\gamma}(\bar{K}_{k})-\nabla\tilde{J}^{i,\gamma}(\bar{K}_{k})\rVert_{2}^{2},
=mηK^k2η~J~i,γ(K¯k)J~i,γ(K¯k)2m+η2~J~i,γ(K¯k)J~i,γ(K¯k)22\displaystyle=\sqrt{m\eta}\lVert\hat{K}_{k}\rVert_{2}\frac{\sqrt{\eta}\lVert\tilde{\nabla}\tilde{J}^{i,\gamma}(\bar{K}_{k})-\nabla\tilde{J}^{i,\gamma}(\bar{K}_{k})\rVert_{2}}{\sqrt{m}}+\eta^{2}\lVert\tilde{\nabla}\tilde{J}^{i,\gamma}(\bar{K}_{k})-\nabla\tilde{J}^{i,\gamma}(\bar{K}_{k})\rVert_{2}^{2}

Using the fact 2aba2+b22ab\leq a^{2}+b^{2} and for η<1\eta<1 we get

K^k+122\displaystyle\lVert\hat{K}_{k+1}\rVert^{2}_{2} (1mη)K^k22+mη2K^k2+η(1m+1)~J~i,γ(K¯k)J~i,γ(K¯k)22,\displaystyle\leq(1-m\eta)\lVert\hat{K}_{k}\rVert^{2}_{2}+\frac{m\eta}{2}\lVert\hat{K}_{k}\rVert_{2}+\eta\big{(}\frac{1}{m}+1\big{)}\lVert\tilde{\nabla}\tilde{J}^{i,\gamma}(\bar{K}_{k})-\nabla\tilde{J}^{i,\gamma}(\bar{K}_{k})\rVert^{2}_{2},
(1mη2)K^k22+η(1m+1)~J~i,γ(K¯k)J~i,γ(K¯k)22,\displaystyle\leq\bigg{(}1-\frac{m\eta}{2}\bigg{)}\lVert\hat{K}_{k}\rVert^{2}_{2}+\eta\bigg{(}\frac{1}{m}+1\bigg{)}\lVert\tilde{\nabla}\tilde{J}^{i,\gamma}(\bar{K}_{k})-\nabla\tilde{J}^{i,\gamma}(\bar{K}_{k})\rVert^{2}_{2},

Hence

K^k22\displaystyle\lVert\hat{K}_{k}\rVert^{2}_{2} (1mη2)kK^022+η(1m+1)j=0k(1mη2)j~J~i,γ(K¯j)J~i,γ(K¯j)22,\displaystyle\leq\bigg{(}1-\frac{m\eta}{2}\bigg{)}^{k}\lVert\hat{K}_{0}\rVert^{2}_{2}+\eta\bigg{(}\frac{1}{m}+1\bigg{)}\sum_{j=0}^{k}\bigg{(}1-\frac{m\eta}{2}\bigg{)}^{j}\lVert\tilde{\nabla}\tilde{J}^{i,\gamma}(\bar{K}_{j})-\nabla\tilde{J}^{i,\gamma}(\bar{K}_{j})\rVert^{2}_{2},
(1mη2)kK^022+η(1m+1)j=0(1mη2)j~J~i,γ(K¯j)J~i,γ(K¯j)22,\displaystyle\leq\bigg{(}1-\frac{m\eta}{2}\bigg{)}^{k}\lVert\hat{K}_{0}\rVert^{2}_{2}+\eta\bigg{(}\frac{1}{m}+1\bigg{)}\sum_{j=0}^{\infty}\bigg{(}1-\frac{m\eta}{2}\bigg{)}^{j}\lVert\tilde{\nabla}\tilde{J}^{i,\gamma}(\bar{K}_{j})-\nabla\tilde{J}^{i,\gamma}(\bar{K}_{j})\rVert^{2}_{2},
(1mη2)kK^022+2m(1m+1)(mMr(J~i,γ(Kt,Lt)+λr)log(2mδ)+ϕir),\displaystyle\leq\bigg{(}1-\frac{m\eta}{2}\bigg{)}^{k}\lVert\hat{K}_{0}\rVert^{2}_{2}+\frac{2}{m}\bigg{(}\frac{1}{m}+1\bigg{)}\bigg{(}\frac{m}{Mr}(\tilde{J}^{i,\gamma}(K_{t},L_{t})+\lambda r)\sqrt{\log\bigg{(}\frac{2m}{\delta}\bigg{)}}+\phi_{i}r\bigg{)},

If k=𝒪(log(1/ϵ)),M=𝒪(1/ϵ)k=\mathcal{O}(\log(1/\epsilon)),M=\mathcal{O}(1/\epsilon) and r=Ω(ϵ)r=\Omega(\epsilon) then K^k22ϵ\lVert\hat{K}_{k}\rVert^{2}_{2}\leq\epsilon.

Proof of Theorem 4.2

Proof 6.5.

In this proof we show that maxj{1,2}KtjKtj=𝒪(ϵ)\max_{j\in\{1,2\}}\lVert K^{j}_{t}-K^{j*}_{t}\rVert=\mathcal{O}(\epsilon) for t{T1,,0}t\in\{T-1,\ldots,0\} and the result maxj{1,2}LtjLtj=𝒪(ϵ)\max_{j\in\{1,2\}}\lVert L^{j}_{t}-L^{j*}_{t}\rVert=\mathcal{O}(\epsilon) can be obtained in a similar manner. Throughout the proof we refer to the output of the inner loop of Algorithm 1 as the set of output controllers (Kti)i[2],t{0,,T1}(K^{i}_{t})_{i\in[2],t\in\{0,\ldots,T-1\}}. In the proof we use two other sets of controllers as well. The first set (Kti)i[2],t{0,,T1}(K^{i*}_{t})_{i\in[2],t\in\{0,\ldots,T-1\}} which denotes the NE as characterized in Theorem 3.2. The second set is called the local-NE (as in proof of Theorem 4.1) and is denoted by (K~ti)i[2],t{0,,T1}(\tilde{K}^{i*}_{t})_{i\in[2],t\in\{0,\ldots,T-1\}}. The proof quantifies the error between the output controllers (Kti)i[2](K^{i}_{t})_{i\in[2]} and the corresponding NE controllers (Kti)i[2](K^{i*}_{t})_{i\in[2]} by utilizing the intermediate local-NE controllers (K~ti)i[2](\tilde{K}^{i*}_{t})_{i\in[2]} for each time t{T1,,0}t\in\{T-1,\ldots,0\}. For each tt the error is shown to depend on error in future controllers (Ksi)st,i[2](K^{i}_{s})_{s\geq t,i\in[2]} and the approximation error Δt\Delta_{t} introduced by the gradient descent-ascent update. If Δt=𝒪(ϵ)\Delta_{t}=\mathcal{O}(\epsilon), then the error between the output and NE controllers is shown to be 𝒪(ϵ)\mathcal{O}(\epsilon).

Let us start by denoting the NE value function matrices for agent i[2]i\in[2] at time t{0,1,,T1}t\in\{0,1,\ldots,T-1\}, under the NE control matrices (Ksi)i[2],s{t+1,,T1}(K^{i*}_{s})_{i\in[2],s\in\{t+1,\ldots,T-1\}} by MtiM^{i*}_{t}. Using results in literature Başar and Olsder (1998) the local-NE control matrices can be characterized as:

K~t1=BtMt+1Λt1At,K~t2=Mt+1Λt1At\displaystyle\tilde{K}^{1*}_{t}=-B^{\top}_{t}M_{t+1}\Lambda^{-1}_{t}A_{t},\hskip 5.69046pt\tilde{K}^{2*}_{t}=M_{t+1}\Lambda^{-1}_{t}A_{t}

where

M~t=Qt+AtMt+1Λt1At,Λt=I+(BtBtγ2I)Mt+1\displaystyle\tilde{M}_{t}=Q_{t}+A^{\top}_{t}M_{t+1}\Lambda^{-1}_{t}A_{t},\hskip 5.69046pt\Lambda_{t}=I+(B_{t}B^{\top}_{t}-\gamma^{-2}I)M_{t+1}

Similarly the NE control matrices can be characterized as:

Kt1=BtMt+1Λt1At,Kt2=Mt+1Λt1At\displaystyle K^{1*}_{t}=-B^{\top}_{t}M^{*}_{t+1}{\Lambda^{*}_{t}}^{-1}A_{t},\hskip 5.69046ptK^{2*}_{t}=M^{*}_{t+1}{\Lambda^{*}_{t}}^{-1}A_{t}

where

Mt=Qt+AtMt+1Λt1At,Λt=I+(BtBtγ2I)Mt+1\displaystyle M^{*}_{t}=Q_{t}+A^{\top}_{t}M^{*}_{t+1}{\Lambda^{*}_{t}}^{-1}A_{t},\hskip 5.69046pt\Lambda^{*}_{t}=I+(B_{t}B^{\top}_{t}-\gamma^{-2}I)M^{*}_{t+1}

The matrix Λt\Lambda^{*}_{t} is invertible Başar and Olsder (1998) and the matrix Λt\Lambda_{t} will also be invertible if Mt+1M_{t+1} is close enough to Mt+1M^{*}_{t+1}. Now we characterize the difference between the local NE and NE controllers.

Kt1K~t1\displaystyle K^{1*}_{t}-\tilde{K}^{1*}_{t} =Bt(Mt+1Λt1Mt+1Λt1)At\displaystyle=B^{\top}_{t}(M^{*}_{t+1}{\Lambda^{*}_{t}}^{-1}-M_{t+1}\Lambda^{-1}_{t})A_{t}
=Bt(Mt+1Λt1Mt+1Λt1+Mt+1Λt1Mt+1Λt1)At\displaystyle=B^{\top}_{t}(M^{*}_{t+1}{\Lambda^{*}_{t}}^{-1}-M_{t+1}{\Lambda^{*}_{t}}^{-1}+M_{t+1}{\Lambda^{*}_{t}}^{-1}-M_{t+1}\Lambda^{-1}_{t})A_{t}
=Bt((Mt+1Mt+1)Λt1+Mt+1(Λt1Λt1))At\displaystyle=B^{\top}_{t}\big{(}(M^{*}_{t+1}-M_{t+1}){\Lambda^{*}_{t}}^{-1}+M_{t+1}({\Lambda^{*}_{t}}^{-1}-\Lambda^{-1}_{t})\big{)}A_{t} (26)

To characterize the difference given above we must evaluate Λt1Λt1{\Lambda^{*}_{t}}^{-1}-\Lambda^{-1}_{t}. Using matrix manipulations we can write down

Λt1Λt1\displaystyle\Lambda^{-1}_{t}-{\Lambda^{*}_{t}}^{-1} =Λt1(ΛtΛt1)Λt1\displaystyle=\Lambda^{-1}_{t}\big{(}\Lambda^{*}_{t}-\Lambda^{-1}_{t}\big{)}{\Lambda^{*}_{t}}^{-1}
=Λt1(BtBtγ2I)(Mt+1Mt+1)Λt1\displaystyle=\Lambda^{-1}_{t}(B_{t}B^{\top}_{t}-\gamma^{-2}I)(M^{*}_{t+1}-M_{t+1}){\Lambda^{*}_{t}}^{-1} (27)

Using (26) and (27) we can bound the difference between the local-NE and NE controllers

Kt1K~t1\displaystyle\lVert K^{1*}_{t}-\tilde{K}^{1*}_{t}\rVert =Bt((Mt+1Mt+1)Λt1+Mt+1(Λt1Λt1))At\displaystyle=\lVert B^{\top}_{t}\big{(}(M^{*}_{t+1}-M_{t+1}){\Lambda^{*}_{t}}^{-1}+M_{t+1}({\Lambda^{*}_{t}}^{-1}-\Lambda^{-1}_{t})\big{)}A_{t}\rVert
cAcB(1+cMΛt1)cΛMt+1Mt+1\displaystyle\leq c_{A}c_{B}\big{(}1+c_{M}\lVert\Lambda^{-1}_{t}\rVert\big{)}c_{\Lambda}\lVert M^{*}_{t+1}-M_{t+1}\rVert (28)

assuming Mt+1Mt+11\lVert M^{*}_{t+1}-M_{t+1}\rVert\leq 1 where cA:=maxtAt,cB:=maxtBt,cM:=maxtMtc_{A}:=\max_{t}\lVert A_{t}\rVert,c_{B}:=\max_{t}\lVert B_{t}\rVert,c_{M}:=\max_{t}\lVert M^{*}_{t}\rVert and cΛ:=maxtΛt1c_{\Lambda}:=\max_{t}\lVert\Lambda^{-1}_{t}\rVert. Now we bound Λt1\lVert\Lambda^{-1}_{t}\rVert by first defining Λ^t:=ΛtΛt=(BtBtγ2I)(Mt+1Mt+1)\hat{\Lambda}_{t}:=\Lambda_{t}-\Lambda^{*}_{t}=(B_{t}B^{\top}_{t}-\gamma^{-2}I)(M_{t+1}-M^{*}_{t+1})

Λt1\displaystyle\lVert\Lambda^{-1}_{t}\rVert =(I+Λt1Λ^t)1Λt1\displaystyle=\lVert(I+{\Lambda^{*}_{t}}^{-1}\hat{\Lambda}_{t})^{-1}{\Lambda^{*}_{t}}^{-1}\rVert
cΛk=0(cΛΛ^t)kcΛk=0((cB2+γ2)cΛMt+1Mt+1)k2cΛ\displaystyle\leq c_{\Lambda}\sum_{k=0}^{\infty}\big{(}c_{\Lambda}\lVert\hat{\Lambda}_{t}\rVert\big{)}^{k}\leq c_{\Lambda}\sum_{k=0}^{\infty}\big{(}(c^{2}_{B}+\gamma^{-2})c_{\Lambda}\lVert M_{t+1}-M^{*}_{t+1}\rVert\big{)}^{k}\leq 2c_{\Lambda} (29)

where the last inequality is possible due to Mt+1Mt+112cΛ(cB2+γ2)\lVert M_{t+1}-M^{*}_{t+1}\rVert\leq\frac{1}{2c_{\Lambda}(c^{2}_{B}+\gamma^{-2})}. Using (28) and (29) we arrive at

Kt1K~t1cAcB(1+2cMcΛ)cΛMt+1Mt+1\displaystyle\lVert K^{1*}_{t}-\tilde{K}^{1*}_{t}\rVert\leq c_{A}c_{B}\big{(}1+2c_{M}c_{\Lambda})c_{\Lambda}\lVert M^{*}_{t+1}-M_{t+1}\rVert

Similarly the difference between Kt1K~t1K^{1*}_{t}-\tilde{K}^{1*}_{t} can be bounded as

Kt2K~t2cA(1+2cMcΛ)cΛMt+1Mt+1\displaystyle\lVert K^{2*}_{t}-\tilde{K}^{2*}_{t}\rVert\leq c_{A}\big{(}1+2c_{M}c_{\Lambda})c_{\Lambda}\lVert M^{*}_{t+1}-M_{t+1}\rVert

Hence we can equivalently write

maxi{1,2}KtiK~ticAmax(cB,1)(1+2cMcΛ)cΛc¯1Mt+1Mt+1\displaystyle\max_{i\in\{1,2\}}\lVert K^{i*}_{t}-\tilde{K}^{i*}_{t}\rVert\leq\underbrace{c_{A}\max(c_{B},1)\big{(}1+2c_{M}c_{\Lambda})c_{\Lambda}}_{\bar{c}_{1}}\lVert M^{*}_{t+1}-M_{t+1}\rVert (30)

Having bounded the difference between local-NE and the NE controllers at time tt, now we turn towards bounding the difference between the value matrices MtM^{*}_{t} and MtM_{t}. First we define a change of notation to keep the analysis concise. Let

Bt1=Bt,Bt2=I,Qt1=Qt,Qt2=Qt,Rt1=I,Rt2=γ2I,Mt1=Mt,Mt2=Mt\displaystyle B^{1}_{t}=B_{t},\hskip 5.69046ptB^{2}_{t}=I,\hskip 5.69046ptQ^{1}_{t}=Q_{t},\hskip 5.69046ptQ^{2}_{t}=-Q_{t},\hskip 5.69046ptR^{1}_{t}=I,\hskip 5.69046ptR^{2}_{t}=-\gamma^{2}I,\hskip 5.69046ptM^{1}_{t}=M_{t},\hskip 5.69046ptM^{2}_{t}=-M_{t}

where the sequence (Mt)t(M_{t})_{\forall t} will be defined in the following analysis. The NE value function matrices are be defined recursively using the Lyapunov equation and NE controllers as

Mti\displaystyle M^{i*}_{t} =Qti+(Ati)Mt+1iAti(Ati)Mt+1iBti(Rti+(Bti)Mt+1iBti)1(Bti)Mt+1iAti\displaystyle=Q^{i}_{t}+(A^{i*}_{t})^{\top}M^{i*}_{t+1}A^{i*}_{t}-(A^{i*}_{t})^{\top}M^{i*}_{t+1}B^{i}_{t}(R^{i}_{t}+(B^{i}_{t})^{\top}M^{i*}_{t+1}B^{i}_{t})^{-1}(B^{i}_{t})^{\top}M^{i*}_{t+1}A^{i*}_{t}
=Qti+(Ati)Mt+1i(Ati+BtiKti),\displaystyle=Q^{i}_{t}+(A^{i*}_{t})^{\top}M^{i*}_{t+1}(A^{i*}_{t}+B^{i}_{t}K^{i*}_{t}),
MTi\displaystyle M^{i*}_{T} =QTi\displaystyle=Q^{i}_{T} (31)

where Ati:=At+jiBtjKtjA^{i*}_{t}:=A_{t}+\sum_{j\neq i}B^{j}_{t}K^{j*}_{t}. The sufficient condition for existence and uniqueness of the set of matrices KtiK^{i*}_{t} and MtiM^{i*}_{t} is shown in Theorem 3.2. Let us now define the perturbed values matrices M~ti\tilde{M}^{i}_{t} (resulting from the control matrices (Ksi)i[2],s{t+1,,T1}(K^{i}_{s})_{i\in[2],s\in\{t+1,\ldots,T-1\}}). Using these matrices we define the value function matrices (M~ti)i[2](\tilde{M}^{i}_{t})_{i\in[2]} using the Lyapunov equations as follows

M~ti\displaystyle\tilde{M}^{i}_{t} =Qti+(K~ti)RtiK~ti+(At+j=1NBtjK~ti)Mt+1i(At+j=12BtiK~tj)\displaystyle=Q^{i}_{t}+(\tilde{K}^{i*}_{t})^{\top}R^{i}_{t}\tilde{K}^{i*}_{t}+\bigg{(}A_{t}+\sum_{j=1}^{N}B^{j}_{t}\tilde{K}^{i*}_{t}\bigg{)}^{\top}M^{i}_{t+1}\bigg{(}A_{t}+\sum_{j=1}^{2}B^{i}_{t}\tilde{K}^{j*}_{t}\bigg{)}
=Qti+(A~ti)Mt+1iA~ti(A~ti)Mt+1iBti(Rti+(Bti)Mt+1iBti)1(Bti)Mt+1iA~ti\displaystyle=Q^{i}_{t}+(\tilde{A}^{i}_{t})^{\top}M^{i}_{t+1}\tilde{A}^{i}_{t}-(\tilde{A}^{i}_{t})^{\top}M^{i}_{t+1}B^{i}_{t}(R^{i}_{t}+(B^{i}_{t})^{\top}M^{i}_{t+1}B^{i}_{t})^{-1}(B^{i}_{t})^{\top}M^{i}_{t+1}\tilde{A}^{i}_{t}
=Qti+(A~ti)Mt+1i(A~ti+BtiK~ti)\displaystyle=Q^{i}_{t}+(\tilde{A}^{i}_{t})^{\top}M^{i}_{t+1}(\tilde{A}^{i}_{t}+B^{i}_{t}\tilde{K}^{i*}_{t})
M~Ti\displaystyle\tilde{M}^{i}_{T} =QTi\displaystyle=Q^{i}_{T} (32)

where K~j=(K~tj,Kt+1j,,KT1j)\tilde{K}^{j*}=(\tilde{K}^{j*}_{t},K^{j}_{t+1},\ldots,K^{j}_{T-1}) and A~ti:=At+jiBtjK~tj\tilde{A}^{i}_{t}:=A_{t}+\sum_{j\neq i}B^{j}_{t}\tilde{K}^{j*}_{t}. Finally we define the perturbed value function matrices MtiM^{i}_{t} which result from the perturbed matrices (Ksi)i[2],s{t+1,,T1}(K^{i}_{s})_{i\in[2],s\in\{t+1,\ldots,T-1\}} obtained using the gradient descent-ascent in Algorithm 1:

Mti=Qti+(Kti)RtiKti+(At+j=12BtjKtj)Mt+1i(At+j=12BtjKtj)\displaystyle M^{i}_{t}=Q^{i}_{t}+(K^{i}_{t})^{\top}R^{i}_{t}K^{i}_{t}+\bigg{(}A_{t}+\sum_{j=1}^{2}B^{j}_{t}K^{j}_{t}\bigg{)}^{\top}M^{i}_{t+1}\bigg{(}A_{t}+\sum_{j=1}^{2}B^{j}_{t}K^{j}_{t}\bigg{)} (33)

Throughout this proof we assume that the output of the inner loop in Algorithm 1, also called the output matrices KtiK^{i}_{t}, are Δt\Delta_{t} away from the target matrices K~ti\tilde{K}^{i*}_{t}, such that KtiK~tiΔt\lVert K^{i}_{t}-\tilde{K}^{i*}_{t}\rVert\leq\Delta_{t}. We know that,

KtiKtiKtiK~ti+K~tiKtiΔt+K~tiKti.\displaystyle\lVert K^{i}_{t}-K^{i*}_{t}\rVert\leq\lVert K^{i}_{t}-\tilde{K}^{i*}_{t}\rVert+\lVert\tilde{K}^{i*}_{t}-K^{i*}_{t}\rVert\leq\Delta_{t}+\lVert\tilde{K}^{i*}_{t}-K^{i*}_{t}\rVert.

Now we characterize the difference MtiMti=MtiM~ti+M~tiMtiM^{i}_{t}-M^{i*}_{t}=M^{i}_{t}-\tilde{M}^{i}_{t}+\tilde{M}^{i}_{t}-M^{i*}_{t}. First we can characterize M~tiMti\tilde{M}^{i}_{t}-M^{i*}_{t} using (31) and (32):

M~tiMti=(A~ti)Mt+1i(A~ti+BtiK~ti)(Ati)Mt+1i(Ati+BtiKti)\displaystyle\tilde{M}^{i}_{t}-M^{i*}_{t}=(\tilde{A}^{i}_{t})^{\top}M^{i}_{t+1}(\tilde{A}^{i}_{t}+B^{i}_{t}\tilde{K}^{i*}_{t})-(A^{i*}_{t})^{\top}M^{i*}_{t+1}(A^{i*}_{t}+B^{i}_{t}K^{i*}_{t})
=(A~tiAti)Mt+1i(A~ti+BtiK~ti)+(Ati)(Mt+1iMt+1i)(A~ti+BtiK~ti)\displaystyle=(\tilde{A}^{i}_{t}-A^{i*}_{t})^{\top}M^{i}_{t+1}(\tilde{A}^{i}_{t}+B^{i}_{t}\tilde{K}^{i*}_{t})+(A^{i*}_{t})^{\top}(M^{i}_{t+1}-M^{i*}_{t+1})(\tilde{A}^{i}_{t}+B^{i}_{t}\tilde{K}^{i*}_{t})
+(Ati)Mt+1i(A~ti+BtiK~tiAtiBtiKti)\displaystyle\hskip 170.71652pt+(A^{i*}_{t})^{\top}M^{i*}_{t+1}(\tilde{A}^{i}_{t}+B^{i}_{t}\tilde{K}^{i*}_{t}-A^{i*}_{t}-B^{i}_{t}K^{i*}_{t})
=(jiBtj(K~tjKtj))Mt+1i(A~ti+BtiK~ti)+(Ati)(Mt+1iMt+1i)(A~ti+BtiK~ti)\displaystyle=\bigg{(}\sum_{j\neq i}B^{j}_{t}(\tilde{K}^{j*}_{t}-K^{j*}_{t})\bigg{)}^{\top}M^{i}_{t+1}(\tilde{A}^{i}_{t}+B^{i}_{t}\tilde{K}^{i*}_{t})+(A^{i*}_{t})^{\top}(M^{i}_{t+1}-M^{i*}_{t+1})(\tilde{A}^{i}_{t}+B^{i}_{t}\tilde{K}^{i*}_{t})
+(Ati)Mt+1ij=12Btj(K~tjKtj)\displaystyle\hskip 170.71652pt+(A^{i*}_{t})^{\top}M^{i*}_{t+1}\sum_{j=1}^{2}B^{j}_{t}(\tilde{K}^{j*}_{t}-K^{j*}_{t}) (34)

Using this characterization we bound M~tiMti\lVert\tilde{M}^{i}_{t}-M^{i*}_{t}\rVert using the AM-GM inequality

M~tiMti\displaystyle\lVert\tilde{M}^{i}_{t}-M^{i*}_{t}\rVert
2AtMt+1icBj=12K~tjKtj\displaystyle\leq 2\lVert A^{*}_{t}\rVert\lVert M^{i*}_{t+1}\rVert c_{B}\sum_{j=1}^{2}\lVert\tilde{K}^{j*}_{t}-K^{j*}_{t}\rVert
+(At/2+Mt+1icB+Ati/2)cB(j=12K~tjKtj)2+cB42(j=12K~tjKtj)4\displaystyle\hskip 14.22636pt+\big{(}\lVert A^{*}_{t}\rVert/2+\lVert M^{i*}_{t+1}\rVert c_{B}+\lVert A^{i*}_{t}\rVert/2\big{)}c_{B}\bigg{(}\sum_{j=1}^{2}\lVert\tilde{K}^{j*}_{t}-K^{j*}_{t}\rVert\bigg{)}^{2}+\frac{c^{4}_{B}}{2}\bigg{(}\sum_{j=1}^{2}\lVert\tilde{K}^{j*}_{t}-K^{j*}_{t}\rVert\bigg{)}^{4}
+(At/2+1/2+Ati/2)Mt+1iMt+1i2+AtiAtMt+1iMt+1i\displaystyle\hskip 14.22636pt+\big{(}\lVert A^{*}_{t}\rVert/2+1/2+\lVert A^{i*}_{t}\rVert/2\big{)}\lVert M^{i}_{t+1}-M^{i*}_{t+1}\rVert^{2}+\lVert A^{i*}_{t}\rVert\lVert A^{*}_{t}\rVert\lVert M^{i}_{t+1}-M^{i*}_{t+1}\rVert
(2cAcMcB+(cA/2+cMcB+cAi/2)cB+cB4/2)c¯2j=12K~tjKtj\displaystyle\leq\underbrace{\big{(}2c^{*}_{A}c^{*}_{M}c_{B}+(c^{*}_{A}/2+c^{*}_{M}c_{B}+c^{i*}_{A}/2)c_{B}+c^{4}_{B}/2\big{)}}_{\bar{c}_{2}}\sum_{j=1}^{2}\lVert\tilde{K}^{j*}_{t}-K^{j*}_{t}\rVert
+cA/2+1/2+cAi/2+cAicAc¯3Mt+1iMt+1i\displaystyle\hskip 199.16928pt+\underbrace{c^{*}_{A}/2+1/2+c^{i*}_{A}/2+c^{i*}_{A}c^{*}_{A}}_{\bar{c}_{3}}\lVert M^{i}_{t+1}-M^{i*}_{t+1}\rVert
c¯2Nmaxj[2]K~tjKtj+c¯3Mt+1iMt+1i\displaystyle\leq\bar{c}_{2}N\max_{j\in[2]}\lVert\tilde{K}^{j*}_{t}-K^{j*}_{t}\rVert+\bar{c}_{3}\lVert M^{i}_{t+1}-M^{i*}_{t+1}\rVert (35)

where At=Ati+BtiKti,cA=At,cAi:=maxt{0,,T1}Ati,cM:=maxi[2],t{0,,T1}Mt+1iA^{*}_{t}=A^{i*}_{t}+B^{i}_{t}K^{i*}_{t},c^{*}_{A}=\lVert A^{*}_{t}\rVert,c^{i*}_{A}:=\max_{t\in\{0,\ldots,T-1\}}\lVert A^{i*}_{t}\rVert,c^{*}_{M}:=\max_{i\in[2],t\in\{0,\ldots,T-1\}}\lVert M^{i*}_{t+1}\rVert, and the last inequality is possible due to the fact that Mt+1iMt+1i,K~tjKtj1/N\lVert M^{i}_{t+1}-M^{i*}_{t+1}\rVert,\lVert\tilde{K}^{j*}_{t}-K^{j*}_{t}\rVert\leq 1/N. Similarly MtiM~tiM^{i}_{t}-\tilde{M}^{i}_{t} can be decomposed using (32) and (33):

MtiM~ti\displaystyle M^{i}_{t}-\tilde{M}^{i}_{t} =(Kti)RtiKti+(At+j=12BtjKtj)Mt+1i(At+j=12BtjKtj)\displaystyle=(K^{i}_{t})^{\top}R^{i}_{t}K^{i}_{t}+\bigg{(}A_{t}+\sum_{j=1}^{2}B^{j}_{t}K^{j}_{t}\bigg{)}^{\top}M^{i}_{t+1}\bigg{(}A_{t}+\sum_{j=1}^{2}B^{j}_{t}K^{j}_{t}\bigg{)}
[(K~ti)RtiK~ti+(At+j=12BtjK~ti)Mt+1i(At+j=12BtiK~tj)].\displaystyle\hskip 56.9055pt-\bigg{[}(\tilde{K}^{i*}_{t})^{\top}R^{i}_{t}\tilde{K}^{i*}_{t}+\bigg{(}A_{t}+\sum_{j=1}^{2}B^{j}_{t}\tilde{K}^{i*}_{t}\bigg{)}^{\top}M^{i}_{t+1}\bigg{(}A_{t}+\sum_{j=1}^{2}B^{i}_{t}\tilde{K}^{j*}_{t}\bigg{)}\bigg{]}.

We start by analyzing the quadratic form xMtixx^{\top}M^{i}_{t}x:

xMtix\displaystyle x^{\top}M^{i}_{t}x =x[Qti+(Kti)RtiKti+(At+j=12BtjKtj)Mt+1i(At+j=12BtjKtj)]x\displaystyle=x^{\top}\Bigg{[}Q^{i}_{t}+(K^{i}_{t})^{\top}R^{i}_{t}K^{i}_{t}+\bigg{(}A_{t}+\sum_{j=1}^{2}B^{j}_{t}K^{j}_{t}\bigg{)}^{\top}M^{i}_{t+1}\bigg{(}A_{t}+\sum_{j=1}^{2}B^{j}_{t}K^{j}_{t}\bigg{)}\bigg{]}x
=x[(Kti)(Rti+(Bti)Mt+1iBti)Kti+2(BtiKti)Mt+1i(At+jiBtjKtj)+Qti\displaystyle=x^{\top}\bigg{[}(K^{i}_{t})^{\top}\big{(}R^{i}_{t}+(B^{i}_{t})^{\top}M^{i}_{t+1}B^{i}_{t}\big{)}K^{i}_{t}+2(B^{i}_{t}K^{i}_{t})^{\top}M^{i}_{t+1}\bigg{(}A_{t}+\sum_{j\neq i}B^{j}_{t}K^{j}_{t}\bigg{)}+Q^{i}_{t}
+(At+jiBtjKtj)Mt+1i(At+jiBtjKtj)]x\displaystyle\hskip 113.81102pt+\bigg{(}A_{t}+\sum_{j\neq i}B^{j}_{t}K^{j}_{t}\bigg{)}^{\top}M^{i}_{t+1}\bigg{(}A_{t}+\sum_{j\neq i}B^{j}_{t}K^{j}_{t}\bigg{)}\bigg{]}x
=x[(Kti)(Rti+(Bti)Mt+1iBti)Kti+2(BtiKti)Mt+1i(At+jiBtjK~tj)+Qti\displaystyle=x^{\top}\bigg{[}(K^{i}_{t})^{\top}\big{(}R^{i}_{t}+(B^{i}_{t})^{\top}M^{i}_{t+1}B^{i}_{t}\big{)}K^{i}_{t}+2(B^{i}_{t}K^{i}_{t})^{\top}M^{i}_{t+1}\bigg{(}A_{t}+\sum_{j\neq i}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}+Q^{i}_{t}
+2(BtiKti)Mt+1ijiBtj(KtjK~tj)+(At+jiBtjK~tj)Mt+1i(At+jiBtjK~tj)\displaystyle\hskip 11.38092pt+2(B^{i}_{t}K^{i}_{t})^{\top}M^{i}_{t+1}\sum_{j\neq i}B^{j}_{t}(K^{j}_{t}-\tilde{K}^{j*}_{t})+\bigg{(}A_{t}+\sum_{j\neq i}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}^{\top}M^{i}_{t+1}\bigg{(}A_{t}+\sum_{j\neq i}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}
+(jiBtj(KtjK~tj))Mt+1i(jiBtj(KtjK~tj))\displaystyle\hskip 11.38092pt+\bigg{(}\sum_{j\neq i}B^{j}_{t}(K^{j}_{t}-\tilde{K}^{j*}_{t})\bigg{)}^{\top}M^{i}_{t+1}\bigg{(}\sum_{j\neq i}B^{j}_{t}(K^{j}_{t}-\tilde{K}^{j*}_{t})\bigg{)}
+2(At+jiBtjK~tj)Mt+1i(jiBtj(KtjK~tj))]x\displaystyle\hskip 11.38092pt+2\bigg{(}A_{t}+\sum_{j\neq i}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}M^{i}_{t+1}\bigg{(}\sum_{j\neq i}B^{j}_{t}(K^{j}_{t}-\tilde{K}^{j*}_{t})\bigg{)}\bigg{]}x

Completing squares we get

xMtix\displaystyle x^{\top}M^{i}_{t}x (36)
=x[(KtiK~ti)(Rti+(Bti)Mt+1iBti)(KtiK~ti)\displaystyle=x^{\top}\Bigg{[}(K^{i}_{t}-\tilde{K}^{i*}_{t})^{\top}\big{(}R^{i}_{t}+(B^{i}_{t})^{\top}M^{i}_{t+1}B^{i}_{t}\big{)}(K^{i}_{t}-\tilde{K}^{i*}_{t})
+(At+jiBtjK~tj)Mt+1i(At+jiBtjK~tj)+Qti\displaystyle\hskip 11.38092pt+\bigg{(}A_{t}+\sum_{j\neq i}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}^{\top}M^{i}_{t+1}\bigg{(}A_{t}+\sum_{j\neq i}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}+Q^{i}_{t}
(At+jiBtjK~tj)Mt+1iBti(Rti+(Bti)Mt+1iBti)1(Bti)Mt+1i(At+jiBtjK~tj)\displaystyle\hskip 11.38092pt-\bigg{(}A_{t}+\sum_{j\neq i}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}^{\top}M^{i}_{t+1}B^{i}_{t}\big{(}R^{i}_{t}+(B^{i}_{t})^{\top}M^{i}_{t+1}B^{i}_{t}\big{)}^{-1}(B^{i}_{t})^{\top}M^{i}_{t+1}\bigg{(}A_{t}+\sum_{j\neq i}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}
+(2(At+j=12BtjK~tj)+j=12Btj(KtjK~tj))Mt+1i(jiBtj(KtjK~tj)]x.\displaystyle\hskip 11.38092pt+\bigg{(}2\bigg{(}A_{t}+\sum_{j=1}^{2}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}+\sum_{j=1}^{2}B^{j}_{t}(K^{j}_{t}-\tilde{K}^{j*}_{t})\bigg{)}^{\top}M^{i}_{t+1}\bigg{(}\sum_{j\neq i}B^{j}_{t}(K^{j}_{t}-\tilde{K}^{j*}_{t}\bigg{)}\bigg{]}x.

Now we take a look at the quadratic xM~tixx^{\top}\tilde{M}^{i}_{t}x:

xM~tix\displaystyle x^{\top}\tilde{M}^{i}_{t}x
=x[Qti+(K~ti)RtiK~ti+(At+j=12BtjK~tj)Mt+1i(At+j=12BtjK~tj)]x\displaystyle=x^{\top}\bigg{[}Q^{i}_{t}+(\tilde{K}^{i*}_{t})^{\top}R^{i}_{t}\tilde{K}^{i*}_{t}+\bigg{(}A_{t}+\sum_{j=1}^{2}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}^{\top}M^{i}_{t+1}\bigg{(}A_{t}+\sum_{j=1}^{2}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}\bigg{]}x
=x[(K~ti)(Rti+(Bti)Mt+1iBti)K~ti+2(BtiK~ti)Mt+1i(At+jiBtjK~tj)+Qti\displaystyle=x^{\top}\bigg{[}(\tilde{K}^{i*}_{t})^{\top}\big{(}R^{i}_{t}+(B^{i}_{t})^{\top}M^{i}_{t+1}B^{i}_{t}\big{)}\tilde{K}^{i*}_{t}+2(B^{i}_{t}\tilde{K}^{i*}_{t})^{\top}M^{i}_{t+1}\bigg{(}A_{t}+\sum_{j\neq i}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}+Q^{i}_{t}
+(At+jiBtjK~tj)Mt+1i(At+ji2BtjK~tj)]x\displaystyle\hskip 11.38092pt+\bigg{(}A_{t}+\sum_{j\neq i}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}^{\top}M^{i}_{t+1}\bigg{(}A_{t}+\sum_{j\neq i}^{2}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}\bigg{]}x
=x[(At+jiBtjK~tj)Mt+1i(At+ji2BtjK~tj)+Qti\displaystyle=x^{\top}\bigg{[}\bigg{(}A_{t}+\sum_{j\neq i}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}^{\top}M^{i}_{t+1}\bigg{(}A_{t}+\sum_{j\neq i}^{2}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}+Q^{i}_{t} (37)
(At+jiBtjK~tj)Mt+1iBti(Rti+(Bti)Mt+1iBti)1(Bti)Mt+1i(At+jiBtjK~tj)]x.\displaystyle\hskip 11.38092pt-\bigg{(}A_{t}+\sum_{j\neq i}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}^{\top}M^{i}_{t+1}B^{i}_{t}\big{(}R^{i}_{t}+(B^{i}_{t})^{\top}M^{i}_{t+1}B^{i}_{t}\big{)}^{-1}(B^{i}_{t})^{\top}M^{i}_{t+1}\bigg{(}A_{t}+\sum_{j\neq i}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}\bigg{]}x.

Using (36) and (37), we get

x(MtiM~ti)x\displaystyle x^{\top}(M^{i}_{t}-\tilde{M}^{i}_{t})x
=x[(KtiK~ti)(Rti+(Bti)Mt+1iBti)(KtiK~ti)\displaystyle=x^{\top}\Bigg{[}(K^{i}_{t}-\tilde{K}^{i*}_{t})^{\top}\big{(}R^{i}_{t}+(B^{i}_{t})^{\top}M^{i}_{t+1}B^{i}_{t}\big{)}(K^{i}_{t}-\tilde{K}^{i*}_{t})
+(2(At+j=12BtjK~tj)+j=12Btj(KtjK~tj))Mt+1i(jiBtj(KtjK~tj)]x\displaystyle\hskip 11.38092pt+\bigg{(}2\bigg{(}A_{t}+\sum_{j=1}^{2}B^{j}_{t}\tilde{K}^{j*}_{t}\bigg{)}+\sum_{j=1}^{2}B^{j}_{t}(K^{j}_{t}-\tilde{K}^{j*}_{t})\bigg{)}^{\top}M^{i}_{t+1}\bigg{(}\sum_{j\neq i}B^{j}_{t}(K^{j}_{t}-\tilde{K}^{j*}_{t}\bigg{)}\bigg{]}x
=x[(KtiK~ti)(Rti+(Bti)Mt+1iBti)(KtiK~ti)\displaystyle=x^{\top}\Bigg{[}(K^{i}_{t}-\tilde{K}^{i*}_{t})^{\top}\big{(}R^{i}_{t}+(B^{i}_{t})^{\top}M^{i}_{t+1}B^{i}_{t}\big{)}(K^{i}_{t}-\tilde{K}^{i*}_{t})
+(2(At+j=12BtjKtj)+2j=12Btj(K~tjKtj)+j=12Btj(KtjK~tj))\displaystyle\hskip 11.38092pt+\bigg{(}2\bigg{(}A_{t}+\sum_{j=1}^{2}B^{j}_{t}K^{j*}_{t}\bigg{)}+2\sum_{j=1}^{2}B^{j}_{t}(\tilde{K}^{j*}_{t}-K^{j*}_{t})+\sum_{j=1}^{2}B^{j}_{t}(K^{j}_{t}-\tilde{K}^{j*}_{t})\bigg{)}^{\top}
Mt+1i(jiBtj(KtjK~tj)]x\displaystyle\hskip 256.0748ptM^{i}_{t+1}\bigg{(}\sum_{j\neq i}B^{j}_{t}(K^{j}_{t}-\tilde{K}^{j*}_{t}\bigg{)}\bigg{]}x

Using this characterization we bound MtiM~ti\lVert M^{i}_{t}-\tilde{M}^{i}_{t}\rVert:

MtiM~ti\displaystyle\lVert M^{i}_{t}-\tilde{M}^{i}_{t}\rVert
(Rti+(Bti)Mt+1iBti+(Bti)(Mt+1iMt+1i)Bti)KtiK~ti\displaystyle\leq\big{(}\lVert R^{i}_{t}+(B^{i}_{t})^{\top}M^{i*}_{t+1}B^{i}_{t}\rVert+\lVert(B^{i}_{t})^{\top}\big{(}M^{i*}_{t+1}-M^{i*}_{t+1}\big{)}B^{i}_{t}\rVert\big{)}\lVert K^{i}_{t}-\tilde{K}^{i*}_{t}\rVert
+(2cAi+2cBj=12(K~tjKt+KtjK~tj)(Mt+1i+Mt+1iMt+1i)\displaystyle\hskip 11.38092pt+(2c^{i*}_{A}+2c_{B}\sum_{j=1}^{2}\big{(}\lVert\tilde{K}^{j*}_{t}-K^{*}_{t}\rVert+\lVert K^{j}_{t}-\tilde{K}^{j*}_{t}\rVert\big{)}\big{(}\lVert M^{i*}_{t+1}\rVert+\lVert M^{i}_{t+1}-M^{i*}_{t+1}\rVert\big{)}
cBj=12KtjK~tj\displaystyle\hskip 256.0748ptc_{B}\sum_{j=1}^{2}\lVert K^{j}_{t}-\tilde{K}^{j*}_{t}\rVert

As before assuming Mt+1iMt+1i,K~tjKtj1/N\lVert M^{i}_{t+1}-M^{i*}_{t+1}\rVert,\lVert\tilde{K}^{j*}_{t}-K^{j*}_{t}\rVert\leq 1/N,

MtiM~ti\displaystyle\lVert M^{i}_{t}-\tilde{M}^{i}_{t}\rVert (c¯+cB2/2+2cB2(cM+1)+2cAi)c¯4j=12KtjK~tj\displaystyle\leq\underbrace{\big{(}\bar{c}^{*}+c^{2}_{B}/2+2c^{2}_{B}(c^{*}_{M}+1)+2c^{i*}_{A}\big{)}}_{\bar{c}_{4}}\sum_{j=1}^{2}\lVert K^{j}_{t}-\tilde{K}^{j*}_{t}\rVert
+(cB2/2+cAi+cB)c¯5Mt+1iMt+1i+cB(cM+1)c¯6j=12K~tjKtj\displaystyle\hskip 42.67912pt+\underbrace{\big{(}c^{2}_{B}/2+c^{i*}_{A}+c_{B}\big{)}}_{\bar{c}_{5}}\lVert M^{i}_{t+1}-M^{i*}_{t+1}\rVert+\underbrace{c_{B}(c^{*}_{M}+1)}_{\bar{c}_{6}}\sum_{j=1}^{2}\lVert\tilde{K}^{j*}_{t}-K^{j*}_{t}\rVert
c¯4NΔt+c¯5Mt+1iMt+1i+c¯6j=12K~tjKtj\displaystyle\leq\bar{c}_{4}N\Delta_{t}+\bar{c}_{5}\lVert M^{i}_{t+1}-M^{i*}_{t+1}\rVert+\bar{c}_{6}\sum_{j=1}^{2}\lVert\tilde{K}^{j*}_{t}-K^{j*}_{t}\rVert (38)

where c¯:=maxi[2],t{0,,T1}Rti+(Bti)Mt+1iBti\bar{c}^{*}:=\max_{i\in[2],t\in\{0,\ldots,T-1\}}\lVert R^{i}_{t}+(B^{i}_{t})^{\top}M^{i*}_{t+1}B^{i}_{t}\rVert. Let us define etK:=maxj[2]KtjKtj,etP:=maxj[2]MtjMtje^{K}_{t}:=\max_{j\in[2]}\lVert K^{j}_{t}-K^{j*}_{t}\rVert,e^{P}_{t}:=\max_{j\in[2]}\lVert M^{j}_{t}-M^{j*}_{t}\rVert. Using (30), (35) and (38) we get

etK\displaystyle e^{K}_{t} c¯1et+1P+Δt\displaystyle\leq\bar{c}_{1}e^{P}_{t+1}+\Delta_{t}
etP\displaystyle e^{P}_{t} (c¯2+c¯6)NetK+(c¯3+c¯5)et+1P+c¯4NΔ\displaystyle\leq(\bar{c}_{2}+\bar{c}_{6})Ne^{K}_{t}+(\bar{c}_{3}+\bar{c}_{5})e^{P}_{t+1}+\bar{c}_{4}N\Delta
(c¯1(c¯2+c¯6)N+c¯3+c¯5)c¯7et+1P+(c¯2+c¯4+c¯6)Nc¯8Δt\displaystyle\leq\underbrace{(\bar{c}_{1}(\bar{c}_{2}+\bar{c}_{6})N+\bar{c}_{3}+\bar{c}_{5})}_{\bar{c}_{7}}e^{P}_{t+1}+\underbrace{(\bar{c}_{2}+\bar{c}_{4}+\bar{c}_{6})N}_{\bar{c}_{8}}\Delta_{t}

Using this recursive definition we deduce

etP\displaystyle e^{P}_{t} c¯7TteTP+c¯8s=0T1c¯7sΔt+s=c¯8s=0T1c¯7sΔt+s\displaystyle\leq\bar{c}^{T-t}_{7}e^{P}_{T}+\bar{c}_{8}\sum_{s=0}^{T-1}\bar{c}^{s}_{7}\Delta_{t+s}=\bar{c}_{8}\sum_{s=0}^{T-1}\bar{c}^{s}_{7}\Delta_{t+s}

Hence if Δ=𝒪(ϵ)\Delta=\mathcal{O}(\epsilon), in particular Δtϵ/(2c¯1c¯7tc¯8T)\Delta_{t}\leq\epsilon/(2\bar{c}_{1}\bar{c}^{t}_{7}\bar{c}_{8}T) then etPϵ/2c¯1e^{P}_{t}\leq\epsilon/2\bar{c}_{1} and etKϵe^{K}_{t}\leq\epsilon for t{0,,T1}t\in\{0,\ldots,T-1\}.