Semantic Communication in Multi-team Dynamic Games: A Mean Field Perspective

Shubham Aggarwal, Muhammad Aneeq uz Zaman, Melih Bastopcu, and Tamer Başar Research of the authors was supported in part by ARO MURI Grant AG285 and in part by AFOSR Grant FA9550-24-1-0152.Shubham Aggarwal and Muhammad Aneeq uz Zaman are with the Coordinated Science Laboratory and the Department of Mechanical Science and Engineering at the University of Illinois Urbana-Champaign (UIUC); Melih Bastopcu is with the Coordinated Science Laboratory at UIUC; Tamer Başar is with the Coordinated Science Laboratory and the Department of Electrical and Computer Engineering at UIUC. (Emails: {sa57,mazaman2,bastopcu,basar1}@illinois.edu)

Abstract

Coordinating communication and control is a key component in the stability and performance of networked multi-agent systems. While single user networked control systems have gained a lot of attention within this domain, in this work, we address the more challenging problem of large population multi-team dynamic games. In particular, each team constitutes two decision makers (namely, the sensor and the controller) who coordinate over a shared network to control a dynamically evolving state of interest under costs on both actuation and sensing/communication. Due to the shared nature of the wireless channel, the overall cost of each team depends on other teams’ policies, thereby leading to a noncooperative game setup. Due to the presence of a large number of teams, we compute approximate decentralized Nash equilibrium policies for each team using the paradigm of (extended) mean-field games, which is governed by (1) the mean traffic flowing over the channel, and (2) the value of information at the sensor, which highlights the semantic nature of the ensuing communication. In the process, we compute optimal controller policies and approximately optimal sensor policies for each representative team of the mean-field system to alleviate the problem of general non-contractivity of the mean-field fixed point operator associated with the finite cardinality of the sensor action space. Consequently, we also prove the $\epsilon$ –Nash property of the mean-field equilibrium solution which essentially characterizes how well the solution derived using mean-field analysis performs on the finite-team system. Finally, we provide extensive numerical simulations, which corroborate the theoretical findings and lead to additional insights on the properties of the results presented.

I Introduction

Games among teams have been an attractive topic of interest in the recent past. These systems consist of cooperating decision-makers (within a team) who aim to collectively optimize the shared team objective, while being affected by the actions of other competing teams. Applications of the same include multi-player perimeter defense games [1], electricity markets [2], oligopolies [3], and networked robotics [4], to quote a few.

In this paper, we address the problem of finding equilibrium policies for multiple dynamic teams competing to access a shared wireless channel. A prototype of such a system is shown in Fig. 1. Each team is comprised of a dynamically evolving plant controlled by a sensor-controller pair. The controller relies on state information from the sensor, which communicates over the shared wireless channel. Costs arise from (1) the controller transmitting data over the network, and (2) the sensor implementing control policies for plant stabilization. An effective control policy requires consistent state measurements, which increase communication costs. Conversely, reducing the number of communications lowers these costs, but degrades control performance, leading to a classic communication-control trade-off for each team. Consequently, the sensor and controller act as two active decision-makers within a team, optimizing a common objective to find a team-optimal policy pair.

Refer to caption — Figure 1: A multi-agent networked control system operating over a shared wireless resource serves use cases such as connected autonomy, path planning, Industry 4.0, and medical applications.

Furthermore, since the channel is shared, the communication cost is coupled with the transmission activity of other users. We model this coupling effect as the product of sensor action and the average number of communication instances of the other teams. Therefore, if the channel is highly congested, each sensor will seek to reduce its communication, whereas if the channel is underutilized, each sensor will increase its transmissions to improve control performance. This dynamic interaction suggests that a (Nash) equilibrium channel utilization may emerge for each team. However, with a large number of teams, calculating exact Nash policies is challenging because each team needs to be aware of other teams’ policies. To address this, we employ the mean-field game (MFG) framework to compute approximate decentralized equilibrium policies for each team, which depend only on the local information. This approach involves analyzing the limiting team game problem (with infinite teams), which we refer to as the mean-field team (MFT) system. In the MFT system, individual team deviations become insignificant due to the infinite number of teams, and hence, it can be characterized by a representative team interacting with a mass distribution. This allows us to determine the optimal policy for a generic team, referred to as the mean-field team equilibrium (MFTE) policy, which aligns with the population’s behavior. We provide conditions for the existence and uniqueness of an approximate MFTE and demonstrate when the derived equilibrium policies constitute approximate Nash policies for the finite-team system.

I-A Related Works

I-A1 Team optimal control & Information structures

Team decision problems, which were first investigated in [5, 6], involve multiple decision makers (DMs) each of whom has access to different information variables and consequently choose policies jointly to incur a common cost/reward. Since each one acts independently, and they do not necessarily share the same information, the joint optimal policy design is highly dependent on the information available to each DM [7, 8, 9], and its derivation can be challenging particularly when the information is dynamic, as demonstrated by Witsenhausen [10], Feldbaum [11], and Başar [12]. The single team communication-control trade-off problem is an instance of such a decision problem, where the sensor and the controller form a 2-DM team, each having access to different information signals within the system. Initial investigations for solving the single team problem in this context were undertaken within the realm of event-driven control, where the primary objective is to merely stabilize the dynamical plant, without any optimality guarantees [13, 14, 15, 16]. Multi-team extensions of the same have also been considered in [17, 18].

On the other hand, single team settings from the viewpoint of an optimal control problem have also been widely dealt with in literature [19, 20, 21, 22, 23, 24, 25] to study the impact of resource constraints on estimator and controller designs, where the resource is the frequency of usage of the transmission channels for estimation or control purposes. Specifically, papers [19] and [20] have obtained the optimal threshold-type (symmetric) policies for the sensor under a hard constraint on the number of transmissions, with extensions provided in [21]; [22] has placed resource constraints on the interaction between the controller and the actuator in a plant; and [23] has introduced a tradeoff between two options for a controller, which are transmission of control signals to the actuator and receiving measurements, where optimality is again based on threshold-type policies; [24] has provided jointly optimal policies for control and sensing; and [25] has derived suboptimal sensor and control policies based (only) on the statistics of the estimation error. The latest works in this direction are [26, 27], where the authors show that for a multi-dimensional Gauss-Markov process, the optimal estimator is linear and is unaffected by the no-communication events (as was derived for scalar systems earlier in [20] with its optimality proven in [21]), leading to a globally optimal policy design for a single team problem.

I-A2 Finite & Mean-field team games

As opposed to the above frameworks which consider a single team problem, there are only a handful of works considering games among multiple teams[28, 29, 30, 31, 32]. Specifically, the paper [28] has investigated the properties of the saddle-point equilibrium in a two-team zero-sum game; [29] has proposed a learning algorithm for zero-sum Markov games between teams of multiple agents, with extensions to general sum 2-team games in [30]; [31] has provided only the Nash controller policies for 2-team general sum games without explicit sensor policy computation, which becomes challenging even for a 2-team problem; and [32] has solved for best response sensor policies for one player with asymmetric sensing and a zero-sum game setup.

To alleviate the above issue of equilibrium policy computation, the framework of mean-field games (MFGs) (which forms a subset of the mean-field team games with each team constituting a single DM) was introduced simultaneously in [33] and [34]. The idea behind the same is to compute approximate decentralized Nash equilibrium policies for agents by passing to an infinite agent system (under suitable assumptions of anonymity between agents), in which case the effect of actions of any specific agent becomes insignificant on that of the other agents, thereby leading to an aggregation effect. Under this scenario, one can solve for a representative agent’s equilibrium policy within the infinite agent system which is consistent with that of the population and then characterize its approximation with respect to the centralized Nash equilibrium policies for the finite-agent system.

While many works have considered both theoretical and application aspects of standard MFGs [35, 36, 37, 38, 39, 40], only a few results exist pertaining to MFTGs [41, 2]. A similar paradigm also worth mentionaing is that of mean-field type games [42, 43] where the number of teams is finite, but the number of agents within each team goes to infinity. As opposed to the standard MFGs, an additional challenge within the MFTG setup is to handle the incoming interactions from other teams whilst solving for the team optimal solution for each team. While the work [2] considers a discrete state and action setup, the work [41] considers a linear-quadratic setting to compute an asymptotic mixed-equilibrium optima for the team-game problem. In this work, however, we consider a mixed discrete-continuous action space where the controller action lies within the $m$ –dimensional Euclidean space while the sensor action is binary-valued (either 0 denoting no transmission or 1 denoting full state transmission). Further, as we will see, the cost function is non-quadratic and the estimation error dynamics, which forms the state of the estimated problem, is bilinear due to the presence of no-transmission events, which makes the problem significantly more challenging.

I-A3 Semantic communication

With the proliferation of the Internet-of-Things (IoT) technology, there has been an increasing interest in the efficient use of the shared wireless spectrum [37, 44, 45]. To enable the same, semantic communication is an emerging paradigm that aims to transmit only relevant information through shared resources rather than just the raw data, leading to more intelligent and context-aware communication systems, especially in bandwidth-constrained environments [46]. For instance, in [47], it has been argued that with semantic communication, it is possible to reduce the transmission rate without significant loss in the system performance; the paper [48] discusses the potential energy savings resulting from semantic interaction in connected autonomous vehicles; and the paper [49] demonstrates the effectiveness of semantics powered sampling for remote real-time source reconstruction. We refer the interested reader to [50, 51], and the references therein for additional details. In this work, we capture the semantics of information using the value of information (VoI) metric [26]. This metric essentially measures the (excess) payoff derived by a team from the information communicated by the sensor to the controller as opposed to no communication, thereby providing a quantitative basis for evaluating the effectiveness of semantic communication in our proposed approach.

Thus, we list the main contributions of our work as follows.

1.

We consider a multi-team dynamic networked game problem with a costly uplink where each team operates in a non-cooperative game setting due to interactions over a shared resource whilst coordinating its optimal communication and control actions to ensure stabilization of the underlying dynamical system. The setups described in Subsection I-A1 constitute special cases of our work up to appropriate modifications.
2.

In view of the challenge in finding Nash equilibrium policies due to the presence of a large number of teams, we use the framework of mean-field team games to compute approximate decentralized Nash equilibrium controller-sensor policies for each team. In the process, we first notice the general non-contractivity of the mean-field operator due to the finite cardinality of the sensor action space, which precludes the existence of a mean-field team equilibrium (MFTE). To alleviate this issue, we employ probabilistic Boltzmann-type policies to approximate the optimal sensor policy, which are dependent on the average utilization of the wireless channel (or the mean traffic flow over the network). The same is then used to prove the existence and uniqueness of an approximate MFTE.
3.

Finally, we demonstrate that the MFTE obtained above constitutes an approximate Nash equilibrium for the finite-team system, where the approximation is induced by (1) the number of teams $(N)$ in the system, and (2) the Boltzmann-parameter $(\alpha)$ characterizing the approximate sensor policy.

We would also like to note that our current work is closest in spirit to [26]. Here, however, we consider a multi-team problem as opposed to a single team problem considered in [26]. Nonetheless, we utilize the team optimal policy notions and heuristics employed in [26] in the current work as well. The added non-trivial challenge is to solve for the Nash equilibrium policies on top of the team-optimal policies for each team. This has required us to frame the problem as a MFTG and consequently compute approximate Nash policies between the finite number of teams.

I-B Organization of the paper

The paper is organized as follows. We formulate the finite-team game problem in Section II and the mean-field team game in Section III. The analysis of the existence and uniqueness of the MFTE is presented in Section IV, its approximation performance is studied in Section V, and the approximate Nash equilibrium analysis is included in Section VI. Numerical results are presented in Section VII, and the paper concludes with some major highlights and ensuing discussions in Section VIII. Proofs of main propositions and the supporting lemmas are presented in Appendices.

Notations: The set of agents is denoted by $[N]:=\{1,\cdots,N\}$ and the set of time instants by $[T]:=\{0,\cdots,T-1\}$ . We denote the Euclidean norm by $\|\cdot\|$ . For a positive definite (resp. semi-definite) matrix $Z$ , we use the notation $Z\succ 0$ (resp. $Z\succeq 0$ ). The trace of a matrix $Z$ is denoted by $tr(Z)$ , and for a vector $x$ , we define $\|x\|^{2}_{Z}:=x^{\top}Zx$ . For two real functions $g$ and $h$ , $g(x)=\mathcal{O}(h(x))$ means that there exists a constant $C>0$ such that $\lim_{x\rightarrow\infty}|g(x)/h(x)|=C$ . We define $Z_{0:m}:=\{Z_{0},\cdots,Z_{m}\}$ , and $\mathbb{I}[A]$ as the indicator function of a measurable set $A$ .

II Finite-Team Game Problem

Consider the networked multi-agent system shown in Fig. 2, comprised of $N$ teams. Each team $i$ is modeled by a triple $(P_{i},S_{i},C_{i})$ , i.e., a dynamically evolving plant ( $P_{i}$ ), a sensor ( $S_{i}$ ), and a controller ( $C_{i}$ ), where $S_{i}$ and $C_{i}$ can be viewed as the two active agents of team $i$ . The plant $P_{i}$ evolves according to a stochastic linear difference equation as

\displaystyle X^{i}_{k+1}=A(\omega_{i})X^{i}_{k}+B(\omega_{i})U^{i}_{k}+W^{i}_{k},~{}~{}i\in[N],

(1)

where $X^{i}_{k}\in\mathbb{R}^{n}$ denotes the state, and $U^{i}_{k}\in\mathbb{R}^{m}$ denotes the control input, both for the $i^{th}$ team. The system noise $W^{i}_{k}$ is assumed to be Gaussian with zero mean and finite covariance $K_{W^{i}}$ . Further, $A(\omega_{i})$ and $B(\omega_{i})$ are matrices with appropriate dimensions which depend on the type of the agent $\omega_{i}\in\Omega:=\{\omega_{1},\cdots,\omega_{p}\}$ , chosen according to empirical probability mass function (pmf) $\mathbb{P}_{N}(\omega=\omega_{i})$ , for all $\omega\in\Omega$ . The initial state $X_{0}^{i}$ has a symmetric density with mean $\nu(\omega_{i})$ and positive definite covariance $\Sigma_{0}(\omega_{i})$ for each $i$ . We assume that $X_{0}^{i}$ is independent of noise $W^{i}_{k}$ for all $k,i$ .

Next, for each team $i$ , full-state information ( $X^{i}_{k}$ ) of the plant is relayed by $S_{i}$ to $C_{i}$ over a shared two-hop wireless channel via a central base station (as shown in Fig. 2) which coordinates the flow of information. The hop from the sensor to the BS is called the uplink and the one from the BS to the controller is called the downlink. The sensory information is relayed to the controller subject to a one-step delay, i.e., if $X^{i}_{k-1}$ and $y^{i}_{k}$ denote the uplink input and downlink output, respectively, then we have that

\displaystyle y^{i}_{k+1}=\left\{\begin{array}[]{ll}X^{i}_{k},&\text{if $\gamma^{i}_{k}=1$}\\ \varphi,&\text{if}~{}\gamma^{i}_{k}=0\\ \end{array},\right.

(4)

for $k\in[T]$ and $y_{0}=\varphi$ . The quantity $\varphi$ denotes an empty symbol and $\gamma^{i}_{k}$ denotes the transmission requests of $S_{i}$ as

\displaystyle\gamma^{i}_{k}=\left\{\begin{array}[]{ll}1,&\text{if transmission is requested}\\ 0,&\text{otherwise}\\ \end{array}.\right.

(7)

We further assume that the downlink has enough bandwidth to accommodate all the sensor requests.¹¹1The case where the downlink suffers from a capacity constraint with an ideal uplink is dealt with in the works [37, 44], where the authors use a weighted age of information-based metric to devise optimal downlink transmission strategies for the base station. However, the case where both the uplink and the downlink suffer from non-idealities is a significantly challenging problem since one would need to jointly design each team’s policies and the scheduling policy of the base station. We will leave this as an open and promising research direction. Thus, whenever a transmission request is generated, the sensor is connected to the controller by the base station. However, for each packet communicated by the sensor to the controller, team $i$ incurs a cost ( $J^{S,N}_{i}$ ) which is proportional to the average number of users communicating on the uplink. More precisely, the associated time-average communication cost over a finite horizon of $T>0$ is given as

\displaystyle J_{i}^{S,N}(\xi^{i},\xi^{-i}):=\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\gamma^{i}_{k}\left(\frac{1}{N}\sum_{j=1}^{N}\gamma^{j}_{k}\right)\right],

(8)

where we refer to the term $\frac{1}{N}\sum_{j=1}^{N}\gamma^{j}_{k}$ as the average channel utilization, which models the mean traffic flow through the wireless channel. Further, $\xi^{i}:=(\xi^{i}_{0},\xi^{i}_{1},\xi^{i}_{2},\cdots,\xi^{i}_{T-1})$ denotes the transmission policy of sensor $S_{i}$ and $\xi^{-i}:=(\xi^{1},\cdots,\xi^{i-1},\xi^{i+1},\cdots,\xi^{N})$ denotes the transmission policies of sensors other than $S_{i}$ . In the absence of any communication, $C_{i}$ creates the best estimate of the state which resets to the latest information in the event of a communication.

Let us define the information set of the $i^{th}$ sensor as

\displaystyle\mathcal{I}^{S,i}_{k}:=\{X^{i}_{0:k},y^{i}_{0:k},U^{i}_{0:k-1},\gamma^{\ell}_{0:k-1}\mid\ell\in[N]\},

(9)

with $\mathcal{I}^{S,i}_{0}=\{X^{i}_{0},y^{i}_{0}\}$ . Then, we can define the space of admissible centralized²²2Note here we use the term centralized to emphasize that the information set of sensor $i$ contains the transmission instants of other agents as well. sensor policies for $S_{i}$ as $\Xi^{cen,i}=\{\xi^{i}\mid\xi^{i}_{k}~{}\text{is adapted to }\sigma(\mathcal{I}^{S,i}_{t},~{}t=0,1,\cdots,k)\}$ , where $\sigma(\cdot)$ is the $\sigma$ –algebra generated by its arguments. Finally, the expectation in (8) is taken with respect to the stochasticity induced by the (possibly) randomized policies, the system noise, and the initial state distribution.

Next, the controller $C_{i}$ observes the incoming message from $S_{i}$ as in (4). Between communication instants, $C_{i}$ computes the best estimate of the state $Z^{i}_{k}$ (calculated later) based on its information history given by

\displaystyle\mathcal{I}^{C,i}_{k}:=\left\{y^{i}_{0:k},U^{i}_{0:k-1}\right\},

(10)

with $I^{C,i}_{0}=\{y^{i}_{0}\}$ . For each control input applied by $C_{i}$ , the $i^{th}$ team incurs a time-average control cost given as

\displaystyle J_{i}^{C,N}(\pi^{i},\xi^{i})\!:=\!\!\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\|X^{i}_{k}\|^{2}_{Q(\omega_{i})}\!+\!\|U^{i}_{k}\|^{2}_{R(\omega_{i})}\right]\!,\!\!

(11)

where $Q(\omega_{i})\geq 0$ , $R(\omega_{i})>0$ , and $\pi^{i}:=(\pi^{i}_{0},~{}\pi^{i}_{1},\cdots,\pi^{i}_{T-1})$ denotes an admissible policy of the control agent of team $i$ . More precisely, given $\mathcal{I}^{C,i}_{k}$ , we define the space of admissible control policies for team $i$ as $\mathcal{M}^{i}=\{\pi^{i}\mid\pi^{i}~{}\text{is adapted to }\sigma(\mathcal{I}^{C,i}_{s},~{}s=0,1,\cdots,k)\}$ . Finally, the expectation in (11) is taken with respect to the noise statistics and the initial state distribution.

As a result of the cost functions in (8) and (11), each team is faced with the following stochastic team-game problem.

Problem 1 (Team-level game problem).

	$\displaystyle\inf_{(\pi^{i},\xi^{i})\in\mathcal{M}^{i}\times\Xi^{cen,i}}J^{N}_{i}(\pi^{i},\xi^{i},\xi^{-i}):=J_{i}^{C,N}+\lambda J_{i}^{S,N},$
	$\displaystyle\mbox{s.t.}~{}\eqref{System_dynamics},~{}\eqref{decoder_signal}\text{, and }\eqref{transmission}\text{ hold}.$

where $(\pi^{i},\xi^{i})\in\mathcal{M}^{i}\times\Xi^{cen,i}$ and $\lambda>0$ is the weighting parameter which trades off the costs between uplink communication and control performance.

Henceforth, we refer to $(\pi^{i},\xi^{i})$ as the $i^{th}$ team’s policy. We note here that each team’s policy depends on the strategies of the other teams through their sensor policies, which leads to a noncooperative game setup. Further, the works [26, 27] constitute a special case of our setup (with $N=1$ ). Thus, the objective of this work is to compute Nash equilibrium policies for the $N$ teams. That is, we are looking for a $2N$ –tuple $(\pi^{*,i}\in{\cal M}^{i},\xi^{*,i}\in\Xi^{cen,i},i\in[N])$ such that

\displaystyle J^{N}_{i}(\pi^{*,i},\xi^{*,i},\xi^{*,-i})=\inf_{(\pi^{i},\xi^{i})\in{\cal M}^{i}\times\Xi^{cen,i}}J^{N}_{i}(\pi^{i},\xi^{i},\xi^{*,-i})

for all $i\in[N]$ . We note that for team $i$ , the above objective depends on the sensing policies $\xi^{*,-i}$ of the other teams. Due to the presence of a large number of such teams within a typical networked system, it becomes unrealistic for each team to have access to (and store) this information of the sensing policies of other teams in the population. Thus, we would like to compute team equilibrium policies based on only the local information accessible to each team. Toward this end, we will use the mean-field game framework [35, 52, 53] to characterize approximate decentralized Nash equilibrium solutions in the next three sections.

III Networked Mean-Field Team Game

In this section, we model the $N$ –team game using the mean-field team game (MFTG) framework. In this regard, we first consider the limiting system (with $N=\infty$ teams). Under this setting, the average channel utilization ratio in (8) (and hence in Problem 1) can be approximated by a pre-computable deterministic mean-field trajectory by using the Nash certainty equivalence principle [33]. This reduces Problem 1 to a team optimal control problem for a representative team, the solution of which is consistent with the other teams in the population. Thus, we first compute a team optimal policy for the representative team. Consequently, we will observe that the finite cardinality of the sensing action set (which is given as $\{0,1\}$ ) leads to non-contractivity of the mean field operator (as also noted in [54]). To alleviate this issue, we construct a Boltzmann-type sensor policy, prove the existence of a unique approximate mean-field team equilibrium (MFTE). Finally, we will show that the solution obtained constitutes an approximate Nash equilibrium for the finite $N$ –team game problem, with “approximation” defined precisely.

III-A Mean-Field Team Game

Consider a representative team (within the infinite team system) of type $\omega$ , characterized by the tuple $(P(\omega),S(\omega),C(\omega))$ of a plant, a sensor, and a controller. The plant dynamics evolve according to the stochastic linear difference equation

\displaystyle X_{k+1}=A(\omega)X_{k}+B(\omega)U_{k}+W_{k},

(12)

where $(X_{k},U_{k})\in\mathbb{R}^{n}\times\mathbb{R}^{m}$ denotes the state-control input pair of the representative team. Noise $W_{k}$ is assumed to be Gaussian with zero mean and positive definite covariance $K_{W}(\omega)$ . The sensor $S(\omega)$ generates transmission requests ( $\gamma_{k}\in\{0,1\}$ ) over the uplink, which is then coordinated by the base station to update the state of the plant to the controller. As a result of network usage, the team incurs a communication cost $J^{S}$ given as

\displaystyle J^{S}(\zeta,g):=\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}g_{k}\gamma_{k}\right],

(13)

where $\zeta\in\Xi^{dec}$ denotes the sensor policy of the representative team, with $\Xi^{dec}$ denoting the space of admissible decentralized sensor policies defined as $\Xi^{dec}:=\{\zeta|\zeta~{}\text{is adapted to }~{}\sigma(\mathcal{I}^{S,\omega}_{t},~{}t=0,1,\cdots,k)\}$ where $\mathcal{I}^{S,\omega}_{k}=\{X_{0:k},y_{0:k-1},U_{0:k-1},\gamma_{0:k}\}$ defines the information structure of the representative team’s sensor. We note here that the above set constitutes only the local information of the representative team as opposed to the information set defined in (9). We refer to the term ${g}=({g}_{1},{g}_{2},\cdots,{g}_{T-1})$ as the mean-field trajectory, which denotes the infinite-team approximation to the channel utilization ratio term in (8). Finally, the controller $C(\omega)$ applies a control policy which incurs a cost of

\displaystyle J^{C}(\pi,\zeta):=\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\|X_{k}\|^{2}_{Q(\omega)}+\|U_{k}\|^{2}_{R(\omega)}\right].

(14)

The policy $\!\pi\!$ is chosen from the space of admissible control policies defined similarly to $\!\mathcal{M}^{i}\!$ as before. Thus, the representative team’s optimization problem is defined as follows:

Problem 2 (Representative team problem).

\displaystyle\inf_{(\pi,\zeta)\in\mathcal{M}\times\Xi^{dec}}J(\pi,\zeta,g):=J^{C}(\pi,\zeta)+\lambda J^{S}(\zeta,g),

(15)

where $J(\pi,\zeta,g)$ denotes the joint time-average overall cost incurred by the representative team, with $J^{S}(\zeta,g)$ and $J^{C}(\pi,\zeta)$ defined respectively by (13) and (14).

In order to construct a solution to Problem 2, we invoke the following assumption [33, 55] on the mean-field system.

Assumption 1.

1.

The pair $(A(\omega),B(\omega))$ is controllable.
2.

The MF trajectory ${g}\!\in\!\ell_{\infty}^{T}$ where $\ell_{\infty}^{T}\!:=\!\{{g}_{k}\!\in\!\mathbb{R}\mid\sup_{k\in[T]}$ $\!|{g}_{k}|\!<\!\infty\}$ is the set of bounded $T\!$ –length sequences.

Next, we define the mean-field team equilibrium (MFTE) by introducing the following operators:

1.

$\Phi:\!\ell_{\infty}^{T}\!\rightarrow\!\mathcal{M}\times\Xi^{dec}$ , which maps ${g}\mapsto\operatorname*{argmin}_{(\pi,\zeta)\in\mathcal{M}\times\Xi}$ $J(\pi,\zeta,{g})$ and outputs a team-optimal policy ( $\pi,\zeta$ ) for a given MF trajectory ${g}$ , and
2.

$\Psi:\mathcal{M}\times\Xi^{dec}\rightarrow\ell_{\infty}^{T}$ , which maps $(\pi,\zeta)\mapsto{g}$ and generates a MF trajectory $g$ from a policy ( $\pi,\zeta$ ).

Then, the MFTE can be defined as follows.

Definition 1 (Mean-Field Team Equilibrium).

A mean-field team equilibrium (MFTE) is a policy-trajectory pair $((\pi^{*},\zeta^{*}),{g}^{*})$ , which is a fixed point of the composite map $\Psi\circ\Phi$ .

In the sequel, we first compute the team optimal policy pair ( $\pi^{*},\zeta^{*}$ ) solving Problem 2 for a given MF trajectory $g$ and then provide sufficient conditions for the existence of a unique approximate MFTE.

IV Mean-Field Team Equilibrium Analysis

We start this section by solving Problem 2. In this regard, let us start by determining the optimal controller policy. Using completion of squares, we can equivalently write the cost function in (15) as

	$\displaystyle J(\pi,\zeta,g)=$	$\displaystyle\mathbb{E}[\\|X_{0}(\omega)\\|^{2}_{P_{0}(\omega)}]+\frac{1}{T}\sum_{k=0}^{T-1}tr(P_{k}(\omega)K_{W}(\omega))$
		$\displaystyle+\frac{1}{T}\mathbb{E}\Big{[}\sum_{k=0}^{T-1}\\|U_{k}+\hat{R}_{k}^{-1}(\omega)B^{\top}(\omega)P_{k+1}(\omega)A(\omega)X_{k}\\|^{2}_{\hat{R}_{k}(\omega)}+\lambda{g}_{k}\gamma_{k}\Big{]},$		(16)

where $\hat{R}_{k}(\omega):=R(\omega)+B^{\top}(\omega)P_{k+1}(\omega)B(\omega)$ and $P\succeq 0$ is the unique solution to the Riccati equation

\displaystyle P_{k}(\omega)

\displaystyle=Q_{k}(\omega)+A^{\top}(\omega)P_{k+1}(\omega)A(\omega)-A^{\top}(\omega)P_{k+1}B(\omega)\hat{R}_{k}^{-1}(\omega)B^{\top}(\omega)P_{k+1}(\omega)A(\omega),

with $P_{T}(\omega)=0$ . Thus, from (IV), the optimal controller can be obtained as

\displaystyle U_{k}=\pi^{*}_{k}(\mathcal{I}^{C,\omega}_{k})=\hat{R}^{-1}_{k}(\omega)B^{\top}(\omega)P_{k+1}(\omega)A(\omega)Z_{k},

(17)

where $Z_{k}\!:=\!\mathbb{E}[X_{k}\!\mid\!\mathcal{I}^{C,\omega}_{k}]$ is the conditional mean of the state given the information available to the controller. Then, we have the following result.

Proposition 1.

The conditional mean $Z_{k}$ given by the MMS estimator at the controller follows the dynamics

\displaystyle Z_{k+1}

\displaystyle=A(\omega)((1-\gamma_{k})Z_{k}+\gamma_{k}X_{k})+B(\omega)U_{k},

(18)

$\forall k\in[T]$ with $Z_{0}=\mathbb{E}[X_{0}]$ .

The proof of the above proposition readily follows by taking expectation in (12) conditioned on the controller information structure and is thus not included here.

Let us define $e_{k}:=X_{k}-Z_{k}$ . Then, by substituting the optimal controller (17) in (IV), we arrive at

\displaystyle J(\zeta,g)

\displaystyle=J^{*}+\frac{1}{T}\mathbb{E}\Big{[}\sum_{k=0}^{T-1}\|e_{k}\|^{2}_{\Gamma_{k}}+\lambda{g}_{k}\gamma_{k}\Big{]},

(19)

where $J^{*}=\mathbb{E}[\|X_{0}(\omega)\|^{2}_{P_{0}(\omega)}]+\frac{1}{T}\sum_{k=0}^{T-1}tr(P_{k}(\omega)K_{W}(\omega))$ ,

$\Gamma_{k}:=A^{\top}(\omega)P_{k+1}^{\top}(\omega)B(\omega)\hat{R}^{-1}_{k}(\omega)B^{\top}(\omega)P_{k+1}(\omega)A(\omega)$ , and $e_{k}$ satisfies the following stochastic difference equation:

\displaystyle e_{k+1}=(1-\gamma_{k})A(\omega)e_{k}+W_{k}.

(20)

Thus, the sensor objective is to solve for the optimal policy minimizing (19) subject to (20). To obtain it, let us first define the optimal cost-to-go ( $V_{\cdot}^{\omega}(\mathcal{I}^{S,\omega}_{\cdot},g)$ ) of the sensor as follows:

\displaystyle V_{k}^{\omega}(\mathcal{I}^{S,\omega}_{k},g):=\min_{\zeta\in\Xi^{dec}\mid\pi=\pi^{*}}\frac{1}{T}\mathbb{E}\Big{[}\sum_{t=k}^{T-1}\|e_{t}\|_{\Gamma_{t}}^{2}+\lambda g_{t}\gamma_{t}\mid\mathcal{I}^{S,\omega}_{k}\Big{]},

for all $k\in[T]$ and with the convention that $g_{-1}=0$ . Then, we invoke the following definition [26] of the value of information (VoI).

Definition 2.

The VoI at instant $k$ for the sensor of type $\omega$ is defined as a function of $\mathcal{I}^{S,\omega}_{k}$ as

\displaystyle{\operatorname{VoI}}_{k}^{\omega}:=V_{k}^{\omega}(\mathcal{I}^{S,\omega}_{k},g)|_{\gamma_{k}=0}-V_{k}^{\omega}(\mathcal{I}^{S,\omega}_{k},g)|_{\gamma_{k}=1},

(21)

where $V^{\omega}(\cdot,\cdot)|_{\gamma}$ denotes the sensor’s optimal cost-to-go under the sensor action $\gamma$ .

We now have the following proposition which characterizes the optimal sensor policy ( $\zeta^{*}$ ).

Proposition 2.

The optimal sensor policy $\zeta^{*}$ for the team of type $\omega$ is given as

\displaystyle\gamma_{k}=\zeta^{*}(\mathcal{I}^{S,\omega}_{k},g)=\mathbb{I}[\operatorname{VoI}_{k}^{\omega}\geq 0],

(22)

where

	$\displaystyle\operatorname{VoI}_{k}^{\omega}$	$\displaystyle=e_{k}^{\top}A(\omega)^{\top}\Gamma_{k}(\omega)A(\omega)e_{k}-{g}_{k}+\varepsilon_{k}(\omega),$
	$\displaystyle\varepsilon_{k}^{\omega}$	$\displaystyle=\!\mathbb{E}[V^{\omega}_{k+1}(\mathcal{I}^{S,\omega}_{k+1},g)\!\mid\!\mathcal{I}^{S,\omega}_{k},\gamma_{k}=0]-\mathbb{E}[V^{\omega}_{k+1}(\mathcal{I}^{S,\omega}_{k+1},g)\!\mid\!\mathcal{I}^{S,\omega}_{k},\gamma_{k}=1].$		(23)

Proof.

We start by noting that Problem 2 is a single team stochastic optimal control problem where its decision policies ( $\pi$ and $\zeta$ ) depend only on the local information and the mean-field trajectory $g$ , which is fixed. Hence, for each representative team of type $\omega\in\Omega$ , the proof follows from [26, Theorem 1], and is thus complete. ∎

Let us now denote the optimal policy pair for the representative team of type $\omega$ as $(\pi^{*},\zeta^{*})$ , which of course depends on the mean field $g$ . Then, we proceed toward computing a MFTE using this policy. In this regard, we first make a crucial observation. The optimal sensor policy $\zeta^{*}$ computed in Proposition 2 is discontinuous at the point where $\operatorname{VoI}_{k}^{\omega}=0$ . This makes it prohibitive to employ contraction-type arguments to establish the existence of a unique MFTE [33, 54]. Thus, our objective in what follows is to construct an approximation of the policy $\!\zeta^{*}\!$ , by restricting it to a special class of policies, then leading to a unique (approximate) MFTE within that class.

Let $\alpha>0$ , and define a new policy $\hat{\zeta}:=\zeta^{\alpha}$ from which an action $\hat{\gamma}_{k}$ is chosen as follows:

\displaystyle\hat{\gamma}_{k}

\displaystyle=\hat{\zeta}(\mathcal{I}^{S,\omega}_{k},g):=\begin{cases}1,&\text{w.p. }p:=\frac{1}{1+e^{-\alpha\operatorname{VoI}^{\omega}_{k}}},\\ 0,&\text{w.p. }1-p.\end{cases}

(24)

We note that the above policy is indeed an approximation of $\zeta^{*}$ since $\hat{\zeta}\rightarrow\zeta^{*}$ as $\alpha\rightarrow\infty$ under the convention that $\hat{\gamma}_{k}=1$ whenever $\operatorname{VoI}_{k}^{\omega}=0$ . Further, it is also measurable with respect to $\mathcal{I}^{S,\omega}_{k}$ . Let the policy $\hat{\zeta}$ belong to the policy class $\Xi^{\alpha}$ . Then, we prove the existence of a unique approximate MFE in the sense above, which is defined in precise terms as below.

Definition 3 ( $\alpha$ –Approximate MFTE).

An $\alpha$ –approximate MFTE is a policy-trajectory pair $((\pi^{*},\hat{\zeta}^{*}),\hat{g}^{*})$ , which is a fixed point of the composite map $\Psi^{\alpha}\circ\Phi^{\alpha}$ , where

1.

$\Phi^{\alpha}:\ell_{\infty}^{T}\rightarrow\mathcal{M}\times\Xi^{\alpha}$ is defined similar to $\Phi$ using (24), and outputs the pair ( $\pi^{*},\hat{\zeta}^{*}$ ) for a given MF trajectory $\hat{g}^{*}$ , and
2.

$\Psi^{\alpha}:\mathcal{M}\times\Xi^{\alpha}\rightarrow\ell_{\infty}^{T}$ which maps $(\pi^{*},\hat{\zeta}^{*})$ to $\hat{g}^{*}$ .

Also, let us define the mean-field operator $\mathcal{T}$ as

\displaystyle\hat{g}_{k+1}

\displaystyle=[\mathcal{T}(\hat{g})]_{k}:=\mathbb{E}_{\hat{\zeta},\mathbb{P}(\cdot),\mathcal{N}(0,K_{W}(\cdot))}[\hat{\gamma}_{k}]=\mathbb{E}_{\mathbb{P}(\cdot),\mathcal{N}(0,K_{W}(\cdot))}\left[\frac{1}{1+e^{-\alpha\operatorname{VoI}^{\omega}_{k}}}\right].

(25)

We now first state and prove the following result on the existence of an approximate MFTE.

Proposition 3.

Under Assumption 1, there exists an $\alpha$ –approximate MFTE for any $\alpha>0$ .

Proof.

A detailed proof is provided in Appendix I. In summary, the proof relies on first proving that the mean-field operator $\mathcal{T}$ , as defined in (25), is a continuous function in $\hat{g}$ . Subsequently, Brouwer’s fixed point theorem is applied to show the existence of at least one MFTE for any $\alpha>0$ . ∎

In addition, we can guarantee the contraction of the MF operator $\mathcal{T}$ for sufficiently low values of $\alpha$ , which allows us to prove a stronger result on the uniqueness of the approximate MFTE as follows.

Theorem 1.

Suppose Assumption 1 holds. Then, the mean-field operator $\mathcal{T}$ is a contraction, if

\displaystyle 2\alpha\lambda T<1.

(26)

Furthermore, under condition (26), there exists a unique $\alpha$ –approximate MFTE, that is the fixed point of the operator $\!\mathcal{T}\!$ .

Proof.

The detailed proof is provided in Appendix I. Briefly, we use the contractivity of $\mathcal{T}$ under the condition (26) to employ Banach’s fixed point theorem to show the existence of a unique $\alpha$ –approximate MFTE. ∎

The above results show the existence of an approximate MFTE which is unique if the parameter $\alpha$ is small enough. Further, we would also like to note from the proof of Theorem 1 that the contraction condition in (26) is only a sufficient one and one may be able to obtain a more precise characterization of the ‘smallness’ of $\alpha$ . Furthermore, in the numerical simulation section, we will compute an approximate MFTE even when (26) is not satisfied.

Now that we have characterized the approximate MFTE for the mean-field game, we next proceed towards proving that this MFTE satisfies the approximate Nash property (as formalized later in Section VI) for the original finite-team game problem.

V Performance of Mean-Field Approximation

In this and the subsequent section, we return to the finite-team setup of Section II and show that the $\alpha$ –approximate MFTE as computed in the previous section constitutes an ( $\epsilon_{a},\epsilon_{b}$ )–Nash equilibrium for the finite-team game problem. The term $\epsilon_{a}$ arises as a result of the approximation parameter $\alpha$ in the sensor policy and the term $\epsilon_{b}$ term arises due to the infinite team approximation.

Suppose now that team $i$ deviates toward using a policy different from the $\alpha$ –approximate MFTE policy $\hat{\zeta}^{*}$ . We call the former policy $\xi^{i}\in\Xi^{cen,i}$ and an action chosen from it at instant $k$ as $\tilde{\gamma}_{k}$ . Also, let us define the quantity $\gamma^{N,av}_{k}:=\frac{1}{N}\sum_{j=1}^{N}\gamma^{j}_{k}$ , which denotes the empirical average of the actions chosen by all the sensors. We start by proving the following lemma which computes the mean deviation between the empirical average $\gamma^{N,av}$ and the approximate equilibrium mean-field trajectory $\hat{g}^{*}$ .

Lemma 1.

Suppose that Assumption 1 holds. Let team $i$ employ the sensor policy $\xi^{i}\in\Xi^{cen,i}$ while all the other teams employ the equilibrium policy $\hat{\zeta}^{*}$ . Then, we have that

\displaystyle\frac{1}{T}\sum_{k=0}^{T-1}\mathbb{E}\Big{[}|\gamma^{N,av}_{k}-\hat{g}^{*}_{k}|\Big{]}=\mathcal{O}\Big{(}\frac{1}{\sqrt{N}}+\epsilon_{P,N}\Big{)},

(27)

where $\epsilon_{P,N}=\sum_{\omega\in\Omega}|\mathbb{P}_{N}(\omega)-\mathbb{P}(\omega)|$ .

Proof.

Consider the following inequality:

	$\displaystyle\sum_{k=0}^{T-1}\mathbb{E}\Big{[}\Big{\|}\frac{1}{N}\sum_{j=1}^{N}\gamma^{j}_{k}-\hat{g}^{*}_{k}\Big{\|}\Big{]}$	$\displaystyle\leq\underbrace{\sum_{k=0}^{T-1}\mathbb{E}\Big{[}\Big{\|}\frac{1}{N}\sum_{j=1}^{N}(\gamma^{j}_{k}-\hat{\gamma}^{j}_{k})\Big{\|}\Big{]}}_{\text{=:Term 1}}$
		$\displaystyle+\!\!\underbrace{\sum_{k=0}^{T-1}\!\mathbb{E}\Big{[}\Big{\|}\!\frac{1}{N}\sum_{j=1}^{N}\hat{\gamma}^{j}_{k}\!-\!\mathbb{E}[\hat{g}^{,j}_{k}]\Big{\|}\Big{]}}_{\text{=:Term 2}}\!+\!\underbrace{\sum_{k=0}^{T-1}\mathbb{E}\Big{[}\Big{\|}\!\frac{1}{N}\!\sum_{j=1}^{N}\mathbb{E}[\hat{g}^{,j}_{k}]\!-\!\hat{g}^{*}_{k}\Big{\|}\Big{]}}_{\text{=:Term 3}}.$		(28)

We now bound each of the three terms in (V) below. To that end, let the action chosen by the sensor of team $i$ be given as $\tilde{\gamma}^{i}_{k}\sim\xi^{i}(\mathcal{I}^{S,i}_{k})$ . Then, Term 1 in (V) can be bounded as follows:

Term 1	$\displaystyle=\sum_{k=0}^{T-1}\mathbb{E}\Big{[}\Big{\|}\frac{1}{N}\sum_{j\in[N]\setminus\{i\}}(\hat{\gamma}^{j}_{k}-\hat{\gamma}^{j}_{k})+\frac{1}{N}(\tilde{\gamma}^{i}_{k}-\hat{\gamma}^{i}_{k})\Big{\|}\Big{]}$
	$\displaystyle=\sum_{k=0}^{T-1}\mathbb{E}\Big{[}\Big{\|}\frac{1}{N}\sum_{j\in[N]\setminus\{i\}}(\hat{\zeta}^{j}_{k}(\mathcal{I}^{S,i}_{k})-\hat{\zeta}^{*,j}_{k}(\mathcal{I}^{S,i}_{k}))+\frac{1}{N}(\tilde{\gamma}^{i}_{k}-\hat{\gamma}^{i}_{k})\Big{\|}\Big{]}$
	$\displaystyle=\sum_{k=0}^{T-1}\frac{1}{N}\mathbb{E}[\|\tilde{\gamma}^{i}_{k}-\hat{\gamma}^{i}_{k}\|]\leq\frac{2T}{N},$	(29)

where the second equality follows since teams other than the $i^{th}$ one are still following the policy $\hat{\zeta}^{*}$ .

We next bound Term 2 in (V). To that end, let us define the quantity $\bar{g}^{*,j}_{k}:=\gamma^{j}_{k}-\mathbb{E}[\hat{g}^{*,j}_{k}]$ . Then, we have that

\displaystyle\!\!\!\mathbb{E}\Big{[}\frac{1}{N}\sum_{j=1}^{N}\hat{\gamma}^{j}_{k}\!-\!\mathbb{E}[\hat{g}^{*,j}_{k}]\Big{]}^{2}\!\!=\!\frac{1}{N^{2}}\mathbb{E}\Big{[}\sum_{j\in[N]}\bar{g}^{*,j}_{k}\Big{]}^{2}\!\!=\!\!\frac{1}{N^{2}}\mathbb{E}\Big{[}\sum_{j,\ell\in[N]}\bar{g}^{*,j}_{k}\bar{g}^{*,\ell}_{k}\Big{]}\!\!=\!\frac{1}{N^{2}}\mathbb{E}\Big{[}\sum_{j\in[N]}(\bar{g}^{*,j}_{k})^{2}\Big{]}\!\!\leq\!\!\frac{1}{N},\!

(30)

where the third equality follows since $\bar{g}^{*,j}_{k}$ and $\bar{g}^{*,\ell}_{k}$ are uncorrelated because the system noise $W_{k}^{i}$ is independent across different teams, which leads to decoupling between the corresponding sensor policies.

Thus, it follows from (30) that:

\displaystyle\text{Term 2}\leq\sum_{k=0}^{T-1}\frac{1}{\sqrt{N}}\leq\frac{T}{\sqrt{N}}.

(31)

Finally, since ${\hat{g}}_{k}^{*,\omega}$ takes values in $[0,1]$ , Term 3 in (V) can be bounded as follows:

Term 3

\displaystyle=\sum_{k=0}^{T-1}\Big{[}\sum_{\omega\in\Omega}\hat{g}^{*,\omega}_{k}|\mathbb{P}_{N}(\omega)-\mathbb{P}(\omega)|\Big{]}\leq T\sum_{\omega\in\Omega}|\mathbb{P}_{N}(\omega)-\mathbb{P}(\omega)|.

(32)

Using (V), (31), and (32) in (V), we arrive at the result in (27). The proof is thus complete. ∎

The above lemma demonstrates that the deviation of the empirical average from the approximate mean-field trajectory vanishes toward zero as the number of teams increases, and hence the latter ‘reasonably’ approximates the former if the number of teams is large.

VI Approximate Nash Equilibrium Analysis

In this subsection, we prove the approximate-Nash property of the MF solution for the finite-team game problem, which is made precise below.

Definition 4.

The set of team policies $\{(\pi^{*,i},\xi^{i})\in\mathcal{M}\times\Xi^{cen,i},~{}i\in[N]\}$ constitutes an $\epsilon$ –Nash equilibrium for the cost functions $J^{N}_{i}$ for $i\in[N]$ , if there exists $\epsilon>0$ such that

\displaystyle J^{N}_{i}(\pi^{*,i},\hat{\zeta}^{*,i},\hat{\zeta}^{*,-i})\leq\inf_{\xi^{i}\in\Xi^{cen,i}}J^{N}_{i}(\pi^{*,i},\xi^{i},\hat{\zeta}^{*,-i})+\epsilon,

for all $i\in[N]$ .

We also note that the infimum on the RHS of the above inequality is taken over the set of centralized equilibrium policies as defined before, hence bringing in an element of conservatism.

Now, we define the following cost functions:


	$\displaystyle\!\!J_{i}^{N}(\pi^{,i},\hat{\zeta}^{,i},\hat{\zeta}^{*,-i})=\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\\|\hat{X}^{i}_{k}\\|^{2}_{Q(\omega_{i})}\!\!+\!\\|U^{i}_{k}\\|^{2}_{R(\omega_{i})}\!\!+\!\lambda\hat{\gamma}^{N,av}_{k}\hat{\gamma}^{i}_{k}\right],\!$	(33a)
	$\displaystyle\!\!J(\pi^{}\!,\hat{\zeta}^{}\!,\hat{g}^{})\!\!=\!\!\frac{1}{T}\mathbb{E}\!\left[\sum_{k=0}^{T-1}\!\\|\hat{X}_{k}\\|^{2}_{Q(\omega)}\!\!+\!\\|U_{k}\\|^{2}_{R(\omega)}\!\!+\!\lambda\hat{g}^{}_{k}\hat{\gamma}_{k}\right]\!,\!$	(33b)
	$\displaystyle\!\!J_{i}^{N}(\pi^{,i},\xi^{i},\hat{\zeta}^{,-i})=\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\!\\|{X}^{i}_{k}\\|^{2}_{Q(\omega_{i})}\!\!+\!\\|U^{i}_{k}\\|^{2}_{R(\omega_{i})}\!\!+\!\lambda\tilde{\gamma}^{N,av}_{k}\tilde{\gamma}^{i}_{k}\right]\!,\!\!$	(33c)
	$\displaystyle\!\!J(\pi^{}\!,{\zeta}^{d}\!,\hat{g}^{})\!\!=\!\!\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\!\!\\|X^{d}_{k}\\|^{2}_{Q(\omega)}\!\!+\!\\|U_{k}\\|^{2}_{R(\omega)}\!\!+\!\lambda\hat{g}^{*}_{k}\gamma^{d}_{k}\right]\!.\!\!\!$	(33d)
The cost (33a) is the one incurred by team $i$ in the finite-team setting when all the players follow the $\alpha$ –approximate MFTE policy. The cost in (33b) is the generic team’s cost under the equilibrium policy and equilibrium mean-field. The cost in (33c) is incurred by team $i$ when it deviates toward using a policy $\xi^{i}\in\Xi^{cen,i}$ while all other teams still employ the MF policy. Finally, the cost in (33d) is incurred when an agent follows a decentralized policy $\zeta^{d}\in\Xi^{dec}$ under the approximate equilibrium MF trajectory. The notations of the states and actions in the above costs are under the respective policies and the trajectory distributions, and are self-explanatory.

We next state and prove another main result of the paper in the following theorem, which essentially characterizes how ‘well’ the MFTE policy performs.

Theorem 2.

Suppose that Assumptions 1 holds. Suppose also that $X_{0}$ is given. Then, the set $\{\hat{\zeta}^{*,i},1\leq i\leq N\}$ of decentralized control policies constitute $(\epsilon_{a},\epsilon_{b})$ –Nash equilibrium for the finite-team uplink scheduling game, i.e.,

\displaystyle J_{i}^{N}(\pi^{*,i},\hat{\zeta}^{*,i},\hat{\zeta}^{*,-i})\leq\inf_{\xi^{i}\in\Xi^{cen,i}}J_{i}^{N}(\pi^{*,i},\xi^{i},\hat{\zeta}^{*,-i})+\epsilon_{a}(\alpha)+\epsilon_{b}(N),

where $\!\epsilon_{a}(\alpha)\!\!=\!\!\mathcal{O}\Big{(}\!\frac{1}{1\!+\!\min\limits_{k\in[T]}\{e^{\alpha|\operatorname{VoI}_{k}\!|}\}}\!\Big{)}$ and $\!\epsilon_{b}(N)\!\!=\!\mathcal{O}\Big{(}\!\frac{1}{\sqrt{N}}+\epsilon_{P,N}\!\Big{)}$ .

Proof.

To establish the inequality in Theorem 2, consider the following:


$\displaystyle J_{i}^{N}(\pi^{,i},\hat{\zeta}^{,i},\hat{\zeta}^{*,-i})-$	$\displaystyle\inf_{\xi^{i}\in\Xi^{cen,i}}J_{i}^{N}(\pi^{,i},\xi^{i},\hat{\zeta}^{,-i})=J^{N}_{i}(\pi^{,i},\hat{\zeta}^{,i},\hat{\zeta}^{,-i})-J(\pi^{},\hat{\zeta}^{},\hat{g}^{})$	(34a)
	$\displaystyle\hskip 14.22636pt+J(\pi^{},\hat{\zeta}^{},\hat{g}^{})-\inf_{\zeta^{d}\in\Xi^{dec}}J(\pi^{},{\zeta^{d}},\hat{g}^{*})$	(34b)
	$\displaystyle\hskip 14.22636pt+\inf_{\zeta^{d}\in\Xi^{dec}}J(\pi^{},{\zeta^{d}},\hat{g}^{})-\inf_{\xi^{i}\in\Xi^{cen,i}}J_{i}^{N}(\pi^{,i},\xi^{i},\hat{\zeta}^{,-i}).$	(34c)

The proofs for bounding each of the above terms are provided in the Appendices. The term in (34a) is bounded by Lemma 3 as in Appendix II. Next, to bound the term in (34b), we first prove the boundedness of the state action value function as in Lemma 5 in Appendix III and consequently use Proposition 5 as in Appendix III. Finally, the term in (34c) is bounded by Lemma 4 as in Appendix II.

Combining the results from Lemmas 3, 4 and Proposition 5, we arrive at

\displaystyle J_{i}^{N}(\pi^{*,i},\hat{\zeta}^{*,i},\hat{\zeta}^{*,-i})\!\!-\!\!\inf_{\xi^{i}\in\Xi^{cen,i}}J_{i}^{N}(\pi^{*,i},\xi^{i},\hat{\zeta}^{*,-i})\!\leq\!\epsilon_{a}(\alpha)\!+\!\mathcal{O}\Big{(}\frac{1}{\sqrt{N}}\!+\!\epsilon_{P,N}\Big{)}\!=\!\epsilon_{a}(\alpha)\!+\!\epsilon_{b}(N).

(35)

The proof is thus complete.

∎

Theorem 2 states that the $\alpha$ –approximate MFTE policy derived using the infinite agent system behaves as an approximate Nash equilirium solution for the original finite-team game, where the approximation results from 1) using the policy $\hat{\zeta}^{*}$ instead of the optimal sensing policy $\zeta^{*}$ , and 2) using infinite-team system as a proxy for the $N$ –team system.

VII Numerical Experiments

In this section, we provide simulation results on the finite-team game and demonstrate the performance of the approximate MFTE as computed in Section IV. Throughout, we consider the type set to be singleton, i.e., all the teams are homogeneous. Further, we consider open-loop unstable scalar agents with the following parameters: $A=1.5,$ $B=2,$ $Q=2,$ $R=1,$ $\Sigma_{0}=2,$ $K_{W}=6$ and a time horizon $T=50$ . Next, we note that the approximate sensing policy in (24) depends on the value of information which in turn depends on the conditional value function of the state. Since we do not have a closed-form expression for the latter, we will use supervised learning with function approximation [56] to approximate it. In particular, we start by first parametrizing the state value function using a quadratic-in-state and quadratic-in-mean-field approximation as $w_{0}+w_{1}e+w_{2}e^{2}+w_{3}\hat{g}^{2}$ , where $w_{0},w_{1},w_{2}$ , and $w_{3}$ denote the weighting coefficients and $e$ denotes the estimation error and $\hat{g}$ denotes the mean-field trajectory. Subsequently, we run value iteration using the parametrized form of the value function. Finally, we perform supervised learning using a mean-squared error loss to learn the weighting coefficients. The procedure terminates whenever the latter converges within a $\delta$ range for a suitably chosen $\delta>0$ . Later, we also study the effect of the degree of the approximating polynomial on the total average cost per team.

In the first study, we plot the variation of the average cost per team versus the importance weight $\lambda$ in Figure 3 for four different values of $\lambda=0.3,1.0,5.0$ and $10$ . The corresponding values of $\alpha$ were taken to be $1.8,0.55,0.11$ and $0.055$ . The figure shows a box plot depicting the median (red line) and spread (box) of the average cost per team over 100 runs for each value of $\lambda$ and $N=50$ teams, under the equilibrium controller-sensor policy developed using the mean-field game paradigm. From the same, we observe that the average cost increases with the increase in $\lambda$ . This is explained by an increase in overall estimation error due to increase in the price of communication.

For the same set of values of $\lambda$ and $\alpha$ , we also plot (in Figure 4) the approximate equilibrium mean-field trajectory corresponding to the $\alpha$ –approximate MFTE computed in Section IV, which denotes the average utilization of the wireless channel. From the same, we see that the channel utilization decreases as the price of communication increases, aligned with intuition. The corresponding state, error, and control input trajectories under the approximate equilibrium policies for all the teams are shown in Figures 5, 6, and 7, respectively.

Next, in Figure 8 we study the variation of the average cost per team under the equilibrium controller-sensor policy for a fixed value of $\lambda=5$ and varying values of $\alpha$ . The figure shows a box plot of the average cost per team over 100 runs for each value of $\alpha$ and $N=50$ teams. From the same, we observe that the average cost shows a decreasing trend with increasing values of $\alpha$ . This is explained by the fact that increasing $\alpha$ implies decreasing level of approximation in the construction of the policy $\hat{\zeta}$ from the optimal policy $\zeta$ , which in turn leads to a lower cost.

Next, in Figure 9 we provide the box plots for average cost per agents over 100 runs for varying numbers of teams. The values of $\alpha$ and $\lambda$ were chosen to be 0.055 and 10, respectively. We then observe that the spread of the box plot decreases with increasing $N$ , indicating that the mean-field approximation gets better with increasing $N$ .

Finally, we study the effect of the degree of the polynomial used for approximation of the value function on the average cost of the teams. For this, we fix the term $w_{3}\hat{g}^{2}$ and choose the values of $\lambda=5$ and $\alpha=0.11$ . We plot in Figure 10 the average cost per team as a function of the degree of the polynomial. As the degree increases from quadratic to cubic to fourth, we do not observe significant changes in the average cost per team. The median costs (red line) for the three cases are obtained as 3640, 3650, and 3666. This demonstrates that the quadratic polynomial approximates the value function reasonably well.

VIII Conclusion & Discussions

In this work, we have solved for approximate Nash equilibrium team policies for a multi-team dynamic game problem where each team is coordinating its communication and control policies while utilizing a shared wireless channel. Specifically, we have formulated a noncooperative (general sum) game between $N$ –teams, each of which was comprised of a local sensor and a local controller which communicate over a network and incur an overall cost which is coupled with the (sensing) policies of other teams as a result of communication over the shared channel. Due to unavailability of the sensing policies of other teams (which become challenging to obtain especially when there is a high population of them), we have employed the mean-field team game framework to compute approximate decentralized Nash-equilibria between the teams. Toward this end, we have first noticed that the finite cardinality of the sensing action space prohibited the use of a contraction argument to establish the existence of a mean-field equilibrium. To overcome this difficulty, we have constructed a Boltzmann-type sensing policy which we then employed to explicitly prove the existence of a unique approximate mean-field team equilibrium. Subsequently, we showed that the same constituted an approximate Nash equilibrium for the finite-team game, where the approximation depended vanishingly on the number of teams, $N$ , and the Boltzmann parameter $\alpha$ . We have also validated the theoretical results with extensive numerical simulations.

We list here a few future research directions. First, for the numerical analysis, we had assumed that the price parameter $\lambda$ was fixed and provided an extensive numerical analysis for its effect on the average costs, estimation errors, and control inputs of the teams. Thus one direction would be to formulate the finite-team game within a Stackelberg game setting, where the leader (possibly a telecommunication/network operator) solves an optimization problem to distribute ‘optimal’ $\lambda$ to the follower teams. This necessitates that one substitutes the policies of the followers into the leader’s optimization problem, and subsequently derive the leader’s optimal policy. However, this (at the moment) seems challenging since each team’s policy is not attainable in closed form; rather, we only know it as a function of the conditional value function. Thus, it would be interesting to see if one can derive analytical results for the Stackelberg game setting.

Another direction that deserves mention here is that in the current work we did not account for the no-transmission instants within the information structure of the controller. This was due to the fact that the no-transmission instants make the estimator dynamics nonlinear, and couples the design on the optimal estimator and the optimal sensing policy. Although it has been shown in [27] that the optimal estimator turns out to be linear and decoupled from the sensor design, this is true only at optimality. However, upon careful observation, one may notice that in the current work, we have employed a Boltzmann sensing policy (which is an approximation to the optimal policy) to establish the existence of a unique and tractable equilibrium. Hence, the decoupling argument between the estimator and the scheduler no longer holds. Thus, it would be interesting to consider whether similar results as in this paper can be generalised to the case where the information set of the controller also includes the no-transmission instants.

References

[1] D. Shishika and V. Kumar, “A review of multi agent perimeter defense games,” in Decision and Game Theory for Security: 11th International Conference, GameSec 2020, College Park, MD, USA, October 28–30, 2020, Proceedings 11. Springer, 2020, pp. 472–485.
[2] J. Subramanian, A. Kumar, and A. Mahajan, “Mean-field games among teams,” arXiv preprint arXiv:2310.12282, 2023.
[3] G. Y. Weintraub, C. L. Benkard, and B. Van Roy, “Markov perfect industry dynamics with many firms,” Econometrica, vol. 76, no. 6, pp. 1375–1411, 2008.
[4] T. Hatanaka, N. Chopra, M. Fujita, and M. W. Spong, Passivity-based Control and Estimation in Networked Robotics. Springer, 2015.
[5] J. Marschak, “Elements for a theory of teams,” Management Science, vol. 1, no. 2, pp. 127–137, 1955.
[6] R. Radner, “Team decision problems,” The Annals of Mathematical Statistics, vol. 33, no. 3, pp. 857–881, 1962.
[7] S. Yüksel and T. Başar, Stochastic networked control systems: Stabilization and optimization under information constraints. Springer Science & Business Media, 2013.
[8] A. Dave and A. A. Malikopoulos, “Decentralized stochastic control in partially nested information structures,” IFAC-PapersOnLine, vol. 52, no. 20, pp. 97–102, 2019.
[9] A. Dave, N. Venkatesh, and A. A. Malikopoulos, “Decentralized control of two agents with nested accessible information,” in 2022 American Control Conference (ACC). IEEE, 2022, pp. 3423–3430.
[10] H. S. Witsenhausen, “A counterexample in stochastic optimum control,” SIAM Journal on Control, vol. 6, no. 1, pp. 131–147, 1968.
[11] A. A. Feldbaum, “Dual control theory,” in Control Theory: Twenty-Five Seminal Papers (1932-1981), T. Başar, Ed. Wiley-IEEE Press, 2001, ch. 10, pp. 874–880.
[12] T. Başar, “Variations on the theme of the Witsenhausen counterexample,” in Proceedings of the 47th IEEE Conference on Decision and Control, December 2008, pp. 1614–1619.
[13] K. J. Astrom and B. M. Bernhardsson, “Comparison of Riemann and Lebesgue sampling for first order stochastic systems,” in Proceedings of the 41st IEEE Conference on Decision and Control, 2002., vol. 2. IEEE, 2002, pp. 2011–2016.
[14] P. Tabuada, “Event-triggered real-time scheduling of stabilizing control tasks,” IEEE Transactions on Automatic Control, vol. 52, no. 9, pp. 1680–1685, 2007.
[15] W. P. Heemels, K. H. Johansson, and P. Tabuada, “An introduction to event-triggered and self-triggered control,” in Proceedings of the 51st IEEE Conference on Decision and Control, 2012, pp. 3270–3285.
[16] K. Nar and T. Başar, “Sampling multidimensional Wiener processes,” in Proc. 53rd IEEE Conference on Decision and Control (CDC’14, Dec 15-17, 2014; Los Angeles, CA), pp. 3426–3431.
[17] D. V. Dimarogonas, E. Frazzoli, and K. H. Johansson, “Distributed event-triggered control for multi-agent systems,” IEEE Transactions on Automatic Control, vol. 57, no. 5, pp. 1291–1297, 2011.
[18] G. S. Seyboth, D. V. Dimarogonas, and K. H. Johansson, “Event-based broadcasting for multi-agent average consensus,” Automatica, vol. 49, no. 1, pp. 245–252, 2013.
[19] O. C. Imer and T. Başar, “Optimal estimation with limited measurements,” in Proceedings of the 44th IEEE Conference on Decision and Control. IEEE, 2005, pp. 1029–1034.
[20] O. C. Imer and T. Başar, “Optimal estimation with limited measurements,” International Journal of Systems, Control and Communications, vol. 2, no. 1-3, pp. 5–29, 2010.
[21] G. M. Lipsa and N. C. Martins, “Remote state estimation with communication costs for first-order LTI systems,” IEEE Transactions on Automatic Control, vol. 56, no. 9, pp. 2013–2025, 2011.
[22] O. C. Imer and T. Başar, “Optimal control with limited controls,” in 2006 American Control Conference. IEEE, 2006, pp. 298–303.
[23] O. C. Imer and T. Basar, “To measure or to control: optimal control with scheduled measurements and controls,” in 2006 American Control Conference. IEEE, 2006, pp. 1003–1008.
[24] A. Molin and S. Hirche, “On LQG joint optimal scheduling and control under communication constraints,” in Proceedings of the 48th IEEE Conference on Decision and Control held jointly with 28th Chinese Control Conference. IEEE, 2009, pp. 5832–5838.
[25] D. Maity and J. S. Baras, “Minimal feedback optimal control of linear-quadratic-Gaussian systems: No communication is also a communication,” IFAC-PapersOnLine, vol. 53, no. 2, pp. 2201–2207, 2020.
[26] T. Soleymani, J. S. Baras, and S. Hirche, “Value of information in feedback control: Quantification,” IEEE Transactions on Automatic Control, vol. 67, no. 7, pp. 3730–3737, 2022.
[27] T. Soleymani, J. S. Baras, S. Hirche, and K. H. Johansson, “Value of information in feedback control: Global optimality,” IEEE Transactions on Automatic Control, vol. 68, no. 6, pp. 3641–3647, 2023.
[28] I. Hogeboom-Burr and S. Yüksel, “Zero-sum games involving teams against teams: Existence of equilibria, and comparison and regularity in information,” Systems & Control Letters, vol. 172, p. 105454, 2023.
[29] M. G. Lagoudakis and R. Parr, “Learning in zero-sum team Markov games using factored value functions,” Advances in Neural Information Processing Systems, vol. 15, 2002.
[30] M. Ghimire, L. Zhang, W. Zhang, Y. Ren, and Z. Xu, “Solving two-player general-sum games between swarms,” available on arXiv:2310.01682, 2023.
[31] D. Maity and J. S. Baras, “Linear quadratic stochastic differential games under asymmetric value of information,” IFAC-PapersOnLine, vol. 50, no. 1, pp. 8957–8962, 2017.
[32] S. Aggarwal, T. Başar, and D. Maity, “Linear quadratic zero-sum differential games with intermittent and costly sensing,” IEEE Control Systems Letters, 2024.
[33] M. Huang, P. E. Caines, and R. P. Malhamé, “Large-population cost-coupled LQG problems with nonuniform agents: individual-mass behavior and decentralized $\varepsilon$ -Nash equilibria,” IEEE Transactions on Automatic Control, vol. 52, no. 9, pp. 1560–1571, 2007.
[34] J.-M. Lasry and P.-L. Lions, “Mean field games,” Japanese Journal of Mathematics, vol. 2, no. 1, pp. 229–260, 2007.
[35] M. Huang, R. P. Malhamé, P. E. Caines et al., “Large population stochastic dynamic games: closed-loop Mckean-Vlasov systems and the Nash Certainty Equivalence principle,” Communications in Information & Systems, vol. 6, no. 3, pp. 221–252, 2006.
[36] J. Moon and T. Başar, “Linear quadratic risk-sensitive and robust mean field games,” IEEE Transactions on Automatic Control, vol. 62, no. 3, pp. 1062–1077, 2016.
[37] S. Aggarwal, M. A. U. Zaman, M. Bastopcu, and T. Başar, “Weighted age of information based scheduling for large population games on networks,” IEEE Journal on Selected Areas in Information Theory, vol. 4, pp. 682–697, 2023.
[38] F. Bagagiolo and D. Bauso, “Mean-field games and dynamic demand management in power grids,” Dynamic Games and Applications, vol. 4, pp. 155–176, 2014.
[39] S. Y. Olmez, S. Aggarwal, J. W. Kim, E. Miehling, T. Başar, M. West, and P. G. Mehta, “Modeling presymptomatic spread in epidemics via mean-field games,” in 2022 American Control Conference (ACC). IEEE, 2022, pp. 3648–3655.
[40] S. Aggarwal, M. A. uz Zaman, M. Bastopcu, S. Ulukus, and T. Başar, “A mean field game model for timely computation in edge computing systems,” available on arXiv:2404.02898, 2024.
[41] J. Huang, Z. Qiu, S. Wang, and Z. Wu, “Linear quadratic mean-field game-team analysis: A mixed coalition approach,” Automatica, vol. 159, p. 111358, 2024.
[42] B. Djehiche, A. Tcheukam, and H. Tembine, “Mean-field-type games in engineering,” available on arXiv:1605.03281, 2016.
[43] M. A. U. Zaman, A. Koppel, M. Laurière, and T. Başar, “Independent RL for cooperative-competitive agents: A mean-field perspective,” available on arXiv:2403.11345, 2024.
[44] S. Aggarwal, M. A. uz Zaman, M. Bastopcu, and T. Başar, “Large population games on constrained unreliable networks,” in 2023 62nd IEEE Conference on Decision and Control (CDC), 2023, pp. 3480–3485.
[45] F. Mason, F. Chiariotti, A. Zanella, and P. Popovski, “Multi-agent reinforcement learning for coordinating communication and control,” IEEE Transactions on Cognitive Communications and Networking, 2024.
[46] Q. Lan, D. Wen, Z. Zhang, Q. Zeng, X. Chen, P. Popovski, and K. Huang, “What is semantic communication? A view on conveying meaning in the era of machine intelligence,” Journal of Communications and Information Networks, vol. 6, no. 4, pp. 336–371, 2021.
[47] E. Uysal, O. Kaya, A. Ephremides, J. Gross, M. Codreanu, P. Popovski, M. Assaad, G. Liva, A. Munari, B. Soret et al., “Semantic communications in networked systems: A data significance perspective,” IEEE Network, vol. 36, no. 4, pp. 233–240, 2022.
[48] H. Du, J. Wang, D. Niyato, J. Kang, Z. Xiong, M. Guizani, and D. I. Kim, “Rethinking wireless communication security in semantic internet of things,” IEEE Wireless Communications, vol. 30, no. 3, pp. 36–43, 2023.
[49] M. Kountouris and N. Pappas, “Semantics-empowered communication for networked intelligent systems,” IEEE Communications Magazine, vol. 59, no. 6, pp. 96–102, 2021.
[50] O. Ayan, P. Kutsevol, H. Y. Özkan, and W. Kellerer, “Semantics-and task-oriented scheduling for networked control systems in practice,” IEEE Access, vol. 10, pp. 115 673–115 690, 2022.
[51] W. Yang, H. Du, Z. Q. Liew, W. Y. B. Lim, Z. Xiong, D. Niyato, X. Chi, X. Shen, and C. Miao, “Semantic communications for future Internet: Fundamentals, applications, and challenges,” IEEE Communications Surveys & Tutorials, vol. 25, no. 1, pp. 213–250, 2022.
[52] J.-M. Lasry and P.-L. Lions, “Jeux à champ moyen. i–le cas stationnaire,” Comptes Rendus Mathématique, vol. 343, no. 9, pp. 619–625, 2006.
[53] ——, “Jeux à champ moyen. ii–horizon fini et contrôle optimal,” Comptes Rendus Mathématique, vol. 343, no. 10, pp. 679–684, 2006.
[54] K. Cui and H. Koeppl, “Approximately solving mean field games via entropy-regularized deep reinforcement learning,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2021, pp. 1909–1917.
[55] J. Moon and T. Başar, “Discrete-time LQG mean field games with unreliable communication,” in Proceedings of the 53rd IEEE Conference on Decision and Control (CDC), December 2014, pp. 2697–2702.
[56] D. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming. Athena Scientific, 1996.

Appendix I

Proof of Proposition 3.

We prove the proposition by inductively establishing the continuity of the operator $\mathcal{T}$ and consequently employing Brouwer’s fixed point theorem.

To this end, let us define the following set

\displaystyle\mathcal{G}:=\{\hat{g}:=\{\hat{g}_{k}\}_{k=0}^{T-1}\mid\|\hat{g}_{k}\|\leq 1,~{}\forall k\in[T]\}.

(36)

Then, we note from (25) that $\mathcal{T}:[0,1]^{\mathbb{N}}\rightarrow[0,1]^{\mathbb{N}}$ , and hence, $\mathcal{T}(\mathcal{G})\subseteq\mathcal{G}$ . Further, the interval $[0,1]$ is compact and convex, and hence, $\mathcal{G}$ is also compact and convex. Thus, it only remains to prove the continuity of $\mathcal{T}$ , which then by using Brouwer’s fixed point theorem would prove the statement of the proposition. From now on, we will omit $\omega$ wherever it clutters the notation.

To establish continuity, using (25), it suffices to show the continuity of $\operatorname{VoI}_{k}^{\omega}$ with respect to $\hat{g}$ , for which we first show the continuity of the state-action value function, defined as:

\displaystyle\!\!{Q}^{*,x}_{k}(\mathcal{I}^{S}_{k},a_{k})\!

\displaystyle=\mathbb{E}[\lambda x_{k}a_{k}+e_{k+1}^{\top}\Gamma_{k}e_{k+1}+\min_{a\in\{0,1\}}{Q}^{*,x}_{k+1}(\mathcal{I}^{S}_{k+1},a)],

(37)

for all $k\in[T]$ . The above denotes the value received by the generic team when the action $a_{k}$ is chosen at the current instant and the future actions are chosen optimally from the admissible space $\Xi^{dec}$ for a given MF trajectory $x$ . Now, we will use backward induction on $k$ to prove the result. Fix $\hat{g}\in\mathcal{G}$ . Consider the base case when $k=T-1$ . Then, by recalling that the terminal cost is 0, we have that

\displaystyle\!\!{Q}^{*,\hat{g}}_{T-1}(\mathcal{I}^{S}_{T-1},a_{T-1})\!

\displaystyle=\mathbb{E}[\lambda\hat{g}_{T-1}a_{T-1}+e_{T}^{\top}\Gamma_{T}e_{T}].

(38)

Then, we can compute the error covariance dynamics using Proposition 1 as

\displaystyle\!\mathbb{E}[e_{k+1}e_{k+1}^{\top}]=\mathbb{E}[(1-\gamma_{k})A(\omega)e_{k}e_{k}^{\top}A^{\top}(\omega)]\!+\!K_{W}(\omega).\!

(39)

Substituting (39) into (38) for $k=T-1$ , we arrive at the equation

\displaystyle{Q}^{*,\hat{g}}_{T-1}

\displaystyle(\mathcal{I}^{S}_{T-1},a_{T-1})\!=\lambda\hat{g}_{T-1}a_{T-1}+(1-a_{T-1})^{2}e_{T-1}^{\top}A^{\top}\Gamma_{T}Ae_{T-1}+\mathbb{E}[W_{T-1}^{\top}\Gamma_{T}W_{T-1}],

which is linear (and thus, clearly continuous) in $\hat{g}$ .

Next, we take the induction hypothesis to be the assertion that ${Q}^{*,\hat{g}}_{k+1}(\mathcal{I}^{S}_{k+1},a_{k+1})$ is continuous in $\hat{g}$ . Then, we prove the assertion for instant $k$ . In this regard, consider the following:

\displaystyle{Q}^{*,\hat{g}}_{k}(\mathcal{I}^{S}_{k},a_{k})\!

\displaystyle=\mathbb{E}[\lambda\hat{g}_{k}a_{k}+e_{k+1}^{\top}\Gamma_{k}e_{k+1}+\min_{a\in\{0,1\}}{Q}^{*,\hat{g}}_{k+1}(\mathcal{I}^{S}_{k+1},a)].

From the induction hypothesis, we have that $\min_{a\in\{0,1\}}{Q}^{*,\hat{g}}_{k+1}(\mathcal{I}^{S}_{k+1},a)$ is continuous in $\hat{g}$ since the minimum of a continuous function is continuous. Further, the running cost of $\mathbb{E}[\lambda\hat{g}_{k}a_{k}+e_{k+1}^{\top}\Gamma_{k}e_{k+1}]$ is also continuous in $\hat{g}$ . The desired continuity of ${Q}^{*,\hat{g}}_{k}(\mathcal{I}^{S}_{k},a_{k})$ thus follows from the fact that the sum of continuous functions is continuous. Consequently, the result is established for all $k\in[T]$ and any chosen action $a\in\{0,1\}$ by invoking the induction principle. The continuity of $\operatorname{VoI}_{k}^{\omega}$ then follows from by noting that

\displaystyle\operatorname{VoI}_{k}^{\omega}={Q}^{*,\hat{g},\omega}_{k}(\mathcal{I}^{S}_{k},0)-{Q}^{*,\hat{g},\omega}_{k}(\mathcal{I}^{S}_{k},1)

is the difference of two continuous functions. The proof is thus complete. ∎

Next, to prove Theorem 1, we first establish the following auxiliary lemma.

Lemma 2.

Let $\hat{g}^{1},\hat{g}^{2}\in\mathcal{G}$ where $\mathcal{G}$ is defined in (36). Then, the following is true for all $k\in[T]$ :

\displaystyle{Q}^{*,\hat{g}^{1}}_{k}(\mathcal{I}^{S}_{k},a_{k})

\displaystyle-{Q}^{*,\hat{g}^{2}}_{k}(\mathcal{I}^{S}_{k},a_{k})\leq\lambda T\|\hat{g}^{1}-\hat{g}^{2}\|.

Proof.

We will use backward induction on $k$ to prove the result. In this regard, first, consider the base case for $k=T-1$ . Then, we have that:

	$\displaystyle{Q}^{*,\hat{g}^{1}}_{T-1}(\mathcal{I}^{S}_{T-1},a_{T-1})$	$\displaystyle-{Q}^{*,\hat{g}^{2}}_{T-1}(\mathcal{I}^{S}_{T-1},a_{T-1})$
		$\displaystyle=\lambda\hat{g}^{1}_{T-1}a_{T-1}+(1-a_{T-1})^{2}e_{T-1}^{\top}A^{\top}\Gamma_{T}Ae_{T-1}+\mathbb{E}[W_{T-1}^{\top}\Gamma_{T}W_{T-1}]$
		$\displaystyle~{}~{}~{}~{}-\lambda\hat{g}^{2}_{T-1}a_{T-1}-(1-a_{T-1})^{2}e_{T-1}^{\top}A^{\top}\Gamma_{T}Ae_{T-1}-\mathbb{E}[\tilde{W}_{T-1}^{\top}\Gamma_{T}\tilde{W}_{T-1}]$
		$\displaystyle=\lambda a_{T-1}(\hat{g}^{1}_{T-1}-\hat{g}^{2}_{T-1})\leq\lambda\\|\hat{g}^{1}-\hat{g}^{2}\\|.$

The base case is thus true. Assume next that

\displaystyle{Q}^{*,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},a_{k+1})

\displaystyle-{Q}^{*,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},a_{k+1})\leq\lambda(T-k-1)\|\hat{g}^{1}-\hat{g}^{2}\|.

(40)

Then, consider the following:

		$\displaystyle{Q}^{,\hat{g}^{1}}_{k}(\mathcal{I}^{S}_{k},a_{k})-{Q}^{,\hat{g}^{2}}_{k}(\mathcal{I}^{S}_{k},a_{k})$
		$\displaystyle=\!\mathbb{E}[\lambda\hat{g}^{1}_{k}a_{k}+e_{k+1}^{\top}\Gamma_{k}e_{k+1}\!+\!\min_{a\in\{0,1\}}{Q}^{,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},a)]\!-\!\mathbb{E}[\lambda\hat{g}^{2}_{k}a_{k}\!+\!\tilde{e}_{k+1}^{\top}\Gamma_{k}\tilde{e}_{k+1}\!+\!\min_{a\in\{0,1\}}{Q}^{,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},a)]$
		$\displaystyle=\lambda a_{k}(\hat{g}^{1}_{k}-\hat{g}^{2}_{k})+\mathbb{E}[\min_{a\in\{0,1\}}{Q}^{,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},a)\!-\!\min_{a\in\{0,1\}}{Q}^{,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},a)],$		(41)

where the errors $e_{k+1}$ and $\tilde{e}_{k+1}$ are under noise realizations $W_{k}$ and $\tilde{W}_{k}$ , respectively. Next, let us define $a_{1}^{*}:=\text{argmin}_{a}~{}{Q}^{*,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},a)$ and $a_{2}^{*}:=\text{argmin}_{a}~{}{Q}^{*,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},a)$ . Then, we consider the following cases:

•

If $a_{1}^{*}=a_{2}^{*}=1$ or $a_{1}^{*}=a_{2}^{*}=0$ , using (40), we have that

\displaystyle{Q}^{*,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},a_{1}^{*})

\displaystyle-{Q}^{*,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},a_{2}^{*})\leq\lambda(T-k-1)\|\hat{g}^{1}-\hat{g}^{2}\|.

•

If $a_{1}^{*}=1$ and $a_{2}^{*}=0$ , we have that

\displaystyle{Q}^{*,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},1)-{Q}^{*,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},0)\leq{Q}^{*,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},0)-{Q}^{*,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},0)\leq\lambda(T-k-1)\|\hat{g}^{1}-\hat{g}^{2}\|,

where the last inequality follows by (40).

•

If $a_{1}^{*}=0$ and $a_{2}^{*}=1$ , we have that

\displaystyle{Q}^{*,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},0)-{Q}^{*,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},1)\leq{Q}^{*,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},1)-{Q}^{*,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},1)\leq\lambda(T-k-1)\|\hat{g}^{1}-\hat{g}^{2}\|,

where the last inequality follows by (40).

Thus, continuing from (Proof.), we have that

	$\displaystyle{Q}^{,\hat{g}^{1}}_{k}(\mathcal{I}^{S}_{k},a_{k})-{Q}^{,\hat{g}^{2}}_{k}(\mathcal{I}^{S}_{k},a_{k})$	$\displaystyle\leq\lambda a_{k}(\hat{g}^{1}_{k}-\hat{g}^{2}_{k})+\lambda(T-k-1)\\|\hat{g}^{1}-\hat{g}^{2}\\|$
		$\displaystyle\leq\lambda(T-k)\\|\hat{g}^{1}-\hat{g}^{2}\\|\leq\lambda T\\|\hat{g}^{1}-\hat{g}^{2}\\|.$		(42)

The proof of the lemma is thus completed by invoking the induction principle. ∎

We are now ready to prove Theorem 1.

Proof of Theorem 1.

We will use Banach’s fixed point theorem to prove the result. In this regard, we start by noting from (25) that $\mathcal{T}:[0,1]^{\mathbb{N}}\rightarrow[0,1]^{\mathbb{N}}$ , and hence, $\mathcal{T}(\mathcal{G})\subseteq\mathcal{G}$ . Further, the interval $[0,1]$ is a closed subset of a complete metric space $\mathbb{R}$ , and hence complete, which further implies that $[0,1]^{\mathbb{N}}$ is complete under the sup norm. Finally, we prove that $\mathcal{T}$ is a contraction.

Let $\hat{g}^{1},\hat{g}^{2}\in\mathcal{G}$ and further define $\operatorname{VoI}^{\omega,1}_{k}$ and $\operatorname{VoI}^{\omega,2}_{k}$ as the value of information terms at instant $k$ corresponding to $\hat{g}^{1}$ and $\hat{g}^{2}$ , respectively. Then, consider the following:

	$\displaystyle[\mathcal{T}(\hat{g}^{1})$	$\displaystyle-\mathcal{T}(\hat{g}^{2})]_{k}=\mathbb{E}\left[\frac{1}{1+e^{-\alpha\operatorname{VoI}^{\omega,1}_{k}}}\!-\!\frac{1}{1\!+\!e^{-\alpha\operatorname{VoI}^{\omega,2}_{k}}}\right]=\mathbb{E}\left[\frac{e^{-\alpha\operatorname{VoI}^{\omega,2}_{k}}-e^{-\alpha\operatorname{VoI}^{\omega,1}_{k}}}{(1+e^{-\alpha\operatorname{VoI}^{\omega,2}_{k}})(1+e^{-\alpha\operatorname{VoI}^{\omega,1}_{k}})}\right]$
		$\displaystyle=\mathbb{E}\left[\frac{e^{-\alpha\operatorname{VoI}^{\omega,2}_{k}}(1-e^{-\alpha\operatorname{VoI}^{\omega,1}_{k}+\alpha\operatorname{VoI}^{\omega,2}_{k}})}{(1+e^{-\alpha\operatorname{VoI}^{\omega,2}_{k}})(1+e^{-\alpha\operatorname{VoI}^{\omega,1}_{k}})}\right]\leq\alpha\mathbb{E}\left[\frac{e^{-\alpha\operatorname{VoI}^{\omega,2}_{k}}(\operatorname{VoI}^{\omega,1}_{k}-\operatorname{VoI}^{\omega,2}_{k})}{(1+e^{-\alpha\operatorname{VoI}^{\omega,2}_{k}})(1+e^{-\alpha\operatorname{VoI}^{\omega,1}_{k}})}\right].$		(43)

Next, we prove that $\operatorname{VoI}^{\omega}$ is Lipschitz continuous in $\hat{g}$ . To this end, we invoke the definition of the value of information to arrive at:

$\displaystyle\operatorname{VoI}^{\omega,1}_{k}-\operatorname{VoI}^{\omega,2}_{k}$	$\displaystyle={Q}^{,\hat{g}^{1},\omega}_{k}(\mathcal{I}^{S,\omega}_{k},0)-{Q}^{,\hat{g}^{1},\omega}_{k}(\mathcal{I}^{S,\omega}_{k},1)-{Q}^{,\hat{g}^{2},\omega}_{k}(\mathcal{I}^{S,\omega}_{k},0)+{Q}^{,\hat{g}^{2},\omega}_{k}(\mathcal{I}^{S,\omega}_{k},1)$
	$\displaystyle={Q}^{,\hat{g}^{1},\omega}_{k}(\mathcal{I}^{S,\omega}_{k},0)-{Q}^{,\hat{g}^{2},\omega}_{k}(\mathcal{I}^{S,\omega}_{k},0)+{Q}^{,\hat{g}^{2},\omega}_{k}(\mathcal{I}^{S,\omega}_{k},1)-{Q}^{,\hat{g}^{2},\omega}_{k}(\mathcal{I}^{S,\omega}_{k},1)$
	$\displaystyle\leq 2\lambda(T-k)\\|\hat{g}^{1}-\hat{g}^{2}\\|,$	(44)

where the last inequality follows by using Lemma 2. Then, by substituting (Proof of Theorem 1.) in (Proof of Theorem 1.), we arrive at

\displaystyle[\mathcal{T}(\hat{g}^{1})-\mathcal{T}(\hat{g}^{2})]_{k}\leq L_{k}\|\hat{g}^{1}-\hat{g}^{2}\|~{}\text{where}~{}L_{k}:=2\alpha\lambda(T-k)\mathbb{E}\left[\frac{e^{-\alpha\operatorname{VoI}^{\omega,2}_{k}}}{(1+e^{-\alpha\operatorname{VoI}^{\omega,2}_{k}})(1+e^{-\alpha\operatorname{VoI}^{\omega,1}_{k}})}\right].

A similar inequality can be proven for $[\mathcal{T}(\hat{g}^{2})-\mathcal{T}(\hat{g}^{1})]_{k}$ , which would then imply the above inequality for $|[\mathcal{T}(\hat{g}^{1})-\mathcal{T}(\hat{g}^{2})]|_{k}$ , as well. Thus, the operator $\mathcal{T}$ is a contraction if $\max_{k\in[T]}L_{k}<1$ , which is implied by the sufficient condition (26). The proof is then completed by invoking Banach’s fixed point theorem to conclude that there exists a unique $\hat{g}^{*}$ which is the fixed point of the operator $\mathcal{T}$ such that $\hat{g}^{*}=\mathcal{T}(\hat{g}^{*})$ . ∎

Appendix II

Lemma 3.

It holds that $J^{N}_{i}(\pi^{*,i},\hat{\zeta}^{*,i},\hat{\zeta}^{*,-i})-J(\pi^{*},\hat{\zeta},\hat{g}^{*})\leq\mathcal{O}\Big{(}\frac{1}{\sqrt{N}}+\epsilon_{P,N}\Big{)}.$

Proof.

Consider the following:

	$\displaystyle J^{N}_{i}(\pi^{,i},\hat{\zeta}^{,i},\hat{\zeta}^{*,-i})$	$\displaystyle=\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\\|\hat{X}^{i}_{k}\\|^{2}_{Q(\omega_{i})}+\\|U^{i}_{k}\\|^{2}_{R(\omega_{i})}+\lambda\hat{\gamma}^{N,av}_{k}\hat{\gamma}^{i}_{k}-\lambda\hat{g}^{}_{k}\hat{\gamma}^{i}_{k}\!+\!\lambda\hat{g}^{}_{k}\hat{\gamma}^{i}_{k}\right]$
		$\displaystyle=J(\pi^{},\hat{\zeta}^{},\hat{g}^{})+\frac{\lambda}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}(\hat{\gamma}^{N,av}_{k}\!-\!\hat{g}^{}_{k})\hat{\gamma}^{i}_{k}\right]\leq J(\pi^{},\hat{\zeta},\hat{g}^{})\!+\!\mathcal{O}\Big{(}\frac{1}{\sqrt{N}}\!+\!\epsilon_{P,N}\Big{)},$

where the equality follows since we use the approximate MF policy pair ( $\pi^{*},\hat{\zeta}^{*}$ ) on the finite-agent system and the inequality follows as a result of Lemma 1. The proof is thus complete. ∎

Lemma 4.

The following inequality holds:

\displaystyle\inf_{\zeta^{d}\in\Xi^{dec}}J(\pi^{*},{\zeta^{d}},\hat{g}^{*})-\inf_{\xi^{i}\in\Xi^{cen,i}}J_{i}^{N}(\pi^{*,i},\xi^{i},\hat{\zeta}^{*,-i})\leq\mathcal{O}\Big{(}\frac{1}{\sqrt{N}}+\epsilon_{P,N}\Big{)}.

Proof.

Let us fix a policy $\xi^{i}\in\Xi^{cen,i}$ and $\tilde{\gamma}^{i}_{k}\sim\xi^{i}(\mathcal{I}^{S_{i}}_{k})$ while the other agents play using the policy $\hat{\zeta}^{*}$ . Then, we have that:

	$\displaystyle J_{i}^{N}(\pi^{,i},\xi^{i},\hat{\zeta}^{,-i})$	$\displaystyle=\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\\|{X}^{i}_{k}\\|^{2}_{Q(\omega_{i})}+\\|U^{i}_{k}\\|^{2}_{R(\omega_{i})}+\lambda\tilde{\gamma}^{N,av}_{k}\tilde{\gamma}^{i}_{k}+\lambda\hat{g}^{}_{k}\tilde{\gamma}^{i}_{k}-\lambda\hat{g}^{}_{k}\tilde{\gamma}^{i}_{k}\right]$
		$\displaystyle=\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\\|{X}^{i}_{k}\\|^{2}_{Q(\omega_{i})}+\\|U^{i}_{k}\\|^{2}_{R(\omega_{i})}+\lambda\hat{g}^{}_{k}\gamma^{i}_{k}+\lambda\tilde{\gamma}^{N,av}_{k}\tilde{\gamma}^{i}_{k}-\lambda\hat{g}^{}_{k}\tilde{\gamma}^{i}_{k}\right]$
		$\displaystyle\geq\inf_{\zeta^{d}\in\Xi^{dec}}J(\pi^{},{\zeta^{d}},\hat{g}^{})+\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\lambda\tilde{\gamma}^{i}_{k}(\tilde{\gamma}^{N,av}_{k}-\hat{g}^{*}_{k})\right],$

which gives

\displaystyle\inf_{\zeta^{d}\in\Xi^{dec}}J(\pi^{*},{\zeta}^{d},\gamma^{av,*})-J_{i}^{N}(\pi^{*,i},\xi^{i},\hat{\zeta}^{*,-i})\leq\frac{\lambda}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}(\hat{g}^{*}_{k}-\tilde{\gamma}^{N,av}_{k})\tilde{\gamma}^{i}_{k}\right]\leq\mathcal{O}\Big{(}\frac{1}{\sqrt{N}}+\epsilon_{P,N}\Big{)}

which by taking infimum on both sides implies that

\displaystyle\inf_{\zeta^{d}\in\Xi^{dec}}J_{i}(\pi^{*},{\zeta}^{d},g^{*})-\inf_{\xi^{i}\in\Xi^{cen,i}}J_{i}^{N}(\pi^{*,i},\xi^{i},\hat{\zeta}^{*,-i})\leq\mathcal{O}\Big{(}\frac{1}{\sqrt{N}}+\epsilon_{P,N}\Big{)}.

(45)

The proof is thus complete. ∎

Appendix III

Lemma 5.

The following inequality holds:

\displaystyle\hat{Q}^{\hat{g}^{*}}_{k}(\mathcal{I}^{S}_{k},\hat{\gamma}_{k})\leq f_{k}(\|e_{k}\|^{2})+c_{k},~{}~{}~{}k\in[T],

(46)

where $\!\hat{Q}^{x}_{k}(\mathcal{I}^{S}_{k},\!a_{k})\!$ denotes the state-action value function and is defined recursively as:

\displaystyle\hat{Q}^{x}_{k}(\mathcal{I}^{S}_{k},a_{k})\!

\displaystyle=\!\mathbb{E}[\lambda x_{k}a_{k}\!+\!e_{k+1}^{\top}\Gamma_{k}e_{k+1}\!\!+\!\hat{Q}^{x}_{k+1}(\mathcal{I}^{S}_{k+1},a_{k+1})],

(47)

with $a_{k},a_{k+1}\sim\hat{\zeta}^{*}(\mathcal{I}^{S}_{k},x)$ . Further, $f_{k}(\cdot)$ are time-dependent functions, and $c_{k}>0$ is a constant.

We note that the same inequality can also be established for $Q^{*,x}(\cdot,\cdot)$ using the same proof technique as presented below.

Proof.

We prove the above lemma by induction on $k$ . In this regard, consider the base case with $k=T-1$ . Then, we have

	$\displaystyle\hat{Q}^{\hat{g}^{*}}_{T-1}(\mathcal{I}^{S}_{T-1},\hat{\gamma}_{T-1})$	$\displaystyle=\mathbb{E}[\lambda\hat{g}^{*}_{T-1}\hat{\gamma}_{T-1}+e_{T}^{\top}\Gamma_{T}e_{T}]\leq\lambda+\sup_{0\leq t\leq T}\\|\Gamma_{t}\\|\mathbb{E}[e_{T}^{\top}e_{T}]$
		$\displaystyle\leq\lambda+\sup_{0\leq t\leq T}\\|\Gamma_{t}\\|(\\|e_{T-1}\\|^{2}+K_{W})=:f_{T-1}(\\|e_{T-1}\\|^{2})+c_{T-1},$		(48)

where we have used the error dynamics in (20) to arrive at the last inequality. The base case is therefore proved. Assume now that the inequality in the statement of the lemma holds for instant $k+1$ , i.e.,

\displaystyle\hat{Q}^{\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},\hat{\gamma}_{k+1})\leq f_{k+1}(\|e_{k+1}\|^{2})+c_{k+1}.

(49)

Then, we prove the above for instant $k$ . Consider the following:

$\displaystyle\hat{Q}^{\hat{g}^{*}}_{k}(\mathcal{I}^{S}_{k},\hat{\gamma}_{k})$	$\displaystyle=\mathbb{E}[\lambda\hat{g}^{}_{k}\hat{\gamma}_{k}+e_{k+1}^{\top}\Gamma_{k+1}e_{k+1}+\hat{Q}^{\hat{g}^{}}_{k+1}(\mathcal{I}^{S}_{k+1},\hat{\zeta}^{*})]$
	$\displaystyle\leq\lambda+\sup_{0\leq t\leq T}\\|\Gamma_{t}\\|(\\|e_{k}\\|^{2}+K_{W})+\mathbb{E}[f_{k+1}(\\|e_{k+1}\\|^{2})+c_{k+1}]$
	$\displaystyle\leq\lambda+\sup_{0\leq t\leq T}\\|\Gamma_{t}\\|(\\|e_{k}\\|^{2}+K_{W})+\mathbb{E}[f_{k+1}(\\|(1-\hat{\gamma}_{k})e_{k}+W_{k}\\|^{2})+c_{k+1}]$
	$\displaystyle=:f_{k}(\\|e_{k}\\|^{2})+c_{k}.$	(50)

The proof now follows by the induction principle, and is thus complete. ∎

Proposition 4.

We have that

\displaystyle\hat{Q}^{\hat{g}^{*}}_{k}(\mathcal{I}^{S}_{k},\hat{\gamma}_{k})-{Q}^{*,\hat{g}^{*}}(\mathcal{I}^{S}_{k},\hat{\gamma}_{k})\leq\epsilon^{\prime}(\alpha).

(51)

for all $k\in[T]$ , where $\epsilon^{\prime}(\alpha):=\mathcal{O}\Big{(}\frac{1}{1+\min_{k\leq t\leq T-1}\{e^{\alpha|\operatorname{VoI}_{t}|}\}}\Big{)}$ .

Proof.

Let us fix $\hat{g}^{*}\in[0,1]^{T}$ . We will use induction on $k$ to prove the result.

Let us start with the base case of $k=T-1$ . Let us define the estimation errors

\displaystyle\tilde{e}_{k+1}=(1-\hat{\gamma}_{k})A\hat{e}_{k}+\tilde{W}_{k},\qquad\bar{e}_{k+1}=(1-\hat{\gamma}_{k})A\hat{e}_{k}+\bar{W}_{k},

where $\hat{e}_{k}$ is the error at instant k under action $\hat{\gamma}_{k-1}$ .

We recall that $V_{T}(\mathcal{I}_{T}^{S},\hat{g}^{*})=0$ . Then, we have that

	$\displaystyle\hat{Q}^{\hat{g}^{}}_{T-1}(\mathcal{I}^{S}_{T-1},\hat{\gamma}_{T-1})-{Q}^{,\hat{g}^{*}}_{T-1}(\mathcal{I}^{S}_{T-1},\hat{\gamma}_{T-1})=tr(\Gamma_{T}(\mathbb{E}[\tilde{e}_{T}^{\top}\tilde{e}_{T}]-\mathbb{E}[\bar{e}_{T}^{\top}\bar{e}_{T}]))$
	$\displaystyle=tr(\Gamma_{T}((1-\hat{\gamma}_{T-1})A\hat{e}_{T-1}\hat{e}_{T-1}^{\top}A^{\top})+K_{W})-tr(\Gamma_{T}((1-\hat{\gamma}_{T-1})A\hat{e}_{T-1}\hat{e}_{T-1}^{\top}A^{\top})+K_{W})$
	$\displaystyle=0<\epsilon^{\prime}(\alpha).$		(52)

The base case is thus proved. Assume now that the following is true:

\displaystyle\hat{Q}^{\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},\hat{\gamma}_{k+1})

\displaystyle-{Q}^{*,\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},\hat{\gamma}_{k+1})\leq\epsilon_{1}(\alpha)/2:=\mathcal{O}\Big{(}\frac{1}{1+\min_{k+1\leq t\leq T-1}\{e^{\alpha|\operatorname{VoI}_{t}|}\}}\Big{)}.

(53)

Then, we prove the inequality in the statement of the Proposition for instant $k$ . In this regard, let us define the set $\mathcal{A}:=\{0,1\}$ and the set $\mathcal{A}^{x}_{\text{opt}}(\mathcal{I}^{S})$ to be the set of optimal actions for a given trajectory $x$ and the information set $\mathcal{I}^{S}$ . Further let us denote a suboptimal action as $a_{\text{sub}}\in\mathcal{A}\setminus\mathcal{A}^{s}_{\text{opt}}(\mathcal{I}^{S})$ . Then, we have that

	$\displaystyle\hat{Q}^{\hat{g}^{}}_{k}(\mathcal{I}^{S}_{k},\hat{\gamma}_{k})-{Q}^{,\hat{g}^{}}_{k}(\mathcal{I}^{S}_{k},\hat{\gamma}_{k})=\sum_{a\in\mathcal{A}}\hat{\zeta}^{}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{,k})\hat{Q}^{\hat{g}^{}}_{k+1}(\mathcal{I}^{S}_{k+1},a)-\min_{a^{\prime}\in\mathcal{A}}Q^{,\hat{g}^{}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})$
	$\displaystyle=\sum_{a\in\mathcal{A}^{\hat{g}^{}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{,k})\hat{Q}^{\hat{g}^{}}_{k+1}(\mathcal{I}^{S}_{k+1},a)-\min_{a^{\prime}\in\mathcal{A}}Q^{,\hat{g}^{}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})+\sum_{\hskip 8.53581pta\in\mathcal{A}\setminus\mathcal{A}^{\hat{g}^{}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{})\hat{Q}^{\hat{g}^{}}_{k+1}(\mathcal{I}^{S}_{k+1},a)$
	$\displaystyle=\Big{[}\sum_{a\in\mathcal{A}^{\hat{g}^{}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{})\hat{Q}^{\hat{g}^{}}_{k+1}(\mathcal{I}^{S}_{k+1},a)-\sum_{a\in\mathcal{A}^{\hat{g}^{}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{})\min_{a^{\prime}\in\mathcal{A}}Q^{,\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})\Big{]}$
	$\displaystyle\hskip 17.07164pt+\Big{[}\sum_{a\in\mathcal{A}^{\hat{g}^{}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{})\min_{a^{\prime}\in\mathcal{A}}Q^{,\hat{g}^{}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})-\min_{a^{\prime}\in\mathcal{A}}Q^{,\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})\Big{]}$
	$\displaystyle\hskip 17.07164pt+\sum_{\hskip 8.53581pta\in\mathcal{A}\setminus\mathcal{A}^{\hat{g}^{}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{})\hat{Q}^{\hat{g}^{}}_{k+1}(\mathcal{I}^{S}_{k+1},a)$
	$\displaystyle\leq 2\max_{a\in\mathcal{A}^{\hat{g}^{}}_{\text{opt}}(\mathcal{I}^{S}_{k})}[\hat{Q}^{\hat{g}^{}}_{k+1}(\mathcal{I}^{S}_{k+1},a)-\min_{a^{\prime}\in\mathcal{A}}Q^{,\hat{g}^{}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})]+\Big{[}\sum_{a\in\mathcal{A}^{\hat{g}^{}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{})\min_{a^{\prime}\in\mathcal{A}}Q^{,\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})$
	$\displaystyle\hskip 28.45274pt-\min_{a^{\prime}\in\mathcal{A}}Q^{,\hat{g}^{}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})\Big{]}+\sum_{\hskip 8.53581pta\in\mathcal{A}\setminus\mathcal{A}^{\hat{g}^{}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{})\hat{Q}^{\hat{g}^{}}_{k+1}(\mathcal{I}^{S}_{k+1},a)$
	$\displaystyle\leq\epsilon_{1}(\alpha)+\Big{[}\sum_{a\in\mathcal{A}^{\hat{g}^{}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{})\min_{a^{\prime}\in\mathcal{A}}Q^{,\hat{g}^{}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})-\min_{a^{\prime}\in\mathcal{A}}Q^{,\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})\Big{]}$
	$\displaystyle\hskip 17.07164pt+\sum_{\hskip 8.53581pta\in\mathcal{A}\setminus\mathcal{A}^{\hat{g}^{}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{})\hat{Q}^{\hat{g}^{}}_{k+1}(\mathcal{I}^{S}_{k+1},a)$

	$\displaystyle\overset{\textup{\makebox[0.0pt]{(1)}}\ltx@label{b:1}}{\leq}\epsilon_{1}(\alpha)+\hat{\zeta}^{}(a_{\text{sub}}\mid\mathcal{I}^{S}_{k},\hat{g}^{})\min_{a^{\prime}\in\mathcal{A}}Q^{,\hat{g}^{}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})+\sum_{\hskip 8.53581pta\in\mathcal{A}\setminus\mathcal{A}^{\hat{g}^{}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{})\hat{Q}^{\hat{g}^{}}_{k+1}(\mathcal{I}^{S}_{k+1},a$
	$\displaystyle\leq\epsilon_{1}(\alpha)+4(f_{k}(\\|e_{k}\\|^{2})+c_{k})\frac{1}{1+e^{\alpha\|\operatorname{VoI}_{k}\|}}\leq\epsilon^{\prime}(\alpha).$		(54)

We note that the above is valid for $\operatorname{VoI}\neq 0$ . If it is 0, then we have that both the actions of 0 and 1 produce the same cost, and hence, there is no suboptimal action. Thus from LABEL:b:1, we get that $\hat{Q}^{\hat{g}^{*}}_{k}(\mathcal{I}^{S}_{k},\hat{\gamma}_{k})-{Q}^{*,\hat{g}^{*}}_{k}(\mathcal{I}^{S}_{k},\hat{\gamma}_{k})\leq\epsilon_{1}(\alpha)<\epsilon^{\prime}(\alpha)$ .

The proof thus follows by the induction principle for all $k\in[T]$ , and is complete.

∎

Proposition 5.

It holds that $J(\pi^{*},\hat{\zeta}^{*},\hat{g}^{*})-\inf_{\zeta^{d}\in\Xi^{dec}}J(\pi^{*},{\zeta^{d}},\hat{g}^{*})\leq\epsilon(\alpha).$

Proof.

Let $\zeta^{*}=\text{arginf}_{\zeta^{d}\in\Xi^{dec}}J(\pi^{*},\zeta^{d},\hat{g}^{*})$ and $\gamma^{*}$ be an action chosen from $\zeta^{*}$ .

	$\displaystyle J(\pi^{},\hat{\zeta}^{},\hat{g}^{})-J(\pi^{},{\zeta}^{},\hat{g}^{})$
	$\displaystyle=\mathbb{E}[\hat{Q}_{0}(e_{0},\hat{\gamma}_{0},\hat{g}^{})]-\mathbb{E}[{Q}^{}_{0}(e_{0},{\gamma}^{}_{0},\hat{g}^{})]$
	$\displaystyle=\mathbb{E}[\hat{Q}_{0}(e_{0},\hat{\gamma}_{0},\hat{g}^{})]-\mathbb{E}[{Q}^{}_{0}(e_{0},\hat{\gamma}_{0},\hat{g}^{})]+\mathbb{E}[{Q}^{}_{0}(e_{0},\hat{\gamma}_{0},\hat{g}^{})]-\mathbb{E}[{Q}^{}_{0}(e_{0},{\gamma}^{}_{0},\hat{g}^{})]$
	$\displaystyle\leq\epsilon^{\prime}(\alpha)+\mathbb{E}[{Q}^{}_{0}(e_{0},\hat{\gamma}_{0},\hat{g}^{})]-\mathbb{E}[{Q}^{}_{0}(e_{0},{\gamma}^{}_{0},\hat{g}^{*})]$
	$\displaystyle=\epsilon^{\prime}(\alpha)+\mathbb{E}[\lambda\hat{g}^{}_{0}(\hat{\gamma}_{0}-\gamma^{}_{0})+\hat{e}_{1}^{\top}\Gamma_{1}\hat{e}_{1}-(e^{}_{1})^{\top}\Gamma_{1}e^{}_{1}+Q_{1}^{}(\hat{e}_{1},\gamma_{1}^{},\hat{g}^{})-Q_{1}^{}(e^{}_{1},\gamma_{1}^{},\hat{g}^{*})]$
	$\displaystyle=\epsilon^{\prime}(\alpha)+\mathbb{E}[\lambda\hat{g}^{}_{0}(\hat{\gamma}_{0}-\gamma^{}_{0})]+tr(\Gamma_{1}\mathbb{E}[(\gamma^{}_{0}-\hat{\gamma}_{0})Ae_{0}e_{0}^{\top}A^{\top}])+\mathbb{E}[Q_{1}^{}((1-\hat{\gamma}_{0})Ae_{0}+\hat{W}_{0},\gamma^{}_{1},\hat{g}^{})$
	$\displaystyle\hskip 28.45274pt-Q_{1}^{}((1-\gamma^{}_{0})Ae_{0}+\tilde{W}_{0},\gamma^{}_{1},\hat{g}^{}]$
	$\displaystyle\leq\epsilon^{\prime}(\alpha)+\mathbb{E}[\lambda\hat{g}^{}_{0}(\hat{\gamma}_{0}-\gamma^{}_{0})]+tr(\Gamma_{1}Ae_{0}e_{0}^{\top}A^{\top})\mathbb{E}[(\gamma^{}_{0}-\hat{\gamma}_{0})]+2(f_{0}(\\|e_{0}\\|^{2})+c_{0})\mathbb{P}(\gamma^{}_{0}\neq\hat{\gamma}_{0})$
	$\displaystyle\hskip 28.45274pt+\mathbb{P}(\gamma^{}_{0}=\hat{\gamma}_{0})(\mathbb{E}[Q_{1}^{}((1-\hat{\gamma}_{0})Ae_{0}+\hat{W}_{0},\gamma^{}_{1},\hat{g}^{})-Q_{1}^{}((1-\hat{\gamma}_{0})Ae_{0}+\check{W}_{0},\gamma^{}_{1},\hat{g}^{*})])$

	$\displaystyle\leq\epsilon^{\prime}(\alpha)+(\lambda+tr(\Gamma_{1}Ae_{0}e_{0}^{\top}A^{\top}))\mathbb{E}[(\gamma^{}_{0}-\hat{\gamma}_{0})]+2(f_{0}(\\|e_{0}\\|^{2})+c_{0})\mathbb{P}(\gamma^{}_{0}\neq\hat{\gamma}_{0})$
	$\displaystyle\leq\epsilon^{\prime}(\alpha)+(\lambda+tr(\Gamma_{1}Ae_{0}e_{0}^{\top}A^{\top})+2f_{0}(\\|e_{0}\\|^{2})+2c_{0})\times\Big{(}\frac{1}{1+e^{\alpha\|\operatorname{VoI}_{0,1}\|}}+\frac{1}{1+e^{\alpha\|\operatorname{VoI}_{0,2}\|}}\Big{)}$
	$\displaystyle=:\epsilon^{\prime}(\alpha)+\epsilon^{\prime\prime}(\alpha)=:\epsilon(\alpha)=\mathcal{O}\Big{(}\frac{1}{1+\min_{k\in[T]}\{e^{\alpha\|\operatorname{VoI}_{k}\|}\}}\Big{)}.$		(55)

The proof is thus complete. ∎

	$\displaystyle\sum_{k=0}^{T-1}\mathbb{E}\Big{[}\Big{\|}\frac{1}{N}\sum_{j=1}^{N}\gamma^{j}_{k}-\hat{g}^{*}_{k}\Big{\|}\Big{]}$	$\displaystyle\leq\underbrace{\sum_{k=0}^{T-1}\mathbb{E}\Big{[}\Big{\|}\frac{1}{N}\sum_{j=1}^{N}(\gamma^{j}_{k}-\hat{\gamma}^{j}_{k})\Big{\|}\Big{]}}_{\text{=:Term 1}}$
		$\displaystyle+\!\!\underbrace{\sum_{k=0}^{T-1}\!\mathbb{E}\Big{[}\Big{\|}\!\frac{1}{N}\sum_{j=1}^{N}\hat{\gamma}^{j}_{k}\!-\!\mathbb{E}[\hat{g}^{,j}_{k}]\Big{\|}\Big{]}}_{\text{=:Term 2}}\!+\!\underbrace{\sum_{k=0}^{T-1}\mathbb{E}\Big{[}\Big{\|}\!\frac{1}{N}\!\sum_{j=1}^{N}\mathbb{E}[\hat{g}^{,j}_{k}]\!-\!\hat{g}^{*}_{k}\Big{\|}\Big{]}}_{\text{=:Term 3}}.$		(28)

Term 1	$\displaystyle=\sum_{k=0}^{T-1}\mathbb{E}\Big{[}\Big{\|}\frac{1}{N}\sum_{j\in[N]\setminus\{i\}}(\hat{\gamma}^{j}_{k}-\hat{\gamma}^{j}_{k})+\frac{1}{N}(\tilde{\gamma}^{i}_{k}-\hat{\gamma}^{i}_{k})\Big{\|}\Big{]}$
	$\displaystyle=\sum_{k=0}^{T-1}\mathbb{E}\Big{[}\Big{\|}\frac{1}{N}\sum_{j\in[N]\setminus\{i\}}(\hat{\zeta}^{j}_{k}(\mathcal{I}^{S,i}_{k})-\hat{\zeta}^{*,j}_{k}(\mathcal{I}^{S,i}_{k}))+\frac{1}{N}(\tilde{\gamma}^{i}_{k}-\hat{\gamma}^{i}_{k})\Big{\|}\Big{]}$
	$\displaystyle=\sum_{k=0}^{T-1}\frac{1}{N}\mathbb{E}[\|\tilde{\gamma}^{i}_{k}-\hat{\gamma}^{i}_{k}\|]\leq\frac{2T}{N},$	(29)


$\displaystyle J_{i}^{N}(\pi^{,i},\hat{\zeta}^{,i},\hat{\zeta}^{*,-i})-$	$\displaystyle\inf_{\xi^{i}\in\Xi^{cen,i}}J_{i}^{N}(\pi^{,i},\xi^{i},\hat{\zeta}^{,-i})=J^{N}_{i}(\pi^{,i},\hat{\zeta}^{,i},\hat{\zeta}^{,-i})-J(\pi^{},\hat{\zeta}^{},\hat{g}^{})$	(34a)
	$\displaystyle\hskip 14.22636pt+J(\pi^{},\hat{\zeta}^{},\hat{g}^{})-\inf_{\zeta^{d}\in\Xi^{dec}}J(\pi^{},{\zeta^{d}},\hat{g}^{*})$	(34b)
	$\displaystyle\hskip 14.22636pt+\inf_{\zeta^{d}\in\Xi^{dec}}J(\pi^{},{\zeta^{d}},\hat{g}^{})-\inf_{\xi^{i}\in\Xi^{cen,i}}J_{i}^{N}(\pi^{,i},\xi^{i},\hat{\zeta}^{,-i}).$	(34c)

		$\displaystyle{Q}^{,\hat{g}^{1}}_{k}(\mathcal{I}^{S}_{k},a_{k})-{Q}^{,\hat{g}^{2}}_{k}(\mathcal{I}^{S}_{k},a_{k})$
		$\displaystyle=\!\mathbb{E}[\lambda\hat{g}^{1}_{k}a_{k}+e_{k+1}^{\top}\Gamma_{k}e_{k+1}\!+\!\min_{a\in\{0,1\}}{Q}^{,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},a)]\!-\!\mathbb{E}[\lambda\hat{g}^{2}_{k}a_{k}\!+\!\tilde{e}_{k+1}^{\top}\Gamma_{k}\tilde{e}_{k+1}\!+\!\min_{a\in\{0,1\}}{Q}^{,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},a)]$
		$\displaystyle=\lambda a_{k}(\hat{g}^{1}_{k}-\hat{g}^{2}_{k})+\mathbb{E}[\min_{a\in\{0,1\}}{Q}^{,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},a)\!-\!\min_{a\in\{0,1\}}{Q}^{,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},a)],$		(41)

	$\displaystyle{Q}^{,\hat{g}^{1}}_{k}(\mathcal{I}^{S}_{k},a_{k})-{Q}^{,\hat{g}^{2}}_{k}(\mathcal{I}^{S}_{k},a_{k})$	$\displaystyle\leq\lambda a_{k}(\hat{g}^{1}_{k}-\hat{g}^{2}_{k})+\lambda(T-k-1)\\|\hat{g}^{1}-\hat{g}^{2}\\|$
		$\displaystyle\leq\lambda(T-k)\\|\hat{g}^{1}-\hat{g}^{2}\\|\leq\lambda T\\|\hat{g}^{1}-\hat{g}^{2}\\|.$		(42)

Semantic Communication in Multi-team Dynamic Games: A Mean Field Perspective

Abstract

I Introduction

I-A Related Works

I-A1 Team optimal control & Information structures

I-A2 Finite & Mean-field team games

I-A3 Semantic communication

I-B Organization of the paper

II Finite-Team Game Problem

Problem 1 (Team-level game problem).

III Networked Mean-Field Team Game

III-A Mean-Field Team Game

Problem 2 (Representative team problem).

Assumption 1.

Definition 1 (Mean-Field Team Equilibrium).

IV Mean-Field Team Equilibrium Analysis

Proposition 1.

Definition 2.

Proposition 2.

Proof.

Definition 3 (α\alpha–Approximate MFTE).

Proposition 3.

Proof.

Theorem 1.

Proof.

V Performance of Mean-Field Approximation

Lemma 1.

Proof.

VI Approximate Nash Equilibrium Analysis

Definition 4.

Theorem 2.

Proof.

VII Numerical Experiments

VIII Conclusion & Discussions

References

Appendix I

Proof of Proposition 3.

Lemma 2.

Proof.

Proof of Theorem 1.

Appendix II

Lemma 3.

Proof.

Lemma 4.

Proof.

Appendix III

Lemma 5.

Proof.

Proposition 4.

Proof.

Proposition 5.

Proof.

Definition 3 ( $\alpha$ –Approximate MFTE).