This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Semantic Communication in Multi-team Dynamic Games: A Mean Field Perspective

Shubham Aggarwal, Muhammad Aneeq uz Zaman, Melih Bastopcu, and Tamer Başar Research of the authors was supported in part by ARO MURI Grant AG285 and in part by AFOSR Grant FA9550-24-1-0152.Shubham Aggarwal and Muhammad Aneeq uz Zaman are with the Coordinated Science Laboratory and the Department of Mechanical Science and Engineering at the University of Illinois Urbana-Champaign (UIUC); Melih Bastopcu is with the Coordinated Science Laboratory at UIUC; Tamer Başar is with the Coordinated Science Laboratory and the Department of Electrical and Computer Engineering at UIUC. (Emails: {sa57,mazaman2,bastopcu,basar1}@illinois.edu)
Abstract

Coordinating communication and control is a key component in the stability and performance of networked multi-agent systems. While single user networked control systems have gained a lot of attention within this domain, in this work, we address the more challenging problem of large population multi-team dynamic games. In particular, each team constitutes two decision makers (namely, the sensor and the controller) who coordinate over a shared network to control a dynamically evolving state of interest under costs on both actuation and sensing/communication. Due to the shared nature of the wireless channel, the overall cost of each team depends on other teams’ policies, thereby leading to a noncooperative game setup. Due to the presence of a large number of teams, we compute approximate decentralized Nash equilibrium policies for each team using the paradigm of (extended) mean-field games, which is governed by (1) the mean traffic flowing over the channel, and (2) the value of information at the sensor, which highlights the semantic nature of the ensuing communication. In the process, we compute optimal controller policies and approximately optimal sensor policies for each representative team of the mean-field system to alleviate the problem of general non-contractivity of the mean-field fixed point operator associated with the finite cardinality of the sensor action space. Consequently, we also prove the ϵ\epsilon–Nash property of the mean-field equilibrium solution which essentially characterizes how well the solution derived using mean-field analysis performs on the finite-team system. Finally, we provide extensive numerical simulations, which corroborate the theoretical findings and lead to additional insights on the properties of the results presented.

I Introduction

Games among teams have been an attractive topic of interest in the recent past. These systems consist of cooperating decision-makers (within a team) who aim to collectively optimize the shared team objective, while being affected by the actions of other competing teams. Applications of the same include multi-player perimeter defense games [1], electricity markets [2], oligopolies [3], and networked robotics [4], to quote a few.

In this paper, we address the problem of finding equilibrium policies for multiple dynamic teams competing to access a shared wireless channel. A prototype of such a system is shown in Fig. 1. Each team is comprised of a dynamically evolving plant controlled by a sensor-controller pair. The controller relies on state information from the sensor, which communicates over the shared wireless channel. Costs arise from (1) the controller transmitting data over the network, and (2) the sensor implementing control policies for plant stabilization. An effective control policy requires consistent state measurements, which increase communication costs. Conversely, reducing the number of communications lowers these costs, but degrades control performance, leading to a classic communication-control trade-off for each team. Consequently, the sensor and controller act as two active decision-makers within a team, optimizing a common objective to find a team-optimal policy pair.

Refer to caption

Figure 1: A multi-agent networked control system operating over a shared wireless resource serves use cases such as connected autonomy, path planning, Industry 4.0, and medical applications.

Furthermore, since the channel is shared, the communication cost is coupled with the transmission activity of other users. We model this coupling effect as the product of sensor action and the average number of communication instances of the other teams. Therefore, if the channel is highly congested, each sensor will seek to reduce its communication, whereas if the channel is underutilized, each sensor will increase its transmissions to improve control performance. This dynamic interaction suggests that a (Nash) equilibrium channel utilization may emerge for each team. However, with a large number of teams, calculating exact Nash policies is challenging because each team needs to be aware of other teams’ policies. To address this, we employ the mean-field game (MFG) framework to compute approximate decentralized equilibrium policies for each team, which depend only on the local information. This approach involves analyzing the limiting team game problem (with infinite teams), which we refer to as the mean-field team (MFT) system. In the MFT system, individual team deviations become insignificant due to the infinite number of teams, and hence, it can be characterized by a representative team interacting with a mass distribution. This allows us to determine the optimal policy for a generic team, referred to as the mean-field team equilibrium (MFTE) policy, which aligns with the population’s behavior. We provide conditions for the existence and uniqueness of an approximate MFTE and demonstrate when the derived equilibrium policies constitute approximate Nash policies for the finite-team system.

I-A Related Works

I-A1 Team optimal control & Information structures

Team decision problems, which were first investigated in [5, 6], involve multiple decision makers (DMs) each of whom has access to different information variables and consequently choose policies jointly to incur a common cost/reward. Since each one acts independently, and they do not necessarily share the same information, the joint optimal policy design is highly dependent on the information available to each DM [7, 8, 9], and its derivation can be challenging particularly when the information is dynamic, as demonstrated by Witsenhausen [10], Feldbaum [11], and Başar [12]. The single team communication-control trade-off problem is an instance of such a decision problem, where the sensor and the controller form a 2-DM team, each having access to different information signals within the system. Initial investigations for solving the single team problem in this context were undertaken within the realm of event-driven control, where the primary objective is to merely stabilize the dynamical plant, without any optimality guarantees [13, 14, 15, 16]. Multi-team extensions of the same have also been considered in [17, 18].

On the other hand, single team settings from the viewpoint of an optimal control problem have also been widely dealt with in literature [19, 20, 21, 22, 23, 24, 25] to study the impact of resource constraints on estimator and controller designs, where the resource is the frequency of usage of the transmission channels for estimation or control purposes. Specifically, papers [19] and [20] have obtained the optimal threshold-type (symmetric) policies for the sensor under a hard constraint on the number of transmissions, with extensions provided in [21]; [22] has placed resource constraints on the interaction between the controller and the actuator in a plant; and [23] has introduced a tradeoff between two options for a controller, which are transmission of control signals to the actuator and receiving measurements, where optimality is again based on threshold-type policies; [24] has provided jointly optimal policies for control and sensing; and [25] has derived suboptimal sensor and control policies based (only) on the statistics of the estimation error. The latest works in this direction are [26, 27], where the authors show that for a multi-dimensional Gauss-Markov process, the optimal estimator is linear and is unaffected by the no-communication events (as was derived for scalar systems earlier in [20] with its optimality proven in [21]), leading to a globally optimal policy design for a single team problem.

I-A2 Finite & Mean-field team games

As opposed to the above frameworks which consider a single team problem, there are only a handful of works considering games among multiple teams[28, 29, 30, 31, 32]. Specifically, the paper [28] has investigated the properties of the saddle-point equilibrium in a two-team zero-sum game; [29] has proposed a learning algorithm for zero-sum Markov games between teams of multiple agents, with extensions to general sum 2-team games in [30]; [31] has provided only the Nash controller policies for 2-team general sum games without explicit sensor policy computation, which becomes challenging even for a 2-team problem; and [32] has solved for best response sensor policies for one player with asymmetric sensing and a zero-sum game setup.

To alleviate the above issue of equilibrium policy computation, the framework of mean-field games (MFGs) (which forms a subset of the mean-field team games with each team constituting a single DM) was introduced simultaneously in [33] and [34]. The idea behind the same is to compute approximate decentralized Nash equilibrium policies for agents by passing to an infinite agent system (under suitable assumptions of anonymity between agents), in which case the effect of actions of any specific agent becomes insignificant on that of the other agents, thereby leading to an aggregation effect. Under this scenario, one can solve for a representative agent’s equilibrium policy within the infinite agent system which is consistent with that of the population and then characterize its approximation with respect to the centralized Nash equilibrium policies for the finite-agent system.

While many works have considered both theoretical and application aspects of standard MFGs [35, 36, 37, 38, 39, 40], only a few results exist pertaining to MFTGs [41, 2]. A similar paradigm also worth mentionaing is that of mean-field type games [42, 43] where the number of teams is finite, but the number of agents within each team goes to infinity. As opposed to the standard MFGs, an additional challenge within the MFTG setup is to handle the incoming interactions from other teams whilst solving for the team optimal solution for each team. While the work [2] considers a discrete state and action setup, the work [41] considers a linear-quadratic setting to compute an asymptotic mixed-equilibrium optima for the team-game problem. In this work, however, we consider a mixed discrete-continuous action space where the controller action lies within the mm–dimensional Euclidean space while the sensor action is binary-valued (either 0 denoting no transmission or 1 denoting full state transmission). Further, as we will see, the cost function is non-quadratic and the estimation error dynamics, which forms the state of the estimated problem, is bilinear due to the presence of no-transmission events, which makes the problem significantly more challenging.

I-A3 Semantic communication

With the proliferation of the Internet-of-Things (IoT) technology, there has been an increasing interest in the efficient use of the shared wireless spectrum [37, 44, 45]. To enable the same, semantic communication is an emerging paradigm that aims to transmit only relevant information through shared resources rather than just the raw data, leading to more intelligent and context-aware communication systems, especially in bandwidth-constrained environments [46]. For instance, in [47], it has been argued that with semantic communication, it is possible to reduce the transmission rate without significant loss in the system performance; the paper [48] discusses the potential energy savings resulting from semantic interaction in connected autonomous vehicles; and the paper [49] demonstrates the effectiveness of semantics powered sampling for remote real-time source reconstruction. We refer the interested reader to [50, 51], and the references therein for additional details. In this work, we capture the semantics of information using the value of information (VoI) metric [26]. This metric essentially measures the (excess) payoff derived by a team from the information communicated by the sensor to the controller as opposed to no communication, thereby providing a quantitative basis for evaluating the effectiveness of semantic communication in our proposed approach.

Thus, we list the main contributions of our work as follows.

  1. 1.

    We consider a multi-team dynamic networked game problem with a costly uplink where each team operates in a non-cooperative game setting due to interactions over a shared resource whilst coordinating its optimal communication and control actions to ensure stabilization of the underlying dynamical system. The setups described in Subsection I-A1 constitute special cases of our work up to appropriate modifications.

  2. 2.

    In view of the challenge in finding Nash equilibrium policies due to the presence of a large number of teams, we use the framework of mean-field team games to compute approximate decentralized Nash equilibrium controller-sensor policies for each team. In the process, we first notice the general non-contractivity of the mean-field operator due to the finite cardinality of the sensor action space, which precludes the existence of a mean-field team equilibrium (MFTE). To alleviate this issue, we employ probabilistic Boltzmann-type policies to approximate the optimal sensor policy, which are dependent on the average utilization of the wireless channel (or the mean traffic flow over the network). The same is then used to prove the existence and uniqueness of an approximate MFTE.

  3. 3.

    Finally, we demonstrate that the MFTE obtained above constitutes an approximate Nash equilibrium for the finite-team system, where the approximation is induced by (1) the number of teams (N)(N) in the system, and (2) the Boltzmann-parameter (α)(\alpha) characterizing the approximate sensor policy.

We would also like to note that our current work is closest in spirit to [26]. Here, however, we consider a multi-team problem as opposed to a single team problem considered in [26]. Nonetheless, we utilize the team optimal policy notions and heuristics employed in [26] in the current work as well. The added non-trivial challenge is to solve for the Nash equilibrium policies on top of the team-optimal policies for each team. This has required us to frame the problem as a MFTG and consequently compute approximate Nash policies between the finite number of teams.

I-B Organization of the paper

The paper is organized as follows. We formulate the finite-team game problem in Section II and the mean-field team game in Section III. The analysis of the existence and uniqueness of the MFTE is presented in Section IV, its approximation performance is studied in Section V, and the approximate Nash equilibrium analysis is included in Section VI. Numerical results are presented in Section VII, and the paper concludes with some major highlights and ensuing discussions in Section VIII. Proofs of main propositions and the supporting lemmas are presented in Appendices.

Notations: The set of agents is denoted by [N]:={1,,N}[N]:=\{1,\cdots,N\} and the set of time instants by [T]:={0,,T1}[T]:=\{0,\cdots,T-1\}. We denote the Euclidean norm by \|\cdot\|. For a positive definite (resp. semi-definite) matrix ZZ, we use the notation Z0Z\succ 0 (resp. Z0Z\succeq 0). The trace of a matrix ZZ is denoted by tr(Z)tr(Z), and for a vector xx, we define xZ2:=xZx\|x\|^{2}_{Z}:=x^{\top}Zx. For two real functions gg and hh, g(x)=𝒪(h(x))g(x)=\mathcal{O}(h(x)) means that there exists a constant C>0C>0 such that limx|g(x)/h(x)|=C\lim_{x\rightarrow\infty}|g(x)/h(x)|=C. We define Z0:m:={Z0,,Zm}Z_{0:m}:=\{Z_{0},\cdots,Z_{m}\}, and 𝕀[A]\mathbb{I}[A] as the indicator function of a measurable set AA.

II Finite-Team Game Problem

Consider the networked multi-agent system shown in Fig. 2, comprised of NN teams. Each team ii is modeled by a triple (Pi,Si,Ci)(P_{i},S_{i},C_{i}), i.e., a dynamically evolving plant (PiP_{i}), a sensor (SiS_{i}), and a controller (CiC_{i}), where SiS_{i} and CiC_{i} can be viewed as the two active agents of team ii. The plant PiP_{i} evolves according to a stochastic linear difference equation as

Xk+1i=A(ωi)Xki+B(ωi)Uki+Wki,i[N],\displaystyle X^{i}_{k+1}=A(\omega_{i})X^{i}_{k}+B(\omega_{i})U^{i}_{k}+W^{i}_{k},~{}~{}i\in[N], (1)

where XkinX^{i}_{k}\in\mathbb{R}^{n} denotes the state, and UkimU^{i}_{k}\in\mathbb{R}^{m} denotes the control input, both for the ithi^{th} team. The system noise WkiW^{i}_{k} is assumed to be Gaussian with zero mean and finite covariance KWiK_{W^{i}}. Further, A(ωi)A(\omega_{i}) and B(ωi)B(\omega_{i}) are matrices with appropriate dimensions which depend on the type of the agent ωiΩ:={ω1,,ωp}\omega_{i}\in\Omega:=\{\omega_{1},\cdots,\omega_{p}\}, chosen according to empirical probability mass function (pmf) N(ω=ωi)\mathbb{P}_{N}(\omega=\omega_{i}), for all ωΩ\omega\in\Omega. The initial state X0iX_{0}^{i} has a symmetric density with mean ν(ωi)\nu(\omega_{i}) and positive definite covariance Σ0(ωi)\Sigma_{0}(\omega_{i}) for each ii. We assume that X0iX_{0}^{i} is independent of noise WkiW^{i}_{k} for all k,ik,i.

Refer to caption

Figure 2: The figure shows a prototypical networked multi-agent system constituting NN teams of plant PiP_{i}, sensor SiS_{i} and controller CiC_{i}. The shared wireless link between the sensor and the respective controller constitutes two wireless hops, namely, the uplink and the downlink. The shared channel couples the sensing policies of each team leading to a noncooperative game setup between teams.

Next, for each team ii, full-state information (XkiX^{i}_{k}) of the plant is relayed by SiS_{i} to CiC_{i} over a shared two-hop wireless channel via a central base station (as shown in Fig. 2) which coordinates the flow of information. The hop from the sensor to the BS is called the uplink and the one from the BS to the controller is called the downlink. The sensory information is relayed to the controller subject to a one-step delay, i.e., if Xk1iX^{i}_{k-1} and ykiy^{i}_{k} denote the uplink input and downlink output, respectively, then we have that

yk+1i={Xki,if γki=1φ,ifγki=0,\displaystyle y^{i}_{k+1}=\left\{\begin{array}[]{ll}X^{i}_{k},&\text{if $\gamma^{i}_{k}=1$}\\ \varphi,&\text{if}~{}\gamma^{i}_{k}=0\\ \end{array},\right. (4)

for k[T]k\in[T] and y0=φy_{0}=\varphi. The quantity φ\varphi denotes an empty symbol and γki\gamma^{i}_{k} denotes the transmission requests of SiS_{i} as

γki={1,if transmission is requested0,otherwise.\displaystyle\gamma^{i}_{k}=\left\{\begin{array}[]{ll}1,&\text{if transmission is requested}\\ 0,&\text{otherwise}\\ \end{array}.\right. (7)

We further assume that the downlink has enough bandwidth to accommodate all the sensor requests.111The case where the downlink suffers from a capacity constraint with an ideal uplink is dealt with in the works [37, 44], where the authors use a weighted age of information-based metric to devise optimal downlink transmission strategies for the base station. However, the case where both the uplink and the downlink suffer from non-idealities is a significantly challenging problem since one would need to jointly design each team’s policies and the scheduling policy of the base station. We will leave this as an open and promising research direction. Thus, whenever a transmission request is generated, the sensor is connected to the controller by the base station. However, for each packet communicated by the sensor to the controller, team ii incurs a cost (JiS,NJ^{S,N}_{i}) which is proportional to the average number of users communicating on the uplink. More precisely, the associated time-average communication cost over a finite horizon of T>0T>0 is given as

JiS,N(ξi,ξi):=1T𝔼[k=0T1γki(1Nj=1Nγkj)],\displaystyle J_{i}^{S,N}(\xi^{i},\xi^{-i}):=\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\gamma^{i}_{k}\left(\frac{1}{N}\sum_{j=1}^{N}\gamma^{j}_{k}\right)\right], (8)

where we refer to the term 1Nj=1Nγkj\frac{1}{N}\sum_{j=1}^{N}\gamma^{j}_{k} as the average channel utilization, which models the mean traffic flow through the wireless channel. Further, ξi:=(ξ0i,ξ1i,ξ2i,,ξT1i)\xi^{i}:=(\xi^{i}_{0},\xi^{i}_{1},\xi^{i}_{2},\cdots,\xi^{i}_{T-1}) denotes the transmission policy of sensor SiS_{i} and ξi:=(ξ1,,ξi1,ξi+1,,ξN)\xi^{-i}:=(\xi^{1},\cdots,\xi^{i-1},\xi^{i+1},\cdots,\xi^{N}) denotes the transmission policies of sensors other than SiS_{i}. In the absence of any communication, CiC_{i} creates the best estimate of the state which resets to the latest information in the event of a communication.

Let us define the information set of the ithi^{th} sensor as

kS,i:={X0:ki,y0:ki,U0:k1i,γ0:k1[N]},\displaystyle\mathcal{I}^{S,i}_{k}:=\{X^{i}_{0:k},y^{i}_{0:k},U^{i}_{0:k-1},\gamma^{\ell}_{0:k-1}\mid\ell\in[N]\}, (9)

with 0S,i={X0i,y0i}\mathcal{I}^{S,i}_{0}=\{X^{i}_{0},y^{i}_{0}\}. Then, we can define the space of admissible centralized222Note here we use the term centralized to emphasize that the information set of sensor ii contains the transmission instants of other agents as well. sensor policies for SiS_{i} as Ξcen,i={ξiξkiis adapted to σ(tS,i,t=0,1,,k)}\Xi^{cen,i}=\{\xi^{i}\mid\xi^{i}_{k}~{}\text{is adapted to }\sigma(\mathcal{I}^{S,i}_{t},~{}t=0,1,\cdots,k)\}, where σ()\sigma(\cdot) is the σ\sigma–algebra generated by its arguments. Finally, the expectation in (8) is taken with respect to the stochasticity induced by the (possibly) randomized policies, the system noise, and the initial state distribution.

Next, the controller CiC_{i} observes the incoming message from SiS_{i} as in (4). Between communication instants, CiC_{i} computes the best estimate of the state ZkiZ^{i}_{k} (calculated later) based on its information history given by

kC,i:={y0:ki,U0:k1i},\displaystyle\mathcal{I}^{C,i}_{k}:=\left\{y^{i}_{0:k},U^{i}_{0:k-1}\right\}, (10)

with I0C,i={y0i}I^{C,i}_{0}=\{y^{i}_{0}\}. For each control input applied by CiC_{i}, the ithi^{th} team incurs a time-average control cost given as

JiC,N(πi,ξi):=1T𝔼[k=0T1XkiQ(ωi)2+UkiR(ωi)2],\displaystyle J_{i}^{C,N}(\pi^{i},\xi^{i})\!:=\!\!\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\|X^{i}_{k}\|^{2}_{Q(\omega_{i})}\!+\!\|U^{i}_{k}\|^{2}_{R(\omega_{i})}\right]\!,\!\! (11)

where Q(ωi)0Q(\omega_{i})\geq 0, R(ωi)>0R(\omega_{i})>0, and πi:=(π0i,π1i,,πT1i)\pi^{i}:=(\pi^{i}_{0},~{}\pi^{i}_{1},\cdots,\pi^{i}_{T-1}) denotes an admissible policy of the control agent of team ii. More precisely, given kC,i\mathcal{I}^{C,i}_{k}, we define the space of admissible control policies for team ii as i={πiπiis adapted to σ(sC,i,s=0,1,,k)}\mathcal{M}^{i}=\{\pi^{i}\mid\pi^{i}~{}\text{is adapted to }\sigma(\mathcal{I}^{C,i}_{s},~{}s=0,1,\cdots,k)\}. Finally, the expectation in (11) is taken with respect to the noise statistics and the initial state distribution.

As a result of the cost functions in (8) and (11), each team is faced with the following stochastic team-game problem.

Problem 1 (Team-level game problem).
inf(πi,ξi)i×Ξcen,iJiN(πi,ξi,ξi):=JiC,N+λJiS,N,\displaystyle\inf_{(\pi^{i},\xi^{i})\in\mathcal{M}^{i}\times\Xi^{cen,i}}J^{N}_{i}(\pi^{i},\xi^{i},\xi^{-i}):=J_{i}^{C,N}+\lambda J_{i}^{S,N},
s.t.(1),(4), and (7) hold.\displaystyle\mbox{s.t.}~{}\eqref{System_dynamics},~{}\eqref{decoder_signal}\text{, and }\eqref{transmission}\text{ hold}.

where (πi,ξi)i×Ξcen,i(\pi^{i},\xi^{i})\in\mathcal{M}^{i}\times\Xi^{cen,i} and λ>0\lambda>0 is the weighting parameter which trades off the costs between uplink communication and control performance.

Henceforth, we refer to (πi,ξi)(\pi^{i},\xi^{i}) as the ithi^{th} team’s policy. We note here that each team’s policy depends on the strategies of the other teams through their sensor policies, which leads to a noncooperative game setup. Further, the works [26, 27] constitute a special case of our setup (with N=1N=1). Thus, the objective of this work is to compute Nash equilibrium policies for the NN teams. That is, we are looking for a 2N2N–tuple (π,ii,ξ,iΞcen,i,i[N])(\pi^{*,i}\in{\cal M}^{i},\xi^{*,i}\in\Xi^{cen,i},i\in[N]) such that

JiN(π,i,ξ,i,ξ,i)=inf(πi,ξi)i×Ξcen,iJiN(πi,ξi,ξ,i)\displaystyle J^{N}_{i}(\pi^{*,i},\xi^{*,i},\xi^{*,-i})=\inf_{(\pi^{i},\xi^{i})\in{\cal M}^{i}\times\Xi^{cen,i}}J^{N}_{i}(\pi^{i},\xi^{i},\xi^{*,-i})

for all i[N]i\in[N]. We note that for team ii, the above objective depends on the sensing policies ξ,i\xi^{*,-i} of the other teams. Due to the presence of a large number of such teams within a typical networked system, it becomes unrealistic for each team to have access to (and store) this information of the sensing policies of other teams in the population. Thus, we would like to compute team equilibrium policies based on only the local information accessible to each team. Toward this end, we will use the mean-field game framework [35, 52, 53] to characterize approximate decentralized Nash equilibrium solutions in the next three sections.

III Networked Mean-Field Team Game

In this section, we model the NN–team game using the mean-field team game (MFTG) framework. In this regard, we first consider the limiting system (with N=N=\infty teams). Under this setting, the average channel utilization ratio in (8) (and hence in Problem 1) can be approximated by a pre-computable deterministic mean-field trajectory by using the Nash certainty equivalence principle [33]. This reduces Problem 1 to a team optimal control problem for a representative team, the solution of which is consistent with the other teams in the population. Thus, we first compute a team optimal policy for the representative team. Consequently, we will observe that the finite cardinality of the sensing action set (which is given as {0,1}\{0,1\}) leads to non-contractivity of the mean field operator (as also noted in [54]). To alleviate this issue, we construct a Boltzmann-type sensor policy, prove the existence of a unique approximate mean-field team equilibrium (MFTE). Finally, we will show that the solution obtained constitutes an approximate Nash equilibrium for the finite NN–team game problem, with “approximation” defined precisely.

III-A Mean-Field Team Game

Consider a representative team (within the infinite team system) of type ω\omega, characterized by the tuple (P(ω),S(ω),C(ω))(P(\omega),S(\omega),C(\omega)) of a plant, a sensor, and a controller. The plant dynamics evolve according to the stochastic linear difference equation

Xk+1=A(ω)Xk+B(ω)Uk+Wk,\displaystyle X_{k+1}=A(\omega)X_{k}+B(\omega)U_{k}+W_{k}, (12)

where (Xk,Uk)n×m(X_{k},U_{k})\in\mathbb{R}^{n}\times\mathbb{R}^{m} denotes the state-control input pair of the representative team. Noise WkW_{k} is assumed to be Gaussian with zero mean and positive definite covariance KW(ω)K_{W}(\omega). The sensor S(ω)S(\omega) generates transmission requests (γk{0,1}\gamma_{k}\in\{0,1\}) over the uplink, which is then coordinated by the base station to update the state of the plant to the controller. As a result of network usage, the team incurs a communication cost JSJ^{S} given as

JS(ζ,g):=1T𝔼[k=0T1gkγk],\displaystyle J^{S}(\zeta,g):=\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}g_{k}\gamma_{k}\right], (13)

where ζΞdec\zeta\in\Xi^{dec} denotes the sensor policy of the representative team, with Ξdec\Xi^{dec} denoting the space of admissible decentralized sensor policies defined as Ξdec:={ζ|ζis adapted to σ(tS,ω,t=0,1,,k)}\Xi^{dec}:=\{\zeta|\zeta~{}\text{is adapted to }~{}\sigma(\mathcal{I}^{S,\omega}_{t},~{}t=0,1,\cdots,k)\} where kS,ω={X0:k,y0:k1,U0:k1,γ0:k}\mathcal{I}^{S,\omega}_{k}=\{X_{0:k},y_{0:k-1},U_{0:k-1},\gamma_{0:k}\} defines the information structure of the representative team’s sensor. We note here that the above set constitutes only the local information of the representative team as opposed to the information set defined in (9). We refer to the term g=(g1,g2,,gT1){g}=({g}_{1},{g}_{2},\cdots,{g}_{T-1}) as the mean-field trajectory, which denotes the infinite-team approximation to the channel utilization ratio term in (8). Finally, the controller C(ω)C(\omega) applies a control policy which incurs a cost of

JC(π,ζ):=1T𝔼[k=0T1XkQ(ω)2+UkR(ω)2].\displaystyle J^{C}(\pi,\zeta):=\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\|X_{k}\|^{2}_{Q(\omega)}+\|U_{k}\|^{2}_{R(\omega)}\right]. (14)

The policy π\!\pi\! is chosen from the space of admissible control policies defined similarly to i\!\mathcal{M}^{i}\! as before. Thus, the representative team’s optimization problem is defined as follows:

Problem 2 (Representative team problem).
inf(π,ζ)×ΞdecJ(π,ζ,g):=JC(π,ζ)+λJS(ζ,g),\displaystyle\inf_{(\pi,\zeta)\in\mathcal{M}\times\Xi^{dec}}J(\pi,\zeta,g):=J^{C}(\pi,\zeta)+\lambda J^{S}(\zeta,g), (15)

where J(π,ζ,g)J(\pi,\zeta,g) denotes the joint time-average overall cost incurred by the representative team, with JS(ζ,g)J^{S}(\zeta,g) and JC(π,ζ)J^{C}(\pi,\zeta) defined respectively by (13) and (14).

In order to construct a solution to Problem 2, we invoke the following assumption [33, 55] on the mean-field system.

Assumption 1.
  1. 1.

    The pair (A(ω),B(ω))(A(\omega),B(\omega)) is controllable.

  2. 2.

    The MF trajectory gT{g}\!\in\!\ell_{\infty}^{T} where T:={gksupk[T]\ell_{\infty}^{T}\!:=\!\{{g}_{k}\!\in\!\mathbb{R}\mid\sup_{k\in[T]} |gk|<}\!|{g}_{k}|\!<\!\infty\} is the set of bounded TT\!–length sequences.

Next, we define the mean-field team equilibrium (MFTE) by introducing the following operators:

  1. 1.

    Φ:T×Ξdec\Phi:\!\ell_{\infty}^{T}\!\rightarrow\!\mathcal{M}\times\Xi^{dec}, which maps gargmin(π,ζ)×Ξ{g}\mapsto\operatorname*{argmin}_{(\pi,\zeta)\in\mathcal{M}\times\Xi} J(π,ζ,g)J(\pi,\zeta,{g}) and outputs a team-optimal policy (π,ζ\pi,\zeta) for a given MF trajectory g{g}, and

  2. 2.

    Ψ:×ΞdecT\Psi:\mathcal{M}\times\Xi^{dec}\rightarrow\ell_{\infty}^{T}, which maps (π,ζ)g(\pi,\zeta)\mapsto{g} and generates a MF trajectory gg from a policy (π,ζ\pi,\zeta).

Then, the MFTE can be defined as follows.

Definition 1 (Mean-Field Team Equilibrium).

A mean-field team equilibrium (MFTE) is a policy-trajectory pair ((π,ζ),g)((\pi^{*},\zeta^{*}),{g}^{*}), which is a fixed point of the composite map ΨΦ\Psi\circ\Phi.

In the sequel, we first compute the team optimal policy pair (π,ζ\pi^{*},\zeta^{*}) solving Problem 2 for a given MF trajectory gg and then provide sufficient conditions for the existence of a unique approximate MFTE.

IV Mean-Field Team Equilibrium Analysis

We start this section by solving Problem 2. In this regard, let us start by determining the optimal controller policy. Using completion of squares, we can equivalently write the cost function in (15) as

J(π,ζ,g)=\displaystyle J(\pi,\zeta,g)= 𝔼[X0(ω)P0(ω)2]+1Tk=0T1tr(Pk(ω)KW(ω))\displaystyle\mathbb{E}[\|X_{0}(\omega)\|^{2}_{P_{0}(\omega)}]+\frac{1}{T}\sum_{k=0}^{T-1}tr(P_{k}(\omega)K_{W}(\omega))
+1T𝔼[k=0T1Uk+R^k1(ω)B(ω)Pk+1(ω)A(ω)XkR^k(ω)2+λgkγk],\displaystyle+\frac{1}{T}\mathbb{E}\Big{[}\sum_{k=0}^{T-1}\|U_{k}+\hat{R}_{k}^{-1}(\omega)B^{\top}(\omega)P_{k+1}(\omega)A(\omega)X_{k}\|^{2}_{\hat{R}_{k}(\omega)}+\lambda{g}_{k}\gamma_{k}\Big{]}, (16)

where R^k(ω):=R(ω)+B(ω)Pk+1(ω)B(ω)\hat{R}_{k}(\omega):=R(\omega)+B^{\top}(\omega)P_{k+1}(\omega)B(\omega) and P0P\succeq 0 is the unique solution to the Riccati equation

Pk(ω)\displaystyle P_{k}(\omega) =Qk(ω)+A(ω)Pk+1(ω)A(ω)A(ω)Pk+1B(ω)R^k1(ω)B(ω)Pk+1(ω)A(ω),\displaystyle=Q_{k}(\omega)+A^{\top}(\omega)P_{k+1}(\omega)A(\omega)-A^{\top}(\omega)P_{k+1}B(\omega)\hat{R}_{k}^{-1}(\omega)B^{\top}(\omega)P_{k+1}(\omega)A(\omega),

with PT(ω)=0P_{T}(\omega)=0. Thus, from (IV), the optimal controller can be obtained as

Uk=πk(kC,ω)=R^k1(ω)B(ω)Pk+1(ω)A(ω)Zk,\displaystyle U_{k}=\pi^{*}_{k}(\mathcal{I}^{C,\omega}_{k})=\hat{R}^{-1}_{k}(\omega)B^{\top}(\omega)P_{k+1}(\omega)A(\omega)Z_{k}, (17)

where Zk:=𝔼[XkkC,ω]Z_{k}\!:=\!\mathbb{E}[X_{k}\!\mid\!\mathcal{I}^{C,\omega}_{k}] is the conditional mean of the state given the information available to the controller. Then, we have the following result.

Proposition 1.

The conditional mean ZkZ_{k} given by the MMS estimator at the controller follows the dynamics

Zk+1\displaystyle Z_{k+1} =A(ω)((1γk)Zk+γkXk)+B(ω)Uk,\displaystyle=A(\omega)((1-\gamma_{k})Z_{k}+\gamma_{k}X_{k})+B(\omega)U_{k}, (18)

k[T]\forall k\in[T] with Z0=𝔼[X0]Z_{0}=\mathbb{E}[X_{0}].

The proof of the above proposition readily follows by taking expectation in (12) conditioned on the controller information structure and is thus not included here.

Let us define ek:=XkZke_{k}:=X_{k}-Z_{k}. Then, by substituting the optimal controller (17) in (IV), we arrive at

J(ζ,g)\displaystyle J(\zeta,g) =J+1T𝔼[k=0T1ekΓk2+λgkγk],\displaystyle=J^{*}+\frac{1}{T}\mathbb{E}\Big{[}\sum_{k=0}^{T-1}\|e_{k}\|^{2}_{\Gamma_{k}}+\lambda{g}_{k}\gamma_{k}\Big{]}, (19)

where J=𝔼[X0(ω)P0(ω)2]+1Tk=0T1tr(Pk(ω)KW(ω))J^{*}=\mathbb{E}[\|X_{0}(\omega)\|^{2}_{P_{0}(\omega)}]+\frac{1}{T}\sum_{k=0}^{T-1}tr(P_{k}(\omega)K_{W}(\omega)),

Γk:=A(ω)Pk+1(ω)B(ω)R^k1(ω)B(ω)Pk+1(ω)A(ω)\Gamma_{k}:=A^{\top}(\omega)P_{k+1}^{\top}(\omega)B(\omega)\hat{R}^{-1}_{k}(\omega)B^{\top}(\omega)P_{k+1}(\omega)A(\omega), and eke_{k} satisfies the following stochastic difference equation:

ek+1=(1γk)A(ω)ek+Wk.\displaystyle e_{k+1}=(1-\gamma_{k})A(\omega)e_{k}+W_{k}. (20)

Thus, the sensor objective is to solve for the optimal policy minimizing (19) subject to (20). To obtain it, let us first define the optimal cost-to-go (Vω(S,ω,g)V_{\cdot}^{\omega}(\mathcal{I}^{S,\omega}_{\cdot},g)) of the sensor as follows:

Vkω(kS,ω,g):=minζΞdecπ=π1T𝔼[t=kT1etΓt2+λgtγtkS,ω],\displaystyle V_{k}^{\omega}(\mathcal{I}^{S,\omega}_{k},g):=\min_{\zeta\in\Xi^{dec}\mid\pi=\pi^{*}}\frac{1}{T}\mathbb{E}\Big{[}\sum_{t=k}^{T-1}\|e_{t}\|_{\Gamma_{t}}^{2}+\lambda g_{t}\gamma_{t}\mid\mathcal{I}^{S,\omega}_{k}\Big{]},

for all k[T]k\in[T] and with the convention that g1=0g_{-1}=0. Then, we invoke the following definition [26] of the value of information (VoI).

Definition 2.

The VoI at instant kk for the sensor of type ω\omega is defined as a function of kS,ω\mathcal{I}^{S,\omega}_{k} as

VoIkω:=Vkω(kS,ω,g)|γk=0Vkω(kS,ω,g)|γk=1,\displaystyle{\operatorname{VoI}}_{k}^{\omega}:=V_{k}^{\omega}(\mathcal{I}^{S,\omega}_{k},g)|_{\gamma_{k}=0}-V_{k}^{\omega}(\mathcal{I}^{S,\omega}_{k},g)|_{\gamma_{k}=1}, (21)

where Vω(,)|γV^{\omega}(\cdot,\cdot)|_{\gamma} denotes the sensor’s optimal cost-to-go under the sensor action γ\gamma.

We now have the following proposition which characterizes the optimal sensor policy (ζ\zeta^{*}).

Proposition 2.

The optimal sensor policy ζ\zeta^{*} for the team of type ω\omega is given as

γk=ζ(kS,ω,g)=𝕀[VoIkω0],\displaystyle\gamma_{k}=\zeta^{*}(\mathcal{I}^{S,\omega}_{k},g)=\mathbb{I}[\operatorname{VoI}_{k}^{\omega}\geq 0], (22)

where

VoIkω\displaystyle\operatorname{VoI}_{k}^{\omega} =ekA(ω)Γk(ω)A(ω)ekgk+εk(ω),\displaystyle=e_{k}^{\top}A(\omega)^{\top}\Gamma_{k}(\omega)A(\omega)e_{k}-{g}_{k}+\varepsilon_{k}(\omega),
εkω\displaystyle\varepsilon_{k}^{\omega} =𝔼[Vk+1ω(k+1S,ω,g)kS,ω,γk=0]𝔼[Vk+1ω(k+1S,ω,g)kS,ω,γk=1].\displaystyle=\!\mathbb{E}[V^{\omega}_{k+1}(\mathcal{I}^{S,\omega}_{k+1},g)\!\mid\!\mathcal{I}^{S,\omega}_{k},\gamma_{k}=0]-\mathbb{E}[V^{\omega}_{k+1}(\mathcal{I}^{S,\omega}_{k+1},g)\!\mid\!\mathcal{I}^{S,\omega}_{k},\gamma_{k}=1]. (23)
Proof.

We start by noting that Problem 2 is a single team stochastic optimal control problem where its decision policies (π\pi and ζ\zeta) depend only on the local information and the mean-field trajectory gg, which is fixed. Hence, for each representative team of type ωΩ\omega\in\Omega, the proof follows from [26, Theorem 1], and is thus complete. ∎

Let us now denote the optimal policy pair for the representative team of type ω\omega as (π,ζ)(\pi^{*},\zeta^{*}), which of course depends on the mean field gg. Then, we proceed toward computing a MFTE using this policy. In this regard, we first make a crucial observation. The optimal sensor policy ζ\zeta^{*} computed in Proposition 2 is discontinuous at the point where VoIkω=0\operatorname{VoI}_{k}^{\omega}=0. This makes it prohibitive to employ contraction-type arguments to establish the existence of a unique MFTE [33, 54]. Thus, our objective in what follows is to construct an approximation of the policy ζ\!\zeta^{*}\!, by restricting it to a special class of policies, then leading to a unique (approximate) MFTE within that class.

Let α>0\alpha>0, and define a new policy ζ^:=ζα\hat{\zeta}:=\zeta^{\alpha} from which an action γ^k\hat{\gamma}_{k} is chosen as follows:

γ^k\displaystyle\hat{\gamma}_{k} =ζ^(kS,ω,g):={1,w.p. p:=11+eαVoIkω,0,w.p. 1p.\displaystyle=\hat{\zeta}(\mathcal{I}^{S,\omega}_{k},g):=\begin{cases}1,&\text{w.p. }p:=\frac{1}{1+e^{-\alpha\operatorname{VoI}^{\omega}_{k}}},\\ 0,&\text{w.p. }1-p.\end{cases} (24)

We note that the above policy is indeed an approximation of ζ\zeta^{*} since ζ^ζ\hat{\zeta}\rightarrow\zeta^{*} as α\alpha\rightarrow\infty under the convention that γ^k=1\hat{\gamma}_{k}=1 whenever VoIkω=0\operatorname{VoI}_{k}^{\omega}=0. Further, it is also measurable with respect to kS,ω\mathcal{I}^{S,\omega}_{k}. Let the policy ζ^\hat{\zeta} belong to the policy class Ξα\Xi^{\alpha}. Then, we prove the existence of a unique approximate MFE in the sense above, which is defined in precise terms as below.

Definition 3 (α\alpha–Approximate MFTE).

An α\alpha–approximate MFTE is a policy-trajectory pair ((π,ζ^),g^)((\pi^{*},\hat{\zeta}^{*}),\hat{g}^{*}), which is a fixed point of the composite map ΨαΦα\Psi^{\alpha}\circ\Phi^{\alpha}, where

  1. 1.

    Φα:T×Ξα\Phi^{\alpha}:\ell_{\infty}^{T}\rightarrow\mathcal{M}\times\Xi^{\alpha} is defined similar to Φ\Phi using (24), and outputs the pair (π,ζ^\pi^{*},\hat{\zeta}^{*}) for a given MF trajectory g^\hat{g}^{*}, and

  2. 2.

    Ψα:×ΞαT\Psi^{\alpha}:\mathcal{M}\times\Xi^{\alpha}\rightarrow\ell_{\infty}^{T} which maps (π,ζ^)(\pi^{*},\hat{\zeta}^{*}) to g^\hat{g}^{*}.

Also, let us define the mean-field operator 𝒯\mathcal{T} as

g^k+1\displaystyle\hat{g}_{k+1} =[𝒯(g^)]k:=𝔼ζ^,(),𝒩(0,KW())[γ^k]=𝔼(),𝒩(0,KW())[11+eαVoIkω].\displaystyle=[\mathcal{T}(\hat{g})]_{k}:=\mathbb{E}_{\hat{\zeta},\mathbb{P}(\cdot),\mathcal{N}(0,K_{W}(\cdot))}[\hat{\gamma}_{k}]=\mathbb{E}_{\mathbb{P}(\cdot),\mathcal{N}(0,K_{W}(\cdot))}\left[\frac{1}{1+e^{-\alpha\operatorname{VoI}^{\omega}_{k}}}\right]. (25)

We now first state and prove the following result on the existence of an approximate MFTE.

Proposition 3.

Under Assumption 1, there exists an α\alpha–approximate MFTE for any α>0\alpha>0.

Proof.

A detailed proof is provided in Appendix I. In summary, the proof relies on first proving that the mean-field operator 𝒯\mathcal{T}, as defined in (25), is a continuous function in g^\hat{g}. Subsequently, Brouwer’s fixed point theorem is applied to show the existence of at least one MFTE for any α>0\alpha>0. ∎

In addition, we can guarantee the contraction of the MF operator 𝒯\mathcal{T} for sufficiently low values of α\alpha, which allows us to prove a stronger result on the uniqueness of the approximate MFTE as follows.

Theorem 1.

Suppose Assumption 1 holds. Then, the mean-field operator 𝒯\mathcal{T} is a contraction, if

2αλT<1.\displaystyle 2\alpha\lambda T<1. (26)

Furthermore, under condition (26), there exists a unique α\alpha–approximate MFTE, that is the fixed point of the operator 𝒯\!\mathcal{T}\!.

Proof.

The detailed proof is provided in Appendix I. Briefly, we use the contractivity of 𝒯\mathcal{T} under the condition (26) to employ Banach’s fixed point theorem to show the existence of a unique α\alpha–approximate MFTE. ∎

The above results show the existence of an approximate MFTE which is unique if the parameter α\alpha is small enough. Further, we would also like to note from the proof of Theorem 1 that the contraction condition in (26) is only a sufficient one and one may be able to obtain a more precise characterization of the ‘smallness’ of α\alpha. Furthermore, in the numerical simulation section, we will compute an approximate MFTE even when (26) is not satisfied.

Now that we have characterized the approximate MFTE for the mean-field game, we next proceed towards proving that this MFTE satisfies the approximate Nash property (as formalized later in Section VI) for the original finite-team game problem.

V Performance of Mean-Field Approximation

In this and the subsequent section, we return to the finite-team setup of Section II and show that the α\alpha–approximate MFTE as computed in the previous section constitutes an (ϵa,ϵb\epsilon_{a},\epsilon_{b})–Nash equilibrium for the finite-team game problem. The term ϵa\epsilon_{a} arises as a result of the approximation parameter α\alpha in the sensor policy and the term ϵb\epsilon_{b} term arises due to the infinite team approximation.

Suppose now that team ii deviates toward using a policy different from the α\alpha–approximate MFTE policy ζ^\hat{\zeta}^{*}. We call the former policy ξiΞcen,i\xi^{i}\in\Xi^{cen,i} and an action chosen from it at instant kk as γ~k\tilde{\gamma}_{k}. Also, let us define the quantity γkN,av:=1Nj=1Nγkj\gamma^{N,av}_{k}:=\frac{1}{N}\sum_{j=1}^{N}\gamma^{j}_{k}, which denotes the empirical average of the actions chosen by all the sensors. We start by proving the following lemma which computes the mean deviation between the empirical average γN,av\gamma^{N,av} and the approximate equilibrium mean-field trajectory g^\hat{g}^{*}.

Lemma 1.

Suppose that Assumption 1 holds. Let team ii employ the sensor policy ξiΞcen,i\xi^{i}\in\Xi^{cen,i} while all the other teams employ the equilibrium policy ζ^\hat{\zeta}^{*}. Then, we have that

1Tk=0T1𝔼[|γkN,avg^k|]=𝒪(1N+ϵP,N),\displaystyle\frac{1}{T}\sum_{k=0}^{T-1}\mathbb{E}\Big{[}|\gamma^{N,av}_{k}-\hat{g}^{*}_{k}|\Big{]}=\mathcal{O}\Big{(}\frac{1}{\sqrt{N}}+\epsilon_{P,N}\Big{)}, (27)

where ϵP,N=ωΩ|N(ω)(ω)|\epsilon_{P,N}=\sum_{\omega\in\Omega}|\mathbb{P}_{N}(\omega)-\mathbb{P}(\omega)|.

Proof.

Consider the following inequality:

k=0T1𝔼[|1Nj=1Nγkjg^k|]\displaystyle\sum_{k=0}^{T-1}\mathbb{E}\Big{[}\Big{|}\frac{1}{N}\sum_{j=1}^{N}\gamma^{j}_{k}-\hat{g}^{*}_{k}\Big{|}\Big{]} k=0T1𝔼[|1Nj=1N(γkjγ^kj)|]=:Term 1\displaystyle\leq\underbrace{\sum_{k=0}^{T-1}\mathbb{E}\Big{[}\Big{|}\frac{1}{N}\sum_{j=1}^{N}(\gamma^{j}_{k}-\hat{\gamma}^{j}_{k})\Big{|}\Big{]}}_{\text{=:Term 1}}
+k=0T1𝔼[|1Nj=1Nγ^kj𝔼[g^k,j]|]=:Term 2+k=0T1𝔼[|1Nj=1N𝔼[g^k,j]g^k|]=:Term 3.\displaystyle+\!\!\underbrace{\sum_{k=0}^{T-1}\!\mathbb{E}\Big{[}\Big{|}\!\frac{1}{N}\sum_{j=1}^{N}\hat{\gamma}^{j}_{k}\!-\!\mathbb{E}[\hat{g}^{*,j}_{k}]\Big{|}\Big{]}}_{\text{=:Term 2}}\!+\!\underbrace{\sum_{k=0}^{T-1}\mathbb{E}\Big{[}\Big{|}\!\frac{1}{N}\!\sum_{j=1}^{N}\mathbb{E}[\hat{g}^{*,j}_{k}]\!-\!\hat{g}^{*}_{k}\Big{|}\Big{]}}_{\text{=:Term 3}}. (28)

We now bound each of the three terms in (V) below. To that end, let the action chosen by the sensor of team ii be given as γ~kiξi(kS,i)\tilde{\gamma}^{i}_{k}\sim\xi^{i}(\mathcal{I}^{S,i}_{k}). Then, Term 1 in (V) can be bounded as follows:

Term 1 =k=0T1𝔼[|1Nj[N]{i}(γ^kjγ^kj)+1N(γ~kiγ^ki)|]\displaystyle=\sum_{k=0}^{T-1}\mathbb{E}\Big{[}\Big{|}\frac{1}{N}\sum_{j\in[N]\setminus\{i\}}(\hat{\gamma}^{j}_{k}-\hat{\gamma}^{j}_{k})+\frac{1}{N}(\tilde{\gamma}^{i}_{k}-\hat{\gamma}^{i}_{k})\Big{|}\Big{]}
=k=0T1𝔼[|1Nj[N]{i}(ζ^kj(kS,i)ζ^k,j(kS,i))+1N(γ~kiγ^ki)|]\displaystyle=\sum_{k=0}^{T-1}\mathbb{E}\Big{[}\Big{|}\frac{1}{N}\sum_{j\in[N]\setminus\{i\}}(\hat{\zeta}^{j}_{k}(\mathcal{I}^{S,i}_{k})-\hat{\zeta}^{*,j}_{k}(\mathcal{I}^{S,i}_{k}))+\frac{1}{N}(\tilde{\gamma}^{i}_{k}-\hat{\gamma}^{i}_{k})\Big{|}\Big{]}
=k=0T11N𝔼[|γ~kiγ^ki|]2TN,\displaystyle=\sum_{k=0}^{T-1}\frac{1}{N}\mathbb{E}[|\tilde{\gamma}^{i}_{k}-\hat{\gamma}^{i}_{k}|]\leq\frac{2T}{N}, (29)

where the second equality follows since teams other than the ithi^{th} one are still following the policy ζ^\hat{\zeta}^{*}.

We next bound Term 2 in (V). To that end, let us define the quantity g¯k,j:=γkj𝔼[g^k,j]\bar{g}^{*,j}_{k}:=\gamma^{j}_{k}-\mathbb{E}[\hat{g}^{*,j}_{k}]. Then, we have that

𝔼[1Nj=1Nγ^kj𝔼[g^k,j]]2=1N2𝔼[j[N]g¯k,j]2=1N2𝔼[j,[N]g¯k,jg¯k,]=1N2𝔼[j[N](g¯k,j)2]1N,\displaystyle\!\!\!\mathbb{E}\Big{[}\frac{1}{N}\sum_{j=1}^{N}\hat{\gamma}^{j}_{k}\!-\!\mathbb{E}[\hat{g}^{*,j}_{k}]\Big{]}^{2}\!\!=\!\frac{1}{N^{2}}\mathbb{E}\Big{[}\sum_{j\in[N]}\bar{g}^{*,j}_{k}\Big{]}^{2}\!\!=\!\!\frac{1}{N^{2}}\mathbb{E}\Big{[}\sum_{j,\ell\in[N]}\bar{g}^{*,j}_{k}\bar{g}^{*,\ell}_{k}\Big{]}\!\!=\!\frac{1}{N^{2}}\mathbb{E}\Big{[}\sum_{j\in[N]}(\bar{g}^{*,j}_{k})^{2}\Big{]}\!\!\leq\!\!\frac{1}{N},\! (30)

where the third equality follows since g¯k,j\bar{g}^{*,j}_{k} and g¯k,\bar{g}^{*,\ell}_{k} are uncorrelated because the system noise WkiW_{k}^{i} is independent across different teams, which leads to decoupling between the corresponding sensor policies.

Thus, it follows from (30) that:

Term 2k=0T11NTN.\displaystyle\text{Term 2}\leq\sum_{k=0}^{T-1}\frac{1}{\sqrt{N}}\leq\frac{T}{\sqrt{N}}. (31)

Finally, since g^k,ω{\hat{g}}_{k}^{*,\omega} takes values in [0,1][0,1], Term 3 in (V) can be bounded as follows:

Term 3 =k=0T1[ωΩg^k,ω|N(ω)(ω)|]TωΩ|N(ω)(ω)|.\displaystyle=\sum_{k=0}^{T-1}\Big{[}\sum_{\omega\in\Omega}\hat{g}^{*,\omega}_{k}|\mathbb{P}_{N}(\omega)-\mathbb{P}(\omega)|\Big{]}\leq T\sum_{\omega\in\Omega}|\mathbb{P}_{N}(\omega)-\mathbb{P}(\omega)|. (32)

Using (V), (31), and (32) in (V), we arrive at the result in (27). The proof is thus complete. ∎

The above lemma demonstrates that the deviation of the empirical average from the approximate mean-field trajectory vanishes toward zero as the number of teams increases, and hence the latter ‘reasonably’ approximates the former if the number of teams is large.

VI Approximate Nash Equilibrium Analysis

In this subsection, we prove the approximate-Nash property of the MF solution for the finite-team game problem, which is made precise below.

Definition 4.

The set of team policies {(π,i,ξi)×Ξcen,i,i[N]}\{(\pi^{*,i},\xi^{i})\in\mathcal{M}\times\Xi^{cen,i},~{}i\in[N]\} constitutes an ϵ\epsilon–Nash equilibrium for the cost functions JiNJ^{N}_{i} for i[N]i\in[N], if there exists ϵ>0\epsilon>0 such that

JiN(π,i,ζ^,i,ζ^,i)infξiΞcen,iJiN(π,i,ξi,ζ^,i)+ϵ,\displaystyle J^{N}_{i}(\pi^{*,i},\hat{\zeta}^{*,i},\hat{\zeta}^{*,-i})\leq\inf_{\xi^{i}\in\Xi^{cen,i}}J^{N}_{i}(\pi^{*,i},\xi^{i},\hat{\zeta}^{*,-i})+\epsilon,

for all i[N]i\in[N].

We also note that the infimum on the RHS of the above inequality is taken over the set of centralized equilibrium policies as defined before, hence bringing in an element of conservatism.

Now, we define the following cost functions:

JiN(π,i,ζ^,i,ζ^,i)=1T𝔼[k=0T1X^kiQ(ωi)2+UkiR(ωi)2+λγ^kN,avγ^ki],\displaystyle\!\!J_{i}^{N}(\pi^{*,i},\hat{\zeta}^{*,i},\hat{\zeta}^{*,-i})=\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\|\hat{X}^{i}_{k}\|^{2}_{Q(\omega_{i})}\!\!+\!\|U^{i}_{k}\|^{2}_{R(\omega_{i})}\!\!+\!\lambda\hat{\gamma}^{N,av}_{k}\hat{\gamma}^{i}_{k}\right],\! (33a)
J(π,ζ^,g^)=1T𝔼[k=0T1X^kQ(ω)2+UkR(ω)2+λg^kγ^k],\displaystyle\!\!J(\pi^{*}\!,\hat{\zeta}^{*}\!,\hat{g}^{*})\!\!=\!\!\frac{1}{T}\mathbb{E}\!\left[\sum_{k=0}^{T-1}\!\|\hat{X}_{k}\|^{2}_{Q(\omega)}\!\!+\!\|U_{k}\|^{2}_{R(\omega)}\!\!+\!\lambda\hat{g}^{*}_{k}\hat{\gamma}_{k}\right]\!,\! (33b)
JiN(π,i,ξi,ζ^,i)=1T𝔼[k=0T1XkiQ(ωi)2+UkiR(ωi)2+λγ~kN,avγ~ki],\displaystyle\!\!J_{i}^{N}(\pi^{*,i},\xi^{i},\hat{\zeta}^{*,-i})=\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\!\|{X}^{i}_{k}\|^{2}_{Q(\omega_{i})}\!\!+\!\|U^{i}_{k}\|^{2}_{R(\omega_{i})}\!\!+\!\lambda\tilde{\gamma}^{N,av}_{k}\tilde{\gamma}^{i}_{k}\right]\!,\!\! (33c)
J(π,ζd,g^)=1T𝔼[k=0T1XkdQ(ω)2+UkR(ω)2+λg^kγkd].\displaystyle\!\!J(\pi^{*}\!,{\zeta}^{d}\!,\hat{g}^{*})\!\!=\!\!\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\!\!\|X^{d}_{k}\|^{2}_{Q(\omega)}\!\!+\!\|U_{k}\|^{2}_{R(\omega)}\!\!+\!\lambda\hat{g}^{*}_{k}\gamma^{d}_{k}\right]\!.\!\!\! (33d)
The cost (33a) is the one incurred by team ii in the finite-team setting when all the players follow the α\alpha–approximate MFTE policy. The cost in (33b) is the generic team’s cost under the equilibrium policy and equilibrium mean-field. The cost in (33c) is incurred by team ii when it deviates toward using a policy ξiΞcen,i\xi^{i}\in\Xi^{cen,i} while all other teams still employ the MF policy. Finally, the cost in (33d) is incurred when an agent follows a decentralized policy ζdΞdec\zeta^{d}\in\Xi^{dec} under the approximate equilibrium MF trajectory. The notations of the states and actions in the above costs are under the respective policies and the trajectory distributions, and are self-explanatory.

We next state and prove another main result of the paper in the following theorem, which essentially characterizes how ‘well’ the MFTE policy performs.

Theorem 2.

Suppose that Assumptions 1 holds. Suppose also that X0X_{0} is given. Then, the set {ζ^,i,1iN}\{\hat{\zeta}^{*,i},1\leq i\leq N\} of decentralized control policies constitute (ϵa,ϵb)(\epsilon_{a},\epsilon_{b})–Nash equilibrium for the finite-team uplink scheduling game, i.e.,

JiN(π,i,ζ^,i,ζ^,i)infξiΞcen,iJiN(π,i,ξi,ζ^,i)+ϵa(α)+ϵb(N),\displaystyle J_{i}^{N}(\pi^{*,i},\hat{\zeta}^{*,i},\hat{\zeta}^{*,-i})\leq\inf_{\xi^{i}\in\Xi^{cen,i}}J_{i}^{N}(\pi^{*,i},\xi^{i},\hat{\zeta}^{*,-i})+\epsilon_{a}(\alpha)+\epsilon_{b}(N),

where ϵa(α)=𝒪(11+mink[T]{eα|VoIk|})\!\epsilon_{a}(\alpha)\!\!=\!\!\mathcal{O}\Big{(}\!\frac{1}{1\!+\!\min\limits_{k\in[T]}\{e^{\alpha|\operatorname{VoI}_{k}\!|}\}}\!\Big{)} and ϵb(N)=𝒪(1N+ϵP,N)\!\epsilon_{b}(N)\!\!=\!\mathcal{O}\Big{(}\!\frac{1}{\sqrt{N}}+\epsilon_{P,N}\!\Big{)}.

Proof.

To establish the inequality in Theorem 2, consider the following:

JiN(π,i,ζ^,i,ζ^,i)\displaystyle J_{i}^{N}(\pi^{*,i},\hat{\zeta}^{*,i},\hat{\zeta}^{*,-i})- infξiΞcen,iJiN(π,i,ξi,ζ^,i)=JiN(π,i,ζ^,i,ζ^,i)J(π,ζ^,g^)\displaystyle\inf_{\xi^{i}\in\Xi^{cen,i}}J_{i}^{N}(\pi^{*,i},\xi^{i},\hat{\zeta}^{*,-i})=J^{N}_{i}(\pi^{*,i},\hat{\zeta}^{*,i},\hat{\zeta}^{*,-i})-J(\pi^{*},\hat{\zeta}^{*},\hat{g}^{*}) (34a)
+J(π,ζ^,g^)infζdΞdecJ(π,ζd,g^)\displaystyle\hskip 14.22636pt+J(\pi^{*},\hat{\zeta}^{*},\hat{g}^{*})-\inf_{\zeta^{d}\in\Xi^{dec}}J(\pi^{*},{\zeta^{d}},\hat{g}^{*}) (34b)
+infζdΞdecJ(π,ζd,g^)infξiΞcen,iJiN(π,i,ξi,ζ^,i).\displaystyle\hskip 14.22636pt+\inf_{\zeta^{d}\in\Xi^{dec}}J(\pi^{*},{\zeta^{d}},\hat{g}^{*})-\inf_{\xi^{i}\in\Xi^{cen,i}}J_{i}^{N}(\pi^{*,i},\xi^{i},\hat{\zeta}^{*,-i}). (34c)

The proofs for bounding each of the above terms are provided in the Appendices. The term in (34a) is bounded by Lemma 3 as in Appendix II. Next, to bound the term in (34b), we first prove the boundedness of the state action value function as in Lemma 5 in Appendix III and consequently use Proposition 5 as in Appendix III. Finally, the term in (34c) is bounded by Lemma 4 as in Appendix II.

Combining the results from Lemmas 3, 4 and Proposition 5, we arrive at

JiN(π,i,ζ^,i,ζ^,i)infξiΞcen,iJiN(π,i,ξi,ζ^,i)ϵa(α)+𝒪(1N+ϵP,N)=ϵa(α)+ϵb(N).\displaystyle J_{i}^{N}(\pi^{*,i},\hat{\zeta}^{*,i},\hat{\zeta}^{*,-i})\!\!-\!\!\inf_{\xi^{i}\in\Xi^{cen,i}}J_{i}^{N}(\pi^{*,i},\xi^{i},\hat{\zeta}^{*,-i})\!\leq\!\epsilon_{a}(\alpha)\!+\!\mathcal{O}\Big{(}\frac{1}{\sqrt{N}}\!+\!\epsilon_{P,N}\Big{)}\!=\!\epsilon_{a}(\alpha)\!+\!\epsilon_{b}(N). (35)

The proof is thus complete.

Theorem 2 states that the α\alpha–approximate MFTE policy derived using the infinite agent system behaves as an approximate Nash equilirium solution for the original finite-team game, where the approximation results from 1) using the policy ζ^\hat{\zeta}^{*} instead of the optimal sensing policy ζ\zeta^{*}, and 2) using infinite-team system as a proxy for the NN–team system.

VII Numerical Experiments

In this section, we provide simulation results on the finite-team game and demonstrate the performance of the approximate MFTE as computed in Section IV. Throughout, we consider the type set to be singleton, i.e., all the teams are homogeneous. Further, we consider open-loop unstable scalar agents with the following parameters: A=1.5,A=1.5, B=2,B=2, Q=2,Q=2, R=1,R=1, Σ0=2,\Sigma_{0}=2, KW=6K_{W}=6 and a time horizon T=50T=50. Next, we note that the approximate sensing policy in (24) depends on the value of information which in turn depends on the conditional value function of the state. Since we do not have a closed-form expression for the latter, we will use supervised learning with function approximation [56] to approximate it. In particular, we start by first parametrizing the state value function using a quadratic-in-state and quadratic-in-mean-field approximation as w0+w1e+w2e2+w3g^2w_{0}+w_{1}e+w_{2}e^{2}+w_{3}\hat{g}^{2}, where w0,w1,w2w_{0},w_{1},w_{2}, and w3w_{3} denote the weighting coefficients and ee denotes the estimation error and g^\hat{g} denotes the mean-field trajectory. Subsequently, we run value iteration using the parametrized form of the value function. Finally, we perform supervised learning using a mean-squared error loss to learn the weighting coefficients. The procedure terminates whenever the latter converges within a δ\delta range for a suitably chosen δ>0\delta>0. Later, we also study the effect of the degree of the approximating polynomial on the total average cost per team.

Refer to caption

Figure 3: The plot shows the increase in average cost per team with the increase in importance weight λ\lambda.

In the first study, we plot the variation of the average cost per team versus the importance weight λ\lambda in Figure 3 for four different values of λ=0.3,1.0,5.0\lambda=0.3,1.0,5.0 and 1010. The corresponding values of α\alpha were taken to be 1.8,0.55,0.111.8,0.55,0.11 and 0.0550.055. The figure shows a box plot depicting the median (red line) and spread (box) of the average cost per team over 100 runs for each value of λ\lambda and N=50N=50 teams, under the equilibrium controller-sensor policy developed using the mean-field game paradigm. From the same, we observe that the average cost increases with the increase in λ\lambda. This is explained by an increase in overall estimation error due to increase in the price of communication.

Refer to caption

Figure 4: The plot shows the variation of the approximate mean-field trajectory (g^\hat{g}^{*}) as a function of time for different values for λ\lambda and α\alpha.

For the same set of values of λ\lambda and α\alpha, we also plot (in Figure 4) the approximate equilibrium mean-field trajectory corresponding to the α\alpha–approximate MFTE computed in Section IV, which denotes the average utilization of the wireless channel. From the same, we see that the channel utilization decreases as the price of communication increases, aligned with intuition. The corresponding state, error, and control input trajectories under the approximate equilibrium policies for all the teams are shown in Figures 5, 6, and 7, respectively.

Refer to caption

Figure 5: The plot shows the variation of the state trajectory with time.

Refer to caption

Figure 6: The plot shows the variation of the estimation error trajectory with time.

Refer to caption

Figure 7: The plot shows the variation of the control input trajectory with time.

Next, in Figure 8 we study the variation of the average cost per team under the equilibrium controller-sensor policy for a fixed value of λ=5\lambda=5 and varying values of α\alpha. The figure shows a box plot of the average cost per team over 100 runs for each value of α\alpha and N=50N=50 teams. From the same, we observe that the average cost shows a decreasing trend with increasing values of α\alpha. This is explained by the fact that increasing α\alpha implies decreasing level of approximation in the construction of the policy ζ^\hat{\zeta} from the optimal policy ζ\zeta, which in turn leads to a lower cost.

Refer to caption


Figure 8: The plot shows the variation of average cost per team vs α\alpha for a fixed λ=5\lambda=5.

Next, in Figure 9 we provide the box plots for average cost per agents over 100 runs for varying numbers of teams. The values of α\alpha and λ\lambda were chosen to be 0.055 and 10, respectively. We then observe that the spread of the box plot decreases with increasing NN, indicating that the mean-field approximation gets better with increasing NN.

Refer to caption

Figure 9: The plot shows the variation of the average cost per team versus the number of teams in the population for fixed λ=10\lambda=10 and α=0.055\alpha=0.055.

Finally, we study the effect of the degree of the polynomial used for approximation of the value function on the average cost of the teams. For this, we fix the term w3g^2w_{3}\hat{g}^{2} and choose the values of λ=5\lambda=5 and α=0.11\alpha=0.11. We plot in Figure 10 the average cost per team as a function of the degree of the polynomial. As the degree increases from quadratic to cubic to fourth, we do not observe significant changes in the average cost per team. The median costs (red line) for the three cases are obtained as 3640, 3650, and 3666. This demonstrates that the quadratic polynomial approximates the value function reasonably well.

Refer to caption

Figure 10: The plot shows the variation of the average cost per team versus the number of teams in the population for fixed λ=5\lambda=5 and α=0.11\alpha=0.11.

VIII Conclusion & Discussions

In this work, we have solved for approximate Nash equilibrium team policies for a multi-team dynamic game problem where each team is coordinating its communication and control policies while utilizing a shared wireless channel. Specifically, we have formulated a noncooperative (general sum) game between NN–teams, each of which was comprised of a local sensor and a local controller which communicate over a network and incur an overall cost which is coupled with the (sensing) policies of other teams as a result of communication over the shared channel. Due to unavailability of the sensing policies of other teams (which become challenging to obtain especially when there is a high population of them), we have employed the mean-field team game framework to compute approximate decentralized Nash-equilibria between the teams. Toward this end, we have first noticed that the finite cardinality of the sensing action space prohibited the use of a contraction argument to establish the existence of a mean-field equilibrium. To overcome this difficulty, we have constructed a Boltzmann-type sensing policy which we then employed to explicitly prove the existence of a unique approximate mean-field team equilibrium. Subsequently, we showed that the same constituted an approximate Nash equilibrium for the finite-team game, where the approximation depended vanishingly on the number of teams, NN, and the Boltzmann parameter α\alpha. We have also validated the theoretical results with extensive numerical simulations.

We list here a few future research directions. First, for the numerical analysis, we had assumed that the price parameter λ\lambda was fixed and provided an extensive numerical analysis for its effect on the average costs, estimation errors, and control inputs of the teams. Thus one direction would be to formulate the finite-team game within a Stackelberg game setting, where the leader (possibly a telecommunication/network operator) solves an optimization problem to distribute ‘optimal’ λ\lambda to the follower teams. This necessitates that one substitutes the policies of the followers into the leader’s optimization problem, and subsequently derive the leader’s optimal policy. However, this (at the moment) seems challenging since each team’s policy is not attainable in closed form; rather, we only know it as a function of the conditional value function. Thus, it would be interesting to see if one can derive analytical results for the Stackelberg game setting.

Another direction that deserves mention here is that in the current work we did not account for the no-transmission instants within the information structure of the controller. This was due to the fact that the no-transmission instants make the estimator dynamics nonlinear, and couples the design on the optimal estimator and the optimal sensing policy. Although it has been shown in [27] that the optimal estimator turns out to be linear and decoupled from the sensor design, this is true only at optimality. However, upon careful observation, one may notice that in the current work, we have employed a Boltzmann sensing policy (which is an approximation to the optimal policy) to establish the existence of a unique and tractable equilibrium. Hence, the decoupling argument between the estimator and the scheduler no longer holds. Thus, it would be interesting to consider whether similar results as in this paper can be generalised to the case where the information set of the controller also includes the no-transmission instants.

References

  • [1] D. Shishika and V. Kumar, “A review of multi agent perimeter defense games,” in Decision and Game Theory for Security: 11th International Conference, GameSec 2020, College Park, MD, USA, October 28–30, 2020, Proceedings 11.   Springer, 2020, pp. 472–485.
  • [2] J. Subramanian, A. Kumar, and A. Mahajan, “Mean-field games among teams,” arXiv preprint arXiv:2310.12282, 2023.
  • [3] G. Y. Weintraub, C. L. Benkard, and B. Van Roy, “Markov perfect industry dynamics with many firms,” Econometrica, vol. 76, no. 6, pp. 1375–1411, 2008.
  • [4] T. Hatanaka, N. Chopra, M. Fujita, and M. W. Spong, Passivity-based Control and Estimation in Networked Robotics.   Springer, 2015.
  • [5] J. Marschak, “Elements for a theory of teams,” Management Science, vol. 1, no. 2, pp. 127–137, 1955.
  • [6] R. Radner, “Team decision problems,” The Annals of Mathematical Statistics, vol. 33, no. 3, pp. 857–881, 1962.
  • [7] S. Yüksel and T. Başar, Stochastic networked control systems: Stabilization and optimization under information constraints.   Springer Science & Business Media, 2013.
  • [8] A. Dave and A. A. Malikopoulos, “Decentralized stochastic control in partially nested information structures,” IFAC-PapersOnLine, vol. 52, no. 20, pp. 97–102, 2019.
  • [9] A. Dave, N. Venkatesh, and A. A. Malikopoulos, “Decentralized control of two agents with nested accessible information,” in 2022 American Control Conference (ACC).   IEEE, 2022, pp. 3423–3430.
  • [10] H. S. Witsenhausen, “A counterexample in stochastic optimum control,” SIAM Journal on Control, vol. 6, no. 1, pp. 131–147, 1968.
  • [11] A. A. Feldbaum, “Dual control theory,” in Control Theory: Twenty-Five Seminal Papers (1932-1981), T. Başar, Ed.   Wiley-IEEE Press, 2001, ch. 10, pp. 874–880.
  • [12] T. Başar, “Variations on the theme of the Witsenhausen counterexample,” in Proceedings of the 47th IEEE Conference on Decision and Control, December 2008, pp. 1614–1619.
  • [13] K. J. Astrom and B. M. Bernhardsson, “Comparison of Riemann and Lebesgue sampling for first order stochastic systems,” in Proceedings of the 41st IEEE Conference on Decision and Control, 2002., vol. 2.   IEEE, 2002, pp. 2011–2016.
  • [14] P. Tabuada, “Event-triggered real-time scheduling of stabilizing control tasks,” IEEE Transactions on Automatic Control, vol. 52, no. 9, pp. 1680–1685, 2007.
  • [15] W. P. Heemels, K. H. Johansson, and P. Tabuada, “An introduction to event-triggered and self-triggered control,” in Proceedings of the 51st IEEE Conference on Decision and Control, 2012, pp. 3270–3285.
  • [16] K. Nar and T. Başar, “Sampling multidimensional Wiener processes,” in Proc. 53rd IEEE Conference on Decision and Control (CDC’14, Dec 15-17, 2014; Los Angeles, CA), pp. 3426–3431.
  • [17] D. V. Dimarogonas, E. Frazzoli, and K. H. Johansson, “Distributed event-triggered control for multi-agent systems,” IEEE Transactions on Automatic Control, vol. 57, no. 5, pp. 1291–1297, 2011.
  • [18] G. S. Seyboth, D. V. Dimarogonas, and K. H. Johansson, “Event-based broadcasting for multi-agent average consensus,” Automatica, vol. 49, no. 1, pp. 245–252, 2013.
  • [19] O. C. Imer and T. Başar, “Optimal estimation with limited measurements,” in Proceedings of the 44th IEEE Conference on Decision and Control.   IEEE, 2005, pp. 1029–1034.
  • [20] O. C. Imer and T. Başar, “Optimal estimation with limited measurements,” International Journal of Systems, Control and Communications, vol. 2, no. 1-3, pp. 5–29, 2010.
  • [21] G. M. Lipsa and N. C. Martins, “Remote state estimation with communication costs for first-order LTI systems,” IEEE Transactions on Automatic Control, vol. 56, no. 9, pp. 2013–2025, 2011.
  • [22] O. C. Imer and T. Başar, “Optimal control with limited controls,” in 2006 American Control Conference.   IEEE, 2006, pp. 298–303.
  • [23] O. C. Imer and T. Basar, “To measure or to control: optimal control with scheduled measurements and controls,” in 2006 American Control Conference.   IEEE, 2006, pp. 1003–1008.
  • [24] A. Molin and S. Hirche, “On LQG joint optimal scheduling and control under communication constraints,” in Proceedings of the 48th IEEE Conference on Decision and Control held jointly with 28th Chinese Control Conference.   IEEE, 2009, pp. 5832–5838.
  • [25] D. Maity and J. S. Baras, “Minimal feedback optimal control of linear-quadratic-Gaussian systems: No communication is also a communication,” IFAC-PapersOnLine, vol. 53, no. 2, pp. 2201–2207, 2020.
  • [26] T. Soleymani, J. S. Baras, and S. Hirche, “Value of information in feedback control: Quantification,” IEEE Transactions on Automatic Control, vol. 67, no. 7, pp. 3730–3737, 2022.
  • [27] T. Soleymani, J. S. Baras, S. Hirche, and K. H. Johansson, “Value of information in feedback control: Global optimality,” IEEE Transactions on Automatic Control, vol. 68, no. 6, pp. 3641–3647, 2023.
  • [28] I. Hogeboom-Burr and S. Yüksel, “Zero-sum games involving teams against teams: Existence of equilibria, and comparison and regularity in information,” Systems & Control Letters, vol. 172, p. 105454, 2023.
  • [29] M. G. Lagoudakis and R. Parr, “Learning in zero-sum team Markov games using factored value functions,” Advances in Neural Information Processing Systems, vol. 15, 2002.
  • [30] M. Ghimire, L. Zhang, W. Zhang, Y. Ren, and Z. Xu, “Solving two-player general-sum games between swarms,” available on arXiv:2310.01682, 2023.
  • [31] D. Maity and J. S. Baras, “Linear quadratic stochastic differential games under asymmetric value of information,” IFAC-PapersOnLine, vol. 50, no. 1, pp. 8957–8962, 2017.
  • [32] S. Aggarwal, T. Başar, and D. Maity, “Linear quadratic zero-sum differential games with intermittent and costly sensing,” IEEE Control Systems Letters, 2024.
  • [33] M. Huang, P. E. Caines, and R. P. Malhamé, “Large-population cost-coupled LQG problems with nonuniform agents: individual-mass behavior and decentralized ε\varepsilon-Nash equilibria,” IEEE Transactions on Automatic Control, vol. 52, no. 9, pp. 1560–1571, 2007.
  • [34] J.-M. Lasry and P.-L. Lions, “Mean field games,” Japanese Journal of Mathematics, vol. 2, no. 1, pp. 229–260, 2007.
  • [35] M. Huang, R. P. Malhamé, P. E. Caines et al., “Large population stochastic dynamic games: closed-loop Mckean-Vlasov systems and the Nash Certainty Equivalence principle,” Communications in Information & Systems, vol. 6, no. 3, pp. 221–252, 2006.
  • [36] J. Moon and T. Başar, “Linear quadratic risk-sensitive and robust mean field games,” IEEE Transactions on Automatic Control, vol. 62, no. 3, pp. 1062–1077, 2016.
  • [37] S. Aggarwal, M. A. U. Zaman, M. Bastopcu, and T. Başar, “Weighted age of information based scheduling for large population games on networks,” IEEE Journal on Selected Areas in Information Theory, vol. 4, pp. 682–697, 2023.
  • [38] F. Bagagiolo and D. Bauso, “Mean-field games and dynamic demand management in power grids,” Dynamic Games and Applications, vol. 4, pp. 155–176, 2014.
  • [39] S. Y. Olmez, S. Aggarwal, J. W. Kim, E. Miehling, T. Başar, M. West, and P. G. Mehta, “Modeling presymptomatic spread in epidemics via mean-field games,” in 2022 American Control Conference (ACC).   IEEE, 2022, pp. 3648–3655.
  • [40] S. Aggarwal, M. A. uz Zaman, M. Bastopcu, S. Ulukus, and T. Başar, “A mean field game model for timely computation in edge computing systems,” available on arXiv:2404.02898, 2024.
  • [41] J. Huang, Z. Qiu, S. Wang, and Z. Wu, “Linear quadratic mean-field game-team analysis: A mixed coalition approach,” Automatica, vol. 159, p. 111358, 2024.
  • [42] B. Djehiche, A. Tcheukam, and H. Tembine, “Mean-field-type games in engineering,” available on arXiv:1605.03281, 2016.
  • [43] M. A. U. Zaman, A. Koppel, M. Laurière, and T. Başar, “Independent RL for cooperative-competitive agents: A mean-field perspective,” available on arXiv:2403.11345, 2024.
  • [44] S. Aggarwal, M. A. uz Zaman, M. Bastopcu, and T. Başar, “Large population games on constrained unreliable networks,” in 2023 62nd IEEE Conference on Decision and Control (CDC), 2023, pp. 3480–3485.
  • [45] F. Mason, F. Chiariotti, A. Zanella, and P. Popovski, “Multi-agent reinforcement learning for coordinating communication and control,” IEEE Transactions on Cognitive Communications and Networking, 2024.
  • [46] Q. Lan, D. Wen, Z. Zhang, Q. Zeng, X. Chen, P. Popovski, and K. Huang, “What is semantic communication? A view on conveying meaning in the era of machine intelligence,” Journal of Communications and Information Networks, vol. 6, no. 4, pp. 336–371, 2021.
  • [47] E. Uysal, O. Kaya, A. Ephremides, J. Gross, M. Codreanu, P. Popovski, M. Assaad, G. Liva, A. Munari, B. Soret et al., “Semantic communications in networked systems: A data significance perspective,” IEEE Network, vol. 36, no. 4, pp. 233–240, 2022.
  • [48] H. Du, J. Wang, D. Niyato, J. Kang, Z. Xiong, M. Guizani, and D. I. Kim, “Rethinking wireless communication security in semantic internet of things,” IEEE Wireless Communications, vol. 30, no. 3, pp. 36–43, 2023.
  • [49] M. Kountouris and N. Pappas, “Semantics-empowered communication for networked intelligent systems,” IEEE Communications Magazine, vol. 59, no. 6, pp. 96–102, 2021.
  • [50] O. Ayan, P. Kutsevol, H. Y. Özkan, and W. Kellerer, “Semantics-and task-oriented scheduling for networked control systems in practice,” IEEE Access, vol. 10, pp. 115 673–115 690, 2022.
  • [51] W. Yang, H. Du, Z. Q. Liew, W. Y. B. Lim, Z. Xiong, D. Niyato, X. Chi, X. Shen, and C. Miao, “Semantic communications for future Internet: Fundamentals, applications, and challenges,” IEEE Communications Surveys & Tutorials, vol. 25, no. 1, pp. 213–250, 2022.
  • [52] J.-M. Lasry and P.-L. Lions, “Jeux à champ moyen. i–le cas stationnaire,” Comptes Rendus Mathématique, vol. 343, no. 9, pp. 619–625, 2006.
  • [53] ——, “Jeux à champ moyen. ii–horizon fini et contrôle optimal,” Comptes Rendus Mathématique, vol. 343, no. 10, pp. 679–684, 2006.
  • [54] K. Cui and H. Koeppl, “Approximately solving mean field games via entropy-regularized deep reinforcement learning,” in International Conference on Artificial Intelligence and Statistics.   PMLR, 2021, pp. 1909–1917.
  • [55] J. Moon and T. Başar, “Discrete-time LQG mean field games with unreliable communication,” in Proceedings of the 53rd IEEE Conference on Decision and Control (CDC), December 2014, pp. 2697–2702.
  • [56] D. Bertsekas and J. N. Tsitsiklis, Neuro-Dynamic Programming.   Athena Scientific, 1996.

Appendix I

Proof of Proposition 3.

We prove the proposition by inductively establishing the continuity of the operator 𝒯\mathcal{T} and consequently employing Brouwer’s fixed point theorem.

To this end, let us define the following set

𝒢:={g^:={g^k}k=0T1g^k1,k[T]}.\displaystyle\mathcal{G}:=\{\hat{g}:=\{\hat{g}_{k}\}_{k=0}^{T-1}\mid\|\hat{g}_{k}\|\leq 1,~{}\forall k\in[T]\}. (36)

Then, we note from (25) that 𝒯:[0,1][0,1]\mathcal{T}:[0,1]^{\mathbb{N}}\rightarrow[0,1]^{\mathbb{N}}, and hence, 𝒯(𝒢)𝒢\mathcal{T}(\mathcal{G})\subseteq\mathcal{G}. Further, the interval [0,1][0,1] is compact and convex, and hence, 𝒢\mathcal{G} is also compact and convex. Thus, it only remains to prove the continuity of 𝒯\mathcal{T}, which then by using Brouwer’s fixed point theorem would prove the statement of the proposition. From now on, we will omit ω\omega wherever it clutters the notation.

To establish continuity, using (25), it suffices to show the continuity of VoIkω\operatorname{VoI}_{k}^{\omega} with respect to g^\hat{g}, for which we first show the continuity of the state-action value function, defined as:

Qk,x(kS,ak)\displaystyle\!\!{Q}^{*,x}_{k}(\mathcal{I}^{S}_{k},a_{k})\! =𝔼[λxkak+ek+1Γkek+1+mina{0,1}Qk+1,x(k+1S,a)],\displaystyle=\mathbb{E}[\lambda x_{k}a_{k}+e_{k+1}^{\top}\Gamma_{k}e_{k+1}+\min_{a\in\{0,1\}}{Q}^{*,x}_{k+1}(\mathcal{I}^{S}_{k+1},a)], (37)

for all k[T]k\in[T]. The above denotes the value received by the generic team when the action aka_{k} is chosen at the current instant and the future actions are chosen optimally from the admissible space Ξdec\Xi^{dec} for a given MF trajectory xx. Now, we will use backward induction on kk to prove the result. Fix g^𝒢\hat{g}\in\mathcal{G}. Consider the base case when k=T1k=T-1. Then, by recalling that the terminal cost is 0, we have that

QT1,g^(T1S,aT1)\displaystyle\!\!{Q}^{*,\hat{g}}_{T-1}(\mathcal{I}^{S}_{T-1},a_{T-1})\! =𝔼[λg^T1aT1+eTΓTeT].\displaystyle=\mathbb{E}[\lambda\hat{g}_{T-1}a_{T-1}+e_{T}^{\top}\Gamma_{T}e_{T}]. (38)

Then, we can compute the error covariance dynamics using Proposition 1 as

𝔼[ek+1ek+1]=𝔼[(1γk)A(ω)ekekA(ω)]+KW(ω).\displaystyle\!\mathbb{E}[e_{k+1}e_{k+1}^{\top}]=\mathbb{E}[(1-\gamma_{k})A(\omega)e_{k}e_{k}^{\top}A^{\top}(\omega)]\!+\!K_{W}(\omega).\! (39)

Substituting (39) into (38) for k=T1k=T-1, we arrive at the equation

QT1,g^\displaystyle{Q}^{*,\hat{g}}_{T-1} (T1S,aT1)=λg^T1aT1+(1aT1)2eT1AΓTAeT1+𝔼[WT1ΓTWT1],\displaystyle(\mathcal{I}^{S}_{T-1},a_{T-1})\!=\lambda\hat{g}_{T-1}a_{T-1}+(1-a_{T-1})^{2}e_{T-1}^{\top}A^{\top}\Gamma_{T}Ae_{T-1}+\mathbb{E}[W_{T-1}^{\top}\Gamma_{T}W_{T-1}],

which is linear (and thus, clearly continuous) in g^\hat{g}.

Next, we take the induction hypothesis to be the assertion that Qk+1,g^(k+1S,ak+1){Q}^{*,\hat{g}}_{k+1}(\mathcal{I}^{S}_{k+1},a_{k+1}) is continuous in g^\hat{g}. Then, we prove the assertion for instant kk. In this regard, consider the following:

Qk,g^(kS,ak)\displaystyle{Q}^{*,\hat{g}}_{k}(\mathcal{I}^{S}_{k},a_{k})\! =𝔼[λg^kak+ek+1Γkek+1+mina{0,1}Qk+1,g^(k+1S,a)].\displaystyle=\mathbb{E}[\lambda\hat{g}_{k}a_{k}+e_{k+1}^{\top}\Gamma_{k}e_{k+1}+\min_{a\in\{0,1\}}{Q}^{*,\hat{g}}_{k+1}(\mathcal{I}^{S}_{k+1},a)].

From the induction hypothesis, we have that mina{0,1}Qk+1,g^(k+1S,a)\min_{a\in\{0,1\}}{Q}^{*,\hat{g}}_{k+1}(\mathcal{I}^{S}_{k+1},a) is continuous in g^\hat{g} since the minimum of a continuous function is continuous. Further, the running cost of 𝔼[λg^kak+ek+1Γkek+1]\mathbb{E}[\lambda\hat{g}_{k}a_{k}+e_{k+1}^{\top}\Gamma_{k}e_{k+1}] is also continuous in g^\hat{g}. The desired continuity of Qk,g^(kS,ak){Q}^{*,\hat{g}}_{k}(\mathcal{I}^{S}_{k},a_{k}) thus follows from the fact that the sum of continuous functions is continuous. Consequently, the result is established for all k[T]k\in[T] and any chosen action a{0,1}a\in\{0,1\} by invoking the induction principle. The continuity of VoIkω\operatorname{VoI}_{k}^{\omega} then follows from by noting that

VoIkω=Qk,g^,ω(kS,0)Qk,g^,ω(kS,1)\displaystyle\operatorname{VoI}_{k}^{\omega}={Q}^{*,\hat{g},\omega}_{k}(\mathcal{I}^{S}_{k},0)-{Q}^{*,\hat{g},\omega}_{k}(\mathcal{I}^{S}_{k},1)

is the difference of two continuous functions. The proof is thus complete. ∎

Next, to prove Theorem 1, we first establish the following auxiliary lemma.

Lemma 2.

Let g^1,g^2𝒢\hat{g}^{1},\hat{g}^{2}\in\mathcal{G} where 𝒢\mathcal{G} is defined in (36). Then, the following is true for all k[T]k\in[T]:

Qk,g^1(kS,ak)\displaystyle{Q}^{*,\hat{g}^{1}}_{k}(\mathcal{I}^{S}_{k},a_{k}) Qk,g^2(kS,ak)λTg^1g^2.\displaystyle-{Q}^{*,\hat{g}^{2}}_{k}(\mathcal{I}^{S}_{k},a_{k})\leq\lambda T\|\hat{g}^{1}-\hat{g}^{2}\|.
Proof.

We will use backward induction on kk to prove the result. In this regard, first, consider the base case for k=T1k=T-1. Then, we have that:

QT1,g^1(T1S,aT1)\displaystyle{Q}^{*,\hat{g}^{1}}_{T-1}(\mathcal{I}^{S}_{T-1},a_{T-1}) QT1,g^2(T1S,aT1)\displaystyle-{Q}^{*,\hat{g}^{2}}_{T-1}(\mathcal{I}^{S}_{T-1},a_{T-1})
=λg^T11aT1+(1aT1)2eT1AΓTAeT1+𝔼[WT1ΓTWT1]\displaystyle=\lambda\hat{g}^{1}_{T-1}a_{T-1}+(1-a_{T-1})^{2}e_{T-1}^{\top}A^{\top}\Gamma_{T}Ae_{T-1}+\mathbb{E}[W_{T-1}^{\top}\Gamma_{T}W_{T-1}]
λg^T12aT1(1aT1)2eT1AΓTAeT1𝔼[W~T1ΓTW~T1]\displaystyle~{}~{}~{}~{}-\lambda\hat{g}^{2}_{T-1}a_{T-1}-(1-a_{T-1})^{2}e_{T-1}^{\top}A^{\top}\Gamma_{T}Ae_{T-1}-\mathbb{E}[\tilde{W}_{T-1}^{\top}\Gamma_{T}\tilde{W}_{T-1}]
=λaT1(g^T11g^T12)λg^1g^2.\displaystyle=\lambda a_{T-1}(\hat{g}^{1}_{T-1}-\hat{g}^{2}_{T-1})\leq\lambda\|\hat{g}^{1}-\hat{g}^{2}\|.

The base case is thus true. Assume next that

Qk+1,g^1(k+1S,ak+1)\displaystyle{Q}^{*,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},a_{k+1}) Qk+1,g^2(k+1S,ak+1)λ(Tk1)g^1g^2.\displaystyle-{Q}^{*,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},a_{k+1})\leq\lambda(T-k-1)\|\hat{g}^{1}-\hat{g}^{2}\|. (40)

Then, consider the following:

Qk,g^1(kS,ak)Qk,g^2(kS,ak)\displaystyle{Q}^{*,\hat{g}^{1}}_{k}(\mathcal{I}^{S}_{k},a_{k})-{Q}^{*,\hat{g}^{2}}_{k}(\mathcal{I}^{S}_{k},a_{k})
=𝔼[λg^k1ak+ek+1Γkek+1+mina{0,1}Qk+1,g^1(k+1S,a)]𝔼[λg^k2ak+e~k+1Γke~k+1+mina{0,1}Qk+1,g^2(k+1S,a)]\displaystyle=\!\mathbb{E}[\lambda\hat{g}^{1}_{k}a_{k}+e_{k+1}^{\top}\Gamma_{k}e_{k+1}\!+\!\min_{a\in\{0,1\}}{Q}^{*,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},a)]\!-\!\mathbb{E}[\lambda\hat{g}^{2}_{k}a_{k}\!+\!\tilde{e}_{k+1}^{\top}\Gamma_{k}\tilde{e}_{k+1}\!+\!\min_{a\in\{0,1\}}{Q}^{*,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},a)]
=λak(g^k1g^k2)+𝔼[mina{0,1}Qk+1,g^1(k+1S,a)mina{0,1}Qk+1,g^2(k+1S,a)],\displaystyle=\lambda a_{k}(\hat{g}^{1}_{k}-\hat{g}^{2}_{k})+\mathbb{E}[\min_{a\in\{0,1\}}{Q}^{*,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},a)\!-\!\min_{a\in\{0,1\}}{Q}^{*,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},a)], (41)

where the errors ek+1e_{k+1} and e~k+1\tilde{e}_{k+1} are under noise realizations WkW_{k} and W~k\tilde{W}_{k}, respectively. Next, let us define a1:=argminaQk+1,g^1(k+1S,a)a_{1}^{*}:=\text{argmin}_{a}~{}{Q}^{*,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},a) and a2:=argminaQk+1,g^2(k+1S,a)a_{2}^{*}:=\text{argmin}_{a}~{}{Q}^{*,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},a). Then, we consider the following cases:

  • If a1=a2=1a_{1}^{*}=a_{2}^{*}=1 or a1=a2=0a_{1}^{*}=a_{2}^{*}=0, using (40), we have that

    Qk+1,g^1(k+1S,a1)\displaystyle{Q}^{*,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},a_{1}^{*}) Qk+1,g^2(k+1S,a2)λ(Tk1)g^1g^2.\displaystyle-{Q}^{*,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},a_{2}^{*})\leq\lambda(T-k-1)\|\hat{g}^{1}-\hat{g}^{2}\|.
  • If a1=1a_{1}^{*}=1 and a2=0a_{2}^{*}=0, we have that

    Qk+1,g^1(k+1S,1)Qk+1,g^2(k+1S,0)Qk+1,g^1(k+1S,0)Qk+1,g^2(k+1S,0)λ(Tk1)g^1g^2,\displaystyle{Q}^{*,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},1)-{Q}^{*,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},0)\leq{Q}^{*,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},0)-{Q}^{*,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},0)\leq\lambda(T-k-1)\|\hat{g}^{1}-\hat{g}^{2}\|,

    where the last inequality follows by (40).

  • If a1=0a_{1}^{*}=0 and a2=1a_{2}^{*}=1, we have that

    Qk+1,g^1(k+1S,0)Qk+1,g^2(k+1S,1)Qk+1,g^1(k+1S,1)Qk+1,g^2(k+1S,1)λ(Tk1)g^1g^2,\displaystyle{Q}^{*,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},0)-{Q}^{*,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},1)\leq{Q}^{*,\hat{g}^{1}}_{k+1}(\mathcal{I}^{S}_{k+1},1)-{Q}^{*,\hat{g}^{2}}_{k+1}(\mathcal{I}^{S}_{k+1},1)\leq\lambda(T-k-1)\|\hat{g}^{1}-\hat{g}^{2}\|,

    where the last inequality follows by (40).

Thus, continuing from (Proof.), we have that

Qk,g^1(kS,ak)Qk,g^2(kS,ak)\displaystyle{Q}^{*,\hat{g}^{1}}_{k}(\mathcal{I}^{S}_{k},a_{k})-{Q}^{*,\hat{g}^{2}}_{k}(\mathcal{I}^{S}_{k},a_{k}) λak(g^k1g^k2)+λ(Tk1)g^1g^2\displaystyle\leq\lambda a_{k}(\hat{g}^{1}_{k}-\hat{g}^{2}_{k})+\lambda(T-k-1)\|\hat{g}^{1}-\hat{g}^{2}\|
λ(Tk)g^1g^2λTg^1g^2.\displaystyle\leq\lambda(T-k)\|\hat{g}^{1}-\hat{g}^{2}\|\leq\lambda T\|\hat{g}^{1}-\hat{g}^{2}\|. (42)

The proof of the lemma is thus completed by invoking the induction principle. ∎

We are now ready to prove Theorem 1.

Proof of Theorem 1.

We will use Banach’s fixed point theorem to prove the result. In this regard, we start by noting from (25) that 𝒯:[0,1][0,1]\mathcal{T}:[0,1]^{\mathbb{N}}\rightarrow[0,1]^{\mathbb{N}}, and hence, 𝒯(𝒢)𝒢\mathcal{T}(\mathcal{G})\subseteq\mathcal{G}. Further, the interval [0,1][0,1] is a closed subset of a complete metric space \mathbb{R}, and hence complete, which further implies that [0,1][0,1]^{\mathbb{N}} is complete under the sup norm. Finally, we prove that 𝒯\mathcal{T} is a contraction.

Let g^1,g^2𝒢\hat{g}^{1},\hat{g}^{2}\in\mathcal{G} and further define VoIkω,1\operatorname{VoI}^{\omega,1}_{k} and VoIkω,2\operatorname{VoI}^{\omega,2}_{k} as the value of information terms at instant kk corresponding to g^1\hat{g}^{1} and g^2\hat{g}^{2}, respectively. Then, consider the following:

[𝒯(g^1)\displaystyle[\mathcal{T}(\hat{g}^{1}) 𝒯(g^2)]k=𝔼[11+eαVoIkω,111+eαVoIkω,2]=𝔼[eαVoIkω,2eαVoIkω,1(1+eαVoIkω,2)(1+eαVoIkω,1)]\displaystyle-\mathcal{T}(\hat{g}^{2})]_{k}=\mathbb{E}\left[\frac{1}{1+e^{-\alpha\operatorname{VoI}^{\omega,1}_{k}}}\!-\!\frac{1}{1\!+\!e^{-\alpha\operatorname{VoI}^{\omega,2}_{k}}}\right]=\mathbb{E}\left[\frac{e^{-\alpha\operatorname{VoI}^{\omega,2}_{k}}-e^{-\alpha\operatorname{VoI}^{\omega,1}_{k}}}{(1+e^{-\alpha\operatorname{VoI}^{\omega,2}_{k}})(1+e^{-\alpha\operatorname{VoI}^{\omega,1}_{k}})}\right]
=𝔼[eαVoIkω,2(1eαVoIkω,1+αVoIkω,2)(1+eαVoIkω,2)(1+eαVoIkω,1)]α𝔼[eαVoIkω,2(VoIkω,1VoIkω,2)(1+eαVoIkω,2)(1+eαVoIkω,1)].\displaystyle=\mathbb{E}\left[\frac{e^{-\alpha\operatorname{VoI}^{\omega,2}_{k}}(1-e^{-\alpha\operatorname{VoI}^{\omega,1}_{k}+\alpha\operatorname{VoI}^{\omega,2}_{k}})}{(1+e^{-\alpha\operatorname{VoI}^{\omega,2}_{k}})(1+e^{-\alpha\operatorname{VoI}^{\omega,1}_{k}})}\right]\leq\alpha\mathbb{E}\left[\frac{e^{-\alpha\operatorname{VoI}^{\omega,2}_{k}}(\operatorname{VoI}^{\omega,1}_{k}-\operatorname{VoI}^{\omega,2}_{k})}{(1+e^{-\alpha\operatorname{VoI}^{\omega,2}_{k}})(1+e^{-\alpha\operatorname{VoI}^{\omega,1}_{k}})}\right]. (43)

Next, we prove that VoIω\operatorname{VoI}^{\omega} is Lipschitz continuous in g^\hat{g}. To this end, we invoke the definition of the value of information to arrive at:

VoIkω,1VoIkω,2\displaystyle\operatorname{VoI}^{\omega,1}_{k}-\operatorname{VoI}^{\omega,2}_{k} =Qk,g^1,ω(kS,ω,0)Qk,g^1,ω(kS,ω,1)Qk,g^2,ω(kS,ω,0)+Qk,g^2,ω(kS,ω,1)\displaystyle={Q}^{*,\hat{g}^{1},\omega}_{k}(\mathcal{I}^{S,\omega}_{k},0)-{Q}^{*,\hat{g}^{1},\omega}_{k}(\mathcal{I}^{S,\omega}_{k},1)-{Q}^{*,\hat{g}^{2},\omega}_{k}(\mathcal{I}^{S,\omega}_{k},0)+{Q}^{*,\hat{g}^{2},\omega}_{k}(\mathcal{I}^{S,\omega}_{k},1)
=Qk,g^1,ω(kS,ω,0)Qk,g^2,ω(kS,ω,0)+Qk,g^2,ω(kS,ω,1)Qk,g^2,ω(kS,ω,1)\displaystyle={Q}^{*,\hat{g}^{1},\omega}_{k}(\mathcal{I}^{S,\omega}_{k},0)-{Q}^{*,\hat{g}^{2},\omega}_{k}(\mathcal{I}^{S,\omega}_{k},0)+{Q}^{*,\hat{g}^{2},\omega}_{k}(\mathcal{I}^{S,\omega}_{k},1)-{Q}^{*,\hat{g}^{2},\omega}_{k}(\mathcal{I}^{S,\omega}_{k},1)
2λ(Tk)g^1g^2,\displaystyle\leq 2\lambda(T-k)\|\hat{g}^{1}-\hat{g}^{2}\|, (44)

where the last inequality follows by using Lemma 2. Then, by substituting (Proof of Theorem 1.) in (Proof of Theorem 1.), we arrive at

[𝒯(g^1)𝒯(g^2)]kLkg^1g^2whereLk:=2αλ(Tk)𝔼[eαVoIkω,2(1+eαVoIkω,2)(1+eαVoIkω,1)].\displaystyle[\mathcal{T}(\hat{g}^{1})-\mathcal{T}(\hat{g}^{2})]_{k}\leq L_{k}\|\hat{g}^{1}-\hat{g}^{2}\|~{}\text{where}~{}L_{k}:=2\alpha\lambda(T-k)\mathbb{E}\left[\frac{e^{-\alpha\operatorname{VoI}^{\omega,2}_{k}}}{(1+e^{-\alpha\operatorname{VoI}^{\omega,2}_{k}})(1+e^{-\alpha\operatorname{VoI}^{\omega,1}_{k}})}\right].

A similar inequality can be proven for [𝒯(g^2)𝒯(g^1)]k[\mathcal{T}(\hat{g}^{2})-\mathcal{T}(\hat{g}^{1})]_{k}, which would then imply the above inequality for |[𝒯(g^1)𝒯(g^2)]|k|[\mathcal{T}(\hat{g}^{1})-\mathcal{T}(\hat{g}^{2})]|_{k}, as well. Thus, the operator 𝒯\mathcal{T} is a contraction if maxk[T]Lk<1\max_{k\in[T]}L_{k}<1, which is implied by the sufficient condition (26). The proof is then completed by invoking Banach’s fixed point theorem to conclude that there exists a unique g^\hat{g}^{*} which is the fixed point of the operator 𝒯\mathcal{T} such that g^=𝒯(g^)\hat{g}^{*}=\mathcal{T}(\hat{g}^{*}). ∎

Appendix II

Lemma 3.

It holds that JiN(π,i,ζ^,i,ζ^,i)J(π,ζ^,g^)𝒪(1N+ϵP,N).J^{N}_{i}(\pi^{*,i},\hat{\zeta}^{*,i},\hat{\zeta}^{*,-i})-J(\pi^{*},\hat{\zeta},\hat{g}^{*})\leq\mathcal{O}\Big{(}\frac{1}{\sqrt{N}}+\epsilon_{P,N}\Big{)}.

Proof.

Consider the following:

JiN(π,i,ζ^,i,ζ^,i)\displaystyle J^{N}_{i}(\pi^{*,i},\hat{\zeta}^{*,i},\hat{\zeta}^{*,-i}) =1T𝔼[k=0T1X^kiQ(ωi)2+UkiR(ωi)2+λγ^kN,avγ^kiλg^kγ^ki+λg^kγ^ki]\displaystyle=\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\|\hat{X}^{i}_{k}\|^{2}_{Q(\omega_{i})}+\|U^{i}_{k}\|^{2}_{R(\omega_{i})}+\lambda\hat{\gamma}^{N,av}_{k}\hat{\gamma}^{i}_{k}-\lambda\hat{g}^{*}_{k}\hat{\gamma}^{i}_{k}\!+\!\lambda\hat{g}^{*}_{k}\hat{\gamma}^{i}_{k}\right]
=J(π,ζ^,g^)+λT𝔼[k=0T1(γ^kN,avg^k)γ^ki]J(π,ζ^,g^)+𝒪(1N+ϵP,N),\displaystyle=J(\pi^{*},\hat{\zeta}^{*},\hat{g}^{*})+\frac{\lambda}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}(\hat{\gamma}^{N,av}_{k}\!-\!\hat{g}^{*}_{k})\hat{\gamma}^{i}_{k}\right]\leq J(\pi^{*},\hat{\zeta},\hat{g}^{*})\!+\!\mathcal{O}\Big{(}\frac{1}{\sqrt{N}}\!+\!\epsilon_{P,N}\Big{)},

where the equality follows since we use the approximate MF policy pair (π,ζ^\pi^{*},\hat{\zeta}^{*}) on the finite-agent system and the inequality follows as a result of Lemma 1. The proof is thus complete. ∎

Lemma 4.

The following inequality holds:

infζdΞdecJ(π,ζd,g^)infξiΞcen,iJiN(π,i,ξi,ζ^,i)𝒪(1N+ϵP,N).\displaystyle\inf_{\zeta^{d}\in\Xi^{dec}}J(\pi^{*},{\zeta^{d}},\hat{g}^{*})-\inf_{\xi^{i}\in\Xi^{cen,i}}J_{i}^{N}(\pi^{*,i},\xi^{i},\hat{\zeta}^{*,-i})\leq\mathcal{O}\Big{(}\frac{1}{\sqrt{N}}+\epsilon_{P,N}\Big{)}.
Proof.

Let us fix a policy ξiΞcen,i\xi^{i}\in\Xi^{cen,i} and γ~kiξi(kSi)\tilde{\gamma}^{i}_{k}\sim\xi^{i}(\mathcal{I}^{S_{i}}_{k}) while the other agents play using the policy ζ^\hat{\zeta}^{*}. Then, we have that:

JiN(π,i,ξi,ζ^,i)\displaystyle J_{i}^{N}(\pi^{*,i},\xi^{i},\hat{\zeta}^{*,-i}) =1T𝔼[k=0T1XkiQ(ωi)2+UkiR(ωi)2+λγ~kN,avγ~ki+λg^kγ~kiλg^kγ~ki]\displaystyle=\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\|{X}^{i}_{k}\|^{2}_{Q(\omega_{i})}+\|U^{i}_{k}\|^{2}_{R(\omega_{i})}+\lambda\tilde{\gamma}^{N,av}_{k}\tilde{\gamma}^{i}_{k}+\lambda\hat{g}^{*}_{k}\tilde{\gamma}^{i}_{k}-\lambda\hat{g}^{*}_{k}\tilde{\gamma}^{i}_{k}\right]
=1T𝔼[k=0T1XkiQ(ωi)2+UkiR(ωi)2+λg^kγki+λγ~kN,avγ~kiλg^kγ~ki]\displaystyle=\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\|{X}^{i}_{k}\|^{2}_{Q(\omega_{i})}+\|U^{i}_{k}\|^{2}_{R(\omega_{i})}+\lambda\hat{g}^{*}_{k}\gamma^{i}_{k}+\lambda\tilde{\gamma}^{N,av}_{k}\tilde{\gamma}^{i}_{k}-\lambda\hat{g}^{*}_{k}\tilde{\gamma}^{i}_{k}\right]
infζdΞdecJ(π,ζd,g^)+1T𝔼[k=0T1λγ~ki(γ~kN,avg^k)],\displaystyle\geq\inf_{\zeta^{d}\in\Xi^{dec}}J(\pi^{*},{\zeta^{d}},\hat{g}^{*})+\frac{1}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}\lambda\tilde{\gamma}^{i}_{k}(\tilde{\gamma}^{N,av}_{k}-\hat{g}^{*}_{k})\right],

which gives

infζdΞdecJ(π,ζd,γav,)JiN(π,i,ξi,ζ^,i)λT𝔼[k=0T1(g^kγ~kN,av)γ~ki]𝒪(1N+ϵP,N)\displaystyle\inf_{\zeta^{d}\in\Xi^{dec}}J(\pi^{*},{\zeta}^{d},\gamma^{av,*})-J_{i}^{N}(\pi^{*,i},\xi^{i},\hat{\zeta}^{*,-i})\leq\frac{\lambda}{T}\mathbb{E}\left[\sum_{k=0}^{T-1}(\hat{g}^{*}_{k}-\tilde{\gamma}^{N,av}_{k})\tilde{\gamma}^{i}_{k}\right]\leq\mathcal{O}\Big{(}\frac{1}{\sqrt{N}}+\epsilon_{P,N}\Big{)}

which by taking infimum on both sides implies that

infζdΞdecJi(π,ζd,g)infξiΞcen,iJiN(π,i,ξi,ζ^,i)𝒪(1N+ϵP,N).\displaystyle\inf_{\zeta^{d}\in\Xi^{dec}}J_{i}(\pi^{*},{\zeta}^{d},g^{*})-\inf_{\xi^{i}\in\Xi^{cen,i}}J_{i}^{N}(\pi^{*,i},\xi^{i},\hat{\zeta}^{*,-i})\leq\mathcal{O}\Big{(}\frac{1}{\sqrt{N}}+\epsilon_{P,N}\Big{)}. (45)

The proof is thus complete. ∎

Appendix III

Lemma 5.

The following inequality holds:

Q^kg^(kS,γ^k)fk(ek2)+ck,k[T],\displaystyle\hat{Q}^{\hat{g}^{*}}_{k}(\mathcal{I}^{S}_{k},\hat{\gamma}_{k})\leq f_{k}(\|e_{k}\|^{2})+c_{k},~{}~{}~{}k\in[T], (46)

where Q^kx(kS,ak)\!\hat{Q}^{x}_{k}(\mathcal{I}^{S}_{k},\!a_{k})\! denotes the state-action value function and is defined recursively as:

Q^kx(kS,ak)\displaystyle\hat{Q}^{x}_{k}(\mathcal{I}^{S}_{k},a_{k})\! =𝔼[λxkak+ek+1Γkek+1+Q^k+1x(k+1S,ak+1)],\displaystyle=\!\mathbb{E}[\lambda x_{k}a_{k}\!+\!e_{k+1}^{\top}\Gamma_{k}e_{k+1}\!\!+\!\hat{Q}^{x}_{k+1}(\mathcal{I}^{S}_{k+1},a_{k+1})], (47)

with ak,ak+1ζ^(kS,x)a_{k},a_{k+1}\sim\hat{\zeta}^{*}(\mathcal{I}^{S}_{k},x). Further, fk()f_{k}(\cdot) are time-dependent functions, and ck>0c_{k}>0 is a constant.

We note that the same inequality can also be established for Q,x(,)Q^{*,x}(\cdot,\cdot) using the same proof technique as presented below.

Proof.

We prove the above lemma by induction on kk. In this regard, consider the base case with k=T1k=T-1. Then, we have

Q^T1g^(T1S,γ^T1)\displaystyle\hat{Q}^{\hat{g}^{*}}_{T-1}(\mathcal{I}^{S}_{T-1},\hat{\gamma}_{T-1}) =𝔼[λg^T1γ^T1+eTΓTeT]λ+sup0tTΓt𝔼[eTeT]\displaystyle=\mathbb{E}[\lambda\hat{g}^{*}_{T-1}\hat{\gamma}_{T-1}+e_{T}^{\top}\Gamma_{T}e_{T}]\leq\lambda+\sup_{0\leq t\leq T}\|\Gamma_{t}\|\mathbb{E}[e_{T}^{\top}e_{T}]
λ+sup0tTΓt(eT12+KW)=:fT1(eT12)+cT1,\displaystyle\leq\lambda+\sup_{0\leq t\leq T}\|\Gamma_{t}\|(\|e_{T-1}\|^{2}+K_{W})=:f_{T-1}(\|e_{T-1}\|^{2})+c_{T-1}, (48)

where we have used the error dynamics in (20) to arrive at the last inequality. The base case is therefore proved. Assume now that the inequality in the statement of the lemma holds for instant k+1k+1, i.e.,

Q^k+1g^(k+1S,γ^k+1)fk+1(ek+12)+ck+1.\displaystyle\hat{Q}^{\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},\hat{\gamma}_{k+1})\leq f_{k+1}(\|e_{k+1}\|^{2})+c_{k+1}. (49)

Then, we prove the above for instant kk. Consider the following:

Q^kg^(kS,γ^k)\displaystyle\hat{Q}^{\hat{g}^{*}}_{k}(\mathcal{I}^{S}_{k},\hat{\gamma}_{k}) =𝔼[λg^kγ^k+ek+1Γk+1ek+1+Q^k+1g^(k+1S,ζ^)]\displaystyle=\mathbb{E}[\lambda\hat{g}^{*}_{k}\hat{\gamma}_{k}+e_{k+1}^{\top}\Gamma_{k+1}e_{k+1}+\hat{Q}^{\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},\hat{\zeta}^{*})]
λ+sup0tTΓt(ek2+KW)+𝔼[fk+1(ek+12)+ck+1]\displaystyle\leq\lambda+\sup_{0\leq t\leq T}\|\Gamma_{t}\|(\|e_{k}\|^{2}+K_{W})+\mathbb{E}[f_{k+1}(\|e_{k+1}\|^{2})+c_{k+1}]
λ+sup0tTΓt(ek2+KW)+𝔼[fk+1((1γ^k)ek+Wk2)+ck+1]\displaystyle\leq\lambda+\sup_{0\leq t\leq T}\|\Gamma_{t}\|(\|e_{k}\|^{2}+K_{W})+\mathbb{E}[f_{k+1}(\|(1-\hat{\gamma}_{k})e_{k}+W_{k}\|^{2})+c_{k+1}]
=:fk(ek2)+ck.\displaystyle=:f_{k}(\|e_{k}\|^{2})+c_{k}. (50)

The proof now follows by the induction principle, and is thus complete. ∎

Proposition 4.

We have that

Q^kg^(kS,γ^k)Q,g^(kS,γ^k)ϵ(α).\displaystyle\hat{Q}^{\hat{g}^{*}}_{k}(\mathcal{I}^{S}_{k},\hat{\gamma}_{k})-{Q}^{*,\hat{g}^{*}}(\mathcal{I}^{S}_{k},\hat{\gamma}_{k})\leq\epsilon^{\prime}(\alpha). (51)

for all k[T]k\in[T], where ϵ(α):=𝒪(11+minktT1{eα|VoIt|})\epsilon^{\prime}(\alpha):=\mathcal{O}\Big{(}\frac{1}{1+\min_{k\leq t\leq T-1}\{e^{\alpha|\operatorname{VoI}_{t}|}\}}\Big{)}.

Proof.

Let us fix g^[0,1]T\hat{g}^{*}\in[0,1]^{T}. We will use induction on kk to prove the result.

Let us start with the base case of k=T1k=T-1. Let us define the estimation errors

e~k+1=(1γ^k)Ae^k+W~k,e¯k+1=(1γ^k)Ae^k+W¯k,\displaystyle\tilde{e}_{k+1}=(1-\hat{\gamma}_{k})A\hat{e}_{k}+\tilde{W}_{k},\qquad\bar{e}_{k+1}=(1-\hat{\gamma}_{k})A\hat{e}_{k}+\bar{W}_{k},

where e^k\hat{e}_{k} is the error at instant k under action γ^k1\hat{\gamma}_{k-1}.

We recall that VT(TS,g^)=0V_{T}(\mathcal{I}_{T}^{S},\hat{g}^{*})=0. Then, we have that

Q^T1g^(T1S,γ^T1)QT1,g^(T1S,γ^T1)=tr(ΓT(𝔼[e~Te~T]𝔼[e¯Te¯T]))\displaystyle\hat{Q}^{\hat{g}^{*}}_{T-1}(\mathcal{I}^{S}_{T-1},\hat{\gamma}_{T-1})-{Q}^{*,\hat{g}^{*}}_{T-1}(\mathcal{I}^{S}_{T-1},\hat{\gamma}_{T-1})=tr(\Gamma_{T}(\mathbb{E}[\tilde{e}_{T}^{\top}\tilde{e}_{T}]-\mathbb{E}[\bar{e}_{T}^{\top}\bar{e}_{T}]))
=tr(ΓT((1γ^T1)Ae^T1e^T1A)+KW)tr(ΓT((1γ^T1)Ae^T1e^T1A)+KW)\displaystyle=tr(\Gamma_{T}((1-\hat{\gamma}_{T-1})A\hat{e}_{T-1}\hat{e}_{T-1}^{\top}A^{\top})+K_{W})-tr(\Gamma_{T}((1-\hat{\gamma}_{T-1})A\hat{e}_{T-1}\hat{e}_{T-1}^{\top}A^{\top})+K_{W})
=0<ϵ(α).\displaystyle=0<\epsilon^{\prime}(\alpha). (52)

The base case is thus proved. Assume now that the following is true:

Q^k+1g^(k+1S,γ^k+1)\displaystyle\hat{Q}^{\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},\hat{\gamma}_{k+1}) Qk+1,g^(k+1S,γ^k+1)ϵ1(α)/2:=𝒪(11+mink+1tT1{eα|VoIt|}).\displaystyle-{Q}^{*,\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},\hat{\gamma}_{k+1})\leq\epsilon_{1}(\alpha)/2:=\mathcal{O}\Big{(}\frac{1}{1+\min_{k+1\leq t\leq T-1}\{e^{\alpha|\operatorname{VoI}_{t}|}\}}\Big{)}. (53)

Then, we prove the inequality in the statement of the Proposition for instant kk. In this regard, let us define the set 𝒜:={0,1}\mathcal{A}:=\{0,1\} and the set 𝒜optx(S)\mathcal{A}^{x}_{\text{opt}}(\mathcal{I}^{S}) to be the set of optimal actions for a given trajectory xx and the information set S\mathcal{I}^{S}. Further let us denote a suboptimal action as asub𝒜𝒜opts(S)a_{\text{sub}}\in\mathcal{A}\setminus\mathcal{A}^{s}_{\text{opt}}(\mathcal{I}^{S}). Then, we have that

Q^kg^(kS,γ^k)Qk,g^(kS,γ^k)=a𝒜ζ^(akS,g^,k)Q^k+1g^(k+1S,a)mina𝒜Qk+1,g^(k+1S,a)\displaystyle\hat{Q}^{\hat{g}^{*}}_{k}(\mathcal{I}^{S}_{k},\hat{\gamma}_{k})-{Q}^{*,\hat{g}^{*}}_{k}(\mathcal{I}^{S}_{k},\hat{\gamma}_{k})=\sum_{a\in\mathcal{A}}\hat{\zeta}^{*}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{*,k})\hat{Q}^{\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a)-\min_{a^{\prime}\in\mathcal{A}}Q^{*,\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})
=a𝒜optg^(kS)ζ^(akS,g^,k)Q^k+1g^(k+1S,a)mina𝒜Qk+1,g^(k+1S,a)+a𝒜𝒜optg^(kS)ζ^(akS,g^)Q^k+1g^(k+1S,a)\displaystyle=\sum_{a\in\mathcal{A}^{\hat{g}^{*}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{*}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{*,k})\hat{Q}^{\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a)-\min_{a^{\prime}\in\mathcal{A}}Q^{*,\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})+\sum_{\hskip 8.53581pta\in\mathcal{A}\setminus\mathcal{A}^{\hat{g}^{*}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{*}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{*})\hat{Q}^{\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a)
=[a𝒜optg^(kS)ζ^(akS,g^)Q^k+1g^(k+1S,a)a𝒜optg^(kS)ζ^(akS,g^)mina𝒜Qk+1,g^(k+1S,a)]\displaystyle=\Big{[}\sum_{a\in\mathcal{A}^{\hat{g}^{*}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{*}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{*})\hat{Q}^{\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a)-\sum_{a\in\mathcal{A}^{\hat{g}^{*}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{*}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{*})\min_{a^{\prime}\in\mathcal{A}}Q^{*,\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})\Big{]}
+[a𝒜optg^(kS)ζ^(akS,g^)mina𝒜Qk+1,g^(k+1S,a)mina𝒜Qk+1,g^(k+1S,a)]\displaystyle\hskip 17.07164pt+\Big{[}\sum_{a\in\mathcal{A}^{\hat{g}^{*}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{*}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{*})\min_{a^{\prime}\in\mathcal{A}}Q^{*,\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})-\min_{a^{\prime}\in\mathcal{A}}Q^{*,\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})\Big{]}
+a𝒜𝒜optg^(kS)ζ^(akS,g^)Q^k+1g^(k+1S,a)\displaystyle\hskip 17.07164pt+\sum_{\hskip 8.53581pta\in\mathcal{A}\setminus\mathcal{A}^{\hat{g}^{*}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{*}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{*})\hat{Q}^{\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a)
2maxa𝒜optg^(kS)[Q^k+1g^(k+1S,a)mina𝒜Qk+1,g^(k+1S,a)]+[a𝒜optg^(kS)ζ^(akS,g^)mina𝒜Qk+1,g^(k+1S,a)\displaystyle\leq 2\max_{a\in\mathcal{A}^{\hat{g}^{*}}_{\text{opt}}(\mathcal{I}^{S}_{k})}[\hat{Q}^{\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a)-\min_{a^{\prime}\in\mathcal{A}}Q^{*,\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})]+\Big{[}\sum_{a\in\mathcal{A}^{\hat{g}^{*}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{*}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{*})\min_{a^{\prime}\in\mathcal{A}}Q^{*,\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})
mina𝒜Qk+1,g^(k+1S,a)]+a𝒜𝒜optg^(kS)ζ^(akS,g^)Q^g^k+1(k+1S,a)\displaystyle\hskip 28.45274pt-\min_{a^{\prime}\in\mathcal{A}}Q^{*,\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})\Big{]}+\sum_{\hskip 8.53581pta\in\mathcal{A}\setminus\mathcal{A}^{\hat{g}^{*}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{*}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{*})\hat{Q}^{\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a)
ϵ1(α)+[a𝒜optg^(kS)ζ^(akS,g^)mina𝒜Qk+1,g^(k+1S,a)mina𝒜Qk+1,g^(k+1S,a)]\displaystyle\leq\epsilon_{1}(\alpha)+\Big{[}\sum_{a\in\mathcal{A}^{\hat{g}^{*}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{*}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{*})\min_{a^{\prime}\in\mathcal{A}}Q^{*,\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})-\min_{a^{\prime}\in\mathcal{A}}Q^{*,\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})\Big{]}
+a𝒜𝒜optg^(kS)ζ^(akS,g^)Q^k+1g^(k+1S,a)\displaystyle\hskip 17.07164pt+\sum_{\hskip 8.53581pta\in\mathcal{A}\setminus\mathcal{A}^{\hat{g}^{*}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{*}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{*})\hat{Q}^{\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a)
(1)\ltx@labelb:1ϵ1(α)+ζ^(asubkS,g^)mina𝒜Qk+1,g^(k+1S,a)+a𝒜𝒜optg^(kS)ζ^(akS,g^)Q^k+1g^(k+1S,a\displaystyle\overset{\textup{\makebox[0.0pt]{(1)}}\ltx@label{b:1}}{\leq}\epsilon_{1}(\alpha)+\hat{\zeta}^{*}(a_{\text{sub}}\mid\mathcal{I}^{S}_{k},\hat{g}^{*})\min_{a^{\prime}\in\mathcal{A}}Q^{*,\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a^{\prime})+\sum_{\hskip 8.53581pta\in\mathcal{A}\setminus\mathcal{A}^{\hat{g}^{*}}_{\text{opt}}(\mathcal{I}^{S}_{k})}\hat{\zeta}^{*}(a\mid\mathcal{I}^{S}_{k},\hat{g}^{*})\hat{Q}^{\hat{g}^{*}}_{k+1}(\mathcal{I}^{S}_{k+1},a
ϵ1(α)+4(fk(ek2)+ck)11+eα|VoIk|ϵ(α).\displaystyle\leq\epsilon_{1}(\alpha)+4(f_{k}(\|e_{k}\|^{2})+c_{k})\frac{1}{1+e^{\alpha|\operatorname{VoI}_{k}|}}\leq\epsilon^{\prime}(\alpha). (54)

We note that the above is valid for VoI0\operatorname{VoI}\neq 0. If it is 0, then we have that both the actions of 0 and 1 produce the same cost, and hence, there is no suboptimal action. Thus from LABEL:b:1, we get that Q^kg^(kS,γ^k)Qk,g^(kS,γ^k)ϵ1(α)<ϵ(α)\hat{Q}^{\hat{g}^{*}}_{k}(\mathcal{I}^{S}_{k},\hat{\gamma}_{k})-{Q}^{*,\hat{g}^{*}}_{k}(\mathcal{I}^{S}_{k},\hat{\gamma}_{k})\leq\epsilon_{1}(\alpha)<\epsilon^{\prime}(\alpha).

The proof thus follows by the induction principle for all k[T]k\in[T], and is complete.

Proposition 5.

It holds that J(π,ζ^,g^)infζdΞdecJ(π,ζd,g^)ϵ(α).J(\pi^{*},\hat{\zeta}^{*},\hat{g}^{*})-\inf_{\zeta^{d}\in\Xi^{dec}}J(\pi^{*},{\zeta^{d}},\hat{g}^{*})\leq\epsilon(\alpha).

Proof.

Let ζ=arginfζdΞdecJ(π,ζd,g^)\zeta^{*}=\text{arginf}_{\zeta^{d}\in\Xi^{dec}}J(\pi^{*},\zeta^{d},\hat{g}^{*}) and γ\gamma^{*} be an action chosen from ζ\zeta^{*}.

J(π,ζ^,g^)J(π,ζ,g^)\displaystyle J(\pi^{*},\hat{\zeta}^{*},\hat{g}^{*})-J(\pi^{*},{\zeta}^{*},\hat{g}^{*})
=𝔼[Q^0(e0,γ^0,g^)]𝔼[Q0(e0,γ0,g^)]\displaystyle=\mathbb{E}[\hat{Q}_{0}(e_{0},\hat{\gamma}_{0},\hat{g}^{*})]-\mathbb{E}[{Q}^{*}_{0}(e_{0},{\gamma}^{*}_{0},\hat{g}^{*})]
=𝔼[Q^0(e0,γ^0,g^)]𝔼[Q0(e0,γ^0,g^)]+𝔼[Q0(e0,γ^0,g^)]𝔼[Q0(e0,γ0,g^)]\displaystyle=\mathbb{E}[\hat{Q}_{0}(e_{0},\hat{\gamma}_{0},\hat{g}^{*})]-\mathbb{E}[{Q}^{*}_{0}(e_{0},\hat{\gamma}_{0},\hat{g}^{*})]+\mathbb{E}[{Q}^{*}_{0}(e_{0},\hat{\gamma}_{0},\hat{g}^{*})]-\mathbb{E}[{Q}^{*}_{0}(e_{0},{\gamma}^{*}_{0},\hat{g}^{*})]
ϵ(α)+𝔼[Q0(e0,γ^0,g^)]𝔼[Q0(e0,γ0,g^)]\displaystyle\leq\epsilon^{\prime}(\alpha)+\mathbb{E}[{Q}^{*}_{0}(e_{0},\hat{\gamma}_{0},\hat{g}^{*})]-\mathbb{E}[{Q}^{*}_{0}(e_{0},{\gamma}^{*}_{0},\hat{g}^{*})]
=ϵ(α)+𝔼[λg^0(γ^0γ0)+e^1Γ1e^1(e1)Γ1e1+Q1(e^1,γ1,g^)Q1(e1,γ1,g^)]\displaystyle=\epsilon^{\prime}(\alpha)+\mathbb{E}[\lambda\hat{g}^{*}_{0}(\hat{\gamma}_{0}-\gamma^{*}_{0})+\hat{e}_{1}^{\top}\Gamma_{1}\hat{e}_{1}-(e^{*}_{1})^{\top}\Gamma_{1}e^{*}_{1}+Q_{1}^{*}(\hat{e}_{1},\gamma_{1}^{*},\hat{g}^{*})-Q_{1}^{*}(e^{*}_{1},\gamma_{1}^{*},\hat{g}^{*})]
=ϵ(α)+𝔼[λg^0(γ^0γ0)]+tr(Γ1𝔼[(γ0γ^0)Ae0e0A])+𝔼[Q1((1γ^0)Ae0+W^0,γ1,g^)\displaystyle=\epsilon^{\prime}(\alpha)+\mathbb{E}[\lambda\hat{g}^{*}_{0}(\hat{\gamma}_{0}-\gamma^{*}_{0})]+tr(\Gamma_{1}\mathbb{E}[(\gamma^{*}_{0}-\hat{\gamma}_{0})Ae_{0}e_{0}^{\top}A^{\top}])+\mathbb{E}[Q_{1}^{*}((1-\hat{\gamma}_{0})Ae_{0}+\hat{W}_{0},\gamma^{*}_{1},\hat{g}^{*})
Q1((1γ0)Ae0+W~0,γ1,g^]\displaystyle\hskip 28.45274pt-Q_{1}^{*}((1-\gamma^{*}_{0})Ae_{0}+\tilde{W}_{0},\gamma^{*}_{1},\hat{g}^{*}]
ϵ(α)+𝔼[λg^0(γ^0γ0)]+tr(Γ1Ae0e0A)𝔼[(γ0γ^0)]+2(f0(e02)+c0)(γ0γ^0)\displaystyle\leq\epsilon^{\prime}(\alpha)+\mathbb{E}[\lambda\hat{g}^{*}_{0}(\hat{\gamma}_{0}-\gamma^{*}_{0})]+tr(\Gamma_{1}Ae_{0}e_{0}^{\top}A^{\top})\mathbb{E}[(\gamma^{*}_{0}-\hat{\gamma}_{0})]+2(f_{0}(\|e_{0}\|^{2})+c_{0})\mathbb{P}(\gamma^{*}_{0}\neq\hat{\gamma}_{0})
+(γ0=γ^0)(𝔼[Q1((1γ^0)Ae0+W^0,γ1,g^)Q1((1γ^0)Ae0+Wˇ0,γ1,g^)])\displaystyle\hskip 28.45274pt+\mathbb{P}(\gamma^{*}_{0}=\hat{\gamma}_{0})(\mathbb{E}[Q_{1}^{*}((1-\hat{\gamma}_{0})Ae_{0}+\hat{W}_{0},\gamma^{*}_{1},\hat{g}^{*})-Q_{1}^{*}((1-\hat{\gamma}_{0})Ae_{0}+\check{W}_{0},\gamma^{*}_{1},\hat{g}^{*})])
ϵ(α)+(λ+tr(Γ1Ae0e0A))𝔼[(γ0γ^0)]+2(f0(e02)+c0)(γ0γ^0)\displaystyle\leq\epsilon^{\prime}(\alpha)+(\lambda+tr(\Gamma_{1}Ae_{0}e_{0}^{\top}A^{\top}))\mathbb{E}[(\gamma^{*}_{0}-\hat{\gamma}_{0})]+2(f_{0}(\|e_{0}\|^{2})+c_{0})\mathbb{P}(\gamma^{*}_{0}\neq\hat{\gamma}_{0})
ϵ(α)+(λ+tr(Γ1Ae0e0A)+2f0(e02)+2c0)×(11+eα|VoI0,1|+11+eα|VoI0,2|)\displaystyle\leq\epsilon^{\prime}(\alpha)+(\lambda+tr(\Gamma_{1}Ae_{0}e_{0}^{\top}A^{\top})+2f_{0}(\|e_{0}\|^{2})+2c_{0})\times\Big{(}\frac{1}{1+e^{\alpha|\operatorname{VoI}_{0,1}|}}+\frac{1}{1+e^{\alpha|\operatorname{VoI}_{0,2}|}}\Big{)}
=:ϵ(α)+ϵ′′(α)=:ϵ(α)=𝒪(11+mink[T]{eα|VoIk|}).\displaystyle=:\epsilon^{\prime}(\alpha)+\epsilon^{\prime\prime}(\alpha)=:\epsilon(\alpha)=\mathcal{O}\Big{(}\frac{1}{1+\min_{k\in[T]}\{e^{\alpha|\operatorname{VoI}_{k}|}\}}\Big{)}. (55)

The proof is thus complete. ∎