This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Goal-oriented Tensor: Beyond Age of Information Towards Semantics-Empowered Goal-Oriented Communications

Aimin Li ID , Graduate Student Member, IEEE, Shaohua Wu ID , Member, IEEE,
Sumei Sun ID , Fellow, IEEE, and Jie Cao ID , Member, IEEE
Abstract

Optimizations premised on open-loop metrics such as Age of Information (AoI) indirectly enhance the system’s decision-making utility. We therefore propose a novel closed-loop metric named Goal-oriented Tensor (GoT) to directly quantify the impact of semantic mismatches on goal-oriented decision-making utility. Leveraging the GoT, we consider a sampler & decision-maker pair that works collaboratively and distributively to achieve a shared goal of communications. We formulate a two-agent infinite-horizon Decentralized Partially Observable Markov Decision Process (Dec-POMDP) to conjointly deduce the optimal deterministic sampling policy and decision-making policy. To circumvent the curse of dimensionality in obtaining an optimal deterministic joint policy through Brute-Force-Search, a sub-optimal yet computationally efficient algorithm is developed. This algorithm is predicated on the search for a Nash Equilibrium between the sampler and the decision-maker. Simulation results reveal that the proposed sampler & decision-maker co-design surpasses the current literature on AoI and its variants in terms of both goal achievement utility and sparse sampling rate, signifying progress in the semantics-conscious, goal-driven sparse sampling design.

Index Terms:
Goal-oriented communications, Goal-oriented Tensor, Status updates, Age of Information, Age of Incorrect Information, Value of Information, Semantics-aware sampling.

I Introduction

The recent advancement of the emerging 5G and beyond has spawned the proliferation of both theoretical development and application instances for Internet of Things (IoT) networks. In such networks, timely status updates are becoming increasingly crucial for enabling real-time monitoring and actuation across a plethora of applications. To address the inadequacies of traditional throughput and delay metrics in such contexts, the Age of Information (AoI) has emerged as an innovative metric to capture the data freshness perceived by the receiver [1], defined as AoI(t)=tU(t)\mathrm{AoI}(t)=t-U(t), where U(t)U(t) denotes the generation time of the latest received packet before time tt. Since its inception, AoI has garnered significant research attention and has been extensively analyzed in the queuing systems [2, 3, 4, 5, 6, 7], physical-layer communications [8, 9, 10, 11, 12, 13, 14], MAC-layer communications [15, 16, 17, 18, 19], industrial IoT [20, 21], energy harvesting systems[22, 23, 24, 25], and etc. (see [26] and the references therein for more comprehensive review). These research efforts are driven by the consensus that a freshly received message typically contains more valuable information, thereby enhancing the utility of decision-making processes.

Though AoI has been proven efficient in many freshness-critical applications, it exhibits several critical shortcomings. Specifically, (aa) AoI does not provide a direct measure of information value; (bb) AoI does not consider the content dynamics of source data and ignores the effect of End-to-End (E2E) information mismatch on the decision-making process.

To address shortcoming (a), a typical approach is to impose a non-linear penalty on AoI[27, 28, 29, 30, 31]. In [27], the authors map the AoI to a non-linear and non-decreasing function f(AoI(t))f(\mathrm{AoI}(t)) to evaluate the degree of “discontent” resulting from stale information. Subsequently, the optimal sampling policy is deduced for an arbitrary non-decreasing penalty function. The authors in [28] introduce two penalty functions, namely the exponential penalty function aAoI(t)1a^{\mathrm{AoI}(t)}-1 and the logarithmic penalty function loga(AoI(t+1))\log_{a}{\left(\mathrm{AoI}(t+1)\right)}, for evaluating the Cost of Update Delay (CoUD). In [29] and [30], the binary indicator function 𝟙{AoI(t)>d}\mathbbm{1}_{\left\{\mathrm{AoI(t)>d}\right\}} is applied to evaluate whether the most recently received message is up-to-date. Specifically, the penalty assumes a value of 11 when the instantaneous AoI surpasses a predetermined threshold dd; otherwise, the penalty reverts to 0. The freshness of web crawling is evaluated through this AoI-threshold binary indicator function. In [31], an analogous binary indicator approach is implemented in caching systems to evaluate the freshness of information.

The above works tend to tailor a particular non-linear penalty function to evaluate the information value. However, the selection of penalty functions in the above research relies exclusively on empirical configurations, devoid of rigorous derivations. To this end, several information-theoretic techniques strive to explicitly derive the non-linear penalty function in terms of AoI [32, 33, 34, 35]. One such quintessential work is the auto-correlation function 𝔼[XtXtAoI(t)]\mathbb{E}\left[X_{t}^{*}X_{t-\mathrm{AoI}(t)}\right], which proves to be a non-linear function of AoI when the source is stationary [32]. Another methodology worth noting is the mutual information metric between the present source state XtX_{t} and the vector consisting of an ensemble of successfully received updates 𝐖t\mathbf{W}_{t} [33, 34]. In the context of a Markovian source, this metric can be reduced to I(Xt;XtAoI(t))I(X_{t};X_{t-\mathrm{AoI}(t)}), which demonstrates a non-linear dependency on AoI under both the Gaussian Markov source and the Binary Markov source [33]. In [34], an analogous approach is utilized to characterize the value of information (VoI) for the Ornstein-Uhlenbeck (OU) process, which likewise demonstrates a non-linear dependency on AoI. In [35], the conditional entropy H(Xt|𝐖t)H(X_{t}|\mathbf{W}_{t}) is further investigated to measure the uncertainty of the source for a remote estimator given the history received updates 𝐖t\mathbf{W}_{t}. When applied to a Markov Source, this conditional entropy simplifies to H(Xt|XtAoI(t))H(X_{t}|{X}_{t-\mathrm{AoI}(t)}), thus exemplifying a non-linear penalty associated with AoI.

To address shortcoming (b), substantial research efforts have been invested in the optimization of the Mean Squared Error (MSE), denoted by (XtX^t)2(X_{t}-\hat{X}_{t})^{2}, with an ultimate objective of constructing a real-time reconstruction remote estimation system [36, 37, 38, 39]. In [36], a metric termed effective age is proposed to minimize the MSE for the remote estimation of a Markov source. In [37] and [38], two Markov sources of interest, the Wiener process and the OU process are investigated to deduce the MSE-optimal sampling policy. Intriguingly, these policies were found to be threshold-based, activating sampling only when the instantaneous MSE exceeds a predefined threshold, otherwise maintaining a state of idleness. The authors in [39] explored the trade-off between MSE and quantization over a noisy channel, and derived the MSE-optimal sampling strategy for the OU process.

Complementary to the above MSE-centered research, variants of AoI that address shortcoming (b)(b) have also been conceptualized [40, 41, 42, 43]. In [40], Age of Changed Information (AoCI) is proposed to address the ignorance of content dynamics of AoI. In this regard, unchanged statuses do not necessarily provide new information and thus are not prioritized for transmission. In [41], the authors introduce the context-aware weighting coefficient to propose the Urgency of Information (UoI), a metric capable of measuring the weighted MSE in diverse urgency contexts. In [42], the authors propose a novel age penalty named Age of Synchronization (AoS), a novel metric quantifying the time since the most recent end-to-end synchronization. Moreover, considering that an E2E status mismatch may exert a detrimental effect on the overall system’s performance over time, the authors of [43] propose a novel metric called Age of Incorrect Information (AoII). This metric quantifies the adverse effect stemming from the duration of the E2E mismatch, revealing that both the degree and duration of E2E semantic mismatches lead to a utility reduction for subsequent decision-making.

The above studies focused on the open-loop generation-to-reception process within a transmitter-receiver pair, neglecting the closed-loop perception-actuation timeliness. A notable development addressing this issue is the extension from Up/Down-Link (UL/DL) AoI to a closed-loop AoI metric, referred to as the Age of Loop (AoL) [44]. Unlike the traditional open-loop AoI, which diminishes upon successful reception of a unidirectional update, the AoL decreases solely when both the UL status and the DL command are successfully received. Another advanced metric in [45], called Age of Actuation (AoA), also encapsulates the actuation timeliness, which proves pertinent when the receiver employs the received update to execute timely actuation.

Refer to caption
Figure 1: Interconnections of Age of Information and Beyond in the literature.

Notwithstanding the above advancements, the question on how the E2E mismatch affects the closed-loop utility of decision-making has yet to be addressed. To address this issue, [46, 47, 48, 49] introduce a metric termed Cost of Actuation Error to delve deeper into the cost resulting from the error actuation due to imprecise real-time estimations. Specifically, the Cost of Actuation Error is denoted by an asymmetric zero diagonal matrix 𝐂\mathbf{C}, with each value CXt,X^tC_{X_{t},\hat{X}_{t}} representing the instant cost under the E2E mismatch status (Xt,X^t)XtX^t(X_{t},\hat{X}_{t})_{X_{t}\neq\hat{X}_{t}}. In this regard, the authors reveal that the utility of decision-making bears a close relation to the E2E semantic mismatch category, as opposed to the mismatch duration (AoII) or mismatch degree (MSE). For example, an E2E semantic mismatch category that “Fire” is estimated as “No Fire” will result in high cost; while in the opposite scenario, the cost is low. Nonetheless, we notice that ii) the method to obtain a Cost of Actuation Error remains unclear, which implicitly necessitates a pre-established decision-making policy; iiii) Cost of Actuation Error does not consider the context-varying factors, which may also affect the decision-making utility; iiiiii) the zero diagonal property of the matrix implies the supposition that if Xt=X^tX_{t}=\hat{X}_{t}, then CXt,X^t=0C_{X_{t},\hat{X}_{t}}=0, thereby signifying that error-less actuation necessitates no energy expenditure. Fig. 1 provides an overview of the existing metrics.

Against this background, the present authors have recently proposed a new metric referred to as GoT in [50], which, compared to Cost of Actuation Error, introduces new dimensions of the context Φt\Phi_{t} and the decision-making policy πA\pi_{A} to describe the true utility of decision-making. Remarkably, we find that GoT offers a versatile degeneration to established metrics such as AoI, MSE, UoI, AoII, and Cost of Actuation Error. Additionally, it provides a balanced evaluation of the cost trade-off between the sampling and decision-making. The contributions of this work could be summarized as follows:

\bullet We focus on the decision utility issue directly by employing the GoT. A controlled Markov source is observed, wherein the transition of the source is dependent on both the decision-making at the receiver and the contextual situation it is situated. In this case, the decision-making will lead to three aspects in utility: ii) the future evolution of the source; iiii) the instant cost at the source; iiiiii) the energy and resources consumed by actuation.

\bullet We accomplish the goal-oriented sampler & decision-maker co-design, which, to the best of our knowledge, represents the first work that addresses the trade-off between semantics-aware sampling and goal-oriented decision-making. Specifically, we formulate this problem as a two-agent infinite-horizon Decentralized Partially Observable Markov Decision Process (Dec-POMDP) problem, with one agent embodying the semantics and context-aware sampler, and the other representing the goal-oriented decision-maker. Note that the optimal solution of even a finite-horizon Dec-POMDP is known to be NEXP-hard [51], we develop the RVI-Brute-Force-Search Algorithm. This algorithm seeks to derive optimal deterministic joint policies for both sampling and decision-making. A thorough discussion on the optimality of our algorithm is also presented.

\bullet To further mitigate the “curse of dimensionality” intricately linked with the execution of the optimal RVI-Brute-Force-Search, we introduce a low-complexity yet efficient algorithm to solve the problem. The algorithm is designed by decoupling the problem into two distinct components: a Markov Decision Process (MDP) problem and a Partially Observable Markov Decision Process (POMDP) issue. Following this separation, the algorithm endeavors to search for the joint Nash Equilibrium between the sampler and the decision-maker, providing a sub-optimal solution to this goal-oriented communication-decision-making co-design.

II Goal-oriented Tensor

II-A Specific Examples of Goal-Oriented Communications

Consider a time-slotted communication system. Let Xt𝒮X_{t}\in\mathcal{S} represent the perceived semantic status at time slot tt at the source, and X^t𝒮\hat{X}_{t}\in\mathcal{S} denote the constructed estimated semantic status at time slot tt. Real-time reconstruction-oriented communications is a special type of goal-oriented communications, whose goal is achieving real-time accurate reconstruction:

minlimsupT1Tt=0T1(XtX^t)2.\min\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}\sum_{t=0}^{T-1}(X_{t}-\hat{X}_{t})^{2}. (1)

Although real-time reconstruction is important, it does not represent the ultimate goals of communications, such as implementing accurate actuation (as opposed to merely real-time reconstruction) and minimizing the system’s long-term Cost of Actuation Error, as in [46, 47, 48, 49]:

minlimsupT1Tt=0T1CXt,X^t,\min\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}\sum_{t=0}^{T-1}C_{X_{t},\hat{X}_{t}}, (2)

where CXt,X^tC_{X_{t},\hat{X}_{t}} represents the instantaneous system cost of the system at time slot tt. This cost is derived from the erroneous actuation stemming from the status mismatch between transceivers.

Following the avenue of Cost of Actuation Error, we notice that the matrix-based metric could be further augmented to tensors to realize more flexible goal characterizations. For example, drawing parallels with the concept of Urgency of Information [41], we can introduce a context element Φt\Phi_{t} to incorporate context-aware attributes into this metric.111It is important to note that the GoT could be expanded into higher dimensions by integrating additional components, including actuation policies, task-specific attributes, and other factors. Accordingly, we can define a three-dimensional GoT through a specified mapping: (Xt,Φt,X^t)𝒮×𝒱×𝒮GoT(t)(X_{t},\Phi_{t},\hat{X}_{t})\in\mathcal{S}\times\mathcal{V}\times\mathcal{S}\overset{\mathcal{L}}{\rightarrow}\mathrm{GoT}(t)\in\mathbb{R}, which could be visualized by Fig. 2. In this regard, the GoT, denoted by (Xt,Φt,X^t)\mathcal{L}(X_{t},\Phi_{t},\hat{X}_{t}) or GoT(t)\mathrm{GoT}(t), indicates the instantaneous cost of the system at time slot tt, with the knowledge of (Xt,Φt,X^t)(X_{t},\Phi_{t},\hat{X}_{t}). Consequently, the overarching goal of this system could be succinctly expressed as:

minlimsupT1Tt=0T1GoT(t).\min\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}\sum_{t=0}^{T-1}\mathrm{GoT}(t). (3)

II-B Degeneration to Existing Metrics

In this subsection, we demonstrate, through visualized examples and mathematical formulations, that a three-dimension GoT can degenerate to existing metrics. Fig. 2 showcases a variety of instances of the GoT metric.

\bullet Degeneration to AoI: AoI is generally defined as AoI(t)tmax{Gi:Di<t}\mathrm{AoI}(t)\triangleq t-\max\left\{G_{i}:D_{i}<t\right\}, where GiG_{i} is the generated time stamp of the ii-th status update, DiD_{i} represents the corresponding deliver time slot. Since AoI is known to be semantics-agnostic [52, 46], the values in the tensor only depend on the freshness context ΦtAoI(t)\Phi_{t}\triangleq\mathrm{AoI}(t). In this case, each tensor value given a determined context status Φ(t)\Phi(t) is a constant, and the GoT is reduced to

GoT(t)=(Xt,Φt,X^t)=(a)Φ(t)=AoI(t).\mathrm{GoT}(t)=\mathcal{L}(X_{t},\Phi_{t},\hat{X}_{t})\overset{(a)}{=}\Phi(t)=\mathrm{AoI}(t). (4)

where (a) indicates that AoI is semantics-agnostic. In this case, AoI refers to the context exactly. The process of reducing GoT to AoI is visually depicted in Fig. 2(a).

Refer to caption
Figure 2: Visualize the GoT.

\bullet Degeneration to AoII: AoII is defined as AoII(t)f(AoS(t))g(Xt,X^t)\mathrm{AoII}(t)\triangleq f(\mathrm{AoS}(t))\cdot g(X_{t},\hat{X}_{t}), where AoS(t)tmax{τ:tτ,Xt=X^t}\textrm{AoS}(t)\triangleq t-\max\left\{\tau:t\leq\tau,X_{t}=\hat{X}_{t}\right\}. AoII embraces the semantics-aware characteristics and is hence regarded as an enabler of semantic communications [53]. The inherent multiplicative characteristic of AoII guarantees the existence of a base layer of the tensor representation, from which other layers are derived by multiplying this base layer, as depicted in Fig. 2(b). Let Φtf(AoS(t))\Phi_{t}\triangleq f(\mathrm{AoS}(t)), the GoT is reduced to

GoT(t)=(Xt,Φt,X^t)=(a)Φtg(Xt,X^t)=f(AoS(t))g(Xt,X^t)=AoII(t),\mathrm{GoT}(t)=\mathcal{L}\left(X_{t},\Phi_{t},\hat{X}_{t}\right)\overset{(a)}{=}\Phi_{t}\cdot g\left(X_{t},\hat{X}_{t}\right)=f(\mathrm{AoS}(t))\cdot g(X_{t},\hat{X}_{t})=\mathrm{AoII}(t), (5)

where g(Xt,X^t)g(X_{t},\hat{X}_{t}), typically characterized by 𝟙{XtX^t}\mathbbm{1}_{\left\{X_{t}\neq\hat{X}_{t}\right\}}, represents the base layer in the tensor, and (a)(a) indicates the inherent multiplicative characteristic of AoII. The visual representation of the GoT reduction to AoII is depicted in Fig. 2(b).

\bullet Degeneration to MSE: MSE is defined as MSE(t)(XtX^t)2\mathrm{MSE}(t)\triangleq\left(X_{t}-\hat{X}_{t}\right)^{2}. MSE is intuitively context-agnostic. In the scenario where g(Xt,X^t)=(XtX^t)2g(X_{t},\hat{X}_{t})=(X_{t}-\hat{X}_{t})^{2}, the GoT collapses to the MSE metric:

GoT(t)=(Xt,Φt,X^t)=(a)g(Xt,X^t)=(XtX^t)2=MSE(t),\mathrm{GoT}(t)=\mathcal{L}\left(X_{t},\Phi_{t},\hat{X}_{t}\right)\overset{(a)}{=}g\left(X_{t},\hat{X}_{t}\right)=\left(X_{t}-\hat{X}_{t}\right)^{2}=\mathrm{MSE}(t), (6)

where (a)(a) establishes due to the context-agnostic nature of MSE. The visualization of the GoT reduction to MSE is shown in Fig. 2(c).

\bullet Degeneration to UoI: UoI is defined by UoI(t)Φt(XtX^t)2\mathrm{UoI}(t)\triangleq\Phi_{t}\cdot\left(X_{t}-\hat{X}_{t}\right)^{2}, where the context-aware weighting coefficient Φt\Phi_{t} is further introduced [41]. When g(Xt,X^t)=(XtX^t)2g(X_{t},\hat{X}_{t})=(X_{t}-\hat{X}_{t})^{2}, the GoT could be transformed into the UoI by

GoT(t)=(Xt,Φt,X^t)=(a)Φtg(Xt,X^t)=Φt(XtX^t)2=UoI(t),\mathrm{GoT}(t)=\mathcal{L}\left(X_{t},\Phi_{t},\hat{X}_{t}\right)\overset{(a)}{=}\Phi_{t}\cdot g\left(X_{t},\hat{X}_{t}\right)=\Phi_{t}\cdot\left(X_{t}-\hat{X}_{t}\right)^{2}=\mathrm{UoI}(t), (7)

where (a)(a) indicates the inherent multiplicative characteristic of UoI. The visualization of the GoT reduction to UoI is shown in Fig. 2(d).

\bullet Degeneration to Cost of Actuation Error: Cost of Actuation Error is defined by CXt,X^tC_{X_{t},\hat{X}_{t}}, which indicates the instantaneous system cost if the source status is XtX_{t} and the estimated one X^t\hat{X}_{t} mismatch [48]. Let g(Xt,X^t)=CXt,X^tg(X_{t},\hat{X}_{t})=C_{X_{t},\hat{X}_{t}}, the GoT collapses to Cost of Actuation Error:

GoT(t)=(Xt,Φt,X^t)=(a)g(Xt,X^t)=CXt,X^t,\mathrm{GoT}(t)=\mathcal{L}\left(X_{t},\Phi_{t},\hat{X}_{t}\right)\overset{(a)}{=}g\left(X_{t},\hat{X}_{t}\right)=C_{X_{t},\hat{X}_{t}}, (8)

where (a)(a) establishes due to the context-agnostic nature of Cost of Actuation Error. The visualization of the GoT reduction to Cost of Actuation Error is shown in Fig. 2(e).

II-C Goal Characterization Through GoT

As shown in Fig. 2(f), a more general GoT exhibits neither symmetry, zero diagonal property, nor a base layer, offering considerable versatility contingent upon the specific goal. Here we propose a method that constructs a specific GoT, there are four steps:

\bullet Step 1: Clarify the scenario and the communication goal. For instance, in the wireless accident monitoring and rescue systems, the goal is to minimize the cumulative average cost associated with accidents and their corresponding rescue interventions over the long term.

\bullet Step 2: Define the sets of semantic status, 𝒮\mathcal{S}, and contextual status, 𝒱\mathcal{V}. These sets can be modeled as collections of discrete components. For instance, the set 𝒮\mathcal{S} might encompass the gravity of industrial mishaps in intelligent factories, whereas the set 𝒱\mathcal{V} could encompass the meteorological circumstances, which potentially influence the source dynamics and the costs.

\bullet Step 3: The GoT could be decoupled by three factors: 222The definitions of costs are diverse, covering aspects such as financial loss, resource usage, or a normalized metric derived through scaling, depending on the specific objective they are designed to address.

i)i) The status inherent cost C1(Xt,Φt)C_{1}(X_{t},\Phi_{t}). It quantifies the cost associated with different status pairs (Xt,Φt)(X_{t},\Phi_{t}) in the absence of external influences;

ii)ii) The actuation gain C2(πA(X^t))C_{2}(\pi_{A}(\hat{X}_{t})), where πA\pi_{A} is a deterministic decision policy contingent upon X^t\hat{X}_{t}. This cost quantifies the positive utility resulting from the actuation πA(X^(t))\pi_{A}(\hat{X}(t));

iii)iii) The actuation resource expenditure C3(πA(X^t))C_{3}(\pi_{A}(\hat{X}_{t})), which reflects the resources consumed by a particular actuation πA(X^(t))\pi_{A}(\hat{X}(t)).

\bullet Step 4: Constructing the GoT. The GoT, given a specific triple-tuple (Xt,X^t,Φt)(X_{t},\hat{X}_{t},\Phi_{t}) and a certain decision strategy πA\pi_{A}, is calculated by

GoTπA(t)=(Xt,Φt,X^t,πA)=[C1(Xt,Φt)aC2(πA(X^t))]++bC3(πA(X^t)).\mathrm{GoT}^{\pi_{A}}(t)=\mathcal{L}(X_{t},\Phi_{t},\hat{X}_{t},\pi_{A})=\left[C_{1}(X_{t},\Phi_{t})-aC_{2}(\pi_{A}(\hat{X}_{t}))\right]^{+}+bC_{3}(\pi_{A}(\hat{X}_{t})). (9)

The ramp function []+\left[\cdot\right]^{+} ensures that any actuation πA(X^t)\pi_{A}(\hat{X}_{t}) reduces the cost to a maximum of 0. A visualization of a specific GoT construction is shown in Fig. 4. The GoT in Fig. 4 is obtained through the following definition:

C1(Xt,Φt)=(01200131025),πA(X^t)=(012a0a1a2),C2(aA(t))=(a0a1a2024),C3(aA(t))=(a0a1a2012).\begin{array}[]{c}C_{1}(X_{t},\Phi_{t})=\left(\begin{array}[]{c|ccc}{}\hfil&0&1&2\\ \hline\cr 0&0&1&3\\ 1&0&2&5\end{array}\right),\pi_{A}(\hat{X}_{t})=\left(\begin{array}[]{ccc}0&1&2\\ \hline\cr a_{0}&a_{1}&a_{2}\\ \end{array}\right),\\ C_{2}(a_{A}(t))=\left(\begin{array}[]{ccc}a_{0}&a_{1}&a_{2}\\ \hline\cr 0&2&4\\ \end{array}\right),C_{3}(a_{A}(t))=\left(\begin{array}[]{ccc}a_{0}&a_{1}&a_{2}\\ \hline\cr 0&1&2\\ \end{array}\right).\end{array} (10)

III System Model

Refer to caption
Figure 3: A visualized example for characterizing the GoT through (9), where []+[\cdot]^{+} represents the ramp function,
Refer to caption
Figure 4: Illustration of our considered system where transmitted semantic status arrives at a receiver for decision-making to achieve a certain goal.

We consider a time-slotted perception-actuation loop where both the perceived semantics Xt𝒮={s1,,s|𝒮|}X_{t}\in\mathcal{S}=\left\{s_{1},\cdots,s_{|\mathcal{S}|}\right\} and context Φt𝒱={v1,,v|𝒱|}\Phi_{t}\in\mathcal{V}=\left\{v_{1},\cdots,v_{|\mathcal{V}|}\right\} are input into a semantic sampler, tasked with determining the significance of the present status XtX_{t} and subsequently deciding if it warrants transmission via an unreliable channel. The semantics and context are extracted and assumed to perfectly describe the status of the observed process. The binary indicator, aS(t)=πS(Xt,Φt,X^t){0,1}a_{S}(t)=\pi_{S}(X_{t},\Phi_{t},\hat{X}_{t})\in\left\{0,1\right\}, signifies the sampling and transmission action at time slot tt, with the value 11 representing the execution of sampling and transmission, and the value 0 indicating the idleness of the sampler. πS\pi_{S} here represents the sampling policy. We consider a perfect and delay-free feedback channel [46, 47, 48, 49], with ACK representing a successful transmission and NACK representing the otherwise. The decision-maker at the receiver will make decisions aA(t)𝒜A={a1,,a|𝒜A|}a_{A}(t)\in\mathcal{A}_{A}=\left\{a_{1},\cdots,a_{|\mathcal{A}_{A}|}\right\} based on the estimate X^t\hat{X}_{t}333We consider a general abstract decision-making set 𝒜A\mathcal{A}_{A} that exhibits adaptability to diverse applications. Notably, this decision-making set can be customized and tailored to suit specific requirements., which will ultimately affect the utility of the system. An illustration of our considered model is shown in Fig. 4.

III-A Semantics and Context Dynamics

Thus far, a plethora of studies have delved into the analysis of various discrete Markov sources, encompassing Birth-Death Markov Processes elucidated in [49], binary Markov sources elucidated in [54], and etc. In real-world situations, the context and the actuation also affect the source’s evolution. Consequently, we consider a context-dependent controlled Discrete Markov source:

Pr(Xt+1=su|Xt=si,aA(t)=am,Φt=vk)=pi,u(k,m),\Pr\left({X_{t+1}=s_{u}\left|{X_{t}=s_{i},a_{A}(t)=a_{m},\Phi_{t}=v_{k}}\right.}\right)=p_{i,u}^{(k,m)}, (11)

where the dynamics of the source is dependent on both the decision-making aA(t)a_{A}(t) and context Φt\Phi_{t}. Furthermore, we take into account the variations in context Φt\Phi_{t}, characterized by:

Pr(Φt+1=vr|Φt=vk)=pk,r.\Pr\left({\Phi_{t+1}=v_{r}\left|{\Phi_{t}=v_{k}}\right.}\right)=p_{k,r}. (12)

Note that the semantic status XtX_{t} and context status Φt\Phi_{t} could be tailored according to the specific application scenario. In general, these two processes are independent with each other.

III-B Unreliable Channel and Estimate Transition

Refer to caption
Figure 5: An illustration of AoI, AoCI, AoII, and GoT in a time-slotted status update system. Here, the value of GoT is obtained from the tensor obtained on the right-hand side of Fig. 4.

We assume that the channel realizations exhibit independence and identical distribution (i.i.d.) across time slots, following a Bernoulli distribution. Particularly, the channel realization hth_{t} assumes a value of 11 in the event of successful transmission, and 0 otherwise. Accordingly, we define the probability of successful transmission as Pr(ht=1)=pS\Pr\left(h_{t}=1\right)=p_{S} and the failure probability as Pr(ht=0)=1pS\Pr\left(h_{t}=0\right)=1-p_{S}. To characterize the dynamic process of X^t\hat{X}_{t}, we consider two cases as described below:

\bullet aS(t)=0a_{S}(t)=0. In this case, the sampler and transmitter remain idle, manifesting that there is no new knowledge given to the receiver, i.e., X^t+1=X^t\hat{X}_{t+1}=\hat{X}_{t}. As such, we have:

Pr(X^t+1=x|X^t=sj,aS(t)=0)=𝟙{x=sj}.\Pr\left({\hat{X}_{t+1}=x\left|\hat{X}_{t}=s_{j},a_{S}(t)=0\right.}\right)=\mathbbm{1}_{\left\{x=s_{j}\right\}}. (13)

\bullet aS(t)=1a_{S}(t)=1. In this case, the sampler and transmitter transmit the current semantic status XtX_{t} through an unreliable channel. As the channel is unreliable, we differentiate between two distinct situations: ht=1h_{t}=1 and ht=0h_{t}=0:

(a) ht=1h_{t}=1. In this case, the transmission is successful. As such, the estimate at the receiver X^t+1\hat{X}_{t+1} is nothing but X(t)X(t), and the transition probability is

Pr(X^t+1=x|X^t=sj,Xt=si,aS(t)=1,ht=1)=𝟙{x=si}.\Pr\left({\hat{X}_{t+1}=x\left|\hat{X}_{t}=s_{j},X_{t}=s_{i},a_{S}(t)=1,h_{t}=1\right.}\right)=\mathbbm{1}_{\left\{x=s_{i}\right\}}. (14)

(b) ht=0h_{t}=0. In this case, the transmission is not successfully decoded by the receiver. As such, the estimate at the receiver X^t+1\hat{X}_{t+1} remains X^(t)\hat{X}(t). In this way, the transition probability is

Pr(X^t+1=x|X^t=sj,Xt=si,aS(t)=1,ht=0)=𝟙{x=sj}.\Pr\left({\hat{X}_{t+1}=x\left|\hat{X}_{t}=s_{j},X_{t}=s_{i},a_{S}(t)=1,h_{t}=0\right.}\right)=\mathbbm{1}_{\left\{x=s_{j}\right\}}. (15)

As the channel realization hth_{t} is independent with the process of XtX_{t}, X^t\hat{X}_{t}, and aS(t)a_{S}(t), we have that

Pr(X^t+1=x|X^t=sj,Xt=si,aS(t)=1)\displaystyle\Pr\left({\hat{X}_{t+1}=x\left|\hat{X}_{t}=s_{j},X_{t}=s_{i},a_{S}(t)=1\right.}\right) (16)
=htp(ht)Pr(X^t+1=x|X^t=sj,Xt=si,aS(t)=1,ht)\displaystyle=\sum_{h_{t}}p\left(h_{t}\right)\Pr\left({\hat{X}_{t+1}=x\left|\hat{X}_{t}=s_{j},X_{t}=s_{i},a_{S}(t)=1,h_{t}\right.}\right)
=pS𝟙{x=si}+(1pS)𝟙{x=sj}.\displaystyle=p_{S}\cdot\mathbbm{1}_{\left\{x=s_{i}\right\}}+(1-p_{S})\cdot\mathbbm{1}_{\left\{x=s_{j}\right\}}.

Combing (13) with (16) yields the dynamics of the estimate.

III-C Goal-oriented decision-making and Actuating

We note that the previous works primarily focused on minimizing the open-loop freshness-related or error-related penalty for a transmitter-receiver system. Nevertheless, irrespective of the fresh delivery or accurate end-to-end timely reconstruction, the ultimate goal of such optimization efforts is to ensure precise and effective decision-making. To this end, we broaden the open-loop transmitter-receiver information flow to include a perception-actuation closed-loop utility flow by incorporating the decision-making and actuation processes. As a result, decision-making and actuation enable the conversion of status updates into ultimate effectiveness. Here the decision-making at time slot tt follows that aA(t)=πA(X^t)a_{A}(t)=\pi_{A}(\hat{X}_{t}), with πA\pi_{A} representing the deterministic decision-making policy.

IV Problem Formulation and Solution

Traditionally, the development of sampling strategies has been designed separately from the decision-making process. An archetypal illustration of this two-stage methodology involves first determining the optimal sampling policy based on AoI, MSE, and their derivatives, such as AoII, and subsequently accomplishing goal-oriented decision-making. This two-stage separate design arises from the inherent limitation of existing metrics that they fail to capture the closed-loop decision utility. Nevertheless, the metric GoT empowers us to undertake a co-design of sampling and decision-making.

We adopt the team decision theory, wherein two agents, one embodying the sampler and the other the decision-maker, collaborate to achieve a shared goal. We aim at determining a joint deterministic policy 𝝅C=(πS,πA)\boldsymbol{\pi}_{C}=(\pi_{S},\pi_{A}) that minimizes the long-term average cost of the system. It is considered that the sampling and transmission of an update also consumes energy, incurring a CSC_{S} cost. In this case, the instant cost of the system could be clarified by GoTπA(t)+CSaS(t)\mathrm{GoT}^{\pi_{A}}(t)+C_{S}\cdot a_{S}(t), and the problem is characterized as:

𝒫1:min𝝅CΥlimsupT1T𝔼𝝅C(t=0T1GoTπA(t)+CSaS(t)),\begin{array}[]{*{20}{c}}{{{\cal P}}1:}&{\mathop{\min}\limits_{{\boldsymbol{\pi}_{C}}\in\Upsilon}\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}{\mathbb{E}^{{\boldsymbol{\pi}}_{C}}}\left({\sum\limits_{t=0}^{T-1}{\mathrm{GoT}^{\pi_{A}}(t)+C_{S}\cdot a_{S}(t)}}\right)}\\ \end{array}, (17)

where 𝝅C=(πS,πA)\boldsymbol{\pi}_{C}=(\pi_{S},\pi_{A}) denotes the joint sampling and decision-making policy, comprising πS=(aS(0),aS(1),)\pi_{S}=(a_{S}(0),a_{S}(1),\cdots) and πA=(aA(0),aA(1),)\pi_{A}=(a_{A}(0),a_{A}(1),\cdots), which correspond to the sampling action sequence and actuation sequence, respectively. Note that GoTπA(t)\mathrm{GoT}^{\pi_{A}}(t) is characterized by (9).

IV-A Dec-POMDP Formulation

The problem in (17) aims to find the optimal decentralized policy 𝝅C\boldsymbol{\pi}_{C} so that the long-term average cost of the system is minimized. To solve the problem 𝒫1\mathcal{P}1, we ought to formulate a DEC-POMDP problem, which is initially introduced in [51] to solve the cooperative sequential decision issues for distributed multi-agents. Within a Dec-POMDP framework, a team of agents cooperates to achieve a shared goal, relying solely on their localized knowledge. A typical Dec-POMDP is denoted by a tuple DECPOMDPn,,𝒜,𝒯,Ω,𝒪,\mathscr{M}_{DEC-POMDP}\triangleq\left\langle n,\mathcal{I},\mathcal{A},\mathcal{T},\Omega,\mathcal{O},\mathcal{R}\right\rangle:

\bullet nn denotes the number of agents. In this instance, we have n=2n=2, signifying the presence of two agents within this context: one agent 𝒜gentS\mathcal{A}gent_{S} embodies the semantics-context-aware sampler and transmitter, while the other represents the X^t\hat{X}_{t}-dependent decision-maker, denoted by 𝒜gentA\mathcal{A}gent_{A}.

\bullet \mathcal{I} is the finite set of the global system status, characterized by (Xt,X^t,Φt)𝒮×𝒮×𝒱(X_{t},\hat{X}_{t},\Phi_{t})\in\mathcal{S}\times\mathcal{S}\times\mathcal{V}. For the sake of brevity, we henceforth denote 𝐖t=(Xt,X^t,Φt)\mathbf{W}_{t}=(X_{t},\hat{X}_{t},\Phi_{t}) in the squeal.

\bullet 𝒯\mathcal{T} is the transition function defined by

𝒯(𝐰,𝐚,𝐰)Pr(𝐖t+1=𝐰|𝐖t=𝐰,𝐚t=𝐚),\mathcal{T}\left(\mathbf{w},\mathbf{a},\mathbf{w}^{\prime}\right)\triangleq\Pr(\mathbf{W}_{t+1}=\mathbf{w}^{\prime}|\mathbf{W}_{t}=\mathbf{w},\mathbf{a}_{t}=\mathbf{a}), (18)

which is defined by the transition probability from global status 𝐖t=𝐰\mathbf{W}_{t}=\mathbf{w} to status 𝐖t+1=𝐰\mathbf{W}_{t+1}=\mathbf{w}^{\prime}, after the agents in the system taking a joint action 𝐚t=𝐚=(aS(t),aA(t))\mathbf{a}_{t}=\mathbf{a}=(a_{S}(t),a_{A}(t)). For the sake of concise notation, we let p(𝐰|𝐰,𝐚)p(\mathbf{w}^{\prime}|\mathbf{w},\mathbf{a}) symbolize 𝒯(𝐰,𝐚,𝐰)\mathcal{T}\left(\mathbf{w},\mathbf{a},\mathbf{w}^{\prime}\right) in the subsequent discourse. Then, by taking into account the conditional independence among Xt+1X_{t+1}, Φt+1\Phi_{t+1}, and X^t+1\hat{X}_{t+1}, given (Xt,Φt,X^t)(X_{t},\Phi_{t},\hat{X}_{t}) and 𝐚(t)\mathbf{a}(t), the transition functions can be calculated in lemma 1.

Lemma 1.

The transition functions of the Dec-POMDP:

p((su,x,vr)|(si,sj,vk),(1,am))=pi,u(k,m)pk,r(pS𝟙{x=si}+(1pS)𝟙{x=sj}),p\left((s_{u},x,v_{r})\left|(s_{i},s_{j},v_{k}),(1,a_{m})\right.\right)=p_{i,u}^{(k,m)}\cdot p_{k,r}\cdot\left(p_{S}\cdot\mathbbm{1}_{\left\{x=s_{i}\right\}}+(1-p_{S})\cdot\mathbbm{1}_{\left\{x=s_{j}\right\}}\right), (19)
p((su,x,vr)|(si,sj,vk),(0,am))=pi,u(k,m)pk,r𝟙{x=sj},p\left((s_{u},x,v_{r})\left|(s_{i},s_{j},v_{k}),(0,a_{m})\right.\right)=p_{i,u}^{(k,m)}\cdot p_{k,r}\cdot\mathbbm{1}_{\left\{x=s_{j}\right\}}, (20)

for any x𝒮x\in\mathcal{S} and indexes ii, jj, u{1,2,,|𝒮|}u\in\left\{1,2,\cdots,|\mathcal{S}|\right\}, kk, r{1,2,,|𝒱|}r\in\left\{1,2,\cdots,|\mathcal{V}|\right\}, and m{1,2,,|𝒜A|}m\in\left\{1,2,\cdots,|\mathcal{A}_{A}|\right\}.

Proof.

The transition function can be derived by incorporating the dynamics in equations (11), (12), (13), and (16). A more comprehensive proof is presented in Appendix A. ∎

\bullet 𝒜=𝒜S×𝒜A\mathcal{A}=\mathcal{A}_{S}\times\mathcal{A}_{A}, with 𝒜S{0,1}\mathcal{A}_{S}\triangleq\left\{0,1\right\} representing the set of binary sampling actions executed by the sampler, and 𝒜A{a0,,aM1}\mathcal{A}_{A}\triangleq\left\{a_{0},\cdots,a_{M-1}\right\} representing the set of decision actions undertaken by the actuator.

\bullet Ω=ΩS×ΩA\Omega=\Omega_{S}\times\Omega_{A} constitutes a finite set of joint observations. Generally, the observation made by a single agent regarding the system status is partially observable. ΩS\Omega_{S} signifies the sampler’s observation domain. In this instance, the sampler 𝒜gentS\mathcal{A}gent_{S} is entirely observable, with ΩS\Omega_{S} encompassing the comprehensive system state oS(t)=𝐖t{o}_{S}^{(t)}=\mathbf{W}_{t}. ΩA\Omega_{A} signifies the actuator’s observation domain. In this case, the actuator (or decision-maker) 𝒜gentA\mathcal{A}gent_{A} is partially observable, with ΩA\Omega_{A} comprising oA(t)=X^(t)o_{A}^{(t)}=\hat{X}(t). The joint observation at time instant tt is denoted by 𝐨t=(oS(t),oA(t))\mathbf{o}_{t}=(o_{S}^{(t)},o_{A}^{(t)}).

\bullet 𝒪=𝒪S×𝒪A\mathcal{O}=\mathcal{O}_{S}\times\mathcal{O}_{A} represents the observation function, where 𝒪S\mathcal{O}_{S} and 𝒪A\mathcal{O}_{A} denotes the observation function of the sampler 𝒜gentS\mathcal{A}gent_{S} and the actuator 𝒜gentA\mathcal{A}gent_{A}, respectively, defined as:

𝒪(𝐰,𝐨)Pr(𝐨t=𝐨|𝐖t=𝐰),\displaystyle\mathcal{O}(\mathbf{w},\mathbf{o})\triangleq\Pr(\mathbf{o}_{t}=\mathbf{o}|\mathbf{W}_{t}=\mathbf{w}), (21)
𝒪S(𝐰,oS)Pr(oS(t)=oS|𝐖t=𝐰),\displaystyle\mathcal{O}_{S}(\mathbf{w},o_{S})\triangleq\Pr(o_{S}^{(t)}=o_{S}|\mathbf{W}_{t}=\mathbf{w}),
𝒪A(𝐰,oA)Pr(oA(t)=oA|𝐖t=𝐰).\displaystyle\mathcal{O}_{A}(\mathbf{w},o_{A})\triangleq\Pr(o_{A}^{(t)}=o_{A}|\mathbf{W}_{t}=\mathbf{w}).

The observation function of an agent 𝒜genti\mathcal{A}gent_{i} signifies the conditional probability of agent 𝒜genti\mathcal{A}gent_{i} perceiving oio_{i}, contingent upon the prevailing global system state as 𝐖t=𝐰\mathbf{W}_{t}=\mathbf{w}. For the sake of brevity, we henceforth let pA(oA|𝐰)p_{A}(o_{A}|\mathbf{w}) represent 𝒪A(𝐰,oA)\mathcal{O}_{A}(\mathbf{w},o_{A}) and pS(oS|𝐰)p_{S}(o_{S}|\mathbf{w}) represent 𝒪S(𝐰,oA)\mathcal{O}_{S}(\mathbf{w},o_{A}) in the subsequent discourse. In our considered model, the observation functions are deterministic, characterized by lemma 2.

Lemma 2.

The observation functions of the Dec-POMDP:

pS((su,sr,vq)|(si,sj,vk))\displaystyle p_{S}\left((s_{u},s_{r},v_{q})|(s_{i},s_{j},v_{k})\right) =𝟙{(su,sr,vq)=(si,sj,vk)}\displaystyle=\mathbbm{1}_{\left\{(s_{u},s_{r},v_{q})=(s_{i},s_{j},v_{k})\right\}} (22)
pA(sz|(si,sj,vk))\displaystyle p_{A}\left(s_{z}|(s_{i},s_{j},v_{k})\right) =𝟙{sz=sj}\displaystyle=\mathbbm{1}_{\left\{s_{z}=s_{j}\right\}}\cdot

for indexes zz, ii, jj, uu, r{1,2,|𝒮|}r\in\left\{1,2,\cdots\,|\mathcal{S}|\right\}, and kk, q{1,2,|𝒱|}q\in\left\{1,2,\cdots\,|\mathcal{V}|\right\}.

\bullet \mathcal{R} is the reward function, characterized as a mapping ×𝒜\mathcal{I}\times\mathcal{A}\rightarrow\mathbb{R}. In the long-term average reward maximizing setup, resolving a Dec-POMDP is equivalent to addressing the following problem min𝝅CΥlimsupT1T𝔼𝝅C(t=0T1r(t)){\mathop{\min}\limits_{{\boldsymbol{\pi}_{C}}\in\Upsilon}\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}{\mathbb{E}^{{\boldsymbol{\pi}}_{C}}}\left(-{\sum\limits_{t=0}^{T-1}r(t)}\right)}. Subsequently, to establish congruence with the problem in (17), the reward function is defined as:

r(t)=πA(𝐰,aS)=GoTπA(t)CSaS(t).r(t)=\mathcal{R}^{\pi_{A}}(\mathbf{w},a_{S})=-\mathrm{GoT}^{\pi_{A}}(t)-C_{S}\cdot a_{S}(t). (23)

IV-B Solutions to the Infinite-Horizon Dec-POMDP

In general, solving a Dec-POMDP is known to be NEXP-complete for the finite-horizon setup [51], signifying that it necessitates formulating a conjecture about the solution non-deterministically, while each validation of a conjecture demands exponential time. For an infinite-horizon Dec-POMDP problem, finding an optimal policy for a Dec-POMDP problem is known to be undecidable. Nevertheless, within our considered model, both the sampling and decision-making processes are deterministic, given as aS(t)=πS(𝐰)a_{S}(t)=\pi_{S}(\mathbf{w}) and aA(t)=πA(oA)a_{A}(t)=\pi_{A}(o_{A}). In such a circumstance, it is feasible to determine a joint optimal deterministic policy via Brute-Search-across the decision-making policy space.

IV-B1 Optimal Solution

The idea is based on the finding that, given a deterministic decision-making policy πA\pi_{A}, the sampling problem can be formulated as a standard fully observed MDP problem denoted by MDPπA,𝒯πA,𝒜S,\mathscr{M}^{\pi_{A}}_{\mathrm{MDP}}\triangleq\langle\mathcal{I},\mathcal{T}^{\pi_{A}},\mathcal{A}_{S},\mathcal{R}\rangle.

Definition 1.

Given a deterministic decision-making policy πA\pi_{A}, the optimal sampling problem could be formulated by a typical fully observed MDP problem MDPπA,𝒜S,𝒯MDPπA,\mathscr{M}^{\pi_{A}}_{\mathrm{MDP}}\triangleq\langle\mathcal{I},\mathcal{A}_{S},\mathcal{T}_{\mathrm{MDP}}^{\pi_{A}},\mathcal{R}\rangle, where the elements are given as follows:

  • \mathcal{I}: the same as the pre-defined Dec-POMDP tuple.

  • 𝒜S={0,1}\mathcal{A}_{S}=\left\{0,1\right\}: the sampling and transmission action set.

  • 𝒯πA\mathcal{T}^{\pi_{A}}: the transition function given a deterministic decision-making policy πA\pi_{A}, which is

    𝒯πA(𝐰,aS,𝐰)=pπA(𝐰|𝐰,aS)=oA𝒪Ap(𝐰|𝐰,(aS,πA(oA)))pA(oA|𝐰),\begin{aligned} \mathcal{T}^{\pi_{A}}(\mathbf{w},a_{S},\mathbf{w}^{\prime})=p^{\pi_{A}}\left(\mathbf{w}^{\prime}|\mathbf{w},a_{S}\right)=\sum_{o_{A}\in\mathcal{O}_{A}}p\left(\mathbf{w}^{\prime}|\mathbf{w},(a_{S},\pi_{A}(o_{A}))\right)p_{A}(o_{A}|\mathbf{w})\end{aligned}, (24)

    where p(𝐰|𝐰,(aS,πA(oA)))p\left(\mathbf{w}^{\prime}|\mathbf{w},(a_{S},\pi_{A}(o_{A}))\right) could be obtained by Lemma 1 and p(oA|𝐰)p(o_{A}|\mathbf{w}) could be obtained by Lemma 2.

  • \mathcal{R}: the same as the pre-defined Dec-POMDP tuple.

We now proceed to solve the MDP problem MDPπA\mathscr{M}^{\pi_{A}}_{\mathrm{MDP}}, which is characterized by a tuple ,𝒯πA,𝒜S,\langle\mathcal{I},\mathcal{T}^{\pi_{A}},\mathcal{A}_{S},\mathcal{R}\rangle. In order to deduce the optimal sampling policy under a deterministic decision-making policy πA\pi_{A}, it is imperative to resolve the Bellman equations [55]:

θπA+VπA(𝐰)=maxaS𝒜A{πA(𝐰,aS)+𝐰p(𝐰|𝐰,aS)VπA(𝐰)},\displaystyle\theta^{*}_{\pi_{A}}+V_{\pi_{A}}(\mathbf{w})=\mathop{\max}\limits_{a_{S}\in\mathcal{A}_{A}}\left\{\mathcal{R}^{\pi_{A}}(\mathbf{w},a_{S})+\sum_{\mathbf{w}^{\prime}\in\mathcal{I}}p(\mathbf{w}^{\prime}|\mathbf{w},a_{S})V_{\pi_{A}}(\mathbf{w}^{\prime})\right\}, (25)

where VπA(𝐰)V^{\pi_{A}}(\mathbf{w}) is the value function and θπA\theta^{*}_{\pi_{A}} is the optimal long-term average reward given the decision-making policy πA\pi_{A}. We apply the relative value iteration (RVI) algorithm to solve this problem. The details are shown in Algorithm 1:

Input: The MDP tuple ,𝒜S,𝒯πA,\langle\mathcal{I},\mathcal{A}_{S},\mathcal{T}^{\pi_{A}},\mathcal{R}\rangle, ϵ\epsilon, πA\pi_{A};
1 Initialization: 𝐰\forall\mathbf{w}\in\mathcal{I}, V~πA0(𝐰)=0\tilde{V}^{0}_{\pi_{A}}(\mathbf{w})=0, V~πA1(𝐰)=\tilde{V}^{-1}_{\pi_{A}}(\mathbf{w})=\infty, k=0k=0 ;
2 Choose 𝐰ref\mathbf{w}^{ref} arbitrarily;
3 while V~πAk(𝐰)V~πAk1(𝐰)ϵ||\tilde{V}_{\pi_{A}}^{k}(\mathbf{w})-\tilde{V}_{\pi_{A}}^{k-1}(\mathbf{w})||\geq\epsilon do
4       k=k+1k=k+1;
5       for 𝐰𝐰ref\mathbf{w}\in\mathcal{I}-\mathbf{w}^{ref} do
6             V~πAk(𝐰)=gk+maxaS{(𝐰,aS)+𝐰𝐰refp(𝐰|𝐰,aS)V~πAk1(𝐰)};\scriptsize\begin{aligned} &\tilde{V}_{\pi_{A}}^{k}(\mathbf{w})=-g_{k}+\\ &\mathop{\max}\limits_{a_{S}}\left\{\mathcal{R}(\mathbf{w},a_{S})+\sum_{\mathbf{w}^{\prime}\in\mathcal{I}-\mathbf{w}^{ref}}p(\mathbf{w}^{\prime}|\mathbf{w},a_{S})\tilde{V}^{k-1}_{\pi_{A}}(\mathbf{w}^{\prime})\right\};\end{aligned}
7      
8θ(πA,πS)=V~πAk(𝐰)maxaS𝒜S{(𝐰,aS)+𝐰p(𝐰|𝐰,aS)V~πAk(𝐰)}\begin{aligned} &\theta^{*}(\pi_{A},\pi_{S}^{*})=-\tilde{V}^{k}_{\pi_{A}}(\mathbf{w})\\ &\mathop{\max}\limits_{a_{S}\in\mathcal{A}_{S}}\left\{\mathcal{R}(\mathbf{w},a_{S})+\sum_{\mathbf{w}^{\prime}\in\mathcal{I}}p(\mathbf{w}^{\prime}|\mathbf{w},a_{S})\tilde{V}^{k}_{\pi_{A}}(\mathbf{w}^{\prime})\right\}\end{aligned};
9 for 𝐰\mathbf{w}\in\mathcal{I} do
10       πS(πA,𝐰)=argmaxaS{(𝐰,aS)+𝐰p(𝐰|𝐰,aS)V~πAk(𝐰)}\pi_{S}^{*}(\pi_{A},\mathbf{w})=\mathop{\arg\max}\limits_{a_{S}}\left\{\mathcal{R}(\mathbf{w},a_{S})+\sum_{\mathbf{w}^{\prime}\in\mathcal{I}}p(\mathbf{w}^{\prime}|\mathbf{w},a_{S})\tilde{V}^{k}_{\pi_{A}}(\mathbf{w}^{\prime})\right\};
Output: πS(πA)\pi_{S}^{*}(\pi_{A}), θ(πA,πS)\theta^{*}(\pi_{A},\pi_{S}^{*})
Algorithm 1 The RVI Algorithm to Solve the MDP Given πA\pi_{A}

With Definition 1 and Algorithm 1 in hand, we could then perform a Brute-Force-Search across the decision-making policy space ΥA\Upsilon_{A}, thereby acquiring the joint sampling-decision-making policy. The algorithm is called RVI-Brute-Force-Search Algorithm, which is elaborated in Algorithm 2. In the following theorem, we discuss the optimality of the RVI-Brute-Force-Search Algorithm.

Theorem 1.

The RVI-Brute-Force-Search Algorithm (Algorithm 2) could achieve the optimal joint deterministic policies (πS,πA)(\pi_{S}^{*},\pi_{A}^{*}), given that the transition function 𝒯πA\mathcal{T}^{\pi_{A}} follows a unichan.

Proof. If the the transition function 𝒯πA\mathcal{T}^{\pi_{A}} follows a unichan, we obtain from [56, Theorem 8.4.5] that for any πA\pi_{A}, we could obtain the optimal deterministic policy πS\pi_{S}^{*} such that θ(πA,πS)θ(πA,πS)\theta^{*}({\pi_{A}},\pi_{S}^{*})\leq\theta^{*}({\pi_{A}},\pi_{S}). Also, Algorithm 2 assures that for any πA\pi_{A}, θ(πA,πS)θ(πA,πS)\theta^{*}({\pi^{*}_{A}},\pi_{S}^{*})\leq\theta^{*}({\pi_{A}},\pi_{S}^{*}). This leads to the conclusion that for any 𝝅C=(πS,πA)Υ\boldsymbol{\pi}_{C}=(\pi_{S},\pi_{A})\in\Upsilon, we have that

θ(πA,πS)θ(πA,πS)θ(πA,πS).\theta^{*}({\pi^{*}_{A}},\pi_{S}^{*})\leq\theta^{*}({\pi_{A}},\pi_{S}^{*})\leq\theta^{*}({\pi_{A}},\pi_{S}). (26)

Nonetheless, the Brute-Force-Search across the decision-making policy space remains computationally expensive, as the size of the decision-making policy space ΥA\Upsilon_{A} amounts to |ΥA|=𝒜A𝒪A|\Upsilon_{A}|=\mathcal{A}_{A}^{\mathcal{O}_{A}}. This implies that executing the RVI algorithm 𝒜A𝒪A\mathcal{A}_{A}^{\mathcal{O}_{A}} times is necessary to attain the optimal policy. Consequently, although proven to be optimal, such an algorithm is ill-suited for scenarios where 𝒪A\mathcal{O}_{A} and 𝒜A\mathcal{A}_{A} are considerably large. To ameliorate this challenge, we propose a sub-optimal, yet computation-efficient alternative in the subsequent section.

IV-B2 A Sub-optimal Solution

The method here is to instead find a locally optimal algorithm to circumvent the high complexity of the Brute-Force-Search-based approach. We apply the Joint Equilibrium-Based Search for Policies (JESP) for Nash equilibrium solutions [57]. Within this framework, the sampling policy is optimally responsive to the decision-making policy and vice versa, i.e., πS,πA,θ(πS,πA)θ(πS,πA),θ(πS,πA)θ(πS,πA)\forall\pi_{S},\pi_{A},\theta(\pi_{S}^{*},\pi_{A}^{*})\leq\theta(\pi_{S},\pi_{A}^{*}),\theta(\pi_{S}^{*},\pi_{A}^{*})\leq\theta(\pi_{S}^{*},\pi_{A}).

Input: The Dec-POMDP tuple DECPOMDPn,,𝒜,𝒯,Ω,𝒪,\mathscr{M}_{DEC-POMDP}\triangleq\left\langle n,\mathcal{I},\mathcal{A},\mathcal{T},\Omega,\mathcal{O},\mathcal{R}\right\rangle;
1 for πAΥ\pi_{A}\in\Upsilon do
2       Formulate the MDP problem MDPπA,𝒜S,𝒯MDPπA,\mathscr{M}^{\pi_{A}}_{\mathrm{MDP}}\triangleq\langle\mathcal{I},\mathcal{A}_{S},\mathcal{T}_{\mathrm{MDP}}^{\pi_{A}},\mathcal{R}\rangle as given in Definition 1;
3       Run Algorithm 1 to obtain πS(πA)\pi_{S}^{*}(\pi_{A}) and θ(πA,πS)\theta^{*}({\pi_{A}},\pi_{S}^{*});
4Calculate the optimal joint policy: {πA=argminπAθπAπS=πS(πA)\begin{cases}&\pi_{A}^{*}=\mathop{\arg\min}\nolimits_{\pi_{A}}\theta_{\pi_{A}}^{*}\\ &\pi_{S}^{*}=\pi_{S}(\pi^{*}_{A})\end{cases};
Output: πS\pi_{S}^{*}, πA\pi_{A}^{*}
Algorithm 2 The RVI-Brute-Force-Search Algorithm

To search for the Nash equilibrium, we first search for the optimal sampling policy prescribed a decision-making policy. This problem can be formulated as a standard fully observed MDP problem denoted by MDPπA,𝒜S,𝒯MDPπA,\mathscr{M}^{\pi_{A}}_{\mathrm{MDP}}\triangleq\langle\mathcal{I},\mathcal{A}_{S},\mathcal{T}_{\mathrm{MDP}}^{\pi_{A}},\mathcal{R}\rangle (see Definition 1). Next, we alternatively fix the sampling policy πS\pi_{S} and solve for the optimal decision-making policy πA\pi_{A}. This problem can be modeled as a memoryless partially observable Markov decision process (POMDP), denoted by POMDPπS,𝒜A,ΩA,𝒪A,𝒯POMDPπS,\mathscr{M}^{\pi_{S}}_{\mathrm{POMDP}}\triangleq\langle\mathcal{I},\mathcal{A}_{A},\Omega_{A},\mathcal{O}_{A},\mathcal{T}^{\pi_{S}}_{\mathrm{POMDP}},\mathcal{R}\rangle (see Definition 2). Then, by alternatively iterating between 𝒜gentS\mathcal{A}gent_{S} and 𝒜gentA\mathcal{A}gent_{A}, we could obtain the Nash equilibrium between the two agents.

Definition 2.

Given a deterministic sampling policy πS\pi_{S}, the optimal sampling problem could be formulated as a memoryless POMDP problem POMDPπS,𝒜A,ΩA,𝒪A,𝒯POMDPπS,\mathscr{M}^{\pi_{S}}_{\mathrm{POMDP}}\triangleq\langle\mathcal{I},\mathcal{A}_{A},\Omega_{A},\mathcal{O}_{A},\mathcal{T}^{\pi_{S}}_{\mathrm{POMDP}},\mathcal{R}\rangle, where the elements are given as follows:

  • \mathcal{I}, ΩA\Omega_{A}, 𝒜A\mathcal{A}_{A}, and 𝒪A\mathcal{O}_{A}: the same as the pre-defined Dec-POMDP tuple.

  • 𝒯POMDPπS\mathcal{T}^{\pi_{S}}_{\mathrm{POMDP}}: the transition function given a deterministic sampling policy πS\pi_{S}, which is

    𝒯POMDPπS(𝐰,aA,𝐰)=pπS(𝐰|𝐰,aA)=p(𝐰|𝐰,(πS(𝐰),aA)),\begin{aligned} \mathcal{T}^{\pi_{S}}_{\mathrm{POMDP}}(\mathbf{w},a_{A},\mathbf{w}^{\prime})=p^{\pi_{S}}\left(\mathbf{w}^{\prime}|\mathbf{w},a_{A}\right)=p\left(\mathbf{w}^{\prime}|\mathbf{w},(\pi_{S}(\mathbf{w}),a_{A})\right)\end{aligned}, (27)

    where p(𝐰|𝐰,(πS(𝐰),aA))p\left(\mathbf{w}^{\prime}|\mathbf{w},(\pi_{S}(\mathbf{w}),a_{A})\right) could be obtained by Lemma 1.

  • \mathcal{R}: the reward function is denoted as πS(𝐰,aA)\mathcal{R}^{\pi_{S}}(\mathbf{w},a_{A}), which could be obtained by (9).

We then proceed to solve the memoryless POMDP problem discussed in Definition 2 to obtain the deterministic decision-making policy. Denote pπAπS(𝐰|𝐰)p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}^{\prime}|\mathbf{w}) as the transition probability Pr{𝐖t+1=𝐰|𝐖t=𝐰}\Pr\left\{\mathbf{W}_{t+1}=\mathbf{w}^{\prime}|\mathbf{W}_{t}=\mathbf{w}\right\} under the sampling policy πA\pi_{A} and πS\pi_{S}, we then have that

pπAπS(𝐰|𝐰)=oA𝒪Ap(oA|𝐰)aA𝒜ApπS(𝐰|𝐰,aA)πA(aA|oA),p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}^{\prime}|\mathbf{w})=\sum_{o_{A}\in\mathcal{O}_{A}}p(o_{A}|\mathbf{w})\sum_{a_{A}\in\mathcal{A}_{A}}p^{\pi_{S}}(\mathbf{w}^{\prime}|\mathbf{w},a_{A})\pi_{A}(a_{A}|o_{A}), (28)

where p(oA|𝐰)p(o_{A}|\mathbf{w}) could be obtained by Lemma 1 and pπS(𝐰|𝐰,πA(oA))p^{\pi_{S}}(\mathbf{w}^{\prime}|\mathbf{w},\pi_{A}(o_{A})) is obtained by (27). By assuming the ergodicity of the pπAπS(𝐰|𝐰)p^{\pi_{S}}_{\pi_{A}}(\mathbf{w}^{\prime}|\mathbf{w}) and rewrite it as a matrix 𝐏πAπS\mathbf{P}_{\pi_{A}}^{\pi_{S}}, we could then solve out the stationary distribution of the system status under the policies πS\pi_{S} and πA\pi_{A}, denoted as 𝝁πAπS\boldsymbol{\mu}_{\pi_{A}}^{\pi_{S}}, by solving the balance equations:

𝝁πAπS𝐏πAπS=𝝁πAπS,𝝁πAπS𝐞=1,\boldsymbol{\mu}_{\pi_{A}}^{\pi_{S}}\mathbf{P}_{\pi_{A}}^{\pi_{S}}=\boldsymbol{\mu}_{\pi_{A}}^{\pi_{S}},{~{}}\boldsymbol{\mu}_{\pi_{A}}^{\pi_{S}}\mathbf{e}=1, (29)

where 𝐞\mathbf{e} is the all one vector [1,,1]||×1[1,\cdots,1]_{|\mathcal{I}|\times 1}, 𝝁πAπS\boldsymbol{\mu}_{\pi_{A}}^{\pi_{S}} could be solved out by Cramer’s rules. Denote μπAπS(𝐰){\mu}_{\pi_{A}}^{\pi_{S}}(\mathbf{w}) as the stationary distribution of 𝐰\mathbf{w}. Also, we denote rπAπS(𝐰)r_{\pi_{A}}^{\pi_{S}}(\mathbf{w}) as the expectation reward of global system status 𝐰\mathbf{w} under policies πA\pi_{A} and πS\pi_{S}. It could be calculated as:

rπAπS(𝐰)=oA𝒪Ap(oA|𝐰)πS(𝐰,πA(oA)).r_{\pi_{A}}^{\pi_{S}}(\mathbf{w})=\sum_{o_{A}\in\mathcal{O}_{A}}p(o_{A}|\mathbf{w})\mathcal{R}^{\pi_{S}}(\mathbf{w},\pi_{A}(o_{A})). (30)

The performance measure is the long-term average reward:

ηπAπS=limsupT1T𝔼(πS,πA)(t=0T1r(t))=𝐰μπAπS(𝐰)rπAπS(𝐰)\eta_{\pi_{A}}^{\pi_{S}}={\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left({\sum\limits_{t=0}^{T-1}r(t)}\right)}=\sum_{\mathbf{w}\in\mathcal{I}}{\mu}_{\pi_{A}}^{\pi_{S}}(\mathbf{w})\cdot r_{\pi_{A}}^{\pi_{S}}(\mathbf{w}) (31)

With ηπAπS\eta_{\pi_{A}}^{\pi_{S}} in hand, we then introduce the relative reward gπAπS(𝐰)g_{\pi_{A}}^{\pi_{S}}(\mathbf{w}), defined by

gπAπS(𝐰)limsupT1T𝔼(πS,πA)[t=0T1(r(t)ηπAπS)|𝐖0=𝐰],g_{\pi_{A}}^{\pi_{S}}(\mathbf{w})\triangleq{\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left[{\sum\limits_{t=0}^{T-1}\left(r(t)-\eta_{\pi_{A}}^{\pi_{S}}\right)}\left|\mathbf{W}_{0}=\mathbf{w}\right.\right]}, (32)

which satisfies the Poisson equations [58]:

ηπAπS+gπAπS(𝐰)=rπAπS(𝐰)+𝐰pπAπS(𝐰|𝐰)gπAπS(𝐰).\eta_{\pi_{A}}^{\pi_{S}}+g_{\pi_{A}}^{\pi_{S}}(\mathbf{w})=r_{\pi_{A}}^{\pi_{S}}(\mathbf{w})+\sum_{\mathbf{w}^{\prime}\in\mathcal{I}}p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}^{\prime}|\mathbf{w})g_{\pi_{A}}^{\pi_{S}}\mathbf{(w}^{\prime}). (33)

Denote 𝐠πAπS\mathbf{g}^{\pi_{S}}_{\pi_{A}} as the vector consisting of gπAπS(𝐰)g_{\pi_{A}}^{\pi_{S}}(\mathbf{w}), 𝐫πAπS\mathbf{r}^{\pi_{S}}_{\pi_{A}} as the vector consisting of rπAπS(𝐰),𝐰{r}^{\pi_{S}}_{\pi_{A}}(\mathbf{w}),\mathbf{w}\in\mathcal{I}. 𝐠πAπS\mathbf{g}^{\pi_{S}}_{\pi_{A}} could be solved by utilizing [56]:

𝐠πAπS=[(I𝐏πAπS+𝐞𝝁πAπS)1𝐞𝝁πAπS]𝐫πAπS\mathbf{g}^{\pi_{S}}_{\pi_{A}}=\left[(I-\mathbf{P}^{\pi_{S}}_{\pi_{A}}+\mathbf{e}\boldsymbol{\mu}^{\pi_{S}}_{\pi_{A}})^{-1}-\mathbf{e}\boldsymbol{\mu}^{\pi_{S}}_{\pi_{A}}\right]\mathbf{r}^{\pi_{S}}_{\pi_{A}} (34)

With the relative reward gπAπSg_{\pi_{A}}^{\pi_{S}} in hand, we then introduce QπAπS(𝐰,aA)Q_{\pi_{A}}^{\pi_{S}}(\mathbf{w},a_{A}) and QπAπS(oA,aA)Q_{\pi_{A}}^{\pi_{S}}(o_{A},a_{A}) as follows:

Lemma 3.

QπAπS(𝐰,aA)Q_{\pi_{A}}^{\pi_{S}}(\mathbf{w},a_{A}) and QπAπS(oA,aA)Q_{\pi_{A}}^{\pi_{S}}(o_{A},a_{A}) are defined and calculated as:

QπAπS(𝐰,aA)limsupT1T𝔼(πS,πA)[t=0T1(r(t)ηπAπS)|𝐖0=𝐰,aA(0)=aA]\displaystyle Q_{\pi_{A}}^{\pi_{S}}(\mathbf{w},a_{A})\triangleq{\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left[{\sum\limits_{t=0}^{T-1}\left(r(t)-\eta_{\pi_{A}}^{\pi_{S}}\right)}\left|\mathbf{W}_{0}=\mathbf{w},a_{A}(0)=a_{A}\right.\right]} (35)
=πS(𝐰,aA)ηπAπS+𝐰pπS(𝐰|𝐰,aA)gπAπS(𝐰),\displaystyle=\mathcal{R}^{\pi_{S}}(\mathbf{w},a_{A})-\eta_{\pi_{A}}^{\pi_{S}}+\sum_{\mathbf{w}^{\prime}\in\mathcal{I}}p^{\pi_{S}}(\mathbf{w}^{\prime}|\mathbf{w},a_{A})g_{\pi_{A}}^{\pi_{S}}\mathbf{(w}^{\prime}),
QπAπS(oA,aA)limsupT1T𝔼(πS,πA)[t=0T1(r(t)ηπAπS)|oA(0)=oA,aA(0)=aA]\displaystyle Q_{\pi_{A}}^{\pi_{S}}(o_{A},a_{A})\triangleq{\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left[{\sum\limits_{t=0}^{T-1}\left(r(t)-\eta_{\pi_{A}}^{\pi_{S}}\right)}\left|o_{A}^{(0)}=o_{A},a_{A}(0)=a_{A}\right.\right]} (36)
=𝐰pπAπS(𝐰|oA)QπAπS(𝐰,aA),\displaystyle=\sum_{\mathbf{w}\in\mathcal{I}}p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}|o_{A})Q_{\pi_{A}}^{\pi_{S}}(\mathbf{w},a_{A}),

where pπAπS(𝐰|𝐰)p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}^{\prime}|\mathbf{w}) can be obtained by (28) and pπAπS(𝐰|oA)p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}|o_{A}) can be obtained by the Bayesian formula:

pπAπS(𝐰|oA)=μπAπS(𝐰)p(oA|𝐰)𝐰μπAπS(𝐰)p(oA|𝐰).p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}|o_{A})=\frac{{\mu}_{\pi_{A}}^{\pi_{S}}(\mathbf{w})p(o_{A}|\mathbf{w})}{\sum_{\mathbf{w}\in\mathcal{I}}{\mu}_{\pi_{A}}^{\pi_{S}}(\mathbf{w})p(o_{A}|\mathbf{w})}. (37)
Proof.

Please refer to Appendix B. ∎

With QπAπS(oA,aA)Q_{\pi_{A}}^{\pi_{S}}(o_{A},a_{A}) in hand, it is then easy to conduct the Policy Iteration (PI) Algorithm with Step Sizes [59] to iteratively improve the deterministic memoryless decision-making policy πA\pi_{A}. The detailed steps are shown in Algorithm 3.

Input: The POMDP tuple POMDPπS,𝒜A,ΩA,𝒪A,𝒯POMDPπS,\mathscr{M}^{\pi_{S}}_{\mathrm{POMDP}}\triangleq\langle\mathcal{I},\mathcal{A}_{A},\Omega_{A},\mathcal{O}_{A},\mathcal{T}^{\pi_{S}}_{\mathrm{POMDP}},\mathcal{R}\rangle, ϵ\epsilon, πS\pi_{S};
1 Initialization: randomly choose decision-making policy πA(1)\pi_{A}^{(1)}, ηπA(0)πS=0\eta_{{\pi_{A}}^{(0)}}^{\pi_{S}}=0, ηπA(1)πS=\eta_{{\pi_{A}}^{(-1)}}^{\pi_{S}}=\infty, k=0k=0;
2 while |ηπA(k)πSηπA(k1)πS|ϵ|\eta_{{\pi_{A}}^{(k)}}^{\pi_{S}}-\eta_{{\pi_{A}}^{(k-1)}}^{\pi_{S}}|\geq\epsilon do
3       k=k+1k=k+1;
4       Calculate the transition probability pπA(k)πS(𝐰|𝐰)p_{{\pi_{A}}^{(k)}}^{\pi_{S}}(\mathbf{w}^{\prime}|\mathbf{w}) by (28);
5       Solve the stationary distribution 𝝁πA(k)πS\boldsymbol{\mu}_{{\pi_{A}}^{(k)}}^{\pi_{S}} by the stationary equations (29);
6       Calculate the expectation reward rπA(k)πS(𝐰)r_{{\pi_{A}}^{(k)}}^{\pi_{S}}(\mathbf{w}) by (30) and the long-term average reward ηπA(k)πS\eta_{{\pi_{A}}^{(k)}}^{\pi_{S}} by (31);
7       Calculate the relative reward gπA(k)πS(𝐰)g_{{\pi_{A}}^{(k)}}^{\pi_{S}}(\mathbf{w}) by (34);
8       for oA𝒪Ao_{A}\in\mathcal{O}_{A} do
9             for aA𝒜Aa_{A}\in\mathcal{A}_{A} do
10                   Calculate QπA(k)πS(oA,aA)Q_{{\pi_{A}}^{(k)}}^{\pi_{S}}(o_{A},a_{A}) by Lemma 3;
11            
12      for oA𝒪Ao_{A}\in\mathcal{O}_{A} do
13             πA(|oA)=argmaxπA(|oA)QπA(k)πS(oA,aA)\pi_{A}(\cdot|o_{A})=\mathop{\arg\max}\limits_{\pi_{A}(\cdot|o_{A})}Q_{\pi_{A}^{(k)}}^{\pi_{S}}(o_{A},a_{A});
14             πA(k)(|oA)=πA(k1)(|oA)+δk(πA(k1)(|oA)πA(oA))\pi_{A}^{(k)}(\cdot|o_{A})=\pi_{A}^{(k-1)}(\cdot|o_{A})+\delta_{k}*(\pi_{A}^{(k-1)}(\cdot|o_{A})-\pi_{A}(o_{A}));
15      
16for oA𝒪Ao_{A}\in\mathcal{O}_{A} do
17       πA(oA)=argmaxaAπA(k)(aA|oA)\pi_{A}^{*}(o_{A})=\mathop{\arg\max}\limits_{a_{A}}\pi_{A}^{(k)}(a_{A}|o_{A});
18θ(πS,πA)=ηπAπS\theta^{*}(\pi_{S},\pi_{A}^{*})=\eta_{{\pi_{A}}^{*}}^{\pi_{S}};
Output: πA(πS)\pi_{A}^{*}(\pi_{S}), θ(πS,πA)\theta^{*}(\pi_{S},\pi_{A}^{*})
Algorithm 3 The PI Algorithm with Step Size to Solve the POMDP Given πS\pi_{S}

Thus far, we have solved two problems: ii) by capitalizing on Definition 1 and Algorithm 1, we have ascertained an optimal sampling strategy πS\pi_{S}^{*} contingent upon the decision-making policy πA\pi_{A}; iiii) by harnessing Definition 2 and Algorithm 3, we have determined an optimal actuation strategy πS\pi_{S}^{*} predicated on the decision-making policy πA\pi_{A}. Consequently, we could iteratively employ Algorithm 1 and Algorithm 3 in an alternating fashion, whereby Algorithm 1 yields the optimal sampling strategy πS(k)(πA(k1))\pi_{S}^{*(k)}(\pi_{A}^{(k-1)}), subsequently serving as an input for Algorithm 3 to derive the decision-making policy πA(k)(πS(k))\pi_{A}^{*(k)}(\pi_{S}^{*(k)}). The procedure shall persist until the average reward θ(πS(k),πA(k))\theta^{*}(\pi_{S}^{*(k)},\pi_{A}^{*(k)}) reaches convergence, indicating that the solution achieves a Nash equilibrium between the sampler and the actuator.The intricacies of the procedure are delineated in Algorithm 4.

Remark 1.

Generally, the JESP algorithm should restart the algorithm by randomly choosing the initial decision-making policy πA(1)\pi_{A}^{*(1)} to ensure a good solution, as the initialization of decision-making policy πA(1)\pi_{A}^{*(1)} may often lead to poor local optima. We here investigate a heuristic initialization to find the solution quickly and reliably. Specifically, we assume that the decision-maker is fully observable and solve a MDP problem:

Definition 3.

MDP,𝒜A,𝒯MDP,\mathscr{M}_{\mathrm{MDP}}\triangleq\langle\mathcal{I},\mathcal{A}_{A},\mathcal{T}_{\mathrm{MDP}},\mathcal{R}\rangle, where the elements are given as follows:

  • \mathcal{I}: the set of (Xt,Φt)𝒮×𝒱(X_{t},\Phi_{t})\in\mathcal{S}\times\mathcal{V}.

  • 𝒜A={a0,aM1}\mathcal{A}_{A}=\left\{a_{0},a_{M-1}\right\}: the decision-making set.

  • 𝒯\mathcal{T}: the transition function, given as

    𝒯((Xt,Φt),aA,(Xt+1,Φt+1))\displaystyle\mathcal{T}((X_{t},\Phi_{t}),a_{A},(X_{t+1},\Phi_{t+1})) (38)
    =p(Xt+1|Xt,aA,Φt)p(Φt+1|Φt),\displaystyle=p(X_{t+1}|X_{t},a_{A},\Phi_{t})\cdot p(\Phi_{t+1}|\Phi_{t}),

    where p(Xt+1|Xt,aA,Φt)p(X_{t+1}|X_{t},a_{A},\Phi_{t}) and p(Φt+1|Φt)p(\Phi_{t+1}|\Phi_{t}) could be obtained by (11) and (12), respectively.

  • \mathcal{R}: the same as the pre-defined Dec-POMDP tuple.

Through solving the above MDP problem, we could explicitly obtain the Q function Q(Xt,Φt,aA)Q(X_{t},\Phi_{t},a_{A}), define Q(Xt,aA)Q(X_{t},a_{A}) as 𝔼Φt[Q(Xt,Φt,aA)]{\mathbb{E}}_{\Phi_{t}}[Q(X_{t},\Phi_{t},a_{A})], the initial decision-making policy πA(1)\pi_{A}^{*(1)} is given as

πA(1)(X^t)=argminaAQ(X^t,aA).\pi_{A}^{*(1)}(\hat{X}_{t})={\arg\min}_{a_{A}}Q(\hat{X}_{t},a_{A}). (39)
Input: The Dec-POMDP tuple DECPOMDPn,,𝒜,𝒯,Ω,𝒪,\mathscr{M}_{DEC-POMDP}\triangleq\left\langle n,\mathcal{I},\mathcal{A},\mathcal{T},\Omega,\mathcal{O},\mathcal{R}\right\rangle, ϵ\epsilon;
1 Initialization: θ0=0\theta^{*}_{0}=0, θ1=\theta^{*}_{-1}=\infty, k=0k=0;
2 Initialize πA(1)\pi_{A}^{*(1)} by calculating (39);
3 while θkθk1ϵ||\theta^{*}_{k}-\theta^{*}_{k-1}||\geq\epsilon do
4      k=k+1k=k+1;
5       Formulate the MDP problem MDPπA(k),𝒜S,𝒯MDPπA(k),\mathscr{M}^{\pi_{A}^{*(k)}}_{\mathrm{MDP}}\triangleq\langle\mathcal{I},\mathcal{A}_{S},\mathcal{T}_{\mathrm{MDP}}^{\pi_{A}^{*(k)}},\mathcal{R}\rangle as given in Definition 1;
6       Run Algorithm 1 to obtain πS(k)\pi_{S}^{*(k)};
7       Formulate the POMDP problem POMDPπS(k),𝒜A,ΩA,𝒪A,𝒯POMDPπS(k),\mathscr{M}^{\pi_{S}^{*(k)}}_{\mathrm{POMDP}}\triangleq\langle\mathcal{I},\mathcal{A}_{A},\Omega_{A},\mathcal{O}_{A},\mathcal{T}^{\pi_{S}^{*(k)}}_{\mathrm{POMDP}},\mathcal{R}\rangle as given in Definition 2;
8       Run Algorithm 3 to obtain θ(πS(k),πA(k))\theta^{*}(\pi_{S}^{*(k)},\pi_{A}^{*(k)}) and πA(k)\pi_{A}^{*(k)};
9       θk=θ(πS(k),πA(k))\theta^{*}_{k}=\theta^{*}(\pi_{S}^{*(k)},\pi_{A}^{*(k)});
10      
11The joint sub-optimal policy is: πS=πS(k),πS=πA(k)\pi_{S}^{*}=\pi_{S}^{*(k)},\pi_{S}^{*}=\pi_{A}^{*(k)};
12 The sub-optimal average reward is: θ=θk\theta^{*}=\theta^{*}_{k};
Output: πS\pi_{S}^{*}, πA\pi_{A}^{*}, θ\theta^{*}
Algorithm 4 The Improved JESP Algorithm

V Simulation Results

Tradition metrics such as Age of Information have been developed under the assumption that a fresher packet or more accurate packet, capable of aiding in source reconstruction, holds a higher value for the receiver, thus promoting goal-oriented decision-making. Nevertheless, the manner in which a packet update impacts the system’s utility via decision-making remains an unexplored domain. Through the simulations, we endeavor to elucidate the following observations of interest:

\bullet GoT-optimal vs. State-of-the-art. In contrast with the state-of-the-art sampling policies, the proposed goal-oriented sampler & decision-maker co-design is capable of concurrently maximizing goal attainment and conserving communication resources, accomplishing a closed-loop utility optimization via sparse sampling. (See Fig. 9 and 9)

\bullet Separate Design vs. Co-Design. Compared to the two-stage sampling-decision-making separate framework, the co-design of sampling and decision-making not only achieves superior goal achievement but also alleviates resource expenditure engendered by communication and actuation implementation. (See Fig. 9 and 9)

\bullet Optimal Brute-Force-Search vs. Sub-optimal JESP. Under different successful transmission probability pSp_{S} and sampling cost CSC_{S}, the sub-optimal yet computation-efficient JESP algorithm will converge to near-optimal solutions. (See Fig. 9)

\bullet Trade-off: Transmission vs. Actuation. There is a trade-off between transmission and actuation in terms of resource expenditure: under reliable channel conditions, it is apt to increase communication overhead to ensure effective decision-making; conversely, under poor channel conditions, it is advisable to curtail communication expenses and augment actuation resources to attain maximal system utility. (See Fig. 9)

V-A Comparing Benchmarks

Fig. 9 illustrates the simulation results, which characterizes the utility by the average cost composed by status inherent cost C1(Xt,Φt)C_{1}(X_{t},\Phi_{t}), actuation gain cost C2(πA(X^t))C_{2}(\pi_{A}(\hat{X}_{t})), actuation inherent cost C3(πA(X^t))C_{3}(\pi_{A}(\hat{X}_{t})), and sampling cost CSC_{S}. For the simulation setup, we set 𝒜A={a0,,a10}\mathcal{A}_{A}=\left\{a_{0},\cdots,a_{10}\right\}, 𝒮={s0,s1,s2}\mathcal{S}=\left\{s_{0},s_{1},s_{2}\right\}, 𝒱={v0,v1}\mathcal{V}=\left\{v_{0},v_{1}\right\} and the corresponding costs are given as:

𝐂1|𝒱|×|𝒮|=(0205001020),\mathbf{C}_{1}^{|\mathcal{V}|\times|\mathcal{S}|}=\begin{pmatrix}0&20&50\\ 0&10&20\end{pmatrix}, (40)

C2(πA(X^t))C_{2}(\pi_{A}(\hat{X}_{t})) and C3(πA(X^t))C_{3}(\pi_{A}(\hat{X}_{t})) are both linear to the actuation with C2(πA(X^t)=CgπA(X^t)C_{2}(\pi_{A}(\hat{X}_{t})=C_{g}\cdot\pi_{A}(\hat{X}_{t}) and C3(πA(X^t))=CIπA(X^t)C_{3}(\pi_{A}(\hat{X}_{t}))=C_{I}\cdot\pi_{A}(\hat{X}_{t}). The following comparing benchmarks of interest are considered:

\bullet Uniform. Sampling is triggered periodically in this policy. In this case, aS(t)=𝟙{t=KΔ}a_{S}(t)=\mathbbm{1}_{\left\{t=K*\Delta\right\}}, where K=0,1,2,K=0,1,2,\cdots and Δ+\Delta\in\mathbb{N}^{+}. For each Δ\Delta, we could calculate the sampling rate as 1/Δ1/\Delta and explicitly obtain the long-term average cost through Markov chain simulations under pre-defined decision-making policy πA=[a0,a3,a7]\pi_{A}=[a_{0},a_{3},a_{7}]. The setup of Δ\Delta represents a trade-off between utility and sampling frequency, as depicted in Fig. 9: If CSC_{S} is minimal, sampling and transmission will contribute positively to the utility; If the sampling action is expansive, sampling may not yield adequate utility; If a single sampling consumes moderate resource, the utility will exhibit a U-shaped pattern in terms of the sampling rate.

\bullet Age-aware. Sampling is executed when the AoI attains a predetermined threshold, a principle that has been established as a threshold-based result for AoI-optimal sampling [27]. In this case, aS(t)=𝟙{AoI(t)>δ}a_{S}(t)=\mathbbm{1}_{\left\{\mathrm{AoI(t)}>\delta\right\}}, where the AoI-optimal threshold δ\delta can be ascertained using the Bisection method delineated in Algorithm 1 of [27]. In this context, rather than determining a fixed threshold that minimized AoI, we dynamically shift the threshold to explore the balance between sampling and utility. As evidenced in Fig. 9, the utility derived from this sampling policy consistently surpasses that of the uniform sampling policy.

\bullet Change-aware. Sampling is triggered whenever the source status changes. In such a case, aS(t)=𝟙{XtXt1}a_{S}(t)=\mathbbm{1}_{\left\{X_{t}\neq X_{t-1}\right\}}. The performance of this policy is dependent on the dynamics of the system, i.e., if the semantics of the sources transfers frequently, then the sampling rate will be higher. The utility of this policy may be arbitrarily detrimental owing to its semantics-unaware nature. In our considered model, the Change-aware policy will turn out 89.5%89.5\% to have the unsynchronized status Xt=s0X_{t}=s_{0} while X^t=s3\hat{X}_{t}=s_{3}. In this case, the actuator will implement the actuation aA=a7a_{A}=a_{7} according to the estimate X^t=s3\hat{X}_{t}=s_{3}, which will in turn make the status XtX_{t} converges to s0s_{0}.

\bullet Optimal MSE. This is a type of E2E goal-oriented sampling policy if the goal is determined as achieving real-time reconstruction. Nevertheless, this policy disregards the semantics conveyed by the packet and the ensuing actuation updating precipitated by semantics updates. The problem could be formulated as a standard MDP formulation and solved out through RVI Algorithm. The sampling rate and average cost are obtained given the MSE-optimal sampling policy and the pre-defined decision-making policy πA=[a0,a3,a7]\pi_{A}=[a_{0},a_{3},a_{7}].

\bullet Optimal AoII (also optimal AoCI). From [43], it has been proven that the AoII-optimal sampling policy turns out to be aS(t)=𝟙{XtX^t}a_{S}(t)=\mathbbm{1}_{\left\{X_{t}\neq\hat{X}_{t}\right\}}. From [40], the AoCI-optimal sampling policy is aS(t)=𝟙{XtXtAoI(t)}a_{S}(t)=\mathbbm{1}_{\left\{X_{t}\neq{X}_{t-\mathrm{AoI}(t)}\right\}}. Note that X^t=XtAoI(t)\hat{X}_{t}={X}_{t-\mathrm{AoI}(t)}, these two sampling policies are equivalent. The sampling rate and average cost are obtained given this sampling policy and the greedy-based decision-making policy. πA=[a0,a3,a7]\pi_{A}=[a_{0},a_{3},a_{7}].

V-B Separate Design vs. Co-Design

Conventionally, the sampling and actuation policies are designed in a two-stage manner: they first emphasize open-loop performance metrics such as average mean squared error (MSE) or average Age of Information (AoI), we then focus on the decision-making policy πA\pi_{A} design. Specifically, we consider that πA\pi_{A} is predetermined using a greedy methodology:

πA(X^t)=argminaA𝒮A𝔼Φt{[C1(X^t,Φt)C2(πA(X^t))]+C3(πA(X^t))}.\pi_{A}(\hat{X}_{t})=\mathop{\arg\min}\limits_{a_{A}\in\mathcal{S}_{A}}\mathop{\mathbb{E}}\limits_{\Phi_{t}}\left\{\left[C_{1}(\hat{X}_{t},\Phi_{t})-C_{2}(\pi_{A}(\hat{X}_{t}))\right]^{+}C_{3}(\pi_{A}(\hat{X}_{t}))\right\}. (41)

This greedy approach entails selecting the actuation that minimizes cost in the current step, given that the estimate X^t\hat{X}_{t} is perfect. By calculating (41), we obtain πA=[a0,a3,a7]\pi_{A}=[a_{0},a_{3},a_{7}].

Refer to caption
Figure 6: Average Cost vs. Sampling Rate under different policies and parameters setup. Here we set Cg=8C_{g}=8 and CI=1C_{I}=1.
Refer to caption
Figure 7: Comparisons among GoT-optimal, AoII-optimal, and MSE-optimal policies. Here we set Cg=8C_{g}=8 and CI=1C_{I}=1.
Refer to caption
Figure 8: Distance between optimal solutions through RVI-Brute-Force-Search algorithm and Sub-optimal solutions through JESP algorithm.
Refer to caption
Figure 9: Trade-off between long-term average transmission cost and long-term average actuation cost.

However, we notice that sampling and actuation are closely intertwined, highlighting the potential for further co-design. In this paper, we have proposed the RVI-Brute-Force-Search and the Improved JESP algorithms for such optimal co-design. As shown in Fig. 9, the sampler & decision-maker co-design achieves the optimal utility through sparse sampling. Specifically, only semantically important information is sampled and transmitted, while non-essential data is excluded. This goal-oriented, semantic-aware, and sparse sampling design represents a significant advancement in sampling policy design. By incorporating a best-matching decision-making policy, the sparse sampling achieves superior performance compared to existing methods.

Fig. 9 presents a comparative analysis between the AoII (or AoCI)-optimal sampling policy, the MSE-optimal sampling policy, and our proposed GoT-optimal sampler & decision-maker co-design. It is verified that under different pSp_{S} and CSC_{S}, the proposed GoT-optimal sampler & decision-maker co-design achieves the optimal goal-oriented utility. Importantly, under the condition of an extremely unreliable channel pS=0.2p_{S}=0.2 and high sampling cost CS=10C_{S}=10, the proposed co-design facilitates a significant reduction in long-term average cost, exceeding 60%. This underscores the superiority of the GoT-optimal sampler & decision-maker co-design.

V-C Optimal vs. Sub-Optimal

Fig. 9 presents a comparative visualization between optimal and sub-optimal solutions over a wide range of CSC_{S} and pSp_{S} values. The negligible zero-approaching value in Fig. 9 implies a trivial deviation between the optimal and sub-optimal solutions, suggesting the latter’s potential for convergence towards near-optimal outcomes. The minimal variance testifies to the sub-optimal algorithm’s consistent ability to approximate solutions with high proximity to the optimal. This critical observation underscores the practical advantages of employing sub-optimal improved JESP Algorithm, especially in scenarios with extensive 𝒜A\mathcal{A}_{A} and 𝒪A\mathcal{O}_{A}.

V-D Trade-off: Transmission vs. Actuation

Fig. 9 exemplifies the resource allocation trade-off between transmission and actuation when the long-term average cost is minimized. When the probability of pSp_{S} remains low (signifying an unreliable channel) or CSC_{S} is high (indicating expensive sampling), it becomes prudent to decrease sampling and transmission, while concurrently augmenting actuation resources for optimal system utility. In contrast, when the channel is reliable, sampling and transmission resource can be harmonized with actuation resources to achieve the goal better. This indicates that through the investigation of the optimal co-design of the sampler & decision-maker paradigm, a trade-off between transmission and actuation resources can be achieved.

VI Conclusion

In this paper, we have investigated the GoT metric to directly describe the goal-oriented system decision-making utility. Employing the proposed GoT, we have formulated an infinite horizon Dec-POMDP problem to accomplish the co-design of sampling and actuating. To address this problem, we have developed two algorithms: the computationally intensive RVI-Brute-Force-Search, which is proven to be optimal, and the more efficient, albeit suboptimal algorithm, named JESP Algorithm. Comparative analyses have substantiated that the proposed GoT-optimal sampler & decision-maker pair can achieve sparse sampling and meanwhile maximize the utility, signifying the initial realization of a sparse, goal-oriented, and semantics-aware sampler design.

References

  • [1] S. Kaul, M. Gruteser, V. Rai, and J. Kenney, “Minimizing age of information in vehicular networks,” in 2011 8th Annual IEEE Communications Society Conference on Sensor, Mesh and Ad-Hoc Communications and Networks, 2011, pp. 350–358.
  • [2] Y. Sun, E. Uysal-Biyikoglu, R. D. Yates, C. E. Koksal, and N. B. Shroff, “Update or wait: How to keep your data fresh,” IEEE Transactions on Information Theory, vol. 63, no. 11, pp. 7492–7508, 2017.
  • [3] M. Costa, M. Codreanu, and A. Ephremides, “On the age of information in status update systems with packet management,” IEEE Transactions on Information Theory, vol. 62, no. 4, pp. 1897–1910, 2016.
  • [4] R. D. Yates and S. K. Kaul, “The age of information: Real-time status updating by multiple sources,” IEEE Transactions on Information Theory, vol. 65, no. 3, pp. 1807–1827, 2018.
  • [5] C. Kam, S. Kompella, G. D. Nguyen, and A. Ephremides, “Effect of message transmission path diversity on status age,” IEEE Transactions on Information Theory, vol. 62, no. 3, pp. 1360–1374, 2015.
  • [6] A. M. Bedewy, Y. Sun, and N. B. Shroff, “The age of information in multihop networks,” IEEE/ACM Transactions on Networking, vol. 27, no. 3, pp. 1248–1257, 2019.
  • [7] M. Moltafet, M. Leinonen, and M. Codreanu, “On the age of information in multi-source queueing models,” IEEE Transactions on Communications, vol. 68, no. 8, pp. 5003–5017, 2020.
  • [8] H. Sac, B. T. Bacinoglu, E. Uysal-Biyikoglu, and G. Durisi, “Age-Optimal Channel Coding Blocklength for an M/G/1 Queue with HARQ,” 2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pp. 1–5, 2018.
  • [9] M. Xie, Q. Wang, J. Gong, and X. Ma, “Age and energy analysis for LDPC coded status update with and without ARQ,” IEEE Internet of Things Journal, vol. 7, no. 10, pp. 10 388–10 400, 2020.
  • [10] J. You, S. Wu, Y. Deng, J. Jiao, and Q. Zhang, “An age optimized hybrid ARQ scheme for polar codes via Gaussian approximation,” IEEE Wireless Communications Letters, vol. 10, no. 10, pp. 2235–2239, 2021.
  • [11] S. Meng, S. Wu, A. Li, J. Jiao, N. Zhang, and Q. Zhang, “Analysis and Optimization of the HARQ-Based Spinal Coded Timely Status Update System,” IEEE Transactions on Communications, vol. 70, no. 10, pp. 6425–6440, 2022.
  • [12] H. Pan, T.-T. Chan, V. C. Leung, and J. Li, “Age of Information in Physical-layer Network Coding Enabled Two-way Relay Networks,” IEEE Transactions on Mobile Computing, 2022.
  • [13] A. Maatouk, M. Assaad, and A. Ephremides, “Minimizing The Age of Information: NOMA or OMA?” IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 102–108, 2019.
  • [14] S. Wu, Z. Deng, A. Li, J. Jiao, N. Zhang, and Q. Zhang, “Minimizing Age of Information in HARQ-CC Aided NOMA Systems,” IEEE Transactions on Wireless Communications, 2022.
  • [15] E. T. Ceran, D. Gündüz, and A. György, “Average Age of Information with Hybrid ARQ under a resource constraint,” IEEE Transactions on Wireless Communications, vol. 18, no. 3, pp. 1900–1913, 2019.
  • [16] D. Li, S. Wu, J. Jiao, N. Zhang, and Q. Zhang, “Age-Oriented Transmission Protocol Design in Space-Air-Ground Integrated Networks,” IEEE Transactions on Wireless Communications, vol. 21, no. 7, pp. 5573–5585, 2022.
  • [17] F. Peng, Z. Jiang, S. Zhang, and S. Xu, “Age of Information Optimized MAC in V2X Sidelink via Piggyback-Based Collaboration,” IEEE Transactions on Wireless Communications, vol. 20, pp. 607–622, 2020.
  • [18] H. Pan, T.-T. Chan, J. Li, and V. C. M. Leung, “Age of information with collision-resolution random access,” IEEE Transactions on Vehicular Technology, vol. 71, no. 10, pp. 11 295–11 300, 2022.
  • [19] A. Li, S. Wu, J. Jiao, N. Zhang, and Q. Zhang, “Age of Information with Hybrid-ARQ: A Unified Explicit Result,” IEEE Transactions on Communications, vol. 70, no. 12, pp. 7899–7914, 2022.
  • [20] B. Zhou and W. Saad, “Joint status sampling and updating for minimizing age of information in the internet of things,” IEEE Transactions on Communications, vol. 67, no. 11, pp. 7468–7482, 2019.
  • [21] X. Xie, H. Wang, and X. Liu, “Scheduling for minimizing the age of information in multi-sensor multi-server industrial iot systems,” IEEE Transactions on Industrial Informatics, pp. 1–10, 2023.
  • [22] M. A. Abd-Elmagid and H. S. Dhillon, “Closed-Form Characterization of the MGF of AoI in Energy Harvesting Status Update Systems,” IEEE Transactions on Information Theory, vol. 68, no. 6, pp. 3896–3919, 2022.
  • [23] M. Hatami, M. Leinonen, Z. Chen, N. Pappas, and M. Codreanu, “On-Demand AoI Minimization in Resource-Constrained Cache-Enabled IoT Networks With Energy Harvesting Sensors,” IEEE Transactions on Communications, vol. 70, no. 11, pp. 7446–7463, 2022.
  • [24] R. D. Yates, “Lazy is timely: Status updates by an energy harvesting source,” in 2015 IEEE International Symposium on Information Theory (ISIT), 2015, pp. 3008–3012.
  • [25] A. Arafa, J. Yang, S. Ulukus, and H. V. Poor, “Age-minimal transmission for energy harvesting sensors with finite batteries: Online policies,” IEEE Transactions on Information Theory, vol. 66, no. 1, pp. 534–556, 2019.
  • [26] R. D. Yates, Y. Sun, D. R. Brown, S. K. Kaul, E. Modiano, and S. Ulukus, “Age of Information: An Introduction and Survey,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 5, pp. 1183–1210, 2021.
  • [27] Y. Sun and B. Cyr, “Sampling for data freshness optimization: Non-linear age functions,” Journal of Communications and Networks, vol. 21, no. 3, pp. 204–219, 2019.
  • [28] A. Kosta, N. Pappas, A. Ephremides, and V. Angelakis, “The cost of delay in status updates and their value: Non-linear ageing,” IEEE Transactions on Communications, vol. 68, no. 8, pp. 4905–4918, 2020.
  • [29] J. Cho and H. Garcia-Molina, “Effective page refresh policies for web crawlers,” ACM Transactions on Database Systems (TODS), vol. 28, no. 4, pp. 390–426, 2003.
  • [30] Y. Azar, E. Horvitz, E. Lubetzky, Y. Peres, and D. Shahaf, “Tractable near-optimal policies for crawling,” Proceedings of the National Academy of Sciences, vol. 115, no. 32, pp. 8099–8103, 2018.
  • [31] M. Bastopcu and S. Ulukus, “Information freshness in cache updating systems,” IEEE Transactions on Wireless Communications, vol. 20, no. 3, pp. 1861–1874, 2020.
  • [32] K. T. Truong and R. W. Heath, “Effects of Channel Aging in Massive MIMO systems,” Journal of Communications and Networks, vol. 15, no. 4, pp. 338–351, 2013.
  • [33] Y. Sun and B. Cyr, “Information aging through queues: A mutual information perspective,” in 2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), 2018, pp. 1–5.
  • [34] Z. Wang, M.-A. Badiu, and J. P. Coon, “A framework for characterizing the value of information in hidden markov models,” IEEE Transactions on Information Theory, vol. 68, no. 8, pp. 5203–5216, 2022.
  • [35] G. Chen, S. C. Liew, and Y. Shao, “Uncertainty-of-information scheduling: A restless multiarmed bandit framework,” IEEE Transactions on Information Theory, vol. 68, no. 9, pp. 6151–6173, 2022.
  • [36] C. Kam, S. Kompella, G. D. Nguyen, J. E. Wieselthier, and A. Ephremides, “Towards an effective age of information: Remote estimation of a Markov source,” in IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2018, pp. 367–372.
  • [37] Y. Sun, Y. Polyanskiy, and E. Uysal, “Sampling of the Wiener process for remote estimation over a channel with random delay,” IEEE Transactions on Information Theory, vol. 66, no. 2, pp. 1118–1135, 2019.
  • [38] T. Z. Ornee and Y. Sun, “Sampling for remote estimation through queues: Age of information and beyond,” in 2019 International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOPT), 2019, pp. 1–8.
  • [39] A. Arafa, K. Banawan, K. G. Seddik, and H. V. Poor, “Sample, quantize, and encode: Timely estimation over noisy channels,” IEEE Transactions on Communications, vol. 69, no. 10, pp. 6485–6499, 2021.
  • [40] X. Wang, W. Lin, C. Xu, X. Sun, and X. Chen, “Age of changed information: Content-aware status updating in the internet of things,” IEEE Transactions on Communications, vol. 70, no. 1, pp. 578–591, 2022.
  • [41] X. Zheng, S. Zhou, and Z. Niu, “Urgency of information for context-aware timely status updates in remote control systems,” IEEE Transactions on Wireless Communications, vol. 19, no. 11, pp. 7237–7250, 2020.
  • [42] J. Zhong, R. D. Yates, and E. Soljanin, “Two freshness metrics for local cache refresh,” in 2018 IEEE International Symposium on Information Theory (ISIT), pp. 1924–1928.
  • [43] A. Maatouk, S. Kriouile, M. Assaad, and A. Ephremides, “The age of incorrect information: A new performance metric for status updates,” IEEE/ACM Transactions on Networking, vol. 28, no. 5, pp. 2215–2228, 2020.
  • [44] J. Cao, X. Zhu, S. Sun, P. Popovski, S. Feng, and Y. Jiang, “Age of Loop for Wireless Networked Control System in the Finite Blocklength Regime: Average, Variance and Outage Probability,” IEEE Transactions on Wireless Communications, pp. 1–1, 2023.
  • [45] A. Nikkhah, A. Ephremides, and N. Pappas, “Age of Actuation in a Wireless Power Transfer System,” in IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2023.
  • [46] M. Kountouris and N. Pappas, “Semantics-empowered communication for networked intelligent systems,” IEEE Communications Magazine, vol. 59, pp. 96–102, 2020.
  • [47] N. Pappas and M. Kountouris, “Goal-oriented communication for real-time tracking in autonomous systems,” in 2021 IEEE International Conference on Autonomous Systems (ICAS), 2021, pp. 1–5.
  • [48] M. Salimnejad, M. Kountouris, and N. Pappas, “Real-time Reconstruction of Markov Sources and Remote Actuation over Wireless Channels,” arXiv, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2302.13927
  • [49] E. Fountoulakis, N. Pappas, and M. Kountouris, “Goal-oriented Policies for Cost of Actuation Error Minimization in Wireless Autonomous Systems,” IEEE Communications Letter, 2023.
  • [50] A. Li, S. Wu, S. Meng, and Q. Zhang, “Towards goal-oriented semantic communications: New metrics, open challenges, and future research directions,” Submitted to IEEE Communications Magazine., 2023. [Online]. Available: https://arxiv.org/abs/2304.00848
  • [51] D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein, “The complexity of decentralized control of markov decision processes,” Mathematics of operations research, vol. 27, no. 4, pp. 819–840, 2002.
  • [52] E. Uysal, O. Kaya, A. Ephremides, J. Gross, M. Codreanu, P. Popovski, M. Assaad, G. Liva, A. Munari, B. Soret et al., “Semantic communications in networked systems: A data significance perspective,” IEEE Network, vol. 36, no. 4, pp. 233–240, 2022.
  • [53] A. Maatouk, M. Assaad, and A. Ephremides, “The age of incorrect information: An enabler of semantics-empowered communication,” IEEE Transactions on Wireless Communications, 2022.
  • [54] C. Kam, S. Kompella, and A. Ephremides, “Age of incorrect information for remote estimation of a binary markov source,” in IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2020, pp. 1–6.
  • [55] D. Bertsekas, Dynamic programming and optimal control.   Athena scientific, 2012, vol. 1.
  • [56] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming.   John Wiley & Sons, 2014.
  • [57] R. Nair, M. Tambe, M. Yokoo, D. V. Pynadath, and S. Marsella, “Taming decentralized pomdps: Towards efficient policy computation for multiagent settings,” in Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, 2003.
  • [58] D. P. Bertsekas and J. N. Tsitsiklis, “Neuro-dynamic programming: an overview,” in Proceedings of 1995 34th IEEE conference on decision and control, vol. 1, pp. 560–564.
  • [59] Y. Li, B. Yin, and H. Xi, “Finding optimal memoryless policies of pomdps under the expected average reward criterion,” European Journal of Operational Research, vol. 211, no. 3, pp. 556–567, 2011.

Appendix A The Proof of Lemma 1

By taking into account the conditional independence among Xt+1X_{t+1}, Φt+1\Phi_{t+1}, and Xt+1X_{t+1}, given (Xt,Φt,Xt)(X_{t},\Phi_{t},X_{t}) and 𝐚(t)\mathbf{a}(t), we we can express the following:

Pr{𝐖t+1=(su,x,vr)|𝐖t=(si,sj,vk),𝐚(t)=(1,am)}\displaystyle\Pr\left\{\mathbf{W}_{t+1}=(s_{u},x,v_{r})\left|\mathbf{W}_{t}=(s_{i},s_{j},v_{k}),\mathbf{a}(t)=(1,a_{m})\right.\right\} (42)
=Pr{Xt+1=su|𝐖t=(si,sj,vk),𝐚(t)=(1,am)}×\displaystyle=\Pr\left\{X_{t+1}=s_{u}\left|\mathbf{W}_{t}=(s_{i},s_{j},v_{k}),\mathbf{a}(t)=(1,a_{m})\right.\right\}\times
Pr{X^t+1=x|𝐖t=(si,sj,vk),𝐚(t)=(1,am)}×\displaystyle\Pr\left\{\hat{X}_{t+1}=x\left|\mathbf{W}_{t}=(s_{i},s_{j},v_{k}),\mathbf{a}(t)=(1,a_{m})\right.\right\}\times
Pr{Φt+1=vr|𝐖t=(si,sj,vk),𝐚(t)=(1,am)},\displaystyle\Pr\left\{\Phi_{t+1}=v_{r}\left|\mathbf{W}_{t}=(s_{i},s_{j},v_{k}),\mathbf{a}(t)=(1,a_{m})\right.\right\},

wherein the first, second, and third terms can be derived through conditional independence, resulting in simplified expressions of (11), (16), and (12), respectively:

Pr{Xt+1=su|𝐖t=(si,sj,vk),𝐚(t)=(1,am)}=pi,u(k,m),\displaystyle\Pr\left\{X_{t+1}=s_{u}\left|\mathbf{W}_{t}=(s_{i},s_{j},v_{k}),\mathbf{a}(t)=(1,a_{m})\right.\right\}=p_{i,u}^{(k,m)}, (43)
Pr{X^t+1=x|𝐖t=(si,sj,vk),𝐚(t)=(1,am)}=pS𝟙{x=si}+(1pS)𝟙{x=sj},\displaystyle\Pr\left\{\hat{X}_{t+1}=x\left|\mathbf{W}_{t}=(s_{i},s_{j},v_{k}),\mathbf{a}(t)=(1,a_{m})\right.\right\}=p_{S}\cdot\mathbbm{1}_{\left\{x=s_{i}\right\}}+(1-p_{S})\cdot\mathbbm{1}_{\left\{x=s_{j}\right\}}, (44)
Pr{Φt+1=vr|𝐖t=(si,sj,vk),𝐚(t)=(1,am)}=pk,r,\displaystyle\Pr\left\{\Phi_{t+1}=v_{r}\left|\mathbf{W}_{t}=(s_{i},s_{j},v_{k}),\mathbf{a}(t)=(1,a_{m})\right.\right\}=p_{k,r}, (45)

Substituting (43), (44), and (45) into (42) yields the (1) in Lemma 1. In the case where aS(t)=0a_{S}(t)=0, we can obtain a similar expression by replacing 𝐚(t)=(1,m)\mathbf{a}(t)=(1,m) with 𝐚(t)=(0,m)\mathbf{a}(t)=(0,m). Substituting (11), (13), and (12) into this new expression results in the proof of (20) in Lemma 1.

Appendix B Proof of Lemma 3

QπAπS(𝐰,aA)Q_{\pi_{A}}^{\pi_{S}}(\mathbf{w},a_{A}) could be simplified as follows:

limsupT1T𝔼(πS,πA)[t=0T1(r(t)ηπAπS)|𝐖0=𝐰,aA(0)=aA]\displaystyle{\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left[{\sum\limits_{t=0}^{T-1}\left(r(t)-\eta_{\pi_{A}}^{\pi_{S}}\right)}\left|\mathbf{W}_{0}=\mathbf{w},a_{A}(0)=a_{A}\right.\right]} (46)
=πS(𝐰,aA)ηπAπS+𝐰pπS(𝐰|𝐰,aA)limsupT1T1𝔼(πS,πA)[t=1T1(r(t)ηπAπS)|𝐖1=𝐰]gπAπS(𝐰)\displaystyle=\mathcal{R}^{\pi_{S}}(\mathbf{w},a_{A})-\eta_{\pi_{A}}^{\pi_{S}}+\sum_{\mathbf{w}^{\prime}\in\mathcal{I}}p^{\pi_{S}}(\mathbf{w}^{\prime}|\mathbf{w},a_{A})\cdot\underbrace{{\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T-1}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left[{\sum\limits_{t=1}^{T-1}\left(r(t)-\eta_{\pi_{A}}^{\pi_{S}}\right)}\left|\mathbf{W}_{1}=\mathbf{w}^{\prime}\right.\right]}}_{g_{\pi_{A}}^{\pi_{S}}\mathbf{(w}^{\prime})}
=πS(𝐰,aA)ηπAπS+𝐰pπS(𝐰|𝐰,aA)gπAπS(𝐰),\displaystyle=\mathcal{R}^{\pi_{S}}(\mathbf{w},a_{A})-\eta_{\pi_{A}}^{\pi_{S}}+\sum_{\mathbf{w}^{\prime}\in\mathcal{I}}p^{\pi_{S}}(\mathbf{w}^{\prime}|\mathbf{w},a_{A})g_{\pi_{A}}^{\pi_{S}}\mathbf{(w}^{\prime}),

QπAπS(oA,aA)Q_{\pi_{A}}^{\pi_{S}}(o_{A},a_{A}) could be solved as:

limsupT1T𝔼(πS,πA)[t=0T1(r(t)ηπAπS)|oA(0)=oA,aA(0)=aA]\displaystyle{\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left[{\sum\limits_{t=0}^{T-1}\left(r(t)-\eta_{\pi_{A}}^{\pi_{S}}\right)}\left|o_{A}^{(0)}=o_{A},a_{A}(0)=a_{A}\right.\right]} (47)
=𝐰pπAπS(𝐰|oA,aA)limsupT1T1𝔼(πS,πA)[t=0T1(r(t)ηπAπS)|𝐖0=𝐰,aA(0)=aA]QπAπS(𝐰,aA)\displaystyle=\sum_{\mathbf{w}\in\mathcal{I}}p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}|o_{A},a_{A})\cdot\underbrace{{\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T-1}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left[{\sum\limits_{t=0}^{T-1}\left(r(t)-\eta_{\pi_{A}}^{\pi_{S}}\right)}\left|\mathbf{W}_{0}=\mathbf{w},a_{A}(0)=a_{A}\right.\right]}}_{Q_{\pi_{A}}^{\pi_{S}}(\mathbf{w},a_{A})}
=𝐰pπAπS(𝐰|oA,aA)QπAπS(𝐰,aA).\displaystyle=\sum_{\mathbf{w}\in\mathcal{I}}p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}|o_{A},a_{A})Q_{\pi_{A}}^{\pi_{S}}(\mathbf{w},a_{A}).

where pπAπS(𝐰|oA,aA)p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}|o_{A},a_{A}) is the posterior conditional probability. Note that the state 𝐰\mathbf{w} is independent of aAa_{A} when oAo_{A} is known, we have that

pπAπS(𝐰|oA,aA)=pπAπS(𝐰|oA)=pπAπS(𝐰,oA)pπAπS(oA)=𝝁πAπS(𝐰)p(oA|𝐰)𝐰𝝁πAπS(𝐰)p(oA|𝐰).\displaystyle p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}|o_{A},a_{A})=p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}|o_{A})=\frac{p_{\pi_{A}}^{\pi_{S}}(\mathbf{w},o_{A})}{p_{\pi_{A}}^{\pi_{S}}(o_{A})}=\frac{\boldsymbol{\mu}_{\pi_{A}}^{\pi_{S}}(\mathbf{w})p(o_{A}|\mathbf{w})}{\sum_{\mathbf{w}\in\mathcal{I}}\boldsymbol{\mu}_{\pi_{A}}^{\pi_{S}}(\mathbf{w})p(o_{A}|\mathbf{w})}. (48)