Goal-oriented Tensor: Beyond Age of Information Towards Semantics-Empowered Goal-Oriented Communications

Aimin Li , Graduate Student Member, IEEE, Shaohua Wu , Member, IEEE,
Sumei Sun , Fellow, IEEE, and Jie Cao , Member, IEEE

Abstract

Optimizations premised on open-loop metrics such as Age of Information (AoI) indirectly enhance the system’s decision-making utility. We therefore propose a novel closed-loop metric named Goal-oriented Tensor (GoT) to directly quantify the impact of semantic mismatches on goal-oriented decision-making utility. Leveraging the GoT, we consider a sampler & decision-maker pair that works collaboratively and distributively to achieve a shared goal of communications. We formulate a two-agent infinite-horizon Decentralized Partially Observable Markov Decision Process (Dec-POMDP) to conjointly deduce the optimal deterministic sampling policy and decision-making policy. To circumvent the curse of dimensionality in obtaining an optimal deterministic joint policy through Brute-Force-Search, a sub-optimal yet computationally efficient algorithm is developed. This algorithm is predicated on the search for a Nash Equilibrium between the sampler and the decision-maker. Simulation results reveal that the proposed sampler & decision-maker co-design surpasses the current literature on AoI and its variants in terms of both goal achievement utility and sparse sampling rate, signifying progress in the semantics-conscious, goal-driven sparse sampling design.

Index Terms:

Goal-oriented communications, Goal-oriented Tensor, Status updates, Age of Information, Age of Incorrect Information, Value of Information, Semantics-aware sampling.

I Introduction

The recent advancement of the emerging 5G and beyond has spawned the proliferation of both theoretical development and application instances for Internet of Things (IoT) networks. In such networks, timely status updates are becoming increasingly crucial for enabling real-time monitoring and actuation across a plethora of applications. To address the inadequacies of traditional throughput and delay metrics in such contexts, the Age of Information (AoI) has emerged as an innovative metric to capture the data freshness perceived by the receiver [1], defined as $\mathrm{AoI}(t)=t-U(t)$ , where $U(t)$ denotes the generation time of the latest received packet before time $t$ . Since its inception, AoI has garnered significant research attention and has been extensively analyzed in the queuing systems [2, 3, 4, 5, 6, 7], physical-layer communications [8, 9, 10, 11, 12, 13, 14], MAC-layer communications [15, 16, 17, 18, 19], industrial IoT [20, 21], energy harvesting systems[22, 23, 24, 25], and etc. (see [26] and the references therein for more comprehensive review). These research efforts are driven by the consensus that a freshly received message typically contains more valuable information, thereby enhancing the utility of decision-making processes.

Though AoI has been proven efficient in many freshness-critical applications, it exhibits several critical shortcomings. Specifically, ( $a$ ) AoI does not provide a direct measure of information value; ( $b$ ) AoI does not consider the content dynamics of source data and ignores the effect of End-to-End (E2E) information mismatch on the decision-making process.

To address shortcoming (a), a typical approach is to impose a non-linear penalty on AoI[27, 28, 29, 30, 31]. In [27], the authors map the AoI to a non-linear and non-decreasing function $f(\mathrm{AoI}(t))$ to evaluate the degree of “discontent” resulting from stale information. Subsequently, the optimal sampling policy is deduced for an arbitrary non-decreasing penalty function. The authors in [28] introduce two penalty functions, namely the exponential penalty function $a^{\mathrm{AoI}(t)}-1$ and the logarithmic penalty function $\log_{a}{\left(\mathrm{AoI}(t+1)\right)}$ , for evaluating the Cost of Update Delay (CoUD). In [29] and [30], the binary indicator function $\mathbbm{1}_{\left\{\mathrm{AoI(t)>d}\right\}}$ is applied to evaluate whether the most recently received message is up-to-date. Specifically, the penalty assumes a value of $1$ when the instantaneous AoI surpasses a predetermined threshold $d$ ; otherwise, the penalty reverts to $0$ . The freshness of web crawling is evaluated through this AoI-threshold binary indicator function. In [31], an analogous binary indicator approach is implemented in caching systems to evaluate the freshness of information.

The above works tend to tailor a particular non-linear penalty function to evaluate the information value. However, the selection of penalty functions in the above research relies exclusively on empirical configurations, devoid of rigorous derivations. To this end, several information-theoretic techniques strive to explicitly derive the non-linear penalty function in terms of AoI [32, 33, 34, 35]. One such quintessential work is the auto-correlation function $\mathbb{E}\left[X_{t}^{*}X_{t-\mathrm{AoI}(t)}\right]$ , which proves to be a non-linear function of AoI when the source is stationary [32]. Another methodology worth noting is the mutual information metric between the present source state $X_{t}$ and the vector consisting of an ensemble of successfully received updates $\mathbf{W}_{t}$ [33, 34]. In the context of a Markovian source, this metric can be reduced to $I(X_{t};X_{t-\mathrm{AoI}(t)})$ , which demonstrates a non-linear dependency on AoI under both the Gaussian Markov source and the Binary Markov source [33]. In [34], an analogous approach is utilized to characterize the value of information (VoI) for the Ornstein-Uhlenbeck (OU) process, which likewise demonstrates a non-linear dependency on AoI. In [35], the conditional entropy $H(X_{t}|\mathbf{W}_{t})$ is further investigated to measure the uncertainty of the source for a remote estimator given the history received updates $\mathbf{W}_{t}$ . When applied to a Markov Source, this conditional entropy simplifies to $H(X_{t}|{X}_{t-\mathrm{AoI}(t)})$ , thus exemplifying a non-linear penalty associated with AoI.

To address shortcoming (b), substantial research efforts have been invested in the optimization of the Mean Squared Error (MSE), denoted by $(X_{t}-\hat{X}_{t})^{2}$ , with an ultimate objective of constructing a real-time reconstruction remote estimation system [36, 37, 38, 39]. In [36], a metric termed effective age is proposed to minimize the MSE for the remote estimation of a Markov source. In [37] and [38], two Markov sources of interest, the Wiener process and the OU process are investigated to deduce the MSE-optimal sampling policy. Intriguingly, these policies were found to be threshold-based, activating sampling only when the instantaneous MSE exceeds a predefined threshold, otherwise maintaining a state of idleness. The authors in [39] explored the trade-off between MSE and quantization over a noisy channel, and derived the MSE-optimal sampling strategy for the OU process.

Complementary to the above MSE-centered research, variants of AoI that address shortcoming $(b)$ have also been conceptualized [40, 41, 42, 43]. In [40], Age of Changed Information (AoCI) is proposed to address the ignorance of content dynamics of AoI. In this regard, unchanged statuses do not necessarily provide new information and thus are not prioritized for transmission. In [41], the authors introduce the context-aware weighting coefficient to propose the Urgency of Information (UoI), a metric capable of measuring the weighted MSE in diverse urgency contexts. In [42], the authors propose a novel age penalty named Age of Synchronization (AoS), a novel metric quantifying the time since the most recent end-to-end synchronization. Moreover, considering that an E2E status mismatch may exert a detrimental effect on the overall system’s performance over time, the authors of [43] propose a novel metric called Age of Incorrect Information (AoII). This metric quantifies the adverse effect stemming from the duration of the E2E mismatch, revealing that both the degree and duration of E2E semantic mismatches lead to a utility reduction for subsequent decision-making.

The above studies focused on the open-loop generation-to-reception process within a transmitter-receiver pair, neglecting the closed-loop perception-actuation timeliness. A notable development addressing this issue is the extension from Up/Down-Link (UL/DL) AoI to a closed-loop AoI metric, referred to as the Age of Loop (AoL) [44]. Unlike the traditional open-loop AoI, which diminishes upon successful reception of a unidirectional update, the AoL decreases solely when both the UL status and the DL command are successfully received. Another advanced metric in [45], called Age of Actuation (AoA), also encapsulates the actuation timeliness, which proves pertinent when the receiver employs the received update to execute timely actuation.

Refer to caption — Figure 1: Interconnections of Age of Information and Beyond in the literature.

Notwithstanding the above advancements, the question on how the E2E mismatch affects the closed-loop utility of decision-making has yet to be addressed. To address this issue, [46, 47, 48, 49] introduce a metric termed Cost of Actuation Error to delve deeper into the cost resulting from the error actuation due to imprecise real-time estimations. Specifically, the Cost of Actuation Error is denoted by an asymmetric zero diagonal matrix $\mathbf{C}$ , with each value $C_{X_{t},\hat{X}_{t}}$ representing the instant cost under the E2E mismatch status $(X_{t},\hat{X}_{t})_{X_{t}\neq\hat{X}_{t}}$ . In this regard, the authors reveal that the utility of decision-making bears a close relation to the E2E semantic mismatch category, as opposed to the mismatch duration (AoII) or mismatch degree (MSE). For example, an E2E semantic mismatch category that “Fire” is estimated as “No Fire” will result in high cost; while in the opposite scenario, the cost is low. Nonetheless, we notice that $i$ ) the method to obtain a Cost of Actuation Error remains unclear, which implicitly necessitates a pre-established decision-making policy; $ii$ ) Cost of Actuation Error does not consider the context-varying factors, which may also affect the decision-making utility; $iii$ ) the zero diagonal property of the matrix implies the supposition that if $X_{t}=\hat{X}_{t}$ , then $C_{X_{t},\hat{X}_{t}}=0$ , thereby signifying that error-less actuation necessitates no energy expenditure. Fig. 1 provides an overview of the existing metrics.

Against this background, the present authors have recently proposed a new metric referred to as GoT in [50], which, compared to Cost of Actuation Error, introduces new dimensions of the context $\Phi_{t}$ and the decision-making policy $\pi_{A}$ to describe the true utility of decision-making. Remarkably, we find that GoT offers a versatile degeneration to established metrics such as AoI, MSE, UoI, AoII, and Cost of Actuation Error. Additionally, it provides a balanced evaluation of the cost trade-off between the sampling and decision-making. The contributions of this work could be summarized as follows:

$\bullet$ We focus on the decision utility issue directly by employing the GoT. A controlled Markov source is observed, wherein the transition of the source is dependent on both the decision-making at the receiver and the contextual situation it is situated. In this case, the decision-making will lead to three aspects in utility: $i$ ) the future evolution of the source; $ii$ ) the instant cost at the source; $iii$ ) the energy and resources consumed by actuation.

$\bullet$ We accomplish the goal-oriented sampler & decision-maker co-design, which, to the best of our knowledge, represents the first work that addresses the trade-off between semantics-aware sampling and goal-oriented decision-making. Specifically, we formulate this problem as a two-agent infinite-horizon Decentralized Partially Observable Markov Decision Process (Dec-POMDP) problem, with one agent embodying the semantics and context-aware sampler, and the other representing the goal-oriented decision-maker. Note that the optimal solution of even a finite-horizon Dec-POMDP is known to be NEXP-hard [51], we develop the RVI-Brute-Force-Search Algorithm. This algorithm seeks to derive optimal deterministic joint policies for both sampling and decision-making. A thorough discussion on the optimality of our algorithm is also presented.

$\bullet$ To further mitigate the “curse of dimensionality” intricately linked with the execution of the optimal RVI-Brute-Force-Search, we introduce a low-complexity yet efficient algorithm to solve the problem. The algorithm is designed by decoupling the problem into two distinct components: a Markov Decision Process (MDP) problem and a Partially Observable Markov Decision Process (POMDP) issue. Following this separation, the algorithm endeavors to search for the joint Nash Equilibrium between the sampler and the decision-maker, providing a sub-optimal solution to this goal-oriented communication-decision-making co-design.

II Goal-oriented Tensor

II-A Specific Examples of Goal-Oriented Communications

Consider a time-slotted communication system. Let $X_{t}\in\mathcal{S}$ represent the perceived semantic status at time slot $t$ at the source, and $\hat{X}_{t}\in\mathcal{S}$ denote the constructed estimated semantic status at time slot $t$ . Real-time reconstruction-oriented communications is a special type of goal-oriented communications, whose goal is achieving real-time accurate reconstruction:

\min\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}\sum_{t=0}^{T-1}(X_{t}-\hat{X}_{t})^{2}.

(1)

Although real-time reconstruction is important, it does not represent the ultimate goals of communications, such as implementing accurate actuation (as opposed to merely real-time reconstruction) and minimizing the system’s long-term Cost of Actuation Error, as in [46, 47, 48, 49]:

\min\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}\sum_{t=0}^{T-1}C_{X_{t},\hat{X}_{t}},

(2)

where $C_{X_{t},\hat{X}_{t}}$ represents the instantaneous system cost of the system at time slot $t$ . This cost is derived from the erroneous actuation stemming from the status mismatch between transceivers.

Following the avenue of Cost of Actuation Error, we notice that the matrix-based metric could be further augmented to tensors to realize more flexible goal characterizations. For example, drawing parallels with the concept of Urgency of Information [41], we can introduce a context element $\Phi_{t}$ to incorporate context-aware attributes into this metric.¹¹1It is important to note that the GoT could be expanded into higher dimensions by integrating additional components, including actuation policies, task-specific attributes, and other factors. Accordingly, we can define a three-dimensional GoT through a specified mapping: $(X_{t},\Phi_{t},\hat{X}_{t})\in\mathcal{S}\times\mathcal{V}\times\mathcal{S}\overset{\mathcal{L}}{\rightarrow}\mathrm{GoT}(t)\in\mathbb{R}$ , which could be visualized by Fig. 2. In this regard, the GoT, denoted by $\mathcal{L}(X_{t},\Phi_{t},\hat{X}_{t})$ or $\mathrm{GoT}(t)$ , indicates the instantaneous cost of the system at time slot $t$ , with the knowledge of $(X_{t},\Phi_{t},\hat{X}_{t})$ . Consequently, the overarching goal of this system could be succinctly expressed as:

\min\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}\sum_{t=0}^{T-1}\mathrm{GoT}(t).

(3)

II-B Degeneration to Existing Metrics

In this subsection, we demonstrate, through visualized examples and mathematical formulations, that a three-dimension GoT can degenerate to existing metrics. Fig. 2 showcases a variety of instances of the GoT metric.

$\bullet$ Degeneration to AoI: AoI is generally defined as $\mathrm{AoI}(t)\triangleq t-\max\left\{G_{i}:D_{i}<t\right\}$ , where $G_{i}$ is the generated time stamp of the $i$ -th status update, $D_{i}$ represents the corresponding deliver time slot. Since AoI is known to be semantics-agnostic [52, 46], the values in the tensor only depend on the freshness context $\Phi_{t}\triangleq\mathrm{AoI}(t)$ . In this case, each tensor value given a determined context status $\Phi(t)$ is a constant, and the GoT is reduced to

\mathrm{GoT}(t)=\mathcal{L}(X_{t},\Phi_{t},\hat{X}_{t})\overset{(a)}{=}\Phi(t)=\mathrm{AoI}(t).

(4)

where (a) indicates that AoI is semantics-agnostic. In this case, AoI refers to the context exactly. The process of reducing GoT to AoI is visually depicted in Fig. 2(a).

$\bullet$ Degeneration to AoII: AoII is defined as $\mathrm{AoII}(t)\triangleq f(\mathrm{AoS}(t))\cdot g(X_{t},\hat{X}_{t})$ , where $\textrm{AoS}(t)\triangleq t-\max\left\{\tau:t\leq\tau,X_{t}=\hat{X}_{t}\right\}$ . AoII embraces the semantics-aware characteristics and is hence regarded as an enabler of semantic communications [53]. The inherent multiplicative characteristic of AoII guarantees the existence of a base layer of the tensor representation, from which other layers are derived by multiplying this base layer, as depicted in Fig. 2(b). Let $\Phi_{t}\triangleq f(\mathrm{AoS}(t))$ , the GoT is reduced to

\mathrm{GoT}(t)=\mathcal{L}\left(X_{t},\Phi_{t},\hat{X}_{t}\right)\overset{(a)}{=}\Phi_{t}\cdot g\left(X_{t},\hat{X}_{t}\right)=f(\mathrm{AoS}(t))\cdot g(X_{t},\hat{X}_{t})=\mathrm{AoII}(t),

(5)

where $g(X_{t},\hat{X}_{t})$ , typically characterized by $\mathbbm{1}_{\left\{X_{t}\neq\hat{X}_{t}\right\}}$ , represents the base layer in the tensor, and $(a)$ indicates the inherent multiplicative characteristic of AoII. The visual representation of the GoT reduction to AoII is depicted in Fig. 2(b).

$\bullet$ Degeneration to MSE: MSE is defined as $\mathrm{MSE}(t)\triangleq\left(X_{t}-\hat{X}_{t}\right)^{2}$ . MSE is intuitively context-agnostic. In the scenario where $g(X_{t},\hat{X}_{t})=(X_{t}-\hat{X}_{t})^{2}$ , the GoT collapses to the MSE metric:

\mathrm{GoT}(t)=\mathcal{L}\left(X_{t},\Phi_{t},\hat{X}_{t}\right)\overset{(a)}{=}g\left(X_{t},\hat{X}_{t}\right)=\left(X_{t}-\hat{X}_{t}\right)^{2}=\mathrm{MSE}(t),

(6)

where $(a)$ establishes due to the context-agnostic nature of MSE. The visualization of the GoT reduction to MSE is shown in Fig. 2(c).

$\bullet$ Degeneration to UoI: UoI is defined by $\mathrm{UoI}(t)\triangleq\Phi_{t}\cdot\left(X_{t}-\hat{X}_{t}\right)^{2}$ , where the context-aware weighting coefficient $\Phi_{t}$ is further introduced [41]. When $g(X_{t},\hat{X}_{t})=(X_{t}-\hat{X}_{t})^{2}$ , the GoT could be transformed into the UoI by

\mathrm{GoT}(t)=\mathcal{L}\left(X_{t},\Phi_{t},\hat{X}_{t}\right)\overset{(a)}{=}\Phi_{t}\cdot g\left(X_{t},\hat{X}_{t}\right)=\Phi_{t}\cdot\left(X_{t}-\hat{X}_{t}\right)^{2}=\mathrm{UoI}(t),

(7)

where $(a)$ indicates the inherent multiplicative characteristic of UoI. The visualization of the GoT reduction to UoI is shown in Fig. 2(d).

$\bullet$ Degeneration to Cost of Actuation Error: Cost of Actuation Error is defined by $C_{X_{t},\hat{X}_{t}}$ , which indicates the instantaneous system cost if the source status is $X_{t}$ and the estimated one $\hat{X}_{t}$ mismatch [48]. Let $g(X_{t},\hat{X}_{t})=C_{X_{t},\hat{X}_{t}}$ , the GoT collapses to Cost of Actuation Error:

\mathrm{GoT}(t)=\mathcal{L}\left(X_{t},\Phi_{t},\hat{X}_{t}\right)\overset{(a)}{=}g\left(X_{t},\hat{X}_{t}\right)=C_{X_{t},\hat{X}_{t}},

(8)

where $(a)$ establishes due to the context-agnostic nature of Cost of Actuation Error. The visualization of the GoT reduction to Cost of Actuation Error is shown in Fig. 2(e).

II-C Goal Characterization Through GoT

As shown in Fig. 2(f), a more general GoT exhibits neither symmetry, zero diagonal property, nor a base layer, offering considerable versatility contingent upon the specific goal. Here we propose a method that constructs a specific GoT, there are four steps:

$\bullet$ Step 1: Clarify the scenario and the communication goal. For instance, in the wireless accident monitoring and rescue systems, the goal is to minimize the cumulative average cost associated with accidents and their corresponding rescue interventions over the long term.

$\bullet$ Step 2: Define the sets of semantic status, $\mathcal{S}$ , and contextual status, $\mathcal{V}$ . These sets can be modeled as collections of discrete components. For instance, the set $\mathcal{S}$ might encompass the gravity of industrial mishaps in intelligent factories, whereas the set $\mathcal{V}$ could encompass the meteorological circumstances, which potentially influence the source dynamics and the costs.

$\bullet$ Step 3: The GoT could be decoupled by three factors: ²²2The definitions of costs are diverse, covering aspects such as financial loss, resource usage, or a normalized metric derived through scaling, depending on the specific objective they are designed to address.

$i)$ The status inherent cost $C_{1}(X_{t},\Phi_{t})$ . It quantifies the cost associated with different status pairs $(X_{t},\Phi_{t})$ in the absence of external influences;

$ii)$ The actuation gain $C_{2}(\pi_{A}(\hat{X}_{t}))$ , where $\pi_{A}$ is a deterministic decision policy contingent upon $\hat{X}_{t}$ . This cost quantifies the positive utility resulting from the actuation $\pi_{A}(\hat{X}(t))$ ;

$iii)$ The actuation resource expenditure $C_{3}(\pi_{A}(\hat{X}_{t}))$ , which reflects the resources consumed by a particular actuation $\pi_{A}(\hat{X}(t))$ .

$\bullet$ Step 4: Constructing the GoT. The GoT, given a specific triple-tuple $(X_{t},\hat{X}_{t},\Phi_{t})$ and a certain decision strategy $\pi_{A}$ , is calculated by

\mathrm{GoT}^{\pi_{A}}(t)=\mathcal{L}(X_{t},\Phi_{t},\hat{X}_{t},\pi_{A})=\left[C_{1}(X_{t},\Phi_{t})-aC_{2}(\pi_{A}(\hat{X}_{t}))\right]^{+}+bC_{3}(\pi_{A}(\hat{X}_{t})).

(9)

The ramp function $\left[\cdot\right]^{+}$ ensures that any actuation $\pi_{A}(\hat{X}_{t})$ reduces the cost to a maximum of 0. A visualization of a specific GoT construction is shown in Fig. 4. The GoT in Fig. 4 is obtained through the following definition:

\begin{array}[]{c}C_{1}(X_{t},\Phi_{t})=\left(\begin{array}[]{c|ccc}{}\hfil&0&1&2\\ \hline\cr 0&0&1&3\\ 1&0&2&5\end{array}\right),\pi_{A}(\hat{X}_{t})=\left(\begin{array}[]{ccc}0&1&2\\ \hline\cr a_{0}&a_{1}&a_{2}\\ \end{array}\right),\\ C_{2}(a_{A}(t))=\left(\begin{array}[]{ccc}a_{0}&a_{1}&a_{2}\\ \hline\cr 0&2&4\\ \end{array}\right),C_{3}(a_{A}(t))=\left(\begin{array}[]{ccc}a_{0}&a_{1}&a_{2}\\ \hline\cr 0&1&2\\ \end{array}\right).\end{array}

(10)

III System Model

We consider a time-slotted perception-actuation loop where both the perceived semantics $X_{t}\in\mathcal{S}=\left\{s_{1},\cdots,s_{|\mathcal{S}|}\right\}$ and context $\Phi_{t}\in\mathcal{V}=\left\{v_{1},\cdots,v_{|\mathcal{V}|}\right\}$ are input into a semantic sampler, tasked with determining the significance of the present status $X_{t}$ and subsequently deciding if it warrants transmission via an unreliable channel. The semantics and context are extracted and assumed to perfectly describe the status of the observed process. The binary indicator, $a_{S}(t)=\pi_{S}(X_{t},\Phi_{t},\hat{X}_{t})\in\left\{0,1\right\}$ , signifies the sampling and transmission action at time slot $t$ , with the value $1$ representing the execution of sampling and transmission, and the value $0$ indicating the idleness of the sampler. $\pi_{S}$ here represents the sampling policy. We consider a perfect and delay-free feedback channel [46, 47, 48, 49], with ACK representing a successful transmission and NACK representing the otherwise. The decision-maker at the receiver will make decisions $a_{A}(t)\in\mathcal{A}_{A}=\left\{a_{1},\cdots,a_{|\mathcal{A}_{A}|}\right\}$ based on the estimate $\hat{X}_{t}$ ³³3We consider a general abstract decision-making set $\mathcal{A}_{A}$ that exhibits adaptability to diverse applications. Notably, this decision-making set can be customized and tailored to suit specific requirements., which will ultimately affect the utility of the system. An illustration of our considered model is shown in Fig. 4.

III-A Semantics and Context Dynamics

Thus far, a plethora of studies have delved into the analysis of various discrete Markov sources, encompassing Birth-Death Markov Processes elucidated in [49], binary Markov sources elucidated in [54], and etc. In real-world situations, the context and the actuation also affect the source’s evolution. Consequently, we consider a context-dependent controlled Discrete Markov source:

\Pr\left({X_{t+1}=s_{u}\left|{X_{t}=s_{i},a_{A}(t)=a_{m},\Phi_{t}=v_{k}}\right.}\right)=p_{i,u}^{(k,m)},

(11)

where the dynamics of the source is dependent on both the decision-making $a_{A}(t)$ and context $\Phi_{t}$ . Furthermore, we take into account the variations in context $\Phi_{t}$ , characterized by:

\Pr\left({\Phi_{t+1}=v_{r}\left|{\Phi_{t}=v_{k}}\right.}\right)=p_{k,r}.

(12)

Note that the semantic status $X_{t}$ and context status $\Phi_{t}$ could be tailored according to the specific application scenario. In general, these two processes are independent with each other.

III-B Unreliable Channel and Estimate Transition

We assume that the channel realizations exhibit independence and identical distribution (i.i.d.) across time slots, following a Bernoulli distribution. Particularly, the channel realization $h_{t}$ assumes a value of $1$ in the event of successful transmission, and $0$ otherwise. Accordingly, we define the probability of successful transmission as $\Pr\left(h_{t}=1\right)=p_{S}$ and the failure probability as $\Pr\left(h_{t}=0\right)=1-p_{S}$ . To characterize the dynamic process of $\hat{X}_{t}$ , we consider two cases as described below:

$\bullet$ $a_{S}(t)=0$ . In this case, the sampler and transmitter remain idle, manifesting that there is no new knowledge given to the receiver, i.e., $\hat{X}_{t+1}=\hat{X}_{t}$ . As such, we have:

\Pr\left({\hat{X}_{t+1}=x\left|\hat{X}_{t}=s_{j},a_{S}(t)=0\right.}\right)=\mathbbm{1}_{\left\{x=s_{j}\right\}}.

(13)

$\bullet$ $a_{S}(t)=1$ . In this case, the sampler and transmitter transmit the current semantic status $X_{t}$ through an unreliable channel. As the channel is unreliable, we differentiate between two distinct situations: $h_{t}=1$ and $h_{t}=0$ :

(a) $h_{t}=1$ . In this case, the transmission is successful. As such, the estimate at the receiver $\hat{X}_{t+1}$ is nothing but $X(t)$ , and the transition probability is

\Pr\left({\hat{X}_{t+1}=x\left|\hat{X}_{t}=s_{j},X_{t}=s_{i},a_{S}(t)=1,h_{t}=1\right.}\right)=\mathbbm{1}_{\left\{x=s_{i}\right\}}.

(14)

(b) $h_{t}=0$ . In this case, the transmission is not successfully decoded by the receiver. As such, the estimate at the receiver $\hat{X}_{t+1}$ remains $\hat{X}(t)$ . In this way, the transition probability is

\Pr\left({\hat{X}_{t+1}=x\left|\hat{X}_{t}=s_{j},X_{t}=s_{i},a_{S}(t)=1,h_{t}=0\right.}\right)=\mathbbm{1}_{\left\{x=s_{j}\right\}}.

(15)

As the channel realization $h_{t}$ is independent with the process of $X_{t}$ , $\hat{X}_{t}$ , and $a_{S}(t)$ , we have that

		$\displaystyle\Pr\left({\hat{X}_{t+1}=x\left\|\hat{X}_{t}=s_{j},X_{t}=s_{i},a_{S}(t)=1\right.}\right)$		(16)
		$\displaystyle=\sum_{h_{t}}p\left(h_{t}\right)\Pr\left({\hat{X}_{t+1}=x\left\|\hat{X}_{t}=s_{j},X_{t}=s_{i},a_{S}(t)=1,h_{t}\right.}\right)$
		$\displaystyle=p_{S}\cdot\mathbbm{1}_{\left\{x=s_{i}\right\}}+(1-p_{S})\cdot\mathbbm{1}_{\left\{x=s_{j}\right\}}.$

Combing (13) with (16) yields the dynamics of the estimate.

III-C Goal-oriented decision-making and Actuating

We note that the previous works primarily focused on minimizing the open-loop freshness-related or error-related penalty for a transmitter-receiver system. Nevertheless, irrespective of the fresh delivery or accurate end-to-end timely reconstruction, the ultimate goal of such optimization efforts is to ensure precise and effective decision-making. To this end, we broaden the open-loop transmitter-receiver information flow to include a perception-actuation closed-loop utility flow by incorporating the decision-making and actuation processes. As a result, decision-making and actuation enable the conversion of status updates into ultimate effectiveness. Here the decision-making at time slot $t$ follows that $a_{A}(t)=\pi_{A}(\hat{X}_{t})$ , with $\pi_{A}$ representing the deterministic decision-making policy.

IV Problem Formulation and Solution

Traditionally, the development of sampling strategies has been designed separately from the decision-making process. An archetypal illustration of this two-stage methodology involves first determining the optimal sampling policy based on AoI, MSE, and their derivatives, such as AoII, and subsequently accomplishing goal-oriented decision-making. This two-stage separate design arises from the inherent limitation of existing metrics that they fail to capture the closed-loop decision utility. Nevertheless, the metric GoT empowers us to undertake a co-design of sampling and decision-making.

We adopt the team decision theory, wherein two agents, one embodying the sampler and the other the decision-maker, collaborate to achieve a shared goal. We aim at determining a joint deterministic policy $\boldsymbol{\pi}_{C}=(\pi_{S},\pi_{A})$ that minimizes the long-term average cost of the system. It is considered that the sampling and transmission of an update also consumes energy, incurring a $C_{S}$ cost. In this case, the instant cost of the system could be clarified by $\mathrm{GoT}^{\pi_{A}}(t)+C_{S}\cdot a_{S}(t)$ , and the problem is characterized as:

\begin{array}[]{*{20}{c}}{{{\cal P}}1:}&{\mathop{\min}\limits_{{\boldsymbol{\pi}_{C}}\in\Upsilon}\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}{\mathbb{E}^{{\boldsymbol{\pi}}_{C}}}\left({\sum\limits_{t=0}^{T-1}{\mathrm{GoT}^{\pi_{A}}(t)+C_{S}\cdot a_{S}(t)}}\right)}\\ \end{array},

(17)

where $\boldsymbol{\pi}_{C}=(\pi_{S},\pi_{A})$ denotes the joint sampling and decision-making policy, comprising $\pi_{S}=(a_{S}(0),a_{S}(1),\cdots)$ and $\pi_{A}=(a_{A}(0),a_{A}(1),\cdots)$ , which correspond to the sampling action sequence and actuation sequence, respectively. Note that $\mathrm{GoT}^{\pi_{A}}(t)$ is characterized by (9).

IV-A Dec-POMDP Formulation

The problem in (17) aims to find the optimal decentralized policy $\boldsymbol{\pi}_{C}$ so that the long-term average cost of the system is minimized. To solve the problem $\mathcal{P}1$ , we ought to formulate a DEC-POMDP problem, which is initially introduced in [51] to solve the cooperative sequential decision issues for distributed multi-agents. Within a Dec-POMDP framework, a team of agents cooperates to achieve a shared goal, relying solely on their localized knowledge. A typical Dec-POMDP is denoted by a tuple $\mathscr{M}_{DEC-POMDP}\triangleq\left\langle n,\mathcal{I},\mathcal{A},\mathcal{T},\Omega,\mathcal{O},\mathcal{R}\right\rangle$ :

$\bullet$ $n$ denotes the number of agents. In this instance, we have $n=2$ , signifying the presence of two agents within this context: one agent $\mathcal{A}gent_{S}$ embodies the semantics-context-aware sampler and transmitter, while the other represents the $\hat{X}_{t}$ -dependent decision-maker, denoted by $\mathcal{A}gent_{A}$ .

$\bullet$ $\mathcal{I}$ is the finite set of the global system status, characterized by $(X_{t},\hat{X}_{t},\Phi_{t})\in\mathcal{S}\times\mathcal{S}\times\mathcal{V}$ . For the sake of brevity, we henceforth denote $\mathbf{W}_{t}=(X_{t},\hat{X}_{t},\Phi_{t})$ in the squeal.

$\bullet$ $\mathcal{T}$ is the transition function defined by

\mathcal{T}\left(\mathbf{w},\mathbf{a},\mathbf{w}^{\prime}\right)\triangleq\Pr(\mathbf{W}_{t+1}=\mathbf{w}^{\prime}|\mathbf{W}_{t}=\mathbf{w},\mathbf{a}_{t}=\mathbf{a}),

(18)

which is defined by the transition probability from global status $\mathbf{W}_{t}=\mathbf{w}$ to status $\mathbf{W}_{t+1}=\mathbf{w}^{\prime}$ , after the agents in the system taking a joint action $\mathbf{a}_{t}=\mathbf{a}=(a_{S}(t),a_{A}(t))$ . For the sake of concise notation, we let $p(\mathbf{w}^{\prime}|\mathbf{w},\mathbf{a})$ symbolize $\mathcal{T}\left(\mathbf{w},\mathbf{a},\mathbf{w}^{\prime}\right)$ in the subsequent discourse. Then, by taking into account the conditional independence among $X_{t+1}$ , $\Phi_{t+1}$ , and $\hat{X}_{t+1}$ , given $(X_{t},\Phi_{t},\hat{X}_{t})$ and $\mathbf{a}(t)$ , the transition functions can be calculated in lemma 1.

Lemma 1.

The transition functions of the Dec-POMDP:

p\left((s_{u},x,v_{r})\left|(s_{i},s_{j},v_{k}),(1,a_{m})\right.\right)=p_{i,u}^{(k,m)}\cdot p_{k,r}\cdot\left(p_{S}\cdot\mathbbm{1}_{\left\{x=s_{i}\right\}}+(1-p_{S})\cdot\mathbbm{1}_{\left\{x=s_{j}\right\}}\right),

(19)

p\left((s_{u},x,v_{r})\left|(s_{i},s_{j},v_{k}),(0,a_{m})\right.\right)=p_{i,u}^{(k,m)}\cdot p_{k,r}\cdot\mathbbm{1}_{\left\{x=s_{j}\right\}},

(20)

for any $x\in\mathcal{S}$ and indexes $i$ , $j$ , $u\in\left\{1,2,\cdots,|\mathcal{S}|\right\}$ , $k$ , $r\in\left\{1,2,\cdots,|\mathcal{V}|\right\}$ , and $m\in\left\{1,2,\cdots,|\mathcal{A}_{A}|\right\}$ .

Proof.

The transition function can be derived by incorporating the dynamics in equations (11), (12), (13), and (16). A more comprehensive proof is presented in Appendix A. ∎

$\bullet$ $\mathcal{A}=\mathcal{A}_{S}\times\mathcal{A}_{A}$ , with $\mathcal{A}_{S}\triangleq\left\{0,1\right\}$ representing the set of binary sampling actions executed by the sampler, and $\mathcal{A}_{A}\triangleq\left\{a_{0},\cdots,a_{M-1}\right\}$ representing the set of decision actions undertaken by the actuator.

$\bullet$ $\Omega=\Omega_{S}\times\Omega_{A}$ constitutes a finite set of joint observations. Generally, the observation made by a single agent regarding the system status is partially observable. $\Omega_{S}$ signifies the sampler’s observation domain. In this instance, the sampler $\mathcal{A}gent_{S}$ is entirely observable, with $\Omega_{S}$ encompassing the comprehensive system state ${o}_{S}^{(t)}=\mathbf{W}_{t}$ . $\Omega_{A}$ signifies the actuator’s observation domain. In this case, the actuator (or decision-maker) $\mathcal{A}gent_{A}$ is partially observable, with $\Omega_{A}$ comprising $o_{A}^{(t)}=\hat{X}(t)$ . The joint observation at time instant $t$ is denoted by $\mathbf{o}_{t}=(o_{S}^{(t)},o_{A}^{(t)})$ .

$\bullet$ $\mathcal{O}=\mathcal{O}_{S}\times\mathcal{O}_{A}$ represents the observation function, where $\mathcal{O}_{S}$ and $\mathcal{O}_{A}$ denotes the observation function of the sampler $\mathcal{A}gent_{S}$ and the actuator $\mathcal{A}gent_{A}$ , respectively, defined as:

	$\displaystyle\mathcal{O}(\mathbf{w},\mathbf{o})\triangleq\Pr(\mathbf{o}_{t}=\mathbf{o}\|\mathbf{W}_{t}=\mathbf{w}),$		(21)
	$\displaystyle\mathcal{O}_{S}(\mathbf{w},o_{S})\triangleq\Pr(o_{S}^{(t)}=o_{S}\|\mathbf{W}_{t}=\mathbf{w}),$
	$\displaystyle\mathcal{O}_{A}(\mathbf{w},o_{A})\triangleq\Pr(o_{A}^{(t)}=o_{A}\|\mathbf{W}_{t}=\mathbf{w}).$

The observation function of an agent $\mathcal{A}gent_{i}$ signifies the conditional probability of agent $\mathcal{A}gent_{i}$ perceiving $o_{i}$ , contingent upon the prevailing global system state as $\mathbf{W}_{t}=\mathbf{w}$ . For the sake of brevity, we henceforth let $p_{A}(o_{A}|\mathbf{w})$ represent $\mathcal{O}_{A}(\mathbf{w},o_{A})$ and $p_{S}(o_{S}|\mathbf{w})$ represent $\mathcal{O}_{S}(\mathbf{w},o_{A})$ in the subsequent discourse. In our considered model, the observation functions are deterministic, characterized by lemma 2.

Lemma 2.

The observation functions of the Dec-POMDP:

	$\displaystyle p_{S}\left((s_{u},s_{r},v_{q})\|(s_{i},s_{j},v_{k})\right)$	$\displaystyle=\mathbbm{1}_{\left\{(s_{u},s_{r},v_{q})=(s_{i},s_{j},v_{k})\right\}}$		(22)
	$\displaystyle p_{A}\left(s_{z}\|(s_{i},s_{j},v_{k})\right)$	$\displaystyle=\mathbbm{1}_{\left\{s_{z}=s_{j}\right\}}\cdot$		(22)

for indexes $z$ , $i$ , $j$ , $u$ , $r\in\left\{1,2,\cdots\,|\mathcal{S}|\right\}$ , and $k$ , $q\in\left\{1,2,\cdots\,|\mathcal{V}|\right\}$ .

$\bullet$ $\mathcal{R}$ is the reward function, characterized as a mapping $\mathcal{I}\times\mathcal{A}\rightarrow\mathbb{R}$ . In the long-term average reward maximizing setup, resolving a Dec-POMDP is equivalent to addressing the following problem ${\mathop{\min}\limits_{{\boldsymbol{\pi}_{C}}\in\Upsilon}\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}{\mathbb{E}^{{\boldsymbol{\pi}}_{C}}}\left(-{\sum\limits_{t=0}^{T-1}r(t)}\right)}$ . Subsequently, to establish congruence with the problem in (17), the reward function is defined as:

r(t)=\mathcal{R}^{\pi_{A}}(\mathbf{w},a_{S})=-\mathrm{GoT}^{\pi_{A}}(t)-C_{S}\cdot a_{S}(t).

(23)

IV-B Solutions to the Infinite-Horizon Dec-POMDP

In general, solving a Dec-POMDP is known to be NEXP-complete for the finite-horizon setup [51], signifying that it necessitates formulating a conjecture about the solution non-deterministically, while each validation of a conjecture demands exponential time. For an infinite-horizon Dec-POMDP problem, finding an optimal policy for a Dec-POMDP problem is known to be undecidable. Nevertheless, within our considered model, both the sampling and decision-making processes are deterministic, given as $a_{S}(t)=\pi_{S}(\mathbf{w})$ and $a_{A}(t)=\pi_{A}(o_{A})$ . In such a circumstance, it is feasible to determine a joint optimal deterministic policy via Brute-Search-across the decision-making policy space.

IV-B1 Optimal Solution

The idea is based on the finding that, given a deterministic decision-making policy $\pi_{A}$ , the sampling problem can be formulated as a standard fully observed MDP problem denoted by $\mathscr{M}^{\pi_{A}}_{\mathrm{MDP}}\triangleq\langle\mathcal{I},\mathcal{T}^{\pi_{A}},\mathcal{A}_{S},\mathcal{R}\rangle$ .

Definition 1.

Given a deterministic decision-making policy $\pi_{A}$ , the optimal sampling problem could be formulated by a typical fully observed MDP problem $\mathscr{M}^{\pi_{A}}_{\mathrm{MDP}}\triangleq\langle\mathcal{I},\mathcal{A}_{S},\mathcal{T}_{\mathrm{MDP}}^{\pi_{A}},\mathcal{R}\rangle$ , where the elements are given as follows:

•

$\mathcal{I}$ : the same as the pre-defined Dec-POMDP tuple.
•

$\mathcal{A}_{S}=\left\{0,1\right\}$ : the sampling and transmission action set.

•

$\mathcal{T}^{\pi_{A}}$ : the transition function given a deterministic decision-making policy $\pi_{A}$ , which is

\begin{aligned} \mathcal{T}^{\pi_{A}}(\mathbf{w},a_{S},\mathbf{w}^{\prime})=p^{\pi_{A}}\left(\mathbf{w}^{\prime}|\mathbf{w},a_{S}\right)=\sum_{o_{A}\in\mathcal{O}_{A}}p\left(\mathbf{w}^{\prime}|\mathbf{w},(a_{S},\pi_{A}(o_{A}))\right)p_{A}(o_{A}|\mathbf{w})\end{aligned},

(24)

where $p\left(\mathbf{w}^{\prime}|\mathbf{w},(a_{S},\pi_{A}(o_{A}))\right)$ could be obtained by Lemma 1 and $p(o_{A}|\mathbf{w})$ could be obtained by Lemma 2.

•

$\mathcal{R}$ : the same as the pre-defined Dec-POMDP tuple.

We now proceed to solve the MDP problem $\mathscr{M}^{\pi_{A}}_{\mathrm{MDP}}$ , which is characterized by a tuple $\langle\mathcal{I},\mathcal{T}^{\pi_{A}},\mathcal{A}_{S},\mathcal{R}\rangle$ . In order to deduce the optimal sampling policy under a deterministic decision-making policy $\pi_{A}$ , it is imperative to resolve the Bellman equations [55]:

\displaystyle\theta^{*}_{\pi_{A}}+V_{\pi_{A}}(\mathbf{w})=\mathop{\max}\limits_{a_{S}\in\mathcal{A}_{A}}\left\{\mathcal{R}^{\pi_{A}}(\mathbf{w},a_{S})+\sum_{\mathbf{w}^{\prime}\in\mathcal{I}}p(\mathbf{w}^{\prime}|\mathbf{w},a_{S})V_{\pi_{A}}(\mathbf{w}^{\prime})\right\},

(25)

where $V^{\pi_{A}}(\mathbf{w})$ is the value function and $\theta^{*}_{\pi_{A}}$ is the optimal long-term average reward given the decision-making policy $\pi_{A}$ . We apply the relative value iteration (RVI) algorithm to solve this problem. The details are shown in Algorithm 1:

Input: The MDP tuple

\langle\mathcal{I},\mathcal{A}_{S},\mathcal{T}^{\pi_{A}},\mathcal{R}\rangle

\epsilon

\pi_{A}

;

1 Initialization:

\forall\mathbf{w}\in\mathcal{I}

\tilde{V}^{0}_{\pi_{A}}(\mathbf{w})=0

\tilde{V}^{-1}_{\pi_{A}}(\mathbf{w})=\infty

k=0

;

2 Choose

\mathbf{w}^{ref}

arbitrarily;

3 while $||\tilde{V}_{\pi_{A}}^{k}(\mathbf{w})-\tilde{V}_{\pi_{A}}^{k-1}(\mathbf{w})||\geq\epsilon$ do

k=k+1

;

5 for $\mathbf{w}\in\mathcal{I}-\mathbf{w}^{ref}$ do

\scriptsize\begin{aligned} &\tilde{V}_{\pi_{A}}^{k}(\mathbf{w})=-g_{k}+\\ &\mathop{\max}\limits_{a_{S}}\left\{\mathcal{R}(\mathbf{w},a_{S})+\sum_{\mathbf{w}^{\prime}\in\mathcal{I}-\mathbf{w}^{ref}}p(\mathbf{w}^{\prime}|\mathbf{w},a_{S})\tilde{V}^{k-1}_{\pi_{A}}(\mathbf{w}^{\prime})\right\};\end{aligned}

\begin{aligned} &\theta^{*}(\pi_{A},\pi_{S}^{*})=-\tilde{V}^{k}_{\pi_{A}}(\mathbf{w})\\ &\mathop{\max}\limits_{a_{S}\in\mathcal{A}_{S}}\left\{\mathcal{R}(\mathbf{w},a_{S})+\sum_{\mathbf{w}^{\prime}\in\mathcal{I}}p(\mathbf{w}^{\prime}|\mathbf{w},a_{S})\tilde{V}^{k}_{\pi_{A}}(\mathbf{w}^{\prime})\right\}\end{aligned}

;

9 for $\mathbf{w}\in\mathcal{I}$ do

\pi_{S}^{*}(\pi_{A},\mathbf{w})=\mathop{\arg\max}\limits_{a_{S}}\left\{\mathcal{R}(\mathbf{w},a_{S})+\sum_{\mathbf{w}^{\prime}\in\mathcal{I}}p(\mathbf{w}^{\prime}|\mathbf{w},a_{S})\tilde{V}^{k}_{\pi_{A}}(\mathbf{w}^{\prime})\right\}

;

Output:

\pi_{S}^{*}(\pi_{A})

\theta^{*}(\pi_{A},\pi_{S}^{*})

Algorithm 1 The RVI Algorithm to Solve the MDP Given

\pi_{A}

With Definition 1 and Algorithm 1 in hand, we could then perform a Brute-Force-Search across the decision-making policy space $\Upsilon_{A}$ , thereby acquiring the joint sampling-decision-making policy. The algorithm is called RVI-Brute-Force-Search Algorithm, which is elaborated in Algorithm 2. In the following theorem, we discuss the optimality of the RVI-Brute-Force-Search Algorithm.

Theorem 1.

The RVI-Brute-Force-Search Algorithm (Algorithm 2) could achieve the optimal joint deterministic policies $(\pi_{S}^{*},\pi_{A}^{*})$ , given that the transition function $\mathcal{T}^{\pi_{A}}$ follows a unichan.

Proof. If the the transition function $\mathcal{T}^{\pi_{A}}$ follows a unichan, we obtain from [56, Theorem 8.4.5] that for any $\pi_{A}$ , we could obtain the optimal deterministic policy $\pi_{S}^{*}$ such that $\theta^{*}({\pi_{A}},\pi_{S}^{*})\leq\theta^{*}({\pi_{A}},\pi_{S})$ . Also, Algorithm 2 assures that for any $\pi_{A}$ , $\theta^{*}({\pi^{*}_{A}},\pi_{S}^{*})\leq\theta^{*}({\pi_{A}},\pi_{S}^{*})$ . This leads to the conclusion that for any $\boldsymbol{\pi}_{C}=(\pi_{S},\pi_{A})\in\Upsilon$ , we have that

\theta^{*}({\pi^{*}_{A}},\pi_{S}^{*})\leq\theta^{*}({\pi_{A}},\pi_{S}^{*})\leq\theta^{*}({\pi_{A}},\pi_{S}).

(26)

Nonetheless, the Brute-Force-Search across the decision-making policy space remains computationally expensive, as the size of the decision-making policy space $\Upsilon_{A}$ amounts to $|\Upsilon_{A}|=\mathcal{A}_{A}^{\mathcal{O}_{A}}$ . This implies that executing the RVI algorithm $\mathcal{A}_{A}^{\mathcal{O}_{A}}$ times is necessary to attain the optimal policy. Consequently, although proven to be optimal, such an algorithm is ill-suited for scenarios where $\mathcal{O}_{A}$ and $\mathcal{A}_{A}$ are considerably large. To ameliorate this challenge, we propose a sub-optimal, yet computation-efficient alternative in the subsequent section.

IV-B2 A Sub-optimal Solution

The method here is to instead find a locally optimal algorithm to circumvent the high complexity of the Brute-Force-Search-based approach. We apply the Joint Equilibrium-Based Search for Policies (JESP) for Nash equilibrium solutions [57]. Within this framework, the sampling policy is optimally responsive to the decision-making policy and vice versa, i.e., $\forall\pi_{S},\pi_{A},\theta(\pi_{S}^{*},\pi_{A}^{*})\leq\theta(\pi_{S},\pi_{A}^{*}),\theta(\pi_{S}^{*},\pi_{A}^{*})\leq\theta(\pi_{S}^{*},\pi_{A})$ .

Input: The Dec-POMDP tuple

\mathscr{M}_{DEC-POMDP}\triangleq\left\langle n,\mathcal{I},\mathcal{A},\mathcal{T},\Omega,\mathcal{O},\mathcal{R}\right\rangle

;

1 for $\pi_{A}\in\Upsilon$ do

2 Formulate the MDP problem

\mathscr{M}^{\pi_{A}}_{\mathrm{MDP}}\triangleq\langle\mathcal{I},\mathcal{A}_{S},\mathcal{T}_{\mathrm{MDP}}^{\pi_{A}},\mathcal{R}\rangle

as given in Definition 1;

3 Run Algorithm 1 to obtain

\pi_{S}^{*}(\pi_{A})

and

\theta^{*}({\pi_{A}},\pi_{S}^{*})

;

4Calculate the optimal joint policy:

\begin{cases}&\pi_{A}^{*}=\mathop{\arg\min}\nolimits_{\pi_{A}}\theta_{\pi_{A}}^{*}\\ &\pi_{S}^{*}=\pi_{S}(\pi^{*}_{A})\end{cases}

;

Output:

\pi_{S}^{*}

\pi_{A}^{*}

Algorithm 2 The RVI-Brute-Force-Search Algorithm

To search for the Nash equilibrium, we first search for the optimal sampling policy prescribed a decision-making policy. This problem can be formulated as a standard fully observed MDP problem denoted by $\mathscr{M}^{\pi_{A}}_{\mathrm{MDP}}\triangleq\langle\mathcal{I},\mathcal{A}_{S},\mathcal{T}_{\mathrm{MDP}}^{\pi_{A}},\mathcal{R}\rangle$ (see Definition 1). Next, we alternatively fix the sampling policy $\pi_{S}$ and solve for the optimal decision-making policy $\pi_{A}$ . This problem can be modeled as a memoryless partially observable Markov decision process (POMDP), denoted by $\mathscr{M}^{\pi_{S}}_{\mathrm{POMDP}}\triangleq\langle\mathcal{I},\mathcal{A}_{A},\Omega_{A},\mathcal{O}_{A},\mathcal{T}^{\pi_{S}}_{\mathrm{POMDP}},\mathcal{R}\rangle$ (see Definition 2). Then, by alternatively iterating between $\mathcal{A}gent_{S}$ and $\mathcal{A}gent_{A}$ , we could obtain the Nash equilibrium between the two agents.

Definition 2.

Given a deterministic sampling policy $\pi_{S}$ , the optimal sampling problem could be formulated as a memoryless POMDP problem $\mathscr{M}^{\pi_{S}}_{\mathrm{POMDP}}\triangleq\langle\mathcal{I},\mathcal{A}_{A},\Omega_{A},\mathcal{O}_{A},\mathcal{T}^{\pi_{S}}_{\mathrm{POMDP}},\mathcal{R}\rangle$ , where the elements are given as follows:

•

$\mathcal{I}$ , $\Omega_{A}$ , $\mathcal{A}_{A}$ , and $\mathcal{O}_{A}$ : the same as the pre-defined Dec-POMDP tuple.

•

$\mathcal{T}^{\pi_{S}}_{\mathrm{POMDP}}$ : the transition function given a deterministic sampling policy $\pi_{S}$ , which is

\begin{aligned} \mathcal{T}^{\pi_{S}}_{\mathrm{POMDP}}(\mathbf{w},a_{A},\mathbf{w}^{\prime})=p^{\pi_{S}}\left(\mathbf{w}^{\prime}|\mathbf{w},a_{A}\right)=p\left(\mathbf{w}^{\prime}|\mathbf{w},(\pi_{S}(\mathbf{w}),a_{A})\right)\end{aligned},

(27)

where $p\left(\mathbf{w}^{\prime}|\mathbf{w},(\pi_{S}(\mathbf{w}),a_{A})\right)$ could be obtained by Lemma 1.

•

$\mathcal{R}$ : the reward function is denoted as $\mathcal{R}^{\pi_{S}}(\mathbf{w},a_{A})$ , which could be obtained by (9).

We then proceed to solve the memoryless POMDP problem discussed in Definition 2 to obtain the deterministic decision-making policy. Denote $p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}^{\prime}|\mathbf{w})$ as the transition probability $\Pr\left\{\mathbf{W}_{t+1}=\mathbf{w}^{\prime}|\mathbf{W}_{t}=\mathbf{w}\right\}$ under the sampling policy $\pi_{A}$ and $\pi_{S}$ , we then have that

p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}^{\prime}|\mathbf{w})=\sum_{o_{A}\in\mathcal{O}_{A}}p(o_{A}|\mathbf{w})\sum_{a_{A}\in\mathcal{A}_{A}}p^{\pi_{S}}(\mathbf{w}^{\prime}|\mathbf{w},a_{A})\pi_{A}(a_{A}|o_{A}),

(28)

where $p(o_{A}|\mathbf{w})$ could be obtained by Lemma 1 and $p^{\pi_{S}}(\mathbf{w}^{\prime}|\mathbf{w},\pi_{A}(o_{A}))$ is obtained by (27). By assuming the ergodicity of the $p^{\pi_{S}}_{\pi_{A}}(\mathbf{w}^{\prime}|\mathbf{w})$ and rewrite it as a matrix $\mathbf{P}_{\pi_{A}}^{\pi_{S}}$ , we could then solve out the stationary distribution of the system status under the policies $\pi_{S}$ and $\pi_{A}$ , denoted as $\boldsymbol{\mu}_{\pi_{A}}^{\pi_{S}}$ , by solving the balance equations:

\boldsymbol{\mu}_{\pi_{A}}^{\pi_{S}}\mathbf{P}_{\pi_{A}}^{\pi_{S}}=\boldsymbol{\mu}_{\pi_{A}}^{\pi_{S}},{~{}}\boldsymbol{\mu}_{\pi_{A}}^{\pi_{S}}\mathbf{e}=1,

(29)

where $\mathbf{e}$ is the all one vector $[1,\cdots,1]_{|\mathcal{I}|\times 1}$ , $\boldsymbol{\mu}_{\pi_{A}}^{\pi_{S}}$ could be solved out by Cramer’s rules. Denote ${\mu}_{\pi_{A}}^{\pi_{S}}(\mathbf{w})$ as the stationary distribution of $\mathbf{w}$ . Also, we denote $r_{\pi_{A}}^{\pi_{S}}(\mathbf{w})$ as the expectation reward of global system status $\mathbf{w}$ under policies $\pi_{A}$ and $\pi_{S}$ . It could be calculated as:

r_{\pi_{A}}^{\pi_{S}}(\mathbf{w})=\sum_{o_{A}\in\mathcal{O}_{A}}p(o_{A}|\mathbf{w})\mathcal{R}^{\pi_{S}}(\mathbf{w},\pi_{A}(o_{A})).

(30)

The performance measure is the long-term average reward:

\eta_{\pi_{A}}^{\pi_{S}}={\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left({\sum\limits_{t=0}^{T-1}r(t)}\right)}=\sum_{\mathbf{w}\in\mathcal{I}}{\mu}_{\pi_{A}}^{\pi_{S}}(\mathbf{w})\cdot r_{\pi_{A}}^{\pi_{S}}(\mathbf{w})

(31)

With $\eta_{\pi_{A}}^{\pi_{S}}$ in hand, we then introduce the relative reward $g_{\pi_{A}}^{\pi_{S}}(\mathbf{w})$ , defined by

g_{\pi_{A}}^{\pi_{S}}(\mathbf{w})\triangleq{\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left[{\sum\limits_{t=0}^{T-1}\left(r(t)-\eta_{\pi_{A}}^{\pi_{S}}\right)}\left|\mathbf{W}_{0}=\mathbf{w}\right.\right]},

(32)

which satisfies the Poisson equations [58]:

\eta_{\pi_{A}}^{\pi_{S}}+g_{\pi_{A}}^{\pi_{S}}(\mathbf{w})=r_{\pi_{A}}^{\pi_{S}}(\mathbf{w})+\sum_{\mathbf{w}^{\prime}\in\mathcal{I}}p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}^{\prime}|\mathbf{w})g_{\pi_{A}}^{\pi_{S}}\mathbf{(w}^{\prime}).

(33)

Denote $\mathbf{g}^{\pi_{S}}_{\pi_{A}}$ as the vector consisting of $g_{\pi_{A}}^{\pi_{S}}(\mathbf{w})$ , $\mathbf{r}^{\pi_{S}}_{\pi_{A}}$ as the vector consisting of ${r}^{\pi_{S}}_{\pi_{A}}(\mathbf{w}),\mathbf{w}\in\mathcal{I}$ . $\mathbf{g}^{\pi_{S}}_{\pi_{A}}$ could be solved by utilizing [56]:

\mathbf{g}^{\pi_{S}}_{\pi_{A}}=\left[(I-\mathbf{P}^{\pi_{S}}_{\pi_{A}}+\mathbf{e}\boldsymbol{\mu}^{\pi_{S}}_{\pi_{A}})^{-1}-\mathbf{e}\boldsymbol{\mu}^{\pi_{S}}_{\pi_{A}}\right]\mathbf{r}^{\pi_{S}}_{\pi_{A}}

(34)

With the relative reward $g_{\pi_{A}}^{\pi_{S}}$ in hand, we then introduce $Q_{\pi_{A}}^{\pi_{S}}(\mathbf{w},a_{A})$ and $Q_{\pi_{A}}^{\pi_{S}}(o_{A},a_{A})$ as follows:

Lemma 3.

$Q_{\pi_{A}}^{\pi_{S}}(\mathbf{w},a_{A})$ and $Q_{\pi_{A}}^{\pi_{S}}(o_{A},a_{A})$ are defined and calculated as:

		$\displaystyle Q_{\pi_{A}}^{\pi_{S}}(\mathbf{w},a_{A})\triangleq{\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left[{\sum\limits_{t=0}^{T-1}\left(r(t)-\eta_{\pi_{A}}^{\pi_{S}}\right)}\left\|\mathbf{W}_{0}=\mathbf{w},a_{A}(0)=a_{A}\right.\right]}$		(35)
		$\displaystyle=\mathcal{R}^{\pi_{S}}(\mathbf{w},a_{A})-\eta_{\pi_{A}}^{\pi_{S}}+\sum_{\mathbf{w}^{\prime}\in\mathcal{I}}p^{\pi_{S}}(\mathbf{w}^{\prime}\|\mathbf{w},a_{A})g_{\pi_{A}}^{\pi_{S}}\mathbf{(w}^{\prime}),$		(35)

		$\displaystyle Q_{\pi_{A}}^{\pi_{S}}(o_{A},a_{A})\triangleq{\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left[{\sum\limits_{t=0}^{T-1}\left(r(t)-\eta_{\pi_{A}}^{\pi_{S}}\right)}\left\|o_{A}^{(0)}=o_{A},a_{A}(0)=a_{A}\right.\right]}$		(36)
		$\displaystyle=\sum_{\mathbf{w}\in\mathcal{I}}p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}\|o_{A})Q_{\pi_{A}}^{\pi_{S}}(\mathbf{w},a_{A}),$		(36)

where $p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}^{\prime}|\mathbf{w})$ can be obtained by (28) and $p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}|o_{A})$ can be obtained by the Bayesian formula:

p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}|o_{A})=\frac{{\mu}_{\pi_{A}}^{\pi_{S}}(\mathbf{w})p(o_{A}|\mathbf{w})}{\sum_{\mathbf{w}\in\mathcal{I}}{\mu}_{\pi_{A}}^{\pi_{S}}(\mathbf{w})p(o_{A}|\mathbf{w})}.

(37)

Proof.

Please refer to Appendix B. ∎

With $Q_{\pi_{A}}^{\pi_{S}}(o_{A},a_{A})$ in hand, it is then easy to conduct the Policy Iteration (PI) Algorithm with Step Sizes [59] to iteratively improve the deterministic memoryless decision-making policy $\pi_{A}$ . The detailed steps are shown in Algorithm 3.

Input: The POMDP tuple

\mathscr{M}^{\pi_{S}}_{\mathrm{POMDP}}\triangleq\langle\mathcal{I},\mathcal{A}_{A},\Omega_{A},\mathcal{O}_{A},\mathcal{T}^{\pi_{S}}_{\mathrm{POMDP}},\mathcal{R}\rangle

\epsilon

\pi_{S}

;

1 Initialization: randomly choose decision-making policy

\pi_{A}^{(1)}

\eta_{{\pi_{A}}^{(0)}}^{\pi_{S}}=0

\eta_{{\pi_{A}}^{(-1)}}^{\pi_{S}}=\infty

k=0

;

2 while $|\eta_{{\pi_{A}}^{(k)}}^{\pi_{S}}-\eta_{{\pi_{A}}^{(k-1)}}^{\pi_{S}}|\geq\epsilon$ do

k=k+1

;

4 Calculate the transition probability

p_{{\pi_{A}}^{(k)}}^{\pi_{S}}(\mathbf{w}^{\prime}|\mathbf{w})

by (28);

5 Solve the stationary distribution

\boldsymbol{\mu}_{{\pi_{A}}^{(k)}}^{\pi_{S}}

by the stationary equations (29);

6 Calculate the expectation reward

r_{{\pi_{A}}^{(k)}}^{\pi_{S}}(\mathbf{w})

by (30) and the long-term average reward

\eta_{{\pi_{A}}^{(k)}}^{\pi_{S}}

by (31);

7 Calculate the relative reward

g_{{\pi_{A}}^{(k)}}^{\pi_{S}}(\mathbf{w})

by (34);

8 for $o_{A}\in\mathcal{O}_{A}$ do

9 for $a_{A}\in\mathcal{A}_{A}$ do

10 Calculate

Q_{{\pi_{A}}^{(k)}}^{\pi_{S}}(o_{A},a_{A})

by Lemma 3;

12 for $o_{A}\in\mathcal{O}_{A}$ do

\pi_{A}(\cdot|o_{A})=\mathop{\arg\max}\limits_{\pi_{A}(\cdot|o_{A})}Q_{\pi_{A}^{(k)}}^{\pi_{S}}(o_{A},a_{A})

;

\pi_{A}^{(k)}(\cdot|o_{A})=\pi_{A}^{(k-1)}(\cdot|o_{A})+\delta_{k}*(\pi_{A}^{(k-1)}(\cdot|o_{A})-\pi_{A}(o_{A}))

;

16for $o_{A}\in\mathcal{O}_{A}$ do

\pi_{A}^{*}(o_{A})=\mathop{\arg\max}\limits_{a_{A}}\pi_{A}^{(k)}(a_{A}|o_{A})

;

\theta^{*}(\pi_{S},\pi_{A}^{*})=\eta_{{\pi_{A}}^{*}}^{\pi_{S}}

;

Output:

\pi_{A}^{*}(\pi_{S})

\theta^{*}(\pi_{S},\pi_{A}^{*})

Algorithm 3 The PI Algorithm with Step Size to Solve the POMDP Given

\pi_{S}

Thus far, we have solved two problems: $i$ ) by capitalizing on Definition 1 and Algorithm 1, we have ascertained an optimal sampling strategy $\pi_{S}^{*}$ contingent upon the decision-making policy $\pi_{A}$ ; $ii$ ) by harnessing Definition 2 and Algorithm 3, we have determined an optimal actuation strategy $\pi_{S}^{*}$ predicated on the decision-making policy $\pi_{A}$ . Consequently, we could iteratively employ Algorithm 1 and Algorithm 3 in an alternating fashion, whereby Algorithm 1 yields the optimal sampling strategy $\pi_{S}^{*(k)}(\pi_{A}^{(k-1)})$ , subsequently serving as an input for Algorithm 3 to derive the decision-making policy $\pi_{A}^{*(k)}(\pi_{S}^{*(k)})$ . The procedure shall persist until the average reward $\theta^{*}(\pi_{S}^{*(k)},\pi_{A}^{*(k)})$ reaches convergence, indicating that the solution achieves a Nash equilibrium between the sampler and the actuator.The intricacies of the procedure are delineated in Algorithm 4.

Remark 1.

Generally, the JESP algorithm should restart the algorithm by randomly choosing the initial decision-making policy $\pi_{A}^{*(1)}$ to ensure a good solution, as the initialization of decision-making policy $\pi_{A}^{*(1)}$ may often lead to poor local optima. We here investigate a heuristic initialization to find the solution quickly and reliably. Specifically, we assume that the decision-maker is fully observable and solve a MDP problem:

Definition 3.

$\mathscr{M}_{\mathrm{MDP}}\triangleq\langle\mathcal{I},\mathcal{A}_{A},\mathcal{T}_{\mathrm{MDP}},\mathcal{R}\rangle$ , where the elements are given as follows:

•

$\mathcal{I}$ : the set of $(X_{t},\Phi_{t})\in\mathcal{S}\times\mathcal{V}$ .
•

$\mathcal{A}_{A}=\left\{a_{0},a_{M-1}\right\}$ : the decision-making set.

•

$\mathcal{T}$ : the transition function, given as

		$\displaystyle\mathcal{T}((X_{t},\Phi_{t}),a_{A},(X_{t+1},\Phi_{t+1}))$		(38)
		$\displaystyle=p(X_{t+1}\|X_{t},a_{A},\Phi_{t})\cdot p(\Phi_{t+1}\|\Phi_{t}),$		(38)

where $p(X_{t+1}|X_{t},a_{A},\Phi_{t})$ and $p(\Phi_{t+1}|\Phi_{t})$ could be obtained by (11) and (12), respectively.

•

$\mathcal{R}$ : the same as the pre-defined Dec-POMDP tuple.

Through solving the above MDP problem, we could explicitly obtain the Q function $Q(X_{t},\Phi_{t},a_{A})$ , define $Q(X_{t},a_{A})$ as ${\mathbb{E}}_{\Phi_{t}}[Q(X_{t},\Phi_{t},a_{A})]$ , the initial decision-making policy $\pi_{A}^{*(1)}$ is given as

\pi_{A}^{*(1)}(\hat{X}_{t})={\arg\min}_{a_{A}}Q(\hat{X}_{t},a_{A}).

(39)

Input: The Dec-POMDP tuple

\mathscr{M}_{DEC-POMDP}\triangleq\left\langle n,\mathcal{I},\mathcal{A},\mathcal{T},\Omega,\mathcal{O},\mathcal{R}\right\rangle

\epsilon

;

1 Initialization:

\theta^{*}_{0}=0

\theta^{*}_{-1}=\infty

k=0

;

2 Initialize

\pi_{A}^{*(1)}

by calculating (39);

3 while $||\theta^{*}_{k}-\theta^{*}_{k-1}||\geq\epsilon$ do

k=k+1

;

5 Formulate the MDP problem

\mathscr{M}^{\pi_{A}^{*(k)}}_{\mathrm{MDP}}\triangleq\langle\mathcal{I},\mathcal{A}_{S},\mathcal{T}_{\mathrm{MDP}}^{\pi_{A}^{*(k)}},\mathcal{R}\rangle

as given in Definition 1;

6 Run Algorithm 1 to obtain

\pi_{S}^{*(k)}

;

7 Formulate the POMDP problem

\mathscr{M}^{\pi_{S}^{*(k)}}_{\mathrm{POMDP}}\triangleq\langle\mathcal{I},\mathcal{A}_{A},\Omega_{A},\mathcal{O}_{A},\mathcal{T}^{\pi_{S}^{*(k)}}_{\mathrm{POMDP}},\mathcal{R}\rangle

as given in Definition 2;

8 Run Algorithm 3 to obtain

\theta^{*}(\pi_{S}^{*(k)},\pi_{A}^{*(k)})

and

\pi_{A}^{*(k)}

;

\theta^{*}_{k}=\theta^{*}(\pi_{S}^{*(k)},\pi_{A}^{*(k)})

;

11The joint sub-optimal policy is:

\pi_{S}^{*}=\pi_{S}^{*(k)},\pi_{S}^{*}=\pi_{A}^{*(k)}

;

12 The sub-optimal average reward is:

\theta^{*}=\theta^{*}_{k}

;

Output:

\pi_{S}^{*}

\pi_{A}^{*}

\theta^{*}

Algorithm 4 The Improved JESP Algorithm

V Simulation Results

Tradition metrics such as Age of Information have been developed under the assumption that a fresher packet or more accurate packet, capable of aiding in source reconstruction, holds a higher value for the receiver, thus promoting goal-oriented decision-making. Nevertheless, the manner in which a packet update impacts the system’s utility via decision-making remains an unexplored domain. Through the simulations, we endeavor to elucidate the following observations of interest:

$\bullet$ GoT-optimal vs. State-of-the-art. In contrast with the state-of-the-art sampling policies, the proposed goal-oriented sampler & decision-maker co-design is capable of concurrently maximizing goal attainment and conserving communication resources, accomplishing a closed-loop utility optimization via sparse sampling. (See Fig. 9 and 9)

$\bullet$ Separate Design vs. Co-Design. Compared to the two-stage sampling-decision-making separate framework, the co-design of sampling and decision-making not only achieves superior goal achievement but also alleviates resource expenditure engendered by communication and actuation implementation. (See Fig. 9 and 9)

$\bullet$ Optimal Brute-Force-Search vs. Sub-optimal JESP. Under different successful transmission probability $p_{S}$ and sampling cost $C_{S}$ , the sub-optimal yet computation-efficient JESP algorithm will converge to near-optimal solutions. (See Fig. 9)

$\bullet$ Trade-off: Transmission vs. Actuation. There is a trade-off between transmission and actuation in terms of resource expenditure: under reliable channel conditions, it is apt to increase communication overhead to ensure effective decision-making; conversely, under poor channel conditions, it is advisable to curtail communication expenses and augment actuation resources to attain maximal system utility. (See Fig. 9)

V-A Comparing Benchmarks

Fig. 9 illustrates the simulation results, which characterizes the utility by the average cost composed by status inherent cost $C_{1}(X_{t},\Phi_{t})$ , actuation gain cost $C_{2}(\pi_{A}(\hat{X}_{t}))$ , actuation inherent cost $C_{3}(\pi_{A}(\hat{X}_{t}))$ , and sampling cost $C_{S}$ . For the simulation setup, we set $\mathcal{A}_{A}=\left\{a_{0},\cdots,a_{10}\right\}$ , $\mathcal{S}=\left\{s_{0},s_{1},s_{2}\right\}$ , $\mathcal{V}=\left\{v_{0},v_{1}\right\}$ and the corresponding costs are given as:

\mathbf{C}_{1}^{|\mathcal{V}|\times|\mathcal{S}|}=\begin{pmatrix}0&20&50\\ 0&10&20\end{pmatrix},

(40)

$C_{2}(\pi_{A}(\hat{X}_{t}))$ and $C_{3}(\pi_{A}(\hat{X}_{t}))$ are both linear to the actuation with $C_{2}(\pi_{A}(\hat{X}_{t})=C_{g}\cdot\pi_{A}(\hat{X}_{t})$ and $C_{3}(\pi_{A}(\hat{X}_{t}))=C_{I}\cdot\pi_{A}(\hat{X}_{t})$ . The following comparing benchmarks of interest are considered:

$\bullet$ Uniform. Sampling is triggered periodically in this policy. In this case, $a_{S}(t)=\mathbbm{1}_{\left\{t=K*\Delta\right\}}$ , where $K=0,1,2,\cdots$ and $\Delta\in\mathbb{N}^{+}$ . For each $\Delta$ , we could calculate the sampling rate as $1/\Delta$ and explicitly obtain the long-term average cost through Markov chain simulations under pre-defined decision-making policy $\pi_{A}=[a_{0},a_{3},a_{7}]$ . The setup of $\Delta$ represents a trade-off between utility and sampling frequency, as depicted in Fig. 9: If $C_{S}$ is minimal, sampling and transmission will contribute positively to the utility; If the sampling action is expansive, sampling may not yield adequate utility; If a single sampling consumes moderate resource, the utility will exhibit a U-shaped pattern in terms of the sampling rate.

$\bullet$ Age-aware. Sampling is executed when the AoI attains a predetermined threshold, a principle that has been established as a threshold-based result for AoI-optimal sampling [27]. In this case, $a_{S}(t)=\mathbbm{1}_{\left\{\mathrm{AoI(t)}>\delta\right\}}$ , where the AoI-optimal threshold $\delta$ can be ascertained using the Bisection method delineated in Algorithm 1 of [27]. In this context, rather than determining a fixed threshold that minimized AoI, we dynamically shift the threshold to explore the balance between sampling and utility. As evidenced in Fig. 9, the utility derived from this sampling policy consistently surpasses that of the uniform sampling policy.

$\bullet$ Change-aware. Sampling is triggered whenever the source status changes. In such a case, $a_{S}(t)=\mathbbm{1}_{\left\{X_{t}\neq X_{t-1}\right\}}$ . The performance of this policy is dependent on the dynamics of the system, i.e., if the semantics of the sources transfers frequently, then the sampling rate will be higher. The utility of this policy may be arbitrarily detrimental owing to its semantics-unaware nature. In our considered model, the Change-aware policy will turn out $89.5\%$ to have the unsynchronized status $X_{t}=s_{0}$ while $\hat{X}_{t}=s_{3}$ . In this case, the actuator will implement the actuation $a_{A}=a_{7}$ according to the estimate $\hat{X}_{t}=s_{3}$ , which will in turn make the status $X_{t}$ converges to $s_{0}$ .

$\bullet$ Optimal MSE. This is a type of E2E goal-oriented sampling policy if the goal is determined as achieving real-time reconstruction. Nevertheless, this policy disregards the semantics conveyed by the packet and the ensuing actuation updating precipitated by semantics updates. The problem could be formulated as a standard MDP formulation and solved out through RVI Algorithm. The sampling rate and average cost are obtained given the MSE-optimal sampling policy and the pre-defined decision-making policy $\pi_{A}=[a_{0},a_{3},a_{7}]$ .

$\bullet$ Optimal AoII (also optimal AoCI). From [43], it has been proven that the AoII-optimal sampling policy turns out to be $a_{S}(t)=\mathbbm{1}_{\left\{X_{t}\neq\hat{X}_{t}\right\}}$ . From [40], the AoCI-optimal sampling policy is $a_{S}(t)=\mathbbm{1}_{\left\{X_{t}\neq{X}_{t-\mathrm{AoI}(t)}\right\}}$ . Note that $\hat{X}_{t}={X}_{t-\mathrm{AoI}(t)}$ , these two sampling policies are equivalent. The sampling rate and average cost are obtained given this sampling policy and the greedy-based decision-making policy. $\pi_{A}=[a_{0},a_{3},a_{7}]$ .

V-B Separate Design vs. Co-Design

Conventionally, the sampling and actuation policies are designed in a two-stage manner: they first emphasize open-loop performance metrics such as average mean squared error (MSE) or average Age of Information (AoI), we then focus on the decision-making policy $\pi_{A}$ design. Specifically, we consider that $\pi_{A}$ is predetermined using a greedy methodology:

\pi_{A}(\hat{X}_{t})=\mathop{\arg\min}\limits_{a_{A}\in\mathcal{S}_{A}}\mathop{\mathbb{E}}\limits_{\Phi_{t}}\left\{\left[C_{1}(\hat{X}_{t},\Phi_{t})-C_{2}(\pi_{A}(\hat{X}_{t}))\right]^{+}C_{3}(\pi_{A}(\hat{X}_{t}))\right\}.

(41)

This greedy approach entails selecting the actuation that minimizes cost in the current step, given that the estimate $\hat{X}_{t}$ is perfect. By calculating (41), we obtain $\pi_{A}=[a_{0},a_{3},a_{7}]$ .

However, we notice that sampling and actuation are closely intertwined, highlighting the potential for further co-design. In this paper, we have proposed the RVI-Brute-Force-Search and the Improved JESP algorithms for such optimal co-design. As shown in Fig. 9, the sampler & decision-maker co-design achieves the optimal utility through sparse sampling. Specifically, only semantically important information is sampled and transmitted, while non-essential data is excluded. This goal-oriented, semantic-aware, and sparse sampling design represents a significant advancement in sampling policy design. By incorporating a best-matching decision-making policy, the sparse sampling achieves superior performance compared to existing methods.

Fig. 9 presents a comparative analysis between the AoII (or AoCI)-optimal sampling policy, the MSE-optimal sampling policy, and our proposed GoT-optimal sampler & decision-maker co-design. It is verified that under different $p_{S}$ and $C_{S}$ , the proposed GoT-optimal sampler & decision-maker co-design achieves the optimal goal-oriented utility. Importantly, under the condition of an extremely unreliable channel $p_{S}=0.2$ and high sampling cost $C_{S}=10$ , the proposed co-design facilitates a significant reduction in long-term average cost, exceeding 60%. This underscores the superiority of the GoT-optimal sampler & decision-maker co-design.

V-C Optimal vs. Sub-Optimal

Fig. 9 presents a comparative visualization between optimal and sub-optimal solutions over a wide range of $C_{S}$ and $p_{S}$ values. The negligible zero-approaching value in Fig. 9 implies a trivial deviation between the optimal and sub-optimal solutions, suggesting the latter’s potential for convergence towards near-optimal outcomes. The minimal variance testifies to the sub-optimal algorithm’s consistent ability to approximate solutions with high proximity to the optimal. This critical observation underscores the practical advantages of employing sub-optimal improved JESP Algorithm, especially in scenarios with extensive $\mathcal{A}_{A}$ and $\mathcal{O}_{A}$ .

V-D Trade-off: Transmission vs. Actuation

Fig. 9 exemplifies the resource allocation trade-off between transmission and actuation when the long-term average cost is minimized. When the probability of $p_{S}$ remains low (signifying an unreliable channel) or $C_{S}$ is high (indicating expensive sampling), it becomes prudent to decrease sampling and transmission, while concurrently augmenting actuation resources for optimal system utility. In contrast, when the channel is reliable, sampling and transmission resource can be harmonized with actuation resources to achieve the goal better. This indicates that through the investigation of the optimal co-design of the sampler & decision-maker paradigm, a trade-off between transmission and actuation resources can be achieved.

VI Conclusion

In this paper, we have investigated the GoT metric to directly describe the goal-oriented system decision-making utility. Employing the proposed GoT, we have formulated an infinite horizon Dec-POMDP problem to accomplish the co-design of sampling and actuating. To address this problem, we have developed two algorithms: the computationally intensive RVI-Brute-Force-Search, which is proven to be optimal, and the more efficient, albeit suboptimal algorithm, named JESP Algorithm. Comparative analyses have substantiated that the proposed GoT-optimal sampler & decision-maker pair can achieve sparse sampling and meanwhile maximize the utility, signifying the initial realization of a sparse, goal-oriented, and semantics-aware sampler design.

References

[1] S. Kaul, M. Gruteser, V. Rai, and J. Kenney, “Minimizing age of information in vehicular networks,” in 2011 8th Annual IEEE Communications Society Conference on Sensor, Mesh and Ad-Hoc Communications and Networks, 2011, pp. 350–358.
[2] Y. Sun, E. Uysal-Biyikoglu, R. D. Yates, C. E. Koksal, and N. B. Shroff, “Update or wait: How to keep your data fresh,” IEEE Transactions on Information Theory, vol. 63, no. 11, pp. 7492–7508, 2017.
[3] M. Costa, M. Codreanu, and A. Ephremides, “On the age of information in status update systems with packet management,” IEEE Transactions on Information Theory, vol. 62, no. 4, pp. 1897–1910, 2016.
[4] R. D. Yates and S. K. Kaul, “The age of information: Real-time status updating by multiple sources,” IEEE Transactions on Information Theory, vol. 65, no. 3, pp. 1807–1827, 2018.
[5] C. Kam, S. Kompella, G. D. Nguyen, and A. Ephremides, “Effect of message transmission path diversity on status age,” IEEE Transactions on Information Theory, vol. 62, no. 3, pp. 1360–1374, 2015.
[6] A. M. Bedewy, Y. Sun, and N. B. Shroff, “The age of information in multihop networks,” IEEE/ACM Transactions on Networking, vol. 27, no. 3, pp. 1248–1257, 2019.
[7] M. Moltafet, M. Leinonen, and M. Codreanu, “On the age of information in multi-source queueing models,” IEEE Transactions on Communications, vol. 68, no. 8, pp. 5003–5017, 2020.
[8] H. Sac, B. T. Bacinoglu, E. Uysal-Biyikoglu, and G. Durisi, “Age-Optimal Channel Coding Blocklength for an M/G/1 Queue with HARQ,” 2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), pp. 1–5, 2018.
[9] M. Xie, Q. Wang, J. Gong, and X. Ma, “Age and energy analysis for LDPC coded status update with and without ARQ,” IEEE Internet of Things Journal, vol. 7, no. 10, pp. 10 388–10 400, 2020.
[10] J. You, S. Wu, Y. Deng, J. Jiao, and Q. Zhang, “An age optimized hybrid ARQ scheme for polar codes via Gaussian approximation,” IEEE Wireless Communications Letters, vol. 10, no. 10, pp. 2235–2239, 2021.
[11] S. Meng, S. Wu, A. Li, J. Jiao, N. Zhang, and Q. Zhang, “Analysis and Optimization of the HARQ-Based Spinal Coded Timely Status Update System,” IEEE Transactions on Communications, vol. 70, no. 10, pp. 6425–6440, 2022.
[12] H. Pan, T.-T. Chan, V. C. Leung, and J. Li, “Age of Information in Physical-layer Network Coding Enabled Two-way Relay Networks,” IEEE Transactions on Mobile Computing, 2022.
[13] A. Maatouk, M. Assaad, and A. Ephremides, “Minimizing The Age of Information: NOMA or OMA?” IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pp. 102–108, 2019.
[14] S. Wu, Z. Deng, A. Li, J. Jiao, N. Zhang, and Q. Zhang, “Minimizing Age of Information in HARQ-CC Aided NOMA Systems,” IEEE Transactions on Wireless Communications, 2022.
[15] E. T. Ceran, D. Gündüz, and A. György, “Average Age of Information with Hybrid ARQ under a resource constraint,” IEEE Transactions on Wireless Communications, vol. 18, no. 3, pp. 1900–1913, 2019.
[16] D. Li, S. Wu, J. Jiao, N. Zhang, and Q. Zhang, “Age-Oriented Transmission Protocol Design in Space-Air-Ground Integrated Networks,” IEEE Transactions on Wireless Communications, vol. 21, no. 7, pp. 5573–5585, 2022.
[17] F. Peng, Z. Jiang, S. Zhang, and S. Xu, “Age of Information Optimized MAC in V2X Sidelink via Piggyback-Based Collaboration,” IEEE Transactions on Wireless Communications, vol. 20, pp. 607–622, 2020.
[18] H. Pan, T.-T. Chan, J. Li, and V. C. M. Leung, “Age of information with collision-resolution random access,” IEEE Transactions on Vehicular Technology, vol. 71, no. 10, pp. 11 295–11 300, 2022.
[19] A. Li, S. Wu, J. Jiao, N. Zhang, and Q. Zhang, “Age of Information with Hybrid-ARQ: A Unified Explicit Result,” IEEE Transactions on Communications, vol. 70, no. 12, pp. 7899–7914, 2022.
[20] B. Zhou and W. Saad, “Joint status sampling and updating for minimizing age of information in the internet of things,” IEEE Transactions on Communications, vol. 67, no. 11, pp. 7468–7482, 2019.
[21] X. Xie, H. Wang, and X. Liu, “Scheduling for minimizing the age of information in multi-sensor multi-server industrial iot systems,” IEEE Transactions on Industrial Informatics, pp. 1–10, 2023.
[22] M. A. Abd-Elmagid and H. S. Dhillon, “Closed-Form Characterization of the MGF of AoI in Energy Harvesting Status Update Systems,” IEEE Transactions on Information Theory, vol. 68, no. 6, pp. 3896–3919, 2022.
[23] M. Hatami, M. Leinonen, Z. Chen, N. Pappas, and M. Codreanu, “On-Demand AoI Minimization in Resource-Constrained Cache-Enabled IoT Networks With Energy Harvesting Sensors,” IEEE Transactions on Communications, vol. 70, no. 11, pp. 7446–7463, 2022.
[24] R. D. Yates, “Lazy is timely: Status updates by an energy harvesting source,” in 2015 IEEE International Symposium on Information Theory (ISIT), 2015, pp. 3008–3012.
[25] A. Arafa, J. Yang, S. Ulukus, and H. V. Poor, “Age-minimal transmission for energy harvesting sensors with finite batteries: Online policies,” IEEE Transactions on Information Theory, vol. 66, no. 1, pp. 534–556, 2019.
[26] R. D. Yates, Y. Sun, D. R. Brown, S. K. Kaul, E. Modiano, and S. Ulukus, “Age of Information: An Introduction and Survey,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 5, pp. 1183–1210, 2021.
[27] Y. Sun and B. Cyr, “Sampling for data freshness optimization: Non-linear age functions,” Journal of Communications and Networks, vol. 21, no. 3, pp. 204–219, 2019.
[28] A. Kosta, N. Pappas, A. Ephremides, and V. Angelakis, “The cost of delay in status updates and their value: Non-linear ageing,” IEEE Transactions on Communications, vol. 68, no. 8, pp. 4905–4918, 2020.
[29] J. Cho and H. Garcia-Molina, “Effective page refresh policies for web crawlers,” ACM Transactions on Database Systems (TODS), vol. 28, no. 4, pp. 390–426, 2003.
[30] Y. Azar, E. Horvitz, E. Lubetzky, Y. Peres, and D. Shahaf, “Tractable near-optimal policies for crawling,” Proceedings of the National Academy of Sciences, vol. 115, no. 32, pp. 8099–8103, 2018.
[31] M. Bastopcu and S. Ulukus, “Information freshness in cache updating systems,” IEEE Transactions on Wireless Communications, vol. 20, no. 3, pp. 1861–1874, 2020.
[32] K. T. Truong and R. W. Heath, “Effects of Channel Aging in Massive MIMO systems,” Journal of Communications and Networks, vol. 15, no. 4, pp. 338–351, 2013.
[33] Y. Sun and B. Cyr, “Information aging through queues: A mutual information perspective,” in 2018 IEEE 19th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC), 2018, pp. 1–5.
[34] Z. Wang, M.-A. Badiu, and J. P. Coon, “A framework for characterizing the value of information in hidden markov models,” IEEE Transactions on Information Theory, vol. 68, no. 8, pp. 5203–5216, 2022.
[35] G. Chen, S. C. Liew, and Y. Shao, “Uncertainty-of-information scheduling: A restless multiarmed bandit framework,” IEEE Transactions on Information Theory, vol. 68, no. 9, pp. 6151–6173, 2022.
[36] C. Kam, S. Kompella, G. D. Nguyen, J. E. Wieselthier, and A. Ephremides, “Towards an effective age of information: Remote estimation of a Markov source,” in IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2018, pp. 367–372.
[37] Y. Sun, Y. Polyanskiy, and E. Uysal, “Sampling of the Wiener process for remote estimation over a channel with random delay,” IEEE Transactions on Information Theory, vol. 66, no. 2, pp. 1118–1135, 2019.
[38] T. Z. Ornee and Y. Sun, “Sampling for remote estimation through queues: Age of information and beyond,” in 2019 International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOPT), 2019, pp. 1–8.
[39] A. Arafa, K. Banawan, K. G. Seddik, and H. V. Poor, “Sample, quantize, and encode: Timely estimation over noisy channels,” IEEE Transactions on Communications, vol. 69, no. 10, pp. 6485–6499, 2021.
[40] X. Wang, W. Lin, C. Xu, X. Sun, and X. Chen, “Age of changed information: Content-aware status updating in the internet of things,” IEEE Transactions on Communications, vol. 70, no. 1, pp. 578–591, 2022.
[41] X. Zheng, S. Zhou, and Z. Niu, “Urgency of information for context-aware timely status updates in remote control systems,” IEEE Transactions on Wireless Communications, vol. 19, no. 11, pp. 7237–7250, 2020.
[42] J. Zhong, R. D. Yates, and E. Soljanin, “Two freshness metrics for local cache refresh,” in 2018 IEEE International Symposium on Information Theory (ISIT), pp. 1924–1928.
[43] A. Maatouk, S. Kriouile, M. Assaad, and A. Ephremides, “The age of incorrect information: A new performance metric for status updates,” IEEE/ACM Transactions on Networking, vol. 28, no. 5, pp. 2215–2228, 2020.
[44] J. Cao, X. Zhu, S. Sun, P. Popovski, S. Feng, and Y. Jiang, “Age of Loop for Wireless Networked Control System in the Finite Blocklength Regime: Average, Variance and Outage Probability,” IEEE Transactions on Wireless Communications, pp. 1–1, 2023.
[45] A. Nikkhah, A. Ephremides, and N. Pappas, “Age of Actuation in a Wireless Power Transfer System,” in IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2023.
[46] M. Kountouris and N. Pappas, “Semantics-empowered communication for networked intelligent systems,” IEEE Communications Magazine, vol. 59, pp. 96–102, 2020.
[47] N. Pappas and M. Kountouris, “Goal-oriented communication for real-time tracking in autonomous systems,” in 2021 IEEE International Conference on Autonomous Systems (ICAS), 2021, pp. 1–5.
[48] M. Salimnejad, M. Kountouris, and N. Pappas, “Real-time Reconstruction of Markov Sources and Remote Actuation over Wireless Channels,” arXiv, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2302.13927
[49] E. Fountoulakis, N. Pappas, and M. Kountouris, “Goal-oriented Policies for Cost of Actuation Error Minimization in Wireless Autonomous Systems,” IEEE Communications Letter, 2023.
[50] A. Li, S. Wu, S. Meng, and Q. Zhang, “Towards goal-oriented semantic communications: New metrics, open challenges, and future research directions,” Submitted to IEEE Communications Magazine., 2023. [Online]. Available: https://arxiv.org/abs/2304.00848
[51] D. S. Bernstein, R. Givan, N. Immerman, and S. Zilberstein, “The complexity of decentralized control of markov decision processes,” Mathematics of operations research, vol. 27, no. 4, pp. 819–840, 2002.
[52] E. Uysal, O. Kaya, A. Ephremides, J. Gross, M. Codreanu, P. Popovski, M. Assaad, G. Liva, A. Munari, B. Soret et al., “Semantic communications in networked systems: A data significance perspective,” IEEE Network, vol. 36, no. 4, pp. 233–240, 2022.
[53] A. Maatouk, M. Assaad, and A. Ephremides, “The age of incorrect information: An enabler of semantics-empowered communication,” IEEE Transactions on Wireless Communications, 2022.
[54] C. Kam, S. Kompella, and A. Ephremides, “Age of incorrect information for remote estimation of a binary markov source,” in IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2020, pp. 1–6.
[55] D. Bertsekas, Dynamic programming and optimal control. Athena scientific, 2012, vol. 1.
[56] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
[57] R. Nair, M. Tambe, M. Yokoo, D. V. Pynadath, and S. Marsella, “Taming decentralized pomdps: Towards efficient policy computation for multiagent settings,” in Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, 2003.
[58] D. P. Bertsekas and J. N. Tsitsiklis, “Neuro-dynamic programming: an overview,” in Proceedings of 1995 34th IEEE conference on decision and control, vol. 1, pp. 560–564.
[59] Y. Li, B. Yin, and H. Xi, “Finding optimal memoryless policies of pomdps under the expected average reward criterion,” European Journal of Operational Research, vol. 211, no. 3, pp. 556–567, 2011.

Appendix A The Proof of Lemma 1

By taking into account the conditional independence among $X_{t+1}$ , $\Phi_{t+1}$ , and $X_{t+1}$ , given $(X_{t},\Phi_{t},X_{t})$ and $\mathbf{a}(t)$ , we we can express the following:

		$\displaystyle\Pr\left\{\mathbf{W}_{t+1}=(s_{u},x,v_{r})\left\|\mathbf{W}_{t}=(s_{i},s_{j},v_{k}),\mathbf{a}(t)=(1,a_{m})\right.\right\}$		(42)
		$\displaystyle=\Pr\left\{X_{t+1}=s_{u}\left\|\mathbf{W}_{t}=(s_{i},s_{j},v_{k}),\mathbf{a}(t)=(1,a_{m})\right.\right\}\times$
		$\displaystyle\Pr\left\{\hat{X}_{t+1}=x\left\|\mathbf{W}_{t}=(s_{i},s_{j},v_{k}),\mathbf{a}(t)=(1,a_{m})\right.\right\}\times$
		$\displaystyle\Pr\left\{\Phi_{t+1}=v_{r}\left\|\mathbf{W}_{t}=(s_{i},s_{j},v_{k}),\mathbf{a}(t)=(1,a_{m})\right.\right\},$

wherein the first, second, and third terms can be derived through conditional independence, resulting in simplified expressions of (11), (16), and (12), respectively:

\displaystyle\Pr\left\{X_{t+1}=s_{u}\left|\mathbf{W}_{t}=(s_{i},s_{j},v_{k}),\mathbf{a}(t)=(1,a_{m})\right.\right\}=p_{i,u}^{(k,m)},

(43)

\displaystyle\Pr\left\{\hat{X}_{t+1}=x\left|\mathbf{W}_{t}=(s_{i},s_{j},v_{k}),\mathbf{a}(t)=(1,a_{m})\right.\right\}=p_{S}\cdot\mathbbm{1}_{\left\{x=s_{i}\right\}}+(1-p_{S})\cdot\mathbbm{1}_{\left\{x=s_{j}\right\}},

(44)

\displaystyle\Pr\left\{\Phi_{t+1}=v_{r}\left|\mathbf{W}_{t}=(s_{i},s_{j},v_{k}),\mathbf{a}(t)=(1,a_{m})\right.\right\}=p_{k,r},

(45)

Substituting (43), (44), and (45) into (42) yields the (1) in Lemma 1. In the case where $a_{S}(t)=0$ , we can obtain a similar expression by replacing $\mathbf{a}(t)=(1,m)$ with $\mathbf{a}(t)=(0,m)$ . Substituting (11), (13), and (12) into this new expression results in the proof of (20) in Lemma 1.

Appendix B Proof of Lemma 3

$Q_{\pi_{A}}^{\pi_{S}}(\mathbf{w},a_{A})$ could be simplified as follows:

		$\displaystyle{\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left[{\sum\limits_{t=0}^{T-1}\left(r(t)-\eta_{\pi_{A}}^{\pi_{S}}\right)}\left\|\mathbf{W}_{0}=\mathbf{w},a_{A}(0)=a_{A}\right.\right]}$		(46)
		$\displaystyle=\mathcal{R}^{\pi_{S}}(\mathbf{w},a_{A})-\eta_{\pi_{A}}^{\pi_{S}}+\sum_{\mathbf{w}^{\prime}\in\mathcal{I}}p^{\pi_{S}}(\mathbf{w}^{\prime}\|\mathbf{w},a_{A})\cdot\underbrace{{\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T-1}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left[{\sum\limits_{t=1}^{T-1}\left(r(t)-\eta_{\pi_{A}}^{\pi_{S}}\right)}\left\|\mathbf{W}_{1}=\mathbf{w}^{\prime}\right.\right]}}_{g_{\pi_{A}}^{\pi_{S}}\mathbf{(w}^{\prime})}$
		$\displaystyle=\mathcal{R}^{\pi_{S}}(\mathbf{w},a_{A})-\eta_{\pi_{A}}^{\pi_{S}}+\sum_{\mathbf{w}^{\prime}\in\mathcal{I}}p^{\pi_{S}}(\mathbf{w}^{\prime}\|\mathbf{w},a_{A})g_{\pi_{A}}^{\pi_{S}}\mathbf{(w}^{\prime}),$

$Q_{\pi_{A}}^{\pi_{S}}(o_{A},a_{A})$ could be solved as:

		$\displaystyle{\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left[{\sum\limits_{t=0}^{T-1}\left(r(t)-\eta_{\pi_{A}}^{\pi_{S}}\right)}\left\|o_{A}^{(0)}=o_{A},a_{A}(0)=a_{A}\right.\right]}$		(47)
		$\displaystyle=\sum_{\mathbf{w}\in\mathcal{I}}p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}\|o_{A},a_{A})\cdot\underbrace{{\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T-1}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left[{\sum\limits_{t=0}^{T-1}\left(r(t)-\eta_{\pi_{A}}^{\pi_{S}}\right)}\left\|\mathbf{W}_{0}=\mathbf{w},a_{A}(0)=a_{A}\right.\right]}}_{Q_{\pi_{A}}^{\pi_{S}}(\mathbf{w},a_{A})}$
		$\displaystyle=\sum_{\mathbf{w}\in\mathcal{I}}p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}\|o_{A},a_{A})Q_{\pi_{A}}^{\pi_{S}}(\mathbf{w},a_{A}).$

where $p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}|o_{A},a_{A})$ is the posterior conditional probability. Note that the state $\mathbf{w}$ is independent of $a_{A}$ when $o_{A}$ is known, we have that

\displaystyle p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}|o_{A},a_{A})=p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}|o_{A})=\frac{p_{\pi_{A}}^{\pi_{S}}(\mathbf{w},o_{A})}{p_{\pi_{A}}^{\pi_{S}}(o_{A})}=\frac{\boldsymbol{\mu}_{\pi_{A}}^{\pi_{S}}(\mathbf{w})p(o_{A}|\mathbf{w})}{\sum_{\mathbf{w}\in\mathcal{I}}\boldsymbol{\mu}_{\pi_{A}}^{\pi_{S}}(\mathbf{w})p(o_{A}|\mathbf{w})}.

(48)

	$\displaystyle\mathcal{O}(\mathbf{w},\mathbf{o})\triangleq\Pr(\mathbf{o}_{t}=\mathbf{o}\|\mathbf{W}_{t}=\mathbf{w}),$		(21)
	$\displaystyle\mathcal{O}_{S}(\mathbf{w},o_{S})\triangleq\Pr(o_{S}^{(t)}=o_{S}\|\mathbf{W}_{t}=\mathbf{w}),$
	$\displaystyle\mathcal{O}_{A}(\mathbf{w},o_{A})\triangleq\Pr(o_{A}^{(t)}=o_{A}\|\mathbf{W}_{t}=\mathbf{w}).$

		$\displaystyle\Pr\left\{\mathbf{W}_{t+1}=(s_{u},x,v_{r})\left\|\mathbf{W}_{t}=(s_{i},s_{j},v_{k}),\mathbf{a}(t)=(1,a_{m})\right.\right\}$		(42)
		$\displaystyle=\Pr\left\{X_{t+1}=s_{u}\left\|\mathbf{W}_{t}=(s_{i},s_{j},v_{k}),\mathbf{a}(t)=(1,a_{m})\right.\right\}\times$
		$\displaystyle\Pr\left\{\hat{X}_{t+1}=x\left\|\mathbf{W}_{t}=(s_{i},s_{j},v_{k}),\mathbf{a}(t)=(1,a_{m})\right.\right\}\times$
		$\displaystyle\Pr\left\{\Phi_{t+1}=v_{r}\left\|\mathbf{W}_{t}=(s_{i},s_{j},v_{k}),\mathbf{a}(t)=(1,a_{m})\right.\right\},$

		$\displaystyle{\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left[{\sum\limits_{t=0}^{T-1}\left(r(t)-\eta_{\pi_{A}}^{\pi_{S}}\right)}\left\|\mathbf{W}_{0}=\mathbf{w},a_{A}(0)=a_{A}\right.\right]}$		(46)
		$\displaystyle=\mathcal{R}^{\pi_{S}}(\mathbf{w},a_{A})-\eta_{\pi_{A}}^{\pi_{S}}+\sum_{\mathbf{w}^{\prime}\in\mathcal{I}}p^{\pi_{S}}(\mathbf{w}^{\prime}\|\mathbf{w},a_{A})\cdot\underbrace{{\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T-1}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left[{\sum\limits_{t=1}^{T-1}\left(r(t)-\eta_{\pi_{A}}^{\pi_{S}}\right)}\left\|\mathbf{W}_{1}=\mathbf{w}^{\prime}\right.\right]}}_{g_{\pi_{A}}^{\pi_{S}}\mathbf{(w}^{\prime})}$
		$\displaystyle=\mathcal{R}^{\pi_{S}}(\mathbf{w},a_{A})-\eta_{\pi_{A}}^{\pi_{S}}+\sum_{\mathbf{w}^{\prime}\in\mathcal{I}}p^{\pi_{S}}(\mathbf{w}^{\prime}\|\mathbf{w},a_{A})g_{\pi_{A}}^{\pi_{S}}\mathbf{(w}^{\prime}),$

		$\displaystyle{\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left[{\sum\limits_{t=0}^{T-1}\left(r(t)-\eta_{\pi_{A}}^{\pi_{S}}\right)}\left\|o_{A}^{(0)}=o_{A},a_{A}(0)=a_{A}\right.\right]}$		(47)
		$\displaystyle=\sum_{\mathbf{w}\in\mathcal{I}}p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}\|o_{A},a_{A})\cdot\underbrace{{\mathop{\lim\sup}\limits_{T\to\infty}\frac{1}{T-1}{\mathbb{E}^{(\pi_{S},\pi_{A})}}\left[{\sum\limits_{t=0}^{T-1}\left(r(t)-\eta_{\pi_{A}}^{\pi_{S}}\right)}\left\|\mathbf{W}_{0}=\mathbf{w},a_{A}(0)=a_{A}\right.\right]}}_{Q_{\pi_{A}}^{\pi_{S}}(\mathbf{w},a_{A})}$
		$\displaystyle=\sum_{\mathbf{w}\in\mathcal{I}}p_{\pi_{A}}^{\pi_{S}}(\mathbf{w}\|o_{A},a_{A})Q_{\pi_{A}}^{\pi_{S}}(\mathbf{w},a_{A}).$