This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Hierarchical Deep Reinforcement Learning for Age-of-Information Minimization in IRS-aided and Wireless-powered Wireless Networks

Shimin Gong, Leiyang Cui, Bo Gu, Bin Lyu, Dinh Thai Hoang, and Dusit Niyato
Shimin Gong, Leiyang Cui, and Bo Gu are with the School of Intelligent Systems Engineering, Sun Yat-sen University, Shenzhen 518055, China (email: [email protected], [email protected], [email protected]). Bin Lyu is with Key Laboratory of Ministry of Education in Broadband Wireless Communication and Sensor Network Technology, Nanjing University of Posts and Telecommunications, China (e-mail: [email protected]). Dinh Thai Hoang is with the School of Electrical and Data Engineering, University of Technology Sydney, Australia (email: [email protected]). Dusit Niyato is with School of Computer Science and Engineering, Nanyang Technological University, Singapore (email: [email protected]).
Abstract

In this paper, we focus on a wireless-powered sensor network coordinated by a multi-antenna access point (AP). Each node can generate sensing information and report the latest information to the AP using the energy harvested from the AP’s signal beamforming. blueWe aim to minimize the average age-of-information (AoI) by adapting the nodes’ transmission scheduling and the transmission control strategies jointly. To reduce the transmission delay, an intelligent reflecting surface (IRS) is used to enhance the channel conditions by controlling the AP’s beamforming bluevector and the IRS’s phase shifting bluematrix. blueConsidering dynamic data arrivals at different sensing nodes, we propose a hierarchical deep reinforcement learning (DRL) framework to bluefor AoI minimization in two steps. The users’ transmission scheduling is firstly determined by the outer-loop DRL approach, e.g. the DQN or PPO algorithm, and then the inner-loop optimization is used to adapt either the uplink information transmission or downlink energy transfer to all nodes. A simple and efficient approximation is also proposed to reduce the blueinner-loop rum time overhead. Numerical results verify that the hierarchical learning framework outperforms typical baselines in terms of the average AoI and proportional fairness among different nodes.

Index Terms:
AoI minimization, wireless power transfer, IRS-aided wireless network, deep reinforcement learning.

I Introduction

With the development of the future Internet of Things (IoT), a large portion of the emerging applications (e.g., autonomous driving, interactive gaming, and virtual reality) depends on timely transmissions and processing of the IoT devices’ sensing data in the physical world, e.g., [1] and [2]. Maintaining information freshness in time-sensitive applications requires a new network performance metric, i.e., the age-of-information (AoI), which is defined as the elapsed time since the generation of the latest status update successfully received by the receiver [3]. In wireless networks, the overall time delay of each node is mainly caused by the waiting time or scheduling delay before information transmission and the in-the-air transmission delay. The waiting delay is usually determined by the multi-user scheduling strategy, while the transmission delay depends on the network capacity and channel conditions, e.g., mutual interference or channel fading effect. To minimize the transmission delay, we need to explore preferable channel opportunities and adapt the transmission control parameters accordingly, e.g., power control, channel allocation, and beamforming strategies. More recently, the intelligent reflecting surface (IRS) has been used to reduce the transmission delay by shaping the wireless channels in favor of information transmission via passive beamforming optimization [4]. The AoI-aware scheduling and transmission control in wireless networks were previously studied by the queueing theory [5]. The closed-form analysis of the AoI performance is tedious and usually difficult, typically relying on specific probability distributions of the sensor nodes’ data arrivals, the data transmission/service time, and the rate of information requests at user devices. For more complex wireless networks, e.g., with users’ mobility and limited energy resources, it is still very challenging to optimize multi-user scheduling policy to maximize the overall AoI performance.

In energy-constrained wireless networks, the AoI minimization depends on the optimal control of each user’s energy supply and demand, especially in wireless powered communication networks. When a sensor node is scheduled more often to update its sensing information, more wireless energy transfer is required for the node to achieve self-sustainability. This may reduce the transmission opportunities and the energy transfer to other nodes, leading to the AoI-energy tradeoff study in energy-constrained wireless networks [6]. The joint optimization of multi-user scheduling and energy management is usually formulated as a dynamic program due to the spatial-temporal couplings among wireless sensor nodes. With limited energy supply, the users’ scheduling becomes more challenging to balance the tradeoff between AoI and energy consumption. The AoI minimization problem is further complicated by the unknown channel conditions and the sensor nodes’ dynamic data arrivals. Without complete system information, the scheduling strategy has to be adapted according to the users’ historical AoI information. Instead of the optimization approaches, the AoI minimization problems can be flexibly solved by the model-free deep reinforcement learning (DRL) approaches, e.g., [6, 7, 8, 9]. The DRL approaches can reformulate the AoI minimization problem into Markov decision process (MDP). The action includes the scheduling and transmission control strategies. With a large-size IRS, the action and state spaces become high-dimensional and lead to unreliable and slow learning performance. This motivates us to design a more efficient learning approach for the AoI minimization problem.

Specifically, in this paper, we aim to minimize the average AoI in a wireless-powered network, consisting of a multi-antenna access point (AP), a wall-mounted IRS, and multiple single-antenna sensor nodes, sampling the status information from different physical processes. The IRS is used to enhance the channel conditions and reduce the transmission delay. It can assist either the downlink wireless power transfer from the AP to the sensing nodes or uplink information transmissions from nodes to the AP [4]. The joint scheduling and beamforming optimization normally leads to a high-dimensional mix-integer problem that is difficult to solve practically. Different from the conventional DRL solutions, we devise a two-step hierarchical learning framework to improve the overall learning efficiency. The basic idea is to adapt the scheduling strategy by the outer-loop DRL algorithm, e.g., the value-based deep Q-network (DQN) or the policy-based proximal policy optimization (PPO) algorithm [10], and optimize the beamforming strategy by the inner-loop optimization module. Given the outer-loop decision, the inner-loop procedures either optimize the AP’s downlink energy transfer or optimize individual’s uplink information transmission. Specifically, the contributions of this paper are summarized as follows:

  • The IRS-assisted scheduling and beamforming for AoI minimization: We aim to reduce both the packet waiting and transmission delays for updating sensing information in a wireless-powered and IRS-assisted wireless network. The IRS’s passive beamforming not only enhances the wireless power transfer to sensor nods, but also assists their uplink information transmissions to the AP. We formulate the AoI minimization problem by jointly adapting the user scheduling and beamforming strategies.

  • The hierarchical DRL approach for AoI minimization: A hierarchical DRL framework is proposed to solve the AoI minimization in two steps. The model-free DRL in the outer loop adapts the combinatorial scheduling decision according to each user’s energy status and the AoI performance. Given the outer-loop scheduling, we optimize the joint beamforming strategies for either the downlink energy transfer and the uplink information transmission.

  • Policy-based PPO algorithm for outer-loop scheduling: Our simulation results verify that the hierarchical learning framework significantly reduces the average AoI compared with typical baseline strategies. Besides, we compare both the traditional DQN and the PPO methods for the outer-loop scheduling optimization. The PPO-based hierarchical learning can improve convergence and achieve a lower AoI value compared to the DQN-based method.

Some preliminary results of this work have been presented in [11]. In this extension, we include detailed analysis about the PPO algorithm and compare it with the DQN method. We also propose a simple approximation for the inner-loop optimization to reduce the time overhead while achieving comparable AoI performance at convergence. The remainder of this paper is organized as follows. We discuss related works in Section II and present our system model in Section III. We present the hierarchical learning framework in Section IV. The inner-loop optimization and outer-loop learning procedures are detailed in Sections V and VI, respectively. Finally, we present the numerical results in Section VII and conclude the paper in Section VIII.

II Related works

II-A DRL Approaches for AoI Minimization

DRL has been introduced recently as an effective solution for AoI minimization by adapting the scheduling and beamforming strategies according to time-varying traffic demands and channel conditions. The authors in [7] designed the freshness-aware scheduling solution by using the DQN method. The DQN agent continuously updates its scheduling strategy to maximize the freshness of information in the long term. The authors in [8] focused on a multi-user status update system, where a single sensor node monitors a physical process and schedules its information updates to multiple users with time-varying channel conditions. Based on the user’s instantaneous ACK/NACK feedback, the DRL agent at the information source can decide on when and to which user to transmit the next information update, without any priori information on the random channel conditions. The authors in [9] focused on a multi-access system where the base-station (BS) aims to maximize its information collected from all wireless users. Given a strict time deadline to each wireless user, the PPO algorithm was used to adapt the scheduling policy by learning the users’ traffic patterns from the past experiences. The authors in [12] considered a different multi-user scheduling scheme that allows a group of sensor nodes to transmit their information simultaneously. The scheduling decision is adapted to minimize the AoI by using the double DQN (DDQN) method [13], an extension of the DQN method by using two sets of deep neural networks (DNNs) to approximate the Q-value. The AoI-energy tradeoff study in [14] revealed that the sensor nodes’ energy consumption can be reduced without a significant increase in the AoI, by using DQN to adapt the content update in the caching node.

II-B RF-powered Scheduling for AoI Minimization

The wireless power transfer and energy harvesting are promising techniques to sustain the massive number of low-cost sensor nodes. However, energy harvesting is usually unreliable depending on the channel conditions. The dynamics and scarcity in energy harvesting make it more challenging for the sensor nodes’ energy management and AoI minimization. The authors in [15] focused on AoI minimization in a cognitive radio network with dynamic supplies of the energy and spectrum resources. The optimal scheduling policy is derived by a dynamic programming approach, revealing a threshold structure depending on the sensor nodes’ AoI states. The authors in [16] revealed that the optimal policy allows each sensor to send a status update only if the AoI value is higher than some threshold that depends on its energy level. Considering stochastic energy harvesting at sensor devices, the authors in [17] studied the AoI minimization problem with the long-term energy constraints and proposed Lyapunov-based dynamic optimization to derive an approximate solution. Generally the dynamic optimization approaches are not only computational demanding, but also relying on the availability of system information. Without knowing the dynamic energy arrivals, the authors in [18] reformulated the AoI minimization problem into MDP. The online Q-learning method was proposed to adaptively schedule the wireless devices’ information update. The authors in [19] focused on the RF-powered wireless network, where the wireless devices can harvest RF power from a dedicated BS and then transmit their update packets to the BS. To minimize the long-term AoI, the DQN algorithm was used to adapt the scheduling between the downlink energy transfer and the uplink information transmission. Both the DQN and dueling DDQN methods were used in [20] to adapt the sensing and information update policy for AoI minimization in a spectrum sharing cognitive radio system. Considering energy harvesting ad hoc networks, the authors in [21] solved the AoI minimization problem by using the advantage actor-critic (A2C) algorithm to adapt the scheduling and power allocation policy, which shows faster runtime and comparable AoI performance to the optimum. The above-mentioned works typically solve the AoI minimization by using the conventional model-free DRL methods. These methods become inflexible and unreliable due to slow convergence with the increasing state and action spaces.

II-C IRS-Assisted AoI Minimization

The IRS’s reconfigurability can be used to enhance the channel quality/capacity or reduce the transmission delay by tuning the phase shifts of the reflecting elements [4]. Only a few existing works have discussed the IRS’s application for AoI minimization in wireless networks. The authors in [22] focused on a mobile edge computing (MEC) system and proposed using the IRS to minimize the workload processing delay by optimizing the IRS’s passive beamforming strategy. The authors in [23] set delay constraints to the wireless users’ uplink information transmissions and revealed that the IRS’s passive beamforming can help reduce the wireless users’ transmit power. The authors in [24] employed the IRS to enhance the AoI performance by jointly optimizing the uses’ scheduling and the IRS’s passive beamforming strategies. The combinatorial scheduling decision is adapted by the model-free DRL algorithm, while the passive beamforming optimization relies on the solution to the conventional semi-definite relaxation (SDR) problem. The authors in [25] employed the UAV-carrying IRS to relay information from the ground users to the BS. The AoI minimization requires the optimization of the UAV’s altitude, the ground users’ transmission scheduling, and the IRS’s passive beamforming strategies. Comparing to [24] and [25], our work in this paper exploits the performance gains in both the uplink and downlink of the IRS-assisted system. The IRS’s passive beamforming not only assists uplink information transmission but also enhances or balances the AP’s downlink energy transfer to the users.

III System model

We consider an IRS-assisted wireless sensor network deployed in smart cities to assist information sensing and decision making, similar to that in [11]. The system consists of a multi-antenna AP with MM antennas, an IRS with NN reflection elements, and KK single-antenna IoT devices, denoted by the set 𝒦{1,2,,K}\mathcal{K}\triangleq\{1,2,\ldots,K\}. The system model can be straightforwardly extended to the cases with multiple AP or multiple IRSs. The sensing information is typically a small amount of data, which should be timely updated to the AP for real-time status monitoring. We assume that all sensor nodes are low-power devices and wireless powered by harvesting RF energy from the AP’s beamforming signals. The wireless powered communications technology has been verified and evaluated in [26], showing that the LoRa-based sensor nodes typically have 0.5-1.5 mJ energy consumption, and require 2-5 seconds of energy harvesting time 3.2 meters away to sustain periodical information sensing and data transmissions up to 200 bytes. The IRS can be deployed on the exterior walls of buildings to enhance the channel conditions between the sensor nodes and the AP. We aim to collect all sensor nodes’ data in a timely fashion by scheduling their uplink data transmissions, based on their channel conditions, traffic demands, and energy status. A list of notations is provided in Table I.

TABLE I: A list of Notations
Notation Description Notation Description
MM Number of AP’s antennas NN Size of the IRS
KK Number of IoT devices 𝒯\mathcal{T} The set of time slots
ψ0(t){0,1}\psi_{0}(t)\in\{0,1\} The AP’s mode selection ψk(t){0,1}\psi_{k}(t)\in\{0,1\} The AP’s uplink scheduling
𝐆(t){\bf G}(t) The AP-IRS channel matrix 𝐡kr(t){\bf h}^{{r}}_{k}(t) The IRS-User channel vector
𝚯d(t){\boldsymbol{\Theta}}_{d}(t), 𝚯u(t){\boldsymbol{\Theta}}_{u}(t) The IRS’s beamforming strategies 𝐰d(t){\bf w}_{d}(t), 𝐰u(t){\bf w}_{u}(t) The AP’s beamforming vectors
η\eta Energy conversion efficiency Ekh(t)E^{h}_{k}(t) Energy harvested by the user-kk
Ekc(t)E_{k}^{c}(t) The user-kk’s energy consumption Ek(t){E}_{k}(t) The user-kk’s energy state
EmaxE_{\max} Maximum energy capacity γk(t)\gamma_{k}(t) The received SNR at the AP
rk(t)r_{k}(t) The user-kk’s uplink throughput τk\tau_{k} The user-kk’s uplink transmission time
dkd_{k} The user-kk’s data size Ak(t)A_{k}(t) The user-kk’s AoI value

III-A Mode Selection and Scheduling

We consider a time-slotted frame structure to avoid contention between different nodes. Each data frame is equally divided into TT time slots allocated to different sensor nodes. Let 𝒯{1,2,,T}\mathcal{T}\triangleq\{1,2,\ldots,T\} denote the set of all time slots. In each time slot, we need to firstly decide the AP’s operation mode, i.e., the time slot is used for either the downlink energy transfer or the uplink data transmission. We use ψ0(t){0,1}\psi_{0}(t)\in\{0,1\} to denote the AP’s mode selection in each time slot, i.e., ψ0(t)=1\psi_{0}(t)=1 indicates the downlink energy beamforming and ψ0(t)=0\psi_{0}(t)=0 represents the uplink information transmission. We further use ψk(t){0,1}\psi_{k}(t)\in\{0,1\} to denote the uplink scheduling policy, i.e., ψk(t)=1\psi_{k}(t)=1 represents that the kk-th sensor node is allowed to access the channel for uploading its sensing information to the AP. We require that at most one sensor node can access the channel in each time slot, which implies the following scheduling constraint:

ψ0(t)+k𝒦ψk(t)1,t𝒯.\psi_{0}(t)+\sum_{k\in\mathcal{K}}\psi_{k}(t)\leq 1,\quad\forall\,t\in\mathcal{T}. (1)

We denote 𝚿(t)=[ψ0(t),ψ1(t),,ψK(t)]\boldsymbol{\Psi}(t)=[\psi_{0}(t),\psi_{1}(t),\ldots,\psi_{K}(t)] as the AP’s scheduling policy, which depends on the sensor nodes’ traffic demands, channel conditions, and energy status.

Let 𝒩{1,2,,N}\mathcal{N}\triangleq\{1,2,\ldots,N\} denote the set of the IRS’s reflecting elements and θn(t)(0,2π]\theta_{n}(t)\in(0,2\pi] denote the phase shift of the nn-th reflecting element in the tt-th time slot. We define the IRS’s phase shifting vector in the tt-th time slot as 𝜽(t)=[ejθn(t)]n𝒩\boldsymbol{\theta}(t)=[e^{j\theta_{n}(t)}]_{n\in\mathcal{N}}. Note that the IRS can set different beamforming vectors, denoted as 𝜽d(t)[ejθd,n(t)]n𝒩\boldsymbol{\theta}_{d}(t)\triangleq[e^{j\theta_{d,n}(t)}]_{n\in\mathcal{N}} and 𝜽u(t)[ejθu,1(t)]n𝒩\boldsymbol{\theta}_{u}(t)\triangleq[e^{j\theta_{u,1}(t)}]_{n\in\mathcal{N}}, for the downlink and uplink phases, respectively. The channel matrix from the AP to the IRS in tt-th time slot is given by 𝐆(t)M×N{\bf G}(t)\in\mathbb{C}^{M\times N}. The channel vectors from the IRS and the AP to the kk-th sensor node are denoted by 𝐡kr(t)N×1{\bf h}^{{r}}_{k}(t)\in\mathbb{C}^{N\times 1} and 𝐡kd(t)M×1{\bf h}^{{d}}_{k}(t)\in\mathbb{C}^{M\times 1}, respectively. The AP can estimate the channel information by a training period at the beginning of each time slot.

III-B Downlink Energy Transfer

When ψ0(t)=1\psi_{0}(t)=1, the IRS-assisted downlink energy transfer ensures the sustainable operation of the system. Given the IRS’s passive beamforming strategy 𝜽d(t){\boldsymbol{\theta}}_{d}(t), the equivalent downlink channel vector 𝐟d,k(t){\bf f}_{d,k}(t) from the AP to the kk-th sensor node can be expressed as follows:

𝐟d,k(t)=𝐡kd(t)+𝐆k(t)𝚯d(t)𝐡kr(t),{\bf f}_{d,k}(t)={\bf h}^{{d}}_{k}(t)+{\bf G}_{k}(t){\boldsymbol{\Theta}}_{d}(t){\bf h}^{{r}}_{k}(t), (2)

where we define 𝚯d(t)diag(𝜽d(t))\boldsymbol{\Theta}_{d}(t)\triangleq\text{diag}(\boldsymbol{\theta}_{d}(t)) as a diagonal matrix with the diagonal element 𝜽d(t)\boldsymbol{\theta}_{d}(t). The phase shifting matrix 𝚯d(t)\boldsymbol{\Theta}_{d}(t) represents the IRS’s passive beamforming strategy in the downlink energy transfer. Let 𝐰d(t)M×1{\bf w}_{d}(t)\in\mathbb{C}^{M\times 1} denote the AP’s transmit beamforming vector in the downlink energy transfer phase. Given the AP’s transmit power psp_{s}, the AP’s beamforming signal is given by 𝐱(t)=ps𝐰d(t)s0(t){\bf x}(t)=\sqrt{p_{s}}{\bf w}_{d}(t)s_{0}(t), where s0(t)s_{0}(t)\in\mathbb{C} denotes a random complex symbol with the unit power. Then, the received signal at the kk-th sensor node is given as yk(t)=𝐟d,kH(t)𝐱(t)+nk(t){y}_{k}(t)={\bf f}^{{H}}_{d,k}(t){\bf x}(t)+n_{k}(t), where ()H(\cdot)^{H} denotes conjugate transpose and nk(t)n_{k}(t) is the normalized Gaussian noise with zero mean and unit power. Considering a linear energy harvesting (EH) model [27], the received signal yk(t){y}_{k}(t) can be converted to energy as follows:

Ekh(t)=η𝔼[|𝐟d,kH(t)𝐱(t)|2]=ηps|𝐟d,kH(t)𝐰d(t)|2,E^{h}_{k}(t)=\eta\mathbb{E}[|{\bf f}^{{H}}_{d,k}(t){\bf x}(t)|^{2}]=\eta p_{s}|{\bf f}^{{H}}_{d,k}(t){\bf w}_{d}(t)|^{2}, (3)

where η\eta denotes the energy conversion efficiency. The energy harvested from the noise signal is assumed to be negligible. It is clear that the AP can control the energy transfer to different sensor nodes by optimizing the downlink beamforming vector 𝐰d(t){\bf w}_{d}(t).

In particular, the AP can steer the beam direction toward the sensor nodes with insufficient energy supply. Besides, the IRS’s passive beamforming strategy 𝚯d(t)\boldsymbol{\Theta}_{d}(t) controls the downlink channel conditions 𝐟d,k(t){\bf f}_{d,k}(t) to individual receivers. Due to the broadcast nature of wireless signals, the AP’s energy transfer to different sensor nodes depends on the joint beamforming strategies (𝐰d(t),𝚯d(t))({\bf w}_{d}(t),\boldsymbol{\Theta}_{d}(t)) in different time slots. A more practical non-linear EH model can be also applied to our system. In this case, the harvested power firstly increases with the received signal power and then becomes saturated as the received signal power continues to increase, e.g., [28]. This can be approximated by a piecewise linear EH model: Ekh(t)=min{ηps|𝐟d,kH(t)𝐰d(t)|2,psat}E^{h}_{k}(t)=\min\Big{\{}\eta p_{s}|{\bf f}^{{H}}_{d,k}(t){\bf w}_{d}(t)|^{2},\,\,p_{\text{sat}}\Big{\}}, where psatp_{\text{sat}} denotes the saturation power. Our algorithm in the following part adopts the linear EH model in (3) and can be easily applied to the piecewise linear model with minor modifications.

III-C Sensing Information Updates

We assume that the uplink channels are the same as the downlink channels in each time slot due to channel reciprocity, similar to that in [11] and [19]. Let pk(t)p_{k}(t) denote the transmit power of the kk-th sensor node when it is scheduled in the tt-th time slot, i.e., ψk(t)=1\psi_{k}(t)=1. The signal received at the AP is given by yk=pk(t)𝐟u,k(t)sk+𝐧k(t)\textbf{y}_{k}=\sqrt{p_{k}(t)}{\bf f}_{u,k}(t)s_{k}+{\bf n}_{k}(t), where 𝐟u,k(t){\bf f}_{u,k}(t) denotes the uplink channel from the kk-th sensor node to the AP and sk(t)s_{k}(t) denotes its information symbol. Similar to (2), the IRS-assisted uplink channel is given by 𝐟u,k(t)=𝐡kd(t)+𝐆k(t)𝚯u(t)𝐡kr(t){\bf f}_{u,k}(t)={\bf h}^{{d}}_{k}(t)+{\bf G}_{k}(t){\boldsymbol{\Theta}}_{u}(t){\bf h}^{{r}}_{k}(t), where 𝚯u(t)diag(𝜽u(t)){\boldsymbol{\Theta}}_{u}(t)\triangleq\text{diag}(\boldsymbol{\theta}_{u}(t)) denotes the IRS’s uplink passive beamforming strategy. Without loss of generality, the noise signal 𝐧k(t){\bf n}_{k}(t) received by the AP can be normalized to the unit power. Thus, the received SNR can be characterized as γk(t)=pk(t)|𝐟u,kH(t)𝐰u(t)|2\gamma_{k}(t)=p_{k}(t)|{\bf f}_{u,k}^{H}(t){\bf w}_{u}(t)|^{2}, where 𝐰u(t){\bf w}_{u}(t) represents the AP’s receive beamforming vector. By using the time division protocol, the sensor nodes’ uplink transmissions can avoid mutual interference. The AP can simply align its receive beamforming vector 𝐰u(t){\bf w}_{u}(t) to the uplink channel 𝐟u,k(t){\bf f}_{u,k}(t). As such, we can denote the received SNR as γk(t)=pk(t)𝐟u,k(t)2\gamma_{k}(t)=p_{k}(t)||{\bf f}_{u,k}(t)||^{2} and characterize the uplink throughput as rk(t)=τklog(1+pk(t)𝐟u,k(t)2)r_{k}(t)=\tau_{k}\log\bigl{(}1+p_{k}(t)||{\bf f}_{u,k}(t)||^{2}\bigr{)}, where τk[0,1]\tau_{k}\in[0,1] denotes the uplink transmission time. Given the data size dkd_{k}, we require rk(t)dkr_{k}(t)\geq d_{k} to ensure the successful transmission of the sensing information.

In each time slot, the sensor node’s energy consumption is given by Ekc(t)=τk(t)(pk(t)+pc)E_{k}^{c}(t)=\tau_{k}(t)(p_{k}(t)+p_{c}), where pcp_{c} denotes a constant circuit power to maintain the node’s activity. The power consumption τk(t)pk(t)\tau_{k}(t)p_{k}(t) in uplink data transmission is linearly proportional to the transmit power pk(t)p_{k}(t) and the transmission time τk\tau_{k}. The transmit power pk(t)p_{k}(t) can vary with the channel conditions to ensure the rate constraint rk(t)dkr_{k}(t)\geq d_{k}. Let EmaxE_{\max} denote the sensor nodes’ maximum battery capacity and Ek(t){E}_{k}(t) be the energy state in the tt-th time slot. Then, we have the following energy dynamics:

Ek(t+1)=min{(Ek(t)ψk(t)Ekc(t))++ψ0(t)Ekh,Emax}.{E}_{k}(t+1)=\min\Big{\{}\Big{(}{E}_{k}(t)-\psi_{k}(t)E^{c}_{k}(t)\Big{)}^{+}+\psi_{0}(t)E^{h}_{k},E_{\max}\Big{\}}. (4)

Here we denote (x)+max{x,0}(x)^{+}\triangleq\max\{x,0\} for simplicity.

IV Hierarchical Learning for AoI Minimization

In this paper, the physical sensing process is beyond our control and we only focus on the transmission scheduling and beamforming optimization over the wireless network. The sensing information can be randomly generated by the sensor nodes, depending on the energy status and the physical process under monitoring. Once new sensing data arrives, each sensor node will discard existing data in the cache and always cache the latest sensing data. From the perspective of the receiver, the sensing data from each sensor node is considered as the new information and used to replace the obsolete information at the receiver. When the node-kk is scheduled for uplink data transmission, the cached information will be uploaded to the AP and then the AP will replace the sensing information by the latest copy.

For each sensor node k𝒦k\in\mathcal{K}, the caching delay depends on the AP’s scheduling policy 𝚿(t)\boldsymbol{\Psi}(t). The transmission delay can be minimized by optimizing the sensor node’s transmit control strategy (pk(t),τk(t))(p_{k}(t),\tau_{k}(t)) and the joint beamforming strategies (𝐰u(t),𝚯u(t))({\bf w}_{u}(t),{\boldsymbol{\Theta}}_{u}(t)) in the uplink transmissions. Let Ak(t)A_{k}(t) denote the AoI value of the kk-th sensor node, which is used to characterize the information freshness at the AP. When the node-kk is scheduled to update its information in the tt-th time slot, the AP can replace the obsolete information by the new sensing information and thus update the AoI in the next time slot as Ak(t+1)=1A_{k}(t+1)=1. Here we assume that the node-kk can successfully finish its data transmission at the end of each time slot. Otherwise, when the node-kk is not scheduled, its AoI will be further increased by one, i.e., Ak(t+1)=Ak(t)+1A_{k}(t+1)=A_{k}(t)+1. Therefore, the AoI of each sensor node k𝒦k\in\mathcal{K} will be updated by the following rules:

Ak(t+1)={1,if ok(t)=1,ψk(t)=1,rk(t)dk,Ek(t)Ekc(t),Ak(t)+1,otherwise.A_{k}(t+1)=\begin{cases}1,&\text{if }o_{k}(t)=1,\psi_{k}(t)=1,r_{k}(t)\geq d_{k},E_{k}(t)\geq E^{c}_{k}(t),\\ A_{k}(t)+1,&\text{otherwise}.\end{cases} (5)

Here ok(t){0,1}o_{k}(t)\in\{0,1\} indicates the status of the caching space. When the cache is non-empty with ok(t)=1o_{k}(t)=1 and the node-kk is currently scheduled with ψk(t)=1\psi_{k}(t)=1, the AP can update the sensing information if the uplink data transmission is successful. Given the size dkd_{k} of sensing data, the scheduled node-kk will have a successful data transmission if it has sufficient energy, i.e., Ek(t)Ekc(t)E_{k}(t)\geq E^{c}_{k}(t), to fulfill its traffic demand, i.e., rk(t)dkr_{k}(t)\geq d_{k}, where the uplink data rate rk(t)r_{k}(t) depends on the control parameters (pk(t),τk)(p_{k}(t),\tau_{k}) and the joint beamforming strategies (𝐰u(t),𝚯u(t))({\bf w}_{u}(t),{\boldsymbol{\Theta}}_{u}(t)).

We aim to minimize the AoI by optimizing the scheduling policy 𝚿{𝚿(t)}t𝒯\boldsymbol{\Psi}\triangleq\{\boldsymbol{\Psi}(t)\}_{t\in\mathcal{T}} and the joint beamforming strategies (𝐰,𝚯)(𝐰m(t),𝚯m(t))m{d,u},t𝒯({\bf w},{\boldsymbol{\Theta}})\triangleq({\bf w}_{m}(t),{\boldsymbol{\Theta}}_{m}(t))_{m\in\{d,u\},t\in\mathcal{T}} in both the downlink and uplink phases. Considering different priorities of the sensing information, we assign different weights to the AoI values and define the time-averaged weighted AoI as follows:

A¯(𝚿,𝐰,𝚯)=limT1TK𝔼[t𝒯k𝒦λkAk(t)],\bar{A}(\boldsymbol{\Psi},{\bf w},\boldsymbol{\Theta})=\lim_{T\rightarrow\infty}\frac{1}{TK}\mathbb{E}\left[\sum_{t\in\mathcal{T}}\sum_{k\in\mathcal{K}}\lambda_{k}A_{k}(t)\right], (6)

where the constant λk\lambda_{k} indicates the delay sensitivity of different sensing information. A larger weight should be given to more critical sensing information, e.g., the safety monitoring in autonomous driving. Till this point, we can formulate the AoI minimization problem as follows:

min𝚿,𝐰,𝚯\displaystyle\min_{\boldsymbol{\Psi},{\bf w},\boldsymbol{\Theta}}~{} A¯(𝚿,𝐰,𝚯),s.t.(1)(5).\displaystyle~{}\bar{A}(\boldsymbol{\Psi},{\bf w},\boldsymbol{\Theta}),\quad\text{s.t.}\,\,\eqref{equ-sched}-\eqref{equ-aoi-dyna}. (7)

Given the mode selection ψ0(t)\psi_{0}(t), the optimization of (𝐰,𝚯)({\bf w},\boldsymbol{\Theta}) corresponds to either the downlink energy transfer or the uplink information transmission. The downlink energy transfer determines the power budgets of different sensor nodes, which should be jointly optimized with the users’ scheduling policy {ψk(t)}k𝒦\{\psi_{k}(t)\}_{k\in\mathcal{K}} to improve the overall AoI performance. The problem (7) is firstly challenged by the stochasticity and high-dimensionality. The energy dynamics in (4) and the time-averaged AoI objective in (7) imply that the solutions (𝚿(t),𝐰(t),𝚯(t))(\boldsymbol{\Psi}(t),{\bf w}(t),\boldsymbol{\Theta}(t)) in each time slot are temporally correlated. A dynamic programming approach to solve (7) can be practically intractable due to the curse of dimensionality. The joint scheduling and beamforming optimization also lead to a high-dimensional mix-integer problem that is difficult to solve practically. The second difficulty of problem (7) lies in that the dynamics of the data arrival process at each sensor node can be unknown to the AP, which makes the scheduling optimization more complicated in practice. Without complete information, the AP has to adapt its scheduling policy based on the past observations of the AoI dynamics. The third difficulty lies in the combinatorial nature of the AP’s scheduling policy 𝚿\boldsymbol{\Psi}. Given the scheduling policy 𝚿\boldsymbol{\Psi}, the challenges still exist as the joint beamforming strategies (𝐰,𝚯)({\bf w},\boldsymbol{\Theta}) are not only coupled with each other in a non-convex form, but also introduce the competition for energy resource among different sensor nodes.

Refer to caption
Figure 1: The hierarchical DRL framework includes the outer-loop DRL and the inner-loop optimization methods.

To overcome these difficulties, we devise a hierarchical learning structure for problem (7) that decomposes the optimization of (𝚿,𝐰,𝚯)(\boldsymbol{\Psi},{\bf w},\boldsymbol{\Theta}) into two parts. The overall algorithm framework is shown in Fig. 1, which mainly includes the outer-loop learning module for scheduling and the inner-loop optimization module for beamforming optimization. In fact, we may also apply an inner-loop learning method to adapt the beamforming strategy. However, this may still require huge action and state spaces and thus lead to slow learning performance. Instead of an inner-loop learning method, the AP can estimate the beamforming strategy (𝐰(t),𝚯(t))({\bf w}(t),\boldsymbol{\Theta}(t)) efficiently by using the optimization method, based on the AP’s observation of the users’ channel conditions. This motivates us to devise a hybrid solution structure that exploits the outer-loop learning and the inner-loop optimization modules. Specifically, due to the combinatorial nature of the scheduling policy 𝚿(t)\boldsymbol{\Psi}(t), we employ the model-free DRL approach to adapt 𝚿(t)\boldsymbol{\Psi}(t) in the outer-loop learning procedure. In each iteration, the DRL agent firstly determines 𝚿(t)\boldsymbol{\Psi}(t) based on the past observations of the nodes’ AoI dynamics. Then, the inner-loop joint optimization of (𝐰(t),𝚯(t))({\bf w}(t),\boldsymbol{\Theta}(t)) becomes much easier by using the alternating optimization (AO) and semi-definite relaxation (SDR) methods. According to the outer-loop mode selection ψ0(t)\psi_{0}(t), the inner-loop optimization aims to either maximize the downlink energy transfer to all sensor nodes or fulfill the uplink information transmission of the scheduled sensor node. After the inner-loop optimization, the AP can execute the joint action (𝚿(t),𝐰(t),𝚯(t))(\boldsymbol{\Psi}(t),{\bf w}(t),\boldsymbol{\Theta}(t)) in the tt-th time slot and then update the AoI state of each sensor node. The evaluation of the time-averaged AoI performance will drive the outer-loop DRL agent to adapt the scheduling decision 𝚿(t+1)\boldsymbol{\Psi}(t+1) and the beamforming strategies (𝐰(t+1),𝚯(t+1))({\bf w}(t+1),\boldsymbol{\Theta}(t+1)) in the next time slot. By such a decomposition, the inner-loop optimization becomes computation-efficient, while the outer-loop learning becomes time-efficient as it only adapts the combinatorial scheduling policy with a smaller action space.

V Inner-loop Optimization Problems

Given the scheduling decision ψ0(t){0,1}\psi_{0}(t)\in\{0,1\}, the AP either beamforms RF signals for downlink energy transfer or receives the sensing information from individual sensor node. In each case, the AP will optimize the joint beamforming strategies (𝐰(t),𝚯(t))({\bf w}(t),\boldsymbol{\Theta}(t)). In the sequel, we discuss the inner-loop optimization problems in two cases.

V-A Energy Minimization in Uplink Transmission

In the tt-th time slot, when the kk-th sensor node is allowed to update its sensing information with ψk(t)=1\psi_{k}(t)=1, all other sensor nodes have to wait for scheduling in the other time slots. The sensor nodes’ AoI values will be either increased by 1 or reset to 1 if the transmission is unsuccessful or successful, respectively, as shown in (5). In this case, we can minimize the energy consumption Ekc(t)=τk(t)(pk(t)+pc)E^{c}_{k}(t)=\tau_{k}(t)(p_{k}(t)+p_{c}) of the scheduled sensor node conditioned on the successful transmission of its sensing data, i.e., rk(t)dkr_{k}(t)\geq d_{k}. This will preserve more energy for its future use. Thus, we have the following energy minimization problem:

minτk,pk,𝐰u,𝚯u\displaystyle\min_{\tau_{k},p_{k},{\bf w}_{u},\boldsymbol{\Theta}_{u}}~{} τk(pk+pc)\displaystyle~{}\tau_{k}(p_{k}+p_{c}) (8a)
s.t. τklog(1+pk|𝐟kH𝐰u|2)dk,\displaystyle~{}\tau_{k}\log(1+p_{k}|{\bf f}^{H}_{k}{\bf w}_{u}|^{2})\geq d_{k}, (8b)
τk(0,1) and θu,n(0,2π],n𝒩.\displaystyle~{}\tau_{k}\in(0,1)\text{ and }\theta_{u,n}\in(0,2\pi],\,n\in\mathcal{N}. (8c)

The uplink transmission only considers the node-kk’s rate constraint in (8b). Hence, the AP’s receive beamforming vector 𝐰u{\bf w}_{u} can be aligned with the uplink channel 𝐟u,k=𝐡kd+𝐆k𝚯u𝐡kr{\bf f}_{u,k}={\bf h}^{{d}}_{k}+{\bf G}_{k}{\boldsymbol{\Theta}}_{u}{\bf h}^{{r}}_{k}. Given 𝐰u=𝐟u,k/𝐟u,k{\bf w}_{u}={\bf f}_{u,k}/||{\bf f}_{u,k}||, the optimal passive beamforming strategy 𝚯u\boldsymbol{\Theta}_{u} needs to maximize the IRS-assisted channel gain |𝐟kH𝐰u|2|{\bf f}_{k}^{H}{\bf w}_{u}|^{2} as follows:

maxθu,n(0,2π]𝐡kd+𝐆k𝚯u𝐡kr2,\max_{\theta_{u,n}\in(0,2\pi]}\,\,||{\bf h}^{{d}}_{k}+{\bf G}_{k}{\boldsymbol{\Theta}}_{u}{\bf h}^{{r}}_{k}||^{2}, (9)

which can be easily solved by the SDR method similar to that in [24] and [29]. The transmission control parameters (τk,pk)(\tau_{k},p_{k}) can be also optimized to minimize the energy consumption in (8a). Let ekτkpke_{k}\triangleq\tau_{k}p_{k} denote the node’s energy consumption in RF communications. Given the optimal 𝚯u\boldsymbol{\Theta}_{u} to (9), we can find the optimal (τk,pk)(\tau_{k},p_{k}) by the following problem:

minek,τk(0,1)\displaystyle\min_{e_{k},\tau_{k}\in(0,1)}~{} ek+pcτk,s.t.τklog(1+ekτk𝐟u,kH2)dk.\displaystyle~{}e_{k}+p_{c}\tau_{k},\quad\text{s.t.}\quad\tau_{k}\log\left(1+\frac{e_{k}}{\tau_{k}}||{\bf f}^{H}_{u,k}||^{2}\right)\geq d_{k}. (10)

Problem (10) is convex in (τk,ek)(\tau_{k},e_{k}) and satisfies the Slater’s condition, which allows us to find a closed-form solution by using the Lagrangian dual method [30]. After this, we can easily find the optimal transmit power as pk=ek/τkp_{k}=e_{k}/\tau_{k}. If the energy budget holds, i.e., ek+pcτkEke_{k}+p_{c}\tau_{k}\leq E_{k}, the node-kk’s data transmission will be successful and thus we can update its AoI as Ak(t+1)=1A_{k}(t+1)=1.

V-B Weighted Energy Transfer Maximization

In downlink energy transfer with ψ0(t)=1\psi_{0}(t)=1, we aim to supply sufficient energy to all sensor nodes that can sustain their uplink information transmission to minimize the time-averaged AoI performance. However, it is difficult to explicitly quantify how downlink energy transfer affects the AoI performance. Instead, our intuitive design is to transfer more energy to those sensor nodes with the relatively worse AoI conditions. If the node-kk has a higher AoI value, we expect to transfer more energy to the node-kk. This allows the node-kk to increase its sampling frequency and report its sensing information with a shorter transmission delay, and thus reducing its AoI value in the following sensing and reporting cycles. By this intuition, in the downlink phase we aim to maximize the AoI-weighted energy transfer to all sensor nodes, formulated as follows:

max𝐰d,𝚯d\displaystyle\max_{{\bf w}_{d},\boldsymbol{\Theta}_{d}}~{} k𝒦vk|𝐟d,kH𝐰d|2\displaystyle~{}\sum_{k\in\mathcal{K}}v_{k}|{\bf f}^{H}_{d,k}{\bf w}_{d}|^{2} (11a)
s.t. EkcEk+Ekh,k𝒦,\displaystyle~{}E^{c}_{k}\leq E_{k}+E^{h}_{k},\quad\forall\,k\in\mathcal{K}, (11b)
𝐰d1 and θd,n(0,2π),n𝒩,\displaystyle~{}||{\bf w}_{d}||\leq 1\text{ and }{\theta}_{d,n}\in(0,2\pi),\quad\forall\,n\in\mathcal{N}, (11c)

where Ekc=τk(pk+pc)E^{c}_{k}=\tau_{k}(p_{k}+p_{c}) denotes the node-kk’s energy consumption in the uplink transmission. For each node-kk, we define the weight parameter as vk=Ak+αkEk1v_{k}=A_{k}+\alpha_{k}E_{k}^{-1}, which is increasing in the AoI value AkA_{k} while inversely proportional to the energy capacity EkE_{k}. Thus, we prefer to transfer more energy to the sensor node with a higher AoI value and a lower energy supply, which is prioritized by a larger weight parameter vkv_{k} in (11a). Such a heuristic is expected to reduce the average AoI of all sensor nodes in a long term. The constant αk\alpha_{k} characterizes the tradeoff between the sensor node’s energy supply and AoI performance. A larger value of αk\alpha_{k} indicates that the sensor node is more sensitive to the energy insufficiency. Besides, we expect that any sensor node may need to upload its data in future time slots, but we do not know when it will be scheduled to transmit. For a conservative consideration, we impose the constraint (11b) to ensure that all sensor nodes will have sufficient energy to upload their data in the next time slot. If we remove (11b), it becomes possible that some node-kk may not have sufficient energy to upload its data after beamforming optimization. If this node-kk happens to be scheduled by the DRL agent in the next time slot, its data transmission will be unsuccessful and thus its AoI will continue increasing at the AP. Therefore, we include the constraint in (11b) as a one-step lookahead safety mechanism to ensure that every sensor has sufficient energy for data transmission when it is scheduled in the next time slot.

V-B1 Alternating optimization (AO) for problem (11)

Given the uplink (𝐰u,𝚯u)({\bf w}_{u},\boldsymbol{\Theta}_{u}) and the control parameters (τk,pk)k𝒦(\tau_{k},p_{k})_{k\in\mathcal{K}}, the optimization of the downlink (𝐰d,𝚯d)({\bf w}_{d},\boldsymbol{\Theta}_{d}) in (11) can be decomposed by the AO method into two sub-problems, similar to that in [29]. In the first sub-problem, we optimize 𝚯d\boldsymbol{\Theta}_{d} in problem (11) with the fixed 𝐰d{\bf w}_{d}. Note that only the IRS-enhanced downlink channel 𝐟d,k{\bf f}_{d,k} relates to 𝚯d\boldsymbol{\Theta}_{d} as shown in (2). We can simplify problem (11) by introducing a few auxiliary variables. Define 𝐅k𝐆kdiag(𝐡kr){\bf F}_{k}\triangleq{\bf G}_{k}\text{diag}({\bf h}^{{r}}_{k}) and then we can simplify (2) as 𝐟d,k=𝐡kd(t)+𝐅k𝜽d{\bf f}_{d,k}={\bf h}^{{d}}_{k}(t)+{\bf F}_{k}{\boldsymbol{\theta}}_{d}. We further define 𝜽¯d[𝜽d,ζ]T\bar{\boldsymbol{\theta}}_{d}\triangleq[\boldsymbol{\theta}_{d},\zeta]^{T} where ζ0\zeta\geq 0 and |ζ|=1|\zeta|=1. The quadratic term in (11a) can be rewritten as |𝐟d,kH𝐰d|2=𝜽¯dH𝐑k𝜽¯d+(𝐡kd)H𝐡kd|{\bf f}^{H}_{d,k}{\bf w}_{d}|^{2}=\bar{\boldsymbol{\theta}}_{d}^{H}{\bf R}_{k}\bar{\boldsymbol{\theta}}_{d}+({\bf h}^{{d}}_{k})^{H}{\bf h}^{{d}}_{k}, where the matrix coefficient 𝐑k{\bf R}_{k} is given by 𝐑k=[𝐅kH𝐰d𝐰dH𝐅k𝐅kH𝐰d𝐰dH𝐡kd(𝐡kd)H𝐰d𝐰dH𝐅k0]{\bf R}_{k}=\begin{bmatrix}{\bf F}^{H}_{k}{\bf w}_{d}{\bf w}_{d}^{H}{\bf F}_{k}&{\bf F}^{H}_{k}{\bf w}_{d}{\bf w}_{d}^{H}{\bf h}^{{d}}_{k}\\ \bigr{(}{\bf h}^{{d}}_{k}\bigr{)}^{H}{\bf w}_{d}{\bf w}_{d}^{H}{\bf F}_{k}&0\end{bmatrix}. We can further apply SDR to 𝜽¯dH𝐑k𝜽¯d\bar{\boldsymbol{\theta}}^{H}_{d}{\bf R}_{k}\bar{\boldsymbol{\theta}}_{d} by introducing the matrix variable 𝚽d=𝜽¯d𝜽¯dH\boldsymbol{\Phi}_{d}=\bar{\boldsymbol{\theta}}_{d}\bar{\boldsymbol{\theta}}^{H}_{d}. Similar transformation can be applied to the energy budget constraint in (11b). As such, the optimization of the downlink 𝚯d\boldsymbol{\Theta}_{d} can be converted into the following SDP similar to that in [29] and [31].

max𝚽d0\displaystyle\max_{{\boldsymbol{\Phi}}_{d}\succeq 0}~{} k𝒦vkTr(𝐑k𝚽d)\displaystyle~{}\sum_{k\in\mathcal{K}}v_{k}\textbf{Tr}\bigl{(}{\bf R}_{k}{\boldsymbol{\Phi}}_{d}\bigr{)} (12a)
s.t. Tr(𝐑k𝚽d)(ηps)1(EkcEk),k𝒦,\displaystyle~{}\textbf{Tr}\bigl{(}{\bf R}_{k}{\boldsymbol{\Phi}}_{d}\bigr{)}\geq(\eta p_{s})^{-1}(E^{c}_{k}-E_{k}),\quad\forall\,k\in\mathcal{K}, (12b)
𝚽d(n,n)=1,n𝒩,\displaystyle~{}{\boldsymbol{\Phi}}_{d}(n,n)=1,\quad\forall\,n\in\mathcal{N}, (12c)

where Tr()\textbf{Tr}(\cdot) denotes the matrix trace. Given the constant weight vkv_{k}, problem (12) becomes a conventional beamforming optimization for downlink MISO system [32], which can be solved efficiently by the interior-point algorithm. In the second sub-problem, we optimize 𝐰{\bf w} in problem (11) with the fixed 𝚯d{\boldsymbol{\Theta}}_{d} and (τk,pk)(\tau_{k},p_{k}). This follows a similar SDR approach as that in (12) by introducing a matrix variable 𝐖d=𝐰d𝐰dH{\bf W}_{d}={\bf w}_{d}{\bf w}_{d}^{H}. Once the optimal solution 𝐖d{\bf W}_{d} or 𝚽d{\boldsymbol{\Phi}}_{d} is obtained, we can extract the rank-one beamformer 𝐰d{\bf w}_{d} or 𝜽d{\boldsymbol{\theta}}_{d} by Gaussian randomization method. We continue the iterations between 𝐰d{\bf w}_{d} and 𝚯d{\boldsymbol{\Theta}}_{d} until the convergence to a stable point.

Algorithm 1 AO Method for Downlink Energy Transfer
0:  The channel information {𝐡kd(t),𝐆k(t),𝐡kr(t)}\{{\bf h}^{{d}}_{k}(t),{\bf G}_{k}(t),{\bf h}^{{r}}_{k}(t)\}, AoI state AkA_{k}, energy state EkE_{k}, and energy demand EkcE_{k}^{c} of each sensor node k𝒦k\in\mathcal{K}
0:  a feasible beamforming strategy (𝐰d,𝚯d)({\bf w}_{d},{\boldsymbol{\Theta}}_{d}), τ0\tau\leftarrow 0
1:  vkAk+αkEk1v_{k}\leftarrow A_{k}+\alpha_{k}E_{k}^{-1}
2:  Ed(τ)0E_{d}^{(\tau)}\leftarrow 0, Ed(τ+1)k𝒦vk|𝐟d,kH𝐰d|2E_{d}^{(\tau+1)}\leftarrow\sum_{k\in\mathcal{K}}v_{k}|{\bf f}^{H}_{d,k}{\bf w}_{d}|^{2}
3:  while Ed(τ+1)Ed(τ)ϵ||E_{d}^{(\tau+1)}-E_{d}^{(\tau)}||\geq\epsilon do
4:     ττ+1\tau\leftarrow\tau+1
5:     Solve SDP (12) by the interior-point algorithm
6:     Extract the rank-one passive beamformer 𝚯d{\boldsymbol{\Theta}}_{d}
7:     Update 𝐰d{\bf w}_{d} with the fixed 𝚯d\boldsymbol{\Theta}_{d}
8:     Ed(τ)Ed(τ+1)E_{d}^{(\tau)}\leftarrow E_{d}^{(\tau+1)}
9:     Ed(τ+1)k𝒦vk|𝐟d,kH𝐰d|2E_{d}^{(\tau+1)}\leftarrow\sum_{k\in\mathcal{K}}v_{k}|{\bf f}^{H}_{d,k}{\bf w}_{d}|^{2}
10:  end while

V-B2 Simple approximation to problem (11)

Given the scheduling decision ψk(t)\psi_{k}(t), the inner-loop optimization estimates the beamforming strategies (𝐰d,𝚯d)({\bf w}_{d},{\boldsymbol{\Theta}}_{d}) and the transmission parameters (τk,pk)k𝒦(\tau_{k},p_{k})_{k\in\mathcal{K}}. The inner-loop optimization should be very efficient to minimize the computational overhead and processing delay in each iteration. Note that the AO algorithm for (𝐰d,𝚯d)({\bf w}_{d},\boldsymbol{\Theta}_{d}) can be inefficient as each iteration requires to solve the SDP problem (12) with a high computational complexity. The number of AO iterations till convergence is also unknown. Instead of the AO method, we further propose a simple approximation to problem (11) by optimizing 𝚯d\boldsymbol{\Theta}_{d} with a fixed and feasible 𝐰d{\bf w}_{d}. This solution can avoid the iterations between 𝐰d{\bf w}_{d} and 𝚯d\boldsymbol{\Theta}_{d}, and thus improve the inner-loop computation efficiency. In particular, we firstly consider an optimistic case in which the AP’s downlink beamforming vector 𝐰d{\bf w}_{d} can be aligned to all users’ downlink channels 𝐟d,kH{\bf f}^{H}_{d,k}, and thus we can relax problem (11) as follows:

max𝚯dk𝒦vk𝐟d,k2,s.t.(11b)(11c),\max_{\boldsymbol{\Theta}_{d}}\,\,\sum_{k\in\mathcal{K}}v_{k}||{\bf f}_{d,k}||^{2},\quad\text{s.t.}\quad\eqref{con-energy-dl}-\eqref{theta-beam-dl}, (13)

which only relies on 𝚯d{\boldsymbol{\Theta}_{d}} and can be solved by a similar SDR method for problem (9). However, problem (13) overestimates the total energy transfer to all sensor nodes. In the second step, we can reorder the sensor nodes by the channel gain 𝐟d,k2||{\bf f}_{d,k}||^{2} and fix the downlink beamforming vector as 𝐰d=𝐟d,km/𝐟d,km{\bf w}_{d}={\bf f}^{m}_{d,k}/||{\bf f}^{m}_{d,k}||, where 𝐟d,km=argmink𝒦𝐟d,k2{\bf f}^{m}_{d,k}=\arg\min_{k\in\mathcal{K}}||{\bf f}_{d,k}||^{2}. This intuition ensures that we transfer more RF energy to the sensor node with the worst channel condition. In the third step, we optimize 𝚯d\boldsymbol{\Theta}_{d} by solving (12) with the fixed 𝐰d{\bf w}_{d}, which provides a feasible lower bound to problem (11). As such, we only need to solve two SDPs to approximate the solution (𝐰d,𝚯d)({\bf w}_{d},\boldsymbol{\Theta}_{d}).

Algorithm 2 Simple Approximation for (𝐰d,𝚯d)({\bf w}_{d},\boldsymbol{\Theta}_{d})
0:  The channel information {𝐡kd(t),𝐆k(t),𝐡kr(t)}\{{\bf h}^{{d}}_{k}(t),{\bf G}_{k}(t),{\bf h}^{{r}}_{k}(t)\}, AoI state AkA_{k}, energy state EkE_{k}, and energy demand EkcE_{k}^{c} of each sensor node k𝒦k\in\mathcal{K}
1:  vkAk+αkEk1v_{k}\leftarrow A_{k}+\alpha_{k}E_{k}^{-1}
2:  Solve problem (13) in the optimistic case
3:  𝐟d,kmargmink𝒦𝐟d,k2{\bf f}^{m}_{d,k}\leftarrow\arg\min_{k\in\mathcal{K}}||{\bf f}_{d,k}||^{2}
4:  𝐰d𝐟d,km/𝐟d,km{\bf w}_{d}\leftarrow{\bf f}^{m}_{d,k}/||{\bf f}^{m}_{d,k}||
5:  Solve problem (12) with the fixed 𝐰d{\bf w}_{d}
6:  Extract the passive beamforming strategy 𝚯d{\boldsymbol{\Theta}}_{d}

VI Outer-loop Learning for Scheduling

The outer-loop DRL approach aims to update the AP’s scheduling policy by continuously interacting with the network environment. We can reformulate the scheduling optimization problem into the Markov decision process (MDP), which can be characterized by a tuple (𝒮,𝒜,)(\mathcal{S},\mathcal{A},\mathcal{R}). The state space 𝒮\mathcal{S} denotes the set of all system states. In each decision epoch, the AP’s state 𝐬t𝒮{\bf s}_{t}\in\mathcal{S} includes all sensor nodes’ AoI values, denoted as a vector 𝐀(t)[A1(t),A2(t),,AK(t)]{\bf A}(t)\triangleq[A_{1}(t),A_{2}(t),\ldots,A_{K}(t)], and the energy status denoted as 𝐄(t)=[E1(t),E2(t),,EK(t)]{\bf E}(t)=[E_{1}(t),E_{2}(t),\ldots,E_{K}(t)]. Hence, we can define the system state in each time slot as 𝐬t=(𝐀(t),𝐄(t)){\bf s}_{t}=({\bf A}(t),{\bf E}(t)). For each sensor node-kk, we have Ak(t)1A_{k}(t)\geq 1 as its AoI value can keep increasing from 1. The energy status Ek(t)E_{k}(t) is upper bounded by the maximum battery capacity, i.e., Ek(t)[0,Emax]E_{k}(t)\in[0,E_{\max}]. The action space 𝒜\mathcal{A} denotes the set of all feasible scheduling decisions 𝐚t{ψ0(t),ψ1(t),,ψK(t)}{0,1}K+1{\bf a}_{t}\triangleq\{\psi_{0}(t),\psi_{1}(t),\ldots,\psi_{K}(t)\}\in\{0,1\}^{K+1} that satisfies the inequality in (1). Given the AP’s scheduling decision, we can obtain (𝐰d,𝚯d)({\bf w}_{d},\boldsymbol{\Theta}_{d}) by the inner-loop optimization and then update the AoI performance of different sensor nodes. The reward \mathcal{R} assigns each state-action pair an immediate value vt(𝐬t,𝐚t)v_{t}({\bf s}_{t},{\bf a}_{t}). It also influences the DRL agent’s action adaptation to maximize the long-term reward, namely, the value function Vπ(𝐬0)t𝒯εtvt(𝐬t,𝐚t)V_{\pi}({\bf s}_{0})\triangleq\sum_{t\in\mathcal{T}}\varepsilon^{t}v_{t}({\bf s}_{t},{\bf a}_{t}), where ε(0,1)\varepsilon\in(0,1) is the discount factor for cumulating the reward and π(𝐚t|𝐬t)\pi({\bf a}_{t}|{\bf s}_{t}) denotes the policy function mapping each state 𝐬t{\bf s}_{t} to the action 𝐚t{\bf a}_{t}. Specifically, considering the AoI minimization in (7), we can define the reward vt(𝐬t,𝐚t)v_{t}({\bf s}_{t},{\bf a}_{t}) in the tt-th time step as follows:

vt(𝐬t,𝐚t)=1K|t|τtk𝒦λkAk(τ),v_{t}({\bf s}_{t},{\bf a}_{t})=-\frac{1}{K|\mathcal{H}_{t}|}\sum_{\tau\in\mathcal{H}_{t}}\sum_{k\in\mathcal{K}}\lambda_{k}A_{k}(\tau), (14)

where t{tto,,t1,t}𝒯\mathcal{H}_{t}\triangleq\{t-t_{o},\ldots,t-1,t\}\subset\mathcal{T} denotes a set of past time slots with the length |t||\mathcal{H}_{t}|. Hence, the reward vt(𝐬t,𝐚t)v_{t}({\bf s}_{t},{\bf a}_{t}) can be considered as the averaged AoI of all sensor nodes in the most recent sliding window of the past time slots.

DRL approaches such as the value-based DQN and policy-based policy gradient (PG) algorithms have been demonstrated to be effective for solving MDP in large-scale wireless network by using DNNs to approximate either the value function Vπ(𝐬t)V_{\pi}({\bf s}_{t}) or the policy function π(𝐚t|𝐬t)\pi({\bf a}_{t}|{\bf s}_{t}) [33]. In particular, the DQN method is an extension of the classic Q-learning method [34], by using DNN to approximate the Q-value function Q𝝁(𝐬t,𝐚t)Q_{\boldsymbol{\mu}}({\bf s}_{t},{\bf a}_{t}), where 𝝁\boldsymbol{\mu} denotes the DNN weight parameters for the Q-value network. Starting from the initial state 𝐬0{\bf s}_{0}, the PG algorithms directly optimize the value function Vπ(𝐬0)V_{\pi}({\bf s}_{0}) by using gradient-based approaches to update the policy π𝝎(𝐚t|𝐬t)\pi_{\boldsymbol{\omega}}({\bf a}_{t}|{\bf s}_{t}), where 𝝎\boldsymbol{\omega} denotes the DNN weight parameters for the policy network. The proximal policy optimization (PPO) algorithm is recently proposed in [10] as a sample-efficient and easy-to-implement PG algorithm, striking a favorable balance between complexity, simplicity, and learning efficiency. In the sequel, we implement both the DQN and PPO algorithms and compare their performance for the outer-loop scheduling optimization.

VI-A DQN Algorithm for Scheduling

The DQN algorithm relies on two DNNs to stabilize the learning, denoted by the online Q-network and target Q-network. Given the DNN parameters 𝝁t{\boldsymbol{\mu}}_{t} and 𝝁t{\boldsymbol{\mu}}_{t}^{\prime} for the two Q-networks, their outputs are given by Q𝝁t(𝐬t,𝐚t)Q_{{\boldsymbol{\mu}}_{t}}({\bf s}_{t},{\bf a}_{t}) and Q𝝁t(𝐬t,𝐚t)Q_{{\boldsymbol{\mu}}_{t}^{\prime}}({\bf s}_{t},{\bf a}_{t}), respectively. At each learning episode, the DQN agent observes the system state 𝐬t=(𝐀(t),𝐄(t)){\bf s}_{t}=({\bf A}(t),{\bf E}(t)) and chooses the best scheduling action 𝐚t{\bf a}_{t} with the maximum value of Q𝝁t(𝐬t,𝐚t)Q_{{\boldsymbol{\mu}}_{t}}({\bf s}_{t},{\bf a}_{t}). To enable random exploration, the DQN agent can also take a random action with a small probability. Once the action 𝐚t{\bf a}_{t} is fixed and executed in the network, the DQN agent observes the instant reward vt(𝐬t,𝐚t)v_{t}({\bf s}_{t},{\bf a}_{t}) and records the transition to the next state 𝐬t+1{\bf s}_{t+1}. Each transition sample (𝐬t,𝐚t,vt,𝐬t+1)({\bf s}_{t},{\bf a}_{t},v_{t},{\bf s}_{t+1}) will be stored in the experience replay buffer. Meanwhile, the DQN agent estimates the target Q-value as yt=vt(𝐬t,𝐚t)+εQ𝝁t(𝐬t+1,𝐚t+1)y_{t}=v_{t}({\bf s}_{t},{\bf a}_{t})+\varepsilon Q_{{\boldsymbol{\mu}}_{t}^{\prime}}({\bf s}_{t+1},{\bf a}_{t+1}) by using the target Q-network with a different weight parameter 𝝁t{\boldsymbol{\mu}}_{t}^{\prime}.

DQN’s success lies in the design of the experience replay mechanism that improves the learning efficiency by reusing the historical transition samples to train the DNN in each iteration [33]. The DNN training aims to adjust the parameter 𝝁t{\boldsymbol{\mu}}_{t} to minimize a loss function (𝝎t)\ell({\boldsymbol{\omega}}_{t}), which is defined as the gap or more formally the temporal-difference (TD) error between the online Q-network Q𝝁t(𝐬t,𝐚t)Q_{{\boldsymbol{\mu}}_{t}}({\bf s}_{t},{\bf a}_{t}) and the target value yty_{t}, specified as follows:

(𝝁t)=𝔼[|ytQ𝝁t(𝐬t,𝐚t)|2].\ell({\boldsymbol{\mu}}_{t})=\mathbb{E}[|y_{t}-Q_{{\boldsymbol{\mu}}_{t}}({\bf s}_{t},{\bf a}_{t})|^{2}]. (15)

The expectation in (15) is taken over a random subset (i.e., mini-batch) of transition samples from the experience replay buffer. For each mini-batch sample, we can evaluate the target value yty_{t} and generate the Q-value Q𝝁t(𝐬t,𝐚t)Q_{{\boldsymbol{\mu}}_{t}}({\bf s}_{t},{\bf a}_{t}). The weight parameters 𝝁t{\boldsymbol{\mu}}_{t} can be updated by the backpropagation of the gradient information. The DQN method stabilizes the learning by periodically copying the DNN parameter 𝝁t{\boldsymbol{\mu}}_{t} of the online Q-network to the target Q-network.

Refer to caption
Figure 2: The DQN and PPO algorithms for the outer-loop scheduling optimization.

VI-B Policy-based Actor-Critic Algorithms

Different from the value-based DQN method, the policy-based approach improves the value function by updating the parametric policy π𝝎\pi_{{\boldsymbol{\omega}}} in gradient-based methods [33]. Given the system state 𝐬t{\bf s}_{t}, the policy π𝝎(𝐚t|𝐬t)\pi_{{\boldsymbol{\omega}}}({\bf a}_{t}|{\bf s}_{t}) specifies a probability distribution over different actions 𝐚t𝒜{\bf a}_{t}\in\mathcal{A}. The DNN training aims at updating the policy network to improve the expected value function, rewritten as J(𝝎)=𝐬𝒮d(𝐬)Vπ(𝐬)=𝐬𝒮d(𝐬)𝐚𝒜π𝝎(𝐚|𝐬)Qπ(𝐬,𝐚)J({\boldsymbol{\omega}})=\sum_{{\bf s}\in\mathcal{S}}d({\bf s})V_{\pi}({\bf s})=\sum_{{\bf s}\in\mathcal{S}}d({\bf s})\sum_{{\bf a}\in\mathcal{A}}\pi_{\boldsymbol{\omega}}({\bf a}|{\bf s})Q^{\pi}({\bf s},{\bf a}), where d(𝐬)d({\bf s}) is the stationary state distribution corresponding to the policy π𝝎\pi_{\boldsymbol{\omega}} and Qπ(𝐬,𝐚)Q^{\pi}({\bf s},{\bf a}) denotes the Q-value of the state-action pair (𝐬,𝐚)({\bf s},{\bf a}) following the policy π𝝎\pi_{\boldsymbol{\omega}}. Now the expected value function J(𝝎)J({\boldsymbol{\omega}}) becomes a function of the policy parameter 𝝎{\boldsymbol{\omega}}. The policy gradient theorem in [35] simplifies the evaluation of the policy gradient 𝝎J(𝝎)\nabla_{\boldsymbol{\omega}}J({\boldsymbol{\omega}}) as follows:

𝝎J(𝝎)=𝔼π[Qπ(𝐬,𝐚)𝝎lnπ𝝎(𝐚|𝐬)],\nabla_{\boldsymbol{\omega}}J({\boldsymbol{\omega}})=\mathbb{E}_{\pi}\Big{[}Q^{\pi}({\bf s},{\bf a})\nabla_{\boldsymbol{\omega}}\ln\pi_{\boldsymbol{\omega}}({\bf a}|{\bf s})\Big{]}, (16)

where the expectation is taken over all possible state-action pairs following the same policy π𝝎\pi_{\boldsymbol{\omega}}.

For practical implementation, the policy gradient 𝝎J(𝝎)\nabla_{\boldsymbol{\omega}}J({\boldsymbol{\omega}}) can be evaluated by sampling the historical decision-making trajectories. At each learning epoch tt, the DRL agent interacts with the environment by the state-action pair (𝐬t,𝐚t)({\bf s}_{t},{\bf a}_{t}), collects an immediate reward vtv_{t}, and observes the transition to the next state 𝐬t+1{\bf s}_{t+1}. Let {𝐬1,𝐚1,v1,𝐬2,𝐚2,v2,𝐬3,,vT}{\boldsymbol{\ell}}\triangleq\{{\bf s}_{1},{\bf a}_{1},v_{1},{\bf s}_{2},{\bf a}_{2},v_{2},{\bf s}_{3},\ldots,v_{T}\} denote the state-action trajectory as the DRL agent interacts with the environment. We can estimate the Q-value Qπ(𝐬t,𝐚t)Q^{\pi}({\bf s}_{t},{\bf a}_{t}) in (16) by gt=τ=tTετtvtg_{t}=\sum_{\tau=t}^{T}\varepsilon^{\tau-t}v_{t}. As such, we can approximate the policy gradient 𝝎J(𝝎)\nabla_{\boldsymbol{\omega}}J({\boldsymbol{\omega}}) in each time step by the random sample gt𝝎lnπ𝝎(𝐚t|𝐬t)g_{t}\nabla_{\boldsymbol{\omega}}\ln\pi_{\boldsymbol{\omega}}({\bf a}_{t}|{\bf s}_{t}) and update the policy network as 𝝎𝝎+α𝝎gt𝝎lnπ𝝎(𝐚𝐭|𝐬t){\boldsymbol{\omega}}\leftarrow{\boldsymbol{\omega}}+\alpha_{\boldsymbol{\omega}}g_{t}\nabla_{\boldsymbol{\omega}}\ln\pi_{\boldsymbol{\omega}}({\bf a_{t}}|{\bf s}_{t}), where α𝝎\alpha_{\boldsymbol{\omega}} denotes the step-size for gradient update. Besides the stochastic approximation, we can also employ another DNN to approximate the Q-value Qπ(𝐬t,𝐚t)Q^{\pi}({\bf s}_{t},{\bf a}_{t}) in (16) and replace the Monte Carlo estimation gtg_{t} by the DNN approximation Q𝝁(𝐬t,𝐚𝐭)Q_{\boldsymbol{\mu}}({\bf s}_{t},{\bf a_{t}}) with the weight parameter 𝝁{\boldsymbol{\mu}}, similar to that in the DQN algorithm. This motivates the actor-critic framework to update both the policy network and the Q-value network [35]. The actor updates the policy network while the critic updates the Q-network by minimizing a loss function similar to (15). We can take the derivative of the loss function and update the weight parameter as 𝝁𝝁+α𝝁δt𝝁Q𝝁(𝐬t,𝐚𝐭){\boldsymbol{\mu}}\leftarrow{\boldsymbol{\mu}}+\alpha_{\boldsymbol{\mu}}\delta_{t}\nabla_{\boldsymbol{\mu}}Q_{\boldsymbol{\mu}}({\bf s}_{t},{\bf a_{t}}), where δt=ytQ𝝁t(𝐬t,𝐚t)\delta_{t}=y_{t}-Q_{{\boldsymbol{\mu}}_{t}}({\bf s}_{t},{\bf a}_{t}) denotes the TD error. The Q-value estimation can be also replaced by the advantage function Aπ(𝐬t,𝐚t)Qπ(𝐬t,𝐚t)Vπ(𝐬t)A_{\pi}({\bf s}_{t},{\bf a}_{t})\triangleq Q^{{\pi}}({\bf s}_{t},{\bf a}_{t})-V_{\pi}({\bf s}_{t}) to reduce the variability in predictions and improve the learning efficiency.

The gradient estimation in (16) requires a complete trajectory by using the same target policy π𝝎\pi_{\boldsymbol{\omega}}. It is actually the on-policy method that refrains us from using past experiences and limits the sample efficiency. This drawback can be avoided by a minor revision to the policy gradient:

𝝎J(𝝎)=𝔼π𝝎o[π𝝎(𝐬,𝐚)π𝝎o(𝐬,𝐚)Aπ𝝎(𝐬,𝐚)𝝎lnπ𝝎(𝐚|𝐬)],\nabla_{\boldsymbol{\omega}}J({\boldsymbol{\omega}})=\mathbb{E}_{\pi_{\boldsymbol{\omega}}^{o}}\Big{[}\frac{\pi_{\boldsymbol{\omega}}({\bf s},{\bf a})}{\pi^{o}_{\boldsymbol{\omega}}({\bf s},{\bf a})}A_{\pi_{\boldsymbol{\omega}}}({\bf s},{\bf a})\nabla_{\boldsymbol{\omega}}\ln\pi_{\boldsymbol{\omega}}({\bf a}|{\bf s})\Big{]}, (17)

where the behavior policy π𝝎o\pi_{\boldsymbol{\omega}}^{o} is used to collect the training samples. This becomes the off-policy gradient that allows us to estimate it by using the past experience collected from a different and even obsolete behavior policy π𝝎o\pi_{\boldsymbol{\omega}}^{o}. Hence, we can improve the sample efficiency by maintaining the experience replay buffer, similar to the DQN method. To further improve training stability, the off-policy trust region policy optimization (TRPO) method imposes an additional constraint on the gradient update [36], i.e., the new policy π𝝎\pi_{\boldsymbol{\omega}} should not change too much from the old policy π𝝎o\pi_{\boldsymbol{\omega}}^{o}. Therefore, the policy optimization is to solve the following constrained problem:

max𝝎\displaystyle\max_{{\boldsymbol{\omega}}}~{} 𝔼π𝝎o[π𝝎(𝐬,𝐚)π𝝎o(𝐬,𝐚)Aπ𝝎o(𝐬,𝐚)],s.t.𝔼π𝝎o[DKL(π𝝎,π𝝎o)]δKL,\displaystyle~{}\mathbb{E}_{\pi_{\boldsymbol{\omega}}^{o}}\Big{[}\frac{\pi_{\boldsymbol{\omega}}({\bf s},{\bf a})}{\pi^{o}_{\boldsymbol{\omega}}({\bf s},{\bf a})}A_{\pi_{\boldsymbol{\omega}}^{o}}({\bf s},{\bf a})\Big{]},\quad\text{s.t.}\quad\mathbb{E}_{\pi_{\boldsymbol{\omega}}^{o}}[D_{KL}(\pi_{\boldsymbol{\omega}},\pi_{\boldsymbol{\omega}}^{o})]\leq\delta_{KL}, (18)

where DKL(P1,P2)P1(x)log(P1(x)/P2(x))𝑑xD_{KL}(P_{1},P_{2})\triangleq\int_{-\infty}^{\infty}P_{1}(x)\log\left(P_{1}(x)/P_{2}(x)\right)\,dx represents a distance measure in terms of the Kullback-Leibler (KL) divergence between two probability distributions [36]. The inequality constraint in (18) enforces that the KL divergence between two policies π𝝎\pi_{\boldsymbol{\omega}} and π𝝎o\pi_{\boldsymbol{\omega}}^{o} are bounded by δKL\delta_{KL}. The advantage function Aπ𝝎oA_{\pi_{\boldsymbol{\omega}}^{o}} in the objective of (18) is the approximation of the true advantage Aπ𝝎A_{\pi_{\boldsymbol{\omega}}} corresponding to the target policy π𝝎\pi_{\boldsymbol{\omega}}. However, the exact solution to the optimization problem (18) is not easy. Normally we require the first- and second-order approximations for both the objective and the constraint in (18).

VI-C PPO Algorithm for Scheduling

The proximal policy optimization (PPO) algorithm proposed in [10] further improves the objective in (18) by limiting the probability ratio or the importance weight ρ𝝎(𝐬,𝐚)π𝝎(𝐬,𝐚)π𝝎o(𝐬,𝐚)\rho_{\boldsymbol{\omega}}({\bf s},{\bf a})\triangleq\frac{\pi_{\boldsymbol{\omega}}({\bf s},{\bf a})}{\pi^{o}_{\boldsymbol{\omega}}({\bf s},{\bf a})} within a safer region [1ϵ,1+ϵ][1-\epsilon,1+\epsilon].

J~(π𝝎)=𝔼π𝝎o[min{ρ𝝎Aπ𝝎o,clip(ρ𝝎,1ϵ,1+ϵ)Aπ𝝎o}],\tilde{J}(\pi_{\boldsymbol{\omega}})=\mathbb{E}_{\pi_{\boldsymbol{\omega}}^{o}}\Big{[}\min\Big{\{}\rho_{\boldsymbol{\omega}}A_{\pi_{\boldsymbol{\omega}}^{o}},\text{clip}(\rho_{\boldsymbol{\omega}},1-\epsilon,1+\epsilon)A_{\pi_{\boldsymbol{\omega}}^{o}}\Big{\}}\Big{]}, (19)

where the function clip(ρ𝝎,1ϵ,1+ϵ)\text{clip}(\rho_{\boldsymbol{\omega}},1-\epsilon,1+\epsilon) returns ρ𝝎\rho_{\boldsymbol{\omega}} if ρ𝝎[1ϵ,1+ϵ]\rho_{\boldsymbol{\omega}}\in[1-\epsilon,1+\epsilon] and returns 1ϵ1-\epsilon (or 1+ϵ1+\epsilon) if ρ𝝎<1ϵ\rho_{\boldsymbol{\omega}}<1-\epsilon (or ρ𝝎>1+ϵ\rho_{\boldsymbol{\omega}}>1+\epsilon). The parameter ϵ\epsilon is used to control the clipping range. The approximate value function J~(π𝝎)\tilde{J}(\pi_{\boldsymbol{\omega}}) ensures that the target policy π𝝎{\pi_{\boldsymbol{\omega}}} will not deviate too far from the behavior policy π𝝎o{\pi_{\boldsymbol{\omega}}^{o}} for either positive or negative advantage Aπ𝝎oA_{\pi_{\boldsymbol{\omega}}^{o}}. With the clipped value function J~(π𝝎)\tilde{J}(\pi_{\boldsymbol{\omega}}), we further introduce a Lagrangian dual variable βKL\beta_{KL} and reformulate the constrained problem (18) into the unconstrained maximization as follows:

max𝝎J~(π𝝎)βKL𝔼π𝝎o[DKL(π𝝎,π𝝎o)],\max_{{\boldsymbol{\omega}}}\,\,\tilde{J}(\pi_{\boldsymbol{\omega}})-\beta_{KL}\mathbb{E}_{\pi_{\boldsymbol{\omega}}^{o}}\Big{[}D_{KL}(\pi_{\boldsymbol{\omega}},\pi_{\boldsymbol{\omega}}^{o})\Big{]}, (20)

The policy parameter 𝝎{\boldsymbol{\omega}} for the new value function in (20) can be easily updated by using the stochastic gradient ascent method. The parameter βKL\beta_{KL} can be also adapted according to the difference between the measured KL divergence 𝔼π𝝎o[DKL(π𝝎,π𝝎o)]\mathbb{E}_{\pi_{\boldsymbol{\omega}}^{o}}\Big{[}D_{KL}(\pi_{\boldsymbol{\omega}},\pi_{\boldsymbol{\omega}}^{o})\Big{]} and its target δKL\delta_{KL}.

Algorithm 3 Energy-and-AoI-Aware Scheduling and Transmission Control Algorithm
0:  Target policy π𝝎\pi_{\boldsymbol{\omega}} and behavior policy π𝝎o\pi_{\boldsymbol{\omega}}^{o},
0:  t0t\leftarrow 0, Ek(0)BE_{k}(0)\leftarrow B, Ak(t)0A_{k}(t)\leftarrow 0
1:  for Episode={1,2,,MAX=3000}\text{Episode}=\{1,2,\dots,\text{MAX}=3000\} do
2:     while tTt\neq T do
3:        Observe the system state (𝐀(t),𝐄(t))({\bf A}(t),{\bf E}(t))
4:        Choose the outer-loop action 𝚿(t)\boldsymbol{\Psi}(t) for scheduling
5:        case ψ0(t)=0\psi_{0}(t)=0: optimize uplink data transmission in (8)
6:        case ψ0(t)=1\psi_{0}(t)=1: optimize downlink energy transfer in (11)
7:        Execute joint action 𝐚(t)(𝚿(t),𝐰(t),𝚽(t)){\bf a}(t)\triangleq(\boldsymbol{\Psi}(t),{\bf w}(t),\boldsymbol{\Phi}(t)), evaluate vt(𝐬t,𝐚t)v_{t}({\bf s}_{t},{\bf a}_{t})
8:        Record the next state 𝐬t+1{\bf s}_{t+1} and buffer the transition (𝐬t,𝐚t,vt,𝐬t+1)({\bf s}_{t},{\bf a}_{t},v_{t},{\bf s}_{t+1})
9:        tt+1t\leftarrow t+1
10:     end while
11:     Take mini-batch form the experience replay buffer
12:     Estimate advantage Aπ𝝎oA_{\pi_{\boldsymbol{\omega}}^{o}} and update 𝝎{\boldsymbol{\omega}} to maximize (20)
13:     Update behavior policy π𝝎o(1μ)π𝝎o+μπ𝝎\pi_{\boldsymbol{\omega}}^{o}\leftarrow(1-\mu)\pi_{\boldsymbol{\omega}}^{o}+\mu\pi_{\boldsymbol{\omega}}
14:  end for

As shown in Fig. 2, we highlight the comparison between the DQN and the PPO algorithms for outer-loop scheduling optimization. Different from the DQN algorithm, the PPO algorithm employs the actor-critic structure that relies on two sets of DNNs to approximate the policy networks. The difference between the target policy network and the behavior policy network is used to generate the importance sampling weight ρ𝝎(𝐬,𝐚)\rho_{\boldsymbol{\omega}}({\bf s},{\bf a}). The behavior policy network is used to interact with the environment and store the transition samples in the experience replay buffer. The target policy network is used to update the DNN weight parameter 𝝎{\boldsymbol{\omega}} by sampling a mini-batch randomly from the experience replay buffer. Then we can update the target policy network by maximizing the objective in (20). The complete solution procedures are listed in Algorithm 3. Considering the preferable learning efficiency and robustness of the PPO algorithm, we integrate it in Algorithm 3 to adapt the outer-loop scheduling policy in the hierarchical learning framework. At the initialization stage, we randomly initialize the DNN weight parameters 𝝎{\boldsymbol{\omega}} for the policy network. In each learning episode, the AP collects observations (𝐀(t),𝐄(t))({\bf A}(t),{\bf E}(t)) of the system and decides the outer-loop scheduling decision 𝚿(t)\boldsymbol{\Psi}(t), as shown in line 44 of Algorithm 3. Given the scheduling decision 𝚿(𝐭)\bf{\Psi(t)}, the AP needs to optimize the joint beamforming strategies (𝐰(t),𝚯(t))({\bf w}(t),\boldsymbol{\Theta}(t)) for either uplink information transmission or downlink energy transfer, corresponding to lines 5-6 of Algorithm 3. Note that the solution to downlink energy transfer can be determined by either the iterative Algorithm 1 or the simplified Algorithm 2. When we determine both the outer-loop and inner-loop decision variables, we can execute the joint action (𝚿(t),𝐰(t),𝚯(t))(\boldsymbol{\Psi}(t),{\bf w}(t),\boldsymbol{\Theta}(t)) in the wireless system and evaluate the reward function as shown in lines 7-8 of Algorithm 3. The DNN training is based on the random mini-batch sampled from the experience replay buffer, as shown in lines 11-13 of Algorithm 3. For performance comparison, the DQN algorithm is also implemented for outer-loop scheduling. Our numerical evaluation in Section VII reveals that the PPO-based algorithm can improve the convergence performance and achieve a lower AoI value compared to the DQN-based algorithm.

TABLE II: Parameter settings
Parameters Values Parameters Values
Number of DNN hidden layers 33 Optimizer Adam
Actor’s learning rate 0.00050.0005 Number of AP’s antennas 4
Critic’s learning rate 0.0010.001 AP’s transmit power psp_{s} 3030dBm
Reward discount 0.990.99 Energy conversion efficiency η\eta 0.90.9
Number of neurons 6464 Noise power σ2\sigma^{2} 75-75dBm
Activation function Tanh and Softmax Sensor nodes’ data size DD 55Kbits

VII Simulation Results

In this section, we present simulation results to verify the performance gain of the proposed algorithms for the wireless-powered and IRS-assisted wireless sensor network. The (x,y,z)(x,y,z)-coordinates of the AP and the IRS in meters are given by (200,200,0)(200,-200,0) and (0,0,0)(0,0,0), respectively. The sensor nodes are randomly distributed in a rectangular area [5,35]×[35,35][5,35]\times[-35,35] in the (x,y)(x,y)-plane with z=20z=-20. We consider that the direct channel from the AP to each sensor node-kk follows the Rayleigh fading distribution, i.e., 𝐡kd(t)=β0,k𝐡~kd(t){\bf h}^{{d}}_{k}(t)=\beta_{0,k}\tilde{\bf h}^{{d}}_{k}(t), where 𝐡~kd(t)𝒞𝒩(0,𝑰)\tilde{\bf h}^{{d}}_{k}(t)\thicksim\mathcal{CN}(0,{\boldsymbol{I}}) denotes the complex Gaussian random variable and β0,k\beta_{0,k} denotes the path-loss of the direct channel. Given the distance dkASd^{\rm{AS}}_{k} from the AP to the sensor node-kk, we have β0,k=32.6+36.7log(dkAS)\beta_{0,k}=32.6+36.7\log(d^{\rm{AS}}_{k}). A similar channel model is employed in [37]. The IRS-sensor channel 𝐡kr(t){\bf h}^{{r}}_{k}(t) and the AP-IRS channel 𝐆k(t){\bf G}_{k}(t) are modelled similarly. More detailed parameters are summarized in Table II.

Refer to caption
(a) Reward dynamics in outer-loop learning algorithms
Refer to caption
(b) Convergence with linear and non-linear EH models
Refer to caption
(c) Run time overhead for the inner-loop optimization
Refer to caption
(d) Reward dynamics for the inner-loop optimization
Figure 3: The outer-loop learning and inner-loop optimization of the hierarchical PPO algorithm.

VII-A Improved Learning Efficiency and Convergence

In Fig. 3(a), we compare the reward performance among the hierarchical learning Algorithm 3, the hierarchical DQN and the conventional PPO, in which all decision variables (𝚿(t),𝐰(t),𝚯(t))(\boldsymbol{\Psi}(t),{\bf w}(t),\boldsymbol{\Theta}(t)) are adapted simultaneously. Hence, we denote the conventional PPO as the All-in-One PPO algorithm in Fig. 3. The solid line in red represents the dynamics of the AoI performance in the proposed hierarchical PPO algorithm, which spits the decision variables into two parts and optimizes the beamforming strategy (𝐰(t),𝚯(t))({\bf w}(t),\boldsymbol{\Theta}(t)) by the inner-loop Algorithm 1. The dotted line in red denotes hierarchical DQN algorithm, and the dash-dotted line in black denotes the conventional All-in-One PPO algorithm. It is clear that the All-in-One PPO may not converge effectively due to a huge action space in the mixed discrete and continuous domain. Compared with the PPO algorithm, the hierarchical DQN algorithm becomes unstable as shown in Fig. 3. The hierarchical PPO for joint scheduling and transmission control can reduce the action space in the outer-loop learning framework and thus achieve a significantly higher reward performance and faster convergence guided by the inner-loop optimization modules. We also implement the two-step learning algorithm (denoted as Two-Step PPO in Fig. 3) in which both inner-loop and outer-loop control variables are adapted by the PPO learning methods. The Two-Step PPO algorithm has a better convergence than that of the conventional All-in-One PPO algorithm. However, its reward performance is much inferior to that of the hierarchical learning framework, which is guided by the inner-loop optimization-driven target during the outer-loop learning.

Figure 3(b) reveals how different EH models effect the AoI performance. The linear EH model results in a slightly smaller AoI value than that of the non-linear EH model. The reason is that the linear EH model over-estimates the sensor nodes’ EH capabilities. For the non-linear EH model, the harvested power will not further increase as the received signal power becomes higher than the saturation power psatp_{\text{sat}}. Fig. 3(b) shows that psatp_{\text{sat}} also affects the AoI performance. Generally we can expect a better AoI performance with a higher saturation power psatp_{\text{sat}}. In our simulation, we also evaluate the overall reward and average AoI performances with different discount factor ε{0.99,0.95,0.90,0.80}\varepsilon\in\{0.99,0.95,0.90,0.80\}, which is used to accumulate the rewards in different decision epochs. The simulation results reveal that the learning with a larger discount factor becomes stable. We further show the run time and performance comparison between the iterative Algorithm 1 and and the simplified Algorithm 2 for the inner-loop beamforming optimization. It is clear that the run time of each inner-loop optimization algorithm increases with the size of the IRS and the number of sensor nodes. Besides, we observe that Algorithm 2 significantly saves the run time by reducing the number of iterations, especially with a large-size IRS, as shown in Fig. 3(c). However, the reward performances of two algorithms are very close to each other, as shown in Fig 3(d). This implies that we can deploy Algorithm 2 preferably in practice.

VII-B Performance Gain over Existing Scheduling Policies

In this part, we develop two baselines policies to verify the performance gain of Algorithm 3. The first baseline is the round-robin (ROR) scheduling policy which periodically selects one sensor node to upload its status-update information. In each scheduling period, we jointly optimize the active and passive beamforming strategy to enhance the information transmission. The second baseline is the Max-Age-First (MAF) scheduling policy, i.e., the AP selects the sensor node with the highest AoI value to upload its sensing information [38]. Both baselines rely on the same EH policy, i.e., the AP starts downlink energy transfer only when the scheduled sensor node has insufficient energy capacity, e.g., below some threshold value. In Fig. 4(a), we show the AoI performance as we increase the number of sensor nodes. For different algorithms, we set the same coordinates for the AP and the IRS. Generally, different scheduling policies have a small AoI value with a few sensor nodes. The MAF policy performs better than the ROR policy, as it gives higher priorities to the sensor nodes with unsatisfactory AoI performance. As the number of nodes increases, Algorithm 3 always outperforms the baselines by adapting the scheduling strategy according to sensor nodes’ stochastic data arrivals, as shown in Fig. 4(a).

Refer to caption
(a) Minimum AoI values compared with the baselines
Refer to caption
(b) Enhanced fairness among different nodes
Figure 4: Performance gain over baseline scheduling policies.

In Fig. 4(b), we compare the fairness of scheduling by showing the average AoI of different sensor nodes. For a fair comparison, we set the same weight parameters λk\lambda_{k} for all sensor nodes in (6). It can be seen from Fig. 4(b) that Algorithm 3 achieves a smaller AoI value for each sensor node. Moreover, different sensor nodes can achieve very similar AoI values, which implies the enhanced fairness in the scheduling policy by using Algorithm 3. The ROR scheme has a large deviation compared to that of the other two baselines. The reason is that the RoR scheme cannot adapt to the dynamic data arrival process. An interesting observation is that the MAF scheme also has a smaller AoI deviation among different sensor nodes. This is because that the MAF scheme always chooses the sensor node with the highest AoI value to upload its sensing information. This can effectively prevent the AoI of some sensor nodes from being too high.

Refer to caption
(a) Larger-size IRS reduces transmission failure rate
Refer to caption
(b) Larger-size IRS improves the AoI performance.
Figure 5: Performance gain by using a larger-size IRS.

VII-C Performance Gain by Using a Larger Size IRS

The use of IRS not only improves the downlink wireless energy transfer, but also enhances the uplink channels for sensing information transmissions. Both aspects implicitly improve the system’s AoI performance. In this part, we intend to verify the performance gain achievable by using the IRS. In Fig. 5, we show the dynamics of the sensor nodes’ transmission failure rate and the average AoI by increasing the size of IRS from 2020 to 100100. Specifically, we count the number of transmission failures within 3030K time slots and visualize the transmission failure rate in Fig. 5(a). It is quite intuitive that a larger-size IRS can reduce the sensor nodes’ transmission failure rate, due to the IRS’s reconfigurability to improve the channel quality. The increase in the IRS’s size makes it more flexible to reshape the wireless channels and improve the uplink transmission success probability. This can help minimize the AoI values in a long run. However, the AoI values will not keep decreasing as the IRS’s size increases. As shown in Fig. 5(b), the performance gap between our method and the baselines becomes smaller as the size increases. That is because the channel conditions become much better with a large-size IRS and thus the bottleneck of AoI performance becomes the scheduling delay, instead of the transmission delay.

Refer to caption
(a) Trade-off between AoI and energy consumption
Refer to caption
(b) Transmission scheduling with different priorities
Figure 6: Energy-aware AoI minimization with different user priorities. We set N=80N=80 and K=3K=3 in the experiment.

VII-D Trade-off Between AoI and Energy Consumption

In this part, we study the trade-off between AoI and energy consumption of the sensor nodes. Considering the AoI-energy tradeoff, we can revise the DRL agent’s reward as follows:

vt(𝐬t,𝐚t)=1K|t|τtk𝒦(λkAk(τ)+λ0Ekc(τ)),v_{t}({\bf s}_{t},{\bf a}_{t})=-\frac{1}{K|\mathcal{H}_{t}|}\sum_{\tau\in\mathcal{H}_{t}}\sum_{k\in\mathcal{K}}\Bigl{(}\lambda_{k}A_{k}(\tau)+\lambda_{0}E^{c}_{k}(\tau)\Bigr{)},

where the AoI-energy trade-off parameter λ0\lambda_{0} is used to balance the sensor node’s AoI and energy consumption. With a smaller λ0\lambda_{0}, the sensor node becomes more aggressive to upload its information. This may cause energy outage due to insufficient energy supply to the sensor node. With a larger λ0\lambda_{0}, the sensor node will focus more on its energy status and become more tolerant to the information delay. As shown in Fig. 6(a), given different λ0\lambda_{0}, the AoI value and energy consumption have different trends of evolution. When λ0=0\lambda_{0}=0, the average AoI can be reduced by 53.8%53.8\% comparing to that with a higher λ0=2000\lambda_{0}=2000. We also show the performance gain with different priorities λk\lambda_{k} for the sensor nodes. Considering three sensor nodes, we gradually increase λ1\lambda_{1} from 1 to 4 for the node-11 while setting fixed values for λ2=λ3=2\lambda_{2}=\lambda_{3}=2. We evaluate the scheduling frequency in 3030K time slots and plot in Fig. 6(b) the change of node-kk’s scheduling frequency, which is shown to increase linearly with respect to its priority λ1\lambda_{1}. Fig. 6(b) also shows the change of three sensor nodes’ AoI performance as we increase λ1\lambda_{1} for the node-11. It is clear to see that the node-11 will experience a much higher AoI value when it has a smaller priority, e.g., λ1=1\lambda_{1}=1, than those of the other two nodes. When we gradually increase its priority, our algorithm will be more sensitive to the node-11’s AoI performance and thus try to reduce its AoI by scheduling it more often, as revealed in Fig. 6(b). As such, the node-11 will take up more transmission opportunities by sacrificing the AoI performance of the other two nodes. This verifies that our algorithm can be adaptive to the change of sensor nodes’ priorities.

VIII Conclusions

In this paper, we have focused on a wireless-powered and IRS-assisted network and aimed to minimize the overall AoI for information updates. We have formulated the AoI minimization problem as a mixed-integer program and devised a hierarchical learning framework, which includes the outer-loop model-free learning and the inner-loop optimization methods. Simulation results demonstrate that our algorithm can significantly reduce the average AoI and achieve controllable fairness among sensor nodes. More specifically, the hierarchical PPO algorithm achieves a significantly higher reward performance and faster convergence compared to the hierarchical DQN algorithm. It also outperforms typical baseline strategies in terms of the AoI performance and fairness. The performance gain can be more significant with a small size IRS.

References

  • [1] M. A. Abd-Elmagid, N. Pappas, and H. S. Dhillon, “On the role of age of information in the internet of things,” IEEE Commun. Mag., vol. 57, no. 12, pp. 72–77, Dec. 2019.
  • [2] J. Lee, D. Niyato, Y. L. Guan, and D. I. Kim, “Learning to schedule joint radar-communication requests for optimal information freshness,” in proc. IEEE Intelligent Vehicles Symp. (IV), Jul. 2021, pp. 8–15.
  • [3] Y. Sun, E. Uysal-Biyikoglu, R. D. Yates, C. E. Koksal, and N. B. Shroff, “Update or Wait: How to keep your data fresh,” IEEE Trans. Inf. Theory, vol. 63, no. 11, pp. 7492–7508, Nov. 2017.
  • [4] S. Gong, X. Lu, D. T. Hoang, D. Niyato, L. Shu, D. I. Kim, and Y.-C. Liang, “Toward smart wireless communications via intelligent reflecting surfaces: A contemporary survey,” IEEE Commun. Surv. Tut., vol. 22, no. 4, pp. 2283–2314, Jun. 2020.
  • [5] Z. Bao, Y. Dong, Z. Chen, P. Fan, and K. B. Letaief, “Age-optimal service and decision processes in internet of things,” IEEE Internet Things J., vol. 8, no. 4, pp. 2826–2841, Feb. 2021.
  • [6] C. Xu, Y. Xie, X. Wang, H. H. Yang, D. Niyato, and T. Q. S. Quek, “Optimal status update for caching enabled IoT networks: A dueling deep r-network approach,” IEEE Trans. Wireless Commun., vol. 20, no. 12, pp. 8438–8454, Dec. 2021.
  • [7] C. Zhou, G. Li, J. Li, Q. Zhou, and B. Guo, “FAS-DQN: Freshness-aware scheduling via reinforcement learning for latency-sensitive applications,” IEEE Trans. Comput., pp. 1–1, Nov. 2021.
  • [8] E. T. Ceran, D. Gündüz, and A. György, “A reinforcement learning approach to age of information in multi-user networks with HARQ,” IEEE J. Sel. Area. Commun., vol. 39, no. 5, pp. 1412–1426, May 2021.
  • [9] B.-M. Robaglia, A. Destounis, M. Coupechoux, and D. Tsilimantos, “Deep reinforcement learning for scheduling uplink IoT traffic with strict deadlines,” in proc. IEEE GLOBECOM, Dec. 2021, pp. 1–6.
  • [10] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv:1707.06347, 2017. [Online]. Available: http://arxiv.org/abs/1707.06347
  • [11] L. Cui, Y. Long, D. T. Hoang, and S. Gong, “Hierarchical learning approach for age-of-information minimization in wireless sensor networks,” in proc. IEEE Int. Symp. World Wirel. Mob. Multimed. Netw. (WoWMoM), Jun. 2022, pp. 1–7.
  • [12] J. Feng, W. Mai, and X. Chen, “Simultaneous multi-sensor scheduling based on double deep Q-learning under multi-constraint,” in proc. IEEE ICCC, Jul. 2021, pp. 224–229.
  • [13] H. v. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double qq-learning,” in Proc. Thirtieth AAAI Conf. Artificial Intelligence, 2016, pp. 2094–2100.
  • [14] X. Wu, X. Li, J. Li, P. C. Ching, and H. V. Poor, “Deep reinforcement learning for IoT networks: Age of information and energy cost tradeoff,” in proc. IEEE GLOBECOM, Dec. 2020, pp. 1–6.
  • [15] S. Leng and A. Yener, “Age of information minimization for an energy harvesting cognitive radio,” IEEE Trans. Cogn. Commun. Network., vol. 5, no. 2, pp. 427–439, 2019.
  • [16] A. Arafa, J. Yang, S. Ulukus, and H. V. Poor, “Age-minimal transmission for energy harvesting sensors with finite batteries: Online policies,” IEEE Trans. Inf. Theory, vol. 66, no. 1, pp. 534–556, 2020.
  • [17] X. Ling, J. Gong, R. Li, S. Yu, Q. Ma, and X. Chen, “Dynamic age minimization with real-time information preprocessing for edge-assisted IoT devices with energy harvesting,” IEEE Trans. Netw. Sci. Eng., vol. 8, no. 3, pp. 2288–2300, 2021.
  • [18] M. Hatami, M. Leinonen, and M. Codreanu, “Aoi minimization in status update control with energy harvesting sensors,” IEEE Trans. Commun., vol. 69, no. 12, pp. 8335–8351, 2021.
  • [19] M. A. Abd-Elmagid, H. S. Dhillon, and N. Pappas, “A reinforcement learning framework for optimizing age of information in RF-powered communication systems,” IEEE Trans. Commun., vol. 68, no. 8, pp. 4747–4760, Aug. 2020.
  • [20] A. H. Zarif, P. Azmi, N. Mokari, M. R. Javan, and E. Jorswieck, “AoI minimization in energy harvesting and spectrum sharing enabled 6G networks,” IEEE Trans. Green Commun. Network., pp. 1–1, 2022.
  • [21] S. Leng and A. Yener, “Learning to transmit fresh information in energy harvesting networks,” IEEE Trans Green Commun. Network., pp. 1–1, 2022.
  • [22] T. Bai, C. Pan, Y. Deng, M. Elkashlan, A. Nallanathan, and L. Hanzo, “Latency minimization for intelligent reflecting surface aided mobile edge computing,” IEEE J. Sel. Area. Commun., vol. 38, no. 11, pp. 2666–2682, Nov. 2020.
  • [23] Y. Cao, T. Lv, Z. Lin, and W. Ni, “Delay-constrained joint power control, user detection and passive beamforming in intelligent reflecting surface-assisted uplink mmwave system,” IEEE Trans. Cogn. Commun. Netw., vol. 7, no. 2, pp. 482–495, Jun. 2021.
  • [24] A. Muhammad, M. Elhattab, M. A. Arfaoui, A. Al-Hilo, and C. Assi, “Age of information optimization in a RIS-assisted wireless network,” arXiv:2103.06405, 2021. [Online]. Available: https://arxiv.org/abs/2103.06405
  • [25] M. Samir, M. Elhattab, C. Assi, S. Sharafeddine, and A. Ghrayeb, “Optimizing age of information through aerial reconfigurable intelligent surfaces: A deep reinforcement learning approach,” IEEE Trans. Veh. Technol., vol. 70, no. 4, pp. 3978–3983, Apr. 2021.
  • [26] G. Cuozzo, C. Buratti, and R. Verdone, “A 2.4-GHz LoRa-based protocol for communication and energy harvesting on industry machines,” IEEE Internet Things J., vol. 9, no. 10, pp. 7853–7865, May 2022.
  • [27] J. Yao and N. Ansari, “Wireless power and energy harvesting control in IoD by deep reinforcement learning,” IEEE Trans. Green Commun. Netw., vol. 5, no. 2, pp. 980–989, Jun. 2021.
  • [28] B. Lyu, P. Ramezani, D. T. Hoang, S. Gong, Z. Yang, and A. Jamalipour, “Optimized energy and information relaying in self-sustainable IRS-empowered WPCN,” IEEE Trans. Commun., vol. 69, no. 1, pp. 619–633, Jan. 2021.
  • [29] B. Lyu, D. T. Hoang, S. Gong, D. Niyato, and D. I. Kim, “IRS-based wireless jamming attacks: When jammers can attack without power,” IEEE Wireless Commun. Lett., vol. 9, no. 10, pp. 1663–1667, Oct. 2020.
  • [30] S. Boyd, S. P. Boyd, and L. Vandenberghe, Convex optimization.   Cambridge, U.K.: Cambridge Univ. Press, 2004.
  • [31] Y. Zou, Y. Long, S. Gong, D. T. Hoang, W. Liu, W. Cheng, and D. Niyato, “Robust beamforming optimization for self-sustainable intelligent reflecting surface assisted wireless networks,” IEEE Trans. Cogn. Commun. Netw., pp. 1–1, Dec. 2021.
  • [32] Y. Zou, S. Gong, J. Xu, W. Cheng, D. T. Hoang, and D. Niyato, “Wireless powered intelligent reflecting surfaces for enhancing wireless communications,” IEEE Trans. Veh. Technol., vol. 69, no. 10, pp. 12 369–12 373, Oct. 2020.
  • [33] S. Gong, Y. Xie, J. Xu, D. Niyato, and Y.-C. Liang, “Deep reinforcement learning for backscatter-aided data offloading in mobile edge computing,” IEEE Netw., vol. 34, no. 5, pp. 106–113, Oct. 2020.
  • [34] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y.-C. Liang, and D. I. Kim, “Applications of deep reinforcement learning in communications and networking: A survey,” IEEE Commun. Surv. Tut., vol. 21, no. 4, pp. 3133–3174, May 2019.
  • [35] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning.” in proc. Int. Conf. Learn. Represent. (ICLR), Jan. 2016.
  • [36] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in proc. Int. Conf. Mach. Learn. (ICML), Jul. 2015, pp. 1889–1897.
  • [37] T. Jiang, H. V. Cheng, and W. Yu, “Learning to reflect and to beamform for intelligent reflecting surface with implicit channel estimation,” IEEE J. Sel. Area. Commun., vol. 39, no. 7, pp. 1931–1945, Jul. 2021.
  • [38] I. Kadota, A. Sinha, E. Uysal-Biyikoglu, R. Singh, and E. Modiano, “Scheduling policies for minimizing age of information in broadcast wireless networks,” IEEE/ACM Trans. Netw., vol. 26, no. 6, pp. 2637–2650, Dec. 2018.