This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Meta-Reinforcement Learning for Timely and Energy-efficient Data Collection in Solar-powered UAV-assisted IoT Networks

Mengjie Yi, Xijun Wang, Juan Liu, Yan Zhang, and Ronghui Hou M. Yi and R. Hou are with School of Cyber Engineering, Xidian University, Xi’an 710071, China (e-mail: [email protected], [email protected]). X. Wang is with School of Electronics and Information Technology, Sun Yat-sen University, Guangzhou, 510006, China (e-mail: [email protected]). J. Liu is with School of Electrical Engineering and Computer Science, Ningbo University, Zhejiang, 315211, China (e-mail: [email protected]). Y. Zhang is with State Key Lab of Integrated Service Networks, Information Science Institute, Xidian University, Xi’an, Shaanxi, 710071, China (e-mail: [email protected]).
Abstract

Unmanned aerial vehicles (UAVs) have the potential to greatly aid Internet of Things (IoT) networks in mission-critical data collection, thanks to their flexibility and cost-effectiveness. However, challenges arise due to the UAV’s limited onboard energy and the unpredictable status updates from sensor nodes (SNs), which impact the freshness of collected data. In this paper, we investigate the energy-efficient and timely data collection in IoT networks through the use of a solar-powered UAV. Each SN generates status updates at stochastic intervals, while the UAV collects and subsequently transmits these status updates to a central data center. Furthermore, the UAV harnesses solar energy from the environment to maintain its energy level above a predetermined threshold. To minimize both the average age of information (AoI) for SNs and the energy consumption of the UAV, we jointly optimize the UAV trajectory, SN scheduling, and offloading strategy. Then, we formulate this problem as a Markov decision process (MDP) and propose a meta-reinforcement learning algorithm to enhance the generalization capability. Specifically, the compound-action deep reinforcement learning (CADRL) algorithm is proposed to handle the discrete decisions related to SN scheduling and the UAV’s offloading policy, as well as the continuous control of UAV flight. Moreover, we incorporate meta-learning into CADRL to improve the adaptability of the learned policy to new tasks. To validate the effectiveness of our proposed algorithms, we conduct extensive simulations and demonstrate their superiority over other baseline algorithms.

Index Terms:
Age of information, Internet of things, compound-action deep reinforcement learning, meta learning, unmanned aerial vehicle.

I Introduction

In recent years, the Internet of Things (IoT) has rapidly evolved with the unprecedented popularity of mobile devices, including smart home appliances and personal devices, to support all aspects of our daily lives [1]. However, the expanding scale of the IoT poses several challenges for network operators and service providers. One of the key challenges stems from the fact that IoT devices typically operate with limited energy resources. These constraints restrict their capacity to transmit data over long distances, thus hindering their ability to establish robust connections with remote base stations (BSs) [2]. Additionally, it is of significant importance to deploy IoT applications such as environmental sensing and disaster monitoring in challenging environments, including swamps, deserts, hazardous battlefields, and contaminated sites [3]. Unfortunately, in such scenarios, IoT devices are often placed in locations that lack proper wireless infrastructure.

Unmanned aerial vehicles (UAVs) are poised to become a pivotal element of future wireless networks, primarily due to their complete mobility control, economical operation, and rapid deployment capabilities. Within the context of IoT networks, UAVs can adapt their flight paths to collect environmental information from sensor nodes (SNs) in their vicinity. Through the establishment of a line-of-sight (LoS) communication link, UAVs can communicate reliably with SNs, thereby contributing to a reduction in the transmit power requirements of SNs and an extension of the network’s operational lifespan. However, this reduction in transmission energy for SNs comes at the expense of consuming the propulsion energy of the UAV itself [4]. This presents a challenge in the deployment of UAVs for data collection from SNs in IoT networks, given the finite onboard energy of UAVs. While the optimization of energy consumption can extend the UAV’s operational lifetime to a certain extent [5, 6], it cannot completely overcome the inherent energy limitation problem. Consequently, efficient energy replenishment mechanisms are imperative for the practical deployment of UAVs. Solar-powered UAVs, in particular, have attracted considerable interest for their ability to sustain flight operations over extended periods [7, 8]. Importantly, these solar-powered UAVs eliminate the necessity for periodic returns to a charging station to replenish their energy reserves [9]. This feature not only enhances their operational autonomy but also holds promise for a wide range of applications, where uninterrupted aerial data collection or surveillance is essential.

Numerous studies have concentrated on UAV-assisted data collection within IoT networks, with a primary focus on traditional performance metrics like throughput and latency as their principal optimization objectives [10, 11, 12]. Nonetheless, these conventional metrics may not fully reflect the timeliness of the data. Fortunately, the emergence of the age of information (AoI) has ushered in a novel perspective for assessing the freshness of data from SNs at the receiver’s perspective [13, 14]. Many efforts have employed deep reinforcement learning (DRL) algorithms to refine system parameters, such as UAV trajectories and SN associations, with the goal of reducing the AoI in UAV-assisted IoT networks [15, 16, 17]. However, a noteworthy limitation of policies obtained through DRL algorithms is their limited generalization capability. When the task undergoes changes, the policies previously learned often become less applicable, resulting in a decline in performance.

Driven by these challenges, we investigate a issue of timely and energy-efficient data collection assisted by a solar-powered UAV in IoT networks. Within the system under consideration, each SN stochastically samples the environment and generates update packets. The update packets are subsequently gathered by a UAV, which caches them within its onboard buffer. When the UAV reaches suitable locations, it offloads the gathered data to the data center (DC). We employ a DRL approach to achieve the dual objectives of reducing the average AoI of SNs and conserving the energy consumption of the UAV. Furthermore, we augment DRL with meta-learning to enhance the algorithm’s capacity for generalization. The primary contributions of this paper can be outlined as follows:

  • We study the timely and energy-efficient data collection problem in solar-powered UAV-assisted IoT networks. We undertake the joint optimization of the UAV’s velocity, SN scheduling, and data offloading decisions while considering the energy and kinematic constraints inherent to the UAV. This problem is formulated as a Markov decision process (MDP). Given the combination of continuous and discrete actions in this problem, we design a compound-action DRL (CADRL) algorithm to minimize the weighted sum of the average AoI of SNs and the energy consumption of the UAV.

  • To enhance the CADRL algorithm’s ability to adapt to new tasks and improve its generalization, we propose a meta-learning-based CADRL (MLCADRL) algorithm. MLCADRL is equipped to acquire a meta-policy from a multitude of learning tasks and thus converge quickly on new tasks, resulting in increased scalability and adaptability.

  • We undertake comprehensive simulations to assess the performance of the algorithms we propose. Our results reveal that the CADRL-based algorithm adeptly coordinates the UAV’s velocity, the scheduling of SNs, and the offloading strategy, ultimately surpassing baseline algorithms by achieving the lowest system cost. Moreover, the MLCADRL-based algorithm proves effective in accelerating convergence when dealing with new tasks.

The rest of this paper follows this structure: Section II provides an overview of the related research. Section III presents the system model and problem formulation. In Section IV, CADRL-based and MLCADRL-based data collection algorithms are proposed. The simulation results are presented in Section V. Finally, conclusions are presented in Section VI.

II Related Work

II-A UAV-assisted Data Collection in IoT Networks

In the realm of UAV-assisted IoT networks, various studies have been dedicated to optimizing data collection with a focus on ensuring the freshness of information. Zhu et al. [18] employed transformers and weighted A* to optimize UAV hover points for cumulative AoI minimization. Liu et al. [19] centered on the optimization of UAV trajectory and SN associations, and proposed two optimization problems aiming at minimizing the maximum and average AoI, respectively. Furthermore, the work by [20] considered the UAV’s onboard energy, focusing on minimizing the completion time of the UAV’s mission while considering AoI and UAV onboard energy constraints. The aforementioned studies primarily explored scenarios involving a single UAV. In contrast, other studies introduced the use of multiple UAVs [21, 22, 23, 24]. Long et al. [21] investigated the use of multiple UAVs for transmitting data from ground users to distant BSs. They adopted a multi-stage stochastic optimization approach to reduce the long-term AoI through trajectory and scheduling optimization. Additionally, Zhang et al. [22] studied the AoI minimization in large-scale, ultra-reliable, and low-latency communication scenarios, considering statistical delay and bit error rate constraints. Wang et al. [23] used a multi-agent DRL approach to jointly optimize trajectories and SN scheduling of UAVs to minimize AoI. Furthermore, Liu et al. [24] examined a system involving multiple UAVs taking off from a DC, collecting data from ground SNs, distributing it to users, and then returning to the DC. They minimized AoI by optimizing task assignment, interaction point selection, and UAV flight trajectories.

In the above studies, a common assumption was made that the SNs followed a generate-at-will generation policy [25, 26]. However, it is worth noting that in certain scenarios, SNs may utilize their own independent sampling mechanisms, separate from the data transmission scheme. Since both the sampling and transmission processes can influence the AoI, it becomes imperative to jointly consider these two factors to effectively reduce the AoI. Recognizing the uncertainty surrounding the SN’s sampling process, Zhou et al. [27] utilized a DQN-based algorithm to optimize the UAV’s trajectory with the goal of minimizing the AoI. Tong et al. [28] assumed that SNs can sample information at fixed or random rates. Their optimization was the UAV’s flight trajectory, with the dual objective of minimizing the Age of Information (AoI) and reducing data packet loss rates. Building upon the work of [28], Li et al. [29] further extended their investigation to scenarios involving multiple UAVs.

The aforementioned studies predominantly dealt with battery-powered UAVs, which inherently possess limited operating durations due to their constrained energy storage capacity. In contrast, solar-powered UAVs offer a promising avenue for extending operational periods by harnessing solar energy from the sun [7, 8]. Sami et al. [30] focused on optimizing UAV altitude adjustments and channel access management. They achieved this through the use of constrained DRL algorithms, with the primary objective of maximizing network capacity. Furthermore, Zhang et al. [31] extended this optimization by considering the three-dimensional trajectory and time allocation of solar-powered UAVs. This extension aimed to enhance overall network throughput while accommodating various constraints, including energy constraints, quality of service demands, and shifting flight conditions. [8] detailed the application of solar-powered UAVs in the data collection process from IoT devices and simultaneously recharged these devices using laser technology. The primary aim was to optimize the UAV’s energy resources while satisfying the requirements of IoT devices. Moreover, in [32], the UAV could acquire energy from both solar power and charging stations. Through optimization of the UAV’s trajectory, a delicate balance was struck between achieving average data transmission rates, minimizing total energy consumption, and ensuring fair coverage of IoT terminals. While these studies explored various aspects of solar-powered UAV operations in IoT networks, none of them specifically addressed the optimization of AoI.

II-B DRL Methods in UAV-assisted Data Collection

Over the past few years, DRL has received significant attention due to its impressive success in intricate tasks such as international Go, games, and controlling complex machinery for various operations [33, 34, 35]. This success has propelled the application of DRL methods in UAV-assisted data collection for IoT networks [11, 27, 28, 29, 30]. Nevertheless, the action space considered in these studies is typically confined to either continuous or discrete domains [36, 37, 38, 39, 40]. Applying these DRL algorithms directly to issues involving both discrete and continuous action spaces is not feasible. To overcome this obstacle, Hu et al. [41] studied UAVs engaged in collaborative perception and transmission for sensing tasks. They proposed the compound action actor-critic (CA2C) method to optimize UAV trajectories and task selection for AoI reduction. Akbari et al. [42] investigated the optimization problem of virtual network function placement and scheduling in industrial IoT networks and also utilized the CA2C method to address the problem with compound actions. The CA2C algorithm can be seen as a combination of DQN and DDPG, in which the continuous actions selected by DDPG serve as inputs to DQN. This may increase the sum of Q values at the cost of decreasing the maximum Q value [43]. In addition, the DRL approaches proposed in [41, 42] fail to extract well-generalized knowledge from training tasks, resulting in the need for DRL agents to start training anew when tackling new tasks.

There has been some work exploring how to improve the generalizability of DRL algorithms in UAV-assisted IoT data collection scenarios. Zhu et al. [44] investigated the minimization of total energy consumption through the joint design of the UAV’s trajectory and cluster head selection. They proposed a DRL method with a sequential model to address this issue and validated the algorithm’s generalization capability. However, combining a sequence-to-sequence model with DRL can present challenges because of the intricate training process. Chu et al. [17] aimed at maximizing data collection capacity and energy efficiency, and proposed a transfer learning combined DRL algorithm to enhance the efficacy of the DRL-based algorithm on new tasks. Since single-task transfer learning may suffer from learning bias, Yi et al. [45] utilized multi-task transfer learning in conjunction with DRL for the purpose of optimizing the UAV’s trajectory and recharging decisions that lead to the reduction of the AoI. While multi-task transfer DRL can leverage knowledge acquired from training tasks to enhance its performance on new tasks, it exhibits relatively longer convergence times when compared to meta DRL approaches. Lu et al. [46] explored UAV data collection from ground nodes, with a focus on maximizing data collected by optimizing the UAV’s trajectory. A meta-DRL-based algorithm was proposed to enhance generalization capabilities. However, this work focused on UAV actions limited to a single type of discrete action, specifically, flight direction.

III System Model and Problem Formulation

Refer to caption
Figure 1: The scene of solar-powered UAV-assisted data collection. The UAV collects status updates from SNs and caches them in its onboard buffer. During the collection process, the UAV offloads the cached data to the DC. Additionally, the UAV can harness energy from the environment through the solar panel.

As depicted in Figure 1, we consider solar-powered UAV-assisted data collection within the IoT network, in which NN sensor nodes (SNs) are randomly distributed across the area. Each SN performs random environmental sampling, and the status update for each SN follows a Poisson process with an arrival rate of λ0\lambda_{0}. The collected data is cached in the buffer as a packet, each containing ww bits and accompanied by a timestamp. We use 𝒩={1,2,,N}\mathcal{N}=\{1,2,\ldots,N\} to denote the set of all the SNs, with the location of SN n𝒩n\in\mathcal{N} specified as 𝑪n=(xn,yn)\boldsymbol{C}_{n}=(x_{n},y_{n}). The location of the DC is represented by 𝑪0=(x0,y0)\boldsymbol{C}_{0}=(x_{0},y_{0}). We define 𝒩+={0}𝒩\mathcal{N}^{+}=\{0\}\cup\mathcal{N}, signifying the set of all nodes, including the DC.

We consider a discrete-time system where time is partitioned into equidistant time slots. Each slot has a duration of τ0\tau_{0} seconds. Suppose that the UAV is required to enable data collection for TT slots. Specifically, the rotary-wing UAV serves as a mobile relay, taking off from the DC, flying over SNs to collect data packets, and temporarily storing the collected data packets in its onboard buffer. Once the UAV reaches an appropriate location, it offloads all buffered data packets to the DC for processing. It is assumed that the UAV operates at a consistent altitude, denoted by HH. Thus, the UAV’s location can be denoted by its ground-level projection at time slot tt, 𝑪u(t)=[xu(t),yu(t)]\boldsymbol{C}_{u}(t)=[x_{u}(t),y_{u}(t)]. At the beginning of time slot tt, the UAV’s velocity can be represented using polar coordinates and denoted by 𝒗(t)=(vs(t),ϕ(t))\boldsymbol{v}(t)=(v_{s}(t),\phi(t)). Here, vs(t)=𝒗(t)v_{s}(t)=\left\|\boldsymbol{v}(t)\right\| signifies the UAV’s speed, while ϕ(t)\phi(t) stands for the velocity direction, within the range 0ϕ(t)2π0\leq\phi(t)\leq 2\pi. We consider a consistent acceleration 𝒂c(t)=𝒗˙(t)\bm{a}_{\textrm{c}}(t)=\dot{\boldsymbol{v}}(t) throughout a time slot, leading to the update of the velocity as 𝒗(t+1)=𝒗(t)+𝒂c(t)τ0\bm{v}(t+1)=\bm{v}(t)+\bm{a}_{\textrm{c}}(t)\tau_{0}. We assume that the rotary-wing UAV can swiftly alter its orientation as a time slot commences and then remain that way for the duration of the time slot, since the rotary-wing UAV can readily turn by modifying the rotation of its rotors. In operations, the UAV faces kinematic restrictions. Specifically, the UAV’s speed at slot tt does not go beyond its maximum limit vsmaxv_{s}^{\textrm{max}}, i.e., vs(t)vsmaxv_{s}(t)\leq v_{s}^{\textrm{max}}, and the UAV’s turning angle at slot tt, ϕ(t)=ϕ(t)ϕ(t1)\triangle\phi(t)=\phi(t)-\phi(t-1), does not go beyond its maximum limit ϕmax\triangle\phi_{\max}, i.e., ϕ(t)ϕmax\mid\triangle\phi(t)\mid\leq\triangle\phi_{\max}. Let 𝑽=(𝒗(1),𝒗(2),,𝒗(T))\boldsymbol{V}=(\boldsymbol{v}(1),\boldsymbol{v}(2),\ldots,\boldsymbol{v}(T)) denote the series of UAV velocities. The UAV’s flight path encompasses a series of locations it crosses, i.e., 𝒑=(𝑪u(1),𝑪u(2),,𝑪u(T))\bm{p}=(\boldsymbol{C}_{u}(1),\boldsymbol{C}_{u}(2),\ldots,\boldsymbol{C}_{u}(T)), where 𝑪u(1)=𝑪0\boldsymbol{C}_{u}(1)=\boldsymbol{C}_{0}.

During each time slot, the UAV is required to make a determination regarding the scheduling of a specific SN. The UAV’s SN scheduling vector is denoted as 𝒃=(b(1),b(2),,b(T))\boldsymbol{b}=(b(1),b(2),\ldots,b(T)), with b(t)=nb(t)=n indicating that SN nn will transmit its status updates to the UAV during time slot tt, while b(t)=0b(t)=0 signifies that no SN is scheduled for time slot tt. On the other hand, the UAV needs to determine at each time slot whether to offload data packets in its buffer to the DC based on the SNs’ AoI, the UAV’s energy level, and the link status between the UAV and the DC. We use 𝒒=(q(1),q(2),,q(T))\boldsymbol{q}=(q(1),q(2),\ldots,q(T)) to represent the UAV’s offloading vector, with q(t){0,1}q(t)\in\{0,1\}. Specifically, q(t)=1q(t)=1 means that the data packets in the UAV’s buffer are offloaded to the DC in time slot tt; otherwise, q(t)=0q(t)=0. Note that in a given time slot, the UAV cannot simultaneously schedule SN data transmission and offload data to the DC. It can first schedule a SN for a duration of τs\tau_{\textrm{s}} and then offload data to the DC within a duration of τ0τs\tau_{0}-\tau_{\textrm{s}}.

The UAV’s maximum onboard energy is EmaxE_{\textrm{max}}, and its energy level in slot tt is indicated by E(t)[0,Emax]E(t)\in[0,E_{\textrm{max}}]. To ensure that the UAV does not crash due to insufficient energy, we assume that the UAV has two modes, namely, working and restoring, which are determined by the UAV’s energy level. Let m(t){0,1}m(t)\in\{0,1\} denote the mode of the UAV. m(t)=0m(t)=0 means that the UAV is in the restoring mode for the time slot tt; otherwise, the UAV is in the working mode. In the working mode, the UAV can gather status updates from SNs and offload the data packets in its buffer to the DC. When the UAV’s energy level falls below the threshold Eth1E_{\textrm{th}}^{1}, i.e., E(t)<Eth1E(t)<E_{\textrm{th}}^{1}, it switches to the restoring mode. The UAV lands on the ground and waits for the harvested energy. When the UAV’s energy level exceeds a certain threshold Eth2E_{\textrm{th}}^{2}, i.e., E(t)Eth2E(t)\geq E_{\textrm{th}}^{2}, it switches to the working mode, ascends from the ground to altitude HH, and continues its data collection task.

III-A Transmission Model

In this work, we account for both large-scale and small-scale fading in the channel model [47, 48]. Let hu,n(t)h_{u,n}(t) represent the channel coefficient between the UAV and node n𝒩+n\in\mathcal{N}^{+} in time slot tt, as expressed by

hu,n(t)=gu,n(t)h~u,n(t),h_{u,n}(t)=\sqrt{g_{u,n}(t)}\widetilde{h}_{u,n}(t), (1)

where gu,n(t)g_{u,n}(t) denotes the large-scale fading factor and h~u,n(t)\widetilde{h}_{u,n}(t) is the small-scale attenuation coefficient with 𝔼[|h~u,n(t)|2]=1\mathbb{E}[|\widetilde{h}_{u,n}(t)|^{2}]=1. The large-scale channel fading is either line-of-sight (LoS) or non-line-of-sight (NLoS), depending on the propagation environment. Thus, we assume that the UAV-node n𝒩+n\in\mathcal{N}^{+} channel experiences LoS fading in a probabilistic manner. The LoS probability between the UAV and node n𝒩+n\in\mathcal{N}^{+} is expressed as [49]

pu,nLoS(t)=11+βexp(β(180πarcsin(Hdu,n(t))β)),p_{u,n}^{\text{LoS}}(t)=\frac{1}{1+\beta\exp\left(\text{\textminus}\beta^{\prime}\left(\frac{180}{\pi}\arcsin\left(\frac{H}{d_{u,n}(t)}\right)\text{\textminus}\beta\right)\right)}, (2)

where β\beta and β\beta^{\prime} are constants that depend on the environment, and du,n(t)=H2+𝑪u(t)𝑪n2d_{u,n}(t)=\sqrt{H^{2}+\|\boldsymbol{C}_{u}(t)-\boldsymbol{C}_{n}\|^{2}} signifies the Euclidean distance between the UAV and node n𝒩+n\in\mathcal{N}^{+} in slot tt. Thus, the large-scale channel gain between the UAV and node n𝒩+n\in\mathcal{N}^{+} in slot tt can be represented as

gu,n(t)={β0[du,n(t)]ς,w.p. pu,nLoS(t),κβ0[du,n(t)]ς,w.p. 1pu,nLoS(t),g_{u,n}(t)=\begin{cases}\beta_{0}[d_{u,n}(t)]^{-\varsigma},&\textrm{w.p. $p_{u,n}^{\text{LoS}}(t)$},\\ \kappa\beta_{0}[d_{u,n}(t)]^{-\varsigma},&\textrm{w.p. }1-p_{u,n}^{\text{LoS}}(t)\textrm{,}\end{cases} (3)

where β0\beta_{0} represents the channel gain at a reference distance of one meter, ς\varsigma denotes the path-loss exponent, κ\kappa (κ<1)(\kappa<1) denotes the additional attenuation factor associated with NLoS, and “w.p.” is an abbreviation for “with probability”.

The signal-to-noise ratio (SNR) of the air-to-ground (A2G) channel linking the UAV and SN n𝒩n\in\mathcal{N} in time slot tt can be described as

ξu,n(t)\displaystyle\xi_{u,n}(t) =Ps|hu,n(t)|2σ2,\displaystyle=\frac{P_{s}|h_{u,n}(t)|^{2}}{\sigma^{2}}, (4)

where PsP_{s} denotes the transmission power of each SN, and σ2\sigma^{2} denotes the noise power of the A2G channel. If ξu,n(t)\xi_{u,n}(t) is not lower than the threshold ξth\xi_{\textrm{th}}, i.e., ξu,n(t)ξth\xi_{u,n}(t)\geq\xi_{\textrm{th}}, the UAV successfully receives the status updates from SN nn during time slot tt. Otherwise, it fails.

The transmission rate of the A2G channel that links the UAV and the DC in time slot tt is given by

Ru(t)=Blog2(1+Pct|hu,0(t)|2σ2),R_{u}(t)=B\log_{2}(1+\frac{P_{\textrm{c}}^{\textrm{t}}|h_{u,0}(t)|^{2}}{\sigma^{2}}), (5)

where PctP_{\textrm{c}}^{\textrm{t}} denotes the UAV’s transmission power and BB denotes the channel bandwidth. When the UAV offloads the data packets in its buffer to the DC within a specified time τ(t)\tau(t) and W(t)Ru(t)τ(t)\frac{W(t)}{R_{u}(t)}\leq\tau(t), it is considered that the data offloading is successful, where W(t)W(t) is the size of data in the UAV’s buffer at slot tt. Otherwise, it fails. Specifically, if the UAV only offloads data to the DC during the time slot tt, then τ(t)=τ0\tau(t)=\tau_{0}. If the UAV both schedules SNs and offloads data during time slot tt, then τ(t)=τ0τs\tau(t)=\tau_{0}-\tau_{\textrm{s}}.

III-B Age of Information

SN n𝒩n\in\mathcal{N} utilizes a random sampling strategy to sample information from its surroundings and packages it into a time-stamped data packet. The data packet is placed in the buffer of SN nn. Let kn(t){0,1}k_{n}(t)\in\{0,1\} denote the packet arrival status of SN nn at time slot tt. kn(t)=1k_{n}(t)=1 means SN nn samples its surrounding environment, and a new data packet reaches SN nn in time slot tt. Otherwise, kn(t)=0k_{n}(t)=0. When the buffer of SN nn contains data and a new data packet arrives, the old data packet will be replaced by the new one.

Let Yn(t)Y_{n}(t) keep track of the lifetime of the data packet in SN nn if there exits one at time slot tt. The update of Yn(t)Y_{n}(t) is

Yn(t)={0,if kn(t)=1,Yn(t1)+1,otherwise.Y_{n}(t)=\begin{cases}0,&\textrm{if }k_{n}(t)=1,\\ Y_{n}(t-1)+1,&\textrm{otherwise}.\end{cases} (6)

In (6), if a new packet arrives at SN nn at time slot tt, i.e., kn(t)=1k_{n}(t)=1, Yn(t)Y_{n}(t) is set to zero. Otherwise, Yn(t1)Y_{n}(t-1) is incremented by one.

The inherent unpredictability of the channel connecting SN nn and the UAV implies that, even with scheduling, the transmission may fail to be successful. Let zn(t){0,1}z_{n}(t)\in\{0,1\} denote the uploading status of SN nn. Specifically, zn(t)=1z_{n}(t)=1 signifies the successful transmission of SN nn to the UAV, i.e., b(t)=n and ξu,n(t)ξthb(t)=n\textrm{ and }\xi_{u,n}(t)\geq\xi_{\textrm{th}}; otherwise, zn(t)=0z_{n}(t)=0. The UAV has a buffer of capacity NwNw that can cache one data packet for each SN. When a data packet from SN nn already exists in the UAV’s buffer and a new data packet from SN nn is received, the old data packet of SN nn will be replaced with this new data packet. Un(t)U_{n}(t) is used to track the lifetime of the data packet of SN nn in the UAV’s buffer if there exists one. Therefore, the update of Un(t)U_{n}(t) is represented as

Un(t)={Yn(t)+1,if zn(t)=1,Un(t1)+1,otherwise.U_{n}(t)=\begin{cases}Y_{n}(t)+1,&\textrm{if }z_{n}(t)=1,\\ U_{n}(t-1)+1,&\textrm{otherwise}.\end{cases} (7)

In (7), if the update packet from SN nn is effectively delivered to the UAV, i.e., zn(t)=1z_{n}(t)=1, Un(t)U_{n}(t) is set to Yn(t)+1Y_{n}(t)+1. Otherwise, Un(t1)U_{n}(t-1) is increased by one.

Let o(t){0,1}o(t)\in\{0,1\} indicate the offload status of the UAV. In particular, o(t)=1o(t)=1 represents that the data packets are successfully offloaded from the UAV to the DC, i.e., q(t)=1q(t)=1 and W(t)Ru(t)τ(t)\frac{W(t)}{R_{u}(t)}\leq\tau(t); otherwise, o(t)=0o(t)=0. The AoI is utilized to depict the freshness of data packets originating from SNs at the DC. More specifically, δn(t)\delta_{n}(t), which represents the AoI of SN nn, is defined as the time that has passed since the DC received the latest status update. This can be expressed as:

δn(t)={Un(t),if Un(t)0 and o(t)=1,δ(t1)+1,otherwise.\delta_{n}(t)=\begin{cases}U_{n}(t),&\textrm{if }U_{n}(t)\geq 0\textrm{ and }o(t)=1,\\ \delta(t-1)+1,&\text{otherwise}.\end{cases} (8)

In (8), if a data packet of SN nn is cached in the UAV’s buffer in time slot tt, i.e., Un(t)0U_{n}(t)\geq 0, and the cached data packets are successfully offloaded by the UAV to the DC within time slot tt, i.e., o(t)=1o(t)=1, δn(t)\delta_{n}(t) is set to Un(t)U_{n}(t). Otherwise, δn(t1)\delta_{n}(t-1) is increased by one.

III-C Energy Model

III-C1 Energy Consumption Model

The energy consumption of a rotary-wing UAV is chiefly determined by its propulsion and communication necessities. The following is an expression for the UAV’s propulsion power during time slot tt [6]:

Pcp(t)=\displaystyle P_{\textrm{c}}^{\textrm{p}}(t)= nr[χ8(Th(t)xTρA+3(vs(t))2)Th(t)ρxs2AxT+\displaystyle n_{r}\Bigg{[}\frac{\chi}{8}\left(\frac{T_{\textrm{h}}(t)}{x_{T}\rho A}+3\left(v_{s}(t)\right)^{2}\right)\sqrt{\frac{T_{\textrm{h}}(t)\rho x_{s}^{2}A}{x_{T}}}+
12d0ρxsA(vs(t))3+(1+xf)Th(t)×\displaystyle\frac{1}{2}d_{0}\rho x_{s}A\left(v_{s}(t)\right)^{3}+(1+x_{f})T_{\textrm{h}}(t)\times
((Th(t))24ρ2A2+(vs(t))44(vs(t))22)12],\displaystyle\left(\sqrt{\frac{\left(T_{\textrm{h}}(t)\right)^{2}}{4\rho^{2}A^{2}}+\frac{\left(v_{s}(t)\right)^{4}}{4}}-\frac{\left(v_{s}(t)\right)^{2}}{2}\right)^{\frac{1}{2}}\Bigg{]}, (9)

where nrn_{r} represents the number of rotors, xTx_{T}, xsx_{s}, and xfx_{f} indicate the thrust coefficient based on disc area, rotor solidity, and the incremental correction factor of induced power, respectively, χ\chi is the local blade section drag coefficient, ρ\rho is air density, AA denotes the disc area for each rotor, d0d_{0} denotes the fuselage drag ratio for each rotor, and Th(t)T_{\textrm{h}}(t) indicates the thrust of each rotor. To better elaborate, we only take into account the acceleration that is parallel to the velocity [6]. Hence, each rotor’s thrust can be expressed as

Th(t)=1nr[(Mac(t)+12ρ(vs(t))2SFA)2+(Mg)2]1/2,T_{\textrm{h}}(t)=\frac{1}{n_{r}}\left[\left(Ma_{\textrm{c}}(t)+\frac{1}{2}\rho\left(v_{s}(t)\right)^{2}S_{FA}\right)^{2}+(Mg)^{2}\right]^{1/2}, (10)

where ac(t)=(vs(t+1)vs(t))/τ0a_{\textrm{c}}(t)=(v_{s}(t+1)-v_{s}(t))/\tau_{0}, SFAS_{FA} denotes the fuselage equivalent flat plate area, MM and gg are the UAV’s weight and the gravity acceleration, respectively.

The energy consumption of the UAV ec(t)e_{\textrm{c}}(t) in a time slot is discussed in the following four scenarios: If the UAV is in a restoring mode at time slot tt, i.e., m(t)=0m(t)=0, and the energy level is less than Eth2E_{\textrm{th}}^{2}, i.e., E(t)<Eth2E(t)<E_{\textrm{th}}^{2}, the energy consumption is zero; if the UAV is in a working mode at time slot tt, i.e., m(t)=1m(t)=1, the energy level is greater than Eth1E_{\textrm{th}}^{1}, i.e., E(t)Eth1E(t)\geq E_{\textrm{th}}^{1}, and it schedules an SN while offloading data packets, i.e., b(t)0b(t)\neq 0 and q(t)=1q(t)=1, the energy consumption is τ0Pcp(t)+(τ0τc)Pct\tau_{0}P_{\textrm{c}}^{\textrm{p}}(t)+(\tau_{0}-\tau_{c})P_{\textrm{c}}^{\textrm{t}}; if the UAV is in a working mode at time slot tt, i.e., m(t)=1m(t)=1, the energy level is greater than Eth1E_{\textrm{th}}^{1}, i.e., E(t)Eth1E(t)\geq E_{\textrm{th}}^{1}, and it only offloads data packets without scheduling an SN, i.e., b(t)=0b(t)=0 and q(t)=1q(t)=1, the energy consumption is τ0(Pcp(t)+Pct)\tau_{0}\left(P_{\textrm{c}}^{\textrm{p}}(t)+P_{\textrm{c}}^{\textrm{t}}\right); in other cases, the UAV uses its propulsion energy just for flying. Hence, the UAV’s energy consumption ec(t)e_{\textrm{c}}(t) in time slot tt is summarized in (11) on the next page.

ec(t)={0,if m(t)=0 and E(t)<Eth2,τ0Pcp(t)+(τ0τc)Pct,if m(t)=1,E(t)Eth1,b(t)0,and q(t)=1,τ0(Pcp(t)+Pct),if m(t)=1,E(t)Eth1,b(t)=0,and q(t)=1,τ0Pcp(t),otherwise.e_{\textrm{c}}(t)=\begin{cases}0,&\textrm{if }m(t)=0\textrm{ and }E(t)<E_{\textrm{th}}^{2},\\ \tau_{0}P_{\textrm{c}}^{\textrm{p}}(t)+(\tau_{0}-\tau_{c})P_{\textrm{c}}^{\textrm{t}},&\textrm{if }m(t)=1,E(t)\geq E_{\textrm{th}}^{1},b(t)\neq 0,\textrm{and }q(t)=1,\\ \tau_{0}\left(P_{\textrm{c}}^{\textrm{p}}(t)+P_{\textrm{c}}^{\textrm{t}}\right),&\textrm{if }m(t)=1,E(t)\geq E_{\textrm{th}}^{1},b(t)=0,\textrm{and }q(t)=1,\\ \tau_{0}P_{\textrm{c}}^{\textrm{p}}(t),&\textrm{otherwise}.\end{cases} (11)

III-C2 Energy Harvesting Model

Providing a sustainable energy supply for the UAV with limited battery capacity is a crucial challenge. Assuming that the UAV can harvest energy from sunlight, solar energy is not reliable and depends on various factors, such as weather and altitude. The process of solar energy harvesting by the UAV is portrayed as an independent Bernoulli process with parameter λ1\lambda_{1} [50]. This signifies that the probability of solar energy arriving at the UAV at each time slot is λ1\lambda_{1}. The solar energy reaching the UAV during time slot tt can be expressed as [8]

er(t)=τ0η1SuG1(φ1φ2eh(t)h1),e_{\textrm{r}}(t)=\tau_{0}\eta_{1}S_{u}G_{1}\left(\varphi_{1}-\varphi_{2}e^{-\frac{h(t)}{h_{1}}}\right), (12)

where h(t)h(t) is the UAV’s altitude in time slot tt, η1\eta_{1} denotes the efficiency of energy conversion, SuS_{u} indicates the area of the solar panel that effectively receives light, G1G_{1} represents the average solar radiation on the ground, φ1\varphi_{1} and φ2\varphi_{2} are the maximum atmospheric transmittance value and the atmosphere’s extinction coefficient, respectively, and h1h_{1} denotes the earth’s scale height. Therefore, the energy harvested by the UAV during time slot tt is given by

eh(t)={er(t),w.p.λ1,0,w.p. 1-λ1.e_{\textrm{h}}(t)=\begin{cases}e_{\textrm{r}}(t),&\textrm{w.p}.\,\lambda_{1},\\ 0,&\textrm{w.p.\,1-$\lambda_{1}$.}\end{cases} (13)

To sum up, the dynamics of the UAV’s battery level can be represented as

E(t+1)=min(E(t)+eh(t)ec(t),Emax).E(t+1)=\textrm{min}\left(E(t)+e_{\textrm{h}}(t)-e_{\textrm{c}}(t),E_{\max}\right). (14)

III-D Problem Formulation

In this solar-powered UAV-assisted data collection problem, our goal is to minimize the time-averaged expected total AoI of SNs and energy consumption of the UAV. To this end, we jointly optimize the UAV’s trajectory, the schedule of SNs, as well as the offloading strategy to minimize the system cost, and formulate an optimization problem as follows:

P1: min𝒑,𝒃,𝒒\displaystyle\text{P1: }\min_{\boldsymbol{p},\boldsymbol{b},\boldsymbol{q}}\quad 1T𝔼[t=1T(ω1n=1Nδn(t)+ω2ec(t))],\displaystyle\frac{1}{T}\mathbb{E}\left[\sum_{t=1}^{T}\left(\omega_{1}\sum_{n=1}^{N}\delta_{n}(t)+\omega_{2}e_{\textrm{c}}(t)\right)\right], (15a)
s.t. 𝑪u(1)=𝑪0,\displaystyle\boldsymbol{C}_{u}(1)=\boldsymbol{C}_{0}, (15b)
0vs(t)vsmax,\displaystyle 0\leq v_{s}(t)\leq v_{s}^{\textrm{max}}, (15c)
0ϕ(t)2π,\displaystyle 0\leq\phi(t)\leq 2\pi, (15d)
|ϕ(t)|ϕmax,t2,\displaystyle|\triangle\phi(t)|\leq\triangle\phi_{\max},t\geq 2, (15e)

where ω1\omega_{1} and ω2\omega_{2} represent the weights of the SNs’ average AoI and energy consumption, respectively. Constraint (15b) specifies that the UAV takes off from the DC. The speed constraint (15c) and direction constraints (15d) and (15e) guarantee that the UAV adheres to the kinematic restrictions. The aforementioned stochastic optimization problem is exceedingly difficult to solve due to the unknown environmental dynamics, including the stochastic sampling rates of SNs, the random solar energy arrivals of the UAV, and the uncertain A2G channel. To deal with this problem, we propose a reinforcement learning-based algorithm that enables the UAV to autonomously learn from its interactions with the environment, thereby jointly optimizing its trajectory, the scheduling of SNs, and the offloading strategy. The following sections will delve into a detailed discussion of this approach.

IV Meta-reinforcement learning Approach

In this section, we begin by reformulating the data collection problem in a solar-powered UAV-assisted IoT network as a MDP. Subsequently, we propose a compound-action deep reinforcement learning (CADRL) algorithm to address the MDP, which can handle discrete and continuous actions simultaneously in the UAV’s action space. Furthermore, we introduce the meta-learning method to enhance the generalization of the CADRL-based algorithm for new tasks.

IV-A MDP Formulation

Most often, an MDP is described by {𝒮,𝒜,r,𝒫}\{\mathcal{S},\mathcal{A},r,\mathcal{P}\}, where 𝒮,𝒜,r\mathcal{S},\mathcal{A},r, and 𝒫\mathcal{P} stand for the state space, action space, reward function, and state transition function, respectively. Detailed definitions of state, action, reward function, and state transition are provided below.

IV-A1 State

The state in time slot tt is denoted as 𝒔(t)=(𝚿(t),vs(t),ϕ(t1),𝒀(t),𝑼(t),𝜹(t),E(t))\boldsymbol{s}(t)=(\boldsymbol{\Psi}(t),v_{s}(t),\phi(t-1),\boldsymbol{Y}(t),\boldsymbol{U}(t),\boldsymbol{\delta}(t),E(t)), where

  • 𝚿(t)\boldsymbol{\Psi}(t) denotes the set of relative positions between the UAV’s ground projection at time slot tt and all the nodes, i.e., 𝚿(t)=(𝝍0(t),𝝍1(t),,𝝍N(t))\boldsymbol{\Psi}(t)=(\boldsymbol{\psi}_{0}(t),\boldsymbol{\psi}_{1}(t),\ldots,\boldsymbol{\psi}_{N}(t)), where 𝝍n(t)=𝑪u(t)𝑪n=(xu(t)xn,yu(t)yn),n𝒩+\boldsymbol{\psi}_{n}(t)=\boldsymbol{C}_{u}(t)-\boldsymbol{C}_{n}=(x_{u}(t)-x_{n},y_{u}(t)-y_{n}),n\in\mathcal{N}^{+}.

  • vs(t)v_{s}(t) denotes the UAV’s speed at the start of time slot tt.

  • ϕ(t1)\phi(t-1) denotes the UAV’s direction of velocity at time slot t1t-1.

  • 𝒀(t)\boldsymbol{Y}(t) is the set of lifetime of the update data packet in time slot tt for all SNs, i.e., 𝒀(t)=(Y1(t),Y2(t),,YN(t))\boldsymbol{Y}(t)=(Y_{1}(t),Y_{2}(t),\ldots,Y_{N}(t)).

  • 𝑼(t)\boldsymbol{U}(t) is the set of lifetime of the update data packet at the UAV in time slot tt for all SNs, i.e., 𝑼(t)=(U1(t),U2(t),,UN(t))\boldsymbol{U}(t)=(U_{1}(t),U_{2}(t),\ldots,U_{N}(t)).

  • 𝜹(t)\boldsymbol{\delta}(t) is the set of AoI values in time slot tt for all SNs, i.e., 𝜹(t)=(δ1(t),δ2(t),,δN(t))\boldsymbol{\delta}(t)=(\delta_{1}(t),\delta_{2}(t),\ldots,\delta_{N}(t)).

  • E(t)E(t) denotes the UAV’s energy level in time slot tt.

IV-A2 Action

The UAV’s action in time slot tt is characterized by its speed vs(t+1)v_{s}(t+1) at the start of time slot t+1t+1 and direction ϕ(t)\phi(t), the scheduling of SNs b(t)b(t), and the offloading decision q(t)q(t), i.e., 𝒂(t)=(vs(t+1),ϕ(t),b(t),q(t))\boldsymbol{a}(t)=(v_{s}(t+1),\phi(t),b(t),q(t)), where vs(t+1)v_{s}(t+1) and ϕ(t)\phi(t) are continuous, while b(t)b(t) and q(t)q(t) are discrete.

IV-A3 State Transition

We detail the transition of each element in 𝒔(t)\boldsymbol{s}(t). The update of the relative position of the UAV’s projection on the ground to the node nn, 𝝍n(t)\boldsymbol{\psi}_{n}(t) n𝒩+n\in\mathcal{N}^{+}, relies on the UAV’s location, speed, and direction of the velocity, which can be expressed as

𝝍n(t+1)=\displaystyle\boldsymbol{\psi}_{n}(t+1)= 𝝍n(t)+vs(t)+vs(t+1)2τ0(cosϕ(t),sinϕ(t)).\displaystyle\boldsymbol{\psi}_{n}(t)+\frac{v_{s}(t)+v_{s}(t+1)}{2}\tau_{0}(\cos\phi(t),\sin\phi(t)). (16)

The lifetime update of the update data packet at each SN depends on the arrival status of the packet. In particular, if a new packet is received at SN nn, its lifetime at the SN is initialized to zero; otherwise, the lifetime at the SN is increased by one. The update of the lifetime of the update data packet at the SN is given in Eq. (6).

The lifetime update of the update data packet for each SN at the UAV depends on the scheduling status zn(t)z_{n}(t) from SN n𝒩n\in\mathcal{N} to the UAV. Specifically, if the update status of SN nn is successfully transmitted to the UAV, the lifetime of SN nn at the UAV is set to its lifetime at SN nn plus one; otherwise, the lifetime at the UAV is increased by one. The update of the lifetime of the update data packet at the UAV is given in Eq. (7).

The AoI update for each SN depends on the offloading status o(t)o(t) from the UAV to the DC. Due to the channel’s stochastic characteristics, the scheduled SN’s AoI may not decrease. Specifically, if the UAV successfully offloads the packet of SN nn in its buffer to the DC within time slot tt, the AoI of SN nn is updated to the lifetime of SN nn at the UAV; otherwise, the AoI of SN nn is incremented by one. The AoI update is presented in Eq. (8).

The update of the UAV’s energy level is dependent on the energy consumption and energy arrival status, which is represented in Eq. (14).

IV-A4 Reward

The UAV’s reward in time slot tt is defined to minimize the weighted sum of the average AoI of SNs and the energy consumption of the UAV, and it is expressed as r(t)=1T(ω1n=1Nδn(t)+ω2ec(t))r(t)=-\frac{1}{T}(\omega_{1}\sum_{n=1}^{N}\delta_{n}(t)+\omega_{2}e_{\textrm{c}}(t)).

The goal of the MDP is to discover an optimal policy that, beginning with the initial state s(1)s(1), maximizes the expected total reward during TT time slots. In particular, the optimal policy is the solution of the MDP, i.e.,

π=argmaxπ𝔼[t=1Tr(t)|s(1)],\pi^{*}=\arg\max_{\pi}\mathbb{E}\left[\sum_{t=1}^{T}r(t)|s(1)\right], (17)

where the expectation is relative to the distribution of the sequence of states and actions following the policy π\pi and the transition probabilities.

IV-B Meta-learning-based Compound-action DRL Approach

IV-B1 Compound-action DRL Approach

In this subsection, we first introduce the architecture of the CADRL-based algorithm shown in Fig. 2, which can manage both discrete and continuous actions. This architecture is based on the actor-critic architecture but contains two parallel actor networks. These two actor networks are utilized to perform discrete-action selection and continuous-action selection, respectively. In particular, the two actor networks share the first few layers with parameters 𝜽s\boldsymbol{\theta}^{\textrm{s}} to extract the features from the input state 𝒔\boldsymbol{s}. Then, the single-stream neural network is partitioned into two streams, forming the discrete actor network’s output layer with parameters 𝜽d\boldsymbol{\theta}^{\textrm{d}} and the continuous actor network’s output layer with parameters 𝜽c\boldsymbol{\theta}^{\textrm{c}}, respectively. A stochastic policy π(𝒔;𝜽s,𝜽d)\pi(\boldsymbol{s};\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}}) is learned by the discrete actor network for selecting discrete actions 𝒂d\boldsymbol{a}^{\textrm{d}}, and the continuous actor network learns a stochastic policy π(𝒔;𝜽s,𝜽c)\pi(\boldsymbol{s};\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{c}}) for determining continuous actions 𝒂c\boldsymbol{a}^{\textrm{c}}.

To generate the stochastic policy π(𝒔;𝜽s,𝜽d)\pi(\boldsymbol{s};\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}}), the discrete actor network outputs a vector of values f=(f𝒂1d,f𝒂2d,,f𝒂Add)f=(f_{\boldsymbol{a}_{1}^{\textrm{d}}},f_{\boldsymbol{a}_{2}^{\textrm{d}}},\ldots,f_{\boldsymbol{a}_{A_{\textrm{d}}}^{\textrm{d}}}) for the AdA_{\textrm{d}} discrete actions, where Ad=2(N+1)A_{\textrm{d}}=2(N+1) is the dimension of the discrete action space. Then, a discrete action 𝒂d\boldsymbol{a}^{\textrm{d}} is randomly sampled from the distribution represented by softmax(f)\textrm{softmax}(f). The continuous actor network outputs both the mean and variance of the Gaussian distribution to form the stochastic policy π(𝒔;𝜽s,𝜽c)\pi(\boldsymbol{s};\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{c}}) for continuous actions 𝒂c\boldsymbol{a}^{\textrm{c}}. Both the discrete and continuous actor networks are updated using the trust region policy optimization (TRPO) method in this study. The stochastic policy π(𝒔;𝜽s,𝜽d)\pi(\boldsymbol{s};\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}}) and the stochastic policy π(𝒔;𝜽s,𝜽c)\pi(\boldsymbol{s};\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{c}}) are updated by minimizing their loss functions, respectively. The loss function of the discrete policy π(𝒔;𝜽s,𝜽d)\pi(\boldsymbol{s};\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}}) is given by [51]

d(𝜽s,𝜽d)=\displaystyle\mathcal{L}^{\textrm{d}}(\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}})= 𝔼[π(𝒂d(t)|𝒔(t);𝜽s,𝜽d)π(𝒂d(t)|𝒔(t);𝜽olds,𝜽oldd)A^(𝒔(t))],\displaystyle\mathbb{E}\left[\frac{\pi(\boldsymbol{a}^{\textrm{d}}(t)|\boldsymbol{s}(t);\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}})}{\pi(\boldsymbol{a}^{\textrm{d}}(t)|\boldsymbol{s}(t);\boldsymbol{\theta}_{\textrm{old}}^{\textrm{s}},\boldsymbol{\theta}_{\textrm{old}}^{\textrm{d}})}\hat{A}(\boldsymbol{s}(t))\right],
s.t.\displaystyle s.t.\quad 𝔼[KL[π(|𝒔(t);𝜽olds,𝜽oldd),π(|𝒔(t);𝜽s,𝜽d)]]ϵ,\displaystyle\mathbb{E}\left[\textrm{KL}[\pi(\cdot|\boldsymbol{s}(t);\boldsymbol{\theta}_{\textrm{old}}^{\textrm{s}},\boldsymbol{\theta}_{\textrm{old}}^{\textrm{d}}),\pi(\cdot|\boldsymbol{s}(t);\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}})]\right]\leq\epsilon, (18)

where 𝔼[]\mathbb{E}[\cdot] denotes the expectation for a limited batch of experiences, (𝜽olds,𝜽oldd)(\boldsymbol{\theta}_{\textrm{old}}^{\textrm{s}},\boldsymbol{\theta}_{\textrm{old}}^{\textrm{d}}) is the parameters before the update, KL[π1,π2]\textrm{KL}[\pi_{1},\pi_{2}] represents the KL divergence between two policies π1\pi_{1} and π2\pi_{2}, ϵ\epsilon denotes the maximum KL divergence constraint. Similarly, the loss function of the continuous policy π(𝒔;𝜽s,𝜽c)\pi(\boldsymbol{s};\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{c}}) is given by [51]

c(𝜽s,𝜽c)=\displaystyle\mathcal{L}^{\textrm{c}}(\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{c}})= 𝔼[π(𝒂c(t)|𝒔(t);𝜽s,𝜽c)π(𝒂c(t)|𝒔(t);𝜽olds,𝜽oldc)A^(𝒔(t))],\displaystyle\mathbb{E}\left[\frac{\pi(\boldsymbol{a}^{\textrm{c}}(t)|\boldsymbol{s}(t);\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{c}})}{\pi(\boldsymbol{a}^{\textrm{c}}(t)|\boldsymbol{s}(t);\boldsymbol{\theta}_{\textrm{old}}^{\textrm{s}},\boldsymbol{\theta}_{\textrm{old}}^{\textrm{c}})}\hat{A}(\boldsymbol{s}(t))\right],
s.t.\displaystyle s.t.\quad 𝔼[KL[π(|𝒔(t);𝜽olds,𝜽oldc),π(|𝒔(t);𝜽s,𝜽c)]]ϵ.\displaystyle\mathbb{E}\left[\textrm{KL}[\pi(\cdot|\boldsymbol{s}(t);\boldsymbol{\theta}_{\textrm{old}}^{\textrm{s}},\boldsymbol{\theta}_{\textrm{old}}^{\textrm{c}}),\pi(\cdot|\boldsymbol{s}(t);\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{c}})]\right]\leq\epsilon. (19)
Refer to caption
Figure 2: The architecture of the CADRL approach.

In the CADRL architecture, there is one critic network with parameters ϑ\boldsymbol{\vartheta} that is utilized to estimate the state-value function V(𝒔;ϑ)V(\boldsymbol{s};\boldsymbol{\vartheta}). The critic network undergoes updates by minimizing the loss function, which is defined as

(ϑ)=(Vtarget(t)V(𝒔(t);ϑ))2,\mathcal{L}(\boldsymbol{\vartheta})=(V^{\textrm{target}}(t)-V(\boldsymbol{s}(t);\boldsymbol{\vartheta}))^{2}, (20)

where Vtarget(t)=A^(t)+V(s(t))V^{\textrm{target}}(t)=\hat{A}(t)+V(s(t)), and A^(t)\hat{A}(t) is the generalized advantage estimation (GAE) that is expressed by [52]

A^(t)=r(t)+γV(𝒔(t+1))V(𝒔(t))+γλA^(t1),\hat{A}(t)=r(t)+\gamma V(\boldsymbol{s}(t+1))-V(\boldsymbol{s}(t))+\gamma\lambda\hat{A}(t-1), (21)

where λ\lambda is the GAE parameter, γ\gamma represents the discount factor, and A^(0)=0\hat{A}(0)=0.

Algorithm 1 outlines the process of the CADRL-based solar-powered UAV-assisted data collection algorithm. The process begins with the initialization of the maximum KL divergence ϵ\epsilon, replay buffer DD, as well as the parameters of the actor network (𝜽s,𝜽d,𝜽c)(\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}},\boldsymbol{\theta}^{\textrm{c}}) and critic network ϑ\boldsymbol{\vartheta} (Line 1). In time slot tt, the UAV selects the discrete action component 𝒂d(t)π(𝒔(t);𝜽s,𝜽d)\boldsymbol{a}^{\textrm{d}}(t)\sim\pi(\boldsymbol{s}(t);\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}}) and the continuous action component 𝒂c(t)π(𝒔(t);𝜽s,𝜽c)\boldsymbol{a}^{\textrm{c}}(t)\sim\pi(\boldsymbol{s}(t);\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{c}}) according to the current state 𝒔(t)\boldsymbol{s}(t), conducts the action 𝒂(t)=(𝒂d(t),𝒂c(t))\boldsymbol{a}(t)=(\boldsymbol{a}^{\textrm{d}}(t),\boldsymbol{a}^{\textrm{c}}(t)), receives a reward r(t)r(t), and transitions a new state 𝒔(t+1)\boldsymbol{s}(t+1). Then, the UAV places the experience (𝒔(t),𝒂(t),r(t)\boldsymbol{s}(t),\boldsymbol{a}(t),r(t),𝒔(t+1)\boldsymbol{s}(t+1)) in the replay buffer (Lines 5~8). To update the actor and critic networks, a mini-batch of experiences is sampled from the replay buffer. The advantage function is computed as (21). The loss functions of the discrete and continuous actor networks are calculated according to (18) and (19), respectively, and are minimized to update the two actor networks. The critic network undergoes updates through the minimization of its loss function (20) (Lines 9~12).

Algorithm 1 CADRL-based solar-powered UAV-assisted data collection algorithm.
1:  Initialize the maximum KL divergence ϵ\epsilon, the replay buffer DD, the discrete and continuous actor networks parameters (𝜽s,𝜽d,𝜽c)(\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}},\boldsymbol{\theta}^{\textrm{c}}), and the critic network parameters ϑ\boldsymbol{\vartheta};
2:  for  ep=1:E\textrm{ep}=1:E  do
3:     Initialize the environment;
4:     for t=1:Tt=1:T do
5:        Observe the state 𝒔(t)\boldsymbol{s}(t);
6:        Select a discrete action 𝒂d(t)π(𝒔(t);𝜽s,𝜽d)\boldsymbol{a}^{\textrm{d}}(t)\sim\pi(\boldsymbol{s}(t);\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}}) and a continuous action 𝒂c(t)π(𝒔(t);𝜽s,𝜽c)\boldsymbol{a}^{\textrm{c}}(t)\sim\pi(\boldsymbol{s}(t);\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{c}});
7:        Conduct action 𝒂(t)=(𝒂d(t),𝒂c(t))\boldsymbol{a}(t)=(\boldsymbol{a}^{\textrm{d}}(t),\boldsymbol{a}^{\textrm{c}}(t)), receive the reward r(t)r(t), and transition to the next state 𝒔(t+1)\boldsymbol{s}(t+1);
8:        Store experience (𝒔(t),𝒂(t),r(t),𝒔(t+1))(\boldsymbol{s}(t),\boldsymbol{a}(t),r(t),\boldsymbol{s}(t+1)) in DD;
9:        Get out a mini-batch of JJ experiences (𝒔(j),𝒂(j),r(j),𝒔(j+1))(\boldsymbol{s}(j),\boldsymbol{a}(j),r(j),\boldsymbol{s}(j+1)) from DD to update the discrete and continuous actor networks;
10:        π(𝒔;𝜽olds,𝜽oldd)=π(𝒔;𝜽s,𝜽d)\pi(\boldsymbol{s};\boldsymbol{\theta}_{\textrm{old}}^{\textrm{s}},\boldsymbol{\theta}_{\textrm{old}}^{\textrm{d}})=\pi(\boldsymbol{s};\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}}) and π(𝒔;𝜽olds,𝜽oldc)=π(𝒔;𝜽s,𝜽c)\pi(\boldsymbol{s};\boldsymbol{\theta}_{\textrm{old}}^{\textrm{s}},\boldsymbol{\theta}_{\textrm{old}}^{\textrm{c}})=\pi(\boldsymbol{s};\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{c}});
11:        Compute the advantage function depends on (21);
12:        The discrete and continuous actor networks are updated by minimizing the loss functions (18) and (19), respectively, and the critic network undergoes updates through the minimization of its loss function (20).
13:     end for
14:  end for

IV-B2 Meta-learning-based Compound-action DRL

The CADRL-based algorithm is applicable to a specific task where the number and positions of SNs are fixed. When the task changes, the policy learned by the CADRL-based algorithm is not applicable to the new task. The use of a meta-learning-based DRL method brings the advantage of quick adaptation to new tasks, drawing on previous knowledge from prior experiences, and minimising the need for extensive training data [53]. In parallel, model-agnostic meta-learning (MAML) is a gradient-based meta-RL approach that concentrates on parameter optimization of the meta-policy during meta-training, serving as a solid starting point for unforeseen tasks [54]. Hence, we utilize MAML to enhance the generalization of the DRL method across different tasks and propose a meta-learning-based compound-action deep reinforcement learning (MLCADRL) algorithm.

An agent in MAML endeavors to acquire a meta-policy with parameters 𝜽=(𝜽s,𝜽d,𝜽c)\boldsymbol{\theta}=(\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}},\boldsymbol{\theta}^{\textrm{c}}) across a multitude of tasks from a given task distribution p(M)p(M). Each task Mip(M)M_{i}\sim p(M) can be described as an MDP model defined by (𝒮i,𝒜i,ri,𝒫i)(\mathcal{S}_{i},\mathcal{A}_{i},r_{i},\mathcal{P}_{i}), where 𝒮i\mathcal{S}_{i}, 𝒜i\mathcal{A}_{i}, rir_{i}, and 𝒫i\mathcal{P}_{i} denote the state space, action space, reward function, and state transition function of task MiM_{i}, respectively. As depicted in Fig. 3, the training process for the MLCADRL-based algorithm consists of two alternating learning phases: the inner-loop task learner and the outer-loop meta learner. The meta parameters 𝜽\boldsymbol{\theta} in both loops are optimized by using gradient descent. In the inner-loop, the task learner is initialized with the meta parameters 𝜽\boldsymbol{\theta}. It computes the updated parameters 𝜽i\boldsymbol{\theta}_{i} of the task learner for each training task MiM_{i} using the training data 𝒟itr\mathcal{D}_{i}^{\textrm{tr}}, which is collected by the meta-policy 𝜽\boldsymbol{\theta}, which is expressed as

𝜽i𝜽α1𝜽(𝜽,𝒟itr),\boldsymbol{\theta}_{i}\leftarrow\boldsymbol{\theta}-\alpha_{1}\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta},\mathcal{D}_{i}^{\textrm{tr}}), (22)

where α1\alpha_{1} is the task learner’s learning rate and \mathcal{L} denotes the loss function of an RL learning method. Then, the task learner utilizes validation data 𝒟ivd\mathcal{D}_{i}^{\textrm{vd}} sampled from trajectories collected with updated parameters 𝜽i\boldsymbol{\theta}_{i} for task MiM_{i} to estimate the loss function, expressed as

Mi(𝜽i,𝒟ivd)=Mi(𝜽α1𝜽(𝜽,𝒟itr),𝒟ivd).\mathcal{L}_{M_{i}}(\boldsymbol{\theta}_{i},\mathcal{D}_{i}^{\textrm{vd}})=\mathcal{L}_{M_{i}}(\boldsymbol{\theta}-\alpha_{1}\nabla_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta},\mathcal{D}_{i}^{\textrm{tr}}),\mathcal{D}_{i}^{\textrm{vd}}). (23)

In the outer-loop, as the policy is updated for each training task, the meta learner accumulates the loss Mi(𝜽i,𝒟ivd)\mathcal{L}_{M_{i}}(\boldsymbol{\theta}_{i},\mathcal{D}_{i}^{\textrm{vd}}) and carries out a meta-gradient update on the meta parameters 𝜽\boldsymbol{\theta} as

𝜽𝜽α2𝜽Mip(M)Mi(𝜽i,𝒟ivd),\boldsymbol{\theta}\leftarrow\boldsymbol{\theta}-\alpha_{2}\nabla_{\boldsymbol{\theta}}\sum_{M_{i}\sim p(M)}\mathcal{L}_{M_{i}}(\boldsymbol{\theta}_{i},\mathcal{D}_{i}^{\textrm{vd}}), (24)

where α2\alpha_{2} is the learning rate of the meta learner. These processes iterate for LmetaL_{\textrm{meta}} times. Upon the completion of meta-training, the meta-policy can be employed as the initial policy for new tasks, facilitating rapid adaptation to new tasks.

Refer to caption
Figure 3: The training phase of meta-learning for hybrid-action DRL.

The solar-powered UAV-assisted data collection algorithm, based on MLCADRL, is detailed in Algorithm 2. First, the task distribution p(M)p(M), the number of meta-iterations LmetaL_{\textrm{meta}}, the number of tasks of each meta-iteration II, the number of trajectories sampled for each task KK, and the parameters of the meta-policy π(𝜽s,𝜽d,𝜽c)\pi(\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}},\boldsymbol{\theta}^{\textrm{c}}) are initialized (Line 1). In the meta-training phase, the inner-loop task learner and outer-loop meta learner are performed alternately. In particular, in the inner-loop task learner phase, the task-specific policy is updated by using the vanilla policy gradient algorithm (REINFORCE) [55]. We sample II tasks from the task distribution (Line 3). The loss functions of the discrete and continuous actor networks for each task MiM_{i} are given by

ii,d(𝜽s,𝜽d)=\displaystyle\mathcal{L}_{i}^{\textrm{i,d}}(\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}})= 1KTk=1Kt=1T𝔼[logπ(𝒂i,kd(t)|𝒔i,k(t);𝜽s,𝜽d)×\displaystyle\frac{1}{KT}\sum_{k=1}^{K}\sum_{t=1}^{T}\mathbb{E}\Bigg{[}\log\pi(\boldsymbol{a}_{i,k}^{\textrm{d}}(t)|\boldsymbol{s}_{i,k}(t);\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}})\times
A^(𝒔i,k(t))],\displaystyle\quad\quad\quad\quad\quad\quad\hat{A}(\boldsymbol{s}_{i,k}(t))\Bigg{]}, (25)

ii,c(𝜽s,𝜽c)=\displaystyle\mathcal{L}_{i}^{\textrm{i,c}}(\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{c}})= 1KTk=1Kt=1T𝔼[logπ(𝒂i,kc(t)|𝒔i,k(t);𝜽s,𝜽c)×\displaystyle\frac{1}{KT}\sum_{k=1}^{K}\sum_{t=1}^{T}\mathbb{E}\Bigg{[}\log\pi(\boldsymbol{a}_{i,k}^{\textrm{c}}(t)|\boldsymbol{s}_{i,k}(t);\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{c}})\times
A^(𝒔i,k(t))],\displaystyle\quad\quad\quad\quad\quad\quad\hat{A}(\boldsymbol{s}_{i,k}(t))\Bigg{]}, (26)

where 𝒔i,k(t)\boldsymbol{s}_{i,k}(t), 𝒂i,kc(t)\boldsymbol{a}_{i,k}^{\textrm{c}}(t), and 𝒂i,kd(t)\boldsymbol{a}_{i,k}^{\textrm{d}}(t) respectively denote the state, continuous action, and discrete action of task MiM_{i} at time slot tt along trajectory kk. We generate KK trajectories 𝒟itr\mathcal{D}_{i}^{\textrm{tr}} by executing the meta-policy on task MiM_{i} to calculate the gradients of the loss functions in (25) and (26). The task-specific policy parameters (𝜽is,𝜽id,𝜽ic)(\boldsymbol{\theta}_{i}^{\textrm{s}},\boldsymbol{\theta}_{i}^{\textrm{d}},\boldsymbol{\theta}_{i}^{\textrm{c}}) are obtained by performing one or more gradient updates in (22) (Lines 5~7). In the outer-loop meta-learner phase, the loss functions of the discrete and continuous actor networks are given by

o,d(𝜽s,𝜽d)=\displaystyle\mathcal{L}^{\textrm{o,d}}(\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}})= i=1Id(𝜽is,𝜽id),\displaystyle\sum_{i=1}^{I}\mathcal{L}^{\textrm{d}}(\boldsymbol{\theta}_{i}^{\textrm{s}},\boldsymbol{\theta}_{i}^{\textrm{d}}), (27)
o,c(𝜽s,𝜽c)=\displaystyle\mathcal{L}^{\textrm{o,c}}(\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{c}})= i=1Ic(𝜽is,𝜽id).\displaystyle\sum_{i=1}^{I}\mathcal{L}^{\textrm{c}}(\boldsymbol{\theta}_{i}^{\textrm{s}},\boldsymbol{\theta}_{i}^{\textrm{d}}). (28)

To calculate the gradients of the loss functions in (27) and (28), the meta-learner summarizes the trajectories 𝒟ivd\mathcal{D}_{i}^{\textrm{vd}} sampled using the policy π(𝜽is,𝜽id,𝜽ic)\pi(\boldsymbol{\theta}_{i}^{\textrm{s}},\boldsymbol{\theta}_{i}^{\textrm{d}},\boldsymbol{\theta}_{i}^{\textrm{c}}) for each task. The meta-policy is updated by minimizing these loss functions using the TRPO method (Lines 8 and 10).

Algorithm 2 Algorithm of the MLCADRL-based for solar-powered UAV-assisted data collection.
1:  Initialize the task distribution p(M)p(M), the number of meta-iterations LmetaL_{\textrm{meta}}, the number of tasks of each meta-iteration II, the number of trajectories sampled for each task KK, the parameters of the compound-action meta-policy π(𝜽s,𝜽d,𝜽c)\pi(\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}},\boldsymbol{\theta}^{\textrm{c}});
2:  for  lmeta=1:Lmetal_{\textrm{meta}}=1:L_{\textrm{meta}}  do
2:     Task learner in the inner-loop
3:     Sample II tasks Mip(M)M_{i}\sim p(M);
4:     for  each task MiM_{i}  do
5:        Sample KK trajectories 𝒟itr\mathcal{D}_{i}^{\textrm{tr}} using π(𝜽s,𝜽d,𝜽c)\pi(\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}},\boldsymbol{\theta}^{\textrm{c}});
6:        Compute inner-loss functions of discrete and continuous actor networks in (25) and (26) using 𝒟itr\mathcal{D}_{i}^{\textrm{tr}};
7:        Update task-specific parameters (𝜽is,𝜽id,𝜽ic)(\boldsymbol{\theta}_{i}^{\textrm{s}},\boldsymbol{\theta}_{i}^{\textrm{d}},\boldsymbol{\theta}_{i}^{\textrm{c}}) as in (22);
8:        Collect trajectories 𝒟ivd\mathcal{D}_{i}^{\textrm{vd}} using the updated policy π(𝜽is,𝜽id,𝜽ic)\pi(\boldsymbol{\theta}_{i}^{\textrm{s}},\boldsymbol{\theta}_{i}^{\textrm{d}},\boldsymbol{\theta}_{i}^{\textrm{c}}) in task MiM_{i};
9:     end forMeta learner in the outer-loop
10:     Update meta-parameters (𝜽s,𝜽d,𝜽c)(\boldsymbol{\theta}^{\textrm{s}},\boldsymbol{\theta}^{\textrm{d}},\boldsymbol{\theta}^{\textrm{c}}) by minimizing loss functions of discrete and continuous actor networks in (27) and (28) using 𝒟ivd\mathcal{D}_{i}^{\textrm{vd}};
11:  end for

V Simulation Results

In this section, we conduct extensive simulations to gauge the effectiveness of our CADRL-based and MLCADRL-based algorithms. Firstly, the simulation setup and baseline algorithms are introduced. We then explore the CADRL-based algorithm’s convergence and performance under different environmental parameters. Additionally, we delve into the influence of parameters on the MLCADRL-based algorithm’s training and its performance on new tasks.

V-A Simulation Setup

In our simulation, we randomly distribute SNs within a square area measuring 200m200\,\textrm{m} on each side. The data DC is situated at coordinates (0,160m)(0,160\,\textrm{m}). Table I outlines the essential system parameters of the IoT network.

TABLE I: System parameters
Parameter Value
Pct,vsmax,ϕmaxP_{\textrm{c}}^{\textrm{t}},v_{s}^{\max},\triangle\phi_{\max} 1 W, 20 m/s, π3\frac{\pi}{3}
β\beta,β\beta^{\prime},β0\beta_{0} 11.9511.95,0.140.14,60dB-60\,\textrm{dB},
ww,κ\kappa,ς\varsigma 1024010240 bits,0.20.2,2.32.3
λ0\lambda_{0},σ2\sigma^{2},ξth\xi_{\textrm{th}} 0.10.1,100dBm-100\,\textrm{dBm}[15],2dB2\,\textrm{dB}
HH, TT, τ0\tau_{0}, τs\tau_{\textrm{s}} 100 m, 100 slots, 0.5 s, 0.25 s
PsP_{s},BB 50mW50\,\textrm{mW}, 5MHz5\,\textrm{MHz}
nrn_{r},χ\chi,xTx_{T} 44,0.0120.012,0.3020.302 [6]
ρ\rho,AA,xsx_{s} 1.293kg/m31.293\,\textrm{kg/$\textrm{m}^{3}$},0.0314m20.0314\,\textrm{m}^{2},0.09550.0955
d0d_{0},xfx_{f},SFAS_{FA} 0.8340.834,0.1310.131,0.10.1
MM,gg 2kg2\,\textrm{kg},9.8m/s29.8\,\textrm{m/}\textrm{s}^{2}
Eth1E_{\textrm{th}}^{1},Eth2E_{\textrm{th}}^{2},EmaxE_{\max} 1e3J1e3\,\textrm{J},3e3J3e3\,\textrm{J},6e3J6e3\,\textrm{J}
η1\eta_{1},SuS_{u},G1G_{1} 0.40.4,0.14m20.14\,\textrm{m}^{2},1367W/m21367\,\textrm{W/}\textrm{m}^{2} [8]
φ1\varphi_{1},φ2\varphi_{2},h1h_{1} 0.89780.8978,0.28040.2804,80008000
ω1\omega_{1},ω2\omega_{2} 11,1010

We utilize the NVIDIA GeForce RTX 3080 GPU alongside the Gen Intel(R) Core(TM) i9-12900K CPU with a clock frequency of 3.20 GHz for simulations. Our software environment was composed of Torch 1.3.0 and Python 3.6, running on the Ubuntu 20.04 LTS platform. In the CADRL-based algorithm, the actor network is composed of three fully connected hidden layers with 256256 neurons in each layer. Beyond the third hidden layer, the network splits into two streams: one with the same size as the discrete action space, representing the discrete stochastic policy, and the other with the same size as the continuous action dimension, signifying the continuous stochastic policy. The critic network is comprised of two fully connected hidden layers, each containing 256256 neurons, and the output layer with one neuron representing the value function. The input layers of both the actor and critic networks maintain the same dimensionality as the state in one time slot. In the MLCADRL-based algorithm, the meta-policy network adheres to the identical architecture found in the actor network of the CADRL-based algorithm. Further details of the hyperparameters for both the CADRL-based and MLCADRL-based algorithms are in Table II.

TABLE II: Hyperparameters of CADRL-based and MLCADRL-based algorithms
Parameter Value
EE,DD 3000030000,20482048
JJ,ϵ\epsilon 6464,0.010.01
λ\lambda,γ\gamma,α1\alpha_{1} 0.950.95,0.990.99,0.10.1
LmetaL_{\textrm{meta}},KK,II 1000010000,55,1515
Activation function Relu

The performance of the proposed algorithms is assessed in comparison to five different baseline algorithms, namely:

  • DQN-based algorithm [56]: In this algorithm, we discretize the continuous actions into discrete actions.

  • DDPG-based algorithm [57]: In this algorithm, we convert the discrete actions into continuous actions.

  • P-DQN-based algorithm [58]: This algorithm can be seen as a combination of DQN and DDPG to solve the problem involving continuous and discrete actions.

  • AoI-based algorithm: In this algorithm, the UAV flies and schedules a SN during each time slot, guided by the following characteristics: the SN has updated data packets in its buffer, and the SN possesses the maximum AoI value at the DC. Once the UAV collects data from the SN, it promptly forwards the collected information to the DC.

  • TLCADRL-based algorithm [59]: The algorithm is a combination of the CADRL-based algorithm and transfer learning. This approach allows the strategy obtained from similar tasks to be transferred to new tasks.

V-B The Performance of CADRL-based Algorithm

Refer to caption
Figure 4: Evaluation of convergence performance for the proposed CADRL-based algorithm and other learning-based algorithms (ω1:ω2=1:10,N=20,λ0=0.1 and λ1=0.7)(\omega_{1}:\omega_{2}=1:10,N=20,\lambda_{0}=0.1\textrm{ and }\lambda_{1}=0.7)).

Fig. 4 displays the convergence curves of four algorithms: our proposed CADRL-based algorithm, the DQN-based algorithm, the DDPG-based algorithm, and the P-DQN-based algorithm. The horizontal axis is the number of training episodes, while the vertical axis represents the cumulative reward. It is evident that our CADRL-based algorithm outperforms the other algorithms. Additionally, the CADRL-based algorithm’s performance matches that of the P-DQN-based algorithm in the final stages, but with faster convergence. This superiority can be attributed to its innovative hybrid structure, which enables this algorithm to effectively address the challenges posed by the compound action space.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 5: Performance comparison of the proposed CADRL-based algorithm and four baseline algorithms (ω1:ω2=1:10,λ0=0.1 and λ1=0.7)(\omega_{1}:\omega_{2}=1:10,\lambda_{0}=0.1\textrm{ and }\lambda_{1}=0.7). (a) the average AoI of SNs in relation to the number of SNs NN. (b) The average energy consumption in relation to the number of SNs NN. (c) The weighted sum in relation to the number of SNs NN.

Fig. 5(a) reveals the average AoI of SNs in relation to the number of SNs NN. As NN grows, a noticeable rise in the average AoI can be observed. This can be attributed to the constraint that allows the UAV to schedule only one SN for status updates in each time slot. Consequently, with a larger NN, each SN experiences longer waiting times to update its status, leading to extended AoI update periods at the DC. Fig. 5(b) shows the UAV’s average energy consumption in relation to NN. As NN increases, the learning-based methods exhibit an increasing trend in average energy consumption, while the AoI-based algorithm maintains a relatively constant average energy consumption. Furthermore, the average energy consumption of the AoI-based algorithm surpasses that of the learning-based algorithms. This is due to the fact that in the AoI-based algorithm, the UAV accelerates to its maximum speed with maximum acceleration to collect the update status from the target SN, resulting in higher energy consumption as acceleration increases. On the other hand, the learning-based algorithms effectively control the UAV’s velocity, leading to reduced energy consumption. As NN grows, the UAV is required to adjust its velocity more frequently to adapt to changes in its flight state. Furthermore, we can see that the weighted sum increases with NN from Fig. 5(c), and our proposed CADRL-based algorithm attains the lowest weighted sum in comparison to the other algorithms.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 6: Performance comparison of the proposed CADRL-based algorithm and four baseline algorithms (ω1:ω2=1:10,N=12 and λ1=0.7)(\omega_{1}:\omega_{2}=1:10,N=12\textrm{ and }\lambda_{1}=0.7). (a) the average AoI of SNs in relation to the sample rate λ0\lambda_{0}. (b) The average energy consumption in relation to the sample rate λ0\lambda_{0}. (c) The weighted sum in relation to the sample rate λ0\lambda_{0}.

Fig. 6(a) illustrates the relationship between the SN’s sampling rate λ0\lambda_{0} and the average AoI. In Fig. 6(a), it is noticeable that as λ0\lambda_{0} increases, the average AoI of SNs undergoes an initial rapid decrease, after which the rate of decline gradually slows down, ultimately reaching a stable state. This is due to the fact that when λ0\lambda_{0} is small, the time required for SNs to generate state updates is longer, resulting in a larger AoI. As λ0\lambda_{0} increases, the time for SNs to generate state updates decreases, leading to a decrease in AoI. However, as λ0\lambda_{0} continues to increase, the time for SNs to generate state updates further decreases, but the UAV’s data collection capacity is limited, resulting in a stabilized AoI. In Fig. 6(b), we present the relationship between λ0\lambda_{0} and the average energy consumption. Fig. 6(b) reveals that the impact of λ0\lambda_{0} on average energy consumption is minimal. This is due to the fact that λ0\lambda_{0} primarily influences the UAV’s communication energy consumption, which is considerably smaller in magnitude when contrasted with the energy consumption during flight. Fig. 6(c) displays the relationship between the weighted sum and λ0\lambda_{0}. Our suggested CADRL-based method achieves the minimum weighted sum, which demonstrates its superior performance, as shown in Fig. 6(c).

V-C The Performance of MLCADRL-based algorithm

Refer to caption
Figure 7: The influence of the number of tasks selected from the task set II on the convergence performance and training time of the MLCADRL-based algorithm (ω1:ω2=1:10,λ0=0.1 and λ1=0.7)(\omega_{1}:\omega_{2}=1:10,\lambda_{0}=0.1\textrm{ and }\lambda_{1}=0.7).

Fig. 7 illustrates the influence of the number of tasks selected from the task distribution II on the convergence performance and training time of the MLCADRL-based algorithm during the training phase. The solid line represents the convergence performance, while the dashed line represents the convergence time. From Fig. 7, we can observe that as II increases, the performance of the MLCADRL-based algorithm continuously improves, and it reaches its best performance when I=15I=15. Beyond this point, further increasing II does not lead to any significant performance improvement. This is because with more training tasks, the algorithm gains more knowledge, leading to performance enhancement. However, when II reaches 1515, the algorithm will have gained a comprehensive amount of knowledge. On the other hand, as II increases, the training time of the MLCADRL-based algorithm also increases. This is because more training tasks require more computational resources, resulting in a longer training time.

Refer to caption
Figure 8: Performance comparison of algorithms when the task changes.

Fig. 8 illustrates the performance comparison of three algorithms, namely MLCADRL-based, TLCADRL-based, and CADRL-based algorithms, while the task changes. At the 15001500th episode, the number of SNs in the environment decreased from 1616 to 1212, and the positions of SNs were also altered. From Fig. 8, we observe that the MLCADRL-based algorithm adapts more rapidly to the new task compared to the SLCADRL-based and CADRL-based algorithms. This can be attributed to the meta-policy learned during the training phase of the MLCADRL-based algorithm, which enables quick adaptation to varying environmental conditions.

VI Conclusions

This paper has investigated the problem of timely and energy-efficient data collection in solar-powered UAV-assisted IoT networks where the UAV scheduled SNs to gather their status updates, temporarily stored data packets in its buffer, and subsequently, following an offloading strategy, offloaded the data packets from its buffers to the DC. Each SN sampled the environment stochastically. The joint optimization of the UAV’s trajectory, scheduling of SNs, and offloading strategy was undertaken to minimize the weighted sum of the average AoI of SNs and energy consumption. The problem has been formulated as a finite-horizon MDP, and then the CADRL-based and MLCADRL-based data collection algorithms have been proposed. The CADRL-based algorithm was capable of handling both continuous and discrete actions of the UAV, and the MLCADRL-based algorithm, which combines meta-learning and the CADRL-based algorithm, could improve the performance of CADRL-based algorithms in new tasks. Simulation results have affirmed the effectiveness of our proposed algorithms in reducing the weighted sum of the average AoI of SNs and energy consumption of the UAV compared to the baseline algorithms, and the combination of meta-learning and CADRL allowed the algorithm to have fast adaptability to new tasks.

References

  • [1] A. Al-Fuqaha, M. Guizani, M. Mohammadi, M. Aledhari, and M. Ayyash, “Internet of Things: A Survey on Enabling Technologies, Protocols, and Applications,” IEEE Commun. Surveys Tuts., vol. 17, no. 4, pp. 2347–2376, 2015.
  • [2] N. Hossein Motlagh, T. Taleb, and O. Arouk, “Low-Altitude Unmanned Aerial Vehicles-Based Internet of Things Services: Comprehensive Survey and Future Perspectives,” IEEE Internet Things J., vol. 3, no. 6, pp. 899–922, 2016.
  • [3] Q. Wu, L. Liu, and R. Zhang, “Fundamental Trade-offs in Communication and Trajectory Design for UAV-Enabled Wireless Network,” IEEE Wireless Commun., vol. 26, no. 1, pp. 36–44, 2019.
  • [4] D. Yang, Q. Wu, Y. Zeng, and R. Zhang, “Energy Tradeoff in Ground-to-UAV Communication via Trajectory Design,” IEEE Trans. Veh. Technol., vol. 67, no. 7, pp. 6721–6726, 2018.
  • [5] S. F. Abedin, M. S. Munir, N. H. Tran, Z. Han, and C. S. Hong, “Data Freshness and Energy-Efficient UAV Navigation Optimization: A Deep Reinforcement Learning Approach,” IEEE Trans. Intell. Transp. Syst., vol. 22, no. 9, pp. 5994–6006, 2021.
  • [6] R. Ding, F. Gao, and X. S. Shen, “3D UAV Trajectory Design and Frequency Band Allocation for Energy-Efficient and Fair Communication: A Deep Reinforcement Learning Approach,” IEEE Trans. Wireless Commun., vol. 19, no. 12, pp. 7796–7809, 2020.
  • [7] Y. Sun, D. Xu, D. W. K. Ng, L. Dai, and R. Schober, “Optimal 3D-Trajectory Design and Resource Allocation for Solar-Powered UAV Communication Systems,” IEEE Trans. Commun., vol. 67, no. 6, pp. 4281–4298, 2019.
  • [8] Y. Fu, H. Mei, K. Wang, and K. Yang, “Joint Optimization of 3D Trajectory and Scheduling for Solar-Powered UAV Systems,” IEEE Transa. Veh. Technol., vol. 70, no. 4, pp. 3972–3977, 2021.
  • [9] A. Trotta, M. D. Felice, F. Montori, K. R. Chowdhury, and L. Bononi, “Joint Coverage, Connectivity, and Charging Strategies for Distributed UAV Networks,” IEEE Trans. Robot., vol. 34, no. 4, pp. 883–900, 2018.
  • [10] L. P. Qian, H. Zhang, Q. Wang, Y. Wu, and B. Lin, “Joint Multi-Domain Resource Allocation and Trajectory Optimization in UAV-Assisted Maritime IoT Networks,” IEEE Internet Things J., vol. 10, no. 1, pp. 539–552, 2023.
  • [11] L. Wang, H. Zhang, S. Guo, and D. Yuan, “Deployment and Association of Multiple UAVs in UAV-Assisted Cellular Networks With the Knowledge of Statistical User Position,” IEEE Trans. Wireless Commun., vol. 21, no. 8, pp. 6553–6567, 2022.
  • [12] X. Zhang, H. Zhao, J. Wei, C. Yan, J. Xiong, and X. Liu, “Cooperative Trajectory Design of Multiple UAV Base Stations With Heterogeneous Graph Neural Networks,” IEEE Trans. Wireless Commun., vol. 22, no. 3, pp. 1495–1509, 2023.
  • [13] S. Kaul, M. Gruteser, V. Rai, and J. Kenney, “Minimizing Age of Information in Vehicular Networks,” in Proc. IEEE 8th Annu. Commun. Soc. Conf. Sensor, Mesh, Ad Hoc Commun. Netw., 2011, pp. 350–358.
  • [14] C. Guo, X. Wang, L. Liang, and G. Y. Li, “Age of Information, Latency, and Reliability in Intelligent Vehicular Networks,” IEEE Network, pp. 1–8, 2022.
  • [15] M. A. Abd-Elmagid, A. Ferdowsi, H. S. Dhillon, and W. Saad, “Deep Reinforcement Learning for Minimizing Age-of-Information in UAV-Assisted Networks,” in Proc. IEEE Global Commun. Conf. (GLOBECOM), Puako, HI, USA, May 2019.
  • [16] X. Zhang, H. Zhao, J. Wei, C. Yan, J. Xiong, and X. Liu, “Cooperative Trajectory Design of Multiple UAV Base Stations With Heterogeneous Graph Neural Networks,” IEEE Trans. Wireless Commun., vol. 22, no. 3, pp. 1495–1509, 2023.
  • [17] N. H. Chu, D. T. Hoang, D. N. Nguyen, N. Van Huynh, and E. Dutkiewicz, “Joint Speed Control and Energy Replenishment Optimization for UAV-assisted IoT Data Collection with Deep Reinforcement Transfer Learning,” IEEE Internet Things J., pp. 1–1, 2022.
  • [18] B. Zhu, E. Bedeer, H. H. Nguyen, R. Barton, and Z. Gao, “UAV Trajectory Planning for AoI-Minimal Data Collection in UAV-Aided IoT Networks by Transformer,” IEEE Trans. Wireless Commun., vol. 22, no. 2, pp. 1343–1358, 2023.
  • [19] J. Liu, P. Tong, X. Wang, B. Bai, and H. Dai, “UAV-Aided Data Collection for Information Freshness in Wireless Sensor Networks,” IEEE Trans. Wireless Commun., vol. 20, no. 4, pp. 2368–2382, 2021.
  • [20] K. Liu and J. Zheng, “UAV Trajectory Optimization for Time-Constrained Data Collection in UAV-Enabled Environmental Monitoring Systems,” IEEE Internet Things J., vol. 9, no. 23, pp. 24 300–24 314, 2022.
  • [21] Y. Long, W. Zhang, S. Gong, X. Luo, and D. Niyato, “AoI-aware Scheduling and Trajectory Optimization for Multi-UAV-assisted Wireless Networks,” in Proc. IEEE Global Commun. Conf. (GLOBECOM), 2022, pp. 2163–2168.
  • [22] X. Zhang, J. Wang, and H. V. Poor, “AoI-Driven Statistical Delay and Error-Rate Bounded QoS Provisioning for mURLLC Over UAV-Multimedia 6G Mobile Networks Using FBC,” IEEE J. Sel. Areas Commun., vol. 39, no. 11, pp. 3425–3443, 2021.
  • [23] X. Wang, M. Yi, J. Liu, Y. Zhang, M. Wang, and B. Bai, “Cooperative Data Collection With Multiple UAVs for Information Freshness in the Internet of Things,” IEEE Trans. Commun., vol. 71, no. 5, pp. 2740–2755, 2023.
  • [24] C. Liu, Y. Guo, N. Li, and X. Song, “AoI-Minimal Task Assignment and Trajectory Optimization in Multi-UAV-Assisted IoT Networks,” IEEE Internet Things J., vol. 9, no. 21, pp. 21 777–21 791, 2022.
  • [25] Y. Sun, E. Uysal-Biyikoglu, R. D. Yates, C. E. Koksal, and N. B. Shroff, “Update or Wait: How to Keep Your Data Fresh,” IEEE Trans. Inf. Theory, vol. 63, no. 11, pp. 7492–7508, 2017.
  • [26] M. A. Abd-Elmagid, H. S. Dhillon, and N. Pappas, “A Reinforcement Learning Framework for Optimizing Age of Information in RF-Powered Communication Systems,” IEEE Trans. Commun., vol. 68, no. 8, pp. 4747–4760, 2020.
  • [27] C. Zhou, H. He, P. Yang, F. Lyu, W. Wu, N. Cheng, and X. Shen, “Deep RL-based Trajectory Planning for AoI Minimization in UAV-assisted IoT,” in Proc. 11th Int. Conf. Wireless Commun. Signal Process. (WCSP), Xian, China, Oct., 2019, pp. 1–6.
  • [28] P. Tong, J. Liu, X. Wang, B. Bai, and H. Dai, “Deep Reinforcement Learning for Efficient Data Collection in UAV-Aided Internet of Things,” in Proc. IEEE Int. Conf. Commun. Workshops, 2020, pp. 1–6.
  • [29] Z. Li, P. Tong, J. Liu, X. Wang, L. Xie, and H. Dai, “Learning-Based Data Gathering for Information Freshness in UAV-Assisted IoT Networks,” IEEE Internet Things J., vol. 10, no. 3, pp. 2557–2573, 2023.
  • [30] S. Khairy, P. Balaprakash, L. X. Cai, and Y. Cheng, “Constrained Deep Reinforcement Learning for Energy Sustainable Multi-UAV Based Random Access IoT Networks With NOMA,” IEEE J. Sel. Areas Commun., vol. 39, no. 4, pp. 1101–1115, 2021.
  • [31] Z. Zhang, C. Xu, Z. Li, X. Zhao, and R. Wu, “Deep Reinforcement Learning for Aerial Data Collection in Hybrid-Powered NOMA-IoT Networks,” IEEE Internet Things J., vol. 10, no. 2, pp. 1761–1774, 2023.
  • [32] L. Zhang, A. Celik, S. Dang, and B. Shihada, “Energy-Efficient Trajectory Optimization for UAV-Assisted IoT Networks,” IEEE Trans. Mobile Comput., vol. 21, no. 12, pp. 4323–4337, 2022.
  • [33] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al., “A General Reinforcement Learning Algorithm that Masters Chess, Shogi, and Go Through Self-play,” Science, vol. 362, no. 6419, pp. 1140–1144, 2018.
  • [34] C. Berner, G. Brockman, B. Chan, V. Cheung, P. Dębiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse et al., “Dota 2 with Large Scale Deep Reinforcement Learning,” arXiv preprint arXiv:1912.06680, 2019.
  • [35] A. Rajeswaran, V. Kumar, A. Gupta, G. Vezzani, J. Schulman, E. Todorov, and S. Levine, “Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations,” arXiv preprint arXiv:1709.10087, 2017.
  • [36] Y. Yuan, L. Lei, T. X. Vu, S. Chatzinotas, S. Sun, and B. Ottersten, “Energy Minimization in UAV-Aided Networks: Actor-Critic Learning for Constrained Scheduling Optimization,” IEEE Trans. Veh. Technol., vol. 70, no. 5, pp. 5028–5042, 2021.
  • [37] X. Liu, Y. Liu, and Y. Chen, “Reinforcement Learning in Multiple-UAV Networks: Deployment and Movement Design,” IEEE Trans. Veh. Technol., vol. 68, no. 8, pp. 8036–8049, 2019.
  • [38] A. M. Seid, J. Lu, H. N. Abishu, and T. A. Ayall, “Blockchain-Enabled Task Offloading With Energy Harvesting in Multi-UAV-Assisted IoT Networks: A Multi-Agent DRL Approach,” IEEE J. Sel. Areas Commun., vol. 40, no. 12, pp. 3517–3532, 2022.
  • [39] M. Sun, X. Xu, X. Qin, and P. Zhang, “AoI-Energy-Aware UAV-Assisted Data Collection for IoT Networks: A Deep Reinforcement Learning Method,” IEEE Internet Things J., vol. 8, no. 24, pp. 17 275–17 289, 2021.
  • [40] K. Li, W. Ni, E. Tovar, and M. Guizani, “Joint Flight Cruise Control and Data Collection in UAV-Aided Internet of Things: An Onboard Deep Reinforcement Learning Approach,” IEEE Internet Things J., vol. 8, no. 12, pp. 9787–9799, 2021.
  • [41] J. Hu, H. Zhang, L. Song, R. Schober, and H. V. Poor, “Cooperative Internet of UAVs: Distributed Trajectory Design by Multi-Agent Deep Reinforcement Learning,” IEEE Trans. Commun., vol. 68, no. 11, pp. 6807–6821, 2020.
  • [42] M. Akbari, M. R. Abedi, R. Joda, M. Pourghasemian, N. Mokari, and M. Erol-Kantarci, “Age of Information Aware VNF Scheduling in Industrial IoT Using Deep Reinforcement Learning,” IEEE J. Sel. Areas Commun., vol. 39, no. 8, pp. 2487–2500, 2021.
  • [43] Z. Fan, R. Su, W. Zhang, and Y. Yu, “Hybrid Actor-critic Reinforcement Learning in Parameterized Action Space,” arXiv preprint arXiv:1903.01344, 2019.
  • [44] B. Zhu, E. Bedeer, H. H. Nguyen, R. Barton, and J. Henry, “Joint Cluster Head Selection and Trajectory Planning in UAV-Aided IoT Networks by Reinforcement Learning With Sequential Model,” IEEE Internet Things J., vol. 9, no. 14, pp. 12 071–12 084, 2022.
  • [45] M. Yi, X. Wang, J. Liu, Y. Zhang, and R. Hou, “Multi-Task Transfer Deep Reinforcement Learning for Timely Data Collection in Rechargeable-UAV-aided IoT Networks,” IEEE Internet Things J., pp. 1–1, 2023.
  • [46] Z. Lu, X. Wang, and M. C. Gursoy, “Trajectory Design for Unmanned Aerial Vehicles via Meta-Reinforcement Learning,” in Proc. IEEE Int. Conf. Comput.Commun. Workshops (INFOCOM WKSHPS), 2023, pp. 1–6.
  • [47] M. Samir, S. Sharafeddine, C. M. Assi, T. M. Nguyen, and A. Ghrayeb, “UAV Trajectory Planning for Data Collection from Time-Constrained IoT Devices,” IEEE Trans. Wireless Commun., vol. 19, no. 1, pp. 34–46, 2020.
  • [48] D.-H. Tran, V.-D. Nguyen, S. Chatzinotas, T. X. Vu, and B. Ottersten, “UAV Relay-Assisted Emergency Communications in IoT Networks: Resource Allocation and Trajectory Optimization,” IEEE Trans. Wireless Commun., vol. 21, no. 3, pp. 1621–1637, 2022.
  • [49] A. Al-Hourani, S. Kandeepan, and S. Lardner, “Optimal LAP Altitude for Maximum Coverage,” IEEE Wireless Commun. Lett., vol. 3, no. 6, pp. 569–572, 2014.
  • [50] A. Baknina and S. Ulukus, “Optimal and Near-Optimal Online Strategies for Energy Harvesting Broadcast Channels,” IEEE J. Sel. Areas Commun., vol. 34, no. 12, pp. 3696–3708, 2016.
  • [51] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in Proc. Int. Conf. Mach. Learn.   PMLR, 2015, pp. 1889–1897.
  • [52] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional Continuous Control Using Generalized Advantage Estimation,” arXiv preprint arXiv:1506.02438, 2015.
  • [53] S. Thrun and L. Pratt, “Learning to Learn: Introduction and Overview,” in Learning to Learn.   Springer, 1998, pp. 3–17.
  • [54] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic Meta-learning for Fast Adaptation of Deep Networks,” in Proc. Int. Conf.Mach. Learn.   PMLR, 2017, pp. 1126–1135.
  • [55] R. J. Williams, “Simple Statistical Gradient-following Algorithms for Connectionist Reinforcement Learning,” Mach. Learn., pp. 5–32, 1992.
  • [56] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level Control Through Deep Reinforcement Learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [57] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous Control with Deep Reinforcement Learning,” arXiv preprint arXiv:1509.02971, 2015.
  • [58] J. Xiong, Q. Wang, Z. Yang, P. Sun, L. Han, Y. Zheng, H. Fu, T. Zhang, J. Liu, and H. Liu, “Parametrized Deep Q-networks Learning: Reinforcement Learning with Discrete-continuous Hybrid Action Space,” arXiv preprint arXiv:1810.06394, 2018.
  • [59] M. E. Taylor and P. Stone, “Transfer Learning for Reinforcement Learning Domains: A Survey.” J. Mach. Learn. Res., vol. 10, no. 7, 2009.