Trajectory Design for UAV-Based Internet-of-Things Data Collection: A Deep Reinforcement Learning Approach

Yang Wang, Zhen Gao, Jun Zhang, Xianbin Cao, Dezhi Zheng, Yue Gao, , Derrick Wing Kwan Ng, , and Marco Di Renzo Y. Wang, Z. Gao, and J. Zhang are with the School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China (e-mail: [email protected]; [email protected]; [email protected]).X. Cao is with the School of Electronic and Information Engineering, Beihang University, Beijing 100083, China (e-mail: [email protected]).D. Zheng is with the Innovation Institute of Frontier Science and Technology, School of Instrumentation and Optoelectronic Engineering, Beihang University, Beijing 100083, China (e-mail: [email protected]).Y. Gao is with the Department of Electrical and Electronic Engineering, University of Surrey, Surrey GU2 7XH, U.K. (e-mail: [email protected]).D. W. K. Ng is with the School of Electrical Engineering and Telecommunications, University of New South Wales, Sydney, NSW 2025, Australia (e-mail: [email protected]).M. Di Renzo is with Université Paris-Saclay, CNRS, CentraleSupélec, Laboratoire des Signaux et Systèmes, 3 Rue Joliot-Curie, 91192 Gif-sur-Yvette, France (e-mail: [email protected]).

Abstract

In this paper, we investigate an unmanned aerial vehicle (UAV)-assisted Internet-of-Things (IoT) system in a sophisticated three-dimensional (3D) environment, where the UAV’s trajectory is optimized to efficiently collect data from multiple IoT ground nodes. Unlike existing approaches focusing only on a simplified two-dimensional scenario and the availability of perfect channel state information (CSI), this paper considers a practical 3D urban environment with imperfect CSI, where the UAV’s trajectory is designed to minimize data collection completion time subject to practical throughput and flight movement constraints. Specifically, inspired from the state-of-the-art deep reinforcement learning approaches, we leverage the twin-delayed deep deterministic policy gradient (TD3) to design the UAV’s trajectory and present a TD3-based trajectory design for completion time minimization (TD3-TDCTM) algorithm. In particular, we set an additional information, i.e., the merged pheromone, to represent the state information of UAV and environment as a reference of reward which facilitates the algorithm design. By taking the service statuses of IoT nodes, the UAV’s position, and the merged pheromone as input, the proposed algorithm can continuously and adaptively learn how to adjust the UAV’s movement strategy. By interacting with the external environment in the corresponding Markov decision process, the proposed algorithm can achieve a near-optimal navigation strategy. Our simulation results show the superiority of the proposed TD3-TDCTM algorithm over three conventional non-learning based baseline methods.

Index Terms:

UAV communications, trajectory design, data collection, Internet-of-Things (IoT), deep reinforcement learning.

I Introduction

The unmanned aerial vehicle (UAV)-assisted communication paradigm is expected to play a pivotal role in the next-generation wireless communication systems for providing ubiquitous connectivity with a broader and deeper coverage[1, 2]. Particularly, adopting UAVs as aerial mobile base stations (BSs) to collect data from distributed Internet-of-Things (IoT) ground nodes is anticipated to be a promising technology for realizing green communications[3]. Compared with terrestrial BS-based IoT systems, the UAV-based aerial BS system has salient attributes, such as a high probability in establishing strong line-of-sight (LoS) channels to improve coverage, a flexible deployment and fast response for unexpected or limited-duration missions, and a dynamic three-dimensional (3D) placement and movement for improving spectral and energy efficiency, etc[4].

I-A Related Works

Due to the high maneuverability, UAVs can approach closely to the potential IoT nodes offering a low power consumption solution for the IoT nodes uploading data, where the UAVs’ trajectory is essential to be carefully optimized. To date, there have been various related work investigating the trajectory design for UAV-assisted IoT data collection systems, where different optimization objectives, such as energy-efficiency[5, 6, 7, 8], data collection rate[9, 10, 11], flight time[13, 14], and age-of-information (AoI)[15] have been considered. Specifically, the authors in [5] considered to maximize the energy-efficiency of wireless sensor networks by optimizing the UAV’s trajectory and sensors’ wake-up schedule. Besides, in [6], to minimize the total transmit power of IoT nodes, the authors considered to optimize the UAVs’ 3D placement, each IoT node’s transmit power, and the association between IoT nodes and UAVs. Also, in [7], the authors studied the UAV’s trajectory and the route design problem which the UAV adopted the hover-and-fly model for data collection. A suboptimal design was proposed to maximize the minimal residual energy of sensor nodes after data transmission. However, the above works mainly considered the sensors’ energy consumption and failed to consider the UAV’s energy consumption, which is one of the important factors for determining the performance of UAV-assisted communication systems. In [8], the authors jointly optimized the UAV’s trajectory, resource allocation strategy and the jamming policy for maximizing the UAV’s energy efficiency.

Generally, a UAV would cruise sufficiently close enough to each sensor for efficient data collection. However, this associated high energy consumption of the UAV would limit its maneuverability and endurance in long flight, which jeopardizes the total collection throughput of the whole network. Therefore, in additional to the energy-efficiency, there also have several works considering to optimize the data collection rate [9, 10, 11, 12]. For instance, the authors in [9] jointly designed the sensors’ transmission scheduling, power allocations, and the UAV’s trajectory to maximize the minimum data collection rate from the ground sensors to a multi-antenna UAV. Also, in [10], the authors optimized the UAV’s 3D trajectory to maximize the minimum average rate for data collection take into account the angle-dependent Rician fading channels. Besides, in [11, 12], the authors considered to jointly design the UAV’s trajectory and the wireless resource allocation optimization for maximizing the system throughput.

On the other hand, there also have been several prior works studying the UAV-assisted data collection with flight time optimization[13, 14]. Specifically, the authors in [13] and [14] jointly designed the UAV’s flight trajectory and wireless resource allocation/scheduling to minimize the mission completion time, where the sensors deployed in the one-dimensional (1D) and two-dimensional (2D) spaces are considered, respectively. Besides, there is a few work considering the AoI of the data collection. In [15], the authors considered to minimize the average AoI of the data collected from all sensors by jointly optimizing the UAV’s trajectory, energy harvesting duration, and the time required for data collection at each sensor.

TABLE I: Comparison of related works with our work.

References	Optimization objective	Optimization variables	Optimization method	Channel model	Number of UAVs
[5]	Energy consumption of all sensors minimization	Sensors’ wake-up schedule and UAV’s 2D trajectory	Successive convex optimization technique	Simplified LoS channel	Single
[6]	Total transmit power of IoT devices minimization	3D placement of the UAVs, the association of devices, and uplink power control	Traditional convex optimization technique	Probabilistic LoS channel	Multiple
[7]	Sensors’ minimum residual energy maximization	UAV’s hovering locations and 2D route	Voronoi diagram theory	Simplified LoS channel	Single
[8]	UAV’s energy-efficiency maximization	UAV’s 2D trajectory, resource allocation, and jamming policy	Alternating optimization algorithm	Simplified LoS channel	Single
[9]	Minimum data collection rate maximization	Sensors’ transmission scheduling, power allocation, and UAV’s 2D trajectory	Block coordinate descent and successive convex approximation	Simplified LoS channel	Single
[10]	Minimum average data collection rate maximization	UAV’s 3D trajectory	Block coordinate descent and successive convex approximation	Angle-dependent Rician fading channel	Single
[11]	Uplink minimum throughput maximization	UAV’s 2D trajectory and transmission resource allocation	Alternating optimization and successive convex programming	Simplified LoS channel	Single
[12]	The sum throughput maximization	UAV’s 3D trajectory and the transmit power and subcarrier allocation	Monotonic optimization and successive convex programming	Simplified LoS channel	Single
[13]	UAV’s total flight time minimization	UAV’s 1D speed and sensors’ transmission power	Dynamic programming	Simplified LoS channel	Single
[14]	Total mission time minimization	UAV’s 2D trajectory, fixed altitude, velocity, and scheduling	Block coordinate descent	Probabilistic LoS channel	Single
[15]	Average age of information minimization	Time of energy transmission plus data collection, and UAV’s 2D trajectory	Dynamic programming and ant colony heuristic algorithm[22]	Simplified LoS channel	Single
[17]	Energy-efficient maximization	UAV’s 3D trajectory	Deep deterministic policy gradient algorithm (DDPG)[23]	Probabilistic LoS channel	Multiple
[18]	Energy-efficient maximization	UAV’s 2D trajectory and charge station’s 2D trajectory	Deep Q network (DQN) [24]	None channel model	Single
[19]	Energy-efficient maximization	UAV’s 2D trajectory	Multi-agent deep deterministic policy gradient (MADDPG) [25]	None channel model	Multiple
[20]	Weighted sum of the mission completion time and the expected communication outage duration minimization	UAV’s 2D trajectory	Temporal difference learning [16]	Practical urban LoS channel	Single
[21]	Weighted sum of the age of information maximization	UAV’s 2D trajectory and transmission scheduling of the sensors	DQN	Simplified LoS channel	Single
Our work	Average mission completion time minimization	UAV’s navigation altitude and UAV’s 2D trajectory	Twin-delayed deep deterministic policy gradient algorithm (TD3) [26]	Practical urban LoS channel	Single

However, the above UAV trajectory designs based on conventional optimization solutions suffer from some critic limitations. First, the formulation of an optimization problem requires an accurate and tractable radio propagation model. For such reason, recent works, e.g., [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], have mostly adopted some statistical models such as the simplified LoS-dominated model, the probabilistic LoS model, and the angle-dependent Rician fading model. However, these models can only predict the performance in the average sense and cannot provide any performance guarantee for the local environment where the UAVs are actually deployed. Second, the offline optimization-based trajectory design assumed that perfect channel state information (CSI) can be explicitly derived from a specific radio propagation model. Unfortunately, perfect CSI in practice is often difficult to be acquired due to the uncertainty of the UAV’s position and the time-varying dynamic of the communication environment. At last, most of these optimization problems in UAV-assist communication systems are highly non-convex and difficult to be solved efficiently. In addition, machine learning algorithms such as supervised learning and unsupervised learning require sufficient data samples in advance, which is unrealistic for decision-making problems. In contrast, deep reinforcement learning (DRL) has been serving as an efficient solution to tackle such a decision-making issue thanks to its essential traits, i.e., learning dynamically from the real world. Specifically, since there is no prior information about the environment model in the Markov decision process (MDP), the agent, i.e., the learning entity, would interact with the external environment, collect some samples in real-time, and then design the optimal strategy based on these samples space[16].

Recently, there have been several research works leveraging DRL to optimize the UAV’s trajectory for UAV-assisted IoT data collection systems. For example, the authors in [17] designed a DRL-based 3D continuous movement control algorithm to solve an trajectory optimization problem with the goal of maximizing the system energy efficiency. Also, in [18, 19], the authors proposed a DRL-based UAV control method for maximizing the energy efficiency in mobile crowd sensing systems. Besides, in [20], to minimize the weighted sum of the mission completion time and the expected communication outage duration, the authors focused on optimizing the UAV’s trajectory based on DRL. Moreover, in [21], the weighted sum of the AoI was adopted as the performance metric in UAV-assisted wireless powered IoT networks, where the UAV’s trajectory and transmission scheduling of the sensors were jointly optimized adopting a DRL-based approach. For clarity, the comparison of the related works[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], [17, 18, 19, 20, 21] is summarized in Table I.

I-B Motivations

Existing UAV trajectory design schemes, e.g., [5, 7, 8, 9, 11, 12, 13, 15, 21], for IoT data collection systems rely on the existence of pure LoS channels. Although adopting the overly simplified LoS channel model facilitates the system performance analysis and resource allocation optimization, these aforementioned works fail to capture the dynamics in practical IoT communication environments. On the other hand, another well-known ground-air (G2A) channel model, known as probabilistic LoS model [29], has been adopted in[6, 14, 17] as an alternative. This channel model considers the existence of both LoS and non-line-of-sight (NLoS) links with certain probability, respectively, following a practically assumed distribution model. However, the existence of LoS link should depend on the actual environment, i.e., whether the direct connection between the UAV and IoT node is physically blocked by obstacles, rather than based on the probability model or simplified LoS channel model. To the best of our knowledge, only the authors in [20] considered a practical urban LoS model. Yet, its UAV’s trajectory in [20] was designed for the single-sensor scenario, which do not applicable to the case of multiple-sensors and it only considered the optimization of the UAV in discrete horizontal direction. Therefore, there is few work considering such a practical urban LoS channel for UAV-assisted IoT data collection system, which motivates our research in this article.

I-C Contributions

In this paper, we propose a trajectory design for the completion time minimization (TDCTM) in the UAV-assisted IoT data collection systems. Specifically, the UAV is employed to collect delay-tolerant data from multiple IoT ground nodes distributed in an urban scenario. In order to address the problems of practical urban LoS channel modeling and the availability of imperfect CSI problems, we leverage the twin-delayed deep deterministic policy gradient (TD3)[26], known as a state-of-the-art DRL approach, to design the UAV’s trajectory for minimizing the mission completion time, subject to the constraints of throughput and flight movement. Our contributions are summarized as follows:

•

First, to minimize the mission completion time, we formulate the trajectory design for UAV-assisted IoT data collection systems. This problem, however, is generally intractable to be solved by conventional convex optimization methods due to a sophisticated distribution of buildings in relatively practical environment leading to imperfect CSI. To address this difficulty, we reformulate the original problem as an equivalent MDP, where DRL techniques can be applied to acquire a near-optimal solution for this problem.
•

Next, to cope with the continuous control problem with an infinite action space, we propose a TD3-based TDCTM (TD3-TDCTM) algorithm realized by TD3, which can pilot the UAV agent continuously and adaptively learn how to adjust its movement strategy. Besides, inspired by ant colony algorithm[22], we set an additional information, i.e., the merged pheromone, to represent the state information of UAV and environment which is adopted as a reward function. The proposed TD3-TDCTM algorithm requires only some commonly accessible information, i.e., the service statuses of the IoT nodes, the UAV’s position, and the merged pheromone, where they can be obtained by minimum information exchange between the agent and the environment.
•

Then, to facilitate the convergence of TD3-TDCTM: i) we propose an information-enhancement technique that a self-defined pheromone of the UAV is added to the state for enhancing the learning efficiency; ii) considering the sparse reward of the original problem, we propose a reward-shaping mechanism, which can transform the original sparse rewards as dense rewards; iii) to address the convergence of Q-loss, we integrate the techniques including dimension-spread and done-or-terminated. Also, we conduct extensive simulations to verify the effectiveness of the above four techniques.
•

Last, we compare the proposed algorithm with various baseline methods. Results show that the proposed algorithm consistently outperforms the others. Also, we find an appropriate set of hyperparameters that achieve good performance in terms of discount factor, neuron number, and experience replay buffer size by extensive simulations.

The remainder of this article is organized as follows. In Section II, the UAV-assisted IoT communication system model is presented. In Section III-A, the UAV data collection problem is formulated as a non-convex optimization problem. Section III-B provides a brief introduction of TD3. As for the rest parts of Section III, the TD3-TDCTM algorithm is described in details. Then, simulation results are presented in Section IV to evaluate the performance of the proposed algorithm, as well as its robustness and convergence against various parameters. Finally, we conclude the paper in Section V.

Notations: In this paper, scalars are denoted by italic letters, and vectors are denoted by boldface letters. The Euclidean norm of a vector is denoted by $\left\|\cdot\right\|$ , $\left\{\cdot\right\}$ denotes an array, $\mathbb{E}\left[\cdot\right]$ denotes the statistical expectation, $\nabla$ denotes the gradient of function, and ${\rm clip}\left(\cdot\right)$ denotes the boundary clipping function. The distribution of a Gaussian random noise with mean $\mu$ and covariance $\sigma^{2}$ is denoted by $\mathcal{N}\left(\mu,\sigma^{2}\right)$ .

II System Model

As shown in Fig. 1, we consider a UAV-assisted IoT data collection system, where the UAV is dispatched to collect delay-tolerant data from a large number of distributed IoT ground nodes, e.g., the wireless sensors in the applications of smart city monitoring, traffic flow, health monitoring, and etc. We assume that both the UAV and the IoT ground nodes employ single omni-directional antenna and $K$ IoT nodes are randomly distributed in a given geographical region of $D\times D$ m². The positions of the $k$ -th IoT node and the UAV are denoted by $\bm{w}_{k}=[\bar{x}_{k},\bar{y}_{k},0]\in\mathbb{R}^{3}$ and $\bm{q}(t)=[x_{t},y_{t},H]\in\mathbb{R}^{3}$ , $0\leq t\leq T$ , respectively, where $(\bar{x}_{k},\bar{y}_{k})$ denotes the horizontal coordinate of the $k$ -th IoT node, $(x_{t},y_{t})$ is the UAV location projected on the horizontal plane, $H$ is the UAV altitude, and $T$ is the mission execution duration. The $K$ IoT nodes’ locations are assumed to be fixed for designing the UAV’s trajectory and performing data collection.

Refer to caption — Figure 1: UAV-assisted IoT data collection system.

As mentioned in Table I, most existing works adopt the simplified LoS channel or the statistic channel model (i.e., the probabilistic LoS channel) to model the G2A link. By contrast, we consider a more practical G2A channel model, which can be characterized by large-scale fading and small-scale fading which are calculated based on a simulated 3D map by taking into account the existence of buildings as propagation scatterers. Specifically, the locations and the heights of the buildings are generated according to a statistical model recommended by the international telecommunication union (ITU)[31]. In this model, there are three parameters to characterize an urban environment: $\alpha$ is the ratio of a land area covered by buildings to the total land area; $\beta$ is the average number of buildings per square kilometer; and $\lambda$ is the mean value of a Rayleigh distribution which represents the building height distribution.

Given a specific area with the simulated building location and height, we can accurately determine whether there is a LoS link between the UAV and the $k$ -th IoT node by checking whether the direct communication link between them is blocked by a building. Thus, the large-scale fading of the G2A channel associated with the $k$ -th IoT node can be expressed as [29]

{\rm PL}_{k}(t)=\begin{cases}L_{k}^{\rm{FS}}(t)+\eta_{\rm LoS},\\ L_{k}^{\rm{FS}}(t)+\eta_{\rm NLoS},\end{cases}

(1)

where ${L_{k}^{\rm{FS}}}(t)=20\log_{10}d_{k}(t)+20\log_{10}f_{c}+20\log_{10}\left(\frac{4\pi}{c}\right)$ represents the free space pathloss between the UAV and the $k$ -th IoT node, $d_{k}(t)=\left\|\bm{q}(t)-\bm{w}_{k}\right\|$ denotes the distance from the UAV to the $k$ -th IoT node, $f_{c}$ denotes the carrier frequency, and $c$ represents the velocity of light. Besides, $\eta_{\rm LoS}$ and $\eta_{\rm NLoS}$ represent the propagation loss of the LoS and NLoS links, respectively¹¹1The above pathloss expressions are all in dB..

The small-scale fading coefficient $\tilde{h}_{k}(t)$ is assumed to be Rayleigh fading for the NLoS case and Rician fading with 15 dB Rician factor for the LoS case, respectively. Furthermore, the Doppler effect caused by the UAV mobility is assumed to be well estimated and then compensated at the receiver by using existing advanced compensation algorithms[30]. Thus, the channel gain from the UAV to the $k$ -th IoT node can be expressed as

h_{k}(t)=10^{-{\rm PL}_{k}\left(t\right)/20}\tilde{h}_{k}\left(t\right).

(2)

III Proposed DRL-Based TD3-TDCTM Scheme

In this section, we formulate the UAV’s trajectory optimization problem and reformulated it as a MDP structure, which is the common representation of DRL framework. Furthermore, we propose the TD3-TDCTM algorithm with three tricks of facilitating convergence for minimizing the mission completion time.

III-A Problem Formulation

To make the UAV’s trajectory optimization problem tractable, the continuous time domain is discretized into $N$ time steps with unequal-length duration $\delta_{n}$ , $n\in\{0,1,\ldots,N\}$ , and a data collection task is performed within a series of time steps, i.e., $\{\delta_{0},\delta_{1},\ldots,\delta_{N}\}$ [5]. In addition, we consider that each time step consists of two parts, i.e., $\delta_{n}=\delta_{\rm ft}+\delta_{{\rm ht},n}$ , where $\delta_{\rm ft}$ is the fixed flight time and $\delta_{{\rm ht},n}$ is the hovering time for data collection. If there is no active IoT node in the current position, the UAV would skip hovering and directly executes the next time step, i.e., $\delta_{{\rm ht},n}=0$ s. During each time step, the UAV mobility strategy can be expressed as

x_{n+1}=x_{n}+m_{n}\cos(\bar{\theta}_{n}),

(3)

y_{n+1}=y_{n}+m_{n}\sin(\bar{\theta}_{n}),

(4)

where $m_{n}=\delta_{\rm ft}{\upsilon_{n}}$ represents the moving distance of the UAV in the $n$ -th time step, $\upsilon_{n}\in[0,\upsilon_{\rm max}]$ denotes the average flight speed in the $n$ -th time step, $\upsilon_{\rm max}$ denotes the maximum cruising speed in each time step, and $\bar{\theta}_{n}\in(0,2\pi]$ denotes the horizontal direction of the UAV in the $xy$ -plane with respect to the $x$ -axis.

We assume that only when IoT nodes are waken up by the UAV, they can begin to upload data with a constant transmission power $P_{\rm Tx}$ , otherwise they continues to stay in the silent mode for energy saving. In the $n$ -th time step, the corresponding uplink signal-to-noise ratio (SNR) between the $k$ -th IoT node and the UAV can be expressed as

\rho_{k,n}=\frac{P_{\rm Tx}{\left|h_{k,n}\right|}^{2}}{P_{N}},

(5)

where $|h_{k,n}|^{2}$ is the channel power gain at the hover stage of the $n$ -th time step and $P_{N}$ represents the power of the additive white Gaussian noise (AWGN) at the UAV receiver. For the uplink data collection service associated with the $k$ -th IoT node, we set a pre-defined SNR threshold $\rho_{\rm th}$ , where the $k$ -th IoT node can be waken up and served by the UAV if and only if $\rho_{k,n}\geq\rho_{\rm th}$ . Furthermore, we define the indicator function as

b_{k,n}=\begin{cases}1,\quad{\rm if}\,\rho_{k,n}\geq\rho_{\rm th},\\ 0,\quad{\rm otherwise},\end{cases}

(6)

to indicate whether the $k$ -th IoT node can be satisfied with the SNR requirement by the UAV in the $n$ -th time step. Due to the assumption that each IoT node only can be served at most once in one realization, we define the following indicator function of the $k$ -th IoT node as

\tilde{b}_{k,n}=\begin{cases}1,\quad{\rm if}\,b_{k,n}=1,{\rm and}\,c_{k,n}=0,\\ 0,\quad{\rm otherwise},\end{cases}

(7)

where $c_{k,n}\in\left\{0,1\right\}$ is a binary variable to indicate whether the $k$ -th IoT node has been served by the UAV. For simplicity, we assume that an orthogonal frequency division multiple access (OFDMA) is utilized to allow the simultaneous data collection from at most $K_{\rm up}$ IoT nodes, i.e., each IoT node satisfying upload requirements would be allocated with bandwidth $W$ , thus the inter-user-interference can be neglected. Therefore, the constraint of $\tilde{b}_{k,n}$ can be expressed as

\sum_{k=1}^{K}\tilde{b}_{k,n}\leq K_{\rm up},

(8)

where if the number of wake-up IoT nodes exceeds the maximum access number $K_{\rm up}$ , the system will select the first $K_{\rm up}$ IoT nodes with the large SNR as the serving objects, and set other IoT nodes to be silent, i.e., $\tilde{b}_{k,n}=0$ . Besides, we define the serving flag $c_{k,n}$ as

c_{k,n/0}=\min\left\{\sum_{i=0}^{n}\tilde{b}_{k,i},1\right\},\,c_{k,0}=0,

(9)

where if $c_{k,n}=1$ , the $k$ -th IoT node has been served during the mission; otherwise, the $k$ -th IoT node has not been served. Then, the transmission rate between the UAV and the $k$ -th IoT node can be expressed as

R_{k,n}=\tilde{b}_{k,n}W\log_{2}\left(1+\rho_{k,n}\right),

(10)

where only when the $k$ -th IoT node is served in the $n$ -th time step, the data of the $k$ -th IoT node can be uploaded. Thus, the hovering time of UAV, which equals to the maximum upload data duration from the served IoT nodes in the $n$ -th time step, can be expressed as

\delta_{{\rm ht},n}=\max_{k\in\{1,\ldots,K\}}\left\{\frac{\tilde{b}_{k,n}D_{\rm file}}{R_{k,n}+\kappa}\right\},

(11)

where $D_{\rm file}$ denotes the information file size of the $k$ -th IoT node and $\kappa$ is the value preventing the denominator from being zero. The completion criterion of the data collection mission is that the data of all IoT nodes has been collected, which can be expressed as

\sum_{k=1}^{K}c_{k,N}=K.

(12)

Thus, the problem to minimize the mission completion time via trajectory optimization can be formulated as

\begin{array}[]{cll}\displaystyle\mathop{\mathrm{minimize}}\limits_{\left\{\upsilon_{n},\bar{\theta}_{n}\right\},\left\{x_{n},y_{n}\right\},N}&\sum_{n=0}^{N}\delta_{n}\\ {\rm s.t.}&\sum_{k=1}^{K}\tilde{b}_{k,n}\leq K_{\rm up},\forall n,\\ &c_{k,n}=\min\left\{\sum_{i=0}^{n}\tilde{b}_{k,i},1\right\},\forall n,k,\\ &\sum_{k=1}^{K}c_{k,N}=K,\\ &0\leq{\upsilon_{n}}\leq\upsilon_{\rm max},\forall n,\\ &0\leq\bar{\theta}_{n}\leq 2\pi,\forall n,\\ &0\leq x_{n}\leq D,\forall n,\\ &0\leq y_{n}\leq D,\forall n.\\ \end{array}

(13)

It is noteworthy that the above optimization problem is a mixed-integer non-convex problem, which is known to be NP-hard. Moreover, in the considered scenario, both the large-scale fading and small-scale fading depend on the instantaneous locations of both the UAV and the IoT nodes as well as the surrounding buildings, so that it is intractable to solve the above problem by employing traditional optimization methods, such as constraint programming and mixed integer programming [32]. In contrast, DRL has been demonstrated to be an efficient approach for handling sophisticated control problems in high-dimensional continuous spaces[26]. Hence, in the following subsection, we propose a DRL-based solution to tackle this challenging control problem.

III-B Preliminaries

Reinforcement learning (RL) considers the paradigm of an agent interacting with its environment with the aim of learning reward-maximizing policy[16]. Specifically, RL can be used to address a MDP problem with 4-tuple $\left\langle\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R}\right\rangle$ , where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, $\mathcal{P}$ is the state transition probability, and $\mathcal{R}$ is the reward function. At each discrete time step $n$ , with a given state $s\in\mathcal{S}$ , the agent selects action $a\in\mathcal{A}$ with respect to its policy $\pi$ , and receives a reward $r$ . The return is defined as $R_{n}=\sum_{i=n}^{N}\gamma^{i-n}r\left(s_{i},a_{i}\right)$ , where $\gamma$ is a discount factor determining the priority of short-term rewards.

DRL can be considered as the “deep” version of RL, which uses multiple DNNs as the approximator of the Q-value function $Q\left(s,a\right)=\mathbb{E}\left[R_{n}|s,a\right]$ . Here, $Q\left(s,a\right)$ is the expected return when performing action $a$ in state $s$ . In deep deterministic policy gradient (DDPG) algorithm[23], the Q-value approximator $Q_{\theta}\left(s,a\right)$ with parameters $\theta$ can be updated by minimizing the following loss function

L\left(\theta\right)=\mathbb{E}\left[\left(y-Q_{\theta}\left(s,a\right)\right)^{2}\right],

(14)

where $y$ is the target value, which can be estimated by

y=r+\gamma Q_{\theta^{\prime}}\left(s^{\prime},a^{\prime}\right),a^{\prime}\sim\pi_{\phi^{\prime}}\left(s^{\prime}\right),

(15)

where $s^{\prime}$ is the next state, $a^{\prime}$ is an action selected from a target actor network $\pi_{\phi^{\prime}}$ , and $Q_{\theta^{\prime}}$ is a target network to maintain a fixed objective $y$ over multiple updates. The policy can be updated through the deterministic policy gradient algorithm which is given by

{\nabla}_{\phi}{J(\phi)}=\mathbb{E}\left[{\nabla}_{{a}}{Q}_{{\theta}}(s,a){|}_{{a=}\pi_{\phi}{(s)}}{{\nabla}}_{\phi}\pi_{{\phi}}(s)\right].

(16)

As a realization of the celebrated actor-critic algorithm, DDPG can achieve good performance in many applications. Yet, it has a fatal shortcoming, that is, the problem of overestimation, which will lead to a cumulative error. In particular, this kind of error would cause the originally poor state to admit a high Q value leading to a highly non-optimal strategy. In order to address this issue, TD3 was proposed by[26], which considers the interplay between policy and value updates in the approximation errors of function. Based on DDPG, TD3 applies the following three techniques to address the overestimation bias problem and achieves better performance over its counterparts, e.g., proximal policy optimization [27] and soft actor-critic [28].

The first technique is clipped double Q-learning. TD3 learns two Q-functions instead of one (hence “twin”), i.e., $Q_{\theta_{1}}$ and $Q_{\theta_{2}}$ . Therefore, critic network and critic target network both contain two independent Q networks, respectively. In each update, the Q target network with smaller Q-value is chosen as the Q target. Specifically, the target update of the clipped double Q-learning can be expressed as

y=r+\gamma\min_{i=1,2}\left\{Q_{\theta^{\prime}_{i}}\left(s^{\prime},\tilde{a}\right)\right\},

(17)

where $r$ is a reward that we designate and $\tilde{a}$ is the output of target policy.

The second technique is delayed policy updates. TD3 updates the actor network and the target network less frequently than the critic networks. That is, the model does not update the policy unless the model’s value function has changed sufficiently. These less frequent policy updates will lead to a lower variance in the value estimation and thus should result in a better policy. This allows the critic network to become more stable and reduce errors before it is used to update the actor network.

The third technique is target policy smoothing regularization. A concern with deterministic policies is that they may overfit to narrow peaks in the value estimate. When updating the critic, a learning target using a deterministic policy is highly susceptible to inaccuracies induced by approximation errors of function, which can increase the variance of the target. This induced variance can be reduced through regularization. TD3 adds noise (i.e., clipped random noise) to the target action and averages over mini-batches with size $B$ for smoothing the value estimate, as shown below,

\tilde{a}=\pi_{\phi^{\prime}}\left(s^{\prime}\right)+\epsilon,\epsilon\sim\rm{clip}\left({\mathcal{N}\left({\mathrm{0,\tilde{\omega}}}\right)\mathrm{,-c,c}}\right),

(18)

where the added noise is clipped to keep the target close to the original action. The TD3 algorithm’s update principles are similar to the DDPG algorithm, which can be expressed as

	$\displaystyle\theta_{{i}}$	$\displaystyle\leftarrow{\mathrm{arg\,min}}_{\theta_{i}}{B}^{\mathrm{-1}}\sum(y-{Q}_{{\theta}_{{i}}}(s,a))^{2},$		(19)
	$\displaystyle{\nabla}_{\phi}{J(\phi)}$	$\displaystyle={B}^{-1}\sum{\nabla}_{{a}}{Q}_{{\theta_{1}}}(s,a){\|}_{{a=}\pi_{\phi}{(s)}}{{\nabla}}_{\phi}\pi_{{\phi}}(s).$		(20)

III-C MDP Formulation

Combing the optimization problem formulated in Section III-A and the state-of-the-art DRL approach, TD3 in Section III-B, we reformulate the original problem as a MDP structure so that DRL algorithm can be applied. As shown in Fig. 2, the UAV is treated as an agent. During the training process, the TD3-TDCTM network and replay buffer are deployed on a central server. The replay buffer regularly collects the current state information of the interaction between the UAV and the environment, then the TD3-TDCTM network selects a better strategy to control the flight path of the UAV agent based on the historical states and rewards. When the actor network has been trained well, we will deploy it on the UAV agent during the testing stage. Then the UAV receives the state information of the environment and obtains the flight action command directly through the actor network. This process repeats until the mission is completed. Therefore, we first define the following state, action, and reward for UAV trajectory design problem in the following.

1) State $s_{n}$ in the $n$ -th time step, $\forall\,n$ :

•

$b_{k,n}\in\{0,1\}$ : the coverage indicator of the $k$ -th IoT node in the $n$ -th time step. If $b_{k,n}=1$ , the $k$ -th IoT node satisfies the SNR requirement; otherwise, the $k$ -th IoT node does not satisfy the requirement. Hence, our proposed DRL-based trajectory design only needs the imperfect CSI, i.e., SNR information.
•

$c_{k,n}\in\{0,1\}$ : the serving flag of the $k$ -th IoT node during the whole mission. If $c_{k,n}=1$ , the $k$ -th IoT node has been served during the mission; otherwise, the $k$ -th IoT node has not been served.
•

$[x_{n},y_{n}]$ : the two-dimensional coordinate of the UAV in a given region.
•

$\zeta_{n}$ : the pheromone of the UAV, which can be expressed as

$\zeta_{n}=\zeta_{n-1}+K_{{\rm cov},n}\cdot\kappa_{\rm cov}-\kappa_{\rm dis}-P_{\rm ob},$ (21)

where $\zeta_{n-1}$ is the remaining pheromone in the $\left(n-1\right)$ -th time step, $K_{{\rm cov},n}=\sum_{k=1}^{K}\tilde{b}_{k,n}$ is the number of IoT nodes served by the UAV in the $n$ -th time step, $\kappa_{\rm cov}$ is a positive constant that is used to express the captured pheromone per IoT node, $\kappa_{\rm dis}$ is a positive constant expressing the lost pheromone, and $P_{\rm ob}$ is a penalty when an action causes the boundary violation of the UAV.

Formally, $s_{n}=[b_{1,n},\cdots,b_{K,n};c_{1,n},\cdots,c_{K,n};x_{n},y_{n};\zeta_{n}]$ is the complete representation of the $n$ -th state, which has a cardinality equal to $2K+3$ . In state $s_{n}$ , both $b_{k,n}$ and $c_{k,n}$ reflect the data collection situation of the $k$ -th IoT node; $[x_{n},y_{n}]$ represents the UAV’s movement status; and $\zeta_{n}$ denotes the merged information between environment and UAV agent during the mission, which can be regarded as an additional information to enhance the decision efficiency.

Algorithm 1 TD3-TDCTM

1: Initialize critic networks

Q_{\theta_{1}}

Q_{\theta_{2}}

, and actor network

\pi_{\phi}

with random parameters

\theta_{1}

\theta_{2}

, and

\phi

2: Initialize target networks

\theta_{1}^{\prime}\leftarrow\theta_{1}

\theta_{2}^{\prime}\leftarrow\theta_{2}

, and

\phi^{\prime}\leftarrow\phi

3: Initialize experience replay buffer

R

4: for episode

=0

M

5: Initialize the environment, receive an initial state

s_{0}

, let the time step

n=0

and the terminated flag

d=0

6: repeat

7: Select an action

a_{n}={\pi_{\phi}\left(s_{n}\right)}+\sigma\epsilon

, where

\epsilon

is a Gaussian noise and

\sigma

is a decay constant, and observe a reward

r_{n}=r_{\rm tanh}\left(\zeta_{n}\right)

and a new state

s_{n+1}

8: if the UAV flies over the border then

\zeta_{n}=\zeta_{n}-P_{\rm ob}

, where

P_{\rm ob}

is a given penalty. Meanwhile, the movement of the UAV is canceled and update

r_{n}

s_{n+1}

accordingly

10: end if

11: if the UAV completes the data collection task, i.e.,

\sum_{k=1}^{K}c_{k,n}=K

then

12:

r_{n}=r_{n}+N_{\rm re}

, and the episode is terminated in advance, i.e.,

d=1

13: end if

14: Store the transition

\left(s_{n},a_{n},r_{n},s_{n+1},d\right)

R

15: if

R>

2,000 then

16: Sample mini-batch of

B

transitions

\left(s,a,r,s^{\prime},d\right)

from

R

17:

\tilde{a}\leftarrow\pi_{\phi^{\prime}}\left(s^{\prime}\right)+\epsilon

\epsilon\sim{\rm clip}\left(\mathcal{N}\left(0,\tilde{\omega}\right),-c,c\right)

18: Set

y\leftarrow r+\left(1-d\right)\cdot\gamma\min_{i=1,2}\left\{Q_{\theta^{\prime}_{i}}\left(s^{\prime},\tilde{a}\right)\right\}

19: Update critics by minimizing the loss:

20:

\theta_{i}\leftarrow{\rm arg\,min}_{\theta_{i}}B^{-1}\begin{matrix}\sum\left(y-Q_{\theta_{i}}\left(s,a\right)\right)^{2}\end{matrix}

21: Update the actor policy

\phi

by the deterministic policy gradient:

22:

\nabla{\phi}J\left(\phi\right)=B^{-1}\begin{matrix}\sum\nabla_{a}Q_{\theta_{1}}\left(s,a\right)|_{a=\pi_{\phi}\left(s\right)}\nabla_{\phi}\pi_{\phi}\left(s\right)\end{matrix}

23: Update target networks:

24:

\theta_{i}^{\prime}\leftarrow\tau\theta_{i}+\left(1-\tau\right)\theta_{i}^{\prime}

25:

\phi^{\prime}\leftarrow\tau\phi+\left(1-\tau\right)\phi^{\prime}

26: end if

27: Update

n\leftarrow n+1

28: until

n=N_{\max}

\sum_{k=1}^{K}c_{k,n}=K

29: end for

2) Action $a_{n}$ in the $n$ -th time step, $\forall\,n$ :

•

$\bar{\theta}_{n}\in(0,2\pi]$ : the horizontal direction of the UAV in the next time step.
•

$\upsilon_{n}\in[0,\upsilon_{\rm max}]$ : the flying speed of the UAV in the next time step.

Formally, the action is defined as $a_{n}=[\bar{\theta}_{n},\upsilon_{n}]$ . Since both action variables take continuous values, the UAV’s trajectory optimization is a continuous control problem.

3) Reward $r_{n}$ in the $n$ -th time step, $\forall\,n$ . For the above data collection mission, the UAV agent can not obtain a positive reward until it completes the data collection from all IoT nodes within the specified time step, i.e., there is no reward for every step in the intermediate process. Furthermore, at the beginning of training, the agent’s strategy is random and the reward acquisition needs a series of complex operations. Therefore, the data collection mission is a sparse rewards problem[16]. When we directly exploit reinforcement learning algorithm to optimize such a problem, the training difficulty of the original algorithm will increase exponentially as the number of IoT nodes increases, and the convergence cannot be guaranteed. To overcome this issue, we propose a reward-shaping mechanism, which can transform the original sparse rewards into dense rewards. Specifically, the reward design is defined as

r_{n}=\begin{cases}r_{\rm tanh}\left(\zeta_{n}\right)+N_{\rm re},&{\rm if}\,\sum_{k=1}^{K}c_{k,n}=K,\\ r_{\rm tanh}\left(\zeta_{n}\right),&{\rm otherwise},\end{cases}

(22)

where $r_{\rm tanh}\left(\zeta_{n}\right)=\frac{2}{1+{\rm exp}\left(-\zeta_{n}/\left(K\cdot\kappa_{\rm cov}\right)\right)}-1$ is a shaped reward function of the pheromone $\zeta_{n}$ . Besides, $r_{\rm tanh}\left(\cdot\right)$ approximates ${\rm tanh}\left(\cdot\right)$ function, but the gradient is smoother than the latter. Due to the change of pheromone $\zeta_{n}$ , the UAV agent can obtain dense rewards within the exploration stage. Furthermore, the gradient information of reward function can accelerate the convergence of the algorithm. Besides, the UAV would obtain a remaining time reward $N_{\rm re}=N_{\max}-n$ at the mission completion time step, which thus encourages the UAV to complete the data collection mission as soon as possible.

III-D TD3-Based UAV Trajectory Design

To solve the UAV’s trajectory design issue, we propose a TD3-TDCTM algorithm, whose pseudocode is listed in Algorithm 1. As mentioned above, since the UAV’s trajectory design is a continuous control task, we adopt TD3, a state-of-the-art actor-critic algorithm, as the starting point of our design, whose basic idea has been introduced in Section III-B. In the following, we utilize information-enhancement, dimension-spread, and done-or-terminated tricks to stabilize the training process of TD3, whose details are illustrated below:

1) Information-enhancement: Inspired by ant colony algorithm[22], we set an additional information, i.e., the merged pheromone $\zeta_{n}$ , as part of the state to enhance the learning efficiency. We assume that each IoT node contains some pheromones, which can also be represented as some special data to be collected. During the cruising of the UAV, the data of active IoT nodes will be collected and the pheromones on the IoT nodes also will be transferred to the UAV. At the same time, pheromones on the UAV will evaporate continuously and more pheromones will evaporate when the UAV’s movement violates the boundary. The pheromone represents the fusion information of UAV and environment, which also serves as a reference for reward design. Besides, the pheromone concentration determines the reward, so as to pilot the UAV agent to explore environment better.

2) Dimension-spread: According to the definition of state in the MDP formulation, most of the state dimensions are related to the coverage indicator of IoT nodes, while only two dimensions are related to location of the UAV and only one dimension is related to the pheromone of the UAV. Obviously, there is a dimension imbalance problem and we establish a pre-spread network to spread these state dimensions such that they are comparable to the coverage indicator dimensions [34]. To this end, the low-dimensional states including the UAV’s location and pheromone are first connected to a dense network with $2K$ neurons for extending the dimension to $2K$ . Then these spread states and coverage indicator states are concatenated as the input of the actor and critic networks.

3) Done-or-terminated: There are two situations to trigger “done” in the environment: reaching the maximum time steps and completing the task. According to Bellman equation, the target value used for critic update is calculated as follow

Q\left(s,a\right)\approx r+\left(1-I_{\rm done}\right)\cdot\gamma\min_{i=1,2}\left\{Q_{\theta^{\prime}_{i}}\left(s^{\prime},\tilde{a}\right)\right\},

(23)

where $I_{\rm done}\in\left\{0,1\right\}$ is a binary variable to indicate whether the episode is over or not, i.e., “done”. However, it is incorrect to set “done” when reaching the maximum time steps of the episode, because this episode is artificially terminated and the future Q value is abandoned. In fact, if the environment continues to run, the future Q value is not zero. Note that only when the UAV completes the task and terminates, the future Q value can be set to zero²²2This is because the target of the framework is to train the UAV to complete the data collection task. Once the task is completed, the episode will truly end and there will be no reward signal.. If the “done” state and the “terminated” state of the environment are not distinguished, it will cause critic learning oscillation and result in the performance degradation. Therefore, we set a “terminated” flag, denoted by $d\in\{0,1\}$ , which is a binary variable, to record whether data collection task has been completed by the UAV. In this way, the target value function can be expressed as

y=r+\left(1-d\right)\cdot\gamma\min_{i=1,2}\left\{Q_{\theta^{\prime}_{i}}\left(s^{\prime},\tilde{a}\right)\right\}.

(24)

III-E Flow of the Proposed Algorithm

In Fig. 3, we show the TD3-TDCTM network framework proposed for the UAV-assisted IoT data collection system. Integrated the information-enhancement, dimension-spread, and done-or-terminated tricks to the TD3 algorithm, we propose a UAV’s trajectory optimization algorithm, TD3-TDCTM, to minimize the mission completion time. According to the above discussion, the flow of the TD3-TDCTM algorithm can be described as follow. First, the algorithm randomly initializes the weights $\theta_{1}$ , $\theta_{2}$ , and $\phi$ of the critic and actor networks, respectively (Line 1) in Algorithm 1. Then, critic and actor target networks are employed to improve the learning stability. The target networks have the same structures as the original actor or critic networks, whose weights $\theta_{1}^{\prime}$ , $\theta_{2}^{\prime}$ , and $\phi^{\prime}$ are initialized in the same manner as their original networks (Line 2). However, these networks parameters are updated in the entirely different ways. For instance, the target networks’ parameters are updated by performing lines 24-25, where a soft target update technique is applied to control the updating rate.

The second part (Lines 5-13) is the process of exploration. At the beginning of each episode, the algorithm will initialize the environment and receive an initial state and a terminated flag. Then during exploration, the algorithm derives an action from the current actor network $\pi_{\phi}\left(\cdot\right)$ and then add a random noise $\sigma\epsilon$ , where $\epsilon$ is a Gaussian noise and $\sigma$ is a decay constant. Through reasonable setting of the exploration noise and the decay factor, we can effectively swing the trade-off between exploration and exploitation. Then, we also need to take care of an important case that an action leads to the boundary violation. To avoid this case, we consider to assign a penalty value $P_{\rm ob}=\frac{1}{K}$ to the pheromone $\zeta_{n}$ (Lines 8-10), cancel the corresponding movement (i.e., the UAV stays put without making the movement), and update the corresponding reward and next state accordingly. Besides, we consider another special case that the UAV completes the data collection task in the $n$ -th time step, then the rest of the time, denoted by $N_{\rm re}=N_{\max}-n$ , is regarded as a positive reward added to the current reward. Finally, we terminate this episode in advance and store the terminated flag as $d=1$ , which will be used in the done-or-terminated technique.

The third part is how to update the neural networks (Lines 14-26). Similar to those in classical TD3 algorithm, we use a replay buffer, which is initialized at the beginning with the size $R$ , for updating the actor and critic networks (Line 3). Specifically, we first store the collected samples into the replay buffer (Line 14) and then sample a mini-batch of them from the buffer to update the actor and critic networks (Lines 16-22).

TABLE II: Main simulation parameters.

Simulation parameters	Value
Maximum number of episodes	$M=8000$
The Gaussian distributed exploration noise	$\epsilon\sim\mathcal{N}(0,0.36)$
Exploration noise decay rate	$\sigma=0.999$
Experience replay buffer capacity	$R=1\times 10^{5}$
Target network soft-update rate	$\tau=0.005$
Discount factor	$\gamma=0.99$
Mini-batch size	$B=256$
Actor network and critic networks learning rate	$l_{r}=0.0001$
Maximum time step per episode	$N_{\max}=200$

As explained above, the critic networks are updated by minimizing the loss function (Eq. 19); and the actor network is updated by computing the gradient (Eq. 20).

IV Simulation Results

In this section, numerical results are conducted to evaluate the performance of the proposed TD3-TDCTM algorithm.

IV-A Simulation Settings

As shown in Fig. 4, we consider an urban area of size $1,000\times 1,000$ $\rm{m}^{2}$ with the dense and high-rise buildings that are generated by one realization of the statistical model in[31]. Figs. 4(a) and 4(b) show the 2D and 3D views of one particular realization of the building locations and heights, respectively, with parameters $\alpha=0.3$ , $\beta=144$ buildings/km², and $\lambda=50$ m. To ensure the practicality, the height of building is clipped to $h\in[10,50]$ m.

We assume that the transmit power of each IoT node antenna is $P_{\rm Tx}=10$ dBm, the noise power is $P_{N}=-75$ dBm, the SNR threshold satisfying the basic data collection requirements is $\rho_{\rm th}=0$ dB, the propagation losses are $\eta_{\rm LoS}=0.1$ dB and $\eta_{\rm NLoS}=21$ dB[29], and the information file size of each IoT node is $D_{\rm file}=10$ Mbits. Besides, due to the OFDMA system, the maximal number of IoT nodes that the UAV can serve in the $n$ -th time step is $K_{\rm up}=6$ and the bandwidth allocated to each waken up IoT node is $10$ MHz. The average flight speed of UAV is assumed to be ${\upsilon}\in[0,20]$ m/s, the flight time per step is $\delta_{\rm ft}=2.5$ s, the hovering time of UAV can be computed by (11), and the fixed altitude of UAV $H$ is a parameter to be optimized. The parameters of pheromone designed are $\kappa_{\rm cov}=10$ and $\kappa_{\rm dis}=1$ .

As for Algorithm 1, all the actor and critic networks are constructed by a 2-layer fully-connected feedforward neural network with 400 neurons. Then the actor network utilizes $\rm\tanh\left(\cdot\right)$ as the activation function to bound the actions. The critic networks utilize the ReLU function for activation. Other major simulation parameters are summarized in Table II and our simulation runs in the following environment: Python 3.6 and Pytorch 1.4 on a Windows server with 3 NVIDIA 2080TI GPUs.

To evaluate the performance, we compare the proposed TD3-TDCTM algorithm with three conventional baseline methods.

•

Scan strategy: The UAV flies according to a preset trajectory which is a rectangular strip track. The track starts from the lower left corner of the area and ends at the upper left corner. Note that such a trajectory design ensures that all locations within the target region are covered by the UAV.
•

ACO-based approach: Considering each IoT node as a node, it fixes the initial position of the UAV and exploit the ant colony optimization (ACO) algorithm[22] to solve the shortest path for completing the routing of each node from the determined starting point.
•

RRT-based approach: Based on the ACO solution, we use a typical search algorithm, named rapidly exploring random tree (RRT) algorithm[35] to complete the multi-targets path planning problem. The UAV explores each node in turn according to the ACO sequence. If the data collection of target node and some additional nodes is completed during this period, these nodes will be deleted from the sequence until the sequence is empty.

IV-B Result and Analysis

To verify the effectiveness of our proposed algorithm, we use the trained model for testing. In each simulation realization, the UAV’s horizontal position is randomly generated. We totally execute 25 mutually independent realizations, whose outputs are averaged to obtain the final results.

For the UAV-assisted IoT data collection, the establishment of LoS G2A channel is desired for connecting the UAV and IoT node with improved link quality. Typically, the G2A LoS link relies on various factors, such as the density and height of buildings, the locations of both the IoT node and the UAV, as well as the deployment altitude of the UAV. Specifically, the LoS probability is a monotonically increasing function with respect to altitude[29]. The higher altitude will enhance the probability of G2A LoS link, but the impact of large-scale fading is more severe. Similarly, the lower altitude will mitigate the impact of large-scale fading, but the probability of G2A LoS link will be reduced obviously. Therefore, we need to make a trade-off between the large-scale fading and the LoS probability so that an optimal altitude can be obtained for maximizing the coverage radius. However, it is hard to obtain an analytical solution for the optimum UAV altitude, which is strongly dependent on the specific urban environment condition. Thus, we first compare the average mission completion time with different UAV altitudes to yield the best altitude in Fig. 5. We can observe that the UAV’s altitude $H=95$ m is the best in the considered setting. Besides, we compare the 3D trajectory design scheme of the adaptive adjustment of the flight altitude with that of fixed altitude. Through simulation, we can observe the performance of 3D trajectory is poor compared with the case of fixed flight altitude. This is because the 3D trajectory policy obtained by DRL is still far from the optimal policy. In addition, for buildings of different heights, the optimization of 3D trajectory requires constant adjustment of the UAV altitude during the flight to achieve the optimal coverage. This makes the time consumption on the 3D trajectory of the UAV during the mission is greater than that in the case of fixed flight altitude. Therefore, when serving a large number of distributed IoT nodes, the fixed flight altitude scheme is more suitable for wide area data collection missions. Accordingly, we consider to choose the fixed altitude $H=95$ m in the following simulation.

In Fig. 4, the UAV’s trajectory is plotted under the case of 40 IoT nodes and $H=95$ m, where the red triangles represent the served IoT nodes and the blue curve represents the UAV’s trajectory. We can observe that the UAV can complete the data collection mission for all IoT nodes. In such a dense urban environment, buildings are more likely to block the LoS links between the aerial UAV and the terrestrial IoT nodes. Then in the learning process progresses, once the UAV discovers the blockages of LoS links, it would adopt appropriate cruising direction to reestablish the G2A LoS link as soon as possible. This fact shows that TD3-TDCTM algorithm can pilot the UAV to sense and learn the external environment. Therefore, it can learn to obtain an approximately optimal strategy for this practical problem with imperfect CSI.

In Fig. 6(a), we compare the average mission completion time of different methods versus different numbers of IoT nodes. We can observe that the average mission completion time of proposed TD3-TDCTM algorithm outperforms that of conventional schemes. For 25 IoT nodes, TD3-TDCTM algorithm saves 54.2 s compared with the RRT algorithm, 84.9 s compared with the ACO algorithm, and 300.3 s compared with the Scan strategy. For the Scan strategy, although the UAV can guarantee to serve all IoT nodes, the exceedingly long mission completion time is intolerable. For the ACO algorithm, although it addresses the shortest route problem from the UAV to each IoT node, it does not exploit the sensing ability of the UAV, thus there is still a lot of redundancy in flight trajectory. Developed from the ACO algorithm, the RRT algorithm integrates the exploration mechanism to avoid the unnecessary flight. However, since this exploration is completely random, the reduced mission completion time is not obvious. In contrast, the TD3-TDCTM algorithm can sufficiently and adaptively learn how to adjust the exploration strategy. Therefore, the TD3-TDCTM algorithm can take the minimum time to complete the data collection task, while ensuring each IoT node can be served.

In order to show the benefit of our proposed approach in terms of energy saving, we compute the energy consumption of the UAV. Specifically, based on [9], the propulsion power of the UAV can be modeled as

	$\displaystyle P\left(\upsilon\right)=P_{0}\left(1+\frac{3\upsilon^{2}}{U_{\rm tip}^{2}}\right)+$	$\displaystyle P_{1}\left(\sqrt{1+\frac{\upsilon^{4}}{4\upsilon_{0}^{4}}}-\frac{\upsilon^{2}}{2\upsilon_{0}^{2}}\right)^{1/2}$
		$\displaystyle+\frac{1}{2}d_{0}\varrho sA\upsilon^{3},$		(25)

where $\upsilon={\upsilon}_{n}$ is the horizontal speed of the UAV, $P_{0}=79.8563$ W and $P_{1}=88.6279$ W are two constants, $U_{\rm tip}=120$ m/s represents the tip speed of the rotor blade, $\upsilon_{0}=4.03$ m/s is the mean rotor induced velocity in hover, $d_{0}=0.6$ and $s=0.05$ are the fuselage drag ratio and rotor solidity, respectively, $\varrho=1.225$ kg/m³ and $A=0.503$ m² denote the air density and rotor disc area, respectively. Fig. 6(b) shows the average flight energy consumption of different methods versus different numbers of IoT nodes. It is observed that proposed TD3-TDCTM algorithm can save 43.6% of flight energy compared with the RRT algorithm, 54.9% compared with the ACO algorithm, and 81.2% compared with the Scan strategy for 25 IoT nodes. Owing to the substantially reducing the energy consumption of UAV flight, more time and energy can be saved for the UAV to perform data collection.

In Fig. 7, we demonstrate the effectiveness of the tricks introduced in Section III-D. As shown in Fig. 7(a), we can observe that the TD3-TDCTM algorithm without reward-shaping does not work well at all in the case of 40 IoT nodes. However, the TD3-TDCTM algorithm with reward-shaping, like the black curve or the pink curve, can complete the data collection task and converge to a greater and stable reward. It shows that the reward-shaping mechanism can guarantee the convergence of the proposed algorithm. Besides, without information-enhancement, the convergence of accumulated reward is obviously slower. Specifically, without information-enhancement, about 5000 episodes are required to ensure the converge. In contrast, the TD3-TDCTM algorithm only requires about 2000 episodes to reach converge. It shows that the introduction of extra pheromone information as part of the state can enhance the agent’s decision efficiency. In Fig. 7(b), we clarify the effectiveness of the other two tricks from the perspective of Q-loss. Specifically, we can see that the Q-loss fluctuation of TD3-TDCTM algorithm without dimension-spread and without done-or-terminated is relatively large. Therefore, these two tricks can further improve the convergence performance.

TABLE III: Impact of different discount factors and buffer sizes on the average completion time.

Discount factor $\gamma$	Buffer size $R$
Discount factor $\gamma$	$0.5\times 10^{5}$	$0.75\times 10^{5}$	$1\times 10^{5}$	$1.25\times 10^{5}$	$1.5\times 10^{5}$
0.8	None	None	None	None	None
0.9	79.19	76.72	74.52	74.53	74.33
0.99	76.45	75.43	71.49	71.29	76.71

TABLE IV: Impact of different neuron numbers and buffer sizes on the average completion time.

Neuron number	Buffer size $R$
Neuron number	$0.5\times 10^{5}$	$0.75\times 10^{5}$	$1\times 10^{5}$	$1.25\times 10^{5}$	$1.5\times 10^{5}$
200	76.45	75.43	71.49	71.29	76.71
400	73.62	74.1	70.52	73.55	74.96
600	85.41	78.4	86.48	77.56	76.36

Fig. 8 shows the accumulated reward per episode in the training stage under different numbers of IoT nodes. We observe that the accumulated reward shows an upward trend with the increase of the training episodes. After training around 2000 episodes, the accumulated reward gradually becomes smooth and stable. As shown in Fig. 8, the proposed TD3-TDCTM algorithm has the similar convergence performance in the cases of different numbers of IoT nodes. Hence, the proposed scheme is capable of achieving the good convergence and robustness.

Fig. 9 demonstrates the average LoS coverage ratio of the UAV trajectory designed by the proposed TD3-TDCTM algorithm. We can observe that the probability of completing the data collection tasks using LoS link can always be more than 85%. It indicates that by using the proposed algorithm, the designed UAV’s trajectory can make full use of the environment information so that the G2A LoS link can be established as much as possible. In this way, more IoT nodes can be simultaneously covered along the designed UAV’s trajectory.

In Tables III and IV, we present the experimental results for obtaining the appropriate hyperparameters of the proposed algorithm. Here, we select the discount factor $\gamma$ , the number of neurons in each layer in all six networks³³3There six networks includes the actor network, target actor network, critic network 1, target critic network 1, critic network 2, and target critic network 2 shown in Fig. 3., and the experience replay buffer size $R$ to be discussed in the case of 25 IoT nodes. As shown in Table III, we first present the impact of the discount fact $\gamma$ on the average mission completion time when we set the neuron number to 200. We can observe that the UAV cannot complete the task at all in the case of $\gamma=0.8$ . However, when $\gamma$ becomes large, the UAV begins to complete the mission and the average completion time approaches that of the near-optimal solution. This is because a larger $\gamma$ implies a longer-term consideration of future reward. Since our considered scenario is to minimize the mission completion time, increasing $\gamma$ helps the UAV to focus more attention to the possible benefits of future actions and try to navigate in such a way that future movement will complete the current task more quickly.

In Table IV, we discuss how the neuron number and buffer size affect the mission completion time, under the case of the fixed discount factor $\gamma=0.99$ . We can observe that when setting the buffer size as $R=1\times 10^{5}$ , and increasing the neuron numbers from 200 to 400, the average mission completion time will decrease from 71.49 to 70.52, i.e., our model can achieve the optimal performance when neuron number is 400 and buffer size is $R=1\times 10^{5}$ . This is because the appropriate number of neurons can improve the capacity of networks, which can help the agent to better learn the representation of complex correlations among state, action, and reward. Meanwhile, we can conclude that the buffer size $R=1\times 10^{5}$ can help to achieve a better performance than the rest of setting. On the one hand, the smaller buffer size would lead to an insufficient transition sampling storage or less randomness. On the other hand, the larger buffer size would lead to good samples to be replayed with fewer chances.

V Conclusion

In this paper, we proposed a DRL based algorithm, TD3-TDCTM, which can design an efficient flight trajectory for a UAV to collect data from IoT nodes in a practical 3D urban environment with imperfect CSI. In particular, we set an additional information, i.e., the merged pheromone, to represent the state information of UAV and environment as a reference of reward which facilitates the algorithm design. By taking the service status of IoT nodes, the UAV’s position, and the merged pheromone as input, the TD3-TDCTM algorithm can continuously and adaptively learn how to adjust the UAV’s movement strategy for minimizing the completion time under the constraints in flight movement and throughput. Numerical results show a significant performance gain of the TD3-TDCTM algorithm over the existing optimization-based methods. In the future study, we will consider the 3D trajectory design for multi-antenna UAV and multi-UAVs scenarios, where the beamforming and the G2A channel estimation should be considered.

References

[1] B. Li, Z. Fei, and Y. Zhang, “UAV communications for 5G and beyond: Recent advances and future trends,” IEEE Internet Things J., vol. 6, no. 2, pp. 2241-2263, Apr. 2019.
[2] Q. Wu, J. Xu, Y. Zeng, D. W. K. Ng, N. Al-Dhahir, R. Schober, and A. L. Swindlehurst, “5G-and-beyond networks with UAVs: From communications to sensing and intelligence,” arXiv preprint arXiv:2010.09317, 2020.
[3] A. E. A. A. Abdulla, Z. M. Fadlullah, H. Nishiyama, N. Kato, F. Ono, and R. Miura, “An optimal data collection technique for improved utility in UAS-aided networks,” in Proc. IEEE Int. Conf. Comput. Commun. (INFOCOM), Toronto, Canada, May 2014, pp. 736-744.
[4] Y. Zeng, R. Zhang, and T. J. Lim, “Wireless communications with unmanned aerial vehicles: Opportunities and challenges,” IEEE Commun. Mag., vol. 54, no. 5, pp. 36-42, May 2016.
[5] C. Zhan, Y. Zeng, and R. Zhang, “Energy-efficient data collection in UAV enabled wireless sensor network,” IEEE Wireless Commun. Lett., vol. 7, no. 3, pp. 328-331, Jun. 2018.
[6] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Mobile unmanned aerial vehicles (UAVs) for energy-efficient internet of things communications,” IEEE Trans. Wireless Commun., vol. 16, no. 11, pp. 7574-7589, Nov. 2017.
[7] J. Baek, S. I. Han, and Y. Han,“Energy-efficient UAV routing for wireless sensor networks,” IEEE Trans. Veh. Technol., vol. 69, no. 2, pp. 1741-1750, Feb. 2020.
[8] Y. Cai, Z. Wei, R. Li, D. W. K. Ng, and J. Yuan, “Joint trajectory and resource allocation design for energy-efficient secure UAV communication systems,” IEEE Trans. Commun., vol. 68, no. 7, pp. 4536-4553, Jul. 2020.
[9] J. Zhang, Y. Zeng, and R. Zhang, “Multi-antenna UAV data harvesting: Joint trajectory and communication optimization,” J. Commun. Inf. Netw., vol. 5, no. 1, pp. 86-99, Mar. 2020.
[10] C. You and R. Zhang, “3D trajectory optimization in Rician fading for UAV-enabled data harvesting,” IEEE Trans. Wireless Commun., vol. 18, no. 6, pp. 3192-3207, Jun. 2019.
[11] L. Xie, J. Xu, and R. Zhang, “Throughput maximization for UAV-enabled wireless powered communication networks,” IEEE Internet Things J., vol. 6, no. 2, pp. 1690-1703, Apr. 2019.
[12] Y. Sun, D. Xu, D. W. K. Ng, L. Dai, and R. Schober, “Optimal 3D-trajectory design and resource allocation for solar-powered UAV communication systems,” IEEE Trans. on Commun., vol. 67, no. 6, pp. 4281-4298, Jun. 2019.
[13] J. Gong, T. Chang, C. Shen, and X. Chen, “Flight time minimization of UAV for data collection over wireless sensor networks,” IEEE J. Sel. Areas Commun., vol. 36, no. 9, pp. 1942-1954, Sep. 2018.
[14] J. Li, H. Zhao, H. Wang, F. Gu, J. Wei, H. Yin, and B. Ren, “Joint optimization on trajectory, altitude, velocity and link scheduling for minimum mission time in UAV-aided data collection,” IEEE Internet Things J., vol. 7, no. 2, pp. 1464-1475, Feb. 2020.
[15] H. Hu, K. Xiong, G. Qu, Q. Ni, P. Fan, and K. B. Letaief, “AoI-minimal trajectory planning and data collection in UAV-assisted wireless powered IoT networks,” IEEE Internet Things J., vol. 8, no. 2, pp. 1211-1223, Jan. 2021.
[16] R. S. Sutton and A. G. Barto, “Reinforcement learning: An introduction”. MIT press, 2018.
[17] P. Yang, X. Cao, X. Xi, W. Du, Z. Xiao, and D. Wu, “Three-dimensional continuous movement control of drone cells for energy-efficient communication coverage,” IEEE Trans. Veh. Technol., vol. 68, no. 7, pp. 6535-6546, Jul. 2019.
[18] B. Zhang, C. H. Liu, J. Tang, Z. Xu, J. Ma, and W. Wang, “Learning-based energy-efficient data collection by unmanned vehicles in smart cities,” IEEE Trans. Ind. Inf., vol. 14, no. 4, pp. 1666-1676, Apr. 2018.
[19] C. H. Liu, Z. Chen, and Y. Zhan, “Energy-efficient distributed mobile crowd sensing: A deep learning approach,” IEEE J. Sel. Areas in Commun., vol. 37, no. 6, pp. 1262-1276, Jun. 2019.
[20] Y. Zeng and X. Xu, “Path design for cellular-connected UAV with reinforcement learning,” IEEE Global Commun. Conf. (GLOBECOM), Waikoloa, HI, USA, 2019, pp. 1-6.
[21] M. Yi, X. Wang, J. Liu, Y. Zhang, and B. Bai, “Deep reinforcement learning for fresh data collection in UAV-assisted IoT networks,” Proc. IEEE INFOCOM, Toronto, ON, Canada, 2020, pp. 716-721.
[22] M. Dorigo, V. Maniezzo, and A. Colorni, “Ant system: Optimization by a colony of cooperating agents,” IEEE Trans. Sys., Man, and Cybernetics, Part B (Cybernetics), vol. 26, no. 1, pp. 29-41, Feb. 1996.
[23] T. P. Lillicrap et al., “Continuous control with deep reinforcement learning,” Comput. Sci., vol. 8, no. 6, 2015, Art. no. A187.
[24] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, Feb. 2015.
[25] R. Lowe, Y. Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environments,” in Proc. NIPS, pp. 6379–6390, 2017.
[26] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” in Proc. ICML, pp. 1582-1591, 2018.
[27] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
[28] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” arXiv preprint arXiv:1801.01290, 2018.
[29] A. Al-Hourani, S. Kandeepan, and S. Lardner, “Optimal LAP altitude for maximum coverage,” IEEE Wireless Commun. Lett., vol. 3, no. 6, pp. 569–572, Dec. 2014.
[30] Q. Zhang, H. Sun, Z. Feng, H. Gao, and W. Li, “Data-aided Doppler frequency shift estimation and compensation for UAVs,” IEEE Internet Things J., vol. 7, no. 1, pp. 400-415, Jan. 2020.
[31] ITU-R, Rec. P.1410-5, “Propagation data and prediction methods required for the design of terrestrial broadband radio access systems operating in a frequency range from 3 to 60 GHz,” Radiowave propagation, Feb. 2012.
[32] S. Boyd and L. Vandenberghe, “Convex optimization,” Cambridge University Press, 2004.
[33] H. V. Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in Proc. AAAI-16, Feb. 2016.
[34] R. Ding, F. Gao, and X. S. Shen, “3D UAV trajectory design and frequency band allocation for energy-efficient and fair communication: A deep reinforcement learning approach,” IEEE Trans. Wireless Commun., vol. 19, no. 12, pp. 7796-7809, Dec. 2020.
[35] S. M. LaValle, “Rapidly-exploring random trees: A new tool for path planning”, Oct. 1998.