Multi-Agent Graph Reinforcement Learning based On-Demand Wireless Energy Transfer in Multi-UAV-aided IoT Network

Ze Yu Zhao¹, Yueling Che², Sheng Luo³, Kaishun Wu⁴, and Victor C. M. Leung⁵ College of Computer Science and Software Engineering, Shenzhen University, China
Email: 2100271103¹@email.szu.edu.cn, {yuelingche², sluo³, wu⁴, vleung⁵}@szu.edu.cn

Abstract

This paper proposes a new on-demand wireless energy transfer (WET) scheme of multiple unmanned aerial vehicles (UAVs). Unlike the existing studies that simply pursuing the total or the minimum harvested energy maximization at the Internet of Things (IoT) devices, where the IoT devices’ own energy requirements are barely considered, we propose a new metric called the hungry-level of energy (HoE), which reflects the time-varying energy demand of each IoT device based on the energy gap between its required energy and the harvested energy from the UAVs. With the purpose to minimize the overall HoE of the IoT devices whose energy requirements are not satisfied, we optimally determine all the UAVs’ trajectories and WET decisions over time, under the practical mobility and energy constraints of the UAVs. Although the proposed problem is of high complexity to solve, by excavating the UAVs’ self-attentions for their collaborative WET, we propose the multi-agent graph reinforcement learning (MAGRL) based approach. Through the offline training of the MAGRL model, where the global training at the central controller guides the local training at each UAV agent, each UAV then distributively determines its trajectory and WET based on the well-trained local neural networks. Simulation results show that the proposed MAGRL-based approach outperforms various benchmarks for meeting the IoT devices’ energy requirements.

Index Terms:

Multiple unmanned aerial vehicles (UAVs) aided network, wireless energy transfer (WET), hungry-level of energy (HoE), multi-agent deep reinforcement learning, self-attentions.

I Introduction

The technology of radio frequency (RF) based wireless energy transfer (WET) has been recognized as a promising approach to support energy-sustainable Internet of Things (IoT) networks[1]. Conventionally, ground infrastructure (e.g., dedicated energy transmitters or base stations) are utilized to charge the low-power IoT devices. However, due to the generally low end-to-end wireless energy transmission efficiency, to assure non-zero harvested energy at the IoT devices, the effective transmission distance from the ground infrastructure to each IoT device is restricted (to , e.g., 10 meters [2]).

By exploiting the UAVs’ flexible mobility to effectively shorten the transmission distances between the UAVs and the IoT devices, the UAV-aided WET has attracted a great deal of attentions. For example, the UAV’s dynamic wireless energy and information transmission scheme in the presence of primary users was investigated in [3]. The UAV’s hovering position to maximize the minimum harvested energy at the IoT devices was studied in [4]. The UAV’s trajectory and WET were jointly optimized for maximizing the total harvested energy at the IoT devices in [5].

The above studies only considered the WET by a single UAV, which is usually difficult to serve a large-scale network due to its limited on-board battery energy. By tackling with the more complicated joint design of the multiple UAVs’ trajectories and wireless energy transmissions, the multi-UAV aided WET for maximizing the total harvested energy at the IoT devices has been studied in [6] and [7], where the Lagrange multiplier method and the fly-and-hover based trajectory design were utilized, respectively. The joint design of multi-UAV-aided wireless data and energy transmissions has also been investigated in [8] and [9], where the deep reinforcement learning (DRL) is applied to adapt the multiple UAVs’ transmissions to the dynamic environment.

However, it came to our notice that the existing UAV-aided WET schemes may sacrifice the energy demands of some IoT devices for achieving higher global benefits. For example, with the purpose to maximize the total harvested energy at all the IoT devices, the UAVs with limited on-board energy may choose to transmit energy to the closely-located IoT devices more often, but barely fly to serve the distantly-located IoT devices to save energy; and with the purpose to maximize the minimum harvested energy at all the IoT devices, the IoT devices with high energy demands may not be able to harvest sufficient energy, for achieving equal energy harvesting at each IoT device. As a result, both designs may deviate from the IoT devices’ own energy demands.

To cater to the IoT devices’ energy requirements, we propose a new metric called hungry-level of energy (HoE), which reflects the time-varying energy desirability of each IoT device based on the energy gap between its required energy and the harvested energy from the UAVs. Moreover, to explore the UAVs’ potential collaborations, such that they can automatically determine their joint or separate WET depending on the IoT devices’ HoE, we employ the UAVs’ self-attentions based on the graph-based representation. Finally, we propose the novel multi-agent graph reinforcement learning (MAGRL) based approach to minimize the overall HoE of the IoT devices whose energy requirements are not satisfied.

Refer to caption — Figure 1: Multi-UAV assisted WET in an IoT system.

The main contributions are summarized as follows:

•

HoE-based Multi-UAV WET Modeling and Novel Problem Formulation: Section II newly proposes the metric of HoE to guide the multi-UAV-aided WET for satisfying the different energy demands of the IoT devices. Based on each IoT device’s non-linear energy harvesting model and each UAV’s velocity-determined energy consumption model, the battery energy management at both the UAVs and the IoT devices are properly modeled. By optimally determining all the UAVs’ trajectories and WET decisions over time, the novel HoE minimization problem is formulated under the UAVs’ practical mobility and energy constraints.
•

MAGRL-based Approach for Distributed and Collaborative Multi-UAV WET: Sections III and IV propose the MAGRL-based approach to solve the complicated HoE minimization problem, where the UAVs’ self-attentions are excavated for their collaborative WET. Through the offline training of the proposed MAGRL model, where the central controller leads the global training to guide the local training at each UAV agent, each UAV then distributively determines its trajectory and WET decision based on the well-trained policy neural networks.
•

Extensive Simulation Results for Performance Evaluation: Section V conducts extensive simulations to verify the validity of the proposed HoE metric for guiding the UAVs’ on-demand WET, by comparing with various benchmarks. The UAVs’ collaborative WET under the proposed MAGRL-based approach is also illustrated.

II System Model and the Problem Formulation

As shown in Fig. 1, we consider that $U$ UAVs with $U\geq 2$ act as the airborne wireless energy transmitters to charge in total of $I$ IoT devices with low power consumptions on ground. Each UAV flies at a fixed altitude of $h_{fix}$ meters (m). Denote the sets of the UAVs and the IoT devices as $\mathbb{U}=\{1,2,...,U\}$ and $\mathbb{I}=\{1,2,...,I\}$ , respectively. The UAVs’ task period of WET is divided into $T$ time slots with $T\geq 1$ , where the slot length is $\vartheta$ seconds (s). The set of the time slots is denoted as $\mathbb{T}=\{1,2,...,T\}$ . The coordinate of UAV- $u$ in slot $t$ is represented as $q_{u}[t]=\{x_{u}[t],y_{u}[t],h_{fix}\}$ , $\forall u\in\mathbb{U}$ , $\forall t\in\mathbb{T}$ . Since the slot length $\vartheta$ is usually very small, $q_{u}[t]$ is assumed to be unchanged in each slot, but may change over different slots. The coordinate of device- $i$ with $i\in\mathbb{I}$ on ground is denoted as $q_{i}=\{x_{i},y_{i},0\}$ . Let $d_{i}^{u}[t]=\left\|q_{u}[t]-q_{i}\right\|$ denote the distance between UAV- $u$ and device- $i$ in slot $t$ . Let $C_{u}[t]\in\{0,1\}$ denote UAV- $u$ ’s WET decision in slot $t$ , where UAV- $u$ broadcasts energy in slot $t$ if $C_{u}(t)=1$ , or keeps silent, otherwise.

According to [10], denote $P_{Los,i}^{u}[t]=\frac{1}{1+a\exp(-b\beta_{i}^{u}[t]+ab)}$ as the line-of-sight (LoS) probability of the air-to-ground (AtG) channel from UAV- $u$ to device- $i$ in slot $t$ , where $\beta_{i}^{u}[t]=\sin^{-1}\left(\frac{h_{fix}}{d_{i}^{u}[t]}\right)$ is the elevation angles from UAV- $u$ to device- $i$ in slot $t$ , and the constants $a$ and $b$ are the environment-related parameters. We then obtain the non-line-of-sight (NLoS) probability as $P_{NLoS,i}^{u}[t]=1-P_{Los,i}^{u}[t]$ . As a result, the average AtG channel power gain from UAV- $u$ to device- $i$ is

G_{i}^{u}[t]=P_{LoS,i}^{u}[t]G_{0}d_{i}^{u}[t]^{\!-\!\alpha_{L}}\!+\!P_{NLoS,i}^{u}[t]G_{0}d_{i}^{u}[t]^{\!-\!\alpha_{N}}\!,\!

(1)

where $G_{0}$ is the average channel gain at a reference distance of $1$ m, and $\alpha_{L}$ and $\alpha_{N}$ are the channel path-loss exponents for LoS and NoS links, respectively.

II-A UAV Energy Consumption Model

Each UAV’s energy consumption is mainly caused by propulsion and WET. According to [11], in each slot $t$ , UAV- $u$ ’s propulsion power consumption is determined by its velocity $V_{u}[t]=\frac{1}{\vartheta}\left\|q_{u}[t+1]-q_{u}[t]\right\|$ as follows:

	$\displaystyle P_{pro}(V_{u}[t])$	$\displaystyle=P_{a}\left(1+\frac{3V_{u}[t]^{2}}{V_{tip}^{2}}\right)+\frac{1}{2}f_{0}\rho e_{1}AV_{u}[t]^{3}$
		$\displaystyle+P_{b}\left(\sqrt{1+\frac{V_{u}[t]^{4}}{4e_{0}^{4}}}-\frac{V_{u}[t]^{2}}{2e_{0}^{2}}\right)^{\frac{1}{2}},$		(2)

where the constants $P_{a}$ , $P_{b}$ , $V_{tip}$ , $e_{0}$ , $e_{1}$ , $f_{0}$ and $\rho$ are the UAV’s mechanical-related parameters. Hence, the propulsion energy consumption of UAV- $u$ in slot $t$ is obtained as $P_{pro}(V_{u}[t])\vartheta$ . Denote $P_{u}$ as each UAV- $u$ ’s transmit power for WET. The energy consumption of UAV- $u$ ’s WET in slot $t$ is thus $C_{u}[t]P_{u}\vartheta$ . Denote $B_{u}[t]$ as the battery level of UAV- $u$ at the beginning of slot $t$ , which is updated as

\displaystyle B_{u}[t]\!=\!\max\!\left(B_{u}[t\!-\!1]\!-\!P_{pro}(V_{u}[t\!-\!1])\vartheta\!-\!C_{u}[t\!-\!1]P_{u}\vartheta,0\right).

(3)

We assume that UAV- $u$ is fully charged in the initial with $B_{u}[0]=B_{u}^{max}$ , where $B_{u}^{max}$ is UAV- $u$ ’s battery capacity.

II-B Energy Harvesting at IoT devices

Each IoT device is installed with a rectenna and an energy harvester. According to [12], the energy harvester transforms the received RF power $p\geq 0$ at the rectenna into the direct-circuit (DC) power $\mathcal{F}(p)$ as follows:

\mathcal{F}(p)=\begin{cases}0,&p\in[0,P_{sen}),\\ f(p),&p\in[P_{sen},P_{sat}),\\ f(P_{sat}),&p\in[P_{sat},+\infty],\end{cases}

(4)

where $P_{sen}$ and $P_{sat}$ with $0<P_{sen}<P_{sat}$ are the sensitivity power and the saturation power at the energy harvester, respectively, and $f(\cdot)$ is a non-linear power transform function that can be easily obtained through the curve fitting technique [12]. From (4), no DC power is harvested if $p$ is below $P_{sen}$ and the harvested DC power keeps unchanged if $p\geq P_{sat}$ . The value of $P_{sen}$ is usually high (with, e.g., -10 dBm [13]) in practice. Hence, the transmission distance from the UAV to the IoT device needs to be sufficiently short to assure effective WET with non-zero harvested energy. For each device- $i$ , since its harvested energy from multiple UAVs in each slot can be accumulated, based on (1) and (4), the harvested energy at device- $i$ in slot $t$ is obtained as

E_{i}^{har}[t]=\mathcal{F}\left(\sum_{u=1}^{U}P_{u}C_{u}[t]G_{i}^{u}[t]\right)\vartheta.

(5)

From (5), device- $i$ can harvest more energy if more UAVs are located nearby and transmit energy to it jointly. Denote $B_{i}[t]$ as the battery level of device- $i$ at the beginning of slot $t$ , which is updated as

B_{i}[t]=\min\left(B_{i}[t-1]+E_{i}^{har}[t-1],B_{i}^{max}\right),

(6)

where $B_{i}^{max}$ is the battery capacity of device- $i$ and the initial battery energy $B_{i}[0]\geq 0$ is given. Denote $B^{thr}$ as the required battery energy level at each IoT device, where $B_{i}^{max}\geq B^{thr}>B_{i}[0]>0$ holds in general. We say an IoT device is energy satisfied if its accumulated battery energy reaches the required $B^{thr}$ before or at the last slot $T$ . Hence, $\mathbb{I}_{l}[t]=\{i|B_{i}[t]+E_{i}^{har}[t]<B^{thr}\}$ is the set of energy-unsatisfied IoT devices at the end of slot $t$ . After the UAVs’ WET for $T$ slots, the set of energy-unsatisfied IoT devices is obtained as $\mathbb{I}_{l}[T]$ .

II-C Hunger-Level of Energy at IoT devices

The HoE is defined to measure the time-varying energy demand of each IoT device. Denote $H_{i}[t]$ as the HoE of device- $i$ at the beginning of slot $t$ . We define the time variation of $H_{i}[t]$ as

H_{i}[t]=\begin{cases}\max(H_{i}[t\!-\!1]\!-\!1,\!1),&\!\textrm{if~{}}E_{i}^{har}[t\!-\!1]\!\geq\!E^{exp}\!\textrm{~{}and~{}}\\ &\!B_{i}[t]<B^{thr},\\ H_{i}[t\!-\!1]\!+\!1,&\!\textrm{if~{}}E_{i}^{har}[t\!-\!1]\!<\!E^{exp}\textrm{~{}and~{}}\\ &B_{i}[t]\!<\!B^{thr},\\ 0,&\!\textrm{if~{}}B_{i}[t]\!\geq\!B^{thr},\end{cases}

(7)

where $E^{exp}\!\triangleq\!\frac{B^{thr}}{T}$ denotes the average amount of energy that device- $i$ expects to harvest in each slot for reaching $B^{thr}$ after $T$ slots. If device- $i$ ’s harvested energy $E_{i}^{har}[t-1]$ in slot ( $t-1$ ) reaches the expected $E^{exp}$ , but the resultant battery energy $B_{i}[t]$ at the beginning of slot $t$ is still lower than the required $B^{thr}$ , $H_{i}[t]$ is reduced by $1$ at the beginning of slot $t$ , where the minimum allowable HoE when $B_{i}[t]<B^{thr}$ is set to $1$ . If $E_{i}^{har}[t-1]$ is lower than the expected $E^{exp}$ and the resultant $B_{i}[t]$ is also lower than the required $B^{thr}$ , $H_{i}[t]$ is increased by $1$ at the beginning of slot $t$ . Moreover, if device- $i$ ’s required energy is satisfied with $B_{i}[t]\geq B^{thr}$ at the beginning of slot $t$ , $H_{i}[t]$ becomes $0$ . The overall HoE of all the energy-unsatisfied IoT devices over $T$ slots is then obtained as

H_{total}=\sum_{i\in\mathbb{I}_{l}[T]}\sum_{t=1}^{T}H_{i}[t].

(8)

It is easy to find that $H_{total}$ is $0$ , if $\mathbb{I}_{l}[T]$ is empty.

II-D Problem Formulation

By optimally determining all the UAVs’ WET decisions $\boldsymbol{C}=\{C_{u}[t]\}$ and trajectories $\boldsymbol{Q}=\{q_{u}[t]\}$ over $T$ slots, we minimize the overall HoE $H_{total}$ in (8) under the UAVs’ practical mobility and energy constraints as follows:

$\displaystyle\textrm{(P1)}:~{}\min_{\boldsymbol{Q},\boldsymbol{C}}$	$\displaystyle\sum_{i\in\mathbb{I}_{l}[T]}\sum_{t=1}^{T}H_{i}[t],$
$\displaystyle\mathrm{s.t.}$	$\displaystyle(\ref{equ: UAV Energy}),~{}(\ref{equ: IoT Energy}),~{}(\ref{equ: HoE})$
	$\displaystyle\left\\|q_{u}[t]\!-\!q_{u}[t\!-\!1]\right\\|\!\leq\!V_{u}^{max}\vartheta,\forall u\!\in\!\mathbb{U},\forall t\!\in\!\mathbb{T},$	(9)
	$\displaystyle C_{u}[t]\in\{0,1\},\forall u\in\mathbb{U},\forall t\in\mathbb{T},$	(10)
	$\displaystyle B_{u}[T]\geq B_{u}^{min},\forall u\in\mathbb{U},$	(11)
	$\displaystyle d_{u}^{u^{{}^{\prime}}}[t]\geq d_{min},\forall u,u^{{}^{\prime}}\in\mathbb{U},u\neq u^{{}^{\prime}},\forall t\in\mathbb{T},$	(12)
	$\displaystyle x_{u}[t]\!\in\!\left[0,\!W_{max}\!\right],\!y_{u}[t]\!\in\!\left[0,\!L_{max}\right],\!\forall u\!\in\!\mathbb{U},\forall t\!\in\!\mathbb{T}.$	(13)

The constraint in (9) ensures that the velocity of UAV- $u$ does not exceed its maximal allowable velocity $V_{u}^{max}$ . The constraint in (10) gives each UAV’s binary WET decision. The constraint in (11) ensures that each UAV’s remained energy at the end of slot $T$ is no less than the minimum required energy $B_{u}^{min}$ for a safe return after the WET task. The constraint in (12) guarantees a safe distance between any two UAVs in each slot to avoid collisions. The constraint in (13) confines each UAV’s horizontal moving space within an area of length $L_{max}$ and width $W_{max}$ .

Problem (P1) is a mixed-integer programming problem. It is also noticed that with the goal to minimize the overall HoE, the multiple UAVs need to be efficiently organized, by either jointly transmitting energy to the same set of IoT devices that are closely located, or separately serving different sets of IoT devices that are distantly located. Hence, all the UAVs’ trajectories and WET decisions are naturally coupled with each other over time. Moreover, as constrained by (11), each UAV must use the limited battery energy wisely for reducing the IoT devices’ HoE. Therefore, problem (P1) is generally difficult to solve efficiently by using the traditional optimization methods.

III MDP Modeling and Global Graph Design

Considering the above complicated and coupled relations in problem (P1) among multiple UAVs, a multi-agent DRL approach is leveraged in this paper. As shown in Fig. 1, each UAV acts as an agent and reports its environment states to the central controller (e.g., a base station or a satellite); By using the global environment information of all the UAVs, the central controller’s training output also guides each UAV’s local training. Although the training is centralized, after the training process, each UAV distributively determines its own WET decision and trajectory based on its local policy. Moreover, to explore the potential collaboration of all the UAVs for efficient WET, we take the global UAV information as a graph and introduce the similarity matrix [14] and the self-attention block [15] to operate the graph-based global information at the central controller [16]. By doing so, a new MAGRL-based approach is proposed to solve problem (P1). In this section, we model the Markov decision processes (MDP) at each UAV, and then introduce the UAVs’ graph-based representations at the central controller. The MAGRL-based solution will be specified in Section IV.

III-A MDP Modeling

According to problem (P1), by letting each UAV act as an agent, we model the MDP for each of the $U$ agents[17]. For each agent, define the MDP as a set of states $\mathbb{S}$ , a set of actions $\mathbb{A}$ , and a set of rewards $\mathbb{R}$ . The state set $\mathbb{S}$ embraces all the possible environment configurations at each UAV, including the UAV’s own location, the HoE of all the IoT devices, the battery levels of all the IoT devices ¹¹1It is assumed that all the IoT devices share their HoE and battery levels with the UAVs via a common channel., and the UAV’s own battery level. The action set $\mathbb{A}$ provides the action space of each UAV’s decision on its trajectory and WET. For any given state $s_{u}[t]\in\mathbb{S}$ for UAV- $u$ at the beginning of slot $t$ , UAV- $u$ applies the policy $\pi_{u}:s_{u}[t]\to a_{u}[t]$ to select the action $a_{u}[t]\in\mathbb{A}$ , and then gets the corresponding reward $r_{u}[t]\in\mathbb{R}$ at the end of slot $t$ .

Specifically, the state in slot $t$ is defined as $s_{u}[t]=\left\{\!x_{u}[t]\!,\!y_{u}[t]\!,\!H_{1}[t]\!,...,\!H_{I}[t]\!,\!B_{1}[t]\!,...,\!B_{I}[t]\!,\!B_{u}[t]\!\right\}$ , which contains in total of $M=2I+3$ elements. Denoting $\varphi_{u}[t]$ as UAV- $u$ ’s horizontal rotation angle in slot $t$ , UAV- $u$ ’s horizontal location $\left(x_{u}[t],y_{u}[t]\right)$ in problem (P1) is determined if $\varphi_{u}[t]$ and $V_{u}[t]$ are obtained. Thus UAV- $u$ ’s MDP action is defined as $a_{u}[t]=\left\{V_{u}[t],\varphi_{u}[t],C_{u}[t]\right\}$ , where $C_{u}[t]=0$ if the policy network output is negative or $C_{u}[t]=1$ , otherwise. The reward function is proposed as

r_{u}[t]=\xi_{0}r_{u,0}[t]-\xi_{1}r_{u,1}[t],

(14)

where $r_{u,0}[t],r_{u,1}[t]$ are the reward and penalty that UAV- $u$ receives in slot $t$ , respectively, and $\xi_{0},\xi_{1}\in(0,1)$ are the corresponding weights. Specifically, letting $w_{i}^{u}\triangleq\frac{\mathcal{F}(P_{u}C_{u}[t]G_{i}^{u}[t])\vartheta}{E_{i}^{har}[t]}$ denote the ratio of the DC energy that device- $i$ harvestes from UAV- $u$ to that from all the UAVs, we use $N_{u}[t]=\frac{w_{u}[t]}{\sum_{u\in\mathbb{U}}w_{u}[t]}$ with $w_{u}[t]=\sum_{i\in\mathbb{I}}w_{i}^{u}[t]$ to represent UAV- $u$ ’s effective WET weight among all the UAVs in slot $t$ . The IoT devices can harvest higher amounts of energy from the UAV with a higher $N_{u}[t]$ , and vice versa. We then propose to use the following $r_{u,0}[t]$ :

	$\displaystyle r_{u,0}[t]=\frac{N_{u}[t]\sum_{i\in\mathbb{I}_{l}[t]}(B_{i}[t+1]-B_{i}[t])\cdot H_{i}[t]}{1+\|\mathbb{I}_{l}[t]\|\sum_{i\in\mathbb{I}_{l}[t]}H_{i}[t]}$
	$\displaystyle+\xi_{2}(B_{u}[t+1]-B_{u}^{min}).$		(15)

From (15), while the UAVs prefer to perform WET more frequently to reduce the IoT devices’ HoE, they also need to use their battery energy carefully to assure the constraint in (11). Hence, the reward in (15) contains two items. The first item is UAV- $u$ ’s reward for charging IoT devices, where the numerator is the product of UAV- $u$ ’s weight $N_{u}[t]$ and all the IoT devices harvested energy biased by their HoE, and the denominator is the product of the HoE summation over all the energy-unsatisfied IoT devices and the set size $|\mathbb{I}_{l}[t]|$ , which plus $1$ to prevent the denominator from being $0$ . It is easy to find that, the more energy the high-HoE IoT devices can harvest, the higher value the first item achieves. The second item is UAV- $u$ ’s battery energy gap between $B_{u}[t+1]$ and $B_{u}^{min}$ in constraint (11) after taking action $a_{u}[t]$ in slot $t$ , where a balance parameter $\xi_{2}$ is multiplied. The penalty in (14) is designed as

r_{u,1}[t]=\sum_{u=1}^{U}\sum_{j=0}^{1}PEN_{u}^{j},

(16)

where $PEN_{u}^{0}=1$ (or $PEN_{u}^{1}=1$ ) if the constraint in (12) (or (13)) is not satisfied, or $PEN_{u}^{0}=0$ (or $PEN_{u}^{1}=0$ ), otherwise.

III-B Graph Representation of UAVs

By receiving each UAV’s state information, the central controller obtains the global information. Let $\boldsymbol{o}[t]=\{s_{1}[t],...,s_{U}[t]\}$ , $\boldsymbol{a}[t]=\{a_{1}[t],...,a_{U}[t]\}$ and $\boldsymbol{r}[t]=\{r_{1}[t],...,r_{U}[t]\}$ denote the global observations, actions, and rewards, respectively. To explore the potential connections among the UAVs to improve the overall WET performance as well as to avoid collisions, the central controller uses a graph to represent all the UAVs, by treating each UAV as a node in the graph. According to [14], we use the following similarity matrix among the UAVs to represent the strength of their connections,

\boldsymbol{Z}[t]=\begin{pmatrix}z_{11}[t],&...&,z_{1U}[t]\\ ...&&...\\ z_{U1}[t],&...&,z_{UU}[t]\end{pmatrix}_{U\times U},

(17)

where the element $z_{uu^{{}^{\prime}}}[t]=\exp\left(-\frac{\left\|q_{u}[t]-q_{u^{{}^{\prime}}}[t]\right\|^{2}}{2\varrho^{2}}\right)$ , $\forall u,u^{{}^{\prime}}\in\mathbb{U},u\neq u^{{}^{\prime}}$ , is the Gaussian distance between UAV- $u$ and UAV- $u^{{}^{\prime}}$ , $\varrho^{2}$ is a constant, and the element $z_{uu}[t]=\sum_{u^{{}^{\prime}}\in\mathbb{U},u^{{}^{\prime}}\neq u}z_{uu^{{}^{\prime}}}[t]$ on the diagonal is used as the degree of UAV- $u$ . Specifically, from [15], to obtain the global feature matrix $\tilde{\boldsymbol{o}}$ , the central controller first generates the attention matrix $\boldsymbol{W}_{att}$ and value matrix $\boldsymbol{W}_{v}^{{}^{\prime}}$ as follows:

	$\displaystyle\boldsymbol{W}_{att}=softmax\left(\!\frac{1}{\sqrt{M}}\left(\boldsymbol{o}\!\times\!\boldsymbol{W}_{q}\right)\!\times\!\left(\boldsymbol{o}\!\times\!\boldsymbol{W}_{k}\right)^{T}\!\cdot\!\boldsymbol{Z}\!\right),$
	$\displaystyle\boldsymbol{W}_{v}^{{}^{\prime}}=\boldsymbol{o}\times\boldsymbol{W}_{v},$		(18)

where the symbol $\times$ denotes the matrix multiplication, $(\cdot)^{T}$ is the matrix transposition. For any matrix $\boldsymbol{Z}\in\mathbf{R}^{U\times U}$ , the $softmax(\cdot)$ function transforms the element $z_{uu^{{}^{\prime}}}$ into $\frac{\exp\left(z_{uu^{{}^{\prime}}}\right)}{\sum_{u^{{}^{\prime}}\in\mathbb{U}}\exp\left(z_{uu^{{}^{\prime}}}\right)}$ , $\forall u,u^{{}^{\prime}}\in\mathbb{U}$ . Then, the global feature matrix $\tilde{\boldsymbol{o}}$ is obtained as $\tilde{\boldsymbol{o}}=\boldsymbol{W}_{att}\times\boldsymbol{W}_{v}^{{}^{\prime}}+\boldsymbol{W}_{v}^{{}^{\prime}}$ , which is used as the new observation matrix for the central controller.

IV MAGRL-based Solution

IV-A MAGRL Training Flow

As shown in Fig. 2, the MAGRL framework includes two parts, where one is the local training at each of the UAV, and the other is the global training at the central controller. In training stage, a tuple $\left(\boldsymbol{o}[t],\boldsymbol{a}[t],\boldsymbol{r}[t],\boldsymbol{o}[t\!+\!1],\boldsymbol{Z}[t],\boldsymbol{Z}[t\!+\!1]\right)$ is stored in the experience replay buffer $\mathcal{D}$ , and all the neural networks for both local and global training apply the stochastic gradient descent (SGD) algorithm to update their parameters.

Local Training: For each UAV’s local training, we apply the SAC algorithm proposed in [18] to enhance the exploration of the environment for all agents. As shown in the left side of Fig. 2, for any UAV- $u$ , there are five neural networks employed for its local training, which are the policy network $Actor_{L}$ , the local Q-networks $Q_{L0}~{}Critic$ and $Q_{L1}~{}Critic$ , and the local V-networks $V_{L0}~{}Critic$ and $V_{L1}~{}Critic$ , with the corresponding network parameters denoted by $\theta^{\pi_{u}}$ , $\eta_{0,u}$ , $\eta_{1,u}$ , $\phi_{0,u}$ and $\phi_{1,u}$ , respectively. The policy network is trained for the policy function $\pi_{u}(\cdot)$ that maps UAV- $u$ ’s state $s_{u}[t]$ to its action $a_{u}[t]$ , the two local Q-networks are trained for the local state action functions $Q_{L0}(\cdot)$ and $Q_{L1}(\cdot)$ , and the two local V-networks are trained for the local state functions $V_{L0}(\cdot)$ and $V_{L1}(\cdot)$ . The information entropy $\mathcal{H}(\cdot)$ is used to enhance the agent’s exploration of the environment [18]. The goal of the local training is to obtain the optimal policy $\pi_{u}^{*}=\arg\max_{\pi_{u}}\sum_{t\in\mathbb{T}}\mathbb{E}\left[r_{u}[t]+\alpha_{u}\mathcal{H}\left(\pi_{u}(\cdot|s_{u}[t])\right)\right]$ , where the temperature coefficient $\alpha_{u}$ is the weight of the information entropy, and $\mathbb{E}[\cdot]$ is the expectation operation over all the possible actions. The performance of UAV- $u$ when taking action $a_{u}[t]$ in state $s_{u}[t]$ is evaluated by the local Q-networks $Q_{L0}~{}Critic$ and $Q_{L1}~{}Critic$ , with $Q_{Lj}(s_{u}[t],a_{u}[t])=r_{u}[t]+\gamma\mathbb{E}[V_{L1}(s_{u}[t\!+\!1])]$ , $j\in\{0,1\}$ . The performance of UAV- $u$ in state $s_{u}[t]$ is evaluated by the local V-network $V_{L1}~{}Critic$ , with $V_{L1}(s_{u}[t])=\mathbb{E}_{a_{u}^{{}^{\prime}}\sim\pi_{u}}[Q_{min}(s_{u}[t],a_{u}[t])-\alpha_{u}\log(\pi_{u}(a_{u}^{{}^{\prime}}|s_{u}[t]))]$ , where $a_{u}^{{}^{\prime}}\sim\pi_{u}$ denotes the action taken from policy $\pi_{u}$ , and $Q_{min}(\cdot)=\min(Q_{L0}(\cdot),Q_{L1}(\cdot))$ .

The parameters of all the five neural networks are updated based on the corresponding loss functions. Specifically, after receiving the $k$ -th local experience $(s_{u,k},a_{u,k},r_{u,k},s_{u,k}^{{}^{\prime}})$ from the mini-batch $\mathcal{D}_{k}$ of central controller for the local training, UAV- $u$ uses the loss function

\displaystyle J_{V_{L0}}(\phi_{0,u})\!=\!\mathbb{E}\left[\!\frac{1}{2}\left(\!V_{L0}(s_{u,k};\!\phi_{0,u})-Q_{exp}\right)^{2}\!\right]

(19)

to update $\phi_{0,u}$ for the local V-network $V_{L0}~{}Critic$ in negative the direction of the gradient $\widehat{\nabla}_{\phi_{0,u}}J_{V_{L0}}(\phi_{0,u})$ , where $Q_{exp}\!\triangleq\!\mathbb{E}_{a_{u}^{{}^{\prime}}\sim\!\pi_{u}}\left[\!Q_{min}(s_{u,k},a_{u}^{{}^{\prime}};\!\eta_{j,u})\!-\!\alpha_{u}\mathcal{H}_{k}\!\right]$ is the expected entropy-added local minimum Q-value and $\mathcal{H}_{k}\!\triangleq\!\log\left(\!\pi_{u}(a_{u}^{{}^{\prime}}|s_{u,k};\!\theta^{\pi_{u}})\!\right)$ is the entropy. For the parameter $\phi_{1,u}$ of $V_{L1}~{}Critic$ network, we perform a soft update via $\phi_{1,u}\leftarrow\tau\phi_{1,u}+(1-\tau)\phi_{0,u}$ , $\tau\in[0,1)$ . For the update of the parameters $\eta_{0,u}$ and $\eta_{1,u}$ for $Q_{L0}~{}Critic$ and $Q_{L1}~{}Critic$ networks, respectively, they are also computed in the negative direction of the corresponding loss function’s gradient with

\displaystyle J_{Q_{Lj}}(\eta_{j,u})\!=\!\mathbb{E}\left[\frac{1}{2}\left(Q_{Lj}(s_{u,k},a_{u,k};\eta_{j,u})\!-\!y_{Lj}\right)^{2}\right],

(20)

where $j\in\{0,1\}$ and by using $Q_{G}(\cdot)$ to denote the global Q-value used to guide the training of the two local Q-networks, we define $y_{Lj}\!\triangleq\!\epsilon\!\left(\!r_{u,k}\!+\!\gamma\mathbb{E}\!\left[\!V_{L1}(s_{u,k}^{{}^{\prime}};\!\phi_{1,u})\!\right]\!\right)\!+\!(1\!-\epsilon)\mathbb{E}\left[Q_{G}(\cdot)\right]$ with $\epsilon\in(0,1]$ . Similarly, according to the loss function for the policy network

J_{\pi_{u}}(\theta^{\pi_{u}})=\mathbb{E}_{\varepsilon\sim\mathcal{N}}\left[\alpha_{u}\mathcal{H}_{k}^{{}^{\prime}}\!-\!Q_{min}\left(s_{u,k},f_{\theta^{\pi_{u}}};\eta_{j,u}\right)\right],

(21)

the network parameter $\theta^{\pi_{u}}$ is updated in the negative direction of the gradient $\widehat{\nabla}_{\theta^{\pi_{u}}}J_{\pi_{u}}(\theta^{\pi_{u}})$ , where the information entropy $\mathcal{H}_{k}^{{}^{\prime}}\triangleq\log(\pi_{u}\left(f_{\theta^{\pi_{u}}}(s_{u,k};\varepsilon)|s_{u,k}\right))$ is calculated from the noise-added action $f_{\theta^{\pi_{u}}}(s_{u,k};\varepsilon)$ , and $\varepsilon$ is the noise sampled from a fixed distribution $\mathcal{N}$ . Based on [18], adding noise to the action prevents the network from overfitting and ensures the stable network training. We use the loss function

J_{\alpha_{u}}(\alpha_{u})=\mathbb{E}_{a_{u}^{{}^{\prime}}\sim\pi_{u}}\left[-\alpha_{u}\log\left(\pi_{u}(a_{u}^{{}^{\prime}}|s_{u,k};\theta^{\pi_{u}})\right)-\alpha_{u}\tilde{\mathcal{H}}\right]

(22)

to update the temperature coefficient $\alpha_{u}$ in the negative direction of the gradient $\widehat{\nabla}_{\alpha_{u}}J_{\alpha_{u}}(\alpha_{u})$ with $\tilde{\mathcal{H}}\triangleq|s_{u}[t]|$ .

Global Training: The global training is designed for the central controller which consists of three neural networks, which are the global Q-network $Q_{G}~{}Critic$ and the two global V-networks $V_{G0}~{}Critic$ and $V_{G1}~{}Critic$ , with the corresponding network parameters denoted as $\eta_{G}$ , $\phi_{G0}$ and $\phi_{G1}$ , respectively. The global Q-network is trained for the global state action function $Q_{G}(\cdot)$ and the two global V-networks are trained for the global state functions $V_{G0}(\cdot)$ and $V_{G1}(\cdot)$ . As shown in the right side of Fig. 2, each neural network contains the self-attention block, which extracts the global feature matrix $\tilde{\boldsymbol{o}}$ to obtain the connections among UAVs as specified in Section III-B. The goal of the central training is to find the optimal global Q-value $Q_{G}^{*}(\boldsymbol{o}[t],\boldsymbol{a}[t])=\boldsymbol{r}+\gamma\mathbb{E}[V_{G1}(\boldsymbol{s}[t])]$ .

The parameters of the three networks are also updated based on the corresponding loss functions. Specifically, after receiving the $k$ -th global experience ( $\boldsymbol{o}_{k},\boldsymbol{a}_{k},\boldsymbol{r}_{k},\boldsymbol{o}^{{}^{\prime}}_{k},\boldsymbol{Z}_{k},\boldsymbol{Z}_{k}^{{}^{\prime}}$ ) from the mini-batch $\mathcal{D}_{k}$ , the central controller uses the loss function

J_{V_{G0}}(\phi_{G0})=\mathbb{E}\left[\!\frac{1}{2}\!\left(V_{G0}(\boldsymbol{o}_{k};\phi_{G0})\!-\!\mathbb{E}_{\boldsymbol{a}^{{}^{\prime}}\sim\pi_{u}}\!Q_{G}(\boldsymbol{o}_{k},\!\boldsymbol{a}^{{}^{\prime}};\!\eta_{G})\!\right)^{2}\!\right]

(23)

to update $\phi_{G0}$ for the global V-network $V_{G0}~{}Critic$ in the negative direction of the gradient $\widehat{\nabla}_{\phi_{G0}}J_{V_{G0}}(\phi_{G0})$ . For the parameter $\phi_{G1}$ of $V_{G1~{}Critic}$ network, we perform a soft update via $\phi_{G1}\leftarrow\tau\phi_{G1}+(1-\tau)\phi_{G0}$ , $\tau\in[0,1)$ . For the update of the parameter $\eta_{G}$ for $Q_{G}~{}Critic$ network, it is also computed in the negative direction of the corresponding loss function’s gradient, where the loss function is given as

J_{\!Q_{G}\!}(\!\eta_{G}\!)\!=\!\mathbb{E}\!\left[\!\frac{1}{2}\!\left(\!Q_{G}(\boldsymbol{o}_{k},\!\boldsymbol{a}_{k};\!\eta_{G})\!-\!\left(\!\boldsymbol{r}_{k}\!+\!\gamma\mathbb{E}\left[\!V_{G1}(\boldsymbol{o}_{k}^{{}^{\prime}};\!\phi_{G1})\!\right]\!\right)\!\right)^{2}\!\right]\!.

(24)

IV-B MAGRL-Based Algorithm

Based on the above framework, we propose the MAGRL-based algorithm to solve problem (P1). The MAGRL-based algorithm is specified in Algorithm 1.

TABLE I: Simulation parameters

Parameter	Value	Parameter	Value
$h_{fix}$	$5$ m	$\sigma^{2},\varrho^{2}$	$-90$ dBm, 100
$P_{u}$	$1$ W	$P_{sen},P_{sat}$	$-10$ , $7$ dBm
$\alpha_{L},\alpha_{N}$	$3,5$	$B^{thr}$	$10$ mW $\cdot$ s
$B_{u}^{min}$ , $B_{u}^{max}$	$20000$ , $140000$ W $\cdot$ s	$d_{min}$	$5$ m
$\gamma,\epsilon,\tau$	$0.985$ , $0.8$ , $0.999$	$a,b$ for $P_{LoS}$	$12.08$ , $0.11$
$\xi_{0},\xi_{1},\xi_{2}$	$0.25$ , $1$ , $0.00001$	Size of $\mathcal{D}$	$2^{17}$
size of $\mathcal{D}_{k}$	$128$	$\alpha_{u}$ ’s learning rate	$0.0002$

V Simulation Results

To evaluate the performance of our proposed MAGRL method, we conduct simulations based on python-3.9.12 and pytorch-1.12.1. Unless specified otherwise, in all the simulations, the starting horizontal position of each UAV is selected randomly in the the considered area, and each IoT device’s initial battery energy is set randomly in the range between $2$ mW $\cdot$ s and $5$ mW $\cdot$ s, respectively. The UAV’s propulsion model parameters are set as in [12]. Each neural network for the local training has $4$ layers and the number of neurons in the hidden layer is $256$ . Each neural network in global training contains a self-atteion block and $2$ fully connected layers. Learning rate is set to $0.0002$ for all neural networks except for the neural network $Actor_{L}$ , which has a learning rate of $0.0003$ . Other parameters can be found in Table I.

Algorithm 1 MAGRL-based solution

1: Initialize replay buffer

\mathcal{D}

, learning rate

\lambda

, discount factor

\gamma

, soft update weight

\tau

and temperature factor

\alpha_{u}

u\in\mathbb{U}

. Initialize the parameters of local’s all five networks and global three networks

2: for Episode

\leftarrow 1,...,EPS

3: Initialize the location and energy of all UAVs and IoT devices;

4: Initialize the observatin

\boldsymbol{o}[0]

and similarity matrix

B[0]

;

5: for

t\leftarrow 1,...,T

6: get action

a_{u}[t]=\pi_{u}(s_{u}[t]|\theta^{\pi_{u}})

u\in\mathbb{U}

7: execute action

a_{u}[t]=\left[V_{u}[t],\omega_{u}[t],C_{u}[t]\right]

\forall u\in\mathbb{U}

. We can get

\boldsymbol{o}[t+1]

\boldsymbol{r}[t]

and

\boldsymbol{Z}[t+1]

;

8: store

\left(\boldsymbol{o}[t],\boldsymbol{a}[t],\boldsymbol{r}[t],\boldsymbol{o}[t+1],\boldsymbol{Z}[t],\boldsymbol{Z}[t+1]\right)

into experience replay buffer

\mathcal{D}

;

9: if

|\mathcal{D}|\geq

mini-batch of size

\triangle

then

10: for

u\leftarrow 1,...,U

11:

\phi_{0,u}\leftarrow\phi_{0,u}-\lambda\widehat{\nabla}_{\phi_{0,u}}J_{V_{L0}}(\phi_{0,u})

, update

\phi_{0,u}

;

12:

\eta_{j,u}\leftarrow\eta_{j,u}-\lambda\widehat{\nabla}_{\eta_{j,u}}J_{Q_{Lj}}(\eta_{j,u}),j\in\left\{0,1\right\}

;

13:

\theta^{\pi_{u}}\leftarrow\theta^{\pi_{u}}-\lambda\widehat{\nabla}_{\theta^{\pi_{u}}}J_{\pi_{u}}\left(\theta^{\pi_{u}}\right)

;

14:

\alpha_{u}\leftarrow\alpha_{u}-\lambda\widehat{\nabla}_{\alpha_{u}}J_{\alpha_{u}}(\alpha_{u})

, update tempture factor based on(22);

15:

\phi_{1,u}\leftarrow\tau\phi_{1,u}+(1-\tau)\phi_{0,u}

, soft update;

16: end for

17:

\phi_{G0}\leftarrow\phi_{G0}-\lambda\widehat{\nabla}_{\phi_{G0}}J_{V_{G0}}(\phi_{G0})

, update

\phi_{G0}

;

18:

\eta_{G}\leftarrow\eta_{G}-\lambda\widehat{\nabla}_{\eta_{G}}J_{Q_{G}}(\eta_{G})

;

19:

\phi_{G1}\leftarrow\tau\phi_{G1}+(1-\tau)\phi_{G0}

, soft update;

20: end if

21:

\boldsymbol{o}[t]\leftarrow\boldsymbol{o}[t+1]

and

\boldsymbol{Z}[t]\leftarrow\boldsymbol{Z}[t+1]

u\in\mathbb{U}

;

22: end for

23: end for

V-A Training stage

In the training stage, we compare MAGRL with the following $3$ benchmarks:

•

MAGRL-HoE: This method does not consider HoE at each IoT device. By only considering the battery energy at each IoT device, its reward function is reduced from (15) as

	$\displaystyle r_{u,0}[t]\!=\!\frac{N_{u}[t]\sum_{i\in\mathbb{I}_{l}[t]}(B_{i}[t\!+\!1]\!-\!B_{i}[t])}{1\!+\|\mathbb{I}_{l}[T]\|}\!$
	$\displaystyle+\xi_{2}(\!B_{u}[t\!+\!1]\!-\!B_{u}^{min}\!).$		(25)

•

MAGRL-G: This method removes the global training, where the loss function is given in (20) with $\varepsilon=1$ .
•

MAGRL-HoE-G: This method does not exploit the HoE and the global training, where both (25) and (20) with $\varepsilon=1$ are applied.

We consider an area of $400$ m $\times$ $400$ m with $4$ UAVs and $6$ IoT devices. For the proposed MAGRL method and the benchmarks, we show the accumulated average reward $r_{ac}=\frac{1}{U}\sum_{t=1}^{T}\sum_{u=1}^{U}r_{u}[t]$ in Fig. 3(a). Fig. 3(b) shows the variations of $H_{total}$ of all four methods. The convergence of our proposed MAGRL algorithm is observed in Fig. 3(a), where the proposed MAGRL method outperforms the other $3$ benchmarks and achieves the highest $r_{ac}$ after convergence. This implies that the global training can learn the potential connections among the states of the UAVs, thus improving the learning ability of multiple agents. From Fig. 3(b), it is observed that by considering HoE, $H_{total}$ in our proposed MAGRL method is the lowest among all four methods, which confirms, that the goal of HoE minimization can guide the UAVs’ WET to cater to each IoT device’s energy requirements.

V-B Testing stage

To show the performance of the UAVs’ WET in the testing stage, we illustrate an example, where $2$ UAVs are dispatched to charge $3$ IoT devices in a horizontal area of $200$ m $\times$ $200$ m. Each UAV agent applies the trained $Actor_{L}$ network to determine its actions. Fig. 4(a) shows the trajectories of the two UAVs, Fig. 4(b) shows each of the IoT device’s battery variations over time, and Fig. 5 shows the two UAVs’ binary WET decisions. It is observed from Fig. 4(a) and Fig. 5 that although each UAV distributively determines its own trajectory and WET, due to the exploration of their self-attentions in the global training, the two UAVs can automatically serve different IoT devices in a collaborative manner, where UAV-1 transmits energy mainly to the two closely-located IoT devices, while UAV-2 mainly serves the other distantly-located IoT devices. Due to their effective collaboration for WET, it is also observed that each of the IoT device’s battery energy achieves the required threshold $B^{thr}$ in Fig. 4(b), and thus the overall HoE in problem (P1) becomes $0$ . Moreover, at the end of $T=100$ slots, the remained energy of the two UAVs are $24657.72$ W $\cdot$ s and $24538.863$ W $\cdot$ s, respectively, where the required $B_{u}^{min}=20000$ W $\cdot$ s in (11) are satisfied for both UAVs.

VI Conclusion

This paper proposes the novel on-demand WET scheme of multiple UAVs. We propose a new metric of HoE to measure each IoT device’s time-varying energy demand based on its required battery energy and the harvested energy from the UAVs. We formulate the HoE minimization problem under practical UAVs’ mobility and energy constraints, by optimally determining the UAVs’ coupled trajectories and WET decisions over time. Due to the high complexity of this problem, we leverage DRL and propose the MAGRL-based approach, where the UAVs’ collocations for WET are exploited by excavating the UAVs’ self-attentions in the global training. Through the offline global and local training at the central controller and each UAV, respectively, each UAV can then distinctively determine its own trajectory and WET based on the well-trained local neural networks. Simulation results verify the validity of the proposed HoE metric for guiding the UAVs’ on-demand WET, as well as the UAVs’ collaborative WET under the proposed MAGRL-based approach.

Acknowledgement

This work was supported by the National Natural Science Foundation of China under Grant 62072314.

References

[1] S. Bi, C. K. Ho, and R. Zhang, “Wireless powered communication: Opportunities and challenges,” IEEE Commun. Mag., vol. 53, no. 4, pp. 117–125, Apr., 2015.
[2] B. Clerckx, et al., “Fundamentals of wireless information and power transfer: From RF energy harvester models to signal and system designs,” IEEE J. Sel. Areas Commun., vol. 37, no. 1, pp. 4–33, Jan. 2018.
[3] Y. L. Che, Y. Lai, S. Luo, K. Wu, and L. Duan, “UAV-aided information and energy transmissions for cognitive and sustainable 5G networks,” IEEE Tran. Wireless Commun., vol. 20, no. 3, pp.1668-1683, Mar. 2021.
[4] Z. Yang,W. Xu,M. Shikh-Bahaei, “Energy efficient UAV communication with energy harvesting,” IEEE Trans. Veh. Technol., vol. 69, no. 2, pp. 1913-1927, Feb. 2020.
[5] J. Xu, Y. Zeng and R. Zhang, “UAV-enabled wireless power transfer: trajectory design and energy optimization,” IEEE Trans. Wireless Commun., vol. 17, no. 8, pp. 5092-5106, Aug. 2018.
[6] J. Mu and Z. Sun, “Trajectory design for multi-UAV-aided wireless power transfer toward future wireless systems,” Sensors, vol. 22, no. 18, pp. 6859, Aug. 2022.
[7] L. Xie, X. Cao, J. Xu and R. Zhang, “UAV-enabled wireless power transfer: a tutorial overview,” IEEE Transactions on Green Communications and Networking, vol. 5, no. 4, pp. 2042-2064, Dec. 2021.
[8] K. Li, W. Ni, E. Tovar and A. Jamalipour, “On-board deep Q-network for UAV-assisted online power transfer and data collection,” IEEE Trans. Veh. Technol., vol. 68, no. 12, pp. 12215-12226, Dec. 2019.
[9] O. S. Oubbati et. al., “Synchronizing UAV teams for timely data collection and energy transfer by deep reinforcement learning,” IEEE Trans. Veh. Technol., vol. 71, no. 6, pp. 6682-6697, Jun. 2022.
[10] A. Al-Hourani, S. Kandeepan and S. Lardner, “Optimal LAP altitude for maximum coverage,” IEEE Wireless Commun. Let., vol. 3, no. 6, pp. 569-572, Dec. 2014.
[11] Y. Zeng, J. Xu and R. Zhang, “Energy minimization for wireless communication with rotary-wing UAV,” IEEE Trans. Wireless Commun., vol. 18, no. 4, pp. 2329-2345, Apr. 2019.
[12] P. N. Alevizos and A. Bletsas, “Sensitive and nonlinear far-field RF energy harvesting in wireless communications,” IEEE Trans. Wireless Commun., vol. 17, no. 6, pp. 3670-3685, Jun. 2018.
[13] PowerCast Module. Accessed: Jul. 2020. [Online]. Available: http://www.mouser.com/ds/2/329/P2110B-Datasheet-Rev-3-1091766.pdf
[14] U. V. Luxburg, “A tutorial on spectral clustering,” Statistics and computing, vol. 17, no. 3, pp. 395-416, Aug. 2007.
[15] A. Vaswani, et. al., “Attention is all you need,” Advances in neural information processing systems, 2017.
[16] J. Jiang, et. al. “Graph convolutional reinforcement learning,” in Pro. International Conference on Learning Representations (ICLR), Oct. 2018.
[17] M. L. Puterman, “Markov decision processes: Discrete stochastic dynamic programming,” John Wiley and Sons, 2014.
[18] T. Haarnoja, et. al. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,”, in Pro. International Conference on Machine Learning (ICML), Aug. 2018.