This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Multi-Agent Graph Reinforcement Learning based On-Demand Wireless Energy Transfer in Multi-UAV-aided IoT Network

Ze Yu Zhao1, Yueling Che2, Sheng Luo3, Kaishun Wu4, and Victor C. M. Leung5 College of Computer Science and Software Engineering, Shenzhen University, China
Email: 21002711031@email.szu.edu.cn, {yuelingche2, sluo3, wu4, vleung5}@szu.edu.cn
Abstract

This paper proposes a new on-demand wireless energy transfer (WET) scheme of multiple unmanned aerial vehicles (UAVs). Unlike the existing studies that simply pursuing the total or the minimum harvested energy maximization at the Internet of Things (IoT) devices, where the IoT devices’ own energy requirements are barely considered, we propose a new metric called the hungry-level of energy (HoE), which reflects the time-varying energy demand of each IoT device based on the energy gap between its required energy and the harvested energy from the UAVs. With the purpose to minimize the overall HoE of the IoT devices whose energy requirements are not satisfied, we optimally determine all the UAVs’ trajectories and WET decisions over time, under the practical mobility and energy constraints of the UAVs. Although the proposed problem is of high complexity to solve, by excavating the UAVs’ self-attentions for their collaborative WET, we propose the multi-agent graph reinforcement learning (MAGRL) based approach. Through the offline training of the MAGRL model, where the global training at the central controller guides the local training at each UAV agent, each UAV then distributively determines its trajectory and WET based on the well-trained local neural networks. Simulation results show that the proposed MAGRL-based approach outperforms various benchmarks for meeting the IoT devices’ energy requirements.

Index Terms:
Multiple unmanned aerial vehicles (UAVs) aided network, wireless energy transfer (WET), hungry-level of energy (HoE), multi-agent deep reinforcement learning, self-attentions.

I Introduction

The technology of radio frequency (RF) based wireless energy transfer (WET) has been recognized as a promising approach to support energy-sustainable Internet of Things (IoT) networks[1]. Conventionally, ground infrastructure (e.g., dedicated energy transmitters or base stations) are utilized to charge the low-power IoT devices. However, due to the generally low end-to-end wireless energy transmission efficiency, to assure non-zero harvested energy at the IoT devices, the effective transmission distance from the ground infrastructure to each IoT device is restricted (to , e.g., 10 meters [2]).

By exploiting the UAVs’ flexible mobility to effectively shorten the transmission distances between the UAVs and the IoT devices, the UAV-aided WET has attracted a great deal of attentions. For example, the UAV’s dynamic wireless energy and information transmission scheme in the presence of primary users was investigated in [3]. The UAV’s hovering position to maximize the minimum harvested energy at the IoT devices was studied in [4]. The UAV’s trajectory and WET were jointly optimized for maximizing the total harvested energy at the IoT devices in [5].

The above studies only considered the WET by a single UAV, which is usually difficult to serve a large-scale network due to its limited on-board battery energy. By tackling with the more complicated joint design of the multiple UAVs’ trajectories and wireless energy transmissions, the multi-UAV aided WET for maximizing the total harvested energy at the IoT devices has been studied in [6] and [7], where the Lagrange multiplier method and the fly-and-hover based trajectory design were utilized, respectively. The joint design of multi-UAV-aided wireless data and energy transmissions has also been investigated in [8] and [9], where the deep reinforcement learning (DRL) is applied to adapt the multiple UAVs’ transmissions to the dynamic environment.

However, it came to our notice that the existing UAV-aided WET schemes may sacrifice the energy demands of some IoT devices for achieving higher global benefits. For example, with the purpose to maximize the total harvested energy at all the IoT devices, the UAVs with limited on-board energy may choose to transmit energy to the closely-located IoT devices more often, but barely fly to serve the distantly-located IoT devices to save energy; and with the purpose to maximize the minimum harvested energy at all the IoT devices, the IoT devices with high energy demands may not be able to harvest sufficient energy, for achieving equal energy harvesting at each IoT device. As a result, both designs may deviate from the IoT devices’ own energy demands.

To cater to the IoT devices’ energy requirements, we propose a new metric called hungry-level of energy (HoE), which reflects the time-varying energy desirability of each IoT device based on the energy gap between its required energy and the harvested energy from the UAVs. Moreover, to explore the UAVs’ potential collaborations, such that they can automatically determine their joint or separate WET depending on the IoT devices’ HoE, we employ the UAVs’ self-attentions based on the graph-based representation. Finally, we propose the novel multi-agent graph reinforcement learning (MAGRL) based approach to minimize the overall HoE of the IoT devices whose energy requirements are not satisfied.

Refer to caption
Figure 1: Multi-UAV assisted WET in an IoT system.

The main contributions are summarized as follows:

  • HoE-based Multi-UAV WET Modeling and Novel Problem Formulation: Section II newly proposes the metric of HoE to guide the multi-UAV-aided WET for satisfying the different energy demands of the IoT devices. Based on each IoT device’s non-linear energy harvesting model and each UAV’s velocity-determined energy consumption model, the battery energy management at both the UAVs and the IoT devices are properly modeled. By optimally determining all the UAVs’ trajectories and WET decisions over time, the novel HoE minimization problem is formulated under the UAVs’ practical mobility and energy constraints.

  • MAGRL-based Approach for Distributed and Collaborative Multi-UAV WET: Sections III and IV propose the MAGRL-based approach to solve the complicated HoE minimization problem, where the UAVs’ self-attentions are excavated for their collaborative WET. Through the offline training of the proposed MAGRL model, where the central controller leads the global training to guide the local training at each UAV agent, each UAV then distributively determines its trajectory and WET decision based on the well-trained policy neural networks.

  • Extensive Simulation Results for Performance Evaluation: Section V conducts extensive simulations to verify the validity of the proposed HoE metric for guiding the UAVs’ on-demand WET, by comparing with various benchmarks. The UAVs’ collaborative WET under the proposed MAGRL-based approach is also illustrated.

II System Model and the Problem Formulation

As shown in Fig. 1, we consider that UU UAVs with U2U\geq 2 act as the airborne wireless energy transmitters to charge in total of II IoT devices with low power consumptions on ground. Each UAV flies at a fixed altitude of hfixh_{fix} meters (m). Denote the sets of the UAVs and the IoT devices as 𝕌={1,2,,U}\mathbb{U}=\{1,2,...,U\} and 𝕀={1,2,,I}\mathbb{I}=\{1,2,...,I\}, respectively. The UAVs’ task period of WET is divided into TT time slots with T1T\geq 1, where the slot length is ϑ\vartheta seconds (s). The set of the time slots is denoted as 𝕋={1,2,,T}\mathbb{T}=\{1,2,...,T\}. The coordinate of UAV-uu in slot tt is represented as qu[t]={xu[t],yu[t],hfix}q_{u}[t]=\{x_{u}[t],y_{u}[t],h_{fix}\}, u𝕌\forall u\in\mathbb{U}, t𝕋\forall t\in\mathbb{T}. Since the slot length ϑ\vartheta is usually very small, qu[t]q_{u}[t] is assumed to be unchanged in each slot, but may change over different slots. The coordinate of device-ii with i𝕀i\in\mathbb{I} on ground is denoted as qi={xi,yi,0}q_{i}=\{x_{i},y_{i},0\}. Let diu[t]=qu[t]qid_{i}^{u}[t]=\left\|q_{u}[t]-q_{i}\right\| denote the distance between UAV-uu and device-ii in slot tt. Let Cu[t]{0,1}C_{u}[t]\in\{0,1\} denote UAV-uu’s WET decision in slot tt, where UAV-uu broadcasts energy in slot tt if Cu(t)=1C_{u}(t)=1, or keeps silent, otherwise.

According to [10], denote PLos,iu[t]=11+aexp(bβiu[t]+ab)P_{Los,i}^{u}[t]=\frac{1}{1+a\exp(-b\beta_{i}^{u}[t]+ab)} as the line-of-sight (LoS) probability of the air-to-ground (AtG) channel from UAV-uu to device-ii in slot tt, where βiu[t]=sin1(hfixdiu[t])\beta_{i}^{u}[t]=\sin^{-1}\left(\frac{h_{fix}}{d_{i}^{u}[t]}\right) is the elevation angles from UAV-uu to device-ii in slot tt, and the constants aa and bb are the environment-related parameters. We then obtain the non-line-of-sight (NLoS) probability as PNLoS,iu[t]=1PLos,iu[t]P_{NLoS,i}^{u}[t]=1-P_{Los,i}^{u}[t]. As a result, the average AtG channel power gain from UAV-uu to device-ii is

Giu[t]=PLoS,iu[t]G0diu[t]αL+PNLoS,iu[t]G0diu[t]αN,G_{i}^{u}[t]=P_{LoS,i}^{u}[t]G_{0}d_{i}^{u}[t]^{\!-\!\alpha_{L}}\!+\!P_{NLoS,i}^{u}[t]G_{0}d_{i}^{u}[t]^{\!-\!\alpha_{N}}\!,\! (1)

where G0G_{0} is the average channel gain at a reference distance of 11 m, and αL\alpha_{L} and αN\alpha_{N} are the channel path-loss exponents for LoS and NoS links, respectively.

II-A UAV Energy Consumption Model

Each UAV’s energy consumption is mainly caused by propulsion and WET. According to [11], in each slot tt, UAV-uu’s propulsion power consumption is determined by its velocity Vu[t]=1ϑqu[t+1]qu[t]V_{u}[t]=\frac{1}{\vartheta}\left\|q_{u}[t+1]-q_{u}[t]\right\| as follows:

Ppro(Vu[t])\displaystyle P_{pro}(V_{u}[t]) =Pa(1+3Vu[t]2Vtip2)+12f0ρe1AVu[t]3\displaystyle=P_{a}\left(1+\frac{3V_{u}[t]^{2}}{V_{tip}^{2}}\right)+\frac{1}{2}f_{0}\rho e_{1}AV_{u}[t]^{3}
+Pb(1+Vu[t]44e04Vu[t]22e02)12,\displaystyle+P_{b}\left(\sqrt{1+\frac{V_{u}[t]^{4}}{4e_{0}^{4}}}-\frac{V_{u}[t]^{2}}{2e_{0}^{2}}\right)^{\frac{1}{2}}, (2)

where the constants PaP_{a}, PbP_{b}, VtipV_{tip}, e0e_{0}, e1e_{1}, f0f_{0} and ρ\rho are the UAV’s mechanical-related parameters. Hence, the propulsion energy consumption of UAV-uu in slot tt is obtained as Ppro(Vu[t])ϑP_{pro}(V_{u}[t])\vartheta. Denote PuP_{u} as each UAV-uu’s transmit power for WET. The energy consumption of UAV-uu’s WET in slot tt is thus Cu[t]PuϑC_{u}[t]P_{u}\vartheta. Denote Bu[t]B_{u}[t] as the battery level of UAV-uu at the beginning of slot tt, which is updated as

Bu[t]=max(Bu[t1]Ppro(Vu[t1])ϑCu[t1]Puϑ,0).\displaystyle B_{u}[t]\!=\!\max\!\left(B_{u}[t\!-\!1]\!-\!P_{pro}(V_{u}[t\!-\!1])\vartheta\!-\!C_{u}[t\!-\!1]P_{u}\vartheta,0\right). (3)

We assume that UAV-uu is fully charged in the initial with Bu[0]=BumaxB_{u}[0]=B_{u}^{max}, where BumaxB_{u}^{max} is UAV-uu’s battery capacity.

II-B Energy Harvesting at IoT devices

Each IoT device is installed with a rectenna and an energy harvester. According to [12], the energy harvester transforms the received RF power p0p\geq 0 at the rectenna into the direct-circuit (DC) power (p)\mathcal{F}(p) as follows:

(p)={0,p[0,Psen),f(p),p[Psen,Psat),f(Psat),p[Psat,+],\mathcal{F}(p)=\begin{cases}0,&p\in[0,P_{sen}),\\ f(p),&p\in[P_{sen},P_{sat}),\\ f(P_{sat}),&p\in[P_{sat},+\infty],\end{cases} (4)

where PsenP_{sen} and PsatP_{sat} with 0<Psen<Psat0<P_{sen}<P_{sat} are the sensitivity power and the saturation power at the energy harvester, respectively, and f()f(\cdot) is a non-linear power transform function that can be easily obtained through the curve fitting technique [12]. From (4), no DC power is harvested if pp is below PsenP_{sen} and the harvested DC power keeps unchanged if pPsatp\geq P_{sat}. The value of PsenP_{sen} is usually high (with, e.g., -10 dBm [13]) in practice. Hence, the transmission distance from the UAV to the IoT device needs to be sufficiently short to assure effective WET with non-zero harvested energy. For each device-ii, since its harvested energy from multiple UAVs in each slot can be accumulated, based on (1) and (4), the harvested energy at device-ii in slot tt is obtained as

Eihar[t]=(u=1UPuCu[t]Giu[t])ϑ.E_{i}^{har}[t]=\mathcal{F}\left(\sum_{u=1}^{U}P_{u}C_{u}[t]G_{i}^{u}[t]\right)\vartheta. (5)

From (5), device-ii can harvest more energy if more UAVs are located nearby and transmit energy to it jointly. Denote Bi[t]B_{i}[t] as the battery level of device-ii at the beginning of slot tt, which is updated as

Bi[t]=min(Bi[t1]+Eihar[t1],Bimax),B_{i}[t]=\min\left(B_{i}[t-1]+E_{i}^{har}[t-1],B_{i}^{max}\right), (6)

where BimaxB_{i}^{max} is the battery capacity of device-ii and the initial battery energy Bi[0]0B_{i}[0]\geq 0 is given. Denote BthrB^{thr} as the required battery energy level at each IoT device, where BimaxBthr>Bi[0]>0B_{i}^{max}\geq B^{thr}>B_{i}[0]>0 holds in general. We say an IoT device is energy satisfied if its accumulated battery energy reaches the required BthrB^{thr} before or at the last slot TT. Hence, 𝕀l[t]={i|Bi[t]+Eihar[t]<Bthr}\mathbb{I}_{l}[t]=\{i|B_{i}[t]+E_{i}^{har}[t]<B^{thr}\} is the set of energy-unsatisfied IoT devices at the end of slot tt. After the UAVs’ WET for TT slots, the set of energy-unsatisfied IoT devices is obtained as 𝕀l[T]\mathbb{I}_{l}[T].

II-C Hunger-Level of Energy at IoT devices

The HoE is defined to measure the time-varying energy demand of each IoT device. Denote Hi[t]H_{i}[t] as the HoE of device-ii at the beginning of slot tt. We define the time variation of Hi[t]H_{i}[t] as

Hi[t]={max(Hi[t1]1,1),if Eihar[t1]Eexp and Bi[t]<Bthr,Hi[t1]+1,if Eihar[t1]<Eexp and Bi[t]<Bthr,0,if Bi[t]Bthr,H_{i}[t]=\begin{cases}\max(H_{i}[t\!-\!1]\!-\!1,\!1),&\!\textrm{if~{}}E_{i}^{har}[t\!-\!1]\!\geq\!E^{exp}\!\textrm{~{}and~{}}\\ &\!B_{i}[t]<B^{thr},\\ H_{i}[t\!-\!1]\!+\!1,&\!\textrm{if~{}}E_{i}^{har}[t\!-\!1]\!<\!E^{exp}\textrm{~{}and~{}}\\ &B_{i}[t]\!<\!B^{thr},\\ 0,&\!\textrm{if~{}}B_{i}[t]\!\geq\!B^{thr},\end{cases} (7)

where EexpBthrTE^{exp}\!\triangleq\!\frac{B^{thr}}{T} denotes the average amount of energy that device-ii expects to harvest in each slot for reaching BthrB^{thr} after TT slots. If device-ii’s harvested energy Eihar[t1]E_{i}^{har}[t-1] in slot (t1t-1) reaches the expected EexpE^{exp}, but the resultant battery energy Bi[t]B_{i}[t] at the beginning of slot tt is still lower than the required BthrB^{thr}, Hi[t]H_{i}[t] is reduced by 11 at the beginning of slot tt, where the minimum allowable HoE when Bi[t]<BthrB_{i}[t]<B^{thr} is set to 11. If Eihar[t1]E_{i}^{har}[t-1] is lower than the expected EexpE^{exp} and the resultant Bi[t]B_{i}[t] is also lower than the required BthrB^{thr}, Hi[t]H_{i}[t] is increased by 11 at the beginning of slot tt. Moreover, if device-ii’s required energy is satisfied with Bi[t]BthrB_{i}[t]\geq B^{thr} at the beginning of slot tt, Hi[t]H_{i}[t] becomes 0. The overall HoE of all the energy-unsatisfied IoT devices over TT slots is then obtained as

Htotal=i𝕀l[T]t=1THi[t].H_{total}=\sum_{i\in\mathbb{I}_{l}[T]}\sum_{t=1}^{T}H_{i}[t]. (8)

It is easy to find that HtotalH_{total} is 0, if 𝕀l[T]\mathbb{I}_{l}[T] is empty.

II-D Problem Formulation

By optimally determining all the UAVs’ WET decisions 𝑪={Cu[t]}\boldsymbol{C}=\{C_{u}[t]\} and trajectories 𝑸={qu[t]}\boldsymbol{Q}=\{q_{u}[t]\} over TT slots, we minimize the overall HoE HtotalH_{total} in (8) under the UAVs’ practical mobility and energy constraints as follows:

(P1):min𝑸,𝑪\displaystyle\textrm{(P1)}:~{}\min_{\boldsymbol{Q},\boldsymbol{C}} i𝕀l[T]t=1THi[t],\displaystyle\sum_{i\in\mathbb{I}_{l}[T]}\sum_{t=1}^{T}H_{i}[t],
s.t.\displaystyle\mathrm{s.t.} (3),(6),(7)\displaystyle(\ref{equ: UAV Energy}),~{}(\ref{equ: IoT Energy}),~{}(\ref{equ: HoE})
qu[t]qu[t1]Vumaxϑ,u𝕌,t𝕋,\displaystyle\left\|q_{u}[t]\!-\!q_{u}[t\!-\!1]\right\|\!\leq\!V_{u}^{max}\vartheta,\forall u\!\in\!\mathbb{U},\forall t\!\in\!\mathbb{T}, (9)
Cu[t]{0,1},u𝕌,t𝕋,\displaystyle C_{u}[t]\in\{0,1\},\forall u\in\mathbb{U},\forall t\in\mathbb{T}, (10)
Bu[T]Bumin,u𝕌,\displaystyle B_{u}[T]\geq B_{u}^{min},\forall u\in\mathbb{U}, (11)
duu[t]dmin,u,u𝕌,uu,t𝕋,\displaystyle d_{u}^{u^{{}^{\prime}}}[t]\geq d_{min},\forall u,u^{{}^{\prime}}\in\mathbb{U},u\neq u^{{}^{\prime}},\forall t\in\mathbb{T}, (12)
xu[t][0,Wmax],yu[t][0,Lmax],u𝕌,t𝕋.\displaystyle x_{u}[t]\!\in\!\left[0,\!W_{max}\!\right],\!y_{u}[t]\!\in\!\left[0,\!L_{max}\right],\!\forall u\!\in\!\mathbb{U},\forall t\!\in\!\mathbb{T}. (13)

The constraint in (9) ensures that the velocity of UAV-uu does not exceed its maximal allowable velocity VumaxV_{u}^{max}. The constraint in (10) gives each UAV’s binary WET decision. The constraint in (11) ensures that each UAV’s remained energy at the end of slot TT is no less than the minimum required energy BuminB_{u}^{min} for a safe return after the WET task. The constraint in (12) guarantees a safe distance between any two UAVs in each slot to avoid collisions. The constraint in (13) confines each UAV’s horizontal moving space within an area of length LmaxL_{max} and width WmaxW_{max}.

Problem (P1) is a mixed-integer programming problem. It is also noticed that with the goal to minimize the overall HoE, the multiple UAVs need to be efficiently organized, by either jointly transmitting energy to the same set of IoT devices that are closely located, or separately serving different sets of IoT devices that are distantly located. Hence, all the UAVs’ trajectories and WET decisions are naturally coupled with each other over time. Moreover, as constrained by (11), each UAV must use the limited battery energy wisely for reducing the IoT devices’ HoE. Therefore, problem (P1) is generally difficult to solve efficiently by using the traditional optimization methods.

III MDP Modeling and Global Graph Design

Considering the above complicated and coupled relations in problem (P1) among multiple UAVs, a multi-agent DRL approach is leveraged in this paper. As shown in Fig. 1, each UAV acts as an agent and reports its environment states to the central controller (e.g., a base station or a satellite); By using the global environment information of all the UAVs, the central controller’s training output also guides each UAV’s local training. Although the training is centralized, after the training process, each UAV distributively determines its own WET decision and trajectory based on its local policy. Moreover, to explore the potential collaboration of all the UAVs for efficient WET, we take the global UAV information as a graph and introduce the similarity matrix [14] and the self-attention block [15] to operate the graph-based global information at the central controller [16]. By doing so, a new MAGRL-based approach is proposed to solve problem (P1). In this section, we model the Markov decision processes (MDP) at each UAV, and then introduce the UAVs’ graph-based representations at the central controller. The MAGRL-based solution will be specified in Section IV.

III-A MDP Modeling

According to problem (P1), by letting each UAV act as an agent, we model the MDP for each of the UU agents[17]. For each agent, define the MDP as a set of states 𝕊\mathbb{S}, a set of actions 𝔸\mathbb{A}, and a set of rewards \mathbb{R}. The state set 𝕊\mathbb{S} embraces all the possible environment configurations at each UAV, including the UAV’s own location, the HoE of all the IoT devices, the battery levels of all the IoT devices 111It is assumed that all the IoT devices share their HoE and battery levels with the UAVs via a common channel., and the UAV’s own battery level. The action set 𝔸\mathbb{A} provides the action space of each UAV’s decision on its trajectory and WET. For any given state su[t]𝕊s_{u}[t]\in\mathbb{S} for UAV-uu at the beginning of slot tt, UAV-uu applies the policy πu:su[t]au[t]\pi_{u}:s_{u}[t]\to a_{u}[t] to select the action au[t]𝔸a_{u}[t]\in\mathbb{A}, and then gets the corresponding reward ru[t]r_{u}[t]\in\mathbb{R} at the end of slot tt.

Specifically, the state in slot tt is defined as su[t]={xu[t],yu[t],H1[t],,HI[t],B1[t],,BI[t],Bu[t]}s_{u}[t]=\left\{\!x_{u}[t]\!,\!y_{u}[t]\!,\!H_{1}[t]\!,...,\!H_{I}[t]\!,\!B_{1}[t]\!,...,\!B_{I}[t]\!,\!B_{u}[t]\!\right\}, which contains in total of M=2I+3M=2I+3 elements. Denoting φu[t]\varphi_{u}[t] as UAV-uu’s horizontal rotation angle in slot tt, UAV-uu’s horizontal location (xu[t],yu[t])\left(x_{u}[t],y_{u}[t]\right) in problem (P1) is determined if φu[t]\varphi_{u}[t] and Vu[t]V_{u}[t] are obtained. Thus UAV-uu’s MDP action is defined as au[t]={Vu[t],φu[t],Cu[t]}a_{u}[t]=\left\{V_{u}[t],\varphi_{u}[t],C_{u}[t]\right\}, where Cu[t]=0C_{u}[t]=0 if the policy network output is negative or Cu[t]=1C_{u}[t]=1, otherwise. The reward function is proposed as

ru[t]=ξ0ru,0[t]ξ1ru,1[t],r_{u}[t]=\xi_{0}r_{u,0}[t]-\xi_{1}r_{u,1}[t], (14)

where ru,0[t],ru,1[t]r_{u,0}[t],r_{u,1}[t] are the reward and penalty that UAV-uu receives in slot tt, respectively, and ξ0,ξ1(0,1)\xi_{0},\xi_{1}\in(0,1) are the corresponding weights. Specifically, letting wiu(PuCu[t]Giu[t])ϑEihar[t]w_{i}^{u}\triangleq\frac{\mathcal{F}(P_{u}C_{u}[t]G_{i}^{u}[t])\vartheta}{E_{i}^{har}[t]} denote the ratio of the DC energy that device-ii harvestes from UAV-uu to that from all the UAVs, we use Nu[t]=wu[t]u𝕌wu[t]N_{u}[t]=\frac{w_{u}[t]}{\sum_{u\in\mathbb{U}}w_{u}[t]} with wu[t]=i𝕀wiu[t]w_{u}[t]=\sum_{i\in\mathbb{I}}w_{i}^{u}[t] to represent UAV-uu’s effective WET weight among all the UAVs in slot tt. The IoT devices can harvest higher amounts of energy from the UAV with a higher Nu[t]N_{u}[t], and vice versa. We then propose to use the following ru,0[t]r_{u,0}[t]:

ru,0[t]=Nu[t]i𝕀l[t](Bi[t+1]Bi[t])Hi[t]1+|𝕀l[t]|i𝕀l[t]Hi[t]\displaystyle r_{u,0}[t]=\frac{N_{u}[t]\sum_{i\in\mathbb{I}_{l}[t]}(B_{i}[t+1]-B_{i}[t])\cdot H_{i}[t]}{1+|\mathbb{I}_{l}[t]|\sum_{i\in\mathbb{I}_{l}[t]}H_{i}[t]}
+ξ2(Bu[t+1]Bumin).\displaystyle+\xi_{2}(B_{u}[t+1]-B_{u}^{min}). (15)

From (15), while the UAVs prefer to perform WET more frequently to reduce the IoT devices’ HoE, they also need to use their battery energy carefully to assure the constraint in (11). Hence, the reward in (15) contains two items. The first item is UAV-uu’s reward for charging IoT devices, where the numerator is the product of UAV-uu’s weight Nu[t]N_{u}[t] and all the IoT devices harvested energy biased by their HoE, and the denominator is the product of the HoE summation over all the energy-unsatisfied IoT devices and the set size |𝕀l[t]||\mathbb{I}_{l}[t]|, which plus 11 to prevent the denominator from being 0. It is easy to find that, the more energy the high-HoE IoT devices can harvest, the higher value the first item achieves. The second item is UAV-uu’s battery energy gap between Bu[t+1]B_{u}[t+1] and BuminB_{u}^{min} in constraint (11) after taking action au[t]a_{u}[t] in slot tt, where a balance parameter ξ2\xi_{2} is multiplied. The penalty in (14) is designed as

ru,1[t]=u=1Uj=01PENuj,r_{u,1}[t]=\sum_{u=1}^{U}\sum_{j=0}^{1}PEN_{u}^{j}, (16)

where PENu0=1PEN_{u}^{0}=1 (or PENu1=1PEN_{u}^{1}=1) if the constraint in (12) (or (13)) is not satisfied, or PENu0=0PEN_{u}^{0}=0 (or PENu1=0PEN_{u}^{1}=0), otherwise.

III-B Graph Representation of UAVs

By receiving each UAV’s state information, the central controller obtains the global information. Let 𝒐[t]={s1[t],,sU[t]}\boldsymbol{o}[t]=\{s_{1}[t],...,s_{U}[t]\}, 𝒂[t]={a1[t],,aU[t]}\boldsymbol{a}[t]=\{a_{1}[t],...,a_{U}[t]\} and 𝒓[t]={r1[t],,rU[t]}\boldsymbol{r}[t]=\{r_{1}[t],...,r_{U}[t]\} denote the global observations, actions, and rewards, respectively. To explore the potential connections among the UAVs to improve the overall WET performance as well as to avoid collisions, the central controller uses a graph to represent all the UAVs, by treating each UAV as a node in the graph. According to [14], we use the following similarity matrix among the UAVs to represent the strength of their connections,

𝒁[t]=(z11[t],,z1U[t]zU1[t],,zUU[t])U×U,\boldsymbol{Z}[t]=\begin{pmatrix}z_{11}[t],&...&,z_{1U}[t]\\ ...&&...\\ z_{U1}[t],&...&,z_{UU}[t]\end{pmatrix}_{U\times U}, (17)

where the element zuu[t]=exp(qu[t]qu[t]22ϱ2)z_{uu^{{}^{\prime}}}[t]=\exp\left(-\frac{\left\|q_{u}[t]-q_{u^{{}^{\prime}}}[t]\right\|^{2}}{2\varrho^{2}}\right), u,u𝕌,uu\forall u,u^{{}^{\prime}}\in\mathbb{U},u\neq u^{{}^{\prime}}, is the Gaussian distance between UAV-uu and UAV-uu^{{}^{\prime}}, ϱ2\varrho^{2} is a constant, and the element zuu[t]=u𝕌,uuzuu[t]z_{uu}[t]=\sum_{u^{{}^{\prime}}\in\mathbb{U},u^{{}^{\prime}}\neq u}z_{uu^{{}^{\prime}}}[t] on the diagonal is used as the degree of UAV-uu. Specifically, from [15], to obtain the global feature matrix 𝒐~\tilde{\boldsymbol{o}}, the central controller first generates the attention matrix 𝑾att\boldsymbol{W}_{att} and value matrix 𝑾v\boldsymbol{W}_{v}^{{}^{\prime}} as follows:

𝑾att=softmax(1M(𝒐×𝑾q)×(𝒐×𝑾k)T𝒁),\displaystyle\boldsymbol{W}_{att}=softmax\left(\!\frac{1}{\sqrt{M}}\left(\boldsymbol{o}\!\times\!\boldsymbol{W}_{q}\right)\!\times\!\left(\boldsymbol{o}\!\times\!\boldsymbol{W}_{k}\right)^{T}\!\cdot\!\boldsymbol{Z}\!\right),
𝑾v=𝒐×𝑾v,\displaystyle\boldsymbol{W}_{v}^{{}^{\prime}}=\boldsymbol{o}\times\boldsymbol{W}_{v}, (18)

where the symbol ×\times denotes the matrix multiplication, ()T(\cdot)^{T} is the matrix transposition. For any matrix 𝒁𝐑U×U\boldsymbol{Z}\in\mathbf{R}^{U\times U}, the softmax()softmax(\cdot) function transforms the element zuuz_{uu^{{}^{\prime}}} into exp(zuu)u𝕌exp(zuu)\frac{\exp\left(z_{uu^{{}^{\prime}}}\right)}{\sum_{u^{{}^{\prime}}\in\mathbb{U}}\exp\left(z_{uu^{{}^{\prime}}}\right)}, u,u𝕌\forall u,u^{{}^{\prime}}\in\mathbb{U}. Then, the global feature matrix 𝒐~\tilde{\boldsymbol{o}} is obtained as 𝒐~=𝑾att×𝑾v+𝑾v\tilde{\boldsymbol{o}}=\boldsymbol{W}_{att}\times\boldsymbol{W}_{v}^{{}^{\prime}}+\boldsymbol{W}_{v}^{{}^{\prime}}, which is used as the new observation matrix for the central controller.

Refer to caption
Figure 2: MAGRL framework.

IV MAGRL-based Solution

IV-A MAGRL Training Flow

As shown in Fig. 2, the MAGRL framework includes two parts, where one is the local training at each of the UAV, and the other is the global training at the central controller. In training stage, a tuple (𝒐[t],𝒂[t],𝒓[t],𝒐[t+1],𝒁[t],𝒁[t+1])\left(\boldsymbol{o}[t],\boldsymbol{a}[t],\boldsymbol{r}[t],\boldsymbol{o}[t\!+\!1],\boldsymbol{Z}[t],\boldsymbol{Z}[t\!+\!1]\right) is stored in the experience replay buffer 𝒟\mathcal{D}, and all the neural networks for both local and global training apply the stochastic gradient descent (SGD) algorithm to update their parameters.

Local Training: For each UAV’s local training, we apply the SAC algorithm proposed in [18] to enhance the exploration of the environment for all agents. As shown in the left side of Fig. 2, for any UAV-uu, there are five neural networks employed for its local training, which are the policy network ActorLActor_{L}, the local Q-networks QL0CriticQ_{L0}~{}Critic and QL1CriticQ_{L1}~{}Critic, and the local V-networks VL0CriticV_{L0}~{}Critic and VL1CriticV_{L1}~{}Critic, with the corresponding network parameters denoted by θπu\theta^{\pi_{u}}, η0,u\eta_{0,u}, η1,u\eta_{1,u}, ϕ0,u\phi_{0,u} and ϕ1,u\phi_{1,u}, respectively. The policy network is trained for the policy function πu()\pi_{u}(\cdot) that maps UAV-uu’s state su[t]s_{u}[t] to its action au[t]a_{u}[t], the two local Q-networks are trained for the local state action functions QL0()Q_{L0}(\cdot) and QL1()Q_{L1}(\cdot), and the two local V-networks are trained for the local state functions VL0()V_{L0}(\cdot) and VL1()V_{L1}(\cdot). The information entropy ()\mathcal{H}(\cdot) is used to enhance the agent’s exploration of the environment [18]. The goal of the local training is to obtain the optimal policy πu=argmaxπut𝕋𝔼[ru[t]+αu(πu(|su[t]))]\pi_{u}^{*}=\arg\max_{\pi_{u}}\sum_{t\in\mathbb{T}}\mathbb{E}\left[r_{u}[t]+\alpha_{u}\mathcal{H}\left(\pi_{u}(\cdot|s_{u}[t])\right)\right], where the temperature coefficient αu\alpha_{u} is the weight of the information entropy, and 𝔼[]\mathbb{E}[\cdot] is the expectation operation over all the possible actions. The performance of UAV-uu when taking action au[t]a_{u}[t] in state su[t]s_{u}[t] is evaluated by the local Q-networks QL0CriticQ_{L0}~{}Critic and QL1CriticQ_{L1}~{}Critic, with QLj(su[t],au[t])=ru[t]+γ𝔼[VL1(su[t+1])]Q_{Lj}(s_{u}[t],a_{u}[t])=r_{u}[t]+\gamma\mathbb{E}[V_{L1}(s_{u}[t\!+\!1])], j{0,1}j\in\{0,1\}. The performance of UAV-uu in state su[t]s_{u}[t] is evaluated by the local V-network VL1CriticV_{L1}~{}Critic, with VL1(su[t])=𝔼auπu[Qmin(su[t],au[t])αulog(πu(au|su[t]))]V_{L1}(s_{u}[t])=\mathbb{E}_{a_{u}^{{}^{\prime}}\sim\pi_{u}}[Q_{min}(s_{u}[t],a_{u}[t])-\alpha_{u}\log(\pi_{u}(a_{u}^{{}^{\prime}}|s_{u}[t]))], where auπua_{u}^{{}^{\prime}}\sim\pi_{u} denotes the action taken from policy πu\pi_{u}, and Qmin()=min(QL0(),QL1())Q_{min}(\cdot)=\min(Q_{L0}(\cdot),Q_{L1}(\cdot)).

The parameters of all the five neural networks are updated based on the corresponding loss functions. Specifically, after receiving the kk-th local experience (su,k,au,k,ru,k,su,k)(s_{u,k},a_{u,k},r_{u,k},s_{u,k}^{{}^{\prime}}) from the mini-batch 𝒟k\mathcal{D}_{k} of central controller for the local training, UAV-uu uses the loss function

JVL0(ϕ0,u)=𝔼[12(VL0(su,k;ϕ0,u)Qexp)2]\displaystyle J_{V_{L0}}(\phi_{0,u})\!=\!\mathbb{E}\left[\!\frac{1}{2}\left(\!V_{L0}(s_{u,k};\!\phi_{0,u})-Q_{exp}\right)^{2}\!\right] (19)

to update ϕ0,u\phi_{0,u} for the local V-network VL0CriticV_{L0}~{}Critic in negative the direction of the gradient ^ϕ0,uJVL0(ϕ0,u)\widehat{\nabla}_{\phi_{0,u}}J_{V_{L0}}(\phi_{0,u}), where Qexp𝔼auπu[Qmin(su,k,au;ηj,u)αuk]Q_{exp}\!\triangleq\!\mathbb{E}_{a_{u}^{{}^{\prime}}\sim\!\pi_{u}}\left[\!Q_{min}(s_{u,k},a_{u}^{{}^{\prime}};\!\eta_{j,u})\!-\!\alpha_{u}\mathcal{H}_{k}\!\right] is the expected entropy-added local minimum Q-value and klog(πu(au|su,k;θπu))\mathcal{H}_{k}\!\triangleq\!\log\left(\!\pi_{u}(a_{u}^{{}^{\prime}}|s_{u,k};\!\theta^{\pi_{u}})\!\right) is the entropy. For the parameter ϕ1,u\phi_{1,u} of VL1CriticV_{L1}~{}Critic network, we perform a soft update via ϕ1,uτϕ1,u+(1τ)ϕ0,u\phi_{1,u}\leftarrow\tau\phi_{1,u}+(1-\tau)\phi_{0,u}, τ[0,1)\tau\in[0,1). For the update of the parameters η0,u\eta_{0,u} and η1,u\eta_{1,u} for QL0CriticQ_{L0}~{}Critic and QL1CriticQ_{L1}~{}Critic networks, respectively, they are also computed in the negative direction of the corresponding loss function’s gradient with

JQLj(ηj,u)=𝔼[12(QLj(su,k,au,k;ηj,u)yLj)2],\displaystyle J_{Q_{Lj}}(\eta_{j,u})\!=\!\mathbb{E}\left[\frac{1}{2}\left(Q_{Lj}(s_{u,k},a_{u,k};\eta_{j,u})\!-\!y_{Lj}\right)^{2}\right], (20)

where j{0,1}j\in\{0,1\} and by using QG()Q_{G}(\cdot) to denote the global Q-value used to guide the training of the two local Q-networks, we define yLjϵ(ru,k+γ𝔼[VL1(su,k;ϕ1,u)])+(1ϵ)𝔼[QG()]y_{Lj}\!\triangleq\!\epsilon\!\left(\!r_{u,k}\!+\!\gamma\mathbb{E}\!\left[\!V_{L1}(s_{u,k}^{{}^{\prime}};\!\phi_{1,u})\!\right]\!\right)\!+\!(1\!-\epsilon)\mathbb{E}\left[Q_{G}(\cdot)\right] with ϵ(0,1]\epsilon\in(0,1]. Similarly, according to the loss function for the policy network

Jπu(θπu)=𝔼ε𝒩[αukQmin(su,k,fθπu;ηj,u)],J_{\pi_{u}}(\theta^{\pi_{u}})=\mathbb{E}_{\varepsilon\sim\mathcal{N}}\left[\alpha_{u}\mathcal{H}_{k}^{{}^{\prime}}\!-\!Q_{min}\left(s_{u,k},f_{\theta^{\pi_{u}}};\eta_{j,u}\right)\right], (21)

the network parameter θπu\theta^{\pi_{u}} is updated in the negative direction of the gradient ^θπuJπu(θπu)\widehat{\nabla}_{\theta^{\pi_{u}}}J_{\pi_{u}}(\theta^{\pi_{u}}), where the information entropy klog(πu(fθπu(su,k;ε)|su,k))\mathcal{H}_{k}^{{}^{\prime}}\triangleq\log(\pi_{u}\left(f_{\theta^{\pi_{u}}}(s_{u,k};\varepsilon)|s_{u,k}\right)) is calculated from the noise-added action fθπu(su,k;ε)f_{\theta^{\pi_{u}}}(s_{u,k};\varepsilon), and ε\varepsilon is the noise sampled from a fixed distribution 𝒩\mathcal{N}. Based on [18], adding noise to the action prevents the network from overfitting and ensures the stable network training. We use the loss function

Jαu(αu)=𝔼auπu[αulog(πu(au|su,k;θπu))αu~]J_{\alpha_{u}}(\alpha_{u})=\mathbb{E}_{a_{u}^{{}^{\prime}}\sim\pi_{u}}\left[-\alpha_{u}\log\left(\pi_{u}(a_{u}^{{}^{\prime}}|s_{u,k};\theta^{\pi_{u}})\right)-\alpha_{u}\tilde{\mathcal{H}}\right] (22)

to update the temperature coefficient αu\alpha_{u} in the negative direction of the gradient ^αuJαu(αu)\widehat{\nabla}_{\alpha_{u}}J_{\alpha_{u}}(\alpha_{u}) with ~|su[t]|\tilde{\mathcal{H}}\triangleq|s_{u}[t]|.

Global Training: The global training is designed for the central controller which consists of three neural networks, which are the global Q-network QGCriticQ_{G}~{}Critic and the two global V-networks VG0CriticV_{G0}~{}Critic and VG1CriticV_{G1}~{}Critic, with the corresponding network parameters denoted as ηG\eta_{G}, ϕG0\phi_{G0} and ϕG1\phi_{G1}, respectively. The global Q-network is trained for the global state action function QG()Q_{G}(\cdot) and the two global V-networks are trained for the global state functions VG0()V_{G0}(\cdot) and VG1()V_{G1}(\cdot). As shown in the right side of Fig. 2, each neural network contains the self-attention block, which extracts the global feature matrix 𝒐~\tilde{\boldsymbol{o}} to obtain the connections among UAVs as specified in Section III-B. The goal of the central training is to find the optimal global Q-value QG(𝒐[t],𝒂[t])=𝒓+γ𝔼[VG1(𝒔[t])]Q_{G}^{*}(\boldsymbol{o}[t],\boldsymbol{a}[t])=\boldsymbol{r}+\gamma\mathbb{E}[V_{G1}(\boldsymbol{s}[t])].

The parameters of the three networks are also updated based on the corresponding loss functions. Specifically, after receiving the kk-th global experience (𝒐k,𝒂k,𝒓k,𝒐k,𝒁k,𝒁k\boldsymbol{o}_{k},\boldsymbol{a}_{k},\boldsymbol{r}_{k},\boldsymbol{o}^{{}^{\prime}}_{k},\boldsymbol{Z}_{k},\boldsymbol{Z}_{k}^{{}^{\prime}}) from the mini-batch 𝒟k\mathcal{D}_{k}, the central controller uses the loss function

JVG0(ϕG0)=𝔼[12(VG0(𝒐k;ϕG0)𝔼𝒂πuQG(𝒐k,𝒂;ηG))2]J_{V_{G0}}(\phi_{G0})=\mathbb{E}\left[\!\frac{1}{2}\!\left(V_{G0}(\boldsymbol{o}_{k};\phi_{G0})\!-\!\mathbb{E}_{\boldsymbol{a}^{{}^{\prime}}\sim\pi_{u}}\!Q_{G}(\boldsymbol{o}_{k},\!\boldsymbol{a}^{{}^{\prime}};\!\eta_{G})\!\right)^{2}\!\right] (23)

to update ϕG0\phi_{G0} for the global V-network VG0CriticV_{G0}~{}Critic in the negative direction of the gradient ^ϕG0JVG0(ϕG0)\widehat{\nabla}_{\phi_{G0}}J_{V_{G0}}(\phi_{G0}). For the parameter ϕG1\phi_{G1} of VG1CriticV_{G1~{}Critic} network, we perform a soft update via ϕG1τϕG1+(1τ)ϕG0\phi_{G1}\leftarrow\tau\phi_{G1}+(1-\tau)\phi_{G0}, τ[0,1)\tau\in[0,1). For the update of the parameter ηG\eta_{G} for QGCriticQ_{G}~{}Critic network, it is also computed in the negative direction of the corresponding loss function’s gradient, where the loss function is given as

JQG(ηG)=𝔼[12(QG(𝒐k,𝒂k;ηG)(𝒓k+γ𝔼[VG1(𝒐k;ϕG1)]))2].J_{\!Q_{G}\!}(\!\eta_{G}\!)\!=\!\mathbb{E}\!\left[\!\frac{1}{2}\!\left(\!Q_{G}(\boldsymbol{o}_{k},\!\boldsymbol{a}_{k};\!\eta_{G})\!-\!\left(\!\boldsymbol{r}_{k}\!+\!\gamma\mathbb{E}\left[\!V_{G1}(\boldsymbol{o}_{k}^{{}^{\prime}};\!\phi_{G1})\!\right]\!\right)\!\right)^{2}\!\right]\!. (24)

IV-B MAGRL-Based Algorithm

Based on the above framework, we propose the MAGRL-based algorithm to solve problem (P1). The MAGRL-based algorithm is specified in Algorithm 1.

Refer to caption
(a) Reward variation.
Refer to caption
(b) HtotalH_{total} variation.
Figure 3: Comparison of MAGR-method to baseline methods.
TABLE I: Simulation parameters
Parameter Value Parameter Value
hfixh_{fix} 55 m σ2,ϱ2\sigma^{2},\varrho^{2} 90-90 dBm, 100
PuP_{u} 11 W Psen,PsatP_{sen},P_{sat} 10-10,77 dBm
αL,αN\alpha_{L},\alpha_{N} 3,53,5 BthrB^{thr} 1010 mW\cdots
BuminB_{u}^{min}, BumaxB_{u}^{max} 2000020000,140000140000 W\cdots dmind_{min} 55 m
γ,ϵ,τ\gamma,\epsilon,\tau 0.9850.985, 0.80.8, 0.9990.999 a,ba,b for PLoSP_{LoS} 12.0812.08, 0.110.11
ξ0,ξ1,ξ2\xi_{0},\xi_{1},\xi_{2} 0.250.25, 11, 0.000010.00001 Size of 𝒟\mathcal{D} 2172^{17}
size of 𝒟k\mathcal{D}_{k} 128128 αu\alpha_{u}’s learning rate 0.00020.0002

V Simulation Results

To evaluate the performance of our proposed MAGRL method, we conduct simulations based on python-3.9.12 and pytorch-1.12.1. Unless specified otherwise, in all the simulations, the starting horizontal position of each UAV is selected randomly in the the considered area, and each IoT device’s initial battery energy is set randomly in the range between 22 mW\cdots and 55 mW\cdots, respectively. The UAV’s propulsion model parameters are set as in [12]. Each neural network for the local training has 44 layers and the number of neurons in the hidden layer is 256256. Each neural network in global training contains a self-atteion block and 22 fully connected layers. Learning rate is set to 0.00020.0002 for all neural networks except for the neural network ActorLActor_{L}, which has a learning rate of 0.00030.0003. Other parameters can be found in Table I.

Algorithm 1 MAGRL-based solution
1:  Initialize replay buffer 𝒟\mathcal{D}, learning rate λ\lambda, discount factor γ\gamma, soft update weight τ\tau and temperature factor αu\alpha_{u}, u𝕌u\in\mathbb{U}. Initialize the parameters of local’s all five networks and global three networks
2:  for Episode 1,,EPS\leftarrow 1,...,EPS do
3:     Initialize the location and energy of all UAVs and IoT devices;
4:     Initialize the observatin 𝒐[0]\boldsymbol{o}[0] and similarity matrix B[0]B[0];
5:     for t1,,Tt\leftarrow 1,...,T do
6:        get action au[t]=πu(su[t]|θπu)a_{u}[t]=\pi_{u}(s_{u}[t]|\theta^{\pi_{u}}), u𝕌u\in\mathbb{U}
7:        execute action au[t]=[Vu[t],ωu[t],Cu[t]]a_{u}[t]=\left[V_{u}[t],\omega_{u}[t],C_{u}[t]\right], u𝕌\forall u\in\mathbb{U}. We can get 𝒐[t+1]\boldsymbol{o}[t+1], 𝒓[t]\boldsymbol{r}[t] and 𝒁[t+1]\boldsymbol{Z}[t+1];
8:        store (𝒐[t],𝒂[t],𝒓[t],𝒐[t+1],𝒁[t],𝒁[t+1])\left(\boldsymbol{o}[t],\boldsymbol{a}[t],\boldsymbol{r}[t],\boldsymbol{o}[t+1],\boldsymbol{Z}[t],\boldsymbol{Z}[t+1]\right) into experience replay buffer 𝒟\mathcal{D};
9:        if |𝒟||\mathcal{D}|\geqmini-batch of size \triangle then
10:           for u1,,Uu\leftarrow 1,...,U do
11:              ϕ0,uϕ0,uλ^ϕ0,uJVL0(ϕ0,u)\phi_{0,u}\leftarrow\phi_{0,u}-\lambda\widehat{\nabla}_{\phi_{0,u}}J_{V_{L0}}(\phi_{0,u}), update ϕ0,u\phi_{0,u};
12:              ηj,uηj,uλ^ηj,uJQLj(ηj,u),j{0,1}\eta_{j,u}\leftarrow\eta_{j,u}-\lambda\widehat{\nabla}_{\eta_{j,u}}J_{Q_{Lj}}(\eta_{j,u}),j\in\left\{0,1\right\};
13:              θπuθπuλ^θπuJπu(θπu)\theta^{\pi_{u}}\leftarrow\theta^{\pi_{u}}-\lambda\widehat{\nabla}_{\theta^{\pi_{u}}}J_{\pi_{u}}\left(\theta^{\pi_{u}}\right);
14:              αuαuλ^αuJαu(αu)\alpha_{u}\leftarrow\alpha_{u}-\lambda\widehat{\nabla}_{\alpha_{u}}J_{\alpha_{u}}(\alpha_{u}), update tempture factor based on(22);
15:              ϕ1,uτϕ1,u+(1τ)ϕ0,u\phi_{1,u}\leftarrow\tau\phi_{1,u}+(1-\tau)\phi_{0,u}, soft update;
16:           end for
17:           ϕG0ϕG0λ^ϕG0JVG0(ϕG0)\phi_{G0}\leftarrow\phi_{G0}-\lambda\widehat{\nabla}_{\phi_{G0}}J_{V_{G0}}(\phi_{G0}), update ϕG0\phi_{G0};
18:           ηGηGλ^ηGJQG(ηG)\eta_{G}\leftarrow\eta_{G}-\lambda\widehat{\nabla}_{\eta_{G}}J_{Q_{G}}(\eta_{G});
19:           ϕG1τϕG1+(1τ)ϕG0\phi_{G1}\leftarrow\tau\phi_{G1}+(1-\tau)\phi_{G0}, soft update;
20:        end if
21:        𝒐[t]𝒐[t+1]\boldsymbol{o}[t]\leftarrow\boldsymbol{o}[t+1] and 𝒁[t]𝒁[t+1]\boldsymbol{Z}[t]\leftarrow\boldsymbol{Z}[t+1], u𝕌u\in\mathbb{U};
22:     end for
23:  end for

V-A Training stage

In the training stage, we compare MAGRL with the following 33 benchmarks:

  • MAGRL-HoE: This method does not consider HoE at each IoT device. By only considering the battery energy at each IoT device, its reward function is reduced from (15) as

    ru,0[t]=Nu[t]i𝕀l[t](Bi[t+1]Bi[t])1+|𝕀l[T]|\displaystyle r_{u,0}[t]\!=\!\frac{N_{u}[t]\sum_{i\in\mathbb{I}_{l}[t]}(B_{i}[t\!+\!1]\!-\!B_{i}[t])}{1\!+|\mathbb{I}_{l}[T]|}\!
    +ξ2(Bu[t+1]Bumin).\displaystyle+\xi_{2}(\!B_{u}[t\!+\!1]\!-\!B_{u}^{min}\!). (25)
  • MAGRL-G: This method removes the global training, where the loss function is given in (20) with ε=1\varepsilon=1.

  • MAGRL-HoE-G: This method does not exploit the HoE and the global training, where both (25) and (20) with ε=1\varepsilon=1 are applied.

We consider an area of 400400 m×\times400400 m with 44 UAVs and 66 IoT devices. For the proposed MAGRL method and the benchmarks, we show the accumulated average reward rac=1Ut=1Tu=1Uru[t]r_{ac}=\frac{1}{U}\sum_{t=1}^{T}\sum_{u=1}^{U}r_{u}[t] in Fig. 3(a). Fig. 3(b) shows the variations of HtotalH_{total} of all four methods. The convergence of our proposed MAGRL algorithm is observed in Fig. 3(a), where the proposed MAGRL method outperforms the other 33 benchmarks and achieves the highest racr_{ac} after convergence. This implies that the global training can learn the potential connections among the states of the UAVs, thus improving the learning ability of multiple agents. From Fig. 3(b), it is observed that by considering HoE, HtotalH_{total} in our proposed MAGRL method is the lowest among all four methods, which confirms, that the goal of HoE minimization can guide the UAVs’ WET to cater to each IoT device’s energy requirements.

V-B Testing stage

To show the performance of the UAVs’ WET in the testing stage, we illustrate an example, where 22 UAVs are dispatched to charge 33 IoT devices in a horizontal area of 200200 m×\times200200 m. Each UAV agent applies the trained ActorLActor_{L} network to determine its actions. Fig. 4(a) shows the trajectories of the two UAVs, Fig. 4(b) shows each of the IoT device’s battery variations over time, and Fig. 5 shows the two UAVs’ binary WET decisions. It is observed from Fig. 4(a) and Fig. 5 that although each UAV distributively determines its own trajectory and WET, due to the exploration of their self-attentions in the global training, the two UAVs can automatically serve different IoT devices in a collaborative manner, where UAV-1 transmits energy mainly to the two closely-located IoT devices, while UAV-2 mainly serves the other distantly-located IoT devices. Due to their effective collaboration for WET, it is also observed that each of the IoT device’s battery energy achieves the required threshold BthrB^{thr} in Fig. 4(b), and thus the overall HoE in problem (P1) becomes 0. Moreover, at the end of T=100T=100 slots, the remained energy of the two UAVs are 24657.7224657.72 W\cdots and 24538.86324538.863 W\cdots, respectively, where the required Bumin=20000B_{u}^{min}=20000 W\cdots in (11) are satisfied for both UAVs.

Refer to caption
(a) UAVs’ trajectories.
Refer to caption
(b) IoT devices’ battery energy variation.
Figure 4: UAVs’ trajectory and the IoT devices’ battery energy under the proposed MAGRL method.
Refer to caption
(a) UAV-1’s WET decisions over time.
Refer to caption
(b) UAV-2’s WET decisions over time.
Figure 5: UAVs’ WET decisions under the proposed MAGRL method.

VI Conclusion

This paper proposes the novel on-demand WET scheme of multiple UAVs. We propose a new metric of HoE to measure each IoT device’s time-varying energy demand based on its required battery energy and the harvested energy from the UAVs. We formulate the HoE minimization problem under practical UAVs’ mobility and energy constraints, by optimally determining the UAVs’ coupled trajectories and WET decisions over time. Due to the high complexity of this problem, we leverage DRL and propose the MAGRL-based approach, where the UAVs’ collocations for WET are exploited by excavating the UAVs’ self-attentions in the global training. Through the offline global and local training at the central controller and each UAV, respectively, each UAV can then distinctively determine its own trajectory and WET based on the well-trained local neural networks. Simulation results verify the validity of the proposed HoE metric for guiding the UAVs’ on-demand WET, as well as the UAVs’ collaborative WET under the proposed MAGRL-based approach.

Acknowledgement

This work was supported by the National Natural Science Foundation of China under Grant 62072314.

References

  • [1] S. Bi, C. K. Ho, and R. Zhang, “Wireless powered communication: Opportunities and challenges,” IEEE Commun. Mag., vol. 53, no. 4, pp. 117–125, Apr., 2015.
  • [2] B. Clerckx, et al., “Fundamentals of wireless information and power transfer: From RF energy harvester models to signal and system designs,” IEEE J. Sel. Areas Commun., vol. 37, no. 1, pp. 4–33, Jan. 2018.
  • [3] Y. L. Che, Y. Lai, S. Luo, K. Wu, and L. Duan, “UAV-aided information and energy transmissions for cognitive and sustainable 5G networks,” IEEE Tran. Wireless Commun., vol. 20, no. 3, pp.1668-1683, Mar. 2021.
  • [4] Z. Yang,W. Xu,M. Shikh-Bahaei, “Energy efficient UAV communication with energy harvesting,” IEEE Trans. Veh. Technol., vol. 69, no. 2, pp. 1913-1927, Feb. 2020.
  • [5] J. Xu, Y. Zeng and R. Zhang, “UAV-enabled wireless power transfer: trajectory design and energy optimization,” IEEE Trans. Wireless Commun., vol. 17, no. 8, pp. 5092-5106, Aug. 2018.
  • [6] J. Mu and Z. Sun, “Trajectory design for multi-UAV-aided wireless power transfer toward future wireless systems,” Sensors, vol. 22, no. 18, pp. 6859, Aug. 2022.
  • [7] L. Xie, X. Cao, J. Xu and R. Zhang, “UAV-enabled wireless power transfer: a tutorial overview,” IEEE Transactions on Green Communications and Networking, vol. 5, no. 4, pp. 2042-2064, Dec. 2021.
  • [8] K. Li, W. Ni, E. Tovar and A. Jamalipour, “On-board deep Q-network for UAV-assisted online power transfer and data collection,” IEEE Trans. Veh. Technol., vol. 68, no. 12, pp. 12215-12226, Dec. 2019.
  • [9] O. S. Oubbati et. al., “Synchronizing UAV teams for timely data collection and energy transfer by deep reinforcement learning,” IEEE Trans. Veh. Technol., vol. 71, no. 6, pp. 6682-6697, Jun. 2022.
  • [10] A. Al-Hourani, S. Kandeepan and S. Lardner, “Optimal LAP altitude for maximum coverage,” IEEE Wireless Commun. Let., vol. 3, no. 6, pp. 569-572, Dec. 2014.
  • [11] Y. Zeng, J. Xu and R. Zhang, “Energy minimization for wireless communication with rotary-wing UAV,” IEEE Trans. Wireless Commun., vol. 18, no. 4, pp. 2329-2345, Apr. 2019.
  • [12] P. N. Alevizos and A. Bletsas, “Sensitive and nonlinear far-field RF energy harvesting in wireless communications,” IEEE Trans. Wireless Commun., vol. 17, no. 6, pp. 3670-3685, Jun. 2018.
  • [13] PowerCast Module. Accessed: Jul. 2020. [Online]. Available: http://www.mouser.com/ds/2/329/P2110B-Datasheet-Rev-3-1091766.pdf
  • [14] U. V. Luxburg, “A tutorial on spectral clustering,” Statistics and computing, vol. 17, no. 3, pp. 395-416, Aug. 2007.
  • [15] A. Vaswani, et. al., “Attention is all you need,” Advances in neural information processing systems, 2017.
  • [16] J. Jiang, et. al. “Graph convolutional reinforcement learning,” in Pro. International Conference on Learning Representations (ICLR), Oct. 2018.
  • [17] M. L. Puterman, “Markov decision processes: Discrete stochastic dynamic programming,” John Wiley and Sons, 2014.
  • [18] T. Haarnoja, et. al. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor,”, in Pro. International Conference on Machine Learning (ICML), Aug. 2018.