This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Meta Reinforcement Learning for Strategic IoT Deployments Coverage in Disaster-Response UAV Swarms

Marwan Dhuheir1, Aiman Erbad1, Ala Al-Fuqaha1 1Division of Information and Computing Technology, College of Science and Engineering,
Hamad Bin Khalifa University, Qatar Foundation, Doha, Qatar.
Abstract

In the past decade, Unmanned Aerial Vehicles (UAVs) have grabbed the attention of researchers in academia and industry for their potential use in critical emergency applications, such as providing wireless services to ground users and collecting data from areas affected by disasters, due to their advantages in terms of maneuverability and movement flexibility. The UAVs’ limited resources, energy budget, and strict mission completion time have posed challenges in adopting UAVs for these applications. Our system model considers a UAV swarm that navigates an area collecting data from ground IoT devices focusing on providing better service for strategic locations and allowing UAVs to join and leave the swarm (e.g., for recharging) in a dynamic way. In this work, we introduce an optimization model with the aim of minimizing the total energy consumption and provide the optimal path planning of UAVs under the constraints of minimum completion time and transmit power. The formulated optimization is NP-hard making it not applicable for real-time decision making. Therefore, we introduce a light-weight meta-reinforcement learning solution that can also cope with sudden changes in the environment through fast convergence. We conduct extensive simulations and compare our approach to three state-of-the-art learning models. Our simulation results prove that our introduced approach is better than the three state-of-the-art algorithms in providing coverage to strategic locations with fast convergence.

Index Terms:
Optimization; QoS; UAVs positions; energy consumption; meta-reinforcement learning; reinforcement learning; UAVs; strategic locations.

I introduction

Unmanned aerial vehicles (UAVs) are playing a key role in many applications. Recently, UAVs have been widely used as flying wireless communication base stations due to their advantages over traditional platforms. The UAVs advantages include effective cost, flexibility, easy maneuverability, scalability, and line of sight (LoS) service. These advantages led to the wide adoption of UAVs in critical applications, including surveillance, forest fire detection, object tracking, and wireless communication services in catastrophic disasters (e.g., earthquakes, hurricanes, and floods affected areas) [1, 2]. Furthermore, UAVs can provide a cost-effective and reliable solution to collect data from dispatched IoT users over a wide geographical area with terrestrial infrastructure impacted by natural or man-made disasters.

UAVs have many advantages over traditional platforms allowing them to effectively accomplish their mission with less energy consumption and within the time constraints [3]. In this way, UAVs can work as data aggregators, which are flying base stations that provide wireless communication services to ground users that cannot be reached with terrestrial base stations. Nevertheless, several challenges need to be addressed to effectively use UAVs, including optimizing mission trajectory, energy consumption, resource allocation, and the completion time of missions.

UAV-based solutions for data collection from IoT devices can optimize the mission’s path, energy consumption, and completion time [4, 5, 6]. The authors in [4] propose to jointly optimize the UAV’s positions and the transmit power for reliable information transmission and rapid data collection from ground users. In [5], the authors introduce a joint position and travel path optimization to consume less energy in gathering data. The trajectory path planning optimization aims to collect more user data under energy consumption constraints. In [6], the authors used deep reinforcement learning of UAVs’ trajectory planning optimization. The authors of this study focus on finding the optimal paths of UAVs under time constraints in which the UAVs need to deliver their collected data to a central station before the data becomes irrelevant. All the above-mentioned studies [4, 5, 6] focus on optimizing the UAVs path planning to reduce energy and maximizing the coverage to include a larger number of users and collect more data. However, none of these studies investigated planning the UAV paths to cover the affected areas with a focus on the most vulnerable spots to ensure the UAVs plan their trajectories and collect data from the most affected ground IoT devices and upload the data into a remote base stations. Particularly, the UAVs can navigate into the whole area and focus on their trajectories in some strategic locations while maintaining the minimum data rate that guarantees reliable data transmission to other UAVs within the swarm.

Machine learning-based solutions to optimize the path planning of UAVs in dynamic solutions have been addressed in some research studies [7, 8, 9, 10]. The works in [7, 8] investigate using UAVs path optimization to reduce the latency and improve the reliability of the transmission by considering deep reinforcement learning approach. The authors in [9] proposed a machine learning solution to provide on-demand services to ground users. In [10], the authors use multi-agent reinforcement learning using a Q-learning approach to choose the trajectories of UAVs based on predicting the user movements. Nevertheless, these two studies are not suitable in case of rapid changes in the environment, especially since the UAV path parameters are constantly changing during the mission in dynamic environments. For example, UAVs can join and leave the swarm dynamically for recharging and due to other urgent issues (e.g., damage, malfunction). Due to the complexity of the problem, conventional RL algorithms are slow to converge again to respect the constraints of path planning; hence, we propose meta-learning to quickly adapt neural networks performing in new environments with high efficiency across related tasks [11].

In this work, we aim to address the problem of UAVs’ path planning that provides the minimum energy consumption of UAVs while guaranteeing the minimum data rate is ensured for reliably sending the data to the UAVs by using meta-RL algorithms for coping with dynamic environments through rapid convergence. In our approach, we define strategic locations in the covered geographical area that UAVs need to visit more. These strategic locations represent more affected buildings/neighbourhoods after a disaster or areas with more victims so the UAVs need to plan their trajectory to traverse through these buildings/neighbourhoods more often to ensure timely data collection from severely affected victims. However, our approach is general enough to include wide UAV applications like object tracking and wireless communication services in condensed areas. Our contributions can be summarized as follows:

  • We develop a system model containing a UAV swarm navigating an area to collect data from IoT devices distributed in the ground. The covered area contains some strategic locations that need better service from the UAV swarm.

  • We delineate the approach as an optimization problem that seeks to minimize the total energy consumption of the UAVs in the grid while having some constraints ensuring the reliable data delivery among UAVs, minimum energy consumption, optimal path planning with strategic locations.

  • We investigate the proposed system model through extensive simulations to prove its performance by testing it on various parameters such as changing the number of UAVs in the grid and comparing our meta-RL solution with three state-state-of-art algorithms, namely reinforcement learning with proximal policy optimization (PPO), actor-critic algorithm, and DQN algorithm, and show that our introduced Meta-RL algorithm provides better demand service satisfaction to the strategic locations than the other competitive algorithms.

The rest of this article is organized as follows: section II presents the description of our system model. In Section III, we delineate the problem formulation. Section V explains the implementation results of the proposed approach. At the end, section VI concludes and discusses future research directions.

II system model

We divide the covered area into equal-sized cells, in which each cell is monitored by one UAV for data collection from the IoT devices that are distributed in the area as shown in Figure 1. According to this approach, UU UAVs denoted as u={1,2,,U}u=\{1,2,\ldots,U\} navigate to cover the area focusing on strategic locations in the area. In particular, UU UAVs navigate for data collection from N1N\gg 1 ground IoT devices where i={1,2,,N}i=\{1,2,\ldots,N\}. The positions of the ground IoT devices are denoted by ii, Qi=[xi,yi,0]Q_{i}=[x_{i},y_{i},0], where iNi\in N are supposed to be known using global positioning systems (GPS). As shown in Figure 1, the movements of UAVs and their directions are optimized using a ground station (control center).

Refer to caption
Figure 1: System Model for multi-UAVs covering an area with strategic locations. The UAVs mission is data collection from ground devices.

Let us suppose that UAVs operate at time frame TT where T>0T>0 in seconds (s). The time frame TT is divided into small time intervals tt, such that 0tT0\leq t\leq T, thus the 3D positions of UAV uu at the specific time instance tt can be described as Qu(t)=[xu(t),yu(t),hu(t)]3,uUQ_{u}(t)=[x_{u}(t),y_{u}(t),h_{u}(t)]\in\mathbb{R}^{3},u\in U is a set of UAVs. Moreover, we further assume that the duration time frame TT is divided into MM equal time slots, determined by the set of m={1,,M}m=\{1,\ldots,M\}. Each time slot mm is defined by the time length of μ=TM\mu=\frac{T}{M}, which is small enough to provide a stable time duration of the UAV 3D location [12]. Accordingly, the UAVs’ 3D position can be rewritten as Qu[m]=[xu(t),yu(t),hu(t)],mMQ_{u}[m]=[x_{u}(t),y_{u}(t),h_{u}(t)],m\in M. The gathered data from ground devices is sent to the ground base station after the UAVs complete a one-time frame TT.

II-A Wireless Channel Model

In our approach and for practical scenarios, the obstacles information, including their number, height, and locations, might not be known; hence the randomness of the availability of line of sight (LoS) and non-line of sight (NLoS) channels of the air-to-ground link between UAVs and IoT devices are considered. Note that the LoS and NLoS depend on the type of environment (e.g., rural, urban, suburban, etc.), the location of UAVs and IoT devices, and the altitude of the flying UAVs. Hence, the probability of LoS expression is given by [3]:

PuiLoS=11+ω1expω2[θuiω1]P_{ui}^{LoS}=\frac{1}{1+\omega_{1}\exp{-\omega_{2}[\theta_{ui}-\omega_{1}]}} (1)

where ω1\omega_{1} and ω2\omega_{2} are constant parameters and their values are specified based on the type of the environment, θui\theta_{ui} represents the elevation angle between the UAV uu and the IoT device ii. Particularly, θ=180π×sin1(hidui)\theta=\frac{180}{\pi}\times\sin^{-1}(\frac{h_{i}}{d_{u}i}), where dui=(xixu)2+(yiyu)2+hi2d_{ui}=\sqrt{(x_{i}-x_{u})^{2}+(y_{i}-y_{u})^{2}+h_{i}^{2}} is the distance between the UAV uu and the device ii. The probability of NLoS is given by, PuiNLoS=1PuiLoSP_{ui}^{NLoS}=1-P_{ui}^{LoS}.

The path loss models of LoS and NLoS are expressed as follows [3]:

LLoSui=ψLoS(4πfcduic)2L_{LoS}^{ui}=\psi_{LoS}(\frac{4\pi f_{c}d_{ui}}{c})^{2} (2)
LNLoSui=ψNLoS(4πfcduic)2L_{NLoS}^{ui}=\psi_{NLoS}(\frac{4\pi f_{c}d_{ui}}{c})^{2} (3)

where fcf_{c} is denoted as the carrier frequency and cc is the speed of light. ψLoS\psi_{LoS} and ψNLoS\psi_{NLoS} are denoted to the excessive path loss related to free-space propagation loss for LoS and NLoS, respectively.

The total average path of the communication link between the UAVs and the IoT devices is given by:

L¯ui=PuiLoSLLoSui+PuiNLoSLNLoSui\bar{L}_{ui}=P_{ui}^{LoS}L_{LoS}^{ui}+P_{ui}^{NLoS}L_{NLoS}^{ui} (4)

Moreover, the average channel gain of the communication link between IoT devices and UAVs is g¯ui=(1/L¯ui)\bar{g}_{ui}=(1/\bar{L}_{ui}). The threshold of minimum data rate for successful transmission is expressed as:

ρu,i(lb)=Bu,ilog2(1+Prσ2)\rho_{u,i}(lb)=B_{u,i}\>log_{2}(1+\frac{P_{r}}{\sigma^{2}}) (5)

where Bu,iB_{u,i} is the transmission bandwidth between the UAV uu and IoT device ii, PrP_{r} is the received power at the UAV uu, and Pr=Pi×g¯uiP_{r}=P_{i}\times\bar{g}_{ui}, PiP_{i} is denoted to the transmit power at device ii, and σ2\sigma^{2} represents the thermal noise power. Hence, to achieve a reliable collection time from UAVs from IoT devices, the following condition need to be satisfied:

ρu,i[m]ρu,i(lb)\rho_{u,i}[m]\geq\rho_{u,i}(lb) (6)

To satisfy a reliable transmission of data from IoT devices to the flying UAVs, the maximum altitude of UAVs needs to satisfy the minimum SNR Γ\Gamma as follows [13]:

hi(PmaxΓF02σ2ψLoS)h_{i}\leq\bigg{(}\frac{P_{max}}{\Gamma F_{0}^{2}\sigma^{2}\psi_{LoS}}\bigg{)} (7)

where F0=4πfccF_{0}=\frac{4\pi f_{c}}{c}, and PmaxP_{max} is maximum transmit power of each UAV.

II-B Device-to-Device (D2D) Time Delay Model

D2D time is the time delay to complete the data collection and send it to the central station. The completion time contains the time required to transfer the data from IoT devices (Ddata)(D_{data}) and the time required by the UAVs to accomplish the mission (Dcom)(D_{com}). The time is given by:

Ddata=iNKiρi,k[M]D_{data}=\sum_{i\in N}\frac{K_{i}}{\rho_{i,k}[M]} (8)
Dcom=uUm=1Mτu,mD_{com}=\sum_{u\in U}\sum_{m=1}^{M}\tau_{u,m} (9)

where KiK_{i} is the IoT’s data packets, and τu,m\tau_{u,m} is the uthu^{th} UAV’s travelling time between two successive time-slots mm and m+1m+1 which is calculated by:

τu,m=Qu[m+1]Qu[m]V,m=1,,M\tau_{u,m}=\frac{\lVert Q_{u}[m+1]-Q_{u}[m]\rVert}{V},\;m=1,\ldots,M (10)

where VV is the average speed of uthu^{th} UAV travelling between two consecutive locations.

Then, the total completion time of uthu^{th} UAV mission is given by:

Dtot=Ddata+DcomD_{tot}=D_{data}+D_{com} (11)

To satisfy the limitation of completing time, the total completion time should satisfy the following inequalities:

DtotTmaxD_{tot}\leq T_{max} (12)

where TmaxT_{max} is the maximum time for UAV to complete its mission.

II-C Energy Consumption Model

The energy consumption of UAV is the total energy consumed by each UAV. The energy consumed by uthu^{th} UAV can be expressed as:

εu=Pu,operDtot+Pm,commDdata\varepsilon_{u}=P_{u,oper}D_{tot}+P_{m,comm}D_{data} (13)

where Pu,operP_{u,oper} and Pu,commP_{u,comm} are the UAV’s operating power and UAV’s communication power. This energy is designed to calculate the energy consumed by the UAV to follow its planned trajectory and collect data while focusing on strategic locations such that at each time TT, all strategic locations need to be visited at least one time. Then, the total energy consumed by all the UAVs (in Joules) in the swarm is expressed as:

εtot=uUδu,i,q.εu\varepsilon_{tot}=\sum_{u\in U}\delta_{u,i,q}\;.\;\varepsilon_{u} (14)

where δu,i,q\delta_{u,i,q} is a binary constraint to encourage UAVs pass through strategic locations and exploit their energy collecting data from the IoT devices, and it is defined as:

δu,i,q={1 if UAV u is collecting data from IoT i that is in strategic location qi,0 otherwise\small\delta_{u,i,q}=\left\{\begin{array}[]{l}1\;\text{\;if UAV $u$ is collecting data from IoT $i$ that is}\\ \text{\; \; in strategic location $q_{i}$,}\\ 0\;\text{\;otherwise}\\ \end{array}\right.\quad (15)

III problem formulation

The main objective is to minimize the total energy consumption of UAVs in the swarm presented in equation (14) by finding the optimal positions of UAVs while respecting the maximum completion time, and minimum achievable data rate for reliable data collection. The problem formulation is expressed as follows:

minQεtot\small\min_{Q}\;\;\varepsilon_{tot} (16)

s.t. (6), (7), (12)

uU,qQ,i=1Nδu,i,q1\small\forall u\in U,\forall q\in Q,\sum_{i=1}^{N}\delta_{u,i,q}\geq 1 (17)
δu,i,q{0,1}\delta_{u,i,q}\in\{0,1\} (18)

The objective function in equation (16) aims to minimize the total energy consumption of UAVs in the swarm by finding the optimal UAVs paths and optimal transmit power while navigating to cover the area, focusing on visiting strategic locations and collecting data from IoT devices. The mission needs to consider some constraints, including the maximum completion time, the minimum data rate, and the maximum altitude for the UAVs for reliable data transmission from IoT devices into the flying UAVs.

The objective function in equation (16) is NP-hard due the non-linearity in constraints (6), and (7). Hence, we use meta-reinforcement learning to solve the objective function of the optimization problem and its constraints to provide an efficient sub-optimal solution, which will be described in the next sections.

IV Meta-reinforcement learning for efficient energy consumption and path planning

The model handles a dynamic UAV swarm, i.e., at any moment, UAVs can join and leave the swarm (e.g., for recharging); hence, the approach needs to be flexible to these unforeseen changes and can respect the constraints while flying to collect data from the area. Due to the complexity of the problem, conventional RL algorithms are slow to converge again to respect the constraints of path planning; hence, meta-learning is adapted to quickly adapt in new environments with high efficiency across related tasks [11]. In particular, one model can be trained to learn new tasks more effectively and converge quickly than working correctly with only one single task.

Meta-RL applies meta-learning to reinforcement learning and aims to make the agents learn the general policy of related tasks. A task consists of a set of states and actions, dynamics, and rewards [11]. In particular, meta-RL agents do not aim to learn the optimal policy of a specific task; instead, they aim to learn the general policy that can be applied to new environments with the same family of tasks. The benefit is that it enables the agents to quickly reach to the optimal policy of new environments with minimum number of episodes.

In our RL approach, one episode contains a set of time steps, and in each step, the algorithm needs to choose the 3D position of each UAV. The decision to select the best position of a UAV in the swarm is abased on several factors, including minimizing the energy consumption, and ensuring the data is sent successfully from the ground IoT device to the UAV. Our RL approach is represented by a Markov Decision Process (MDP) as (S,A,P,R,γ)(S,A,P,R,\gamma), where SS is the environment state, AA is the action vector which contains two parts (path planning), PP is the probability of the possible transition, RR represents the rewards that the agent gets for each action, and γ\gamma is denoted to the discount learning factor.

  1. 1.

    Environment Modeling: The environment is a swarm of UAVs navigating the covered area with focusing on strategic locations. Let’s define the π\pi as the optimal stochastic policy that the agent tries to learn where π:S×A[0,1]\pi:S\times A\rightarrow[0,1]. The RL algorithm receives the details from a centralized agent that interacts with the environment, takes action AA, gets either reward/penalty of that action, and reaches the optimal policy π\pi^{*} by the received reward RR at each time step tt. The algorithm tries to reach to the optimal policy π\pi^{*} with the maximum vπ(s)v^{\pi^{*}}(s) for all parameters sSs\in S. The value set of vπ(s)v^{\pi}(s) is denoted to the feedback from the reward after implementing action aa on state ss, where 𝐄\mathbf{E} is denoted to the expected value and it can be illustrated by:

    vπ(s)=𝐄at,st+1(k=1γktRk|St=s),\small v^{\pi}(s)=\mathbf{E}_{a_{t},s_{t+1}}\Bigg{(}\sum_{k=1}^{\infty}\gamma^{k-t}R_{k}|S_{t}=s\Bigg{)}, (19)
  2. 2.

    States and Actions: The RL agents needs helpful hints about the environment to improve the system performance. The state vector in our approach consists of the positions of UAVs and strategic locations and the visited cells in the grid. The action of the agent is the UAV direction focusing on strategic locations.

  3. 3.

    Reward function: The reward is a crucial factor to help the algorithm learn the optimal policy. In our algorithm, the agent is positively rewarded when UAV’s direction is chosen such that it visited one of the strategic locations and consumes less energy. The agent gets a negative reward if the agent’s next UAV position is higher than the minimum derived data rate or collides with other UAVs. The UAVs need to pass through strategic locations and collect data from the IoT devices. Particularly, each strategic location has different demand services based on its importance that UAVs need to pass through and satisfy part of this demand service factor which represents the quality of services (QoS) in these strategic locations. When UAVs pass through these strategic locations, it satisfies part of the demand service denoted by ϕΩ\phi_{\Omega}. Therefore, we define the rewards of satisfying the number of visits to these strategic locations as done in [7] by:

    RΩ=11+i=1ΩϕΩ(t)\small R_{\Omega}=\frac{1}{1+\sum_{i=1}^{\Omega}\phi_{\Omega}(t)} (20)

Algorithm 1 delineates the steps of meta-RL algorithm for converging and respecting the constraints related to the objective function in (16). According to this algorithm, if the UAV does not collide with other UAVs, it gets a reward (lines 10-12). Then, if the UAV chooses the next step producing less energy, above the limit of minimum data rate and less than the maximum expected time completion, it also gets a reward. The agent will get an extra reward if it visits one strategic location at that time, encouraging UAVS to collect data from the strategic locations (lines 13-16).

Algorithm 1 Meta-RL
1:Q-network parameters initialization θ\theta and θv\theta_{v}.
2:old Q-network parameters initialization θ\theta^{\prime} and θv\theta_{v}^{\prime}
3:for each episode rr \in RR do
4:    gradient initialization: dθ0d\theta\leftarrow 0 and dθv0d\theta_{v}\leftarrow 0
5:    Q-network initialization: θ=θ\theta^{\prime}=\theta and θv=θv\theta_{v}^{\prime}=\theta_{v}
6:    for each tt \in TT do
7:         for each u{1,2,3,,U}u\in\{1,2,3,...,U\} do
8:             S(u)={Su[m],SSL[m]}S(u)=\{S_{u}[m],S_{SL}[m]\}
9:             choose action AiA_{i} based on ϵ\epsilon
10:             if Qu[t]Qu[T]Q_{u}[t]\notin Q_{u}[T] and (6), (7), (12then   
11:                 Rt=Rt+1R_{t}=R_{t}+1\qquad              
12:             if SuStrategicLosactionS_{u}\in StrategicLosaction then
13:                 energyEpisode+=εtot(Q,Ni)energyEpisode\;+=\;\varepsilon_{tot}(Q,N_{i}) equation (16)
14:                 Rt=Rt+RΩR_{t}=R_{t}+R_{\Omega}\qquad equation (20)              
15:             observe Si+1S_{i+1}
16:             observe RtR_{t}
17:             system reset with new UAVs positions
18:             save (Si,Ai,ri,Si+1)(S_{i},A_{i},r_{i},S_{i+1}) in replay memory
19:             taking a minibatch of (Si,Ai,ri,Si+1)(S_{i},A_{i},r_{i},S_{i+1})
20:             gradient accumulation wrtθ:dθdθ+\theta^{\prime}:d\theta\leftarrow d\theta+           θlogπ(ai|si;θ)(RV(si;θv))\nabla_{\theta^{\prime}}\log\pi(a_{i}|s_{i};\theta^{\prime})(R-V(s_{i};\theta_{v}^{\prime}))
21:             gradient accumulation wrtθv:dθdθ+\theta_{v}^{\prime}:d\theta\leftarrow d\theta+           (RV(si;θv))2/θv\partial(R-V(s_{i};\theta_{v}^{\prime}))^{2}/\partial\theta_{v}^{\prime}              
22:    asynchronous update of θ\theta using dθd\theta and θv\theta_{v} using dθvd\theta_{v}

V simulation results and analysis

In our approach, we used 440 m ×\times 440 m with 25 equal-sized cells. Among these 25 cells, we have three strategic locations located in different places on the grid. For environment type, we used urban area with parameters of ω1\omega_{1} and ω2\omega_{2} as 11.95 and 0.14, respectively. The ψLoS\psi_{LoS} and ψNLoS\psi_{NLoS} are 3 dB and 23 dB, respectively. The list of parameters is summarized in table I.

TABLE I: Simulation Parameters.
Parameters Description Value
max transmit power PmaxP_{max} 200 mW
σ2\sigma^{2} noise power -170 dBm
γ\gamma Discount factor 0.85
α\alpha Learning rate 0.0001
BuB_{u} Bandwidth 1 MHz
KiK_{i} IoT’s data packet size 1 Mbits
VV average flight speed 10 m/s
operating power PuoperP^{oper}_{u} 300W
communication power PucommP^{comm}_{u} 5W
Refer to caption
Figure 2: The number of visitations to strategic locations in one time frame TT.

Figure 2 expresses the number of visitations to strategic locations compared to non-strategic locations in a one-time frame TT. As shown in the figure, our approach can successfully encourage the UAVS to pass through these strategic locations more often than the other locations.

Refer to caption
Figure 3: Adaptivity of Meta-RL algorithm to the environment changes of the learning. The algorithm started with 4 UAVs, then 1 more UAV joined the swarm, and after that, 2 UAVs left the swarm. Meta-RL algorithm learns the optimal policy quickly and converges to its maximum expected reward.

We investigate a practical scenario when the number of UAVs joins and disconnects the swarm during the learning. Figure 3 shows the related implementation of the average reward convergence. The simulation started with a swarm of four UAVs, and we highlight that Meta-RL converges rapidly to the maximum expected reward, and it takes 900 episodes to converge into the maximum expected reward. During the learning, one UAV joined the swarm, and Meta-RL can adapt to the new changes in the environment and converge very quickly to the maximum expected reward. After that, two UAVs depart the swarm (e.g., for recharging), and Meta-RL can converge to the maximum expected reward.

Figures 4(a) and 4(b) compare the Meta-RL solution and traditional RL algorithms in case of energy consumption in strategic and non-strategic locations when the number of UAVs changes in the swarm. As shown in the figures, Meta-RL tends to spend more energy than others on strategic locations and lower energy consumption on non-strategic locations as its priority is to serve strategic locations. Nevertheless, Meta-RL outperforms the traditional RL algorithms in terms of successfully satisfying the demand service needs of strategic locations, reaching 96% when 7 UAVs are used, as shown in Figure 4(c). Meta-RL also outperforms the traditional RL algorithms regarding convergence speed, as shown in Figure 4(d). The Meta-RL algorithm needs fewer episodes than the other algorithms to learn the optimal policy and hence the ability to learn faster.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 4: Energy consumption of different algorithms in terms of strategic locations and no-strategic locations, convergence speed, and demand service satisfaction.

VI conclusion

In this article, we investigated the optimization of path planning and energy consumption of UAV swarms in the task of collecting data from IoT devices while focusing on strategic locations to visit more often than the other areas in the grid. We formulated the problem as an optimization problem that seeks the minimum energy consumption by controlling the UAV deployments and ensuring the minimum data rate. We propose an efficient sub-optimal solution using meta-reinforcement learning for dealing with dynamic environments and for fast convergence. We compared our proposed solution with three competitive solutions and showed the effective of our approach in providing better service demand in strategic locations. Our simulation results demonstrated the effectiveness of the proposed method in terms of environment adaptive to sudden changes and fast convergence into the maximum rewards. In future work, we plan to study the interference and the Doppler effect on UAV movements.

Acknowledgment

This work was made possible by NPRP grant # NPRP13S-0205-200265 from the Qatar National Research Fund (a member of Qatar Foundation). The findings achieved herein are solely the responsibility of the authors.

References

  • [1] Q. Wu, J. Xu, Y. Zeng, D. W. K. Ng, N. Al-Dhahir, R. Schober, and A. L. Swindlehurst, “A comprehensive overview on 5g-and-beyond networks with uavs: From communications to sensing and intelligence,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 10, pp. 2912–2945, 2021.
  • [2] J.-C. Padró, F.-J. Muñoz, J. Planas, and X. Pons, “Comparison of four uav georeferencing methods for environmental monitoring purposes focusing on the combined use with airborne and satellite remote sensing platforms,” International journal of applied earth observation and geoinformation, vol. 75, pp. 130–140, 2019.
  • [3] M. Mozaffari, W. Saad, M. Bennis, and M. Debbah, “Mobile unmanned aerial vehicles (uavs) for energy-efficient internet of things communications,” IEEE Transactions on Wireless Communications, vol. 16, no. 11, pp. 7574–7589, 2017.
  • [4] Z. Huang, C. Chen, and M. Pan, “Multiobjective uav path planning for emergency information collection and transmission,” IEEE Internet of Things Journal, vol. 7, no. 8, pp. 6993–7009, 2020.
  • [5] M. B. Ghorbel, D. Rodríguez-Duarte, H. Ghazzai, M. J. Hossain, and H. Menouar, “Joint position and travel path optimization for energy efficient wireless data gathering using unmanned aerial vehicles,” IEEE Transactions on Vehicular Technology, vol. 68, no. 3, pp. 2165–2175, 2019.
  • [6] S. Wan, J. Lu, P. Fan, and K. B. Letaief, “Toward big data processing in iot: Path planning and resource management of uav base stations in mobile-edge computing system,” IEEE Internet of Things Journal, vol. 7, no. 7, pp. 5995–6009, 2020.
  • [7] M. A. Dhuheir, E. Baccour, A. Erbad, S. S. Al-Obaidi, and M. Hamdi, “Deep reinforcement learning for trajectory path planning and distributed inference in resource-constrained uav swarms,” IEEE Internet of Things Journal, vol. 10, no. 9, pp. 8185–8201, 2023.
  • [8] M. Dhuheir, A. Erbad, and S. Sabeeh, “Llhr: Low latency and high reliability cnn distributed inference for resource-constrained uav swarms,” in 2023 IEEE Wireless Communications and Networking Conference (WCNC).   IEEE, 2023, pp. 1–6.
  • [9] Q. Zhang, W. Saad, M. Bennis, X. Lu, M. Debbah, and W. Zuo, “Predictive deployment of uav base stations in wireless networks: Machine learning meets contract theory,” IEEE Transactions on Wireless Communications, vol. 20, no. 1, pp. 637–652, 2021.
  • [10] X. Liu, Y. Liu, Y. Chen, and L. Hanzo, “Trajectory design and power control for multi-uav assisted wireless networks: A machine learning approach,” IEEE Transactions on Vehicular Technology, vol. 68, no. 8, pp. 7957–7969, 2019.
  • [11] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” in International conference on machine learning.   PMLR, 2017, pp. 1126–1135.
  • [12] Y. Zeng, X. Xu, and R. Zhang, “Trajectory design for completion time minimization in uav-enabled multicasting,” IEEE Transactions on Wireless Communications, vol. 17, no. 4, pp. 2233–2246, 2018.
  • [13] K. Chen, Y. Wang, J. Zhao, X. Wang, and Z. Fei, “Urllc-oriented joint power control and resource allocation in uav-assisted networks,” IEEE Internet of Things Journal, vol. 8, no. 12, pp. 10 103–10 116, 2021.