This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

User-centric Heterogeneous-action Deep Reinforcement Learning for Virtual Reality in the Metaverse over Wireless Networks

Wenhan Yu, Terence Jie Chua, and
Jun Zhao
The authors are all with Nanyang Technological University, Singapore. Corresponding author: Jun Zhao, Email: [email protected]. A 6-page short version containing partial results is accepted to the 2023 IEEE International Conference on Communications (ICC) [1].
Abstract

The Metaverse emerging as maturing technologies are empowering the different facets. Virtual Reality (VR) technologies serve as the backbone of the virtual universe within the Metaverse to offer a highly immersive user experience. As mobility is emphasized in the Metaverse context, VR devices reduce their weights at the sacrifice of local computation abilities. In this paper, for a system consisting of a Metaverse server and multiple VR users, we consider two cases of (i) the server generating frames and transmitting them to users, and (ii) users generating frames locally and thus consuming device energy. As Metaverse emphasizes on the accessibility for all users anywhere and anytime, the users can have totally different characteristics, devices and demands. In this paper, the channel access arrangement (including the decisions on frame generation location), and transmission powers for the downlink communications from the server to the users are jointly optimized by our proposed user-centric Deep Reinforcement Learning (DRL) algorithm, namely User-centric Critic with Heterogenous Actors (UCHA). Comprehensive experiments demonstrate that our UCHA algorithm leads to remarkable results under various requirements and constraints.

Index Terms:
Metaverse, resource allocation, reinforcement learning, wireless networks.

I Introduction

I-A Background

The recent futuristic notion of Metaverse is an extension, a complete simulation, and a mirror of the real world, which is empowered by the maturing high-performance Extended Reality (XR) technologies. Among them, Virtual Reality (VR) provides users with a fully immersive and digital world, which has been used in many fields such as entertainment, socialization, and industry [2, 3]. To alleviate the obtrusive sense of restriction for users when moving around wearing VR devices, the mobility of these devices is essential. Feasible solutions are applied by using wireless connections and decreasing the weights at the sacrifice of local computation abilities. Thus, even state-of-art VR devices (e.g., HTC Vive [4]) do not have the sufficient local computing power to support high-resolution and high-frame-rate applications. In this event, transferring some VR frame generations to a remote server is necessary. Furthermore, under the context of the Metaverse, the boundary between virtual and physical environments becomes more and more blurred, which brings more frequent demands of accessing the virtual world from users. As the Metaverse emphasizes building the bridge between the virtual and real worlds and supporting user inputs anywhere and anytime, it is necessary to consider how to allocate the limited resources to a wide variety of VR users (VUs) with distinct characteristics and demands.

I-B Challenges and motivations

We first explain the challenges with motivations for this work from the following aspects.

User-centric features and user-diverse problems. The main reason for its popularity can be contributed to the user-centric services, which serve as the fundamental basis of the next-generation Internet [2]. The Metaverse is a perceived virtual universe and ideally, allows people to access it anywhere and anytime, with any purpose. Compared to the traditional network structure, the user-centric features also pose a user-diverse problem: the wide variety of use cases (applications) and user-inherent characteristics (e.g., device types, battery power) induce highly different demands for frames per second (FPS), resolutions, etc. Further, the much larger size of transmitted data in VR can pose a challenge for devices with lower capabilities. These devices may struggle to complete tasks without sufficient support from remote computing resources. Therefore, the first and foremost challenge is how to achieve efficient utilization of those network resources in the multi-user scenario where each user has a different purpose of use and requirements. This propels us to seek a more user-centric and oriented solution to handle widely different users.

Wireless communication for VR. VR is a key feature of an immersive Metaverse socialization experience. Compared to traditional two-dimensional images, generating 360360^{\circ} panoramic images for the VR experience is computationally intensive. However, as the mobility of VR devices is attached with great importance under the Metaverse context, manufacturers have to lessen the weight at the cost of local computation capability. As a consequence, the existing VR devices lack the local computation ability of high-resolution and frame-rate applications. A feasible solution to powering an immersive socialization experience on VR devices is to make the Metaverse Server help the application frame generation, and send generated frames to VUs. However, with the frequent demand and high congestion degrees for network resources, it is necessary to lighten the network burden by allocating some devices with higher local computing power to do local generation sometimes. Therefore, this paper considers two cases: (i) server generation, and (ii) local generation.

Joint optimization in wireless communication. In many scenarios, there are more than one important objectives that need to be optimized. In the wireless scenario where two cases of server and local generation are taken into account, two parts are detrimental to data transfer efficiency: (i) The channel access arrangement for VUs (including being assigned with no channel and doing local generation), and (ii) transmission power allocation for VUs. Based on this, this paper focuses on jointly optimizing these variables in the wireless downlink transmission scenario, and considering the VUs’ diverse characteristics.

I-C Related work and our novelty

The related work is separated into multiple aspects according to the challenges and motivations since we design our work and make contributions from these multiple domains. The references and our novelties compared to them are expounded in the following.

User-centric Metaverse. For the time being, people are already very aware of the user-centric particularity in the Metaverse. Lee et al. [2] claim that the Metaverse is user-centric by design, and it will rely on pervasive network access. Consequently, users with diverse purposes, devices, and demands can access the universe anytime and anywhere. Du et al. [5] emphasized the user-centric demands in the Metaverse and proposed an attention-aware network resource allocation considering the diversities among users’ interests and applications. Both papers above came up with attractive and fresh concepts and discussed the potential challenges and future directions. However, none of them has ever studied a specific scenario with problems and designed novel algorithms to tackle them. In this paper, we design a concrete user-diverse problem scenario and invent a user-centric DRL structure correspondingly.

VR over wireless communication. In recent years, VR services over wireless communication have been thoroughly studied in many previous works. Yang et al. [6] investigated the problem of providing ultra-reliable and power-efficient VR strategies for wireless mobile VUs by Deep Reinforcement Learning (DRL) algorithms. In order to fit the discrete action space required in their DRL algorithm, they quantize the continuous actions into discrete actions. Xiao et al. [7] studied the predictive VR video delivering by optimizing the bitrate with DRL methods. Some other works has also demonstrated the excellent performance of DRL methods over wireless communications as its ability to explore and exploit in self-defined environments [8, 9, 10]. However, none of the previous works considered the varying purpose of use and requirements between different VUs, and there are no existing works that have designed a user-centric and user-oriented DRL method as our proposed solutions.

Joint optimization in wireless communications with DRL methods. Some remarkable works have investigated using DRL methods to solve joint optimization problems [11, 12]. For instance, Guo et al. [13] solved the handover control and power allocation joint problem using Multi-agent Proximal Policy Optimization (MAPPO) and obtained satisfactory results. As mixed continuous-discrete actions can lead to problems of directly embedding the MAPPO algorithm, they use discrete power requirement factors instead of a continuous power allocation to simplify the problem. Thus, the problem they aimed to address does not have heterogeneous actions (both discrete and continuous actions), hence they can directly use MAPPO. He et al. [14] researched the joint optimization problems on channel access and power resource arrangements by using DRL methods to determine the channel assignment and conducting traditional optimization methods for power allocation under the channel information. However, none of these works considers a user-diverse scenario or problems with heterogeneous actions. Compared to them, we propose a novel Multi-Agent Deep Reinforcement Learning (MADRL) structure, which is equipped with a user-centric view and able to handle interactive and heterogeneous actions.

I-D Methodology and Contributions

This paper proposes a novel multi-user VR model in a downlink Non-Orthogonal Multiple Access (NOMA) system. Specifically, we optimize the channel access arrangement and downlink power allocation jointly, taking the diversities between VUs into consideration. We designed a novel MADRL algorithm, User-centric Critic with Heterogenous Actors (UCHA), which considers the varying purpose of use and requirements of the users and handles the heterogeneous action spaces. In terms of the backbone of our algorithm, we re-design the widely-used Proximal Policy Optimization (PPO) algorithm [15] with a reward decomposition structure in Critic, and two assymetric Actors.

Our contributions are as follows:

  • Formulating user-centric VR in the Metaverse: We study the user-centric Metaverse over the wireless network, designing a multi-user VR scenario where a Metaverse Server assists users in generating reality-assisted virtual environments.

  • Heterogeneous Actors for inseparable optimization variables: We created two asymmetric Actors interacting with each other to handle the inseparable and discrete-continuous mixed optimization objectives. Specifically, Actor one is for the channel access arrangement, and Actor two is for the power allocation based on the solution by Actor one.

  • User-centric Critic for user-diverse scenario: We crafted a novel user-centric Critic equipped with a more user-specific architecture, in which we decompose the reward into multiple VUs and evaluate the value for each VU. To the best of our knowledge, we are the first to embed the hybrid reward architecture in multi-agent reinforcement learning, and use it to solve communication problems.

  • Novel and comprehensive simulation: We conduct comprehensive experiments and design novel metrics to evaluate our proposed solution. The experimental results indicate that UCHA achieves the fastest convergence speed and attains the highest rewards among all baseline models. UCHA’s allocations are more user-specific to individual users and appear to be more reasonable, as they fulfill different requirements of the users.

I-E Organization

The rest of the paper is organized as follows. Section II introduces our system model. Then Sections III and IV propose our deep reinforcement learning setting and algorithm. In Section V, extensive experiments are performed, and various methods are compared to show the prowess of our strategy. Section VI concludes the paper.

A 6-page short version is accepted by the 2023 IEEE International Conference on Communications (ICC) [1]. In that conference version, we compare the user-centric structure in Critic with the normal Critic structure, and demonstrate its remarkable performance and ability to accelerates the convergence speed. Besides, we also demonstrate that using this structure in PPO is much superior to Hybrid Reward DQN [16]. However, in that version, we do not consider anything about the resolution, nor the multi-agent structure for further optimizing the transmission power, and these are all highlights of this journal version. Furthermore, this paper also designs new algorithms, metrics, and evaluation methods, compared to the conference version.

II System Model

Consider a multi-user wireless downlink transmission in an indoor environment, in which TT frames are generated by the Metaverse server and sent to NN VR Users (VUs) in one second. To ensure a smooth experience, we apply the clock signal from the server for synchronization, and use a slotted time structure, where one second is divided into TT time slots (steps), and the duration of each slot is ι=1T\iota=\frac{1}{T}. In each slot, a high-resolution 3D scene frame for each user is generated and sent to NN VUs 𝒩={1,2,,N}\mathcal{N}=\{1,2,...,N\} via a set of channels ={1,2,,M}\mathcal{M}=\{1,2,...,M\}. These VUs have distinct characteristics (e.g., local computation capability) and different requirements (e.g., Frames Per Second (FPS)). This FPS requirement is decided by the specific applications. Assume that one VU is playing VR video games, while another VU is having a virtual meeting. The FPS requirement for the former should be evidently much higher than the latter. Each user can accept VR frame rates as low as a minimum tolerable frame per second (FPS) τn,F\tau_{n,F}, which is the number of successfully received frames in one second (i.e., TT time slots). Our objective is to obtain channel access and downlink power arrangements for each VU.

II-A Channel allocation and frame generation

In terms of channel allocation, we define an N×TN\times T matrix 𝒁\boldsymbol{Z} such that the element in its nnth row and ttth column is zntz_{n}^{t}, for n{1,2,,N}n\in\{1,2,\ldots,N\} and t{1,2,,T}t\in\{1,2,\ldots,T\}, indicating that the channel allocation for VU nn at tt is zntz_{n}^{t}. In other words, 𝒁\boldsymbol{Z} denotes the selection of downlink channel arrangement. Specifically, in the studied system of one Metaverse server and NN VUs, our channel allocation for each VU n{1,2,N}n\in\{1,2\ldots,N\} includes two cases:

  • Case 1: Server-generated frame. If VU nn is assigned for a channel mm at time step tt (i.e., znt=m,mz_{n}^{t}=m,m\in\mathcal{M}), the Metaverse server selects the mmth channel for downlink communication with VU nn to deliver the frame that the server generates for VU nn. In this case, if the sum delay (for generation and transmission) of each frame exceeds slot duration ι\iota, this frame is deemed as a failure.

  • Case 2: VU-generated frame. If VU nn is assigned with no channel (i.e.,znt=0z_{n}^{t}=0), it generates the frame locally with a lower computation capability, and at the expense of energy consumption without communicating with the server. In this case, each VU assigned to do local generation will generate the frame of the highest in-time-processable resolution (i.e., can be generated in ι\iota). If the generated resolution is below the minimum acceptable resolution, this frame is deemed as a failure.

Thus, the channel allocation in this paper also includes the decisions on whether the frames for the VUs are generated by the server or the VUs (for simplicity, we sometimes just say channel allocation without mentioning decisions of frame-generation locations since the former includes the latter). In other words, znt=m,mz_{n}^{t}=m,m\in\mathcal{M} indicates VU nn is arranged to channel mm at step tt, and znt=0z_{n}^{t}=0 means user nn needs to generate locally. Next, we explain the two cases.

Refer to caption
Figure 1: System model. This figure illustrates a single time slot execution, where the frame generation contains two cases, server-generation and local-generation. Then, different metrics are evaluated and rewards are given accordingly.

II-B The case of the frame being generated and sent by server

In each time step, the server will manage the downlink channels \mathcal{M} of all VUs 𝒩\mathcal{N}, and it will subsequently allocate the downlink transmission powers 𝒑t\boldsymbol{p}^{t} with this Channel State Information (CSI). Here we define 𝑷:=[𝒑1,𝒑2,,𝒑T]\boldsymbol{P}:=[\boldsymbol{p}^{1},\boldsymbol{p}^{2},\ldots,\boldsymbol{p}^{T}] as the downlink power, where 𝒑t:=[p1t,p2t,,pNt]\boldsymbol{p}^{t}:=[p_{1}^{t},p_{2}^{t},\ldots,p_{N}^{t}], pntp_{n}^{t} is the power for VU nn at tt, and enforce n𝒩pntpmax\sum_{n\in\mathcal{N}}p_{n}^{t}\leq p_{max}. As the total delay in this case and the resolutions are important metrics, the achievable rate and resolutions are discussed in the following:

II-B1 Achievable rate from the server to each VU

We adopt the Non-Orthogonal Multiple Access (NOMA) system as this work’s propagation model since it allows multiple users to share the same frequency band, which increases the capacity of the network [17]. In our NOMA system, several VUs can be multiplexed on one channel by superposition coding, and each VU exploits successive interference cancellation (SIC) at its receiver. The decoding process follows the approach described by Dai et al. [17]. Specifically, with 𝒛t\boldsymbol{z}^{t} denoting [z1t,z2t,,zNt][z_{1}^{t},z_{2}^{t},\ldots,z_{N}^{t}], then among the NN VUs, we let 𝒩mt(𝒛t)\mathcal{N}_{m}^{t}(\boldsymbol{z}^{t}) be the set of Nmt(𝒛t)N_{m}^{t}(\boldsymbol{z}^{t}) users which receive VR frames from the channel mm at time step tt. Formally, we have

𝒩mt(𝒛t):={n𝒩|znt=m}, and Nmt(𝒛t)=|𝒩mt(𝒛t)|.\displaystyle\mathcal{N}_{m}^{t}(\boldsymbol{z}^{t}):=\{{n}\in\mathcal{N}~{}|~{}z_{n}^{t}=m\},\text{ and }N_{m}^{t}(\boldsymbol{z}^{t})=|\mathcal{N}_{m}^{t}(\boldsymbol{z}^{t})|. (1)

The part “(𝒛t)(\boldsymbol{z}^{t})” means that 𝒩mt(𝒛t)\mathcal{N}_{m}^{t}(\boldsymbol{z}^{t}) and Nmt(𝒛t)N_{m}^{t}(\boldsymbol{z}^{t}) are both functions 𝒛t\boldsymbol{z}^{t}. Based on [17], we order Nmt(𝒛t)N_{m}^{t}(\boldsymbol{z}^{t}) VUs of 𝒩mt(𝒛t)\mathcal{N}_{m}^{t}(\boldsymbol{z}^{t}) such that channel-to-noise ratios between them and the server on channel mm are in a decreasing order; formally, recalling that all NN VUs are indexed by 1,2,,N1,2,\ldots,N, suppose that the above ordering for Nmt(𝒛t)N_{m}^{t}(\boldsymbol{z}^{t}) VUs of 𝒩mt(𝒛t)\mathcal{N}_{m}^{t}(\boldsymbol{z}^{t}) produces VU indices u1,u2,,uNmt(𝒛t)u_{1},u_{2},\ldots,u_{N_{m}^{t}(\boldsymbol{z}^{t})}; i.e., with hi,mth_{i,m}^{t} denoting the channel attenuation between the server and VU ii over channel mm at time step tt, and (σi,mt)2(\sigma_{i,m}^{t})^{2} denoting the power spectral density of additive white Gaussian noise at VU ii over channel mm at time step tt, we have the following111We write (σi,mt)2(\sigma_{i,m}^{t})^{2} here for generality. In our experiments of Section V, (σi,mt)2(\sigma_{i,m}^{t})^{2} is the same σ2\sigma^{2} for any i,m,ti,m,t.: define u1,u2,,uNmt(𝒛t)u_{1},u_{2},\ldots,u_{N_{m}^{t}(\boldsymbol{z}^{t})} such that they together form the set 𝒩mt(𝒛t)\mathcal{N}_{m}^{t}(\boldsymbol{z}^{t}) of Eq. (1), and

|hu1,mt|2(σu1,mt)2|hu2,mt|2(σu2,mt)2|huNmt(𝒛t),mt|2(σuNmt(𝒛t),mt)2\displaystyle\frac{|h^{t}_{u_{1},m}|^{2}}{(\sigma^{t}_{u_{1},m})^{2}}\geq\frac{|h^{t}_{u_{2},m}|^{2}}{(\sigma^{t}_{u_{2},m})^{2}}\geq\ldots\geq\frac{|h^{t}_{u_{N_{m}^{t}(\boldsymbol{z}^{t})},m}|^{2}}{(\sigma^{t}_{u_{N_{m}^{t}(\boldsymbol{z}^{t})},m})^{2}} (2)

Note that if two VUs with indices uu and uu^{\prime} result in the same channel-to-noise ratios, their ordering can be arbitrary and this paper places the larger index ahead of the lower index; i.e., uu is before (resp., after) uu^{\prime} if uu is greater (resp., smaller) than uu^{\prime}. Following the same rationality as [17], after decoding via SIC, the interferences to VU nn satisfying znt=mz_{n}^{t}=m come from signals sent from server that are intended for users whose positions are before VU nn in the sequence u1,u2,,uNmt(𝒛t)u_{1},u_{2},\ldots,u_{N_{m}^{t}(\boldsymbol{z}^{t})}; formally, supposing uνu_{\nu} is nn (such ν\nu exists since n𝒩mt(𝒛t)n\in\mathcal{N}_{m}^{t}(\boldsymbol{z}^{t}) follows from znt=mz_{n}^{t}=m), those users are u1,u2,,uν1u_{1},u_{2},\ldots,u_{\nu-1} and the sum of the interferences to VU nn (i.e., uνu_{\nu}) are given by j=1ν1pujt|hn,mt|2\sum_{j=1}^{\nu-1}p_{u_{j}}^{t}|h_{n,m}^{t}|^{2}, after we define pitp_{i}^{t} as the transmit power used by server for transmitting the signal intended for VU ii at time step tt.

Then the achievable rate of VU nn over its assigned channel znt=mz_{n}^{t}=m is

rnt(𝒛t,𝒑t)=Wmlog(1+pnt|hn,mt|2j=1ν1pujt|hn,mt|2+Wm(σn,mt)2),\displaystyle{r}_{n}^{t}(\boldsymbol{z}^{t},\boldsymbol{p}^{t})=W_{m}\log\left(1+\frac{{p}_{n}^{t}|h_{n,m}^{t}|^{2}}{\sum_{j=1}^{\nu-1}p_{u_{j}}^{t}|h_{n,m}^{t}|^{2}+W_{m}(\sigma_{n,m}^{t})^{2}}\right), (3)

for ν\nu satisfying uν=nu_{\nu}=n after defining related notations via (1) and (2), where WmW_{m} denotes the bandwidth of channel mm,

II-B2 Resolution of server-generated frame

We denote Dn,otD_{n,o}^{t} (n𝒩n\in\mathcal{N}) as the transmission data size of the VR frame at time step tt that needs to be executed and transmitted by server to user nn, 𝒢={G1,,GJ}\mathcal{G}=\{G_{1},...,G_{J}\} as the frame sizes with different graphic resolutions in descending order (e.g., G1G_{1} is 1440p, and G2G_{2} is 1080p), where G1G_{1} and GJG_{J} are the lowest and highest acceptable resolutions, respectively. For convenience, we also define G0=+G_{0}=+\infty and GJ+1=0G_{J+1}=0. Resnt{Res}_{n}^{t} is the received frame resolution of VU nn at tt (Resnt𝒢{Res}_{n}^{t}\in\mathcal{G}). Assume that the server-generated frames are all in the highest resolution (which is G1G_{1}), and the transmission data size from server is

Dnt=ResntComnt.\displaystyle D_{n}^{t}=\frac{{Res}_{n}^{t}}{{Com}_{n}^{t}}. (4)

In parallel to the fast-developing VR devices, video compression technologies are being developed as well. VR compression leverages the likeness of different images from different cameras and uses advanced slicing and tiling techniques [18]. The compression ratio is not always a constant but variable to different VR image qualities and data sizes. To better fit the real situation, we use the ComntCom_{n}^{t} as the varying compression ratios of this frame of VU nn at time step tt.

Accordingly, the delay dnt(𝒛t,𝒑t)d_{n}^{t}(\boldsymbol{z}^{t},\boldsymbol{p}^{t}) of each frame in time step tt is divided into (1) execution time and (2) downlink transmission time:

dnt(𝒛t,𝒑t)=Dnt×cntfv+Dntrnt(𝒛t,𝒑t),\displaystyle d_{n}^{t}(\boldsymbol{z}^{t},\boldsymbol{p}^{t})=\frac{D_{n}^{t}\times c_{n}^{t}}{f_{v}}+\frac{D_{n}^{t}}{{r}_{n}^{t}(\boldsymbol{z}^{t},\boldsymbol{p}^{t})}, (5)

where fvf_{v} is the computation capability of the server (i.e., cycles per second), and cntc_{n}^{t} is the required number of cycles per bit of this frame [19].

II-C The case of the frame being generated by VUs locally

When VU is not allocated a channel, it needs to generate the VR frames locally at the expense of energy consumption and resolution degeneration according to its local computing capability (CPU frequency). Let fnf_{n} be the computation capability of VU nn , and it varies across VUs. Adopting the model from [20], the energy per cycle can be expressed as en,cyc=ηfn2e_{n,cyc}=\eta f_{n}^{2}. Therefore, the energy consumption overhead of local computing can be derived as:

en,lt={μn×Dnt×cnt×en,cyc,znt0.0,znt=0.\displaystyle e_{n,l}^{t}=\begin{cases}\mu_{n}\times D_{n}^{t}\times c_{n}^{t}\times e_{n,cyc},&z_{n}^{t}\neq 0.\\ 0,&z_{n}^{t}=0.\end{cases} (6)

Inherently, μn\mu_{n} is the battery weighting parameter of energy for VU nn. The battery state of each VU can be different, then, we assume that μn\mu_{n} is closer to 0 with a higher battery.

In terms of the resolution, if VU nn is doing local generation, the resolution of the current frame will degenerate to the highest resolution that VU nn can process with tolerable delay. Thus, the frame resolution of VU nn at tt is formulated as:

Resnt={G1,znt0.GJnt,znt=0.\displaystyle Res_{n}^{t}=\begin{cases}G_{1},&z_{n}^{t}\neq 0.\\ G_{J_{n}^{t}},&z_{n}^{t}=0.\end{cases} (7)

where

GJnt/Comntfn×ιG(Jnt1)/Comnt,and1JntJ.\displaystyle G_{J_{n}^{t}}/{Com}_{n}^{t}\leq f_{n}\times\iota\leq G_{(J_{n}^{t}-1)}/{Com}_{n}^{t},~{}\text{and}~{}~{}1\leq J_{n}^{t}\leq J.

Here, JntJ_{n}^{t} is the rank of the highest available resolution of VU nn among all resolution 𝒟\mathcal{D}, and JJ is the number of resolution types. ι\iota is the duration of each time step (ι=1T\iota=\frac{1}{T}), and fn×ιf_{n}\times\iota is the maximum datasize can be locally generated by VU nn in one step. The overall system model is shown in Fig. 1.

II-D Problem formulation

Different users have different purposes for use (video games, group chat, etc.). Therefore, they also have varying expectations of satisfactory FPS τn,F\tau_{n,F}. For each successive frame, the total delay the tolerable threshold (occurs in “Case 1”) and the insufficient resolution (Resnt<GJRes_{n}^{t}<G_{J}, occurs in “Case 2”) lead to a frame failure. We set a frame success flag IntI_{n}^{t} for VU nn at step tt as:

Int={0,ifznt{1,2,,M} and dnt(𝒛t,𝒑t)>ι.0,ifznt=0 and Resnt<GJ.1,else.\displaystyle I_{n}^{t}=\begin{cases}0,&\text{if}~{}z_{n}^{t}\in\{1,2,\ldots,M\}\text{ and }d_{n}^{t}(\boldsymbol{z}^{t},\boldsymbol{p}^{t})>\iota.\\ 0,&\text{if}~{}z_{n}^{t}=0\text{ and }Res_{n}^{t}<G_{J}.\\ 1,&\text{else}.\end{cases} (8)

Our goal is to decide channel allocation and downlink power allocation for the transmission of TT frames. The objectives can be divided into 1) fulfill the FPS requirements of different users as possible and minimizing the total transmission failures, 2) optimize VUs’ device energy usage regarding their battery state, and 3) increase VUs’ received frames resolutions.

We define an indicator function χ[A]\chi[A] which takes 11 if event AA occurs and takes 0 otherwise. From the above discussion, our optimization problem is

max𝒁,𝑷{ω1(minn𝒩[t=1TIntτn,F])+1Nn=1Nt=1T[ω3Resntω2en,lt]}\displaystyle\max\limits_{\boldsymbol{Z},\boldsymbol{P}}\left\{\omega_{1}\left(\min\limits_{n\in\mathcal{N}}\left[\sum_{t=1}^{T}I_{n}^{t}-\tau_{n,F}\right]\right)+\frac{1}{N}\sum_{n=1}^{N}\sum_{t=1}^{T}[\omega_{3}{Res}_{n}^{t}-\omega_{2}e_{n,l}^{t}]\right\} (9)
s.t.C1:znt{0,1,,M},n𝒩,t{1,2,,T},\displaystyle s.t.~{}C1:z_{n}^{t}\in\{0,1,...,M\},~{}\forall{n}\in\mathcal{N},\forall{t}\in\{1,2,\ldots,T\}, (10)
C2:n𝒩χ[znt0]pntpmax,t{1,2,,T}.\displaystyle~{}~{}~{}~{}~{}C2:\sum_{n\in\mathcal{N}}\chi[z_{n}^{t}\neq 0]p_{n}^{t}\leq p_{max},\forall{t}\in\{1,2,\ldots,T\}. (11)

The ω1,ω2,ω3\omega_{1},\omega_{2},\omega_{3} are the weighting parameters of frame failures, local device energy consumption, and VU devices received frames resolution. In practice, these parameters will be reflected in the reward setting III. The first part of the objective function (i.e., minn𝒩[(t=1TInt)τn,F]\min\limits_{n\in\mathcal{N}}[(\sum_{t=1}^{T}I_{n}^{t})-\tau_{n,F}]) is to make every VU fulfill their FPS requirements as possible, and the second part is to minimize the local energy usage and maximize the frame resolutions. Constraint C1C1 is our integer optimization variable which denotes the computing method and channel assignment for each user at every time step. Constraint C2C2 is the limit of the downlink transmission power for all VUs.

II-E The execution flow of the system

Here we describe the execution flow of the system given a solution to the optimization problem. The key idea of the two cases design (server or local generation) is to utilize the limited network resources and let some VUs spare the resources by doing local generation at each time. Thus, this paper uses a slotted time structure by applying the clock signal from the server for synchronization. Specifically, in each time slot tt, one frame of each VU will be executed. Considering the very short duration ι\iota of each time slot, we assume that the channel attenuation hn,mth_{n,m}^{t} varies across different slots but remains the same in one slot. At the beginning of each slot tt (frame), the server collects the time-varying information of each VU (e.g., channel attenuation, compression rate), and sends the decisions on channel access (local or server generation, if server generation, which channel) to all VUs through a dedicated channel. Then, VUs assigned to no channel will locally generate the frame immediately, and the others will wait for the server to transmit the frames to them.

II-F Motivation of using MADRL in the optimization problem

The optimization variables in the formulated problem, computation method choice, channel arrangement, and power allocation make the formulated problem highly coupled with inseparable mixed-integer non-linear programming (MINLP) optimization problems, where the discrete variable 𝒁\boldsymbol{Z} and continuous variable 𝑷\boldsymbol{P} are inseparable, which is NP-hard [21]. Moreover, this formulated problem is time-sequential, where the number of variables increases with TT. Thus, the traditional optimization methods are unsuitable for our proposed problem due to the daunting computational complexity. Also, as the problem contains too many random variables, model-based reinforcement learning (RL) approaches that require transition probabilities are infeasible techniques to tackle our proposed problem. Heuristic search methods can possibly be a solution to sequential problems. However, it doesn’t change the approximation of the policy, while naively making improved action selections given the current value function. Besides, a huge tree of possible continuations is usually considered when using this method [22], which further makes it impractical to apply heuristic search in such a complicated scenario with a huge dimension of decisions. Therefore, it is highly necessary to design a comprehensive model-free Multi-Agent Deep Reinforcement Learning (MADRL) method to tackle the problem with heterogeneous optimization variables and distinct requirements from all VUs.

III Environment in Our Multi-Agent Deep Reinforcement Learning (MADRL)

To tackle a problem with DRL method, designing a comprehensive reinforcement learning environment based on the formulated problem is the first and foremost step. For a reinforcement learning environment, the most important components are (1) State: the key factors for an agent to make a decision. (2) Action: the operation decided by an agent to interact with the environment. (3) Reward: the feedback for Agent to evaluate the action under this state. Thus, we expound on these three components next.

III-A State

In the DRL environment, weeding out less relevant and less time-varying variables is essential. Therefore, we set two agents: Agent1Agent_{1} for optimizing the channel arrangement, and Agent2Agent_{2} for allocating the downlink transmission power.

III-A1 Agent1Agent_{1} State s1ts^{t}_{1}

We included the following attributes into the state: (1) Each VU’s frame size: Dnt=Resnt/ComntD_{n}^{t}={Res}_{n}^{t}/{Com}_{n}^{t}. (2) Each VU’s remaining tolerable frame transmission failure count. (3) The channel attenuation of each VU: hn,mth_{n,m}^{t}. (4) The remaining number of time slots: (Tt)(T-t). (5) Gap from requirement to each VU: τn,Fi=1tIni\tau_{n,F}-\sum_{i=1}^{t}I_{n}^{i}.

III-A2 Agent2Agent_{2} State s2ts^{t}_{2}

Only after obtaining the CSI, the power allocation can be finished based on it. As a result, the action of Agent1Agent_{1} is significant to the decision of Agent2Agent_{2}. Besides the action, the other important attributes in s1ts^{t}_{1} should be considered by Agent2Agent_{2} as well. Therefore, we use the concatenation as the state: s2t={a1t;s1t}s_{2}^{t}=\{a_{1}^{t};s_{1}^{t}\}.

III-B Action

The appropriate action settings that are directly related to the optimization variables are critical for finding a good solution. In this environment, the action a1ta_{1}^{t} and action a2ta_{2}^{t} are respectively the Agent1Agent_{1} and Agent2Agent_{2} actions, explained below.

Refer to caption
Figure 2: UL action encoding method. The action a1t=n=1Nznt(M+1)n1a_{1}^{t}=\sum_{n=1}^{N}z_{n}^{t}(M+1)^{n-1}.

III-B1 Agent1Agent_{1} Action a1ta_{1}^{t}

The discrete action of Agent1Agent_{1} is the channel allocation: 𝒛t={z1t,z2t,,zNt}\boldsymbol{z}^{t}=\{z_{1}^{t},z_{2}^{t},...,z_{N}^{t}\}. In a DRL environment, the discrete actions are actually discrete numbers (indicators), then, we need to give consecutive discrete indicators to each action. As 𝒛t\boldsymbol{z}^{t} contains NN elements (NN VUs), and each element has M+1M+1 possible values (MM channels and plus 1 for decision on the cases), we use the NN-bit-(M+1)(M+1)-number code and decode it to decimal indicator as shown in Fig. 2:

a1t=n=1Nznt(M+1)n1,t{1,,T}.\displaystyle a_{1}^{t}=\sum_{n=1}^{N}z_{n}^{t}(M+1)^{n-1},~{}\forall t\in\{1,\cdots,T\}. (12)

III-B2 Agent2Agent_{2} Action a2ta_{2}^{t}

The continuous action of Agent2Agent_{2} should be the downlink transmission power {p1t,p2t,,pNt}\{p_{1}^{t},p_{2}^{t},...,p_{N}^{t}\}. However, it is impractical for the RL agent to allocate power for all VU within the sum power constraint. Therefore, we add a softmax layer to the Agent2Agent_{2}. Accordingly, the a2ta_{2}^{t} turns into the portions of the pmaxp_{max}. We use PntP_{n}^{t} denotes the portion allocated to nn at tt:

a2t={P1t,P2t,,PNt},t{1,,T}.\displaystyle a_{2}^{t}=\{P_{1}^{t},P_{2}^{t},...,P_{N}^{t}\},~{}\forall t\in\{1,\cdots,T\}. (13)

III-C Reward

In the traditional CTDE framework [23], the rewards are shared by different agents. However, in our scenario, the objectives and optimization variables are highly different. e.g., If VU nn is allocated to do the local computation, the rewards for energy consumption and resolution degeneration are irrelevant to Agent2Agent_{2}. Therefore, we design the rewards for Agent1Agent_{1} and Agent2Agent_{2} separately. Moreover, as we consider a user-centric scenario, we decompose the rewards for Agent1Agent_{1} and Agent2Agent_{2} among different VUs, and construct a novel algorithm in Section IV.

III-C1 Agent1Agent_{1} Reward for each VU R1t[n]R_{1}^{t}[n]

The Agent1Agent_{1} is responsible for channel access arrangement, which takes an important position in the whole system. The rewards for any VU nn in Agent1Agent_{1} contain: (1) a descending reward Rrt[n]R_{r}^{t}[n] for different received frame resolutions from high to low. (2) a penalty for every transmission failure Rft[n]R_{f}^{t}[n]. (3) a weighted penalty Ret[n]R_{e}^{t}[n] for energy consumption corresponding to VU’s battery state: ωe×en,lt(μn)\omega_{e}\times e_{n,l}^{t}(\mu_{n}). To fulfill our objective in Eq. (9), we give (4) a huge reward/penalty Rwt[n]R_{w}^{t}[n] according to the Worst VU at the final step: ωend×(minn𝒩[(t=1TInt)τn,F])\omega_{end}\times\left(\min\limits_{n\in\mathcal{N}}\left[\left(\sum_{t=1}^{T}I_{n}^{t}\right)-\tau_{n,F}\right]\right). However, the sparse reward is devastating to a goal-conditioned DRL environment without taking any actions [24]. Thus, only giving a huge penalty at the final step can make the rewards “sparse”, as the reward for FPS can not be shown to the agent, and this metric is very important in the proposed problem. Then, it will be hard to train and slow to converge. To avoid the sparse reward and accelerate training, we set an early termination flag. For any VU, if the frame failure times exceed the tolerant failure times, (failure times of VUn\text{VU}_{n} >Tτn,F>T-\tau_{n,F}), we assign an (5) additional early termination penalty Rtermt[n]R_{term}^{t}[n] according to the number of left frames: ωf×(Tt)\omega_{f}\times(T-t), and end this episode immediately. ωe,ωend,ωf\omega_{e},\omega_{end},\omega_{f} above are all hyper-parameters to be set in the experiments.

III-C2 Agent2Agent_{2} Reward for each VU R2t[n]R_{2}^{t}[n]

The Agent2Agent_{2} is responsible for the downlink power allocation. We weed out some rewards from R1t[n]R_{1}^{t}[n] that are not related to Agent2Agent_{2}. Therefore, the rewards for every VU nn in Agent2Agent_{2} contain: (1) different resolution reward: Rrt[n]R_{r}^{t}[n], (2) transmission failure penalty Rft[n]R_{f}^{t}[n] (3) the huge penalty for Worst VU: Rwt[n]R_{w}^{t}[n], and (4) the early termination (task fail) penalty Rtermt[n]R_{term}^{t}[n].

For every reward, we narrow the range by dividing the number of VUs NN to ease the training.

IV Our User-centric MADRL Approach

Our proposed User-centric Critic with Heterogeneous Actors (UCHA) structure uses the state-of-the-art Proximal Policy Optimization (PPO) algorithm as the backbone. Inspired by the effective Hybrid Reward Architecture (HRA) [16], we design a user-centric Critic, which evaluates the current state-value in a more user-specific view. And considering the heterogeneous action space that incorporates both discrete and continuous actions, we create a heterogeneous Actor structure. Thus, the preliminaries, PPO (as the backbone) and HRA (as the inspiration structure) will first be introduced. We will then explain UCHA.

IV-A Preliminary

IV-A1 Backbone: Proximal Policy Optimization (PPO)

Why PPO? As we emphasize developing a user-centric model which considers VUs’ varying purpose of use and requirements and uses multiple agents with heterogeneous actions, the algorithm’s policy stability and the ability for dealing with both discrete and continuous actions are essential. Therefore, the Advantage Actor-Critic structure based algorithms [25] that directly evaluate the V values for states instead of Q values for actions is an advisable choice. Among them, the Proximal Policy Optimization (PPO) by openAI [15] is an enhancement of the traditional Advantage Actor-Critic which fulfills the two requirements with better sample efficiency by using two separate policies for sampling and training, and it is more stable by applying the policy constraints.

PPO has been actively used in solving wireless communication problems and Metaverse [26, 27], and its prowess has been demonstrated in many scenarios, like the recent widely discussed chatbot, ChatGPT [28]. Next, we will introduce the pipeline of PPO and expound on its two pivotal features: (i) Importance sampling, and (ii) Policy constraint.

Use of importance sampling. Importance sampling (IS) refers to using another distribution to approximate the original distribution [29]. In order to increase the sample efficiency, PPO uses two separate policies (distributions) for training and sampling to better utilize the collected trajectories, which uses the theory of importance sampling. [15]. To distinguish between the two policies, we use πθ\pi_{\theta}, πθ¯\pi_{\bar{\theta}} to denote the policies for training and sampling, where π\pi is the policy network and θ,θ¯\theta,\bar{\theta} are the parameters. In practice, the Temporal-Difference-1 (TD1) state-action pairs are used, then, the objective function is reformulated as:

J(θ)\displaystyle J(\theta) =𝔼(st,at)πθ[πθ(st,at)At]\displaystyle=\mathbb{E}_{(s^{t},a^{t})\sim\pi_{\theta}}\left[\pi_{\theta}(s^{t},a^{t})A^{t}\right]
=𝔼(st,at)πθ¯[πθ(st,at)πθ¯(st,at)At]\displaystyle=\mathbb{E}_{(s^{t},a^{t})\sim\pi_{{\bar{\theta}}}}\left[\frac{\pi_{\theta}(s^{t},a^{t})}{\pi_{{\bar{\theta}}}(s^{t},a^{t})}A^{t}\right]
𝔼(st,at)πθ¯[πθ(at|st)πθ¯(at|st)At],\displaystyle\approx\mathbb{E}_{(s^{t},a^{t})\sim\pi_{{\bar{\theta}}}}\left[\frac{\pi_{\theta}(a^{t}|s^{t})}{\pi_{{\bar{\theta}}}(a^{t}|s^{t})}A^{t}\right], (14)

AtA^{t} is short for the advantage function A(st,at)A(s^{t},a^{t}) (a standard term in reinforcement learning to measure how much is a certain action a good or bad decision given a certain state). How to sensibly estimate the AA has been widely discussed. In this paper, the advantage function is estimated by the truncated version of generalized advantage estimation (GAE) [30], which will be explained in  IV-B. We assume πθ(st)=πθ¯(st)\pi_{\theta}(s^{t})=\pi_{\bar{\theta}}(s^{t}) here as calculating the probabilities of states occurrence is impractical [15], and we use \approx instead of == to avoid misunderstandings.

Add KL-divergence penalty. To increase the stability, we need to reduce the distance between the two distributions θ\theta and θ¯\bar{\theta}. Therefore, Trust Region Policy Optimization (TRPO) [31], which is the predecessor of PPO, uses a Kullback-Leibler (KL) divergence constraint to the objective function to limit the distance between the two distributions by directly setting DKL(πθ||πθ¯)ϱD_{KL}(\pi_{\theta}||\pi_{{\bar{\theta}}})\leq\varrho, where ϱ\varrho is a hyper-parameter which will be set in their experiments. Nonetheless, this constraint is imposed on every observation, and it is very hard to use in practice. Thus, PPO re-formulates its objective function into the following [15]:

Δθ=𝔼(st,at)πθ¯[ft(θ,At)],\displaystyle\Delta\theta=\mathbb{E}_{(s^{t},a^{t})\sim\pi_{{\bar{\theta}}}}[\triangledown f^{t}(\theta,A^{t})], (15)

where

ft(θ,At)=min{πθ(at|st)πθ¯(at|st)At,clip(πθ(at|st)πθ¯(at|st),1ϵ,1+ϵ)At}.\displaystyle f^{t}(\theta,A^{t})=\min\{\frac{\pi_{\theta}(a^{t}|s^{t})}{\pi_{\bar{\theta}}(a^{t}|s^{t})}A^{t},clip(\frac{\pi_{\theta}(a^{t}|s^{t})}{\pi_{\bar{\theta}}(a^{t}|s^{t})},1-\epsilon,1+\epsilon)A^{t}\}. (16)

Critic loss. Considering the particularity of our Critic structure, we separate the Critic and Actor networks, instead of using the shared layers like the implementation in the famous Stable Baselines3 (SB3) [32]. The loss function is as follows:

L(ϕ)=(Vϕ(st)Vtargett)2,\displaystyle L(\phi)=(V_{\phi}(s^{t})-V^{t}_{target})^{2}, (17)

where Vtargett=AGAE+Vϕ(st)V^{t}_{target}=A^{GAE}+V_{\phi^{\prime}}(s^{t}). The VV is the Critic (value) network, and ϕ\phi is the parameters. Then, Vϕ(s)V_{\phi}(s) means the state-value [22] of state ss generated by Critic network. Note that ϕ\phi^{\prime} is the parameter of the target Critic network, and it will be replaced periodically by ϕ\phi. This prevailing trick is to increse the stability of the target [22]. Besides, the advantage function is estimated by the GAE algorithm, and we simply use AGAEA^{GAE} here.

IV-A2 Hybrid Reward Architecture (HRA)

Since there are always mixed objectives in the communication scenarios that lead to hybrid rewards reflected in reinforcement learning, we usually need to find a solution to tackle the hybrid reward problem. In this scenario, the hybrid rewards refer to the different rewards for different VUs, as we consider their distinct inherent characteristics and requirements. This issue of using RL to solve a high dimensional objective function was first studied in [16]. In their work, they proposed the HRA structure for Deep Q-learning (DQN) which aims to decompose rewards into different reward functions according to each objective. e.g., Decomposing rewards into objectives for energy consumption and delay minimization, and calculating two different losses accordingly. The illustration of HRA is shown in Fig. 3. HRA can exploit domain knowledge to a much greater extent and has remarkable effects on environments with different roles and objectives, which has been demonstrated to speed up learning with better performances in various domains such as video game playing [33], mimicry learning, complementary task [34], etc. However, to our best known, no one has explored this structure in depth and used it in solving communication problems. This paper re-design the normal Critic into a user-centric Critic by embedding HRA structure. Specifically, we decompose the rewards into different VUs instead of different objectives and calculate the losses of each VU accordingly. Next, we will expound on our novel and effective algorithm, User-centric Critic with Heterogeneous Actors (UCHA).

Refer to caption
Figure 3: The Hybrid Reward Architecture from [16]. The reward is decomposed into multiple domains, and different from the normal value network which only outputs one Q value for the action, HRA outputs multiple Q values for every domain.

IV-B User-centric Critic with Heterogeneous Actors (UCHA)

In contrast to decomposing the overall reward into separate sub-goal rewards as done in HRA, we built a user-centric reward decomposition Critic, which takes in the rewards of different users and calculates the actions-values separately. In other words, we give the network a view of the value for each user, instead of merely evaluating the overall value of an action based on an overall state. Simultaneously, considering the highly different roles, action spaces, and objectives of the two agents, we design the Heterogeneous Actors structure. The two agents are updated by different rewards and advantages, which will be expounded in the following.

Function process: In each episode, when the current transmission is accomplished with the channel access arrangement a1ta_{1}^{t} from Agent1Agent_{1}, and downlink power allocation a2ta_{2}^{t} from Agent2Agent_{2}, the environment will issue every VU’s rewards 𝑹1t={R1t[1],R1t[2],,R1t[N]}\boldsymbol{R}_{1}^{t}=\{R_{1}^{t}[1],R_{1}^{t}[2],...,R_{1}^{t}[N]\} for Agent1Agent_{1}, and 𝑹2t={R2t[1],R2t[2],,R2t[N]}\boldsymbol{R}_{2}^{t}=\{R_{2}^{t}[1],R_{2}^{t}[2],...,R_{2}^{t}[N]\} for Agent2Agent_{2} as feedback to different VUs. The global state s1ts_{1}^{t} and the next global state s1t+1s_{1}^{t+1} will be sent to the user-centric Critic to generate the state values for each VU. Then, the rewards 𝑹1t\boldsymbol{R}_{1}^{t} and 𝑹2t\boldsymbol{R}_{2}^{t} with state-values will be used to calculate the advantages 𝑨1t={A1t[1],A1t[2],,A1t[N]}\boldsymbol{A}_{1}^{t}=\{A_{1}^{t}[1],A_{1}^{t}[2],...,A_{1}^{t}[N]\} and 𝑨2t={A2t[1],A2t[2],,A2t[N]}\boldsymbol{A}_{2}^{t}=\{A_{2}^{t}[1],A_{2}^{t}[2],...,A_{2}^{t}[N]\} for Agent1Agent_{1} and Agent2Agent_{2}. The user-centric Critic takes in the global state and advantages for calculating the Critic losses {L1,L2,,LN}\{L_{1},L_{2},...,L_{N}\}for different VU, and updates with the sum losses similar to HRA. Note that the global state (s1ts_{1}^{t}), reward (𝑹1t\boldsymbol{R}_{1}^{t}), and advantage are all from Agent1Agent_{1}, as those of Agent2Agent_{2} are mainly a portion extracted from Agent1Agent_{1}. In terms of the Actors, we use 𝑨1t\boldsymbol{A}_{1}^{t} and 𝑨2t\boldsymbol{A}_{2}^{t} to update Actors of Agent1Agent_{1} and Agent2Agent_{2}, separately. The above-mentioned process is illustrated in Fig. 4.

Refer to caption
Figure 4: The structure of User-centric Critic with Heterogeneous Actors (UCHA). The top of the figure is the function process and overall map of UCHA, while the bottoms are the network structures in practice (coding). The two distinct Actors take in different states and select different actions, and the Critic evaluates the global state, outputting values for each VU, separately.

Heterogeneous actors update function: In equation (15) (i.e., Δθ=𝔼(st,at)πθ¯[ft(θ,At)]\Delta\theta=\mathbb{E}_{(s^{t},a^{t})\sim\pi_{{\bar{\theta}}}}[\triangledown f^{t}(\theta,A^{t})]), we established the policy gradient for PPO Actor, and our UCHA uses the sum-advantages of every VU for the update. Then, we have the gradient Δθ1\Delta\theta_{1}, Δθ2\Delta\theta_{2} for Agent1Agent_{1} and Agent2Agent_{2} as:

Δθ1=𝔼(st,at)πθ[ft(θ,(n=1NA1t[n])],\displaystyle\Delta\theta_{1}=\mathbb{E}_{(s^{t},a^{t})\sim\pi_{{\theta}^{\prime}}}[\triangledown f^{t}(\theta,(\sum_{n=1}^{N}A_{1}^{t}[n])], (18)
Δθ2=𝔼(st,at)πθ[ft(θ,(n=1NA2t[n])].\displaystyle\Delta\theta_{2}=\mathbb{E}_{(s^{t},a^{t})\sim\pi_{{\theta}^{\prime}}}[\triangledown f^{t}(\theta,(\sum_{n=1}^{N}A_{2}^{t}[n])]. (19)

where A1t[n],A2t[n]A_{1}^{t}[n],A_{2}^{t}[n] denote the advantages of different VUs for Agent1Agent_{1} and Agent2Agent_{2}. A truncated version of generalized advantage estimation (GAE) [35] is chosen as the advantage function:

A1t[n]=δ1t[n]+(γλ)δ1t+1[n]++(γλ)T¯1δ1t+T¯1[n],\displaystyle A_{1}^{t}[n]=\delta_{1}^{t}[n]+(\gamma\lambda)\delta_{1}^{t+1}[n]+...+(\gamma\lambda)^{\bar{T}-1}\delta_{1}^{t+\bar{T}-1}[n], (20)
A2t[n]=δ2t[n]+(γλ)δ2t+1[n]++(γλ)T¯1δ2t+T¯1[n],\displaystyle A_{2}^{t}[n]=\delta_{2}^{t}[n]+(\gamma\lambda)\delta_{2}^{t+1}[n]+...+(\gamma\lambda)^{\bar{T}-1}\delta_{2}^{t+\bar{T}-1}[n], (21)

where

δ1t[n]=R1t[n]+γVϕ(s1t+1)[n]Vϕ(s1t)[n],\displaystyle\delta_{1}^{t}[n]=R_{1}^{t}[n]+\gamma V_{\phi^{\prime}}(s_{1}^{t+1})[n]-V_{\phi^{\prime}}(s_{1}^{t})[n], (22)
δ2t[n]=R2t[n]+γVϕ(s1t+1)[n]Vϕ(s1t)[n].\displaystyle\delta_{2}^{t}[n]=R_{2}^{t}[n]+\gamma V_{\phi^{\prime}}(s_{1}^{t+1})[n]-V_{\phi^{\prime}}(s_{1}^{t})[n]. (23)

T¯\bar{T} specifies the length of the given trajectory segment, γ\gamma specifies the discount factor, and λ\lambda denotes the GAE parameter. And these parameters will be specified in experiments. VϕV_{\phi^{\prime}} is the target Critic with parameters ϕ\phi^{\prime}, which will be periodically replaced by ϕ\phi. The user-centric Critic has NN outputs, where Vϕ()[n]V_{\phi^{\prime}}(\cdot)[n] means the Critic-head for VU nn, which takes in the state, and outputs the state-value for VU nn. Note that we both use s1ts_{1}^{t}, but with their separate rewards, because s1ts_{1}^{t} is the global state, and we only give the Critic an overall view of the whole task.

User-centric Critic update function: In this user-centric Critic, we compute the value losses for each VU separately to enable UCHA to have a user-specific view. Similar to the update method in HRA [16] that uses the sum losses of different components, we use the sum losses of different VUs. Therefore, the traditional PPO-Critic loss in Eq. (17) (i.e., L(ϕ)=(Vϕ(st)Vtargett)2L(\phi)=(V_{\phi}(s^{t})-V^{t}_{target})^{2}) is re-formatted into:

L(ϕ)=n=1NL(ϕ)[n]=n=1N(Vϕ(s1t)[n]Vtargett[n])2,\displaystyle L(\phi)=\sum_{n=1}^{N}L(\phi)[n]=\sum_{n=1}^{N}\left(V_{\phi}(s_{1}^{t})[n]-V_{target}^{t}[n]\right)^{2}, (24)

where Vtarget[n]=A1t[n]+Vϕ(s1t)[n]V_{target}[n]=A_{1}^{t}[n]+V_{\phi^{\prime}}(s_{1}^{t})[n], and L(ϕ)[n]L(\phi)[n] is the Critic loss for VU nn.

As explained above, this user-centric Critic uses centralized training with equation (24). The computation complexity with real-time expenditure in experiments will be discussed in Section V.

Algorithm 1 User-centric Critic with Heterogeneous Actors
0:  Actor 1 parameter θ1\theta_{1}, Actor 2 parameter θ2\theta_{2}, Hybrid critic parameter ϕ\phi and target network ϕ\phi^{\prime}, initial state s10s_{1}^{0}.
1:  for iteration = 1,21,2... do
2:     Agent1Agent_{1} executes action according to πθ1(a1t|s1t)\pi_{\theta_{1}^{{}^{\prime}}}(a_{1}^{t}|s_{1}^{t})
3:     Get reward R1t[1],R1t[2],,R1t[N]R_{1}^{t}[1],R_{1}^{t}[2],...,R_{1}^{t}[N] and the current state of Agent2Agent_{2}: s2ts_{2}^{t}
4:     Agent2Agent_{2} executes action according to πθ2(a2t|s2t)\pi_{\theta_{2}^{{}^{\prime}}}(a_{2}^{t}|s_{2}^{t})
5:     Get reward R2t[1],R2t[2],,R2t[N]R_{2}^{t}[1],R_{2}^{t}[2],...,R_{2}^{t}[N] and the next state of Agent1Agent_{1}: s1t+1s_{1}^{t+1}, and s1ts1t+1s_{1}^{t}\leftarrow s_{1}^{t+1}
6:     Sample {st,at,(R1t[1],,R1t[N]),(R2t[1],,R2t[N]),st+1s^{t},a^{t},(R_{1}^{t}[1],...,R_{1}^{t}[N]),(R_{2}^{t}[1],...,R_{2}^{t}[N]),s^{t+1}} till end
7:     Compute advantages {A1t[1],A1t[2],,A1t[N]A_{1}^{t}[1],A_{1}^{t}[2],...,A_{1}^{t}[N]} for Agent1Agent_{1}, {A2t[1],A2t[2],,A2t[N]A_{2}^{t}[1],A_{2}^{t}[2],...,A_{2}^{t}[N]} for Agent2Agent_{2}, and target values{Vtargett[1],Vtargett[2],,Vtargett[N]V_{target}^{t}[1],V_{target}^{t}[2],...,V_{target}^{t}[N]}
8:     for kk = 1,2,,K1,2,...,K do
9:        Shuffle the data’s order, set batch size bsbs
10:        for jj=0,1,,Trajectory lengthbs10,1,...,\frac{\textbf{Trajectory length}}{bs}-1 do
11:           Compute gradients for Actor 1 and Actor 2 using Eq. (19).
12:           Update Actors separately by gradient ascent
13:           Compute Value losses for each VU
14:           Update Critic with MSE loss using Eq. (24)
15:        end for
16:        Assign target network ϕϕ\phi^{\prime}\leftarrow\phi every CC steps
17:     end for
18:  end for

V Experiments

In this section, we conduct extensive experiments to compare UCHA with the baselines, highlighting the remarkable performance achieved by it. We will first introduce our baseline algorithms and metrics, and then present our results with detailed analysis.

V-A Baselines

As we introduce a novel UCHA algorithm with unique actor and critic structures, we weed out the key points of UCHA one by one, and design the following baseline algorithms:

  • Independent PPO (IPPO). The most intuitive way of using RL in such a cooperative interactive environment is to implement two independent RL agents interacting with each other. We implemented two independent PPO with user-centric Critics for optimizing the channel access and downlink power allocation. They both use a normal and separate Actor and Critic and update with their own states, actions, and rewards. This baseline is to examine the effect of the heterogeneous Actors structure in UCHA.

  • Heterogeneous-Actor PPO (HAPPO). We also implement an HAPPO structure that is similar to UCHA. The HAPPO does not apply the user-centric architecture, but merely uses one normal Critic that outputs a single value for the whole global state. Therefore, in HAPPO, we use the sum rewards of each Agent as the single reward. i.e., Agent1Agent_{1} uses rewards 𝑹𝟏𝒕=n=1NR1t[n]\boldsymbol{R_{1}^{t}}=\sum_{n=1}^{N}R_{1}^{t}[n], and Agent2Agent_{2} uses 𝑹𝟐𝒕=n=1NR2t[n]\boldsymbol{R_{2}^{t}}=\sum_{n=1}^{N}R_{2}^{t}[n]. This baseline is to testify the effect of the User-centric structure of the Critic.

  • Random. The two random Agents select actions randomly, which represents the system performance if no optimization strategy is performed. The random policy serves as a naive baseline to show the results if no optimization has been conducted.

V-B Metrics

We introduce a set of metrics to evaluate the effectiveness of our proposed methods.

  • Worst VU frames. We define the Worst VU frames, i.e., minn𝒩(t=1TIntτn,F)\min\limits_{n\in\mathcal{N}}(\sum_{t=1}^{T}I_{n}^{t}-\tau_{n,F}) in Eq.(9). The “Worst VU” means the VU has the biggest gap between its successfully received frames and its required FPS, and the “Worst VU frames” refers to this gap.

  • Successful FPS. The number of successful frames among total TT frames determines the FPS of the VR frames and hence fluidity of the Metaverse VR experience.

  • The sum local energy consumption. We first illustrate the sum energy consumption during training, and then investigate it for different VUs in detail.

  • The received frame resolutions. The received frame resolutions of different VUs will be shown to testify the comprehensive view of our proposed methods.

V-C Channel Attenuation

When znt=mz_{n}^{t}=m, the channel attenuation becomes a very important variable for the downlink rate. In this paper, the channel attenuation is simulated as hn,mt=βntgn,mth_{n,m}^{t}=\sqrt{\beta_{n}^{t}}g_{n,m}^{t}. The Rician fading is used as the small-scale fading, gn,mt=KK+1g¯n,mt+1K+1g~n,mtg_{n,m}^{t}=\sqrt{\frac{K}{K+1}}\bar{g}_{n,m}^{t}+\sqrt{\frac{1}{K+1}}\tilde{g}_{n,m}^{t}, where g¯n,mt\bar{g}_{n,m}^{t} stands for the Line-Of-Sight (LOS) component, and g~n,mt\tilde{g}_{n,m}^{t} means the Non-LOS (NLOS) component that follows the standard complex normal distribution 𝒞𝒩(0,1)\mathcal{C}\mathcal{N}(0,1). The large-scale fading βnt=β0(Ln)α\beta_{n}^{t}=\beta_{0}(L_{n})^{-\alpha}, and LnL_{n} represents the distance between the nnth VU and server. β0\beta_{0} denotes the channel attenuation at the reference distance L0=1L_{0}=1 m. The path-loss exponent α\alpha is simulated as 22, and the Rician factor KK is simulated as 33. Note that as the duration of each time slot is too short, we don’t consider the geographical movements of VUs in each time step.

Refer to caption
(a) Training reward with 6 VUs.
Refer to caption
(b) Energy consumption with 6 VUs.
Refer to caption
(c) Worst VU frames with 6 VUs.
Refer to caption
(d) Training reward with 8 VUs.
Refer to caption
(e) Energy consumption with 8 VUs.
Refer to caption
(f) Worst VU frames with 8 VUs.
Figure 5: Train-time model performances with 6 and 8 VUs (3 channels). In the simpler configuration (6 VUs), UCHA has a much quicker convergence rate than other baselines, and in the more complicated configuration (8 VUs), UCHA obtains much more remarkable performance in all aspects.

V-D Numerical Setting

Consider a 30×3030\times 30 m2m^{2} indoor space where multiple VUs are distributed uniformly across the space. We set the number of channels to be 33 in each experiment configuration, and the number of VUs to be from 55 to 88 across the different experiment configurations. To simplify the simulation, we use the most common standard of the resolutions but ignore the various horizontal FoVs and aspect ratios for different VR devices. We set the resolutions of one frame can be 1440p (2560×14402560\times 1440, which is known as Quad HD), 1080p (1920×10801920\times 1080, which is known as Full HD), 720p (1280×7201280\times 720, which is known as HD), and below (considered as insufficient resolution). Each pixel is stored in 16 bits [36] and the factor of compression is selected in the uniformly random distribution [300,600][300,600] [18]. And each frame consists of two images for two eyes. Therefore, the data size of each frame can be Dnt{2560×1440×16×2compression,1920×1080×16×2compression,1280×720×16×2compression}D_{n}^{t}\in\{\frac{2560\times 1440\times 16\times 2}{compression},\frac{1920\times 1080\times 16\times 2}{compression},\frac{1280\times 720\times 16\times 2}{compression}\}. The maximum flashed rate, TT frames in one second is taken to be 90, which is considered a comfortable rate for VR [4] applications. The bandwidth of each channel is set to 10×18010\times 180 kHZ. The required successful frame transmission count τn,F\tau_{n,F} is uniformly selected from [60,90][60,90], which is higher than the acceptable of 6060 [4]. The maximum powers of the server are 100100 Watt. The server computation frequency is 1010 GHz and the computation capabilities of each VU are selected uniformly from [0.3,0.9][0.3,0.9] GHz. For all experiments, we use a single NVIDIA GTX 2080 Ti and 2×1052\times 10^{5} training steps, and the evaluation interval is set to be 500500 training steps. As there are several random variables in our environment (e.g., channel attenuation, compression rates), all experiments are conducted under ten different global random seeds, and the error bands are drawn to better illustrate the model performances.

V-E Train-time result analysis

Reasonable and comprehensive experiments are of vital importance in such a complicated scenario. In this section, we will illustrate and analyze some evaluated metrics completely during training. For brevity, we show the performances of different models in two experimental configurations: the less complicated scenario with 6 VUs and the more complicated one with 8 VUs. And the overall results of every scenario are shown in Table I. We also analyze the computational complexity, and then the value losses of each VU are illustrated to give a more “user-specific” analysis. Note that to better evaluate the different metrics, we do not apply early termination if the task fails in the evaluation stage.

Refer to caption
Figure 6: The Critic losses (Value losses) for each VU when using UCHA in 3 channels 8 VUs scenario.

V-E1 Train-time model performance in “3 Channels, 6 VUs3\text{ Channels, }6\text{ VUs}”, “3 Channels, 8 VUs3\text{ Channels, }8\text{ VUs}” configurations

The training reward, sum local energy consumption, and worst VU frames show an overall upward trend as training progresses. When pitted against these metrics, UCHA performed the best out of the tested baseline algorithms in both configurations.

In the experimental setting with 66 VUs, although HAPPO (without the user-centric structure in Critic) is able to attain similar peak rewards in the end training stages, UCHA converges in about one-fifth of training steps taken for HAPPO to achieve convergence, as shown in Fig. 5(c). As for the energy usage shown in Fig. 5(b), both UCHA and HAPPO increase the energy consumption (i.e., do more local generation) first, and decrease it in the following steps, while the training rewards both increase steadily. Apparently, both UCHA and HAPPO learned to decrease the congestion degree in each channel by allocating some VUs for local generation successfully. They initially tried to decrease congestion degrees to fulfill the requirements of the FPS by allocating more VUs for local generation, and then saving the local battery energy by decreasing the local generation times gradually. There is a sudden drop in the training reward of UCHA at about 100100k steps (Fig. 5(a)), which is assumed to be the sharp decrease in energy usage (local computation times). This means that UCHA tried to further save energy usage by a fast decline in the computation times, but too many VUs sharing the limited channel and downlink power resources can be devastating to the overall transmission process. On balance, UCHA and HAPPO both succeed in lowering local energy usage to a similar level as if no optimizations are conducted (random policy). As a matter of fact, lowering energy usage to this random policy level is impressive. Because intuitively, we need to sacrifice more local energy to fulfill the task, and the reward and worst VU frames they obtained are extremely higher than the random policy.

Different from UCHA and HAPPO, IPPO fails to find an acceptable solution even in this simpler scenario, but we can still observe from Fig. 5(a) that the reward is increasing during training. IPPO has the worst stability across different global random seeds, which is reflected in the huge error bands and the sudden drops even in the late training stages. The above observations signify that the Agents in IPPO fail to work cooperatively and they don’t achieve good arrangements on channel access and downlink power compared to UCHA and HAPPO. However, IPPO still has a good performance in the simplest 3 Channels, 5 VUs3\text{ Channels, }5\text{ VUs} scenario. (Table I)

The results of the experiment with 88 VUs show that UCHA is extremely superior to HAPPO and IPPO in almost all aspects. Compared to the simpler configuration with 66 VUs, UCHA used many more steps to reach the peak reward, as shown in Fig. 5(d). Moreover, in this scenario, the local energy consumption (local generation times) is increasing all the time for UCHA, which is different from the observations in the 3 Channels, 6 VUs3\text{ Channels, }6\text{ VUs} configuration. It is reasonable because the scenario with more VUs needs more local computation arrangements to avoid unacceptable interference. And the rationality of doing so is reflected in the steadily increasing reward (Fig. 5(d)) and the slower rise of the energy usage (Fig. 5(e)). The complete results are shown in Table I. We can observe that UCHA obtains the best performance for almost every metric under every scenario. This demonstrates that decomposing the reward and using sum-losses which provides a user-centric view for the RL agent, is a good approach to tackling the formulated problem.

TABLE I: Overall results
Scenario Energy usage (in Joules) Worst VU Reward Train step-time (ms) Execution step-time (ms)
UCHA
3 Channels, 5 VUs3\text{ Channels, }5\text{ VUs} 11.7811.78 9.979.97 80.0480.04 1919 0.950.95
3 Channels, 6 VUs3\text{ Channels, }6\text{ VUs} 10.8410.84 4.764.76 69.7269.72 1818 0.940.94
3 Channels, 7 VUs3\text{ Channels, }7\text{ VUs} 23.6723.67 5.435.43 62.7862.78 2020 0.940.94
3 Channels, 8 VUs3\text{ Channels, }8\text{ VUs} 37.2537.25 2.562.56 65.665.6 2222 0.980.98
HAPPO
3 Channels, 5 VUs3\text{ Channels, }5\text{ VUs} 14.4614.46 4.334.33 75.4675.46 1515 0.920.92
3 Channels, 6 VUs3\text{ Channels, }6\text{ VUs} 13.3713.37 1.53-1.53 64.9864.98 1515 0.910.91
3 Channels, 7 VUs3\text{ Channels, }7\text{ VUs} 9.889.88 43.76-43.76 16.1016.10 1616 0.900.90
3 Channels, 8 VUs3\text{ Channels, }8\text{ VUs} 21.9621.96 60.67-60.67 2.83-2.83 1818 0.960.96
IPPO
3 Channels, 5 VUs3\text{ Channels, }5\text{ VUs} 10.7610.76 1.321.32 72.2272.22 2020 0.550.55
3 Channels, 6 VUs3\text{ Channels, }6\text{ VUs} 9.419.41 49.67-49.67 14.4614.46 2727 0.530.53
3 Channels, 7 VUs3\text{ Channels, }7\text{ VUs} 10.9610.96 66.34-66.34 16.81-16.81 3232 0.550.55
3 Channels, 8 VUs3\text{ Channels, }8\text{ VUs} 12.2312.23 73.83-73.83 14.00-14.00 5656 0.570.57
Refer to caption
(a) The received frame resolutions of each VU. The table on the left shows the VU inherent characteristics (i.e., Max resolution when local computing, Battery state, and FPS demand), and the right bar chart compares the received resolutions frames of each VU in one second when using different algorithms.
Refer to caption
(b) The local computation times and energy usage. The color of the VU index denotes the battery states.
Figure 7: Detailed user-specific evaluation.

Then we illustrate the value losses for each VU (calculated by the user-centric Critic) to show the convergence of our proposed UCHA in Fig. 6. Overall, all losses of VUs are decreasing, which shows the convergence of the hybrid Critic even with eight branches (one branch for one VU). However, the range and the decline patterns of the losses are all different. Here, we list four representative VUs for detailed discussion. In this setting, VU1\text{VU}_{1} is the toughest one for optimizing as it has no sufficient local computing power for a high resolution (i.e., higher than 720p), which is doomed to frame loss, and it has a high FPS requirement (80 FPS). Therefore, VU1\text{VU}_{1} must be arranged carefully, or it can cause huge losses. VU2\text{VU}_{2} has a high loss initially, as it has a high requirement of FPS. However, this can be easily tackled by allocating it with more local computation as VU2\text{VU}_{2} has sufficient local computing power for 1440p resolution. VU6\text{VU}_{6} is in a more involved situation. It can generate 1080p frames locally and has low battery energy. Thus, if it is arranged to do local computing, although there is no frame loss, it will be assigned energy consumption and resolution degeneration penalties. On the contrary, VU3\text{VU}_{3} can generate 1440p frames locally with high frame failure tolerance, and its loss is always low.

V-F Computational complexity

We analyze the computational complexity and show the exact time by the figure. We use KAl,KAl,KClK^{l}_{A},K^{l}_{A^{\prime}},K^{l}_{C} to denote the number of neurons in layer ll of Actor1, Actor2, and the user-centric Critic. And (𝒜0,𝒜0,𝒞0)(\mathcal{A}_{0},{\mathcal{A}^{\prime}_{0}},{\mathcal{C}_{0}}), (LA,LA,LC)(L_{A},L_{A^{\prime}},L_{C}) be the size of the input layers (proportional to the state dimension shown in Fig. 4), and the number of training layers of the three parts. Each training step contains two Actor training and one Critic training, and considering the mini-batch size BB in the training stage, we have the complexity in one training step as O(B(𝒜0KA1+l=1LA1KAlKAl+1+𝒜0KA1+l=1LA1KAlKAl+1+𝒞0KC1+l=1LC1KClKCl+1))O(B(\mathcal{A}_{0}K^{1}_{A}+\sum_{l=1}^{L_{A}-1}K^{l}_{A}K^{l+1}_{A}+\mathcal{A}^{\prime}_{0}K^{1}_{A^{\prime}}+\sum_{l=1}^{L_{A^{\prime}}-1}K^{l}_{A^{\prime}}K^{l+1}_{A^{\prime}}+\mathcal{C}_{0}K^{1}_{C}+\sum_{l=1}^{L_{C}-1}K^{l}_{C}K^{l+1}_{C})). And according to [37], the computational complexity depends on the total number of convergence steps to the optimal policy. In practice, the network training can be performed offline for a finite number of steps at a centralized-powerful unit (such as the server). We use Table I as an intuitive illustration for the time of a single training and execution (in ms) step.

V-G Evaluating performance of the user-specific effect

In this section, we load the trained models of UCHA, HAPPO, and IPPO, evaluate them in the same evaluation environment. Noted that to better compare different algorithms, we weed out the early termination constraint (i.e., if any VU’s frame loss times reach the tolerant limit, this episode ends immediately) in evaluation, which means all evaluations finish the whole 9090 frames transmissions.

Fig. 7(a) presents the frame resolutions of each VU, in which the left table is the inherent characteristics of each VU (i.e., Max resolution when local computing, battery state, and FPS requirement), and the right bar chart compares the resolutions of each VU with different algorithms. And Fig. 7(b) further shows the local computation times and local energy usage of each VU. To simplify the illustration, we only set battery states as Low, Middle, and High, and a fixed local max resolution for each VU in this evaluation. Noted that the “fail” in the bar chart means this frame is allocated to be processed on the server, but not transmitted to VU in time, and the resolution less than 720p is also deemed as a failure.

To begin with, only UCHA is able to fulfill the FPS requirements of different VUs, and it does obtain a “user-specific” ability. For those with less local computing power (e.g., VU1\text{VU}_{1}, VU2\text{VU}_{2}, VU6\text{VU}_{6}) and with low batter energy (e.g.,VU5\text{VU}_{5}), UCHA avoid allocating them to do many local computations. We can observe from Fig. 7(b) that UCHA decides to allocate VU1\text{VU}_{1} and VU5\text{VU}_{5} to do remote computing every time, as VU1\text{VU}_{1} with a high FPS requirement doesn’t have the sufficient local computing power for 720p (i.e., the lowest resolution deemed as successful), and VU5\text{VU}_{5} has a low battery and a high local max resolution. Therefore, VU5\text{VU}_{5} can consume more energy with the high local resolution if it is allocated to computing locally. On the contrary, VU3\text{VU}_{3},VU7\text{VU}_{7},VU8\text{VU}_{8} are arranged to do more local computation by UCHA, as they have high battery energy and good local computation capabilities, which can spare the channel and downlink power resources to other VUs. Furthermore, we notice that VU4\text{VU}_{4} receives the least frames successfully, but this is reasonable, as VU4\text{VU}_{4} is in a less FPS-aware scenario and has the lowest FPS requirement, and at some steps, we need to allocate more resources to other VUs with higher requirements. The above remarkable performances demonstrate that UCHA has a “user-centric” ability.

Different from UCHA, HAPPO and IPPO fail to do well in such a user-centric task, and IPPO performs worse than HAPPO. We can observe from the local computation arrangement in Fig. 7(b) that HAPPO (without the user-centric Critic) learns a very extreme policy. VU2\text{VU}_{2}, VU3\text{VU}_{3}, and VU7\text{VU}_{7} are allocated to do local computation almost every time, as VU3\text{VU}_{3} and VU7\text{VU}_{7} have high battery energy which leads to relatively low local computing penalty. Although VU2\text{VU}_{2} has a low battery, it only generates 720p resolution locally, and the energy usage is lower. In terms of IPPO, the disorderly and unreasonable allocations signify that the agents in it are non-cooperative and do not achieve an overall good channel and power selection.

VI Conclusion

In this paper, we study a user-centric multi-user VR for the Metaverse over wireless networks. Users with varying requirements are considered, and a novel user-centric DRL algorithm called UCHA is designed to tackle the studied problem. Extensive experimental results show that our UCHA has the quickest convergence speed and achieves the highest reward than other algorithms, and UCHA has successfully gained the user-specific view. We envision our work to motivate more research on calibrating deep reinforcement learning for research problems of the Metaverse.

Acknowledgement

This research is supported by the Singapore Ministry of Education Academic Research Fund under Grant Tier 1 RG90/22, RG97/20, Grant Tier 1 RG24/20 and Grant Tier 2 MOE2019-T2-1-176; and by the NTU-Wallenberg AI, Autonomous Systems and Software Program (WASP) Joint Project.

 
References

  • [1] W. Yu, T. J. Chua, and J. Zhao, “Virtual Reality in Metaverse over Wireless Networks with User-centered Deep Reinforcement Learning,” IEEE International Conference on Communications (ICC), 2023, also available online at https://arxiv.org/abs/2303.04349 .
  • [2] L.-H. Lee, T. Braud, P. Zhou, L. Wang, D. Xu, Z. Lin, A. Kumar, C. Bermejo, and P. Hui, “All one needs to know about metaverse: A complete survey on technological singularity, virtual ecosystem, and research agenda,” arXiv preprint arXiv:2110.05352, 2021.
  • [3] Y. Sun, J. Chen, Z. Wang, M. Peng, and S. Mao, “Enabling mobile virtual reality with open 5G, fog computing and reinforcement learning,” IEEE Network, 2022.
  • [4] E. Bastug, M. Bennis, M. Médard, and M. Debbah, “Toward interconnected virtual reality: Opportunities, challenges, and enablers,” IEEE Communications Magazine, vol. 55, no. 6, pp. 110–117, 2017.
  • [5] H. Du, J. Wang, D. Niyato, J. Kang, Z. Xiong, X. S. Shen, and D. I. Kim, “Exploring attention-aware network resource allocation for customized metaverse services,” IEEE Network, 2022.
  • [6] P. Yang, T. Q. S. Quek, J. Chen, C. You, and X. Cao, “Feeling of presence maximization: mmwave-enabled virtual reality meets deep reinforcement learning,” IEEE Transactions on Wireless Communications, 2022.
  • [7] G. Xiao, M. Wu, Q. Shi, Z. Zhou, and X. Chen, “DeepVR: Deep reinforcement learning for predictive panoramic video streaming,” IEEE Transactions on Cognitive Communications and Networking, vol. 5, no. 4, pp. 1167–1177, 2019.
  • [8] Q. Wu, X. Chen, Z. Zhou, L. Chen, and J. Zhang, “Deep reinforcement learning with spatio-temporal traffic forecasting for data-driven base station sleep control,” IEEE/ACM Transactions on Networking, vol. 29, no. 2, pp. 935–948, 2021.
  • [9] S. Yang, J. Liu, F. Zhang, F. Li, X. Chen, and X. Fu, “Caching-enabled computation offloading in multi-region mec network via deep reinforcement learning,” IEEE Internet of Things Journal, 2022.
  • [10] S. Yu, X. Chen, Z. Zhou, X. Gong, and D. Wu, “When deep reinforcement learning meets federated learning: Intelligent multitimescale resource management for multiaccess edge computing in 5G ultradense network,” IEEE Internet of Things Journal, vol. 8, no. 4, pp. 2238–2251, 2020.
  • [11] T. J. Chua, W. Yu, and J. Zhao, “Play to earn in the metaverse over wireless networks with deep reinforcement learning,” submitted to the 2023 EAI Game Theory for Networks (GameNets). [Online]. Available: https://personal.ntu.edu.sg/JunZhao/ICC2023MALS.pdf
  • [12] Z. Wang, L. Li, Y. Xu, H. Tian, and S. Cui, “Handover control in wireless systems via asynchronous multiuser deep reinforcement learning,” IEEE Internet of Things Journal, 2018.
  • [13] D. Guo, L. Tang, X. Zhang, and Y.-C. Liang, “Joint optimization of handover control and power allocation based on multi-agent deep reinforcement learning,” IEEE Transactions on Vehicular Technology, 2020.
  • [14] C. He, Y. Hu, Y. Chen, and B. Zeng, “Joint power allocation and channel assignment for noma with deep reinforcement learning,” IEEE Journal on Selected Areas in Communications, vol. 37, no. 10, pp. 2200–2210, 2019.
  • [15] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  • [16] H. Van Seijen, M. Fatemi, J. Romoff, R. Laroche, T. Barnes, and J. Tsang, “Hybrid reward architecture for reinforcement learning,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • [17] L. Dai, B. Wang, Z. Ding, Z. Wang, S. Chen, and L. Hanzo, “A survey of non-orthogonal multiple access for 5G,” IEEE Communications Surveys & Tutorials, 2018.
  • [18] E. Bastug, M. Bennis, M. Médard, and M. Debbah, “Toward interconnected virtual reality: Opportunities, challenges, and enablers,” IEEE Communications Magazine, vol. 55, no. 6, pp. 110–117, 2017.
  • [19] C. You, Y. Zeng, R. Zhang, and K. Huang, “Asynchronous mobile-edge computation offloading: Energy-efficient resource management,” IEEE Transactions on Wireless Communications, vol. 17, no. 11, pp. 7590–7605, 2018.
  • [20] Y. Wen, W. Zhang, and H. Luo, “Energy-optimal mobile application execution: Taming resource-poor mobile devices with cloud clones,” in IEEE INFOCOM, 2012, pp. 2716–2720.
  • [21] L. Liberti, “Undecidability and hardness in mixed-integer nonlinear programming,” RAIRO-Operations Research, 2019.
  • [22] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction.   MIT press, 2018.
  • [23] P. K. Sharma, R. Fernandez, E. Zaroukian, M. Dorothy, A. Basak, and D. E. Asher, “Survey of recent multi-agent reinforcement learning algorithms utilizing centralized training,” in Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications III, vol. 11746, 2021, pp. 665–676.
  • [24] Y. Li, T. Gao, J. Yang, H. Xu, and Y. Wu, “Phasic self-imitative reduction for sparse-reward goal-conditioned reinforcement learning,” in International Conference on Machine Learning.   PMLR, 2022, pp. 12 765–12 781.
  • [25] Y. Wu, E. Mansimov, S. Liao, A. Radford, and J. Schulman, “OpenAI baselines: Acktr & A2C,” 2017. [Online]. Available: https://openai.com/blog/baselines-acktr-a2c
  • [26] T. J. Chua, W. Yu, and J. Zhao, “Resource allocation for mobile metaverse with the Internet of Vehicles over 6G wireless communications: A deep reinforcement learning approach,” in 8th IEEE World Forum on the Internet of Things (WFIoT), 2022, also available online at https://arxiv.org/pdf/2209.13425.pdf .
  • [27] X. Wang, Y. Han, V. C. M. Leung, D. Niyato, X. Yan, and X. Chen, “Convergence of edge computing and deep learning: A comprehensive survey,” IEEE Communications Surveys & Tutorials, vol. 22, no. 2, pp. 869–904, 2020.
  • [28] OpenAI, “ChatGPT: Optimizing language models for dialogue,” 2022. [Online]. Available: https://openai.com/blog/chatgpt/
  • [29] A. Owen and Y. Zhou, “Safe and effective importance sampling,” Journal of the American Statistical Association, 2000.
  • [30] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
  • [31] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International conference on machine learning.   PMLR, 2015, pp. 1889–1897.
  • [32] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning implementations,” Journal of Machine Learning Research, vol. 22, no. 268, pp. 1–8, 2021.
  • [33] N. Justesen, P. Bontrager, J. Togelius, and S. Risi, “Deep learning for video game playing,” IEEE Transactions on Games, 2020.
  • [34] L. Hasenclever, F. Pardo, R. Hadsell, N. Heess, and J. Merel, “CoMic: Complementary task learning & mimicry for reusable skills,” in International Conference on Machine Learning, 2020, pp. 4105–4115.
  • [35] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-dimensional continuous control using generalized advantage estimation,” arXiv preprint arXiv:1506.02438, 2015.
  • [36] Meta Quest, “Mobile virtual reality media overview.” [Online]. Available: https://developer.oculus.com/documentation/mobilesdk/latest/concepts/mobile-media-overview/
  • [37] C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan, “Is Q{Q}-learning provably efficient?” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • [38] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

Appendix A Implementation details

We provide the hyper-parameters for the reward setting as a reference:

  • Descending rewards for different resolutions Rrt[n]R_{r}^{t}[n]: 0 for [0, 720720p), 2 for [720720p, 10801080p), 3 for [10801080p, 22k), 5 for [22k, ++\infty].

  • Transmission failure penalty Rft[n]R_{f}^{t}[n]: 1-1.

  • Weight in the energy consumption penalty Ret[n]R_{e}^{t}[n]: ωe=0.5\omega_{e}=-0.5.

  • Weight in worst VU penalty Rwt[n]R_{w}^{t}[n]: ωend=10\omega_{end}=-10.

  • Weight in early termination penalty Rtermt[n]R_{term}^{t}[n]: ωf=10\omega_{f}=-10.

For the neural network hyper-parameters settings, the adaptive moment estimation (Adam) [38] is selected as the optimizer. The discount factor γ\gamma for Actor1 and Actor2 are 0.990.99 and 0.90.9, respectively, and GAE factor λ\lambda is fixed at 0.950.95. The batch size is set to 6464, and the hidden layer widths are all simulated as 6464. The learning rates for Actors and the Critic are 2×1042\times 10^{-4}.

[Uncaptioned image] Wenhan Yu (Student Member, IEEE) received his B.S. degree in Computer Science and Technology from Sichuan University, Sichuan, China in 2021. He is currently pursuing a Ph.D. degree in the School of Computer Science and Engineering, Nanyang Technological University (NTU), Singapore. His research interests cover wireless communications, deep reinforcement learning, optimization, and Metaverse.
[Uncaptioned image] Terence Jie Chua received his B.S. degree from Nanyang Technological University, Singapore. He is currently pursuing a Ph.D. degree in the School of Computer Science and Engineering, Nanyang Technological University (NTU), Singapore. His research interests cover wireless communications, adversarial machine learning, deep reinforcement learning, optimization, and the Metaverse.
[Uncaptioned image] Jun Zhao (Member, IEEE) received the bachelor’s degree from Shanghai Jiao Tong University, China, in July 2010, and the joint Ph.D. degree in electrical and computer engineering from Carnegie Mellon University (CMU), USA, in May 2015, (advisors: Virgil Gligor and Osman Yagan; collaborator: Adrian Perrig), affiliating with CMU’s renowned CyLab Security & Privacy Institute. He is currently an Assistant Professor with the School of Computer Science and Engineering (SCSE), Nanyang Technological University (NTU), Singapore. Before joining NTU, he was a Post-Doctoral Researcher under the supervision of Xiaokui Xiao and then as a Faculty Member, he was a PostDoctoral Researcher at Arizona State University as an Arizona Computing Post-Doctoral Researcher Best Practices Fellow (advisors: Junshan Zhang and Vincent Poor).