This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Wireless MAC Protocol Synthesis and Optimization with Multi-Agent Distributed Reinforcement Learning

Navid Keshtiarast,  Oliver Renaldi, and Marina Petrova Navid Keshtiarast, Oliver Renaldi and Marina Petrova are with Mobile Communications and Computing Group, at RWTH Aachen University, Germany (e-mail: {navid.keshtiarast@mcc, oliver.renaldi, petrova@mcc}.rwth-aachen.de)
Abstract

In this letter, we propose a novel Multi-Agent Deep Reinforcement Learning (MADRL) framework for Medium Access Control (MAC) protocol design. Unlike centralized approaches, which rely on a single entity for decision-making, MADRL empowers individual network nodes to autonomously learn and optimize their MAC based on local observations. Leveraging ns3-ai and RLlib, as far as we are aware of, our framework is the first of a kind that enables distributed multi-agent learning within the ns-3 environment, facilitating the design and synthesis of adaptive MAC protocols tailored to specific environmental conditions. We demonstrate the effectiveness of the MADRL MAC framework through extensive simulations, showcasing superior performance compared to legacy protocols across diverse scenarios. Our findings highlight the potential of MADRL-based MAC protocols to significantly enhance Quality of Service (QoS) requirements for future wireless applications.

Index Terms:
ML-based protocol design, MADRL, intelligent wireless protocols

I Introduction

Wireless networks are continuously faced with a multitude of demands, ranging from high-reliability and low-latency connectivity to support of bandwidth-intensive applications such as virtual reality (VR), gaming, and holographic video. These various applications highlight the growing need for adaptable and application-specific channel access mechanisms and resource allocation. However, the large number of configurable parameters and their entangled inter-dependencies in the medium access control (MAC) layer, for example, pose challenges to optimize and fine-tune protocols with traditional methods due to the dynamic nature of wireless networks [1], especially in uncoordinated environments such as unlicensed bands lacking a centralized authority to regulate channel access across all network nodes.

In recent years, AI-driven approaches, particularly deep reinforcement learning (DRL), have shown great promise in optimizing wireless network performance [2] by letting MAC protocols to learn and adapt autonomously based on real-time feedback from the environment, and ensuring more intelligent and adaptive behavior. However, most of the existing works in this area optimize and configure only a few parameters in the MAC layer or physical layer[3]. Some recent studies also explore applications of DRL for generating protocols and signalling for cellular [4, 5] and Wi-Fi networks [6, 7]. While these studies make significant step forward, the proposed solutions rely on centralized entities for training. Even in instances where distributed inference is employed, the overarching approach remains centralized, hampering scalability and adaptability, particularly in dynamic network environments where decentralized decision-making is desirable. Expanding upon our prior work, [6, 8], which proposed a DRL-based MAC design framework with centralized learning and execution mechanisms for Wi-Fi networks, in this letter, we introduce a Multi-Agent Deep Reinforcement Learning (MADRL) framework that incorporates both centralized and distributed learning, along with distributed inference. This advancement increases flexibility and adaptability, empowering individual nodes to manage diverse traffic loads and environmental conditions effectively. We implement our MADRL framework within ns-3 by integrating ns3-ai[9] and RLlib. To the best of our knowledge, this is the first implementation that enables distributed multi-agent learning using ns3-ai and Ray RLlib in ns-3 environment. Furthermore, we extend the capabilities of our framework to support 5G New Radio Unlicensed (5G NR-U) technology, which demonstrates the versatility of our solution across emerging wireless technologies beyond Wi-Fi networks. We test and verify the performance of our framework and the learnt protocols through extensive system-level simulations and demonstrate the superior performance of the MADRL-synthesized protocols over legacy protocols across diverse scenarios. This underlines the potential of our framework to significantly enhance QoS for future applications through a novel protocol design approach.

II System Concept

Figure 1 illustrates our system concept. The synthesis of MAC protocols is performed from a set of atomic building blocks interconnected through Machine Learning (ML) driven policies. To drive this process, we supply the ML framework with atomic building block functions such as backoff, sensing, defer, modulation and coding scheme and all of their related parameters along with application requirements and environment characteristics such as type of traffic and packet arrival rates, and number of nodes. The ML framework synthesizes a new protocol, which is subsequently evaluated in the ns-3 simulator using a pre-defined reward, which is selected to fit the application requirements. Upon computation of the reward, feedback loops back to the ML framework. This iterative process continues until a protocol is constructed that incorporates the optimal set of building blocks for the current network/environment configuration. This modular approach empowers the agents to discern the most effective combination of these blocks, thereby generating novel MAC protocols or refining existing ones.

Refer to caption
Figure 1: System concept.

In this letter, we showcase the MAC protocols synthesis with our MADRL framework in a network comprising of 5G New Radio-Unlicensed (5G NR-U) gNBs. 5G NR-U is a radio access technology that is being developed by 3GPP and first introduced in Release 16[10]. One of the key features is the channel access mechanism, namely Listen Before Talk (LBT), inherited from the LTE licensed-assisted access (LTE-LAA), operating in 5 GHz unlicensed band. Before transmission, 5G NR-U devices have to sense the channel to ensure harmonious coexistence with other unlicensed devices, such as IEEE 802.11ax. 5G NR-U devices perform LBT procedure in the downlink, similar to Wi-Fi’s CSMA/CA protocol. As shown in Figure 2 after a period of idle channel, TfT_{f}, lasting 16μs16\leavevmode\nobreak\ \mu s, the gNB initiates Clear Channel Assessment (CCA) in a sequence of did_{i} consecutive observation CCA slots, each with a duration of 9μ9\leavevmode\nobreak\ \mu. Subsequently, the deferred period TdfT_{df} is computed as TfT_{f} + di×TCCAd_{i}\times T_{CCA}. Where did_{i} is determined based on the priority assigned to various traffic types in the standard. If the channel remains idle throughout the defer time, the gNB initiates the backoff procedure by randomly selecting a number from the set {0,1,,CW1}\{0,1,...,CW-1\} while continuing to sense the channel and decrementing the backoff counter. Upon reaching zero, the gNB starts transmission. If the channel becomes occupied during any of these slots, the backoff counter freezes, and the process restarts once the channel becomes idle again with the remaining backoff counter from the previous attempt. Upon gaining access to the channel, the gNBs can occupy it for a maximum duration, known as Maximum Channel Occupancy Time (MCOT). MCOT has different values for different traffic types as defined in the standard specification[10].

Refer to caption
Figure 2: NR-U LBT4 Channel access mechanism.

III Multi-Agent Deep Reinforcement Learning for Reconfigurable MAC protocol

Our goal is to develop a multi-agent learning algorithm for reconfigurable MAC that can effectively adjust according to real wireless network scenarios while overcoming the complexity due to large parameter spaces and partial observability of the environment. First, we define the agent’s decision process as a Partially Observable Markov Decision Process (POMDP), consisting of observations, actions, and rewards. Thereafter, we leverage the proximal policy optimization (PPO) algorithm to train the agents efficiently in a distributed manner.

III-A Problem Formulation

Our ultimate objective is to maximize the long-term throughput averaged over all NN gNBs and episode time steps, TT as follows:

max1T1Nt=1Tj=1NThj(t),\max\frac{1}{T}\frac{1}{N}\sum_{t=1}^{T}\sum_{j=1}^{N}{Th}_{j}(t),\vspace{-1.6mm} (1)

where ThTh is the aggregated downlink throughput per gNB. We consider a network environment where multiple gNBs are deployed, each serving a particular area with a diverse set of traffic types, including Poisson traffic with different arrival rates, λ\lambda and augmented and/or virtual reality (AR/VR) traffic, modeled as bursty traffic with different frame rates [11]. We assume a partial observability of the environment at each gNB. Each node has complete autonomy in creating the MAC protocol and adjusting the protocol parameters. This means each agent has the capability to manipulate the deferred period (TdfT_{df}) by modifying the size and number of clear channel assessment slots or the defer time dfd_{f}, as well as adjusting the backoff number, size, and its functions. This allows each agent to create various types of MAC protocols. Moreover, agents can control parameters such as the energy detection threshold (EDThED_{Th}) and the transmission power (TxTx).

III-B Partially Observable Markov Decision Process (POMDP)

Observation space 𝒪\mathcal{O}: Observation (𝒪x\mathcal{O}_{x}) = \langleCurrentActionxCurrentAction_{x}, NNxNN_{x}, RSSICRSSI_{C}, RSSIIRSSI_{I}, ThroughputxThroughput_{x}, TRxTR_{x}, DelayxDelay_{x}, AirtimexAirtime_{x}\rangle The observation space of agent xx is defined as tuple, which includes the current MAC protocol blocks specified by the CurrentActionxCurrentAction_{x}, the number of visible nodes in the surrounding area NNxNN_{x}, which depends on the energy detection threshold, the interference from other nodes (RSSIIRSSI_{I}), and the received power level from the connected user (RSSICRSSI_{C}). The agent can also calculate the throughput, ThroughputxThroughput_{x} and the delay DelayxDelay_{x}. The parameter AirtimexAirtime_{x} is the airtime occupied by other users on the channel, which is obtained through the sensing capability of the agent. We assume that each node broadcasts its traffic characteristics TRxTR_{x} and the aggregated downlink throughput. Broadcasting can be done using the X2 interface defined in NR protocols for communication between neighbour nodes.

Action 𝒜\mathcal{A}: Action (𝒜x\mathcal{A}_{x}) = \langleMCOTXMCOT_{X}, TxT_{x}, MCSxMCS_{x}, EDTh,xED_{Th,x}, TdfT_{df}, Backoff_typexBackoff\_type_{x}, CW_minxCW\_min_{x}, SensingslotdurationxSensing\leavevmode\nobreak\ slot\leavevmode\nobreak\ duration_{x}\rangle The action space 𝒜x\mathcal{A}_{x} for each agent xx is defined as a tuple containing MAC block functions and their parameters that determine the behaviour of the MAC protocol. These parameters include the backoff function type Backoff_typexBackoff\_type_{x} and its relevant parameters, such as the sensing slot duration SensingslotdurationxSensing\leavevmode\nobreak\ slot\leavevmode\nobreak\ duration_{x}, the minimum contention window size CW_minxCW\_min_{x}, the energy detection threshold EDTh,xED_{Th,x} and the defer time TdfT_{df}. Additionally, the tuple specifies the modulation and coding scheme MCSxMCS_{x}, the maximum channel occupancy time MCOTXMCOT_{X}, and the transmission power TxT_{x}. Each agent makes decisions on whether to include specific MAC protocol blocks and choose appropriate values for parameters. Table I provides a summary of the action space parameters and their corresponding values. Each parameter in the action space has a range of possible values, allowing agents to make diverse decisions when configuring the MAC protocol.

Refer to caption
Figure 3: Distributed training and execution architecture.

Reward \mathcal{R}: Each agent broadcasts its throughput and traffic rate to the nodes within its range. Each node can also calculate the airtime of other nodes within its range. We define the reward for each agent as follows:

i=Th¯iλ¯iαt¯air,i,\mathcal{R}_{i}=\frac{\overline{Th}_{i}}{\overline{\lambda}_{i}}-\alpha\overline{t}_{air,i},\vspace{-1.6mm} (2)

where Th¯i\overline{Th}_{i} represents the mean normalized aggregated downlink throughput of ithi_{th} network and λ¯i\overline{\lambda}_{i} is the normalized traffic arrival rate and t¯air,i\overline{t}_{air,i} denotes the normalized airtime of ithi_{th} gNB. Both λ¯i\overline{\lambda}_{i} and t¯air,i\overline{t}_{air,i} are normalized with respect to other nodes within their sensing range. The reward encourages effective usage of the channel by minimizing airtime while maximizing throughput. Additionally, the reward function discourages greediness among agents by considering the throughput and airtime of other nodes within range.

Table I: The Action Space
Action parameter Values Range Standard value
a1a_{1} Sensing Slot Size {0, 1, 2, …, 20} 9
a2a_{2} Backoff type Off, EDID, BEB
BEB, Constant
a3a_{3} Minimum CW {0, 1, 2, …, 63} 15
a4a_{4} MCOT {0, 1, 2, …, 10} 2, 3, 5, 8
a5a_{5} MCS {0, 1, 2, …, 28} Auto. Rate Control
a6a_{6} TdfT_{df} {0, 1, 2, …, 20} 16
a7a_{7} EDThED_{Th} [dBm] {-90, -89, …, -60} -62 dBm
a8a_{8} TxT_{x} [dBm] {10, 11, …, 30} 23 dBm

Input: All training parameters from Table II
     
Output: πΘx\pi_{\Theta_{x}}, VϕxV_{\phi_{x}} where x{1,2,,NN}\forall x\in\{1,2,...,NN\}

1:x{1,2,,NN}\forall x\in\{1,2,...,NN\} Initialize the actor πΘ(a|s)\pi_{\Theta}(a|s) and critic VϕV_{\phi} with random parameters Θx\Theta_{x} and ϕx\phi_{x}, respectively.
2:for iteration=1, 2,…NepsN_{eps} do
3:    Initialize the environment
4:    j=0j=0
5:    while j<Tj<T do
6:         for Agents at gNBxgNB_{x}with x{1,2,,NN}x\in\{1,2,...,NN\} are deployed in parallel do
7:             Generate an experience set of NN time steps by following MAC block policy πΘ\pi_{\Theta} from every agent in parallel
8:             Collecting (otx1,atx1,rtx,otx)(o^{x}_{t}-1,a^{x}_{t}-1,r^{x}_{t},o^{x}_{t})
9:             For each step calculate advantage function At=i=tLγikriVϕ(ot)A_{t}=\sum_{i=t}^{L}\gamma^{i-k}r_{i}-V_{\phi}(o_{t}) and return function Gt=At+Vϕ(ot)G_{t}=A_{t}+V_{\phi}(o_{t})
10:             Each agent Collect a subset of MM random sample (mini-batches) from the current set of experience separately
11:             Calculate value function loss LVF=12Mi=1M(GiVϕ(si))2L^{VF}=\frac{1}{2M}\sum_{i=1}^{M}(G_{i}-V_{\phi}(s_{i}))^{2} and Minimize the LVFL^{VF} using gradient descent and update Φ\Phi
12:             Compute ratio rt(θ)r_{t}(\theta)and entropy loss S[π(at|ot)]S[\pi(a_{t}|o_{t})] and calculate surrogate objective LCLIP=1Mi=1M[min(r(Θ)A,clip(r,1ϵ,1+ϵ)A)+cS[πΘ(at|ot)]L^{\mathrm{CLIP}}=\frac{1}{M}\sum_{i=1}^{M}[-\min(r(\Theta)A,\mathrm{clip}(r,1-\epsilon,1+\epsilon)A)+cS[\pi_{\Theta}(a_{t}|o_{t})]
13:             Update the policy network parameters θ\theta by maximizing LCLIPL^{\mathrm{CLIP}}, taking gradient ascent
14:             j=j+Nj=j+N
15:         end for
16:    end while
17:end for
Algorithm 1 Multi-Agent Distributed Training and Distributed Execution (DTDE) for Reconfigurable MAC protocol

III-C PPO for multi-agent reconfigurable wireless MAC protocol

We use Proximal Policy Optimization (PPO) for designing MAC policies across different deployment scenarios [12], developed by OpenAI. PPO is an actor-critic algorithm, meaning it employs two separate neural networks, for value and policy estimation. We adopt a fully distributed approach for both learning and execution. Each gNB node hosts a single agent dedicated to training and inference tasks as illustrated in Figure 3. Each agent operates autonomously with its own dedicated neural networks, ensuring complete autonomy and decentralization. As shown in Algorithm 1, every agent initializes the value networks VϕV_{\phi} and policy networks πΘ\pi_{\Theta} with the respective parameters ϕ\phi and Θ\Theta and maintains separate mini-batches. For each iteration, the policy at each node is executed in the environment independently, and each agent accumulates its experiences according to the PPO algorithm (lines 5-6), facilitating individualized learning and adaptation. Subsequently, the advantage function is computed for each time step at each node. The advantage function AtA_{t} measures the potential benefit of choosing a particular action in a certain state compared to the average outcome expected when following the current policy, and is defined as follows:

At=k=0Tt1γkriVϕ(ot),A_{t}=\sum_{k=0}^{T-t-1}\gamma^{k}r_{i}-V_{\phi}(o_{t}),\vspace{-1.6mm} (3)

where the first term, discounted returns Gt=γkriG_{t}=\gamma^{k}r_{i}, is calculated using the collected rewards, and Vϕ(ot)V_{\phi}(o_{t}) is the value estimate for each observation oto_{t} from the value network.

To optimize the policy and value networks, we randomly collect samples and add them to mini-batches. The value network is optimized by minimizing the value loss, which is defined as the mean squared error between the predicted values and the computed target values.

LVF=12Mi=1M(GiVϕ(oi))2L^{VF}=\frac{1}{2M}\sum_{i=1}^{M}(G_{i}-V_{\phi}(o_{i}))^{2}\vspace{-1.6mm} (4)

Following this, we proceed to update the parameters, ϕ\phi, of the value network. This is accomplished by minimizing the value loss by using gradient descent. Concurrently, the optimization process in PPO involves updating the policy network by maximizing a clipped surrogate objective, which is given byLCLIP=1Mi=1M[min(r(Θ)A,clip(r,1ϵ,1+ϵ)A)+cS[π(at|ot)].L^{\mathrm{CLIP}}=\frac{1}{M}\sum_{i=1}^{M}[-\min(r(\Theta)A,\mathrm{clip}(r,1-\epsilon,1+\epsilon)A)+cS[\pi(a_{t}|o_{t})]. This loss function uses the ratio of the new policy to the old policy, which is computed as follows:

rt(θ)=πθ(at|ot)πθold(at|ot)r_{t}(\theta)=\frac{\pi_{\theta}(a_{t}|o_{t})}{\pi_{\theta_{old}}(a_{t}|o_{t})}\vspace{-1.6mm} (5)

Additionally, a clipping function is used to ensure that updates are not extreme, as significant deviations could destabilize learning progress. Here, ϵ\epsilon represents the clipping parameter. The entropy term, denoted by S[π(.|s;Θ)]S[\pi(.|s;\Theta)], encourages exploration by the policy and prevents premature settling on suboptimal deterministic policies. The entropy term is defined as:

S[πΘ(at|ot)]=aπΘ(at|ot)logπΘ(at|ot),S[\pi_{\Theta}(a_{t}|o_{t})]=-\sum_{a}\pi_{\Theta}(a_{t}|o_{t})\log\pi_{\Theta}(a_{t}|o_{t}),\vspace{-1.6mm} (6)

where parameter cc acts as a coefficient that controls the weight of the entropy term. Ultimately, the policy network parameters θ\theta are updated by maximizing the objective using a gradient ascent.

Refer to caption
Figure 4: Integration of ns3-ai gym with RLlib using a dummy environment.

During the training phase, the broadcasting of rewards and traffic characteristics can be accomplished using the X2 interface defined in NR protocols for communication between neighbouring nodes or added to the packet header.

IV Simulation and Learning Environment

We implement our MADRL framework by integrating ns3-ai and RLlib. To ensure compatibility between ns3-ai and RLlib, we have created a dummy environment within RLlib, which collects the data from the ns3-ai gym environment. Our primary simulation is the ns-3, with ns3-ai responsible for transferring actions and observations between ns-3 and RLlib’s dummy environment, as illustrated in Figure 4. Observations collected from the ns-3 environment are relayed to the dummy environment, where agents analyze them to determine suitable actions. These actions are then directly applied to the agent within the ns-3 simulation environment, enabling uninterrupted simulation. We use the 5G NR-U module, which is a full-stack implementation of NR, including the channel access mechanism specified for NR-U technology. We have ensured its full functionality and compatibility with ns3-ai.

The simulation and training processes were conducted on a server equipped with 2 GPU units and 64 CPU cores. The ns-3 simulations ran on the CPU, while the training process, involving machine learning algorithms and neural network models, ran simultaneously on the GPU, for a faster convergence of the overall learning process. Figure 5 illustrates the learning convergence of the proposed distributed training and execution approach, DTDE, against the centralized training and execution approach, CTCE introduced in [6]. In CTCE, a single agent is responsible for both training and execution. The distributed approach converges notably faster than the centralized approach, which can be attributed to the CTCE’s requirement for the agent to manage a significantly larger action space, thereby slowing down the learning process. It is also worth noting that due to the distributed nature of the system and the lack of full control and knowledge over other nodes, DTDE achieves a slightly lower mean reward compared to centralized learning.

Refer to caption
Figure 5: Learning curves comparing the convergence of proposed DTDE against the centralized approach CTCE.
Table II: Training and Environment Parameters
Number of networks (NN) 1-6
Frequency 6 GHz
Bandwidth 20 MHz
Traffic characteristic (TR): Poisson and AR/VR with arrival rates λ\lambda λ=\lambda=[ 0 - 3000]
Packet size 15001500
Learning Rate, Optimizer 0.001, Adam
Policy RNN (2 layers of 256)
batch size, MM 1000
Step size 0.1 s
Episode duration 50 s
α\alpha 0.3

V Performance Evaluation

For the evaluation of our distributed multi-agent approach, i.e., DTDE, we randomly deployed six 5G NR-U gNBs in an area of 200×200200\times 200 m2m^{2}. Each gNB is connected to at least one UE. At each gNB a single agent is deployed that learns the optimal MAC blocks and their parameters based on the current environment and application requirements. We consider Poison and AR/VR traffic, which has bursty characteristics [11]. We evaluate the performance of the synthesized MAC protocol in terms of mean downlink throughput and average end-to-end packet delay per gNB.

Figure LABEL:fig:results shows the evaluation results for various traffic densities and types. We consider four traffic densities based on packet arrival rates λ\lambda, i.e., low traffic (10 to 500 packets/sec), medium traffic (500 to 1000 packets/sec), high traffic (1000 to 3000 packets/sec), and random-rate traffic (10 to 3000 packets/sec). We ran our distributed approach for 10 episodes, with each agent using its learning model. Afterwards, we compared its performance with that of the standard-based 5G NR-U and the centralized approach, i.e., CTCE. The system parameters of 5G NR-U are listed in Table I.

Figures LABEL:TSUS_1_1 and LABEL:TSUS_2_1 illustrate the distribution of mean throughput and delay across all nodes in the environment for low-density Poisson and AR/VR traffic. As the contention for channel access is minimal and all the packets in the queue can be successfully transmitted, both baselines (the standard 5G NR-U and CTCE), and our distributed multi-agent NR-U protocol show similar throughput. However, in terms of delay, both learning approaches show improvement due to the selection of appropriate MAC blocks and parameters and removing the unnecessary overhead in the standards 5G NR-U.

Figures LABEL:TSUS_1_2, LABEL:TSUS_1_3, and LABEL:TSUS_1_4 display the throughput distributions for medium, high, and mixed-rate traffic. The results demonstrate that our distributed approach improves mean throughput by at least 10%10\%, primarily due to the reduction of carrier sensing overhead and the dynamic selection of MAC protocol blocks tailored to each node’s specific requirements. This is achieved by selecting the appropriate backoff algorithm, deferring, and sensing parameters at each node based on the environmental characteristics observed by each agent.

It is also worth noting that some nodes achieved significantly higher throughput without adversely affecting others, as indicated by the red outlier points. This improvement is due to our framework’s ability to select not only the optimal MAC protocols for each scenario but also to adjust transmission power levels to minimize interference with other coexisting nodes. Additionally, nodes adjust their sensitivity to interference from neighbouring nodes by changing EDThED_{Th}, which allows nodes to access the channel more freely, similar to the Basic Service Set (BSS) coloring technique used in Wi-Fi technology. As a result, nodes gain more opportunities to transmit, leading to a significant reduction in end-to-end packet delay, as shown in Figures LABEL:TSUS_2_2, LABEL:TSUS_2_3, and LABEL:TSUS_2_4.

Overall, our distributed multi-agent NR-U protocol consistently surpasses the standard 5G NR-U protocol and closely matches the performance of the centralized baseline, despite each agent having only partial observation compared to the centralized model, which possesses complete knowledge. This success is largely attributed to our reward function, which ensures that each agent at each gNB considers not only its own performance but also that of neighboring nodes within its sensing range.

VI Conclusions

In this letter, we have proposed a MADRL framework that leverages distributed multi-agent machine learning to empower individual network nodes to autonomously optimize design and configure MAC protocols, thus overcoming the limitations of centralized decision-making. By enabling nodes to customize their Medium access based on local observations, our approach offers adaptability and scalability tailored to specific environmental conditions. Through extensive simulations, we have demonstrated the superiority of MADRL-synthesized protocols over the legacy 5G NR-U MAC, highlighting the potential of the new protocol design approach to enhance QoS for future wireless applications.

References

  • [1] N. Naderializadeh, J. J. Sydir, M. Simsek, and H. Nikopour, “Resource Management in Wireless Networks via Multi-Agent Deep Reinforcement Learning,” IEEE Transactions on Wireless Communications, vol. 20, no. 6, pp. 3507–3523, 2021.
  • [2] S. Szott and et al., “Wi-Fi Meets ML: A Survey on Improving IEEE 802.11 Performance With Machine Learning,” IEEE Communications Surveys & Tutorials, vol. 24, no. 3, pp. 1843–1893, 2022.
  • [3] K. Kosek-Szott, S. Szott, and F. Dressler, “Improving IEEE 802.11ax UORA Performance: Comparison of Reinforcement Learning and Heuristic Approaches,” IEEE Access, vol. 10, pp. 120 285–95, 2022.
  • [4] L. Miuccio et al., “Learning generalized wireless mac communication protocols via abstraction,” in Proc. IEEE GLOBECOM, 2022, pp. 2322–2327.
  • [5] M. P. Mota, A. Valcarce, J.-M. Gorce, and J. Hoydis, “The Emergence of Wireless MAC Protocols with Multi-Agent Reinforcement Learning,” in Proc. IEEE Globecom Workshops, 2021, pp. 1–6.
  • [6] N. Keshtiarast and M. Petrova, “Ml framework for wireless mac protocol design,” in 2024 IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN), 2024, pp. 560–565.
  • [7] H. B. Pasandi and T. Nadeem, “Towards a Learning-Based Framework for Self-Driving Design of Networking Protocols,” IEEE Access, vol. 9, pp. 34 829–44, 2021.
  • [8] P. Wang, M. Petrova, and P. Mähönen, “DMDL: A hierarchical approach to design, visualize, and implement MAC protocols,” in Proc. IEEE WCNC, 2018, pp. 1–6.
  • [9] H. Yin et al., “Ns3-Ai: Fostering Artificial Intelligence Algorithms for Networking Research,” in Proceedings of the 2020 Workshop on Ns-3, ser. WNS3 2020.   New York, NY, USA: Association for Computing Machinery, 2020, p. 57–64. [Online]. Available: https://doi.org/10.1145/3389400.3389404
  • [10] “3GPP; TSG RAN; study on NR- based access to unlicensed spectrum,” document TR 38.889, V16.0.0, 3GPP, Dec. 2018.
  • [11] M. Lecci, A. Zanella, and M. Zorzi, “An ns-3 implementation of a bursty traffic framework for virtual reality sources,” in Proceedings of the 2021 Workshop on Ns-3, ser. WNS3 ’21.   New York, NY, USA: Association for Computing Machinery, 2021, p. 73–80.
  • [12] J. Schulman et al., “Proximal Policy Optimization Algorithms,” arXiv preprint arXiv:1707.06347, 2017.