Content Caching-Assisted Vehicular Edge Computing Using Multi-Agent Graph Attention Reinforcement Learning
Abstract
In order to avoid repeated task offloading and realize the reuse of popular task computing results, we construct a novel content caching-assisted vehicular edge computing (VEC) framework. In the face of irregular network topology and unknown environmental dynamics, we further propose a multi-agent graph attention reinforcement learning (MGARL) based edge caching scheme, which utilizes the graph attention convolution kernel to integrate the neighboring nodes’ features of each agent and further enhance the cooperation among agents. Our simulation results show that our proposed scheme is capable of improving the utilization of caching resources while reducing the long-term task computing latency compared to the baselines.
Index Terms:
Vehicular Edge Computing, Edge Caching, Multi-Agent, Graph Attention Reinforcement LearningI Introduction
Recently, mobile edge computing (MEC) has evolved as a promising technology for ultra-reliable and ultra-low latency (URLLC) applications, such as virtual reality and autonomous driving [1]. Instead of sending tasks to the far-end cloud, MEC sinks computing and caching resources to the edge nodes close to the users [2]. In vehicular networks, vehicles have limited computing capability but can offload requested tasks to edge servers (ESs) for execution [3, 4, 5], thus relieving the computing pressure of vehicles [6].
Nevertheless, in practical scenarios where vehicular users (VUs) are driving during rush hour traffic or in congested urban areas, the data they offload exhibits high spatial, temporal, and social correlation, which results in different VUs requesting the same services simultaneously, such as navigation assistance, real-time traffic updates, or recommendations for nearby amenities [7]. As such, multiple VUs may have similar computing tasks associated with the same computing results to access and utilize the shared services. In this case, instead of duplicate task re-uploading and re-computing, the previous task computing results can be cached by ESs, e.g. the surrounding Roadside Units (RSUs) or VUs, and be further utilized by other VUs to reduce the latency of subsequent tasks [8]. Therefore, designing an optimal content caching policy for edge computing is of great significance for the vehicular edge computing (VEC) networks.
Considering the unknown dynamics of time-varying network topology and channel conditions, the edge computing content caching problem is essentially a model-free sequential decision-making problem [9]. By combining the advantages of deep learning in identifying data features and reinforcement learning in dynamic programming, deep reinforcement learning (DRL) has been widely employed in solving such problems [10]. For example, H. Tian et al. of [11] used deep deterministic policy gradient (DDPG) approach to orchestrate the joint offloading, caching, and resource allocation decision-making for VEC, taking into account the time-varying content popularity. However, it relies on the centralized learning framework without cooperative learning. Additionally, X. Ye et al. of [12] designed a blockchain-based collaborative computing and caching framework in VEC using the multi-agent asynchronous advantage actor-critic (A3C) approach, which has verified the efficiency of cooperative learning compared to the non-cooperative learning.
Network Topology | Content Popularity | Graphical Information | Cooperative Learning | ||
Dynamic | Irregular | ||||
[10] | ✓ | ✗ | ✗ | ✗ | ✓ |
[11] | ✓ | ✗ | ✓ | ✗ | ✗ |
[12] | ✓ | ✗ | ✓ | ✗ | ✓ |
[16] | ✗ | ✗ | ✓ | ✓ | ✓ |
Ours | ✓ | ✓ | ✓ | ✓ | ✓ |
However, the randomness of VU mobility leads to the dynamics and temporal-spatial irregularities of the network topology, which results in the environmental uncertainty and makes environmental knowledge more difficult to learn. But the aforementioned commonly used DRL solutions relying on convolutional neural networks exhibit weakness in extracting the temporal and spatial relationships between nodes in the irregular topology. To solve this issue, graph convolutional neural networks have been employed in DRL, which has the benefits of acting directly on graphs and fully utilizing the structural information such as the relationships among nodes [13]. Consequently, it has been used for solving optimization problems by jointly considering the information of agents close to an agent [14], and further enhances the cooperation of agents [15]. Although D. Wang et al. of [16] have explored the deep graph reinforcement learning approach for solving the joint caching, communication, and resource allocation problem, they did not carefully consider the dynamic graphical topology characteristics resulted from high vehicular mobility. In this context, this paper aims to investigate the content caching-assisted VEC problem, where the irregular and dynamic network topology and the cooperative learning among multiple vehicles are considered. To the best of our knowledge, at the time of writing, this is the first attempt in the related literature to study the content caching problem in VEC networks relying on multi-agent graph attention reinforcement learning (MGARL) framework. Compared to the existing literature, MGARL method has the promising potential to capture the dynamics and irregularities of time-varying vehicular networks, as well as make better use of the neighboring nodes’ information to facilitate the cooperative content caching decision-making.
Our main contributions are boldly and explicitly contrasted to the literature in Table I and are detailed as follows:
-
•
A content caching-assisted VEC framework is established where each VU needs to decide whether to cache the task computing result, with the aim of reducing computing latency and reusing the popular content.
-
•
The edge caching decision-making problem is constructed as a decentralized partially observable Markov Decision Process (DEC-POMDP), where each VU agent observes its local information and takes action individually based on its learned policy.
-
•
An MGARL-based edge caching scheme is proposed, where graph attention convolution kernels are utilized to capture relationships among agents by taking into account the neighboring agents’ features when making their own edge caching decisions.
-
•
Simulation results show that the proposed scheme outperforms other baselines in terms of both improving caching resource utilization and reducing long-term task computing latency.
II System Model and Problem Formulation
II-A Network Model

As shown in Fig. 1, we consider a Manhattan vehicular network model which consists of multiple horizontal and vertical two-lane two-way roads. RSUs are deployed uniformly and VUs drive along the road with their velocity obeying the Markov-Gaussian stochastic process independently and may change their direction at intersections with a given probability . Let and denote the index sets of RSUs and VUs, respectively. Both RSUs and VUs have the capability to compute tasks and cache contents, which are referred to as service nodes (SNs). Let represent the index set of SNs.
Without loss of generality, we assume that the system operates in a time-slot (TS) mode and environmental parameters (e.g. transmit power and channel gain) remain unchanged during each TS. Additionally, new tasks with different popularity characteristics arrive every certain TSs, and each task can be completed successfully during each TS. Let represent the index set of tasks, where task is represented by a 3-tuple with the task size , the number of CPU cycles required to process the task and the size of the processed task content . Note that, if the content of the task result has been cached on SNs within the communication range of the VU, the content can be fetched directly from SNs to VUs. Otherwise, the VU has to compute the task itself or offload it for execution. In this case, after completing the task, each VU needs to decide whether to cache the content and to which SN.
II-B Communication Model
Let and denote the transmit power of VU and the channel gain spanning from VU and SN in TS , respectively. In this framework, we consider the small-scale fading and the path loss of vehicle-to-vehicle (V2V) communication and vehicle-to-infrastructure (V2I) communication, neglecting the effects of shadow fading [4]. We assume that the inter-user interference has been eliminated by allocating orthogonal resource blocks. Thus, the uplink transmission rate from VU to SN in TS is given by
(1) |
where is the bandwidth of each channel, and is the power of the additive Gaussian white noise, both of which are assumed to be the same for each TS. Similarly, let and denote the transmit power of SN and the channel gain spanning from SN and VU in TS , respectively. Thus the downlink transmission rate from SN to VU in TS is given by
(2) |
II-C Computing Model
Let denote the task requested by VU in TS and denote whether VU has to offload . If , VU computes itself, otherwise VU offloads to the nearest SN .
-
•
Local Computing: Let be the computation capability on VU . Therefore, the latency of computing task generated by user VU is given by .
-
•
Task Offloading: Let be the computation capability on SN . Then, the latency in completing task is expressed as , where , and denote the task offloading latency, task computing latency, and result content fetching latency, respectively.
II-D Caching Model
Let represent whether computing result contents are cached or not by all SNs in TS . Specifically, if the content of task is cached on SN in TS , otherwise . Note that all SNs have limited caching capacity with for SN , thus it is impossible to cache all the task contents. Let us continue to denote as the caching decision variable for VU in TS . To be specific, if , VU caches the computing result content to SN and , otherwise VU does not cache the content to any SN. Then, the caching decision of all VUs in TS can be denoted as .
When SNs cache the content of task computing results in TS , we have . We assume that the task content popularity follows the Zipf distribution, thus the probability of task content to be requested is expressed as , where denotes the ranking of the popularity of task content in descending order in TS , and is the Zipf distribution parameter.
II-E Problem Formulation
Our aim is to design an edge caching scheme for the VEC system to minimize the long-term task computing latency by utilizing limited edge caching resources. To formulate our problem, we introduce an auxiliary variable , where when can fetch the required content directly from SNs within its communication range, otherwise . Thus, the total computation latency is the local computing latency or the task offloading latency when , and is the latency for fetching the cached task result when . Then, the task computing latency for VU in TS can be expressed by . Mathematically, the optimization problem is formulated as
(3a) | |||
(3b) | |||
(3c) |
where constraint (3c) indicates that the occupied caching capacity at SN can not exceed its maximum capacity .

III The Design of DEC-POMDP
Considering the fact that the vehicular environment is dynamically changing and each VU is unable to obtain global environmental knowledge, we further formulate the above edge caching decision-making problem as a DEC-POMDP. Explicitly, each VU agent gets the local observation and takes action individually. By interacting with the environment iteratively, the agents can learn their strategies to maximize the system reward. Next, we will elaborate on the definitions of DEC-POMDP.
III-A Observation
Due to limited sensing and positioning technology, we assume that each VU agent can only observe its location, its caching state, and the current remaining caching capacity of all SNs. Accordingly, the observation of VU agent can be defined as
(4) | |||
where and are the current horizontal and vertical coordinates of VU agent , respectively.
III-B Action
Based on the learned policy and its observation in TS , VU agent selects an action to decide whether to cache the content and to which SN.
III-C Reward Function
In order to minimize the long-term task computing latency, the local reward of VU agent can be designed as
(5) |
where is the reward for encouraging caching computing results. Moreover, is negative if the caching decision made by VU agent leads to the excess of caching capacity, otherwise . Accordingly, the system reward, which is defined as the sum of all local rewards, is given by .
IV MGARL-based Scheme
To adapt to dynamically changing irregular topology and capture relationships among agents, we resort to MGARL to further utilize the cooperation of multiple agents for solving the edge caching decision-making problem. As illustrated in Fig.2, MGARL consists of three modules: graph modeling, latent feature generating and parameter updating.
IV-A Graph Modeling
The vehicular network topology can be modeled as a graph where the node set is the set of VUs and the edge set is determined by the distance among vehicles. To be specific, there exists an edge between two nodes when their distance does not exceed the maximum communication range and such vehicles are neighbors of each other. Let denote the set of neighbors of node in TS . Each node in the graph has its node features, denoted by for node in TS , which are yielded from the observation through a fully connected network.
Algorithm 1 Training Process for MGARL-Based Content Caching-Assisted VEC Scheme
IV-B Latent Feature Generating
The latent feature generating stage utilizes the convolutional layer to integrate the node features in node ’s local region, which includes and to generate the latent feature in TS . Firstly, in TS , all nodes’ feature vectors are merged into a feature matrix with size in the order of index where is the length of the feature vector. Then, let represent the set of node and and we introduce an adjacency matrix with size to denote the one-hot representations of . To be specific, the first row of is the one-hot representation of node , and the th row is the representation of the th neighbor of . Then, the features in the local region of node are obtained by .
To further capture the relationships among nodes, the multi-head attention is adopted as the convolution kernel to integrate the feature vectors in the local region of node and generate the latent feature vector . Let , and denote the parameter matrices corresponding to the query, key, and value representation in attention head , respectively. For attention head , the relationship between node and its neighbor in TS can be formulated as
(6) |
where is a scaling factor. Next, the outputs of attention heads for node are concatenated and then fed into function to produce the output of the convolutional layer, expressed as , where con denotes the concatenation of the outputs of attention heads. Note that, more graphical information can be extracted through increasing the number of convolutional layers.
IV-C Parameter Updating
Similar to the traditional deep Q network (DQN) algorithm using Q values to estimate the long-term cumulative reward of taking a certain action at the current state, the MGARL algorithm utilizes both the training network and the target network, where the training network parameters are updated by optimizing the loss function of the target network. Through extracting graphical information to facilitate cooperative learning, the MGARL method can better capture the dynamics and irregularities than the traditional DRL methods. In order to utilize historical data for training and improve sample utilization, the tuple of vehicular agent is stored in the shared replay buffer in each TS. When training, sets of samples are randomly selected to update the parameters of the training network. Particularly, in order to achieve stable interactions between agents, MGARL optimizes the attention weight distribution of the last convolutional layer to minimize the Kullback-Leibler divergence of the attention weight distribution in the current TS over the weights in the next TS. Then, the loss function of the target network is expressed as
(7) | |||
where is the Q value generated by the target network with parameter . is the discount factor. denotes the set of observations of nodes in agent ’s local region determined by adjacency matrix . Q function parameterized by takes as input and outputs value for agent . is the coefficient for the loss and denotes the attention weight distribution of relation representations of attention head in convolutional layer for agent . The details of the training process for the proposed MGARL-based algorithm are listed in Algorithm 1.
IV-D Complexity Analysis
Let us now discuss the complexity of the graph attention convolutional layer in TS including the feature mapping of nodes and the calculation of attention weights. We assume that the constructed graph in TS consists of edges and each node feature is mapped from dimension to space in dimension . Then, the computational complexity for mapping node features is represented as while the computational complexity in calculating attention weights is related to the number of edges, which is expressed as . Therefore, the total computational complexity with attention heads is determined by the number of VUs, the number of edges, and the node feature dimension, given by .
V Simulation Results
V-A Simulation Settings
We consider a Manhattan vehicular network model, where the length and width of the road are both km. By default, we set , , MB, MB, Megacycles , MHz, dBm/Hz, , , and . The transmit power of RSUs and VUs are set to dBm and dBm, respectively. The caching capacity of RSUs and VUs are MB and MB, respectively. The computation capability of RSUs and VUs are GHz and GHz, respectively. The communication range of RSUs and V2V communication are set as m and m, respectively. We adopt the channel model according to [4]. In terms of the VU mobility model, we set and the mean range of VUs’ velocity as km/h, and other parameters setting please refer to [17]. With regard to the learning configurations, , , the learning rate is and the buffer capacity is . For fair comparison, we choose the proposed w/o attention based scheme, multi-agent Independent Double Deep Q Network (IDDQN) based scheme and the random content caching scheme.
V-B Performance Evaluation

Fig. 3 presents the convergence performance of all schemes. It can be observed that the cumulative reward of the proposed scheme is higher than the baselines. This is attributed to the fact that the proposed scheme, in conjunction with the proposed w/o attention and IDDQN-based scheme, can learn policy by continuously interacting with the environment. Moreover, the proposed scheme utilizes graph attention convolution kernels to capture the relationships among agents to further enhance cooperation, which is more adaptable to irregular network topologies and consequently has the highest cumulative reward.


The converged content hit ratio versus the number of VUs is investigated in Fig. 4 when the number of content types (CTs) is 10 and 20. The first point to observe is that the converged content hit ratio increases as the number of VUs increases. This is because more users lead to the increasing number of content requests, thus increasing the probability of reusing the same cached content. Secondly, it is clear that the content hit ratio is reduced when the number of CTs is increased from to . The underlying reason is that due to the limited caching capacity of SNs, they may not cache all the contents, which decreases the hit ratio of each content. Moreover, we can see that our proposed scheme outperforms other schemes in improving the content hit ratio regardless of both the number of CTs and the number of VUs, which verifies the effectiveness of the proposed scheme in improving caching resource utilization.
Fig. 5 compares the converged total system latency versus the number of RSUs. Firstly, as the number of RSUs increases, the converged total system latency exhibits a downward trend. The underlying reason for this phenomenon is that with more RSUs having computation and caching capability, VUs are more likely to fetch the cached result content from the surrounding RSUs, thus avoiding the repeated uploading and computing process. Secondly, it can be seen that the proposed scheme outperforms other schemes under different numbers of RSUs, since multiple graph attention convolutional layers can extract more hidden structural information to facilitate cooperative learning. This phenomenon implies that the proposed scheme can make better use of densely deployed cache resources to further reduce the total system latency.
VI Conclusion
In this paper, we proposed a content caching-assisted VEC framework by taking into account the reuse of popular task computing results. To adapt to the irregular network topology and the environmental uncertainty, we developed an MGARL-based edge caching scheme for VEC networks by utilizing the cooperation among agents with the integration information of neighboring nodes in decision-making. Compared to the baselines, the proposed scheme can better learn the irregular topology dynamics, thus significantly reducing the task computing latency and improving the utilization of densely deployed caching resources.
References
- [1] K. Jiang, H. Zhou, X. Chen, and H. Zhang, “Mobile edge computing for ultra-reliable and low-latency communications,” IEEE Commun. Mag., vol. 5, no. 2, pp. 68–75, 2021.
- [2] Z. Yao, S. Xia, Y. Li, and G. Wu, “Cooperative task offloading and service caching for digital twin edge networks: A graph attention multi-agent reinforcement learning approach,” IEEE J. Sel. Areas Commun., vol. 41, no. 11, pp. 3401–3413, 2023.
- [3] H. Zhou, K. Jiang, S. He, G. Min, and J. Wu, “Distributed deep multi-agent reinforcement learning for cooperative edge caching in Internet of Vehicles,” IEEE Trans. Wireless Commun., vol. 22, no. 12, pp. 9595–9609, 2023.
- [4] Y. Lin, Y. Zhang, J. Li, F. Shu, and C. Li, “Popularity-aware online task offloading for heterogeneous vehicular edge computing using contextual clustering of bandits,” IEEE Internet of Things J., vol. 9, no. 7, pp. 5422–5433, 2022.
- [5] Y. Lin, J. Bao, Y. Zhang, J. Li, F. Shu, and L. Hanzo, “Privacy-preserving joint edge association and power optimization for the Internet of Vehicles via federated multi-agent reinforcement learning,” IEEE Trans. Veh. Technol., vol. 72, no. 6, pp. 8256–8261, 2023.
- [6] L. Liu, X. Yuan, N. Zhang, D. Chen, K. Yu, and A. Taherkordi, “Joint computation offloading and data caching in multi-access edge computing enabled Internet of Vehicles,” IEEE Trans. Veh. Technol., vol. 72, no. 11, pp. 14939–14954, 2023.
- [7] M. W. Al Azad and S. Mastorakis, “The promise and challenges of computation deduplication and reuse at the network edge,” IEEE Wireless Commun., vol. 29, no. 6, pp. 112–118, 2022.
- [8] F. Zeng, K. Zhang, L. Wu, and J. Wu, “Efficient caching in vehicular edge computing based on edge-cloud collaboration,” IEEE Trans. Veh. Technol., vol. 72, no. 2, pp. 2468–2481, 2023.
- [9] G. Qiao, S. Leng, S. Maharjan, Y. Zhang, and N. Ansari, “Deep reinforcement learning for cooperative content caching in vehicular edge computing and networks,” IEEE Internet of Things J., vol. 7, no. 1, pp. 247–257, 2020.
- [10] L. Geng, H. Zhao, J. Wang, A. Kaushik, S. Yuan, and W. Feng, “Deep-reinforcement-learning-based distributed computation offloading in vehicular edge computing networks,” IEEE Internet of Things J., vol. 10, no. 14, pp. 12416–12433, 2023.
- [11] H. Tian, X. Xu, L. Qi, X. Zhang, W. Dou, S. Yu, and Q. Ni, “Copace: Edge computation offloading and caching for self-driving with deep reinforcement learning,” IEEE Trans. Veh. Technol., vol. 70, no. 12, pp. 13281–13293, 2021.
- [12] X. Ye, M. Li, P. Si, R. Yang, Z. Wang, and Y. Zhang, “Collaborative and intelligent resource optimization for computing and caching in IoV with blockchain and MEC using A3C approach,” IEEE Trans. Veh. Technol., vol. 72, no. 2, pp. 1449–1463, 2023.
- [13] R. Akmam Dziyauddin, D. Niyato, N. Cong Luong, M. A. M. Izhar, M. Hadhari, and S. Daud, “Computation offloading and content caching delivery in vehicular edge computing: A survey,” arXiv e-prints, pp. arXiv–1912, 2019.
- [14] J. Jiang, C. Dun, T. Huang, and Z. Lu, “Graph convolutional reinforcement learning,” arXiv preprint arXiv:1810.09202, 2018.
- [15] Z. Yao, Y. Li, S. Xia, and G. Wu, “Attention cooperative task offloading and service caching in edge computing,” in GLOBECOM 2022 - 2022 IEEE Global Communications Conference, pp. 5189–5194, 2022.
- [16] D. Wang, Y. Bai, G. Huang, B. Song, and F. R. Yu, “Cache-aided MEC for IoT: Resource allocation using deep graph reinforcement learning,” IEEE Internet of Things J., vol. 10, no. 13, pp. 11486–11496, 2023.
- [17] Y. Lin, Z. Zhang, Y. Huang, J. Li, F. Shu, and L. Hanzo, “Heterogeneous user-centric cluster migration improves the connectivity-handover trade-off in vehicular networks,” IEEE Trans. Veh. Technol., vol. 69, no. 12, pp. 16027–16043, 2020.