Edge-Assisted Lightweight Region-of-Interest Extraction and Transmission for Vehicle Perception
Abstract
To enhance on-road environmental perception for autonomous driving, accurate and real-time analytics on high-resolution video frames generated from on-board cameras becomes crucial. In this paper, we design a lightweight object location method based on class activation mapping (CAM) to rapidly capture the region of interest (RoI) boxes that contain driving safety related objects from on-board cameras, which can not only improve the inference accuracy of vision tasks, but also reduce the amount of transmitted data. Considering the limited on-board computation resources, the RoI boxes extracted from the raw image are offloaded to the edge for further processing. Considering both the dynamics of vehicle-to-edge communications and the limited edge resources, we propose an adaptive RoI box offloading algorithm to ensure prompt and accurate inference by adjusting the down-sampling rate of each box. Extensive experimental results on four high-resolution video streams demonstrate that our approach can effectively improve the overall accuracy by up to 16% and reduce the transmission demand by up to 49%, compared with other benchmarks.
I Introduction
The potential of autonomous driving in reducing traffic congestion and improving driving safety has attracted much attention. Its core function components are outlined as sensing, planning, and control [1]. Highly accurate environmental sensing is fundamental to decision making in autonomous driving, especially in complex road segments and hazardous weather conditions [2]. However, limited by the sensing range and accuracy of on-board sensors, most off-the-shelf vehicles can only reach the level of L2 in autonomous driving [3, 4].
High-resolution video cameras can help to capture rich on-road information and improve the inference accuracy of perception tasks based on high-performance compute [5, 14]. Besides, advanced convolutional neural networks (CNNs) can effectively process video streams compared to traditional feature extraction methods, but require significant amount of accelerators, such as central/graphics processing units (CPUs/GPUs). Most existing CNNs have been developed to deal with low-resolution images to increase processing speed, such coarse-grained down-sampling of high-resolution input can significantly reduce the recognition accuracy [6, 7]. Consequently, taking full advantage of the rich information of high-resolution videos while meeting real-time requirements is one of the challenges in video analytics. Moreover, self-driving vehicles are sensitive to latency, as it is directly related to driving safety. The massive amounts of data generated by multiple high-resolution cameras deployed inside and outside the vehicle pose tremendous pressure on real-time video analytics system [9, 10, 8]. Generally, a self-driving vehicle can generate up to 4 terabytes of data per day, and the L5 level of autonomous driving requires computing facility that is capable of executing over 2000 tera operations per second [3]. In this regard, it becomes paramount to investigate how to achieve highly accurate and low latency on-board inference results with limited computing resources.
Cloud server with powerful GPUs can help to mitigate computation-intensive inference burden on devices [4, 12]. However, the strict requirements for end-to-end latency of self-driving vehicles makes the cloud-based offloading strategy unfeasible [1, 11]. In particular, considering the high-speed motion characteristic of vehicles, edge nodes (ENs) in the proximity are more suitable for vehicular networks. The vehicles can offload part of analytics tasks when it is within the coverage of ENs, which is referred to as road-side units [3, 21]. Considering the packet loss due to the instability of network connection in the interest of vehicle (IoV), reducing the amount of data transmitted to ENs should be considered without deteriorating inference accuracy, in order to guarantee driving safety.
The redundant information in the video frames generated by on-board cameras, not only affects detection accuracy, but also increases the transmission burden. The RoI in high-resolution video frames actually takes up only a small fraction of pixels, which provides the possibility for data compression [5]. RoI extraction is a classic problem in computer vision, such as region proposal network in faster R-CNN [15] and selecting search strategy in R-CNN [16]. However, these RoI extraction methods are time-consuming and difficult to perform in real time. As a result, we need to explore a low-cost RoI extraction method for background de-redundancy to reduce the amount of transmitted data. Class activation mapping (CAM) [18] has the remarkable ability to localize objects with the help of feature map extracted by convolutional layers, which can be used for RoI extraction. More importantly, it only takes tenths of milliseconds to process one image.
In this paper, we investigate lightweight RoI extraction method for autonomous vehicles, and propose an edge-assisted video analytics system to minimize the end-to-end latency while ensuring high-performance accuracy of vision tasks. RoI boxes extracted from high-resolution frame based on the location ability of CAM are assigned to different down-sampling rates and offloading to edge aiming to lighten the computing burden on autonomous vehicles. The main contributions of this paper can be summarized as follows.
-
•
We propose an edge-assisted real-time video analytics system for high-resolution video streams generated by on-board cameras, which significantly improves the inference accuracy for vision-based vehicle perception.
-
•
We design a lightweight CAM-based RoI extraction method, which extracts RoI boxes from high-resolution frames at low complexity.
-
•
Aiming to strike a balance between inference accuracy and the cost of data processing, we propose an adaptive algorithm, which selects resolution for each valid RoI box based on network fluctuation and available edge resources.
The remainder of this paper is organized as follows. Section II describes the motivation of our design, followed by system model in Section III. Section IV presents the adaptive boxes offloading strategy. Performance evaluation is given in Section V. Finally, Section VI concludes this paper with future work.
II Motivation

(a)

(b)
II-A On-Board Processing of High-Resolution Videos
High-resolution images and videos are with much more pixels, which provide rich on-road information for safe autonomous driving[13]. The feature maps extracted by CNNs are semantically richer and the correlation position relationships between objects are spatially more accurate. However, there are few lightweight CNNs designed for real-time processing of high resolution images. The datasets used for training CNNs are mostly small size images, such as 224224 or 640640, which may not be applicable to vision tasks on 4K or 8K images [5]. High resolution video streams need to be downsampled before being input to CNNs, leading to accuracy drop. The architecture of CNNs could also be scaled up and trained on high-resolution datasets, at the cost of unacceptable latency [17, 20]. It takes about 49 millisecond (ms) for object detection task of a 4K image using YOLOv5x (1280) [19] on an RTX 3090 GPU, which is far from achieving the real-time inference on video streams.
II-B CAM-based Frame Partitioning
Equal image partitioning is an effective way to improve the prediction accuracy, but it can not work well with limited computation and bandwidth resources[9]. Instead of reducing the amount of data transmitted to the edge, image partitioning even increases the computational burden of prediction by several times (each part of the image requires CNNs for processing), leading to high end-to-end latency. On the other hand, image partitioning without incorporating object distribution information leads to the loss of accuracy for objects at the boundary of the partitions, and the waste of resources in areas where no object exist. Therefore, we consider RoI extraction based on object location algorithm to reduce the large volume of redundant data transmitted, and improve the prediction accuracy without increasing the input size of CNNs.
In order to demonstrate the advantages of image partition, we conduct experiments on a computer with RTX 3090 GPU, and employ YOLOv5x (640) for object detection task on a 4K YouTube video. As shown in Fig. 1, the detection accuracy of our proposed system is significantly higher compared to directly feeding raw image to the CNNs. Fig. 1(b) shows the result of using CAM-based RoI extraction, which can effectively improve the inference accuracy. Hence, in this paper, we investigate low-complexity CAM-based RoI extraction method to achieve fast and high-accuracy vehicle perception.
III System Model and Problem Formulation
III-A System Overview

(a) Raw image

(b) CAM

(c) Masking

(d) RoI extraction

(e) Resizing of valid RoI boxes

(f) Final result
As shown in Fig. 2, the edge-assisted real-time video analytics system consists of three components: feature extraction and cropping, CAM-based RoI extraction, and adaptive RoI box offloading. In order to reduce the computing consumption of vehicles and the amount of data transferred to the edge, we employ a lightweight feature extractor and a fixed-mode feature partition method for video frame pre-processing. Then, considering the limited on-board computing capacity, a lightweight CAM-based RoI extractor is utilized on each feature cropping to obtain a set of boxes containing the target objects and regardless of the background region. Before offloaded from vehicles to ENs, the RoI boxes are re-selected to remove invalid boxes and appointed to different down-sampling rates based on the current available bandwidth and edge resources to better balance the inference accuracy and transmission cost. In the rest of this section, we will introduce the specific details of the main component.
III-B Feature Extraction and Cropping
The CAM has remarkable object location abilities without training on any pixel-level labels [18]. For a given image , let represents the feature of channel extracted from the last convolutional layer of the lightweight CNN at spatial location . After global average pooling layer, we can get a set of scalars , each of which corresponding to the mean value of a feature map . Combined with the weight , we can obtain the class activation map of class :
(1) |
where reflects the importance of feature map in corresponding channel for class . Moreover, we can get the contribution value of each pixel to the class , . Larger indicates the area where the object is more likely to appear, thus enabling the object location based on CAM. However, it is revealed in our experiments that CAM is not good at localizing multiple objects, especially when the objects are small and distributed discretely. The missed target objects will lead to a deterioration in the overall accuracy. Therefore, in order to maximize the performance of CAM low-cost object localization, we crop the feature map extracted by the lightweight CNN into five parts and perform CAM on each cropping separately, as shown in Fig. 2(a). We can get the activation value of each part. Compared with employing CAM on the raw image directly, our method conducts feature extraction only once, while the later requires four times more computational cost.
On the one hand, convolution operation does not change the exact spatial mapping relationship between objects. Therefore, convolution followed by cropping the feature map yields the same result, as partitioning the original image first and then extracting the features separately on each part. On the other hand, CAM-based object localization on each feature cropping can help to seize discrete objects scattered in each part of the image. In conclusion, combining feature cropping with CAM for object positioning allows for high accuracy of RoI extraction with as less computational resources as possible.
III-C CAM-based RoI Extraction
After employing CAM on each feature map cropping, we can get the heatmap which shows the localization of target objects (Fig. 3(b)). Areas with higher heat values indicate that the object is more likely to appear. Therefore, by setting a heat value threshold , we can obtain the mask to extract the RoI, illustrated as Fig. 3(c). A larger threshold will lead to smaller extracted regions, which will result in the omission of key information and further reducing the overall accuracy. On the contrary, if we choose a small threshold, the extracted RoI will contain a lot of redundant background information. The small proportion of the target object in the image will lead to an increase in the difficulty of the visual task, and redundant information will also reduce the efficiency of RoI extraction, resulting in an increase in the amount of transmitted data. Additionally, retaining a small portion of the background area helps improve the accuracy of the visual task.
In short, we choose an empirical threshold to balance the amount of information extracted and the efficiency of region extraction. After region segmentation, we can get a set of RoI boxes as Fig. 3(e), which can not only reduce the amount of data transferred, but also improve inference accuracy.
III-D RoI Boxes Selection and Offloading
The upper part of Fig. 3(b) illustrates the poor localization performance of CAM over the whole background area, the RoI boxes extracted from which are most invalid boxes without any target objects. Offloading invalid boxes to the servers not only wastes bandwidth resources but also increases the computational burden at the edge, thus we adopt a RoI box selection algorithm to remove the invalid boxes that do not contain any target objects.
According to our experimental results, approximately 2 to 6 valid RoI boxes will be extracted from the raw frame through our RoI box selection algorithm, with each box ranging in size from 20 KB to 500 KB. On the one hand, due to non-visual range and high mobility of vehicles, the IoV is unstable and thus leads to the packet loss and untimely update of inference results. On the other hand, predicting RoI boxes increases the computational burden of the edge by several times compared to inputting the raw video frame directly which put computation pressure on resource-constraint edge servers. Lower resolution of the boxes leads to low prediction latency of the CNNs deployed on the edge at the cost of accuracy deterioration. Therefore, to mitigate the impact of network fluctuations on computation task offloading and maximize the prediction performance of CNNs at the edge, we propose a adaptive resizing strategy, which determines the resolution of each box by adjusting the down-sampling rate according to the current available bandwidth and computation resources at the edge.
Generally, image resolution can largely affect the inference accuracy and profiling cost of video analytics tasks. For example, the higher resolution of each box, the more accurate inference results obtained by the CNNs, but it leads to longer prediction latency and more GPU utilization. To measure the performance of CNNs on processing the RoI boxes extracted from a video frame, we define a utility function:
(2) |
where denotes the total number of valid RoI boxes at time . and represent the accuracy and resource consumption in predicting the -th box with down-sampling rate respectively. The resource consumption consists of two components, i.e., inference latency and GPU utilization of the edge server.
III-E Problem Formulation
Our objective is to maximize the utility function under network bandwidth and edge-side resources constraints:
(3) | ||||
s.t. | (4) | |||
(5) | ||||
(6) |
where and denote the current maximum available bandwidth and GPU resources, respectively. Constrain ensures that the total data volume transmitted per frame does not exceed the current available bandwidth. And constrain represents that the total GPU usage required for the RoI boxes inference is no more than the total available GPU resources at the edge. When the box resolution is smaller than the input size of CNN model, the inference accuracy will be significantly reduced, so we set a lower bound of resolution and constrain . We discretize the variables by restricting the value of the downsampling rate and adopt hill climbing algorithm to solve the optimization problem .

(a)

(b)

(c)
Parts | Dataset1 | Dataset2 | Dataset3 | Dataset4 |
---|---|---|---|---|
1.008 | 1.146 | 19.531 | 1.692 | |
0.129 | 1.076 | 7.513 | 0.866 |
IV Adaptive Boxes Offloading Strategy
In order to maximize the performance of CAM-based RoI extraction, we crop the feature map. According to the picture structure information of the video captured by the on-board cameras, it can be seen that the middle area of the video frames has the most important information and the densely distribution of objects. Hence, we adopt a “four plus one” cropping method shown as Fig. 4(a). Note that the partitions can be in non-equal proportions, only the “four plus one” mode is fixed. The proportion of each part will be determined offline according to the focal length of the on-board cameras. Cameras with large focal length result in a small view range of video frames. Thus we can reduce the proportion of Part1 (Part refer to as P for simplicity in the rest of the paper) and P2. In the following experiments, we adopt the method of dividing in equal proportions which can cover most scenarios, namely, each part is in the same size.
The P1 and P2 are mainly background areas such as sky and buildings, RoI boxes extracted from them are always invalid. We count the probability of the target object appearing in P1 and P2 (as shown in Fig. 4(b)), as well as the part of their concatenation with P4 (as shown in Fig. 4(c)) under four different in-vehicle 4K video datasets derived from YouTube. As we can obtain from Table I, in addition to the high probability of object appearing in the upper half of the video frame under dataset3, the probability of missing useful information caused by discarding P1 and P2 can be controlled within 1 in the rest datasets. The high probability of object appearing in dataset3 can be solved by expending the proportion of P3 and P4. When we adjust the ratio of the top half to the bottom half to 1:3, the probability of detecting the target object in P1 and P2 of dataset3 will be less than 0.1%. As a result, we do not need to perform CAM on P1 and P2.
As for the invalid boxes extracted from P3, P4 and P5, we introduce an algorithm to control the transmission frequency of each part, with the aim of further reducing the amount of data transmitted and the computation burden of the edge, which is shown in Algorithm 1. Specifically, all extracted RoI boxes are initially offloaded to the edge with the highest frequency of 30 frames per second (FPS). If there is no object detected in a part, the offloading frequency of this part is reduced by five frames. And we set the lower bound of the frame transmission frequency to 1 FPS (the video frame rate is 30 FPS). Moreover, once objects are detected in the RoI boxes offloaded to the edge, the transmission frequency will be reset to 30 FPS.
V Performance Evaluation
We evaluate the proposed CAM-based RoI extraction and offloading system on multiple video datasets captured by in-vehicle cameras. Key takeaways are as follows:
-
The proposed system can improve the overall accuracy of object detection task by up to 16% and data compression rate by 49% compared with other baselines.
-
The system can adapt well to bandwidth changes, it can improve the inference accuracy by up to 23% in poor network condition.
V-A Experimental Setting
Platforms. We implement the CAM-based RoI extraction and adaptive boxes offloading on the NVIDIA Jetson Tx2 (256 CUDA cores) which is considered to be the on-board processor with considerable computation resources. And the edge server runs Ubuntu and has one NVIDIA GeForce RTX 3090 GPU (10496 CUDA cores) and Intel Core i9-10900K CPU.
CNN Models and Datasets. To validate the ability of our RoI extraction system in handling computer vision tasks, we choose a common autonomous driving task and employ state-of-the-art CNN models for inference, i.e., the YOLOv5x (640) will be used for object detection on the edge server. And the lightweight feature extractor is chosen as ResNet18. The test datasets are derived from YouTube containing four different video streams generated by in-vehicle cameras. We use the original encoded MP4 format (38402160 resolution, 30fps frame rate) as the input.
Baseline. We compare our proposed system with the following video analytics pipelines.
-
[9] offloads the low-resolution video frame every three frames to edge servers to handle new-occurred objects. The region proposals can be obtained via an attention-based Long Short-Term Memory (LSTM) prediction network and the previous inference results.
-
[10] partitions the image into tiles equally and offloads the tiles containing small objects to the remote CNN models located on the edge servers, with the aim of improving detection accuracy.
-
is a variant of our proposed system, which only performs local CAM-based RoI extraction and boxes selection. The RoI boxes will be fed into the local CNN model for inference. Moreover, in order to ensure real-time inference on the local side, we choose the YOLOv5s (640) model for object detection task to run on TX2.
-
directly inputs the raw video frames into the CNN model without image partition.
V-B Performance Comparison
In this section, we evaluate our proposed video analytics pipeline under four different datasets and varying bandwidth conditions, and compare it with other baselines. We employ the F1 score [14] to measure the object detection performance and YOLOv5x (1280) placed on the server to obtain the ground truth. The performance is evaluated by the mean accuracy and the frame size after processing with different algorithms.

(a)

(b)
The results are shown in the Fig. 5, where Fig. 5(a) is the comparison of mean accuracy between our proposed system and other baselines under four different on-board datasets. Compared to other algorithms, our proposed system obtains the highest average accuracy under all datasets. Although both and partition the raw image into tiles and handle each tile separately to improve the detection accuracy. However, obtains region proposals through a prediction network according to the previous inference results of the low-resolution video frame transmitted to the edge at the beginning, which limits the accuracy performance improvement of the whole system. As for , the equal partitioning scheme can not eliminate background areas very well which further influences the detection accuracy. To make a great improvement in the inference accuracy, our proposed system performs low-cost RoI extraction based on CAM. The result in Fig. 5(a) shows that compared with , our system achieves up to 16% mean accuracy increase.
Figure 5(b) shows the frame size of three consecutive frames, namely, the amount of data transferred under different algorithms. Likewise, our proposed system demonstrates significant advantages in data compression compared with baselines. The frame size after boxes resizing of our proposed system is only about 100 KB (including four valid boxes) of one frame. The data size of the first frame in is relatively high because the low-resolution frames are transmitted to the edge for processing together with the region proposals. And the data size of is related to the frame position in a group of pictures, which is also larger than ours.
Vehicles’ high mobility lead to intermittent IoV connections, and subsequently packet loss. This calls for adaptability to network conditions in system design. Our model shows great advantages in terms of the adaptability to bandwidth. Fig. 6(a) demonstrates that, in the case of poor network conditions, e.g., 20 Mbps, our method can guarantee a mean accuracy of 0.7, while that of is only 0.45. In Fig. 6(b), when network bandwidth fluctuates, the inference accuracy of other baselines change drastically, while our algorithm ensures stable and high inference accuracy. Thanks to the extremely high data compression rate of RoI extraction process, the proposed algorithm can maintain high accuracy under various network conditions.
Note that, the overall computational overhead of our system consists of 2.94 ms feature extraction, and 2 ms box size selection at the device side. For the edge, different analytics tasks correspond to different processing latency, e.g., target detection takes only 10 ms. Moreover, the size of effective boxes after resize is only about 100 KB, and the time for parallel offloading to the edge is negligible. In summary, our system completion can run in real time.

(a)

(b)
VI Conclusion
In this paper, we have proposed a CAM-based RoI box extraction and adaptive transmission system for vehicle perception, aiming at achieving high accuracy at low edge resource consumption. The system performs feature extraction and cropping on the video frames, and conducts CAM on each cropped images to help extract RoI boxes. Valid RoI boxes are then selected and offloaded to edge servers after box resizing for better inference. Extensive experimental results have demonstrated the advantages of the proposed system. For the future work, we will consider scaling up the system to multi-vehicle collaboration.
References
- [1] E. Coronado, G. Cebrian-Marquez, and R. Riggio, “Enabling computation offloading for autonomous and assisted driving in 5G networks," in Proc. IEEE GLOBECOM, 2019 , pp. 1-6.
- [2] L. Ale, N. Zhang, X. Fang, X. Chen, S. Wu, and L. Li, “Delay-aware and energy-efficient computation offloading in mobile-edge computing using deep reinforcement learning,” IEEE Trans. Cogn. Commun. Netw., vol. 7, no. 3, pp. 881-892, 2021.
- [3] A. Caillot, S. Ouerghi, P. Vasseur, R. Boutteau, and Y. Dupuis, “Survey on cooperative perception in an automotive context," IEEE Trans. Intell. Transp. Syst., vol. 23, no. 9, pp. 14204–14223, 2022.
- [4] P. Yang, J. Hou, L. Yu, W. Chen, and Y. Wu, “Edge-coordinated energy-efficient video analytics for digital twin in 6G,” China Commun., vol. 20, no. 2, pp. 14–25, 2023.
- [5] S. Jiang, Z. Lin, Y. Li, Y. Shu, and Y. Liu, “Flexible high-resolution object detection on edge devices with tunable latency,” in Proc. ACM MobiCom, 2021, pp. 559–572.
- [6] J. Hou, P. Yang, T. Qin, and W. Wu, “Edge-coordinated on-road perception for connected autonomous vehicles using point cloud,” in Proc. Bienn. Symp. Commun. (BSC), 2023, pp. 77–82.
- [7] C. Zhou, P. Yang, Z. Zhang, C. Wang, and N. Zhang, “Bandwidth-efficient edge video analytics via frame partitioning and quantization optimization,” in Proc. IEEE ICC, 2023, pp. 1–7.
- [8] P. Yang, F. Lyu, W. Wu et al., “Edge coordinated query configuration for low-latency and accurate video analytics,” IEEE Trans. Ind. Informat., vol. 16, no. 7, pp. 4855–4864, 2020.
- [9] W. Zhang, Z. He, L. Liu et al., “ELF: Accelerate high-resolution mobile deep vision with content-aware parallel offloading,” in Proc. ACM MobiCom, 2021, pp. 201–214.
- [10] X. Wang, Z. Yang, J. Wu, Y. Zhao, and Z. Zhou, “EdgeDuet: Tiling small object detection for edge assisted autonomous mobile vision,” in Proc. IEEE INFOCOM, 2021, pp. 1–10.
- [11] T. Murad, A. Nguyen, and Z. Yan, “DAO: Dynamic adaptive offloading for video analytics,” in Proc. ACM Multimedia, 2022, pp. 3017–3025.
- [12] X. Dai, P. Yang, X. Zhang et al., “RESPIRE: Reducing spatial-temporal redundancy for efficient edge-based industrial video analytics,” IEEE Trans. Ind. Informat., vol. 18, no. 12, pp. 9324–9334, 2022.
- [13] S. Liu, T. Wang, J. Li et al., “AdaMask: Enabling machine-centric video streaming with adaptive frame masking for DNN inference offloading,” in Proc. ACM Multimedia, 2022, pp. 3035–3044.
- [14] C. Wang, P. Yang, J. Lin, W. Wu, and N. Zhang, “Object-based resolution selection for efficient edge-assisted multi-task video analytics,” in Proc. IEEE GLOBECOM, 2022, pp. 5081–5086.
- [15] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pat. Anlys. Mach. Intel., vol. 39, no. 6, pp. 1137–1149, 2017.
- [16] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE CVPR, 2014.
- [17] R. Bhardwaj, Z. Xia et al., “Ekya: Continuous learning of video analytics models on edge compute servers,” in Proc. USENIX NSDI, 2022, pp. 119–135.
- [18] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning Deep Features for Discriminative Localization,” in Proc. IEEE CVPR, 2016, pp. 2921–2929.
- [19] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. IEEE CVPR, 2016, pp. 779–788.
- [20] Y. Kong, P. Yang, and Y. Cheng, “Edge-assisted on-device model update for video analytics in adverse environments,” in Proc. ACM Multimedia, 2023.
- [21] Y. He, P. Yang, T. Qin, and N. Zhang, “End-edge coordinated joint encoding and neural enhancement for low-light video analytics,” in Proc. IEEE GLOBECOM, 2023.