Multimodal Collaboration Networks for Geospatial Vehicle Detection in Dense, Occluded, and Large-Scale Events
Abstract
In large-scale disaster events, the planning of optimal rescue routes depends on the object detection ability at the disaster scene, with one of the main challenges being the presence of dense and occluded objects. Existing methods, which are typically based on the RGB modality, struggle to distinguish targets with similar colors and textures in crowded environments and are unable to identify obscured objects. To this end, we first construct two multimodal dense and occlusion vehicle detection datasets for large-scale events, utilizing RGB and height map modalities. Based on these datasets, we propose a multimodal collaboration network for dense and occluded vehicle detection, MuDet for short. MuDet hierarchically enhances the completeness of discriminable information within and across modalities and differentiates between simple and complex samples. MuDet includes three main modules: Unimodal Feature Hierarchical Enhancement (Uni-Enh), Multimodal Cross Learning (Mul-Lea), and Hard-easy Discriminative (He-Dis) Pattern. Uni-Enh and Mul-Lea enhance the features within each modality and facilitate the cross-integration of features from two heterogeneous modalities. He-Dis effectively separates densely occluded vehicle targets with significant intra-class differences and minimal inter-class differences by defining and thresholding confidence values, thereby suppressing the complex background. Experimental results on two re-labeled multimodal benchmark datasets, the 4K-SAI-LCS dataset, and the ISPRS Potsdam dataset, demonstrate the robustness and generalization of the MuDet. The codes of this work are available openly at https://github.com/Shank2358/MuDet.
Index Terms:
Large-scale Disaster Events, Remote Sensing, Multimodal Vehicle Detection, Convolutional Neural Networks, Dense and Occluded, Hard-easy Balanced Attention
I Introduction
Remote sensing (RS) imagery has long been utilized across various fields of large-scale disaster events, including early warning and damage evaluation. With the rapid development of domestic and foreign satellites and aircraft, as well as the popularity of unmanned aerial vehicles (UAVs) [1], the spatial and spectral resolutions of RS data have been continuously improved [2], providing basic data assurance for applications in large-scale disaster events. Object detection [3, 4, 5] in RS images is the technical basis for risk assessment and rescue. However, research focusing on object detection, e.g., vehicle detection, in large-scale events, which involves challenges such as density, occlusion, and even distortion, remains relatively limited.
In remote sensing (RS) object detection, especially when utilizing deep learning (DL) models, high-quality labeling is crucial for defining precise object boundaries and categories. This accuracy is vital for the model’s ability to learn and recognize distinctive features of each object class. The Northwestern Polytechnical University Very-High-Resolution dataset (NWPU VHR-10)[6], the INDIA aerial picture dataset, UCAS-AOD [7], the Remote Sensing Object Detection (RSOD) dataset [8], and the Dataset for Object deTection (DOTA)[9], are representative examples of publicly available object detection datasets. The majority of these datasets are sourced from Google Maps with RGB modality, with the vehicle targets predominantly located in parking lots, roadside areas, residential zones, and other relevant scenes. All datasets are in RGB modality, with minor variations among the image scenes. There is a significant difference between different vehicle target classes, yet there is a high degree of similarity within each class. Additionally, vehicles in these scenes adhere to predefined parking rules and regulations. A selection of annotated vehicle samples is presented above the dashed line in Fig. 1.

Currently, there are numerous methods for detecting the above vehicles. For instance, Ref. [10] utilized adversarial learning to create vehicle images to diversify the dataset and enhance detection accuracy. To concentrate on regions of interest while minimizing the impact of occlusions, Zhang et al. [11] developed a triple-head network incorporating regional attention. Zhu et al. [12] proposed a Hard Samples Metric Learning (HSML) strategy aimed at reducing intra-class variance and lowering the rate of false detections. Meanwhile, Huang et al. [13] developed an Object-Adaptation Label Assignment (OLA) method that adapts neural network learning to the specific characteristics of different objects, indirectly addressing the challenge of densely packed boats and vehicles. Unfortunately, RGB modality data alone falls short of distinguishing between objects that are densely packed or occluded within the small intra-class distance. This limitation is particularly evident at large-scale events, for example, where recreational vehicles (RVs), vehicles with tents mounted on their roofs, and flat-topped rectangular tents are densely parked in a limited area. This scenario is depicted in the example images located below the dotted line in Fig. 1.
Multimodal data [14], which integrates information from various sensors or sources, such as visible light (RGB), infrared, light detection and ranging (LIDAR), synthetic aperture radar (SAR), and optical photographic measuring, significantly enhances the ability to detect and differentiate vehicles. This integration leverages the strengths of each modality to overcome the limitations inherent in any single data type, particularly in challenging conditions such as denseness, occlusion, or similar appearance among objects. Sharma et al. [15] utilized mid-level feature fusion to integrate data from visible and infrared (IR) modalities. Sumbul et al. [16] developed a unified object detection framework that integrates feature representations and attention mechanisms from both visible and LIDAR data. A significant challenge in multimodal object detection is crafting effective fusion strategies. In response, Hong et al. [17] prompted a promising research problem, i.e., cross-region or cross-city land cover classification, and proposed a novel multimodal deep learning method, called high-resolution domain adaptation networks. However, current multimodal datasets and their associated methodologies mainly concentrate on object deformation, such as variations in scale and orientation, while often neglecting issues like irregular parking and obstructions, e.g., tents or branches. This oversight leads to significant challenges in detecting vehicles amidst dense occlusion at large-scale events.
To this end, we first construct and label two Multimodal Vehicle Detection (MVD) datasets at large-scale events, incorporating both RGB and height map modalities. These datasets are characterized by densely packed vehicles, occlusions, and instances of partial deformation. RGB images provide color and texture information about objects, aiding in the identification of vehicular surface characteristics. Height maps offer elevation data for objects, enabling the differentiation of objects with similar colors and textures in crowded environments by their distinct heights. Furthermore, height maps can also infer the presence of occluded objects based on their height differences. Then, we propose a multimodal collaboration network for dense and occluded vehicles. Specifically, we design a Unimodal Feature Hierarchical Enhancement (Uni-Enh) network and a Multimodal Cross Learning (Mul-Lea) strategy to enhance the distinct features of each modality and enrich the feature representation of vehicles. Following it, a Hard-Easy Discriminative (He-Dis) pattern is designed to enhance the discriminability between hard and easy objects and to minimize the impact of complex background interference. The contributions of this paper are summarized as follows:
-
•
A multi-modal vehicle detection dataset is constructed and labeled, specifically targeting vehicles in dense and occluded scenarios in large-scale events. These vehicles are categorized as “hard vehicles” due to the complexity of their detection conditions.
-
•
A Multimodal Collaboration Network (MuDet) is proposed to detect dense and occluded vehicles in large-scale events. By integrating RGB and height map data, it enhances features within each modality and improves the completeness of feature fusion across modalities. MuDet significantly enhances the discriminability and separability of multimodal features.
-
•
A unimodal feature hierarchical enhancement (Uni-Ehn) network and a multimodal cross learning (Mul-Lea) strategy are designed to enhance the distinct features of each modality and enrich the distinguishing features of vehicles. A hard-easy discriminative pattern (He_Dis) module is designed to balance hard-easy object discriminability and suppress interference from complex backgrounds on objects.
-
•
We evaluate the detection performance of the proposed MuDet on two new multimodal vehicle detection datasets, namely the 4K-SAI-LCS dataset and the ISPRS Potsdam dataset, demonstrating substantial improvements over various existing methods. The codes and datasets will be available for the sake of reproducibility and for developing the research direction of multimodal RSOD.
This paper is organized as follows: Section II introduces the proposed MuDet framework in detail, including the Uni-Ehn network, Mul-Lea strategy, He-Dis modules, and its loss function. Quantitative experiments and visual analysis of the proposed MuDet on two newly annotated multimodal datasets, the 4K-SAI-LCS dataset and the ISPRS Potsdam dataset, are discussed in Section III. Section IV summarizes and offers prospects for the proposed MuDet framework.
II Proposed MuDet Framework
In this section, we provide a detailed description of the proposed MuDet for dense and occluded vehicle detection. Fig. 2 illustrates the network architecture of MuDet. In detail, we first introduce an unimodal feature learning and enhancement (Uni-Enh) network designed to amplify the distinctive features of each stream, aiming to capture intramodal relationships more precisely. Then, the generated feature is sent to the multimodal cross-learning (Mul-Lea) module to interactively learn features between two heterogeneous modalities and improve the completeness of information fusion across these modalities. To further enhance the detection of vehicles in dense and occluded conditions, we developed a hard-easy discriminative (HE-Dis) pattern that differentiates vehicles across varying levels of density and occlusion.
II-A Unimodal Feature Hierarchical Enhancement (Uni-Enh)
Convolutional Neural Networks (CNNs) feature a unique architecture of local weight sharing, significantly benefiting image processing and other areas by enabling the extraction of highly discriminative object features from input images. Currently, CNNs are the widely used method in the field of remote sensing [18, 19]. Unimodal feature hierarchical enhancement (Uni-Enh) involves dual-stream feature learning via CNNs and features hierarchical enhancement of each stram to more effectively capture intramodal relationships.
Firstly, we introduce a dual-stream CNN-based network for feature learning. Each stream in the network is composed of several CNN blocks, with each block consisting of a convolutional layer, Batch Normalization (BN), and Leaky ReLU activation. We define as the CNN features. To distinguish, and are used to represent the feature maps of the RGB image and the height map (H), respectively. and represent the number of RGB stream channels and the number of height map stream channels, respectively. denotes the feature map size. The output of the -th layer of MuDet is denoted as .
(1) |
where denotes the feature maps of the -th layer. is a nonlinear activation function. indicates the layer number of CNN. and are the learned weights and biases of the -th layer, respectively.
Then, we develop a hierarchical enhancement strategy to amplify the distinctive features of each stream and integrate the outcomes into a cross-attention mechanism, thereby aiming to more precisely capture intramodal relationships.
For the RGB stream, we measure the grayscale values of the RGB image and employ gamma transformation with diverse coefficients to refine details across both low and high grayscale spectrums. Thus, the input of the RGB stream is
(2) |
where is a constant and represents the gamma transformation function.
For the height map stream , we employ grayscale slicing to emphasize the height information of foreground objects while masking the height values of background objects, leveraging expert prior knowledge.
(3) |
where and are the experiential thresholds for background objects. represents the height map value at a specific position . The values denote distinct slicing thresholds. The constants and signify the minimum and maximum height values, respectively.
II-B Multimodal Cross Learning (Mul-Lea)
The concept of cross-attention, as first introduced in the transformer architecture for language processing due to its potent semantic feature extraction and long-range feature capture capabilities [20]. It asymmetrically combines two independent sequences of embeddings, each with the same dimensions. Here, the two sequences correspond to the features of the two modalities. Given the feature map and of two modalities, the cross-attention mechanism is defined as follows:

(4) | ||||
where , , and , with , , and being convolutions. Thus, is also represented as follows:
(5) |
Overall, the Uni-Enh block and the Mul-Lea block hierarchically enhance vehicle differentiation within each modality and interactively learn features between two heterogeneous modalities, respectively. They dynamically improve the completeness of information fusion both within and across modalities, resulting in enhanced multimodal feature discriminability and separability.
II-C Hard-easy Discriminative (He-Dis) Pattern
To enhance the distinction and detection of vehicles in large-scale events, we designed a hard-easy discriminative pattern. This pattern begins by calculating the confidence value of features within each modality, followed by constructing and thresholding easy-to-predict and hard-to-predict masks to accurately detect vehicles. This pattern ensures precise supervision of each modality, facilitating a more effective vehicle location. Fig. 3 shows an illustration of the hard-easy discriminative pattern. More specifically, we define the confidence value as ,
(6) | ||||
,
(7) | ||||
where and represent the convlutions.
To accurately differentiate between hard and easy vehicles, we define a threshold . If the vehicle confidence predicted by both the RGB stream and height map stream exceeds the threshold , the vehicle predicted at position is classified as an easy-to-predict sample, and a mask is given.
(8) |
If the object confidence predicted by either the RGB branch or the height map branch is greater than the threshold , and the other is less than , indicating that not both modal features can detect the object, then the object predicted at position is considered a hard-to-predict sample. Thus, two masks and are given,
(9) |
(10) |
Finally, all detected vehicles can be formulated as follows:
(11) | ||||
The hard-easy discriminability strategy streamlines the differentiation between hard and easy vehicles through soft thresholding with the hard-easy mask, thereby significantly improving the separation of dense and occluded vehicles.
II-D Loss Function
In this section, we employ distinct loss functions to supervise hard and easy vehicles separately.

For the easy-to-predict vehicles, we employed two loss functions: an object classification loss, denoted as , and an Oriented Bounding Box (OBB) regression loss, represented as . Specifically, the classification loss is formulated using focal loss [21], that is
(12) |
(13) |
where represents the predicted probability, and denotes the true label. In alignment with the paper [21], we set the hyperparameter to 2.
To refine the regression results, the regression loss employs the definition provided in Ref [13], utilizing OBB for more precise object localization.
(14) |
where and represent the distances from the sampling point to the horizontal bounding box (HBB) boundaries. and denote the distances between the HBB vertices and the OBB vertices. and are the ratios of the HBB to the OBB in terms of area. signifies the Intersection over Union (IoU) between two HBBs. Define the ground truth distances is composed of , , , , and the predicted distances is composed of , , , . The area of ground truth HBB is and the area of predicted HBB is . Then, the Overlapping area is represented as
(15) |
The area of the circumscribed HBB of the two HBBs above is represented as
(16) |
The area of the union region of the two HBBs above is represented as . Thus,
(17) |
Thus, the loss of easy samples is
(18) |
where represents the number of samples. Figure 4 provides a detailed representation of the OBB.
For the hard vehicles, the total loss is
(19) |
Ultimately, by individually supervising and learning vehicles of varying difficulty levels, the model’s optimization direction can be intentionally adjusted to further improve the detection performance of dense and occluded vehicles. Thus, the total loss is represented as
(20) |
III Experimental Results and Analysis
III-A Data Annotation and Description
In this section, we label and present two multimodal vehicle detection benchmark datasets for remote sensing imagery. These datasets are distinguished from existing vehicle detection datasets by four unique features: 1) They are expressly crafted for multimodal vehicle detection in the context of large-scale events, with RGB information modal and height maps modal. Each dataset has varied resolutions and is collected using varied platforms. 2) They encompass densely packed and irregularly arranged objects, including a variety of vehicle styles as well as tents and branches. 3) They feature a wide range of occlusions, such as those caused by tents and branches. 4) They increase the complexity of vehicle detection due to the presence of distorted vehicles and the varied distribution of vehicles across large-scale areas.
The data annotation method employed utilizes the oriented bounding box (OBB) format, represented as , where denotes the center coordinates, and specify the width and height of the bounding box, and represents the rotation angle relative to the horizontal axis of the standard bounding box. The detailed descriptions of the two multimodal datasets are as follows:

Dataset | Attributes | |||
---|---|---|---|---|
Annotation | # Categoried | #Instances | Image resolution | |
ISPRS Potsdam | OBB | 1 | 4,896 | |
4K-SAI-LCS | OBB | 2 | 339,111 |


2) 4K-SAI-LCS MVD Dataset:
The 4K Stereo Aerial Imagery of a Large Camping Site (4K-SAI-LCS) dataset is a subset of aerial imagery acquired from a large-scale site [22, 23]. This dataset encompasses an expanse of km. Utilizing the German Aerospace Center’s advanced optical 4K camera system, a total of images were captured images with a leftward orientation and images with a rightward orientation. These images were obtained at altitudes of m and m above ground level, respectively. Image pre-orientation is performed by using the open source SRTM (The Shuttle Radar Topography Mission) data and measured GPS positions of the image projection centers. Precise image orientation is then accomplished by bundle adjustment using automatically extracted SIFT (Scale-Invariant Feature Transform)- tie points [24]. Afterward, the 3D point cloud is calculated using semi-global matching [25, 26, 27].
To preserve the original rich textures and the sharp boundaries of the vehicles in RGB images, instead of generating true orthophoto (TOP) and DSM images, in this paper, the multimodel dataset consists of the original RGB images and height maps. Unlike DSMs, height maps use the original image coordinates instead of geo-coordinates, facilitating a more direct correspondence with the RGB imagery. To further improve the point density, we use each test region with 4-6 overlapping images. Point clouds, created from different views, are merged and filtered. This process ensures that, after projection, there is a one-to-one relationship between each pixel in the 2D height map and its corresponding pixel in the RGB image. Both the images and height maps have ground sampling distances of cm, and each scene has a resolution of .
We have annotated the 4K-SAI-LCS MVD dataset in the oriented bounding box (OBB) format using the LabelMe toolbox. This newly labeled dataset is now employed for multimodal occluded and dense vehicle detection. Fig. 5. I provide an example annotation image. The designated control zone of the festival scene encompasses a spacious parking lot and tent area. As a result, the primary objects depicted in the scene images include vehicles, tents, roads, and sanitation facilities, effectively representing the campground environment. The dataset presents significant challenges due to the dense and irregular parking arrangements of vehicles, which fall into diverse subcategories, including cars, transport vehicles, transport trailers, recreational vehicles, and camping trailers. This diversity leads to substantial intra-class variation. Moreover, the visual resemblance between vehicles and tents significantly complicates the task of vehicle detection, thereby increasing the complexity of the dataset and placing higher demands on the algorithms designed for detection. Fig. 5. II presents some examples of vehicles with varying degrees of occlusion and density.
1) ISPRS Potsdam City MVD Dataset:111http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-potsdam.html. The original Potsdam dataset was constructed for the “semantic segmentation competition” by the ISPRS III/4 working group and was first published in the ISPRS 2D semantic labeling contest. Potsdam is a typical historical city and this dataset includes different regions with true orthophoto (TOP) and digital surface models (DSMs). The TOP images were captured using Trimble INPHO OrthoVista, while the DSMs, detailing the absolute elevation values for each pixel, were produced via dense image-matching techniques utilizing Trimble INPHO 5.3 software. Both images feature a ground sampling distance of cm and a resolution of pixels.
Different from existing segmentation labels, we have re-annotated the Potsdam dataset in the OBB format for images that encompass both VIS and height map data. It features vehicles located in extensive building complexes, narrow lanes, and densely populated residential zones. The designated parking areas display notable overlaps and occlusions, posing challenges in distinguishing black vehicles, particularly those obscured by foliage. Fig. 5. II presents some examples of vehicles with varying degrees of occlusion and density.
3) Dataset Statistic: Table I lists detailed instance counts for the two multimodal vehicle datasets. Given the varying scenes of image acquisition, the vehicle density in the 4K-SAI-LCS dataset is higher than in the ISPRS dataset. The 4K-SAI-LCS dataset contains over 300,000 instances, while the ISPRS dataset comprises approximately 5,000 instances. The presence of differently dense targets also poses a significant challenge for detecting densely packed vehicles. Fig. 6 displays a curve graph comparing vehicle area to the number of vehicles in both datasets. This graph highlights the diversity and balanced distribution of vehicle objects within the two datasets.
III-B Experimental Setup
In the experiment, data preprocessing and augmentation were applied to all images to prevent overfitting. For the 4K-SAI-LCS dataset, input images were cropped to three sizes: , , and pixels, each with a -pixel overlap to maximize object information capture. The training and testing sets were equally divided, maintaining a ratio. To balance the volumes of the two multimodal datasets, original images from the ISPRS Potsdam City Dataset, featuring aligned VIS and DSM scenarios, were selected. These input images were cropped to pixels, with a -pixel overlap for optimal object information capture. The ratio of the training set to the testing set for the ISPRS dataset was adjusted to to align with the experimental requirements.”
The initial and final learning rates are set at and , respectively. We employ the Stochastic Gradient Descent (SGD) optimization strategy, with a weight decay of and momentum of . The training process is designed to run for a maximum of epochs, with a confidence threshold set at and a Non-Maximum Suppression (NMS) threshold of . Given the distinct information content of height maps and RGB data, we utilize different backbone networks for each data stream: ResNet18 for height maps and Darknet53 for RGB data. The proposed MuDet architecture is implemented using the PyTorch framework on an NVIDIA GeForce RTX 3090 GPU.

III-C Evaluation Metrics
Three common object detection evaluation criteria are utilized for quantitative analysis, including Precision (P), Recall (R), and Average Precision ().
(21) | ||||
where , , and represents true positive, false positive objects, and false negative objects, respectively. Generally, higher values of these metrics indicate superior detection performance.
Modality | Attention | Backbone | ISPRS Potsdam | 4K-SAI-LCS |
AP0.5(%) | AP0.5(%) | |||
RGB | - | Darknet53 | 90.03 | 86.99 |
Height map | - | ResNet18 | 13.13 | 27.04 |
RGB | Self Attention | Darknet53 | 91.47 | 90.02 |
Height map | Self Attention | ResNet18 | 11.94 | 27.53 |
MuDet(v3) | Mul-Lea | Darknet53/ResNet18 | 93.63 | 92.57 |
MuDet(GGHL) | Mul-Lea | Darknet53/ResNet18 | 94.58 | 94.19 |
MuDet(v8) | Mul-Lea | CSPDarknet53/ResNet18 | 94.92 | 95.07 |
Fusion method | Network | ISPRS Potsdam | 4K-SAI-LCS |
---|---|---|---|
AP0.5(%) | AP0.5(%) | ||
Image-level | Darknet53 | 93.12 | 87.89 |
Feature-level | Darknet53/ResNet18 | 94.05 | 89.27 |
Uni-Enh+Mul-Lea | Darknet53/ResNet18 | 94.37 | 91.11 |
Feature-level+He-Dis | Darknet53/ResNet18 | 94.28 | 90.64 |
MuDet(v3) | Darknet53/ResNet18 | 93.63 | 92.57 |
MuDet(GGHL) | Darknet53/ResNet18 | 94.58 | 94.19 |
MuDet(v8) | CSPDarknet53/ResNet18 | 94.92 | 95.07 |
is a global indicator, enabling fair comparison across different detection methods. In our experiments, refers to the Average Precision (AP) calculated at an Intersection over Union (IoU) threshold of 0.5.
(22) |
where represents the threshold. denotes the precision at the -th threshold. is the precision at the -th threshold. calculates the change in recall between consecutive -th and -th thresholds.

III-D Comparison with state-of-the-art MVD models
In the experiment, six commonly used methods for vehicle detection in multimodal remote sensing (RS) images were selected for both quantitative and qualitative comparisons. These methods include You only look once (YOLOv3)[28], RetinaNet[21], Fully Convolutional One-Stage object detector (FCOS)[29], General Gaussian Heatmap Label Assignment (GGHL)[13], Representative Points (RepPoints)[31], and YOLOv8 [30]. Darknet53 was employed as the backbone network across all methods, complemented by a multi-scale feature pyramid network (FPN) for upsampling and fusion to ensure a balanced comparison.
III-E Ablation Study
In this section, we evaluated the contribution of the proposed MuDet through two ablation analysis experiments. Specifically, 1) the contribution of the modality increment; 2) the contribution of the fusion increment.
III-E1 Contribution of the Modality Increment.
Table II quantifies the improvement in vehicle detection performance by the incremental addition of modalities. Note that RGB images offer rich color and texture information, whereas height maps supply solely elevation data. Employing distinct backbones for each modality, Darknet53/CSPDarknet53 for RGB and ResNet18 for height map, enables the extraction of the most valuable features from each modality while avoiding overfitting. For unimodal, height information alone proves insufficient for distinguishing vehicles without the complementary support of RGB data. The performance enhancement achieved by integrating self-attention into each modality falls short of the improvements in three variants of MuDet (the YOLOv3-based MuDet (MuDet(v3)), the DDHL-based MuDet (MuDet(GGHL)), and the YOLOv8-based MuDet (MuDet(v8)), with the largest gap in performance exceeding 5%. Fig. 7 shows the visualization of vehicle separation results through incremental modality utilization, including RGB images, height map images, RGB + self attention, height map + self attention, and yolov8-based MuDet, when applied to the 4K-SAI-LCS dataset. MCo-Net outperforms other methods in separating densely packed vehicles.
III-E2 Contribution of the Fusion Increment.
Table. III quantifies the improvement in vehicle detection performance by the various fusion strategies, including image-level fusion, feature-level fusion, Uni-Enh+Mul-Lea, feature-level+He-Dis, and three types of MuDet. Overall, both image-level fusion and feature-level fusion result in the poorest detection performance. The best MuDet (v8) achieves an approximate 6% improvement in AP for the 4K-SAI-LCS dataset and a 1% improvement in AP for the ISPRS Potsdam dataset. In contrast, the Feature-Level+He-Di method, which utilizes feature fusion, demonstrates minimal improvement due to its lack of consideration for the interplay among multimodal features. Note that the vehicle categories within the ISPRS Potsdam dataset are relatively uniform, and their density is low. Consequently, the improvement offered by the feature-level+He-Dis module is somewhat constrained.
III-F Results and Analysis on the 4K-SAI-LCS Data


Table. IV lists the quantitative detection accuracy of the six methods for the 4K-SAI-LCS dataset in terms of AP. Overall, multimodal data significantly outperforms single modality data in terms of AP value. Specifically, RetinaNet, due to its focal loss design, reduces the weight of easily detected objects, achieving a detection performance approximately 2% higher than YOLOv3. FCOS, GGHL, and YOLOv8, all anchor-free methods, show distinct advantages over anchor-based methods. FCOS enhances detection precision by predicting object presence at each pixel, thus avoiding the complex anchor matching process. GGHL employs a Gaussian heatmap distribution technique to improve the learning capability of objects of different sizes in various positions. YOLOv8 has been redesigned with a regression-based loss function and sample matching strategies, among other modules, offering significant improvements in both speed and accuracy, approximately 6% higher than YOLOv3. Unlike traditional anchor-based methods, Reppoints models each object’s unique shape and contour without relying on predefined anchor box sizes or ratios, achieving optimal performance in single-modality object detection that may be influenced by background or semantically irrelevant foreground information. The proposed MuDet increases detection accuracy across different network backbones, including YOLOv3, GGHL, and YOLOv8, by over 5% compared to their corresponding single modality. MuDet introduces single-modality feature enhancement, multimodal cross-fusion, and a strategy for balancing samples of varying difficulty, effectively segregating vehicles, and eliminating the impact of background information.
Fig. 8 shows the visualization results of the MuDet on selected example images, demonstrating effective separability for dense and occluded vehicles, such as RVs with open doors and tents mounted on cars. However, a small fraction of white recreational vehicles (RVs) were missed due to the absence of open-door samples in the labeled 4K-SAI-LCS vehicle dataset. Additionally, we have to admit that the 4K-SAI-LCS dataset presents significant challenges for vehicle detection. Notably, some vehicles are difficult to distinguish, with their similarity to tents sometimes exceeding 80%.
Given the substantial impact that varying confidence thresholds can have on model performance, it is crucial to analyze their effects. Fig. 9 (a) illustrates the Precision-Recall (PR) curve for all comparative methods and the proposed MuDet across different thresholds. It can be shown that the proposed method not only achieves the highest precision for a given level of recall but also has the largest area under the PR curve, reaffirming the efficacy and superiority of MuDet. However, the recall metric still warrants enhancement. This likely stems from the significant intra-class variance and subtle inter-class variance. The challenge in achieving precise fine-grained separation becomes apparent when merging all patterned vehicles into a broad ”vehicle” category, especially when distinct types, such as cars and trucks, are amalgamated into a single category, hindering precise differentiation.

III-G Results and Analysis on the ISPRS Potsdam City Data
Table V lists a quantitative performance analysis of the ISPRS Potsdam city dataset, while Fig. 10 shows the visualization results for sample images using our methods.
Overall, the detection performance on the Potsdam city dataset is consistent with that on the 4K-SAI-LCS dataset, further demonstrating that MCo-Net improves the detection performance of dense and occluded vehicle objects. MuDet achieves a significant improvement over unimodal YOLOv3 and multimodal YOLOv3, with increases of 10% and 8%, respectively. Relying solely on single-modality data raises the likelihood of inaccurate detections or missed vehicles. Especially dark vehicles are obscured by tree branches or closely match the color of the branches. Furthermore, compared to competitive multimodal methods based on YOLOv8 and Reppoint, MuDet achieves improvements of up to 3% and 1.5%, respectively, further verifying the method’s robustness.
Fig. 9 (b) shows the PR curve for all comparison methods and the proposed MuDet under dynamic thresholds. MuDet obtains the largest area under the PR curve, similar to the performance on the 4K-SAI-LCS dataset, further demonstrating MuDet’s effectiveness and superiority. However, at equivalent levels of accuracy, this dataset exhibits a lower recall rate compared to the 4K-SAI-LCS dataset. This discrepancy is likely attributed to varying occlusion levels from tree branches and vehicle deformation, which not only diminishes the contrast between background and foreground but also amplifies the intra-class variation of vehicle targets, leading to a reduced recall rate.
Fig. 10 shows some detection results on the ISPRS Potsdam city dataset. While this dataset has a lower vehicle density compared to the 4K-SAI-LCS dataset, it also features other challenges, such as black vehicles obscured by tree branches and vehicles with deformations. For these vehicles, MuDet achieved commendable detection results. However, there remains a considerable opportunity for further refinement, particularly in improving the detection of vehicles with significant deformations or those extensively occluded by tree branches.
IV Conclusion
In this article, we initially develop two multimodal datasets for dense and occluded vehicle detection in large-scale scenarios, employing both RGB and height map modalities. Subsequently, we propose a multimodal collaboration network, termed MuDet, for dense and occluded vehicle detection in large-scale events. MuDet is designed to fully exploit unimodal enhanced features, multimodal cross-features, and patterns that distinguish between hard and easy vehicle detection. Leveraging the integrated data from RGB and height maps, MuDet excels in differentiating vehicles based on color, identifying vehicles with similar colors and textures in crowded scenes through their unique height values, and detecting occluded vehicles, thereby enhancing its utility in complex environments. Extensive experiments conducted on two newly constructed and labeled image datasets demonstrate MuDet’s superiority in MVD compared to commonly employed detection methods.
However, MuDet currently does not adapt well to two multimodal datasets with significant distributional variance. Therefore, future work will focus on exploring domain adaptation techniques in multimodal contexts to improve the model’s ability to generalize effectively across various domains.
Acknowledgement
References
- [1] X. Wu, W. Li, D. Hong, R. Tao, and Q. Du, “Deep learning for unmanned aerial vehicle-based object detection and tracking: A survey,” IEEE Geoscience and Remote Sensing Magazine, vol. 10, no. 1, pp. 91–124, 2022.
- [2] D. Hong, B. Zhang, X. Li, Y. Li, C. Li, J. Yao, N. Yokoya, H. Li, P. Ghamisi, X. Jia, A. Plaza, P. Gamba, J. A. Benediktsson, and J. Chanussot, “Spectralgpt: Spectral remote sensing foundation model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. DOI:10.1109/TPAMI.2024.3362475.
- [3] I. Weber, J. Bongartz, and R. Roscher, “Artificial and beneficial–exploiting artificial images for aerial vehicle detection,” ISPRS J. Photogramm. Remote Sens., vol. 175, pp. 158–170, 2021.
- [4] X. Wu, D. Hong, and J. Chanussot, “Uiu-net: U-net in u-net for infrared small object detection,” IEEE Transactions on Image Processing, vol. 32, pp. 364–376, 2022.
- [5] X. Wu, D. Hong, J. Tian, J. Chanussot, W. Li, and R. Tao, “Orsim detector: A novel object detection framework in optical remote sensing imagery using spatial-frequency channel features,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 7, pp. 5146–5158, 2019.
- [6] G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant convolutional neural networks for object detection in vhr optical remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12, pp. 7405–7415, 2016.
- [7] H. Q. Zhu, X. G. Chen, W. Q. Dai, K. Fu, Q. X. Ye, and J. B. Jiao, “Orientation robust object detection in aerial images using deep convolutional neural network,” in Proc. IEEE. Conf. International Conference on Image Processing (ICIP), pp. 3735–3739, IEEE, 2015.
- [8] Y. Long, Y. P. Gong, Z. F. Xiao, and Q. Liu, “Accurate object localization in remote sensing images based on convolutional neural networks,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 5, pp. 2486–2498, 2017.
- [9] G. S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. B. Luo, M. Datcu, M. Pelillo, and L. P. Zhang, “Dota: A large-scale dataset for object detection in aerial images,” in Proc. IEEE Int. Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 3974–3983, 2018.
- [10] X. L. Wang, A. Shrivastava, and A. Gupta, “A-fast-rcnn: Hard positive generation via adversary for object detection,” in Proc. IEEE. Int. Conf. Computer Vision and Pattern Recognition (CVPR), pp. 2606–2615, 2017.
- [11] W. Zhang, C. H. Liu, F. L. Chang, and Y. Song, “Multi-scale and occlusion aware network for vehicle detection and segmentation on uav aerial images,” Remote Sens., vol. 12, no. 11, 2020.
- [12] D. J. Zhu, S. X. Xia, J. Q. Zhao, Y. Zhou, Q. Niu, R. Yao, and Y. Chen, “Spatial hierarchy perception and hard samples metric learning for high-resolution remote sensing image object detection,” Appl. Artif. Intell., pp. 1–16, 2021.
- [13] Z. C. Huang, W. Li, X. G. Xia, and R. Tao, “A general gaussian heatmap label assignment for arbitrary-oriented object detection,” IEEE Trans. Image Process., vol. 31, pp. 1895–1910, 2022.
- [14] D. Hong, J. Yao, C. Li, D. Meng, N. Yokoya, and J. Chanussot, “Decoupled-and-coupled networks: Self-supervised hyperspectral image super-resolution with subpixel fusion,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
- [15] M. Sharma, M. Dhanaraj, S. Karnam, D. G. Chachlakis, R. Ptucha, P. P. Markopoulos, and E. Saber, “Yolors: Object detection in multimodal remote sensing imagery,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 14, pp. 1497–1508, 2020.
- [16] G. Sumbul, R. G. Cinbis, and S. Aksoy, “Multisource region attention network for fine-grained object recognition in remote sensing imagery,” IEEE Trans. Geosci. Remote Sens., vol. 57, no. 7, pp. 4929–4937, 2019.
- [17] D. Hong, B. Zhang, H. Li, Y. Li, J. Yao, C. Li, M. Werner, J. Chanussot, A. Zipf, and X. X. Zhu, “Cross-city matters: A multimodal remote sensing benchmark dataset for cross-city semantic segmentation using high-resolution domain adaptation networks,” Remote Sensing of Environment, vol. 299, p. 113856, 2023.
- [18] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han, “Object detection in optical remote sensing images: A survey and a new benchmark,” ISPRS journal of photogrammetry and remote sensing, vol. 159, pp. 296–307, 2020.
- [19] C. Li, B. Zhang, D. Hong, J. Yao, and J. Chanussot, “Lrr-net: An interpretable deep unfolding network for hyperspectral anomaly detection,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
- [20] J. Yao, B. Zhang, C. Li, D. Hong, and J. Chanussot, “Extended vision transformer (exvit) for land use and land cover classification: A multimodal deep learning framework,” IEEE Transactions on Geoscience and Remote Sensing, 2023.
- [21] T. Y. Lin, P. Goyal, R. Girshick, K. M. He, and P. Dollár, “Focal loss for dense object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 2, pp. 318–327, 2020.
- [22] V. Gstaiger, J. Tian, R. Kiefl, and F. Kurz, “2d vs. 3d change detection using aerial imagery to support crisis management of large-scale events,” Remote Sens., vol. 10, no. 12, p. 2054, 2018.
- [23] X. Wu, W. Li, D. Hong, J. Tian, R. Tao, and Q. Du, “Vehicle detection of multi-source remote sensing data using active fine-tuning network,” ISPRS J. Photogramm. Remote Sens., vol. 167, pp. 39–53, 2020.
- [24] F. Kurz, S. Tuermer, O. Meynberg, D. Rosenbaum, H. Runge, P. Reinartz, and J. Leitloff, “Low-cost optical camera systems for real-time mapping applications,” Photogrammetrie-Fernerkundung-Geoinformation, pp. 159–176, 2012.
- [25] “Region-based automatic building and forest change detection on cartosat-1 stereo imagery,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 79, pp. 226–239, 2013.
- [26] P. d’Angelo, “Improving semi-global matching: cost aggregation and confidence measure,” The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 41, pp. 299–304, 2016.
- [27] C. Kempf, J. Tian, F. Kurz, P. D’Angelo, T. Schneider, and P. Reinartz, “Oblique view individual tree crown delineation,” International Journal of Applied Earth Observation and Geoinformation, vol. 99, p. 102314, 2021.
- [28] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
- [29] Z. Tian, C. H. Shen, H. Chen, and T. He, “FCOS: Fully convolutional one-stage object detection,” in Proc. Int. Conf. International Conference on Machine Learning (ICCV), (Seoul, South Korea), pp. 9627–9636, Oct. 2019.
- [30] Ultralytics, 2023. https://github.com/ultralytics/ultralytics, Open Source on 2023-01-10.
- [31] Z. Yang, S. H. Liu, H. Hu, L. W. Wang, and S. Lin, “Reppoints: Point set representation for object detection,” in Proc. Int. Conf. International Conference on Machine Learning (ICCV), pp. 9657–9666, 2019.
![]() |
Xin Wu (Senior Member, IEEE) received the Ph.D. degree from the School of Information and Electronics, Beijing Institute of Technology (BIT), Beijing, China, in 2020. She is currently an Assistant Professor at the School of Computer Science, Beijing University of Posts and Telecommunications (BUPT), Beijing, China. Her research interests include deep learning, remote sensing, object detection, and multimodal intelligent perception. Dr. Wu is a Topical Associate Editor of the IEEE Transactions on Geoscience and Remote Sensing (TGRS). She was a recipient of the Best Reviewer Award of the IEEE TGRS in 2023 and the IEEE JSTARS in 2022 as well as the Jose Bioucas Dias award for recognizing the outstanding paper at the Workshop on Hyperspectral Imaging and Signal Processing: Evolution in Remote Sensing (WHISPERS) in 2021. She is also a Leading Guest Editor of the IEEE JSTARS and Remote Sensing. |
![]() |
Zhanchao Huang (Member, IEEE) received the Ph.D. degree from the School of Information and Electronics, Beijing Institute of Technology (BIT), Beijing, China, in 2023. He is currently an Assistant Professor at The Academy of Digital China (ADC), Fuzhou University (FZU), Fuzhou 350108, China. His research interests include object detection and remote sensing image interpretation. |
![]() |
Li Wang (Senior Member, IEEE) received the Ph.D. degree from the Beijing University of Posts and Telecommunications (BUPT), Beijing, China, in 2009. She is currently a Full Professor with the School of Computer Science, National Pilot Software Engineering School, BUPT, where she is also an Associate Dean and the Head of the High Performance Computing and Networking Laboratory. She is also a Member of the Key Laboratory of the Universal Wireless Communications, Ministry of Education, China. She is also a rotating director of the Key Laboratory of Application Innovation in Emergency Command Communication Technology, Ministry of Emergency Management, China. She also held Visiting Positions with the School of Electrical and Computer Engineering, Georgia Tech, Atlanta, GA, USA, from December 2013 to January 2015, and with the Department of Signals and Systems, Chalmers University of Technology, Gothenburg, Sweden, from August to November 2015 and July to August 2018. She has authored or coauthored almost 70 journal papers and four books. Her research interests include wireless communications, distributed networking and storage, vehicular communications, social networks, and edge AI. She currently serves on the Editorial Boards for IEEE Transactions on Vehicular Technology, IEEE Transactions on Cognitive Communications and Networking, IEEE Internet of Things Journal, and China Communications. She was an Associate Editor for IEEE Transactions on Green Communications and Networking, the Symposium Chair of IEEE ICC 2019 on Cognitive Radio and Networks Symposium, and a Tutorial Chair of IEEE VTC 2019. She also is the chair of the Special Interest Group (SIG) on Sensing, Communications, Caching, and Computing (C3) in Cognitive Networks for the IEEE Technical Committee on Cognitive Networks. She was the Vice Chair of the Meetings and Conference Committee (MCC) for the IEEE Communication Society (ComSoc) Asia Pacific Board (APB) for the term of 2020–2021. She was the recipient of the 2013 Beijing Young Elite Faculty for Higher Education Award, best paper awards from several IEEE conferences, IEEE ICCC 2017, IEEE GLOBECOM 2018, and IEEE WCSP 2019. She was also the recipient of the Beijing Technology Rising Star Award in 2018. She has served on the TPC of multiple IEEE conferences, including IEEE Infocom, Globecom, International Conference on Communications, IEEE Wireless Communications and Networking Conference, and IEEE Vehicular Technology Conference in recent years. |
![]() |
Jocelyn Chanussot (IEEE Fellow) received the M.Sc. degree in electrical engineering from the Grenoble Institute of Technology (Grenoble INP), Grenoble, France, in 1995, and the Ph.D. degree from the Université de Savoie, Annecy, France, in 1998. From 1999 to 2023, he has been with Grenoble INP, where he was a Professor of signal and image processing. He is currently a Research Director with INRIA, Grenoble. His research interests include image analysis, hyperspectral remote sensing, data fusion, machine learning, and artificial intelligence. He has been a visiting scholar at Stanford University (USA), KTH (Sweden), and NUS (Singapore). Since 2013, he has been an Adjunct Professor at the University of Iceland. In 2015-2017, he was a visiting professor at the University of California, Los Angeles (UCLA). He holds the AXA chair in remote sensing and is an Adjunct professor at the Chinese Academy of Sciences, Aerospace Information Research Institute, Beijing, China. Dr. Chanussot is the founding President of the IEEE Geoscience and Remote Sensing French chapter (2007-2010) which received the 2010 IEEE GRSS Chapter Excellence Award. He was the Vice-President of the IEEE Geoscience and Remote Sensing Society, in charge of meetings and symposia (2017-2019). He is an Associate Editor for the IEEE Transactions on Geoscience and Remote Sensing, the IEEE Transactions on Image Processing, and the Proceedings of the IEEE. He was the Editor-in-Chief of the IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2011-2015). In 2014 he served as a Guest Editor for the IEEE Signal Processing Magazine. He is a Fellow of the IEEE, an ELLIS Fellow, a Fellow of AAIA, a member of the Institut Universitaire de France (2012-2017), and a Highly Cited Researcher (Clarivate Analytics/Thomson Reuters, since 2018). |
![]() |
Jiaojiao Tian (M’19–SM’21) received her B.S. degree in geoinformation systems from the China University of Geoscience, Beijing, in 2006, her M. Eng. degree in cartography and geoinformation at the Chinese Academy of Surveying and Mapping, Beijing, in 2009, and her Ph.D. degree in mathematics and computer science from Osnabrück University, Germany, in 2013. Since 2009, she has been with the Photogrammetry and Image Analysis Department, Remote Sensing Technology Institute, German Aerospace Center, Wessling, Germany, where she is currently head of the 3D and Modeling Group. In 2011, she was a guest scientist with the Institute of Photogrammetry and Remote Sensing, ETH Zürich, Switzerland. She serves as a co-chair of the ISPRS Commision WG I/8: Multi-sensor Modelling and Cross-modality Fusion. She is a member of the editorial board of the ISPRS Journal of Photogrammetry and Remote Sensing and of the International Journal of Image and Data Fusion. Her research interests include 3D change detection, digital surface model (DSM) generation, 3D point cloud semantic segmentation, object extraction, and DSM-assisted building reconstruction, forest monitoring and classification. |