MAFNet: A Multi-Attention Fusion Network
for RGB-T Crowd Counting
Abstract
RGB-Thermal (RGB-T) crowd counting is a challenging task, which uses thermal images as complementary information to RGB images to deal with the decreased performance of unimodal RGB-based methods in scenes with low-illumination or similar backgrounds. Most existing methods propose well-designed structures for cross-modal fusion in RGB-T crowd counting. However, these methods have difficulty in encoding cross-modal contextual semantic information in RGB-T image pairs. Considering the aforementioned problem, we propose a two-stream RGB-T crowd counting network called Multi-Attention Fusion Network (MAFNet), which aims to fully capture long-range contextual information from the RGB and thermal modalities based on the attention mechanism. Specifically, in the encoder part, a Multi-Attention Fusion (MAF) module is embedded into different stages of the two modality-specific branches for cross-modal fusion at the global level. In addition, a Multi-modal Multi-scale Aggregation (MMA) regression head is introduced to make full use of the multi-scale and contextual information across modalities to generate high-quality crowd density maps. Extensive experiments on two popular datasets show that the proposed MAFNet is effective for RGB-T crowd counting and achieves the state-of-the-art performance.
Index Terms:
RGB-T crowd counting, attention mechanism, multi-modal fusion.I INTRODUCTION
In recent years, with the increase in population, crowd counting has become an important topic in the field of intelligent surveillance [1, 2, 3], human swarm analysis [4, 5, 6] and public security [7, 8], where the target is to accurately estimate the number of people in images. Current crowd counting methods [9, 10, 11, 12, 13, 14, 15, 16, 17, 18] mainly rely on visual features of RGB images. However, RGB images are susceptible to illumination variations, which limits the application of RGB-based methods in scenes with low-illumination or similar backgrounds. Different from RGB images, thermal images are insensitive to illumination variations and have a powerful penetration to fog and smog. Recently, some researches introduce thermal images as complementary information to RGB images in computer vision tasks. As shown in Fig. 1-(a, b), crowds in RGB images are invisible in dark conditions, whereas the outlines of crowds in the thermal images are clearly visible. Meanwhile, thermal images have low resolution and suffer from thermal crossover problems to confuse different objects. In contrast, RGB images have high resolution and rich semantic information, which could cover the above shortcomings of thermal images. As shown in Fig. 1-(c), crowds in thermal images are difficult to distinguish because of the thermal crossover, while the crowds are clearly visible in RGB images in bright illumination. Thus, it is necessary to study the RGB-Thermal cross-modal fusion problem in crowd counting.

Recently, RGB-Thermal (RGB-T) crowd counting has received much attention, and several datasets and methods are proposed. The existing methods [19, 20, 21, 22] manually propose well-designed structures for RGB-T fusion based on convolutional neural networks (CNNs) and show exceptional performance on DroneRGBT [19] and RGBT-CC [20] datasets. Peng et al. [19] first propose an end-to-end framework named Multi-Modal Crowd Counting Network (MMCCN) with an adaptive fusion module. Based on this, Zhang et al. [21] propose I-MMCCN, which improves MMCCN by introducing a hard example mining module and a novel block-averaged absolute error loss. Besides, Liu et al. [20] propose an Information Aggregation-Distribution Module (IADM) for cross-modal representation learning. Further, Tang et al. [22] propose a three-stream adaptive fusion network (TAFNet) to extract the combination features of RGB-T through an additional stream. However, purely CNN-based methods have difficulty in adequately capturing long-range contextual information [23, 24] from both modalities on account of the small perception field of convolution operations, which limits the effectiveness of cross-modal fusion.
In recent works, the attention mechanism [25] has demonstrated the usefulness in cross-modal fusion to obtain long-range correlation and global information of the two modalities [26, 27, 28]. To tackle the limitation of the CNN-based RGB-T crowd counting methods, we propose a two-stream framework called Multi-Attention Fusion Network (MAFNet). Specifically, in the encoder part, two VGG19 [29] backbones are used for modality-specific representation learning, and the attention-based Multi-Attention Fusion (MAF) modules are embedded in the backbones to capture long-range contextual information for cross-modal fusion in a hierarchical manner. Besides, considering that the feature maps may lose crucial information in the process of down-sampling, we introduce a Multi-modal Multi-scale Aggregation (MMA) regression head. The extracted features with different scales from different modalities are fed to the MMA regression head to make full use of the multi-scale and contextual information across modalities to generate high-quality crowd density maps.
Our main contributions are summarized as follows:
-
1)
A two-stream RGB-T crowd counting framework termed Multi-Attention Fusion Network (MAFNet) is proposed with an attention-based Multi-Attention Fusion (MAF) module to fully capture long-range contextual information from both modalities in the encoder part.
-
2)
A Multi-modal Multi-scale Aggregation (MMA) regression head is introduced to make full use of the multi-scale and contextual information across modalities to generate high quality crowd density maps.
-
3)
Experiments show that the proposed MAFNet achieves the state-of-the-art performance on two RGB-T crowd counting datasets, RGBT-CC and DroneRGBT.
II RELATED WORK
The discussion of related works has been divided into two categories: (A) RGB-based crowd counting and (B) multi-modal crowd counting.
II-A RGB-Based Crowd Counting
The development of crowd counting has gone through three stages: detection-based, regression-based, and density estimation-based methods.
Detection-based methods. The early crowd counting methods are mainly based on object detection [30, 31, 32, 33, 34, 35]. After the detection of individuals, the number of targets is counted to obtain the total number of people. Such algorithms perform well in scenes with sparse crowds and no occlusion, but poorly in dense crowds.
Regression-based methods. Due to the limitations of detection-based methods, researchers propose regression-based crowd counting methods [36, 37, 38]. These methods first extract global features or local features of the input image, and then use regression methods to learn a mapping function to obtain the crowd counts. However, the extracted low-level features may ignore the critical features of the targets and such methods cannot obtain the spatial distribution of the crowds, limiting the application in real scenes.
Density estimation-based methods. The methods based on density estimation [9, 10, 11, 39, 13, 40, 14, 41, 42, 15, 16, 18, 43, 44, 45] use spatial information to estimate the crowd density maps, which are the current mainstream crowd counting methods. The crowd density map not only reflects the spatial distribution of the crowds in the image, but also its integral value is the crowd counts. Zhang et al. [10] propose a Multi-column Convolutional Neural Network (MCNN) for crowd counting, which utilizes filters with different sizes of receptive fields to extract features. Li et al. [11] propose a Congested Scene Recognition Network (CSRNet) that uses dilated convolution in the backend to improve the receptive field and enhances feature extraction capability. Recently, some works improve the crowd counting performance by introducing multi-scale information [39, 13, 40] and perspective estimation [14, 41, 42]. To alleviate the problem of difficult collection and annotation of crowd counting datasets, some works [15, 16, 18] explore the domain-adaptive crowd counting from the synthetic datasets to the real-world. In addition, with the Vision Transformer (ViT) [3] first applying the transformer structure for vision tasks, many transformer-based [25] crowd counting methods [43, 44, 45] have been proposed with outstanding performance.

II-B Multi-Modal Crowd Counting
Multi-modal crowd counting aims at improving the performance of crowd counting methods by comprehending multi-modal data through multi-modal representation learning [46, 47, 48, 49, 50]. Multi-modal crowd counting methods mainly include three research directions: the methods based on RGB and depth images (RGB-D), based on auditory and visual information (Audio-Visual), and based on RGB and thermal images (RGB-T).
RGB-D crowd counting methods. The depth image reflects the scale information of the image, which facilitates the detection of small-scale targets for crowd counting. Bondi et al. [51] propose a RGB-D crowd counting dataset MICC and use depth images to segment people and localize candidate heads. Arteta et al. [52] introduce depth information and enhance crowd density estimation performance by foreground and background segmentation. Song et al. [53] use the Kinect sensor to obtain depth images and propose a depth region suggestion network for counting. Lian et al. [54], [55] propose a strategy to generate crowd depth-adaptive density maps, and a real-world RGB-D crowd counting dataset called ShanghaiTechRGBD and a synthetic dataset ShanghaiTechRGBD-syn are collected. Yang et al. [56] propose a Bidirectional Cross-modal Attention (BCA) mechanism to focus on crowded regions in images though depth information. Li et al. [57] fuse the RGB and depth images by cycle-attention and optimize the counting model from both fine pixel-aware and coarse pyramid region-aware perspectives. However, the depth information is easily disturbed and poorly robust to illumination changes and occlusions in practical applications.
Audio-Visual crowd counting methods. Researchs in the field of neurobiology have shown that visual and auditory information are widely used as perceptual mediators of human beings. Several researches introduce auditory information as an auxiliary cue for crowd counting tasks. Hu et al. [58] propose an estimation model that jointly learns visual and audio modalities, and release a large-scale audiovisual crowd counting dataset DISCO. Sajid et al. [59] propose an audio-visual multi-task network based on the transformer structure to achieve better pattern association and efficient feature extraction. Hu et al. [60] propose an Audio-Visual Multi-Scale Network (AVMSN) to model unconstrained visual and auditory sources for crowd counting. Nevertheless, auditory information is easily disturbed by the surrounding noise in a noisy scenario.
RGB-T crowd counting methods. Thermal images are insensitive to illumination variations and have strong ability to penetrate some particulate matters such as smog and fog, which could be used as complementary information to RGB images for crowd counting. Peng et al. [19] propose the DroneRGBT dataset, which is the first RGB-T crowd counting dataset based on drone view, and propose the MMCCN network to learn both modal information from RGB and thermal images simultaneously. Based on this, Zhang et al. [21] propose an improved model of MMCCN, the I-MMCCN network, which introduces a hard example mining module and a novel block average absolute error loss to the MMCCN network for performance enhancement. Besides, Liu et al. [20] propose a large-scale RGB-T crowd counting dataset RGBT-CC and a two-stream cross-modal representation learning framework. Tang et al. [22] propose a three-stream adaptive fusion network named TAFNet, which extracts the combination features of RGB and thermal images through an additional stream. After introducing the thermal information, the performance of crowd counting methods in low-illumination and similar backgrounds is effectively improved. However, the existing RGB-T crowd counting methods use CNN-based cross-modal fusion structure, which makes it difficult to obtain long-range contextual information across modalities due to the small perception field. Unlike the previous methods, we adopt a multi-modal fusion method based on the attention mechanism, which can effectively focus on the global and contextual information between the modalities.
III PROPOSED METHOD
In this section, we introduce the full structure of the proposed Multi-Attention Fusion Network (MAFNet) for RGB-T crowd counting. As shown in Fig. 2, the MAFNet consists of an encoder part embedded with Multi-Attention Fusion (MAF) modules for cross-modal fusion and a Multi-modal Multi-scale Aggregation (MMA) regression head to make use of the cross-modal and contextual information for crowd density map prediction. We first introduce the encoder architecture of the MAFNet and the MAF module, which is the main component of the encoder part. Then, we describe the MMA regression head. Finally, the loss function is reported.
III-A Encoder Architecture
In the encoder part, the VGG19 [29] networks which discard the fully-connected layers are used as the backbones to extract multi-scale features of the RGB and thermal modalities. The backbones are divided into 5 stages and the last layer of each stage is the max-pooling layer with a stride of 2 for down-sampling. To fully interact information of the two modalities, an intermediate fusion strategy is adopted by embedding the proposed MAF modules into the adjacent stages of the backbones. Considering that if the information interaction of large-scale features between the two modalities is carried out at the shallow stages, the fusion effect may be affected by the misalignment of the paired RGB-T data. In our implementation, the MAF modules are embedded between deep stages to fuse the features with small-scale but at the high semantic level, which alleviate the data misalignment problem through the translation invariance of the max-pooling layer [20] in the feature extraction stage. More specifically, the RGB and thermal images are fed to the two backbones respectively for modality-specific representation learning in stage 1 and 2. Starting from stage 3, the features are first further extracted through the backbones, and then fed to the MAF modules for information interaction and enhancement. Finally, three paired feature maps from the two modalities are obtained from the output of MAF blocks. Their scales are , , of the input images, and the corresponding number of channels are 256, 512, 512, respectively.

III-B Multi-Attention Fusion Module
On account of the small perception field of convolution operations, the existing CNN-based multi-modal fusion modules have difficulty in adequately capturing long-range contextual information from the RGB and thermal modalities. Considering the powerful cross-modal global modeling capability of the attention mechanism, an attention-based Multi-Attention Fusion (MAF) module is proposed for better information interaction between the two modalities. The implementation of the MAF module represents in Fig. 3-(a).
Patch Embedding. To use the attention mechanism, the input 2D feature maps should first be converted into a sequence consisting of patches. Given a feature map of modal with the size of channel , height and width and reshaped it into a sequence composing of patches of size with the dimension of , where . Then the patches are flattened and mapped to D-dimensional latent vectors via a trainable linear projection layer. The outputs of the projection layer are called patch embeddings, which can be formulated as:
(1) |
where means the -th flattened patch () and denotes the linear projection matrix of the embedding process. In our implementation, the patch size of the first MAF module is set to 2 and the last two are set to 1. The embedding dimension is set to 768.
MAF block. To fully capture the long-range information within each modality and across modalities by the attention mechanism, the MAF block consisting of Intra-Modality-Attention (IMA) module and Cross-Modality-Attention (CMA) module is proposed. By stacking a suitable number of MAF blocks, the MAF module have a powerful cross-modal fusion capability at the global level. In this part, we first review the Multi-Head Attention mechanism and then introduce the implementation of the IMA and CMA modules.
(a) Multi-Head Attention mechanism. First, we introduce the implementation of Multi-Head Attention mechanism [25]. Given two patch embeddings and as inputs. Multi-Head Attention executes multiple independent attention heads in parallel and concatenates the outputs, which is then projected to obtain the final value. The input of one attention head is a triplet (query), (key) and (value) calculated from the input embeddings:
(2) | |||
where , , are the three learnable linear projection matrices and is the dimension of the channels. And one attention head is formulated as:
(3) |
where terms the output of one attetnion head. Then the Multi-Head Attention is expressed by the following formula:
(4) |
where is the output of the Multi-Head Attention processing. is the number of attention heads, means the concatenation operation for all attention heads and is the linear projection matrices. In summary, the processing of the Multi-Head Attention is presented as follows:
(5) |
(b) Intra-Modality-Attention (IMA) module and Cross-Modality-Attention (CMA) module. Based on the multi-head attention mechanism, the IMA and the CMA module are designed as shown in Fig. 3-(b, c), respectively. Specifically, in the -th MAF block , given the input patch embeddings and of different modalities. The input query, key and value of the IMA module are from the same modality for the intra-modal features enhancement. Formally:
(6) |
(7) |
where indicates the variable using in IMA module. means the operation is performed within the RGB modality and means the operation is performed within the thermal modality. Different from the IMA module, the input key and value of the CMA module are from the same modality but the query is from the other. The CMA module pays attention to the regions of interest across modalities for feature alignment and cross-modal information interaction by means of attention maps computed by the query and key of different modalities. Formally:
(8) |
(9) |
where indicates the variable using in CMA module. means that the query is obtained from thermal modality, and key and value are obtained from RGB modality for co-attention. In contrast, means that the query is from RGB modality, and key and value are from thermal modality. Then, the features obtained from different sub-modules are aggregated by element-wise product in different modalities. The aggregated process can be formulated as:
(10) |
Finally, the features obtained from the last MAF block of a MAF module termed and , where means the number of MAF blocks in one MAF module. In our implementation, the in each MAF module is taken as 2, i.e. stacking two MAF blocks.
Recovering 2D structure and skip connection. In order to obtain the 2D feature maps for the next stage of feature extraction, the final output features of the last MAF module should be mapped back to the dimension of the input embeddings by a linear projection layer and recovered the 2D structure. Besides, in consequence of the internal complexity of the MAF module may degrade the performance of the network, a skip connection structure is introduced to obtain the final output feature map of this stage by directly summing the input and output feature maps inspired by ResNet [61].
III-C Multi-Modal Multi-Scale Aggregation Regression Head
Considering that the output feature maps from the encoder may lose crucial information in the process of down-sampling and the feature maps with different scales may focus on different levels of semantic information from different modalities, a Multi-modal Multi-Scale Aggregation (MMA) regression head is proposed to aggregate the multi-modal and multi-scale feature information to reconstruct detail information and generate high-quality crowd density maps. Specifically, the paired feature maps of the two modalities with the same scale output from the the MAF modules are first concatenated by channel and then fed to the convolutions of to fully aggregate information of RGB and thermal modalities. Then the multi-scale features are up-sampled to the same scale and added together. Following, the added features are fed into four parallel branches, three of which contain dilated convolutions of with dilation rates of 1,2,3 for increasing the perceptive fields and one skip connection with convolution. The output features of the top three branches are concatenated and added with the skip connection together to obtain the final aggregated feature maps. Finally, the aggregated feature maps are fed to a convolution layer and a convolution layer to generate the predicted crowd density map.
III-D Loss Function
For training the proposed model, we adopt the standard Mean Squared Error (MSE) loss function, which uses Euclidean distance to calculate the gap between the ground truth and the predicted crowd density map. The loss function can be formulated as:
(11) |
where is the number of training samples, means the -th training sample, represents the predicted density map and represents the density map generated by ground truth.
IV EXPERIMENTAL RESULTS AND ANALYSIS
The experiments are implemented on two RGB-T crowd-counting datasets, RGBT-CC [20] and DroneRGBT [21]. At the same time, the ablation studies are performed on the RGBT-CC dataset to demonstrate the effectiveness of the proposed network. Finally, the further analyses are discussed.
MAF | MMA | GAME(0) | GAME(1) | GAME(2) | GAME(3) | RMSE | |
IMA | CMA | ||||||
✗ | ✗ | ✗ | 14.55 | 19.75 | 24.55 | 34.13 | 28.37 |
✔ | ✗ | ✗ | 13.13 | 18.34 | 23.11 | 33.00 | 24.67 |
✗ | ✔ | ✗ | 12.50 | 16.16 | 20.55 | 26.99 | 22.30 |
✗ | ✗ | ✔ | 13.67 | 16.26 | 19.67 | 25.80 | 24.90 |
✔ | ✗ | ✔ | 12.46 | 15.29 | 19.11 | 24.54 | 23.91 |
✗ | ✔ | ✔ | 11.68 | 14.81 | 18.78 | 24.44 | 22.19 |
✔ | ✔ | ✗ | 11.85 | 15.66 | 20.47 | 28.14 | 22.11 |
✔ | ✔ | ✔ | 10.90 | 14.44 | 18.36 | 24.01 | 20.82 |
IV-A Datasets
RGBT-CC [20] is a large-scale RGB-T crowd counting dataset for surveillance view, which contains 2030 pairs of RGB-T images with the size of 640×480 consisting of 138,389 pedestrians. Among the 2030 images, 1013 images are captured under light conditions and 1017 images are captured under dark conditions.
DroneRGBT [19] is a RGB-T crowd counting dataset for drone view, which contains 3600 pairs of RGB-T images with the size of 640×512 consisting of 175,698 pedestrians. Among them, about 1600 images were captured under dusk conditions, 1300 images under light conditions, and 900 images under dark conditions.
IV-B Experimental Setup
Implementation details. The experiments are implemented on NVIDIA GeForce RTX 3090 GPU with 24GB memory. In the training phase, we adopt an optimizer of AdamW, the initial learning rate of with a linear warmup strategy, a batch size of 16 and a maximum number of iterations of 300. Normal data augmentation is applied to the input images, including rescale, random crop and flip. The input images are randomly cropped to the size of 256256. To convert the point annotations into the corresponding crowd density maps, we use Gaussian convolution with kernel size of 7 and variance of 2 to extend the sparse points. In addition, our code is built on -Framework [62].
Evaluation Metrics. We use Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) as evaluation metrics in the two datasets. MAE and RMSE are defined as:
(12) |
(13) |
where and refer to the predicted count and ground truth of the -th RGB-T image pairs, respectively. In addition, following [20], Grid Average Mean Absolute Error (GAME) is utilized in the RGBT-CC dataset to evaluate the performance of different regions. Specifically, for a particular level , the image is grided into non-overlapping regions to calculate the counting error of each region and sum up. GAME is defined as:
(14) |
where and refer to predicted count and ground truth of the -th part of the -th RGB-T image pairs, respectively. It is worth noting that GAME(0) corresponds to the counting error of the whole image, which is equivalent to the MAE.
IV-C Ablation Studies
In order to verify the effectiveness of each component in the proposed Multi-Attention Fusion Network (MAFNet), extensive ablation studies are performed on the RGBT-CC dataset. A model with a dual VGG19 encoder and a regression head consisting of a convolution and a convolution is used as the baseline. We gradually add the proposed modules to the baseline to obtain variant models and compare the performance to the baseline to verify the effectiveness of each module. The evaluation results are shown in Table I, and row 1 shows the result of the baseline.
Effectiveness of Multi-Attention Fusion module. The Multi-Attention Fusion (MAF) module is stacked by MAF blocks, which consists of two Intra-Modality-Attention (IMA) modules and a Cross-Modality-Attention (CMA) module. To demonstrate the effectiveness of the MAF module, we perform ablation studies both on the IMA and CMA modules. For results of the row 2 and row 3, the models with the IMA and CMA modules, respectively, both perform better than the baseline in all evaluation metrics, which verify the effectiveness of the IMA and the CMA module. Especially, the performance improvement of the model with the CMA module is much higher than that of the model with the IMA module. That’s because the cross-modal information interaction relies mainly on the CMA module, which plays a more important role in the MAF module. Compared with the results in row 5 and row 6, the model with CMA also performs better than the model with IMA, further illustrating the importance of CMA in cross-modal information interaction. As shown in row 7, the model with both IMA and CMA modules exhibits satisfactory performance, which is close to the performance of the full structure of MAFNet (row 7 v.s. row 8).

Effectiveness of Multi-modal Multi-scale Aggregation regression head. The Multi-modal Multi-Scale Aggregation (MMA) regression head is proposed to aggregate the multi-modal and multi-scale feature information to predict crowd density maps. Compared with row 4 and the baseline shown in row 1, the performance of the model is improved after adding the MMA module to the baseline model, which verifies the effectiveness of the MMA regression head. Further, the performance improvement of the MMA module combined with the IMA and CMA modules respectively are explored as shown in row 5 and row 6. The performance of the models with IMA and CMA modules respectively are both greatly improved on all evaluation metrics with the addition of MMA regression head, where GAME (0) are improved 5.10 / 6.56 (row 2 v.s. row 5 and row 3 v.s. row 6). From this, it could be observed that the model using the IMA module has a better improvement than the model with the CMA module after adding the MMA regression head. Considering that IMA module cannot implement cross-modal information interaction in the encoder stage, the multi-modal information is aggregated in the regression stage by introducing MMA to compensate for the limitation of IMA. Through the above extensive ablation studies, the effectiveness of the MMA regression head in aggregating multi-modal and multi-scale information in MAFNet is well demonstrated.
Method | RGBT-CC | ||||
---|---|---|---|---|---|
GAME(0) | GAME(1) | GAME(2) | GAME(3) | RMSE | |
CSRNet [11] | 20.40 | 23.58 | 28.03 | 35.51 | 35.26 |
BL [63] | 18.70 | 22.55 | 26.83 | 34.62 | 32.67 |
CMCRL [20] | 15.61 | 19.95 | 24.69 | 32.89 | 28.18 |
TAFNet [22] | 12.38 | 16.98 | 21.86 | 30.19 | 22.45 |
MAFNet w/o MMA | 11.85 | 15.66 | 20.47 | 28.14 | 22.11 |
MAFNet (Ours) | 10.90 | 14.44 | 18.36 | 24.01 | 20.82 |
Illumination | Method | GAME(0) | GAME(1) | GAME(2) | GAME(3) | RMSE |
---|---|---|---|---|---|---|
Brightness | CMCRL [20] | 20.36 | 23.57 | 28.49 | 36.29 | 32.57 |
TAFNet [22] | 15.57 | 20.65 | 26.67 | 36.17 | 24.25 | |
Ours | 11.31 | 14.61 | 18.92 | 25.31 | 21.81 | |
Darkness | CMCRL [20] | 15.44 | 19.23 | 23.79 | 30.28 | 29.11 |
TAFNet [22] | 14.20 | 19.20 | 24.00 | 31.63 | 27.50 | |
Ours | 10.48 | 14.26 | 17.79 | 22.67 | 19.75 |
IV-D Comparison with the SOTAs on RGBT-CC Dataset
Table II(b) lists the performance comparison between the MAFNet and four state-of-the-art methods on two datasets respectively and it could be found that the MAFNet shows strong performance, and outperforms the existing methods. As shown in Table II(b)-(a), on the RGBT-CC dataset, MAFNet improved 11.95, 14.96, 16.01, 20.47, and 7.26 in the five metrics of GAME(0), GAME(1), GAME(2), GAME(3) and RMSE respectively, compared with the state-of-the-art method TAFNet [22]. And as shown in Table II(b)-(b), compared with I-MMCCN [21], MAFNet achieves 7.38 and 9.77 improvement on MAE and RMSE. In addition, existing methods such as CMCRL [20] and TAFNet [22] do not utilize multi-scale information, and the performance of these models are completely dependent on the proposed multi-modal fusion modules. In fairness, we use the MAFNet without MMA regression head to compare with the state-of-the-art methods (row 4 v.s. row 5 in Table II(b)-(a) and (b)) and the experiments show that the MAFNet without MMA regression head still outperforms the state-of-the-art methods, which demonstrates the superiority of the designed MAF module over the existing multi-modal fusion modules. A visualization comparison of the partial state-of-the-art methods with MAFNet regarding the generated density maps for different density levels on the two datasets are shown in Fig. 4. The visualization results show that MAFNet could predict more accurate counting results and perform better in dense crowd scenes than the state-of-the-art methods.
To further illustrate the excellent robustness of MAFNet for illumination conditions, the RGBT-CC test set of 800 images is divided into 406 brightness and 394 darkness images to test the model’s performance in different illumination conditions. As the results shown in Table III, comparing with CMCRL [20] and TAFNet [22], the MAFNet outperforms these methods under different illumination conditions. It is worth noting that the MAFNet exhibits similar performance under different illumination conditions, while the other methods are much different. This phenomenon precisely shows that MAFNet is more robust to illumination variations than other methods.

IV-E Analysis and Discussion

Visualization analysis. To further illustrate the effectiveness of the proposed model, the attention maps of the three MAF modules are visualized in Fig. 5. From this, it is clear that the network’s attention is indeed focused on the target crowds, which illustrates the effectiveness of the attention mechanism used in the MAF module. Further, by comparing the attention maps obtained in different MAF modules, it could be observed that the MAF modules embedded at different locations focus on different levels of feature information. The shallow MAF module is better for head focus in sparse scenes because the inputs are large-scale feature maps, while it is difficult to fully focus on the entire crowd in more dense scenes. In contrast, the deep MAF module uses semantics-rich and small-scale feature maps that could model global information well for dense scene detection, but the performance of specific target recognition in sparse scenes decreases due to the low resolution. In particular, the second MAF module embedded in the middle stage has well-focused attention to crowds in different scales and density scenes. By observing the visualized results, it is a natural idea to introduce the MMA regression head to make full use of the multi-modal and multi-scale information for improving the robustness of the model to different density scenes.
Effectiveness of RGB-T multi-modal data. To verify the effectiveness of multi-modal data, RGB and thermal images are used to train the proposed model respectively and compared the performance with the RGB-T trained model. As shown in Table 6, using only RGB data to train the model works unsatisfactorily (row 1) because it is difficult to discriminate the people in RGB images under dark conditions, which provides insufficient information for crowd counting tasks. In comparison, the performance of the model trained with thermal images (row 2) is much better and closer to the performance of the model trained with RGB-T data (row 9), indicating that thermal images play a more important role in RGB-T crowd counting tasks, especially in dark environments. To further illustrate the necessity of multi-modal data, some extreme examples are listed in Fig. 6. In the first row, Fig. 6-(a) shows the RGB image with low-illumination condition, which leads to large error in the predicted crowd density map shown in Fig. 6-(d). In the last row, the thermal image shown in Fig. 6-(b) is poorly imaged with the unclear outline of the people in the boxed region leading to huge counting errors shown in Fig. 6-(e). Thanks to the high complementarity of RGB and thermal images, both of the modalities compensate for the information of the other to ensure the quality of the predicted crowd density maps, which are shown in Fig. 6-(f).
Input | GAME(0) | GAME(1) | GAME(2) | GAME(3) | RMSE |
---|---|---|---|---|---|
RGB | 23.85 | 30.18 | 37.24 | 45.40 | 45.46 |
T | 12.34 | 15.50 | 19.25 | 24.45 | 22.84 |
RGB-T | 10.90 | 14.44 | 18.36 | 24.01 | 20.82 |
Patch Size. In the conventional vision transformer models, using large size patches such as 16×16 in ViT [3] will only result in coarse-grained features, which may cause large errors for intensive prediction tasks such as crowd counting. In contrast, using small size patches will make the sequence so long that it will cause a large amount of computation. However, as mentioned in Section III-A, the MAF modules are only embedded in the high stages to process the small-scale features without causing much computation. Therefore, the patch size of the first MAF module is set to 2 and the last two are set to 1 to ensure both the acquisition of fine-grained features and low computation in our implementation.
The necessity of Position Embedding. Although position embedding is the default component of the attention-based transformer model [25], it is worth discussing whether it is necessary in the MAF module. The comparison results of whether to add position embedding are shown in Table I and means the position embedding. It could be found that the two methods get comparable performance. Specifically, the network with position embedding performs poorly in GAME(0) (10.95 vs 10.90) but performs better in RMSE (19.85 vs 20.82). Considering that both convolution and attention mechanisms are used in the network, the skip connection of the MAF module somewhat introduces the inductive bias of convolution, which provides the location information for the attention calculation. Therefore, the addition of position embedding has little effect on the model’s performance.
Method | GAME(0) | GAME(1) | GAME(2) | GAME(3) | RMSE |
---|---|---|---|---|---|
w. PE | 10.95 | 14.41 | 18.62 | 24.48 | 19.85 |
w/o PE | 10.90 | 14.44 | 18.36 | 24.01 | 20.82 |
Comparison of different numbers of MAF modules. The number of MAF modules is an important hyperparameter. Assuming that the MAF modules are also embedded in the earlier stages, on the one hand the computation will increase, and on the other hand the additional counting errors will be introduced due to the paired image unalignment, which has mentioned in Section III-A. To find the optimal number of MAF modules, the MAF modules are embedded after each stage starting from the last stage of the backbone in turn. As shown in Table VI, the performance of the network is gradually getting better by increasing the number of MAF modules and the best result is obtained when the number of MAF modules is 3. When increasing to 4 modules, the model’s performance starts to degrade. Eventually, the number of the MAF modules is set to 3 in the network.
Nums | Depth | GAME(0) | GAME(1) | GAME(2) | GAME(3) | RMSE |
---|---|---|---|---|---|---|
1 | 2,2,2 | 11.88 | 15.33 | 19.14 | 26.25 | 23.04 |
2 | 2,2,2 | 11.55 | 15.12 | 19.24 | 26.35 | 21.14 |
3 | 1,1,1 | 12.42 | 15.43 | 18.97 | 24.08 | 22.09 |
1,2,4 | 12.08 | 15.45 | 19.33 | 26.56 | 22.87 | |
4,2,1 | 11.67 | 14.54 | 18.05 | 23.33 | 21.08 | |
2,2,2 | 10.90 | 14.44 | 18.36 | 24.01 | 20.82 | |
4 | 2,2,2 | 12.22 | 15.09 | 18.52 | 23.87 | 21.67 |
Comparison of different depths of MAF module. The depth of MAF module means the number of stacked MAF blocks in MAF module, which is also an important hyperparameter. We test different configurations of the network that contains different depth of MAF modules and show the results in Table VI. The 3 numbers in the second column of Table VI correspond to the depth of each of the 3 MAF modules, which indicates the depth of the MAF module greatly affects the performance of the model. Considering that the performance of MAE(GAME(0)) and RMSE is more important than other evaluation metrics, we set the depth of all three MAF modules to 2 as our final implementation to get the best result.
V CONCLUSION
In this paper, we propose an effective two-stream framework for RGB-T crowd counting termed Multi-Attention Fusion Network (MAFNet). By embedding the attention-based Multi-Attention Fusion (MAF) modules between different stages of the backbone, the perception field of network in the cross-modal fusion stage is expanded to fully capture the long-range contextual information across modalities. The proposed Multi-modal Multi-scale Aggregation (MMA) regression head integrates multi-scale information across modalities and further improves the network’s performance. Extensive experiments show the effectiveness of the proposed method and the network achieves state-of-the-art performance on RGBT-CC and DroneRGBT datasets. The ablation studies demonstrate the effectiveness of the individual modules.
References
- [1] S. Zhang, G. Wu, J. P. Costeira, and J. M. Moura, “Understanding traffic density from large-scale web camera data,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 5898–5907.
- [2] F. Xiong, X. Shi, and D.-Y. Yeung, “Spatiotemporal modeling for crowd counting in videos,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5151–5159.
- [3] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- [4] L. B. Rosenberg, “Human swarms, a real-time method for collective intelligence,” in ECAL 2015: the 13th European Conference on Artificial Life. MIT Press, 2015, pp. 658–659.
- [5] L. Rosenberg, D. Baltaxe, and N. Pescetelli, “Crowds vs swarms, a comparison of intelligence,” in 2016 Swarm/Human Blended Intelligence Workshop (SHBI), 2016, pp. 1–4.
- [6] D. Yu, C. P. Chen, and H. Xu, “Intelligent decision making and bionic movement control of self-organized swarm,” IEEE Transactions on Industrial Electronics, vol. 68, no. 7, pp. 6369–6378, 2020.
- [7] D. Helbing, I. J. Farkas, P. Molnar, and T. Vicsek, “Simulation of pedestrian crowds in normal and evacuation situations,” Pedestrian and evacuation dynamics, vol. 21, no. 2, pp. 21–58, 2002.
- [8] J. Sang, W. Wu, H. Luo, H. Xiang, Q. Zhang, H. Hu, and X. Xia, “Improved crowd counting method based on scale-adaptive convolutional neural network,” IEEE Access, vol. 7, pp. 24 411–24 419, 2019.
- [9] V. Lempitsky and A. Zisserman, “Learning to count objects in images,” Advances in neural information processing systems, vol. 23, 2010.
- [10] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Single-image crowd counting via multi-column convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 589–597.
- [11] Y. Li, X. Zhang, and D. Chen, “Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1091–1100.
- [12] X. Jiang, L. Zhang, P. Lv, Y. Guo, R. Zhu, Y. Li, Y. Pang, X. Li, B. Zhou, and M. Xu, “Learning multi-level density maps for crowd counting,” IEEE transactions on neural networks and learning systems, vol. 31, no. 8, pp. 2705–2715, 2019.
- [13] J. Gao, Q. Wang, and Y. Yuan, “Scar: Spatial-/channel-wise attention regression networks for crowd counting,” Neurocomputing, vol. 363, pp. 1–8, 2019.
- [14] J. Gao, Q. Wang, and X. Li, “Pcc net: Perspective crowd counting via spatial convolutional network,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 10, pp. 3486–3498, 2019.
- [15] Q. Wang, J. Gao, W. Lin, and Y. Yuan, “Learning from synthetic data for crowd counting in the wild,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8198–8207.
- [16] J. Gao, Y. Yuan, and Q. Wang, “Feature-aware adaptation and density alignment for crowd counting in video surveillance,” IEEE transactions on cybernetics, vol. 51, no. 10, pp. 4822–4833, 2020.
- [17] Q. Wang, T. Han, J. Gao, and Y. Yuan, “Neuron linear transformation: Modeling the domain shift for crowd counting,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
- [18] J. Gao, T. Han, Y. Yuan, and Q. Wang, “Domain-adaptive crowd counting via high-quality image translation and density reconstruction,” IEEE Transactions on Neural Networks and Learning Systems, 2021.
- [19] T. Peng, Q. Li, and P. Zhu, “Rgb-t crowd counting from drone: A benchmark and mmccn network,” in Proceedings of the Asian Conference on Computer Vision, 2020.
- [20] L. Liu, J. Chen, H. Wu, G. Li, C. Li, and L. Lin, “Cross-modal collaborative representation learning and a large-scale rgbt benchmark for crowd counting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4823–4833.
- [21] B. Zhang, Y. Du, Y. Zhao, J. Wan, and Z. Tong, “I-mmccn: Improved mmccn for rgb-t crowd counting of drone images,” in 2021 7th IEEE International Conference on Network Intelligence and Digital Content (IC-NIDC). IEEE, 2021, pp. 117–121.
- [22] H. Tang, Y. Wang, and L.-P. Chau, “Tafnet: A three-stream adaptive fusion network for rgb-t crowd counting,” arXiv preprint arXiv:2202.08517, 2022.
- [23] J.-B. Cordonnier, A. Loukas, and M. Jaggi, “On the relationship between self-attention and convolutional layers,” arXiv preprint arXiv:1911.03584, 2019.
- [24] B. Zhao, L. Haopeng, and X. Lu, Xiaoqiang Li, “Reconstructive sequence-graph network for video summarization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- [25] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- [26] H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” arXiv preprint arXiv:1908.07490, 2019.
- [27] W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 5583–5594.
- [28] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 8748–8763.
- [29] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- [30] I. S. Topkaya, H. Erdogan, and F. Porikli, “Counting people by clustering person detector outputs,” in 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2014, pp. 313–318.
- [31] P. Viola, M. J. Jones, and D. Snow, “Detecting pedestrians using patterns of motion and appearance,” International Journal of Computer Vision, vol. 63, no. 2, pp. 153–161, 2005.
- [32] M. Li, Z. Zhang, K. Huang, and T. Tan, “Estimating the number of people in crowded scenes by mid based foreground segmentation and head-shoulder detection,” in 2008 19th international conference on pattern recognition. IEEE, 2008, pp. 1–4.
- [33] B. Leibe, E. Seemann, and B. Schiele, “Pedestrian detection in crowded scenes,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1. IEEE, 2005, pp. 878–885.
- [34] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 9, pp. 1627–1645, 2009.
- [35] J. Wang, J. Gao, Y. Yuan, and Q. Wang, “Crowd localization from gaussian mixture scoped knowledge and scoped teacher,” arXiv preprint arXiv:2206.05717, 2022.
- [36] A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos, “Privacy preserving crowd monitoring: Counting people without people models or tracking,” in 2008 IEEE conference on computer vision and pattern recognition. IEEE, 2008, pp. 1–7.
- [37] A. B. Chan and N. Vasconcelos, “Bayesian poisson regression for crowd counting,” in 2009 IEEE 12th international conference on computer vision. IEEE, 2009, pp. 545–551.
- [38] H. Idrees, I. Saleemi, C. Seibert, and M. Shah, “Multi-source multi-scale counting in extremely dense crowd images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 2547–2554.
- [39] V. A. Sindagi and V. M. Patel, “Generating high-quality crowd density maps using contextual pyramid cnns,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1861–1870.
- [40] W. Liu, M. Salzmann, and P. Fua, “Context-aware crowd counting,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5099–5108.
- [41] Z. Yan, Y. Yuan, W. Zuo, X. Tan, Y. Wang, S. Wen, and E. Ding, “Perspective-guided convolution networks for crowd counting,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 952–961.
- [42] Y. Yang, G. Li, Z. Wu, L. Su, Q. Huang, and N. Sebe, “Reverse perspective network for perspective-aware object counting,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 4374–4383.
- [43] X. Wei, Y. Kang, J. Yang, Y. Qiu, D. Shi, W. Tan, and Y. Gong, “Scene-adaptive attention network for crowd counting,” arXiv preprint arXiv:2112.15509, 2021.
- [44] Y. Tian, X. Chu, and H. Wang, “Cctrans: Simplifying and improving crowd counting with transformer,” arXiv preprint arXiv:2109.14483, 2021.
- [45] H. Lin, Z. Ma, R. Ji, Y. Wang, and X. Hong, “Boosting crowd counting via multifaceted attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 628–19 637.
- [46] W. Guo, J. Wang, and S. Wang, “Deep multimodal representation learning: A survey,” IEEE Access, vol. 7, pp. 63 373–63 394, 2019.
- [47] K. Bayoudh, R. Knani, F. Hamdaoui, and A. Mtibaa, “A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets,” The Visual Computer, pp. 1–32, 2021.
- [48] X. Yuan, Z. Lin, J. Kuen, J. Zhang, Y. Wang, M. Maire, A. Kale, and B. Faieta, “Multimodal contrastive training for visual representation learning,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 6991–7000.
- [49] K. Fu, D.-P. Fan, G.-P. Ji, and Q. Zhao, “Jl-dcf: Joint learning and densely-cooperative fusion framework for rgb-d salient object detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 3052–3062.
- [50] K. Fu, D.-P. Fan, G.-P. Ji, Q. Zhao, J. Shen, and C. Zhu, “Siamese network for rgb-d salient object detection and beyond,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2021.
- [51] E. Bondi, L. Seidenari, A. D. Bagdanov, and A. Del Bimbo, “Real-time people counting from depth imagery of crowded environments,” in 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2014, pp. 337–342.
- [52] C. Arteta, V. Lempitsky, and A. Zisserman, “Counting in the wild,” in European conference on computer vision. Springer, 2016, pp. 483–498.
- [53] D. Song, Y. Qiao, and A. Corbetta, “Depth driven people counting using deep region proposal network,” in 2017 IEEE International Conference on Information and Automation (ICIA). IEEE, 2017, pp. 416–421.
- [54] D. Lian, J. Li, J. Zheng, W. Luo, and S. Gao, “Density map regression guided detection network for rgb-d crowd counting and localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1821–1830.
- [55] D. Lian, X. Chen, J. Li, W. Luo, and S. Gao, “Locating and counting heads in crowds with a depth prior,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
- [56] S.-D. Yang, H.-T. Su, W. H. Hsu, and W.-C. Chen, “Deccnet: Depth enhanced crowd counting,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
- [57] H. Li, S. Zhang, and W. Kong, “Rgb-d crowd counting with cross-modal cycle-attention fusion and fine-coarse supervision,” IEEE Transactions on Industrial Informatics, 2022.
- [58] D. Hu, L. Mou, Q. Wang, J. Gao, Y. Hua, D. Dou, and X. X. Zhu, “Ambient sound helps: Audiovisual crowd counting in extreme conditions,” arXiv preprint arXiv:2005.07097, 2020.
- [59] U. Sajid, X. Chen, H. Sajid, T. Kim, and G. Wang, “Audio-visual transformer based crowd counting,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2249–2259.
- [60] R. Hu, Q. Mo, Y. Xie, Y. Xu, J. Chen, Y. Yang, H. Zhou, Z.-R. Tang, and E. Q. Wu, “Avmsn: An audio-visual two stream crowd counting framework under low-quality conditions,” IEEE Access, vol. 9, pp. 80 500–80 510, 2021.
- [61] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- [62] J. Gao, W. Lin, B. Zhao, D. Wang, C. Gao, and J. Wen, “C^3 framework: An open-source pytorch code for crowd counting,” arXiv preprint arXiv:1907.02724, 2019.
- [63] Z. Ma, X. Wei, X. Hong, and Y. Gong, “Bayesian loss for crowd count estimation with point supervision,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6142–6151.
![]() |
Pengyu Chen received the B.E. degree in Automation from Northwestern Polytechnical University, Xi’an 710072, Shaanxi, P. R. China, in 2022. He is currently pursuing the M.S. degree in computer science and technology with the School of Computer Science, and the school of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University. His research interests include computer vision and pattern recognition. |
![]() |
Junyu Gao received the B.E. degree and the Ph.D. degree in computer science and technology from the Northwestern Polytechnical University, Xi’an 710072, Shaanxi, P. R. China, in 2015 and 2021 respectively. He is currently a researcher with the School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an, China. His research interests include computer vision and pattern recognition. |
![]() |
Yuan Yuan (M’05-SM’09) is currently a Full Professor with the School of Computer Science and the School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an, China. She has authored or coauthored over 150 papers, including about 100 in reputable journals, such as the IEEE TRANSACTIONS ON PATTERN ANALYS IS AND MACHINE INTELLIGENCE, as well as the conference papers in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), British Machine Vision Conference (BMVC), International Conference on Image Processing (ICIP), and International Conference on Acoustics, Speech and Signal Processing (ICASSP). Her current research interests include visual information processing and image/video content analysis. |
![]() |
Qi Wang received the B.E. degree in automation and the Ph.D. degree in pattern recognition and intelligent systems from the University of Science and Technology of China, Hefei, China, in 2005 and 2010, respectively. He is currently a Professor with the School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi’an, China. His research interests include computer vision, pattern recognition, and remote sensing. |