Explicit Attention-Enhanced Fusion for RGB-Thermal Perception Tasks

Mingjian Liang^∗, Junjie Hu^∗, Chenyu Bao, Hua Feng, Fuqin Deng and Tin Lun Lam^† J. Hu, H. Feng, F. Deng, T.L. Lam are with the Shenzhen Institute of Artificial Intelligence and Robotics for Society.M. Liang, C. Bao and T.L. Lam are with the School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen.

*

indicates equal contribution.^†Corresponding author: Tin Lun Lam [email protected]

Abstract

Recently, RGB-Thermal based perception has shown significant advances. Thermal information provides useful clues when visual cameras suffer from poor lighting conditions, such as low light and fog. However, how to effectively fuse RGB images and thermal data remains an open challenge. Previous works involve naive fusion strategies such as merging them at the input, concatenating multi-modality features inside models, or applying attention to each data modality. These fusion strategies are straightforward yet insufficient. In this paper, we propose a novel fusion method named Explicit Attention-Enhanced Fusion (EAEF) that fully takes advantage of each type of data. Specifically, we consider the following cases: i) both RGB data and thermal data, ii) only one of the types of data, and iii) none of them generate discriminative features. EAEF uses one branch to enhance feature extraction for i) and iii) and the other branch to remedy insufficient representations for ii). The outputs of two branches are fused to form complementary features. As a result, the proposed fusion method outperforms state-of-the-art by 1.6% on semantic segmentation, 2.3% on object detection, and 1.19% on crowd counting. The code is available at https://github.com/FreeformRobotics/EAEFNet.

Index Terms:

Multi-modality data fusion, RGB-Thermal fusion, RGB-thermal perception

I Introduction

Over the last decade, we have witnessed significant progress on many perception tasks. Based on data-driven learning, deep neural networks (DNNs) can learn to estimate semantic maps [FCN, UNet, DeepLab, pspnet, upernet, gcnet], object categories [FasterRCNN, Yolo, cascadeRCNN, FCOS, TODO], depth maps [NYU, KITTI, Hu2019RevisitingSI, jiang2021plnet], etc., from only RGB images. This paradigm has continuously boosted perception tasks for robots in which various models, loss functions, and learning strategies have been explored.

However, current methods highly depend on the quality of RGB images. In reality, visual cameras are particularly susceptible to noises [suganuma2019attention], poor lighting [hu2021two], weather [liu2019dual], etc. In these cases, DNNs tend to degrade their performance significantly. To handle these issues, researchers sought to employ thermal data to complement RGB images and developed different multi-modality fusion strategies for RGB-T based perception.

The core of RGB-T based methods is the fusion strategy of RGB data and thermal data. Previous methods [UCNet, m3fd] directly combine them at the input. Some works [MFNet, RTFNet] use two separate encoders for extracting features from RGB and thermal images, respectively. Then, these features are merged and outputted to a decoder to yield a final prediction. Recently, most studies [ABMDRNet, FEANet] attempted to utilize the attention mechanism for multi-modality data fusion. These approaches commonly apply channel attention to intermediate features of different data types and obtain the fused features by weighing their importance. However, these fusion strategies are implicit and insufficient. In particular, it is unclear how multi-modality data can (or cannot) complement each other.

Refer to caption — Figure 1: Visualization of extracted features of RGB, thermal images, and the proposed fusion method, where (a) both RGB and thermal data, (b) only one of them, and (c) none, can yield distinct features. As seen, the proposed fusion method can boost feature extraction for all three cases.

Different from existing studies, we explicitly take multi-modality data fusion under three circumstances: i) both RGB and thermal images can extract useful features, as Fig. 1 (a), ii) only one of them can generate meaningful representations, as Fig. 1 (b), and iii) none of them provides useful features, as Fig. 1 (c). In this paper, we propose the Explicit Attention-Enhanced Fusion (EAEF) that performs a more effective fusion. The key inspiration of EAEF is the case-specific design that uses one branch to stick to meaningful representations for i) and enhance feature extraction for iii), and the other branch to force CNNs to pay attention to insufficient representations for ii). One of these branches will generate useful features at least, and their combination will yield complementary features for a final prediction.

To validate the effectiveness of this novel multi-modality fusion method, we design a novel RGB-T framework by integrating EAEF into an encoder-decoder network and evaluate it on various vision tasks with benchmark datasets, including semantic segmentation, object detection, and crowd counting. We confirm through extensive experiments that our method is more effective for RGB-T based vision perception.

In summary, our contributions are:

•

A novel multi-modality fusion method that effectively fuses RGB features and thermal features in an explicit manner.
•

An effective encoder-decoder based network assembled with the proposed feature fusion strategy for dense prediction tasks.
•

State-of-the-art performance on semantic segmentation, object detection, and crowd counting with open-source codes.

The remainder of this paper is organized as follows. Section II reviews related studies. Section III presents the framework and the proposed multi-modality data fusion method. Section IV provides quantitative and qualitative experimental results on three tasks. Section V concludes our work.

II Related Work

Most vision tasks aim to make predictions from only RGB images. However, they suffer from low accuracy, robustness, and generability when images are captured under poor lighting conditions, noises, motion blur, etc. A promising solution is to utilize multi-modality data to complement RGB images. A representative is depth maps [Fusenet, he2017std2p, seichter2021efficient, Hu2022DeepDC] that boost the new design of deep neural networks, loss functions, and learning strategies.

Recently, there has been a trend that introduces thermal data into perception tasks. The key point of these methods is the fusion strategy between RGB images and thermal images. We can mainly categorize the current methods into two types according to whether attention mechanisms are leveraged in the fusion process.

Regarding non-attention methods, MFNet [MFNet] is an early attempt where features from both modalities are extracted by a dual-encoder structure and fused into the decoder by symmetry shortcut blocks. RTFNet [RTFNet] proposes an auxiliary branch for extracting thermal features, which are then element-wise summed into the main RGB branch at different spatial scales. Based on the RTFNet’s structure, FuseSeg [FuseSeg] further introduces skip-connection modules for improvement. Besides the dual-encoder structure, PST900 [pst900] decomposes the prediction process into two stages. It makes a coarse prediction from one single RGB image in the first stage and leverages thermal input for refinement in the second stage. TarDAL [m3fd] fuses the two modalities in an early fusion manner and achieves impressive performance.

As mentioned before, utilizing the attention mechanism for cross-modal fusion has become a new trend recently. ABMDRNet [ABMDRNet] fuses the two modalities by implementing channel attention in their CWF (channel-wise fusion) module. FEANet [FEANet] further adds a spatial attention operation right after channel attention for recovering some detailed structures. A similar combination of spatial and channel attention is also introduced in the Shallow Feature Fusion Modul (SFFM) from GMNet [GMNet] but implemented more delicately. MFTNet [MFTNet] resorts to the currently popular self-attention to improve the ability of feature fusion.

In this work, we propose an explicit attention-enhanced mechanism that analyzes the modal difference and takes full advantage of the modal fusion on multiple perception tasks.

III Methodology

III-A Framework Overview

We take the classical encoder-decoder network for dense prediction tasks. The framework consists of an image encoder, a thermal encoder, and a shared decoder. Similar to existing approaches, the RGB encoder is used to extract features from RGB images and the thermal encoder is used to extract features from thermal images. The proposed Explicit Attention-Enhanced Fusion (EAEF) is applied between the two encoders to fuse features at multi-scales. A diagram of our dense prediction network is given in Fig. 2, where we show a semantic segmentation network built on ResNet [ResNet]. The framework uses five convolutional blocks to extract multi-scale features; thus, we apply five EAEF modules to fuse RGB and thermal features.

Note that the framework naturally uses different backbones on different tasks. Therefore, the detailed implementation is task-specific. Nevertheless, all models built on our framework have the same technical components, i.e., one RGB encoder, one thermal encoder, EAEF modules, and a shared decoder.

III-B Explicit Attention-Enhanced Fusion

Suppose $F_{rgb}\in\mathbb{R}^{h\times w\times c}$ and $F_{t}\in\mathbb{R}^{h\times{w}\times{c}}$ are the features extracted from RGB encoder and thermal encoder at a certain scale, we first quantify whether $F_{rgb}$ or $F_{t}$ contains sufficiently discriminative features. We apply the global average pooling to $F_{rgb}$ and $F_{t}$ along the channel dimension and then apply an MLP to obtain the weights as follows:

		$\displaystyle R=\biggl{(}f_{MLP}\Bigl{(}f_{GAP}(F_{rgb})\Bigr{)}\biggr{)}$		(1)
		$\displaystyle T=\biggl{(}f_{MLP}\Bigl{(}f_{GAP}(F_{t})\Bigr{)}\biggr{)}$		(1)

where $R\in\mathbb{R}^{c}$ and $T\in\mathbb{R}^{c}$ are extracted weights for RGB features and thermal features, respectively; $f_{GAP}$ and $f_{MLP}$ denote the global average pooling and MLP, respectively. For many previous works, the feature fusion is conducted by $\sigma(R)\stackMath\mathbin{\stackinset{c}{0ex}{c}{0ex}{\ast}{\bigcirc}}F_{rgb}+\sigma(T)\stackMath\mathbin{\stackinset{c}{0ex}{c}{0ex}{\ast}{\bigcirc}}F_{t}$ , where $\sigma$ is the sigmoid activation that generates channel-wise attention, $\stackMath\mathbin{\stackinset{c}{0ex}{c}{0ex}{\ast}{\bigcirc}}$ denotes channel-wise multiplication. However, this fusion method is effective only if either $\sigma(R)$ or $\sigma(T)$ has been activated sufficiently.

TABLE I: The comparisons between traditional attention and our explicit attention-enhanced fusion. As seen, for any possible values of

R

and

T

, we can generate enhanced attention.

	$\sigma(R)$	$\sigma(T)$	$\sigma(c*R\otimes T)$	1- $\sigma(c*R\otimes T)$
R $\geq$ 0, T $\geq$ 0	[0.5, 1)	[0.5, 1)	[0.5, 1)	(0, 0.5]
R $\leq$ 0, T $\geq$ 0	(0, 0.5]	[0.5, 1)	(0, 0.5]	[0.5, 1)
R $\geq$ 0, T $\leq$ 0	[0.5, 1)	(0, 0.5]	(0, 0.5]	[0.5, 1)
R $\leq$ 0, T $\leq$ 0	(0, 0.5]	(0, 0.5]	[0.5, 1)	(0, 0.5]

Differing from any existing approaches, we delve into this feature fusion by explicitly considering the interaction of multi-modality features. We explicitly specify four cases for extracted attention as seen in Table I where $R$ and $T$ are extracted weights of feature maps by Eq. (1). Noting that $R$ and $T$ are vectors, we treat them as scalars in Table I for simplicity. The positive values denote higher attention, i.e., $\sigma(R)\in[0.5,1)$ ; similarly, the negative values denote lower attention, i.e., $\sigma(R)\in(0,0.5]$ . For all of these cases, we apply attention enhancement to generate higher attention for both RGB and thermal features.

Specifically, we decompose the feature fusion into an Attention Interaction Branch (AIB) and an Attention Complement Branch (ACB), as shown in Fig. 3. The former handles cases where both RGB and thermal encoder or none of them capture discriminative features, and the latter tackles cases where only one encoder extracts useful features.

AIB takes an element-wise multiplication between $R$ and $T$ to generate correlated attention, then applies channel-wise multiplication to RGB and thermal features. It is represented as:

		$\displaystyle F^{\prime}_{rgb}=\sigma(c*(R\otimes T))\stackMath\mathbin{\stackinset{c}{0ex}{c}{0ex}{\ast}{\bigcirc}}F_{rgb}$		(2)
		$\displaystyle F^{\prime}_{t}=\sigma(c*(R\otimes T))\stackMath\mathbin{\stackinset{c}{0ex}{c}{0ex}{\ast}{\bigcirc}}F_{t}$		(2)

where $\otimes$ denotes element-wise multiplication. $c$ is the number of channels that plays a role of scaling factor for attention enhancement, such that $\sigma(c*(R\otimes T))\geq\sigma(R)$ and $\sigma(c*(R\otimes T))\geq\sigma(T)$ .

Let $F^{\prime}_{rgb,t}$ be the concatenation of $F^{\prime}_{rgb}$ and $F^{\prime}_{t}$ ; then, AIB further performs multi-modality interaction between $F^{\prime}_{rgb}$ and $F^{\prime}_{t}$ by a data interaction module:

\tilde{F}_{rgb,t}=F^{\prime}_{rgb,t}\stackMath\mathbin{\stackinset{c}{0ex}{c}{0ex}{\ast}{\bigcirc}}\sigma\biggl{(}f_{MLP}\biggl{(}f_{GMP}\Bigl{(}f_{dw}(F^{\prime}_{rgb,t})\Bigr{)}\biggr{)}\biggr{)}\\

(3)

where $f_{MLP}$ and $\sigma$ denote MLP and sigmoid operations which are the same as Eq.(1), $f_{dw}$ is depth-wise convolution, $f_{GMP}$ is global max pooling operation. $\tilde{F}_{rgb,t}$ is the outputted features by the data interaction module, and it is further split to RGB features $\tilde{F}_{rgb}$ and thermal features $\tilde{F}_{rgb}$ , respectively.

For cases where only one modality data provides sufficiently discriminative features, i.e., $R\geq 0$ , $T\leq 0$ or $R\leq 0$ , $T\geq 0$ , $\sigma(c\stackMath\mathbin{\stackinset{c}{0ex}{c}{0ex}{\ast}{\bigcirc}}(R\otimes T))$ tends to be small. Thus, we use the attention complement branch that applies the enhancement by:

		$\displaystyle\hat{F}_{rgb}=(1-\sigma(c*(R\otimes T)))\stackMath\mathbin{\stackinset{c}{0ex}{c}{0ex}{\ast}{\bigcirc}}F_{rgb}$		(4)
		$\displaystyle\hat{F}_{t}=(1-\sigma(c*(R\otimes T)))\stackMath\mathbin{\stackinset{c}{0ex}{c}{0ex}{\ast}{\bigcirc}}F_{t}$		(4)

Then, the enhanced RGB and thermal features are obtained by aggregating outputs from AIB and ACB:

		$\displaystyle\overline{F}_{rgb}=\tilde{F}_{rgb}+\hat{F}_{rgb}$		(5)
		$\displaystyle\overline{F}_{t}=\tilde{F}_{t}+\hat{F}_{t}$		(5)

We empirically found that the interaction module contributes less to the ACB regarding the model’s performance. Therefore, we do not apply the interaction module in ACB to reduce the model complexity.

Finally, we apply a spatial attention mechanism to further merge the enhanced RGB and thermal features with a 1 $\times$ 1 convolutional layer. Formally, the final merged features are obtained by:

F_{final}=\overline{F}_{rgb,t}\otimes SoftMax\Bigl{(}Conv_{1\times 1}(\overline{F}_{rgb,t})\Bigr{)}

(6)

where $\overline{F}_{rgb,t}$ denotes the concatenated result of $\overline{F}_{rgb}$ and $\overline{F}_{t}$ . $F_{final}$ is the fused features outputted by the EAEF module and is sent to the RGB encoder and the thermal encoder, respectively.

IV Experimental Results

TABLE II: Quantitative comparisons on the MFNet dataset. The best and the second best results are shown in bold font and the color blue, respectively.

Method	Backbone	Params.(M)	Car		Person		Bike		Curve		Car stop		Guardrail		Color Cone		Bump		mAcc	mIoU
Method	Backbone	Params.(M)	Acc	IoU	Acc	IoU	Acc	IoU	Acc	IoU	Acc	IoU	Acc	IoU	Acc	IoU	Acc	IoU	mAcc	mIoU
MFNet[MFNet]	-	8.4	77.2	65.9	67.0	58.9	53.9	42.9	36.2	29.9	19.1	9.9	0.1	8.5	30.3	25.2	30.0	27.7	45.1	39.7
FuseNet[Fusenet]	VGG16	284.0	81.0	75.6	75.2	66.3	64.5	51.9	51.0	37.8	17.4	15.0	0.0	0.0	31.1	21.4	51.9	45.0	52.4	45.6
RTFNet-50[RTFNet]	ResNet50	245.7	91.3	86.3	78.2	67.8	71.5	58.2	59.8	43.7	32.1	24.3	13.4	3.6	40.4	26.0	73.5	57.2	62.2	51.7
RTFNet-152[RTFNet]	ResNet152	337.1	91.3	87.4	79.3	70.3	76.8	62.7	60.7	45.3	38.5	29.8	0.0	0.0	45.5	29.1	74.7	55.7	63.1	53.2
PSTNet[pst900]	ResNet18	105.8	-	76.8	-	52.6	-	55.3	-	29.6	-	25.1	-	15.1	-	39.4	-	45.0	-	48.4
FuseSeg-161[FuseSeg]	DenseNet161	-	93.1	87.9	81.4	71.7	78.5	64.6	68.4	44.8	29.1	22.7	63.7	6.4	55.8	46.9	66.4	49.7	70.6	54.5
ABMDRNet[ABMDRNet]	ResNet50	-	94.3	84.8	90.0	69.6	75.7	60.3	64.0	45.1	44.1	33.1	31.0	5.1	61.7	47.4	66.2	50.0	69.5	54.8
FEANet[FEANet]	ResNet152	337.1	93.3	87.8	82.7	71.1	76.7	61.1	65.5	46.5	26.6	22.1	70.8	6.6	66.6	55.3	77.3	48.9	73.2	55.3
GMNet[GMNet]	ResNet50	149.5	94.1	86.5	83.0	73.1	76.9	61.7	59.7	44.0	55.0	42.3	71.2	14.5	54.7	48.7	73.1	47.4	74.1	57.3
EGFNet[EGFNet]	ResNet152	201.3	95.8	87.6	89.0	69.8	80.6	58.8	71.5	42.8	48.7	33.8	33.6	7.0	65.3	48.3	71.1	47.1	72.7	54.8
MFTNet[MFTNet]	ResNet152	360.9	95.1	87.9	85.2	66.8	83.9	64.4	64.3	47.1	50.8	36.1	45.9	8.4	62.8	55.5	73.8	62.2	74.7	57.3
Ours	ResNet50	109.1	93.9	86.8	84.6	71.8	80.4	62.0	66.8	49.7	43.5	29.7	58.5	7.1	61.8	50.9	70.9	46.7	73.2	55.9
Ours	ResNet152	200.4	95.4	87.6	85.2	72.6	79.9	63.8	70.6	48.6	47.9	35.0	62.8	14.2	62.7	52.4	71.9	58.3	75.1	58.9

TABLE III: Results on the PST900 dataset. The best results are shown in bold font.

Method	Background		Hand-Drill		Backpack		Fire-Extinguisher		Survivor		mAcc	mIoU
Method	Acc	IoU	Acc	IoU	Acc	IoU	Acc	IoU	Acc	IoU	mAcc	mIoU
Efficient FCN (3C)[EfficientFCN]	99.81	98.63	32.08	30.12	60.06	58.15	78.87	39.96	32.76	28.00	60.72	50.98
Efficient FCN (4C)[EfficientFCN]	99.80	98.85	48.75	38.58	69.96	67.59	76.45	46.28	38.86	35.06	66.75	57.27
CCNet (3C)[CCNet]	99.86	99.05	51.77	32.27	68.30	66.42	67.79	51.84	60.84	57.50	69.71	61.42
CCNet (4C)[CCNet]	99.59	97.74	54.09	51.01	75.96	72.95	88.06	73.80	49.45	33.52	73.43	66.00
ACNet[Acnet]	99.83	99.25	53.59	51.46	85.56	83.19	84.88	59.95	69.10	65.19	78.67	71.81
SA-Gate[SAGate]	99.74	99.25	89.88	81.01	89.03	79.77	80.70	72.97	64.19	62.22	84.71	79.05
RTFNet[RTFNet]	99.78	99.25	7.79	7.07	79.96	74.17	62.39	51.93	78.51	70.11	65.69	60.46
PSTNet[pst900]	/	98.85	/	53.60	/	69.20	/	70.12	/	50.03	/	68.36
GMNet[GMNet]	99.81	99.44	90.29	85.17	89.01	83.82	88.28	73.79	80.65	78.36	89.61	84.12
Ours	99.83	99.52	92.24	80.41	91.14	87.69	93.25	83.96	80.63	76.22	91.42	85.56

IV-A Semantic Segmentation

IV-A1 Datasets

MFNet dataset [MFNet] is the most popular benchmark for RGB-T based semantic segmentation. It records nine semantic categories in urban street scenes, including one unlabeled background category and eight hand-labeled object categories. The dataset contains 1569 pairs of RGB and thermal images with a resolution of $640\times 480$ . Following RTFNet [RTFNet], we use 784 pairs of images for training, 392 pairs for validation, and the rest 420 pairs for testing.

PST900 dataset [pspnet] is also a popular benchmark for RGB-T based semantic segmentation. It contains five semantic categories and 894 RGB-T image pairs with a resolution of $720\times 1280$ . Among them, 597 pairs are split for training, and the rest 297 pairs are used for testing.

IV-A2 Implementation Details and Evaluation Metrics

We use the stochastic gradient descent (SGD)[SGD] optimization solver for training. The initial learning rate is set to 0.02, Momentum and weight decay are set to 0.9 and 0.0005, respectively. The batch size is set to 5, and we apply ExponentialLR to gradually decrease the learning rate. The loss function has a DiceLoss [Dice] term and a SoftCrossEntropy [soft] term, each term is weighed with a scalar of 0.5. For MFNet dataset, we train the model with 100 epochs and use the best model on the validation set for evaluation. For PST900 dataset, we train the model with 60 epochs.

Same to previous works [RTFNet], we use two measures for quantifying results. The first is Accuracy (Acc) and the second one is Intersection Union (IoU). mAcc and mIoU are the averages over all categories.

IV-A3 Results

Results on MFNet

We first conduct quantitative comparisons between the proposed method and other baseline approaches. We compare our method against existing approaches, including MFNet [MFNet], FuseNet [Fusenet], RTFNet-152 [RTFNet], FusSeg-161 [FuseSeg], FEANet [FEANet], GMNet [GMNet], MFTNet [MFTNet], PSTNet [pst900], RTFNet-50 [RTFNet], ABMDENet [ABMDRNet], EGFNet [EGFNet]. Since the model complexity is different for existing methods, we implement our method on two backbones, including a larger ResNet-152 and a smaller ResNet-50, for fair comparisons.

Table II shows the quantitative results. It is clear that our method achieves the best mean accuracy. As seen, when having a similarly smaller model complexity, our method beats PSTNet significantly. Besides, our method built on ResNet152 achieved superior performance for most categories, e.g., the second best performance on “Person”, “Bike”, “Curve”, “’Bump’ in IoU. Most importantly, the proposed method gained 0.4% and 1.6% improvements in mAcc and mIoU, respectively, against the current state-of-the-art MFTNet. The quantitative results verify that our method can extract better complementary cross-modality features.

Figure 4 exhibits the quantitative results under different lighting conditions. In general, we find that our method has the following advantages. First, our method demonstrates better results than existing approaches for both night and daytime conditions. It shows slightly better performance for daytime images and more superior results for nighttime images. Second, our method can capture the tiny objects both in RGB and thermal images more effectively, such as the pedestrian in the 3rd column and the bump on the road in the 5th column. These advantages validate the effectiveness of our strategy for multi-modality feature fusion.

Results on PST900

We then conduct experiments on PST900 dataset. We compare our method with Efficient FCN [EfficientFCN], CCNet [CCNet], ACNet [Acnet], SA-Gate [SAGate], RTFNet [RTFNet], PSTNet [pst900], and GMNet [GMNet].

The quantitative results are given in Table III. It can be clearly seen that our method achieves the best results. It outperforms all previous methods, achieving 91.42 in mAcc and 85.56 in mIoU. Besides, it outperforms the state-of-the-art GMNet [GMNet] by 1.81% in mACC and 1.44% in mIoU, respectively. We also visualize several predicted semantic maps in Fig. 5.

IV-B Object Detection

IV-B1 Dataset

M³FD dataset[m3fd]contains a set of auto-driving scenarios. It has 4200 pairs of RGB-T images, including 33603 annotated labels in six classes, including “People”, “Car”, “Bus”, “Motorcycle”, “Truck”, and “Lamp”. Moreover, the dataset was split into “Daytime”, “Overcast”, “Night”, and “challenge” scenarios according to the characteristics of the environments.

IV-B2 Implementation Details and Evaluation Metric

We build a network for object detection by integrating EAEF into YoloV5 [mmyolo]. We use the stochastic gradient descent (SGD)[SGD] optimization solver for training. The initial learning rate is set to 0.01, Momentum and weight decay are set to 0.9 and 0.0005, respectively. The batch size is set to 32, and we apply ExponentialLR to gradually decrease the learning rate. The loss function has an IoULoss [Iouloss] term and a CrossEntropy [celoss] term. These two loss terms are weighed with a scalar of 0.3 and 0.7, respectively. For evaluation, we take the [email protected] metric as TarDAL [m3fd].

TABLE IV: Quantitative results on the M3FD object detection dataset. The best results are shown in bold font.

Method	Day	Overcast	Night	Challenge	[email protected]
Only RGB	0.759	0.729	0.863	0.815	0.772
Only Thermal	0.717	0.727	0.852	0.991	0.753
TarDAL[m3fd]	0.745	0.741	0.893	0.983	0.778
Ours	0.783	0.786	0.895	0.979	0.801

IV-B3 Results

To evaluate the effectiveness of the proposed method on object detection, we perform experiments on M³FD object detection dataset. We compare our method against approaches using only RGB images, only thermal images, and TarDAL [m3fd], which is the current state-of-the-art.

The experimental results are shown in Table IV. As seen, the method only using thermal data shows the worst performance. Nevertheless, for “Challenge” scenarios, it attains better performance than using RGB. Both TarDAL and our method obtain better accuracy compared to single modality data based methods. It is also observed that our method outperforms the other three approaches by a good margin. We obtain 0.801 mAP, outperforming TarDAL by 2.3%. Figure 6 shows the qualitative results. As seen, although TarDal could also correctly identify the locations of objects, our method has higher recognition accuracy.

IV-C Crowd counting

IV-C1 Dataset

RGBT-CC dataset [iadm] has 2,030 RGB-T pairs captured in public scenarios. The images have a resolution of $640\times\ 480$ . A total of 138,389 pedestrians are marked with point annotations, and approximately 68 people are marked per image. The training, validation, and test set have 1545, 300, and 1200 RGB-T pairs, respectively.

IV-C2 Implementation Details and Evaluation Metric

We adopt the same training strategy as IADM [iadm]. We use the Adam optimizer and set the learning rate to 0.00001. We evaluate the model every 10 epochs out of 300 epochs. The best model on the validation set will be used for evaluation. For evaluation, we measure with the root mean square error (RMSE) and the grid average mean absolute error (GAME) [GAME].

TABLE V: Results on the RGBT-CC dataset. The best results are shown in bold font.

Method	GAME(0) $\downarrow$	GAME(1) $\downarrow$	GAME(2) $\downarrow$	GAME(3) $\downarrow$	RMSE $\downarrow$
UcNet[UCNet]	33.96	42.42	53.06	65.07	56.31
HDFNet[HDFNet]	22.36	27.79	33.68	42.48	33.93
BBSNet[BBSNet]	19.56	25.07	31.25	39.24	32.48
MVMS[MVMS]	19.97	25.1	31.02	38.91	33.97
IADM[iadm]	15.61	19.95	24.69	32.89	28.18
Ours	14.85	19.24	24.10	32.57	26.99

IV-C3 Results

To evaluate the performance of our method on crowd counting task, we provide quantitative comparisons against previous approaches, including UcNet [UCNet], HDFNet [HDFNet], BBSNet [BBSNet], MVMS [MVMS], and the current state-of-the-art IADM [iadm].

The results are shown in Table V. As seen, our method achieves the best performance on all metrics. It outperforms UcBet, HDFNet, BBSNet, and MVMS by $33.97\%$ – $56.31\%$ in RMSE. Moreover, our method also outperforms the state-of-the-art IADM by 1.19 $\%$ in RMSE. The experimental results also verify the effectiveness of our method on the task of crowd counting.

TABLE VI: Results of ablation study on the MFNet dataset.

	AIB	ACB	mAcc	mIoU
$\surd$			71.7	56.5
$\surd$	$\surd$		72.5	57.1
$\surd$		$\surd$	74.3	57.7
$\surd$	$\surd$	$\surd$	75.1	58.9

IV-D Ablation Study

We analyze the effectiveness of each component of our EAEF through additional experiments on the MFNet dataset. We establish a baseline by removing the AIB and ACB from the EAEF. The results are shown in Table VI. We can observe that both AIB and ACB improved the performance of the baseline, and their combination, i.e., EAEF, gained the best performance.

V CONCLUSIONS

In this paper, we studied the better fusion strategy of RGB images and thermal data for perception tasks. We explicitly specify cases where i) both RGB and thermal data, ii) only one type of data, and iii) none of them can provide sufficiently useful features. We proposed the explicit attention-enhanced fusion (EAEF) that enhances feature extraction and provides compensation for insufficient representations. We evaluated our method on three different perception tasks, including semantic segmentation, object detection, and crowd counting. As a result, we achieved state-of-the-art performance on all three tasks, providing the robot community with a better fusion approach for RGB-thermal based perception tasks.

RGB	\addstackgap	\addstackgap	\addstackgap
Thermal	\addstackgap	\addstackgap	\addstackgap
Ground truth	\addstackgap	\addstackgap	\addstackgap
Ours	\addstackgap	\addstackgap	\addstackgap

RGB	\addstackgap	\addstackgap	\addstackgap
Thermal	\addstackgap	\addstackgap	\addstackgap
Ground truth	\addstackgap	\addstackgap	\addstackgap
TarDAL	\addstackgap	\addstackgap	\addstackgap
Ours	\addstackgap	\addstackgap	\addstackgap

RGB	\addstackgap	\addstackgap	\addstackgap
Thermal	\addstackgap	\addstackgap	\addstackgap
Ground truth	\addstackgap	\addstackgap	\addstackgap
IADM	\addstackgap	\addstackgap	\addstackgap
Ours	\addstackgap	\addstackgap	\addstackgap