This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\equalcont

Equally corresponding authors. \equalcontEqually corresponding authors.

1]\orgdivSchool of Computer Science and Artificial Intelligence, \orgnameChangzhou University, \cityChangzhou, \postcode213000, \stateJiangsu, \countryChina

2]\orgdivDepartment of Computer Science, \orgnameUniversity of Sheffield, \citySheffield, \postcodeS10 2TN, \countryU.K

Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding

\fnmYi \surLiu [email protected]    \fnmChengxin \surLi [email protected]    \fnmShoukun \surXu [email protected]    \fnmJungong \surHan [email protected] [ [
Abstract

Multi-modal fusion has played a vital role in multi-modal scene understanding. Most existing methods focus on cross-modal fusion involving two modalities, often overlooking more complex multi-modal fusion, which is essential for real-world applications like autonomous driving, where visible, depth, event, LiDAR, etc., are used. Besides, few attempts for multi-modal fusion, e.g., simple concatenation, cross-modal attention, and token selection, cannot well dig into the intrinsic shared and specific details of multiple modalities. To tackle the challenge, in this paper, we propose a Part-Whole Relational Fusion (PWRF) framework. For the first time, this framework treats multi-modal fusion as part-whole relational fusion. It routes multiple individual part-level modalities to a fused whole-level modality using the part-whole relational routing ability of Capsule Networks (CapsNets). Through this part-whole routing, our PWRF generates modal-shared and modal-specific semantics from the whole-level modal capsules and the routing coefficients, respectively. On top of that, modal-shared and modal-specific details can be employed to solve the issue of multi-modal scene understanding, including synthetic multi-modal segmentation and visible-depth-thermal salient object detection in this paper. Experiments on several datasets demonstrate the superiority of the proposed PWRF framework for multi-modal scene understanding. The source code has been released on https://github.com/liuyi1989/PWRF.

keywords:
Multi-modal fusion, Part-whole relational fusion, Capsule network, Synthetic multi-modal semantic segmentation, VDT salient object detection

1 Introduction

Due to the limited perception ability of single sensor, multiple sensors help to capture different fields of perception, e.g., depth, thermal, and LiDAR [1, 2, 3, 4, 5]. Naturally, multi-modal fusion plays a fundamental role in multi-sensor scene understanding, with applications ranging from synthetic autonomous driving perception [6, 2] to unmanned aerial vehicles [7].

Previous multi-modal fusion methods mostly lie in the cross-modal combinations [8, 9, 10], which aim to find the complementary details in different modalities to enhance individual representations. Despite these methods have advanced the progress of multi-modal fusion, they largely concentrate on specific sensor pairs and lack behind the current trend of fusing multiple modalities [11]. In contrast, the noise and misalignment of multi-modal sensors pose significant challenges for fusing more modalities, such as triple-modal fusion, which is crucial for many autonomous driving applications.

Refer to caption
Figure 1: Comparison of different multi-modal fusion methods. (a) Multi-modal fusion via concatenation. (b) Multi-modal fusion through parallel cross-attention to attend the primary modality. (c) Multi-modal fusion via selection mechanism. (d) Our multi-modal fusion via part-whole relational routing to generate modal-shared and modal-specific details.

Within the scope of multi-modal fusion111In this paper, multi-modal fusion focuses on triple-modal fusion instead of cross-modal fusion., there are a few approaches towards this field, as shown in Fig. 1. For example, Fig. 1(a) explores the triple-modal fusion using a simple concatenation [12]. Fig. 1(b) uses parallel cross-attention mechanism to attend the primary modality [13]. Fig. 1(c) selects the informative-modal token from three modalities as the fusion result [11, 14]. To sum up, the existing multi-modal fusion approaches mostly combine the important details from multiple modalities using attention [13, 11] or select the most important modality for each patch using a maximum criteria [14]. Albeit some progress, these methods still encounter challenges for arbitrary-modal fusion. First, attention-based fusion methods [13, 11] focus on identifying important details rather than capturing intrinsic knowledge among multiple modalities, including shared and specific knowledge, thus lacking intrinsic fusion effectiveness. Secondly, some methods discard many supplementary modalities, e.g., the informative modal selection method [14], which can cause performance degradation in scenarios requiring the fusion of more modalities.

To tackle the issue of the multi-modal fusion, we explore an alternative fusion, called Part-Whole Relational Fusion (PWRF), which treats the relationships between individual modality and the fusion modality as relationships between each part-level modality and the whole-level modality. In such sense, the solution of multi-modal fusion involves routing part-modalities to the whole-modality, which can be achieved through the part-whole relational routing within the framework of Capsule Network (CapsNets) [15]. To this end, considering the heavy computation of CapsNets, we opt for a lightweight version [16], named Disentangled Capsule Routing (DCR), to routing the part-modality to whole-modality. Concretely, DCR begins by disentangling part-level modal capsules from each single modality along the horizontal and vertical dimensions, which are fed into capsule routing mechanism to generate horizontal and vertical whole-level modal capsules. These orthogonal capsules are entangled to achieve the whole-level modal capsules. Thanks to the primitive fusion of PWRF in Fig. 1(d), modal-shared and modal-specific semantics, which are two vital fusion semantics in the issue of multi-modal fusion [17], are included. To be concrete, modal-shared semantics are represented by the whole-level modal capsules, generated by exploring common properties across different individual modalities. Modal-specific semantics for each modality are computed via the routing coefficients from each part-level modality to the whole-level modality, reflecting the associations between each single modality and the fused version.

To explore the potential of the proposed PWRF for multi-modal scene understanding, we select two fundamental tasks to validate its superiority: Synthetic Multi-Modal (SMM) semantic segmentation [14] and Visible-Depth-Thermal (VDT) salient object detection [12]. Synthetic Multi-Modal (SMM) semantic segmentation [14] and Visible-Depth-Thermal (VDT) salient object detection [12] are chosen due to their complementary focus areas in multi-modal scene understanding. Specifically, SMM semantic segmentation is concerned with segmenting various semantic classes in multi-modal input, which includes visible, depth, event, and LiDAR data, thereby enhancing detailed scene understanding by leveraging the strengths of each modality for comprehensive semantic representation. In contrast, VDT salient object detection is a crucial task in autonomous driving and robotic navigation, where the primary focus is to identify the most salient objects in challenging multi-modal settings involving visible, depth, and thermal data. This task is essential for applications that prioritize real-time decision-making, especially under adverse or uncertain conditions where identifying key objects becomes critical for effective scene understanding. Inspired by the above concerns, these two tasks provide a holistic evaluation of the proposed PWRF framework for the issue of multi-modal scene understanding. Concretely, SMM segmentation assesses the capability of the framework to understand and represent detailed semantic aspects of the scene, while VDT saliency detection focuses on the ability to extract and prioritize critical information. The integration of both tasks showcases the versatility of our approach in tackling different facets of multi-modal scene understanding, thus demonstrating its generalizability and robustness across varied application scenarios. Experiments on SMM semantic segmentation [14] and VDT salient object detection [12] datasets prove the superiority of the proposed PWRF framework for multi-modal scene understanding.

Contributions of this paper are described as follows:

(i) We propose a PWRF framework for multi-modal scene understanding, which, to the best of our knowledge, is the first to treat multi-modal fusion as the part-whole relational fusion. Under this framework, we can obtain modal-shared and modal-specific details using the whole-level modality and routing coefficients, respectively, which can be further employed to enhance multi-modal scene understanding.

(ii) To apply PWRF for SSM semantic segmentation, we design a shared-specific-integration module that fuses modal-shared and modal-specific details to detect semantic objects in synthetic multi-modal scenarios.

(iii) In order to configure PWRF for VDT salient object detection, we design a stacking adjacent-scale attention decoder to integrate modal-shared and modal-specific details, which is experimentally proved to be superior to detect the salient objects in the visible-depth-thermal environment.

The paper is organized as follows. Sec. 2 reviews the works related to the proposed method. Sec. 3 details the proposed PWRF framework. Sec. 4 illustrates the architectures for SMM semantic segmentation and VDT salient object detection using PWRF. Sec. 5 carries out abundant experiments and analyses to understand the proposed method. Sec. 7 provides a conclusion for the paper.

2 Related work

In this section, we will review the works related to our method, including multi-modal semantic segmentation, multi-modal salient object detection, and CapsNets, which will be described in detail in the following.

2.1 Multi-modal semantic segmentation

Semantic segmentation has been a fundamental task in the computer vision community since fully convolutional networks [18] revolutionized its development. Unlike RGB-modal semantic segmentation [19, 20, 21] that relies on RGB modality, which may suffer from sensor limitations, multi-modal semantic segmentation enhances scene perception by incorporating multiple modalities, e.g., depth, events, and LiDAR. For example, Hazirbas et al. [22] leveraged the rich color and texture from RGB modality, and geometric and structural information from depth modality, for semantic segmentation. Wang et al. [23] proposed to extract common and modality-specific features from RGB and depth images to improve semantic segmentation accuracy for indoor scenes. Wang et al. [11] proposed a multi-modal token fusion method through substituting uninformatie tokens with important ones. Wang et al. [24] dynamically exchanged channels between sub-networks of different modalities for multi-modal fusion. Zhao et al. [25] presented a coarse-to-fine fusion mechanism for LiDAR-camera semantic segmentation by leveraging the low-level contextual information and designing an offset correction. Liu et al. [26] fused camera and LiDAR modalities in a bi-directional manner. Liang et al. [27] derived a region guided filter to select informative combinations of multiple modalities classes. Zhang et al. [28] utilized a cross-modal feature correction module to enhance complementary information and a feature fusion module to achieve full fusion and long-distance context exchange. Based on this, they subsequently presented a cross-modal segmentation model [14] by fusing the primary modality and selected informative auxiliary modality.

Most of the previous methods focus on fusing complementary cues [11] or selecting informative one modality while discarding the others, which can lead to performance degradation due to a lack of intrinsic integration. Differently, we treat multi-modal fusion as part-whole relational fusion that routes part-level modalities to whole-level modality, thus enabling the capture of primitive fusion and generating shared and specific details.

2.2 Multi-modal salient object detection

Unlike RGB salient object detection [29, 30, 31, 32, 33], which often severely struggles challenging scenarios such as cluttered background and high similarity between salient objects and their surroundings, multi-modal learning can enhance saliency understanding in these difficult conditions. The most popular multi-modal salient object detection lies in RGB-D and RGB-T saliency, which detect the salient objects from the RGB & depth and RGB & thermal data, respectively, via cross-modal fusion. For example, Wu et al. [34] proposed a multi-level and multi-scale fusion scheme to fuse RGB-depth features for saliency prediction. Li et al. [35] designed a boundary-aware fusion framework for RGB-D salient object detection. Xie et al. [36] developed a double bi-directional interaction network for RGB-D saliency detection. Zhang et al. [37] utilized saliency prototypes of the primary modality to enhance semantics of the auxiliary modality, followed by allocating dynamically weighs for auxiliary modality during fusion stage. Different from RGB-D and RGB-T salient object detection, the VDT salient object detection tackles the problem of salient object in the environment of the visible, depth, and thermal images. To solve the challenge, Song et al. [38] utilized an attention mechanism to parallelly fuse the primary modality and the auxiliary modality for triple-modal salient object detection. This approach highlights an innovative method for complementary aggregation of triple-modal information. Wan et al. [12] proposed a triple-modal fusion encoder and a progressive feature enhancement decoder for VDT salient object detection. They further designed a triple-modal interaction encoder and a multi-scale fusion decoder for VDT salient object detection [39]. Bao et al. [40] developed a quality-aware selective fusion network for VDT saliency detection.

Different from the existing VDT salient object detection that cannot explore the modal-shared and modal-specific semantics, our PWRF framework can explore the associations between triple modalities to find modal-shared and modal-specific semantics for further saliency prediction.

2.3 CapsNets

Unlike convolutional neural networks that intend to capture the discriminative features, CapsNets target at capturing part-whole relations to find the targets. The classical CapsNets, i.e., vector CapsNets [41] and matrix CapsNets [15], have been known large-scale parameters and heavy computation. To achieve lightweight CapsNets, a lot of efforts have been devoted. For example, a prediction tuning framework was proposed to allow a deep architecture [42]. Shi et al. [43] utilized sparse optimization to compress CapsNets via reducing unnecessary weight parameters and computational cost. In our previous work, a residual pose routing [44] and a disentangled-entangled routing mechanism [16] were proposed to speed up CapsNets. In addition to CapsNets architectures, they have been introduced to a wide range of applications. Jampour et al. [45] introduced a new regularization term into CapsNets to improve the generalization for signature identification.Our previous works employed CapsNets for part-whole relational visual saliency [32] and visual camouflage [46]. Wu et al. [47] fed capsule features from multiple modalities into long short-term memory for motion recognition.

In this paper, we apply CapsNets for multi-modal scene understanding by the proposed PWRF framework. Different from [47] that utilized CapsNets to extract modal capsules features instead of multi-modal fusion, we employ the part-whole relational routing ability of CapsNets for multi-modal fusion stage.

3 Proposed Part-Whole Relational Fusion

In this section, we will describe the details of PWRF, which takes multi-modal fusion as routing multiple individual part-level modalities to the fused whole-level modality. The framework PWRF consists of two core components, including part-whole modality routing, and modal-shared and modal-specific details generation. We take triple modalities as the case for illustration in the following.

3.1 Part-whole modality routing

To achieve primitive fusion of multiple modalities, we take each single modality and the fused modality as a part-level modality and the whole-level modality, respectively, which guides multi-modal fusion to the issue of routing multiple part-level modalities to their whole-level modality. Leveraging the part-whole relational routing ability of CapsNets [15], we implement the part-whole relational fusion. However, considering the heavy computation demands of the original CapsNets [15], we use our previous lightweight capsule routing version, called Disentangled Capsule Routing (DCR) algorithm [16], as an alternative. In the following, we detail the PWRF framework using DCR for routing part-level modalities to the whole-level modality, which is composed by part-level modality capsules disentanglement and part-level to whole-level capsules routing.

3.1.1 Part-level modality capsules disentanglement

Part-level modality capsules disentanglement targets at disentangling the full-resolution capsule features of each single modality to two orthogonal versions, including horizontal and vertical capsules, which will transform two-dimensional capsule routing to one-dimensional version with computation reduced greatly.

Given features {𝐟inHi×Wi×Cin{1,2,3},i=1,2,3,4}\left\{{\bf{f}}_{i}^{n}\in{\Re^{H_{i}\times W_{i}\times C_{i}}}\mid n\in\left\{1,2,3\right\},i=1,2,3,4\right\} from the backbone network of ResNet-50 [48], where Hi×WiH_{i}\times W_{i} and CiC_{i} represent the resolution and channel of features at stage-ii for modality nn, respectively. Each single modality capsule, including pose matrix 𝐩in{\bf{p}}_{i}^{n} and activation value 𝒂in\boldsymbol{a}_{i}^{n}, can be constructed as follows:

𝐩inHi×Wi×Tp×16=Conv(𝐟in,dim=3),𝒂inHi×Wi×Tp×1=σ(Conv(𝐟in,dim=3)),\begin{array}[]{l}{\bf{p}}_{i}^{n}\in{\Re^{{H_{i}}\times{W_{i}}\times T^{p}\times 16}}=Conv\left({{\bf{f}}_{i}^{n}},{dim}=3\right),\\ \boldsymbol{a}_{i}^{n}\in{\Re^{{H_{i}}\times{W_{i}}\times T^{p}\times 1}}=\sigma(Conv\left({{\bf{f}}_{i}^{n}},{dim}=3\right)),\end{array} (1)

where Conv(,dim=3)Conv\left({*,dim=3}\right) and σ\sigma refer to convolution and Sigmoid operations along the 3rd{rd} channel, respectively. TpT^{p} is the type number of the part-level capsules. 16 and 1 represents the pose matrix dimension and activation value dimension, respectively. The full-resolution capsules 𝐏in{\mathbf{P}_{i}^{n}} is composed by

𝐅𝐏inHi×Wi×Tp×17=Concat(𝐩in,𝒂in,dim=4).{\mathbf{FP}_{i}^{n}\in\Re^{H_{i}\times W_{i}\times T^{p}\times 17}={Concat}(\mathbf{p}_{i}^{n},\boldsymbol{a}_{i}^{n},{dim}=4)}. (2)

On top of the full-resolution capsules 𝐏in{\mathbf{P}_{i}^{n}}, we disentangle the horizontal 1-dimensional capsules as follows

𝐏i,HnHi×1×Tp×17=Conv(𝐅𝐏in,dim=2).{\mathbf{P}_{i,H}^{n}\in\Re^{H_{i}\times 1\times T^{p}\times 17}={Conv}(\mathbf{FP}_{i}^{n},{dim}=2)}. (3)

Similarly, the vertical part-level capsules can be disentangled as

𝐏i,Vn1×Wi×Tp×17=Conv(𝐅𝐏in,dim=1).{\mathbf{P}_{i,V}^{n}\in\Re^{1\times W_{i}\times T^{p}\times 17}={Conv}(\mathbf{FP}_{i}^{n},{dim}=1)}. (4)

3.1.2 Part-level to whole-level capsules routing

The disentangled horizontal and vertical part-level modal capsules, 𝐏i,Hn{\mathbf{P}_{i,H}^{n}} and 𝐏i,Vn{\mathbf{P}_{i,V}^{n}}, will be fed into the capsule routing algorithm [15] to find the whole-level modal capsules via exploring part-whole relations.

Specifically, part-level capsules of multiple modalities are combined together, which is achieved by concatenating horizontally part-level capsules of multiple modalities and vertically part-level capsules of multiple modalities along the type dimension separately, i.e.,

𝐏i,HHi×1×(Tp+Tp+Tp)×17=Concat(𝐏i,Hn|n=1,2,3,dim=3),\begin{array}[]{l}\mathbf{P}_{i,H}\in{\Re^{{H_{i}}\times 1\times({T^{p}}+{T^{p}}+{T^{p}})\times 17}}=Concat\left({{{\left.{\mathbf{P}_{i,H}^{n}}\right|}_{n=1,2,3}},dim=3}\right)\end{array}, (5)
𝐏i,V1×Wi×(Tp+Tp+Tp)×17=Concat(𝐏i,Vn|n=1,2,3,dim=3),\begin{array}[]{l}\mathbf{P}_{i,V}\in{\Re^{1\times{W_{i}}\times({T^{p}}+{T^{p}}+{T^{p}})\times 17}}=Concat\left({{{\left.{\mathbf{P}_{i,V}^{n}}\right|}_{n=1,2,3}},dim=3}\right)\end{array}, (6)

The horizontally and vertically whole-level modal capsules will be computed by implementing capsule routing [15] on horizontally and vertically part-level modal capsules separately as follows

𝐖i,HHi×1×Tw×17,𝐑i,H(Hi×1)×(Tp+Tp+Tp)×Tw=Routing(𝐏i,H),\begin{array}[]{l}{\mathbf{W}_{i,H}\in\Re^{H_{i}\times 1\times T^{w}\times 17},\mathbf{R}_{i,H}\in{\Re^{{({H_{i}}\times 1)\times(T^{p}+T^{p}+T^{p})}\times{T^{w}}}}=Routing\left(\mathbf{P}_{i,H}\right)}\end{array}, (7)
𝐖i,V1×Wi×Tw×17,𝐑i,V(1×Wi)×(Tp+Tp+Tp)×Tw=Routing(𝐏i,V),\begin{array}[]{l}{\mathbf{W}_{i,V}\in\Re^{1\times W_{i}\times T^{w}\times 17},{\mathbf{R}_{i,V}}\in{\Re^{{(1\times{W_{i}})\times(T^{p}+T^{p}+T^{p})}\times{T^{w}}}}=Routing\left(\mathbf{P}_{i,V}\right)}\end{array}, (8)

where Routing()Routing\left(*\right) represents the capsule routing algorithm [15]. 𝐑i,H\mathbf{R}_{i,H} and 𝐑i,V\mathbf{R}_{i,V} are the routing coefficients from part-level modalities to the whole-level modal, which reveal the familiar relations between the part-level modalities and the whole-level modality along the horizontal and vertical dimensions, respectively. 𝐖i,H{\mathbf{W}_{i,H}} and 𝐖i,H{\mathbf{W}_{i,H}} are the horizontal and vertical whole-level modalities, respectively.

On top of that, the full-resolution whole-level modal capsules can be achieved by entangling them as follows

𝐖𝐏iHi×Wi×Tw×17=𝐖i,H𝐖i,V,{\mathbf{WP}_{i}\in{\Re^{{H_{i}}\times{W_{i}}\times{T^{w}}\times 17}}=\mathbf{W}_{i,H}\otimes{\mathbf{W}_{i,V}},} (9)

where \otimes represents the operation of matrix multiplication along the resolution dimension.

3.2 Modal-shared and modal-specific details generation

3.2.1 Modal-shared details

The whole-level modality 𝐖𝐏i{{\bf{WP}}_{i}} in Eq. (9) captures associations across different individual modalities, since it is computed by routing information from multiple modalities. The generation of modal-shared details is a crucial step in the PWRF framework, allowing the model to leverage information that is consistent across multiple modalities. The modal-shared details are generated by aggregating features from all part-level modalities through a part-whole relational routing mechanism. In light of this fact, we treat it as the modal-shared semantic details.

3.2.2 Modal-specific details

The routing coefficients 𝐑i,H\mathbf{R}_{i,H} and 𝐑i,V\mathbf{R}_{i,V} in Eqs. (7) and (8) indicate the likelihood of each part-level capsule belonging to each latent whole-level capsule class, which reveal the contributions of different single part-level modalities to the whole-level modality, respectively. Therefore, we model the modal-shared details for each modality using the routing coefficients and each modalities capsules. First, the part-level correlations for each modality can be extracted from the horizontal and vertical routing coefficients Ri,H\text{R}_{i,H} and Ri,V\text{R}_{i,V}

Ri,HnHi×1×Tp=1Twk=1Tw𝐑i,H[:,:,Tp(n1)+1:Tpn,k],n=1,2,3,{\text{R}_{i,H}^{n}\in\Re^{H_{i}\times 1\times T^{p}}=\frac{1}{T^{w}}\sum_{k=1}^{T^{w}}\mathbf{R}_{i,H}[:,:,T^{p}(n-1)+1:T^{p}n,k],\quad n=1,2,3,} (10)
Ri,Vn1×Wi×Tp=1Twk=1Tw𝐑i,V[:,:,Tp(n1)+1:Tpn,k],n=1,2,3,{\text{R}_{i,V}^{n}\in\Re^{1\times W_{i}\times T^{p}}=\frac{1}{T^{w}}\sum_{k=1}^{T^{w}}\mathbf{R}_{i,V}[:,:,T^{p}(n-1)+1:T^{p}n,k],\quad n=1,2,3,} (11)

where Ri,Hn\text{R}_{i,H}^{n} and Ri,Vn\text{R}_{i,V}^{n} represent the horizontal and vertical routing coefficients for each part of the capsule corresponding to nthn^{\text{th}} modality, respectively. The third dimension of the routing coefficients is split into three parts, each of size TpT^{p}.

Refer to caption
Figure 2: Visualization for the split of routing coefficients.

The horizontal and vertical modal-specific details can be computed via multiplying the each part-level correlations and its capsule features

𝐒𝐏i,HnHi×1×Tp×17=𝐏i,HnRi,Hn,{{\bf{SP}}_{i,H}^{n}\in{\Re^{{H_{i}}\times 1\times{T^{p}}\times 17}}={\mathbf{P}_{i,H}^{n}\odot\text{R}_{i,H}^{n}},} (12)
𝐒𝐏i,Vn1×Wi×Tp×17=𝐏i,VnRi,Vn,{{\bf{SP}}_{i,V}^{n}\in{\Re^{1\times{W_{i}}\times{T^{p}}\times 17}}={\mathbf{P}_{i,V}^{n}\odot\text{R}_{i,V}^{n}}}, (13)

where {{\odot}} means element-wise multiplication. The full-resolution modal-specific details are further computed by entangling 𝐒𝐏i,Hn{\bf{SP}}_{i,H}^{n} and 𝐒𝐏i,Vn{\bf{SP}}_{i,V}^{n} along the resolution dimension as follows

𝐒𝐏inHi×Wi×Tp×17=𝐒𝐏i,Hn𝐒𝐏i,Vn.{{\bf{SP}}_{i}^{n}\in{\Re^{{H_{i}}\times{W_{i}}\times{T^{p}}\times 17}}={\bf{SP}}_{i,H}^{n}\otimes{\bf{SP}}_{i,V}^{n}.} (14)

It is noted that each modality specific details 𝐒𝐏in{\bf{SP}}_{i}^{n} is reshaped to 𝐒𝐏inHi×Wi×(Tp×17){\bf{SP}}_{i}^{n}\in{\Re^{{H_{i}}\times{W_{i}}\times({T^{p}}\times 17)}} by merging the last two dimensions together for the following utilization. On top of that, modal-specific components 𝐒𝐏in{\bf{SP}}_{i}^{n} of multiple modalities are integrated to get the merged modal-specific details as

𝐒𝐏i=ConvCate((𝐒𝐏in),dim=3),{{{\bf{SP}}_{i}}=ConvCate\left(({\bf{SP}}_{i}^{n}),dim=3\right),} (15)

where ConvCate()ConvCate(*) denotes the operation that combines concatenation and convolution operation along the specified dimension.

To make the entire routing process more intuitive, Fig. 2 visualizes the the process of spliting the routing coefficients. As shown in Fig. 2, on top of the routing between the high-level capsules and the low-level capsules, the routing coefficients are obtained. In order to obtain the corresponding routing coefficients for the respective modalities, the routing coefficients are split and averaged along the last dimension. The reshape operation is gone through for subsequent processing.

4 Multi-modal scene understanding using PWRF

In this section, we select two fundamental tasks, including SMM semantic segmentation and VDT salient object detection, to explore the contributions of our PWRF framework for multi-modal scene understanding.

4.1 SMM semantic segmentation

Refer to caption
Figure 3: SMM semantic segmentation framework based on PWRF. There are 4 stages with different-scale features and outputs. It is noted that we use the same stage 1 as in [14] due to the heavy computation for DCR. In stages 2-4, our PWRF models modal-shared and modal-specific details of different auxiliary modalities, which are further integrated with the primary RGB modality. The outputs of four stages are fed to Segformer head [49] for semantic segmentation.

To explore the contributions of our PWRF for SMM semantic segmentation, which targets at segment semantic objects under the synthetic RGB-depth-event-LiDAR condition, we plug our PWRF framework in the baseline [14]. As shown in Fig. 3, the proposed PWRF based synthetic multi-modal semantic segmentation network consists of four stages. In each stage, RGB and the remaining modalities are taken as the primary and auxiliary modalities, respectively. For the primary RGB modality, the Multi-Head Self-Attention (MHSA) [49] is used to extract deep features. For the auxiliary modalities, their fusion is solved by the proposed PWRF framework, on top of which modal-shared and modal-specific details are further integrated via the designed shared-specific interaction module. In the final, the primary RGB modality and the integrated auxiliary modalities details are fed in the segmentation head [49] for semantic segmentation. On top of the modal-shared and modal-specific details from PWRF, the shared-specific integration module will be illustrated in the following. In Fig. 3, feature refine fusion module, parallel pooling mixer blocks, and token-select hub can refer to [14], which are not detailed due to they being not our contributions.

4.1.1 Shared-specific Integration

In view of the complementary characteristic of modal-shared and modal-specific cues with respect to the input, their primitive integration will benefit the semantic exploration. To this end, we design a shared-specific integration module to combine complementaries of modal-shared and modal-specific details for better semantic inference, which consists of two components, including primitive modal-specific details generation and shared-specific interaction.

A. Primitive modal-specific details generation

In order to attenuate noise of each modal-specific information, primitive modal-specific information is generated by using the original modality features 𝐟in{\bf{f}}_{i}^{n} and modal-shared details 𝐅i,shd\mathbf{F}_{i,{shd}} as follows

𝐅^i,spnHi×Wi×C=ConvCate({𝐟in,𝐒𝐏in},dim=3),{\hat{\bf{F}}_{i,{sp}}^{n}\in{\Re^{{H_{i}}\times{W_{i}}\times C}}=ConvCate\left({\left\{{{\bf{f}}_{i}^{n},{\bf{SP}}_{i}^{n}}\right\},dim=3}\right),} (16)
𝐅i,spnHi×Wi×C=ConvCate({𝐅i,shd,𝐒𝐏in},dim=3),{\mathbf{F}_{i,{sp}}^{n}\in{\Re^{{H_{i}}\times{W_{i}}\times C}}=ConvCate\left({\left\{{\mathbf{F}_{i,{shd}},{\bf{SP}}_{i}^{n}}\right\},dim=3}\right),} (17)

where the modal-shared details, denoted as 𝐅i,shd\mathbf{F}_{i,{shd}}, are derived from reshaping the whole-level modal capsules 𝐖𝐏i\mathbf{WP}_{i}. The primitive modal-shared details can be derived as

𝐅i,psgnHi×Wi×C=σ(𝐅^i,spn)𝐅i,spn+𝐅i,spn.{\mathbf{F}_{i,{psg}}^{n}\in{\Re^{{H_{i}}\times{W_{i}}\times C}}=\sigma\left(\hat{\bf{F}}_{i,{sp}}^{n}\right)\odot\mathbf{F}_{i,{sp}}^{n}+\mathbf{F}_{i,{sp}}^{n}.} (18)

As such, a more primitive modal-specific information is achieved by combining three single modalities together as follows

𝐅i,psgHi×Wi×C=ConvCate(𝐅i,psgn|n=1,2,3,dim=3).{\mathbf{F}_{i,{psg}}\in{\Re^{{H_{i}}\times{W_{i}}\times C}}=ConvCate\left({{{\left.\mathbf{F}_{i,{psg}}^{n}\right|}_{n=1,2,3}},dim=3}\right).} (19)

B. Shared-specific interaction

To integrate modal-shared and modal-specific details, we propose a shared-specific interaction module. Besides modal-shared details 𝐅i,shd\mathbf{F}_{i,{shd}} and primitive modal-specific details 𝐅i,psg\mathbf{F}_{i,{psg}}, we also employ the selected modal details of Self-Query Hub[14] denoted as 𝐅i,sqh\mathbf{F}_{i,{sqh}}. Specifically, three parallel branches are first designed to interact these three components. Within each branch, one component is selected as the primary cue, while the remaining two components are utilized to attend the primary cue for better semantic exploration, which is achieved by a spatial attention and a channel attention. The spatial attention is implemented as follows

𝐒𝐀i=σ(Conv(CGMP(𝐂𝐏i1+𝐂𝐏i2+𝐂𝐏i3),dim=3)),{\bf{SA}}_{i}=\sigma\left({Conv(CGMP\left({{\bf{CP}}_{i}^{1}+{\bf{CP}}_{i}^{2}+{\bf{CP}}_{i}^{3}}\right),dim=3)}\right), (20)

where 𝐂𝐏i1{\bf{CP}}^{1}_{i} is the primary component. 𝐂𝐏i2{\bf{CP}}^{2}_{i} and 𝐂𝐏i3{\bf{CP}}^{3}_{i} are the remaining two components. CGMP()CGMP(*) denotes the global max pooling operation performed along the channel direction.

The channel attention is implemented as

𝐂𝐀i=σ(Conv(GMP(𝐂𝐏i1𝐒𝐀i+𝐂𝐏i1),dim=3)),{\bf{CA}}_{i}=\sigma\left({Conv(GMP\left({{\bf{CP}}_{i}^{1}\odot{\bf{SA}}_{i}}+{\bf{CP}}^{1}_{i}\right),dim=3)}\right), (21)

where GMP()GMP(*) refers to the adaptive global max pooling operation.

Based on the spatial and channel attentions, the primary component can be attended as

𝐅i,ssi1=𝐂𝐏i1𝐂𝐀i+𝐂𝐏i1.{{\bf{F}}_{i,ssi}^{1}={\bf{CP}}_{i}^{1}\odot{\bf{CA}}_{i}+{\bf{CP}}_{i}^{1}.} (22)

Doing Eqs. (20)-(22), we obtain the shared-specific interaction for each primary component 𝐂𝐏ij,j=1,2,3{\bf{CP}}_{i}^{j},j=1,2,3.

The interacted component 𝐅i,ssi{\bf{F}}_{i,ssi} in three branches are integrated to get the merged information using the element-wise multiplication and element-wise addition, which is written as

𝐅iu=ConvCate({{𝐅i,ssij}|j=1,2,3,{𝐅i,ssij}|j=1,2,3}),{{\bf{F}}_{i}^{u}=ConvCate\left({\left\{{{{\left.{\otimes\left\{{\bf{F}}_{i,ssi}^{j}\right\}}\right|}_{j=1,2,3}},{{\left.{\oplus\left\{{\bf{F}}_{i,ssi}^{j}\right\}}\right|}_{j=1,2,3}}}\right\}}\right),} (23)

where \oplus represents the element-wise addition.

4.1.2 Model training

On top of the fusion features 𝐅iu{\bf{F}}_{i}^{u}, we integrate it with the primary RGB modality using the fusion step [14] to get the multi-modal fusion features, which is fed into the multi-layer perception decoder [49] to predict the semantic result 𝐏𝐫𝐞{\bf{Pre}}. To train the model, the online hard example mining cross-entropy loss function [50] is used to compute the difference between the semantic predictions and the ground truth 𝐆𝐓{\bf{GT}}, i.e.,

Loss=OHEMCrossEntropyLoss(𝐏𝐫𝐞,𝐆𝐓).Loss=OHEMCrossEntropyLoss({\bf{Pre}},{\bf{GT}}). (24)

4.2 VDT salient object detection

To explore the effectiveness of our PWRF for VDT salient object detection, we design a network as shown in Fig. 4. To be concrete, Swin-Transformer [51] is utilized to learn the backbone features of triple modalities, which are further fed in our PWRF to get modal-shared and modal-specific semantics. After that, a stacking adjacent-scale attention decoder is designed to integrate different-scale modal-shared/specific semantics. The predictions of these two sub-decoders are combined to achieve the final saliency map. The following will detail the Adjacent-Scale Attention (ASA) and stacking ASA decoder.

Refer to caption
Figure 4: VDT salient object detection framework based on PWRF. Swin-Transformer [51] is utilized to learn the backbone features of triple modalities, which are further fed in our PWRF to get modal-shared and modal-specific semantics. After that, a stacking adjacent-scale attention decoder is designed to integrate different-scale modal-shared/specific semantics. The predictions of these two sub-decoders are combined to achieve the final saliency map.

4.2.1 Adjacent-scale attention

High-level features contain rich semantic information, encapsulating the overall properties of salient objects. In contrast, low-level features preserve edge details of salient objects. To complement their superiority, we design an ASA module to integrate adjacently high-level and low-level semantics for modal-shared and modal-specific in Fig. 5, which is composed by three components, including adjacent-scale integration, dual-branch attention, and selective aggregation.

Refer to caption
Figure 5: Adjacent-scale Attention Module, which is composed by three components, including adjacent-scale integration, dual-branch attention, and selective aggregation.

Adjacent-scale integration. The adjacent-level modal-shared details (𝐖𝐏i1{\bf{WP}}_{i-1} and 𝐖𝐏i{\bf{WP}}_{i}) and modal-specific details (𝐒𝐏i1{{\bf{SP}}_{i-1}} and 𝐒𝐏i{{\bf{SP}}_{i}}) are integrated via

𝐅i1,shd=𝐖𝐏i1𝐖𝐏i.{{{\bf{F}}_{{i-1,shd}}}={\bf{WP}}_{i-1}\oplus{\bf{WP}}_{i}^{\prime}.} (25)
𝐅i1,spc=𝐒𝐏i1𝐒𝐏i.{{{\bf{F}}_{{i-1},spc}}={{\bf{SP}}_{i-1}}\oplus{{\bf{SP}}_{i}}^{\prime}.} (26)

𝐖𝐏i{\bf{WP}}_{i}^{\prime} and 𝐒𝐏i{\bf{SP}}_{i}^{\prime} can be obtained by

𝐖𝐏i=Bilinear(CBR(𝐖𝐏i)),{\bf{WP}}_{i}^{\prime}=Bilinear\left({CBR\left({\bf{WP}}_{i}\right)}\right), (27)
𝐒𝐏i=Bilinear(CBR(𝐒𝐏i)),{\bf{SP}}_{i}^{\prime}=Bilinear\left({CBR\left({{\bf{SP}}_{i}}\right)}\right), (28)

where CBR()CBR(\cdot) and BilinearBilinear represent the operations of (Convolution + BatchNorm + ReLU) and bilinear upsampling, respectively.

Dual-branch attention. A dual-branch attention including local attention and global attention is designed to attend the informative regions. Specifically, the dual-branch attention is achieved by

𝐅i,shddba=CBRCB(𝐅i,shd)+ACRC(𝐅i,shd),{{{\bf{F}}^{dba}_{i,shd}}=CBRCB\left({{\bf{F}}_{{i,shd}}}\right)+ACRC\left({{\bf{F}}_{{i,shd}}}\right),} (29)
𝐅i,spcdba=CBRCB(𝐅i,spc)+ACRC(𝐅i,spc),{{{\bf{F}}^{dba}_{i,spc}}=CBRCB\left({{\bf{F}}_{{i,spc}}}\right)+ACRC\left({{\bf{F}}_{{i,spc}}}\right),} (30)

where CBRCB()CBRCB(\cdot) is the local attention using (Convolution + BatchNorm + ReLU + Convolution + BatchNorm). ACRC()ACRC(\cdot) is the global attention using (Average pooling + Convolution + ReLU + Convolution).

Selective aggregation. To address the feature discrepancy, a selective aggregation strategy is designed to suppress redundant information and prevent feature contamination. To this end, a gate signal mechanism is introduced to aggregate adjacent-level modal-shared and modal-specific details, i.e.,

𝐅i,shdasa=𝐖𝐏iσ(𝐅i,shddba)+𝐖𝐏i1(1σ(𝐅i,shddba)),{{{\bf{F}}^{asa}_{i,shd}}={\bf{WP}}_{i}^{\prime}\otimes\sigma\left({{{\bf{F}}^{dba}_{i,{shd}}}}\right)+{\bf{WP}}_{i-1}\otimes\left(1-\sigma\left({{\bf{F}}^{dba}_{i,{shd}}}\right)\right),} (31)
𝐅i,spcasa=𝐒𝐏iσ(𝐅i,spcdba)+𝐒𝐏i1(1σ(𝐅i,spcdba)).{{{\bf{F}}^{asa}_{i,spc}}={\bf{SP}}_{i}^{\prime}\otimes\sigma\left({{\bf{F}}^{dba}_{i,spc}}\right)+{\bf{SP}}_{i-1}\otimes\left({1-\sigma\left({{\bf{F}}^{dba}_{i,spc}}\right)}\right).} (32)

4.2.2 Stacking ASA decoder

As illustrated in Fig. 4(b), we stack two sub-decoders composed by ASA to improve the feature aggregation to produce saliency maps, which implement features aggregation following bottom-up and top-down flows. Specifically in the bottom-up process, the ASA within each decoder progressively aggregates from high-level to low-level features for both modal-shared and modal-specific, separately. The resulting aggregated features 𝐅i,shdasa{{\bf{F}}^{asa}_{i,shd}} and 𝐅i,spcasa{{\bf{F}}^{asa}_{i,spc}} contribute to the generation of a preliminary saliency map. Conversely, in the top-down process, the shallowest aggregated features of the first sub-decoder are densely used to guide the second sub-decoder to learn primitive features. By the way, edge cues are employed to enhance the depth feature maps by adjusting their poor conditions. This process improves boundary refinement, allowing for more accurate delineation of object edges and a better representation of spatial structures within the depth maps. Aggregated features of two sub-decoders are combined to generate the final saliency prediction.

4.2.3 Model training

The loss function LL is defined as:

L=i=15(LB(𝐏𝐫𝐞i,𝐆𝐓)+LS(𝐏𝐫𝐞i,𝐆𝐓)+LI(𝐏𝐫𝐞i,𝐆𝐓)),L=\sum\limits_{i=1}^{5}\left(L_{B}(\mathbf{Pre}_{i},\mathbf{GT})+L_{S}(\mathbf{Pre}_{i},\mathbf{GT})+L_{I}(\mathbf{Pre}_{i},\mathbf{GT})\right), (33)

where LBL_{B}, LSL_{S}, and LIL_{I} represent binary cross entropy loss [52], structural similarity loss [53], and intersection-over-union loss [54], respectively. 𝐏𝐫𝐞i\mathbf{Pre}_{i} represent the ithith predicted saliency map. 𝐆𝐓\mathbf{GT} is the Ground Truth.

4.3 Intermediate Visualization of PWRF

To get a more intuitive perception for the effectiveness of our PWRF, as shown in Fig. 6, modal-shared and primitive modal-specific features are visualized. In Fig. 6, modal-shared knowledge can well capture the common details of three modalities while with blurry boundaries. In contrast, modal-specific features complement to focus on object shapes with clear boundaries. It is obvious for the necessity of integrating modal-shared and modal-specific details for further decision.

Refer to caption
Figure 6: Visualization of modal-shared and primitive modal-specific features. Modal-shared knowledge well capture the common details of three modalities. Modal-specific features focus on different cues such as object shape.

5 Experiment and Analysis

In this section, we will discuss the experimental results of the proposed methods for the tasks of SMM semantic segmentation and VDT salient object detection.

5.1 Datasets

DELIVER [14] is a large-scale dataset for synthetic multi-modal semantic segmentation, including RGB, Depth, LiDAR, and Event, which was created using the CARLA simulator. Each image resolution is 1042×1042. It contains 47,310 frames, with 7,885 front-view samples divided to 3,983/2,005/1,897 samples for training/validation/testing, respectively. The dataset introduces adverse conditions and sensor failure scenarios, such as environmental variations and partial sensor malfunctions, which is a challenge for autonomous driving.

VDT-2048 dataset [38] consists of 2048 images with pixel-wise annotations (ground truths) or VDT salient object detection. The dataset is divided into 1048 training images and 1000 testing images.

5.2 Implementation details

Multi-modal synthetic semantic segmentation. We conduct our SMM semantic segmentation model trained on four A100 GPUs with an initial learning rate (LR) set to 6e-5, using the poly strategy with a power of 0.9. The LR is adjusted to 0.1×\times the original LR for the first 10 epochs for warming up. AdamW optimizer is chosen for training with epsilon set to 1e-8 and weight decay set to 1e-2. Data augmentation techniques include random resizing with a ratio ranging from 0.5 to 2.0, random horizontal flipping, random color jittering, random Gaussian blurring, and random cropping.

VDT salient object detection. We conduct our VDT salient object detection model trained on an RTX 3090 GPU. During the training process, we resize all training images to 384×384384\times 384 and apply data augmentation techniques such as random flipping and clipping. The backbone network parameters are initialized with pre-trained weights from the Swin-B network [55]. Adam optimizer is chosen to train the model using a batchsize of 4 and an initial learning rate of 5e-5. The learning rate is decreased by a factor of 10 every 80 epochs.

5.3 Evaluation Metrics

To quantitatively evaluate the performance on DELIVER [14], we select mIoU as the metric, which is in accordance with the state-of-the-arts.

To quantitatively evaluate different saliency models on the VDT-2048 dataset, we use 10 comprehensive evaluation metrics including S-measure (S) [56], Mean Absolute Error (MAE) [57], various F-measure metrics [57] (FβmeanF_{\beta}^{mean}, FβadpF_{\beta}^{adp}), and E-measure metrics [58] (EξmeanE_{\xi}^{mean}, EξadpE_{\xi}^{adp}). The mathematical definitions of all metrics are depicted as follows,

MAE:

MAE=1W×Hx=1Wy=1H|𝐏(x,y)𝐆(x,y)|,\mathrm{MAE}=\frac{1}{W\times H}\sum_{x=1}^{W}\sum_{y=1}^{H}|\mathbf{P}(x,y)-\mathbf{G}(x,y)|, (34)

where 𝐏(x,y)\mathbf{P}(x,y) and 𝐆(x,y)\mathbf{G}(x,y) predicted saliency map and ground truth respectively.

F-measure:

Fβ=(1+β2)PrecisionRecallβ2Precision+Recall,F_{\beta}=(1+\beta^{2})\cdot\frac{Precision\cdot Recall}{\beta^{2}\cdot Precision+Recall}, (35)

where β2\beta^{2} is the parameter used to balance PrecisionPrecision and RecallRecall.

E-measure:

Eξ=1W×Hx=1Wy=1Hθ(x,y),E_{\xi}=\frac{1}{W\times H}\sum_{x=1}^{W}\sum_{y=1}^{H}\theta(x,y), (36)

where θ\theta represents the relation between predicted map and ground truth. H×WH\times W is the spatial resolution of the input.

S-measure:

S=αSo(1α)Sr,\mathrm{S}=\alpha S_{o}-(1-\alpha)S_{r}, (37)

where SoS_{o} and SrS_{r} denote object-aware and region-aware structural similarity, respectively. α\alpha is set to 0.5.

5.4 Comparison against the State of the Art

5.4.1 SMM semantic segmentation

Table 1 provides a comprehensive comparison between our approach and some State-Of-The-Art (SOTA) methods, including HRFuser [13], SegFormer [49], TokenFusion [11], CMX [28], and CMNeXt [14], on the DELIVER dataset [14]. Following [14], we list evaluation values of different models on the validation set of DELIVER [14] in Table 1. In Table 1, most cross-modal fusion approaches are beaten by the multi-modal fusion methods. Besides, our model achieves the best IoU values over the multi-modal fusion methods, which proves the superiority of our model over the other methods. To be more concrete, the IoU value for each class are listed in Table 2. For more comprehensive comparison, we test our model and the SOTA multi-modal fusion method CMNeXt [14] on the test set of the DELIVER dataset [14]. The mIoU values of our model and CMNeXt [14] are 54.29% and 53%, respectively, which proves a significant improvement of our model over the SOTA method.

Table 1: mIoU values of different models on DeLIVER dataset [14].
Method Modal Backbone DeLIVER
HRFuser [13] RGB-D HRFormer-T [13] 51.88
TokenFusion [11] RGB-D MiT-B2 [49] 60.25
CMX [28] RGB-D MiT-B2 62.67
CMNeXt [14] RGB-D MiT-B2 [49] 63.58
HRFuser [13] RGB-E HRFormer-T [13] 42.22
TokenFusion [11] RGB-E MiT-B2 [49] 45.63
CMX [28] RGB-E MiT-B2 [49] 56.52
CMNeXt [14] RGB-E MiT-B2 [49] 57.48
HRFuser [13] RGB-Li HRFormer-T [13] 43.13
TokenFusion [11] RGB-Li MiT-B2 [49] 53.01
CMX [28] RGB-Li MiT-B2 56.37
CMNeXt [14] RGB-Li MiT-B2 [49] 58.04
HRFuser [13] RGB-D-E-Li HRFormer-T [13] 52.97
CMNeXt [14] RGB-D-E-Li MiT-B2 [49] 66.30
OURS RGB-D-E-Li MiT-B2 [49] 66.47
Table 2: IoU, F1, and Accuracy for Different Classes in the DeLIVER dataset [14].
Class IoU (%) F1 (%) Accuracy (%)
Building 89.28 94.34 98.29
Fence 44.4 61.5 59.45
Other 0 0 0
Pedestrian 75.94 86.33 85.78
Pole 75.61 86.11 85.2
RoadLine 86.17 92.57 90.63
Road 98.33 99.16 98.94
SideWalk 80.87 89.43 96.11
Vegetation 89.17 94.28 93.92
Cars 88.8 94.07 98.57
Wall 64.45 78.38 88.45
TrafficSign 72.4 83.99 77.28
Sky 99.43 99.71 99.75
Ground 2.77 5.38 4.2
Bridge 51.22 67.74 59.4
RailTrack 54.38 70.45 73.97
GroundRail 48.82 65.61 50.14
TrafficLight 83.19 90.82 87.28
Static 33.04 49.67 34.98
Dynamic 34.29 51.07 49.79
Water 42.11 59.27 42.4
Terrain 84.52 91.61 93.79
TwoWheeler 75.83 86.25 87.12
Bus 92.69 96.21 95.95
Truck 93.97 96.89 96.89
Mean 66.47 75.63 73.93
Table 3: Results on adverse conditions in the DeLIVER dataset [14]. Sensor failure cases are MB: Motion Blur; OE: Over-Exposure; UE: Under-Exposure; LJ: LiDAR-Jitter; and EL: Event Low-resolution.
Model-modality Cloudy Foggy Night Rainy Sunny MB OE UE LJ EL Mean
HRFuser-RGB [13] 49.26 48.64 42.57 50.61 50.47 48.33 35.13 26.86 49.06 49.88 47.95
SegFormer-RGB [49] 59.99 57.30 50.45 58.69 60.21 57.28 56.64 37.44 57.17 59.12 57.20
TokenFusion-RGB-D [11] 50.92 52.02 43.37 50.70 52.21 49.22 46.22 36.39 49.58 49.17 49.86
CMX-RGB-D [14] 63.70 62.77 60.74 62.37 63.14 59.50 60.14 55.84 62.65 63.26 62.66
HRFuser-RGB-D-E-L [13] 56.20 52.39 49.85 52.53 54.02 49.44 46.31 46.92 53.94 52.72 52.97
CMNeXt-RGB-D-E-L [14] 68.70 65.67 62.46 67.50 66.57 62.91 64.59 60.00 65.92 65.48 66.30
PWRF-RGB-D-E-L 69.53 65.11 64.05 65.8 67.5 63.02 64.84 60.37 66.2 67.14 66.47
Table 4: Quantitative comparison results (%) of SS, FβmeanF_{\beta}^{mean}, FβadpF_{\beta}^{adp}, EξmeanE_{\xi}^{mean}, EξadpE_{\xi}^{adp}, and MAEMAE on the VDT-2048 dataset. Here, “\uparrow” (“\downarrow”) means that the larger (smaller) the better. The best three results in each column are marked in red, green, and blue, respectively. Note: Red indicates the best performance in each metric.
Methods Type SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow EξmeanE_{\xi}^{mean}\uparrow FβadpF_{\beta}^{adp}\uparrow FβmeanF_{\beta}^{mean}\uparrow
CPD [59] V 90.44 0.39 92.70 95.01 76.45 83.76
RAS [60] V 89.00 0.40 96.15 96.50 80.79 82.92
BBSNet [61] VD 91.17 0.46 87.47 93.57 69.57 82.67
DPANet [62] VD 72.26 1.92 53.28 72.22 29.19 48.52
RD3D [63] VD 90.95 0.47 83.54 92.31 64.62 81.20
SwinNet [55] VD 91.98 0.37 89.78 95.07 73.21 84.58
HRTransNet [64] VD 91.44 0.31 96.17 97.60 88.27 85.49
Ours VD 90.84 0.33 97.57 97.61 84.77 85.51
CGFNet [65] VT 91.66 0.33 93.19 94.47 78.22 84.80
CSRNet [53] VT 88.21 0.50 94.94 95.57 78.88 82.78
DCNet [66] VT 87.87 0.38 96.58 94.36 85.21 84.5
LSNet [67] VT 88.67 0.44 93.27 96.31 76.07 80.97
SwinNet[55] VT 93.70 0.26 94.44 97.46 80.90 88.87
HRTransNet[64] VT 92.81 0.26 96.80 98.09 84.46 87.59
Ours VT 92.85 0.26 98.65 98.43 89.02 89.45
HWSI [38] VDT 93.18 0.26 98.15 98.45 87.18 89.61
MFFNet [12] VDT 93.94 0.25 98.31 98.25 87.58 90.34
Ours VDT 93.27 0.23 98.84 98.52 90.17 90.38
Table 5: Quantitative results (%) in V challenges. Here, “\uparrow” (“\downarrow”) means that the larger (smaller) the better. The best three results in each column are marked in red, green, and blue, respectively. Note: Red indicates the best performance in each metric.
Method V-BSO V-LI V-MSO V-NI V-SA V-SI V-SSO
SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow FβadpF_{\beta}^{adp}\uparrow SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow FβadpF_{\beta}^{adp}\uparrow SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow FβadpF_{\beta}^{adp}\uparrow SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow FβadpF_{\beta}^{adp}\uparrow SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow FβadpF_{\beta}^{adp}\uparrow SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow FβadpF_{\beta}^{adp}\uparrow SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow FβadpF_{\beta}^{adp}\uparrow
BBSNet [61] 95.98 0.77 98.94 92.64 89.87 0.64 86.46 67.20 91.31 0.53 91.59 75.30 82.54 0.80 75.57 52.34 92.13 0.34 90.67 72.06 89.29 0.65 86.29 66.80 85.48 0.28 65.83 41.83
CGFNet [65] 95.96 0.71 99.19 93.77 90.35 0.46 93.63 77.55 91.18 0.49 94.77 80.20 88.40 0.35 86.93 68.57 92.10 0.30 92.90 76.67 91.19 0.43 93.53 77.84 84.21 0.12 79.42 56.03
CPD [59] 94.29 0.92 98.58 93.18 88.92 0.55 93.07 75.50 90.65 0.46 95.59 80.12 81.11 0.55 84.02 59.55 91.96 0.30 94.65 79.65 89.29 0.55 92.95 74.29 84.78 0.12 77.80 53.16
CSRNet [53] 90.94 1.39 96.15 89.85 87.73 0.58 95.72 79.34 85.98 0.89 94.14 79.40 85.89 0.43 92.29 72.60 86.15 0.47 95.11 75.69 86.71 0.78 94.15 77.77 81.89 0.14 83.10 56.74
DCNet [66] 94.52 0.82 99.06 94.44 87.10 0.48 97.04 84.13 88.11 0.54 98.00 85.11 82.42 0.38 92.07 78.35 89.60 0.32 98.26 83.79 88.55 0.47 98.17 85.46 77.40 0.12 92.66 72.01
DPANet [62] 77.27 4.18 93.12 69.80 71.93 2.15 54.10 29.53 71.82 2.64 58.67 31.95 63.14 2.55 44.28 20.84 71.13 1.66 45.61 21.45 71.07 2.21 52.63 27.37 62.04 1.08 30.16 5.97
LSNet [67] 94.90 0.99 98.94 92.58 87.47 0.57 94.05 75.98 88.98 0.60 94.45 78.60 81.76 0.52 87.92 63.91 88.48 0.40 93.83 76.77 86.21 0.64 94.05 74.97 79.43 0.18 76.55 51.17
RAS [60] 93.88 0.95 98.24 92.36 87.23 0.57 95.61 78.08 88.67 0.55 96.78 81.98 80.71 0.52 90.60 66.15 89.33 0.33 97.11 82.12 86.40 0.65 95.86 78.07 83.07 0.13 89.18 65.19
RD3D [63] 95.52 0.91 98.60 90.81 89.46 0.62 82.77 62.15 91.31 0.57 88.83 70.76 82.63 0.67 69.80 47.31 91.48 0.38 87.22 67.05 88.66 0.70 82.42 62.10 84.66 0.24 57.07 33.14
SwinNet(VD) [55] 96.39 0.68 99.17 93.62 90.60 0.48 90.03 72.24 91.57 0.48 92.53 76.79 84.84 0.52 81.23 59.70 93.69 0.28 92.20 75.56 89.90 0.54 90.58 72.69 86.13 0.24 66.73 42.93
SwinNet(VT) [55] 96.76 0.53 99.40 95.15 92.94 0.35 95.03 81.19 92.74 0.40 95.48 82.07 90.59 0.27 91.26 74.33 93.80 0.23 95.15 80.66 92.56 0.39 95.58 81.57 88.59 0.09 77.83 54.77
HRTrans(VD) [64] 95.72 0.64 99.26 94.63 90.43 0.42 96.42 81.57 91.24 0.42 96.91 83.39 83.98 0.42 92.16 69.74 92.40 0.25 97.37 84.25 89.81 0.49 96.52 81.45 84.63 0.12 85.45 61.91
HRTrans(VT) [64] 96.15 0.57 99.24 94.75 91.84 0.37 96.94 83.75 91.70 0.39 97.17 84.82 89.07 0.30 94.38 77.38 92.78 0.23 97.38 84.69 91.17 0.42 96.78 83.21 85.58 0.11 86.93 64.92
HWSI [38] 95.92 0.61 99.29 94.63 91.28 0.38 97.43 84.02 92.23 0.40 97.60 86.56 90.51 0.28 95.89 80.28 92.48 0.24 97.87 85.56 91.86 0.38 97.81 85.08 89.45 0.08 93.62 75.10
MFFNet [12] 96.43 0.57 99.37 95.28 92.33 0.35 98.12 86.00 93.22 0.41 98.14 87.29 91.12 0.26 97.17 82.17 93.76 0.22 97.61 85.88 92.28 0.37 98.17 86.07 90.70 0.07 93.51 74.25
Ours 96.19 0.51 99.43 95.96 92.63 0.32 98.86 88.43 92.34 0.36 98.91 89.32 89.93 0.25 97.13 84.97 93.66 0.20 99.18 89.24 91.35 0.38 98.76 87.78 88.97 0.07 97.13 81.34
Table 6: Quantitative results (%) in D challenges. Here, “\uparrow” (“\downarrow”) means that the larger (smaller) the better. The best three results in each column are marked in red, green, and blue, respectively. Note: Red indicates the best performance in each metric.
Method D-BI D-BM D-II D-SSO
SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow FβadpF_{\beta}^{adp}\uparrow SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow FβadpF_{\beta}^{adp}\uparrow SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow FβadpF_{\beta}^{adp}\uparrow SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow FβadpF_{\beta}^{adp}\uparrow
BBSNet [61] 90.67 0.46 86.07 67.12 90.27 0.41 86.87 68.43 92.74 0.46 91.91 77.23 85.48 0.28 65.83 41.83
DPANet [62] 70.68 1.89 49.05 24.17 72.26 1.82 55.24 32.07 76.65 2.06 65.92 44.05 62.04 1.08 30.16 5.97
RD3D [63] 90.31 0.46 81.78 61.80 90.76 0.44 82.94 63.97 92.89 0.52 89.08 73.35 84.66 0.24 57.07 33.14
SwinNet(VD) [55] 91.55 0.34 88.70 71.11 91.27 0.44 88.77 71.56 93.30 0.47 93.21 79.82 86.13 0.24 66.73 42.93
HRTrans(VD) [64] 90.92 0.29 95.74 80.87 90.81 0.32 95.84 80.91 93.03 0.36 97.37 86.49 84.63 0.12 85.45 61.91
HWSI [38] 92.86 0.23 98.05 86.55 92.85 0.29 97.73 85.86 94.17 0.35 98.43 89.20 89.45 0.08 93.62 75.10
MFFNet [12] 93.64 0.23 98.14 86.76 93.27 0.26 98.23 86.63 94.61 0.32 98.73 89.94 90.47 0.07 93.51 74.25
Ours 92.98 0.22 98.83 89.59 92.66 0.25 98.36 89.11 94.18 0.29 98.86 91.93 88.97 0.07 97.13 81.34
Table 7: Quantitative results (%) in T challenges. Here, “\uparrow” (“\downarrow”) means that the larger (smaller) the better. The best three results in each column are marked in red, green, and blue, respectively. Note: Red indicates the best performance in each metric.
Method T-Cr T-HR T-RD
SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow FβadpF_{\beta}^{adp}\uparrow SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow FβadpF_{\beta}^{adp}\uparrow SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow FβadpF_{\beta}^{adp}\uparrow
CGFNet [65] 90.10 0.34 91.03 74.04 95.23 0.29 96.72 83.39 93.02 0.46 97.61 85.50
CSRNet [53] 84.53 0.50 92.75 73.69 93.74 0.33 98.12 89.14 89.81 0.61 97.76 84.97
DCNet [66] 85.82 0.39 95.10 81.97 93.05 0.31 98.94 89.95 90.38 0.51 98.33 88.76
LSNet [67] 88.08 0.42 91.37 73.25 90.95 0.43 95.67 79.60 90.11 0.64 96.97 81.73
SwinNet(VT) [55] 92.63 0.26 92.53 77.18 96.08 0.21 97.69 85.98 94.34 0.37 98.01 87.43
HRTrans(VT) [64] 91.30 0.27 95.42 81.13 94.82 0.22 98.67 87.88 93.92 0.38 98.69 88.47
HWSI [38] 92.53 0.25 97.44 84.92 94.70 0.26 99.02 89.04 93.24 0.40 99.01 89.10
MFFNet [12] 92.87 0.23 97.49 84.90 96.02 0.19 99.03 90.26 93.36 0.37 98.95 89.82
Ours 91.90 0.23 97.85 87.91 95.29 0.19 99.43 92.20 93.70 0.35 99.22 91.29

For more concrete analysis, we conduct a comparison against established multi-modal fusion methodologies across varying conditions, including adverse weather and partial sensor failure scenarios. As shown in Table 3, single-modal and cross-modal fusion methods exhibits limitations in adverse scenarios due to less auxiliary modalities. Although multi-modal fusion methods, e.g., HRFuser [13] and CMNeXt [14], obtain good performance, our model is superior on most adverse scenarios thanks to the primitive fusion for multiple modalities.

5.4.2 VDT salient object detection

To evaluate the effectiveness of our PWRF for VDT salient object detection, we compare it with 13 state-of-the-art approaches, including two RGB salient object detection methods (CPD [59] and RAS [60]), three RGB-D salient object detection methods (BBSNet [61], DPANet [62], and RD3D [63]), six RGB-T salient object detection methods (CGFNet [65], CSRNet [53], DCNet [66], and LSNet [67], SwinNet [55], and HRTransNet [55]), and two VDT salient object detection methods (HWSI [38] and MFFNet [12]). To ensure fair comparisons, all model predictions are either provided by the authors or generated using their source codes with default settings.

As shown in Table 4, three findings can be easily concluded: i) Compared with the RGB methods, most cross-modal methods achieve better performance, which is owing to the complementaries of depth and thermal modalities; ii) Compared with the V-D and V-T approaches, VDT methods both get higher metric values, which thanks to the complementary of triple modalities; iii) Compared with the previous VDT methods, our method achieves consistently superior performance for five best evaluation metrics, which owes to the superior multi-modal fusion of PWRF over the previous simple fusion mechanisms. To be more concrete, compared to the second-best model, MFFNet [12], our model achieves significant improvements. Besides, our framework on cross-modal settings, including VD and VT, consistently achieves superior performance, which can be found from the second and third brackets in Table. 4, where our model on VD and VT conditions outperforms better than the other methods.

In addition, to demonstrate the robustness of our method in handling challenging scenarios, we present the performance comparison for visible-challenge, depth-challenge, and thermal-challenge scenes in Table 5, Table 6, and Table 7, respectively. For the challenging scenarios, our method still achieves a superior performance.

5.5 Visual comparison

5.5.1 SMM semantic segmentation

To visually demonstrate the performance of different models for SMM semantic segmentation, we selected two representative scenarios for comparison: “Cloud & Underexposure" and “Rainy & LiDAR Jitter". As shown in Fig. 7222The values of semantic labels are scaled via ×10\times 10 for visualization in Fig. 7., compared with the single-modal method SegFormer [49], multi-modal fusion methods provide more accurate semantic analysis. Moreover, compared with the multi-modal fusion method CMNeXt [14], our method benefits from the Part-Whole Relational Fusion (PWRF) framework, resulting in more complete object segmentation in complex scenarios.

Refer to caption


Figure 7: Visual comparison for SMM semantic segmentation.

5.5.2 VDT salient object detection

To visually demonstrate the superiority of our model, we present several visualization results from challenging scenes across visible, depth, and thermal images. Specifically, V-challenges include big salient object (BSO), low illumination (LI), multiple salient objects (MSO), no illumination (NI), similar appearance (SA), side illumination (SI), and small salient object (SSO). D-challenges contain background interference (BI), background messy (BM), incomplete information (II), and small salient object (SSO). T-challenges cover crossover (Cr), heat reflection (HR), and radiation dispersion (RD). As shown in Fig. 8(a), (b), and (c), the proposed model well tackles these challenging scenes for salient object detection, compared with the other methods.

Refer to caption

(a) Visual comparison of V-challenge.

Refer to caption

(b) Visual comparison of D-challenge.

Refer to caption

(c) Visual comparison of T-challenge.

Figure 8: Visual comparison of V-challenge, D-challenge, and T-challenge scenes for VDT salient object detection.

5.6 Ablation study

5.6.1 Ablation analysis for SMM semantic segmentation

In this subsection, we will conduct ablation experiments on DELIVER [14] to evaluate the contributions of key components of the SMM semantic segmentation model.

Different number of capsule types for PWRF. Capsule type number takes a vital role in part-whole relations exploration, which will affect the performance of the whole model. To take a thorough study on different number of capsule types, we run several rounds with different part-level capsule types333The whole-level capsule types number is set to the category number for different datasets.. As shown in Table 8, it is seen that few or more capsules will lower the performance, because few capsules cannot find the accurate part-whole relations while more capsules will introduce noisy capsule assignments. By contrast, 8 types achieve the best IoU value, which is the setting for our model in this paper.

Shared parameters. Since there are multiple stages, our PWRF architecture should be repeated multiple times for different modality branch, which generates multi-branch structures. To discuss parameters sharing for these consistent structures, we carry out experiments using shared and unshared parameters for the model. As listed in the last two columns of Table 8, shared parameters improve the semantic segmentation performance compared with the unshared setting. The reason behind comes from three folds: i) Shared parameters reduce some noise caused by the modality gap; ii) Shared parameters learn consistent fusion trend for different modalities; iii) By sharing structures and parameters, data from different modalities can assist each other to enhance the understanding of the same scenario.

Table 8: Ablation study for different capsule types and parameters sharing on DeLIVER dataset [14].
Primary Caps mIoU (Shared) mIoU (Unsahred)
4 64.50 63.68
8 66.47 64.55
16 63.50 62.65
25 64.50 63.51

5.6.2 Ablation analysis for VDT salient object detection

In this subsection, we conduct ablation experiments on VDT-2048 dataset to evaluate the contributions of key modules in our proposed method.

Table 9: Ablation analysis (%) on our baseline gradually including the newly proposed components on the VDT-2048 dataset [38].
Component SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow EξmeanE_{\xi}^{mean}\uparrow FβadpF_{\beta}^{adp}\uparrow FβmeanF_{\beta}^{mean}\uparrow
(a) Baseline 88.57 0.57 90.41 90.96 76.59 77.24
(b) + PWRF 92.24 0.28 98.37 98.31 87.35 88.37
(c) + Stacking ASA deocder 90.14 0.41 94.91 95.48 84.03 84.36
(d) +PWRF + Stacking ASA deocder 93.27 0.23 98.84 98.52 90.17 90.38

Different components. To verify the effectiveness of different components in our proposed method, we perform various ablation experiments in Table 9. First, comparing (a) & (b) and (a) & (c) in Table 9, the proposed PWRF and stacking ASA decoder significantly boost the performance. The performance improvements come from two aspects: i) PWRF dynamically captures the informative semantics from different modalities for fusion; ii) Stacking ASA decoder helps to extract the primitive context of different modalities for prediction. Secondly, comparing (b) & (d) and (c) & (d) in Table 9, the combination of PWRF and stacking ASA decoder achieves a higher performance, which proves the contributions of the proposed components efficiently.

Different fusion mechanisms. In order to investigate the contribution of our PWRF for triple-modal fusion, we carry out several experiments by replacing our PWRF with different fusion mechanisms, including addition, concatenation, QKV attention mechanisms and EM routing [15] in our VDT salient object detection model. As shown in Table 10, there are two findings: i) Our PWRF surpasses the simple addition and concatenation mechanisms due to the primitive fusion for multiple modalities; ii) Compared with the attention mechanism, our model still achieves a significant superiority; iii) Our PWRF outperforms the previous EM routing [15] with a large margin, which demonstrates the superiority of DCR routing in our PWRF for VDT salient object detection.

Table 10: Ablation study (%) on different triple-modal fusion strategies on the VDT-2048 dataset [38].
Settings SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow EξmeanE_{\xi}^{mean}\uparrow FβadpF_{\beta}^{adp}\uparrow FβmeanF_{\beta}^{mean}\uparrow
Addition 91.72 0.32 95.96 96.82 85.98 87.14
Concatenation 90.14 0.41 94.17 95.48 80.31 84.36
QKV Attention 91.08 0.36 95.03 96.02 84.72 86.39
Concatenation + EM routing 71.59 1.04 82.57 87.13 33.88 52.67
Ours(PWRF) 93.27 0.23 98.84 98.52 90.17 90.38

Stacking ASA decoder. Stacking ASA decoder contains two sub-decoders using a bridge connection. To deeply dig into its contribution, we conduct experiments including baseline by removing stacking ASA decoder from the entire model, one sub-decoder, and two sub-decoders. As shown in Table 11, compared with baseline, one sub-decoder definitely improves the performance, which is because ASA emphasizes semantic feature channels while suppressing noisy ones. In addition, compared with one sub-decoder, stacking two sub-decoders performs better, which proves the bridge connection of two sub-decoders helps the model to get the superior performance.

Table 11: Ablation analysis (%) for stacking ASA decoder on the VDT-2048 dataset [38].
No Settings SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow EξmeanE_{\xi}^{mean}\uparrow FβadpF_{\beta}^{adp}\uparrow FβmeanF_{\beta}^{mean}\uparrow
1 Baseline + PWRF 92.24 0.28 97.74 98.31 87.11 88.37
2 Baseline + PWRF + one sub-decoder 92.47 0.26 97.92 98.33 87.95 88.73
3 Baseline + PWRF + two sub-decoders 93.27 0.23 98.84 98.52 90.17 90.38

Ablation analysis on different modalities. To assess the impact of different modalities, we conduct four experiments detailed in Table 12, focusing on modalities combinations such as V+D, V+T, D+T, and V+D+T. Our PWRF method necessitates utilizing at least two distinct features from each modality, hence experiments on individual modalities were omitted. Moreover, for experiments involving combinations like V+D, V+T, and D+T, we simply removed one capsule feature branch while keeping other operations unchanged. From Table 12, it is evident that V+T obtains good performance compared with V+D and D+T. Leveraging three modalities prefer to improve the performance significantly, which demonstrates the superiority of more-modal fusion over cross-modal fusion.

Table 12: Ablation study (%) on different modalities on the VDT-2048 dataset [38].
No Settings SS\uparrow MAEMAE\downarrow EξadpE_{\xi}^{adp}\uparrow EξmeanE_{\xi}^{mean}\uparrow FβadpF_{\beta}^{adp}\uparrow FβmeanF_{\beta}^{mean}\uparrow
1 V+D 90.84 0.33 97.57 97.61 84.77 85.51
2 V+T 92.85 0.26 98.65 98.43 89.02 89.45
3 D+T 90.21 0.35 97.88 97.79 84.30 84.68
4 V+D+T 93.27 0.23 98.84 98.52 90.17 90.38

5.6.3 Routing coefficients explanation

Most previous methods cannot interpret the fusion of different modalities, which limits their reliability in real-world applications. In contrast, our model can provide an explanation for multi-modal fusion due to the part-whole relational routing from part-level modalities to whole-level modal. Specifically, we plot the horizontal and vertical routing coefficients for one pixel in Fig. 9, in which higher routing values represent higher contribution of single modalities for fusion. In Fig. 9, we observe differences in the contributions from each modality under horizontal and vertical routing conditions. The x-axis, y-axis, and z-axis represent the part-level capsule types, whole-level capsule types, and routing coefficients, respectively. For example, as shown in the red rectangle of Fig. 9(a) in terms of the horizontal dimension, from the 8th part-level capsule to the first whole-level capsule, depth and event modalities contribute much for fusion, while LiDAR contributing less, which explains that depth and event modalities occupies much for semantic understanding while LiDAR has no roles at this pixel. By contrast, in the vertical direction at the pixel, other modalities might dominate. Such differences could be due to the unique characteristics captured by each modality in different spatial dimensions. These directional differences in contribution help capture complementary information from different modalities, thus enhancing the feature fusion.

Refer to caption

(a) Horizontal routing coefficients.

Refer to caption

(b) Vertical routing coefficients.

Figure 9: Explanation using the routing coefficients. Red, blue, and green markers represent the routing coefficients of depth, event, and LiDAR modalities. x-axis, y-axis, and z-axis denote whole-level capsules types, part-level capsule types, and routing values, respectively.

6 Limitations and Future Works

6.1 Limitations

Complexity and Resource Requirements. The proposed PWRF framework involves routing operations in CapsNets as well as multi-modal data fusion. Despite utilizing the lightweight DCR [16] mechanism, the framework still has high computational complexity and resource demands, especially when processing high-resolution and large-scale multi-modal data. This could limit its application in resource-constrained environments.

Alignment and Noise in Multi-Modal Data. Multi-modal sensors often suffer from spatial and temporal misalignment, and certain modalities (e.g., depth and event data) tend to contain significant noise. Although PWRF employs CapsNets to extract both modal-shared and modal-specific information, these issues have not been entirely addressed in the current implementation, which could negatively impact the overall quality of the results. Additionally, the current usage of attention mechanisms in the framework is relatively preliminary, and there is significant potential for further integration to improve feature selection and noise reduction, thereby enhancing robustness and the ability to capture critical information.

Adaptability and Generalization. While PWRF has demonstrated notable performance in applications such as autonomous driving and multi-modal object detection, its adaptability and generalization capabilities for other domains, such as multi-modal emotion recognition [68] and medical imaging analysis [69], have not been to be thoroughly evaluated. The effectiveness of the framework in these new domains remains an open problem for future research.

6.2 Future Work

Lightweight for Real-Time Optimization. We intend to further optimize the computational complexity of PWRF by exploring new lightweight architectures or combining them with attention mechanisms [70, 71] to achieve higher performance in resource-limited environments, thus making the framework more suitable for real-time applications.

Expansion to More Application Scenarios. We intend to extend the PWRF framework to other multi-modal tasks, such as emotion recognition [68] and medical imaging analysis [69]. Through experimentation in these new tasks, we hope to validate the generalizability of the model and improve its performance across different industries and applications.

7 Conclusion

In this article, we have presented a novel multi-modal fusion model from the perspective of part-whole relational fusion, which treated multi-modal fusion as routing each individual part-level modality to the fused whole-level modality. Using disentangled capsule routing, we modeled the modal-shared and modal-specific details for primitive fusion. Experiments on SMM semantic segmentation and VDT salient object detection demonstrate the superiority of the proposed PWRF framework for multi-modal scene understanding. In the future, we will study more primitive capsule routing for part-whole relational fusion and fusion explainability for reliable applications.

\bmhead

Acknowledgements This work is supported in part by the National Natural Science Foundation of Jiangsu Province under Grant No. BK20221379, in part by the State Key Laboratory of Reliability and Intelligence of Electrical Equipment under Grant EERI KF2022005, Hebei University of Technology.

References

  • \bibcommenthead
  • Huang et al. [2021] Huang, Y., Du, C., Xue, Z., Chen, X., Zhao, H., Huang, L.: What makes multi-modal learning better than single (provably). Advances in Neural Information Processing Systems 34, 10944–10956 (2021)
  • Wang et al. [2023] Wang, Y., Mao, Q., Zhu, H., Deng, J., Zhang, Y., Ji, J., Li, H., Zhang, Y.: Multi-modal 3d object detection in autonomous driving: a survey. International Journal of Computer Vision 131(8), 2122–2152 (2023)
  • Liu et al. [2024] Liu, J., Lin, R., Wu, G., Liu, R., Luo, Z., Fan, X.: Coconet: Coupled contrastive learning network with multi-level feature ensemble for multi-modality image fusion. International Journal of Computer Vision 132(5), 1748–1775 (2024)
  • Planamente et al. [2024] Planamente, M., Plizzari, C., Peirone, S.A., Caputo, B., Bottino, A.: Crelative norm alignment for tackling domain shift in deep multi-modal classification. International Journal of Computer Vision 132, 2618–2638 (2024)
  • Zhu et al. [2024] Zhu, X.-F., Xu, T., Liu, Z., Tang, Z., Wu, X.-J., Kittler, J.: Unimod1k: Towards a more universal large-scale dataset and benchmark for multi-modal learning. International Journal of Computer Vision 132, 2845–2860 (2024)
  • Cao et al. [2021] Cao, J., Leng, H., Lischinski, D., Cohen-Or, D., Tu, C., Li, Y.: Shapeconv: Shape-aware convolutional layer for indoor rgb-d semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7088–7097 (2021)
  • Sun et al. [2022] Sun, Y., Cao, B., Zhu, P., Hu, Q.: Drone-based rgb-infrared cross-modality vehicle detection via uncertainty-aware learning. IEEE Transactions on Circuits and Systems for Video Technology 32(10), 6700–6713 (2022)
  • Chen et al. [2021] Chen, L.-Z., Lin, Z., Wang, Z., Yang, Y.-L., Cheng, M.-M.: Spatial information guided convolution for real-time rgbd semantic segmentation. IEEE Transactions on Image Processing 30, 2313–2324 (2021)
  • Xiang et al. [2021] Xiang, K., Yang, K., Wang, K.: Polarization-driven semantic segmentation via efficient attention-bridged fusion. Optics Express 29(4), 4802–4820 (2021)
  • Zhou et al. [2021] Zhou, W., Liu, J., Lei, J., Yu, L., Hwang, J.-N.: Gmnet: Graded-feature multilabel-learning network for rgb-thermal urban scene semantic segmentation. IEEE Transactions on Image Processing 30, 7790–7802 (2021)
  • Wang et al. [2022] Wang, Y., Chen, X., Cao, L., Huang, W., Sun, F., Wang, Y.: Multimodal token fusion for vision transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12186–12195 (2022)
  • Wan et al. [2023] Wan, B., Zhou, X., Sun, Y., Wang, T., Lv, C., Wang, S., Yin, H., Yan, C.: Mffnet: Multi-modal feature fusion network for vdt salient object detection. IEEE Transactions on Multimedia (2023)
  • Broedermann et al. [2023] Broedermann, T., Sakaridis, C., Dai, D., Van Gool, L.: Hrfuser: A multi-resolution sensor fusion architecture for 2d object detection. In: 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), pp. 4159–4166 (2023). IEEE
  • Zhang et al. [2023] Zhang, J., Liu, R., Shi, H., Yang, K., Reiß, S., Peng, K., Fu, H., Wang, K., Stiefelhagen, R.: Delivering arbitrary-modal semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1136–1147 (2023)
  • Hinton et al. [2018] Hinton, G.E., Sabour, S., Frosst, N.: Matrix capsules with em routing. In: International Conference on Learning Representations (2018)
  • Liu et al. [2022] Liu, Y., Zhang, D., Liu, N., Xu, S., Han, J.: Disentangled capsule routing for fast part-object relational saliency. IEEE Transactions on Image Processing 31, 6719–6732 (2022)
  • Wang et al. [2023] Wang, H., Chen, Y., Ma, C., Avery, J., Hull, L., Carneiro, G.: Multi-modal learning with missing modality via shared-specific feature modelling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15878–15887 (2023)
  • Long et al. [2015] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440 (2015)
  • Jin et al. [2021] Jin, Z., Gong, T., Yu, D., Chu, Q., Wang, J., Wang, C., Shao, J.: Mining contextual information beyond image for semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7231–7241 (2021)
  • Borse et al. [2021] Borse, S., Wang, Y., Zhang, Y., Porikli, F.: Inverseform: A loss function for structured boundary-aware segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5901–5911 (2021)
  • Gu et al. [2022] Gu, J., Kwon, H., Wang, D., Ye, W., Li, M., Chen, Y.-H., Lai, L., Chandra, V., Pan, D.Z.: Multi-scale high-resolution vision transformer for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12094–12103 (2022)
  • Hazirbas et al. [2017] Hazirbas, C., Ma, L., Domokos, C., Cremers, D.: Fusenet: Incorporating depth into semantic segmentation via fusion-based cnn architecture. In: Computer Vision–ACCV 2016: 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, Revised Selected Papers, Part I 13, pp. 213–228 (2017). Springer
  • Wang et al. [2016] Wang, J., Wang, Z., Tao, D., See, S., Wang, G.: Learning common and specific features for rgb-d semantic segmentation with deconvolutional networks. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pp. 664–679 (2016). Springer
  • Wang et al. [2020] Wang, Y., Huang, W., Sun, F., Xu, T., Rong, Y., Huang, J.: Deep multimodal fusion by channel exchanging. Advances in neural information processing systems 33, 4835–4845 (2020)
  • Zhao et al. [2023] Zhao, L., Zhou, H., Zhu, X., Song, X., Li, H., Tao, W.: Lif-seg: Lidar and camera image fusion for 3d lidar semantic segmentation. IEEE Transactions on Multimedia 26, 1158–1168 (2023)
  • Liu et al. [2022] Liu, H., Lu, T., Xu, Y., Liu, J., Li, W., Chen, L.: Camliflow: bidirectional camera-lidar fusion for joint optical flow and scene flow estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5791–5801 (2022)
  • Liang et al. [2022] Liang, Y., Wakaki, R., Nobuhara, S., Nishino, K.: Multimodal material segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19800–19808 (2022)
  • Zhang et al. [2023] Zhang, J., Liu, H., Yang, K., Hu, X., Liu, R., Stiefelhagen, R.: Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers. IEEE Transactions on Intelligent Transportation Systems (2023)
  • Tian et al. [2023] Tian, X., Zhang, J., Xiang, M., Dai, Y.: Modeling the distributional uncertainty for salient object detection models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19660–19670 (2023)
  • Li et al. [2023] Li, G., Bai, Z., Liu, Z., Zhang, X., Ling, H.: Salient object detection in optical remote sensing images driven by transformer. IEEE Transactions on Image Processing (2023)
  • Yuan et al. [2023] Yuan, J., Zhu, A., Xu, Q., Wattanachote, K., Gong, Y.: Ctif-net: A cnn-transformer iterative fusion network for salient object detection. IEEE Transactions on Circuits and Systems for Video Technology (2023)
  • Liu et al. [2021] Liu, Y., Zhang, D., Zhang, Q., Han, J.: Part-object relational visual saliency. IEEE transactions on pattern analysis and machine intelligence 44(7), 3688–3704 (2021)
  • Liu et al. [2023] Liu, Y., Zhou, L., Wu, G., Xu, S., Han, J.: Tcgnet: Type-correlation guidance for salient object detection. IEEE Transactions on Intelligent Transportation Systems (2023)
  • Wu et al. [2023] Wu, Z., Allibert, G., Meriaudeau, F., Ma, C., Demonceaux, C.: Hidanet: Rgb-d salient object detection via hierarchical depth awareness. IEEE Transactions on Image Processing 32, 2160–2173 (2023)
  • Li et al. [2023] Li, J., Ji, W., Zhang, M., Piao, Y., Lu, H., Cheng, L.: Delving into calibrated depth for accurate rgb-d salient object detection. International Journal of Computer Vision 131(4), 855–876 (2023)
  • Xie et al. [2023] Xie, Z., Shao, F., Chen, G., Chen, H., Jiang, Q., Meng, X., Ho, Y.-S.: Cross-modality double bidirectional interaction and fusion network for rgb-t salient object detection. IEEE Transactions on Circuits and Systems for Video Technology (2023)
  • Zhang et al. [2023] Zhang, Z., Wang, J., Han, Y.: Saliency prototype for rgb-d and rgb-t salient object detection. In: Proceedings of the 31st ACM International Conference on Multimedia, pp. 3696–3705 (2023)
  • Song et al. [2022] Song, K., Wang, J., Bao, Y., Huang, L., Yan, Y.: A novel visible-depth-thermal image dataset of salient object detection for robotic visual perception. IEEE/ASME Transactions on Mechatronics 28(3), 1558–1569 (2022)
  • Wan et al. [2024] Wan, B., Zhou, X., Sun, Y., Zhu, Z., Wang, H., Yan, C., et al.: Tmnet: Triple-modal interaction encoder and multi-scale fusion decoder network for vdt salient object detection. Pattern Recognition 147, 110074 (2024)
  • Bao et al. [2024] Bao, L., Zhou, X., Lu, X., Sun, Y., Yin, H., Hu, Z., Zhang, J., Yan, C.: Quality-aware selective fusion network for vdt salient object detection. IEEE Transactions on Image Processing (2024)
  • Sabour et al. [2017] Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. Advances in neural information processing systems 30 (2017)
  • Pan and Velipasalar [2021] Pan, C., Velipasalar, S.: Pt-capsnet: A novel prediction-tuning capsule network suitable for deeper architectures. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11996–12005 (2021)
  • Shi et al. [2022] Shi, R., Niu, L., Zhou, R.: Sparse capsnet with explicit regularizer. Pattern Recognition 124, 108486 (2022)
  • Liu et al. [2024] Liu, Y., Cheng, D., Zhang, D., Xu, S., Han, J.: Capsule networks with residual pose routing. IEEE Transactions on Neural Networks and Learning Systems (2024)
  • Jampour et al. [2021] Jampour, M., Abbaasi, S., Javidi, M.: Capsnet regularization and its conjugation with resnet for signature identification. Pattern Recognition 120, 107851 (2021)
  • Liu et al. [2021] Liu, Y., Zhang, D., Zhang, Q., Han, J.: Integrating part-object relationship and contrast for camouflaged object detection. IEEE Transactions on Information Forensics and Security 16, 5154–5166 (2021)
  • Wu et al. [2022] Wu, J., Mai, S., Hu, H.: Interpretable multimodal capsule fusion. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30, 1815–1826 (2022)
  • He et al. [2016] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
  • Xie et al. [2021] Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems 34, 12077–12090 (2021)
  • Shrivastava et al. [2016] Shrivastava, A., Gupta, A., Girshick, R.: Training region-based object detectors with online hard example mining. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769 (2016)
  • Liu et al. [2021] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
  • De Boer et al. [2005] De Boer, P.-T., Kroese, D.P., Mannor, S., Rubinstein, R.Y.: A tutorial on the cross-entropy method. Annals of operations research 134, 19–67 (2005)
  • Li et al. [2021] Li, J., Su, J., Xia, C., Ma, M., Tian, Y.: Salient object detection with purificatory mechanism and structural similarity loss. IEEE Transactions on Image Processing 30, 6855–6868 (2021)
  • Rahman and Wang [2016] Rahman, M.A., Wang, Y.: Optimizing intersection-over-union in deep neural networks for image segmentation. In: International Symposium on Visual Computing, pp. 234–244 (2016). Springer
  • Liu et al. [2021] Liu, Z., Tan, Y., He, Q., Xiao, Y.: Swinnet: Swin transformer drives edge-aware rgb-d and rgb-t salient object detection. IEEE Transactions on Circuits and Systems for Video Technology 32(7), 4486–4497 (2021)
  • Fan et al. [2017] Fan, D.-P., Cheng, M.-M., Liu, Y., Li, T., Borji, A.: Structure-measure: A new way to evaluate foreground maps. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4548–4557 (2017)
  • Achanta et al. [2009] Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned salient region detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1597–1604 (2009)
  • Fan et al. [2018] Fan, D.-P., Gong, C., Cao, Y., Ren, B., Cheng, M.-M., Borji, A.: Enhanced-alignment measure for binary foreground map evaluation. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 698–704 (2018)
  • Wu et al. [2019] Wu, Z., Su, L., Huang, Q.: Cascaded partial decoder for fast and accurate salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3907–3916 (2019)
  • Chen et al. [2018] Chen, S., Tan, X., Wang, B., Hu, X.: Reverse attention for salient object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 234–250 (2018)
  • Fan et al. [2020] Fan, D.-P., Zhai, Y., Borji, A., Yang, J., Shao, L.: Bbs-net: Rgb-d salient object detection with a bifurcated backbone strategy network. In: European Conference on Computer Vision, pp. 275–292 (2020). Springer
  • Chen et al. [2020] Chen, Z., Cong, R., Xu, Q., Huang, Q.: Dpanet: Depth potentiality-aware gated attention network for rgb-d salient object detection. IEEE Transactions on Image Processing 30, 7012–7024 (2020)
  • Chen et al. [2022] Chen, Q., Zhang, Z., Lu, Y., Fu, K., Zhao, Q.: 3-d convolutional neural networks for rgb-d salient object detection and beyond. IEEE Transactions on Neural Networks and Learning Systems 35(3), 4309–4323 (2022)
  • Tang et al. [2022] Tang, B., Liu, Z., Tan, Y., He, Q.: Hrtransnet: Hrformer-driven two-modality salient object detection. IEEE Transactions on Circuits and Systems for Video Technology 33(2), 728–742 (2022)
  • Wang et al. [2021] Wang, J., Song, K., Bao, Y., Huang, L., Yan, Y.: Cgfnet: Cross-guided fusion network for rgb-t salient object detection. IEEE Transactions on Circuits and Systems for Video Technology 32(5), 2949–2961 (2021)
  • Tu et al. [2022] Tu, Z., Li, Z., Li, C., Tang, J.: Weakly alignment-free rgbt salient object detection with deep correlation network. IEEE Transactions on Image Processing 31, 3752–3764 (2022)
  • Zhou et al. [2023] Zhou, W., Zhu, Y., Lei, J., Yang, R., Yu, L.: Lsnet: Lightweight spatial boosting network for detecting salient objects in rgb-thermal images. IEEE Transactions on Image Processing 32, 1329–1340 (2023)
  • Yang et al. [2023] Yang, D., Chen, Z., Wang, Y., Wang, S., Li, M., Liu, S., Zhao, X., Huang, S., Dong, Z., Zhai, P., et al.: Context de-confounded emotion recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19005–19015 (2023)
  • Rana and Bhushan [2023] Rana, M., Bhushan, M.: Machine learning and deep learning approach for medical image analysis: diagnosis to detection. Multimedia Tools and Applications 82(17), 26731–26769 (2023)
  • Alman and Song [2024] Alman, J., Song, Z.: Fast attention requires bounded entries. Advances in Neural Information Processing Systems 36 (2024)
  • Agarwal and Arora [2023] Agarwal, A., Arora, C.: Attention attention everywhere: Monocular depth prediction with skip attention. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5861–5870 (2023)