\jyear

2023

1]\orgdivSchool of Biomedical Engineering, \orgnameShanghai Jiao Tong University, \orgaddress\cityShanghai, \postcode200240, \countryChina

2]\orgdivEndoscopy Center, \orgnameZhongshan Hospital of Fudan University, \orgaddress\cityShanghai, \postcode200032, \countryChina

Deep Multimodal Collaborative Learning for Polyp Re-Identification

\fnmSuncheng \surXiang [email protected] \fnmJincheng \surLi [email protected] \fnmZhengjie \surZhang [email protected] \fnmShilu \surCai [email protected] \fnmJiale \surGuan [email protected] \fnmDahong \surQian [email protected] [ [

Abstract

Colonoscopic Polyp Re-Identification aims to match the same polyp from a large gallery with images from different views taken using different cameras, which plays an important role in the prevention and treatment of colorectal cancer in computer-aided diagnosis. However, traditional methods for object ReID directly adopting CNN models trained on the ImageNet dataset usually produce unsatisfactory retrieval performance on colonoscopic datasets due to the large domain gap. Worsely, these solutions typically learn unimodal modal representations on the basis of visual samples, which fails to explore complementary information from other different modalities. To address this challenge, we propose a novel Deep Multimodal Collaborative Learning framework named DMCL for polyp re-identification, which can effectively encourage modality collaboration and reinforce generalization capability in medical scenarios. On the basis of it, a dynamic multimodal feature fusion strategy is introduced to leverage the optimized multimodal representations for multimodal fusion via end-to-end training. Experiments on the standard benchmarks show the benefits of the multimodal setting over state-of-the-art unimodal ReID models, especially when combined with the specialized multimodal fusion strategy, from which we have proved that learning representation with multiple-modality can be competitive to methods based on unimodal representation learning. We also hope that our method will shed light on some related researches to move forward, especially for multimodal collaborative learning. The code is publicly available at https://github.com/JeremyXSC/DMCL.

keywords:

Colonoscopic Polyp Re-Identification, modal representation, multimodal collaborative learning, generalization capability

1 Introduction

Colonoscopic polyp re-identification (Polyp ReID) aims to match a specific polyp in a large gallery with different cameras and locations, which has been studied intensively due to its practical importance in the prevention and treatment of colorectal cancer in the computer-aided diagnosis (Koffas et al, 2022). With the development of deep convolution neural networks and the availability of video re-identification dataset, video retrieval methods have achieved remarkable performance in a supervised manner (Feng et al, 2019; Xiang et al, 2023b, c), where a model is trained and tested on different splits of the same dataset. However, in practice, manually labeling a large diversity of pairwise polyp area data is time-consuming and labor-intensive when directly deploying polyp ReID system to new hospital scenarios (Chen et al, 2023). Nevertheless, when compared with the conventional ReID (Xiang et al, 2020c, a), polyp ReID is confronted with more challenges in some aspects: 1) from the model perspective: traditional object ReID methods learn the unimodal representation by greedily “pre-training” several layers of features on the basis of visual samples (Xiang et al, 2023a, 2024a), while ignore to explore complementary information from different modalities, and 2) from the data perspective, polyp ReID will encounter many challenges such as variation in terms of backgrounds, viewpoint, and illumination, etc., which poses great challenges to the clinical deployment of deep model in real-world scenarios.

Based on the aforementioned findings, we novely propose a deep multimodal collaborative learning framework named DMCL to encourage modality collaboration and reinforce generalization capability in medical scenarios. On the basis of it, a dynamic multimodal training strategy is introduced to leverage the unimodal representations for multimodal fusion via end-to-end training on multimodal tasks, and then improve the performance of our model in an end-to-end manner, shown in Fig. 1. To the best of our knowledge, this is the first attempt to employ the visual-text feature with collaborative training mechanism for colonoscopic polyp re-identification.

To this end, the major contributions of our work can be summarized as follows:

$\bullet$

A deep multimodal collaborative learning framework DMCL is proposed for the first time to obtain the visual and texture information simultaneously, and then promote the development of multimodal polyp re-identification.
$\bullet$

Based on the DMCL framework, we introduce a simple but effective multimodal feature fusion strategy, which can greatly help the model learn more discriminatve information during training and texting stage.
$\bullet$

Comprehensive experiments on standard benchmark datasets demonstrate that our method can achieve state-of-the-art results, proving the efficacy of our method.

The remainder of this paper is structured as follows. In Section 2, we give the related works based on hand-crafted based approaches and deep learning based methods in medical area, and then briefly introduce our method. In Section 3, the details of our multimodal collaborative learning method, as well as the dynamic feature fusion strategy, is presented. Extensive evaluations compared with state-of-the-art methods and comprehensive analyses of the proposed approach are reported in Section 4. Finally, conclusion of this paper and discussion of future works are presented in Section 5.

Refer to caption — Figure 1: The illustration of our multimodal polyp dataset with visual image and its corresponding text description. Specially, given a query image, the main goal of this work is to learn a robust polyp re-identification model on the basis of visual-text representation.

2 Related Work

In this section, we have a brief review on the related works of traditional re-identification methods. The mainstream idea of the existing methods is to learn a discriminative and robust model for downstream polyp ReID task. There methods can be roughly divided into hand-crafted based approaches and deep learning based approaches.

2.1 Hand-crafted based Approaches

Traditional research works (Prosser et al, 2010; Zhao et al, 2013, 2014) related to hand-crafted systems for image retrieval task aim to design or learn discriminative representation or pedestrian features. For example, Prosser et al (2010) propose a reformulation of the person re-identification problem as a learning to rank problem. Zhao et al (2013) exploit the pairwise salience distribution relationship between pedestrian images, and solve the person re-identification problem by proposing a salience matching strategy. Besides directly using mid-level color and texture features, some methods (Zhao et al, 2014; Xiang et al, 2020a) also explore different discriminative abilities of local patches for robust generalization. Unfortunately, these handcrafted feature based approaches always fail to produce competitive results on large-scale datasets. The main reason is that these early works are mostly based on heuristic design, and thus they could not learn optimal discriminative features on current large-scale datasets.

2.2 Deep Learning based Approaches

Recently, there has been a significant research interest in the design of deep learning based approaches for image or video retrieval (Shao et al, 2021; Xiang et al, 2023a; Ma et al, 2020; Xiang et al, 2020a). For example, Shao et al (2021) propose temporal context aggregation which incorporates long-range temporal information for content-based video retrieval. Xiang et al (2020a) propose a feature fusion strategy based on traditional convolutional neural network for pedestrian retrieval. Ma et al (2020) explore a method for performing video moment retrieval in a weakly-supervised manner. Lin et al (2020) propose a semantic completion network including the proposal generation module to score all candidate proposals in a single pass. Besides, Kordopatis-Zilos et al (2022) propose a video retrieval framework based on knowledge distillation that addresses the problem of performance-efficiency trade-off. As for the self-supervised learning, Wu et al (2018) present an unsupervised feature learning approach called instance-wise contrastive learning by maximizing distinction between instances via a non-parametric softmax formulation. In addition, Chen et al (2023) propose a self-supervised contrastive representation learning scheme named Colo-SCRL to learn spatial representations from colonoscopic video dataset. However, on the one hand, all above approaches divide the learning process into multiple stages, each requiring independent optimization. On the other hand, despite the tremendous successes achieved by deep learning-based approach, they have been left largely unexplored in terms of semantic feature embedding.

In this work, we propose a novel Deep Multimodal Collaborative Learning framework named DMCL for multimodal polyp re-identification, which remarkably encourage modality collaboration and reinforce generalization capability in medical scenarios. On the basis of it, a dynamic multimodal feature fusion strategy is proposed to fully exploit the prior knowledge of visual-text feature, thereby improving the accuracy of polyp ReID in complex textured regions of unencoded subimages. To the best of our knowledge, this is the first attempt to employ the visual-text feature with collaborative learning mechanism for colonoscopic polyp re-identification task. To this end, the deep multimodal collaborative learning system proposed in this study not only improves the polyp ReID performance to a certain extent but also effectively maintains the social logical consistency of the extracted feature, which ensures the reliability and usability of the paradigm in various application scenarios, providing strong support for its application.

3 Our Method

In this section, we firstly give the problem definition of colonoscopic polyp ReID task. Then we introduce the deep multimodal collaborative learning framework (DMCL), as illustrated in Fig. 2. Finally, we elaborate more details of our multimodal fusion strategy.

3.1 Preliminary

Assuming that we are given a training dataset set $\mathcal{D}$ , which contains its own label space $\mathcal{D}=\left\{\left(\boldsymbol{x}_{i},y_{i}\right)\right\}_{i=1}^{N}$ in terms of visual and textual feature, where $N$ is the number of images in the training dataset set $\mathcal{D}$ . Each sample $\boldsymbol{x}_{i}\in\mathcal{X}$ is associated with an identity label $y_{i}\in\mathcal{Y}=\left\{1,2,\ldots,M\right\}$ , where $M$ is the number of identities in the training dataset set $D$ , along with the corresponding text description $\boldsymbol{x}_{t}$ of colonoscopic polyp dataset. The main goal of this work is to learn a robust polyp ReID model on the basis of visual-text representation.

3.2 Our Proposed DMCL network

In this work, we design a DMCL framework for polyp re-identification in medical scenarios, which includes a visual feature extractor, a textual feature encoder and dynamic multimodal feature fusion strategy.

Image Encoder. Image encoder is adopted as the visual feature extractor in our DMCL framework. To be more specific, we adopt the ResNet-50 (He et al, 2016) as the backbone for the image encoder module. Benefit from the merit of residual connection design, ResNet-50 network can effectively address the problem of gradient vanishing, enabling the model to deeply learn image features. On the other hand, our image encoder can extract rich details and global structures to form efficient feature encoding, which exhibits powerful performance and generalization ability in feature extraction. To be more specific, given a 224 $\times$ 224 $\times$ 3-sized polyp image, the main goal of the image encoder is to encode the input into a $I^{1\times 2048}$ -sized vector for subsequent feature fusion and processing, which provides a strong foundation for downstream polyp ReID task.

Text Encoder. As for the text encoder module, we employ the BERT model (Devlin et al, 2018) to encode the text description of corresponding polyp. In essence, BERT model is a pre-trained language model based on the Transformer architecture, which learns language representations through masked language modeling and next sentence prediction tasks on a large amount of unlabeled text. In this work, given a text description of the polyp with $n$ characters, BERT will encode it into a $n$ $\times$ 768-dimensions vector matrix for subsequent feature fusion and processing, formulated as $[T_{1},T_{2},T_{3},\cdots,T_{n}]^{1\times 768}$ . The bidirectional encoder design of BERT enables it to capture contextual information of words, generating more accurate text representations.

Dynamic Multimodal Feature Fusion. In the general scenarios, there exists the obvious data discrepancy between image and text in our dataset, which inevitably has a negative impacts on the model’s performance. To overcome this issue, we creatively introduce a multimodal fusion strategy based on the self-attention mechanism, which is a crucial part of our proposed deep collaborative training framework. To be more specific, we obtain the image encoding of $I$ after passing through the image encoder, whose size is then dimensionally reduced from the original 2048-dimension to 768-dimension. Similarly, we can also obtain the text encoding through the text encoder, which is denoted as:

T=[T_{1},T_{2},T_{3},\cdots,T_{n}]

(1)

where $T$ denotes the text encoding, and $T_{n}$ represents the number index of character in the polyp’s text description. Subsequently, we concatenate the image encoding $I$ and text encoding $T$ to formulate a new encoding of fusion feature, which can be formalized as:

K=[I,T_{1},T_{2},T_{3},\cdots,T_{n}]

(2)

where $K$ represents the features generated by the multimodal fusion component. Remarkably, multimodal fusion component consists of multiple self-attention layers to accomplish the fusion of visual feature and text feature. After passing through three linear layers, these three vectors are then inputted into the self-attention module. The output from the self-attention module is:

\text{ Output }=\text{ Softmax }\left(\frac{\mathbf{QK}^{\mathrm{T}}}{\sqrt{d}}\right)\mathbf{V},d=32

(3)

where $\mathbf{Q}$ , $\mathbf{K}$ , $\mathbf{V}$ represents the Query, Key, and Value of self-attention mechanism respectively, and $d$ denotes the dimension of the input vector. After modal fusion, we can obtain the new updated features:

K_{\text{f}}=\left[I_{f0},T_{f1},T_{f2},T_{f3},\ldots,T_{fn}\right]

(4)

Generally speaking, visual feature is of paramount importance for the task of polyp re-identification. Inspired by this, we choose the feature of $I_{f0}$ at the corresponding position of the original image feature, which can be adopted as the testing feature of the corresponding polyp in the testing phase for downstream polyp ReID task. The whole procedure of our collaborative multimodal training system is illustrated in Algorithm 1.

Algorithm 1 Procedure of our DMCL training system.

2:Visual sample

x_{v}

from dataset

\mathcal{D}

;

3:Textual sample

x_{t}

from dataset

\mathcal{D}

;

4:Initialized deep model

\theta

;

5:Training iterations

n

;

6:Optimized polyp ReID model

\theta_{opt}

;

7:for

iter\leq n

8: Random select

x_{v}

from

\mathcal{D}

;

9: Random select

x_{t}

from

\mathcal{D}

;

10: Extracting visual feature

I

with

x_{v}

\theta

;

11: Extracting text feature

T

with

x_{t}

\theta

;

12: Perform multimodal feature fusion:

K_{\text{f}}\leftarrow I+T

;

13: Optimizing

\theta

with

K_{\text{f}}

\colon

\theta_{opt}\leftarrow\theta

;

14:end for

15:Performing testing with model

\theta_{opt}

;

3.3 Dynamic Network Updating

In essence, many previous works (Luo et al, 2019; Xiang et al, 2023a; Luo et al, 2019; Xiang et al, 2020b) have been found that performing training with multiple losses has great potential to learn a robust and generalizable ReID model. During the training process, we adopt a identification loss (Zheng et al, 2017a) and triplet loss (Hermans et al, 2017) as the optimization metric for our DMCL model. Specially, the triplet-loss function aims to reduce the feature distance between similar polyp images while increasing the feature distance between different polyp images. The identification loss function, on the other hand, is to enhance the model’s ability to recognize and classify polyp categories.

To be more specific, for a single-label $N$ classification task, the identification loss (cross-entropy loss) is written as,

\mathcal{L}_{ID}=-\frac{1}{\text{ $M_{batch}$ }}\sum_{i=1}^{\text{ $M_{batch}$}}\sum_{j=1}^{N}y_{ij}\log\hat{y_{ij}}

(5)

where $M_{batch}$ is the number of labeled training images in a batch, $\hat{y_{ij}}$ is the predicted probability of the input belonging to ground-truth class $y_{ij}$ . For a triplet couple { $x_{a}$ , $x_{p}$ , $x_{n}$ }, the soft-marginal triplet loss $\mathcal{L}_{triplet}$ can be calculated as:

\mathcal{L}_{Triplet}=\log\left[1+\exp\left(\left\|x_{a}-x_{p}\right\|_{2}^{2}-\left\|x_{a}-x_{n}\right\|_{2}^{2}\right)\right]

(6)

where $x_{a}$ , $x_{p}$ and $x_{n}$ denote the anchor image, positive sample and negative sample respectively. As for the triplet selection, we randomly select different polyps from $16$ patients and sample $4$ images or texts from image or text modality for each patient to form the triplet couple, which can greatly help to construct the discriminative embedding for multimodal representation learning.

Finally, the overall objective loss function in a training batch is expressed as:

\mathcal{L}_{total}=\mathcal{L}_{Triplet}+\mathcal{L}_{ID}

(7)

To this end, our deep multimodal collaborative learning scheme fully considers the information from both images and texts, providing stronger support for clinical colonoscopy polyp examinations and potentially improving model’s performance with a clear margin.

4 Experiments

4.1 Datasets and Evaluation Metric

We conduct experiments on several large-scale public datasets, which include Colo-Pair (Chen et al, 2023), Market-1501 (Zheng et al, 2015), DukeMTMC-reID (Zheng et al, 2017b) and CUHK03 dataset (Li et al, 2014). We follow the standard evaluation protocol (Zheng et al, 2015) used in the ReID task and adopt mean Average Precision (mAP) and Cumulative Matching Characteristics (CMC) at Rank-1, Rank-5, and Rank-10 for performance evaluation on downstream ReID task.

Table 1: Performance comparison with state-of-the-art methods on Colo-Pair dataset. Bold indicates the best and underline the second best.

Method	Venue	Video Retrieval $\uparrow$
Method	Venue	mAP	Rank-1	Rank-5	Rank-10
ViSiL (Kordopatis-Zilos et al, 2019)	ICCV 19	24.9	14.5	30.6	51.6
CoCLR (Han et al, 2020)	NIPS 20	16.3	6.5	22.6	33.9
TCA (Shao et al, 2021)	WCAV 21	27.8	16.1	35.5	53.2
ViT (Caron et al, 2021)	CVPR 21	20.4	9.7	30.6	43.5
CVRL (Qian et al, 2021)	CVPR 21	23.6	11.3	32.3	53.2
CgS^c (Kordopatis-Zilos et al, 2022)	IJCV 22	21.4	8.1	35.5	45.2
FgAttS ${}^{f}_{A}$ (Kordopatis-Zilos et al, 2022)	IJCV 22	23.6	9.7	40.3	50.0
FgBinS ${}^{f}_{B}$ (Kordopatis-Zilos et al, 2022)	IJCV 22	21.2	9.7	32.3	48.4
Colo-SCRL (Chen et al, 2023)	ICME 23	31.5	22.6	41.9	58.1
VT-ReID (Xiang et al, 2024b)	ICASSP 24	37.9	23.4	44.5	60.1
DMCL	Ours	46.4	54.3	57.9	60.4

4.2 Implementation details

In our experiment, ResNet-50 (He et al, 2016) is regarded as the backbone with no bells and whistles during visual feature extraction. Following the training procedure in (Tong et al, 2022), we adopt the common methods such as random flipping and random cropping for data augmentation and employ the Adam optimizer with a weight decay co-efficient of $1\times 10^{-5}$ and $1\times 10^{-7}$ for parameter optimization. Besides, we adopt the ID loss and triplet loss functions to train the model for 180 iterations, and the cosine distance is also adopted to calculate the similarity of polyp features in the dataset for the task of polyp re-identification. Please note that the text information is also employed during testing phase. In addition, the batch size $M_{batch}$ for training is set to 64. All the experiments are performed on PyTorch framework with one Nvidia GeForce RTX 2080Ti GPU on a server equipped with an Intel Xeon Gold 6130T CPU.

4.3 Comparison with State-of-the-Arts

Colonoscopic Polyp Re-Identification. In this section, we compare our proposed method with the state-of-the-art algorithms, including: (1) transformer based (soft attention) models ViT (Caron et al, 2021), Colo-SCRL (Chen et al, 2023) and VT-ReID (Xiang et al, 2024b); (2) knowledge distillation-based methods, such as CgS^c, FgAttS ${}^{f}_{A}$ and FgBinS ${}^{f}_{B}$ (Kordopatis-Zilos et al, 2022); (3) feature level based methods, such as ViSiL (Kordopatis-Zilos et al, 2019), CoCLR (Han et al, 2020), TCA (Shao et al, 2021), CVRL (Qian et al, 2021). According to the results in Table 1, we can easily observe that our method shows the clear performance superiority over other state-of-the-arts with significant Rank-1 and mAP advantages. For instance, when compared to the knowledge distillation-based network FgAttS ${}^{f}_{A}$ (Kordopatis-Zilos et al, 2022), our model improves the Rank-1 by +44.6% (54.3 vs. 9.7). Besides, VT-ReID also surpasses recent transformer based (soft attention) models ViT (Caron et al, 2021), Colo-SCRL (Chen et al, 2023) and VT-ReID (Xiang et al, 2024b). Specially, our method outperforms the second best model VT-ReID (Xiang et al, 2024b) by +8.5% (46.4 vs. 37.9) and +30.9% (54.3 vs. 23.4) in terms of mAP and Rank-1 accuracy, respectively. The superiority of our proposed method can be largely contributed to the visual-text representation mined by our DMCL during multiple collaborative training, as well as the dynamic multimodal feature fusion strategy, which is beneficial to learn a more robust and discriminative model in polyp ReID tasks.

Person Re-Identification. To further prove the effectiveness of our method on other related object ReID tasks, we also compare our DMCL with existing methods in Table 2. we can easily observe that our method can achieve the state-of-the-art performance on Market-1501, DukeMTMC-reID and CUHK03 datasets with considerable advantages respectively. For example, our DMCL method can achieve a mAP/Rank-1 performance of 92.1% and 96.3% respectively on Market-1501 dataset, leading +1.0% and +0.6% improvement of mAP and Rank-1 accuracy when compared to the second best method NFormer (Wang et al, 2022) and MGN (Wang et al, 2018). Besides, our DMCL method can also obtain the improvement of +1.2% and +1.4% in terms of mAP and Rank-1 accuracy on CUHK03 dataset when compared to the VT-ReID (Xiang et al, 2024b). These experiments strongly demonstrates the priority of our proposed deep multimodal collaborative learning framework.

Table 2: Performance comparison with other methods on Person ReID datasets. Bold indicates the best and underline the second best.

Method	Market-1501		DukeMTMC-reID		CUHK03
Method	mAP	Rank-1	mAP	Rank-1	mAP	Rank-1
PCB (Wang et al, 2020)	81.6	93.8	69.2	83.3	57.5	63.7
MHN (Chen et al, 2019)	85.0	95.1	77.2	77.3	76.5	71.7
ISP (Zhu et al, 2020)	88.6	95.3	80.0	89.6	71.4	75.2
CBDB (Tan et al, 2021)	85.0	94.4	74.3	87.7	72.8	75.4
C2F (Zhang et al, 2021)	87.7	94.8	74.9	87.4	84.1	81.3
NFormer (Wang et al, 2022)	91.1	94.7	83.5	89.4	74.7	77.3
MGN (Wang et al, 2018)	86.9	95.7	78.4	88.7	66.0	66.8
SCSN (Chen et al, 2020)	88.3	92.4	79.0	91.0	81.0	84.7
VT-ReID (Xiang et al, 2024b)	88.1	93.8	79.2	92.6	85.3	88.3
DMCL (Ours)	92.1	96.3	87.6	93.5	86.5	89.7

4.4 Ablation Studies

In this experiments, to verify the effectiveness of our deep multimodal collaborate learning framework, we perform ablation study with multimodal feature from both the qualitative and quantitative perspectives.

Effectiveness of deep multimodal collaborative learning: Firstly, from the quantitative aspect, we evaluate the effectiveness of deep multimodal collaborative learning DMCL framework. As illustrated in Table 3, when adopting our deep multimodal collaborative learning framework, results show that mAP accuracy improves significantly from 25.9% to 46.4% on Colo-Pair dataset with visual-text induction. Additionally, similar improvements can also be easily observed in terms of Rank-1, Rank-5 evaluation metrics, leading to +24.3% and +9.1% respectively. These results prove that multimodal collaborative learning paradigm has a direct impact on downstream polyp ReID task.

Table 3: Ablation study of different pre-training settings (e.g. DMCL w/o Image, DMCL w/o Text and DMCL (Ours)) from Colo-Pair dataset. Note that the pre-trained model is then fine-tuned on dataset for downstream polyp ReID task.

Pre-training	Text data	Image data	Colo-Pair $\uparrow$
Pre-training	Text data	Image data	mAP	Rank-1	Rank-5
Baseline	$\times$	$\times$	25.9	17.5	37.1
DMCL w/ text	$\checkmark$	$\times$	28.5	18.6	38.8
DMCL w/ image	$\times$	$\checkmark$	36.6	41.8	46.2
DMCL (Ours)	$\checkmark$	$\checkmark$	46.4	54.3	57.9

Secondly, from the qualitative aspect, we also give some qualitative results of our proposed deep multimodal collaborative learning framework named DMCL. Fig. 3 provides some qualitative visualization of ranking results of our proposed approach DMCL with collaborative training mechanism on Colo-ReID dataset. To be more specific, we can obviously observe that our model attends to relevant image regions or discriminative parts for making polyp predictions, indicating that DMCL can greatly help the model learn more global context information and meaningful visual features with better semantic understanding, which significantly makes our collaborative training method DMCL model more robust to perturbations.

Effectiveness of dynamic multimodal training strategy: In this section, we proceed to evaluate the effectiveness of dynamic multimodal training strategy by testing whether text modality or image modality matters. According to Table 3, our dynamic multimodal training strategy DMCL with text representation (DMCL w/ text) can lead to a significant improvement in Rank-1 of +5.6% on Colo-Pair dataset when compared with baseline setting. Furthermore, when adopting image representation, our method (DMCL w/ image) can obtain a remarkable performance of 36.6% in terms of mAP accuracy, leadig a significant improvement of +8.1% when compared to DMCL w/ text. The effectiveness of the dynamic multimodal training strategy can be largely attributed to that it enhances the discrimination capability of other collaborative networks during multimodal representation learning, which is vital for polyp domain adaptation in general domain where the target supervision is not available.

Visualization of Feature Response: To further explain why our deep multimodal collaborative learning strategy works, we perform in-depth analysis of feature response in DMCL method, and also show some qualitative examples of EigenGradCAM (Muhammad and Yeasin, 2020) visualizations in Fig.4. In fact, the Grad-CAM is a package with some methods for Explainable AI for computer vision, which provides attributions to both the inputs and the neurons of intermediate layers, so as to make CNN-based models more transparent by producing visual explanations. Specifically, EigenGradCAM decodes the importance of each feature map to a specific class by analyzing the gradients within the last convolutional layer of a CNN, generating a heatmap that highlights the image regions that contribute most significantly to the prediction result. This heatmap visually represents which parts of the image are most influential in the model’s final decision. As illustrated in Fig.4, compared with CNN-based training method without text information, we observe that our model attends to relevant image regions or discriminative parts for making decison, indicating that our DMCL model can effectively explore more global context information and meaningful visual features with better semantic understanding, which significantly make our model more robust to perturbations.

5 Conclusion

This study further investigates the possibility of applying multimodal collaborative learning for polyp retrieval task with visual-text dataset, and then propose a simple but effective multimodal representation learning network named DMCL to improve the performance of polyp Re-ID event. To further enhance the robustness of DMCL model, a dynamic multimodal feature fusion strategy is introduced to leverage the optimized multimodal representations for multimodal fusion via end-to-end training. Comprehensive experiments on standard benchmarks also demonstrate that our method can achieve the state-of-the-art performance in polyp Re-ID task and other image retriveal task. In the future, we will explore the interpretability of this method and apply it to other related computer vision tasks, e.g. polyp detection and segmentation.

\bmhead

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China under Grant No.62301315, Startup Fund for Young Faculty at SJTU (SFYF at SJTU) under Grant No.23X010501967 and Shanghai Municipal Health Commission Health Industry Clinical Research Special Project under Grant No.202340010. The authors would like to thank the anonymous reviewers for their valuable suggestions and constructive criticisms.

Declarations

•

Funding
This work was partially supported by the National Natural Science Foundation of China under Grant No.62301315, Startup Fund for Young Faculty at SJTU (SFYF at SJTU) under Grant No.23X010501967 and Shanghai Municipal Health Commission Health Industry Clinical Research Special Project under Grant No.202340010.
•

Conflict of interest
The authors declare that they have no conflict of interest.
•

Ethics approval
Not Applicable. The datasets and the work do not contain personal or sensitive information, no ethical issue is concerned.
•

Consent to participate
The authors are fine that the work is submitted and published by Machine Learning Journal. There is no human study in this work, so this aspect is not applicable.
•

Consent for publication
The authors are fine that the work (including all content, data and images) is published by Machine Learning Journal.
•

Availability of data and material
The data used for the experiments in this paper are available online, see Section 4.1 for more details.
•

Code availability
The code is publicly available at https://github.com/JeremyXSC/DMCL.
•

Authors’ contributions
Suncheng Xiang and Jiacheng Ruan contributed conception and design of the study, as well as the experimental process and interpreted model results. Suncheng Xiang obtained funding for the project and provided clinical guidance. Suncheng Xiang drafted the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.

References

\bibcommenthead
Caron et al (2021) Caron M, Touvron H, Misra I, et al (2021) Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9650–9660
Chen et al (2019) Chen B, Deng W, Hu J (2019) Mixed high-order attention network for person re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 371–381
Chen et al (2023) Chen Q, Cai S, Cai C, et al (2023) Colo-scrl: Self-supervised contrastive representation learning for colonoscopic video retrieval. arXiv preprint arXiv:230315671
Chen et al (2020) Chen X, Fu C, Zhao Y, et al (2020) Salience-guided cascaded suppression network for person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3300–3310
Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805
Feng et al (2019) Feng Y, Ma L, Liu W, et al (2019) Spatio-temporal video re-localization by warp lstm. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1288–1297
Han et al (2020) Han T, Xie W, Zisserman A (2020) Self-supervised co-training for video representation learning. Advances in Neural Information Processing Systems 33:5679–5690
He et al (2016) He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hermans et al (2017) Hermans A, Beyer L, Leibe B (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:170307737
Koffas et al (2022) Koffas A, Papaefthymiou A, Laskaratos FM, et al (2022) Colon capsule endoscopy in the diagnosis of colon polyps: Who needs a colonoscopy? Diagnostics 12(9):2093
Kordopatis-Zilos et al (2019) Kordopatis-Zilos G, Papadopoulos S, Patras I, et al (2019) Visil: Fine-grained spatio-temporal video similarity learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6351–6360
Kordopatis-Zilos et al (2022) Kordopatis-Zilos G, Tzelepis C, Papadopoulos S, et al (2022) Dns: Distill-and-select for efficient and accurate video indexing and retrieval. International Journal of Computer Vision 130(10):2385–2407
Li et al (2014) Li W, Zhao R, Xiao T, et al (2014) Deepreid: Deep filter pairing neural network for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 152–159
Lin et al (2020) Lin Z, Zhao Z, Zhang Z, et al (2020) Weakly-supervised video moment retrieval via semantic completion network. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 11,539–11,546
Luo et al (2019) Luo H, Jiang W, Gu Y, et al (2019) A strong baseline and batch normalization neck for deep person re-identification. IEEE Transactions on Multimedia 22(10):2597–2609
Ma et al (2020) Ma M, Yoon S, Kim J, et al (2020) Vlanet: Video-language alignment network for weakly-supervised video moment retrieval. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16, Springer, pp 156–171
Muhammad and Yeasin (2020) Muhammad MB, Yeasin M (2020) Eigen-cam: Class activation map using principal components. In: 2020 international joint conference on neural networks (IJCNN), IEEE, pp 1–7
Prosser et al (2010) Prosser BJ, Zheng WS, Gong S, et al (2010) Person re-identification by support vector ranking. In: Bmvc, p 6
Qian et al (2021) Qian R, Meng T, Gong B, et al (2021) Spatiotemporal contrastive video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6964–6974
Shao et al (2021) Shao J, Wen X, Zhao B, et al (2021) Temporal context aggregation for video retrieval with contrastive learning. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3268–3278
Tan et al (2021) Tan H, Liu X, Bian Y, et al (2021) Incomplete descriptor mining with elastic loss for person re-identification. IEEE Transactions on Circuits and Systems for Video Technology 32(1):160–171
Tong et al (2022) Tong Z, Song Y, Wang J, et al (2022) Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems 35:10,078–10,093
Wang et al (2018) Wang G, Yuan Y, Chen X, et al (2018) Learning discriminative features with multiple granularities for person re-identification. In: Proceedings of the 26th ACM international conference on Multimedia, pp 274–282
Wang et al (2022) Wang H, Shen J, Liu Y, et al (2022) Nformer: Robust person re-identification with neighbor transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7297–7307
Wang et al (2020) Wang Y, Liao S, Shao L (2020) Surpassing real-world source training data: Random 3d characters for generalizable person re-identification. In: ACM International Conference on Multimedia, pp 3422–3430
Wu et al (2018) Wu Z, Xiong Y, Yu SX, et al (2018) Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3733–3742
Xiang et al (2020a) Xiang S, Fu Y, Chen H, et al (2020a) Multi-level feature learning with attention for person re-identification. Multimedia Tools and Applications 79:32,079–32,093
Xiang et al (2020b) Xiang S, Fu Y, Xie M, et al (2020b) Unsupervised person re-identification by hierarchical cluster and domain transfer. Multimedia Tools and Applications 79:19,769–19,786
Xiang et al (2020c) Xiang S, Fu Y, You G, et al (2020c) Unsupervised domain adaptation through synthesis for person re-identification. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6
Xiang et al (2023a) Xiang S, Fu Y, Guan M, et al (2023a) Learning from self-discrepancy via multiple co-teaching for cross-domain person re-identification. Machine Learning 112(6):1923–1940
Xiang et al (2023b) Xiang S, Qian D, Gao J, et al (2023b) Rethinking person re-identification via semantic-based pretraining. ACM Transactions on Multimedia Computing, Communications and Applications 20(3):1–17
Xiang et al (2023c) Xiang S, Qian D, Guan M, et al (2023c) Less is more: Learning from synthetic data with fine-grained attributes for person re-identification. ACM Transactions on Multimedia Computing, Communications and Applications 19(5s):1–20
Xiang et al (2024a) Xiang S, Chen H, Ran W, et al (2024a) Deep multimodal representation learning for generalizable person re-identification. Machine Learning 113(4):1921–1939
Xiang et al (2024b) Xiang S, Liu C, Ruan J, et al (2024b) Vt-reid: Learning discriminative visual-text representation for polyp re-identification. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp 3170–3174
Zhang et al (2021) Zhang A, Gao Y, Niu Y, et al (2021) Coarse-to-fine person re-identification with auxiliary-domain classification and second-order information bottleneck. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 598–607
Zhao et al (2013) Zhao R, Ouyang W, Wang X (2013) Person re-identification by salience matching. 2013 IEEE International Conference on Computer Vision pp 2528–2535
Zhao et al (2014) Zhao R, Ouyang W, Wang X (2014) Learning mid-level filters for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 144–151
Zheng et al (2015) Zheng L, Shen L, Tian L, et al (2015) Scalable person re-identification: A benchmark. In: Proceedings of the IEEE international conference on computer vision, pp 1116–1124
Zheng et al (2017a) Zheng Z, Zheng L, Yang Y (2017a) A discriminatively learned cnn embedding for person reidentification. ACM transactions on Multimedia Computing, Communications, and Applications 14(1):1–20
Zheng et al (2017b) Zheng Z, Zheng L, Yang Y (2017b) Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In: Proceedings of the IEEE international conference on computer vision, pp 3754–3762
Zhu et al (2020) Zhu K, Guo H, Liu Z, et al (2020) Identity-guided human semantic parsing for person re-identification. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, Springer, pp 346–363