2023
1]\orgdivSchool of Biomedical Engineering, \orgnameShanghai Jiao Tong University, \orgaddress\cityShanghai, \postcode200240, \countryChina
2]\orgdivEndoscopy Center, \orgnameZhongshan Hospital of Fudan University, \orgaddress\cityShanghai, \postcode200032, \countryChina
Deep Multimodal Collaborative Learning for Polyp Re-Identification
Abstract
Colonoscopic Polyp Re-Identification aims to match the same polyp from a large gallery with images from different views taken using different cameras, which plays an important role in the prevention and treatment of colorectal cancer in computer-aided diagnosis. However, traditional methods for object ReID directly adopting CNN models trained on the ImageNet dataset usually produce unsatisfactory retrieval performance on colonoscopic datasets due to the large domain gap. Worsely, these solutions typically learn unimodal modal representations on the basis of visual samples, which fails to explore complementary information from other different modalities. To address this challenge, we propose a novel Deep Multimodal Collaborative Learning framework named DMCL for polyp re-identification, which can effectively encourage modality collaboration and reinforce generalization capability in medical scenarios. On the basis of it, a dynamic multimodal feature fusion strategy is introduced to leverage the optimized multimodal representations for multimodal fusion via end-to-end training. Experiments on the standard benchmarks show the benefits of the multimodal setting over state-of-the-art unimodal ReID models, especially when combined with the specialized multimodal fusion strategy, from which we have proved that learning representation with multiple-modality can be competitive to methods based on unimodal representation learning. We also hope that our method will shed light on some related researches to move forward, especially for multimodal collaborative learning. The code is publicly available at https://github.com/JeremyXSC/DMCL.
keywords:
Colonoscopic Polyp Re-Identification, modal representation, multimodal collaborative learning, generalization capability1 Introduction
Colonoscopic polyp re-identification (Polyp ReID) aims to match a specific polyp in a large gallery with different cameras and locations, which has been studied intensively due to its practical importance in the prevention and treatment of colorectal cancer in the computer-aided diagnosis (Koffas et al, 2022). With the development of deep convolution neural networks and the availability of video re-identification dataset, video retrieval methods have achieved remarkable performance in a supervised manner (Feng et al, 2019; Xiang et al, 2023b, c), where a model is trained and tested on different splits of the same dataset. However, in practice, manually labeling a large diversity of pairwise polyp area data is time-consuming and labor-intensive when directly deploying polyp ReID system to new hospital scenarios (Chen et al, 2023). Nevertheless, when compared with the conventional ReID (Xiang et al, 2020c, a), polyp ReID is confronted with more challenges in some aspects: 1) from the model perspective: traditional object ReID methods learn the unimodal representation by greedily “pre-training” several layers of features on the basis of visual samples (Xiang et al, 2023a, 2024a), while ignore to explore complementary information from different modalities, and 2) from the data perspective, polyp ReID will encounter many challenges such as variation in terms of backgrounds, viewpoint, and illumination, etc., which poses great challenges to the clinical deployment of deep model in real-world scenarios.
Based on the aforementioned findings, we novely propose a deep multimodal collaborative learning framework named DMCL to encourage modality collaboration and reinforce generalization capability in medical scenarios. On the basis of it, a dynamic multimodal training strategy is introduced to leverage the unimodal representations for multimodal fusion via end-to-end training on multimodal tasks, and then improve the performance of our model in an end-to-end manner, shown in Fig. 1. To the best of our knowledge, this is the first attempt to employ the visual-text feature with collaborative training mechanism for colonoscopic polyp re-identification.
To this end, the major contributions of our work can be summarized as follows:
-
A deep multimodal collaborative learning framework DMCL is proposed for the first time to obtain the visual and texture information simultaneously, and then promote the development of multimodal polyp re-identification.
-
Based on the DMCL framework, we introduce a simple but effective multimodal feature fusion strategy, which can greatly help the model learn more discriminatve information during training and texting stage.
-
Comprehensive experiments on standard benchmark datasets demonstrate that our method can achieve state-of-the-art results, proving the efficacy of our method.
The remainder of this paper is structured as follows. In Section 2, we give the related works based on hand-crafted based approaches and deep learning based methods in medical area, and then briefly introduce our method. In Section 3, the details of our multimodal collaborative learning method, as well as the dynamic feature fusion strategy, is presented. Extensive evaluations compared with state-of-the-art methods and comprehensive analyses of the proposed approach are reported in Section 4. Finally, conclusion of this paper and discussion of future works are presented in Section 5.

2 Related Work
In this section, we have a brief review on the related works of traditional re-identification methods. The mainstream idea of the existing methods is to learn a discriminative and robust model for downstream polyp ReID task. There methods can be roughly divided into hand-crafted based approaches and deep learning based approaches.
2.1 Hand-crafted based Approaches
Traditional research works (Prosser et al, 2010; Zhao et al, 2013, 2014) related to hand-crafted systems for image retrieval task aim to design or learn discriminative representation or pedestrian features. For example, Prosser et al (2010) propose a reformulation of the person re-identification problem as a learning to rank problem. Zhao et al (2013) exploit the pairwise salience distribution relationship between pedestrian images, and solve the person re-identification problem by proposing a salience matching strategy. Besides directly using mid-level color and texture features, some methods (Zhao et al, 2014; Xiang et al, 2020a) also explore different discriminative abilities of local patches for robust generalization. Unfortunately, these handcrafted feature based approaches always fail to produce competitive results on large-scale datasets. The main reason is that these early works are mostly based on heuristic design, and thus they could not learn optimal discriminative features on current large-scale datasets.
2.2 Deep Learning based Approaches
Recently, there has been a significant research interest in the design of deep learning based approaches for image or video retrieval (Shao et al, 2021; Xiang et al, 2023a; Ma et al, 2020; Xiang et al, 2020a). For example, Shao et al (2021) propose temporal context aggregation which incorporates long-range temporal information for content-based video retrieval. Xiang et al (2020a) propose a feature fusion strategy based on traditional convolutional neural network for pedestrian retrieval. Ma et al (2020) explore a method for performing video moment retrieval in a weakly-supervised manner. Lin et al (2020) propose a semantic completion network including the proposal generation module to score all candidate proposals in a single pass. Besides, Kordopatis-Zilos et al (2022) propose a video retrieval framework based on knowledge distillation that addresses the problem of performance-efficiency trade-off. As for the self-supervised learning, Wu et al (2018) present an unsupervised feature learning approach called instance-wise contrastive learning by maximizing distinction between instances via a non-parametric softmax formulation. In addition, Chen et al (2023) propose a self-supervised contrastive representation learning scheme named Colo-SCRL to learn spatial representations from colonoscopic video dataset. However, on the one hand, all above approaches divide the learning process into multiple stages, each requiring independent optimization. On the other hand, despite the tremendous successes achieved by deep learning-based approach, they have been left largely unexplored in terms of semantic feature embedding.
In this work, we propose a novel Deep Multimodal Collaborative Learning framework named DMCL for multimodal polyp re-identification, which remarkably encourage modality collaboration and reinforce generalization capability in medical scenarios. On the basis of it, a dynamic multimodal feature fusion strategy is proposed to fully exploit the prior knowledge of visual-text feature, thereby improving the accuracy of polyp ReID in complex textured regions of unencoded subimages. To the best of our knowledge, this is the first attempt to employ the visual-text feature with collaborative learning mechanism for colonoscopic polyp re-identification task. To this end, the deep multimodal collaborative learning system proposed in this study not only improves the polyp ReID performance to a certain extent but also effectively maintains the social logical consistency of the extracted feature, which ensures the reliability and usability of the paradigm in various application scenarios, providing strong support for its application.
3 Our Method
In this section, we firstly give the problem definition of colonoscopic polyp ReID task. Then we introduce the deep multimodal collaborative learning framework (DMCL), as illustrated in Fig. 2. Finally, we elaborate more details of our multimodal fusion strategy.
3.1 Preliminary
Assuming that we are given a training dataset set , which contains its own label space in terms of visual and textual feature, where is the number of images in the training dataset set . Each sample is associated with an identity label , where is the number of identities in the training dataset set , along with the corresponding text description of colonoscopic polyp dataset. The main goal of this work is to learn a robust polyp ReID model on the basis of visual-text representation.

3.2 Our Proposed DMCL network
In this work, we design a DMCL framework for polyp re-identification in medical scenarios, which includes a visual feature extractor, a textual feature encoder and dynamic multimodal feature fusion strategy.
Image Encoder. Image encoder is adopted as the visual feature extractor in our DMCL framework. To be more specific, we adopt the ResNet-50 (He et al, 2016) as the backbone for the image encoder module. Benefit from the merit of residual connection design, ResNet-50 network can effectively address the problem of gradient vanishing, enabling the model to deeply learn image features. On the other hand, our image encoder can extract rich details and global structures to form efficient feature encoding, which exhibits powerful performance and generalization ability in feature extraction. To be more specific, given a 224 224 3-sized polyp image, the main goal of the image encoder is to encode the input into a -sized vector for subsequent feature fusion and processing, which provides a strong foundation for downstream polyp ReID task.
Text Encoder. As for the text encoder module, we employ the BERT model (Devlin et al, 2018) to encode the text description of corresponding polyp. In essence, BERT model is a pre-trained language model based on the Transformer architecture, which learns language representations through masked language modeling and next sentence prediction tasks on a large amount of unlabeled text. In this work, given a text description of the polyp with characters, BERT will encode it into a 768-dimensions vector matrix for subsequent feature fusion and processing, formulated as . The bidirectional encoder design of BERT enables it to capture contextual information of words, generating more accurate text representations.
Dynamic Multimodal Feature Fusion. In the general scenarios, there exists the obvious data discrepancy between image and text in our dataset, which inevitably has a negative impacts on the model’s performance. To overcome this issue, we creatively introduce a multimodal fusion strategy based on the self-attention mechanism, which is a crucial part of our proposed deep collaborative training framework. To be more specific, we obtain the image encoding of after passing through the image encoder, whose size is then dimensionally reduced from the original 2048-dimension to 768-dimension. Similarly, we can also obtain the text encoding through the text encoder, which is denoted as:
(1) |
where denotes the text encoding, and represents the number index of character in the polyp’s text description. Subsequently, we concatenate the image encoding and text encoding to formulate a new encoding of fusion feature, which can be formalized as:
(2) |
where represents the features generated by the multimodal fusion component. Remarkably, multimodal fusion component consists of multiple self-attention layers to accomplish the fusion of visual feature and text feature. After passing through three linear layers, these three vectors are then inputted into the self-attention module. The output from the self-attention module is:
(3) |
where , , represents the Query, Key, and Value of self-attention mechanism respectively, and denotes the dimension of the input vector. After modal fusion, we can obtain the new updated features:
(4) |
Generally speaking, visual feature is of paramount importance for the task of polyp re-identification. Inspired by this, we choose the feature of at the corresponding position of the original image feature, which can be adopted as the testing feature of the corresponding polyp in the testing phase for downstream polyp ReID task. The whole procedure of our collaborative multimodal training system is illustrated in Algorithm 1.
3.3 Dynamic Network Updating
In essence, many previous works (Luo et al, 2019; Xiang et al, 2023a; Luo et al, 2019; Xiang et al, 2020b) have been found that performing training with multiple losses has great potential to learn a robust and generalizable ReID model. During the training process, we adopt a identification loss (Zheng et al, 2017a) and triplet loss (Hermans et al, 2017) as the optimization metric for our DMCL model. Specially, the triplet-loss function aims to reduce the feature distance between similar polyp images while increasing the feature distance between different polyp images. The identification loss function, on the other hand, is to enhance the model’s ability to recognize and classify polyp categories.
To be more specific, for a single-label classification task, the identification loss (cross-entropy loss) is written as,
(5) |
where is the number of labeled training images in a batch, is the predicted probability of the input belonging to ground-truth class . For a triplet couple {, , }, the soft-marginal triplet loss can be calculated as:
(6) |
where , and denote the anchor image, positive sample and negative sample respectively. As for the triplet selection, we randomly select different polyps from patients and sample images or texts from image or text modality for each patient to form the triplet couple, which can greatly help to construct the discriminative embedding for multimodal representation learning.
Finally, the overall objective loss function in a training batch is expressed as:
(7) |
To this end, our deep multimodal collaborative learning scheme fully considers the information from both images and texts, providing stronger support for clinical colonoscopy polyp examinations and potentially improving model’s performance with a clear margin.
4 Experiments
4.1 Datasets and Evaluation Metric
We conduct experiments on several large-scale public datasets, which include Colo-Pair (Chen et al, 2023), Market-1501 (Zheng et al, 2015), DukeMTMC-reID (Zheng et al, 2017b) and CUHK03 dataset (Li et al, 2014). We follow the standard evaluation protocol (Zheng et al, 2015) used in the ReID task and adopt mean Average Precision (mAP) and Cumulative Matching Characteristics (CMC) at Rank-1, Rank-5, and Rank-10 for performance evaluation on downstream ReID task.
Method | Venue | Video Retrieval | |||
mAP | Rank-1 | Rank-5 | Rank-10 | ||
ViSiL (Kordopatis-Zilos et al, 2019) | ICCV 19 | 24.9 | 14.5 | 30.6 | 51.6 |
CoCLR (Han et al, 2020) | NIPS 20 | 16.3 | 6.5 | 22.6 | 33.9 |
TCA (Shao et al, 2021) | WCAV 21 | 27.8 | 16.1 | 35.5 | 53.2 |
ViT (Caron et al, 2021) | CVPR 21 | 20.4 | 9.7 | 30.6 | 43.5 |
CVRL (Qian et al, 2021) | CVPR 21 | 23.6 | 11.3 | 32.3 | 53.2 |
CgSc (Kordopatis-Zilos et al, 2022) | IJCV 22 | 21.4 | 8.1 | 35.5 | 45.2 |
FgAttS (Kordopatis-Zilos et al, 2022) | IJCV 22 | 23.6 | 9.7 | 40.3 | 50.0 |
FgBinS (Kordopatis-Zilos et al, 2022) | IJCV 22 | 21.2 | 9.7 | 32.3 | 48.4 |
Colo-SCRL (Chen et al, 2023) | ICME 23 | 31.5 | 22.6 | 41.9 | 58.1 |
VT-ReID (Xiang et al, 2024b) | ICASSP 24 | 37.9 | 23.4 | 44.5 | 60.1 |
DMCL | Ours | 46.4 | 54.3 | 57.9 | 60.4 |
4.2 Implementation details
In our experiment, ResNet-50 (He et al, 2016) is regarded as the backbone with no bells and whistles during visual feature extraction. Following the training procedure in (Tong et al, 2022), we adopt the common methods such as random flipping and random cropping for data augmentation and employ the Adam optimizer with a weight decay co-efficient of and for parameter optimization. Besides, we adopt the ID loss and triplet loss functions to train the model for 180 iterations, and the cosine distance is also adopted to calculate the similarity of polyp features in the dataset for the task of polyp re-identification. Please note that the text information is also employed during testing phase. In addition, the batch size for training is set to 64. All the experiments are performed on PyTorch framework with one Nvidia GeForce RTX 2080Ti GPU on a server equipped with an Intel Xeon Gold 6130T CPU.
4.3 Comparison with State-of-the-Arts
Colonoscopic Polyp Re-Identification. In this section, we compare our proposed method with the state-of-the-art algorithms, including: (1) transformer based (soft attention) models ViT (Caron et al, 2021), Colo-SCRL (Chen et al, 2023) and VT-ReID (Xiang et al, 2024b); (2) knowledge distillation-based methods, such as CgSc, FgAttS and FgBinS (Kordopatis-Zilos et al, 2022); (3) feature level based methods, such as ViSiL (Kordopatis-Zilos et al, 2019), CoCLR (Han et al, 2020), TCA (Shao et al, 2021), CVRL (Qian et al, 2021). According to the results in Table 1, we can easily observe that our method shows the clear performance superiority over other state-of-the-arts with significant Rank-1 and mAP advantages. For instance, when compared to the knowledge distillation-based network FgAttS (Kordopatis-Zilos et al, 2022), our model improves the Rank-1 by +44.6% (54.3 vs. 9.7). Besides, VT-ReID also surpasses recent transformer based (soft attention) models ViT (Caron et al, 2021), Colo-SCRL (Chen et al, 2023) and VT-ReID (Xiang et al, 2024b). Specially, our method outperforms the second best model VT-ReID (Xiang et al, 2024b) by +8.5% (46.4 vs. 37.9) and +30.9% (54.3 vs. 23.4) in terms of mAP and Rank-1 accuracy, respectively. The superiority of our proposed method can be largely contributed to the visual-text representation mined by our DMCL during multiple collaborative training, as well as the dynamic multimodal feature fusion strategy, which is beneficial to learn a more robust and discriminative model in polyp ReID tasks.
Person Re-Identification. To further prove the effectiveness of our method on other related object ReID tasks, we also compare our DMCL with existing methods in Table 2. we can easily observe that our method can achieve the state-of-the-art performance on Market-1501, DukeMTMC-reID and CUHK03 datasets with considerable advantages respectively. For example, our DMCL method can achieve a mAP/Rank-1 performance of 92.1% and 96.3% respectively on Market-1501 dataset, leading +1.0% and +0.6% improvement of mAP and Rank-1 accuracy when compared to the second best method NFormer (Wang et al, 2022) and MGN (Wang et al, 2018). Besides, our DMCL method can also obtain the improvement of +1.2% and +1.4% in terms of mAP and Rank-1 accuracy on CUHK03 dataset when compared to the VT-ReID (Xiang et al, 2024b). These experiments strongly demonstrates the priority of our proposed deep multimodal collaborative learning framework.
Method | Market-1501 | DukeMTMC-reID | CUHK03 | |||
mAP | Rank-1 | mAP | Rank-1 | mAP | Rank-1 | |
PCB (Wang et al, 2020) | 81.6 | 93.8 | 69.2 | 83.3 | 57.5 | 63.7 |
MHN (Chen et al, 2019) | 85.0 | 95.1 | 77.2 | 77.3 | 76.5 | 71.7 |
ISP (Zhu et al, 2020) | 88.6 | 95.3 | 80.0 | 89.6 | 71.4 | 75.2 |
CBDB (Tan et al, 2021) | 85.0 | 94.4 | 74.3 | 87.7 | 72.8 | 75.4 |
C2F (Zhang et al, 2021) | 87.7 | 94.8 | 74.9 | 87.4 | 84.1 | 81.3 |
NFormer (Wang et al, 2022) | 91.1 | 94.7 | 83.5 | 89.4 | 74.7 | 77.3 |
MGN (Wang et al, 2018) | 86.9 | 95.7 | 78.4 | 88.7 | 66.0 | 66.8 |
SCSN (Chen et al, 2020) | 88.3 | 92.4 | 79.0 | 91.0 | 81.0 | 84.7 |
VT-ReID (Xiang et al, 2024b) | 88.1 | 93.8 | 79.2 | 92.6 | 85.3 | 88.3 |
DMCL (Ours) | 92.1 | 96.3 | 87.6 | 93.5 | 86.5 | 89.7 |
4.4 Ablation Studies
In this experiments, to verify the effectiveness of our deep multimodal collaborate learning framework, we perform ablation study with multimodal feature from both the qualitative and quantitative perspectives.
Effectiveness of deep multimodal collaborative learning: Firstly, from the quantitative aspect, we evaluate the effectiveness of deep multimodal collaborative learning DMCL framework. As illustrated in Table 3, when adopting our deep multimodal collaborative learning framework, results show that mAP accuracy improves significantly from 25.9% to 46.4% on Colo-Pair dataset with visual-text induction. Additionally, similar improvements can also be easily observed in terms of Rank-1, Rank-5 evaluation metrics, leading to +24.3% and +9.1% respectively. These results prove that multimodal collaborative learning paradigm has a direct impact on downstream polyp ReID task.
Pre-training | Text data | Image data | Colo-Pair | ||
mAP | Rank-1 | Rank-5 | |||
Baseline | 25.9 | 17.5 | 37.1 | ||
DMCL w/ text | 28.5 | 18.6 | 38.8 | ||
DMCL w/ image | 36.6 | 41.8 | 46.2 | ||
DMCL (Ours) | 46.4 | 54.3 | 57.9 |

Secondly, from the qualitative aspect, we also give some qualitative results of our proposed deep multimodal collaborative learning framework named DMCL. Fig. 3 provides some qualitative visualization of ranking results of our proposed approach DMCL with collaborative training mechanism on Colo-ReID dataset. To be more specific, we can obviously observe that our model attends to relevant image regions or discriminative parts for making polyp predictions, indicating that DMCL can greatly help the model learn more global context information and meaningful visual features with better semantic understanding, which significantly makes our collaborative training method DMCL model more robust to perturbations.
Effectiveness of dynamic multimodal training strategy: In this section, we proceed to evaluate the effectiveness of dynamic multimodal training strategy by testing whether text modality or image modality matters. According to Table 3, our dynamic multimodal training strategy DMCL with text representation (DMCL w/ text) can lead to a significant improvement in Rank-1 of +5.6% on Colo-Pair dataset when compared with baseline setting. Furthermore, when adopting image representation, our method (DMCL w/ image) can obtain a remarkable performance of 36.6% in terms of mAP accuracy, leadig a significant improvement of +8.1% when compared to DMCL w/ text. The effectiveness of the dynamic multimodal training strategy can be largely attributed to that it enhances the discrimination capability of other collaborative networks during multimodal representation learning, which is vital for polyp domain adaptation in general domain where the target supervision is not available.

Visualization of Feature Response: To further explain why our deep multimodal collaborative learning strategy works, we perform in-depth analysis of feature response in DMCL method, and also show some qualitative examples of EigenGradCAM (Muhammad and Yeasin, 2020) visualizations in Fig.4. In fact, the Grad-CAM is a package with some methods for Explainable AI for computer vision, which provides attributions to both the inputs and the neurons of intermediate layers, so as to make CNN-based models more transparent by producing visual explanations. Specifically, EigenGradCAM decodes the importance of each feature map to a specific class by analyzing the gradients within the last convolutional layer of a CNN, generating a heatmap that highlights the image regions that contribute most significantly to the prediction result. This heatmap visually represents which parts of the image are most influential in the model’s final decision. As illustrated in Fig.4, compared with CNN-based training method without text information, we observe that our model attends to relevant image regions or discriminative parts for making decison, indicating that our DMCL model can effectively explore more global context information and meaningful visual features with better semantic understanding, which significantly make our model more robust to perturbations.
5 Conclusion
This study further investigates the possibility of applying multimodal collaborative learning for polyp retrieval task with visual-text dataset, and then propose a simple but effective multimodal representation learning network named DMCL to improve the performance of polyp Re-ID event. To further enhance the robustness of DMCL model, a dynamic multimodal feature fusion strategy is introduced to leverage the optimized multimodal representations for multimodal fusion via end-to-end training. Comprehensive experiments on standard benchmarks also demonstrate that our method can achieve the state-of-the-art performance in polyp Re-ID task and other image retriveal task. In the future, we will explore the interpretability of this method and apply it to other related computer vision tasks, e.g. polyp detection and segmentation.
Acknowledgments
This work was partially supported by the National Natural Science Foundation of China under Grant No.62301315, Startup Fund for Young Faculty at SJTU (SFYF at SJTU) under Grant No.23X010501967 and Shanghai Municipal Health Commission Health Industry Clinical Research Special Project under Grant No.202340010. The authors would like to thank the anonymous reviewers for their valuable suggestions and constructive criticisms.
Declarations
-
•
Funding
This work was partially supported by the National Natural Science Foundation of China under Grant No.62301315, Startup Fund for Young Faculty at SJTU (SFYF at SJTU) under Grant No.23X010501967 and Shanghai Municipal Health Commission Health Industry Clinical Research Special Project under Grant No.202340010. -
•
Conflict of interest
The authors declare that they have no conflict of interest. -
•
Ethics approval
Not Applicable. The datasets and the work do not contain personal or sensitive information, no ethical issue is concerned. -
•
Consent to participate
The authors are fine that the work is submitted and published by Machine Learning Journal. There is no human study in this work, so this aspect is not applicable. -
•
Consent for publication
The authors are fine that the work (including all content, data and images) is published by Machine Learning Journal. -
•
Availability of data and material
The data used for the experiments in this paper are available online, see Section 4.1 for more details. -
•
Code availability
The code is publicly available at https://github.com/JeremyXSC/DMCL. -
•
Authors’ contributions
Suncheng Xiang and Jiacheng Ruan contributed conception and design of the study, as well as the experimental process and interpreted model results. Suncheng Xiang obtained funding for the project and provided clinical guidance. Suncheng Xiang drafted the manuscript. All authors contributed to manuscript revision, read and approved the submitted version.
References
- \bibcommenthead
- Caron et al (2021) Caron M, Touvron H, Misra I, et al (2021) Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9650–9660
- Chen et al (2019) Chen B, Deng W, Hu J (2019) Mixed high-order attention network for person re-identification. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 371–381
- Chen et al (2023) Chen Q, Cai S, Cai C, et al (2023) Colo-scrl: Self-supervised contrastive representation learning for colonoscopic video retrieval. arXiv preprint arXiv:230315671
- Chen et al (2020) Chen X, Fu C, Zhao Y, et al (2020) Salience-guided cascaded suppression network for person re-identification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3300–3310
- Devlin et al (2018) Devlin J, Chang MW, Lee K, et al (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805
- Feng et al (2019) Feng Y, Ma L, Liu W, et al (2019) Spatio-temporal video re-localization by warp lstm. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1288–1297
- Han et al (2020) Han T, Xie W, Zisserman A (2020) Self-supervised co-training for video representation learning. Advances in Neural Information Processing Systems 33:5679–5690
- He et al (2016) He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
- Hermans et al (2017) Hermans A, Beyer L, Leibe B (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:170307737
- Koffas et al (2022) Koffas A, Papaefthymiou A, Laskaratos FM, et al (2022) Colon capsule endoscopy in the diagnosis of colon polyps: Who needs a colonoscopy? Diagnostics 12(9):2093
- Kordopatis-Zilos et al (2019) Kordopatis-Zilos G, Papadopoulos S, Patras I, et al (2019) Visil: Fine-grained spatio-temporal video similarity learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 6351–6360
- Kordopatis-Zilos et al (2022) Kordopatis-Zilos G, Tzelepis C, Papadopoulos S, et al (2022) Dns: Distill-and-select for efficient and accurate video indexing and retrieval. International Journal of Computer Vision 130(10):2385–2407
- Li et al (2014) Li W, Zhao R, Xiao T, et al (2014) Deepreid: Deep filter pairing neural network for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 152–159
- Lin et al (2020) Lin Z, Zhao Z, Zhang Z, et al (2020) Weakly-supervised video moment retrieval via semantic completion network. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 11,539–11,546
- Luo et al (2019) Luo H, Jiang W, Gu Y, et al (2019) A strong baseline and batch normalization neck for deep person re-identification. IEEE Transactions on Multimedia 22(10):2597–2609
- Ma et al (2020) Ma M, Yoon S, Kim J, et al (2020) Vlanet: Video-language alignment network for weakly-supervised video moment retrieval. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVIII 16, Springer, pp 156–171
- Muhammad and Yeasin (2020) Muhammad MB, Yeasin M (2020) Eigen-cam: Class activation map using principal components. In: 2020 international joint conference on neural networks (IJCNN), IEEE, pp 1–7
- Prosser et al (2010) Prosser BJ, Zheng WS, Gong S, et al (2010) Person re-identification by support vector ranking. In: Bmvc, p 6
- Qian et al (2021) Qian R, Meng T, Gong B, et al (2021) Spatiotemporal contrastive video representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 6964–6974
- Shao et al (2021) Shao J, Wen X, Zhao B, et al (2021) Temporal context aggregation for video retrieval with contrastive learning. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 3268–3278
- Tan et al (2021) Tan H, Liu X, Bian Y, et al (2021) Incomplete descriptor mining with elastic loss for person re-identification. IEEE Transactions on Circuits and Systems for Video Technology 32(1):160–171
- Tong et al (2022) Tong Z, Song Y, Wang J, et al (2022) Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in neural information processing systems 35:10,078–10,093
- Wang et al (2018) Wang G, Yuan Y, Chen X, et al (2018) Learning discriminative features with multiple granularities for person re-identification. In: Proceedings of the 26th ACM international conference on Multimedia, pp 274–282
- Wang et al (2022) Wang H, Shen J, Liu Y, et al (2022) Nformer: Robust person re-identification with neighbor transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 7297–7307
- Wang et al (2020) Wang Y, Liao S, Shao L (2020) Surpassing real-world source training data: Random 3d characters for generalizable person re-identification. In: ACM International Conference on Multimedia, pp 3422–3430
- Wu et al (2018) Wu Z, Xiong Y, Yu SX, et al (2018) Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3733–3742
- Xiang et al (2020a) Xiang S, Fu Y, Chen H, et al (2020a) Multi-level feature learning with attention for person re-identification. Multimedia Tools and Applications 79:32,079–32,093
- Xiang et al (2020b) Xiang S, Fu Y, Xie M, et al (2020b) Unsupervised person re-identification by hierarchical cluster and domain transfer. Multimedia Tools and Applications 79:19,769–19,786
- Xiang et al (2020c) Xiang S, Fu Y, You G, et al (2020c) Unsupervised domain adaptation through synthesis for person re-identification. In: 2020 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6
- Xiang et al (2023a) Xiang S, Fu Y, Guan M, et al (2023a) Learning from self-discrepancy via multiple co-teaching for cross-domain person re-identification. Machine Learning 112(6):1923–1940
- Xiang et al (2023b) Xiang S, Qian D, Gao J, et al (2023b) Rethinking person re-identification via semantic-based pretraining. ACM Transactions on Multimedia Computing, Communications and Applications 20(3):1–17
- Xiang et al (2023c) Xiang S, Qian D, Guan M, et al (2023c) Less is more: Learning from synthetic data with fine-grained attributes for person re-identification. ACM Transactions on Multimedia Computing, Communications and Applications 19(5s):1–20
- Xiang et al (2024a) Xiang S, Chen H, Ran W, et al (2024a) Deep multimodal representation learning for generalizable person re-identification. Machine Learning 113(4):1921–1939
- Xiang et al (2024b) Xiang S, Liu C, Ruan J, et al (2024b) Vt-reid: Learning discriminative visual-text representation for polyp re-identification. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp 3170–3174
- Zhang et al (2021) Zhang A, Gao Y, Niu Y, et al (2021) Coarse-to-fine person re-identification with auxiliary-domain classification and second-order information bottleneck. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 598–607
- Zhao et al (2013) Zhao R, Ouyang W, Wang X (2013) Person re-identification by salience matching. 2013 IEEE International Conference on Computer Vision pp 2528–2535
- Zhao et al (2014) Zhao R, Ouyang W, Wang X (2014) Learning mid-level filters for person re-identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 144–151
- Zheng et al (2015) Zheng L, Shen L, Tian L, et al (2015) Scalable person re-identification: A benchmark. In: Proceedings of the IEEE international conference on computer vision, pp 1116–1124
- Zheng et al (2017a) Zheng Z, Zheng L, Yang Y (2017a) A discriminatively learned cnn embedding for person reidentification. ACM transactions on Multimedia Computing, Communications, and Applications 14(1):1–20
- Zheng et al (2017b) Zheng Z, Zheng L, Yang Y (2017b) Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In: Proceedings of the IEEE international conference on computer vision, pp 3754–3762
- Zhu et al (2020) Zhu K, Guo H, Liu Z, et al (2020) Identity-guided human semantic parsing for person re-identification. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, Springer, pp 346–363