Robust Noisy Correspondence Learning via Self-Drop and Dual-Weight

Fan Liu Chenwei Dong, Chuanyi Zhang, Hualiang Zhou,
Jun Zhou Corresponding author: Chuanyi Zhang.Fan Liu, Chenwei Dong are with the College of Computer Science and Software Engineer, Hohai University, Nanjing, 210098, China.Chuanyi Zhang is with the College of Artificial Intelligence and Automation, Hohai University, Nanjing, 210098, China.Hualiang Zhou is with NARI Technology Co., Ltd., Nanjing, 211106, ChinaJun Zhou is with the School of Information and Communication Technology, Griffith University, Nathan, Queensland 4111, Australia.

Abstract

Many researchers collect data from the internet through crowd-sourcing or web crawling to alleviate the data-hungry challenge associated with cross-modal matching. Although such practice does not require expensive annotations, it inevitably introduces mismatched pairs and results in a noisy correspondence problem. Current approaches leverage the memorization effect of deep neural networks to distinguish noise and perform re-weighting. However, briefly lowering the weight of noisy pairs cannot eliminate the negative impact of noisy correspondence in the training process. In this paper, we propose a novel self-drop and dual-weight approach, which achieves elaborate data processing by qua-partitioning the data. Specifically, our approach partitions all data into four types: clean and significant, clean yet insignificant, vague, and noisy. We analyze the effect of noisy and clean data pairs and find that for vision-language pre-training models, a small number of clean samples is more valuable than a majority of noisy ones. Based on this observation, we employ self-drop to discard noisy samples to effectively mitigate the impact of noise. In addition, we adopt a dual-weight strategy to ensure that the model focuses more on significant samples while appropriately leveraging vague samples. Compared to the prior works, our approach is more robust and demonstrates relatively more stable performance on noisy datasets, especially under a high noise ratio. Extensive experiments on three widely used datasets, including Flickr30K, MS-COCO, and Conceptual Captions, validate the effectiveness of our approach.

Index Terms:

Cross-modal Matching, Noisy Correspondence, Multi-modal Learning.

I Introduction

Cross-modal matching [1, 2, 3, 4] endeavors to establish correspondences and alignments between content from distinct modalities, such as images and text or images and audio. Despite the rapid advancements in cross-modal matching methods in recent years, most methods assume that the training data is correctly aligned across various modalities. However, collecting such manual-annotated datasets, e.g., Flickr30K [5] and MSCOCO [6], is labor-intensive and time-consuming in real-world scenarios. Therefore, to mitigate the high cost of annotation, recent datasets, e.g., Conceptual Captions [7], were acquired through crowd-sourcing or web crawling. Inevitably, numerous mismatched pairs are erroneously considered as matched ones, leading to the issue of noisy correspondence [8]. In contrast to the extensively researched issue of noisy labels [9, 10, 11, 12, 13, 14], noisy correspondence refers to mismatched cross-modality pairs, posing a more challenging problem than noisy labels as they represent instance-level noise [15].

Refer to caption — Figure 1: Illustrations of complex data distributions in the real world and our observations. (a) We partition all data into four types, namely, clean and significant (green), clean yet insignificant (light green), vague (grey), and noisy (orange). The previous methods of adopting bi-partition or tri-partition are insufficient to handle such complex data distributions. (b) Comparison of rSum of CLIP under different noise/drop ratios on Flickr30K and MS-COCO 1K. The performance of VLP is differently sensitive to noisy data and discarded data and therefore a small number of clean samples is more valuable than a majority of noisy ones for fine-tuning a VLP.

Many works have paid attention to the issue of noisy correspondence [8, 16, 17]. These methods are typically based on the memorization effect [18, 19] of Deep Neural Networks (DNNs) that models tend to learn simple patterns before fitting noisy samples. This effect makes clean pairs display lower losses than noisy ones in the initial few epochs. Therefore, prior works typically train from scratch and employ a warm-up process to achieve preliminary model convergence. Nonetheless, for vision-language pre-training models (VLP) such as CLIP [20], the cost of training from scratch is substantial. In response to this issue, DSDMR [21] employs the remarkable zero-shot capabilities of pre-trained models to differentiate between noisy and clean samples. NPC [22] evaluates the negative impact of each sample to assign confidence weights. Surprisingly, NPC’s performance under noise-free conditions outperforms CLIP, demonstrating that not all clean samples are equally significant for VLP.

Most existing methods typically adopt a bi-partition assumption [8, 17] that divides data into clean and noisy subsets. Some researchers further investigate tri-partition approaches [23, 16] and divide data into clean, noisy, and vague subsets, where vague samples are challenging to distinguish from noisy ones. However, real-world data distributions tend to be more complex. Since clean samples can be unequally significant for VLP, all data can be partitioned into four types: clean and significant, clean yet insignificant, vague, and noisy (Fig. 1 (a)). Consequently, simple data partitioning strategies tend to be insufficient for handling complex data distributions and cannot eliminate the impact of noise.

To address the aforementioned challenges, we propose an efficient approach called Self-Drop and Dual-weight (SDD). Our approach is predicated on the observation that a small number of clean samples is more valuable than a majority of noisy ones for fine-tuning VLP. As illustrated in Fig. 1 (b), we adjust the noise and data drop ratios on Flickr30K and MSCOCO to evaluate the performance of CLIP. CLIP demonstrates relatively stable performance across varying data drop ratios. Conversely, as the noise ratio gradually increases, the model’s performance rapidly declines. This contrasting phenomenon becomes more pronounced with increasing noise and drop ratio. It indicates that noisy data is detrimental for VLP while decreasing the amount of training data only has a minor effect. Inspired by this observation, our method aims to identify and discard noisy data, even including some vague samples, to minimize optimization risks. Specifically, our SDD encompasses two core modules to mitigate the impact of noise and enhance its focus on significant parts within clean samples. In the first module, SDD leverages the zero-shot capabilities inherent in VLP to compute the similarity of the given image-text pairs. Subsequently, SDD treats samples with low similarity as noisy ones and opts to discard them. This sample selection strategy can effectively mitigate the impact of noise to benefit robust cross-modal matching. In the second module, SDD introduces a dual-weight strategy to assign confidence and significance weights from different perspectives. In detail, the confidence weights are dynamically determined by the similarity between sample pairs, and the significance weights are derived by assessing the samples’ contribution to model training. The former appropriately leverages vague samples, while the latter ensures the model focuses more on clean and significant samples.

Our main contributions can be summarized as follows:

•

We emphasize that existing works should consider a more complex data assumption: clean and significant, clean yet insignificant, vague, and noisy. Properly treating four types of data contributes to noise-robust training.
•

We propose an efficient Self-Drop and Dual-weight (SDD) approach to achieve robustness against noisy correspondence. Specifically, self-drop effectively mitigates the misguidance from noisy data. Dual-weight strategy enhances the impact of clean and significant samples while appropriately leveraging vague ones.
•

Extensive experiments on three widely used datasets, including Flickr30K, MS-COCO, and Conceptual Captions, demonstrate the effectiveness of our approach.

II RELATED WORK

In this section, we provide a brief overview of recent advancements in three interconnected fields: cross-modal matching, learning with noisy labels, and learning with noisy correspondence.

II-A Cross-modal Matching

Cross-modal matching strives to create associations and alignments between content from various modalities, such as images and text. Conventional image-text matching models can be broadly classified into two types: 1) global-level matching methods [24, 25, 26], which aligned the visual features captured from an image with the overall semantic feature extracted from a text; and 2) local-level matching methods [1, 2, 27, 28], which aimed to establish fine-grained connections between local regions of an image and individual words in text.

In recent years, vision-language pre-training models, e.g., CLIP [20], have demonstrated powerful zero-shot capabilities. Moreover, CLIP has been shown to exhibit flexibility, allowing it to be incorporated into a wide range of cross-modal tasks, including detection [29, 30], segmentation [31], and captioning [32].

However, CLIP still exhibits a high sensitivity to noise when attempting to generalize to downstream tasks through fine-tuning. In this paper, we employ CLIP as our backbone and propose a novel approach for fine-tuning VLP for cross-modal matching under the noisy correspondence scenario.

II-B Learning with Noisy Labels

Learning with noisy labels has been extensively studied and is generally concerned with addressing the label noise problem in classification tasks. Noisy label learning algorithms can be broadly categorized into four types: adding regularization [33, 34, 35, 36], loss adjustment [37, 38], sample selection [39, 9, 10], and label correction [40, 11]. Adding regularization can effectively mitigate overfitting [41] such as data augmentation [33], weight decay [34], dropout [35] and batch normalization [36], yet it exhibits suboptimal performance in high-noise scenarios. Loss adjustment is adaptively conducted based on the distinction between noisy and clean samples. Reed et al. [37] enhanced the robustness against noise by injecting consistency through network prediction targets. Active Bias [38] emphasized samples with uncertain predictions by assigning the predicted variance as the training weight. Sample selection involves choosing clean samples for training from a noisy dataset. Inspired by the memorization effect [18] of DNNs, Arazo et al. [39] selected clean samples with the discrepancy in loss values observed during the training process. Although this method is outstanding, it inevitably leads to error accumulation. Therefore, subsequent methods [9, 10] often employed two DNNs for training in order to mitigate error accumulation. The objective of label correction is to rectify the labels of unreliable samples. AdaCorr [40] introduced a novel label correction algorithm based on the predictions of a noise classifier. SELFIE [11] selectively refurbished and utilized imprecise samples that can be corrected with high accuracy, thereby progressively augmenting the number of available training samples.

However, since learning with noisy labels is tailored for classification tasks and noisy correspondence is an instance-level issue, the methods designed for noisy labels cannot be directly transferred to address the noisy correspondence.

II-C Learning with Noisy Correspondence

Noisy correspondence refers to mismatched pairs that are erroneously considered matched ones. Given the broad applicability of data pairs, methods against noise correspondence have developed across numerous fields, including person re-identification [42, 43, 44, 45], graph matching [46], multi-view clustering [47, 48, 49, 50], and image captioning [51, 52].

Earlier works [8, 53, 54] on cross-modal matching have explored the issue of noisy correspondence. NCR [8] recast the rectified labels into soft margins for triplet loss, thereby enabling robust cross-modal matching. CTPR [23] further extended NCR by dividing the image-text pairs within the training data into subsets of clean, hard, and noisy pairs to mitigate the challenge of selecting hard samples. BiCro [55] reassigned the noisy image-text pairs through bidirectional cross-modal similarity consistency that similar images should have similar textual descriptions and vice versa. MSCN [56] introduced meta-learning and proposed a meta-correction network to provide reliable noise similarity scores. CREAM [16] rectified possible noisy correspondence within positive sample pairs and exploited diverse latent consistency in negative sample pairs. GSC [57] introduced a multimodal geometrical structure consistency by optimizing a contrastive loss that aligns the geometrical structures within modalities and a traditional loss for cross-modal alignment, thereby preserving the integrity of geometrical structures. ESC [58] posited that for any two matched samples, the semantic changes induced by image variations should be proportional to those induced by text variations. By leveraging this equivariant similarity consistency, ESC achieves robust cross-modal matching.

However, the noisy correspondence research in VLP is still in its infancy. DSDMR [21] employs the remarkable zero-shot capabilities of pre-trained models to differentiate between noisy and clean samples. NPC [22] evaluates the negative impact of each sample to assign confidence weights. Although these models have demonstrated promising performances, they typically adopt a bi-partition assumption or tri-partition assumption. In contrast to these works, we adopt a more detailed assumption to achieve robustness against noisy correspondence.

III METHOD

In this section, we provide a detailed explanation of the proposed SDD, which consists of two modules integrated with a novel objective function to achieve robust cross-modal matching against noisy correspondences. Section III-A defines the problem of noisy correspondence. Section III-B revisits the feature representation in CLIP. Section III-C discusses the method of discarding the majority of noisy samples through self-drop. Section III-D elaborates on the dual-weight strategy, which adaptively weights the remaining samples based on confidence and importance. Section III-E introduces the proposed objective function to robust cross-modal matching. Finally, we discuss the differences between our SDD and NPC [8] in Section III-F to show our innovation. The framework of SDD is illustrated in Fig. 2.

III-A Problem Definition

To ensure broad applicability, we explore the issue of noisy correspondence in cross-modal matching with the example of image-text matching. Given the noisy dataset $D=\{(I_{i},T_{i})\}_{i=1}^{N}$ that consists of a total of $N$ training samples, where $(\,I_{i},T_{i})\,$ represents the $i$ -th image-text pair. Generally, the aim of image-text matching models with parameters $\Theta$ is to project visual and textual modalities into a shared representation space via image encoder $f$ and text encoder $g$ , respectively. Then the similarity of the given image-text pairs is calculated through $S(I_{i},T_{i})$ , which we abbreviate as $s_{i,i}$ for simplicity, as expressed in the following equation:

S(I_{i},T_{i})=s_{i,i}=\frac{f(I_{i})\cdot g(T_{i})}{||f(I_{i})||\cdot||g(T_{i})||}.

(1)

Algorithm 1 Self-Drop and Dual-Weight Training Algorithm

Input: Training set $D=\{(I_{i},T_{i})\}_{i=1}^{N}$ , memory bank $D_{MB}=\{(I_{i}^{I},T_{i}^{I}),(I_{i}^{T},T_{i}^{T})\}_{i=1}^{N}$
Output: Robust matching model $\theta_{b}^{E}$

1:Initialize parameters

\theta_{b}^{e}

for base model

2:for each epoch

e=1,2,...,E

//

self-drop

4: Compute the similarity of the given image-text pairs

s=\{s_{i,i}\}_{i=0}^{N}

by Eq. (1)

5: Construct partial dataset

D_{p}=\{(I_{i},T_{i})|s_{i,i}>\alpha\}

//

confidence weight

7: Obtain confidence weights

\{w_{con}^{i}\}_{i=0}^{N}\longleftarrow GMM(s)

by Eq. (5)-(6)

8: Copy parameters from

\theta_{b}^{e}

to create siamese model

\theta_{s}^{e}

9: for each mini-batch

\{(I_{i},T_{i})\}_{i=1}^{B}

from

D_{p}

with a batch size of

B

10:

//

significance weight

11: Compute siamese model loss

l_{i}=\{(l_{i2t}^{i}[e],l_{t2i}^{i}[e])\}_{i=1}^{B}

by Eq. (9)

12: Calculate siamese model loss

w_{con}^{i}\mathcal{L}_{\mathrm{InfoNCE}}(I_{i},T_{i})

with mini-batch

\{(I_{i},T_{i})\}_{i=1}^{B}

13: Update siamese model parameters

\theta_{s}^{e}

\theta_{s}^{e+1}

14: Compute siamese model loss

l_{i}=\{(l_{i2t}^{i}[e+1],l_{t2i}^{i}[e+1])\}_{i=1}^{B}

by Eq. (9)

15: Calculate significance weights

\{w_{sig}^{i}\}_{i=1}^{B}

by Eq. (10)-(11)

16:

//

robust matching

17: Calculate base model loss

w_{con}^{i}w_{sig}^{i}\mathcal{L}_{\mathrm{InfoNCE}}(I_{i},T_{i})

with mini-batch

\{(I_{i},T_{i})\}_{i=1}^{B}

18: Update base model parameters

\theta_{b}^{e}

\theta_{b}^{e+1}

19: end for

20:end for

III-B Revisiting Feature Representations in CLIP

In various vision-language learning tasks, the VLP model CLIP [20] has demonstrated remarkable generalization capabilities. In this paper, we choose CLIP as the backbone of our approach. It adopts a simple InfoNCE loss function for optimization [59], which promotes the alignment of matched image-text samples while pushing apart mismatched samples:

\mathcal{L}_{\mathrm{InfoNCE}}(I_{i},T_{i})=\mathcal{C}E(I_{i},T_{i})+\mathcal{C}E(T_{i},I_{i}),

(2)

\mathcal{C}E(x_{i},y_{i})=-\log\left(\frac{\exp(S(x_{i},y_{i})/\tau)}{\begin{matrix}\sum_{j=1}^{B}\end{matrix}\exp(S(x_{i},y_{j})/\tau)}\right),

(3)

where $\tau$ is a temperature parameter and $B$ is the batch size.

CLIP achieves strong zero-shot capabilities by leveraging large-scale image-text pairs collected from the internet for training. However, as shown in Fig. 1 (b), CLIP still exhibits a high sensitivity to noise when attempting to generalize to downstream tasks through fine-tuning.

III-C Self-Drop Module

Prior works are typically based on the memorization effect [18, 19] of DNNs, which tend to learn simple patterns before fitting noisy samples. Consequently, these works typically train from scratch and employ a warm-up process to achieve preliminary model convergence. However, this paradigm presents two shortcomings: 1) for VLP, such as CLIP, the cost of training from scratch is substantial; and 2) in the warm-up period, models would inevitably fit the noisy samples [17], thus degrading the performance. To tackle these issues, by considering the strong zero-shot capabilities inherent in CLIP, we aim to identify and discard noisy data, even including vague samples, to minimize optimization risk.

To begin, as illustrated in Fig. 2, we project $N$ image-text pairs into a shared embedding space through image and text encoder, respectively. Subsequently, we compute the cosine similarity of image-text pairs by Eq. (1) and store it in a matrix of size $N^{2}$ . The required similarity of image-text pairs is located on the diagonal of the matrix, which can be expressed as $\{S_{i,i}\}_{i=1}^{N}$ . Since noisy samples are detrimental (Fig. 1 (b)), selecting an appropriate proportion of noisy data and discarding them may be a more cost-effective approach compared to utilizing the noisy data. Specifically, by setting a threshold $\alpha$ for $\{s_{i,i}\}_{i=1}^{N}$ , we drop the majority of noisy pairs to construct the partial dataset.

As a special phenomenon of the CLIP-based model, the cosine similarity of positive pairs in CLIP is typically around 0.3, while of negative pairs it is around 0.1 [60, 61]. Due to the minimal difference, CLIP usually employs a temperature parameter $\tau$ to amplify it by a factor of 100. Thus, In CLIP, the similarity scores for noisy and clean pairs are approximately 10 and 30, respectively. For simplicity, we set the threshold $\alpha$ to 20 through all experiments. The selection of hyperparameter $\alpha$ will be discussed in subsequent visualization experiments. Samples with $s_{i,i}\leq\alpha$ will be discarded and the remaining samples will construct a partial dataset as follows:

D_{p}=\{(I_{i},T_{i})|s_{i,i}>\alpha\}.

(4)

III-D Dual-Weight Module

After self-drop, it is still risky to directly train the model on $D_{p}$ [62] because noise discarding cannot be perfect. The filtered dataset $D_{p}$ potentially contains a small number of noisy pairs and vague pairs that are difficult to distinguish from the noisy ones [23, 16]. Additionally, clean yet insignificant samples are often overlooked by prior works. Therefore, we propose a dual-weight strategy that assigns confidence and significance weights from different perspectives to suppress noise data, properly utilize vague pairs, and enhance the contribution of significant clean samples.

III-D1 Confidence Weight

The purpose of confidence weight is to minimize the impact of a small number of noisy samples in $D_{p}$ while appropriately leveraging the vague samples within $D_{p}$ . Following DSDMR [21], we fit the similarity of all pairs by a two-component Gaussian Mixture Model (GMM) [63, 64]:

p(x|\theta)=\sum_{k=1}^{K}\alpha_{k}\phi(x|\theta_{k}),

(5)

where $\alpha_{k}$ denotes the mixture coefficient, and $\phi(x|\theta_{k})$ denotes the probability density of the $k^{th}$ component. The confidence probability $w_{con}^{i}$ of pair $i$ is calculated by:

w_{con}^{i}=p(\theta_{k}|x_{i})=\frac{p(\theta_{k})p(x_{i}|\theta_{k})}{p(x_{i})},

(6)

where $\theta_{k}$ is the Gaussian component with the higher mean.

We utilize $w_{con}$ generated by the GMM to re-weight the data in $D_{p}$ . After that, vague samples can contribute to model training with proper weight, while the negative impact of a small portion of noisy samples in $D_{p}$ is diminished.

III-D2 Significance Weight

NPC’s [22] performance on noise-free remarkably outperforms CLIP, demonstrating that not all clean samples are equally significant for VLP. Inspired by this observation, we introduce a significant weight into our approach.

It is a rational assumption that the model’s performance should be improved after training on clean samples compared to before the training. In other words, if the model’s performance is worse after training than before, we can regard training samples that are insignificant or even noisy. For convenience, we utilize the loss value of the model as a substitute for its performance. We evaluate the change in loss values before and after training with a memory bank $(MB)$ to calculate the significance of each sample. The memory bank is a reliable set of clean samples that correspond to each sample in the noisy dataset $D$ .

To obtain $MB$ , a clean dataset $D_{c}$ is sampled with a relatively high threshold $\beta$ (30):

D_{c}=\{(I_{i},T_{i})|s_{i,i}>\beta\}.

(7)

Then, for each image-text pair $(I_{i},T_{i})$ in the dataset $D$ , we assign two evaluation entries $(I_{i}^{I},T_{i}^{I})$ and $(I_{i}^{T},T_{i}^{T})$ to it, where $I_{i}^{I}$ shows the highest similarity with $I_{i}$ in $D_{c}$ and $T_{i}^{T}$ shows the highest similarity with $T_{i}$ in $D_{c}$ . Specifically, the memory bank can be denoted as:

D_{MB}=\{(I_{i}^{I},T_{i}^{I}),(I_{i}^{T},T_{i}^{T})\}_{i=1}^{N},(I_{i}^{*},T_{i}^{*})\in D_{c}.

(8)

Samples in the partial dataset $D_{p}$ are evaluated to obtain significance weight by clean samples from the memory bank. As illustrated in Fig. 2, we initially create a siamese model by copying the parameters $\Theta_{b}$ of the base model, with the parameters of the siamese model represented as $\Theta_{s}$ . We use the model’s loss $l^{i}$ on clean $D_{MB}$ to represent performance, where a lower loss indicates a higher significance of the sample. Formally, for the $i$ -th data pair $(I_{i},T_{i})$ in $D_{p}$ , we can compute its i2t loss and t2i loss of its evaluation entry in the current epoch $e$ as follows:

\begin{split}&l_{i2t}^{i}[e]=\mathcal{C}E(I_{i}^{I},T_{i}^{I})+\mathcal{C}E(I_{i}^{T},T_{i}^{T}),\\ &l_{t2i}^{i}[e]=\mathcal{C}E(T_{i}^{I},I_{i}^{I})+\mathcal{C}E(T_{i}^{T},I_{i}^{T}).\end{split}

(9)

Then, we train the siamese model using data from $D_{p}$ weighted by $w_{con}$ and update the siamese model parameters $\Theta_{s}^{e}$ to $\Theta_{s}^{e+1}$ . The change in performance after updating the siamese model parameters can be calculated by:

c_{i}=\frac{1}{2}(\frac{l_{i2t}^{i}[e]}{l_{i2t}^{i}[e+1]}+\frac{l_{t2i}^{i}[e]}{l_{t2i}^{i}[e+1]}).

(10)

Finally, the significance of clean samples can be defined by:

w_{sig}^{i}=\begin{cases}tanh(c_{i})&,\text{if $c_{i}<1$}\\ 1&,\text{otherwise}\end{cases}.

(11)

The values of performance change $c_{i}$ are densely distributed around 1. Therefore, we aim to discretize the weights of significant and insignificant samples through a mapping function, such as the $tanh$ or the $sigmoid$ function. Samples that decrease performance after training ( $c_{i}<1$ ) will be assigned with smaller weights, while those that improve performance ( $c_{i}\geq 1$ ) are set the weights to 1.

III-E Loss Function

After obtaining dual-weight $w_{con}$ and $w_{sig}$ , we can update the base model parameters $\Theta_{b}^{e}$ to $\Theta_{b}^{e+1}$ by utilizing partial dataset $D_{p}$ . The overall loss function of SDD is formulated as expressed as follows:

\mathcal{L}=\sum_{(I_{i},T_{i})\in D_{p}}w_{con}^{i}w_{sig}^{i}\mathcal{L}_{\mathrm{InfoNCE}}(I_{i},T_{i}),

(12)

where confidence weight $w_{con}^{i}$ minimizes the impact of noisy samples and appropriately leverages vague samples, while significance weight $w_{sig}^{i}$ ensures the model focuses more on clean and significant samples. Algorithm 1 summarizes our proposed SDD.

III-F Discussions

It should be acknowledged that NPC [22] is one of our motivations as well as a SOTA approach for noisy correspondence problems. In this subsection, we present detailed discussions from various aspects to deeply recognize the differences between our SDD and NPC. We believe a comprehensive analysis could provide a more thorough understanding of our approach.

III-F1 Sample Selection

NPC aims to estimate the negative impact of each sample and re-weights all samples before model training without sample selection. Conversely, our proposed SDD is established from the observation that a small number of clean samples is more valuable than a majority of noisy ones for fine-tuning VLP. Therefore, we perform sample selection to explicitly discard a large proportion of noisy data to enhance the data reliability.

III-F2 Weighting Method

Our significance weight is indeed motivated by NPC, indicating that not each sample contributes equally to model training. However, in our research, we discover that relying solely on a single weight (NPC’s method) is insufficient to eliminate the negative impact of noise (we will demonstrate it in experiment Fig. 4). Therefore, we further explore the weighting strategy and propose a dual-weight method to re-weight samples from both confidence and significance perspectives.

TABLE I: Image-text matching performance under synthetic noise ratios of 20%, 40%, and 60% on Flickr30K and MS-COCO 1K. The best and second-best results are highlighted in each column.

		Flickr30K							MS-COCO 1K
		Image $\longrightarrow$ Text			Text $\longrightarrow$ Image				Image $\longrightarrow$ Text			Text $\longrightarrow$ Image
Noise	Methods	R@1	R@5	R@10	R@1	R@5	R@10	rSum	R@1	R@5	R@10	R@1	R@5	R@10	rSum
20%	NCR [8]	75.0	93.9	97.5	58.3	83.0	89.0	496.7	77.7	95.6	98.2	62.6	89.3	95.3	518.7
	BiCro [55]	78.1	94.4	97.5	60.4	84.4	89.9	504.7	78.8	96.1	98.6	63.7	90.3	95.7	523.2
	MSCN [56]	76.4	94.5	97.6	58.8	83.5	89.2	500.0	78.1	97.2	98.8	64.3	90.4	95.8	524.6
	CRCL [65]	77.9	95.4	98.3	60.9	84.7	90.6	507.8	79.6	96.1	98.7	64.7	90.6	95.9	525.6
	GSC [57]	78.3	94.6	97.8	60.1	84.5	90.5	505.8	79.5	96.4	98.9	64.4	90.6	95.9	525.7
	ESC [58]	79.0	94.8	97.5	59.1	83.8	89.1	503.3	79.2	97.0	99.1	64.8	90.7	96.0	526.8
	DSDMR [21]	85.1	97.0	99.2	69.7	90.3	94.8	536.1	80.5	95.8	98.4	65.8	90.9	96.2	527.6
	NPC [22]	87.3	97.5	98.8	72.9	92.1	95.8	544.4	79.9	95.9	98.4	66.3	90.8	98.4	529.7
	SDD	87.6	97.5	99.5	74.3	93.4	96.8	549.1	81.4	96.0	98.5	67.1	91.2	98.5	532.7
40%	NCR [8]	73.5	92.6	95.8	55.7	80.3	86.9	484.8	76.6	95.6	98.2	61.0	88.9	94.9	515.2
	BiCro [55]	74.6	92.7	96.2	55.5	81.1	87.4	487.5	77.0	95.9	98.3	61.8	89.2	94.9	517.1
	MSCN [56]	69.5	90.8	95.7	53.2	79.9	86.4	475.5	74.5	96.0	98.1	60.8	89.0	95.0	513.4
	CRCL [65]	77.8	95.2	98.0	60.0	84.0	90.2	505.2	78.2	95.7	98.3	63.3	90.3	95.7	521.5
	GSC [57]	76.5	94.1	97.6	57.5	82.7	88.9	497.3	78.2	95.9	98.2	62.5	89.7	95.4	519.9
	ESC [58]	76.1	93.1	96.4	56.0	80.8	87.2	489.6	78.6	96.6	99.0	63.2	90.6	95.9	523.9
	DSDMR [21]	85.2	97.0	99.0	69.6	90.2	94.9	535.9	79.9	95.9	98.3	64.8	90.3	95.9	525.1
	NPC [22]	85.6	97.5	98.4	71.3	91.3	95.3	539.4	79.4	95.1	98.3	65.0	90.1	98.3	526.2
	SDD	87.3	98.0	99.2	73.5	92.9	96.6	547.5	81.5	95.9	98.5	66.8	91.0	98.5	532.2
60%	NCR [8]	70.0	91.0	94.4	52.3	76.9	84.0	468.6	72.6	93.8	97.4	57.0	86.4	93.6	500.8
	BiCro [55]	67.6	90.8	94.4	51.2	77.6	84.7	466.3	73.9	94.4	97.8	58.3	87.2	93.9	505.5
	MSCN [56]	68.8	88.6	93.1	48.8	76.4	84.0	459.7	73.7	95.1	98.5	57.0	86.9	94.0	505.2
	CRCL [65]	73.1	93.4	95.8	54.8	81.9	88.3	487.3	76.3	95.1	97.9	60.8	89.0	95.1	514.2
	GSC [57]	70.8	91.1	95.9	53.6	79.8	86.8	478.0	75.6	95.1	98.0	60.0	88.3	94.6	511.7
	ESC [58]	72.6	90.9	94.6	53.0	78.6	85.3	475.0	77.2	95.1	98.1	61.1	88.6	94.9	515.0
	DSDMR [21]	85.8	96.9	99.1	69.5	90.1	94.7	536.1	78.9	95.2	98.2	63.5	89.5	95.7	521.0
	NPC [22]	83.0	95.9	98.6	68.1	89.6	94.2	529.4	78.2	94.4	97.7	63.1	89.0	97.7	520.1
	SDD	87.2	96.8	99.0	72.4	92.4	96.2	544.0	79.5	95.9	98.6	65.2	90.3	98.6	528.1

IV Experiments

In this section, we present a comprehensive experimental validation of the proposed SDD from multiple perspectives. To this end, we conducted extensive experiments on image-text retrieval across three benchmark datasets. The structure of this section is organized as follows. In Section IV-A, we provide a detailed description of the experimental setup, including the datasets and implementation details. In Section IV-B, we compare the performance of our method with 8 state-of-the-art (SOTA) approaches to validate its effectiveness. Furthermore, we assess the noise robustness of our method through stability comparisons and progressive analyses. Finally, in the interest of fairness, we also compare SDD with other methods that similarly utilize a CLIP ViT-B/32-based backbone. In Section IV-C, we conduct a series of ablation studies to provide a comprehensive understanding of SDD.

IV-A Experimental Setting

IV-A1 Datasets and Evaluation Metrics

Following previous works, the proposed SDD was evaluated on three benchmark datasets, Flickr30K [5], MSCOCO [6], and CC120K [22]:

•

Flickr30K comprises 31,783 images with 5 annotated texts per image. Following previous works [8], 29,783, 1,000, and 1,000 images were used for training, validation, and testing, respectively.
•

MS-COCO encompasses 123,287 images with 5 annotated captions per image. Following previous works [8], we employed 113,287 images for training, with 5,000 images allocated for validation and 5,000 images reserved for testing.
•

CC120K is a subset of the web-crawled dataset Conceptual Captions [7], with about 3%-20% mismatched image-text pairs. CC120K consists of 120,851 images with a single caption for each. Following NPC [22], we utilized 118,851 images for training, 1,000 images for validation, and 1,000 images reserved for testing.

TABLE II: Image-Text Matching on MS-COCO 5K. The Best and second-best results are highlighted.

		Image $\longrightarrow$ Text			Text $\longrightarrow$ Image
Noise	Methods	R@1	R@5	R@10	R@1	R@5	R@10	rSum
40%	NCR [8]	55.5	82.4	90.2	39.7	68.5	79.2	415.5
	BiCro [55]	56.3	83.0	90.8	40.1	69.0	79.5	418.7
	MSCN [56]	49.7	78.9	88.0	36.9	66.1	77.1	396.7
	CRCL [65]	55.8	83.1	90.1	40.9	67.8	80.6	418.3
	ESC [58]	56.2	83.2	90.7	41.0	69.5	79.8	420.4
	NPC [22]	61.1	84.8	90.7	44.7	72.1	81.7	435.1
	SDD	63.7	85.6	91.8	47.0	73.8	83.0	444.9
60%	NCR [8]	49.6	78.1	87.3	35.5	64.2	75.7	390.4
	BiCro [55]	52.5	80.0	88.4	37.8	66.2	77.1	402.0
	MSCN [56]	48.1	76.0	85.5	34.5	63.5	75.1	382.7
	CRCL [65]	53.1	81.2	89.0	37.6	66.3	77.4	404.6
	ESC [58]	53.4	81.1	89.2	38.2	66.7	77.5	406.1
	NPC [22]	59.7	82.9	89.7	43.0	70.2	79.9	425.4
	SDD	62.0	85.1	91.5	45.5	72.2	81.9	438.2

TABLE III: Comparison with baselines on CC120K. The Best and second-best results are highlighted in each column.

	Image $\longrightarrow$ Text			Text $\longrightarrow$ Image
Methods	R@1	R@5	R@10	R@1	R@5	R@10	rSum
CLIP [20]	68.8	87.0	92.9	67.8	86.4	90.9	493.8
NPC [22]	71.1	92.0	96.2	73.0	90.5	94.8	517.6
SDD	72.2	92.2	95.5	73.0	91.4	95.0	519.3

We evaluated SDD with the recall rate at K (R@K) which measures the proportion of relevant items found within the top K results of a ranked list. We took the image and text as queries, respectively, and calculated the corresponding results of R@1, R@5, and R@10, which were further summed to evaluate the overall performance, i.e., rSum. Additionally, we used the variance $(var)$ of rSum under different noise ratios to evaluate the approaches’ performance stability, with lower $var$ indicating higher stability.

IV-A2 Implementation Details

We adopted the pre-trained CLIP [20] with ViT-B/32 as our backbone and trained SDD on a single RTX 3090 GPU. We employed a batch size of 128 and an AdamW [66] optimizer with a weight decay of 0.2. In all experiments, we trained the model for 5 epochs with a learning rate of $2e-7$ . The hyperparameters $\alpha$ and $\beta$ were set to 20 and 30, respectively.

IV-B Comparison with State-of-the-Art

IV-B1 Experiments on Flickr30K and MS-COCO

To verify the effectiveness of SDD, we conducted comparison experiments with 8 state-of-the-art methods, including NCR [8], BiCro [55], MSCN [56], CRCL [65], GSC [57], ESC [58], DSDMR [21], and NPC [22]. Given that the data in Flickr30K and MS-COCO are correctly paired, we introduced noisy correspondences by randomly shuffling the captions of 20%, 40%, and 60% of training images. Table I shows results on Flickr30K and MS-COCO under different noise ratios. For Flickr30K and MS-COCO 1K, we obtained results from ESC [58] and the publications introducing the respective models. For MS-COCO 5K, we used the data reported in the ESC as the results.

From the experimental results, it can be seen that the proposed SDD demonstrates significant improvements over all state-of-the-art methods. In the case of Flick30K, SDD shows advantages over the second-best method in the rSum score of recalls by 4.7%, 8.1%, and 7.9% under different noise ratios, respectively. For MS-COCO 1K, the rSum scores of SDD are 3.0%, 6.0%, and 7.1% higher than the second-best method. In Table II, we also demonstrate the superior performance of SDD on the MS-COCO full 5K dataset. SDD surpasses the second-best method in the rSum score of recalls by 9.8% and 12.8%.

IV-B2 Experiments on CC120K

Table III shows the results on CC120K under a real-world noisy correspondence scenario. The baseline results are derived from NPC [22] for convenience. It can be observed from Table III that SDD achieves the best performance with an overall score of 519.3%, surpassing the second-best method NPC by 1.7%. The results affirm SDD’s capability to manage not only simply simulated but also complex, real-world noisy correspondences.

IV-B3 Stability Comparison

To further investigate the performance stability of SDD, we present in Fig. 3 the rSum change curves of different methods on Flickr30K and MS-COCO 1K under varying noise ratios. The performance of each method is obtained from its original paper respectively. From Fig. 3, it can be observed that the proposed SDD considerably outperforms all methods across all noise ratios. SDD demonstrates relatively stable performance under different noise ratios, whereas the performance of other methods significantly decreases as the noise ratio increases. Furthermore, we calculate the variance of each method under different noise ratios to quantify the stability of the methods. Our model shows a significant stability gap compared with most other methods, with variances of only 4.54% and 4.25% on Flickr30K and MS-COCO 1K, respectively. Although DSDMR has a lower variance than SDD on Flickr30K, its performance is evidently worse than ours. These results validate the effectiveness of mitigating the negative effects of noise through self-drop and confidence weight $w_{con}$ .

TABLE IV: Comparison of methods with ViT-B/32 backbone on noisy MSCOCO. The best and second-best results are highlighted in each column.

noise	method	1K R@1	5K R@1	1K RSUM
20%	CLIP [20]	66.8	47.2	507.2
	VSE $\infty$ [67]	72.0	51.4	520.2
	PCME [68]	69.9	48.1	519.3
	PCME++ [69]	70.8	49.5	522.4
	PAU [70]	71.4	51.7	521.5
	NPC [22]	73.1	53.8	529.8
	SDD	74.3	55.5	532.7
50%	CLIP [20]	60.9	41.4	486
	VSE $\infty$ [67]	38.5	18.4	390.5
	PCME [68]	65.8	43.0	505.7
	PCME++ [69]	65.7	44.0	503.9
	PAU [70]	69.3	49.6	513.4
	NPC [22]	71.3	51.9	523.4
	SDD	73.0	54.5	530.1

TABLE V: Effectiveness of each component on Flickr30K with 40% noise ratio. The best results are marked by bold.

Components			Image $\longrightarrow$ Text
self-drop	$w_{con}$	$w_{sig}$	R@1	R@5	R@10	rSum
$\checkmark$	$\checkmark$	$\checkmark$	87.3	98.0	99.2	284.5
$\checkmark$	$\checkmark$		87.2	97.8	99.1	284.1
$\checkmark$		$\checkmark$	86.2	97.9	99.4	283.5
	$\checkmark$	$\checkmark$	86.7	98.0	99.3	284.0
$\checkmark$			85.8	97.8	99.4	283.0
			Text $\longrightarrow$ Image
$\checkmark$	$\checkmark$	$\checkmark$	73.5	92.9	96.6	263.0
$\checkmark$	$\checkmark$		73.6	92.8	96.3	262.7
$\checkmark$		$\checkmark$	72.1	92.4	96.2	260.7
	$\checkmark$	$\checkmark$	73.2	92.9	96.4	262.5
$\checkmark$			72.3	92.5	96.2	261.0

IV-B4 Progressive Comparison

In order to demonstrate the noise-robustness of our approach, we further investigate the training process by conducting a progressive comparison. We selected NPC as a compared baseline method because it generally demonstrated the second-best performance on three datasets. The results of the NPC were reproduced by us with the same experimental setting as ours. In Fig. 4, we present the results of the R@1 average values for NPC and SDD on the Flickr30K and MS-COCO validation sets. The results indicate that our SDD demonstrates promising performance across all noise ratios and maintains relative stability throughout the training process. Conversely, although NPC initially exhibits notable performance, its effectiveness gradually declines due to the accumulation of errors. This phenomenon indicates that using a single weight for re-weighting (NPC’s strategy) is insufficient to address the challenges posed by noisy correspondence. Therefore, we opt to discard noisy samples using self-drop and further mitigate the impact of noise by dual-weight.

IV-B5 Comparison with ViT-B/32 Backbone Methods

Since the CLIP ViT-B/32-based backbone inherently has advantages over other backbones, we also provide results of other methods adopting the ViT-B/32 backbone in Table IV, including VSE $\infty$ [67], PCME [68], PCME++ [69], PAU [70], NPC [22], for a fair comparison. The results of the models compared with SDD in this experiment are all from the paper of NPC [22]. Specifically, we conducted comparative experiments under 20% and 50% noise correspondence scenarios, and report the mean R@1 MS-COCO 1K and 5K, as well as rSum on 1K.

As demonstrated in Table IV, under 20% and 50% noise ratios, the rSum of SDD outperforms the previous best-performing method NPC by 2.9%, and 6.7%, respectively. Moreover, under 20% noise ratio, SDD surpasses NPC by 1.2% and 1.7% on MSCOCO 1K and 5K, respectively, and under 50% noise ratio, SDD surpasses NPC by 1.7% and 2.6% on MSCOCO 1K and 5K, respectively. It is worth noting that, as a representative of VLP, CLIP exhibits strong zero-shot capabilities but its performance significantly deteriorates in noisy scenarios. Specifically, when the noise ratio increases from 20% to 50%, CLIP’s rSum on MS-COCO 1K decreases by 21.2%. Compared to other CLIP-based methods, our work significantly reduces the risk of learning noise data with CLIP ViT-B/32-based backbone as much as possible, which demonstrates the value of our work.

IV-C Ablation Study

We undertook the ablation study on the Flickr30K with a noise ratio of 40% to detailedly demonstrate our approach.

IV-C1 Effectiveness of each component

We show the effect of each component in Table V. Specifically, we ablated the contributions of three key components of SDD, i.e., self-drop, confidence weight $w_{con}$ , and significance weight $w_{sig}$ . From Table V, we observe the following conclusions: 1) The full SDD could achieve the best overall performance, showing that all three components are important to improve the robustness against noisy correspondence. 2) The model’s performance significantly declines without confidence weight $w_{con}$ demonstrating that a small number of noisy samples can have a severe impact on the model.

IV-C2 Reliability Analysis

The memory bank is used to assess the significance of samples, thus a reliable memory bank is a prerequisite for the superiority of our method. To verify the reliability of the memory bank, we visualize the data distribution of the clean dataset $D_{c}$ in Fig. 6 (a). We compare different choices of threshold $\beta$ on the Flickr30K dataset with low (20%) to high (60%) noise ratios, respectively. From Fig. 6 (a), it can be observed that the proportion of noisy samples in $D_{c}$ is larger when $\beta$ is small, but it can be negligible under a high threshold setting. Therefore, the samples in our memory bank are sufficiently reliable.

In addition, we also visualize the distribution of samples discarded by SDD in Fig. 6 (b) to verify the reliability of self-drop. Obviously, with the selection of an appropriate threshold, the self-drop strategy merely discards a small number of clean samples. At low noise ratios (20%), the self-drop strategy discards only about 1% of the clean samples, and at high noise ratios (60%), the number of clean samples discarded by the self-drop strategy is even lower than 0.2%. As we previously observed in Fig. 1 (b), discarding a small portion of clean samples does not significantly impact the model, whereas a few noisy samples can have a substantial negative effect. Therefore, it is sufficiently reasonable that SDD filters noisy correspondence samples by self-drop.

IV-C3 Similarity Distribution

To further explore the influence of our method, we visualized the similarity distribution of clean and noisy pairs at different training stages of our SDD in Fig. 5 with the second-best baseline NPC for comparison. As training progresses, the similarity of clean samples of SDD rises higher while that of noisy data decreases, exhibiting a clear trend of separation. This phenomenon indicates that our approach less fits noisy samples and remarkably suppresses the negative effect caused by them. Meanwhile, one could observe that there is a clear distinction between noisy and clean samples in NPC’s distribution in the initial stage (Fig. 5 (a)). However, the overlap between noisy and clean samples increases in the following epochs (Fig. 5 (b)-(d)), indicating that NPC inevitably fits noisy data. Conversely, the noise samples and clean samples in SDD exhibit a more evident separation trend. This phenomenon accounts for the reason why SDD outperforms NPC across multiple datasets and demonstrates greater stability.

IV-C4 Hyperparameter Analysis

The selection of hyperparameters $\alpha$ and $\beta$ is crucial for SDD. The former determines the threshold for discarding noisy samples, while the latter sets the threshold for constructing the memory bank. As illustrated in Fig. 5 (a), the similarity of noisy samples is typically below 20, whereas that of clean samples is generally above 30. This phenomenon accounts for the hyperparameter setting $\alpha=20$ and $\beta=30$ . In other words, $\alpha$ and $\beta$ are easy to adjust according to similarity distribution and will not complicate our approach.

IV-C5 Case Study

We conducted the case study and show some examples intuitively. Fig. 7 (a)-(d) respectively illustrate the cases of our qua-partition: noisy, vague, clean yet insignificant, and clean and significant. We report similarity score $sim$ for the noisy sample and the confidence weight $w_{con}$ and significance weight $w_{sig}$ for others. In particular, noisy image-text pairs with low similarity are directly discarded, and thus their weights are not displayed. From Fig. 7 (a), we can see that the content of images has no relationship with their corresponding captions. Therefore, it is a cost-effective choice to discard such samples directly. Vague samples are tough to distinguish from noisy data according to image-text similarity and they are assigned proper confidence weights (Fig. 7 (b)). Clean samples share high $w_{con}$ while $w_{sig}$ balances their contribution to model training (Fig. 7 (c)-(d)).

Due to the complexity of vague samples, we present a case study about more vague samples in Fig. 8 for a comprehensive understanding of them. For the vague samples in Fig. 8, we leverage different colors to distinguish between correct (green) and incorrect (orange) correspondences. Specifically, vague mismatched pairs may exhibit subtle semantic misalignment, e.g., action and the scene in the second and third images. Moreover, complete misalignment and proper alignment are also very common in vague samples. Consequently, vague samples are located near the decision boundary [23], which presents a challenging issue to resolve in noisy correspondence. Therefore, we assign a smaller confidence weight to cautiously utilize these samples.

V Conclusion

In this paper, we proposed a simple yet effective self-drop and dual-weight method to address a significant and challenging problem of learning with noisy correspondence. Specifically, we analyzed the effect of noisy and clean data pairs and found that for vision-language pre-training models, a small number of clean samples is more valuable than a majority of noisy ones. Based on this observation, we employed self-drop to discard potentially noisy samples to effectively mitigate the impact of noise. In addition, we adopted a dual-weight strategy to ensure that the model focuses more on significant samples while appropriately leveraging vague ones. Comprehensive experimental results revealed that our proposed approach surpasses contemporary state-of-the-art methods, yielding robust and competitive outcomes even under elevated noise ratios.

References

[1] K.-H. Lee, X. Chen, G. Hua, H. Hu, and X. He, “Stacked cross attention for image-text matching,” in European Conference on Computer Vision, 2018, pp. 201–216.
[2] H. Diao, Y. Zhang, L. Ma, and H. Lu, “Similarity reasoning and filtration for image-text matching,” in Association for the Advancement of Artificial Intelligence, vol. 35, no. 2, 2021, pp. 1218–1226.
[3] C. Liu, Z. Mao, T. Zhang, H. Xie, B. Wang, and Y. Zhang, “Graph structured network for image-text matching,” in IEEE/CVF Computer Vision and Pattern Recognition Conference, 2020, pp. 10 921–10 930.
[4] J. Wang, A. Zheng, Y. Yan, R. He, and J. Tang, “Attribute-guided cross-modal interaction and enhancement for audio-visual matching,” IEEE Transactions on Information Forensics and Security, 2024.
[5] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78, 2014.
[6] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision, 2014, pp. 740–755.
[7] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Annual Meeting of the Association for Computational Linguistics, 2018, pp. 2556–2565.
[8] Z. Huang, G. Niu, X. Liu, W. Ding, X. Xiao, H. Wu, and X. Peng, “Learning with noisy correspondence for cross-modal matching,” Conference on Neural Information Processing Systems, vol. 34, pp. 29 406–29 419, 2021.
[9] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama, “Co-teaching: Robust training of deep neural networks with extremely noisy labels,” Advances in neural information processing systems, vol. 31, 2018.
[10] X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, and M. Sugiyama, “How does disagreement help generalization against label corruption?” in International conference on machine learning. PMLR, 2019, pp. 7164–7173.
[11] H. Song, M. Kim, and J.-G. Lee, “Selfie: Refurbishing unclean samples for robust deep learning,” in International Conference on Machine Learning, 2019, pp. 5907–5915.
[12] Q. Yuan, G. Gou, Y. Zhu, Y. Zhu, G. Xiong, and Y. Wang, “Mcre: A unified framework for handling malicious traffic with noise labels based on multidimensional constraint representation,” IEEE Transactions on Information Forensics and Security, 2023.
[13] M. Ye and P. C. Yuen, “Purifynet: A robust person re-identification model with noisy labels,” IEEE Transactions on Information Forensics and Security, vol. 15, pp. 2655–2666, 2020.
[14] X. Wu, R. He, Z. Sun, and T. Tan, “A light cnn for deep face representation with noisy labels,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 11, pp. 2884–2896, 2018.
[15] L. Jiang, D. Huang, M. Liu, and W. Yang, “Beyond synthetic noise: Deep learning on controlled noisy labels,” in International Conference on Machine Learning, 2020, pp. 4804–4815.
[16] X. Ma, M. Yang, Y. Li, P. Hu, J. Lv, and X. Peng, “Cross-modal retrieval with noisy correspondence via consistency refining and mining,” IEEE Transactions on Image Processing, 2024.
[17] Z. Huang, P. Hu, G. Niu, X. Xiao, J. Lv, and X. Peng, “Learning with noisy correspondence,” International Journal of Computer Vision, pp. 1–22, 2024.
[18] D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio et al., “A closer look at memorization in deep networks,” in International Conference on Machine Learning, 2017, pp. 233–242.
[19] X. Xia, T. Liu, B. Han, C. Gong, N. Wang, Z. Ge, and Y. Chang, “Robust early-learning: Hindering the memorization of noisy labels,” in International Conference on Learning Representations, 2020.
[20] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning, 2021, pp. 8748–8763.
[21] H. Shi, M. Liu, X. Mu, X. Song, Y. Hu, and L. Nie, “Breaking through the noisy correspondence: A robust model for image-text matching,” ACM Transactions on Information Systems, 2024.
[22] X. Zhang, H. Li, and M. Ye, “Negative pre-aware for noisy cross-modal matching,” in Association for the Advancement of Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7341–7349.
[23] Z. Feng, Z. Zeng, C. Guo, Z. Li, and L. Hu, “Learning from noisy correspondence with tri-partition for cross-modal matching,” IEEE Transactions on Multimedia, 2023.
[24] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler, “Vse++: Improving visual-semantic embeddings with hard negatives,” arXiv preprint arXiv:1707.05612, 2017.
[25] Y. Liu, Y. Guo, E. M. Bakker, and M. S. Lew, “Learning a recurrent residual fusion network for multimodal matching,” in International Conference on Computer Vision, 2017, pp. 4107–4116.
[26] S. Qian, D. Xue, H. Zhang, Q. Fang, and C. Xu, “Dual adversarial graph neural networks for multi-label cross-modal retrieval,” in Association for the Advancement of Artificial Intelligence, vol. 35, no. 3, 2021, pp. 2440–2448.
[27] K. Li, Y. Zhang, K. Li, Y. Li, and Y. Fu, “Visual semantic reasoning for image-text matching,” in International Conference on Computer Vision, 2019, pp. 4654–4662.
[28] H. Chen, G. Ding, X. Liu, Z. Lin, J. Liu, and J. Han, “Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval,” in IEEE/CVF Computer Vision and Pattern Recognition Conference, 2020, pp. 12 655–12 663.
[29] X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” International Conference on Learning Representations, 2021.
[30] Y. Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y. Li et al., “Regionclip: Region-based language-image pretraining,” in IEEE/CVF Computer Vision and Pattern Recognition Conference, 2022, pp. 16 793–16 803.
[31] F. Liang, B. Wu, X. Dai, K. Li, Y. Zhao, H. Zhang, P. Zhang, P. Vajda, and D. Marculescu, “Open-vocabulary semantic segmentation with mask-adapted clip,” in IEEE/CVF Computer Vision and Pattern Recognition Conference, 2023, pp. 7061–7070.
[32] R. Mokady, A. Hertz, and A. H. Bermano, “Clipcap: Clip prefix for image captioning,” arXiv preprint arXiv:2111.09734, 2021.
[33] C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of big data, vol. 6, no. 1, pp. 1–48, 2019.
[34] A. Krogh and J. Hertz, “A simple weight decay can improve generalization,” Conference on Neural Information Processing Systems, vol. 4, 1991.
[35] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The journal of machine learning research, vol. 15, no. 1, pp. 1929–1958, 2014.
[36] S. Ioffe, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” International Conference on Machine Learning, 2015.
[37] S. Reed, H. Lee, D. Anguelov, C. Szegedy, D. Erhan, and A. Rabinovich, “Training deep neural networks on noisy labels with bootstrapping,” arXiv preprint arXiv:1412.6596, 2014.
[38] H.-S. Chang, E. Learned-Miller, and A. McCallum, “Active bias: Training more accurate neural networks by emphasizing high variance samples,” Advances in Neural Information Processing Systems, vol. 30, 2017.
[39] E. Arazo, D. Ortego, P. Albert, N. O’Connor, and K. McGuinness, “Unsupervised label noise modeling and loss correction,” in International conference on machine learning. PMLR, 2019, pp. 312–321.
[40] S. Zheng, P. Wu, A. Goswami, M. Goswami, D. Metaxas, and C. Chen, “Error-bounded correction of noisy labels,” in International Conference on Machine Learning. PMLR, 2020, pp. 11 447–11 457.
[41] H. Song, M. Kim, D. Park, Y. Shin, and J.-G. Lee, “Learning from noisy labels with deep neural networks: A survey,” IEEE Transactions on Neural Networks and learning systems, vol. 34, no. 11, pp. 8135–8153, 2022.
[42] M. Yang, Z. Huang, P. Hu, T. Li, J. Lv, and X. Peng, “Learning with twin noisy labels for visible-infrared person re-identification,” in IEEE/CVF Computer Vision and Pattern Recognition Conference, 2022, pp. 14 308–14 317.
[43] M. Yang, Z. Huang, and X. Peng, “Robust object re-identification with coupled noisy labels,” International Journal of Computer Vision, pp. 1–19, 2024.
[44] S. Wu, S. Shan, G. Xiao, M. S. Lew, and X. Gao, “Modality blur and batch alignment learning for twin noisy labels-based visible–infrared person re-identification,” Engineering Applications of Artificial Intelligence, vol. 133, p. 107990, 2024.
[45] Y. Qin, Y. Chen, D. Peng, X. Peng, J. T. Zhou, and P. Hu, “Noisy-correspondence learning for text-to-image person re-identification,” in IEEE/CVF Computer Vision and Pattern Recognition Conference, 2024, pp. 27 197–27 206.
[46] Y. Lin, M. Yang, J. Yu, P. Hu, C. Zhang, and X. Peng, “Graph matching with bi-level noisy correspondence,” in International Conference on Computer Vision, 2023, pp. 23 362–23 371.
[47] Z. Huang, P. Hu, J. T. Zhou, J. Lv, and X. Peng, “Partially view-aligned clustering,” Conference on Neural Information Processing Systems, vol. 33, pp. 2892–2902, 2020.
[48] M. Yang, Y. Li, Z. Huang, Z. Liu, P. Hu, and X. Peng, “Partially view-aligned representation learning with noise-robust contrastive loss,” in IEEE/CVF Computer Vision and Pattern Recognition Conference, 2021, pp. 1134–1143.
[49] M. Yang, Y. Li, P. Hu, J. Bai, J. Lv, and X. Peng, “Robust multi-view clustering with incomplete information,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 1055–1069, 2022.
[50] Y. Lu, Y. Lin, M. Yang, D. Peng, P. Hu, and X. Peng, “Decoupled contrastive multi-view clustering with high-order random walks,” in Association for the Advancement of Artificial Intelligence, vol. 38, no. 13, 2024, pp. 14 193–14 201.
[51] W. Kang, J. Mun, S. Lee, and B. Roh, “Noise-aware learning from web-crawled image-text data for image captioning,” in IEEE/CVF Computer Vision and Pattern Recognition Conference, 2023, pp. 2942–2952.
[52] Z. Fu, K. Song, L. Zhou, and Y. Yang, “Noise-aware image captioning with progressively exploring mismatched words,” in Association for the Advancement of Artificial Intelligence, vol. 38, no. 11, 2024, pp. 12 091–12 099.
[53] Y. Qin, D. Peng, X. Peng, X. Wang, and P. Hu, “Deep evidential learning with noisy correspondence for cross-modal retrieval,” in ACM International Conference on Multimedia, 2022, pp. 4948–4956.
[54] P. Hu, Z. Huang, D. Peng, X. Wang, and X. Peng, “Cross-modal retrieval with partially mismatched pairs,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 8, pp. 9595–9610, 2023.
[55] S. Yang, Z. Xu, K. Wang, Y. You, H. Yao, T. Liu, and M. Xu, “Bicro: Noisy correspondence rectification for multi-modality data via bi-directional cross-modal similarity consistency,” in IEEE/CVF Computer Vision and Pattern Recognition Conference, 2023, pp. 19 883–19 892.
[56] H. Han, K. Miao, Q. Zheng, and M. Luo, “Noisy correspondence learning with meta similarity correction,” in IEEE/CVF Computer Vision and Pattern Recognition Conference, 2023, pp. 7517–7526.
[57] Z. Zhao, M. Chen, T. Dai, J. Yao, B. Han, Y. Zhang, and Y. Wang, “Mitigating noisy correspondence by geometrical structure consistency learning,” in IEEE/CVF Computer Vision and Pattern Recognition Conference, 2024, pp. 27 381–27 390.
[58] Y. Yang, L. Wang, E. Yang, and C. Deng, “Robust noisy correspondence learning with equivariant similarity consistency,” in IEEE/CVF Computer Vision and Pattern Recognition Conference, 2024, pp. 17 700–17 709.
[59] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
[60] X. Jiang, F. Liu, Z. Fang, H. Chen, T. Liu, F. Zheng, and B. Han, “Negative label guided ood detection with pretrained vision-language models,” International Conference on Learning Representations, 2024.
[61] Y. Ming, H. Yin, and Y. Li, “On the impact of spurious correlation for out-of-distribution detection,” in Association for the Advancement of Artificial Intelligence, vol. 36, no. 9, 2022, pp. 10 051–10 059.
[62] Z. Dang, M. Luo, C. Jia, G. Dai, X. Chang, and J. Wang, “Noisy correspondence learning with self-reinforcing errors mitigation,” in Association for the Advancement of Artificial Intelligence, vol. 38, no. 2, 2024, pp. 1463–1471.
[63] H. Permuter, J. Francos, and I. Jermyn, “A study of gaussian mixture models of color and texture features for image classification and segmentation,” Pattern Recognition, vol. 39, no. 4, pp. 695–706, 2006.
[64] J. Li, R. Socher, and S. C. Hoi, “Dividemix: Learning with noisy labels as semi-supervised learning,” International Conference on Learning Representations, 2020.
[65] Y. Qin, Y. Sun, D. Peng, J. T. Zhou, X. Peng, and P. Hu, “Cross-modal active complementary learning with self-refining correspondence,” Conference on Neural Information Processing Systems, vol. 36, pp. 24 829–24 840, 2023.
[66] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” International Conference on Learning Representations, 2019.
[67] J. Chen, H. Hu, H. Wu, Y. Jiang, and C. Wang, “Learning the best pooling strategy for visual semantic embedding,” in IEEE/CVF Computer Vision and Pattern Recognition Conference, 2021, pp. 15 789–15 798.
[68] S. Chun, S. J. Oh, R. S. De Rezende, Y. Kalantidis, and D. Larlus, “Probabilistic embeddings for cross-modal retrieval,” in IEEE/CVF Computer Vision and Pattern Recognition Conference, 2021, pp. 8415–8424.
[69] S. Chun, “Improved probabilistic image-text representations,” in International Conference on Learning Representations, 2024.
[70] H. Li, J. Song, L. Gao, X. Zhu, and H. Shen, “Prototype-based aleatoric uncertainty quantification for cross-modal retrieval,” Conference on Neural Information Processing Systems, vol. 36, 2024.