¹¹institutetext: YouTu Lab, Tencent, Shanghai ¹¹email: {boshenzhang,yukiyxli,jeromepeng,caseywang}@tencent.com ²²institutetext: Tongji University, Shanghai
²²email: {2030809, zhaocairong}@tongji.edu.cn
³³institutetext: Key Laboratory of Image Processing and Intelligent Control, Ministry of Education,
School of Artificial Intelligence and Automation,
Huazhong University of Science and Technology, China
³³email: {cunlin_wu,Yang_Xiao}@hust.edu.cn

Learning from Noisy Labels with Coarse-to-Fine Sample Credibility Modeling

Boshen Zhang* 11 Yuxi Li 11 Yuanpeng Tu 22 Jinlong Peng 11 Yabiao Wang^† 11 Cunlin Wu 33 Yang Xiao 33 Cairong Zhao 22

Abstract

Training deep neural network (DNN) with noisy labels is practically challenging since inaccurate labels severely degrade the generalization ability of DNN. Previous efforts tend to handle part or full data in a unified denoising flow via identifying noisy data with a coarse small-loss criterion to mitigate the interference from noisy labels, ignoring the fact that the difficulties of noisy samples are different, thus a rigid and unified data selection pipeline cannot tackle this problem well . In this paper, we first propose a coarse-to-fine robust learning method called CREMA, to handle noisy data in a divide-and-conquer manner. In coarse-level, clean and noisy sets are firstly separated in terms of credibility in a statistical sense. Since it is practically impossible to categorize all noisy samples correctly, we further process them in a fine-grained manner via modeling the credibility of each sample. Specifically, for the clean set, we deliberately design a memory-based modulation scheme to dynamically adjust the contribution of each sample in terms of its historical credibility sequence during training, thus alleviating the effect from noisy samples incorrectly grouped into the clean set. Meanwhile, for samples categorized into the noisy set, a selective label update strategy is proposed to correct noisy labels while mitigating the problem of correction error. Extensive experiments are conducted on benchmarks of different modality, including image classification (CIFAR, Clothing1M etc) and text recognition (IMDB), with either synthetic or natural semantic noises, demonstrating the superiority and generality of CREMA.

Keywords:

robust learning, label noise, divide-and-conquer

^†^†footnotetext: * Authors contributed equally to this work.
†Yabiao Wang is corresponding author ([email protected]).

Refer to caption — Figure 1: Training on MNIST with 50% symmetric noise. (a) Compared with noisy samples, clean samples yield relatively smaller loss value and more consistent predictions. (b) Empirical PDF (Probability Density Function) of loss values and (c) their standard deviation justify the above conclusion. That is, clean and noisy samples possess distinctive statistical properties. However, noisy samples can not be completely identified via a simple threshold filter strategy (blue dotted line in (b) and (c)) with these statistical metrics. The existence of easy and hard noisy samples requires different ways to handle them accordingly. A similar experimental conclusion can also be found on other synthetic noisy datasets (i.e., CIFAR-10, CIFAR-100) across different noise settings.

1 Introduction

Deep learning has achieved significant progress in the recognition of multimedia signals (e.g. images, text, speeches). The key to its success is the availability of large-scale datasets with reliable manual annotations. Collecting such datasets, however, is time-consuming and expensive. Some alternative ways to obtain labeled data, such as web crawling [46], inevitably yield samples with noisy labels, which are not appropriate to be directly utilized to train DNN since these complex models can easily over-fitting (i.e., memorizing) noisy labels [2, 50].

To handle this problem, classical Learning with Noisy Label (LNL) approaches focus on either identifying and dropping noisy samples (i.e., sample selection) [10, 14, 49, 45] or adjusting the objective term of each sample during training (i.e., loss adjustment) [29, 48, 37]. The former usually make use of small-loss trick to select clean samples, and then take them to update DNNs. However, the procedure of sample selection cannot guarantee that the selected clean samples are completely clean. In contrast, as indicated in Fig. 1, division relied on statistic metrics can still involve some hard noisy samples in the training set, which will be treated equally as other normal samples in the following training stages. Thus the negative impact brought by wrongly grouped noisy samples can still confuse the optimization process and lower the test performance of DNNs [49]. On the other hand, the latter schemes reweight loss values or update labels by estimating the confidence on how clean a sample is. Typical methods include loss correction via an estimated noise transition matrix [29, 8, 12]. However, estimating an accurate noise transition matrix is practically challenging. Recently, there are approaches directly correcting the labels of all training samples [37, 48]. However, we empirically find that unconstrained label correction in full data can do harm to clean samples and reversely hinder the model performance.

Towards the problems above, we propose a simple but effective method called CREMA (Coarse-to-fine sample cREdibility Modeling and Adaptive loss reweighting), which adaptively reduces the impact of noisy labels via modeling the credibility (i.e., quality) of each sample. In the coarse-level, with the estimated sample credibility by simple statistic metrics, clean and noisy samples can be roughly separated and handled in a divide-and-conquer manner. Since it is practically impossible to separate these samples perfectly, for the selected clean samples, we take their historical credibility sequences to adjust the contribution of each sample to their objective, thus mitigating the negative impact of hard noisy samples (i.e, noisy samples incorrectly grouped into the clean set) in a fine-grained manner. As for the separated noisy samples, some of them are actually clean (i.e., hard clean samples) and can be helpful for model training. Thus instead of discarding them as previous sample selection methods [10, 45], we make use of them via a selective label correction scheme.

The insight behind CREMA is from the observation on the loss value during training on noisy data (illustrated in Fig 1), it can be found that clean and noisy samples manifest distinctive statistical properties during training, where clean samples yield relatively smaller loss value [33] and more consistent prediction. Hence these statistical features can be utilized to coarsely model the sample credibility. However, Fig 1 also shows that the full data can not be perfectly separated by simple statistical metrics. This inspires us to adaptively cope with noises of different difficulty levels with more fine-grained design. For easily recognized noisy samples, we can directly apply certain label correction schemes while avoiding erroneous correction on normal samples. For samples that fall into the confusing area and hybrid with clean ones, since the coarsely estimated credibility in the current epoch is not informative enough to identify noisy samples, CREMA applies a fine-grained likelihood estimator of noisy samples by resorting the historical sequence of sample credibility. This is achieved by maintaining a historical memory bank along with the training process and estimating the likelihood function through a consistency metric and assumption of markov property of the sequence.

CREMA is built upon a classic co-training framework [45, 10]. The fine-grained sample credibility estimated by one network is used to adjust the loss term of credible samples for the other network. Extensive experiments are conducted on benchmarks of different modality, including image classification (CIFAR, MNIST, Clothing1M etc) and text recognition (IMDB), with either synthetic or natural semantic noises, demonstrating the superiority and generality of the proposed method. In a nutshell, the key contributions of this paper include:

$\bullet$ CREMA: a novel LNL algorithm that combats noisy labels via coarse-to-fine sample credibility modeling. In coarse-level, clean and noisy sets are roughly separated and handled respectively, in the spirit of the idea of divide-and-conquer. Easily recognized noisy samples are handled via a selective label update strategy;

$\bullet$ In CREMA, likelihood estimation of historical credibility sequence is proposed to help identify hard noisy samples, which naturally plays as the dynamical weight to modulate loss term of each training sample in a fine-grained manner;

$\bullet$ CREMA is evaluated on six synthetic and real-world noisy datasets with different modality, noise type and strength. Extensive ablation studies and qualitative analysis are provided to verify the effectiveness of each component.

2 Related Works

The existing LNL approaches can be mainly categorized into three groups: loss adjustment, label correction, and noisy sample detection. Next, we will introduce and discuss existing works for training DNN with noisy labels.

Loss Adjustment. Adjusting the loss values of all training samples is able to reduce the negative impact of noisy labels. To do this, many approaches seek to robust loss functions, such as Robust MAE [7], generalized cross entropy [55], symmetric cross entropy [44], Improved MAE [42] and curriculum loss [23]. Rather than treat all samples equally, some methods rectify the loss of each sample through estimating the label transition matrix [12, 29, 8, 9, 46, 19, 52] or imposing different importance on each sample to formulate a weighted training procedure [40, 22, 4, 43]. The noise transition matrix, however, is relatively hard to be estimated and many approaches [12, 39, 21, 14, 20, 6, 5, 34, 35, 56] often make assumptions that a small clean-labeled dataset exists. In real-world scenarios, such condition is not always fulfilled, thus limiting the applications of these approaches.

Label correction. Label correction methods seek to refurbish the ground-truth of noisy samples, thus preventing DNN overfits to false labels. The most common ways to obtain the updated label include bootstrapping (i.e., a convex combination of the noisy label and the DNN prediction) [33, 11, 1, 41] and label replacing [37, 48, 36, 54]. One critical problem of label correction methods is to define the confidence of each label being clean, that is, samples with high clean probability should keep their labels almost unchanged, and vice versa. Previous solutions including cross-validation [33], fitting a two-component mixture model [1], local intrinsic dimensionality measurement [13, 24] and leveraging the prediction consistency of DNN models [36, 41]. However, updating the labels of all training sets is challenging, and well-designed regularization terms are important to prevent DNN from falling into trivial solutions [37, 48].

Noisy sample detection. One common knowledge used to discover noisy samples is the memorization effects (i.e., DNN fits clean samples first and then noisy ones). As a result, after a warm-up training stage with all noisy samples, DNN is able to identify the clean samples by taking the small-loss ones. The small-loss trick is exploited by many sample selection methods [10, 49, 26, 14, 45, 28, 36, 47]. After separating the noisy samples from the clean ones, Co-teaching [10] and corresponding variants [10, 49, 45, 47] update two parallel network parameters with the clean samples and abandoned the noisy ones. The idea of training two deep networks simultaneously is effective to avoid the confirmation bias problem (i.e., a model would accumulate its error through the self-training process) [10, 14, 16]. Other noisy measurement metrics such as Area Under the Margin (AUM) [32] is also proposed to better distinguish and remove noisy samples, which hypothesizes that noisy samples will in expectation have smaller AUM than clean samples. However, discarding the noisy samples means that valuable data may be lost, which leads to slow convergence of DNN models [4]. Instead, there are methods that utilize both clean and noisy samples to formulate a semi-supervised learning problem, by discarding only the labels of identified noisy samples. Thus naturally converting LNL problem into a semi-supervised learning one, for which powerful semi-supervised learning methods can be leveraged to boost performance [16, 57, 51, 3].

Hybrid. There are also researches taking two or more techniques above into account to boost performance of robust learning. For example, RoCL [58] and SELFIE [36] propose to dynamically discover informative (refurbished) samples and correct their labels with model predictions. Accordingly, CREMA belongs to hybrid group and it differs from existing methods in (1) it adaptively cope with noises of different difficulty levels via estimating the sample credibility in a coarse-to-fine fashion, easy and hard noisy samples are handled in a divide-and-conquer strategy; (2) it estimates likelihood of historical sample credibility sequence to dynamically modulate loss term of hard noisy samples; (3) it explores a selective label correction scheme to deal with hard clean samples while mitigating the correction error; (4) it is end-to-end trainable and does not require extra computation or any modification to the model.

3 Method

In this section, we introduce CREMA, an end-to-end approach for LNL problem. The technical pipeline of the approach is shown in Fig. 2. The training process is built upon a classic co-training framework [10, 45] to avoid confirmation bias and separate credible samples (mostly clean) and noisy samples (mostly noisy) via per-sample loss value in the coarse stage. The separated samples are handled by the succeeding fine-grained processes in a divide-and-conquer manner.

3.1 Coarse level separation

Formally, for multi-class classification with noisy label problem, let $\mathcal{D}=\{(x_{i},y_{i})\}_{i=1}^{N}$ denote the training data, where $x_{i}$ is data sample and $y_{i}\in\{0,1\}^{C}$ is the one-hot label over $C$ classes. $f(x_{i};\theta)$ denotes sample feature extracted by DNN model. With the loss $\mathcal{L}(x,y)$ from the DNN model, clean set $\mathcal{X}_{c}$ , and noisy set $\mathcal{X}_{u}$ are separated via the widely used low-loss criterion [10, 45],

		$\displaystyle\mathcal{X}_{c}=\{(x,y)\|\mathcal{L}(x,y)<\tau,(x,y)\in\mathcal{D}\},$		(1)
		$\displaystyle\mathcal{X}_{u}=\{(x,y)\|\mathcal{L}(x,y)\geq\tau,(x,y)\in\mathcal{D}\},$		(1)

where $\tau$ is the threshold, determined by a dynamic memory rate $R(t)\in[0,1]$ . Which is set for DNN to gradually distinguish $(1-R(t))$ data with highest loss value as noisy samples while keeping other samples as the clean set. The loss value simply serves as credibility of each sample in a coarse level. However, as illustrated in Fig. 1, this simple separation criterion can not strictly eliminate noisy samples. Hence we choose to handle them respectively, where $\mathcal{X}_{c}$ is exploited to update DNN parameters via fine-grained sample credibility guided loss adjustment (Sec. 3.2), and $\mathcal{X}_{u}$ is leveraged via another label learning scheme (Sec. 3.3).

3.2 Fine-grained sequential credibility modeling

Sequential credibility analysis. Previous works prefer to assess the data reliability purely based on its statistical property on a single point of time (e.g. the loss value in current epoch) during training process, i.e. they regard the credibility $w(x,y)$ of the $i$ -th data sample $(x,y)$ proportional to the joint distribution or likelihood of its data-label pair,

w(x,y)\propto P(x,y)\quad\textbf{or}\quad w(x,y)\propto\log P(x,y).

(2)

However, as shown in Fig. 1, the training curve of normal and noisy samples usually yield different statistic information, where noisy samples usually have relatively larger loss values and poorer prediction consistency compared with clean ones, therefore the historical record of data training is also informative enough to help distinguish noisy and clean data.

This observation inspires us to estimate the data credibility in a sequential manner. To be specific, we define a sequence with length $n$ as:

\mathbf{L}^{n}_{t}=[\mathbf{f}_{t},\mathbf{f}_{t-1},\cdots,\mathbf{f}_{t-n+1}],\quad\mathbf{f}_{t}=f\left(x;\theta_{t}\right).

(3)

Eq (3) illustrates a sliding window covering the feature snapshot of data from previous $n$ epochs to current time point, where $\theta_{t}$ denotes the model parameters at the $t$ -th epoch, and we model the data credibility with the likelihood and consistency of these historical sequences,

w(x,y)\propto C(\mathbf{L}^{n}_{t},y)\log{P\left(\mathbf{L}^{n}_{t}|\mathbf{f}_{t-n},y\right)}.

(4)

The Eq (4) can be decoupled into two items, where $C(\mathbf{L}^{n}_{t},y)$ measures the stability of training sequence given its label $y$ , while $\log{P\left(\mathbf{L}^{n}_{t}|\mathbf{f}_{t-n},y\right)}$ denotes log-likelihood of sequence generated from the $(t-n)$ -th data observation of neural network training process. To estimate the sequential log-likelihood, we further assume that the observation in sequence $\mathbf{L}^{n}_{t}$ conforms to a certain markov property as:

\mathbf{f}_{t}\perp\mathbf{f}_{i}|(\mathbf{f}_{t-1},y)\quad\forall\quad i<t-1.

(5)

The assumption in Eq (5) is reasonable since in most iterative learning algorithm like SGD, the data feature distribution is only decided by the last observation and its label. With this assumption, we can further derive the likelihood as:

$\displaystyle\log{P\left(\mathbf{L}^{n}_{t}\|\mathbf{f}_{t-n},y\right)}$	$\displaystyle=\log{P\left(\mathbf{f}_{t}\|\mathbf{L}^{n}_{t-1},y\right)}+\log{P\left(\mathbf{L}^{n-1}_{t-1}\|\mathbf{f}_{t-n},y\right)}$	(6)
	$\displaystyle=\sum_{i=0}^{n-1}{\log{P\left(\mathbf{f}_{t-i}\|\mathbf{L}^{n-i}_{t-i-1},y\right)}}$
	$\displaystyle=\sum_{i=0}^{n-1}{\log{P\left(\mathbf{f}_{t-i}\|\mathbf{f}_{t-i-1},y\right)}}.$

With Eq (6), we can represent the sequential likelihood as the summation of the conditional likelihood of data observation at each adjacent epochs within a sliding window of length $n$ . In the implementation, we can apply a normalized mixture model like GMM [16] or BMM [1] as estimator to estimate the conditional probability $P\left(\mathbf{f}_{t-i}|\mathbf{f}_{t-i-1},y\right)$ in Eq (6) via modeling the sample-wise loss value distribution. Meanwhile, with the conditional probability estimation, the stability measurement $C(\mathbf{L}^{n}_{t},y)$ is further designed as a modulator to suppress loss on training sequence with intense fluctuation,

\displaystyle C(\mathbf{L}^{n}_{t},y)

\displaystyle=1-\sqrt{\frac{1}{n}\sum_{i=0}^{n-1}{\left(P\left(\mathbf{f}_{t-i}|\mathbf{f}_{t-i-1},y\right)-\bar{P}\left(\mathbf{L}^{n}_{t},y\right)\right)^{2}}}.

(7)

\displaystyle\bar{P}\left(\mathbf{L}^{n}_{t},y\right)

\displaystyle=\frac{1}{n}\sum_{i=0}^{n-1}P\left(\mathbf{f}_{t-i}|\mathbf{f}_{t-i-1},y\right).

(8)

Adaptively loss adjustment. The sequential likelihood $\bar{P}\left(\mathbf{L}^{n}_{t}\right)$ and stability measurement $C(\mathbf{L}^{n}_{t},y)$ reflects how confident of the sample being clean. With the estimated credibility we reweight loss to update DNN as:

{\theta}_{t+1}={\theta}_{t}-\eta\nabla\Big{(}\frac{1}{|\mathcal{X}_{c}|}\!\sum_{(x,{y})\in\mathcal{X}_{c}}\!\!\!\!{w(x,{y})\mathcal{L}\big{(}f(x;\theta_{t}),{y}\big{)}}\Big{)}.

(9)

Where $\mathcal{L}$ is the objective function. $w(x,{y})$ is the sample credibility and it modulates the contribution of each sample through gradient descending algorithm. Note that Eq (9) is only applied on clean set $\mathcal{X}_{c}$ , in this way, the negative impact of hard noisy samples within $\mathcal{X}_{c}$ can be mitigated.

Objective function. Inspired by the design of symmetric cross entropy (SCE) function [44], a symmetric JS-divergence function with a co-regularization term is leveraged in CREMA as:

	$\displaystyle\mathcal{L}$	$\displaystyle=D_{\mathrm{JS}}(y\|\|h(f_{1}(x;\theta)))+D_{\mathrm{JS}}(y\|\|h(f_{2}(x;\theta)))$		(10)
		$\displaystyle+D_{\mathrm{JS}}\left(h(f_{1}(x;\theta))\|\|h(f_{2}(x;\theta))\right),$		(10)

Where $h(x)$ is the softmax function, $f_{1}(x)$ and $f_{2}(x)$ are features extracted by two models. The reason we choose JS-divergence [27] instead of cross entropy (CE) as loss function is that CE tends to over-fit noisy samples as these samples contribute relatively large gradient values during training. While JS-divergence mitigates this problem via using predictions of the current model as supervising signals as well. Since for noisy samples, DNN predictions are usually more reliable than its label. Following previous work [37], a prior label distribution term and a negative entropy term are included to regularize training and further alleviate the over-fitting problem.

1 Input: network parameters

\theta^{(1)}

and

\theta^{(2)}

, training dataset

\mathcal{D}

, dynamic memory rate

R(t)

, soft label distribution

\tilde{{y}}

, memory sequence

\mathbf{L}^{n}_{1,t}

and

\mathbf{L}^{n}_{2,t}

2 while $t<\mathrm{MaxEpoch}$ do

3 Fetch mini-batch

\mathcal{D}_{n}

from

\mathcal{D}

;

Divide

\mathcal{D}_{n}

into

\mathcal{X}_{c}

and

\mathcal{X}_{u}

based on

R(t)

;

// divide samples into clean and noisy set based on low-loss criterion

4 for $x_{c}\in\mathcal{X}_{c}$ do

Calculate

w(x_{c},y_{c})

based on Eq (6) and Eq (7);

// sample credibility modeling

Update

\theta^{(1)}

and

\theta^{(2)}

based on Eq. (9);

// adaptive loss adjustment

7 end for

8 for $x_{u}\in\mathcal{X}_{u}$ do

Update

\tilde{{y}}

\theta^{(1)}

, and

\theta^{(2)}

through gradient descent;

// update soft label distribution and model parameters

10 end for

11 Update

{R(t)}

;

Update

\mathbf{L}^{n}_{1,t}

and

\mathbf{L}^{n}_{2,t}

;

// enqueue feature snapshot of current epoch

13 end while

14Output:

\theta^{(1)}

and

\theta^{(2)}

Algorithm 1 CREMA. Line 5-9: sequential credibility modeling; Line 10-12: selective label update.

3.3 Selective label distribution learning

Following the divide-and-conquer idea, we attempt to leverage the separated noisy samples $\mathcal{X}_{u}$ as well. Some hard clean samples are blended with the separated noisy ones. Thus instead of discarding these wrongly labeled data as in most sample detection methods [10, 45], we resort to label correction approaches [37, 48] to exploit them with gradually corrected labels and further boost performance.

Specifically, labels of $\mathcal{X}_{u}$ are treated as extra parameters and updated through back-propagation to optimize a certain objective, this means both the network parameters and labels are updated simultaneously during the training process, where original one-hot labels ${{y}}$ will turn into a soft label distribution $\tilde{{y}}=h(y)$ after updating. Formally, $\tilde{{y}}$ is updated as $\tilde{{y}}\leftarrow\tilde{{y}}-\lambda(\partial\mathcal{L}_{l}/\partial\tilde{{y}})$ . Where $\lambda$ is learning rate and $\mathcal{L}_{l}$ is the objective to supervise the label correction process as $\mathcal{L}_{l}=D_{\mathrm{JS}}(h(f_{1}(x;\theta)||\tilde{{y}})+D_{\mathrm{JS}}(h(f_{2}(x;\theta)||\tilde{{y}})$ .

Empirical insight. In our experiments. we find that global label learning strategy (i.e., correcting labels of all training data) suffers from correction error in clean data. This can be observed from Fig. 3 (a), large gradient value will also be imposed on lots of correctly-labeled samples. Consequently, labels for these clean samples are unnecessarily updated. Compared with global label correction manners, we choose to only update the separated noisy samples $\mathcal{X}_{u}$ (mostly noisy). As shown in Fig. 3 (b), the proposed selective label correction strategy focuses more on learning noisy labels. The number of correctly-labeled samples with large gradient value is way less than a global correction scheme. Indicating that the selective label correction scheme can mitigate the problem of correction error. Experiments in Sec 4.3 also quantitatively verify the effectiveness of the selective label learning strategy over the global label learning manner.

Putting this all together, Algorithm 1 delineates the proposed CREMA in detail. In a nutshell, CREMA is built on a divide-and-conquer framework. Firstly, clean set $\mathcal{X}_{c}$ and noisy set $\mathcal{X}_{u}$ are separated based on the low-loss criterion [10]. For $\mathcal{X}_{c}$ , we compute the likelihood of historical credibility sequence, which helps to adaptively modulate the loss term of each training sample. As for $\mathcal{X}_{u}$ , a selective label correction scheme is leveraged to update label distribution and model parameters simultaneously. After each training epoch, memory sequence $\mathbf{L}^{n}_{1,t}$ and $\mathbf{L}^{n}_{2,t}$ are updated with the feature snapshot of the most current epoch.

4 Experiments

4.1 Datasets and Implementation Details

Datasets. To validate the effectiveness of the proposed method, we experimentally investigate on four synthetic noisy datasets, i.e., IMDB [25], MNIST, CIFAR-10, CIFAR-100 [15] and two real-world label noise datasets, i.e., Clothing1M [46], and Animal10N [36]. IMDB [25] is a collection of highly polarized movie reviews (positive/negative). It consists of 25,000 training samples and 25,000 samples for testing. The task is formalized as a binary classification task to decide the polarity of sentiment for a review. MNIST consists of 70,000 images of size $28\times 28$ for 10 classes, in which 60,000 images for training and the left 10,000 images for testing. Both CIFAR-10 and CIFAR100 contain 50,000 training images and 10,000 testing images of size $32\times 32\times 3$ . Differently, the former has 10 classes, while CIFAR-100 has 100 classes. For the Clothing1M, it is a large-scale real-world noisy dataset which is collected from multiple online shopping websites. It contains 1 million training images and clean training subsets (47K for training, 14K for validation and 10K for test) with 14 classes. The noise rate for this dataset is around 38.5%. Animal-10N contains 55,000 human-labeled online images for 10 confusing animals. It includes approximately 8% noisy-labeled samples. Following the settings in previous works [36, 54], 50,000 images are exploited as a training set while the left for testing.

Implementation Details. For the three synthetic image classification noisy datasets, MNIST, CIFAR-10 and CIFAR-100, we follow the setting in previous works [10, 49, 45], experiments with three kinds of noise types are considered, i.e., symmetric noise (uniformly random), asymmetric noise, and pairflip noise. For the IMDB text classification dataset, we tokenize each sentence and the word embeddings have dimension 10,000. Symmetric noises with different noise level are tested. Specifically, symmetric noise is generated by replacing labels in each class with labels of other classes uniformly. Asymmetric noise simulates fine-grained classification (for example, lynx and cat in Animal-10N [36] with noisy labels, where labels are corrupted to a set of similar classes. Pairflip noise is generated by flipping each class to its adjacent class. Varying noise rates $\tau$ are conducted to fully evaluate the proposed method, where for symmetric label noise, we set $\tau\in\{20\%,50\%,80\%\}$ on image datasets, $\tau\in\{20\%,40\%\}$ on text dataset, $\tau=40\%$ for asymmetric noise and $\tau\in\{40\%,45\%\}$ for pairflip label noise. For real-world noisy Clothing1M dataset, following [48, 54], we do not use the 50K clean data, and a randomly sampled pseudo-balanced subset includes about 260K images is leveraged as training data.

For the network structure, a 2-layer bi-directional LSTM network is adopted for IMDB. It is of 128 embedding size and 128 hidden size. A 9-layer CNN with Leaky-ReLU activation function [10] is used for MNIST, CIFAR-10, and CIFAR-100, while ResNet-50 is adopted for Clothing1M and Animal-10N datasets. The batch size is set as 64 for all the datasets. For fair comparisons, we train our model for 200 epochs in total and choose the average test accuracy of last 10 epochs as the final result in three image synthetic noisy datasets. For IMDB dataset, we set total training epochs as 100 and also test the accuracy of last ten epochs. Total training epochs for Clothing1M and Animal-10N are 80 and 150 respectively. Additionally, all the methods are implemented in PyTorch and run on NVIDIA Tesla V100 GPUs. Moreover, we use Adam optimizer for all the experiments and set the initial learning rate as 0.001, then it is degraded by a factor of $5$ every $30$ epochs for Clothing1M and $50$ epochs for Animal-10N. The two classifiers in our methods are two networks with the same structure but different initialization parameters. Following [10], $R(t)$ is linearly decreased along with training until reach a lower bound value $\sigma$ , for Clothing1M and Animal-10N datasets, we empirically set lower bound $\sigma$ as 0.8 and 0.92 respectively.

Noise rates $\tau$	Standard	PENCIL	Co-teaching	Co-teaching+	JoCoR	CREMA (ours)
Symmetry-20%	$79.94\pm 0.10$	$97.20\pm 0.53$	$97.40\pm 0.09$	$97.81\pm 0.03$	$97.98\pm 0.02$	$\textbf{98.40}\pm 0.14$
Symmetry-50%	$52.92\pm 0.21$	$96.22\pm 0.13$	$92.47\pm 0.14$	$95.80\pm 0.09$	$96.35\pm 0.02$	$\textbf{98.07}\pm 0.24$
Symmetry-80%	$23.95\pm 0.18$	$87.64\pm 0.25$	$82.04\pm 0.43$	$58.92\pm 0.37$	$85.51\pm 0.08$	$\textbf{92.02}\pm 0.54$
Asymmetry-40%	$78.80\pm 0.09$	$94.39\pm 0.37$	$90.57\pm 0.04$	$93.28\pm 0.43$	$94.14\pm 0.12$	$\textbf{97.15}\pm 0.26$
Pairflip-40%	$58.51\pm 0.29$	$94.06\pm 0.09$	$90.73\pm 0.22$	$89.91\pm 0.31$	$93.47\pm 0.10$	$\textbf{95.80}\pm 0.51$
Pairflip-45%	$54.54\pm 0.30$	$90.73\pm 0.29$	$89.42\pm 0.22$	$85.81\pm 0.30$	$91.30\pm 0.25$	$\textbf{94.12}\pm 0.58$

Table 1: Average test accuracy (%) on MNIST over the last ten epochs.

Noise rates $\tau$	Standard	PENCIL	Co-teaching	Co-teaching+	JoCoR	CREMA (ours)
Symmetry-20%	$68.67\pm 0.11$	$78.78\pm 0.15$	$82.56\pm 0.24$	$82.27\pm 0.21$	$85.73\pm 0.19$	$\textbf{86.32}\pm 0.16$
Symmetry-50%	$42.31\pm 0.18$	$64.71\pm 0.27$	$72.97\pm 0.22$	$63.01\pm 0.33$	$79.53\pm 0.10$	$\textbf{81.63}\pm 0.13$
Symmetry-80%	$15.94\pm 0.07$	${26.96}\pm 0.37$	$24.03\pm 0.18$	$17.96\pm 0.06$	$27.30\pm 0.08$	$\textbf{29.66}\pm 0.16$
Asymmetric-40%	$70.04\pm 0.08$	$70.06\pm 0.28$	$75.96\pm 0.15$	$72.21\pm 0.43$	$76.31\pm 0.21$	$\textbf{82.49}\pm 0.13$
Pairflip-40%	$51.66\pm 0.11$	$75.26\pm 0.18$	$75.10\pm 0.23$	$57.59\pm 0.45$	$68.56\pm 0.16$	$\textbf{85.00}\pm 0.13$
Pairflip-45%	$45.78\pm 0.13$	$71.18\pm 0.28$	$70.68\pm 0.23$	$49.60\pm 0.23$	$57.68\pm 0.21$	$\textbf{82.94}\pm 0.12$

Table 2: Average test accuracy (%) on CIFAR-10 over the last ten epochs.

Noise rates $\tau$	Standard	PENCIL	Co-teaching	Co-teaching+	JoCoR	CREMA (ours)
Symmetry-20%	$34.72\pm 0.07$	$52.11\pm 0.21$	$50.48\pm 0.24$	$49.27\pm 0.03$	$53.41\pm 0.09$	$\textbf{57.21}\pm 0.25$
Symmetry-50%	$16.86\pm 0.09$	$39.89\pm 0.30$	$38.24\pm 0.26$	$40.04\pm 0.70$	$43.37\pm 0.09$	$\textbf{43.95}\pm 0.42$
Symmetry-80%	$4.60\pm 0.12$	$16.08\pm 0.15$	$11.78\pm 0.12$	$13.44\pm 0.37$	$12.33\pm 0.13$	$\textbf{17.10}\pm 0.19$
Asymmetric-40%	$26.93\pm 0.10$	$32.81\pm 0.23$	$33.36\pm 0.28$	$33.62\pm 0.39$	$32.66\pm 0.13$	$\textbf{38.61}\pm 0.25$
Pairflip-40%	$27.48\pm 0.12$	$33.83\pm 0.52$	$33.94\pm 0.18$	$33.80\pm 0.25$	$33.89\pm 0.12$	$\textbf{38.06}\pm 0.34$
Pairflip-45%	$24.21\pm 0.11$	$29.01\pm 0.28$	$29.57\pm 0.15$	$26.93\pm 0.34$	$28.83\pm 0.10$	$\textbf{32.50}\pm 0.29$

Table 3: Average test accuracy (%) on CIFAR-100 over the last ten epochs.

Method	Category			Accuracy
Method	LA	LC	ND	Accuracy
Cross-Entropy				79.4
ActiveBias [4]	✓			80.5
PLC [54]		✓		83.4
Co-teaching [10]			✓	80.2
SELFIE [36]		✓	✓	81.8
CREMA (Ours)	✓	✓	✓	84.2

Table 4: Test accuracy on Animal-10N. “LA”, “LC” and “ND” denote “Loss Adjustment”, “Label Correction” and “Noisy sample Detection” respectively.

Method	Noise rates $\tau$
Method	Sym-20%	Sym-40%
Standard	$74.08\pm 0.23$	$58.37\pm 0.26$
PENCIL [48]	$73.73\pm 0.21$	$58.07\pm 0.30$
Co-teaching [10]	$82.07\pm 0.07$	$73.25\pm 0.19$
Co-teaching+ [49]	$82.27\pm 0.23$	$53.56\pm 3.04$
JoCoR [45]	$84.82\pm 0.07$	$76.12\pm 0.17$
CREMA (ours)	$\textbf{86.44}\pm 0.04$	$\textbf{78.39}\pm 0.14$

Table 5: Average test accuracy (%) on IMDB dataset over the last ten epochs.

Method	Category			Accuracy
Method	LA	LC	ND	Accuracy
Cross-Entropy				69.21
GCE [55]	✓			69.75
IMAE [42]	✓			73.20
SCE [44]	✓			71.02
DM [43]	✓			73.30
F-correction [29]	✓			69.84
M-correction [1]	✓			71.00
Masking [9]	✓			71.10
Joint-Optim [37]		✓		72.23
Meta-Cleaner [53]		✓		72.50
Meta-Learning [17]		✓		73.47
PENCIL [48]		✓		73.49
PLC [54]		✓		74.02
Self-Learning [11]		✓		74.45
ProSelfLC [41]		✓		73.40
Co-teaching [10]			✓	70.15
JoCoR [45]			✓	70.30
C2D [57]			✓	74.30
DivideMix^† [16]			✓	74.48
CREMA (Ours)	✓	✓	✓	74.53

Table 6: Comparison with state-of-the-art methods in test accuracy on Clothing1M.

\dagger

means the result without model ensemble [18].

4.2 Comparison with state-of-the-art methods

Results on synthetic noisy datasets. Table 1, Table 2, Table 3 and Table 6 show the detailed results of the proposed CREMA and other methods in multiple synthetic noisy cases on four widely used datasets, i.e., MNIST, CIFAR-10, CIFAR-100 IMDB. Specifically, four state-of-the-art LNL methods that are highly related to our work are chosen for comparison: PENCIL [48], Co-teaching [10], Co-teaching+ [49], JoCoR [45]. Standard DNN training with cross entropy is also included as baseline. All the results of these methods are reproduced with their public code and suggested hyper-parameters for fair comparison. From these tables, we can observe that most of these methods show better performance than Standard in the most natural Symmetry-20% case, except for PENCIL on IMDB dataset, which verifies their robustness. Among them, JoCoR performs much better over other compared methods. However, when it comes to Pairflip-40% and Pairflip-45% noisy cases on image datasets, their performance drops significantly. On the contrary, the proposed CREMA can achieve consistent improvements over other methods on four benchmarks across various noise settings. In the Pairflip-40% and Pairflip-45% cases, the proposed method outperforms other baselines by a large margin. Specifically, CREMA can achieve 16.44% and 25.26% improvement in accuracy over JoCoR on CIFAR-10. When dealing with extremely noisy scenario, e.g. Symmetry-80%, CREMA can also perform generally better than other compared methods.The result demonstrates the superiority and generality of the proposed robust learning method across various types and levels of label noise on multimedia (i.e., image and text) datasets.

Results on real-world noisy datasets. Experiments on real-world noisy datasets Clothing1M [46], Animal-10N [36] are also conducted to verify the effectiveness of the proposed method. The baseline methods are chosen from recently proposed LNL methods. Specifically, several loss adjustment methods, including GCE [55], IMAE [42], SCE [44], DM [43], F-correction [29], M-correction [1], Masking [9], ActiveBias [4], label correction methods, including Joint-Optim [37], PENCIL [48], Self-Learning [11], PLC [54], ProSelfLC [41] noisy sample detection approaches, including Co-teaching [10], JoCoR [45], C2D [57], DivideMix [16] and a hybrid method SELFIE [36] are compared with the proposed method. Table 6 and Table 6 show results on two real-world noisy datasets respectively. On the large-scale Clothing1M dataset, CREMA outperforms all compared methods. Note that CREMA follows the standard DNN training procedure, and is similar to other co-training methods [10, 49, 45] in terms of training time since the time cost for sample credibility modeling is negligible compared with DNN update. It is worth noting that the proposed method outperforms these co-teaching methods by a large margin. Indicating that the discarded samples by co-teaching methods are actually valuable, and CREMA well utilized all training samples. The best test accuracy is achieved by CREMA among the compared methods in Animal-10N as well. The results indicate that the proposed method can work well on high noise level (i.e., Clothing1M) and fine-grained (i.e., Animal-10N) real-world noisy datasets.

Method	Test Accuracy (%)
Baseline	72.81
+ Selective label update	73.25
+ Sequential likelihood	74.00
+ Stability measurement	74.53

Table 7: Ablation studies of each component within CREMA on Clothing1M dataset.

Estimator	Test Accuracy (%)
BMM	74.09
GMM	74.53

Table 8: Investigations on different mixture models on Clothing1M dataset.

Length of sequence $n$	1	2	3	4	5	6
Test Accuracy (%)	73.25	73.99	74.53	74.40	74.27	73.96

Table 9: Investigations on length of sequence

n

on Clothing1M dataset.

4.3 Ablation studies

$\bullet$ Component Analysis. CREMA contains several important components, including selective label learning strategy, sequential likelihood $\log{P\left(\mathbf{L}^{n}_{t}|\mathbf{f}_{t-n},y\right)}$ and stability measurement $C(\mathbf{L}^{n}_{t},y)$ . To verify the effectiveness of each component, we conduct experiments on large-scale noisy dataset Clothing1M. The baseline method is built upon a simple co-teaching framework [45] combined with global label correction schemes (as in [48]), without the credibility guided loss adjustment strategy. The results are shown in Table 8, we can see that, conform to the observation in Fig. 3, the proposed selective label learning strategy achieves better results compared with the global correction counterpart. The sequential likelihood and stability measurement further boost the model performance with 0.75% and 0.53% accuracy gain, this indicates that the proposed sequential sample credibility modeling can effectively combat hard noisy samples mixed with clean ones. With all the three key components above, CREMA can achieve 74.53% test accuracy on Clothing1M.

$\bullet$ Length of sequence $n$ . We also conduct experiments to investigate how the length of sequence $n$ affects the performance. Fig 4(a) shows results on Clothing1M with various values (in {1, 2, 3, 4, 5, 6}) of $n$ . It can be observed that increasing the length of sequence helps achieve higher accuracy at first but turn poor after hitting the peak value. Intuitively, when no temporal information is provided when $n=1$ , CREMA can not utilize consistency metric to identify hard noisy samples that blended with clean ones, thus leading to an inferior result. When $n$ is larger than 4, we also notice that performance degrades, this is probably due to unreliable model inside the very long sequence can harm sample credibility modeling and reversely hinder the final result.

$\bullet$ Effect of different estimators. The probabilistic model plays the role of estimating the conditional probability $P\left(\mathbf{f}_{t}|\mathbf{f}_{t-1},y\right)$ in Eq (6). We compare two different estimators, Gaussian Mixture Model (GMM) [31] and Beta Mixture Model (BMM) [1] on Clothing1M. Table 8 shows the results. We can see that GMM obtains a relatively higher test accuracy, but BMM can also achieve good results (74.09%) as well. This indicates that the choice of normalized mixture model is not sensitive to the final result.

$\bullet$ Reliability of the estimated sample credibility. Sample credibility plays the role of dynamical weight to modulate the loss term of each training sample, as in Eq (9). In Fig. 4 we visualize the empirical PDF of the learned credibility weight of all training samples between (b) the sequential estimation manner within CREMA and (a) its non-sequential counterpart (i.e., $n=1$ ) under two different noisy settings. It can be observed that the overall credibility weight of training samples is distinguishable for clean and noisy data in (b). Specifically, clean samples possess larger weight, thus contributing more gradients during the process of DNN training. Noisy samples are assigned with a relatively small weight to alleviate their negative impact. However, the non-sequential weights yield significantly more samples that cannot be correctly-separated via a fixed threshold. Indicating that the proposed fine-grained sequential credibility estimation is more effective for reliable sample weight modeling.

5 Conclusion

In this paper, we propose a novel end-to-end robust learning method, called CREMA. Towards the problem that previous works lack the consideration of intrinsic difference among difficulties of noisy samples. We follow the idea of divide-and-conquer that separates clean and noisy samples via estimating the credibility of each training sample. Two branches are designed to handle the imperfectly separated sample sets respectively. For easily recognizable noisy samples, we apply a selective label correction scheme avoiding erroneous label updates on clean samples. For hard noisy samples blended with clean ones, likelihood estimation of historical credibility sequence adaptively modulates the loss term of each sample during training. Extensive experiments conducted on several synthetic and real-world noisy datasets verify the superiority of the proposed method.

References

[1] Arazo, E., Ortego, D., Albert, P., O’Connor, N.E., McGuinness, K.: Unsupervised label noise modeling and loss correction. In: Proc. International Conference on Machine Learning (ICML) (2019)
[2] Arpit, D., Jastrzkebski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M.S., Maharaj, T., Fischer, A., Courville, A., Bengio, Y., et al.: A closer look at memorization in deep networks. In: Proc. International Conference on Machine Learning (ICML). pp. 233–242 (2017)
[3] Berthelot, D., Carlini, N., Goodfellow, I.J., Papernot, N., Oliver, A., Raffel, C.: Mixmatch: A holistic approach to semi-supervised learning. Proc. Advances in Neural Information Processing Systems (NeurIPS) (2019)
[4] Chang, H.S., Learned-Miller, E., McCallum, A.: Active Bias: Training more accurate neural networks by emphasizing high variance samples. In: Proc. Advances in Neural Information Processing Systems (NeurIPS). pp. 1002–1012 (2017)
[5] Dehghani, M., Severyn, A., Rothe, S., Kamps, J.: Avoiding your teacher’s mistakes: Training neural networks with controlled weak supervision. arXiv preprint arXiv:1711.00313 (2017)
[6] Dehghani, M., Severyn, A., Rothe, S., Kamps, J.: Learning to learn from weak supervision by full supervision. In: Proc. Advances in Neural Information Processing Systems Workshop (NeurIPSW) (2017)
[7] Ghosh, A., Kumar, H., Sastry, P.: Robust loss functions under label noise for deep neural networks. In: Proc. Association for the Advancement of Artificial Intelligence (AAAI) (2017)
[8] Goldberger, J., Ben-Reuven, E.: Training deep neural-networks using a noise adaptation layer. In: Proc. International Conference on Learning Representations (ICLR) (2017)
[9] Han, B., Yao, J., Niu, G., Zhou, M., Tsang, I., Zhang, Y., Sugiyama, M.: Masking: A new perspective of noisy supervision. In: Proc. Advances in Neural Information Processing Systems (NeurIPS). pp. 5836–5846 (2018)
[10] Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., Sugiyama, M.: Co-teaching: Robust training of deep neural networks with extremely noisy labels. In: Proc. Advances in Neural Information Processing Systems (NeurIPS). pp. 8527–8537 (2018)
[11] Han, J., Luo, P., Wang, X.: Deep self-learning from noisy labels. In: Proc. IEEE International Conference on Computer Vision (ICCV). pp. 5138–5147 (2019)
[12] Hendrycks, D., Mazeika, M., Wilson, D., Gimpel, K.: Using trusted data to train deep networks on labels corrupted by severe noise. In: Proc. Advances in Neural Information Processing Systems (NeurIPS). pp. 10456–10465 (2018)
[13] Houle, M.E.: Local intrinsic dimensionality I: An extreme-value-theoretic foundation for similarity applications. In: Proc. International Conference on Similarity Search and Applications (SISAP). pp. 64–79 (2017)
[14] Jiang, L., Zhou, Z., Leung, T., Li, L.J., Fei-Fei, L.: MentorNet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In: Proc. International Conference on Machine Learning (ICML) (2018)
[15] Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
[16] Li, J., Socher, R., Hoi, S.C.: Dividemix: Learning with noisy labels as semi-supervised learning. arXiv preprint arXiv:2002.07394 (2020)
[17] Li, J., Wong, Y., Zhao, Q., Kankanhalli, M.: Learning to learn from noisy labeled data. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5046–5054 (2019)
[18] Li, J., Xiong, C., Hoi, S.C.: Learning from noisy data with robust representation learning. In: Proc. IEEE International Conference on Computer Vision (ICCV). pp. 9485–9494 (2021)
[19] Li, X., Liu, T., Han, B., Niu, G., Sugiyama, M.: Provably end-to-end label-noise learning without anchor points. Proc. International Conference on Machine Learning (ICML) (2021)
[20] Li, Y., Yang, J., Song, Y., Cao, L., Luo, J., Li, L.J.: Learning from noisy labels with distillation. In: Proc. IEEE International Conference on Computer Vision (ICCV). pp. 1910–1918 (2017)
[21] Litany, O., Freedman, D.: Soseleto: A unified approach to transfer learning and training with noisy labels. arXiv preprint arXiv:1805.09622 (2018)
[22] Liu, T., Tao, D.: Classification with noisy labels by importance reweighting. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 38(3), 447–461 (2015)
[23] Lyu, Y., Tsang, I.W.: Curriculum loss: Robust learning and generalization against label corruption. In: Proc. International Conference on Learning Representations (ICLR) (2020)
[24] Ma, X., Wang, Y., Houle, M.E., Zhou, S., Erfani, S.M., Xia, S.T., Wijewickrema, S., Bailey, J.: Dimensionality-driven learning with noisy labels. In: Proc. International Conference on Machine Learning (ICML) (2018)
[25] Maas, A., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies. pp. 142–150 (2011)
[26] Malach, E., Shalev-Shwartz, S.: Decoupling” when to update” from” how to update”. In: Proc. Advances in Neural Information Processing Systems (NeurIPS). pp. 960–970 (2017)
[27] Manning, C., Schutze, H.: Foundations of statistical natural language processing. MIT press (1999)
[28] Nguyen, D.T., Mummadi, C.K., Ngo, T.P.N., Nguyen, T.H.P., Beggel, L., Brox, T.: SELF: Learning to filter noisy labels with self-ensembling. In: Proc. International Conference on Learning Representations (ICLR) (2020)
[29] Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., Qu, L.: Making deep neural networks robust to label noise: A loss correction approach. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1944–1952 (2017)
[30] Pereyra, G., Tucker, G., Chorowski, J., Kaiser, L., Hinton, G.E.: Regularizing neural networks by penalizing confident output distributions. ArXiv abs/1701.06548 (2017)
[31] Permuter, H., Francos, J.M., Jermyn, I.: A study of gaussian mixture models of color and texture features for image classification and segmentation. Pattern Recognition (PR). 39, 695–706 (2006)
[32] Pleiss, G., Zhang, T., Elenberg, E.R., Weinberger, K.Q.: Identifying mislabeled data using the area under the margin ranking. Proc. Advances in Neural Information Processing Systems (NeurIPS) (2020)
[33] Reed, S., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., Rabinovich, A.: Training deep neural networks on noisy labels with bootstrapping. In: Proc. International Conference on Learning Representations (ICLR) (2015)
[34] Ren, M., Zeng, W., Yang, B., Urtasun, R.: Learning to reweight examples for robust deep learning. In: Proc. International Conference on Machine Learning (ICML) (2018)
[35] Shu, J., Xie, Q., Yi, L., Zhao, Q., Zhou, S., Xu, Z., Meng, D.: Meta-Weight-Net: Learning an explicit mapping for sample weighting. In: Proc. Advances in Neural Information Processing Systems (NeurIPS). pp. 1917–1928 (2019)
[36] Song, H., Kim, M., Lee, J.G.: SELFIE: Refurbishing unclean samples for robust deep learning. In: Proc. International Conference on Machine Learning (ICML). pp. 5907–5915 (2019)
[37] Tanaka, D., Ikami, D., Yamasaki, T., Aizawa, K.: Joint optimization framework for learning with noisy labels. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5552–5560 (2018)
[38] Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In: Proc. Advances in Neural Information Processing Systems (NeurIPS) (2017)
[39] Veit, A., Alldrin, N., Chechik, G., Krasin, I., Gupta, A., Belongie, S.: Learning from noisy large-scale datasets with minimal supervision. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
[40] Wang, R., Liu, T., Tao, D.: Multiclass learning with partially corrupted labels. IEEE Transactions on Neural Networks and Learning Systems 29(6), 2568–2580 (2017)
[41] Wang, X., Hua, Y., Kodirov, E., Clifton, D.A., Robertson, N.M.: Proselflc: Progressive self label correction for training robust deep neural networks. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 752–761 (2021)
[42] Wang, X., Hua, Y., Kodirov, E., Robertson, N.M.: Imae for noise-robust learning: Mean absolute error does not treat examples equally and gradient magnitude’s variance matters. arXiv preprint arXiv:1903.12141 (2019)
[43] Wang, X., Kodirov, E., Hua, Y., Robertson, N.M.: Derivative manipulation for general example weighting. arXiv preprint arXiv:1905.11233 (2019)
[44] Wang, Y., Ma, X., Chen, Z., Luo, Y., Yi, J., Bailey, J.: Symmetric cross entropy for robust learning with noisy labels. In: Proc. IEEE International Conference on Computer Vision (ICCV). pp. 322–330 (2019)
[45] Wei, H., Feng, L., Chen, X., An, B.: Combating noisy labels by agreement: A joint training method with co-regularization. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13726–13735 (2020)
[46] Xiao, T., Xia, T., Yang, Y., Huang, C., Wang, X.: Learning from massive noisy labeled data for image classification. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2691–2699 (2015)
[47] Yao, Q., Yang, H., Han, B., Niu, G., Kwok, J.: Searching to exploit memorization effect in learning with noisy labels. In: Proc. International Conference on Machine Learning (ICML) (2020)
[48] Yi, K., Wu, J.: Probabilistic end-to-end noise correction for learning with noisy labels. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7017–7025 (2019)
[49] Yu, X., Han, B., Yao, J., Niu, G., Tsang, I.W., Sugiyama, M.: How does disagreement help generalization against label corruption? In: Proc. International Conference on Machine Learning (ICML) (2019)
[50] Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep learning requires rethinking generalization. In: Proc. International Conference on Learning Representations (ICLR) (2017)
[51] Zhang, H., Cissé, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. In: Proc. International Conference on Learning Representations (ICLR) (2018)
[52] Zhang, M., Lee, J., Agarwal, S.: Learning from noisy labels with no change to the training process. In: Proc. International Conference on Machine Learning (ICML). pp. 12468–12478 (2021)
[53] Zhang, W., Wang, Y., Qiao, Y.: Metacleaner: Learning to hallucinate clean representations for noisy-labeled visual recognition. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7365–7374 (2019)
[54] Zhang, Y., Zheng, S., Wu, P., Goswami, M., Chen, C.: Learning with feature-dependent label noise: A progressive approach. In: Proc. International Conference on Learning Representations (ICLR) (2021)
[55] Zhang, Z., Sabuncu, M.: Generalized cross entropy loss for training deep neural networks with noisy labels. In: Proc. Advances in Neural Information Processing Systems (NeurIPS). pp. 8778–8788 (2018)
[56] Zhang, Z., Zhang, H., Arik, S.O., Lee, H., Pfister, T.: Distilling effective supervision from severe label noise. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9294–9303 (2020)
[57] Zheltonozhskii, E., Baskin, C., Mendelson, A., Bronstein, A.M., Litany, O.: Contrast to divide: Self-supervised pre-training for learning with noisy labels. arXiv preprint arXiv:2103.13646 (2021)
[58] Zhou, T., Wang, S., Bilmes, J.: Robust curriculum learning: From clean label detection to noisy label self-correction. In: Proc. International Conference on Learning Representations (ICLR) (2021)

$\displaystyle\log{P\left(\mathbf{L}^{n}_{t}\|\mathbf{f}_{t-n},y\right)}$	$\displaystyle=\log{P\left(\mathbf{f}_{t}\|\mathbf{L}^{n}_{t-1},y\right)}+\log{P\left(\mathbf{L}^{n-1}_{t-1}\|\mathbf{f}_{t-n},y\right)}$	(6)
	$\displaystyle=\sum_{i=0}^{n-1}{\log{P\left(\mathbf{f}_{t-i}\|\mathbf{L}^{n-i}_{t-i-1},y\right)}}$
	$\displaystyle=\sum_{i=0}^{n-1}{\log{P\left(\mathbf{f}_{t-i}\|\mathbf{f}_{t-i-1},y\right)}}.$