SemiNLL: A Framework of Noisy-Label Learning by Semi-Supervised Learning

Zhuowei Wang¹, Jing Jiang¹, Bo Han^2,4, Lei Feng³,
Bo An³, Gang Niu⁴, Guodong Long¹

¹University of Technology Sydney, Australia
²Hong Kong Baptist University, HKSAR, China
³Nanyang Technological University, Singapore
⁴RIKEN Center for Advanced Intelligence Project, Japan
{zhuowei.wang}@student.uts.edu.au, {jing.jiang, guodong.long}@uts.edu.au,
{feng0093, boan}@ntu.edu.sg, {bo.han, gang.niu}@riken.jp

Abstract

Deep learning with noisy labels is a challenging task. Recent prominent methods that build on a specific sample selection (SS) strategy and a specific semi-supervised learning (SSL) model achieved state-of-the-art performance. Intuitively, better performance could be achieved if stronger SS strategies and SSL models are employed. Following this intuition, one might easily derive various effective noisy-label learning methods using different combinations of SS strategies and SSL models, which is, however, reinventing the wheel in essence. To prevent this problem, we propose SemiNLL, a versatile framework that combines SS strategies and SSL models in an end-to-end manner. Our framework can absorb various SS strategies and SSL backbones, utilizing their power to achieve promising performance. We also instantiate our framework with different combinations, which set the new state of the art on benchmark-simulated and real-world datasets with noisy labels.

1 Introduction

Deep Neural Networks (DNNs) have achieved great success in different computer vision problems, e.g., image classification [20], detection [36], and semantic segmentation [27]. Such success is demanding for large datasets with clean human-annotated labels. However, it is costly and time-consuming to correctly label massive images for building a large-scale dataset like ImageNet [5]. Some common and less expensive ways to collect large datasets are through online search engines [37] or crowdsourcing [51], which would, unfortunately, bring noisy labels to the collected datasets. Besides, an in-depth study [52] showed that deep learning with noisy labels can lead to severe performance deterioration. Thus, it is crucial to alleviate the negative effects caused by noisy labels for training DNNs.

A typical strategy is to conduct sample selection (SS) and to train DNNs with selected samples [12, 18, 40, 45, 50]. Since DNNs tend to learn simple patterns first before fitting noisy samples [3], many studies utilize the small-loss trick, where the samples with smaller losses are taken as clean ones. For example, Co-teaching [12] leverages two networks to select small-loss samples within each mini-batch for training each other. Later, Yu et al. [50] pointed out the importance of the disagreement between two networks and proposed Co-teaching+, which updates the two networks using the data on which the two networks hold different predictions. By contrast, JoCoR [45] proposes to reduce the diversity between two networks by training them simultaneously with a joint loss calculated from the selected small-loss samples. Although these methods have achieved satisfactory performance by training with selected small-loss samples, they simply discard other large-loss samples which may contain potentially useful information for the training process.

To make full use of all given samples, a prominent strategy is to consider selected samples as labeled “clean” data and other samples as unlabeled data, and to perform semi-supervised learning (SSL) [2, 4, 22, 42]. Following this strategy, SELF [32] detects clean samples by progressively removing noisy samples whose self-ensemble predictions of the model do not match the given labels in each iteration. With the selected labeled and unlabeled data, the problem becomes an SSL problem, and a Mean-Teacher model [42] can be trained. Another recent method, DivideMix [26], leverages Gaussian Mixture Model (GMM) [34] to distinguish clean (labeled) and noisy (unlabeled) data, and then uses a strong SSL backbone called MixMatch [4]. DivideMix achieves state-of-the-art results across different benchmark datasets.

As shown above, both methods rely on a specific SS strategy and a specific SSL model. The two components play a vitally important role for combating label noise, and stronger components are expected to achieve better performance. This motivates us to investigate a general algorithmic framework that can leverage various SS strategies and SSL models. In this paper, we propose SemiNLL, which is a versatile framework to bridge the gap between SSL and noisy-label learning (NLL). Our framework can absorb various SS strategies and SSL backbones, utilizing their power to achieve promising performance. Guided by our framework, one can easily instantiate a specific learning algorithm for NLL, by specifying a commonly used SSL backbone with an SS strategy. The key contributions of our paper can be summarized as follows:

•

To avoid reinventing the wheel for NLL using SSL algorithms, we propose a versatile framework that can absorb various SS strategies and SSL algorithms. Our framework is advantageous since better performance would be achieved if stronger components (including the ones proposed in the future) are used.
•

To instantiate our framework, we propose DivideMix+ by replacing the epoch-level selection strategy of DivideMix [26] with a mini-batch level one. We also propose GPL, another instantiation of our framework that leverages a two-component Gaussian mixture model [26, 34] to select labeled (unlabeled) data and uses Pseudo-Labeling [2] as the SSL backbone.
•

We conduct extensive experiments on benchmark-simulated and real-world datasets with noisy labels. Empirical results show that the stronger SS strategies and SSL backbones we use, the better performance SemiNLL could achieve. In addition, our instantiations, DivideMix+ and GPL, outperform other state-of-the-art noisy-label learning methods.

2 Related work

In this section, we briefly review several related aspects on which our framework builds.

2.1 Learning with noisy labels

For NLL, most of the existing methods could be roughly categorized into the following groups:

Sample selection. This family of methods regards samples with small loss as “clean” and trains the model only on selected clean samples. For example, self-paced MentorNet [18], or equivalently self-teaching, selects small-loss samples and uses them to train the network by itself. To alleviate the sample-selection bias in self-teaching, Han et al. [12] proposed an algorithm called Co-teaching [12], where two networks choose the next batch of data for each other for training based on the samples with smaller loss values. Co-teaching+ [50] bridges the disagreement strategy [29] with Co-teaching [12] by updating the networks over data where two networks make different predictions. In contrast, Wei et al. [45] leveraged the agreement maximization algorithm [21] by designing a joint loss to train two networks on the same mini-batch data and selected small-loss samples to update the parameters of both networks. The mini-batch SS strategy in our framework belongs to this direction. However, instead of ignoring the large-loss unclean samples, we just discard their labels and exploit the associated images in an SSL setup.

Noise transition estimation. Another line of NLL is to estimate the noise transition matrix for loss correction [11, 16, 30, 31, 33, 44, 47]. Patrini et al. [33] first estimated the noise transition matrix and trained the network with two different loss corrections. Hendrycks et al. [16] proposed a loss correction technique that utilizes a small portion of trusted samples to estimate the noise transition matrix. However, the limitation of these methods is that they do not perform well on datasets with a large number of classes.

Other deep learning methods. Some other interesting and promising directions for NLL include meta-learning [8, 39] based, pseudo-label estimation [24] based, and robust loss [7, 10, 28, 43, 48, 54] based approaches. For meta-learning based approaches, most studies fall into two main categories: training a model that adapts fast to different learning tasks without overfitting to corrupted labels [9, 25], and learning to reweight loss of each mini-batch to alleviate the adverse effects of corrupted labels [35, 38, 55]. Pseudo-label estimation based approaches reassign the labels for noisy samples. For example, Joint-Optim [41] corrects labels during training and updates network parameters simultaneously. PENCIL [49] proposes a probabilistic model, which can update network parameters and reassign labels as label distributions. The family of pseudo-label estimation has a close relationship with semi-supervised learning [13, 24, 41, 49]. Robust loss based approaches focus on designing loss functions that are robust to noisy labels.

2.2 Semi-supervised learning

SSL methods leverage unlabeled data to provide additional information for the training model. A line of work is based on the concept of consistency regularization: if a perturbation is given to an unlabeled sample, the model predictions of the same sample should not be too different. Laine and Aila [22] applied consistency between the output of the current network and the exponential moving average (EMA) of the output from the past epochs. Instead of averaging the model outputs, Tarvainen and Valpola [42] proposed to update the network on every mini-batch using an EMA of model parameter values. Berthelot et al. [4] introduced a holistic approach that well combines MixUp [53], entropy minimization, and consistency regularization. Another line of SSL is pseudo-labeling, the objective of which is to generate pseudo-labels for unlabeled samples to enhance the learning process. Recently, Arazo et al. [2] proposed a method to improve previous pseudo-labeling methods [17, 24] by adding MixUp augmentation [53] and setting a minimum number of labeled samples per mini-batch to reduce accumulated error of wrong pseudo-labels.

2.3 Combination of SS and SSL

Some previous studies that combine a specific SS strategy and a specific SSL backbone could be regarded as special cases in our framework. Ding et al. [6] used a pre-trained DNN on the noisy dataset to select labeled samples. In the SSL stage, Temporal Ensembling [22] was used to handle labeled and unlabeled data. Nguyen et al. [32] proposed a progressive noise filtering mechanism based on the Mean-Teacher model [42] and its self-ensemble prediction. Li et al. [26] used a Gaussian Mixture Model (GMM) to divide noisy and clean samples based on their training losses and fitted them into a recent SSL algorithm called MixMatch [4]. Each specific component used in these methods has its own pros and cons. This motivates us to propose a versatile framework that can build on a variety of SS strategies and SSL backbones. In other words, all the methods mentioned in this subsection could be taken as instantiations of our framework.

Refer to caption — Figure 1: The schematic of SemiNLL. First, each mini-batch of data is forwarded to the network to conduct SS, which divides the original data into the labeled/unlabeled sets. Second, labeled/unlabeled samples are used to train the SSL backbone to produce accurate model output.

3 The overview of SemiNLL

Algorithm 1 SemiNLL

0: Network

f_{\theta}

, SS strategy select, SSL method semi, epoch

T_{\text{max}}

, iteration

I_{\text{max}}

;

1: for

t

= 1,2,…,

T_{\text{max}}

2: Shuffle training set

{\mathcal{D}}_{train}

;

3: for

n=1,\ldots,I_{\text{max}}

4: Fetch mini-batch

D_{n}

from

{\mathcal{D}}_{train}

;

5: Obtain

{\mathcal{X}}_{m},{\mathcal{U}}_{m}\leftarrow\textsc{select}(D_{n},f_{\theta})

;

6: Update

f_{\theta}\leftarrow\textsc{semi}({\mathcal{X}}_{m},{\mathcal{U}}_{m},f_{\theta})

;

7: end for

8: end for

f_{\theta}

In this section, we present SemiNLL, a versatile framework of learning with noisy labels by SSL. The idea behind our framework is that we effectively take advantage of the whole training set by trusting the labels of undoubtedly correct samples and utilizing only the image content of potentially corrupted samples. Previous sample selection methods [12, 18, 45, 50] train the network only with selected clean samples, and they discard all potentially corrupted samples to avoid the harmful memorization of DNNs caused by the noisy labels of these samples. In this way, the feature information contained in the associated images might be discarded without being exploited. Our framework, alternatively, makes use of those corrupted samples by ignoring their labels while keeping the associated image content, transforming the NLL problem into an SSL setup. The mechanism of SSL that leverages labeled data to guide the learning of unlabeled data naturally fits well in training the model with the clean and noisy samples divided by our SS strategy. The schematic of our framework is shown in Figure 1. We first discuss the advantages of the mini-batch SS strategy in our framework and then introduce several SSL backbones used in our framework.

3.1 Mini-batch sample selection

During the SS process, a hazard called confirmation bias [42] is worth noting. Since the model is trained using the selected clean (labeled) and noisy (unlabeled) samples, wrongly selected clean samples in this iteration may keep being considered clean ones in the next iteration due to the model overfitting to their labels. Most existing methods [26, 32] divide the whole training set into the clean/noisy set on an epoch level. In this case, the learned selecting knowledge is incorporated into the SSL phase and will not be updated till the next epoch. Thus, the confirmation bias induced from those wrongly divided samples will accumulate within the whole epoch. To overcome this problem, our mini-batch SS strategy divides each mini-batch of samples into the clean subset ${\mathcal{X}}_{m}$ and the noisy subset ${\mathcal{U}}_{m}$ (Line 5 in Algorithm 1) right before updating the network using SSL backbones. In the next mini-batch, the network can know better to distinguish clean and noisy samples, alleviating the confirmation bias mini-batch by mini-batch.

3.2 SSL backbones

The mechanism of SSL that uses labeled data to guide the learning of unlabeled data fits well when dealing with clean/noisy data in NLL. The difference lies in an extra procedure, as introduced in Subsection 3.1, that divides the whole dataset into clean and noisy data. After the SS process, clean samples are considered labeled data and keep their annotated labels. The others are considered noisy samples, and their labels are discarded to be treated as unlabeled ones in SSL backbones. SemiNLL can build on a variety of SSL algorithms without any modifications to form an end-to-end training scheme for NLL. Concretely, we consider the following representative SSL backbones ranging from weak to strong according to their performance in SSL tasks:

(i)

Temporal Ensembling [22], where the model uses an exponential moving average (EMA) of label predictions from the past epochs as a target for the unsupervised loss. It enforces consistency of predictions by minimizing the difference between the current outputs and the EMA outputs.
(ii)

MixMatch [4], which is a holistic method that combines MixUp [53], entropy minimization, consistency regularization, and other traditional regularization tricks. It guesses low-entropy labels for augmented unlabeled samples and mixes labeled and unlabeled data using MixUp [53].
(iii)

Pseudo-Labeling [2], which learns from unlabeled data by combining soft pseudo-label generation [41] and MixUp augmentation [53] to reduce confirmation bias in training.

In the next section, we will instantiate our framework by applying specific SS strategies and SSL backbones to the select and semi placeholders in Algorithm 1.

4 The instantiations of SemiNLL

4.1 Instantiation 1: DivideMix+

In Algorithm 1, if we (i) specify the select placeholder as a GMM [34], (ii) specify the semi placeholder as MixMatch [4] mentioned in Subsection 3.2, and (iii) train two independent networks wherein each network selects clean/noisy samples in the SS phase and predicts labels in the SSL phase for the other network, then our framework is instantiated into a mini-batch version of DivideMix [26]. Specifically, during the SS process, DivideMix [26] fits a two-component GMM to the loss $\ell_{i}$ of each sample using the Expectation-Maximization technique and obtains the posterior probability of a sample being clean or noisy:

p\!\left(k\mid\ell_{i}\right)=\frac{p\!\left(k\right)p\!\left(\ell_{i}\mid k\right)}{p\!\left(\ell_{i}\right)},

(1)

where $k=0\left(1\right)$ denotes the clean (noisy) set. During the SSL phase, the clean set ${\mathcal{X}}_{e}$ and the noisy set ${\mathcal{U}}_{e}$ are fit into an improved MixMatch [4] strategy with label co-refinement and co-guessing. As shown in Figure 2(b), the SS strategy (GMM) of DivideMix [26] is conducted on an epoch level. Since ${\mathcal{X}}_{e}$ and ${\mathcal{U}}_{e}$ are updated only once per epoch, the confirmation bias induced from the wrongly divided samples will be accumulated within the whole epoch. However, our mini-batch version, which is called DivideMix+ (Figure 2(c)), divides each mini-batch of data into a clean subset ${\mathcal{X}}_{m}$ and a noisy subset ${\mathcal{U}}_{m}$ , and updates the networks using the SSL backbone right afterwards. In the next mini-batch, the updated networks could better distinguish clean and noisy samples.

Datasets	Method	Symmetric
		Noise ratio				Mean
		20%	40%	60%	80%
MNIST	Cross-Entropy	$86.16\pm 0.34$	$70.39\pm 0.59$	$50.35\pm 0.51$	$23.41\pm 0.96$	$57.58$
	Coteaching	$91.20\pm 0.03$	$90.02\pm 0.02$	$83.21\pm 0.71$	$25.33\pm 0.84$	$72.44$
	F-correction	$93.93\pm 0.10$	$84.30\pm 0.43$	$65.06\pm 0.64$	$29.81\pm 0.63$	$68.27$
	GCE	$94.36\pm 0.11$	$93.61\pm 0.17$	$92.46\pm 0.20$	$85.04\pm 0.66$	$91.37$
	M-correction	$\boldsymbol{97.25\pm 0.03}$	$\underline{96.63\pm 0.04}$	$95.07\pm 0.08$	$86.19\pm 0.42$	$93.79$
	DivideMix	$96.80\pm 0.08$	$96.53\pm 0.06$	$\underline{96.47\pm 0.04}$	$\underline{95.15\pm 0.25}$	$\underline{96.24}$
	GPL (ours)	$96.67\pm 0.09$	$96.27\pm 0.08$	$95.82\pm 0.09$	$94.81\pm 0.15$	$95.89$
	DivideMix+ (ours)	$\underline{96.83\pm 0.06}$	$\boldsymbol{96.79\pm 0.06}$	$\boldsymbol{96.69\pm 0.03}$	$\boldsymbol{95.91\pm 0.10}$	$\boldsymbol{96.56}$
FASHION MNIST	Cross-Entropy	$90.83\pm 0.26$	$86.44\pm 0.11$	$77.27\pm 0.56$	$61.84\pm 1.27$	$79.10$
	Coteaching	$89.18\pm 0.32$	$89.13\pm 0.05$	$80.08\pm 0.25$	$60.36\pm 2.15$	$79.69$
	F-correction	$\boldsymbol{93.37\pm 0.17}$	$92.27\pm 0.06$	$90.32\pm 0.30$	$85.78\pm 0.06$	$90.43$
	GCE	$\underline{93.35\pm 0.09}$	$92.58\pm 0.11$	$91.30\pm 0.20$	$88.01\pm 0.22$	$\underline{91.31}$
	M-correction	$93.03\pm 0.15$	$\underline{92.74\pm 0.42}$	$\underline{91.61\pm 0.02}$	$85.25\pm 0.23$	$90.66$
	DivideMix	$92.98\pm 0.17$	$92.55\pm 0.13$	$91.55\pm 0.31$	$\underline{88.55\pm 0.24}$	$90.66$
	GPL (ours)	$92.94\pm 0.20$	$91.38\pm 0.54$	$89.97\pm 0.16$	$87.14\pm 0.65$	$90.36$
	DivideMix+ (ours)	$93.20\pm 0.08$	$\boldsymbol{92.89\pm 0.15}$	$\boldsymbol{92.15\pm 0.16}$	$\boldsymbol{88.70\pm 0.17}$	$\boldsymbol{91.74}$
CIFAR-10	Cross-Entropy	$83.48\pm 0.17$	$68.49\pm 0.40$	$48.65\pm 0.06$	$27.56\pm 0.43$	$57.05$
	Coteaching	$67.73\pm 0.71$	$62.83\pm 0.72$	$48.81\pm 0.78$	$27.56\pm 2.71$	$51.73$
	F-correction	$83.27\pm 0.04$	$73.67\pm 0.30$	$77.64\pm 0.11$	$63.95\pm 0.32$	$74.63$
	GCE	$89.72\pm 0.10$	$87.75\pm 0.05$	$84.11\pm 0.26$	$72.84\pm 0.30$	$83.61$
	M-correction	$92.01\pm 0.40$	$90.09\pm 0.68$	$85.90\pm 0.22$	$70.57\pm 0.85$	$84.64$
	DivideMix	$\underline{94.82\pm 0.09}$	$93.95\pm 0.14$	$92.28\pm 0.08$	$89.30\pm 0.17$	$92.59$
	GPL (ours)	$94.45\pm 0.20$	$\underline{94.00\pm 0.22}$	$\boldsymbol{93.32\pm 0.10}$	$\underline{91.76\pm 0.23}$	$\underline{93.38}$
	DivideMix+ (ours)	$\boldsymbol{94.84\pm 0.12}$	$\boldsymbol{94.03\pm 0.20}$	$\underline{93.08\pm 0.19}$	$\boldsymbol{91.91\pm 0.07}$	$\boldsymbol{93.47}$
CIFAR-100	Cross-Entropy	$60.93\pm 0.40$	$46.24\pm 0.74$	$29.00\pm 0.38$	$11.42\pm 0.19$	$36.90$
	F-correction	$60.49\pm 0.29$	$48.93\pm 0.21$	$48.74\pm 0.41$	$22.93\pm 0.78$	$45.27$
	GCE	$69.20\pm 0.10$	$65.90\pm 0.25$	$57.33\pm 0.18$	$18.19\pm 1.15$	$52.66$
	M-correction	$67.96\pm 0.17$	$64.48\pm 0.76$	$55.37\pm 0.72$	$24.21\pm 1.06$	$53.01$
	DivideMix	$\underline{73.17\pm 0.28}$	$\underline{71.01\pm 0.16}$	$\underline{66.61\pm 0.18}$	$43.25\pm 0.82$	$63.51$
	GPL (ours)	$71.24\pm 0.24$	$68.89\pm 0.07$	$65.80\pm 0.63$	$\boldsymbol{59.96\pm 0.15}$	$\underline{66.47}$
	DivideMix+ (ours)	$\boldsymbol{73.22\pm 0.21}$	$\boldsymbol{71.03\pm 0.32}$	$\boldsymbol{67.52\pm 0.19}$	$\underline{58.07\pm 0.71}$	$\boldsymbol{67.46}$

Table 1: Average test accuracy (%) and standard deviation (5 runs) of all the methods in various datasets under symmetric label noise. The best accuracy is bold-faced. The second-best accuracy is underlined.

4.2 Instantiation 2: GPL

Intuitively, the choice of stronger SS strategies and SSL models would achieve better performance based on our framework. Thus, we still choose GMM to distinguish clean and noisy samples due to its flexibility in the sharpness of distribution [26]. As for the SSL backbone, we choose the strongest Pseudo-Labeling [2] introduced in Subsection 3.2. We call this instantiation GPL (GMM + Pseudo-Labeling). Note that we do not train two networks in GPL as in DivideMix [26] and DivideMix+. To our understanding, training two networks simultaneously might provide significant improvements in performance. However, this is outside the scope of this paper, since our goal is to demonstrate the versatility of our framework.

4.3 Self-prediction divider

Inspired by SELF [32], we introduce the self-prediction divider, a simple yet effective SS strategy which leverages the information provided by the network’s own prediction to distinguish clean and noisy samples. Based on the phenomenon that DNN’s predictions tend to be consistent on clean samples and inconsistent on noisy samples in different training iterations, we select the correctly annotated samples via the consistency between the original label set and the model’s own predictions. The self-prediction divider determines potentially clean samples in a mini-batch if the samples’ maximal likelihood predictions of the network match their annotated labels. Specifically, the samples are divided into the labeled set only if the model predicts the annotated label to be the correct class with the highest likelihood. The others are considered noisy samples, and their labels will be discarded to be regarded as unlabeled ones in SSL backbones. Compared to previous small-loss SS methods [12, 45, 50], which depend on a known noise ratio to control how many small-loss samples should be selected in each training iteration, self-prediction divider does not need any additional information to perform SS strategy where the clean subset and the noisy subset are determined by the network itself. Concretely, we instantiate three learning algorithms by combining our self-prediction divider (SPD) with three SSL backbones introduced in Subsection 3.2 and denote them as SPD-Temporal Ensembling, SPD-MixMatch, and SPD-Pseudo-Labeling, respectively.

Datasets	Method	Asymmetric
		Noise ratio				Mean
		10%	20%	30%	40%
MNIST	Cross-Entropy	$95.78\pm 0.19$	$91.15\pm 0.26$	$86.01\pm 0.25$	$79.92\pm 0.32$	$88.22$
	Coteaching	$90.32\pm 0.02$	$89.03\pm 0.02$	$79.80\pm 0.27$	$64.94\pm 0.02$	$81.02$
	F-correction	$96.39\pm 0.04$	$94.27\pm 0.21$	$89.33\pm 0.94$	$81.61\pm 0.42$	$90.40$
	GCE	$94.61\pm 0.13$	$94.43\pm 0.07$	$94.00\pm 0.12$	$93.42\pm 0.12$	$94.12$
	M-correction	$\underline{96.74\pm 0.03}$	$\underline{96.70\pm 0.10}$	$\boldsymbol{96.67\pm 0.07}$	$94.85\pm 0.40$	$96.24$
	DivideMix	$96.17\pm 0.06$	$96.11\pm 0.09$	$95.88\pm 0.05$	$95.83\pm 0.05$	$96.00$
	GPL (ours)	$\boldsymbol{96.76\pm 0.04}$	$\boldsymbol{96.71\pm 0.03}$	$96.49\pm 0.08$	$\underline{96.45\pm 0.04}$	$\boldsymbol{96.60}$
	DivideMix+ (ours)	$96.67\pm 0.04$	$96.66\pm 0.07$	$\underline{96.50\pm 0.04}$	$\boldsymbol{96.46\pm 0.04}$	$\underline{96.57}$
FASHION MNIST	Cross-Entropy	$\underline{93.88\pm 0.16}$	$92.20\pm 0.33$	$90.41\pm 0.67$	$84.56\pm 0.41$	$90.26$
	Coteaching	$88.01\pm 0.03$	$78.88\pm 0.20$	$70.07\pm 0.38$	$61.97\pm 0.21$	$74.73$
	F-correction	$\boldsymbol{94.17}\pm 0.12$	$\boldsymbol{93.88\pm 0.10}$	$\boldsymbol{93.50\pm 0.10}$	$\boldsymbol{93.25\pm 0.16}$	$\boldsymbol{93.7}$
	GCE	$93.51\pm 0.17$	$\underline{93.24\pm 0.14}$	$\underline{92.21\pm 0.27}$	$89.53\pm 0.53$	$92.12$
	M-correction	$92.11\pm 0.93$	$91.26\pm 1.35$	$89.79\pm 1.28$	$89.58\pm 2.20$	$90.69$
	DivideMix	$91.83\pm 0.24$	$91.09\pm 0.08$	$89.90\pm 0.26$	$87.58\pm 0.26$	$90.10$
	GPL (ours)	$92.52\pm 0.22$	$92.23\pm 0.09$	$92.15\pm 0.26$	$\underline{91.64\pm 0.31}$	$\underline{92.14}$
	DivideMix+ (ours)	$92.56\pm 0.39$	$92.25\pm 0.21$	$91.62\pm 0.08$	$89.67\pm 0.44$	$91.53$
CIFAR-10	Cross-Entropy	$90.85\pm 0.06$	$87.23\pm 0.40$	$81.92\pm 0.32$	$76.23\pm 0.45$	$84.06$
	Coteaching	$62.85\pm 2.20$	$61.04\pm 1.31$	$54.50\pm 0.39$	$51.68\pm 1.66$	$57.52$
	F-correction	$89.79\pm 0.33$	$86.79\pm 0.67$	$83.34\pm 0.30$	$76.81\pm 1.08$	$84.18$
	GCE	$90.40\pm 0.09$	$89.30\pm 0.13$	$86.89\pm 0.22$	$82.60\pm 0.17$	$87.30$
	M-correction	$92.28\pm 0.12$	$92.13\pm 0.17$	$91.38\pm 0.11$	$90.43\pm 0.23$	$91.56$
	DivideMix	$93.61\pm 0.15$	$92.99\pm 0.21$	$91.79\pm 0.36$	$90.57\pm 0.31$	$92.24$
	GPL (ours)	$\boldsymbol{94.32\pm 0.01}$	$\boldsymbol{94.23\pm 0.07}$	$\boldsymbol{93.79\pm 0.06}$	$\boldsymbol{93.02\pm 0.30}$	$\boldsymbol{93.84}$
	DivideMix+ (ours)	$\underline{94.27\pm 0.23}$	$\underline{93.92\pm 0.20}$	$\underline{92.82\pm 0.28}$	$\underline{91.91\pm 0.24}$	$\underline{93.23}$
CIFAR-100	Cross-Entropy	$68.58\pm 0.34$	$68.82\pm 0.22$	$53.99\pm 0.50$	$44.31\pm 0.23$	$58.93$
	F-correction	$68.87\pm 0.06$	$64.11\pm 0.37$	$56.45\pm 0.59$	$46.44\pm 0.50$	$58.97$
	GCE	$70.77\pm 0.14$	$69.22\pm 0.15$	$64.60\pm 0.25$	$51.72\pm 1.17$	$64.08$
	M-correction	$69.44\pm 0.52$	$67.25\pm 0.81$	$63.16\pm 1.55$	$52.90\pm 1.79$	$63.19$
	DivideMix	$\boldsymbol{74.00\pm 0.29}$	$\underline{73.28\pm 0.42}$	$\boldsymbol{72.84\pm 0.36}$	$54.33\pm 0.69$	$68.61$
	GPL (ours)	$71.94\pm 0.29$	$71.22\pm 0.11$	$70.56\pm 0.23$	$\boldsymbol{69.84\pm 0.41}$	$\boldsymbol{70.89}$
	DivideMix+ (ours)	$\underline{73.49\pm 0.31}$	$\boldsymbol{73.30\pm 0.22}$	$\underline{72.36\pm 0.43}$	$\underline{55.63\pm 0.60}$	$\underline{68.70}$

Table 2: Average test accuracy (%) and standard deviation (5 runs) of all the methods in various datasets under asymmetric label noise. The best accuracy is bold-faced. The second-best accuracy is underlined.

4.4 Effects of the two components

This section demonstrates the effects of SS strategies and SSL backbones in our framework. To prove that a more robust SS strategy can boost performance for our framework, we propose DivideMix- (Figure 2(a)) by replacing the GMM in DivideMix [26] with our self-prediction divider on an epoch level. Since self-prediction divider is supposed to be weaker than GMM, DivideMix- is expected to achieve lower performance than DivideMix [26]. To prove the effectiveness of the SSL backbone, we remove it after the SS process and only update the model using the supervised loss calculated from the clean samples. We will give detailed discussions in Subsection 5.3 and Subsection 5.4.

5 Experiments

In this section, we first compare two instantiations of our framework, DivideMix+ and GPL, with other state-of-the-art methods. We also analyze the effects of SS strategies by comparing DivideMix-, DivideMix [26], and DivideMix+, then analyze the effects of SSL backbones by combining three representative SSL methods with our self-prediction divider. More information of our experiments can be found in supplementary materials.

5.1 Experiment setup

Datasets. We thoroughly evaluate our proposed DivideMix+ and GPL on five datasets, including MNIST [23], FASHION-MNIST [46], CIFAR-10, CIFAR-100 [19], and Clothing1M [47].

MNIST and FASHION-MNIST contain 60K training images and 10K test images of size $28\times 28$ . CIFAR-10 and CIFAR-100 contain 50K training images and 10K test images of size $32\times 32$ with three channels. According to previous studies [26, 45, 54], we experiment with two types of label noise: symmetric noise and asymmetric noise. Symmetric label noise is produced by changing the original label to all possible labels randomly and uniformly according to the noise ratio. Asymmetric label noise is similar to real-world noise, where labels are flipped to similar classes.

Clothing1M is a large-scale real-world dataset that consists of one million training images from online shopping websites with labels annotated from surrounding texts. The estimated noise ratio is approximately 40% [47].

Datasets	Method	Symmetric				Asymmetric
		Noise ratio				Noise ratio
		20%	40%	60%	80%	10%	20%	30%	40%
CIFAR-10	DivideMix-	$94.49\pm 0.02$	$93.64\pm 0.12$	$91.65\pm 0.34$	$76.61\pm 1.26$	$93.58\pm 0.02$	$92.87\pm 0.14$	$91.21\pm 0.21$	$90.42\pm 0.23$
	DivideMix	$94.82\pm 0.09$	$93.95\pm 0.14$	$92.28\pm 0.08$	$89.30\pm 0.17$	$93.61\pm 0.15$	$92.99\pm 0.21$	$91.79\pm 0.36$	$90.57\pm 0.31$
	DivideMix+ (ours)	$\boldsymbol{94.84\pm 0.12}$	$\boldsymbol{94.03\pm 0.20}$	$\boldsymbol{93.08\pm 0.19}$	$\boldsymbol{91.91\pm 0.07}$	$\boldsymbol{94.27\pm 0.23}$	$\boldsymbol{93.92\pm 0.20}$	$\boldsymbol{92.82\pm 0.28}$	$\boldsymbol{91.91\pm 0.24}$
CIFAR-100	DivideMix-	$72.51\pm 0.32$	$69.27\pm 0.46$	$61.13\pm 0.60$	$25.96\pm 0.78$	$73.62\pm 0.12$	$72.32\pm 0.24$	$70.64\pm 0.20$	$\boldsymbol{68.04\pm 1.24}$
	DivideMix	$73.17\pm 0.28$	$71.01\pm 0.16$	$66.61\pm 0.18$	$43.25\pm 0.82$	$\boldsymbol{74.00\pm 0.29}$	$73.28\pm 0.42$	$\boldsymbol{72.84\pm 0.36}$	$54.33\pm 0.69$
	DivideMix+ (ours)	$\boldsymbol{73.22\pm 0.21}$	$\boldsymbol{71.03\pm 0.32}$	$\boldsymbol{67.52\pm 0.19}$	$\boldsymbol{58.07\pm 0.71}$	$73.49\pm 0.31$	$\boldsymbol{73.30\pm 0.22}$	$72.36\pm 0.43$	$55.63\pm 0.60$

Table 3: Test accuracy (%) of DivideMix-, DivideMix, and DivideMix+.

Network Structure and Optimizer. Following previous works [2, 26, 45, 54], we use a 2-layer MLP for MNIST, a ResNet-18 [14] for FASHION-MNIST, the well-known “13-CNN” architecture [42] for CIFAR-10 and CIFAR-100, and an 18-layer PreAct Resnet [15] for Clothing1M. To ensure a fair comparison between the instantiations of our framework and other methods, we keep the training settings for MNIST, CIFAR-10, CIFAR-100, and Clothing1M as close as possible to DivideMix [26] and FASHION-MNIST close to GCE [54].

For FASHION-MNIST, the network is trained using stochastic gradient descent (SGD) with 0.9 momentum and a weight decay of $1\times 10^{-4}$ for 120 epochs. For MNIST, CIFAR-10, and CIFAR-100, all networks are trained using SGD with 0.9 momentum and a weight decay of $5\times 10^{-4}$ for 300 epochs. For Clothing1M, the momentum is 0.9, and the weight decay is 0.001.

Baselines. We compare DivideMix+ and GPL with the following state-of-the-art algorithms and implement all methods by PyTorch on NVIDIA Tesla V100 GPUs.

(i)

Coteaching [12], which trains two networks and cross-updates the parameters of peer networks.
(ii)

GCE [54], which uses a theoretically grounded and easy-to-use loss function, the $\mathcal{L}_{q}$ loss, for NLL.
(iii)

F-correction [33], which corrects the prediction by the label transition matrix. As suggested by the authors, we first train a standard network using the cross-entropy loss to estimate the transition matrix.
(iv)

M-correction [1], which models clean and noisy samples by fitting a two-component BMM and applies MixUp data augmentation [53].
(v)

DivideMix [26], which divides clean and noisy samples by using a GMM on an epoch level and leverages MixMatch [4] as the SSL backbone.

5.2 Performance Comparison

The results of all the methods under symmetric and asymmetric noise types on MNIST, FASHION-MNIST, CIFAR-10, and CIFAR-100 are shown in Table 1 and Table 2. The results on Clothing1M are shown in Table 5.

Results on MNIST. DivideMix+ surpasses DivideMix across symmetric and asymmetric noise at all noise ratios, showing the effectiveness of the mini-batch SS strategy in our framework. M-correction performs well under low noise ratios. However, in the hardest symmetric 80% case, DivideMix+ achieves best test accuracy.

Results on FASHION-MNIST. FASHION-MNIST is quite similar to MNIST but more complicated. DivideMix+ still outperforms DivideMix on symmetric and asymmetric noise at all noise ratios. In the harder asymmetric 40% noise, DivideMix+ and GPL outperform the other methods by a large margin.

Results on CIFAR-10. DivideMix+ constantly outperforms DivideMix, especially in the cases with higher noise ratios. We believe the reason is that the mini-batch SS strategy used in our framework can better mitigate the confirmation bias induced from wrongly divided samples in more challenging scenarios. Overall, GPL and DivideMix+ surpass the other methods over a large margin, with the latter performing extremely well on asymmetric noise.

Results on CIFAR-100. In most cases, DivideMix+ and DivideMix achieve higher test accuracy than the other approaches, with DivideMix+ performing better. Specifically, DivideMix+ surpasses DivideMix by 14.82% in the hardest symmetric 80% case. An interesting phenomenon is that all the approaches suffer from performance deterioration in the asymmetric 40% cases except GPL, which significantly outperforms the second-best algorithm over +14%.

Dataset	CIFAR-10			CIFAR-100
Method/Noise ratio	20%	50%	80%	20%	50%	80%
SPD-Cross-Entropy	$83.13\pm 0.16$	$79.74\pm 0.10$	$49.14\pm 0.15$	$45.07\pm 0.55$	$35.02\pm 0.57$	$10.22\pm 0.10$
SPD-Temporal Ensembling	$83.15\pm 0.06$	$80.16\pm 0.36$	$49.10\pm 0.13$	$46.16\pm 0.12$	$39.91\pm 0.60$	$12.37\pm 0.67$
SPD-MixMatch	$93.53\pm 0.52$	$90.22\pm 0.18$	$88.77\pm 0.20$	$72.89\pm 0.30$	$68.57\pm 0.20$	$33.92\pm 0.20$
SPD-Pseudo-Labeling	$\boldsymbol{94.52\pm 0.06}$	$\boldsymbol{93.24\pm 0.36}$	$\boldsymbol{90.27\pm 0.34}$	$\boldsymbol{73.84\pm 0.48}$	$\boldsymbol{68.61\pm 0.40}$	$\boldsymbol{55.37\pm 0.34}$

Table 4: Test accuracy (%) of the baseline and three SSL backbones integrated into our proposed framework.

Results on Clothing1M. To show the robustness of our framework under real-world noisy labels, we demonstrate the effectiveness of DivideMix+ and GPL on Clothing1M. As shown in Table 5, the performance of DivideMix+ is better than that of DivideMix and other methods.

5.3 The effects of SS strategies

To study how SS strategies can affect the performance of our framework, we propose DivideMix- by replacing the GMM component in DivideMix with our self-prediction divider yet maintaining the epoch-level SS strategy for a fair comparison. Due to constraints of space, we only provide the mean value of the results in Table 3, which can show the overall tendency. Results with mean and standard deviation can be found in supplementary materials. In CIFAR-10, the difference between DivideMix- and DivideMix is not obvious in the lower noise ratios. However, in the most difficult symmetric 80% case, the test accuracy of DivideMix is +12.69% higher than DivideMix-. The difference is even greater in CIFAR-100, showing that GMM is better able to distinguish clean and noisy labels in most cases. An impressive phenomenon to note is that DivideMix- excels in the asymmetric 40% case in CIFAR-100, which means the self-prediction divider performs better in nosier asymmetric cases than GMM. The reason is explained in the original paper of DivideMix [26], that GMM cannot effectively distinguish clean and noisy samples under asymmetric noise with high noise ratio in datasets with a large number of classes. At the same time, the fact that DivideMix+ constantly outperforms DivideMix in most cases shows that the mini-batch SS strategy in our framework is better than the epoch-level one in DivideMix.

5.4 The effects of SSL backbones

We evaluate the effects of SSL backbones in our framework by combining the self-prediction divider (SPD) with three different SSL methods and a baseline which only updates the model using the cross-entropy loss calculated from clean samples. We denote them as SPD-Temporal Ensembling, SPD-MixMatch, SPD-Pseudo-Labeling, and SPD-Cross-Entropy, respectively. For a fair comparison, we use the “13-CNN” architecture [42] for all methods across different datasets. We keep most hyperparameters introduced by the SSL methods close to their original papers [2, 4, 22], since they can be easily integrated into our framework without massive adjustments.

In Table 4, we list these four algorithms in the left column from weak to strong according to their performance in their original papers. The test accuracies demonstrate their corresponding performance for NLL based on our framework. SPD-MixMatch and SPD-Pseudo-Labeling outperform SPD-Temporal Ensembling by a large domain in both CIFAR-10 and CIFAR-100, especially under 80% noise ratio (over 40% in CIFAR-10). This phenomenon is reasonable because Temporal Ensembling [22] only uses consistency regularization for unsupervised loss, while MixMatch [4] and Pseudo-Labeling [2] also leverage entropy regularization as well as MixUp data augmentation [53]. Moreover, SPD-Pseudo-Labeling achieves remarkable test accuracy under 80% noise ratio in CIFAR-100, which is +21.44% higher than SPD-MixMatch and +42.66% higher than SPD-Temporal Ensembling. We assume that this is due to the additional loss used in SPD-Pseudo-Labeling that prevents the model from assigning all labels to a single class at the early training stage.

From the results of SPD-Cross-Entropy, we can see that after the removal of the SSL backbone, the test accuracy drops dramatically compared to SPD-MixMatch and SPD-Pseudo-Labeling, especially in high noise ratios and datasets with more classes (e.g., CIFAR-100). This is possibly due to the substantial amount of data that has been removed by the self-prediction divider, leaving very few samples per class. Thus, instead of discarding noisy samples, transferring them to unlabeled ones in SSL backbones is an effective way to combat noisy labels.

Methods	Test Accuracy
Cross-Entropy	69.21
F-correction [33]	69.84
M-correction [1]	71.00
Joint-Optim [41]	72.16
Dividemix [26]	73.91
GPL(ours)	73.19
Dividemix+(ours)	$\boldsymbol{74.14}$

Table 5: Test accuracy (%) on Clothing1M.

6 Conclusion

This paper proposes a versatile framework called SemiNLL for NLL. This framework consists of two main parts: the mini-batch SS strategy and the SSL backbone. We conduct extensive experiments on benchmark-simulated and real-world datasets to demonstrate that SemiNLL can absorb a variety of SS strategies and SSL backbones, leveraging their power to achieve state-of-the-art performance in different noise scenarios. Moreover, we throughly analyze the effects of the two components in our framework.

References

Arazo et al. [2019] Eric Arazo, Diego Ortego, Paul Albert, Noel O’Connor, and Kevin McGuinness. Unsupervised label noise modeling and loss correction. In ICML, 2019.
Arazo et al. [2020] Eric Arazo, Diego Ortego, Paul Albert, Noel O’Connor, and Kevin McGuinness. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In IJCNN, pages 1–8, 2020.
Arpit et al. [2017] Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, and Yoshua Bengio. A closer look at memorization in deep networks. In ICML, 2017.
Berthelot et al. [2019] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. Mixmatch: A holistic approach to semi-supervised learning. In NeurIPS, pages 5049–5059, 2019.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, and Fei-Fei Li. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
Ding et al. [2018] Yifan Ding, Liqiang Wang, Deliang Fan, and Boqing Gong. A semi-supervised two-stage approach to learning from noisy labels. In WACV, pages 1215–1224, 2018.
Feng et al. [2020] Lei Feng, Senlin Shu, Zhuoyi Lin, Fengmao Lv, Li Li, and Bo An. Can cross entropy loss be robust to label noise? In IJCAI, 2020.
Finn et al. [2017] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, pages 1126–1135, 2017.
Garcia et al. [2016] Luis Garcia, Andre de Carvalho, and Ana Lorena. Noise detection in the meta-learning level. Neurocomputing, 176:14–25, 2016.
Ghosh et al. [2017] Aritra Ghosh, Himanshu Kumar, and PS Sastry. Robust loss functions under label noise for deep neural networks. In AAAI, 2017.
Goldberger and Ben-Reuven [2016] Jacob Goldberger and Ehud Ben-Reuven. Training deep neural-networks using a noise adaptation layer. In ICLR, 2016.
Han et al. [2018] Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS, pages 8527–8537, 2018.
Han et al. [2019] Jiangfan Han, Ping Luo, and Xiaogang Wang. Deep self-learning from noisy labels. In ICCV, pages 5138–5147, 2019.
He et al. [2016a] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016a.
He et al. [2016b] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, pages 630–645, 2016b.
Hendrycks et al. [2018] Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel. Using trusted data to train deep networks on labels corrupted by severe noise. In NeurIPS, pages 10456–10465, 2018.
Iscen et al. [2019] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. Label propagation for deep semi-supervised learning. In CVPR, pages 5070–5079, 2019.
Jiang et al. [2018] Lu Jiang, Zhengyuan Zhou, Thomas Leung, Li-Jia Li, and Fei-Fei Li. Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, pages 2304–2313, 2018.
Krizhevsky and Hinton [2009] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In NeurIPS, pages 1097–1105, 2012.
Kumar et al. [2010] Abhishek Kumar, Avishek Saha, and Hal Daume. Co-regularization based semi-supervised domain adaptation. In NeurIPS, pages 478–486, 2010.
Laine and Aila [2016] Samuli Laine and Timo Aila. Temporal ensembling for semi-supervised learning. In ICLR, 2016.
Lecun et al. [1998] Yann Lecun, Leon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 1998.
Lee [2013] Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on ICML, 2013.
Li et al. [2019] Junnan Li, Yongkang Wong, Qi Zhao, and Mohan Kankanhalli. Learning to learn from noisy labeled data. In CVPR, pages 5051–5059, 2019.
Li et al. [2020] Junnan Li, Richard Socher, and Steven Hoi. Dividemix: Learning with noisy labels as semi-supervised learning. In ICLR, 2020.
Long et al. [2015] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, pages 3431–3440, 2015.
Ma et al. [2020] Xingjun Ma, Hanxun Huang, Yisen Wang, Simone Romano, Sarah Erfani, and James Bailey. Normalized loss functions for deep learning with noisy labels. In ICML, 2020.
Malach and Shalev-Shwartz [2017] Eran Malach and Shai Shalev-Shwartz. Decoupling" when to update" from" how to update". In NeurIPS, pages 960–970, 2017.
Menon et al. [2015] Aditya Menon, Brendan Van Rooyen, Cheng Soon Ong, and Bob Williamson. Learning from corrupted binary labels via class-probability estimation. In ICML, pages 125–134, 2015.
Natarajan et al. [2013] Nagarajan Natarajan, Inderjit Dhillon, Pradeep Ravikumar, and Ambuj Tewari. Learning with noisy labels. In NeurIPS, pages 1196–1204, 2013.
Nguyen et al. [2020] Duc Tam Nguyen, Chaithanya Kumar Mummadi, Thi Phuong Nhung Ngo, Thi Hoai Phuong Nguyen, Laura Beggel, and Thomas Brox. Self: Learning to filter noisy labels with self-ensembling. In ICLR, 2020.
Patrini et al. [2017] Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and Lizhen Qu. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, pages 1944–1952, 2017.
Permuter et al. [2006] Haim Permuter, Joseph Francos, and Ian Jermyn. A study of gaussian mixture models of color and texture features for image classification and segmentation. Pattern Recognition, 39(4):695–706, 2006.
Ren et al. [2018] Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. In ICML, 2018.
Ren et al. [2015] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NeurIPS, pages 91–99, 2015.
Schroff et al. [2010] Florian Schroff, Antonio Criminisi, and Andrew Zisserman. Harvesting image databases from the web. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(4):754–766, 2010.
Shu et al. [2019] Jun Shu, Qi Xie, Lixuan Yi, Qian Zhao, Sanping Zhou, Zongben Xu, and Deyu Meng. Meta-weight-net: Learning an explicit mapping for sample weighting. In NeurIPS, pages 1919–1930, 2019.
Snell et al. [2017] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In NeurIPS, pages 4077–4087, 2017.
Song et al. [2019] Hwanjun Song, Minseok Kim, and Jae-Gil Lee. Selfie: Refurbishing unclean samples for robust deep learning. In ICML, pages 5907–5915, 2019.
Tanaka et al. [2018] Daiki Tanaka, Daiki Ikami, Toshihiko Yamasaki, and Kiyoharu Aizawa. Joint optimization framework for learning with noisy labels. In CVPR, pages 5552–5560, 2018.
Tarvainen and Valpola [2017] Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In NeurIPS, pages 1195–1204, 2017.
Wang et al. [2019] Yisen Wang, Xingjun Ma, Zaiyi Chen, Yuan Luo, Jinfeng Yi, and James Bailey. Symmetric cross entropy for robust learning with noisy labels. In ICCV, pages 322–330, 2019.
Wang et al. [2020] Zhen Wang, Guosheng Hu, and Qinghua Hu. Training noise-robust deep neural networks via meta-learning. In CVPR, pages 4524–4533, 2020.
Wei et al. [2020] Hongxin Wei, Lei Feng, Xiangyu Chen, and Bo An. Combating noisy labels by agreement: A joint training method with co-regularization. In CVPR, pages 13726–13735, 2020.
Xiao et al. [2017] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
Xiao et al. [2015] Tong Xiao, Tian Xia, Yi Yang, Chang Huang, and Xiaogang Wang. Learning from massive noisy labeled data for image classification. In CVPR, pages 2691–2699, 2015.
Xu et al. [2019] Yilun Xu, Peng Cao, Yuqing Kong, and Yizhou Wang. L_dmi: An information-theoretic noise-robust loss function. In NeurIPS, 2019.
Yi and Wu [2019] Kun Yi and Jianxin Wu. Probabilistic end-to-end noise correction for learning with noisy labels. In CVPR, pages 7017–7025, 2019.
Yu et al. [2019] Xingrui Yu, Bo Han, Jiangchao Yao, Gang Niu, Ivor Tsang, and Masashi Sugiyama. How does disagreement help generalization against label corruption? In ICML, 2019.
Yu et al. [2018] Xiyu Yu, Tongliang Liu, Mingming Gong, and Dacheng Tao. Learning with biased complementary labels. In ECCV, pages 68–83, 2018.
Zhang et al. [2016] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2016.
Zhang et al. [2018] Hongyi Zhang, Moustapha Cisse, Yann Dauphin, and David Lopez-Paz. Mixup: Beyond empirical risk minimization. In ICLR, 2018.
Zhang and Sabuncu [2018] Zhilu Zhang and Mert Sabuncu. Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, pages 8778–8788, 2018.
Zhang et al. [2020] Zizhao Zhang, Han Zhang, Sercan Arik, Honglak Lee, and Tomas Pfister. Distilling effective supervision from severe label noise. In CVPR, pages 9294–9303, 2020.