This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\jvol

XX \jnumXX \paper8 \jmonthMonth \jtitlePublication Title

Group Benefits Instances Selection for Data Purification

Zhenhuang Cai Nanjing University of Science and Technology, Nanjing, 210014, China    Chuanyi Zhang Hohai University, Nanjing, 210024, China    Dan Huang Beijing Institute of Technology, Beijing, 100081, China    Yuanbo Chen Beijing Research Institute of Mechanical and Electrical Technology, Beijing, 100083, China    Xiuyun Guan China Ordnance Industrial Standardization Research Institute, Beijing, 100089, China    Yazhou Yao Nanjing University of Science and Technology, Nanjing, 210014, China
(2023)
Abstract

Manually annotating datasets for training deep models is very labor-intensive and time-consuming. To overcome such inferiority, directly leveraging web images to conduct training data becomes a natural choice. Nevertheless, the presence of label noise in web data usually degrades the model performance. Existing methods for combating label noise are typically designed and tested on synthetic noisy datasets. However, they tend to fail to achieve satisfying results on real-world noisy datasets. To this end, we propose a method named GRIP to alleviate the noisy label problem for both synthetic and real-world datasets. Specifically, GRIP utilizes a group regularization strategy that estimates class soft labels to improve noise-robustness. Soft label supervision reduces overfitting on noisy labels and learns inter-class similarities to benefit classification. Furthermore, an instance purification operation globally identifies noisy labels by measuring the difference between each training sample and its class soft label. Through operations at both group and instance levels, our approach integrates the advantages of noise-robust and noise-cleaning methods and remarkably alleviates the performance degradation caused by noisy labels. Comprehensive experimental results on synthetic and real-world datasets demonstrate the superiority of GRIP over the existing state-of-the-art methods. The data and source code of this work have been made available at: https://github.com/NUST-Machine-Intelligence-Laboratory/GRIP.

journal: Publication Name
Refer to caption
Figure 1: Our approach (b) boosts the typical noise identification (a) through a group regularization strategy. Specifically, it utilizes the similarity between the predicted probability distribution of each sample and its class soft label to identify noisy labels. The predicted probability distributions of clean samples tend to be closer to class soft labels than that of noisy ones.

1 Introduction

\chapteri

Recently, deep neural networks (DNNs) have achieved satisfying performance in various image recognition challenges1, 2, 3, 4, 5, 6, 7, 8, e.g. ImageNet9 and COCO10. Particularly, the impressive results typically rely on the availability of large-scale and well-labeled datasets. Unfortunately, high-quality and reliable annotations can be laborious and expensive11, 12, 13, even not always available in some domains such as fine-grained visual recognition due to the requirement of expert knowledge14. To address the expensive-annotation problem, a promising solution is straightforwardly utilizing data from web or multimedia to conduct large scale datasets15, 16, 17, 18, 19. For example, WebFG-49620 leverages free web images as training data with keywords as labels, while YFCC100M21 and Youtube-8m22 contain millions of media objects. Nevertheless, annotations from web or multimedia only provide unreliable supervision and inevitably contain label noise. According to the memorization effect23, 24, DNNs would perfectly fit the training set with noisy labels and consequently degrade generalization.

To tackle this issue, researchers have proposed a number of methods for combating noisy labels25, 26, 27, 28, 29, 30, 31, 32, 33. An active research direction is to investigate noise-robust methods which aim to reduce contributions of false-labeled samples in model optimization, e.g. robust loss functions34, 35, 36, early-learning37, label smoothing (LS)38, and online label smoothing (OLS) regularization39. This type of method does not involve specific designs for noisy labels and therefore becomes flexible in practical application. However, since noisy labels are not explicitly coped with, noise-robust approaches still inevitably suffer from the performance drop caused by label noise.

Another intuitive research direction is to perform noise-cleaning which aims to correct or discard mislabeled samples to purify datasets. For example, label correction methods aim to revise false labels through noise transition matrix estimation40 or label re-assignment41. Sample selection works42, 43, 44, 45, 46, 47 typically select instances in manual-defined criteria, e.g. the small-loss principle48 that regards images with small losses as clean data. Recently, some hybrid researches49, 50 combine label correction and sample selection methods for more efficient noise-cleaning. However, these approaches tend to be designed and tested on synthetic noisy datasets such as CIFAR51, and typically do not take the real-world scenario into consideration, Consequently, they tend to be less practical on real-world noisy datasets.

To this end, we propose a simple yet effective approach termed GRIP (Group Regularization and Instance Purification) to boost instance purification via group regularization. Specifically, the proposed group regularization strategy estimates the soft label of each class to provide additional supervision. It guides the network to learn inter-class similarities and improves robustness by preventing overfitting noisy labels. Resorting to estimated class soft labels, an instance purification strategy is applied to specifically clean noisy labels from the entire dataset in a global manner. It measures the similarity between the predicted probability distribution of each sample and its estimated class soft label to identify noisy and revisable labels. Subsequently, noisy samples are discarded and revisable instances are re-labeled by model prediction then utilized for training. How group regularization benefits instance purification is visualized in Fig. 1. Owing to the regularization strategy, clean samples and corresponding class centers are encouraged to be closer and therefore it is easier to perform noise identification.

Owing to operations from both group and instance aspects, GRIP integrates the advantages of noise-robust and noise-cleaning methods for tackling noisy labels. To sum up, our contributions are as follows:

  1. 1.

    We propose a group regularization strategy to estimate class soft labels for benefiting instance purification. It also improves the robustness against noisy labels from the category aspect and greatly boosts the model generalization.

  2. 2.

    We propose an instance purification strategy resorting to estimated class soft labels. It globally identifies noisy and revisable labels from the entire dataset. Experimental results demonstrate that it surpasses the widely-used small-loss principle in noise identification on real-world noisy datasets.

  3. 3.

    Our approach integrates the advantages of noise-robust and noise-cleaning methods. Comprehensive experimental results demonstrate that GRIP significantly outperforms state-of-the-art methods on both synthetic and real-world noisy datasets.

Refer to caption
Figure 2: The framework of our proposed approach with Web-bird20 as an example. In each epoch tt, the network produces a probability p(xi)p(x_{i}) for each image xix_{i}. Then p(xi)p(x_{i}) updates the soft label of its class, and EMA is utilized to smooth the update. The estimated class soft labels SS are leveraged in the noise identification and provide supervision through Soft\mathcal{L}_{Soft}. In noise identification, we compute the JS divergence did_{i} between probability p(xi)p(x_{i}) and soft label Syit1S^{t-1}_{y_{i}} to select clean samples. As for noisy ones, we compute the JS divergence d^i\hat{d}_{i} between probability p(xi)p(x_{i}) and soft label of its prediction Sy^it1S^{t-1}_{\hat{y}_{i}} to divide revisable and discarded instances. The prediction y^i\hat{y}_{i} is assigned as the pseudo label for each revisable sample. Finally, clean and revisable images are trained using GR\mathcal{L}_{GR}. ME\mathcal{L}_{ME} is applied on discarded ones as regularization.

2 Related Works

2.1 Noise-Robust

Noise-robust learning approaches directly train models using noisy labeled data. They aim to become insensitive to the presence of noisy labels52. One typical branch is to develop robust loss functions to overcome the problem that cross-entropy loss is sensitive to samples with corrupted labels53. For example, Wang et al. proposed to leverage a noise-robust reverse cross-entropy34 to symmetrically boost cross-entropy loss. Ma et al. 35 proposed a framework to build new loss functions by combining active and passive robust loss functions. Zhang et al. 36 proposed a generalization of mean absolute error and cross-entropy loss. However, since these robust loss functions typically aim to deal with noisy labels through under-learning, they inevitably underfit clean samples. Another branch is to employ regularization to improve robustness. For example, LS38 built soft labels by combining a one-hot and uniform distribution to provide regularization. OLS39 further improved LS by replacing the uniform distribution with a more reasonable probability on non-target categories. However, these approaches do not explicitly tackle noisy labels, leading to a suboptimal performance.

2.2 Noise-Cleaning

Noise-cleaning methods aim to tackle noisy labels by discarding or relabeling them. An intuitive type of research is label correction that corrects noisy labels. For example, several works40 tried to correct noisy labels by estimating the noise transition matrix. However, it is difficult to estimate the accurate noise transition matrix. Furthermore, these label correction methods are unable to deal with out-of-distribution (OOD) instances28 whose true labels do not belong to the training set. This drawback restricts their application in the real-world scenario. Another typical idea of combating noisy labels is to perform sample selection that identifies and removes corrupted data through proper criteria46, 47, 54. For example, Co-teaching43 utilized the small-loss principle and let peer networks select noisy samples for each other. JoCoR44 claimed that different models tend to disagree on false labeled samples and leveraged this principle for noise identification. Nevertheless, the above approaches typically performed sample selection within each mini-batch with a fixed drop rate. Jo-SRC50 and Zhang et al. 55 claimed that noise ratios in different mini-batches inevitably fluctuate in the training process. To overcome this problem, they replaced the widely-used mini-batch noise identifying strategy with a global one for the purpose of stabilizing the selection results. A growing number of methods such as SELFIE49 and Co-learning56 combined label correction and sample selection to further boost the performance. Some state-of-the-art noise-cleaning approaches also adopted noise-robust techniques. For example, Jo-SRC50 and DivideMix41 leveraged LS trick and mix-up57, respectively.

3 The Proposed Method

3.1 Preliminary

Generally, we train a DNN on a noisy dataset 𝒟={(xi,yi)|1iN}\mathcal{D}=\{(x_{i},y_{i})|1\leq i\leq N\} with CC classes, where xix_{i} and yiy_{i} denote the ii-th training sample and the corresponding label, respectively. We define the label distribution qq of the one-hot label yiy_{i} as q(c=yi|xi)=1q(c=y_{i}|x_{i})=1 and q(cyi|xi)=0q(c\neq y_{i}|x_{i})=0. The DNN model takes each sample xix_{i} as input and predicts a probability p(c|xi)p(c|x_{i}) for each class cc using the softmax function. The cross-entropy training loss between the predicted probability distributions pp of training images and their corresponding label qq is written as

CE=1Ni=1Nc=1Cq(c|xi)logp(c|xi).\mathcal{L}_{CE}=-\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C}q(c|x_{i})\log{p(c|x_{i})}. (1)

Since commonly-used cross-entropy loss is proved to be sensitive to noisy labels53 and DNNs can perfectly memorize noisy samples23, 24, deep models tend to suffer from label noise when trained using the noisy dataset 𝒟\mathcal{D} with CE\mathcal{L}_{CE}.

Refer to caption
Figure 3: Label distributions of Soft\mathcal{L}_{Soft} (a) and Soft+ME\mathcal{L}_{Soft}+\mathcal{L}_{ME} (b) on Web-bird. We scale the yy-axis using the log function for visualization. Soft labels are generated during the training process of a ResNet-18 model.

3.2 Group Regularization

Our learning framework is illustrated in Fig. 2. Motivated by OLS39, we adopt a noise-robust strategy through regularizing the training process at the category level. We define S={S0,S1,,St,,ST1}S=\{S^{0},S^{1},\cdots,S^{t},\cdots,S^{T-1}\} as the collection of soft labels of each class for TT training epochs. For each epoch tt, StS^{t} is a C×CC\times C matrix, whose columns correspond to soft labels and are initialized to zero. Given an input image xix_{i}, if the classification is in accord with its label yiy_{i} , the soft label SyitS_{y_{i}}^{t} of the target class yiy_{i} will be updated using the predicted probability p(xi)p(x_{i}) through

Syi,ct=1Mj=1Mp(c|xj),S_{y_{i},c}^{t}=\frac{1}{M}\sum_{j=1}^{M}p(c|x_{j}), (2)

where c{1,2,,C}c\in\{1,2,\cdots,C\} denotes the indexes of SyitS_{y_{i}}^{t} and MM indicates the number of correctly predicted samples with label yiy_{i}. According to Eq. (2), each soft label is the average predicted probability of correctly classified samples belonging to one class.

To stabilize the estimated class soft labels StS^{t}, we further apply an exponential moving average (EMA) strategy through

Syi,ct=mSyi,ct1+1mMj=1Mp(c|xj),\displaystyle S_{y_{i},c}^{t}=mS_{y_{i},c}^{t-1}+\frac{1-m}{M}\sum_{j=1}^{M}p(c|x_{j}), (3)

where mm denotes the momentum that controls the weight on the previous result. EMA smoothes out StS^{t} by alleviating the issue that probabilities predicted by the model can be unstable in the training process. After obtaining the soft label, we utilize St1S^{t-1} to supervise the training process in each epoch tt. Then the soft training loss can be formulated as

Soft=1Ni=1Nc=1CSyi,ct1logp(c|xi).\mathcal{L}_{Soft}=-\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C}S_{y_{i},c}^{t-1}\log{p(c|x_{i})}. (4)

Similar to LS, Soft\mathcal{L}_{Soft} assigns weights to non-target categories. Consequently, it reduces overfitting and improves the robustness against noisy labels. Moreover, it encourages DNNS to learn inter-class similarities and benefits challenging image recognition tasks, e.g. fine-grained classification39.

However, we find that SyitS_{y_{i}}^{t} tends to approach the one-hot label distribution where yiy_{i} has a large value while other categories only share tiny weights as shown in Fig. 3 (a). This behavior may derive from the strong fitting ability of cross-entropy loss. To address this issue, we utilize the maximum entropy principle58 as a strong regularization. It forces the model prediction to be less confident. The maximum entropy loss can be formulated as

ME\displaystyle\mathcal{L}_{ME} =1Ni=1NH(p(xi))\displaystyle=-\frac{1}{N}\sum_{i=1}^{N}\mathrm{H}(p(x_{i})) (5)
=1Ni=1Nc=1Cp(c|xi)logp(c|xi).\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\sum_{c=1}^{C}p(c|x_{i})\log{p(c|x_{i})}.

Since ME\mathcal{L}_{ME} aims to increase the entropy of prediction p(xi)p(x_{i}), it leads to a more reasonable soft label as illustrated in Fig. 3 (b). More importantly, since ME\mathcal{L}_{ME} makes predictions less confident, it further reduces the risk of overfitting noisy labels and guides DNNs to become more noise-robust.

Finally, both the hard and soft labels are leveraged as supervision with the maximum-entropy principle applied as regularization. The total training loss of our group regularization strategy can be represented by

GR=(1w)CE+wSoft+γME,\mathcal{L}_{GR}=(1-w)\mathcal{L}_{CE}+w\mathcal{L}_{Soft}+\gamma\mathcal{L}_{ME}, (6)

where ww and γ\gamma are weights to balance contributions of CE\mathcal{L}_{CE}, Soft\mathcal{L}_{Soft}, and ME\mathcal{L}_{ME}. Owing to Soft\mathcal{L}_{Soft} and ME\mathcal{L}_{ME}, our group regularization strategy guides DNNs to become less sensitive to noisy labels and boosts their robustness. Furthermore, class soft labels generated by group regularization strategy can be leveraged for instance purification.

3.3 Instance Purification

After obtaining class soft labels, we can perform data purification from the instance aspect. According to the memorization effect23, 24, DNNs learn the clean and easy pattern in the initial epochs. Accordingly, clean instances tend to have more contributions to the estimation of class soft labels at the early stage. Moreover, our group regularization strategy reduces the contribution of noisy labels in model optimization by significantly improving noise-robustness. Therefore, generated class soft labels should be closer to predictions of clean samples than that of noisy ones. Based on this phenomenon, we can build a noise identification criterion.

Inspired by Jo-SRC50, we adopt the Jensen-Shannon (JS) divergence between each probability p(xi)p(x_{i}) and its class soft label Syit1S_{y_{i}}^{t-1} as our noise identification criterion through

di=\displaystyle d_{i}= DJS(p(xi)||Syit1)\displaystyle\mathrm{D}_{JS}(p(x_{i})||S_{y_{i}}^{t-1}) (7)
=\displaystyle= 12DKL(pi||pi+Syit12)+12DKL(Syit1||pi+Syit12),\displaystyle\frac{1}{2}\mathrm{D}_{K\!L}(p_{i}||\frac{p_{i}\!+\!S_{y_{i}}^{t-1}}{2})\!+\!\frac{1}{2}\mathrm{D}_{K\!L}(S_{y_{i}}^{t-1}||\frac{p_{i}\!+\!S_{y_{i}}^{t-1}}{2}),

where DKLD_{KL} is the Kullback-Leibler (KL) divergence function and pip_{i} denotes the simplified form of p(xi)p(x_{i}). In this formula, did_{i} is a measure of the difference between predicted probability pip_{i} and corresponding class soft label SyitS_{y_{i}}^{t}, where a larger value indicates a more significant difference. Moreover, did_{i} is a symmetric measure and bounded in [0,1][0,1] when using a base 22 logarithm59.

Refer to caption
Figure 4: The distribution of dd in epoch 55 (a) and 7575 (b) during the training process of a ResNet-18 model on Web-bird. The red bar indicates the threshold thrthr. As the training proceeds, did_{i} decreases and thrthr automatically adapts to the change of did_{i}. The distribution becomes more discrete because noisy samples are discarded.

Since clean samples tend to be closer to their class soft labels, they should have smaller values of did_{i} than noisy instances. Accordingly, we can utilize a threshold to separate clean instances from noisy ones. We define the threshold thrthr for each epoch tt as

thr=mean(d)+αstd(d),thr=\textbf{mean}(d)+\alpha\cdot\textbf{std}(d), (8)

where d={d1,d1,,di,,dN}d=\{d_{1},d_{1},\cdots,d_{i},\cdots,d_{N}\} indicates the collection of did_{i} for the entire dataset, mean()\textbf{mean}(\cdot) and std()\textbf{std}(\cdot) denote computing average value and standard deviation respectively, and α\alpha is a hyperparameter. Specifically, for each training batch \mathcal{B}, we divide it into a clean set clean\mathcal{B}_{clean} and a noisy set noisy\mathcal{B}_{noisy} after a warm-up period tmt_{m} by

clean\displaystyle\mathcal{B}_{clean} ={(xi,yi)|dithr,ttm},\displaystyle=\{(x_{i},y_{i})|d_{i}\leq thr,t\geq t_{m}\}, (9)
noisy\displaystyle\mathcal{B}_{noisy} ={(xi,yi)|di>thr,ttm}.\displaystyle=\{(x_{i},y_{i})|d_{i}>thr,t\geq t_{m}\}.

Once noisy samples are identified, we further perform label re-assignment on the noisy samples noisy\mathcal{B}_{noisy} to leverage revisable ones. Specifically, we first calculate the JS divergence between the probability p(xi)p(x_{i}) of each instance in noisy\mathcal{B}_{noisy} and soft label Sy^it1S_{\hat{y}_{i}}^{t-1} of its predicted class y^i\hat{y}_{i} through

d^i=DJS(p(xi)||Sy^it1),xinoisy.\hat{d}_{i}=\mathrm{D}_{JS}(p(x_{i})||S_{\hat{y}_{i}}^{t-1}),x_{i}\in\mathcal{B}_{noisy}. (10)

Then, since d^i\hat{d}_{i} is bounded in [0,1][0,1], we can utilize a hard threshold τ\tau to select revisable samples in noisy\mathcal{B}_{noisy} through

relabel\displaystyle\mathcal{B}_{relabel} ={(xi,y^i)|xinoisy,d^i<τ},\displaystyle=\{(x_{i},\hat{y}_{i})|x_{i}\in\mathcal{B}_{noisy},\hat{d}_{i}<\tau\}, (11)
discard\displaystyle\mathcal{B}_{discard} ={(xi,yi)|xinoisy,d^iτ}.\displaystyle=\{(x_{i},y_{i})|x_{i}\in\mathcal{B}_{noisy},\hat{d}_{i}\geq\tau\}.

Eq. (11) indicates that we regard a noisy sample as a revisable one when its probability p(xi)p(x_{i}) is close enough to the soft label of its predicted category Sy^it1S_{\hat{y}_{i}}^{t-1}. Its predicted class y^i\hat{y}_{i} is assigned as pseudo label.

Finally, we jointly utilize clean and relabeled samples for training by Eq. (6). Furthermore, we apply ME\mathcal{L}_{ME} in Eq. (5) on discarded images to encourage the model to generate even probabilities for them. This design guides predicted probabilities of noisy samples to become more different from estimated class soft labels and therefore improves the reliability of noise-cleaning in the following epochs. Details of our proposed GRIP are demonstrated in Algorithm 1.

3.4 Discussion

3.4.1 Comparison with OLS

Our group regularization strategy is motivated by OLS39. Compared with OLS, our approach applies maximum entropy loss as regularization to smooth the predicted probability. It addresses the problem that the estimated soft label tends to approach the one-hot distribution. Furthermore, EMA utilizes previous results to stabilize the process of soft label estimation. Consequently, our group regularization strategy becomes more practical and efficient than OLS.

3.4.2 Dynamic and Fixed Thresholds

Notably, in the initial stages of training, a significant disparity between the predicted distribution and soft labels is observed, yielding generally large values of d. Conversely, as training progresses, there is a trend towards smaller values of d. The use of a fixed threshold would result in the model initially selecting too few clean samples, leading to a low recall rate. Furthermore, in later stages, there arises the issue of selecting too many samples, encompassing a substantial amount of noisy false positives and resulting in a low precision rate. To address these challenges, we employ the mean and standard deviation of the distance to formulate the threshold for sample selection criteria, thereby enhancing the flexibility and adaptability of our approach. This selection strategy, widely embraced by numerous classical algorithms60, attests to its effectiveness and generalizability.

As shown in Fig. 4 (a) and (b), did_{i} decreases as the DNN gradually fits the dataset in the training process. Accordingly, we utilize a dynamic threshold thrthr, whose value depends on the distribution of dd. The mean value, which roughly separates "low" and "high", can be utilized as a rough threshold to identify clean and noisy labels. To further adjust the threshold, we leverage the standard deviation controlled by hyperparameter α\alpha as the offset to the mean. Compared with a fixed offset, the standard deviation can automatically adapt to the distribution of dd. It will be small if the data distribution is dense and become large otherwise. Briefly, the threshold is jointly controlled by mean as well as standard deviation. It can adapt to the changing did_{i} and select samples with lower did_{i} for training. Similar to did_{i}, d^i\hat{d}_{i} also gradually decreases in the training process. Contrary to threshold thrthr, we fix τ\tau for the label re-assignment. It guides the number of relabeled samples to grow from a small value as the training proceeds. As the model becomes more robust, more noisy samples can be relabeled for training. This progressive re-assignment strategy controls the number of relabeled samples and stabilizes the training process.

3.4.3 Selection Criterion

Since thrthr is obtained according to the distribution of dd on the entire dataset, the selection operation is performed in a global view. This global selection strategy can alleviate the problem that noise ratios in different mini-batches inevitably fluctuate50, 55. A mini-batch is a sampling of the entire dataset. If the batch size is large enough (e.g. >100), the noise ratio fluctuation problem may not be severe. However, on the contrary, if the batch size has to be small due to the large size of the model or input image, this problem can harm the noise-cleaning performance. We will illustrate the advantage of the global selection strategy over a mini-batch one in experiments.

Compared with the popular small-loss principle, our selection criterion takes the advantage of soft labels in noise identification. The soft label encodes more information than training losses and therefore tends to be more reliable in selecting samples.

Input: Network θ\theta, web set 𝒟\mathcal{D}, warm-up epoch tmt_{m}, momentum mm, weight ww and γ\gamma, hyperparameter α\alpha, threshold τ\tau and maximum epochs TT.
Initialize Network θ\theta.
for t=1,2,,Tt=1,2,...,T do
       for each mini-batch \mathcal{B} in 𝒟\mathcal{D} do
             if t<tmt<t_{m} then
                   Compute \mathcal{L} using \mathcal{B} by Eq. (6).
             else
                   Update did_{i} according to Eq. (7).
                   Update thrthr according to Eq. (8).
                   Obtain clean\mathcal{B}_{clean} and noisy\mathcal{B}_{noisy} by Eq. (9).
                   Obtain relabel\mathcal{B}_{relabel} and discard\mathcal{B}_{discard} by Eq. (11).
                   Compute 1\mathcal{L}_{1} using clean\mathcal{B}_{clean} and relabel\mathcal{B}_{relabel} according to Eq. (6).
                   Compute 2\mathcal{L}_{2} using discard\mathcal{B}_{discard} by Eq. (5).
                   Sum =1+2\mathcal{L}=\mathcal{L}_{1}+\mathcal{L}_{2}.
             end if
            Update StS^{t} according to Eq. (3).
             Update θ=SGD(;θ)\theta=\mathrm{SGD}(\mathcal{L};\theta).
       end for
      
end for
Output: Updated network θ\theta.
Algorithm 1 Group Benefits Instance for Data Purification

4 Experiments on Synthetic Noisy Datasets

In order to demonstrate the superiority and applicability of our approach, we conduct experiments on both synthetic and real-world datasets. In this experiment, we first evaluate our approach on synthetic noisy datasets for coarse-grained classification to validate its robustness under different types of noisy labels and varying noise ratios.

Table 1: Average Classification Accuracy (ACA %\%) on CIFAR-10 over the last 10 epochs.
Noise Ratio ϵ\epsilon Backbone Decoupling42 Co-teaching43 Co-teaching+61 JoCoR44 SELC62 GRIP
Symmetry20%\text{Symmetry}-20\% 69.18 ±\pm 0.52 69.32 ±\pm 0.40 78.23 ±\pm 0.27 78.71 ±\pm 0.34 85.73 ±\pm 0.19 87.08 ±\pm 0.07 88.83 ±\pm 0.09
Symmetry50%\text{Symmetry}-50\% 42.71 ±\pm 0.42 40.22 ±\pm 0.30 71.30 ±\pm 0.13 57.05 ±\pm 0.54 79.41 ±\pm 0.25 81.66 ±\pm 0.11 84.59 ±\pm 0.07
Symmetry80%\text{Symmetry}-80\% 28.67 ±\pm 0.47 26.76 ±\pm 0.35 34.53 ±\pm 0.28 29.05 ±\pm 0.21 47.74 ±\pm 0.25 54.58 ±\pm 0.15 59.32 ±\pm 0.07
Asymmetric40%\text{Asymmetric}-40\% 69.43 ±\pm 0.33 68.72 ±\pm 0.30 73.78 ±\pm 0.22 68.84 ±\pm 0.20 76.36 ±\pm 0.49 78.90 ±\pm 0.12 80.82 ±\pm 0.28
Table 2: Average Classification Accuracy (ACA %\%) on CIFAR-100 over the last 10 epochs.
Noise Ratio ϵ\epsilon Backbone Decoupling42 Co-teaching43 Co-teaching+61 JoCoR44 SELC62 GRIP
Symmetry20%\text{Symmetry}-20\% 35.14 ±\pm 0.44 33.10 ±\pm 0.12 43.73 ±\pm 0.16 49.27 ±\pm 0.03 53.01 ±\pm 0.04 55.44 ±\pm 0.09 61.40 ±\pm 0.13
Symmetry50%\text{Symmetry}-50\% 16.97 ±\pm 0.40 15.25 ±\pm 0.20 34.96 ±\pm 0.50 40.04 ±\pm 0.70 43.49 ±\pm 0.46 46.73 ±\pm 0.08 51.27 ±\pm 0.15
Symmetry80%\text{Symmetry}-80\% 9.67 ±\pm 0.34 8.69 ±\pm 0.09 23.62 ±\pm 0.38 27.08 ±\pm 0.51 31.92 ±\pm 0.27 34.51 ±\pm 0.15 39.67 ±\pm 0.15
Asymmetric40%\text{Asymmetric}-40\% 27.29 ±\pm 0.25 26.11 ±\pm 0.39 28.35 ±\pm 0.25 33.62 ±\pm 0.39 32.70 ±\pm 0.35 45.19 ±\pm 0.12 53.48 ±\pm 0.11

4.1 Datasets and Evaluation Metric

We follow the common experimental settings as in recent works43, 44 and manually corrupt CIFAR-10 and CIFAR-10051 to create synthetic noisy datasets. Specifically, we adopt symmetric and asymmetric noisy labels with noise transition matrices with a transition matrix QQ.

We generate symmetric noisy labels by uniformly flipping a given ratio of training labels to other classes. Asymmetric label noise simulates the fine-grained recognition task with noisy labels, where very similar classes may confuse annotators. In this setting, we only flip a specific set of classes, e.g. flipping CAT to DOG in CIFAR-10. The definition of transition matrix QQ is as follows:

Qsymmetry\displaystyle Q_{symmetry} =(1ϵϵn1ϵn1ϵn1ϵn11ϵϵn1ϵn1ϵn1ϵn11ϵϵn1ϵn1ϵn1ϵn11ϵ),\displaystyle=\begin{pmatrix}1-\epsilon&\frac{\epsilon}{n-1}&\dots&\frac{\epsilon}{n-1}&\frac{\epsilon}{n-1}\\ \frac{\epsilon}{n-1}&1-\epsilon&\dots&\frac{\epsilon}{n-1}&\frac{\epsilon}{n-1}\\ \vdots&&\ddots&&\vdots\\ \frac{\epsilon}{n-1}&\frac{\epsilon}{n-1}&\dots&1-\epsilon&\frac{\epsilon}{n-1}\\ \frac{\epsilon}{n-1}&\frac{\epsilon}{n-1}&\dots&\frac{\epsilon}{n-1}&1-\epsilon\\ \end{pmatrix},
Qasymmetry\displaystyle Q_{asymmetry} =(1ϵϵ0001ϵϵ0001ϵϵϵ001ϵ),\displaystyle=\begin{pmatrix}1-\epsilon&\epsilon&0&\dots&0\\ 0&1-\epsilon&\epsilon&\dots&0\\ \vdots&&\ddots&\ddots&\vdots\\ 0&0&\dots&1-\epsilon&\epsilon\\ \epsilon&0&\dots&0&1-\epsilon\\ \end{pmatrix},

where nn and ϵ\epsilon denote the number of the class and noise ratio, respectively. For evaluating the model classification performance, we take Average Classification Accuracy (ACA) as the evaluation metric.

4.2 Implementation Details

Following JoCoR44, we adopt a 7-layer CNN network architecture and leverage Adam optimizer with momentum set to 0.90.9 for training. We train the network for 200200 epochs with batch size set to 128128. The learning rate is initialized to 0.0010.001 and starts to linearly decrease to 0 after 8080 epochs. As for hyper-parameters, we set momentum mm for EMA and weight ww, γ\gamma to 0.50.5, 0.50.5, 11, respectively. Warm-up epoch tmt_{m} and hyper-parameter τ\tau are empirically set to 1010 and 0.030.03, respectively. As for α\alpha, we adopt a linear decrease from 11 to 0.20.2 in 55 epochs to smooth the discarding process, which is similar to the increasing drop rate trick in Co-teaching43. All experiments are conducted on one NVIDIA Tesla V100 GPU.

4.3 Baseline Methods

To evaluate our approach on synthetic datasets, we compare our approach with the following state-of-the-art algorithms: Decoupling42, Co-teaching43, Co-teaching+61, JoCoR44, SELC62 and the backbone network straightforwardly trained on noisy datasets. As the work of JoCoR44 has reproduced most other baselines42, 43, 61, we directly copy results from it for simplicity. For SELC62, we re-execute their code and report the experimental results. Our method is trained using the same experimental settings as the above methods, and therefore the comparisons are fair.

Refer to caption
Figure 5: The symmetric (a) and asymmetric (b) noise transition matrices and corresponding estimated soft labels after the warm-up period on CIFAR-10 ((c) and (d)). The noise ratios ϵ\epsilon are set to 0.50.5 and 0.40.4 for symmetric and asymmetric noise, respectively.

4.4 Experimental Results and Analysis

4.4.1 Results on CIFAR-10

We demonstrate the test accuracy of each method on CIFAR-10 in Table 1. As we can see from Table 1, GRIP consistently outperforms other approaches in all three noisy settings. Specifically, GRIP surpasses the best results of baselines by 1.75%1.75\%, 2.93%2.93\%, 4.74%4.74\% and 1.92%1.92\% in Symmetry-20%20\%, Symmetry-50%50\%, Symmetry-80%80\%, and Asymmetry-40%40\% cases, respectively. From experimental results, we can conclude that our approach is effective on synthetic noisy datasets for simple coarse-grained classification tasks.

4.4.2 Results on CIFAR-100

The test accuracy of each approach on CIFAR-100 is shown in Table 2. By observing Table 2, we can find that GRIP achieves the highest test accuracies among all methods. Specifically, the improvements over JoCoR become 5.96%5.96\%, 4.54%4.54\%, 5.16%5.16\% and 8.29%8.29\% in Symmetry-20%20\%, Symmetry-50%50\%, Symmetry-80%80\%, and Asymmetry-40%40\% cases, respectively. Our great superiority in the Asymmetry-40%40\% case deserves attention. Since asymmetric noise is a simulation of the real-world label noise41, our advantage in Asymmetry-40%40\% cases indicates that GRIP is promising on real-world noisy datasets.

Moreover, by comparing the results on CIFAR-10 and CIFAR-100, we can find that GRIP shows more significant improvements over baselines as the classification task becomes more challenging (from 10 to 100 classes). This experimental result demonstrates the superiority of our method in more complicated classification tasks with noisy labels.

4.5 Noise Matrix and Soft Labels

In this subsection, we visualize the noise transition matrix QQ and soft labels SS estimated by GRIP to illustrate the noise-robustness of our approach. To be detailed, we investigate the Symmetry-50%50\% and Asymmetry-40%40\% cases and demonstrate soft labels estimated after the warm-up period in Fig. 5.

From Fig. 5, we can observe that in the Symmetry-50%50\% case, all non-target classes of soft labels share similar weights (Fig. 5 (a) and (c)), and target categories have lower weights in soft labels than in noise transition matrices. This phenomenon indicates that our approach becomes less confident on the noisy datasets and shows robustness against label noise.

While in the Asymmetry-40%40\% case, flipped noisy classes have much lower weights in soft labels than in noise transition matrices (Fig. 5 (b) and (d)), which indicates that our approach is less likely to overfit noisy labels. The great noise-robustness results from the maximum entropy regularization, which guides the model to become less confident on potential noisy labels.

Table 3: Numbers of class, training and test images, as well as estimated label accuracies of three sub-datasets.
Dataset Web-bird Web-aircraft Web-car
Class Number 200 100 196
Training Images 18388 13503 21448
Test Images 5794 3333 8041
Label Accuracy 65% 73% 67%
Refer to caption
Figure 6: ACA (%) performances of our approach and baselines on Web-bird, Web-aircraft, Web-car, and average performances.
Table 4: ACA (%) performances of baseline methods and ours on Web-aircraft, Web-bird, and Web-car.
Method Publication Noise-cleaning Noise-robust Datasets
 Web-bird Web-aircraft  Web-car
Backbone - 73.30 78.71 80.86
Decoupling42 NeurIPS 2017 68.81 73.21 78.71
Co-teaching43 NeurIPS 2018 75.51 79.51 83.42
Co-teaching+61 ICML 2019 70.12 74.80 76.77
Sub-center63 ECCV 2020 75.77 79.80 82.59
Self-Adaptive64 NeurIPS 2020 78.49 77.92 78.19
JoCoR44 CVPR 2020 79.19 80.11 85.10
DivideMix41 ICLR 2020 74.40 82.48 84.27
PLC65 ICLR 2021 76.22 79.24 81.87
Peer-learning20 ICCV 2021 75.37 78.64 82.48
OLS39 TIP 2021 79.11 81.27 86.58
Jo-SRC50 CVPR 2021 81.52 82.73 88.13
SELC62 IJCAI 2022 77.25 80.26 80.89
Co-LDL66 TMM 2022 80.11 81.97 86.95
AGCE67 TPAMI 2023 75.54 82.21 82.76
CMW-Net-SL68 TPAMI 2023 77.41 76.48 79.70
GRIP(Ours) - 82.53 83.29 89.49

5 Experiments on Real-world Noisy Datasets

In this experiment, we further evaluate our approach on more challenging web noisy datasets to demonstrate its applicability under real-world scenario.

5.1 Datasets and Evaluation Metric

We evaluate our approach on WebFG-49620, which is designed for research on webly supervised fine-grained classification tasks. It contains three real-world sub-datasets, Web-bird, Web-aircraft, and Web-car. They utilize the fine-grained category labels of benchmark datasets CUB200-201169, FGVC-aircraft70, and Cars-19671 as target categories to crawl web images from Bing Image Search Engine. Compared with benchmark datasets, they have a larger number of training samples (18388, 13503, and 21448, respectively) but potential noisy labels. The test sets are directly taken from three benchmark datasets (CUB200-2011, FGVC-aircraft, and Stanford Cars). Details of three sub-datasets are as follows and summarized in Table 3.

Web-bird consists of 200 different subcategories of birds. Its web training set contains 18388 images and the test set has 5794 clean samples. The training label accuracy estimated by random sampling is about 65%.

Web-aircraft covers 100 variants of aircraft. The number of web training samples and clean test images are 13503 and 3333, respectively. Its estimated web label accuracy is approximately 73%.

Web-car contains 196 types of vehicles. Its training set consists of 21448 web samples with a roughly estimated label accuracy of 67%. The test set contains 8041 manually labeled images.

We follow the evaluation metric in section 4.1 and utilize ACA to evaluate model performance.

5.2 Implementation Details

We adopt a ImageNet pre-trained ResNet-5072 model as our backbone network and resize input images to (448,448)(448,448) with random horizontal flip as weak data augmentation. The network is trained for 100100 epochs with batch size set to 3232. The learning rate is initialized to 0.010.01 and decreases in a cosine annealing manner73. The momentum and weight decay of stochastic gradient descent (SGD)74 optimizer are set to 0.90.9 and 10510^{-5}, respectively.

As for hyper-parameters, we set momentum mm for EMA and weight ww, γ\gamma to 0.50.5. Warm-up epoch tmt_{m} is empirically set to 55. We set α\alpha to 0.50.5 on Web-bird and adopt a linear decreasing α\alpha on Web-aircraft and Web-car. Specifically, α\alpha linearly decreases from 11 to 0.80.8 and 0.30.3 in 55 epochs on Web-aircraft and Web-car, respectively, which is motivated by the increasing drop rate trick43. Threshold τ\tau for relabeling is set to 0.04,0.02,0.040.04,0.02,0.04 on Web-bird, Web-aircraft, and Web-car, respectively. Our experiments are all performed on one NVIDIA Tesla V100 GPU.

5.3 Baseline Methods

On real-world datasets, our baselines contain the following state-of-the-art methods: Decoupling42, Co-teaching43, Sub-center63, Co-teaching+61, Self-Adaptive64, JoCoR44, DivideMix41, Jo-SRC39, OLS39, PLC65, Peer-learning20, SELC62, Co-LDL66, AGCE67 and CMW-Net-SL68. We reproduce most of the above baselines42, 43, 63, 61, 64, 44, 41, 39, 65, 20, 62, 67 using the same backbone network for fair comparisons. In addition, we also train a ResNet-50 network using the cross-entropy loss function for comparison (Backbone).

Refer to caption
Figure 7: The parameter sensitivities of on Web-bird with a ResNet-18 as the backbone.

5.4 Experimental Results and Analysis

We demonstrate ACA performances of baseline approaches and GRIP on WebFG-496 in Table 4. From Table 4, we can observe that GRIP significantly outperforms baselines on all three datasets. Specifically, it surpasses the best results of baselines by 1.01%1.01\%, 0.56%0.56\%, and 1.36%1.36\% on Web-bird, Web-aircraft, Web-car, respectively. For clearer illustration, we also present test accuracy trending of GRIP and compare it with some representative noise-cleaning baselines in Fig. 6. As illustrated in Fig. 6, GRIP shows higher accuracies and faster training speeds than other noise-cleaning approaches42, 44 on all benchmark datasets. This superiority owes to our group regularization strategy, which significantly improves the noise-robustness. In addition, compared with noise-robust approaches63, 39, 62, 67, 68 in Table 4, our approach achieves higher test accuracies by leveraging the instance purification strategy to specifically tackling noisy labels. The significant improvements over baselines demonstrate the effectiveness of simultaneously leveraging both noise-robust and noise-cleaning strategies for combating noisy labels.

Particularly, GRIP shows superior performance than DivideMix41, which utilizes noise-cleaning and noise-robust algorithms simultaneously. DivideMix has large gaps to GRIP on Web-bird (8.13%8.13\%) and Web-car (5.22%5.22\%). The reason may lie in the Gaussian Mixture Model (GMM)75 used for dividing training samples in DivideMix. It can be difficult to fit a GMM on the real-world noisy data distribution. If GMM fails, the performance will inevitably decline.

Furthermore, we can observe from Table 4 that some baselines (Decoupling42, Co-teaching+61, Self-Adaptive64, PLC65, CMW-Net-SL68) only show slight improvements or even inferior performances to Backbone. The reason can be that they are designed and tested on coarse-grained synthetic noisy datasets. As a result, they tend to be less practical for fine-grained tasks in a real-world scenario. Compared with them, our approach is more effective in practical application.

6 Ablation Studies

In order to further analyze our approach, we conduct experiments on the real-world dataset Web-bird using ResNet-18 by default.

6.1 Parameter Analysis

In this experiment, we investigate the parameter sensitivities of weights ww and γ\gamma for loss functions, momentum mm for EMA, α\alpha and τ\tau for noise-cleaning, and warm-up epoch tmt_{m}. Although our approach seems to have many parameters, we will show that half of them are robust and easy to adjust. The experimental results are shown in Fig. 7.

We set γ\gamma to 0.50.5 and changes ww from 0 to 0.90.9 for investigation. From Fig. 7 (a), we can observe that the performance first steadily increases to the optimal value as ww rises. This phenomenon indicates that soft\mathcal{L}_{soft} boosts the robustness with a proper ww. However, if ww further increases, the supervision of CE\mathcal{L}_{CE} is weakened, which results in a performance decline. To achieve the best performance, we set ww to 0.50.5 on all datasets, which is a balanced weight between soft\mathcal{L}_{soft} and CE\mathcal{L}_{CE}. In addition, we can also see that ww is robust in [0.3,0.6][0.3,0.6], which indicates a relatively balanced ww can work well.

Similar to the analysis on ww, we set ww to 0.50.5 and changes γ\gamma from 0 to 11. We can observe from Fig. 7 (b) that the performance climbs as γ\gamma rises from 0 to 0.50.5, then slightly decreases when γ\gamma further increases. Supported by this result, we simply set γ\gamma to 0.50.5 on all datasets. We can also find that γ\gamma is robust in [0.4,1][0.4,1].

We analyze the effect of mm in Fig. 7 (c). It can be observed from Fig. 7 (c) that unless mm is too large (over 0.8), applying EMA can boost the performance over the baseline (mm=0). Furthermore, EMA can work well and achieve close test accuracies in [0.3,0.7][0.3,0.7].

We do not perform label re-assignment and only analyze the effect of α\alpha on sample selection in Fig. 7 (d). When α\alpha increases from 0 to 0.50.5, the performance steadily climbs because more training samples are utilized. However, when α\alpha is larger than 0.50.5, the performance declines with some fluctuations. The reason is that fewer noisy images are discarded and thus noise-cleaning becomes less effective.

Refer to caption
Figure 8: ACA (%) vs. number of images per class. GR: Group Regularization; GRIP: Group Regularization and Instance Purification. "all" indicates using the entire dataset.

The analysis of τ\tau is illustrated in Fig. 7 (e). From Fig. 7 (e), we can see that performance climbs as τ\tau increases from 0.030.03 to 0.050.05. Then it drops fast when τ\tau becomes larger (τ=0.07\tau=0.07). The reason can be that too many noisy samples are relabeled and reused. Some of them may still have false labels or even be out-of-distribution (OOD). From this result, we believe that a small value of τ\tau is safe and can boost performance.

The analysis of tmt_{m} is illustrated in Fig. 7 (f). We can observe from Fig. 7 (f) that if the warm-up period is too short (less than 33), the model cannot learn reliable soft labels for noise identification, which results in an unsatisfying performance. A proper tmt_{m} lies in [5,10][5,10]. If tmt_{m} further increases, the model will be affected by noisy labels in the warm-up period and noise-cleaning becomes less useful, resulting in declining test accuracies.

From the experimental results in Fig. 7, we can conclude that weights ww, γ\gamma, and momentum mm are robust and easy to adjust. Parameters α\alpha and τ\tau for noise-cleaning should be changed on each dataset because they are concerned with noise ratios. As for tmt_{m}, a relatively small value is recommended, e.g. in [5,10][5,10].

6.2 Analysis on Dataset Sizes

In this experiment, we investigate the applicability of GRIP on small datasets by changing the number of web images used for each category on Web-bird20. To be detailed, we construct sub-datasets from Web-bird by randomly selecting 1010 to 9090 samples for each class. Then we compare the performance of the backbone network, our group regularization strategy (GR), and our proposed GRIP on each sub-dataset in Fig. 8. It can be observed from Fig. 8 that our group regularization strategy shows significant and consistent improvements across different dataset sizes over the baseline. After applying the instance purification strategy, GRIP further boosts the performance on each sub-dataset. It shows remarkable noise-robustness even when the dataset is rather small, e.g. 2020 images per class. From the experimental results, we can conclude that our approach is insensitive to the dataset size.

We can also see from Fig. 8 that the ACA performance climbs steadily when more training samples are utilized. Therefore, leveraging free web images is a promising research direction as it allows boosting model performance and robustness through enlarging datasets.

Table 5: The ACA (%) performances and improvements of different backbones on Web-bird. Baseline indicates that the network is straightforwardly trained using cross-entropy loss.
   Backbone     Method Performance Improvement
VGG-16 Baseline 66.34 Δ\Delta 9.53
Ours 75.87
ResNet-18 Baseline 71.10 Δ\Delta 7.81
Ours 78.91
ResNet-50 Baseline 73.30 Δ\Delta 9.23
Ours 82.53

6.3 Applicability Across Different Backbones

We test our approach using different backbone networks on Web-bird to analyze the applicability in Table 5. We can observe from Table 5 that our approach boosts the performance by 9.53%9.53\%, 7.81%7.81\%, and 9.23%9.23\% on VGG-1676, ResNet-18, and ResNet-50, respectively. The experimental results indicate that our approach is robust and shows remarkable improvements across different CNN architectures.

6.4 Contribution of Each Component

In this subsection, we gradually add components in our method to the baseline model and present the contribution of each component in Table 6. From lines 11 to 66 in Table 6, we can observe that all components contribute to the performance improvements. To be detailed, leveraging Soft\mathcal{L}_{Soft} surpasses the baseline by around 2%2\%. Then we further makes remarkable improvements through applying ME\mathcal{L}_{ME} (3.35%3.35\%) and EMA (0.44%0.44\%). Owing to our group regularization strategy, the ACA performance reaches 76.9%76.9\% and significantly surpasses the baseline. After further applying instance purification, the final performance reaches 78.91%78.91\% by employing JS divergence noise identification (1.42%1.42\%) and relabeling (0.59%0.59\%). Resorting to the contribution of each component, our approach boosts the final performance by 7.81%7.81\% over the baseline.

Table 6: The contribution of each component in our approach (11 to 66) and comparisons with other similar methods (77 to 1111). The abbreviations are as follows: Soft\mathcal{L}_{Soft} (Soft), ME\mathcal{L}_{ME} (ME), exponential moving average (EMA), JS divergence noise identification criterion (JS), relabeling (RE), label smoothing (LS), small-loss principle (SL) and JS divergence noise identification within each mini-batch (MB). ‘\checkmark’ indicates the component is utilized in training.
No Regularization Purification Comparisons ACA
Soft ME EMA JS RE LS SL MB
1 71.10
2 73.11
3 76.46
4 76.90
5 78.32
6 78.91
7 75.13
8 72.28
9 75.18
10 77.58
11 77.98
Refer to caption
Figure 9: Illustration of noise-cleaning results on Web-bird, Web-car, and Web-aircraft. Each row illustrates ten samples that are from the same fine-grained category. Blue and red boxes indicate revisable and discarded samples, respectively.

6.5 Comparisons

In this experiment, we demonstrate the superiority of our proposed approach over other similar strategies. We first compare the proposed group regularization with the widely-used LS trick. Then we compare JS divergence noise identification with the small-loss principle. We also compare the global and mini-batch selection based on our JS divergence principle.

The experimental results are presented in Table 6 (lines 77 to 1111). We can observe from Table 6 that LS shows a slightly inferior performance to simply leveraging soft label supervision Soft\mathcal{L}_{Soft}. Applying LS and maximum entropy strategy ME\mathcal{L}_{ME} together shows nearly no improvement over utilizing ME\mathcal{L}_{ME} alone. On the contrary, combining Soft\mathcal{L}_{Soft} and ME\mathcal{L}_{ME} boosts the performance significantly. This result demonstrates the advantages of utilizing estimated soft labels over LS: higher performance and better flexibility. Owing to Soft\mathcal{L}_{Soft} and ME\mathcal{L}_{ME}, our group regularization strategy surpasses LS remarkably.

Comparing lines 1010 and 1111 in Table 6, we can find that noise identification within mini-batch based on JS divergence is superior to that based on loss. This result supports our argument that utilizing class soft labels is more effective than simply relying on loss values in noise identification. Comparing ling 55 and 1111, we can observe that global selection further boosts the performance. The improvement derives from that the noise rate imbalance problem is alleviated.

Note that we apply ME\mathcal{L}_{ME} on discarded images. In order to demonstrate the contribution of this design, we remove it for comparison. Then we find that the performance drops from 78.32%78.32\% to 77.66%77.66\%. Since noisy samples are utilized for training in the warm-up period, the network memorizes them to some extent. They potentially misguide noise identification and degrade performance. To solve this problem, ME\mathcal{L}_{ME} provides a strong regularization to guide the network to forget noisy samples and make them farther from class soft labels to guarantee more reliable noise identification.

6.6 Noise-cleaning Visualization

We sampled noise-cleaning results on WebFG-496 and visualize them in Fig. 9. We can observe from Fig. 9 that our method can effectively divide clean, revisable, and discarded samples. We can also notice that web dataset inevitably contain noisy labels. For example, searching web images for cars has a risk of getting some images of tires and steering wheels. This phenomenon reminds us that noise-cleaning operation is necessary for training robust models in learning with noisy labels tasks.

7 Conclusion

In this paper, we proposed an effective training method named GRIP that leverages group regularization to benefit instance purification on both synthetic and real-world datasets. The proposed group regularization strategy generates reliable class soft labels to boost model robustness against label noise. By measuring the differences between instances and class soft labels, our method can globally identify noisy and revisable samples. Resorting to the regularization from the category aspect and purification at the instance level, GRIP inherits the advantages of both noise-robust and noise-cleaning strategies. By conducting comprehensive experiments, we demonstrate the superiority of our approach over existing methods in combating noisy labels on both synthetic and real-world datasets.

References

  • 1 G. Pei, F. Shen, Y. Yao, G.-S. Xie, Z. Tang, and J. Tang, “Hierarchical feature alignment network for unsupervised video object segmentation,” in Proceedings of the European Conference on Computer Vision, 2022, pp. 596–613.
  • 2 S.-H. Wang, D. R. Nayak, D. S. Guttery, X. Zhang, and Y.-D. Zhang, “Covid-19 classification by ccshnet with deep fusion using transfer learning and discriminant correlation analysis,” Information Fusion, vol. 68, pp. 131–148, 2021.
  • 3 S. Wang, M. E. Celebi, Y.-D. Zhang, X. Yu, S. Lu, X. Yao, Q. Zhou, M.-G. Miguel, Y. Tian, J. M. Gorriz et al., “Advances in data preprocessing for biomedical data fusion: An overview of the methods, challenges, and prospects,” Information Fusion, vol. 76, pp. 376–421, 2021.
  • 4 Y.-D. Zhang, Z. Dong, S.-H. Wang, X. Yu, X. Yao, Q. Zhou, H. Hu, M. Li, C. Jiménez-Mesa, J. Ramirez et al., “Advances in multimodal data fusion in neuroimaging: Overview, challenges, and novel orientation,” Information Fusion, vol. 64, pp. 149–187, 2020.
  • 5 H. Zhu, S. Liu, L. Deng, Y. Li, and F. Xiao, “Infrared small target detection via low-rank tensor completion with top-hat regularization,” IEEE Transactions on Geoscience and Remote Sensing, vol. 58, no. 2, pp. 1004–1016, 2019.
  • 6 L. Deng, H. Zhu, Q. Zhou, and Y. Li, “Adaptive top-hat filter based on quantum genetic algorithm for infrared small target detection,” Multimedia Tools and Applications, vol. 77, pp. 10 539–10 551, 2018.
  • 7 H. Zhu, H. Ni, S. Liu, G. Xu, and L. Deng, “Tnlrs: Target-aware non-local low-rank modeling with saliency filtering regularization for infrared small target detection,” IEEE Transactions on Image Processing, vol. 29, pp. 9546–9558, 2020.
  • 8 L. Deng, J. Zhang, G. Xu, and H. Zhu, “Infrared small target detection via adaptive m-estimator ring top-hat transformation,” Pattern Recognition, vol. 112, p. 107729, 2021.
  • 9 J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
  • 10 T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Proceedings of the European Conference on Computer Vision.   Springer, 2014, pp. 740–755.
  • 11 Y. Yao, T. Chen, G.-S. Xie, C. Zhang, F. Shen, Q. Wu, Z. Tang, and J. Zhang, “Non-salient region object mining for weakly supervised semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 2623–2632.
  • 12 T. Chen, Y. Yao, and J. Tang, “Multi-granularity denoising and bidirectional alignment for weakly supervised semantic segmentation,” IEEE Transactions on Image Processing, vol. 32, pp. 2960–2971, 2023.
  • 13 H. Liu, P. Peng, T. Chen, Q. Wang, Y. Yao, and X.-S. Hua, “Fecanet: Boosting few-shot semantic segmentation with feature-enhanced context-aware network,” IEEE Transactions on Multimedia, pp. 1–13, 2023.
  • 14 Q. Tian, Y. Cheng, S. He, and J. Sun, “Unsupervised multi-source domain adaptation for person re-identification via feature fusion and pseudo-label refinement,” Computers and Electrical Engineering, vol. 113, p. 109029, 2024.
  • 15 Y. Yao, J. Zhang, F. Shen, X. Hua, J. Xu, and Z. Tang, “Exploiting web images for dataset construction: A domain robust approach,” IEEE Transactions on Multimedia, vol. 19, no. 8, pp. 1771–1784, 2017.
  • 16 Y. Yao, J. Zhang, F. Shen, L. Liu, F. Zhu, D. Zhang, and H. T. Shen, “Towards automatic construction of diverse, high-quality image datasets,” IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 6, pp. 1199–1211, 2019.
  • 17 Y. Yao, F. Shen, G. Xie, L. Liu, F. Zhu, J. Zhang, and H. T. Shen, “Exploiting web images for multi-output classification: From category to subcategories,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 7, pp. 2348–2360, 2020.
  • 18 Y. Yao, Z. Sun, F. Shen, L. Liu, L. Wang, F. Zhu, L. Ding, G. Wu, and L. Shao, “Dynamically visual disambiguation of keyword-based image search,” 2019, pp. 996–1002.
  • 19 T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang, “Learning from massive noisy labeled data for image classification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2691–2699.
  • 20 Z. Sun, Y. Yao, X.-S. Wei, Y. Zhang, F. Shen, J. Wu, J. Zhang, and H. T. Shen, “Webly supervised fine-grained recognition: Benchmark datasets and an approach,” in Proceedings of the International Conference on Computer Vision, 2021, pp. 10 602–10 611.
  • 21 B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li, “Yfcc100m: The new data in multimedia research,” Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016.
  • 22 S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan, “Youtube-8m: A large-scale video classification benchmark,” arXiv preprint arXiv:1609.08675, 2016.
  • 23 D. Arpit, S. Jastrzębski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. Courville, Y. Bengio et al., “A closer look at memorization in deep networks,” in Proceedings of the International Conference on Machine Learning, 2017, pp. 233–242.
  • 24 C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in Proceedings of the International Conference on Learning Representations, 2016, pp. 1–15.
  • 25 Z. Sun, F. Shen, D. Huang, Q. Wang, X. Shu, Y. Yao, and J. Tang, “Pnp: Robust learning from noisy labels by probabilistic noise prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 5311–5320.
  • 26 T. Liu and D. Tao, “Classification with noisy labels by importance reweighting,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 3, pp. 447–461, 2015.
  • 27 C. Zhang, Q. Wang, G. Xie, Q. Wu, F. Shen, and Z. Tang, “Robust learning from noisy web images via data purification for fine-grained recognition,” IEEE Transactions on Multimedia, vol. 24, pp. 1198–1209, 2021.
  • 28 C. Zhang, Y. Yao, X. Shu, Z. Li, Z. Tang, and Q. Wu, “Data-driven meta-set based fine-grained visual recognition,” in Proceedings of the ACM International Conference on Multimedia, 2020, pp. 2372–2381.
  • 29 H. Liu, C. Zhang, Y. Yao, X.-S. Wei, F. Shen, Z. Tang, and J. Zhang, “Exploiting web images for fine-grained visual recognition by eliminating open-set noise and utilizing hard examples,” IEEE Transactions on Multimedia, vol. 24, pp. 546–557, 2021.
  • 30 C. Zhang, G. Lin, Q. Wang, F. Shen, Y. Yao, and Z. Tang, “Guided by meta-set: A data-driven method for fine-grained visual recognition,” IEEE Transactions on Multimedia, pp. 4691–4703, 2022.
  • 31 M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to reweight examples for robust deep learning,” in Proceedings of the International Conference on Machine Learning.   PMLR, 2018, pp. 4334–4343.
  • 32 J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, and D. Meng, “Meta-weight-net: Learning an explicit mapping for sample weighting,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • 33 M. Sheng, Z. Sun, Z. Cai, T. Chen, Y. Zhou, and Y. Yao, “Adaptive integration of partial label learning and negative learning for enhanced noisy label learning,” arXiv preprint arXiv:2312.09505, 2023.
  • 34 Y. Wang, X. Ma, Z. Chen, Y. Luo, J. Yi, and J. Bailey, “Symmetric cross entropy for robust learning with noisy labels,” in Proceedings of the International Conference on Computer Vision, 2019, pp. 322–330.
  • 35 X. Ma, H. Huang, Y. Wang, S. Romano, S. Erfani, and J. Bailey, “Normalized loss functions for deep learning with noisy labels,” in Proceedings of the International Conference on Machine Learning.   PMLR, 2020, pp. 6543–6553.
  • 36 Z. Zhang and M. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • 37 X. Xia, T. Liu, B. Han, C. Gong, N. Wang, Z. Ge, and Y. Chang, “Robust early-learning: Hindering the memorization of noisy labels,” in Proceedings of the International Conference on Learning Representations, 2020, pp. 1–15.
  • 38 C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
  • 39 C.-B. Zhang, P.-T. Jiang, Q. Hou, Y. Wei, Q. Han, Z. Li, and M.-M. Cheng, “Delving deep into label smoothing,” IEEE Transactions on Image Processing, pp. 5984–5996, 2021.
  • 40 J. Goldberger and E. Ben-Reuven, “Training deep neural-networks using a noise adaptation layer,” 2016.
  • 41 J. Li, R. Socher, and S. C. Hoi, “Dividemix: Learning with noisy labels as semi-supervised learning,” in Proceedings of the International Conference on Learning Representations, 2020, pp. 1–14.
  • 42 E. Malach and S. Shalev-Shwartz, “Decoupling" when to update" from" how to update",” Advances in Neural Information Processing Systems, vol. 30, 2017.
  • 43 B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama, “Co-teaching: Robust training of deep neural networks with extremely noisy labels,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • 44 H. Wei, L. Feng, X. Chen, and B. An, “Combating noisy labels by agreement: A joint training method with co-regularization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 13 726–13 735.
  • 45 H. Liu, H. Zhang, J. Lu, and Z. Tang, “Exploiting web images for fine-grained visual recognition via dynamic loss correction and global sample selection,” IEEE Transactions on Multimedia, vol. 24, pp. 1105–1115, 2022.
  • 46 Z. Cai, G.-S. Xie, X. Huang, D. Huang, Y. Yao, and Z. Tang, “Robust learning from noisy web data for fine-grained recognition,” Pattern Recognition, vol. 134, p. 109063, 2023.
  • 47 Z. Cai, H. Liu, D. Huang, Y. Yao, and Z. Tang, “Co-mining: Mining informative samples with noisy labels,” Signal Processing, vol. 209, p. 109003, 2023.
  • 48 X.-J. Gui, W. Wang, and Z.-H. Tian, “Towards understanding deep learning from noisy labels with small-loss criterion,” 2021, pp. 2469–2475.
  • 49 H. Song, M. Kim, and J.-G. Lee, “Selfie: Refurbishing unclean samples for robust deep learning,” in Proceedings of the International Conference on Machine Learning, 2019, pp. 5907–5915.
  • 50 Y. Yao, Z. Sun, C. Zhang, F. Shen, Q. Wu, J. Zhang, and Z. Tang, “Jo-src: A contrastive approach for combating noisy labels,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 5192–5201.
  • 51 A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Technical report, University of Tront, vol. 1, no. 4, p. 7, 2009.
  • 52 X. Peng, K. Wang, Z. Zeng, Q. Li, J. Yang, and Y. Qiao, “Suppressing mislabeled data via grouping and self-attention,” in Proceedings of the European Conference on Computer Vision.   Springer, 2020, pp. 786–802.
  • 53 A. Ghosh, H. Kumar, and P. Sastry, “Robust loss functions under label noise for deep neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017.
  • 54 Z. Sun, X.-S. Hua, Y. Yao, X.-S. Wei, G. Hu, and J. Zhang, “Crssc: salvage reusable samples from noisy data for robust learning,” in Proceedings of the ACM International Conference on Multimedia, 2020, pp. 92–101.
  • 55 C. Zhang, Y. Yao, H. Liu, G.-S. Xie, X. Shu, T. Zhou, Z. Zhang, F. Shen, and Z. Tang, “Web-supervised network with softly update-drop training for fine-grained visual classification,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 12 781–12 788.
  • 56 C. Tan, J. Xia, L. Wu, and S. Z. Li, “Co-learning: Learning from noisy labels with self-supervision,” in Proceedings of the ACM International Conference on Multimedia, 2021, pp. 1405–1413.
  • 57 D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel, “Mixmatch: A holistic approach to semi-supervised learning,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • 58 A. Dubey, O. Gupta, R. Raskar, and N. Naik, “Maximum-entropy fine grained classification,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  • 59 J. Lin, “Divergence measures based on the shannon entropy,” IEEE Transactions on Information Theory, vol. 37, no. 1, pp. 145–151, 1991.
  • 60 D. Patel and P. Sastry, “Adaptive sample selection for robust learning under label noise,” in IEEE Winter Conference on Applications of Computer Vision, 2023, pp. 3932–3942.
  • 61 X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, and M. Sugiyama, “How does disagreement help generalization against label corruption?” in Proceedings of the International Conference on Machine Learning, 2019, pp. 7164–7173.
  • 62 Y. Lu and W. He, “SELC: self-ensemble label correction improves learning with noisy labels,” in Proceedings of the International Joint Conference on Artificial Intelligence, vol. 31, 2022, pp. 3278–3284.
  • 63 J. Deng, J. Guo, T. Liu, M. Gong, and S. Zafeiriou, “Sub-center arcface: Boosting face recognition by large-scale noisy web faces,” in Proceedings of the European Conference on Computer Vision.   Springer, 2020, pp. 741–757.
  • 64 L. Huang, C. Zhang, and H. Zhang, “Self-adaptive training: Beyond empirical risk minimization,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  • 65 Y. Zhang, S. Zheng, P. Wu, M. Goswami, and C. Chen, “Learning with feature-dependent label noise: A progressive approach,” in Proceedings of the International Conference on Learning Representations, 2021, pp. 1–13.
  • 66 Z. Sun, H. Liu, Q. Wang, T. Zhou, Q. Wu, and Z. Tang, “Co-ldl: A co-training-based label distribution learning method for tackling label noise,” IEEE Transactions on Multimedia, vol. 24, pp. 1093–1104, 2022.
  • 67 X. Zhou, X. Liu, D. Zhai, J. Jiang, and X. Ji, “Asymmetric loss functions for noise-tolerant learning: theory and applications,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 7, pp. 8094–8109, 2023.
  • 68 J. Shu, X. Yuan, D. Meng, and Z. Xu, “Cmw-net: Learning a class-aware sample weighting mapping for robust deep learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • 69 C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The caltech-ucsd birds-200-2011 dataset,” 2011.
  • 70 S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi, “Fine-grained visual classification of aircraft,” arXiv preprint arXiv:1306.5151, 2013.
  • 71 J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations for fine-grained categorization,” in Proceedings of the International Conference on Computer Vision, 2013, pp. 554–561.
  • 72 K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • 73 I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” in Proceedings of the International Conference on Learning Representations, 2016, pp. 1–16.
  • 74 L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of the International Conference on Computational Statistics, 2010, p. 177.
  • 75 H. Permuter, J. Francos, and I. Jermyn, “A study of gaussian mixture models of color and texture features for image classification and segmentation,” Pattern Recognition, pp. 695–706, 2006.
  • 76 K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.