Label noise breakdown point
1 Trunk’s data model
We deduce the breakdown point for trunk’s data model as a simple corollary from Lemma 15 in blanchard2016classification, following their problem formulation in Section 7.1:
Let be random on where is a Borel space, and let denote the probability measure governing Let denote the set of decision functions, i.e., the set of measurable functions Every induces a classifier where is the unit step function
For any , define the -risk of as:
Define the Bayes -risk
From devroye2013probabilistic (Thm 2.2): for any , the excess -risk satisfies
(1) |
where is the posterior of the clean labels.
In the noisy label setting, we do not observe but the noisy label . Let be jointly distributed, where is independent of the feature vector , meaning that the conditional distribution of given and depends only on . Let be the posterior of the noisy labels.
Ideally, we would like to minimize , but we could only do so via , where . Hence, the breakdown point occurs when minimizing effectively maximizes (i.e., counter-productive!). This happens when the noisy label probability .
Proof.
We write the posterior of clean labels as a function of the posterior of noisy labels and the corruption probability :
(2) | ||||
(3) | ||||
(4) | ||||
(5) |
Thus:
(6) |
Therefore:
(7) | ||||
(8) | ||||
(9) |
By definition , so .
Now, when , , then the right-hand-side of equation (9) is negative.
This implies that . In other words, . ∎
2 Regularization via early stopping
From the empirical studies on label noise, we observe that early stopping dramatically improves test set accuracy even when label noise probability is large. Specifically, the following work performed our symmetric label corruption scheme:
-
1.
zhang2017understanding trained large capacity neural networks on noisy CIFAR10. All the models achieve zero training loss and their test time performance is computed (Figure 1c). Notably, their interpolating models have much worse testset accuracies when is small, and thus does not have a dramatic breakdown point when is around 0.8. For our experiments, we fix the training epochs for all models as , and implicitly use early-stopping as an criteria.
-
2.
jiang2020synthetic trained large capacity neural networks from scratch or by fine-tuning on noisy CIFAR10, mini-ImageNet (100 classes), Standard-car (196 classes). They compute the model performance in terms of: 1) peak test accuracy during the whole training phase; 2) final test accuracy. Their results suggest that the final test accuracy is much worse than the peak test accuracy (see Table 4, 5, 6, 7 - blue noise setting). In other words, early-stopping in the training phase (using a clean small validation set) could yield robust performance even with large around 0.8.