This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Label noise breakdown point

Ningyuan (Teresa) Huang

1 Trunk’s data model

We deduce the breakdown point for trunk’s data model as a simple corollary from Lemma 15 in blanchard2016classification, following their problem formulation in Section 7.1:

Let (X,Y)(X,Y) be random on 𝒳×{0,1}\mathcal{X}\times\{0,1\} where 𝒳\mathcal{X} is a Borel space, and let PP denote the probability measure governing (X,Y).(X,Y). Let \mathcal{M} denote the set of decision functions, i.e., the set of measurable functions 𝒳.\mathcal{X}\rightarrow\mathbb{R}. Every ff\in\mathcal{M} induces a classifier xu(f(x))x\mapsto u(f(x)) where u(t)u(t) is the unit step function

u(t):={1,t>00,t0u(t):=\left\{\begin{array}[]{ll}1,&t>0\\ 0,&t\leq 0\end{array}\right.

For any ff\in\mathcal{M}, define the PP-risk of ff as:

P(f):=𝔼(X,Y)P[𝟏{u(f(X))Y}]\mathcal{R}_{P}(f):=\mathbb{E}_{(X,Y)\sim P}\left[\mathbf{1}_{\{u(f(X))\neq Y\}}\right]

Define the Bayes PP-risk P:=inffP(f).\mathcal{R}_{P}^{*}:=\inf_{f\in\mathcal{M}}\mathcal{R}_{P}(f).

From devroye2013probabilistic (Thm 2.2): for any ff\in\mathcal{M}, the excess PP-risk satisfies

P(f)P=2𝔼X[𝟏{u(f(X))u(η(X)12)}|η(X)12|],\mathcal{R}_{P}(f)-\mathcal{R}_{P}^{*}=2\mathbb{E}_{X}\left[\mathbf{1}_{\left\{u(f(X))\neq u\left(\eta(X)-\frac{1}{2}\right)\right\}}\left|\eta(X)-\frac{1}{2}\right|\right], (1)

where η(x):=P(Y=1X=x)\eta(x):=P(Y=1\mid X=x) is the posterior of the clean labels.

In the noisy label setting, we do not observe YY but the noisy label ZZ. Let (X,Y,Z)(X,Y,Z) be jointly distributed, where ZZ is independent of the feature vector XX, meaning that the conditional distribution of ZZ given XX and YY depends only on YY. Let η~(x):=P(Z=1X=x)\tilde{\eta}(x):=P(Z=1\mid X=x) be the posterior of the noisy labels.

Ideally, we would like to minimize P(f)\mathcal{R}_{P}(f), but we could only do so via P~(f)\mathcal{R}_{\tilde{P}}(f), where (X,Z)P~(X,Z)\sim\tilde{P}. Hence, the breakdown point occurs when minimizing P~(f)\mathcal{R}_{\tilde{P}}(f) effectively maximizes P~(f)\mathcal{R}_{\tilde{P}}(f) (i.e., counter-productive!). This happens when the noisy label probability p0.5p\geq 0.5.

Proof.

We write the posterior of clean labels η(x)\eta(x) as a function of the posterior of noisy labels η~(x)\tilde{\eta}(x) and the corruption probability pp:

η(x)\displaystyle\eta(x) =Pr(Y=1,Z=1X=x)+Pr(Y=1,Z=0X=x)\displaystyle=\operatorname{Pr}(Y=1,Z=1\mid X=x)+\operatorname{Pr}(Y=1,Z=0\mid X=x) (2)
=Pr(Y=1Z=1,X=x)η~(x)+Pr(Y=1Z=0,X=x)(1η~(x))\displaystyle=\operatorname{Pr}(Y=1\mid Z=1,X=x)\tilde{\eta}(x)+\operatorname{Pr}(Y=1\mid Z=0,X=x)(1-\tilde{\eta}(x)) (3)
=(1p)η~(x)+p(1η~(x))\displaystyle=\left(1-p\right)\tilde{\eta}(x)+p(1-\tilde{\eta}(x)) (4)
=(12p)η~(x)+p\displaystyle=\left(1-2p\right)\tilde{\eta}(x)+p (5)

Thus:

η(x)12=(12p)(η~(x)12)\eta(x)-\frac{1}{2}=(1-2p)(\tilde{\eta}(x)-\frac{1}{2}) (6)

Therefore:

P(f)P\displaystyle\mathcal{R}_{P}(f)-\mathcal{R}_{P}^{*} =2𝔼X[𝟏{u(f(X))u(η(X)12)}|η(X)12|]\displaystyle=2\mathbb{E}_{X}\left[\mathbf{1}_{\left\{u(f(X))\neq u\left(\eta(X)-\frac{1}{2}\right)\right\}}\left|\eta(X)-\frac{1}{2}\right|\right] (7)
=2(12p)𝔼X[𝟏{u(f(X))u(η~(x)12)}|η~(x)12|]\displaystyle=2\left(1-2p\right)\mathbb{E}_{X}\left[\mathbf{1}_{\{u(f(X))\neq u(\tilde{\eta}(x)-\frac{1}{2})\}}\left|\tilde{\eta}(x)-\frac{1}{2}\right|\right] (8)
=2(12p)(P~(f)P~)\displaystyle=2\left(1-2p\right)\left(\mathcal{R}_{\tilde{P}}(f)-\mathcal{R}_{\tilde{P}}^{*}\right) (9)

By definition P~:=inffP~(f)\mathcal{R}_{\tilde{P}}^{*}:=\inf_{f\in\mathcal{M}}\mathcal{R}_{\tilde{P}}(f), so (P~(f)P~)0(\mathcal{R}_{\tilde{P}}(f)-\mathcal{R}_{\tilde{P}}^{*})\geq 0.

Now, when p0.5p\geq 0.5, (12p)0(1-2p)\leq 0, then the right-hand-side of equation (9) is negative.

This implies that P(f)P0\mathcal{R}_{P}(f)-\mathcal{R}_{P}^{*}\leq 0. In other words, f=argminP~=argmaxPf^{*}=\arg\min\mathcal{R}_{\tilde{P}}^{*}=\arg\max\mathcal{R}_{P}^{*}. ∎

2 Regularization via early stopping

From the empirical studies on label noise, we observe that early stopping dramatically improves test set accuracy even when label noise probability is large. Specifically, the following work performed our symmetric label corruption scheme:

  1. 1.

    zhang2017understanding trained large capacity neural networks on noisy CIFAR10. All the models achieve zero training loss and their test time performance is computed (Figure 1c). Notably, their interpolating models have much worse testset accuracies when pp is small, and thus does not have a dramatic breakdown point when pp is around 0.8. For our experiments, we fix the training epochs for all models as 33, and implicitly use early-stopping as an criteria.

  2. 2.

    jiang2020synthetic trained large capacity neural networks from scratch or by fine-tuning on noisy CIFAR10, mini-ImageNet (100 classes), Standard-car (196 classes). They compute the model performance in terms of: 1) peak test accuracy during the whole training phase; 2) final test accuracy. Their results suggest that the final test accuracy is much worse than the peak test accuracy (see Table 4, 5, 6, 7 - blue noise setting). In other words, early-stopping in the training phase (using a clean small validation set) could yield robust performance even with large pp around 0.8.