Label noise breakdown point

Ningyuan (Teresa) Huang

1 Trunk’s data model

We deduce the breakdown point for trunk’s data model as a simple corollary from Lemma 15 in blanchard2016classification, following their problem formulation in Section 7.1:

Let $(X,Y)$ be random on $\mathcal{X}\times\{0,1\}$ where $\mathcal{X}$ is a Borel space, and let $P$ denote the probability measure governing $(X,Y).$ Let $\mathcal{M}$ denote the set of decision functions, i.e., the set of measurable functions $\mathcal{X}\rightarrow\mathbb{R}.$ Every $f\in\mathcal{M}$ induces a classifier $x\mapsto u(f(x))$ where $u(t)$ is the unit step function

u(t):=\left\{\begin{array}[]{ll}1,&t>0\\ 0,&t\leq 0\end{array}\right.

For any $f\in\mathcal{M}$ , define the $P$ -risk of $f$ as:

\mathcal{R}_{P}(f):=\mathbb{E}_{(X,Y)\sim P}\left[\mathbf{1}_{\{u(f(X))\neq Y\}}\right]

Define the Bayes $P$ -risk $\mathcal{R}_{P}^{*}:=\inf_{f\in\mathcal{M}}\mathcal{R}_{P}(f).$

From devroye2013probabilistic (Thm 2.2): for any $f\in\mathcal{M}$ , the excess $P$ -risk satisfies

\mathcal{R}_{P}(f)-\mathcal{R}_{P}^{*}=2\mathbb{E}_{X}\left[\mathbf{1}_{\left\{u(f(X))\neq u\left(\eta(X)-\frac{1}{2}\right)\right\}}\left|\eta(X)-\frac{1}{2}\right|\right],

(1)

where $\eta(x):=P(Y=1\mid X=x)$ is the posterior of the clean labels.

In the noisy label setting, we do not observe $Y$ but the noisy label $Z$ . Let $(X,Y,Z)$ be jointly distributed, where $Z$ is independent of the feature vector $X$ , meaning that the conditional distribution of $Z$ given $X$ and $Y$ depends only on $Y$ . Let $\tilde{\eta}(x):=P(Z=1\mid X=x)$ be the posterior of the noisy labels.

Ideally, we would like to minimize $\mathcal{R}_{P}(f)$ , but we could only do so via $\mathcal{R}_{\tilde{P}}(f)$ , where $(X,Z)\sim\tilde{P}$ . Hence, the breakdown point occurs when minimizing $\mathcal{R}_{\tilde{P}}(f)$ effectively maximizes $\mathcal{R}_{\tilde{P}}(f)$ (i.e., counter-productive!). This happens when the noisy label probability $p\geq 0.5$ .

Proof.

We write the posterior of clean labels $\eta(x)$ as a function of the posterior of noisy labels $\tilde{\eta}(x)$ and the corruption probability $p$ :

$\displaystyle\eta(x)$	$\displaystyle=\operatorname{Pr}(Y=1,Z=1\mid X=x)+\operatorname{Pr}(Y=1,Z=0\mid X=x)$	(2)
	$\displaystyle=\operatorname{Pr}(Y=1\mid Z=1,X=x)\tilde{\eta}(x)+\operatorname{Pr}(Y=1\mid Z=0,X=x)(1-\tilde{\eta}(x))$	(3)
	$\displaystyle=\left(1-p\right)\tilde{\eta}(x)+p(1-\tilde{\eta}(x))$	(4)
	$\displaystyle=\left(1-2p\right)\tilde{\eta}(x)+p$	(5)

Thus:

\eta(x)-\frac{1}{2}=(1-2p)(\tilde{\eta}(x)-\frac{1}{2})

(6)

Therefore:

$\displaystyle\mathcal{R}_{P}(f)-\mathcal{R}_{P}^{*}$	$\displaystyle=2\mathbb{E}_{X}\left[\mathbf{1}_{\left\{u(f(X))\neq u\left(\eta(X)-\frac{1}{2}\right)\right\}}\left\|\eta(X)-\frac{1}{2}\right\|\right]$	(7)
	$\displaystyle=2\left(1-2p\right)\mathbb{E}_{X}\left[\mathbf{1}_{\{u(f(X))\neq u(\tilde{\eta}(x)-\frac{1}{2})\}}\left\|\tilde{\eta}(x)-\frac{1}{2}\right\|\right]$	(8)
	$\displaystyle=2\left(1-2p\right)\left(\mathcal{R}_{\tilde{P}}(f)-\mathcal{R}_{\tilde{P}}^{*}\right)$	(9)

By definition $\mathcal{R}_{\tilde{P}}^{*}:=\inf_{f\in\mathcal{M}}\mathcal{R}_{\tilde{P}}(f)$ , so $(\mathcal{R}_{\tilde{P}}(f)-\mathcal{R}_{\tilde{P}}^{*})\geq 0$ .

Now, when $p\geq 0.5$ , $(1-2p)\leq 0$ , then the right-hand-side of equation (9) is negative.

This implies that $\mathcal{R}_{P}(f)-\mathcal{R}_{P}^{*}\leq 0$ . In other words, $f^{*}=\arg\min\mathcal{R}_{\tilde{P}}^{*}=\arg\max\mathcal{R}_{P}^{*}$ . ∎

2 Regularization via early stopping

From the empirical studies on label noise, we observe that early stopping dramatically improves test set accuracy even when label noise probability is large. Specifically, the following work performed our symmetric label corruption scheme:

1.

zhang2017understanding trained large capacity neural networks on noisy CIFAR10. All the models achieve zero training loss and their test time performance is computed (Figure 1c). Notably, their interpolating models have much worse testset accuracies when $p$ is small, and thus does not have a dramatic breakdown point when $p$ is around 0.8. For our experiments, we fix the training epochs for all models as $3$ , and implicitly use early-stopping as an criteria.
2.

jiang2020synthetic trained large capacity neural networks from scratch or by fine-tuning on noisy CIFAR10, mini-ImageNet (100 classes), Standard-car (196 classes). They compute the model performance in terms of: 1) peak test accuracy during the whole training phase; 2) final test accuracy. Their results suggest that the final test accuracy is much worse than the peak test accuracy (see Table 4, 5, 6, 7 - blue noise setting). In other words, early-stopping in the training phase (using a clean small validation set) could yield robust performance even with large $p$ around 0.8.