This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Provably End-to-end Label-noise Learning without Anchor Points

Xuefeng Li    Tongliang Liu    Bo Han    Gang Niu    Masashi Sugiyama
Abstract

In label-noise learning, the transition matrix plays a key role in building statistically consistent classifiers. Existing consistent estimators for the transition matrix have been developed by exploiting anchor points. However, the anchor-point assumption is not always satisfied in real scenarios. In this paper, we propose an end-to-end framework for solving label-noise learning without anchor points, in which we simultaneously optimize two objectives: the cross entropy loss between the noisy label and the predicted probability by the neural network, and the volume of the simplex formed by the columns of the transition matrix. Our proposed framework can identify the transition matrix if the clean class-posterior probabilities are sufficiently scattered. This is by far the mildest assumption under which the transition matrix is provably identifiable and the learned classifier is statistically consistent. Experimental results on benchmark datasets demonstrate the effectiveness and robustness of the proposed method.

Machine Learning, ICML

1 Introduction

The success of modern deep learning algorithms heavily relies on large-scale accurately annotated data (Daniely & Granot, 2019; Han et al., 2020b; Xia et al., 2020; Berthon et al., 2021). However, it is often expensive or even infeasible to annotate large datasets. Therefore, cheap but less accurate annotating methods have been widely used (Xiao et al., 2015; Li et al., 2017; Han et al., 2020a; Yu et al., 2020; Zhu et al., 2021a). As a consequence, these alternatives inevitably introduce label noise. Training deep learning models on noisy data can significantly degenerate the test performance due to overfitting to the noisy labels (Arpit et al., 2017; Zhang et al., 2017; Xia et al., 2021; Wu et al., 2021).

To mitigate the negative impacts of label noise, many methods have been developed and some of them are based on a loss correction procedure. In general, these methods are statistically consistent, i.e., these methods guarantee that the classifier learned from the noisy data approaches to the optimal classifier defined on the clean risk as the size of the noisy training set increases (Liu & Tao, 2016; Scott, 2015; Natarajan et al., 2013; Goldberger & Ben-Reuven, 2017; Patrini et al., 2017; Thekumparampil et al., 2018). The idea is that the clean class-posterior P(𝒀|X=𝒙):=[P(Y=1|X=𝒙),,P(Y=C|X=𝒙)]P(\boldsymbol{Y}|X=\boldsymbol{x}):=[P(Y=1|X=\boldsymbol{x}),\ldots,P(Y=C|X=\boldsymbol{x})]^{\top} can be inferred by utilizing the noisy class-posterior P(𝒀~|X=𝒙)P(\tilde{\boldsymbol{Y}}|X=\boldsymbol{x}) and the transition matrix 𝑻(𝒙)\boldsymbol{T}(\boldsymbol{x}) where 𝑻ij(𝒙)=P(Y~=i|Y=j,X=𝒙)\boldsymbol{T}_{ij}(\boldsymbol{x})=P(\tilde{Y}=i|Y=j,X=\boldsymbol{x}), i.e., P(𝒀|X=𝒙)=[𝑻(𝒙)]1P(𝒀~|X=𝒙)P(\boldsymbol{Y}|X=\boldsymbol{x})=[\boldsymbol{T}(\boldsymbol{x})]^{-1}P(\tilde{\boldsymbol{Y}}|X=\boldsymbol{x}). While those methods theoretically guarantee the statistical consistency, they all heavily rely on the success of estimating transition matrices.

Generally, the transition matrix is unidentifiable without additional assumptions (Xia et al., 2019). In the literature, methods have been developed to estimate the transition matrices under the so-called anchor-point assumption: it assumes the existence of anchor points, i.e., instances belonging to a specific class with probability one (Liu & Tao, 2016). The assumption is reasonable in certain applications (Liu & Tao, 2016; Patrini et al., 2017). However, the violation of the assumption in some cases could lead to a poorly learned transition matrix and a degenerated classifier (Xia et al., 2019). This motivates the development of algorithms without exploiting anchor points (Xia et al., 2019; Liu & Guo, 2020; Xu et al., 2019; Zhu et al., 2021b). However, the performance is not theoretically guaranteed in these works.

Motivation. In this work, our interest lies in designing a consistent algorithm without anchor points, subject to class-dependent label noise, i.e., 𝑻(𝒙)=𝑻\boldsymbol{T(\boldsymbol{x})}=\boldsymbol{T} for any 𝒙\boldsymbol{x} in the feature space. Our algorithm is based on a geometric property of the label corruption process. Given an instance 𝒙\boldsymbol{x}, the noisy class-posterior probability P(𝒀~|X=𝒙):=[P(Y~=1|X=𝒙),,P(Y~=C|X=𝒙)]P(\tilde{\boldsymbol{Y}}|X=\boldsymbol{x}):=[P(\tilde{Y}=1|X=\boldsymbol{x}),\ldots,P(\tilde{Y}=C|X=\boldsymbol{x})]^{\top} can be thought of as a point in the CC-dimensional space where CC is the number of classes. Since we have P(𝒀~|X=𝒙)=𝑻P(𝒀|X=𝒙)P(\tilde{\boldsymbol{Y}}|X=\boldsymbol{x})=\boldsymbol{T}P(\boldsymbol{Y}|X=\boldsymbol{x}) and i=1CP(Y=i|X=𝒙)=1\sum_{i=1}^{C}P(Y=i|X=\boldsymbol{x})=1, P(𝒀~|X=𝒙)P(\tilde{\boldsymbol{Y}}|X=\boldsymbol{x}) is then a convex combination of the columns of 𝑻\boldsymbol{T}. This means that the simplex Sim{𝑻}\mathrm{Sim}\{\boldsymbol{T}\} formed by the columns of 𝑻\boldsymbol{T} encloses P(𝒀~|X=𝒙)P(\tilde{\boldsymbol{Y}}|X=\boldsymbol{x}) for any 𝒙\boldsymbol{x} (Boyd et al., 2004). Thus, the problem of identifying the transition matrix can be treated as the problem of recovering Sim{𝑻}\mathrm{Sim}\{\boldsymbol{T}\}. However, when no assumption has been made, the problem is ill-defined as Sim{𝑻}\mathrm{Sim}\{\boldsymbol{T}\} is not identifiable, i.e., there exists an infinite number of simplexes enclosing P(𝒀~|X)P(\tilde{\boldsymbol{Y}}|X), and any of them can be regarded as the true simplex Sim{𝑻}\mathrm{Sim}\{\boldsymbol{T}\}. It is apparent that under the anchor-point assumption, Sim{𝑻}\mathrm{Sim}\{\boldsymbol{T}\} can be uniquely determined by exploiting anchor points whose noisy class-posterior probabilities are the vertices of Sim{𝑻}\mathrm{Sim}\{\boldsymbol{T}\}. The goal is thus to identify the points which have the largest noisy class-posterior probabilities for each class (Liu & Tao, 2016; Patrini et al., 2017). However, if there are no anchor points, the identified points would not be the vertices of Sim{𝑻}\mathrm{Sim}\{\boldsymbol{T}\}. In this case, existing methods cannot consistently estimate the transition matrices. To recover 𝑻\boldsymbol{T} without anchor points, a key observation is that, among all simplexes enclosing P(𝒀~|X)P(\tilde{\boldsymbol{Y}}|X), Sim{𝑻}\mathrm{Sim}\{\boldsymbol{T}\} is the one with minimum volume. See Figure 1 for a geometric illustration. This observation motivates the development of our method which incorporates the minimum volume constraint of Sim{𝑻}\mathrm{Sim}\{\boldsymbol{T}\} into label-noise learning.

To this end, we propose Volume Minimization Network (VolMinNet) to consistently estimate the transition matrix and build a statistically consistent classifier. Specifically, VolMinNet consists of a classification network 𝒉𝜽\boldsymbol{h}_{\boldsymbol{\theta}} and a trainable transition matrix 𝑻^\hat{\boldsymbol{T}}. We simultaneously optimize 𝑻^\hat{\boldsymbol{T}} and 𝒉𝜽\boldsymbol{h}_{\boldsymbol{\theta}} with two objectives: i) the discrepancy between 𝑻^𝒉𝜽(𝒙)\hat{\boldsymbol{T}}\boldsymbol{h}_{\boldsymbol{\theta}}(\boldsymbol{x}) and the noisy class-posterior distribution P(𝒀~|X=𝒙)P(\tilde{\boldsymbol{Y}}|X=\boldsymbol{x}), ii) The volume of the simplex formed by the columns of 𝑻^\hat{\boldsymbol{T}}. The proposed framework is end-to-end, and there is no need for identifying anchor points or pseudo anchor points (i.e., instances belonging to a specific class with probability close to one) (Xia et al., 2019). Since our proposed method does not rely on any specific data points, it yields better noise robustness compared with existing methods. With a so-called sufficiently scattered assumption where the clean class-posterior distribution is far from uniform, we theoretically prove that 𝑻^\hat{\boldsymbol{T}} will converge to the true transition matrix 𝑻\boldsymbol{T} while 𝒉𝜽(𝒙)\boldsymbol{h}_{\boldsymbol{\theta}}(\boldsymbol{x}) converges to the clean class-posterior P(𝒀|X=𝒙)P(\boldsymbol{Y}|X=\boldsymbol{x}). We also prove that the anchor-point assumption is a special case of the sufficiently scattered assumption.

The rest of this paper is organized as follows. In Section 2, we set up the notations and review the background of label-noise learning with anchor points. In Section 3, we introduce our proposed VolMinNet. In Section 4, we present the main theoretical results. In Section 5, we briefly introduce the related works in the literature. Experimental results on both synthetic and real-world datasets are provided in Section 6. Finally, we conclude the paper in Section 7.

Refer to caption
Figure 1: Geometric illustration of the problem of estimating the transition matrix without anchor points. The red triangle is the simplex of 𝑻\boldsymbol{T} with vertices denoted by 𝑻:;i\boldsymbol{T}_{:;i}. When there are no anchor points, the simplex found with existing methods (blue triangle) by using extreme-valued noisy class-posterior probabilities (see Eq. (4)) is not the true simplex (red triangle). It is obvious that among possible enclosing simplexes (black and red triangles), the true simplex Sim{𝑻}\mathrm{Sim}\{\boldsymbol{T}\} has the minimum volume.

2 Label-Noise Learning with Anchor Points

In this section, we review the background of label-noise learning. We follow common notational conventions in the literature of label-noise learning. 𝒗n\boldsymbol{v}\in\mathbb{R}^{n} and 𝑽n×m\boldsymbol{V}\in\mathbb{R}^{n\times m} denote a real-valued nn-dimensional vector and a real-valued n×mn\times m matrix, respectively. Elements of a vector are denoted by a subscript (e.g., 𝒗j\boldsymbol{v}_{j}), while rows and columns of a matrix are denoted by 𝑽i,:\boldsymbol{V}_{i,:} and 𝑽:,i\boldsymbol{V}_{:,i} respectively. The iith standard basis vector in C\mathbb{R}^{C} is denoted by 𝒆i\boldsymbol{e}_{i}. We denote the all-ones vector by 𝟏\boldsymbol{1}, and ΔC1[0,1]C\Delta^{C-1}\subset[0,1]^{C} is the CC-dimensional simplex. In this work, we also make extensive use of convex analysis. Let a set 𝒱={𝒗1,,𝒗m},\mathcal{V}=\{\boldsymbol{v}_{1},\ldots,\boldsymbol{v}_{m}\}, and the convex cone of 𝒱\mathcal{V} is denoted by cone(𝒱)={𝒗|𝒗=j=1m𝒗jαj,αj0,j}\mathrm{cone}(\mathcal{V})=\{\boldsymbol{v}|\boldsymbol{v}=\sum_{j=1}^{m}\boldsymbol{v}_{j}\alpha_{j},\;\alpha_{j}\geq 0,\;\forall j\}. Similarly, the convex hull of 𝒱\mathcal{V} is defined as conv(𝒱)={𝒗|𝒗=j=1m𝒗jαj,αj0,j=1mαj=1,j}\mathrm{conv}(\mathcal{V})=\{\boldsymbol{v}|\boldsymbol{v}=\sum_{j=1}^{m}\boldsymbol{v}_{j}\alpha_{j},\;\alpha_{j}\geq 0,\;\sum_{j=1}^{m}\alpha_{j}=1,\;\forall j\}. Specially, when {𝒗1,,𝒗m}\{\boldsymbol{v}_{1},\ldots,\boldsymbol{v}_{m}\} are affinely independent, conv(𝒱)\mathrm{conv}(\mathcal{V}) is also called a simplex which we denote it as Sim(𝒱)\mathrm{Sim}(\mathcal{V}).

Let 𝒟\mathcal{D} be the underlying distribution generating a pair of random variables (X,Y)𝒳×𝒴,(X,Y)\in\mathcal{X}\times\mathcal{Y}, where 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d} is the feature space, 𝒴={1,2,,C}\mathcal{Y}=\{1,2,\ldots,C\} is the label space and CC is the number of classes. In many real-world applications, samples drawn from 𝒟\mathcal{D} are unavailable. Before being observed, labels of these samples are contaminated with noise and we obtain a set of corrupted data {(𝒙i,y~i)}i=1n\{(\boldsymbol{x}_{i},\tilde{y}_{i})\}_{i=1}^{n} where y~\tilde{y} is the noisy label and we denote by 𝒟~\tilde{\mathcal{D}} the distribution of the noisy random pair (X,Y~)𝒳×𝒴(X,\tilde{Y})\in\mathcal{X}\times\mathcal{Y}.

Given an instance 𝒙\boldsymbol{x} sampled from XX, Y~\tilde{Y} is derived from the random variable YY through a noise transition matrix 𝑻(𝒙)[0,1]C×C\boldsymbol{T}(\boldsymbol{x})\in[0,1]^{C\times C}:

P(𝒀~|X=𝒙)=𝑻(x)P(𝒀|X=𝒙),P(\tilde{\boldsymbol{Y}}|X=\boldsymbol{x})=\boldsymbol{T}(x)P(\boldsymbol{Y}|X=\boldsymbol{x}), (1)

where P(𝒀|X=𝒙)=[P(Y=1|X=𝒙),,P(Y=C|X=𝒙)]P(\boldsymbol{Y}|X=\boldsymbol{x})=[P(Y=1|X=\boldsymbol{x}),\ldots,P(Y=C|X=\boldsymbol{x})]^{\top} and P(𝒀~|X=𝒙)=[P(Y~=1|X=𝒙),,P(Y~=C|X=𝒙)]P(\tilde{\boldsymbol{Y}}|X=\boldsymbol{x})=[P(\tilde{Y}=1|X=\boldsymbol{x}),\ldots,P(\tilde{Y}=C|X=\boldsymbol{x})]^{\top} are the clean class-posterior probability and the noisy class-posterior probability, respectively. The ijij-th entry of the transition matrix, i.e., 𝑻ij(𝒙)=P(Y~=i|Y=j,X=𝒙)\boldsymbol{T}_{ij}(\boldsymbol{x})=P(\tilde{Y}=i|Y=j,X=\boldsymbol{x}), represents the probability that the instance 𝒙\boldsymbol{x} with the clean label Y=jY=j will have a noisy label Y~=i\tilde{Y}=i. Generally, the transition matrix is non-identifiable without any additional assumption (Xia et al., 2019). For example, we can decompose the transition matrix with 𝑻(𝒙)=𝑻1(𝒙)𝑻2(𝒙)\boldsymbol{T}(\boldsymbol{x})=\boldsymbol{T}_{1}(\boldsymbol{x})\boldsymbol{T}_{2}(\boldsymbol{x}). If we define P(𝒀|X=𝒙)=𝑻2(𝒙)P(𝒀|X=𝒙)P^{\prime}(\boldsymbol{Y}|X=\boldsymbol{x})=\boldsymbol{T}_{2}(\boldsymbol{x})P(\boldsymbol{Y}|X=\boldsymbol{x}), then P(𝒀~|X=𝒙)=𝑻(x)P(𝒀|X=𝒙))=𝑻1(𝒙)P(𝒀|X=𝒙)P(\tilde{\boldsymbol{Y}}|X=\boldsymbol{x})=\boldsymbol{T}(x)P(\boldsymbol{Y}|X=\boldsymbol{x}))=\boldsymbol{T}_{1}(\boldsymbol{x})P^{\prime}(\boldsymbol{Y}|X=\boldsymbol{x}) are both valid. Therefore, in this paper, we study the class-dependent and instance-independent transition matrix on which the majority of existing methods focus (Han et al., 2018b, a; Patrini et al., 2017; Northcutt et al., 2017; Natarajan et al., 2013). Formally, we have:

P(𝒀~|X=𝒙)=𝑻P(𝒀|X=𝒙),P(\tilde{\boldsymbol{Y}}|X=\boldsymbol{x})=\boldsymbol{T}P(\boldsymbol{Y}|X=\boldsymbol{x}), (2)

where the transition matrix 𝑻\boldsymbol{T} is now independent of the instance 𝒙\boldsymbol{x}. In this work, we also assume that the true transition matrix 𝑻\boldsymbol{T} is diagonally dominant111The definition of being diagonally dominant is different from the one in matrix analysis, but it has been commonly used in label-noise learning (Xu et al., 2019). Specifically, the transition matrix 𝑻\boldsymbol{T} is diagonally dominant if for every column of 𝑻\boldsymbol{T}, the magnitude of the diagonal entry is larger than any non-diagonal entry, i.e., 𝑻ii>𝑻ji\boldsymbol{T}_{ii}>\boldsymbol{T}_{ji} for any jij\neq i. This assumption has been commonly used in the literature of label-noise learning (Patrini et al., 2017; Xia et al., 2019; Yao et al., 2020b).

As in Eq. (2), the clean class-posterior probability P(𝒀|X)P(\boldsymbol{Y}|X) can be inferred by using the noisy class-posterior probability P(𝒀~|X)P(\tilde{\boldsymbol{Y}}|X) and the transition matrix 𝑻\boldsymbol{T} as P(𝒀|X)=𝑻1P(𝒀~|X)P(\boldsymbol{Y}|X)=\boldsymbol{T}^{-1}P(\tilde{\boldsymbol{Y}}|X). For this reason, the transition matrix has been widely exploited to build statistically consistent classifiers, i.e., the learned classifier will converge to the optimal classifier defined with clean risk. Specifically, the transition matrix has been used to modify loss functions to build risk-consistent estimators (Goldberger & Ben-Reuven, 2017; Patrini et al., 2017; Yu et al., 2018; Xia et al., 2019), and has been used to correct hypotheses to build classifier-consistent algorithms (Natarajan et al., 2013; Scott, 2015; Patrini et al., 2017). Thus, the successes of these consistent algorithms rely on an accurately learned transition matrix.

In recent years, considerable efforts have been invested in designing algorithms for estimating the transition matrix. These algorithms rely on a so-called anchor-point assumption which requires that there exist anchor points for each class (Liu & Tao, 2016; Xia et al., 2019).

Definition 1 (anchor-point assumption).

For each class j{1,2,,C}j\in\{1,2,\ldots,C\}, there exists an instance 𝐱j𝒳\boldsymbol{x}^{j}\in\mathcal{X} such that P(Y=j|X=𝐱j)=1P(Y=j|X=\boldsymbol{x}^{j})=1.

Under the anchor-point assumption, the task of estimating the transition matrix boils down to finding anchor points for each class. For example, given anchor points 𝒙j\boldsymbol{x}^{j}, we have

P(Y~=iX=𝒙j)=k=1CTikP(Y=kX=𝒙j)=Tij.P(\tilde{Y}=i\mid X=\boldsymbol{x}^{j})=\sum_{k=1}^{C}T_{ik}P(Y=k\mid X=\boldsymbol{x}^{j})=T_{ij}. (3)

Namely, the transition matrix can be obtained with the noisy class-posterior probabilities of anchor points. Assuming that we can accurately model the noisy class-posterior P(𝒀~|X)P(\tilde{\boldsymbol{Y}}|X) given a sufficient number of noisy data, anchor points can be easily found as follows (Liu & Tao, 2016; Patrini et al., 2017):

𝒙j=argmax𝒙P(Y~=j|X=𝒙).\boldsymbol{x}^{j}=\operatorname*{arg\,max}_{\boldsymbol{x}}P(\tilde{Y}=j|X=\boldsymbol{x}). (4)

However, when the anchor-point assumption is not satisfied, points found with Eq. (4) are no longer anchor points. Hence, the above-mentioned method can not consistently estimate the transition matrix with Eq. (3), which will lead to a statistically inconsistent classifier. This motivates us to design a statistically classifier-consistent algorithm which can consistently estimate the transition matrix without anchor points.

3 Volume Minimization Network

Refer to caption
Figure 2: Overview of the proposed VolMinNet. The training in the proposed framework is carried out in an end-to-end manner with two objectives (red blocks) optimized simultaneously.

In this section, we propose a novel framework for label-noise learning which we call the Volume Minimization Network (VolMinNet). The proposed framework is end-to-end, and there is no need for identifying anchor points or a second stage for loss correction, resulting in better noise robustness than existing methods.

To learn the clean class-posterior P(Y|X)P(Y|X), we define a transformation h𝜽:𝒳ΔC1h_{\boldsymbol{\theta}}:\mathcal{X}\rightarrow\Delta^{C-1} where h𝜽h_{\boldsymbol{\theta}} is a differentiable function represented by a neural network with parameters 𝜽\boldsymbol{\theta}. To estimate the transition matrix, we construct a trainable diagonally dominant column stochastic matrix 𝑻^\hat{\boldsymbol{T}}, i.e., 𝑻^[0,1]C×C\hat{\boldsymbol{T}}\in[0,1]^{C\times C}, i=1C𝑻ij=1\sum_{i=1}^{C}\boldsymbol{T}_{ij}=1 and 𝑻ii>𝑻ji\boldsymbol{T}_{ii}>\boldsymbol{T}_{ji} for any iji\neq j. To learn the noisy class posterior distribution from the noisy data, with some abuse of notation, we define the composition of 𝑻^\hat{\boldsymbol{T}} and h𝜽h_{\boldsymbol{\theta}} as 𝑻^h𝜽:𝒳ΔC1\hat{\boldsymbol{T}}h_{\boldsymbol{\theta}}:\mathcal{X}\rightarrow\Delta^{C-1}.

Intuitively, as explained in Section 1, if 𝑻^h𝜽\hat{\boldsymbol{T}}h_{\boldsymbol{\theta}} models P(𝒀~|X)P(\tilde{\boldsymbol{Y}}|X) perfectly while the simplex of 𝑻^\hat{\boldsymbol{T}} has the minimum volume, 𝑻^\hat{\boldsymbol{T}} will converge to the true transition matrix 𝑻\boldsymbol{T} and h𝜽h_{\boldsymbol{\theta}} will converge to P(𝒀|X)P(\boldsymbol{Y}|X). This motivates us to propose the following criterion which corresponds to a constraint optimization problem:

min𝑻^𝕋\displaystyle\min_{\hat{\boldsymbol{T}}\in\mathbb{T}} vol(𝑻^)\displaystyle\;\;\mathrm{vol}(\hat{\boldsymbol{T}}) (5)
s.t. 𝑻^h𝜽=P(𝒀~|X),\displaystyle\hat{\boldsymbol{T}}h_{\boldsymbol{\theta}}=P(\tilde{\boldsymbol{Y}}|X),

where 𝕋={𝑻^[0,1]C×C|i=1C𝑻^ij=1,𝑻^ii>𝑻ji,ij}\mathbb{T}=\{\hat{\boldsymbol{T}}\in[0,1]^{C\times C}\;|\;\sum_{i=1}^{C}\hat{\boldsymbol{T}}_{ij}=1,\;\hat{\boldsymbol{T}}_{ii}>\boldsymbol{T}_{ji},\;\forall\;i\neq j\} is the set of diagonally dominant column stochastic matrices. vol(𝑻^)\mathrm{vol}(\hat{\boldsymbol{T}}) denotes a measure that is related or proportional to the volume of the simplex formed by the columns of 𝑻^\hat{\boldsymbol{T}}.

To solve criterion (5), we first note that the constraint 𝑻^h𝜽=P(𝒀~|X)\hat{\boldsymbol{T}}h_{\boldsymbol{\theta}}=P(\tilde{\boldsymbol{Y}}|X) can be solved with expected risk minimization (Patrini et al., 2017). The risk is defined as R~(h𝜽)=𝔼(𝒙,y~)D~[(𝑻^h𝜽(𝒙),y~)]\tilde{R}(h_{\boldsymbol{\theta}})=\mathbb{E}_{(\boldsymbol{x},\tilde{y})\sim\tilde{D}}[\ell(\hat{\boldsymbol{T}}h_{\boldsymbol{\theta}}(\boldsymbol{x}),\tilde{y})], where \ell is a loss function and we use the cross-entropy loss throughout this paper. We can then re-write criterion (5) as a Lagrangian under the KKT condition (Karush, 1939; Kuhn & Tucker, 2014) to obtain:

(θ,𝑻^):=vol(𝑻^)+β𝔼(𝒙,y~)D~[(𝑻^h𝜽(𝒙),y~)],\mathcal{L}(\theta,\hat{\boldsymbol{T}}):=\mathrm{vol}(\hat{\boldsymbol{T}})+\beta\cdot\mathbb{E}_{(\boldsymbol{x},\tilde{y})\sim\tilde{D}}[\ell(\hat{\boldsymbol{T}}h_{\boldsymbol{\theta}}(\boldsymbol{x}),\tilde{y})], (6)

where β>0\beta>0 is the KKT multiplier. In the literature, various functions for measuring the volume of the simplex have been investigated (Fu et al., 2015; Li & Bioucas-Dias, 2008; Miao & Qi, 2007). Given 𝑻^\hat{\boldsymbol{T}} is a square matrix, a common choice is vol(𝑻^)=det(𝑻^)\mathrm{vol}(\hat{\boldsymbol{T}})=\mathrm{det}(\hat{\boldsymbol{T}}), where det\mathrm{det} denotes the determinant. However, this function is numerically unstable for optimization and computationally hard to deal with. Hence, we adopt another popular alternative logdet(𝑻^)\log\mathrm{det}(\hat{\boldsymbol{T}}). This function has been widely used in low-rank matrix recovery and non-negative matrix decomposition (Fazel et al., 2003; Liu et al., 2012; Fu et al., 2016). Besides, since we only have access to a set of noisy training examples {(𝒙i,y~i)}i=1n\{(\boldsymbol{x}_{i},\tilde{y}_{i})\}_{i=1}^{n} instead of the distribution 𝒟~\tilde{\mathcal{D}}, we employ the empirical risk for training. Formally, we propose the following objective function:

(θ,𝑻^):=1ni=1n~(𝑻^h𝜽(𝒙i)),y~i)+λlogdet(𝑻^),\mathcal{L}(\theta,\hat{\boldsymbol{T}}):=\frac{1}{n}\sum_{i=1}^{n}\tilde{\ell}(\hat{\boldsymbol{T}}h_{\boldsymbol{\theta}}(\boldsymbol{x}_{i})),\tilde{y}_{i})+\lambda\cdot\log\mathrm{det}(\hat{\boldsymbol{T}}), (7)

where λ>0\lambda>0 is a regularization coefficient that balances distribution fidelity versus volume minimization.

The problem remains how to design 𝑻^\hat{\boldsymbol{T}} so that it is differentiable, diagonally dominant and column stochastic. Specifically, we first create a matrix 𝑨C×C\boldsymbol{A}\in\mathbb{R}^{C\times C} so that diagonal elements 𝑨ii=1\boldsymbol{A}_{ii}=1 for all i{1,2,,C}i\in\{1,2,\ldots,C\}, and all other elements 𝑨ij=σ(wij)\boldsymbol{A}_{ij}=\sigma(w_{ij}) for all iji\neq j where σ\sigma is the sigmoid function σ(w)=11+ew\sigma(w)=\frac{1}{1+e^{-w}} and each wijw_{ij}\in\mathbb{R} is a real-valued variable which will be updated throughout training. Then we do the normalization 𝑻^ij=𝑨ijk=1C𝑨kj\hat{\boldsymbol{T}}_{ij}=\frac{\boldsymbol{A}_{ij}}{\sum_{k=1}^{C}\boldsymbol{A}_{kj}} so that the sum of each column of 𝑻^\hat{\boldsymbol{T}} is equal to one. Since the sigmoid function returns a value in the range 0 to 1, we have 𝑻^ii>𝑻^ji\hat{\boldsymbol{T}}_{ii}>\hat{\boldsymbol{T}}_{ji} for all iji\neq j. With this specially designed 𝑻^\hat{\boldsymbol{T}}, we ensure that 𝑻^𝕋\hat{\boldsymbol{T}}\in\mathbb{T}, i.e., 𝑻^\hat{\boldsymbol{T}} is a diagonally dominant and column stochastic matrix. In addition, 𝑻^\hat{\boldsymbol{T}} is differentiable because the sigmoid function and the normalization operation are differentiable.

With both 𝑻^\hat{\boldsymbol{T}} and h𝜽h_{\boldsymbol{\theta}} being differentiable, the objective in Eq. (7) can be easily optimized with any standard gradient-based learning rule. This allows us to replace the two-stage loss correction procedure in existing works with an end-to-end learning framework. See Figure 2 for a less formal, more pedagogical explanation of our proposed learning framework.

4 Theoretical Results

In this section, we show that criterion (5) guarantees the consistency of the estimated transition matrix and the learned classifier under the sufficiently scattered assumption. We also show that the anchor-point assumption is a special case of the sufficiently assumption. To explain, we give a formal definition of the sufficiently scattered assumption:

Refer to caption
Figure 3: Illustration of the anchor-point assumption and the sufficiently scattered assumption in the case of C=3C=3 by assuming that the viewer are facing the hyperplane 𝟏𝒙=1\boldsymbol{1}^{\top}\boldsymbol{x}=1 from the positive orthant. The dots are class-posterior probabilities P(𝒀|X=𝒙)P(\boldsymbol{Y}|X=\boldsymbol{x}); the triangle is the non-negative orthant; the inner circle is \mathcal{R}; the region encompassed by red lines is cone{𝑯}\mathrm{cone}\{\boldsymbol{H}\}. Clearly, the anchor-point assumption (a) is a special case of the sufficiently scattered assumption (b).
Definition 2 (Sufficiently Scattered).

The clean class-posterior P(𝐘|X)P(\boldsymbol{Y}|X) is said to be sufficiently scattered if there exists a set ={x1,,xm}𝒳\mathcal{H}=\{x_{1},\ldots,x_{m}\}\subseteq\mathcal{X} such that the matrix 𝐇=[P(𝐘|X=x1),,P(𝐘|X=xm)]\boldsymbol{H}=[P(\boldsymbol{Y}|X=x_{1}),\ldots,P(\boldsymbol{Y}|X=x_{m})] satisfies (1) cone{𝐇},\mathcal{R}\subseteq\mathrm{cone}\{\boldsymbol{H}\},   where ={𝐯C𝐯𝟏C1𝐯2}\mathcal{R}=\{\boldsymbol{v}\in\mathbb{R}^{C}\mid\boldsymbol{v}^{\top}\boldsymbol{1}\geq\sqrt{C-1}\|\boldsymbol{v}\|_{2}\} and cone{𝐇}\mathrm{cone}\{\boldsymbol{H}\} denotes the convex cone formed by columns of 𝐇\boldsymbol{H}. (2) cone{𝐇}cone{𝐐}\mathrm{cone}\{\boldsymbol{H}\}\nsubseteq cone\{\boldsymbol{Q}\},  for any unitary matrix 𝐐C×C\boldsymbol{Q}\in\mathbb{R}^{C\times C} that is not a permutation matrix.

This assumption is evolved from previous works in non-negative matrix decomposition (Fu et al., 2015, 2018) with necessary modification. Intuitively, in the case of C=3C=3, \mathcal{R} corresponds to a “ball” tangential to the triangle formed by a permutation matrix, e.g., 𝑰=[𝒆1,𝒆2,𝒆3]\boldsymbol{I}=[\boldsymbol{e}_{1},\boldsymbol{e}_{2},\boldsymbol{e}_{3}]. cone{𝑯}\mathrm{cone}\{\boldsymbol{H}\} is the polytope inside this triangle. Columns of 𝑸\boldsymbol{Q} also form triangles which are rotated versions of the triangle defined by permutation matrices; facets of those triangles are also tangential to \mathcal{R}. Condition (1) of the sufficiently scattered assumption requires that \mathcal{R} is enclosed by cone{𝑯}\mathrm{cone}\{\boldsymbol{H}\}, i.e., \mathcal{R} is a subset of cone{𝑯}\mathrm{cone}\{\boldsymbol{H}\}. Condition (2) ensures that given condition (1), cone{𝑯}\mathrm{cone}\{\boldsymbol{H}\} is enclosed by the triangle formed by a permutation matrix and not any other unitary matrix.

To understand the sufficiently scattered assumption and its relationship with the anchor-point assumption, we provide several examples in Figure 3. In Figure 3.(a), we show a situation where both the anchor-point assumption and sufficiently scattered assumption are satisfied. Figure 3.(b) shows a situation where the sufficiently scattered assumption is satisfied while the anchor-point assumption is violated. In Figure 3.(c) and 3.(d), both assumptions are violated. However, in Figure 3.(d), only condition (2) of the sufficiently scattered assumption is violated while both conditions of the sufficiently scattered assumption are violated in 3.(c).

The first observation is that if the anchor-point assumption is satisfied, then the sufficiently scattered assumption must hold, but not vice versa. Intuitively, if the anchor-point assumption is satisfied, then there exists a matrix 𝑯=[P(𝒀|X=𝒙1),,P(𝒀|X=𝒙C)]=𝑰\boldsymbol{H}=[P(\boldsymbol{Y}|X=\boldsymbol{x}^{1}),\ldots,P(\boldsymbol{Y}|X=\boldsymbol{x}^{C})]=\boldsymbol{I} where 𝒙1,,𝒙C\boldsymbol{x}^{1},\ldots,\boldsymbol{x}^{C} are anchor points for different classes and 𝑰\boldsymbol{I} is the identity matrix. From Figure 3.(a) , it is clear that cone{𝑰}=cone{𝑯},\mathcal{R}\subseteq\mathrm{cone}\{\boldsymbol{I}\}=\mathrm{cone}\{\boldsymbol{H}\}, and cone{𝑯}\mathrm{cone}\{\boldsymbol{H}\} can only be enclosed by the convex cone of permutation matrices. This shows that the sufficiently scattered assumption is satisfied. However, from Figure 3.(b), it is clear that the sufficiently scattered assumption is satisfied but not the anchor-point assumption. Formally, we show that:

Proposition 1.

The anchor-point assumption is a sufficient but not necessary condition for the sufficiently scattered assumption when C>2C>2.

The proof of Proposition 1 is included in the supplementary material. Proposition 1 implies that the anchor-point assumption is a special case of the sufficiently scattered assumption. This means that the proposed framework can also deal with the case where the anchor-point assumption holds. Under the sufficiently scattered assumption, we get our main result:

Theorem 1.

Given sufficiently many noisy data, if P(𝐘|X)P(\boldsymbol{Y}|X) is sufficiently scattered, then 𝐓^=𝐓\hat{\boldsymbol{T}}_{\star}=\boldsymbol{T} and h𝛉(𝐱)=P(𝐘|X=𝐱)h_{\boldsymbol{\theta}_{\star}}(\boldsymbol{x})=P(\boldsymbol{Y}|X=\boldsymbol{x}) must hold, where (𝐓^,𝛉)(\hat{\boldsymbol{T}}_{\star},\boldsymbol{\theta}_{\star}) are optimal solutions of Eq. (5).

The proof of Theorem 1 can be found in the supplementary material. Intuitively, if P(𝒀|X)P(\boldsymbol{Y}|X) is sufficiently scattered, the noisy class-posterior P(𝒀~|X)P(\tilde{\boldsymbol{Y}}|X) will be sufficiently spread in the simplex formed by the columns of 𝑻\boldsymbol{T}. Then, finding the minimum-volume data-enclosing convex hull of P(𝒀~|X)P(\tilde{\boldsymbol{Y}}|X) recovers the ground-truth 𝑻\boldsymbol{T} and P(𝒀|X)P(\boldsymbol{Y}|X).

5 Related Works

In this section, we review existing methods in label-noise learning. Based on the statistical consistency of the learned classifier, we divided exsisting methods for label-noise learning into two categories: heuristic algorithms and statistically consistent algorithms.

Methods in the first category focus on employing heuristics to reduce the side-effect of noisy labels. For example, many methods use a specially designed strategy to select reliable samples (Yu et al., 2019; Han et al., 2018b; Malach & Shalev-Shwartz, 2017; Ren et al., 2018; Jiang et al., 2018; Yao et al., 2020a) or correct labels (Ma et al., 2018; Kremer et al., 2018; Tanaka et al., 2018; Reed et al., 2015). Although those methods empirically work well, there is not any theoretical guarantee on the consistency of the learned classifiers from all these methods.

Statistically consistent algorithms are primarily developed based on a loss correction procedure (Liu & Tao, 2016; Patrini et al., 2017; Zhang & Sabuncu, 2018). For these methods, the noise transition matrix plays a key role in building consistent classifiers. For example, Patrini et al.(2017) leveraged a two-stage training procedure of first estimating the noise transition matrix and then use it to modify the loss to ensure risk consistency. These works rely on anchor points or instances belonging to a specific class with probability one or approximately one. When there are no anchor points in datasets or data distributions, all the aforementioned methods cannot guarantee the statistical consistency. Another approach is to jointly learn the noise transition matrix and classifier. For instance, on top of the softmax layer of the classification network (Goldberger & Ben-Reuven, 2017), a constrained linear layer or a nonlinear softmax layer is added to model the noise transition matrix (Sukhbaatar et al., 2015). Zhang et al. (2021) concurrently propose a one-step method for the label-noise learning problem and a derivative-free method for estimating the transition matrix. Specifically, their method uses a total variation regularization term to prevent the overconfidence problem of the neural network, which leads to a more accurate noisy class-posterior. However, the anchor-point assumption is still needed for their method. Based on different motivations, assumptions and learning objectives, their method achieves different theoretical results compared with our proposed method. Learning with noisy labels are also closely related to learning with complementary labels where instead of noisy labels, only compelementary labels are given for training (Yu et al., 2018; Chou et al., 2020; Feng et al., 2020).

Recently, some methods exploiting semi-supervised learning techniques have been proposed to solve the label-noise learning problem like SELF (Nguyen et al., 2019) and DivideMix (Li et al., 2019). These methods are aggregations of multiple techniques such as augmentations and multiple networks. Noise robustness is significantly improved with these methods. However, these methods are sensitive to the choice of hyperparameters and changes in data and noise types would generate degenerated classifiers. In addition, the computational cost of these methods increases significantly compared with previous methods.

6 Experiments

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Transition matrix estimation error on MNIST, CIFAR-10 and CIFAR-100. The error bar for the standard deviation in each figure has been shaded. The lower the better.

In this section, we verify the robustness of the proposed volume minimization network (VolMinNet) from two folds: the estimation error of the transition matrix and the classification accuracy.

Datasets We evaluate the proposed method on three synthetic noisy datasets, i.e., MNIST, CIFAR-10 and CIFAR-100 and one real-world noisy dataset, i.e., clothing1M. We leave out 10% of the training examples as the validation set. The three synthetic datasets contain clean data. We corrupted the training and validation sets manually according to transition matrices. Specifically, we conduct experiments with two commonly used types of noise: (1) Symmetry flipping (Patrini et al., 2017); (2) Pair flipping (Han et al., 2018b). We report both the classification accuracy on the test set and the estimation error between the estimated transition matrix 𝑻^\hat{\boldsymbol{T}} and the true transition matrix 𝑻\boldsymbol{T}. All experiments are repeated five times on all datasets. Following T-Revision (Xia et al., 2019), we also conducted experiments on datasets where possible anchor points are removed from the datasets. The details and more experimental results can be found in the Supplementary Material.

Clothing1M is a real-world noisy dataset which consists of 1M1\mathrm{M} images with real-world noisy labels. Existing methods like Forward (Patrini et al., 2017) and T-revision (Xia et al., 2019) use the additional 50k clean training data to help initialize the transition matrix and validate on 14k clean validation data. Here we use another setting which is also commonly used in the literature (Xia et al., 2020). We only exploit the 1M1\mathrm{M} data for both transition matrix estimation and classification training. Specifically, we leave out 10%10\% of the noisy training examples as a noisy validation set for model selection. We think this setting is more natural considering that it does not require any clean data. All results of baseline methods are quoted from PTD (Xia et al., 2020) as we have the same setting.

Network structure and optimization For a fair comparison, we implement all methods with default parameters by PyTorch on Tesla V100-SXM2. For MNIST, we use a LeNet-5 network. SGD is used to train the classification network h𝜽h_{\boldsymbol{\theta}} with batch size 128128, momentum 0.90.9, weight decay 10310^{−3} and a learning rate 10210^{-2}. Adam with default parameters is used to train the transition matrix 𝑻^\hat{\boldsymbol{T}}. The algorithm is run for 6060 epoch. For CIFAR10, we use a ResNet-18 network. SGD is used to train both the classification network h𝜽h_{\boldsymbol{\theta}} and the transition matrix 𝑻^\hat{\boldsymbol{T}} with batch size 128128, momentum 0.90.9, weight decay 10310^{−3} and an initial learning rate 10210^{-2}. The algorithm is run for 150150 epoch and the learning rate is divided by 1010 after the 3030th and 6060th epoch. For CIFAR100, we use a ResNet-32 network. SGD is used to train the classification network h𝜽h_{\boldsymbol{\theta}} with batch size 128128, momentum 0.90.9, weight decay 10310^{−3} and an initial learning rate 10210^{-2}. Adam with default parameters is used to train the transition matrix 𝑻^\hat{\boldsymbol{T}}. The algorithm is run for 150150 epoch and the learning rate is divided by 1010 after the 3030th and 6060th epoch. For CIFAR-10 and CIFAR-100, we perform data augmentation by horizontal random flips and 32×3232\times 32 random crops after padding 4 pixels on each side. For clothing1M, we use a ResNet-50 pre-trained on ImageNet. We only use the 1M noisy data to train and validate the network. For the optimization, SGD is used train both the classification network h𝜽h_{\boldsymbol{\theta}} and the transition matrix 𝑻^\hat{\boldsymbol{T}} with momentum 0.9, weight decay 10310^{−3}, batch size 32, and run with learning rates 2×1032\times 10^{−3} and 2×1052\times 10^{−5} for 5 epochs each. For each epoch, we ensure the noisy labels for each class are balanced with undersampling. Throughout all experiments, we fixed λ=104\lambda=10^{-4} and the trainable weights ww of 𝑻^\hat{\boldsymbol{T}} are initialized with ln1C2ln\frac{1}{C-2} (roughly -2 for MNIST and CIFAR10, -4.5 for CIFAR100 and -2.5 for clothing1M).

MNIST CIFAR-10 CIFAR-100
Sym-20% Sym-50% Sym-20% Sym-50% Sym-20% Sym-50%
Decoupling 97.04±0.0697.04\pm 0.06 94.58±0.0894.58\pm 0.08 77.32±0.3577.32\pm 0.35 54.07±0.4654.07\pm 0.46 41.92±0.4941.92\pm 0.49 22.63±0.4422.63\pm 0.44
MentorNet 97.21±0.0697.21\pm 0.06 95.56±0.1595.56\pm 0.15 81.35±0.2381.35\pm 0.23 73.47±0.1573.47\pm 0.15 42.88±0.4142.88\pm 0.41 32.66±0.4032.66\pm 0.40
Co-teaching 97.07±0.1097.07\pm 0.10 95.20±0.2395.20\pm 0.23 82.27±0.0782.27\pm 0.07 75.55±0.0775.55\pm 0.07 48.48±0.6648.48\pm 0.66 36.77±0.5236.77\pm 0.52
Forward 98.60±0.1998.60\pm 0.19 97.77±0.1697.77\pm 0.16 85.20±0.8085.20\pm 0.80 74.82±0.7874.82\pm 0.78 54.90±0.7454.90\pm 0.74 41.85±0.7141.85\pm 0.71
T-Revision 98.72±0.1098.72\pm 0.10 98.23±0.10\mathbf{98.23\pm 0.10} 87.95±0.3687.95\pm 0.36 80.01±0.6280.01\pm 0.62 62.72±0.6962.72\pm 0.69 49.12±0.2249.12\pm 0.22
DMI 98.70±0.0298.70\pm 0.02 98.12±0.2198.12\pm 0.21 87.54±0.2087.54\pm 0.20 82.68±0.2182.68\pm 0.21 62.65±0.3962.65\pm 0.39 52.42±0.6452.42\pm 0.64
Dual T 98.43±0.0598.43\pm 0.05 98.15±0.1298.15\pm 0.12 88.35±0.3388.35\pm 0.33 82.54±0.1982.54\pm 0.19 62.16±0.5862.16\pm 0.58 52.49±0.3752.49\pm 0.37
VolMinNet 98.74±0.08\mathbf{98.74}\pm\mathbf{0.08} 98.23±0.16\mathbf{98.23}\pm\mathbf{0.16} 89.58±0.26\mathbf{89.58}\pm\mathbf{0.26} 83.37±0.25\mathbf{83.37}\pm\mathbf{0.25} 64.94±0.40\mathbf{64.94}\pm\mathbf{0.40} 53.89±1.26\mathbf{53.89}\pm\mathbf{1.26}
MNIST CIFAR-10 CIFAR-100
Pair-20% Pair-45% Pair-20% Pair-45% Pair-20% Pair-45%
Decoupling 96.93±0.0796.93\pm 0.07 94.34±0.5494.34\pm 0.54 77.12±0.3077.12\pm 0.30 53.71±0.9953.71\pm 0.99 40.12±0.2640.12\pm 0.26 27.97±0.1227.97\pm 0.12
MentorNet 96.89±0.0496.89\pm 0.04 91.98±0.4691.98\pm 0.46 77.42±0.0077.42\pm 0.00 61.03±0.2061.03\pm 0.20 39.22±0.4739.22\pm 0.47 26.48±0.3726.48\pm 0.37
Co-teaching 97.00±0.0697.00\pm 0.06 96.25±0.0196.25\pm 0.01 80.65±0.2080.65\pm 0.20 73.02±0.2373.02\pm 0.23 42.79±0.7942.79\pm 0.79 27.97±0.2027.97\pm 0.20
Forward 98.84±0.1098.84\pm 0.10 75.06±12.6175.06\pm 12.61 88.21±0.4888.21\pm 0.48 77.44±6.8977.44\pm 6.89 56.12±0.5456.12\pm 0.54 36.88±2.3236.88\pm 2.32
T-Revision 98.89±0.0898.89\pm 0.08 84.56±8.1884.56\pm 8.18 90.33±0.5290.33\pm 0.52 78.94±2.5878.94\pm 2.58 64.33±0.4964.33\pm 0.49 41.55±0.9541.55\pm 0.95
DMI 98.84±0.0998.84\pm 0.09 97.92±0.7697.92\pm 0.76 89.89±0.4589.89\pm 0.45 73.15±7.3173.15\pm 7.31 59.56±0.7359.56\pm 0.73 38.17±2.0238.17\pm 2.02
Dual T 98.86±0.0498.86\pm 0.04 96.71±0.1296.71\pm 0.12 89.77±0.2589.77\pm 0.25 76.53±2.5176.53\pm 2.51 67.21±0.4367.21\pm 0.43 47.60±0.4347.60\pm 0.43
VolMinNet 99.01±0.07\mathbf{99.01}\pm\mathbf{0.07} 99.00±0.07\mathbf{99.00}\pm\mathbf{0.07} 90.37±0.30\mathbf{90.37}\pm\mathbf{0.30} 88.54±0.21\mathbf{88.54}\pm\mathbf{0.21} 68.45±0.69\mathbf{68.45}\pm\mathbf{0.69} 58.90±0.89\mathbf{58.90}\pm\mathbf{0.89}
Table 1: Classification accuracy (percentage) on MNIST, CIFAR-10 and CIFAR-100.

6.1 Transition Matrix Estimation

For evaluating the effectiveness of estimating the transition matrix, we compare the proposed method with the following methods: (1) T-estimator max (Patrini et al., 2017), which identify the extreme-valued noisy class-posterior probabilities from given samples to estimate the transition matrix. (2) T-estimator 3% (Patrini et al., 2017), which takes a α\alpha-percentile in place of the argmax of Equation 4. (3) T-Revision (Xia et al., 2019), which introduces a slack variable to revise the noise transition matrix after initializing the transition matrix with T-estimator. (4) Dual T-estimator (Yao et al., 2020b), which introduces an intermediate class to avoid directly estimating the noisy class-posterior and factorizes the transition matrix into the product of two easy-to-estimate transition matrices.

To show that the proposed method is more robust in estimating the transition matrix, we plot the estimation error for the transition matrix, i.e., TT^ΔT^1/T1\|T-\hat{T}-\Delta\hat{T}\|_{1}/\|T\|_{1}. Figure 4 depicts estimation errors of transition matrices estimated by the proposed VolMinNet and other baseline methods. For all different settings of noise on three different datasets (original intact datasets), VolMinNet consistently gives better results compared to the baselines, which shows its superior robustness against label noise. For example, on CIFAR100 (Flip-0.45), our method achieves estimation error around 0.25, while baseline methods can only reach at around 0.75. These results show that our method establishes the new state of the art in estimating transition matrices.

6.2 Classification accuracy Evaluation

We compare the classification accuracy of the proposed method with the following methods: (1) Decoupling (Malach & Shalev-Shwartz, 2017). (2) MentorNet (Jiang et al., 2018). (3) Co-teaching (Han et al., 2018b). (4) Forward (Patrini et al., 2017). (5) T-Revision (Xia et al., 2019). (7) DMI (Xu et al., 2019). (8) Dual T (Yao et al., 2020b). Note that we did not compare the proposed method with some methods like SELF (Nguyen et al., 2019) and DivideMix (Li et al., 2019). This is because these methods are aggregations of semi-supervised learning techniques which have high computational complexity and are sensitive to the choice of hyperparameters. In this work, we are more focusing on solving the label noise learning without anchor points theoretically.

In Table 1, we present the classification accuracy by the proposed VolMinNet and baseline methods on synthetic noisy datasets. VolMinNet outperforms baseline methods on almost all settings of noise. This result is natural after we have shown that VolMinNet leads to smaller estimation error of the transition matrix compared with baseline methods. While the differences of accuracy among different methods are marginal for symmetric noise, VolMinNet outperforms baselines by over 10%10\% with Pair-45%45\% noise and has much smaller standard deviations. These results show the clear advantage of the proposed VolMinNet. It has better robustness to different settings of noise and datasets compared to baseline methods.

 Decoupling  MentorNet  Co-teaching  Forward 54.5356.7960.1569.91 T-Revision  DMI  PTD  VolMinNet 70.9770.1271.6772.42\begin{array}[]{cccc}\hline\cr\text{ Decoupling }&\text{ MentorNet }&\text{ Co-teaching }&\text{ Forward }\\ \hline\cr 54.53&56.79&60.15&69.91\\ \hline\cr\text{ T-Revision }&\text{ DMI }&\text{ PTD }&\text{ VolMinNet }\\ \hline\cr 70.97&70.12&71.67&\mathbf{72.42}\\ \hline\cr\end{array}

Table 2: Classification accuracy (percentage) on Clothing1M. Only noisy data are exploited for training and validation.

Finally, we show the results on Clothing1M in Table 2. As explained in the previous section, Forward and T-Revision exploited the 50k clean data and their noisy versions in 1M noisy data to help initialize the noise transition matrix, which is not practical in real-world settings. For a fair comparison, we report results by only using the 1M1M noisy data to train and validate the network. As shown in Table 2, our method outperforms previous transition matrix based methods and heuristic methods on the Clothing1M dataset. In addition, the performance on the Clothing1M dataset shows that the proposed method has certain robustness against instance-dependent noise as well.

7 Discussion and Conclusion

In this paper, we considered the problem of label-noise learning without anchor points. We relax the anchor-point assumption with our proposed VolMinNet. The consistency of the estimated transition matrix and the learned classifier are theoretically proved under the sufficiently scattered assumption. Experimental results have demonstrated the robustness of the proposed VolMinNet. Future work should focus on improving the estimation of the noisy class posterior which we believe is the bottleneck of our method.

Acknowledgements

XL was supported by an Australian Government RTP Scholarship. TL was supported by Australian Research Council Project DE-190101473. BH was supported by the RGC Early Career Scheme No. 22200720, NSFC Young Scientists Fund No. 62006202 and HKBU CSD Departmental Incentive Grant. GN and MS were supported by JST AIP Acceleration Research Grant Number JPMJCR20U3, Japan. MS was also supported by Institute for AI and Beyond, UTokyo. Authors also thank for the help from Dr. Alan Blair, Kevin Lam, Yivan Zhang and members of the Trustworthy Machine Learning Lab at the University of Sydney.

References

  • Arpit et al. (2017) Arpit, D., Jastrzębski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville, A., Bengio, Y., et al. A closer look at memorization in deep networks. In ICML, pp.  233–242, 2017.
  • Berthon et al. (2021) Berthon, A., Han, B., Niu, G., Liu, T., and Sugiyama, M. Confidence scores make instance-dependent label-noise learning possible. In ICML, 2021.
  • Boyd et al. (2004) Boyd, S., Boyd, S. P., and Vandenberghe, L. Convex optimization. Cambridge university press, 2004.
  • Chou et al. (2020) Chou, Y.-T., Niu, G., Lin, H.-T., and Sugiyama, M. Unbiased risk estimators can mislead: A case study of learning with complementary labels. In ICML, pp.  1929–1938. PMLR, 2020.
  • Daniely & Granot (2019) Daniely, A. and Granot, E. Generalization bounds for neural networks via approximate description length. In NeurIPS, pp.  13008–13016, 2019.
  • Fazel et al. (2003) Fazel, M., Hindi, H., and Boyd, S. P. Log-det heuristic for matrix rank minimization with applications to hankel and euclidean distance matrices. In ACC, pp.  2156–2162, 2003.
  • Feng et al. (2020) Feng, L., Kaneko, T., Han, B., Niu, G., An, B., and Sugiyama, M. Learning with multiple complementary labels. In ICML, pp.  3072–3081. PMLR, 2020.
  • Fu et al. (2015) Fu, X., Ma, W.-K., Huang, K., and Sidiropoulos, N. D. Blind separation of quasi-stationary sources: Exploiting convex geometry in covariance domain. IEEE Transactions on Signal Processing, 63(9):2306–2320, 2015.
  • Fu et al. (2016) Fu, X., Huang, K., Yang, B., Ma, W.-K., and Sidiropoulos, N. D. Robust volume minimization-based matrix factorization for remote sensing and document clustering. IEEE Transactions on Signal Processing, 64(23):6254–6268, 2016.
  • Fu et al. (2018) Fu, X., Huang, K., and Sidiropoulos, N. D. On identifiability of nonnegative matrix factorization. IEEE Signal Processing Letters, 25(3):328–332, 2018.
  • Goldberger & Ben-Reuven (2017) Goldberger, J. and Ben-Reuven, E. Training deep neural-networks using a noise adaptation layer. In ICLR, 2017.
  • Han et al. (2018a) Han, B., Yao, J., Niu, G., Zhou, M., Tsang, I., Zhang, Y., and Sugiyama, M. Masking: A new perspective of noisy supervision. In NeurIPS, pp.  5836–5846, 2018a.
  • Han et al. (2018b) Han, B., Yao, Q., Yu, X., Niu, G., Xu, M., Hu, W., Tsang, I., and Sugiyama, M. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS, pp.  8527–8537, 2018b.
  • Han et al. (2020a) Han, B., Niu, G., Yu, X., Yao, Q., Xu, M., Tsang, I., and Sugiyama, M. Sigua: Forgetting may make learning with noisy labels more robust. In ICML, pp.  4006–4016. PMLR, 2020a.
  • Han et al. (2020b) Han, B., Yao, Q., Liu, T., Niu, G., Tsang, I. W., Kwok, J. T., and Sugiyama, M. A survey of label-noise representation learning: Past, present and future, 2020b.
  • Jiang et al. (2018) Jiang, L., Zhou, Z., Leung, T., Li, L.-J., and Fei-Fei, L. MentorNet: Learning data-driven curriculum for very deep neural networks on corrupted labels. In ICML, pp.  2309–2318, 2018.
  • Karush (1939) Karush, W. Minima of functions of several variables with inequalities as side constraints. M. Sc. Dissertation. Dept. of Mathematics, Univ. of Chicago, 1939.
  • Kremer et al. (2018) Kremer, J., Sha, F., and Igel, C. Robust active label correction. In AISTATS, pp.  308–316, 2018.
  • Kuhn & Tucker (2014) Kuhn, H. W. and Tucker, A. W. Nonlinear programming. In Traces and emergence of nonlinear programming, pp. 247–258. Springer, 2014.
  • Li & Bioucas-Dias (2008) Li, J. and Bioucas-Dias, J. M. Minimum volume simplex analysis: A fast algorithm to unmix hyperspectral data. In IGARSS, pp.  III–250, 2008.
  • Li et al. (2019) Li, J., Socher, R., and Hoi, S. C. Dividemix: Learning with noisy labels as semi-supervised learning. In ICLR, 2019.
  • Li et al. (2017) Li, W., Wang, L., Li, W., Agustsson, E., and Van Gool, L. Webvision database: Visual learning and understanding from web data. arXiv preprint arXiv:1708.02862, 2017.
  • Liu et al. (2012) Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., and Ma, Y. Robust recovery of subspace structures by low-rank representation. IEEE transactions on pattern analysis and machine intelligence, 35(1):171–184, 2012.
  • Liu & Tao (2016) Liu, T. and Tao, D. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2016.
  • Liu & Guo (2020) Liu, Y. and Guo, H. Peer loss functions: Learning from noisy labels without knowing noise rates. In ICML, pp.  6226–6236. PMLR, 2020.
  • Ma et al. (2018) Ma, X., Wang, Y., Houle, M. E., Zhou, S., Erfani, S. M., Xia, S.-T., Wijewickrema, S., and Bailey, J. Dimensionality-driven learning with noisy labels. In ICML, pp.  3361–3370, 2018.
  • Malach & Shalev-Shwartz (2017) Malach, E. and Shalev-Shwartz, S. Decoupling" when to update" from" how to update". In NeurIPS, pp.  960–970, 2017.
  • Miao & Qi (2007) Miao, L. and Qi, H. Endmember extraction from highly mixed data using minimum volume constrained nonnegative matrix factorization. IEEE Transactions on Geoscience and Remote Sensing, 45(3):765–777, 2007.
  • Natarajan et al. (2013) Natarajan, N., Dhillon, I. S., Ravikumar, P. K., and Tewari, A. Learning with noisy labels. In NeurIPS, pp.  1196–1204, 2013.
  • Nguyen et al. (2019) Nguyen, D. T., Mummadi, C. K., Ngo, T. P. N., Nguyen, T. H. P., Beggel, L., and Brox, T. Self: Learning to filter noisy labels with self-ensembling. In ICLR, 2019.
  • Northcutt et al. (2017) Northcutt, C. G., Wu, T., and Chuang, I. L. Learning with confident examples: Rank pruning for robust classification with noisy labels. In UAI, 2017.
  • Patrini et al. (2017) Patrini, G., Rozza, A., Krishna Menon, A., Nock, R., and Qu, L. Making deep neural networks robust to label noise: A loss correction approach. In CVPR, pp.  1944–1952, 2017.
  • Reed et al. (2015) Reed, S. E., Lee, H., Anguelov, D., Szegedy, C., Erhan, D., and Rabinovich, A. Training deep neural networks on noisy labels with bootstrapping. In ICLR, Workshop Track Proceedings, 2015.
  • Ren et al. (2018) Ren, M., Zeng, W., Yang, B., and Urtasun, R. Learning to reweight examples for robust deep learning. In ICML, pp.  4331–4340, 2018.
  • Scott (2015) Scott, C. A rate of convergence for mixture proportion estimation, with application to learning from noisy labels. In AISTATS, pp.  838–846, 2015.
  • Sukhbaatar et al. (2015) Sukhbaatar, S., Estrach, J. B., Paluri, M., Bourdev, L., and Fergus, R. Training convolutional networks with noisy labels. In ICLR, 2015.
  • Tanaka et al. (2018) Tanaka, D., Ikami, D., Yamasaki, T., and Aizawa, K. Joint optimization framework for learning with noisy labels. In CVPR, pp.  5552–5560, 2018.
  • Thekumparampil et al. (2018) Thekumparampil, K. K., Khetan, A., Lin, Z., and Oh, S. Robustness of conditional gans to noisy labels. In NeurIPS, pp.  10271–10282, 2018.
  • Wu et al. (2021) Wu, S., Xia, X., Liu, T., Han, B., Gong, M., Wang, N., Liu, H., and Niu, G. Class2simi: A new perspective on learning with label noise. In ICML. PMLR, 2021.
  • Xia et al. (2019) Xia, X., Liu, T., Wang, N., Han, B., Gong, C., Niu, G., and Sugiyama, M. Are anchor points really indispensable in label-noise learning? In NeurIPS, pp.  6835–6846, 2019.
  • Xia et al. (2020) Xia, X., Liu, T., Han, B., Wang, N., Gong, M., Liu, H., Niu, G., Tao, D., and Sugiyama, M. Part-dependent label noise: Towards instance-dependent label noise. NerurIPS, 2020.
  • Xia et al. (2021) Xia, X., Liu, T., Han, B., Gong, C., Wang, N., Ge, Z., and Chang, Y. Robust early-learning: Hindering the memorization of noisy labels. In ICLR, 2021.
  • Xiao et al. (2015) Xiao, T., Xia, T., Yang, Y., Huang, C., and Wang, X. Learning from massive noisy labeled data for image classification. In CVPR, pp.  2691–2699, 2015.
  • Xu et al. (2019) Xu, Y., Cao, P., Kong, Y., and Wang, Y. L_dmi: A novel information-theoretic loss function for training deep nets robust to label noise. In NeurIPS, pp.  6222–6233, 2019.
  • Yao et al. (2020a) Yao, Q., Yang, H., Han, B., Niu, G., and Kwok, J. T.-Y. Searching to exploit memorization effect in learning with noisy labels. In ICML, pp.  10789–10798. PMLR, 2020a.
  • Yao et al. (2020b) Yao, Y., Liu, T., Han, B., Gong, M., Deng, J., Niu, G., and Sugiyama, M. Dual t: Reducing estimation error for transition matrix in label-noise learning. In NeurIPS, 2020b.
  • Yu et al. (2018) Yu, X., Liu, T., Gong, M., and Tao, D. Learning with biased complementary labels. In ECCV, pp.  68–83, 2018.
  • Yu et al. (2019) Yu, X., Han, B., Yao, J., Niu, G., Tsang, I., and Sugiyama, M. How does disagreement help generalization against label corruption? In ICML, pp.  7164–7173, 2019.
  • Yu et al. (2020) Yu, X., Liu, T., Gong, M., Zhang, K., Batmanghelich, K., and Tao, D. Label-noise robust domain adaptation. In ICML, pp.  10913–10924. PMLR, 2020.
  • Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
  • Zhang et al. (2021) Zhang, Y., Niu, G., and Sugiyama, M. Learning noise transition matrix from only noisy labels via total variation regularization. In ICML, 2021.
  • Zhang & Sabuncu (2018) Zhang, Z. and Sabuncu, M. Generalized cross entropy loss for training deep neural networks with noisy labels. In NeurIPS, pp.  8778–8788, 2018.
  • Zhu et al. (2021a) Zhu, Z., Liu, T., and Liu, Y. A second-order approach to learning with instance-dependent label noise. In CVPR, 2021a.
  • Zhu et al. (2021b) Zhu, Z., Song, Y., and Liu, Y. Clusterability as an alternative to anchor points when learning with noisy labels. In ICML, 2021b.

Appendix A Appendix

A.1 Proof of Proposition 1

To prove proposition 1, we first show that the anchor-point assumption is a sufficient condition for the sufficiently scattered assumption. In other words, we need to show that if the anchor-point assumption is satisfied, then two conditions of the sufficiently scattered assumption must hold.

We start with condition (2) of the sufficiently scattered assumption. We need to show that if the anchor-point assumption is hold, then there exists a set ={x1,,xm}𝒳\mathcal{H}=\{x_{1},\ldots,x_{m}\}\subseteq\mathcal{X} such that the matrix 𝑯=[P(Y|X=x1),,P(Y|X=xm)]\boldsymbol{H}=[P(Y|X=x_{1}),\ldots,P(Y|X=x_{m})] satisfies that cone{𝑯}cone{𝑸}\mathrm{cone}\{\boldsymbol{H}\}\nsubseteq\mathrm{cone}\{\boldsymbol{Q}\},  for any unitary matrix 𝑸C×C\boldsymbol{Q}\in\mathbb{R}^{C\times C} that is not a permutation matrix.

Since the anchor-point assumption is satisfied, then there exists a matrix 𝑯=[P(Y|X=𝒙1),,P(Y|X=𝒙C)]\boldsymbol{H}=[P(Y|X=\boldsymbol{x}^{1}),...,P(Y|X=\boldsymbol{x}^{C})] where 𝒙1,,𝒙C\boldsymbol{x}^{1},...,\boldsymbol{x}^{C} are anchor points for each class. From the definition of anchor points, we have P(Y|X=𝒙i)=𝒆iP(Y|X=\boldsymbol{x}^{i})=\boldsymbol{e}_{i}. This implies that

𝑯=[P(Y|X=𝒙1),,P(Y|X=𝒙C)]=𝑰,\boldsymbol{H}=[P(Y|X=\boldsymbol{x}^{1}),...,P(Y|X=\boldsymbol{x}^{C})]=\boldsymbol{I}, (8)

where 𝑰\boldsymbol{I} is the identity matrix. By the definition of the identity matrix 𝑰\boldsymbol{I}, it is clear that cone{𝑯}=cone{𝑰}cone{𝑸}\mathrm{cone}\{\boldsymbol{H}\}=\mathrm{cone}\{\boldsymbol{I}\}\nsubseteq\mathrm{cone}\{\boldsymbol{Q}\},  for any unitary matrix 𝑸C×C\boldsymbol{Q}\in\mathbb{R}^{C\times C} that is not a permutation matrix. This shows that condition (2) of the sufficiently scattered assumption is satisfied if the anchor-point assumption is hold.

Next, we show that condition (1) will also be satisfied, i.e., the convex cone cone{𝑯},\mathcal{R}\subseteq\mathrm{cone}\{\boldsymbol{H}\}, where ={𝒗C|𝒗𝟏C1𝒗2}\mathcal{R}=\{\boldsymbol{v}\in\mathbb{R}^{C}|\boldsymbol{v}^{\top}\boldsymbol{1}\geq\sqrt{C-1}\|\boldsymbol{v}\|_{2}\}. By Eq. (8), condition (1) of Theorem 1 is equivalent to

cone{𝑰}={𝒖|𝒖=j=1C𝒆jαj,αj0,j}.\mathcal{R}\subseteq\mathrm{cone}\{\boldsymbol{I}\}=\{\boldsymbol{u}|\boldsymbol{u}=\sum_{j=1}^{C}\boldsymbol{e}_{j}\alpha_{j},\;\alpha_{j}\geq 0,\;\forall j\}. (9)

This means that all elements in \mathcal{R} must be in the non-negative orthant of C\mathbb{R}^{C}, i.e., for all 𝒗\boldsymbol{v}\in\mathcal{R}, 𝒗i0\boldsymbol{v}_{i}\geq 0 for all i{1,,C}i\in\{1,\ldots,C\}. Consider 𝒗\boldsymbol{v}\in\mathcal{R} and let 𝒗^\hat{\boldsymbol{v}} be the normalized vector of 𝒗\boldsymbol{v}, by definition of \mathcal{R} we have the following chain:

𝒗𝟏\displaystyle\boldsymbol{v}^{\top}\boldsymbol{1} C1𝒗2,\displaystyle\geq\sqrt{C-1}\|\boldsymbol{v}\|_{2}, (10a)
𝒗𝒗𝟏=𝒗^𝟏\displaystyle\frac{\boldsymbol{v}^{\top}}{\|\boldsymbol{v}\|}\boldsymbol{1}=\hat{\boldsymbol{v}}^{\top}\boldsymbol{1} C1,\displaystyle\geq\sqrt{C-1}, (10b)
i{1,2,,C}𝒗^i\displaystyle\sum_{i\in\{1,2,\ldots,C\}}\hat{\boldsymbol{v}}_{i} C1.\displaystyle\geq\sqrt{C-1}. (10c)

To show 𝒗\boldsymbol{v} is non-negative is equivalent to prove that 𝒗^\hat{\boldsymbol{v}} is non-negative, i.e., k{1,,C}\forall k\in\{1,\ldots,C\}, 𝒗^k0\hat{\boldsymbol{v}}_{k}\geq 0. Let 𝒖C1\boldsymbol{u}\in\mathbb{R}^{C-1} be the vector which has same elements with 𝒗^\hat{\boldsymbol{v}} except that the kkth element 𝒗^k\hat{\boldsymbol{v}}_{k} is removed. Following Eq. 10, we have:

𝒗^k\displaystyle\;\hat{\boldsymbol{v}}_{k} C1i{1,2,,C}{k}𝒗^i,\displaystyle\geq\sqrt{C-1}-\sum_{i\in\{1,2,\ldots,C\}\setminus\{k\}}\hat{\boldsymbol{v}}_{i}, (11a)
𝒗^k\displaystyle\hat{\boldsymbol{v}}_{k} C1𝒖𝟏.\displaystyle\geq\sqrt{C-1}-\boldsymbol{u}^{\top}\boldsymbol{1}. (11b)

By the Cauchy-Schwarz inequality, we get the following inequality:

|𝒖𝟏|𝒖𝟏.|\boldsymbol{u}^{\top}\boldsymbol{1}|\leq\|\boldsymbol{u}\|\|\boldsymbol{1}\|. (12)

Then by the definition of 𝒖\boldsymbol{u} and 𝟏\boldsymbol{1}, we have 𝒖1\|\boldsymbol{u}\|\leq 1 and 𝟏=C1\|\boldsymbol{1}\|=\sqrt{C-1}. Combined this with Eq. 11 and Eq. 12, we get:

𝒗^kC1𝒖𝟏0.\hat{\boldsymbol{v}}_{k}\geq\sqrt{C-1}-\|\boldsymbol{u}\|\|\boldsymbol{1}\|\geq 0. (13)

This simply implies that 𝒗k0\boldsymbol{v}_{k}\geq 0 for all k{1,2,,C}k\in\{1,2,\ldots,C\} and we have proved that the anchor-point assumption is a sufficient condition of the sufficiently scattered assumption.

We now prove that the anchor-point assumption is not a necessary condition for the sufficiently scattered assumption. Suppose P(𝒀|X)P(\boldsymbol{Y}|X) has the property that x1,x2,xC𝒳x^{1},x^{2},\ldots x^{C}\notin\mathcal{X} which means that the anchor-point assumption is not satisfied. We also assume that there exist a set ={x1,,xm}𝒳\mathcal{H}=\{x_{1},\ldots,x_{m}\}\subseteq\mathcal{X} such that cone{𝑯}\mathrm{cone}\{\boldsymbol{H}\} covers the whole non-negative orthant except the area along each axis (area formed by noisy class-posterior of anchor points). Since these areas along each axis are not part of \mathcal{R} when C>2C>2, it is clear that condition (1) of the sufficiently scattered assumption is satisfied. Besides, by definition of 𝑯\boldsymbol{H}, there is no other unitary matrix which can cover cone{𝑯}\mathrm{cone}\{\boldsymbol{H}\} except permutation matrices. This shows that condition (2) of the sufficiently scattered assumption is also satisfied and the proof is completed. ∎

A.2 Proof of Theorem 1

The insights of our proof are from previous works in non-negative matrix factorisation (Fu et al., 2015). To proceed, let us first introduce following classic lemmas in convex analysis:

Lemma 1.

If 𝒦1\mathcal{K}_{1} and 𝒦2\mathcal{K}_{2} are convex cones and 𝒦1𝒦2,\mathcal{K}_{1}\subseteq\mathcal{K}_{2}, then, dual{𝒦2}dual{𝒦1}dual\{\mathcal{K}_{2}\}\subseteq dual\{\mathcal{K}_{1}\}.

Lemma 2.

If 𝐀\boldsymbol{A} is invertible, then dual(𝐀)=cone(𝐀)dual(\mathbf{A})=\mathrm{cone}(\mathbf{A}^{-\top}).

Readers are referred to Boyd et al.(2004) for details. Our purpose is to show that criterion (5) has unique solutions which are the ground-truth P(𝒀|X)P(\boldsymbol{Y}|X) and 𝑻\boldsymbol{T}. To this end, let us denote (𝑻,h𝜽)(\boldsymbol{T}_{\star},h_{\boldsymbol{\theta}_{\star}}) as a feasible solution of Criterion (5), i.e.,

𝑻h𝜽=𝑻P(𝒀|X)=P(𝒀~|X).\boldsymbol{T}_{\star}h_{\boldsymbol{\theta}_{\star}}=\boldsymbol{T}P(\boldsymbol{Y}|X)=P(\tilde{\boldsymbol{Y}}|X). (14)

As defined in sufficient scattered assumption, we have the matrix 𝑯=[P(𝒀|X=x1),,P(𝒀|X=xm)]\boldsymbol{H}=[P(\boldsymbol{Y}|X=x_{1}),\ldots,P(\boldsymbol{Y}|X=x_{m})] defined on the set ={x1,,xm}𝒳\mathcal{H}=\{x_{1},\ldots,x_{m}\}\subseteq\mathcal{X}. Let 𝑯=[h𝜽(𝒙1),,h𝜽(𝒙m)]\boldsymbol{H}_{\star}=[h_{\boldsymbol{\theta}_{\star}}(\boldsymbol{x}_{1}),\ldots,h_{\boldsymbol{\theta}_{\star}}(\boldsymbol{x}_{m})], it follows that

𝑻𝑯=𝑻𝑯.\boldsymbol{T}_{\star}\boldsymbol{H}_{\star}=\boldsymbol{T}\boldsymbol{H}. (15)

Note that both 𝑻\boldsymbol{T}_{\star} and 𝑻\boldsymbol{T} have full rank because they are diagonally dominant square matrices by definition. In addition, since the sufficiently scattered assumption is satisfied, rank(𝑯)=Crank(\boldsymbol{H})=C also holds (Fu et al., 2015). Therefore, there exists an invertible matrix 𝑨C×C\boldsymbol{A}\in\mathbb{R}^{C\times C} such that

𝑻=𝑻𝑨,\boldsymbol{T}_{\star}=\boldsymbol{T}\boldsymbol{A}^{-\top}, (16)

where 𝑨=𝑯𝑯\boldsymbol{A}^{-\top}=\boldsymbol{H}\boldsymbol{H}_{\star}^{\dagger} and 𝑯=𝑯(𝑯𝑯)1\boldsymbol{H}_{\star}^{\dagger}=\boldsymbol{H}_{\star}^{\top}(\boldsymbol{H}_{\star}\boldsymbol{H}_{\star}^{\top})^{-1} is the pseudo-inverse of 𝑯\boldsymbol{H}_{\star}.

Since 𝟏𝑯=𝟏\boldsymbol{1}^{\top}\boldsymbol{H}=\boldsymbol{1}^{\top} and 𝟏𝑯=𝟏\mathbf{1}^{\top}\boldsymbol{H}_{\star}=\mathbf{1}^{\top} by definition, we get

𝟏𝑨=𝟏𝑯𝑯=𝟏𝑯=𝟏𝑯𝑯=𝟏.\mathbf{1}^{\top}\boldsymbol{A}^{-\top}=\mathbf{1}^{\top}\boldsymbol{H}\boldsymbol{H}_{\star}^{\dagger}=\mathbf{1}^{\top}\boldsymbol{H}_{\star}^{\dagger}=\mathbf{1}^{\top}\boldsymbol{H}_{\star}\boldsymbol{H}_{\star}^{\dagger}=\mathbf{1}^{\top}. (17)

Let 𝒗cone{𝑯},\boldsymbol{v}\in\mathrm{cone}\{\boldsymbol{H}\}, which by definition takes the form 𝒗=𝑯𝒖\boldsymbol{v}=\boldsymbol{H}\boldsymbol{u} for some 𝒖0.\boldsymbol{u}\geq 0. Using 𝑯=𝑨𝑯,𝒗\boldsymbol{H}=\boldsymbol{A}^{-\top}\boldsymbol{H}_{\star},\;\boldsymbol{v} can be expressed as 𝒗=𝑨𝒖~\boldsymbol{v}=\boldsymbol{A}^{-\top}\boldsymbol{\tilde{u}} where 𝒖~=𝑯𝒖0\boldsymbol{\tilde{u}}=\boldsymbol{H}_{\star}\boldsymbol{u}\geq 0. This implies that 𝒗\boldsymbol{v} also lies in cone{𝑨}\mathrm{cone}\{\boldsymbol{A}^{-\top}\}, i.e. cone{𝑯}cone{𝑨}\mathrm{cone}\{\boldsymbol{H}\}\subseteq\mathrm{cone}\{\boldsymbol{A}^{-\top}\}.

Recall Condition (1) of the sufficiently scattered assumption, i.e., cone{𝑯}\mathcal{R}\subseteq\mathrm{cone}\{\boldsymbol{H}\} where ={𝒗C|𝟏𝒗C1𝒗2}.\mathcal{R}=\{\boldsymbol{v}\in\mathbb{R}^{C}|\mathbf{1}^{\top}\boldsymbol{v}\geq\sqrt{C-1}\|\boldsymbol{v}\|_{2}\}. It implies

cone{𝑯}cone(𝑨).\mathcal{R}\subseteq\mathrm{cone}\{\boldsymbol{H}\}\subseteq\mathrm{cone}(\boldsymbol{A}^{-\top}). (18)

By applying Lemmas (1-2) to Eq. (18), we have

cone(𝑨)dual{},\mathrm{cone}(\boldsymbol{A})\subseteq dual\{\mathcal{R}\}, (19)

where dual{}dual\{\mathcal{R}\} is the dual cone of ,\mathcal{R}, which can be shown to be

dual{}={𝒗C|𝒗2𝟏𝒗}.dual\{\mathcal{R}\}=\{\boldsymbol{v}\in\mathbb{R}^{C}|\|\boldsymbol{v}\|_{2}\leq\mathbf{1}^{\top}\boldsymbol{v}\}. (20)

Then we have the following inequalities:

|det(𝑨)|\displaystyle|det(\boldsymbol{A})| i=1C𝑨:,i2\displaystyle\leq\prod_{i=1}^{C}\|\boldsymbol{A}_{:,i}\|_{2} (21a)
i=1C𝟏𝑨:,i\displaystyle\leq\prod_{i=1}^{C}\mathbf{1}^{\top}\boldsymbol{A}_{:,i} (21b)
(i=1C𝟏𝑨:,iC)C\displaystyle\leq(\frac{\sum_{i=1}^{C}\mathbf{1}^{\top}\boldsymbol{A}_{:,i}}{C})^{C} (21c)
=(𝟏𝑨𝟏C)C=1,\displaystyle=(\frac{\mathbf{1}^{\top}\boldsymbol{A}\mathbf{1}}{C})^{C}=1, (21d)

where (21a) is Hadamard’s inequality; (21b) is by Eq. (19); (21c) is by the arithmetic-geometric mean inequality; and (21d) is by Eq. (17).

Note that |det(𝑨)|1=|det(𝑨)||det(\boldsymbol{A})|^{-1}=|det(\boldsymbol{A}^{-\top})| and det(𝑻)=det(𝑻𝑨)=det(𝑻)|det(𝑨)|1det(\boldsymbol{T}_{\star})=det(\boldsymbol{T}\boldsymbol{A}^{-\top})=det(\boldsymbol{T})|det(\boldsymbol{A})|^{-1} from properties of the determinant, it follows from Eq. (21) that det(𝑻)det(𝑻)det(\boldsymbol{T}_{\star})\geq det(\boldsymbol{T}). We also know that det(𝑻)det(𝑻)det(\boldsymbol{T}_{\star})\leq det(\boldsymbol{T}) must hold from Criterion (5), hence we have

det(𝑻)=det(𝑻)det(\boldsymbol{T}_{\star})=det(\boldsymbol{T}) (22)

By Hadamard’s inequality, the equality in (21a) holds only if 𝑨\boldsymbol{A} is column-orthogonal, which is equivalent to that 𝑨\boldsymbol{A}^{-\top} is column-orthogonal. Considering condition (2) in the definition of sufficiently scattered and the property of 𝑨\boldsymbol{A}^{-\top} that cone{𝑯}cone(𝑨)\mathrm{cone}\{\boldsymbol{H}\}\subseteq\mathrm{cone}(\boldsymbol{A}^{-\top}), the only possible choices of column-orthogonal 𝑨\boldsymbol{A}^{-\top} are

𝑨=𝚷𝚽\boldsymbol{A}^{-\top}=\boldsymbol{\Pi}\boldsymbol{\Phi} (23)

where 𝚷C×C\boldsymbol{\Pi}\in\mathbb{R}^{C\times C} is any permutation matrix and 𝚽C×C\boldsymbol{\Phi}\in\mathbb{R}^{C\times C} is any diagonal matrix with non-zero diagonals. By Eq. (17), we must have Φ=\Phi= I. Subsequently, we are left with 𝑨=𝚷\boldsymbol{A}^{-\top}=\boldsymbol{\Pi}, or equivalently, 𝑻=𝚷𝑻.\boldsymbol{T}_{\star}=\boldsymbol{\Pi}\boldsymbol{T}. Since 𝑻\boldsymbol{T} and 𝑻\boldsymbol{T}_{\star} are both diagonal dominant, the only possible permutation matrix is 𝑰\boldsymbol{I}, which means 𝑻=𝑻\boldsymbol{T}_{\star}=\boldsymbol{T} holds. By Eq. (14), it follows that h𝜽=P(𝒀|X)h_{\boldsymbol{\theta}_{\star}}=P(\boldsymbol{Y}|X). Hence we conclude that (𝑻,h𝜽)=(𝑻,P(𝒀|X))(\boldsymbol{T}_{\star},h_{\boldsymbol{\theta}_{\star}})=(\boldsymbol{T},P(\boldsymbol{Y}|X)) is the unique optimal solution to criterion (5). ∎

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Transition matrix estimation error on MNIST/NA, CIFAR-10/NA, CIFAR-100/NA. Datasets with “/NA” means that possible anchor points are removed. The error bar for the standard deviation in each figure has been shaded. The lower the better.

Appendix B Experiments on datasets where possible anchor points are manually removed.

MNIST/NA CIFAR-10/NA CIFAR-100/NA
Sym-20% Sym-50% Sym-20% Sym-50% Sym-20% Sym-50%
Decoupling 96.72±0.1696.72\pm 0.16 92.72±0.3392.72\pm 0.33 75.51±0.3875.51\pm 0.38 49.96±0.5149.96\pm 0.51 38.83±0.3738.83\pm 0.37 20.42±0.5320.42\pm 0.53
MentorNet 97.10±0.1097.10\pm 0.10 95.10±0.1495.10\pm 0.14 80.25±0.5280.25\pm 0.52 71.65±0.2871.65\pm 0.28 39.72±0.3539.72\pm 0.35 29.39±0.3529.39\pm 0.35
Co-teaching 97.06±0.1297.06\pm 0.12 94.89±0.1094.89\pm 0.10 81.74±0.3281.74\pm 0.32 73.38±0.4573.38\pm 0.45 44.92±0.1144.92\pm 0.11 33.13±0.8833.13\pm 0.88
Forward 98.46±0.0798.46\pm 0.07 97.59±0.0597.59\pm 0.05 84.25±0.2284.25\pm 0.22 70.00±3.0770.00\pm 3.07 50.58±0.6850.58\pm 0.68 36.79±1.8636.79\pm 1.86
T-Revision 98.72±0.1398.72\pm 0.13 97.86±0.1197.86\pm 0.11 86.81±0.1986.81\pm 0.19 74.10±2.3474.10\pm 2.34 59.57±1.1359.57\pm 1.13 43.75±0.8443.75\pm 0.84
DMI 98.42±0.0398.42\pm 0.03 97.87±0.1897.87\pm 0.18 83.42±0.5483.42\pm 0.54 77.82±0.4577.82\pm 0.45 56.29±0.2856.29\pm 0.28 41.81±0.7041.81\pm 0.70
Dual T 98.61±0.1298.61\pm 0.12 97.91±0.1297.91\pm 0.12 86.70±0.0686.70\pm 0.06 78.92±0.4278.92\pm 0.42 56.99±1.0056.99\pm 1.00 42.04±1.9642.04\pm 1.96
VolMinNet 98.72±0.06\mathbf{98.72}\pm\mathbf{0.06} 97.94±0.07\mathbf{97.94}\pm\mathbf{0.07} 88.72±0.03\mathbf{88.72}\pm\mathbf{0.03} 82.38±0.65\mathbf{82.38}\pm\mathbf{0.65} 63.40±1.25\mathbf{63.40}\pm\mathbf{1.25} 51.04±1.23\mathbf{51.04}\pm\mathbf{1.23}
MNIST/NA CIFAR-10/NA CIFAR-100/NA
Pair-20% Pair-45% Pair-20% Pair-45% Pair-20% Pair-45%
Decoupling 96.92±0.0696.92\pm 0.06 93.29±0.5793.29\pm 0.57 77.06±0.2677.06\pm 0.26 50.81±0.7350.81\pm 0.73 40.42±0.4740.42\pm 0.47 26.21±0.6726.21\pm 0.67
MentorNet 96.88±0.0496.88\pm 0.04 88.17±0.7088.17\pm 0.70 77.62±0.2877.62\pm 0.28 57.60±0.3557.60\pm 0.35 39.11±0.4139.11\pm 0.41 25.17±0.3625.17\pm 0.36
Co-teaching 96.96±0.0796.96\pm 0.07 95.34±0.0995.34\pm 0.09 80.70±0.1880.70\pm 0.18 69.15±0.8969.15\pm 0.89 43.04±0.7343.04\pm 0.73 26.67±0.2926.67\pm 0.29
Forward 98.61±0.3398.61\pm 0.33 78.51±17.4878.51\pm 17.48 85.87±0.8285.87\pm 0.82 53.92±11.3953.92\pm 11.39 51.37±0.9951.37\pm 0.99 34.69±1.3734.69\pm 1.37
T-Revision 98.71±0.3198.71\pm 0.31 82.65±14.6182.65\pm 14.61 87.52±0.5887.52\pm 0.58 53.96±14.6753.96\pm 14.67 59.70±1.4359.70\pm 1.43 38.35±0.6038.35\pm 0.60
DMI 98.78±0.1198.78\pm 0.11 97.46±1.3897.46\pm 1.38 86.14±1.5286.14\pm 1.52 70.01±5.6370.01\pm 5.63 54.05±1.0954.05\pm 1.09 35.03±2.9135.03\pm 2.91
Dual T 98.76±0.1398.76\pm 0.13 85.77±7.8585.77\pm 7.85 89.02±0.4089.02\pm 0.40 65.17±0.7265.17\pm 0.72 59.07±3.7959.07\pm 3.79 36.95±3.1936.95\pm 3.19
VolMinNet 98.87±0.11\mathbf{98.87}\pm\mathbf{0.11} 97.80±2.43\mathbf{97.80}\pm\mathbf{2.43} 89.26±0.22\mathbf{89.26}\pm\mathbf{0.22} 84.48±3.85\mathbf{84.48}\pm\mathbf{3.85} 64.88±1.87\mathbf{64.88}\pm\mathbf{1.87} 56.07±3.35\mathbf{56.07}\pm\mathbf{3.35}
Table 3: Classification accuracy (percentage) on MNIST, CIFAR-10,CIFAR-100 and MNIST/NA, CIFAR-10/NA, CIFAR-100/NA. Datasets with “/NA” means that possible anchor points are removed.

Following Xia et al.(2019), to show the importance of anchor points, we remove possible anchor points from the datasets, i.e., instances with large estimated class-posterior probability P(Y|X),P(Y|X), before corrupting the training and validation sets. For MNIST we removed 40%40\% of the instances with the largest estimated class posterior probabilities in each class. For CIFAR-10 and CIFAR-100, we removed 10%10\% of the instances with the largest estimated class posterior probabilities in each class. We add "/NA" following the dataset’s name denote those datasets which are modified by removing possible anchor points. The detailed experimental results are shown in Figure 5 (estimation error) and Table 3 (classification accuracy). The experimental performance shows that our proposed method outperforms the baseline methods.