This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning

Yidong Wang1,2 Equal Contribution: [email protected], [email protected]; work done when Yidong was a research intern at MSRA. 1Microsoft Research Asia, 2Tokyo Institute of Technology, 3Carnegie Mellon University,
4North Carolina State University, 5Microsoft STCA,
6Max-Planck-Institut für Informatik, 7Nanjing University
Hao Chen3∗ 1Microsoft Research Asia, 2Tokyo Institute of Technology, 3Carnegie Mellon University,
4North Carolina State University, 5Microsoft STCA,
6Max-Planck-Institut für Informatik, 7Nanjing University
Qiang Heng4 1Microsoft Research Asia, 2Tokyo Institute of Technology, 3Carnegie Mellon University,
4North Carolina State University, 5Microsoft STCA,
6Max-Planck-Institut für Informatik, 7Nanjing University
Wenxin Hou5 1Microsoft Research Asia, 2Tokyo Institute of Technology, 3Carnegie Mellon University,
4North Carolina State University, 5Microsoft STCA,
6Max-Planck-Institut für Informatik, 7Nanjing University
Yue Fan6 1Microsoft Research Asia, 2Tokyo Institute of Technology, 3Carnegie Mellon University,
4North Carolina State University, 5Microsoft STCA,
6Max-Planck-Institut für Informatik, 7Nanjing University

Zhen Wu7
1Microsoft Research Asia, 2Tokyo Institute of Technology, 3Carnegie Mellon University,
4North Carolina State University, 5Microsoft STCA,
6Max-Planck-Institut für Informatik, 7Nanjing University
Jindong Wang1 Correspondence to: [email protected] 1Microsoft Research Asia, 2Tokyo Institute of Technology, 3Carnegie Mellon University,
4North Carolina State University, 5Microsoft STCA,
6Max-Planck-Institut für Informatik, 7Nanjing University
Marios Savvides3 1Microsoft Research Asia, 2Tokyo Institute of Technology, 3Carnegie Mellon University,
4North Carolina State University, 5Microsoft STCA,
6Max-Planck-Institut für Informatik, 7Nanjing University
Takahiro Shinozaki2 1Microsoft Research Asia, 2Tokyo Institute of Technology, 3Carnegie Mellon University,
4North Carolina State University, 5Microsoft STCA,
6Max-Planck-Institut für Informatik, 7Nanjing University

Bhiksha Raj3
1Microsoft Research Asia, 2Tokyo Institute of Technology, 3Carnegie Mellon University,
4North Carolina State University, 5Microsoft STCA,
6Max-Planck-Institut für Informatik, 7Nanjing University
Bernt Schiele6 1Microsoft Research Asia, 2Tokyo Institute of Technology, 3Carnegie Mellon University,
4North Carolina State University, 5Microsoft STCA,
6Max-Planck-Institut für Informatik, 7Nanjing University
Xing Xie1 1Microsoft Research Asia, 2Tokyo Institute of Technology, 3Carnegie Mellon University,
4North Carolina State University, 5Microsoft STCA,
6Max-Planck-Institut für Informatik, 7Nanjing University
Abstract

Semi-supervised Learning (SSL) has witnessed great success owing to the impressive performances brought by various methods based on pseudo labeling and consistency regularization. However, we argue that existing methods might fail to utilize the unlabeled data more effectively since they either use a pre-defined / fixed threshold or an ad-hoc threshold adjusting scheme, resulting in inferior performance and slow convergence. We first analyze a motivating example to obtain intuitions on the relationship between the desirable threshold and model’s learning status. Based on the analysis, we hence propose FreeMatch to adjust the confidence threshold in a self-adaptive manner according to the model’s learning status. We further introduce a self-adaptive class fairness regularization penalty to encourage the model for diverse predictions during the early training stage. Extensive experiments indicate the superiority of FreeMatch especially when the labeled data are extremely rare. FreeMatch achieves 5.78%, 13.59%, and 1.28% error rate reduction over the latest state-of-the-art method FlexMatch on CIFAR-10 with 1 label per class, STL-10 with 4 labels per class, and ImageNet with 100 labels per class, respectively. Moreover, FreeMatch can also boost the performance of imbalanced SSL. The codes can be found at https://github.com/microsoft/Semi-supervised-learning.111Note the results of this paper are obtained using TorchSSL (Zhang et al., 2021). We also provide codes and logs in USB (Wang et al., 2022).

1 Introduction

The superior performance of deep learning heavily relies on supervised training with sufficient labeled data (He et al., 2016; Vaswani et al., 2017; Dong et al., 2018). However, it remains laborious and expensive to obtain massive labeled data. To alleviate such reliance, semi-supervised learning (SSL) (Zhu, 2005; Zhu & Goldberg, 2009; Sohn et al., 2020; Rosenberg et al., 2005; Gong et al., 2016; Kervadec et al., 2019; Dai et al., 2017) is developed to improve the model’s generalization performance by exploiting a large volume of unlabeled data. Pseudo labeling (Lee et al., 2013; Xie et al., 2020b; McLachlan, 1975; Rizve et al., 2020) and consistency regularization (Bachman et al., 2014; Samuli & Timo, 2017; Sajjadi et al., 2016) are two popular paradigms designed for modern SSL. Recently, their combinations have shown promising results (Xie et al., 2020a; Sohn et al., 2020; Pham et al., 2021; Xu et al., 2021; Zhang et al., 2021). The key idea is that the model should produce similar predictions or the same pseudo labels for the same unlabeled data under different perturbations following the smoothness and low-density assumptions in SSL (Chapelle et al., 2006).

A potential limitation of these threshold-based methods is that they either need a fixed threshold (Xie et al., 2020a; Sohn et al., 2020; Zhang et al., 2021; Guo & Li, 2022) or an ad-hoc threshold adjusting scheme (Xu et al., 2021) to compute the loss with only confident unlabeled samples. Specifically, UDA (Xie et al., 2020a) and FixMatch (Sohn et al., 2020) retain a fixed high threshold to ensure the quality of pseudo labels. However, a fixed high threshold (0.950.95) could lead to low data utilization in the early training stages and ignore the different learning difficulties of different classes. Dash (Xu et al., 2021) and AdaMatch (Berthelot et al., 2022) propose to gradually grow the fixed global (dataset-specific) threshold as the training progresses. Although the utilization of unlabeled data is improved, their ad-hoc threshold adjusting scheme is arbitrarily controlled by hyper-parameters and thus disconnected from model’s learning process. FlexMatch (Zhang et al., 2021) demonstrates that different classes should have different local (class-specific) thresholds. While the local thresholds take into account the learning difficulties of different classes, they are still mapped from a pre-defined fixed global threshold. Adsh (Guo & Li, 2022) obtains adaptive thresholds from a pre-defined threshold for imbalanced Semi-supervised Learning by optimizing the the number of pseudo labels for each class. In a nutshell, these methods might be incapable or insufficient in terms of adjusting thresholds according to model’s learning progress, thus impeding the training process especially when labeled data is too scarce to provide adequate supervision.

Refer to caption
(a) Decision boundary
Refer to caption
(b) Self-adaptive fairness
Refer to caption
(c) Confi. threshold
Refer to caption
(d) Sampling rate
Figure 1: Demonstration of how FreeMatch works on the “two-moon” dataset. (a) Decision boundary of FreeMatch and other SSL methods. (b) Decision boundary improvement of self-adaptive fairness (SAF) on two labeled samples per class. (c) Class-average confidence threshold. (d) Class-average sampling rate of FreeMatch during training. The experimental details are in Appendix A.

For example, as shown in Figure 1(a), on the “two-moon” dataset with only 1 labeled sample for each class, the decision boundaries obtained by previous methods fail in the low-density assumption. Then, two questions naturally arise: 1) Is it necessary to determine the threshold based on the model learning status? and 2) How to adaptively adjust the threshold for best training efficiency?

In this paper, we first leverage a motivating example to demonstrate that different datasets and classes should determine their global (dataset-specific) and local (class-specific) thresholds based on the model’s learning status. Intuitively, we need a low global threshold to utilize more unlabeled data and speed up convergence at early training stages. As the prediction confidence increases, a higher global threshold is necessary to filter out wrong pseudo labels to alleviate the confirmation bias (Arazo et al., 2020). Besides, a local threshold should be defined on each class based on the model’s confidence about its predictions. The “two-moon” example in Figure 1(a) shows that the decision boundary is more reasonable when adjusting the thresholds based on the model’s learning status.

We then propose FreeMatch to adjust the thresholds in a self-adaptive manner according to learning status of each class (Guo et al., 2017). Specifically, FreeMatch uses the self-adaptive thresholding (SAT) technique to estimate both the global (dataset-specific) and local thresholds (class-specific) via the exponential moving average (EMA) of the unlabeled data confidence. To handle barely supervised settings (Sohn et al., 2020) more effectively, we further propose a class fairness objective to encourage the model to produce fair (i.e., diverse) predictions among all classes (as shown in Figure 1(b)). The overall training objective of FreeMatch maximizes the mutual information between model’s input and output (John Bridle, 1991), producing confident and diverse predictions on unlabeled data. Benchmark results validate its effectiveness. To conclude, our contributions are:

  •  

    Using a motivating example, we discuss why thresholds should reflect the model’s learning status and provide some intuitions for designing a threshold-adjusting scheme.

  •  

    We propose a novel approach, FreeMatch, which consists of Self-Adaptive Thresholding (SAT) and Self-Adaptive class Fairness regularization (SAF). SAT is a threshold-adjusting scheme that is free of setting thresholds manually and SAF encourages diverse predictions.

  •  

    Extensive results demonstrate the superior performance of FreeMatch on various SSL benchmarks, especially when the number of labels is very limited (e.g, an error reduction of 5.78% on CIFAR-10 with 1 labeled sample per class).

2 A Motivating Example

In this section, we introduce a binary classification example to motivate our threshold-adjusting scheme. Despite the simplification of the actual model and training process, the analysis leads to some interesting implications and provides insight into how the thresholds should be set.

We aim to demonstrate the necessity of the self-adaptability and increased granularity in confidence thresholding for SSL. Inspired by (Yang & Xu, 2020), we consider a binary classification problem where the true distribution is an even mixture of two Gaussians (i.e., the label YY is equally likely to be positive (+1+1) or negative (1-1)). The input XX has the following conditional distribution:

XY=1𝒩(μ1,σ12),XY=+1𝒩(μ2,σ22).X\mid Y=-1\sim\mathcal{N}(\mu_{1},\sigma_{1}^{2}),X\mid Y=+1\sim\mathcal{N}(\mu_{2},\sigma_{2}^{2}). (1)

We assume μ2>μ1\mu_{2}>\mu_{1} without loss of generality. Suppose that our classifier outputs confidence score s(x)=1/[1+exp(β(xμ1+μ22))]s(x)=1/[1+\exp(-\beta(x-\frac{\mu_{1}+\mu_{2}}{2}))], where β\beta is a positive parameter that reflects the model learning status and it is expected to gradually grow during training as the model becomes more confident. Note that μ1+μ22\frac{\mu_{1}+\mu_{2}}{2} is in fact the Bayes’ optimal linear decision boundary. We consider the scenario where a fixed threshold τ(12,1)\tau\in(\frac{1}{2},1) is used to generate pseudo labels. A sample xx is assigned pseudo label +1+1 if s(x)>τs(x)>\tau and 1-1 if s(x)<1τs(x)<1-\tau. The pseudo label is 0 (masked) if 1τs(x)τ1-\tau\leq s(x)\leq\tau.

We then derive the following theorem to show the necessity of self-adaptive threshold:

Theorem 2.1.

For a binary classification problem as mentioned above, the pseudo label YpY_{p} has the following probability distribution:

P(Yp=1)=12Φ(μ2μ121βlog(τ1τ)σ2)+12Φ(μ1μ221βlog(τ1τ)σ1),P(Yp=1)=12Φ(μ2μ121βlog(τ1τ)σ1)+12Φ(μ1μ221βlog(τ1τ)σ2),P(Yp=0)=1P(Yp=1)P(Yp=1),\displaystyle\begin{split}P(Y_{p}=1)&=\frac{1}{2}\Phi(\frac{\frac{\mu_{2}-\mu_{1}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{2}})+\frac{1}{2}\Phi(\frac{\frac{\mu_{1}-\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{1}}),\\ P(Y_{p}=-1)&=\frac{1}{2}\Phi(\frac{\frac{\mu_{2}-\mu_{1}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{1}})+\frac{1}{2}\Phi(\frac{\frac{\mu_{1}-\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{2}}),\\ P(Y_{p}=0)&=1-P(Y_{p}=1)-P(Y_{p}=-1),\end{split} (2)

where Φ\Phi is the cumulative distribution function of a standard normal distribution. Moreover, P(Yp=0)P(Y_{p}=0) increases as μ2μ1\mu_{2}-\mu_{1} gets smaller.

The proof is offered in Appendix B. Theorem 2.1 has the following implications or interpretations:

  1. (i)

    Trivially, unlabeled data utilization (sampling rate) 1P(Yp=0)1-P(Y_{p}=0) is directly controlled by threshold τ\tau. As the confidence threshold τ\tau gets larger, the unlabeled data utilization gets lower. At early training stages, adopting a high threshold may lead to low sampling rate and slow convergence since β\beta is still small.

  2. (ii)

    More interestingly, P(Yp=1)P(Yp=1)P(Y_{p}=1)\neq P(Y_{p}=-1) if σ1σ2\sigma_{1}\neq\sigma_{2}. In fact, the larger τ\tau is, the more imbalanced the pseudo labels are. This is potentially undesirable in the sense that we aim to tackle a balanced classification problem. Imbalanced pseudo labels may distort the decision boundary and lead to the so-called pseudo label bias. An easy remedy for this is to use class-specific thresholds τ2\tau_{2} and 1τ11-\tau_{1} to assign pseudo labels.

  3. (iii)

    The sampling rate 1P(Yp=0)1-P(Y_{p}=0) decreases as μ2μ1\mu_{2}-\mu_{1} gets smaller. In other words, the more similar the two classes are, the more likely an unlabeled sample will be masked. As the two classes get more similar, there would be more samples mixed in feature space where the model is less confident about its predictions, thus a moderate threshold is needed to balance the sampling rate. Otherwise we may not have enough samples to train the model to classify the already difficult-to-classify classes.

The intuitions provided by Theorem 2.1 is that at the early training stages, τ\tau should be low to encourage diverse pseudo labels, improve unlabeled data utilization and fasten convergence. However, as training continues and β\beta grows larger, a consistently low threshold will lead to unacceptable confirmation bias. Ideally, the threshold τ\tau should increase along with β\beta to maintain a stable sampling rate throughout. Since different classes have different levels of intra-class diversity (different σ\sigma) and some classes are harder to classify than others (μ2μ1\mu_{2}-\mu_{1} being small), a fine-grained class-specific threshold is desirable to encourage fair assignment of pseudo labels to different classes. The challenge is how to design a threshold adjusting scheme that takes all implications into account, which is the main contribution of this paper. We demonstrate our algorithm by plotting the average threshold trend and marginal pseudo label probability (i.e. sampling rate) during training in Figure 1(c) and 1(d). To sum up, we should determine global (dataset-specific) and local (class-specific) thresholds by estimating the learning status via predictions from the model. Then, we detail FreeMatch.

3 Preliminaries

In SSL, the training data consists of labeled and unlabeled data. Let 𝒟L={(xb,yb):b[NL]}\mathcal{D}_{L}=\{(x_{b},y_{b}):b\in[N_{L}]\} and 𝒟U={ub:b[NU]}\mathcal{D}_{U}=\{u_{b}:b\in[N_{U}]\}222[N]:={1,2,,N}[N]:=\{1,2,\ldots,N\}. be the labeled and unlabeled data, where NLN_{L} and NUN_{U} is their number of samples, respectively. The supervised loss for labeled data is:

s=1Bb=1B(yb,pm(y|ω(xb))),\mathcal{L}_{s}=\frac{1}{B}\sum_{b=1}^{B}\mathcal{H}(y_{b},p_{m}(y|\omega(x_{b}))), (3)

where BB is the batch size, (,)\mathcal{H}(\cdot,\cdot) refers to cross-entropy loss, ω()\omega(\cdot) means the stochastic data augmentation function, and pm()p_{m}(\cdot) is the output probability from the model.

For unlabeled data, we focus on pseudo labeling using cross-entropy loss with confidence threshold for entropy minimization. We also adopt the “Weak and Strong Augmentation” strategy introduced by UDA (Xie et al., 2020a). Formally, the unsupervised training objective for unlabeled data is:

u=1μBb=1μB𝟙(max(qb)>τ)(q^b,Qb).\begin{split}\mathcal{L}_{u}=&\frac{1}{\mu B}\sum_{b=1}^{\mu B}\mathbbm{1}(\max(q_{b})>\tau)\cdot\mathcal{H}(\hat{q}_{b},Q_{b}).\end{split} (4)

We use qbq_{b} and QbQ_{b} to denote abbreviation of pm(y|ω(ub))p_{m}(y|\omega(u_{b})) and pm(y|Ω(ub))p_{m}(y|\Omega(u_{b})), respectively. q^b\hat{q}_{b} is the hard “one-hot” label converted from qbq_{b}, μ\mu is the ratio of unlabeled data batch size to labeled data batch size, and 𝟙(>τ)\mathbbm{1}(\cdot>\tau) is the indicator function for confidence-based thresholding with τ\tau being the threshold. The weak augmentation (i.e., random crop and flip) and strong augmentation (i.e., RandAugment Cubuk et al. (2020)) is represented by ω()\omega(\cdot) and Ω()\Omega(\cdot) respectively.

Besides, a fairness objective f\mathcal{L}_{f} is usually introduced to encourage the model to predict each class at the same frequency, which usually has the form of f=𝐔log𝔼μB[qb]\mathcal{L}_{f}=\mathbf{U}\log\mathbbm{E}_{\mu B}\left[q_{b}\right] (Andreas Krause, 2010), where 𝐔\mathbf{U} is a uniform prior distribution. One may notice that using a uniform prior not only prevents the generalization to non-uniform data distribution but also ignores the fact that the underlying pseudo label distribution for a mini-batch may be imbalanced due to the sampling mechanism. The uniformity across a batch is essential for fair utilization of samples with per-class threshold, especially for early-training stages.

4 FreeMatch

Refer to caption
Figure 2: Illustration of Self-Adaptive Thresholding (SAT). FreeMatch adopts both global and local self-adaptive thresholds computed from the EMA of prediction statistics from unlabeled samples. Filtered (masked) samples are marked with red X.

4.1 Self-Adaptive Thresholding

We advocate that the key to determining thresholds for SSL is that thresholds should reflect the learning status. The learning effect can be estimated by the prediction confidence of a well-calibrated model (Guo et al., 2017). Hence, we propose self-adaptive thresholding (SAT) that automatically defines and adaptively adjusts the confidence threshold for each class by leveraging the model predictions during training. SAT first estimates a global threshold as the EMA of the confidence from the model. Then, SAT modulates the global threshold via the local class-specific thresholds estimated as the EMA of the probability for each class from the model. When training starts, the threshold is low to accept more possibly correct samples into training. As the model becomes more confident, the threshold adaptively increases to filter out possibly incorrect samples to reduce the confirmation bias. Thus, as shown in Figure 2, we define SAT as τt(c)\tau_{t}(c) indicating the threshold for class cc at the tt-th iteration.

Self-adaptive Global Threshold

We design the global threshold based on the following two principles. First, the global threshold in SAT should be related to the model’s confidence on unlabeled data, reflecting the overall learning status. Moreover, the global threshold should stably increase during training to ensure incorrect pseudo labels are discarded. We set the global threshold τt\tau_{t} as average confidence from the model on unlabeled data, where tt represents the tt-th time step (iteration). However, it would be time-consuming to compute the confidence for all unlabeled data at every time step or even every training epoch due to its large volume. Instead, we estimate the global confidence as the exponential moving average (EMA) of the confidence at each training time step. We initialize τt\tau_{t} as 1C\frac{1}{C} where CC indicates the number of classes. The global threshold τt\tau_{t} is defined and adjusted as:

τt={1C, if t=0,λτt1+(1λ)1μBb=1μBmax(qb), otherwise, \tau_{t}=\begin{cases}\frac{1}{C},&\text{ if }t=0,\\ \lambda\tau_{t-1}+(1-\lambda)\frac{1}{\mu B}\sum_{b=1}^{\mu B}\max(q_{b}),&\text{ otherwise, }\end{cases} (5)

where λ(0,1)\lambda\in(0,1) is the momentum decay of EMA.

Self-adaptive Local Threshold

The local threshold aims to modulate the global threshold in a class-specific fashion to account for the intra-class diversity and the possible class adjacency. We compute the expectation of the model’s predictions on each class cc to estimate the class-specific learning status:

p~t(c)={1C, if t=0,λp~t1(c)+(1λ)1μBb=1μBqb(c), otherwise, \tilde{p}_{t}(c)=\begin{cases}\frac{1}{C},&\text{ if }t=0,\\ \lambda\tilde{p}_{t-1}(c)+(1-\lambda)\frac{1}{\mu B}\sum_{b=1}^{\mu B}q_{b}(c),&\text{ otherwise, }\end{cases} (6)

where p~t=[p~t(1),p~t(2),,p~t(C)]\tilde{p}_{t}=[\tilde{p}_{t}(1),\tilde{p}_{t}(2),\dots,\tilde{p}_{t}(C)] is the list containing all p~t(c)\tilde{p}_{t}(c). Integrating the global and local thresholds, we obtain the final self-adaptive threshold τt(c)\tau_{t}(c) as:

τt(c)=MaxNorm(p~t(c))τt=p~t(c)max{p~t(c):c[C]}τt,\tau_{t}(c)=\operatorname{MaxNorm}(\tilde{p}_{t}(c))\cdot\tau_{t}=\frac{\tilde{p}_{t}(c)}{\max\{\tilde{p}_{t}(c):c\in[C]\}}\cdot\tau_{t}, (7)

where MaxNorm\operatorname{MaxNorm} is the Maximum Normalization (i.e., x=xmax(x)x^{\prime}=\frac{x}{\max(x)}). Finally, the unsupervised training objective u\mathcal{L}_{u} at the tt-th iteration is:

u=1μBb=1μB𝟙(max(qb)>τt(argmax(qb))(q^b,Qb).\begin{split}\mathcal{L}_{u}=&\frac{1}{\mu B}\sum_{b=1}^{\mu B}\mathbbm{1}(\max(q_{b})>\tau_{t}(\arg\max\left(q_{b}\right))\cdot\mathcal{H}(\hat{q}_{b},Q_{b}).\end{split} (8)

4.2 Self-Adaptive Fairness

We include the class fairness objective as mentioned in Section 3 into FreeMatch to encourage the model to make diverse predictions for each class and thus produce a meaningful self-adaptive threshold, especially under the settings where labeled data are rare. Instead of using a uniform prior as in (Arazo et al., 2020), we use the EMA of model predictions p~t\tilde{p}_{t} from Eq. 6 as an estimate of the expectation of prediction distribution over unlabeled data. We optimize the cross-entropy of p~t\tilde{p}_{t} and p¯=𝔼μB[pm(y|Ω(ub))]\overline{p}=\mathbbm{E}_{\mu B}[p_{m}(y|\Omega(u_{b}))] over mini-batch as an estimate of H(𝔼u[pm(y|u)])H(\mathbbm{E}_{u}\left[p_{m}(y|u)\right]). Considering that the underlying pseudo label distribution may not be uniform, we propose to modulate the fairness objective in a self-adaptive way, i.e., normalizing the expectation of probability by the histogram distribution of pseudo labels to counter the negative effect of imbalance as:

p¯=1μBb=1μB𝟙(max(qb)τt(argmax(qb))Qb,h¯=HistμB(𝟙(max(qb)τt(argmax(qb))Q^b).\begin{split}&\overline{p}=\frac{1}{\mu B}\sum_{b=1}^{\mu B}\mathbbm{1}\left(\max\left(q_{b}\right)\geq\tau_{t}(\arg\max\left(q_{b}\right)\right)Q_{b},\\ &\overline{h}=\operatorname{Hist}_{\mu B}\left(\mathbbm{1}\left(\max\left(q_{b}\right)\geq\tau_{t}(\arg\max\left(q_{b}\right)\right)\hat{Q}_{b}\right).\end{split} (9)

Similar to p~t\tilde{p}_{t}, we compute h~t\tilde{h}_{t} as:

h~t=λh~t1+(1λ)HistμB(q^b).\tilde{h}_{t}=\lambda\tilde{h}_{t-1}+(1-\lambda)\operatorname{Hist}_{\mu B}\left(\hat{q}_{b}\right). (10)

The self-adaptive fairness (SAF) LfL_{f} at the tt-th iteration is formulated as:

f=(SumNorm(p~th~t),SumNorm(p¯h¯)),\mathcal{L}_{f}=-\mathcal{H}\left(\operatorname{SumNorm}\left(\frac{\tilde{p}_{t}}{\tilde{h}_{t}}\right),\operatorname{SumNorm}\left(\frac{\bar{p}}{\bar{h}}\right)\right), (11)

where SumNorm=()/()\operatorname{SumNorm}=(\cdot)/\sum(\cdot). SAF encourages the expectation of the output probability for each mini-batch to be close to a marginal class distribution of the model, after normalized by histogram distribution. It helps the model produce diverse predictions especially for barely supervised settings (Sohn et al., 2020), thus converges faster and generalizes better. This is also showed in Figure 1(b).

The overall objective for FreeMatch at tt-th iteration is:

=s+wuu+wff,\mathcal{L}=\mathcal{L}_{s}+w_{u}\mathcal{L}_{u}+w_{f}\mathcal{L}_{f}, (12)

where wuw_{u} and wfw_{f} represents the loss weight for u\mathcal{L}_{u} and f\mathcal{L}_{f} respectively. With u\mathcal{L}_{u} and f\mathcal{L}_{f}, FreeMatch maximizes the mutual information between its outputs and inputs. We present the procedure of FreeMatch in Algorithm 1 of Appendix.

5 Experiments

5.1 Setup

We evaluate FreeMatch on common benchmarks: CIFAR-10/100 (Krizhevsky et al., 2009), SVHN (Netzer et al., 2011), STL-10 (Coates et al., 2011) and ImageNet (Deng et al., 2009). Following previous work (Sohn et al., 2020; Xu et al., 2021; Zhang et al., 2021; Oliver et al., 2018), we conduct experiments with varying amounts of labeled data. In addition to the commonly-chosen labeled amounts, following (Sohn et al., 2020), we further include the most challenging case of CIFAR-10: each class has only one labeled sample.

For fair comparison, we train and evaluate all methods using the unified codebase TorchSSL (Zhang et al., 2021) with the same backbones and hyperparameters. Concretely, we use Wide ResNet-28-2 (Zagoruyko & Komodakis, 2016) for CIFAR-10, Wide ResNet-28-8 for CIFAR-100, Wide ResNet-37-2 (Zhou et al., 2020) for STL-10, and ResNet-50 (He et al., 2016) for ImageNet. We use SGD with a momentum of 0.90.9 as optimizer. The initial learning rate is 0.030.03 with a cosine learning rate decay schedule as η=η0cos(7πk16K)\eta=\eta_{0}\cos(\frac{7\pi k}{16K}), where η0\eta_{0} is the initial learning rate, k(K)k(K) is the current (total) training step and we set K=220K=2^{20} for all datasets. At the testing phase, we use an exponential moving average with the momentum of 0.9990.999 of the training model to conduct inference for all algorithms. The batch size of labeled data is 6464 except for ImageNet where we set 128128. We use the same weight decay value, pre-defined threshold τ\tau, unlabeled batch ratio μ\mu and loss weights introduced for Pseudo-Label (Lee et al., 2013), Π\Pi model (Rasmus et al., 2015), Mean Teacher (Tarvainen & Valpola, 2017), VAT (Miyato et al., 2018), MixMatch (Berthelot et al., 2019b), ReMixMatch (Berthelot et al., 2019a), UDA (Xie et al., 2020a), FixMatch (Sohn et al., 2020), and FlexMatch (Zhang et al., 2021).

We implement MPL based on UDA as in (Pham et al., 2021), where we set temperature as 0.8 and wuw_{u} as 10. We do not fine-tune MPL on labeled data as in (Pham et al., 2021) since we find fine-tuning will make the model overfit the labeled data especially with very few of them. For Dash, we use the same parameters as in (Xu et al., 2021) except we warm-up on labeled data for 2 epochs since too much warm-up will lead to the overfitting (i.e. 2,048 training iterations). For FreeMatch, we set wu=1w_{u}=1 for all experiments. Besides, we set wf=0.01w_{f}=0.01 for CIFAR-10 with 10 labels, CIFAR-100 with 400 labels, STL-10 with 40 labels, ImageNet with 100k labels, and all experiments for SVHN. For other settings, we use wf=0.05w_{f}=0.05. For SVHN, we find that using a low threshold at early training stage impedes the model to cluster the unlabeled data, thus we adopt two training techniques for SVHN: (1) warm-up the model on only labeled data for 2 epochs as Dash; and (2) restrict the SAT within the range [0.9,0.95]\left[0.9,0.95\right]. The detailed hyperparameters are introduced in Appendix D. We train each algorithm 3 times using different random seeds and report the best error rates of all checkpoints (Zhang et al., 2021).

Table 1: Error rates on CIFAR-10/100, SVHN, and STL-10 datasets. The fully-supervised results of STL-10 are unavailable since we do not have label information for its unlabeled data. Bold indicates the best result and underline indicates the second-best result. The significant tests and average error rates for each dataset can be found in Appendix E.1.
Dataset CIFAR-10 CIFAR-100 SVHN STL-10
# Label 10 40 250 4000 400 2500 10000 40 250 1000 40 1000
Π\Pi Model (Rasmus et al., 2015) 79.18±\pm1.11 74.34±\pm1.76 46.24±\pm1.29 13.13±\pm0.59 86.96±\pm0.80 58.80±\pm0.66 36.65±\pm0.00 67.48±\pm0.95 13.30±\pm1.12 7.16±\pm0.11 74.31±\pm0.85 32.78±\pm0.40
Pseudo Label (Lee et al., 2013) 80.21±\pm0.55 74.61±\pm0.26 46.49±\pm2.20 15.08±\pm0.19 87.45±\pm0.85 57.74±\pm0.28 36.55±\pm0.24 64.61±\pm5.6 15.59±\pm0.95 9.40±\pm0.32 74.68±\pm0.99 32.64±\pm0.71
VAT (Miyato et al., 2018) 79.81±\pm1.17 74.66±\pm2.12 41.03±\pm1.79 10.51±\pm0.12 85.20±\pm1.40 46.84±\pm0.79 32.14±\pm0.19 74.75±\pm3.38 4.33±\pm0.12 4.11±\pm0.20 74.74±\pm0.38 37.95±\pm1.12
MeanTeacher (Tarvainen & Valpola, 2017) 76.37±\pm0.44 70.09±\pm1.60 37.46±\pm3.30 8.10±\pm0.21 81.11±\pm1.44 45.17±\pm1.06 31.75±\pm0.23 36.09±\pm3.98 3.45±\pm0.03 3.27±\pm0.05 71.72±\pm1.45 33.90±\pm1.37
MixMatch (Berthelot et al., 2019b) 65.76±\pm7.06 36.19±\pm6.48 13.63±\pm0.59 6.66±\pm0.26 67.59±\pm0.66 39.76±\pm0.48 27.78±\pm0.29 30.60±\pm8.39 4.56±\pm0.32 3.69±\pm0.37 54.93±\pm0.96 21.70±\pm0.68
ReMixMatch (Berthelot et al., 2019a) 20.77±\pm7.48 9.88±\pm1.03 6.30±\pm0.05 4.84±\pm0.01 42.75±\pm1.05 26.03±\pm0.35 20.02±\pm0.27 24.04±\pm9.13 6.36±\pm0.22 5.16±\pm0.31 32.12±\pm6.24 6.74±\pm0.14
UDA (Xie et al., 2020a) 34.53±\pm10.69 10.62±\pm3.75 5.16±\pm0.06 4.29±\pm0.07 46.39±\pm1.59 27.73±\pm0.21 22.49±\pm0.23 5.12±\pm4.27 1.92±\pm0.05 1.89±\pm0.01 37.42±\pm8.44 6.64±\pm0.17
FixMatch (Sohn et al., 2020) 24.79±\pm7.65 7.47±\pm0.28 4.86±\pm0.05 4.21±\pm0.08 46.42±\pm0.82 28.03±\pm0.16 22.20±\pm0.12 3.81±\pm1.18 2.02±\pm0.02 1.96±\pm0.03 35.97±\pm4.14 6.25±\pm0.33
Dash (Xu et al., 2021) 27.28±\pm14.09 8.93±\pm3.11 5.16±\pm0.23 4.36±\pm0.11 44.82±\pm0.96 27.15±\pm0.22 21.88±\pm0.07 2.19±\pm0.18 2.04±\pm0.02 1.97±\pm0.01 34.52±\pm4.30 6.39±\pm0.56
MPL (Pham et al., 2021) 23.55±\pm6.01 6.62±\pm0.91 5.76±\pm0.24 4.55±\pm0.04 46.26±\pm1.84 27.71±\pm0.19 21.74±\pm0.09 9.33±\pm8.02 2.29±\pm0.04 2.28±\pm0.02 35.76±\pm4.83 6.66±\pm0.00
FlexMatch (Zhang et al., 2021) 13.85±\pm12.04 4.97±\pm0.06 4.98±\pm0.09 4.19±\pm0.01 39.94±\pm1.62 26.49±\pm0.20 21.90±\pm0.15 8.19±\pm3.20 6.59±\pm2.29 6.72±\pm0.30 29.15±\pm4.16 5.77±\pm0.18
FreeMatch 8.07±\pm4.24 4.90±\pm0.04 4.88±\pm0.18 4.10±\pm0.02 37.98±\pm0.42 26.47±\pm0.20 21.68±\pm0.03 1.97±\pm0.02 1.97±\pm0.01 1.96±\pm0.03 15.56±\pm0.55 5.63±\pm0.15
Fully-Supervised 4.62±\pm0.05 19.30±\pm0.09 2.13±\pm0.01 -

5.2 Quantitative Results

The Top-1 classification error rates of CIFAR-10/100, SVHN, and STL-10 are reported in Table 1. The results on ImageNet with 100 labels per class are in Table 2. We also provide detailed results on precision, recall, F1 score, and confusion matrix in Appendix E.3. These quantitative results demonstrate that FreeMatch achieves the best performance on CIFAR-10, STL-10, and ImageNet datasets, and it produces very close results on SVHN to the best competitor. On CIFAR-100, FreeMatch is better than ReMixMatch when there are 400 labels. The good performances of ReMixMatch on CIFAR-100 (2500) and CIFAR-100 (10000) are probably brought by the mix up (Zhang et al., 2017) technique and the self-supervised learning part. On ImageNet with 100k labels, FreeMatch significantly outperforms the latest counterpart FlexMatch by 1.28%333Following (Zhang et al., 2021), we train ImageNet for 2202^{20} iterations like other datasets for a fair comparison. We use 4 Tesla V100 GPUs on ImageNet. . We also notice that FreeMatch exhibits fast computation in ImageNet from Table 2. Note that FlexMatch is much slower than FixMatch and FreeMatch because it needs to maintain a list that records whether each sample is clean, which needs heavy indexing computation budget on large datasets.

Table 2: Error rates and runtime on ImageNet with 100 labels per class.
Top-1 Top-5
Runtime
(sec./iter.)
FixMatch 43.66 21.80 0.4
FlexMatch 41.85 19.48 0.6
FreeMatch 40.57 18.77 0.4

Noteworthy is that, FreeMatch consistently outperforms other methods by a large margin on settings with extremely limited labeled data: 5.78% on CIFAR-10 with 10 labels, 1.96% on CIFAR-100 with 400 labels, and surprisingly 13.59% on STL-10 with 40 labels. STL-10 is a more realistic and challenging dataset compared to others, which consists of a large unlabeled set of 100k images. The significant improvements demonstrate the capability and potential of FreeMatch to be deployed in real-world applications.

5.3 Qualitative Analysis

We present some qualitative analysis: Why and how does FreeMatch work? What other benefits does it bring? We evaluate the class average threshold and average sampling rate on STL-10 (40) (i.e., 40 labeled samples on STL-10) of FreeMatch to demonstrate how it works aligning with our theoretical analysis. We record the threshold and compute the sampling rate for each batch during training. The sampling rate is calculated on unlabeled data as bμB𝟙(max(qb)>τt(argmax(qb))μB\frac{\sum_{b}^{\mu B}\mathbbm{1}(\max(q_{b})>\tau_{t}(\arg\max\left(q_{b}\right))}{\mu B}. We also plot the convergence speed in terms of accuracy and the confusion matrix to show the proposed component in FreeMatch helps improve performance. From Figure 3(a) and Figure 3(b), one can observe that the threshold and sampling rate change of FreeMatch is mostly consistent with our theoretical analysis. That is, at the early stage of training, the threshold of FreeMatch is relatively lower, compared to FlexMatch and FixMatch, resulting in higher unlabeled data utilization (sampling rate), which fastens the convergence. As the model learns better and becomes more confident, the threshold of FreeMatch increases to a high value to alleviate the confirmation bias, leading to stably high sampling rate. Correspondingly, the accuracy of FreeMatch increases vastly (as shown in Figure 3(c)) and resulting better class-wise accuracy (as shown in Figure 3(d)). Note that Dash fails to learn properly due to the employment of the high sampling rate until 100k iterations.

To further demonstrate the effectiveness of the class-specific threshold in FreeMatch, we present the t-SNE (Van der Maaten & Hinton, 2008) visualization of features of FlexMatch and FreeMatch on STL-10 (40) in Figure 5 of Section E.8. We exhibit the corresponding local threshold for each class. Interestingly, FlexMatch has a high threshold, i.e., pre-defined 0.950.95, for class 0 and class 66, yet their feature variances are very large and are confused with other classes. This means the class-wise thresholds in FlexMatch cannot accurately reflect the learning status. In contrast, FreeMatch clusters most classes better. Besides, for the similar classes 1,3,5,71,3,5,7 that are confused with each other, FreeMatch retains a higher average threshold 0.870.87 than 0.840.84 of FlexMatch, enabling to mask more wrong pseudo labels. We also study the pseudo label accuracy in Appendix E.9 and shows FreeMatch can reduce noise during training.

Refer to caption
(a) Confidence threshold
Refer to caption
(b) Sampling rate
Refer to caption
(c) Accuracy
Refer to caption
(d) Confusion matrix
Figure 3: How FreeMatch works in STL-10 with 40 labels, compared to others. (a) Class-average confidence threshold; (b) class-average sampling rate; (c) convergence speed in terms of accuracy; (d) confusion matrix, where fading colors of diagonal elements refer to the disparity of accuracy.

5.4 Ablation Study

Self-adaptive Threshold

We conduct experiments on the components of SAT in FreeMatch and compare to the components in FlexMatch (Zhang et al., 2021), FixMatch (Sohn et al., 2020), Class-Balanced Self-Training (CBST) (Zou et al., 2018), and Relative Threshold (RT) in AdaMatch (Berthelot et al., 2022). The ablation is conducted on CIFAR-10 (40 labels).

Table 3: Comparison of different thresholding schemes.
Threshold CIFAR-10 (40)
τ\tau (FixMatch) 7.47±\pm0.28
τ(β(c))\tau*\mathcal{M}(\beta(c)) (FlexMatch) 4.97±\pm0.06
τMaxNorm(p~t(c))\tau*\operatorname{MaxNorm}(\tilde{p}_{t}(c)) 5.13±\pm0.03
τt\tau_{t} (Global) 6.06±\pm0.65
τt(β(c))\tau_{t}*\mathcal{M}(\beta(c)) 8.40±\pm2.49
CBST 16.65±\pm2.90
RT (AdaMatch) 6.09±\pm0.65
SAT (Global and Local) 4.92±\pm0.04

As shown in Table 3, SAT achieves the best performance among all the threshold schemes. Self-adaptive global threshold τt\tau_{t} and local threshold MaxNorm(p~t(c))\operatorname{MaxNorm}(\tilde{p}_{t}(c)) themselves also achieve comparable results, compared to the fixed threshold τ\tau, demonstrating both local and global threshold proposed are good learning effect estimators. When using CPL (β(c))\mathcal{M}(\beta(c)) to adjust τt\tau_{t}, the result is worse than the fixed threshold and exhibits larger variance, indicating potential instability of CPL. AdaMatch (Berthelot et al., 2022) uses the RT, which can be viewed as a global threshold at tt-th iteration computed on the predictions of labeled data without EMA, whereas FreeMatch conducts computation of τt\tau_{t} with EMA on unlabeled data that can better reflect the overall data distribution. For class-wise threshold, CBST (Zou et al., 2018) maintains a pre-defined sampling rate, which could be the reason for its bad performance since the sampling rate should be changed during training as we analyzed in Sec. 2. Note that we did not include LfL_{f} in this ablation for a fair comparison. Ablation study in Appendix E.4 and  E.5 on FixMatch and FlexMatch with different thresholds shows SAT serves to reduce hyperparameter-tuning computation or overall training time in the event of similar performance for an optimally selected threshold.

Table 4: Comparison of different class fairness items.
Fairness CIFAR-10 (10)
w/o fairness 10.37±\pm7.70
Ulogp¯U\log\overline{p} 9.57±\pm6.67
UlogSumNorm(p¯h¯)U\log\operatorname{SumNorm}(\frac{\overline{p}}{\overline{h}}) 12.07±\pm5.23
DA (AdaMatch) 32.94±\pm1.83
DA (ReMixMatch) 11.06±\pm8.21
SAF 8.07±\pm4.24

Self-adaptive Fairness

As illustrated in Table 4, we also empirically study the effect of SAF on CIFAR-10 (10 labels). We study the original version of fairness objective as in (Arazo et al., 2020). Based on that, we study the operation of normalization probability by histograms and show that countering the effect of imbalanced underlying distribution indeed helps the model to learn and diverse better. One may notice that adding original fairness regularization alone already helps improve the performance. Whereas adding normalization operation in the log operation hurts the performance, suggesting the underlying batch data are indeed not uniformly distributed. We also evaluate Distribution Alignment (DA) for class fairness in ReMixMatch (Berthelot et al., 2019a) and AdaMatch (Berthelot et al., 2022), showing inferior results than SAF. A possible reason for the worse performance of DA (AdaMatch) is that it only uses labeled batch prediction as the target distribution which cannot reflect the true data distribution especially when labeled data is scarce and changing the target distribution to the ground truth uniform, i.e., DA (ReMixMatch), is better for the case with extremely limited labels. We also proved SAF can be easily plugged into FlexMatch and bring improvements in Appendix E.6. The EMA decay ablation and performances of imbalanced settings are in Appendix E.5 and Appendix E.7.

6 Related Work

To reduce confirmation bias (Arazo et al., 2020) in pseudo labeling, confidence-based thresholding techniques are proposed to ensure the quality of pseudo labels (Xie et al., 2020a; Sohn et al., 2020; Zhang et al., 2021; Xu et al., 2021), where only the unlabeled data whose confidences are higher than the threshold are retained. UDA (Xie et al., 2020a) and FixMatch (Sohn et al., 2020) keep the fixed pre-defined threshold during training. FlexMatch (Zhang et al., 2021) adjusts the pre-defined threshold in a class-specific fashion according to the per-class learning status estimated by the number of confident unlabeled data. A co-current work Adsh (Guo & Li, 2022) explicitly optimizes the number of pseudo labels for each class in the SSL objective to obtain adaptive thresholds for imbalanced Semi-supervised Learning. However, it still needs a user-predefined threshold. Dash (Xu et al., 2021) defines a threshold according to the loss on labeled data and adjusts the threshold according to a fixed mechanism. A more recent work, AdaMatch (Berthelot et al., 2022), aims to unify SSL and domain adaptation using a pre-defined threshold multiplying the average confidence of the labeled data batch to mask noisy pseudo labels. It needs a pre-defined threshold and ignores the unlabeled data distribution especially when labeled data is too rare to reflect the unlabeled data distribution. Besides, distribution alignment (Berthelot et al., 2019a; 2022) is also utilized in Adamatch to encourage fair predictions on unlabeled data. Previous methods might fail to choose meaningful thresholds due to ignorance of the relationship between the model learning status and thresholds. Chen et al. (2020); Kumar et al. (2020) try to understand self-training / thresholding from the theoretical perspective. A motivation example is used to derive implications for adjusting threshold according to learning status.

Except consistency regularization, entropy-based regularization is also used in SSL. Entropy minimization (Grandvalet et al., 2005) encourages the model to make confident predictions for all samples disregarding the actual class predicted. Maximization of expectation of entropy (Andreas Krause, 2010; Arazo et al., 2020) over all samples is also proposed to induce fairness to the model, enforcing the model to predict each class at the same frequency. But previous ones assume a uniform prior for underlying data distribution and also ignore the batch data distribution. Distribution alignment (Berthelot et al., 2019a) adjusts the pseudo labels according to labeled data distribution and the EMA of model predictions.

7 Conclusion

We proposed FreeMatch that utilizes self-adaptive thresholding and class-fairness regularization for SSL. FreeMatch outperforms strong competitors across a variety of SSL benchmarks, especially in the barely-supervised setting. We believe that confidence thresholding has more potential in SSL. A potential limitation is that the adaptiveness still originates from the heuristics of the model prediction, and we hope the efficacy of FreeMatch inspires more research for optimal thresholding.

References

  • Andreas Krause (2010) Ryan Gomes Andreas Krause, Pietro Perona. Discriminative clustering by regularized information maximization. In Advances in neural information processing systems, 2010.
  • Arazo et al. (2020) Eric Arazo, Diego Ortego, Paul Albert, Noel E O’Connor, and Kevin McGuinness. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks (IJCNN), pp.  1–8. IEEE, 2020.
  • Bachman et al. (2014) Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with pseudo-ensembles. Advances in neural information processing systems, 27:3365–3373, 2014.
  • Berthelot et al. (2019a) David Berthelot, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. Remixmatch: Semi-supervised learning with distribution matching and augmentation anchoring. In International Conference on Learning Representations, 2019a.
  • Berthelot et al. (2019b) David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. Advances in Neural Information Processing Systems, 32, 2019b.
  • Berthelot et al. (2022) David Berthelot, Rebecca Roelofs, Kihyuk Sohn, Nicholas Carlini, and Alex Kurakin. Adamatch: A unified approach to semi-supervised learning and domain adaptation. In International Conference on Learning Representations (ICLR), 2022.
  • Carlini et al. (2019) Nicholas Carlini, Ulfar Erlingsson, and Nicolas Papernot. Distribution density, tails, and outliers in machine learning: Metrics and applications. arXiv preprint arXiv:1910.13427, 2019.
  • Chapelle et al. (2006) Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien (eds.). Semi-Supervised Learning. The MIT Press, 2006.
  • Chen et al. (2020) Yining Chen, Colin Wei, Ananya Kumar, and Tengyu Ma. Self-training avoids using spurious features under domain shift. Advances in Neural Information Processing Systems, 33:21061–21071, 2020.
  • Coates et al. (2011) Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp.  215–223. JMLR Workshop and Conference Proceedings, 2011.
  • Cubuk et al. (2020) Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp.  702–703, 2020.
  • Dai et al. (2017) Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Russ R Salakhutdinov. Good semi-supervised learning that requires a bad gan. Advances in neural information processing systems, 30, 2017.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.  248–255. Ieee, 2009.
  • Dong et al. (2018) Linhao Dong, Shuang Xu, and Bo Xu. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  5884–5888. IEEE, 2018.
  • Fan et al. (2021) Yue Fan, Dengxin Dai, and Bernt Schiele. Cossl: Co-learning of representation and classifier for imbalanced semi-supervised learning. arXiv preprint arXiv:2112.04564, 2021.
  • Gong et al. (2016) Chen Gong, Dacheng Tao, Stephen J Maybank, Wei Liu, Guoliang Kang, and Jie Yang. Multi-modal curriculum learning for semi-supervised image classification. IEEE Transactions on Image Processing, 25(7):3249–3260, 2016.
  • Grandvalet et al. (2005) Yves Grandvalet, Yoshua Bengio, et al. Semi-supervised learning by entropy minimization. volume 367, pp.  281–296, 2005.
  • Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321–1330. PMLR, 2017.
  • Guo & Li (2022) Lan-Zhe Guo and Yu-Feng Li. Class-imbalanced semi-supervised learning with adaptive thresholding. In International Conference on Machine Learning, pp. 8082–8094. PMLR, 2022.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • John Bridle (1991) David MacKay John Bridle, Anthony Heading. Unsupervised classifiers, mutual information and ’phantom targets. 1991.
  • Kervadec et al. (2019) Hoel Kervadec, Jose Dolz, Éric Granger, and Ismail Ben Ayed. Curriculum semi-supervised segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp.  568–576. Springer, 2019.
  • Kim et al. (2020) Jaehyung Kim, Youngbum Hur, Sejun Park, Eunho Yang, Sung Ju Hwang, and Jinwoo Shin. Distribution aligning refinery of pseudo-label for imbalanced semi-supervised learning. Advances in Neural Information Processing Systems, 33:14567–14579, 2020.
  • Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Krizhevsky et al. (2009) Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.
  • Kumar et al. (2020) Ananya Kumar, Tengyu Ma, and Percy Liang. Understanding self-training for gradual domain adaptation. In International Conference on Machine Learning, pp. 5468–5479. PMLR, 2020.
  • Lee et al. (2013) Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, pp.  896, 2013.
  • Lee et al. (2021) Hyuck Lee, Seungjae Shin, and Heeyoung Kim. Abc: Auxiliary balanced classifier for class-imbalanced semi-supervised learning. Advances in Neural Information Processing Systems, 34, 2021.
  • McLachlan (1975) Geoffrey J McLachlan. Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association, 70(350):365–369, 1975.
  • Miyato et al. (2018) Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
  • Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
  • Oliver et al. (2018) Avital Oliver, Augustus Odena, Colin A Raffel, Ekin Dogus Cubuk, and Ian Goodfellow. Realistic evaluation of deep semi-supervised learning algorithms. Advances in neural information processing systems, 31, 2018.
  • Pham et al. (2021) Hieu Pham, Zihang Dai, Qizhe Xie, and Quoc V Le. Meta pseudo labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11557–11568, 2021.
  • Rasmus et al. (2015) Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. Advances in Neural Information Processing Systems, 28:3546–3554, 2015.
  • Rizve et al. (2020) Mamshad Nayeem Rizve, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. In International Conference on Learning Representations, 2020.
  • Rosenberg et al. (2005) Chuck Rosenberg, Martial Hebert, and Henry Schneiderman. Semi-supervised self-training of object detection models. 2005.
  • Sajjadi et al. (2016) Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Advances in neural information processing systems, 29:1163–1171, 2016.
  • Samuli & Timo (2017) Laine Samuli and Aila Timo. Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations (ICLR), volume 4, pp.  6, 2017.
  • Sohn et al. (2020) Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in Neural Information Processing Systems, 33, 2020.
  • Tarvainen & Valpola (2017) Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp.  1195–1204, 2017.
  • Van der Maaten & Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  • Wang et al. (2022) Yidong Wang, Hao Chen, Yue Fan, SUN Wang, Ran Tao, Wenxin Hou, Renjie Wang, Linyi Yang, Zhi Zhou, Lan-Zhe Guo, et al. Usb: A unified semi-supervised learning benchmark for classification. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  • Wei et al. (2021) Chen Wei, Kihyuk Sohn, Clayton Mellina, Alan Yuille, and Fan Yang. Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10857–10866, 2021.
  • Xie et al. (2020a) Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. Unsupervised data augmentation for consistency training. Advances in Neural Information Processing Systems, 33, 2020a.
  • Xie et al. (2020b) Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10687–10698, 2020b.
  • Xu et al. (2021) Yi Xu, Lei Shang, Jinxing Ye, Qi Qian, Yu-Feng Li, Baigui Sun, Hao Li, and Rong Jin. Dash: Semi-supervised learning with dynamic thresholding. In International Conference on Machine Learning, pp. 11525–11536. PMLR, 2021.
  • Yang & Xu (2020) Yuzhe Yang and Zhi Xu. Rethinking the value of labels for improving class-imbalanced learning. In NeurIPS, 2020.
  • Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference 2016. British Machine Vision Association, 2016.
  • Zhang et al. (2021) Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Advances in Neural Information Processing Systems, 34, 2021.
  • Zhang et al. (2017) Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
  • Zhou et al. (2020) Tianyi Zhou, Shengjie Wang, and Jeff Bilmes. Time-consistent self-supervision for semi-supervised learning. In International Conference on Machine Learning, pp. 11523–11533. PMLR, 2020.
  • Zhu & Goldberg (2009) Xiaojin Zhu and Andrew B Goldberg. Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning, 3(1):1–130, 2009.
  • Zhu (2005) Xiaojin Jerry Zhu. Semi-supervised learning literature survey. 2005.
  • Zou et al. (2018) Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European conference on computer vision (ECCV), pp.  289–305, 2018.

Appendix A Experimental details of the “two-moon” dataset.

We generate only two labeled data points (one label per each class, denoted by black dot and round circle) and 1,000 unlabeled data points (in gray) in 2-D space. We train a 3-layer MLP with 64 neurons in each layer and ReLU activation for 2,000 iterations. The red samples indicate the different samples whose confidence values are above the threshold of FreeMatch but below that of FixMatch. The sampling rate is computed on unlabeled data as bNU𝟙(max(qb)>τ)/NU\sum_{b}^{N_{U}}\mathbbm{1}(\max(q_{b})>\tau)/N_{U}. Results are averaged 5 times.

Appendix B Proof of Theorem 2.1

Theorem 2.1 For a binary classification problem as mentioned above, the pseudo label YpY_{p} has the following probability distribution:

P(Yp=1)=12Φ(μ2μ121βlog(τ1τ)σ2)+12Φ(μ1μ221βlog(τ1τ)σ1),P(Yp=1)=12Φ(μ2μ121βlog(τ1τ)σ1)+12Φ(μ1μ221βlog(τ1τ)σ2),P(Yp=0)=1P(Yp=1)P(Yp=1),\displaystyle\begin{split}P(Y_{p}=1)&=\frac{1}{2}\Phi(\frac{\frac{\mu_{2}-\mu_{1}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{2}})+\frac{1}{2}\Phi(\frac{\frac{\mu_{1}-\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{1}}),\\ P(Y_{p}=-1)&=\frac{1}{2}\Phi(\frac{\frac{\mu_{2}-\mu_{1}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{1}})+\frac{1}{2}\Phi(\frac{\frac{\mu_{1}-\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{2}}),\\ P(Y_{p}=0)&=1-P(Y_{p}=1)-P(Y_{p}=-1),\end{split} (13)

where Φ\Phi is the cumulative distribution function of a standard normal distribution. Moreover, P(Yp=0)=0P(Y_{p}=0)=0 increases as μ2μ1\mu_{2}-\mu_{1} gets smaller.

Proof.

A sample xx will be assigned pseudo label 1 if

11+exp(β(xμ1+μ22))>τ,\frac{1}{1+\exp{(-\beta(x-\frac{\mu_{1}+\mu_{2}}{2}))}}>\tau,

which is equivalent to

x>μ1+μ22+1βlog(τ1τ).x>\frac{\mu_{1}+\mu_{2}}{2}+\frac{1}{\beta}\log(\frac{\tau}{1-\tau}).

Likewise, xx will be assigned pseudo label -1 if

11+exp(β(xμ1+μ22))<1τ,\frac{1}{1+\exp{(-\beta(x-\frac{\mu_{1}+\mu_{2}}{2}))}}<1-\tau,

which is equivalent to

x<μ1+μ221βlog(τ1τ).x<\frac{\mu_{1}+\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau}).

If we integrate over xx, we arrive at the following conditional probabilities:

P(Yp=1|Y=1)=Φ(μ2μ121βlog(τ1τ)σ2),\displaystyle P(Y_{p}=1|Y=1)=\Phi(\frac{\frac{\mu_{2}-\mu_{1}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{2}}),
P(Yp=1|Y=1)=Φ(μ1μ221βlog(τ1τ)σ1),\displaystyle P(Y_{p}=1|Y=-1)=\Phi(\frac{\frac{\mu_{1}-\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{1}}),
P(Yp=1|Y=1)=Φ(μ2μ121βlog(τ1τ)σ1),\displaystyle P(Y_{p}=-1|Y=-1)=\Phi(\frac{\frac{\mu_{2}-\mu_{1}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{1}}),
P(Yp=1|Y=1)=Φ(μ1μ221βlog(τ1τ)σ2).\displaystyle P(Y_{p}=-1|Y=1)=\Phi(\frac{\frac{\mu_{1}-\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{2}}).

Recall that P(Y=1)=P(Y=1)=0.5P(Y=1)=P(Y=-1)=0.5, therefore

P(Yp=1)=12Φ(μ2μ121βlog(τ1τ)σ2)+12Φ(μ1μ221βlog(τ1τ)σ1),\displaystyle P(Y_{p}=1)=\frac{1}{2}\Phi(\frac{\frac{\mu_{2}-\mu_{1}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{2}})+\frac{1}{2}\Phi(\frac{\frac{\mu_{1}-\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{1}}),
P(Yp=1)=12Φ(μ2μ121βlog(τ1τ)σ1)+12Φ(μ1μ221βlog(τ1τ)σ2).\displaystyle P(Y_{p}=-1)=\frac{1}{2}\Phi(\frac{\frac{\mu_{2}-\mu_{1}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{1}})+\frac{1}{2}\Phi(\frac{\frac{\mu_{1}-\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{2}}).

Now, let’s use zz to denote μ2μ1\mu_{2}-\mu_{1}, to show that P(Yp=0)P(Y_{p}=0) increases as μ2μ1\mu_{2}-\mu_{1} gets smaller, we only need to show P(Yp=1)+P(Yp=1)P(Y_{p}=-1)+P(Y_{p}=1) gets bigger. We write P(Yp=1)+P(Yp=1)P(Y_{p}=-1)+P(Y_{p}=1) as

P(Yp=1)+P(Yp=1)=12Φ(a1zb1)+12Φ(a1zb1)+12Φ(a2zb2)+12Φ(a2zb2),P(Y_{p}=1)+P(Y_{p}=1)=\frac{1}{2}\Phi(a_{1}z-b_{1})+\frac{1}{2}\Phi(-a_{1}z-b_{1})+\frac{1}{2}\Phi(a_{2}z-b_{2})+\frac{1}{2}\Phi(-a_{2}z-b_{2}),

where a1=12σ1,a2=12σ2,b1=log(τ1τ)βσ1,b2=log(τ1τ)βσ2a_{1}=\frac{1}{2\sigma_{1}},a_{2}=\frac{1}{2\sigma_{2}},b_{1}=\frac{\log(\frac{\tau}{1-\tau})}{\beta\sigma_{1}},b_{2}=\frac{\log(\frac{\tau}{1-\tau})}{\beta\sigma_{2}} are positive constants. We futher only need to show that f(z)=12Φ(a1zb1)+12Φ(a1zb1)f(z)=\frac{1}{2}\Phi(a_{1}z-b_{1})+\frac{1}{2}\Phi(-a_{1}z-b_{1}) is monotone increasing on (0,)(0,\infty). Take the derivative of zz, we have

f(z)=12a1(ϕ(a1zb1)ϕ(a1zb1)),f^{\prime}(z)=\frac{1}{2}a_{1}(\phi(a_{1}z-b_{1})-\phi(-a_{1}z-b_{1})),

where ϕ\phi is the probability density function of a standard normal distribution. Since |a1zb1|<|a1zb1||a_{1}z-b_{1}|<|-a_{1}z-b_{1}|, we have f(z)>0f^{\prime}(z)>0, and the proof is complete.

Appendix C Algorithm

We present the pseudo algorithm of FreeMatch. Compared to FixMatch, each training step involves updating the global threshold and local threshold from the unlabeled data batch, and computing corresponding histograms. FreeMatchs introduce a very trivial computation budget compared to FixMatch, demonstrated also in our main paper.

Algorithm 1 FreeMatch algorithm at tt-th iteration.
1:  Input: Number of classes CC, labeled batch 𝒳={(xb,yb):b(1,2,,B)}\mathcal{X}=\{(x_{b},y_{b}):b\in(1,2,\dots,B)\}, unlabeled batch 𝒰={ub:b(1,2,,μB)}\mathcal{U}=\{u_{b}:b\in(1,2,\dots,\mu B)\}, unsupervised loss weight wuw_{u}, fairness loss weight wfw_{f}, and EMA decay λ\lambda.
2:  Compute s\mathcal{L}_{s} for labeled data s=1Bb=1B(yb,pm(y|ω(xb)))\mathcal{L}_{s}=\frac{1}{B}\sum_{b=1}^{B}\mathcal{H}(y_{b},p_{m}(y|\omega(x_{b})))
3:  Update the global threshold τt=λτt1+(1λ)1μBb=1μBmax(qb)\tau_{t}=\lambda\tau_{t-1}+(1-\lambda)\frac{1}{\mu B}\sum_{b=1}^{\mu B}max(q_{b}) {qbq_{b} is an abbreviation of pm(y|ω(ub))p_{m}(y|\omega(u_{b})), shape of τt\tau_{t}: [1][1] }
4:  Update the local threshold p~t=λp~t1+(1λ)1μBb=1μBqb\tilde{p}_{t}=\lambda\tilde{p}_{t-1}+(1-\lambda)\frac{1}{\mu B}\sum_{b=1}^{\mu B}q_{b} {Shape of p~t\tilde{p}_{t}: [C][C]}
5:  Update histogram for p~t\tilde{p}_{t} h~t=λh~t1+(1λ)HistμB(q^b)\tilde{h}_{t}=\lambda\tilde{h}_{t-1}+(1-\lambda)\operatorname{Hist}_{\mu B}\left(\hat{q}_{b}\right) {Shape of h~t\tilde{h}_{t}: [C][C]}
6:  for c=1c=1 to CC do
7:     τt(c)=MaxNorm(p~t(c))τt\tau_{t}(c)=\operatorname{MaxNorm}(\tilde{p}_{t}(c))\cdot\tau_{t} {Calculate SAT}
8:  end for
9:  Compute u\mathcal{L}_{u} on unlabeled data u=1μBb=1μB𝟙(max(qb)τt(argmax(qb)))\mathcal{L}_{u}=\frac{1}{\mu B}\sum_{b=1}^{\mu B}\mathbbm{1}\left(\max\left(q_{b}\right)\geq\tau_{t}(\arg\max\left(q_{b}\right))\right) (q^b,Qb)\cdot\mathcal{H}(\hat{q}_{b},Q_{b})
10:  Compute expectation of probability on unlabeled data p¯=1μBb=1μB𝟙(max(qb)τt(argmax(qb))Qb\overline{p}=\frac{1}{\mu B}\sum_{b=1}^{\mu B}\mathbbm{1}\left(\max\left(q_{b}\right)\geq\tau_{t}(\arg\max\left(q_{b}\right)\right)Q_{b} {QbQ_{b} is an abbr. of pm(y|Ω(ub))p_{m}(y|\Omega(u_{b})), shape of p¯\overline{p}: [C][C]}
11:  Compute histogram for p¯\overline{p} h¯=HistμB(𝟙(max(qb)τt(argmax(qb))Q^b)\overline{h}=\operatorname{Hist}_{\mu B}\left(\mathbbm{1}\left(\max\left(q_{b}\right)\geq\tau_{t}(\arg\max\left(q_{b}\right)\right)\hat{Q}_{b}\right) {Shape of h¯\overline{h}: [C][C]}
12:  Compute f\mathcal{L}_{f} on unlabeled data f=(SumNorm(p~th~t),SumNorm(p¯h¯))\mathcal{L}_{f}=-\mathcal{H}\left(\operatorname{SumNorm}(\frac{\tilde{p}_{t}}{\tilde{h}_{t}}),\operatorname{SumNorm}(\frac{\bar{p}}{\bar{h}})\right)
13:  Return: s+wuu+wff\mathcal{L}_{s}+w_{u}\cdot\mathcal{L}_{u}+w_{f}\cdot\mathcal{L}_{f}

Appendix D Hyperparameter setting

For reproduction, we show the detailed hyperparameter setting for FreeMatch in Table 5 and 6, for algorithm-dependent and algorithm-independent hyperparameters, respectively.

Table 5: Algorithm dependent hyperparameters.

Algorithm FreeMatch Unlabeled Data to Labeled Data Ratio (CIFAR-10/100, STL-10, SVHN) 7 Unlabeled Data to Labeled Data Ratio (ImageNet) 1 Loss weight wuw_{u} for all experiments 1 Loss weight wfw_{f} for CIFAR-10 (10), CIFAR-100 (400), STL-10 (40), ImageNet (100k), SVHN 0.01 Loss weight wfw_{f} for others 0.05 Thresholding EMA decay for all experiments 0.999

Table 6: Algorithm independent hyperparameters.

Dataset CIFAR-10 CIFAR-100 STL-10 SVHN ImageNet Model WRN-28-2 WRN-28-8 WRN-37-2 WRN-28-2 ResNet-50 Weight decay 5e-4 1e-3 5e-4 5e-4 3e-4 Batch size 64 128 Learning rate 0.03 SGD momentum 0.9 EMA decay 0.999

Note that for ImageNet experiments, we used the same learning rate, optimizer scheme, and training iterations as other experiments, and a batch size of 128 is adopted, whereas, in FixMatch, a large batch size of 1024 and a different optimizer is used. From our experiments, we found that training ImageNet with only 2202^{20} is not enough, and the model starts converging at the end of training. Longer training iterations on ImageNet will be explored in the future. Single NVIDIA V100 is used for training on CIFAR-10, CIFAR-100, SVHN and STL-10. It costs about 2 days to train on CIFAR-10 and SVHN. 10 days are needed for the training on CIFAR-100 and STL-10.

Appendix E Extensive Experiment Details and Results

We present extensive experiment details and results as complementary to the experiments in the main paper.

E.1 Significant Tests

We did significance test using the Friedman test. We choose the top 7 algorithms on 4 datasets (i.e., N=4,k=7N=4,k=7). Then, we compute the F value as τF=3.56\tau_{F}=3.56, which is clearly larger than the thresholds 2.661(α=0.05)2.661(\alpha=0.05) and 2.130(α=0.1)2.130(\alpha=0.1). This test indicates that there are significant differences between all algorithms.

To further show our significance, we report the average error rates for each dataset in Table 7. We can see FreeMatch outperforms most SSL algorithms significantly.

Table 7: The average error rates for each dataset.
CIFAR-10 CIFAR-100 SVHN STL-10 Total Average
Π\Pi Model 53.22 60.80 29.31 53.55 49.19
Pseudo Label 54.10 60.58 29.87 53.66 49.59
VAT 51.50 54.73 27.73 56.35 47.17
MeanTeacher 48.01 52.68 14.27 52.81 41.54
MixMatch 30.56 45.04 12.95 38.32 31.07
ReMixMatch 10.45 29.60 11.85 19.43 17.08
UDA 13.65 32.20 2.98 22.03 17.02
FixMatch 10.33 32.22 2.60 21.11 15.67
Dash 11.43 31.28 2.07 20.46 15.56
MPL 10.12 31.90 4.63 21.21 16.04
FlexMatch 7.00 29.44 7.17 17.46 14.40
FreeMatch 5.49 28.71 1.97 10.60 11.26

E.2 CIFAR-10 (10) Labeled Data

Following (Sohn et al., 2020), we investigate the limitations of SSL algorithms by giving only one labeled training sample per class. The selected 3 labeled training sets are visualized in Figure 4, which are obtained by (Sohn et al., 2020) using ordering mechanism (Carlini et al., 2019).

Refer to caption
Figure 4: CIFAR-10 (10) labeled samples visualization, sorted from the most prototypical dataset (first row) to least prototypical dataset (last row).

E.3 Detailed Results

To comprehensively evaluate the performance of all methods in a classification setting, we further report the precision, recall, f1 score, and AUC (area under curve) results of CIFAR-10 with the same 10 labels, CIFAR-100 with 400 labels, SVHN with 40 labels, and STL-10 with 40 labels. As shown in Table 8 and 9, FreeMatch also has the best performance on precision, recall, F1 score, and AUC in addition to the top1 error rates reported in the main paper.

Table 8: Precision, recall, f1 score and AUC results on CIFAR-10/100.

Datasets CIFAR-10 (10) CIFAR-100 (400) Criteria Precision Recall F1 Score AUC Precision Recall F1 Score AUC UDA 0.5304 0.5121 0.4754 0.8258 0.5813 0.5484 0.5087 0.9475 FixMatch 0.6436 0.6622 0.6110 0.8934 0.5574 0.5430 0.4946 0.9363 Dash 0.6409 0.5410 0.4955 0.8458 0.5833 0.5649 0.5215 0.9456 MPL 0.6286 0.6857 0.6178 0.7993 0.5799 0.5606 0.5193 0.9316 FlexMatch 0.6769 0.6861 0.6780 0.9126 0.6135 0.6193 0.6107 0.9675 FreeMatch 0.8619 0.8593 0.8523 0.9843 0.6243 0.6261 0.6137 0.9692

Table 9: Precision, recall, f1 score and AUC results on SVHN and STL-10.

Datasets SVHN (40) STL-10 (40) Criteria Precision Recall F1 Score AUC Precision Recall F1 Score AUC UDA 0.9783 0.9777 0.9780 0.9977 0.6385 0.5319 0.4765 0.8581 FixMatch 0.9731 0.9706 0.9716 0.9962 0.6590 0.5830 0.5405 0.8862 Dash 0.9782 0.9778 0.9780 0.9978 0.8117 0.6020 0.5448 0.8827 MPL 0.9564 0.9513 0.9512 0.9844 0.6191 0.5740 0.4999 0.8529 FlexMatch 0.9566 0.9691 0.9625 0.9975 0.6403 0.6755 0.6518 0.9249 FreeMatch 0.9783 0.9800 0.9791 0.9979 0.8489 0.8439 0.8354 0.9792

E.4 Ablation of pre-defined thresholds on FixMatch and FlexMatch

As shown in Table 12, the performance of FixMatch and FlexMatch is quite sensitive to the changes of the pre-defined threshold τ\tau.

Table 10: FixMatch and FlexMatch with different thresholds on CIFAR-10 (40).
τ\tau FixMatch FlexMatch
0.25 11.76±0.60 18.84±0.36
0.5 16.29±0.31 14.16±0.21
0.75 15.61±0.23 6.08±0.17
0.95 7.47±0.28 4.97±0.06
0.98 8.01±0.91 5.40±0.11

E.5 Ablation on EMA decay on CIFAR-10 (40)

We provide the ablation study on EMA decay parameter λ\lambda in Equation 5 and Equation 6. One can observe that different decay λ\lambda produces the close results on CIFAR-10 with 40 labels, indicating that FreeMatch is not sensitive to this hyper-parameter. A large λ\lambda is not encouraged since it could impede the update of global / local thresholds.

Table 11: Error rates of different thresholding EMA decay.
Thresholding EMA decay CIFAR-10 (40)
0.9 4.94±\pm0.06
0.99 4.92±\pm0.08
0.999 4.90±\pm0.04
0.9999 5.03±\pm0.07

E.6 Ablation of SAF on FlexMatch and FreeMatch

In Table 13, we present the comparison of different class fairness objectives on CIFAR-10 with 10 labels. FreeMatch is better than FlexMatch in both settings. In addition, SAF is also proved effective when combined with FlexMatch.

Table 12: FixMatch and FlexMatch with different thresholds on CIFAR-10 (40).
τ\tau FixMatch FlexMatch
0.25 11.76±0.60 18.84±0.36
0.5 16.29±0.31 14.16±0.21
0.75 15.61±0.23 6.08±0.17
0.95 7.47±0.28 4.97±0.06
0.98 8.01±0.91 5.40±0.11
Table 13: Ablation of SAF on FlexMatch and FreeMatch on CIFAR-10 (10)
Fairness Objective FlexMatch FreeMatch
w/o SAF 13.85±12.04 10.37±7.70
w/ SAF 12.60±8.16 8.07±4.24

E.7 Ablation of Imbalanced SSL

Table 14: Error rates (%) of imbalanced SSL using 3 different random seeds.
Dataset CIFAR-10-LT CIFAR-100-LT
Imbalance γ\gamma 50 150 20 100
FixMatch 18.5±\pm0.48 31.2±\pm1.08 49.1±\pm0.62 62.5±\pm0.36
FlexMatch 17.8±\pm0.24 29.5±\pm0.47 48.9±\pm0.71 62.7±\pm0.08
FreeMatch 17.7±\pm0.33 28.8±\pm0.64 48.4±\pm0.91 62.5±\pm0.23
FixMatch w/ ABC 14.0±\pm0.22 22.3±\pm1.08 46.6±\pm0.69 58.3±\pm0.41
FlexMatch w/ ABC 14.2±\pm0.34 23.1±\pm0.70 46.2±\pm0.47 58.9±\pm0.51
FreeMatch w/ ABC 13.9±\pm0.03 22.3±\pm0.26 45.6±\pm0.76 58.9±\pm0.55

To further prove the effectiveness of FreeMatch, We evaluate FreeMatch on the imbalanced SSL setting Kim et al. (2020); Wei et al. (2021); Lee et al. (2021); Fan et al. (2021), where the labeled and the unlabeled data are both imbalanced. We conduct experiments on CIFAR-10-LT and CIFAR-100-LT with different imbalance ratios. The imbalance ratio used on CIFAR datasets is defined as γ=Nmax/Nmin\gamma=N_{max}/N_{min} where NmaxN_{max} is the number of samples on the head (frequent) class and NminN_{min} the tail (rare). Note that the number of samples for class kk is computed as Nk=Nmaxγk1C1N_{k}=N_{max}\gamma^{-\frac{k-1}{C-1}}, where CC is the number of classes. Following (Lee et al., 2021; Fan et al., 2021), we set Nmax=1500N_{max}=1500 for CIFAR-10 and Nmax=150N_{max}=150 for CIFAR-100, and the number of unlabeled data is twice as many for each class. We use a WRN-28-2 (Zagoruyko & Komodakis, 2016) as the backbone. We use Adam (Kingma & Ba, 2014) as the optimizer. The initial learning rate is 0.0020.002 with a cosine learning rate decay schedule as η=η0cos(7πk16K)\eta=\eta_{0}\cos(\frac{7\pi k}{16K}), where η0\eta_{0} is the initial learning rate, k(K)k(K) is the current (total) training step and we set K=2.5×105K=2.5\times 10^{5} for all datasets. The batch size of labeled and unlabeled data is 64 and 128, respectively. Weight decay is set as 4e-5. Each experiment is run on three different data splits, and we report the average of the best error rates.

The results are summarized in Table 14. Compared with other standard SSL methods, FreeMach achieves the best performance across all settings. Especially on CIFAR-10 at imbalance ratio 150, FreeMatch outperforms the second best by 2.4%2.4\%. Moreover, when plugged in the other imbalanced SSL method (Lee et al., 2021), FreeMatch still attains the best performance in most of the settings.

E.8 T–SNE Visualization on STL-10 (40)

We plot the T–SNE visualization of the features on STL-10 with 40 labels from FlexMatch (Zhang et al., 2021) and FreeMatch. FreeMatch shows better feature space than FlexMatch with less confusing clusters.

Refer to caption
(a) FlexMatch (train, test)
Refer to caption
(b) FreeMatch (train, test)
Figure 5: T-SNE visualization of FlexMatch and FreeMatch features on STL-10 (40). Unlabeled data is indicated by gray color. Local threshold τt(c)\tau_{t}(c) for each class is shown on the legend.

E.9 Pseudo Label accuracy on CIFAR-10 (10)

We average the pseudo label accuracy with three random seeds and report them in Figure 6. This indicates that mapping thresholds from a high fixed threshold like FlexMatch did can prevent unlabeled samples from being involved in training. In this case, the model can overfit on labeled data and a small amount of unlabeled data. Thus the predictions on unlabeled data will incorporate more noise. Introducing appropriate unlabeled data at training time can avoid overfitting on labeled datasets and a small amount of unlabeled data and bring more accurate pseudo labels.

Refer to caption
Figure 6: CIFAR-10 (10) Pseudo Label accuracy visualization.

E.10 CIFAR-10 (10) Confusion Matrix

We plot the confusion matrix of FreeMatch and other SSL methods on CIFAR-10 (10) in Figure 7. It is worth noting that even with the least prototypical labeled data in our setting, FreeMatch still gets good results while other SSL methods fail to separate the unlabeled data into different clusters, showing inconsistency with the low-density assumption in SSL.

Refer to caption
(a) The most prototypical labeled samples
Refer to caption
(b) The second-most prototypical labeled samples
Refer to caption
(c) The least prototypical labeled samples
Figure 7: Confusion matrix on the test set of CIFAR-10 (10). Rows correspond to the rows in Figure 4. Columns correspond to different SSL methods.