FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning

Yidong Wang^1,2 Equal Contribution: [email protected], [email protected]; work done when Yidong was a research intern at MSRA. ¹Microsoft Research Asia, ²Tokyo Institute of Technology, ³Carnegie Mellon University,
⁴North Carolina State University, ⁵Microsoft STCA,
⁶Max-Planck-Institut für Informatik, ⁷Nanjing University Hao Chen^3∗ ¹Microsoft Research Asia, ²Tokyo Institute of Technology, ³Carnegie Mellon University,
⁴North Carolina State University, ⁵Microsoft STCA,
⁶Max-Planck-Institut für Informatik, ⁷Nanjing University Qiang Heng⁴ ¹Microsoft Research Asia, ²Tokyo Institute of Technology, ³Carnegie Mellon University,
⁴North Carolina State University, ⁵Microsoft STCA,
⁶Max-Planck-Institut für Informatik, ⁷Nanjing University Wenxin Hou⁵ ¹Microsoft Research Asia, ²Tokyo Institute of Technology, ³Carnegie Mellon University,
⁴North Carolina State University, ⁵Microsoft STCA,
⁶Max-Planck-Institut für Informatik, ⁷Nanjing University Yue Fan⁶ ¹Microsoft Research Asia, ²Tokyo Institute of Technology, ³Carnegie Mellon University,
⁴North Carolina State University, ⁵Microsoft STCA,
⁶Max-Planck-Institut für Informatik, ⁷Nanjing University
Zhen Wu⁷ ¹Microsoft Research Asia, ²Tokyo Institute of Technology, ³Carnegie Mellon University,
⁴North Carolina State University, ⁵Microsoft STCA,
⁶Max-Planck-Institut für Informatik, ⁷Nanjing University Jindong Wang¹ Correspondence to: [email protected] ¹Microsoft Research Asia, ²Tokyo Institute of Technology, ³Carnegie Mellon University,
⁴North Carolina State University, ⁵Microsoft STCA,
⁶Max-Planck-Institut für Informatik, ⁷Nanjing University Marios Savvides³ ¹Microsoft Research Asia, ²Tokyo Institute of Technology, ³Carnegie Mellon University,
⁴North Carolina State University, ⁵Microsoft STCA,
⁶Max-Planck-Institut für Informatik, ⁷Nanjing University Takahiro Shinozaki² ¹Microsoft Research Asia, ²Tokyo Institute of Technology, ³Carnegie Mellon University,
⁴North Carolina State University, ⁵Microsoft STCA,
⁶Max-Planck-Institut für Informatik, ⁷Nanjing University
Bhiksha Raj³ ¹Microsoft Research Asia, ²Tokyo Institute of Technology, ³Carnegie Mellon University,
⁴North Carolina State University, ⁵Microsoft STCA,
⁶Max-Planck-Institut für Informatik, ⁷Nanjing University Bernt Schiele⁶ ¹Microsoft Research Asia, ²Tokyo Institute of Technology, ³Carnegie Mellon University,
⁴North Carolina State University, ⁵Microsoft STCA,
⁶Max-Planck-Institut für Informatik, ⁷Nanjing University Xing Xie¹ ¹Microsoft Research Asia, ²Tokyo Institute of Technology, ³Carnegie Mellon University,
⁴North Carolina State University, ⁵Microsoft STCA,
⁶Max-Planck-Institut für Informatik, ⁷Nanjing University

Abstract

Semi-supervised Learning (SSL) has witnessed great success owing to the impressive performances brought by various methods based on pseudo labeling and consistency regularization. However, we argue that existing methods might fail to utilize the unlabeled data more effectively since they either use a pre-defined / fixed threshold or an ad-hoc threshold adjusting scheme, resulting in inferior performance and slow convergence. We first analyze a motivating example to obtain intuitions on the relationship between the desirable threshold and model’s learning status. Based on the analysis, we hence propose FreeMatch to adjust the confidence threshold in a self-adaptive manner according to the model’s learning status. We further introduce a self-adaptive class fairness regularization penalty to encourage the model for diverse predictions during the early training stage. Extensive experiments indicate the superiority of FreeMatch especially when the labeled data are extremely rare. FreeMatch achieves 5.78%, 13.59%, and 1.28% error rate reduction over the latest state-of-the-art method FlexMatch on CIFAR-10 with 1 label per class, STL-10 with 4 labels per class, and ImageNet with 100 labels per class, respectively. Moreover, FreeMatch can also boost the performance of imbalanced SSL. The codes can be found at https://github.com/microsoft/Semi-supervised-learning.¹¹1Note the results of this paper are obtained using TorchSSL (Zhang et al., 2021). We also provide codes and logs in USB (Wang et al., 2022).

1 Introduction

The superior performance of deep learning heavily relies on supervised training with sufficient labeled data (He et al., 2016; Vaswani et al., 2017; Dong et al., 2018). However, it remains laborious and expensive to obtain massive labeled data. To alleviate such reliance, semi-supervised learning (SSL) (Zhu, 2005; Zhu & Goldberg, 2009; Sohn et al., 2020; Rosenberg et al., 2005; Gong et al., 2016; Kervadec et al., 2019; Dai et al., 2017) is developed to improve the model’s generalization performance by exploiting a large volume of unlabeled data. Pseudo labeling (Lee et al., 2013; Xie et al., 2020b; McLachlan, 1975; Rizve et al., 2020) and consistency regularization (Bachman et al., 2014; Samuli & Timo, 2017; Sajjadi et al., 2016) are two popular paradigms designed for modern SSL. Recently, their combinations have shown promising results (Xie et al., 2020a; Sohn et al., 2020; Pham et al., 2021; Xu et al., 2021; Zhang et al., 2021). The key idea is that the model should produce similar predictions or the same pseudo labels for the same unlabeled data under different perturbations following the smoothness and low-density assumptions in SSL (Chapelle et al., 2006).

A potential limitation of these threshold-based methods is that they either need a fixed threshold (Xie et al., 2020a; Sohn et al., 2020; Zhang et al., 2021; Guo & Li, 2022) or an ad-hoc threshold adjusting scheme (Xu et al., 2021) to compute the loss with only confident unlabeled samples. Specifically, UDA (Xie et al., 2020a) and FixMatch (Sohn et al., 2020) retain a fixed high threshold to ensure the quality of pseudo labels. However, a fixed high threshold ( $0.95$ ) could lead to low data utilization in the early training stages and ignore the different learning difficulties of different classes. Dash (Xu et al., 2021) and AdaMatch (Berthelot et al., 2022) propose to gradually grow the fixed global (dataset-specific) threshold as the training progresses. Although the utilization of unlabeled data is improved, their ad-hoc threshold adjusting scheme is arbitrarily controlled by hyper-parameters and thus disconnected from model’s learning process. FlexMatch (Zhang et al., 2021) demonstrates that different classes should have different local (class-specific) thresholds. While the local thresholds take into account the learning difficulties of different classes, they are still mapped from a pre-defined fixed global threshold. Adsh (Guo & Li, 2022) obtains adaptive thresholds from a pre-defined threshold for imbalanced Semi-supervised Learning by optimizing the the number of pseudo labels for each class. In a nutshell, these methods might be incapable or insufficient in terms of adjusting thresholds according to model’s learning progress, thus impeding the training process especially when labeled data is too scarce to provide adequate supervision.

Refer to caption — (a) Decision boundary

For example, as shown in Figure 1(a), on the “two-moon” dataset with only 1 labeled sample for each class, the decision boundaries obtained by previous methods fail in the low-density assumption. Then, two questions naturally arise: 1) Is it necessary to determine the threshold based on the model learning status? and 2) How to adaptively adjust the threshold for best training efficiency?

In this paper, we first leverage a motivating example to demonstrate that different datasets and classes should determine their global (dataset-specific) and local (class-specific) thresholds based on the model’s learning status. Intuitively, we need a low global threshold to utilize more unlabeled data and speed up convergence at early training stages. As the prediction confidence increases, a higher global threshold is necessary to filter out wrong pseudo labels to alleviate the confirmation bias (Arazo et al., 2020). Besides, a local threshold should be defined on each class based on the model’s confidence about its predictions. The “two-moon” example in Figure 1(a) shows that the decision boundary is more reasonable when adjusting the thresholds based on the model’s learning status.

We then propose FreeMatch to adjust the thresholds in a self-adaptive manner according to learning status of each class (Guo et al., 2017). Specifically, FreeMatch uses the self-adaptive thresholding (SAT) technique to estimate both the global (dataset-specific) and local thresholds (class-specific) via the exponential moving average (EMA) of the unlabeled data confidence. To handle barely supervised settings (Sohn et al., 2020) more effectively, we further propose a class fairness objective to encourage the model to produce fair (i.e., diverse) predictions among all classes (as shown in Figure 1(b)). The overall training objective of FreeMatch maximizes the mutual information between model’s input and output (John Bridle, 1991), producing confident and diverse predictions on unlabeled data. Benchmark results validate its effectiveness. To conclude, our contributions are:

Using a motivating example, we discuss why thresholds should reflect the model’s learning status and provide some intuitions for designing a threshold-adjusting scheme.
We propose a novel approach, FreeMatch, which consists of Self-Adaptive Thresholding (SAT) and Self-Adaptive class Fairness regularization (SAF). SAT is a threshold-adjusting scheme that is free of setting thresholds manually and SAF encourages diverse predictions.
Extensive results demonstrate the superior performance of FreeMatch on various SSL benchmarks, especially when the number of labels is very limited (e.g, an error reduction of 5.78% on CIFAR-10 with 1 labeled sample per class).

2 A Motivating Example

In this section, we introduce a binary classification example to motivate our threshold-adjusting scheme. Despite the simplification of the actual model and training process, the analysis leads to some interesting implications and provides insight into how the thresholds should be set.

We aim to demonstrate the necessity of the self-adaptability and increased granularity in confidence thresholding for SSL. Inspired by (Yang & Xu, 2020), we consider a binary classification problem where the true distribution is an even mixture of two Gaussians (i.e., the label $Y$ is equally likely to be positive ( $+1$ ) or negative ( $-1$ )). The input $X$ has the following conditional distribution:

X\mid Y=-1\sim\mathcal{N}(\mu_{1},\sigma_{1}^{2}),X\mid Y=+1\sim\mathcal{N}(\mu_{2},\sigma_{2}^{2}).

(1)

We assume $\mu_{2}>\mu_{1}$ without loss of generality. Suppose that our classifier outputs confidence score $s(x)=1/[1+\exp(-\beta(x-\frac{\mu_{1}+\mu_{2}}{2}))]$ , where $\beta$ is a positive parameter that reflects the model learning status and it is expected to gradually grow during training as the model becomes more confident. Note that $\frac{\mu_{1}+\mu_{2}}{2}$ is in fact the Bayes’ optimal linear decision boundary. We consider the scenario where a fixed threshold $\tau\in(\frac{1}{2},1)$ is used to generate pseudo labels. A sample $x$ is assigned pseudo label $+1$ if $s(x)>\tau$ and $-1$ if $s(x)<1-\tau$ . The pseudo label is $0$ (masked) if $1-\tau\leq s(x)\leq\tau$ .

We then derive the following theorem to show the necessity of self-adaptive threshold:

Theorem 2.1.

For a binary classification problem as mentioned above, the pseudo label $Y_{p}$ has the following probability distribution:

\displaystyle\begin{split}P(Y_{p}=1)&=\frac{1}{2}\Phi(\frac{\frac{\mu_{2}-\mu_{1}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{2}})+\frac{1}{2}\Phi(\frac{\frac{\mu_{1}-\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{1}}),\\ P(Y_{p}=-1)&=\frac{1}{2}\Phi(\frac{\frac{\mu_{2}-\mu_{1}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{1}})+\frac{1}{2}\Phi(\frac{\frac{\mu_{1}-\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{2}}),\\ P(Y_{p}=0)&=1-P(Y_{p}=1)-P(Y_{p}=-1),\end{split}

(2)

where $\Phi$ is the cumulative distribution function of a standard normal distribution. Moreover, $P(Y_{p}=0)$ increases as $\mu_{2}-\mu_{1}$ gets smaller.

The proof is offered in Appendix B. Theorem 2.1 has the following implications or interpretations:

(i)

Trivially, unlabeled data utilization (sampling rate) $1-P(Y_{p}=0)$ is directly controlled by threshold $\tau$ . As the confidence threshold $\tau$ gets larger, the unlabeled data utilization gets lower. At early training stages, adopting a high threshold may lead to low sampling rate and slow convergence since $\beta$ is still small.
(ii)

More interestingly, $P(Y_{p}=1)\neq P(Y_{p}=-1)$ if $\sigma_{1}\neq\sigma_{2}$ . In fact, the larger $\tau$ is, the more imbalanced the pseudo labels are. This is potentially undesirable in the sense that we aim to tackle a balanced classification problem. Imbalanced pseudo labels may distort the decision boundary and lead to the so-called pseudo label bias. An easy remedy for this is to use class-specific thresholds $\tau_{2}$ and $1-\tau_{1}$ to assign pseudo labels.
(iii)

The sampling rate $1-P(Y_{p}=0)$ decreases as $\mu_{2}-\mu_{1}$ gets smaller. In other words, the more similar the two classes are, the more likely an unlabeled sample will be masked. As the two classes get more similar, there would be more samples mixed in feature space where the model is less confident about its predictions, thus a moderate threshold is needed to balance the sampling rate. Otherwise we may not have enough samples to train the model to classify the already difficult-to-classify classes.

The intuitions provided by Theorem 2.1 is that at the early training stages, $\tau$ should be low to encourage diverse pseudo labels, improve unlabeled data utilization and fasten convergence. However, as training continues and $\beta$ grows larger, a consistently low threshold will lead to unacceptable confirmation bias. Ideally, the threshold $\tau$ should increase along with $\beta$ to maintain a stable sampling rate throughout. Since different classes have different levels of intra-class diversity (different $\sigma$ ) and some classes are harder to classify than others ( $\mu_{2}-\mu_{1}$ being small), a fine-grained class-specific threshold is desirable to encourage fair assignment of pseudo labels to different classes. The challenge is how to design a threshold adjusting scheme that takes all implications into account, which is the main contribution of this paper. We demonstrate our algorithm by plotting the average threshold trend and marginal pseudo label probability (i.e. sampling rate) during training in Figure 1(c) and 1(d). To sum up, we should determine global (dataset-specific) and local (class-specific) thresholds by estimating the learning status via predictions from the model. Then, we detail FreeMatch.

3 Preliminaries

In SSL, the training data consists of labeled and unlabeled data. Let $\mathcal{D}_{L}=\{(x_{b},y_{b}):b\in[N_{L}]\}$ and $\mathcal{D}_{U}=\{u_{b}:b\in[N_{U}]\}$ ²²2 $[N]:=\{1,2,\ldots,N\}$ . be the labeled and unlabeled data, where $N_{L}$ and $N_{U}$ is their number of samples, respectively. The supervised loss for labeled data is:

\mathcal{L}_{s}=\frac{1}{B}\sum_{b=1}^{B}\mathcal{H}(y_{b},p_{m}(y|\omega(x_{b}))),

(3)

where $B$ is the batch size, $\mathcal{H}(\cdot,\cdot)$ refers to cross-entropy loss, $\omega(\cdot)$ means the stochastic data augmentation function, and $p_{m}(\cdot)$ is the output probability from the model.

For unlabeled data, we focus on pseudo labeling using cross-entropy loss with confidence threshold for entropy minimization. We also adopt the “Weak and Strong Augmentation” strategy introduced by UDA (Xie et al., 2020a). Formally, the unsupervised training objective for unlabeled data is:

\begin{split}\mathcal{L}_{u}=&\frac{1}{\mu B}\sum_{b=1}^{\mu B}\mathbbm{1}(\max(q_{b})>\tau)\cdot\mathcal{H}(\hat{q}_{b},Q_{b}).\end{split}

(4)

We use $q_{b}$ and $Q_{b}$ to denote abbreviation of $p_{m}(y|\omega(u_{b}))$ and $p_{m}(y|\Omega(u_{b}))$ , respectively. $\hat{q}_{b}$ is the hard “one-hot” label converted from $q_{b}$ , $\mu$ is the ratio of unlabeled data batch size to labeled data batch size, and $\mathbbm{1}(\cdot>\tau)$ is the indicator function for confidence-based thresholding with $\tau$ being the threshold. The weak augmentation (i.e., random crop and flip) and strong augmentation (i.e., RandAugment Cubuk et al. (2020)) is represented by $\omega(\cdot)$ and $\Omega(\cdot)$ respectively.

Besides, a fairness objective $\mathcal{L}_{f}$ is usually introduced to encourage the model to predict each class at the same frequency, which usually has the form of $\mathcal{L}_{f}=\mathbf{U}\log\mathbbm{E}_{\mu B}\left[q_{b}\right]$ (Andreas Krause, 2010), where $\mathbf{U}$ is a uniform prior distribution. One may notice that using a uniform prior not only prevents the generalization to non-uniform data distribution but also ignores the fact that the underlying pseudo label distribution for a mini-batch may be imbalanced due to the sampling mechanism. The uniformity across a batch is essential for fair utilization of samples with per-class threshold, especially for early-training stages.

4 FreeMatch

4.1 Self-Adaptive Thresholding

We advocate that the key to determining thresholds for SSL is that thresholds should reflect the learning status. The learning effect can be estimated by the prediction confidence of a well-calibrated model (Guo et al., 2017). Hence, we propose self-adaptive thresholding (SAT) that automatically defines and adaptively adjusts the confidence threshold for each class by leveraging the model predictions during training. SAT first estimates a global threshold as the EMA of the confidence from the model. Then, SAT modulates the global threshold via the local class-specific thresholds estimated as the EMA of the probability for each class from the model. When training starts, the threshold is low to accept more possibly correct samples into training. As the model becomes more confident, the threshold adaptively increases to filter out possibly incorrect samples to reduce the confirmation bias. Thus, as shown in Figure 2, we define SAT as $\tau_{t}(c)$ indicating the threshold for class $c$ at the $t$ -th iteration.

Self-adaptive Global Threshold

We design the global threshold based on the following two principles. First, the global threshold in SAT should be related to the model’s confidence on unlabeled data, reflecting the overall learning status. Moreover, the global threshold should stably increase during training to ensure incorrect pseudo labels are discarded. We set the global threshold $\tau_{t}$ as average confidence from the model on unlabeled data, where $t$ represents the $t$ -th time step (iteration). However, it would be time-consuming to compute the confidence for all unlabeled data at every time step or even every training epoch due to its large volume. Instead, we estimate the global confidence as the exponential moving average (EMA) of the confidence at each training time step. We initialize $\tau_{t}$ as $\frac{1}{C}$ where $C$ indicates the number of classes. The global threshold $\tau_{t}$ is defined and adjusted as:

\tau_{t}=\begin{cases}\frac{1}{C},&\text{ if }t=0,\\ \lambda\tau_{t-1}+(1-\lambda)\frac{1}{\mu B}\sum_{b=1}^{\mu B}\max(q_{b}),&\text{ otherwise, }\end{cases}

(5)

where $\lambda\in(0,1)$ is the momentum decay of EMA.

Self-adaptive Local Threshold

The local threshold aims to modulate the global threshold in a class-specific fashion to account for the intra-class diversity and the possible class adjacency. We compute the expectation of the model’s predictions on each class $c$ to estimate the class-specific learning status:

\tilde{p}_{t}(c)=\begin{cases}\frac{1}{C},&\text{ if }t=0,\\ \lambda\tilde{p}_{t-1}(c)+(1-\lambda)\frac{1}{\mu B}\sum_{b=1}^{\mu B}q_{b}(c),&\text{ otherwise, }\end{cases}

(6)

where $\tilde{p}_{t}=[\tilde{p}_{t}(1),\tilde{p}_{t}(2),\dots,\tilde{p}_{t}(C)]$ is the list containing all $\tilde{p}_{t}(c)$ . Integrating the global and local thresholds, we obtain the final self-adaptive threshold $\tau_{t}(c)$ as:

\tau_{t}(c)=\operatorname{MaxNorm}(\tilde{p}_{t}(c))\cdot\tau_{t}=\frac{\tilde{p}_{t}(c)}{\max\{\tilde{p}_{t}(c):c\in[C]\}}\cdot\tau_{t},

(7)

where $\operatorname{MaxNorm}$ is the Maximum Normalization (i.e., $x^{\prime}=\frac{x}{\max(x)}$ ). Finally, the unsupervised training objective $\mathcal{L}_{u}$ at the $t$ -th iteration is:

\begin{split}\mathcal{L}_{u}=&\frac{1}{\mu B}\sum_{b=1}^{\mu B}\mathbbm{1}(\max(q_{b})>\tau_{t}(\arg\max\left(q_{b}\right))\cdot\mathcal{H}(\hat{q}_{b},Q_{b}).\end{split}

(8)

4.2 Self-Adaptive Fairness

We include the class fairness objective as mentioned in Section 3 into FreeMatch to encourage the model to make diverse predictions for each class and thus produce a meaningful self-adaptive threshold, especially under the settings where labeled data are rare. Instead of using a uniform prior as in (Arazo et al., 2020), we use the EMA of model predictions $\tilde{p}_{t}$ from Eq. 6 as an estimate of the expectation of prediction distribution over unlabeled data. We optimize the cross-entropy of $\tilde{p}_{t}$ and $\overline{p}=\mathbbm{E}_{\mu B}[p_{m}(y|\Omega(u_{b}))]$ over mini-batch as an estimate of $H(\mathbbm{E}_{u}\left[p_{m}(y|u)\right])$ . Considering that the underlying pseudo label distribution may not be uniform, we propose to modulate the fairness objective in a self-adaptive way, i.e., normalizing the expectation of probability by the histogram distribution of pseudo labels to counter the negative effect of imbalance as:

\begin{split}&\overline{p}=\frac{1}{\mu B}\sum_{b=1}^{\mu B}\mathbbm{1}\left(\max\left(q_{b}\right)\geq\tau_{t}(\arg\max\left(q_{b}\right)\right)Q_{b},\\ &\overline{h}=\operatorname{Hist}_{\mu B}\left(\mathbbm{1}\left(\max\left(q_{b}\right)\geq\tau_{t}(\arg\max\left(q_{b}\right)\right)\hat{Q}_{b}\right).\end{split}

(9)

Similar to $\tilde{p}_{t}$ , we compute $\tilde{h}_{t}$ as:

\tilde{h}_{t}=\lambda\tilde{h}_{t-1}+(1-\lambda)\operatorname{Hist}_{\mu B}\left(\hat{q}_{b}\right).

(10)

The self-adaptive fairness (SAF) $L_{f}$ at the $t$ -th iteration is formulated as:

\mathcal{L}_{f}=-\mathcal{H}\left(\operatorname{SumNorm}\left(\frac{\tilde{p}_{t}}{\tilde{h}_{t}}\right),\operatorname{SumNorm}\left(\frac{\bar{p}}{\bar{h}}\right)\right),

(11)

where $\operatorname{SumNorm}=(\cdot)/\sum(\cdot)$ . SAF encourages the expectation of the output probability for each mini-batch to be close to a marginal class distribution of the model, after normalized by histogram distribution. It helps the model produce diverse predictions especially for barely supervised settings (Sohn et al., 2020), thus converges faster and generalizes better. This is also showed in Figure 1(b).

The overall objective for FreeMatch at $t$ -th iteration is:

\mathcal{L}=\mathcal{L}_{s}+w_{u}\mathcal{L}_{u}+w_{f}\mathcal{L}_{f},

(12)

where $w_{u}$ and $w_{f}$ represents the loss weight for $\mathcal{L}_{u}$ and $\mathcal{L}_{f}$ respectively. With $\mathcal{L}_{u}$ and $\mathcal{L}_{f}$ , FreeMatch maximizes the mutual information between its outputs and inputs. We present the procedure of FreeMatch in Algorithm 1 of Appendix.

5 Experiments

5.1 Setup

We evaluate FreeMatch on common benchmarks: CIFAR-10/100 (Krizhevsky et al., 2009), SVHN (Netzer et al., 2011), STL-10 (Coates et al., 2011) and ImageNet (Deng et al., 2009). Following previous work (Sohn et al., 2020; Xu et al., 2021; Zhang et al., 2021; Oliver et al., 2018), we conduct experiments with varying amounts of labeled data. In addition to the commonly-chosen labeled amounts, following (Sohn et al., 2020), we further include the most challenging case of CIFAR-10: each class has only one labeled sample.

For fair comparison, we train and evaluate all methods using the unified codebase TorchSSL (Zhang et al., 2021) with the same backbones and hyperparameters. Concretely, we use Wide ResNet-28-2 (Zagoruyko & Komodakis, 2016) for CIFAR-10, Wide ResNet-28-8 for CIFAR-100, Wide ResNet-37-2 (Zhou et al., 2020) for STL-10, and ResNet-50 (He et al., 2016) for ImageNet. We use SGD with a momentum of $0.9$ as optimizer. The initial learning rate is $0.03$ with a cosine learning rate decay schedule as $\eta=\eta_{0}\cos(\frac{7\pi k}{16K})$ , where $\eta_{0}$ is the initial learning rate, $k(K)$ is the current (total) training step and we set $K=2^{20}$ for all datasets. At the testing phase, we use an exponential moving average with the momentum of $0.999$ of the training model to conduct inference for all algorithms. The batch size of labeled data is $64$ except for ImageNet where we set $128$ . We use the same weight decay value, pre-defined threshold $\tau$ , unlabeled batch ratio $\mu$ and loss weights introduced for Pseudo-Label (Lee et al., 2013), $\Pi$ model (Rasmus et al., 2015), Mean Teacher (Tarvainen & Valpola, 2017), VAT (Miyato et al., 2018), MixMatch (Berthelot et al., 2019b), ReMixMatch (Berthelot et al., 2019a), UDA (Xie et al., 2020a), FixMatch (Sohn et al., 2020), and FlexMatch (Zhang et al., 2021).

We implement MPL based on UDA as in (Pham et al., 2021), where we set temperature as 0.8 and $w_{u}$ as 10. We do not fine-tune MPL on labeled data as in (Pham et al., 2021) since we find fine-tuning will make the model overfit the labeled data especially with very few of them. For Dash, we use the same parameters as in (Xu et al., 2021) except we warm-up on labeled data for 2 epochs since too much warm-up will lead to the overfitting (i.e. 2,048 training iterations). For FreeMatch, we set $w_{u}=1$ for all experiments. Besides, we set $w_{f}=0.01$ for CIFAR-10 with 10 labels, CIFAR-100 with 400 labels, STL-10 with 40 labels, ImageNet with 100k labels, and all experiments for SVHN. For other settings, we use $w_{f}=0.05$ . For SVHN, we find that using a low threshold at early training stage impedes the model to cluster the unlabeled data, thus we adopt two training techniques for SVHN: (1) warm-up the model on only labeled data for 2 epochs as Dash; and (2) restrict the SAT within the range $\left[0.9,0.95\right]$ . The detailed hyperparameters are introduced in Appendix D. We train each algorithm 3 times using different random seeds and report the best error rates of all checkpoints (Zhang et al., 2021).

Table 1: Error rates on CIFAR-10/100, SVHN, and STL-10 datasets. The fully-supervised results of STL-10 are unavailable since we do not have label information for its unlabeled data. Bold indicates the best result and underline indicates the second-best result. The significant tests and average error rates for each dataset can be found in Appendix E.1.

Dataset	CIFAR-10				CIFAR-100			SVHN			STL-10
# Label	10	40	250	4000	400	2500	10000	40	250	1000	40	1000
$\Pi$ Model (Rasmus et al., 2015)	79.18 $\pm$ 1.11	74.34 $\pm$ 1.76	46.24 $\pm$ 1.29	13.13 $\pm$ 0.59	86.96 $\pm$ 0.80	58.80 $\pm$ 0.66	36.65 $\pm$ 0.00	67.48 $\pm$ 0.95	13.30 $\pm$ 1.12	7.16 $\pm$ 0.11	74.31 $\pm$ 0.85	32.78 $\pm$ 0.40
Pseudo Label (Lee et al., 2013)	80.21 $\pm$ 0.55	74.61 $\pm$ 0.26	46.49 $\pm$ 2.20	15.08 $\pm$ 0.19	87.45 $\pm$ 0.85	57.74 $\pm$ 0.28	36.55 $\pm$ 0.24	64.61 $\pm$ 5.6	15.59 $\pm$ 0.95	9.40 $\pm$ 0.32	74.68 $\pm$ 0.99	32.64 $\pm$ 0.71
VAT (Miyato et al., 2018)	79.81 $\pm$ 1.17	74.66 $\pm$ 2.12	41.03 $\pm$ 1.79	10.51 $\pm$ 0.12	85.20 $\pm$ 1.40	46.84 $\pm$ 0.79	32.14 $\pm$ 0.19	74.75 $\pm$ 3.38	4.33 $\pm$ 0.12	4.11 $\pm$ 0.20	74.74 $\pm$ 0.38	37.95 $\pm$ 1.12
MeanTeacher (Tarvainen & Valpola, 2017)	76.37 $\pm$ 0.44	70.09 $\pm$ 1.60	37.46 $\pm$ 3.30	8.10 $\pm$ 0.21	81.11 $\pm$ 1.44	45.17 $\pm$ 1.06	31.75 $\pm$ 0.23	36.09 $\pm$ 3.98	3.45 $\pm$ 0.03	3.27 $\pm$ 0.05	71.72 $\pm$ 1.45	33.90 $\pm$ 1.37
MixMatch (Berthelot et al., 2019b)	65.76 $\pm$ 7.06	36.19 $\pm$ 6.48	13.63 $\pm$ 0.59	6.66 $\pm$ 0.26	67.59 $\pm$ 0.66	39.76 $\pm$ 0.48	27.78 $\pm$ 0.29	30.60 $\pm$ 8.39	4.56 $\pm$ 0.32	3.69 $\pm$ 0.37	54.93 $\pm$ 0.96	21.70 $\pm$ 0.68
ReMixMatch (Berthelot et al., 2019a)	20.77 $\pm$ 7.48	9.88 $\pm$ 1.03	6.30 $\pm$ 0.05	4.84 $\pm$ 0.01	42.75 $\pm$ 1.05	26.03 $\pm$ 0.35	20.02 $\pm$ 0.27	24.04 $\pm$ 9.13	6.36 $\pm$ 0.22	5.16 $\pm$ 0.31	32.12 $\pm$ 6.24	6.74 $\pm$ 0.14
UDA (Xie et al., 2020a)	34.53 $\pm$ 10.69	10.62 $\pm$ 3.75	5.16 $\pm$ 0.06	4.29 $\pm$ 0.07	46.39 $\pm$ 1.59	27.73 $\pm$ 0.21	22.49 $\pm$ 0.23	5.12 $\pm$ 4.27	1.92 $\pm$ 0.05	1.89 $\pm$ 0.01	37.42 $\pm$ 8.44	6.64 $\pm$ 0.17
FixMatch (Sohn et al., 2020)	24.79 $\pm$ 7.65	7.47 $\pm$ 0.28	4.86 $\pm$ 0.05	4.21 $\pm$ 0.08	46.42 $\pm$ 0.82	28.03 $\pm$ 0.16	22.20 $\pm$ 0.12	3.81 $\pm$ 1.18	2.02 $\pm$ 0.02	1.96 $\pm$ 0.03	35.97 $\pm$ 4.14	6.25 $\pm$ 0.33
Dash (Xu et al., 2021)	27.28 $\pm$ 14.09	8.93 $\pm$ 3.11	5.16 $\pm$ 0.23	4.36 $\pm$ 0.11	44.82 $\pm$ 0.96	27.15 $\pm$ 0.22	21.88 $\pm$ 0.07	2.19 $\pm$ 0.18	2.04 $\pm$ 0.02	1.97 $\pm$ 0.01	34.52 $\pm$ 4.30	6.39 $\pm$ 0.56
MPL (Pham et al., 2021)	23.55 $\pm$ 6.01	6.62 $\pm$ 0.91	5.76 $\pm$ 0.24	4.55 $\pm$ 0.04	46.26 $\pm$ 1.84	27.71 $\pm$ 0.19	21.74 $\pm$ 0.09	9.33 $\pm$ 8.02	2.29 $\pm$ 0.04	2.28 $\pm$ 0.02	35.76 $\pm$ 4.83	6.66 $\pm$ 0.00
FlexMatch (Zhang et al., 2021)	13.85 $\pm$ 12.04	4.97 $\pm$ 0.06	4.98 $\pm$ 0.09	4.19 $\pm$ 0.01	39.94 $\pm$ 1.62	26.49 $\pm$ 0.20	21.90 $\pm$ 0.15	8.19 $\pm$ 3.20	6.59 $\pm$ 2.29	6.72 $\pm$ 0.30	29.15 $\pm$ 4.16	5.77 $\pm$ 0.18
FreeMatch	8.07 $\pm$ 4.24	4.90 $\pm$ 0.04	4.88 $\pm$ 0.18	4.10 $\pm$ 0.02	37.98 $\pm$ 0.42	26.47 $\pm$ 0.20	21.68 $\pm$ 0.03	1.97 $\pm$ 0.02	1.97 $\pm$ 0.01	1.96 $\pm$ 0.03	15.56 $\pm$ 0.55	5.63 $\pm$ 0.15
Fully-Supervised	4.62 $\pm$ 0.05				19.30 $\pm$ 0.09			2.13 $\pm$ 0.01			-

5.2 Quantitative Results

The Top-1 classification error rates of CIFAR-10/100, SVHN, and STL-10 are reported in Table 1. The results on ImageNet with 100 labels per class are in Table 2. We also provide detailed results on precision, recall, F1 score, and confusion matrix in Appendix E.3. These quantitative results demonstrate that FreeMatch achieves the best performance on CIFAR-10, STL-10, and ImageNet datasets, and it produces very close results on SVHN to the best competitor. On CIFAR-100, FreeMatch is better than ReMixMatch when there are 400 labels. The good performances of ReMixMatch on CIFAR-100 (2500) and CIFAR-100 (10000) are probably brought by the mix up (Zhang et al., 2017) technique and the self-supervised learning part. On ImageNet with 100k labels, FreeMatch significantly outperforms the latest counterpart FlexMatch by 1.28%³³3Following (Zhang et al., 2021), we train ImageNet for $2^{20}$ iterations like other datasets for a fair comparison. We use 4 Tesla V100 GPUs on ImageNet. . We also notice that FreeMatch exhibits fast computation in ImageNet from Table 2. Note that FlexMatch is much slower than FixMatch and FreeMatch because it needs to maintain a list that records whether each sample is clean, which needs heavy indexing computation budget on large datasets.

Table 2: Error rates and runtime on ImageNet with 100 labels per class.

Top-1

Top-5

Runtime

(sec./iter.)

FixMatch

43.66

21.80

0.4

FlexMatch

41.85

19.48

0.6

FreeMatch

40.57

18.77

0.4

Noteworthy is that, FreeMatch consistently outperforms other methods by a large margin on settings with extremely limited labeled data: 5.78% on CIFAR-10 with 10 labels, 1.96% on CIFAR-100 with 400 labels, and surprisingly 13.59% on STL-10 with 40 labels. STL-10 is a more realistic and challenging dataset compared to others, which consists of a large unlabeled set of 100k images. The significant improvements demonstrate the capability and potential of FreeMatch to be deployed in real-world applications.

5.3 Qualitative Analysis

We present some qualitative analysis: Why and how does FreeMatch work? What other benefits does it bring? We evaluate the class average threshold and average sampling rate on STL-10 (40) (i.e., 40 labeled samples on STL-10) of FreeMatch to demonstrate how it works aligning with our theoretical analysis. We record the threshold and compute the sampling rate for each batch during training. The sampling rate is calculated on unlabeled data as $\frac{\sum_{b}^{\mu B}\mathbbm{1}(\max(q_{b})>\tau_{t}(\arg\max\left(q_{b}\right))}{\mu B}$ . We also plot the convergence speed in terms of accuracy and the confusion matrix to show the proposed component in FreeMatch helps improve performance. From Figure 3(a) and Figure 3(b), one can observe that the threshold and sampling rate change of FreeMatch is mostly consistent with our theoretical analysis. That is, at the early stage of training, the threshold of FreeMatch is relatively lower, compared to FlexMatch and FixMatch, resulting in higher unlabeled data utilization (sampling rate), which fastens the convergence. As the model learns better and becomes more confident, the threshold of FreeMatch increases to a high value to alleviate the confirmation bias, leading to stably high sampling rate. Correspondingly, the accuracy of FreeMatch increases vastly (as shown in Figure 3(c)) and resulting better class-wise accuracy (as shown in Figure 3(d)). Note that Dash fails to learn properly due to the employment of the high sampling rate until 100k iterations.

To further demonstrate the effectiveness of the class-specific threshold in FreeMatch, we present the t-SNE (Van der Maaten & Hinton, 2008) visualization of features of FlexMatch and FreeMatch on STL-10 (40) in Figure 5 of Section E.8. We exhibit the corresponding local threshold for each class. Interestingly, FlexMatch has a high threshold, i.e., pre-defined $0.95$ , for class $0$ and class $6$ , yet their feature variances are very large and are confused with other classes. This means the class-wise thresholds in FlexMatch cannot accurately reflect the learning status. In contrast, FreeMatch clusters most classes better. Besides, for the similar classes $1,3,5,7$ that are confused with each other, FreeMatch retains a higher average threshold $0.87$ than $0.84$ of FlexMatch, enabling to mask more wrong pseudo labels. We also study the pseudo label accuracy in Appendix E.9 and shows FreeMatch can reduce noise during training.

5.4 Ablation Study

Self-adaptive Threshold

We conduct experiments on the components of SAT in FreeMatch and compare to the components in FlexMatch (Zhang et al., 2021), FixMatch (Sohn et al., 2020), Class-Balanced Self-Training (CBST) (Zou et al., 2018), and Relative Threshold (RT) in AdaMatch (Berthelot et al., 2022). The ablation is conducted on CIFAR-10 (40 labels).

Table 3: Comparison of different thresholding schemes.

Threshold	CIFAR-10 (40)
$\tau$ (FixMatch)	7.47 $\pm$ 0.28
$\tau*\mathcal{M}(\beta(c))$ (FlexMatch)	4.97 $\pm$ 0.06
$\tau*\operatorname{MaxNorm}(\tilde{p}_{t}(c))$	5.13 $\pm$ 0.03
$\tau_{t}$ (Global)	6.06 $\pm$ 0.65
$\tau_{t}*\mathcal{M}(\beta(c))$	8.40 $\pm$ 2.49
CBST	16.65 $\pm$ 2.90
RT (AdaMatch)	6.09 $\pm$ 0.65
SAT (Global and Local)	4.92 $\pm$ 0.04

As shown in Table 3, SAT achieves the best performance among all the threshold schemes. Self-adaptive global threshold $\tau_{t}$ and local threshold $\operatorname{MaxNorm}(\tilde{p}_{t}(c))$ themselves also achieve comparable results, compared to the fixed threshold $\tau$ , demonstrating both local and global threshold proposed are good learning effect estimators. When using CPL $\mathcal{M}(\beta(c))$ to adjust $\tau_{t}$ , the result is worse than the fixed threshold and exhibits larger variance, indicating potential instability of CPL. AdaMatch (Berthelot et al., 2022) uses the RT, which can be viewed as a global threshold at $t$ -th iteration computed on the predictions of labeled data without EMA, whereas FreeMatch conducts computation of $\tau_{t}$ with EMA on unlabeled data that can better reflect the overall data distribution. For class-wise threshold, CBST (Zou et al., 2018) maintains a pre-defined sampling rate, which could be the reason for its bad performance since the sampling rate should be changed during training as we analyzed in Sec. 2. Note that we did not include $L_{f}$ in this ablation for a fair comparison. Ablation study in Appendix E.4 and E.5 on FixMatch and FlexMatch with different thresholds shows SAT serves to reduce hyperparameter-tuning computation or overall training time in the event of similar performance for an optimally selected threshold.

Table 4: Comparison of different class fairness items.

Fairness	CIFAR-10 (10)
w/o fairness	10.37 $\pm$ 7.70
$U\log\overline{p}$	9.57 $\pm$ 6.67
$U\log\operatorname{SumNorm}(\frac{\overline{p}}{\overline{h}})$	12.07 $\pm$ 5.23
DA (AdaMatch)	32.94 $\pm$ 1.83
DA (ReMixMatch)	11.06 $\pm$ 8.21
SAF	8.07 $\pm$ 4.24

Self-adaptive Fairness

As illustrated in Table 4, we also empirically study the effect of SAF on CIFAR-10 (10 labels). We study the original version of fairness objective as in (Arazo et al., 2020). Based on that, we study the operation of normalization probability by histograms and show that countering the effect of imbalanced underlying distribution indeed helps the model to learn and diverse better. One may notice that adding original fairness regularization alone already helps improve the performance. Whereas adding normalization operation in the log operation hurts the performance, suggesting the underlying batch data are indeed not uniformly distributed. We also evaluate Distribution Alignment (DA) for class fairness in ReMixMatch (Berthelot et al., 2019a) and AdaMatch (Berthelot et al., 2022), showing inferior results than SAF. A possible reason for the worse performance of DA (AdaMatch) is that it only uses labeled batch prediction as the target distribution which cannot reflect the true data distribution especially when labeled data is scarce and changing the target distribution to the ground truth uniform, i.e., DA (ReMixMatch), is better for the case with extremely limited labels. We also proved SAF can be easily plugged into FlexMatch and bring improvements in Appendix E.6. The EMA decay ablation and performances of imbalanced settings are in Appendix E.5 and Appendix E.7.

6 Related Work

To reduce confirmation bias (Arazo et al., 2020) in pseudo labeling, confidence-based thresholding techniques are proposed to ensure the quality of pseudo labels (Xie et al., 2020a; Sohn et al., 2020; Zhang et al., 2021; Xu et al., 2021), where only the unlabeled data whose confidences are higher than the threshold are retained. UDA (Xie et al., 2020a) and FixMatch (Sohn et al., 2020) keep the fixed pre-defined threshold during training. FlexMatch (Zhang et al., 2021) adjusts the pre-defined threshold in a class-specific fashion according to the per-class learning status estimated by the number of confident unlabeled data. A co-current work Adsh (Guo & Li, 2022) explicitly optimizes the number of pseudo labels for each class in the SSL objective to obtain adaptive thresholds for imbalanced Semi-supervised Learning. However, it still needs a user-predefined threshold. Dash (Xu et al., 2021) defines a threshold according to the loss on labeled data and adjusts the threshold according to a fixed mechanism. A more recent work, AdaMatch (Berthelot et al., 2022), aims to unify SSL and domain adaptation using a pre-defined threshold multiplying the average confidence of the labeled data batch to mask noisy pseudo labels. It needs a pre-defined threshold and ignores the unlabeled data distribution especially when labeled data is too rare to reflect the unlabeled data distribution. Besides, distribution alignment (Berthelot et al., 2019a; 2022) is also utilized in Adamatch to encourage fair predictions on unlabeled data. Previous methods might fail to choose meaningful thresholds due to ignorance of the relationship between the model learning status and thresholds. Chen et al. (2020); Kumar et al. (2020) try to understand self-training / thresholding from the theoretical perspective. A motivation example is used to derive implications for adjusting threshold according to learning status.

Except consistency regularization, entropy-based regularization is also used in SSL. Entropy minimization (Grandvalet et al., 2005) encourages the model to make confident predictions for all samples disregarding the actual class predicted. Maximization of expectation of entropy (Andreas Krause, 2010; Arazo et al., 2020) over all samples is also proposed to induce fairness to the model, enforcing the model to predict each class at the same frequency. But previous ones assume a uniform prior for underlying data distribution and also ignore the batch data distribution. Distribution alignment (Berthelot et al., 2019a) adjusts the pseudo labels according to labeled data distribution and the EMA of model predictions.

7 Conclusion

We proposed FreeMatch that utilizes self-adaptive thresholding and class-fairness regularization for SSL. FreeMatch outperforms strong competitors across a variety of SSL benchmarks, especially in the barely-supervised setting. We believe that confidence thresholding has more potential in SSL. A potential limitation is that the adaptiveness still originates from the heuristics of the model prediction, and we hope the efficacy of FreeMatch inspires more research for optimal thresholding.

References

Andreas Krause (2010) Ryan Gomes Andreas Krause, Pietro Perona. Discriminative clustering by regularized information maximization. In Advances in neural information processing systems, 2010.
Arazo et al. (2020) Eric Arazo, Diego Ortego, Paul Albert, Noel E O’Connor, and Kevin McGuinness. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. IEEE, 2020.
Bachman et al. (2014) Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with pseudo-ensembles. Advances in neural information processing systems, 27:3365–3373, 2014.
Berthelot et al. (2019a) David Berthelot, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Kihyuk Sohn, Han Zhang, and Colin Raffel. Remixmatch: Semi-supervised learning with distribution matching and augmentation anchoring. In International Conference on Learning Representations, 2019a.
Berthelot et al. (2019b) David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. Advances in Neural Information Processing Systems, 32, 2019b.
Berthelot et al. (2022) David Berthelot, Rebecca Roelofs, Kihyuk Sohn, Nicholas Carlini, and Alex Kurakin. Adamatch: A unified approach to semi-supervised learning and domain adaptation. In International Conference on Learning Representations (ICLR), 2022.
Carlini et al. (2019) Nicholas Carlini, Ulfar Erlingsson, and Nicolas Papernot. Distribution density, tails, and outliers in machine learning: Metrics and applications. arXiv preprint arXiv:1910.13427, 2019.
Chapelle et al. (2006) Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien (eds.). Semi-Supervised Learning. The MIT Press, 2006.
Chen et al. (2020) Yining Chen, Colin Wei, Ananya Kumar, and Tengyu Ma. Self-training avoids using spurious features under domain shift. Advances in Neural Information Processing Systems, 33:21061–21071, 2020.
Coates et al. (2011) Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223. JMLR Workshop and Conference Proceedings, 2011.
Cubuk et al. (2020) Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703, 2020.
Dai et al. (2017) Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Russ R Salakhutdinov. Good semi-supervised learning that requires a bad gan. Advances in neural information processing systems, 30, 2017.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
Dong et al. (2018) Linhao Dong, Shuang Xu, and Bo Xu. Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5884–5888. IEEE, 2018.
Fan et al. (2021) Yue Fan, Dengxin Dai, and Bernt Schiele. Cossl: Co-learning of representation and classifier for imbalanced semi-supervised learning. arXiv preprint arXiv:2112.04564, 2021.
Gong et al. (2016) Chen Gong, Dacheng Tao, Stephen J Maybank, Wei Liu, Guoliang Kang, and Jie Yang. Multi-modal curriculum learning for semi-supervised image classification. IEEE Transactions on Image Processing, 25(7):3249–3260, 2016.
Grandvalet et al. (2005) Yves Grandvalet, Yoshua Bengio, et al. Semi-supervised learning by entropy minimization. volume 367, pp. 281–296, 2005.
Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321–1330. PMLR, 2017.
Guo & Li (2022) Lan-Zhe Guo and Yu-Feng Li. Class-imbalanced semi-supervised learning with adaptive thresholding. In International Conference on Machine Learning, pp. 8082–8094. PMLR, 2022.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
John Bridle (1991) David MacKay John Bridle, Anthony Heading. Unsupervised classifiers, mutual information and ’phantom targets. 1991.
Kervadec et al. (2019) Hoel Kervadec, Jose Dolz, Éric Granger, and Ismail Ben Ayed. Curriculum semi-supervised segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 568–576. Springer, 2019.
Kim et al. (2020) Jaehyung Kim, Youngbum Hur, Sejun Park, Eunho Yang, Sung Ju Hwang, and Jinwoo Shin. Distribution aligning refinery of pseudo-label for imbalanced semi-supervised learning. Advances in Neural Information Processing Systems, 33:14567–14579, 2020.
Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Krizhevsky et al. (2009) Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.
Kumar et al. (2020) Ananya Kumar, Tengyu Ma, and Percy Liang. Understanding self-training for gradual domain adaptation. In International Conference on Machine Learning, pp. 5468–5479. PMLR, 2020.
Lee et al. (2013) Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3, pp. 896, 2013.
Lee et al. (2021) Hyuck Lee, Seungjae Shin, and Heeyoung Kim. Abc: Auxiliary balanced classifier for class-imbalanced semi-supervised learning. Advances in Neural Information Processing Systems, 34, 2021.
McLachlan (1975) Geoffrey J McLachlan. Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association, 70(350):365–369, 1975.
Miyato et al. (2018) Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8):1979–1993, 2018.
Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. 2011.
Oliver et al. (2018) Avital Oliver, Augustus Odena, Colin A Raffel, Ekin Dogus Cubuk, and Ian Goodfellow. Realistic evaluation of deep semi-supervised learning algorithms. Advances in neural information processing systems, 31, 2018.
Pham et al. (2021) Hieu Pham, Zihang Dai, Qizhe Xie, and Quoc V Le. Meta pseudo labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11557–11568, 2021.
Rasmus et al. (2015) Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and Tapani Raiko. Semi-supervised learning with ladder networks. Advances in Neural Information Processing Systems, 28:3546–3554, 2015.
Rizve et al. (2020) Mamshad Nayeem Rizve, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. In International Conference on Learning Representations, 2020.
Rosenberg et al. (2005) Chuck Rosenberg, Martial Hebert, and Henry Schneiderman. Semi-supervised self-training of object detection models. 2005.
Sajjadi et al. (2016) Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Advances in neural information processing systems, 29:1163–1171, 2016.
Samuli & Timo (2017) Laine Samuli and Aila Timo. Temporal ensembling for semi-supervised learning. In International Conference on Learning Representations (ICLR), volume 4, pp. 6, 2017.
Sohn et al. (2020) Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in Neural Information Processing Systems, 33, 2020.
Tarvainen & Valpola (2017) Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 1195–1204, 2017.
Van der Maaten & Hinton (2008) Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
Wang et al. (2022) Yidong Wang, Hao Chen, Yue Fan, SUN Wang, Ran Tao, Wenxin Hou, Renjie Wang, Linyi Yang, Zhi Zhou, Lan-Zhe Guo, et al. Usb: A unified semi-supervised learning benchmark for classification. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
Wei et al. (2021) Chen Wei, Kihyuk Sohn, Clayton Mellina, Alan Yuille, and Fan Yang. Crest: A class-rebalancing self-training framework for imbalanced semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10857–10866, 2021.
Xie et al. (2020a) Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and Quoc Le. Unsupervised data augmentation for consistency training. Advances in Neural Information Processing Systems, 33, 2020a.
Xie et al. (2020b) Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698, 2020b.
Xu et al. (2021) Yi Xu, Lei Shang, Jinxing Ye, Qi Qian, Yu-Feng Li, Baigui Sun, Hao Li, and Rong Jin. Dash: Semi-supervised learning with dynamic thresholding. In International Conference on Machine Learning, pp. 11525–11536. PMLR, 2021.
Yang & Xu (2020) Yuzhe Yang and Zhi Xu. Rethinking the value of labels for improving class-imbalanced learning. In NeurIPS, 2020.
Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference 2016. British Machine Vision Association, 2016.
Zhang et al. (2021) Bowen Zhang, Yidong Wang, Wenxin Hou, Hao Wu, Jindong Wang, Manabu Okumura, and Takahiro Shinozaki. Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Advances in Neural Information Processing Systems, 34, 2021.
Zhang et al. (2017) Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
Zhou et al. (2020) Tianyi Zhou, Shengjie Wang, and Jeff Bilmes. Time-consistent self-supervision for semi-supervised learning. In International Conference on Machine Learning, pp. 11523–11533. PMLR, 2020.
Zhu & Goldberg (2009) Xiaojin Zhu and Andrew B Goldberg. Introduction to semi-supervised learning. Synthesis lectures on artificial intelligence and machine learning, 3(1):1–130, 2009.
Zhu (2005) Xiaojin Jerry Zhu. Semi-supervised learning literature survey. 2005.
Zou et al. (2018) Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In Proceedings of the European conference on computer vision (ECCV), pp. 289–305, 2018.

Appendix A Experimental details of the “two-moon” dataset.

We generate only two labeled data points (one label per each class, denoted by black dot and round circle) and 1,000 unlabeled data points (in gray) in 2-D space. We train a 3-layer MLP with 64 neurons in each layer and ReLU activation for 2,000 iterations. The red samples indicate the different samples whose confidence values are above the threshold of FreeMatch but below that of FixMatch. The sampling rate is computed on unlabeled data as $\sum_{b}^{N_{U}}\mathbbm{1}(\max(q_{b})>\tau)/N_{U}$ . Results are averaged 5 times.

Appendix B Proof of Theorem 2.1

Theorem 2.1 For a binary classification problem as mentioned above, the pseudo label $Y_{p}$ has the following probability distribution:

\displaystyle\begin{split}P(Y_{p}=1)&=\frac{1}{2}\Phi(\frac{\frac{\mu_{2}-\mu_{1}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{2}})+\frac{1}{2}\Phi(\frac{\frac{\mu_{1}-\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{1}}),\\ P(Y_{p}=-1)&=\frac{1}{2}\Phi(\frac{\frac{\mu_{2}-\mu_{1}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{1}})+\frac{1}{2}\Phi(\frac{\frac{\mu_{1}-\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{2}}),\\ P(Y_{p}=0)&=1-P(Y_{p}=1)-P(Y_{p}=-1),\end{split}

(13)

where $\Phi$ is the cumulative distribution function of a standard normal distribution. Moreover, $P(Y_{p}=0)=0$ increases as $\mu_{2}-\mu_{1}$ gets smaller.

Proof.

A sample $x$ will be assigned pseudo label 1 if

\frac{1}{1+\exp{(-\beta(x-\frac{\mu_{1}+\mu_{2}}{2}))}}>\tau,

which is equivalent to

x>\frac{\mu_{1}+\mu_{2}}{2}+\frac{1}{\beta}\log(\frac{\tau}{1-\tau}).

Likewise, $x$ will be assigned pseudo label -1 if

\frac{1}{1+\exp{(-\beta(x-\frac{\mu_{1}+\mu_{2}}{2}))}}<1-\tau,

which is equivalent to

x<\frac{\mu_{1}+\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau}).

If we integrate over $x$ , we arrive at the following conditional probabilities:

	$\displaystyle P(Y_{p}=1\|Y=1)=\Phi(\frac{\frac{\mu_{2}-\mu_{1}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{2}}),$
	$\displaystyle P(Y_{p}=1\|Y=-1)=\Phi(\frac{\frac{\mu_{1}-\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{1}}),$
	$\displaystyle P(Y_{p}=-1\|Y=-1)=\Phi(\frac{\frac{\mu_{2}-\mu_{1}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{1}}),$
	$\displaystyle P(Y_{p}=-1\|Y=1)=\Phi(\frac{\frac{\mu_{1}-\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{2}}).$

Recall that $P(Y=1)=P(Y=-1)=0.5$ , therefore

	$\displaystyle P(Y_{p}=1)=\frac{1}{2}\Phi(\frac{\frac{\mu_{2}-\mu_{1}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{2}})+\frac{1}{2}\Phi(\frac{\frac{\mu_{1}-\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{1}}),$
	$\displaystyle P(Y_{p}=-1)=\frac{1}{2}\Phi(\frac{\frac{\mu_{2}-\mu_{1}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{1}})+\frac{1}{2}\Phi(\frac{\frac{\mu_{1}-\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{2}}).$

Now, let’s use $z$ to denote $\mu_{2}-\mu_{1}$ , to show that $P(Y_{p}=0)$ increases as $\mu_{2}-\mu_{1}$ gets smaller, we only need to show $P(Y_{p}=-1)+P(Y_{p}=1)$ gets bigger. We write $P(Y_{p}=-1)+P(Y_{p}=1)$ as

P(Y_{p}=1)+P(Y_{p}=1)=\frac{1}{2}\Phi(a_{1}z-b_{1})+\frac{1}{2}\Phi(-a_{1}z-b_{1})+\frac{1}{2}\Phi(a_{2}z-b_{2})+\frac{1}{2}\Phi(-a_{2}z-b_{2}),

where $a_{1}=\frac{1}{2\sigma_{1}},a_{2}=\frac{1}{2\sigma_{2}},b_{1}=\frac{\log(\frac{\tau}{1-\tau})}{\beta\sigma_{1}},b_{2}=\frac{\log(\frac{\tau}{1-\tau})}{\beta\sigma_{2}}$ are positive constants. We futher only need to show that $f(z)=\frac{1}{2}\Phi(a_{1}z-b_{1})+\frac{1}{2}\Phi(-a_{1}z-b_{1})$ is monotone increasing on $(0,\infty)$ . Take the derivative of $z$ , we have

f^{\prime}(z)=\frac{1}{2}a_{1}(\phi(a_{1}z-b_{1})-\phi(-a_{1}z-b_{1})),

where $\phi$ is the probability density function of a standard normal distribution. Since $|a_{1}z-b_{1}|<|-a_{1}z-b_{1}|$ , we have $f^{\prime}(z)>0$ , and the proof is complete.

∎

Appendix C Algorithm

We present the pseudo algorithm of FreeMatch. Compared to FixMatch, each training step involves updating the global threshold and local threshold from the unlabeled data batch, and computing corresponding histograms. FreeMatchs introduce a very trivial computation budget compared to FixMatch, demonstrated also in our main paper.

Algorithm 1 FreeMatch algorithm at

t

-th iteration.

1: Input: Number of classes

C

, labeled batch

\mathcal{X}=\{(x_{b},y_{b}):b\in(1,2,\dots,B)\}

, unlabeled batch

\mathcal{U}=\{u_{b}:b\in(1,2,\dots,\mu B)\}

, unsupervised loss weight

w_{u}

, fairness loss weight

w_{f}

, and EMA decay

\lambda

2: Compute

\mathcal{L}_{s}

for labeled data

\mathcal{L}_{s}=\frac{1}{B}\sum_{b=1}^{B}\mathcal{H}(y_{b},p_{m}(y|\omega(x_{b})))

3: Update the global threshold

\tau_{t}=\lambda\tau_{t-1}+(1-\lambda)\frac{1}{\mu B}\sum_{b=1}^{\mu B}max(q_{b})

{

q_{b}

is an abbreviation of

p_{m}(y|\omega(u_{b}))

, shape of

\tau_{t}

[1]

}

4: Update the local threshold

\tilde{p}_{t}=\lambda\tilde{p}_{t-1}+(1-\lambda)\frac{1}{\mu B}\sum_{b=1}^{\mu B}q_{b}

{Shape of

\tilde{p}_{t}

[C]

}

5: Update histogram for

\tilde{p}_{t}

\tilde{h}_{t}=\lambda\tilde{h}_{t-1}+(1-\lambda)\operatorname{Hist}_{\mu B}\left(\hat{q}_{b}\right)

{Shape of

\tilde{h}_{t}

[C]

}

6: for

c=1

C

\tau_{t}(c)=\operatorname{MaxNorm}(\tilde{p}_{t}(c))\cdot\tau_{t}

{Calculate SAT}

8: end for

9: Compute

\mathcal{L}_{u}

on unlabeled data

\mathcal{L}_{u}=\frac{1}{\mu B}\sum_{b=1}^{\mu B}\mathbbm{1}\left(\max\left(q_{b}\right)\geq\tau_{t}(\arg\max\left(q_{b}\right))\right)

\cdot\mathcal{H}(\hat{q}_{b},Q_{b})

10: Compute expectation of probability on unlabeled data

\overline{p}=\frac{1}{\mu B}\sum_{b=1}^{\mu B}\mathbbm{1}\left(\max\left(q_{b}\right)\geq\tau_{t}(\arg\max\left(q_{b}\right)\right)Q_{b}

{

Q_{b}

is an abbr. of

p_{m}(y|\Omega(u_{b}))

, shape of

\overline{p}

[C]

}

11: Compute histogram for

\overline{p}

\overline{h}=\operatorname{Hist}_{\mu B}\left(\mathbbm{1}\left(\max\left(q_{b}\right)\geq\tau_{t}(\arg\max\left(q_{b}\right)\right)\hat{Q}_{b}\right)

{Shape of

\overline{h}

[C]

}

12: Compute

\mathcal{L}_{f}

on unlabeled data

\mathcal{L}_{f}=-\mathcal{H}\left(\operatorname{SumNorm}(\frac{\tilde{p}_{t}}{\tilde{h}_{t}}),\operatorname{SumNorm}(\frac{\bar{p}}{\bar{h}})\right)

13: Return:

\mathcal{L}_{s}+w_{u}\cdot\mathcal{L}_{u}+w_{f}\cdot\mathcal{L}_{f}

Appendix D Hyperparameter setting

For reproduction, we show the detailed hyperparameter setting for FreeMatch in Table 5 and 6, for algorithm-dependent and algorithm-independent hyperparameters, respectively.

Table 5: Algorithm dependent hyperparameters.

Algorithm FreeMatch Unlabeled Data to Labeled Data Ratio (CIFAR-10/100, STL-10, SVHN) 7 Unlabeled Data to Labeled Data Ratio (ImageNet) 1 Loss weight $w_{u}$ for all experiments 1 Loss weight $w_{f}$ for CIFAR-10 (10), CIFAR-100 (400), STL-10 (40), ImageNet (100k), SVHN 0.01 Loss weight $w_{f}$ for others 0.05 Thresholding EMA decay for all experiments 0.999

Table 6: Algorithm independent hyperparameters.

Dataset CIFAR-10 CIFAR-100 STL-10 SVHN ImageNet Model WRN-28-2 WRN-28-8 WRN-37-2 WRN-28-2 ResNet-50 Weight decay 5e-4 1e-3 5e-4 5e-4 3e-4 Batch size 64 128 Learning rate 0.03 SGD momentum 0.9 EMA decay 0.999

Note that for ImageNet experiments, we used the same learning rate, optimizer scheme, and training iterations as other experiments, and a batch size of 128 is adopted, whereas, in FixMatch, a large batch size of 1024 and a different optimizer is used. From our experiments, we found that training ImageNet with only $2^{20}$ is not enough, and the model starts converging at the end of training. Longer training iterations on ImageNet will be explored in the future. Single NVIDIA V100 is used for training on CIFAR-10, CIFAR-100, SVHN and STL-10. It costs about 2 days to train on CIFAR-10 and SVHN. 10 days are needed for the training on CIFAR-100 and STL-10.

Appendix E Extensive Experiment Details and Results

We present extensive experiment details and results as complementary to the experiments in the main paper.

E.1 Significant Tests

We did significance test using the Friedman test. We choose the top 7 algorithms on 4 datasets (i.e., $N=4,k=7$ ). Then, we compute the F value as $\tau_{F}=3.56$ , which is clearly larger than the thresholds $2.661(\alpha=0.05)$ and $2.130(\alpha=0.1)$ . This test indicates that there are significant differences between all algorithms.

To further show our significance, we report the average error rates for each dataset in Table 7. We can see FreeMatch outperforms most SSL algorithms significantly.

Table 7: The average error rates for each dataset.

	CIFAR-10	CIFAR-100	SVHN	STL-10	Total Average
$\Pi$ Model	53.22	60.80	29.31	53.55	49.19
Pseudo Label	54.10	60.58	29.87	53.66	49.59
VAT	51.50	54.73	27.73	56.35	47.17
MeanTeacher	48.01	52.68	14.27	52.81	41.54
MixMatch	30.56	45.04	12.95	38.32	31.07
ReMixMatch	10.45	29.60	11.85	19.43	17.08
UDA	13.65	32.20	2.98	22.03	17.02
FixMatch	10.33	32.22	2.60	21.11	15.67
Dash	11.43	31.28	2.07	20.46	15.56
MPL	10.12	31.90	4.63	21.21	16.04
FlexMatch	7.00	29.44	7.17	17.46	14.40
FreeMatch	5.49	28.71	1.97	10.60	11.26

E.2 CIFAR-10 (10) Labeled Data

Following (Sohn et al., 2020), we investigate the limitations of SSL algorithms by giving only one labeled training sample per class. The selected 3 labeled training sets are visualized in Figure 4, which are obtained by (Sohn et al., 2020) using ordering mechanism (Carlini et al., 2019).

E.3 Detailed Results

To comprehensively evaluate the performance of all methods in a classification setting, we further report the precision, recall, f1 score, and AUC (area under curve) results of CIFAR-10 with the same 10 labels, CIFAR-100 with 400 labels, SVHN with 40 labels, and STL-10 with 40 labels. As shown in Table 8 and 9, FreeMatch also has the best performance on precision, recall, F1 score, and AUC in addition to the top1 error rates reported in the main paper.

Table 8: Precision, recall, f1 score and AUC results on CIFAR-10/100.

Datasets CIFAR-10 (10) CIFAR-100 (400) Criteria Precision Recall F1 Score AUC Precision Recall F1 Score AUC UDA 0.5304 0.5121 0.4754 0.8258 0.5813 0.5484 0.5087 0.9475 FixMatch 0.6436 0.6622 0.6110 0.8934 0.5574 0.5430 0.4946 0.9363 Dash 0.6409 0.5410 0.4955 0.8458 0.5833 0.5649 0.5215 0.9456 MPL 0.6286 0.6857 0.6178 0.7993 0.5799 0.5606 0.5193 0.9316 FlexMatch 0.6769 0.6861 0.6780 0.9126 0.6135 0.6193 0.6107 0.9675 FreeMatch 0.8619 0.8593 0.8523 0.9843 0.6243 0.6261 0.6137 0.9692

Table 9: Precision, recall, f1 score and AUC results on SVHN and STL-10.

Datasets SVHN (40) STL-10 (40) Criteria Precision Recall F1 Score AUC Precision Recall F1 Score AUC UDA 0.9783 0.9777 0.9780 0.9977 0.6385 0.5319 0.4765 0.8581 FixMatch 0.9731 0.9706 0.9716 0.9962 0.6590 0.5830 0.5405 0.8862 Dash 0.9782 0.9778 0.9780 0.9978 0.8117 0.6020 0.5448 0.8827 MPL 0.9564 0.9513 0.9512 0.9844 0.6191 0.5740 0.4999 0.8529 FlexMatch 0.9566 0.9691 0.9625 0.9975 0.6403 0.6755 0.6518 0.9249 FreeMatch 0.9783 0.9800 0.9791 0.9979 0.8489 0.8439 0.8354 0.9792

E.4 Ablation of pre-defined thresholds on FixMatch and FlexMatch

As shown in Table 12, the performance of FixMatch and FlexMatch is quite sensitive to the changes of the pre-defined threshold $\tau$ .

Table 10: FixMatch and FlexMatch with different thresholds on CIFAR-10 (40).

$\tau$	FixMatch	FlexMatch
0.25	11.76±0.60	18.84±0.36
0.5	16.29±0.31	14.16±0.21
0.75	15.61±0.23	6.08±0.17
0.95	7.47±0.28	4.97±0.06
0.98	8.01±0.91	5.40±0.11

E.5 Ablation on EMA decay on CIFAR-10 (40)

We provide the ablation study on EMA decay parameter $\lambda$ in Equation 5 and Equation 6. One can observe that different decay $\lambda$ produces the close results on CIFAR-10 with 40 labels, indicating that FreeMatch is not sensitive to this hyper-parameter. A large $\lambda$ is not encouraged since it could impede the update of global / local thresholds.

Table 11: Error rates of different thresholding EMA decay.

Thresholding EMA decay	CIFAR-10 (40)
0.9	4.94 $\pm$ 0.06
0.99	4.92 $\pm$ 0.08
0.999	4.90 $\pm$ 0.04
0.9999	5.03 $\pm$ 0.07

E.6 Ablation of SAF on FlexMatch and FreeMatch

In Table 13, we present the comparison of different class fairness objectives on CIFAR-10 with 10 labels. FreeMatch is better than FlexMatch in both settings. In addition, SAF is also proved effective when combined with FlexMatch.

Table 12: FixMatch and FlexMatch with different thresholds on CIFAR-10 (40).

$\tau$	FixMatch	FlexMatch
0.25	11.76±0.60	18.84±0.36
0.5	16.29±0.31	14.16±0.21
0.75	15.61±0.23	6.08±0.17
0.95	7.47±0.28	4.97±0.06
0.98	8.01±0.91	5.40±0.11

Table 13: Ablation of SAF on FlexMatch and FreeMatch on CIFAR-10 (10)

Fairness Objective	FlexMatch	FreeMatch
w/o SAF	13.85±12.04	10.37±7.70
w/ SAF	12.60±8.16	8.07±4.24

E.7 Ablation of Imbalanced SSL

Table 14: Error rates (%) of imbalanced SSL using 3 different random seeds.

Dataset	CIFAR-10-LT		CIFAR-100-LT
Imbalance $\gamma$	50	150	20	100
FixMatch	18.5 $\pm$ 0.48	31.2 $\pm$ 1.08	49.1 $\pm$ 0.62	62.5 $\pm$ 0.36
FlexMatch	17.8 $\pm$ 0.24	29.5 $\pm$ 0.47	48.9 $\pm$ 0.71	62.7 $\pm$ 0.08
FreeMatch	17.7 $\pm$ 0.33	28.8 $\pm$ 0.64	48.4 $\pm$ 0.91	62.5 $\pm$ 0.23
FixMatch w/ ABC	14.0 $\pm$ 0.22	22.3 $\pm$ 1.08	46.6 $\pm$ 0.69	58.3 $\pm$ 0.41
FlexMatch w/ ABC	14.2 $\pm$ 0.34	23.1 $\pm$ 0.70	46.2 $\pm$ 0.47	58.9 $\pm$ 0.51
FreeMatch w/ ABC	13.9 $\pm$ 0.03	22.3 $\pm$ 0.26	45.6 $\pm$ 0.76	58.9 $\pm$ 0.55

To further prove the effectiveness of FreeMatch, We evaluate FreeMatch on the imbalanced SSL setting Kim et al. (2020); Wei et al. (2021); Lee et al. (2021); Fan et al. (2021), where the labeled and the unlabeled data are both imbalanced. We conduct experiments on CIFAR-10-LT and CIFAR-100-LT with different imbalance ratios. The imbalance ratio used on CIFAR datasets is defined as $\gamma=N_{max}/N_{min}$ where $N_{max}$ is the number of samples on the head (frequent) class and $N_{min}$ the tail (rare). Note that the number of samples for class $k$ is computed as $N_{k}=N_{max}\gamma^{-\frac{k-1}{C-1}}$ , where $C$ is the number of classes. Following (Lee et al., 2021; Fan et al., 2021), we set $N_{max}=1500$ for CIFAR-10 and $N_{max}=150$ for CIFAR-100, and the number of unlabeled data is twice as many for each class. We use a WRN-28-2 (Zagoruyko & Komodakis, 2016) as the backbone. We use Adam (Kingma & Ba, 2014) as the optimizer. The initial learning rate is $0.002$ with a cosine learning rate decay schedule as $\eta=\eta_{0}\cos(\frac{7\pi k}{16K})$ , where $\eta_{0}$ is the initial learning rate, $k(K)$ is the current (total) training step and we set $K=2.5\times 10^{5}$ for all datasets. The batch size of labeled and unlabeled data is 64 and 128, respectively. Weight decay is set as 4e-5. Each experiment is run on three different data splits, and we report the average of the best error rates.

The results are summarized in Table 14. Compared with other standard SSL methods, FreeMach achieves the best performance across all settings. Especially on CIFAR-10 at imbalance ratio 150, FreeMatch outperforms the second best by $2.4\%$ . Moreover, when plugged in the other imbalanced SSL method (Lee et al., 2021), FreeMatch still attains the best performance in most of the settings.

E.8 T–SNE Visualization on STL-10 (40)

We plot the T–SNE visualization of the features on STL-10 with 40 labels from FlexMatch (Zhang et al., 2021) and FreeMatch. FreeMatch shows better feature space than FlexMatch with less confusing clusters.

E.9 Pseudo Label accuracy on CIFAR-10 (10)

We average the pseudo label accuracy with three random seeds and report them in Figure 6. This indicates that mapping thresholds from a high fixed threshold like FlexMatch did can prevent unlabeled samples from being involved in training. In this case, the model can overfit on labeled data and a small amount of unlabeled data. Thus the predictions on unlabeled data will incorporate more noise. Introducing appropriate unlabeled data at training time can avoid overfitting on labeled datasets and a small amount of unlabeled data and bring more accurate pseudo labels.

E.10 CIFAR-10 (10) Confusion Matrix

We plot the confusion matrix of FreeMatch and other SSL methods on CIFAR-10 (10) in Figure 7. It is worth noting that even with the least prototypical labeled data in our setting, FreeMatch still gets good results while other SSL methods fail to separate the unlabeled data into different clusters, showing inconsistency with the low-density assumption in SSL.

	$\displaystyle P(Y_{p}=1\|Y=1)=\Phi(\frac{\frac{\mu_{2}-\mu_{1}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{2}}),$
	$\displaystyle P(Y_{p}=1\|Y=-1)=\Phi(\frac{\frac{\mu_{1}-\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{1}}),$
	$\displaystyle P(Y_{p}=-1\|Y=-1)=\Phi(\frac{\frac{\mu_{2}-\mu_{1}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{1}}),$
	$\displaystyle P(Y_{p}=-1\|Y=1)=\Phi(\frac{\frac{\mu_{1}-\mu_{2}}{2}-\frac{1}{\beta}\log(\frac{\tau}{1-\tau})}{\sigma_{2}}).$