Semi-Supervised Learning with Multi-Head Co-Training
Abstract
Co-training, extended from self-training, is one of the frameworks for semi-supervised learning. Without natural split of features, single-view co-training works at the cost of training extra classifiers, where the algorithm should be delicately designed to prevent individual classifiers from collapsing into each other. To remove these obstacles which deter the adoption of single-view co-training, we present a simple and efficient algorithm Multi-Head Co-Training. By integrating base learners into a multi-head structure, the model is in a minimal amount of extra parameters. Every classification head in the unified model interacts with its peers through a “Weak and Strong Augmentation” strategy, in which the diversity is naturally brought by the strong data augmentation. Therefore, the proposed method facilitates single-view co-training by 1). promoting diversity implicitly and 2). only requiring a small extra computational overhead. The effectiveness of Multi-Head Co-Training is demonstrated in an empirical study on standard semi-supervised learning benchmarks.
Introduction
Benefiting from the rich data sources and growing computing power in the last decade, the field of machine learning has been thriving drastically. The advent of public datasets with a large amount of high-quality labels has further spawned many successful deep learning methods (Deng et al. 2009; He et al. 2016; Zagoruyko and Komodakis 2016; Krizhevsky, Sutskever, and Hinton 2012). However, there could be various difficulties for obtaining label information, such as privacy, labor costs, safety or ethic issues, and requirement of domain experts (Zhou 2018; Chapelle, Schlkopf, and Zien 2010; Mahajan et al. 2018). All of these impel us to find a way of bringing unlabeled data into full play. Semi-Supervised Learning (SSL) is a branch of machine learning which seeks to address the problem (Chapelle, Schlkopf, and Zien 2010; Chapelle, Chi, and Zien 2006; Prakash and Nithya 2014; Van Engelen and Hoos 2020). It utilizes both labeled and unlabeled data to improve performance.
As one of the earliest and most popular SSL frameworks, self-training works by iteratively retraining the model using pseudo-labels obtained from itself (Lee 2013; Berthelot et al. 2019b, a; McLachlan 1975). Despite its simplicity and alignment to the task of interest (Zoph et al. 2020), self-training underperforms due to the “confirmation bias” or “error accumulation”. It means that some incorrect predictions could be selected as pseudo-labels to guide subsequent training, resulting in a loop of self-reinforcing errors (Zhang et al. 2016).
As an extension of self-training, co-training lets multiple individual models iteratively learn from each other (Zhou and Li 2010; Wang and Zhou 2017). In the early multi-view co-training setting (Blum and Mitchell 1998), there should be a natural split of features where the “sufficiency and independence” assumptions should hold, i.e., it’s sufficient to make predictions based on each view, and views are conditionally independent. Later studies gradually reveal that co-training can also be successful in the single-view setting (Wang and Zhou 2017; Dasgupta, Littman, and McAllester 2002; Abney 2002; Balcan, Blum, and Yang 2005; Wang and Zhou 2007). Despite being feasible, the single-view co-training framework has received little attention recently. We attribute it to (a) the extra computational cost, which means at least twice model parameters of its self-training counterpart, and (b) the loss in simplicity, i.e., more design choices and hype-parameters are introduced for keeping individual classifiers uncorrelated.
In this paper, we aim to facilitate the adoption of single-view co-training. Inspired by recent developments of data augmentation and its applications in SSL (Berthelot et al. 2019b; Sohn et al. 2020; Cubuk et al. 2020; DeVries and Taylor 2017), we find that the enormous size of the augmentation search space naturally prevents base learners from converging to a consensus. Employing the stochastic image augmentation frees us from delicately design different network structures or training algorithms. Moreover, by replacing multiple individual models with a shared module followed by multiple classification heads, the model can achieve co-training in a minimal amount of extra parameters. Combining these, we propose Multi-Head Co-Training, a new algorithm that facilitates the usage of single-view co-training. The main contributions are as follows:
-
•
Multi-Head Co-Training addresses two obstacle factors of standard single-view co-training, i.e., extra design and computational cost.
-
•
Experimentally, we show that our method obtains state-of-the-art results on CIFAR, SVHN, and Mini-ImageNet. Besides, We systematically study the components of Multi-Head Co-Training.
-
•
We further analyze the calibration of SSL methods and provide insights regarding the link between confirmation bias and model calibration.
Related work
In this section, we concentrate on relevant studies to set the stage for Multi-Head Co-Training. More extensive surveys on SSL can be found in (Prakash and Nithya 2014; Van Engelen and Hoos 2020; Zhu 2005; Zhou and Li 2010; Zhou 2018; Subramanya and Talukdar 2014).
The basic assumptions in SSL are the smoothness assumption and low-density assumption. The smoothness assumption states that if two or more data points are close in the sample space, they should belong to the same class. Similarly, the low-density assumption states that the decision boundary for a classification model shouldn’t pass the high-density region of sample space. These assumptions are intuitive in vision tasks because an image with small noise is still semantically identical to the original one. A dominant paradigm in SSL is grounded on these assumptions. From this point of view, various ways of making use of unlabeled data, including consistency regularization, entropy minimization, perturbation-based methods, self-training, and co-training, are essentially similar.
Consistency regularization constrains the model to make consistent predictions across the same example under variants of noises. In a general form,
(1) |
where and are the modeled distributions. Different notations are used here, indicating that they could come from different models. The target example is denoted as and its noisy counterpart is denoted as . can be any distance measurement, such as KL divergence or mean square error. SSL methods falling into this category differ in the source of noise, models for two distributions, and the distance measurement. For example, VAT (Miyato et al. 2018) generates noise in an adversarial direction. Laine & Aila (Laine and Aila 2016) propose -Model and Temporal Ensembling. -Model performs Gaussian noise, dropout, etc., to augment images. Temporal Ensembling further ensembles prior network evaluations to encourage consistent predictions. Mean Teacher (Tarvainen and Valpola 2017) instead maintains an Exponential Moving Average (EMA) of model’s parameters. ICT (Verma et al. 2019) applies consistency regularization between the prediction of interpolation of unlabeled points and the interpolation of the predictions at those points. UDA (Xie et al. 2019) replaces the traditional data augmentation with unsupervised data augmentation.
Self-training111In the field of SSL, the terminology “self-training” overlaps with “pseudo-labeling”, which refers to training the model incrementally instead of retraining in every iteration. For illustration, we use “self-training” throughout this paper, referring to a broad category of methods. favors low-density separation by using model’s own predictions as pseudo-labels. Pseudo-Labeling (Lee 2013) picks the most confident predictions as hard (one-hot) pseudo-labels. Apart from that, the method also uses auto-encoder and dropout as regularization. MixMatch (Berthelot et al. 2019b) uses the average of predictions on the image under multiple augmentations as the soft pseudo-label. Furthermore, Mixup (Zhang et al. 2017), as a regularizer, is used to mix labeled and unlabeled data. ReMixMatch (Berthelot et al. 2019a) improves MixMatch by introducing other regularization techniques and a modified version of AutoAugment (Cubuk et al. 2019). FixMatch (Sohn et al. 2020) finds an effective combination of image augmentation techniques and pseudo-labeling. One of the reasons for the popularity of self-training is its simplicity. It can be used with almost all supervised classifier (Van Engelen and Hoos 2020). Another important but rarely mentioned factor is its awareness of the task of interest. Although some other unsupervised constraints may help build general representations, it has been shown that self-training can align to the task of interest well and benefit model in a more secure way (Zoph et al. 2020). However, confirmation bias, where the prediction mistakes would accumulate during the training process, damages the performance of self-training.
As an extension of self-training, co-training alleviates the problem of confirmation bias. Two or more models are trained by each other’s predictions. In the original form (Blum and Mitchell 1998), two individual classifiers are trained on two views. What’s more, the proposed co-training algorithm requires the “sufficiency and independence” assumptions hold, i.e., two views should be sufficient to perform accurate prediction and be conditionally independent given the class. Nevertheless, later studies show that weak independence (Abney 2002; Balcan, Blum, and Yang 2005) or single-view data (Wang and Zhou 2017, 2007; Du, Ling, and Zhou 2010) is enough for a successful co-training algorithm.
Without the distinct views of the data containing complementary information, single-view co-training has to promote diversity in some other ways. It also has been shown that the more uncorrelated the individual classifiers are, the more effective the algorithm is. (Wang and Zhou 2017, 2007) Several early studies attempt to split the single-view data (Balcan, Blum, and Yang 2005; Chen, Weinberger, and Chen 2011), and some further approach a pure single-view setting by introducing diversity among the classifiers otherwise (Zhou and Goldman 2004; Goldman and Zhou 2000; Xu, He, and Man 2012). Recently, Deep Co-training (Qiao et al. 2018) maintains disagreement through a view difference constraint. Tri-Net (Dong-DongChen et al. 2018) adopts a multi-head structure. To prevent consensus, it designs different head structures and samples different sub-datasets for learners. CoMatch (Li, Xiong, and Hoi 2020) applies a graph-based smoothness regularization and also integrates supervised contrastive learning. Although have achieved a number of successes, these methods complicate the practical adoption of co-training. In Multi-Head Co-Training, employing the stochastic image augmentation frees us from delicately designing different network structures or training algorithms. Unique to other methods, no preventive measures, e.g., extra loss term or different base learners, are needed to avoid collapse. Along with the reduction of computational cost brought by the multi-head structure, the proposed method facilitates the adoption of single-view co-training.
Multi-Head Co-Training

In this section, we introduce the details of Multi-Head Co-Training. Formally, for a -class classification problem, SSL aims to model a class distribution for input utilizing both labeled and unlabeled data. In Multi-Head Co-Training, the parametric model consists of a shared module and multiple classification heads () with the same structure. Let represent the predicted class distribution produced by and . All classification heads are updated using the using a consensus of predictions from other heads. Apart from that, following the principle of recent successful SSL algorithms, this method utilizes image augmentation techniques to employ a “Weak and Strong Augmentation” strategy. It uses predictions on weakly augmented images, which are relatively more accurate, to correct the predictions on strongly augmented images. The weak and strong augmentation function are denoted as and respectively and further introduced in The “Weak and Strong Augmentation” strategy. The diagram of Multi-Head Co-Training with three heads is shown in Figure 1. The overall algorithm is shown in Algorithm 1.
Input: Labeled batch , unlabeled batch , unsupervised loss weight .
Output: Updated model.
The multi-head structure
In every training iteration of Multi-Head Co-Training, a batch of labeled examples and a batch of unlabeled examples are randomly sampled from the labeled and unlabeled dataset respectively. For supervised training, the parameters in the shared module and all heads () are updated to minimize the cross-entropy loss between predictions and true labels, i.e.,
(2) |
where represents the cross-entropy loss function and is the weakly augmented labeled example.
For co-training, every head interacts with its peers through pseudo-labels on unlabeled data. To obtain reliable predictions for pseudo-labeling, weakly augmented unlabeled examples first pass through the shared module and all heads simultaneously,
(3) |
where picks the class with the maximal probability, i.e., the most confident predicted class . To avoid confirmation bias, pseudo-labels for each head depend on the predicted classes from other heads,
(4) |
where is the set operation of removing element. The most frequently predicted class in multiset is the pseudo-label for -th head, and it is selected only when more than half heads agree on the prediction,
(5) | ||||
where refers to the Iverson brackets, defined to be 1 if the statement in it is true, otherwise 0. indicates whether -th pseudo-label is selected for -th head. After the selection process, uncertain (less agreed) examples are filtered. In the meantime, strongly augmented unlabeled examples go through the shared module and corresponding head . The average cross-entropy is then calculated:
(6) | ||||
where comes from times of strong augmentation . Note that the transformation function generates a differently augmented image every time. The supervised loss and unsupervised loss are added together as the total loss (weighted by coefficient ), i.e.,
(7) |
The algorithm proceeds until reaching fixed iterations (training details are illustrated in the supplementary material).
As discussed in Related work, the quality of pseudo-labels and the diversity between individual classifiers are the two important things for a co-training algorithm to succeed. In our method, the diversity between heads inherently comes from the randomness in the strong augmentation function (consequently, the unlabeled examples for each head are differently augmented, selected, and pseudo-labeled). Unique to co-training algorithms, diversity is promoted implicitly in Multi-Head Co-Training. In terms of the quality of pseudo-labels, ensemble predictions of other heads on weakly augmented examples are used for accurately pseudo-labeling and selecting.
The “Weak and Strong Augmentation” strategy
Due to the scarcity of labels, preventing overfitting, or in other words, improving the generalization ability is the core task of SSL. Data augmentation approaches such a problem via expanding the size of training set, and thus plays a vital role in SSL. In Multi-Head Co-Training, the weak and strong augmentation functions differ in the degree of image augmentation. Specifically, the weak image transformation function applies random horizontal flip and random crop. Two augmentation techniques, namely RandAugment (Cubuk et al. 2020) and Cutout (DeVries and Taylor 2017), constitute the strong transformation function . In RandAugment, a given number of operations are randomly selected from a fixed set of geometric and photometric transformations, such as affine transformation, color adjustment. Then they are applied to images with a given magnitude. Cutout randomly masks out square regions of images. Both of them are applied sequentially in the strong augmentation. It has been shown that unsupervised learning benefits from stronger data augmentation (Chen et al. 2020). The same preference can also be extended to SSL. Thus, the setting of RandAugment follows the modified stronger version in FixMatch (Sohn et al. 2020) and details are reported in the supplementary material.
Exponential moving average
To enforce smoothness, Exponential Moving Average (EMA) is a widely used technique in SSL. In this paper, we maintain an EMA model for evaluation. Its parameters are updated every iteration using the training-time model’s parameters:
(8) |
where is the parameters of the EMA model. is the parameters of training-time model. is the decay rate which controls how much the average model moves every time. During test-time, we simply ensemble all heads’ predictions of the EMA model by adding them together.
Experiments
The number and structure of heads in our framework can be arbitrary, but we set the head number as three and the head structure as the last residual block (Zagoruyko and Komodakis 2016) in most of the experiments. The choice is the result of a trade-off between accuracy and efficiency (illustrated in Ablation study).
Results
SVHN | CIFAR-10 | CIFAR-100 | |||||
Method | 250 labels | 500 labels | 1000 labels | 250 labels | 1000 labels | 4000 labels | 10000 labels |
Tri-net | - | - | 3.710.14 | - | - | 8.450.22 | - |
-Model | 17.650.27 | 11.440.39 | 8.600.18 | 53.022.05 | 31.530.98 | 17.410.37 | 37.880.11 |
Pseudo-Label | 21.160.88 | 14.350.37 | 10.190.41 | 49.981.17 | 30.911.73 | 16.210.11 | 36.210.19 |
VAT | 8.411.01 | 7.440.79 | 5.980.21 | 36.032.82 | 18.680.40 | 11.050.31 | - |
Mean Teacher | 6.452.43 | 3.820.17 | 3.750.10 | 47.324.71 | 17.324.00 | 10.360.25 | 35.830.24 |
MixMatch | 3.780.26 | 3.270.31 | 3.270.31 | 11.080.87 | 7.750.32 | 6.240.06 | 28.310.33 |
ReMixMatch | 3.100.50 | - | 2.830.30 | 6.270.34 | 5.730.16 | 5.140.04 | 23.030.56 |
FixMatch (RA) | 2.480.38 | - | 2.280.11 | 5.070.65 | - | 4.260.05 | 22.600.12 |
FixMatch (CTA) | 2.640.64 | - | 2.360.19 | 5.070.33 | - | 4.310.15 | 23.180.11 |
Ours | 2.210.18 | 2.180.08 | 2.160.05 | 4.980.30 | 4.740.16 | 3.840.09 | 21.680.16 |
We benchmark the proposed method on experimental settings using CIFAR-10 (Krizhevsky, Hinton et al. 2009), CIFAR-100 (Krizhevsky, Hinton et al. 2009), and SVHN (Netzer et al. 2011) as the standard practice. Different portions of labeled data ranging from 0.5% to 20% are experimented. For comparison, we consider Tri-net (Dong-DongChen et al. 2018), -Model (Laine and Aila 2016), Pseudo-Label (Lee 2013), Mean Teacher (Tarvainen and Valpola 2017), VAT (Miyato et al. 2018), MixMatch (Berthelot et al. 2019b), ReMixMatch (Berthelot et al. 2019a), FixMatch (Sohn et al. 2020). The results of these methods reported in this section are reproduced by (Berthelot et al. 2019b, a) using the same backbone and training protocol (except for the results of Tri-net and FixMatch are from their papers). The main criterion is the error rate. Variance is also reported to ensure the results are statistically significant (5 runs with different seeds). We report the final performance of the EMA model. SGD with Nesterov momentum (Sutskever et al. 2013) is used, along with weight decay and cosine learning rate decay (Loshchilov and Hutter 2016). The details are in the supplementary material.
CIFAR-10, CIFAR-100
We first compare Multi-Head Co-Training to the state-of-the-art methods on CIFAR (Krizhevsky, Hinton et al. 2009), which is one of the most commonly used image recognition datasets. We randomly choose 250-4000 from 50000 training images of CIFAR-10 as labeled examples. Other images’ labels are thrown away. The backbone model is WRN 28-2 (extra heads are added). As shown in Table 1, Multi-Head Co-Training performs consistently better against the state-of-the-art methods. For example, it achieves an average error rate of 3.84% on CIFAR-10 with 4000 labeled images, compared favorably to the state-of-the-art results.
We randomly choose 10000 from 50000 training images of CIFAR-100 as labeled examples and throw other’s label information. In Table 1, we present the results of CIFAR-100 with 10000 labels. As the common practice in recent methods, WRN 28-8 is used to accommodate the more challenging task (more classes and each with fewer examples). We reduce the number of channels of the final block in WRN 28-8 from 512 to 256. By doing so, the model has a much smaller size. Combining the results in Table 1, Multi-Head Co-Training achieves 21.68% error rate, having an improvement of 1.5% compared to the best results of previous methods with even a smaller model.
4000 labels | 10000 labels | |
Mean Teacher | 72.510.22 | 57.551.11 |
Pseudo-Label | 56.490.51 | 46.080.11 |
LaplaceNet | 46.320.27 | 39.430.09 |
Ours | 46.530.15 | 39.740.12 |
SVHN
Similarly, we evaluate the accuracy of our method with a varying number of labels from 250 to 1000 on the SVHN dataset (Netzer et al. 2011) (the extra training set is not used). The image augmentation for SVHN is different because some operations are not suitable for digit images (e.g., horizontal flip for asymmetrical digit images). Its details are in the supplementary material. The results of Multi-Head Co-Training and other methods are shown in Table 1. Multi-Head Co-Training outperforms other methods by a small margin.
Mini-ImageNet
We further evaluate our model on the more complex dataset Mini-ImageNet (Vinyals et al. 2016), which is a subset ImageNet (Deng et al. 2009). The training set of Mini-ImageNet consists of 50000 images with a size of 84 × 84 in 10 object classes. We randomly choose 4000 and 10000 images as labeled examples and throw other’s label information. The backbone model is ResNet-18 (Wang et al. 2017) and early stopped using the ImageNet validation set. Other methods’ results are from (Sellars, Aviles-Rivero, and Schönlieb 2021). Our method achieves an error rate of 47.88% and 39.74% for 4k and 10k labeled images, respectively. The results are competitive to a recent method LaplaceNet (Sellars, Aviles-Rivero, and Schönlieb 2021), which uses graph-based constrain and multiple strong augmentation. Besides, our co-training method, which is simple and efficient, is orthogonal to other SSL constraints.
WRN 28-2 | WRN 28-8 | |
Original | 1.4 M (30 min) | 23.4 M (136 min) |
Three model | 4.2 M (66 min) | 70.2 M (344 min) |
Ours | 3.7 M (39 min) | 19.9 M (168 min) |
Computational cost analysis
We report the training cost of the original backbone, standard co-training with three models, and our method in Table 3. The reduction of the number of parameters and training time is significant. For example, the number of parameters in the WRN 28-8 backbone is instead fewer than the original one. Only 23.5% extra time, compared to self-training, is needed to train our method, while the standard co-training needs 152.9% extra time.
Ablation study
This section presents an ablation study to measure the contribution of different components of Multi-Head Co-Training. Specifically, we measure the effect of
-
1).
Multi-Head Co-Training with only one head. Pseudo-labels are generated from their own predictions and selected by a confidence threshold of 0.95.
-
2).
Multi-Head Co-Training with one strong augmentation. Strong augmentation is only performed once and forwarded to all three heads.
-
3).
Multi-Head Co-Training without weak augmentation. The original images are for pseudo-labeling.
-
4).
Multi-Head Co-Training with three heads with the same initialization.
-
5).
Multi-Head Co-Training without EMA.
Ablation | One head | Ensemble |
Multi-Head Co-Training | 4.22 | 3.84 |
1). One head | 4.43 | 4.23 |
2). One strong augmentation | 4.45 | 4.03 |
3). W/o weak augmentation | 4.86 | 4.55 |
4). Same heads’ initialization | 4.28 | 3.86 |
5). W/o EMA | 6.23 | 5.30 |
We first set Multi-Head Co-Training’s self-training counterpart as described at 1) as a baseline. It has the same backbone and hype-parameters but with only one head. The self-training baseline obtains sub-optimal performance. To further verify the main improvement of our method is not coming from ensembling, the self-training model is retrained three times with different initialization to produce an ensemble result (“Ensemble” in 1) row). It can be observed from Table 4 that Multi-Head Co-Training, as a co-training algorithm, first shows its effectiveness by outperforming it.
Promoting diversity between individual models is critical to the success of the co-training framework. Otherwise, they would produce too similar predictions, and thus co-training degenerates into self-training. Other single-view co-training methods create diversity mainly in several ways, including automatic view splitting (Balcan, Blum, and Yang 2005; Chen, Weinberger, and Chen 2011), using different individual classifiers or individual classifiers with different structures (Zhou and Goldman 2004; Goldman and Zhou 2000; Xu, He, and Man 2012; Dong-DongChen et al. 2018). Unique to a co-training algorithm, Multi-Head Co-Training doesn’t promote diversity explicitly. The diversity between heads inherently comes from the randomness in parameter initialization and augmentation (consequently, examples selected for each head are different). To study the impact of them, we remove each source of diversity at 2) and 4) respectively. As shown in Table 4, the accuracy drops more when differently augmented images are missing. Moreover, the individual heads’ error rate is almost the same as the self-training baseline 1). This shows the important role strong augmentation plays in Multi-Head Co-Training. As a regularizer, data augmentation is considered to confine learning to a subset of the hypothesis space (Zhang et al. 2016). We believe multiple strong augmentations confine classification heads of Multi-Head Co-Training in different subsets of the hypothesis space and thus, keeping them uncorrelated.


According to the observation in 3), replacing the weakly augmented images with the original ones leads to a worse final performance. Note that pseudo-labels obtained from original images are more accurate. It means that current pseudo-labels from weakly augmented images, even with lower accuracy, would lead to a better model in later training. An interesting fact implied by the phenomenon is that accuracy isn’t the only important factor of pseudo-labels.
The number and structure of heads
One main difference between Multi-Head Co-Training and other SSL methods is the multi-head structure. It brings many benefits. Firstly, it naturally produces multiple uncorrelated predictions for each example that regularizes the feature representation the shared module learns. Secondly, pseudo-labels coming from the ensemble predictions on weakly augmented examples are more reliable. Thirdly, the number of parameters is much smaller because base learners share a module. Based on WRN (Zagoruyko and Komodakis 2016), we empirically find a both accurate and efficient structure. Specifically, we experiment Multi-Head Co-Training with different number of heads and different head structures in this section. Considering that it’s impractical to search all combinations, WRN 28-2 backbone on CIFAR-10 with 4000 labels is studied. The head structure is fixed when we attempt to find the optimal number of heads. Similarly, the number of heads is fixed when we attempt to find the optimal head structure.
We first test Multi-Head Co-Training with 1-7 heads while fixing the structure of head as the last block in WRN 28-2. For consistency, when the number of heads is 1 or 2, i.e., the pseudo-labels come from only one head’s predictions, a threshold of 0.95 is used for selecting. As shown in Figure 2a, with more heads, better performance can be obtained, but the accuracy growth slows down. The structures with more heads are not considered because the gain becomes insignificant with the increasing of the number of heads.




In terms of the head structure, the most important thing is finding the balance point between the size of the shared module and the size of the head. As shown in Figure 2b, the best performance is observed when the heads have the structure of one block and one fully connected layer and the shared module has the structure of one convolutional layer and two blocks. Whether increase or decrease the size of the heads would damage performance. Our explanation is that if there are too many shared parameters, there could be little room for the heads to make diverse predictions. If there are too many independent parameters, the heads would easily fit the pseudo-labels, and again, fail to make diverse predictions.
Considering our main purpose of developing an effective co-training algorithm, we set the number of heads as three and the head as one block in our experiments.
Calibration of SSL
Confirmation bias comes up frequently in SSL literature. However, it is hard to formulate or observe the problem. We notice that most self-training or co-training methods select pseudo-labels by some criteria, such as confidence threshold. In these cases, confirmation bias is closely related to the over-confidence of the network: Wrong predictions with high confidence are likely to be selected and then used as pseudo-labels. Thus, we link confirmation bias to model calibration, i.e., the problem of predicted probability represents the true correctness likelihood. We envision the calibration measurements be used to evaluate confirmation bias and help the design of self-training and co-training algorithms. Apart from this, the challenges of SSL and calibration could appear simultaneously in real-world. For example, in one of the applications of SSL, medical diagnosis, control should be passed on to human experts when the confidence of automatic diagnosis is low. In such scenarios, a well-calibrated SSL model is needed. To the best of our knowledge, these two problems have hitherto been studied independently.
According to our observation, SSL models have poor performance in terms of calibration due to entropy minimization and other similar constraints. We analyze FixMatch (implemented using the same code-base), as one of the typical SSL methods, and Multi-Head Co-Training on CIFAR-100 with 10000 labels from the perspective of model calibration. Several common calibration indicators are used, namely Expected Calibration Error (ECE), confidence histogram, and reliability diagram (illustrated in the supplementary material). As shown in Figure 3a, FixMatch has an average confidence of 78.75% but only 74.93% accuracy, producing over-confident results with an ECE value of 15.61. In Figure 3b, we show the results of Multi-Head Co-Training. Although our method produces over-confident predictions, the ECE value is smaller, indicating better probability estimation.
To further investigate, we apply a simple calibration technique called “temperature scaling” (Guo et al. 2017) (see the supplementary material). From the calibrated results in Figure 3c and Figure 3d, it can be observed that the miscalibration is remedied. The ECE value of calibrated FixMatch is improved to 5.27% Multi-Head Co-Training’s reliability diagrams closely recovers the desired diagonal function with a low ECE value of 2.82. It can be concluded that Multi-Head Co-Training produces good probability estimates naturally and can be better calibrated with simple techniques. Therefore, we suggest that confirmation bias is better addressed in our method from the perspective of calibration.
Conclusion
The field of SSL encompasses a broad spectrum of algorithms. However, co-training framework has received little attention recently because of the diversity criterion, extra computational cost. Multi-Head Co-Training pointedly addresses these. It achieves single-view co-training by integrating the individual models into one multi-head structure and utilizing the data augmentation techniques. As a result, the proposed method is 1). in a minimal amount of both parameters and hype-parameters and 2). doesn’t need extra effort to promote diversity. We present systematic experiments to show that Multi-Head Co-Training is a successful co-training method and outperforms state-of-the-art methods. The solid empirical results suggest that it is possible to scale co-training to more realistic SSL settings. In future work, we are interested in combining modality-agnostic data augmentation to make Multi-Head Co-Training ready to be applied to other tasks.
Acknowledgements
This paper is supported by the National Key Research and Development Program of China (Grant No. 2018YFB1403400), the National Natural Science Foundation of China (Grant No. 61876080), the Key Research and Development Program of Jiangsu(Grant No. BE2019105), the Collaborative Innovation Center of Novel Software Technology and Industrialization at Nanjing University.
References
- Abney (2002) Abney, S. 2002. Bootstrapping. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 360–367.
- Balcan, Blum, and Yang (2005) Balcan, M.-F.; Blum, A.; and Yang, K. 2005. Co-training and expansion: Towards bridging theory and practice. Advances in neural information processing systems, 17: 89–96.
- Berthelot et al. (2019a) Berthelot, D.; Carlini, N.; Cubuk, E. D.; Kurakin, A.; Sohn, K.; Zhang, H.; and Raffel, C. 2019a. Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785.
- Berthelot et al. (2019b) Berthelot, D.; Carlini, N.; Goodfellow, I.; Papernot, N.; Oliver, A.; and Raffel, C. 2019b. Mixmatch: A holistic approach to semi-supervised learning. arXiv preprint arXiv:1905.02249.
- Blum and Mitchell (1998) Blum, A.; and Mitchell, T. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, 92–100.
- Chapelle, Chi, and Zien (2006) Chapelle, O.; Chi, M.; and Zien, A. 2006. A continuation method for semi-supervised SVMs. In Proceedings of the 23rd international conference on Machine learning, 185–192.
- Chapelle, Schlkopf, and Zien (2010) Chapelle, O.; Schlkopf, B.; and Zien, A. 2010. Semi-Supervised Learning. The MIT Press, 1st edition. ISBN 0262514125.
- Chen, Weinberger, and Chen (2011) Chen, M.; Weinberger, K. Q.; and Chen, Y. 2011. Automatic feature decomposition for single view co-training. In ICML.
- Chen et al. (2020) Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597–1607. PMLR.
- Cubuk et al. (2019) Cubuk, E. D.; Zoph, B.; Mane, D.; Vasudevan, V.; and Le, Q. V. 2019. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 113–123.
- Cubuk et al. (2020) Cubuk, E. D.; Zoph, B.; Shlens, J.; and Le, Q. V. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 702–703.
- Dasgupta, Littman, and McAllester (2002) Dasgupta, S.; Littman, M. L.; and McAllester, D. 2002. PAC generalization bounds for co-training. Advances in neural information processing systems, 1: 375–382.
- Deng et al. (2009) Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
- DeVries and Taylor (2017) DeVries, T.; and Taylor, G. W. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552.
- Dong-DongChen et al. (2018) Dong-DongChen; WeiWang; WeiGao; and Zhou, Z.-H. 2018. Tri-net for semi-supervised deep learning. In International Joint Conferences on Artificial Intelligence.
- Du, Ling, and Zhou (2010) Du, J.; Ling, C. X.; and Zhou, Z.-H. 2010. When does cotraining work in real data? IEEE Transactions on Knowledge and Data Engineering, 23(5): 788–799.
- Goldman and Zhou (2000) Goldman, S.; and Zhou, Y. 2000. Enhancing supervised learning with unlabeled data. In ICML, 327–334. Citeseer.
- Guo et al. (2017) Guo, C.; Pleiss, G.; Sun, Y.; and Weinberger, K. Q. 2017. On calibration of modern neural networks. In International Conference on Machine Learning, 1321–1330. PMLR.
- He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- Krizhevsky, Hinton et al. (2009) Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images.
- Krizhevsky, Sutskever, and Hinton (2012) Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25: 1097–1105.
- Laine and Aila (2016) Laine, S.; and Aila, T. 2016. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242.
- Lee (2013) Lee, D.-H. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, volume 3.
- Li, Xiong, and Hoi (2020) Li, J.; Xiong, C.; and Hoi, S. 2020. CoMatch: Semi-supervised Learning with Contrastive Graph Regularization. arXiv preprint arXiv:2011.11183.
- Loshchilov and Hutter (2016) Loshchilov, I.; and Hutter, F. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.
- Mahajan et al. (2018) Mahajan, D.; Girshick, R.; Ramanathan, V.; He, K.; Paluri, M.; Li, Y.; Bharambe, A.; and Van Der Maaten, L. 2018. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), 181–196.
- McLachlan (1975) McLachlan, G. J. 1975. Iterative reclassification procedure for constructing an asymptotically optimal rule of allocation in discriminant analysis. Journal of the American Statistical Association, 70(350): 365–369.
- Miyato et al. (2018) Miyato, T.; Maeda, S.-i.; Koyama, M.; and Ishii, S. 2018. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence, 41(8): 1979–1993.
- Netzer et al. (2011) Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; and Ng, A. Y. 2011. Reading digits in natural images with unsupervised feature learning.
- Prakash and Nithya (2014) Prakash, V. J.; and Nithya, D. L. 2014. A survey on semi-supervised learning techniques. arXiv preprint arXiv:1402.4645.
- Qiao et al. (2018) Qiao, S.; Shen, W.; Zhang, Z.; Wang, B.; and Yuille, A. 2018. Deep co-training for semi-supervised image recognition. In Proceedings of the european conference on computer vision (eccv), 135–152.
- Sellars, Aviles-Rivero, and Schönlieb (2021) Sellars, P.; Aviles-Rivero, A. I.; and Schönlieb, C.-B. 2021. LaplaceNet: A Hybrid Energy-Neural Model for Deep Semi-Supervised Classification. arXiv preprint arXiv:2106.04527.
- Sohn et al. (2020) Sohn, K.; Berthelot, D.; Li, C.-L.; Zhang, Z.; Carlini, N.; Cubuk, E. D.; Kurakin, A.; Zhang, H.; and Raffel, C. 2020. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685.
- Subramanya and Talukdar (2014) Subramanya, A.; and Talukdar, P. P. 2014. Graph-based semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 8(4): 1–125.
- Sutskever et al. (2013) Sutskever, I.; Martens, J.; Dahl, G.; and Hinton, G. 2013. On the importance of initialization and momentum in deep learning. In International conference on machine learning, 1139–1147. PMLR.
- Tarvainen and Valpola (2017) Tarvainen, A.; and Valpola, H. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. arXiv preprint arXiv:1703.01780.
- Van Engelen and Hoos (2020) Van Engelen, J. E.; and Hoos, H. H. 2020. A survey on semi-supervised learning. Machine Learning, 109(2): 373–440.
- Verma et al. (2019) Verma, V.; Kawaguchi, K.; Lamb, A.; Kannala, J.; Bengio, Y.; and Lopez-Paz, D. 2019. Interpolation consistency training for semi-supervised learning. arXiv preprint arXiv:1903.03825.
- Vinyals et al. (2016) Vinyals, O.; Blundell, C.; Lillicrap, T.; Wierstra, D.; et al. 2016. Matching networks for one shot learning. Advances in neural information processing systems, 29: 3630–3638.
- Wang et al. (2017) Wang, F.; Jiang, M.; Qian, C.; Yang, S.; Li, C.; Zhang, H.; Wang, X.; and Tang, X. 2017. Residual attention network for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3156–3164.
- Wang and Zhou (2007) Wang, W.; and Zhou, Z.-H. 2007. Analyzing co-training style algorithms. In European conference on machine learning, 454–465. Springer.
- Wang and Zhou (2017) Wang, W.; and Zhou, Z.-H. 2017. Theoretical foundation of co-training and disagreement-based algorithms. arXiv preprint arXiv:1708.04403.
- Xie et al. (2019) Xie, Q.; Dai, Z.; Hovy, E.; Luong, M.-T.; and Le, Q. V. 2019. Unsupervised data augmentation for consistency training. arXiv preprint arXiv:1904.12848.
- Xu, He, and Man (2012) Xu, J.; He, H.; and Man, H. 2012. DCPE co-training for classification. Neurocomputing, 86: 75–85.
- Zagoruyko and Komodakis (2016) Zagoruyko, S.; and Komodakis, N. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146.
- Zhang et al. (2016) Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; and Vinyals, O. 2016. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.
- Zhang et al. (2017) Zhang, H.; Cisse, M.; Dauphin, Y. N.; and Lopez-Paz, D. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412.
- Zhou and Goldman (2004) Zhou, Y.; and Goldman, S. 2004. Democratic co-learning. In 16th IEEE International Conference on Tools with Artificial Intelligence, 594–602. IEEE.
- Zhou (2018) Zhou, Z.-H. 2018. A brief introduction to weakly supervised learning. National science review, 5(1): 44–53.
- Zhou and Li (2010) Zhou, Z.-H.; and Li, M. 2010. Semi-supervised learning by disagreement. Knowledge and Information Systems, 24(3): 415–439.
- Zhu (2005) Zhu, X. J. 2005. Semi-supervised learning literature survey.
- Zoph et al. (2020) Zoph, B.; Ghiasi, G.; Lin, T.-Y.; Cui, Y.; Liu, H.; Cubuk, E. D.; and Le, Q. V. 2020. Rethinking pre-training and self-training. arXiv preprint arXiv:2006.06882.
Appendix A Notation and definitions
Notation | Definition |
Cross-entropy between two distributions. | |
A batch of examples. and are a batch of labeled examples and unlabeled examples respectively. | |
Modeled distribution. corresponds to -th head. | |
Batch size. and is the number of labeled and unlabeled examples respectively. | |
Labeled example. | |
Label. | |
Unlabeled example. | |
Predicted class. is the predicted class of -th head on -th example. is the pseudo-label for -th head on -th example. | |
Augmentation function. and are the weak augmentation function and strong augmentation function respectively. | |
Shared module in Multi-Head Co-Training. | |
Classification head in Multi-Head Co-Training. is the -th head. | |
The Iverson brackets, defined to be 1 if the statement in it is true, otherwise 0. | |
Value of indicator function. represents the value of indicator function corresponding to -th example and -th head. | |
Training loss. and are the training loss for label examples and unlabeled examples respectively. is the loss corresponding to -th example and -th head. | |
Choose the most confident predicted class. |
Appendix B Experimental setup
We implement our experiments in PyTorch 1.6444https://github.com/pytorch/pytorch.
Details of transformations
The weak image transformation function applies random horizontal flip and random crop. The one used for SVHN is slightly different from other datasets, as shown in Table 5. For CIFAR-10, CIFAR-100, the operations include random horizontal flip followed by random crop. Operations, such as horizontal flip, could be wrong variants for asymmetrical digit images, so only random crop is used for SVHN. Note that one may further remove crop and translate transformation because these may transform the number 8 to number 3. For convenience, we don’t do it.
Operation | Range |
Horizontal Flip | [0,1] |
Crop | [-0.125,0.125] |
The strong transformation is a modified version of RandAugment (Cubuk et al. 2020) followed by Cutout (DeVries and Taylor 2017). The operations of RandAugment are shown in Table 6. The range is similar to the original version, so we don’t elaborate their meaning here. Cutout randomly masks a square (with a side of length range from 0 to 0.5×image length) of pixels to gray.
Operation | Range |
AutoContrast | [0, 1] |
Brightness | [0.05, 0.95] |
Color | [0.05, 0.95] |
Contrast | [0.05, 0.95] |
Equalize | [0, 1] |
Identity | [0, 1] |
Posterize | [4, 8] |
Rotate | [-30, 30] |
Sharpness | [0.05, 0.95] |
ShearX | [-0.3, 0.3] |
ShearY | [-0.3, 0.3] |
Solarize | [0, 256] |
TranslateX | [-0.3, 0.3] |
TranslateY | [-0.3, 0.3] |
Hyper-parameters
We report the hyper-parameters for Multi-Head Co-Training with three heads. Basically, the setting follows FixMatch (Sohn et al. 2020) as shown in Table 7. All of them stay the same across datasets unless otherwise stated. For parameters updating, we use SGD with Nesterov momentum, weight decay, and cosine learning rate decay (Loshchilov and Hutter 2016). The learning rate decay follows:
(9) |
For CIFAR-10 and SVHN, WRN 28-2 is used as the backbone. We trained the network on one single NVIDIA V100 for about 90 hours. For CIFAR-100, the widen factor is adjusted to 8. In Multi-Head Co-Training, the number of channels of the final block in WRN 28-8 is changed from 512 to 256. We trained the network on one single NVIDIA V100 for about 200 hours. For Mini-ImageNet, we trained the network on one single NVIDIA V100 for about 30 hours.
Hyper-parameters | CIFAR-10, SVHN | CIFAR-100 | Mini-ImageNet |
Batch size of labeled data | 64 | 64 | 64 |
Batch size of unlabeled data | 448 | 448 | 192 |
Iterations | 300000 | ||
Weight of unsupervised loss | 1 | 1 | 1 |
Learning rate | 0.3 | 0.3 | 0.4 |
momentum for learning rate | 0.9 | 0.9 | 0.95 |
Weight decay | 0.0005 | 0.001 | 0.0002 |
Appendix C Calibration
Reliability diagram, confidence diagram, and Expected Calibration Error (ECE) are used in Section Calibration of SSL. Predictions on the test set are divided into bins based on the confidence score (i.e., the maximum softmax probability). The reliability diagram shows the average accuracy of the examples in each bin. So the gap between the actual average accuracy and the expected accuracy reflects whether the model is calibrated properly. The confidence diagram shows how many examples are in each bin. ECE is the average over the absolute difference between accuracy and confidence,
(10) |
where is the entire dataset and refers to the examples in -th bin, is the number of bins. For convenience, refers to the average accuracy for examples in -th bin and refers to the average confidence for examples in -th bin. As a one of the calibration methods, temperature scaling (Guo et al. 2017) introduce a scale parameter in the softmax calculation,
(11) |
where represents number of class. We set .