How Does Pseudo-Labeling Affect the Generalization Error of the Semi-Supervised Gibbs Algorithm?
Abstract
We provide an exact characterization of the expected generalization error (gen-error) for semi-supervised learning (SSL) with pseudo-labeling via the Gibbs algorithm. The gen-error is expressed in terms of the symmetrized KL information between the output hypothesis, the pseudo-labeled dataset, and the labeled dataset. Distribution-free upper and lower bounds on the gen-error can also be obtained. Our findings offer new insights that the generalization performance of SSL with pseudo-labeling is affected not only by the information between the output hypothesis and input training data but also by the information shared between the labeled and pseudo-labeled data samples. This serves as a guideline to choose an appropriate pseudo-labeling method from a given family of methods. To deepen our understanding, we further explore two examples—mean estimation and logistic regression. In particular, we analyze how the ratio of the number of unlabeled to labeled data affects the gen-error under both scenarios. As increases, the gen-error for mean estimation decreases and then saturates at a value larger than when all the samples are labeled, and the gap can be quantified exactly with our analysis, and is dependent on the cross-covariance between the labeled and pseudo-labeled data samples. For logistic regression, the gen-error and the variance component of the excess risk also decrease as increases.
1 Introduction
There are several areas, like natural language processing, computer vision, and finance, where labeled data are rare but unlabeled data are abundant. In these situations, semi-supervised learning (SSL) techniques enable us to utilize both labeled and unlabeled data.
Self-training algorithms (Ouali et al., 2020) are a subcategory of SSL techniques. These algorithms use the supervised-learned model’s confident predictions to predict the labels of unlabeled data. Entropy minimization and pseudo-labeling are two basic approaches used in self-training-based SSL. The entropy function may be viewed as a regularization term that penalizes uncertainty in the label prediction of unlabeled data in entropy minimization approaches (Grandvalet et al., 2005). The manifold assumption (Iscen et al., 2019)—where it is assumed that labeled and unlabeled features are drawn from a common data manifold—or the cluster assumption (Chapelle et al., 2003)—where it is assumed that similar data features have a similar label—are assumptions for adopting the entropy minimization algorithm. In contrast, in pseudo-labeling, which is the focus of this work, the model is trained using labeled data and then used to produce a pseudo-label for the unlabeled data (Lee et al., 2013). These pseudo-labels are then utilized to build another model, which is trained using both labeled and pseudo-labeled data in a supervised manner. Studying the generalization error (gen-error) of this procedure is critical to understanding and improving pseudo-labeling performance.
There have been various efforts to characterize the gen-error of SSL algorithms. In Rigollet (2007), an upper bound on the gen-error of binary classification under the cluster assumption is derived. Niu et al. (2013) provides an upper bound on gen-error based on the Rademacher complexity for binary classification with squared-loss mutual information regularization. Göpfert et al. (2019) employs the VC-dimension method to characterize the SSL gen-error. In Göpfert et al. (2019) and Zhu (2020), upper bounds for SSL gen-error using Bayes classifiers are provided. Zhu (2020) also provides an upper bound on the excess risk of SSL algorithm by assuming an exponentially concave function111A function is called -exponentially concave function if is concave based on the conditional mutual information. He et al. (2022) investigates the gen-error of iterative SSL techniques based on pseudo-labels. An information-theoretic gen-error upper bound on self-training algorithms under the covariate-shift assumption is proposed by Aminian et al. (2022a). These upper bounds on excess risk and gen-error do not entirely capture the impact of SSL, in particular pseudo-labeling and the relative number of labeled and unlabeled data, and thus constrain our ability to fully comprehend the performance of SSL.
In this paper, we are interested in characterizing the expected gen-error of pseudo-labeling-based SSL—using an appropriately-designed Gibbs algorithm—and studying how it depends on the output hypothesis, the labeled, and the pseudo-labeled data. Moreover, we intend to understand the effect of the ratio between the numbers of the unlabeled and labeled training data examples on the gen-error in different scenarios.
Our main contributions in this paper are as follows:
-
•
We provide an exact characterization of the expected gen-error of Gibbs algorithm that models pseudo-labeling-based SSL. This characterization can be applied to obtain novel and informative upper and lower bounds.
-
•
The characterization and bounds offer an insight that reducing the shared information between the labeled and pseudo-labeled samples can help to improve the generalization performance of pseudo-labeling-based SSL.
-
•
We analyze the effect of the ratio of the number of unlabeled data to labeled data on the gen-error of Gibbs algorithm using a mean estimation example.
-
•
Finally, we study the asymptotic behaviour of the Gibbs algorithm and analyze the effect of on the gen-error and the excess risk, applying our results to logistic regression.
2 Related Works
Semi-supervised learning: The SSL approaches can be partitioned into six main classes: generative models, low-density separation methods, graph-based methods, self-training and co-training (Zhu, 2008). Among all these, SSL first appeared as self-training in which the model is first trained on the labeled data and annotates the unlabeled data to improve the intial model (Chapelle et al., 2006). SSL has gradually gained more attention after the well-known expectation-minimization (EM) algorithm Moon (1996) was proposed in 1996. One key problem of interest in SSL is whether the unlabeled data can help to improve the performance of learning algorithms. Many existing works have studied this problem either theoretically (e.g., providing bounds) or empirically (e.g., proposing new algorithms). One classical work by Castelli and Cover (1996) set out to study SSL under a traditional setup with unknown mixture of known distributions and characterized the error probability by the fisher information of the labeled and unlabeled data. Szummer and Jaakkola (2002) proposed an algorithm that utilizes the unlabeled data to learn the marginal data distribution to augment learning the class conditional distribution. Amini and Gallinari (2002) empirically showed that semi-supervised logistic regression based on EM algorithm has higher accuracy than the naive Bayes classifier. Singh et al. (2008) studied the benefit of unlabeled data on the excess risk based on the number of unlabeled data and the margin between classes. Ji et al. (2012) developed a simple algorithm based on top eigenfunctions of integral operator derived from both labeled and unlabeled examples that can improve the regression error bound. Li et al. (2019) showed how the unlabeled data can improve the Rademacher-complexity-based generalization error bound of a multi-class classification problem. In deep learning, Berthelot et al. (2019) introduced an effective algorithm that generates low-entropy labels for unlabeled data and then mixes them up with the labeled data to train the model. Sohn et al. (2020) showed that augmenting the confidently pseudo-labeled images can help to improve the accuracy of their model. For a more comprehensive overview of SSL, one can refer to the report by Seeger (2000) and the book by Chapelle et al. (2006).
Pseudo-labeling: Pseudo-labeling is one of the approaches in self-training (Zhu and Goldberg, 2009). Due to the reliance of pseudo-labeling on the quality of the pseudo labels, pseudo-labeling approach might perform poorly. Seeger (2000) stated that there exists a tradeoff between robustness of the learning algorithm and the information gain from the pseudo-labels. Rizve et al. (2020) offered an uncertainty-aware pseudo-labeling strategy to circumvent this difficulty. In Wei et al. (2020), a theoretical framework for combining input-consistency regularization with self-training algorithms in deep neural networks is provided. Dupre et al. (2019) empirically showed that iterative pseudo-labeling with a confidence threshold can improve the test accuracy in early stage. Arazo et al. (2020) showed that the method of generating soft labels for unlabeled data plus mixup augmentation can outperform consistency regularization methods.
Despite the plenty of works on SSL and pseudo-labeling, our work provides a new viewpoint of understanding the effect of pseudo-labeling method on the generalization error using information-theoretic quantities.
Information-theoretic upper bounds: Russo and Zou (2019); Xu and Raginsky (2017) proposed to use the mutual information between the input training set and the output hypothesis to upper bound the expected generalization error. This paves a new way of understanding and improving the generalization performance of a learning algorithm from an information-theoretic viewpoint. Tighter upper bounds by considering the individual sample mutual information is proposed by Bu et al. (2020). Asadi et al. (2018) proposed using chaining mutual information, and Hafez-Kolahi et al. (2020); Haghifam et al. (2020); Steinke and Zakynthinou (2020) advocated the conditioning and processing techniques. Information-theoretic generalization error bounds using other information quantities are also studied, e.g., -divergences, -Rényi divergence and generalized Jensen-Shannon divergence Esposito et al. (2021); Aminian et al. (2022b, 2021b).
3 Semi-Supervised Learning Via The Gibbs Algorithm
In this section, we formulate our problem using the Gibbs algorithm based on both the labeled and unlabeled training data with pseudo-labels. The Gibbs algorithm is a tractable and idealized model for learning algorithms with various approaches, e.g., stochastic optimization methods or relative entropy regularization (Raginsky et al., 2017).
3.1 Problem Formulation
Let be the labeled training dataset, where is the data feature, is the class label and each pair of is independently and identically distributed (i.i.d.) from . Conditioned on , each label is i.i.d. from . Let be the unlabeled training dataset. For all , each is i.i.d. from . Based on the labeled dataset , a pseudo-labeling method assigns a pseudo-label to each and is drawn conditionally i.i.d. from . Let the pseudo-labeled data point be and the pseudo-labeled dataset be . For any labeled and pseudo-label datasets, we use , and to denote the joint distributions of the data samples in , , and , respectively. Note that .
Let denote the output hypothesis. In semi-supervised learning, one considers the empirical risk based on both the labeled and unlabeled data. In this paper, by fixing a mixing weight , for any loss function , the total empirical risk is the -weighted sum of the empirical risks of the labeled and pseudo-labeled data (McLachlan, 2005; Chapelle et al., 2006)
(1) |
where and . Note that, by normalizing the weight , minimizing the empirical risk is equivalent to minimizing
(2) |
The population risk under the true data distribution is
(3) |
Under the i.i.d. assumption, such a definition reduces to .
Any SSL algorithm can be characterized by a conditional distribution , which is a stochastic map from the labeled dataset , the pseudo-labeled dataset to the output hypothesis . For any training datasets and any SSL algorithm , the expected gen-error is defined as the expected gap between the population risk and the empirical risk of the labeled data , i.e.,
(4) |
which measures the extent to which the algorithm overfits to the labeled training data.
In particular, we consider the Gibbs algorithm (also known as the Gibbs posterior (Catoni, 2007)) to model a pseudo-labeling-based SSL algorithm. Given any , the -Gibbs algorithm (Gibbs, 1902; Jaynes, 1957) is
where is the “inverse temperature”, is the partition function and is the prior of . We provide more motivations for the Gibbs algorithm model in Appendix A.
Our goal—relying on the characterization of —is to precisely quantify the gen-error in (4) as a function of various information-theoretic quantities.
3.2 Main Results
One of our main results offers an exact closed-form information-theoretic expression for the gen-error of the -Gibbs algorithm in terms of the symmetrized KL information defined above.
Theorem 1.
Under the -Gibbs algorithm, the expected gen-error is
(5) |
where and .
The proof of Theorem 1 is provided in Appendix B, where we also show that by letting , the result reduces to that for supervised learning (SL). In addition, we can even extend Theorem 1 to SSL based on other methods, e.g., entropy minimization (Amini and Gallinari, 2002; Grandvalet et al., 2005). This corollary is provided in Section C.
Theorem 1 can also be applied to derive novel bounds on the expected gen-error of the Gibbs algorithm as follows.
Proposition 1.
Assume that the loss function is bounded in . Then, the expected gen-error of the -Gibbs algorithm satisfies
(6) |
where and .
The proof and more discussions are provided in Appendix D. In fact, we can further explore how the gen-error depends on the information that the output hypothesis contains about the labeled and unlabeled datasets by writing (see Appendix E)
(7) |
From Theorem 1 and (3.2), we observe that for any , given , as the dependency (or the information shared) between and decreases, the expected gen-error decreases, and the algorithm is less likely to overfit to the training data. This intuition dovetails with the results for supervised learning in Xu and Raginsky (2017), Russo and Zou (2019) and Aminian et al. (2021a). However, the difference here is that the quantities are conditioned on the pseudo-labeled data , which reflects the impact of SSL.
Together with Proposition 1, we can observe that the gen-error is also dependent on the information shared between the input labeled and pseudo-labeled data. If the mutual information decreases, the expected gen-error is likely to decrease as well. This implies that pseudo-labels highly dependent on the labeled dataset may not be beneficial in terms of the generalization performance of an algorithm. In fact, in our subsequent example in Section 4, we exactly quantify the shared information using the cross-covariance between the labeled and pseudo-labeled data. This result sheds light on the future design of pseudo-labeling methods.
3.2.1 Special Cases of Theorem 1
It is instructive to elaborate how Theorem 1 specializes in some well-known learning settings such as transfer learning and SSL not reusing labeled data.
-
•
Case 1 ( independent of ): If the pseudo-labels are not generated based on the labeled dataset , e.g. randomly assigned or generated from another domain (similar to transfer learning), the pseudo-labeled dataset is independent of . According to the basic properties of mutual and Lautum information (Palomar and Verdú, 2008) and (3.2), we have
and . That is,
(8) which corresponds to the result of transfer learning in Bu et al. (2022, Theorem 1).
-
•
Case 2 (): When this Markov chain holds, meaning that the output hypothesis is independent of conditioned on , we have
Thus, the expected gen-error . However, this is a degenerate case. For example, it only occurs when , i.e., the labeled dataset might be used for generating pseudo-labels for but not used in the Gibbs algorithm to learn the output hypothesis . In this case, .
3.2.2 SSL vs. SL with labeled data
It is also instructive to elaborate on how SSL compares to SL with labeled data based on Theorem 1. In particular, assume that that the labeled training dataset contains samples drawn i.i.d. from . Then for any output hypothesis , the population risk is given by (cf. (3)), and the empirical risk of the labeled data samples is given by
From Aminian et al. (2021a, Theorem 1), under the -Gibbs algorithm, the expected gen-error is
(9) |
In the SSL setup, in comparison with (3.2.2), under the -Gibbs algorithm, we let the mixing weight of the empirical risk in (2) be the ratio of the number of unlabeled to labeled data, i.e., . We similarly define another expected gen-error which is also evaluated over data points as follows:
Consider the situation where the pseudo-labeling method is perfect such that for all . Then any has the same distribution as that of any . By applying the same technique in the proof of Theorem 1, we obtain (details provided in Appendix F)
(10) |
However, only when for any , and ; otherwise, the gen-errors of SL in (3.2.2) and of SSL in (10) may not be equal. This is because the pseudo-label ’s are assumed to be generated based on the labeled dataset and this dependence affects the gen-error. As we have discussed following Proposition 1, if the information shared between and is large, the learning algorithm tends to overfit to the labeled training data.
If the ’s are independent of , then for any . Consider the situation where the pseudo-labeling method is close to perfect, i.e., , where is small and such that . We have
where is proportional to and as (details are provided in Appendix F.2). Therefore, in this special case, we see that the gap between of SSL and of SL can be directly quantified by the gap between the labeled and pseudo-labeled distributions.
3.3 Distribution-Free Upper and Lower Bounds
We can also build upon Theorem 1 and Xu and Raginsky (2017, Theorem 1) to provide distribution-free upper and lower bounds of the gen-error in (5). We next state an informal version of such bounds (the formal version and the proof are provided in Appendix G). Throughout this section, we let the mixing weight .
Proposition 2 (Informal Version).
Assume , for some that depends on and . Assume that is bounded in . Let . Then the expected gen-error can be bounded as follows
(11) |
Recall that for SL with labeled data, the expected gen-error is of order Aminian et al. (2021a, Theorem 2). This can be naturally extended to SL with labeled data, leading to the expected gen-error behaving as . Compared to the SL case, we see that using the pseudo-labeled data may degrade the convergence rate of the expected gen-error and the extent to which it degrades depends on , which captures the impact of the amount of information shared among the output hypothesis, the input labeled and unlabeled data as well as the ratio . If is small, it implies is small and is large, which leads to a tight upper bound.
If we fix and assume is independent of , then and
which coincides with the result of transfer learning in Bu et al. (2022, Remark 3) and corresponds to the analysis of (8). Although it appears that using independent pseudo-labeled data does not degrade the convergence rate of the expected gen-error, it may affect the excess risk or the test accuracy, which taken in unison, represents the performance of a learning algorithm. The definition and further analysis of the excess risk is provided in Section 5.1.1.
When , the gen-error converges to that of SL with labeled data and the distribution-free upper bound is of order , given in Aminian et al. (2021a, Theorem 2).
4 An Application to Mean Estimation
To deepen our understanding of the expected gen-error in Theorem 1, we study a mean estimation example and analyze how affects the gen-error.
4.1 Problem Setup
For any and , we assume that , , , and . Any is drawn i.i.d. from the same distribution as . Consider the problem of learning using and . We adopt the mean-squared loss and assume the prior is uniform on . Based on learned from the labeled dataset (e.g., ), we assign a pseudo-label to each as (e.g. ) and let . Let us construct and . The empirical risk is given by (cf. (2) by replacing with )
The expected gen-error is equal to the right-hand side of (5) by replacing with .
The -Gibbs algorithm is given by the following Gibbs posterior distribution
where and
Details are provided in Appendix H. With this posterior Gaussian distribution , the expected gen-error in Theorem 1 can be exactly computed as follows
(12) |
where and run over the labeled and unlabeled datasets respectively. From the definition of the expected gen-error in (4), we obtain the same result, which corroborates the characterization of the expected gen-error in Theorem 1. See Appendix H for details.
Note that the second term in (12) is the trace of the cross-covariance between the labeled and pseudo-labeled data sample, i.e., for any and ,
This result shows that for any fixed , the expected gen-error decreases when the trace of the cross-covariance decreases, which corroborates our analyses in Section 3.2.
4.2 Effect of the Pseudo-labeling and the Ratio of Unlabeled to Labeled Samples
To futher study the effect of the pseudo-labeling method and the ratio , in this mean estimation example, we consider the class-conditional feature distribution of for any . Let be a dataset with independent copies of . Similarly to (12), for the supervised -Gibbs algorithm and the supervised -Gibbs algorithm, the expected gen-errors are respectively given by (see details in Appendix H.1)
Let denote for brevity. By comparing and , we observe that:
-
1.
If , then , which means that SSL with such pseudo-labeling method even has better generalization performance than SL with labeled data. This is the most desirable case in terms of the generalization error;
-
2.
If , then . This implies that if the cross-covariance between the labeled and pseudo-labeled data is larger than a certain threshold, the pseudo-labeling method does not help to improve the generalization performance;
-
3.
If , then . This implies that if the cross-covariance is sufficiently small, pseudo-labeling improves the generalization error.
The value of also depends on the ratio . Next, to study the effect of , let the initial hypothesis learned from the labeled data be and the pseudo-label for any be .
Inspired by the derivation techniques in He et al. (2022), we can rewrite the expected gen-error in (12) as
(13) |
where , , , and are functions with domain , and is a sequence of correlation coefficients. Details are presented in Appendix H.2, where we also prove that and , which means that we always have here. Furthermore, we observe that does not depend on and thus, (13) converges to when we fix , and let .
In Figure 1, we numerically plot by varying and compare it with and for different values of noise level , and number of labeled data samples . Under different choices of , we observe that . Moreover, the gen-error of SSL monotonically decreases as increases, showing obvious improvement compared to . However, there exists a gap between and . For large enough (e.g., ) or noise level small enough (e.g., ), this gap is almost negligible, which means pseudo-labeling yields comparable generalization performance as the SL when all the samples are labeled.




4.3 Choosing a Pseudo-labeling Method from a Given Family of Methods
We have shown that the generalization error decreases as the information shared between the input labeled and pseudo-labeled data (represented by the mutual information or the cross-covariance term in the mean estimation example) decreases. This result serves as a guideline for choosing an appropriate pseudo-labeling method. For example, if we are given a set of pseudo-labeling functions with , we can choose the best one that yields the minimum mutual information . To make our ideas more concrete, for the mean estimation example that we have been discussing thus far, assume that the pseudo-labeling functions (where ) are indexed by various “confidence” thresholds ( plays the role of the index in ). For instance, let us consider the case where , (cf. Figure 1(b)). As shown in Figure 2, one can choose the threshold that minimizes for and . From this figure, we observe that approximately minimizes and hence, the generalization error. We have thus exhibited a concrete way to choose one pseudo-labeling method from a given family of methods via our characterization of generalization error.

5 Particularizing The Gibbs Algorithm To Empirical Risk Minimization
In this section, we consider the asymptotic behavior of the expected gen-error and excess risk for the Gibbs algorithm under SSL as the “inverse temperature” . It is known that the Gibbs algorithm converges to empirical risk minimization (ERM) as ; see Appendix A.
Given any , assume that there exists a unique minimizer of the empirical risk (also known as the single-well case), i.e.,
Let the Hessian matrix of the empirical risk at be defined as
Under the single-well case, we obtain a characterization of the asymptotic expected gen-error in the following theorem.
Theorem 2.
Under the -Gibbs algorithm, as , if the Hessian matrix is not singular and is unique, the expected gen-error converges to
(14) |
where the expectations and .
The proof of Theorem 2 is provided in Appendix I. Theorem 2 shows that the gen-error of the Gibbs algorithm under SSL when depends strongly on the second derivatives of the empirical risk (a.k.a. loss landscape) and the relationship between and . The loss landscape is an important tool to understand the dynamics of the learning process in deep learning. In the extreme case where and are independent, and a simplified expression for the asymptotic gen-error is provided in Appendix J.
5.1 Semi-Supervised Maximum Likelihood Estimation
In particular, by letting the loss function be the negative log-likelihood, this algorithm becomes semi-supervised maximum likelihood estimation (SS-MLE). In this part, we consider the SS-MLE in the asymptotic regime where . Throughout this section, we let the mixing weight in (2) .
We aim to fit the training data with a parametric family , where . Consider the negative log-loss function . For the single-well case, the unique minimizer is
(15) |
Given any labeled dataset , we use to denote the conditional distribution of the pseudo-labeled data (i.e., ), where is the initial hypothesis learned only from and used to generate pseudo-labels for . Assume that is learned from using MLE, i.e.,
We have as , where depends only on the true distribution of the labeled data . Then we have , and , which means become independent of one another and of . We analogously define the minimizer
(16) |
and the landscapes of the loss function
Let . Given any ratio , as , by the law of large numbers, the Hessian matrix converges as follows
which is independent of . Leveraging these asymptotic approximations, from Theorem 2, we obtain the following characterization of the gen-error of SS-MLE.
Corollary 1.
In the asymptotic regime where , the expected gen-error of SS-MLE is given by
The proof of Corollary 1 is provided in Appendix J. For the extreme cases where , we have , and which means the gen-error degenerates to that of the SL case with labeled data.
For the other case where , from Appendix J, we have and
For , we have , which is order-wise the same as the gen-error of SL with labeled data. The intuition is that for large , the pseudo-labeled samples only depend on the labeled data distribution instead of the labeled samples. However, the performance of an algorithm depends not only on the gen-error and but also on the excess risk. Even when the gen-error is small, the bias of the excess risk may be high.
5.1.1 Excess Risk as
In this section, we discuss the excess risk of SS-MLE when . The excess risk is defined as the gap between the expectation and the minimum of the population risk, i.e.,
(17) | ||||
(18) |
where is the optimal hypothesis. The second term in (18) is known as the estimation error. We observe that the excess risk depends on both the gen-error and estimation error. When the gen-error is controlled to be sufficiently small, but if the estimation error is large, the excess risk can still be large.
Corollary 2.
In the asymptotic regime where , the excess risk of SS-MLE is given by
(19) |
The proof of Corollary 2 is provided in Appendix K. In (19), the first term represents the bias caused by learning with the mixture of labeled and pseudo-labeled data. When , the bias converges to . As increases, the bias increases. The second term represents the variance component of the excess risk, which is of order for , the same as that for the gen-error in Corollary 1.
5.2 An Application to Logistic Regression
To re-emphasize, we let the mixing weight in (2) . We now apply SS-MLE to logistic regression to study the effect of on the gen-error and excess risk. For any hypothesis , let be the conditional likelihood of a label upon seeing a feature sample under . Assume that the label . For any , the underlying distribution is and the logistic regression model uses to approximate . Let . To stabilize the solution , we consider a regularized version of the logistic regression model, where for a fixed , the objective function can be expressed as
We assume that there exists a unique minimizer of the empirical risk in this example. We also assume that the initial hypothesis is learned from the labeled dataset :
Consider the case when and . Then
Let the pseudo label for any be defined as . The conditional distribution of the pseudo-labeled data sample given converges as follows
Let us rewrite the minimizer in (16) as
Recall the expected gen-error in Corollary 1, which can also be rewritten as
Details of the derivation are provided in Appendix L. We focus on the right-hand side, which depends on the ratio instead of the individual . As mentioned in Section 5.1, when , . On the other hand, the excess risk of this example is given by Corollary 2 where is replaced by . Intuitively, as the regularization parameter increases, the gen-error decreases.



For different values of , we can numerically calculate the hypothesis , the expected gen-error and the excess risk. Consider the example of a dataset which contains two classes of Gaussian samples. Let and , where and is the all-ones vector in . For notational simplicity, let denote . In Figure 3, we plot and versus when and . We also implement experiments () for times on synthetic data and plot the average empirical gen-error for . In both theoretical and empirical plots, we observe that as the ratio increases, the gen-error of SSL decreases and is smaller than that of SL with only labeled data. We also observe similar behaviours of the empirical gen-error of logistic regression on MNIST dataset (see Appendix L). On the other hand, the variance component of the excess risk behaves similarly as the gen-error while the bias component increases. By taking , we can see the excess risk first decreases and then increases as increases. In this setting, our results suggest that when exceeds some threshold, it may not be beneficial to the performance of the learning algorithm by utilizing any more unlabeled data.
6 Concluding Remarks
To develop a comprehensive understanding of SSL, we present an exact characterization of the expected gen-error of pseudo-labeling-based SSL via the Gibbs algorithm. Our results reveal that the expected gen-error is influenced by the information shared between the labeled and pseudo-labeled data and the ratio of the number of unlabeled data to labeled data. This sheds light in our quest to design good pseudo-labeling methods, which should penalize the dependence between the labeled and pseudo-labeled data, e.g., , so as to improve the generalization performance. To understand the ERM counterpart of pseudo-labeling-based SSL, we also investigate the asymptotic regime, in which the inverse temperature . Finally, we present two examples—mean estimation and logistic regression—as applications of our theoretical findings.
Acknowledgment
Haiyun He and Vincent Tan are supported by a Singapore National Research Foundation (NRF) Fellowship (Grant Number: A-0005077-01-00). Gholamali Aminian is supported by the UKRI Prosperity Partnership Scheme (FAIR) under the EPSRC Grant EP/V056883/1. M. R. D. Rodrigues and Gholamali Aminian are also supported by the Alan Turing Institute.
References
- Amini and Gallinari (2002) Massih-Reza Amini and Patrick Gallinari. Semi-supervised logistic regression. In ECAI, volume 2, page 11, 2002.
- Aminian et al. (2015) Gholamali Aminian, Hamidreza Arjmandi, Amin Gohari, Masoumeh Nasiri-Kenari, and Urbashi Mitra. Capacity of diffusion-based molecular communication networks over LTI-Poisson channels. IEEE Transactions on Molecular, Biological and Multi-Scale Communications, 1(2):188–201, 2015.
- Aminian et al. (2021a) Gholamali Aminian, Yuheng Bu, Laura Toni, Miguel Rodrigues, and Gregory Wornell. An exact characterization of the generalization error for the Gibbs algorithm. Advances in Neural Information Processing Systems, 34, 2021a.
- Aminian et al. (2021b) Gholamali Aminian, Laura Toni, and Miguel RD Rodrigues. Information-theoretic bounds on the moments of the generalization error of learning algorithms. In IEEE International Symposium on Information Theory (ISIT), 2021b.
- Aminian et al. (2022a) Gholamali Aminian, Mahed Abroshan, Mohammad Mahdi Khalili, Laura Toni, and Miguel Rodrigues. An information-theoretical approach to semi-supervised learning under covariate-shift. In International Conference on Artificial Intelligence and Statistics, pages 7433–7449. PMLR, 2022a.
- Aminian et al. (2022b) Gholamali Aminian, Saeed Masiha, Laura Toni, and Miguel RD Rodrigues. Learning algorithm generalization error bounds via auxiliary distributions. arXiv preprint arXiv:2210.00483, 2022b.
- Arazo et al. (2020) Eric Arazo, Diego Ortego, Paul Albert, Noel E O’Connor, and Kevin McGuinness. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2020.
- Asadi et al. (2018) Amir Asadi, Emmanuel Abbe, and Sergio Verdú. Chaining mutual information and tightening generalization bounds. In Advances in Neural Information Processing Systems, pages 7234–7243, 2018.
- Asadi and Abbe (2020) Amir R Asadi and Emmanuel Abbe. Chaining meets chain rule: Multilevel entropic regularization and training of neural networks. Journal of Machine Learning Research, 21(139):1–32, 2020.
- Athreya and Hwang (2010) KB Athreya and Chii-Ruey Hwang. Gibbs measures asymptotics. Sankhya A, 72(1):191–207, 2010.
- Berthelot et al. (2019) David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. Mixmatch: A holistic approach to semi-supervised learning. Advances in neural information processing systems, 32, 2019.
- Bu et al. (2020) Yuheng Bu, Shaofeng Zou, and Venugopal V Veeravalli. Tightening mutual information-based bounds on generalization error. IEEE Journal on Selected Areas in Information Theory, 1(1):121–130, 2020.
- Bu et al. (2022) Yuheng Bu, Gholamali Aminian, Laura Toni, Gregory W Wornell, and Miguel Rodrigues. Characterizing and understanding the generalization error of transfer learning with gibbs algorithm. In International Conference on Artificial Intelligence and Statistics, pages 8673–8699. PMLR, 2022.
- Castelli and Cover (1996) Vittorio Castelli and Thomas M Cover. The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Transactions on Information Theory, 42(6):2102–2117, 1996.
- Catoni (2007) Olivier Catoni. PAC-Bayesian supervised classification: the thermodynamics of statistical learning. arXiv preprint arXiv:0712.0248, 2007.
- Chapelle et al. (2003) Olivier Chapelle, Jason Weston, and Bernhard Scholkopf. Cluster kernels for semi-supervised learning. Advances in neural information processing systems, pages 601–608, 2003.
- Chapelle et al. (2006) Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien, editors. Semi-Supervised Learning. The MIT Press, 2006.
- Chiang et al. (1987) Tzuu-Shuh Chiang, Chii-Ruey Hwang, and Shuenn Jyi Sheu. Diffusion for global optimization in . SIAM Journal on Control and Optimization, 25(3):737–753, 1987.
- Cover (1999) Thomas M Cover. Elements of information theory. John Wiley & Sons, 1999.
- Dupre et al. (2019) Robert Dupre, Jiri Fajtl, Vasileios Argyriou, and Paolo Remagnino. Improving dataset volumes and model accuracy with semi-supervised iterative self-learning. IEEE Transactions on Image Processing, 29:4337–4348, 2019.
- Esposito et al. (2021) Amedeo Roberto Esposito, Michael Gastpar, and Ibrahim Issa. Generalization error bounds via Rényi-, f-divergences and maximal leakage. IEEE Transactions on Information Theory, 2021.
- Gibbs (1902) Josiah Willard Gibbs. Elementary principles of statistical mechanics. Compare, 289:314, 1902.
- Göpfert et al. (2019) Christina Göpfert, Shai Ben-David, Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, and Ruth Urner. When can unlabeled data improve the learning rate? In Conference on Learning Theory, pages 1500–1518. PMLR, 2019.
- Grandvalet et al. (2005) Yves Grandvalet, Yoshua Bengio, et al. Semi-supervised learning by entropy minimization. CAP, 367:281–296, 2005.
- Hafez-Kolahi et al. (2020) Hassan Hafez-Kolahi, Zeinab Golgooni, Shohreh Kasaei, and Mahdieh Soleymani. Conditioning and processing: Techniques to improve information-theoretic generalization bounds. Advances in Neural Information Processing Systems, 33, 2020.
- Haghifam et al. (2020) Mahdi Haghifam, Jeffrey Negrea, Ashish Khisti, Daniel M Roy, and Gintare Karolina Dziugaite. Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms. Advances in Neural Information Processing Systems, 2020.
- He et al. (2022) Haiyun He, Hanshu Yan, and Vincent YF Tan. Information-theoretic characterization of the generalization error for iterative semi-supervised learning. Journal of Machine Learning Research, 23:1–52, 2022.
- Iscen et al. (2019) Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. Label propagation for deep semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5070–5079, 2019.
- Jaynes (1957) Edwin T Jaynes. Information theory and statistical mechanics. Physical review, 106(4):620, 1957.
- Jeffreys (1946) Harold Jeffreys. An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, 186(1007):453–461, 1946.
- Ji et al. (2012) Ming Ji, Tianbao Yang, Binbin Lin, Rong Jin, and Jiawei Han. A simple algorithm for semi-supervised learning with improved generalization error bound. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 835–842, 2012.
- Kuzborskij et al. (2019) Ilja Kuzborskij, Nicolò Cesa-Bianchi, and Csaba Szepesvári. Distribution-dependent analysis of Gibbs-ERM principle. In Conference on Learning Theory, pages 2028–2054. PMLR, 2019.
- Lee et al. (2013) Dong-Hyun Lee et al. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, 2013.
- Li et al. (2019) Jian Li, Yong Liu, Rong Yin, and Weiping Wang. Multi-class learning using unlabeled samples: Theory and algorithm. In International Joint Conference on Artificial Intelligence (IJCAI), pages 2880–2886, 2019.
- Markowich and Villani (2000) Peter A Markowich and Cédric Villani. On the trend to equilibrium for the Fokker-Planck equation: an interplay between physics and functional analysis. Mat. Contemp, 19:1–29, 2000.
- McLachlan (2005) Geoffrey J McLachlan. Discriminant Analysis and Statistical Pattern Recognition. John Wiley & Sons, 2005.
- Moon (1996) Todd K Moon. The expectation-maximization algorithm. IEEE Signal processing magazine, 13(6):47–60, 1996.
- Niu et al. (2013) Gang Niu, Wittawat Jitkrittum, Bo Dai, Hirotaka Hachiya, and Masashi Sugiyama. Squared-loss mutual information regularization: A novel information-theoretic approach to semi-supervised learning. In International Conference on Machine Learning, pages 10–18. PMLR, 2013.
- Ouali et al. (2020) Yassine Ouali, Céline Hudelot, and Myriam Tami. An overview of deep semi-supervised learning. arXiv preprint arXiv:2006.05278, 2020.
- Palomar and Verdú (2008) Daniel P Palomar and Sergio Verdú. Lautum information. IEEE transactions on information theory, 54(3):964–975, 2008.
- Raginsky et al. (2017) Maxim Raginsky, Alexander Rakhlin, and Matus Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pages 1674–1703. PMLR, 2017.
- Rigollet (2007) Philippe Rigollet. Generalization error bounds in semi-supervised classification under the cluster assumption. Journal of Machine Learning Research, 8(7), 2007.
- Rizve et al. (2020) Mamshad Nayeem Rizve, Kevin Duarte, Yogesh S Rawat, and Mubarak Shah. In defense of pseudo-labeling: An uncertainty-aware pseudo-label selection framework for semi-supervised learning. In International Conference on Learning Representations, 2020.
- Russo and Zou (2019) Daniel Russo and James Zou. How much does your data exploration overfit? controlling bias via information usage. IEEE Transactions on Information Theory, 66(1):302–323, 2019.
- Seeger (2000) Matthias Seeger. Learning with labeled and unlabeled data. 2000. URL http://infoscience.epfl.ch/record/161327.
- Singh et al. (2008) Aarti Singh, Robert Nowak, and Jerry Zhu. Unlabeled data: Now it helps, now it doesn’t. Advances in Neural Information Processing Systems, 21:1513–1520, 2008.
- Sohn et al. (2020) Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk, Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems, 33:596–608, 2020.
- Steinke and Zakynthinou (2020) Thomas Steinke and Lydia Zakynthinou. Reasoning about generalization via conditional mutual information. In Conference on Learning Theory, pages 3437–3452. PMLR, 2020.
- Szummer and Jaakkola (2002) Martin Szummer and Tommi Jaakkola. Information regularization with partially labeled data. Advances in Neural Information processing systems, 15, 2002.
- Vershynin (2018) Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- Wei et al. (2020) Colin Wei, Kendrick Shen, Yining Chen, and Tengyu Ma. Theoretical analysis of self-training with deep networks on unlabeled data. In International Conference on Learning Representations, 2020.
- Xu and Raginsky (2017) Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems, pages 2524–2533, 2017.
- Zhu (2020) Jingge Zhu. Semi-supervised learning: the case when unlabeled data is equally useful. In Conference on Uncertainty in Artificial Intelligence, pages 709–718. PMLR, 2020.
- Zhu and Goldberg (2009) Xiaojin Zhu and Andrew B Goldberg. Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3(1):1–130, 2009.
- Zhu (2008) Xiaojin Jerry Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2008.
Appendix A Motivations for Gibbs Algorithm
There are different motivations for the Gibbs algorithm as discussed in Raginsky et al. (2017); Kuzborskij et al. (2019); Asadi and Abbe (2020); Aminian et al. (2021a). Following are an overview of the most prominent motivations for the Gibbs algorithm:
Empirical Risk Minimization: The -Gibbs algorithm can be viewed as a randomized version of empirical risk minimization (ERM). As the inverse temperature , then hypothesis generated by the Gibbs algorithm converges to the hypothesis corresponding to standard ERM.
SGLD Algorithm: As discussed in Chiang et al. (1987) and Markowich and Villani (2000), the Stochastic Gradient Langevin Dynamics (SGLD) algorithm can be viewed as the discrete version of the continuous-time Langevin diffusion, and it is defined as follows:
(20) |
where is a standard Gaussian random vector and is the step size. In Raginsky et al. (2017), it is proved that under some conditions on the loss function, the conditional distribution induced by SGLD algorithm is close to the -Gibbs distribution in 2-Wasserstein distance for sufficiently large iterations, .
Appendix B Proof of Theorem 1
Under the -Gibbs algorithm, we have
(21) | |||
(22) | |||
(23) | |||
(24) |
where and .
Thus, the expected gen-error is given by
(25) |
Appendix C Exact Characterization of Expected Gen-Error Under Entropy Minimization
Let us recall SSL by entropy minimization in Amini and Gallinari (2002) and Grandvalet et al. (2005). In these works, by letting the loss function be the negative log-likelihood under hypothesis , the authors considered minimizing the regularized empirical risk as follows:
Under the -Gibbs algorithm, the posterior distribution of can be denoted as . The output hypothesis only depends on the labeled data and unlabeled data instead of pseudo-labeled data . Since and are independent, according to (8), by replacing with , we can characterize the expected gen-error (cf. (4)) inspired by Theorem 1 in the following corollary.
Corollary 3.
Under the semi-supervised Gibbs algorithm with entropy minimization (namely, the -Gibbs algorithm), the expected gen-error is
Appendix D Proof of Proposition 1
We provide the proof of Proposition 1 and a novel lower bound on .
Proof.
A -sub-Gaussian random variable is such that its cumulant generating function for all (Vershynin, 2018).
Assume that the loss function is bounded in for any and . Then we have . According to Hoeffding’s lemma, the loss function is -sub-Gaussian and is -sub-Gaussian. From the Donsker–Varadhan representation, we have
(27) |
From Theorem 1, we can directly obtain
(28) |
∎
We can also provide another lower bound on . For any random variable , from the chain rule of the mutual information, we have . We can also expand the lautum information as . Thus, we have
(29) |
Then the gen-error is lower bounded by
(30) |
The sign of the lower bound depends on the loss function, the prior distribution and the data distributions. As , or , or , the lower bound vanishes to .
Appendix E Proof of Sec. 3.2
Appendix F Proofs for Comparison to SL with Labeled Data
F.1 Proof of Eq. 10
F.2 Proof for when the pseudo-labeling is close to perfect
Without loss of generality, let and . With the assumption that , the joint distributions and are given by
(39) | ||||
(40) | ||||
(41) | ||||
(42) | ||||
(43) |
where (42) follows since , and .
First recall that in the SL setting with labeled training data samples, the expected gen-error is given by
(44) |
Next, in the SSL setting with this assumption, the expected gen-error is given by
(45) | |||
(46) | |||
(47) | |||
(48) |
where is proportional to and as . The result can be directly extended to with data samples and with data samples.
Appendix G Proof of Proposition 2
The formal version of Proposition 2 is stated as follows.
Proposition 3 (Formal Version).
For any , suppose for some constant and , where depends on and . If is bounded within for any and , the expected gen-error can be bounded as follows
(49) |
Since , we always have
(50) |
Without loss of generality, we can take . Recall . Then we have (i.e., the informal version in Proposition 2)
(51) |
Proof.
If is -sub-Gaussian under for every , from (Xu and Raginsky, 2017, Theorem 1), we have
(52) | ||||
(53) |
With the assumption that the loss function is bounded within for any and , we have . According to Hoeffding’s lemma, the loss function is -sub-Gaussian and is -sub-Gaussian. Thus, from Theorem 1, we have
(54) | ||||
(55) |
Furthermore, by Donsker–Varadhan representation, can be upper bounded as follows
(56) |
Then (55) can be further upper bounded as
(57) |
Suppose for some constant and , where depends on and . Then we have
(58) | ||||
(59) | ||||
(60) |
Thus, we have
(61) |
Appendix H Proofs of Mean Estimation Example
The -Gibbs algorithm is given by the following Gibbs posterior distribution
(65) | |||
(66) |
where ,
(67) |
and
(68) |
It can be seen that is a Gaussian distribution. Thus, the output hypothesis can be written as
(69) | ||||
(70) | ||||
(71) |
where is independent of , , , and .
First, let us calculate the expected gen-error according to (5) in Theorem 1. Let for simplicity. We have
(72) | |||
(73) | |||
(74) | |||
(75) | |||
(76) | |||
(77) | |||
(78) | |||
(79) | |||
(80) |
where (a) follows since , and
(81) | |||
(82) | |||
(83) | |||
(84) |
where (b) follows since is symmetric for any and . By letting and combining (80) and (84), the expected gen-error of this example is given by
(85) |
From the definition of gen-error in (4), we can also derive the same result as follows. Let be an independent copy of for any .
(86) | |||
(87) | |||
(88) | |||
(89) | |||
(90) | |||
(91) | |||
(92) | |||
(93) |
where (c) follows from (69), (d) follows since is symmetric for any and .
H.1 Mean Estimation under Supervised Learning
Under the supervised -Gibbs algorithm, the posterior distribution is given by
(94) |
According to (Aminian et al., 2021a, Theorem 1), with the similar techniques of obtaining (80), the expected gen-error is given by
(95) |
where and .
Similarly, under the supervised -Gibbs algorithm, where contains i.i.d. samples and , the expected gen-error (cf. (3.2.2)) is given by
(96) |
H.2 Proofs for Eq. 13
Let us rewrite the gen-error in (12) as follows: for any and .
(97) | |||
(98) |
Let . Using the proof idea from (He et al., 2022), we can decompose it as
(99) |
where , , and is perpendicular to and independent of . The normalized can be written as
(100) |
where , and
(101) |
For any , since , we have
(102) |
where , , and is independent of . Given any for , we have
(103) | ||||
(104) | ||||
(105) |
where , , and is perpendicular to and independent of . The normalized version is given by
(106) |
where and
(107) |
Define the correlation evolution function :
(108) |
where and .
For any , we decompose the Gaussian random vector in another way
(109) |
where , , and are mutually independent and . Then we decompose and as
(110) | ||||
(111) |
Then we have
(112) |
where (112) follows since is independent of and . Since , we have
(113) |
and similarly,
(114) |
Recall the definitions of and . Then we have
(115) |
and similarly
(116) | |||
(117) | |||
(118) | |||
(119) |
By plugging (115) and (119) back to (98), we have
(120) | |||
(121) |
where (121) follows since . Since and for any , we have that .
Remark 2 (Sign of ).
First, since and , we have
(122) |
Next, for some constant , by fixing , we have
(123) |
Since is an odd increasing function on , we have
(124) |
and
(125) |
Thus, by recalling that , we can deduce that
(126) |
In conclusion, .
Appendix I Proof of Theorem 2
In this case, there exists a unique minimizer of the empirical risk, i.e.,
(127) |
According to (Athreya and Hwang, 2010), if the following Hessian matrix
(128) |
is not singular, then as
(129) |
and
(130) |
Then we have
(131) |
By applying Theorem 1, we use the Gaussian approximation to simplify the symmetrized KL information as follows
(132) | |||
(133) | |||
(134) | |||
(135) | |||
(136) |
From (130), we have
(137) |
which means
(138) |
Therefore, by applying Theorem 1, the expected gen-error can be rewritten as
(139) |
Appendix J Proof of Corollary 1
When and are independent, we can simplify the asymptotic gen-error as
(140) |
Let the MLE optimizer be denoted as . Recall the definition
(141) |
and then we have
(142) |
According to Theorem 2 and (140), the asymptotic gen-error of MLE is given by
(143) | |||
(144) | |||
(145) |
Fix any pseudo-labeled data set and let
(146) |
Then we have given any ratio , as ,
(147) |
and
(148) |
By applying Taylor expansion to around , we have
(149) |
By multivariate central limit theorem, as , the first term in (149) converges as follows
(150) |
By the law of large numbers, as , the second term in (149) converges as follows
(151) |
Given any ratio , as , according to (147),
(152) |
Thus, we have
(153) |
Finally, the expected gen-error in (145) can be rewritten as
(154) | ||||
(155) |
Appendix K Proof of Corollary 2
Appendix L Proof of Logistic Regression Example
The first and second derivatives of the loss function are as follows
(162) | ||||
(163) |
The expected Hessian matrices and expected product of the first derivative are given as follows:
(164) |
(165) |
(166) |
We can see that .

Recall the proof of Corollary 1 in Appendix J. In the logistic regression with regularization, the unique minimizer of the empirical risk in (15) is rewritten as
(167) |
and the Hessian matrix of the empirical risk at is rewritten as
(168) |
Recall the definition of with regularization
(169) |
Given any ratio , as , the Hessian matrix converges as follows
(170) |
Then the asymptotic expected gen-error in (145) is rewritten as
(171) |
By redefining in (146) with the regularization term , we can similarly obtain
(172) |
Then the expected gen-error in (145) can be rewritten as
(173) |
Similarly, the excess risk in (19) can be rewritten as
(174) |
where .
In addition to the experiments on the synthetic datasets, we implement an logistic regression experiment on “0–1” digit pair in MNIST dataset by setting , and . In Figure 4, we observe that the gen-error decreases as increases.