This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Information-Theoretic Characterization of the Generalization Error for Iterative Semi-Supervised Learning

\nameHaiyun He \email[email protected]
\addrDepartment of Electrical and Computer Engineering,
National University of Singapore,
117583 Singapore \AND\nameHanshu Yan \email[email protected]
\addrDepartment of Electrical and Computer Engineering,
National University of Singapore,
117583 Singapore \AND\nameVincent Y. F. Tan \email[email protected]
\addrDepartment of Mathematics,
Department of Electrical and Computer Engineering,
Institute of Operations Research and Analytics,
National University of Singapore,
119076, Singapore
Abstract

Using information-theoretic principles, we consider the generalization error (gen-error) of iterative semi-supervised learning (SSL) algorithms that iteratively generate pseudo-labels for a large amount of unlabelled data to progressively refine the model parameters. In contrast to most previous works that bound the gen-error, we provide an exact expression for the gen-error and particularize it to the binary Gaussian mixture model. Our theoretical results suggest that when the class conditional variances are not too large, the gen-error decreases with the number of iterations, but quickly saturates. On the flip side, if the class conditional variances (and so amount of overlap between the classes) are large, the gen-error increases with the number of iterations. To mitigate this undesirable effect, we show that regularization can reduce the gen-error. The theoretical results are corroborated by extensive experiments on the MNIST and CIFAR datasets in which we notice that for easy-to-distinguish classes, the gen-error improves after several pseudo-labelling iterations, but saturates afterwards, and for more difficult-to-distinguish classes, regularization improves the generalization performance.

Keywords: Generalization error, Semi-supervised learning, Pseudo-label, Information theory, Binary Gaussian mixture.

1 Introduction

In real-life machine learning applications, it is relatively easy and inexpensive to obtain large amounts of unlabelled data, while the number of labelled data examples is usually small due to the high cost of annotating them with true labels. In light of this, semi-supervised learning (SSL) has come to the fore (Chapelle et al., 2006; Zhu, 2008; Van Engelen and Hoos, 2020). SSL makes use of the abundant unlabelled data to augment the performance of learning tasks with few labelled data examples. This has been shown to outperform supervised and unsupervised learning under certain conditions. For example, in a classification problem, the correlation between the additional unlabelled data and the labelled data may help to enhance the accuracy of classifiers. Among the plethora of SSL methods, pseudo-labelling (Lee, 2013) has been observed to be a simple and efficient way to improve the generalization performance empirically. In this paper, we consider the problem of pseudo-labelling a subset of the unlabelled data at each iteration based on the previous output parameter and then refining the model progressively, but we are interested in analysing this procedure theoretically. Our goal in this paper is to understand the impact of pseudo-labelling on the generalization error.

A learning algorithm can be viewed as a randomized map from the training dataset to the output model parameter. The output is highly data-dependent and may suffer from overfitting to the given dataset. In statistical learning theory, the generalization error (gen-error), or generalization bias, is defined as the expected gap between the test and training losses, and is used to measure the extent to which the algorithms overfit to the training data (Russo and Zou, 2016; Xu and Raginsky, 2017; Kawaguchi et al., 2017). In SSL problems, the unlabelled data are expected to improve the generalization performance in a certain manner and thus, it is a worthy endeavor to investigate the behaviour theoretically. Although there exist many works studying the gen-error for supervised learning problems, the gen-error of SSL algorithms is yet to be explored.

1.1 Related Works

The extensive literature review is categorized into three aspects.

Semi-supervised learning:

There have been many existing results discussing about various methods of SSL. The book by Chapelle et al. (2006) presented a comprehensive overview of the SSL methods both theoretically and practically. Chawla and Karakoulas (2005) presented an empirical study of various SSL techniques on a variety of datasets and investigated sample-selection bias when the labelled and unlabelled data are from different distributions. Zhu (2008) partitioned SSL methods into six main classes: generative models, low-density separation methods, graph-based methods, self-training and co-training. Pseudo-labelling is a technique among the self-training and co-training (Zhu and Goldberg, 2009). In self-training, the model is initially trained by the limited number of labelled data and generate pseudo-labels to the unlabelled data. Subsequently, the model is retrained with the pseudo-labelled data and repeats the process iteratively. It is a simple and effective SSL method without restrictions on the data samples (Triguero et al., 2015). A variety of works have also shown the benefits of utilizing the unlabelled data. Singh et al. (2008) developed a finite sample analysis that characterized how the unlabelled data improves the excess risk compared to the supervised learning, with respect to the number of unlabelled data and the margin between different classes. Li et al. (2019) studied multi-class classification with unlabelled data and provided a sharper generalization error bound using the notion of Rademacher complexity that yields a faster convergence rate. Zhu (2020) considered the general SSL setting by assuming the loss function to be β\beta-exponentially concave or the 0-11 loss, and used a Bayesian method for prediction instead of empirical risk minimization which we consider. The author presented an upper bound for the excess risk and the learning rate in terms of the number of labelled and unlabelled data examples. Carmon et al. (2019) proved that using unlabelled data can help to achieve high robust accuracy as well as high standard accuracy at the same time. Dupre et al. (2019) considered iteratively pseudo-labelling the whole unlabelled dataset with a confidence threshold and showed that the accuracy converges relatively quickly. Oymak and Gülcü (2021), in which part of our analysis hinges on, studied SSL under the binary Gaussian mixture model setup and characterized the correlation between the learned and the optimal estimators concerning the margin and the regularization factor. Recently, Aminian et al. (2022) considered the scenario where the labelled and unlabelled data are not generated from the same distribution and these distributions may change over time, exhibiting so-called covariate shifts. They provided an upper bound for the gen-error and proposed the Covariate-shift SSL (CSSL) method which outperforms some previous SSL algorithms under this setting. However, these works do not investigate how the unlabelled data affects the generalization error over the iterations.

Generalization error bounds:

The traditional way of analyzing generalization error involves using the Vapnik–Chervonenkis or VC dimension (Vapnik, 2000) and the Rademacher complexity (Boucheron et al., 2005). Recently, Russo and Zou (2016) proposed using the mutual information between the estimated output of an algorithm and the actual realized value of the estimates to analyze and bound the bias in data analysis, which can be regarded equivalent to the generalization error. This new approach is simpler and can handle a wider range of loss functions compared to the abovementioned methods and other methods such as differential privacy. It also paves a new way to improving generalization capability of learning algorithms from an information-theoretic viewpoint. Following Russo and Zou (2016), Xu and Raginsky (2017) derived upper bounds on generalization error of learning algorithms with mutual information between the input dataset and the output hypothesis, which formalizes the intuition that less information that a learning algorithm can extract from training dataset leads to less overfitting. Later Pensia et al. (2018) derived generalization error bounds for noisy and iterative algorithms and the key contribution is to bound the mutual information between input data and output hypothesis. Negrea et al. (2019) improved mutual information bounds for Stochastic Gradient Langevin Dynamics (SGLD) via data-dependent estimates compared to distribution-dependent bounds.

However, one major shortcoming of the aformentioned mutual information bounds is that the bounds go to infinity for (deterministic) learning algorithms without noise, e.g., Stochastic Gradient Descent (SGD). Some other works have tried to overcome this problem. Lopez and Jog (2018) derived upper bounds on the generalization error using the Wasserstein distance involving the distributions of input data and output hypothesis, which are shown to be tighter under some natural cases. Esposito et al. (2021) derived generalization error bounds via Rényi-, ff-divergences and maximal leakage. Steinke and Zakynthinou (2020) proposed using the Conditional Mutual Information (CMI) to bound the generalization error; the CMI is useful as it possesses the chain rule property. Bu et al. (2020) provided a tightened upper bound based on the individual mutual information (IMI) between the individual data sample and the output. Wu et al. (2020) extended Bu et al. (2020)’s result to transfer learning problems and characterized the upper bound based on IMI and KL-divergence. In a similar manner, Jose and Simeone (2021) provided a tightened bound on transfer generalization error based on the Jensen–Shannon divergence. Moreover, recently, Aminian et al. (2021) and Bu et al. (2022) recently derived the exact characterization of gen-error for supervised learning and transfer learning with the Gibbs algorithm.

Regularization is an important technique to reduce the model variance (Anzai, 2012), but there are few works that theoretically analyse the relationship between the gen-error and regularization. Moody (1992) characterized the gen-error as a function of the regularization parameter in supervised nonlinear learning systems and showed that the gen-error decreases as the parameter increases. Bousquet and Elisseeff (2002) provided a stability-based gen-error upper bound in terms of the regularization parameter in supervised learning. Mignacco et al. (2020) studied how the regularization affects the expected accuracy in high-dimensional GMM supervised classification problem.

Gaussian mixture models (GMM):

The GMM is a popular, simple but non-trivial model that has been studied by many researchers. The performance of GMM classification problems depends on the data structure. The classical work of Castelli and Cover (1996) studied the classification problem in a binary mixture model with known conditional distributions but unknown mixing parameter and characterized the relative value of labelled and unlabelled data in improving the convergence rate of classification error probability. Akaho and Kappen (2000) characterized the generalization bias of general GMMs in supervised learning and discussed its dependency on data noise. Watanabe and Watanabe (2006) considered GMM in Bayesian learning and provided bounds for variational stochastic complexity. Wang and Thrampoulidis (2022) and Muthukumar et al. (2021) studied the dependence of the bGMM classification performance (using the 0-11 loss) on the structure of data covariance by considering SVM and linear interpolation.

However, all these aforementioned works do not investigate the generalization performance of SSL algorithms.

1.2 Contributions

Our main contributions are as follows.

  1. 1.

    In Section 3, we leverage results by Bu et al. (2020) and Wu et al. (2020) to derive an information-theoretic gen-error bound at each iteration for iterative SSL; see Theorem 2.A. Moreover, in contrast to most previous works that bound the gen-error, we derive an exact characterization of gen-error at each iteration for negative log-likelihood (NLL) loss functions (see Theorem 2.B).

  2. 2.

    In Section 4, we particularize Theorem 2.B to the binary Gaussian mixture model (bGMM) with in-class variance σ2\sigma^{2}. We show that for any fixed number of data samples, there exists a critical value σ0\sigma_{0} such that when the data variance (representing the overlap between classes) σ2<σ02\sigma^{2}<\sigma_{0}^{2}, the gen-error decreases in the iteration count tt and converges quickly with a sufficiently large amount of unlabelled data. When σ2>σ02\sigma^{2}>\sigma_{0}^{2}, the gen-error increases instead, which means using the unlabelled data does not help to reduce the gen-error across the SSL iterations. The empirical gen-error corroborates the theoretical results, which suggests that the characterization serves as a useful rule-of-thumb to understand how the gen-error changes across the SSL iterations and it can be used to establish conditions under which unlabelled data can help in terms of generalization.

  3. 3.

    In Section 5, we theoretically and empirically show that for difficult-to-classify problems with large overlap between classes, regularization can effectively help to mitigate the undesirable increase of the gen-error across the SSL iterations.

  4. 4.

    In Section 6, we implement the pseudo-labelling procedure on the MNIST and CIFAR datasets with few labelled data and abundant unlabelled data. The experimental results corroborate the phenomena for the bGMM that the gen-error decreases quickly in the early pseudo-labelling iterations and saturates thereafter for easy-to-distinguish classes but increases for hard-to-distinguish classes. By adding 2\ell_{2}-regularization to the hard-to-distinguish problem, we also observe improvements to the gen-error similar to that for the bGMM.

2 Problem Setup

Let the instance space be 𝒵=𝒳×𝒴d+1\mathcal{Z}=\mathcal{X}\times\mathcal{Y}\subset\mathbb{R}^{d+1}, the model parameter space be Θ\Theta and the loss fucntion be l:𝒵×Θl:\mathcal{Z}\times\Theta\to\mathbb{R}, where dd\in\mathbb{N}. We are given a labelled training dataset Sl={Z1,,Zn}={(Xi,Yi)}i=1nS_{\mathrm{l}}=\{Z_{1},\ldots,Z_{n}\}=\{(X_{i},Y_{i})\}_{i=1}^{n} drawn from 𝒵\mathcal{Z}, where each Zi=(Xi,Yi)Z_{i}=(X_{i},Y_{i}) is independently and identically distributed (i.i.d.) from PZ=PX,Y𝒫(𝒵)P_{Z}=P_{X,Y}\in\mathcal{P}(\mathcal{Z}). For any i[n]i\in[n], XiX_{i} is a vector of features and YiY_{i} is a label indicating the class to which XiX_{i} belongs. However, in many real-life machine learning applications, we only have a limited number of labelled data while we have access to a large amount of unlabelled data, which are expensive to annotate. Then we can incorporate the unlabelled training data together with the labelled data to improve the performance of the model. This procedure is called semi-supervised learning (SSL). We are given an independent unlabelled training dataset Su={X1,,Xτm},τS_{\mathrm{u}}=\{X^{\prime}_{1},\ldots,X^{\prime}_{\tau m}\},\tau\in\mathbb{N}, where each XiX^{\prime}_{i} is generated i.i.d. from PX𝒫(𝒳)P_{X}\in\mathcal{P}(\mathcal{X}). Typically, mnm\gg n.

In the following, we consider the iterative self-training with pseudo-labelling in SSL setup, as shown in Figure 1. Let t[0:τ]t\in[0:\tau] denote the iteration count. In the initial round (t=0t=0), the labelled data SlS_{\mathrm{l}} are first used to learn an initial model parameter θ0Θ\theta_{0}\in\Theta. Next, we split the unlabelled dataset SuS_{\mathrm{u}} into τ\tau disjoint equal-size sub-datasets {Su,k}k=1τ\{S_{\mathrm{u},k}\}_{k=1}^{\tau}, where Su,k={X(k1)m+1,,Xkm}S_{\mathrm{u},k}=\{X^{\prime}_{(k-1)m+1},\ldots,X^{\prime}_{km}\}. In each subsequent round t[1:τ]t\in[1:\tau], based on θt1\theta_{t-1} trained from the previous round, we use a predictor fθt1:𝒳𝒴f_{\theta_{t-1}}:\mathcal{X}\mapsto\mathcal{Y} to assign a pseudo-label Y^i\hat{Y}^{\prime}_{i} to the unlabelled sample XiX^{\prime}_{i} for all it:={(t1)m,(t1)m+1,,tm}i\in\mathcal{I}_{t}:=\{(t-1)m,(t-1)m+1,\ldots,tm\}. Let S^u,t={(Xi,Y^i)}it\hat{S}_{\mathrm{u},t}=\{(X_{i}^{\prime},\hat{Y}_{i}^{\prime})\}_{i\in\mathcal{I}_{t}} denote the ttht^{\mathrm{th}} pseudo-labelled dataset. After pseudo-labelling, both the labelled data SlS_{\mathrm{l}} and the pseudo-labelled data S^u,t\hat{S}_{\mathrm{u},t} are used to learn a new model parameter θt\theta_{t}. The procedure is then repeated iteratively until the maximum number of iterations τ\tau is reached.

Refer to caption
Figure 1: Paradigm of iterative self-training with pseudo-labelling in SSL

This setup is a classical and widely-used model in the realm of self-training in SSL (Chapelle et al., 2006; Zhu, 2008; Zhu and Goldberg, 2009; Lee, 2013), where in each iteration, only a subset of the unlabelled data are used. Furthermore, as discussed by Arazo et al. (2020), this method is less likely to overfit to incorrect pseudo-labels, compared to using all the unlabelled data in each iteration (also see Figure 10). Under this setup of iterative SSL, during each iteration tt, our goal is to find a model parameter θtΘ\theta_{t}\in\Theta that minimizes the population risk with respect to the underlying data distribution

LPZ(θt):=𝔼ZPZ[l(θt,Z)].\displaystyle L_{P_{Z}}(\theta_{t}):=\mathbb{E}_{Z\sim P_{Z}}[l(\theta_{t},Z)]. (1)

Since PZP_{Z} is unknown, LPZ(θt)L_{P_{Z}}(\theta_{t}) cannot be computed directly. Hence, we instead minimize the empirical risk. The procedure is termed empirical risk minimization (ERM). For any model parameter θtΘ\theta_{t}\in\Theta, the empirical risk of the labelled data is defined as

LSl(θt):=1ni=1nl(θt,Zi),\displaystyle L_{S_{\mathrm{l}}}(\theta_{t}):=\frac{1}{n}\sum_{i=1}^{n}l(\theta_{t},Z_{i}), (2)

and for t1t\geq 1, the empirical risk of pseudo-labelled data S^u,t\hat{S}_{\mathrm{u},t} as

LS^u,t(θt):=1mitl(θt,(Xi,Y^i)).\displaystyle L_{\hat{S}_{\mathrm{u},t}}(\theta_{t}):=\frac{1}{m}\sum_{i\in\mathcal{I}_{t}}l(\theta_{t},(X_{i}^{\prime},\hat{Y}_{i}^{\prime})). (3)

We set LS^u,t(θt)=0L_{\hat{S}_{\mathrm{u},t}}(\theta_{t})=0 for t=0t=0. For a fixed weight w[0,1]w\in[0,1], the total empirical risk can be defined as the following linear combination of LSl(θt)L_{S_{\mathrm{l}}}(\theta_{t}) and LS^u,t(θt)L_{\hat{S}_{\mathrm{u},t}}(\theta_{t}):

LSl,S^u,t(θt)\displaystyle L_{S_{\mathrm{l}},\hat{S}_{\mathrm{u},t}}(\theta_{t}) :=wLSl(θt)+(1w)LS^u,t(θt).\displaystyle:=wL_{S_{\mathrm{l}}}(\theta_{t})+(1-w)L_{\hat{S}_{\mathrm{u},t}}(\theta_{t}). (4)

In the usual case where the algorithm minimizes the average of the empirical training losses, one should set w=nn+mw=\frac{n}{n+m}. An SSL algorithm can be characterized by a randomized map from the labelled and unlabelled training data SlS_{\mathrm{l}}, SuS_{\mathrm{u}} to a model parameter θ\theta according to a conditional distribution Pθ|Sl,SuP_{\theta|S_{\mathrm{l}},S_{\mathrm{u}}}. Then at each iteration tt, we can use the sequence of conditional distributions {Pθk|Sl,Su}k=0t\{P_{\theta_{k}|S_{\mathrm{l}},S_{\mathrm{u}}}\}_{k=0}^{t} with Pθ0|Sl,Su=Pθ0|SlP_{\theta_{0}|S_{\mathrm{l}},S_{\mathrm{u}}}=P_{\theta_{0}|S_{\mathrm{l}}} to represent an iterative SSL algorithm. The generalization error at the tt-th iteration is defined as the expected gap between the population risk of θt\theta_{t} and the empirical risk on the training data:

gent(PZ,PX,{Pθk|Sl,Su}k=0t,{fθk}k=0t1)\displaystyle\mathrm{gen}_{t}(P_{Z},P_{X},\{P_{\theta_{k}|S_{\mathrm{l}},S_{\mathrm{u}}}\}_{k=0}^{t},\{f_{\theta_{k}}\}_{k=0}^{t-1})
:=𝔼[LPZ(θt)LSl,S^u,t(θt)]\displaystyle:=\mathbb{E}[L_{P_{Z}}(\theta_{t})-L_{S_{\mathrm{l}},\hat{S}_{\mathrm{u},t}}(\theta_{t})] (5)
=w(𝔼θt𝔼Z[l(θt,Z)]1ni=1n𝔼θt,Zi[l(θt,Zi)])\displaystyle=w\bigg{(}\mathbb{E}_{\theta_{t}}\mathbb{E}_{Z}[l(\theta_{t},Z)]-\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{\theta_{t},Z_{i}}[l(\theta_{t},Z_{i})]\bigg{)}
+(1w)(𝔼θt𝔼Z[l(θt,Z)]1mit𝔼θt,Xi,Y^i[l(θt,(Xi,Y^i))]).\displaystyle\quad+(1-w)\bigg{(}\mathbb{E}_{\theta_{t}}\mathbb{E}_{Z}[l(\theta_{t},Z)]-\frac{1}{m}\sum_{i\in\mathcal{I}_{t}}\mathbb{E}_{\theta_{t},X^{\prime}_{i},\hat{Y}^{\prime}_{i}}[l(\theta_{t},(X^{\prime}_{i},\hat{Y}^{\prime}_{i}))]\bigg{)}. (6)

When t=0t=0 and w=1w=1, the definition of the generalization error reduces to that of vanilla supervised learning. Based on this definition, the expected population risk can be decomposed as

𝔼[LPZ(θt)]=𝔼[LSl,S^u,t(θt)]+gent,\mathbb{E}[L_{P_{Z}}(\theta_{t})]=\mathbb{E}[L_{S_{\mathrm{l}},\hat{S}_{\mathrm{u},t}}(\theta_{t})]+\mathrm{gen}_{t}, (7)

where the first term on the right-hand side of this equation is what the algorithm minimizes and reflects how well the output hypothesis fits the dataset, and the second term gent\mathrm{gen}_{t} is used to measure the extent to which the iterative learning algorithm overfits the training data at the tt-th iteration. To minimize 𝔼[LPZ(θt)]\mathbb{E}[L_{P_{Z}}(\theta_{t})], we need both terms in (7) to be small, but there exists a natural trade-off between them. While the algorithm aims to minimize the empirical risk 𝔼[LSl,S^u,t(θt)]\mathbb{E}[L_{S_{\mathrm{l}},\hat{S}_{\mathrm{u},t}}(\theta_{t})], studying and controlling gent\mathrm{gen}_{t} can also help to reduce the population risk 𝔼[LPZ(θt)]\mathbb{E}[L_{P_{Z}}(\theta_{t})], which is the ultimate goal of learning. Instead of focusing on the total generalization error induced during the entire process, we are interested in the following questions. How does gent\mathrm{gen}_{t} evolve as tt increases? Do the unlabelled data examples in SuS_{\mathrm{u}} help to improve the generalization error?

3 General Results

Inspired by the information-theoretic generalization results in Bu et al. (2020, Theorem 1) and Wu et al. (2020, Theorem 1), we derive an upper bound on the gen-error gent\mathrm{gen}_{t} in terms of the mutual information between input data samples (either labelled or pseudo-labelled) and the output model parameter θt\theta_{t}, as well as the KL-divergence between the data distribution and the joint distribution of feature vectors and pseudo-labels (cf. Theorem 2.A). Furthermore, by considering the NLL loss function (MacKay, 2003; Goodfellow et al., 2016), we derive the exact characterization for the gen-error gent\mathrm{gen}_{t} (cf. Theorem 2.B).

Recall that for a given R>0R>0, LL is an RR-sub-Gaussian random variable (Vershynin, 2018) if its cumulant generating function ΛL(λ):=log𝔼[exp(λ(L𝔼[L]))]exp(λ2R2/2)\Lambda_{L}(\lambda):=\log\mathbb{E}[\exp(\lambda(L-\mathbb{E}[L]))]\leq\exp(\lambda^{2}R^{2}/2) for all λ\lambda\in\mathbb{R}. If LL is RR-sub-Gaussian, we write this as LsubG(R)L\sim\mathrm{subG}(R). Furthermore, let us recall the following somewhat non-standard information quantities (Negrea et al., 2019; Haghifam et al., 2020).

Definition 1.

For random variables XX, YY and UU, define the disintegrated mutual information between XX and YY given UU as IU(X;Y):=D(PX,Y|UPX|UPY|U)I_{U}(X;Y):=D(P_{X,Y|U}\|P_{X|U}\otimes P_{Y|U}), and the disintegrated KL-divergence between PXP_{X} and PYP_{Y} given UU as DU(PXPY):=D(PX|UPY|U)D_{U}(P_{X}\|P_{Y}):=D(P_{X|U}\|P_{Y|U}). These are σ(U)\sigma(U)-measurable random variables. It follows that the conditional mutual information I(X;Y|U)=𝔼U[IU(X;Y)]I(X;Y|U)=\mathbb{E}_{U}[I_{U}(X;Y)] and the conditional KL-divergence D(PX|UPY|U|PU)=𝔼U[DU(PXPY)]D(P_{X|U}\|P_{Y|U}|P_{U})=\mathbb{E}_{U}[D_{U}(P_{X}\|P_{Y})].

For distributions PP, QQ and VV, define the cross-entropy as h(P,Q):=𝔼P[logQ]h(P,Q):=\mathbb{E}_{P}[-\log Q] and the divergence between the cross-entropies as Δh(PQ|V):=h(P,V)h(Q,V)\mathrm{\Delta h}(P\|Q|V):=h(P,V)-h(Q,V).

Let θ(t)=(θ0,,θt)\theta^{(t)}=(\theta_{0},\ldots,\theta_{t}) for any t[0:τ]t\in[0:\tau] and w=1w=1 for t=0t=0. In iterative SSL, we can characterize the gen-error as shown in Theorem 3 by applying the law of total expectation.

Theorem 2.A (Gen-error upper bound for iterative SSL).

Suppose l(θ,Z)subG(R)l(\theta,Z)\sim\mathrm{subG}(R) under ZPZZ\sim P_{Z} for all θΘ\theta\in\Theta, then for any t[0:τ]t\in[0:\tau],

|gent(PZ,PX,{Pθk|Sl,Su}k=0t,{fθk}k=0t1)|\displaystyle\big{|}\mathrm{gen}_{t}(P_{Z},P_{X},\{P_{\theta_{k}|S_{\mathrm{l}},S_{\mathrm{u}}}\}_{k=0}^{t},\{f_{\theta_{k}}\}_{k=0}^{t-1})\big{|}
wni=1n𝔼θ(t1)[2R2Iθ(t1)(i)]+1wmit𝔼θ(t1)[2R2(Iθ(t1)(i)+Dθ(t1)(i))],\displaystyle\leq\frac{w}{n}\sum_{i=1}^{n}\mathbb{E}_{\theta^{(t-1)}}\Big{[}\sqrt{2R^{2}I_{\theta^{(t-1)}}^{(i)}}\Big{]}+\frac{1-w}{m}\sum_{i\in\mathcal{I}_{t}}\mathbb{E}_{\theta^{(t-1)}}\Big{[}\sqrt{2R^{2}\big{(}I_{\theta^{(t-1)}}^{\prime(i)}+D_{\theta^{(t-1)}}^{\prime(i)}\big{)}}\Big{]}, (8)

where Iθ(t1)(i):=Iθ(t1)(θt;Zi)I_{\theta^{(t-1)}}^{(i)}:=I_{\theta^{(t-1)}}(\theta_{t};Z_{i}), Iθ(t1)(i):=Iθ(t1)(θt;Xi,Y^i)I_{\theta^{(t-1)}}^{\prime(i)}:=I_{\theta^{(t-1)}}(\theta_{t};X_{i}^{\prime},\hat{Y}_{i}^{\prime}), and Dθ(t1)(i):=Dθ(t1)(PXi,Y^iPZ)D_{\theta^{(t-1)}}^{\prime(i)}:=D_{\theta^{(t-1)}}(P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}}\|P_{Z}).

Theorem 2.B (Exact gen-error for iterative SSL).

Consider the NLL loss function l(θ,Z)=logpθ(Z)l(\theta,Z)=-\log p_{\theta}(Z), where pθ(Z)p_{\theta}(Z) is the likelihood of ZZ under parameter θ\theta. For any t[0:τ]t\in[0:\tau],

gent(PZ,PX,{Pθk|Sl,Su}k=0t,{fθk}k=0t1)\displaystyle\mathrm{gen}_{t}(P_{Z},P_{X},\{P_{\theta_{k}|S_{\mathrm{l}},S_{\mathrm{u}}}\}_{k=0}^{t},\{f_{\theta_{k}}\}_{k=0}^{t-1})
=𝔼θ(t)[wni=1nΔhθt(i)+1wmit(Δhθ(t)(i)+Δh~θ(t)(i))],\displaystyle=\mathbb{E}_{\theta^{(t)}}\bigg{[}\frac{w}{n}\sum_{i=1}^{n}\mathrm{\Delta h}^{(i)}_{\theta_{t}}+\frac{1-w}{m}\sum_{i\in\mathcal{I}_{t}}\big{(}\mathrm{\Delta h}^{\prime(i)}_{\theta^{(t)}}+\widetilde{\mathrm{\Delta h}}^{\prime(i)}_{\theta^{(t)}}\big{)}\bigg{]}, (9)

where

Δhθt(i)\displaystyle\mathrm{\Delta h}^{(i)}_{\theta_{t}} :=Δh(PZPZi|θt|pθt),Δhθ(t)(i):=Δh(PZPXi,Y^i|θ(t1)|pθt),and\displaystyle:=\mathrm{\Delta h}(P_{Z}\|P_{Z_{i}|\theta_{t}}|p_{\theta_{t}}),\qquad\mathrm{\Delta h}^{\prime(i)}_{\theta^{(t)}}:=\mathrm{\Delta h}(P_{Z}\|P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}|\theta^{(t-1)}}|p_{\theta_{t}}),\quad\mbox{and} (10)
Δh~θ(t)(i)\displaystyle\widetilde{\mathrm{\Delta h}}^{\prime(i)}_{\theta^{(t)}} :=Δh(PXi,Y^i|θ(t1)PXi,Y^i|θ(t)|pθt).\displaystyle:=\mathrm{\Delta h}(P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}|\theta^{(t-1)}}\|P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}|\theta^{(t)}}|p_{\theta_{t}}). (11)

The proof of Theorem 2.A is provided in Appendix A, in which we provide a general upper bound not only applicable to sub-Gaussian loss functions. The proof of Theorem 2.B is provided in Appendix B. Specifically, for NLL loss functions, Theorem 2.B provides an exact characterization of the gen-error at each iteration. This is in stark contrast to most works on information-theoretic generalization error in which only bounds are provided.

In contrast to Bu et al. (2020, Theorem 1) and Wu et al. (2020, Theorem 1) which pertain to supervised learning, Theorem 3 characterizes the gen-error at each iteration during the pseudo-labelling and training process. Note that the quantities in Theorem 2.B satisfy Δhθt(i)=Iθ(t1)(i)+D(PZpθt)D(PZi|θtpθt)\mathrm{\Delta h}^{(i)}_{\theta_{t}}=I_{\theta^{(t-1)}}^{(i)}+D(P_{Z}\|p_{\theta_{t}})-D(P_{Z_{i}|\theta_{t}}\|p_{\theta_{t}}) and Δh~θ(t)(i)=Iθ(t1)(i)+Dθ(t1)(PXi,Y^ipθt)Dθ(t1)(PXi,Y^i|θtpθt)\widetilde{\mathrm{\Delta h}}^{\prime(i)}_{\theta^{(t)}}=I_{\theta^{(t-1)}}^{\prime(i)}+D_{\theta^{(t-1)}}(P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}}\|p_{\theta_{t}})-D_{\theta^{(t-1)}}(P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}|\theta_{t}}\|p_{\theta_{t}}). Thus, it is plausible that the upper bound based on Iθ(t1)(i)I_{\theta^{(t-1)}}^{(i)} and Iθ(t1)(i)I_{\theta^{(t-1)}}^{\prime(i)} in Theorem 2.A can help to understand and control the exact gen-error. Intuitively, the mutual information between the individual input data sample ZiZ_{i} and the output model parameter θt\theta_{t} in Theorem 2.A and the cross-entropy divergences Δhθt(i)\mathrm{\Delta h}^{(i)}_{\theta_{t}}, Δh~θ(t)(i)\widetilde{\mathrm{\Delta h}}^{\prime(i)}_{\theta^{(t)}} in Theorem 2.B both measure the extent to which the algorithm is sensitive to each data example at each iteration tt. The KL-divergence between the underlying PZP_{Z} and pseudo-labelled distribution PXi,Y^iP_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}} in Theorem 2.A and the cross-entropy divergence Δhθ(t)(i)\mathrm{\Delta h}^{\prime(i)}_{\theta^{(t)}} in Theorem 2.B measure how effectively the pseudo-labelling process works. As nn\to\infty and mm\to\infty, we show that the mutual information (as well as Δhθt(i)\mathrm{\Delta h}^{(i)}_{\theta_{t}} and Δh~θ(t)(i)\widetilde{\mathrm{\Delta h}}^{\prime(i)}_{\theta^{(t)}}) vanishes but the divergences Dθ(t1)(i)D_{\theta^{(t-1)}}^{\prime(i)} and Δhθ(t)(i)\mathrm{\Delta h}^{\prime(i)}_{\theta^{(t)}} do not, which reflects the impact of pseudo-labelling on the gen-error.

In iterative learning algorithms, by applying the law of total expectation and conditioning the information-theoretic quantities on the output model parameters θ(t1)={θ1,,θt1}\theta^{(t-1)}=\{\theta_{1},\ldots,\theta_{t-1}\} from previous iterations, we are able to calculate the gen-error iteratively. In the next section, we apply the exact iterated gen-error in Theorem 2.B to a classification problem under a specific generative model—the bGMM. This simple model allows us to derive a tractable characterization on the gen-error as a function of iteration number tt that we can compute numerically.

4 Main Results on bGMM

We now particularize the iterative semi-supervised classification setup to the bGMM. We evaluate (9) to understand the effect of multiple self-training rounds on the gen-error.

4.1 Iterative SSL under bGMM

Fix a unit vector 𝝁d\bm{\mu}\in\mathbb{R}^{d} and a scalar σ+=(0,)\sigma\in\mathbb{R}_{+}=(0,\infty). Under the bGMM with mean 𝝁\bm{\mu} and standard deviation (std. dev.) σ\sigma (bGMM(𝝁,σ\bm{\mu},\sigma)), we assume that the distribution of any labelled data example (𝐗,Y)(\mathbf{X},Y) can be specified as follows. Let 𝒴={1,+1}\mathcal{Y}=\{-1,+1\}, YPY=unif{1,+1}Y\sim P_{Y}=\mathrm{unif}\{-1,+1\}, and 𝐗|Y𝒩(Y𝝁,σ2𝐈d)\mathbf{X}|Y\sim\mathcal{N}(Y\bm{\mu},\sigma^{2}\mathbf{I}_{d}), where 𝐈d\mathbf{I}_{d} is the identity matrix of size d×dd\times d.

The random vector 𝐗\mathbf{X} is distributed according to the mixture distribution

p𝝁=12𝒩(𝝁,σ2𝐈d)+12𝒩(𝝁,σ2𝐈d).p_{\bm{\mu}}=\frac{1}{2}\mathcal{N}(\bm{\mu},\sigma^{2}\mathbf{I}_{d})+\frac{1}{2}\mathcal{N}(-\bm{\mu},\sigma^{2}\mathbf{I}_{d}).

In the unlabelled dataset SuS_{\mathrm{u}}, each 𝐗i\mathbf{X}_{i}^{\prime} for i[1:τm]i\in[1:\tau m] is drawn i.i.d. from p𝝁p_{\bm{\mu}}.

Let Θd\Theta\subset\mathbb{R}^{d} such that 𝝁Θ\bm{\mu}\in\Theta. For any 𝜽Θ\bm{\theta}\in\Theta, under the bGMM(𝜽,σ\bm{\theta},\sigma), the joint distribution of any pair of (𝐗,Y)𝒵(\mathbf{X},Y)\in\mathcal{Z} is given by 𝒩(Y𝜽,σ2𝐈d)PY\mathcal{N}(Y\bm{\theta},\sigma^{2}\mathbf{I}_{d})\otimes P_{Y}. The NLL loss function can be expressed as

l(𝜽,(𝐗,Y))\displaystyle l(\bm{\theta},(\mathbf{X},Y)) =logp𝜽(𝐗,Y)=log(PY(Y)p𝜽(𝐗|Y))\displaystyle=-\log p_{\bm{\theta}}(\mathbf{X},Y)=-\log\big{(}P_{Y}(Y)p_{\bm{\theta}}(\mathbf{X}|Y)\big{)}
=log12(2π)dσd+12σ2(𝐗Y𝜽)(𝐗Y𝜽).\displaystyle=-\log\frac{1}{2\sqrt{(2\pi)^{d}}\sigma^{d}}+\frac{1}{2\sigma^{2}}(\mathbf{X}-Y\bm{\theta})^{\top}(\mathbf{X}-Y\bm{\theta}). (12)

The population risk minimizer is given by argmin𝜽Θ𝔼𝐗,Y[l(𝜽,(𝐗,Y))]=𝝁\operatorname*{arg\,min}_{\bm{\theta}\in\Theta}\mathbb{E}_{\mathbf{X},Y}[l(\bm{\theta},(\mathbf{X},Y))]=\bm{\mu}.

Under this setup, the iterative SSL procedure is shown in Figure 1, but the labelled dataset SlS_{\mathrm{l}} is only used to train in the initial round t=0t=0; we discuss the reuse of SlS_{\mathrm{l}} in all iterations in Corollary 10. That is, in (4), we set w=0w=0. The algorithm operates in the following steps.

  • Step 1: Initial round t=0t=0 with SlS_{\mathrm{l}}: By minimizing the empirical risk of labelled dataset SlS_{\mathrm{l}}

    LSl(𝜽)=1ni=1nl(𝜽,(𝐗i,Yi))=c12σ2ni=1n(𝐗iYi𝜽)(𝐗iYi𝜽),\displaystyle L_{S_{\mathrm{l}}}(\bm{\theta})=\frac{1}{n}\sum_{i=1}^{n}l(\bm{\theta},(\mathbf{X}_{i},Y_{i}))\overset{\mathrm{c}}{=}\frac{1}{2\sigma^{2}n}\sum_{i=1}^{n}(\mathbf{X}_{i}-Y_{i}\bm{\theta})^{\top}(\mathbf{X}_{i}-Y_{i}\bm{\theta}), (13)

    where =c\stackrel{{\scriptstyle\mathrm{c}}}{{=}} means that both sides differ by a constant independent of 𝜽\bm{\theta}, we obtain the minimizer

    𝜽0=argmin𝜽ΘLSl(𝜽)=1ni=1nYi𝐗i.\displaystyle\bm{\theta}_{0}=\operatorname*{arg\,min}_{\bm{\theta}\in\Theta}L_{S_{\mathrm{l}}}(\bm{\theta})=\frac{1}{n}\sum_{i=1}^{n}Y_{i}\mathbf{X}_{i}. (14)
  • Step 2: Pseudo-label data in SuS_{\mathrm{u}}: At each iteration t[1:τ]t\in[1:\tau], for any iti\in\mathcal{I}_{t}, we use 𝜽t1\bm{\theta}_{t-1} to assign a pseudo-label for 𝐗i\mathbf{X}_{i}^{\prime}, that is, Y^i=f𝜽t1(𝐗i)=sgn(𝜽t1𝐗i)\hat{Y}_{i}^{\prime}=f_{\bm{\theta}_{t-1}}(\mathbf{X}_{i}^{\prime})=\operatorname{sgn}(\bm{\theta}_{t-1}^{\top}\mathbf{X}_{i}^{\prime}).

  • Step 3: Refine the model: We then use the pseudo-labelled dataset S^u,t\hat{S}_{\mathrm{u},t} to train the new model. By minimizing the empirical risk of S^u,t\hat{S}_{\mathrm{u},t}

    LS^u,t(𝜽)=1mitl(𝜽,(𝐗i,Y^i))=c12σ2mit(𝐗iY^i𝜽)(𝐗iY^i𝜽),\displaystyle L_{\hat{S}_{\mathrm{u},t}}(\bm{\theta})=\frac{1}{m}\sum_{i\in\mathcal{I}_{t}}l(\bm{\theta},(\mathbf{X}^{\prime}_{i},\hat{Y}_{i}^{\prime}))\overset{\mathrm{c}}{=}\frac{1}{2\sigma^{2}m}\sum_{i\in\mathcal{I}_{t}}(\mathbf{X}^{\prime}_{i}-\hat{Y}_{i}^{\prime}\bm{\theta})^{\top}(\mathbf{X}^{\prime}_{i}-\hat{Y}_{i}^{\prime}\bm{\theta}), (15)

    we obtain the new model parameter

    𝜽t=1mitY^i𝐗i=1mitsgn(𝜽t1𝐗i)𝐗i.\displaystyle\bm{\theta}_{t}=\frac{1}{m}\sum_{i\in\mathcal{I}_{t}}\hat{Y}_{i}^{\prime}\mathbf{X}^{\prime}_{i}=\frac{1}{m}\sum_{i\in\mathcal{I}_{t}}\operatorname{sgn}(\bm{\theta}_{t-1}^{\top}\mathbf{X}^{\prime}_{i})\mathbf{X}^{\prime}_{i}. (16)

    If t<τt<\tau, go back to Step 2.

4.2 Definitions

To state our result succinctly, we first define some non-standard notations and functions. From (14), we know that 𝜽0𝒩(𝝁,σ2n𝐈d)\bm{\theta}_{0}\sim\mathcal{N}(\bm{\mu},\frac{\sigma^{2}}{n}\mathbf{I}_{d}) and inspired by Oymak and Gülcü (2021), we can decompose 𝜽0\bm{\theta}_{0} as

𝜽0=(1+σnξ0)𝝁+σn𝝁,\bm{\theta}_{0}=\Big{(}1+\frac{\sigma}{\sqrt{n}}\xi_{0}\Big{)}\bm{\mu}+\frac{\sigma}{\sqrt{n}}\bm{\mu}^{\bot},

where ξ0𝒩(0,1)\xi_{0}\sim\mathcal{N}(0,1), 𝝁𝒩(𝟎,𝐈d𝝁𝝁)\bm{\mu}^{\bot}\sim\mathcal{N}(\mathbf{0},\mathbf{I}_{d}-\bm{\mu}\bm{\mu}^{\top}), and 𝝁\bm{\mu}^{\bot} is perpendicular to 𝝁\bm{\mu} and independent of ξ0\xi_{0} (the details of this decomposition are provided in Appendix C).

Given a pair of vectors (𝐚,𝐛)(\mathbf{a},\mathbf{b}), define their correlation coefficient as ρ(𝐚,𝐛):=𝐚,𝐛𝐚2𝐛2\rho(\mathbf{a},\mathbf{b}):=\frac{\langle\mathbf{a},\mathbf{b}\rangle}{\|\mathbf{a}\|_{2}\|\mathbf{b}\|_{2}}. The correlation coefficient between the estimated and true parameters is

α(ξ0,𝝁):=ρ(𝜽0,𝝁)=1+σnξ0(1+σnξ0)2+σ2n𝝁22.\displaystyle\alpha(\xi_{0},\bm{\mu}^{\bot}):=\rho(\bm{\theta}_{0},\bm{\mu})=\frac{1+\frac{\sigma}{\sqrt{n}}\xi_{0}}{\sqrt{(1+\frac{\sigma}{\sqrt{n}}\xi_{0})^{2}+\frac{\sigma^{2}}{n}\|\bm{\mu}^{\bot}\|_{2}^{2}}}. (17)

Let β(ξ0,𝝁)=1α(ξ0,𝝁)2\beta(\xi_{0},\bm{\mu}^{\bot})=\sqrt{1-\alpha(\xi_{0},\bm{\mu}^{\bot})^{2}}. We abbreviate α(ξ0,𝝁)\alpha(\xi_{0},\bm{\mu}^{\bot}) and β(ξ0,𝝁)\beta(\xi_{0},\bm{\mu}^{\bot}) to α\alpha and β\beta respectively in the following. Then the normalized vector 𝜽0/𝜽02\bm{\theta}_{0}/\|\bm{\theta}_{0}\|_{2} can be decomposed as follows

𝜽¯0:=𝜽0𝜽02=α𝝁+β𝝊,\displaystyle\bar{\bm{\theta}}_{0}:=\frac{\bm{\theta}_{0}}{\|\bm{\theta}_{0}\|_{2}}=\alpha\bm{\mu}+\beta\bm{\upsilon}, (18)

where 𝝊=𝝁/𝝁2\bm{\upsilon}=\bm{\mu}^{\bot}/\|\bm{\mu}^{\bot}\|_{2}. Let 𝜽¯0:=(2β2𝝁2αβ𝝊)/σ\bar{\bm{\theta}}_{0}^{\bot}:=(2\beta^{2}\bm{\mu}-2\alpha\beta\bm{\upsilon})/\sigma, which is a vector perpendicular to 𝜽¯0\bar{\bm{\theta}}_{0}.

Let Q():=1Φ()\mathrm{Q}(\cdot):=1-\Phi(\cdot). Define the correlation evolution function Fσ:[1,1][1,1]F_{\sigma}:[-1,1]\to[-1,1] that quantifies the increase to the correlation (between the current model parameter and the optimal one) and improvement to the generalization error as the iteration counter increases from tt to t+1t+1:

Fσ(x)\displaystyle F_{\sigma}(x) :=Jσ(x)Jσ2(x)+Kσ2(x),where\displaystyle:=\frac{J_{\sigma}(x)}{\sqrt{J_{\sigma}^{2}(x)+K_{\sigma}^{2}(x)}},\quad\mbox{where} (19)
Jσ(x)\displaystyle~{}~{}J_{\sigma}(x) :=12Q(xσ)+2σx2πexp(x22σ2),and\displaystyle:=1-2\mathrm{Q}\bigg{(}\frac{x}{\sigma}\bigg{)}+\frac{2\sigma x}{\sqrt{2\pi}}\exp\bigg{(}-\frac{x^{2}}{2\sigma^{2}}\bigg{)},\quad\mbox{and} (20)
Kσ(x)\displaystyle K_{\sigma}(x) :=2σ1x22πexp(x22σ2).\displaystyle:=\frac{2\sigma\sqrt{1-x^{2}}}{\sqrt{2\pi}}\exp\bigg{(}-\frac{x^{2}}{2\sigma^{2}}\bigg{)}. (21)

The ttht^{\mathrm{th}} iterate of the function FσF_{\sigma} is defined recursively as Fσ(t):=FσFσ(t1)F_{\sigma}^{(t)}:=F_{\sigma}\circ F_{\sigma}^{(t-1)} with Fσ(0)(x)=xF_{\sigma}^{(0)}(x)=x. As shown in Figure 2, for any fixed σ\sigma, we can see that Fσ(2)(x)Fσ(x)xF_{\sigma}^{(2)}(x)\geq F_{\sigma}(x)\geq x for x0x\geq 0 and Fσ(2)(x)<Fσ(x)<xF_{\sigma}^{(2)}(x)<F_{\sigma}(x)<x for x<0x<0. It can also be easily deduced that for any t[0:τ]t\in[0:\tau], Fσ(t+1)(x)Fσ(t)(x)F_{\sigma}^{(t+1)}(x)\geq F_{\sigma}^{(t)}(x) for any x0x\geq 0 and Fσ(t+1)(x)<Fσ(t)(x)F_{\sigma}^{(t+1)}(x)<F_{\sigma}^{(t)}(x) for any x<0x<0. This important observation implies that if the correlation α\alpha, defined in (17), is positive, Fσ(t)(α)F_{\sigma}^{(t)}(\alpha) increases with tt; and vice versa. Moreover, as shown in Figure 12 in Appendix C, by varying σ\sigma, we observe that a smaller σ\sigma results in a larger |Fσ(x)||F_{\sigma}(x)|.

Refer to caption
Figure 2: Fσ(t)(x)F_{\sigma}^{(t)}(x) versus xx for different tt when σ=0.5\sigma=0.5.

4.3 Main Theorem

By applying the result in Theorem 2.B, the following theorem provides an exact characterization for the generalization error at each iteration tt for mm large enough.

Theorem 3 (Exact gen-error for iterative SSL under bGMM).

Fix any σ+\sigma\in\mathbb{R}_{+}, dd\in\mathbb{N}. The gen-error at t=0t=0 is given by

gen0(P𝐙,P𝐗,P𝜽0|Sl,Su)=dn.\displaystyle\mathrm{gen}_{0}(P_{\mathbf{Z}},P_{\mathbf{X}},P_{\bm{\theta}_{0}|S_{\mathrm{l}},S_{\mathrm{u}}})=\frac{d}{n}. (22)

Let α=α(ξ0,𝛍)\alpha=\alpha(\xi_{0},\bm{\mu}^{\bot}). For each t[1:τ]t\in[1:\tau], for almost all sample paths (i.e., almost surely),

gent(P𝐙,P𝐗,{P𝜽k|Sl,Su}k=0t,{f𝜽k}k=0t1)\displaystyle\mathrm{gen}_{t}(P_{\mathbf{Z}},P_{\mathbf{X}},\{P_{\bm{\theta}_{k}|S_{\mathrm{l}},S_{\mathrm{u}}}\}_{k=0}^{t},\{f_{\bm{\theta}_{k}}\}_{k=0}^{t-1})
=𝔼ξ0,𝝁[(m1)(Jσ2(Fσ(t1)(α))+Kσ2(Fσ(t1)(α)))mσ2Jσ(Fσ(t1)(α))σ2]+o(1),\displaystyle=\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{[}\frac{(m-1)(J_{\sigma}^{2}(F_{\sigma}^{(t-1)}(\alpha))+K_{\sigma}^{2}(F_{\sigma}^{(t-1)}(\alpha)))}{m\sigma^{2}}-\frac{J_{\sigma}(F_{\sigma}^{(t-1)}(\alpha))}{\sigma^{2}}\bigg{]}+o(1), (23)

where o(1)o(1) is a term that vanishes as mm\to\infty.

The proof of Theorem 3 is provided in Appendix C. Several remarks are in order.

First, the gen-error at t=0t=0 corresponds to the asymptotic result of supervised maximum likelihood estimation in works by Akaho and Kappen (2000) and Aminian et al. (2021). We numerically plot the quantity in (23), gσ(m)(x):=((m1)(Jσ2(x)+Kσ2(x))mJσ(x))/σ2g_{\sigma}^{(m)}(x):=((m-1)(J_{\sigma}^{2}(x)+K_{\sigma}^{2}(x))-mJ_{\sigma}(x))/\sigma^{2}, for x[1,1]x\in[-1,1] in Figure 12 in Appendix C, which shows that for all σ1>σ2\sigma_{1}>\sigma_{2}, gσ1(m)(x)>gσ2(m)(x)g_{\sigma_{1}}^{(m)}(x)>g_{\sigma_{2}}^{(m)}(x) when x>0x>0. From (17), we can see that α\alpha is close to 1 of high probability, which means that σgσ(x)\sigma\mapsto g_{\sigma}(x) is monotonically increasing in σ\sigma with high probability. As a result, (23) increases as σ\sigma increases. This is consistent with the intuition that when the training data have larger overlap between classes, it is more difficult to generalize well. Moreover, Fσ(t)(α)F_{\sigma}^{(t)}(\alpha) is also close to 1 of high probability, and thus (23) saturates with tt quickly.

Second, by ignoring the o(1)o(1) term, we compare the theoretical gent\mathrm{gen}_{t} (cf. (22) and (23)) and the empirical gen-error from the repeated synthetic experiments with d=2d=2, n=10n=10 and m=1000m=1000, as shown in Figure 3. It can be seen that the theoretical gent\mathrm{gen}_{t} matches the empirical gen-error well, which means that the characterization in (23) serves as a useful rule-of-thumb for how the gen-error changes over the SSL iterations. When the variance is small (e.g., σ2=0.62\sigma^{2}=0.6^{2}), as shown in Figure 3(a), the gen-error decreases significantly from t=0t=0 to t=1t=1 and then quickly converges to a non-zero constant. Recall the correlation evolution function FσF_{\sigma} in (19). Given any pair of (ξ0,𝝁)(\xi_{0},\bm{\mu}^{\bot}), if α(ξ0,𝝁)>0\alpha(\xi_{0},\bm{\mu}^{\bot})>0, Fσ(t)(α(ξ0,𝝁))>Fσ(t1)(α(ξ0,𝝁))F_{\sigma}^{(t)}(\alpha(\xi_{0},\bm{\mu}^{\bot}))>F_{\sigma}^{(t-1)}(\alpha(\xi_{0},\bm{\mu}^{\bot})) for all t[1:τ]t\in[1:\tau], as shown in Figure 2. This means that if the quality of the labelled data SlS_{\mathrm{l}} is reasonably good, by using 𝜽0\bm{\theta}_{0} which is learned from SlS_{\mathrm{l}}, the generated pseudo-labels for the unlabelled data are largely correct. Then the subsequent parameters 𝜽t\bm{\theta}_{t} for t1t\geq 1 learned from the large number of pseudo-labelled data examples can improve the generalization error. With sufficiently large amount of training data, algorithm converges at very early stage. In addition, for more general cases (e.g., non-diagonal class covariance matrices), it takes more iterations for the gen-error to reach a plateau, as shown in Figure 5.

When the variance is large (e.g. σ2=32\sigma^{2}=3^{2}), as shown in Figure 3(b), the gen-error increases with iteration tt. The result shows that when the overlap between different classes is large enough, using the unlabelled data may not be able to improve the generalization performance. The intuition is that at the initial iteration with a limited number of labelled data, the learned parameter 𝜽0\bm{\theta}_{0} cannot pseudo-label the unlabelled data with sufficiently high accuracy. Thus, the unlabelled data is not labelled well by the pseudo-labelling operation and hence, cannot help to improve the generalization error. To gain more insight, in Figure 5, we numerically plot gen1\mathrm{gen}_{1} versus different values of σ\sigma under the same setting. It is interesting to find that there exists a σ0\sigma_{0} such that for σ<σ0\sigma<\sigma_{0}, gen1<gen0\mathrm{gen}_{1}<\mathrm{gen}_{0}, which means the gen-error can be reduced with the help of abundant unlabelled data, while for σ>σ0\sigma>\sigma_{0}, using the unlabelled data can even harm the generalization performance.

Refer to caption
(a) σ=0.6\sigma=0.6
Refer to caption
(b) σ=3\sigma=3
Figure 3: Comparison of the theoretical gent\mathrm{gen}_{t} and the empirical gen-error at each iteration tt.
Refer to caption
Figure 4: Empirical gen-error with covariance matrix σ2×[0.6,0.3;0.3,0.8]\sigma^{2}\times[0.6,0.3;0.3,0.8].
Refer to caption
Figure 5: Theoretical gen-error at t=1t\!=\!1 versus different std. devs. σ[0.1,3]\sigma\in[0.1,3].

Third, let us examine the effect of nn, the number of labelled training samples. By expanding α\alpha, defined in (17), using a Taylor series, we have

α=1σ22n𝝁22+o(1n).\displaystyle\alpha=1-\frac{\sigma^{2}}{2n}\|\bm{\mu}^{\bot}\|_{2}^{2}+o\bigg{(}\frac{1}{n}\bigg{)}. (24)

It can be seen that as nn increases, α\alpha converges to 11 in probability. Suppose the dimension d=2d=2 and 𝝁=(1,0)\bm{\mu}=(1,0). Then 𝝁=[0,μ2]\bm{\mu}^{\bot}=[0,\mu_{2}^{\bot}] where μ2𝒩(0,1)\mu_{2}^{\bot}\sim\mathcal{N}(0,1). By letting mm\to\infty, the gen-error gen1\mathrm{gen}_{1} (cf. (23)) can be rewritten as

gen122nπσ2eny2σ2gσ(1y2)dy,\displaystyle\mathrm{gen}_{1}\approx\int_{-\sqrt{2}}^{\sqrt{2}}\sqrt{\frac{n}{\pi\sigma^{2}}}\,e^{-\frac{ny^{2}}{\sigma^{2}}}\,g_{\sigma}^{\infty}(1-y^{2})~{}\mathrm{d}y, (25)

where gσ()(1y2):=(Jσ2(1y2)+Kσ2(1y2)Jσ(1y2))/σ2g_{\sigma}^{(\infty)}(1-y^{2}):=(J_{\sigma}^{2}(1-y^{2})+K_{\sigma}^{2}(1-y^{2})-J_{\sigma}(1-y^{2}))/\sigma^{2} and thus, gen1\mathrm{gen}_{1} is a decreasing function of nn. We further deduce that for any tt, gent\mathrm{gen}_{t} is decreasing in nn.

Fourth, we consider an “enhanced” scenario in which the labelled data in SlS_{\mathrm{l}} are reused in each iteration. Set w=nn+mw=\frac{n}{n+m} in (4). We can extend Theorem 3 to Corollary 10 provided in Appendix F. It can be seen from Figure 17 that gent\mathrm{gen}_{t} still decreases from t=0t=0 to 11 and saturates afterwards. We find that when σ=0.6\sigma=0.6, n=10n=10, m=1000m=1000, the gen-error is almost the same as that one in Figure 3(a), which means that for large enough mn\frac{m}{n}, reusing the labelled data does not necessarily help to improve the generalization performance. Moreover, when m=100m=100, gent\mathrm{gen}_{t} is higher than that for m=1000m=1000, which coincides with the intuition that increasing the number of unlabelled data helps to reduce the generalization error.

Fifth, it is natural to wonder what the effect is when mm, the number of unlabelled data examples, is held fixed and nn, the number of labelled data examples, increases. In Figure 6, we numerically plot gen0=dn\mathrm{gen}_{0}=\frac{d}{n} in (22) and (the theoretical) gen1\mathrm{gen}_{1} in (23) for nn ranging from 22 to 5050, m=1000m=1000, σ=0.6\sigma=0.6 and d=2d=2. As nn increases, gen0\mathrm{gen}_{0} and gen1\mathrm{gen}_{1} both decrease, which is as expected. However, when nn is larger than a certain value (3030 in this case), we find that gen0\mathrm{gen}_{0} becomes smaller than gen1\mathrm{gen}_{1}. This implies that with sufficiently many labelled training data, the generalization error based on the labelled training data is already sufficiently low, and incorporating the pseudo-labelled data in fact adversely affects the generalization error. Understanding this phenomenon precisely is an interesting avenue for future work.

Refer to caption
Figure 6: gen0\mathrm{gen}_{0} vs. gen1\mathrm{gen}_{1} for different nn.

Finally, to verify the validity of the gen-error upper bound in Theorem 2.A, we further apply the bound to this setup and prove that the upper bound exhibits similar behaviour of the evolution of gen-error as tt increases. See Appendix D.

Refer to caption
Figure 7: Empirical gen-error versus tt for σ=3\sigma=3 for different λ\lambda.
Refer to caption
Figure 8: Theoretical and empirical gen1\mathrm{gen}_{1} vs.  λ\lambda for different σ\sigma.

5 Improving the Gen-Error for Difficult Problems via Regularization

In Section 4.3, it is shown that for difficult classification problems with large class conditional variance, the gen-error increases after using pseudo-labelled data. The reason is that the learned initial parameter 𝜽0\bm{\theta}_{0} can only generate low-accurate pseudo-labels and thus the pseudo-labelled data cannot help improve the generalization performance. In this section, we prove that by adding regularization to the loss function, we can mitigate the undesirable increase of gen-error across the pseudo-labelling iterations.

Since gen0\mathrm{gen}_{0} in (22) does not depend on data variance σ2\sigma^{2}, here we focus on subsequent iterations t[1:τ]t\in[1:\tau]. By considering the 2\ell_{2} regularization (i.e., adding λ2𝜽22\frac{\lambda}{2}\|\bm{\theta}\|_{2}^{2} to (15)), we obtain the new parameters (cf. (16)) as follows

𝜽treg=𝜽t1+σ2λ,t[1:τ].\displaystyle\bm{\theta}_{t}^{\mathrm{reg}}=\frac{\bm{\theta}_{t}}{1+\sigma^{2}\lambda},\quad\forall\,t\in[1:\tau]. (26)

The derivations are provided in Appendix H. By applying Theorem 3, the following theorem provides a characterization for the gen-error of the case with regularization at each iteration tt for mm large enough. Let gentreg\mathrm{gen}_{t}^{\mathrm{reg}} denotes the gen-error of the case with regularization and we drop the fixed quantities (P𝐙,P𝐗,{P𝜽k|Sl,Su}k=0t,{f𝜽k}k=0t1)(P_{\mathbf{Z}},P_{\mathbf{X}},\{P_{\bm{\theta}_{k}|S_{\mathrm{l}},S_{\mathrm{u}}}\}_{k=0}^{t},\{f_{\bm{\theta}_{k}}\}_{k=0}^{t-1}) for notational simplicity.

Theorem 4 (Gen-error with regularization).

Fix any dd\in\mathbb{N}, and σ,λ+\sigma,\lambda\in\mathbb{R}_{+}. The gen-error at any t[1:τ]t\in[1:\tau] is

gentreg=gent1+σ2λ.\displaystyle\mathrm{gen}_{t}^{\mathrm{reg}}=\frac{\mathrm{gen}_{t}}{1+\sigma^{2}\lambda}. (27)

The proof of Theorem 4 is provided in Appendix H. From (27), we observe that as λ\lambda increases, the gen-error decreases. In Figure 8, we first empirically show that regularization can help mitigate the increase of gen-error during SSL iterations for hard-to-distinguish classes by comparing the empirical gen-error under λ=0,0.1,0.5,2\lambda=0,0.1,0.5,2 when σ=3\sigma=3 and d=2d=2. Then in Figure 8, we plot the theoretical gen-error in (27) versus λ\lambda when t=1t=1 for the cases with small and large variances, i.e., σ2=0.62\sigma^{2}=0.6^{2} and σ2=32\sigma^{2}=3^{2}. We also compare the theoretical results with the empirical gen-errors, which turn out to corroborate the theoretical ones. For the case with smaller variance, the improvement on gen-error is barely visible as λ\lambda increases. For the case with larger variance, the decrease of the gen-error is more pronounced, which implies that 2\ell_{2}-regularization can effectively mitigate the impact on the gen-error induced by large class overlapping and pseudo-labels with low accuracy.

The adept reader might naturally wonder why one would not set λ\lambda\to\infty in (27), which results in the gen-error tending to zero, which presumably is a desirable phenomenon. However, ultimately, what we wish to control is the expected population risk, which, according to (7), is the sum of the expected empirical risk on the training data and the gen-error. Even if the gen-error is zero, the expected population risk might be large. Hence, as λ\lambda increases, we see a tradeoff between the gen-error and the empirical risk.

Classes RGB-mean 2\ell_{2} distance RGB-variance 2\ell_{2} distance Difficulty
horse-ship 0.0180 3.90e-05 Easy
automobile-truck 0.0038 7.06e-05 Moderate
cat-dog 0.0007 4.95e-05 Challenging
Table 1: The 2\ell_{2} distances between the RGB-mean and RGB-variance of different pairs of classes from the CIFAR10 dataset.
Classes     horse \toship ship \tohorse automobile \totruck truck \toautomobile cat \todog dog \tocat
Number     17 3 61 64 93 137
Difficulty     Easy Moderate Challenging
Table 2: Number of images misclassified out of 1000 (Liu and Mukhopadhyay, 2018).

6 Experiments on Benchmark Datasets

To further illustrate that our theory is indeed behind the empirical behaviour of the iterative self-training with pseudo-labelling, in this section, we conduct experiments on real-world benchmark datasets, which demonstrates that our theoretical results on the bGMM example can also reflect the training dynamics on more realistic real-world tasks. The code to reproduce all the experiments can be found at https://github.com/HerianHe/GenErrorSSL_2022.git.

Recall that in the bGMM example, a higher standard deviation σ\sigma represents a higher in-class variance, larger class-overlap, and consequently higher difficulty in classification. By a whitening argument, this also holds for bGMMs with non-isotropic covariance matrices. In our experiments on real-world data, we use the difficulty level of classification to mimic different in-class variances of bGMM. We pick two easy-to-distinguish class pairs (“automobile” and “truck”, “horse” and “ship”) from the CIFAR-10 dataset (Krizhevsky, 2009) as an analogy to bGMM with small in-class variance, and one difficult-to-distinguish class pair (“cat” and “dog”) from the same dataset as an analogy to bGMM with large in-class variance. Furthermore, to extend the analogy to multi-class classification, we conduct experiments on the 10-class MNIST dataset to gain more intuition.

We train deep neural networks (DNNs) via an iterative self-learning strategy (under the same setting as Figure 1) to perform binary and multi-class classification. In the first iteration, we only use a few labelled data examples to initialize the DNN with a sufficient number of epochs. In the subsequent iterations, we first sample a subset of unlabelled data and generate pseudo-labels for them via the model trained from the previous iteration. Then we update the model for a small number of epochs with both the labelled and pseudo-labelled data.

Experimental settings: For binary classification, we collect pairs of classes of images, i.e., “automobile” and “truck”, “horse” and “ship”, and “cat” and “dog” from the CIFAR10 (Krizhevsky, 2009) dataset. In this dataset, each class has 50005000 images for training and 10001000 images for testing. For each selected pair of classes, we manually divide the 1000010000 training images into two sets: the labelled training set with 500500 images and the unlabelled training set with 95009500 images. We train a convolutional neural network, ResNet-10 (He et al., 2016), and use stochastic gradient descent (SGD) optimizer to minimize the cross-entropy loss. In PyTorch, the cross-entropy loss is defined as the negative logarithm of the output softmax probability corresponding to the true class, which is analogous to the NLL of the data under the parameters of the neural network. For the task with 2\ell_{2}-regularization, we train the neural network by setting different weight decay parameters (equivalent to λ×\lambda\times learning rate). In each pseudo-labelling iteration, we sample 25002500 unlabelled images. The complete training procedure lasts for 5050 self-training iterations.

We further validate our theoretical contributions on a multi-class classification problem in which we train a ResNet-6 model with the cross-entropy loss to perform 1010-class handwritten digits classification on the MNIST (LeCun et al., 1998) dataset. We sample 5100051000 images from the training set, which contains 60006000 images for each of the ten classes. We divide them into two sets, i.e., a labelled training set with 10001000 images and an unlabelled set with 5000050000 images. The optimizer and training iterations follow those in the aforementioned binary classification tasks without regularization.

Refer to caption
(a) “horse-ship”: gen-error
Refer to caption
(b) “automobile-truck”: gen-error
Refer to caption
(c) MNIST: gen-error
Refer to caption
(d) “cat-dog”: gen-error
Figure 9: (a)–(c) & (e)–(g): easier-to-distinguish classes “horse” vs. “ship” and “automobile” vs. “truck”; (d),(h), (i)–(k): harder-to-distinguish classes “cat” vs. “dog”.

      

Refer to caption
(e) “horse-ship”: accuracy
Refer to caption
(f) “automobile-truck”: accuracy
Refer to caption
(g) MNIST: accuracy
Refer to caption
(h) “cat-dog”: accuracy
Refer to caption
(i) “cat-dog”: gen-error with weight decay set to 5×1045\times 10^{-4}.
Refer to caption
(j) “cat-dog”: accuracy with weight decay set to 5×1045\times 10^{-4}.
Refer to caption
(k) “cat-dog”: gen-error after convergence versus weight decay.

Experimental observations: We perform each experiment 33 times and report the average test and training (cross entropy) losses, the gen-error, and test and training accuracies in Figures 9(d). To illustrate the difficulty level of classification for each pair, we first calculated the mean and variance of the RGB (i.e., the red-green-blue color values) values of the images to show the difference of the images between the two classes. In Table 2, we display the RGB means and variances of the test data in six classes taken from the CIFAR10 dataset. We observe that the RGB variances of each pair are almost 0 (and small compared to the RGB-mean 2\ell_{2} distances), and thus, the RGB-mean 2\ell_{2} distance is indicative of the difficulty of the classification task. Indeed, a smaller RGB-mean 2\ell_{2} distance implies a higher overlap of the two classes and consequently, greater difficulty in distinguishing them. Therefore, the “cat-dog” pair, which is more difficult to disambiguate compared to the “horse-ship” and “automobile-truck” pairs, is analogous to the bGMM with large variance (i.e. large overlap between the positive and negative classes). Furthermore, in Table 2, we quote the commonly-used confusion matrix for the CIFAR10 dataset in (Liu and Mukhopadhyay, 2018, Fig. 7), which quantifies how many out of 1000 images of each class are misclassified to any other class. It is obvious that fewer misclassified images indicates lower classification difficulty, which corresponds with Table 2. These two tables provide an indication of the level of difficulty to distinguish different pairs of classes.

In Figures 9(a)9(c), for easier-to-distinguish classes (based on the high classification accuracy and low loss, as well as Tables 2 and 2), the gen-error appears to have relatively large reduction in the early training iterations and then fluctuates around a constant value afterwards. For example, in Figure 9(a), the gen-error converges to around 0.250.25 after 55 iterations; in Figure 9(b), it converges to around 0.450.45 after 55 iterations. For multi-classification of MNIST in Figure 9(c), the gen-error also converges to around 0.250.25 after 2020 iterations. These results corroborate the theoretical and empirical analyses in the bGMM case with small variance, which again verifies that Theorem 3 and Corollary 10 can shed light on the empirical gen-error on benchmark datasets. It also reveals that the generalization performance of iterative self-training on real datasets from relatively distinguishable classes can be quickly improved with the help of unlabelled data. In Figures 9(e), 9(f) and 9(g), we also show that the test accuracy increases with the iterations and has significant improvement compared to the initial iteration when only labelled data are used.

In Figures 9(d) and 9(h), we perform another binary classification experiment on the harder-to-distinguish pair, “cat” and “dog” (based on low accuracy at the initial point as well as Tables 2 and 2). We observe that the gen-error (and the test loss) does not decrease across the self-training iterations even though the test accuracy increases. This again corroborates the result in Figure 3(b) for the bGMM with large variance. The fact that both the test loss and test accuracy appear to increase with the iteration is, in fact, not contradictory. To intuitively explain this, in binary classification using the softmax (hence, logistic) function to predict the output classes, suppose the learned probability of a data example belonging to its true class is p(1/2,1]p\in(1/2,1], the classification is correct. In other words, the accuracy is 100%. However, when pp (i.e., the classification confidence) decreases towards (1/2)+(1/2)^{+}, the corresponding decision margin 2p12p-1 (Cao et al., 2019) also decreases and the test loss logp-\log p increases commensurately. Thus, when the decision margin is small, even though the test accuracy may increase as the iteration counter increases, the test loss may also increase at the same time; this represents our lack of confidence.

We further investigate the effect of 2\ell_{2}-regularization on “cat-dog” classification. In Figures 9(i) and 9(j), we show that by setting the weight decay parameter to be 0.00050.0005, the increase of gen-error for the “cat-dog” classification task can be mitigated and the test accuracy is improved by 0.6% as well; compare this to Figures 9(d) and 9(h). In Figure 9(k), we plot the average gen-error over the last 10 iterations versus the weight decay parameter; this is shown to decrease as the weight decay increases (compared to Figure 8). In summary, our above observations correspond to that for the bGMM, namely that the unlabelled data do not always help to improve the gen-error but adding regularization can help to compensate for the undesirable impact.

Refer to caption
Figure 10: Comparison of the gen-error for the “horse” and “ship” classification task

Furthermore, we study the effect of reusing all the unlabelled data at each iteration. Under the same experimental setup as above, we conduct an additional experiment on the “horse-ship” pair in the CIFAR-10 dataset by using all 9,5009,500 unlabelled images in each iteration. The self-training procedure lasts for 1010 iterations. Figure 10 compares the gen-error of this additional experiment with the one of the same experiment in Figure 9(a). We find that when using the unlabelled data all at once, as the pseudo-labelling iteration increases, the gen-error is even higher than that for our original setup. This can possibly be attributed to overfitting.

7 Concluding Remarks and Future Work

In this paper, we have analyzed the gen-error of iterative SSL algorithms that pseudo-label large amounts of unlabelled data to progressively refine the parameters of a given model. We particularized the general bounds and exact expressions on the gen-error for the bGMM to gain some theoretical insight into the problem. These were then corroborated by experiments on benchmark datasets. The theoretical analyses and experimental results reinforce the main message of this paper—namely, that in the low-class-overlap or easy-to-classify scenario, pseudo-labelling can help to reduce the gen-error. On the other hand, for the high-class-overlap or difficult-to-classify scenario, pseudo-labelling can in fact hurt. Thus, the key takeaway from our paper is that practitioners should be judicious in adopting pseudo-labelling techniques, for they may degrade the overall performance.

There are three avenues for future research. First, our analytical results are only applicable to the bGMM. This yields valuable insights, but the model is admittedly restrictive. Generalizing our analyses to other statistical models for classification such as logistic regression will be instructive. Secondly, our work focuses on the gen-error. Often bounds on the population risk are desired as the population risk is the key determinant of the performance of classification algorithms. Bounding the population risk in the SSL setting would thus be interesting. Finally, analyzing other families of SSL algorithms beyond those that utilize pseudo-labelling would provide a clearer theoretical picture about the utility of SSL.

Acknowledgements

We sincerely thank the reviewers and action editor for their meticulous reading and insightful comments that led to a significantly improved version of the paper.

This research/project is supported by the National Research Foundation Singapore and DSO National Laboratories under the AI Singapore Programme (AISG Award No: AISG2-RP-2020-018), by an NRF Fellowship (A-0005077-01-00), and by Singapore Ministry of Education (MOE) AcRF Tier 1 Grants (A-0009042-01-00 and A-8000189-01-00).

Appendices

Appendix A Proof of Theorem 2.A

We commence with some notation. For any convex function ψ:[0,b)\psi:[0,b)\mapsto\mathbb{R}, its Legendre dual ψ\psi^{*} is defined as ψ(x):=supλ[0,b)λxψ(λ)\psi^{*}(x):=\sup_{\lambda\in[0,b)}\lambda x-\psi(\lambda) for all x[0,)x\in[0,\infty). According to Boucheron et al. (2013, Lemma 2.4), when ψ(0)=ψ(0)=0\psi(0)=\psi^{\prime}(0)=0, ψ(x)\psi^{*}(x) is a nonnegative convex and nondecreasing function on [0,)[0,\infty). Moreover, for every y0y\geq 0, its generalized inverse function ψ1(y):=inf{x0:ψ(x)y}\psi^{*-1}(y):=\inf\{x\geq 0:\psi^{*}(x)\geq y\} is concave and can be rewritten as ψ1(y)=infλ[0,b)y+ψ(λ)λ\psi^{*-1}(y)=\inf_{\lambda\in[0,b)}\frac{y+\psi(\lambda)}{\lambda}.

We first introduce the following theorem that is applicable to more general loss functions.

Theorem 5.

For any θ~tΘ\tilde{\theta}_{t}\in\Theta, let ψ(λ,θ~t)\psi_{-}(\lambda,\tilde{\theta}_{t}) and ψ+(λ,θ~t)\psi_{+}(\lambda,\tilde{\theta}_{t}) be convex functions of λ\lambda and ψ+(0,θ~t)=ψ+(0,θ~t)=ψ(0,θ~t)=ψ(0,θ~t)=0\psi_{+}(0,\tilde{\theta}_{t})=\psi_{+}^{\prime}(0,\tilde{\theta}_{t})=\psi_{-}(0,\tilde{\theta}_{t})=\psi_{-}^{\prime}(0,\tilde{\theta}_{t})=0. Assume that Λl(θ~t,Z~)(λ,θ~t)ψ+(λ,θ~t)\Lambda_{l(\tilde{\theta}_{t},\tilde{Z})}(\lambda,\tilde{\theta}_{t})\leq\psi_{+}(\lambda,\tilde{\theta}_{t}) for all λ[0,b+)\lambda\in[0,b_{+}) and Λl(θ~t,Z~)(λ,θ~t)ψ(λ,θ~t)\Lambda_{l(\tilde{\theta}_{t},\tilde{Z})}(\lambda,\tilde{\theta}_{t})\leq\psi_{-}(\lambda,\tilde{\theta}_{t}) for λ(b,0]\lambda\in(b_{-},0] under distribution PZ~|θ(t1)=PZP_{\tilde{Z}|\theta^{(t-1)}}=P_{Z}, where 0<b+0<b_{+}\leq\infty and b<0-\infty\leq b_{-}<0. Let ψ+(λ)=supθ~tψ+(λ,θ~t)\psi_{+}(\lambda)=\sup_{\tilde{\theta}_{t}}\psi_{+}(\lambda,\tilde{\theta}_{t}) and ψ(λ)=supθ~tψ(λ,θ~t)\psi_{-}(\lambda)=\sup_{\tilde{\theta}_{t}}\psi_{-}(\lambda,\tilde{\theta}_{t}). We have

gent(PZ,PX,{Pθk|Sl,Su}k=0t,{fθk}k=0t1)\displaystyle\mathrm{gen}_{t}(P_{Z},P_{X},\{P_{\theta_{k}|S_{\mathrm{l}},S_{\mathrm{u}}}\}_{k=0}^{t},\{f_{\theta_{k}}\}_{k=0}^{t-1})
wni=1n𝔼θ(t1)[ψ1(Iθ(t1)(θt;Zi))]\displaystyle\leq\frac{w}{n}\sum_{i=1}^{n}\mathbb{E}_{\theta^{(t-1)}}\left[\psi_{-}^{*-1}(I_{\theta^{(t-1)}}(\theta_{t};Z_{i}))\right]
+1wmit𝔼θ(t1)[ψ1(Iθ(t1)(θt;Xi,Y^i)+Dθ(t1)(PXi,Y^iPZ))],\displaystyle\quad+\frac{1-w}{m}\sum_{i\in\mathcal{I}_{t}}\mathbb{E}_{\theta^{(t-1)}}\left[\psi_{-}^{*-1}\big{(}I_{\theta^{(t-1)}}(\theta_{t};X_{i}^{\prime},\hat{Y}_{i}^{\prime})+D_{\theta^{(t-1)}}(P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}}\|P_{Z})\big{)}\right], (28)

and

gent(PZ,PX,{Pθk|Sl,Su}k=0t,{fθk}k=0t1)\displaystyle-\mathrm{gen}_{t}(P_{Z},P_{X},\{P_{\theta_{k}|S_{\mathrm{l}},S_{\mathrm{u}}}\}_{k=0}^{t},\{f_{\theta_{k}}\}_{k=0}^{t-1})
wni=1n𝔼θ(t1)[ψ+1(Iθ(t1)(θt;Zi))]\displaystyle\leq\frac{w}{n}\sum_{i=1}^{n}\mathbb{E}_{\theta^{(t-1)}}\left[\psi_{+}^{*-1}(I_{\theta^{(t-1)}}(\theta_{t};Z_{i}))\right]
+1wmit𝔼θ(t1)[ψ+1(Iθ(t1)(θt;Xi,Y^i)+Dθ(t1)(PXi,Y^iPZ))],\displaystyle\quad+\frac{1-w}{m}\sum_{i\in\mathcal{I}_{t}}\mathbb{E}_{\theta^{(t-1)}}\left[\psi_{+}^{*-1}\big{(}I_{\theta^{(t-1)}}(\theta_{t};X_{i}^{\prime},\hat{Y}_{i}^{\prime})+D_{\theta^{(t-1)}}(P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}}\|P_{Z})\big{)}\right], (29)

where PXi,Y^i|θ(t1)(x,y|θ^(t1))=PX(x)1{y=fθ^t1(x)}P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}|\theta^{(t-1)}}(x,y|\hat{\theta}^{(t-1)})=P_{X}(x)\text{1}\{y=f_{\hat{\theta}_{t-1}}(x)\} for any x𝒳x\in\mathcal{X}, y𝒴y\in\mathcal{Y} and θ^(t1)Θt1\hat{\theta}^{(t-1)}\in\Theta^{t-1}, and PZ|θ(t1)=PZP_{Z|\theta^{(t-1)}}=P_{Z}.

Proof  Consider the Donsker–Varadhan variational representation of the KL-divergence between any two distributions PP and QQ on 𝒳\mathcal{X}:

D(PQ)=supg𝒢{𝔼XP[g(X)]log𝔼XQ[eg(X)]},\displaystyle D(P\|Q)=\sup_{g\in\mathcal{G}}\Big{\{}\mathbb{E}_{X\sim P}[g(X)]-\log\mathbb{E}_{X\sim Q}[e^{g(X)}]\Big{\}}, (30)

where the supremum is taken over the set of measurable functions in 𝒢={g:𝒳:𝔼XQ[eg(X)]<}\mathcal{G}=\{g:\mathcal{X}\mapsto\mathbb{R}:\mathbb{E}_{X\sim Q}[e^{g(X)}]<\infty\}.

Recall that θ~t\tilde{\theta}_{t} and Z~\tilde{Z} are independent copies of θt\theta_{t} and ZZ respectively, such that Pθ~t,Z~=QθtPZP_{\tilde{\theta}_{t},\tilde{Z}}=Q_{\theta_{t}}\otimes P_{Z}, Pθ~t,Z~|θ(t1)=Pθt|θ(t1)PZP_{\tilde{\theta}_{t},\tilde{Z}|\theta^{(t-1)}}=P_{\theta_{t}|\theta^{(t-1)}}\otimes P_{Z}. For any iterative SSL algorithm, by applying the law of total expectation, the generalization error can be rewritten as

gent(PZ,PX,{Pθk|Sl,Su}k=0t,{fθk}k=0t1)\displaystyle\mathrm{gen}_{t}(P_{Z},P_{X},\{P_{\theta_{k}|S_{\mathrm{l}},S_{\mathrm{u}}}\}_{k=0}^{t},\{f_{\theta_{k}}\}_{k=0}^{t-1})
=w(𝔼θt[𝔼Z[l(θt,Z)]]1ni=1n𝔼θt,Zi[l(θt,Zi)])\displaystyle=w\bigg{(}\mathbb{E}_{\theta_{t}}[\mathbb{E}_{Z}[l(\theta_{t},Z)]]-\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}_{\theta_{t},Z_{i}}[l(\theta_{t},Z_{i})]\bigg{)}
+(1w)(𝔼θt[𝔼Z[l(θt,Z)]]1mit𝔼θt,Xi,Y^i[l(θt,(Xi,Y^i))])\displaystyle\quad+(1-w)\bigg{(}\mathbb{E}_{\theta_{t}}[\mathbb{E}_{Z}[l(\theta_{t},Z)]]-\frac{1}{m}\sum_{i\in\mathcal{I}_{t}}\mathbb{E}_{\theta_{t},X^{\prime}_{i},\hat{Y}^{\prime}_{i}}[l(\theta_{t},(X^{\prime}_{i},\hat{Y}^{\prime}_{i}))]\bigg{)} (31)
=wni=1n(𝔼θ~t,Z~[l(θ~t,Z~)]𝔼θt,Zi[l(θt,Zi)])\displaystyle=\frac{w}{n}\sum_{i=1}^{n}\bigg{(}\mathbb{E}_{\tilde{\theta}_{t},\tilde{Z}}[l(\tilde{\theta}_{t},\tilde{Z})]-\mathbb{E}_{\theta_{t},Z_{i}}[l(\theta_{t},Z_{i})]\bigg{)}
+1wmit(𝔼θ~t,Z~[l(θ~t,Z~)]𝔼θt,Xi,Y^i[l(θt,(Xi,Y^i))])\displaystyle\quad+\frac{1-w}{m}\sum_{i\in\mathcal{I}_{t}}\bigg{(}\mathbb{E}_{\tilde{\theta}_{t},\tilde{Z}}[l(\tilde{\theta}_{t},\tilde{Z})]-\mathbb{E}_{\theta_{t},X^{\prime}_{i},\hat{Y}^{\prime}_{i}}[l(\theta_{t},(X^{\prime}_{i},\hat{Y}^{\prime}_{i}))]\bigg{)} (32)
=wni=1n𝔼θ(t1)[𝔼θ~t,Z~[l(θ~t,Z~)|θ(t1)]𝔼θt,Zi[l(θt,Zi)|θ(t1)]]\displaystyle=\frac{w}{n}\sum_{i=1}^{n}\mathbb{E}_{\theta^{(t-1)}}\Big{[}\mathbb{E}_{\tilde{\theta}_{t},\tilde{Z}}\big{[}l(\tilde{\theta}_{t},\tilde{Z})|\theta^{(t-1)}\big{]}-\mathbb{E}_{\theta_{t},Z_{i}}\big{[}l(\theta_{t},Z_{i})|\theta^{(t-1)}\big{]}\Big{]}
+1wmit𝔼θ(t1)[𝔼θ~t,Z~[l(θ~t,Z~)|θ(t1)]𝔼θt,Xi,Y^i[l(θt,(Xi,Y^i))|θ(t1)]].\displaystyle\quad+\frac{1-w}{m}\sum_{i\in\mathcal{I}_{t}}\mathbb{E}_{\theta^{(t-1)}}\Big{[}\mathbb{E}_{\tilde{\theta}_{t},\tilde{Z}}\big{[}l(\tilde{\theta}_{t},\tilde{Z})|\theta^{(t-1)}\big{]}-\mathbb{E}_{\theta_{t},X^{\prime}_{i},\hat{Y}^{\prime}_{i}}\big{[}l(\theta_{t},(X^{\prime}_{i},\hat{Y}^{\prime}_{i}))|\theta^{(t-1)}\big{]}\Big{]}. (33)

Note that ψ+(λ)=supθ~tψ+(λ,θ~t)\psi_{+}(\lambda)=\sup_{\tilde{\theta}_{t}}\psi_{+}(\lambda,\tilde{\theta}_{t}) and ψ(λ)=supθ~tψ(λ,θ~t)\psi_{-}(\lambda)=\sup_{\tilde{\theta}_{t}}\psi_{-}(\lambda,\tilde{\theta}_{t}) are convex, and so their Legendre duals ψ\psi_{-}^{*}, ψ+\psi_{+}^{*}, and the corresponding inverses are well-defined.

Let lˇ(θ,z)=l(θ,z)𝔼Z[l(θ,Z)]\check{l}(\theta,z)=l(\theta,z)-\mathbb{E}_{Z}[l(\theta,Z)]. We have the fact that 𝔼Z~[lˇ(θ~t,Z~)]=0\mathbb{E}_{\tilde{Z}}[\check{l}(\tilde{\theta}_{t},\tilde{Z})]=0 for any θ~t\tilde{\theta}_{t}. Again, by the Donsker–Varadhan variational representation of the KL-divergence, for any fixed θ(t1)\theta^{(t-1)} and any λ[0,b+)\lambda\in[0,b_{+}), we have

Iθ(t1)(θt;Z)=D(Pθt,Z|θ(t1)Pθt|θ(t1)PZ)\displaystyle I_{\theta^{(t-1)}}(\theta_{t};Z)=D(P_{\theta_{t},Z|\theta^{(t-1)}}\|P_{\theta_{t}|\theta^{(t-1)}}\otimes P_{Z})
𝔼θt,Z[λlˇ(θt,Z)|θ(t1)]log𝔼θ~t,Z~[eλlˇ(θ~t,Z~)|θ(t1)]\displaystyle\geq\mathbb{E}_{\theta_{t},Z}[\lambda\check{l}(\theta_{t},Z)|\theta^{(t-1)}]-\log\mathbb{E}_{\tilde{\theta}_{t},\tilde{Z}}[e^{\lambda\check{l}(\tilde{\theta}_{t},\tilde{Z})}|\theta^{(t-1)}] (34)
=𝔼θt,Z[λlˇ(θt,Z)|θ(t1)]log𝔼θ~t|θ(t1)𝔼Z~[eλlˇ(θ~t,Z~)]\displaystyle=\mathbb{E}_{\theta_{t},Z}[\lambda\check{l}(\theta_{t},Z)|\theta^{(t-1)}]-\log\mathbb{E}_{\tilde{\theta}_{t}|\theta^{(t-1)}}\mathbb{E}_{\tilde{Z}}\big{[}e^{\lambda\check{l}(\tilde{\theta}_{t},\tilde{Z})}\big{]} (35)
=𝔼θt,Z[λlˇ(θt,Z)|θ(t1)]log𝔼θ~t|θ(t1)[exp(Λl(θ~t,Z~)(λ,θ~t))]\displaystyle=\mathbb{E}_{\theta_{t},Z}[\lambda\check{l}(\theta_{t},Z)|\theta^{(t-1)}]-\log\mathbb{E}_{\tilde{\theta}_{t}|\theta^{(t-1)}}\big{[}\exp\big{(}\Lambda_{l(\tilde{\theta}_{t},\tilde{Z})}(\lambda,\tilde{\theta}_{t})\big{)}\big{]} (36)
λ𝔼θt,Z[l(θt,Z)𝔼Z[l(θt,Z)]|θ(t1)]log𝔼θ~t|θ(t1)[exp(ψ+(λ,θ~t))]\displaystyle\geq\lambda\mathbb{E}_{\theta_{t},Z}[l(\theta_{t},Z)-\mathbb{E}_{Z}[l(\theta_{t},Z)]|\theta^{(t-1)}]-\log\mathbb{E}_{\tilde{\theta}_{t}|\theta^{(t-1)}}\big{[}\exp(\psi_{+}(\lambda,\tilde{\theta}_{t}))\big{]} (37)
λ𝔼θt,Z[l(θt,Z)𝔼Z[l(θt,Z)]|θ(t1)]ψ+(λ)\displaystyle\geq\lambda\mathbb{E}_{\theta_{t},Z}[l(\theta_{t},Z)-\mathbb{E}_{Z}[l(\theta_{t},Z)]|\theta^{(t-1)}]-\psi_{+}(\lambda) (38)
=λ(𝔼θt,Z[l(θt,Z)|θ(t1)]𝔼θ~t,Z~[l(θ~t,Z~)|θ(t1)])ψ+(λ).\displaystyle=\lambda\big{(}\mathbb{E}_{\theta_{t},Z}[l(\theta_{t},Z)|\theta^{(t-1)}]-\mathbb{E}_{\tilde{\theta}_{t},\tilde{Z}}[l(\tilde{\theta}_{t},\tilde{Z})|\theta^{(t-1)}]\big{)}-\psi_{+}(\lambda). (39)

where (36) follows from the definition of Λl(θ~t,Z~)(λ,θ~t)\Lambda_{l(\tilde{\theta}_{t},\tilde{Z})}(\lambda,\tilde{\theta}_{t}) in (49), (37) follows from the assumption that Λl(θ~t,Z~)(λ,θ~t)ψ+(λ,θ~t)\Lambda_{l(\tilde{\theta}_{t},\tilde{Z})}(\lambda,\tilde{\theta}_{t})\leq\psi_{+}(\lambda,\tilde{\theta}_{t}) for all λ[0,b+)\lambda\in[0,b_{+}), and (38) follows because ψ+(λ)=supθ~tψ+(λ,θ~t)\psi_{+}(\lambda)=\sup_{\tilde{\theta}_{t}}\psi_{+}(\lambda,\tilde{\theta}_{t}). Thus, we have

𝔼θt,Z[l(θt,Z)|θ(t1)]𝔼θ~t,Z~[l(θ~t,Z~)|θ(t1)]\displaystyle\mathbb{E}_{\theta_{t},Z}[l(\theta_{t},Z)|\theta^{(t-1)}]-\mathbb{E}_{\tilde{\theta}_{t},\tilde{Z}}[l(\tilde{\theta}_{t},\tilde{Z})|\theta^{(t-1)}]
infλ[0,b+)Iθ(t1)(θt;Z)+ψ+(λ)λ=ψ+1(Iθ(t1)(θt;Z)).\displaystyle\leq\inf_{\lambda\in[0,b_{+})}\frac{I_{\theta^{(t-1)}}(\theta_{t};Z)+\psi_{+}(\lambda)}{\lambda}=\psi_{+}^{*-1}\big{(}I_{\theta^{(t-1)}}(\theta_{t};Z)\big{)}. (40)

Similarly, for λ(b,0]\lambda\in(b_{-},0],

𝔼θ~t,Z~[l(θ~t,Z~)|θ(t1)]𝔼θt,Z[l(θt,Z)|θ(t1)]\displaystyle\mathbb{E}_{\tilde{\theta}_{t},\tilde{Z}}[l(\tilde{\theta}_{t},\tilde{Z})|\theta^{(t-1)}]-\mathbb{E}_{\theta_{t},Z}[l(\theta_{t},Z)|\theta^{(t-1)}]
infλ[0,b)Iθ(t1)(θt;Z)+ψ(λ)λ=ψ1(Iθ(t1)(θt;Z)).\displaystyle\leq\inf_{\lambda\in[0,-b_{-})}\frac{I_{\theta^{(t-1)}}(\theta_{t};Z)+\psi_{-}(\lambda)}{\lambda}=\psi_{-}^{*-1}\big{(}I_{\theta^{(t-1)}}(\theta_{t};Z)\big{)}. (41)

By applying the same techniques, for any pair of pseudo-labelled random variables (X,Y^)(X^{\prime},\hat{Y}^{\prime}) used at iteration tt and any λ[0,b+)\lambda\in[0,b_{+}), we have

Iθ(t1)(θt;X,Y^)+Dθ(t1)(PX,Y^PZ)\displaystyle I_{\theta^{(t-1)}}(\theta_{t};X^{\prime},\hat{Y}^{\prime})+D_{\theta^{(t-1)}}(P_{X^{\prime},\hat{Y}^{\prime}}\|P_{Z})
=Dθ(t1)(Pθt,X,Y^PθtPX,Y^)+Dθ(t1)(PθtPX,Y^PθtPZ)\displaystyle=D_{\theta^{(t-1)}}(P_{\theta_{t},X^{\prime},\hat{Y}^{\prime}}\|P_{\theta_{t}}\otimes P_{X^{\prime},\hat{Y}^{\prime}})+D_{\theta^{(t-1)}}(P_{\theta_{t}}\otimes P_{X^{\prime},\hat{Y}^{\prime}}\|P_{\theta_{t}}\otimes P_{Z}) (42)
𝔼θt,X,Y^[λl(θt,(X,Y^))|θ(t1)]log𝔼θt[𝔼X,Y^[eλl(θt,(X,Y^))|θ(t1)]|θ(t1)]\displaystyle\geq\mathbb{E}_{\theta_{t},X^{\prime},\hat{Y}^{\prime}}[\lambda l(\theta_{t},(X^{\prime},\hat{Y}^{\prime}))|\theta^{(t-1)}]-\log\mathbb{E}_{\theta_{t}}\big{[}\mathbb{E}_{X^{\prime},\hat{Y}^{\prime}}[e^{\lambda l(\theta_{t},(X^{\prime},\hat{Y}^{\prime}))}|\theta^{(t-1)}]|\theta^{(t-1)}\big{]}
+𝔼θt[𝔼X,Y^[λl(θt,(X,Y^))|θ(t1)]|θ(t1)]log𝔼θt[𝔼Z[eλl(θt,Z)]|θ(t1)]\displaystyle\quad+\mathbb{E}_{\theta_{t}}\big{[}\mathbb{E}_{X^{\prime},\hat{Y}^{\prime}}[\lambda l(\theta_{t},(X^{\prime},\hat{Y}^{\prime}))|\theta^{(t-1)}]|\theta^{(t-1)}\big{]}-\log\mathbb{E}_{\theta_{t}}\big{[}\mathbb{E}_{Z}[e^{\lambda l(\theta_{t},Z)}]|\theta^{(t-1)}\big{]} (43)
𝔼θt,X,Y^[λl(θt,(X,Y^))|θ(t1)]log𝔼θt[𝔼Z[eλl(θt,Z)]|θ(t1)]\displaystyle\geq\mathbb{E}_{\theta_{t},X^{\prime},\hat{Y}^{\prime}}[\lambda l(\theta_{t},(X^{\prime},\hat{Y}^{\prime}))|\theta^{(t-1)}]-\log\mathbb{E}_{\theta_{t}}\big{[}\mathbb{E}_{Z}[e^{\lambda l(\theta_{t},Z)}]|\theta^{(t-1)}\big{]} (44)
=λ(𝔼θt,X,Y^[λl(θt,(X,Y^))|θ(t1)]𝔼θt[𝔼Z[l(θt,Z)]|θ(t1)])\displaystyle=\lambda\Big{(}\mathbb{E}_{\theta_{t},X^{\prime},\hat{Y}^{\prime}}[\lambda l(\theta_{t},(X^{\prime},\hat{Y}^{\prime}))|\theta^{(t-1)}]-\mathbb{E}_{\theta_{t}}\big{[}\mathbb{E}_{Z}[l(\theta_{t},Z)]|\theta^{(t-1)}\big{]}\Big{)}
log𝔼θ~t|θ(t1)[exp(Λl(θ~t,Z~)(λ,θ~t))]\displaystyle\quad-\log\mathbb{E}_{\tilde{\theta}_{t}|\theta^{(t-1)}}\big{[}\exp\big{(}\Lambda_{l(\tilde{\theta}_{t},\tilde{Z})}(\lambda,\tilde{\theta}_{t})\big{)}\big{]} (45)
λ(𝔼θt,X,Y^[λl(θt,(X,Y^))]𝔼θt[𝔼Z[l(θt,Z)]])ψ+(λ),\displaystyle\geq\lambda\Big{(}\mathbb{E}_{\theta_{t},X^{\prime},\hat{Y}^{\prime}}[\lambda l(\theta_{t},(X^{\prime},\hat{Y}^{\prime}))]-\mathbb{E}_{\theta_{t}}\big{[}\mathbb{E}_{Z}[l(\theta_{t},Z)]\big{]}\Big{)}-\psi_{+}(\lambda), (46)

where (44) follows from Jensen’s inequality. Thus, we get

𝔼θt,Xi,Y^i[l(θt,(X,Y^))|θ(t1)]𝔼θ~t,Z~[l(θ~t,Z~)|θ(t1)]\displaystyle\mathbb{E}_{\theta_{t},X^{\prime}_{i},\hat{Y}^{\prime}_{i}}\big{[}l(\theta_{t},(X^{\prime},\hat{Y}^{\prime}))|\theta^{(t-1)}\big{]}-\mathbb{E}_{\tilde{\theta}_{t},\tilde{Z}}\big{[}l(\tilde{\theta}_{t},\tilde{Z})|\theta^{(t-1)}\big{]}
ψ+1(Iθ(t1)(θt;X,Y^)+Dθ(t1)(PX,Y^PZ))\displaystyle\leq\psi_{+}^{*-1}\big{(}I_{\theta^{(t-1)}}(\theta_{t};X^{\prime},\hat{Y}^{\prime})+D_{\theta^{(t-1)}}(P_{X^{\prime},\hat{Y}^{\prime}}\|P_{Z})\big{)} (47)

and

𝔼θ~t,Z~[l(θ~t,Z~)|θ(t1)]𝔼θt,Xi,Y^i[l(θt,(X,Y^))|θ(t1)]\displaystyle\mathbb{E}_{\tilde{\theta}_{t},\tilde{Z}}\big{[}l(\tilde{\theta}_{t},\tilde{Z})|\theta^{(t-1)}\big{]}-\mathbb{E}_{\theta_{t},X^{\prime}_{i},\hat{Y}^{\prime}_{i}}\big{[}l(\theta_{t},(X^{\prime},\hat{Y}^{\prime}))|\theta^{(t-1)}\big{]}
ψ1(Iθ(t1)(θt;X,Y^)+Dθ(t1)(PX,Y^PZ)).\displaystyle\leq\psi_{-}^{*-1}\big{(}I_{\theta^{(t-1)}}(\theta_{t};X^{\prime},\hat{Y}^{\prime})+D_{\theta^{(t-1)}}(P_{X^{\prime},\hat{Y}^{\prime}}\|P_{Z})\big{)}. (48)

The proof is completed by applying inequalities (40), (41), (47) and (48) to the expansion of gent\mathrm{gen}_{t} in (33).  

Let θ~t\tilde{\theta}_{t} and Z~\tilde{Z} be independent copies of θt\theta_{t} and ZZ respectively, such that Pθ~t,Z~=QθtPZP_{\tilde{\theta}_{t},\tilde{Z}}=Q_{\theta_{t}}\otimes P_{Z}, where QθtQ_{\theta_{t}} is the marginal distribution of θt\theta_{t}. For any fixed θ~tΘ\tilde{\theta}_{t}\in\Theta, let the cumulant generating function (CGF) of l(θ~t,Z~)l(\tilde{\theta}_{t},\tilde{Z}) be

Λl(θ~t,Z~)(λ,θ~t):=log𝔼Z~[eλ(l(θ~t,Z~)𝔼Z~[l(θ~t,Z~)])].\displaystyle\Lambda_{l(\tilde{\theta}_{t},\tilde{Z})}(\lambda,\tilde{\theta}_{t}):=\log\mathbb{E}_{\tilde{Z}}[e^{\lambda(l(\tilde{\theta}_{t},\tilde{Z})-\mathbb{E}_{\tilde{Z}}[l(\tilde{\theta}_{t},\tilde{Z})])}]. (49)

When the loss function l(θ,Z)subG(R)l(\theta,Z)\sim\mathrm{subG}(R) under ZPZZ\sim P_{Z} for any θΘ\theta\in\Theta, we have Λl(θ~t,Z~)(λ,θ~t)R2λ22\Lambda_{l(\tilde{\theta}_{t},\tilde{Z})}(\lambda,\tilde{\theta}_{t})\leq\frac{R^{2}\lambda^{2}}{2} for all λ\lambda\in\mathbb{R}. Then we can let ψ(λ,θ~t)=ψ+(λ,θ~t)=R2λ22\psi_{-}(\lambda,\tilde{\theta}_{t})=\psi_{+}(\lambda,\tilde{\theta}_{t})=\frac{R^{2}\lambda^{2}}{2} for all θ~tΘ\tilde{\theta}_{t}\in\Theta. Hence, ψ+(λ)=ψ(λ)=supθ~tΘR2λ22=R2λ22\psi_{+}(\lambda)=\psi_{-}(\lambda)=\sup_{\tilde{\theta}_{t}\in\Theta}\frac{R^{2}\lambda^{2}}{2}=\frac{R^{2}\lambda^{2}}{2} and ψ+1(y)=ψ1(y)=2R2y\psi_{+}^{*-1}(y)=\psi_{-}^{*-1}(y)=\sqrt{2R^{2}y} for any y0y\geq 0. Finally, Theorem 2.A can then be directly obtained from Theorem 5.

Appendix B Proof of Theorem 2.B

Recall that θ~t\tilde{\theta}_{t} and Z~\tilde{Z} are independent copies of θt\theta_{t} and ZZ respectively, such that Pθ~t,Z~=QθtPZP_{\tilde{\theta}_{t},\tilde{Z}}=Q_{\theta_{t}}\otimes P_{Z}, Pθ~t,Z~|θ(t1)=Pθt|θ(t1)PZP_{\tilde{\theta}_{t},\tilde{Z}|\theta^{(t-1)}}=P_{\theta_{t}|\theta^{(t-1)}}\otimes P_{Z}.

Recall the gen-error given in (33). The first term in (33) can be rewritten as

𝔼θ(t1)[𝔼θ~t,Z~[l(θ~t,Z~)|θ(t1)]𝔼θt,Zi[l(θt,Zi)|θ(t1)]]\displaystyle\mathbb{E}_{\theta^{(t-1)}}\Big{[}\mathbb{E}_{\tilde{\theta}_{t},\tilde{Z}}\big{[}l(\tilde{\theta}_{t},\tilde{Z})|\theta^{(t-1)}\big{]}-\mathbb{E}_{\theta_{t},Z_{i}}\big{[}l(\theta_{t},Z_{i})|\theta^{(t-1)}\big{]}\Big{]}
=𝔼θ~t,Z~[logpθ~t]𝔼θt,Zi[logpθt]\displaystyle=\mathbb{E}_{\tilde{\theta}_{t},\tilde{Z}}\big{[}-\log p_{\tilde{\theta}_{t}}\big{]}-\mathbb{E}_{\theta_{t},Z_{i}}\big{[}-\log p_{\theta_{t}}\big{]} (50)
=𝔼θt[h(PZ,pθt)h(PZi|θt,pθt)]\displaystyle=\mathbb{E}_{\theta_{t}}\Big{[}h(P_{Z},p_{\theta_{t}})-h(P_{Z_{i}|\theta_{t}},p_{\theta_{t}})\Big{]} (51)
=𝔼θt[Δh(PZPZi|θt|pθt)].\displaystyle=\mathbb{E}_{\theta_{t}}\Big{[}\mathrm{\Delta h}(P_{Z}\|P_{Z_{i}|\theta_{t}}|p_{\theta_{t}})\Big{]}. (52)

The second term in (33) can be rewritten as

𝔼θ~t,Z~[l(θ~t,Z~)]𝔼θt,Xi,Y^i[l(θt,(Xi,Y^i))]\displaystyle\mathbb{E}_{\tilde{\theta}_{t},\tilde{Z}}[l(\tilde{\theta}_{t},\tilde{Z})]-\mathbb{E}_{\theta_{t},X^{\prime}_{i},\hat{Y}^{\prime}_{i}}[l(\theta_{t},(X^{\prime}_{i},\hat{Y}^{\prime}_{i}))]
=𝔼θt,Zi[logpθt]𝔼θt,Xi,Y^i[logpθt]\displaystyle=\mathbb{E}_{\theta_{t},Z_{i}}[-\log p_{\theta_{t}}]-\mathbb{E}_{\theta_{t},X^{\prime}_{i},\hat{Y}^{\prime}_{i}}[-\log p_{\theta_{t}}] (53)
=𝔼θ(t1)[𝔼θt|θ(t1)[h(PZ,pθt)h(PXi,Y^i|θ(t1),pθt)+h(PXi,Y^i|θ(t1),pθt)\displaystyle=\mathbb{E}_{\theta^{(t-1)}}\Big{[}\mathbb{E}_{\theta_{t}|\theta^{(t-1)}}\big{[}h(P_{Z},p_{\theta_{t}})-h(P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}|\theta^{(t-1)}},p_{\theta_{t}})+h(P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}|\theta^{(t-1)}},p_{\theta_{t}})
h(PXi,Y^i|θ(t),pθt)]]\displaystyle\quad-h(P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}|\theta^{(t)}},p_{\theta_{t}})\big{]}\Big{]} (54)
=𝔼θ(t)[Δh(PZPXi,Y^i|θ(t1)|pθt)+Δh(PXi,Y^i|θ(t1)PXi,Y^i|θ(t)|pθt)].\displaystyle=\mathbb{E}_{\theta^{(t)}}\Big{[}\mathrm{\Delta h}(P_{Z}\|P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}|\theta^{(t-1)}}|p_{\theta_{t}})+\mathrm{\Delta h}(P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}|\theta^{(t-1)}}\|P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}|\theta^{(t)}}|p_{\theta_{t}})\Big{]}. (55)

By combining (52) and (55), the gen-error is finally given by

gent(PZ,PX,{Pθk|Sl,Su}k=0t,{fθk}k=0t1)\displaystyle\mathrm{gen}_{t}(P_{Z},P_{X},\{P_{\theta_{k}|S_{\mathrm{l}},S_{\mathrm{u}}}\}_{k=0}^{t},\{f_{\theta_{k}}\}_{k=0}^{t-1})
=wni=1n𝔼θt[Δh(PZPZi|θt|pθt)]\displaystyle=\frac{w}{n}\sum_{i=1}^{n}\mathbb{E}_{\theta_{t}}\Big{[}\mathrm{\Delta h}(P_{Z}\|P_{Z_{i}|\theta_{t}}|p_{\theta_{t}})\Big{]}
+wmit𝔼θ(t)[Δh(PZPXi,Y^i|θ(t1)|pθt)+Δh(PXi,Y^i|θ(t1)PXi,Y^i|θ(t)|pθt)]\displaystyle\quad+\frac{w}{m}\sum_{i\in\mathcal{I}_{t}}\mathbb{E}_{\theta^{(t)}}\Big{[}\mathrm{\Delta h}(P_{Z}\|P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}|\theta^{(t-1)}}|p_{\theta_{t}})+\mathrm{\Delta h}(P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}|\theta^{(t-1)}}\|P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}|\theta^{(t)}}|p_{\theta_{t}})\Big{]} (56)
=𝔼θ(t)[wni=1nΔh(PZPZi|θt|pθt)+wmit(Δh(PZPXi,Y^i|θ(t1)|pθt)\displaystyle=\mathbb{E}_{\theta^{(t)}}\bigg{[}\frac{w}{n}\sum_{i=1}^{n}\mathrm{\Delta h}(P_{Z}\|P_{Z_{i}|\theta_{t}}|p_{\theta_{t}})+\frac{w}{m}\sum_{i\in\mathcal{I}_{t}}\big{(}\mathrm{\Delta h}(P_{Z}\|P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}|\theta^{(t-1)}}|p_{\theta_{t}})
+Δh(PXi,Y^i|θ(t1)PXi,Y^i|θ(t)|pθt))].\displaystyle\quad+\mathrm{\Delta h}(P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}|\theta^{(t-1)}}\|P_{X_{i}^{\prime},\hat{Y}_{i}^{\prime}|\theta^{(t)}}|p_{\theta_{t}})\big{)}\bigg{]}. (57)

Theorem 2.B is thus proved.

Appendix C Proof of Theorem 3

In the following, we abbreviate gent(P𝐙,P𝐗,{P𝜽k|Sl,Su}k=0t,{f𝜽k}k=0t1)\mathrm{gen}_{t}(P_{\mathbf{Z}},P_{\mathbf{X}},\{P_{\bm{\theta}_{k}|S_{\mathrm{l}},S_{\mathrm{u}}}\}_{k=0}^{t},\{f_{\bm{\theta}_{k}}\}_{k=0}^{t-1}) as gent\mathrm{gen}_{t}, if there is no risk of confusion. When the labelled data are not reused in the subsequent iterations, for t1t\geq 1, w=0w=0.

  • Iteration t=0t=0: Since Yi𝐗ii.i.d.𝒩(𝝁,σ2𝐈d)Y_{i}\mathbf{X}_{i}\overset{\text{i.i.d.}}{\sim}\mathcal{N}(\bm{\mu},\sigma^{2}\mathbf{I}_{d}), we have 𝜽0𝒩(𝝁,σ2n𝐈d)\bm{\theta}_{0}\sim\mathcal{N}(\bm{\mu},\frac{\sigma^{2}}{n}\mathbf{I}_{d}). The gen-error gen0\mathrm{gen}_{0} is given by

    gen0\displaystyle\mathrm{gen}_{0} =𝔼𝜽0[𝔼𝐙[logp𝜽0(𝐙)]𝔼𝐙i|𝜽0[logp𝜽0(𝐙i)]]\displaystyle=\mathbb{E}_{\bm{\theta}_{0}}\bigg{[}\mathbb{E}_{\mathbf{Z}}[-\log p_{\bm{\theta}_{0}}(\mathbf{Z})]-\mathbb{E}_{\mathbf{Z}_{i}|\bm{\theta}_{0}}[-\log p_{\bm{\theta}_{0}}(\mathbf{Z}_{i})]\bigg{]}
    =Q𝜽0(𝜽)(P𝐙(𝐳)P𝐙i|𝜽0(𝐳|𝜽))log1p𝜽(𝐳)d𝐳d𝜽\displaystyle=\int Q_{\bm{\theta}_{0}}(\bm{\theta})(P_{\mathbf{Z}}(\mathbf{z})-P_{\mathbf{Z}_{i}|\bm{\theta}_{0}}(\mathbf{z}|\bm{\theta}))\log\frac{1}{p_{\bm{\theta}}(\mathbf{z})}\mathrm{d}\mathbf{z}\mathrm{d}\bm{\theta} (58)
    =12σ2Q𝜽0(𝜽)(P𝐙(𝐱,y)P𝐙i|𝜽0(𝐱,y|𝜽))(𝐱𝐱2y𝜽𝐱+𝜽𝜽)d𝐱dyd𝜽\displaystyle=\frac{1}{2\sigma^{2}}\int Q_{\bm{\theta}_{0}}(\bm{\theta})(P_{\mathbf{Z}}(\mathbf{x},y)-P_{\mathbf{Z}_{i}|\bm{\theta}_{0}}(\mathbf{x},y|\bm{\theta}))\big{(}\mathbf{x}^{\top}\mathbf{x}-2y\bm{\theta}^{\top}\mathbf{x}+\bm{\theta}^{\top}\bm{\theta}\big{)}\mathrm{d}\mathbf{x}\mathrm{d}y\mathrm{d}\bm{\theta} (59)
    =12σ2P𝐙(𝐱,y)(Q𝜽0(𝜽)P𝜽0|𝐙i(𝜽|𝐱,y))2y𝜽𝐱d𝐱dyd𝜽\displaystyle=-\frac{1}{2\sigma^{2}}\int P_{\mathbf{Z}}(\mathbf{x},y)(Q_{\bm{\theta}_{0}}(\bm{\theta})-P_{\bm{\theta}_{0}|\mathbf{Z}_{i}}(\bm{\theta}|\mathbf{x},y))2y\bm{\theta}^{\top}\mathbf{x}~{}\mathrm{d}\mathbf{x}\mathrm{d}y\mathrm{d}\bm{\theta} (60)
    =12σ2(P𝐙(𝐱,1)(Q𝜽0(𝜽)P𝜽0|𝐙i(𝜽|𝐱,1))\displaystyle=-\frac{1}{2\sigma^{2}}\int\Big{(}P_{\mathbf{Z}}(\mathbf{x},1)(Q_{\bm{\theta}_{0}}(\bm{\theta})-P_{\bm{\theta}_{0}|\mathbf{Z}_{i}}(\bm{\theta}|\mathbf{x},1))
    P𝐙(𝐱,1)(Q𝜽0(𝜽)P𝜽0|𝐙i(𝜽|𝐱,1)))2𝜽𝐱d𝐱d𝜽\displaystyle\qquad\qquad\quad-P_{\mathbf{Z}}(\mathbf{x},-1)(Q_{\bm{\theta}_{0}}(\bm{\theta})-P_{\bm{\theta}_{0}|\mathbf{Z}_{i}}(\bm{\theta}|\mathbf{x},-1))\Big{)}2\bm{\theta}^{\top}\mathbf{x}~{}\mathrm{d}\mathbf{x}\mathrm{d}\bm{\theta} (61)
    =1σ2(𝝁𝐱nP𝐙(𝐱,1)𝝁+𝐱nP𝐙(𝐱,1))𝐱d𝐱\displaystyle=-\frac{1}{\sigma^{2}}\int\Big{(}\frac{\bm{\mu}-\mathbf{x}}{n}P_{\mathbf{Z}}(\mathbf{x},1)-\frac{\bm{\mu}+\mathbf{x}}{n}P_{\mathbf{Z}}(\mathbf{x},-1)\Big{)}^{\top}\mathbf{x}~{}\mathrm{d}\mathbf{x} (62)
    =1σ2(dσ22ndσ22n)\displaystyle=-\frac{1}{\sigma^{2}}\bigg{(}-\frac{d\sigma^{2}}{2n}-\frac{d\sigma^{2}}{2n}\bigg{)} (63)
    =dn.\displaystyle=\frac{d}{n}. (64)
  • Pseudo-label using θ0\bm{\theta}_{0}: For any i[1:m]i\in[1:m] and XiSuX_{i}^{\prime}\in S_{\mathrm{u}}, the pseudo-label is

    Y^i=sgn(𝜽0𝐗i).\displaystyle\hat{Y}_{i}^{\prime}=\operatorname{sgn}(\bm{\theta}_{0}^{\top}\mathbf{X}^{\prime}_{i}). (65)

    Given any pair of (ξ0,𝝁)(\xi_{0},\bm{\mu}^{\bot}), 𝜽0\bm{\theta}_{0} is fixed and {Y^i}i[1:m]\{\hat{Y}_{i}^{\prime}\}_{i\in[1:m]} are conditionally i.i.d. from PY^|ξ0,𝝁𝒫(𝒴)P_{\hat{Y}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}\in\mathcal{P}(\mathcal{Y}). Recall the pseudo-labelled dataset is defined as S^u,1={(𝐗i,Y^i)}i=1m\hat{S}_{u,1}=\{(\mathbf{X}^{\prime}_{i},\hat{Y}_{i}^{\prime})\}_{i=1}^{m}.

    Since 𝜽0𝒩(𝝁,σ2n𝐈d)\bm{\theta}_{0}\sim\mathcal{N}(\bm{\mu},\frac{\sigma^{2}}{n}\mathbf{I}_{d}), inspired by Oymak and Gülcü (2021), we can decompose it as follows:

    𝜽0\displaystyle\bm{\theta}_{0} =𝝁+σn𝝃=𝝁+σn(ξ0𝝁+𝝁)=(1+σnξ0)𝝁+σn𝝁,\displaystyle=\bm{\mu}+\frac{\sigma}{\sqrt{n}}\bm{\xi}=\bm{\mu}+\frac{\sigma}{\sqrt{n}}(\xi_{0}\bm{\mu}+\bm{\mu}^{\bot})=\bigg{(}1+\frac{\sigma}{\sqrt{n}}\xi_{0}\bigg{)}\bm{\mu}+\frac{\sigma}{\sqrt{n}}\bm{\mu}^{\bot}, (66)

    where 𝝃𝒩(0,𝐈d)\bm{\xi}\sim\mathcal{N}(0,\mathbf{I}_{d}), ξ0𝒩(0,1)\xi_{0}\sim\mathcal{N}(0,1), 𝝁𝝁\bm{\mu}^{\bot}\perp\bm{\mu}, 𝝁𝒩(0,𝐈d𝝁𝝁)\bm{\mu}^{\bot}\sim\mathcal{N}(0,\mathbf{I}_{d}-\bm{\mu}\bm{\mu}^{\top}) and 𝝁\bm{\mu}^{\bot} is independent of ξ0\xi_{0}.

    Recall the correlation between 𝜽0\bm{\theta}_{0} and 𝝁\bm{\mu} given in (17), the decomposition of 𝜽¯0\bar{\bm{\theta}}_{0} in (18) and α,β\alpha,\beta. Since sgn(𝜽0𝐗i)=sgn(𝜽¯0𝐗i)\operatorname{sgn}(\bm{\theta}_{0}^{\top}\mathbf{X}_{i}^{\prime})=\operatorname{sgn}(\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}_{i}^{\prime}), in the following we can analyze the normalized parameter 𝜽¯0\bar{\bm{\theta}}_{0} instead.

    Given any (ξ0,𝝁)(\xi_{0},\bm{\mu}^{\bot}), α\alpha is fixed, and for any ii\in\mathbb{N}, let us define a Gaussian noise vector 𝐠i𝒩(0,𝐈d)\mathbf{g}_{i}\sim\mathcal{N}(0,\mathbf{I}_{d}) and decompose it as follows

    𝐠i=g0,i𝝁+gi𝝊+𝐠i,\displaystyle\mathbf{g}_{i}=g_{0,i}\bm{\mu}+g_{i}\bm{\upsilon}+\mathbf{g}_{i}^{\bot}, (67)

    where g0,i,gi𝒩(0,1)g_{0,i},g_{i}\sim\mathcal{N}(0,1), 𝐠i𝒩(0,𝐈d𝝁𝝁𝝊𝝊)\mathbf{g}_{i}^{\bot}\sim\mathcal{N}(0,\mathbf{I}_{d}-\bm{\mu}\bm{\mu}^{\top}-\bm{\upsilon}\bm{\upsilon}^{\top}), 𝐠i𝝁\mathbf{g}_{i}^{\bot}\perp\bm{\mu}, 𝐠i𝝊\mathbf{g}_{i}^{\bot}\perp\bm{\upsilon}, and g0,i,gi,𝐠ig_{0,i},g_{i},\mathbf{g}^{\bot}_{i} are mutually independent.

    For any sample 𝐗i𝒩(𝝁,σ2𝐈d)\mathbf{X}^{\prime}_{i}\sim\mathcal{N}(\bm{\mu},\sigma^{2}\mathbf{I}_{d}), we can decompose it as

    𝐗i\displaystyle\mathbf{X}_{i}^{\prime} =𝝁+σ𝐠i=𝝁+σ(g0,i𝝁+gi𝝊+𝐠i).\displaystyle=\bm{\mu}+\sigma\mathbf{g}_{i}=\bm{\mu}+\sigma(g_{0,i}\bm{\mu}+g_{i}\bm{\upsilon}+\mathbf{g}_{i}^{\bot}). (68)

    Then we have

    𝜽¯0𝐗i\displaystyle\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}_{i}^{\prime} =(α𝝁+β𝝊)(𝝁+σ𝐠i)\displaystyle=(\alpha\bm{\mu}+\beta\bm{\upsilon})^{\top}(\bm{\mu}+\sigma\mathbf{g}_{i}) (69)
    =α+σ(α𝝁+β𝝊)(g0,i𝝁+gi𝝊+𝐠i)\displaystyle=\alpha+\sigma(\alpha\bm{\mu}+\beta\bm{\upsilon})^{\top}(g_{0,i}\bm{\mu}+g_{i}\bm{\upsilon}+\mathbf{g}_{i}^{\bot}) (70)
    =α+σ(αg0,i+βgi)=:α+σhi.\displaystyle=\alpha+\sigma(\alpha g_{0,i}+\beta g_{i})=:\alpha+\sigma h_{i}. (71)

    Note that hi𝒩(0,1)h_{i}\sim\mathcal{N}(0,1) for any α[1,1]\alpha\in[-1,1]. Similarly, for any sample 𝐗i𝒩(μ,σ2𝐈d)\mathbf{X}^{\prime}_{i}\sim\mathcal{N}(-\mu,\sigma^{2}\mathbf{I}_{d}), we have

    𝐗i=𝝁+σ𝐠i\displaystyle\mathbf{X}^{\prime}_{i}=-\bm{\mu}+\sigma\mathbf{g}_{i} (72)

    and

    𝜽¯0𝐗i\displaystyle\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}_{i}^{\prime} =α+σhi.\displaystyle=-\alpha+\sigma h_{i}. (73)

    Denote the true label of 𝐗i\mathbf{X}^{\prime}_{i} as YiY^{\prime}_{i} and PYi=PY=unif({1,+1})P_{Y^{\prime}_{i}}=P_{Y}=\mathrm{unif}(\{-1,+1\}). The probability that the pseudo-label Y^i\hat{Y}_{i}^{\prime} is equal to 11 is given by

    Pr(Y^i=1)\displaystyle\Pr(\hat{Y}_{i}^{\prime}=1) =Pr(𝜽¯0𝐗i>0)\displaystyle=\Pr\big{(}\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}^{\prime}_{i}>0\big{)} (74)
    =12Pr(𝜽¯0𝐗i>0|Yi=1)+12Pr(𝜽¯0𝐗i>0|Yi=1)\displaystyle=\frac{1}{2}\Pr\big{(}\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}^{\prime}_{i}>0|Y^{\prime}_{i}=1\big{)}+\frac{1}{2}\Pr\big{(}\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}^{\prime}_{i}>0|Y^{\prime}_{i}=-1\big{)} (75)
    =12𝔼α[Pr(α+σhi>0)]+12𝔼α[Pr(α+σhi>0)]\displaystyle=\frac{1}{2}\mathbb{E}_{\alpha}\big{[}\Pr(\alpha+\sigma h_{i}>0)\big{]}+\frac{1}{2}\mathbb{E}_{\alpha}\big{[}\Pr\big{(}-\alpha+\sigma h_{i}>0\big{)}\big{]} (76)
    =12𝔼α[Q(ασ)]+12𝔼α[Q(ασ)]=12.\displaystyle=\frac{1}{2}\mathbb{E}_{\alpha}\bigg{[}\mathrm{Q}\bigg{(}-\frac{\alpha}{\sigma}\bigg{)}\bigg{]}+\frac{1}{2}\mathbb{E}_{\alpha}\bigg{[}\mathrm{Q}\bigg{(}\frac{\alpha}{\sigma}\bigg{)}\bigg{]}=\frac{1}{2}. (77)

    We also have Pr(Y^i=1)=1Pr(Y^i=1)=1/2\Pr(\hat{Y}_{i}^{\prime}=-1)=1-\Pr(\hat{Y}_{i}^{\prime}=1)=1/2, and so PY^i=PYP_{\hat{Y}_{i}^{\prime}}=P_{Y}.

  • Iteration t=1t=1: Recall (16) and the new model parameter learned from the pseudo-labelled dataset S^u,1\hat{S}_{\mathrm{u},1} is given by

    𝜽1\displaystyle\bm{\theta}_{1} =1mi=1mY^i𝐗i=1mi=1msgn(𝜽0𝐗i)𝐗i=1mi=1msgn(𝜽¯0𝐗i)𝐗i.\displaystyle=\frac{1}{m}\sum_{i=1}^{m}\hat{Y}_{i}^{\prime}\mathbf{X}^{\prime}_{i}=\frac{1}{m}\sum_{i=1}^{m}\operatorname{sgn}(\bm{\theta}_{0}^{\top}\mathbf{X}^{\prime}_{i})\mathbf{X}^{\prime}_{i}=\frac{1}{m}\sum_{i=1}^{m}\operatorname{sgn}(\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}^{\prime}_{i})\mathbf{X}^{\prime}_{i}. (78)

    First let us calculate the conditional expectation of 𝜽1\bm{\theta}_{1} given 𝜽0\bm{\theta}_{0}. Given any (ξ0,𝝁)(\xi_{0},\bm{\mu}^{\bot}), for any j[1:m]j\in[1:m], let 𝝁1ξ0,𝝁:=𝔼[sgn(𝜽¯0𝐗j)𝐗j|ξ0,𝝁]\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}:=\mathbb{E}[\operatorname{sgn}(\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}^{\prime}_{j})\mathbf{X}^{\prime}_{j}|\xi_{0},\bm{\mu}^{\bot}] and ξ0,𝝁\mathbb{P}_{\xi_{0},\bm{\mu}^{\bot}} denotes the probability measure under the parameters (ξ0,𝝁)(\xi_{0},\bm{\mu}^{\bot}).

    The expectation 𝝁1ξ0,𝝁\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}} can be calculated as follows:

    𝝁1ξ0,𝝁\displaystyle\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}} =𝔼[sgn(𝜽¯0𝐗j)𝐗j|ξ0,𝝁]\displaystyle=\mathbb{E}[\operatorname{sgn}(\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}^{\prime}_{j})\mathbf{X}^{\prime}_{j}|\xi_{0},\bm{\mu}^{\bot}] (79)
    =𝔼Yj[𝔼[sgn(𝜽¯0𝐗j)𝐗j|ξ0,𝝁,Yj]]\displaystyle=\mathbb{E}_{Y_{j}^{\prime}}[~{}\mathbb{E}[\operatorname{sgn}(\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}^{\prime}_{j})\mathbf{X}^{\prime}_{j}~{}|~{}\xi_{0},\bm{\mu}^{\bot},Y_{j}^{\prime}]~{}] (80)
    =12𝔼[sgn(𝜽¯0𝐗j)𝐗j|ξ0,𝝁,Yj=1]+12𝔼[sgn(𝜽¯0𝐗j)𝐗j|ξ0,𝝁,Yj=1].\displaystyle=\frac{1}{2}\mathbb{E}[\operatorname{sgn}(\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}^{\prime}_{j})\mathbf{X}^{\prime}_{j}~{}|~{}\xi_{0},\bm{\mu}^{\bot},Y_{j}^{\prime}=-1]+\frac{1}{2}\mathbb{E}[\operatorname{sgn}(\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}^{\prime}_{j})\mathbf{X}^{\prime}_{j}~{}|~{}\xi_{0},\bm{\mu}^{\bot},Y_{j}^{\prime}=1]. (81)

    In contrast to (67), here we decompose the Gaussian random vector 𝐠j𝒩(0,𝐈d)\mathbf{g}_{j}\sim\mathcal{N}(0,\mathbf{I}_{d}) in another way

    𝐠j=g~j𝜽¯0+𝐠~j,\displaystyle\mathbf{g}_{j}=\tilde{g}_{j}\bar{\bm{\theta}}_{0}+\tilde{\mathbf{g}}_{j}^{\bot}, (82)

    where g~j𝒩(0,1)\tilde{g}_{j}\sim\mathcal{N}(0,1), 𝐠~j𝒩(0,𝐈d𝜽¯0𝜽¯0)\tilde{\mathbf{g}}_{j}^{\bot}\sim\mathcal{N}(0,\mathbf{I}_{d}-\bar{\bm{\theta}}_{0}\bar{\bm{\theta}}_{0}^{\top}), g~j\tilde{g}_{j} and 𝐠~j\tilde{\mathbf{g}}_{j}^{\bot} are mutually independent and 𝐠~j𝜽¯0\tilde{\mathbf{g}}_{j}^{\bot}\perp\bar{\bm{\theta}}_{0}.

    Then we decompose 𝐗j\mathbf{X}_{j}^{\prime} and 𝜽¯0𝐗j\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}_{j}^{\prime} as

    𝐗j\displaystyle\mathbf{X}_{j}^{\prime} =Yj𝝁+σg~j𝜽¯0+σ𝐠~j, and\displaystyle=Y_{j}^{\prime}\bm{\mu}+\sigma\tilde{g}_{j}\bar{\bm{\theta}}_{0}+\sigma\tilde{\mathbf{g}}_{j}^{\bot},\text{~{}and} (83)
    𝜽¯0𝐗j\displaystyle\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}_{j}^{\prime} =Yjα+σg~j.\displaystyle=Y_{j}^{\prime}\alpha+\sigma\tilde{g}_{j}. (84)

    Then we have

    𝔼[sgn(𝜽¯0𝐗j)𝐗j|ξ0,𝝁,Yj=1]\displaystyle\mathbb{E}[\operatorname{sgn}(\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}^{\prime}_{j})\mathbf{X}^{\prime}_{j}~{}|~{}\xi_{0},\bm{\mu}^{\bot},Y_{j}^{\prime}=-1]
    =𝔼[sgn(α+σg~j)(𝝁+σg~j𝜽¯0+σ𝐠~)|ξ0,𝝁]\displaystyle=\mathbb{E}[\operatorname{sgn}(-\alpha+\sigma\tilde{g}_{j})(-\bm{\mu}+\sigma\tilde{g}_{j}\bar{\bm{\theta}}_{0}+\sigma\tilde{\mathbf{g}}^{\bot})~{}|~{}\xi_{0},\bm{\mu}^{\bot}] (85)
    =𝔼[sgn(α+σg~j)|ξ0,𝝁]𝝁+σ𝔼[sgn(α+σg~j)g~j|ξ0,𝝁]𝜽¯0\displaystyle=-\mathbb{E}[\operatorname{sgn}(-\alpha+\sigma\tilde{g}_{j})|\xi_{0},\bm{\mu}^{\bot}]\bm{\mu}+\sigma\mathbb{E}[\operatorname{sgn}(-\alpha+\sigma\tilde{g}_{j})\tilde{g}_{j}|\xi_{0},\bm{\mu}^{\bot}]\bar{\bm{\theta}}_{0}
    +σ𝔼[sgn(α+σg~j)𝐠~|ξ0,𝝁]\displaystyle\quad+\sigma\mathbb{E}[\operatorname{sgn}(-\alpha+\sigma\tilde{g}_{j})\tilde{\mathbf{g}}^{\bot}|\xi_{0},\bm{\mu}^{\bot}] (86)
    =𝔼[sgn(α+σg~j)|ξ0,𝝁]𝝁+σ𝔼[sgn(α+σg~j)g~j|ξ0,𝝁]𝜽¯0,\displaystyle=-\mathbb{E}[\operatorname{sgn}(-\alpha+\sigma\tilde{g}_{j})|\xi_{0},\bm{\mu}^{\bot}]\bm{\mu}+\sigma\mathbb{E}[\operatorname{sgn}(-\alpha+\sigma\tilde{g}_{j})\tilde{g}_{j}|\xi_{0},\bm{\mu}^{\bot}]\bar{\bm{\theta}}_{0}, (87)

    where (87) follows since 𝐠~\tilde{\mathbf{g}}^{\bot} is independent of g~j\tilde{g}_{j} and 𝔼[𝐠~]=0\mathbb{E}[\tilde{\mathbf{g}}^{\bot}]=0.

    For the first term in (87), recall g~j𝒩(0,1)\tilde{g}_{j}\sim\mathcal{N}(0,1) and we have

    𝔼[sgn(α+σg~j)|ξ0,𝝁]𝝁=(12Q(ασ))𝝁.\displaystyle-\mathbb{E}[\operatorname{sgn}(-\alpha+\sigma\tilde{g}_{j})|\xi_{0},\bm{\mu}^{\bot}]\bm{\mu}=\bigg{(}1-2\mathrm{Q}\bigg{(}\frac{\alpha}{\sigma}\bigg{)}\bigg{)}\bm{\mu}. (88)

    For the second term in (87), we have

    𝔼[sgn(α+σg~j)g~j|ξ0,𝝁]𝜽¯0\displaystyle\mathbb{E}[\operatorname{sgn}(-\alpha+\sigma\tilde{g}_{j})\tilde{g}_{j}|\xi_{0},\bm{\mu}^{\bot}]\bar{\bm{\theta}}_{0}
    =(ασ12πeg22gdg+ασ12πeg22gdg)𝜽¯0=22πexp(α22σ2)𝜽¯0.\displaystyle=\bigg{(}-\int_{-\infty}^{\frac{\alpha}{\sigma}}\frac{1}{\sqrt{2\pi}}e^{-\frac{g^{2}}{2}}g~{}\mathrm{d}g\!+\!\int_{\frac{\alpha}{\sigma}}^{\infty}\frac{1}{\sqrt{2\pi}}e^{-\frac{g^{2}}{2}}g~{}\mathrm{d}g\bigg{)}\bar{\bm{\theta}}_{0}\!=\!\frac{2}{\sqrt{2\pi}}\exp\bigg{(}-\frac{\alpha^{2}}{2\sigma^{2}}\bigg{)}\bar{\bm{\theta}}_{0}. (89)

    By combining (88) and (89), we have

    𝔼[sgn(𝜽¯0𝐗j)𝐗j|ξ0,𝝁,Yj=1]\displaystyle\mathbb{E}\big{[}\operatorname{sgn}\big{(}\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}^{\prime}_{j}\big{)}\mathbf{X}^{\prime}_{j}~{}|~{}\xi_{0},\bm{\mu}^{\bot},Y_{j}^{\prime}=-1\big{]} =(12Q(ασ))𝝁+2σ2πexp(α22σ2)𝜽¯0,\displaystyle=\bigg{(}1-2\mathrm{Q}\bigg{(}\frac{\alpha}{\sigma}\bigg{)}\bigg{)}\bm{\mu}+\frac{2\sigma}{\sqrt{2\pi}}\exp\bigg{(}-\frac{\alpha^{2}}{2\sigma^{2}}\bigg{)}\bar{\bm{\theta}}_{0}, (90)

    and similarly,

    𝔼[sgn(𝜽¯0𝐗j)𝐗j|ξ0,𝝁,Yj=1]\displaystyle\mathbb{E}\big{[}\operatorname{sgn}\big{(}\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}^{\prime}_{j}\big{)}\mathbf{X}^{\prime}_{j}~{}|~{}\xi_{0},\bm{\mu}^{\bot},Y_{j}^{\prime}=1\big{]} =(2Q(ασ)1)𝝁+2σ2πexp(α22σ2)𝜽¯0.\displaystyle=\bigg{(}2\mathrm{Q}\bigg{(}-\frac{\alpha}{\sigma}\bigg{)}-1\bigg{)}\bm{\mu}+\frac{2\sigma}{\sqrt{2\pi}}\exp\bigg{(}-\frac{\alpha^{2}}{2\sigma^{2}}\bigg{)}\bar{\bm{\theta}}_{0}. (91)

    Thus, recall JσJ_{\sigma} and KσK_{\sigma} defined in (20) and (21), 𝜽¯0=α𝝁+β𝝊\bar{\bm{\theta}}_{0}=\alpha\bm{\mu}+\beta\bm{\upsilon} and 𝝁1ξ0,𝝁\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}} is given by

    𝝁1ξ0,𝝁=𝔼[sgn(𝜽0𝐗j)𝐗j|ξ0,𝝁]\displaystyle\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}=\mathbb{E}[\operatorname{sgn}(\bm{\theta}_{0}^{\top}\mathbf{X}^{\prime}_{j})\mathbf{X}^{\prime}_{j}|\xi_{0},\bm{\mu}^{\bot}]
    =(12Q(ασ))𝝁+2σ2πexp(α22σ2)𝜽¯0\displaystyle=\bigg{(}1-2\mathrm{Q}\bigg{(}\frac{\alpha}{\sigma}\bigg{)}\bigg{)}\bm{\mu}+\frac{2\sigma}{\sqrt{2\pi}}\exp\bigg{(}-\frac{\alpha^{2}}{2\sigma^{2}}\bigg{)}\bar{\bm{\theta}}_{0} (92)
    =(12Q(ασ)+2σα2πexp(α22σ2))𝝁+2σβ2πexp(α22σ2)𝝊\displaystyle=\bigg{(}1-2\mathrm{Q}\bigg{(}\frac{\alpha}{\sigma}\bigg{)}+\frac{2\sigma\alpha}{\sqrt{2\pi}}\exp\bigg{(}-\frac{\alpha^{2}}{2\sigma^{2}}\bigg{)}\bigg{)}\bm{\mu}+\frac{2\sigma\beta}{\sqrt{2\pi}}\exp\bigg{(}-\frac{\alpha^{2}}{2\sigma^{2}}\bigg{)}\bm{\upsilon} (93)
    =Jσ(α(ξ0,𝝁))𝝁+Kσ(α(ξ0,𝝁))𝝊.\displaystyle=J_{\sigma}(\alpha(\xi_{0},\bm{\mu}^{\bot}))\bm{\mu}+K_{\sigma}(\alpha(\xi_{0},\bm{\mu}^{\bot}))\bm{\upsilon}. (94)

    From (9), the gen-error at t=1t=1 is given by

    gen1\displaystyle\mathrm{gen}_{1} =1mi=1m𝔼ξ0,𝝁𝔼𝜽1|ξ0,𝝁[Δh(P𝐗i,Y^i|ξ0,𝝁P𝐗i,Y^i|ξ0,𝝁,𝜽1|p𝜽1)\displaystyle=\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\mathbb{E}_{\bm{\theta}_{1}|\xi_{0},\bm{\mu}^{\bot}}\Big{[}\mathrm{\Delta h}(P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}\|P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot},\bm{\theta}_{1}}|p_{\bm{\theta}_{1}})
    +Δh(P𝐙P𝐗i,Y^i|ξ0,𝝁|p𝜽1)].\displaystyle\qquad+\mathrm{\Delta h}(P_{\mathbf{Z}}\|P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}|p_{\bm{\theta}_{1}})\Big{]}. (95)

    Next, we calculate the two Δh\mathrm{\Delta h} terms in (95) respectively.

    • Calculate 𝔼θ1|ξ0,μ[Δh(P𝐗i,Y^i|ξ0,μP𝐗i,Y^i|ξ0,μ,θ1|pθ1)]\mathbb{E}_{\bm{\theta}_{1}|\xi_{0},\bm{\mu}^{\bot}}\big{[}\mathrm{\Delta h}(P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}\|P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot},\bm{\theta}_{1}}|p_{\bm{\theta}_{1}})\big{]}:

      𝔼𝜽1|ξ0,𝝁[Δh(P𝐗i,Y^i|ξ0,𝝁P𝐗i,Y^i|ξ0,𝝁,𝜽1|p𝜽1)]\displaystyle\mathbb{E}_{\bm{\theta}_{1}|\xi_{0},\bm{\mu}^{\bot}}\big{[}\mathrm{\Delta h}(P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}\|P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot},\bm{\theta}_{1}}|p_{\bm{\theta}_{1}})\big{]}
      =𝔼𝜽1|ξ0,𝝁[h(P𝐗i,Y^i|ξ0,𝝁,p𝜽1)h(P𝐗i,Y^i|ξ0,𝝁,𝜽1,p𝜽1)]\displaystyle=\mathbb{E}_{\bm{\theta}_{1}|\xi_{0},\bm{\mu}^{\bot}}\Big{[}h(P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}},p_{\bm{\theta}_{1}})-h(P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot},\bm{\theta}_{1}},p_{\bm{\theta}_{1}})\Big{]} (96)
      =12σ2Q𝜽1|ξ0,𝝁(𝜽|ξ0,𝝁)(P𝐗i,Y^i|ξ0,𝝁(𝐱,y|ξ0,𝝁)\displaystyle=\frac{1}{2\sigma^{2}}\int Q_{\bm{\theta}_{1}|\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta}|\xi_{0},\bm{\mu}^{\bot})\big{(}P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}(\mathbf{x},y|\xi_{0},\bm{\mu}^{\bot})
      P𝐗i,Y^i|ξ0,𝝁,𝜽1(𝐱,y|ξ0,𝝁,𝜽))(𝐱𝐱2y𝜽𝐱+𝜽𝜽)d𝐱dyd𝜽\displaystyle\qquad\qquad-P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot},\bm{\theta}_{1}}(\mathbf{x},y|\xi_{0},\bm{\mu}^{\bot},\bm{\theta})\big{)}(\mathbf{x}^{\top}\mathbf{x}-2y\bm{\theta}^{\top}\mathbf{x}+\bm{\theta}^{\top}\bm{\theta})\mathrm{d}\mathbf{x}\mathrm{d}y\mathrm{d}\bm{\theta} (97)
      =1σ2P𝐗i,Y^i|ξ0,𝝁(𝐱,y|ξ0,𝝁)(Q𝜽1|ξ0,𝝁(𝜽|ξ0,𝝁)\displaystyle=-\frac{1}{\sigma^{2}}\int P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}(\mathbf{x},y|\xi_{0},\bm{\mu}^{\bot})\big{(}Q_{\bm{\theta}_{1}|\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta}|\xi_{0},\bm{\mu}^{\bot})
      P𝜽1|𝐗i,Y^i,ξ0,𝝁(𝜽|𝐱,y,ξ0,𝝁))(y𝜽𝐱)d𝐱dyd𝜽\displaystyle\qquad\qquad-P_{\bm{\theta}_{1}|\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime},\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta}|\mathbf{x},y,\xi_{0},\bm{\mu}^{\bot})\big{)}(y\bm{\theta}^{\top}\mathbf{x})\mathrm{d}\mathbf{x}\mathrm{d}y\mathrm{d}\bm{\theta} (98)
      =1σ2P𝐗i,Y^i|ξ0,𝝁(𝐱,y|ξ0,𝝁)(𝝁1ξ0,𝝁y𝐱m)(y𝐱)d𝐱dy\displaystyle=-\frac{1}{\sigma^{2}}\int P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}(\mathbf{x},y|\xi_{0},\bm{\mu}^{\bot})\bigg{(}\frac{\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}-y\mathbf{x}}{m}\bigg{)}^{\top}(y\mathbf{x})\mathrm{d}\mathbf{x}\mathrm{d}y (99)
      =1mσ2(𝔼[𝐗i𝐗i|ξ0,𝝁](𝝁1ξ0,𝝁)𝝁1ξ0,𝝁)\displaystyle=\frac{1}{m\sigma^{2}}\Big{(}\mathbb{E}[\mathbf{X}_{i}^{\prime\top}\mathbf{X}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}]-(\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}})^{\top}\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}\Big{)} (100)
      =dσ2+𝝁𝝁(𝝁1ξ0,𝝁)𝝁1ξ0,𝝁mσ2\displaystyle=\frac{d\sigma^{2}+\bm{\mu}^{\top}\bm{\mu}-(\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}})^{\top}\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}}{m\sigma^{2}} (101)
      =dσ2+1Jσ2(α)Kσ2(α)mσ2.\displaystyle=\frac{d\sigma^{2}+1-J_{\sigma}^{2}(\alpha)-K_{\sigma}^{2}(\alpha)}{m\sigma^{2}}. (102)
    • Calculate Δh(P𝐙P𝐗i,Y^i|ξ0,μ|pθ1)\mathrm{\Delta h}(P_{\mathbf{Z}}\|P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}|p_{\bm{\theta}_{1}}): Given any (ξ0,𝝁)(\xi_{0},\bm{\mu}^{\bot}), in the following, we drop the condition on ξ0,𝝁\xi_{0},\bm{\mu}^{\bot} for notational simplicity. Since PY^i(1)=PY^i(1)=PYi(1)=PYi(1)=12P_{\hat{Y}_{i}^{\prime}}(1)=P_{\hat{Y}_{i}^{\prime}}(-1)=P_{Y_{i}^{\prime}}(1)=P_{Y_{i}^{\prime}}(-1)=\frac{1}{2} (cf. (77)), we have

      Δh(P𝐙P𝐗i,Y^i|ξ0,𝝁|p𝜽1)=h(P𝐙,p𝜽1)h(P𝐗i,Y^i|ξ0,𝝁,p𝜽1)\displaystyle\mathrm{\Delta h}(P_{\mathbf{Z}}\|P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}|p_{\bm{\theta}_{1}})=h(P_{\mathbf{Z}},p_{\bm{\theta}_{1}})-h(P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}},p_{\bm{\theta}_{1}})
      =12(P𝐗|Y=1(𝐱)P𝐗i|Y^i=1(𝐱))log1PY(1)p𝜽1(𝐱|1)d𝐱\displaystyle=\frac{1}{2}\int(P_{\mathbf{X}|Y=1}(\mathbf{x})-P_{\mathbf{X}_{i}^{\prime}|\hat{Y}_{i}^{\prime}=1}(\mathbf{x}))\log\frac{1}{P_{Y}(1)p_{\bm{\theta}_{1}}(\mathbf{x}|1)}\mathrm{d}\mathbf{x}
      +12(P𝐗|Y=1(𝐱)P𝐗i|Y^i=1(𝐱))log1PY(1)p𝜽1(𝐱|1)d𝐱\displaystyle\quad+\frac{1}{2}\int(P_{\mathbf{X}|Y=-1}(\mathbf{x})-P_{\mathbf{X}_{i}^{\prime}|\hat{Y}_{i}^{\prime}=-1}(\mathbf{x}))\log\frac{1}{P_{Y}(-1)p_{\bm{\theta}_{1}}(\mathbf{x}|-1)}\mathrm{d}\mathbf{x} (103)
      =12(P𝐗|Y=1(𝐱)P𝐗i|Y^i=1(𝐱))log1p𝜽1(𝐱|1)d𝐱\displaystyle=\frac{1}{2}\int(P_{\mathbf{X}|Y=1}(\mathbf{x})-P_{\mathbf{X}_{i}^{\prime}|\hat{Y}_{i}^{\prime}=1}(\mathbf{x}))\log\frac{1}{p_{\bm{\theta}_{1}}(\mathbf{x}|1)}\mathrm{d}\mathbf{x}
      +12(P𝐗|Y=1(𝐱)P𝐗i|Y^i=1(𝐱))log1p𝜽1(𝐱|1)d𝐱.\displaystyle\quad+\frac{1}{2}\int(P_{\mathbf{X}|Y=-1}(\mathbf{x})-P_{\mathbf{X}_{i}^{\prime}|\hat{Y}_{i}^{\prime}=-1}(\mathbf{x}))\log\frac{1}{p_{\bm{\theta}_{1}}(\mathbf{x}|-1)}\mathrm{d}\mathbf{x}. (104)

      Since given 𝜽1\bm{\theta}_{1}, p𝜽1(𝐱|)p_{\bm{\theta}_{1}}(\mathbf{x}|\cdot) is a Gaussian distribution, for any y{±1}y\in\{\pm 1\}, we have

      12P𝜽1|ξ0,𝝁(𝜽)(P𝐗|Y(𝐱|𝐲)P𝐗i|Y^i(𝐱|y))log1p𝜽(𝐱|y)d𝐱d𝜽\displaystyle\frac{1}{2}\int P_{\bm{\theta}_{1}|\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta})\big{(}P_{\mathbf{X}|Y}(\mathbf{x}|\mathbf{y})-P_{\mathbf{X}_{i}^{\prime}|\hat{Y}_{i}^{\prime}}(\mathbf{x}|y)\big{)}\log\frac{1}{p_{\bm{\theta}}(\mathbf{x}|y)}\mathrm{d}\mathbf{x}\mathrm{d}\bm{\theta}
      =14σ2P𝜽1|ξ0,𝝁(𝜽)(P𝐗|Y(𝐱|y)P𝐗i|Y^i(𝐱|y))(𝐱𝐱2y𝜽𝐱+𝜽𝜽)d𝐱d𝜽\displaystyle=\frac{1}{4\sigma^{2}}\int P_{\bm{\theta}_{1}|\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta})\big{(}P_{\mathbf{X}|Y}(\mathbf{x}|y)-P_{\mathbf{X}_{i}^{\prime}|\hat{Y}_{i}^{\prime}}(\mathbf{x}|y)\big{)}\big{(}\mathbf{x}^{\top}\mathbf{x}-2y\bm{\theta}^{\top}\mathbf{x}+\bm{\theta}^{\top}\bm{\theta}\big{)}\mathrm{d}\mathbf{x}\mathrm{d}\bm{\theta} (105)
      =12σ2P𝜽1|ξ0,𝝁(𝜽)(12P𝐗|Y(𝐱|y)12P𝐗i|Y^i(𝐱|y))(y𝜽𝐱)d𝐱d𝜽\displaystyle=-\frac{1}{2\sigma^{2}}\int P_{\bm{\theta}_{1}|\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta})\bigg{(}\frac{1}{2}P_{\mathbf{X}|Y}(\mathbf{x}|y)-\frac{1}{2}P_{\mathbf{X}_{i}^{\prime}|\hat{Y}_{i}^{\prime}}(\mathbf{x}|y)\bigg{)}\big{(}y\bm{\theta}^{\top}\mathbf{x}\big{)}\mathrm{d}\mathbf{x}\mathrm{d}\bm{\theta} (106)
      =12σ2(𝝁1ξ0,𝝁)(𝝁𝝁1ξ0,𝝁)\displaystyle=-\frac{1}{2\sigma^{2}}(\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}})^{\top}\big{(}\bm{\mu}-\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}\big{)} (107)
      =Jσ2(α)+Kσ2(α)Jσ(α)2σ2.\displaystyle=\frac{J_{\sigma}^{2}(\alpha)+K_{\sigma}^{2}(\alpha)-J_{\sigma}(\alpha)}{2\sigma^{2}}. (108)

      Thus,

      𝔼𝜽1|ξ0,𝝁[Δh(P𝐙P𝐗i,Y^i|ξ0,𝝁|p𝜽1)]=Jσ2(α)+Kσ2(α)Jσ(α)σ2.\displaystyle\mathbb{E}_{\bm{\theta}_{1}|\xi_{0},\bm{\mu}^{\bot}}[\mathrm{\Delta h}(P_{\mathbf{Z}}\|P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}|p_{\bm{\theta}_{1}})]=\frac{J_{\sigma}^{2}(\alpha)+K_{\sigma}^{2}(\alpha)-J_{\sigma}(\alpha)}{\sigma^{2}}. (109)

    Finally, the gen-error at t=1t=1 can be characterized as follows:

    gen1=𝔼ξ0,𝝁[Jσ2(α(ξ0,𝝁))+Kσ2(α(ξ0,𝝁))Jσ(α(ξ0,𝝁))σ2\displaystyle\mathrm{gen}_{1}=\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{[}\frac{J_{\sigma}^{2}(\alpha(\xi_{0},\bm{\mu}^{\bot}))+K_{\sigma}^{2}(\alpha(\xi_{0},\bm{\mu}^{\bot}))-J_{\sigma}(\alpha(\xi_{0},\bm{\mu}^{\bot}))}{\sigma^{2}}
    +dσ2+1Jσ2(α(ξ0,𝝁))Kσ2(α(ξ0,𝝁))mσ2]\displaystyle\qquad\qquad~{}+\frac{d\sigma^{2}+1-J_{\sigma}^{2}(\alpha(\xi_{0},\bm{\mu}^{\bot}))-K_{\sigma}^{2}(\alpha(\xi_{0},\bm{\mu}^{\bot}))}{m\sigma^{2}}\bigg{]} (110)
    =𝔼ξ0,𝝁[(m1)(Jσ2(α(ξ0,𝝁))+Kσ2(α(ξ0,𝝁)))mJσ(α(ξ0,𝝁))+dσ2+1mσ2].\displaystyle=\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{[}\frac{(m-1)(J_{\sigma}^{2}(\alpha(\xi_{0},\bm{\mu}^{\bot}))+K_{\sigma}^{2}(\alpha(\xi_{0},\bm{\mu}^{\bot})))-mJ_{\sigma}(\alpha(\xi_{0},\bm{\mu}^{\bot}))+d\sigma^{2}+1}{m\sigma^{2}}\bigg{]}. (111)
  • Pseudo-label using θ1\bm{\theta}_{1}: Let 𝜽¯1:=𝜽1/𝜽12\bar{\bm{\theta}}_{1}:=\bm{\theta}_{1}/\|\bm{\theta}_{1}\|_{2}. For any i2i\in\mathcal{I}_{2}, the pseudo-labels are given by

    Y^i=sgn(𝜽1𝐗i)=sgn(𝜽¯1𝐗i).\displaystyle\hat{Y}_{i}^{\prime}=\operatorname{sgn}(\bm{\theta}_{1}^{\top}\mathbf{X}^{\prime}_{i})=\operatorname{sgn}(\bar{\bm{\theta}}_{1}^{\top}\mathbf{X}^{\prime}_{i}). (112)

    It can be seen that the pseudo-labels {Y^i}i2\{\hat{Y}_{i}^{\prime}\}_{i\in\mathcal{I}_{2}} are conditionally i.i.d. given 𝜽1\bm{\theta}_{1} and let us denote the conditional distribution under fixed 𝜽1\bm{\theta}_{1} as PY^|𝜽1𝒫(𝒴)P_{\hat{Y}^{\prime}|\bm{\theta}_{1}}\in\mathcal{P}(\mathcal{Y}). The pseudo-labelled dataset is denoted as S^u,2={(𝐗i,Y^i)}i2\hat{S}_{\mathrm{u},2}=\{(\mathbf{X}^{\prime}_{i},\hat{Y}_{i}^{\prime})\}_{i\in\mathcal{I}_{2}}.

    For any fixed (𝜽1,ξ0,𝝁)(\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}), let 𝜽1\bm{\theta}_{1} be decomposed as 𝜽1=A1(ξ0,𝝁)𝝁+B1(ξ0,𝝁)𝝊\bm{\theta}_{1}=A_{1}(\xi_{0},\bm{\mu}^{\bot})\bm{\mu}+B_{1}(\xi_{0},\bm{\mu}^{\bot})\bm{\upsilon}, where A12(ξ0,𝝁)+B12(ξ0,𝝁)=𝜽122A_{1}^{2}(\xi_{0},\bm{\mu}^{\bot})+B_{1}^{2}(\xi_{0},\bm{\mu}^{\bot})=\|\bm{\theta}_{1}\|_{2}^{2}. In addition, let α1(ξ0,𝝁):=A1/A12+B12\alpha_{1}(\xi_{0},\bm{\mu}^{\bot}):=A_{1}/\sqrt{A_{1}^{2}+B_{1}^{2}} and β1(ξ0,𝝁)=1(α1(ξ0,𝝁))2\beta_{1}(\xi_{0},\bm{\mu}^{\bot})=\sqrt{1-(\alpha_{1}(\xi_{0},\bm{\mu}^{\bot}))^{2}}. In the following, we use A1A_{1}, B1B_{1}, α1\alpha_{1}, and β1\beta_{1} for the above quantities if there is no risk of confusion.

    Recall the decomposition of 𝐗i\mathbf{X}_{i}^{\prime} and 𝜽¯0𝐗i\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}_{i}^{\prime} in (68) and (71). Similarly, we have

    𝜽¯1𝐗i=:Yiα1+σhi1,\displaystyle\bar{\bm{\theta}}_{1}^{\top}\mathbf{X}^{\prime}_{i}=:Y_{i}^{\prime}\alpha_{1}+\sigma h_{i}^{1}, (113)

    where hi1𝒩(0,1)h_{i}^{1}\sim\mathcal{N}(0,1). Note that PY^i|𝜽1,ξ0,𝝁=PY^i|𝜽1P_{\hat{Y}_{i}^{\prime}|\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}=P_{\hat{Y}_{i}^{\prime}|\bm{\theta}_{1}} and then the conditional probability PY^i|𝜽1,ξ0,𝝁P_{\hat{Y}_{i}^{\prime}|\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}} can be given by

    PY^i|𝜽1,ξ0,𝝁(1)\displaystyle P_{\hat{Y}_{i}^{\prime}|\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}(1) =PY^i|𝜽1(1)=𝜽1(𝜽¯1𝐗i>0)\displaystyle=P_{\hat{Y}_{i}^{\prime}|\bm{\theta}_{1}}(1)=\mathbb{P}_{\bm{\theta}_{1}}\big{(}\bar{\bm{\theta}}_{1}^{\top}\mathbf{X}_{i}^{\prime}>0\big{)} (114)
    =12𝜽1(α1+σhi1>0)+12𝜽1(α1+σhi10)=12,\displaystyle=\frac{1}{2}\mathbb{P}_{\bm{\theta}_{1}}\big{(}\alpha_{1}+\sigma h_{i}^{1}>0\big{)}+\frac{1}{2}\mathbb{P}_{\bm{\theta}_{1}}\big{(}\alpha_{1}+\sigma h_{i}^{1}\leq 0\big{)}=\frac{1}{2}, (115)

    and PY^i|𝜽1,ξ0,𝝁(1)=1/2P_{\hat{Y}_{i}^{\prime}|\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}(-1)=1/2, where 𝜽1\mathbb{P}_{\bm{\theta}_{1}} denotes the probability measure under parameter 𝜽1\bm{\theta}_{1}.

    Refer to caption
    Figure 11: Fσ(x)F_{\sigma}(x) versus xx for σ{0.3,0.5}\sigma\in\{0.3,0.5\}.
    Refer to caption
    Figure 12: gσ(m)(x)g_{\sigma}^{(m)}(x) versus xx for σ{0.6,0.8,1,2,3}\sigma\in\{0.6,0.8,1,2,3\}.
  • Iteration t=2t=2: Recall (16) and the new model parameter learned from the pseudo-labelled dataset S^u,2\hat{S}_{\mathrm{u},2} is given by

    𝜽2\displaystyle\bm{\theta}_{2} =1mi2Y^i𝐗i=1mi2sgn(𝜽¯1𝐗i)𝐗i,\displaystyle=\frac{1}{m}\sum_{i\in\mathcal{I}_{2}}\hat{Y}_{i}^{\prime}\mathbf{X}^{\prime}_{i}=\frac{1}{m}\sum_{i\in\mathcal{I}_{2}}\operatorname{sgn}(\bar{\bm{\theta}}_{1}^{\top}\mathbf{X}^{\prime}_{i})\mathbf{X}^{\prime}_{i}, (116)

    where {sgn(𝜽¯1𝐗i)𝐗i}i2\{\operatorname{sgn}(\bar{\bm{\theta}}_{1}^{\top}\mathbf{X}^{\prime}_{i})\mathbf{X}^{\prime}_{i}\}_{i\in\mathcal{I}_{2}} are conditionally i.i.d. random variables given 𝜽1,ξ0,𝝁\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}.

    Similar to (94), the expectation of 𝜽2\bm{\theta}_{2} conditioned on 𝜽1,ξ0,𝝁\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot} is given by

    𝝁2𝜽1,ξ0,𝝁:=𝔼[𝜽2|𝜽1,ξ0,𝝁]=𝔼[sgn(𝜽¯1𝐗j)𝐗j|𝜽1,ξ0,𝝁]\displaystyle\bm{\mu}_{2}^{\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}:=\mathbb{E}[\bm{\theta}_{2}|\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}]=\mathbb{E}[\operatorname{sgn}(\bar{\bm{\theta}}_{1}^{\top}\mathbf{X}^{\prime}_{j})\mathbf{X}^{\prime}_{j}|\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}]
    =(12Q(α1σ)+2σα12πexp(α122σ2))𝝁+2σβ12πexp(α122σ2)𝝊\displaystyle=\bigg{(}1-2\mathrm{Q}\bigg{(}\frac{\alpha_{1}}{\sigma}\bigg{)}+\frac{2\sigma\alpha_{1}}{\sqrt{2\pi}}\exp\bigg{(}-\frac{\alpha_{1}^{2}}{2\sigma^{2}}\bigg{)}\bigg{)}\bm{\mu}+\frac{2\sigma\beta_{1}}{\sqrt{2\pi}}\exp\bigg{(}-\frac{\alpha_{1}^{2}}{2\sigma^{2}}\bigg{)}\bm{\upsilon} (117)
    =Jσ(α1(ξ0,𝝁))𝝁+Kσ(α1(ξ0,𝝁))𝝊.\displaystyle=J_{\sigma}(\alpha_{1}(\xi_{0},\bm{\mu}^{\bot}))\bm{\mu}+K_{\sigma}(\alpha_{1}(\xi_{0},\bm{\mu}^{\bot}))\bm{\upsilon}. (118)

    As mm\to\infty, by the strong law of large numbers, 𝜽1|ξ0,𝝁𝝁1ξ0,𝝁\bm{\theta}_{1}|\xi_{0},\bm{\mu}^{\bot}\to\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}} almost surely. By the continuous mapping theorem, we also have α1(ξ0,𝝁)Fσ(α(ξ0,𝝁))\alpha_{1}(\xi_{0},\bm{\mu}^{\bot})\to F_{\sigma}(\alpha(\xi_{0},\bm{\mu}^{\bot})) almost surely. Equivalently, for almost all sample paths, there exists a vanishing sequence ϵm\epsilon_{m} (i.e., ϵm0\epsilon_{m}\to 0 as mm\to\infty) such that |α1(ξ0,𝝁)Fσ(α(ξ0,𝝁))|=ϵm|\alpha_{1}(\xi_{0},\bm{\mu}^{\bot})-F_{\sigma}(\alpha(\xi_{0},\bm{\mu}^{\bot}))|=\epsilon_{m}.

    The gen-error at t=2t=2 is given by

    gen2=1mi2𝔼𝜽1,ξ0,𝝁[𝔼𝜽2|𝜽1,ξ0,𝝁[Δh(P𝐙P𝐗i,Y^i|𝜽1,ξ0,𝝁|p𝜽2)\displaystyle\mathrm{gen}_{2}=\frac{1}{m}\sum_{i\in\mathcal{I}_{2}}\mathbb{E}_{\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}\Big{[}\mathbb{E}_{\bm{\theta}_{2}|\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}\big{[}\mathrm{\Delta h}(P_{\mathbf{Z}}\|P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}|p_{\bm{\theta}_{2}})
    +Δh(P𝐗i,Y^i|𝜽1,ξ0,𝝁P𝐗i,Y^i|𝜽2,𝜽1,ξ0,𝝁|p𝜽2)]].\displaystyle\qquad\qquad+\mathrm{\Delta h}(P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}\|P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\bm{\theta}_{2},\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}|p_{\bm{\theta}_{2}})\big{]}\Big{]}. (119)

    By applying the same techniques in the iteration t=1t=1, we obtain the exact characterization of gen-error at t=2t=2 as follows. By the uniform continuity of JσJ_{\sigma} , for any vanishing sequence ϵm>0\epsilon_{m}>0, there exist ϵm,ϵm′′>0\epsilon_{m}^{\prime},\epsilon_{m}^{\prime\prime}>0 such that supx[0,1]|Jσ(x+ϵm)Jσ(x)|=ϵm\sup_{x\in[0,1]}|J_{\sigma}(x+\epsilon_{m})-J_{\sigma}(x)|=\epsilon_{m}^{\prime} and supx[0,1]|Jσ(xϵm)Jσ(x)|=ϵm′′\sup_{x\in[0,1]}|J_{\sigma}(x-\epsilon_{m})-J_{\sigma}(x)|=\epsilon_{m}^{\prime\prime}, where ϵm,ϵm′′0\epsilon_{m}^{\prime},\epsilon_{m}^{\prime\prime}\downarrow 0 as ϵm0\epsilon_{m}\downarrow 0. The same result holds for KσK_{\sigma}.

    Finally we can obtain the gen-error as follows. For almost all sample paths, there exists a vanishing sequence ϵm\epsilon_{m} (i.e., ϵm0\epsilon_{m}\to 0 as mm\to\infty), such that

    gen2\displaystyle\mathrm{gen}_{2} =𝔼ξ0,𝝁[Jσ2(Fσ(α))+Kσ2(Fσ(α))Jσ(Fσ(α))σ2\displaystyle=\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{[}\frac{J_{\sigma}^{2}(F_{\sigma}(\alpha))+K_{\sigma}^{2}(F_{\sigma}(\alpha))-J_{\sigma}(F_{\sigma}(\alpha))}{\sigma^{2}}
    +dσ2+1Jσ2(Fσ(α))Kσ2(Fσ(α))mσ2]+ϵm\displaystyle\qquad\qquad~{}+\frac{d\sigma^{2}+1-J_{\sigma}^{2}(F_{\sigma}(\alpha))-K_{\sigma}^{2}(F_{\sigma}(\alpha))}{m\sigma^{2}}\bigg{]}+\epsilon_{m} (120)
    =𝔼ξ0,𝝁[(m1)(Jσ2(Fσ(α))+Kσ2(Fσ(α)))mJσ(Fσ(α))mσ2]+ϵm,\displaystyle=\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{[}\frac{(m-1)(J_{\sigma}^{2}(F_{\sigma}(\alpha))+K_{\sigma}^{2}(F_{\sigma}(\alpha)))-mJ_{\sigma}(F_{\sigma}(\alpha))}{m\sigma^{2}}\bigg{]}+\epsilon_{m}^{\prime}, (121)

    where ϵm=ϵm+dσ2+1mσ2\epsilon_{m}^{\prime}=\epsilon_{m}+\frac{d\sigma^{2}+1}{m\sigma^{2}} and α\alpha stands for α(ξ0,𝝁)\alpha(\xi_{0},\bm{\mu}^{\bot}).

  • Iteration t[2:τ]t\in[2:\tau]: By iteratively implementing the calculation, we finally obtain the characterization of gent\mathrm{gen}_{t} as follows. For almost all sample paths, there exists a vanishing sequence ϵm,t\epsilon_{m,t} (ϵm,t0\epsilon_{m,t}\to 0 as mm\to\infty) , such that

    gent=𝔼ξ0,𝝁[(m1)(Jσ2(Fσ(t1)(α))+Kσ2(Fσ(t1)(α)))mJσ(Fσ(t1)(α))mσ2]+ϵm,t,\displaystyle\mathrm{gen}_{t}=\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{[}\frac{(m-1)(J_{\sigma}^{2}(F_{\sigma}^{(t-1)}(\alpha))+K_{\sigma}^{2}(F_{\sigma}^{(t-1)}(\alpha)))-mJ_{\sigma}(F_{\sigma}^{(t-1)}(\alpha))}{m\sigma^{2}}\bigg{]}+\epsilon_{m,t}^{\prime}, (122)

    where ϵm,t=ϵm,t+dσ2+1mσ2\epsilon_{m,t}^{\prime}=\epsilon_{m,t}+\frac{d\sigma^{2}+1}{m\sigma^{2}} and α\alpha stands for α(ξ0,𝝁)\alpha(\xi_{0},\bm{\mu}^{\bot}).

    The proof is thus completed.

Remark 6 (Numerical plots of Fσ(t)()F_{\sigma}^{(t)}(\cdot) and gσ(m)()g_{\sigma}^{(m)}(\cdot)).

Recall gσ(m)(x)=((m1)(Jσ2(x)+Kσ2(x))mJσ(x))/σ2g_{\sigma}^{(m)}(x)=((m-1)(J_{\sigma}^{2}(x)+K_{\sigma}^{2}(x))-mJ_{\sigma}(x))/\sigma^{2} for any x[1,1]x\in[-1,1], which is the quantity that determines the behaviour of (23). To gain more insight, we numerically plot Fσ(t)(x)F_{\sigma}^{(t)}(x) for t=0,1,2t=0,1,2 in Figure 2 and Fσ(x)F_{\sigma}(x) under different σ\sigma in Figure 12. We also plot gσ(m)(x)g_{\sigma}^{(m)}(x) under different σ\sigma in Figure 12.

Appendix D Applying Theorem 2.A to bGMMs

In anticipation of leveraging Theorem 2.A together with the sub-Gaussianity of the loss function for the bGMM to derive generalization bounds in terms of information-theoretic quantities (just as in Russo and Zou (2016); Xu and Raginsky (2017); Bu et al. (2020)), we find it convenient to show that 𝐗\mathbf{X} and l(𝜽,(𝐗,Y))l(\bm{\theta},(\mathbf{X},Y)) are bounded w.h.p.. By defining the \ell_{\infty} ball ry:={𝐱d:𝐱y𝝁r}\mathcal{B}_{r}^{y}:=\{\mathbf{x}\in\mathbb{R}^{d}:\|\mathbf{x}-y\bm{\mu}\|_{\infty}\leq r\}, we see that

Pr(𝐗rY)=(12Φ(rσ))d=:1δr,d,\displaystyle\Pr(\mathbf{X}\in\mathcal{B}_{r}^{Y})=\bigg{(}1-2\Phi\Big{(}-\frac{r}{\sigma}\Big{)}\bigg{)}^{d}=:1-\delta_{r,d}, (123)

where Φ()\Phi(\cdot) is the Gaussian cumulative distribution function. By choosing rr appropriately, the failure probability δr,d\delta_{r,d} can be made arbitrarily small.

To show that 𝜽\bm{\theta} is bounded with high probability, define the set Θ𝝁,c:={𝜽Θ:𝜽𝝁c}\Theta_{\bm{\mu},c}:=\{\bm{\theta}\in\Theta:\|\bm{\theta}-\bm{\mu}\|_{\infty}\leq c\} for some c>0c>0. For any 𝜽Θ𝝁,c\bm{\theta}\in\Theta_{\bm{\mu},c}, we have

min(𝐱,y)𝒵l(𝜽,(𝐱,y))\displaystyle\min_{(\mathbf{x},y)\in\mathcal{Z}}l(\bm{\theta},(\mathbf{x},y)) =log(2(2π)dσd)=:c1, and\displaystyle=\log(2\sqrt{(2\pi)^{d}}\sigma^{d})=:c_{1},\text{~{}~{}~{}and} (124)
max𝐱ry,y𝒴l(𝜽,(𝐱,y))\displaystyle\max_{\mathbf{x}\in\mathcal{B}_{r}^{y},y\in\mathcal{Y}}l(\bm{\theta},(\mathbf{x},y)) log(2(2π)dσd)+d(c+r)22σ2=:c2.\displaystyle\leq\log(2\sqrt{(2\pi)^{d}}\sigma^{d})+\frac{d(c+r)^{2}}{2\sigma^{2}}=:c_{2}. (125)

For any (𝐗,Y)(\mathbf{X},Y) from the bGMM(𝝁,σ\bm{\mu},\sigma) and any 𝜽Θ𝝁,c\bm{\theta}\in\Theta_{\bm{\mu},c}, the probability that l(𝜽,(𝐗,Y))l(\bm{\theta},(\mathbf{X},Y)) belongs to the interval [c1,c2][c_{1},c_{2}] (c1c_{1}, c2c_{2} depend on δr,d\delta_{r,d}) can be lower bounded by

Pr(l(𝜽,(𝐗,Y))[c1,c2])1δr,d.\displaystyle\Pr\big{(}l(\bm{\theta},(\mathbf{X},Y))\in[c_{1},c_{2}]\big{)}\geq 1-\delta_{r,d}. (126)

Thus, according to Hoeffding’s lemma, with probability at least 1δr,d1-\delta_{r,d}, l(𝜽,(𝐗,Y))subG((c2c1)/2)l(\bm{\theta},(\mathbf{X},Y))\sim\mathrm{subG}((c_{2}-c_{1})/2) under (𝐗,Y)𝒩(Y𝝁,σ2𝐈d)PY(\mathbf{X},Y)\sim\mathcal{N}(Y\bm{\mu},\sigma^{2}\mathbf{I}_{d})\otimes P_{Y} for all 𝜽Θ𝝁,c\bm{\theta}\in\Theta_{\bm{\mu},c}, i.e., for all λ\lambda\in\mathbb{R},

𝔼𝐗,Y[exp(λ(l(𝜽,(𝐗,Y))𝔼𝐗,Y[l(𝜽,(𝐗,Y))]))]\displaystyle\mathbb{E}_{\mathbf{X},Y}\big{[}\exp\big{(}\lambda\big{(}l(\bm{\theta},(\mathbf{X},Y))-\mathbb{E}_{\mathbf{X},Y}[l(\bm{\theta},(\mathbf{X},Y))]\big{)}\big{)}\big{]}
exp(λ2(c2c1)2/8).\displaystyle\leq\exp\big{(}\lambda^{2}(c_{2}-c_{1})^{2}/8\big{)}. (127)

Recall the definition of α\alpha in (17) and the decomposition of 𝜽¯0\bar{\bm{\theta}}_{0} in (18). Define the KL-divergence between the pseudo-labelled data distribution and the true data distribution after the first iteration Gσ:[1,1]××d[0,)G_{\sigma}:[-1,1]\times\mathbb{R}\times\mathbb{R}^{d}\to[0,\infty), as follows:

Gσ(α,ξ0,𝝁):=D(pg~,𝐠~pg~p𝐠~),\displaystyle G_{\sigma}(\alpha,\xi_{0},\bm{\mu}^{\bot}):=D\big{(}p^{\prime}_{\tilde{g},\tilde{\mathbf{g}}^{\bot}}\big{\|}p_{\tilde{g}}\otimes p_{\tilde{\mathbf{g}}^{\bot}}\big{)}, (128)

where

pg~,𝐠~=\displaystyle p^{\prime}_{\tilde{g},\tilde{\mathbf{g}}^{\bot}}= Φ(ασ)pg~+2ασ|g~ασp𝐠~+𝜽¯0+Φ(ασ)pg~|g~ασp𝐠~,\displaystyle\Phi\Big{(}-\frac{\alpha}{\sigma}\Big{)}p_{\tilde{g}+\frac{2\alpha}{\sigma}|\tilde{g}\leq-\frac{\alpha}{\sigma}}\otimes p_{\tilde{\mathbf{g}}^{\bot}+\bar{\bm{\theta}}_{0}^{\bot}}+\Phi\Big{(}\frac{\alpha}{\sigma}\Big{)}p_{\tilde{g}|\tilde{g}\leq\frac{\alpha}{\sigma}}\otimes p_{\tilde{\mathbf{g}}^{\bot}}, (129)

g~𝒩(0,1)\tilde{g}\sim\mathcal{N}(0,1), 𝐠~𝒩(0,𝐈d𝜽¯0𝜽¯0)\tilde{\mathbf{g}}^{\bot}\sim\mathcal{N}(0,\mathbf{I}_{d}-\bar{\bm{\theta}}_{0}\bar{\bm{\theta}}_{0}^{\top}), 𝐠~\tilde{\mathbf{g}}^{\bot} is independent of g~\tilde{g} and perpendicular to 𝜽¯0\bar{\bm{\theta}}_{0}. Note that pg~+2ασ|g~ασp_{\tilde{g}+\frac{2\alpha}{\sigma}|\tilde{g}\leq-\frac{\alpha}{\sigma}} is the Gaussian probability density function with mean 2ασ\frac{2\alpha}{\sigma} and variance 11 truncated to the interval (,ασ)(-\infty,-\frac{\alpha}{\sigma}), and similarly for pg~|g~ασp_{\tilde{g}|\tilde{g}\leq\frac{\alpha}{\sigma}}. In general, when Gσ(α,ξ0,𝝁)G_{\sigma}(\alpha,\xi_{0},\bm{\mu}^{\bot}) is small, so is the generalization error.

By applying the result in Theorem 2.A, the following theorem provides an upper bound for the generalization error at each iteration tt for mm large enough.

Theorem 7.

Fix any σ+\sigma\in\mathbb{R}_{+}, dd\in\mathbb{N}, ϵ+\epsilon\in\mathbb{R}_{+} and δ(0,1)\delta\in(0,1). With probability at least 1δ1-\delta, the absolute generalization error at t=0t=0 can be upper bounded as follows

|gen0(P𝐙,P𝐗,P𝜽0|Sl,Su)|(c2c1)2d4lognn1.\displaystyle\big{|}\mathrm{gen}_{0}(P_{\mathbf{Z}},P_{\mathbf{X}},P_{\bm{\theta}_{0}|S_{\mathrm{l}},S_{\mathrm{u}}})\big{|}\leq\sqrt{\frac{(c_{2}-c_{1})^{2}d}{4}\log\frac{n}{n-1}}. (130)

For each t[1:τ]t\in[1:\tau], for mm large enough, with probability at least 1δ1-\delta,

|gent(P𝐙,P𝐗,{P𝜽k|Sl,Su}k=0t,{f𝜽k}k=0t1)|\displaystyle\big{|}\mathrm{gen}_{t}(P_{\mathbf{Z}},P_{\mathbf{X}},\{P_{\bm{\theta}_{k}|S_{\mathrm{l}},S_{\mathrm{u}}}\}_{k=0}^{t},\{f_{\bm{\theta}_{k}}\}_{k=0}^{t-1})\big{|}
c2c12𝔼ξ0,𝝁[Gσ(Fσ(t1)(α(ξ0,𝝁)),ξ0,𝝁)+ϵ].\displaystyle\leq\frac{c_{2}-c_{1}}{\sqrt{2}}\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{[}\sqrt{G_{\sigma}\big{(}F_{\sigma}^{(t-1)}(\alpha(\xi_{0},\bm{\mu}^{\bot})),\xi_{0},\bm{\mu}^{\bot}\big{)}+\epsilon}\bigg{]}. (131)

The proof of Theorem 7 is provided in Appendix E. Several remarks about Theorem 7 are as follows.

Refer to caption
(a) σ=0.6\sigma=0.6
Refer to caption
(b) σ=3\sigma=3
Figure 13: Upper bounds for generalization error at t=0t=0 and t=1t=1 under different σ\sigma when d=2d=2 and 𝝁=(1,0)\bm{\mu}=(1,0).
Refer to caption
(a) σ=0.6\sigma=0.6
Refer to caption
(b) σ=3\sigma=3
Figure 14: The comparison between the upper bound for |gent||\mathrm{gen}_{t}| and the empirical generalization error at each iteration tt. The upper bounds are both for d=2d=2.
Refer to caption
Figure 15: Gσ(α)G_{\sigma}(\alpha) versus α\alpha for different σ\sigma.

First, we compare the upper bounds for |gen0||\mathrm{gen}_{0}| and |gen1||\mathrm{gen}_{1}|, as shown in Figures 13(a) and 13(b). For any fixed σ\sigma, when nn is sufficiently small, the upper bound for |gen0||\mathrm{gen}_{0}| is greater than that for |gen1||\mathrm{gen}_{1}|. As nn increases, the upper bound for |gen1||\mathrm{gen}_{1}| surpasses that of |gen0||\mathrm{gen}_{0}|, as shown in Figure 13(b). This is consistent with the intuition that when the labelled data is limited, using the unlabelled data can help improve the generalization performance. However, as the number of labelled data increases, using the unlabelled data may degrade the generalization performance, if the distributions corresponding to classes +1+1 and 1-1 have a large overlap. This is because the labelled data is already effective in learning the unknown parameter 𝜽t\bm{\theta}_{t} well and additional pseudo-labelled data does not help to further boost the generalization performance. Furthermore, by comparing Figures 13(a) and 13(b), we can see that for smaller σ\sigma, the improvement from |gen0||\mathrm{gen}_{0}| to |gen1||\mathrm{gen}_{1}| is more pronounced. The intuition is that when σ\sigma decreases, the data samples have smaller variance and thus the pseudo-labelling is more accurate. In this case, unlabelled data can improve the generalization performance. Let us examine the effect of nn, the number of labelled training samples. Recall the Taylor expansion of α\alpha in (24). It can be seen that as nn increases, α\alpha converges to 11 in probability. Suppose the dimension d=2d=2 and 𝝁=(1,0)\bm{\mu}=(1,0). Then 𝝁=[0,μ2]\bm{\mu}^{\bot}=[0,\mu_{2}^{\bot}] where μ2𝒩(0,1)\mu_{2}^{\bot}\sim\mathcal{N}(0,1). The upper bound for the absolute generalization error at t=1t=1 can be rewritten as

|gen1|c2c1222nπσ2eny2σ2Gσ(1y2)dy,\displaystyle|\mathrm{gen}_{1}|\lessapprox\frac{c_{2}-c_{1}}{\sqrt{2}}\int_{-\sqrt{2}}^{\sqrt{2}}\sqrt{\frac{n}{\pi\sigma^{2}}}\,e^{-\frac{ny^{2}}{\sigma^{2}}}\,\sqrt{G_{\sigma}(1-y^{2})}~{}\mathrm{d}y, (132)

which is a decreasing function of nn, as shown in Figures 13(a) and 13(b).

Second, in Figure 14(a), we plot the theoretical upper bound in (131) by ignoring ϵ\epsilon. Unfortunately it is computationally difficult to numerically calculate the bound in (131) for high dimensions dd (due to the need for high-dimensional numerical integration), but we can still gain insight from the result for d=2d=2. It is shown that the upper bound for |gent||\mathrm{gen}_{t}| decreases as tt increases and finally converges to a non-zero constant. The gap between the upper bounds for |gent||\mathrm{gen}_{t}| and for |gent+1||\mathrm{gen}_{t+1}| decreases as tt increases and shrinks to almost 0 for t2t\geq 2. The intuition is that as mm\to\infty, there are sufficient data at each iteration and the algorithm can converge at very early stage. In the empirical simulation, we let n=10n=10, d=50d=50, 𝝁=(1,0,,0)\bm{\mu}=(1,0,\ldots,0) and iteratively run the self-training procedure for 2020 iterations and 20002000 rounds. We find that the behaviour of the empirical generalization error (the red ‘-x’ line) is similar to the theoretical upper bound (the blue ‘-o’ line), which almost converges to its final value at t=2t=2. This result shows that the theoretical upper bound in (131) serves as a useful rule-of-thumb for how the generalization error changes over iterations. In Figure 14(b), we plot the theoretical bound and result from the empirical simulation based on the toy example for d=2d=2 but larger nn and σ\sigma. This figure shows that when we increase nn and σ\sigma, using unlabelled data may not be able to improve the generalization performance. The intuition is that for nn large enough, merely using the labelled data can yield sufficiently low generalization error and for subsequent iterations with the pseudo-labelled data, the reduction in the test loss is negligible but the training loss will decrease more significantly (thus causing the generalization error to increase). When σ\sigma is larger, the data samples have larger variance and the classes have a larger overlap, and thus, the initial parameter 𝜽0\bm{\theta}_{0} learned by the labelled data cannot produce pseudo-labels with sufficiently high accuracy. Thus, the pseudo-labelled data cannot help to improve the generalization error significantly.

Remark 8 (Numerical plot of Gσ()G_{\sigma}(\cdot)).

To gain more insight, we plot Gσ(α,ξ0,𝛍)G_{\sigma}(\alpha,\xi_{0},\bm{\mu}^{\bot}) when d=2d=2 and 𝛍=(1,0)\bm{\mu}=(1,0) in Figure 15. Under these settings, Gσ(α,ξ0,𝛍)G_{\sigma}(\alpha,\xi_{0},\bm{\mu}^{\bot}) depends only on α\alpha and hence, we can rewrite it as Gσ(α)G_{\sigma}(\alpha). As shown in Figure 15 in Appendix D, for all σ1>σ2\sigma_{1}>\sigma_{2}, there exists an α0[1,1]\alpha_{0}\in[-1,1] such that for all αα0=α0(σ1,σ2)\alpha\geq\alpha_{0}=\alpha_{0}(\sigma_{1},\sigma_{2}), Gσ1(α)>Gσ2(α)G_{\sigma_{1}}(\alpha)>G_{\sigma_{2}}(\alpha). From (17), we can see that α\alpha is close to 1 of high probability, which means that σGσ(α)\sigma\mapsto G_{\sigma}(\alpha) is monotonically increasing in σ\sigma with high probability. As a result, 𝔼α[Gσ(α)]\mathbb{E}_{\alpha}[\sqrt{G_{\sigma}(\alpha)}] increases as σ\sigma increases. This is consistent with the intuition that when the training data has larger in-class variance, it is more difficult to generalize well.

Remark 9 (Discussion on Theorem 7).

As nn\to\infty, 𝛉0𝛍\bm{\theta}_{0}\to\bm{\mu} and α=ρ(𝛉0,𝛍)1\alpha=\rho(\bm{\theta}_{0},\bm{\mu})\to 1 almost surely, which means that the estimator converges to the optimal classifier for this bGMM. However, since there is no margin between two groups of data samples, the error probability Pr(Y^jYj)Q(1/σ)>0\Pr(\hat{Y}_{j}^{\prime}\neq Y_{j}^{\prime})\to\mathrm{Q}(1/\sigma)>0 (which is the Bayes error rate) and the disintegrated KL-divergence Dξ0,𝛍(P𝐗j,Y^jP𝐗,Y)D_{\xi_{0},\bm{\mu}^{\bot}}(P_{\mathbf{X}_{j}^{\prime},\hat{Y}_{j}^{\prime}}\|P_{\mathbf{X},Y}) between the estimated and underlying distributions cannot converge to 0.

In the other extreme case, when α=ρ(𝛉0,𝛍)=1\alpha=\rho(\bm{\theta}_{0},\bm{\mu})=-1 and 𝛉¯0=𝛍\bar{\bm{\theta}}_{0}=-\bm{\mu}, the error probability Pr(Y^jYj)=1Q(1/σ)>12\Pr(\hat{Y}_{j}^{\prime}\neq Y_{j}^{\prime})=1-\mathrm{Q}(1/\sigma)>\frac{1}{2} (for all σ>0\sigma>0) and Dξ0,𝛍(P𝐗j,Y^jP𝐗,Y)<D_{\xi_{0},\bm{\mu}^{\bot}}(P_{\mathbf{X}_{j}^{\prime},\hat{Y}_{j}^{\prime}}\|P_{\mathbf{X},Y})<\infty, so in this other extreme (flipped) scenario, we have more mistakes than correct pseudo-labels. The reason why Dα,𝛍(P𝐗j,Y^jP𝐗,Y)D_{\alpha,\bm{\mu}^{\bot}}(P_{\mathbf{X}_{j}^{\prime},\hat{Y}_{j}^{\prime}}\|P_{\mathbf{X},Y}) is finite is that when P𝐗,Y(𝐱,y)P_{\mathbf{X},Y}(\mathbf{x},y) is small, it means that 𝐱\mathbf{x} is far from both 𝛍-\bm{\mu} and 𝛍\bm{\mu}, and then P𝐗(𝐱)P_{\mathbf{X}}(\mathbf{x}) is also small. Thus, P𝐗j,Y^j(𝐱,y)=PY^j|𝐗j(y|𝐱)P𝐗(𝐱)P_{\mathbf{X}_{j}^{\prime},\hat{Y}_{j}^{\prime}}(\mathbf{x},y)=P_{\hat{Y}_{j}^{\prime}|\mathbf{X}_{j}^{\prime}}(y|\mathbf{x})P_{\mathbf{X}}(\mathbf{x}) is also small.

Appendix E Proof of Theorem 7

For simplicity, in the following, we abbreviate gent(P𝐙,P𝐗,{P𝜽k|Sl,Su}k=0t,{f𝜽k}k=0t1)\mathrm{gen}_{t}(P_{\mathbf{Z}},P_{\mathbf{X}},\{P_{\bm{\theta}_{k}|S_{\mathrm{l}},S_{\mathrm{u}}}\}_{k=0}^{t},\{f_{\bm{\theta}_{k}}\}_{k=0}^{t-1}) as gent\mathrm{gen}_{t}.

  1. 1.

    Initial round t=0t=0: Since Yi𝐗ii.i.d.𝒩(𝝁,σ2𝐈d)Y_{i}\mathbf{X}_{i}\overset{\text{i.i.d.}}{\sim}\mathcal{N}(\bm{\mu},\sigma^{2}\mathbf{I}_{d}), we have 𝜽0𝒩(𝝁,σ2n𝐈d)\bm{\theta}_{0}\sim\mathcal{N}(\bm{\mu},\frac{\sigma^{2}}{n}\mathbf{I}_{d}) and for some constant c+c\in\mathbb{R}_{+},

    Pr(𝜽0Θ𝝁,c)=Pr(𝜽0𝝁c)=(12Φ(ncσ))d=:1δnc,d.\displaystyle\Pr(\bm{\theta}_{0}\in\Theta_{\bm{\mu},c})=\Pr(\|\bm{\theta}_{0}-\bm{\mu}\|_{\infty}\leq c)=\bigg{(}1-2\Phi\Big{(}-\frac{\sqrt{n}c}{\sigma}\Big{)}\bigg{)}^{d}=:1-\delta_{\sqrt{n}c,d}. (133)

    By choosing cc large enough, δnc,d\delta_{\sqrt{n}c,d} can be made arbitrarily small. Consider 𝜽~0\tilde{\bm{\theta}}_{0} and (𝐗~,Y~)(\tilde{\mathbf{X}},\tilde{Y}) as independent copies of 𝜽0Q𝜽0\bm{\theta}_{0}\sim Q_{\bm{\theta}_{0}} and (𝐗,Y)P𝐗,Y=PY𝒩(Y𝝁,σ2Id)(\mathbf{X},Y)\sim P_{\mathbf{X},Y}=P_{Y}\otimes\mathcal{N}(Y\bm{\mu},\sigma^{2}I_{d}), respectively, such that Pθ~0,𝐗~,Y~=Q𝜽0P𝐗,YP_{\tilde{\theta}_{0},\tilde{\mathbf{X}},\tilde{Y}}=Q_{\bm{\theta}_{0}}\otimes P_{\mathbf{X},Y}. Then the probability that l(𝜽0,(𝐗,Y))subG((c2c1)/2)l(\bm{\theta}_{0},(\mathbf{X},Y))\sim\mathrm{subG}((c_{2}-c_{1})/2) under (𝐗,Y)P𝐗,Y(\mathbf{X},Y)\sim P_{\mathbf{X},Y} is given as follows

    Pr(Λl(𝜽~0,(𝐗~,Y~))(λ,𝜽~0)λ2(c2c1)28)\displaystyle\Pr\bigg{(}\Lambda_{l(\tilde{\bm{\theta}}_{0},(\tilde{\mathbf{X}},\tilde{Y}))}(\lambda,\tilde{\bm{\theta}}_{0})\leq\frac{\lambda^{2}(c_{2}-c_{1})^{2}}{8}\bigg{)} (134)
    Pr(Λl(𝜽~0,(𝐗~,Y~))(λ,𝜽~0)λ2(c2c1)28 and 𝜽~0Θ𝝁,c)\displaystyle\geq\Pr\bigg{(}\Lambda_{l(\tilde{\bm{\theta}}_{0},(\tilde{\mathbf{X}},\tilde{Y}))}(\lambda,\tilde{\bm{\theta}}_{0})\leq\frac{\lambda^{2}(c_{2}-c_{1})^{2}}{8}\text{ and }\tilde{\bm{\theta}}_{0}\in\Theta_{\bm{\mu},c}\bigg{)} (135)
    =Pr(𝜽~0Θ𝝁,c)Pr(Λl(𝜽~0,(𝐗~,Y~))(λ,𝜽~0)λ2(c2c1)28|𝜽~0Θ𝝁,c)\displaystyle=\Pr(\tilde{\bm{\theta}}_{0}\in\Theta_{\bm{\mu},c})\Pr\bigg{(}\Lambda_{l(\tilde{\bm{\theta}}_{0},(\tilde{\mathbf{X}},\tilde{Y}))}(\lambda,\tilde{\bm{\theta}}_{0})\leq\frac{\lambda^{2}(c_{2}-c_{1})^{2}}{8}\Big{|}\tilde{\bm{\theta}}_{0}\in\Theta_{\bm{\mu},c}\bigg{)} (136)
    =(1δnc,d)(1δr,d),\displaystyle=(1-\delta_{\sqrt{n}c,d})(1-\delta_{r,d}), (137)

    where (137) follows from (127) and (133).

    Fix some dd\in\mathbb{N}, ϵ>0\epsilon>0 and δ(0,1)\delta\in(0,1). There exists n0(d,δ)n_{0}(d,\delta)\in\mathbb{N}, r0(d,δ)+r_{0}(d,\delta)\in\mathbb{R}_{+} such that for all n>n0,r>r0n>n_{0},r>r_{0}, δnc,d<δ3\delta_{\sqrt{n}c,d}<\frac{\delta}{3}, δr,d<δ3\delta_{r,d}<\frac{\delta}{3}, and then with probability at least 1δ1-\delta, the absolute generalization error can be upper bounded as follows

    |gen0|1ni=1n(c2c1)22I(𝜽0;𝐗i,Yi).\displaystyle|\mathrm{gen}_{0}|\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{\frac{(c_{2}-c_{1})^{2}}{2}I(\bm{\theta}_{0};\mathbf{X}_{i},Y_{i})}. (138)

    Then mutual information can be calculated as follows

    I(𝜽0;𝐗i,Yi)\displaystyle I(\bm{\theta}_{0};\mathbf{X}_{i},Y_{i}) =h(𝜽0)h(𝜽0|𝐗i,Yi)\displaystyle=h(\bm{\theta}_{0})-h(\bm{\theta}_{0}|\mathbf{X}_{i},Y_{i}) (139)
    =h(1nj=1nYj𝐗j)h(1nj=1nYj𝐗j|𝐗i,Yi)\displaystyle=h\bigg{(}\frac{1}{n}\sum_{j=1}^{n}Y_{j}\mathbf{X}_{j}\bigg{)}-h\bigg{(}\frac{1}{n}\sum_{j=1}^{n}Y_{j}\mathbf{X}_{j}\bigg{|}\mathbf{X}_{i},Y_{i}\bigg{)} (140)
    =d2log(2πeσ2n)h(1nj[n],jiYj𝐗j)\displaystyle=\frac{d}{2}\log\bigg{(}\frac{2\pi e\sigma^{2}}{n}\bigg{)}-h\bigg{(}\frac{1}{n}\sum_{j\in[n],j\neq i}Y_{j}\mathbf{X}_{j}\bigg{)} (141)
    =d2log(2πeσ2n)d2log(2πe(n1)σ2n2)\displaystyle=\frac{d}{2}\log\bigg{(}\frac{2\pi e\sigma^{2}}{n}\bigg{)}-\frac{d}{2}\log\bigg{(}\frac{2\pi e(n-1)\sigma^{2}}{n^{2}}\bigg{)} (142)
    =d2lognn1.\displaystyle=\frac{d}{2}\log\frac{n}{n-1}. (143)

    Thus we obtain (130).

  2. 2.

    Pseudo-label using θ0\bm{\theta}_{0}: The same as those in Appendix C.

  3. 3.

    Iteration t=1t=1: Recall 𝜽1\bm{\theta}_{1} in (78), the new model parameter learned from the pseudo-labelled dataset S^u,1\hat{S}_{\mathrm{u},1}.

    1. (a)

      Recall the condition expectation of 𝝁1ξ0,𝝁\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}} in (94). The ll_{\infty} norm between 𝝁1ξ0,𝝁\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}} and 𝝁\bm{\mu} can be upper bounded by

      𝝁1ξ0,𝝁𝝁\displaystyle\|\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}-\bm{\mu}\|_{\infty} (2Q(ασ)+2σα2πexp(α22σ2))2+2σ2β2πexp(2α22σ2)\displaystyle\leq\sqrt{\bigg{(}-2\mathrm{Q}\bigg{(}\frac{\alpha}{\sigma}\bigg{)}+\frac{2\sigma\alpha}{\sqrt{2\pi}}\exp\bigg{(}-\frac{\alpha^{2}}{2\sigma^{2}}\bigg{)}\bigg{)}^{2}+\frac{2\sigma^{2}\beta^{2}}{\pi}\exp\bigg{(}-\frac{2\alpha^{2}}{2\sigma^{2}}\bigg{)}} (144)
      <(2Φ(1σ)+2σ2π)2+2σ2π=:c~1,\displaystyle<\sqrt{\bigg{(}2\Phi\bigg{(}\frac{1}{\sigma}\bigg{)}+\frac{2\sigma}{\sqrt{2\pi}}\bigg{)}^{2}+\frac{2\sigma^{2}}{\pi}}=:\tilde{c}_{1}, (145)

      where c~1\tilde{c}_{1} is a constant only dependent on σ\sigma.

    2. (b)

      Next, we need to calculate the probability that l(𝜽1,(𝐗,Y))subG((c2c1/2))l(\bm{\theta}_{1},(\mathbf{X},Y))\sim\mathrm{subG}((c_{2}-c_{1}/2)) under (𝐗,Y)P𝐗,Y(\mathbf{X},Y)\sim P_{\mathbf{X},Y}.

      Let 𝐕i=sgn(𝜽¯0𝐗i)𝐗i𝝁1ξ0,𝝁\mathbf{V}_{i}=\operatorname{sgn}(\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}_{i}^{\prime})\mathbf{X}_{i}^{\prime}-\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}. For any k[d]k\in[d], let Vi,kV_{i,k}, θ1,k\theta_{1,k}, μ1,k\mu_{1,k} denote the kk-th components of 𝐕i\mathbf{V}_{i}, 𝜽1\bm{\theta}_{1} and 𝝁1ξ0,𝝁\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}, respectively. Recall the decompositions 𝐗i=Yi𝝁+σg~i𝜽¯0+σ𝐠~i\mathbf{X}_{i}^{\prime}=Y_{i}^{\prime}\bm{\mu}+\sigma\tilde{g}_{i}\bar{\bm{\theta}}_{0}+\sigma\tilde{\mathbf{g}}_{i}^{\bot} in (83) and 𝜽¯0𝐗i=Yiα+σg~i\bar{\bm{\theta}}_{0}\mathbf{X}_{i}^{\prime}=Y_{i}^{\prime}\alpha+\sigma\tilde{g}_{i} in (84). Suppose the basis of d\mathbb{R}^{d} is denoted by B={𝐯1,,𝐯d}B=\{\mathbf{v}_{1},\ldots,\mathbf{v}_{d}\} and let 𝐯1=𝜽¯0\mathbf{v}_{1}=\bar{\bm{\theta}}_{0}. Then we have

      𝐠~i=g~i,2𝐯2++g~i,d𝐯d,\displaystyle\tilde{\mathbf{g}}_{i}^{\bot}=\tilde{g}_{i,2}^{\bot}\mathbf{v}_{2}+\ldots+\tilde{g}_{i,d}^{\bot}\mathbf{v}_{d}, (146)

      where g~i,k𝒩(0,1)\tilde{g}_{i,k}^{\bot}\sim\mathcal{N}(0,1) for any k[2:d]k\in[2:d] and {g~i,k}k=2d\{\tilde{g}_{i,k}\}_{k=2}^{d} are mutually independent. We also let 𝝁=(μ0,1,,μ0,d)\bm{\mu}=(\mu_{0,1},\ldots,\mu_{0,d}).

      Given any (ξ0,𝝁)(\xi_{0},\bm{\mu}^{\bot}), the moment generating function (MGF) of Vi,1V_{i,1} is given as follows: for any s1>0s_{1}>0,

      𝔼Vi,1[es1Vi,1]\displaystyle\mathbb{E}_{V_{i,1}}[e^{s_{1}V_{i,1}}]
      =Q(ασ)𝔼g~i[es1(μ0,1μ1,1+σg~i)|g~i>ασ]+Q(ασ)𝔼g~i[es1(μ0,1μ1,1+σg~i)|g~i>ασ]\displaystyle=\mathrm{Q}\bigg{(}-\frac{\alpha}{\sigma}\bigg{)}\mathbb{E}_{\tilde{g}_{i}}\Big{[}e^{s_{1}(\mu_{0,1}-\mu_{1,1}+\sigma\tilde{g}_{i})}\Big{|}\tilde{g}_{i}>-\frac{\alpha}{\sigma}\Big{]}+\mathrm{Q}\bigg{(}\frac{\alpha}{\sigma}\bigg{)}\mathbb{E}_{\tilde{g}_{i}}\Big{[}e^{s_{1}(-\mu_{0,1}-\mu_{1,1}+\sigma\tilde{g}_{i})}\Big{|}\tilde{g}_{i}>\frac{\alpha}{\sigma}\Big{]} (147)
      =es1(μ0,1μ1,1)eσ2s122Φ(ασ+σs1)+es1(μ0,1μ1,1)eσ2s122Φ(ασ+σs1).\displaystyle=e^{s_{1}(\mu_{0,1}-\mu_{1,1})}e^{\frac{\sigma^{2}s_{1}^{2}}{2}}\Phi\bigg{(}\frac{\alpha}{\sigma}+\sigma s_{1}\bigg{)}+e^{s_{1}(-\mu_{0,1}-\mu_{1,1})}e^{\frac{\sigma^{2}s_{1}^{2}}{2}}\Phi\bigg{(}-\frac{\alpha}{\sigma}+\sigma s_{1}\bigg{)}. (148)

      The final equality follows from the fact that the MGF of a zero-mean univariate Gaussian truncated to (a,b)(a,b) is eσ2s2/2[Φ(bσs)Φ(aσs)Φ(b)Φ(a)]e^{\sigma^{2}s^{2}/2}\Big{[}\frac{\Phi(b-\sigma s)-\Phi(a-\sigma s)}{\Phi(b)-\Phi(a)}\Big{]}. The second derivative of log𝔼Vi,1[es1Vi,1]\log\mathbb{E}_{V_{i,1}}[e^{s_{1}V_{i,1}}] is given as

      R~1(s1):=d2log𝔼Vi,1[es1Vi,1]ds12\displaystyle\tilde{R}_{1}(s_{1}):=\frac{\mathrm{d}^{2}\log\mathbb{E}_{V_{i,1}}[e^{s_{1}V_{i,1}}]}{\mathrm{d}s_{1}^{2}} (149)
      σ2+const.(Φ(ασ+σs1)eskμ0,k+Φ(ασ+σs1)eskμ0,k)2<.\displaystyle\leq\sigma^{2}+\frac{\mathrm{const.}}{\big{(}\Phi(\frac{\alpha}{\sigma}+\sigma s_{1})e^{s_{k}\mu_{0,k}}+\Phi(\frac{-\alpha}{\sigma}+\sigma s_{1})e^{-s_{k}\mu_{0,k}}\big{)}^{2}}<\infty. (150)

      For k[2:d]k\in[2:d] and any sk>0s_{k}>0, the MGF of Vi,kV_{i,k} is given as

      𝔼Vi,k[eskVi,k]=𝔼σg~i,k,Yi[esk(Yiμ0,kμ1,k+σg~i,k)]\displaystyle\mathbb{E}_{V_{i,k}}[e^{s_{k}V_{i,k}}]=\mathbb{E}_{\sigma\tilde{g}_{i,k}^{\bot},Y_{i}^{\prime}}\Big{[}e^{s_{k}(Y_{i}^{\prime}\mu_{0,k}-\mu_{1,k}+\sigma\tilde{g}_{i,k}^{\bot})}\Big{]} (151)
      =Q(ασ)esk(μ0,kμ1,k)eσ2sk22+Q(ασ)esk(μ0,kμ1,k)eσ2sk22,\displaystyle=Q\bigg{(}-\frac{\alpha}{\sigma}\bigg{)}e^{s_{k}(\mu_{0,k}-\mu_{1,k})}e^{\frac{\sigma^{2}s_{k}^{2}}{2}}+Q\bigg{(}\frac{\alpha}{\sigma}\bigg{)}e^{s_{k}(-\mu_{0,k}-\mu_{1,k})}e^{\frac{\sigma^{2}s_{k}^{2}}{2}}, (152)

      and the second derivative of log𝔼Vi,k[eskVi,k]\log\mathbb{E}_{V_{i,k}}[e^{s_{k}V_{i,k}}] is given by

      R~k(sk):=d2log𝔼Vi,k[eskVi,k]dsk2=σ2+4μ0,k2Q(ασ)Q(ασ)(Q(ασ)eskμ0,k+Q(ασ)eskμ0,k)2.\displaystyle\tilde{R}_{k}(s_{k}):=\frac{\mathrm{d}^{2}\log\mathbb{E}_{V_{i,k}}[e^{s_{k}V_{i,k}}]}{\mathrm{d}s_{k}^{2}}=\sigma^{2}+\frac{4\mu_{0,k}^{2}Q(-\frac{\alpha}{\sigma})Q(\frac{-\alpha}{\sigma})}{(Q(-\frac{\alpha}{\sigma})e^{s_{k}\mu_{0,k}}+Q(\frac{\alpha}{\sigma})e^{-s_{k}\mu_{0,k}})^{2}}. (153)

      Fix k[1:d]k\in[1:d]. According to Taylor’s theorem, we have

      log𝔼Vi,k[eskVi,k]=R~k(ξL,k)2sk2,\displaystyle\log\mathbb{E}_{V_{i,k}}[e^{s_{k}V_{i,k}}]=\frac{\tilde{R}_{k}(\xi_{\mathrm{L},k})}{2}s_{k}^{2}, (154)

      for some ξL,k(0,sk)\xi_{\mathrm{L},k}\in(0,s_{k}) and R~k(ξL,k)<\tilde{R}_{k}(\xi_{\mathrm{L},k})<\infty. Then the Cramér transform of log𝔼Vi,k[eskVi,k]\log\mathbb{E}_{V_{i,k}}[e^{s_{k}V_{i,k}}] can be lower bounded as follows: for any ε>0\varepsilon>0,

      supsk>0(skεlog𝔼Vi,k[eskVi,k])supsk>0(skεR~k(ξL,k)sk22)=ε22R~k(ξL,k).\displaystyle\sup_{s_{k}>0}\Big{(}s_{k}\varepsilon-\log\mathbb{E}_{V_{i,k}}[e^{s_{k}V_{i,k}}]\Big{)}\geq\sup_{s_{k}>0}\Big{(}s_{k}\varepsilon-\frac{\tilde{R}_{k}(\xi_{\mathrm{L},k})s_{k}^{2}}{2}\Big{)}=\frac{\varepsilon^{2}}{2\tilde{R}_{k}(\xi_{\mathrm{L},k})}. (155)

      Let R~=maxξ0,𝝁mink[1:d]R~k(ξL,k)\tilde{R}^{*}=\max_{\xi_{0},\bm{\mu}^{\bot}}\min_{k\in[1:d]}\tilde{R}_{k}(\xi_{\mathrm{L},k}), which is a finite constant only dependent on σ\sigma. Since {sgn(𝜽¯0𝐗i)𝐗i}i=1m\{\operatorname{sgn}(\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}_{i}^{\prime})\mathbf{X}_{i}^{\prime}\}_{i=1}^{m} are i.i.d. random variables conditioned on (ξ0,𝝁)(\xi_{0},\bm{\mu}^{\bot}), by applying Chernoff-Cramér inequality, we have for all ε>0\varepsilon>0

      ξ0,𝝁(𝜽1𝝁1ξ0,𝝁>ε)\displaystyle\mathbb{P}_{\xi_{0},\bm{\mu}^{\bot}}\Big{(}\|\bm{\theta}_{1}-\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}\|_{\infty}>\varepsilon\Big{)}
      =ξ0,𝝁(maxk[1:d]|θ1,kμ1,k|>ε)\displaystyle=\mathbb{P}_{\xi_{0},\bm{\mu}^{\bot}}\Big{(}\max_{k\in[1:d]}|\theta_{1,k}-\mu_{1,k}|>\varepsilon\Big{)} (156)
      k=1dξ0,𝝁(|θ1,kμ1,k|>ε)\displaystyle\leq\sum_{k=1}^{d}\mathbb{P}_{\xi_{0},\bm{\mu}^{\bot}}\Big{(}|\theta_{1,k}-\mu_{1,k}|>\varepsilon\Big{)} (157)
      =k=1dξ0,𝝁(|1mi=1mVi,k|>ε)\displaystyle=\sum_{k=1}^{d}\mathbb{P}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{(}\bigg{|}\frac{1}{m}\sum_{i=1}^{m}V_{i,k}\bigg{|}>\varepsilon\bigg{)} (158)
      k=1d2exp(msups>0(sεlog𝔼Vi,k[esVi,k]))\displaystyle\leq\sum_{k=1}^{d}2\exp\Big{(}-m\sup_{s>0}\Big{(}s\varepsilon-\log\mathbb{E}_{V_{i,k}}[e^{sV_{i,k}}]\Big{)}\Big{)} (159)
      2dexp(mε22R~)\displaystyle\leq 2d\exp\bigg{(}-\frac{m\varepsilon^{2}}{2\tilde{R}^{*}}\bigg{)} (160)
      =:δm,ε,d,\displaystyle=:\delta_{m,\varepsilon,d}, (161)

      where δm,ε,da.s.0\delta_{m,\varepsilon,d}\xrightarrow{\text{a.s.}}0 as mm\to\infty and does not depend on ξ0,𝝁\xi_{0},\bm{\mu}^{\bot}.

      Choose some c(c~1,)c\in(\tilde{c}_{1},\infty) (c~1\tilde{c}_{1} defined in (145)). We have

      ξ0,𝝁(𝜽1Θ𝝁,c)ξ0,𝝁(𝜽1𝝁1ξ0,𝝁cc~1)1δm,cc~1,d.\displaystyle\mathbb{P}_{\xi_{0},\bm{\mu}^{\bot}}\big{(}\bm{\theta}_{1}\in\Theta_{\bm{\mu},c})\geq\mathbb{P}_{\xi_{0},\bm{\mu}^{\bot}}(\|\bm{\theta}_{1}-\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}\|_{\infty}\leq c-\tilde{c}_{1}\big{)}\geq 1-\delta_{m,c-\tilde{c}_{1},d}. (162)

      Consider 𝜽~1\tilde{\bm{\theta}}_{1} as an independent copy of 𝜽1\bm{\theta}_{1} and independent of (𝐗~,Y~)(\tilde{\mathbf{X}},\tilde{Y}). Then the probability that l(𝜽1,(𝐗,Y))subG((c2c1)/2)l(\bm{\theta}_{1},(\mathbf{X},Y))\sim\mathrm{subG}((c_{2}-c_{1})/2) under (𝐗,Y)P𝐗,Y(\mathbf{X},Y)\sim P_{\mathbf{X},Y} is given as follows

      ξ0,𝝁(Λl(𝜽~1,(𝐗~,Y~))(λ,𝜽~1)λ2(c2c1)28)\displaystyle\mathbb{P}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{(}\Lambda_{l(\tilde{\bm{\theta}}_{1},(\tilde{\mathbf{X}},\tilde{Y}))}(\lambda,\tilde{\bm{\theta}}_{1})\leq\frac{\lambda^{2}(c_{2}-c_{1})^{2}}{8}\bigg{)} (163)
      ξ0,𝝁(𝜽~1Θ𝝁,c)ξ0,𝝁(Λl(𝜽~1,(𝐗~,Y~))(λ,𝜽~1)λ2(c2c1)28|𝜽~1Θ𝝁,c)\displaystyle\geq\mathbb{P}_{\xi_{0},\bm{\mu}^{\bot}}(\tilde{\bm{\theta}}_{1}\in\Theta_{\bm{\mu},c})\mathbb{P}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{(}\Lambda_{l(\tilde{\bm{\theta}}_{1},(\tilde{\mathbf{X}},\tilde{Y}))}(\lambda,\tilde{\bm{\theta}}_{1})\leq\frac{\lambda^{2}(c_{2}-c_{1})^{2}}{8}\Big{|}\tilde{\bm{\theta}}_{1}\in\Theta_{\bm{\mu},c}\bigg{)} (164)
      =(1δm,c,d)(1δr,d).\displaystyle=(1-\delta_{m,c,d})(1-\delta_{r,d}). (165)

    Thus, for some c(c~1,)c\in(\tilde{c}_{1},\infty), with probability at least (1δm,cc~1,d)(1δr,d)(1-\delta_{m,c-\tilde{c}_{1},d})(1-\delta_{r,d}), the absolute generalization error can be upper bounded as follows:

    |gen1|=|𝔼[LP𝐙(𝜽1)LS^u,1(𝜽1)]|\displaystyle|\mathrm{gen}_{1}|=|\mathbb{E}[L_{P_{\mathbf{Z}}}(\bm{\theta}_{1})-L_{\hat{S}_{\mathrm{u},1}}(\bm{\theta}_{1})]| (166)
    =|1mi=1m𝔼ξ0,𝝁[𝔼[l(𝜽1,(𝐗,Y))l(𝜽1,(𝐗i,Y^i))|ξ0,𝝁]]|\displaystyle=\bigg{|}\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{[}\mathbb{E}\left[l(\bm{\theta}_{1},(\mathbf{X},Y))-l(\bm{\theta}_{1},(\mathbf{X}^{\prime}_{i},\hat{Y}_{i}^{\prime}))|\xi_{0},\bm{\mu}^{\bot}\right]\bigg{]}\bigg{|} (167)
    1mi=1m𝔼ξ0,𝝁[(c2c1)22(Iξ0,𝝁(𝜽1,(𝐗i,Y^i))+Dξ0,𝝁(P𝐗i,Y^iP𝐗,Y))],\displaystyle\leq\frac{1}{m}\sum_{i=1}^{m}\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{[}\sqrt{\frac{(c_{2}-c_{1})^{2}}{2}\Big{(}I_{\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta}_{1},(\mathbf{X}^{\prime}_{i},\hat{Y}_{i}^{\prime}))+D_{\xi_{0},\bm{\mu}^{\bot}}(P_{\mathbf{X}^{\prime}_{i},\hat{Y}_{i}^{\prime}}\|P_{\mathbf{X},Y})\Big{)}}~{}\bigg{]}, (168)

    where P𝜽1,(𝐗,Y)|ξ0,𝝁=Q𝜽1|ξ0,𝝁P𝐗,YP_{\bm{\theta}_{1},(\mathbf{X},Y)|\xi_{0},\bm{\mu}^{\bot}}=Q_{\bm{\theta}_{1}|\xi_{0},\bm{\mu}^{\bot}}\otimes P_{\mathbf{X},Y} and Q𝜽1|ξ0,𝝁Q_{\bm{\theta}_{1}|\xi_{0},\bm{\mu}^{\bot}} denotes the marginal distribution of 𝜽1\bm{\theta}_{1} under parameters (ξ0,𝝁)(\xi_{0},\bm{\mu}^{\bot}).

    In the following, we derive closed-form expressions of the mutual information and KL-divergence in (168). For any j[1:m]j\in[1:m]:

    • Calculate Iξ0,μ(θ1;𝐗j,Y^j)I_{\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta}_{1};\mathbf{X}^{\prime}_{j},\hat{Y}_{j}^{\prime}): For arbitrary random variables XX and UU, we define the disintegrated conditional differential entropy of XX given UU as

      hU(X):=h(PX|U).\displaystyle h_{U}(X):=h(P_{X|U}). (169)

      This is a σ(U)\sigma(U)-measurable random variable. Conditioned on a certain pair (ξ0,𝝁)(\xi_{0},\bm{\mu}^{\bot}), the mutual information between 𝜽1\bm{\theta}_{1} and (𝐗i,Y^i)(\mathbf{X}^{\prime}_{i},\hat{Y}_{i}^{\prime}) is

      Iξ0,𝝁(𝜽1;𝐗i,Y^i)\displaystyle I_{\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta}_{1};\mathbf{X}^{\prime}_{i},\hat{Y}_{i}^{\prime})
      =hξ0,𝝁(1mi=1msgn(𝜽0𝐗i)𝐗i)hξ0,𝝁(1mj=1mY^j𝐗j|𝐗i,Y^i)\displaystyle=h_{\xi_{0},\bm{\mu}^{\bot}}\bigg{(}\frac{1}{m}\sum_{i=1}^{m}\operatorname{sgn}(\bm{\theta}_{0}^{\top}\mathbf{X}^{\prime}_{i})\mathbf{X}^{\prime}_{i}\bigg{)}-h_{\xi_{0},\bm{\mu}^{\bot}}\bigg{(}\frac{1}{m}\sum_{j=1}^{m}\hat{Y}_{j}^{\prime}\mathbf{X}^{\prime}_{j}\bigg{|}\mathbf{X}^{\prime}_{i},\hat{Y}_{i}^{\prime}\bigg{)} (170)
      =hξ0,𝝁(1mi=1msgn(𝜽0𝐗i)𝐗i)hξ0,𝝁(1mj[m],jisgn(𝜽0𝐗j)𝐗j)\displaystyle=h_{\xi_{0},\bm{\mu}^{\bot}}\bigg{(}\frac{1}{m}\sum_{i=1}^{m}\operatorname{sgn}(\bm{\theta}_{0}^{\top}\mathbf{X}^{\prime}_{i})\mathbf{X}^{\prime}_{i}\bigg{)}-h_{\xi_{0},\bm{\mu}^{\bot}}\bigg{(}\frac{1}{m}\sum_{j\in[m],j\neq i}\operatorname{sgn}(\bm{\theta}_{0}^{\top}\mathbf{X}^{\prime}_{j})\mathbf{X}^{\prime}_{j}\bigg{)} (171)
      =hξ0,𝝁(1mi=1msgn(𝜽0𝐗i)𝐗i)\displaystyle=h_{\xi_{0},\bm{\mu}^{\bot}}\bigg{(}\frac{1}{m}\sum_{i=1}^{m}\operatorname{sgn}(\bm{\theta}_{0}^{\top}\mathbf{X}^{\prime}_{i})\mathbf{X}^{\prime}_{i}\bigg{)}
      hξ0,𝝁(1m1j[m],jisgn(𝜽0𝐗j)𝐗j)dlogm1m.\displaystyle\quad-h_{\xi_{0},\bm{\mu}^{\bot}}\bigg{(}\frac{1}{m-1}\sum_{j\in[m],j\neq i}\operatorname{sgn}(\bm{\theta}_{0}^{\top}\mathbf{X}^{\prime}_{j})\mathbf{X}^{\prime}_{j}\bigg{)}-d\log\frac{m-1}{m}. (172)

      As mm\to\infty, Iξ0,𝝁(𝜽1;𝐗i,Y^i)0I_{\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta}_{1};\mathbf{X}^{\prime}_{i},\hat{Y}_{i}^{\prime})\to 0 almost surely and hence, in probability. Thus, for any ϵ>0\epsilon>0, and there exists m0(ϵ,d,δ)m_{0}(\epsilon,d,\delta)\in\mathbb{N} such that for all m>m0m>m_{0},

      ξ0,𝝁(Iξ0,𝝁(𝜽1;𝐗i,Y^i)>ϵ)δ.\displaystyle\mathbb{P}_{\xi_{0},\bm{\mu}^{\bot}}(I_{\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta}_{1};\mathbf{X}^{\prime}_{i},\hat{Y}_{i}^{\prime})>\epsilon)\leq\delta. (173)
    • Calculate Dξ0,μ(P𝐗j,Y^jP𝐗,Y)D_{\xi_{0},\bm{\mu}^{\bot}}(P_{\mathbf{X}_{j}^{\prime},\hat{Y}_{j}^{\prime}}\|P_{\mathbf{X},Y}): First of all, since PY^j=PYP_{\hat{Y}_{j}^{\prime}}=P_{Y} (cf. (77)) regardless of the values of (ξ0,𝝁)(\xi_{0},\bm{\mu}^{\bot}), the disintegrated conditional KL-divergence can be rewritten as

      Dξ0,𝝁(P𝐗j,Y^jP𝐗,Y)\displaystyle D_{\xi_{0},\bm{\mu}^{\bot}}(P_{\mathbf{X}_{j}^{\prime},\hat{Y}_{j}^{\prime}}\|P_{\mathbf{X},Y})
      =PY^j(1)Dξ0,𝝁(P𝐗j|Y^j=1P𝐗|Y=1)+PY^j(1)Dξ0,𝝁(P𝐗j|Y^j=1P𝐗|Y=1).\displaystyle=P_{\hat{Y}_{j}^{\prime}}(-1)D_{\xi_{0},\bm{\mu}^{\bot}}(P_{\mathbf{X}_{j}^{\prime}|\hat{Y}_{j}^{\prime}=-1}\|P_{\mathbf{X}|Y=-1})+P_{\hat{Y}_{j}^{\prime}}(1)D_{\xi_{0},\bm{\mu}^{\bot}}(P_{\mathbf{X}_{j}^{\prime}|\hat{Y}_{j}^{\prime}=1}\|P_{\mathbf{X}|Y=1}). (174)

      Recall the decomposition of a Gaussian vector 𝐠~j𝒩(0,𝐈d)\tilde{\mathbf{g}}_{j}\sim\mathcal{N}(0,\mathbf{I}_{d}) in (82). Note that rank(𝖢𝗈𝗏(𝐠~j))=rank(𝐈d𝜽¯𝜽¯)=d1\operatorname{rank}(\operatorname{\mathsf{Cov}}(\tilde{\mathbf{g}}_{j}^{\bot}))=\operatorname{rank}(\mathbf{I}_{d}-\bar{\bm{\theta}}\bar{\bm{\theta}}^{\top})=d-1.

      For any pair of labelled data sample (𝐗,Y)(\mathbf{X},Y), from (83), we similarly decompose 𝐗\mathbf{X} as 𝐗=Y𝝁+σ(g~𝜽¯0+𝐠~)\mathbf{X}=Y\bm{\mu}+\sigma(\tilde{g}\bar{\bm{\theta}}_{0}+\tilde{\mathbf{g}}^{\bot}), where g~𝒩(0,1)\tilde{g}\sim\mathcal{N}(0,1) and 𝐠~𝒩(0,𝐈d𝜽¯0𝜽¯0)\tilde{\mathbf{g}}^{\bot}\sim\mathcal{N}(0,\mathbf{I}_{d}-\bar{\bm{\theta}}_{0}\bar{\bm{\theta}}_{0}^{\top}). Let pg~p_{\tilde{g}} and p𝐠~p_{\tilde{\mathbf{g}}^{\bot}} denote the probability density functions of g~\tilde{g} and 𝐠~\tilde{\mathbf{g}}^{\bot}, respectively. For any 𝐱=𝝁+σ(u𝜽¯0+𝐮)d\mathbf{x}=\bm{\mu}+\sigma(u\bar{\bm{\theta}}_{0}+\mathbf{u}^{\perp})\in\mathbb{R}^{d}, the joint probability distribution at (𝐗,Y)=(𝐱,1)(\mathbf{X},Y)=(\mathbf{x},1) is given by

      P𝐗,Y(𝐱,1)\displaystyle P_{\mathbf{X},Y}(\mathbf{x},1) =PY(1)p𝝁(𝐱|1)\displaystyle=P_{Y}(1)p_{\bm{\mu}}(\mathbf{x}|1)
      =PY(1)(2π)dσdexp(12σ2(𝐱y𝝁)(𝐱y𝝁))\displaystyle=\frac{P_{Y}(1)}{\sqrt{(2\pi)^{d}}\sigma^{d}}\exp\bigg{(}-\frac{1}{2\sigma^{2}}(\mathbf{x}-y\bm{\mu})^{\top}(\mathbf{x}-y\bm{\mu})\bigg{)} (175)
      =PY(1)(2π)dσdexp(12σ2(σu𝜽0¯+σ𝐮)(σu𝜽0¯+σ𝐮))\displaystyle=\frac{P_{Y}(1)}{\sqrt{(2\pi)^{d}}\sigma^{d}}\exp\bigg{(}-\frac{1}{2\sigma^{2}}(\sigma u\bar{\bm{\theta}_{0}}+\sigma\mathbf{u}^{\bot})^{\top}(\sigma u\bar{\bm{\theta}_{0}}+\sigma\mathbf{u}^{\bot})\bigg{)} (176)
      =PY(y)(2π)dσdexp(u22)exp((𝐮)𝐮2)\displaystyle=\frac{P_{Y}(y)}{\sqrt{(2\pi)^{d}}\sigma^{d}}\exp\bigg{(}-\frac{u^{2}}{2}\bigg{)}\exp\bigg{(}-\frac{(\mathbf{u}^{\bot})^{\top}\mathbf{u}^{\bot}}{2}\bigg{)} (177)
      =PY(1)pg~(u)p𝐠~(𝐮).\displaystyle=P_{Y}(1)p_{\tilde{g}}(u)p_{\tilde{\mathbf{g}}^{\bot}}(\mathbf{u}^{\bot}). (178)

      Similarly, for any 𝐱=𝝁+σ(u𝜽¯0+𝐮)d\mathbf{x}=-\bm{\mu}+\sigma(u\bar{\bm{\theta}}_{0}+\mathbf{u}^{\perp})\in\mathbb{R}^{d}, the joint probability density evaluated at (X,Y)=(𝐱,1)(X,Y)=(\mathbf{x},-1) is given by

      P𝐗,Y(𝐱,1)=PY(1)p𝝁(𝐱|1)=PY(1)pg~(u)p𝐠~(𝐮).\displaystyle P_{\mathbf{X},Y}(\mathbf{x},-1)=P_{Y}(-1)p_{\bm{\mu}}(\mathbf{x}|-1)=P_{Y}(-1)p_{\tilde{g}}(u)p_{\tilde{\mathbf{g}}^{\bot}}(\mathbf{u}^{\bot}). (179)

      Second, we have P𝐗j|Y^j=y{1,+1}P𝐗j|Y^j,Yj=yPYj=y|Y^jP_{\mathbf{X}_{j}^{\prime}|\hat{Y}_{j}^{\prime}}=\sum_{y\in\{-1,+1\}}P_{\mathbf{X}_{j}^{\prime}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}=y}P_{Y_{j}^{\prime}=y|\hat{Y}_{j}^{\prime}}. The conditional probability distribution PYj|Y^jP_{Y_{j}^{\prime}|\hat{Y}_{j}^{\prime}} can be calculated as follows

      PYj|Y^j=PY^j|YjPYjPY^j=PY^j|Yj,\displaystyle P_{Y_{j}^{\prime}|\hat{Y}_{j}^{\prime}}=\frac{P_{\hat{Y}_{j}^{\prime}|Y_{j}^{\prime}}P_{Y_{j}^{\prime}}}{P_{\hat{Y}_{j}^{\prime}}}=P_{\hat{Y}_{j}^{\prime}|Y_{j}^{\prime}}, (180)

      where the last equality follows since PYj(1)=PYj(1)=PY^j(1)=PY^j(1)=1/2P_{Y_{j}^{\prime}}(-1)=P_{Y_{j}^{\prime}}(1)=P_{\hat{Y}_{j}^{\prime}}(-1)=P_{\hat{Y}_{j}^{\prime}}(1)=1/2. Since Y^j=sgn(Yjα+σg~j)\hat{Y}_{j}^{\prime}=\operatorname{sgn}(Y_{j}^{\prime}\alpha+\sigma\tilde{g}_{j}) (cf. (84)), we have

      PY^j|Yj(1|1)=Pr(Yjα+σg~j<0|Yj=1)=Q(ασ),\displaystyle P_{\hat{Y}_{j}^{\prime}|Y_{j}^{\prime}}(-1|-1)=\Pr(Y_{j}^{\prime}\alpha+\sigma\tilde{g}_{j}<0|Y_{j}^{\prime}=-1)=\mathrm{Q}\bigg{(}-\frac{\alpha}{\sigma}\bigg{)}, (181)

      and similarly,

      PY^j|Yj(1|1)=Q(ασ),PY^j|Yj(1|1)=Q(ασ),PY^j|Yj(1|1)=Q(ασ).\displaystyle P_{\hat{Y}_{j}^{\prime}|Y_{j}^{\prime}}(1|-1)=\mathrm{Q}\bigg{(}\frac{\alpha}{\sigma}\bigg{)},\;\;\;\;P_{\hat{Y}_{j}^{\prime}|Y_{j}^{\prime}}(-1|1)=\mathrm{Q}\bigg{(}\frac{\alpha}{\sigma}\bigg{)},\;\;\;\;P_{\hat{Y}_{j}^{\prime}|Y_{j}^{\prime}}(1|1)=\mathrm{Q}\bigg{(}-\frac{\alpha}{\sigma}\bigg{)}. (182)

      Thus, we conclude that

      PYj|Y^j(yj|y^j)={Q(ασ)yj=y^jQ(ασ)yjy^j.\displaystyle P_{Y_{j}^{\prime}|\hat{Y}_{j}^{\prime}}(y_{j}^{\prime}|\hat{y}_{j}^{\prime})=\left\{\begin{array}[]{cc}\mathrm{Q}(-\frac{\alpha}{\sigma})&y_{j}^{\prime}=\hat{y}_{j}^{\prime}\\ \mathrm{Q}(\frac{\alpha}{\sigma})&y_{j}^{\prime}\neq\hat{y}_{j}^{\prime}.\end{array}\right. (185)

      To calculate the conditional probability distribution P𝐗j|Y^j,YjP_{\mathbf{X}_{j}^{\prime}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}, recall the decomposition of 𝐗j\mathbf{X}_{j}^{\prime} and 𝜽¯0𝐗j\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}_{j}^{\prime} in (83) and (84). Since the event {Y^j=1,Yj=1}\{\hat{Y}_{j}^{\prime}=-1,Y_{j}^{\prime}=-1\} is equivalent to {g~j<α/σ}\{\tilde{g}_{j}<\alpha/\sigma\} and g~j𝒩(0,1)\tilde{g}_{j}\sim\mathcal{N}(0,1), the conditional density of g~j\tilde{g}_{j} given Y^j=1,Yj=1\hat{Y}_{j}^{\prime}=-1,Y_{j}^{\prime}=-1 is given by

      pg~j|Y^j,Yj(u|1,1)=pg~j|g~jα/σ(u)=1{uα/σ}pg~j(u)Φ(α/σ),u.\displaystyle p_{\tilde{g}_{j}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}(u|-1,-1)=p_{\tilde{g}_{j}|\tilde{g}_{j}\leq\alpha/\sigma}(u)=\frac{\text{1}\{u\leq\alpha/\sigma\}p_{\tilde{g}_{j}}(u)}{\Phi(\alpha/\sigma)},\quad\forall u\in\mathbb{R}. (186)

      Similarly, for any uu\in\mathbb{R}

      pg~j|Y^j,Yj(u|1,1)\displaystyle p_{\tilde{g}_{j}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}(u|-1,1) =pg~j|g~jα/σ(u)=1{uα/σ}fg~j(u)Φ(α/σ),\displaystyle=p_{\tilde{g}_{j}|\tilde{g}_{j}\leq-\alpha/\sigma}(u)=\frac{\text{1}\{u\leq-\alpha/\sigma\}f_{\tilde{g}_{j}}(u)}{\Phi(-\alpha/\sigma)}, (187)
      pg~j|Y^j,Yj(u|1,1)\displaystyle p_{\tilde{g}_{j}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}(u|1,-1) =pg~j|g~j>α/σ(u)=1{z>α/σ}fg~j(u)Q(α/σ),\displaystyle=p_{\tilde{g}_{j}|\tilde{g}_{j}>\alpha/\sigma}(u)=\frac{\text{1}\{z>\alpha/\sigma\}f_{\tilde{g}_{j}}(u)}{\mathrm{Q}(\alpha/\sigma)}, (188)
      pg~j|Y^j,Yj(u|1,1)\displaystyle p_{\tilde{g}_{j}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}(u|1,1) =pg~j|g~j>α/σ(u)=1{u>α/σ}pg~j(u)Q(α/σ).\displaystyle=p_{\tilde{g}_{j}|\tilde{g}_{j}>-\alpha/\sigma}(u)=\frac{\text{1}\{u>-\alpha/\sigma\}p_{\tilde{g}_{j}}(u)}{\mathrm{Q}(-\alpha/\sigma)}. (189)

      For any 𝐱=𝝁+σ(u𝜽¯0+𝐮)d\mathbf{x}=\bm{\mu}+\sigma(u\bar{\bm{\theta}}_{0}+\mathbf{u}^{\perp})\in\mathbb{R}^{d}, given Y^j=1,Yj=1\hat{Y}_{j}^{\prime}=1,Y_{j}^{\prime}=1, the conditional probability distribution at 𝐗j=𝐱\mathbf{X}_{j}^{\prime}=\mathbf{x} is given by

      P𝐗j|Y^j,Yj(𝐱|1,1)\displaystyle P_{\mathbf{X}_{j}^{\prime}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}(\mathbf{x}|1,1) =P𝝁+σg~j𝜽¯0+σ𝐠~j|Y^j,Yj(𝝁+σ(u𝜽¯0+𝐮)|1,1)\displaystyle=P_{\bm{\mu}+\sigma\tilde{g}_{j}\bar{\bm{\theta}}_{0}+\sigma\tilde{\mathbf{g}}_{j}^{\bot}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}(\bm{\mu}+\sigma(u\bar{\bm{\theta}}_{0}+\mathbf{u}^{\perp})|1,1) (190)
      =Pσg~j𝜽¯0+σ𝐠~j|Y^j,Yj(σ(u𝜽¯0+𝐮)|1,1)\displaystyle=P_{\sigma\tilde{g}_{j}\bar{\bm{\theta}}_{0}+\sigma\tilde{\mathbf{g}}_{j}^{\bot}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}(\sigma(u\bar{\bm{\theta}}_{0}+\mathbf{u}^{\perp})|1,1) (191)
      =pg~j|Y^j,Yj(u|1,1)p𝐠~j(𝐮),\displaystyle=p_{\tilde{g}_{j}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}(u|1,1)p_{\tilde{\mathbf{g}}_{j}^{\bot}}(\mathbf{u}^{\bot}), (192)

      where (192) follows since g~j\tilde{g}_{j} and 𝐠~j\tilde{\mathbf{g}}_{j}^{\bot} are mutually independent and 𝜽¯0𝐠~j\bar{\bm{\theta}}_{0}\perp\tilde{\mathbf{g}}_{j}^{\bot}.

      Since we can decompose 2𝝁/σ2\bm{\mu}/\sigma as

      2𝝁σ=2α𝜽¯0+2β2𝝁2αβ𝝊σ=2ασ𝜽¯0+𝜽¯0,\displaystyle\frac{2\bm{\mu}}{\sigma}=\frac{2\alpha\bar{\bm{\theta}}_{0}+2\beta^{2}\bm{\mu}-2\alpha\beta\bm{\upsilon}}{\sigma}=\frac{2\alpha}{\sigma}\bar{\bm{\theta}}_{0}+\bar{\bm{\theta}}_{0}^{\bot}, (193)

      given Y^j=1,Yj=1\hat{Y}_{j}^{\prime}=1,Y_{j}^{\prime}=-1, the conditional probability distribution at 𝐗j=𝐱\mathbf{X}_{j}^{\prime}=\mathbf{x} is given by

      P𝐗j|Y^j,Yj(𝐱|1,1)\displaystyle P_{\mathbf{X}_{j}^{\prime}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}(\mathbf{x}|1,-1) =P𝝁+σg~j𝜽¯0+σ𝐠~j|Y^j,Yj(𝝁+σ(u𝜽¯0+𝐮)|1,1)\displaystyle=P_{-\bm{\mu}+\sigma\tilde{g}_{j}\bar{\bm{\theta}}_{0}+\sigma\tilde{\mathbf{g}}_{j}^{\bot}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}(\bm{\mu}+\sigma(u\bar{\bm{\theta}}_{0}+\mathbf{u}^{\perp})|1,-1) (194)
      =Pσg~j𝜽¯0+σ𝐠~j|Y^j,Yj(σ(2𝝁σ+u𝜽¯0+𝐮)|1,1)\displaystyle=P_{\sigma\tilde{g}_{j}\bar{\bm{\theta}}_{0}+\sigma\tilde{\mathbf{g}}_{j}^{\bot}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}\Big{(}\sigma\Big{(}\frac{2\bm{\mu}}{\sigma}+u\bar{\bm{\theta}}_{0}+\mathbf{u}^{\perp}\Big{)}\Big{|}1,-1\Big{)} (195)
      =pg~j|Y^j,Yj(u+2ασ|1,1)p𝐠~j(𝐮+𝜽¯0).\displaystyle=p_{\tilde{g}_{j}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}\Big{(}u+\frac{2\alpha}{\sigma}\Big{|}1,-1\Big{)}p_{\tilde{\mathbf{g}}_{j}^{\bot}}(\mathbf{u}^{\bot}+\bar{\bm{\theta}}_{0}^{\bot}). (196)

      Similarly, for any 𝐱=𝝁+σ(u𝜽¯0+𝐮)d\mathbf{x}=-\bm{\mu}+\sigma(u\bar{\bm{\theta}}_{0}+\mathbf{u}^{\perp})\in\mathbb{R}^{d}, given Y^j=1,Yj=1\hat{Y}_{j}^{\prime}=-1,Y_{j}^{\prime}=1, the conditional distribution at 𝐗j=𝐱\mathbf{X}_{j}^{\prime}=\mathbf{x} is given by

      P𝐗j|Y^j,Yj(𝐱|1,1)\displaystyle P_{\mathbf{X}_{j}^{\prime}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}(\mathbf{x}|-1,1) =P𝝁+σg~j𝜽¯0+σ𝐠~j|Y^j,Yj(𝝁+σ(u𝜽¯0+𝐮)|1,1)\displaystyle=P_{\bm{\mu}+\sigma\tilde{g}_{j}\bar{\bm{\theta}}_{0}+\sigma\tilde{\mathbf{g}}_{j}^{\bot}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}(-\bm{\mu}+\sigma(u\bar{\bm{\theta}}_{0}+\mathbf{u}^{\perp})|-1,1) (197)
      =pg~j|Y^j,Yj(u2ασ|1,1)p𝐠~j(𝐮𝜽¯0);\displaystyle=p_{\tilde{g}_{j}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}\Big{(}u-\frac{2\alpha}{\sigma}\Big{|}-1,1\Big{)}p_{\tilde{\mathbf{g}}_{j}^{\bot}}(\mathbf{u}^{\bot}-\bar{\bm{\theta}}_{0}^{\bot}); (198)

      and given Y^j=1,Yj=1\hat{Y}_{j}^{\prime}=-1,Y_{j}^{\prime}=-1,

      P𝐗j|Y^j,Yj(𝐱|1,1)\displaystyle P_{\mathbf{X}_{j}^{\prime}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}(\mathbf{x}|-1,-1) =P𝝁+σg~j𝜽¯0+σ𝐠~j|Y^j,Yj(𝝁+σ(u𝜽¯0+𝐮)|1,1)\displaystyle=P_{-\bm{\mu}+\sigma\tilde{g}_{j}\bar{\bm{\theta}}_{0}+\sigma\tilde{\mathbf{g}}_{j}^{\bot}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}(-\bm{\mu}+\sigma(u\bar{\bm{\theta}}_{0}+\mathbf{u}^{\perp})|-1,-1) (199)
      =pg~j|Y^j,Yj(u|1,1)p𝐠~j(𝐮).\displaystyle=p_{\tilde{g}_{j}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}(u|-1,-1)p_{\tilde{\mathbf{g}}_{j}^{\bot}}(\mathbf{u}^{\bot}). (200)

      Furthermore, for any 𝐱=𝝁+σ(u𝜽¯0+𝐮)d\mathbf{x}=-\bm{\mu}+\sigma(u\bar{\bm{\theta}}_{0}+\mathbf{u}^{\perp})\in\mathbb{R}^{d}, we have

      P𝐗j|Y^j=1(𝐱)\displaystyle P_{\mathbf{X}_{j}^{\prime}|\hat{Y}_{j}^{\prime}=-1}(\mathbf{x}) =y{1,+1}P𝐗j|Y^j=1,Yj=y(𝐱)PYj|Y^j=1(y)\displaystyle=\sum_{y\in\{-1,+1\}}P_{\mathbf{X}_{j}^{\prime}|\hat{Y}_{j}^{\prime}=-1,Y_{j}^{\prime}=y}(\mathbf{x})P_{Y_{j}^{\prime}|\hat{Y}_{j}^{\prime}=-1}(y) (201)
      =PYj|Y^j=1(1)pg~j|Y^j,Yj(u2ασ|1,1)p𝐠~j(𝐮𝜽¯0)\displaystyle=P_{Y_{j}^{\prime}|\hat{Y}_{j}^{\prime}=-1}(1)p_{\tilde{g}_{j}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}\Big{(}u-\frac{2\alpha}{\sigma}\Big{|}-1,1\Big{)}p_{\tilde{\mathbf{g}}_{j}^{\bot}}(\mathbf{u}^{\bot}-\bar{\bm{\theta}}_{0}^{\bot})
      +PYj|Y^j=1(1)pg~j|Y^j,Yj(u|1,1)p𝐠~j(𝐮)\displaystyle\quad+P_{Y_{j}^{\prime}|\hat{Y}_{j}^{\prime}=-1}(-1)p_{\tilde{g}_{j}|\hat{Y}_{j}^{\prime},Y_{j}^{\prime}}(u|-1,-1)p_{\tilde{\mathbf{g}}_{j}^{\bot}}(\mathbf{u}^{\bot}) (202)
      =1{uασ}pg~j(u2ασ)p𝐠~j(𝐮𝜽¯0)+1{uασ}pg~j(u)p𝐠~j(𝐮);\displaystyle=\text{1}\Big{\{}u\leq\frac{\alpha}{\sigma}\Big{\}}p_{\tilde{g}_{j}}\Big{(}u-\frac{2\alpha}{\sigma}\Big{)}p_{\tilde{\mathbf{g}}_{j}^{\bot}}(\mathbf{u}^{\bot}-\bar{\bm{\theta}}_{0}^{\bot})+\text{1}\Big{\{}u\leq\frac{\alpha}{\sigma}\Big{\}}p_{\tilde{g}_{j}}(u)p_{\tilde{\mathbf{g}}_{j}^{\bot}}(\mathbf{u}^{\bot}); (203)

      for any 𝐱=𝝁+σ(u𝜽¯0+𝐮)d\mathbf{x}=\bm{\mu}+\sigma(u\bar{\bm{\theta}}_{0}+\mathbf{u}^{\perp})\in\mathbb{R}^{d}, we have

      P𝐗j|Y^j=1(𝐱)=y{1,+1}P𝐗j|Y^j=1,Yj=y(𝐱)PYj|Y^j=1(y)\displaystyle P_{\mathbf{X}_{j}^{\prime}|\hat{Y}_{j}^{\prime}=1}(\mathbf{x})=\sum_{y\in\{-1,+1\}}P_{\mathbf{X}_{j}^{\prime}|\hat{Y}_{j}^{\prime}=1,Y_{j}^{\prime}=y}(\mathbf{x})P_{Y_{j}^{\prime}|\hat{Y}_{j}^{\prime}=1}(y) (204)
      =1{u>ασ}pg~j(u+2ασ)p𝐠~j(𝐮+𝜽¯0)+1{u>ασ}pg~j(u)p𝐠~j(𝐮).\displaystyle=\text{1}\Big{\{}u>-\frac{\alpha}{\sigma}\Big{\}}p_{\tilde{g}_{j}}\Big{(}u+\frac{2\alpha}{\sigma}\Big{)}p_{\tilde{\mathbf{g}}_{j}^{\bot}}(\mathbf{u}^{\bot}+\bar{\bm{\theta}}_{0}^{\bot})+\text{1}\Big{\{}u>-\frac{\alpha}{\sigma}\Big{\}}p_{\tilde{g}_{j}}(u)p_{\tilde{\mathbf{g}}_{j}^{\bot}}(\mathbf{u}^{\bot}). (205)

      Define the set 𝒰0(ξ0,𝝁):={𝐮d:𝐮𝜽0}\mathcal{U}_{0}^{\bot}(\xi_{0},\bm{\mu}^{\bot}):=\{\mathbf{u}^{\bot}\in\mathbb{R}^{d}:\mathbf{u}^{\bot}\perp\bm{\theta}_{0}\}. We also use 𝒰0\mathcal{U}_{0}^{\bot} to represent 𝒰0(ξ0,𝝁)\mathcal{U}_{0}^{\bot}(\xi_{0},\bm{\mu}^{\bot}), if there is no risk of confusion. Recall (128) and note that 𝒰0p𝐠~(𝐮)d𝐮=1\int_{\mathcal{U}_{0}^{\bot}}p_{\tilde{\mathbf{g}}^{\bot}}(\mathbf{u}^{\bot})\mathrm{d}\mathbf{u}^{\bot}=1. Finally, the KL-divergence is given by

      Dξ0,𝝁(P𝐗j|Y^j=1P𝐗|Y=1)\displaystyle D_{\xi_{0},\bm{\mu}^{\bot}}(P_{\mathbf{X}_{j}^{\prime}|\hat{Y}_{j}^{\prime}=-1}\|P_{\mathbf{X}|Y=-1})
      =𝒰0ασ(pg~j(u2ασ)p𝐠~j(𝐮𝜽¯0)+pg~j(u)p𝐠~j(𝐮))\displaystyle=\int_{\mathcal{U}_{0}^{\bot}}\int^{\frac{\alpha}{\sigma}}_{-\infty}\bigg{(}p_{\tilde{g}_{j}}\Big{(}u-\frac{2\alpha}{\sigma}\Big{)}p_{\tilde{\mathbf{g}}_{j}^{\bot}}(\mathbf{u}^{\bot}-\bar{\bm{\theta}}_{0}^{\bot})+p_{\tilde{g}_{j}}(u)p_{\tilde{\mathbf{g}}_{j}^{\bot}}(\mathbf{u}^{\bot})\bigg{)}
      ×log(1+pg~j(u2ασ)p𝐠~j(𝐮𝜽¯0)pg~j(u)p𝐠~j(𝐮))dud𝐮=Gσ(α,ξ0,𝝁)\displaystyle\qquad\times\log\bigg{(}1+\frac{p_{\tilde{g}_{j}}\Big{(}u-\frac{2\alpha}{\sigma}\Big{)}p_{\tilde{\mathbf{g}}_{j}^{\bot}}(\mathbf{u}^{\bot}-\bar{\bm{\theta}}_{0}^{\bot})}{p_{\tilde{g}_{j}}(u)p_{\tilde{\mathbf{g}}_{j}^{\bot}}(\mathbf{u}^{\bot})}\bigg{)}\,\mathrm{d}u\,\mathrm{d}\mathbf{u}^{\bot}=G_{\sigma}(\alpha,\xi_{0},\bm{\mu}^{\bot}) (206)

      and

      Dξ0,𝝁(P𝐗j|Y^j=1P𝐗|Y=1)\displaystyle D_{\xi_{0},\bm{\mu}^{\bot}}(P_{\mathbf{X}_{j}^{\prime}|\hat{Y}_{j}^{\prime}=1}\|P_{\mathbf{X}|Y=1})
      =𝒰0ασ+(pg~j(u+2ασ)p𝐠~j(𝐮+𝜽¯0)+pg~j(u)p𝐠~j(𝐮))\displaystyle=\int_{\mathcal{U}_{0}^{\bot}}\int_{-\frac{\alpha}{\sigma}}^{+\infty}\bigg{(}p_{\tilde{g}_{j}}\Big{(}u+\frac{2\alpha}{\sigma}\Big{)}p_{\tilde{\mathbf{g}}_{j}^{\bot}}(\mathbf{u}^{\bot}+\bar{\bm{\theta}}_{0}^{\bot})+p_{\tilde{g}_{j}}(u)p_{\tilde{\mathbf{g}}_{j}^{\bot}}(\mathbf{u}^{\bot})\bigg{)}
      ×log(1+pg~j(u+2ασ)p𝐠~j(𝐮+𝜽¯0)pg~j(u)p𝐠~j(𝐮))dud𝐮=Gσ(α,ξ0,𝝁),\displaystyle\qquad\times\log\bigg{(}1+\frac{p_{\tilde{g}_{j}}\Big{(}u+\frac{2\alpha}{\sigma}\Big{)}p_{\tilde{\mathbf{g}}_{j}^{\bot}}(\mathbf{u}^{\bot}+\bar{\bm{\theta}}_{0}^{\bot})}{p_{\tilde{g}_{j}}(u)p_{\tilde{\mathbf{g}}_{j}^{\bot}}(\mathbf{u}^{\bot})}\bigg{)}\,\mathrm{d}u\,\mathrm{d}\mathbf{u}^{\bot}=G_{\sigma}(\alpha,\xi_{0},\bm{\mu}^{\bot}), (207)

      where (207) follows from since pg~jp_{\tilde{g}_{j}} and p𝐠~jp_{\tilde{\mathbf{g}}_{j}^{\bot}} are zero-mean Gaussian distributions. Then from (174), we have

      Dξ0,𝝁(P𝐗j,Y^jP𝐗,Y)=Gσ(α,ξ0,𝝁).\displaystyle D_{\xi_{0},\bm{\mu}^{\bot}}(P_{\mathbf{X}_{j}^{\prime},\hat{Y}_{j}^{\prime}}\|P_{\mathbf{X},Y})=G_{\sigma}(\alpha,\xi_{0},\bm{\mu}^{\bot}). (208)

    Thus, by combining the aforementioned results, we get the closed-form expression of the upper bound for |gen1||\mathrm{gen}_{1}|. Indeed, if we fix some dd\in\mathbb{N}, ϵ>0\epsilon>0 and δ(0,1)\delta\in(0,1), there exists n0(d,δ)n_{0}(d,\delta)\in\mathbb{N}, m0(ϵ,d,δ)m_{0}(\epsilon,d,\delta)\in\mathbb{N}, c0(d,δ)(c~1,)c_{0}(d,\delta)\in(\tilde{c}_{1},\infty), r0(d,δ)+r_{0}(d,\delta)\in\mathbb{R}_{+} such that for all n>n0,m>m0,c>c0,r>r0n>n_{0},m>m_{0},c>c_{0},r>r_{0}, δm,cc~1,d<δ3\delta_{m,c-\tilde{c}_{1},d}<\frac{\delta}{3}, δr,d<δ3\delta_{r,d}<\frac{\delta}{3}, and with probability at least 1δ1-\delta,

    |gen1|(c2c1)22𝔼ξ0,𝝁[Gσ(α(ξ0,𝝁),ξ0,𝝁)+ϵ].\displaystyle|\mathrm{gen}_{1}|\leq\sqrt{\frac{(c_{2}-c_{1})^{2}}{2}}\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{[}\sqrt{G_{\sigma}(\alpha(\xi_{0},\bm{\mu}^{\bot}),\xi_{0},\bm{\mu}^{\bot})+\epsilon}~{}\bigg{]}. (209)
  4. 4.

    Pseudo-label using θ1\bm{\theta}_{1}: The same as those in Appendix C.

  5. 5.

    Iteration t=2t=2: Recall 𝜽2\bm{\theta}_{2} in (116), the new model parameter learned from the pseudo-labelled dataset S^u,2\hat{S}_{\mathrm{u},2}.

    Given any (𝜽1,ξ0,𝝁)(\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}), for any j2j\in\mathcal{I}_{2}, let 𝝁2𝜽1,ξ0,𝝁:=𝔼[sgn(𝜽¯1𝐗j)𝐗j|𝜽1,ξ0,𝝁]\bm{\mu}_{2}^{\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}:=\mathbb{E}[\operatorname{sgn}(\bar{\bm{\theta}}_{1}^{\top}\mathbf{X}^{\prime}_{j})\mathbf{X}^{\prime}_{j}|\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}] and 𝜽1,ξ0,𝝁\mathbb{P}_{\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}} denotes the probability measure under the parameters 𝜽1,ξ0,𝝁\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}. Following the similar steps that derive (161), for any ε>0\varepsilon>0, we have

    𝜽1,ξ0,𝝁(𝜽2𝝁2𝜽1,ξ0,𝝁>ε)δm,ε,d.\displaystyle\mathbb{P}_{\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}\big{(}\|\bm{\theta}_{2}-\bm{\mu}_{2}^{\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}\|_{\infty}>\varepsilon\big{)}\leq\delta_{m,\varepsilon,d}. (210)

    From (145), no matter what 𝜽1\bm{\theta}_{1} is, we always have 𝝁2𝜽1,ξ0,𝝁𝝁ξ0,𝝁c~1\|\bm{\mu}_{2}^{\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}-\bm{\mu}^{\xi_{0},\bm{\mu}^{\bot}}\|\leq\tilde{c}_{1}. Then, for some c(c~1,)c\in(\tilde{c}_{1},\infty),

    𝜽1,ξ0,𝝁(𝜽2Θ𝝁,c)1δm,cc~1,d.\displaystyle\mathbb{P}_{\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta}_{2}\in\Theta_{\bm{\mu},c})\geq 1-\delta_{m,c-\tilde{c}_{1},d}. (211)

    With probability at least (1δm,cc~1,d)(1δr,d)(1-\delta_{m,c-\tilde{c}_{1},d})(1-\delta_{r,d}), the absolute generalization error can be upper bounded as follows:

    |gen2|=|𝔼[LP𝐙(𝜽2)LS^u,2(𝜽2)]|\displaystyle|\mathrm{gen}_{2}|=|\mathbb{E}[L_{P_{\mathbf{Z}}}(\bm{\theta}_{2})-L_{\hat{S}_{\mathrm{u},2}}(\bm{\theta}_{2})]| (212)
    =|1mi2𝔼𝜽1,ξ0,𝝁[𝔼[l(𝜽2,(𝐗,Y))l(𝜽2,(𝐗i,Y^i))|𝜽1,ξ0,𝝁]]|\displaystyle=\bigg{|}\frac{1}{m}\sum_{i\in\mathcal{I}_{2}}\mathbb{E}_{\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}\bigg{[}\mathbb{E}\left[l(\bm{\theta}_{2},(\mathbf{X},Y))-l(\bm{\theta}_{2},(\mathbf{X}^{\prime}_{i},\hat{Y}_{i}^{\prime}))|\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}\right]\bigg{]}\bigg{|} (213)
    (c2c1)221mi2𝔼𝜽1,ξ0,𝝁[I𝜽1,ξ0,𝝁(𝜽2;(𝐗i,Y^i))+D𝜽1,ξ0,𝝁(P𝐗i,Y^iP𝐗,Y)],\displaystyle\leq\sqrt{\frac{(c_{2}-c_{1})^{2}}{2}}\frac{1}{m}\sum_{i\in\mathcal{I}_{2}}\mathbb{E}_{\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}\bigg{[}\sqrt{I_{\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta}_{2};(\mathbf{X}^{\prime}_{i},\hat{Y}_{i}^{\prime}))+D_{\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}(P_{\mathbf{X}^{\prime}_{i},\hat{Y}_{i}^{\prime}}\|P_{\mathbf{X},Y})}\bigg{]}, (214)

    where P𝜽2,𝐗,Y|𝜽1,ξ0,𝝁=P𝜽2|𝜽1,ξ0,𝝁P𝐗,YP_{\bm{\theta}_{2},\mathbf{X},Y|\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}=P_{\bm{\theta}_{2}|\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}\otimes P_{\mathbf{X},Y}.

    Similar to (173), for any ϵ>0\epsilon>0 and δ(0,1)\delta\in(0,1), there exists m1(ϵ,d,δ)m_{1}(\epsilon,d,\delta) such that for all m>m1m>m_{1},

    𝜽1,ξ0,𝝁(I𝜽1,ξ0,𝝁(𝜽2;(𝐗i,Y^i))>ϵ)δ.\displaystyle\mathbb{P}_{\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}(I_{\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta}_{2};(\mathbf{X}^{\prime}_{i},\hat{Y}^{\prime}_{i}))>\epsilon)\leq\delta. (215)

    Recall (115) that PY^i|𝜽1,ξ0,𝝁unif({1,+1})P_{\hat{Y}_{i}^{\prime}|\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}\sim\textrm{unif}(\{-1,+1\}). For any fixed (𝜽1,ξ0,𝝁)(\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}), recall 𝜽¯1\bar{\bm{\theta}}_{1} can be decomposed as 𝜽¯1=α1(ξ0,𝝁)𝝁+β1(ξ0,𝝁)𝝊\bar{\bm{\theta}}_{1}=\alpha_{1}(\xi_{0},\bm{\mu}^{\bot})\bm{\mu}+\beta_{1}(\xi_{0},\bm{\mu}^{\bot})\bm{\upsilon}.

    By following the similar steps in the first iteration, the disintegrated conditional KL-divergence between pseudo-labelled distribution and true distribution is given by

    D𝜽1,ξ0,𝝁(P𝐗i,Y^iP𝐗,Y)\displaystyle D_{\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}\big{(}P_{\mathbf{X}^{\prime}_{i},\hat{Y}^{\prime}_{i}}\|P_{\mathbf{X},Y}\big{)}
    =12D𝜽1,ξ0,𝝁(P𝐗i|Y^i=1P𝐗|Y=1)+12D𝜽1,ξ0,𝝁(P𝐗i|Y^i=1P𝐗|Y=1)\displaystyle=\frac{1}{2}D_{\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}\big{(}P_{\mathbf{X}_{i}^{\prime}|\hat{Y}_{i}^{\prime}=-1}\|P_{\mathbf{X}|Y=-1}\big{)}+\frac{1}{2}D_{\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}\big{(}P_{\mathbf{X}_{i}^{\prime}|\hat{Y}_{i}^{\prime}=1}\|P_{\mathbf{X}|Y=1}\big{)} (216)
    =Gσ(α1(ξ0,𝝁),ξ0,𝝁).\displaystyle=G_{\sigma}\big{(}\alpha_{1}(\xi_{0},\bm{\mu}^{\bot}),\xi_{0},\bm{\mu}^{\bot}\big{)}. (217)

    Given any pair of (ξ0,𝝁)(\xi_{0},\bm{\mu}^{\bot}), recall the decomposition of 𝝁1ξ0,𝝁\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}} in (94). Then the correlation between 𝝁1ξ0,𝝁\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}} and 𝝁\bm{\mu} is given by

    ρ(𝝁1ξ0,𝝁,𝝁)\displaystyle\rho(\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}},\bm{\mu}) =12Q(ασ)+2σα2πexp(α22σ2)(12Q(ασ)+2σα2πexp(α22σ2))2+2σ2(1α2)πexp(α2σ2)\displaystyle=\frac{1-2\mathrm{Q}\big{(}\frac{\alpha}{\sigma}\big{)}+\frac{2\sigma\alpha}{\sqrt{2\pi}}\exp(-\frac{\alpha^{2}}{2\sigma^{2}})}{\sqrt{\big{(}1-2\mathrm{Q}\big{(}\frac{\alpha}{\sigma}\big{)}+\frac{2\sigma\alpha}{\sqrt{2\pi}}\exp(-\frac{\alpha^{2}}{2\sigma^{2}})\big{)}^{2}+\frac{2\sigma^{2}(1-\alpha^{2})}{\pi}\exp(-\frac{\alpha^{2}}{\sigma^{2}})}} (218)
    =Fσ(α(ξ0,𝝁)).\displaystyle=F_{\sigma}(\alpha(\xi_{0},\bm{\mu}^{\bot})). (219)

    By the strong law of large numbers, we have α1(ξ0,𝝁)a.s.Fσ(α(ξ0,𝝁))\alpha_{1}(\xi_{0},\bm{\mu}^{\bot})\xrightarrow{\text{a.s.}}F_{\sigma}(\alpha(\xi_{0},\bm{\mu}^{\bot})) as mm\to\infty. Then for any ϵ>0\epsilon>0 and δ(0,1)\delta\in(0,1), there exists m2(ϵ,d,δ)m_{2}(\epsilon,d,\delta) such that for all m>m2m>m_{2},

    𝜽1,ξ0,𝝁(|Gσ(α1(ξ0,𝝁),ξ0,𝝁)Gσ(Fσ(α(ξ0,𝝁)),ξ0,𝝁)|>ϵ)δ.\displaystyle\mathbb{P}_{\bm{\theta}_{1},\xi_{0},\bm{\mu}^{\bot}}\Big{(}\Big{|}G_{\sigma}\big{(}\alpha_{1}(\xi_{0},\bm{\mu}^{\bot}),\xi_{0},\bm{\mu}^{\bot}\big{)}-G_{\sigma}\Big{(}F_{\sigma}(\alpha(\xi_{0},\bm{\mu}^{\bot})),\xi_{0},\bm{\mu}^{\bot}\Big{)}\Big{|}>\epsilon\Big{)}\leq\delta. (220)

    Therefore, fix some dd\in\mathbb{N}, ϵ>0\epsilon>0 and δ(0,1)\delta\in(0,1). There exists n0(d,δ)n_{0}(d,\delta)\in\mathbb{N}, m3(ϵ,d,δ)m_{3}(\epsilon,d,\delta)\in\mathbb{N}, c0(d,δ)(c~1,)c_{0}(d,\delta)\in(\tilde{c}_{1},\infty), r0(d,δ)+r_{0}(d,\delta)\in\mathbb{R}_{+} such that for all n>n0,m>m3,c>c0,r>r0n>n_{0},m>m_{3},c>c_{0},r>r_{0}, δm,cc~1,d<δ3\delta_{m,c-\tilde{c}_{1},d}<\frac{\delta}{3}, δr,d<δ3\delta_{r,d}<\frac{\delta}{3}, and then with probability at least 1δ1-\delta, the absolute generalization error at t=2t=2 can be upper bounded as follows:

    |gen2|c2c12𝔼ξ0,𝝁[Gσ(Fσ(α(ξ0,𝝁)),ξ0,𝝁)+ϵ].\displaystyle|\mathrm{gen}_{2}|\leq\frac{c_{2}-c_{1}}{\sqrt{2}}\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\left[\sqrt{G_{\sigma}\Big{(}F_{\sigma}(\alpha(\xi_{0},\bm{\mu}^{\bot})),\xi_{0},\bm{\mu}^{\bot}\Big{)}+\epsilon}\right]. (221)
  6. 6.

    Any iteration t[3:τ]t\in[3:\tau]: By similarly repeating the calculation in iteration t=2t=2, we obtain the upper bound for |gent||\mathrm{gen}_{t}| in (131).

Appendix F Reusing SlS_{\mathrm{l}} in Each Iteration

If the labelled data SlS_{\mathrm{l}} are reused in each iteration and w=nn+mw=\frac{n}{n+m} (cf. (4)), for each t[1:τ]t\in[1:\tau], the learned model parameter is given by

𝜽t\displaystyle\bm{\theta}_{t}^{\prime} =nn+m𝜽0+1n+mitY^i𝐗i\displaystyle=\frac{n}{n+m}\bm{\theta}_{0}+\frac{1}{n+m}\sum_{i\in\mathcal{I}_{t}}\hat{Y}_{i}^{\prime}\mathbf{X}^{\prime}_{i} (222)
=nn+m𝜽0+1n+mitsgn(𝜽¯t1𝐗i)𝐗i.\displaystyle=\frac{n}{n+m}\bm{\theta}_{0}+\frac{1}{n+m}\sum_{i\in\mathcal{I}_{t}}\operatorname{sgn}(\bar{\bm{\theta}}_{t-1}^{\prime\top}\mathbf{X}^{\prime}_{i})\mathbf{X}^{\prime}_{i}. (223)

Similarly to FσF_{\sigma}, let us define the enhanced correlation evolution function F~σ,ξ0,𝝁:[1,1][1,1]\tilde{F}_{\sigma,\xi_{0},\bm{\mu}^{\bot}}:[-1,1]\to[-1,1] as follows:

F~σ,ξ0,𝝁(x)=(1+(wσ𝝁2n+(1w)(2σ1x22πexp(x22σ2))2(w(1+σnξ0)+(1w)(12Q(xσ)+2σx2πexp(x22σ2)))2)12.\displaystyle\tilde{F}_{\sigma,\xi_{0},\bm{\mu}^{\bot}}(x)=\bigg{(}1+\frac{\big{(}w\frac{\sigma\|\bm{\mu}^{\bot}\|_{2}}{n}+(1-w)(\frac{2\sigma\sqrt{1-x^{2}}}{\sqrt{2\pi}}\exp(-\frac{x^{2}}{2\sigma^{2}})\big{)}^{2}}{\big{(}w(1+\frac{\sigma}{\sqrt{n}}\xi_{0})+(1-w)(1-2\mathrm{Q}\big{(}\frac{x}{\sigma}\big{)}+\frac{2\sigma x}{\sqrt{2\pi}}\exp(-\frac{x^{2}}{2\sigma^{2}}))\big{)}^{2}}\bigg{)}^{-\frac{1}{2}}. (224)

From Theorem 7, we can obtain similar characterization for gent\mathrm{gen}_{t}.

Corollary 10.

Fix any σ+\sigma\in\mathbb{R}_{+}, dd\in\mathbb{N} and α=α(ξ0,𝛍)\alpha=\alpha(\xi_{0},\bm{\mu}^{\bot}). For almost all sample paths,

gento(1)\displaystyle\mathrm{gen}_{t}-o(1)
=𝔼ξ0,𝝁[m(m1)(Jσ2(F~σ,ξ0,𝝁(α))+Kσ2(F~σ,ξ0,𝝁(α)))m(mn)Jσ(F~σ,ξ0,𝝁(α))nm(n+m)2σ2].\displaystyle=\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{[}\frac{m(m\!-\!1)(J_{\sigma}^{2}(\tilde{F}_{\sigma,\xi_{0},\bm{\mu}^{\bot}}(\alpha))\!+\!K_{\sigma}^{2}(\tilde{F}_{\sigma,\xi_{0},\bm{\mu}^{\bot}}(\alpha)))\!-\!m(m-n)J_{\sigma}(\tilde{F}_{\sigma,\xi_{0},\bm{\mu}^{\bot}}(\alpha))\!-\!nm}{(n+m)^{2}\sigma^{2}}\bigg{]}. (225)

The proof of Corollary 10 is provided in Appendix G.

Recall the definition of the function F~σ,ξ0,𝝁\tilde{F}_{\sigma,\xi_{0},\bm{\mu}^{\bot}} in (224). Let the tt-th iterate of F~σ,ξ0,𝝁\tilde{F}_{\sigma,\xi_{0},\bm{\mu}^{\bot}} be denoted as F~σ,ξ0,𝝁(t)\tilde{F}_{\sigma,\xi_{0},\bm{\mu}^{\bot}}^{(t)} with initial condition F~σ,ξ0,𝝁(0)(x)=x\tilde{F}^{(0)}_{\sigma,\xi_{0},\bm{\mu}^{\bot}}(x)=x. As shown in Figure 17, we can see that for any fixed (σ,ξ0,𝝁)(\sigma,\xi_{0},\bm{\mu}^{\bot}), F~σ,ξ0,𝝁(t)\tilde{F}_{\sigma,\xi_{0},\bm{\mu}^{\bot}}^{(t)} has a similar behaviour as Fσ(t)F_{\sigma}^{(t)} as tt increases, which implies that the gen-error in (225) in Corollary 10 also decreases as tt increases. As a result, F~σ,ξ0,𝝁(t)\tilde{F}^{(t)}_{\sigma,\xi_{0},\bm{\mu}^{\bot}} represents the improvement of the model parameter 𝜽t\bm{\theta}_{t} over the iterations.

Refer to caption
Figure 16: F~σ,ξ0,𝝁(t)(x)\tilde{F}_{\sigma,\xi_{0},\bm{\mu}^{\bot}}^{(t)}(x) versus xx for t{0,1,2}t\in\{0,1,2\} when σ=0.5\sigma=0.5, ξ0=0\xi_{0}=0, 𝝁2=1\|\bm{\mu}^{\bot}\|_{2}=1, n=10n=10, and m=1000m=1000.
Refer to caption
Figure 17: gent\mathrm{gen}_{t} versus tt for m=100m=100 and m=1000m=1000, when n=10n=10, σ=0.6\sigma=0.6, d=2d=2, and 𝝁=(1,0)\bm{\mu}=(1,0).

As shown in Figure 17, under the same setup as Figure 14(a), when the labelled data SlS_{\mathrm{l}} are reused in each iteration,

the upper bound for |gent||\mathrm{gen}_{t}| is also a decreasing function of tt. When m=1000m=1000, gent\mathrm{gen}_{t} is almost the same as the gen-error when the labelled data are not reused in the subsequent iterations, which means that for large enough m/nm/n, reusing the labelled data does not necessarily help to improve the generalization performance. Moreover, when m=100m=100, gent\mathrm{gen}_{t} is higher than that for m=1000m=1000, which coincides with the intuition that increasing the number of unlabelled data helps to reduce the generalization error.

Appendix G Proof of Corollary 10

Following similar steps as in Appendix D, we first derive the upper bound for |gen1||\mathrm{gen}_{1}|.

At t=1t=1, from (66) and (94), the expectation 𝝁1ξ0,𝝁:=𝔼[𝜽1|ξ0,𝝁]\bm{\mu}_{1}^{\prime\xi_{0},\bm{\mu}^{\bot}}:=\mathbb{E}[\bm{\theta}_{1}^{\prime}|\xi_{0},\bm{\mu}^{\bot}] is rewritten as

𝝁1ξ0,𝝁\displaystyle\bm{\mu}_{1}^{\prime\xi_{0},\bm{\mu}^{\bot}} =nn+m𝜽0+1n+mi=1m𝔼[sgn(𝜽¯0𝐗j)𝐗j|ξ0,𝝁]\displaystyle=\frac{n}{n+m}\bm{\theta}_{0}+\frac{1}{n+m}\sum_{i=1}^{m}\mathbb{E}[\operatorname{sgn}(\bar{\bm{\theta}}_{0}^{\top}\mathbf{X}^{\prime}_{j})\mathbf{X}^{\prime}_{j}|\xi_{0},\bm{\mu}^{\bot}]
=nn+m((1+σnξ0)𝝁+σn𝝁)\displaystyle=\frac{n}{n+m}\bigg{(}\bigg{(}1+\frac{\sigma}{\sqrt{n}}\xi_{0}\bigg{)}\bm{\mu}+\frac{\sigma}{\sqrt{n}}\bm{\mu}^{\bot}\bigg{)}
+mn+m((12Q(ασ)+2σα2πexp(α22σ2))𝝁+2σβ2πexp(α22σ2)𝝊)\displaystyle\quad+\frac{m}{n+m}\bigg{(}\bigg{(}1-2\mathrm{Q}\bigg{(}\frac{\alpha}{\sigma}\bigg{)}+\frac{2\sigma\alpha}{\sqrt{2\pi}}\exp\bigg{(}-\frac{\alpha^{2}}{2\sigma^{2}}\bigg{)}\bigg{)}\bm{\mu}+\frac{2\sigma\beta}{\sqrt{2\pi}}\exp\bigg{(}-\frac{\alpha^{2}}{2\sigma^{2}}\bigg{)}\bm{\upsilon}\bigg{)} (226)
=(1+nσξ0n+m+mn+m(2Q(ασ)+2σα2πexp(α22σ2)))𝝁\displaystyle=\bigg{(}1+\frac{\sqrt{n}\sigma\xi_{0}}{n+m}+\frac{m}{n+m}\bigg{(}-2\mathrm{Q}\bigg{(}\frac{\alpha}{\sigma}\bigg{)}+\frac{2\sigma\alpha}{\sqrt{2\pi}}\exp\bigg{(}-\frac{\alpha^{2}}{2\sigma^{2}}\bigg{)}\bigg{)}\bigg{)}\bm{\mu}
+(nσ𝝁2n+m+mn+m2σβ2πexp(α22σ2))𝝊.\displaystyle\quad+\bigg{(}\frac{\sqrt{n}\sigma\|\bm{\mu}^{\bot}\|_{2}}{n+m}+\frac{m}{n+m}\frac{2\sigma\beta}{\sqrt{2\pi}}\exp\bigg{(}-\frac{\alpha^{2}}{2\sigma^{2}}\bigg{)}\bigg{)}\bm{\upsilon}. (227)

Then the correlation between 𝝁1ξ0,𝝁\bm{\mu}_{1}^{\prime\xi_{0},\bm{\mu}^{\bot}} and 𝝁\bm{\mu} is given by

ρ(𝝁1ξ0,𝝁,𝝁)=F~σ,ξ0,𝝁(α).\displaystyle\rho(\bm{\mu}_{1}^{\prime\xi_{0},\bm{\mu}^{\bot}},\bm{\mu})=\tilde{F}_{\sigma,\xi_{0},\bm{\mu}^{\bot}}(\alpha). (228)

The gen-error gen1\mathrm{gen}_{1} is given by

gen1\displaystyle\mathrm{gen}_{1} =1n+mi=1n𝔼[l(𝜽1,(𝐗,Y))l(𝜽1,(𝐗i,Yi))]\displaystyle=\frac{1}{n+m}\sum_{i=1}^{n}\mathbb{E}\Big{[}l(\bm{\theta}_{1}^{\prime},(\mathbf{X},Y))-l(\bm{\theta}_{1},(\mathbf{X}_{i},Y_{i}))\Big{]}
+1n+mi=1m𝔼ξ0,𝝁[𝔼[l(𝜽1,(𝐗,Y))l(𝜽1,(𝐗i,Y^i))|ξ0,𝝁]]\displaystyle\quad+\frac{1}{n+m}\sum_{i=1}^{m}\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{[}\mathbb{E}\left[l(\bm{\theta}_{1}^{\prime},(\mathbf{X},Y))-l(\bm{\theta}_{1}^{\prime},(\mathbf{X}^{\prime}_{i},\hat{Y}_{i}^{\prime}))|\xi_{0},\bm{\mu}^{\bot}\right]\bigg{]} (229)
=1n+mi=1n𝔼𝜽1[Δh(P𝐙P𝐙i|𝜽1|p𝜽1)]\displaystyle=\frac{1}{n+m}\sum_{i=1}^{n}\mathbb{E}_{\bm{\theta}_{1}^{\prime}}\Big{[}\mathrm{\Delta h}(P_{\mathbf{Z}}\|P_{\mathbf{Z}_{i}|\bm{\theta}_{1}^{\prime}}|p_{\bm{\theta}_{1}^{\prime}})\Big{]}
+1n+mi=1mi=1n𝔼ξ0,𝝁𝔼𝜽1|ξ0,𝝁[Δh(P𝐙P𝐗i,Y^i|ξ0,𝝁|p𝜽1)\displaystyle\quad+\frac{1}{n+m}\sum_{i=1}^{m}\sum_{i=1}^{n}\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\mathbb{E}_{\bm{\theta}_{1}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}\bigg{[}\mathrm{\Delta h}(P_{\mathbf{Z}}\|P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}|p_{\bm{\theta}_{1}^{\prime}})
+Δh(P𝐗i,Y^i|ξ0,𝝁P𝐗i,Y^i|ξ0,𝝁,𝜽1|p𝜽1)].\displaystyle\quad+\mathrm{\Delta h}(P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}\|P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot},\bm{\theta}_{1}^{\prime}}|p_{\bm{\theta}_{1}})\bigg{]}. (230)
  • Calculate 𝔼θ1[Δh(P𝐙P𝐙i|θ1|pθ1)]\mathbb{E}_{\bm{\theta}_{1}^{\prime}}\Big{[}\mathrm{\Delta h}(P_{\mathbf{Z}}\|P_{\mathbf{Z}_{i}|\bm{\theta}_{1}^{\prime}}|p_{\bm{\theta}_{1}^{\prime}})\Big{]}:

    𝔼𝜽1[Δh(P𝐙P𝐙i|𝜽1|p𝜽1)]\displaystyle\mathbb{E}_{\bm{\theta}_{1}^{\prime}}\Big{[}\mathrm{\Delta h}(P_{\mathbf{Z}}\|P_{\mathbf{Z}_{i}|\bm{\theta}_{1}^{\prime}}|p_{\bm{\theta}_{1}^{\prime}})\Big{]}
    =Q𝜽1(𝜽)(P𝐙(𝐳)P𝐙i|𝜽1(𝐳|𝜽))log1p𝜽(𝐳)d𝐳d𝜽\displaystyle=\int Q_{\bm{\theta}_{1}^{\prime}}(\bm{\theta})(P_{\mathbf{Z}}(\mathbf{z})-P_{\mathbf{Z}_{i}|\bm{\theta}_{1}^{\prime}}(\mathbf{z}|\bm{\theta}))\log\frac{1}{p_{\bm{\theta}}(\mathbf{z})}\mathrm{d}\mathbf{z}\mathrm{d}\bm{\theta} (231)
    =12σ2(P𝐙(𝐱,1)(Q𝜽1(𝜽)P𝜽1|𝐙i(𝜽|𝐱,1))\displaystyle=-\frac{1}{2\sigma^{2}}\int\Big{(}P_{\mathbf{Z}}(\mathbf{x},1)(Q_{\bm{\theta}_{1}^{\prime}}(\bm{\theta})-P_{\bm{\theta}_{1}^{\prime}|\mathbf{Z}_{i}}(\bm{\theta}|\mathbf{x},1))
    P𝐙(𝐱,1)(Q𝜽1(𝜽)P𝜽1|𝐙i(𝜽|𝐱,1)))2𝜽𝐱d𝐱d𝜽\displaystyle\qquad-P_{\mathbf{Z}}(\mathbf{x},-1)(Q_{\bm{\theta}_{1}^{\prime}}(\bm{\theta})-P_{\bm{\theta}_{1}^{\prime}|\mathbf{Z}_{i}}(\bm{\theta}|\mathbf{x},-1))\Big{)}2\bm{\theta}^{\top}\mathbf{x}~{}\mathrm{d}\mathbf{x}\mathrm{d}\bm{\theta} (232)
    =1σ2(𝝁𝐱n+mP𝐙(𝐱,1)𝝁+𝐱n+mP𝐙(𝐱,1))𝐱d𝐱\displaystyle=-\frac{1}{\sigma^{2}}\int\Big{(}\frac{\bm{\mu}-\mathbf{x}}{n+m}P_{\mathbf{Z}}(\mathbf{x},1)-\frac{\bm{\mu}+\mathbf{x}}{n+m}P_{\mathbf{Z}}(\mathbf{x},-1)\Big{)}^{\top}\mathbf{x}~{}\mathrm{d}\mathbf{x} (233)
    =1σ2(𝝁𝝁𝝁𝝁dσ22(n+m)𝝁𝝁+𝝁𝝁+dσ22(n+m))\displaystyle=-\frac{1}{\sigma^{2}}\bigg{(}\frac{\bm{\mu}^{\top}\bm{\mu}-\bm{\mu}^{\top}\bm{\mu}-d\sigma^{2}}{2(n+m)}-\frac{-\bm{\mu}^{\top}\bm{\mu}+\bm{\mu}^{\top}\bm{\mu}+d\sigma^{2}}{2(n+m)}\bigg{)} (234)
    =dn+m.\displaystyle=\frac{d}{n+m}. (235)
  • Calculate 𝔼θ1|ξ0,μ[Δh(P𝐗i,Y^i|ξ0,μP𝐗i,Y^i|ξ0,μ,θ1|pθ1)]\mathbb{E}_{\bm{\theta}_{1}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}\big{[}\mathrm{\Delta h}(P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}\|P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot},\bm{\theta}_{1}^{\prime}}|p_{\bm{\theta}_{1}})\big{]}:

    𝔼𝜽1|ξ0,𝝁[Δh(P𝐗i,Y^i|ξ0,𝝁P𝐗i,Y^i|ξ0,𝝁,𝜽1|p𝜽1)]\displaystyle\mathbb{E}_{\bm{\theta}_{1}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}\big{[}\mathrm{\Delta h}(P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}\|P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot},\bm{\theta}_{1}^{\prime}}|p_{\bm{\theta}_{1}})\big{]}
    =1σ2P𝐗i,Y^i|ξ0,𝝁(𝐱,y|ξ0,𝝁)(Q𝜽1|ξ0,𝝁(𝜽|ξ0,𝝁)\displaystyle=-\frac{1}{\sigma^{2}}\int P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}(\mathbf{x},y|\xi_{0},\bm{\mu}^{\bot})\big{(}Q_{\bm{\theta}_{1}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta}|\xi_{0},\bm{\mu}^{\bot})
    P𝜽1|𝐗i,Y^i,ξ0,𝝁(𝜽|𝐱,y,ξ0,𝝁))(y𝜽𝐱)d𝐱dyd𝜽\displaystyle\qquad\qquad-P_{\bm{\theta}_{1}^{\prime}|\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime},\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta}|\mathbf{x},y,\xi_{0},\bm{\mu}^{\bot})\big{)}(y\bm{\theta}^{\top}\mathbf{x})\mathrm{d}\mathbf{x}\mathrm{d}y\mathrm{d}\bm{\theta} (236)
    =1σ2P𝐗i,Y^i|ξ0,𝝁(𝐱,y|ξ0,𝝁)(𝝁1ξ0,𝝁y𝐱n+m)(y𝐱)d𝐱dy\displaystyle=-\frac{1}{\sigma^{2}}\int P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}(\mathbf{x},y|\xi_{0},\bm{\mu}^{\bot})\bigg{(}\frac{\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}-y\mathbf{x}}{n+m}\bigg{)}^{\top}(y\mathbf{x})\mathrm{d}\mathbf{x}\mathrm{d}y (237)
    =1(n+m)σ2(𝔼[𝐗i𝐗i|ξ0,𝝁](𝝁1ξ0,𝝁)𝝁1ξ0,𝝁)\displaystyle=\frac{1}{(n+m)\sigma^{2}}\Big{(}\mathbb{E}[\mathbf{X}_{i}^{\prime\top}\mathbf{X}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}]-(\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}})^{\top}\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}\Big{)} (238)
    =dσ2+𝝁𝝁(𝝁1ξ0,𝝁)𝝁1ξ0,𝝁(n+m)σ2\displaystyle=\frac{d\sigma^{2}+\bm{\mu}^{\top}\bm{\mu}-(\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}})^{\top}\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}}{(n+m)\sigma^{2}} (239)
    =dσ2+1Jσ2(α)Kσ2(α)(n+m)σ2.\displaystyle=\frac{d\sigma^{2}+1-J_{\sigma}^{2}(\alpha)-K_{\sigma}^{2}(\alpha)}{(n+m)\sigma^{2}}. (240)
  • Calculate 𝔼θ1|ξ0,μ[Δh(P𝐙P𝐗i,Y^i|ξ0,μ|pθ1)]\mathbb{E}_{\bm{\theta}_{1}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}[\mathrm{\Delta h}(P_{\mathbf{Z}}\|P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}|p_{\bm{\theta}_{1}^{\prime}})]: Since given any fixed 𝜽1\bm{\theta}_{1}, p𝜽1(𝐱|)p_{\bm{\theta}_{1}^{\prime}}(\mathbf{x}|\cdot) is a Gaussian distribution, for any y{±1}y\in\{\pm 1\}, we have

    12P𝜽1|ξ0,𝝁(𝜽)(P𝐗|Y(𝐱|𝐲)P𝐗i|Y^i(𝐱|y))log1p𝜽(𝐱|y)d𝐱d𝜽\displaystyle\frac{1}{2}\int P_{\bm{\theta}_{1}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta})\big{(}P_{\mathbf{X}|Y}(\mathbf{x}|\mathbf{y})-P_{\mathbf{X}_{i}^{\prime}|\hat{Y}_{i}^{\prime}}(\mathbf{x}|y)\big{)}\log\frac{1}{p_{\bm{\theta}}(\mathbf{x}|y)}\mathrm{d}\mathbf{x}\mathrm{d}\bm{\theta}
    =14σ2P𝜽1|ξ0,𝝁(𝜽)(P𝐗|Y(𝐱|y)P𝐗i|Y^i(𝐱|y))(𝐱𝐱2y𝜽𝐱+𝜽𝜽)d𝐱d𝜽\displaystyle=\frac{1}{4\sigma^{2}}\int P_{\bm{\theta}_{1}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta})\big{(}P_{\mathbf{X}|Y}(\mathbf{x}|y)-P_{\mathbf{X}_{i}^{\prime}|\hat{Y}_{i}^{\prime}}(\mathbf{x}|y)\big{)}\big{(}\mathbf{x}^{\top}\mathbf{x}-2y\bm{\theta}^{\top}\mathbf{x}+\bm{\theta}^{\top}\bm{\theta}\big{)}\mathrm{d}\mathbf{x}\mathrm{d}\bm{\theta} (241)
    =12σ2P𝜽1|ξ0,𝝁(𝜽)(12P𝐗|Y(𝐱|y)12P𝐗i|Y^i(𝐱|y))(y𝜽𝐱)d𝐱d𝜽\displaystyle=-\frac{1}{2\sigma^{2}}\int P_{\bm{\theta}_{1}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta})\bigg{(}\frac{1}{2}P_{\mathbf{X}|Y}(\mathbf{x}|y)-\frac{1}{2}P_{\mathbf{X}_{i}^{\prime}|\hat{Y}_{i}^{\prime}}(\mathbf{x}|y)\bigg{)}\big{(}y\bm{\theta}^{\top}\mathbf{x}\big{)}\mathrm{d}\mathbf{x}\mathrm{d}\bm{\theta} (242)
    =12σ2(𝝁1ξ0,𝝁)(𝝁𝝁1ξ0,𝝁)\displaystyle=-\frac{1}{2\sigma^{2}}(\bm{\mu}_{1}^{\prime\xi_{0},\bm{\mu}^{\bot}})^{\top}\big{(}\bm{\mu}-\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}\big{)} (243)
    =m(Jσ2(α)+Kσ2(α))(mn)Jσ(α)n2σ2(n+m),\displaystyle=\frac{m(J_{\sigma}^{2}(\alpha)+K_{\sigma}^{2}(\alpha))-(m-n)J_{\sigma}(\alpha)-n}{2\sigma^{2}(n+m)}, (244)

    and then

    𝔼𝜽1|ξ0,𝝁[Δh(P𝐙P𝐗i,Y^i|ξ0,𝝁|p𝜽1)]=m(Jσ2(α)+Kσ2(α))(mn)Jσ(α)nσ2(n+m).\displaystyle\mathbb{E}_{\bm{\theta}_{1}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}[\mathrm{\Delta h}(P_{\mathbf{Z}}\|P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}|p_{\bm{\theta}_{1}^{\prime}})]=\frac{m(J_{\sigma}^{2}(\alpha)+K_{\sigma}^{2}(\alpha))-(m-n)J_{\sigma}(\alpha)-n}{\sigma^{2}(n+m)}. (245)

Therefore, the gen-error at t=1t=1 can be exactly characterized as follows

gen1\displaystyle\mathrm{gen}_{1} =𝔼ξ0,𝝁[nd(n+m)2σ2\displaystyle=\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{[}\frac{nd}{(n+m)^{2}\sigma^{2}}
+mn+mdσ2+1Jσ2(α)Kσ2(α)+m(Jσ2(α)+Kσ2(α))(mn)Jσ(α)n(n+m)σ2]\displaystyle\qquad+\frac{m}{n+m}\frac{d\sigma^{2}+1-J_{\sigma}^{2}(\alpha)-K_{\sigma}^{2}(\alpha)+m(J_{\sigma}^{2}(\alpha)+K_{\sigma}^{2}(\alpha))-(m-n)J_{\sigma}(\alpha)-n}{(n+m)\sigma^{2}}\bigg{]} (246)
=𝔼ξ0,𝝁[m(m1)(Jσ2(α)+Kσ2(α))m(mn)Jσ(α)nm+m(1+dσ2)+nd(n+m)2σ2].\displaystyle=\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{[}\frac{m(m-1)(J_{\sigma}^{2}(\alpha)+K_{\sigma}^{2}(\alpha))-m(m-n)J_{\sigma}(\alpha)-nm+m(1+d\sigma^{2})+nd}{(n+m)^{2}\sigma^{2}}\bigg{]}. (247)

For t2t\geq 2, similar to the derivation in Appendix C, by iteratively implementing the calculation, we only need to replace α\alpha with the correlation evolution function F~σ,ξ0,𝝁()\tilde{F}_{\sigma,\xi_{0},\bm{\mu}^{\bot}}(\cdot) (cf. (224)) and then the gen-error for any t1t\geq 1 is characterized as follows. For almost all sample paths, there exists a vanishing sequence ϵm\epsilon_{m} (ϵm0\epsilon_{m}\to 0 as mm\to\infty) , such that

gent=𝔼ξ0,𝝁[\displaystyle\mathrm{gen}_{t}=\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{[} m(m1)(Jσ2(F~σ,ξ0,𝝁(α))+Kσ2(F~σ,ξ0,𝝁(α)))(n+m)2σ2\displaystyle\frac{m(m-1)(J_{\sigma}^{2}(\tilde{F}_{\sigma,\xi_{0},\bm{\mu}^{\bot}}(\alpha))+K_{\sigma}^{2}(\tilde{F}_{\sigma,\xi_{0},\bm{\mu}^{\bot}}(\alpha)))}{(n+m)^{2}\sigma^{2}}
+m(mn)Jσ(F~σ,ξ0,𝝁(α))nm(n+m)2σ2]+ϵm,\displaystyle+\frac{-m(m-n)J_{\sigma}(\tilde{F}_{\sigma,\xi_{0},\bm{\mu}^{\bot}}(\alpha))-nm}{(n+m)^{2}\sigma^{2}}\bigg{]}+\epsilon_{m}^{\prime}, (248)

where ϵm=ϵm+m(1+dσ2)+nd(n+m)2σ20\epsilon_{m}^{\prime}=\epsilon_{m}+\frac{m(1+d\sigma^{2})+nd}{(n+m)^{2}\sigma^{2}}\to 0 as mm\to\infty and α\alpha stands for α(ξ0,𝝁)\alpha(\xi_{0},\bm{\mu}^{\bot}).

The proof of Corollary 10 is thus completed.

Appendix H Proof of Theorem 4

Theorem 4 can be proved similarly from the proof of Theorem 7. For simplicity, in the following proofs, we abbrviate gent(P𝐙,P𝐗,{P𝜽k|Sl,Su}k=0t,{f𝜽k}k=0t1)\mathrm{gen}_{t}(P_{\mathbf{Z}},P_{\mathbf{X}},\{P_{\bm{\theta}_{k}|S_{\mathrm{l}},S_{\mathrm{u}}}\}_{k=0}^{t},\{f_{\bm{\theta}_{k}}\}_{k=0}^{t-1}) as gent\mathrm{gen}_{t}. With 2\ell_{2}-regularization, the algorithm operates in the following steps. Let λ+\lambda\in\mathbb{R}_{+} be the regularization parameter.

  • Step 1: Initial round t=0t=0 with SlS_{\mathrm{l}}: By minimizing the regularized empirical risk of labelled dataset SlS_{\mathrm{l}}

    LSlreg(𝜽)\displaystyle L_{S_{\mathrm{l}}}^{\mathrm{reg}}(\bm{\theta}) =1ni=1nl(𝜽,(𝐗i,Yi))+λ2𝜽22=c12σ2ni=1n(𝐗iYi𝜽)(𝐗iYi𝜽)+λ2𝜽22,\displaystyle=\frac{1}{n}\sum_{i=1}^{n}l(\bm{\theta},(\mathbf{X}_{i},Y_{i}))+\frac{\lambda}{2}\|\bm{\theta}\|_{2}^{2}\overset{\mathrm{c}}{=}\frac{1}{2\sigma^{2}n}\sum_{i=1}^{n}(\mathbf{X}_{i}-Y_{i}\bm{\theta})^{\top}(\mathbf{X}_{i}-Y_{i}\bm{\theta})+\frac{\lambda}{2}\|\bm{\theta}\|_{2}^{2}, (249)

    where =c\stackrel{{\scriptstyle\mathrm{c}}}{{=}} means that both sides differ by a constant independent of 𝜽\bm{\theta}, we obtain the minimizer

    𝜽0reg\displaystyle\bm{\theta}_{0}^{\mathrm{reg}} =argmin𝜽ΘLSl(𝜽)=1ni=1nYi𝐗i1+σ2λ\displaystyle=\operatorname*{arg\,min}_{\bm{\theta}\in\Theta}L_{S_{\mathrm{l}}}(\bm{\theta})=\frac{1}{n}\sum_{i=1}^{n}\frac{Y_{i}\mathbf{X}_{i}}{1+\sigma^{2}\lambda}
    =𝜽01+σ2λ𝒩(𝝁1+σ2λ,σ2n(1+σ2λ)2𝐈d).\displaystyle=\frac{\bm{\theta}_{0}}{1+\sigma^{2}\lambda}\sim\mathcal{N}\bigg{(}\frac{\bm{\mu}}{1+\sigma^{2}\lambda},\frac{\sigma^{2}}{n(1+\sigma^{2}\lambda)^{2}}\mathbf{I}_{d}\bigg{)}. (250)
  • Step 2: Pseudo-label data in SuS_{\mathrm{u}}: At each iteration t[1:τ]t\in[1:\tau], for any iti\in\mathcal{I}_{t}, we use 𝜽t1reg\bm{\theta}_{t-1}^{\mathrm{reg}} to assign a pseudo-label for 𝐗i\mathbf{X}_{i}^{\prime}, that is, Y^i=f𝜽t1reg(𝐗i)=sgn(𝐗i𝜽t1reg)\hat{Y}_{i}^{\prime}=f_{\bm{\theta}_{t-1}^{\mathrm{reg}}}(\mathbf{X}_{i}^{\prime})=\operatorname{sgn}(\mathbf{X}_{i}^{\prime\top}\bm{\theta}_{t-1}^{\mathrm{reg}}).

  • Step 3: Refine the model: We then use the pseudo-labelled dataset S^u,t\hat{S}_{\mathrm{u},t} to train the new model. By minimizing the empirical risk of S^u,t\hat{S}_{\mathrm{u},t}

    LS^u,t(𝜽)\displaystyle L_{\hat{S}_{\mathrm{u},t}}(\bm{\theta}) =1mitl(𝜽,(𝐗i,Y^i))+λ2m𝜽22\displaystyle=\frac{1}{m}\sum_{i\in\mathcal{I}_{t}}l(\bm{\theta},(\mathbf{X}^{\prime}_{i},\hat{Y}_{i}^{\prime}))+\frac{\lambda}{2m}\|\bm{\theta}\|_{2}^{2}
    =c12σ2mit(𝐗iY^i𝜽)(𝐗iY^i𝜽)+λ2𝜽22,\displaystyle\overset{\mathrm{c}}{=}\frac{1}{2\sigma^{2}m}\sum_{i\in\mathcal{I}_{t}}(\mathbf{X}^{\prime}_{i}-\hat{Y}_{i}^{\prime}\bm{\theta})^{\top}(\mathbf{X}^{\prime}_{i}-\hat{Y}_{i}^{\prime}\bm{\theta})+\frac{\lambda}{2}\|\bm{\theta}\|_{2}^{2}, (251)

    we obtain the new model parameter

    𝜽treg\displaystyle\bm{\theta}_{t}^{\mathrm{reg}} =1mitY^i𝐗i1+σ2λ=1mitsgn(𝐗i𝜽t1reg)𝐗i1+σ2λ.\displaystyle=\frac{1}{m}\sum_{i\in\mathcal{I}_{t}}\frac{\hat{Y}_{i}^{\prime}\mathbf{X}^{\prime}_{i}}{1+\sigma^{2}\lambda}=\frac{1}{m}\sum_{i\in\mathcal{I}_{t}}\frac{\operatorname{sgn}(\mathbf{X}_{i}^{\prime\top}\bm{\theta}_{t-1}^{\mathrm{reg}})\mathbf{X}^{\prime}_{i}}{1+\sigma^{2}\lambda}. (252)

    If t<τt<\tau, go back to Step 2.

  1. 1.

    Characterization of |gen1||\mathrm{gen}_{1}|:

    From (250), we still have ρ(𝜽0reg,𝝁)=α(ξ0,𝝁)\rho(\bm{\theta}_{0}^{\mathrm{reg}},\bm{\mu})=\alpha(\xi_{0},\bm{\mu}^{\bot}) and

    𝜽¯0reg:=𝜽0reg𝜽0reg22=α𝝁+β𝝊=𝜽¯0.\displaystyle\bar{\bm{\theta}}_{0}^{\mathrm{reg}}:=\frac{\bm{\theta}_{0}^{\mathrm{reg}}}{\|\bm{\theta}_{0}^{\mathrm{reg}}\|_{2}^{2}}=\alpha\bm{\mu}+\beta\bm{\upsilon}=\bar{\bm{\theta}}_{0}. (253)

    From (253), we can rewrite (252) as follows

    𝜽1reg\displaystyle\bm{\theta}_{1}^{\mathrm{reg}} =1mi=1msgn(𝐗i𝜽¯1reg)𝐗i1+σ2λ=1mi=1msgn(𝐗i𝜽¯0)𝐗i1+σ2λ=𝜽11+σ2λ.\displaystyle=\frac{1}{m}\sum_{i=1}^{m}\frac{\operatorname{sgn}(\mathbf{X}_{i}^{\prime\top}\bar{\bm{\theta}}_{1}^{\mathrm{reg}})\mathbf{X}^{\prime}_{i}}{1+\sigma^{2}\lambda}=\frac{1}{m}\sum_{i=1}^{m}\frac{\operatorname{sgn}(\mathbf{X}_{i}^{\prime\top}\bar{\bm{\theta}}_{0})\mathbf{X}^{\prime}_{i}}{1+\sigma^{2}\lambda}=\frac{\bm{\theta}_{1}}{1+\sigma^{2}\lambda}. (254)

    Thus, the expectation of 𝜽1reg\bm{\theta}_{1}^{\mathrm{reg}} conditioned on (ξ0,𝝁)(\xi_{0},\bm{\mu}^{\bot}) is given by

    𝝁1reg|ξ0,𝝁:=11+σ2λ𝔼[sgn(𝐗j𝜽0reg)𝐗j|ξ0,𝝁]\displaystyle\bm{\mu}_{1}^{\mathrm{reg}|\xi_{0},\bm{\mu}^{\bot}}:=\frac{1}{1+\sigma^{2}\lambda}\mathbb{E}[\operatorname{sgn}(\mathbf{X}_{j}^{\prime\top}\bm{\theta}_{0}^{\mathrm{reg}})\mathbf{X}^{\prime}_{j}|\xi_{0},\bm{\mu}^{\bot}] (255)
    =11+σ2λ𝝁1ξ0,𝝁\displaystyle=\frac{1}{1+\sigma^{2}\lambda}\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}} (256)
    =11+σ2λ((12Q(ασ)+2σα2πexp(α22σ2))𝝁+2σβ2πexp(α22σ2)𝝊)\displaystyle=\frac{1}{1+\sigma^{2}\lambda}\bigg{(}\bigg{(}1-2\mathrm{Q}\bigg{(}\frac{\alpha}{\sigma}\bigg{)}+\frac{2\sigma\alpha}{\sqrt{2\pi}}\exp\bigg{(}-\frac{\alpha^{2}}{2\sigma^{2}}\bigg{)}\bigg{)}\bm{\mu}+\frac{2\sigma\beta}{\sqrt{2\pi}}\exp\bigg{(}-\frac{\alpha^{2}}{2\sigma^{2}}\bigg{)}\bm{\upsilon}\bigg{)} (257)
    =Jσ(α)𝝁+Kσ(α)𝝊1+σ2λ.\displaystyle=\frac{J_{\sigma}(\alpha)\bm{\mu}+K_{\sigma}(\alpha)\bm{\upsilon}}{1+\sigma^{2}\lambda}. (258)

    Recall gen1\mathrm{gen}_{1} given in (95). In the case with regularization, the gen-error gen1reg\mathrm{gen}_{1}^{\mathrm{reg}} has the same definition as gen1\mathrm{gen}_{1}. To derive gen1reg\mathrm{gen}_{1}^{\mathrm{reg}}, we need to calculate the following two terms.

    • Calculate 𝔼θ1reg|ξ0,μ[Δh(P𝐗i,Y^i|ξ0,μP𝐗i,Y^i|ξ0,μ,θ1reg|pθ1reg)]\mathbb{E}_{\bm{\theta}_{1}^{\mathrm{reg}}|\xi_{0},\bm{\mu}^{\bot}}\big{[}\mathrm{\Delta h}(P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}\|P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot},\bm{\theta}_{1}^{\mathrm{reg}}}|p_{\bm{\theta}_{1}^{\mathrm{reg}}})\big{]}:

      𝔼𝜽1reg|ξ0,𝝁[Δh(P𝐗i,Y^i|ξ0,𝝁P𝐗i,Y^i|ξ0,𝝁,𝜽1reg|p𝜽1reg)]\displaystyle\mathbb{E}_{\bm{\theta}_{1}^{\mathrm{reg}}|\xi_{0},\bm{\mu}^{\bot}}\big{[}\mathrm{\Delta h}(P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}\|P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot},\bm{\theta}_{1}^{\mathrm{reg}}}|p_{\bm{\theta}_{1}^{\mathrm{reg}}})\big{]}
      =12σ2Q𝜽1reg|ξ0,𝝁(𝜽|ξ0,𝝁)(P𝐗i,Y^i|ξ0,𝝁(𝐱,y|ξ0,𝝁)\displaystyle=\frac{1}{2\sigma^{2}}\int Q_{\bm{\theta}_{1}^{\mathrm{reg}}|\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta}|\xi_{0},\bm{\mu}^{\bot})\big{(}P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}(\mathbf{x},y|\xi_{0},\bm{\mu}^{\bot})
      P𝐗i,Y^i|ξ0,𝝁,𝜽1reg(𝐱,y|ξ0,𝝁,𝜽))(𝐱𝐱2y𝜽𝐱+𝜽𝜽)d𝐱dyd𝜽\displaystyle\qquad\qquad-P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot},\bm{\theta}_{1}^{\mathrm{reg}}}(\mathbf{x},y|\xi_{0},\bm{\mu}^{\bot},\bm{\theta})\big{)}(\mathbf{x}^{\top}\mathbf{x}-2y\bm{\theta}^{\top}\mathbf{x}+\bm{\theta}^{\top}\bm{\theta})\mathrm{d}\mathbf{x}\mathrm{d}y\mathrm{d}\bm{\theta} (259)
      =1σ2P𝐗i,Y^i|ξ0,𝝁(𝐱,y|ξ0,𝝁)(Q𝜽1reg|ξ0,𝝁(𝜽|ξ0,𝝁)\displaystyle=-\frac{1}{\sigma^{2}}\int P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}(\mathbf{x},y|\xi_{0},\bm{\mu}^{\bot})\big{(}Q_{\bm{\theta}_{1}^{\mathrm{reg}}|\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta}|\xi_{0},\bm{\mu}^{\bot})
      P𝜽1reg|𝐗i,Y^i,ξ0,𝝁(𝜽|𝐱,y,ξ0,𝝁))(y𝜽𝐱)d𝐱dyd𝜽\displaystyle\qquad\qquad-P_{\bm{\theta}_{1}^{\mathrm{reg}}|\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime},\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta}|\mathbf{x},y,\xi_{0},\bm{\mu}^{\bot})\big{)}(y\bm{\theta}^{\top}\mathbf{x})\mathrm{d}\mathbf{x}\mathrm{d}y\mathrm{d}\bm{\theta} (260)
      =1σ2P𝐗i,Y^i|ξ0,𝝁(𝐱,y|ξ0,𝝁)(𝝁1ξ0,𝝁y𝐱m(1+σ2λ))(y𝐱)d𝐱dy\displaystyle=-\frac{1}{\sigma^{2}}\int P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}(\mathbf{x},y|\xi_{0},\bm{\mu}^{\bot})\bigg{(}\frac{\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}-y\mathbf{x}}{m(1+\sigma^{2}\lambda)}\bigg{)}^{\top}(y\mathbf{x})\mathrm{d}\mathbf{x}\mathrm{d}y (261)
      =1mσ2(1+σ2λ)(𝔼[𝐗i𝐗i|ξ0,𝝁](𝝁1ξ0,𝝁)𝝁1ξ0,𝝁)\displaystyle=\frac{1}{m\sigma^{2}(1+\sigma^{2}\lambda)}\Big{(}\mathbb{E}[\mathbf{X}_{i}^{\prime\top}\mathbf{X}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}]-(\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}})^{\top}\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}\Big{)} (262)
      =dσ2+𝝁𝝁(𝝁1ξ0,𝝁)𝝁1ξ0,𝝁mσ2(1+σ2λ)\displaystyle=\frac{d\sigma^{2}+\bm{\mu}^{\top}\bm{\mu}-(\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}})^{\top}\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}}{m\sigma^{2}(1+\sigma^{2}\lambda)} (263)
      =dσ2+1Jσ2(α)Kσ2(α)mσ2(1+σ2λ).\displaystyle=\frac{d\sigma^{2}+1-J_{\sigma}^{2}(\alpha)-K_{\sigma}^{2}(\alpha)}{m\sigma^{2}(1+\sigma^{2}\lambda)}. (264)
    • Calculate 𝔼θ1reg|ξ0,μ[h(P𝐙,pθ1reg)h(P𝐗i,Y^i|ξ0,μ,pθ1reg)]\mathbb{E}_{\bm{\theta}_{1}^{\mathrm{reg}}|\xi_{0},\bm{\mu}^{\bot}}\big{[}h(P_{\mathbf{Z}},p_{\bm{\theta}_{1}^{\mathrm{reg}}})-h(P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}},p_{\bm{\theta}_{1}^{\mathrm{reg}}})\big{]}: Given any (ξ0,𝝁)(\xi_{0},\bm{\mu}^{\bot}), in the following, we drop the condition on ξ0,𝝁\xi_{0},\bm{\mu}^{\bot} for notational simplicity. Since given 𝜽1\bm{\theta}_{1}, p𝜽1(𝐱|)p_{\bm{\theta}_{1}}(\mathbf{x}|\cdot) is a Gaussian distribution, for any y{±1}y\in\{\pm 1\}, we have

      12P𝜽1reg|ξ0,𝝁(𝜽)(P𝐗|Y(𝐱|𝐲)P𝐗i|Y^i(𝐱|y))log1p𝜽(𝐱|y)d𝐱d𝜽\displaystyle\frac{1}{2}\int P_{\bm{\theta}_{1}^{\mathrm{reg}}|\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta})\big{(}P_{\mathbf{X}|Y}(\mathbf{x}|\mathbf{y})-P_{\mathbf{X}_{i}^{\prime}|\hat{Y}_{i}^{\prime}}(\mathbf{x}|y)\big{)}\log\frac{1}{p_{\bm{\theta}}(\mathbf{x}|y)}\mathrm{d}\mathbf{x}\mathrm{d}\bm{\theta}
      =12σ2P𝜽1reg|ξ0,𝝁(𝜽)(12P𝐗|Y(𝐱|y)12P𝐗i|Y^i(𝐱|y))(y𝜽𝐱)d𝐱d𝜽\displaystyle=-\frac{1}{2\sigma^{2}}\int P_{\bm{\theta}_{1}^{\mathrm{reg}}|\xi_{0},\bm{\mu}^{\bot}}(\bm{\theta})\bigg{(}\frac{1}{2}P_{\mathbf{X}|Y}(\mathbf{x}|y)-\frac{1}{2}P_{\mathbf{X}_{i}^{\prime}|\hat{Y}_{i}^{\prime}}(\mathbf{x}|y)\bigg{)}\big{(}y\bm{\theta}^{\top}\mathbf{x}\big{)}\mathrm{d}\mathbf{x}\mathrm{d}\bm{\theta} (265)
      =12σ2(𝝁1reg|ξ0,𝝁)(𝝁𝝁1ξ0,𝝁)\displaystyle=-\frac{1}{2\sigma^{2}}(\bm{\mu}_{1}^{\mathrm{reg}|\xi_{0},\bm{\mu}^{\bot}})^{\top}\big{(}\bm{\mu}-\bm{\mu}_{1}^{\xi_{0},\bm{\mu}^{\bot}}\big{)} (266)
      =Jσ2(α)+Kσ2(α)Jσ(α)2σ2(1+σ2λ).\displaystyle=\frac{J_{\sigma}^{2}(\alpha)+K_{\sigma}^{2}(\alpha)-J_{\sigma}(\alpha)}{2\sigma^{2}(1+\sigma^{2}\lambda)}. (267)

      Thus, we have

      𝔼𝜽1|ξ0,𝝁[Δh(P𝐙P𝐗i,Y^i|ξ0,𝝁|p𝜽1)]=Jσ2(α)+Kσ2(α)Jσ(α)σ2(1+σ2λ).\displaystyle\mathbb{E}_{\bm{\theta}_{1}|\xi_{0},\bm{\mu}^{\bot}}[\mathrm{\Delta h}(P_{\mathbf{Z}}\|P_{\mathbf{X}_{i}^{\prime},\hat{Y}_{i}^{\prime}|\xi_{0},\bm{\mu}^{\bot}}|p_{\bm{\theta}_{1}})]=\frac{J_{\sigma}^{2}(\alpha)+K_{\sigma}^{2}(\alpha)-J_{\sigma}(\alpha)}{\sigma^{2}(1+\sigma^{2}\lambda)}. (268)

    Finally, the gen-error at t=1t=1 can be characterized as follows:

    gen1reg=𝔼ξ0,𝝁[Jσ2(α)+Kσ2(α)Jσ(α)σ2(1+σ2λ)+dσ2+1Jσ2(α)Kσ2(α)mσ2(1+σ2λ)]=gen11+σ2λ,\displaystyle\mathrm{gen}_{1}^{\mathrm{reg}}=\mathbb{E}_{\xi_{0},\bm{\mu}^{\bot}}\bigg{[}\frac{J_{\sigma}^{2}(\alpha)+K_{\sigma}^{2}(\alpha)-J_{\sigma}(\alpha)}{\sigma^{2}(1+\sigma^{2}\lambda)}+\frac{d\sigma^{2}+1-J_{\sigma}^{2}(\alpha)-K_{\sigma}^{2}(\alpha)}{m\sigma^{2}(1+\sigma^{2}\lambda)}\bigg{]}=\frac{\mathrm{gen}_{1}}{1+\sigma^{2}\lambda}, (269)

    where α\alpha stands for α(ξ0,𝝁)\alpha(\xi_{0},\bm{\mu}^{\bot}).

  2. 2.

    Iteration t[2:τ]t\in[2:\tau]: Since 𝜽treg=𝜽t1+σ2λ\bm{\theta}_{t}^{\mathrm{reg}}=\frac{\bm{\theta}_{t}}{1+\sigma^{2}\lambda}, by iteratively applying the same techniques in iteration t=1t=1 and in Appendix C, the gen-error at any t[2:τ]t\in[2:\tau] can be characterized as follows

    gentreg=gent1+σ2λ.\displaystyle\mathrm{gen}_{t}^{\mathrm{reg}}=\frac{\mathrm{gen}_{t}}{1+\sigma^{2}\lambda}. (270)

Theorem 4 is thus proved.

References

  • Akaho and Kappen (2000) Shotaro Akaho and Hilbert J Kappen. Nonmonotonic generalization bias of Gaussian mixture models. Neural Computation, 12(6):1411–1427, 2000.
  • Aminian et al. (2021) Gholamali Aminian, Yuheng Bu, Laura Toni, Miguel Rodrigues, and Gregory Wornell. An exact characterization of the generalization error for the Gibbs algorithm. Advances in Neural Information Processing Systems, 34, 2021.
  • Aminian et al. (2022) Gholamali Aminian, Mahed Abroshan, Mohammad Mahdi Khalili, Laura Toni, and Miguel Rodrigues. An information-theoretical approach to semi-supervised learning under covariate-shift. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151, pages 7433–7449. PMLR, 28–30 Mar 2022.
  • Anzai (2012) Yuichiro Anzai. Pattern Recognition and Machine Learning. Elsevier, 2012.
  • Arazo et al. (2020) Eric Arazo, Diego Ortego, Paul Albert, Noel E O’Connor, and Kevin McGuinness. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2020.
  • Boucheron et al. (2005) Stéphane Boucheron, Olivier Bousquet, and Gábor Lugosi. Theory of classification: A survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005.
  • Boucheron et al. (2013) Stéphane Boucheron, Gábor Lugosi, and Pascal Massart. Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford University Press, 2013.
  • Bousquet and Elisseeff (2002) Olivier Bousquet and André Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2:499–526, 2002.
  • Bu et al. (2020) Yuheng Bu, Shaofeng Zou, and Venugopal V Veeravalli. Tightening mutual information based bounds on generalization error. IEEE Journal on Selected Areas in Information Theory, 1(1):121–130, 2020.
  • Bu et al. (2022) Yuheng Bu, Gholamali Aminian, Laura Toni, Gregory W. Wornell, and Miguel Rodrigues. Characterizing and understanding the generalization error of transfer learning with Gibbs algorithm. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151, pages 8673–8699. PMLR, 28–30 Mar 2022.
  • Cao et al. (2019) Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 1567–1578, 2019.
  • Carmon et al. (2019) Yair Carmon, Aditi Raghunathan, Ludwig Schmidt, Percy Liang, and John C Duchi. Unlabeled data improves adversarial robustness. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, pages 11192–11203, 2019.
  • Castelli and Cover (1996) Vittorio Castelli and Thomas M Cover. The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Transactions on Information Theory, 42(6):2102–2117, 1996.
  • Chapelle et al. (2006) Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien, editors. Semi-Supervised Learning. The MIT Press, 2006.
  • Chawla and Karakoulas (2005) Nitesh V Chawla and Grigoris Karakoulas. Learning from labeled and unlabeled data: An empirical study across techniques and domains. Journal of Artificial Intelligence Research, 23:331–366, 2005.
  • Dupre et al. (2019) Robert Dupre, Jiri Fajtl, Vasileios Argyriou, and Paolo Remagnino. Improving dataset volumes and model accuracy with semi-supervised iterative self-learning. IEEE Transactions on Image Processing, 29:4337–4348, 2019.
  • Esposito et al. (2021) Amedeo Roberto Esposito, Michael Gastpar, and Ibrahim Issa. Generalization error bounds via Rényi-, ff-divergences and maximal leakage. IEEE Transactions on Information Theory, 67(8):4986–5004, 2021.
  • Goodfellow et al. (2016) Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT press, 2016.
  • Haghifam et al. (2020) Mahdi Haghifam, Jeffrey Negrea, Ashish Khisti, Daniel M Roy, and Gintare Karolina Dziugaite. Sharpened generalization bounds based on conditional mutual information and an application to noisy, iterative algorithms. Advances in Neural Information Processing Systems, 33:9925–9935, 2020.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • Jose and Simeone (2021) Sharu Theresa Jose and Osvaldo Simeone. Information-theoretic bounds on transfer generalization gap based on Jensen–Shannon divergence. In 2021 29th European Signal Processing Conference (EUSIPCO), pages 1461–1465. IEEE, 2021.
  • Kawaguchi et al. (2017) Kenji Kawaguchi, Leslie Pack Kaelbling, and Yoshua Bengio. Generalization in deep learning. arXiv preprint arXiv:1710.05468, 2017.
  • Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto, 2009.
  • LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Lee (2013) Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, page 896, 2013.
  • Li et al. (2019) Jian Li, Yong Liu, Rong Yin, and Weiping Wang. Multi-class learning using unlabeled samples: Theory and algorithm. In International Joint Conference on Artificial Intelligence (IJCAI), pages 2880–2886, 2019.
  • Liu and Mukhopadhyay (2018) Qun Liu and Supratik Mukhopadhyay. Unsupervised learning using pretrained CNN and associative memory bank. In 2018 International Joint Conference on Neural Networks (IJCNN), pages 01–08. IEEE, 2018.
  • Lopez and Jog (2018) Adrian Tovar Lopez and Varun Jog. Generalization error bounds using Wasserstein distances. In 2018 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2018.
  • MacKay (2003) David J. C. MacKay. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003.
  • Mignacco et al. (2020) Francesca Mignacco, Florent Krzakala, Yue Lu, Pierfrancesco Urbani, and Lenka Zdeborova. The role of regularization in classification of high-dimensional noisy Gaussian mixture. In International Conference on Machine Learning, pages 6874–6883. PMLR, 2020.
  • Moody (1992) John Moody. The effective number of parameters: An analysis of generalization and regularization in nonlinear learning systems. Advances in Neural Information Processing Systems, 4, 1992.
  • Muthukumar et al. (2021) Vidya Muthukumar, Adhyyan Narang, Vignesh Subramanian, Mikhail Belkin, Daniel Hsu, and Anant Sahai. Classification vs regression in overparameterized regimes: Does the loss function matter? Journal of Machine Learning Research, 22(222):1–69, 2021.
  • Negrea et al. (2019) Jeffrey Negrea, Mahdi Haghifam, Gintare Karolina Dziugaite, Ashish Khisti, and Daniel M Roy. Information-theoretic generalization bounds for SGLD via data-dependent estimates. In Advances in Neural Information Processing Systems, pages 11013–11023, 2019.
  • Oymak and Gülcü (2021) Samet Oymak and Talha Cihad Gülcü. A theoretical characterization of semi-supervised learning with self-training for Gaussian mixture models. In International Conference on Artificial Intelligence and Statistics, pages 3601–3609. PMLR, 2021.
  • Pensia et al. (2018) Ankit Pensia, Varun Jog, and Po-Ling Loh. Generalization error bounds for noisy, iterative algorithms. In 2018 IEEE International Symposium on Information Theory (ISIT), pages 546–550. IEEE, 2018.
  • Russo and Zou (2016) Daniel Russo and James Zou. Controlling bias in adaptive data analysis using information theory. In Arthur Gretton and Christian C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51, pages 1232–1240, Cadiz, Spain, 09–11 May 2016. PMLR.
  • Singh et al. (2008) Aarti Singh, Robert Nowak, and Jerry Zhu. Unlabeled data: Now it helps, now it doesn’t. Advances in Neural Information Processing Systems, 21:1513–1520, 2008.
  • Steinke and Zakynthinou (2020) Thomas Steinke and Lydia Zakynthinou. Reasoning About Generalization via Conditional Mutual Information. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125, pages 3437–3452. PMLR, 09–12 Jul 2020.
  • Triguero et al. (2015) Isaac Triguero, Salvador García, and Francisco Herrera. Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowledge and Information Systems, 42(2):245–284, 2015.
  • Van Engelen and Hoos (2020) Jesper E Van Engelen and Holger H Hoos. A survey on semi-supervised learning. Machine Learning, 109(2):373–440, 2020.
  • Vapnik (2000) Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer, 2000.
  • Vershynin (2018) Roman Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018.
  • Wang and Thrampoulidis (2022) Ke Wang and Christos Thrampoulidis. Binary classification of gaussian mixtures: Abundance of support vectors, benign overfitting, and regularization. SIAM Journal on Mathematics of Data Science, 4(1):260–284, 2022.
  • Watanabe and Watanabe (2006) Kazuho Watanabe and Sumio Watanabe. Stochastic complexities of Gaussian mixtures in variational Bayesian approximation. The Journal of Machine Learning Research, 7:625–644, 2006.
  • Wu et al. (2020) Xuetong Wu, Jonathan H Manton, Uwe Aickelin, and Jingge Zhu. Information-theoretic analysis for transfer learning. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 2819–2824. IEEE, 2020.
  • Xu and Raginsky (2017) Aolin Xu and Maxim Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems, pages 2524–2533, 2017.
  • Zhu (2020) Jingge Zhu. Semi-supervised learning: the case when unlabeled data is equally useful. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), volume 124, pages 709–718. PMLR, 03–06 Aug 2020.
  • Zhu and Goldberg (2009) Xiaojin Zhu and Andrew B Goldberg. Introduction to semi-supervised learning. Synthesis Lectures on Artificial Intelligence and Machine Learning, 3(1):1–130, 2009.
  • Zhu (2008) Xiaojin Jerry Zhu. Semi-supervised learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences, 2008.