Consistency Regularization with Generative Adversarial Networks for Semi-Supervised Learning

Zexi Chen, Bharathkumar Ramachandra, Ranga Raju Vatsavai

Abstract

Generative Adversarial Networks (GANs) based semi-supervised learning (SSL) approaches are shown to improve classification performance by utilizing a large number of unlabeled samples in conjunction with limited labeled samples. However, their performance still lags behind the state-of-the-art non-GAN based SSL approaches. We identify that the main reason for this is the lack of consistency in class probability predictions on the same image under local perturbations. Following the general literature, we address this issue via label consistency regularization, which enforces the class probability predictions for an input image to be unchanged under various semantic-preserving perturbations. In this work, we introduce consistency regularization into the vanilla semi-GAN to address this critical limitation. In particular, we present a new composite consistency regularization method which, in spirit, leverages both local consistency and interpolation consistency. We demonstrate the efficacy of our approach on two SSL image classification benchmark datasets, SVHN and CIFAR-10. Our experiments show that this new composite consistency regularization based semi-GAN significantly improves its performance and achieves new state-of-the-art performance among GAN-based SSL approaches.

Introduction

Refer to caption — Figure 1: A visual comparison of top-2 predictions between semi-GAN (no consistency) and our semi-GAN (with composite consistency) on a CIFAR-10 test image under different augmentations. The blue bars indicate predicted probabilities.

In the past decade, supervised classification performance improved significantly with the advent of deep neural networks (Simonyan and Zisserman 2014; He et al. 2016; Huang et al. 2017). These advancements can be chiefly attributed to the training of deep neural networks on large-scale well-annotated image classification datasets, such as, ImageNet (Deng et al. 2009). However, obtaining such datasets with large amounts of labeled data is often prohibitive due to time, cost, expertise, and privacy restrictions. Semi-supervised learning (SSL) presents an alternative, where models can learn representations from plentiful of unlabeled data, thus reducing the heavy dependence on the availability of large labeled datasets.

In recent years, Deep Generative Models (DGMs) (Kingma and Welling 2013; Goodfellow et al. 2014) have emerged as an advanced framework for learning data representations in an unsupervised manner. In particular, Generative Adversarial Networks (GANs) (Goodfellow et al. 2014) have demonstrated an ability to learn generative model of any arbitrary data distribution and produce visually realistic set of artificial (fake) images. GANs set up an adversarial game between a generator network and a discriminator network, where the generator is tasked to trick the discriminator with generated samples, whereas the discriminator is tasked to tell apart real and generated samples. Semi-GAN (Salimans et al. 2016) is one of the earlier extension of GANs to the SSL domain, where the discriminator employs a (K+1)-class predictor with the extra class referring to the fake samples from the generator.

We first observe that semi-GAN suffers from inconsistent predictions in our experiments on the CIFAR-10 dataset. In this experiment, each unlabeled image is augmented with two different data augmentations and fed into the well-trained discriminator of a semi-GAN. Figure 1 depicts such input images on which the semi-GAN’s discriminator produces inconsistent predictions, whereas our proposed composite consistency based semi-GAN produces desired results. Although many approaches (Dai et al. 2017; Qi et al. 2018; Dumoulin et al. 2016; Lecouat et al. 2018) have been developed to improve the performance of semi-GAN, regularizing semi-GAN with consistency techniques has barely been explored in the literature. Consistency regularization specifies that the classifier should always make consistent predictions for an unlabeled data sample, in particular, under semantic-preserving perturbations. It follows from the popular smoothness assumption (Chapelle, Scholkopf, and Zien 2009) in SSL that if two points in a high-density region of data manifold are close, then so should the corresponding outputs. Based on this intuition, we hypothesize that the discriminator of semi-GAN should also produce consistent outputs on perturbed versions of the same image.

Thus, in this work we propose to extend semi-GAN by integrating consistency regularization into the discriminator. Previous works on consistency regularization focus on either local consistency that regularizes the classifier to be resilient to local perturbations added to data samples, or interpolation consistency that regularizes the classifier to produce consistent predictions at interpolations of data samples. In this work, we propose a new composite consistency regularization by exploring both of them. In summary, we make the following contributions:

•

We propose a new consistency measure called composite consistency, which combines both local consistency and interpolation consistency. We experimentally show that this composite consistency with semi-GAN produces best results among the three consistency-based techniques.
•

We propose an integration of consistency regularization into the discriminator of semi-GAN, encouraging it to make consistent predictions for data under perturbations, thus leading to improved semi-supervised classification. Experimentally, our semi-GAN with composite consistency sets new state-of-the-art performances on the two SSL benchmark datasets SVHN and CIFAR-10, with error rates reduced by 2.87% and 3.13% respectively while using the least amount of labeled data.

Preliminaries

In a general SSL setting, we are given a small set of labeled samples $(\mathbf{x}_{l},y_{l})$ and a large set of unlabeled samples $\mathbf{x}_{u}$ , where every $\mathbf{x}\in\mathbb{R}^{d}$ is a $d$ -dimensional input data sample and $y\in\{1,2,...,K\}$ is one of $K$ class labels. The objective of SSL is to learn a classifier $D(y|\mathbf{x};\theta):\mathcal{X}\rightarrow\mathcal{Y}$ , mapping from the input space $\mathcal{X}$ to the label space $\mathcal{Y}$ , parameterized by $\theta$ . In deep SSL approaches, $D(y|\mathbf{x};\theta)$ is chosen to be represented by a deep neural network.

Review of semi-GAN

In a Generative Adversarial Network (GAN), an adversarial two-player game is set up between discriminator and generator networks. The objective of the generator $G(\mathbf{z};\delta)$ is to transform a random vector $\mathbf{z}$ into a fake sample that cannot be distinguished from real samples by the discriminator. The discriminator is a binary classifier tasked to judge whether a sample is real or fake. Salimans et al. (2016) pioneered the extension of GANs to SSL by proposing the first GAN-based SSL approach named as semi-GAN. In semi-GAN, the discriminator is adjusted into a $(K+1)$ -head classifier, where the first $K$ are real classes originated from the dataset and the $(K+1)$ -th class is the fake class referring to generated samples. The objective function for the discriminator is formulated as:

\begin{split}\mathcal{L}_{D}&=-\mathbb{E}_{p(\mathbf{x}_{l},y_{l})}[\text{log }D(y_{l}|\mathbf{x}_{l};\theta)]\\ &\quad-\mathbb{E}_{p(\mathbf{z})}[\text{log }D(y=K+1|G(\mathbf{z};\delta);\theta)]\\ &\quad-\mathbb{E}_{p(\mathbf{x})}[\text{log }(1-D(y=K+1|\mathbf{x};\theta))]\end{split}

(1)

The first term is the standard supervised loss $\mathcal{L}_{\text{supervised}}$ that maximizes the log-likelihood that a labeled data sample is classified correctly into one of its ground-truth class. The second and third terms constitute the unsupervised loss $\mathcal{L}_{\text{unsupervised}}$ that classifies real samples $\mathbf{x}$ as non-fake ( $y<K+1$ ) and generated samples $G(\mathbf{z})$ as fake ( $y=K+1$ ).

They also proposed a feature matching loss for the generator, where the objective is to minimize the discrepancy of the first moment between real and generated data distributions in feature space, represented as:

\mathcal{L}_{G}={||\mathbb{E}_{p(\mathbf{x})}\mathbf{f}(\mathbf{x};\theta_{f})-\mathbb{E}_{p(\mathbf{z})}\mathbf{f}(G(\mathbf{z};\delta);\theta_{f})||}_{2}^{2}

(2)

where $\mathbf{f}$ is an intermediate layer from the discriminator $D$ , and $\theta_{f}$ is a subset of $\theta$ , including all the parameters up to that intermediate layer of the discriminator. In practice, feature matching loss has exhibited excellent performance for SSL tasks and has been broadly employed by follow-on GAN-based SSL approaches (Dai et al. 2017; Qi et al. 2018).

Review of consistency regularization

Consistency regularization has been widely used in semi-supervised or unsupervised learning approaches (Valpola 2015; Laine and Aila 2016; Tarvainen and Valpola 2017; Miyato et al. 2018). The intuition behind it is that the classifier should make consistent predictions, that are invariant to small perturbations added to either inputs or intermediate representations for both labeled and unlabeled data. To enforce consistency, the $\Gamma$ -model (Rasmus et al. 2015) evaluates each data input with and without perturbation, and minimizes the discrepancy between the two predictions. In this case, the classifier can be considered as assuming two parallel roles, one as a student model for regular learning and the other as a teacher model for generating learning targets.

More formally, the consistency loss term is defined as the divergence of the predictions between the student model and the teacher model, formulated as

\mathcal{L}_{cons}=\mathbb{E}_{p(\mathbf{x})}d[D(y|\mathbf{x};\theta,\xi),D(y|\mathbf{x};\theta^{\prime},\xi^{\prime})]

(3)

where $D(y|\mathbf{x};\theta,\xi)$ is the student with parameters $\theta$ and random perturbation $\xi$ , and $D(y|\mathbf{x};\theta^{\prime},\xi^{\prime})$ is the teacher with parameters $\theta^{\prime}$ and random perturbation $\xi^{\prime}$ . $d[\mathord{\cdot},\mathord{\cdot}]$ measures the divergence between the two predictions, usually chosen to be Euclidean distance or Kullback-Leibler divergence.

Method

To address the prediction inconsistency of semi-GAN (Salimans et al. 2016), we integrate consistency regularization into semi-GAN, leading it to produce consistent outputs (predictions) under small perturbations. More specifically, we incorporate consistency regularization as an additional auxiliary loss term to the discriminator, as shown in Eq.4.

\begin{split}\mathcal{L}_{D}&=-\mathbb{E}_{p(\mathbf{x}_{l},y_{l})}[\text{log }D(y_{l}|\mathbf{x}_{l};\theta,\xi)]\\ &\quad-\mathbb{E}_{p(\mathbf{z})}[\text{log }D(y=K+1|G(\mathbf{z};\delta);\theta)]\\ &\quad-\mathbb{E}_{p(\mathbf{x})}[\text{log }(1-D(y=K+1|\mathbf{x};\theta,\xi))]\\ &\quad+\lambda_{cons}\mathbb{E}_{p(\mathbf{x})}d[D(y|\mathbf{x};\theta,\xi),D(y|\mathbf{x};\theta^{\prime},\xi^{\prime})]\end{split}

(4)

where the first three terms come from original discriminator loss of semi-GAN (see Eq.1) and the fourth term is the consistency loss (see Eq.3), and the coefficient $\lambda_{cons}$ is a hyper-parameter controlling the importance of the consistency loss. Figure 2 displays our new model architecture. As shown in the figure, the discriminator $D(y|\mathbf{x};\theta)$ in semi-GAN (Salimans et al. 2016) is also treated as the student model for the consistency regularization and the consistency loss is enforced as the prediction difference between the student and teacher models for real data.

Two types of consistency regularization methods have been developed in recent years. One is the local consistency, where it encourages the classifier to be resilient to local perturbations added to data samples. Local perturbations are usually represented in the form of input augmentations (Laine and Aila 2016; Tarvainen and Valpola 2017) or adversarial noise (Miyato et al. 2018). In this work, we explore the integration of local consistency to semi-GAN and choose the consistency method Mean Teacher (MT) (Tarvainen and Valpola 2017), as our consistency regularization.

MT imposes consistency by adding random perturbations to the input of the model. As shown in Figure 3 (a), the input data are transformed with certain types of augmentation (e.g., image shifting, flipping, etc.) randomly twice. The two augmented inputs are then fed into the student model and teacher model separately, and the consistency is achieved by minimizing the prediction difference between the student model and teacher model. One key aspect of MT is that it improves the quality of the learning targets from the teacher model by forming a better teacher model. Namely, the parameters $\theta^{\prime}$ of the teacher model are maintained as an exponential moving average (EMA) of the parameters $\theta$ of the student model during training, formulated as:

\theta_{t}^{\prime\prime}=k\theta^{\prime}_{t-1}+(1-k)\theta_{t}^{\prime}

(5)

where $t$ indexes the training step and the hyper-parameter $k$ is the EMA decay coefficient. By aggregating information from the student model in an EMA manner at training time, a better teacher model can generate more stable predictions which serve as higher quality learning targets to guide the learning of the student model.

Then we explore the integration of the other type of consistency regularization, the interpolation consistency, where it encourages consistent predictions at interpolations of two data samples. It was first proposed by Interpolation Consistency Training (ICT) (Verma et al. 2019). In ICT, the interpolation is implemented using the MixUp operation (Zhang et al. 2018). Given any two vectors $u$ and $v$ , we can define the MixUp operation as

{Mix}_{\lambda}(u,v)=\lambda\cdot u+(1-\lambda)\cdot v

(6)

where $\lambda\in[0,1]$ is a parameter randomly sampled from Beta distribution denoted as $\lambda\sim\text{Beta}(\alpha,\alpha)$ , and $\alpha$ is a hyper-parameter controlling the sampling process. With the MixUp operation, given two randomly shuffled versions of the dataset $\mathbf{x}$ after data augmentation $\xi$ represented as $\mathbf{x}_{m}$ and $\mathbf{x}_{n}$ , the ICT consistency is computed as

\begin{split}\mathcal{L}_{ict\_cons}&=\mathbb{E}_{p(\mathbf{x}_{m},\mathbf{x}_{n}|\mathbf{x},\xi)}d[D(y_{mix}|{Mix}_{\lambda}(\mathbf{x}_{m},\mathbf{x}_{n});\theta),\\ &\quad\quad\quad\quad{Mix}_{\lambda}(D(y_{m}|\mathbf{x}_{m};\theta^{\prime}),D(y_{n}|\mathbf{x}_{n};\theta^{\prime}))]\end{split}

(7)

and it encourages the predictions from the student at interpolations of any two data samples (denoted as $D(y_{mix}|{Mix}_{\lambda}(\mathbf{x}_{m},\mathbf{x}_{n});\theta)$ ) to be consistent with the interpolations of the predictions from the teacher on the two samples (denoted as ${Mix}_{\lambda}(D(y_{m}|\mathbf{x}_{m};\theta^{\prime}),D(y_{n}|\mathbf{x}_{n};\theta^{\prime}))$ ), shown in Figure 3 (b).

Composite Consistency

Though MT chooses to perturb data samples by certain types of data augmentations, and the ICT method chooses to perturb data samples from the perspective of data interpolations, they share some characteristics. If we set $\lambda=1$ in ICT, the interpolated sample ${Mix}_{\lambda}(\mathbf{x}_{m},\mathbf{x}_{n})$ is reduced to $\mathbf{x}_{m}$ , hence the ICT consistency loss term is reduced to

\begin{split}\mathcal{L}_{ict\_cons}&=\mathbb{E}_{p(\mathbf{x}_{m},\mathbf{x}_{n}|\mathbf{x},\xi)}d[D(y_{mix}|\mathbf{x}_{m};\theta),D(y_{m}|\mathbf{x}_{m};\theta^{\prime})]\end{split}

(8)

This loss term is the same as MT consistency loss (see Eq.3) except that the same data augmentation $\xi$ is applied to the inputs of both student and teacher models. Accordingly, if two different data augmentations are applied to the inputs of the student and teacher models separately as MT, we can make ICT also robust to local perturbations, as shown in Figure 3 (c). In other words, we can combine these two consistency techniques so that the model would be robust to both local perturbations and interpolation perturbations. We name the combination of these two consistency techniques as composite consistency, and formulate the corresponding loss $\mathcal{L}_{comp\_cons}$ term as

\begin{split}\mathcal{L}_{comp\_cons}&=\mathbb{E}_{p(\mathbf{x}_{m},\mathbf{x}_{n}|\mathbf{x})}d[D(y_{mix}|{Mix}_{\lambda}(\mathbf{x}_{m},\mathbf{x}_{n});\theta,\xi),\\ &\quad\quad\quad{Mix}_{\lambda}(D(y_{m}|\mathbf{x}_{m};\theta^{\prime},\xi^{\prime}),D(y_{n}|\mathbf{x}_{n};\theta^{\prime},\xi^{\prime}))]\end{split}

(9)

Experiments

Datasets and Implementation Details

Following the common practice in evaluating GAN-based SSL approaches (Salimans et al. 2016; Dumoulin et al. 2016; Chongxuan et al. 2017; Qi et al. 2018; Dumoulin et al. 2016), we quantitatively evaluate our extensions using two SSL benchmark datasets: SVHN and CIFAR-10. The SVHN dataset consists of 73,257 training images and 26,032 test images. Each image has a size of 32 $\times$ 32 centered with a street view house number (a digit from 0 to 9). There are a total of 10 classes in the dataset. The CIFAR-10 dataset consists of 50,000 training images and 10,000 test images. Similarly, the CIFAR-10 dataset also has images of size 32 $\times$ 32 and 10 classes.

We utilize the same discriminator and generator network architectures as used in CT-GAN (Wei et al. 2018). See Appendix for more details of the network architectures. When training models on SVHN training data, we augment the images with random translation, where the image is randomly translated in both horizontal and vertical directions with a maximum of 2 pixels. For the CIFAR-10 dataset, we apply both random translation (in the same way as SVHN) and horizontal flips. For both datasets, we train the models with a batch size of 128 labeled samples and 128 unlabeled samples. We run the experiments with Adam Optimizer (with $\beta_{1}=0.5$ , $\beta_{2}=0.999$ ), where the learning rate is set to be 3e-4 for the first 400 epochs and linearly decayed to 0 in the next 200 epochs. Following the same training schema as in MT and ICT, we also employ the ramp-up phase for the consistency loss, where we increase consistency loss weight $\lambda_{cons}$ from 0 to its final value in the first 200 epochs. We adopt the same sigmoid-shaped function $e^{-5(1-\gamma)^{2}}$ (Tarvainen and Valpola 2017) as our ramp-up function, where $\gamma\in[0,1]$ . We set the EMA decay coefficient $k$ to 0.99 and the parameter $\alpha$ in $\text{Beta}(\alpha,\alpha)$ distribution to 0.1 through all our experiments.

Ablation study

Effect of consistency loss weight $\lambda_{cons}$ : The most important hyper-parameter influencing model performance is the consistency loss weight $\lambda_{cons}$ . We conduct an experiment using semi-GAN with composite consistency on CIFAR-10 with 4,000 labeled images where we train our model with a wide range of $\lambda_{cons}$ values, and the results are shown in Figure 4. From the figure, we see that there is a sharp decrease in error rate as $\lambda_{cons}$ increases from 0 to 10, implying composite consistency starts taking effect early on, then it reaches a relatively steady state (between 10 and 20), and then the error rate gradually increases with increase in $\lambda_{cons}$ . This experiment shows that the for a small range of $\lambda_{cons}[10,20]$ test error quickly reduces and stabilizes. Error may also increase for large values of $\lambda_{cons}$ .

Performance of different consistency techniques: As we describe three choices of consistency-based regularizers, it is necessary to quantify the benefits of integrating these into the semi-GAN. So we compare them empirically on CIFAR-10 with 1,000 and 4,000 labeled images, respectively. Table 1 shows the comparison results, and it is clear that incorporating consistency regularization into semi-GAN consistently improves the performance, and semi-GAN with composite consistency yields better results than MT or ICT consistency individually.


Models	Error rate (%)
	CIFAR-10	CIFAR-10
	$n_{l}=1,000$	$n_{l}=4,000$
semi-GAN	17.27 $\pm$ 0.83	14.12 $\pm$ 0.29
semi-GAN + MT	15.28 $\pm$ 1.03	12.08 $\pm$ 0.27
semi-GAN + ICT	15.11 $\pm$ 0.86	11.66 $\pm$ 0.50
semi-GAN + CC	14.36 $\pm$ 0.35	11.03 $\pm$ 0.42

Table 1: Performance of the three consistency measures with semi-GAN. The experiments are conducted over 5 runs and percent error rate is used as the evaluation criteria. “CC” is short for our proposed composite consistency.

Performance Comparison with Consistency-based techniques: In addition, we have also conducted experiments with MT and ICT as standalone methods to demonstrate that semi-GAN with consistency regularization would produce better results. For a fair comparison purpose, we conducted experiments under the same network architecture. We used the source code of the original ICT method (Verma et al. 2019) for ICT experiment and the source code of the LC+MT method (Chen et al. 2020) for MT experiment since they reported better MT performance than original MT method. For both methods, we ran each experiment 5 times with our network architecture while keeping all the other hyper-parameters unchanged. Table 2 shows the comparison results. From the results, we can observe that the semi-GAN and consistency regularization are complementary and could achieve better performance when combined.


Models	Error rate (%)
	CIFAR-10	CIFAR-10
	$n_{l}=1,000$	$n_{l}=4,000$
MT	28.99 $\pm$ 2.72	12.29 $\pm$ 0.20
ICT	30.18 $\pm$ 1.90	17.73 $\pm$ 0.73
semi-GAN + CC	14.36 $\pm$ 0.35	11.03 $\pm$ 0.42

Table 2: Performance comparison with MT and ICT. The experiments are conducted over 5 runs and percent error rate is used as the evaluation criteria. “CC” is short for our proposed composite consistency.

Effect of imposing consistency at different positions of the discriminator: Although consistency has always been imposed at output space in consistency-based approaches (Laine and Aila 2016; Tarvainen and Valpola 2017; Park et al. 2018; Miyato et al. 2018), it could also be imposed at feature space to help the model learn high-level features invariant to diverse perturbations. Therefore, in this study, we choose to impose consistency with three different settings: 1) on the output layer of the discriminator for prediction consistency; 2) on the intermediate layer of the discriminator (the layer right before FC + softmax as shown in Figure 2) for feature consistency; 3) on both the output layer and the intermediate layer of the discriminator for prediction and feature consistencies. When imposing feature consistency, we perform hyper-parameter search for its consistency weight over the values in {0.01, 0.1, 1.0, 10, 100} and report the results with the optimal hyper-parameter value. We conducted experiments on CIFAR-10 dataset with 1,000 and 4,000 labeled images, respectively. From Table 3, we can observe that incorporating consistency in both output space and feature space yields the best performance among the three, implying both feature consistency and prediction consistency can benefit the semi-supervised learning task.


Consistency type	Error rate (%)
	CIFAR-10	CIFAR-10
	$n_{l}=1,000$	$n_{l}=4,000$
Prediction	14.36 $\pm$ 0.35	11.03 $\pm$ 0.42
Feature	16.78 $\pm$ 0.87	13.19 $\pm$ 0.50
Prediction & Feature	14.14 $\pm$ 0.23	10.69 $\pm$ 0.49

Table 3: Effects of imposing consistency at different positions of the discriminator. The experiments are conducted using semi-GAN with composite consistency over 5 runs.

Models	CIFAR-10
Models	$n_{l}=1,000$	$n_{l}=2,000$	$n_{l}=4,000$
semi-GAN (Salimans et al. 2016)	21.83 $\pm$ 2.01	19.61 $\pm$ 2.09	18.63 $\pm$ 2.32
Bad GAN (Dai et al. 2017)	-	-	14.41 $\pm$ 0.30
CLS-GAN (Qi 2019)	-	-	17.30 $\pm$ 0.50
Triple-GAN (Chongxuan et al. 2017)	-	-	16.99 $\pm$ 0.36
Local GAN (Qi et al. 2018)	17.44 $\pm$ 0.25	-	14.23 $\pm$ 0.27
ALI (Dumoulin et al. 2016)	19.98 $\pm$ 0.89	19.09 $\pm$ 0.44	17.99 $\pm$ 1.62
Manifold Regularization (Lecouat et al. 2018)	16.37 $\pm$ 0.42	15.25 $\pm$ 0.35	14.34 $\pm$ 0.17
semi-GAN*	17.27 $\pm$ 0.83	15.36 $\pm$ 0.74	14.12 $\pm$ 0.29
semi-GAN + CC (ours)	14.14 $\pm$ 0.23	12.11 $\pm$ 0.46	10.69 $\pm$ 0.49

Table 4: Percent error rate comparison with GAN-based approaches on CIFAR-10 over 5 runs. “*” indicates our re-implementation of the method. “CC” is short for our proposed composite consistency.

Models	SVHN
Models	$n_{l}=500$	$n_{l}=1,000$
semi-GAN (Salimans et al. 2016)	18.44 $\pm$ 4.80	8.11 $\pm$ 1.30
Bad GAN (Dai et al. 2017)	-	7.42 $\pm$ 0.65
CLS-GAN (Qi 2019)	-	5.98 $\pm$ 0.27
Triple-GAN (Chongxuan et al. 2017)	-	5.77 $\pm$ 0.17
Local GAN (Qi et al. 2018)	5.48 $\pm$ 0.29	4.73 $\pm$ 0.29
ALI (Dumoulin et al. 2016)	-	7.41 $\pm$ 0.65
Manifold Regularization (Lecouat et al. 2018)	5.67 $\pm$ 0.11	4.63 $\pm$ 0.11
semi-GAN*	6.66 $\pm$ 0.58	5.36 $\pm$ 0.31
semi-GAN + CC (ours)	3.79 $\pm$ 0.23	3.64 $\pm$ 0.08

Table 5: Percent error rate comparison with GAN-based approaches on SVHN over 5 runs. “*” indicates our re-implementation of the method. “CC” is short for our proposed composite consistency.

Results and Visualization

Following the standard evaluation criteria used in the GAN-based approaches (Salimans et al. 2016; Dumoulin et al. 2016; Chongxuan et al. 2017; Qi et al. 2018; Dumoulin et al. 2016), we trained these models on SVHN training data with 500 and 1,000 randomly labeled images respectively and evaluated the model classification performance on the corresponding test dataset. For CIFAR-10, we trained the models on training data with 1,000, 2,000, and 4,000 randomly selected labeled images and then evaluated them on test data. The results are provided in Tables 4 and 5. To provide enough evidence for the comparison, we only include methods that have performance reported in multiple settings in our comparison tables. For both datasets, semi-GAN with composite consistency outperforms vanilla semi-GAN by a large margin and sets new state-of-the-art performance among GAN-based SSL approaches.

Note that we could not preform a direct comparison between our approach with non-GAN-based SSL approaches due to the differences in network architecture. However, as a sanity check we have experimented with the CNN-13 architecture adopted in the recent consistency-based SSL approaches (Laine and Aila 2016; Tarvainen and Valpola 2017; Miyato et al. 2018; Verma et al. 2019) as our discriminator, but encountered mode collapse issue (Goodfellow et al. 2014) during training in multiple trials. We suspect that this is due to the discriminator being easily dominated by the generator in this setting.

We also produced visualizations (see Figure 5) with the learned feature embeddings of semi-GAN model and semi-GAN + CC on both CIFAR-10 and SVHN test datasets using t-SNE (Maaten and Hinton 2008). We trained models on CIFAR-10 with 4,000 labeled images and SVHN with 1,000 labeled images respectively, and projected the feature embeddings ( $\mathbf{f}(\mathbf{x})\in\mathbb{R}^{128}$ ) into 2-D space using t-SNE, where the feature embeddings are obtained from the layer right before final FC + softmax layer. From the figure, observe that the feature embeddings of our semi-GAN + CC model are more concentrated within each class and the classes are more separable in both CIFAR-10 and SVHN test datasets, while they are more mixed in the semi-GAN model. This visualization further validates our hypothesis that the composite consistency regularization in semi-GAN improves the classification performance.

Related Work

Having already discussed consistency-based approaches, we only focus on reviewing the most relevant GAN-based SSL approaches and other categories of deep SSL approaches.

GAN-based SSL approaches: Following semi-GAN (Salimans et al. 2016), Qi et al. (2018) propose Local-GAN to improve the robustness of the discriminator on locally noisy samples, which are generated by a local generator at the neighborhood of real samples on a real data manifold. Instead, our approach attempts to improve the robustness of the discriminator from the perspective of consistency directly on real samples. Likewise, Dai et al. (2017) have proposed a complement generator. They show both theoretically and empirically that a preferred generator should generate complementary samples in low-density regions of the feature space, so that real samples are pushed to separable high-density regions and hence the discriminator can learn to correct class decision boundaries. Based on information theory principles, CatGAN (Springenberg 2016) adapts the real/fake adversary formulation of the standard GAN to the adversary on the level of confidence in class predictions, where the discriminator is encouraged to predict real samples into one of the $K$ classes with high confidence and to predict fake samples into all of the $K$ classes with low confidence, and the generator is designated to perform in the opposite. Similarly, the CLS-GAN (Qi 2019) designs a new loss function for the discriminator with the assumption that the prediction error of real samples should always be smaller than that of fake ones by a desired margin, and further regularizes this loss with Lipschitz continuity on the density of real samples. Apart from them, Chongxuan et al. (2017) design a Triple-GAN consisting of three networks, including a discriminator, a classifier, and a generator. Here, the discriminator is responsible for distinguishing real image-label pairs from fake ones, which are generated by either the classifier or the generator using conditional generation. Most of these methods attempt to improve the classification performance from the perspective of better separating real/fake samples, whereas our approach validates that improving the ability of the discriminator in itself with consistency is critical.

Other deep SSL categories: Variational Auto-Encoders (VAEs) have also been explored in the deep generative models (DGMs) domain. VAE-based SSL approaches (Kingma et al. 2014; Rezende and Mohamed 2015) treat class label as an additional latent variable and learn data distribution by optimizing the lower bound of data likelihood using a stochastic variational inference mechanism. Aside from DGMs, graph-based approaches (Atwood and Towsley 2016; Kipf and Welling 2017) have also been developed with deep neural networks, which smooth the label information on a pre-constructed similarity graph using variants of label propagation mechanisms (Bengio, Delalleau, and Le Roux 2006). Differing from graph-based approaches, deep clustering approaches (Haeusser, Mordvintsev, and Cremers 2017; Kamnitsas et al. 2018) build the graph directly in feature space instead of obtaining a pre-constructed graph from input space and perform clustering on the graph guided by partial labeled information. Furthermore, some recent advances (Wang, Li, and Gool 2019; Berthelot et al. 2019) focus on the idea of distribution alignment, attempting to reduce the empirical distribution mismatch between labeled and unlabeled data caused by sampling bias.

Conclusions

In this work, we identified an important limitation of semi-GAN and extended it via consistency regularizer. In particular, we developed a simple but effective composite consistency regularizer and integrated it with the semi-GAN approach. This composite consistency measure is resilient to both local perturbations and interpolation perturbations. Our thorough experiments and ablation studies showed the effectiveness of semi-GAN with composite consistency on two benchmark datasets of SVHN and CIFAR-10, and consistently produced lower error rates among the GAN-based SSL approaches.

Since composite consistency with semi-GAN is proved to be effective on real images, we plan to study the effect of enforcing composite consistency also on generated images from the generator in our future work. Though we adopted standard data augmentations to the input images in this work, we are interested in further exploration of other stronger forms of recent data augmentations (i.e., AutoAugment (Cubuk et al. 2019), RandAugment (Cubuk et al. 2020)).

References

Atwood and Towsley (2016) Atwood, J.; and Towsley, D. 2016. Diffusion-convolutional neural networks. In Advances in neural information processing systems, 1993–2001.
Bengio, Delalleau, and Le Roux (2006) Bengio, Y.; Delalleau, O.; and Le Roux, N. 2006. 11 label propagation and quadratic criterion .
Berthelot et al. (2019) Berthelot, D.; Carlini, N.; Cubuk, E. D.; Kurakin, A.; Sohn, K.; Zhang, H.; and Raffel, C. 2019. ReMixMatch: Semi-Supervised Learning with Distribution Alignment and Augmentation Anchoring. arXiv preprint arXiv:1911.09785 .
Chapelle, Scholkopf, and Zien (2009) Chapelle, O.; Scholkopf, B.; and Zien, A. 2009. Semi-supervised learning (chapelle, o. et al., eds.; 2006)[book reviews]. IEEE Transactions on Neural Networks 20(3): 542–542.
Chen et al. (2020) Chen, Z.; Dutton, B.; Ramachandra, B.; Wu, T.; and Vatsavai, R. R. 2020. Local Clustering with Mean Teacher for Semi-supervised Learning. arXiv preprint arXiv:2004.09665 .
Chongxuan et al. (2017) Chongxuan, L.; Xu, T.; Zhu, J.; and Zhang, B. 2017. Triple generative adversarial nets. In Advances in neural information processing systems, 4088–4098.
Cubuk et al. (2019) Cubuk, E. D.; Zoph, B.; Mane, D.; Vasudevan, V.; and Le, Q. V. 2019. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, 113–123.
Cubuk et al. (2020) Cubuk, E. D.; Zoph, B.; Shlens, J.; and Le, Q. V. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 702–703.
Dai et al. (2017) Dai, Z.; Yang, Z.; Yang, F.; Cohen, W. W.; and Salakhutdinov, R. R. 2017. Good semi-supervised learning that requires a bad gan. In Advances in neural information processing systems, 6510–6520.
Deng et al. (2009) Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
Dumoulin et al. (2016) Dumoulin, V.; Belghazi, I.; Poole, B.; Mastropietro, O.; Lamb, A.; Arjovsky, M.; and Courville, A. 2016. Adversarially learned inference. arXiv preprint arXiv:1606.00704 .
Goodfellow et al. (2014) Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
Haeusser, Mordvintsev, and Cremers (2017) Haeusser, P.; Mordvintsev, A.; and Cremers, D. 2017. Learning by Association–A Versatile Semi-Supervised Training Method for Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 89–98.
He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
Huang et al. (2017) Huang, G.; Liu, Z.; Van Der Maaten, L.; and Weinberger, K. Q. 2017. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4700–4708.
Kamnitsas et al. (2018) Kamnitsas, K.; Castro, D. C.; Folgoc, L. L.; Walker, I.; Tanno, R.; Rueckert, D.; Glocker, B.; Criminisi, A.; and Nori, A. 2018. Semi-supervised learning via compact latent space clustering. arXiv preprint arXiv:1806.02679 .
Kingma et al. (2014) Kingma, D. P.; Mohamed, S.; Rezende, D. J.; and Welling, M. 2014. Semi-supervised learning with deep generative models. In Advances in neural information processing systems, 3581–3589.
Kingma and Welling (2013) Kingma, D. P.; and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 .
Kipf and Welling (2017) Kipf, T. N.; and Welling, M. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net. URL https://openreview.net/forum?id=SJU4ayYgl.
Laine and Aila (2016) Laine, S.; and Aila, T. 2016. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 .
Lecouat et al. (2018) Lecouat, B.; Foo, C.-S.; Zenati, H.; and Chandrasekhar, V. 2018. Manifold regularization with gans for semi-supervised learning. arXiv preprint arXiv:1807.04307 .
Maaten and Hinton (2008) Maaten, L. v. d.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of machine learning research 9(Nov): 2579–2605.
Miyato et al. (2018) Miyato, T.; Maeda, S.-i.; Ishii, S.; and Koyama, M. 2018. Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE transactions on pattern analysis and machine intelligence .
Park et al. (2018) Park, S.; Park, J.; Shin, S.-J.; and Moon, I.-C. 2018. Adversarial dropout for supervised and semi-supervised learning. In Thirty-Second AAAI Conference on Artificial Intelligence.
Qi (2019) Qi, G.-J. 2019. Loss-sensitive generative adversarial networks on lipschitz densities. International Journal of Computer Vision 1–23.
Qi et al. (2018) Qi, G.-J.; Zhang, L.; Hu, H.; Edraki, M.; Wang, J.; and Hua, X.-S. 2018. Global versus localized generative adversarial nets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1517–1525.
Rasmus et al. (2015) Rasmus, A.; Berglund, M.; Honkala, M.; Valpola, H.; and Raiko, T. 2015. Semi-supervised learning with ladder networks. In Advances in Neural Information Processing Systems, 3546–3554.
Rezende and Mohamed (2015) Rezende, D. J.; and Mohamed, S. 2015. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770 .
Salimans et al. (2016) Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; and Chen, X. 2016. Improved techniques for training gans. In Advances in neural information processing systems, 2234–2242.
Simonyan and Zisserman (2014) Simonyan, K.; and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 .
Springenberg (2016) Springenberg, J. T. 2016. UNSUPERVISED AND SEMI-SUPERVISED LEARNING WITH CATEGORICAL GENERATIVE ADVERSARIAL NETWORKS. stat 1050: 30.
Tarvainen and Valpola (2017) Tarvainen, A.; and Valpola, H. 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in neural information processing systems, 1195–1204.
Valpola (2015) Valpola, H. 2015. From neural PCA to deep unsupervised learning. In Advances in Independent Component Analysis and Learning Machines, 143–171. Elsevier.
Verma et al. (2019) Verma, V.; Lamb, A.; Kannala, J.; Bengio, Y.; and Lopez-Paz, D. 2019. Interpolation consistency training for semi-supervised learning. arXiv preprint arXiv:1903.03825 .
Wang, Li, and Gool (2019) Wang, Q.; Li, W.; and Gool, L. V. 2019. Semi-supervised learning by augmented distribution alignment. In Proceedings of the IEEE International Conference on Computer Vision, 1466–1475.
Wei et al. (2018) Wei, X.; Gong, B.; Liu, Z.; Lu, W.; and Wang, L. 2018. Improving the improved training of wasserstein gans: A consistency term and its dual effect. arXiv preprint arXiv:1803.01541 .
Zhang et al. (2018) Zhang, H.; Cisse, M.; Dauphin, Y. N.; and Lopez-Paz, D. 2018. mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations. URL https://openreview.net/forum?id=r1Ddp1-Rb.

Appendix

Network Architectures

Table 6 and Table 7 present the network architectures used in all our experiments. They are identical as the network architectures used in CT-GAN (Wei et al. 2018) except that we remove the first dropout layer of the discriminator as we find it slightly downgrades the classification performance.

Discriminator $\mathbf{D}$
Input: 32 $\times$ 32 RGB image
3 $\times$ 3 conv, 128, Pad=1, Stride=1, WeightNorm, lReLU( $0.2$ )
3 $\times$ 3 conv, 128, Pad=1, Stride=1, WeightNorm, lReLU( $0.2$ )
3 $\times$ 3 conv, 128, Pad=1, Stride=2, WeightNorm, lReLU( $0.2$ )
Dropout: p =0.5
3 $\times$ 3 conv, 256, Pad=1, Stride=1 WeightNorm, lReLU( $0.2$ )
3 $\times$ 3 conv, 256, Pad=1, Stride=1, WeightNorm, lReLU( $0.2$ )
3 $\times$ 3 conv, 256, Pad=1, Stride=2, WeightNorm, lReLU( $0.2$ )
Dropout: p =0.5
3 $\times$ 3 conv, 512, Pad=0, Stride=1, WeightNorm, lReLU( $0.2$ )
1 $\times$ 1 conv, 256, Pad=0, Stride=1, WeightNorm, lReLU( $0.2$ )
1 $\times$ 1 conv, 128, Pad=0, Stride=1, WeightNorm, lReLU( $0.2$ )
Global AveragePool
MLP 10, WeightNorm, Softmax

Table 6: The discriminator network architecture used in our experiments.

Generator $\mathbf{G}$
Input: $\mathbf{z}\sim\mathbf{U}(0,1)$ of 100 dimension
MLP 8192, BatchNorm, ReLU
Reshape to 512 $\times$ 4 $\times$ 4
5 $\times$ 5 deconv, 256, InputPad=2, Stride=2, OutputPad=1, BatchNorm, ReLU
5 $\times$ 5 deconv, 128, InputPad=2, Stride=2, OutputPad=1, BatchNorm, ReLU
5 $\times$ 5 deconv, 3, InputPad=2, Stride=2, OutputPad=1, WeightNorm, Tanh

Table 7: The generator network architecture used in our experiments.