Learning Consistent Deep Generative Models from Sparse Data via Prediction Constraints
Abstract
We develop a new framework for learning variational autoencoders and other deep generative models that balances generative and discriminative goals. Our framework optimizes model parameters to maximize a variational lower bound on the likelihood of observed data, subject to a task-specific prediction constraint that prevents model misspecification from leading to inaccurate predictions. We further enforce a consistency constraint, derived naturally from the generative model, that requires predictions on reconstructed data to match those on the original data. We show that these two contributions – prediction constraints and consistency constraints – lead to promising image classification performance, especially in the semi-supervised scenario where category labels are sparse but unlabeled data is plentiful. Our approach enables advances in generative modeling to directly boost semi-supervised classification performance, an ability we demonstrate by augmenting deep generative models with latent variables capturing spatial transformations.
1 Introduction
We develop broadly applicable methods for learning flexible models of high-dimensional data, like images, that are paired with (discrete or continuous) labels. We are particularly interested in semi-supervised learning (Zhu, 2005, Oliver et al., 2018) from data that is sparsely labeled, a common situation in practice due to the cost or privacy concerns associated with data annotation. Given a large and sparsely labeled dataset, we seek a single probabilistic model that simultaneously makes good predictions of labels and provides a high-quality generative model of the high-dimensional input data. Strong generative models are valuable because they can allow incorporation of domain knowledge, can address partially missing or corrupted data, and can be visualized to improve interpretability.
Prior approaches for the semi-supervised learning of deep generative models include methods based on variational autoencoders (VAEs) (Kingma et al., 2014, Siddharth et al., 2017), generative adversarial networks (GANs) (Dumoulin et al., 2017, Kumar et al., 2017), and hybrids of the two (Larsen et al., 2016, de Bem et al., 2018, Zhang et al., 2019). While these all allow sampling of data, a major shortcoming of these approaches is that they do not adequately use labels to inform the generative model. Furthermore, GAN-based approaches lack the ability to evaluate the learned probability density function, which can be important for tasks such as model selection and anomaly detection.
This paper develops a framework for training prediction constrained variational autoencoders (PC-VAEs) that minimize application-motivated loss functions in the prediction of labels, while simultaneously learning high-quality generative models of the raw data. Our approach is inspired by the prediction-constrained framework recently proposed for learning supervised topic models of “bag of words” count data (Hughes et al., 2018), but differs in four major ways. First, we develop scalable algorithms for learning a much larger and richer family of deep generative models. Second, we capture uncertainty in latent variables rather than simply using point estimates. Third, we allow more flexible specification of loss functions. Finally, we show that the generative model structure leads to a natural consistency constraint vital for semi-supervised learning from very sparse labels.
Our experiments demonstrate that consistent prediction-constrained (CPC) VAE training leads to prediction performance competitive with state-of-the-art discriminative methods on fully-labeled datasets, and excels over these baselines when given semi-supervised datasets where labels are rare.
VAE-then-MLP | PC-VAE | CPC-VAE | M2 | M2 (14) | CPC-VAE (14) |
\begin{overpic}[width=65.44142pt]{figures/halfmoon/cropped/M1_2_6_77_9.pdf} \put(8.0,8.0){77.9\%} \end{overpic} | \begin{overpic}[width=65.44142pt]{figures/halfmoon/cropped/PC_2_6_78_1.pdf} \put(8.0,8.0){78.1\%} \end{overpic} | \begin{overpic}[width=65.44142pt]{figures/halfmoon/cropped/CPC_2_6_98_4.pdf} \put(8.0,8.0){98.4\%} \end{overpic} | \begin{overpic}[width=65.44142pt]{figures/halfmoon/cropped/M2_2_6_98_1.pdf} \put(8.0,8.0){98.1\%} \end{overpic} | \begin{overpic}[width=65.44142pt]{figures/halfmoon/cropped/M2_14_6_80_6.pdf} \put(8.0,8.0){80.6\%} \end{overpic} | \begin{overpic}[width=65.44142pt]{figures/halfmoon/cropped/CPC_14_6_98_5.pdf} \put(8.0,8.0){98.5\%} \end{overpic} |
\begin{overpic}[width=65.44142pt]{figures/halfmoon/cropped/M1_2_100_83_8.pdf} \put(8.0,8.0){83.8\%} \end{overpic} | \begin{overpic}[width=65.44142pt]{figures/halfmoon/cropped/PC_2_100_98_2.pdf} \put(8.0,8.0){98.2\%} \end{overpic} | \begin{overpic}[width=65.44142pt]{figures/halfmoon/cropped/CPC_2_100_98_4.pdf} \put(8.0,8.0){98.4\%} \end{overpic} | \begin{overpic}[width=65.44142pt]{figures/halfmoon/cropped/M2_2_100_98_1.pdf} \put(8.0,8.0){98.1\%} \end{overpic} | \begin{overpic}[width=65.44142pt]{figures/halfmoon/cropped/M2_14_100_96_4.pdf} \put(8.0,8.0){96.4\%} \end{overpic} | \begin{overpic}[width=65.44142pt]{figures/halfmoon/cropped/CPC_14_100_98_1.pdf} \put(8.0,8.0){98.1\%} \end{overpic} |
2 Background: Deep Generative Models and Semi-supervision
We now describe VAEs as deep generative models and review previous methods for semi-supervised learning (SSL) of VAEs, highlighting weaknesses that we later improve upon. We assume all SSL tasks provide two training datasets: an unsupervised (or unlabeled) dataset of feature vectors , and a supervised (or labeled) dataset containing pairs of features and label . Labels are often sparse () and can be discrete or continuous.
2.1 Unsupervised Generative Modeling with the VAE
The variational autoencoder (Kingma & Welling, 2014) is an unsupervised model with two components: a generative model and an inference model. The generative model defines for each example a joint distribution over “features” (observed vector ) and “encodings” (hidden vector ). The “inference model” of the VAE defines an approximate posterior , which is trained to be close to the true posterior () but much easier to evaluate. As in Kingma & Welling (2014), we assume the following conditional independence structure:
(1) |
The likelihood is often multivariate normal, but other distributions may give robustness to outliers. The (deterministic) functions and , with trainable parameters , define the mean and covariance of the likelihood. Given any observation , the posterior of is approximated as normal with mean and (diagonal) covariance parameterized by . These functions can be represented as multi-layer perceptrons (MLPs), convolutional neural networks (CNNs), or other (deep) neural networks.
We would ideally learn generative parameters by maximizing the marginal likelihood of features , integrating latent variable . Since this is intractable, we instead maximize a variational lower bound:
(2) |
This expectation can be evaluated via Monte Carlo samples from the inference model . Gradients with respect to can be similarly estimated by the reparameterization “trick” of representing as a linear transformation of standard normal variables (Kingma & Welling, 2014).
Throughout this paper, we denote variational parameters by . Because the factorization of changes for more complex models, we will write to denote the parameters specific to factor .
2.2 Two-Stage SSL: Maximize Feature Likelihood then Train Predictor
One way to employ the VAE for a semi-supervised task is a two-stage “VAE-then-MLP”. First, train a VAE to maximize the unsupervised likelihood (2) of all observed features (both labeled and unlabeled ). Second, we define a label-from-code predictor that maps each learned code representation to a predicted label . We use an MLP with weights , though any predictor could do. Let be a loss function, such as cross-entropy, appropriate for the prediction task. We train the predictor to minimize the loss: . Importantly, this second stage uses only the small labeled dataset and relies on fixed parameters from stage one.
2.3 Semi-supervised VAEs: Maximize Joint Likelihood of Labels and Features
To overcome the weakness of the two-stage approach, previous work by Kingma et al. (2014) presented a VAE-inspired model called “M2” focused on the joint generative modeling of labels and data . M2 has two components: a generative model and an inference model . Their generative model is factorized to sample labels (with frequencies ) first, and then features :
(3) |
The M2 inference model sets , where .
To train M2, Kingma et al. (2014) maximize the likelihood of all observations (labels and features):
(4) |
The first, “supervised” term in Eq. (4) is a variational bound for the feature-and-label joint likelihood:
(5) |
The second, “unsupervised” term is a variational lower bound for the features-only likelihood , where can be simply expressed in terms of :
(6) |
As with the unsupervised VAE, both terms in the objective can be computed via Monte Carlo sampling from the variational posterior, and gradients can be estimated via the reparameterization trick.
M2’s prediction dilemma and heuristic fix.
After training parameters , we need to predict labels given test data . M2’s structure assumes we make predictions via the inference model’s discriminator density . However, the discriminator’s parameter is only informed by the unlabeled data when using the objective above (it is not used to compute ). We cannot expect accurate predictions from a parameter that does not touch any labeled examples in the training set.
To partially overcome this issue, Kingma et al. (2014) and later work use a weighted objective:
(7) |
This objective biases the inference model’s discriminator to do well on the labeled set via an extra loss term (weighted by hyperparameter ). We can further include to balance the supervised and unsupervised terms. Originally, Kingma et al. (2014) fix and tune to achieve good performance. Later, Siddharth et al. (2017) tuned to improve performance. Maaløe et al. (2016) used this same term for labeled data to train VAEs with auxiliary variables.
Disadvantage: What Justification? While the and terms in Eq. (7) have a rigorous justification as maximizing the data likelihood under the assumed generative model, the first term () is not justified by the generative or inference model. In particular, suppose the training data were fully labeled: we would ignore the terms altogether, and the remaining terms would decouple the parameters from the discriminator parameters . This is deeply unsatisfying: We want a single model guided by both generative and discriminative goals, not two separate models. Even in partially-labeled scenarios, including this term does not adequately balance generative and discriminative goals, as we demonstrate in later examples. An overly flexible yet misspecified generative model may go astray and compromise predictions.
Disadvantage: Runtime Cost. Another disadvantage is that the computation of in Eq. (6) is expensive. If labels are discrete, computing this sum exactly is possible but requires a sum over all possible class labels, computing a Monte Carlo estimate of for each one. In Appendix C, we demonstrate that practically, M2’s runtime is roughly times longer than our consistent prediction-constrained approach. While further Monte Carlo approximations could avoid the explicit sum over classes in Eq. (6), they may make gradients far too noisy.
Extensions. Siddharth et al. (2017) showed how and could be extended to any desired conditional independence structure for , generalizing the label-then-code factorization of Kingma et al. (2014). While importance sampling leads to likelihood bounds, the overall objective still has two undesirable traits. First, it is expensive, requiring either marginalization of to compute in Eq. (6) or marginalization of to compute . Second, the approach requires the heuristic inclusion of the discriminator loss . While recent parallel work by Gordon & Hernández-Lobato (2020) also tries to improve SSL for VAEs, their approach couples discriminative and generative terms only distantly through a joint prior over parameters and still requires expensive sums over labels when computing generative likelihoods.
VAE-then-MLP | Supervised VAE | PC-VAE | CPC-VAE | M2 |
\begin{overpic}[height=56.9055pt]{figures/mnist_2d/cropped/vae.pdf} \put(8.0,8.0){54.9\%} \end{overpic} | \begin{overpic}[height=56.9055pt]{figures/mnist_2d/cropped/generative.pdf} \put(8.0,8.0){66.2\%} \end{overpic} | \begin{overpic}[height=56.9055pt]{figures/mnist_2d/cropped/pcvae.pdf} \put(8.0,8.0){74.1\%} \end{overpic} | \begin{overpic}[height=56.9055pt]{figures/mnist_2d/cropped/cpc_vae.pdf} \put(8.0,8.0){81.1\%} \end{overpic} | \begin{overpic}[height=56.9055pt]{figures/mnist_2d/cropped/m2.pdf} \put(8.0,8.0){69.1\%} \end{overpic} |
3 Prediction-Constrained Learning with Consistency
We now highlight two experiments that demonstrate disadvantages of prior SSL methods, and contrast them with our new approaches. In Fig. 1 we show the predictive accuracy of several SSL methods on the widely-used “half-moon” task, where the goal is to to predict a binary label given 2-dimensional features . We focus on the top row, which shows results given only 6 labeled examples (3 of each class) but hundreds of unlabeled examples. Notably, while M2 has 98.1% accuracy with a small encoding space (), if the generative model is too flexible () it learns overly complex structure that does not help label-from-feature predictions, dropping accuracy to only 80.6%. In contrast, our consistent prediction constrained (CPC) VAE gets over 98% accuracy with either or . We have verified it maintains 98% even at , while M2 shows further instability.
Second, in Fig. 2 we show SSL methods for classifying images of MNIST digits (LeCun et al., 2010), given only 10 labeled examples per digit. We seek models with highly accurate label-from-feature predictions, as well as interpretable relationships between the encoding and these predictions. When forced to use a 2-dimensional latent space, M2 has worse accuracy and (by design) no apparent relationship between encoding and label . In contrast, our CPC approach offers noticeable advantages over all baselines in both accuracy and interpretability of the encoding space.
3.1 Prediction Constrained Training for VAEs
We develop a framework for jointly learning a strong generative model of features , and making label-given-feature predictions of uncompromised quality, by requiring predictions to meet a user-specified quality threshold. Our prediction constrained training objective enables end-to-end estimation of all parameters while incorporating the same task-specific prediction rules and loss functions that will be used in heldout evaluation (“test”) scenarios. Our goals are similar to previous work on end-to-end approximate inference for task-specific losses with simpler probabilistic models (Lacoste-Julien et al., 2011, Stoyanov et al., 2011), but our approach yields simpler algorithms.
Generative model. Our generative model does not include labels , only features and encodings . Their joint distribution factorizes as the unsupervised VAE of Eq. (1), and we also use the inference model defined in Eq. (1). While M2 included the labels in its generative model (Kingma et al., 2014), our goals are different: we wish to make label-given-feature predictions, but we are not interested in label marginals or other distributions over that do not condition on .
Label-from-feature prediction. To predict labels from features , we use a predictor similar to the two-stage method of Sec. 2.2. We first sample an encoding from the learned inference model, and then transform this encoding to a label via the predictor function with parameter . By sharing random variable , the generative model is involved in label-from-feature predictions.
Constrained PC objective. Unlike the two-stage model, our approach does not do post-hoc prediction with a previously learned generative model. Instead, we train the predictor simultaneously with the generative model via a new, prediction-constrained (PC) objective:
(8) |
The constraint requires that any feasible solution achieve average prediction loss less than on the labeled training set. Both the loss function and scalar threshold can be set to reflect task-specific needs (e.g., classification must have a certain false positive rate or overall accuracy). The loss function may be any differentiable function, and need not equal the log-likelihood of discrete labels as assumed by previous work specialized to supervision of topic models (Hughes et al., 2018).
Unconstrained PC objective. Using the KKT conditions, we define an equivalent unconstrained objective that maximizes the unsupervised likelihood but penalizes inaccurate label predictions:
(9) |
Here is a positive Lagrange multiplier chosen to ensure that the target prediction constraint is achieved; smaller loss tolerances require larger penalty multipliers . This PC objective, and gradients for parameters , can be estimated via Monte Carlo samples from .
Justification. While the PC objective of Eq. (9) may look superficially similar to Eq. (7), we emphasize two key differences. First, our objective couples a generative likelihood and a prediction loss via the shared variational parameters . This makes both generative and discriminative performance depend on the same learned encoding . (Later we show how to partition so some entries are discriminative, while others affect generative “style” only.) In contrast, the M2 objective uses a label-given-features conditional to make predictions that does not share any of its parameters with the supervised likelihood . Second, our objective is more affordable: no term requires an expensive marginalization over labels. This is key to scaling to big unlabeled datasets, and also enables tractable learning from datasets whose labels are continuous or multi-dimensional.
Hyperparameters. The major hyperparameter influencing PC training is the constraint multiplier . Setting leads to unsupervised maximum likelihood training (or MAP training, given priors on ) of a classic VAE. Setting and choosing a probabilistic loss produces a “supervised VAE” that maximizes the joint likelihood . But as illustrated in Fig. 2, because features have much higher dimension than labels , the resulting model may have weak predictive performance. Satisfying the strong prediction constraint of Eq. (8) typically requires , and in practice we use validation data to select the best of several candidate values. If a task motivates a concrete tolerance , we can test an increasing sequence of values until the constraint is satisfied.
We emphasize that although Eq. (9) is easier to optimize, we prefer to think of the constrained problem in Eq. (8) as the “primary” objective, because our applied goals are to satisfy discriminative quality first; a generative model that predicts poorly is not plausible. Furthermore, the constrained objective is far more natural for semi-supervised learning. The choice of need not be concerned by the relative sizes of the labeled and unlabeled datasets. In contrast, if either or changes, the value of may need to change dramatically to reach the same prediction quality.

3.2 Enforcing Consistent Predictions from Generative Model Reconstructions
While the PC objective is effective given sufficient labeled data, it may generalize poorly when labels are very sparse (see Fig. 1). This fundamental problem arises because in the PC objective of Eq. (8), the parameters of the predictor are only directly informed by the labeled training data.
Revisiting the generative model, let and be two observations sampled from the same latent code . Even if the true label of is uncertain, we know that for this model to be useful for predictive tasks, must have the same label as . We formalize this relationship via a consistency constraint requiring label predictions for common-code data pairs to approximately match (see Fig. 3). As we show, this regularization may dramatically boost performance.
Given features , our method predicts labels by sampling from the approximate posterior, and then applying our predictor . Alternatively, given we can first simulate alternative features with matching code by sampling from the inference and generative models, and then predict the label associated with . We constrain the label predictions for , and for , to be similar via a consistency penalty function . For the classification tasks considered below, we use a cross-entropy consistency penalty . Given this penalty, we constrain the maximum values of the following consistency costs on unlabeled and labeled examples, respectively:
(10) | ||||
(11) |
Consistent PC: Unconstrained objective. To train parameters, we apply our consistency costs to unlabeled and labeled feature vectors, respectively. The overall objective becomes:
where is the unsupervised likelihood, is the predictor loss, and are the consistency constraints. Here, is a scalar Lagrange multiplier for the consistency terms, with similar interpretation as .
Aggregate Label Consistency. For SSL applications, we find it is also useful to regularize our model with an aggregate label consistency constraint, which forces the distribution of label predictions for unlabeled data to be aligned with a known target distribution . This discourages predictions on ambiguous unlabeled examples from collapsing to a single value. We define the aggregate consistency loss as: , and again use a cross-entropy penalty. If the target distribution of labels is unknown, we set it to the empirical distribution of the labeled data.
Related work on consistency. Recently popular SSL image classifiers focused on discriminative goals will train the weights of a CNN to minimize a modified objective that penalizes both label accuracy and a notions of consistency or smoothness on unlabeled data. Examples include consistency under adversarial perturbations (Miyato et al., 2019), label-invariant transformations (Laine & Aila, 2017), and when interpolating between training features (Berthelot et al., 2019). This regularization can deliver competitive discriminative performance, but does not meet our goal of generative modeling. Recently, Unsupervised Data Augmentation (UDA, Xie et al. (2020)), achieved state-of-the-art vision and text SSL classification by enforcing label consistency on augmented samples of unlabeled features. UDA relies on the availability of well-engineered augmentation routines for specific domains (e.g. image processing library transforms for vision or back-translation for text). In contrast, we learn a generative model that produces feature vectors for which predictions need to be consistent. Our approach is more applicable to new domains where advanced augmentation routines are not available.
In broader machine learning, “cycle-consistency” has improved generative adversarial methods for images (Zhu et al., 2017, Zhou et al., 2016) or biomedical data (McDermott et al., 2018). Others have developed cycle-consistent objectives for VAEs (Jha et al., 2018) which focus on consistency in code vectors . In contrast, our work focuses on semi-supervised learning and enforces cycle consistency in labels . Recently, Miller et al. (2019) developed discriminative regularization for VAEs. Their objective is not designed for SSL and uses a direct feature-to-label prediction model that must be consistent with reconstructed predictions. Our approach uses code-to-label prediction and SSL.
3.3 Improved Generative Models: Robust Likelihoods and Spatial Transformers
As our approach is applicable to any generative model, we can incorporate prior knowledge of the data domain to improve both generative and discriminative performance. We consider two examples: likelihoods that model noisy pixels, and explicit affine transformations to model image deformations.
Robust Likelihoods. Instead of a normal (or other common) likelihood we use a "Noise-Normal" likelihood to model images more robustly. We assume that pixel intensities have values in the interval and rescale our datasets to match. Our Noise-Normal likelihood is defined as a 2-component mixture of a truncated Normal and a uniform “noise” distribution, with pixel-specific mixture weights. Define the standard normal PDF as and standard normal CDF as . We write the probability density function of the Noise-Normal distribution with parameters (, , ) as:
(12) |
Following Eq. (1), our (unsupervised) VAE now uses the revised generative and inference models:
(13) |
This approach allows our model to avoid sensitivity to outliers and noise in the observed images, and boosts the performance of our CPC method for SSL.
Spatial Transformer VAE. Our spatial transformer VAE retains the structure of a standard VAE, but reinterprets the latent code as two components. We denote the first 6 latent dimensions as , and associate these with 6 affine transformation parameters capturing image translation, rotation, scaling, and shear. The generative model maps each value into a fixed range and creates an affine transformation matrix, , by applying the transformations in a fixed order. The remainder of the latent code, , is used to generate parameters for independent, per-pixel likelihoods. Assuming normal likelihoods, the output parameters for the pixel with coordinates are , .
We re-orient our per-pixel likelihoods according to the affine transform . The parameters of the likelihood for pixel will use the decoder outputs at coordinate , where defines the affine mapping from to . We apply this transformation in a (sub) differentiable way via a spatial transformer layer (Jaderberg et al., 2015) that takes as input and the appropriate parameter maps , , and outputs a final set of parameters for the individual pixel likelihoods. As may not correspond to integer coordinates, we use bilinear interpolation over an appropriate representation of the likelihood parameters, and appropriately pad the size of the decoder output.
We further account for this special structure in our prediction and consistency constraints. For many applications, we have prior knowledge that small affine transforms should not affect the class of an image, and thus we can define consistency constraints that condition on but not .
Source | Method | MNIST (100) | SVHN (1000) | NORB (1000) |
Tables 1-2 of Kingma et al. (2014) | M1 + M2 | - | ||
Table 2 of Maaløe et al. (2016) | ADGM | |||
Table 2 of Maaløe et al. (2016) | SDGM | |||
Gordon & Hernández-Lobato (2020) | Blended M2 | - | - | |
Tables 3-4 of Miyato et al. (2019) | VAT | - | ||
ours, using labeled-set only | WRN | |||
ours | CPC VAE |
Method | MNIST (100) | Method | MNIST (100) |
---|---|---|---|
CPC (2 layer) | M1 + M2 (Kingma et al., 2014) | ||
CPC (2 layer, w/o aggregate loss) | M2 (1 layer, ) (Kingma et al., 2014) | ||
CPC (2 layer, w/o transforms) | M2 (2 layer, ) | ||
CPC (4 layer, w/o transforms) | M2 (4 layer, ) | ||
PC (2 layer) | M2 (4 layer, tuned to ) | ||
VAE + MLP | M2 (1 layer, , Noise-Normal) |
4 Experiments
We assess our consistent prediction-constrained (CPC) VAE on two key goals: accurate prediction of labels given features (especially when labels are rare) and useful generative modeling of . We compare to ablations of our own method (without consistency, without spatial transformations) and to external baselines. We report each method’s mean and standard deviation in classification accuracy across 10 labeled subsets. We trained using ADAM (Kingma & Ba, 2014), with each minibatch containing 50% labeled and 50% unlabeled data. Hyperparameter search used Optuna (Akiba et al., 2019) to maximize accuracy on validation data. Supervised baselines used either MLP or wide residual nets (WRN, Zagoruyko & Komodakis (2016)). Reproducible details are in appendices.
SSL classification on MNIST with thorough internal comparisons. In Table 2 we compare several variations of our CPC methods and the M2 model on an SSL version of MNIST (LeCun et al., 2010, 10 classes,,, 10000 validation, 10000 test).
SSL classification on SVHN and NORB. Table 1 compares our methods on two standard SSL tasks: Street-View Housing Numbers (SVHN) (Netzer et al., 2011, 10 classes, , , 10000 validation, 26032 test) and the NYU Object Recognition Benchmark (NORB) (LeCun et al., 2004, 5 classes, , , 2000 validation, 24300 test).
SSL classification on CelebA. We ran additional experiments on a variant of the CelebA dataset (Liu et al., 2015). For these trials we created a classification problem with 4 classes based on the combination of gender (woman/man) and facial expression (neutral/smiling). (4 classes, , , 2000 validation, 19962 test). We report our results in figure 6.
Across all evaluations, we can conclude:
Both consistency and prediction constraints are needed for high accuracy. In Table 2, PC alone gets 80% accuracy on 100-label MNIST, while adding consistency yields 97% for CPC. The benefits of CPC over PC in both accuracy and latent interpretability are visible in Figs. 1-2. Our aggregate label consistency improves robustness, reducing the variance in CPC accuracy (from to ).
CPC training delivers strong improvements in SSL prediction quality over baselines. In Table 1, our CPC achieves 94.22% and 92.0% on the challenging 1000-label SVHN and NORB benchmarks, which surpasses by >1.4% all reported baselines while being reliable across runs. The M1+M2 baseline (Kingma et al., 2014) is not a coherent generative model, but rather a discriminative ensemble of multiple models. It performs well on MNIST, but very poorly on the more challenging SVHN.
CPC delivers better generative performance; it is not all about prediction. We improve on unsupervised VAEs by explicitly learning latent representations informed by class labels (Fig. 2). In Fig. 4, Fig. 5, and Fig. 6,we show visually-plausible class-conditional samples from our best CPC models. Additional visuals from learned VAE and CPC-VAE models are in the supplement.
With improved generative models, CPC can improve predictions. Fig. 5 shows that including spatial transformations allows learning a canonical orientation and scale for each digit. This generative improvement boosts classifier accuracy (e.g., MNIST improves from 91.9% to to 96.7% in Table 2).





5 Conclusion
We have developed a new optimization framework for semi-supervised VAEs that can balance discriminative and generative goals. Across image classification tasks, our CPC-VAE method delivers superior accuracy and label-informed generative models with visually-plausible samples. Unlike previous efforts to enforce constraints on latent variable models, such as expectation constraints (Mann & McCallum, 2010), posterior regularization (Zhu et al., 2014; 2012), posterior constraints (Ganchev et al., 2010), or prediction constraints for topic models (Hughes et al., 2018), our approach is the only one that coherently and simultaneously treats uncertainty in latent variables , applies to flexible “deep” non-conjugate models, and offers scalable training and test evaluation via amortized inference.
A further contribution is demonstrating the necessity of consistency for improving discrimination. Our CPC approach is an antidote to model misspecification: the constraints on prediction quality and consistency prevent the model from learning a generative model that is unaligned with the classification task or that overfits with more flexible generative models (as M2 is vulnerable to do). As we show with spatial transformers, our work lets improvements in generative model quality directly improve semi-supervised label prediction, helping realize the promise of deep generative models.
References
- Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
- Akiba et al. (2019) Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2019.
- Berthelot et al. (2019) David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin Raffel. MixMatch: A Holistic Approach to Semi-Supervised Learning. In Advances in Neural Information Processing Systems, 2019. URL http://arxiv.org/abs/1905.02249.
- Cowell et al. (2006) Robert G Cowell, Philip Dawid, Steffen L Lauritzen, and David J Spiegelhalter. Probabilistic networks and expert systems: Exact computational methods for Bayesian networks. Springer Science & Business Media, 2006.
- de Bem et al. (2018) Rodrigo de Bem, Arnab Ghosh, Thalaiyasingam Ajanthan, Ondrej Miksik, N. Siddharth, and Philip Torr. A Semi-supervised Deep Generative Model for Human Body Analysis. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018. URL http://openaccess.thecvf.com/content_eccv_2018_workshops/w11/html/de_A_Semi-supervised_Deep_Generative_Modelfor_Human_Body_Analysis_ECCVW_2018_paper.html.
- Dumoulin et al. (2017) Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially Learned Inference. In International Conference on Learning Representations (ICLR), 2017.
- Figurnov et al. (2018) Michael Figurnov, Shakir Mohamed, and Andriy Mnih. Implicit reparameterization gradients. In Advances in Neural Information Processing Systems, 2018.
- Ganchev et al. (2010) Kuzman Ganchev, João Graça, Jennifer Gillenwater, and Ben Taskar. Posterior Regularization for Structured Latent Variable Models. Journal of Machine Learning Research, 11:2001–2049, 2010.
- Gordon & Hernández-Lobato (2020) Jonathan Gordon and José Miguel Hernández-Lobato. Combining deep generative and discriminative models for Bayesian semi-supervised learning. Pattern Recognition, 100, 2020.
- Grandvalet & Bengio (2004) Yves Grandvalet and Yoshua Bengio. Semi-supervised learning by entropy minimization. In Proceedings of the 17th International Conference on Neural Information Processing Systems, NIPS’04, pp. 529–536, Cambridge, MA, USA, 2004. MIT Press.
- Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations (ICLR), 2017.
- Hughes et al. (2018) Michael C. Hughes, Gabriel Hope, Leah Weiner, Thomas H. McCoy, Roy H. Perlis, Erik B. Sudderth, and Finale Doshi-Velez. Semi-Supervised Prediction-Constrained Topic Models. In Artificial Intelligence and Statistics, 2018. URL http://proceedings.mlr.press/v84/hughes18a.html.
- Jaderberg et al. (2015) Max Jaderberg, Karen Simonyan, Andrew Zisserman, and koray kavukcuoglu. Spatial transformer networks. In Advances in Neural Information Processing Systems, 2015. URL http://papers.nips.cc/paper/5854-spatial-transformer-networks.pdf.
- Jha et al. (2018) Ananya Harsh Jha, Saket Anand, Maneesh Singh, and V. S. R. Veeravasarapu. Disentangling Factors of Variation with Cycle-Consistent Variational Auto-encoders. In European Conference on Computer Vision (ECCV). Springer International Publishing, 2018.
- Kingma & Ba (2014) Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs], 2014. URL http://arxiv.org/abs/1412.6980.
- Kingma & Welling (2014) Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In International Conference on Learning Representations, 2014. URL http://arxiv.org/abs/1312.6114.
- Kingma et al. (2014) Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, 2014. URL https://papers.nips.cc/paper/5352-semi-supervised-learning-with-deep-generative-models.pdf.
- Kumar et al. (2017) Abhishek Kumar, Prasanna Sattigeri, and Tom Fletcher. Semi-supervised Learning with GANs: Manifold Invariance with Improved Inference. In Advances in Neural Information Processing Systems, 2017. URL https://papers.nips.cc/paper/7137-semi-supervised-learning-with-gans-manifold-invariance-with-improved-inference.pdf.
- Lacoste-Julien et al. (2011) Simon Lacoste-Julien, Ferenc Huszár, and Zoubin Ghahramani. Approximate inference for the loss-calibrated bayesian. In Artificial Intelligence and Statistics, 2011.
- Laine & Aila (2017) Samuli Laine and Timo Aila. Temporal Ensembling for Semi-Supervised Learning. In International Conference on Learning Representations, 2017. URL https://openreview.net/pdf?id=BJ6oOfqge.
- Larsen et al. (2016) Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. In International Conference on Machine Learning, pp. 1558–1566, 2016. URL http://proceedings.mlr.press/v48/larsen16.html.
- LeCun et al. (2004) Y LeCun, F. J. Huang, and L. Bottou. Learning Methods for Generic Object Recognition with Invariance to Pose and Lighting. In IEEE Computer Vision and Pattern Recognition (CVPR), 2004. URL http://yann.lecun.com/exdb/publis/pdf/lecun-04.pdf.
- LeCun et al. (2010) Yann LeCun, Corinna Cortes, and CJ Burges. MNIST handwritten digit database, 2010. URL http://yann.lecun.com/exdb/mnist/.
- Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), December 2015.
- Maaløe et al. (2016) Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Auxiliary Deep Generative Models. arXiv:1602.05473 [cs, stat], 2016. URL http://arxiv.org/abs/1602.05473.
- Mann & McCallum (2010) Gideon S Mann and Andrew McCallum. Generalized expectation criteria for semi-supervised learning with weakly labeled data. Journal of Machine Learning Research, 11(Feb):955–984, 2010.
- McDermott et al. (2018) Matthew B A McDermott, Tom Yan, Tristan Naumann, Nathan Hunt, Harini Suresh, Peter Szolovits, and Marzyeh Ghassemi. Semi-Supervised Biomedical Translation with Cycle Wasserstein Regression GANs. In Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18), pp. 8, 2018. URL https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/viewFile/16938/15951.
- Miller et al. (2019) Andrew C Miller, Ziad Obermeyer, John P Cunningham, and Sendhil Mullainathan. Discriminative Regularization for Latent Variable Models with Applications to Electrocardiography. In International Conference on Machine Learning, pp. 10, 2019. URL https://proceedings.mlr.press/v97/miller19a/miller19a.pdf.
- Miyato et al. (2019) Takeru Miyato, Shin-Ichi Maeda, Masanori Koyama, and Shin Ishii. Virtual Adversarial Training: A Regularization Method for Supervised and Semi-Supervised Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8):1979–1993, 2019. URL https://ieeexplore.ieee.org/document/8417973/.
- Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading Digits in Natural Images with Unsupervised Feature Learning. In NeurIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011. URL http://ufldl.stanford.edu/housenumbers.
- Oliver et al. (2018) Avital Oliver, Augustus Odena, Colin Raffel, Ekin D. Cubuk, and Ian J. Goodfellow. Realistic Evaluation of Deep Semi-Supervised Learning Algorithms. arXiv:1804.09170 [cs, stat], 2018. URL http://arxiv.org/abs/1804.09170.
- Siddharth et al. (2017) N. Siddharth, Brooks Paige, Jan-Willem van de Meent, Alban Desmaison, Noah D. Goodman, Pushmeet Kohli, Frank Wood, and Philip H. S. Torr. Learning Disentangled Representations with Semi-Supervised Deep Generative Models. In Advances in Neural Information Processing Systems, 2017. URL http://arxiv.org/abs/1706.00400.
- Stoyanov et al. (2011) Veselin Stoyanov, Alexander Ropson, and Jason Eisner. Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure. In Artificial Intelligence and Statistics, 2011.
- Xie et al. (2020) Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. Unsupervised data augmentation for consistency training. In Advances in Neural Information Processing Systems, 2020.
- Zagoruyko & Komodakis (2016) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Edwin R. Hancock Richard C. Wilson and William A. P. Smith (eds.), Proceedings of the British Machine Vision Conference (BMVC), pp. 87.1–87.12. BMVA Press, September 2016. ISBN 1-901725-59-6. doi: 10.5244/C.30.87. URL https://dx.doi.org/10.5244/C.30.87.
- Zhang et al. (2019) Xiang Zhang, Lina Yao, and Feng Yuan. Adversarial Variational Embedding for Robust Semi-supervised Learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 139–147, Anchorage AK USA, 2019. ACM.
- Zhou et al. (2016) Tinghui Zhou, Philipp Krahenbuhl, Mathieu Aubry, Qixing Huang, and Alexei A. Efros. Learning Dense Correspondence via 3D-Guided Cycle Consistency. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 117–126, Las Vegas, NV, USA, 2016. IEEE.
- Zhu et al. (2012) Jun Zhu, Amr Ahmed, and Eric P Xing. MedLDA: Maximum margin supervised topic models. The Journal of Machine Learning Research, 13(1):2237–2278, 2012.
- Zhu et al. (2014) Jun Zhu, Ning Chen, and Eric P Xing. Bayesian inference with posterior regularization and applications to infinite latent SVMs. Journal of Machine Learning Research, 15(1):1799–1847, 2014.
- Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros. Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 2242–2251, Venice, 2017. IEEE.
- Zhu (2005) Xiaojin Zhu. Semi-Supervised Learning Literature Survey. Technical Report Technical Report 1530, Department of Computer Science, University of Wisconsin Madison., 2005.
Appendix A Details and Visualizations of Generative Models
A.1 Noise-Normal Likelihood
As discussed in Sec. 4.3, we use a “Noise-Normal” distribution as the pixel likelihood for many of our experiments. We define this distribution to be a parameterized two-component mixture of a truncated-normal distribution and a uniform distribution. We will use to denote the mixture probability of the Normal component, and and to denote the mean and standard deviation of the truncated-normal, respectively. The generative model (or decoder) predicts a distinct outlier probability for each pixel. We assume that pixel intensities are defined on the domain and rescale our datasets to match. We can write the probability density function of the Noise-Normal distribution via the standard normal PDF , and standard normal CDF , as follows:
(14) |
We can similarly express the cumulative distribution function of the Noise-Normal distribution as:
(15) |
In order to propagate gradients through the sampling process of the noise-normal distribution, we use the implicit reparameterization gradients approach of Figurnov et al. (2018). Given a sample drawn from this distribution, we compute the gradient with respect to the parameters , , and as:
(16) |
When fitting the parameters of this distribution using gradient descent, we enforce the constraints that , , and . To do this, we optimize unconstrained parameters , and then define , , and .
A.2 Spatial Transformer VAE
We now describe how to sample affine transformations for use in our generative model of images. As described in Sec. 4.3, the latent transformation code has 6 real-valued dimensions, each corresponding to one of the following 6 affine transformation parameters:
-
•
horizontal translation,
-
•
vertical translation,
-
•
rotation,
-
•
shear,
-
•
horizontal scale,
-
•
vertical scale.
To constrain our transformations to a fixed range of plausible values, we construct using parameters that are first mapped to the interval , and then linearly rescaled to an appropriate range via hyperparameters . Figure 7 illustrates that the induced prior for is heaviest for extreme values, encouraging aggressive augmentation when sampling from the prior. The mapping function could be changed to modify this distribution for other applications.

Given these latent transformation parameters, we define an affine transformation matrix as follows:
(17) |
To determine the parameters of the likelihood function for the pixel at coordinate , we use the generative model (or decoder) output at the pixel for which
(18) |
This corresponds to applying horizontal and vertical scaling, followed by rotation and shear, followed by translation. We use the spatial transformer layer proposed by Jaderberg et al. (2015) with bilinear interpolation to apply this transformation with non-integer pixel coordinates. For the Noise-Normal distribution we independently interpolate the , , and parameters.
A.3 Class-conditional sampling
A standard VAE generates data by sampling , and then sampling , or an alternative like the Noise-Normal likelihood. For the PC-VAE or CPC-VAE, we can further sample images conditioned on a particular class label. As labels are not explicitly part of the generative model, we accomplish this by sampling images that would be confidently predicted as the target class. We use a rejection sampler, repeatedly sampling until a sample meets the criteria: , for some target threshold . We typically use in our experiments.
MNIST digit samples for models with a 2-D latent space.
Fig. 2 in the main text shows 2-dimensional latent space encodings of the MNIST dataset using several different models. We provide a complementary visualization of generative models in Fig. 8, where we compare class-conditional samples for three of these models. The unsupervised VAE’s encodings of some classes (e.g., 2’s and 4’s and 8’s and 9’s) are not separated, and samples thus frequently appear to be the wrong class. Model M2 (Kingma et al., 2014) explicitly encodes the class label as a latent variable, but nevertheless many sampled images do not visually match the conditioned class. In contrast, for our CPC-VAE model almost all samples are easily recognized as the target class.




Appendix B Sensitivity to constraint multipliers
We compare the test accuracy for our consistency-constrained model for MNIST over a range of values for both (the prediction constraint multiplier) and (the consistency constraint multiplier) in figure 10. All runs used our best consistency-constrained model for MNIST using dense networks. We kept all hyperparameters identical to the previous results (see section D), changing only the value of interest for each run.
We see that the resulting test accuracy smoothly varies across several orders of magnitude, with the optimal result being at or near the values we chose for our experiments. Performance is superior to the M2 baseline model for a wide range of hyperparameter values.


Appendix C Training time
Figure 11 below provides an empirical comparison of the average training time cost per step using the MNIST models summarized in Table 2. Our CPC-VAE implementation runs both the encoder and decoder networks twice to compute the objective (once for the standard VAE loss and an additional time to compute the consistency reconstruction and prediction), thus the runtime is approximately twice that of the PC-VAE. We see that this approximately holds in practice: The PC-VAE requires 38 milliseconds per training step, while the CPC-VAE requires 80.7 milliseconds.
Furthermore, our empirical findings show that training M2 is more expensive than our proposed CPC-VAE in practice, which we expect given the runtime analysis described in Sec. 2.3. The M2 model must run the encoder and decoder networks once per class in order to compute the loss, due to the marginalization of the labels required for the unsupervised loss in Eq. (6). This increases the runtime by a factor equivalent to the number of classes. In our empirical test, we see that the training time per step is 6.7x that of the PC-VAE model, close to the 10x slowdown we would expect for the 10 digit classes of MNIST. In our experiments, we did not find substantial differences in the size of networks or number of training steps needed to train each of these models effectively.

Appendix D Experimental Protocol
Here we provide details about models and experiments which did not fit into the primary paper.
D.1 Hyperparameter optimization
The hyperparameter search for all models, including the CPC-VAE and various baselines, used Optuna (Akiba et al., 2019) to achieve the best accuracy on a labeled validation set. For our 2-layer and 4-layer M2 experiments, we used our own implementation (available in our code release) and followed the hyperparameters used by the original authors. For the 4-layer variant, we tested 10 different settings of , ranging from to , reporting both the the result using the original suggested value for () and the best value for we found for the setting (). For M2, we also dynamically reduced the learning rate when the validation loss plateaued.
D.2 Network architectures
For our PC-VAE and CPC-VAE models of the MNIST data, we use fully-connected encoder and decoder networks with two hidden layers, 1000 hidden units per layer, and softplus activation functions. Like the M2 model (Kingma et al., 2014), we use a 50-dimensional latent space. The original M2 experiments used networks with a single hidden layer of 500 units. We compare this to replications with networks matching ours, as well as 4-layer networks.
For the SVHN and NORB datsets, we adapt the wide-residual network architecture (WRN-28-2) (Zagoruyko & Komodakis, 2016) that was proposed as a standard for semi-supervised deep learning research in Oliver et al. (2018). In particular, we use this architecture for our encoder with two notable changes: We replace the final global average pooling layer with 3 dense layers with 1000 hidden units, and add a final dense layer that outputs means and variances for the latent space. We find that the dense layers provide the capacity needed for accomplishing both generative and discriminative tasks with a single network. For the decoder network we use a "mirrored" version of this architecture, reversing the order of layer sizes used, replacing convolutions with transposed convolutions, and removing pooling layers. We maintain the residual structure of the network. Our best classification results with this architecture were achieved with a latent space dimension of 200.
D.3 Beta-VAE regularization
As an additional form of regularization for our model, we allow our hyperparameter optimization to adjust a weight on the KL-divergence term in the variational lower bound, which we call as in previous work (Higgins et al., 2017):
(19) |
This allows us to encourage to more closely conform to the prior, which may be necessary to balance the scale of the objective, depending on the likelihoods used and the dimensionality of the dataset.
D.4 Prediction model regularization
We add two standard regularization terms to the prediction model used in our constraint, . The first is an regularizer on the regression weights, , to help reduce overfitting. The second is an entropy penalty. As defines a categorical distribution over labels, we compute this as: , which has been shown to be helpful for semi-supervised learning in Grandvalet & Bengio (2004) and was used as part of the standardized training framework of Oliver et al. (2018). We allowed our hyperparameter optimization approach to select appropriate weights for both terms.
D.5 Image pre-processing
For all of our image datasets, we rescale the inputs to the range [-1, 1]. For our NORB classification results, we downsample each image to 48x48 pixels. For our SVHN classification results, we convert images to greyscale to reduce the representational load on our generative model. Before the grayscale conversion, we apply contrast normalization to better disambiguate the colors within each image.
For the SVHN and NORB results, we follow the recommendation of a recent survey of semi-supervised learning methods (Oliver et al., 2018) and apply a single data augmentation technique: random translations by up to 2-pixels in each direction. For generative results, we retained the original color images and trained with full labels.
D.6 Likelihoods
For all of our image datasets, we use the Noise-Normal likelihood for our CPC methods. For all experiments on toy data (e.g. half-moon), we used a normal likelihood.
For our implementation of M2 for extensive experiments on MNIST we retained the Bernoulli likelihood used by the original authors (Kingma et al., 2014). That is, we rescaling each pixel’s numerical intensity value to the unit interval [0,1], and then sampled binary values from a Bernoulli with probability equal to the intensity.
D.7 Summary of hyperparameter settings for final results
Table 3 below provides all hyperparameter settings used in our experiments.
Hyperparameter | MNIST (100) | SVHN (1000) | NORB (1000) |
---|---|---|---|
Encoder/decoder | 2 FC layers | WRN-28-2 + 3 FC | WRN-28-2 + 3 FC |
Fully connected layer size | 1000 units | 1000 units | 1000 units |
Network activations | Softplus | Leaky ReLU | Leaky ReLU |
Latent dimension | 50 | 200 | 200 |
Pixel likelihood | Noise-Normal | Noise-Normal | Noise-Normal |
Prediction multiplier | 25 | 140 | 80 |
Consistency multiplier | 4.25 | 1.25 | 4 |
Aggregate consistency penalty | 0.1 | 0.2 | 0.2 |
-VAE weight | 1 | 1.3 | 2 |
Predictor reg. () | 1 | 1 | 1 |
Entropy reg. () | 0.5 | 0.5 | 0.5 |
Translation range () | 0.2 (image-width) | 0.2 (image-width) | 0.2 (image-width) |
Rotation range () | 0.4 rad | 0.5 rad | 0.4 rad |
Shear range () | 0.2 rad | 0.2 rad | 0.2 rad |
Scale range () | 1.5 | 1.5 | 1.5 |
Optimizer | ADAM | ADAM | ADAM |
Learning rate |
Appendix E Dataset Details
For each dataset considered in our paper, we provide a more detailed overview of its contents and properties.
E.1 MNIST
Overview. We consider a 10-way exclusive categorization task for MNIST digits.
We use 28-by-28 pixel grayscale images.
Public availability. We will make code to extract our version available after publication.
Data statistics.
Statistics for MNIST are shown in Table 4.
split | num. examples | label distribution |
---|---|---|
labeled train | 100 | [0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1] |
unlabeled train | 49900 | [0.1 0.11 0.1 0.1 0.1 0.09 0.1 0.1 0.1 0.1] |
labeled valid | 10000 | [0.1 0.11 0.1 0.1 0.1 0.09 0.1 0.1 0.1 0.1] |
labeled test | 10000 | [0.1 0.11 0.1 0.1 0.1 0.09 0.1 0.1 0.1 0.1] |
E.2 SVHN
Overview. We consider a 10-way exclusive categorization task for SVHN digits.
We use 32x32 pixel grayscale images.
Public availability. We will make code to extract our version available after publication.
Data statistics.
Statistics for SVHN are shown in Table 5.
split | num. examples | label distribution |
---|---|---|
labeled train | 1000 | [0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10] |
unlabeled train | 62257 | [0.07 0.19 0.15 0.12 0.10 0.09 0.08 0.08 0.07 0.06] |
labeled valid | 10000 | [0.07 0.19 0.14 0.12 0.10 0.09 0.08 0.08 0.07 0.06] |
labeled test | 26032 | [0.07 0.20 0.16 0.11 0.10 0.09 0.08 0.08 0.06 0.06] |
E.3 NORB
Overview.
We use 48x48 pixel grayscale images.
Public availability. We will make code to extract our version available after publication.
Data statistics.
Statistics for NORB are shown in Table 6.
split | num. examples | label distribution |
---|---|---|
labeled train | 1000 | [0.2 0.2 0.2 0.2 0.2] |
unlabeled train | 21300 | [0.2 0.2 0.2 0.2 0.2] |
labeled valid | 2000 | [0.2 0.2 0.2 0.2 0.2] |
labeled test | 24300 | [0.2 0.2 0.2 0.2 0.2] |
E.4 CelebA
Overview.
We use 64x64 pixel grayscale images. Images were cropped to square from the CelebA aligned variant and downscaled to our 64x64 resolution for computational efficiency. Labels were generated from the provided attributes. Our dataset used 4 classes: woman/neutral face, man/neutral face, woman/smiling, man/smiling.
Public availability. We will make code to extract our version available after publication.
Data statistics.
Statistics for CelebA are shown in Table 7.
split | num. examples | label distribution |
---|---|---|
labeled train | 1000 | [0.25 0.25 0.25 0.25] |
unlabeled train | 21300 | [0.25 0.25 0.34 0.16] |
labeled valid | 2000 | [0.25 0.25 0.25 0.25] |
labeled test | 24300 | [0.27 0.23 0.35 0.15] |