Partitioning Image Representation in Contrastive Learning
Abstract
In contrastive learning in the image domain, the anchor and positive samples are forced to have as close representations as possible. However, forcing the two samples to have the same representation could be misleading because the data augmentation techniques make the two samples different. In this paper, we introduce a new representation, partitioned representation, which can learn both common and unique features of the anchor and positive samples in contrastive learning. The partitioned representation consists of two parts: content part and style part. The content part represents common features of the class, and the style part represents own features of each sample, which can lead to the representation of the data augmentation method. We can achieve the partitioned representation simply by decomposing a loss function of contrastive learning into two terms on the two separate representations, respectively. To evaluate our representation with two parts, we take two framework models: Variational AutoEncoder (VAE) and Bootstrap Your Own Latent (BYOL), to show the content and style’s separability and confirm the generalization ability in classification, respectively. Based on the experiments, we show that our approach can separate two types of information in the VAE framework and outperforms the conventional BYOL in the classification and a few-shot learning task as downstream tasks.
I Introduction
Learning good image representation is an important issue to utilize it for various tasks. Recently, contrastive self-supervised learning has been regarded as a powerful and outstanding method to learn image representation [1, 2, 3, 4, 5]. Many previous works have shown competitive results compared to supervised learning [6, 7, 8, 9]. Moreover, contrastive self-supervised learning could achieve a better result than supervised learning when bigger models are used with only 1% labels [10]. For contrastive learning, the objective is to minimize the distance between the representations of the anchor and positive samples and to maximize the distance between the representations of the anchor and negative samples. Here, the positive and negative samples are from the same and different classes, respectively.
However, though contrastive learning forces the representations of the anchor and positive samples to be close to each other, they are actually different images. As shown in Figure 1, the operations for multiple data augmentations can change the image drastically. Therefore, it seems unreasonable to consider the representations of the two samples to be the same, and forcing the two samples to have the exact same representation would degrade the quality of the representation. Moreover, forcing two samples to be the same point in a single vector space could be misleading because deep neural networks tend to extract any available features to minimize the objective function on the training samples even if the features are not semantic [11, 12]. Minimizing the discrepancy in a single representation could be harmful for models to learn semantic features for a given task, which could lead to a sub-optimal solution by focusing on the non-semantic features.







To overcome such issues, we introduce a new representation method, partitioned representation, which splits the representation into two parts: one for common features (or content) and the other for unique features (or style). The content part represents class-related information shared in both anchor and positive samples. Note that the anchor and positive samples belong to the same class. The style part represents unique information of each sample which is a variation within one class generated by data augmentation.
To learn such a partitioned representation, the anchor and positive samples pull each other in the content part and push each other in the style part. This can be done by decomposing a loss function of contrastive learning for positive pairs into two terms on the two separate representations, respectively.
To evaluate the quality of our partitioned representation, we take two framework models: Variational AutoEncoder (VAE) and Bootstrap Your Own Latent (BYOL), to show the separability of the content and style features and to confirm the generalization ability in classification, respectively. The experiments show that we can separate two types of features in the partitioned representation in the VAE framework, and the partitioned representation outperforms the conventional BYOL with a better representation in terms of the classification accuracy in downstream tasks.
II Related Work
II-A Contrastive Self-Supervised Learning
Contrastive self-supervised learning is capable of learning representations using unlabelled data. In the image domain, contrastive methods have achieved state-of-the-art results on various downstream tasks. Those methods have their own design to choose the anchor and positive samples. Contrastive Predictive Coding for image (CPCv2) learns how to extract the representations of image from multiple image patches [13]. It samples anchor and positive patches from the same image, which are forced to have high mutual information. Augmented Multiscale Deep InfoMax (AMDIM) maximizes the mutual information between intermediate features of the anchor and positive samples in CNN layers [2]. Also, Simple Framework for Contrastive Learning of Visual Representations (SimCLR) maximizes cosine similarity between header representation of the anchor and positive samples [7]. Data augmentation generates the anchor and positive samples with two different views from the same images, and it can be combination of geometric operations (random crop and random flip) and color distortions (color jitter, color drop, solarization, etc.). These contrastive methods need negative samples for training which are usually from different images in the mini-batch, and the negative and positive samples can be selected more efficiently with label as in Supervised Contrastive Learning (SupCon) [14].
Compared with the methods above, Bootstrap Your Own Latent (BYOL) obtains image representation without any negative sample [8]. It aims to minimize only the Euclidean distance between the representations of the anchor and positive samples from two views on asymmetric network architectures. In more detail, BYOL has online and target networks consisting of the different number of head layers over an encoder. The online network with the anchor images predicts the output of the target network with the positive images. To train the networks, the online network is updated by the objective function, but the target network is updated by the exponential moving average of the online network. That is, the representation of the anchor sample is forced to be close to the one of the positive sample.
In this paper, we take BYOL as a baseline to focus on the relationship between the anchor and positive samples, and modify the last representation of BYOL with the partitioned representation. Then, we mainly compare our method (BYOL with the partitioned representation) to the conventional BYOL in terms of the classification and a few-shot learning task to confirm generalization abilities.
II-B Dividing Representation
To our best knowledge, the first attempt to split an embedding vector of images was proposed to exploit the embedding more efficiently in deep metric learning domain [15]. The main idea is a kind of the ‘divide and conquer’ approach; solving a bigger task would be more challenging than solving a set of smaller ones. They divide one problem into K sub-problems which are supposed to be separate, so each divided part of the embedding becomes in charge of solving one sub-problem. By reducing the problem complexity, their method increases the convergence speed and improves generalization. They divide the embedding dimensions and dataset into multiple parts. Then, in each iteration, only one part is updated by its own learner out of K learners. However, it does not mean that content and style features are separated in the sub-problems. Rather, it is more similar to the ensemble method.
Also, we can think of the latent space of Variational AutoEncoder (VAE) [16] as a type of the partitioned representation. VAE is a generative model consisting of an encoder and a decoder. The encoder produces a latent vector of the input, and the decoder reconstructs the input from the latent vector. By doing so, the latent vector is considered as a representation of the input. One of the advantages is the ability to learn a disentangled representation where one dimension of the latent space represents one semantic or style feature. There have been a lot of previous works which aim to boost a disentangled latent space [17, 18, 19, 20], but it is not guaranteed to fully disentangle the latent space without inductive biases [21]. In this paper, however, we aim for a representation where some dimensions have class-related features while the others have class-independent features.
III Proposed Method
In this section, we propose a new method to train a partitioned representation in the contrastive learning framework.
We start from the fact that even though the anchor and positive samples belong to the same class, they are different. By forcing two samples to be close, a model would only focus on common features of the two samples or even non-semantic features to minimize the objective. Therefore, we partition the image representation into two parts: content and style parts. The content part is supposed to learn common features between the anchor and positive samples based on class-related information, and the style part is supposed to represent unique features of each sample focusing on non-class information. Our method considers only the relationship between the anchor and positive samples, not negative samples from different classes. As in the conventional contrastive loss, we force the representations in the content part to be close. However, the style part is trained to push the samples far apart. The objective function between the anchor and positive samples can be defined by
(1) |
where and are an anchor and a positive samples, is a neural network with a content part and a style part , which is parametrized by trainable parameters. The hyperparameter controls the weight between the content and style parts. We train the partitioned representation on two frameworks: VAE and BYOL [16, 8].
First, we apply our method on the VAE framework to see whether the content and style features are separated by our method. A goal of VAE is to maximize the likelihood of input by maximizing the lower bound of the likelihood called Evidence Lower BOund (ELBO). ELBO has two terms: a reconstruction error from the latent space and KL divergence between a prior distribution and data distribution obtained by the encoder as in Equation 2. Without changing the two terms, we add to as in Equation 3, where is a mean vector obtained by the encoder. We choose a positive sample from the same class as the anchor sample as presented in Figure 2.
(2) |
(3) |

To see the generalization ability of the partitioned representation, we train the partitioned representation in the BYOL framework. BYOL minimizes the distance between the prediction vector of the anchor sample and the projection vector of the positive sample from the online and target networks. It generates two samples by applying multiple augmentations on one image, such as a random horizontal flip, random crop, Gaussian blur, and color distortions. As illustrated in Figure 3, we directly apply Equation 1 on the last representation of both networks by manually partitioning the representation into two parts instead of minimizing the entire representation.

IV Experiment Results
We conducted our experiments with the two frameworks: VAE and BYOL. The proposed VAE model exploits the positive sample in the supervised way while the proposed BYOL model generates the anchor and positive samples by applying the data augmentation techniques to the same image.
IV-A Variational AutoEncoder
IV-A1 Qualitative Results
In this section, we check qualitatively if the partitioned representation can split content and style features, especially on VAE. A traversal map of the latent space of VAE can show what features are learned by changing the value of only one dimension in the latent space. We used two datasets: Fashion MNIST [22] and colored MNIST where we add a feature about color on MNIST [23].
We train a VAE model with the partitioned representation and then visualize the latent space to understand what is represented in each dimension. The model has three convolution layers for the encoder and decoder, and 10 dimensions for the latent space with 7 for the content part and 3 for the style part. For each layer, the kernel size, stride, and padding size are 3, 2, and 1, respectively. Also, the channel size of the layers in the encoder are 32, 64, and 128, respectively (reverse order for the decoder). We use the Adam optimizer with the learning rate of 0.001. For contrastive learning, positive images are randomly selected from the same class as the anchor image.
We first train the model on the Fashion MNIST dataset, and Figure 4 shows traversal maps of the latent space. In each traversal map corresponding to one row, the center image is the reconstructed image from the latent space. Given the variance for each dimension, we add with on only one dimension of the latent vector and then reconstruct the images from the manipulated latent vectors. As shown in Figure 4, the first seven dimensions (content part) show a variation from one class to another class, representing class-related features. However, the last three dimensions (style part) show variations within the class of the input image, which means they do not represent class-dependent features. Note that the common features between the anchor and positive images are class-related features because they are from the same class, otherwise, the unique features are about the style of each image.

For further investigation, we inject a content or style feature on purpose with colored MNIST. There are two cases of color injection: biased and unbiased cases as shown in Figure 5. The biased one is the case where each digit has its unique color, which means that color is a class-related feature shared in the anchor and positive samples. The channel values for each biased color are listed in Table I. The unbiased one is the case where we color randomly every image, which means that color is the class-independent feature. Note that the anchor and positive samples share the digit information, not style like rotation, thickness, and font.
Class | Red | Green | Blue |
---|---|---|---|
0 | 255 | 100 | 0 |
1 | 0 | 100 | 0 |
2 | 188 | 143 | 143 |
3 | 255 | 0 | 0 |
4 | 255 | 215 | 0 |
5 | 0 | 255 | 0 |
6 | 65 | 105 | 225 |
7 | 0 | 225 | 255 |
8 | 0 | 0 | 255 |
9 | 255 | 20 | 147 |


As shown in Figure 6, in the biased color case (first row), the color appears in the content part because the color is a shared feature between the anchor and positive images. We can see that color changes only when the class (digit) changes in the latent space, and also style features appear in the style part. In the unbiased color case (second row), the color appears in the style part because color is one of the prevalent variations in the dataset. As a result, the color turns up in the style part with other style features. We observe that by partitioning the representation into the two parts, our approach successfully separates two types of features in the VAE framework.






As an application of the partitioned representation, it is possible to generate a new sample by switching the content and style parts of two samples as in Figure 7. In Figure 7(a), the first two images of each low are the input images to VAE trained on unbiased color MNIST, and the last two images are the reconstructed ones with a switched style. Figure 7(b) shows how to obtain the new images. We exchange the style part of the mean vectors from the two input images. Then, we reconstruct the new samples from the new mean vector which consists of the content part of one image and the style part of the other image. As a result, the new samples represent the same digit with the style of the other image.


IV-A2 Quantitative Results
In this section, we investigate the effect of the partitioned representation in classification tasks. We first train an additional linear classifier over the mean vector of the trained VAE model to classify the mean vector. To see the effect of the model, we add Gaussian noise at the style part or the content part of the latent variable. The test cases can be summarized as follows:
-
•
Noise on Style: Gaussian noise is added to the style part of the mean vector.
-
•
Noise on Content: Gaussian noise is added to the content part of the mean vector.
By adding noise, we perturb the style or content part of the mean vector to check if the proposed method can split content and style features.
Table II shows the results with both datasets. For the VAE model trained on the biased color dataset, the classifier achieves 99% on the test set from the biased color dataset. When Gaussian noise is added, we observe that the style part is more robust to the noise than the content part, because the style features have nothing to do with the classification task while content features are class-related. Similarly, for the unbiased color dataset, adding noise to the style part is more robust than adding noise to the content part. Another interesting point is that the biased color case degrades less than the unbiased color case when the noise is added to the content part. Since the biased color case has a strong class-related feature with color in the content part, the representation can be more robust against random noise on it.
Dataset | Test Set | Noise on Style | Noise on Content |
---|---|---|---|
Biased Color | 99.01 | 98.18 | 72.36 |
Unbiased Color | 91.55 | 90.32 | 43.54 |
Additionally, Table III shows how accuracy changes as the intensity of the noise increases. Given the Gaussian noise and the level of intensity , we add to the content or style part. For both color datasets, the style part is relatively more robust than the content part against the noise especially when the noise intensity increases.
Dataset | Noise on | t = 1 | t = 2 | t = 3 | t = 4 |
---|---|---|---|---|---|
Biased Color | Style | 98.2 | 94.2 | 86.4 | 78.0 |
Content | 72.4 | 41.5 | 29.0 | 24.1 | |
Unbiased Color | Style | 90.3 | 86.2 | 80.9 | 74.7 |
Content | 43.5 | 25.4 | 19.7 | 16.7 |
IV-B BYOL
IV-B1 Quantitative Result
In this section, we evaluate the generalization ability of the partitioned representation in various tasks. In all experiments, we compare our approach (i.e., BYOL with the partitioned representation) to the conventional BYOL in terms of the accuracy of downstream tasks. For the partitioned representation, we divide 256 dimensions of representation of the conventional BYOL: 192 dimensions for the content part and 64 dimensions for the style part. Our implementation is based on the official implementation given by [8].
First, we train the two BYOL models based on the Resnet18 encoder on the STL10 dataset, which consists of 100k unlabeled images and 13k labeled images for ten classes (5k for training and 8k for the test, respectively) [24]. We follow the conventional BYOL for the implementation of data augmentation and training strategy [8]. We also follow the linear evaluation procedures described in [7] to measure the linear separability of representations. We first pretrain both BYOLs on unlabeled data over 800 epochs and then train a linear classifier on 5k training data over a frozen encoder. Finally, we measure the classification performance of the test data for the linear evaluation task. As a result, our method achieves the accuracy of , which is higher than the conventional BYOL.
We also conducted the same experiment on the BYOL models over 40 epochs with 256 batch sizes on ImageNet [25], which consists of 1.28M training images and 50k validation images. As presented in Table IV for the linear evaluation task, our method gets greater accuracy than the conventional BYOL. We think that a model becomes flexible to extract more effective features with the partitioned representation which leads to better generalization.
Dataset | Encoder | Accuracy(%) | |
---|---|---|---|
BYOL | Ours | ||
STL10 | Resnet18 | 78.3 | 79.7 |
ImageNet | Resnet50 | 56.0 | 57.0 |
Lastly, we experiment a few-shot learning task as a downstream task of a pretrained encoder to understand the transferability of the partitioned representation, because few-shot learning from the pretrained encoder can measure the generalization ability of the pretrained encoder [26]. Few-shot learning combined with meta-learning classifies images with a few training samples, and the test images are from unseen classes during training. It typically performs ‘-shot -way’ tasks where we randomly take classes among entire classes and draw samples for each class. In every iteration, we train an encoder with samples, and then test the model on unseen classes.
It was proposed to pretrain an embedding network, AMDIM [2], in a self-supervised way which is supposed to be fine-tuned for a few-shot classification task [26]. It has two training phases: self-supervised pretraining and meta fine-tuning. We replaced the embedding network of [26] with the conventional BYOL or our modified BYOL.
We use the MiniImagetNet dataset [27], which contains 60k images from 100 classes. Also, we follow the data composition as the baseline to divide 100 classes into 64 classes for training, 16 classes for validation, 20 classes for the test [26]. First, an embedding network is pretrained on 100 classes, and meta-learning is applied to fine-tune the network as described in [26]. We follow the same training process on the conventional BYOL and BYOL based on Resnet 50 with the partitioned representation on ‘1-shot 5-way’ and ‘5-shot 5-way’ classification tasks.
As shown in Table V, the conventional BYOL outperforms the baseline on both tasks because the conventional BYOL has better transferability than AMDIM. We find that our method improves further. The partitioned representation extracts precise class-related features and class-unrelated features at the same time. That is, the partitioned representation disentangles the information for classification, which makes classification more efficient.
Task | Accuracy(%) | |
---|---|---|
BYOL | Ours | |
1-shot 5-way | () | () |
5-shot 5-way | () | () |
V Conclusion
We introduced a new representation method, partitioned representation and trained it successfully in contrastive learning where there are only the anchor and positive samples. We obtained the partitioned representation by simply dividing a single representation into the two parts: content and style parts. The proposed partitioned representation can represent common and unique features of the anchor and positive samples at the same time. We showed that our method could split two types of features with the latent space of VAE, and that the partitioned representation would result in better generalized and transferable representation with BYOL.
In this paper, we have focused on the relationship between only the anchor and positive samples. However, for the future works, the method can be extended to have negative samples as well as the anchor and positive samples.
Acknowledgement
This research was supported by Basic Science Research Program through the National Research Foundation of Korea funded by the Ministry of Education (NRF-2022R1A2C1012633), and by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government(MSIT) (No. 2018-0-00749, Development of virtual network management technology based on artificial intelligence).
References
- [1] A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- [2] P. Bachman, R. D. Hjelm, and W. Buchwalter, “Learning representations by maximizing mutual information across views,” arXiv preprint arXiv:1906.00910, 2019.
- [3] Y. M. Asano, C. Rupprecht, and A. Vedaldi, “Self-labelling via simultaneous clustering and representation learning,” arXiv preprint arXiv:1911.05371, 2019.
- [4] Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. Springer, 2020, pp. 776–794.
- [5] I. Misra and L. v. d. Maaten, “Self-supervised learning of pretext-invariant representations,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6707–6717.
- [6] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
- [7] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.
- [8] J.-B. Grill, F. Strub, F. Altch’e, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. Á. Pires, Z. D. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko, “Bootstrap your own latent: A new approach to self-supervised learning,” ArXiv, vol. abs/2006.07733, 2020.
- [9] X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 750–15 758.
- [10] T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. Hinton, “Big self-supervised models are strong semi-supervised learners,” arXiv preprint arXiv:2006.10029, 2020.
- [11] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry, “Adversarial examples are not bugs, they are features,” arXiv preprint arXiv:1905.02175, 2019.
- [12] F. Ahmed, Y. Bengio, H. van Seijen, and A. Courville, “Systematic generalisation with group invariant predictions,” in International Conference on Learning Representations, 2020.
- [13] O. Henaff, “Data-efficient image recognition with contrastive predictive coding,” in International Conference on Machine Learning. PMLR, 2020, pp. 4182–4192.
- [14] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 18 661–18 673, 2020.
- [15] A. Sanakoyeu, V. Tschernezki, U. Buchler, and B. Ommer, “Divide and conquer the embedding space for metric learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 471–480.
- [16] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
- [17] I. Higgins, L. Matthey, A. Pal, C. P. Burgess, X. Glorot, M. M. Botvinick, S. Mohamed, and A. Lerchner, “Beta-vae: Learning basic visual concepts with a constrained variational framework,” in ICLR, 2017.
- [18] H. Kim and A. Mnih, “Disentangling by factorising,” in International Conference on Machine Learning. PMLR, 2018, pp. 2649–2658.
- [19] R. T. Chen, X. Li, R. Grosse, and D. Duvenaud, “Isolating sources of disentanglement in variational autoencoders,” arXiv preprint arXiv:1802.04942, 2018.
- [20] S. Hahn and H. Choi, “Disentangling latent factors of variational auto-encoder with whitening,” in International Conference on Artificial Neural Networks. Springer, 2019, pp. 590–603.
- [21] F. Locatello, S. Bauer, M. Lucic, G. Raetsch, S. Gelly, B. Schölkopf, and O. Bachem, “Challenging common assumptions in the unsupervised learning of disentangled representations,” in international conference on machine learning. PMLR, 2019, pp. 4114–4124.
- [22] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747, 2017.
- [23] L. Deng, “The mnist database of handwritten digit images for machine learning research,” IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 141–142, 2012.
- [24] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 215–223.
- [25] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
- [26] D. Chen, Y. Chen, Y. Li, F. Mao, Y. He, and H. Xue, “Self-supervised learning for few-shot image classification,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 1745–1749.
- [27] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra et al., “Matching networks for one shot learning,” Advances in neural information processing systems, vol. 29, pp. 3630–3638, 2016.