Clustering-friendly Representation Learning for Enhancing Salient Features
Abstract
Recently, representation learning with contrastive learning algorithms has been successfully applied to challenging unlabeled datasets. However, these methods are unable to distinguish important features from unimportant ones under simply unsupervised settings, and definitions of importance vary according to the type of downstream task or analysis goal, such as the identification of objects or backgrounds. In this paper, we focus on unsupervised image clustering as the downstream task and propose a representation learning method that enhances features critical to the clustering task. We extend a clustering-friendly contrastive learning method and incorporate a contrastive analysis approach, which utilizes a reference dataset to separate important features from unimportant ones, into the design of loss functions. Conducting an experimental evaluation of image clustering for three datasets with characteristic backgrounds, we show that for all datasets, our method achieves higher clustering scores compared with conventional contrastive analysis and deep clustering methods.
1 Introduction
Clustering is one of the most fundamental methods for unsupervised machine learning and it aims to classify objects based on measures of similarity between unlabeled samples. Advancements in deep learning techniques have produced deep clustering methods, in which feature extraction with deep neural networks (DNNs) and a clustering process are integrated at various levels [23]. Self-supervised learning (SSL) has also recently been attracting attention in the representation learning field. SSL methods learn representations by solving user-defined pretext tasks, such as instance discrimination tasks and contrastive learning, and their impressive performance in capturing visual features has been demonstrated even for complex real-world datasets [4, 21, 3, 5]. High-resolution representations extracted by SSL are naturally expected to be applied to clustering. Several approaches to simultaneously performing SSL and clustering tasks have been proposed and have achieved state-of-the-art performance [13, 10, 18, 20].
However, simply capturing features at high resolution or in ways that are easy for machines to understand does not necessarily improve clustering scores, and can even worsen them, due to differences in human and algorithmic criteria for feature importance. As a concrete example, Fig. 1 shows instance discrimination and feature decorrelation (IDFD), an SSL method, forming a cluster depending on features of a prominent mesh structure (foreground component) rather than objects such as a leopard or orange.

Even though each feature is correctly extracted, this is not a desirable situation in the context of clustering that focuses on objects. Similar problems can arise when the methods are applied to real-world problems. In the case of inspection images taken at a factory, for example, it is not uncommon for complex product patterns to become more prominent than the object defects targeted for classification [17]. For this reason, models require proper guidance so that only features that are important for the targeted clustering are picked up.
Contrastive analysis (CA) [9] is one technique for providing inductive bias with unsupervised methods to distinguish important from unimportant components. In CA settings, two datasets are prepared: a target dataset that includes salient features of interest for clustering and a background dataset that contains only information that is ineffectual for clustering and should be discarded. Several unsupervised methods making use of CA settings have recently been proposed [1, 8, 2]. These architectures have successfully isolated and extracted variables of interest from backgrounds, but they use classical architectures and learning schemes such as matrix decomposition and simple encoder-decoder structures, and they mainly minimize reconstruction loss functions. There is thus room for improvement in the ability to learn representations for more complex real-world datasets.
In this paper, we propose contrastive IDFD (cIDFD), a method that combines a CA setting with the latest SSL scheme based on instance discrimination. cIDFD learns unimportant features at high resolutions from the background dataset via normal instance discrimination and feature decorrelation. Similarities among unimportant features in target samples are also obtained and utilized to guide networks toward extracting only features that are important for desirable clustering. This feature extraction is performed by minimizing a newly defined loss function, the contrastive instance discriminative loss function. We adapt our method to various datasets with characteristic background patterns. Our method shows improvements over conventional methods in terms of evaluating clustering tasks.
Our main contributions are as follows:
-
•
We propose a new clustering method based on instance discrimination with a CA setting that selectively extracts features meaningful for the user-defined clustering objective.
-
•
We conduct experimental evaluations of clustering for three challenging datasets with characteristic backgrounds. The results show that cIDFD achieves higher clustering scores than conventional contrastive analysis models and state-of-the-art SSL methods.
2 Related Works
There have recently been proposals of contrastive learning as pretext tasks for SSL, in which positive pairs generated by data augmentation from the same data are brought closer together, and negative samples generated from different data are kept apart [21, 4]. Features learned by the pretext tasks have high generalization and achieve excellent performance on a variety of downstream tasks. Although negative pairs played an important role in avoiding collapsing solutions, there are some drawbacks, such as the need to increase batch size, and thus several methods that do not use negative pairs have been proposed [5, 3].
Deep clustering has shown superior performance in a variety of areas. End-to-end learning was proposed to simultaneously perform representation learning and clustering [23]. To improve representation learning ability, recent works integrate clustering and SSL, in particular contrastive learning [20, 10, 19, 18, 14, 6, 13, 22]. IDFD [18] focuses on the representation learning phase and proposes a clustering-friendly representation learning method that uses instance discrimination loss and a proposed feature decorrelation loss motivated by the properties of classical spectral clustering. They achieved high performance, despite using -means in the clustering phase. MiCE [19] and CC [10] focused on end-to-end learning by integrating contrastive learning and clustering. SCAN [20], RUC [14], NNM [6], TCL[22], and SPICE [13] focused on the clustering phase and are multi-stage deep clustering methods.
The principles of contrastive analysis [9] were proposed as a method to separate features to be emphasized from irrelevant features that should be suppressed. cPCA [1] utilizes that principle to separate the principal components to be emphasized from those that should be suppressed, introducing a dataset without features to be emphasized. Contrastive singular spectrum analysis (cSSA) [8] extends the cPCA concept to analyze time-series datasets. Contrastive VAE (cVAE) [2] was also developed to understand complex, nonlinear relations between latent variables and inputs.
3 Proposed Method
Our goal is to learn only those representations that are appropriate to clustering for a target dataset and cluster these samples into meaningful groups under conditions where a background dataset can be prepared. cIDFD utilizes a dataset to reject the influences of background features that are unimportant with respect to clustering for the target dataset . In this section, we describe the model architecture, loss computation, and actual training process of cIDFD.



3.1 The framework of cIDFD
As Fig. 2 shows, we use two embedding functions, and , which map images to feature vectors distributed on a -dimensional unit sphere. These functions are modeled as deep neural networks with parameters and , which are typically a CNN backbone and an L2-normalization layer. is learned from background dataset and assigned the role of extracting background features and for image samples from and , respectively. The other embedding function is learned to extract target features from . To train the background branch , we conduct minimizing instance discrimination and feature decorrelation loss, following the same process as in a previous work [18]. For target branch , parameters are optimized by a newly defined loss function, contrastive instance discrimination loss, and feature decorrelation loss. In the computation of contrastive instance discrimination loss, weighted similarity between negative samples, which includes influences from and , is used as input to a non-parametric softmax classifier.
3.2 Loss for background feature extraction
We apply the instance discrimination proposed in [21] to learn representations from a background dataset. For given samples , the corresponding representations are with , where is normalized to . The probability of representation being assigned to the th class is given by the non-parametric softmax formulation
(1) |
where the dot product is how well matches the th class and is a temperature parameter that determines the concentration of distribution. The learning objective is to maximize the joint probability as
(2) |
We use a constraint for orthogonal features proposed in [18]. The objective is to minimize
(3) |
where are latent feature vectors defined by the transpose of the latent vectors , is the dimensionality of the representations, and is the temperature parameter.
3.3 Loss for target feature extraction
When normal instance discrimination loss is used, the separation of negative pairs is performed equally for all negative pairs. However, this makes grouping impossible, except in cases where both the target and background features are extremely similar. Fig. 3 shows a conceptual illustration of such a situation. Our motivation is to design a loss function that separates the pairs having the same background but different target features while attracting pairs having the same target features, independent of the background. For example, in Figure 3, the former corresponds to the pairs of zeros with diagonal background lines and ones with the same diagonal background lines, while the latter corresponds to the pairs of ones, independent of background types.
This loss function is realized by introducing weight coefficients depending on pairwise similarity between the background features of target samples . Our new loss function, namely, contrastive instance discrimination loss, is defined as
(4) |
where is the weight coefficient between the -th and -th samples that determine how strongly target features and are pulled apart in the learning process. This coefficient is formulated simply as
(5) |
When the background features of the -th and -th images are similar, becomes large, causing their repulsive force in the target feature space to increase. This effect, shown in Fig. 3, reduces the similarities between the feature vectors of zeros with diagonal lines and ones with diagonal lines. also becomes large for pairs having the same background and target, but there is also the usual contrastive learning effect, and their repulsion becomes weakened. For pairs of small , the repulsive force is relatively weak, and thus samples with the same target and different backgrounds are weakly separated. We experimentally demonstrated that our method works according to the perspectives described above; the details are given in Section 4.4.
Temperature parameters control the distribution of feature vectors, and controls the magnitude of the weight coefficient . The differences from IDFD are the background module and . In the case of , cIDFD is consistent with IDFD because the contribution from the background module to the target module disappears.
3.4 Two-stage learning
We consider separately learning the embedding functions and . In the first step, the branch learns a background dataset by minimizing the objective function
(6) |
During this step, the input samples are only from , and the network parameters of are not updated. After learning, works as an extractor of features to be discarded. In the second step, we freeze the parameters of and train by dataset . The objective function is a composition of the contrastive instance discriminative loss (4) and feature decorrelation loss for the target features,
(7) |
where are the vectors defined by the transpose of . The above learning process is summarized as Algorithm1.
4 Experiments
4.1 Datasets
We evaluated the performance of cIDFD on three datasets created from commonly used public datasets. Each dataset contains both target and background datasets. Table 1 summarizes the key details. Figure 4 shows sample images from each dataset.
Stripe MNIST We created a synthetic image dataset using handwritten digits from the Modified National Institute of Standards and Technology (MNIST) database [7] and randomly generated artificial stripe patterns, which are shown in Fig. 4. Our goal was to cluster the ten handwritten digits independently of the background patterns. We also prepared a background dataset of stripe-pattern images that are almost the same as those in the target dataset, but not used in its creation (Fig. 4).
CelebA-ROH We assume a clustering task that focuses on certain features of facial images. We made datasets from the popular celebrity facial images dataset CelebA [11]. As a target dataset, we collected images with the target attributes "receding hairlines" or "wearing hats" (Fig. 4). We used the remaining celebrities as the background dataset, which we call CelebA-RNH (Fig. 4). Our goal was to cluster the two target attributes independently of other attributes such as eyeglasses, hair color, or gender.
Birds400-ABC As a target dataset, we collected images from Birds400 [15], which are derived from the Kaggle datasets. Birds400 contains images of 400 types of birds with various backgrounds. From the training split of the original dataset, we utilized 144 bird species with names starting with A, B, or C (Fig. 4). To realize bird-oriented clustering by cIDFD, we used the Landscape Pictures dataset [16], which includes high-quality images of natural landscapes. By randomly cropping and resizing those images, we generated 41,733 samples with pixels for the background dataset (Fig. 4).
Dataset | Image size | Samples | classes |
---|---|---|---|
Stripe MNIST | 28 28 1 | 60000 | 10 |
Stripes | 28 28 1 | 10000 | 4 |
CelebA-ROH | 178 218 3 | 21065 | 2 |
CelebA-RNH | 178 218 3 | 20000 | - |
Birds400-ABC | 224 224 3 | 20822 | 144 |
Landscape Pictures | 224 224 3 | 41733 | - |






4.2 Comparison with conventional methods
We compared cIDFD with VAE, cVAE and eight other competitive deep clustering methods: CC, MiCE, SCAN, RUC, NNM, TCL, SPICE, IDFD. Given that VAE and the eight competitive deep clustering methods have no way to handle the background dataset, only the target dataset was used. In the clustering phase, we applied simple -means to representations for VAE, cVAE, IDFD, and cIDFD. The number of clusters is set equal to the number of classes in each dataset as shown in Table 1. Clustering performance was evaluated by three popular metrics: clustering accuracy (ACC), normalized mutual information (NMI), and the adjusted rand index (ARI). These metrics give values in the range , with higher scores indicating more accurate clustering assignments.
Table 2 lists performances for each method. These results show that cIDFD clearly outperform the conventional methods. In terms of ACC, the cIDFD scores were improved by approximately 24% for Stripe MNIST, by 7% for CelebA-ROH, and by 25% for Birds400-ABC. In particular, performance under our method is better than performance under cVAE, which is the most similar method in terms of CA setting usage, even for complex datasets such as CelebA-ROH and Birds400-ABC.
Dataset | Stripe MNIST | CelebA-ROH | Birds400-ABC | ||||||
---|---|---|---|---|---|---|---|---|---|
Metric | ACC | ARI | NMI | ACC | ARI | NMI | ACC | ARI | NMI |
VAE | 0.177 | 0.059 | 0.085 | 0.500 | 0.038 | 0.034 | 0.067 | 0.016 | 0.218 |
cVAE | 0.578 | 0.421 | 0.563 | 0.776 | 0.299 | 0.208 | 0.070 | 0.016 | 0.200 |
CC | 0.377 | 0.248 | 0.439 | 0.893 | 0.619 | 0.579 | 0.331 | 0.204 | 0.553 |
MiCE | 0.349 | 0.233 | 0.420 | 0.714 | 0.185 | 0.149 | 0.265 | 0.202 | 0.562 |
SCAN | 0.594 | 0.505 | 0.696 | 0.577 | 0.023 | 0.021 | 0.257 | 0.152 | 0.479 |
RUC | 0.587 | 0.504 | 0.706 | 0.482 | 0.019 | 0.016 | 0.005 | 0.031 | 0.310 |
NNM | 0.402 | 0.321 | 0.548 | 0.632 | 0.070 | 0.055 | 0.226 | 0.140 | 0.470 |
TCL | 0.376 | 0.233 | 0.421 | 0.820 | 0.409 | 0.456 | 0.332 | 0.193 | 0.547 |
SPICE | 0.492 | 0.411 | 0.588 | 0.578 | 0.021 | 0.037 | 0.321 | 0.203 | 0.526 |
IDFD | 0.276 | 0.156 | 0.369 | 0.648 | 0.086 | 0.067 | 0.486 | 0.361 | 0.658 |
cIDFD | 0.830 | 0.809 | 0.908 | 0.969 | 0.879 | 0.796 | 0.738 | 0.664 | 0.848 |
4.3 Representation distribution
Fig. 5 visualizes feature representations of the three datasets, which are learned by IDFD and cIDFD. 128-dimensional representations were embedded into two-dimensional space by UMAP [12]. Point colors indicate ground truth labels. The distribution clearly shows that cIDFD is preferable to IDFD when grouping samples into clusters characterized by ground truth labels, which are the features we focus on. For Stripe MNIST, IDFD created clusters excessively according to both the digits and stripes features, while cIDFD generated almost ten clusters. For CelebA-ROH, IDFD generated one distribution mixing two classes, but cIDFD generated distinct distributions according to the ground truth labels. cIDFD also correctly distinguished an extremely large number of bird species; however, in the IDFD results, samples of many classes were degenerated to several large clusters.






4.4 Similarity distribution
To clearly understand how our method works, we conducted experiments on similarity distribution for both IDFD and cIDFD with the synthetic image dataset Stripe MNIST. Fig. 6 shows the resulting histograms on four types of average similarity, which were calculated for each instance: the first type is similarity to instances of the same background and a different target; the second type is to instances of a different background and the same target; the third type is to instances of the same background and the same target; and the fourth type is to instances of a different target and a different background. In the case of IDFD (Fig. 6), the peaks of the first and second type of similarity are located in almost same lower region. As mentioned in Section 3.3, this situation is problematic. In contrast, cIDFD successfully moved the peak of the second type to the same position of the third type distribution in higher region (Fig. 6). On the other hand, the first and fourth type distributions are located in the same lower region. Consequently, instances with the same target features were clustered independently of the background features.


5 Conclusion
We presented cIDFD as a new self-supervised clustering method combining instance discrimination with feature decorrelation and contrastive analysis. Our method is designed to extract unimportant features from a background dataset and reject them in the learning process for the target dataset, resulting in clustering according to only the important features. The experimental results on the Stripe MNIST, CelebA-ROH, and Birds400-ABC datasets showed that cIDFD outperforms the state-of-the-art SSL method and similar conventional methods with contrastive analysis. Problem settings allowing utilization of background datasets appear in various fields, so we expect there to be many situations in which cIDFD can be applied.
References
- [1] Abid, A., Zhang, M.J., Bagaria, V.K., Zou, J.: Exploring patterns enriched in a dataset with contrastive principal component analysis. Nat Commun 9, 2134 (2018)
- [2] Abid, A., Zou, J.: Contrastive Variational Autoencoder Enhances Salient Features. arXiv:1902.04601 (2019)
- [3] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised Learning of Visual Features by Contrasting Cluster Assignments. In: NeurIPS. pp. 9912–9924. Curran Associates, Inc. (2020)
- [4] Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A Simple Framework for Contrastive Learning of Visual Representations. In: ICML. pp. 1597–1607. PMLR (2020)
- [5] Chen, X., He, K.: Exploring simple siamese representation learning. In: CVPR. pp. 15750–15758. IEEE (2021)
- [6] Dang, Z., Deng, C., Yang, X., Wei, K., Huang, H.: Nearest neighbor matching for deep clustering. In: CVPR. pp. 13693–13702 (June 2021)
- [7] Deng, L.: The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine 29(6), 141–142 (2012)
- [8] Dirie, A.H., Abid, A., Zou, J.: Contrastive multivariate singular spectrum analysis. In: 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton). pp. 1122–1127 (2019)
- [9] Ge, R., Zou, J.: Rich component analysis. In: ICML. p. 1502–1510. PMLR (2016)
- [10] Li, Y., Hu, P., Liu, Z., Peng, D., Zhou, J.T., Peng, X.: Contrastive Clustering. In: AAAI. pp. 8547–8555. AAAI Press (2021)
- [11] Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: Proceedings of International Conference on Computer Vision (ICCV) (2015)
- [12] McInnes, L., Healy, J., Melville, J.: UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426 (2018)
- [13] Niu, C., Shan, H., Wang, G.: SPICE: Semantic Pseudo-labeling for Image Clustering. arXiv:2103.09382 (2022)
- [14] Park, S., Han, S., Kim, S., Kim, D., Park, S., Hong, S., Cha, M.: Improving unsupervised image clustering with robust learning. In: CVPR (2021)
- [15] Piosenka, G.: Birds 400 - species image classification (2022)
- [16] ROUGETET, A.: Landscape pictures (2020)
- [17] Shota, M., Yukako, T.: Application of contrastive representation learning to unsupervised defect classification in semiconductor manufacturing. In: AEC/APC Symposium Asia 2021 (2021)
- [18] Tao, Y., Takagi, K., Nakata, K.: Clustering-friendly Representation Learning via Instance Discrimination and Feature Decorrelation. In: ICLR (2021)
- [19] Tsai, T.W., Li, C., Zhu, J.: Mice: Mixture of contrastive experts for unsupervised image clustering. In: ICLR (2021)
- [20] Van Gansbeke, W., Vandenhende, S., Georgoulis, S., Proesmans, M., Van Gool, L.: SCAN: Learning to classify images without labels. In: ECCV. pp. 268–285. Springer International Publishing (2020)
- [21] Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised Feature Learning via Non-Parametric Instance Discrimination. In: CVPR. pp. 3733–3742. IEEE (2018)
- [22] Yunfan, L., Mouxing, Y., Dezhong, P., Taihao, L., Jiantao, H., Xi, P.: Twin contrastive learning for online clustering. International Journal of Computer Vision (2022)
- [23] Zhou, S., Xu, H., Zheng, Z., Chen, J., li, Z., Bu, J., Wu, J., Wang, X., Zhu, W., Ester, M.: A comprehensive survey on deep clustering: Taxonomy, challenges, and future directions (2022)