\equalcont

These authors contributed equally to this work.

\equalcont

These authors contributed equally to this work.

\fnm

Xi \surPeng^*

A Survey on Deep Clustering: From the Prior Perspective

\fnmYiding \surLu [email protected] \fnmHaobin \surLi [email protected] \fnmYunfan \surLi [email protected] \fnmYijie \surLin [email protected] [email protected] \orgdivCollege of Computer Science, \orgnameSichuan University, \orgaddress\cityChengdu, \stateSichuan, \countryChina

Abstract

Facilitated by the powerful feature extraction ability of neural networks, deep clustering has achieved great success in analyzing high-dimensional and complex real-world data. The performance of deep clustering methods is affected by various factors such as network structures and learning objectives. However, as pointed out in this survey, the essence of deep clustering lies in the incorporation and utilization of prior knowledge, which is largely ignored by existing works. From pioneering deep clustering methods based on data structure assumptions to recent contrastive clustering methods based on data augmentation invariances, the development of deep clustering intrinsically corresponds to the evolution of prior knowledge. In this survey, we provide a comprehensive review of deep clustering methods by categorizing them into six types of prior knowledge. We find that in general the prior innovation follows two trends, namely, i) from mining to constructing, and ii) from internal to external. Besides, we provide a benchmark on five widely-used datasets and analyze the performance of methods with diverse priors. By providing a novel prior knowledge perspective, we hope this survey could provide some novel insights and inspire future research in the deep clustering community.

keywords:

Clustering, Deep Clustering, Unsupervised Learning

1 Introduction

As a fundamental problem in machine learning, clustering aims at grouping data instances into several clusters, where instances from the same cluster share similar semantics and instances from different clusters are dissimilar. Clustering could reveal the inherent semantic structure underlying the data, which benefits the down-stream analysis such as anomaly detection [86], person re-identification [115], community detection [96], and domain adaption [111], etc.

In the early stage, various classic clustering methods are developed, such as centroid-based clustering [64], density-based clustering [19], hierarchical clustering [71], and so on. These shallow methods are grounded in theory and enjoy high interpretability. Later on, some works extend shallow clustering methods to diverse data types, such as multi-view [117, 75, 76, 98] and graph data [73, 87]. Other efforts have been made to improve the scalability [118] of shallow clustering methods.

However, shallow clustering methods partition instances based on the similarity [64] or density [19] of the given raw or linear transformed data. Due to the limited feature extraction ability, shallow clustering methods would achieve sub-optimal results when confronted with complex, high-dimensional, and non-linear data in the real world. To tackle this challenge, deep clustering techniques are proposed to incorporate neural networks into clustering methods. In other words, deep clustering simultaneously learns discriminative representations and performs clustering on the learned features, progressively benefiting each other.

Over the past few years, many efforts have been devoted to improving the clustering performance from various aspects, such as network architectures [8, 74], training strategies [69], and loss functions [40, 124]. However, we would like to highlight that the fundamental challenge of deep clustering is the absence of data annotations. Consequently, the key to deep clustering lies in introducing proper prior knowledge to construct the supervision signals. From the early data structure assumption to the recent data augmentation invariance, the development of deep clustering methods intrinsically corresponds to the evolution of prior knowledge. In this survey, we provide a comprehensive review of deep clustering methods from the perspective of prior knowledge.

Inspired by traditional clustering and dimensionality reduction approaches [85, 4], the early deep clustering methods [33, 79, 91] build upon the structure prior of data. Based on the assumption that the inherent data structure could reflect the semantic relation, these methods incorporate classic manifold [85] or subspace learning [101] objectives to optimize the neural network for feature extraction and clustering. The second type of prior knowledge is the distribution prior, which assumes that instances from different clusters follow distinct distributions. Based on such a prior, several generative deep clustering methods [40, 69] propose to learn the latent distribution of samples for the data partition. In the past few years, the success of contrastive learning spawns a new category of prior knowledge, namely, augmentations invariance. Instead of mining data priors, researchers turn to constructing additional priors with the data augmentation technique. Leveraging the invariance across different augmented samples at the instance representation and clustering assignment levels, numerous contrastive clustering methods [39, 52] significantly improve the feature discriminability and clustering performance. Further, researchers find that instances of the same semantics are likely to be mapped into nearby points in the latent space, and accordingly propose the neighborhood consistency prior. Specifically, by encouraging neighboring samples to have similar cluster assignments, several works [97, 124] alleviate the false-negative problem in the contrastive clustering paradigm, thus advancing the clustering results. Another branch of progress is made based on the pseudo label prior, namely, cluster assignments with high confidence are likely to be correct. By selecting confident predictions as pseudo labels, several studies further boost the clustering performance through pseudo-labeling [53, 81] and semi-supervised learning [77]. Very recently, instead of pursuing internal priors from the data itself, some works [7, 54] attempt to introduce abundant external knowledge such as textual descriptions to guide clustering.

In summary, the essence of deep clustering lies in how to find and leverage effective prior knowledge, for both feature extraction and cluster assignment. To provide an overview of the development of deep clustering, in this paper, we categorize a series of state-of-the-art approaches according to the taxonomy of prior knowledge. We hope such a new perspective for deep clustering could inspire future research in the community. The rest of this paper is organized as follows: First, Section 2 introduces the preliminaries on deep clustering. Section 3 reviews the existing deep clustering methods from the prior knowledge perspective. Then, Section 4 provides experimental analyses of deep clustering methods. After that, Section 5 briefly introduces some applications of deep clustering in the vicinagearth security. Lastly, Section 6 summarizes some notable trends and challenges for deep clustering.

Related Surveys

We notice that several surveys on deep clustering have been proposed in recent years. Briefly, Min et al [66] categorizes deep clustering methods according to the network architecture. Dong et al [18] focuses on applications of deep clustering. Ren et al [84] summarizes existing methods from the view of data types, such as single- and multi-view. Zhou et al [125] discusses various interactions between representation learning and clustering. Distinct from existing surveys, this work systematically provides a new perspective from the prior knowledge, which plays a more intrinsic and essential role in deep clustering.

2 Problem Definition

In this section, we introduce the pipeline of deep clustering, including the notation and problem definition. Unless specially notified, in this paper, we use bold uppercase and lowercase to denote matrices and vectors, respectively. The commonly used notations are summarized in Table 1.

The deep clustering problem is formally defined as follows: given a set of instances $\mathcal{D}=\left\{\mathbf{x}_{i}\right\}_{i=1}^{N}\in\mathcal{X}$ that belongs to $C$ classes, deep clustering aims to learn discriminative features and group the instances into $C$ clusters according to their semantics. Specifically, deep clustering methods first learn a deep neural network $f:\mathcal{X}\rightarrow\mathcal{Z}$ for feature extraction, i.e., $\mathbf{z}_{i}=f(\mathbf{x}_{i})$ . Given instance features in the latent space, clustering results could be obtained in two ways. The most straightforward way is to apply classic algorithms such as K-means [64] and DBSCAN [19] on the learned features. The other solution is to train an additional cluster head $h:\mathcal{Z}\rightarrow\mathbb{R}^{C}$ to produce soft cluster assignment $\mathbf{p}_{i}=\operatorname{softmax}(h(\mathbf{z}_{i}))$ which satisfies $\sum_{j=0}^{K}\mathbf{p}_{ij}=1$ . The hard cluster assignment for the $i$ -th instance could be computed by $\arg\max$ operation, namely,

\tilde{y}_{i}=\arg\max_{j}\ \mathbf{p}_{ij},1\leq j\leq C.

(1)

The cluster assignments provide the inherent semantic structure underlying the data, which could be utilized in various downstream analyses.

Refer to caption — Figure 1: Six categories of prior knowledge for deep clustering. (a) Structure Prior: data structure could reflect the semantic relation between instances. (b) Distribution Prior: instances from different clusters follow distinct data distributions. (c) Augmentation Invariance: samples augmented by the same instance have similar features. (d) Neighborhood Consistency: neighboring samples have consistent cluster assignments. (e) Pseudo Label: cluster assignments with high confidence are likely to be correct. (f) External Knowledge: abundant knowledge favorable to clustering exists in open-world data and models.

Table 1: Commonly used mathematical notations.

Notation	Explanation
$N$	Number of data instances
$B$	Size of a mini-batch
$C$	Number of clusters
$f(\cdot)$	Encoder network
$h(\cdot)$	Cluster head
$\mathbf{x}_{i}$	$i$ -th data instance
$\mathbf{z}_{i}$	Feature of the $i$ -th instance
$\tilde{y}_{i}$	Pseudo label of the $i$ -th instance
$\\|\cdot\\|$	L2-norm of a vector
$\langle\cdot\rangle$	Dot product operator
$\operatorname{s}(\mathbf{a},\mathbf{b})$	Cosine similarity, i.e., $\operatorname{s}(\mathbf{a},\mathbf{b})=\frac{\langle\mathbf{a},\mathbf{b}\rangle}{\\|\mathbf{a}\\|\\|\mathbf{b}\\|}$
$\mathbf{c}_{i}$	Centroid of the $i$ -th cluster
$H(\cdot)$	Entropy, i.e., $H(X)=\sum_{x\in X}-p(x)\log p(x)$
$H(\cdot,\cdot)$	Conditional entropy,i.e., $H(Y\mid X)=\sum_{x\in X,y\in Y}-p(x,y)\log\frac{p(x,y)}{p(x)}$
$I(\cdot;\cdot)$	Mutual Information, i.e., $I(X;Y)=H(X)-H(X\mid Y)$
$\tau$	Temperature coefficient of contrastive loss

3 Priors for Deep Clustering

In this section, we review existing deep clustering methods from the perspective of prior knowledge. The priors are illustrated in Figure 1 and the method categorization is summarized in Table2.

3.1 Structure Prior

Structure prior is mostly inspired by traditional clustering methods. Traditional cluster is mainly rooted in assumptions about the structural characteristics of clusters in data space. For example, K-means [64] aims to learn $k$ cluster centroids, which assumes that instances in each cluster form a spherical structure around its centroid. DBSCAN [19] is based on the assumption that a cluster in data space is a contiguous region of high point density, separated from other such clusters by regions of low point density. Spectral clustering [4] assumes data lies on a locally linear manifold so that the local neighborhoods’ relation should be preserved in latent space. Those methods partition instances according to the graph Laplacian. Agglomerative clustering [25] considers the hierarchical structure of data and performs clustering with merging and splitting. Motivated by the success of classic clustering methods, the early exploration of deep clustering mainly focuses on adapting mature structure priors as objective functions to optimize neural networks.

Given well-structured data in the latent space, ABDC [95] iteratively optimizes the data representation and clustering centers motivated by K-means. As the deep extension of classic spectral clustering, DEN [33], SpectralNet [91], and MvLNet [34, 35] compute the graph Laplacian in the latent space learned by auto-encoder [5] and SiameseNets [28, 90], respectively. Likewise, DCC [89] extends the core idea of RCC [88] by performing a relation matching based on the similarity between latent features. The auto-encoder is then optimized by minimizing the distance of paired instances in the latent space. PARTY [79] is the first deep subspace clustering method, which introduces the sparsity prior and self-representation property in subspace learning to optimize neural networks. Motivated by the hierarchical structure of clusters, JULE [110] achieves agglomerative deep clustering by progressively merging clusters and optimizing the features.

3.2 Distribution Prior

Distribution prior refers to instances of different semantics following distinct data distributions. Such a prior arouses the generative deep clustering paradigm, which employs variational autoencoder [43] (VAE) and generative adversarial network [24] (GAN) to learn the underlying distribution. Instances generated from similar distributions are then grouped together to achieve clustering.

VaDE [40] is the first deep generative clustering method, which computes different data distributions by fitting the Gaussian mixture model (GMM) in the latent space. To generate an instance, VaDE first samples a cluster distribution $p\left(c\right)$ to generate a latent vector $p\left(z\mid c\right)$ , and then reconstructs the instance in the input space $p\left(x\mid z\right)$ . The cluster assignment and neural network are jointly optimized by maximizing the log-likelihood of instance, i.e.,

\log p(\mathrm{x})=\log\int_{\mathrm{z}}\sum_{c}p(\mathrm{x}\mid\mathrm{z})p(\mathrm{z}\mid c)p(c)\mathrm{dz}.

(2)

Since directly computing Eq. 2 is intractable, the optimization is approximated by the evidence lower bound (ELBO) of variational inference objective, namely,

\mathcal{L}=\mathbb{E}_{q(\mathrm{z},c\mid\mathrm{x})}\left[\log\frac{p(\mathrm{x},\mathrm{z},c)}{q(\mathrm{z},c\mid\mathrm{x})}\right],

(3)

where $q(\mathrm{z},c\mid\mathrm{x})$ is variational posterior, which approximates the real posterior. The reparameterization trick introduced in VAE [43] is adopted to make the sampling process differentiable.

Though GMM could effectively distinguish distributions, Gaussian components are proved to be redundant, which harms the discriminability between different clusters [27]. As an improvement, ClusterGAN, DCGAN [69, 82] proposes to adopt GAN to implicitly learn the latent distributions. Specifically, in addition to the continuous latent variable $\mathbf{z}_{n}$ , it introduces a one-hot encoding $\mathbf{z}_{c}$ to capture cluster distribution during the generation. The objective function of ClusterGAN is formulated as follows:

$\displaystyle\mathcal{L}=$	$\displaystyle\underset{\mathbf{x}\sim p_{X}(\mathbf{x})}{\mathbb{E}}q(\mathcal{D}(\mathbf{x}))+\underset{\mathbf{z}\sim\mathbb{P}_{\mathbf{z}}}{\mathbb{E}}q(1-\mathcal{D}(\mathcal{G}(\mathbf{z})))$	(4)
	$\displaystyle\quad+\beta_{n}\underset{p_{\mathcal{Z}}(\mathbf{z})}{\mathbb{E}}\left\\|\mathbf{z}_{n}-\mathcal{E}\left(\mathcal{G}\left(\mathbf{z}_{n}\right)\right)\right\\|_{2}^{2}$
	$\displaystyle\quad+\beta_{c}\underset{p_{\mathcal{Z}}(\mathbf{z})}{\mathbb{E}}\mathcal{H}\left(\mathbf{z}_{c},\mathcal{E}\left(\mathcal{G}\left(\mathbf{z}_{c}\right)\right)\right),$

where $\mathbf{z}=(\mathbf{z}_{n},\mathbf{z}_{c})$ is the mixed latent variable, $\mathcal{E}$ is the inverse network which maps data from the raw to latent space, $\mathcal{H}\left(\cdot,\cdot\right)$ denotes the cross-entropy, and $\beta_{n}$ , $\beta_{c}$ are the weight parameters. The first two terms are consistent with standard GAN. The last two clustering-specific terms encourage a more distinct cluster distribution, as well as map inputs to the latent space to achieve clustering.

3.3 Augmentation Invariance

In recent years, image augmentation methods [93] have gained widespread attention, grounded in the prior that augmentations of the same instance could preserve consistent semantic information. This augmentation-invariance character inspires exploration of how to leverage the positive pairs (i.e., different augmentations of the same image) with similar semantic information. Notably, mutual-information-based methods and contrastive-learning-based methods have emerged as pioneers in the realm of deep clustering. In this section, we delve into the fundamental concepts and related works of both mutual-information-based and contrastive-learning-based methods.

Firstly, mutual information is a measure of dependence between two continuous random variables $X$ and $Y$ , formally,

I(X;Y)=\int_{Y}\int_{X}p(x,y)\log\left(\frac{p(x,y)}{p(x)p(y)}\right)dxdy,

(5)

where $p(x,y)$ is the joint probability mass function of $X$ and $Y$ , $p(x)$ and $p(y)$ are the marginal probability mass functions of $X$ and $Y$ respectively. In the context of information theory, leveraging the mutual information between variables of positive instances could enhance the optimization of clustering-related information.

IMSAT [31] stands as a typical information-theoretic approach to deep clustering. Its fundamental concept includes enforcing invariance on pair-wise augmented instances and achieving unambiguous and uniform cluster assignments. Specifically, IMSAT encourages the representations of augmented instances to closely match the representations of the original instances, i.e.,

\displaystyle\mathcal{L}=-\sum_{i,k}\mathbf{p}_{ik}\log\mathbf{p}^{\prime}_{ik}

(6)

where $\mathbf{p}^{\prime}$ is the prediction representations of augmented instances. This aspect can be viewed as exploring the maximization of mutual information between data and its augmentations. Besides, IMSAT implements regularized information maximization for deep clustering inspired by RIM [44] to keep the cluster assignments unambiguous and uniform. Specifically, IMSAT seeks to maximize the mutual information between instances and their cluster assignments, expressed as:

	$\displaystyle I(X;Y)$	$\displaystyle=H(Y)-H(Y\mid X)$		(7)
		$\displaystyle=-\sum_{k}\mathbf{p}_{\cdot k}\log\mathbf{p}_{\cdot k}+\frac{1}{N}\sum_{i,k}\mathbf{p}_{ik}\log\mathbf{p}_{ik},$		(7)

where $H(\cdot)$ and $H(\cdot|\cdot)$ the entropy and conditional entropy, and $\mathbf{p}_{\cdot k}=\frac{1}{N}\sum_{i}\mathbf{p}_{ik}$ . Increasing the first term (marginal entropy $H(Y)$ ) encourages uniform cluster assignments, i.e., the number of instances in each cluster tends to be the same. Conversely, decreasing the second term (conditional entropy $H(Y\mid X)$ ) encourages each instance to be unambiguously assigned to a certain cluster.

IIC [39] and Completer [57, 58] take a further step in exploring the mutual information between instances and their augmentations. The fundamental concept is to maximize the mutual information between the cluster assignments of pair-wise augmented instances. Specifically, IIC achieves semantically meaningful clustering and avoids trivial solutions by maximizing the mutual information between the cluster assignments,

	$\displaystyle\mathcal{L}$	$\displaystyle=I\left(Z,Z^{\prime}\right)=\sum_{i}^{N}I\left(\mathbf{z}_{i},\mathbf{z}_{i}^{\prime}\right)=I(\mathbf{P}),$		(8)
		$\displaystyle=\sum_{c=1}^{C}\sum_{c^{\prime}=1}^{C}\mathbf{P}_{cc^{\prime}}\cdot\ln\frac{\mathbf{P}_{cc^{\prime}}}{\mathbf{P}_{c}\cdot\mathbf{P}_{c^{\prime}}},$		(8)

where $\mathbf{z}$ and $\mathbf{z}^{\prime}$ are the representations of the original instance $x$ and its augmentation $\mathbf{x}^{\prime}$ , respectively. The conditional joint distribution of $\mathbf{z}$ and $\mathbf{z}^{\prime}$ is given by the matrix $\mathbf{P}\in\mathbb{R}^{C\times C}$ which is constituted by,

\mathbf{P}=\frac{1}{n}\sum_{i=1}^{n}\mathbf{z}_{i}\cdot\left(\mathbf{z}_{i}^{\prime}\right)^{\top},

(9)

where $\mathbf{P}_{cc^{\prime}}=P\left(z=c,z^{\prime}=c^{\prime}\right)$ denotes the element of $c$ -th row and $c^{\prime}$ -th column. Additionally, the marginals $\mathbf{P}_{c}=P(z=c)$ and $\mathbf{P}_{c^{\prime}}=P\left(z^{\prime}=c^{\prime}\right)$ can be obtained by summing over the rows and columns of this matrix. Notably, IIC stands out as one of the earliest deep frameworks designed entirely under the framework of information theory, distinguishing itself from IMSAT.

Similar to mutual-information-based methods, contrastive-learning-based methods treat instances augmented from the same instance as positive samples and the rest as negative samples. Let $\mathbf{z}_{2i}$ and $\mathbf{z}_{2i-1}$ represent two augmented representation of the $i$ -th instance, the contrastive loss is formulated as:

	$\displaystyle\mathcal{L}$	$\displaystyle=\sum_{i}^{N}\left(\ell\left(2i,2i-1\right)+\ell\left(2i-1,2i\right)\right),$		(10)
	$\displaystyle\ell\left(i,j\right)$	$\displaystyle=-\log\frac{\exp\left(\operatorname{s}\left(\mathbf{z}_{i}\cdot\mathbf{z}_{j}\right)/\tau\right)}{\sum_{j=1}^{2N}\mathbf{1}_{[j\neq i]}\exp\left(\operatorname{s}\left(\mathbf{z}_{i}\cdot\mathbf{z}_{j}/\right)\tau\right)},$		(10)

where $\ell\left(i,j\right)$ represents the pairwise contrastive loss and $\tau$ controls the temperature of the softmax. The function $\operatorname{s}\left(\mathbf{z}_{i},\mathbf{z}_{j}\right)$ denotes the similarity between representations $\mathbf{z}_{i}$ and $\mathbf{z}_{j}$ . This loss encourages representations of positive instances to be closer while being separated from negative instances, encouraging meaningful clustering patterns.

Notably, some theoretical works [78, 68, 59] have demonstrated that contrastive learning is equivalent to maximizing the mutual information from the instance level. Motivated by this observation, researchers have further explored the application of contrastive loss at the cluster level, proving beneficial for deep clustering. PICA [32] is one of the pioneer works of this domain. The fundamental concept behind it is to maximize the similarity between the cluster assignment of original and augmented data. This objective can be likened to conducting contrastive learning [60] at the cluster level. Motivated by PICA, CC [52] and DRC [123] conduct contrastive learning on both instance level and cluster level. Specifically, cluster-level contrastive loss helps learn discriminative cluster assignment, which is the key to the clustering task. Formally, the cluster-level contrastive loss is,

	$\displaystyle\mathcal{L}$	$\displaystyle=\frac{1}{2C}\sum_{i=1}^{C}\left(\ell\left(2i-1,2i\right)\!+\!\ell\left(2i,2i-1\right)\right)\!-\!H(\mathbf{Y}),$		(11)
	$\displaystyle\ell\left(i,j\right)$	$\displaystyle=-\log\frac{\exp\left(s\left(\mathbf{y}_{i},\mathbf{y}_{i}\right)/\tau\right)}{\sum_{j=1}^{2C}\mathbf{1}_{[j\neq i]}\left[\exp\left(s\left(\mathbf{y}_{i},\mathbf{y}_{j}\right)/\tau\right)\right]},$		(11)

where $\mathbf{y}_{i}\in\mathbb{R}^{1\times N}$ is the cluster-level assignment and $\tau$ is the cluster-level temperature parameter. $H(\mathbf{Y})=H(\mathbf{Y}^{1})+H(\mathbf{Y}^{2})$ is the cluster assignment probabilities entropy of two augmentations. The inclusion of $H(\mathbf{Y})$ helps avoid the trivial solution where most instances are assigned to the same cluster. Notably, the utilization of contrastive learning at the cluster level in CC and DRC has inspired subsequent works in the field.

TCC [92] takes a step further in exploring the interaction between instance-level and cluster-level representations. The core idea is to leverage a unified representation combined of the cluster semantics and instances, enhancing the representation with cluster information to facilitate clustering tasks. Formally, for an instance representation $\mathbf{z}_{i}$ , the enhanced representation is given by:

\hat{\mathbf{z}}_{i}=\left(\mathbf{z}_{i}+\operatorname{NN}_{\theta}\left(\mathbf{c}_{i}\right)\right)/\|\mathbf{z}_{i}+\operatorname{NN}_{\theta}\left(\mathbf{c}_{i}\right)\|_{2},

(12)

where $\mathbf{c}_{i}$ represents the cluster assignment of $i$ -th instance after Gumbel Softmax. $\operatorname{NN}_{\theta}\left(\cdot\right)$ denotes a single fully connected network, which is the learnable cluster representation. Different from CC which performs contrastive loss on cluster assignment, TCC conducts contrastive loss on the unified representation to better capture cluster semantics. Inspired by TCC, some works [108, 50] explore the fusion of instance-level and cluster-level representation in various domains. and then conduct contrastive loss on the unified representation, which further explores its effectiveness.

3.4 Neighborhood Consistency

Thanks to the advancements in self-supervised representation learning, the features acquired through discriminative pretext tasks can unveil high-level semantics in the latent space. This provides a crucial prior for clustering, as instances and their neighborhoods in the latent space are likely to belong to the same semantic cluster. Leveraging neighborhood-consistent semantics can further enhance clustering.

SCAN [97] first observes that similar instances will be mapped closely in latent space through self-supervised pretext tasks. Motivated by this observation, SCAN trains a cluster head based on the cluster neighborhood consistency within neighbors. Specifically, SCAN first obtains an encoder $f$ by a pretext task [23, 119, 104, 30]. It then optimizes a cluster head $h$ by requiring it to make consistent predictions between instances and their nearest neighbors:

\mathcal{L}=-\frac{1}{B}\sum_{i=1}\sum_{j\in\mathcal{N}^{k}_{i}}\log\langle\mathbf{p}_{i},\mathbf{p}_{j}\rangle-\lambda H(Y).

(13)

Here $\mathcal{N}^{k}_{i}$ denotes the $k$ -nearest neighbors of the $i$ -th instance. The second term in Eq. 13 prevents $h$ from assigning all instances to a single cluster which is also used in Eq. 11.

NNM [15] and GCC [124] incorporate neighbor information into the framework of contrastive learning to group instances within neighborhoods. In particular, NNM aligns the clustering assignment of an instance with its neighbors through cluster-level contrastive learning:

\mathcal{L}=-\frac{1}{C}\sum_{i=1}^{C}\log\frac{\exp(\operatorname{s}(\mathbf{q}_{i},\mathbf{q}_{\mathcal{N}i}))}{\sum_{j=1}^{C}\exp(\operatorname{s}(\mathbf{q}_{i},\mathbf{q}_{j}))},

(14)

where $\mathbf{q},~{}\mathbf{q}_{\mathcal{N}}\in\mathbb{R}^{C\times B}$ represent the transpose matrix of $\mathbf{p}$ and $\mathbf{p}_{\mathcal{N}}$ , respectively. In contrast, GCC introduces the graph structure of the latent space to modify the vanilla instance-level contrastive loss. It constructs a normalized symmetric graph Laplacian $\mathbf{L}$ based on the $K$ -nn graph:

	$\displaystyle\mathbf{L}$	$\displaystyle=\mathbf{I}-\mathbf{D}^{-\frac{1}{2}}\mathbf{A}\mathbf{D}^{-\frac{1}{2}},$		(15)
	$\displaystyle\text{with }\mathbf{A}_{ij}$	$\displaystyle=\begin{cases}1,&\text{if }j\in\mathcal{N}^{k}_{i}\text{ or }i\in\mathcal{N}^{k}_{j}\\ 0,&\text{otherwise}\end{cases}.$		(15)

Then, the loss function is given by the following form:

\mathcal{L}=-\frac{1}{N}\sum_{i=1}^{N}\log\frac{\sum_{\mathbf{L}_{ij}<0}-\mathbf{L}_{ij}\exp(\operatorname{s}(\mathbf{z}_{i},\mathbf{z}_{j})/\tau)}{\sum_{\mathbf{L}_{ij}=0}\exp(\operatorname{s}(\mathbf{z}_{i},\mathbf{z}_{j})/\tau)},

(16)

where $\tau$ is the temperature. The Graph Laplacian guides the model to attract instances within neighborhoods rather than just augmentation of themselves so that the influence of potential false negative samples [114, 112] can be mitigated. As a result, GCC can better minimize the intra-cluster variance and maximize the inter-cluster variance. The success of this approach has inspired numerous contrastive learning methods [38, 63] in various domains to leverage neighbor relationships that effectively address the false negative challenge.

3.5 Pseudo-Labeling

As a prevalent paradigm of semi-supervised classification [48, 6, 94], pseudo-labeling has been extended to deep clustering in recent years. The fundamental assumption of pseudo-labeling is that the predictions on unlabeled data, especially the confident ones, can provide reliable supervision and guide model training. Motivated by this, recent deep clustering works leverage confident predictions to boost clustering performance.

DEC [106] is a pioneering work that utilizes labels generated by itself to simultaneously enhance feature representations and optimize clustering assignments. DEC initializes with a pre-trained auto-encoder and $C$ learnable cluster centroids. The soft assignment is calculated using the Student’s $t$ -distribution, based on the distance between the representation $\mathbf{z}_{i}$ and centroid $\mathbf{c}_{j}$ :

\mathbf{q}_{ij}=\frac{(1+\|\mathbf{z}_{i}-\mathbf{c}_{j}\|^{2}/\alpha)^{-\frac{\alpha+1}{2}}}{\sum_{k}(1+\|\mathbf{z}_{i}-\mathbf{c}_{k}\|^{2}/\alpha)^{-\frac{\alpha+1}{2}}},

(17)

where $\alpha$ is the hyper-parameter and $\mathbf{q}_{ij}$ denotes the probability of assigning the instances $i$ to the cluster $j$ . DEC refines the clusters by emphasizing the high-confidence assignments and making predictions more confident. Specifically, DEC uses the second power of $\mathbf{q}_{i}$ as a sharpened assignment to guide the training, i.e.,

\mathbf{p}_{ij}=\frac{\mathbf{q}_{ij}^{2}/\operatorname{freq}_{j}}{\sum_{k}\mathbf{q}_{ik}^{2}/\operatorname{freq}_{k}},

(18)

where $\operatorname{freq}_{j}=\sum_{i}\mathbf{q}_{ij}$ is the soft cluster frequency and the sharpened assignment is normalized by $f_{j}$ to prevent feature collapse. Finally, a KL divergence loss between $\mathbf{p}$ and $\mathbf{q}$ minimizes the distances between the two distributions, i.e., $\mathcal{L}=\operatorname{KL}(\mathbf{p}|\mathbf{q})$ .

Another notable method of pseudo-labeling is DeepCluster [8]. This approach employs K-means clustering on the learned representations to obtain cluster assignments as pseudo-labels. DeepCluster iteratively performs representation learning and clustering in a mutually beneficial manner to bootstrap each other. However, DeepCluster faces limitations in achieving outstanding performance, primarily due to the restricted semantics of the initial representation. Similar to DeepCluster, ProPos [36] proposes an EM framework of pseudo-labeling, iteratively performing K-means to obtain pseudo labels (E step) and the representation updating (M step). Notably, ProPos significantly outperforms DeepCluster and other methods because ProPos performs K-means on the learned feature of state-of-the-art self-supervised paradigm BYOL [26]. This observation has demonstrated that the semantics of the representation is vital to pseudo-label generation and clustering. Low-quality features would introduce potential noise in pseudo-labels, impact subsequent pseudo-label generation, and mislead representation learning, which accumulates the error in the process.

In addition to the progression of self-supervised paradigms, researchers are actively investigating strategies to alleviate the issue of error accumulation in pseudo-labeling. To be specific, the challenges in the realm of pseudo-labeling deep clustering remain two-fold: enhancing the accuracy of generating pseudo-labels and maximizing the utility of these pseudo-labels for effective clustering. On the one hand, inaccurate pseudo-labels pose a risk of degradation in clustering performance. On the other hand, determining how to effectively leverage these pseudo-labels for clustering is a critical consideration. These two challenges underscore the ongoing efforts in the pseudo-labeling learning of deep clustering.

The first challenge has been addressed by many works through carefully designing selection methods. For instance, SCAN [97] empirically observed that instances exhibiting highly confident predictions (i.e., $\max(\mathbf{p}_{i})\approx 1$ ) tend to be correctly clustered by the cluster head. Building on this insight, SCAN opts to choose instances with the most confident predictions as labeled data to fine-tune the model using the cross-entropy loss,

		$\displaystyle\mathcal{L}=\frac{1}{\|Y\|}\sum_{i\in Y}-\tilde{y}_{i}\log(\mathbf{p}_{i}),$		(19)
		$\displaystyle Y=\left\{i\mid\text{conf}_{i}\geq\eta\right\},\text{with conf}_{i}=\max(\mathbf{p}_{i})$		(19)

where $\eta$ is the threshold hyper-parameter to filter the uncertain instances. TCL [53] and SPICE [77] have devised more effective selection strategies to enhance the accuracy of pseudo-labeling. Specifically, TCL selects the most confident predictions as pseudo labels from each cluster $c$ :

		$\displaystyle Y^{c}=\left\{\operatorname{topK}(\text{conf}_{i})\mid\tilde{y}_{i}=c\right\}$		(20)
		$\displaystyle Y=\bigcup_{c=1}^{C}Y^{c}$		(20)

where $\operatorname{topK}(\cdot)$ returns the indices of the top $K$ confident instances and $\bigcup$ denotes the union of the pseudo labels from all clusters. Here $K=\gamma N/C$ and $\gamma$ is the selection ratio. The cluster-wise selection leads to more class-balanced pseudo labels compared to threshold-based criteria. It improves the clustering performance, especially for challenging classes.

SPICE introduces a prototype-based pseudo-labeling approach. Specifically, it first re-computes the centroids of each cluster only using the instances with confident predictions, then re-assign each instance with new pseudo labels according to the similarity to the new centroids, formally:

		$\displaystyle\mathbf{c}_{i}^{\prime}=\frac{1}{\|Y^{c}\|}\sum_{i\in Y^{c}}\mathbf{z}_{i},$		(21)
		$\displaystyle\tilde{y}_{i}^{\prime}=\arg\max_{j}\operatorname{s}(\mathbf{z}_{i},\mathbf{c}^{\prime}_{j}).$		(21)

This operation helps mitigate the influence of potentially incorrect pseudo labels used in calculating centroids, which might accumulate errors in the iterative self-training process.

To address the second challenge, i.e., better utilizing the confident labels, TCL removes negative pairs with the same label in contrastive loss, preventing intra-class instances from pushing apart, i.e., the false negative issue. Meanwhile, SPICE and TCL adopt some semi-supervised classification techniques like FixMatch [94] that impose the pseudo-label consistency for strong augmentations of the same instance. The marvelous results achieved by these works show the effectiveness of combining reliable pseudo-labeling methods and semi-supervised paradigms in clustering.

3.6 External Knowledge

Most clustering approaches focus on grouping data based on inherent characteristics such as structural priors, distribution priors, and augmentation invariance priors. Instead of pursuing internal priors from the data itself, some recent works [7, 54] attempt to introduce abundant external knowledge such as textual descriptions to guide clustering. These methods prove effective because the utilization of semantic information from natural language offers valuable supervisory signals that enhance the quality of clustering.

SIC [7] is one of the first works in incorporating external knowledge guidance into clustering. The fundamental concept revolves around generating image pseudo-labels from a textual space pre-trained by CLIP [83]. The process involves three main steps: i) Construction of Semantic Space: SIC selects meaningful texts resembling category names to build a semantic space. ii) Pseudo-labeling: Pseudo-labels are generated using text semantic centers $\mathbf{h}$ and image representations $\mathbf{z}_{i}$ , formally,

\mathbf{q}_{i}=\text{one-hot}\left(c,\arg\max_{l}\frac{\exp\left(\mathbf{z}_{i}^{T}\mathbf{h}_{l}\right)}{\Sigma_{l^{\prime}}^{c}\exp\left(\mathbf{z}_{i}^{T}\mathbf{h}_{l^{\prime}}\right)}\right),

(22)

where $c$ is the number of semantic centers, $\mathbf{h}_{l}$ is the $l$ -th center of semantic centers, one-hot operator will generate a $c$ -bit one-hot vector. The pseudo-labels is utilized to guide the clustering similar to SCAN [97],

\mathcal{L}=\frac{1}{n}\sum_{i=1}^{n}CE\left(\mathbf{q}_{i},\mathbf{p}_{i}\right),

(23)

where $CE\left(\cdot\right)$ is the cross entropy function. iii) Consistency learning: Enhancing clustering effect by enforcing the consistency between the images and their neighbors in the image space,

\mathcal{L}=-\frac{1}{n}\sum_{i=1}^{n}\log\mathbf{p}_{i}^{T}\mathbf{p}_{j},

(24)

where $j$ is an instance index randomly selected from the nearest neighbors $\mathcal{N}_{k}\left(\mathbf{z}_{i}\right)$ of $i$ -th instance. Note that, SIC essentially pulls image embeddings closer to embeddings in semantic space, while ignoring the improvement of text semantic embeddings.

Table 2: The summary of deep clustering methods from the perspective of prior knowledge.

Prior Knowledge		Method	Major Contribution
Structurture Prior	Inherent data structure reflect semantic relation	ABDC (2013)	optimize features and clustering assignment in an EM manner
		DEN (2014), SpectralNet (2018)	extend spectral clustering from shallow to deep
		PARTY (2016)	introduce the sparsity prior from subspace learning to deep clustering
		JULE (2016)	extend agglomerative clustering from shallow to deep
		DCC (2018)	propose relation matching to achieve non-parametric deep clustering
Distribution Prior	Instances of different semantics follow distinct data distribution	VaDE (2016)	learn distinct cluster distributions by Gaussian mixture model
Distribution Prior		ClusterGAN (2019) DCGAN (2015)	implicitly learn cluster distribution with GAN
Augmentation Invariance	Instance features are invariant to data augmentation	IMSAT (2017)	propose the invariance between pair-wise augmented samples
	Instance features are invariant to data augmentation	IIC (2019), Completer (2021)	propose the mutual information framework with respect to augmentation invariance
	Cluster assignments are invariant to data augmentation	PICA (2020)	explore invariance between cluster assignments of augmented samples
		CC (2021), DRC (2020)	simultaneously explore augmentation invariance at instance and cluster level
		TCC (2021)	leverage a unified representation combined of the cluster semantics and instances
Neighborhood Consistency	Neighboring instances have similar semantics	SCAN (2020)	impose consistent cluster assignments between neighboring instances
		NNM (2021)	perform cluster-level contrastive learning between neighbors
		GCC (2021)	perform instance- and cluster-level contrastive learning between neighbors
Pseudo Label	Cluster assignments with high confidence are reliable	DEC (2016)	construct target cluster distribution via sharpening
		DeepCluster (2018)	generate pseudo labels with K-means
		SCAN (2020)	select high-confident predictions and finetune the model with strong augmented samples
		SPICE (2022)	select pseudo labels with the help of prototypes and adopt semi-supervised learning to fine-tune the model
		TCL (2022)	use pseudo labels to mitigate false negative pairs in contrastive learning
		ProPos (2022)	use pseudo label from K-means to increase cluster compactness
External Knowledge	Abundant clustering- favorable knowledge exists in open world	SIC (2023)	generate image pseudo labels from the textual space from pre-trained vision-language model
External Knowledge		TAC (2023b)	construct more discriminative text counterparts and perform cross-modal distillation to improve clustering

TAC [54] focuses on leveraging textual semantics to enhance the feature discriminability. Specifically, it retrieves a text counterpart among representative nouns for each image, which improves K-means performance without any additional training. Besides, TAC proposes a mutual distillation paradigm to incorporate the image and text modalities, which further improves the clustering performance. The cross-modal mutual distillation strategy is formulated as follows:

$\displaystyle\mathcal{L}$	$\displaystyle=\sum_{i=1}^{C}\mathcal{L}_{i}^{v\rightarrow t}+\mathcal{L}_{i}^{t\rightarrow v},$	(25)
$\displaystyle L_{i}^{v\rightarrow t}$	$\displaystyle=-\log\frac{\exp\left(\operatorname{sim}\left(\hat{\mathbf{q}}_{i},\hat{\mathbf{p}}_{i}^{\mathcal{N}}\right)/\tau\right)}{\sum_{k=1}^{K}\exp\left(\operatorname{sim}\left(\hat{\mathbf{q}}_{i},\hat{\mathbf{p}}_{k}^{\mathcal{N}}\right)/\tau\right)},$
$\displaystyle L_{i}^{t\rightarrow v}$	$\displaystyle=-\log\frac{\exp\left(\operatorname{sim}\left(\hat{\mathbf{p}}_{i},\hat{\mathbf{q}}_{i}^{\mathcal{N}}\right)/\hat{\tau}\right)}{\sum_{k=1}^{K}\exp\left(\operatorname{sim}\left(\hat{\mathbf{p}}_{i},\hat{\mathbf{q}}_{k}^{\mathcal{N}}\right)/\tau\right)},$

where $\tau$ is the softmax temperature parameter, $\hat{\mathbf{p}}_{i},\hat{\mathbf{q}}_{i}\in\mathbb{R}^{1\times N}$ is the $i$ -th column of image and text assignment matrix, $\hat{\mathbf{p}}_{i}^{\mathcal{N}},\hat{\mathbf{q}}_{i}^{\mathcal{N}}\in\mathbb{R}^{1\times N}$ is the $i$ -th column of image and text random nearest neighbor matrix. The mutual distillation strategy has two advantages. On the one hand, it generates more discriminative clusters through cluster-level contrastive loss. On the other hand, it encourages consistent clustering assignments between each sample and its cross-modal neighbors, which bootstraps the clustering performance in both modalities.

4 Experiment

In this section, we introduce the evaluation of deep clustering. Briefly, we first present the evaluation metrics and common benchmarks. Then we analyze the results of the existing deep clustering methods.

4.1 Evaluation Metrics

For clustering evaluation, three metrics are commonly used to measure how the predicted cluster assignments $\tilde{y}$ match the ground truth labels $y$ , including accuracy (ACC), normalized mutual information (NMI), and adjusted rand index (ARI). A higher value of the metrics corresponds to better clustering performance. The definitions of the three metrics are as follows:

•

ACC [1] indicates the correct rate of clustering predictions:

$\operatorname{ACC}=\frac{1}{N}\sum_{i=1}^{N}\mathbf{1}\{y_{i}=\tilde{y}_{i}\},$ (26)

where the Hungarian matching [46] is first applied to align the predictions and labels.
•

NMI [65] quantifies the mutual information between the predicted labels $\tilde{\mathbf{Y}}$ and ground truth labels $\mathbf{Y}$ :

$\operatorname{NMI}=\frac{I(\tilde{\mathbf{Y}};\mathbf{Y})}{\frac{1}{2}[H(\tilde{\mathbf{Y}})+H(\mathbf{Y})]},$ (27)

where $H(\mathbf{Y})$ denotes the entropy of $Y$ and $I(\tilde{\mathbf{Y}};\mathbf{Y})$ denotes the mutual information between $\tilde{\mathbf{Y}}$ and $\mathbf{Y}$ .

•

ARI [37] is the normalization of the rand index (RI), which counts the number of instances pairs in the same cluster and different clusters:

\operatorname{RI}=\frac{\operatorname{TP}+{\operatorname{TN}}}{\operatorname{C}^{2}_{N}},

(28)

where $\operatorname{TP}$ and $\operatorname{TN}$ refer to the number of true positive pairs and true negative pairs, $\operatorname{C}^{2}_{N}$ is the number of possible instance pairs. ARI is computed by adding the following normalization:

\operatorname{ARI}=\frac{\operatorname{RI}-\mathbb{E}(\operatorname{RI})}{\operatorname{max}(\operatorname{RI})-\mathbb{E}\left(\operatorname{RI}\right)},

(29)

where $\mathbb{E}(\operatorname{RI})$ denotes the expectation of RI.

Table 3: A summary of datasets commonly used for deep clustering.

Dataset	Split	Samples	Classes	Image Size
CIFAR-10	Train+Test	60,000	10	32 $\times$ 32
CIFAR-100	Train+Test	60,000	20	32 $\times$ 32
STL-10	Train+Test	13,000	10	96 $\times$ 96
ImageNet-10	Train	13,000	10	96 $\times$ 96
ImageNet-Dogs	Train	19,500	15	96 $\times$ 96
Tiny-ImageNet	Train	100,000	200	64 $\times$ 64
ImageNet-1K	Train	1,281,167	1000	224 $\times$ 224

Table 4: Clustering performance on five widely-used image clustering datasets. SCAN^∗ denotes the clustering results using only neighborhood consistency loss without the self-labeling step.

\dagger

denotes using the train and test split for training and testing respectively, instead of using both splits for training and testing. Horizontal lines separate methods with different priors. From top to bottom are structure prior, distribution prior, augmentation invariance, neighborhood consistency, pseudo-labeling, and external knowledge.

Method	CIFAR-10			CIFAR-100			STL-10			ImageNet-10			ImageNet-Dogs
Method	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI	ACC	NMI	ARI
K-means (1967)	22.9	8.7	4.9	13.0	8.4	2.8	19.2	12.5	6.1	24.1	11.9	5.7	-	-	-
JULE (2016)	27.2	19.2	13.8	13.7	10.3	3.3	27.7	18.2	16.4	30.0	17.5	13.8	13.8	5.4	2.8
DCGAN (2015)	31.5	26.5	17.6	15.1	12.0	4.5	29.9	22.7	16.2	31.3	18.6	14.2	17.8	9.8	7.3
IIC (2019)	61.7	51.3	41.1	25.7	22.5	11.7	59.6	49.6	39.7	-	-	-	-	-	-
PICA (2020)	69.6	59.1	51.2	33.7	31.0	17.1	71.3	61.1	53.1	87.0	80.2	76.1	35.3	35.2	20.1
CC (2021)	79.0	70.5	63.7	42.9	43.1	26.6	85.0	76.4	72.6	89.3	85.9	82.2	42.9	44.5	27.4
TCC (2021)	90.6	79.0	73.3	49.1	47.9	31.2	81.4	73.2	68.9	89.7	84.8	82.5	59.5	55.4	41.7
SCAN^∗ (2020)	81.8	71.2	66.5	42.2	44.1	26.7	75.5	65.4	59.0	-	-	-	-	-	-
NNM^† (2021)	83.7	73.7	69.4	45.9	48.0	30.2	76.8	66.3	59.6	-	-	-	58.6	60.4	44.9
GCC (2021)	85.6	76.4	72.8	47.2	47.2	30.5	78.8	68.4	63.1	90.1	84.2	82.2	52.6	49.0	36.2
DEC (2016)	30.1	25.7	16.1	18.5	13.6	5.0	35.9	27.6	18.6	38.1	28.2	20.3	19.5	12.2	7.9
DeepCluster (2018)	37.4	-	-	-	-	-	33.4	-	-	18.9	-	-	-	-	-
SCAN^† (2020)	87.6	78.7	75.8	48.3	48.5	31.4	81.8	70.3	66.1	-	-	-	59.3	61.2	45.7
SPICE (2022)	83.8	73.4	70.5	46.8	44.8	29.4	90.8	81.7	81.2	92.1	82.8	83.6	64.6	57.2	47.9
TCL (2022)	88.7	81.9	78.0	53.1	52.9	35.7	86.8	79.9	75.7	89.5	87.5	83.7	64.4	62.3	51.6
ProPos (2022)	94.3	88.6	88.4	61.4	60.6	45.1	86.7	75.8	73.7	95.6	89.6	90.6	74.5	69.2	62.7
SIC^† (2023)	92.6	84.7	84.4	58.3	59.3	43.9	98.1	95.3	95.9	98.2	97.0	96.1	69.7	69.0	55.8
TAC (2023b)	92.3	84.1	83.9	60.7	61.1	44.8	98.2	95.5	96.1	99.2	98.5	98.3	83.0	80.6	72.2

4.2 Datasets

In the early stage, deep clustering methods are evaluated on relatively small and low-dimensional datasets (e.g. COIL-20 [72], YaleB [22]). Recently, with the rapid development of deep clustering methods, it has become more popular to evaluate clustering performance on more complex and challenging datasets. There are five widely used benchmark datasets:

•

CIFAR-10 [45] consists of 60,000 colored images from 10 different classes including airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck.
•

CIFAR-100 [45] contains 100 classes grouped into 20 superclasses. Each image comes with a “fine” class label and a “coarse” superclass label.
•

STL-10 [13] contains 13,000 labeled images from 10 object classes. Besides, it provides 100,000 unlabeled images for self-supervised learning to enhance the clustering performance.
•

ImageNet-10 [9] is a subset of the ImageNet dataset [17]. It contains 10 classes, each with 1,300 high-resolution images.
•

ImageNet-Dog [9] is another subset of ImageNet. It consists of images belonging to 15 dog breeds, which is suitable for fine-grained clustering tasks.

Apart from them, some recent works employ two more challenging large-scale datasets, Tiny-ImageNet [49] and ImageNet-1K [17], to evaluate the effectiveness and efficiency. A brief description of these datasets is summarized in Table 3.

4.3 Performance Comparisons

The clustering performance on five widely used datasets is shown in Table 4. Thanks to the feature extraction ability of deep neural networks, early deep clustering methods based on structure and distribution priors achieve much better performance than the classic K-means. Then, a series of contrastive clustering methods significantly improve the performance by introducing additional priors through data augmentation. After that, more advanced methods boost the performance by further considering the neighborhood consistency (GCC compared with CC) and utilizing pseudo labels (SCAN compared with SCAN^∗). Notably, the performance gains of different priors are independent. For example, ProPos remarkably outperforms DEC and CC by additionally utilizing the augmentation invariance or pseudo-labeling priors, respectively. Very recently, external-knowledge-based methods achieved state-of-the-art performance, which proves the promising prospect of such a new deep clustering paradigm. In addition, clustering becomes more challenging when the category number grows (from CIFAR-10 to CIFAR-100) or the semantics becomes more complex (from CIFAR-10 to ImageNet-Dogs). Such results indicate that more challenging datasets such as full ImageNet-1K are expected to benchmark in future works.

5 Application in Vicinagearth

In this section, we explore some typical applications of deep clustering within the domain of Vicinagearth, a term crafted from the fusion of ”Vicinage” and ”Earth.” Vicinagearth represents the critical spatial expanse ranging from 1,000 meters below sea level (the depth at which sunlight ceases to penetrate) to 10,000 meters above sea level (the typical cruising altitude of commercial aircraft). This zone is of great importance as it encompasses the core regions of human activity including areas of habitation and production. Recently, deep clustering has emerged as an indispensable analytical tool within Vicinagearth, instrumental in unveiling complex patterns and structures of data within the vicinal space. The diverse applications of deep clustering in this zone include anomaly detection, environmental monitoring, community detection, person re-identification, and more.

Anomaly Detection, also known as Outlier Detection [14] or Novelty Detection [20], attempts to identify abnormal instances or patterns. In the context of Vicinagearth, deep clustering proves valuable for analyzing sensor data obtained from diverse sources like underwater monitoring systems, aerial sensors, or ground-based sensors [10]. Through the analysis of the patterns and typical behaviors from the sensor data, the system becomes adept at detecting anomalies, which may signal security threats or irregular activities.

Environmental Monitoring involves the analysis of data collected from environmental sensors [105], such as monitoring air quality, water conditions, and geological factors. The primary goal is to ensure the health of ecosystems [103] and detect potential environmental threats, such as pollution events or natural disasters. Deep clustering techniques play a crucial role in grouping similar environmental patterns, facilitating the identification of abnormalities. This application contributes to real-time environmental monitoring [47], enhancing the ability to respond promptly to environmental challenges.

Community Detection [21, 41] involves evaluating how groups of nodes are clustered or partitioned and their tendency to strengthen or break apart within a network. In the context of Vicinagearth, this technique is applied to identify groups of species [70] that interact closely or share similar ecological niches. Deep clustering plays a pivotal role in the analysis of complex ecological networks [67], contributing to a deeper understanding of ecological communities and their dynamics.

Person Re-identification [102, 115] is a crucial task that involves recognizing and matching individuals across different camera views [113]. This technology plays a significant role in public safety and law enforcement initiatives, as it helps to monitor densely populated areas for including potential threats or subjects on the watchlist. The integration of deep clustering algorithms has remarkedly improved the scalability and efficiency [109] of person re-identification systems. Deep clustering effectively enables the management of the complexities presented by large and dynamically changing crowds. Furthermore, the adaptability of deep clustering techniques broadens their use to include the monitoring of natural habitats and the tracking of wildlife in diverse and uncontrolled settings.

6 Future Challenges

Although existing works achieve remarkable performance, some practical challenges and emerging requirements have yet to be fully addressed. In this section, we delve into some future directions of modern deep clustering.

6.1 Fine-grained Clustering

The objective of fine-grained clustering is to discern subtle and intricate variations within data, which is particularly advantageous in research like the identification of biological subspecies [55, 56]. The primary challenge is that fine-grained classes exhibit a high degree of similarity, where distinctions often lie in coloration, markings, shape, or other subtle characteristics. In such scenarios, traditional coarse-grained clustering priors frequently prove inadequate. For instance, color and shape augmentations in augmentation invariance prior would become ineffective. Recently, C3-GAN [42] employs contrastive learning within adversarial training to generate lifelike images, enabling the nuanced capture of fine-grained details and ensuring the separability between clusters.

6.2 Non-parametric Clustering

Many clustering methods typically require a predefined and fixed number of clusters. However, real-world datasets often present a challenge with an unknown number of clusters, reflecting situations closer to reality. Only a few works [11, 89, 122, 100] have been devoted to solving this problem. These methods often rely on calculating global similarity and introduce huge computational costs, especially in large-scale datasets. Therefore, efficiently determining the optimal value of cluster number $C$ remains an open challenge, often involving the incorporation of human priors. Among existing works, DeepDPM introduces Dirichlet Process Gaussian Mixture Models (DPGMM) [3] that utilize the Dirichlet Process as the prior distribution over mixture components. DeepDPM dynamically adjusts the number of clusters $C$ through split and merge operations guided by the Metropolis-Hastings framework [29].

6.3 Fair Clustering

Collecting Real-world datasets from diverse sources with various acquisition methods can enhance the generalization of machine learning models. However, these datasets frequently manifest inherent biases, notably in sensitive attributes such as gender, race, and ethnicity. These biases would introduce disparities among individuals and minority groups, leading to cluster partitions that deviate from the underlying objective characteristics of the data. The pursuit of fairness is particularly pertinent in applications where unbiased and equitable analyses are crucial, such as employment, healthcare, and education. To tackle this challenge, fair clustering seeks to mitigate the influence of these biases given the biased attributes for each sample.

To address this daunting task, Chierichetti et al first introduces a data pre-processing method known as fairlet decomposition. Recent advancements address this issue on large-scale data through adversarial training [51] and mutual information maximization [116]. Notably, Zeng et al designs a novel metric that assesses both clustering quality and fairness from the perspective of information theory. Despite these developments, there is still room for improvement, and the establishment of better evaluation metrics is a continuing area of this research.

6.4 Multi-view Clustering

Multi-view data [107, 62] is common in real-world situations where information is captured from a variety of sensors or observed from multiple angles. This data is inherently rich, offering diverse yet consistent information. For example, an RGB view would provide color details while the depth view reveals spatial information, which represents the complementary aspects of the views. Simultaneously, there exists a level of view consistency as the same object possesses common attributes across different views. To deal with multi-view data, multi-view clustering [16, 61] is proposed to exploit both the complementary and consistent characters. The goal is to integrate information from all views to produce a unified and insightful clustering result.

Over recent years, several deep-learning approaches [2, 99, 121, 80] have been developed to address this challenge. Binary multi-view clustering Zhang et al [120] simultaneously refines binary cluster structures alongside discrete data representations, ensuring cohesive clustering. In pursuit of view consistency, Lin et al [57, 58] maximize mutual information across views, thus aligning common properties. SURE [114] aims to strengthen the consistency of shared features between views by utilizing robust contrastive loss. Recently, Li et al [50] performs bound contrastive loss to preserve the view complementary at the cluster level. These innovative methodologies demonstrate the significant strides made in the field of multi-view analysis, where clustering continues to play a pivotal role in enhancing the synergistic exploitation of multi-view data.

7 Conclusion

The key to deep clustering or unsupervised learning is to seek effective supervision to guide representation learning. Different from traditional taxonomies from the network structure or data type, this survey offers a comprehensive review from the perspective of prior knowledge. With the evolution of clustering technologies, there is a discernible trend shifting from exploring priors within the data itself to external knowledge like natural language guiding. The exploration of external pre-trained models like ChatGPT or GPT-4V(ision) might emerge as a promising avenue. This survey potentially provides some valuable insight and inspires further exploration and advancements in deep clustering.

References

\bibcommenthead
Amigó et al [2009] Amigó E, Gonzalo J, Artiles J, et al (2009) A comparison of extrinsic clustering evaluation metrics based on formal constraints. Information retrieval 12:461–486
Andrew et al [2013] Andrew G, Arora R, Bilmes J, et al (2013) Deep canonical correlation analysis. In: International conference on machine learning, PMLR, pp 1247–1255
Antoniak [1974] Antoniak CE (1974) Mixtures of dirichlet processes with applications to bayesian nonparametric problems. The annals of statistics pp 1152–1174
Belkin and Niyogi [2001] Belkin M, Niyogi P (2001) Laplacian eigenmaps and spectral techniques for embedding and clustering. Advances in neural information processing systems 14
Bengio et al [2013] Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence 35(8):1798–1828
Berthelot et al [2019] Berthelot D, Carlini N, Goodfellow I, et al (2019) Mixmatch: A holistic approach to semi-supervised learning. Advances in neural information processing systems 32
Cai et al [2023] Cai S, Qiu L, Chen X, et al (2023) Semantic-enhanced image clustering. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 6869–6878
Caron et al [2018] Caron M, Bojanowski P, Joulin A, et al (2018) Deep clustering for unsupervised learning of visual features. In: Proceedings of the European conference on computer vision (ECCV), pp 132–149
Chang et al [2017] Chang J, Wang L, Meng G, et al (2017) Deep adaptive image clustering. In: Proceedings of the IEEE international conference on computer vision, pp 5879–5887
Chatterjee and Ahmed [2022] Chatterjee A, Ahmed BS (2022) Iot anomaly detection methods and applications: A survey. Internet of Things 19:100568
Chen [2015] Chen G (2015) Deep learning with nonparametric clustering. arXiv preprint arXiv:150103084
Chierichetti et al [2017] Chierichetti F, Kumar R, Lattanzi S, et al (2017) Fair clustering through fairlets. Advances in neural information processing systems 30
Coates et al [2011] Coates A, Ng A, Lee H (2011) An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, JMLR Workshop and Conference Proceedings, pp 215–223
Comaniciu and Meer [2002] Comaniciu D, Meer P (2002) Mean shift: A robust approach toward feature space analysis. IEEE Transactions on pattern analysis and machine intelligence 24(5):603–619
Dang et al [2021] Dang Z, Deng C, Yang X, et al (2021) Nearest neighbor matching for deep clustering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 13693–13702
Deng et al [2015] Deng C, Lv Z, Liu W, et al (2015) Multi-view matrix decomposition: A new scheme for exploring discriminative information. In: Twenty-Fourth International Joint Conference on Artificial Intelligence
Deng et al [2009] Deng J, Dong W, Socher R, et al (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, Ieee, pp 248–255
Dong et al [2021] Dong S, Wang P, Abbas K (2021) A survey on deep learning and its applications. Computer Science Review 40:100379
Ester et al [1996a] Ester M, Kriegel HP, Sander J, et al (1996a) A density-based algorithm for discovering clusters in large spatial databases with noise. In: kdd, pp 226–231
Ester et al [1996b] Ester M, Kriegel HP, Sander J, et al (1996b) A density-based algorithm for discovering clusters in large spatial databases with noise. In: kdd, pp 226–231
Fortunato [2010] Fortunato S (2010) Community detection in graphs. Physics reports 486(3-5):75–174
Georghiades et al [2001] Georghiades AS, Belhumeur PN, Kriegman DJ (2001) From few to many: Illumination cone models for face recognition under variable lighting and pose. IEEE transactions on pattern analysis and machine intelligence 23(6):643–660
Gidaris et al [2018] Gidaris S, Singh P, Komodakis N (2018) Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:180307728
Goodfellow et al [2014] Goodfellow I, Pouget-Abadie J, Mirza M, et al (2014) Generative adversarial nets. Advances in neural information processing systems 27
Gowda and Krishna [1978] Gowda KC, Krishna G (1978) Agglomerative clustering using the concept of mutual nearest neighbourhood. Pattern recognition 10(2):105–112
Grill et al [2020] Grill JB, Strub F, Altché F, et al (2020) Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems 33:21271–21284
Gurumurthy et al [2017] Gurumurthy S, Kiran Sarvadevabhatla R, Venkatesh Babu R (2017) Deligan: Generative adversarial networks for diverse and limited data. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 166–174
Hadsell et al [2006] Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), IEEE, pp 1735–1742
Hastings [1970] Hastings WK (1970) Monte carlo sampling methods using markov chains and their applications
He et al [2020] He K, Fan H, Wu Y, et al (2020) Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9729–9738
Hu et al [2017] Hu W, Miyato T, Tokui S, et al (2017) Learning discrete representations via information maximizing self-augmented training. In: International conference on machine learning, PMLR, pp 1558–1567
Huang et al [2020] Huang J, Gong S, Zhu X (2020) Deep semantic clustering by partition confidence maximisation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 8849–8858
Huang et al [2014] Huang P, Huang Y, Wang W, et al (2014) Deep embedding network for clustering. In: 2014 22nd International conference on pattern recognition, IEEE, pp 1532–1537
Huang et al [2019] Huang Z, Zhou JT, Peng X, et al (2019) Multi-view spectral clustering network. In: Proceeings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, pp 2563–2569, 10.24963/ijcai.2019/356
Huang et al [2021] Huang Z, Zhou JT, Zhu H, et al (2021) Deep spectral representation learning from multi-view data. IEEE Transactions on Image Processing 30:5352–5362
Huang et al [2022] Huang Z, Chen J, Zhang J, et al (2022) Learning representation for clustering via prototype scattering and positive sampling. IEEE Transactions on Pattern Analysis and Machine Intelligence
Hubert and Arabie [1985] Hubert L, Arabie P (1985) Comparing partitions. Journal of classification 2:193–218
Huynh et al [2022] Huynh T, Kornblith S, Walter MR, et al (2022) Boosting contrastive self-supervised learning with false negative cancellation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 2785–2795
Ji et al [2019] Ji X, Henriques JF, Vedaldi A (2019) Invariant information clustering for unsupervised image classification and segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9865–9874
Jiang et al [2016] Jiang Z, Zheng Y, Tan H, et al (2016) Variational deep embedding: An unsupervised and generative approach to clustering. arXiv preprint arXiv:161105148
Jin et al [2021] Jin D, Yu Z, Jiao P, et al (2021) A survey of community detection approaches: From statistical modeling to deep learning. IEEE Transactions on Knowledge and Data Engineering 35(2):1149–1170
Kim and Ha [2021] Kim Y, Ha JW (2021) Contrastive fine-grained class clustering via generative adversarial networks
Kingma and Welling [2013] Kingma DP, Welling M (2013) Auto-encoding variational bayes. arXiv preprint arXiv:13126114
Krause et al [2010] Krause A, Perona P, Gomes R (2010) Discriminative clustering by regularized information maximization. Advances in neural information processing systems 23
Krizhevsky et al [2009] Krizhevsky A, Hinton G, et al (2009) Learning multiple layers of features from tiny images
Kuhn [1955] Kuhn HW (1955) The hungarian method for the assignment problem. Naval research logistics quarterly 2(1-2):83–97
Kumar et al [2012] Kumar A, Kim H, Hancke GP (2012) Environmental monitoring systems: A review. IEEE Sensors Journal 13(4):1329–1339
Laine and Aila [2016] Laine S, Aila T (2016) Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:161002242
Le and Yang [2015] Le Y, Yang X (2015) Tiny imagenet visual recognition challenge. CS 231N 7(7):3
Li et al [2023a] Li H, Li Y, Yang M, et al (2023a) Incomplete multi-view clustering via prototype-based imputation. arXiv preprint arXiv:230111045
Li et al [2020] Li P, Zhao H, Liu H (2020) Deep fair clustering for visual learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9070–9079
Li et al [2021] Li Y, Hu P, Liu Z, et al (2021) Contrastive clustering. In: Proceedings of the AAAI conference on artificial intelligence, pp 8547–8555
Li et al [2022] Li Y, Yang M, Peng D, et al (2022) Twin contrastive learning for online clustering. International Journal of Computer Vision 130(9):2205–2221
Li et al [2023b] Li Y, Hu P, Peng D, et al (2023b) Image clustering with external guidance. arXiv preprint arXiv:231011989
Li et al [2023c] Li Y, Lin Y, Hu P, et al (2023c) Single-cell rna-seq debiased clustering via batch effect disentanglement. IEEE Transactions on Neural Networks and Learning Systems
Li et al [2023d] Li Y, Zhang D, Yang M, et al (2023d) scbridge embraces cell heterogeneity in single-cell rna-seq and atac-seq data integration. Nature Communications 14(1):6045
Lin et al [2021] Lin Y, Gou Y, Liu Z, et al (2021) Completer: Incomplete multi-view clustering via contrastive prediction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11174–11183
Lin et al [2022] Lin Y, Gou Y, Liu X, et al (2022) Dual contrastive prediction for incomplete multi-view representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence pp 1–14. 10.1109/TPAMI.2022.3197238
Lin et al [2023] Lin Y, Yang M, Yu J, et al (2023) Graph matching with bi-level noisy correspondence. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 23362–23371
Liu et al [2022] Liu J, Lin Y, Jiang L, et al (2022) Improve interpretability of neural networks via sparse contrastive coding. In: Findings of the Association for Computational Linguistics: EMNLP 2022, pp 460–470
Liu et al [2019a] Liu X, Zhu X, Li M, et al (2019a) Multiple kernel $k$ k-means with incomplete kernels. IEEE transactions on pattern analysis and machine intelligence 42(5):1191–1204
Liu et al [2019b] Liu X, Zhu X, Li M, et al (2019b) Multiple kernel $k$ k-means with incomplete kernels. IEEE transactions on pattern analysis and machine intelligence 42(5):1191–1204
Lu et al [2024] Lu Y, Lin Y, Yang M, et al (2024) Decoupled contrastive multi-view clustering with high-order random walks. Proceedings of the AAAI Conference on Artificial Intelligence 38(13):14193–14201
MacQueen et al [1967] MacQueen J, et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Oakland, CA, USA, pp 281–297
McDaid et al [2011] McDaid AF, Greene D, Hurley N (2011) Normalized mutual information to evaluate overlapping community finding algorithms. arXiv preprint arXiv:11102515
Min et al [2018] Min E, Guo X, Liu Q, et al (2018) A survey of clustering with deep learning: From the perspective of network architecture. IEEE Access 6:39501–39514
Montoya et al [2006] Montoya JM, Pimm SL, Solé RV (2006) Ecological networks and their fragility. Nature 442(7100):259–264
Moskalev et al [2022] Moskalev A, Sosnovik I, Fischer V, et al (2022) Contrasting quadratic assignments for set-based representation learning. In: European Conference on Computer Vision, Springer, pp 88–104
Mukherjee et al [2019] Mukherjee S, Asnani H, Lin E, et al (2019) Clustergan: Latent space clustering in generative adversarial networks. In: Proceedings of the AAAI conference on artificial intelligence, pp 4610–4617
Murdock and Yaeger [2011] Murdock J, Yaeger LS (2011) Identifying species by genetic clustering. In: ECAL, pp 564–572
Murtagh and Contreras [2012] Murtagh F, Contreras P (2012) Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2(1):86–97
Nene et al [1996] Nene SA, Nayar SK, Murase H, et al (1996) Columbia object image library (coil-20)
Newman and Girvan [2004] Newman ME, Girvan M (2004) Finding and evaluating community structure in networks. Physical review E 69(2):026113
Nguyen et al [2021] Nguyen XB, Bui DT, Duong CN, et al (2021) Clusformer: A transformer based clustering approach to unsupervised large-scale face and visual landmark recognition. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 10842–10851, 10.1109/CVPR46437.2021.01070
Nie et al [2016] Nie F, Li J, Li X, et al (2016) Parameter-free auto-weighted multiple graph learning: a framework for multiview clustering and semi-supervised classification. In: IJCAI
Nie et al [2017] Nie F, Li J, Li X, et al (2017) Self-weighted multiview clustering with multiple graphs. In: IJCAI, pp 2564–2570
Niu et al [2022] Niu C, Shan H, Wang G (2022) Spice: Semantic pseudo-labeling for image clustering. IEEE Transactions on Image Processing 31:7264–7278
Oord et al [2018] Oord Avd, Li Y, Vinyals O (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:180703748
Peng et al [2016] Peng X, Xiao S, Feng J, et al (2016) Deep subspace clustering with sparsity prior. In: IJCAI, pp 1925–1931
Peng et al [2019] Peng X, Huang Z, Lv J, et al (2019) Comic: Multi-view clustering without parameter selection. In: International conference on machine learning, PMLR, pp 5092–5101
Qian [2023] Qian Q (2023) Stable cluster discrimination for deep clustering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 16645–16654
Radford et al [2015] Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:151106434
Radford et al [2021] Radford A, Kim JW, Hallacy C, et al (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning, PMLR, pp 8748–8763
Ren et al [2022] Ren Y, Pu J, Yang Z, et al (2022) Deep clustering: A comprehensive survey. arXiv preprint arXiv:221004142
Roweis and Saul [2000] Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. science 290(5500):2323–2326
Saeedi Emadi and Mazinani [2018] Saeedi Emadi H, Mazinani SM (2018) A novel anomaly detection algorithm using dbscan and svm in wireless sensor networks. Wireless Personal Communications 98:2025–2035
Schaeffer [2007] Schaeffer SE (2007) Graph clustering. Computer science review 1(1):27–64
Shah and Koltun [2017] Shah SA, Koltun V (2017) Robust continuous clustering. Proceedings of the National Academy of Sciences 114(37):9814–9819
Shah and Koltun [2018] Shah SA, Koltun V (2018) Deep continuous clustering. arXiv preprint arXiv:180301449
Shaham and Lederman [2018] Shaham U, Lederman RR (2018) Learning by coincidence: Siamese networks and common variable learning. Pattern Recognition 74:52–63
Shaham et al [2018] Shaham U, Stanton K, Li H, et al (2018) Spectralnet: Spectral clustering using deep neural networks. arXiv preprint arXiv:180101587
Shen et al [2021] Shen Y, Shen Z, Wang M, et al (2021) You never cluster alone. Advances in Neural Information Processing Systems 34:27734–27746
Shorten and Khoshgoftaar [2019] Shorten C, Khoshgoftaar TM (2019) A survey on image data augmentation for deep learning. Journal of big data 6(1):1–48
Sohn et al [2020] Sohn K, Berthelot D, Li CL, et al (2020) Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:200107685
Song et al [2013] Song C, Liu F, Huang Y, et al (2013) Auto-encoder based data clustering. In: Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications: 18th Iberoamerican Congress, CIARP 2013, Havana, Cuba, November 20-23, 2013, Proceedings, Part I 18, Springer, pp 117–124
Su et al [2022] Su X, Xue S, Liu F, et al (2022) A comprehensive survey on community detection with deep learning. IEEE Transactions on Neural Networks and Learning Systems pp 1–21. 10.1109/TNNLS.2021.3137396
Van Gansbeke et al [2020] Van Gansbeke W, Vandenhende S, Georgoulis S, et al (2020) Scan: Learning to classify images without labels. In: European conference on computer vision, Springer, pp 268–285
Wang et al [2018] Wang Q, Chen M, Nie F, et al (2018) Detecting coherent groups in crowd scenes by multiview clustering. IEEE transactions on pattern analysis and machine intelligence 42(1):46–58
Wang et al [2016] Wang W, Yan X, Lee H, et al (2016) Deep variational canonical correlation analysis. arXiv preprint arXiv:161003454
Wang et al [2021] Wang Z, Ni Y, Jing B, et al (2021) Dnb: A joint learning framework for deep bayesian nonparametric clustering. IEEE Transactions on Neural Networks and Learning Systems 33(12):7610–7620
Wright et al [2010] Wright J, Ma Y, Mairal J, et al (2010) Sparse representation for computer vision and pattern recognition. Proceedings of the IEEE 98(6):1031–1044
Wu et al [2019] Wu D, Zheng SJ, Zhang XP, et al (2019) Deep learning-based methods for person re-identification: A comprehensive review. Neurocomputing 337:354–371
Wu et al [2016] Wu M, Tan L, Xiong N (2016) Data prediction, compression, and recovery in clustered wireless sensor networks for environmental monitoring applications. Information Sciences 329:800–818
Wu et al [2018] Wu Z, Xiong Y, Yu SX, et al (2018) Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3733–3742
Xia and Vlajic [2007] Xia D, Vlajic N (2007) Near-optimal node clustering in wireless sensor networks for environment monitoring. In: 21st international conference on advanced information networking and applications (AINA’07), IEEE, pp 632–641
Xie et al [2016] Xie J, Girshick R, Farhadi A (2016) Unsupervised deep embedding for clustering analysis. In: International conference on machine learning, PMLR, pp 478–487
Xu et al [2013] Xu C, Tao D, Xu C (2013) A survey on multi-view learning. arXiv preprint arXiv:13045634
Xu et al [2022] Xu J, De Mello S, Liu S, et al (2022) Groupvit: Semantic segmentation emerges from text supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 18134–18144
Yan et al [2023] Yan Y, Li J, Qin J, et al (2023) Efficient person search: An anchor-free approach. International Journal of Computer Vision pp 1–20
Yang et al [2016] Yang J, Parikh D, Batra D (2016) Joint unsupervised learning of deep representations and image clusters. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5147–5156
Yang et al [2023] Yang J, Liu J, Xu N, et al (2023) Tvt: Transferable vision transformer for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp 520–530
Yang et al [2021] Yang M, Li Y, Huang Z, et al (2021) Partially view-aligned representation learning with noise-robust contrastive loss. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Yang et al [2022a] Yang M, Huang Z, Hu P, et al (2022a) Learning with twin noisy labels for visible-infrared person re-identification. In: Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
Yang et al [2022b] Yang M, Li Y, Hu P, et al (2022b) Robust multi-view clustering with incomplete information. IEEE Trans Pattern Anal Mach Intell
Ye et al [2021] Ye M, Shen J, Lin G, et al (2021) Deep learning for person re-identification: A survey and outlook. IEEE transactions on pattern analysis and machine intelligence 44(6):2872–2893
Zeng et al [2023] Zeng P, Li Y, Hu P, et al (2023) Deep fair clustering via maximizing and minimizing mutual information: Theory, algorithm and metric. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 23986–23995
Zhang et al [2015] Zhang C, Fu H, Liu S, et al (2015) Low-rank tensor constrained multiview subspace clustering. In: Proceedings of the IEEE international conference on computer vision, pp 1582–1590
Zhang et al [2023] Zhang H, Nie F, Li X (2023) Large-scale clustering with structured optimal bipartite graph. IEEE Transactions on Pattern Analysis and Machine Intelligence
Zhang et al [2019] Zhang L, Qi GJ, Wang L, et al (2019) Aet vs. aed: Unsupervised representation learning by auto-encoding transformations rather than data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 2547–2555
Zhang et al [2018] Zhang Z, Liu L, Shen F, et al (2018) Binary multi-view clustering. IEEE transactions on pattern analysis and machine intelligence 41(7):1774–1782
Zhao et al [2016] Zhao H, Liu H, Fu Y (2016) Incomplete multi-modal visual data grouping. In: IJCAI, pp 2392–2398
Zhao et al [2019] Zhao T, Wang Z, Masoomi A, et al (2019) Streaming adaptive nonparametric variational autoencoder. arXiv preprint arXiv:190603288
Zhong et al [2020] Zhong H, Chen C, Jin Z, et al (2020) Deep robust clustering by contrastive learning. arXiv preprint arXiv:200803030
Zhong et al [2021] Zhong H, Wu J, Chen C, et al (2021) Graph contrastive clustering. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9224–9233
Zhou et al [2022] Zhou S, Xu H, Zheng Z, et al (2022) A comprehensive survey on deep clustering: Taxonomy, challenges, and future directions. arXiv preprint arXiv:220607579

Author Contributions

All authors contributed to the core insights presented in this paper. Xi Peng supervised this survey and provided valuable guidance throughout the process. Yiding Lu, Haobin Li, Yunfan Li, and Yijie Lin collaboratively wrote Priors for Deep Clustering. Yiding Lu took the lead in crafting Introduction, Application, and Future Challenges. Haobin Li was responsible for collecting and analyzing experimental results, creating figures, and summarizing tables. Yunfan Li and Yijie Lin designed the outline, wrote Abstract, and refined the manuscript.

Data Availability

The datasets utilized in this survey are publicly available and can be accessed from the following sources:

•

CIFAR-10 and CIFAR-100:
https://www.cs.toronto.edu/~kriz/cifar.html.
•

STL-10:
https://cs.stanford.edu/~acoates/stl10/.
•

ImageNet-10 and ImageNet-Dogs:
Google Drive (Preprocessed versions)
•

Tiny-ImageNet: http://cs231n.stanford.edu/tiny-imagenet-200.zip.
•

ImageNet-1K: https://www.image-net.org/.