Hashing-Based Distributed Clustering
for Massive High-Dimensional Data
Abstract
Clustering analysis is of substantial significance for data mining. The properties of big data raise higher demand for more efficient and economical distributed clustering methods. However, existing distributed clustering methods mainly focus on the size of data but ignore possible problems caused by data dimension. To solve this problem, we propose a new distributed algorithm, referred to as Hashing-Based Distributed Clustering (HBDC). Motivated by the outstanding performance of hashing methods for nearest neighbor searching, this algorithm applies the learning-to-hash technique to the clustering problem, which possesses incomparable advantages for data storage, transmission and computation. Following a global-sub-site paradigm, the HBDC consists of distributed training of hashing network and spectral clustering for hash codes at the global site. The sub-sites use the learnable network as a hash function to convert massive HD original data into a small number of hash codes, and send them to the global site for final clustering. In addition, a sample-selection method and slight network structures are designed to accelerate the convergence of the hash network. We also analyze the transmission cost of HBDC, including the upper bound. Our experiments on synthetic and real datasets illustrate the superiority of HBDC compared with existing state-of-the-art algorithms.
Index Terms:
Distributed clustering, learning to hash, high-dimensional data, spectral clustering1 Introduction
Clustring is a classical unsupervised technique widely used to discover the latent structure within a large dataset. Specifically, clustering aims to classify samples in one dataset into several clusters by their distribution features and maximize the similarity of the samples in one cluster while minimizing the samples’ similarity between different clusters. There are various clustering algorithms that can distinguish clusters effectively on different kinds of datasets. However, with the arrival of the big-data era, the change in data properties brings new challenges. Real-world data are usually generated and stored in distributed machines[1], which cannot meet the requirements of centralized clustering. Whereas collecting all the data into a central computer is nearly impossible because of the unaffordable transmission cost and privacy concerns. Meanwhile, the high dimension of big data also results in the curse of dimensionality[2] and the rising of computational complexity. It may be impossible to process massive high-dimensional data in a single computer. Therefore, how to process and cluster data in a distributed scenario is an inevitable problem.
Some distributed clustering algorithms have been proposed to solve this problem in a global-sub-site paradigm consisting of one global site and several sub-sites. In order to reduce the overhead, existing distributed clustering methods try to model local data at the sub-site and only send the model parameters to the global site, where an overall model is generated by merging all local models and broadcast to all the sub-sites. To be specific, there are basically four groups: density-based methods[3], partition-based methods[4],grid-based methods[5] and model-based methods[6]. Density-based methods search for high-density regions as clusters and collect some core samples as cluster parameters. Partition-based methods execute preliminary local clustering at sub-sites and then upload the centers and their weights (the number of samples belonging to the cluster). Grid-based methods divide sample space with finite grids, and the samples are grouped by their grid neighborhood relationship. Model-based methods always pre-define a statistical model and try to fit local data by parameter estimation. These proposed methods can process massive distributed data and have achieved remarkable clustering performance.
Recently, the data in empirical applications, such as image retrieval[7] and text analysis[8], usually show the characteristics of high dimension (HD). As a result of the explosive growth of data volume in the real world, several problems are involved. The first one is the curse of dimensionality[2]. Points in HD space are usually of sparsity so that the variance of estimated density is inflated as the dimensionality of data increases to hundreds or even more, which results in a degradation of clustering performance. In addition, the high-density regions searched in HD space are also inaccuracy for density-based methods. Second, the shape of clusters becomes irregular and complicated, while partition-based methods are based on some centralized convex clustering methods[9], which depends on an assumption that clusters are convex. However, it is usually not true for HD space situations, so the performance cannot be guaranteed. For the grid-based methods, the total number of grids shows exponential growth with dimension increasing[10]. Correspondingly, the computational cost becomes unaffordable when processing HD data. The last problem is that parameter estimation for HD statistical models often costs enormous computational resources, so most model-based methods have to make a compromise between the cost and accuracy[11]. Overall, distributed clustering for HD data remains to be solved.
Nearest neighbor (NN) searching, which aims to find the point with the smallest distance to a target query in a given set[12], is a basic step in a wide range of tasks and can also be regarded as a fundamental sub-problem of clustering. Even when it comes to HD space, the computational cost can be limited by approximate methods, and the results can still meet requirements in most cases. For the NN problem, hashing methods show incomparable advantages in terms of the efficiency of storage and calculation[13] and thus gain great popularity. It aims to use a hash function to map the HD original data into binary hash codes, and the similar objects in the HD space are supposed to be converted to the hash codes with smaller distances. The calculation of hamming distance and the storage of hash codes obviously cost much lower than those of the original data. Traditional hashing method, which is called local density hashing(LSH)[14], builds hash maps to map data into a limited number of hash buckets, and makes similar objects map into the same buckets with higher possibility. However, to improve the recall rate[12], these LSH methods have to build many hash maps so that their applications remain difficult when processing large datasets. Since the hash maps in LSH are independent of data, a hash function learned from the data automatically may be a more suitable choice for building high-quality codes. The development of machine learning and neural networks makes it possible. Learning to hash, a combination of hashing and neural network, gets wide attention because of its outstanding performance and capability to process large and complex datasets.
Motivated by the superiority of learning to hash in NN searching and the similarity between NN searching and clustering, a novel distributed clustering approach for HD data, referred to as Hashing-Based Distributed Clustering (HBDC) algorithm, is proposed in this paper. This algorithm is designed for a global-sub-site paradigm which is composed of one global site and several sub-sites. First, an initial neural network as a hash function is broadcast from the global site to all the sub-sites, and a mini-batch of samples in each sub-site is selected for local training of the hash function. After the local training, the global site will collect renewed parameters of the hash function from sub-sites and do integration. This distributed machine-learning process lasts several iterations until the hash function is capable enough to generate representative hash codes. Second, all the sub-sites use this hash function to map local data into hash codes and send the hash codes to the global site for final clustering. Massive HD data are converted into a limited number of hash codes, and thus the transmission and computational cost can be reduced sharply. The contribution of this work can be summarized as follows.
-
1)
An algorithm HBDC for distributed clustering is proposed to tackle massive HD data.
-
2)
A distance-driven self-supervised training method for hash functions focusing on clustering is proposed and implemented to get reliable hash functions efficiently.
-
3)
An analysis is presented to illustrate that HBDC is cost-effective for HD data in distributed cases.
The rest of the paper is organized as follows. In Section II, some related works are introduced. Section III gives some symbols and notations. Section IV presents the proposed distributed clustering method HBDC and Section V shows some experimental results to justify the efficacy of the proposed HBDC. Finally, the conclusion is in Section VI.
2 RELATED WORKS
2.1 Distributed Clustering
The aim of distributed clustering methods is to make clustering analysis of data discretely stored in distributed different sites without direct transmission of the original data. Therefore, the algorithm design is substantially influenced by the topological properties of site networks. Generally, there are two network paradigms that are mainly researched: peer-to-peer (P2P) networks and global-sub-site networks[15]. The main difference between them is whether there is a global site that has access to all sub-sites to collect necessary data for clustering. In the P2P paradigm, there is not such a global site, and each sub-site can only communicates with a few other sites for local clustering[15]. By contrast, the global-sub-site paradigm has a global site acting as a central server, and there is no direct communication between sub-sites. In this paper, only the global-sub-site paradigm is discussed.
In [1], Januzaj et al. proposed a DBSCAN-based distributed clustering method called DBDC. In the first step, it clusters data in the sub-sites locally by DBSCAN. In each local cluster, some core samples are selected and associated with their neighborhood parameters. These samples and parameters are sent to the global site and participate in global DBSCAN clustering to generate final clusters. Though DBDC can find latent clusters with arbitrary shapes, it also inherits the downside of DBSCAN, the sensitivity to input parameters, which means that DBDC method can only perform well when receiving proper parameters. For that matter, [16] improved DBDC and made good progress in clustering performance. On the other hand, in [17], Xu et al. used DBSCAN to process centralized clustering parallel by dividing dataset at global site into some clusters, and then used distributed machines to handle each cluster parallel. Overall, these density-based methods can find clusters with complex shapes but are highly dependent on the input parameters.
In [18], Jagannathan and Wright extended the basic clustering algorithm k-means into the distributed case. Meanwhile, privacy protection was also considered by avoiding exposure of original private data. It performs well when the data in different sites represent different attributes of common entities. Ji and Ling[9] proposed an ensemble-learning-based clustering algorithm. In this distributed k-means clustering algorithm, the entire clustering process is divided into two steps. First, every sub-site executes local k-means clustering and summarizes the centroid points of all clusters and the numbers of points in each cluster as the weights of centroid points. Then all the centroid points and their weights are sent to the global site, where another k-means clustering to centroid points is done. Based on the results of the two clusterings, sub-sites can get the final result. Essentially, the whole clustering process is composed of a 2-layer hierarchical clustering and both sub-site and global site take one of them. In [19], Balcon et al. proposed a distributed k-means clustering based on core-set of samples to process distributed data more efficiently. Jeong et al. [20] focused on biological data and modified distributed k-means algorithm. This modified algorithm can achieve outstanding performance on enormous datasets. Overall, the foregoing methods are efficient in handling distributed clustering tasks, whereas their performance is highly limited by their basis k-means algorithm.
A model-based distributed clustering algorithm based on expectation maximum (EM) method[21] was proposed by Kriegel et al. in 2005[22]. It uses a mixture of Gaussian distributions to model local data, and the parameters of distribution functions are estimated by EM method. Then, these Gaussian mixture distributions from all sub-sites are merged at the global site and form a general global model. For the concern of privacy, a modified algorithm in [18] limits data sharing between sites to preclude possible data divulging. Another algorithm referred to LDSDC[6] uses subspace Gaussian model, which is flexible to data dimension, to do local clustering at sub-sites and achieve better performance on HD datasets. But the subspace Gaussian model contains parameters whose number rises dramatically with data dimension, so the cost is still high. Furthermore, sometimes the Gaussian model cannot fit datasets well, which limits its performance. On the other hand, there are also some centralized methods aiming to directly process HD data. In [23], R Agrawal et al. proposed a grid-based algorithm CLIQUE. It divides feature space into isometric intervals in every dimension, and the grids which contain more than a certain threshold number of samples are regarded as dense cells. Then, adjacent dense cells are merged to constitute clusters in a bottom-up fashion. CLIQUE is highly dependent on proper input parameters, including threshold and grid size. Another up-bottom grid-based method WFC[24] overcomes the drawbacks of CLIQUE by using a Weber–Fechner-Law-inspired approach to calculate different scales and obtain multi-scale clustering results. However, these centralized clustering methods cannot be extended to distributed cases while maintaining their efficacy and economy.
2.2 LEARNING TO HASH
To solve the nearest neighbor searching problem, researchers have made a wide range of endeavors to learn compact hash functions from data to preserve the distance order in original space. Y Gong et al. proposed AQBC[25], which uses cosine similarity between two samples as the measure and maps vectors into the nearest binary hypercube. In [26], Jie Gui et al. regressed semantic labels of samples to hash codes in a supervised manner and then optimized the hash codes. Many further types of research [27][28][29] have been proposed to achieve better performance. However, limited by the ability of representation, these simple hashing methods are not scalable for massive and complex datasets.
With the development of deep learning, more and more researchers are trying to apply deep neural networks to hashing algorithms, called deep hashing (DH). DH uses hashing neural networks as hash functions to map data into hash codes in an end-to-end manner. Generally, the DH methods can be divided into supervised DH and unsupervised DH, and researchers mainly focus on the supervised methods. The core issue of current supervised methods is how to measure the similarities of samples. There are three categories generally: point-wise methods, pairwise methods and ranking-based methods. CQS[30] employs a point-wise method that uses label information directly instead of similarity information. First, it generates some central hash codes with different class labels and then enforces the outputs of networks to approach the hash centers with the same labels. In [31], H Liu et al. proposed a pair-wised method DSH. It uses labels representing whether a pair of samples are semantically similar to train a hashing network by minimizing the hamming distances of the hash codes with similar labels and maximizing those with different labels. A network consisting of three convolutional layers and two following fully-connected layers is designed as the hash function of DSH. H Lai et al. proposed DNNH[32] based on the ranking method. It groups samples into triplets and encourages hash codes of more similar samples to be closer.
The supervised methods are the basis of the unsupervised ones. To compensate for the loss of label information, most unsupervised methods utilize pre-trained networks to obtain semantic information, and convert the problem into a supervised one. However, some works still chose to use data themselves for training without generating semantic information. For example, AETBH[33] and BinGAN[34] introduced self-supervised techniques into unsupervised deep hashing and used auto-encoder and generative adversarial networks as hash functions, respectively. In [35], YK Jang and NI Cho involved contrastive learning in the training of hash functions. The contrastive-learning method uses one sample to generate more inputs by adding noise randomly. The inputs deriving from the same samples should be mapped to closer hash codes. Although these unsupervised methods can generate high-quality hash codes and achieve promising performance, the computational cost is usually very high because of the heavy structure of neural networks. In distributed cases, it will result in unacceptable transmission cost, so more slight networks should be chosen. The HBDC proposed in this paper provides a feasible solution.

3 SYMBOLS and NOTATIONS
Some notations are defined as follows. Let lower- (e.g., ) and upper-case (e.g., ) letters denote scales. denotes the loss function. Boldface lower- (e.g., a) and upper-case letters (e.g., A) denote vectors and matrices. Calligraphic letters (e.g., ) donate sets. and are the th row and the th column of a matrix.. We use and to represent the 1-norm and 2-norm of vectors. donates the cardinality of a set. , and donate the inputs, outputs of neural networks and the output hash codes, respectively (in matrix form). donates the length of hash codes. donates the neural networks, and donates the set of their neural parameters. A graph consists of vertices (denoted by ) and some edges (denoted by ).
4 HASHING-BASED DISTRIBUTED CLUSTERING
In this section, the HBDC algorithm is presented. Consider a situation that the whole system consists of a global site and sub-sites. The total data that need to be clustered are stored in sub-sites dispersedly . The local data in the th sub-site are -dimensional vectors , where donates the number of samples stored in the th sub-site. As presented in Fig. 1, HBDC aims to cluster data in three steps:
-
1)
The global site broadcasts an initiative hashing network to all the sub-site. In each sub-site, a batch of samples is selected for the local training of the hashing network.
-
2)
The parameter gradients of the network are sent to the global site, and then the global site upgrades the hashing network by merging all gradients. The first two steps repeat times until the convergence.
-
3)
The sub-sites use the hashing network to convert local data into hash codes, and then the codes and their degrees (i.e., the number of samples each code represents) are sent to the global site. A graph is generated with hash codes as its vertices. Thus the clustering problem can be converted into a graph-cutting problem. Based on graph theory, the graph is cut by a normalized-cut method.
The following subsections discuss the distributed training of hash function and the hash code clustering in terms of details and analysis.
4.1 Distributed Training of Hash Function
Unlike existing DH methods, the overall hash function for distributed clustering cannot be trained in a single machine, which means that the overhead during the training process must be considered. The transmission cost for training can be calculated as:
(1) |
where donates the number of parameters of hashing network, donates the number of iterations in the training process and a float32 data accounts for bits. contains data transmitted in unlink and downlink communication. For simplicity, this research does not consider quantization techniques of parameters for cost reduction[36][37]. All network parameters are transmitted as original float32 datatype.
To limit the cost, we need to reduce the number of network parameters and accelerate the convergence. Centralized DH methods usually use complex networks with over millions parameters to achieve promising performance, such as ResNet50[38], while distributed training of these networks can result in enormous communication cost. Thus, in this paper, more slight and compact networks are used as hashing networks.

On the other hand, to compensate for the loss of representation ability caused by the network structure, the whole clustering is divided into two steps hierarchically. The hash codes represent the result of the first-step clustering, which means the hashing network does not have to be powerful enough to generate an end-to-end result. As shown in Fig.2, the shape information is coupled with color in the hash codes, but later clustering can group the hash codes into proper categories. In more complex datasets with more features, this can be achieved by increasing the length of hash codes . In this way, a part of network function can be transferred to the second step.
In addition, to increase the convergence rate, a filter is designed to make the samples for local training more representative. The principle of selection is to make the samples as different as possible. After randomly choosing the first sample, the second one comes from the cluster with the hash code, which has the biggest hamming distance from the first one. Following this mechanism, the th sample is chosen randomly from the cluster that maximizes the sum of hamming distances from the previous codes until the whole batch of samples is determined. An illustration of the mechanism of sample selection is shown in Fig.3

The local training of hashing network exploits the pair-wise method. After normalization, two samples are grouped together, and the model is trained by minimizing the difference between the distances of input space and hamming space, i.e.,
(2) |
where , donate the inputs in th sub-site, donates the hash codes and , are hyper-parameters. Since the maximum hamming distance between hash codes is , the inputs with distances more than reward hash function maps them to totally different codes. Thus we can encourage the hash function to yield more kinds of hash codes to use the bits to the greatest extent by adjusting the parameter .
Because of the gradient property of sign function, the gradients of cannot be obtained by back-propagating. Thus, is used to approximate the sign function in the training part, and problem (2) is rewritten as:
(3) |
where donate the outputs of hashing network with the last hashing layer replaced by . This replacement is only used for upgrading neural parameters . After that, the sign function will be used to generate hash codes.
After the local training, the local average parameter gradients are transmitted to the global site and then it can upgrade the model to the next iteration with the average:
(4) |
where the parameter denote learning rate. Although we assume that all the sub-sites have the same batch size, (4) can be easily extended to some cases in that different sub-sites have different batch-size, such as online clustering. Although the batch size in each sub-site may be small, the total size of all batches can be much larger to make the model upgrade more reliable. The hashing network with updated parameters is broadcast to the sub-sites to begin the next training iteration. This iteration is repeated times to get a high-quality hash function.
4.2 Hash Code Clustering
After training, the sub-sites use the hashing network to transform all local data into hash codes and count the number of samples corresponding to each hash code as its degree. Compared with massive original data, the number of hash codes is much smaller, and their binary format makes it much more convenient for storage, transmission and computation. The hash codes and their degrees are sent to the global site for final clustering. In this communication, the volume of transmitted data is
(5) |
where donates the number of hash codes in the th sub-site. is the length of hash codes, and the degrees are transmitted as the data type float32.
Since the hash codes are actually distributed on the vertices of an -dimensional cube, clustering for hash codes may not be convex. Thus, justifiably exploiting simple clustering methods on hash codes may not be reasonable. For that matter, we use the codes to generate a graph and exploit spectral clustering[39] on it. Spectral clustering is a class of clustering algorithms working on an undirected graph. It divides the vertices into a given number of clusters by cutting the edges between them. There are different methods of graph cutting, such as Radio Cut[40] and Normalized Cut[41] In this research, the Normalized Cut method is chosen, which aims to minimize the following loss function:
(6) |
where are sub-graphs, represents the number of vertices, and is the sum of edge weights connecting the sub-graph and its complementary sub-graph . This graph cutting can be solved by matrix analysis method, which is detailed in [41]. Note that the feasibility of spectral clustering is also enabled by the superiority of hash codes because calculating an adjacent matrix between millions of HD samples is time-consuming, while it is much easier when processing much fewer binary hash codes.
In the hash code clustering, the undirected graph is generated by the hash codes and their degree. The vertices of denote the hash codes. The weights of edges between codes are calculated by their hamming distances and the degrees of codes as follows:
(7) |
The weight of an edge represents how strongly two vertices are connected, i.e., how possible these two vertices are in the same cluster. Thus, besides the reciprocal of hamming distance, the degrees of hash codes are also considered to generate the graph. The later graph cutting tries to cut the edges with low weights, and this is the same with density-based clustering methods, which partition clusters by low-density regions. Therefore, the input adjacent matrix for spectral clustering consists of the weights in (7):
(8) |
4.3 Algorithm Analysis
In summary, a whole algorithm description is presented in Algorithm 1. It is clear that this algorithm can be modified by quick start with a pre-trained network. This can improve clustering performance, dramatically reduce the training cost, and even skip the training process. Besides, choosing suitable network structures can act as a similar role. Compared with existing distributed clustering algorithms, the HDBC is more robust to different kinds of datasets because of the selectivity of hashing networks.
One of the most important considerations of distributed algorithms is the transmission cost. In the HDBC, it consists of two parts: the cost for distributed training and for hash code transmission. According to (1) and (5), the total cost is:
(9) |
The first term is different from (1) because the broadcasting of the latest parameters (the step in Algorithm 1) is included. For simplicity, we do not consider extra quantization methods for transmitted data to reduce the cost. In (9), the size of parameter set and the number of hash codes are two key factors influencing the cost mostly. No matter what kind of hashing network is chosen, is , where is the data dimension. Traditional algorithms usually use multiple parameters to model one dimension of data[42][6], but with convolutional-pooling layers, the number of parameters in a network can be much smaller than the data dimension. On the other hand, can be controlled by setting the length of hash codes because its upper bound is . By selecting proper in the loss function (3), we can encourage the hash function to map data to more kinds of hash codes to make the best use of the bits. Thus can be set with a small value without unacceptable performance loss, and this part of cost can be limited. Therefore, an upper bound of the total cost is:
(10) |
Except for some fixed constants, the transmission cost is .
5 EXPERIMENTS
This section presents the experiments to evaluate the proposed HDBC algorithm.
5.1 EVALUATION
Two measures are chosen to test the quality of clustering results: purity[43] and normalized mutual information (NMI)[44]. The purity counts the number of correct samples divided by the number of total number. For example, given a dataset with size , assume that and are set of clusters and goundtruth respectively. The purity can be calculated by:
(11) |
Since there is a drawback of purity when the number of clusters is small, the purity cannot evaluate the quality of clustering appropriately. An extreme example is that if all samples are grouped into one cluster, the purity will reach . Thus, the NMI is also included to rule out situations like this and make performance evaluation more comprehensive. The NMI is a variant of a basic concept in information theory called mutual information. Given a set of size , similarly assume that there are clusters and true clusters. Use , and to represent the number of samples in cluster , cluster and both of them. The NMI can be calculated by:
(12) |
In addition, to evaluate the economy of methods, the data volume needed to be transmitted between sub-sites and global site is also considered as cost in the experiments.
5.2 Datasets
All algorithms are tested on both synthetic datasets and real-world datasets. Following subsections demonstrate their details.
5.2.1 Synthetic Datasets
Since it is convenient to generate a series of datasets with arbitrary numbers of samples, clusters and features, the synthetic datasets are used to test the algorithms when processing HD datasets with big volumes and over hundreds of clusters. Motivated by HD clusters are usually derived from embedding sub-spaces with relatively low dimensions, we follow the method in [6] to generate Gaussian clusters. The explicit procedures to generate an -dimensional Gaussian cluster from a -dimensional embedding sub-space are present as follows:
-
1)
Generate a Gaussian matrix . The elements of the matrix are sampled from the normal Gaussian distribution randomly. And then the matrix is normalized by .
-
2)
Generate a shift vector . The elements are sampled from a uniform distribution over .
-
3)
A sample is generated by:
(13) where and .
-
4)
Repeat 3) until generating enough samples to constitute a cluster.
We increase the size, dimension, and cluster number of datasets gradually to test how the performances of methods change when datasets become more complex.
Dataset | Embedded Dimensions | |||
---|---|---|---|---|
S1 | ||||
S2 | ||||
S3 | ||||
S4 | ||||
S5 |
-
•
The embedded dimensions donate the parameter in the foregoing data-generate procedures. For each embedded dimension, different clusters are generated.
5.2.2 Real-world Datasets
Eight real-world datasets are selected for the experiments, including three hyperspectral datasets Salinas, Pavia Centre, Pavia University[45]; four datasets from UCI[46]; and handwritten digits MNIST[47]. The summary of eight datasets is presented in Table 2.
where , , are defined in the same way as TABLE 1
Dataset | |||
---|---|---|---|
Salinas | |||
Pavia Centre | |||
Pavia University | |||
WBCD | |||
Waveform-noise | |||
LSD | |||
WFRND | |||
MINST |
Some statistics and descriptions of the datasets are presented as follows.
-
•
Salinas was collected by AVIRIS sensor over Salinas Valley. The scene contains pixels, and each pixel is featured by 224 bands. But 20 water absorption bands are discarded, so the dimension is actually 204. Salinas contains 16 classes.
-
•
Pavia Centre and University scenes are acquired by ROSIS sensor over Pavia, Italy. After discarding some pixels with no information, the images for analysis are for Pavia Centre and for Pavia University. Pavia Centre contains bands and Pavia University contains bands. Both of them include classes.
-
•
Wisconsin breast cancer dataset (WBCD) is from UCI. The features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. There are two classes representing malignant and benign, respectively.
-
•
Waveform-noise from UCI is constructed by three classes of waves and each of them is generated from a combination of three base waves. All forty features include noise, and the latter nineteen are totally noise with mean and variance .
-
•
Landsat satellite dataset (LSD) from UCI is generated from data purchased from NASA. The dataset consists of the multi-spectral values of pixels in neighborhoods in a satellite image and the classification associated with the central pixel in each neighborhood.
-
•
Wall-following robot navigation dataset (WFRND) from UCi is collected by ultrasound sensors as the SCITOS G5 robot navigates through the room, following the wall in a clockwise direction for 4 rounds. The dataset contains three classes representing the movements of the robot.
-
•
MNIST consists of samples of handwriting digits. All samples are 784-dimensional vectors ( images) valued by grayscales. It contains classes from 0 to 9.
5.3 Comparison with Centralized Clustering

To test the representative capability of hash codes, we compare the HBDC in the distributed situation with centralized clustering algorithms to test how representative the hash codes can be. In this subsection, we conduct experiments on four datasets: WBCD, Waveform-noise, LSD and WFRND, because executing centralized clustering on big HD datasets in one machine could be very time-consuming and these four datasets are relatively small. Since spectral clustering is the downstream clustering method for HBDC, it is chosen as the centralized clustering method for comparison intuitively. In addition, we also use k-means for supplementary when spectral clustering cannot perform well.
Spectral | K-means | HBDC | ||||
---|---|---|---|---|---|---|
Datasets | Purity | NMI | Purity | NMI | Purity | NMI |
WBCD | ||||||
Waveform-noise | ||||||
LSD | ||||||
WFRND |
-
•
The cluster numbers in all three methods are set as real class numbers. The parameter in spectral clustering is . In the HBDC, the length of hash codes is 4bit for WBDC, 4bit for waveform-noise, 12bit for LSD and 8bit for WFRND.
In order to simulate the distributed situation, each dataset is distributed into ten sub-sites randomly, and the size of data at each site is larger than fifty to make sure all sites have enough data for local training. It is not difficult to extend to a more general case in reality. Even though under extreme circumstances the data in some sub-sites are very small, we can reduce their batch sizes and make the equation (4) weighted average. Besides, the structure of hashing network in this part is set as three fully connected layers. The numbers of units in the first two layers are equal to the dimension of input data and use as active functions. The output layer is of the length of hash codesthe with as an active function. This is replaced by after the training part to generate binary codes.
Table 3 shows the clustering performance in terms of purity and NMI. On the later three datasets, the HDBC can outperform centralized spectral clustering. This is because the original spectral clustering is unsuitable for processing HD data. As the dimension of space grows, the distribution of samples becomes sparse and their connection becomes weak, so edge cutting may not lead to promising results. By contrast, graphs generated by shorter hash codes have edges whose weights vary with a greater scale, so the input adjacent matrix is denser. On the other hand, the weak performance on the WBCD may be attributed to the small size of the dataset. Whereas it still achieves eighty percent of the centralized case. On the WBCD and LSD, the results of k-means are better than spectral clustering. The reason is the differences between the numbers of samples in different classes are relatively large, but spectral clustering based on (6) tends to group samples evenly.
5.4 Convergence
The convergence of distributed training is a substantial factor in the HBDC. We report it on five synthetic Gaussian datasets to see the convergence rate and its sensitivity to the number of sub-sites. In this subsection, the convergence is represented by the relative error ratio (RER), which is defined as:
(14) |
where , and donate the value, minimum and maximum of the loss function (2), respectively.
Fig 4 shows how the RER changes as the number of iterations grows. It is clear that the number of sub-sites influences the rate of convergence significantly. In general, more sub-sites result in faster convergence. The reason is that with more sub-sites, more samples participate in training within certain iterations, so it influences the convergence in the same way as increasing batch size. But there are still differences on different datasets. For example, S5 is the most complex dataset, so subfigure (e) shows the slowest descent of the curves. Even with sub-sites, the model still takes nearly iterations for convergence, It is similar on S4 that curves representing more sub-sites fall faster. The model with sub-sites almost converges within iterations, which is faster than that on S5. Whereas, when datasets become more simple, in addition to the faster overall convergence rate, the acceleration brought by more sub-sites is not absolute. The differences between the curves become smaller gradually from (e) to (a). On the relatively simple datasets S2 and S3, the blue curves (representing the model with sub-sites) descend faster than the purple curves (representing the model with sub-sites). When it comes to S1, the model with sub-sites can converge at a high rate, and keeping increasing the number cannot bring more conspicuous benefit.
5.5 Horizontal Comparison
In this subsection, five state-of-the-art distributed clustering algorithms are selected deliberately, including a partition-based algorithm k-means[48], two density-based algorithms DBDC[1] and LSHDDP[49], and two model-based algorithms LDSDC[6] and REMOLD[50]. We report the methods on all synthetic datasets and four HD real-world datasets: Salinas, Pavia Centre, Pavia University and MNIST.
5.5.1 Settings of Methods
For the best performance, the parameters of each method are searched by multiple times of experiments. The settings of the baseline algorithms and their description are presented as follows:
-
•
HBDC’s hashing networks are designed according to the dimensions of the datasets. The convolutional-pooling layer is used to process HD data like synthetic Gaussian datasets and MNIST because the pooling layer can exponentially decrease the dimensions of hidden features and the number of parameters in the hashing network. We also use hyperspectral datasets to test the performance of fully-connected hashing networks. Besides, , are selected in the set . The number of iterations for training is in .
-
•
k-means’s only parameter represents the number of clusters. We set equal to the number of real clusters in each dataset. The initial seeds are randomly chosen by k-means++.
-
•
DBDC uses cut-off density, where is estimated as the second percentile of the ascending ordered distances of all sample pairs according to [49]. Another parameter is in .
-
•
LSHDDP also uses cut-off density. Following [49], we set and . The number of clusters is set as the real number. The samples with top gamma values are the center of clusters.
-
•
LDSDC follows the setting in [6]. The parameter is set to the smallest integer such that the retained eigenvalue ratio of sample covariance matrix is greater than . Set , where and represent the numbers of samples and sub-sites, respectively.
-
•
REMOLD’s parameter is set in the same way as LDSDC. The candidate set for parameter is obtained by an auxiliary algorithm according to [50].
5.5.2 Experimental Results

The performance of clustering results on the synthetic datasets S1 to S5 is shown in Fig 5. From (a) and (b), it can be seen that LSHDDP (represented by gray bar) achieves perfect performance in terms of accuracy. HBDC (represented by green bar) also performs well and achieves more than purity and NMI scores except for S2, and this has not been influenced by the increase of the dimension and volume of datasets. Its purity and NMI are stable on all five datasets. With the usage of hashing networks, HBDC is more capable of processing HD data and shows better robustness. LDSDC (represented by yellow bar) and REMOLD (represented by light blue bar) show comparable and stable performance, which is almost more than purity and NMI measures. By contrast, the performance of k-means (represented by dark blue bar) gradually degraded as the index of the dataset increases. This can be attributed by that its local k-means clustering cannot model local data in sub-sites accurately. When dataset becomes more complex, the loss of information also rises so that it is not sufficient enough to yield a reliable result. Likewise, DBDC (represented by orange bar) shows low-quality performance, although its purity scores on S2, S3 and S4 are relatively high because it divides most samples into a few groups. The reason is similar to that of k-means, which is that a simplified model results in too much information loss when processing complex data.
The transmission cost is shown in (c) in terms of megabytes by semi-log scale. Generally, the cost of all algorithms rises gradually as the index of the dataset because of the linear increase of data dimension. LSHDDP shows the highest cost because it needs to reshuffle dataset multiple times. K-means also costs high for data transmission because its cost is proportional to both the data dimension and class number. In the two model-based methods, as a modified version, LDSDC costs much lower than REMOLD, but it still needs MB and MB for HD datasets S4 and S5, while HBDC only transmits MB and MB, including uplink and downlink. The cost reduction is more obvious on HD datasets than on LD datasets. This benefits from the efficiency of binary hash code so that we do not need to transmit HD vectors. The cost for DBDC is the lowest, because it does not need to transmit more data even though the size of dataset is large.

Fig 6 shows the clustering performance on four real-world datasets. In (a), DBDC has the highest purity score on PaviaU and PaviaC, but that sharply drops on Salinas and MNIST. LDSDC, REMOLD and HBDC show similar performance. In specific, REMOLD achieves the highest purity score on PaviaC and MNIST, while HBDC outperforms on PaviaU and Salinas. Comparing (a) and (b), the large difference between purity and NMI usually means unbalanced clustering results, such as k-means and DBDC on PaviaC and PaviaU. LSHDDP performs well on hyperspectral datasets but shows the lowest purity and NMI on MNIST. Considering the purity and NMI simultaneously, HBDC is either the best or the second which is close to the first.
In (c), DBDC and LSHDDP still show the lowest and the highest cost. It can be observed that among the three most robust algorithms, HBDC has the lowest transmission cost except for Salinas. The exception on Salinas is caused by the heavy fully-connected structure of the hashing network and its high dimension, which indicates that convolutional networks are more suitable for HD data. This can also be demonstrated by the performance on MNIST: the cost is MB for HBDC, MB for LDSDC and MB for REMOLD although its dimension is much higher than Salinas. Overall, the results on five synthetic datasets and four real-world datasets illustrate the superiority of the proposed HBDC by comparing clustering accuracy, robustness and cost with other state-of-the-art algorithms.
6 Conclusion
Motivated by the outstanding performance of the learning-to-hash technique in the nearest neighbor searching, we propose a hashing-based distributed clustering (HDBC) algorithm. With the usage of hashing networks, The proposed algorithm is capable of clustering massive HD data at low cost and preserving privacy. HBDC exploits a learnable hashing network as its hash function, and the training of hashing network consists of global model merging, broadcasting and local training, which is driven in a self-supervised manner and accelerated by a sample selection mechanism. Original data are mapped into hash codes by hash function and form a graph for final graph-cutting clustering. Some experiments on both synthetic and real-world datasets demonstrate that HBDC has better comprehensive performance than the other four benchmark algorithms by weighting accuracy, robustness and cost. In addition, all the experiments are conducted with randomly initialized networks, which causes difficulty for the training, while much research on hashing methods uses pre-trained models. Since pre-trained models are gaining more popularity, HBDC has more possibilities and higher limits. This can be a feasible direction for future research.
Finally, HBDC can be easily extended to an online-clustering version. When the local data in sub-sites change gradually or sub-sites generate new data, the hash network can be adjusted in real-time by distributed training to adapt to new data distribution. This can be applied to many real cases like communication and industrial manufacturing.
References
- [1] E. Januzaj, H.-P. Kriegel, and M. Pfeifle, “DBDC: Density based distributed clustering,” in Advances in Database Technology - EDBT 2004, E. Bertino, S. Christodoulakis, D. Plexousakis, V. Christophides, M. Koubarakis, K. Böhm, and E. Ferrari, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004, pp. 88–105.
- [2] R. E. Bellman and S. E. Dreyfus, Applied dynamic programming. Princeton university press, 2015, vol. 2050.
- [3] A. Rosato, R. Altilio, and M. Panella, “Recent advances on distributed unsupervised learning,” in Advances in Neural Networks, S. Bassis, A. Esposito, F. C. Morabito, and E. Pasero, Eds. Cham: Springer International Publishing, 2016, pp. 77–86.
- [4] W. Gan, J. C.-W. Lin, H.-C. Chao, and J. Zhan, “Data mining in distributed environment: a survey,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 7, no. 6, p. e1216, 2017.
- [5] P. Berkhin, “A survey of clustering data mining techniques,” Grouping multidimensional data: Recent advances in clustering, pp. 25–71, 2006.
- [6] Y. A. Geng, Q. Li, M. Liang, C. Y. Chi, J. Tan, and H. Huang, “Local-density subspace distributed clustering for high-dimensional data,” IEEE Transactions on Parallel and Distributed Systems, vol. 31, no. 8, pp. 1799–1814, 2020.
- [7] Y. Gu, S. Wang, H. Zhang, Y. Yao, and L. Liu, “Clustering-driven unsupervised deep hashing for image retrieval,” Neurocomputing, vol. 368, pp. 114–123, 2019.
- [8] M. T. Krause, C. Nord, and P. Sparrow, “Text analysis in translation: Theory, methodology, and didactic application of a model for translation-oriented text analysis,” Modern Language Journal, vol. 76, no. 4, p. 581, 2005.
- [9] G. Ji and X. Ling, “Ensemble learning based distributed clustering,” in Emerging Technologies in Knowledge Discovery and Data Mining: PAKDD 2007 International Workshops Nanjing, China, May 22-25, 2007 Revised Selected Papers 11. Springer, 2007, pp. 312–321.
- [10] A. Amini, T. Y. Wah, M. R. Saybani, and S. R. A. S. Yazdi, “A study of density-grid based clustering algorithms on data streams,” in 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), vol. 3. IEEE, 2011, pp. 1652–1656.
- [11] C. Bouveyron and C. Brunet-Saumard, “Model-based clustering of high-dimensional data: A review,” Computational Statistics & Data Analysis, vol. 71, pp. 52–78, 2014.
- [12] X. Luo, H. Wang, D. Wu, C. Chen, M. Deng, J. Huang, and X.-S. Hua, “A survey on deep hashing methods,” ACM Trans. Knowl. Discov. Data, vol. 17, no. 1, feb 2023.
- [13] R. Cantini, F. Marozzo, G. Bruno, and P. Trunfio, “Learning sentence-to-hashtags semantic mapping for hashtag recommendation on microblogs,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 16, no. 2, pp. 1–26, 2021.
- [14] M. S. Charikar, “Similarity estimation techniques from rounding algorithms,” in Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, 2002, pp. 380–388.
- [15] K. M. Hammouda and M. S. Kamel, “Models of distributed data clustering in peer-to-peer environments,” Knowledge and information systems, vol. 38, pp. 303–329, 2014.
- [16] W. Ni, G. Cheng, Y. Wu, and Z. Sun, “Local density based distributed clustering algorithm,” Journal of Software, vol. 19, no. 9, pp. 2339–2348, 2008.
- [17] X. Xu, J. Jäger, and H.-P. Kriegel, “A fast parallel clustering algorithm for large spatial databases,” High Performance Data Mining: Scaling Algorithms, Applications and Systems, pp. 263–290, 2002.
- [18] G. Jagannathan and R. N. Wright, “Privacy-preserving distributed k-means clustering over arbitrarily partitioned data,” in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, 2005, pp. 593–599.
- [19] M.-F. F. Balcan, S. Ehrlich, and Y. Liang, “Distributed -means and -median clustering on general topologies,” Advances in neural information processing systems, vol. 26, 2013.
- [20] J. Jeong, B. Ryu, D. Shin, and D. Shin, “Integration of distributed biological data using modified k-means algorithm,” in Emerging Technologies in Knowledge Discovery and Data Mining: PAKDD 2007 International Workshops Nanjing, China, May 22-25, 2007 Revised Selected Papers 11. Springer, 2007, pp. 469–475.
- [21] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the royal statistical society: series B (methodological), vol. 39, no. 1, pp. 1–22, 1977.
- [22] H.-P. Kriegel, P. Kroger, A. Pryakhin, and M. Schubert, “Effective and efficient distributed model-based clustering,” in Fifth IEEE International Conference on Data Mining (ICDM’05). IEEE, 2005, pp. 8–pp.
- [23] R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, “Automatic subspace clustering of high dimensional data for data mining applications,” in Proceedings of the 1998 ACM SIGMOD international conference on Management of data, 1998, pp. 94–105.
- [24] S. Yang, L. Zhang, C. Xu, H. Yu, J. Fan, and Z. Xu, “Massive data clustering by multi-scale psychological observations,” National Science Review, vol. 9, no. 2, p. nwab183, 2022.
- [25] Y. Gong, S. Kumar, V. Verma, and S. Lazebnik, “Angular quantization-based binary codes for fast similarity search,” Advances in neural information processing systems, vol. 25, 2012.
- [26] J. Gui, T. Liu, Z. Sun, D. Tao, and T. Tan, “Fast supervised discrete hashing,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 2, pp. 490–496, 2017.
- [27] A. Gionis, P. Indyk, R. Motwani et al., “Similarity search in high dimensions via hashing,” in Vldb, vol. 99, no. 6, 1999, pp. 518–529.
- [28] C. Strecha, A. Bronstein, M. Bronstein, and P. Fua, “Ldahash: Improved matching with smaller descriptors,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 1, pp. 66–78, 2011.
- [29] Z. Zhang, X. Zhu, G. Lu, and Y. Zhang, “Probability ordinal-preserving semantic hashing for large-scale image retrieval,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 15, no. 3, pp. 1–22, 2021.
- [30] L. Yuan, T. Wang, X. Zhang, F. E. Tay, Z. Jie, W. Liu, and J. Feng, “Central similarity quantization for efficient image and video retrieval,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 3080–3089.
- [31] H. Liu, R. Wang, S. Shan, and X. Chen, “Deep supervised hashing for fast image retrieval,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2064–2072.
- [32] H. Lai, Y. Pan, Y. Liu, and S. Yan, “Simultaneous feature learning and hash coding with deep neural networks,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3270–3278.
- [33] Y. Shen, J. Qin, J. Chen, M. Yu, L. Liu, F. Zhu, F. Shen, and L. Shao, “Auto-encoding twin-bottleneck hashing,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 2815–2824.
- [34] M. Zieba, P. Semberecki, T. El-Gaaly, and T. Trzcinski, “Bingan: Learning compact binary descriptors with a regularized gan,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, ser. NIPS’18. Red Hook, NY, USA: Curran Associates Inc., 2018, p. 3612–3622.
- [35] Y. K. Jang and N. I. Cho, “Self-supervised product quantization for deep unsupervised image retrieval,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 12 065–12 074.
- [36] S. Zheng, C. Shen, and X. Chen, “Design and analysis of uplink and downlink communications for federated learning,” IEEE Journal on Selected Areas in Communications, vol. 39, no. 7, pp. 2150–2167, 2020.
- [37] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2020, pp. 2021–2031.
- [38] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.
- [39] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and computing, vol. 17, pp. 395–416, 2007.
- [40] L. Hagen and A. B. Kahng, “New spectral methods for ratio cut partitioning and clustering,” IEEE transactions on computer-aided design of integrated circuits and systems, vol. 11, no. 9, pp. 1074–1085, 1992.
- [41] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on pattern analysis and machine intelligence, vol. 22, no. 8, pp. 888–905, 2000.
- [42] L. M. Aouad, N. An-Lekhac, and T. Kechadi, “Grid-based approaches for distributed data mining applications,” Journal of Algorithms & Computational Technology, vol. 3, no. 4, pp. 517–534, 2009.
- [43] H. Schütze, C. D. Manning, and P. Raghavan, Introduction to information retrieval. Cambridge University Press Cambridge, 2008, vol. 39.
- [44] A. Strehl and J. Ghosh, “Cluster ensembles—a knowledge reuse framework for combining multiple partitions,” Journal of machine learning research, vol. 3, no. Dec, pp. 583–617, 2002.
- [45] “Hyperspectral remote sensing scenes,” https://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes, Last accessed on March. 2023.
- [46] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
- [47] Y. LeCun, “Mnist handwriting dataset,” https://yann.lecun.com/exdb/mnist/, Last accessed on March. 2023.
- [48] B. Bahmani, B. Moseley, A. Vattani, R. Kumar, and S. Vassilvitskii, “Scalable k-means++,” arXiv preprint arXiv:1203.6402, 2012.
- [49] Y. Zhang, S. Chen, and G. Yu, “Efficient distributed density peaks for clustering large data sets in mapreduce,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 12, pp. 3218–3230, 2016.
- [50] M. Liang, Q. Li, Y.-a. Geng, J. Wang, and Z. Wei, “Remold: an efficient model-based clustering algorithm for large datasets with spark,” in 2017 IEEE 23rd International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 2017, pp. 376–383.
![]() |
Yifeng Xiao received the B.S. degree in applied mathematics from Xi’an Jiaotong University, Xi’an, China, in 2021, where he is currently pursuing the master’s degree in applied mathematics. His current research interests include the machine learning, distributed algorithm and learning to optimize. |
![]() |
Jiang Xue (Senior Member, IEEE) received the B.S. degree in information and computing science from the Xi’an Jiaotong University, Xi’an, China, in 2005, the M.S. degree in applied mathematics from Lanzhou University, China, and Uppsala University, Sweden, in 2008 and 2009, respectively, and the Ph.D. degree in electrical and electronic engineering from ECIT, Queen’s University of Belfast, U.K., in 2012. From 2013 to 2017, he was a Research Fellow with the University of Edinburgh, U.K. Since 2017, he has been with the National Engineering Laboratory for Big Data Analytics, Xi’an International Academy for Mathematics and Mathematical Technology, School of Mathematics and Statistics, Xi’an Jiaotong University, Pengcheng Laboratory, China. He is supported by the Zhongying Young Scholars Project. His main interests include the machine learning and wireless communication, performance analysis of general multiple antenna systems, CSI estimation and prediction, stochastic geometry, cooperative communications, and cognitive radio. |
![]() |
Deyu Meng received the B.Sc., M.Sc., and Ph.D. degrees from Xi’an Jiaotong University, Xi’an, China, in 2001, 2004, and 2008, respectively. He was a Visiting Scholar at Carnegie Mellon University, Pittsburgh, PA, USA, from 2012 to 2014. He is currently a Professor with the School of Mathematics and Statistics, Xi’an Jiaotong University, and an Adjunct Professor with the Faculty of Information Technology, Macau University of Science and Technology, Taipa, Macau, China. His research interests include model-based deep learning, variational networks, and meta learning. |