Hierarchical Sparse Representation Clustering for High-Dimensional Data Streams

Jie Chen, Hua Mao, Yuanbiao Gou, and Xi Peng Manuscript received . (Corresponding author: Xi Peng.)Jie Chen, Yuanbiao Gou and Xi Peng are with the College of Computer Science, Sichuan University, Chengdu 610065, China (E-mail: [email protected]; [email protected]; [email protected]).Hua Mao is with the Department of Computer and Information Sciences, Northumbria University, Newcastle, NE1 8ST, U. K. (E-mail: [email protected]).

Abstract

Data stream clustering reveals patterns within continuously arriving, potentially unbounded data sequences. Numerous data stream algorithms have been proposed to cluster data streams. The existing data stream clustering algorithms still face significant challenges when addressing high-dimensional data streams. First, it is intractable to measure the similarities among high-dimensional data objects via Euclidean distances when constructing and merging microclusters. Second, these algorithms are highly sensitive to the noise contained in high-dimensional data streams. In this paper, we propose a hierarchical sparse representation clustering (HSRC) method for clustering high-dimensional data streams. HSRC first employs an $l_{1}$ -minimization technique to learn an affinity matrix for data objects in individual landmark windows with fixed sizes, where the number of neighboring data objects is automatically selected. This approach ensures that highly correlated data samples within clusters are grouped together. Then, HSRC applies a spectral clustering technique to the affinity matrix to generate microclusters. These microclusters are subsequently merged into macroclusters based on their sparse similarity degrees (SSDs). Additionally, HSRC introduces sparsity residual values (SRVs) to adaptively select representative data objects from the current landmark window. These representatives serve as dictionary samples for the next landmark window. Finally, HSRC refines each macrocluster through fine-tuning. In particular, HSRC enables the detection of outliers in high-dimensional data streams via the associated SRVs. The experimental results obtained on several benchmark datasets demonstrate the effectiveness and robustness of HSRC.

Index Terms:

High-dimensional data stream, clustering, sparse representation, outlier detection

I Introduction

Data stream clustering techniques have attracted attention from data mining and machine learning researchers [1, 2, 3, 4, 5]. They have significant impacts on the development of data stream applications [6], such as network media transmission, sensor transfer information, and computer network traffic monitoring approaches. These applications involve data streams that can be potentially infinite, but only a limited amount of memory resources are available for computation purposes. Extracting potentially valuable knowledge from massive data streams naturally leads to the challenging problem of data stream clustering. As an effective tool for data stream mining, data stream clustering aims to divide the given data objects into a finite number of classes and identify potential outliers.

Traditional clustering algorithms, e.g., spectral clustering [7, 8], statistical learning-based methods and their variants [9], often achieve impressive performance on static and stable datasets. A data stream consists of massive, unbounded sequences of data objects, and its probability distribution changes over time in an unexpected way. This change can be gradual, which is known as concept drift, or sudden, which is referred to as concept shift. As a result, once entirely new classes appear in the stream, traditional classifiers are expensive and time-consuming, and prediction models need to be reconstructed and retrained from scratch via a newly collected data stream.

A variety of clustering methods have been proposed for data streams [10, 11, 12, 13, 14, 15]. For example, an efficient and scalable data clustering algorithm, called balanced iterative reducing and clustering using hierarchies (BIRCH), was proposed for conducting exploratory analyses on very large datasets [16]. BIRCH constructs a height-balanced tree via a clustering feature (CF) vector to compute cluster measures, such as the clustering mean, radius and diameter. The Euclidean distance between a new object and each centroid of the CF entries is calculated and later compared with a given threshold. The CF vector introduced by BIRCH has been extensively employed by different algorithms [17, 18, 19]. Specifically, the concept of the CF vector was extended to microclusters in the CluStream algorithm [17], which includes two stages: online microclustering and offline macroclustering. The density-based data stream clustering (DenStream) algorithm [18] introduces the structures of potential core microclusters and outlier microclusters via their CFs for incremental computation purposes. Additionally, the ClusTree algorithm uses a hierarchical indexing structure to maintain CFs and further builds a microcluster hierarchy at different granularity levels [19].

Data stream clustering algorithms utilize CF vectors to measure the similarity among data objects, which is intrinsically tied to the construction and merging operations involving appropriate microclusters. However, these approaches still face significant challenges. First, a new data object is merged into its nearest microcluster if the new radius (determined based on the associated CFs) is smaller than a predetermined threshold. Selecting an appropriate maximum boundary for this radius is a challenging problem. Second, these algorithms often rely on local similarity measures, typically the Euclidean distances between data objects, to maintain local neighborhood information. However, noise is pervasive in high-dimensional data streams. The use of the Euclidean distance measure for evaluating the relationships among data objects makes the similarity computation highly sensitive to noise. Clustering inherently assumes that highly correlated data objects within the same cluster share similar structural characteristics. Therefore, measuring the similarity of the data objects contained in high-dimensional data streams remains a significant challenge.

Recently, sparse representation has received significant interest across various fields [20, 21, 22, 23, 24], including image classification [25, 26], image denoising [27], and visual tracking [28]. It offers a statistical model for identifying a set of dictionary elements from the same class for each data object. Ideally, sparse representation groups highly correlated data objects belonging to the same class, with the nonzero elements representing the weights assigned to different pairs of data objects. As a result, sparse representation has demonstrated the ability to mitigate the aforementioned challenges by leveraging the self-similarity properties of data objects contained in high-dimensional data streams.

In this paper, we propose a hierarchical sparse representation clustering (HSRC) method for clustering high-dimensional data streams. HSRC first incorporates a sparse constraint into the self-expressiveness property of data objects in high-dimensional data streams. It aims to learn an affinity matrix that can capture the relationships among different data objects. For example, given a set of data objects in a landmark window, each object is represented as a linear combination of the other objects, and the coefficient matrix is sparse. In particular, we integrate adaptive dictionary learning into the sparse representation process to improve its generalization in data stream clustering scenarios. Then, HSRC partitions the affinity matrix into microclusters via a spectral clustering technique. The sparse linear representation of each data object includes more objects from the same cluster, indicating that the affinity matrix encodes the relationships within microclusters. Unlike density-based clustering techniques, HSRC does not rely on Euclidean distances to measure relationships among data objects. Instead, two microclusters are merged into a new macrocluster if the sparse similarity degree (SSD) of the merged cluster is higher than that between the new microcluster and any other microcluster. Finally, the clustering labels of the data objects contained within the macroclusters are fine-tuned to enhance the performance of the data stream clustering process. Additionally, HSRC includes an outlier detection mechanism for identifying potential outliers.

The contributions of this paper can be summarized as follows.

1.

We present an HSRC model that measures the relationships among data objects via sparse representation.
2.

A residual sparsity value is introduced for conducting adaptive dictionary sample selection and outlier identification, which can improve the generalization of the sparse representation process in high-dimensional data stream clustering tasks.
3.

The SSDs among microclusters are introduced to construct macroclusters, which together form a hierarchical structure for performing high-dimensional data stream clustering.
4.

Extensive experimental results obtained on benchmark datasets demonstrate the effectiveness and robustness of HSRC in high-dimensional data stream clustering scenarios.

The remainder of this paper is organized as follows. The related work on sparse representation and data stream clustering is summarized in Section II. Section III describes the HSRC method in detail. Extensive experimental results obtained several benchmark datasets are presented in Section IV. Finally, conclusions are given in Section V.

II Related Work

In this section, we present a brief overview of the related work on sparse representation and data stream clustering.

II-A Sparse Representation Theory

Let $\mathbf{D}=[{\mathbf{d}_{1}},{\mathbf{d}_{2}},...,{\mathbf{d}_{n}}]\in{\mathbf{R}^{d\times n}}$ be a dictionary consisting of $n$ vectors. Given a signal $x=\in{\mathbf{R}^{d}}$ , its sparsest representation over the dictionary $\mathbf{D}$ can be approximately represented by a sparse linear combination of only a few columns of $\mathbf{D}$ , i.e.,

\mathop{\min}\limits_{\mathbf{z}}{\left\|{\mathbf{x}-\mathbf{D}\mathbf{z}}\right\|_{2}}+\lambda{\left\|\mathbf{z}\right\|_{0}},

(1)

where $\mathbf{z}$ represents a sparse coefficient, $\lambda>0$ is a tradeoff parameter, and ${\left\|\cdot\right\|_{0}}$ denotes the number of nonzero elements in a vector.

Since Problem (1) is nonconvex and nondeterministic polynomial-time (NP)-hard, it is often relaxed to the following ${l_{1}}$ -norm minimization problem as a common surrogate for the sparse representation task, i.e.,

\mathop{\min}\limits_{\mathbf{z}}{\left\|{\mathbf{x}-\mathbf{D}\mathbf{z}}\right\|_{2}}+\lambda{\left\|{\mathbf{z}}\right\|_{1}},

(2)

where ${\left\|\cdot\right\|_{1}}$ denotes the $l_{1}$ -norm of a vector. The above optimization problem can be solved via various convex optimization methods, such as iterative thresholding algorithms [29, 24].

Inspired by the advances achieved with respect to ${l_{0}}$ -norm and ${l_{1}}$ -norm techniques, sparse representation has been widely used in various areas of machine learning and pattern recognition [30, 31]. For example, Aharon et al. [32] proposed a $K$ -singular value decomposition (SVD) algorithm for designing overcomplete dictionaries for sparse representation purposes. The $K$ -SVD algorithm can effectively seek the best representation of a given set as a dictionary under strict sparsity constraints. Hence, sparse representation has been proven to be an extremely successful technique for exploiting the intrinsic structures of high-dimensional data.

II-B Data Stream Clustering Techniques

A data stream is a massive, potentially unbounded sequence of data objects which are continuously generated over time. Most existing data stream clustering algorithms can be broadly categorized into three main types: hierarchy-based clustering, density-based clustering and partitioning-based clustering.

Hierarchy-based clustering algorithms first use the data objects to construct a tree, in which each leaf represents a data object. These algorithms then group data objects into the corresponding clusters using agglomerative or divisive hierarchical decomposition. In the agglomerative approach, $n$ objects are merged into more general classes in a bottom-up fashion, whereas in divisive techniques, $n$ objects are divided into smaller clusters in a bottom-up fashion. BIRCH is a typical hierarchical clustering algorithm that constructs a height-balanced tree using the CF vector [16]. BIRCH adopts the CF vector to compute cluster measures such as the mean, radius, and diameter. Finally, the Euclidean distances between new objects and each centroid of the CF entries are calculated and compared against a threshold to determine whether the new object belongs to a new class. The BIRCH algorithm has $O(n)$ complexity and obtains good clustering results through only a single scan. However, it does not work well with data of arbitrary shape. The E-Stream algorithm is an evolution-based approach for data stream clustering that supports five evolutions of the data, namely appearance, disappearance, self-evolution, merging, and splitting [33, 34].

Density-based clustering algorithms define clusters as high-density areas of the features, which can be used to detect any arbitrary-shaped clusters. DenStream, an extended version of the DBSCAN algorithm, discovers clusters of arbitrary shape in an evolving data stream [18]. It introduces a core microcluster, a potential core microcluster, and an outlier microcluster using the CFs developed for BIRCH. The core microcluster is adopted to summarize the clusters with an arbitrary shape, while the potential core microcluster and outlier microcluster structures are used to pursue and distinguish potential clusters and outliers. DenStream involves numerous time vector calculations in the offline clustering phase, which leads to a high computational cost. The D-Stream algorithm automatically and dynamically adjusts the clusters without requiring user specification of the target time horizon or the number of clusters [11]. An online component maps each input data record into a grid, while an offline component computes the grid density and clusters the grids based on the density. However, D-Stream is incapable of processing very-high-dimensional data.

Partitioning-based clustering methods use traditional soft partitioning techniques such as $k$ -median and $k$ -means clustering to deal with data streams. CluStream is a two-phase clustering method that performs online micro-clustering and offline macro-clustering [17]. The first phase involves the acquisition of summary statistics from the data stream by extending the concept of the CF vector in the online micro-clustering. Each microcluster contains five components, three of which are the regular components of the CF vector. The second phase uses these statistics and other inputs to create clusters, where the $k$ -means algorithm is embedded in the offline macro-clustering component. HPStream was developed as an extension to CluStream that performs projected clustering for high-dimensional streaming data. CluStream is not efficient when applied to high-dimensional data streams.

III Hierarchical Sparse Representation Clustering

In this section, we introduce the HSRC method for clustering high-dimensional data streams. Given the potentially infinite nature of data streams, HSRC employs landmark windows, thereby sequentially processing fixed-size, nonoverlapping chunks of data objects. The proposed HSRC model contains three critical processes, including the creation of microclusters, the merging of microclusters into macroclusters and the fine-tuning of macroclusters, which together form a hierarchical structure. Throughout these processes, the data object representatives are repeatedly utilized across three successive HSRC processes. Therefore, the proposed HSRC method can effectively capture the intrinsic structures of the high-dimensional data objects contained within data streams.

III-A Evaluating the Relationships Among Data Objects via Sparse Representation

Let $\mathbf{X}=[{\mathbf{x}_{1}},{\mathbf{x}_{2}},...,{\mathbf{x}_{n}}]\in{\mathbf{R}^{d\times n}}$ be a set of data objects included in a landmark window. The rationale assumption behind clustering algorithms is that data objects within a cluster are more similar to each other than they are to objects belonging to a different cluster. Given a data object ${\mathbf{x}_{i}}$ ( $1\leq i\leq n$ ), HSRC uses the following ${l_{1}}$ -norm minimization problem for sparse representation:

\mathop{arg\min}\limits_{\mathbf{z}_{i}}{\left\|{\mathbf{z}_{i}}\right\|_{1}}\qquad s.t.\qquad{\left\|{{\mathbf{x}_{i}}-\mathbf{X}{\mathbf{z}_{i}}}\right\|_{2}^{2}}\leq\varepsilon

(3)

where $\varepsilon$ represents a noise item. The above optimization problem can be solved in polynomial time via standard linear programming algorithms [29, 24]. Here $\mathbf{Z}$ can be constructed using the vectors of sparse coefficients, i.e., $\mathbf{Z}=[{\mathbf{z}_{1}},{\mathbf{z}_{2}},...,{\mathbf{z}_{n}}]$ , and ${\mathbf{Z}^{*}}=\left|\mathbf{Z}\right|+\left|{{\mathbf{Z}^{T}}}\right|$ . Each element $z_{ij}$ in ${\mathbf{Z}^{*}}$ can be used to measure the similarity between two data objects ${\mathbf{x}_{i}}$ and ${\mathbf{x}_{j}}$ . If the data objects $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ are close in terms of the intrinsic geometries of their data distributions, their representations, $\mathbf{z}_{i}$ and $\mathbf{z}_{j}$ , respectively, should also be close with respect to the same basis $\mathbf{X}$ . Hence, sparse representation is utilized to evaluate the relationships among the data objects contained in high-dimensional data streams.

The solution to Problem (3) is individually obtained for each data object. This inevitably results in high computational costs. However, clustering high-dimensional data streams requires data objects to be continuously clustered within a limited period. To increase the efficiency of optimizing individual sparse representations, we use a sparse batch representation optimization problem as a good surrogate for Problem (3). Specifically, we formulate the following convex optimization problem to find a sparse representation $\mathbf{Z}$ :

\min\limits_{\mathbf{Z},\mathbf{E}}{\left\|\mathbf{Z}\right\|_{1}}+\lambda\left\|{\mathbf{E}}\right\|_{l}\qquad s.t.\qquad\mathbf{X}=\mathbf{X}\mathbf{Z}+\mathbf{E},

(4)

where $\mathbf{E}\in{\mathbf{R}^{d\times n}}$ is a noise item and $\lambda$ is a scalar constant.

Algorithm 1 Solving Problem (5) via an inexact ALM framework

0: a data matrix

\mathbf{X}=[{\mathbf{x}_{1}},{\mathbf{x}_{2}},...,{\mathbf{x}_{n}}]\in{\mathbf{R}^{d\times n}}

, a parameter

\lambda>0

Initialize: $\mathbf{Z}=\mathbf{J}=0,\mathbf{E}=0,{\mathbf{Y}_{1}}={\mathbf{Y}_{2}}=0,{\mu}={10^{-2}},$
${\mu_{\max}}={10^{10}},\rho=1.1,\varepsilon={10^{-6}}$

1: while not converged do

2: update the variables via Eq. (8);

3: update the multipliers:

{\mathbf{Y}_{1}}\leftarrow{\mathbf{Y}_{1}}+{\mu}\left(\mathbf{X}-\mathbf{X}\mathbf{Z}-\mathbf{E}\right)

;

{\mathbf{Y}_{2}}\leftarrow{\mathbf{Y}_{2}}+{\mu}\left(\mathbf{Z}-\mathbf{J}\right)

;

4: update the parameter

{\mu}

{\mu}\leftarrow\min(\rho{\mu},{\mu_{\max}})

;

5: check the convergence conditions:

{\left\|{\mathbf{X}-\mathbf{X}\mathbf{Z}-\mathbf{E}}\right\|_{max}}<\varepsilon

and

{\left\|{\mathbf{Z}-\mathbf{J}}\right\|_{max}}<\varepsilon

;

6: end while

\left({\mathbf{Z},\mathbf{E}}\right)

We first convert Problem (4) to the following equivalent problem by introducing an auxiliary variable $\mathbf{J}$ :

\min\limits_{\mathbf{Z},\mathbf{J},\mathbf{E}}{\left\|\mathbf{J}\right\|_{1}}+\lambda\left\|{\mathbf{E}}\right\|_{l}\qquad s.t.\qquad\mathbf{X}=\mathbf{X}\mathbf{Z}+\mathbf{E},\ \mathbf{Z}=\mathbf{J}.

(5)

The augmented Lagrangian function of Problem (5) is

\begin{split}&\mathop{\min}\limits_{\mathbf{Z},\mathbf{E},\mathbf{J},{\mathbf{Y}_{1}},{\mathbf{Y}_{2}}}{\left\|\mathbf{J}\right\|_{1}}+\lambda\left\|\mathbf{E}\right\|_{l}+tr\left({\mathbf{Y}_{1}^{T}(\mathbf{X}-\mathbf{X}\mathbf{Z}-\mathbf{E}}\right)+\\ &tr\left({\mathbf{Y}_{2}^{T}\left({\mathbf{Z}-\mathbf{J}}\right)}\right)+\frac{{{\mu}}}{2}\left(\left\|{\mathbf{X}-\mathbf{X}\mathbf{Z}-\mathbf{E}}\right\|_{F}^{2}+\left\|{\mathbf{Z}-\mathbf{J}}\right\|_{F}^{2}\right),\end{split}

(6)

where ${\mathbf{Y}_{1}}$ and ${\mathbf{Y}_{2}}$ are Lagrange multipliers, and ${\mu_{1}}>0$ and ${\mu_{2}}>0$ are penalty parameters. The above optimization problem can be formulated as follows:

\begin{split}&\mathop{\min}\limits_{\mathbf{Z},\mathbf{E},\mathbf{J},{\mathbf{Y}_{1}},{\mathbf{Y}_{2}}}{\left\|\mathbf{J}\right\|_{1}}+\lambda\left\|\mathbf{E}\right\|_{l}+\\ &\frac{{{\mu}}}{2}\left(\left\|{\mathbf{X}-\mathbf{X}\mathbf{Z}-\mathbf{E}+\frac{{{\mathbf{Y}_{1}}}}{{{\mu}}}}\right\|_{F}^{2}+\left\|{\mathbf{Z}-\mathbf{J}+\frac{{{\mathbf{Y}_{2}}}}{{{\mu}}}}\right\|_{F}^{2}\right).\end{split}

(7)

This formula can be effectively solved by the inexact augmented Lagrange multiplier (ALM) framework [35]. The variables $\mathbf{J}$ , $\mathbf{Z}$ and $\mathbf{E}$ can be updated alternately at each step, whereas the other two variables are fixed. The updating schemes for the $(k+1)$ th iteration are as follows:

\begin{split}&{\mathbf{J}}\leftarrow\mathop{\min}\limits_{\mathbf{J}}\frac{1}{{{\mu}}}{\left\|\mathbf{J}\right\|_{1}}+\frac{1}{2}\left\|{\mathbf{J}-\left({\mathbf{Z}+\frac{{{\mathbf{Y}_{2}}}}{{{\mu}}}}\right)}\right\|_{F}^{2},\\ &{\mathbf{Z}}\leftarrow\left({\mathbf{I}+{\mathbf{X}^{T}}\mathbf{X}}\right)^{-1}\left({{\mathbf{X}^{T}}\mathbf{X}-{\mathbf{X}^{T}}\mathbf{E}+\mathbf{J}+\frac{{{\mathbf{X}^{T}}{\mathbf{Y}_{1}}-{\mathbf{Y}_{2}}}}{{{\mu}}}}\right),\\ &{\mathbf{E}}\leftarrow\mathop{\min}\limits_{\mathbf{E}}\lambda{\left\|\mathbf{E}\right\|_{l}}+\frac{{{\mu}}}{2}\left\|{\mathbf{X}-\mathbf{X}\mathbf{Z}-\mathbf{E}+\frac{{{\mathbf{Y}_{1}}}}{{{\mu}}}}\right\|_{F}^{2}.\end{split}

(8)

Algorithm 2 Creating microclusters

0: a data matrix

\mathbf{X}=[{\mathbf{x}_{1}},{\mathbf{x}_{2}},...,{\mathbf{x}_{n}}]\in{\mathbf{R}^{d\times n}}

; parameters

\lambda>0

m>0

and

\sigma\in\left({0,1}\right)

1: Solve Problem (5) via Algorithm 1 and obtain the optimal solution

\left({\mathbf{Z},\mathbf{E}}\right)

2: Compute the affinity matrix

{\mathbf{Z}^{*}}=\left|\mathbf{Z}\right|+\left|{{\mathbf{Z}^{T}}}\right|

3: Apply NCuts on

{\mathbf{Z}^{*}}

to obtain

m

microclusters.

4: For a new data object

\mathbf{x}

, compute

SRV(\mathbf{x})

using Eq. (9);

5: if

SRV(\mathbf{x})<\sigma

then

\mathbf{x}

is considered an outlier.

7: end if

7: The

m

microclusters.

The first equation in Problem (8) is a convex problem that can be solved via the singular value thresholding (SVT) operator [36]. The solution of the third equation in Problem (8) is closely related to the value of the $l$ -norm, which indicates a certain regularization strategy for characterizing various noises. For example, the ${l_{2,1}}$ -norm encourages the columns of the matrix to be zero, where the last equation has a closed-form solution [37]. The complete procedure for solving Problem (5) is outlined in Algorithm 1.

III-B Evaluating the Importance of Data Objects via Sparsity Residual Values

HSRC uses the learned affinity matrix to identify clusters as groups of microclusters through a spectral clustering algorithm, such as NCuts [7]. For simplicity, we assume that $n$ data objects in a landmark window are segmented into $m$ microclusters, i.e., $\mathbf{X}=[{\mathbf{X}_{1}},{\mathbf{X}_{2}},...,{\mathbf{X}_{m}}]$ , according to Algorithm 2. For example, the $i$ th microcluster contains $n_{i}$ data objects, i.e., ${\mathbf{X}_{i}}=[{\mathbf{X}_{i1}},{\mathbf{X}_{i2}},...,{\mathbf{X}_{i{n_{i}}}}]$ . Each microcluster consists of a set of neighboring objects, where each data object within the microcluster can be linearly represented by other objects in the cluster. The microclusters are identified in a single pass of the landmark window, and their summary statistics are stored offline.

Definition 1

Sparsity residual value (SRV): The SRV of a data object $\mathbf{x}$ is defined as

SRV(\mathbf{x})=\frac{1}{{{{\left\|\mathbf{e}\right\|}_{0}}}}\sum\limits_{j=1}^{n_{i}}{\frac{{\left|{{e_{j}}}\right|}}{{{{\left\|\mathbf{e}\right\|}_{2}}}}}

(9)

where $\mathbf{e}\in{\mathbf{R}^{n_{i}}}$ denotes a corresponding noise item in the sparse representation of the object and $n_{i}$ is the number of samples contained in the $i$ th cluster.

We define SRVs to measure the importance levels of the data objects included in clusters. The lower the SRV value of a data object is, the more significant it is during the sparse representation process. Without loss of generality, the sparse coefficients of a data object $\mathbf{x}$ are represented by a vector $\mathbf{z}$ , and $\mathbf{e}$ denotes a corresponding noise item in the sparse representation of the object. Given a solution $\mathbf{Z}$ for the data objects $\mathbf{X}$ obtained by Algorithm 1, the representative data objects can be adaptively selected from the candidates by sorting their SRVs. The sizes of the representative data objects are aligned with those of the data objects contained in the landmark window. The dictionary samples in an adaptive dictionary selection scheme consist of two parts, including the data objects contained in the current landmark window and the representative data objects chosen from the previous landmark window.

Algorithm 3 Merging microclusters into macroclusters

0: Microclusters

\mathbf{X}=[{\mathbf{X}_{1}},{\mathbf{X}_{2}},...,{\mathbf{X}_{m}}]

and an affinity matrix

{\mathbf{Z}^{*}}

Initialize: $\mathbf{S}\in{\mathbf{R}^{m\times m}}=0$ ;

1: while the clusters in

\mathbf{X}

have been changed do

2: for each cluster

i

among the clusters do

s

\leftarrow

the number of clusters in

\mathbf{X}

;

4: for each cluster

j

among the clusters and

i\neq j

5: Compute

S({\mathbf{X}_{i}},{\mathbf{X}_{j}})

and

S({\mathbf{X}_{i+j}},{\mathbf{X}_{p}})

using Eqs. (11) and (12), respectively, where

p\in\left[{1,s}\right]

p\neq i

and

p\neq j

;

6: if

\mathbf{S}({\mathbf{X}_{i}},{\mathbf{X}_{j}})\leq\mathbf{S}({\mathbf{X}_{i+j}},{\mathbf{X}_{p}})

then

\mathbf{S}[ij]=1

;

8: end if

9: end for

10:

mergedSet=[]

11: for

i=1:s

and

i\notin mergedSet

12:

clusterRows=find(S(i,:)==1)

;

13:

clusterCols=find(S(:,i)==1)

;

14:

clusterSet=clusterRows\cap clusterCols

15: if

\sim isempty(clusterSet)

then

16: Merge cluster

i

and all clusters in

clusterSet

;

17: Add

i

and all items of

clusterSet

mergedSet

;

18: end if

19: end for

20: end for

21: end while

21: The clusters in

\mathbf{X}

In practice, outliers often arise due to various factors, such as data collection, storage, or transmission errors. Outliers are data objects that deviate from a sparse representation-based model. To identify outliers throughout high-dimensional data streams, we set a threshold $\sigma\in\left({0,1}\right)$ . A data object is estimated according to the threshold $\sigma$ . For example, we consider a data object an outlier if it satisfies the following condition:

SRV(\mathbf{x})\geq\sigma.

(10)

HSRC is able to adaptively identify outliers in high-dimensional data streams. Algorithm 2 summarizes the complete microclustering algorithm of HSRC.

Algorithm 4 Fine-tuning macroclusters

s

macroclusters

\mathbf{X}=[{\mathbf{X}_{1}},{\mathbf{X}_{2}},...,{\mathbf{X}_{s}}]\in{\mathbf{R}^{d\times n}}

, a parameter

\sigma\in\left({0,1}\right)

1: for each cluster

i

in the macroclusters do

2: for each data object

\mathbf{x}

in the

i

th cluster do

3: for each cluster

j

in the macroclusters do

4: Construct an error set using Eq. (14) as follows:

errorSet=[error_{1},{error_{2}},...,{error_{s}}]

;

5: Determine the clustering index

j

of the minimum value in the error set;

6: if

i\neq j

then

7: Select

\mathbf{x}

from the

i

th macrocluster and drop it into the

j

th macrocluster;

8: end if

9: end for

10: end for

11: end for

11: The final clusters.

III-C Merging Micro-clusters into Macro-clusters

As microclusters are rough, they need to be merged into macroclusters. Considering the $i$ th microcluster, we first choose a candidate from another microcluster as the merged target. Utilizing only the sparse coefficients associated with the $j$ th microcluster ${\mathbf{X}_{j}}$ , the sparse representation of the $l$ th data object ${\mathbf{x}_{l}}$ contained in the $i$ th microcluster can be written as follows:

{\mathbf{x}_{l}}={\mathbf{X}_{j}}{\mathbf{z}_{j}}+{\mathbf{e}_{l}},

(11)

where $i,j\in[1,m]$ , $l\in[1,n_{i}]$ and $i\neq j$ . The item $\mathbf{z}_{j}\in{\mathbf{R}^{n_{j}}}$ is a sparse coefficient vector whose entries are those associated with $\mathbf{X}_{j}$ . The residual item ${\mathbf{e}_{l}}$ can be used to measure the similarity between ${\mathbf{x}_{l}}$ and ${\mathbf{X}_{j}}$ . We adopt the residual item to define the SSD concept, which is used to measure the similarity between two microclusters in the microcluster merging operation.

Definition 2

SSD: The SSD between cluster $i$ and cluster $j$ is the sum of the residuals of the data objects contained in the $i$ th microcluster ${\mathbf{X}_{i}}$ that are associated with the $j$ th microcluster ${\mathbf{X}_{j}}$ , which can be defined as

S({\mathbf{X}_{i}},{\mathbf{X}_{j}})=\sum\limits_{l=1}^{n_{i}}{{\left\|{\mathbf{e}_{l}}\right\|}_{2}},

(12)

where $i\neq j$ and $i,j\in[1,m]$ .

We further calculate the SSDs of cluster $i$ with the other clusters. Finally, we can obtain a candidate based on these residual items by choosing the lowest error. We need to further check the merging condition to determine whether cluster $i$ and cluster $j$ should be merged. We assume that we obtain a new cluster ${\mathbf{X}_{i+j}}$ after clusters $i$ and $j$ are merged. According to graph theory, the merging condition is that the similarity degree of the merged cluster ${\mathbf{X}_{i+j}}$ must be greater than that of cluster ${\mathbf{X}_{i+j}}$ associated with any other clusters ${\mathbf{X}_{p}}$ ; i.e., the following two conditions must hold:

\begin{split}&S({\mathbf{X}_{i}},{X_{j}})\leq S({\mathbf{X}_{i+j}},{\mathbf{X}_{p}}),\\ &S({\mathbf{X}_{j}},{X_{i}})\leq S({\mathbf{X}_{i+j}},{\mathbf{X}_{p}}),\end{split}

(13)

where $p\neq i$ and $p\neq j$ . If the above conditions hold, two microclusters are merged into a relatively large cluster. The final merging operation is stopped if any two microclusters cannot be further merged. For simplicity, we define a new matrix $\mathbf{S}$ with a size of $m\times m$ to preserve the merged sign for each pair of $m$ microclusters. For instance, ${S_{ij}}$ denotes that cluster $i$ can be merged into cluster $j$ . Algorithm 3 below summarizes the complete microclusters merging procedure.

Algorithm 5 The HSRC algorithm

\mathbf{X}_{t}=[{\mathbf{x}_{1}},{\mathbf{x}_{2}},...,{\mathbf{x}_{n}}]\in{\mathbf{R}^{d\times n}}

, parameters

\lambda>0

m>0

and

\sigma\in\left({0,1}\right)

1: if

t==1

then

\mathbf{X}={\mathbf{X}_{1}}

and

\mathbf{X}_{s}=\mathbf{X}_{1}

;

3: else

\mathbf{X}=\left[{{\mathbf{X}_{s}},{\mathbf{X}_{t}}}\right]

;

5: end if

6: Solve Problem (5) via Algorithm 1 and obtain the optimal solution

\left({\mathbf{Z},\mathbf{E}}\right)

7: Create

m

microclusters via Algorithm 2, and generate

m

microclusters.

8: Perform the merging operation on the

m

microclusters via Algorithm 3, and obtain

s

macroclusters.

9: Purify each of the

s

macroclusters in a fine-tuning manner via Algorithm 4, and obtain the final clusters.

10: for each cluster in the final clusters do

11: for each data object

\mathbf{x}

in a cluster do

12: Computing

SRV\left(\mathbf{x}\right)

using Eq. (9);

13: if

SRV(\mathbf{x})\geq\sigma

then

14:

\mathbf{x}

is regarded as an outlier;

15: end if

16: end for

17: Select the representative data objects from the cluster by sorting their SRVs and then add them to

{\mathbf{X}_{s}}

;

18: end for

18: The final clusters.

III-D Fine-Tuning

Although sparse representation attempts to automatically choose data objects belonging to the same class in data streams, the coefficients of sparse representation are not completely concentrated on a particular microcluster and instead are likely to be widely spread across a few microclusters. Hence, the microclusters can be considered rough clusters. A data object has a sparse representation whose nonzero entries are concentrated mostly in one microcluster. The distribution of the estimated sparse coefficients contains important clustering information about the relationships among the data objects. Under these circumstances, fine-tuning is a strategy in which a few data objects are selected from a microcluster and then dropped into another microcluster under certain conditions. This strategy incrementally improves the clustering performance achieved for data objects.

From a sparse representation perspective, fine-tuning considers each data object in a microcluster with respect to another microcluster as follows:

{error_{j}}=\left\|{{\mathbf{x}_{l}}-{\mathbf{X}_{j}}{\mathbf{Z}_{j}}}\right\|_{F}^{2},

(14)

where $\mathbf{x}_{l}$ denotes the $l$ th data object in the $i$ th microcluster, and ${\mathbf{X}_{j}}$ is the $j$ th microcluster. We assume that $s$ macroclusters are obtained via Algorithm 3. The error set of $\mathbf{x}_{l}$ is represented as $errorSet=[error_{1},{error_{2}},...,{error_{s}}]$ , and the clustering index of the minimum value included in the error set is $j$ .

TABLE I: Descriptions of the streaming datasets.

Datasets	Classes	Data objects	Features
Keystroke	4	1,600	10
Network Intrusion	2	494,000	42
Forest Cover	7	580,000	54
COIL-100	100	7,200	1,024

If $i\neq j$ , $\mathbf{x}_{l}$ is selected from the $i$ th microcluster and dropped into the $j$ th microcluster. For example, a data object is picked up from its original cluster and dropped into a new microcluster if it has a smaller residual with respect to another microcluster.

Finally, it is critical to choose candidates from among the data objects that will be purified in their clusters. We consider these data objects, which indicate the spread of the sparse coefficients over most classes. In particular, we adopt the SRV as a criterion to help choose potential outliers in each cluster. Algorithm 4 below summarizes the complete macrocluster fine-tuning procedure.

III-E Computational Complexity Analysis

We assume that $\mathbf{X}$ in a sliding-window has $n$ data objects that belong to $s$ classes, where the size of $\mathbf{X}$ is $d\times n$ . Algorithm 5 summarizes the complete data stream clustering algorithm of HSRC. We use an inexact ALM framework in Algorithm 1 [35]. The inexact ALM framework has been extensively studied and generally converges well. Algorithm 1 performs well in practical applications. The computational complexity of the first step of Algorithm 1 is $O({n^{2}})$ because it requires the sparse representation of an $n\times n$ matrix to be computed in an SVT operator. The overall computational complexity of Algorithm 1 is $O(t{n^{2}})$ if the ${l_{2,1}}$ -norm is adopted in the last equation of Problem (8), where $t$ is the number of iterations. When $n>d$ , the computational complexity of Algorithm 2 can be considered $O({n^{3}})$ in spectral clustering. The computational complexities of Algorithm 3 and Algorithm 4 are $O({k^{2}}{n^{2}})$ and $O({s^{2}}{n^{2}})$ respectively, where $k$ and $s$ are the numbers of microclusters and macroclusters, respectively. The complexity of HSRC is $O({t{n^{2}}+{n^{3}}+{k^{2}}{n^{2}}+{s^{2}}{n^{2}}})$ . Therefore, the final overall complexity of HSRC is $O({n^{3}})$ if $t\ll n$ , $k\ll n$ and $s\ll n$ .

IV Experimental Study

In this section, we evaluate the performance of the proposed HSRC approach on publicly available datasets by comparing it with existing popular data stream clustering algorithms: CluStream [17], ClusTree [19], CluStreamKM [38], StreamKM++ [39], fuzzy double $c$ -means based on sparse self-representation (FDCM_SSR) [40] and flexible density peak clustering (FDPC) [41]. HSRC is implemented in MATLAB, and all the experiments are conducted on a Windows platform with an Intel i5-2300 CPU and 16 GB of RAM. The anonymous source code¹¹1https://github.com/chenjie20/HSRC for SSCDL is available online. The implementations of the CluStream, ClusTree, CluStreamKM and StreamKM++ algorithms are provided by Massive Online Analysis (MOA), which is a popular open-source tool for data stream mining [38]. The source codes of the FDCM_SSR and FDPC algorithms are provided by their authors.

IV-A Experimental Settings

Four benchmark datasets are used in the evaluation [42, 43]. The data objects are randomly shuffled to alleviate any potential effects caused by the sorted streaming order. The drift interval of the Keystroke dataset is 200. The drift intervals of the other datasets are unknown. The Keystroke dataset is composed of 10 feature variables. The Network Intrusion dataset was collected from seven weeks of network traffic simulated in a military network environment. The Network Intrusion requests are divided into two classes: normal and malicious. All the text attributes of Network Intrusion are manually converted into enumerated values and represented by digits. The Forest Cover dataset consists of 54 cartographic variables. The last variables of the datasets represent the ground-truth labels of the corresponding data objects. The Columbia University Image Library (COIL)-100 dataset contains 7200 images of data objects belonging to 100 categories. The statistics of the datasets are summarized in Table I.

The HSRC algorithm is evaluated based on its clustering quality and temporal efficiency. Clustering quality is assessed via clustering purity and the $F$ -measure [44]. All the parameters of the competing algorithms are manually tuned to obtain their optimal results. In Algorithms 1-4, HSRC involves three parameters: $\lambda$ , $m$ and $\sigma$ . Empirically speaking, parameter $\lambda$ should be relatively large if the input data streams are slightly contaminated by noise, and vice versa. Parameter $m$ controls the number of microclusters and is typically set as a multiple of the maximum number of classes contained in the data streams. For our experiments, we use a modified parameter ${m^{\prime}}$ instead of $m$ , which ranges from 1 to 2. Microcluster merging is not performed if the maximum number of classes is less than or equal to 2 when ${m^{\prime}}=1$ . Parameter $\sigma$ is disregarded in cases where no outliers are detected. Additionally, HSRC disables the fine-tuning operation by setting $w=0$ . Further details concerning the parameters are provided in the experiments.

Refer to caption — (a) Network Intrusion

IV-B Clustering Quality Evaluation

TABLE II: Average clustering results (%) produced over the entire stream of each dataset.

Datasets	Metrics (%)	HSRC	CluStream	ClusTree	CluStreamKM	StreamKM++	FDCM_SSR	FDPC
Keystroke	Purity	89.49	64	73	71	69	74.75	80.06
Keystroke	$F$ -measure	89.03	74	75	72	70	75.27	70.59
Network Intrusion	Purity	96.48	93	91	94	92	94.33	94.89
Network Intrusion	$F$ -measure	97.95	72	70	78	75	90.72	89.51
Forest Cover	Purity	74.58	61	66	53	56	71.6	71.88
Forest Cover	$F$ -measure	45.28	37	33	36	38	44.61	44.18
COIL-100	Purity	67.74	41	39	44	43	51.1	44.96
COIL-100	$F$ -measure	64.24	40	42	38	39	47.97	26.87

TABLE III: The average computational costs required for a landmark window (seconds) in each dataset.

Datasets	HSRC	CluStream	ClusTree	CluStreamKM	StreamKM++	FDCM_SSR	FDPC
Keystroke	0.49	0.15	0.9	0.09	0.1	0.26	0.11
Network Intrusion	1.25	0.47	2.7	0.24	0.44	5.24	0.33
Forest Cover	1.22	0.29	0.72	0.27	0.48	3.75	0.3
COIL-100	5.73	1.71	2.21	1.55	2.45	6.97	1.64

We evaluate the proposed algorithm on the streaming datasets. The four groups of HSRC parameters used for this experiment are (1) $\lambda=100$ , $m^{\prime}=1$ , $w=0$ , (2) $\lambda=0.1$ , $m^{\prime}=1$ , $w=1$ , (3) $\lambda=12$ , $m^{\prime}=1$ , $w=0$ and (4) $\lambda=20$ , $m^{\prime}=1$ , $w=1$ . Compared with the other algorithms, the HSRC algorithm almost consistently obtains the best results in terms of clustering purity and the $F$ -measure in the experiments. This confirms that our proposed method is very effective for high-dimensional data stream clustering cases with varying numbers of clusters. For example, HSRC achieves a high clustering purity level of 89.49% on the Keystroke dataset and improves the clustering purity by at least 9.43% over those of the other algorithms. We observe the same advantages in our proposed method, which contains more data object features. For example, HSRC achieves high clustering purities of 96.48%, 74.58% and 67.74% for the remaining three datasets. Table II shows that it still significantly outperforms the other four algorithms in terms of the $F$ -measure. These clustering results confirm that the relationships calculated from the sparse representations significantly improve the attained clustering performance, especially when the data objects derived from the data streams contain more features. In addition, density-based algorithms can identify clusters with arbitrary shapes in data streams and often perform well when the input data objects have only a few attributes. However, the clustering performance of these algorithms decreases as the number of data object attributes gradually increases.

We evaluate the online performance of the proposed HSRC method on the first 50 windows of two representative datasets, i.e., the Network Intrusion and Forest Cover. Fig. 1 illustrates the online performance attained by the proposed HSRC method on these datasets. We can observe that HSRC almost achieves relatively stable clustering purity and $F$ -measures on both datasets. For example, Fig. 1a shows that the online clustering purity and $F$ -measure values exceed 90% for most of the landmark windows. Similarly, Fig. 1b indicates that both metrics yielded by HSRC are above 25%. These experimental results demonstrate the excellent stability of the online performance exhibited by the proposed HSRC method.

The time costs of all the competing algorithms are presented in Table III. The density-based algorithms generally have lower computational costs than the other methods do. For example, CluStreamKM has a lower computational cost than the other algorithms do. However, HSRC has relatively reasonable computational costs in the experiments. This is because HSRC spends most of its time computing the affinity matrix using sparse representation.

Compared with $k$ -means-based data stream clustering methods such as CluStreamKM and StreamKM++, HSRC employs a spectral clustering technique that also incorporates $k$ -means to create microclusters. CluStreamKM and StreamKM++ compute the centroid of each CF vector via the original data objects. In contrast, HSRC uses an $l_{1}$ -norm technique to construct an affinity matrix, which encodes the membership attributed of the high-dimensional data objects contained in the input data streams. HSRC captures the intrinsic structures of high-dimensional data objects. The experimental results show that HSRC significantly enhances the clustering purity and $F$ -measure values achieved for high-dimensional data streams.

IV-C Robustness to Noise and Outliers

We evaluate the robustness of these algorithms on a more challenging set of high-dimensional data streams. Four artificial pixel noise levels (5%, 10%, 15%, and 20%) are integrated into two datasets: (1) Network Intrusion and (2) Forest Cover. The locations of the corrupted data object attributes are chosen randomly, and the value of each selected location is replaced by a random number within the range $\left[{0,1}\right]$ .

TABLE IV: Average clustering results (%) produced on the Network Intrusion and Forest Cover datasets.

Datasets	Ratio (%)	Metrics	HSRC	StreamKM++	FDCM_SSR	FDPC
Network Intrusion	5%	Purity	95.04	90	93.17	93.9
	5%	$F$ -measure	96.93	74	89.53	88.79
	10%	Purity	94.95	88	92.34	93.54
	10%	$F$ -measure	96.45	72	88.05	80.85
	15%	Purity	94.4	86	91.68	92.15
	15%	$F$ -measure	96.2	71	86.78	64.84
	20%	Purity	94.44	86	91.54	90.04
	20%	$F$ -measure	96.4	70	85.88	56.23
Forest Cover	5%	Purity	74.69	54	68.58	71.03
	5%	$F$ -measure	43.61	35	41.14	37.96
	10%	Purity	73.18	53	66.24	68.81
	10%	$F$ -measure	39.27	34	38.16	30.48
	15%	Purity	71.96	51	65.35	66.14
	15%	$F$ -measure	38.65	32	33.5	33.35
	20%	Purity	69.96	50	65.52	65.18
	20%	$F$ -measure	37.68	29	33.71	30.88

Table IV shows the average clustering results produced for the four different noise levels. As expected, the performance of the algorithms slowly decreases as the percentage of noise increases. HSRC achieves consistently better clustering results than those of the other methods. For example, our method achieves average clustering purities of $95.04\%$ , $94.95\%$ , $94.4\%$ and $94.44\%$ on the Network Intrusion dataset. Regarding the $F$ -measure, HSRC significantly outperforms the competing methods across the different noise levels, which indicates that the numbers of macroclusters are much closer to the ground-truth numbers of clusters contained in the landmark windows. However, the clustering performance of the other competing algorithms dramatically decreases as the percentage of noise slowly increases. We also observe that HSRC still retains the same advantages on the Forest Cover dataset. For example, the clustering purity and $F$ -measure of HSRC are $74.69\%$ and $43.61\%$ , respectively under a noise percentage of 5%. HSRC consistently achieves better clustering performance than that of the competing approaches under the other three noise percentages (10%, 15%, and 20% ). On the one hand, this highlights the benefit of estimating the relationships among data objects via sparse representation. On the other hand, this finding demonstrates that the ${l_{2,1}}$ -norm effectively characterizes the noise term in Problem (4). These experimental results demonstrate the robustness of the proposed HSRC method under noisy circumstances.

The proposed outlier detection mechanism is evaluated on the Forest Cover dataset. We select the first 20 landmark windows, and each window contains 1,000 data objects. In each trial, two successive landmark windows are employed, and the experiment is repeated 10 times. Half of the data objects contained within these windows are segmented into clusters, whereas the remaining data objects are treated as potential outliers for testing purposes. The effectiveness of the outlier detection mechanism is then verified via the SRV from Eq. (9). For comparison, we employ the 1-NN method, which is a general outlier detection technique that calculates the Euclidean distance between a test data object and its nearest neighbor within a landmark window. The outlier detection errors are classified into two categories: valid data objects misidentified as outliers and actual outliers incorrectly classified as valid data. The average outlier error rate for HSRC is 4.21%, whereas it is 6.64% for 1-NN across the 10 trials. Compared with 1-NN, HSRC reduces the average outlier error rate by 2.43%.

IV-D Sensitivity Analysis

In Algorithms 1-4, HSRC has two main parameters: $\lambda$ and $m$ . We examine the sensitivity of different combinations of parameters $\lambda$ and $m$ . In particular, we set parameter $w=1$ for the Network Fusion and Forest Cover datasets. Empirically, parameter $\lambda$ is closely related to the noise level prior of the data stream. Hence, we usually set relatively large values for parameter $\lambda$ in HSRC if the given data stream is slightly contaminated by noise. In addition, parameter $m$ represents the number of microclusters. For convenience, parameter $m^{\prime}$ represents a multiple of the maximum number of classes contained in a data stream. Parameter $m^{\prime}$ ranges from 1 to 2 in the experiments.

Figs. 2 and 3 show the clustering performance achieved on the Network Fusion and Forest Cover datasets, respectively, in terms of the clustering purity and $F$ -measure values produced with different combinations of parameters $\lambda$ and $m^{\prime}$ . Specifically, Fig. 2 indicates that HSRC performs well across a large range of $\lambda$ values when $m^{\prime}=1$ . However, the $F$ -measure declines significantly as $m^{\prime}$ increases from 1 to 2. This is because the number of clusters rapidly decreases as the number of macroclusters gradually increases during the experiments. Similar cases can also be observed in Fig. 3. For example, the $F$ -measure dramatically decreases when $m^{\prime}$ gradually increases from 1 to 2 in Fig. 3. Therefore, relatively large ranges of $\lambda$ yield satisfactory clustering results when $m^{\prime}=1$ .

V Conclusion

In this paper, we propose an HSRC algorithm for clustering high-dimensional data streams. Unlike the existing clustering techniques that rely on Euclidean distance computations, HSRC utilizes an norm technique to capture the intrinsic structures of high-dimensional data objects. It automatically selects the appropriate number of neighboring data objects. This ensures that the highly correlated data objects of clusters are grouped together. HSRC merges microclusters into macroclusters by introducing the SSD. Moreover, fine-tuning is employed to refine the macroclusters. Data object representatives are selected from each macrocluster by sorting the SRVs of the data objects. These representatives are then employed as dictionary samples for the next landmark window. The proposed HSRC method significantly enhances the relationships among data objects, which improves the generalization of sparse representation in high-dimensional data stream clustering tasks. Furthermore, HSRC effectively estimates outlier candidates based on their SRVs. Our extensive experiments conducted on high-dimensional datasets demonstrate the superiority of the proposed HSRC method over several state-of-the-art approaches.

References

[1] J. Gama, R. Sebastião, and P. P. Rodrigues, “On evaluating stream learning algorithms,” Mach. Learn., vol. 90, no. 3, pp. 317–346, Mar. 2013.
[2] J. A. Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. C. P. L. F. de Carvalho, and J. Gama, “Data stream clustering: A survey,” ACM Comput. Surv., vol. 46, no. 1, pp. 1–31, october 2013.
[3] A. Amini, T. Y. Wah, and H. Saboohi, “On density-based data streams clustering algorithms: A survey,” J. Comput. Sci. Technol., vol. 29, no. 1, pp. 116–141, May 2014.
[4] H. L. Nguyen, Y. K. Woon, and W. K. Ng, “A survey on data stream clustering and classification,” Knowl. Inf. Syst., vol. 45, no. 3, pp. 535–569, october 2015.
[5] M. Hahsler and M. Bolaños, “Clustering data streams based on shared density between micro-clusters,” IEEE Trans. Knowl. Data Eng., vol. 28, no. 6, pp. 1449–1461, Jun. 2016.
[6] M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and R. Schult, “Monic: modeling and monitoring cluster transitions,” in in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Disc. Data Mining, Philadelphia, PA, USA, Aug. 2006, pp. 706–711.
[7] J. Shi, J. Malik, and S.Sastry, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. and Mach. Intell., vol. 22, no. 8, pp. 181–214, May 2000.
[8] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Proc. 16th Adv. Neural Inf. Process. Syst., Vancouver, British Columbia, Canada, Dec. 2002, pp. 585–591.
[9] Z. Huang, H. Zhu, J. T. Zhou, and X. Peng, “Multiple marginal fisher analysis,” IEEE Trans. Ind. Electron., vol. 1, no. 1, p. 1, Sept. 2018.
[10] S. Guha, A. Meyerson, R. M. N. Mishra, and O’Callaghan, “Clustering data streams: Theory and practice,” IEEE Trans. Knowl. Data Eng., vol. 15, no. 3, pp. 515–528, May 2003.
[11] L. Tu and Y. Chen, “Stream data clustering based on grid density and attraction,” ACM Trans. Knowl. Discov. Data., vol. 3, no. 3, pp. 1–27, Jul. 2009.
[12] A. Amini, T. Y. Wah, M. R. Saybani, and S. R. A. S. Yazdi, “A study of density-grid based clustering algorithms on data streams,” in 8th Int. Conf. Fuzzy Sys. Knowl. Discov., Haikou, Hainan, China, Jul. 2011, pp. 1652–1656.
[13] M. Shindler, A. Wong, and A. Meyerson, “Fast and accurate k-means for large datasets,” in Proc. 25th Adv. Neural Inf. Process. Syst., Granada, Spain, Dec. 2011, pp. 2375–2383.
[14] X. Zhang, C. Furtlehner, C. Germain-Renaud, and M. Sebag, “Data stream clustering with affinity propagation,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 7, pp. 1644–1656, Jul. 2014.
[15] R. Hyde, P. Angelov, and A. R. MacKenzie, “Fully online clustering of evolving data streams into arbitrarily shaped clusters,” Inf. Sci., vol. 382, no. 1, pp. 96–114, Mar. 2017.
[16] T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: A new data clustering algorithm and its applications,” Data. Mining. Knowl. Discov., vol. 1, no. 2, pp. 2215–2228, Jun. 1997.
[17] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A framework for clustering evolving data streams,” in Proc. 29th Int. Conf. Very Large Data Bases, Berlin, Germany, Sept. 2003, pp. 81–92.
[18] F. Cao, M. Estert, W. Qian, and A. Zhou, “Density-based clustering over an evolving data stream with noise,” in Proc. SIAM Int. Conf. Data Mining, Bethesda, MD, USA, Apr. 2006, pp. 328–339.
[19] P. Kranen, I. Assent, C. Baldauf, and T. Seidl, “The clustree: indexing micro-clusters for anytime stream mining,” Knowl. Inf. Syst., vol. 29, no. 2, pp. 249–272, Nov. 2011.
[20] J. Chen, Z. Wang, S. Yang, and H. Mao, “Two-stage sparse representation clustering for dynamic data streams,” IEEE Trans. Cybern., vol. 53, no. 10, pp. 6408–6420, Oct. 2023.
[21] J. Chen, S. Yang, C. Fahy, Z. Wang, Y. Guo, and Y. Chen, “Online sparse representation clustering for evolving data streams,” IEEE Trans. Neural. Netw. Learn. Syst., pp. 1–15, Oct. 2023.
[22] B. A.Olshausen and D. J.Field, “Sparse coding with an overcomplete basis set: A strategy employed by v1?” Vis. Res., vol. 37, no. 23, pp. 3311–3326, Dec. 1997.
[23] R. Tibshiran, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B, vol. 58, no. 1, pp. 267–288, Jan. 1996.
[24] D. L. Donoho, “For most large underdetermined systems of linear equations the minimal ${l_{1}}$ ‐norm solution is also the sparsest solution,” Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, vol. 59, no. 6, pp. 797–829, Mar. 2006.
[25] J. Chen, Y. Chen, Z. Wang, H. Zhang, and X. Peng, “Spectral embedding fusion for incomplete multiview clustering,” IEEE Trans. Image Process., pp. 4116–4130, Jul. 2024.
[26] J. Chen, H. Mao, H. Zhang, and Z. Yi, “Symmetric low-rank preserving projections for subspace learning,” Neurocomputing, vol. 315, no. 13, pp. 381–393, Nov. 2018.
[27] M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Trans. Image Process., vol. 15, no. 12, pp. 3736–3745, Dec. 2006.
[28] X. Yuan, X. Liu, and S. Yan, “Visual classification with multi-task joint sparse representation,” IEEE Trans. Image Process., vol. 21, no. 10, pp. 4349–4360, Jun. 2012.
[29] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. R. Stat. Soc. Series B, vol. 58, no. 1, pp. 267–288, Jan. 1996.
[30] B. K. Natarajan, “Sparse approximate solutions to linear systems,” SIAM journal on computing, vol. 24, no. 2, pp. 227–234, october 1995.
[31] A. M. Bruckstein, D. L. Donoho, and M. Elad, “From sparse solutions of systems of equations to sparse modeling of signals and images,” SIAM review, vol. 51, no. 1, pp. 34–81, Jul. 2009.
[32] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2009.
[33] K. Udommanetanakit, T. Rakthanmanon, and K. Waiyamai, “E-stream: Evolution-based technique for stream clustering,” in Int. Conf. Adv Data Mining Appl., Heidelberg, Berlin, Aug. 2007, pp. 605–615.
[34] S. Luhr and M. Lazarescu, “Incremental clustering of dynamic data streams using connectivity based representative points,” Data Knowl. Eng., vol. 68, no. 1, pp. 1–27, Jan. 2009.
[35] Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction method with adaptive penalty for low-rank representation,” in Proc. 25th Adv. Neural Inf. Process. Syst., Vancouver, British Columbia, Canada, Dec. 2011, pp. 612–620.
[36] J. F. Cai, E. J. Candès, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM J. Optim., vol. 20, no. 4, pp. 1956–1982, Mar. 2010.
[37] G. Liu and Y. Y. Z. Lin, “Robust subspace segmentation by low-rank representation,” in in Proc. 27th Int. Conf. Mach. Learn., Haifa, Israel, Jun. 2010, pp. 1–8.
[38] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “MOA: massive online analysis,” J. Mach. Learn. Res., vol. 11, no. 1, pp. 1601–1604, May 2010.
[39] M. R. Ackermann, M. Märtens, C. Raupach, C. Lammersen, and C. Sohler, “Streamkm++: A clustering algorithm for data streams,” J. Exp. Algorithmics, vol. 17, no. 2, pp. 1–30, May 2012.
[40] J. Gu, L. Jiao, S. Yang, and F. Liu, “Fuzzy double $c$ -means clustering based on sparse self-representation,” IEEE Trans. Fuzzy Syst., vol. 26, no. 2, pp. 612–626, Apr. 2018.
[41] J. Hou, H. Lin, H. Yuan, and M. Pelillo, “Flexible density peak clustering for real-world data,” Pattern Recognit., vol. 156, p. 110772, Dec. 2024.
[42] D. Dua and C. Graff. (2017, Apr.) UCI machine learning repository. [Online]. Available: http://archive.ics.uci.edu/ml
[43] V. M. A. Souza, D. F. Silva, J. Gama, and G. E. A. P. A. Batista, “Data stream classification guided by clustering on nonstationary environments and extreme verification latency,” in Proc. SIAM Int. Conf. Data Mining (SDM), Vancouver, British Columbia, Canada, Apr. 2015, pp. 873–881.
[44] C. D. Manning, P. Raghavan, and H. Sch’́utze, Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press, 2008.

Jie Chen received the BSc degree in Software Engineering, MSc degree and PhD degree in Computer Science from Sichuan University, Chengdu, China, in 2005, 2008 and 2014, respectively. He is currently an Associate Professor in the College of Computer Science, Sichuan University, China. His current research interests include machine learning, big data analysis and deep neural networks.

Hua Mao received the B.S. degree and M.S. degree in Computer Science from University of Electronic Science and Technology of China (UESTC) in 2006 and 2009, respectively. She received her Ph.D. degree in Computer Science and Engineering from Aalborg University, Denmark in 2013. She is currently a Senior Lecturer in Department of Computer and Information Sciences, Northumbria University, U.K. Her current research interests include Deep Neural Networks and Big Data.

Yuanbiao Gou received the B.E. degree from the School of Software Engineering, Sichuan University, Chengdu, China, in 2019. He is currently pursuing the Ph.D. degree with the School of Computer Science, Sichuan University, under the supervision of Prof. Xi Peng. His current research interests include image processing and computer vision.

Xi Peng (Member, IEEE) is currently a Full Professor with the College of Computer Science, Sichuan University. His current research interest includes machine intelligence and has authored more than 50 articles in these areas. He has served as an Associate Editor/Guest Editor for six journals, including the IEEE Transactions on SMC: Systems and IEEE Transactions on Neural Network And Learning Systems and the Area Chair/Senior Program Committee Member for the conferences such as IJCAI, AAAI, and ICME.