This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Hierarchical Sparse Representation Clustering for High-Dimensional Data Streams

Jie Chen,   Hua Mao, Yuanbiao Gou, and Xi Peng Manuscript received . (Corresponding author: Xi Peng.)Jie Chen, Yuanbiao Gou and Xi Peng are with the College of Computer Science, Sichuan University, Chengdu 610065, China (E-mail: [email protected]; [email protected]; [email protected]).Hua Mao is with the Department of Computer and Information Sciences, Northumbria University, Newcastle, NE1 8ST, U. K. (E-mail: [email protected]).
Abstract

Data stream clustering reveals patterns within continuously arriving, potentially unbounded data sequences. Numerous data stream algorithms have been proposed to cluster data streams. The existing data stream clustering algorithms still face significant challenges when addressing high-dimensional data streams. First, it is intractable to measure the similarities among high-dimensional data objects via Euclidean distances when constructing and merging microclusters. Second, these algorithms are highly sensitive to the noise contained in high-dimensional data streams. In this paper, we propose a hierarchical sparse representation clustering (HSRC) method for clustering high-dimensional data streams. HSRC first employs an l1l_{1}-minimization technique to learn an affinity matrix for data objects in individual landmark windows with fixed sizes, where the number of neighboring data objects is automatically selected. This approach ensures that highly correlated data samples within clusters are grouped together. Then, HSRC applies a spectral clustering technique to the affinity matrix to generate microclusters. These microclusters are subsequently merged into macroclusters based on their sparse similarity degrees (SSDs). Additionally, HSRC introduces sparsity residual values (SRVs) to adaptively select representative data objects from the current landmark window. These representatives serve as dictionary samples for the next landmark window. Finally, HSRC refines each macrocluster through fine-tuning. In particular, HSRC enables the detection of outliers in high-dimensional data streams via the associated SRVs. The experimental results obtained on several benchmark datasets demonstrate the effectiveness and robustness of HSRC.

Index Terms:
High-dimensional data stream, clustering, sparse representation, outlier detection
publicationid: pubid: 0000–0000/00$00.00 © 2021 IEEE

I Introduction

Data stream clustering techniques have attracted attention from data mining and machine learning researchers [1, 2, 3, 4, 5]. They have significant impacts on the development of data stream applications [6], such as network media transmission, sensor transfer information, and computer network traffic monitoring approaches. These applications involve data streams that can be potentially infinite, but only a limited amount of memory resources are available for computation purposes. Extracting potentially valuable knowledge from massive data streams naturally leads to the challenging problem of data stream clustering. As an effective tool for data stream mining, data stream clustering aims to divide the given data objects into a finite number of classes and identify potential outliers.

Traditional clustering algorithms, e.g., spectral clustering [7, 8], statistical learning-based methods and their variants [9], often achieve impressive performance on static and stable datasets. A data stream consists of massive, unbounded sequences of data objects, and its probability distribution changes over time in an unexpected way. This change can be gradual, which is known as concept drift, or sudden, which is referred to as concept shift. As a result, once entirely new classes appear in the stream, traditional classifiers are expensive and time-consuming, and prediction models need to be reconstructed and retrained from scratch via a newly collected data stream.

A variety of clustering methods have been proposed for data streams [10, 11, 12, 13, 14, 15]. For example, an efficient and scalable data clustering algorithm, called balanced iterative reducing and clustering using hierarchies (BIRCH), was proposed for conducting exploratory analyses on very large datasets [16]. BIRCH constructs a height-balanced tree via a clustering feature (CF) vector to compute cluster measures, such as the clustering mean, radius and diameter. The Euclidean distance between a new object and each centroid of the CF entries is calculated and later compared with a given threshold. The CF vector introduced by BIRCH has been extensively employed by different algorithms [17, 18, 19]. Specifically, the concept of the CF vector was extended to microclusters in the CluStream algorithm [17], which includes two stages: online microclustering and offline macroclustering. The density-based data stream clustering (DenStream) algorithm [18] introduces the structures of potential core microclusters and outlier microclusters via their CFs for incremental computation purposes. Additionally, the ClusTree algorithm uses a hierarchical indexing structure to maintain CFs and further builds a microcluster hierarchy at different granularity levels [19].

Data stream clustering algorithms utilize CF vectors to measure the similarity among data objects, which is intrinsically tied to the construction and merging operations involving appropriate microclusters. However, these approaches still face significant challenges. First, a new data object is merged into its nearest microcluster if the new radius (determined based on the associated CFs) is smaller than a predetermined threshold. Selecting an appropriate maximum boundary for this radius is a challenging problem. Second, these algorithms often rely on local similarity measures, typically the Euclidean distances between data objects, to maintain local neighborhood information. However, noise is pervasive in high-dimensional data streams. The use of the Euclidean distance measure for evaluating the relationships among data objects makes the similarity computation highly sensitive to noise. Clustering inherently assumes that highly correlated data objects within the same cluster share similar structural characteristics. Therefore, measuring the similarity of the data objects contained in high-dimensional data streams remains a significant challenge.

Recently, sparse representation has received significant interest across various fields [20, 21, 22, 23, 24], including image classification [25, 26], image denoising [27], and visual tracking [28]. It offers a statistical model for identifying a set of dictionary elements from the same class for each data object. Ideally, sparse representation groups highly correlated data objects belonging to the same class, with the nonzero elements representing the weights assigned to different pairs of data objects. As a result, sparse representation has demonstrated the ability to mitigate the aforementioned challenges by leveraging the self-similarity properties of data objects contained in high-dimensional data streams.

In this paper, we propose a hierarchical sparse representation clustering (HSRC) method for clustering high-dimensional data streams. HSRC first incorporates a sparse constraint into the self-expressiveness property of data objects in high-dimensional data streams. It aims to learn an affinity matrix that can capture the relationships among different data objects. For example, given a set of data objects in a landmark window, each object is represented as a linear combination of the other objects, and the coefficient matrix is sparse. In particular, we integrate adaptive dictionary learning into the sparse representation process to improve its generalization in data stream clustering scenarios. Then, HSRC partitions the affinity matrix into microclusters via a spectral clustering technique. The sparse linear representation of each data object includes more objects from the same cluster, indicating that the affinity matrix encodes the relationships within microclusters. Unlike density-based clustering techniques, HSRC does not rely on Euclidean distances to measure relationships among data objects. Instead, two microclusters are merged into a new macrocluster if the sparse similarity degree (SSD) of the merged cluster is higher than that between the new microcluster and any other microcluster. Finally, the clustering labels of the data objects contained within the macroclusters are fine-tuned to enhance the performance of the data stream clustering process. Additionally, HSRC includes an outlier detection mechanism for identifying potential outliers.

The contributions of this paper can be summarized as follows.

  1. 1.

    We present an HSRC model that measures the relationships among data objects via sparse representation.

  2. 2.

    A residual sparsity value is introduced for conducting adaptive dictionary sample selection and outlier identification, which can improve the generalization of the sparse representation process in high-dimensional data stream clustering tasks.

  3. 3.

    The SSDs among microclusters are introduced to construct macroclusters, which together form a hierarchical structure for performing high-dimensional data stream clustering.

  4. 4.

    Extensive experimental results obtained on benchmark datasets demonstrate the effectiveness and robustness of HSRC in high-dimensional data stream clustering scenarios.

The remainder of this paper is organized as follows. The related work on sparse representation and data stream clustering is summarized in Section II. Section III describes the HSRC method in detail. Extensive experimental results obtained several benchmark datasets are presented in Section IV. Finally, conclusions are given in Section V.

II Related Work

In this section, we present a brief overview of the related work on sparse representation and data stream clustering.

II-A Sparse Representation Theory

Let 𝐃=[𝐝1,𝐝2,,𝐝n]𝐑d×n\mathbf{D}=[{\mathbf{d}_{1}},{\mathbf{d}_{2}},...,{\mathbf{d}_{n}}]\in{\mathbf{R}^{d\times n}} be a dictionary consisting of nn vectors. Given a signal x=𝐑dx=\in{\mathbf{R}^{d}}, its sparsest representation over the dictionary 𝐃\mathbf{D} can be approximately represented by a sparse linear combination of only a few columns of 𝐃\mathbf{D}, i.e.,

min𝐳𝐱𝐃𝐳2+λ𝐳0,\mathop{\min}\limits_{\mathbf{z}}{\left\|{\mathbf{x}-\mathbf{D}\mathbf{z}}\right\|_{2}}+\lambda{\left\|\mathbf{z}\right\|_{0}}, (1)

where 𝐳\mathbf{z} represents a sparse coefficient, λ>0\lambda>0 is a tradeoff parameter, and 0{\left\|\cdot\right\|_{0}} denotes the number of nonzero elements in a vector.

Since Problem (1) is nonconvex and nondeterministic polynomial-time (NP)-hard, it is often relaxed to the following l1{l_{1}}-norm minimization problem as a common surrogate for the sparse representation task, i.e.,

min𝐳𝐱𝐃𝐳2+λ𝐳1,\mathop{\min}\limits_{\mathbf{z}}{\left\|{\mathbf{x}-\mathbf{D}\mathbf{z}}\right\|_{2}}+\lambda{\left\|{\mathbf{z}}\right\|_{1}}, (2)

where 1{\left\|\cdot\right\|_{1}} denotes the l1l_{1}-norm of a vector. The above optimization problem can be solved via various convex optimization methods, such as iterative thresholding algorithms [29, 24].

Inspired by the advances achieved with respect to l0{l_{0}}-norm and l1{l_{1}}-norm techniques, sparse representation has been widely used in various areas of machine learning and pattern recognition [30, 31]. For example, Aharon et al. [32] proposed a KK-singular value decomposition (SVD) algorithm for designing overcomplete dictionaries for sparse representation purposes. The KK-SVD algorithm can effectively seek the best representation of a given set as a dictionary under strict sparsity constraints. Hence, sparse representation has been proven to be an extremely successful technique for exploiting the intrinsic structures of high-dimensional data.

II-B Data Stream Clustering Techniques

A data stream is a massive, potentially unbounded sequence of data objects which are continuously generated over time. Most existing data stream clustering algorithms can be broadly categorized into three main types: hierarchy-based clustering, density-based clustering and partitioning-based clustering.

Hierarchy-based clustering algorithms first use the data objects to construct a tree, in which each leaf represents a data object. These algorithms then group data objects into the corresponding clusters using agglomerative or divisive hierarchical decomposition. In the agglomerative approach, nn objects are merged into more general classes in a bottom-up fashion, whereas in divisive techniques, nn objects are divided into smaller clusters in a bottom-up fashion. BIRCH is a typical hierarchical clustering algorithm that constructs a height-balanced tree using the CF vector [16]. BIRCH adopts the CF vector to compute cluster measures such as the mean, radius, and diameter. Finally, the Euclidean distances between new objects and each centroid of the CF entries are calculated and compared against a threshold to determine whether the new object belongs to a new class. The BIRCH algorithm has O(n)O(n) complexity and obtains good clustering results through only a single scan. However, it does not work well with data of arbitrary shape. The E-Stream algorithm is an evolution-based approach for data stream clustering that supports five evolutions of the data, namely appearance, disappearance, self-evolution, merging, and splitting [33, 34].

Density-based clustering algorithms define clusters as high-density areas of the features, which can be used to detect any arbitrary-shaped clusters. DenStream, an extended version of the DBSCAN algorithm, discovers clusters of arbitrary shape in an evolving data stream [18]. It introduces a core microcluster, a potential core microcluster, and an outlier microcluster using the CFs developed for BIRCH. The core microcluster is adopted to summarize the clusters with an arbitrary shape, while the potential core microcluster and outlier microcluster structures are used to pursue and distinguish potential clusters and outliers. DenStream involves numerous time vector calculations in the offline clustering phase, which leads to a high computational cost. The D-Stream algorithm automatically and dynamically adjusts the clusters without requiring user specification of the target time horizon or the number of clusters [11]. An online component maps each input data record into a grid, while an offline component computes the grid density and clusters the grids based on the density. However, D-Stream is incapable of processing very-high-dimensional data.

Partitioning-based clustering methods use traditional soft partitioning techniques such as kk-median and kk-means clustering to deal with data streams. CluStream is a two-phase clustering method that performs online micro-clustering and offline macro-clustering [17]. The first phase involves the acquisition of summary statistics from the data stream by extending the concept of the CF vector in the online micro-clustering. Each microcluster contains five components, three of which are the regular components of the CF vector. The second phase uses these statistics and other inputs to create clusters, where the kk-means algorithm is embedded in the offline macro-clustering component. HPStream was developed as an extension to CluStream that performs projected clustering for high-dimensional streaming data. CluStream is not efficient when applied to high-dimensional data streams.

III Hierarchical Sparse Representation Clustering

In this section, we introduce the HSRC method for clustering high-dimensional data streams. Given the potentially infinite nature of data streams, HSRC employs landmark windows, thereby sequentially processing fixed-size, nonoverlapping chunks of data objects. The proposed HSRC model contains three critical processes, including the creation of microclusters, the merging of microclusters into macroclusters and the fine-tuning of macroclusters, which together form a hierarchical structure. Throughout these processes, the data object representatives are repeatedly utilized across three successive HSRC processes. Therefore, the proposed HSRC method can effectively capture the intrinsic structures of the high-dimensional data objects contained within data streams.

III-A Evaluating the Relationships Among Data Objects via Sparse Representation

Let 𝐗=[𝐱1,𝐱2,,𝐱n]𝐑d×n\mathbf{X}=[{\mathbf{x}_{1}},{\mathbf{x}_{2}},...,{\mathbf{x}_{n}}]\in{\mathbf{R}^{d\times n}} be a set of data objects included in a landmark window. The rationale assumption behind clustering algorithms is that data objects within a cluster are more similar to each other than they are to objects belonging to a different cluster. Given a data object 𝐱i{\mathbf{x}_{i}} (1in1\leq i\leq n), HSRC uses the following l1{l_{1}}-norm minimization problem for sparse representation:

argmin𝐳i𝐳i1s.t.𝐱i𝐗𝐳i22ε\mathop{arg\min}\limits_{\mathbf{z}_{i}}{\left\|{\mathbf{z}_{i}}\right\|_{1}}\qquad s.t.\qquad{\left\|{{\mathbf{x}_{i}}-\mathbf{X}{\mathbf{z}_{i}}}\right\|_{2}^{2}}\leq\varepsilon (3)

where ε\varepsilon represents a noise item. The above optimization problem can be solved in polynomial time via standard linear programming algorithms [29, 24]. Here 𝐙\mathbf{Z} can be constructed using the vectors of sparse coefficients, i.e., 𝐙=[𝐳1,𝐳2,,𝐳n]\mathbf{Z}=[{\mathbf{z}_{1}},{\mathbf{z}_{2}},...,{\mathbf{z}_{n}}], and 𝐙=|𝐙|+|𝐙T|{\mathbf{Z}^{*}}=\left|\mathbf{Z}\right|+\left|{{\mathbf{Z}^{T}}}\right|. Each element zijz_{ij} in 𝐙{\mathbf{Z}^{*}} can be used to measure the similarity between two data objects 𝐱i{\mathbf{x}_{i}} and 𝐱j{\mathbf{x}_{j}}. If the data objects 𝐱i\mathbf{x}_{i} and 𝐱j\mathbf{x}_{j} are close in terms of the intrinsic geometries of their data distributions, their representations, 𝐳i\mathbf{z}_{i} and 𝐳j\mathbf{z}_{j}, respectively, should also be close with respect to the same basis 𝐗\mathbf{X}. Hence, sparse representation is utilized to evaluate the relationships among the data objects contained in high-dimensional data streams.

The solution to Problem (3) is individually obtained for each data object. This inevitably results in high computational costs. However, clustering high-dimensional data streams requires data objects to be continuously clustered within a limited period. To increase the efficiency of optimizing individual sparse representations, we use a sparse batch representation optimization problem as a good surrogate for Problem (3). Specifically, we formulate the following convex optimization problem to find a sparse representation 𝐙\mathbf{Z}:

min𝐙,𝐄𝐙1+λ𝐄ls.t.𝐗=𝐗𝐙+𝐄,\min\limits_{\mathbf{Z},\mathbf{E}}{\left\|\mathbf{Z}\right\|_{1}}+\lambda\left\|{\mathbf{E}}\right\|_{l}\qquad s.t.\qquad\mathbf{X}=\mathbf{X}\mathbf{Z}+\mathbf{E}, (4)

where 𝐄𝐑d×n\mathbf{E}\in{\mathbf{R}^{d\times n}} is a noise item and λ\lambda is a scalar constant.

Algorithm 1 Solving Problem (5) via an inexact ALM framework
0:    a data matrix 𝐗=[𝐱1,𝐱2,,𝐱n]𝐑d×n\mathbf{X}=[{\mathbf{x}_{1}},{\mathbf{x}_{2}},...,{\mathbf{x}_{n}}]\in{\mathbf{R}^{d\times n}}, a parameter λ>0\lambda>0

Initialize: 𝐙=𝐉=0,𝐄=0,𝐘1=𝐘2=0,μ=102,\mathbf{Z}=\mathbf{J}=0,\mathbf{E}=0,{\mathbf{Y}_{1}}={\mathbf{Y}_{2}}=0,{\mu}={10^{-2}},
μmax=1010,ρ=1.1,ε=106{\mu_{\max}}={10^{10}},\rho=1.1,\varepsilon={10^{-6}}

1:  while not converged do
2:     update the variables via Eq. (8);
3:     update the multipliers: 𝐘1𝐘1+μ(𝐗𝐗𝐙𝐄){\mathbf{Y}_{1}}\leftarrow{\mathbf{Y}_{1}}+{\mu}\left(\mathbf{X}-\mathbf{X}\mathbf{Z}-\mathbf{E}\right); 𝐘2𝐘2+μ(𝐙𝐉){\mathbf{Y}_{2}}\leftarrow{\mathbf{Y}_{2}}+{\mu}\left(\mathbf{Z}-\mathbf{J}\right);
4:     update the parameter μ{\mu}: μmin(ρμ,μmax){\mu}\leftarrow\min(\rho{\mu},{\mu_{\max}});
5:     check the convergence conditions: 𝐗𝐗𝐙𝐄max<ε{\left\|{\mathbf{X}-\mathbf{X}\mathbf{Z}-\mathbf{E}}\right\|_{max}}<\varepsilon and 𝐙𝐉max<ε{\left\|{\mathbf{Z}-\mathbf{J}}\right\|_{max}}<\varepsilon;
6:  end while
6:    (𝐙,𝐄)\left({\mathbf{Z},\mathbf{E}}\right)

We first convert Problem (4) to the following equivalent problem by introducing an auxiliary variable 𝐉\mathbf{J}:

min𝐙,𝐉,𝐄𝐉1+λ𝐄ls.t.𝐗=𝐗𝐙+𝐄,𝐙=𝐉.\min\limits_{\mathbf{Z},\mathbf{J},\mathbf{E}}{\left\|\mathbf{J}\right\|_{1}}+\lambda\left\|{\mathbf{E}}\right\|_{l}\qquad s.t.\qquad\mathbf{X}=\mathbf{X}\mathbf{Z}+\mathbf{E},\ \mathbf{Z}=\mathbf{J}. (5)

The augmented Lagrangian function of Problem (5) is

min𝐙,𝐄,𝐉,𝐘1,𝐘2𝐉1+λ𝐄l+tr(𝐘1T(𝐗𝐗𝐙𝐄)+tr(𝐘2T(𝐙𝐉))+μ2(𝐗𝐗𝐙𝐄F2+𝐙𝐉F2),\begin{split}&\mathop{\min}\limits_{\mathbf{Z},\mathbf{E},\mathbf{J},{\mathbf{Y}_{1}},{\mathbf{Y}_{2}}}{\left\|\mathbf{J}\right\|_{1}}+\lambda\left\|\mathbf{E}\right\|_{l}+tr\left({\mathbf{Y}_{1}^{T}(\mathbf{X}-\mathbf{X}\mathbf{Z}-\mathbf{E}}\right)+\\ &tr\left({\mathbf{Y}_{2}^{T}\left({\mathbf{Z}-\mathbf{J}}\right)}\right)+\frac{{{\mu}}}{2}\left(\left\|{\mathbf{X}-\mathbf{X}\mathbf{Z}-\mathbf{E}}\right\|_{F}^{2}+\left\|{\mathbf{Z}-\mathbf{J}}\right\|_{F}^{2}\right),\end{split} (6)

where 𝐘1{\mathbf{Y}_{1}} and 𝐘2{\mathbf{Y}_{2}} are Lagrange multipliers, and μ1>0{\mu_{1}}>0 and μ2>0{\mu_{2}}>0 are penalty parameters. The above optimization problem can be formulated as follows:

min𝐙,𝐄,𝐉,𝐘1,𝐘2𝐉1+λ𝐄l+μ2(𝐗𝐗𝐙𝐄+𝐘1μF2+𝐙𝐉+𝐘2μF2).\begin{split}&\mathop{\min}\limits_{\mathbf{Z},\mathbf{E},\mathbf{J},{\mathbf{Y}_{1}},{\mathbf{Y}_{2}}}{\left\|\mathbf{J}\right\|_{1}}+\lambda\left\|\mathbf{E}\right\|_{l}+\\ &\frac{{{\mu}}}{2}\left(\left\|{\mathbf{X}-\mathbf{X}\mathbf{Z}-\mathbf{E}+\frac{{{\mathbf{Y}_{1}}}}{{{\mu}}}}\right\|_{F}^{2}+\left\|{\mathbf{Z}-\mathbf{J}+\frac{{{\mathbf{Y}_{2}}}}{{{\mu}}}}\right\|_{F}^{2}\right).\end{split} (7)

This formula can be effectively solved by the inexact augmented Lagrange multiplier (ALM) framework [35]. The variables 𝐉\mathbf{J}, 𝐙\mathbf{Z} and 𝐄\mathbf{E} can be updated alternately at each step, whereas the other two variables are fixed. The updating schemes for the (k+1)(k+1) th iteration are as follows:

𝐉min𝐉1μ𝐉1+12𝐉(𝐙+𝐘2μ)F2,𝐙(𝐈+𝐗T𝐗)1(𝐗T𝐗𝐗T𝐄+𝐉+𝐗T𝐘1𝐘2μ),𝐄min𝐄λ𝐄l+μ2𝐗𝐗𝐙𝐄+𝐘1μF2.\begin{split}&{\mathbf{J}}\leftarrow\mathop{\min}\limits_{\mathbf{J}}\frac{1}{{{\mu}}}{\left\|\mathbf{J}\right\|_{1}}+\frac{1}{2}\left\|{\mathbf{J}-\left({\mathbf{Z}+\frac{{{\mathbf{Y}_{2}}}}{{{\mu}}}}\right)}\right\|_{F}^{2},\\ &{\mathbf{Z}}\leftarrow\left({\mathbf{I}+{\mathbf{X}^{T}}\mathbf{X}}\right)^{-1}\left({{\mathbf{X}^{T}}\mathbf{X}-{\mathbf{X}^{T}}\mathbf{E}+\mathbf{J}+\frac{{{\mathbf{X}^{T}}{\mathbf{Y}_{1}}-{\mathbf{Y}_{2}}}}{{{\mu}}}}\right),\\ &{\mathbf{E}}\leftarrow\mathop{\min}\limits_{\mathbf{E}}\lambda{\left\|\mathbf{E}\right\|_{l}}+\frac{{{\mu}}}{2}\left\|{\mathbf{X}-\mathbf{X}\mathbf{Z}-\mathbf{E}+\frac{{{\mathbf{Y}_{1}}}}{{{\mu}}}}\right\|_{F}^{2}.\end{split} (8)
Algorithm 2 Creating microclusters
0:    a data matrix 𝐗=[𝐱1,𝐱2,,𝐱n]𝐑d×n\mathbf{X}=[{\mathbf{x}_{1}},{\mathbf{x}_{2}},...,{\mathbf{x}_{n}}]\in{\mathbf{R}^{d\times n}}; parameters λ>0\lambda>0, m>0m>0 and σ(0,1)\sigma\in\left({0,1}\right)
1:  Solve Problem (5) via Algorithm 1 and obtain the optimal solution (𝐙,𝐄)\left({\mathbf{Z},\mathbf{E}}\right).
2:  Compute the affinity matrix 𝐙=|𝐙|+|𝐙T|{\mathbf{Z}^{*}}=\left|\mathbf{Z}\right|+\left|{{\mathbf{Z}^{T}}}\right|.
3:  Apply NCuts on 𝐙{\mathbf{Z}^{*}} to obtain mm microclusters.
4:  For a new data object 𝐱\mathbf{x}, compute SRV(𝐱)SRV(\mathbf{x}) using Eq. (9);
5:  if  SRV(𝐱)<σSRV(\mathbf{x})<\sigma  then
6:     𝐱\mathbf{x} is considered an outlier.
7:  end if
7:    The mm microclusters.

The first equation in Problem (8) is a convex problem that can be solved via the singular value thresholding (SVT) operator [36]. The solution of the third equation in Problem (8) is closely related to the value of the ll-norm, which indicates a certain regularization strategy for characterizing various noises. For example, the l2,1{l_{2,1}}-norm encourages the columns of the matrix to be zero, where the last equation has a closed-form solution [37]. The complete procedure for solving Problem (5) is outlined in Algorithm 1.

III-B Evaluating the Importance of Data Objects via Sparsity Residual Values

HSRC uses the learned affinity matrix to identify clusters as groups of microclusters through a spectral clustering algorithm, such as NCuts [7]. For simplicity, we assume that nn data objects in a landmark window are segmented into mm microclusters, i.e., 𝐗=[𝐗1,𝐗2,,𝐗m]\mathbf{X}=[{\mathbf{X}_{1}},{\mathbf{X}_{2}},...,{\mathbf{X}_{m}}], according to Algorithm 2. For example, the iith microcluster contains nin_{i} data objects, i.e., 𝐗i=[𝐗i1,𝐗i2,,𝐗ini]{\mathbf{X}_{i}}=[{\mathbf{X}_{i1}},{\mathbf{X}_{i2}},...,{\mathbf{X}_{i{n_{i}}}}]. Each microcluster consists of a set of neighboring objects, where each data object within the microcluster can be linearly represented by other objects in the cluster. The microclusters are identified in a single pass of the landmark window, and their summary statistics are stored offline.

Definition 1

Sparsity residual value (SRV): The SRV of a data object 𝐱\mathbf{x} is defined as

SRV(𝐱)=1𝐞0j=1ni|ej|𝐞2SRV(\mathbf{x})=\frac{1}{{{{\left\|\mathbf{e}\right\|}_{0}}}}\sum\limits_{j=1}^{n_{i}}{\frac{{\left|{{e_{j}}}\right|}}{{{{\left\|\mathbf{e}\right\|}_{2}}}}} (9)

where 𝐞𝐑ni\mathbf{e}\in{\mathbf{R}^{n_{i}}} denotes a corresponding noise item in the sparse representation of the object and nin_{i} is the number of samples contained in the iith cluster.

We define SRVs to measure the importance levels of the data objects included in clusters. The lower the SRV value of a data object is, the more significant it is during the sparse representation process. Without loss of generality, the sparse coefficients of a data object 𝐱\mathbf{x} are represented by a vector 𝐳\mathbf{z}, and 𝐞\mathbf{e} denotes a corresponding noise item in the sparse representation of the object. Given a solution 𝐙\mathbf{Z} for the data objects 𝐗\mathbf{X} obtained by Algorithm 1, the representative data objects can be adaptively selected from the candidates by sorting their SRVs. The sizes of the representative data objects are aligned with those of the data objects contained in the landmark window. The dictionary samples in an adaptive dictionary selection scheme consist of two parts, including the data objects contained in the current landmark window and the representative data objects chosen from the previous landmark window.

Algorithm 3 Merging microclusters into macroclusters
0:    Microclusters 𝐗=[𝐗1,𝐗2,,𝐗m]\mathbf{X}=[{\mathbf{X}_{1}},{\mathbf{X}_{2}},...,{\mathbf{X}_{m}}] and an affinity matrix 𝐙{\mathbf{Z}^{*}}

Initialize: 𝐒𝐑m×m=0\mathbf{S}\in{\mathbf{R}^{m\times m}}=0;

1:  while the clusters in 𝐗\mathbf{X} have been changed  do
2:     for each cluster ii among the clusters do
3:        ss \leftarrow the number of clusters in 𝐗\mathbf{X};
4:        for each cluster jj among the clusters and iji\neq j do
5:           Compute S(𝐗i,𝐗j)S({\mathbf{X}_{i}},{\mathbf{X}_{j}}) and S(𝐗i+j,𝐗p)S({\mathbf{X}_{i+j}},{\mathbf{X}_{p}}) using Eqs. (11) and (12), respectively, where p[1,s]p\in\left[{1,s}\right], pip\neq i and pjp\neq j;
6:           if  𝐒(𝐗i,𝐗j)𝐒(𝐗i+j,𝐗p)\mathbf{S}({\mathbf{X}_{i}},{\mathbf{X}_{j}})\leq\mathbf{S}({\mathbf{X}_{i+j}},{\mathbf{X}_{p}})  then
7:              𝐒[ij]=1\mathbf{S}[ij]=1;
8:           end if
9:        end for
10:        mergedSet=[]mergedSet=[],
11:        for  i=1:si=1:s and imergedSeti\notin mergedSet do
12:           clusterRows=find(S(i,:)==1)clusterRows=find(S(i,:)==1);
13:           clusterCols=find(S(:,i)==1)clusterCols=find(S(:,i)==1);
14:           clusterSet=clusterRowsclusterColsclusterSet=clusterRows\cap clusterCols
15:           if  isempty(clusterSet)\sim isempty(clusterSet)  then
16:              Merge cluster ii and all clusters in clusterSetclusterSet;
17:              Add ii and all items of clusterSetclusterSet to mergedSetmergedSet;
18:           end if
19:        end for
20:     end for
21:  end while
21:    The clusters in 𝐗\mathbf{X} .

In practice, outliers often arise due to various factors, such as data collection, storage, or transmission errors. Outliers are data objects that deviate from a sparse representation-based model. To identify outliers throughout high-dimensional data streams, we set a threshold σ(0,1)\sigma\in\left({0,1}\right). A data object is estimated according to the threshold σ\sigma. For example, we consider a data object an outlier if it satisfies the following condition:

SRV(𝐱)σ.SRV(\mathbf{x})\geq\sigma. (10)

HSRC is able to adaptively identify outliers in high-dimensional data streams. Algorithm 2 summarizes the complete microclustering algorithm of HSRC.

Algorithm 4 Fine-tuning macroclusters
0:    ss macroclusters 𝐗=[𝐗1,𝐗2,,𝐗s]𝐑d×n\mathbf{X}=[{\mathbf{X}_{1}},{\mathbf{X}_{2}},...,{\mathbf{X}_{s}}]\in{\mathbf{R}^{d\times n}}, a parameter σ(0,1)\sigma\in\left({0,1}\right)
1:  for each cluster ii in the macroclusters do
2:     for each data object 𝐱\mathbf{x} in the iith cluster do
3:        for each cluster jj in the macroclusters do
4:           Construct an error set using Eq. (14) as follows: errorSet=[error1,error2,,errors]errorSet=[error_{1},{error_{2}},...,{error_{s}}];
5:           Determine the clustering index jj of the minimum value in the error set;
6:           if  iji\neq j  then
7:              Select 𝐱\mathbf{x} from the iith macrocluster and drop it into the jjth macrocluster;
8:           end if
9:        end for
10:     end for
11:  end for
11:    The final clusters.

III-C Merging Micro-clusters into Macro-clusters

As microclusters are rough, they need to be merged into macroclusters. Considering the iith microcluster, we first choose a candidate from another microcluster as the merged target. Utilizing only the sparse coefficients associated with the jjth microcluster 𝐗j{\mathbf{X}_{j}}, the sparse representation of the llth data object 𝐱l{\mathbf{x}_{l}} contained in the ii th microcluster can be written as follows:

𝐱l=𝐗j𝐳j+𝐞l,{\mathbf{x}_{l}}={\mathbf{X}_{j}}{\mathbf{z}_{j}}+{\mathbf{e}_{l}}, (11)

where i,j[1,m]i,j\in[1,m], l[1,ni]l\in[1,n_{i}] and iji\neq j. The item 𝐳j𝐑nj\mathbf{z}_{j}\in{\mathbf{R}^{n_{j}}} is a sparse coefficient vector whose entries are those associated with 𝐗j\mathbf{X}_{j}. The residual item 𝐞l{\mathbf{e}_{l}} can be used to measure the similarity between 𝐱l{\mathbf{x}_{l}} and 𝐗j{\mathbf{X}_{j}}. We adopt the residual item to define the SSD concept, which is used to measure the similarity between two microclusters in the microcluster merging operation.

Definition 2

SSD: The SSD between cluster ii and cluster jj is the sum of the residuals of the data objects contained in the iith microcluster 𝐗i{\mathbf{X}_{i}} that are associated with the jjth microcluster 𝐗j{\mathbf{X}_{j}}, which can be defined as

S(𝐗i,𝐗j)=l=1ni𝐞l2,S({\mathbf{X}_{i}},{\mathbf{X}_{j}})=\sum\limits_{l=1}^{n_{i}}{{\left\|{\mathbf{e}_{l}}\right\|}_{2}}, (12)

where iji\neq j and i,j[1,m]i,j\in[1,m].

We further calculate the SSDs of cluster ii with the other clusters. Finally, we can obtain a candidate based on these residual items by choosing the lowest error. We need to further check the merging condition to determine whether cluster ii and cluster jj should be merged. We assume that we obtain a new cluster 𝐗i+j{\mathbf{X}_{i+j}} after clusters ii and jj are merged. According to graph theory, the merging condition is that the similarity degree of the merged cluster 𝐗i+j{\mathbf{X}_{i+j}} must be greater than that of cluster 𝐗i+j{\mathbf{X}_{i+j}} associated with any other clusters 𝐗p{\mathbf{X}_{p}}; i.e., the following two conditions must hold:

S(𝐗i,Xj)S(𝐗i+j,𝐗p),S(𝐗j,Xi)S(𝐗i+j,𝐗p),\begin{split}&S({\mathbf{X}_{i}},{X_{j}})\leq S({\mathbf{X}_{i+j}},{\mathbf{X}_{p}}),\\ &S({\mathbf{X}_{j}},{X_{i}})\leq S({\mathbf{X}_{i+j}},{\mathbf{X}_{p}}),\end{split} (13)

where pip\neq i and pjp\neq j. If the above conditions hold, two microclusters are merged into a relatively large cluster. The final merging operation is stopped if any two microclusters cannot be further merged. For simplicity, we define a new matrix 𝐒\mathbf{S} with a size of m×mm\times m to preserve the merged sign for each pair of mm microclusters. For instance, Sij{S_{ij}} denotes that cluster ii can be merged into cluster jj. Algorithm 3 below summarizes the complete microclusters merging procedure.

Algorithm 5 The HSRC algorithm
0:    𝐗t=[𝐱1,𝐱2,,𝐱n]𝐑d×n\mathbf{X}_{t}=[{\mathbf{x}_{1}},{\mathbf{x}_{2}},...,{\mathbf{x}_{n}}]\in{\mathbf{R}^{d\times n}}, parameters λ>0\lambda>0, m>0m>0 and σ(0,1)\sigma\in\left({0,1}\right)
1:  if  t==1t==1  then
2:     𝐗=𝐗1\mathbf{X}={\mathbf{X}_{1}} and 𝐗s=𝐗1\mathbf{X}_{s}=\mathbf{X}_{1};
3:  else
4:     𝐗=[𝐗s,𝐗t]\mathbf{X}=\left[{{\mathbf{X}_{s}},{\mathbf{X}_{t}}}\right];
5:  end if
6:  Solve Problem (5) via Algorithm 1 and obtain the optimal solution (𝐙,𝐄)\left({\mathbf{Z},\mathbf{E}}\right).
7:  Create mm microclusters via Algorithm 2, and generate mm microclusters.
8:  Perform the merging operation on the mm microclusters via Algorithm 3, and obtain ss macroclusters.
9:  Purify each of the ss macroclusters in a fine-tuning manner via Algorithm 4, and obtain the final clusters.
10:  for each cluster in the final clusters do
11:     for each data object 𝐱\mathbf{x} in a cluster do
12:        Computing SRV(𝐱)SRV\left(\mathbf{x}\right) using Eq. (9);
13:        if  SRV(𝐱)σSRV(\mathbf{x})\geq\sigma  then
14:           𝐱\mathbf{x} is regarded as an outlier;
15:        end if
16:     end for
17:     Select the representative data objects from the cluster by sorting their SRVs and then add them to 𝐗s{\mathbf{X}_{s}};
18:  end for
18:    The final clusters.

III-D Fine-Tuning

Although sparse representation attempts to automatically choose data objects belonging to the same class in data streams, the coefficients of sparse representation are not completely concentrated on a particular microcluster and instead are likely to be widely spread across a few microclusters. Hence, the microclusters can be considered rough clusters. A data object has a sparse representation whose nonzero entries are concentrated mostly in one microcluster. The distribution of the estimated sparse coefficients contains important clustering information about the relationships among the data objects. Under these circumstances, fine-tuning is a strategy in which a few data objects are selected from a microcluster and then dropped into another microcluster under certain conditions. This strategy incrementally improves the clustering performance achieved for data objects.

From a sparse representation perspective, fine-tuning considers each data object in a microcluster with respect to another microcluster as follows:

errorj=𝐱l𝐗j𝐙jF2,{error_{j}}=\left\|{{\mathbf{x}_{l}}-{\mathbf{X}_{j}}{\mathbf{Z}_{j}}}\right\|_{F}^{2}, (14)

where 𝐱l\mathbf{x}_{l} denotes the llth data object in the iith microcluster, and 𝐗j{\mathbf{X}_{j}} is the jjth microcluster. We assume that ss macroclusters are obtained via Algorithm 3. The error set of 𝐱l\mathbf{x}_{l} is represented as errorSet=[error1,error2,,errors]errorSet=[error_{1},{error_{2}},...,{error_{s}}], and the clustering index of the minimum value included in the error set is jj.

TABLE I: Descriptions of the streaming datasets.
Datasets Classes Data objects Features
Keystroke 4 1,600 10
Network Intrusion 2 494,000 42
Forest Cover 7 580,000 54
COIL-100 100 7,200 1,024

If iji\neq j, 𝐱l\mathbf{x}_{l} is selected from the iith microcluster and dropped into the jjth microcluster. For example, a data object is picked up from its original cluster and dropped into a new microcluster if it has a smaller residual with respect to another microcluster.

Finally, it is critical to choose candidates from among the data objects that will be purified in their clusters. We consider these data objects, which indicate the spread of the sparse coefficients over most classes. In particular, we adopt the SRV as a criterion to help choose potential outliers in each cluster. Algorithm 4 below summarizes the complete macrocluster fine-tuning procedure.

III-E Computational Complexity Analysis

We assume that 𝐗\mathbf{X} in a sliding-window has nn data objects that belong to ss classes, where the size of 𝐗\mathbf{X} is d×nd\times n. Algorithm 5 summarizes the complete data stream clustering algorithm of HSRC. We use an inexact ALM framework in Algorithm 1 [35]. The inexact ALM framework has been extensively studied and generally converges well. Algorithm 1 performs well in practical applications. The computational complexity of the first step of Algorithm 1 is O(n2)O({n^{2}}) because it requires the sparse representation of an n×nn\times n matrix to be computed in an SVT operator. The overall computational complexity of Algorithm 1 is O(tn2)O(t{n^{2}}) if the l2,1{l_{2,1}}-norm is adopted in the last equation of Problem (8), where tt is the number of iterations. When n>dn>d, the computational complexity of Algorithm 2 can be considered O(n3)O({n^{3}}) in spectral clustering. The computational complexities of Algorithm 3 and Algorithm 4 are O(k2n2)O({k^{2}}{n^{2}}) and O(s2n2)O({s^{2}}{n^{2}}) respectively, where kk and ss are the numbers of microclusters and macroclusters, respectively. The complexity of HSRC is O(tn2+n3+k2n2+s2n2)O({t{n^{2}}+{n^{3}}+{k^{2}}{n^{2}}+{s^{2}}{n^{2}}}). Therefore, the final overall complexity of HSRC is O(n3)O({n^{3}}) if tnt\ll n, knk\ll n and sns\ll n.

IV Experimental Study

In this section, we evaluate the performance of the proposed HSRC approach on publicly available datasets by comparing it with existing popular data stream clustering algorithms: CluStream [17], ClusTree [19], CluStreamKM [38], StreamKM++ [39], fuzzy double cc-means based on sparse self-representation (FDCM_SSR) [40] and flexible density peak clustering (FDPC) [41]. HSRC is implemented in MATLAB, and all the experiments are conducted on a Windows platform with an Intel i5-2300 CPU and 16 GB of RAM. The anonymous source code111https://github.com/chenjie20/HSRC for SSCDL is available online. The implementations of the CluStream, ClusTree, CluStreamKM and StreamKM++ algorithms are provided by Massive Online Analysis (MOA), which is a popular open-source tool for data stream mining [38]. The source codes of the FDCM_SSR and FDPC algorithms are provided by their authors.

IV-A Experimental Settings

Four benchmark datasets are used in the evaluation [42, 43]. The data objects are randomly shuffled to alleviate any potential effects caused by the sorted streaming order. The drift interval of the Keystroke dataset is 200. The drift intervals of the other datasets are unknown. The Keystroke dataset is composed of 10 feature variables. The Network Intrusion dataset was collected from seven weeks of network traffic simulated in a military network environment. The Network Intrusion requests are divided into two classes: normal and malicious. All the text attributes of Network Intrusion are manually converted into enumerated values and represented by digits. The Forest Cover dataset consists of 54 cartographic variables. The last variables of the datasets represent the ground-truth labels of the corresponding data objects. The Columbia University Image Library (COIL)-100 dataset contains 7200 images of data objects belonging to 100 categories. The statistics of the datasets are summarized in Table I.

The HSRC algorithm is evaluated based on its clustering quality and temporal efficiency. Clustering quality is assessed via clustering purity and the FF-measure [44]. All the parameters of the competing algorithms are manually tuned to obtain their optimal results. In Algorithms 1-4, HSRC involves three parameters: λ\lambda, mm and σ\sigma. Empirically speaking, parameter λ\lambda should be relatively large if the input data streams are slightly contaminated by noise, and vice versa. Parameter mm controls the number of microclusters and is typically set as a multiple of the maximum number of classes contained in the data streams. For our experiments, we use a modified parameter m{m^{\prime}} instead of mm, which ranges from 1 to 2. Microcluster merging is not performed if the maximum number of classes is less than or equal to 2 when m=1{m^{\prime}}=1. Parameter σ\sigma is disregarded in cases where no outliers are detected. Additionally, HSRC disables the fine-tuning operation by setting w=0w=0. Further details concerning the parameters are provided in the experiments.

Refer to caption
(a) Network Intrusion
Refer to caption
(b) Forest Cover
Figure 1: Online performance achieved on the first 50 windows of the Network Intrusion and Forest Cover datasets.

IV-B Clustering Quality Evaluation

TABLE II: Average clustering results (%) produced over the entire stream of each dataset.
Datasets Metrics (%) HSRC CluStream ClusTree CluStreamKM StreamKM++ FDCM_SSR FDPC
Keystroke Purity 89.49 64 73 71 69 74.75 80.06
FF-measure 89.03 74 75 72 70 75.27 70.59
Network Intrusion Purity 96.48 93 91 94 92 94.33 94.89
FF-measure 97.95 72 70 78 75 90.72 89.51
Forest Cover Purity 74.58 61 66 53 56 71.6 71.88
FF-measure 45.28 37 33 36 38 44.61 44.18
COIL-100 Purity 67.74 41 39 44 43 51.1 44.96
FF-measure 64.24 40 42 38 39 47.97 26.87
TABLE III: The average computational costs required for a landmark window (seconds) in each dataset.
Datasets HSRC CluStream ClusTree CluStreamKM StreamKM++ FDCM_SSR FDPC
Keystroke 0.49 0.15 0.9 0.09 0.1 0.26 0.11
Network Intrusion 1.25 0.47 2.7 0.24 0.44 5.24 0.33
Forest Cover 1.22 0.29 0.72 0.27 0.48 3.75 0.3
COIL-100 5.73 1.71 2.21 1.55 2.45 6.97 1.64

We evaluate the proposed algorithm on the streaming datasets. The four groups of HSRC parameters used for this experiment are (1) λ=100\lambda=100, m=1m^{\prime}=1, w=0w=0, (2) λ=0.1\lambda=0.1, m=1m^{\prime}=1, w=1w=1, (3) λ=12\lambda=12, m=1m^{\prime}=1, w=0w=0 and (4) λ=20\lambda=20, m=1m^{\prime}=1, w=1w=1 . Compared with the other algorithms, the HSRC algorithm almost consistently obtains the best results in terms of clustering purity and the FF-measure in the experiments. This confirms that our proposed method is very effective for high-dimensional data stream clustering cases with varying numbers of clusters. For example, HSRC achieves a high clustering purity level of 89.49% on the Keystroke dataset and improves the clustering purity by at least 9.43% over those of the other algorithms. We observe the same advantages in our proposed method, which contains more data object features. For example, HSRC achieves high clustering purities of 96.48%, 74.58% and 67.74% for the remaining three datasets. Table II shows that it still significantly outperforms the other four algorithms in terms of the FF-measure. These clustering results confirm that the relationships calculated from the sparse representations significantly improve the attained clustering performance, especially when the data objects derived from the data streams contain more features. In addition, density-based algorithms can identify clusters with arbitrary shapes in data streams and often perform well when the input data objects have only a few attributes. However, the clustering performance of these algorithms decreases as the number of data object attributes gradually increases.

We evaluate the online performance of the proposed HSRC method on the first 50 windows of two representative datasets, i.e., the Network Intrusion and Forest Cover. Fig. 1 illustrates the online performance attained by the proposed HSRC method on these datasets. We can observe that HSRC almost achieves relatively stable clustering purity and FF-measures on both datasets. For example, Fig. 1a shows that the online clustering purity and FF-measure values exceed 90% for most of the landmark windows. Similarly, Fig. 1b indicates that both metrics yielded by HSRC are above 25%. These experimental results demonstrate the excellent stability of the online performance exhibited by the proposed HSRC method.

The time costs of all the competing algorithms are presented in Table III. The density-based algorithms generally have lower computational costs than the other methods do. For example, CluStreamKM has a lower computational cost than the other algorithms do. However, HSRC has relatively reasonable computational costs in the experiments. This is because HSRC spends most of its time computing the affinity matrix using sparse representation.

Compared with kk-means-based data stream clustering methods such as CluStreamKM and StreamKM++, HSRC employs a spectral clustering technique that also incorporates kk-means to create microclusters. CluStreamKM and StreamKM++ compute the centroid of each CF vector via the original data objects. In contrast, HSRC uses an l1l_{1}-norm technique to construct an affinity matrix, which encodes the membership attributed of the high-dimensional data objects contained in the input data streams. HSRC captures the intrinsic structures of high-dimensional data objects. The experimental results show that HSRC significantly enhances the clustering purity and FF-measure values achieved for high-dimensional data streams.

IV-C Robustness to Noise and Outliers

We evaluate the robustness of these algorithms on a more challenging set of high-dimensional data streams. Four artificial pixel noise levels (5%, 10%, 15%, and 20%) are integrated into two datasets: (1) Network Intrusion and (2) Forest Cover. The locations of the corrupted data object attributes are chosen randomly, and the value of each selected location is replaced by a random number within the range [0,1]\left[{0,1}\right].

TABLE IV: Average clustering results (%) produced on the Network Intrusion and Forest Cover datasets.
Datasets Ratio (%) Metrics HSRC StreamKM++ FDCM_SSR FDPC
Network Intrusion 5% Purity 95.04 90 93.17 93.9
FF-measure 96.93 74 89.53 88.79
10% Purity 94.95 88 92.34 93.54
FF-measure 96.45 72 88.05 80.85
15% Purity 94.4 86 91.68 92.15
FF-measure 96.2 71 86.78 64.84
20% Purity 94.44 86 91.54 90.04
FF-measure 96.4 70 85.88 56.23
Forest Cover 5% Purity 74.69 54 68.58 71.03
FF-measure 43.61 35 41.14 37.96
10% Purity 73.18 53 66.24 68.81
FF-measure 39.27 34 38.16 30.48
15% Purity 71.96 51 65.35 66.14
FF-measure 38.65 32 33.5 33.35
20% Purity 69.96 50 65.52 65.18
FF-measure 37.68 29 33.71 30.88

Table IV shows the average clustering results produced for the four different noise levels. As expected, the performance of the algorithms slowly decreases as the percentage of noise increases. HSRC achieves consistently better clustering results than those of the other methods. For example, our method achieves average clustering purities of 95.04%95.04\%, 94.95%94.95\%, 94.4%94.4\% and 94.44%94.44\% on the Network Intrusion dataset. Regarding the FF-measure, HSRC significantly outperforms the competing methods across the different noise levels, which indicates that the numbers of macroclusters are much closer to the ground-truth numbers of clusters contained in the landmark windows. However, the clustering performance of the other competing algorithms dramatically decreases as the percentage of noise slowly increases. We also observe that HSRC still retains the same advantages on the Forest Cover dataset. For example, the clustering purity and FF-measure of HSRC are 74.69%74.69\% and 43.61%43.61\%, respectively under a noise percentage of 5%. HSRC consistently achieves better clustering performance than that of the competing approaches under the other three noise percentages (10%, 15%, and 20% ). On the one hand, this highlights the benefit of estimating the relationships among data objects via sparse representation. On the other hand, this finding demonstrates that the l2,1{l_{2,1}}-norm effectively characterizes the noise term in Problem (4). These experimental results demonstrate the robustness of the proposed HSRC method under noisy circumstances.

The proposed outlier detection mechanism is evaluated on the Forest Cover dataset. We select the first 20 landmark windows, and each window contains 1,000 data objects. In each trial, two successive landmark windows are employed, and the experiment is repeated 10 times. Half of the data objects contained within these windows are segmented into clusters, whereas the remaining data objects are treated as potential outliers for testing purposes. The effectiveness of the outlier detection mechanism is then verified via the SRV from Eq. (9). For comparison, we employ the 1-NN method, which is a general outlier detection technique that calculates the Euclidean distance between a test data object and its nearest neighbor within a landmark window. The outlier detection errors are classified into two categories: valid data objects misidentified as outliers and actual outliers incorrectly classified as valid data. The average outlier error rate for HSRC is 4.21%, whereas it is 6.64% for 1-NN across the 10 trials. Compared with 1-NN, HSRC reduces the average outlier error rate by 2.43%.

IV-D Sensitivity Analysis

In Algorithms 1-4, HSRC has two main parameters: λ\lambda and mm. We examine the sensitivity of different combinations of parameters λ\lambda and mm. In particular, we set parameter w=1w=1 for the Network Fusion and Forest Cover datasets. Empirically, parameter λ\lambda is closely related to the noise level prior of the data stream. Hence, we usually set relatively large values for parameter λ\lambda in HSRC if the given data stream is slightly contaminated by noise. In addition, parameter mm represents the number of microclusters. For convenience, parameter mm^{\prime} represents a multiple of the maximum number of classes contained in a data stream. Parameter mm^{\prime} ranges from 1 to 2 in the experiments.

Refer to caption
(a) m=1m^{\prime}=1
Refer to caption
(b) m=2m^{\prime}=2
Figure 2: Clustering results obtained with different values of λ\lambda when using the first 50 windows of the Network Fusion dataset.
Refer to caption
(a) m=1m^{\prime}=1
Refer to caption
(b) m=2m^{\prime}=2
Figure 3: Clustering results obtained with different values of λ\lambda when using the first 50 windows of the Forest Cover dataset.

Figs. 2 and 3 show the clustering performance achieved on the Network Fusion and Forest Cover datasets, respectively, in terms of the clustering purity and FF-measure values produced with different combinations of parameters λ\lambda and mm^{\prime}. Specifically, Fig. 2 indicates that HSRC performs well across a large range of λ\lambda values when m=1m^{\prime}=1. However, the FF-measure declines significantly as mm^{\prime} increases from 1 to 2. This is because the number of clusters rapidly decreases as the number of macroclusters gradually increases during the experiments. Similar cases can also be observed in Fig. 3. For example, the FF-measure dramatically decreases when mm^{\prime} gradually increases from 1 to 2 in Fig. 3. Therefore, relatively large ranges of λ\lambda yield satisfactory clustering results when m=1m^{\prime}=1.

V Conclusion

In this paper, we propose an HSRC algorithm for clustering high-dimensional data streams. Unlike the existing clustering techniques that rely on Euclidean distance computations, HSRC utilizes an norm technique to capture the intrinsic structures of high-dimensional data objects. It automatically selects the appropriate number of neighboring data objects. This ensures that the highly correlated data objects of clusters are grouped together. HSRC merges microclusters into macroclusters by introducing the SSD. Moreover, fine-tuning is employed to refine the macroclusters. Data object representatives are selected from each macrocluster by sorting the SRVs of the data objects. These representatives are then employed as dictionary samples for the next landmark window. The proposed HSRC method significantly enhances the relationships among data objects, which improves the generalization of sparse representation in high-dimensional data stream clustering tasks. Furthermore, HSRC effectively estimates outlier candidates based on their SRVs. Our extensive experiments conducted on high-dimensional datasets demonstrate the superiority of the proposed HSRC method over several state-of-the-art approaches.

References

  • [1] J. Gama, R. Sebastião, and P. P. Rodrigues, “On evaluating stream learning algorithms,” Mach. Learn., vol. 90, no. 3, pp. 317–346, Mar. 2013.
  • [2] J. A. Silva, E. R. Faria, R. C. Barros, E. R. Hruschka, A. C. P. L. F. de Carvalho, and J. Gama, “Data stream clustering: A survey,” ACM Comput. Surv., vol. 46, no. 1, pp. 1–31, october 2013.
  • [3] A. Amini, T. Y. Wah, and H. Saboohi, “On density-based data streams clustering algorithms: A survey,” J. Comput. Sci. Technol., vol. 29, no. 1, pp. 116–141, May 2014.
  • [4] H. L. Nguyen, Y. K. Woon, and W. K. Ng, “A survey on data stream clustering and classification,” Knowl. Inf. Syst., vol. 45, no. 3, pp. 535–569, october 2015.
  • [5] M. Hahsler and M. Bolaños, “Clustering data streams based on shared density between micro-clusters,” IEEE Trans. Knowl. Data Eng., vol. 28, no. 6, pp. 1449–1461, Jun. 2016.
  • [6] M. Spiliopoulou, I. Ntoutsi, Y. Theodoridis, and R. Schult, “Monic: modeling and monitoring cluster transitions,” in in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Disc. Data Mining, Philadelphia, PA, USA, Aug. 2006, pp. 706–711.
  • [7] J. Shi, J. Malik, and S.Sastry, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. and Mach. Intell., vol. 22, no. 8, pp. 181–214, May 2000.
  • [8] M. Belkin and P. Niyogi, “Laplacian eigenmaps and spectral techniques for embedding and clustering,” in Proc. 16th Adv. Neural Inf. Process. Syst., Vancouver, British Columbia, Canada, Dec. 2002, pp. 585–591.
  • [9] Z. Huang, H. Zhu, J. T. Zhou, and X. Peng, “Multiple marginal fisher analysis,” IEEE Trans. Ind. Electron., vol. 1, no. 1, p. 1, Sept. 2018.
  • [10] S. Guha, A. Meyerson, R. M. N. Mishra, and O’Callaghan, “Clustering data streams: Theory and practice,” IEEE Trans. Knowl. Data Eng., vol. 15, no. 3, pp. 515–528, May 2003.
  • [11] L. Tu and Y. Chen, “Stream data clustering based on grid density and attraction,” ACM Trans. Knowl. Discov. Data., vol. 3, no. 3, pp. 1–27, Jul. 2009.
  • [12] A. Amini, T. Y. Wah, M. R. Saybani, and S. R. A. S. Yazdi, “A study of density-grid based clustering algorithms on data streams,” in 8th Int. Conf. Fuzzy Sys. Knowl. Discov., Haikou, Hainan, China, Jul. 2011, pp. 1652–1656.
  • [13] M. Shindler, A. Wong, and A. Meyerson, “Fast and accurate k-means for large datasets,” in Proc. 25th Adv. Neural Inf. Process. Syst., Granada, Spain, Dec. 2011, pp. 2375–2383.
  • [14] X. Zhang, C. Furtlehner, C. Germain-Renaud, and M. Sebag, “Data stream clustering with affinity propagation,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 7, pp. 1644–1656, Jul. 2014.
  • [15] R. Hyde, P. Angelov, and A. R. MacKenzie, “Fully online clustering of evolving data streams into arbitrarily shaped clusters,” Inf. Sci., vol. 382, no. 1, pp. 96–114, Mar. 2017.
  • [16] T. Zhang, R. Ramakrishnan, and M. Livny, “Birch: A new data clustering algorithm and its applications,” Data. Mining. Knowl. Discov., vol. 1, no. 2, pp. 2215–2228, Jun. 1997.
  • [17] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A framework for clustering evolving data streams,” in Proc. 29th Int. Conf. Very Large Data Bases, Berlin, Germany, Sept. 2003, pp. 81–92.
  • [18] F. Cao, M. Estert, W. Qian, and A. Zhou, “Density-based clustering over an evolving data stream with noise,” in Proc. SIAM Int. Conf. Data Mining, Bethesda, MD, USA, Apr. 2006, pp. 328–339.
  • [19] P. Kranen, I. Assent, C. Baldauf, and T. Seidl, “The clustree: indexing micro-clusters for anytime stream mining,” Knowl. Inf. Syst., vol. 29, no. 2, pp. 249–272, Nov. 2011.
  • [20] J. Chen, Z. Wang, S. Yang, and H. Mao, “Two-stage sparse representation clustering for dynamic data streams,” IEEE Trans. Cybern., vol. 53, no. 10, pp. 6408–6420, Oct. 2023.
  • [21] J. Chen, S. Yang, C. Fahy, Z. Wang, Y. Guo, and Y. Chen, “Online sparse representation clustering for evolving data streams,” IEEE Trans. Neural. Netw. Learn. Syst., pp. 1–15, Oct. 2023.
  • [22] B. A.Olshausen and D. J.Field, “Sparse coding with an overcomplete basis set: A strategy employed by v1?” Vis. Res., vol. 37, no. 23, pp. 3311–3326, Dec. 1997.
  • [23] R. Tibshiran, “Regression shrinkage and selection via the lasso,” Journal of the Royal Statistical Society. Series B, vol. 58, no. 1, pp. 267–288, Jan. 1996.
  • [24] D. L. Donoho, “For most large underdetermined systems of linear equations the minimal l1{l_{1}}‐norm solution is also the sparsest solution,” Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, vol. 59, no. 6, pp. 797–829, Mar. 2006.
  • [25] J. Chen, Y. Chen, Z. Wang, H. Zhang, and X. Peng, “Spectral embedding fusion for incomplete multiview clustering,” IEEE Trans. Image Process., pp. 4116–4130, Jul. 2024.
  • [26] J. Chen, H. Mao, H. Zhang, and Z. Yi, “Symmetric low-rank preserving projections for subspace learning,” Neurocomputing, vol. 315, no. 13, pp. 381–393, Nov. 2018.
  • [27] M. Elad and M. Aharon, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Trans. Image Process., vol. 15, no. 12, pp. 3736–3745, Dec. 2006.
  • [28] X. Yuan, X. Liu, and S. Yan, “Visual classification with multi-task joint sparse representation,” IEEE Trans. Image Process., vol. 21, no. 10, pp. 4349–4360, Jun. 2012.
  • [29] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. R. Stat. Soc. Series B, vol. 58, no. 1, pp. 267–288, Jan. 1996.
  • [30] B. K. Natarajan, “Sparse approximate solutions to linear systems,” SIAM journal on computing, vol. 24, no. 2, pp. 227–234, october 1995.
  • [31] A. M. Bruckstein, D. L. Donoho, and M. Elad, “From sparse solutions of systems of equations to sparse modeling of signals and images,” SIAM review, vol. 51, no. 1, pp. 34–81, Jul. 2009.
  • [32] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2009.
  • [33] K. Udommanetanakit, T. Rakthanmanon, and K. Waiyamai, “E-stream: Evolution-based technique for stream clustering,” in Int. Conf. Adv Data Mining Appl., Heidelberg, Berlin, Aug. 2007, pp. 605–615.
  • [34] S. Luhr and M. Lazarescu, “Incremental clustering of dynamic data streams using connectivity based representative points,” Data Knowl. Eng., vol. 68, no. 1, pp. 1–27, Jan. 2009.
  • [35] Z. Lin, R. Liu, and Z. Su, “Linearized alternating direction method with adaptive penalty for low-rank representation,” in Proc. 25th Adv. Neural Inf. Process. Syst., Vancouver, British Columbia, Canada, Dec. 2011, pp. 612–620.
  • [36] J. F. Cai, E. J. Candès, and Z. Shen, “A singular value thresholding algorithm for matrix completion,” SIAM J. Optim., vol. 20, no. 4, pp. 1956–1982, Mar. 2010.
  • [37] G. Liu and Y. Y. Z. Lin, “Robust subspace segmentation by low-rank representation,” in in Proc. 27th Int. Conf. Mach. Learn., Haifa, Israel, Jun. 2010, pp. 1–8.
  • [38] A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “MOA: massive online analysis,” J. Mach. Learn. Res., vol. 11, no. 1, pp. 1601–1604, May 2010.
  • [39] M. R. Ackermann, M. Märtens, C. Raupach, C. Lammersen, and C. Sohler, “Streamkm++: A clustering algorithm for data streams,” J. Exp. Algorithmics, vol. 17, no. 2, pp. 1–30, May 2012.
  • [40] J. Gu, L. Jiao, S. Yang, and F. Liu, “Fuzzy double cc-means clustering based on sparse self-representation,” IEEE Trans. Fuzzy Syst., vol. 26, no. 2, pp. 612–626, Apr. 2018.
  • [41] J. Hou, H. Lin, H. Yuan, and M. Pelillo, “Flexible density peak clustering for real-world data,” Pattern Recognit., vol. 156, p. 110772, Dec. 2024.
  • [42] D. Dua and C. Graff. (2017, Apr.) UCI machine learning repository. [Online]. Available: http://archive.ics.uci.edu/ml
  • [43] V. M. A. Souza, D. F. Silva, J. Gama, and G. E. A. P. A. Batista, “Data stream classification guided by clustering on nonstationary environments and extreme verification latency,” in Proc. SIAM Int. Conf. Data Mining (SDM), Vancouver, British Columbia, Canada, Apr. 2015, pp. 873–881.
  • [44] C. D. Manning, P. Raghavan, and H. Sch’́utze, Introduction to Information Retrieval.   New York, NY, USA: Cambridge University Press, 2008.
[Uncaptioned image] Jie Chen received the BSc degree in Software Engineering, MSc degree and PhD degree in Computer Science from Sichuan University, Chengdu, China, in 2005, 2008 and 2014, respectively. He is currently an Associate Professor in the College of Computer Science, Sichuan University, China. His current research interests include machine learning, big data analysis and deep neural networks.
[Uncaptioned image] Hua Mao received the B.S. degree and M.S. degree in Computer Science from University of Electronic Science and Technology of China (UESTC) in 2006 and 2009, respectively. She received her Ph.D. degree in Computer Science and Engineering from Aalborg University, Denmark in 2013. She is currently a Senior Lecturer in Department of Computer and Information Sciences, Northumbria University, U.K. Her current research interests include Deep Neural Networks and Big Data.
[Uncaptioned image] Yuanbiao Gou received the B.E. degree from the School of Software Engineering, Sichuan University, Chengdu, China, in 2019. He is currently pursuing the Ph.D. degree with the School of Computer Science, Sichuan University, under the supervision of Prof. Xi Peng. His current research interests include image processing and computer vision.
[Uncaptioned image] Xi Peng (Member, IEEE) is currently a Full Professor with the College of Computer Science, Sichuan University. His current research interest includes machine intelligence and has authored more than 50 articles in these areas. He has served as an Associate Editor/Guest Editor for six journals, including the IEEE Transactions on SMC: Systems and IEEE Transactions on Neural Network And Learning Systems and the Area Chair/Senior Program Committee Member for the conferences such as IJCAI, AAAI, and ICME.