This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

High-Dimensional Data Set Simplification by Laplace-Beltrami Operator

Chenkai Xu, Hongwei Lin C. Xu and H. Lin are with School of Mathematics, State Key Lab. of CAD&CG, Zhejiang University.Email: [email protected]
Abstract

With the development of the Internet and other digital technologies, the speed of data generation has become considerably faster than the speed of data processing. Because big data typically contain massive redundant information, it is possible to significantly simplify a big data set while maintaining the key information it contains. In this paper, we develop a big data simplification method based on the eigenvalues and eigenfunctions of the Laplace-Beltrami operator (LBO). Specifically, given a data set that can be considered as an unorganized data point set in high-dimensional space, a discrete LBO defined on the big data set is constructed and its eigenvalues and eigenvectors are calculated. Then, the local extremum and the saddle points of the eigenfunctions are proposed to be the feature points of a data set in high-dimensional space, constituting a simplified data set. Moreover, we develop feature point detection methods for the functions defined on an unorganized data point set in high-dimensional space, and devise metrics for measuring the fidelity of the simplified data set to the original set. Finally, examples and applications are demonstrated to validate the efficiency and effectiveness of the proposed methods, demonstrating that data set simplification is a method for processing a maximum-sized data set using a limited data processing capability.

Index Terms:
Data simplification, Big data, Laplace-Beltrami operator, Feature point detection

1 Introduction

It is well known that big data concern large-volume, complex, and growing data sets with multiple sources [1]. With the development of the Internet and other digital technologies, increasingly larger amounts of data are generated at a speed considerably greater than the data processing speed of human beings. Because big data typically contain massive redundant information, it is possible to significantly simplify a big data set while maintaining the key information it contains. Therefore, data set simplification is a feasible method to process a maximum-sized data set using a limited data processing capability.

The purpose of big data simplification is to maintain the maximum amount of information a big data set contains, while deleting the maximum amount of redundant information. In this paper, a big data set is considered as an unorganized point set with points in a high-dimensional space. Based on this consideration, we developed a data set simplification method by classifying the points in a big data set as feature points and trivial points, and then considering the feature point set as the simplification of the big data set because it contains a majority of the information that the original big data set contained.

Feature point detection is a fundamental problem in numerous research fields including computer vision and computer graphics. These fields focus on the process of one-dimensional (1D) and two-dimensional (2D) manifolds, such that the main tool for the feature point detection is the curvature, including both Gauss and mean curvatures. However, in high-dimensional space, the definition of curvature is complicated and practical computational methods for calculating the curvature of a high-dimensional manifold are not available. Therefore, it is necessary for the big data process to develop practical and efficient methods for detecting feature points in big data sets.

It is well known that the eigenvalues and eigenvectors of a Laplace-Beltrami operator (LBO) defined on a manifold are closely related to its geometry properties. Moreover, whatever the dimension of the manifold, after discretization, the LBO (called a discrete LBO) becomes a matrix. That is, the discrete LBO is unrelated to the dimension of the manifold where it is defined. Therefore, the eigenvalues and eigenvectors of LBOs and the related heat kernel functions are desirable tools for studying the geometric properties of big data sets in high-dimensional space, including the detection of feature points.

In this paper, we developed a data simplification method for high-dimensional big data sets based on feature point detection using the eigenvectors of LBO. Moreover, three metrics are devised for measuring the fidelity of the simplified data set to the original data set. Finally, the capability of the proposed big data simplification method is extensively discussed using several examples and applications.

This paper is organized as follows: In Section 2, related work is briefly reviewed, including the calculations and applications of eigenvalues and eigenvectors of LBO, feature point detection methods in computer graphics and computer vision, and sampling techniques on high-dimensional data. In Section 3, after introducing the technique for calculating the eigenvalues and eigenvectors of LBO on high-dimensional big data sets, we develop the data simplification method by detecting feature points of the big data sets, and propose three metrics to measure the fidelity of the simplified data set to the original data set. In Section 4, the capabilities of the proposed data simplification method are illustrated and discussed. Finally, the paper is concluded in Section 5.

2 Related work

In this section, we review the related work in three aspects. First, we address the computation techniques and applications of eigenvalues and eigenvectors of LBO in dimensionality reduction (DR), shape descriptors, and mesh saliency detection. Secondly, the feature point detection methods are introduced. Finally, data sampling techniques on machine learning and deep learning are discussed.

Computation and applications of LBO. To calculate the eigenvalues and eigenfunctions of an LBO defined on a triangular mesh or big data set, it must first be discretized. One commonly employed technique for the discretization of an LBO is the cotangent Laplacian [2], which requires only the one-ring neighbors of a vertex to construct the discrete LBO at the vertex. Although the cotangent Laplacian has intuitive geometric meaning, it does not converge in generic cases [3]. In 2008, Belkin et al. proposed the mesh Laplacian scheme [4], where the Euclidean distance between each pair of the complete data set is required for the calculation of the matrix weight. Furthermore, by discretizing the integration of certain continuous functions defined over the manifold, a symmetrizable discrete LBO was developed to ensure its eigenvectors were orthogonal [5].

The LBO has been extensively employed in mesh processing in the fields of computer graphics and computer vision. Based on the spectral analysis of the graph Laplacian, Taubin et al. [6] designed low-pass filters for mesh smoothing. Moreover, by first modulating the eigenvalues and eigenvectors of an LBO on a mesh and then reconstructing the shape of the mesh model, mesh editing tasks can be fulfilled, including mesh smoothing, shape detail enhancement, and change of model posture [7, 8].

Using the information in the spectral domain of the LBO, the mesh saliency points or regions can be detected. By defining a geometric energy related to the eigenvalues of an LBO, Hu et al. proposed a method to extracts the salient geometric feature points [9]. Considering the properties of the log-Laplacian spectrum of a mesh, Song et al. developed a method for detecting mesh saliency by capturing the saliency in the frequency domain [10].

The eigenvalues and eigenvectors of an LBO have been used in shape analysis and shape retrieval. Reuter et al.[11] used the sequence of eigenvalues of an LBO defined on a mesh as a numerical fingerprint or signature of the mesh, and called this the Shape DNA, which has been successfully applied in fields such as shape retrieval, copyright protection, and mesh quality assessment. Using the concept of heat kernel, which contains the information for the eigenvalues and eigenvectors of the LBO defined on a mesh, Sun et al. [3] proposed the heat kernel signature (HKS) for shape analysis. With the intrinsic invariant property of HKS, Ovsjanikov [12] proposed a one-point isometric matching algorithm to determine the intrinsic symmetries of shapes. Bronstein [13] developed the scale-invariant heat kernel signatures(SI-HKS) for non-rigid shape recognition, based on a logarithmically sampled scale-space.

More importantly, the LBO has been successfully applied in high-dimensional data processing. A well-known application of the LBO in high-dimensional data processing is Laplacian eigenmaps [14]. To appropriately represent high-dimensional data for machine learning and pattern recognition, Belkin et al.[14] proposed a locality-preserving method for nonlinear DR using the LBO eigenfunctions, which has a natural connection to clustering, and can be computed efficiently.

Feature point detection. Following the presentation of scale invariant feature transform(SIFT) [15] and histogram of oriented gradients(HOG) [16], a series of feature point detection methods were proposed using the Difference of Gaussians(DoG) [15]. Castellani et al. [17] designed a 3D saliency measure basing on DoG, which allows a small number of sparse salient points to characterize distinctive portions of a surface. Zou et al.[18] defined the DoG function over a curved surface in a geodesic scale space and calculated the local extrema of the DoG, called saliency-driven keypoints, which is robust and stable to noisy input [19, 20, 21].

Similarly, geometric function-based methods have been proposed. Two widely known surface descriptors based on surface geometry are 3D spin images [19] and 3D shape contexts [20]. Schlattmann et al.[21] developed a novel scale space generalization to detect surface features based on the averaged mean curvature flow.

Another kind of feature point detection approach is based on geometric diffusion equation. Heat kernel signature (HKS) can be used for feature point detection, which discerns the global feature and local feature points by adjusting the time parameter [3]. Moreover, by integrating the eigenvalues into the eigenvectors of an LBO on a mesh, global point signature (GPS) was developed in [22] for feature point detection. For 2D surface quadrangulation, Shen et al. [23] used the extremes of the Laplacian eigenfunctions to build a Morse-Smale complex and construct a well-shaped quadrilateral mesh.

In recent years, deep networks[24] have achieved excellent success in feature point detection [25]. Luo et al. [26] proposed a novel face parser approach by detecting faces at both the part- and component-levels, and then computing the pixel-wise label maps, where the feature points can then be easily obtained from the boundary of the label maps. Sun et al. [27] presented an approach for detecting the positions of face keypoints with three-level carefully designed convolution networks. In addition to convolutional neural networks(CNN), recurrent neural networks(RNN) have also been developed on feature point detection, especially on facial feature point detection [28, 25, 29].

Data sampling. In the early machine learning field, to solve the imbalanced data problem in classification, over-sampling and under-sampling approaches were proposed. Tomek et al. [30] proposed the ’Tomek links’, which is computed by two examples belonging to different classes. ’Tomek links’ can determine if one of two examples is noise or borderline. One noted over-sampling method is SMOTE [31], which creates synthetic minority class examples by interpolating among several minority class examples that lie together. Hybrid methods have been proposed based on the above two methods, including ’SMOTE-Tomek links’ [32], ’Boderline-SMOTE’ [33] and ’Safe-Level-SMOTE’ [34].

In the deep learning field, to reduce annotation effort and clean the data before training, following data under-sampling methods were proposed. In biomedical imaging applications, Zhou et al.[35] presented AFT* to seek ’worthy’ samples for annotation, which reduces the annotation cost by half compared with random selection. Vodrahalli et al. [36] used the gradient in stochastic gradient descent (SGD) to measure the importance of training data, demonstrating that a small subsample is indeed sufficient for training in certain cases.

3 High-dimensional data set simplification

In this section, we develop the high-dimensional data set simplification method, which first detects the feature points in the high-dimensional data set, and then uses the feature point set as the simplified data set. Moreover, three metrics are proposed for measuring the fidelity of the simplified data set to the original data set.

Feature point detection on 1D and 2D manifolds typically depends on the curvature information. However, in high-dimensional space, the definition and computation of the curvature is complicated. Hence, feature point detection using curvature information on high-dimensional data sets is infeasible.

As stated above, whatever the dimension of a space is, the discrete LBO defined on the space is a matrix. Because the eigenvectors of a discrete LBO are closely related to the geometric properties of the space where the LBO is defined, and because there are sophisticated methods for computing the eigenvectors of a matrix (which is a discrete LBO), the discrete LBO and its eigenvectors are desirable tools for feature point detection in high-dimensional data sets.

3.1 Spectrum of LBO and Feature Points

Let MM be a Riemannian manifold (differentiable manifold with Riemannian metric), and fC2f\in C^{2} be a real-valued function defined on MM. The LBO Δ\Delta on MM is defined as

Δf=div(gradf),\Delta f=div(grad\ f), (1)

where gradfgrad\ f is the gradient of ff, and div()div(\cdot) is the divergence on MM. The LBO (1) can be represented in local coordinates. Let ψ\psi be a local parametrization

ψ:nn+k,\psi:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n+k},

of a sub-manifold of MM with

gij=iψ,jψ,G=(gij),(gij)=G1,W=detG,\begin{split}&g_{ij}=\langle\partial_{i}\psi,\partial_{j}\psi\rangle,\quad G=(g_{ij}),\\ &(g^{ij})=G^{-1},\quad W=\sqrt{detG},\end{split}

where i,j=1,2,,ni,j=1,2,...,n , ,\langle,\rangle is the dot product, and detdet is the determinant. In the local coordinates, the LBO (1) can be transformed as,

Δf=1Wi,ji(gijWjf).\Delta f=\frac{1}{W}\sum_{i,j}\partial_{i}(g^{ij}W\partial_{j}f). (2)

The eigenvalues and eigenfunctions of the LBO (1) (2) can be calculated by solving the Helmholtz equation:

Δf=λf,\Delta f=-\lambda f, (3)

where λ\lambda is a real number. The spectrum of LBO is a list of eigenvalues such as

0λ0λ1λ2+,0\leq\lambda_{0}\leq\lambda_{1}\leq\lambda_{2}\leq...\leq+\infty{}, (4)

with the corresponding eigenfunctions,

ϕ0(x),ϕ1(x),ϕ2(x),,\phi_{0}(x),\phi_{1}(x),\phi_{2}(x),\cdots, (5)

which are orthogonal.

In the case of a closed manifold, the first eigenvalue λ0\lambda_{0} is always zero. Moreover, the eigenvalues λi,i=0,1,2,\lambda_{i},i=0,1,2,\cdots (4) specify the discrete frequency domain of an LBO, and the eigenfunctions are the extensions of the basis functions in Fourier analysis to a manifold. Eigenfunctions of larger eigenvalues contain higher frequency information. It is well known that the eigenvalues and eigenfunctions of an LBO are closely related to the geometric properties of the manifold where the LBO is defined, including its curvature [37], volume of MM, volume of M\partial{M}, and Euler characteristic of MM [11].

Specifically, consider a heat diffusion equation defined on a closed manifold MM (i.e., M=Ø\partial M={\O}):

{(Δ+t)u(x,t)=0,u(x,0)=f(x),\left\{\begin{array}[]{lr}(\Delta+\frac{\partial}{\partial t})u(x,t)=0,\\ u(x,0)=f(x),\\ \end{array}\right. (6)

where xMx\in M, u(x,t)u(x,t) is the heat value of the point xx at time tt, and ff is the initial heat distribution on MM. The solution of the heat diffusion equation (6) can be expressed as:

u(x,t)=MH(x,y,t)f(y)𝑑yu(x,t)=\int_{M}H(x,y,t)f(y)dy

where, H(x,y,t)H(x,y,t) is the fundamental solution of the heat diffusion equation (6), called heat kernel. The heat kernel H(x,y,t)H(x,y,t) measures the amount of heat transferred from xx to yy within time tt, and can be rewritten as,

H(x,y,t)=i=0eλitϕi(x)ϕi(y),H(x,y,t)=\sum_{i=0}^{\infty}e^{-\lambda_{i}t}\phi_{i}(x)\phi_{i}(y), (7)

where λi,i=0,1,\lambda_{i},i=0,1,\cdots (4) are the eigenvalues of the LBO (1) (2), and ϕi,i=0,1,\phi_{i},i=0,1,\cdots (5) are the corresponding eigenfunctions.

When x=yx=y in Eq. (7), H(x,x,t)H(x,x,t) , called HKS, quantifies the change of heat at xx along the time tt. It has been shown that [38, 39, 37], when the time tt approaches 0+0^{+} sufficiently close, the heat kernel H(x,x,t)H(x,x,t) can be represented as,

H(x,x,t)=i=0eλitϕi2(x)=14πt+K(x)12π+O(t),H(x,x,t)=\sum_{i=0}^{\infty}e^{-\lambda_{i}t}\phi_{i}^{2}(x)=\frac{1}{4\pi t}+\frac{K(x)}{12\pi}+O(t), (8)

where K(x)K(x) is the scalar curvature at point xx on MM. Especially, when MM is a 2-dimensional manifold, K(x)K(x) is the Gaussian curvature at xx.

From Eq. (8), we can see that, the eigenfunctions ϕi(x),i=0,1,\phi_{i}(x),i=0,1,\cdots (5) of a LBO are closely related to the scale curvature K(x)K(x) at the point xMx\in M.

As stated above, the eigenfunctions ϕi(x),i=0,1,\phi_{i}(x),i=0,1,\cdots (5) are the extension of the trigonometric functions. In the trigonometric function system, the local extrema of a trigonometric function with a lower frequency hold in the trigonometric function with a higher frequency (Fig. 1). This property is preserved by the eigenfunctions ϕi(x),i=0,1,\phi_{i}(x),i=0,1,\cdots (5) on manifolds. In Fig. 2, eigenfunctions of an LBO defined on a 2D manifold are illustrated with their local extrema. We can observe that, the local extrema in the lower frequency eigenfunctions remain the local extrema in the higher frequency eigenfunctions (Fig. 2). Because the local extrema of ϕi(x)\phi_{i}(x) hold in ϕj(x),j<i\phi_{j}(x),j<i, and the heat kernel (8) is a weighted sum of ϕi2(x),i=0,1,\phi_{i}^{2}(x),i=0,1,\cdots, the local extrema of ϕi(x)\phi_{i}(x), especially the eigenfunctions with low frequency, are the candidates of the local extrema of the heat kernel H(x,x,t)H(x,x,t) (8), and the scale curvature K(x)K(x) (8). In conclusion, the local extrema of the eigenfunctions ϕi(x),i=0,1,\phi_{i}(x),i=0,1,\cdots contain the features of the manifold where the LBO is defined. Therefore, in this paper, we use the local extrema of the eigenfunctions of the LBO defined on a high-dimensional data set as its feature points, which constitute the simplified data set.

Refer to caption
Figure 1: Local extrema (indicated by red arrows) of trigonometric functions with lower frequencies hold in those with higher frequencies.
Refer to caption
Figure 2: Local extrema (the centres of blue and red regions) of the eigenvectors (5) of a discrete LBO defined on a 2D manifold. (a)-(f) Eigenvectors ϕ2(x)\phi_{2}(x), ϕ3(x)\phi_{3}(x), ϕ6(x)\phi_{6}(x), ϕ15(x)\phi_{15}(x), ϕ19(x)\phi_{19}(x), ϕ31(x)\phi_{31}(x).

3.2 Discrete Computation

To solve the Helmholtz equation (3) for calculating the eigenvalues and eigenfunctions of the LBO (2), the Helmholtz equation (3) must be discretized into a linear system of equations. Traditionally, Eq. (3) is discretized on a mesh with connectivity information between two mesh vertices [4, 14, 2]. However, in this paper, we are addressing an unorganized data point set in a high-dimensional space, where the connectivity information is missing. Hence, we use the discretization method developed in [14] to construct the discrete LBO. Because only the Euclidean distances between points are involved in the construction of the discrete LBO, it can be employed for LBO discretization in high-dimensional space.

The LBO discretization method [14] contains the two steps:

  • adjacency graph construction,

  • weight computation,

which are explained in detail in the following.

Adjacency graph construction. There are two methods for constructing an adjacency graph: (a) kk nearest neighbors (KNN). For any vertex vv, its kk nearest neighbors vi,i=1,2,,kv_{i},i=1,2,\cdots,k are determined using the KNN algorithm [40]. Then, the connection between the vertex vv and each neighbor vi,i=1,2,,kv_{i},i=1,2,\cdots,k is established, thus constructing the adjacent graph. Because the kk nearest neighbors are not symmetric, i.e., it is possible that the vertex viv_{i} is in the kk nearest neighbors of the vertex vjv_{j}, yet, vjv_{j} is not in the kk nearest neighbors of viv_{i}, the weight matrix (see below) constructed by the KNN method is not guaranteed to be symmetric. (b) ϵ\epsilon -neighbors. For a vertex vv, a hypersphere is first constructed, which uses vv as the center, and ϵ\epsilon as its radius. The adjacency graph is constructed by connecting the vertex vv and each vertex lying in the hypersphere. Although the weight matrix constructed based on the ϵ\epsilon -neighbors is symmetric, there is a serious problem in that the value of ϵ\epsilon is difficult to choose. An overly small ϵ\epsilon will cause certain rows of the weight matrix WW to be zero; and overly large ϵ\epsilon will create a matrix that is excessively dense. Therefore, in our implementation, we employ the kk nearest neighbors to construct the adjacency matrix.

Weight computation. Using the adjacency graph for each point viv_{i} in the given data set, the connections between point viv_{i} and the other points in the data set are established. Based on the adjacency graph, the weights between point viv_{i} and each of the other points can be calculated as follows:

wij={evivj2t,if i,j are adjacent,kiwik,if i=j,0,otherwise.w_{ij}=\left\{\begin{array}[]{lr}-e^{-\frac{\left\|v_{i}-v_{j}\right\|^{2}}{t}},&$if $i,j$ are adjacent,$\\ \sum_{k\neq i}-w_{ik},&$if $i=j,\\ 0,&$otherwise$.\\ \end{array}\right. (9)

Because only the Euclidean distance vivj\left\|v_{i}-v_{j}\right\| is required in the computation of wijw_{ij}, it is appropriate for addressing high-dimensional data sets.

As stated above, the weight matrix W¯=[wij]\bar{W}=[w_{ij}] (9) constructed by the KNN method is not symmetric, so we take the following matrix:

W=W¯+W¯T2W=\frac{\bar{W}+\bar{W}^{T}}{2}

as the weight matrix, which is a symmetric matrix. Moreover, we construct a diagonal matrix

A=diag(a1,a2,,an),A=diag(a_{1},a_{2},\cdots,a_{n}),

where ai=wiia_{i}=w_{ii} (refer to Eq. (9)). Then the LBO can be discretized into the matrix,

L=A1W,L=A^{-1}W,

whatever the dimension of the space the LBO defined on. Meanwhile, the Helmholtz equation (3) is discretized as,

Lφ=A1Wφ=λφ,L\varphi=A^{-1}W\varphi=\lambda\varphi,

where λ\lambda is the eigenvalue of the Laplace matrix LL, and φ\varphi is the corresponding eigenvector. It is equivalent to,

Wφ=λAφ.W\varphi=\lambda A\varphi. (10)

Because the matrix WW is symmetric, and the matrix AA is diagonal, it has been shown that the eigenvalues λ\lambda are all non-negative real numbers satisfying  [3]:

0=λ1λ2λn,0=\lambda_{1}\leq\lambda_{2}\leq\cdots\leq\lambda_{n},

and the corresponding eigenvectors φi,i=1,2,,n\varphi_{i},i=1,2,\cdots,n are orthogonal, i.e.,

<φi,φj>=φiTφj=0,ij.<\varphi_{i},\varphi_{j}>=\varphi_{i}^{T}\varphi_{j}=0,\ i\neq j.

3.3 Data Simplification by Feature Point Detection

By solving Eq. (10), the eigenvalues λ1λ2λn\lambda_{1}\leq\lambda_{2}\leq\cdots\leq\lambda_{n} and the corresponding eigenvectors φi,i=1,2,,n\varphi_{i},i=1,2,\cdots,n are determined. The local maximum, minimum, and saddle points of the eigenvectors are used as the feature points. As stated above, the eigenvalue λi\lambda_{i} measures the ’frequency’ of its corresponding eigenvector φi\varphi_{i}. The larger the eigenvalue, the higher the frequency of the eigenvector φi\varphi_{i}, and then the greater number of feature points in the eigenvector φi\varphi_{i}. The data simplification algorithm first detects the feature points of the eigenvector φ1\varphi_{1}, and adds them to the simplified data set. Then, it detects and adds the feature points of φ2\varphi_{2} to the simplified data set. This procedure is performed iteratively until the simplified data set satisfies a preset fidelity threshold to the original data point set, according to the metrics defined in Section 3.4.

Because each of the eigenvectors φi,i=1,2,,n\varphi_{i},i=1,2,\cdots,n is a scalar function defined on an unorganized point set in a high-dimensional space, the detection of the maximum, minimum, and saddle points on the defined function is not straightforward. Suppose ω(x)\omega(x) is a scalar function defined on each point of an unorganized data point set. For each point xx in the data set, we construct its neighbor NxN_{x} using the KNN algorithm, discussed previously in Section3.2. The methods for detecting the maximum, minimum, and saddle points of the function ω(x)\omega(x) are elucidated in the following.

Maximum and minimum point detection: For a data point xx in the data set and any data point yNxy\in N_{x}, if ω(y)<ω(x)\omega(y)<\omega(x), yNx\forall y\in N_{x}, the data point xx is a local maximum point of ω(x)\omega(x). Conversely, if ω(y)>ω(x)\omega(y)>\omega(x), yNx\forall y\in N_{x}, the data point xx is a local minimum point of ω(x)\omega(x).

Saddle point detection: To detect the saddle points of ω(x)\omega(x) defined on an unorganized high-dimensional data point set, the points of the set are projected into a 2D plane, using the DR methods [14]. The data point projected into a 2D plane continues to be denoted as xx. For a point xx, the points in its KNN neighbor NxN_{x} comprise its one-ring neighbor RxR_{x} (see Fig. 3). Each point xrx_{r} in RxR_{x} has a sole preceding point xrpx_{r}^{p} and a sole succeeding point xrnx_{r}^{n} in a counter-clockwise direction. As illustrated in Fig. 3, if ω(x)ω(xr)\omega(x)\geq\omega(x_{r}), the point xrx_{r} is labeled as \ominus; otherwise, it is labeled as \oplus. Now, we initialize a counter sum=0sum=0 and traverse the one-ring neighbor RxR_{x} of point xx from an arbitrary point. If the signs of two neighbor points in RxR_{x} are different (marked with dotted lines), sum=sum+1sum=sum+1. Finally, if the counter sum4sum\geq 4, point xx is a saddle point; otherwise, it is not a saddle point (Fig. 3). Examples of saddle points and non-saddle points are demonstrated in Fig. 3.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Saddle point detection. Dotted line indicates the change of signs between neighbor points. While the points xx in (a) and (b) are not saddle point, the points xx in (c) and (d) are saddle point.

3.4 Measurement of Fidelity of the Simplified Data Set

As stated in Section 3.3, the data simplification method detects the feature points of the eigenvectors φi,i=1,2,,n\varphi_{i},i=1,2,\cdots,n, and adds them to the simplified data set iteratively, until the simplified data set satisfies a preset threshold of fidelity to the original data set. In this section, three metrics are devised to measure the fidelity of the simplified data set 𝒮\mathcal{S} to the original high-dimensional data set \mathcal{H}; these are the Kullback-Leibler divergence (KL divergence) [41]-based metric, Hausdorff distance [42], and determinant of covariance matrix [43]. They measure the fidelity of the simplified data set 𝒮\mathcal{S} to the original data set \mathcal{H} from three aspects:

  • entropy of information (KL-divergence-based metric),

  • distance between two sets (Hausdorff distance),and

  • ’volume’ of a data set (determinant of covariance matrix).

KL-divergence-based metric. KL-divergence [41], also called relative entropy, is typically employed to measure the difference between two probability distributions P={pi}P=\{p_{i}\} and Q={qi}Q=\{q_{i}\}, i.e.,

DKL(PQ)=ipilog(piqi).D_{KL}(P\|Q)=\sum_{i}p_{i}log\left(\frac{p_{i}}{q_{i}}\right). (11)

To define the KL-divergence-based metric, the original data set \mathcal{H} and simplified data set 𝒮\mathcal{S} are represented as an n×dn\times d matrix MM_{\mathcal{H}}, and an m×dm\times d matrix M𝒮M_{\mathcal{S}}, where n,m,(n>m)n,m,(n>m) are the number of data points in the data sets \mathcal{H} and 𝒮\mathcal{S}, respectively. Each row of the matrices \mathcal{H} and 𝒮\mathcal{S} stores a data point in the dd-dimensional Euclidean space. First, we calculate the probability distribution P(k)={pi(k)},k=1,2,,dP^{(k)}=\{p^{(k)}_{i}\},k=1,2,\cdots,d for each dimension of the data points in the original data set \mathcal{H}, corresponding to the kthk^{th} column of matrix MM_{\mathcal{H}}. To do this, the kthk^{th} column elements of matrix MM_{\mathcal{H}} are normalized into the interval [0,1][0,1], which is divided into ll subintervals, (in our implementation, we use l=100l=100). By counting the number of elements of the kthk^{th} column of matrix MM_{\mathcal{H}} whose values lie in the ll subintervals, the value distribution histogram is generated, which is then used as the probability distribution P(k)={pi(k),i=1,2,,l}P^{(k)}=\{p^{(k)}_{i},i=1,2,\cdots,l\} of the kth,k=1,2,,dk^{th},k=1,2,\cdots,d column of matrix MM_{\mathcal{H}}. Similarly, the probability distribution Q(k)={qi(k),i=1,2,,l}Q^{(k)}=\{q^{(k)}_{i},i=1,2,\cdots,l\} of the kth,k=1,2,,dk^{th},k=1,2,\cdots,d column of matrix M𝒮M_{\mathcal{S}} can be calculated. Then, the KL-divergence for the kth,k=1,2,,dk^{th},k=1,2,\cdots,d column elements of matrix MM_{\mathcal{H}} and M𝒮M_{\mathcal{S}} is

DKL(P(k)Q(k))=ipi(k)log(pi(k)qi(k)).D_{KL}(P^{(k)}\|Q^{(k)})=\sum_{i}p^{(k)}_{i}log\left(\frac{p^{(k)}_{i}}{q^{(k)}_{i}}\right).

And the KL-divergence-based metric is defined as

dKL(,𝒮)=k=1d(DKL(P(k)Q(k)))2.d_{KL}(\mathcal{H},\mathcal{S})=\sqrt{\sum_{k=1}^{d}(D_{KL}(P^{(k)}\|Q^{(k)}))^{2}}. (12)

The KL-divergence-based metric (12) transforms the high-dimensional data set into its probability distribution, and measures the difference between the probability distributions of the original and simplified data sets. As the simplified data set 𝒮\mathcal{S} becomes increasingly closer to the original data set \mathcal{H}, the KL-divergence-based metric dKLd_{KL} becomes increasingly smaller. When 𝒮=\mathcal{S}=\mathcal{H}, dKL=0d_{KL}=0.

Hausdorff distance. The Hausdorff distance [42] measures how far two sets are away from each other. Suppose 𝒳\mathcal{X} and 𝒴\mathcal{Y} are two data set in dd-dimensional space, and denote

D(𝒳,𝒴)=supx𝒳infy𝒴d(x,y),D(\mathcal{X},\mathcal{Y})=\sup_{x\in\mathcal{X}}\inf_{y\in\mathcal{Y}}d(x,y),

the maximum distance from an arbitrary point in set 𝒳\mathcal{X} to set 𝒴\mathcal{Y}, where d(x,y)d(x,y) is the Euclidean distance between two points xx and yy. The Hausdorff distance between the original data set \mathcal{H} and simplified data set 𝒮\mathcal{S} is defined as

dH(,𝒮)\displaystyle d_{H}(\mathcal{H},\mathcal{S}) =max{D(,𝒮),D(𝒮,)}\displaystyle=max\{D(\mathcal{H},\mathcal{S}),D(\mathcal{S},\mathcal{H})\}
=max{supxinfy𝒮d(x,y),supy𝒮infxd(x,y)}.\displaystyle=max\{\sup_{x\in\mathcal{H}}\inf_{y\in\mathcal{S}}d(x,y),\sup_{y\in\mathcal{S}}\inf_{x\in\mathcal{H}}d(x,y)\}.

Note that the simplified data set 𝒮\mathcal{S} is a subset of the original data set \mathcal{H}, and then, D(𝒮,)=0D(\mathcal{S},\mathcal{H})=0. Therefore, the Hausdorff distance between \mathcal{H} and 𝒮\mathcal{S} can be calculated by

dH(,𝒮)=D(,𝒮)=supxinfy𝒮d(x,y).d_{H}(\mathcal{H},\mathcal{S})=D(\mathcal{H},\mathcal{S})\\ =\sup_{x\in\mathcal{H}}\inf_{y\in\mathcal{S}}d(x,y). (13)

That is, the Hausdorff distance is the maximum distance from an arbitrary point in the original data set \mathcal{H} to the simplified data set 𝒮\mathcal{S}.

Determinant of covariance matrix. The covariance matrix is frequently employed in statistics and probability theory to measure the joint variability of several variables [43]. Suppose

Xi=(xi1,xi2,,xid),i=1,2,,m,X_{i}=(x_{i1},x_{i2},\cdots,x_{id}),i=1,2,\cdots,m,

are mm samples in a dd-dimensional sample space XX, and denote

X¯j=i=1mxijm,j=1,2,,d,\bar{X}_{j}=\frac{\sum_{i=1}^{m}x_{ij}}{m},\ j=1,2,\cdots,d,

where X¯j\bar{X}_{j} is the mean value of the jj-th random variable. The covariance cijc_{ij} between the ii-th and jj-th random variables is

cij=1mk=1m(XkiX¯i)(XkjX¯j),c_{ij}=\frac{1}{m}\sum_{k=1}^{m}(X_{ki}-\bar{X}_{i})(X_{kj}-\bar{X}_{j}),

which is the (i,j)th(i,j)^{th}-element of the covariance matrix Cov(X)Cov(X) of the data set XX. According to  [43], the determinant of the covariance matrix of XX, i.e., det(Cov(X))det(Cov(X)), can express the ’volume’ of the data set XX. When the simplified data set 𝒮\mathcal{S} approaches the original data set \mathcal{H}, they are expected to have increasingly ’volumes’, namely, the determinants of covariance matrices. Therefore, the difference of the determinants of the covariance matrices of \mathcal{H} and 𝒮\mathcal{S}:

dCOV(,𝒮)=|det(Cov())det(Cov(𝒮))|,d_{COV}(\mathcal{H},\mathcal{S})=\left|det(Cov(\mathcal{H}))-det(Cov(\mathcal{S}))\right|, (14)

can measure the fidelity of the simplified data set 𝒮\mathcal{S} to the original data set \mathcal{H}.

4 Results, Discussion, and Applications

In this section, the proposed method is employed to simplify four well-known high-dimensional data sets (refer to Fig. 4), including the swiss roll data set [44] (Fig. 4) with 20002000 points in 3D space, MNIST handwritten digital data set [45] (Fig. 4) which contains 60000 images (28×2828\times 28) of handwritten digits from ’0’ to ’9’, human face data set [44] (Fig. 4) which includes 698 images(64×6464\times 64) of a plaster head sculpture, and CIFAR-10 classification data set [46] (Fig. 4) with 5000050000 RGB-images(32×32×332\times 32\times 3). The fidelity of the simplified data sets to the original data sets are measured with the three metrics described in Section 3.4. Moreover, two applications of a simplified data set are also demonstrated including the speedup of DR and a training data simplification in supervised learning.

The proposed method is implemented using C++ with OpenCV and ARPACK [47], and executes on a PC with a 3.63.6 GHz Intel Core i74790i7-4790 CPU, GTX 10601060 GPU, and 16G16G memory. We employed the OpenCV function ’cv::flann’ to perform the KNN operation [40] and ARPACK to solve the sparse symmetric matrix eigen-equation (10).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Snapshots of the data sets employed in this paper. (a) Swiss roll. (b) MNIST. (c) Human face. (d) CIFAR-10.

4.1 Simplification Results

In this section, we will illustrate several data simplification results.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: The intuitive data simplification procedure of the swiss roll data set. In (a)-(e), the simplified data sets consist of 24,62,397,612,128624,62,397,612,1286 data points, generated by detecting the feature points of the first 5,10,30,50,1005,10,30,50,100 LBO eigenvectors, respectively. (f) is the original data with 2000 points.

Swiss roll: In Fig. 5, we use the swiss roll data set in 3D space to illustrate the intuitive simplification procedure by the proposed data simplification algorithm. In Fig 5, the simplified data set is generated by detecting the feature points of the first five eigenvectors, which captures the critical points at the head and tail positions of the roll (marked in red and blue colors). In Figs. 55, by detecting higher frequency eigenvectors, increasingly more feature points are added into the simplified data set and it approaches increasingly closer to the original data set (Fig. 5). Note that the simplified data set in Fig. 5, which contains 397397 data points (approximately 20%20\% points of the entire data set), includes the majority of the information of the original data set. Please refer to Section 4.2 for the quantitative fidelity analysis of the simplified data set to the original data set measured by the metrics dKLd_{KL} (12), dHd_{H} (13), and dCOVd_{COV} (14).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Simplification of the data set MNIST. (a-d): The simplified data sets by detecting the first 5,10,20,305,10,20,30 LBO eigenvectors, which consist of 82,120,196,28082,120,196,280 data points. (e,f): The DR result (using T-SNE [48]) of the original data set, illustrated by colorful digits (e) and original images (f).

MNIST: For clarity of the visual analysis, we selected 900900 units of the gray images for the digits ’0’ to ’4’, from the MNIST data set, and resized each image as 8×88\times 8. The selected handwritten digit images were projected into the 2D-plane using the DR method T-SNE [48]. Whereas in Fig. 6, colorful digits are placed at the projected 2D points, in Fig. 6, the images themselves are located at the projected points. The simplified data sets with 82,120,196,28082,120,196,280 data points, which were generated by detecting the feature points of the first 5,10,20,305,10,20,30 LBO eigenvectors are displayed in Fig. 66. For illustration, their positions in the 2D-plane are extracted from the dimensionality reduced data set (Fig. 6). In the simplified data set in Fig. 6, which has 8282 data points (approximately 9%9\% of the points of the original data set), all five categories of the digits are represented. In the simplified data set in Fig. 6, which has 196196 data points (approximately 22%22\% of the points of the original data set), virtually all of the key features of the data set are represented, including the subset of digit ’2’ in the class of digit ’1’. In Fig. 6, the simplified data set contains 280280 data points (approximately 31%31\% of the points of the original data set). We can observe that the boundary of each class of data points has formed (Fig. 6), and the geometric shape of each class is similar to that of the corresponding class in the original data set (Fig. 6).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: The simplified data sets of the human face data set, after DR by PCA [49]. Each simplified data set is comprised of the images marked with red frames. (a-e) The simplified data sets by detecting the first 5,10,20,30,505,10,20,30,50 LBO eigenvectors, which consist of 16,40,95,150,27016,40,95,150,270 data points, respectively. (f) The feature points of the first 55 eigenvectors.

Human face: The human face data set (Fig. 4) contains 698698 gray-images(64×6464\times 64) of a plaster head sculpture. The images of the data set were captured by adjusting three parameters: up-down pose, left-right pose, and lighting direction. Therefore, the images in the data set can be regarded as distributing on a 3D-manifold. Thus, as illustrated in Fig. 7, after the human face data set is dimensionally reduced into a 2D-plane using PCA [49], it can observed that the longitudinal axis represents the sculpture’s rotation in the right-left direction, and the horizontal axis indicates the change of lighting.

In Fig. 7, the images marked with red frames constitute the simplified data set. The simplified data sets in Figs. 77 which consist of 16,40,95,150,27016,40,95,150,270 data points, are generated by detecting the feature points of the first 5,10,20,30,505,10,20,30,50 LBO eigenvectors, respectively. In the simplified data set in Fig. 7, the critical points at the boundary are captured. In Figs. 77, increasingly more critical points at the boundary and interior are added into the simplified data sets. In Fig. 7, virtually all of the images at the boundary are added into the simplified set. For clarity, we display the feature points of the first five LBO eigenvectors in Fig. 7, which indeed capture several ’extreme’ images in the human face data set. Specifically, the two feature points of the first eigenvector, ϕ1\phi_{1}, are two images with ’leftmost pose + most lighting’ and ’rightmost pose + most lighting’. In the feature point set of the second eigenvector, ϕ2\phi_{2}, an image with ’leftmost pose + fewest lighting’ is added. Similarly, in the feature point sets of the eigenvectors ϕ3,ϕ4\phi_{3},\phi_{4}, and ϕ5\phi_{5}, there are ’extreme’ images of the pose and lighting (Fig. 7).

CIFAR-10: The data set CIFAR-10 (Fig. 4) was also simplified by the proposed method. Similarly, increasingly more critical points of the data set were added into the simplified data set as a consequence of the simplification procedure.

The fidelity of the simplified data sets was quantitatively measured as per Section 4.2. In Table I, The running time in seconds of the proposed method is listed. It can be seen that the operations of KNN and eigen-solving consume the majority of the time, and the data simplification time ranged from 0.0440.044 to 4.7594.759 seconds.

TABLE I: Running time (seconds) of the developed method
Data set Data size KNN Eigen Simplification
Swiss roll 2000×32000\times 3 5.535 6.986 0.044
MNIST 60000×78460000\times 784 57.579 178.933 6.544
Human face 698×4096698\times 4096 3.604 1.079 0.035
CIFAR-10 50000×307250000\times 3072 115.919 146.113 4.759

4.2 Quantitative Measurement of the Fidelity

In this section, the fidelity of the simplified data set 𝒮\mathcal{S} to the original data set \mathcal{H} is quantitatively measured by the three metrics developed in Section 3.4, i.e., KL-divergence-based metric dKL(,𝒮)d_{KL}(\mathcal{H},\mathcal{S}) (12), Hausdorff distance dH(,𝒮)d_{H}(\mathcal{H},\mathcal{S}) (13), and determinant of covariance matrix dCOV(,𝒮)d_{COV}(\mathcal{H},\mathcal{S}) (14). Specifically, for the data simplification procedure of each of the four data sets mentioned above, a series of simplified data sets 𝒮i,i=1,2,,n\mathcal{S}_{i},i=1,2,\cdots,n was first generated. Then, we calculated the simplification rate rir_{i}, i.e., the ratio between the cardinality of each data set 𝒮i\mathcal{S}_{i} and that of the original data set \mathcal{H}, i.e.,

ri=Card(𝒮i)Card(),r_{i}=\frac{Card(\mathcal{S}_{i})}{Card(\mathcal{H})}, (15)

and the three metrics

dKL(,𝒮i),dH(,𝒮i),anddCOV(,𝒮i),i=1,2,,n,d_{KL}(\mathcal{H},\mathcal{S}_{i}),d_{H}(\mathcal{H},\mathcal{S}_{i}),\ \text{and}\ d_{COV}(\mathcal{H},\mathcal{S}_{i}),i=1,2,\cdots,n,

which were normalized into [0,1][0,1]. Finally, the diagrams of rir_{i} v.s. dKL(,𝒮i)d_{KL}(\mathcal{H},\mathcal{S}_{i}), rir_{i} v.s. dH(,𝒮i)d_{H}(\mathcal{H},\mathcal{S}_{i}), and rir_{i} v.s. dCOV(,𝒮i),i=1,2,,nd_{COV}(\mathcal{H},\mathcal{S}_{i}),i=1,2,\cdots,n were plotted as displayed in Figs. 8, 9, and 10, respectively.

Refer to caption
Figure 8: Diagram of ri=Card(𝒮i)Card()r_{i}=\frac{Card(\mathcal{S}_{i})}{Card(\mathcal{H})} v.s. the normalized dKL(,𝒮i),i=1,2,,nd_{KL}(\mathcal{H},\mathcal{S}_{i}),i=1,2,\cdots,n.

KL-divergence-based metric: dKL(,𝒮)d_{KL}(\mathcal{H},\mathcal{S}) (12) measures the similarity of the distributions of the simplified data set 𝒮\mathcal{S} and the original data set \mathcal{H}. If 𝒮\mathcal{S} is exactly the same as \mathcal{H}, then dKL(,𝒮)=0d_{KL}(\mathcal{H},\mathcal{S})=0.

Fig. 8 displays the diagrams of ri=Card(𝒮i)Card()r_{i}=\frac{Card(\mathcal{S}_{i})}{Card(\mathcal{H})} v.s. the normalized dKL(,𝒮i),i=1,2,,nd_{KL}(\mathcal{H},\mathcal{S}_{i}),i=1,2,\cdots,n of the four data sets. It can be observed that in the simplification procedures of all of four data sets, the metric dKL(,𝒮)d_{KL}(\mathcal{H},\mathcal{S}) is rapidly reduced to a small value (less than 0.1), when ri=0.1r_{i}=0.1 , approximately, i.e., the simplified data set contains approximately 10%10\% of the data points of the original data set. Using the data set swiss roll as an example (blue line) when the simplified data set 𝒮\mathcal{S} is composed of 5%5\% of the data points of the original data set \mathcal{H} (i.e., ri=0.05r_{i}=0.05), the normalized dKL(,𝒮)d_{KL}(\mathcal{H},\mathcal{S}) is less than 0.10.1; when ri=0.2r_{i}=0.2 (𝒮\mathcal{S} contains 20%20\% of the data points of \mathcal{H}), the normalized dKL(,𝒮)d_{KL}(\mathcal{H},\mathcal{S}) is close to 0, meaning that the simplified data set 𝒮\mathcal{S} contains virtually all of the information embraced by the original data set \mathcal{H}.

Refer to caption
Figure 9: Diagram of ri=Card(𝒮i)Card()r_{i}=\frac{Card(\mathcal{S}_{i})}{Card(\mathcal{H})} v.s. the normalized dH(,𝒮i),i=1,2,,nd_{H}(\mathcal{H},\mathcal{S}_{i}),i=1,2,\cdots,n.

The Hausdorff distance dH(,𝒮)d_{H}(\mathcal{H},\mathcal{S}) (13) measures the geometric distance between two sets. In Fig. 9, the diagram between

ri=Card(𝒮i)Card()v.s. the normalizeddH(,𝒮i),i=1,2,,nr_{i}=\frac{Card(\mathcal{S}_{i})}{Card(\mathcal{H})}\ \text{v.s. the normalized}\ d_{H}(\mathcal{H},\mathcal{S}_{i}),i=1,2,\cdots,n

is illustrated. Because the Hausdorff distances between the original data set \mathcal{H} and two adjacent simplified data sets, e.g., dH(,𝒮i)d_{H}(\mathcal{H},\mathcal{S}_{i}) and dH(,𝒮i+1),i=1,2,,n1d_{H}(\mathcal{H},\mathcal{S}_{i+1}),i=1,2,\cdots,n-1, are possibly the same, the diagrams demonstrated in Fig. 9 are step-shaped. Based on Fig. 9, in the data simplification procedures of the four data sets, the Hausdorff distances dH(,𝒮)d_{H}(\mathcal{H},\mathcal{S}) decrease when increasingly more data points are contained in the simplified data sets. Moreover, the convergence speed of dH(,𝒮i)d_{H}(\mathcal{H},\mathcal{S}_{i}) with respect to rir_{i} (15) is related to the intrinsic dimension of the data set: The lower the intrinsic dimension of a data set, the faster the convergence speed. For example (refer to Fig. 9), the intrinsic dimension of the data set swiss roll is 22 (the smallest of the four data sets), and the convergence speed of the diagram for the simplification procedure of swiss roll is the fastest. The intrinsic dimension of the data set human face is 33, greater than that of swiss roll, and the convergence speed of its diagram is slower than that of swiss roll. The intrinsic dimensions of the data sets MNIST and CIFAR-10 are greater than those of swiss roll and human face, and the convergence speeds of their diagrams are the slowest.

Refer to caption
Figure 10: Diagram of ri=Card(𝒮i)Card()r_{i}=\frac{Card(\mathcal{S}_{i})}{Card(\mathcal{H})} v.s. the normalized dCOV(,𝒮i),i=1,2,,nd_{COV}(\mathcal{H},\mathcal{S}_{i}),i=1,2,\cdots,n.

The determinant of covariance matrix of a data set reflects its ’volume’. The metric dCOV(,𝒮)d_{COV}(\mathcal{H},\mathcal{S}) (14) measures the difference between the ’volume’ of the original data set \mathcal{H} and that of the simplified data set 𝒮\mathcal{S}. In Fig. 10, the diagram of rir_{i} (15) v.s. the normalized dCOV(,𝒮i),i=1,2,,nd_{COV}(\mathcal{H},\mathcal{S}_{i}),i=1,2,\cdots,n is illustrated. In the region with small rir_{i} (15), the simplified data set contains a small number of data points and the diagrams oscillate. This is because the ’volume’ of a simplified data set with a small number of data points is heavily influenced by each of its points, and adding merely one point into the set can change its ’volume’ significantly. However, with increasingly more data points inserted into the simplified data set (i.e., with increasingly greater rir_{i}), the diagrams of all four data sets converge to zero. Considering the data set CIFAR-10 as an example, when the simplified data set contains 20%20\% of the data points of the original data set (ri=0.2r_{i}=0.2), the metric dCOV(,𝒮)d_{COV}(\mathcal{H},\mathcal{S}) (14) is nearly zero, meaning that the ’volume’ of the simplified data set has approached that of the original data very closely.

4.3 Applications

In this section, two applications of the proposed data set simplification method are demonstrated; there are the speedup of DR and training data simplification in supervised learning.

Speedup of DR: The computation of the DR algorithms for a high-dimensional data set is typically complicated. Given an original data set \mathcal{H} (Card()=nCard(\mathcal{H})=n), if we first perform the DR on the simplified data set 𝒮\mathcal{S} (Card(𝒮)=mCard(\mathcal{S})=m), and then employ the ’out-of-sample’ algorithm [50] to calculate the DR coordinates of the remaining data points, the computation requirement can be reduced significantly. Because the simplified data set 𝒮\mathcal{S} captures the feature points of the original data set, the result generated by DR on simplified set + ’out-of-sample’ is similar as that by DR on original set.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11: Comparison between the DR result on the simplified data set and that on the original data set human face. The DR results on the simplified data sets 𝒮10\mathcal{S}_{10}, 𝒮20\mathcal{S}_{20}, 𝒮50\mathcal{S}_{50}, are illustrated in (a, c, e), respectively. For comparison, the DR is performed on the original data set human face, and the subsets corresponding to the points in 𝒮10\mathcal{S}_{10}, 𝒮20\mathcal{S}_{20}, 𝒮50\mathcal{S}_{50} are extracted and demonstrated in (b, d, f).

First, we compare the results of DR on the simplified data sets and that of DR on the original data set. Here, the data set human face is used as the original set and its simplified data sets are 𝒮10,𝒮20,𝒮50\mathcal{S}_{10},\mathcal{S}_{20},\mathcal{S}_{50}, consisting of the feature points of the first 10,20,5010,20,50 LBO eigenvectors, respectively. In Figs 11, 11, and 11, the DR results (by PCA) on the simplified data set 𝒮10,𝒮20,𝒮50\mathcal{S}_{10},\mathcal{S}_{20},\mathcal{S}_{50} are presented, respectively. For comparison, the DR (by PCA) of the original data set human face was performed and the subsets of the DR results corresponding to the points in 𝒮10,𝒮20,𝒮50\mathcal{S}_{10},\mathcal{S}_{20},\mathcal{S}_{50} are extracted and demonstrated in Figs. 11, 11, and 11. It can be observed that the shapes and key geometric structures in the DR results of the original data set hold in the DR results of the simplified data sets.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 12: Speedup of DR using the method of ’DR on simplified set + out of sample’. (a) Result generated by ’DR on simplified set + out of sample’, where the images with red frame are the DR result on the simplified set S10S_{10}. (b) Result by ’DR on original set’. (c) The above two results are displayed in the overlapping manner. (d) The distribution of the correspondence error.

Beginning with the DR result on the simplified data set S10S_{10} (Fig. 11), the DR coordinates of the remaining points in the original data set human face were calculated by the ’out-of-sample’ algorithm [50]. The DR result thus generated is illustrated in Fig. 12, where the images marked with red frames are the DR results of S10S_{10}, and the other image positions are calculated by the ’out-of-sample’ algorithm [50]. The DR result on the original set human face is demonstrated in Fig. 12. For clarity of visual comparison, the two results in Figs. 12 and 12 are displayed in an overlapping manner in Fig. 12. The correspondence error between the two DR results, i.e., the distance between the two DR coordinates of the same data point, were counted and displayed in Fig. 12, where the xx-axis is the interval of the correspondence error and the yy-axis is the number of points with correspondence errors in the intervals. Because the simplified data set S10S_{10} captures the feature points of the original data set, the DR result by ’DR on simplified set + out of sample’ is similar to that by ’DR on original set’. They have a highly repeated distribution and small correspondence error.

TABLE II: Speedup of DR.
Data set DR (PCA) out-of-sample Total
Human face 7.975 / 7.975
Simplified data 0.079 0.001 0.080

In Table II, the running time for the methods ’DR on simplified set + out of sample’ and ’DR on original set’ are listed. Whereas the method ’DR on original set’ required 7.9757.975 seconds, the method ’DR on simplified set + out of sample’ required only 0.0800.080 seconds, a run-time savings of two orders of magnitude.

Training data simplification: With the increase of the use of training data in supervised learning [51], the cost of manual annotation has become increasingly more expensive. Moreover, repeated samples or ’bad’ samples can be added into the training data set, adding confusion and complication to the training data, and influencing the convergence of the training. Therefore, in supervised and active learning [52], simplifying the training data set by retaining the feature points and deleting the repeated or ’bad’ samples can

  1. 1.

    lighten the burden of the manual annotation,

  2. 2.

    reduce the number of epochs, and,

  3. 3.

    save storage space.

Refer to caption
Figure 13: Diagram of simplification rate v.s. test accuracy.

To validate the superiority of the simplified data set in supervised learning, we constructed a CNN to perform an image classification task. The CNN consisted of two convolution layers, followed by two fully connected layers. Cross-entropy was used as the loss function, and Adam [53] was employed to minimize the loss function.

First, we studied the relationship between the simplification rate (15) and test accuracy with two experiments. In the first experiment, the data set MNIST was employed, which contained 6000060000 training images and 1000010000 test images. In the second experiment, we used the data set CIFAR-10 with 5000050000 training images and 1000010000 test images. In the two experiments, the training image set was first simplified using the proposed method; then, the CNN was trained for 1000010000 epochs using the simplified training sets. Finally, the test accuracy was calculated by the test images. In Fig. 13, the diagram of the simplification rate rir_{i} v.s. test accuracy is illustrated. For the data set MNIST, when rir_{i} = 0.177 (the simplified training set contained nearly 1000010000 images), the test accuracy achieved 97.4%97.4\%, which was near to the test accuracy of 98.98%98.98\% using the original training data (6000060000 images). For the data set CIFAR-10, when rir_{i} = 0.584 and the simplified training data set contained nearly 2900029000 images, the test accuracy was 78.2%78.2\%, whereas using the original training images (5000050000 images), the test accuracy was 80.2%80.2\%. Therefore, with the proposed simplification method, the simplified data was sufficient for training in the two image classification tasks.

Refer to caption
Figure 14: Diagrams of number of epochs v.s. test accuracy. The arrows indicate the places where the training processes stop.

Furthermore, with the simplified training set, the training process can terminate earlier than that with the original training set, thus reducing the number of epochs. In Fig. 14, the diagrams of the number of epochs v.s. test accuracy are illustrated. The ’early stopping’ criterion proposed in  [54] was used to determine the stopping time (indicated by arrows). For the data set MNIST, the training process with the original training set (including 6000060000 images) stopped at 1200012000 epochs, with a test accuracy 99.07%99.07\%; with the simplified training set (including 29912991 images), the training process stopped as early as at 60006000 epochs, with a test accuracy 95.25%95.25\%. For the data set CIFAR-10, the training process stopped at 80008000 epochs with the simplified training set (including 1063010630 images), whereas the training process with the original training set (including 5000050000 images) stopped at 1150011500 epochs. Therefore, with the simplified training data set, the training epochs can be reduced significantly.

5 Conclusion

In this paper, we proposed a big data simplification method by detecting the feature points of the eigenvectors of an LBO defined on the data set. Specifically, we first clarified the close relationship between the feature points of the eigenvectors of an LBO and the scalar curvature of the manifold the LBO defined. Then, we proposed methods for detecting the feature points of an LBO defined on an unorganized high-dimensional data set. The simplified data set consisted of the feature points of the LBO. Moreover, three metrics were developed to measure the fidelity of the simplified data set to the original data set from three perspectives, i.e., information theory, geometry, and statistics. Finally, the proposed method was validated by simplifying some high-dimensional data sets, Applications of the simplified data set were demonstrated, including the speedup of DR and training data simplification. These demonstrated that the simplified data set can capture the main features of the original data set and can replace the original data set in data processing tasks, thus improving the data processing capability significantly.

Acknowledgement

The authors would like thank to Zihao Wang for the helpful advices and JiaMing Ai, JinCheng Lu for the help in dimensionality reduction. This work is supported by the National Natural Science Foundation of China under Grant no. 61872316, and the National Key R&D Plan of China under Grant no. 2016YFB1001501.

References

  • [1] X. Wu, X. Zhu, G. Wu, and W. Ding, “Data Mining with Big Data,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 1, pp. 97–107, 2013.
  • [2] U. Pinkall and K. Polthier, “Computing Discrete Minimal Surfaces and their Conjugates,” Experimental Mathematics, vol. 2, no. 1, pp. 15–36, 1993.
  • [3] J. Sun, M. Ovsjanikov, and L. Guibas, “A Concise and Provably Informative Multi-scale Signature Based on Heat Diffusion,” Computer Graphics Forum, vol. 28, no. 5, pp. 1383–1392, 2009.
  • [4] M. Belkin and Y. Wang, “Discrete Laplace Operator on Meshed Surfaces,” in Twenty-Fourth Symposium on Computational Geometry, pp. 278–287, 2008.
  • [5] Y. Liu, B. Prabhakaran, and X. Guo, “Point-Based Manifold Harmonics,” IEEE Transactions on Visualization & Computer Graphics, vol. 18, no. 10, pp. 1693–1703, 2012.
  • [6] G. Taubin, “A Signal Processing Approach to Fair Surface Design,” in Conference on Computer Graphics and Interactive Techniques, pp. 351–358, 1995.
  • [7] B. Levy, “Laplace-Beltrami Eigenfunctions Towards an Algorithm that ”Understands” Geometry,” in IEEE International Conference on Shape Modeling and Applications, pp. 13–13, 2006.
  • [8] B. Vallet and B. Levy, “Spectral Geometry Processing with Manifold Harmonics,” Computer Graphics Forum, vol. 27, no. 2, pp. 251–260, 2008.
  • [9] J. Hu and J. Hua, “Salient Spectral Geometric Features for Shape Matching and Retrieval,” Visual Computer, vol. 25, no. 5-7, pp. 667–675, 2009.
  • [10] R. Song, Y. Liu, R. R. Martin, and P. L. Rosin, “Mesh Saliency via Spectral Processing,” Acm Transactions on Graphics, vol. 33, no. 1, pp. 1–17, 2014.
  • [11] M. Reuter, F. E. Wolter, and N. Peinecke, “Laplace-Beltrami Spectra as Shape-DNA of Surfaces and Solids,” Computer-Aided Design, vol. 38, no. 4, pp. 342–366, 2006.
  • [12] M. Ovsjanikov, Q. M rigot, F. M moli, and L. Guibas, “One Point Isometric Matching with the Heat Kernel,” Computer Graphics Forum, vol. 29, no. 5, pp. 1555–1564, 2010.
  • [13] M. M. Bronstein and I. Kokkinos, “Scale-Invariant Heat Kernel Signatures for Non-rigid Shape Recognition,” in Computer Vision and Pattern Recognition, pp. 1704–1711, 2010.
  • [14] M. Belkin and P. Niyogi, “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation,” Neural computation, vol. 15, no. 6, pp. 1373–1396, 2003.
  • [15] D. G. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
  • [16] N. Dalal and B. Triggs, “Histograms of Oriented Gradients for Human Detection,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 886–893, 2005.
  • [17] U. Castellani, M. Cristani, S. Fantoni, and V. Murino, “Sparse Points Matching by Combining 3D Mesh Saliency with Statistical Descriptors,” Computer Graphics Forum, vol. 27, no. 2, pp. 643–652, 2010.
  • [18] G. Zou, J. Hua, M. Dong, and H. Qin, “Surface Matching with Salient Keypoints in Geodesic Scale Space,” Computer Animation & Virtual Worlds, vol. 19, no. 3-4, pp. 399–410, 2008.
  • [19] A. E. Johnson and M. Hebert, “Using Spin Images for Efficient Object Recognition in Cluttered 3D Scenes,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 21, no. 5, pp. 433–449, 2002.
  • [20] M. Kortgen, “3D Shape Matching with 3D Shape Contexts,” in Central European Seminar on Computer Graphics, 2003.
  • [21] M. Schlattmann, P. Degener, and R. Klein, “Scale Space Based Feature Point Detection on Surfaces,” in ASME 2012 Internal Combustion Engine Division Fall Technical Conference, pp. 893–905, 2008.
  • [22] R. M. Rustamov, “Laplace-Beltrami Eigenfunctions for Deformation Invariant Shape Representation,” in Eurographics Symposium on Geometry Processing, pp. 225–233, 2007.
  • [23] D. Shen, P. T. Bremer, M. Garland, V. Pascucci, and J. C. Hart, “Spectral Surface Quadrangulation,” Acm Transactions on Graphics, vol. 25, no. 3, pp. 1057–1066, 2006.
  • [24] Y. Lecun and Y. Bengio, “Convolutional Networks for Images, Speech, and Time-Series,” in The Handbook of Brain Theory and Neural Networks, 1995.
  • [25] N. Wang, X. Gao, D. Tao, H. Yang, and X. Li, “Facial Feature Point Detection: A Comprehensive Survey,” Neurocomputing, 2017.
  • [26] P. Luo, “Hierarchical Face Parsing via Deep Learning,” in IEEE Conference on Computer Vision and Pattern Recognition, pp. 2480–2487, 2012.
  • [27] Y. Sun, X. Wang, and X. Tang, “Deep Convolutional Network Cascade for Facial Point Detection,” in Computer Vision and Pattern Recognition, pp. 3476–3483, 2013.
  • [28] X. Peng, R. S. F., W. X., and D. N. Metaxas, “RED-Net: A Recurrent Encoder-Decoder Network for Video-Based Face Alignment,” International Journal of Computer Vision, no. 2, pp. 1–17, 2018.
  • [29] S. Xiao, J. Feng, J. Xing, H. Lai, S. Yan, and A. Kassim, Robust Facial Landmark Detection via Recurrent Attentive-Refinement Networks. Springer International Publishing, 2016.
  • [30] I. Tomek, “Two Modifications of CNN,” IEEE Trans. Systems, Man and Cybernetics, vol. 6, pp. 769–772, 1976.
  • [31] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-Sampling Technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.
  • [32] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data,” ACM SIGKDD explorations newsletter, vol. 6, no. 1, pp. 20–29, 2004.
  • [33] C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, “Safe-Level-Smote: Safe-Level-Synthetic Minority Over-sampling Technique for Handling the Class Imbalanced Problem,” in Pacific-Asia conference on knowledge discovery and data mining, pp. 475–482, Springer, 2009.
  • [34] H. Han, W. Wang, and B. Mao, “Borderline-SMOTE: A New Over-sampling Method in Imbalanced Data Sets Learning,” in International conference on intelligent computing, pp. 878–887, Springer, 2005.
  • [35] Z. Zhou, J. Y. Shin, S. R. Gurudu, M. B. Gotway, and J. Liang, “AFT*: Integrating Active Learning and Transfer Learning to Reduce Annotation Efforts,” arXiv preprint arXiv:1802.00912, 2018.
  • [36] K. Vodrahalli, K. Li, and J. Malik, “Are All Training Examples Created Equal? An Empirical Study,” arXiv preprint arXiv:1811.12569, 2018.
  • [37] K. Benko, M. Kothe, K. D. Semmler, and U. Simon, “Eigenvalues of the Laplacian and Curvature,” Colloquium Mathematicum, vol. 42, no. 1, pp. 19–31, 1979.
  • [38] A. Pleijel, “Some Properties of Eigenfunctions of the Laplace Operator on Riemannnian Manifolds,” vol. 1, 1949.
  • [39] M. Berger, “Le Spectre d’une Variete Riemannienne,” Lecture Notes in Math, vol. 194, pp. 141–241, 1971.
  • [40] “FLANN: Fast library for approximate nearest neighbors.”
  • [41] F. Perez-Cruz, “Kullback-Leibler Divergence Estimation of Continuous Distributions,” in IEEE International Symposium on Information Theory, 2008.
  • [42] M. P. Dubuisson and A. K. Jain, “A Modified Hausdorff Distance for Object Matching,” in Proceedings of 12th international conference on pattern recognition, vol. 1, pp. 566–568, IEEE, 1994.
  • [43] A. Sharma and R. Horaud, “Shape Matching Based on Diffusion Embedding and on Mutual Isometric Consistency,” in IEEE Computer Society Conference on Computer Vision & Pattern Recognition-workshops, 2010.
  • [44] J. B. Tenenbaum, V. D. Silva, and J. C. Langford, “A Global Geometric Framework for Nonlinear Dimensionality Reduction,” Science, vol. 290, no. 5500, pp. 2319–2323, 2000.
  • [45] Y. LeCun, C. Cortes, and C. J. Burges, “The MNIST Database of Handwritten Digits, 1998,” URL http://yann. lecun. com/exdb/mnist, vol. 10, p. 34, 1998.
  • [46] A. Krizhevsky, V. Nair, and G. Hinton, “The CIFAR-10 Dataset,” online: http://www. cs. toronto. edu/kriz/cifar. html, vol. 55, 2014.
  • [47] R. Lehoucq, K. Maschhoff, D. Sorensen, and C. Yang, “ARPACK Software,” Website: http://www. caam. rice. edu/software/ARPACK, 2007.
  • [48] L. Maaten and G. Hinton, “Visualizing Data Using t-SNE,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
  • [49] I. Jolliffe, Principal Component Analysis. Springer, 2011.
  • [50] Y. Bengio, J. Paiement, P. Vincent, O. Delalleau, N. L. Roux, and M. Ouimet, “Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering,” in Advances in neural information processing systems, pp. 177–184, 2004.
  • [51] T. G. Dietterich, “Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms,” Neural computation, vol. 10, no. 7, pp. 1895–1923, 1998.
  • [52] R. T. Johnson and D. W. Johnson, “Active Learning: Cooperation in the Classroom,” The annual report of educational psychology in Japan, vol. 47, pp. 29–30, 2008.
  • [53] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [54] L. Prechelt, “Automatic Early Stopping Using Cross Validation: Quantifying the Criteria,” Neural Networks, vol. 11, no. 4, pp. 761–767, 1998.