This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Effective data reduction algorithm for topological data analysis

Seonmi Choi Jinseok Oh Jeong Rye Park Seung Yeop Yang  and  Hongdae Yun [email protected] [email protected] [email protected] [email protected] [email protected] Nonlinear Dynamics and Mathematical Application Center, Kyungpook National University, Daegu 41566, Republic of Korea Department of Mathematics, Kyungpook National University, Daegu, 41566, Republic of Korea Department of Mathematics, Kyungpook National University, Daegu, 41566, Republic of Korea Department of Mathematics, Kyungpook National University, Daegu, 41566, Republic of Korea Department of Mathematics, Kyungpook National University, Daegu, 41566, Republic of Korea
Abstract.

One of the most interesting tools that have recently entered the data science toolbox is topological data analysis (TDA). With the explosion of available data sizes and dimensions, identifying and extracting the underlying structure of a given dataset is a fundamental challenge in data science, and TDA provides a methodology for analyzing the shape of a dataset using tools and prospects from algebraic topology. However, the computational complexity makes it quickly infeasible to process large datasets, especially those with high dimensions. Here, we introduce a preprocessing strategy called the Characteristic Lattice Algorithm (CLA), which allows users to reduce the size of a given dataset as desired while maintaining geometric and topological features in order to make the computation of TDA feasible or to shorten its computation time. In addition, we derive a stability theorem and an upper bound of the barcode errors for CLA based on the bottleneck distance.

Key words and phrases:
topological data analysis, persistent homology, Vietoris-Rips filtration, topology-preserving data reduction
2020 Mathematics Subject Classification:
Primary: 55N31. Secondary: 62R40, 68T09.

1. Introduction

Recent advances in computational science have led to a data explosion in the scientific community exploring complex natural phenomena. In particular, high-dimensional and complex simulations are being implemented in various fields using high-performance computing technology. However, these datasets, especially biological data and astronomical data, are often so high-dimensional that they severely limit our visualization capabilities, as well as have much more noise and missing information.

Topological data analysis (TDA in abbreviation) combines theoretical tools and algorithms in algebraic topology and geometry to enable a mathematically rigorous study of the “shape” of data [2, 3, 9, 11, 22]. TDA provides dimensionality reduction and robustness for high-dimensional, incomplete, and noisy data. One of the main tools of TDA is persistent homology, which is a method for retrieving the essential topological features of a dataset. In order to calculate the persistent homology of a dataset, one first needs to represent the dataset as a filtered simplicial complex, which is a sequence of simplicial complexes built on the dataset in order to add topological structure to it. One widely used method for doing this is the Vietoris-Rips filtration [10, 21]. This filtration, unfortunately, is often too computationally expensive to construct in full (The run-time of computation grows exponentially with the number of simplices. [15]). As a result, new approaches to reducing computation costs have recently emerged that employ data reprocessing and sub-sampling [5, 12, 13, 14, 17, 18, 19, 20]. Sheehy [18] introduced an alternative filtration on an nn-point metric space, called the sparse Vietoris–Rips filtration that has a size O(n)O(n) and can be computed in O(nlogn)O(n\log n) time. Chazal et al. [5] proposed a method for reducing the computation independent of the filtration by sub-sampling the data randomly repeatedly and creating an average landscape for the point cloud. de Silva and Carlsson [20] employ a random or MaxMin approach to select “landmark” points that are near other members in the point cloud. Moitra et al. [12] use the centroids of kk-means++ clusters in order to reduce the number of data points.

In this paper, we introduce a method for reducing data size while preserving its topological features. The computation time for the persistent homology of a given dataset is generally exponential with respect to the number of points in the data, which requires significant computational costs for large datasets. The proposed method allows for a significant reduction in computation time while having a minimal impact on the output results. Furthermore, the reduction ratio of the input data can be adjusted as desired, making it possible to perform the calculation for the persistent homology on big data. This method can be utilized as a preprocessing step to reduce the size of data before inputting it into persistent homology algorithms such as Ripser [1]. Therefore, it can be combined with other methods that reduce the computational cost of persistent homology. The cost-saving effect and the accuracy of this method will be analyzed using data generated through the NumPy library.

This paper is organized as follows. Section 2 describes the basic notions and ideas of TDA, particularly on persistent homology. Section 3.1 introduces a new data reduction algorithm suitable for TDA. Sections 3.2 to 3.4 analyze the effects of this algorithm in terms of computational cost reduction and accuracy of data analysis. Finally, Section 4 presents our conclusions.

2. Background: Topological Data Analysis

We briefly review the basic definitions and notions of persistent homology of simplicial complexes obtained from point clouds. See [8, 16] for more details.

Filtered Vietoris-Rips Complex  Let (X,dX)(X,d_{X}) be a finite metric space and let ε>0\varepsilon>0 be fixed. The Vietoris-Rips complex VRε(X,dX)VR_{\varepsilon}(X,d_{X}) is the abstract simplicial complex such that

  • (1)

    vertices are the elements of XX,

  • (2)

    [v0,v1,,vk][v_{0},v_{1},\dots,v_{k}] is a kk-simplex if dX(vi,vj)2εd_{X}(v_{i},v_{j})\leq 2\varepsilon for all 0i,jk0\leq i,j\leq k, where {v0\{v_{0}, v1v_{1}, \dots, vk}v_{k}\} is a subset of XX.

For ε<ε\varepsilon<\varepsilon^{\prime}, there exists the induced simplicial map VRε(X,dX)VRε(X,dX)VR_{\varepsilon}(X,d_{X})\rightarrow VR_{\varepsilon^{\prime}}(X,d_{X}) as every simplex of VRε(X,dX)VR_{\varepsilon}(X,d_{X}) is also a simplex of VRε(X,dX)VR_{\varepsilon^{\prime}}(X,d_{X}). Since XX is finite, one can choose finite values ε1<ε2<<εm\varepsilon_{1}<\varepsilon_{2}<\cdots<\varepsilon_{m} so that VRεi(X,dX)VRεi+1(X,dX)VR_{\varepsilon_{i}}(X,d_{X})\subsetneq VR_{\varepsilon_{i+1}}(X,d_{X}) for each i.i. Then the natural inclusions

VRε1(X,dX)VRε2(X,dX)VRεm(X,dX)VR_{\varepsilon_{1}}(X,d_{X})\hookrightarrow VR_{\varepsilon_{2}}(X,d_{X})\hookrightarrow\cdots\hookrightarrow VR_{\varepsilon_{m}}(X,d_{X})

imply the induced homomorphisms

PHk(VRε1(X,dX);𝔽)PHk(VRε2(X,dX);𝔽)PHk(VRεm(X,dX);𝔽)PH_{k}(VR_{\varepsilon_{1}}(X,d_{X});\mathbb{F})\rightarrow PH_{k}(VR_{\varepsilon_{2}}(X,d_{X});\mathbb{F})\rightarrow\cdots\rightarrow PH_{k}(VR_{\varepsilon_{m}}(X,d_{X});\mathbb{F})

by the kkth homology group functor PHkPH_{k}, where 𝔽\mathbb{F} is a field. The sequence of simplicial complexes VR(X,dX)VR_{\bullet}(X,d_{X}) is called a filtered Vietoris-Rips complex on X.X.

Barcode and Persistent Diagram  For a filtered Vietoris-Rips complex VR(X,dX)VR_{\bullet}(X,d_{X}), an element γPHk(VRεi(X,dX);𝔽)\gamma\in PH_{k}(VR_{\varepsilon_{i}}(X,d_{X});\mathbb{F}) is

  • (1)

    born at εi\varepsilon_{i} if γIm(θj,i)\gamma\notin\textrm{Im}(\theta_{j,i}) for any εj<εi\varepsilon_{j}<\varepsilon_{i},

  • (2)

    dies at εl>εi\varepsilon_{l}>\varepsilon_{i} if θi,l(γ)=0\theta_{i,l}(\gamma)=0 in PHk(VRεl(X,dX);𝔽)PH_{k}(VR_{\varepsilon_{l}}(X,d_{X});\mathbb{F}) or for some εj<εl\varepsilon_{j}<\varepsilon_{l}, there exists an element αPHk(VRεj(X,dX);𝔽)\alpha\in PH_{k}(VR_{\varepsilon_{j}}(X,d_{X});\mathbb{F}) such that θi,l(γ)=θj,l(α)\theta_{i,l}(\gamma)=\theta_{j,l}(\alpha),

where θa,b\theta_{a,b} denotes the induced homomorphism PHk(VRεa(X,dX);𝔽)PHk(VRεb(X,dX);𝔽)PH_{k}(VR_{\varepsilon_{a}}(X,d_{X});\mathbb{F})\rightarrow PH_{k}(VR_{\varepsilon_{b}}(X,d_{X});\mathbb{F}) for any εa<εb\varepsilon_{a}<\varepsilon_{b}. The interval [εi,εl)[\varepsilon_{i},\varepsilon_{l}) represents the lifespan of γ\gamma. The kkth barcode of XX, denoted by BkXB_{k}X, can be defined as the multiset of non-empty intervals, called bars, of the form either [x,y)[x,y) or [x,)[x,\infty) and the bars representing the lifespans of particular elements in PHk(VR(X,dX);𝔽).PH_{k}(VR_{\bullet}(X,d_{X});\mathbb{F}). It is often efficient to consider a barcode as a set of points in 2,\mathbb{R}^{2}, called a persistent diagram (PD in abbreviation), by replacing each bar [x,y)[x,y) of the barcode with the point (x,y)2.(x,y)\in\mathbb{R}^{2}. Figure 1, for example, illustrates the process of obtaining a Vietoris-Rips complex, a barcode, and a persistent diagram of the given point cloud XX consisting of nine points in 2.\mathbb{R}^{2}.

Refer to caption
Figure 1. A barcode and a persistent diagram

For the stability of persistent homology under perturbation, we use the distance between barcodes to be precise about measuring changes in the output of TDA.

The Gromov-Hausdorff Distance  Let (X,dX)(X,d_{X}) and (Y,dY)(Y,d_{Y}) be two compact metric spaces. The Gromov-Hausdorff distance between XX and YY is defined by

dGH((X,dX),(Y,dY))=infθ1,θ2dH(X,Y),d_{GH}((X,d_{X}),(Y,d_{Y}))=\underset{\theta_{1},\theta_{2}}{\textrm{inf}}d_{H}(X,Y),

where θ1:XZ\theta_{1}:X\rightarrow Z and θ2:YZ\theta_{2}:Y\rightarrow Z are isometric embeddings of (X,dX)(X,d_{X}) and (Y,dY)(Y,d_{Y}) in any metric space (Z,dZ)(Z,d_{Z}) respectively and dHd_{H} is the Hausdorff distance.

The Bottleneck Distance  For two intervals [x1,y1)[x_{1},y_{1}) and [x2,y2),[x_{2},y_{2}), we define

d([x1,y1),[x2,y2))=max{|x1x2|,|y1y2|}.d_{\infty}([x_{1},y_{1}),[x_{2},y_{2}))=\textrm{max}\{|x_{1}-x_{2}|,|y_{1}-y_{2}|\}.

For the empty set \emptyset, we extend dd_{\infty} by defining

d([x,y),)=|yx|2.d_{\infty}([x,y),\emptyset)=\frac{|y-x|}{2}.

Let B1B_{1} and B2B_{2} be barcodes. We add \emptyset to B1B_{1} and B2,B_{2}, respectively and assume that |B1|<|B2||B_{1}|<|B_{2}| without loss of generality. A matching is a bijection ϕ:A1A2,\phi:A_{1}\rightarrow A_{2}, where for each i=1,2,i=1,2, AiA_{i} is a multi-subset of BiB_{i} and the elements of BiAiB_{i}\setminus A_{i} are considered to be matched with .\emptyset. The bottleneck distance between B1B_{1} and B2B_{2} is defined to be

dB(B1,B2)=infϕsupIB1d(I,ϕ(I)),d_{B}(B_{1},B_{2})=\underset{\phi}{\textrm{inf}}\underset{I\in B_{1}}{\textrm{sup}}d_{\infty}(I,\phi(I)),

where ϕ\phi varies over all matchings between B1B_{1} and B2B_{2} and the supremum is taken over all bars II in B1.B_{1}.

The bottleneck distance quantifies the maximum difference in the best matching between the two barcodes. In other words, two barcodes are considered similar in the bottleneck distance when after ignoring “short” bars, the endpoints of matching “long” bars are close. The following is the stability theorem for barcodes with respect to the bottleneck distance.

Proposition 2.1.

[4, 6] Let (X,dX)(X,d_{X}) and (Y,dY)(Y,d_{Y}) be finite metric spaces. Then for all k0,k\geq 0,

dB(BkX,BkY)dGH((X,dX),(Y,dY)).d_{B}(B_{k}X,B_{k}Y)\leq d_{GH}((X,d_{X}),(Y,d_{Y})).

3. Results

In this section, we introduce a novel method for reducing the size of a dataset as desired while maintaining geometric and topological features, in order to enable feasible computation of TDA or to reduce its computation time. Moreover, we investigate the extent to which the calculation speed of TDA improves as well as how well the accuracy of data analysis utilizing TDA is maintained when applying this method, by using random data.

3.1. Topology-Preserving Data Reduction: Characteristic Lattice Algorithm

Let XX be a point cloud in the mm-dimensional Euclidean space m\mathbb{R}^{m} and let δ>0\delta>0 be fixed. The procedure of our method is outlined as follows:

  1. (1)

    Divide m\mathbb{R}^{m} into mm-dimensional hypercubes (also simply called mm-cubes) with side length δ\delta. To be more precise, we consider the partition m/\mathbb{R}^{m}/{\sim} of m\mathbb{R}^{m} induced by the equivalence relation \sim on m\mathbb{R}^{m} defined by

    (x1,x2,,xm)(y1,y2,,ym) for each i,xi,yi[kiδ,(ki+1)δ) for some ki,(x_{1},x_{2},\ldots,x_{m})\sim(y_{1},y_{2},\ldots,y_{m})\iff\textrm{ for each }i,\ x_{i},y_{i}\in[k_{i}\delta,(k_{i}+1)\delta)\textrm{ for some }k_{i}\in\mathbb{Z},

    and then the corresponding equivalence classes are mm-cubes.

  2. (2)

    For each mm-cube CC, if CX,C\cap X\neq\emptyset, then we select one sample point 𝐱C\mathbf{x}_{C}^{*} from CC (the sample point does not need to be an element of CXC\cap X). Otherwise, we do not select any sample points. See Figure 2 for example.

  3. (3)

    The set of all the sample points obtained in (2) forms a new point cloud, which is denoted by Xδ.X_{\delta}^{*}. See Figure 3 for example.

Refer to caption
Figure 2. Sample points from a 22-cube and a 33-cube
Refer to caption
Figure 3. XδX_{\delta}^{*}(the set of orange points) derived from XX using CLA

It is obvious that as the value of δ\delta increases, the size of the new point cloud XδX_{\delta}^{*} decreases. We call this data reduction technique the Characteristic Lattice Algorithm (CLA in abbreviation) with δ\delta, which allows us to shorten the run-time for calculating a persistent diagram while preserving the topological features of the original point cloud X.X.

Remark 3.1.

When implementing CLA, one can choose all sample points as the centers of the mm-cubes, and the outcome derived from this selection is denoted by Xδc.X_{\delta}^{c}. That is,

Xδc={(x¯1,x¯2,,x¯m)m|(x1,x2,,xm)X},X_{\delta}^{c}=\{(\overline{x}_{1},\overline{x}_{2},\ldots,\overline{x}_{m})\in\mathbb{R}^{m}~{}|~{}(x_{1},x_{2},\ldots,x_{m})\in X\},

where x¯i=δ[xiδ]+δ2\overline{x}_{i}=\delta\left[\frac{x_{i}}{\delta}\right]+\frac{\delta}{2} for each i=1,2,,m.i=1,2,\ldots,m.

The following theorem supports that CLA preserves topological features of a given data well and that the choice of sample points does not significantly affect the results of CLA.

Theorem 3.1.

Let XX be a finite point cloud in m\mathbb{R}^{m} and let δ>0\delta>0 be fixed. Then for all k0,k\geq 0,

  • (1)

    dB(BkX,BkXδ)mδ;d_{B}(B_{k}X,B_{k}X_{\delta}^{*})\leq\sqrt{m}\delta;

  • (2)

    dB(BkXδ,BkXδc)mδ2.d_{B}(B_{k}X_{\delta}^{*},B_{k}X_{\delta}^{c})\leq\frac{\sqrt{m}\delta}{2}.

Proof.

Note that the Hausdorff distance between AA and BB is defined as

dH(A,B)=inf{ε>0|BAε,ABε},d_{H}(A,B)={\textrm{inf}}\{\varepsilon>0~{}|~{}B\subseteq A_{\varepsilon},A\subseteq B_{\varepsilon}\},

where Aε=aA{xX|dX(x,a)ε}A_{\varepsilon}=\underset{a\in A}{\bigcup}\{x\in X~{}|~{}d_{X}(x,a)\leq\varepsilon\} and Bε=bB{xX|dX(x,b)ε}B_{\varepsilon}=\underset{b\in B}{\bigcup}\{x\in X~{}|~{}d_{X}(x,b)\leq\varepsilon\}. When ε=mδ\varepsilon=\sqrt{m}\delta, it is obvious that XδXεX_{\delta}^{*}\subseteq X_{\varepsilon} and X(Xδ)εX\subseteq(X_{\delta}^{*})_{\varepsilon}. Similarly, if ε=mδ2\varepsilon=\frac{\sqrt{m}\delta}{2}, then Xδc(Xδ)εX_{\delta}^{c}\subseteq(X_{\delta}^{*})_{\varepsilon} and Xδ(Xδc)εX_{\delta}^{*}\subseteq(X_{\delta}^{c})_{\varepsilon}. Therefore,

dH(X,Xδ)mδ and dH(Xδ,Xδc)mδ2.d_{H}(X,X_{\delta}^{*})\leq\sqrt{m}\delta~{}\textrm{ and }~{}d_{H}(X_{\delta}^{*},X_{\delta}^{c})\leq\frac{\sqrt{m}\delta}{2}.

By the definition of Gromov-Hausdorff distance, we have

dGH(X,Xδ)mδ and dGH(Xδ,Xδc)mδ2.d_{GH}(X,X_{\delta}^{*})\leq\sqrt{m}\delta~{}\textrm{ and }~{}d_{GH}(X_{\delta}^{*},X_{\delta}^{c})\leq\frac{\sqrt{m}\delta}{2}.

Then the inequalities are obtained immediately by Proposition 2.1, as desired. ∎

Let pk:mp_{k}:\mathbb{R}^{m}\rightarrow\mathbb{R} denote the kkth projection defined by pk(x1,,xk,,xm)=xkp_{k}(x_{1},\ldots,x_{k},\ldots,x_{m})=x_{k}, for each k=1,,mk=1,\cdots,m. Let MkM_{k} and mkm_{k} denote the maximum and the minimum of pk(X)p_{k}(X), respectively. Then the following is obtained by using the Pigeonhole principle.

Lemma 3.2.

If δ\delta satisfies k=1m([Mkδ]+[mkδ]+2)<|X|,\prod\limits_{k=1}^{m}\left(\left[\frac{M_{k}}{\delta}\right]+\left[\frac{-m_{k}}{\delta}\right]+2\right)<|X|, then |Xδ|<|X|.|X_{\delta}^{*}|<|X|.

Note that CLA is meaningful to apply only when the number of points in XX is reduced. Therefore, it is important to consider the extent to which the data size is reduced. From now on, we will consider the reduction rate of the number of XX, denoted by rr, and use the corresponding δ\delta for each r.r. By changing the reduction rate rr, we can observe the extent to which the computation time is reduced when CLA is applied.

3.2. Data

We construct the dataset using NumPy 1.24 in Python 3.9, which consists of nn pairs of two point clouds in m\mathbb{R}^{m}. Both point clouds have the same number of points, let us say pp. For a fixed integer ss as the random seed, two types of point clouds, namely sphere type and random type, are formed as follows:

  1. (1)

    Sphere type: the data having pp points in an (m1)(m-1)-dimensional sphere shape with a noise 5×0.1m15\times 0.1^{m-1} constructed by using the (m1)(m-1)-dimensional spherical coordinate system;

  2. (2)

    Random type: the data consisting of uniformly random pp points.

One can normalize them in the mm-dimensional box [0,100]m[0,100]^{m} in order to have the same scale. Thus, for each seed ss, there is a pair of two point clouds. As ss ranges from 0 to n1n-1, we can get nn pairs of two point clouds. That is, the size of the dataset is 2n2n. For example, Figure 4 shows a 2-dimensional (2D) dataset and a 3-dimensional (3D) dataset when n=1n=1.

Refer to caption
Refer to caption
Figure 4. A 2D dataset (p=300p=300) and a 3D dataset (p=1,000p=1,000)

In Figure 5, the datasets whose points are colored by orange can be constructed from the datasets in Figure 4 by applying CLA with δ=10\delta=10 to the 2D dataset and δ=20\delta=20 to the 3D dataset, respectively, using the centers of the mm-cubes as the sample points.

Refer to caption
Refer to caption
Figure 5. Data reduction of the datasets in Figure 4 by applying CLA The point clouds consisting of orange points can be obtained from the point clouds consisting of blue points by applying CLA with δ=10\delta=10 to the 2D dataset and δ=20\delta=20 to the 3D dataset.

To evaluate the efficiency and effectiveness of CLA, it is necessary to analyze the computation time and classification accuracy. The workflow on the left in Figure 6 represents a pipeline that measures the time required to find a persistent diagram when applying and not applying CLA. One can evaluate the effectiveness of CLA using a support vector machine (SVM) [7] on the dataset. The workflow on the right in Figure 6 shows a pipeline for comparing the accuracy of data analysis with and without CLA. The hardware environment was 12th Gen Intel(R) Core(TM) i7-12700K 3.61 GHz, 128 GB RAM.

Refer to caption
Figure 6. Workflows for measuring computation time and analysis accuracy The left workflow compares the computation time required to obtain the PD when applying CLA with the reduction rates of r=10%r=10\%, 30%30\%, or 50%50\%, as well as without applying CLA. The right workflow provides accuracy for classifying two types of point clouds in the dataset when applying CLA with a reduction rate of r=50%r=50\% compared to not applying CLA.

3.3. Efficiency in Computation Time

To assess the efficiency of calculating a PD with CLA, we will use a random type point cloud as described in Section 3.2. Figure 7 illustrates the run-time for calculating the PD of the 2D and the 3D point clouds. Note that pp increases by 1,0001,000 from 1,0001,000 to 30,00030,000 in the 2D case and pp increases by 100100 from 100100 to 3,0003,000 in the 3D case. We compare the computation time it takes to calculate the PD of the original point cloud and the time it takes to calculate the PD of that point cloud after applying CLA with the reduction rates of r=10%r=10\%, 30%30\%, or 50%50\%.

Refer to caption
Refer to caption
Figure 7. PD Calculation Time Graphs The run-time of calculating the PD from the 2D point cloud and the 3D point cloud is presented in the above graphs, where pp increases by 1,0001,000 from 1,0001,000 to 30,00030,000 in the 2D point cloud and pp increases by 100100 from 100100 to 3,0003,000 in the 3D point cloud, respectively. (Note that when pp reaches 3,0003,000, due to excessive computation time in our hardware environment, we were unable to calculate the PD without applying CLA.) The blue dashed line represents the calculation time of the PD, while the other lines indicate the calculation time of the PD after applying CLA with the reduction rates 10%,30%10\%,30\%, and 50%50\%.

As shown in the graph (A) of Figure 7, there is a noticeable difference in computation time between applying and not applying CLA as the size of the 2D data increases. In particular, the information when p=30,000p=30,000 are shown in Table 1. The reduction rates in computation time when applying CLA with the reduction rates 10%,30%10\%,30\%, and 50%50\% are greater than 93%93\% in all cases.

p=30,000p=30,000 Without CLA CLA(r=10%r=10\%) CLA(r=30%r=30\%) CLA (r=50%r=50\%)
Computation Time(sec) 1046.5445 65.9021 64.6348 64.9718
Rate of Decrease(%\%) - 93.7 93.8 93.8
Table 1. Reduction rates of calculation time when applying CLA to a 2D point cloud

The graph (B) of Figure 7 shows that the utilization of CLA in the 3D point cloud effectively reduces the computation time compared to not using CLA. Specifically, the information of the calculation time when p=2,900p=2,900 are shown in Table 2.

p=2,900p=2,900 Without CLA CLA(r=10%r=10\%) CLA(r=30%r=30\%) CLA (r=50%r=50\%)
Computation Time(sec) 3474.1446 1302.1153 571.3973 146.7123
Rate of decrease(%\%) - 62.5 83.5 95.7
Table 2. Reduction rates of calculation time when applying CLA to a 3D point cloud

The reduction rates in computation time when applying CLA with the reduction rate 50%50\% is 95.7%95.7\%. In particular, when the size of the 3D data is p=3,000p=3,000, it was not possible to obtain the PD without applying CLA due to our memory constraints and computational limitations. However, when CLA is applied, the PD can be successfully calculated even when p=3,000p=3,000.

3.4. Preservation of Classification Accuracy

The goal of this section is to observe the changes in the original dataset and its PD when applying CLA, and to compare the data classification accuracy. To compare the data classification accuracy when CLA is applied and when it is not applied, we analyze the results of the right workflow in Figure 6 for the sphere type and random type point clouds presented in Section 3.2. The performance of SVM is intricately tied to the careful selection of hyperparameters. Our goal is to facilitate a straightforward and unbiased comparison of the SVM performance for two types of point clouds. Therefore, we deliberately refrained from adjusting any parameters in the SVM employed.

First, we analyze how the 2D and 3D datasets, as well as their corresponding PDs, change when applying CLA with the reduction rates 10%,30%10\%,30\%, and 50%50\%. The first rows in Figure 8 and Figure 9 are the original point clouds and their corresponding PDs without applying CLA. The second, third, and fourth rows in Figure 8 and Figure 9 represent the reconstructed datasets obtained by applying CLA with the reduction rates 10%,30%10\%,30\%, and 50%50\% respectively to the original datasets, along with their corresponding PDs.

Refer to caption
Figure 8. Evolution of the PDs for a 2D dataset The PDs of point clouds of a 2D dataset applying CLA with the reduction rates 10%,30%10\%,30\%, and 50%50\%.

When ignoring the points near the diagonal in the PDs, which can be considered as noise, it is evident that the features of the first persistent homology (orange points) of the 2D original datasets and the second persistent homology (green points) of the 3D original datasets are well-preserved in the PDs, despite the increase in the reduction rate of CLA.

Refer to caption
Figure 9. Evolution of the PDs for a 3D dataset The PDs of point clouds of a 3D dataset applying CLA with the reduction rates 10%,30%10\%,30\%, and 50%50\%.

Second, we investigate the preservation of classification accuracy when applying CLA. For this purpose, we compare the classification accuracy derived from the following three different data processing methods using the dataset with n=10,000n=10,000 presented in Section 3.2. One is to classify the sphere type and random type point clouds using SVM alone, another is to classify their PDs using SVM, and the other is to classify the PDs of the reconstructed point clouds obtained by applying CLA using SVM. The classification accuracy obtained from the 2D and 3D datasets is depicted in two graphs, denoted as (A) and (B), in Figure 10, respectively. In both datasets, the number of points pp increases by 100100 from 100100 to 1,0001,000.

Refer to caption
Refer to caption
Figure 10. Classification accuracy of the 2D and 3D datasets The above graphs show the accuracy of classification for sphere type point clouds and random type point clouds in the 2D and the 3D datasets, respectively, with n=10,000n=10,000, where pp increases from 100100 to 1,0001,000 by 100100. The blue dashed lines represent the classification accuracy of the datasets using SVM alone. The orange dotted lines show the classification accuracy obtained by classifying the PDs of the datasets using SVM. Finally, the green dotted lines represent the classification accuracy determined by classifying the PDs of the reconstructed datasets obtained by applying CLA with the reduction rate 50%50\% using SVM.

The blue dashed lines indicate the classification accuracy when only SVM is applied to the given datasets. The orange dotted lines present the classification accuracy of applying SVM to PDs obtained from the datasets. The green dotted lines show the classification accuracy of applying SVM to PDs of reconstructed datasets obtained by applying CLA with the reduction rate 50%50\% to the dataset. When using only SVM for classification, it can be observed from graph (A) and graph (B) in Figure 10 that the classification accuracy of the 2D dataset is approximately 0.50.5, and the classification accuracy of the 3D dataset is approximately 0.770.77 when the number of points in the dataset is p=1,000p=1,000. On the other hand, when applying SVM to the PDs obtained from the datasets (i.e., utilizing TDA), the classification accuracy consistently reaches a value of 11 in both the 2D and 3D datasets. Furthermore, it can be observed that even after reducing the number of points in both the 2D and 3D datasets by applying CLA, the classification accuracy still remains at 11. This observation highlights the fact that CLA is a data reduction technique that effectively preserves the topological features of the data.

4. Conclusion

The research results can be summarized as follows: It is natural that as the reduction rate increases, the corresponding time decreases accordingly. Table 1 indicates that the use of CLA reduces the PD calculation time by approximately 93%93\% compared to calculations without CLA. The disparity in computational cost becomes more pronounced when analyzing the 3D data. Especially at p=3,000p=3,000, the assessment of the computational cost without CLA was impeded by our computer specifications. When CLA is applied with r=50%r=50\%, the rates of decrease for the 3D data show that the computational cost is up to 95%95\% less than calculations without CLA, as presented in Table 2. Notably, the reduction in computational cost is higher for the 3D data compared to the 2D data.

The application of TDA in combination with SVM for the sphere type and random type point clouds resulted in significantly higher classification accuracy compared to using SVM alone, as observed in Figure 10. Although CLA may slightly deform the given data as a preprocessing technique, there was no significant difference in classification accuracy when comparing the combined use of TDA and CLA with using TDA alone. This suggests that CLA is an effective data reduction technique for TDA, and it is also theoretically supported by Theorem 3.1.

TDA proposes a novel approach to big data analysis. However, persistent homology, one of the representative methodologies of TDA, is currently not feasible due to its exponentially increasing computational complexity. Data reduction is the most fundamental technique used in data science to address computational complexity problems. Thus, various data reduction methods suitable for data analysis using persistent homology have been studied. The method proposed in this paper, CLA, improves upon previous related studies by allowing for data reduction as desired while preserving the topological features of high-dimensional and complex datasets. Therefore, CLA demonstrates that it can be used as an effective tool for the extension of topology in machine learning and the future of data analysis.

Acknowledgement

All authors equally contribute this paper. The work of Seung Yeop Yang and Seonmi Choi was supported by the National Research Foundation of Korea(NRF) grant funded by the Korean government(MSIT)(No. 2022R1A5A1033624). The work of Seonmi Choi was supported by the Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (No. 2021R1I1A1A01049100). The work of Jeong Rye Park was supported by the Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (No. 2021R1I1A1A01057767).

References

  • [1] U. Bauer, Ripser: efficient computation of Vietoris-Rips persistence barcodes, J. Appl. Comput. Topol. 5 (2021), no. 3, 391–423.
  • [2] G. Carlsson, Topology and data, Bull. Amer. Math. Soc. (N.S.), 46 (2009), no. 2, 255–308.
  • [3] G. Carlsson and V. De Silva, Zigzag persistence, Found. Comput. Math., 10 (2010), no. 4, 367–405.
  • [4] F. Chazal, D. Cohen-Steiner, L. J. Guibas, F. Mémoli, and S. Y. Oudot, Gromov-Hausdorff Stable Signatures for Shapes using Persistence, Computer Graphics Forum, 28 (2009), no. 5, 1393-1403.
  • [5] F. Chazal, B. T. Fasy, F. Lecci, B. Michel, A. Rinaldo, and L. Wasserman, Subsampling methods for persistent homology, In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei Eds. Lille, France, 37 (2015), JMLR: W&CP, 2143–2151.
  • [6] D. Cohen-Steiner, H. Edelsbrunner, and J. Harer, Stability of persistence diagrams, Discrete Comput. Geom., 37 (2007), no. 1, 103–120.
  • [7] C. Cortes and V. Vapnik, Support-vector networks, Mach. Learn., 20 (1995), no. 3, 273–297.
  • [8] H. Edelsbrunner and J. Harer, Computational topology: an introduction, American Mathematical Society, 2010.
  • [9] R. Ghrist, Barcodes: the persistent topology of data, Bull. Amer. Math. Soc. (N.S.), 45 (2008), no. 1, 61–75.
  • [10] M. Gromov, Hyperbolic groups, Essays in group theory, 75–263, Math. Sci. Res. Inst. Publ., 8, Springer, New York, 1987.
  • [11] P. Y. Lum, G. Singh, A. Lehman, T. Ishkanov, M. Vejdemo-Johansson, M. Alagappan, J. Carlsson, and G. Carlsson, Extracting insights from the shape of complex data using topology, Scientific Reports, 3, Feb. 2013.
  • [12] A. Moitra, N. Malott, and P. A. Wilsey, Cluster-based data reduction for persistent homology, In 2018 IEEE International Conference on Big Data, ser. Big Data 2018, Dec. 2018, 327–334.
  • [13] N. O. Malott, A. M. Sens, and P. A. Wilsey, Topology Preserving Data Reduction for Computing Persistent Homology, In 2020 IEEE International Conference on Big Data, ser. Big Data 2020, Dec. 2020, 2681–2690.
  • [14] N. O. Malott and P. A. Wilsey, Fast computation of persistent homology with data reduction and data partitioning, In 2019 IEEE International Conference on Big Data, ser. Big Data 2019, Dec. 2019, pp. 880–889.
  • [15] N. Otter, M. A. Porter, U. Tillmann, P. Grindrod, and H. A. Harrington, A roadmap for the computation of persistent homology, EPJ Data Science, 6 (2017), no. 1.
  • [16] R. Rabadán and A. J. Blumberg, Topological data analysis for genomics and evolution Topology in Biology, Cambridge University press, 2020.
  • [17] K. N. Ramamurthy, K. R. Varshney, and J. J. Thiagarajan, Computing persistent homology under random projection, In 2014 IEEE Workshop on Statistical Signal Processing (SSP), Jun. 2014, pp. 105–108.
  • [18] D. R. Sheehy, Linear-size approximations to the Vietoris-Rips filtration, Discrete Comput. Geom., 49 (2013), no. 4, 778–796.
  • [19] D. R. Sheehy, The persistent homology of distance functions under random projection, In Proceedings of the Thirtieth Annual Symposium on Computational Geometry, ser. SOCG’14. New York, NY, USA: ACM, 2014, pp. 328–334.
  • [20] V. de Silva and G. Carlsson, Topological estimation using witness complexes, In Eurographics Symposium on Point-Based Graphics, ser. SPBG ’04, M. Gross, H. Pfister, M. Alexa, and S. Rusinkiewicz, Eds. Goslar, DEU: The Eurographics Association, (2004), 157–166.
  • [21] L. Vietoris, Über den höheren Zusammenhang kompakter Räume und eine Klasse von zusammenhangstreuen Abbildungen (German), Math. Ann. 97 (1927), no. 1, 454–472.
  • [22] A. Zomorodian and G. Carlsson, Computing Persistent Homology, Discrete Comput. Geom., 33 (2005), no. 2, 249-274.