Effective data reduction algorithm for topological data analysis
Abstract.
One of the most interesting tools that have recently entered the data science toolbox is topological data analysis (TDA). With the explosion of available data sizes and dimensions, identifying and extracting the underlying structure of a given dataset is a fundamental challenge in data science, and TDA provides a methodology for analyzing the shape of a dataset using tools and prospects from algebraic topology. However, the computational complexity makes it quickly infeasible to process large datasets, especially those with high dimensions. Here, we introduce a preprocessing strategy called the Characteristic Lattice Algorithm (CLA), which allows users to reduce the size of a given dataset as desired while maintaining geometric and topological features in order to make the computation of TDA feasible or to shorten its computation time. In addition, we derive a stability theorem and an upper bound of the barcode errors for CLA based on the bottleneck distance.
Key words and phrases:
topological data analysis, persistent homology, Vietoris-Rips filtration, topology-preserving data reduction2020 Mathematics Subject Classification:
Primary: 55N31. Secondary: 62R40, 68T09.1. Introduction
Recent advances in computational science have led to a data explosion in the scientific community exploring complex natural phenomena. In particular, high-dimensional and complex simulations are being implemented in various fields using high-performance computing technology. However, these datasets, especially biological data and astronomical data, are often so high-dimensional that they severely limit our visualization capabilities, as well as have much more noise and missing information.
Topological data analysis (TDA in abbreviation) combines theoretical tools and algorithms in algebraic topology and geometry to enable a mathematically rigorous study of the “shape” of data [2, 3, 9, 11, 22]. TDA provides dimensionality reduction and robustness for high-dimensional, incomplete, and noisy data. One of the main tools of TDA is persistent homology, which is a method for retrieving the essential topological features of a dataset. In order to calculate the persistent homology of a dataset, one first needs to represent the dataset as a filtered simplicial complex, which is a sequence of simplicial complexes built on the dataset in order to add topological structure to it. One widely used method for doing this is the Vietoris-Rips filtration [10, 21]. This filtration, unfortunately, is often too computationally expensive to construct in full (The run-time of computation grows exponentially with the number of simplices. [15]). As a result, new approaches to reducing computation costs have recently emerged that employ data reprocessing and sub-sampling [5, 12, 13, 14, 17, 18, 19, 20]. Sheehy [18] introduced an alternative filtration on an -point metric space, called the sparse Vietoris–Rips filtration that has a size and can be computed in time. Chazal et al. [5] proposed a method for reducing the computation independent of the filtration by sub-sampling the data randomly repeatedly and creating an average landscape for the point cloud. de Silva and Carlsson [20] employ a random or MaxMin approach to select “landmark” points that are near other members in the point cloud. Moitra et al. [12] use the centroids of -means++ clusters in order to reduce the number of data points.
In this paper, we introduce a method for reducing data size while preserving its topological features. The computation time for the persistent homology of a given dataset is generally exponential with respect to the number of points in the data, which requires significant computational costs for large datasets. The proposed method allows for a significant reduction in computation time while having a minimal impact on the output results. Furthermore, the reduction ratio of the input data can be adjusted as desired, making it possible to perform the calculation for the persistent homology on big data. This method can be utilized as a preprocessing step to reduce the size of data before inputting it into persistent homology algorithms such as Ripser [1]. Therefore, it can be combined with other methods that reduce the computational cost of persistent homology. The cost-saving effect and the accuracy of this method will be analyzed using data generated through the NumPy library.
This paper is organized as follows. Section 2 describes the basic notions and ideas of TDA, particularly on persistent homology. Section 3.1 introduces a new data reduction algorithm suitable for TDA. Sections 3.2 to 3.4 analyze the effects of this algorithm in terms of computational cost reduction and accuracy of data analysis. Finally, Section 4 presents our conclusions.
2. Background: Topological Data Analysis
We briefly review the basic definitions and notions of persistent homology of simplicial complexes obtained from point clouds. See [8, 16] for more details.
Filtered Vietoris-Rips Complex Let be a finite metric space and let be fixed. The Vietoris-Rips complex is the abstract simplicial complex such that
-
(1)
vertices are the elements of ,
-
(2)
is a -simplex if for all , where , , , is a subset of .
For , there exists the induced simplicial map as every simplex of is also a simplex of . Since is finite, one can choose finite values so that for each Then the natural inclusions
imply the induced homomorphisms
by the th homology group functor , where is a field.
The sequence of simplicial complexes is called a filtered Vietoris-Rips complex on
Barcode and Persistent Diagram For a filtered Vietoris-Rips complex , an element is
-
(1)
born at if for any ,
-
(2)
dies at if in or for some , there exists an element such that ,
where denotes the induced homomorphism for any . The interval represents the lifespan of . The th barcode of , denoted by , can be defined as the multiset of non-empty intervals, called bars, of the form either or and the bars representing the lifespans of particular elements in It is often efficient to consider a barcode as a set of points in called a persistent diagram (PD in abbreviation), by replacing each bar of the barcode with the point Figure 1, for example, illustrates the process of obtaining a Vietoris-Rips complex, a barcode, and a persistent diagram of the given point cloud consisting of nine points in

For the stability of persistent homology under perturbation, we use the distance between barcodes to be precise about measuring changes in the output of TDA.
The Gromov-Hausdorff Distance Let and be two compact metric spaces. The Gromov-Hausdorff distance between and is defined by
where and are isometric embeddings of and in any metric space respectively and is the Hausdorff distance.
The Bottleneck Distance For two intervals and we define
For the empty set , we extend by defining
Let and be barcodes. We add to and respectively and assume that without loss of generality. A matching is a bijection where for each is a multi-subset of and the elements of are considered to be matched with The bottleneck distance between and is defined to be
where varies over all matchings between and and the supremum is taken over all bars in
The bottleneck distance quantifies the maximum difference in the best matching between the two barcodes. In other words, two barcodes are considered similar in the bottleneck distance when after ignoring “short” bars, the endpoints of matching “long” bars are close. The following is the stability theorem for barcodes with respect to the bottleneck distance.
3. Results
In this section, we introduce a novel method for reducing the size of a dataset as desired while maintaining geometric and topological features, in order to enable feasible computation of TDA or to reduce its computation time. Moreover, we investigate the extent to which the calculation speed of TDA improves as well as how well the accuracy of data analysis utilizing TDA is maintained when applying this method, by using random data.
3.1. Topology-Preserving Data Reduction: Characteristic Lattice Algorithm
Let be a point cloud in the -dimensional Euclidean space and let be fixed. The procedure of our method is outlined as follows:
-
(1)
Divide into -dimensional hypercubes (also simply called -cubes) with side length . To be more precise, we consider the partition of induced by the equivalence relation on defined by
and then the corresponding equivalence classes are -cubes.
-
(2)
For each -cube , if then we select one sample point from (the sample point does not need to be an element of ). Otherwise, we do not select any sample points. See Figure 2 for example.
-
(3)
The set of all the sample points obtained in (2) forms a new point cloud, which is denoted by See Figure 3 for example.


It is obvious that as the value of increases, the size of the new point cloud decreases. We call this data reduction technique the Characteristic Lattice Algorithm (CLA in abbreviation) with , which allows us to shorten the run-time for calculating a persistent diagram while preserving the topological features of the original point cloud
Remark 3.1.
When implementing CLA, one can choose all sample points as the centers of the -cubes, and the outcome derived from this selection is denoted by That is,
where for each
The following theorem supports that CLA preserves topological features of a given data well and that the choice of sample points does not significantly affect the results of CLA.
Theorem 3.1.
Let be a finite point cloud in and let be fixed. Then for all
-
(1)
-
(2)
Proof.
Note that the Hausdorff distance between and is defined as
where and . When , it is obvious that and . Similarly, if , then and . Therefore,
By the definition of Gromov-Hausdorff distance, we have
Then the inequalities are obtained immediately by Proposition 2.1, as desired. ∎
Let denote the th projection defined by , for each . Let and denote the maximum and the minimum of , respectively. Then the following is obtained by using the Pigeonhole principle.
Lemma 3.2.
If satisfies then
Note that CLA is meaningful to apply only when the number of points in is reduced. Therefore, it is important to consider the extent to which the data size is reduced. From now on, we will consider the reduction rate of the number of , denoted by , and use the corresponding for each By changing the reduction rate , we can observe the extent to which the computation time is reduced when CLA is applied.
3.2. Data
We construct the dataset using NumPy 1.24 in Python 3.9, which consists of pairs of two point clouds in . Both point clouds have the same number of points, let us say . For a fixed integer as the random seed, two types of point clouds, namely sphere type and random type, are formed as follows:
-
(1)
Sphere type: the data having points in an -dimensional sphere shape with a noise constructed by using the -dimensional spherical coordinate system;
-
(2)
Random type: the data consisting of uniformly random points.
One can normalize them in the -dimensional box in order to have the same scale. Thus, for each seed , there is a pair of two point clouds. As ranges from to , we can get pairs of two point clouds. That is, the size of the dataset is . For example, Figure 4 shows a 2-dimensional (2D) dataset and a 3-dimensional (3D) dataset when .


In Figure 5, the datasets whose points are colored by orange can be constructed from the datasets in Figure 4 by applying CLA with to the 2D dataset and to the 3D dataset, respectively, using the centers of the -cubes as the sample points.


To evaluate the efficiency and effectiveness of CLA, it is necessary to analyze the computation time and classification accuracy. The workflow on the left in Figure 6 represents a pipeline that measures the time required to find a persistent diagram when applying and not applying CLA. One can evaluate the effectiveness of CLA using a support vector machine (SVM) [7] on the dataset. The workflow on the right in Figure 6 shows a pipeline for comparing the accuracy of data analysis with and without CLA. The hardware environment was 12th Gen Intel(R) Core(TM) i7-12700K 3.61 GHz, 128 GB RAM.

3.3. Efficiency in Computation Time
To assess the efficiency of calculating a PD with CLA, we will use a random type point cloud as described in Section 3.2. Figure 7 illustrates the run-time for calculating the PD of the 2D and the 3D point clouds. Note that increases by from to in the 2D case and increases by from to in the 3D case. We compare the computation time it takes to calculate the PD of the original point cloud and the time it takes to calculate the PD of that point cloud after applying CLA with the reduction rates of , , or .


As shown in the graph (A) of Figure 7, there is a noticeable difference in computation time between applying and not applying CLA as the size of the 2D data increases. In particular, the information when are shown in Table 1. The reduction rates in computation time when applying CLA with the reduction rates , and are greater than in all cases.
Without CLA | CLA() | CLA() | CLA () | |
---|---|---|---|---|
Computation Time(sec) | 1046.5445 | 65.9021 | 64.6348 | 64.9718 |
Rate of Decrease() | 93.7 | 93.8 | 93.8 |
The graph (B) of Figure 7 shows that the utilization of CLA in the 3D point cloud effectively reduces the computation time compared to not using CLA. Specifically, the information of the calculation time when are shown in Table 2.
Without CLA | CLA() | CLA() | CLA () | |
---|---|---|---|---|
Computation Time(sec) | 3474.1446 | 1302.1153 | 571.3973 | 146.7123 |
Rate of decrease() | 62.5 | 83.5 | 95.7 |
The reduction rates in computation time when applying CLA with the reduction rate is . In particular, when the size of the 3D data is , it was not possible to obtain the PD without applying CLA due to our memory constraints and computational limitations. However, when CLA is applied, the PD can be successfully calculated even when .
3.4. Preservation of Classification Accuracy
The goal of this section is to observe the changes in the original dataset and its PD when applying CLA, and to compare the data classification accuracy.
To compare the data classification accuracy when CLA is applied and when it is not applied, we analyze the results of the right workflow in Figure 6 for the sphere type and random type point clouds presented in Section 3.2.
The performance of SVM is intricately tied to the careful selection of hyperparameters.
Our goal is to facilitate a straightforward and unbiased comparison of the SVM performance for two types of point clouds.
Therefore, we deliberately refrained from adjusting any parameters in the SVM employed.
First, we analyze how the 2D and 3D datasets, as well as their corresponding PDs, change when applying CLA with the reduction rates , and . The first rows in Figure 8 and Figure 9 are the original point clouds and their corresponding PDs without applying CLA. The second, third, and fourth rows in Figure 8 and Figure 9 represent the reconstructed datasets obtained by applying CLA with the reduction rates , and respectively to the original datasets, along with their corresponding PDs.

When ignoring the points near the diagonal in the PDs, which can be considered as noise, it is evident that the features of the first persistent homology (orange points) of the 2D original datasets and the second persistent homology (green points) of the 3D original datasets are well-preserved in the PDs, despite the increase in the reduction rate of CLA.

Second, we investigate the preservation of classification accuracy when applying CLA. For this purpose, we compare the classification accuracy derived from the following three different data processing methods using the dataset with presented in Section 3.2. One is to classify the sphere type and random type point clouds using SVM alone, another is to classify their PDs using SVM, and the other is to classify the PDs of the reconstructed point clouds obtained by applying CLA using SVM. The classification accuracy obtained from the 2D and 3D datasets is depicted in two graphs, denoted as (A) and (B), in Figure 10, respectively. In both datasets, the number of points increases by from to .


The blue dashed lines indicate the classification accuracy when only SVM is applied to the given datasets. The orange dotted lines present the classification accuracy of applying SVM to PDs obtained from the datasets. The green dotted lines show the classification accuracy of applying SVM to PDs of reconstructed datasets obtained by applying CLA with the reduction rate to the dataset. When using only SVM for classification, it can be observed from graph (A) and graph (B) in Figure 10 that the classification accuracy of the 2D dataset is approximately , and the classification accuracy of the 3D dataset is approximately when the number of points in the dataset is . On the other hand, when applying SVM to the PDs obtained from the datasets (i.e., utilizing TDA), the classification accuracy consistently reaches a value of in both the 2D and 3D datasets. Furthermore, it can be observed that even after reducing the number of points in both the 2D and 3D datasets by applying CLA, the classification accuracy still remains at . This observation highlights the fact that CLA is a data reduction technique that effectively preserves the topological features of the data.
4. Conclusion
The research results can be summarized as follows: It is natural that as the reduction rate increases, the corresponding time decreases accordingly. Table 1 indicates that the use of CLA reduces the PD calculation time by approximately compared to calculations without CLA. The disparity in computational cost becomes more pronounced when analyzing the 3D data. Especially at , the assessment of the computational cost without CLA was impeded by our computer specifications. When CLA is applied with , the rates of decrease for the 3D data show that the computational cost is up to less than calculations without CLA, as presented in Table 2. Notably, the reduction in computational cost is higher for the 3D data compared to the 2D data.
The application of TDA in combination with SVM for the sphere type and random type point clouds resulted in significantly higher classification accuracy compared to using SVM alone, as observed in Figure 10.
Although CLA may slightly deform the given data as a preprocessing technique, there was no significant difference in classification accuracy when comparing the combined use of TDA and CLA with using TDA alone.
This suggests that CLA is an effective data reduction technique for TDA, and it is also theoretically supported by Theorem 3.1.
TDA proposes a novel approach to big data analysis. However, persistent homology, one of the representative methodologies of TDA, is currently not feasible due to its exponentially increasing computational complexity. Data reduction is the most fundamental technique used in data science to address computational complexity problems. Thus, various data reduction methods suitable for data analysis using persistent homology have been studied. The method proposed in this paper, CLA, improves upon previous related studies by allowing for data reduction as desired while preserving the topological features of high-dimensional and complex datasets. Therefore, CLA demonstrates that it can be used as an effective tool for the extension of topology in machine learning and the future of data analysis.
Acknowledgement
All authors equally contribute this paper. The work of Seung Yeop Yang and Seonmi Choi was supported by the National Research Foundation of Korea(NRF) grant funded by the Korean government(MSIT)(No. 2022R1A5A1033624). The work of Seonmi Choi was supported by the Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (No. 2021R1I1A1A01049100). The work of Jeong Rye Park was supported by the Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (No. 2021R1I1A1A01057767).
References
- [1] U. Bauer, Ripser: efficient computation of Vietoris-Rips persistence barcodes, J. Appl. Comput. Topol. 5 (2021), no. 3, 391–423.
- [2] G. Carlsson, Topology and data, Bull. Amer. Math. Soc. (N.S.), 46 (2009), no. 2, 255–308.
- [3] G. Carlsson and V. De Silva, Zigzag persistence, Found. Comput. Math., 10 (2010), no. 4, 367–405.
- [4] F. Chazal, D. Cohen-Steiner, L. J. Guibas, F. Mémoli, and S. Y. Oudot, Gromov-Hausdorff Stable Signatures for Shapes using Persistence, Computer Graphics Forum, 28 (2009), no. 5, 1393-1403.
- [5] F. Chazal, B. T. Fasy, F. Lecci, B. Michel, A. Rinaldo, and L. Wasserman, Subsampling methods for persistent homology, In Proceedings of the 32nd International Conference on Machine Learning, F. Bach and D. Blei Eds. Lille, France, 37 (2015), JMLR: W&CP, 2143–2151.
- [6] D. Cohen-Steiner, H. Edelsbrunner, and J. Harer, Stability of persistence diagrams, Discrete Comput. Geom., 37 (2007), no. 1, 103–120.
- [7] C. Cortes and V. Vapnik, Support-vector networks, Mach. Learn., 20 (1995), no. 3, 273–297.
- [8] H. Edelsbrunner and J. Harer, Computational topology: an introduction, American Mathematical Society, 2010.
- [9] R. Ghrist, Barcodes: the persistent topology of data, Bull. Amer. Math. Soc. (N.S.), 45 (2008), no. 1, 61–75.
- [10] M. Gromov, Hyperbolic groups, Essays in group theory, 75–263, Math. Sci. Res. Inst. Publ., 8, Springer, New York, 1987.
- [11] P. Y. Lum, G. Singh, A. Lehman, T. Ishkanov, M. Vejdemo-Johansson, M. Alagappan, J. Carlsson, and G. Carlsson, Extracting insights from the shape of complex data using topology, Scientific Reports, 3, Feb. 2013.
- [12] A. Moitra, N. Malott, and P. A. Wilsey, Cluster-based data reduction for persistent homology, In 2018 IEEE International Conference on Big Data, ser. Big Data 2018, Dec. 2018, 327–334.
- [13] N. O. Malott, A. M. Sens, and P. A. Wilsey, Topology Preserving Data Reduction for Computing Persistent Homology, In 2020 IEEE International Conference on Big Data, ser. Big Data 2020, Dec. 2020, 2681–2690.
- [14] N. O. Malott and P. A. Wilsey, Fast computation of persistent homology with data reduction and data partitioning, In 2019 IEEE International Conference on Big Data, ser. Big Data 2019, Dec. 2019, pp. 880–889.
- [15] N. Otter, M. A. Porter, U. Tillmann, P. Grindrod, and H. A. Harrington, A roadmap for the computation of persistent homology, EPJ Data Science, 6 (2017), no. 1.
- [16] R. Rabadán and A. J. Blumberg, Topological data analysis for genomics and evolution Topology in Biology, Cambridge University press, 2020.
- [17] K. N. Ramamurthy, K. R. Varshney, and J. J. Thiagarajan, Computing persistent homology under random projection, In 2014 IEEE Workshop on Statistical Signal Processing (SSP), Jun. 2014, pp. 105–108.
- [18] D. R. Sheehy, Linear-size approximations to the Vietoris-Rips filtration, Discrete Comput. Geom., 49 (2013), no. 4, 778–796.
- [19] D. R. Sheehy, The persistent homology of distance functions under random projection, In Proceedings of the Thirtieth Annual Symposium on Computational Geometry, ser. SOCG’14. New York, NY, USA: ACM, 2014, pp. 328–334.
- [20] V. de Silva and G. Carlsson, Topological estimation using witness complexes, In Eurographics Symposium on Point-Based Graphics, ser. SPBG ’04, M. Gross, H. Pfister, M. Alexa, and S. Rusinkiewicz, Eds. Goslar, DEU: The Eurographics Association, (2004), 157–166.
- [21] L. Vietoris, Über den höheren Zusammenhang kompakter Räume und eine Klasse von zusammenhangstreuen Abbildungen (German), Math. Ann. 97 (1927), no. 1, 454–472.
- [22] A. Zomorodian and G. Carlsson, Computing Persistent Homology, Discrete Comput. Geom., 33 (2005), no. 2, 249-274.