Condensed Representation of Machine Learning Data
Abstract
Training of a Machine Learning model requires sufficient data. The sufficiency of the data is not always about the quantity, but about the relevancy and reduced redundancy. Data-generating processes create massive amounts of data. When used raw, such big data is causing much computational resource utilization. Instead of using the raw data, a proper Condensed Representation can be used instead. Combining K-means, a well-known clustering method, with some correction and refinement facilities a novel Condensed Representation method for Machine Learning applications is introduced. To present the novel method meaningfully and visually, synthetically generated data is employed. It has been shown that by using the condensed representation, instead of the raw data, acceptably accurate model training is possible.
keywords:
Condensed Representation, Supervised Learning, Piecewise Linear Classification,1 Introduction
The concept of condensed representation is considered in [1], [2] as determining frequent patterns in big databases. It is given as an idea to improve pattern matching against big databases. Instead of using the actual database, a condensed representation can be queried for efficiency. The condensed representation should give approximately correct results.
Online probability density estimation is used in [3] as a method of condensed representation for data mining tasks, especially when not having the whole data envisioned.
In [4], online density estimators are employed to obtain condensed representations from data streams.
Wavelet-based PDFs are used in [5] for condensed representation of training data to train a CNN. This usage resulted in less computational resource usage.
The resources in the literature, such as [6, 7, 8], are mainly related but not limited to, data mining and set operations on sequential datasets.
The most relevant machine learning subject to the study of this paper is autoencoders [9]. Autoencoders represent data in a new space with different dimensionality.
It has been shown that autoencoders can generate condensed representations of the sequential input data [10]. Then a classifier can process the representations instead of the original data.
Autoencoders are used in [11] to obtain condensed representations of the urban road networks.
The main goal of this paper is to introduce a novel approach to the condensed representation of machine learning data and to demonstrate its practical use. Accordingly, the principal components of the novel method are defined, combined and tested on some pretty standart synthetic data. Then, the final thoughts are shared.
2 The Algorithm
The algorithm consists of two main steps. In the initialization step, initial representation centers are placed over the data. Following the initialization, the refinement process improves the purities of the classes.
2.1 Initialization
The initialization process starts with the separation of different classes of data instances. Given labels of the dataset are used to discriminate classes. Then, to distribute initial Condensed Representation Centers (CRCs) over the data, the k-Means algorithm is applied to every class of the data separately. The k parameter can be chosen independently for every class.
The resulting cluster centers, the initial CRCs, are placed over the whole dataset. Every data instance is assigned to the nearest CRC. So, the space is tessellated into convex polyhedral cells, which the enclosed CRC defines.
When all classes are combined, some of the CRCs might lie over the intersections of different classes. In order to determine the represented class of a CRC, plurality voting is applied to the covered data instances. The result of the vote becomes labels of the CRCs. (Algorithm 1) (Figure 2) (Figure 4)












2.2 Refinement
The refinement process advances CRCs to improve the purities of the enclosed cells. Several suboperations are required to accomplish the refinement process. The base elements of the refinement should be defined before explaining the overall process. (Figure 2) (Figure 4)
2.2.1 Labeling
Some of the CRCs might lie over the intersections of different classes. In order to determine the represented classes of the CRCs, plurality voting is applied to the assigned data instances. The result of the vote becomes labels of the CRCs. (Algorithm 2)
2.2.2 Purity Metric
The purity of data assignments to CRCs is measured to determine how clearly class boundaries are set. Assignment purities are measured independently for every CRC. This metric is used to guide the convergence of the refinement. (Algorithm 3)
2.2.3 Population Counts
The number of data instances under the containment of a CRC specifies the importance and effectiveness within the total coverage. Less populated polyhedral cells affect the complete purity lesser. (Algorithm 4)
2.2.4 Soft Assignment
Although for the purity metric, absolute assignments of the data instances to the CRCs are considered, for the refinement, a weighted soft assignment scheme is used. A data instance is partially assigned to the nearest and the second nearest CRCs. Assignment weights for both CRCs are calculated by applying a weighting function. The assignment weights are equal if the distances are equal. As one CRC gets closer, the relevant assignment weight approaches unity. (Algorithm 5)
2.2.5 Correctness
If a data instance is assigned to a CRC, for a correct assignment, the label of that CRC must be equal to the given label of the data. In the soft assignment case, both CRCs’ labels are considered. The correctnesses of those comparisons is specific to the current CRC configurations and might change with a refinement iteration. (Algorithm 6)
2.2.6 Activity Group
A refinement iteration is not executed over the whole dataset. The activity group of the refinement is determined by checking the correctnesses of the first (nearest) and the next (second nearest) CRCs. If both correctnesses are true, the data instance is under a proper cover and needs no more intervention. If both are false, the data instance can be treated as an outlier where it stays.
A data instance is accepted within the activity group only if its soft assignment is shared between a true and a false CRC. (Eq 1)
(1) |
2.2.7 Advancement Vectors
True assignments attract CRCs, and false assignments repel. For a CRC, when the effects of all assignments are combined, it becomes the advancement vector of that CRC. The advancement vectors are calculated using weighted vector differences between assigned data instances and CRCs. Weights come from soft assignments and are signed according to the correctnesses. (Algorithm 7)
2.2.8 Selective Advancement
Advancement of CRCs is selectively applied depending on the purity improvements. If the updated state of a CRC results in better purity, the up-to-date state is used; otherwise, the previous state is kept. (Algorithm 8)
2.2.9 Convergence
During the refinement, the general purity of the present CRCs configuration is tracked for every iteration. Parameters of the purest configuration are saved. If the purity starts to go down, the refinement process is ended. (Algorithm 9)
2.2.10 The Overall Refinement Process
In every iteration, soft assignment is applied to the whole dataset for the current state of the CRCs. Then, the correctnesses of the primary and secondary assignments are determined. Afterward, the current activity group is designated using the correctnesses of the soft assignments. Later, the advancement candidate of every CRC is found considering weighted and polarized advancement vectors. Finally, selective advancement is applied, and the next state is reached. (Algorithm 10)
3 Experimental Results
The platform used for the experimental tests and comparisons is as follows: The CPU used for the experimentation is Intel(R) Core(TM) i7-7700 running at 3.60GHz frequency; the system has 32GB of DDR4 RAM running at 2133 MHz. Nvidia Geforce GTX 1080Ti is the system GPU.
Circles and moons datasets are generated with varying noise levels using the Scikit-Learn [12] library functions. The condensation is done on GPU. Using the condensed representations, multilayer-perceptron classifier [12] models are trained and tested on the CPU for accuracy on the original datasets. It can be seen in Figure 5) and (Figure 6) that training nonlinear machine learning models on the condensed representations of training datasets gives acceptable accuracies with the possibility of easier training. Training the multilayer perceptron classifier with condensed representations results in practically acceptable accuracies.












4 Conclusion
Massive amounts of data are continuously generated throughout many processes. Such big data mostly consists of redundancy, so that can be reduced to a condensed representation without losing much accuracy, especially for machine learning applications. In this paper, an alternate way of handling machine learning datasets is introduced. The ideas introduced in this paper can be summarized as follows:
-
1.
K-means clustering is used as a piecewise-linear classifier model generator.
-
2.
The generated PWL classifier models are composed of condensed representation centers, which have the same dimensionality as the data.
-
3.
The CRC locations can be fine-tuned for class separation purity and this is equivalent to improving the accuracy of a NN classifier, classifying on CRCs.
-
4.
The final CRCs can be used as training data for various nonlinear classifiers, such as multilayer neural networks.
The usage of condensed representation of data instead of the raw data allows focusing on different stages of the machine-learning pipeline from different aspects. For example, an online machine-learning pipeline can be separated into two main stages: Online Condensation and Machine Learning. A more specific example can be, an autonomous mobility platform that has data condensation edge devices with central machine learning processing.
This paper is flexibly and modularly constructed, so that any sub-element of the novel method, introduced in this paper, can be replaced or modified to study different aspects of condensed representation.
Acknowledgments
This research was supported by the Turkish Scientific and Technological Research Council (TUBITAK) under project no. 121E537.
We would like to thank the reviewers for their thoughtful comments and their constructive remarks.
References
- [1] H. Mannila, Data mining: Machine learning, statistics, and databases, in: Proceedings of 8th International Conference on Scientific and Statistical Data Base Management, 1996, pp. 2–9. doi:10.1109/SSDM.1996.505910.
- [2] H. Mannila, Methods and problems in data mining, in: F. Afrati, P. Kolaitis (Eds.), Database Theory — ICDT ’97, Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 1997, pp. 41–55. doi:10.1007/3-540-62222-5_35.
- [3] M. Geilke, A. Karwath, S. Kramer, A probabilistic condensed representation of data for stream mining, in: 2014 International Conference on Data Science and Advanced Analytics (DSAA), 2014, pp. 297–303. doi:10.1109/DSAA.2014.7058088.
- [4] M. Geilke, A. Karwath, S. Kramer, Modeling recurrent distributions in streams using possible worlds, in: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2015, pp. 1–9. doi:10.1109/DSAA.2015.7344814.
- [5] A. Peel, F. Lalande, J.-L. Starck, V. Pettorino, J. Merten, C. Giocoli, M. Meneghetti, M. Baldi, Distinguishing standard and modified gravity cosmologies with machine learning, Physical Review D 100 (2) (2019) 023508. doi:10.1103/PhysRevD.100.023508.
- [6] A. Soulet, B. Crémilleux, F. Rioult, Condensed Representation of Emerging Patterns, in: H. Dai, R. Srikant, C. Zhang (Eds.), Advances in Knowledge Discovery and Data Mining, Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 2004, pp. 127–132. doi:10.1007/978-3-540-24775-3_16.
- [7] M. Plantevit, B. Crémilleux, Condensed Representation of Sequential Patterns According to Frequency-Based Measures, in: N. M. Adams, C. Robardet, A. Siebes, J.-F. Boulicaut (Eds.), Advances in Intelligent Data Analysis VIII, Vol. 5772, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 155–166. doi:10.1007/978-3-642-03915-7_14.
- [8] M. Boettcher, M. Spott, R. Kruse, A Condensed Representation of Itemsets for Analyzing Their Evolution over Time, in: W. Buntine, M. Grobelnik, D. Mladenić, J. Shawe-Taylor (Eds.), Machine Learning and Knowledge Discovery in Databases, Vol. 5781, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 163–178. doi:10.1007/978-3-642-04180-8_28.
- [9] S. Skansi, Autoencoders, in: S. Skansi (Ed.), Introduction to Deep Learning: From Logical Calculus to Artificial Intelligence, Undergraduate Topics in Computer Science, Springer International Publishing, Cham, 2018, pp. 153–163. doi:10.1007/978-3-319-73004-2_8.
- [10] A. Alazizi, A. Habrard, F. Jacquenet, L. He-Guelton, F. Oblé, Dual Sequential Variational Autoencoders for Fraud Detection, in: M. R. Berthold, A. Feelders, G. Krempl (Eds.), Advances in Intelligent Data Analysis XVIII, Vol. 12080, Springer International Publishing, Cham, 2020, pp. 14–26. doi:10.1007/978-3-030-44584-3_2.
- [11] K. Kempinska, R. Murcio, Modelling urban networks using Variational Autoencoders, Applied Network Science 4 (1) (2019) 114. doi:10.1007/s41109-019-0234-0.
- [12] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, É. Duchesnay, Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12 (85) (2011) 2825–2830.