Condensed Representation of Machine Learning Data

Rahman Salim Zengin ¹¹1ORCID: https://orcid.org/0000-0002-3104-4677 [email protected] Volkan Sezer ²²2ORCID: https://orcid.org/0000-0001-9658-2153 [email protected] Department of Mechatronics Engineering, Istanbul Technical University, Istanbul, Turkey Department of Control and Automation Engineering, Istanbul Technical University, Istanbul, Turkey Autonomous Mobility Group, Electrical and Electronics Engineering Faculty, Istanbul Technical University, Istanbul, Turkey

Abstract

Training of a Machine Learning model requires sufficient data. The sufficiency of the data is not always about the quantity, but about the relevancy and reduced redundancy. Data-generating processes create massive amounts of data. When used raw, such big data is causing much computational resource utilization. Instead of using the raw data, a proper Condensed Representation can be used instead. Combining K-means, a well-known clustering method, with some correction and refinement facilities a novel Condensed Representation method for Machine Learning applications is introduced. To present the novel method meaningfully and visually, synthetically generated data is employed. It has been shown that by using the condensed representation, instead of the raw data, acceptably accurate model training is possible.

keywords:

Condensed Representation, Supervised Learning, Piecewise Linear Classification,

^†^†journal: nowhere yet!

1 Introduction

The concept of condensed representation is considered in [1], [2] as determining frequent patterns in big databases. It is given as an idea to improve pattern matching against big databases. Instead of using the actual database, a condensed representation can be queried for efficiency. The condensed representation should give approximately correct results.

Online probability density estimation is used in [3] as a method of condensed representation for data mining tasks, especially when not having the whole data envisioned.

In [4], online density estimators are employed to obtain condensed representations from data streams.

Wavelet-based PDFs are used in [5] for condensed representation of training data to train a CNN. This usage resulted in less computational resource usage.

The resources in the literature, such as [6, 7, 8], are mainly related but not limited to, data mining and set operations on sequential datasets.

The most relevant machine learning subject to the study of this paper is autoencoders [9]. Autoencoders represent data in a new space with different dimensionality.

It has been shown that autoencoders can generate condensed representations of the sequential input data [10]. Then a classifier can process the representations instead of the original data.

Autoencoders are used in [11] to obtain condensed representations of the urban road networks.

The main goal of this paper is to introduce a novel approach to the condensed representation of machine learning data and to demonstrate its practical use. Accordingly, the principal components of the novel method are defined, combined and tested on some pretty standart synthetic data. Then, the final thoughts are shared.

2 The Algorithm

The algorithm consists of two main steps. In the initialization step, initial representation centers are placed over the data. Following the initialization, the refinement process improves the purities of the classes.

2.1 Initialization

The initialization process starts with the separation of different classes of data instances. Given labels of the dataset are used to discriminate classes. Then, to distribute initial Condensed Representation Centers (CRCs) over the data, the k-Means algorithm is applied to every class of the data separately. The k parameter can be chosen independently for every class.

The resulting cluster centers, the initial CRCs, are placed over the whole dataset. Every data instance is assigned to the nearest CRC. So, the space is tessellated into convex polyhedral cells, which the enclosed CRC defines.

When all classes are combined, some of the CRCs might lie over the intersections of different classes. In order to determine the represented class of a CRC, plurality voting is applied to the covered data instances. The result of the vote becomes labels of the CRCs. (Algorithm 1) (Figure 2) (Figure 4)

Function Initialize( $n_{centers}$ )

Data:

X\longleftarrow\begin{bmatrix}\boldsymbol{x_{1}}&\cdots&\boldsymbol{x_{N}}\end{bmatrix}

/* Instances */

y\longleftarrow\begin{bmatrix}y_{1}&\cdots&y_{N}\end{bmatrix}

/* Labels */

Result:

Centers

/* Condensed Representation Centers */

begin

for $i={1,\ldots,n_{classes}}$ do /* every class */

Centers[i]\longleftarrow

KMeans(

X[y=i]

k_{i}

)

end for

end

Algorithm 1 Initialization

Refer to caption — (a) Clear boundaries, initialized

2.2 Refinement

The refinement process advances CRCs to improve the purities of the enclosed cells. Several suboperations are required to accomplish the refinement process. The base elements of the refinement should be defined before explaining the overall process. (Figure 2) (Figure 4)

2.2.1 Labeling

Some of the CRCs might lie over the intersections of different classes. In order to determine the represented classes of the CRCs, plurality voting is applied to the assigned data instances. The result of the vote becomes labels of the CRCs. (Algorithm 2)

Function Labeling()

Data:

X\longleftarrow\begin{bmatrix}\boldsymbol{x_{1}}&\cdots&\boldsymbol{x_{N}}\end{bmatrix}

/* Data Instances */

y\longleftarrow\begin{bmatrix}y_{1}&\cdots&y_{N}\end{bmatrix}

/* Data Labels */

Centers\longleftarrow\begin{bmatrix}\boldsymbol{c_{1}}&\cdots&\boldsymbol{c_{n_{centers}}}\end{bmatrix}

/* CRCs */

Result:

Labels

/* Labels of the centers */

begin

Assignments\longleftarrow

Assign(

X,Centers

)

for $i={1,\ldots,n_{centers}}$ do /* every center */

Classes,Counts\longleftarrow

CountUnique(

y[Assignments=i]

)

Labels[i]\longleftarrow Classes[\operatorname*{argmax}Counts]

end for

end

Algorithm 2 Labeling

2.2.2 Purity Metric

The purity of data assignments to CRCs is measured to determine how clearly class boundaries are set. Assignment purities are measured independently for every CRC. This metric is used to guide the convergence of the refinement. (Algorithm 3)

Function Purities()

Data:

X\longleftarrow\begin{bmatrix}\boldsymbol{x_{1}}&\cdots&\boldsymbol{x_{N}}\end{bmatrix}

/* Data Instances */

y\longleftarrow\begin{bmatrix}y_{1}&\cdots&y_{N}\end{bmatrix}

/* Data Labels */

Centers\longleftarrow\begin{bmatrix}\boldsymbol{c_{1}}&\cdots&\boldsymbol{c_{n_{centers}}}\end{bmatrix}

/* Condensed Representation Centers */

Result:

Purities

/* Purities of the assignments */

begin

Assignments\longleftarrow

Assign(

X,Centers

)

for $i={1,\ldots,n_{centers}}$ do /* every center */

Counts\longleftarrow

CountUnique(

y[Assignments=i]]

)

Purities[i]\longleftarrow\max Counts\div\sum Counts

end for

end

Algorithm 3 Purities

2.2.3 Population Counts

The number of data instances under the containment of a CRC specifies the importance and effectiveness within the total coverage. Less populated polyhedral cells affect the complete purity lesser. (Algorithm 4)

Function Popcount()

Data:

X\longleftarrow\begin{bmatrix}\boldsymbol{x_{1}}&\cdots&\boldsymbol{x_{N}}\end{bmatrix}

/* Data Instances */

Centers\longleftarrow\begin{bmatrix}\boldsymbol{c_{1}}&\cdots&\boldsymbol{c_{n_{centers}}}\end{bmatrix}

/* CRCs */

Result:

IDs

/* Identifiers of the centers */

Counts

/* Population counts */

begin

Assignments\longleftarrow

Assign(

X,Centers

)

IDs,Counts\longleftarrow

CountUnique(

Assignments

)

end

Algorithm 4 Population counts

2.2.4 Soft Assignment

Although for the purity metric, absolute assignments of the data instances to the CRCs are considered, for the refinement, a weighted soft assignment scheme is used. A data instance is partially assigned to the nearest and the second nearest CRCs. Assignment weights for both CRCs are calculated by applying a weighting function. The assignment weights are equal if the distances are equal. As one CRC gets closer, the relevant assignment weight approaches unity. (Algorithm 5)

Function SoftAssign()

Data:

X\longleftarrow\begin{bmatrix}\boldsymbol{x_{1}}&\cdots&\boldsymbol{x_{N}}\end{bmatrix}

/* Data Instances */

Centers\longleftarrow\begin{bmatrix}\boldsymbol{c_{1}}&\cdots&\boldsymbol{c_{n_{centers}}}\end{bmatrix}

/* CRCs */

Result:

IDs

/* Identifiers of the centers */

Weights

/* Assignment weights */

begin

IDs,Distances\longleftarrow

TwoNearest(

X,Centers

)

Weights\longleftarrow 1-\frac{Distances^{2}}{\sum Distances^{2}}

end

Algorithm 5 Soft assignment

2.2.5 Correctness

If a data instance is assigned to a CRC, for a correct assignment, the label of that CRC must be equal to the given label of the data. In the soft assignment case, both CRCs’ labels are considered. The correctnesses of those comparisons is specific to the current CRC configurations and might change with a refinement iteration. (Algorithm 6)

Function Correctness()

Data:

Labels

/* Labels of the centers */

IDs

/* Soft assignment identifiers */

y\longleftarrow\begin{bmatrix}y_{1}&\cdots&y_{N}\end{bmatrix}

/* Data Labels */

Result:

Correctness

/* Correctnesses of the soft assignments */

begin

Correctness\longleftarrow Labels[IDs]\equiv y

end

Algorithm 6 Correctness

2.2.6 Activity Group

A refinement iteration is not executed over the whole dataset. The activity group of the refinement is determined by checking the correctnesses of the first (nearest) and the next (second nearest) CRCs. If both correctnesses are true, the data instance is under a proper cover and needs no more intervention. If both are false, the data instance can be treated as an outlier where it stays.

A data instance is accepted within the activity group only if its soft assignment is shared between a true and a false CRC. (Eq 1)

Activity=FirstCorrectness\neq SecondCorrectness

(1)

2.2.7 Advancement Vectors

True assignments attract CRCs, and false assignments repel. For a CRC, when the effects of all assignments are combined, it becomes the advancement vector of that CRC. The advancement vectors are calculated using weighted vector differences between assigned data instances and CRCs. Weights come from soft assignments and are signed according to the correctnesses. (Algorithm 7)

Function AdvanVect()

Data:

X\longleftarrow\begin{bmatrix}\boldsymbol{x_{1}}&\cdots&\boldsymbol{x_{N}}\end{bmatrix}

/* Data Instances */

Centers\longleftarrow\begin{bmatrix}\boldsymbol{c_{1}}&\cdots&\boldsymbol{c_{n_{centers}}}\end{bmatrix}

/* CRCs */

Result:

A\longleftarrow\begin{bmatrix}\boldsymbol{a_{1}}&\cdots&\boldsymbol{a_{n_{centers}}}\end{bmatrix}

/* Adv vecs */

begin

for $i={1,\ldots,n_{centers}}$ do /* every center */

/* Diff assigned instances with the current center and multiply with signed weights */

a_{i}\longleftarrow((X_{i}-c_{i})\times Weights_{i}).mean()

end for

end

Algorithm 7 Advancement vectors

2.2.8 Selective Advancement

Advancement of CRCs is selectively applied depending on the purity improvements. If the updated state of a CRC results in better purity, the up-to-date state is used; otherwise, the previous state is kept. (Algorithm 8)

Function SelectAdvance()

Data:

A\longleftarrow\begin{bmatrix}\boldsymbol{a_{1}}&\cdots&\boldsymbol{a_{n_{centers}}}\end{bmatrix}

/* Advancement vectors */

Purity_{before}

/* Purities before */

Purity_{after}

/* Purities after */

Result:

A_{selected}\longleftarrow\begin{bmatrix}\boldsymbol{a_{1}}&\cdots&\boldsymbol{a_{n_{centers}}}\end{bmatrix}

/* Selected advancement vectors */

begin

for $i={1,\ldots,n_{centers}}$ do /* every center */

if $Purity_{before}^{[i]}\geq Purity_{after}^{[i]}$ then /* Similar or worse purity */

A_{selected}^{[i]}\longleftarrow Purity_{before}^{[i]}

else /* Better purity */

A_{selected}^{[i]}\longleftarrow Purity_{after}^{[i]}

end if

end for

end

Algorithm 8 Selective advancement

2.2.9 Convergence

During the refinement, the general purity of the present CRCs configuration is tracked for every iteration. Parameters of the purest configuration are saved. If the purity starts to go down, the refinement process is ended. (Algorithm 9)

Function Convergence()

Data:

OverallPurity

/* The overall purity */

ModelParams

/* The current parameters */

Result:

BestPurity

/* The best overall purity */

BestParams

/* The best parameters */

begin

BestPurity\longleftarrow OverallPurity

while $OverallPurity\geq BestPurity$ do /* Until the best overall purity is reached */

BestPurity\longleftarrow OverallPurity

BestParams\longleftarrow ModelParams

/* Do the refinement */

Refine()

end while

end

Algorithm 9 Convergence

2.2.10 The Overall Refinement Process

In every iteration, soft assignment is applied to the whole dataset for the current state of the CRCs. Then, the correctnesses of the primary and secondary assignments are determined. Afterward, the current activity group is designated using the correctnesses of the soft assignments. Later, the advancement candidate of every CRC is found considering weighted and polarized advancement vectors. Finally, selective advancement is applied, and the next state is reached. (Algorithm 10)

Function Refine()

Data:

X

/* Data Instances */

y

/* Data Labels */

Centers

/* CRCs */

Labels

/* Labels of the centers */

Result:

NewCenters

/* New CRCs */

NewLabels

/* New labels of the centers */

begin

IDs,Weights\longleftarrow

SoftAssign(

X,Centers

)

A\longleftarrow

AdvanVect(

X,Centers

)

A_{selected}\longleftarrow

SelectAdvance(

A,Purity_{before},Purity_{after}

)

NewCenters\longleftarrow Centers+A_{selected}

NewLabels\longleftarrow

Labeling(

X,y,NewCenters

)

end

Algorithm 10 Refinement

3 Experimental Results

The platform used for the experimental tests and comparisons is as follows: The CPU used for the experimentation is Intel(R) Core(TM) i7-7700 running at 3.60GHz frequency; the system has 32GB of DDR4 RAM running at 2133 MHz. Nvidia Geforce GTX 1080Ti is the system GPU.

Circles and moons datasets are generated with varying noise levels using the Scikit-Learn [12] library functions. The condensation is done on GPU. Using the condensed representations, multilayer-perceptron classifier [12] models are trained and tested on the CPU for accuracy on the original datasets. It can be seen in Figure 5) and (Figure 6) that training nonlinear machine learning models on the condensed representations of training datasets gives acceptable accuracies with the possibility of easier training. Training the multilayer perceptron classifier with condensed representations results in practically acceptable accuracies.

4 Conclusion

Massive amounts of data are continuously generated throughout many processes. Such big data mostly consists of redundancy, so that can be reduced to a condensed representation without losing much accuracy, especially for machine learning applications. In this paper, an alternate way of handling machine learning datasets is introduced. The ideas introduced in this paper can be summarized as follows:

1.

K-means clustering is used as a piecewise-linear classifier model generator.
2.

The generated PWL classifier models are composed of condensed representation centers, which have the same dimensionality as the data.
3.

The CRC locations can be fine-tuned for class separation purity and this is equivalent to improving the accuracy of a NN classifier, classifying on CRCs.
4.

The final CRCs can be used as training data for various nonlinear classifiers, such as multilayer neural networks.

The usage of condensed representation of data instead of the raw data allows focusing on different stages of the machine-learning pipeline from different aspects. For example, an online machine-learning pipeline can be separated into two main stages: Online Condensation and Machine Learning. A more specific example can be, an autonomous mobility platform that has data condensation edge devices with central machine learning processing.

This paper is flexibly and modularly constructed, so that any sub-element of the novel method, introduced in this paper, can be replaced or modified to study different aspects of condensed representation.

Acknowledgments

This research was supported by the Turkish Scientific and Technological Research Council (TUBITAK) under project no. 121E537.

We would like to thank the reviewers for their thoughtful comments and their constructive remarks.

References

[1] H. Mannila, Data mining: Machine learning, statistics, and databases, in: Proceedings of 8th International Conference on Scientific and Statistical Data Base Management, 1996, pp. 2–9. doi:10.1109/SSDM.1996.505910.
[2] H. Mannila, Methods and problems in data mining, in: F. Afrati, P. Kolaitis (Eds.), Database Theory — ICDT ’97, Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 1997, pp. 41–55. doi:10.1007/3-540-62222-5_35.
[3] M. Geilke, A. Karwath, S. Kramer, A probabilistic condensed representation of data for stream mining, in: 2014 International Conference on Data Science and Advanced Analytics (DSAA), 2014, pp. 297–303. doi:10.1109/DSAA.2014.7058088.
[4] M. Geilke, A. Karwath, S. Kramer, Modeling recurrent distributions in streams using possible worlds, in: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2015, pp. 1–9. doi:10.1109/DSAA.2015.7344814.
[5] A. Peel, F. Lalande, J.-L. Starck, V. Pettorino, J. Merten, C. Giocoli, M. Meneghetti, M. Baldi, Distinguishing standard and modified gravity cosmologies with machine learning, Physical Review D 100 (2) (2019) 023508. doi:10.1103/PhysRevD.100.023508.
[6] A. Soulet, B. Crémilleux, F. Rioult, Condensed Representation of Emerging Patterns, in: H. Dai, R. Srikant, C. Zhang (Eds.), Advances in Knowledge Discovery and Data Mining, Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 2004, pp. 127–132. doi:10.1007/978-3-540-24775-3_16.
[7] M. Plantevit, B. Crémilleux, Condensed Representation of Sequential Patterns According to Frequency-Based Measures, in: N. M. Adams, C. Robardet, A. Siebes, J.-F. Boulicaut (Eds.), Advances in Intelligent Data Analysis VIII, Vol. 5772, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 155–166. doi:10.1007/978-3-642-03915-7_14.
[8] M. Boettcher, M. Spott, R. Kruse, A Condensed Representation of Itemsets for Analyzing Their Evolution over Time, in: W. Buntine, M. Grobelnik, D. Mladenić, J. Shawe-Taylor (Eds.), Machine Learning and Knowledge Discovery in Databases, Vol. 5781, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 163–178. doi:10.1007/978-3-642-04180-8_28.
[9] S. Skansi, Autoencoders, in: S. Skansi (Ed.), Introduction to Deep Learning: From Logical Calculus to Artificial Intelligence, Undergraduate Topics in Computer Science, Springer International Publishing, Cham, 2018, pp. 153–163. doi:10.1007/978-3-319-73004-2_8.
[10] A. Alazizi, A. Habrard, F. Jacquenet, L. He-Guelton, F. Oblé, Dual Sequential Variational Autoencoders for Fraud Detection, in: M. R. Berthold, A. Feelders, G. Krempl (Eds.), Advances in Intelligent Data Analysis XVIII, Vol. 12080, Springer International Publishing, Cham, 2020, pp. 14–26. doi:10.1007/978-3-030-44584-3_2.
[11] K. Kempinska, R. Murcio, Modelling urban networks using Variational Autoencoders, Applied Network Science 4 (1) (2019) 114. doi:10.1007/s41109-019-0234-0.
[12] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, É. Duchesnay, Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12 (85) (2011) 2825–2830.