This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Condensed Representation of Machine Learning Data

Rahman Salim Zengin 111ORCID: https://orcid.org/0000-0002-3104-4677 [email protected] Volkan Sezer 222ORCID: https://orcid.org/0000-0001-9658-2153 [email protected] Department of Mechatronics Engineering, Istanbul Technical University, Istanbul, Turkey Department of Control and Automation Engineering, Istanbul Technical University, Istanbul, Turkey Autonomous Mobility Group, Electrical and Electronics Engineering Faculty, Istanbul Technical University, Istanbul, Turkey
Abstract

Training of a Machine Learning model requires sufficient data. The sufficiency of the data is not always about the quantity, but about the relevancy and reduced redundancy. Data-generating processes create massive amounts of data. When used raw, such big data is causing much computational resource utilization. Instead of using the raw data, a proper Condensed Representation can be used instead. Combining K-means, a well-known clustering method, with some correction and refinement facilities a novel Condensed Representation method for Machine Learning applications is introduced. To present the novel method meaningfully and visually, synthetically generated data is employed. It has been shown that by using the condensed representation, instead of the raw data, acceptably accurate model training is possible.

keywords:
Condensed Representation, Supervised Learning, Piecewise Linear Classification,
journal: nowhere yet!

1 Introduction

The concept of condensed representation is considered in [1], [2] as determining frequent patterns in big databases. It is given as an idea to improve pattern matching against big databases. Instead of using the actual database, a condensed representation can be queried for efficiency. The condensed representation should give approximately correct results.

Online probability density estimation is used in [3] as a method of condensed representation for data mining tasks, especially when not having the whole data envisioned.

In [4], online density estimators are employed to obtain condensed representations from data streams.

Wavelet-based PDFs are used in [5] for condensed representation of training data to train a CNN. This usage resulted in less computational resource usage.

The resources in the literature, such as [6, 7, 8], are mainly related but not limited to, data mining and set operations on sequential datasets.

The most relevant machine learning subject to the study of this paper is autoencoders [9]. Autoencoders represent data in a new space with different dimensionality.

It has been shown that autoencoders can generate condensed representations of the sequential input data [10]. Then a classifier can process the representations instead of the original data.

Autoencoders are used in [11] to obtain condensed representations of the urban road networks.

The main goal of this paper is to introduce a novel approach to the condensed representation of machine learning data and to demonstrate its practical use. Accordingly, the principal components of the novel method are defined, combined and tested on some pretty standart synthetic data. Then, the final thoughts are shared.

2 The Algorithm

The algorithm consists of two main steps. In the initialization step, initial representation centers are placed over the data. Following the initialization, the refinement process improves the purities of the classes.

2.1 Initialization

The initialization process starts with the separation of different classes of data instances. Given labels of the dataset are used to discriminate classes. Then, to distribute initial Condensed Representation Centers (CRCs) over the data, the k-Means algorithm is applied to every class of the data separately. The k parameter can be chosen independently for every class.

The resulting cluster centers, the initial CRCs, are placed over the whole dataset. Every data instance is assigned to the nearest CRC. So, the space is tessellated into convex polyhedral cells, which the enclosed CRC defines.

When all classes are combined, some of the CRCs might lie over the intersections of different classes. In order to determine the represented class of a CRC, plurality voting is applied to the covered data instances. The result of the vote becomes labels of the CRCs. (Algorithm 1) (Figure 2) (Figure 4)

Function Initialize(ncentersn_{centers})
      
Data:
X[𝒙𝟏𝒙𝑵]X\longleftarrow\begin{bmatrix}\boldsymbol{x_{1}}&\cdots&\boldsymbol{x_{N}}\end{bmatrix} /* Instances */
y[y1yN]y\longleftarrow\begin{bmatrix}y_{1}&\cdots&y_{N}\end{bmatrix} /* Labels */
Result: CentersCenters /* Condensed Representation Centers */
begin
       for i=1,,nclassesi={1,\ldots,n_{classes}} do /* every class */
             Centers[i]Centers[i]\longleftarrow KMeans(X[y=i]X[y=i], kik_{i})
       end for
      
end
Algorithm 1 Initialization
Refer to caption
(a) Clear boundaries, initialized
Refer to caption
(b) Touching boundaries, initialized
Refer to caption
(c) Noisy boundaries, initialized
Figure 1: Condensed representations of synthetic circles dataset, initialization only
Refer to caption
(d) Clear boundaries, refined
Refer to caption
(e) Touching boundaries, refined
Refer to caption
(f) Noisy boundaries, refined
Figure 2: Condensed representations of synthetic circles dataset, after refinement
Refer to caption
(a) Clear boundaries, initialized
Refer to caption
(b) Touching boundaries, initialized
Refer to caption
(c) Noisy boundaries, initialized
Figure 3: Condensed representations of synthetic moons dataset, initialization only
Refer to caption
(d) Clear boundaries, refined
Refer to caption
(e) Touching boundaries, refined
Refer to caption
(f) Noisy boundaries, refined
Figure 4: Condensed representations of synthetic moons dataset, after refinement

2.2 Refinement

The refinement process advances CRCs to improve the purities of the enclosed cells. Several suboperations are required to accomplish the refinement process. The base elements of the refinement should be defined before explaining the overall process. (Figure 2) (Figure 4)

2.2.1 Labeling

Some of the CRCs might lie over the intersections of different classes. In order to determine the represented classes of the CRCs, plurality voting is applied to the assigned data instances. The result of the vote becomes labels of the CRCs. (Algorithm 2)

Function Labeling()
      
Data:
X[𝒙𝟏𝒙𝑵]X\longleftarrow\begin{bmatrix}\boldsymbol{x_{1}}&\cdots&\boldsymbol{x_{N}}\end{bmatrix} /* Data Instances */
y[y1yN]y\longleftarrow\begin{bmatrix}y_{1}&\cdots&y_{N}\end{bmatrix} /* Data Labels */
Centers[𝒄𝟏𝒄𝒏𝒄𝒆𝒏𝒕𝒆𝒓𝒔]Centers\longleftarrow\begin{bmatrix}\boldsymbol{c_{1}}&\cdots&\boldsymbol{c_{n_{centers}}}\end{bmatrix} /* CRCs */
Result: LabelsLabels /* Labels of the centers */
begin
       AssignmentsAssignments\longleftarrow Assign(X,CentersX,Centers)
       for i=1,,ncentersi={1,\ldots,n_{centers}} do /* every center */
             Classes,CountsClasses,Counts\longleftarrow CountUnique(y[Assignments=i]y[Assignments=i]) Labels[i]Classes[argmaxCounts]Labels[i]\longleftarrow Classes[\operatorname*{argmax}Counts]
       end for
      
end
Algorithm 2 Labeling

2.2.2 Purity Metric

The purity of data assignments to CRCs is measured to determine how clearly class boundaries are set. Assignment purities are measured independently for every CRC. This metric is used to guide the convergence of the refinement. (Algorithm 3)

Function Purities()
      
Data:
X[𝒙𝟏𝒙𝑵]X\longleftarrow\begin{bmatrix}\boldsymbol{x_{1}}&\cdots&\boldsymbol{x_{N}}\end{bmatrix} /* Data Instances */
y[y1yN]y\longleftarrow\begin{bmatrix}y_{1}&\cdots&y_{N}\end{bmatrix} /* Data Labels */
Centers[𝒄𝟏𝒄𝒏𝒄𝒆𝒏𝒕𝒆𝒓𝒔]Centers\longleftarrow\begin{bmatrix}\boldsymbol{c_{1}}&\cdots&\boldsymbol{c_{n_{centers}}}\end{bmatrix} /* Condensed Representation Centers */
Result: PuritiesPurities /* Purities of the assignments */
begin
       AssignmentsAssignments\longleftarrow Assign(X,CentersX,Centers)
       for i=1,,ncentersi={1,\ldots,n_{centers}} do /* every center */
             CountsCounts\longleftarrow CountUnique(y[Assignments=i]]y[Assignments=i]]) Purities[i]maxCounts÷CountsPurities[i]\longleftarrow\max Counts\div\sum Counts
       end for
      
end
Algorithm 3 Purities

2.2.3 Population Counts

The number of data instances under the containment of a CRC specifies the importance and effectiveness within the total coverage. Less populated polyhedral cells affect the complete purity lesser. (Algorithm 4)

Function Popcount()
      
Data:
X[𝒙𝟏𝒙𝑵]X\longleftarrow\begin{bmatrix}\boldsymbol{x_{1}}&\cdots&\boldsymbol{x_{N}}\end{bmatrix} /* Data Instances */
Centers[𝒄𝟏𝒄𝒏𝒄𝒆𝒏𝒕𝒆𝒓𝒔]Centers\longleftarrow\begin{bmatrix}\boldsymbol{c_{1}}&\cdots&\boldsymbol{c_{n_{centers}}}\end{bmatrix} /* CRCs */
Result:
IDsIDs /* Identifiers of the centers */
CountsCounts /* Population counts */
begin
       AssignmentsAssignments\longleftarrow Assign(X,CentersX,Centers)
       IDs,CountsIDs,Counts\longleftarrow CountUnique(AssignmentsAssignments)
end
Algorithm 4 Population counts

2.2.4 Soft Assignment

Although for the purity metric, absolute assignments of the data instances to the CRCs are considered, for the refinement, a weighted soft assignment scheme is used. A data instance is partially assigned to the nearest and the second nearest CRCs. Assignment weights for both CRCs are calculated by applying a weighting function. The assignment weights are equal if the distances are equal. As one CRC gets closer, the relevant assignment weight approaches unity. (Algorithm 5)

Function SoftAssign()
      
Data:
X[𝒙𝟏𝒙𝑵]X\longleftarrow\begin{bmatrix}\boldsymbol{x_{1}}&\cdots&\boldsymbol{x_{N}}\end{bmatrix} /* Data Instances */
Centers[𝒄𝟏𝒄𝒏𝒄𝒆𝒏𝒕𝒆𝒓𝒔]Centers\longleftarrow\begin{bmatrix}\boldsymbol{c_{1}}&\cdots&\boldsymbol{c_{n_{centers}}}\end{bmatrix} /* CRCs */
Result:
IDsIDs /* Identifiers of the centers */
WeightsWeights /* Assignment weights */
begin
       IDs,DistancesIDs,Distances\longleftarrow TwoNearest(X,CentersX,Centers) Weights1Distances2Distances2Weights\longleftarrow 1-\frac{Distances^{2}}{\sum Distances^{2}}
      
end
Algorithm 5 Soft assignment

2.2.5 Correctness

If a data instance is assigned to a CRC, for a correct assignment, the label of that CRC must be equal to the given label of the data. In the soft assignment case, both CRCs’ labels are considered. The correctnesses of those comparisons is specific to the current CRC configurations and might change with a refinement iteration. (Algorithm 6)

Function Correctness()
      
Data:
LabelsLabels /* Labels of the centers */
IDsIDs /* Soft assignment identifiers */
y[y1yN]y\longleftarrow\begin{bmatrix}y_{1}&\cdots&y_{N}\end{bmatrix} /* Data Labels */
Result:
CorrectnessCorrectness /* Correctnesses of the soft assignments */
begin
       CorrectnessLabels[IDs]yCorrectness\longleftarrow Labels[IDs]\equiv y
      
end
Algorithm 6 Correctness

2.2.6 Activity Group

A refinement iteration is not executed over the whole dataset. The activity group of the refinement is determined by checking the correctnesses of the first (nearest) and the next (second nearest) CRCs. If both correctnesses are true, the data instance is under a proper cover and needs no more intervention. If both are false, the data instance can be treated as an outlier where it stays.

A data instance is accepted within the activity group only if its soft assignment is shared between a true and a false CRC. (Eq 1)

Activity=FirstCorrectnessSecondCorrectnessActivity=FirstCorrectness\neq SecondCorrectness (1)

2.2.7 Advancement Vectors

True assignments attract CRCs, and false assignments repel. For a CRC, when the effects of all assignments are combined, it becomes the advancement vector of that CRC. The advancement vectors are calculated using weighted vector differences between assigned data instances and CRCs. Weights come from soft assignments and are signed according to the correctnesses. (Algorithm 7)

Function AdvanVect()
      
Data:
X[𝒙𝟏𝒙𝑵]X\longleftarrow\begin{bmatrix}\boldsymbol{x_{1}}&\cdots&\boldsymbol{x_{N}}\end{bmatrix} /* Data Instances */
Centers[𝒄𝟏𝒄𝒏𝒄𝒆𝒏𝒕𝒆𝒓𝒔]Centers\longleftarrow\begin{bmatrix}\boldsymbol{c_{1}}&\cdots&\boldsymbol{c_{n_{centers}}}\end{bmatrix} /* CRCs */
Result:
A[𝒂𝟏𝒂𝒏𝒄𝒆𝒏𝒕𝒆𝒓𝒔]A\longleftarrow\begin{bmatrix}\boldsymbol{a_{1}}&\cdots&\boldsymbol{a_{n_{centers}}}\end{bmatrix} /* Adv vecs */
begin
       for i=1,,ncentersi={1,\ldots,n_{centers}} do /* every center */
             /* Diff assigned instances with the current center and multiply with signed weights */
             ai((Xici)×Weightsi).mean()a_{i}\longleftarrow((X_{i}-c_{i})\times Weights_{i}).mean()
       end for
      
end
Algorithm 7 Advancement vectors

2.2.8 Selective Advancement

Advancement of CRCs is selectively applied depending on the purity improvements. If the updated state of a CRC results in better purity, the up-to-date state is used; otherwise, the previous state is kept. (Algorithm 8)

Function SelectAdvance()
      
Data:
A[𝒂𝟏𝒂𝒏𝒄𝒆𝒏𝒕𝒆𝒓𝒔]A\longleftarrow\begin{bmatrix}\boldsymbol{a_{1}}&\cdots&\boldsymbol{a_{n_{centers}}}\end{bmatrix} /* Advancement vectors */
PuritybeforePurity_{before} /* Purities before */
PurityafterPurity_{after} /* Purities after */
Result:
Aselected[𝒂𝟏𝒂𝒏𝒄𝒆𝒏𝒕𝒆𝒓𝒔]A_{selected}\longleftarrow\begin{bmatrix}\boldsymbol{a_{1}}&\cdots&\boldsymbol{a_{n_{centers}}}\end{bmatrix} /* Selected advancement vectors */
begin
       for i=1,,ncentersi={1,\ldots,n_{centers}} do /* every center */
             if Puritybefore[i]Purityafter[i]Purity_{before}^{[i]}\geq Purity_{after}^{[i]} then /* Similar or worse purity */
                   Aselected[i]Puritybefore[i]A_{selected}^{[i]}\longleftarrow Purity_{before}^{[i]}
                  
            else /* Better purity */
                   Aselected[i]Purityafter[i]A_{selected}^{[i]}\longleftarrow Purity_{after}^{[i]}
                  
             end if
            
       end for
      
end
Algorithm 8 Selective advancement

2.2.9 Convergence

During the refinement, the general purity of the present CRCs configuration is tracked for every iteration. Parameters of the purest configuration are saved. If the purity starts to go down, the refinement process is ended. (Algorithm 9)

Function Convergence()
      
Data:
OverallPurityOverallPurity /* The overall purity */
ModelParamsModelParams /* The current parameters */
Result:
BestPurityBestPurity /* The best overall purity */
BestParamsBestParams /* The best parameters */
begin
       BestPurityOverallPurityBestPurity\longleftarrow OverallPurity
       while OverallPurityBestPurityOverallPurity\geq BestPurity do /* Until the best overall purity is reached */
             BestPurityOverallPurityBestPurity\longleftarrow OverallPurity
             BestParamsModelParamsBestParams\longleftarrow ModelParams
             /* Do the refinement */
             Refine()
       end while
      
end
Algorithm 9 Convergence

2.2.10 The Overall Refinement Process

In every iteration, soft assignment is applied to the whole dataset for the current state of the CRCs. Then, the correctnesses of the primary and secondary assignments are determined. Afterward, the current activity group is designated using the correctnesses of the soft assignments. Later, the advancement candidate of every CRC is found considering weighted and polarized advancement vectors. Finally, selective advancement is applied, and the next state is reached. (Algorithm 10)

Function Refine()
      
Data:
XX /* Data Instances */
yy /* Data Labels */
CentersCenters /* CRCs */
LabelsLabels /* Labels of the centers */
Result:
NewCentersNewCenters /* New CRCs */
NewLabelsNewLabels /* New labels of the centers */
begin
       IDs,WeightsIDs,Weights\longleftarrow SoftAssign(X,CentersX,Centers)
       AA\longleftarrow AdvanVect(X,CentersX,Centers)
       AselectedA_{selected}\longleftarrow SelectAdvance(A,Puritybefore,PurityafterA,Purity_{before},Purity_{after})
       NewCentersCenters+AselectedNewCenters\longleftarrow Centers+A_{selected}
       NewLabelsNewLabels\longleftarrow Labeling(X,y,NewCentersX,y,NewCenters)
      
end
Algorithm 10 Refinement

3 Experimental Results

The platform used for the experimental tests and comparisons is as follows: The CPU used for the experimentation is Intel(R) Core(TM) i7-7700 running at 3.60GHz frequency; the system has 32GB of DDR4 RAM running at 2133 MHz. Nvidia Geforce GTX 1080Ti is the system GPU.

Circles and moons datasets are generated with varying noise levels using the Scikit-Learn [12] library functions. The condensation is done on GPU. Using the condensed representations, multilayer-perceptron classifier [12] models are trained and tested on the CPU for accuracy on the original datasets. It can be seen in Figure 5) and (Figure 6) that training nonlinear machine learning models on the condensed representations of training datasets gives acceptable accuracies with the possibility of easier training. Training the multilayer perceptron classifier with condensed representations results in practically acceptable accuracies.

Refer to caption
(a) Clear boundaries, condensed
Refer to caption
(b) Touching boundaries, condensed
Refer to caption
(c) Noisy boundaries, condensed
Refer to caption
(d) Clear boundaries, trained
Refer to caption
(e) Touching boundaries, trained
Refer to caption
(f) Noisy boundaries, trained
Figure 5: Condensed representations of circles dataset and results of models trained on condensed representations
Refer to caption
(a) Clear boundaries, condensed
Refer to caption
(b) Touching boundaries, condensed
Refer to caption
(c) Noisy boundaries, condensed
Refer to caption
(d) Clear boundaries, trained
Refer to caption
(e) Touching boundaries, trained
Refer to caption
(f) Noisy boundaries, trained
Figure 6: Condensed representations of moons dataset and results of models trained on condensed representations

4 Conclusion

Massive amounts of data are continuously generated throughout many processes. Such big data mostly consists of redundancy, so that can be reduced to a condensed representation without losing much accuracy, especially for machine learning applications. In this paper, an alternate way of handling machine learning datasets is introduced. The ideas introduced in this paper can be summarized as follows:

  • 1.

    K-means clustering is used as a piecewise-linear classifier model generator.

  • 2.

    The generated PWL classifier models are composed of condensed representation centers, which have the same dimensionality as the data.

  • 3.

    The CRC locations can be fine-tuned for class separation purity and this is equivalent to improving the accuracy of a NN classifier, classifying on CRCs.

  • 4.

    The final CRCs can be used as training data for various nonlinear classifiers, such as multilayer neural networks.

The usage of condensed representation of data instead of the raw data allows focusing on different stages of the machine-learning pipeline from different aspects. For example, an online machine-learning pipeline can be separated into two main stages: Online Condensation and Machine Learning. A more specific example can be, an autonomous mobility platform that has data condensation edge devices with central machine learning processing.

This paper is flexibly and modularly constructed, so that any sub-element of the novel method, introduced in this paper, can be replaced or modified to study different aspects of condensed representation.

Acknowledgments

This research was supported by the Turkish Scientific and Technological Research Council (TUBITAK) under project no. 121E537.

We would like to thank the reviewers for their thoughtful comments and their constructive remarks.

References

  • [1] H. Mannila, Data mining: Machine learning, statistics, and databases, in: Proceedings of 8th International Conference on Scientific and Statistical Data Base Management, 1996, pp. 2–9. doi:10.1109/SSDM.1996.505910.
  • [2] H. Mannila, Methods and problems in data mining, in: F. Afrati, P. Kolaitis (Eds.), Database Theory — ICDT ’97, Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 1997, pp. 41–55. doi:10.1007/3-540-62222-5_35.
  • [3] M. Geilke, A. Karwath, S. Kramer, A probabilistic condensed representation of data for stream mining, in: 2014 International Conference on Data Science and Advanced Analytics (DSAA), 2014, pp. 297–303. doi:10.1109/DSAA.2014.7058088.
  • [4] M. Geilke, A. Karwath, S. Kramer, Modeling recurrent distributions in streams using possible worlds, in: 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 2015, pp. 1–9. doi:10.1109/DSAA.2015.7344814.
  • [5] A. Peel, F. Lalande, J.-L. Starck, V. Pettorino, J. Merten, C. Giocoli, M. Meneghetti, M. Baldi, Distinguishing standard and modified gravity cosmologies with machine learning, Physical Review D 100 (2) (2019) 023508. doi:10.1103/PhysRevD.100.023508.
  • [6] A. Soulet, B. Crémilleux, F. Rioult, Condensed Representation of Emerging Patterns, in: H. Dai, R. Srikant, C. Zhang (Eds.), Advances in Knowledge Discovery and Data Mining, Lecture Notes in Computer Science, Springer, Berlin, Heidelberg, 2004, pp. 127–132. doi:10.1007/978-3-540-24775-3_16.
  • [7] M. Plantevit, B. Crémilleux, Condensed Representation of Sequential Patterns According to Frequency-Based Measures, in: N. M. Adams, C. Robardet, A. Siebes, J.-F. Boulicaut (Eds.), Advances in Intelligent Data Analysis VIII, Vol. 5772, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 155–166. doi:10.1007/978-3-642-03915-7_14.
  • [8] M. Boettcher, M. Spott, R. Kruse, A Condensed Representation of Itemsets for Analyzing Their Evolution over Time, in: W. Buntine, M. Grobelnik, D. Mladenić, J. Shawe-Taylor (Eds.), Machine Learning and Knowledge Discovery in Databases, Vol. 5781, Springer Berlin Heidelberg, Berlin, Heidelberg, 2009, pp. 163–178. doi:10.1007/978-3-642-04180-8_28.
  • [9] S. Skansi, Autoencoders, in: S. Skansi (Ed.), Introduction to Deep Learning: From Logical Calculus to Artificial Intelligence, Undergraduate Topics in Computer Science, Springer International Publishing, Cham, 2018, pp. 153–163. doi:10.1007/978-3-319-73004-2_8.
  • [10] A. Alazizi, A. Habrard, F. Jacquenet, L. He-Guelton, F. Oblé, Dual Sequential Variational Autoencoders for Fraud Detection, in: M. R. Berthold, A. Feelders, G. Krempl (Eds.), Advances in Intelligent Data Analysis XVIII, Vol. 12080, Springer International Publishing, Cham, 2020, pp. 14–26. doi:10.1007/978-3-030-44584-3_2.
  • [11] K. Kempinska, R. Murcio, Modelling urban networks using Variational Autoencoders, Applied Network Science 4 (1) (2019) 114. doi:10.1007/s41109-019-0234-0.
  • [12] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, É. Duchesnay, Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research 12 (85) (2011) 2825–2830.