This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\hrefhttps://www.kaggle.com/mi2datalab/mementomlMementoML: Performance of selected machine learning algorithm configurations on OpenML100 datasets

Wojciech Kretowicz Warsaw University of Technology Przemysław Biecek Warsaw University of Technology University of Warsaw
(July 2020)

1 Introduction

Finding optimal hyperparameters for the machine learning algorithm can often significantly improve its performance Probst et al. [2018]. But how to choose them in a time-efficient way? In this paper we present the protocol of generating benchmark data describing the performance of different ML algorithms with different hyperparameter configurations. Data collected in this way is used to study the factors influencing the algorithm’s performance.

This collection was prepared for the purposes of the study presented in the EPP paper Gosiewska et al. [2020]. We tested algorithms performance on dense grid of hyperparameters. Tested datasets and hyperparameters were chosen before any algorithm has run and were not changed. This is a different approach than the one usually used in hyperparameter tuning, where the selection of candidate hyperparameters depends on the results obtained previously. However, such selection allows for systematic analysis of performance sensitivity from individual hyperparameters.

This resulted in a comprehensive dataset of such benchmarks that we would like to share. We hope, that computed and collected result may be helpful for other researchers. This paper describes the way data was collected. Here you can find benchmarks of 7 popular machine learning algorithms on 39 OpenML datasets.

The detailed data forming this benchmark are available at: https://www.kaggle.com/mi2datalab/mementoml.

2 Related datasets

Kühn et al. [2018] introduced a benchmark of algorithms created for the OpenML repository. This dataset contains data about 6 algorithms written in R: glmnet, rpart, kknn, svm, ranger, xgboost. It also allows us to run some additional computations and obtain further results in a similar way.

The Smith et al. [2014] provides a MongoDB database that has got data on an instance level. It contains predictions made for every single instance in the considered datasets, information about algorithms and its hyperparameters. There is also a possibility to extend this benchmark.

mlpack benchmark Edel et al. [2014] contains data about performance of different algorithms in popular machine learning frameworks/libraries. It also provides comprehensive scripts for further evaluation.

3 Algorithms, datasets and hyperparameters used

We used several number of popular machine learning algorithms: gradient boosting on decision trees (catboost Prokhorenkova et al. [2017], gbm, xgboost), generalized linear models (glmnet Friedman et al. [2010]), kk nearest neighbours (kknn), random forests (randomforest Liaw and Wiener [2002], ranger Wright and Ziegler [2017]). All computations were made in R and all models were from R packages. Almost all of them are used through the mlr Bischl et al. [2016] framework. Only catboost was used directly, because is not included in mlr.

Benchmarks on predicted values were calculated with mlr function performance for all models except catboost. For that one were used measureACC and measureAUC.

3.1 Algorithms and parameters

These are models used in the computation with considered ranges of parameters:

Table 1: Algorithms, their parameters’ ranges used in benchmark and general rule of drawing. UU stands for random variable sampled from uniform distribution on proper set.
algorithm parameter type lower upper grid
iterations integer 100 10086 2U2^{U}
depth integer 6 10 UU
l2_leaf_reg numeric 0 9 U2U^{2}
bagging_temperature numeric 0 1.5e UU
catboost learning_rate numeric 0.001 2 2U2^{U}
n.trees integer 100 10086 2U2^{U}
interaction.depth integer 1 5 UU
n.minobsinnode integer 2 25 UU
shrinkage numeric 0.001 0.1 10U10^{U}
gbm bag.fraction numeric 0.2 1 UU
alpha numeric 0 1 UU
glmnet lambda numeric 0.001 1024 2U2^{U}
kknn k integer 1 30 UU
ntree integer 100 10086 2U2^{U}
replace logical FALSE TRUE UU
randomforest nodesize integer 1 5 UU
num.trees integer 100 10086 2U2^{U}
min.node.size integer 1 4 UU
replace logical FALSE TRUE UU
ranger splitrule discrete - - UU
booster discrete - - UU
nrounds integer 1 1000 UU
eta numeric 0.031 1 2U2^{U}
subsample numeric 0.5 1 UU
max_depth integer 6 15 UU
min_child_weight numeric 1 8 2U2^{U}
colsample_bytree numeric 0.2 1 UU
xgboost colsample_bylevel numeric 0.2 1 UU

Parameter splitrule for ranger was either gini or extratrees. Parameter booster for xgboost was either gbtree or gblinear.

Parameters for each model were randomly chosen within the presented ranges using corresponding distribution. Although they were drawn randomly, all of them are reproducible.

3.2 Datasets

All datasets used in the benchmark were downloaded from OpenML Vanschoren et al. [2013]. These datasets come from OpenML100 and are all for binary classification, the number of observations are between 500500 and 100000100000, the number of features is less than 50005000, and the ratio of the minority class and the majority class is above 0.050.05.

All considered datasets’ OpenML ids:

id name link rows columns
3 kr-vs-kp https://www.openml.org/d/3 3196 37
31 credit-g https://www.openml.org/d/31 1000 21
37 diabetes https://www.openml.org/d/37 768 9
44 spambase https://www.openml.org/d/44 4601 58
50 tic-tac-toe https://www.openml.org/d/50 958 10
151 electricity https://www.openml.org/d/151 45312 9
312 scene https://www.openml.org/d/312 2407 300
333 monks-problems-1 https://www.openml.org/d/333 556 7
334 monks-problems-2 https://www.openml.org/d/334 601 7
335 monks-problems-3 https://www.openml.org/d/335 554 7
1036 sylva_agnostic https://www.openml.org/d/1036 14395 217
1038 gina_agnostic https://www.openml.org/d/1038 3468 971
1043 ada_agnostic https://www.openml.org/d/1043 4562 49
1046 mozilla4 https://www.openml.org/d/1046 15545 6
1049 pc4 https://www.openml.org/d/1049 1458 38
1050 pc3 https://www.openml.org/d/1050 1563 38
1063 kc2 https://www.openml.org/d/1063 522 22
1067 kc1 https://www.openml.org/d/1067 2109 22
1068 pc1 https://www.openml.org/d/1068 1109 22
1120 MagicTelescope https://www.openml.org/d/1120 19020 12
1461 bank-marketing https://www.openml.org/d/1461 45211 17
1462 banknote-authentication https://www.openml.org/d/1462 1372 5
1464 blood-transfusion-service-center https://www.openml.org/d/1464 748 5
1467 climate-model-simulation-crashes https://www.openml.org/d/1467 540 21
1471 eeg-eye-state https://www.openml.org/d/1471 14980 15
1479 hill-valley https://www.openml.org/d/1479 1212 101
1480 ilpd https://www.openml.org/d/1480 583 11
1485 madelon https://www.openml.org/d/1485 2600 501
1486 nomao https://www.openml.org/d/1486 34465 119
1487 ozone-level-8hr https://www.openml.org/d/1487 2534 73
1489 phoneme https://www.openml.org/d/1489 5404 6
1494 qsar-biodeg https://www.openml.org/d/1494 1055 42
1504 steel-plates-fault https://www.openml.org/d/1504 1941 34
1510 wdbc https://www.openml.org/d/1510 569 31
1570 wilt https://www.openml.org/d/1570 4839 6
4134 Bioresponse https://www.openml.org/d/4134 3751 1777
4135 Amazon_employee_access https://www.openml.org/d/4135 32769 10
4534 PhishingWebsites https://www.openml.org/d/4534 11055 31
40509 Australian https://www.openml.org/d/40509 690 15

4 Data collection

Each dataset was divided into 2020 train/test bootstrap pairs. Each row is not guaranteed to be chosen exactly once to the test set or 1919 times to the train set because each pair was chosen independently in a bootstrap manner. Each considered model with a particular set of hyperparameters was 2020 times trained, one time on each train subset and evaluated on a corresponding test. There were two collated measures: ACC and AUC. Thus, for each model and each dataset should be 20|paramset|20\cdot|paramset| ACC and 20|paramset|20\cdot|paramset| AUC measures. However, not all computations finished and they are updated on a regular basis as they progress. Additionally, learning times were collected.

What is important in our approach, the first thing we did is arbitrary choosing datasets and randomly choosing parameters in a way described above in a table 1. Thus they had been all chosen and set before any calculations began and they were not updated meanwhile. Splits to train/test subsets are also constant for each dataset. This makes comparing results between different algorithms easier. We are trying to cover as much of parameter space as possible, including also parameters in subspaces that may not work well or even work at all. This should make possible researches in more broad-spectrum concerning hyperparameters. However, for practical reasons, we had to draw them only from finite ranges.

4.1 ABC4ML - Automated benchmark collector

ABC4ML is an abbreviation of Automated benchmark collector, software written and used for easy and reproducible benchmarking of selected datasets. Its main function is calculate that simply takes model name, vector of OpenML datasets’ ids, path to a file with train/test splits and a path to file with parameters sets. This allows easy and transparent calculations as well as easy parallelization. If considered machine learning algorithm did not converge there would be returned NA.

As you see, you need to create a file with data splits in a form of rows ids and file with parameters sets first, before any calculations.

Due to use of OpenML, there is no need for downloading these datasets by hand prior to run calculations. They are one by one downloaded to RAM as they are needed in progressing computations.

4.2 Details

After some initial data collection, we obtained estimated times of calculations on each dataset. Then datasets were grouped in such a way, that sum of time required for all datasets in a group is nearly equal with other groups.

For each model and each group of datasets was run a docker container that was calculating benchmarks.

4.3 Dataset

4.3.1 Benchmarks

Resulting dataset is a dataframe with 7 columns. First ”dataset” column denotes OpenML id of the dataset i.e. 1486, next ”row_index” is the train/test split identifier i.e. 12 from splits file, third ”model” is a model name i.e. gbm or kknn. Fourth is a ”param_index” denoting hyperparameter set identifier from parameters file. These identifiers starts from 1001 (1001 denotes 1). Fifth is ”time”, that is a learning time measured in ms. The last two columns are acc and auc measures.

4.3.2 Hyperparameters

There is also a data frame for each model with its hyperparameters. The first column is a ”param_index”. Rest of the columns correspond to hyperparameters related to this model and used in calculations.

4.3.3 Train/test splits

For each used dataset there is a separate file with a train/test splits. Each of its rows indicates row indices in a single test subset in the mentioned dataset.

5 Reproducibility

All hyperparameters were chosen with a fixed random seed, thus they can be reproduced at will. However, not all parameters were chosen in the same way. catboost parameters were drawn using a newer version of the script. In this updated version of the scipt, there is also a possibility to draw parameters for other algorithms.

Similarly, all algorithms had fixed random seeds before every single run (default seed parameter in a calculate function). Thus you can easily and independently reproduce only some of the results without the need to make all previous calculations. Some of the results were reproduced to ensure the proper functioning of the whole framework.

Moreover, we trace versions of the used software libraries.

6 Further calculations and Docker

It is much easier to make calculations by your own using our docker container. This container is a modified r-base image that installs all needed both linux and R packages, has a structure compliance with presumptions of a calculate.R script, has got a screen utility, copies train/test splits from ”datasets” directory and parameters from ”parameters”. If you want to add some new algorithm that is not present in mlr framework, you need to add an install statement in an install_packages.R script and add new function to calculate.R with compliant result vector.

To build docker just run:

sudo docker build .

Before you run docker ensure you have a directory for results in a host. results directory inside a container has to be matched with the host’s results directory.

To run a docker simply type:

sudo docker run -ti -v [host’s results absolute directory]:/results elo bash

Open R console after running this command, source calculate.R function from ”scripts” and run it with proper parameters.

All results will be saved to host’s result directory passed in a docker run command.

7 Discussion

This work results in a comprehensive dataset with wide hyperparameter space covered. Dataset is ready to use, easily accessible and, if needed, allows further calculations in compliance form. It can be used in developing a branch of machine learning - meta-learning, finding the best defaults of hyperparameters and reasonable subspaces as well as discovering the impact and importance of particular ones.

This dataset can be found at: MementoML

References

  • Bischl et al. [2016] B. Bischl, M. Lang, L. Kotthoff, J. Schiffner, J. Richter, E. Studerus, G. Casalicchio, and Z. M. Jones. mlr: Machine learning in r. Journal of Machine Learning Research, 17(170):1–5, 2016. URL http://jmlr.org/papers/v17/15-066.html.
  • Edel et al. [2014] M. Edel, A. Soni, and R. R. Curtin. An automatic benchmarking system. In NIPS 2014 Workshop on Software Engineering for Machine Learning (SE4ML’2014), volume 1, 2014.
  • Friedman et al. [2010] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010. URL http://www.jstatsoft.org/v33/i01/.
  • Gosiewska et al. [2020] A. Gosiewska, K. Woznica, and P. Biecek. Interpretable meta-measure for model performance, 2020. URL https://arxiv.org/abs/2006.02293.
  • Kühn et al. [2018] D. Kühn, P. Probst, J. Thomas, and B. Bischl. Automatic Exploration of Machine Learning Experiments on OpenML, 2018. URL https://arxiv.org/pdf/1806.10961.pdf.
  • Liaw and Wiener [2002] A. Liaw and M. Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002. URL https://CRAN.R-project.org/doc/Rnews/.
  • Probst et al. [2018] P. Probst, B. Bischl, and A.-L. Boulesteix. Tunability: Importance of Hyperparameters of Machine Learning Algorithms, 2018. URL https://arxiv.org/abs/1802.09596.
  • Prokhorenkova et al. [2017] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin. Catboost: unbiased boosting with categorical features, 2017. URL https://arxiv.org/abs/1706.09516.
  • Smith et al. [2014] M. R. Smith, A. White, C. Giraud-Carrier, and T. Martinez. An Easy to Use Repository for Comparing and Improving Machine Learning Algorithm Usage, 2014. URL https://arxiv.org/abs/1405.7292.
  • Vanschoren et al. [2013] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo. OpenML: Networked Science in Machine Learning. SIGKDD Explorations, 15(2):49–60, 2013. doi: 10.1145/2641190.2641198. URL http://doi.acm.org/10.1145/2641190.2641198.
  • Wright and Ziegler [2017] M. N. Wright and A. Ziegler. ranger: A fast implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1):1–17, 2017. doi: 10.18637/jss.v077.i01.