\hrefhttps://www.kaggle.com/mi2datalab/mementomlMementoML: Performance of selected machine learning algorithm configurations on OpenML100 datasets
1 Introduction
Finding optimal hyperparameters for the machine learning algorithm can often significantly improve its performance Probst et al. [2018]. But how to choose them in a time-efficient way? In this paper we present the protocol of generating benchmark data describing the performance of different ML algorithms with different hyperparameter configurations. Data collected in this way is used to study the factors influencing the algorithm’s performance.
This collection was prepared for the purposes of the study presented in the EPP paper Gosiewska et al. [2020]. We tested algorithms performance on dense grid of hyperparameters. Tested datasets and hyperparameters were chosen before any algorithm has run and were not changed. This is a different approach than the one usually used in hyperparameter tuning, where the selection of candidate hyperparameters depends on the results obtained previously. However, such selection allows for systematic analysis of performance sensitivity from individual hyperparameters.
This resulted in a comprehensive dataset of such benchmarks that we would like to share. We hope, that computed and collected result may be helpful for other researchers. This paper describes the way data was collected. Here you can find benchmarks of 7 popular machine learning algorithms on 39 OpenML datasets.
The detailed data forming this benchmark are available at: https://www.kaggle.com/mi2datalab/mementoml.
2 Related datasets
Kühn et al. [2018] introduced a benchmark of algorithms created for the OpenML repository. This dataset contains data about 6 algorithms written in R: glmnet, rpart, kknn, svm, ranger, xgboost. It also allows us to run some additional computations and obtain further results in a similar way.
The Smith et al. [2014] provides a MongoDB database that has got data on an instance level. It contains predictions made for every single instance in the considered datasets, information about algorithms and its hyperparameters. There is also a possibility to extend this benchmark.
mlpack benchmark Edel et al. [2014] contains data about performance of different algorithms in popular machine learning frameworks/libraries. It also provides comprehensive scripts for further evaluation.
3 Algorithms, datasets and hyperparameters used
We used several number of popular machine learning algorithms: gradient boosting on decision trees (catboost Prokhorenkova et al. [2017], gbm, xgboost), generalized linear models (glmnet Friedman et al. [2010]), nearest neighbours (kknn), random forests (randomforest Liaw and Wiener [2002], ranger Wright and Ziegler [2017]). All computations were made in R and all models were from R packages. Almost all of them are used through the mlr Bischl et al. [2016] framework. Only catboost was used directly, because is not included in mlr.
Benchmarks on predicted values were calculated with mlr function performance for all models except catboost. For that one were used measureACC and measureAUC.
3.1 Algorithms and parameters
These are models used in the computation with considered ranges of parameters:
algorithm | parameter | type | lower | upper | grid |
iterations | integer | 100 | 10086 | ||
depth | integer | 6 | 10 | ||
l2_leaf_reg | numeric | 0 | 9 | ||
bagging_temperature | numeric | 0 | 1.5e | ||
catboost | learning_rate | numeric | 0.001 | 2 | |
n.trees | integer | 100 | 10086 | ||
interaction.depth | integer | 1 | 5 | ||
n.minobsinnode | integer | 2 | 25 | ||
shrinkage | numeric | 0.001 | 0.1 | ||
gbm | bag.fraction | numeric | 0.2 | 1 | |
alpha | numeric | 0 | 1 | ||
glmnet | lambda | numeric | 0.001 | 1024 | |
kknn | k | integer | 1 | 30 | |
ntree | integer | 100 | 10086 | ||
replace | logical | FALSE | TRUE | ||
randomforest | nodesize | integer | 1 | 5 | |
num.trees | integer | 100 | 10086 | ||
min.node.size | integer | 1 | 4 | ||
replace | logical | FALSE | TRUE | ||
ranger | splitrule | discrete | - | - | |
booster | discrete | - | - | ||
nrounds | integer | 1 | 1000 | ||
eta | numeric | 0.031 | 1 | ||
subsample | numeric | 0.5 | 1 | ||
max_depth | integer | 6 | 15 | ||
min_child_weight | numeric | 1 | 8 | ||
colsample_bytree | numeric | 0.2 | 1 | ||
xgboost | colsample_bylevel | numeric | 0.2 | 1 |
Parameter splitrule for ranger was either gini or extratrees. Parameter booster for xgboost was either gbtree or gblinear.
Parameters for each model were randomly chosen within the presented ranges using corresponding distribution. Although they were drawn randomly, all of them are reproducible.
3.2 Datasets
All datasets used in the benchmark were downloaded from OpenML Vanschoren et al. [2013]. These datasets come from OpenML100 and are all for binary classification, the number of observations are between and , the number of features is less than , and the ratio of the minority class and the majority class is above .
All considered datasets’ OpenML ids:
4 Data collection
Each dataset was divided into train/test bootstrap pairs. Each row is not guaranteed to be chosen exactly once to the test set or times to the train set because each pair was chosen independently in a bootstrap manner. Each considered model with a particular set of hyperparameters was times trained, one time on each train subset and evaluated on a corresponding test. There were two collated measures: ACC and AUC. Thus, for each model and each dataset should be ACC and AUC measures. However, not all computations finished and they are updated on a regular basis as they progress. Additionally, learning times were collected.
What is important in our approach, the first thing we did is arbitrary choosing datasets and randomly choosing parameters in a way described above in a table 1. Thus they had been all chosen and set before any calculations began and they were not updated meanwhile. Splits to train/test subsets are also constant for each dataset. This makes comparing results between different algorithms easier. We are trying to cover as much of parameter space as possible, including also parameters in subspaces that may not work well or even work at all. This should make possible researches in more broad-spectrum concerning hyperparameters. However, for practical reasons, we had to draw them only from finite ranges.
4.1 ABC4ML - Automated benchmark collector
ABC4ML is an abbreviation of Automated benchmark collector, software written and used for easy and reproducible benchmarking of selected datasets. Its main function is calculate that simply takes model name, vector of OpenML datasets’ ids, path to a file with train/test splits and a path to file with parameters sets. This allows easy and transparent calculations as well as easy parallelization. If considered machine learning algorithm did not converge there would be returned NA.
As you see, you need to create a file with data splits in a form of rows ids and file with parameters sets first, before any calculations.
Due to use of OpenML, there is no need for downloading these datasets by hand prior to run calculations. They are one by one downloaded to RAM as they are needed in progressing computations.
4.2 Details
After some initial data collection, we obtained estimated times of calculations on each dataset. Then datasets were grouped in such a way, that sum of time required for all datasets in a group is nearly equal with other groups.
For each model and each group of datasets was run a docker container that was calculating benchmarks.
4.3 Dataset
4.3.1 Benchmarks
Resulting dataset is a dataframe with 7 columns. First ”dataset” column denotes OpenML id of the dataset i.e. 1486, next ”row_index” is the train/test split identifier i.e. 12 from splits file, third ”model” is a model name i.e. gbm or kknn. Fourth is a ”param_index” denoting hyperparameter set identifier from parameters file. These identifiers starts from 1001 (1001 denotes 1). Fifth is ”time”, that is a learning time measured in ms. The last two columns are acc and auc measures.
4.3.2 Hyperparameters
There is also a data frame for each model with its hyperparameters. The first column is a ”param_index”. Rest of the columns correspond to hyperparameters related to this model and used in calculations.
4.3.3 Train/test splits
For each used dataset there is a separate file with a train/test splits. Each of its rows indicates row indices in a single test subset in the mentioned dataset.
5 Reproducibility
All hyperparameters were chosen with a fixed random seed, thus they can be reproduced at will. However, not all parameters were chosen in the same way. catboost parameters were drawn using a newer version of the script. In this updated version of the scipt, there is also a possibility to draw parameters for other algorithms.
Similarly, all algorithms had fixed random seeds before every single run (default seed parameter in a calculate function). Thus you can easily and independently reproduce only some of the results without the need to make all previous calculations. Some of the results were reproduced to ensure the proper functioning of the whole framework.
Moreover, we trace versions of the used software libraries.
6 Further calculations and Docker
It is much easier to make calculations by your own using our docker container. This container is a modified r-base image that installs all needed both linux and R packages, has a structure compliance with presumptions of a calculate.R script, has got a screen utility, copies train/test splits from ”datasets” directory and parameters from ”parameters”. If you want to add some new algorithm that is not present in mlr framework, you need to add an install statement in an install_packages.R script and add new function to calculate.R with compliant result vector.
To build docker just run:
sudo docker build .
Before you run docker ensure you have a directory for results in a host. results directory inside a container has to be matched with the host’s results directory.
To run a docker simply type:
sudo docker run -ti -v [host’s results absolute directory]:/results elo bash
Open R console after running this command, source calculate.R function from ”scripts” and run it with proper parameters.
All results will be saved to host’s result directory passed in a docker run command.
7 Discussion
This work results in a comprehensive dataset with wide hyperparameter space covered. Dataset is ready to use, easily accessible and, if needed, allows further calculations in compliance form. It can be used in developing a branch of machine learning - meta-learning, finding the best defaults of hyperparameters and reasonable subspaces as well as discovering the impact and importance of particular ones.
This dataset can be found at: MementoML
References
- Bischl et al. [2016] B. Bischl, M. Lang, L. Kotthoff, J. Schiffner, J. Richter, E. Studerus, G. Casalicchio, and Z. M. Jones. mlr: Machine learning in r. Journal of Machine Learning Research, 17(170):1–5, 2016. URL http://jmlr.org/papers/v17/15-066.html.
- Edel et al. [2014] M. Edel, A. Soni, and R. R. Curtin. An automatic benchmarking system. In NIPS 2014 Workshop on Software Engineering for Machine Learning (SE4ML’2014), volume 1, 2014.
- Friedman et al. [2010] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010. URL http://www.jstatsoft.org/v33/i01/.
- Gosiewska et al. [2020] A. Gosiewska, K. Woznica, and P. Biecek. Interpretable meta-measure for model performance, 2020. URL https://arxiv.org/abs/2006.02293.
- Kühn et al. [2018] D. Kühn, P. Probst, J. Thomas, and B. Bischl. Automatic Exploration of Machine Learning Experiments on OpenML, 2018. URL https://arxiv.org/pdf/1806.10961.pdf.
- Liaw and Wiener [2002] A. Liaw and M. Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002. URL https://CRAN.R-project.org/doc/Rnews/.
- Probst et al. [2018] P. Probst, B. Bischl, and A.-L. Boulesteix. Tunability: Importance of Hyperparameters of Machine Learning Algorithms, 2018. URL https://arxiv.org/abs/1802.09596.
- Prokhorenkova et al. [2017] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin. Catboost: unbiased boosting with categorical features, 2017. URL https://arxiv.org/abs/1706.09516.
- Smith et al. [2014] M. R. Smith, A. White, C. Giraud-Carrier, and T. Martinez. An Easy to Use Repository for Comparing and Improving Machine Learning Algorithm Usage, 2014. URL https://arxiv.org/abs/1405.7292.
- Vanschoren et al. [2013] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo. OpenML: Networked Science in Machine Learning. SIGKDD Explorations, 15(2):49–60, 2013. doi: 10.1145/2641190.2641198. URL http://doi.acm.org/10.1145/2641190.2641198.
- Wright and Ziegler [2017] M. N. Wright and A. Ziegler. ranger: A fast implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1):1–17, 2017. doi: 10.18637/jss.v077.i01.