\hrefhttps://www.kaggle.com/mi2datalab/mementomlMementoML: Performance of selected machine learning algorithm configurations on OpenML100 datasets

Wojciech Kretowicz Warsaw University of Technology Przemysław Biecek Warsaw University of Technology University of Warsaw

(July 2020)

1 Introduction

Finding optimal hyperparameters for the machine learning algorithm can often significantly improve its performance Probst et al. [2018]. But how to choose them in a time-efficient way? In this paper we present the protocol of generating benchmark data describing the performance of different ML algorithms with different hyperparameter configurations. Data collected in this way is used to study the factors influencing the algorithm’s performance.

This collection was prepared for the purposes of the study presented in the EPP paper Gosiewska et al. [2020]. We tested algorithms performance on dense grid of hyperparameters. Tested datasets and hyperparameters were chosen before any algorithm has run and were not changed. This is a different approach than the one usually used in hyperparameter tuning, where the selection of candidate hyperparameters depends on the results obtained previously. However, such selection allows for systematic analysis of performance sensitivity from individual hyperparameters.

This resulted in a comprehensive dataset of such benchmarks that we would like to share. We hope, that computed and collected result may be helpful for other researchers. This paper describes the way data was collected. Here you can find benchmarks of 7 popular machine learning algorithms on 39 OpenML datasets.

The detailed data forming this benchmark are available at: https://www.kaggle.com/mi2datalab/mementoml.

2 Related datasets

Kühn et al. [2018] introduced a benchmark of algorithms created for the OpenML repository. This dataset contains data about 6 algorithms written in R: glmnet, rpart, kknn, svm, ranger, xgboost. It also allows us to run some additional computations and obtain further results in a similar way.

The Smith et al. [2014] provides a MongoDB database that has got data on an instance level. It contains predictions made for every single instance in the considered datasets, information about algorithms and its hyperparameters. There is also a possibility to extend this benchmark.

mlpack benchmark Edel et al. [2014] contains data about performance of different algorithms in popular machine learning frameworks/libraries. It also provides comprehensive scripts for further evaluation.

3 Algorithms, datasets and hyperparameters used

We used several number of popular machine learning algorithms: gradient boosting on decision trees (catboost Prokhorenkova et al. [2017], gbm, xgboost), generalized linear models (glmnet Friedman et al. [2010]), $k$ nearest neighbours (kknn), random forests (randomforest Liaw and Wiener [2002], ranger Wright and Ziegler [2017]). All computations were made in R and all models were from R packages. Almost all of them are used through the mlr Bischl et al. [2016] framework. Only catboost was used directly, because is not included in mlr.

Benchmarks on predicted values were calculated with mlr function performance for all models except catboost. For that one were used measureACC and measureAUC.

3.1 Algorithms and parameters

These are models used in the computation with considered ranges of parameters:

Table 1: Algorithms, their parameters’ ranges used in benchmark and general rule of drawing.

U

stands for random variable sampled from uniform distribution on proper set.

algorithm	parameter	type	lower	upper	grid
	iterations	integer	100	10086	$2^{U}$
	depth	integer	6	10	$U$
	l2_leaf_reg	numeric	0	9	$U^{2}$
	bagging_temperature	numeric	0	1.5e	$U$
catboost	learning_rate	numeric	0.001	2	$2^{U}$
	n.trees	integer	100	10086	$2^{U}$
	interaction.depth	integer	1	5	$U$
	n.minobsinnode	integer	2	25	$U$
	shrinkage	numeric	0.001	0.1	$10^{U}$
gbm	bag.fraction	numeric	0.2	1	$U$
	alpha	numeric	0	1	$U$
glmnet	lambda	numeric	0.001	1024	$2^{U}$
kknn	k	integer	1	30	$U$
	ntree	integer	100	10086	$2^{U}$
	replace	logical	FALSE	TRUE	$U$
randomforest	nodesize	integer	1	5	$U$
	num.trees	integer	100	10086	$2^{U}$
	min.node.size	integer	1	4	$U$
	replace	logical	FALSE	TRUE	$U$
ranger	splitrule	discrete	-	-	$U$
	booster	discrete	-	-	$U$
	nrounds	integer	1	1000	$U$
	eta	numeric	0.031	1	$2^{U}$
	subsample	numeric	0.5	1	$U$
	max_depth	integer	6	15	$U$
	min_child_weight	numeric	1	8	$2^{U}$
	colsample_bytree	numeric	0.2	1	$U$
xgboost	colsample_bylevel	numeric	0.2	1	$U$

Parameter splitrule for ranger was either gini or extratrees. Parameter booster for xgboost was either gbtree or gblinear.

Parameters for each model were randomly chosen within the presented ranges using corresponding distribution. Although they were drawn randomly, all of them are reproducible.

3.2 Datasets

All datasets used in the benchmark were downloaded from OpenML Vanschoren et al. [2013]. These datasets come from OpenML100 and are all for binary classification, the number of observations are between $500$ and $100000$ , the number of features is less than $5000$ , and the ratio of the minority class and the majority class is above $0.05$ .

All considered datasets’ OpenML ids:

id	name	link	rows	columns
3	kr-vs-kp	https://www.openml.org/d/3	3196	37
31	credit-g	https://www.openml.org/d/31	1000	21
37	diabetes	https://www.openml.org/d/37	768	9
44	spambase	https://www.openml.org/d/44	4601	58
50	tic-tac-toe	https://www.openml.org/d/50	958	10
151	electricity	https://www.openml.org/d/151	45312	9
312	scene	https://www.openml.org/d/312	2407	300
333	monks-problems-1	https://www.openml.org/d/333	556	7
334	monks-problems-2	https://www.openml.org/d/334	601	7
335	monks-problems-3	https://www.openml.org/d/335	554	7
1036	sylva_agnostic	https://www.openml.org/d/1036	14395	217
1038	gina_agnostic	https://www.openml.org/d/1038	3468	971
1043	ada_agnostic	https://www.openml.org/d/1043	4562	49
1046	mozilla4	https://www.openml.org/d/1046	15545	6
1049	pc4	https://www.openml.org/d/1049	1458	38
1050	pc3	https://www.openml.org/d/1050	1563	38
1063	kc2	https://www.openml.org/d/1063	522	22
1067	kc1	https://www.openml.org/d/1067	2109	22
1068	pc1	https://www.openml.org/d/1068	1109	22
1120	MagicTelescope	https://www.openml.org/d/1120	19020	12
1461	bank-marketing	https://www.openml.org/d/1461	45211	17
1462	banknote-authentication	https://www.openml.org/d/1462	1372	5
1464	blood-transfusion-service-center	https://www.openml.org/d/1464	748	5
1467	climate-model-simulation-crashes	https://www.openml.org/d/1467	540	21
1471	eeg-eye-state	https://www.openml.org/d/1471	14980	15
1479	hill-valley	https://www.openml.org/d/1479	1212	101
1480	ilpd	https://www.openml.org/d/1480	583	11
1485	madelon	https://www.openml.org/d/1485	2600	501
1486	nomao	https://www.openml.org/d/1486	34465	119
1487	ozone-level-8hr	https://www.openml.org/d/1487	2534	73
1489	phoneme	https://www.openml.org/d/1489	5404	6
1494	qsar-biodeg	https://www.openml.org/d/1494	1055	42
1504	steel-plates-fault	https://www.openml.org/d/1504	1941	34
1510	wdbc	https://www.openml.org/d/1510	569	31
1570	wilt	https://www.openml.org/d/1570	4839	6
4134	Bioresponse	https://www.openml.org/d/4134	3751	1777
4135	Amazon_employee_access	https://www.openml.org/d/4135	32769	10
4534	PhishingWebsites	https://www.openml.org/d/4534	11055	31
40509	Australian	https://www.openml.org/d/40509	690	15

4 Data collection

Each dataset was divided into $20$ train/test bootstrap pairs. Each row is not guaranteed to be chosen exactly once to the test set or $19$ times to the train set because each pair was chosen independently in a bootstrap manner. Each considered model with a particular set of hyperparameters was $20$ times trained, one time on each train subset and evaluated on a corresponding test. There were two collated measures: ACC and AUC. Thus, for each model and each dataset should be $20\cdot|paramset|$ ACC and $20\cdot|paramset|$ AUC measures. However, not all computations finished and they are updated on a regular basis as they progress. Additionally, learning times were collected.

What is important in our approach, the first thing we did is arbitrary choosing datasets and randomly choosing parameters in a way described above in a table 1. Thus they had been all chosen and set before any calculations began and they were not updated meanwhile. Splits to train/test subsets are also constant for each dataset. This makes comparing results between different algorithms easier. We are trying to cover as much of parameter space as possible, including also parameters in subspaces that may not work well or even work at all. This should make possible researches in more broad-spectrum concerning hyperparameters. However, for practical reasons, we had to draw them only from finite ranges.

4.1 ABC4ML - Automated benchmark collector

ABC4ML is an abbreviation of Automated benchmark collector, software written and used for easy and reproducible benchmarking of selected datasets. Its main function is calculate that simply takes model name, vector of OpenML datasets’ ids, path to a file with train/test splits and a path to file with parameters sets. This allows easy and transparent calculations as well as easy parallelization. If considered machine learning algorithm did not converge there would be returned NA.

As you see, you need to create a file with data splits in a form of rows ids and file with parameters sets first, before any calculations.

Due to use of OpenML, there is no need for downloading these datasets by hand prior to run calculations. They are one by one downloaded to RAM as they are needed in progressing computations.

4.2 Details

After some initial data collection, we obtained estimated times of calculations on each dataset. Then datasets were grouped in such a way, that sum of time required for all datasets in a group is nearly equal with other groups.

For each model and each group of datasets was run a docker container that was calculating benchmarks.

4.3 Dataset

4.3.1 Benchmarks

Resulting dataset is a dataframe with 7 columns. First ”dataset” column denotes OpenML id of the dataset i.e. 1486, next ”row_index” is the train/test split identifier i.e. 12 from splits file, third ”model” is a model name i.e. gbm or kknn. Fourth is a ”param_index” denoting hyperparameter set identifier from parameters file. These identifiers starts from 1001 (1001 denotes 1). Fifth is ”time”, that is a learning time measured in ms. The last two columns are acc and auc measures.

4.3.2 Hyperparameters

There is also a data frame for each model with its hyperparameters. The first column is a ”param_index”. Rest of the columns correspond to hyperparameters related to this model and used in calculations.

4.3.3 Train/test splits

For each used dataset there is a separate file with a train/test splits. Each of its rows indicates row indices in a single test subset in the mentioned dataset.

5 Reproducibility

All hyperparameters were chosen with a fixed random seed, thus they can be reproduced at will. However, not all parameters were chosen in the same way. catboost parameters were drawn using a newer version of the script. In this updated version of the scipt, there is also a possibility to draw parameters for other algorithms.

Similarly, all algorithms had fixed random seeds before every single run (default seed parameter in a calculate function). Thus you can easily and independently reproduce only some of the results without the need to make all previous calculations. Some of the results were reproduced to ensure the proper functioning of the whole framework.

Moreover, we trace versions of the used software libraries.

6 Further calculations and Docker

It is much easier to make calculations by your own using our docker container. This container is a modified r-base image that installs all needed both linux and R packages, has a structure compliance with presumptions of a calculate.R script, has got a screen utility, copies train/test splits from ”datasets” directory and parameters from ”parameters”. If you want to add some new algorithm that is not present in mlr framework, you need to add an install statement in an install_packages.R script and add new function to calculate.R with compliant result vector.

To build docker just run:

sudo docker build .

Before you run docker ensure you have a directory for results in a host. results directory inside a container has to be matched with the host’s results directory.

To run a docker simply type:

sudo docker run -ti -v [host’s results absolute directory]:/results elo bash

Open R console after running this command, source calculate.R function from ”scripts” and run it with proper parameters.

All results will be saved to host’s result directory passed in a docker run command.

7 Discussion

This work results in a comprehensive dataset with wide hyperparameter space covered. Dataset is ready to use, easily accessible and, if needed, allows further calculations in compliance form. It can be used in developing a branch of machine learning - meta-learning, finding the best defaults of hyperparameters and reasonable subspaces as well as discovering the impact and importance of particular ones.

This dataset can be found at: MementoML

References

Bischl et al. [2016] B. Bischl, M. Lang, L. Kotthoff, J. Schiffner, J. Richter, E. Studerus, G. Casalicchio, and Z. M. Jones. mlr: Machine learning in r. Journal of Machine Learning Research, 17(170):1–5, 2016. URL http://jmlr.org/papers/v17/15-066.html.
Edel et al. [2014] M. Edel, A. Soni, and R. R. Curtin. An automatic benchmarking system. In NIPS 2014 Workshop on Software Engineering for Machine Learning (SE4ML’2014), volume 1, 2014.
Friedman et al. [2010] J. Friedman, T. Hastie, and R. Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010. URL http://www.jstatsoft.org/v33/i01/.
Gosiewska et al. [2020] A. Gosiewska, K. Woznica, and P. Biecek. Interpretable meta-measure for model performance, 2020. URL https://arxiv.org/abs/2006.02293.
Kühn et al. [2018] D. Kühn, P. Probst, J. Thomas, and B. Bischl. Automatic Exploration of Machine Learning Experiments on OpenML, 2018. URL https://arxiv.org/pdf/1806.10961.pdf.
Liaw and Wiener [2002] A. Liaw and M. Wiener. Classification and regression by randomforest. R News, 2(3):18–22, 2002. URL https://CRAN.R-project.org/doc/Rnews/.
Probst et al. [2018] P. Probst, B. Bischl, and A.-L. Boulesteix. Tunability: Importance of Hyperparameters of Machine Learning Algorithms, 2018. URL https://arxiv.org/abs/1802.09596.
Prokhorenkova et al. [2017] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin. Catboost: unbiased boosting with categorical features, 2017. URL https://arxiv.org/abs/1706.09516.
Smith et al. [2014] M. R. Smith, A. White, C. Giraud-Carrier, and T. Martinez. An Easy to Use Repository for Comparing and Improving Machine Learning Algorithm Usage, 2014. URL https://arxiv.org/abs/1405.7292.
Vanschoren et al. [2013] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo. OpenML: Networked Science in Machine Learning. SIGKDD Explorations, 15(2):49–60, 2013. doi: 10.1145/2641190.2641198. URL http://doi.acm.org/10.1145/2641190.2641198.
Wright and Ziegler [2017] M. N. Wright and A. Ziegler. ranger: A fast implementation of Random Forests for High Dimensional Data in C++ and R. Journal of Statistical Software, 77(1):1–17, 2017. doi: 10.18637/jss.v077.i01.