Collecting Meta-Data from the OpenML Public Repository
Resumo
In Machine Learning (ML), selecting the most suitable algorithm for a problem is a challenge. Meta-Learning (MtL) offers an alternative approach by exploring the relationships between dataset characteristics and ML algorithmic performance. To conduct a MtL study, it is necessary to create a metadataset comprising datasets of varying characteristics and defying the ML algorithms at different levels. This study analyzes the information available in the OpenML public repository for building such meta-datasets, which provides a Python API for easy data importation. Assessing the content currently available in the platform, there is still no extensive meta-feature characterization for all datasets, limiting their complete characterization.
Referências
Alcalá-Fdez, J., Fernández, A., Luengo, J., Derrac, J., and García, S. (2011). Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. J. Multiple Valued Log. Soft Comput., 17(2-3):255–287.
Alcobaça, E., Siqueira, F., Rivolli, A., Garcia, L. P. F., Oliva, J. T., and de Carvalho, A. C. P. L. F. (2020). Mfe: Towards reproducible meta-feature extraction. Journal of Machine Learning Research, 21(111):1–5.
Bilalli, B., Abelló, A., and Aluja-Banet, T. (2017). On the predictive power of meta-features in openml. International Journal of Applied Mathematics and Computer Science, 27(4):697–712.
Bischl, B., Casalicchio, G., Feurer, M., Gijsbers, P., Hutter, F., Lang, M., Mantovani, R. G., van Rijn, J. N., and Vanschoren, J. (2021). Openml benchmarking suites.
Fernández, A., García, S., Galar, M., Prati, R. C., Krawczyk, B., and Herrera, F. (2018). Learning from imbalanced data sets, volume 10. Springer.
Feurer, M., van Rijn, J. N., Kadra, A., Gijsbers, P., Mallik, N., Ravi, S., Mueller, A., Vanschoren, J., and Hutter, F. (2019). Openml-python: an extensible python api for openml. arXiv, 1911.02490.
Frank, E., Hall, M. A., Holmes, G., Kirkby, R., Pfahringer, B., and Witten, I. H. (2005). Weka: A machine learning workbench for data mining., pages 1305– 1314. Springer, Berlin.
Kühn, D., Probst, P., Thomas, J., and Bischl, B. (2018). Automatic exploration of machine learning experiments on openml.
Lorena, A. C., Garcia, L. P., Lehmann, J., Souto, M. C., and Ho, T. K. (2019). How complex is your classification problem? a survey on measuring classification complexity. ACM Computing Surveys (CSUR), 52(5):1–34.
Muñoz, M. A., Villanova, L., Baatar, D., and Smith-Miles, K. (2018). Instance spaces for machine learning classification. Machine Learning, 107:109–147.
Newman, D., Hettich, S., Blake, C., and Merz, C. (1998). Uci repository of machine learning databases.
Noy, N., Burgess, M., and Brickley, D. (2019). Google dataset search: Building a search engine for datasets in an open web ecosystem. In 28th Web Conference (WebConf 2019).
Post, M. J., van der Putten, P., and van Rijn, J. N. (2016). Does feature selection improve classification? a large scale experiment in openml. In Boström, H., Knobbe, A., Soares, C., and Papapetrou, P., editors, Advances in Intelligent Data Analysis XV, pages 158–170, Cham. Springer International Publishing.
Rice, J. R. (1976). The algorithm selection problem. Advances in Computers, 15:65–118.
Rivolli, A., Garcia, L. P., Soares, C., Vanschoren, J., and de Carvalho, A. C. (2018). Towards reproducible empirical research in meta-learning. arXiv preprint arXiv:1808.10406, pages 32–52.
Rivolli, A., Garcia, L. P., Soares, C., Vanschoren, J., and de Carvalho, A. C. (2022). Meta-features for meta-learning. Knowledge-Based Systems, 240:108101.
Smith-Miles, K. A. (2009). Cross-disciplinary perspectives on metalearning for algorithm selection. ACM Comput. Surv., 41(1).
Song, Q., Wang, G., and Wang, C. (2012). Automatic recommendation of classification algorithms based on data set characteristics. Pattern Recognition, 45(7):2672–2689.
Vanschoren, J. (2018). Meta-learning: A survey. CoRR, abs/1810.03548.
Vanschoren, J., van Rijn, J. N., Bischl, B., and Torgo, L. (2013). Openml: Networked science in machine learning. SIGKDD Explorations, 15(2):49– 60.
Wolpert, D. H. (2002). The Supervised Learning No-Free-Lunch Theorems, pages 25–42. Springer London, London.
Zöller, M.-A. and Huber, M. F. (2021). Benchmark and survey of automated machine learning frameworks.