GPU Acceleration of Clustering Meta-feature Extraction using RAPIDS

Lucas L. Silva; Ricardo Franco; André Carvalho; Wellington Martins

doi:10.5753/wperformance.2023.230098

Lucas L. Silva UFG
Ricardo Franco UFG
André Carvalho USP
Wellington Martins UFG

DOI: https://doi.org/10.5753/wperformance.2023.230098

Resumo

Although machine learning algorithms have been successful when applied to several tasks, the selection of the most suitable for a given dataset is not straightforward. The recommendation of machine learning algorithms can be automated through the use of meta-learning, but this requires efficient methods for the characterizations of datasets, i.e. meta-features extraction. In this work we propose to accelerate the extraction of clustering-based meta-features on GPUs, taking advantage of the optimized libraries and API from the RAPIDS framework. We parallelized a well-known meta-feature extraction tool (MFE) via RAPIDS to accelerate the clustering meta-features extraction process. Our experiment shows that significantly less time is required to complete the extraction, up to 10x faster than the MFE implementation. These results are promising and suggest greater feasibility for large-scale experiments involving meta-learning.

Referências

Alcobaça, E., Siqueira, F., Rivolli, A., Garcia, L. P., Oliva, J. T., and De Carvalho, A. C. (2020). Mfe: Towards reproducible meta-feature extraction. The Journal of Machine Learning Research, 21(1):4503–4507.

Deborah, L. J., Baskaran, R., and Kannan, A. (2010). A survey on internal validity measure for cluster validation. International Journal of Computer Science & Engineering Survey, 1(2):85–102.

Frank, A. (2010). Uci machine learning repository. http://archive.ics.uci.edu/ml.

Lemke, C., Budka, M., and Gabrys, B. (2015). Metalearning: a survey of trends and technologies. Artificial intelligence review, 44:117–130.

Luna-Romera, J. M., del Mar Martinez-Ballesteros, M., Garcia-Gutierrez, J., and Riquelme-Santos, J. C. (2016). An approach to silhouette and dunn clustering indices applied to big data in spark. In Advances in Artificial Intelligence: 17th Conference of the Spanish Association for Artificial Intelligence, CAEPIA 2016, Salamanca, Spain, September 14-16, 2016. Proceedings 17, pages 160–169. Springer.

Ncir, C.-E. B., Hamza, A., and Bouaguel, W. (2021). Parallel and scalable dunn index for the validation of big data clusters. Parallel Computing, 102:102751.

Nishino, R. and Loomis, S. H. C. (2017). Cupy: A numpy-compatible library for nvidia gpu calculations. 31st confernce on neural information processing systems, 151(7).

Paiva, P. Y. A., Moreno, C. C., Smith-Miles, K., Valeriano, M. G., and Lorena, A. C. (2022). Relating instance hardness to classification performance in a dataset: a visual approach. Machine Learning, 111(8):3085–3123.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830.

Rice, J. R. (1976). The algorithm selection problem. In Advances in computers, volume 15, pages 65–118. Elsevier.

Rivolli, A., Garcia, L. P., Soares, C., Vanschoren, J., and de Carvalho, A. C. (2022). Meta-features for meta-learning. Knowledge-Based Systems, 240:108101.

Team, R. D. (2018). Rapids: collection of libraries for end to end gpu data science. NVIDIA, Santa Clara, CA, USA.

Zerabi, S., Meshoul, S., and Boucherkha, S. C. (2020). Models for internal clustering validation indexes based on hadoop-mapreduce. International Journal of Distributed Systems and Technologies (IJDST), 11(3):42–67.