Impulsionando Árvores Extremamente Aleatórias em Paralelo para a Classificação de Dados Textuais

  • Julio Pires Instituto Federal Goiano
  • Wellington Martins Universidade Federal de Goiás

Resumo


Os algoritmos de aprendizado usando conjuntos de árvores de decisão têm se destacado na classificação de documentos, mas não sem pagar um alto custo computacional. A exploração de paralelismo tem sido uma alternativa para viabilizar o uso destes algoritmos mais sofisticados. Neste trabalho propomos um algoritmo paralelo para acelerar a construção destas árvores de decisão utilizadas num método recente que demonstrou superar os classificadores de última geração para dados textuais. Resultados experimentais, utilizando bases de dados textuais padronizadas, mostram que o algoritmo implementado em uma arquitetura manycore (GPU) é capaz de reduzir o tempo de execução em até 26 vezes em comparação a um algoritmo sequencial equivalente.

Referências

Alharthi, A., Krotov, V., and Bowman, M. (2017). Addressing barriers to big data. Business Horizons, 60(3):285 – 292.

Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2):123–140.

Breiman, L. (2001). Random forests. Mach. Learn., 45(1):5–32.

Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA.

Browne, J., Tomita, T. M., Mhembere, D., Burns, R. C., and Vogelstein, J. T. (2018) Forest packing: Fast, parallel decision forests. CoRR, abs/1806.07300.

Campos, R., Canuto, S., Salles, T., de Sá, C. C., and Gonçalves, M. A. (2017). Stacking bagged and boosted forests for effective automated classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, pages 105–114, New York, NY, USA. ACM.

Cano, A. (2018). A survey on graphic processing unit computing for large-scale data mining. Wiley Interdiscip. Rev. Data Min. Knowl. Discov., 8(1).

Chiu, C.-C., Luo, G.-H., and Yuan, S.-M. (2011). A decision tree using cuda gpus. In Proceedings of the 13th International Conference on Information Integration and Webbased Applications and Services, iiWAS ’11, pages 399–402, New York, NY, USA. ACM.

Franco, M. A. and Bacardit, J. (2016). Large-scale experimental evaluation of gpu strategies for evolutionary machine learning. Inf. Sci., 330(C):385–402.

Freund, Y. and Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 – 139.

Geurts, P., Ernst, D., and Wehenkel, L. (2006). Extremely randomized trees. Machine Learning, 63(1):3–42.

Grahn, H., Lavesson, N., Lapajne, M. H., and Slat, D. (2011). Cudarf: A cuda-based implementation of random forests. In 2011 9th IEEE/ACS International Conference on Computer Systems and Applications (AICCSA), pages 95–101.

Han, J., Kamber, M., and Pei, J. (2012). Data mining concepts and techniques, third edition.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: data mining, inference and prediction. Springer, 2 edition.

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning – with Applications in R, volume 103 of Springer Texts in Statistics. Springer, New York.

Jansson, K., Sundell, H., and Boström, H. (2014). gpurf and gpuert: Efficient and scalable gpu algorithms for decision tree ensembles. In 2014 IEEE International Parallel Distributed Processing Symposium Workshops, pages 1612–1621.

Liao, Y., Rubinsteyn, A., Power, R., and Li, J. (2013). Learning random forests on the gpu. In Big Learning 2013: Advances in Algorithms and Data Management, Lake Tahoe.

Mitchell, R., Adinets, A., Rao, T., and Frank, E. (2018). Xgboost: Scalable gpu accelerated learning. CoRR, abs/1806.11248.

Mitchell, R. and Frank, E. (2017). Accelerating the xgboost algorithm using gpu computing. PeerJ Computer Science, 3:e127.

Nasridinov, A., Lee, Y., and Park, Y.-H. (2013). Decision tree construction on gpu: Ubiquitous parallel computing approach. Computing, 96:403–413.

Navarro, C., Hitschfeld, N., and Mateu, L. (2013). A survey on parallel computing and its applications in data-parallel problems using gpu architectures. Communications in Computational Physics, 15:285–329.

Pianu, D., Nerino, R., Ferraris, C., and Chimienti, A. (2016). A novel approach to train random forests on gpu for computer vision applications using local features. IJHPCA, 30:290–304.

Quinlan, J. R. (1986). Induction of decision trees. Mach. Learn., 1(1):81–106.

Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.

Rokach, L. (2016). Decision forest: Twenty years of research. Information Fusion, 27:111 – 125.

Salles, T., Gonçalves, M., Rodrigues, V., and Rocha, L. (2015). Broof: Exploiting outof-bag errors, boosting and random forests for effective automated classification. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’15, pages 353–362, New York, NY, USA. ACM.

Sharp, T. (2008). Implementing decision trees and forests on a gpu. In Forsyth, D., Torr, P., and Zisserman, A., editors, Computer Vision – ECCV 2008, pages 595–608, Berlin, Heidelberg. Springer Berlin Heidelberg.

Strnad, D. and Nerat, A. (2016). Parallel construction of classification trees on a gpu. Concurr. Comput. : Pract. Exper., 28(5):1417–1436.

Tan, P.-N., Steinbach, M., and Kumar, V. (2013). Introduction to Data Mining: Pearson New International Edition (English Edition). Pearson Education Limited, Harlow, ESX, UK.

Wen, Z., He, B., Kotagiri, R., Lu, S., and Shi, J. (2018). Efficient gradient boosted decision tree training on gpus. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 234–243.

Zhang, H., Si, S., and Hsieh, C.-J. (2017). Gpu-acceleration for large-scale tree boosting. CoRR, abs/1706.08359.
Publicado
08/11/2019
PIRES, Julio; MARTINS, Wellington. Impulsionando Árvores Extremamente Aleatórias em Paralelo para a Classificação de Dados Textuais. In: SIMPÓSIO EM SISTEMAS COMPUTACIONAIS DE ALTO DESEMPENHO (SSCAD), 20. , 2019, Campo Grande. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2019 . p. 312-323. DOI: https://doi.org/10.5753/wscad.2019.8678.