Comparative Analysis of Classical and Deep Algorithms for Text Clustering in Brazilian Portuguese

Paulo V. Mourão; Marcela P. Pessoa; Oswald M. Ekwoge; Marcelo E. Anjos

doi:10.5753/eniac.2025.13904

Paulo V. Mourão UEA
Marcela P. Pessoa UEA
Oswald M. Ekwoge UEA
Marcelo E. Anjos SiDi

DOI: https://doi.org/10.5753/eniac.2025.13904

Resumo

This work compares different combinations of text embedding models and clustering algorithms applied to Portuguese-language texts. Three datasets were used (poems, Reddit, and product reviews), evaluating models such as BERTimbau and ST5, combined with classic algorithms and the deep method DEC. Using accuracy, V-Measure, and ARI as metrics, results show that BERTimbau performs better on formal texts, while ST5 excels in informal content. DEC outperformed others only on the largest dataset (product reviews), highlighting the potential of deep clustering approaches for Portuguese text analysis.

Referências

Borges, B. R. (2025). Comparison of clustering techniques in text documents in portuguese (comparacao de tecnicas de clusterizacao em documentos de texto em portugues). iSys: Revista Brasileira de Sistemas de Informação (Brazilian Journal of Information Systems), 18(1):4:1–4:17.

Delibasis, K. K. (2019). A new topology-preserving distance metric with applications to multi-dimensional data clustering. In MacIntyre, J., Maglogiannis, I., Iliadis, L., and Pimenidis, E., editors, Artificial Intelligence Applications and Innovations, pages 155–166, Cham. Springer International Publishing.

Grootendorst, M. (2022). Bertopic: Neural topic modeling with a class-based tf-idf procedure. arXiv preprint arXiv:2203.05794.

Guan, R., Zhang, H., Liang, Y., Giunchiglia, F., Huang, L., and Feng, X. (2020). Deep feature-based text clustering and its explanation. IEEE Transactions on Knowledge and Data Engineering, PP:1–1.

Guo, X., Gao, L., Liu, X., and Yin, J. (2017). Improved deep embedded clustering with local structure preservation. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, page 1753–1759. AAAI Press.

Keraghel, I., Morbieu, S., and Nadif, M. (2024). Keraghel, i., morbieu, s., & nadif, m. (2024). beyond words: a comparative analysis of llm embeddings for effective clustering. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP).

Murtagh, F. and Legendre, P. (2014). Ward’s hierarchical agglomerative clustering method: Which algorithms implement ward’s criterion? Journal of Classification, 31(3):274–295.

Ni, J., Ábrego, G. H., Constant, N., Ma, J., Hall, K. B., Cer, D., and Yang, Y. (2021). Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. Google Research, Mountain View, CA.

Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23.

Subakti, A., Murfi, H., and Hariadi, N. (2022). Subakti, a., murfi, h., & hariadi, n. (2022). the performance of bert as data representation of text clustering. Journal of Big Data, 9(15).

Tan, P.-N., Steinbach, M., and Kumar, V. (2014). Introduction to Data Mining. Pearson, New York.

Wehrli, S., Arnrich, B., and Irrgang, C. (2024). Wehrli, s., arnrich, b., & irrgang, c. (2024). german text embedding clustering benchmark. arXiv preprint, arXiv:2401.02709.

Wu, S. and Dredze, M. (2020). Are all languages created equal in multilingual BERT? In Gella, S., Welbl, J., Rei, M., Petroni, F., Lewis, P., Strubell, E., Seo, M., and Hajishirzi, H., editors, Proceedings of the 5th Workshop on Representation Learning for NLP, pages 120–130, Online. Association for Computational Linguistics.

Xie, J., Girshick, R., and Farhadi, A. (2016). Unsupervised deep embedding for clustering analysis. arXiv preprint arXiv:1511.06335.