Impacto do pré-processamento em datasets de sentimento de e-commerce em português

Diego D. Bottero; Giancarlo Lucca; Joelson S. Junior; João Pedro S. Moreira; Eduardo N. Borges; Rafael A. Berri; Bruno L. Dalmazo

doi:10.5753/weit.2025.40439

Diego D. Bottero UCPel
Giancarlo Lucca UCPel https://orcid.org/0000-0002-3776-0260
Joelson S. Junior FURG https://orcid.org/0000-0001-5379-8253
João Pedro S. Moreira UCPel https://orcid.org/0009-0005-0408-0305
Eduardo N. Borges FURG
Rafael A. Berri FURG
Bruno L. Dalmazo FURG

DOI: https://doi.org/10.5753/weit.2025.40439

Resumo

Embora haja acesso a quantidades significativas de datasets, a limpeza e padronização normalmente reduzem a quantidade de instâncias. Este trabalho analisa o impacto de um fluxo de pré-processamento para normalização textual, deduplicação e downsampling, sobre três datasets públicos. A deduplicação removeu 6,8% de instâncias redundantes e o balanceamento por downsampling reduziu o volume em 73%. Os experimentos demonstram que os datasets limpos fornecem condições experimentais mais confiáveis. Os resultados evidenciam a eficácia do pré-processamento sistemático e reforçam a necessidade de ampliar e atualizar continuamente datasets abertos para impulsionar a pesquisa de sentimentos em e-commerce em língua portuguesa.

Palavras-chave: processamento de linguagem natural, análise de sentimentos, e-commerce, pré-processamento de texto, datasets em português

Referências

Avanço, L. and Nunes, M. (2014). Lexicon-based sentiment analysis for reviews of products in brazilian portuguese. pages 277–281. [link]

B2W Digital (2020). B2w-reviews01: Brazilian e-commerce product reviews dataset.

Batista, G. E. A. P. A., Prati, R. C., and Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1):20–29. DOI: 10.1145/1007730.1007735

dos Santos Silva, L. N., Real, L., Zandavalle, A. C. B., Rodrigues, C. F. G., da Silva Gama, T., Souza, F. G., and Zaidan, P. D. S. (2024). RePro: a benchmark for opinion mining for Brazilian Portuguese. In Gamallo, P., Claro, D., Teixeira, A., Real, L., Garcia, M., Oliveira, H. G., and Amaro, R., editors, Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1, pages 432–440, Santiago de Compostela, Galicia/Spain. Association for Computational Lingustics. [link]

Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R., editors (1996). Advances in Knowledge Discovery and Data Mining. American Association for Artificial Intelligence, USA. [link]

Hartmann, N., Avanço, L., Balage, P., Duran, M., das Graças Volpe Nunes, M., Pardo, T., and Aluísio, S. (2014). A large corpus of product reviews in Portuguese: Tackling out-of-vocabulary words. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), pages 3865–3871, Reykjavik, Iceland. European Language Resources Association (ELRA). [link]

He, H. and Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9):1263–1284. DOI: 10.1109/TKDE.2008.239

Kaplan, A. M. and Haenlein, M. (2010). Users of the world, unite! the challenges and opportunities of social media. Business Horizons, 53(1):59–68. DOI: 10.1016/j.bushor.2009.09.003

Olist (2018). Brazilian e-commerce public dataset by olist. [link]

Souza, F. and Filho, J. (2021). Sentiment analysis on brazilian portuguese user reviews. DOI: 10.48550/arXiv.2112.05459

Witten, I. H., Frank, E., and Hall, M. A. (2011). Data Mining: Practical Machine Learning Tools and Techniques. Elsevier.