Empirical Evaluation of Preprocessing and Balancing Techniques Impact Across Algorithm-Vectorizer Combinations in Sentiment Classification

  • Nathanael Motta UPE
  • Ana Claudia Maria de Souza UPE
  • Carlo Marcelo Revoredo da Silva UPE
  • Cleyton Mario de Oliveira Rodrigues UPE

Resumo


This study systematically evaluates the individual impact of six commonly applied preprocessing techniques on sentiment classification performance using 14,224 Steam reviews. Each technique was assessed across four algorithms (SVC, Random Forest, TextBlob, VADER) and three vectorization approaches (TF-IDF, Count Vectorizer, Hashing Vectorizer). Results reveal significant variability in preprocessing effectiveness: four consecutive stages (HTML cleanup, emoji removal, number elimination, punctuation normalization) produced identical results, indicating redundancy for this corpus. Stop word removal and stemming showed mixed effects depending on algorithm-vectorizer combinations. Machine learning approaches demonstrated different sensitivity patterns compared to lexicon-based methods. SVC with TF-IDF showed consistent performance across preprocessing stages. These findings challenge assumptions about universal preprocessing benefits and emphasize the importance of empirical validation for specific domains and algorithm combinations.
Palavras-chave: Natural Language Processing, Sentiment Analysis, Preprocessing Techniques, Algorithm-Vectorizer Interactions, Empirical Evaluation, Steam Reviews

Referências

Amorim, E., Cai, J., Kadav, A., Cui, L., Das, S., Singh, M., and Chen, J. (2022). The choice of scaling technique matters for classification performance. arXiv preprint arXiv:2212.12343. Comprehensive evaluation of scaling techniques across 82 datasets and 20 classification algorithms.

Elahi, K. T., Rahman, T. B., Shahriar, S., Sarker, S., Shawon, M. T. R., and Shahariar, G. M. (2024). A comparative analysis of noise reduction methods in sentiment analysis on noisy bangla texts. arXiv preprint arXiv:2401.14360. Accepted in The 9th Workshop on Noisy and User-generated Text (W-NUT), 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2024).

Guzsvinecz, T. and Szűcs, J. (2023). Length and sentiment analysis of reviews about top-level video game genres on the steam platform. Computers in Human Behavior, 149:107955. Available online 12 September 2023.

Manda, R., Sharma, P., and Singh, R. (2021). A systematic literature review on preprocessing techniques for sentiment analysis. Expert Systems with Applications, 168:114432.

Maree, M., Eleyat, M., and Mesqali, E. (2024). Optimizing machine learning-based sentiment analysis accuracy in bilingual sentences via preprocessing techniques. The International Arab Journal of Information Technology (IAJIT), 21(02):257–270.

Prastyo, P. A., Berlilana, and Tahyudin, I. (2018). Sentiment analysis of indonesian slang reviews using machine learning. Journal of Applied Data Sciences, 3(1):45–58.

Ruseti, S., Dascalu, D., Calin, M., Dascalu, M., Trausan-Matu, S., and Militaru, G. (2020). Comprehensive Exploration of Game Reviews Extraction and Opinion Mining Using NLP Techniques, pages 323–331.

Schoenfeld, M., Zimmermann, A., and Crémilleux, B. (2018). Preprocessor selection for machine learning pipelines. arXiv preprint arXiv:1810.09942. Submitted to Machine Learning journal.

Tan, J. Y., Chow, A. S. K., and Tan, C. W. (2022). A comparative study of machine learning algorithms for sentiment analysis of game reviews. The Journal of The Institution of Engineers, Malaysia, 82(3):63–68. Special Edition: International Conference on Digital Transformation and Applications 2021 (ICDXA 2021).
Publicado
29/09/2025
MOTTA, Nathanael; SOUZA, Ana Claudia Maria de; SILVA, Carlo Marcelo Revoredo da; RODRIGUES, Cleyton Mario de Oliveira. Empirical Evaluation of Preprocessing and Balancing Techniques Impact Across Algorithm-Vectorizer Combinations in Sentiment Classification. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 16. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 502-511. DOI: https://doi.org/10.5753/stil.2025.37850.