Impact of Language on Text Classification Algorithms: A Study Between Portuguese and English

  • Jorge N. S. Pavão CEFET/RJ
  • Kele Belloze CEFET/RJ
  • Gustavo Guedes CEFET/RJ

Abstract


This study examines the influence of language on the performance of machine learning algorithms in text classification tasks. Two parallel Portuguese-English corpora were used, encompassing sentiment analysis and thematic categorization of scientific abstracts. Several supervised algorithms were evaluated under different preprocessing configurations. The results show no significant variations attributable exclusively to language, indicating the robustness of these techniques across linguistic boundaries. Additionally, automatic translation was found not to impair model performance, supporting its use in multilingual scenarios.

References

Araújo, M., Pereira, A., and Benevenuto, F. (2020). A comparative study of machine translation for multilingual sentence-level sentiment analysis. Information Sciences, 512:1078–1102.

Cohen, J. (1988). Statistical Power Analisys for Behavioral Sciences. Lawrence Erbaum Associates, 2nd edition.

De Azevedo, G., Pettine, G., Feder, F., Portugal, G., Mendes, C. O. S., Ribeiro, R. C., Mauro, R. C., Junior, F. P., and Guedes, G. (2021). Nat: Towards an emotional agent.

Flores, F. N., Moreira, and P., V. (2016). Assessing the impact of stemming accuracy on information retrieval – a multilingual perspective. Information Processing and Management, 52(5):840 – 854.

Han, J., Kamber, M., and Pei, J. (2011). Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd edition.

Kobellarz, J. and Silva, T. (2022). Should we translate? evaluating toxicity in online comments when translating from portuguese to english. In ACM International Conference Proceeding Series, pages 89–98.

Liddy, E. (2001). Natural language processing. In Encyclopedia of Library and Information Science. Marcel Decker, Inc.

Mirkin, S., Nowson, S., Brun, C., and Perez, J. (2015). Motivating personality-aware machine translation. In Conference Proceedings EMNLP 2015:, pages 1102–1108.

Mohammad, S., Salameh, M., and Kiritchenko, S. (2016). How translation alters sentiment. Journal of Artificial Intelligence Research, 55:95–130.

Oliveira, D. F., Nogueira, A. S., and Brito, M. A. (2022). Performance comparison of machine learning algorithms in classifying information technologies incident tickets. AI, 3:601–622.

Oliveira, D. N. and Merschmann, L. H. d. C. (2021). Joint evaluation of preprocessing tasks with classifiers for sentiment analysis in brazilian portuguese language. Multimedia Tools and Applications, 80(10):15391 – 15412.

Pires, R., Abonizio, H., Almeida, T. S., and Nogueira, R. (2023). Sabiá: Portuguese large language models. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 14197 LNAI:226 – 240.

Salameh, M., Mohammad, S., and Kiritchenko, S. (2015). Sentiment after translation: A case-study on arabic social media posts. In Conference Proceedings NAACL HLT 2015, pages 767–777.

Santos, L. d. F. and Silva, M. V. d. (2023). The effect of stemming and lemmatization on portuguese fake news text classification. arXiv preprint arXiv:2310.11344.

Sarkar, D. (2019). Text Analytics with Python. Apress Berkeley, 2nd edition. Silva, E., Silva, G., and Belloze, K. (2024). Abordagens baseadas em ontologias para análise de sentimentos em português do brasil. Dissertação de mestrado, Centro Federal de Educação Tecnológica Celso Suckow da Fonseca CEFET/RJ.

Soares, F., Moreira, V. P., and Becker, K. (2019). A large parallel corpus of full-text scientific articles. page 3459 – 3463.

Soares, F., Yamashita, G. H., and Anzanello, M. J. (2018). A parallel corpus of theses and dissertations abstracts. In Computational Processing of the Portuguese Language, pages 345–352, Cham. Springer International Publishing.

Tebbifakhr, A., Bentivogli, L., Negri, M., and Turchi, M. (2019). Machine translation for machines: The sentiment classification use case. In Conference Proceedings EMNLPIJCNLP 2019, pages 1368–1374.

Unanue, I., Haffari, G., and Piccardi, M. (2023). T3l: Translate-and-test transfer learning for cross-lingual text classification. Transactions of the Association for Computational Linguistics, 11:1147–1161.

Vajjala, S., Majumder, B., Gupta, A., and Surana, H. (2020). Practical Natural Language Processing, A Comprehensive Guide to Building Real-World NLP Systems. O’Reilly Media, Inc.
Published
2025-09-29
PAVÃO, Jorge N. S.; BELLOZE, Kele; GUEDES, Gustavo. Impact of Language on Text Classification Algorithms: A Study Between Portuguese and English. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 16. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 323-333. DOI: https://doi.org/10.5753/stil.2025.37835.