A Multi-Corpus Benchmark of Classical Fake News Classifiers with Contextual Portuguese Embeddings
Resumo
This paper presents a systematic benchmark for fake news detection in Brazilian Portuguese, combining multiple datasets, four contextual embedding models, and eight classical supervised classifiers under a unified evaluation protocol. The results show that, when a strong classifier such as SVC-RBF is adopted, performance varies more across embedding models than across classifiers. Among the evaluated encoders, albertina_ptbr_900m achieved the best overall generalization on the test set, while bertimbau_large showed the strongest validation performance. Overall, the study highlights representation quality as a central factor for robust classical fake news classification in Brazilian Portuguese.
Referências
Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. Wadsworth.
Cantarino, F. H. S. (2024). Criação de um corpus português para auxiliar a identificação de notícias verdadeiras e falsas. Trabalho de Conclusão de Curso, Universidade Federal de Uberlândia. Corpus provenance for BoatosBr.
Chavarro, J. P., Carvalho, J. T., Portela, T. T., and Silva, J. C. (2023). Faketruebr: Um corpus brasileiro de notícias falsas. In Anais da Escola Regional de Banco de Dados.
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3):273–297.
Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1):21–27.
da Silva, F. R. M., Freire, P. M. S., de Souza, M. P., de A. B. Plenamente, G., and Goldschmidt, R. R. (2020). Fakenewssetgen: A process to build datasets that support comparison among fake news detection methods. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management.
de Morais, J. I., Abonizio, H. Q., Tavares, G. M., da Fonseca, A. A., and Barbon Jr., S. (2020). A multi-label classification system to distinguish among fake, satirical, objective and legitimate news in brazilian portuguese. iSys – Brazilian Journal of Information Systems, 13(4):126–149.
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189–1232.
Garcia, G. L., Afonso, L. C. S., and Papa, J. P. (2022). Fakerecogna: A new brazilian corpus for fake news detection. In International Conference on Computational Processing of the Portuguese Language, pages 57–67.
Gôlo, M. P. S., Mori, A. L. V., Oliveira, W. G., Barbosa, J. R., Graciano-Neto, V. V., Lima, E. A. d., and Marcacini, R. M. (2024). On the use of large language models to detect brazilian politics fake news. In Anais do Encontro Nacional de Inteligência Artificial e Computacional.
Gôlo, M. P. S., Souza, M. C. d., Rossi, R. G., Rezende, S. O., Nogueira, B. M., and Marcacini, R. M. (2023). One-class learning for fake news detection through multi-modal variational autoencoders. Engineering Applications of Artificial Intelligence, 124:106088.
Hosmer, D. W., Lemeshow, S., and Sturdivant, R. X. (2013). Applied Logistic Regression. Wiley.
Jeronimo, C. L. M., Marinho, L. B., Campelo, C. E. C., Veloso, A., and Melo, A. S. d. C. (2019). Fake news classification based on subjective language. In Proceedings of the 21st International Conference on Information Integration and Web-based Applications & Services, pages 15–24.
Marques, I., Salles, I., Couto, J. M. M., Pimenta, B. C., Assis, S., Reis, J. C. S., da Silva, A. P. C., Almeida, J. M., and Benevenuto, F. (2022). A comprehensive dataset of brazilian fact-checking stories. Journal of Information and Data Management, 13(1).
Martins, A. D. F., Cabral, L., Mourão, P. J. C., de Sá, I. C., Monteiro, J. M., and Machado, J. (2021). Covid19.br: A dataset of misinformation about covid-19 in brazilian portuguese whatsapp messages. In Anais do Dataset Showcase Workshop.
Monteiro, R. A., Santos, R. L. S., Pardo, T. A. S., Almeida, T. A., Ruiz, E. E. S., and Vale, O. A. (2018). Contributions to the study of fake news in portuguese: New corpus and automatic detection results. In Computational Processing of the Portuguese Language, pages 324–334.
Moreno, J. G. and Bressan, G. C. (2019). Factck.br: A new dataset for claim detection and related fact checking. In Proceedings of the International Conference on the Computational Processing of Portuguese.
Nielsen, D. S. and McConville, R. (2022). Mumin: A large-scale multilingual multimodal fact-checked misinformation social network dataset. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3141–3153.
Paixão, M., Lima, R., and Espinasse, B. (2020). Fake news classification and topic modeling in brazilian portuguese. In 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT).
Pires, V. B. and Guerreiro e Silva, D. (2024). Portuguese fake news classification with bert models. In Anais do Encontro Nacional de Inteligência Artificial e Computacional, pages 834–845.
Rodrigues, J., Gomes, L., Silva, J., Branco, A., Santos, R., Cardoso, H. L., and Osório, T. (2023). Advancing neural encoding of portuguese with transformer albertina pt-*.
Santos, R., Rodrigues, J., Gomes, L., Silva, J., Branco, A., Cardoso, H. L., Osório, T. F., and Leite, B. (2024). Fostering the ecosystem of open neural encoders for portuguese with albertina pt-* family.
Santos, R. L. d. S. (2022). Detecção automática de notícias falsas em português. Tese de doutorado, Universidade de São Paulo. Acesso em: 30 mar. 2026.
Silva, R. M., Santos, R. L. S., Almeida, T. A., and Pardo, T. A. S. (2020). Towards automatically filtering fake news in portuguese. Expert Systems with Applications, 146:113199.
Sousa, F., Barbosa, A., Oliveira, C., and Braga, R. (2022). Detecção de fake news em língua portuguesa combinando redes neurais convolucionais e algoritmos de aprendizagem de máquina. In Anais do Simpósio Brasileiro de Redes de Computadores e Sistemas Distribuídos, pages 336–348.
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. In Brazilian Conference on Intelligent Systems.
Villela, H. F., Corrêa, F., Ribeiro, J. S. d. A. N., Rabelo, A., and Carvalho, D. B. F. (2023). Fake news detection: a systematic literature review of machine learning algorithms and datasets. Journal on Interactive Systems, 14(1).
Vosoughi, S., Roy, D., and Aral, S. (2018). The spread of true and false news online. Science, 359(6380):1146–1151.
Zhang, H. (2004). The optimality of naive bayes. AA, 1(2):3.
