Portuguese Fake News Classification with BERT models

  • Vinícius Baião Pires UnB
  • Daniel Guerreiro e Silva UnB

Abstract


Fake news is a phenomenum which causes great concern, due to the potential negative effects they cause in our society. Automatic fake news classification is a problem that has been addressed within machine learning community for some time, but there are not so many works on this subject which consider Portuguese as the primal language. Simultaneously, we witness the use of Transformer models with textual data, and in this context, there is BERT (Bidirectional Encoders Representations for Transformers) and its variant specifically pre-trained for portuguese tasks: BERTimbau. This work, therefore, proposes to train, with different datasets, the two aforementioned models and the multilingual variant of BERT, mBERT, for the task of classifying fake news, in order to assess the potential gains of using a language model specifically trained for Portuguese. The overall results indicate a superiority of BERTimbau over BERT and mBERT in the referred task, with an average improvement of 2.37% and 1.07%, respectively, for the F1-score.
Keywords: fake news, BERT, classification

References

Burkhardt, J. M. (2017). Combating fake news in the digital age. Number vol. 53, no. 8 in Library technology reports. ALA TechSource, Chicago, IL.

Charles, A. C., Ruback, L., and Oliveira, J. (2022). Fakepedia Corpus: A Flexible Fake News Corpus in Portuguese. In Pinheiro, V., Gamallo, P., Amaro, R., Scarton, C., Batista, F., Silva, D., Magro, C., and Pinto, H., editors, Computational Processing of the Portuguese Language, pages 37–45, Cham. Springer International Publishing.

Chavarro, J., Carvalho, J., Portela, T., and Silva, J. (2023). Faketruebr: Um corpus brasileiro de notícias falsas. In Anais da XVIII Escola Regional de Banco de Dados, pages 108–117, Porto Alegre, RS, Brasil. SBC.

Chaves, M. and Braga, A. (2019). The agenda of disinformation: ”fake news”and membership categorization analysis in the 2018 Brazilian presidential elections. Brazilian journalism research, 15(3):474–495.

Clark, E. and Araki, K. (2011). Text Normalization in Social Media: Progress, Problems and Applications for a Pre-Processing System of Casual English. Procedia - Social and Behavioral Sciences, 27:2–11.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North, volume 1, pages 4171–4186, Stroudsburg, PA, USA. Association for Computational Linguistics. arXiv: 1810.04805.

Fischer, M., Haque, R., Stynes, P., and Pathak, P. (2022). Identifying Fake News in Brazilian Portuguese. In Natural Language Processing and Information Systems, pages 111–118, Cham. Springer International Publishing.

Garcia, G. L., Afonso, L. C. S., and Papa, J. P. (2022). FakeRecogna: A New Brazilian Corpus for Fake News Detection. In Pinheiro, V., Gamallo, P., Amaro, R., Scarton, C., Batista, F., Silva, D., Magro, C., and Pinto, H., editors, Computational Processing of the Portuguese Language, pages 57–67, Cham. Springer International Publishing.

Kalsnes, B. (2018). Fake News. In Oxford Research Encyclopedia of Communication. Oxford University Press.

McKinney, W. et al. (2011). pandas: a foundational python library for data analysis and statistics. Python for high performance and scientific computing, 14(9):1–9.

Monteiro, R. A., Santos, R. L. S., Pardo, T. A. S., de Almeida, T. A., Ruiz, E. E. S., and Vale, O. A. (2018). Contributions to the Study of Fake News in Portuguese: New Corpus and Automatic Detection Results. In Villavicencio, A., Moreira, V., Abad, A., Caseli, H., Gamallo, P., Ramisch, C., Gonçalo Oliveira, H., and Paetzold, G. H., editors, Computational Processing of the Portuguese Language, pages 324–334, Cham. Springer International Publishing.

Prechelt, L. (1998). Early Stopping - But When? In Orr, G. B. and Müller, K.-R., editors, Neural Networks: Tricks of the Trade, pages 55–69. Springer Berlin Heidelberg, Berlin, Heidelberg.

Saunshi, N., Malladi, S., and Arora, S. (2021). A mathematical exploration of why language models help solve downstream tasks. In International Conference on Learning Representations.

Silva, R. M., Santos, R. L., Almeida, T. A., and Pardo, T. A. (2020). Towards automatically filtering fake news in Portuguese. Expert Systems with Applications, 146:113199. Publisher: Elsevier Ltd.

Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: Pretrained BERT Models for Brazilian Portuguese. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12319 LNAI:403–417. ISBN: 9783030613761.

Tunstall, L., Von Werra, L., and Wolf, T. (2022). Natural language processing with transformers. ”O’Reilly Media, Inc.”.

Van Rossum, G. (1991). Python programming language. [link]. Acessado em: 23-08-2024.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is All you Need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30, pages 5998–6008. Curran Associates, Inc.

Wagner Filho, J. A., Wilkens, R., Idiart, M., and Villavicencio, A. (2018). The brWaC corpus: A new open resource for Brazilian Portuguese. In Calzolari, N., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S., and Tokunaga, T., editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).

You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-J. (2020). Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations.
Published
2024-11-17
PIRES, Vinícius Baião; GUERREIRO E SILVA, Daniel. Portuguese Fake News Classification with BERT models. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 21. , 2024, Belém/PA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 834-845. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2024.245138.