Portuguese Fake News Classification with BERT models

Vinícius Baião Pires; Daniel Guerreiro e Silva

doi:10.5753/eniac.2024.245138

Vinícius Baião Pires UnB
Daniel Guerreiro e Silva UnB

DOI: https://doi.org/10.5753/eniac.2024.245138

Resumo

O fenômeno das notícias falsas é motivo notório de preocupação, devido aos potenciais malefícios que estas causam na vida em sociedade. A classificação automática de notícias falsas (fake news) é um problema que vem sendo abordado no contexto do aprendizado de máquina há certo tempo, mas, mesmo assim, ainda não há tantos trabalhos a respeito realizados na língua portuguesa. Em paralelo, desenvolveu-se a partir dos anos 2020 o emprego de modelos Transformer com dados textuais e, neste contexto, há o modelo BERT (Bidirectional Encoders Representations for Transformers) e a sua variante que foi especificamente pré-treinada para tarefas em língua portuguesa: o BERTimbau. Este trabalho, daí, propõe treinar os dois modelos supracitados, juntos à variante multilingual do BERT, mBERT, na tarefa de classificação de notícias falsas, a partir de diversos conjuntos de dados propostos na literatura, com o objetivo de verificar os potenciais ganhos do emprego de um modelo de linguagem especificamente treinado para o Português. Os resultados obtidos indicam uma superioridade do BERTimbau sobre o BERT e sobre o mBERT na referida tarefa, com uma melhoria média de 2,37% e 1,07%, respectivamente, para os valores de F1-score.

Palavras-chave: notícias falsas, BERT, classificação

Referências

Burkhardt, J. M. (2017). Combating fake news in the digital age. Number vol. 53, no. 8 in Library technology reports. ALA TechSource, Chicago, IL.

Charles, A. C., Ruback, L., and Oliveira, J. (2022). Fakepedia Corpus: A Flexible Fake News Corpus in Portuguese. In Pinheiro, V., Gamallo, P., Amaro, R., Scarton, C., Batista, F., Silva, D., Magro, C., and Pinto, H., editors, Computational Processing of the Portuguese Language, pages 37–45, Cham. Springer International Publishing.

Chavarro, J., Carvalho, J., Portela, T., and Silva, J. (2023). Faketruebr: Um corpus brasileiro de notícias falsas. In Anais da XVIII Escola Regional de Banco de Dados, pages 108–117, Porto Alegre, RS, Brasil. SBC.

Chaves, M. and Braga, A. (2019). The agenda of disinformation: ”fake news”and membership categorization analysis in the 2018 Brazilian presidential elections. Brazilian journalism research, 15(3):474–495.

Clark, E. and Araki, K. (2011). Text Normalization in Social Media: Progress, Problems and Applications for a Pre-Processing System of Casual English. Procedia - Social and Behavioral Sciences, 27:2–11.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North, volume 1, pages 4171–4186, Stroudsburg, PA, USA. Association for Computational Linguistics. arXiv: 1810.04805.

Fischer, M., Haque, R., Stynes, P., and Pathak, P. (2022). Identifying Fake News in Brazilian Portuguese. In Natural Language Processing and Information Systems, pages 111–118, Cham. Springer International Publishing.

Garcia, G. L., Afonso, L. C. S., and Papa, J. P. (2022). FakeRecogna: A New Brazilian Corpus for Fake News Detection. In Pinheiro, V., Gamallo, P., Amaro, R., Scarton, C., Batista, F., Silva, D., Magro, C., and Pinto, H., editors, Computational Processing of the Portuguese Language, pages 57–67, Cham. Springer International Publishing.

Kalsnes, B. (2018). Fake News. In Oxford Research Encyclopedia of Communication. Oxford University Press.

McKinney, W. et al. (2011). pandas: a foundational python library for data analysis and statistics. Python for high performance and scientific computing, 14(9):1–9.

Monteiro, R. A., Santos, R. L. S., Pardo, T. A. S., de Almeida, T. A., Ruiz, E. E. S., and Vale, O. A. (2018). Contributions to the Study of Fake News in Portuguese: New Corpus and Automatic Detection Results. In Villavicencio, A., Moreira, V., Abad, A., Caseli, H., Gamallo, P., Ramisch, C., Gonçalo Oliveira, H., and Paetzold, G. H., editors, Computational Processing of the Portuguese Language, pages 324–334, Cham. Springer International Publishing.

Prechelt, L. (1998). Early Stopping - But When? In Orr, G. B. and Müller, K.-R., editors, Neural Networks: Tricks of the Trade, pages 55–69. Springer Berlin Heidelberg, Berlin, Heidelberg.

Saunshi, N., Malladi, S., and Arora, S. (2021). A mathematical exploration of why language models help solve downstream tasks. In International Conference on Learning Representations.

Silva, R. M., Santos, R. L., Almeida, T. A., and Pardo, T. A. (2020). Towards automatically filtering fake news in Portuguese. Expert Systems with Applications, 146:113199. Publisher: Elsevier Ltd.

Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: Pretrained BERT Models for Brazilian Portuguese. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 12319 LNAI:403–417. ISBN: 9783030613761.

Tunstall, L., Von Werra, L., and Wolf, T. (2022). Natural language processing with transformers. ”O’Reilly Media, Inc.”.

Van Rossum, G. (1991). Python programming language. [link]. Acessado em: 23-08-2024.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is All you Need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30, pages 5998–6008. Curran Associates, Inc.

Wagner Filho, J. A., Wilkens, R., Idiart, M., and Villavicencio, A. (2018). The brWaC corpus: A new open resource for Brazilian Portuguese. In Calzolari, N., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S., and Tokunaga, T., editors, Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).

You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., Song, X., Demmel, J., Keutzer, K., and Hsieh, C.-J. (2020). Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations.