FakeTrueBR: Um corpus brasileiro de notícias falsas

  • Juan Pablo Chavarro UFSC
  • Jonata Tyska Carvalho UFSC
  • Tarlis Tortelli Portela UFSC
  • Jonathan Cardoso Silva London School of Economics and Political Science London

Abstract


Currently, the large volume of fake news circulating on social media is a danger to society’s perception of reality. Machine learning techniques have been useful in combating misinformation, but to generate good results they require balanced and high-quality training datasets. Since the main corpora publicly available for training fake news detection models are outdated or misaligned, this work proposes an innovative approach to recover true news from fake ones and improve their similarity and alignment. Thus, a dataset was developed that allows us to verify and classify the information we consume daily on the web through natural language processing. Additionally, the resulting corpus was evaluated using classical natural language processing techniques for text representation, such as BoW and BoW TF-IDF, along with various traditional classification methods. The results demonstrate that this dataset is effective for news classification, with an f1-score of 0.945 using a multi-layer perceptron. Therefore, this new corpus is a valuable resource in the fight against misinformation and for improving the quality of available online information.

References

Andrade Junior, J. E., Cardoso-Silva, J., and Bezerra, L. C. (2021). Comparing contextual embeddings for semantic textual similarity in portuguese. In Intelligent Systems: 10th Brazilian Conference, BRACIS 2021, Virtual Event, November 29-December 3, 2021, Proceedings, Part II 10, pages 389-404. Springer.

Charles, A. C., Ruback, L., and Oliveira, J. (2022). Fakepedia corpus: A flexible fake news corpus in portuguese. In Computational Processing of the Portuguese Language: 15th International Conference, PROPOR 2022, Fortaleza, Brazil, March 21-23, 2022, Proceedings, pages 37-45. Springer.

Cunha, L. C. C. d. (2021). Fakewhatsapp. br: detecção de desinformação e desinformadores em grupos públicos do whatsapp em pt-br.

da Silva, F. R. M., Freire, P. M. S., de Souza, M. P., de AB Plenamente, G., and Goldschmidt, R. R. (2020). Fakenewssetgen: A process to build datasets that support comparison among fake news detection methods. In Proceedings of the Brazilian Symposium on Multimedia and the Web, pages 241-248.

de Morais, J., Abonizio, H., Tavares, G., da Fonseca, A., and Barbon, S. (2019). Deciding among fake, satirical, objective and legitimate news: A multi-label classification system. In Anais do XV Simpósio Brasileiro de Sistemas de Informação, pages 167-174, Porto Alegre, RS, Brasil. SBC.

Garcia, G. L., Afonso, L. C., and Papa, J. P. (2022). Fakerecogna: A new brazilian corpus for fake news detection. In Computational Processing of the Portuguese Language: 15th International Conference, PROPOR 2022, Fortaleza, Brazil, March 21-23, 2022, Proceedings, pages 57-67. Springer.

Geron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and Tensor-Flow: Concepts, Tools, and Techniques to Build Intelligent Systems. O'Reilly Media, Inc.

Jain, M. K., Gopalani, D., Meena, Y. K., and Kumar, R. (2020). Machine learning based fake news detection using linguistic features and word vector features. In 2020 IEEE 7th Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON), pages 1-6. IEEE.

Mishima, K. and Yamana, H. (2022). A survey on explainable fake news detection. IEICE TRANSACTIONS on Information and Systems, 105(7):1249-1257.

Mishra, S., Shukla, P., and Agarwal, R. (2022). Analyzing machine learning enabled fake news detection techniques for diversified datasets. Wireless Communications and Mobile Computing, 2022.

Monteiro, R. A., Santos, R. L., Pardo, T. A., De Almeida, T. A., Ruiz, E. E., and Vale, O. A. (2018). Contributions to the study of fake news in portuguese: New corpus and automatic detection results. In Computational Processing of the Portuguese Language: 13th International Conference, PROPOR 2018, Canela, Brazil, September 24-26, 2018, Proceedings 13, pages 324-334. Springer.

Netlab (2022). Acompanhamento multiplataforma da desinformação durante as eleições 2022. [link]. Relatório técnico.

Newman, N., Fletcher, R., Kalogeropoulos, A., Nielsen, R. K., Alves, T., Kus, M., Steemers, J., Fletcher, C., and Vacchiano, G. (2022). Digital news report 2022. Technical report, Reuters Institute for the Study of Journalism, University of Oxford.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.

Rocha, Y. M., de Moura, G. A., Desidério, G. A., de Oliveira, C. H., Lourenço, F. D., and de Figueiredo Nicolete, L. D. (2021). The impact of fake news on social media and its influence on health during the covid-19 pandemic: A systematic review. Journal of Public Health, pages 1-10.

Rubin, V. L., Chen, Y., and Conroy, N. K. (2015). Deception detection for news: three types of fakes. Proceedings of the Association for Information Science and Technology, 52(1):1-4.

Silva, R. M., Santos, R. L., Almeida, T. A., and Pardo, T. A. (2020). Towards automatically filtering fake news in portuguese. Expert Systems with Applications, 146:113199.

Vosoughi, S., Roy, D., and Aral, S. (2018). The spread of true and false news online. Science, 359(6380):1146-1151.
Published
2023-04-11
CHAVARRO, Juan Pablo; CARVALHO, Jonata Tyska; PORTELA, Tarlis Tortelli; SILVA, Jonathan Cardoso. FakeTrueBR: Um corpus brasileiro de notícias falsas. In: REGIONAL DATABASE SCHOOL (ERBD), 18. , 2023, Palmas/PR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 108-117. ISSN 2595-413X. DOI: https://doi.org/10.5753/erbd.2023.229495.