Evaluating the Performance of Generative Data Models for Fake News Classification

  • William Teles de Andrade Júnior IFPE
  • João Gabriel Rocha Silva IFB
  • Rodrigo Cesar Lira IFPE
  • Antônio Correia de Sá Barreto Neto IFPE

Abstract


This paper aimed to investigate the potential of models to generate synthetic data to improve fake news detection. The research compares the results obtained from a real dataset, containing news information, with those obtained from four synthetic datasets generated using GAN, VAE, DDPM and SMOTE. The study results indicate that classification performance improved when using artificial data, with an accuracy score of approximately 87%. These results suggest that synthetic data can be a valuable tool for improving fake news classification performance.

References

Almeida, A. L., Carrara, G., Prates, I., Nascimento, L. C., Souza, P. H., Almeida, T., Cani, R., and Silva, J. G. (2021). Modelo matemático apoiado por um algoritmo genético para classificação de fake news na web. In Anais do VIII Encontro Nacional de Computação dos Institutos Federais, pages 17–20, Porto Alegre, RS, Brasil. SBC.

Assefa, S. A., Dervovic, D., Mahfouz, M., Tillman, R. E., Reddy, P., and Veloso, M. (2020). Generating synthetic data in finance: opportunities, challenges and pitfalls. In Proceedings of the First ACM International Conference on AI in Finance, pages 1–8.

Carrillo-Perez, F., Pizurica, M., Zheng, Y., Nandi, T. N., Madduri, R., Shen, J., and Gevaert, O. (2023). Rna-to-image multi-cancer synthesis using cascaded diffusion models. bioRxiv.

Ferreira, A. L. N., Nascimento, D. G., Basílio, S. C. A., and Silva, J. G. R. (2020). Um modelo matemático para classificação de fake news na web. In Anais do Simpósio Brasileiro de Pesquisa Operacional.

Frid-Adar, M., Diamant, I., Klang, E., Amitai, M., Goldberger, J., and Greenspan, H. (2018). Gan-based synthetic medical image augmentation for increased cnn performance in liver lesion classification. Neurocomputing, 321:321–331.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2020). Generative adversarial networks. Commun. ACM, 63(11):139–144.

Horne, B. and Adali, S. (2017). This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. Proceedings of the International AAAI Conference on Web and Social Media, 11(1):759–766.

Kingma, D. P. and Welling, M. (2022). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

Kotelnikov, A., Baranchuk, D., Rubachev, I., and Babenko, A. (2023). Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning, pages 17564–17579. PMLR.

Lu, Y., Wang, H., and Wei, W. (2023). Machine learning for synthetic data generation: a review. arXiv preprint arXiv:2302.04062.

Mukherjee, M. and Khushi, M. (2021). Smote-enc: A novel smote-based method to generate synthetic data for nominal and continuous features. Applied System Innovation, 4(1):18.

Nichol, A. Q. and Dhariwal, P. (2021). Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X., and Chen, X. (2016). Improved techniques for training gans. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.

Seghouane, A.-K. and Amari, S.-I. (2007). The aic criterion and symmetrizing the kullback–leibler divergence. IEEE Transactions on Neural Networks, 18(1):97–106.

Shu, K., Sliva, A., Wang, S., Tang, J., and Liu, H. (2017). Fake news detection on social media: A data mining perspective. SIGKDD Explor. Newsl., 19(1):22–36.

Suroso, D., Cherntanomwong, P., and Sooraksa, P. (2023). Synthesis of a small fingerprint database through a deep generative model for indoor localisation. Elektronika Ir Elektrotechnika, 29:69–75.

Vosoughi, S., Roy, D., and Aral, S. (2018). The spread of true and false news online. Science, 359(6380):1146–1151.

Wang, W. Y. (2017). “liar, liar pants on fire”: A new benchmark dataset for fake news detection. In Barzilay, R. and Kan, M.-Y., editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 422–426, Vancouver, Canada. Association for Computational Linguistics.

Zhou, X. and Zafarani, R. (2020). A survey of fake news: Fundamental theories, detection methods, and opportunities. ACM Comput. Surv., 53(5).
Published
2024-07-21
ANDRADE JÚNIOR, William Teles de; SILVA, João Gabriel Rocha; LIRA, Rodrigo Cesar; BARRETO NETO, Antônio Correia de Sá. Evaluating the Performance of Generative Data Models for Fake News Classification. In: NATIONAL COMPUTING MEETING OF FEDERAL INSTITUTES (ENCOMPIF), 11. , 2024, Brasília/DF. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 42-49. ISSN 2763-8766. DOI: https://doi.org/10.5753/encompif.2024.1958.