Avaliando o Desempenho de Modelos Generativos de Dados para Classificação de Notícias Falsas

William Teles de Andrade Júnior; João Gabriel Rocha Silva; Rodrigo Cesar Lira; Antônio Correia de Sá Barreto Neto

doi:10.5753/encompif.2024.1958

William Teles de Andrade Júnior IFPE
João Gabriel Rocha Silva IFB
Rodrigo Cesar Lira IFPE
Antônio Correia de Sá Barreto Neto IFPE

DOI: https://doi.org/10.5753/encompif.2024.1958

Resumo

Este artigo teve como objetivo investigar o potencial dos modelos generativos de dados sintéticos para a abordagem de detecção de notícias falsas. A pesquisa compara os resultados obtidos de um conjunto de dados real, contendo informações obtidas de notícias da internet, com aqueles obtidos de quatro conjuntos de dados sintéticos gerados usando GAN, VAE, DDPM e SMOTE. Os resultados do estudo indicam que o desempenho da classificação obteve uma melhora quando usado os dados sintéticos, com uma pontuação de acurácia de, aproximadamente, 87%. Esses resultados sugerem que dados sintéticos podem servir como ferramentas valiosas para melhorar o desempenho classificação de notícias falsas.

Referências

Almeida, A. L., Carrara, G., Prates, I., Nascimento, L. C., Souza, P. H., Almeida, T., Cani, R., and Silva, J. G. (2021). Modelo matemático apoiado por um algoritmo genético para classificação de fake news na web. In Anais do VIII Encontro Nacional de Computação dos Institutos Federais, pages 17–20, Porto Alegre, RS, Brasil. SBC.

Assefa, S. A., Dervovic, D., Mahfouz, M., Tillman, R. E., Reddy, P., and Veloso, M. (2020). Generating synthetic data in finance: opportunities, challenges and pitfalls. In Proceedings of the First ACM International Conference on AI in Finance, pages 1–8.

Carrillo-Perez, F., Pizurica, M., Zheng, Y., Nandi, T. N., Madduri, R., Shen, J., and Gevaert, O. (2023). Rna-to-image multi-cancer synthesis using cascaded diffusion models. bioRxiv.

Ferreira, A. L. N., Nascimento, D. G., Basílio, S. C. A., and Silva, J. G. R. (2020). Um modelo matemático para classificação de fake news na web. In Anais do Simpósio Brasileiro de Pesquisa Operacional.

Frid-Adar, M., Diamant, I., Klang, E., Amitai, M., Goldberger, J., and Greenspan, H. (2018). Gan-based synthetic medical image augmentation for increased cnn performance in liver lesion classification. Neurocomputing, 321:321–331.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2020). Generative adversarial networks. Commun. ACM, 63(11):139–144.

Horne, B. and Adali, S. (2017). This just in: Fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. Proceedings of the International AAAI Conference on Web and Social Media, 11(1):759–766.

Kingma, D. P. and Welling, M. (2022). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

Kotelnikov, A., Baranchuk, D., Rubachev, I., and Babenko, A. (2023). Tabddpm: Modelling tabular data with diffusion models. In International Conference on Machine Learning, pages 17564–17579. PMLR.

Lu, Y., Wang, H., and Wei, W. (2023). Machine learning for synthetic data generation: a review. arXiv preprint arXiv:2302.04062.

Mukherjee, M. and Khushi, M. (2021). Smote-enc: A novel smote-based method to generate synthetic data for nominal and continuous features. Applied System Innovation, 4(1):18.

Nichol, A. Q. and Dhariwal, P. (2021). Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.

Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X., and Chen, X. (2016). Improved techniques for training gans. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.

Seghouane, A.-K. and Amari, S.-I. (2007). The aic criterion and symmetrizing the kullback–leibler divergence. IEEE Transactions on Neural Networks, 18(1):97–106.

Shu, K., Sliva, A., Wang, S., Tang, J., and Liu, H. (2017). Fake news detection on social media: A data mining perspective. SIGKDD Explor. Newsl., 19(1):22–36.

Suroso, D., Cherntanomwong, P., and Sooraksa, P. (2023). Synthesis of a small fingerprint database through a deep generative model for indoor localisation. Elektronika Ir Elektrotechnika, 29:69–75.

Vosoughi, S., Roy, D., and Aral, S. (2018). The spread of true and false news online. Science, 359(6380):1146–1151.

Wang, W. Y. (2017). “liar, liar pants on fire”: A new benchmark dataset for fake news detection. In Barzilay, R. and Kan, M.-Y., editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 422–426, Vancouver, Canada. Association for Computational Linguistics.

Zhou, X. and Zafarani, R. (2020). A survey of fake news: Fundamental theories, detection methods, and opportunities. ACM Comput. Surv., 53(5).