GolpeBR: Construction and Validation of an Annotated Dataset on Banking Scams and Fraud
Abstract
This article details the construction and validation of the GolpeBR dataset, which was created from news articles and Reddit posts. Automated Python methods were used to extract and process the data, which was then annotated using the Deepseek-R1 LLM and the 5W1H methodology. A cybersecurity expert classified and validated the records, distinguishing between banking cybercrimes and unrelated crimes. For the dataset's validation, supervised learning algorithms were applied. It was found that models trained with data structured by the 5W1H methodology demonstrated better accuracy, reaching 0.83 for the Logistic Regression and Random Forest algorithms.References
Al-Khater, W. A., Al-Maadeed S., Ahmed, A. A., Sadiq, A. S. and Khan, M. K. (2020) "Comprehensive Review of Cybercrime Detection Techniques". In IEEE Access, v. 8, p. 137293-137311, DOI: 10.1109/ACCESS.2020.3011259.
Barros, M., Silva, C., e Miranda, P. (2020) “Xphide: Um Sistema Especialista para a Detecção de Phishing”. In Anais do Simpósio Brasileiro de Segurança da Informação e de Sistemas Computacionais (SBSeg)
Balasankula, U. R., Poojitha, B., Chekurtha, S., Buyya, K., Bala, H. and Rao, P. V. (2024) "Banking Fraud Detection Using Machine Learning Algorithms," In 5th International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India, 2024, pp. 1228-1233, DOI: 10.1109/ICESC60852.2024.10689731.
Carnaz, G., Antunes, M., and Nogueira, V. B. (2021). "An Annotated Corpus of CrimeRelated Portuguese Documents for NLP and Machine Learning Processing". In Data, DOI: 10.3390/data6070071.
Chen, Y. and Joo, J. (2021). “Understanding and mitigating annotation bias in facial expression recognition”. DOI: 10.48550/arXiv.2108.08504
Deora, R. S. e Chudasama D. (2021) “Brief Study of Cybercrime on an Internet”. In Journal of Communication Engineering & Systems. p. 1-6.
Demartini, G., Roitero, K. and Mizzaro, S. (2021). “Managing bias in human-annotated data: Moving beyond bias removal”. DOI: 10.48550/arXiv.2110.13504
Dilek, S., Çakır, H., and Aydın, M. (2015) "Applications of Artificial Intelligence Techniques to Combating Cyber Crimes: A Review”. In International Journal of Artificial Intelligence & Applications (IJAIA), DOI: 10.5121/ijaia.2015.6102.
Febraban – Federação Brasileira de Bancos (2025) “Radar Febraban: Março 2025”. Disponível em: [link], June.
Gumma, Y. R. and Peram, S. (2024) "Review of Cybercrime Detection Approaches using Machine Learning and Deep Learning Techniques". In International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, pp. 772-779, DOI: 10.1109/ICAAIC60222.2024.10575058.
Gyamfi, N. K.; Abdulai, J.D. (2018) "Bank Fraud Detection Using Support Vector Machine," In IEEE 9th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, pp. 37-41, DOI: 10.1109/IEMCON.2018.8614994.
KPMG. (2019) “Pesquisa Global sobre Fraude Bancária. A ameaça multifacetada da fraude: Os bancos estão prontos para enfrentar este desafio?”. Disponível em: [link], June.
Mallmann, J., Xavier, A. dos S., e Santin, A. O. (2018) Detecção de Cibercrime em Redes Sociais: Machine Learning. In The Tenth International Conference On Forensic Computer Science And Cyber Law. São Paulo, 2018. p. 44-49, DOI: 10.5769/C2018005.
Manna, A., Al-Fayoumi, M. e Al-Fawa'reh, M. (2024) "Detecting Text-Based Cybercrimes Using BERT" In International Jordanian Cybersecurity Conference (IJCC), Aِmman, Jordan, pp. 111-117, DOI: 10.1109/IJCC64742.2024.10847273.
Minastireanu, E.‑A., & Mesniță, G. (2019). An analysis of the most used machine learning algorithms for online fraud detection. Informatica Economica, 23(1), 5–16. DOI: 10.12948/issn14531305/23.1.2019.01.
Monteith, S., Bauer, M., Alda, M., Geddes, J., Whybrow, P. C., and Glen, T. (2021). “Increasing Cybercrime Since the Pandemic: Concerns for Psychiatry”. In Current Psychiatry Reports 23, DOI: 10.1007/s11920-021-01228-w.
Nicholls, J., Kuppa, A. and Le-Khac, N. A. (2021) "Financial Cybercrime: A Comprehensive Survey of Deep Learning Approaches to Tackle the Evolving Financial Crime Landscape". In IEEE Access, v. 9, p. 163965-163986, DOI: 10.1109/ACCESS.2021.3134076
Plath, H. O., Paiva, M. E. O., Pinto, D. L. e Costa, P. D. P. (2022). “Detecção de Discurso de Ódio Contra Mulheres em Textos em Português Brasileiro: Construção da Base MINA-BR e Modelo de Classificação”. In Revista Eletrônica De Iniciação Científica Em Computação, 20(3).
Sabillon R.; Cavaller V.; Cano J. e Serra-Ruiz J. (2016) "Cybercriminals, cyberattacks and cybercrime". In IEEE International Conference on Cybercrime and Computer Forensic (ICCCF), Vancouver, Canada, p. 1-9, DOI: 10.1109/ICCCF.2016.7740434.
Sarma, D., Alam, W., Saha, I., Alam, M. N., Alam, M. J., and Hossain, S. (2020) "Bank Fraud Detection using Community Detection Algorithm," In Second International Conference on Inventive Research in Computing Applications (ICIRCA), Coimbatore, India, p. 642-646, DOI: 10.1109/ICIRCA48905.2020.9182954.
Silva, R. L. and Vieira, A. (2021) “Segurança cibernética: o cenário dos crimes virtuais no Brasil”. In Revista Científica Multidisciplinar Núcleo do Conhecimento Ano 06, ed. 04, v. 07, p. 134-149, DOI: 10.32749/nucleodoconhecimento.com.br/ciencia-da-computacao/crimes-virtuais.
Ullah, F., Faheem, A., Azam, U., Ayub, M. S., Kamiran, F. and Karim, A. (2024). "Detecting Cybercrimes in Accordance with Pakistani Law: Dataset and Evaluation Using PLMs". In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, pages 4717–4728, Torino, Italia. ELRA and ICCL
Yang, Q., Zhang, C., Azenkot, S., Bigham, J. P., Dontcheva, M., Fourney, A., Ju, W., Lee, J., Liao, Q., Lim, B. Y., Nebeling, M., Teevan, J., Wigdor, D., Zhu, J., …. Pan, Z. (2024) "The Future of Human-AI Interaction: A Research Agenda." arXiv preprint, arXiv:2412.19437. DOI: 10.48550/arXiv.2412.19437.
Barros, M., Silva, C., e Miranda, P. (2020) “Xphide: Um Sistema Especialista para a Detecção de Phishing”. In Anais do Simpósio Brasileiro de Segurança da Informação e de Sistemas Computacionais (SBSeg)
Balasankula, U. R., Poojitha, B., Chekurtha, S., Buyya, K., Bala, H. and Rao, P. V. (2024) "Banking Fraud Detection Using Machine Learning Algorithms," In 5th International Conference on Electronics and Sustainable Communication Systems (ICESC), Coimbatore, India, 2024, pp. 1228-1233, DOI: 10.1109/ICESC60852.2024.10689731.
Carnaz, G., Antunes, M., and Nogueira, V. B. (2021). "An Annotated Corpus of CrimeRelated Portuguese Documents for NLP and Machine Learning Processing". In Data, DOI: 10.3390/data6070071.
Chen, Y. and Joo, J. (2021). “Understanding and mitigating annotation bias in facial expression recognition”. DOI: 10.48550/arXiv.2108.08504
Deora, R. S. e Chudasama D. (2021) “Brief Study of Cybercrime on an Internet”. In Journal of Communication Engineering & Systems. p. 1-6.
Demartini, G., Roitero, K. and Mizzaro, S. (2021). “Managing bias in human-annotated data: Moving beyond bias removal”. DOI: 10.48550/arXiv.2110.13504
Dilek, S., Çakır, H., and Aydın, M. (2015) "Applications of Artificial Intelligence Techniques to Combating Cyber Crimes: A Review”. In International Journal of Artificial Intelligence & Applications (IJAIA), DOI: 10.5121/ijaia.2015.6102.
Febraban – Federação Brasileira de Bancos (2025) “Radar Febraban: Março 2025”. Disponível em: [link], June.
Gumma, Y. R. and Peram, S. (2024) "Review of Cybercrime Detection Approaches using Machine Learning and Deep Learning Techniques". In International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Salem, India, pp. 772-779, DOI: 10.1109/ICAAIC60222.2024.10575058.
Gyamfi, N. K.; Abdulai, J.D. (2018) "Bank Fraud Detection Using Support Vector Machine," In IEEE 9th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON), Vancouver, BC, Canada, pp. 37-41, DOI: 10.1109/IEMCON.2018.8614994.
KPMG. (2019) “Pesquisa Global sobre Fraude Bancária. A ameaça multifacetada da fraude: Os bancos estão prontos para enfrentar este desafio?”. Disponível em: [link], June.
Mallmann, J., Xavier, A. dos S., e Santin, A. O. (2018) Detecção de Cibercrime em Redes Sociais: Machine Learning. In The Tenth International Conference On Forensic Computer Science And Cyber Law. São Paulo, 2018. p. 44-49, DOI: 10.5769/C2018005.
Manna, A., Al-Fayoumi, M. e Al-Fawa'reh, M. (2024) "Detecting Text-Based Cybercrimes Using BERT" In International Jordanian Cybersecurity Conference (IJCC), Aِmman, Jordan, pp. 111-117, DOI: 10.1109/IJCC64742.2024.10847273.
Minastireanu, E.‑A., & Mesniță, G. (2019). An analysis of the most used machine learning algorithms for online fraud detection. Informatica Economica, 23(1), 5–16. DOI: 10.12948/issn14531305/23.1.2019.01.
Monteith, S., Bauer, M., Alda, M., Geddes, J., Whybrow, P. C., and Glen, T. (2021). “Increasing Cybercrime Since the Pandemic: Concerns for Psychiatry”. In Current Psychiatry Reports 23, DOI: 10.1007/s11920-021-01228-w.
Nicholls, J., Kuppa, A. and Le-Khac, N. A. (2021) "Financial Cybercrime: A Comprehensive Survey of Deep Learning Approaches to Tackle the Evolving Financial Crime Landscape". In IEEE Access, v. 9, p. 163965-163986, DOI: 10.1109/ACCESS.2021.3134076
Plath, H. O., Paiva, M. E. O., Pinto, D. L. e Costa, P. D. P. (2022). “Detecção de Discurso de Ódio Contra Mulheres em Textos em Português Brasileiro: Construção da Base MINA-BR e Modelo de Classificação”. In Revista Eletrônica De Iniciação Científica Em Computação, 20(3).
Sabillon R.; Cavaller V.; Cano J. e Serra-Ruiz J. (2016) "Cybercriminals, cyberattacks and cybercrime". In IEEE International Conference on Cybercrime and Computer Forensic (ICCCF), Vancouver, Canada, p. 1-9, DOI: 10.1109/ICCCF.2016.7740434.
Sarma, D., Alam, W., Saha, I., Alam, M. N., Alam, M. J., and Hossain, S. (2020) "Bank Fraud Detection using Community Detection Algorithm," In Second International Conference on Inventive Research in Computing Applications (ICIRCA), Coimbatore, India, p. 642-646, DOI: 10.1109/ICIRCA48905.2020.9182954.
Silva, R. L. and Vieira, A. (2021) “Segurança cibernética: o cenário dos crimes virtuais no Brasil”. In Revista Científica Multidisciplinar Núcleo do Conhecimento Ano 06, ed. 04, v. 07, p. 134-149, DOI: 10.32749/nucleodoconhecimento.com.br/ciencia-da-computacao/crimes-virtuais.
Ullah, F., Faheem, A., Azam, U., Ayub, M. S., Kamiran, F. and Karim, A. (2024). "Detecting Cybercrimes in Accordance with Pakistani Law: Dataset and Evaluation Using PLMs". In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, pages 4717–4728, Torino, Italia. ELRA and ICCL
Yang, Q., Zhang, C., Azenkot, S., Bigham, J. P., Dontcheva, M., Fourney, A., Ju, W., Lee, J., Liao, Q., Lim, B. Y., Nebeling, M., Teevan, J., Wigdor, D., Zhu, J., …. Pan, Z. (2024) "The Future of Human-AI Interaction: A Research Agenda." arXiv preprint, arXiv:2412.19437. DOI: 10.48550/arXiv.2412.19437.
Published
2025-09-29
How to Cite
SOUZA, Tamyres Vial de; TIRLONI, Jhonata; BELO, Felipe; ARAUJO, Nelcileno Virgilio; VENTURA, Thiago M.; OLIVEIRA, Allan Gonçalves de.
GolpeBR: Construction and Validation of an Annotated Dataset on Banking Scams and Fraud. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 16. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 429-440.
DOI: https://doi.org/10.5753/stil.2025.37844.
