Machine Learning-based Classification of Portuguese News Articles on Public Procurement Fraud

  • Paulo Marcos de Assis UFSC
  • Márcio Castro UFSC
  • Jônata Tyska Carvalho UFSC

Resumo


Combating fraud in public procurement is a critical task for oversight agencies, and the news media offer a rich source to uncover irregularities. Yet, the sheer volume of daily news hampers the identification of relevant reports. We propose a text classification pipeline to flag news articles reporting procurement fraud. Contributions include a labeled dataset of 9,412 news articles, the optimization and evaluation of multiple machine learning methods, and the identification of an effective feature extractor-classifier combination. Using BERT embeddings and Support Vector Machines, we achieved an F1-Score of 100% on a test set of 8553 news articles with only 0.15% positive labels. Results offer a promising news-based methodology for evidence-based fraud detection.

Referências

de Souza, A. C. S. and Dorneles, C. F. (2025). Cono: Um coletor automatizado de notícias sobre corrupção em santa catarina. In Anais da XX Escola Regional de Banco de Dados (ERBD), pages 129–132, Florianópolis/SC. Sociedade Brasileira de Computação.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.

Lima, W., Lira, R., Paiva, A., Silva, J., and Silva, V. (2023). Methodology for automatic extraction of red flags in public procurement. In International Joint Conference on Neural Networks (IJCNN), pages 01–07, Gold Coast, Australia.

Nai, R., Sulis, E., Meo, R., et al. (2022). Public procurement fraud detection and artificial intelligence techniques: a literature review. In International Conference on Knowledge Engineering and Knowledge Management (EKAW), pages 1–13. CEUR-WS.

Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: Pretrained bert models for brazilian portuguese. Intelligent Systems, pages 403–417.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), pages 5998–6008.
Publicado
12/11/2025
ASSIS, Paulo Marcos de; CASTRO, Márcio; CARVALHO, Jônata Tyska. Machine Learning-based Classification of Portuguese News Articles on Public Procurement Fraud. In: ESCOLA REGIONAL DE APRENDIZADO DE MÁQUINA E INTELIGÊNCIA ARTIFICIAL DA REGIÃO SUL (ERAMIA-RS), 1. , 2025, Porto Alegre/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 88-91. DOI: https://doi.org/10.5753/eramiars.2025.16668.