Experimental Evaluation of Machine Learning Algorithms for Classifying Health-Related News with Indications of Irregularity
Resumo
Research Context: The increasing production of unstructured data, particularly textual data, has driven the application of Natural Language Processing (NLP) techniques in public administration. The analysis of multiple information sources enables the identification of patterns and the development of predictive models to optimize strategies, improve service delivery, and ensure population monitoring and safety. Scientific and/or Practical Problem: The auditing process is characterized by high costs, long duration, and heavy reliance on human and material resources, necessitating solutions capable of automating the analysis of corruption reports. Proposed Solution and/or Analysis: Focusing on the preliminary identification of potential irregularities, the use of machine learning models is proposed to support the auditing process by identifying health-related news that may indicate irregularities. Related IS Theory: This study is grounded in Cognitive Load Theory, as it examines methods to reduce information overload. Research Method: A controlled in vitro experiment was conducted to scientifically evaluate 54 machine learning models and compare metrics including Accuracy, Precision, Recall, and F1-score. Additionally, an asymptotic complexity analysis of the algorithms was performed. Summary of Results: The Random Forest model stood out in terms of effectiveness, achieving an accuracy of 99.90%, a recall of 98.62%, and an F1-score of 99.28%, while Naive Bayes and Logistic Regression excelled in efficiency, with linear complexity 𝑂 (𝑛𝑑) for both training and prediction and low memory usage. Contributions and Impact to IS area: The results demonstrate the feasibility of using machine learning models to identify health-related news with potential irregularities. This approach enhances the information gathering and corruption evidence stage performed by AudSUS auditors for detecting potential irregularities, thereby contributing to the efficiency of public resource management.
Referências
Ash, E., Galletta, S., and Giommoni, T. (2020). A machine learning approach to analyzing corruption in local public finances. Working Paper 06/2020, ETH Zurich, Center for Law & Economics. Open Access. In Copyright - Non-Commercial Use Permitted.
Bannur, C., Bhat, C., Singh, K., Kulkarni, S. A., and Doddamani, M. (2023). Paacda: Comprehensive data corruption detection algorithm. IEEE Access, 11:24908–24934.
Basili, V. R. and Weiss, D. M. (1984). A methodology for collecting valid software engineering data. IEEE Transactions on Software Engineering, SE-10(6):728–738.
Benjelloun, F.-Z., Benjelloun, F.-Z., Lahcen, A. A., Lahcen, A. A., Lahcen, A. A., Belfkih, S., and Belfkih, S. (2015). An overview of big data opportunities, applications and tools. null.
Caputo, F., Ligorio, L., and Venturelli, A. (2025). Framing research on corruption and public administration in management studies: research trends and future directions. Journal of Global Responsibility.
ChatGPT (2025). Chatgpt. Disponível em: [link]. Acesso em: janeiro de 2025.
Colaço Júnior, M. (2025). IA para a Galera Toda: Agentes e Inovação Experimental Sem Código. Amazon Publishing.
Colaço Júnior, M., Cruz, R., Araújo, L., Bliacheriene, A., and Nunes, F. (2022). Evaluation of a process for the experimental development of data mining, ai and data science applications aligned with the strategic planning. Journal of Information Systems and Technology Management, 19.
Damiano, R., Polizzi, S., Scannella, E., and Valenza, G. (2025). Corruption detection through textual analysis: Evidence from eurozone banks. Business Ethics, the Environment & Responsibility, 0:1–21. Open Access, Creative Commons Attribution License.
do Amaral, J. A. A., Amaral, J. A., Rodrigues, J. B., Rodrigues, J. B., and Rodrigues, J. B. (2020). Alocacao de topicos latentes — um modelo para segmentacao de dados de auditoria do governo de pe. null.
Fontes, R. S., Júnior, M. C., Prado, H., Nely, A., Araújo, J., de Paiva, J. C., and de Medeiros Valentim, R. A. (2023). Sussurro - detecção na web de eventos auditáveis que representam riscos à saúde pública. Anais Estendidos do XXIII Simpósio Brasileiro de Computação Aplicada à Saúde (SBCAS 2023).
Guimarães, A., Almeida, S., Colaço Junior, M., Fontes, R., and Ferreira de Araújo, G. G. (2025). Health related news dataset.
Jiang, Y., Li, J., Wong, D., and Kan, H. Y. (2023). Natural language processing adoption in governments and future research directions: A systematic review. Applied Sciences.
Joudaki, H., Rashidian, A., Minaei-Bidgoli, B., Mahmoodi, M., Geraili, B., Nasiri, M., and Arab, M. (2015). Using data mining to detect health care fraud and abuse: A review of literature. Global Journal of Health Science, 7(1):194–202. Open Access under Creative Commons Attribution 4.0 License.
Kose, I., Gokturk, M., and Kilic, K. (2015). An interactive machine-learning-based electronic fraud and abuse detection system in healthcare insurance. Applied Soft Computing, 36:283–299.
Levitin, A. (2012). Introduction to the Design & Analysis of Algorithms. Pearson, Boston, MA, USA, 3rd edition. Includes bibliographical references and index.
Lima, M. S. M. and Delen, D. (2020). Predicting and explaining corruption across countries: A machine learning approach. Government Information Quarterly, 37(1):101407.
Mackey, T. K., Mackey, T. K., Vian, T., Vian, T., Köhler, J. C., and Kohler, J. C. (2018). The sustainable development goals as a framework to combat health-sector corruption. Bulletin of The World Health Organization.
Madureira, L., Popovič, A., and Castelli, M. (2021). Competitive intelligence: A unified view and modular definition. Technological Forecasting & Social Change, 173:121086. Received 22 December 2020; Received in revised form 26 July 2021; Accepted 28 July 2021; Available online 9 August 2021; ©2021 Elsevier Inc. All rights reserved.
Masrom, S., Abdul Rahman, R., Salleh, N. A., Pitaloka, E., Md Nor, M. A., and Zakaria, N. B. (2023). Machine learning prediction of petty corruption intention among law enforcement officers. Indonesian Journal of Electrical Engineering and Computer Science (IJEECS), 30(3):1634–1642. Open Access, Creative Commons Attribution-ShareAlike 4.0 International License.
Paula, T. D., Amaral, A. D., Victor, A., Sales, L. A., Moreira, R., Meirelles, T., and Basso, R. (2024). Automated admissibility of complaints about fraud and corruption. In Gamallo, P., Claro, D., Teixeira, A., Real, L., Garcia, M., Oliveira, H. G., and Amaro, R., editors, Proceedings of the 16th International Conference on Computational Processing of Portuguese - Vol. 1, pages 610–613, Santiago de Compostela, Galicia/Spain. Association for Computational Lingustics.
Rabuzin, K. and Modrušan, N. (2019). Prediction of public procurement corruption indices using machine learning methods. In Proceedings of the 11th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2019) - KMIS, pages 333–340. INSTICC, SciTePress.
Salton, G. and Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing & Management, 24(5):513–523.
Sanchez-Gomez, J. M., Vega-Rodríguez, M. A., and Pérez, C. J. (2022). A multi-objective memetic algorithm for query-oriented text summarization: Medicine texts as a case study. Expert Systems with Applications, 198:116769.
Schneider dos Santos, E., Machado dos Santos, M., Castro, M., et al. (2025). Detection of fraud in public procurement using data-driven methods: a systematic mapping study. EPJ Data Science, 14:52.
Travassos, G. H., Gurov, D., and Amaral, E. (2020). Introdução à engenharia de software. Relatório, Universidade Federal do Rio de Janeiro, Rio de Janeiro, RJ, Brasil. Experimental.
Vasconcelos, M. O., Chaim, R. M., and Cavique, L. (2021). Imbalanced learning in assessing the risk of corruption in public administration. In Marreiros, G., Melo, F. S., Lau, N., Lopes Cardoso, H., and Reis, L. P., editors, Progress in Artificial Intelligence, pages 510–523, Cham. Springer International Publishing.
Weichselbraun, A., Hörler, S., Hauser, C., and Havelka, A. (2020). Classifying news media coverage for corruption risks management with deep learning and web intelligence. In Proceedings of the 10th International Conference on Web Intelligence, Mining and Semantics, WIMS 2020, page 54–62, New York, NY, USA. Association for Computing Machinery.
Zhu, W., Zeng, N., and Wang, N. (2010). Sensitivity, specificity, accuracy, associated confidence interval and roc analysis with practical sas implementations. In NorthEast SAS Users Group, Health Care and Life Sciences.
