A Hybrid Approach for Named Entity Recognition in Clinical Case Reports
Resumo
Context: Natural Language Processing (NLP) has shown significant impact in healthcare by enabling knowledge extraction from unstructured texts. However, in Brazilian Portuguese, there remains a gap in developing models capable of processing clinical case reports, rich sources of crucial information for diagnostics, treatments, and decision-making. Problem: While some models exist (e.g., BioBERTpt), they are scarce and deliver limited performance for Portuguese clinical texts. The lack of accurate entity recognition reduces the ability to transform raw data into structured information that supports clinical decision-making. Solution: This work presents ClinptHyb, a hybrid Named Entity Recognition (NER) model combining a spaCy-based pipeline (ClinptCy, trained from scratch on SemClinBr) with a Large Language Model (Llama-3.3-70b) as reviewer. The hybrid design enhances entity identification and helps detect inconsistencies or omissions in the corpus. IS Theory: Grounded in the Organizational Information Processing Theory (OIPT), the study emphasizes how structuring clinical narratives reduces information uncertainty and increases organizational capacity to process data, thereby improving decision-making in healthcare contexts. Method: Following Design Science Research, the study included corpus preprocessing, stratified dataset splitting, spaCy model training, and prompt-based LLM revision. Evaluation employed Precision, Recall, and F1-score against the SemClinBr gold standard. Summary of Results: ClinptHyb consistently improved over the baseline ClinptCy and outperformed BioBERTpt in 8 of the 10 evaluated categories. The Negation class reached an F1-score of 0.97, a critical result for safe clinical interpretation. The model also revealed annotation inconsistencies, highlighting its potential as a support tool for dataset refinement. Contributions and Impact on the IS Area: This research presents the first hybrid NER approach that combines spaCy with an LLM for Portuguese clinical texts, aligned with IS for data, information, and knowledge management. It contributes to decision support systems, clinical annotation workflows, and the advancement of NLP technologies adapted to the Brazilian healthcare context.
Referências
Claro, D. B., Santos, J., Souza, M., Vieira, R., and Pinheiro, V. (2024). Extração de informação. In Caseli, H. M. and Nunes, M. G. V., editors, Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português, book chapter 20. BPLN, 2 edition.
Fabregat, H., Duque, A., Martinez-Romo, J., and Araujo, L. (2023). Negation-based transfer learning for improving biomedical named entity recognition and relation extraction. Journal of Biomedical Informatics, 138.
Groq (2025). Groq documentation: Overview. [link]. Acesso em: 2025-07-16.
Jurafsky, D. and Martin, J. H. (2024). Large language models. In Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, chapter 7. Prentice Hall PTR, USA, 3rd edition. Draft, August 24, 2025.
Kmetz, J. L. (2018). The Information Processing Theory of Organization: Managing Technology Accession in Complex Systems. Routledge.
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and Kang, J. (2020). Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
Mello, C. E. R., Schneider, E. T. R., Oliveira, L. E. S. e., Nascimento, J. N. d., Gumie, Y. B., Araújo, I. F. d., and Moro, C. (2024). Avaliação de grandes modelos de linguagem na extração de informações clínica. Journal of Health Informatics, 16(Especial):1–14. Apresentado no XX Congresso Brasileiro de Informática em Saúde (CBIS’24), Belo Horizonte, MG, Brasil, 8-11 out. 2024.
Nadeau, D. (2007). Semi-supervised named entity recognition: learning to recognize 100 entity types with little supervision. PhD thesis, University of Ottawa, [s.l.].
Névéol, A., Dalianis, H., Velupillai, S., Savova, G., and Zweigenbaum, P. (2018). Clinical natural language processing in languages other than english: opportunities and challenges. Journal of Biomedical Semantics, 9(1):1–13.
Oliveira, L. E. S., Peters, A. C., Pucca da Silva, A. M., Gebeluca, C. P., Gumiel, Y. B., Cintho, L. M. M., Carvalho, D. R., Al Hasan, S., and Moro, C. M. C. (2022). Semclinbr - a multi-institutional and multi-specialty semantically annotated corpus for portuguese clinical nlp tasks. Journal of Biomedical Semantics, 13(13).
Ollama (2025). Ollama documentation. [link]. Acesso em: 2025-07-16.
Pagano, A., Moro, C., Schneider, E. T. R., Cintho, L. M. M., and Gumiel, Y. (2024). Pln na saúde. In Caseli, H. M. and Nunes, M. G. V., editors, Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português, book chapter 25. BPLN, 2 edition.
Pavanelli, L., Gumiel, Y. B., Ferreira, T., Pagano, A., and Laber, E. (2023). Bete: A brazilian portuguese dataset for named entity recognition and relation extraction in the diabetes healthcare domain. In Naldi, M. C. and Bianchi, R. A. C., editors, Intelligent Systems. BRACIS 2023. Lecture Notes in Computer Science, volume 14197 of Lecture Notes in Computer Science, pages 256—-267. Springer, Cham.
Schneider, E. T. R., de Souza, J. V. A., Knafou, J., Oliveira, L. E. S. e., Copara, J., Gumiel, Y. B., Oliveira, L. F. A. d., Paraiso, E. C., Teodoro, D., and Barra, C. M. C. M. (2020). BioBERTpt - a Portuguese neural language model for clinical named entity recognition. In Proceedings of the 3rd Clinical Natural Language Processing Workshop, pages 65–72, Online. Association for Computational Linguistics.
Sechidis, K., Tsoumakas, G., Vlahavas, I., and updated by Trent-B (2024). iterative-stratification: Iterative stratification algorithms for multi-label data. [link]. Acesso em: 22 set. 2025.
spaCy (2025). spacy 101: Everything you need to know · spacy usage documentation. [link]. Acesso em: 2025-07-16.
Torres, A. M. N. (2024). Desenvolvimento de modelo ner para a extração de informações em relatos de casos clínicos. Orientador: Cristiano da Silveira Colombo.
U.S. National Library of Medicine (2025). Umls semantic network. [link]. Acesso em: 2025-09-18.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS 2017), pages 6000–6010, Long Beach, CA, USA. Curran Associates Inc.
Wulff, P., Kubsch, M., and Krist, C. (2025). Natural language processing and large language models. In Wulff, P., Kubsch, M., and Krist, C., editors, Applying Machine Learning in Science Education Research: When, How, and Why?, Springer Texts in Education, pages 117–142. Springer.
