Expansão Adaptativa de Vocabulário Clínico com LLMs Biomédicos em Registros Eletrônicos de Saúde
Resumo
Ontologias biomédicas frequentemente não captam a variabilidade lexical presente em textos clínicos reais, limitando aplicações de Processamento de Linguagem Natural (PLN) em prontuários eletrônicos. Este trabalho propõe um pipeline de supervisão fraca para expansão adaptativa de vocabulário clínico utilizando códigos ICD do MIMIC-IV v3.1. A partir de 842 root codes normalizados, o modelo BioMistral-7B gerou 18.017 termos candidatos. Após validação semântica com embeddings SapBERT (θ = 0,60), 4.094 termos (22,7%) foram aceitos, representando a maior taxa de aprovação lexical entre as configurações avaliadas. Na tarefa downstream de detecção de menções de doenças (627 doenças), a configuração θ = 0,50 apresentou o melhor desempenho, elevando o recall macro de 4,8% para 21,4% e o F1 macro de 2,0% para 5,2%. Os resultados indicam que a combinação de geração lexical por LLM com validação semântica baseada em embeddings permite expandir vocabulários clínicos de forma escalável, ampliando a cobertura diagnóstica em tarefas de mineração de texto clínico.
Palavras-chave:
Processamento de Linguagem Natural, Modelos de Linguagem, Vocabulário Clínico, Supervisão Fraca, MIMIC-IV
Referências
Agrawal, M., Hegselmann, S., Lang, H., Kim, Y., and Sontag, D. (2022). Large language models are few-shot clinical information extractors. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1998–2022, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Bodenreider, O. (2004). The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32(90001):267D–270.
Chen, Q., Hu, Y., Peng, X., Xie, Q., Jin, Q., Gilson, A., Singer, M. B., Ai, X., Lai, P.-T., Wang, Z., Keloth, V. K., Raja, K., Huang, J., He, H., Lin, F., Du, J., Zhang, R., Zheng, W. J., Adelman, R. A., Lu, Z., and Xu, H. (2025). Benchmarking large language models for biomedical natural language processing applications and recommendations. Nature Communications, 16(1):3280.
Edin, J., Junge, A., Havtorn, J. D., Borgholt, L., Maistro, M., Ruotsalo, T., and Maaløe, L. (2023). Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2572–2582. arXiv:2304.10909 [cs].
Fries, J. A., Steinberg, E., Khattar, S., Fleming, S. L., Posada, J., Callahan, A., and Shah, N. H. (2021). Ontology-driven weak supervision for clinical entity classification in electronic health records. Nature Communications, 12(1):2017.
Gupta, M., Gallamoza, B., Cutrona, N., Dhakal, P., Poulain, R., and Beheshti, R. (2023). An Extensive Data Processing Pipeline for MIMIC-IV.
Johnson, A. E. W., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T. J., Hao, S., Moody, B., Gow, B., Lehman, L.-w. H., Celi, L. A., and Mark, R. G. (2023). MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data, 10(1).
Kraljevic, Z., Searle, T., Shek, A., Roguski, L., Noor, K., Bean, D., Mascio, A., Zhu, L., Folarin, A. A., Roberts, A., Bendayan, R., Richardson, M. P., Stewart, R., Shah, A. D., Wong, W. K., Ibrahim, Z., Teo, J. T., and Dobson, R. J. (2021). Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit. Artificial Intelligence in Medicine, 117:102083.
Labbé, T., Castel, P., Sanner, J.-M., and Saleh, M. (2023). ChatGPT for phenotypes extraction: one model to rule them all? In 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 1–4, Sydney, Australia. IEEE.
Labrak, Y., Bazoge, A., Morin, E., Gourraud, P.-A., Rouvier, M., and Dufour, R. (2024). BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. arXiv:2402.10373 [cs].
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and Kang, J. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
Liu, F., Shareghi, E., Meng, Z., Basaldella, M., and Collier, N. (2021). Self-Alignment Pretraining for Biomedical Entity Representations. arXiv:2010.11784 [cs].
Luo, R., Sun, L., Xia, Y., Qin, T., Zhang, S., Poon, H., and Liu, T.-Y. (2022). BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining. Briefings in Bioinformatics, 23(6):bbac409. arXiv:2210.10341 [cs].
Oliveira, L. E. S. E., Peters, A. C., Da Silva, A. M. P., Gebeluca, C. P., Gumiel, Y. B., Cintho, L. M. M., Carvalho, D. R., Al Hasan, S., and Moro, C. M. C. (2022). Sem-ClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks. Journal of Biomedical Semantics, 13(1):13.
Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Ré, C. (2017). Snorkel: Rapid Training Data Creation with Weak Supervision. Proceedings of the VLDB Endowment, 11(3):269–282. arXiv:1711.10160 [cs].
Rohanian, O., Nouriborji, M., Kouchaki, S., Nooralahzadeh, F., Clifton, L., and Clifton, D. A. (2024). Exploring the effectiveness of instruction tuning in biomedical language processing. Artificial Intelligence in Medicine, 158:103007.
Yang, J., Liu, C., Deng, W., Wu, D., Weng, C., Zhou, Y., and Wang, K. (2024). Enhancing phenotype recognition in clinical notes using large language models: PhenoBCBERT and PhenoGPT. Patterns, 5(1):100887.
Ye, C. and Mitchell, C. S. (2025). LLM as entity disambiguator for biomedical entity-linking. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T., editors, Proceedings of the 63rd annual meeting of the association for computational linguistics (volume 2: Short papers), pages 301–312, Vienna, Austria. Association for Computational Linguistics.
Bodenreider, O. (2004). The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32(90001):267D–270.
Chen, Q., Hu, Y., Peng, X., Xie, Q., Jin, Q., Gilson, A., Singer, M. B., Ai, X., Lai, P.-T., Wang, Z., Keloth, V. K., Raja, K., Huang, J., He, H., Lin, F., Du, J., Zhang, R., Zheng, W. J., Adelman, R. A., Lu, Z., and Xu, H. (2025). Benchmarking large language models for biomedical natural language processing applications and recommendations. Nature Communications, 16(1):3280.
Edin, J., Junge, A., Havtorn, J. D., Borgholt, L., Maistro, M., Ruotsalo, T., and Maaløe, L. (2023). Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2572–2582. arXiv:2304.10909 [cs].
Fries, J. A., Steinberg, E., Khattar, S., Fleming, S. L., Posada, J., Callahan, A., and Shah, N. H. (2021). Ontology-driven weak supervision for clinical entity classification in electronic health records. Nature Communications, 12(1):2017.
Gupta, M., Gallamoza, B., Cutrona, N., Dhakal, P., Poulain, R., and Beheshti, R. (2023). An Extensive Data Processing Pipeline for MIMIC-IV.
Johnson, A. E. W., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T. J., Hao, S., Moody, B., Gow, B., Lehman, L.-w. H., Celi, L. A., and Mark, R. G. (2023). MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data, 10(1).
Kraljevic, Z., Searle, T., Shek, A., Roguski, L., Noor, K., Bean, D., Mascio, A., Zhu, L., Folarin, A. A., Roberts, A., Bendayan, R., Richardson, M. P., Stewart, R., Shah, A. D., Wong, W. K., Ibrahim, Z., Teo, J. T., and Dobson, R. J. (2021). Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit. Artificial Intelligence in Medicine, 117:102083.
Labbé, T., Castel, P., Sanner, J.-M., and Saleh, M. (2023). ChatGPT for phenotypes extraction: one model to rule them all? In 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 1–4, Sydney, Australia. IEEE.
Labrak, Y., Bazoge, A., Morin, E., Gourraud, P.-A., Rouvier, M., and Dufour, R. (2024). BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. arXiv:2402.10373 [cs].
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and Kang, J. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
Liu, F., Shareghi, E., Meng, Z., Basaldella, M., and Collier, N. (2021). Self-Alignment Pretraining for Biomedical Entity Representations. arXiv:2010.11784 [cs].
Luo, R., Sun, L., Xia, Y., Qin, T., Zhang, S., Poon, H., and Liu, T.-Y. (2022). BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining. Briefings in Bioinformatics, 23(6):bbac409. arXiv:2210.10341 [cs].
Oliveira, L. E. S. E., Peters, A. C., Da Silva, A. M. P., Gebeluca, C. P., Gumiel, Y. B., Cintho, L. M. M., Carvalho, D. R., Al Hasan, S., and Moro, C. M. C. (2022). Sem-ClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks. Journal of Biomedical Semantics, 13(1):13.
Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Ré, C. (2017). Snorkel: Rapid Training Data Creation with Weak Supervision. Proceedings of the VLDB Endowment, 11(3):269–282. arXiv:1711.10160 [cs].
Rohanian, O., Nouriborji, M., Kouchaki, S., Nooralahzadeh, F., Clifton, L., and Clifton, D. A. (2024). Exploring the effectiveness of instruction tuning in biomedical language processing. Artificial Intelligence in Medicine, 158:103007.
Yang, J., Liu, C., Deng, W., Wu, D., Weng, C., Zhou, Y., and Wang, K. (2024). Enhancing phenotype recognition in clinical notes using large language models: PhenoBCBERT and PhenoGPT. Patterns, 5(1):100887.
Ye, C. and Mitchell, C. S. (2025). LLM as entity disambiguator for biomedical entity-linking. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T., editors, Proceedings of the 63rd annual meeting of the association for computational linguistics (volume 2: Short papers), pages 301–312, Vienna, Austria. Association for Computational Linguistics.
Publicado
01/06/2026
Como Citar
ANDERLE, Nadine; GRIEBLER, Dalvan.
Expansão Adaptativa de Vocabulário Clínico com LLMs Biomédicos em Registros Eletrônicos de Saúde. In: SIMPÓSIO BRASILEIRO DE COMPUTAÇÃO APLICADA À SAÚDE (SBCAS), 26. , 2026, Ouro Preto/MG.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2026
.
p. 978-989.
ISSN 2763-8952.
DOI: https://doi.org/10.5753/sbcas.2026.21596.
