Adaptive Clinical Vocabulary Expansion with Biomedical LLMs in Electronic Health Records

  • Nadine Anderle PUCRS
  • Dalvan Griebler PUCRS

Abstract


Biomedical ontologies often fail to capture the lexical variability present in real clinical narratives, limiting Natural Language Processing (NLP) applications over electronic health records. This work proposes a weakly supervised pipeline for adaptive clinical vocabulary expansion using ICD codes from MIMIC-IV v3.1. From 842 normalized root codes, the BioMistral-7B model generated 18,017 candidate terms. After semantic validation using SapBERT embeddings (θ = 0.60), 4,094 terms (22.7%) were retained, representing the highest lexical acceptance rate among the evaluated configurations. In a downstream task of disease mention detection in clinical text (627 diseases), the configuration θ = 0.50 achieved the best empirical performance, increasing macro recall from 4.8% to 21.4% and macro F1 from 2.0% to 5.2%. The results indicate that combining LLM-based lexical generation with embeddingbased semantic validation enables scalable expansion of clinical vocabularies, improving diagnostic coverage in clinical text mining tasks.
Keywords: Natural Language Processing, Language Models, Clinical Vocabulary, Weak Supervision, MIMIC-IV

References

Agrawal, M., Hegselmann, S., Lang, H., Kim, Y., and Sontag, D. (2022). Large language models are few-shot clinical information extractors. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1998–2022, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Bodenreider, O. (2004). The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research, 32(90001):267D–270.

Chen, Q., Hu, Y., Peng, X., Xie, Q., Jin, Q., Gilson, A., Singer, M. B., Ai, X., Lai, P.-T., Wang, Z., Keloth, V. K., Raja, K., Huang, J., He, H., Lin, F., Du, J., Zhang, R., Zheng, W. J., Adelman, R. A., Lu, Z., and Xu, H. (2025). Benchmarking large language models for biomedical natural language processing applications and recommendations. Nature Communications, 16(1):3280.

Edin, J., Junge, A., Havtorn, J. D., Borgholt, L., Maistro, M., Ruotsalo, T., and Maaløe, L. (2023). Automated Medical Coding on MIMIC-III and MIMIC-IV: A Critical Review and Replicability Study. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2572–2582. arXiv:2304.10909 [cs].

Fries, J. A., Steinberg, E., Khattar, S., Fleming, S. L., Posada, J., Callahan, A., and Shah, N. H. (2021). Ontology-driven weak supervision for clinical entity classification in electronic health records. Nature Communications, 12(1):2017.

Gupta, M., Gallamoza, B., Cutrona, N., Dhakal, P., Poulain, R., and Beheshti, R. (2023). An Extensive Data Processing Pipeline for MIMIC-IV.

Johnson, A. E. W., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T. J., Hao, S., Moody, B., Gow, B., Lehman, L.-w. H., Celi, L. A., and Mark, R. G. (2023). MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data, 10(1).

Kraljevic, Z., Searle, T., Shek, A., Roguski, L., Noor, K., Bean, D., Mascio, A., Zhu, L., Folarin, A. A., Roberts, A., Bendayan, R., Richardson, M. P., Stewart, R., Shah, A. D., Wong, W. K., Ibrahim, Z., Teo, J. T., and Dobson, R. J. (2021). Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit. Artificial Intelligence in Medicine, 117:102083.

Labbé, T., Castel, P., Sanner, J.-M., and Saleh, M. (2023). ChatGPT for phenotypes extraction: one model to rule them all? In 2023 45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 1–4, Sydney, Australia. IEEE.

Labrak, Y., Bazoge, A., Morin, E., Gourraud, P.-A., Rouvier, M., and Dufour, R. (2024). BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. arXiv:2402.10373 [cs].

Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., and Kang, J. (2020). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.

Liu, F., Shareghi, E., Meng, Z., Basaldella, M., and Collier, N. (2021). Self-Alignment Pretraining for Biomedical Entity Representations. arXiv:2010.11784 [cs].

Luo, R., Sun, L., Xia, Y., Qin, T., Zhang, S., Poon, H., and Liu, T.-Y. (2022). BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining. Briefings in Bioinformatics, 23(6):bbac409. arXiv:2210.10341 [cs].

Oliveira, L. E. S. E., Peters, A. C., Da Silva, A. M. P., Gebeluca, C. P., Gumiel, Y. B., Cintho, L. M. M., Carvalho, D. R., Al Hasan, S., and Moro, C. M. C. (2022). Sem-ClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks. Journal of Biomedical Semantics, 13(1):13.

Ratner, A., Bach, S. H., Ehrenberg, H., Fries, J., Wu, S., and Ré, C. (2017). Snorkel: Rapid Training Data Creation with Weak Supervision. Proceedings of the VLDB Endowment, 11(3):269–282. arXiv:1711.10160 [cs].

Rohanian, O., Nouriborji, M., Kouchaki, S., Nooralahzadeh, F., Clifton, L., and Clifton, D. A. (2024). Exploring the effectiveness of instruction tuning in biomedical language processing. Artificial Intelligence in Medicine, 158:103007.

Yang, J., Liu, C., Deng, W., Wu, D., Weng, C., Zhou, Y., and Wang, K. (2024). Enhancing phenotype recognition in clinical notes using large language models: PhenoBCBERT and PhenoGPT. Patterns, 5(1):100887.

Ye, C. and Mitchell, C. S. (2025). LLM as entity disambiguator for biomedical entity-linking. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T., editors, Proceedings of the 63rd annual meeting of the association for computational linguistics (volume 2: Short papers), pages 301–312, Vienna, Austria. Association for Computational Linguistics.
Published
2026-06-01
ANDERLE, Nadine; GRIEBLER, Dalvan. Adaptive Clinical Vocabulary Expansion with Biomedical LLMs in Electronic Health Records. In: BRAZILIAN SYMPOSIUM ON COMPUTING APPLIED TO HEALTH (SBCAS), 26. , 2026, Ouro Preto/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2026 . p. 978-989. ISSN 2763-8952. DOI: https://doi.org/10.5753/sbcas.2026.21596.

Most read articles by the same author(s)

<< < 1 2 3 4 5 6 7 8 > >>