Anonymization of Clinical Texts Using LLMs
Abstract
Data-driven model training is essential for healthcare advancements, enabling more personalized medicine. Clinical text anonymization protects patient privacy amid increasing digitalization. Traditional methods, while effective, may reduce data utility and fail in contextual anonymization. This study proposes a method based on large language models (LLMs), combining named entity recognition (NER) and text rephrasing to ensure coherence and anonymization. Tested on therapeutic transcripts, the method achieved high accuracy in removing sensitive information while preserving textual integrity, making it applicable across contexts.References
Allen, C. O., Carrier, S. R., Harold Moss, I., and Woods, E. (2015). Anonymizing sensitive identifying information based on relational context across a group. US Patent 9,047,488.
Amazon Web Services (2025). What is a large language model? Acesso em: 15 fev. 2025.
Britton, F. C., Dowling, S., and Frain, M. (2022). A contribution towards the regulation of anonymised datasets within the framework of gdpr. In 2022 Cyber Research Conference-Ireland (Cyber-RCI), pages 1–6. IEEE.
El Emam, K. and Arbuckle, L. (2013). Anonymizing health data: case studies and methods to get you started. "O’Reilly Media, Inc.".
Fabregat, H., Duque, A., Martinez-Romo, J., and Araujo, L. (2019). De-identification through named entity recognition for medical document anonymization. In IberLEF@ SEPLN, pages 663–670.
Gates, J. D., Yulianti, Y., and Pangilinan, G. A. (2024). Big data analytics for predictive insights in healthcare. Intl. Transactions on Artificial Intelligence, 3(1):54–63.
Gonçalves, A. C. M. (2023). Text mining de relatórios clínicos. Master’s thesis, ISCTE Lisboa.
Gumier, A. B. (2019). Terapia cognitivo-comportamental por internet para dependentes de álcool: viabilidade e estudo piloto de um ensaio clínico randomizado. PhD thesis, Universidade Federal de Juiz de Fora.
Hassan, F., Domingo-Ferrer, J., and Soria-Comas, J. (2018). Anonymization of unstructured data via named-entity recognition. In Proc. of the Intl. Conf. on Modeling Decisions for Artificial Intelligence (MDAI), pages 296–305. Springer.
Hassan, F., Sánchez, D., Soria-Comas, J., and Domingo-Ferrer, J. (2019). Automatic anonymization of textual documents: detecting sensitive information via word embeddings. In Proc. of the IEEE Intl. Conf. On Trust, Security And Privacy In Computing And Communications / IEEE Intl. Conf. On Big Data Science And Engineering (Trust-Com/BigDataSE), pages 358–365. IEEE.
HIPAA Journal (2025). Healthcare data breach statistics. Acesso em: 21 fev. 2025.
IBM (2025). What are large language models (llms)? Acesso em: 15 fev. 2025.
Isa, A. K. (2024). Exploring digital therapeutics for mental health: Ai-driven innovations in personalized treatment approaches. World J. of Advanced Research and Reviews.
Kadden, R. (1995). Cognitive-behavioral coping skills therapy manual: A clinical research guide for therapists treating individuals with alcohol abuse and dependence. Number 94. US Department of Health and Human Services, Public Health Service.
Larbi, I. B. C., Burchardt, A., and Roller, R. (2023). Clinical text anonymization, its influence on downstream nlp tasks and the risk of re-identification. In Proc. of the Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 105–111.
Liu, Z., Huang, Y., Yu, X., Zhang, L., Wu, Z., Cao, C., Dai, H., Zhao, L., Li, Y., Shu, P., et al. (2023). Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv preprint arXiv:2303.11032.
Marques, J. F. and Bernardino, J. (2020). Analysis of data anonymization techniques. In KEOD, pages 235–241.
Mogre, N. V., Agarwal, G., and Patil, P. (2012). A review on data anonymization technique for data publishing. International Journal of Engineering Research & Technology (IJERT), 1(10):2278–0181.
Pettersson, E., Borin, L., and Lenas, E. (2024). Swener-1800: A corpus for named entity recognition in 19th century swedish. In Digital Humanities in the Nordic and Baltic Countries, volume 6.
Pissarra, D., Curioso, I., Alveira, J., Pereira, D., Ribeiro, B., Souper, T., Gomes, V., Carreiro, A. V., and Rolla, V. (2024). Unlocking the potential of large language models for clinical text anonymization: A comparative study. arXiv preprint arXiv:2406.00062.
Ribeiro, B., Rolla, V., and Santos, R. (2023). Incognitus: A toolbox for automated clinical notes anonymization. In Proc. of the Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 187–194.
Ribeiro, R. A. P. (2023). Anonimização Automática de Texto Clínico: um estudo sobre técnicas emergentes e métodos de avaliação. PhD thesis, "ISEP - Instituto Superior de Engenharia do Porto".
Salles, A. A. and Castelo, L. (2023). Privacy and confidentiality in therapeutic process: contributions from bioethics. Revista Bioética, 31:e3340PT.
Shamsinejad, E., Banirostam, T., Pedram, M. M., and Rahmani, A. M. (2024). A review of anonymization algorithms and methods in big data. Annals of Data Science, pages 1–27.
Supriya, M. and Deepa, A. (2020). Machine learning approach on healthcare big data: a review. Big data and information analytics, 5(1):58–75.
União Europeia (2016). Regulamento geral sobre a proteção de dados (gdpr). Acesso em: 31 ago. 2024.
U.S. Department of Health and Human Services (2003). Health insurance portability and accountability act of 1996 (hipaa). Acesso em: 31 ago. 2024.
Vakili, T., Henriksson, A., and Dalianis, H. (2024). End-to-end pseudonymization of fine-tuned clinical bert models: Privacy preservation with maintained data utility. BMC Medical Informatics and Decision Making, 24(1):162.
Yadav, V. and Bethard, S. (2019). A survey on recent advances in named entity recognition from deep learning models. arXiv preprint arXiv:1910.11470.
Amazon Web Services (2025). What is a large language model? Acesso em: 15 fev. 2025.
Britton, F. C., Dowling, S., and Frain, M. (2022). A contribution towards the regulation of anonymised datasets within the framework of gdpr. In 2022 Cyber Research Conference-Ireland (Cyber-RCI), pages 1–6. IEEE.
El Emam, K. and Arbuckle, L. (2013). Anonymizing health data: case studies and methods to get you started. "O’Reilly Media, Inc.".
Fabregat, H., Duque, A., Martinez-Romo, J., and Araujo, L. (2019). De-identification through named entity recognition for medical document anonymization. In IberLEF@ SEPLN, pages 663–670.
Gates, J. D., Yulianti, Y., and Pangilinan, G. A. (2024). Big data analytics for predictive insights in healthcare. Intl. Transactions on Artificial Intelligence, 3(1):54–63.
Gonçalves, A. C. M. (2023). Text mining de relatórios clínicos. Master’s thesis, ISCTE Lisboa.
Gumier, A. B. (2019). Terapia cognitivo-comportamental por internet para dependentes de álcool: viabilidade e estudo piloto de um ensaio clínico randomizado. PhD thesis, Universidade Federal de Juiz de Fora.
Hassan, F., Domingo-Ferrer, J., and Soria-Comas, J. (2018). Anonymization of unstructured data via named-entity recognition. In Proc. of the Intl. Conf. on Modeling Decisions for Artificial Intelligence (MDAI), pages 296–305. Springer.
Hassan, F., Sánchez, D., Soria-Comas, J., and Domingo-Ferrer, J. (2019). Automatic anonymization of textual documents: detecting sensitive information via word embeddings. In Proc. of the IEEE Intl. Conf. On Trust, Security And Privacy In Computing And Communications / IEEE Intl. Conf. On Big Data Science And Engineering (Trust-Com/BigDataSE), pages 358–365. IEEE.
HIPAA Journal (2025). Healthcare data breach statistics. Acesso em: 21 fev. 2025.
IBM (2025). What are large language models (llms)? Acesso em: 15 fev. 2025.
Isa, A. K. (2024). Exploring digital therapeutics for mental health: Ai-driven innovations in personalized treatment approaches. World J. of Advanced Research and Reviews.
Kadden, R. (1995). Cognitive-behavioral coping skills therapy manual: A clinical research guide for therapists treating individuals with alcohol abuse and dependence. Number 94. US Department of Health and Human Services, Public Health Service.
Larbi, I. B. C., Burchardt, A., and Roller, R. (2023). Clinical text anonymization, its influence on downstream nlp tasks and the risk of re-identification. In Proc. of the Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 105–111.
Liu, Z., Huang, Y., Yu, X., Zhang, L., Wu, Z., Cao, C., Dai, H., Zhao, L., Li, Y., Shu, P., et al. (2023). Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv preprint arXiv:2303.11032.
Marques, J. F. and Bernardino, J. (2020). Analysis of data anonymization techniques. In KEOD, pages 235–241.
Mogre, N. V., Agarwal, G., and Patil, P. (2012). A review on data anonymization technique for data publishing. International Journal of Engineering Research & Technology (IJERT), 1(10):2278–0181.
Pettersson, E., Borin, L., and Lenas, E. (2024). Swener-1800: A corpus for named entity recognition in 19th century swedish. In Digital Humanities in the Nordic and Baltic Countries, volume 6.
Pissarra, D., Curioso, I., Alveira, J., Pereira, D., Ribeiro, B., Souper, T., Gomes, V., Carreiro, A. V., and Rolla, V. (2024). Unlocking the potential of large language models for clinical text anonymization: A comparative study. arXiv preprint arXiv:2406.00062.
Ribeiro, B., Rolla, V., and Santos, R. (2023). Incognitus: A toolbox for automated clinical notes anonymization. In Proc. of the Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 187–194.
Ribeiro, R. A. P. (2023). Anonimização Automática de Texto Clínico: um estudo sobre técnicas emergentes e métodos de avaliação. PhD thesis, "ISEP - Instituto Superior de Engenharia do Porto".
Salles, A. A. and Castelo, L. (2023). Privacy and confidentiality in therapeutic process: contributions from bioethics. Revista Bioética, 31:e3340PT.
Shamsinejad, E., Banirostam, T., Pedram, M. M., and Rahmani, A. M. (2024). A review of anonymization algorithms and methods in big data. Annals of Data Science, pages 1–27.
Supriya, M. and Deepa, A. (2020). Machine learning approach on healthcare big data: a review. Big data and information analytics, 5(1):58–75.
União Europeia (2016). Regulamento geral sobre a proteção de dados (gdpr). Acesso em: 31 ago. 2024.
U.S. Department of Health and Human Services (2003). Health insurance portability and accountability act of 1996 (hipaa). Acesso em: 31 ago. 2024.
Vakili, T., Henriksson, A., and Dalianis, H. (2024). End-to-end pseudonymization of fine-tuned clinical bert models: Privacy preservation with maintained data utility. BMC Medical Informatics and Decision Making, 24(1):162.
Yadav, V. and Bethard, S. (2019). A survey on recent advances in named entity recognition from deep learning models. arXiv preprint arXiv:1910.11470.
Published
2025-06-09
How to Cite
PEREIRA, Arthur M.; MARTINS, Leonardo F.; SARTES, Laisa M. A.; ALMEIDA, Larissa F. de; BERNARDINO, Heder S.; SOUZA, Jairo F. de.
Anonymization of Clinical Texts Using LLMs. In: BRAZILIAN SYMPOSIUM ON COMPUTING APPLIED TO HEALTH (SBCAS), 25. , 2025, Porto Alegre/RS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 365-376.
ISSN 2763-8952.
DOI: https://doi.org/10.5753/sbcas.2025.7150.
