Anonimização de Textos Clínicos Utilizando LLM

  • Arthur M. Pereira UFJF
  • Leonardo F. Martins PUC-Rio
  • Laisa M. A. Sartes UFJF
  • Larissa F. de Almeida UFJF
  • Heder S. Bernardino UFJF
  • Jairo F. de Souza UFJF

Resumo


O uso de dados no treinamento de modelos é essencial para avanços na saúde, viabilizando um tratamento mais personalizado. A anonimização de textos terapêuticos protege a privacidade dos pacientes diante da digitalização crescente. Métodos tradicionais, embora eficazes, podem reduzir a utilidade dos dados e falhar na anonimização contextual. Este estudo propõe um método baseado em modelos de linguagem de grande porte (LLMs), combinando reconhecimento de entidades nomeadas (NER) e reformulação textual para garantir coerência e anonimização contextual. Testado em transcrições terapêuticas, o método demonstrou alta precisão na remoção de informações sensíveis sem comprometer a integridade textual,se tornando aplicável a diferentes contextos.

Referências

Allen, C. O., Carrier, S. R., Harold Moss, I., and Woods, E. (2015). Anonymizing sensitive identifying information based on relational context across a group. US Patent 9,047,488.

Amazon Web Services (2025). What is a large language model? Acesso em: 15 fev. 2025.

Britton, F. C., Dowling, S., and Frain, M. (2022). A contribution towards the regulation of anonymised datasets within the framework of gdpr. In 2022 Cyber Research Conference-Ireland (Cyber-RCI), pages 1–6. IEEE.

El Emam, K. and Arbuckle, L. (2013). Anonymizing health data: case studies and methods to get you started. "O’Reilly Media, Inc.".

Fabregat, H., Duque, A., Martinez-Romo, J., and Araujo, L. (2019). De-identification through named entity recognition for medical document anonymization. In IberLEF@ SEPLN, pages 663–670.

Gates, J. D., Yulianti, Y., and Pangilinan, G. A. (2024). Big data analytics for predictive insights in healthcare. Intl. Transactions on Artificial Intelligence, 3(1):54–63.

Gonçalves, A. C. M. (2023). Text mining de relatórios clínicos. Master’s thesis, ISCTE Lisboa.

Gumier, A. B. (2019). Terapia cognitivo-comportamental por internet para dependentes de álcool: viabilidade e estudo piloto de um ensaio clínico randomizado. PhD thesis, Universidade Federal de Juiz de Fora.

Hassan, F., Domingo-Ferrer, J., and Soria-Comas, J. (2018). Anonymization of unstructured data via named-entity recognition. In Proc. of the Intl. Conf. on Modeling Decisions for Artificial Intelligence (MDAI), pages 296–305. Springer.

Hassan, F., Sánchez, D., Soria-Comas, J., and Domingo-Ferrer, J. (2019). Automatic anonymization of textual documents: detecting sensitive information via word embeddings. In Proc. of the IEEE Intl. Conf. On Trust, Security And Privacy In Computing And Communications / IEEE Intl. Conf. On Big Data Science And Engineering (Trust-Com/BigDataSE), pages 358–365. IEEE.

HIPAA Journal (2025). Healthcare data breach statistics. Acesso em: 21 fev. 2025.

IBM (2025). What are large language models (llms)? Acesso em: 15 fev. 2025.

Isa, A. K. (2024). Exploring digital therapeutics for mental health: Ai-driven innovations in personalized treatment approaches. World J. of Advanced Research and Reviews.

Kadden, R. (1995). Cognitive-behavioral coping skills therapy manual: A clinical research guide for therapists treating individuals with alcohol abuse and dependence. Number 94. US Department of Health and Human Services, Public Health Service.

Larbi, I. B. C., Burchardt, A., and Roller, R. (2023). Clinical text anonymization, its influence on downstream nlp tasks and the risk of re-identification. In Proc. of the Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, pages 105–111.

Liu, Z., Huang, Y., Yu, X., Zhang, L., Wu, Z., Cao, C., Dai, H., Zhao, L., Li, Y., Shu, P., et al. (2023). Deid-gpt: Zero-shot medical text de-identification by gpt-4. arXiv preprint arXiv:2303.11032.

Marques, J. F. and Bernardino, J. (2020). Analysis of data anonymization techniques. In KEOD, pages 235–241.

Mogre, N. V., Agarwal, G., and Patil, P. (2012). A review on data anonymization technique for data publishing. International Journal of Engineering Research & Technology (IJERT), 1(10):2278–0181.

Pettersson, E., Borin, L., and Lenas, E. (2024). Swener-1800: A corpus for named entity recognition in 19th century swedish. In Digital Humanities in the Nordic and Baltic Countries, volume 6.

Pissarra, D., Curioso, I., Alveira, J., Pereira, D., Ribeiro, B., Souper, T., Gomes, V., Carreiro, A. V., and Rolla, V. (2024). Unlocking the potential of large language models for clinical text anonymization: A comparative study. arXiv preprint arXiv:2406.00062.

Ribeiro, B., Rolla, V., and Santos, R. (2023). Incognitus: A toolbox for automated clinical notes anonymization. In Proc. of the Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, pages 187–194.

Ribeiro, R. A. P. (2023). Anonimização Automática de Texto Clínico: um estudo sobre técnicas emergentes e métodos de avaliação. PhD thesis, "ISEP - Instituto Superior de Engenharia do Porto".

Salles, A. A. and Castelo, L. (2023). Privacy and confidentiality in therapeutic process: contributions from bioethics. Revista Bioética, 31:e3340PT.

Shamsinejad, E., Banirostam, T., Pedram, M. M., and Rahmani, A. M. (2024). A review of anonymization algorithms and methods in big data. Annals of Data Science, pages 1–27.

Supriya, M. and Deepa, A. (2020). Machine learning approach on healthcare big data: a review. Big data and information analytics, 5(1):58–75.

União Europeia (2016). Regulamento geral sobre a proteção de dados (gdpr). Acesso em: 31 ago. 2024.

U.S. Department of Health and Human Services (2003). Health insurance portability and accountability act of 1996 (hipaa). Acesso em: 31 ago. 2024.

Vakili, T., Henriksson, A., and Dalianis, H. (2024). End-to-end pseudonymization of fine-tuned clinical bert models: Privacy preservation with maintained data utility. BMC Medical Informatics and Decision Making, 24(1):162.

Yadav, V. and Bethard, S. (2019). A survey on recent advances in named entity recognition from deep learning models. arXiv preprint arXiv:1910.11470.
Publicado
09/06/2025
PEREIRA, Arthur M.; MARTINS, Leonardo F.; SARTES, Laisa M. A.; ALMEIDA, Larissa F. de; BERNARDINO, Heder S.; SOUZA, Jairo F. de. Anonimização de Textos Clínicos Utilizando LLM. In: SIMPÓSIO BRASILEIRO DE COMPUTAÇÃO APLICADA À SAÚDE (SBCAS), 25. , 2025, Porto Alegre/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 365-376. ISSN 2763-8952. DOI: https://doi.org/10.5753/sbcas.2025.7150.

Artigos mais lidos do(s) mesmo(s) autor(es)