Skip to main content

De-Identification of Clinical Notes Using Contextualized Language Models and a Token Classifier

  • Conference paper
  • First Online:
Intelligent Systems (BRACIS 2021)

Abstract

The de-identification of clinical notes is crucial for the reuse of electronic clinical data and is a common Named Entity Recognition (NER) task. Neural language models provide a great improvement in Natural Language Processing (NLP) tasks, such as NER, when they are integrated with neural network methods. This paper evaluates the use of current state-of-the-art deep learning methods (Bi-LSTM-CRF) in the task of identifying patient names in clinical notes, for de-identification purposes. We used two corpora and three language models to evaluate which combination delivers the best performance. In our experiments, the specific corpus for the de-identification of clinical notes and a contextualized embedding with word embeddings achieved the best result: an F-measure of 0.94.

This work was partially supported by Institute of Artificial Intelligence in Healthcare, Memed, Google Latin America Research Awards, and by FCT under the project UIDB/00057/2020 (Portugal).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://pytorch.org.

  2. 2.

    https://github.com/noharm-ai/noharm-anony.

References

  1. Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., Vollgraf, R.: FLAIR: an easy-to-use framework for state-of-the-art NLP. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, pp. 54–59. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-4010, https://www.aclweb.org/anthology/N19-4010

  2. Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649 (2018)

    Google Scholar 

  3. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)

    Article  Google Scholar 

  4. Brown, T.B., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Proceedings of the 33th Annual Conference on Neural Information Processing Systems (2020)

    Google Scholar 

  5. El Emam, K.: Guide to the De-identification of Personal Health Information. CRC Press, Boca Raton (2013)

    Google Scholar 

  6. Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Silva, J., Aluísio, S.: Portuguese word embeddings: evaluating on word analogies and natural language tasks. In: Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology, pp. 122–131 (2017)

    Google Scholar 

  7. Hash, J., Bowen, P., Johnson, A., Smith, C., Steinberg, D.: An introductory resource guide for implementing the health insurance portability and accountability act (HIPAA) security rule. US Department of Commerce, Technology Administration, National Institute of \(\ldots \) (2005)

    Google Scholar 

  8. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  9. Jiang, Y., Hu, C., Xiao, T., Zhang, C., Zhu, J.: Improved differentiable architecture search for language modeling and named entity recognition. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pp. 3585–3590. Association for Computational Linguistics, Hong Kong, China (2019)

    Google Scholar 

  10. Jurafsky, D., Martin, J.H.: Speech and Language Processing, vol. 3. Pearson, London, United Kingdom (2014)

    Google Scholar 

  11. Lee, K., Filannino, M., Uzuner, Ö.: An empirical test of GRUs and deep contextualized word representations on de-identification. In: MedInfo, pp. 218–222 (2019)

    Google Scholar 

  12. Leevy, J.L., Khoshgoftaar, T.M., Villanustre, F.: Survey on RNN and CRF models for de-identification of medical free text. J. Big Data 7(1), 1–22 (2020)

    Article  Google Scholar 

  13. Magboo, Ma. Sheila A.., Coronel, Andrei D..: Data mining electronic health records to support evidence-based clinical decisions. In: Chen, Yen-Wei., Zimmermann, Alfred, Howlett, Robert J.., Jain, Lakhmi C.. (eds.) Innovation in Medicine and Healthcare Systems, and Multimedia. SIST, vol. 145, pp. 223–232. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-8566-7_22

    Chapter  Google Scholar 

  14. Meystre, S.M., Friedlin, F.J., South, B.R., Shen, S., Samore, M.H.: Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med. Res. Methodol. 10(1), 1–16 (2010)

    Article  Google Scholar 

  15. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Bengio, Y., LeCun, Y. (eds.) Proceedings of the 1st International Conference on Learning Representations (2013)

    Google Scholar 

  16. Nakayama, H., Kubo, T., Kamura, J., Taniguchi, Y., Liang, X.: doccano: text annotation tool for human (2018). software available from https://github.com/doccano/doccano

  17. Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the Conference of the North American chapter of the association for computational linguistics: human language technologies, pp. 2227–2237 (2018)

    Google Scholar 

  18. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020)

    Google Scholar 

  19. Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. arXiv preprint cs/0306050 (2003)

    Google Scholar 

  20. Santos, D., Freitas, C., Oliveira, H.G., Carvalho, P.: Second harem: new challenges and old wisdom. In: International Conference on Computational Processing of the Portuguese Language. pp. 212–215. Springer (2008). https://doi.org/10.1007/978-3-540-85980-2_22

  21. Santos, D., Seco, N., Cardoso, N., Vilela, R.: Harem: An advanced NER evaluation contest for Portuguese. In: quot; In: Calzolari, N., et al. (ed.) Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa Italy 22–28 May 2006 (2006)

    Google Scholar 

  22. dos Santos, H.D.P., Silva, A.P., Maciel, M.C.O., Burin, H.M.V., Urbanetto, J.S., Vieira, R.: Fall detection in EHR using word embeddings and deep learning. In: 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE), pp. 265–268, October 2019. https://doi.org/10.1109/BIBE.2019.00054

  23. dos Santos, H.D.P., Ulbrich, A.H.D., Woloszyn, V., Vieira, R.: DDC-outlier: preventing medication errors using unsupervised learning. IEEE J. Biomed. Health Inform. 23, 8 (2018)

    Google Scholar 

  24. dos Santos, H.D.P., Ulbrich, A.H.D., Woloszyn, V., Vieira, R.: An initial investigation of the Charlson comorbidity index regression based on clinical notes. In: 2018 IEEE 31st International Symposium on Computer-Based Medical Systems (CBMS), pp. 6–11. IEEE (2018)

    Google Scholar 

  25. Santos, J., Consoli, B.S., dos Santos, C.N., Terra, J., Collovini, S., Vieira, R.: Assessing the impact of contextual embeddings for Portuguese named entity recognition. In: Proceedings of the 8th Brazilian Conference on Intelligent Systems, pp. 437–442 (2019)

    Google Scholar 

  26. Santos, J., dos Santos, H.D., Vieira, R.: Fall detection in clinical notes using language models and token classifier. In: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), pp. 283–288. IEEE (2020)

    Google Scholar 

  27. Straková, J., Straka, M., Hajic, J.: Neural architectures for nested NER through linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5326–5331. Association for Computational Linguistics (2019)

    Google Scholar 

  28. Stubbs, A., Filannino, M., Uzuner, Ö.: De-identification of psychiatric intake records: overview of 2016 CEGS N-GRID shared tasks track 1. J. Biomed. Inform. 75, S4–S18 (2017)

    Article  Google Scholar 

  29. Stubbs, A., Uzuner, Ö.: Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UThealth corpus. J. Biomed. Inform. 58, S20–S29 (2015)

    Article  Google Scholar 

Download references

Acknowledgments

We thank Dr. Ana Helena D. P. S. Ulbrich, who provided the clinical notes dataset from the hospital, for her valuable cooperation. We also thank the volunteers of the Institute of Artificial Intelligence in Healthcare Celso Pereira and Ana Lúcia Dias, for the dataset annotation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Henrique D. P. dos Santos .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Santos, J., dos Santos, H.D.P., Tabalipa, F., Vieira, R. (2021). De-Identification of Clinical Notes Using Contextualized Language Models and a Token Classifier. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13074. Springer, Cham. https://doi.org/10.1007/978-3-030-91699-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-91699-2_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-91698-5

  • Online ISBN: 978-3-030-91699-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics