De-Identification of Clinical Notes Using Contextualized Language Models and a Token Classifier

Santos, Joaquim; dos Santos, Henrique D. P.; Tabalipa, Fábio; Vieira, Renata

doi:10.1007/978-3-030-91699-2_3

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13074))

Included in the following conference series:

Brazilian Conference on Intelligent Systems

1003 Accesses
2 Citations

Abstract

The de-identification of clinical notes is crucial for the reuse of electronic clinical data and is a common Named Entity Recognition (NER) task. Neural language models provide a great improvement in Natural Language Processing (NLP) tasks, such as NER, when they are integrated with neural network methods. This paper evaluates the use of current state-of-the-art deep learning methods (Bi-LSTM-CRF) in the task of identifying patient names in clinical notes, for de-identification purposes. We used two corpora and three language models to evaluate which combination delivers the best performance. In our experiments, the specific corpus for the de-identification of clinical notes and a contextualized embedding with word embeddings achieved the best result: an F-measure of 0.94.

This work was partially supported by Institute of Artificial Intelligence in Healthcare, Memed, Google Latin America Research Awards, and by FCT under the project UIDB/00057/2020 (Portugal).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Akbik, A., Bergmann, T., Blythe, D., Rasul, K., Schweter, S., Vollgraf, R.: FLAIR: an easy-to-use framework for state-of-the-art NLP. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics, pp. 54–59. Association for Computational Linguistics, Minneapolis, Minnesota, June 2019. https://doi.org/10.18653/v1/N19-4010, https://www.aclweb.org/anthology/N19-4010
Akbik, A., Blythe, D., Vollgraf, R.: Contextual string embeddings for sequence labeling. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1638–1649 (2018)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)
Article Google Scholar
Brown, T.B., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Proceedings of the 33th Annual Conference on Neural Information Processing Systems (2020)
Google Scholar
El Emam, K.: Guide to the De-identification of Personal Health Information. CRC Press, Boca Raton (2013)
Google Scholar
Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Silva, J., Aluísio, S.: Portuguese word embeddings: evaluating on word analogies and natural language tasks. In: Proceedings of the 11th Brazilian Symposium in Information and Human Language Technology, pp. 122–131 (2017)
Google Scholar
Hash, J., Bowen, P., Johnson, A., Smith, C., Steinberg, D.: An introductory resource guide for implementing the health insurance portability and accountability act (HIPAA) security rule. US Department of Commerce, Technology Administration, National Institute of \(\ldots \) (2005)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Jiang, Y., Hu, C., Xiao, T., Zhang, C., Zhu, J.: Improved differentiable architecture search for language modeling and named entity recognition. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, pp. 3585–3590. Association for Computational Linguistics, Hong Kong, China (2019)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing, vol. 3. Pearson, London, United Kingdom (2014)
Google Scholar
Lee, K., Filannino, M., Uzuner, Ö.: An empirical test of GRUs and deep contextualized word representations on de-identification. In: MedInfo, pp. 218–222 (2019)
Google Scholar
Leevy, J.L., Khoshgoftaar, T.M., Villanustre, F.: Survey on RNN and CRF models for de-identification of medical free text. J. Big Data 7(1), 1–22 (2020)
Article Google Scholar
Magboo, Ma. Sheila A.., Coronel, Andrei D..: Data mining electronic health records to support evidence-based clinical decisions. In: Chen, Yen-Wei., Zimmermann, Alfred, Howlett, Robert J.., Jain, Lakhmi C.. (eds.) Innovation in Medicine and Healthcare Systems, and Multimedia. SIST, vol. 145, pp. 223–232. Springer, Singapore (2019). https://doi.org/10.1007/978-981-13-8566-7_22
Chapter Google Scholar
Meystre, S.M., Friedlin, F.J., South, B.R., Shen, S., Samore, M.H.: Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med. Res. Methodol. 10(1), 1–16 (2010)
Article Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Bengio, Y., LeCun, Y. (eds.) Proceedings of the 1st International Conference on Learning Representations (2013)
Google Scholar
Nakayama, H., Kubo, T., Kamura, J., Taniguchi, Y., Liang, X.: doccano: text annotation tool for human (2018). software available from https://github.com/doccano/doccano
Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the Conference of the North American chapter of the association for computational linguistics: human language technologies, pp. 2227–2237 (2018)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21, 140:1–140:67 (2020)
Google Scholar
Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. arXiv preprint cs/0306050 (2003)
Google Scholar
Santos, D., Freitas, C., Oliveira, H.G., Carvalho, P.: Second harem: new challenges and old wisdom. In: International Conference on Computational Processing of the Portuguese Language. pp. 212–215. Springer (2008). https://doi.org/10.1007/978-3-540-85980-2_22
Santos, D., Seco, N., Cardoso, N., Vilela, R.: Harem: An advanced NER evaluation contest for Portuguese. In: quot; In: Calzolari, N., et al. (ed.) Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa Italy 22–28 May 2006 (2006)
Google Scholar
dos Santos, H.D.P., Silva, A.P., Maciel, M.C.O., Burin, H.M.V., Urbanetto, J.S., Vieira, R.: Fall detection in EHR using word embeddings and deep learning. In: 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE), pp. 265–268, October 2019. https://doi.org/10.1109/BIBE.2019.00054
dos Santos, H.D.P., Ulbrich, A.H.D., Woloszyn, V., Vieira, R.: DDC-outlier: preventing medication errors using unsupervised learning. IEEE J. Biomed. Health Inform. 23, 8 (2018)
Google Scholar
dos Santos, H.D.P., Ulbrich, A.H.D., Woloszyn, V., Vieira, R.: An initial investigation of the Charlson comorbidity index regression based on clinical notes. In: 2018 IEEE 31st International Symposium on Computer-Based Medical Systems (CBMS), pp. 6–11. IEEE (2018)
Google Scholar
Santos, J., Consoli, B.S., dos Santos, C.N., Terra, J., Collovini, S., Vieira, R.: Assessing the impact of contextual embeddings for Portuguese named entity recognition. In: Proceedings of the 8th Brazilian Conference on Intelligent Systems, pp. 437–442 (2019)
Google Scholar
Santos, J., dos Santos, H.D., Vieira, R.: Fall detection in clinical notes using language models and token classifier. In: 2020 IEEE 33rd International Symposium on Computer-Based Medical Systems (CBMS), pp. 283–288. IEEE (2020)
Google Scholar
Straková, J., Straka, M., Hajic, J.: Neural architectures for nested NER through linearization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5326–5331. Association for Computational Linguistics (2019)
Google Scholar
Stubbs, A., Filannino, M., Uzuner, Ö.: De-identification of psychiatric intake records: overview of 2016 CEGS N-GRID shared tasks track 1. J. Biomed. Inform. 75, S4–S18 (2017)
Article Google Scholar
Stubbs, A., Uzuner, Ö.: Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UThealth corpus. J. Biomed. Inform. 58, S20–S29 (2015)
Article Google Scholar

Download references

Acknowledgments

We thank Dr. Ana Helena D. P. S. Ulbrich, who provided the clinical notes dataset from the hospital, for her valuable cooperation. We also thank the volunteers of the Institute of Artificial Intelligence in Healthcare Celso Pereira and Ana Lúcia Dias, for the dataset annotation.

Author information

Authors and Affiliations

CIDEHUS, University of Évora, Évora, Portugal
Joaquim Santos & Renata Vieira
Institute of Artificial Intelligence in Healthcare, Porto Alegre, Brazil
Henrique D. P. dos Santos
Memed, Clinical Intelligence Department, São Paulo, Brazil
Fábio Tabalipa

Authors

Joaquim Santos
View author publications
You can also search for this author in PubMed Google Scholar
Henrique D. P. dos Santos
View author publications
You can also search for this author in PubMed Google Scholar
Fábio Tabalipa
View author publications
You can also search for this author in PubMed Google Scholar
Renata Vieira
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Henrique D. P. dos Santos .

Editor information

Editors and Affiliations

Universidade Federal de Sergipe, São Cristóvão, Brazil
André Britto
Universidade de São Paulo, São Paulo, Brazil
Karina Valdivia Delgado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Santos, J., dos Santos, H.D.P., Tabalipa, F., Vieira, R. (2021). De-Identification of Clinical Notes Using Contextualized Language Models and a Token Classifier. In: Britto, A., Valdivia Delgado, K. (eds) Intelligent Systems. BRACIS 2021. Lecture Notes in Computer Science(), vol 13074. Springer, Cham. https://doi.org/10.1007/978-3-030-91699-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-91699-2_3
Published: 28 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91698-5
Online ISBN: 978-3-030-91699-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics